Skip to content

bobby_dreamer

Python pandas read_html()

python, pandas, scraping1 min read

Just saying, sraping could be a lot of work, you go to a webpage and there are lots of tables of different formats insense diffent number of columns and rows. Pandas, combined all these and made it look easy, so you get more time to look at data and work on it.

Here we have simple program, which reads wikipedia page, which had multiple tables and the code is pretty simple and there is only one thing to note, if there are multiple tables in a webpage, pd.read_html() is going to return array of dataframes, other than that there is not much to explain, its so easy. But, there is a bit of work in the data wrangling part, after reading the table like below,

  1. In table[1], there is no column names
  2. table[2], at first it might look confusing, there is an pandas index and there is a sequence number in the table itself.