Skip to content

bobby_dreamer

Python pandas read_html()

python, pandas, scraping1 min read

Just saying, sraping could be a lot of work, you go to a webpage and there are lots of tables of different formats insense diffent number of columns and rows. Pandas, combined all these and made it look easy, so you get more time to look at data and work on it.

Here we have simple program, which reads wikipedia page, which had multiple tables and the code is pretty simple and there is only one thing to note, if there are multiple tables in a webpage, pd.read_html() is going to return array of dataframes, other than that there is not much to explain, its so easy. But, there is a bit of work in the data wrangling part, after reading the table like below,

  1. In table[1], there is no column names
  2. table[2], at first it might look confusing, there is an pandas index and there is a sequence number in the table itself.
read_html.py
1import pandas as pd
2import lxml
3import requests
4import html5lib
5from tabulate import tabulate
6
7url = "https://en.wikipedia.org/wiki/BSE_SENSEX"
8html = requests.get(url)
9df_arr = pd.read_html(html.text)
10
11print(tabulate(df_arr[0], headers='keys', tablefmt='psql', showindex=True))
12# Output
13+----+----------+----------------------------+
14| | 0 | 1 |
15|----+----------+----------------------------|
16| 0 | nan | nan |
17| 1 | Type | Stock exchange |
18| 2 | Location | Mumbai, Maharashtra, India |
19| 3 | Currency | Indian rupee () |
20| 4 | Website | www.bseindia.com |
21+----+----------+----------------------------+
22
23print(tabulate(df_arr[1], headers='keys', tablefmt='psql', showindex=True))
24# Output
25+----+-----+-------------------+---------------------------------+--------------------------------+---------------------+
26| | # | Exchange ticker | Companies | Sector | Date Added |
27|----+-----+-------------------+---------------------------------+--------------------------------+---------------------|
28| 0 | 1 | 500820 | Asian Paints | Paints | 21 December 2015[7] |
29| 1 | 2 | 532215 | Axis Bank | Banking - Private | nan |
30| 2 | 3 | 532977 | Bajaj Auto | Automobile | nan |
31| 3 | 4 | 500034 | Bajaj Finance | Finance (NBFC) | 24 December 2018[8] |
32| 4 | 5 | 532978 | Bajaj Finserv | Finance (Investment) | nan |
33| 5 | 6 | 532454 | Bharti Airtel | Telecommunications | nan |
34| 6 | 7 | 532281 | HCL Technologies | IT Services & Consulting | nan |
35| 7 | 8 | 500010 | HDFC | Finance (Housing) | nan |
36| 8 | 9 | 500180 | HDFC Bank | Banking - Private | nan |
37| 9 | 10 | 500696 | Hindustan Unilever Limited | FMCG | nan |
38| 10 | 11 | 532174 | ICICI Bank | Banking - Private | nan |
39| 11 | 12 | 532187 | IndusInd Bank | Banking - Private | 18 December 2017[9] |
40| 12 | 13 | 500209 | Infosys | IT Services & Consulting | nan |
41| 13 | 14 | 500875 | ITC Limited | Cigarettes & FMCG | nan |
42| 14 | 15 | 500247 | Kotak Mahindra Bank | Banking - Private | 19 June 2017[10] |
43| 15 | 16 | 500510 | Larsen & Toubro | Engineering & Construction | nan |
44| 16 | 17 | 500520 | Mahindra & Mahindra | Automobile | nan |
45| 17 | 18 | 532500 | Maruti Suzuki | Automobile | nan |
46| 18 | 19 | 500790 | Nestle India | FMCG | 23 Dec 2019[11] |
47| 19 | 20 | 532555 | NTPC | Power generation/Distribution | nan |
48| 20 | 21 | 500312 | Oil and Natural Gas Corporation | Oil exploration and Production | nan |
49| 21 | 22 | 532898 | Power Grid Corporation of India | Power generation/Distribution | 20 June 2016[12] |
50| 22 | 23 | 500325 | Reliance Industries Limited | Conglomerate | nan |
51| 23 | 24 | 500112 | State Bank of India | Banking - Public | nan |
52| 24 | 25 | 524715 | Sun Pharmaceutical | Pharmaceuticals | 8 August 2011[13] |
53| 25 | 26 | 500470 | Tata Steel | Iron & Steel | nan |
54| 26 | 27 | 532540 | Tata Consultancy Services | IT Services & Consulting | nan |
55| 27 | 28 | 532755 | Tech Mahindra | IT Services & Consulting | nan |
56| 28 | 29 | 500114 | Titan Company | Diamond & Jewellery | 23 Dec 2019[11] |
57| 29 | 30 | 532538 | UltraTech Cement | Cement | 23 Dec 2019[11] |
58+----+-----+-------------------+---------------------------------+--------------------------------+---------------------+
59
60print(tabulate(df_arr[2], headers='keys', tablefmt='psql', showindex=True))
61# Output
62+----+------------+----------------------+-----------------------------+
63| | Category | All-Time Highs[14] | All-Time Highs[14].1 |
64|----+------------+----------------------+-----------------------------|
65| 0 | Closing | 41952.6 | Tuesday, 14 January 2020 |
66| 1 | Intraday | 42063 | Friday, 17 January 2020[15] |
67+----+------------+----------------------+-----------------------------+
68
69print(tabulate(df_arr[3].tail(), headers='keys', tablefmt='psql', showindex=True))
70# Output
71+----+---------+---------------+--------------+-----------------------------------------+
72| | S.No. | Date | Points | Reason |
73|----+---------+---------------+--------------+-----------------------------------------|
74| 75 | 75 | 24 April 2020 | 535.86[110] | Due to Coronavirus Pandemic. |
75| 76 | 76 | 4 May 2020 | 2002.27[111] | Due to Coronavirus Pandemic. |
76| 77 | 77 | 14 May 2020 | 885.72[112] | Due to Coronavirus Pandemic. |
77| 78 | 78 | 18 May 2020 | 1068.75[113] | Due to Coronavirus Pandemic. |
78| 79 | 79 | 11 June 2020 | 708.68 [114] | Driven by global equity sell-off. [115] |
79+----+---------+---------------+--------------+-----------------------------------------+