read_html to read Tabular data from html file or URL

import pandas as pd
df_l=pd.read_html('https://www.plus2net.com/php_tutorial/site_map-date.php') # List 
#print(type(df_l)) # <class 'list'>
df=df_l[0] # create a DataFrame from the list object
print(df.head()) # Top 5 rows from DataFrame

Output is here

             Function                             Description
0                Date                  PHP Date & Time object
1  createfromformat()                      Change date format
2         checkdate()                         Validating date
3              date()  Required date and time in given format
4       date_create()                   Creating date objects

This function uses <table> <tr><th><td> .. tags and takes care of colpsan and rowspan of the <td> <th>. tags.

Creating DataFrame from Local html file

Download sample student.html file

import pandas as pd
df_l=pd.read_html('C:\\data\\student.html',index_col='id') 
df=df_l[0] # creating dataframe from list
print(df.tail()) # Last five rows of DataFrame

output

           name  class  mark  gender
id
31  Marry Toeey   Four    88    male
32    Binn Rott  Seven    90  female
33    Kenn Rein    Six    96  female
34     Gain Toe  Seven    69    male
35   Rows Noump    Six    88  female

Using file

File read

fob=open('C:\\data\\student.html','r') # Open in read mode 
data=fob.read() # read the file data 
df_l=pd.read_html(data,index_col='id')
df=df_l[0]
print(df.tail())

Output is same as above.

match

To collect the table where multiple tables are there we can use match option. Check this URL or Click Python Home page.

https://www.plus2net.com/python/site_map.php

There are multiple tables in this page. Check here how we are matching this regex or string. Lines are commented with different match values. Try them.

import pandas as pd
#df_l=pd.read_html('https://www.plus2net.com/python/site_map.php',match='Operators')  
#df_l=pd.read_html('https://www.plus2net.com/python/site_map.php',match='MySQL database')
#df_l=pd.read_html('https://www.plus2net.com/python/site_map.php',match='Django')
df_l=pd.read_html('https://www.plus2net.com/python/site_map.php',match='Pygsheet')
df=df_l[0] # create a DataFrame
print(df)

attrs

my_dict={'id': 'tb1'} # valid HTML table attributes
df_l=pd.read_html('https://www.plus2net.com/python/site_map.php',attrs=my_dict)

Change the id value to tb2 and check the result. There are multiple tables ( total 11) with same 'class' attribute. While creating the DataFrame we can use different elements of the list and get the tables. Here len() is used to get the number of tables having the matching attribute.

my_dict={'class': 'table table-striped'}
df_l=pd.read_html('https://www.plus2net.com/python/site_map.php',attrs=my_dict)
print(len(df_l)) # 11
df=df_l[1] # Change this value to get different table 
print(df)

Output

             0                                                1
0  BooleanVar()  Tkinter Variable for handling True / False data
1            IP               IP address and host name in Python
2          Json      Json methods to manage Json data formatting
3       tkinter                         Python GUI Module module
4        Turtle                          Draw graphics in Python
5         tuple                  Ordered unchangeable items list
6        Django                             Python web framework
7        Pickle               Pickle or Un-pickle Python objects
8        Pillow                    Python Imageing Library : PIL

Keep changing this line and get different tables.

df=df_l[2] # Change this value to get different table

Parameters

io: path or url or file objects ( check the above examples ), required
match : Matching the regex or string ( examples above )
flavor : Engine to use , 'bs4' or 'html5lib' ( You may have to install these libraries if not there )
header : The row number to be used
index_col : The column to be used as index ( see example above )
skiprows : Number of rows to skip
attrs : Valid html attribute passed as dictionary to identify table. ( see example above )
parse_dates: Pasing date column
thousands : Separators to use for thousands marking.
encoding : Encoding to be used while reading the file.
decimal : Char to be used as decimal ( , is used in European data )
na_values : How to handle NA values
keep_default_na : How to override default NA values.
displayed_only : How to handle displayed None elements.
extract_links : Extract href value.

Questions

What is the purpose of the read_html() function in Pandas?
How do you use the read_html() function to read data from an HTML table?
What is the return type of the read_html() function?
Can the read_html() function read multiple tables from a single HTML page?
What are the optional parameters that can be passed to the read_html() function?
How does the header parameter work in the read_html() function?
Can the read_html() function handle tables with merged cells or complex structures?
How can you specify a specific table to read when using the read_html() function?
Is it possible to read data from a remote HTML page using the read_html() function?
What happens if the HTML page contains non-tabular data along with the table?

to_html() read_csv()
Pandas read_excel() to_csv() to_excel()
Data input and output from Pandas DataFrame

Numpy arrays Python & MySQL Python- Tutorials

Subhendu Mohapatra

Author

🎥 Join me live on YouTube

Passionate about coding and teaching, I publish practical tutorials on PHP, Python, JavaScript, SQL, and web development. My goal is to make learning simple, engaging, and project‑oriented with real examples and source code.

Subscribe to our YouTube Channel here

read_html():HTML tables into a list of DataFrame objects