read_html():HTML tables into a list of DataFrame objects
import pandas as pd
df_l=pd.read_html('https://www.plus2net.com/php_tutorial/site_map-date.php') # List
#print(type(df_l)) # <class 'list'>
df=df_l[0] # create a DataFrame from the list object
print(df.head()) # Top 5 rows from DataFrame
Output is here
Function Description
0 Date PHP Date & Time object
1 createfromformat() Change date format
2 checkdate() Validating date
3 date() Required date and time in given format
4 date_create() Creating date objects
This function uses <table> <tr><th><td> .. tags and takes care of colpsan and rowspan of the <td> <th>. tags.
import pandas as pd
df_l=pd.read_html('C:\\data\\student.html',index_col='id')
df=df_l[0] # creating dataframe from list
print(df.tail()) # Last five rows of DataFrame
output
name class mark gender
id
31 Marry Toeey Four 88 male
32 Binn Rott Seven 90 female
33 Kenn Rein Six 96 female
34 Gain Toe Seven 69 male
35 Rows Noump Six 88 female
fob=open('C:\\data\\student.html','r') # Open in read mode
data=fob.read() # read the file data
df_l=pd.read_html(data,index_col='id')
df=df_l[0]
print(df.tail())
Output is same as above.
match
To collect the table where multiple tables are there we can use match option. Check this URL or Click Python Home page.
https://www.plus2net.com/python/site_map.php
There are multiple tables in this page. Check here how we are matching this regex or string. Lines are commented with different match values. Try them.
import pandas as pd
#df_l=pd.read_html('https://www.plus2net.com/python/site_map.php',match='Operators')
#df_l=pd.read_html('https://www.plus2net.com/python/site_map.php',match='MySQL database')
#df_l=pd.read_html('https://www.plus2net.com/python/site_map.php',match='Django')
df_l=pd.read_html('https://www.plus2net.com/python/site_map.php',match='Pygsheet')
df=df_l[0] # create a DataFrame
print(df)
attrs
my_dict={'id': 'tb1'} # valid HTML table attributes
df_l=pd.read_html('https://www.plus2net.com/python/site_map.php',attrs=my_dict)
Change the id value to tb2 and check the result.
There are multiple tables ( total 11) with same 'class' attribute. While creating the DataFrame we can use different elements of the list and get the tables. Here len() is used to get the number of tables having the matching attribute.
my_dict={'class': 'table table-striped'}
df_l=pd.read_html('https://www.plus2net.com/python/site_map.php',attrs=my_dict)
print(len(df_l)) # 11
df=df_l[1] # Change this value to get different table
print(df)
Output
0 1
0 BooleanVar() Tkinter Variable for handling True / False data
1 IP IP address and host name in Python
2 Json Json methods to manage Json data formatting
3 tkinter Python GUI Module module
4 Turtle Draw graphics in Python
5 tuple Ordered unchangeable items list
6 Django Python web framework
7 Pickle Pickle or Un-pickle Python objects
8 Pillow Python Imageing Library : PIL
Keep changing this line and get different tables.
df=df_l[2] # Change this value to get different table
Parameters
io: path or url or file objects ( check the above examples ), required match : Matching the regex or string ( examples above ) flavor : Engine to use , 'bs4' or 'html5lib' ( You may have to install these libraries if not there ) header : The row number to be used index_col : The column to be used as index ( see example above ) skiprows : Number of rows to skip attrs : Valid html attribute passed as dictionary to identify table. ( see example above ) parse_dates: Pasing date column thousands : Separators to use for thousands marking. encoding : Encoding to be used while reading the file. decimal : Char to be used as decimal ( , is used in European data ) na_values : How to handle NA values keep_default_na : How to override default NA values. displayed_only : How to handle displayed None elements. extract_links : Extract href value.
Questions
What is the purpose of the read_html() function in Pandas?
How do you use the read_html() function to read data from an HTML table?
What is the return type of the read_html() function?
Can the read_html() function read multiple tables from a single HTML page?
What are the optional parameters that can be passed to the read_html() function?
How does the header parameter work in the read_html() function?
Can the read_html() function handle tables with merged cells or complex structures?
How can you specify a specific table to read when using the read_html() function?
Is it possible to read data from a remote HTML page using the read_html() function?
What happens if the HTML page contains non-tabular data along with the table?