read_html():HTML tables into a list of DataFrame objects

import pandas as pd
df_l=pd.read_html('https://www.plus2net.com/php_tutorial/site_map-date.php') # List 
#print(type(df_l)) # <class 'list'>
df=df_l[0] # create a DataFrame from the list object
print(df.head()) # Top 5 rows from DataFrame
Output is here
             Function                             Description
0                Date                  PHP Date & Time object
1  createfromformat()                      Change date format
2         checkdate()                         Validating date
3              date()  Required date and time in given format
4       date_create()                   Creating date objects
This function uses <table> <tr><th><td> .. tags and takes care of colpsan and rowspan of the <td> <th>. tags.

Creating DataFrame from Local html file

Download sample student.html file
import pandas as pd
df_l=pd.read_html('C:\\data\\student.html',index_col='id') 
df=df_l[0] # creating dataframe from list
print(df.tail()) # Last five rows of DataFrame
output
           name  class  mark  gender
id
31  Marry Toeey   Four    88    male
32    Binn Rott  Seven    90  female
33    Kenn Rein    Six    96  female
34     Gain Toe  Seven    69    male
35   Rows Noump    Six    88  female

Using file

File read
fob=open('C:\\data\\student.html','r') # Open in read mode 
data=fob.read() # read the file data 
df_l=pd.read_html(data,index_col='id')
df=df_l[0]
print(df.tail())
Output is same as above.

match

To collect the table where multiple tables are there we can use match option. Check this URL or Click Python Home page.
https://www.plus2net.com/python/site_map.php
There are multiple tables in this page. Check here how we are matching this regex or string. Lines are commented with different match values. Try them.
import pandas as pd
#df_l=pd.read_html('https://www.plus2net.com/python/site_map.php',match='Operators')  
#df_l=pd.read_html('https://www.plus2net.com/python/site_map.php',match='MySQL database')
#df_l=pd.read_html('https://www.plus2net.com/python/site_map.php',match='Django')
df_l=pd.read_html('https://www.plus2net.com/python/site_map.php',match='Pygsheet')
df=df_l[0] # create a DataFrame
print(df)

attrs

my_dict={'id': 'tb1'} # valid HTML table attributes
df_l=pd.read_html('https://www.plus2net.com/python/site_map.php',attrs=my_dict)
Change the id value to tb2 and check the result. There are multiple tables ( total 11) with same 'class' attribute. While creating the DataFrame we can use different elements of the list and get the tables. Here len() is used to get the number of tables having the matching attribute.

my_dict={'class': 'table table-striped'}
df_l=pd.read_html('https://www.plus2net.com/python/site_map.php',attrs=my_dict)
print(len(df_l)) # 11
df=df_l[1] # Change this value to get different table 
print(df)
Output
             0                                                1
0  BooleanVar()  Tkinter Variable for handling True / False data
1            IP               IP address and host name in Python
2          Json      Json methods to manage Json data formatting
3       tkinter                         Python GUI Module module
4        Turtle                          Draw graphics in Python
5         tuple                  Ordered unchangeable items list
6        Django                             Python web framework
7        Pickle               Pickle or Un-pickle Python objects
8        Pillow                    Python Imageing Library : PIL
Keep changing this line and get different tables.
df=df_l[2] # Change this value to get different table 

Parameters

io: path or url or file objects ( check the above examples ), required
match : Matching the regex or string ( examples above )
flavor : Engine to use , 'bs4' or 'html5lib' ( You may have to install these libraries if not there )
header : The row number to be used
index_col : The column to be used as index ( see example above )
skiprows : Number of rows to skip
attrs : Valid html attribute passed as dictionary to identify table. ( see example above )
parse_dates: Pasing date column
thousands : Separators to use for thousands marking.
encoding : Encoding to be used while reading the file.
decimal : Char to be used as decimal ( , is used in European data )
na_values : How to handle NA values
keep_default_na : How to override default NA values.
displayed_only : How to handle displayed None elements.
extract_links : Extract href value.

Questions

to_html() read_csv()
Pandas read_excel() to_csv() to_excel()
Data input and output from Pandas DataFrame

Subscribe to our YouTube Channel here


Subscribe

* indicates required
Subscribe to plus2net

    plus2net.com



    Post your comments , suggestion , error , requirements etc here





    Python Video Tutorials
    Python SQLite Video Tutorials
    Python MySQL Video Tutorials
    Python Tkinter Video Tutorials
    We use cookies to improve your browsing experience. . Learn more
    HTML MySQL PHP JavaScript ASP Photoshop Articles FORUM . Contact us
    ©2000-2024 plus2net.com All rights reserved worldwide Privacy Policy Disclaimer