Cleaning Data : dropna() thresh option

  1. Keep only the rows having 2 or more valid data
  2. Keep only the rows having 3 or more valid data
  3. Keep only columns where 11 or more than 11 valid data is available
  4. Keep only rows where 70% or more valid data is available
  5. Keep only columns where 80% or more valid data is available
           name class1  mark     gender
id                                   
1.0     John Deo   Four  75.0  female
2.0     Max Ruin  Three  85.0    male
NaN       Arnold  Three  55.0    male
4.0   Krish Star   Four  60.0  female
NaN     John Mik   Four  60.0  female
6.0      Alex Jo   Four  55.0     NaN
7.0      My John   Five  78.0    male
NaN          NaN    NaN   NaN     NaN
9.0      Tes Qry    Six  78.0    male
10.0         NaN   Four  55.0  female
11.0      Ronald    Six   NaN  female
12.0       Recky    Six  94.0  female
13.0         Ron    NaN  55.0     NaN
14.0        King    NaN   NaN     NaN
You can create DataFrame from the excel file and you can see from above figures that there are some blank data.
Download student-dropna_1.xlsx file
How to read the excel file and create the DataFrame?.
import pandas as pd 
df = pd.read_excel('D:\student-dropna_1.xlsx',index_col='id')


thresh: removing rows and columns based on number of missing values #C02


Keep only the rows having 2 or more valid data

We kept axis=0 (default value for axis) so we can delete rows. thresh=2 so 2 or more valid data is required to keep the row.
Based on this you can see row No 8 and row No 14 is removed as they have 4 and 3 blank data cells. Here is the code to delete these two rows.
df=df.dropna(how='any',axis=0,thresh=2)
TypeError: You cannot set both the how and thresh arguments at the same time.
Here is the corrected line witout how='any'
df=df.dropna(axis=0,thresh=2)
Output
            name class1  mark     gender
id                                   
1.0     John Deo   Four  75.0  female
2.0     Max Ruin  Three  85.0    male
NaN       Arnold  Three  55.0    male
4.0   Krish Star   Four  60.0  female
NaN     John Mik   Four  60.0  female
6.0      Alex Jo   Four  55.0     NaN
7.0      My John   Five  78.0    male
9.0      Tes Qry    Six  78.0    male
10.0         NaN   Four  55.0  female
11.0      Ronald    Six   NaN  female
12.0       Recky    Six  94.0  female
13.0         Ron    NaN  55.0     NaN

Keep only the rows having 3 or more valid data

Now let us raise the limit to 3 valid data. You can see the row No 13 is to be removed as it has only two valid data cells( we are not considering the index column ).
df=df.dropna(axis=0,thresh=3)
Output
            name class1  mark     gender
id                                   
1.0     John Deo   Four  75.0  female
2.0     Max Ruin  Three  85.0    male
NaN       Arnold  Three  55.0    male
4.0   Krish Star   Four  60.0  female
NaN     John Mik   Four  60.0  female
6.0      Alex Jo   Four  55.0     NaN
7.0      My John   Five  78.0    male
9.0      Tes Qry    Six  78.0    male
10.0         NaN   Four  55.0  female
11.0      Ronald    Six   NaN  female
12.0       Recky    Six  94.0  female
In above code you can see the 13th row is removed.

Keep only columns where 11 or more valid data is available

In our DataFrame gender column has 10 valid data. If we keep 11 as thresh value to retain the columns then the column gender will be removed. To check the columns we will be using axis=1
df=df.dropna(axis=1,thresh=11) 
Output
            name class1  mark
id                           
1.0     John Deo   Four  75.0
2.0     Max Ruin  Three  85.0
NaN       Arnold  Three  55.0
4.0   Krish Star   Four  60.0
NaN     John Mik   Four  60.0
6.0      Alex Jo   Four  55.0
7.0      My John   Five  78.0
NaN          NaN    NaN   NaN
9.0      Tes Qry    Six  78.0
10.0         NaN   Four  55.0
11.0      Ronald    Six   NaN
12.0       Recky    Six  94.0
13.0         Ron    NaN  55.0
14.0        King    NaN   NaN

Keep only rows where 70% or more valid data is available

To get the details about number of rows or columns we will use shape.
df.shape[0] Number of rows
df.shape[1] Number of columns.
df=df.dropna(axis=0,thresh=df.shape[1]*0.7)
Output
            name class1  mark     gender
id                                   
1.0     John Deo   Four  75.0  female
2.0     Max Ruin  Three  85.0    male
NaN       Arnold  Three  55.0    male
4.0   Krish Star   Four  60.0  female
NaN     John Mik   Four  60.0  female
6.0      Alex Jo   Four  55.0     NaN
7.0      My John   Five  78.0    male
9.0      Tes Qry    Six  78.0    male
10.0         NaN   Four  55.0  female
11.0      Ronald    Six   NaN  female
12.0       Recky    Six  94.0  female

Keep only columns where 80% or more valid data is available

df=df.dropna(axis=1,thresh=df.shape[0]*0.8)
Output
            name
id              
1.0     John Deo
2.0     Max Ruin
NaN       Arnold
4.0   Krish Star
NaN     John Mik
6.0      Alex Jo
7.0      My John
NaN          NaN
9.0      Tes Qry
10.0         NaN
11.0      Ronald
12.0       Recky
13.0         Ron
14.0        King

inplace

Default vlaue is False, so the original DataFrame is not changed. To change the source dataframe df we can use this parameter.
df.dropna(axis=0,thresh=2,inplace=True)
print(df) # source dataframe is changed.

Questions

Answers

Data Cleaning dropna()
contains() to display and delete row based on Conditions
Pandas columns() add_prefix() add_suffix()
Subscribe to our YouTube Channel here


Subscribe

* indicates required
Subscribe to plus2net

    plus2net.com



    Post your comments , suggestion , error , requirements etc here





    Python Video Tutorials
    Python SQLite Video Tutorials
    Python MySQL Video Tutorials
    Python Tkinter Video Tutorials
    We use cookies to improve your browsing experience. . Learn more
    HTML MySQL PHP JavaScript ASP Photoshop Articles FORUM . Contact us
    ©2000-2024 plus2net.com All rights reserved worldwide Privacy Policy Disclaimer