Cleaning of data using dropna() thresh option

Keep only the rows having 2 or more valid data
Keep only the rows having 3 or more valid data
Keep only columns where 11 or more than 11 valid data is available
Keep only rows where 70% or more valid data is available
Keep only columns where 80% or more valid data is available

           name class1  mark     gender
id                                   
1.0     John Deo   Four  75.0  female
2.0     Max Ruin  Three  85.0    male
NaN       Arnold  Three  55.0    male
4.0   Krish Star   Four  60.0  female
NaN     John Mik   Four  60.0  female
6.0      Alex Jo   Four  55.0     NaN
7.0      My John   Five  78.0    male
NaN          NaN    NaN   NaN     NaN
9.0      Tes Qry    Six  78.0    male
10.0         NaN   Four  55.0  female
11.0      Ronald    Six   NaN  female
12.0       Recky    Six  94.0  female
13.0         Ron    NaN  55.0     NaN
14.0        King    NaN   NaN     NaN

You can create DataFrame from the excel file and you can see from above figures that there are some blank data.
Download student-dropna_1.xlsx file
How to read the excel file and create the DataFrame?.

import pandas as pd 
df = pd.read_excel('D:\student-dropna_1.xlsx',index_col='id')

We used id column as index_col so the data in index_col is not included in our calculation of valid data and we can't delete the index column.

thresh: removing rows and columns based on number of missing values #C02

Keep only the rows having 2 or more valid data

We kept axis=0 (default value for axis) so we can delete rows. thresh=2 so 2 or more valid data is required to keep the row.
Based on this you can see row No 8 and row No 14 is removed as they have 4 and 3 blank data cells. Here is the code to delete these two rows.

df=df.dropna(how='any',axis=0,thresh=2)

TypeError: You cannot set both the how and thresh arguments at the same time.
Here is the corrected line witout how='any'

df=df.dropna(axis=0,thresh=2)

Output

            name class1  mark     gender
id                                   
1.0     John Deo   Four  75.0  female
2.0     Max Ruin  Three  85.0    male
NaN       Arnold  Three  55.0    male
4.0   Krish Star   Four  60.0  female
NaN     John Mik   Four  60.0  female
6.0      Alex Jo   Four  55.0     NaN
7.0      My John   Five  78.0    male
9.0      Tes Qry    Six  78.0    male
10.0         NaN   Four  55.0  female
11.0      Ronald    Six   NaN  female
12.0       Recky    Six  94.0  female
13.0         Ron    NaN  55.0     NaN

Keep only the rows having 3 or more valid data

Now let us raise the limit to 3 valid data. You can see the row No 13 is to be removed as it has only two valid data cells( we are not considering the index column ).

df=df.dropna(axis=0,thresh=3)

Output

            name class1  mark     gender
id                                   
1.0     John Deo   Four  75.0  female
2.0     Max Ruin  Three  85.0    male
NaN       Arnold  Three  55.0    male
4.0   Krish Star   Four  60.0  female
NaN     John Mik   Four  60.0  female
6.0      Alex Jo   Four  55.0     NaN
7.0      My John   Five  78.0    male
9.0      Tes Qry    Six  78.0    male
10.0         NaN   Four  55.0  female
11.0      Ronald    Six   NaN  female
12.0       Recky    Six  94.0  female

In above code you can see the 13th row is removed.

Keep only columns where 11 or more valid data is available

In our DataFrame gender column has 10 valid data. If we keep 11 as thresh value to retain the columns then the column gender will be removed. To check the columns we will be using axis=1

df=df.dropna(axis=1,thresh=11)

Output

            name class1  mark
id                           
1.0     John Deo   Four  75.0
2.0     Max Ruin  Three  85.0
NaN       Arnold  Three  55.0
4.0   Krish Star   Four  60.0
NaN     John Mik   Four  60.0
6.0      Alex Jo   Four  55.0
7.0      My John   Five  78.0
NaN          NaN    NaN   NaN
9.0      Tes Qry    Six  78.0
10.0         NaN   Four  55.0
11.0      Ronald    Six   NaN
12.0       Recky    Six  94.0
13.0         Ron    NaN  55.0
14.0        King    NaN   NaN

Keep only rows where 70% or more valid data is available

To get the details about number of rows or columns we will use shape.
df.shape[0] Number of rows
df.shape[1] Number of columns.

df=df.dropna(axis=0,thresh=df.shape[1]*0.7)

Output

            name class1  mark     gender
id                                   
1.0     John Deo   Four  75.0  female
2.0     Max Ruin  Three  85.0    male
NaN       Arnold  Three  55.0    male
4.0   Krish Star   Four  60.0  female
NaN     John Mik   Four  60.0  female
6.0      Alex Jo   Four  55.0     NaN
7.0      My John   Five  78.0    male
9.0      Tes Qry    Six  78.0    male
10.0         NaN   Four  55.0  female
11.0      Ronald    Six   NaN  female
12.0       Recky    Six  94.0  female

Keep only columns where 80% or more valid data is available

df=df.dropna(axis=1,thresh=df.shape[0]*0.8)

Output

            name
id              
1.0     John Deo
2.0     Max Ruin
NaN       Arnold
4.0   Krish Star
NaN     John Mik
6.0      Alex Jo
7.0      My John
NaN          NaN
9.0      Tes Qry
10.0         NaN
11.0      Ronald
12.0       Recky
13.0         Ron
14.0        King

inplace

Default vlaue is False, so the original DataFrame is not changed. To change the source dataframe df we can use this parameter.

df.dropna(axis=0,thresh=2,inplace=True)
print(df) # source dataframe is changed.

Questions

What is the thresh parameter in the dropna() function in Pandas?
How does the thresh parameter work?
What are the different values that the thresh parameter can take?
What are the advantages and disadvantages of using the thresh parameter?

Answers

The thresh parameter in the dropna() function in Pandas specifies the minimum number of non-NA values in a row or column for it to be considered in the final result. Any row or column with the number of non-NA values less than the thresh value is removed irrespective of other parameters passed.
The thresh parameter works by counting the number of non-NA values in each row or column. If the number of non-NA values is greater than or equal to the thresh value, then the row or column is kept. Otherwise, the row or column is dropped.
The thresh parameter can take any integer value. For example, if thresh=2, then only rows or columns with at least 2 non-NA values will be kept.
The advantages of using the thresh parameter include:
* It allows you to control the number of rows or columns that are dropped.
* It can be used to remove rows or columns that have a large percentage of missing values.
The disadvantages of using the thresh parameter include:
* It can be difficult to determine the optimal value for thresh.
* If thresh is set too high, then you may end up dropping rows or columns that you do not want to drop.

Data Cleaning dropna()
contains() to display and delete row based on Conditions
Pandas columns() add_prefix() add_suffix()

Numpy arrays Python & MySQL Python- Tutorials

Subscribe to our YouTube Channel here

Cleaning Data : dropna() thresh option

Keep only the rows having 2 or more valid data

Keep only the rows having 3 or more valid data

Keep only columns where 11 or more valid data is available

Keep only rows where 70% or more valid data is available

Keep only columns where 80% or more valid data is available

inplace

Questions

Answers

Subscribe