Keep only columns where 11 or more than 11 valid data is available
Keep only rows where 70% or more valid data is available
Keep only columns where 80% or more valid data is available
name class1 mark gender
id
1.0 John Deo Four 75.0 female
2.0 Max Ruin Three 85.0 male
NaN Arnold Three 55.0 male
4.0 Krish Star Four 60.0 female
NaN John Mik Four 60.0 female
6.0 Alex Jo Four 55.0 NaN
7.0 My John Five 78.0 male
NaN NaN NaN NaN NaN
9.0 Tes Qry Six 78.0 male
10.0 NaN Four 55.0 female
11.0 Ronald Six NaN female
12.0 Recky Six 94.0 female
13.0 Ron NaN 55.0 NaN
14.0 King NaN NaN NaN
import pandas as pd
df = pd.read_excel('D:\student-dropna_1.xlsx',index_col='id')
We used id column as index_col so the data in index_col is not included in our calculation of valid data and we can't delete the index column.
thresh: removing rows and columns based on number of missing values #C02
Keep only the rows having 2 or more valid data
We kept axis=0 (default value for axis) so we can delete rows. thresh=2 so 2 or more valid data is required to keep the row.
Based on this you can see row No 8 and row No 14 is removed as they have 4 and 3 blank data cells. Here is the code to delete these two rows.
df=df.dropna(how='any',axis=0,thresh=2)
TypeError: You cannot set both the how and thresh arguments at the same time. Here is the corrected line witout how='any'
df=df.dropna(axis=0,thresh=2)
Output
name class1 mark gender
id
1.0 John Deo Four 75.0 female
2.0 Max Ruin Three 85.0 male
NaN Arnold Three 55.0 male
4.0 Krish Star Four 60.0 female
NaN John Mik Four 60.0 female
6.0 Alex Jo Four 55.0 NaN
7.0 My John Five 78.0 male
9.0 Tes Qry Six 78.0 male
10.0 NaN Four 55.0 female
11.0 Ronald Six NaN female
12.0 Recky Six 94.0 female
13.0 Ron NaN 55.0 NaN
Keep only the rows having 3 or more valid data
Now let us raise the limit to 3 valid data. You can see the row No 13 is to be removed as it has only two valid data cells( we are not considering the index column ).
df=df.dropna(axis=0,thresh=3)
Output
name class1 mark gender
id
1.0 John Deo Four 75.0 female
2.0 Max Ruin Three 85.0 male
NaN Arnold Three 55.0 male
4.0 Krish Star Four 60.0 female
NaN John Mik Four 60.0 female
6.0 Alex Jo Four 55.0 NaN
7.0 My John Five 78.0 male
9.0 Tes Qry Six 78.0 male
10.0 NaN Four 55.0 female
11.0 Ronald Six NaN female
12.0 Recky Six 94.0 female
In above code you can see the 13th row is removed.
Keep only columns where 11 or more valid data is available
In our DataFrame gender column has 10 valid data. If we keep 11 as thresh value to retain the columns then the column gender will be removed. To check the columns we will be using axis=1
df=df.dropna(axis=1,thresh=11)
Output
name class1 mark
id
1.0 John Deo Four 75.0
2.0 Max Ruin Three 85.0
NaN Arnold Three 55.0
4.0 Krish Star Four 60.0
NaN John Mik Four 60.0
6.0 Alex Jo Four 55.0
7.0 My John Five 78.0
NaN NaN NaN NaN
9.0 Tes Qry Six 78.0
10.0 NaN Four 55.0
11.0 Ronald Six NaN
12.0 Recky Six 94.0
13.0 Ron NaN 55.0
14.0 King NaN NaN
Keep only rows where 70% or more valid data is available
To get the details about number of rows or columns we will use shape. df.shape[0] Number of rows df.shape[1] Number of columns.
df=df.dropna(axis=0,thresh=df.shape[1]*0.7)
Output
name class1 mark gender
id
1.0 John Deo Four 75.0 female
2.0 Max Ruin Three 85.0 male
NaN Arnold Three 55.0 male
4.0 Krish Star Four 60.0 female
NaN John Mik Four 60.0 female
6.0 Alex Jo Four 55.0 NaN
7.0 My John Five 78.0 male
9.0 Tes Qry Six 78.0 male
10.0 NaN Four 55.0 female
11.0 Ronald Six NaN female
12.0 Recky Six 94.0 female
Keep only columns where 80% or more valid data is available
df=df.dropna(axis=1,thresh=df.shape[0]*0.8)
Output
name
id
1.0 John Deo
2.0 Max Ruin
NaN Arnold
4.0 Krish Star
NaN John Mik
6.0 Alex Jo
7.0 My John
NaN NaN
9.0 Tes Qry
10.0 NaN
11.0 Ronald
12.0 Recky
13.0 Ron
14.0 King
inplace
Default vlaue is False, so the original DataFrame is not changed. To change the source dataframe df we can use this parameter.
df.dropna(axis=0,thresh=2,inplace=True)
print(df) # source dataframe is changed.
Questions
What is the thresh parameter in the dropna() function in Pandas?
How does the thresh parameter work?
What are the different values that the thresh parameter can take?
What are the advantages and disadvantages of using the thresh parameter?
Answers
The thresh parameter in the dropna() function in Pandas specifies the minimum number of non-NA values in a row or column for it to be considered in the final result. Any row or column with the number of non-NA values less than the thresh value is removed irrespective of other parameters passed.
The thresh parameter works by counting the number of non-NA values in each row or column. If the number of non-NA values is greater than or equal to the thresh value, then the row or column is kept. Otherwise, the row or column is dropped.
The thresh parameter can take any integer value. For example, if thresh=2, then only rows or columns with at least 2 non-NA values will be kept.
The advantages of using the thresh parameter include:
* It allows you to control the number of rows or columns that are dropped.
* It can be used to remove rows or columns that have a large percentage of missing values.
The disadvantages of using the thresh parameter include:
* It can be difficult to determine the optimal value for thresh.
* If thresh is set too high, then you may end up dropping rows or columns that you do not want to drop.