« Pandas « Data Cleaning
- Keep only the rows having 2 or more valid data
- Keep only the rows having 3 or more valid data
- Keep only columns where 11 or more than 11 valid data is available
- Keep only rows where 70% or more valid data is available
- Keep only columns where 80% or more valid data is available
name class1 mark sex
id
1.0 John Deo Four 75.0 female
2.0 Max Ruin Three 85.0 male
NaN Arnold Three 55.0 male
4.0 Krish Star Four 60.0 female
NaN John Mik Four 60.0 female
6.0 Alex Jo Four 55.0 NaN
7.0 My John Five 78.0 male
NaN NaN NaN NaN NaN
9.0 Tes Qry Six 78.0 male
10.0 NaN Four 55.0 female
11.0 Ronald Six NaN female
12.0 Recky Six 94.0 female
13.0 Ron NaN 55.0 NaN
14.0 King NaN NaN NaN
|
 |
You can create DataFrame from the excel file and you can see from above figures that there are some blank data.
import pandas as pd
my_data = pd.read_excel('D:\student-dropna_1.xlsx',index_col='id')
We used id column as index_col so the data in index_col is not included in our calculation of valid data and we can't delete the index column.
Data is kept in an excel file, we can read the excel file and create the DataFrame.
«dropna() : Deleting rows or columns based on NaN or blank data
Download student-dropna_1.xlsx file ⇓
Keep only the rows having 2 or more valid data
We kept axis=0
so we can delete rows. thresh=2
so 2 or more valid data is required to keep the row.
Based on this you can see row No 8 and row No 14 is removed as they have 4 and 3 blank data cells. Here is the code to delete these two rows.
my_data=my_data.dropna(how='any',axis=0,thresh=2)
Output
name class1 mark sex
id
1.0 John Deo Four 75.0 female
2.0 Max Ruin Three 85.0 male
NaN Arnold Three 55.0 male
4.0 Krish Star Four 60.0 female
NaN John Mik Four 60.0 female
6.0 Alex Jo Four 55.0 NaN
7.0 My John Five 78.0 male
9.0 Tes Qry Six 78.0 male
10.0 NaN Four 55.0 female
11.0 Ronald Six NaN female
12.0 Recky Six 94.0 female
13.0 Ron NaN 55.0 NaN
Keep only the rows having 3 or more valid data
In abobe code two rows are removed as they were not having minimum 2 valid data. Now let us raise the limit to 3 valid data. Now we require 3 or more valid data cells in a row. You can see the row No 13 is to be removed as it has only two valid data cells( we are not considering the index column ).
my_data=my_data.dropna(how='any',axis=0,thresh=3)
Output
name class1 mark sex
id
1.0 John Deo Four 75.0 female
2.0 Max Ruin Three 85.0 male
NaN Arnold Three 55.0 male
4.0 Krish Star Four 60.0 female
NaN John Mik Four 60.0 female
6.0 Alex Jo Four 55.0 NaN
7.0 My John Five 78.0 male
9.0 Tes Qry Six 78.0 male
10.0 NaN Four 55.0 female
11.0 Ronald Six NaN female
12.0 Recky Six 94.0 female
In above code you can see the 13th row is removed.
Keep only columns where 11 or more valid data is available
In our DataFrame sex column has 10 valid data. If we keep 11 as thresh value to retain the columns then the column sex will be removed. To check the columns we will be using axis=1
my_data=my_data.dropna(how='any',axis=1,thresh=11)
Output
name class1 mark
id
1.0 John Deo Four 75.0
2.0 Max Ruin Three 85.0
NaN Arnold Three 55.0
4.0 Krish Star Four 60.0
NaN John Mik Four 60.0
6.0 Alex Jo Four 55.0
7.0 My John Five 78.0
NaN NaN NaN NaN
9.0 Tes Qry Six 78.0
10.0 NaN Four 55.0
11.0 Ronald Six NaN
12.0 Recky Six 94.0
13.0 Ron NaN 55.0
14.0 King NaN NaN
Keep only rows where 70% or more valid data is available
To get the details about number of rows or columns we will use shape. Number of rows is my_data.shape[0]
and number of columns is my_data.shape[1]
my_data=my_data.dropna(how='any',axis=0,thresh=my_data.shape[1]*0.7)
Output
name class1 mark sex
id
1.0 John Deo Four 75.0 female
2.0 Max Ruin Three 85.0 male
NaN Arnold Three 55.0 male
4.0 Krish Star Four 60.0 female
NaN John Mik Four 60.0 female
6.0 Alex Jo Four 55.0 NaN
7.0 My John Five 78.0 male
9.0 Tes Qry Six 78.0 male
10.0 NaN Four 55.0 female
11.0 Ronald Six NaN female
12.0 Recky Six 94.0 female
Keep only columns where 80% or more valid data is available
my_data=my_data.dropna(how='any',axis=1,thresh=my_data.shape[0]*0.8)
Output
name
id
1.0 John Deo
2.0 Max Ruin
NaN Arnold
4.0 Krish Star
NaN John Mik
6.0 Alex Jo
7.0 My John
NaN NaN
9.0 Tes Qry
10.0 NaN
11.0 Ronald
12.0 Recky
13.0 Ron
14.0 King
Data Cleaning
contains() to display and delete row based on Conditions »
« Pandas
columns() add_prefix() add_suffix()
← Subscribe to our YouTube Channel here