Cleaning Data : dropna() thresh option

Pandas Data Cleaning

  1. Keep only the rows having 2 or more valid data
  2. Keep only the rows having 3 or more valid data
  3. Keep only columns where 11 or more than 11 valid data is available
  4. Keep only rows where 70% or more valid data is available
  5. Keep only columns where 80% or more valid data is available
           name class1  mark     sex
id                                   
1.0     John Deo   Four  75.0  female
2.0     Max Ruin  Three  85.0    male
NaN       Arnold  Three  55.0    male
4.0   Krish Star   Four  60.0  female
NaN     John Mik   Four  60.0  female
6.0      Alex Jo   Four  55.0     NaN
7.0      My John   Five  78.0    male
NaN          NaN    NaN   NaN     NaN
9.0      Tes Qry    Six  78.0    male
10.0         NaN   Four  55.0  female
11.0      Ronald    Six   NaN  female
12.0       Recky    Six  94.0  female
13.0         Ron    NaN  55.0     NaN
14.0        King    NaN   NaN     NaN


Cleaning of Data
You can create DataFrame from the excel file and you can see from above figures that there are some blank data.
import pandas as pd 
my_data = pd.read_excel('D:\student-dropna_1.xlsx',index_col='id')
Data is kept in an excel file, we can read the excel file and create the DataFrame.
dropna() : Deleting rows or columns based on NaN or blank data
Download student-dropna_1.xlsx file

Keep only the rows having 2 or more valid data

We kept axis=0 so we can delete rows. thresh=2 so 2 or more valid data is required to keep the row.
Based on this you can see row No 8 and row No 14 is removed as they have 4 and 3 blank data cells. Here is the code to delete these two rows.
my_data=my_data.dropna(how='any',axis=0,thresh=2)
Output
            name class1  mark     sex
id                                   
1.0     John Deo   Four  75.0  female
2.0     Max Ruin  Three  85.0    male
NaN       Arnold  Three  55.0    male
4.0   Krish Star   Four  60.0  female
NaN     John Mik   Four  60.0  female
6.0      Alex Jo   Four  55.0     NaN
7.0      My John   Five  78.0    male
9.0      Tes Qry    Six  78.0    male
10.0         NaN   Four  55.0  female
11.0      Ronald    Six   NaN  female
12.0       Recky    Six  94.0  female
13.0         Ron    NaN  55.0     NaN

Keep only the rows having 3 or more valid data

In abobe code two rows are removed as they were not having minimum 2 valid data. Now let us raise the limit to 3 valid data. Now we require 3 or more valid data cells in a row. You can see the row No 13 is to be removed as it has only two valid data cells( we are not considering the index column ).
my_data=my_data.dropna(how='any',axis=0,thresh=3)
Output
            name class1  mark     sex
id                                   
1.0     John Deo   Four  75.0  female
2.0     Max Ruin  Three  85.0    male
NaN       Arnold  Three  55.0    male
4.0   Krish Star   Four  60.0  female
NaN     John Mik   Four  60.0  female
6.0      Alex Jo   Four  55.0     NaN
7.0      My John   Five  78.0    male
9.0      Tes Qry    Six  78.0    male
10.0         NaN   Four  55.0  female
11.0      Ronald    Six   NaN  female
12.0       Recky    Six  94.0  female
In above code you can see the 13th row is removed.

Keep only columns where 11 or more valid data is available

In our DataFrame sex column has 10 valid data. If we keep 11 as thresh value to retain the columns then the column sex will be removed. To check the columns we will be using axis=1
my_data=my_data.dropna(how='any',axis=1,thresh=11) 
Output
            name class1  mark
id                           
1.0     John Deo   Four  75.0
2.0     Max Ruin  Three  85.0
NaN       Arnold  Three  55.0
4.0   Krish Star   Four  60.0
NaN     John Mik   Four  60.0
6.0      Alex Jo   Four  55.0
7.0      My John   Five  78.0
NaN          NaN    NaN   NaN
9.0      Tes Qry    Six  78.0
10.0         NaN   Four  55.0
11.0      Ronald    Six   NaN
12.0       Recky    Six  94.0
13.0         Ron    NaN  55.0
14.0        King    NaN   NaN

Keep only rows where 70% or more valid data is available

To get the details about number of rows or columns we will use shape. Number of rows is my_data.shape[0] and number of columns is my_data.shape[1]
my_data=my_data.dropna(how='any',axis=0,thresh=my_data.shape[1]*0.7)
Output
            name class1  mark     sex
id                                   
1.0     John Deo   Four  75.0  female
2.0     Max Ruin  Three  85.0    male
NaN       Arnold  Three  55.0    male
4.0   Krish Star   Four  60.0  female
NaN     John Mik   Four  60.0  female
6.0      Alex Jo   Four  55.0     NaN
7.0      My John   Five  78.0    male
9.0      Tes Qry    Six  78.0    male
10.0         NaN   Four  55.0  female
11.0      Ronald    Six   NaN  female
12.0       Recky    Six  94.0  female

Keep only columns where 80% or more valid data is available

my_data=my_data.dropna(how='any',axis=1,thresh=my_data.shape[0]*0.8)
Output
            name
id              
1.0     John Deo
2.0     Max Ruin
NaN       Arnold
4.0   Krish Star
NaN     John Mik
6.0      Alex Jo
7.0      My John
NaN          NaN
9.0      Tes Qry
10.0         NaN
11.0      Ronald
12.0       Recky
13.0         Ron
14.0        King
Data Cleaning
Pandas columns() add_prefix() add_suffix()


plus2net.com



Post your comments , suggestion , error , requirements etc here




We use cookies to improve your browsing experience. . Learn more
HTML MySQL PHP JavaScript ASP Photoshop Articles FORUM . Contact us
©2000-2020 plus2net.com All rights reserved worldwide Privacy Policy Disclaimer