« Pandas « Data Cleaning
DataFrame.duplicated(keep)
keep | Optional ,
'first' default, all duplicates are marked True except first one
'last' , all duplicates are marked True except last one
'False' ,all duplicates are marked True
|
Series : indicates duplicate values.
DataFrame :indicates duplicate rows.( can consider based some column values )
Serries.duplicated() »
Using DataFrame
Here is a sample DataFrame.
import pandas as pd
my_dict={
'id':[1,2,3,4,5,4,2],
'name':['John','Max','Arnold','Krish','John','Krish','Max'],
'class1':['Four','Three','Three','Four','Four','Four','Three'],
'mark':[75,85,55,60,60,60,85],
'sex':['female','male','male','female','female','female','male']
}
my_data = pd.DataFrame(data=my_dict)
print(my_data)
Output ( here last two rows are duplicates, 6 is duplicate of 1 and 5 is duplicate of 3 )
id name class1 mark sex
0 1 John Four 75 female
1 2 Max Three 85 male
2 3 Arnold Three 55 male
3 4 Krish Four 60 female
4 5 John Four 60 female
5 4 Krish Four 60 female
6 2 Max Three 85 male
Display rows indicating duplicates
print(my_data.duplicated())
Output
0 False
1 False
2 False
3 False
4 False
5 True
6 True
dtype: bool
We can add one column with the above status.
my_data['status']=my_data['class1'].duplicated()
Display duplicate rows only
print(my_data[my_data.duplicated()])
Output
id name class1 mark sex
5 4 Krish Four 60 female
6 2 Max Three 85 male
Display unique rows
print(my_data[~my_data.duplicated()])
Output ( without 5 and 6th row )
id name class1 mark sex
0 1 John Four 75 female
1 2 Max Three 85 male
2 3 Arnold Three 55 male
3 4 Krish Four 60 female
4 5 John Four 60 female
Display based on unique value of column
In our class1 column we will identify the first unique values and then display the row.
print(my_data[~my_data['class1'].duplicated(keep='first')])
Output
id name class1 mark sex
0 1 John Four 75 female
1 2 Max Three 85 male
Data Cleaning
« Pandas
Series.duplicated()
Series.drop_duplicates()
dataframe.drop_duplicates()
← Subscribe to our YouTube Channel here