« Pandas « Data Cleaning
DataFrame.drop_duplicates(keep)
keep | Optional ,
'first' default, delete all duplicate rows except first occurrence
'last' , delete all duplicate rows except last occurrence
'False' ,delete all duplicate rows
|
Series : deletes duplicate values.
DataFrame :deletes duplicate rows.( can consider based some column values )
Serries.drop_duplicates() »
Using DataFrame
Here is a sample DataFrame.
import pandas as pd
my_dict={
'id':[1,2,3,4,5,4,2],
'name':['John','Max','Arnold','Krish','John','Krish','Max'],
'class1':['Four','Three','Three','Four','Four','Four','Three'],
'mark':[75,85,55,60,60,60,85],
'sex':['female','male','male','female','female','female','male']
}
my_data = pd.DataFrame(data=my_dict)
print(my_data)
Output ( here last two rows are duplicates, 6 is duplicate of 1 and 5 is duplicate of 3 )
id name class1 mark sex
0 1 John Four 75 female
1 2 Max Three 85 male
2 3 Arnold Three 55 male
3 4 Krish Four 60 female
4 5 John Four 60 female
5 4 Krish Four 60 female
6 2 Max Three 85 male
Delete duplicate rows after first occurrence
Remove all duplicate rows but keep the first occurence. keep='first'
but this is the default value of keep
import pandas as pd
my_dict={
'id':[1,2,3,4,5,4,2],
'name':['John','Max','Arnold','Krish','John','Krish','Max'],
'class1':['Four','Three','Three','Four','Four','Four','Three'],
'mark':[75,85,55,60,60,60,85],
'sex':['female','male','male','female','female','female','male']
}
my_data = pd.DataFrame(data=my_dict)
df=my_data.drop_duplicates(keep='first')
print(df)
Output : Note that we have assigned output to a new DataFrame df because by default inplace=False
( explained below )
id name class1 mark sex
0 1 John Four 75 female
1 2 Max Three 85 male
2 3 Arnold Three 55 male
3 4 Krish Four 60 female
4 5 John Four 60 female
Delete duplicate rows but keep last occurence
keep='last'
df=my_data.drop_duplicates(keep='last')
print(df)
Output
id name class1 mark sex
0 1 John Four 75 female
2 3 Arnold Three 55 male
4 5 John Four 60 female
5 4 Krish Four 60 female
6 2 Max Three 85 male
Delete duplicate rows in all places
keep=False
df=my_data.drop_duplicates(keep=False)
print(df)
Output ( all duplicate rows are deleted from all places )
id name class1 mark sex
0 1 John Four 75 female
2 3 Arnold Three 55 male
4 5 John Four 60 female
inplace=True
By default inplace=False
, so our main dataframe my_data is not altered when we use drop_duplicates(). So in above codes we have used another DataFrame df to store the output of drop_duplicates(). By using inplace=True
we can modify our main DataFrame my_data
my_data.drop_duplicates(inplace=True)
print(my_data)
Output
id name class1 mark sex
0 1 John Four 75 female
1 2 Max Three 85 male
2 3 Arnold Three 55 male
3 4 Krish Four 60 female
4 5 John Four 60 female
Data Cleaning
« Pandas
dataframe.duplicated()
Series.duplicated()
Series.drop_duplicates()
← Subscribe to our YouTube Channel here