Output ( here last two rows are duplicates, 6 is duplicate of 1 and 5 is duplicate of 3 )
id name class1 mark gender
0 1 John Four 75 female
1 2 Max Three 85 male
2 3 Arnold Three 55 male
3 4 Krish Four 60 female
4 5 John Four 60 female
5 4 Krish Four 60 female6 2 Max Three 85 male
drop_duplicates: Deleting rows based on number of duplicate data #C03
Syntax
DataFrame.drop_duplicates(keep)
keep
Optional , 'first' default, delete all duplicate rows except first occurrence 'last', delete all duplicate rows except last occurrence 'False',delete all duplicate rows
Series : deletes duplicate values.
DataFrame :deletes duplicate rows.( can consider based some column values )
Serries.drop_duplicates() »
Delete duplicate rows after first occurrence
Remove all duplicate rows but keep the first occurence. keep='first' but this is the default value of keep
Output : Note that we have assigned output to a new DataFrame df because by default inplace=False ( explained below )
id name class1 mark gender
0 1 John Four 75 female
1 2 Max Three 85 male
2 3 Arnold Three 55 male
3 4 Krish Four 60 female
4 5 John Four 60 female
Delete duplicate rows but keep last occurence
keep='last'
df=df.drop_duplicates(keep='last')
print(df)
Output
id name class1 mark gender
0 1 John Four 75 female
2 3 Arnold Three 55 male
4 5 John Four 60 female
5 4 Krish Four 60 female
6 2 Max Three 85 male
Delete duplicate rows in all places
keep=False
df=df.drop_duplicates(keep=False)
print(df)
Output ( all duplicate rows are deleted from all places )
id name class1 mark gender
0 1 John Four 75 female
2 3 Arnold Three 55 male
4 5 John Four 60 female
inplace=True
By default inplace=False, so our main dataframe df is not altered when we use drop_duplicates(). So in above codes we have used another DataFrame df to store the output of drop_duplicates(). By using inplace=True we can modify our main DataFrame df
df.drop_duplicates(inplace=True)
print(df)
Output
id name class1 mark gender
0 1 John Four 75 female
1 2 Max Three 85 male
2 3 Arnold Three 55 male
3 4 Krish Four 60 female
4 5 John Four 60 female
subset
Only consider certain columns for identifying duplicates, by default use all of the columns.