import pandas as pd
my_dict={
'id':[1,2,3,4,5,4,2],
'name':['John','Max','Arnold','Krish','John','Krish','Max'],
'class1':['Four','Three','Three','Four','Four','Four','Three'],
'mark':[75,85,55,60,60,60,85],
'gender':['female','male','male','female','female','female','male']
}
df = pd.DataFrame(data=my_dict)
print(df)
Output ( here last two rows are duplicates, 6 is duplicate of 1 and 5 is duplicate of 3 )
id name class1 mark gender
0 1 John Four 75 female
1 2 Max Three 85 male
2 3 Arnold Three 55 male
3 4 Krish Four 60 female
4 5 John Four 60 female
5 4 Krish Four 60 female
6 2 Max Three 85 male
DataFrame.drop_duplicates(keep)
keep | Optional , 'first' default, delete all duplicate rows except first occurrence'last' , delete all duplicate rows except last occurrence 'False' ,delete all duplicate rows
|
keep='first'
but this is the default value of keep
import pandas as pd
my_dict={
'id':[1,2,3,4,5,4,2],
'name':['John','Max','Arnold','Krish','John','Krish','Max'],
'class1':['Four','Three','Three','Four','Four','Four','Three'],
'mark':[75,85,55,60,60,60,85],
'gender':['female','male','male','female','female','female','male']
}
df = pd.DataFrame(data=my_dict)
df=df.drop_duplicates(keep='first')
print(df)
Output : Note that we have assigned output to a new DataFrame df because by default inplace=False
( explained below )
id name class1 mark gender
0 1 John Four 75 female
1 2 Max Three 85 male
2 3 Arnold Three 55 male
3 4 Krish Four 60 female
4 5 John Four 60 female
keep='last'
df=df.drop_duplicates(keep='last')
print(df)
Output
id name class1 mark gender
0 1 John Four 75 female
2 3 Arnold Three 55 male
4 5 John Four 60 female
5 4 Krish Four 60 female
6 2 Max Three 85 male
keep=False
df=df.drop_duplicates(keep=False)
print(df)
Output ( all duplicate rows are deleted from all places )
id name class1 mark gender
0 1 John Four 75 female
2 3 Arnold Three 55 male
4 5 John Four 60 female
inplace=False
, so our main dataframe df is not altered when we use drop_duplicates(). So in above codes we have used another DataFrame df to store the output of drop_duplicates(). By using inplace=True
we can modify our main DataFrame df
df.drop_duplicates(inplace=True)
print(df)
Output
id name class1 mark gender
0 1 John Four 75 female
1 2 Max Three 85 male
2 3 Arnold Three 55 male
3 4 Krish Four 60 female
4 5 John Four 60 female
import pandas as pd
my_dict={
'id':[1,2,3,4,5,4,2],
'name':['John','Max','Arnold','Krish','John','Krish','Max'],
'class1':['Four','Three','Three','Four','Four','Four','Three'],
'mark':[75,85,55,60,60,60,85],
'gender':['female','male','male','female','female','female','male']
}
df = pd.DataFrame(data=my_dict)
df.drop_duplicates(subset=['class1','mark','gender'],inplace=True)
print(df)
Output
id name class1 mark gender
0 1 John Four 75 female
1 2 Max Three 85 male
2 3 Arnold Three 55 male
3 4 Krish Four 60 female
df.drop_duplicates(keep='last',inplace=True,ignore_index=True)
Output
id name class1 mark gender
0 1 John Four 75 female
1 3 Arnold Three 55 male
2 5 John Four 60 female
3 4 Krish Four 60 female
4 2 Max Three 85 male
Data CleaningAuthor
🎥 Join me live on YouTubePassionate about coding and teaching, I publish practical tutorials on PHP, Python, JavaScript, SQL, and web development. My goal is to make learning simple, engaging, and project‑oriented with real examples and source code.