drop_duplicates() : delete duplicate rows

Pandas Data Cleaning

DataFrame.drop_duplicates(keep)
keepOptional ,
'first' default, delete all duplicate rows except first occurrence
'last', delete all duplicate rows except last occurrence
'False',delete all duplicate rows
Series : deletes duplicate values.
DataFrame :deletes duplicate rows.( can consider based some column values )
Serries.drop_duplicates()

Using DataFrame

Here is a sample DataFrame.
import pandas as pd 
my_dict={
  'id':[1,2,3,4,5,4,2],
  'name':['John','Max','Arnold','Krish','John','Krish','Max'],
  'class1':['Four','Three','Three','Four','Four','Four','Three'],
  'mark':[75,85,55,60,60,60,85],
  'sex':['female','male','male','female','female','female','male']
	}
my_data = pd.DataFrame(data=my_dict)
print(my_data)
Output ( here last two rows are duplicates, 6 is duplicate of 1 and 5 is duplicate of 3 )
   id    name class1  mark     sex
0   1    John   Four    75  female
1   2     Max  Three    85    male
2   3  Arnold  Three    55    male
3   4   Krish   Four    60  female
4   5    John   Four    60  female
5   4   Krish   Four    60  female
6   2     Max  Three    85    male

Delete duplicate rows after first occurrence

Remove all duplicate rows but keep the first occurence. keep='first' but this is the default value of keep
import pandas as pd
my_dict={
	'id':[1,2,3,4,5,4,2],
	'name':['John','Max','Arnold','Krish','John','Krish','Max'],
    'class1':['Four','Three','Three','Four','Four','Four','Three'],
	'mark':[75,85,55,60,60,60,85],
    'sex':['female','male','male','female','female','female','male']
	}
my_data = pd.DataFrame(data=my_dict)
df=my_data.drop_duplicates(keep='first')
print(df)
Output : Note that we have assigned output to a new DataFrame df because by default inplace=False ( explained below )
   id    name class1  mark     sex
0   1    John   Four    75  female
1   2     Max  Three    85    male
2   3  Arnold  Three    55    male
3   4   Krish   Four    60  female
4   5    John   Four    60  female

Delete duplicate rows but keep last occurence

keep='last'
df=my_data.drop_duplicates(keep='last')
print(df)
Output
   id    name class1  mark     sex
0   1    John   Four    75  female
2   3  Arnold  Three    55    male
4   5    John   Four    60  female
5   4   Krish   Four    60  female
6   2     Max  Three    85    male

Delete duplicate rows in all places

keep=False
df=my_data.drop_duplicates(keep=False)
print(df)
Output ( all duplicate rows are deleted from all places )
   id    name class1  mark     sex
0   1    John   Four    75  female
2   3  Arnold  Three    55    male
4   5    John   Four    60  female

inplace=True

By default inplace=False, so our main dataframe my_data is not altered when we use drop_duplicates(). So in above codes we have used another DataFrame df to store the output of drop_duplicates(). By using inplace=True we can modify our main DataFrame my_data
my_data.drop_duplicates(inplace=True)
print(my_data)
Output
   id    name class1  mark     sex
0   1    John   Four    75  female
1   2     Max  Three    85    male
2   3  Arnold  Three    55    male
3   4   Krish   Four    60  female
4   5    John   Four    60  female
Data Cleaning
Pandas dataframe.duplicated() Series.duplicated() Series.drop_duplicates()
Subscribe to our YouTube Channel here


Subscribe

* indicates required
Subscribe to plus2net

    plus2net.com



    Post your comments , suggestion , error , requirements etc here





    Python Video Tutorials
    Python SQLite Video Tutorials
    Python MySQL Video Tutorials
    Python Tkinter Video Tutorials
    We use cookies to improve your browsing experience. . Learn more
    HTML MySQL PHP JavaScript ASP Photoshop Articles FORUM . Contact us
    ©2000-2023 plus2net.com All rights reserved worldwide Privacy Policy Disclaimer