Pandas DataFrame cut()

Pandas

Segment data into bins
Parameters

x : The one dimensional input array to be categorized.
bins : The segments to be used for categorization. We can specify integer or non-uniform width or interval index.
right: Default is True , the bin should include right most value or not ( see examples below )
labels : Default None , A list of labels can be used for bins, must match with number of segments or bins
retbins : Default False, to return bins or not.
precision : int , default 3
include_lowest : default False, the first interval should be left inclusive or not
duplicates : default 'raise', 'drop' For non-unique bin edges if set.

Examples using options

In this example mark of each student in MATH is used for segmentation. We used bins to make non-uniform 3 segments. That is from 1 to 50 , from 50 to 70 and from 70 to 100.
import pandas as pd 
my_dict={'NAME':['Ravi','Raju','Alex','Ron','King','Jack'],
         'ID':[1,2,3,4,5,6],
         'MATH':[80,40,70,70,60,30],
         'ENGLISH':[80,70,40,50,60,30]}
my_data = pd.DataFrame(data=my_dict)
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[1, 50, 70, 100]) 
print(my_data)
Output
   NAME  ID  MATH  ENGLISH     my_cut
0  Ravi   1    80       80  (70, 100]
1  Raju   2    40       70    (1, 50]
2  Alex   3    70       40   (50, 70]
3   Ron   4    70       50   (50, 70]
4  King   5    60       60   (50, 70]
5  Jack   6    30       30    (1, 50]

category data type

The column holding the output of cut() is of categorical data types. You can check the output like this.
print(my_data['my_cut'].dtypes) # category
Read more on data types by dtypes() and about categorical data type.

Bins

How we will decide on segments for distribution of values ? There are three types.

Fixed width bins : By specifying integer we can say how many number of segments we want. Here mark is varying in the range of 50, so by saying bins= 5 we are creating segments of fixed width 10. The The range of x is extended by .1% to include minimum and maximum values.
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=5) 
print(my_data)
Output
   NAME  ID  MATH  ENGLISH         my_cut
0  Ravi   1    80       80   (70.0, 80.0]
1  Raju   2    40       70  (29.95, 40.0]
2  Alex   3    70       40   (60.0, 70.0]
3   Ron   4    70       50   (60.0, 70.0]
4  King   5    60       60   (50.0, 60.0]
5  Jack   6    30       30  (29.95, 40.0]
Sequence of scalars : We specify the edges of the bins.
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[1,50,70,100]) 
print(my_data)
Output
   NAME  ID  MATH  ENGLISH     my_cut
0  Ravi   1    80       80  (70, 100]
1  Raju   2    40       70    (1, 50]
2  Alex   3    70       40   (50, 70]
3   Ron   4    70       50   (50, 70]
4  King   5    60       60   (50, 70]
5  Jack   6    30       30    (1, 50]
Intervalindex : Non overlapping exact bins.
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[1,49,50,69,70,79,80,100]) 
print(my_data)
Output
   NAME  ID  MATH  ENGLISH    my_cut
0  Ravi   1    80       80  (79, 80]
1  Raju   2    40       70   (1, 49]
2  Alex   3    70       40  (69, 70]
3   Ron   4    70       50  (69, 70]
4  King   5    60       60  (50, 69]
5  Jack   6    30       30   (1, 49]

mark

Which bin we should place for the mark which are at the edges of the bins ?
Alex got 70 and he is kept in 50, 70 segment. We can place him in 70 , 100 also. For this we have to use right option. By default right=True. So when MARK is 70, it is included in 50 to 70 segment. If we make right=False then we will include MARK in 70 to 100 segment.
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[1,50,70,100],right=True) 
output
   NAME  ID  MATH  ENGLISH     my_cut
0  Ravi   1    80       80  (70, 100]
1  Raju   2    40       70    (1, 50]
2  Alex   3    70       40   (50, 70]
3   Ron   4    70       50   (50, 70]
4  King   5    70       60   (50, 70]
5  Jack   6    30       30    (1, 50]
Let us change to right=False
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[1,50,70,100],right=False) 
Output
   NAME  ID  MATH  ENGLISH     my_cut
0  Ravi   1    80       80  [70, 100)
1  Raju   2    40       70    [1, 50)
2  Alex   3    70       40  [70, 100)
3   Ron   4    70       50  [70, 100)
4  King   5    70       60  [70, 100)
5  Jack   6    30       30    [1, 50)

Labels

Default is None. We can use labels for our segments.
my_labels=['Fail','Second','First']
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[1, 50, 75, 100],labels=my_labels) 
print(my_data)
Output
   NAME  ID  MATH  ENGLISH  my_cut
0  Ravi   1    80       80   First
1  Raju   2    40       70    Fail
2  Alex   3    70       40  Second
3   Ron   4    70       50  Second
4  King   5    70       60  Second
5  Jack   6    30       30    Fail
We can use sum of two columns as our input array.
my_labels=['Fail','Second','First']
my_data['my_cut'] = pd.cut(x=my_data['MATH']+my_data['ENGLISH'],bins=[1, 100, 150, 200],labels=my_labels) 
print(my_data)
Output
   NAME  ID  MATH  ENGLISH  my_cut
0  Ravi   1    80       80   First
1  Raju   2    40       70  Second
2  Alex   3    70       40  Second
3   Ron   4    70       50  Second
4  King   5    70       60  Second
5  Jack   6    30       30    Fail

include_lowest

Default value is False. The first interval should be left inclusive or not.
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[30,60,80,100],include_lowest=False)
Output
   NAME  ID  MATH  ENGLISH        my_cut
0  Ravi   1    80       80  (60.0, 80.0]
1  Raju   2    40       70  (30.0, 60.0]
2  Alex   3    70       40  (60.0, 80.0]
3   Ron   4    70       50  (60.0, 80.0]
4  King   5    70       60  (60.0, 80.0]
5  Jack   6    30       30           NaN
Let us try include_lowest=True
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[30,60,80,100],include_lowest=True)
Output
   NAME  ID  MATH  ENGLISH          my_cut
0  Ravi   1    80       80    (60.0, 80.0]
1  Raju   2    40       70  (29.999, 60.0]
2  Alex   3    70       40    (60.0, 80.0]
3   Ron   4    70       50    (60.0, 80.0]
4  King   5    70       60    (60.0, 80.0]
5  Jack   6    30       30  (29.999, 60.0]

duplicates

Default value is 'raise', we can change this to duplicates='drop'
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[40,50,50,100],duplicates='drop') 
print(my_data)
Output
   NAME  ID  MATH  ENGLISH         my_cut
0  Ravi   1    80       80  (50.0, 100.0]
1  Raju   2    40       70            NaN
2  Alex   3    70       40  (50.0, 100.0]
3   Ron   4    70       50  (50.0, 100.0]
4  King   5    70       60  (50.0, 100.0]
5  Jack   6    30       30            NaN
Let us change to duplicates='raise'
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[40,50,50,100],duplicates='raise') 
Output
This will give ValueError loc at mask groupby() value_counts()

Pandas Pandas DataFrame iloc - rows and columns by integers
Subscribe to our YouTube Channel here


Subscribe

* indicates required
Subscribe to plus2net

    plus2net.com



    Post your comments , suggestion , error , requirements etc here





    Python Video Tutorials
    Python SQLite Video Tutorials
    Python MySQL Video Tutorials
    Python Tkinter Video Tutorials
    We use cookies to improve your browsing experience. . Learn more
    HTML MySQL PHP JavaScript ASP Photoshop Articles FORUM . Contact us
    ©2000-2024 plus2net.com All rights reserved worldwide Privacy Policy Disclaimer