Pandas DataFrame cut()

Segment data into bins
Parameters

x : The one dimensional input array to be categorized.
bins : The segments to be used for categorization. We can specify integer or non-uniform width or interval index.
right: Default is True , the bin should include right most value or not ( see examples below )
labels : Default None , A list of labels can be used for bins, must match with number of segments or bins
retbins : Default False, to return bins or not.
precision : int , default 3
include_lowest : default False, the first interval should be left inclusive or not
duplicates : default 'raise', 'drop' For non-unique bin edges if set.

Examples using options

In this example mark of each student in MATH is used for segmentation. We used bins to make non-uniform 3 segments. That is from 1 to 50 , from 50 to 70 and from 70 to 100.
import pandas as pd 
my_dict={'NAME':['Ravi','Raju','Alex','Ron','King','Jack'],
         'ID':[1,2,3,4,5,6],
         'MATH':[80,40,70,70,60,30],
         'ENGLISH':[80,70,40,50,60,30]}
my_data = pd.DataFrame(data=my_dict)
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[1, 50, 70, 100]) 
print(my_data)
Output
   NAME  ID  MATH  ENGLISH     my_cut
0  Ravi   1    80       80  (70, 100]
1  Raju   2    40       70    (1, 50]
2  Alex   3    70       40   (50, 70]
3   Ron   4    70       50   (50, 70]
4  King   5    60       60   (50, 70]
5  Jack   6    30       30    (1, 50]

category data type

The column holding the output of cut() is of categorical data types. You can check the output like this.
print(my_data['my_cut'].dtypes) # category
Read more on data types by dtypes() and about categorical data type.

Bins

How we will decide on segments for distribution of values ? There are three types.

Fixed width bins : By specifying integer we can say how many number of segments we want. Here mark is varying in the range of 50, so by saying bins= 5 we are creating segments of fixed width 10. The The range of x is extended by .1% to include minimum and maximum values.
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=5) 
print(my_data)
Output
   NAME  ID  MATH  ENGLISH         my_cut
0  Ravi   1    80       80   (70.0, 80.0]
1  Raju   2    40       70  (29.95, 40.0]
2  Alex   3    70       40   (60.0, 70.0]
3   Ron   4    70       50   (60.0, 70.0]
4  King   5    60       60   (50.0, 60.0]
5  Jack   6    30       30  (29.95, 40.0]
Sequence of scalars : We specify the edges of the bins.
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[1,50,70,100]) 
print(my_data)
Output
   NAME  ID  MATH  ENGLISH     my_cut
0  Ravi   1    80       80  (70, 100]
1  Raju   2    40       70    (1, 50]
2  Alex   3    70       40   (50, 70]
3   Ron   4    70       50   (50, 70]
4  King   5    60       60   (50, 70]
5  Jack   6    30       30    (1, 50]
Intervalindex : Non overlapping exact bins.
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[1,49,50,69,70,79,80,100]) 
print(my_data)
Output
   NAME  ID  MATH  ENGLISH    my_cut
0  Ravi   1    80       80  (79, 80]
1  Raju   2    40       70   (1, 49]
2  Alex   3    70       40  (69, 70]
3   Ron   4    70       50  (69, 70]
4  King   5    60       60  (50, 69]
5  Jack   6    30       30   (1, 49]

mark

Which bin we should place for the mark which are at the edges of the bins ?
Alex got 70 and he is kept in 50, 70 segment. We can place him in 70 , 100 also. For this we have to use right option. By default right=True. So when MARK is 70, it is included in 50 to 70 segment. If we make right=False then we will include MARK in 70 to 100 segment.
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[1,50,70,100],right=True) 
output
   NAME  ID  MATH  ENGLISH     my_cut
0  Ravi   1    80       80  (70, 100]
1  Raju   2    40       70    (1, 50]
2  Alex   3    70       40   (50, 70]
3   Ron   4    70       50   (50, 70]
4  King   5    70       60   (50, 70]
5  Jack   6    30       30    (1, 50]
Let us change to right=False
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[1,50,70,100],right=False) 
Output
   NAME  ID  MATH  ENGLISH     my_cut
0  Ravi   1    80       80  [70, 100)
1  Raju   2    40       70    [1, 50)
2  Alex   3    70       40  [70, 100)
3   Ron   4    70       50  [70, 100)
4  King   5    70       60  [70, 100)
5  Jack   6    30       30    [1, 50)

Labels

Default is None. We can use labels for our segments.
my_labels=['Fail','Second','First']
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[1, 50, 75, 100],labels=my_labels) 
print(my_data)
Output
   NAME  ID  MATH  ENGLISH  my_cut
0  Ravi   1    80       80   First
1  Raju   2    40       70    Fail
2  Alex   3    70       40  Second
3   Ron   4    70       50  Second
4  King   5    70       60  Second
5  Jack   6    30       30    Fail
We can use sum of two columns as our input array.
my_labels=['Fail','Second','First']
my_data['my_cut'] = pd.cut(x=my_data['MATH']+my_data['ENGLISH'],bins=[1, 100, 150, 200],labels=my_labels) 
print(my_data)
Output
   NAME  ID  MATH  ENGLISH  my_cut
0  Ravi   1    80       80   First
1  Raju   2    40       70  Second
2  Alex   3    70       40  Second
3   Ron   4    70       50  Second
4  King   5    70       60  Second
5  Jack   6    30       30    Fail

include_lowest

Default value is False. The first interval should be left inclusive or not.
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[30,60,80,100],include_lowest=False)
Output
   NAME  ID  MATH  ENGLISH        my_cut
0  Ravi   1    80       80  (60.0, 80.0]
1  Raju   2    40       70  (30.0, 60.0]
2  Alex   3    70       40  (60.0, 80.0]
3   Ron   4    70       50  (60.0, 80.0]
4  King   5    70       60  (60.0, 80.0]
5  Jack   6    30       30           NaN
Let us try include_lowest=True
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[30,60,80,100],include_lowest=True)
Output
   NAME  ID  MATH  ENGLISH          my_cut
0  Ravi   1    80       80    (60.0, 80.0]
1  Raju   2    40       70  (29.999, 60.0]
2  Alex   3    70       40    (60.0, 80.0]
3   Ron   4    70       50    (60.0, 80.0]
4  King   5    70       60    (60.0, 80.0]
5  Jack   6    30       30  (29.999, 60.0]

duplicates

Default value is 'raise', we can change this to duplicates='drop'
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[40,50,50,100],duplicates='drop') 
print(my_data)
Output
   NAME  ID  MATH  ENGLISH         my_cut
0  Ravi   1    80       80  (50.0, 100.0]
1  Raju   2    40       70            NaN
2  Alex   3    70       40  (50.0, 100.0]
3   Ron   4    70       50  (50.0, 100.0]
4  King   5    70       60  (50.0, 100.0]
5  Jack   6    30       30            NaN
Let us change to duplicates='raise'
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[40,50,50,100],duplicates='raise') 
Output
This will give ValueError loc at mask groupby() value_counts()

Pandas Pandas DataFrame iloc - rows and columns by integers
Subhendu Mohapatra — author at plus2net
Subhendu Mohapatra

Author

🎥 Join me live on YouTube

Passionate about coding and teaching, I publish practical tutorials on PHP, Python, JavaScript, SQL, and web development. My goal is to make learning simple, engaging, and project‑oriented with real examples and source code.



Subscribe to our YouTube Channel here



plus2net.com







Python Video Tutorials
Python SQLite Video Tutorials
Python MySQL Video Tutorials
Python Tkinter Video Tutorials
We use cookies to improve your browsing experience. . Learn more
HTML MySQL PHP JavaScript ASP Photoshop Articles Contact us
©2000-2025   plus2net.com   All rights reserved worldwide Privacy Policy Disclaimer