groupby: Pandas DataFrame

Pandas

We can combining data based on header and apply different aggregate function to it.

DataFrame.groupby(by,axis, level, as_index, sort, group_keys, squeeze, observed)

by : The column or label or list over which the groups will be formed.
axis : 0 ( default ) over different axis
lavel :groupby particular level or levels
as_index : bool ( default True )
sort: (default True) Sort group keys
group_keys: While using apply we can refer to
squeez: (default False ) reduce the dimension
observed: (default False ) applies if any of the groupers are Categoricals
dropna: (default True ) NA values are treated as keys if value is False

Returns a groupby object

We used this CSV file for our examples. You can download and run the script.

sales.csv Six columns , sale_id, c_id ( customer id ), p_id (product_id), qty ( quantity sold) ,store ( name )

You can also use this DataFrame for our demo examples here.
import pandas as pd 
my_dict={'sale_id':[1,2,3,4,5,6,7,8,9],
         'c_id':[2,2,1,4,2,3,2,3,2],
         'p_id':[3,4,3,2,3,3,2,2,3],
	 'product':['Monitor','CPU','Monitor','RAM','Monitor','Monitor','RAM','RAM','Monitor'],
         'qty':[2,1,3,2,3,2,3,2,2],
         'store':['ABC','DEF','ABC','DEF','ABC','DEF','ABC','DEF','ABC']}
sales = pd.DataFrame(data=my_dict)
print(sales)
using groupby on product
import pandas as pd 
sales=pd.read_csv("sales.csv") 
my_sale=sales.groupby(['product'])
print(my_sale.first())
Output is here
         sale_id  c_id  p_id  qty store
product                                
CPU            2     2     4    1   DEF
Monitor        1     2     3    2   ABC
RAM            4     4     2    2   DEF
Along with product we are also getting first matching row values.

groupby with multiple columns

We can apply groupby on two columns ( product and store )
my_sale=sales.groupby(['product','store'])
Output
               sale_id  c_id  p_id  qty
product store                          
CPU     DEF          2     2     4    1
Monitor ABC          1     2     3    2
        DEF          6     3     3    2
RAM     ABC          7     2     2    3
        DEF          4     4     2    2
Using as_index=False. Note the difference in product column.
my_sale=sales.groupby(['product','store'],as_index=False)
   product store  sale_id  c_id  p_id  qty
0      CPU   DEF        2     2     4    1
1  Monitor   ABC        1     2     3    2
2  Monitor   DEF        6     3     3    2
3      RAM   ABC        7     2     2    3
4      RAM   DEF        4     4     2    2
We can apply all above parameters
my_sale=sales.groupby(['product'],observed=False,squeeze=False,as_index=False,group_keys=True,sort=False)
Output
   product  sale_id  c_id  p_id  qty store
0  Monitor        1     2     3    2   ABC
1      CPU        2     2     4    1   DEF
2      RAM        4     4     2    2   DEF

Using get_group()

Collect rows of a particular group
my_sale=sales.groupby(['product'])
print(my_sale.get_group('Monitor'))
Output ( rows of all Monitor under product )
   sale_id  c_id  p_id  product  qty store
0        1     2     3  Monitor    2   ABC
2        3     1     3  Monitor    3   ABC
4        5     2     3  Monitor    3   ABC
5        6     3     3  Monitor    2   DEF
8        9     2     3  Monitor    2   ABC

Using sum()

We can get sum of any column data by grouping them under another column. Here we will get total quantity sold against different types of products.
my_sale=sales.groupby(['product'])[['qty']].sum()
Output
         qty
product     
CPU        1
Monitor   12
RAM        7
We can use two columns for grouping and get the breakup of sum of quantity sold
my_sale=sales.groupby(['product','store'])[['qty']].sum()
Output
               qty
product store     
CPU     DEF      1
Monitor ABC     10
        DEF      2
RAM     ABC      3
        DEF      4

Using max()

What is highest quantity sold within every ( group of ) products ?
my_sale=sales.groupby(['product'])[['qty']].max()
Output
         qty
product     
CPU        1
Monitor    3
RAM        3

Using min()

What is the lowest quantity sold within every ( group of ) products ?
my_sale=sales.groupby(['product'])[['qty']].min()
Output
         qty
product     
CPU        1
Monitor    2
RAM        2

Using mean()

Get the mean of each product quantity sold
my_sale=sales.groupby(['product'])[['qty']].mean()
Output
              qty
product          
CPU      1.000000
Monitor  2.400000
RAM      2.333333

Using std()

Using standard deviation function std()
my_sale=sales.groupby(['product'])[['qty']].std()
              qty
product          
CPU           NaN
Monitor  0.547723
RAM      0.577350

Using count()

Total number of sales against each product. This can be treated as total number of rows against each product
my_sale=sales.groupby(['product'])[['qty']].count()
Output
         qty
product     
CPU        1
Monitor    5
RAM        3

Using apply()

my_sale=sales.groupby(['product'])
print(my_sale.apply(lambda x: x['qty'].sum() > 2))
Output
product
CPU        False
Monitor     True
RAM         True
dtype: bool

Using different aggregate functions against different columns

One price column is added. The aggregate function sum is useed for qty column and mean is used for price column. The option as_index=False is kept to keep the product values in all rows.
import pandas as pd 
my_dict={'sale_id':[1,2,3,4,5,6,7,8,9],
         'c_id':[2,2,1,4,2,3,2,3,2],
         'p_id':[3,4,3,2,3,3,2,2,3],
         'price':[30,20,40,20,30,40,20,30,40],
	 'product':['Monitor','CPU','Monitor','RAM','Monitor','Monitor','RAM','RAM','Monitor'],
         'qty':[2,1,3,2,3,2,3,2,2],
         'store':['ABC','DEF','ABC','DEF','ABC','DEF','ABC','DEF','ABC']}
sales = pd.DataFrame(data=my_dict)
my_sale=sales.groupby(['product','store'],as_index=False).agg({'qty':'sum','price':'mean'})
print(my_sale)
Output
   product store  qty  price
0      CPU   DEF    1     20
1  Monitor   ABC   10     35
2  Monitor   DEF    2     40
3      RAM   ABC    3     20
4      RAM   DEF    4     25
We can use dictionary to apply aggregate function on multiple columns.
d={'qty':'sum','price':'mean'} # dictionary to use
my_sale=sales.groupby(['product','store'],as_index=False).agg(d)
Mulitple aggregate values of same column.
d={'qty':['sum'],'price':['mean','max','min']}
Output is here
   product store qty price        
                 sum  mean max min
0      CPU   DEF   1    20  20  20
1  Monitor   ABC  10    35  40  30
2  Monitor   DEF   2    40  40  40
3      RAM   ABC   3    20  20  20
4      RAM   DEF   4    25  30  20

Handling Null values

While grouping by default the NA values are ignored. To include the blank data as key while grouping we have to use dropna=False.
my_sale=sales.groupby(['product','store'],dropna=False).agg(d)


Pandas value_counts cut() segment and sort data values into bins
Subscribe to our YouTube Channel here


Subscribe

* indicates required
Subscribe to plus2net

    plus2net.com



    Post your comments , suggestion , error , requirements etc here





    Python Video Tutorials
    Python SQLite Video Tutorials
    Python MySQL Video Tutorials
    Python Tkinter Video Tutorials
    We use cookies to improve your browsing experience. . Learn more
    HTML MySQL PHP JavaScript ASP Photoshop Articles FORUM . Contact us
    ©2000-2024 plus2net.com All rights reserved worldwide Privacy Policy Disclaimer