Pandas Grouper and Agg Functions Explained

Introduction

Every once in a while it is useful to take a step back and look at pandas’
functions and see if there is a new or better way to do things. I was recently
working on a problem and noticed that pandas had a Grouper function
that I had never used before. I looked into how it can be used and it turns out
it is useful for the type of summary analysis I tend to do on a frequent basis.

In addition to functions that have been around a while, pandas continues to provide
new and improved capabilities with every release. The updated agg function
is another very useful and intuitive tool for summarizing data.

This article will walk through how and why you may want to use the
Grouper

and

agg

functions on your own data. Along the way, I will include a few tips
and tricks on how to use them most effectively.

Grouping Time Series Data

Pandas’ origins are in the financial industry so it should not be a surprise that
it has robust capabilities to manipulate and summarize time series data. Just look at the
extensive time series documentation to get a feel for all the options.
I encourage you to review it so that you’re aware of the concepts.

In order to illustrate this particular concept better, I will walk through an example of sales
data and some simple operations to get total sales by month, day, year, etc.

For this example, I’ll use my trusty transaction data that I’ve used in other articles.
You can follow along in the notebook as well.

import pandas as pd
df = pd.read_excel("https://github.com/chris1610/pbpython/blob/master/data/sample-salesv3.xlsx?raw=True")
df["date"] = pd.to_datetime(df['date'])
df.head()
account numbernameskuquantityunit priceext pricedate
0740150Barton LLCB1-200003986.693380.912014-01-01 07:21:51
1714466Trantow-BarrowsS2-77896-163.16-63.162014-01-01 10:00:47
2218895Kulas IncB1-699242390.702086.102014-01-01 13:24:58
3307599Kassulke, Ondricka and MetzS1-654814121.05863.052014-01-01 15:05:22
4412290Jerde-HilpertS2-34077683.21499.262014-01-01 23:26:55

Before I go much further, it’s useful to become familiar with Offset Aliases.
These strings are used to represent various common time frequencies like days vs. weeks
vs. years. I always forget what these are called and how to use the more esoteric ones
so make sure to bookmark the link!

For example, if you were interested in summarizing all of the sales by month, you could use the

resample

function. The tricky part about using resample is that it only
operates on an index. In this data set, the data is not indexed by the date column
so resample would not work without restructuring the data. In order to make it work,
use
set_index

to make the date column an index and then resample:

df.set_index('date').resample('M')["ext price"].sum()
date
2014-01-31    185361.66
2014-02-28    146211.62
2014-03-31    203921.38
2014-04-30    174574.11
2014-05-31    165418.55
2014-06-30    174089.33
2014-07-31    191662.11
2014-08-31    153778.59
2014-09-30    168443.17
2014-10-31    171495.32
2014-11-30    119961.22
2014-12-31    163867.26
Freq: M, Name: ext price, dtype: float64

This is a fairly straightforward way to summarize the data but it gets a little more
challenging if you would like to group the data as well. If we would like to see
the monthly results for each customer, then you could do this (results truncated
to 20 rows):

df.set_index('date').groupby('name')["ext price"].resample("M").sum()
name                             date
Barton LLC                       2014-01-31     6177.57
                                 2014-02-28    12218.03
                                 2014-03-31     3513.53
                                 2014-04-30    11474.20
                                 2014-05-31    10220.17
                                 2014-06-30    10463.73
                                 2014-07-31     6750.48
                                 2014-08-31    17541.46
                                 2014-09-30    14053.61
                                 2014-10-31     9351.68
                                 2014-11-30     4901.14
                                 2014-12-31     2772.90
Cronin, Oberbrunner and Spencer  2014-01-31     1141.75
                                 2014-02-28    13976.26
                                 2014-03-31    11691.62
                                 2014-04-30     3685.44
                                 2014-05-31     6760.11
                                 2014-06-30     5379.67
                                 2014-07-31     6020.30
                                 2014-08-31     5399.58
Name: ext price, dtype: float64

This certainly works but it feels a bit clunky. Fortunately
Grouper

makes
this a little more streamlined. Instead of having to play around with reindexing, we
can use our normal
groupby

syntax but provide a little more info on how
to group the data in the date column:

df.groupby(['name', pd.Grouper(key='date', freq='M')])['ext price'].sum()
name                             date
Barton LLC                       2014-01-31     6177.57
                                 2014-02-28    12218.03
                                 2014-03-31     3513.53
                                 2014-04-30    11474.20
                                 2014-05-31    10220.17
                                 2014-06-30    10463.73
                                 2014-07-31     6750.48
                                 2014-08-31    17541.46
                                 2014-09-30    14053.61
                                 2014-10-31     9351.68
                                 2014-11-30     4901.14
                                 2014-12-31     2772.90
Cronin, Oberbrunner and Spencer  2014-01-31     1141.75
                                 2014-02-28    13976.26
                                 2014-03-31    11691.62
                                 2014-04-30     3685.44
                                 2014-05-31     6760.11
                                 2014-06-30     5379.67
                                 2014-07-31     6020.30
                                 2014-08-31     5399.58
Name: ext price, dtype: float64

Since
groupby

is one of my standard functions, this approach seems simpler
to me and it is more likely to stick in my brain.

The nice benefit of this capability is that if you are interested in looking at
data summarized in a different time frame, just change the
freq

parameter
to one of the valid offset aliases. For instance, an annual summary using December
as the last month would look like this:

df.groupby(['name', pd.Grouper(key='date', freq='A-DEC')])['ext price'].sum()
name                             date
Barton LLC                       2014-12-31    109438.50
Cronin, Oberbrunner and Spencer  2014-12-31     89734.55
Frami, Hills and Schmidt         2014-12-31    103569.59
Fritsch, Russel and Anderson     2014-12-31    112214.71
Halvorson, Crona and Champlin    2014-12-31     70004.36
Herman LLC                       2014-12-31     82865.00
Jerde-Hilpert                    2014-12-31    112591.43
Kassulke, Ondricka and Metz      2014-12-31     86451.07
Keeling LLC                      2014-12-31    100934.30
Kiehn-Spinka                     2014-12-31     99608.77
Koepp Ltd                        2014-12-31    103660.54
Kuhn-Gusikowski                  2014-12-31     91094.28
Kulas Inc                        2014-12-31    137351.96
Pollich LLC                      2014-12-31     87347.18
Purdy-Kunde                      2014-12-31     77898.21
Sanford and Sons                 2014-12-31     98822.98
Stokes LLC                       2014-12-31     91535.92
Trantow-Barrows                  2014-12-31    123381.38
White-Trantow                    2014-12-31    135841.99
Will LLC                         2014-12-31    104437.60
Name: ext price, dtype: float64

If your annual sales were on a non-calendar basis, then the data can be easily
changed by modifying the
freq

parameter. I encourage you to play around
with different offsets to get a feel for how it works. When dealing with summarizing
time series data, this is incredibly handy. To put this in perspective, try doing
this in Excel. It is certainly possible (using pivot tables and
custom grouping) but I do not think it is nearly as intuitive as the pandas approach.

New and improved aggregate function

In pandas 0.20.1, there was a new
agg

function added that makes it a lot simpler
to summarize data in a manner similar to the
groupby

API.

To illustrate the functionality, let’s say we need to get the total of the
ext price

and

quantity

column as well as the average of the
unit price

. The process
is not very convenient:

df[["ext price", "quantity"]].sum()
ext price    2018784.32
quantity       36463.00
dtype: float64
df["unit price"].mean()
55.007526666666664

This works but it’s a bit messy. The new
agg

makes this simpler:

df[["ext price", "quantity", "unit price"]].agg(['sum', 'mean'])
ext pricequantityunit price
sum2.018784e+0636463.00000082511.290000
mean1.345856e+0324.30866755.007527

The results are good but including the sum of the unit price is not really that
useful. Fortunately we can pass a dictionary to
agg

and specify what
operations to apply to each column.

df.agg({'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean']})
quantityext priceunit price
mean24.3086671.345856e+0355.007527
sum36463.0000002.018784e+06NaN

I find this approach really handy when I want to summarize several columns of data.
In the past, I would run the individual calculations and build up the resulting dataframe
a row at a time. It was tedious. This is a much better approach.

As an added bonus, you can define your own functions. For instance, I frequently
find myself needing to aggregate data and use a mode function that works on text.
I found a lambda function that uses
value_counts

to do what I need and
frequently use this
get_max

function:

get_max = lambda x: x.value_counts(dropna=False).index[0]

Then, if I want to include the most frequent sku in my summary table:

df.agg({'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean'], 'sku': [get_max]})
quantityskuext priceunit price
<lambda>NaNS2-77896NaNNaN
mean24.308667NaN1.345856e+0355.007527
sum36463.000000NaN2.018784e+06NaN

This is pretty cool but there is one thing that has always bugged me about this approach.
The fact that the column says “<lambda>” bothers me. Ideally I want it to say
“most frequent.” In the past I’d jump through some hoops to rename it. But, when
working on this article I stumbled on another approach – explicitly defining the name
of the lambda function.

get_max.__name__ = "most frequent"

Now, when I do the aggregation:

df.agg({'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean'], 'sku': [get_max]})
quantityskuext priceunit price
most frequentNaNS2-77896NaNNaN
mean24.308667NaN1.345856e+0355.007527
sum36463.000000NaN2.018784e+06NaN

I get a much nicer label! It’s a small thing but I am definitely glad I finally
figured that out.

As a final final bonus, here’s one other trick. The aggregate function using a
dictionary is useful but one challenge is that it does not preserve order. If
you want to make sure your columns are in a specific order, you can use an

OrderedDict

:

import collections
f = collections.OrderedDict([('ext price', ['sum', 'mean']), ('quantity', ['sum', 'mean']), ('sku', [get_max])])
df.agg(f)
ext pricequantitysku
mean1.345856e+0324.308667NaN
most frequentNaNNaNS2-77896
sum2.018784e+0636463.000000NaN

Conclusion

The pandas library continues to grow and evolve over time. Sometimes it is useful
to make sure there aren’t simpler approaches to some of the frequent approaches
you may use to solve your problems. Pandas’ Grouper function and the updated
agg function are really useful when aggregating and summarizing data. I hope this
article will be useful to you in your data analysis. Are there any other pandas
functions that you just learned about or might be useful to others? Feel free
to give your input in the comments.


Source From: pbpython.com.
Original article title: Pandas Grouper and Agg Functions Explained.
This full article can be read at: Pandas Grouper and Agg Functions Explained.

Advertisement


Random Article You May Like

Leave a Reply

Your email address will not be published. Required fields are marked *

*
*