Understanding the Transform Function in Pandas

Introduction

One of the compelling features of pandas is that it has a rich library of methods for manipulating
data. However, there are times when it is not clear what the various functions
do and how to use them. If you are approaching a problem from an Excel mindset,
it can be difficult to translate the planned solution into the unfamiliar pandas command.
One of those “unknown” functions is the
transform

method.
Even after using pandas for a while, I have never had the chance to use this function
so I recently took some time to figure out what it is and how it could be helpful
for real world analysis. This article will walk through an example where

transform

can be used to efficiently summarize data.

What is transform?

I have found the best coverage of this topic in Jake VanderPlas’ excellent
Python Data Science Handbook. I plan to write a review on this book in the future
but the short and sweet is that it is a great resource that I highly recommend.

As described in the book,
transform

is an operation used in conjunction
with
groupby

(which is one of the most useful operations in pandas). I
suspect most pandas users likely have used
aggregate

,
filter

or

apply

with
groupby

to summarize data. However,
transform

is a little more difficult to understand – especially coming from an Excel world.
Since Jake made all of his book available via jupyter notebooks it is a good place
to start to understand how transform is unique:

While aggregation must return a reduced version of the data, transformation can
return some transformed version of the full data to recombine. For such a transformation,
the output is the same shape as the input. A common example is to center the data
by subtracting the group-wise mean.

With that basic definition, I will go through another example that can explain how this
is useful in other instances outside of centering data.

Problem Set

For this example, we will analyze some fictitious sales data. In order to keep the
dataset small, here is a sample of 12 sales transactions for our company:

accountnameorderskuquantityunit priceext price
0383080Will LLC10001B1-20000733.69235.83
1383080Will LLC10001S1-277221121.12232.32
2383080Will LLC10001B1-86481335.99107.97
3412290Jerde-Hilpert10005S1-065324855.822679.36
4412290Jerde-Hilpert10005S1-828012113.62286.02
5412290Jerde-Hilpert10005S1-06532992.55832.95
6412290Jerde-Hilpert10005S1-474124478.913472.04
7412290Jerde-Hilpert10005S1-277223625.42915.12
8218895Kulas Inc10006S1-277223295.663061.12
9218895Kulas Inc10006B1-330872322.55518.65
10218895Kulas Inc10006B1-33364372.30216.90
11218895Kulas Inc10006B1-20000-172.18-72.18

You can see in the data that the file contains 3 different orders (10001, 10005 and 10006)
and that each order consists has multiple products (aka skus).

The question we would like to answer is: “What percentage of the order total
does each sku represent?”

For example, if we look at order 10001 with a total of $576.12, the break down would be:

  • B1-20000 = $235.83 or 40.9%
  • S1-27722 = $232.32 or 40.3%
  • B1-86481 = $107.97 or 18.7%

The tricky part in this calculation is that we need to get a total for each order
and combine it back with the transaction level detail in order to get the percentages.
In Excel, you could try to use some version of a subtotal to try to calculate the values.

First Approach – Merging

If you are familiar with pandas, your first inclination is going to be trying
to group the data into a new dataframe and combine it in a multi-step process.
Here’s what that approach would look like.

Import all the modules we need and read in our data:

import pandas as pd

df = pd.read_excel("sales_transactions.xlsx")

Now that the data is in a dataframe, determining the total by order is simple with the
help of the standard
groupby

aggregation.

df.groupby('order')["ext price"].sum()
order
10001     576.12
10005    8185.49
10006    3724.49
Name: ext price, dtype: float64

Here is a simple image showing what is happening with the standard
groupby

Groupby Example

The tricky part is figuring out how to combine this data back with the original
dataframe. The first instinct is to create a new dataframe with the totals by order
and merge it back with the original. We could do something like this:

order_total = df.groupby('order')["ext price"].sum().rename("Order_Total").reset_index()
df_1 = df.merge(order_total)
df_1["Percent_of_Order"] = df_1["ext price"] / df_1["Order_Total"]
accountnameorderskuquantityunit priceext priceorder totalOrder_TotalPercent_of_Order
0383080Will LLC10001B1-20000733.69235.83576.12576.120.409342
1383080Will LLC10001S1-277221121.12232.32576.12576.120.403249
2383080Will LLC10001B1-86481335.99107.97576.12576.120.187409
3412290Jerde-Hilpert10005S1-065324855.822679.368185.498185.490.327330
4412290Jerde-Hilpert10005S1-828012113.62286.028185.498185.490.034942
5412290Jerde-Hilpert10005S1-06532992.55832.958185.498185.490.101759
6412290Jerde-Hilpert10005S1-474124478.913472.048185.498185.490.424170
7412290Jerde-Hilpert10005S1-277223625.42915.128185.498185.490.111798
8218895Kulas Inc10006S1-277223295.663061.123724.493724.490.821890
9218895Kulas Inc10006B1-330872322.55518.653724.493724.490.139254
10218895Kulas Inc10006B1-33364372.30216.903724.493724.490.058236
11218895Kulas Inc10006B1-20000-172.18-72.183724.493724.49-0.019380

This certainly works but there are several steps needed to get the data combined
in the manner we need.

Second Approach – Using Transform

Using the original data, let’s try using
transform

and
groupby

and
see what we get:

df.groupby('order')["ext price"].transform('sum')
0      576.12
1      576.12
2      576.12
3     8185.49
4     8185.49
5     8185.49
6     8185.49
7     8185.49
8     3724.49
9     3724.49
10    3724.49
11    3724.49
dtype: float64

You will notice how this returns a different size data set from our normal
groupby

functions. Instead of only showing the totals for 3 orders, we retain the same number
of items as the original data set. That is the unique feature of using
transform

.

The final step is pretty simple:

df["Order_Total"] = df.groupby('order')["ext price"].transform('sum')
df["Percent_of_Order"] = df["ext price"] / df["Order_Total"]
accountnameorderskuquantityunit priceext priceorder totalOrder_TotalPercent_of_Order
0383080Will LLC10001B1-20000733.69235.83576.12576.120.409342
1383080Will LLC10001S1-277221121.12232.32576.12576.120.403249
2383080Will LLC10001B1-86481335.99107.97576.12576.120.187409
3412290Jerde-Hilpert10005S1-065324855.822679.368185.498185.490.327330
4412290Jerde-Hilpert10005S1-828012113.62286.028185.498185.490.034942
5412290Jerde-Hilpert10005S1-06532992.55832.958185.498185.490.101759
6412290Jerde-Hilpert10005S1-474124478.913472.048185.498185.490.424170
7412290Jerde-Hilpert10005S1-277223625.42915.128185.498185.490.111798
8218895Kulas Inc10006S1-277223295.663061.123724.493724.490.821890
9218895Kulas Inc10006B1-330872322.55518.653724.493724.490.139254
10218895Kulas Inc10006B1-33364372.30216.903724.493724.490.058236
11218895Kulas Inc10006B1-20000-172.18-72.183724.493724.49-0.019380

As an added bonus, you could combine into one statement if you did not want to
show the individual order totals:

df["Percent_of_Order"] = df["ext price"] / df.groupby('order')["ext price"].transform('sum')

Here is a diagram to show what is happening:

Groupby Example

After taking the time to understand
transform

, I think you will agree that this
tool can be very powerful – even if it is a unique approach as compared to the
standard Excel mindset.

Conclusion

I am continually amazed at the power of pandas to make complex numerical manipulations
very efficient. Despite working with pandas for a while, I never took the time to figure
out how to use
transform.

Now that I understand how it works, I am sure
I will be able to use it in future analysis and hope that you will find this useful
as well.


Source From: pbpython.com.
Original article title: Understanding the Transform Function in Pandas.
This full article can be read at: Understanding the Transform Function in Pandas.

Advertisement
StudioPress Premium WordPress Themes: Foodie Pro Theme


Random Article You May Like

Leave a Reply

Your email address will not be published. Required fields are marked *

*
*