## Introduction

One of the compelling features of pandas is that it has a rich library of methods for manipulating

data. However, there are times when it is not clear what the various functions

do and how to use them. If you are approaching a problem from an Excel mindset,

it can be difficult to translate the planned solution into the unfamiliar pandas command.

One of those “unknown” functions is the

transform

method.

Even after using pandas for a while, I have never had the chance to use this function

so I recently took some time to figure out what it is and how it could be helpful

for real world analysis. This article will walk through an example where

transform

can be used to efficiently summarize data.

## What is transform?

I have found the best coverage of this topic in Jake VanderPlas’ excellent

Python Data Science Handbook. I plan to write a review on this book in the future

but the short and sweet is that it is a great resource that I highly recommend.

As described in the book,

transform

is an operation used in conjunction

with

groupby

(which is one of the most useful operations in pandas). I

suspect most pandas users likely have used

aggregate

,

filter

or

apply

with

groupby

to summarize data. However,

transform

is a little more difficult to understand – especially coming from an Excel world.

Since Jake made all of his book available via jupyter notebooks it is a good place

to start to understand how transform is unique:

While aggregation must return a reduced version of the data, transformation can

return some transformed version of the full data to recombine. For such a transformation,

the output is the same shape as the input. A common example is to center the data

by subtracting the group-wise mean.

With that basic definition, I will go through another example that can explain how this

is useful in other instances outside of centering data.

## Problem Set

For this example, we will analyze some fictitious sales data. In order to keep the

dataset small, here is a sample of 12 sales transactions for our company:

account | name | order | sku | quantity | unit price | ext price | |
---|---|---|---|---|---|---|---|

0 | 383080 | Will LLC | 10001 | B1-20000 | 7 | 33.69 | 235.83 |

1 | 383080 | Will LLC | 10001 | S1-27722 | 11 | 21.12 | 232.32 |

2 | 383080 | Will LLC | 10001 | B1-86481 | 3 | 35.99 | 107.97 |

3 | 412290 | Jerde-Hilpert | 10005 | S1-06532 | 48 | 55.82 | 2679.36 |

4 | 412290 | Jerde-Hilpert | 10005 | S1-82801 | 21 | 13.62 | 286.02 |

5 | 412290 | Jerde-Hilpert | 10005 | S1-06532 | 9 | 92.55 | 832.95 |

6 | 412290 | Jerde-Hilpert | 10005 | S1-47412 | 44 | 78.91 | 3472.04 |

7 | 412290 | Jerde-Hilpert | 10005 | S1-27722 | 36 | 25.42 | 915.12 |

8 | 218895 | Kulas Inc | 10006 | S1-27722 | 32 | 95.66 | 3061.12 |

9 | 218895 | Kulas Inc | 10006 | B1-33087 | 23 | 22.55 | 518.65 |

10 | 218895 | Kulas Inc | 10006 | B1-33364 | 3 | 72.30 | 216.90 |

11 | 218895 | Kulas Inc | 10006 | B1-20000 | -1 | 72.18 | -72.18 |

You can see in the data that the file contains 3 different orders (10001, 10005 and 10006)

and that each order consists has multiple products (aka skus).

The question we would like to answer is: “What percentage of the order total

does each sku represent?”

For example, if we look at order 10001 with a total of $576.12, the break down would be:

- B1-20000 = $235.83 or 40.9%
- S1-27722 = $232.32 or 40.3%
- B1-86481 = $107.97 or 18.7%

The tricky part in this calculation is that we need to get a total for each order

and combine it back with the transaction level detail in order to get the percentages.

In Excel, you could try to use some version of a subtotal to try to calculate the values.

## First Approach – Merging

If you are familiar with pandas, your first inclination is going to be trying

to group the data into a new dataframe and combine it in a multi-step process.

Here’s what that approach would look like.

Import all the modules we need and read in our data:

```
import pandas as pd
df = pd.read_excel("sales_transactions.xlsx")
```

Now that the data is in a dataframe, determining the total by order is simple with the

help of the standard

groupby

aggregation.

```
df.groupby('order')["ext price"].sum()
```

order 10001 576.12 10005 8185.49 10006 3724.49 Name: ext price, dtype: float64

Here is a simple image showing what is happening with the standard

groupby

The tricky part is figuring out how to combine this data back with the original

dataframe. The first instinct is to create a new dataframe with the totals by order

and merge it back with the original. We could do something like this:

```
order_total = df.groupby('order')["ext price"].sum().rename("Order_Total").reset_index()
df_1 = df.merge(order_total)
df_1["Percent_of_Order"] = df_1["ext price"] / df_1["Order_Total"]
```

account | name | order | sku | quantity | unit price | ext price | order total | Order_Total | Percent_of_Order | |
---|---|---|---|---|---|---|---|---|---|---|

0 | 383080 | Will LLC | 10001 | B1-20000 | 7 | 33.69 | 235.83 | 576.12 | 576.12 | 0.409342 |

1 | 383080 | Will LLC | 10001 | S1-27722 | 11 | 21.12 | 232.32 | 576.12 | 576.12 | 0.403249 |

2 | 383080 | Will LLC | 10001 | B1-86481 | 3 | 35.99 | 107.97 | 576.12 | 576.12 | 0.187409 |

3 | 412290 | Jerde-Hilpert | 10005 | S1-06532 | 48 | 55.82 | 2679.36 | 8185.49 | 8185.49 | 0.327330 |

4 | 412290 | Jerde-Hilpert | 10005 | S1-82801 | 21 | 13.62 | 286.02 | 8185.49 | 8185.49 | 0.034942 |

5 | 412290 | Jerde-Hilpert | 10005 | S1-06532 | 9 | 92.55 | 832.95 | 8185.49 | 8185.49 | 0.101759 |

6 | 412290 | Jerde-Hilpert | 10005 | S1-47412 | 44 | 78.91 | 3472.04 | 8185.49 | 8185.49 | 0.424170 |

7 | 412290 | Jerde-Hilpert | 10005 | S1-27722 | 36 | 25.42 | 915.12 | 8185.49 | 8185.49 | 0.111798 |

8 | 218895 | Kulas Inc | 10006 | S1-27722 | 32 | 95.66 | 3061.12 | 3724.49 | 3724.49 | 0.821890 |

9 | 218895 | Kulas Inc | 10006 | B1-33087 | 23 | 22.55 | 518.65 | 3724.49 | 3724.49 | 0.139254 |

10 | 218895 | Kulas Inc | 10006 | B1-33364 | 3 | 72.30 | 216.90 | 3724.49 | 3724.49 | 0.058236 |

11 | 218895 | Kulas Inc | 10006 | B1-20000 | -1 | 72.18 | -72.18 | 3724.49 | 3724.49 | -0.019380 |

This certainly works but there are several steps needed to get the data combined

in the manner we need.

## Second Approach – Using Transform

Using the original data, let’s try using

transform

and

groupby

and

see what we get:

```
df.groupby('order')["ext price"].transform('sum')
```

0 576.12 1 576.12 2 576.12 3 8185.49 4 8185.49 5 8185.49 6 8185.49 7 8185.49 8 3724.49 9 3724.49 10 3724.49 11 3724.49 dtype: float64

You will notice how this returns a different size data set from our normal

groupby

functions. Instead of only showing the totals for 3 orders, we retain the same number

of items as the original data set. That is the unique feature of using

transform

.

The final step is pretty simple:

```
df["Order_Total"] = df.groupby('order')["ext price"].transform('sum')
df["Percent_of_Order"] = df["ext price"] / df["Order_Total"]
```

account | name | order | sku | quantity | unit price | ext price | order total | Order_Total | Percent_of_Order | |
---|---|---|---|---|---|---|---|---|---|---|

0 | 383080 | Will LLC | 10001 | B1-20000 | 7 | 33.69 | 235.83 | 576.12 | 576.12 | 0.409342 |

1 | 383080 | Will LLC | 10001 | S1-27722 | 11 | 21.12 | 232.32 | 576.12 | 576.12 | 0.403249 |

2 | 383080 | Will LLC | 10001 | B1-86481 | 3 | 35.99 | 107.97 | 576.12 | 576.12 | 0.187409 |

3 | 412290 | Jerde-Hilpert | 10005 | S1-06532 | 48 | 55.82 | 2679.36 | 8185.49 | 8185.49 | 0.327330 |

4 | 412290 | Jerde-Hilpert | 10005 | S1-82801 | 21 | 13.62 | 286.02 | 8185.49 | 8185.49 | 0.034942 |

5 | 412290 | Jerde-Hilpert | 10005 | S1-06532 | 9 | 92.55 | 832.95 | 8185.49 | 8185.49 | 0.101759 |

6 | 412290 | Jerde-Hilpert | 10005 | S1-47412 | 44 | 78.91 | 3472.04 | 8185.49 | 8185.49 | 0.424170 |

7 | 412290 | Jerde-Hilpert | 10005 | S1-27722 | 36 | 25.42 | 915.12 | 8185.49 | 8185.49 | 0.111798 |

8 | 218895 | Kulas Inc | 10006 | S1-27722 | 32 | 95.66 | 3061.12 | 3724.49 | 3724.49 | 0.821890 |

9 | 218895 | Kulas Inc | 10006 | B1-33087 | 23 | 22.55 | 518.65 | 3724.49 | 3724.49 | 0.139254 |

10 | 218895 | Kulas Inc | 10006 | B1-33364 | 3 | 72.30 | 216.90 | 3724.49 | 3724.49 | 0.058236 |

11 | 218895 | Kulas Inc | 10006 | B1-20000 | -1 | 72.18 | -72.18 | 3724.49 | 3724.49 | -0.019380 |

As an added bonus, you could combine into one statement if you did not want to

show the individual order totals:

```
df["Percent_of_Order"] = df["ext price"] / df.groupby('order')["ext price"].transform('sum')
```

Here is a diagram to show what is happening:

After taking the time to understand

transform

, I think you will agree that this

tool can be very powerful – even if it is a unique approach as compared to the

standard Excel mindset.

## Conclusion

I am continually amazed at the power of pandas to make complex numerical manipulations

very efficient. Despite working with pandas for a while, I never took the time to figure

out how to use

transform.

Now that I understand how it works, I am sure

I will be able to use it in future analysis and hope that you will find this useful

as well.

**
Source From: pbpython.com.
Original article title: Understanding the Transform Function in Pandas.
This full article can be read at: Understanding the Transform Function in Pandas.
**

**Advertisement**