## Introduction

Pandas includes multiple built in functions such as

sum

,

mean

,

max

,

min

, etc. that you can apply to a DataFrame or grouped data.

However, building and using your own function is a good way to learn more about

how pandas works and can increase your productivity with data wrangling and analysis.

The weighted average is a good example use case because it is easy to understand but useful formula

that is not included in pandas. I find that it can be more intuitive than a simple average

when looking at certain collections of data. Building a weighted average function

in pandas is relatively simple but can be incredibly useful when combined with

other pandas functions such as

groupby

.

This article will discuss the basics of why you might choose to use a weighted

average to look at your data then walk through how to build and use this function

in pandas. The basic principles shown in this article will be helpful for building

more complex analysis in pandas and should also be helpful in understanding

how to work with grouped data in pandas.

## Why Use A Weighted Average?

A simple example shows why the weighted average can be a helpful statistic. The table below

shows the prices and quantities that 3 different customers pay for the same product.

Customer | Shoe Price | Shoe Quantity |
---|---|---|

Small Customer | 300 | 20 |

Medium Customer | 200 | 100 |

Big Customer | 150 | 225 |

If someone were to ask, what is the average price of our shoes? The simple average

of the shoe prices would be:

frac{300+200+150}{3} = 6.67

end{equation*}

While this is an accurate average, this does not intuitively make sense for understanding our

average selling price. This is especially challenging if we want to use an average for

revenue projections.

If you look at the numbers, you can see we are selling far more shoes for < $200

than we are above $200. Therefore an average of $216.67 does not accurately reflect

the real average selling price in the market.

What would be more useful is to weight those prices based on the quantity purchased.

Let’s build a weighted average such that the average shoe price will be more

representative of all customers’ purchase patterns.

A weighted average can be calculated like this:

frac{(300*20 + 200*100 + 150*225)}{(20 + 100 + 225)} = 3.19

end{equation*}

Since we are selling the vast majority of our shoes between $200 and $150, this number

represents the overall average price of our products more accurately than the simple average.

This concept is simple but can be a little bit more difficult to calculate in pandas

because you need two values: the value to average (shoe price) and the weight

(shoe quantity). Let’s walk through how to build and use this in pandas.

## Calculating Weighted Average in Pandas

As shown above, the mathematical concept for a weighted average is straightforward.

Because we need values and weights, it can be a little less intuitive to implement in pandas when

you are doing complex groupings of data. However, once you figure it out, it can be incredibly easy

to use the weighted average in a bunch of different scenarios.

Additionally, the process of building out this functionality and using it in various

situations should be useful for building your day to day pandas data manipulation skills.

Before I go any further, I wanted to call out that the basic code for this function is

based on this Stack Overflow question.

We are going to use a simple DataFrame that contains fictious sales data as the

basis for our analysis. Let’s start by importing all the modules we need and read

in our Excel file:

```
import pandas as pd
import numpy as np
sales = pd.read_excel("https://github.com/chris1610/pbpython/blob/master/data/sales-estimate.xlsx?raw=True", sheetname="projections")
sales.head()
```

Account | Name | State | Rep | Manager | Current_Price | Quantity | New_Product_Price | |
---|---|---|---|---|---|---|---|---|

0 | 714466 | Trantow-Barrows | MN | Craig Booker | Debra Henley | 500 | 100 | 550 |

1 | 737550 | Fritsch, Russel and Anderson | MN | Craig Booker | Debra Henley | 600 | 90 | 725 |

2 | 146832 | Kiehn-Spinka | TX | Daniel Hilton | Debra Henley | 225 | 475 | 255 |

3 | 218895 | Kulas Inc | TX | Daniel Hilton | Debra Henley | 290 | 375 | 300 |

4 | 412290 | Jerde-Hilpert | WI | John Smith | Debra Henley | 375 | 400 | 400 |

In our example data, we have a bunch of account information that includes a current price and

quantity as well as a projected **New_Product_Price**.

If we want to determine a simple mean, we can use the built in functions to easily

calculate it:

```
sales["Current_Price"].mean()
sales["New_Product_Price"].mean()
```

405.41666 447.08333

In order to calculate a weighted average using the long approach:

```
(sales["Current_Price"] * sales["Quantity"]).sum() / sales["Quantity"].sum()
(sales["New_Product_Price"] * sales["Quantity"]).sum() / sales["Quantity"].sum()
```

374.6383 342.5406

average function? We absolutely could but I wanted to show how to create a formula.

At the end of the article, I will show how to use

np.average

The weighted average formula is not complicated but it is verbose. It also is going

to be difficult to use when we group data. Life will be much easier if we build

a function for calculating the data.

## Grouping Data with the Weighted Average

Panda’s

groupby

is commonly used to summarize data. For instance, if we want

to look at the mean of the **Current_Price** by manager, it is simple with

groupby

:

```
sales.groupby("Manager")["Current_Price"].mean()
```

Manager Debra Henley 423.333333 Fred Anderson 387.500000 Name: Current_Price, dtype: float64

Ideally we would like to do the same thing with the weighted average, but how do

we pass in the weights we want to use? Hmmm.

The answer is to define a custom function that takes the names of the columns of

our data and calculates the weighted average. Then, use

apply

to execute it

against our grouped data.

```
def wavg(group, avg_name, weight_name):
""" http://stackoverflow.com/questions/10951341/pandas-dataframe-aggregate-function-using-multiple-columns
In rare instance, we may not have weights, so just return the mean. Customize this if your business case
should return otherwise.
"""
d = group[avg_name]
w = group[weight_name]
try:
return (d * w).sum() / w.sum()
except ZeroDivisionError:
return d.mean()
```

I want to return the simple mean. In your case you might want to return a

NaN

or some other value. This is one example of the power you have by

building your own function.

In order to get our weighted average:

```
wavg(sales, "Current_Price", "Quantity")
```

342.54068716094031

The nice thing is that this will also work on grouped data. The key is that we

need to use

apply

in order for pandas to pass the various groupings

to the function.

```
sales.groupby("Manager").apply(wavg, "Current_Price", "Quantity")
```

Manager Debra Henley 340.665584 Fred Anderson 344.897959 dtype: float64

Using this on our projected price is easy because you just need to pass in a

new column name:

```
sales.groupby("Manager").apply(wavg, "New_Product_Price", "Quantity")
```

Manager Debra Henley 372.646104 Fred Anderson 377.142857 dtype: float64

It is also possible to group by multiple criteria and the function will make sure

that the correct data is used in each grouping:

```
sales.groupby(["Manager", "State"]).apply(wavg, "New_Product_Price", "Quantity")
```

Manager State Debra Henley MN 632.894737 TX 274.852941 WI 440.000000 Fred Anderson CA 446.428571 NV 325.000000 WA 610.000000 dtype: float64

This is a simple but really useful approach to understanding your data better.

## Multiple Aggregations

One final item I wanted to cover is the ability to perform multiple aggregations on data.

For instance, if we want to get the mean for some columns, median for one and sum for another,

we can do this by defining a dictionary with the column names and aggregation functions to call.

Then, we call it on the grouped data with

agg

```
f = {'New_Product_Price': ['mean'],'Current_Price': ['median'], 'Quantity': ['sum', 'mean']}
sales.groupby("Manager").agg(f)
```

New_Product_Price | Current_Price | Quantity | ||
---|---|---|---|---|

mean | median | sum | mean | |

Manager | ||||

Debra Henley | 471.666667 | 437.5 | 1540 | 256.666667 |

Fred Anderson | 422.500000 | 375.0 | 1225 | 204.166667 |

me know in the comments.

Unfortunately, I could not figure out how to do something similar with a custom function that

takes arguments. I’m hoping that I am missing something and that a reader will point it out.

In the meantime, here is the approach I use to combine multiple custom functions into a single DataFrame.

First create two datasets of the various weighted averages:

```
data_1 = sales.groupby("Manager").apply(wavg, "New_Product_Price", "Quantity")
data_2 = sales.groupby("Manager").apply(wavg, "Current_Price", "Quantity")
```

Then combine them into a single DataFrame and give it a meaningful label:

```
summary = pd.DataFrame(data=dict(s1=data_1, s2=data_2))
summary.columns = ["New Product Price","Current Product Price"]
summary.head()
```

New Product Price | Current Product Price | |
---|---|---|

Manager | ||

Debra Henley | 372.646104 | 340.665584 |

Fred Anderson | 377.142857 | 344.897959 |

I have actually found myself using this pattern in several different scenarios

so I’m hoping it is useful to others as well.

## Using Numpy

As I mentioned above, Numpy has an average function which can take a list of weights

and calculate a weighted average.

Here is how to use it to get the weighted average for all the ungrouped data:

```
np.average(sales["Current_Price"], weights=sales["Quantity"])
```

342.54068716094031

If you want to call this on grouped data, you would need to build a

lambda

function:

```
sales.groupby("Manager").apply(lambda x: np.average(x['New_Product_Price'], weights=x['Quantity']))
```

Manager Debra Henley 372.646104 Fred Anderson 377.142857 dtype: float64

## Conclusion

Sometimes when I’m working with pandas, I know something is possible but get stuck

on a minor implementation detail that trips me up. The process I describe above

shows one example of how I worked through a relatively simple math problem and

built a robust solution in pandas that can work on grouped or ungrouped data.

The principals shown here can be used to build your own complex formulas for

your own needs. If you would prefer looking at this in a notebook, you can find

it on github.

Thanks for reading and if you have any input or suggestions, feel free to comment below.

if (!document.getElementById(‘mathjaxscript_pelican_#%@#$@#’)) {

var align = “center”,

indent = “0em”,

linebreak = “false”;

if (false) {

align = (screen.width < 768) ? “left” : align;

indent = (screen.width < 768) ? “0em” : indent;

linebreak = (screen.width < 768) ? ‘true’ : linebreak;

}

var mathjaxscript = document.createElement(‘script’);

var location_protocol = (false) ? ‘https’ : document.location.protocol;

if (location_protocol !== ‘http’ && location_protocol !== ‘https’) location_protocol = ‘https:’;

mathjaxscript.id = ‘mathjaxscript_pelican_#%@#$@#’;

mathjaxscript.type = ‘text/javascript’;

mathjaxscript.src = location_protocol + ‘//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML’;

mathjaxscript[(window.opera ? “innerHTML” : “text”)] =

“MathJax.Hub.Config({” +

” config: [‘MMLorHTML.js’],” +

” TeX: { extensions: [‘AMSmath.js’,’AMSsymbols.js’,’noErrors.js’,’noUndefined.js’], equationNumbers: { autoNumber: ‘AMS’ } },” +

” jax: [‘input/TeX’,’input/MathML’,’output/HTML-CSS’],” +

” extensions: [‘tex2jax.js’,’mml2jax.js’,’MathMenu.js’,’MathZoom.js’],” +

” displayAlign: ‘”+ align +”‘,” +

” displayIndent: ‘”+ indent +”‘,” +

” showMathMenu: true,” +

” messageStyle: ‘normal’,” +

” tex2jax: { ” +

” inlineMath: [ [‘\(‘,’\)’] ], ” +

” displayMath: [ [‘$$’,’$$’] ],” +

” processEscapes: true,” +

” preview: ‘TeX’,” +

” }, ” +

” ‘HTML-CSS’: { ” +

” styles: { ‘.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn’: {color: ‘inherit ! important’} },” +

” linebreaks: { automatic: “+ linebreak +”, width: ‘90% container’ },” +

” }, ” +

“}); ” +

“if (‘default’ !== ‘default’) {” +

“MathJax.Hub.Register.StartupHook(‘HTML-CSS Jax Ready’,function () {” +

“var VARIANT = MathJax.OutputJax[‘HTML-CSS’].FONTDATA.VARIANT;” +

“VARIANT[‘normal’].fonts.unshift(‘MathJax_default’);” +

“VARIANT[‘bold’].fonts.unshift(‘MathJax_default-bold’);” +

“VARIANT[‘italic’].fonts.unshift(‘MathJax_default-italic’);” +

“VARIANT[‘-tex-mathit’].fonts.unshift(‘MathJax_default-italic’);” +

“});” +

“MathJax.Hub.Register.StartupHook(‘SVG Jax Ready’,function () {” +

“var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;” +

“VARIANT[‘normal’].fonts.unshift(‘MathJax_default’);” +

“VARIANT[‘bold’].fonts.unshift(‘MathJax_default-bold’);” +

“VARIANT[‘italic’].fonts.unshift(‘MathJax_default-italic’);” +

“VARIANT[‘-tex-mathit’].fonts.unshift(‘MathJax_default-italic’);” +

“});” +

“}”;

(document.body || document.getElementsByTagName(‘head’)[0]).appendChild(mathjaxscript);

}

**Source From: pbpython.com.** **Original article title: “Learn More About Pandas By Building and Using a Weighted Average Function”.**

**This full article can be read at Learn More About Pandas By Building and Using a Weighted Average Function.**

**Advertisement**