Introduction to Market Basket Analysis in Python

Introduction

There are many data analysis tools available to the python analyst and it can be
challenging to know which ones to use in a particular situation. A useful
(but somewhat overlooked) technique is called association analysis
which attempts to find common patterns of items in large data sets. One specific
application is often called market basket analysis. The most
commonly cited example of market basket analysis is the so-called “beer and diapers”
case. The basic story is that a large retailer
was able to mine their transaction data and find an unexpected purchase pattern
of individuals that were buying beer and baby diapers at the same time.

Unfortunately this story is most likely a data urban legend. However, it is
an illustrative (and entertaining) example of the types of insights
that can be gained by mining transactional data.

While these types of associations are normally used for looking at sales transactions;
the basic analysis can be applied to other situations like click stream tracking,
spare parts ordering and online recommendation engines – just to name a few.

If you have some basic understanding of the python data science world, your first
inclination would be to look at scikit-learn for a ready-made algorithm. However,
scikit-learn does not support this algorithm. Fortunately, the very useful MLxtend
library by Sebastian Raschka has a a an implementation of the Apriori algorithm
for extracting frequent item sets for further analysis.

The rest of this article will walk through an example of using this library
to analyze a relatively large online retail data set and try to find
interesting purchase combinations. By the end of this article, you should be
familiar enough with the basic approach to apply it to your own data sets.

Why Association Analysis?

In today’s world, there are many complex ways to analyze data (clustering, regression,
Neural Networks, Random Forests, SVM, etc.). The challenge with many of these approaches
is that they can be difficult to tune, challenging to interpret and require quite a bit
of data prep and feature engineering to get good results. In other words, they
can be very powerful but require a lot of knowledge to implement properly.

Association analysis is relatively light on the math concepts and easy to explain
to non-technical people. In addition, it is an unsupervised learning tool that looks
for hidden patterns so there is limited need for data prep and feature engineering.
It is a good start for certain cases of data exploration and can point the way for a deeper dive
into the data using other approaches.

As an added bonus, the python implementation in MLxtend should be very familiar
to anyone that has exposure to scikit-learn and pandas. For all these reasons,
I think it is a useful tool to be familiar with and can help you with
your data analysis problems.

One quick note – technically, market basket analysis is just one application of
association analysis. In this post though, I will use association
analysis and market basket analysis interchangeably.

Association Analysis 101

There are a couple of terms used in association analysis that are important to understand.
This chapter in Introduction to Data Mining is a great reference for those
interested in the math behind these definitions and the details of the algorithm implementation.

Association rules are normally written like this: {Diapers} -> {Beer} which means that
there is a strong relationship between customers that purchased diapers and also purchased
beer in the same transaction.

In the above example, the {Diaper} is the antecedent and the {Beer} is the consequent.
Both antecedents and consequents can have multiple items. In other words,
{Diaper, Gum} -> {Beer, Chips} is a valid rule.

Support is the relative frequency that the rules show up. In many instances, you
may want to look for high support in order to make sure it is a useful relationship.
However, there may be instances where a low support is useful if you are trying to find
“hidden” relationships.

Confidence is a measure of the reliability of the rule. A confidence of .5 in
the above example would mean that in 50% of the cases where Diaper and Gum were
purchased, the purchase also included Beer and Chips. For product
recommendation, a 50% confidence may be perfectly acceptable but in a medical
situation, this level may not be high enough.

Lift is the ratio of the observed support to that expected if the two rules were
independent (see wikipedia). The basic rule of thumb is that a lift value close
to 1 means the rules were completely independent. Lift values > 1 are generally
more “interesting” and could be indicative of a useful rule pattern.

One final note, related to the data. This analysis requires that all the data for
a transaction be included in 1 row and the items should be 1-hot encoded.
The MLxtend documentation example is useful:

AppleCornDillEggsIce creamKidney BeansMilkNutmegOnionUnicornYogurt
000010111101
100110101101
210010110000
301000110011
401011100100

The specific data for this article comes from the UCI Machine Learning Repository
and represents transactional data from a UK retailer from 2010-2011. This mostly
represents sales to wholesalers so it is slightly different from consumer purchase
patterns but is still a useful case study.

Let’s Code

MLxtend can be installed using pip, so make sure that is done before trying to
execute any of the code below. Once it is installed, the code below shows how
to get it up and running. I have made the notebook available so feel free to
follow along with the examples below.

Get our pandas and MLxtend code imported and read the data:

import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

df = pd.read_excel('http://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx')
df.head()
InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountry
053636585123AWHITE HANGING HEART T-LIGHT HOLDER62010-12-01 08:26:002.5517850.0United Kingdom
153636571053WHITE METAL LANTERN62010-12-01 08:26:003.3917850.0United Kingdom
253636584406BCREAM CUPID HEARTS COAT HANGER82010-12-01 08:26:002.7517850.0United Kingdom
353636584029GKNITTED UNION FLAG HOT WATER BOTTLE62010-12-01 08:26:003.3917850.0United Kingdom
453636584029ERED WOOLLY HOTTIE WHITE HEART.62010-12-01 08:26:003.3917850.0United Kingdom

There is a little cleanup, we need to do. First, some of the descriptions have spaces
that need to be removed. We’ll also drop the rows that don’t have invoice numbers
and remove the credit transactions (those with invoice numbers containing C).

df['Description'] = df['Description'].str.strip()
df.dropna(axis=0, subset=['InvoiceNo'], inplace=True)
df['InvoiceNo'] = df['InvoiceNo'].astype('str')
df = df[~df['InvoiceNo'].str.contains('C')]

After the cleanup, we need to consolidate the items into 1 transaction per row with each
product 1 hot encoded. For the sake of keeping the data set small, I’m only
looking at sales for France. However, in additional code below, I will compare
these results to sales from Germany. Further country comparisons would be
interesting to investigate.

basket = (df[df['Country'] =="France"]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))

Here’s what the first few columns look like (note, I added some numbers
to the columns to illustrate the concept – the actual data in this example
is all 0’s):

Description10 COLOUR SPACEBOY PEN12 COLOURED PARTY BALLOONS12 EGG HOUSE PAINTED WOOD12 MESSAGE CARDS WITH ENVELOPES12 PENCIL SMALL TUBE WOODLAND12 PENCILS SMALL TUBE RED RETROSPOT12 PENCILS SMALL TUBE SKULL12 PENCILS TALL TUBE POSY
InvoiceNo
53637011.00.00.00.00.00.00.01.0
5368520.00.00.00.05.00.00.00.0
5369740.00.00.00.00.00.00.00.0
5370650.00.00.00.00.07.00.00.0
5374630.00.09.00.00.00.00.00.0

There are a lot of zeros in the data but we also need to make sure any positive
values are converted to a 1 and anything less the 0 is set to 0. This step will
complete the one hot encoding of the data and remove the postage column (since
that charge is not one we wish to explore):

def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

basket_sets = basket.applymap(encode_units)
basket_sets.drop('POSTAGE', inplace=True, axis=1)

Now that the data is structured properly, we can generate frequent item sets
that have a support of at least 7% (this number was chosen so that I could
get enough useful examples):

frequent_itemsets = apriori(basket_sets, min_support=0.07, use_colnames=True)

The final step is to generate the rules with their corresponding support, confidence and lift:

rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules.head()
antecedantsconsequentssupportconfidencelift
0(PLASTERS IN TIN WOODLAND ANIMALS)(PLASTERS IN TIN CIRCUS PARADE)0.1709180.5970153.545907
1(PLASTERS IN TIN CIRCUS PARADE)(PLASTERS IN TIN WOODLAND ANIMALS)0.1683670.6060613.545907
2(PLASTERS IN TIN CIRCUS PARADE)(PLASTERS IN TIN SPACEBOY)0.1683670.5303033.849607
3(PLASTERS IN TIN SPACEBOY)(PLASTERS IN TIN CIRCUS PARADE)0.1377550.6481483.849607
4(PLASTERS IN TIN WOODLAND ANIMALS)(PLASTERS IN TIN SPACEBOY)0.1709180.6119404.442233

That’s all there is to it! Build the frequent items using
apriori

then
build the rules with
association_rules

.

Now, the tricky part is figuring out what this tells us.
For instance, we can see that there are quite a few rules with a high lift value
which means that it occurs more frequently than would be expected given the number
of transaction and product combinations. We can also see several where the confidence
is high as well. This part of the analysis is where the domain knowledge
will come in handy. Since I do not have that, I’ll just look for a couple of
illustrative examples.

We can filter the dataframe using standard pandas code. In this case, look for a
large lift (6) and high confidence (.8):

rules[ (rules['lift'] >= 6) &
       (rules['confidence'] >= 0.8) ]
antecedantsconsequentssupportconfidencelift
8(SET/6 RED SPOTTY PAPER CUPS)(SET/6 RED SPOTTY PAPER PLATES)0.1377550.8888896.968889
9(SET/6 RED SPOTTY PAPER PLATES)(SET/6 RED SPOTTY PAPER CUPS)0.1275510.9600006.968889
10(ALARM CLOCK BAKELIKE GREEN)(ALARM CLOCK BAKELIKE RED)0.0969390.8157898.642959
11(ALARM CLOCK BAKELIKE RED)(ALARM CLOCK BAKELIKE GREEN)0.0943880.8378388.642959
16(SET/6 RED SPOTTY PAPER CUPS, SET/6 RED SPOTTY(SET/20 RED RETROSPOT PAPER NAPKINS)0.1224490.8125006.125000
17(SET/6 RED SPOTTY PAPER CUPS, SET/20 RED RETRO(SET/6 RED SPOTTY PAPER PLATES)0.1020410.9750007.644000
18(SET/6 RED SPOTTY PAPER PLATES, SET/20 RED RET(SET/6 RED SPOTTY PAPER CUPS)0.1020410.9750007.077778
22(SET/6 RED SPOTTY PAPER PLATES)(SET/20 RED RETROSPOT PAPER NAPKINS)0.1275510.8000006.030769

In looking at the rules, it seems that the green and red alarm clocks are purchased
together and the red paper cups, napkins and plates are purchased together in
a manner that is higher than the overall probability would suggest.

At this point, you may want to look at how much opportunity there is to use the popularity
of one product to drive sales of another. For instance, we can see that we sell
340 Green Alarm clocks but only 316 Red Alarm Clocks so maybe we can drive more
Red Alarm Clock sales through recommendations?

basket['ALARM CLOCK BAKELIKE GREEN'].sum()

340.0

basket['ALARM CLOCK BAKELIKE RED'].sum()

316.0

What is also interesting is to see how the combinations vary by country of
purchase. Let’s check out what some popular combinations might be in Germany:

basket2 = (df[df['Country'] =="Germany"]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))

basket_sets2 = basket2.applymap(encode_units)
basket_sets2.drop('POSTAGE', inplace=True, axis=1)
frequent_itemsets2 = apriori(basket_sets2, min_support=0.05, use_colnames=True)
rules2 = association_rules(frequent_itemsets2, metric="lift", min_threshold=1)

rules2[ (rules2['lift'] >= 4) &
        (rules2['confidence'] >= 0.5)]
antecedantsconsequentssupportconfidencelift
7(PLASTERS IN TIN SPACEBOY)(PLASTERS IN TIN WOODLAND ANIMALS)0.1072210.5714294.145125
9(PLASTERS IN TIN CIRCUS PARADE)(PLASTERS IN TIN WOODLAND ANIMALS)0.1159740.5849064.242887
10(RED RETROSPOT CHARLOTTE BAG)(WOODLAND CHARLOTTE BAG)0.0700220.8437506.648168

It seems that in addition to David Hasselhoff, Germans love Plasters in Tin Spaceboy and
Woodland Animals.

In all seriousness, an analyst that has familiarity with the data would probably
have a dozen different questions that this type of analysis could drive. I did not
replicate this analysis for additional countries or customer combos but the overall
process would be relatively simple given the basic pandas code shown above.

Conclusion

The really nice aspect of association analysis is that it is easy to run and relatively
easy to interpret. If you did not have access to MLxtend and this association
analysis, it would be exceedingly difficult to find these patterns using
basic Excel analysis. With python and MLxtend, the analysis process is relatively
straightforward and since you are in python, you have access to all the additional
visualization techniques and data analysis tools in the python ecosystem.

Finally, I encourage you to check out the rest of the MLxtend library. If you are
doing any work in sci-kit learn it is helpful to be familiar with MLxtend and how
it could augment some of the existing tools in your data science toolkit.


Source From: pbpython.com.
Original article title: Introduction to Market Basket Analysis in Python.
This full article can be read at: Introduction to Market Basket Analysis in Python.

Advertisement


Random Article You May Like

Leave a Reply

Your email address will not be published. Required fields are marked *

*
*