Forecasting Website Traffic Using Facebook’s Prophet Library

Introduction

A common business analytics task is trying to forecast the future
based on known historical data. Forecasting is a complicated topic and relies
on an analyst knowing the ins and outs of the domain as well as knowledge of
relatively complex mathematical theories. Because the mathematical concepts can
be complex, a lot of business forecasting approaches are “solved”
with a little linear regression and “intuition.” More complex models would yield
better results but are too difficult to implement.

Given that background, I was very interested to see that Facebook recently open
sourced a python and R library called prophet which seeks to automate the
forecasting process in a more sophisticated but easily tune-able model. In this
article, I’ll introduce prophet and show how to use it to predict the volume of
traffic in the next year for Practical Business Python. To make this a little
more interesting, I will post the prediction through the end of March so we can
take a look at how accurate the forecast is.

Overview of Prophet

For those interested in learning more about prophet, I
recommend reading Facebook’s white paper on the topic. The paper is relatively
light on math and heavy on the background of forecasting and some of the business
challenges associated with building and using forecasting models at scale.

The paper’s introduction contains a good overview of the challenges with current
forecasting approaches:

Producing high quality forecasts is not an easy problem for either machines or
for most analysts. We have observed two main themes in the practice of creating
business forecasts:

1. Completely automatic forecasting techniques can be brittle and they are
often too inflexible to incorporate useful assumptions or heuristics.

2. Analysts who can produce high quality forecasts are quite rare because
forecasting is a specialized data science skill requiring substantial experience.
The result of these themes is that the demand for high quality forecasts often
far outstrips the pace at which the organization can produce them.

Prophet seeks to provide a simple to use model that is sophisticated enough
to provide useful results – even when run by someone without deep knowledge of
the mathematical theories of forecasting. However, the modeling solution does
provide several tuneable parameters so that analysts can easily make changes to
the model based on their unique business needs.

Installation

Before going any further, make sure to install prophet. The complex statistical modeling
is handled by the Stan library and is a prerequisite for prophet. As long as
you are using anaconda, the installation process is pretty simple:

conda install pystan
pip install fbprophet

Starting the Analysis

For this analysis, I will be using a spreadsheet of the actual web traffic volume from
pbpython starting in Sept 2014 and going through early March 2017. The data is
downloaded from Google analytics and looks like this:

Google Analytics Dataset
import pandas as pd
import numpy as np
from fbprophet import Prophet

data_file = "All Web Site Data Audience Overview.xlsx"
df = pd.read_excel(data_file)
df.head()
Day IndexSessions
02014-09-251
12014-09-264
22014-09-278
32014-09-2842
42014-09-29233

The first thing we need to check is to make sure the Day Index column came through
as a datetime type:

df.dtypes
Day Index    datetime64[ns]
Sessions              int64
dtype: object

Since that looks good, let’s see what kind of insight we can get with just simple
pandas plots:

df.set_index('Day Index').plot();
Google Analytics Dataset

The basic plot is interesting but, like most time series data, it is difficult to get
much out of this without doing further analysis. Additionally, if you wanted to
add a predicted trend-line, it is a non-trivial task with stock pandas.

Before going further, I do want to address the outlier in the July 2015 timeframe. My
most popular article is Pandas Pivot Table Explained which saw the biggest traffic spike
on this blog. Since that article represents an outlier in volume, I am going to
change those values to
nan

so that it does not unduly influence the projection.

This change is not strictly required but it will be useful to show that prophet
can handle this missing data without further manipulation. This process also
highlights the need for the analyst to still be involved in the process of making
the forecast.

df.loc[(df['Sessions'] > 5000), 'Sessions'] = np.nan
df.set_index('Day Index').plot();
Google Analytics Dataset

This is pretty good but I am going to do one other data transformation
before continuing. I will convert the
Sessions

column to be a log
value. This article has more information on why a log transform is useful for
these types of data sets. From the article:

… logging converts multiplicative relationships to additive relationships,
and by the same token it converts exponential (compound growth) trends to linear
trends. By taking logarithms of variables which are multiplicatively related and/or
growing exponentially over time, we can often explain their behavior with
linear models.

df['Sessions'] = np.log(df['Sessions'])
df.set_index('Day Index').plot();
Google Analytics Dataset

The data set is almost ready to make a prediction. The final step is to rename
the columns to
ds

and
y

in order to comply with the prophet API.

df.columns = ["ds", "y"]
df.head()
dsy
02014-09-250.000000
12014-09-261.386294
22014-09-272.079442
32014-09-283.737670
42014-09-295.451038

Now that the data is cleaned and labeled correctly, let’s see what prophet can
do with it.

Making a Prediction

The prophet API is similar to scikit-learn. The general flow is to
fit

the
data then
predict

the future time series. In addition, prophet supports
some nice plotting features using
plot

and
plot_components

.

Create the first model (m1) and fit the data to our dataframe:

m1 = Prophet()
m1.fit(df)

In order to tell prophet how far to predict in the future, use
make_future_dataframe.

In this example, we will predict out 1 year (365 days).

future1 = m1.make_future_dataframe(periods=365)

Then make the forecast:

forecast1 = m1.predict(future1)

The
forecast1

is just a pandas dataframe with a several columns of data.
The predicted value is called
yhat

and the range is defined by
yhat_lower

and

yhat_upper

. To see the last 5 predicted values:

forecast1[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
dsyhatyhat_loweryhat_upper
12502018-02-277.8480406.6258879.081303
12512018-02-287.7873146.5659039.008327
12522018-03-017.7551466.5174818.948139
12532018-03-027.5523826.3091918.785648
12542018-03-037.0116515.7957788.259777

To convert back to the numerical values representing sessions, use
np.exp

np.exp(forecast1[['yhat', 'yhat_lower', 'yhat_upper']].tail())
yhatyhat_loweryhat_upper
12502560.709477754.3734078789.412841
12512409.836175710.4528488170.840734
12522333.549138676.8713587693.563414
12531905.275686549.6004046539.712030
12541109.484324328.9078433865.233952

To make this look nice and impress management, plot the data:

m1.plot(forecast1);
Google Analytics Dataset

Very cool. The other useful feature is the ability to plot the various components:

m1.plot_components(forecast1);
Google Analytics Dataset

I really like this view because it is a very simple way to pull out the daily and weekly
trends. For instance, the charts make it easy to see that Monday-Thursday are peak times
with big fall offs on the weekend. Additionally, I appear to have bigger jumps in
traffic towards the end of the year.

Refining the Model

I hope you’ll agree that the basic process to create a model is relatively straightforward
and you can see that the results include more rigor than a simple linear trend line.
Where prophet really shines is the ability to iterate the models with different
assumptions and inputs.

One of the features that prophet supports is the concept of a “holiday.” The
simplest way to think about this idea is the typical up-tick in store sales
seen around the Thanksgiving and Christmas holidays. If we have certain known
events that have major impacts on our time series, we can define them and the
model will use these data points to try to make better future predictions.

For this blog, any time a new article is published, there is an uptick in traffic for
about 1 week, then there is a slow decay back to steady state. Therefore for this
analysis, we can define a holiday as a blog post. Since I know that the post
drives increased traffic for about 5-7 days, I can define an
upper_window

to encapsulate those 5 days in that holiday window. There is also a corresponding

lower_window

for days leading up to the holiday. For this analysis, I will
only look at the upper_window.

To capture the holidays, define a holiday dataframe with a datestamp and the
description of the holiday:

articles = pd.DataFrame({
  'holiday': 'publish',
  'ds': pd.to_datetime(['2014-09-27', '2014-10-05', '2014-10-14', '2014-10-26', '2014-11-9',
                        '2014-11-18', '2014-11-30', '2014-12-17', '2014-12-29', '2015-01-06',
                        '2015-01-20', '2015-02-02', '2015-02-16', '2015-03-23', '2015-04-08',
                        '2015-05-04', '2015-05-17', '2015-06-09', '2015-07-02', '2015-07-13',
                        '2015-08-17', '2015-09-14', '2015-10-26', '2015-12-07', '2015-12-30',
                        '2016-01-26', '2016-04-06', '2016-05-16', '2016-06-15', '2016-08-23',
                        '2016-08-29', '2016-09-06', '2016-11-21', '2016-12-19', '2017-01-17',
                        '2017-02-06', '2017-02-21', '2017-03-06']),
  'lower_window': 0,
  'upper_window': 5,
})
articles.head()
dsholidaylower_windowupper_window
02014-09-27publish05
12014-10-05publish05
22014-10-14publish05
32014-10-26publish05
42014-11-09publish05

Astute readers may have noticed that you can include dates in the future. In
this instance, I am including today’s blog post in the holiday dataframe.

To use the publish dates in the model, pass it to the model via the

holidays

keyword. Perform the normal
fit

,
make_future

(this time we’ll try 90 days),
predict

and
plot

:

m2 = Prophet(holidays=articles).fit(df)
future2 = m2.make_future_dataframe(periods=90)
forecast2 = m2.predict(future2)
m2.plot(forecast2);
Google Analytics Dataset

Because we have defined holidays, we get a little more information when we plot components:

m2.plot_components(forecast2);
Google Analytics Dataset

Predictions

Prophet offers a couple of other options for continuing to tweak the model. I
encourage you to play around with them to get a feel for how they work and what
and can be used for your models. I have included one new option
mcmc_samples

in the final example below.

As promised, here is my forecast for website traffic between today and
the end of March:

m3 = Prophet(holidays=articles, mcmc_samples=500).fit(df)
future3 = m3.make_future_dataframe(periods=90)
forecast3 = m3.predict(future3)
forecast3["Sessions"] = np.exp(forecast3.yhat).round()
forecast3["Sessions_lower"] = np.exp(forecast3.yhat_lower).round()
forecast3["Sessions_upper"] = np.exp(forecast3.yhat_upper).round()
forecast3[(forecast3.ds > "3-5-2017") &
          (forecast3.ds < "4-1-2017")][["ds", "yhat", "Sessions_lower",
                                        "Sessions", "Sessions_upper"]]
dsyhatSessions_lowerSessionsSessions_upper
8922017-03-067.8452801432.02554.04449.0
8932017-03-078.0871201795.03252.05714.0
8942017-03-087.5787961142.01956.03402.0
8952017-03-097.5567251079.01914.03367.0
8962017-03-107.415903917.01662.02843.0
8972017-03-116.796987483.0895.01587.0
8982017-03-126.627355417.0755.01267.0
8992017-03-137.240586811.01395.02341.0

The model passes the intuitive test in that there is a big spike anticipated
with the publishing of this article. The upper and lower bounds represent a fairly
large range but for the purposes of this forecast, that is likely acceptable.

To keep me honest, you can see all of the values in the github notebook.

Final Thoughts

It is always interesting to get insights into the ways big companies use various
open source tools in the their business. I am impressed with the functionality
that Facebook has given us with prophet. The API is relatively simple and since it uses
the standard panda’s dataframe and matplotlib for displaying the data, it fits very
easily into the python datascience workflow. There is a lot if recent github activity
for this library so I suspect it to get more useful and powerful over the months ahead.

As Yogi Berra said, “It’s tough to make predictions, especially about the future.”
I think this library is going to be very useful for people trying to improve their
forecasting approaches. I will be interested to see how well this particular forecast
works on this site’s data. Stay tuned for an update where I will compare
the prediction against the actuals and we will see what insight can be gained.


Source From: pbpython.com.
Original article title: Forecasting Website Traffic Using Facebook’s Prophet Library.
This full article can be read at: Forecasting Website Traffic Using Facebook’s Prophet Library.

Advertisement


Random Article You May Like

Leave a Reply

Your email address will not be published. Required fields are marked *

*
*