Machine Learning, python

Forecasting Time Series Data with Prophet

In my last post, I used ARIMA model to carry out forecasting against a time series dataset. Although it worked, it is a bit too complicated and hard to understand for non expert users like me. Recently I came across Prophet which is an awesome library from Facebook for carrying out forecasting with time series data. After a bit of play with it, I reckon it is a much better alternative to classic techniques like ARIMA. Why? Here is what I quoted from Facebook Research Blog

We have frequently used Prophet as a replacement for the forecast package in many settings because of two main advantages:

Prophet makes it much more straightforward to create a reasonable, accurate forecast. The forecast package includes many different forecasting techniques (ARIMA, exponential smoothing, etc), each with their own strengths, weaknesses, and tuning parameters. We have found that choosing the wrong model or parameters can often yield poor results, and it is unlikely that even experienced analysts can choose the correct model and parameters efficiently given this array of choices.
Prophet forecasts are customizable in ways that are intuitive to non-experts. There are smoothing parameters for seasonality that allow you to adjust how closely to fit historical cycles, as well as smoothing parameters for trends that allow you to adjust how aggressively to follow historical trend changes. For growth curves, you can manually specify “capacities” or the upper limit of the growth curve, allowing you to inject your own prior information about how your forecast will grow (or decline). Finally, you can specify irregular holidays to model like the dates of the Super Bowl, Thanksgiving and Black Friday.

Prophet works with data that comes in different time intervals like daily, weekly, monthly and etc. It does not require a large amount of data like ARIMA requires for it to carry out forecasting. It also works with missing data from the time series dataset so it is more data tolerant and flexible.

More importantly to me who is not a stats expert, it is more intuitive and user friendly. It feels like using a decent point and shoot camera to take nice photos and not to worry about ISO, shutter speed and aperture settings because it is all figured out automatically and the result is quite good. So to produce a reasonable forecasting does not require too much effort and expertise. It is that simple.

Dataset

So lets put it in action with the classic sales dataset from Walmart, which is free to public. You can download the full dataset from here.

The files contain historical sales data for 45 Walmart stores located in different regions. Here are these files and what the data looks like in these files (first 5 rows),

  • train.csv (date between 05/02/2010 and 26/10/2012)
Store Dept Date Weekly_Sales IsHoliday
1 1 5/2/10 24924.5 FALSE
1 1 12/2/10 46039.49 TRUE
1 1 19/2/10 41595.55 FALSE
1 1 26/2/10 19403.54 FALSE
  • test.csv. We will use the holiday information for our forecast
Store Dept Date Weekly_Sales IsHoliday
1 1 5/2/10 24924.5 FALSE
1 1 12/2/10 46039.49 TRUE
1 1 19/2/10 41595.55 FALSE
1 1 26/2/10 19403.54 FALSE

For simplicity purposes,  we only pick one store store #1 and department #1 and carry out forecasting with Prophet for the next year.

Process the data with Prophet

According to Prophet docs, we need to format our data to the format that Prophet can recognise. Since Prophet follows sklearn API interface, we can call model.fit and model.predict methods. This is how we train the model and carry out forecast/prediction. I think I can call it Machine Learning, just like many services that claim themselves having “Machine Learning” features:) Simply put, that is just statistics.

The dataset has to include 2 columns, ds and y,

  • ds is the datetime type column which contains a list of datetime objects.
  • y is simply the time based figures, the sales figure.
import pandas as pd
from fbprophet import Prophet
import matplotlib.pyplot as plt

# Read train.csv
train = pd.read_csv('train.csv')

# Convert the Date column to datetime type
train['Date'] = pd.to_datetime(train['Date'])

# Filtering out only store #1 and department #1 to a new dataframe
# with the required column names
store1_dept1_sales = train[(train['Store'] == 1) & (
    train['Dept'] == 1)][['Date', 'Weekly_Sales']]
store1_dept1_sales.columns = ['ds', 'y']

Fit the model and do the forecasting.

model = Prophet()
model.fit(store1_dept1_sales)
future = model.make_future_dataframe(periods=52, freq='W')
forecast = model.predict(future)
figure = model.plot(forecast)
plt.show()
forecast.to_csv('sales_forecast.csv')

Since that is  weekly sales report, we use weekly as the frequency. Interestingly, Prophet can forecast daily sales using the weekly data set we fit in. So it is flexible.

That is it. At the end of the code, it renders this forecast chart and save the data into a csv file. Figure_1.pngIt looks pretty good to me considering how little effort on producing this forecast chart. This is what it contains (first 5 rows),

print(forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail(5))

 

ds yhat yhat_lower yhat_upper
190 2013-09-22 20443.488962 12518.858795 27468.180441
191 2013-09-29 20641.660038 12938.020617 27909.093330
192 2013-10-06 20477.833384 13299.782936 28176.877186
193 2013-10-13 22547.212544 15202.682886 29934.752752
194 2013-10-20 27710.351046 19897.129162 35997.097132

It has ds as the timestamp, yhat is the forecast value column which we care the most. There are also columns for components and uncertainty intervals, with these values, you can do more analysis. To my understanding, yhat_low and yhat_upper define the range for the forecast figure to indicate the uncertainty.

Factor in holiday information

The holiday dataframe needs 2 columns

  • ds : the datetime column
  • holiday: a string
# Build the holiday dataframe combining the current and future holiday information
train_store1_dep1_holiday = train[(train['Store'] == 1) & (
    train['Dept'] == 1) & (train['IsHoliday'] == True)][['Date', 'IsHoliday']]
test = pd.read_csv('test.csv')
test_store1_dep1_holiday = test[(test['Store'] == 1) & (
    test['Dept'] == 1) & (test['IsHoliday'] == True)][['Date', 'IsHoliday']]
combined_holiday = pd.concat(
    [train_store1_dep1_holiday, test_store1_dep1_holiday])
combined_holiday['Date'] = pd.to_datetime(combined_holiday['Date'])
combined_holiday.columns = ['ds', 'holiday']

# Prophet requires the holiday column to be string so we need to convert the boolean
# to string
combined_holiday['holiday'] = combined_holiday['holiday'].map(
    {True: 'Yes', False: 'No'})

# Pass in the holiday information
model = Prophet(holidays=combined_holiday)
model.fit(store1_dept1_sales)
future = model.make_future_dataframe(periods=52, freq='W')
forecast_holiday = model.predict(future)
figure = model.plot(forecast_holiday)
figure2 = model.plot_components(forecast_holiday)

So here is the chart,

Figure_1

It looks very similar to the previous chart but it did take the holiday into account as shown in the components plot.

Figure_2.png

So that is it.

Thoughts

I really like that Prophet allows you to produce a reasonably good forecast with so little effort. You do not need to be a stats expert to use it. You can of course tweak options if you know what you are doing, like tweaking the trend, seasonalities and holidays.

Comparing to ARIMA model which requires 4 to 5 whole seasons in the dataset, Prophet does not have this requirement so it is not data hungry. Having a large amount of historical data can be impossible to young companies.

The data files and the code can be found here.

Thanks.

Leave a comment