I seem to be doing a lot of forecasting work these days. My challenge is generally to predict demand or utilization in the future, knowing only previous values and the accompanying dates/times. I'm also working on a few forecasting demos in R for the OpenBI website.
I had the privilege in grad school many years ago of taking a time series course taught by eminent statistician George Box, co-originator of the Box-Jenkins forecast modeling methodologies. A lot's changed in the statistical world in the 30 years since, but the BJ autoregressive/moving average (ARIMA) models that use past values of a series to predict the future have stood the test of time. A review of the table of contents of the excellent text “Time Series Analysis With Applications in R” provides quick affirmation.
The many series I'm currently investigating are pretty messy. Most have non-linear trends, seasons and periods, as well as significant autocorrelation. And it's a stretch for ARIMA models to handle the deterministic patterns. Out of the box, ARIMA likes stationary series, and these are anything but stationary. True, the data can often be differenced or otherwise transformed to make them tractable, but for complicated models like the ones I'm working with, that exercise might be more trouble than it's worth.
Instead, what I've discovered through experimentation is that either spectral analysis or a combination of statistical learning and ARIMA modeling approaches might work best for forecasting with these data. For the SL/ARIMA approach, I first use a regression-like model to capture the deterministic trend, seasonality and periodicity components. Then, once those effects have been “removed” from the series, I use ARIMA to handle the autocorrelation. In effect, have each of the model types handle what it does best.
There's certainly no shortage of statistical learning models to try on my data in the first pass. Just consult the 700+ page “Elements of Statistical Learning” book to make your head spin. Over the last few years, I've settled on about a dozen ESL models implemented in R as staples for my predictive modeling. Those were the points of departure for my forecast work.
As I started working with the different models and time series, an additional problem emerged. I was not simply fitting a few different time series, where I could idiosyncratically finesse best models, but rather scores of them. So it'd be difficult to give each series personal attention once the model types were chosen. I had to be comfortable that a given few techniques could reasonably handle the different data challenges. Once the models were identified and productionized, I then have to be assured they'd continue to provide reasonable forecasts when re-estimated in the weekly updates.
To meet the challenges of just a few modeling techniques for the many series, I've opted to use a tool that's become increasingly part of the standard statistical modeler's tool chest: natural splines. Just as a draftman's spline is a flexible strip of metal or rubber used for drawing curves, a restricted cubic or natural spline is a function of flexible piecewise polynomials used by modelers for curve fitting. The power of natural splines is that they can adapt to different functions so modelers can get close approximations even when they're unsure of the exact forms.
In one of my first investigations, the natural spline function immediately “found” the inverted parabola series trend. At their best, splines can flexibly handle many functional form “signals” in the data. I routinely use splines with the linear model, general additive model and multivariate adaptive regression splines packages in R.
Another significant tool in the statistical forecaster's quiver is simulation. Combining the easy ability to generate precise functional forms embellished with random number generation – to simulate – statistical platforms like R let analysts divine a new series with a specific “look” at a moment's notice. For example, I just created a data frame in R with curvilinear trend, seasonality, periodicity and autocorrelation – all with plenty of random variation to mask the known functional forms.
Having programmed the “signals,” though, I'm well aware of how the series “should” behave. I can then test candidate models to determine which faithfully identify and separate the known signal from noise. My work to date with both the real and simulated data leads me to favor gam over mars and lm, though each method produces best results in certain cases. Mars is like the little girl with a curl: when it's good, it's very, very good. But when it's bad, it's horrid.
A second benefit of simulation is to show users that, even with strong signals, random variation can distance actual from predicted values. I sometimes find it useful to have simulation help demonstrate the limits of predictive science to skeptical business consumers.
Alas, at the same time, simulation can “help” the modeler appreciate randomness as well. I recently regenerated random data for a particular series I was working with. Many of the insights I'd gleaned on model performance from the previous data were undone with the new. I gained a renewed appreciation of randomness yet again. A humbling comeuppance, indeed.