I was tied up in statistical knots. Earlier this week,  a colleague good-naturedly “called me out”  in a discussion of  forecasting work I just completed for a customer. As part of the project, I'd tried many of the techniques supported by Rob Hyndman's excellent R forecast package. From exponential smoothing to Holt Winters filtering to ARIMA, the fit of the trained models to hold-out data was just so-so, certainly nothing to brag about to the customer. In the end, with what visual exploration suggested were multiple seasonalities and curvilinear trends, I finally settled on a linear regression model using cubic splines, with month and day of the week as categorical regressors. The fit of these models to the test data was much better than the earlier efforts.

My colleague, a statistical purist, noted that standard regression is generally unsuitable for time series data, since the assumptions underlying the linear model, especially independence of errors, aren't tenable for the autocorrelation of time series. And, by the way, hadn't I just posted a blog that staked my position on statistical orthodoxy with “causal” analysis? Was I a statistical hypocrite? What was next, promoting mindless gradient boosting models?

I must admit, I was statistically despondent and immediately set out to rationalize my thinking. Yes, my intention with the hypothetical marketing campaign was to compare several treatment offers to determine “causally” which outperformed. A data design to tease out what worked best was called for – perhaps a randomized experiment, perhaps statistical matching or propensity scores.

And yes, my colleague was correct that inference from linear models requires independent errors. On the other hand, the forecasting exercise simply took time series data, split it into training and test, and looked to make the best predictions on the “future” from the past – with no interpretation or inference  intended. In the first case, I was looking for explanation; in the second I wanted only the most accurate forecasts on hold-out data.

Fortunately, I was able to start working my way out of the explanation-prediction funk by researching several informative articles and one solid book on the topic.

Author Galit Shmueli contrasts explanatory statistical models with predictive ones. With explanatory models, “statistical methods are used nearly exclusively for testing causal theory. Given a causal theoretical model, statistical models are applied to data in order to test causal hypotheses. In such models, a set of underlying factors that are measured by variables X are assumed to cause an underlying effect, measured by variable Y.”  Thus, with an explanatory focus, the analyst carefully specifies  models and relationships between outcomes and predictors, testing theories via statistical inference.

Predictive modeling, according to Shmueli, is “the process of applying a statistical model or data mining algorithm to data for the purpose of predicting new or future observations. In particular, I focus on nonstochastic prediction....where the goal is to predict the output value (Y ) for new observations given their input values (X). This definition also includes temporal forecasting, where observations until time t (the input) are used to forecast future values at time t + k” .

Similarly, Richard Berk distinguishes confirmatory from exploratory statistical learning models. “An important, although somewhat fuzzy, distinction is sometimes made between a confirmatory data analysis and an exploratory data analysis. For a confirmatory data analysis, the form of the statistical model is determined before looking at the data. All that remains is to estimate the values of some key parameters. For an exploratory data analysis, there is no model. Statistical tools are used to extract patterns in the data.”

So Shmueli's explanatory is Berk's confirmatory, while their predictives align. The explanatory/confirmatory models are more “causal”, seeking to test theories with model specification and coefficient values – and are more closely aligned with orthodox statistical modeling/inference.

Predictive models, in contrast, are far less grand, “inducting” from data via flexible algorithms,  evaluating themselves by their predictive prowess rather than by inference and theory-testing. Predictive models are quite comfortable in the statistical learning world.

As Berk notes, though, “There is nothing in regression analysis that requires statistical inference: formal tests of null hypotheses or confidence intervals. These can sometimes be very useful but go beyond the definition of regression analysis....They're an add-on.... It cannot be overemphasized that causal inference comes from knowledge about the experiment, not from the regression analysis”.

Shmueli agrees, offering the explanatory problem of multicollinearity as an illustration: “Multicollinearity is not a problem unless either (i) the individual regression coefficients are of interest, or (ii) attempts are made to isolate the contribution of one explanatory variable to Y, without the influence of the other explanatory variables. Multicollinearity will not affect the ability of the model to predict.”  Kind of made me feel better about my prediction-only autocorrelation violations.

So how did I “un-conflict”, resolving my the prediction-only needs of my latest work that ignored data generation-design with other, more causal data science requirements? It's actually pretty simple: The data scientist works in both explanatory and predictive worlds, and must adopt the best modeling tools for the challenges at hand. Both traditional statistical models as well as their learning cousins are needed to cover the DS bases.

For inquiries that purport to test/explain, the data design/experiment and statistical inference are critical, the analytics more aligned with Gary King's Research Designs for Causal Inference.

For predictive challenges where explanation is irrelevant, adaptive statistical learning models are often the better choice, the SL emphasis on predictive accuracy replacing the statistical focus on inference.

The data scientist should be facile with both methodologies.


Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access