This is the firstin a series of columns of statistical learning for business intelligence (BI). Column one contrasts what’s come to be called statistical or machine learning (ML) with traditional statistical (TS) methods for predictive modeling. Subsequent columns will discuss many of the latest statistical learning procedures and identify software for deployment. Illustrations will be drawn from business applications.

My fondness for the R statistical computing platform is no secret. R combines two current passions at OpenBI: statistical analysis and open source software. Both the R platform and community appear to be growing exponentially in the academic, research and, more recently, business worlds. Increasingly, the benefits of this growth are manifest to BI.

One advantage of working with R is the collateral learning derived from participation in the various support lists. The lists certainly help me keep up to date with coding tricks, best practices, bug resolution, new release schedules and the availability of the latest statistical packages from the worldwide community. But occasionally correspondence evolves to important philosophical discussion of an issue pertinent to statisticians and business analytics professionals alike.

Not long ago, a “newbie” - R list-speak for a community novice - presented his research data and statistical analysis on one of the forums, asking for commentary on the work. Though he didn’t realize it at the time, his overture would start a lively discussion that escalated from evaluations of his R coding prowess to the reliability of his approach for testing the findings, and, finally, to talk of whether the chosen model was even appropriate for the data, given the model’s stringent assumptions. Questions were raised on the relative importance of the model’s ability to accurately predict the future versus sensibly explain findings by estimating parameters and determining their significance. A community “elder” framed the evolving discussion as one that positioned traditional statistics, with mathematical models that have inference foundations, against machine learning, which focuses on algorithms and obsesses on predictive accuracy. He cited a 2001 article by UC Berkeley Professor Leo Breiman that provocatively articulated the TS statistics versus ML tension in the field.

Professor Breiman was certainly qualified to offer commentary on the disparate approaches to predictive modeling. He earned doctoral degrees in both physics and mathematics before accepting a teaching position at UCLA. Years later, he left academia for a stint in freelance statistical consulting, which helped formulate his views on algorithmic modeling. Breiman then joined the statistics department at UC Berkeley, where he made seminal contributions to statistics and the nascent field of ML for 25 years before his untimely passing in 2005. BI analysts recognize his noteworthy contributions of classification and regression trees (CART) and random forests to the predictive modeling knowledge base.

Breiman’s illustrations focus primarily on supervised learning, for which there’s a designated dependent variable to predict with one or more explanatory attributes. Supervised learning is further delineated by whether that dependent variable is continuous (regression) or categorical (classification). Breiman’s major criticism of TS is that the culture that comes from academic statistics is too focused on the mathematical models and assumptions underpinning the data generating process, obsessing unproductively on model parameters and validation measured by goodness of fit significance tests. In his mind, this deductive approach to modeling is too limiting for the growing complexity of problems and volumes of data.

The machine learning paradigm, which has roots outside statistics in the computer and mathematical sciences, is more inductive; it can learn from the data with fewer preconceived notions and assumptions. Whereas precisely specified (but limiting) models with parameters are de rigueur in traditional statistics, flexible “black box” methods that mask complexity are the norm with ML. The criterion for model validation with ML is predictive accuracy – the ability of models to reliably predict or classify new observations.

According to Breiman, the TS approach has led to irrelevant theory that can promote questionable conclusions while deterring statisticians from using more suitable methods. This direction has also prevented statisticians from working on exciting new problems for which traditional models are ill-suited. He notes that imposing a parametric model on data is often too simplistic and constraining, and it may lead to incorrect results. The approach of ML, in contrast, changes the focus from data models to algorithms, from deduction to induction, with predictive accuracy characterized by the strength of explanatory variables as the ultimate goal. Rather than look for simple models that are mathematically tractable, ML embraces the complexity of high dimensionality for analysis. From the ML perspective, it’s more important to be accurate than simple.

Breiman doesn’t cavalierly dismiss traditional statistics for machine learning. Instead, he advocates a pragmatic approach to modeling that uses the best of both TS and ML. His mantra for the practice of analytics that provides guidance for BI:

  1. Explore the data before starting the modeling process,
  2. Search for a suitable model, either from TS or ML, that provides a good solution and
  3. Make predictive accuracy on test data the determinant of success. Breiman seems the ultimate analytics pragmatist.

Breiman’s argument that traditional statistical approaches are at least somewhat dated is valid. Reactions to Breiman’s work among statisticians at the time of his paper, however, were muted at best. Most acknowledge that the article presented valid criticisms, even if not all precepts were universally accepted. If nothing else, the paper helped open statistical eyes to the contributions of other disciplines and popularize the term “statistical learning,” a phrasing more palatable to statisticians than ML.

Statistician Richard Berk provides some foundation for conciliation between TS and ML in his new book: Statistical Learning from a Regression Perspective. Berk delineates four types of stories to be told by data analysis: causal, conditional distribution, data summary, and forecasting. 1 Though he ostensibly resists generalizations and argues that both TS and ML techniques can be successfully applied for each story, Berk hints that theoretically-based, top-down causal and conditional distribution problems are often better suited for traditional statistics, while bottom-up summary and forecasting questions are served better by ML. In a final acknowledgment to an analytics world increasingly at ease with the learning approach, Berk claims that causal and distribution stories are becoming deferential to forecasting, where “many statistical learning procedures perform well, and often considerably better than conventional causal modeling.”2

My take is a pragmatic one that recognizes the unique contributions of both TS and ML, though I’m glad Breiman sent a clarion call to the staid statistical world seven years ago. Over the past 30 months, I’ve migrated from a full reliance on linear models for prediction to enthusiastic adoption of flexible learning procedures, such as general additive models, regression trees and gradient boosting - the last two of which add the wisdom of ensemble calculations.I’ve also confirmed that learning procedures often perform better than their TS counterparts. At the same time, I generally use TS methods like propensity models to evaluate the impact of business interventions and test specific performance hypotheses. For problems in which pure prediction is paramount, statistical learning is indicated. For problems that involve hypothesis testing and parameter interpretation, linear models are very much in order. The prudent BI predictive analyst should search for - and embrace - established, new and improving methods as practically indicated, regardless of their origins.

References:

  1. Richard A. Berk. Statistical Learning from a Regression Perspective. Springer, 2008.
  2. Berk.