Statistical Learning for BI, Part 1
OpenBI Forum
Information Management Online, November 26, 2008
This is the first in a series of columns of statistical learning for business intelligence (BI). Column one contrasts whats come to be called statistical or machine learning (ML) with traditional statistical (TS) methods for predictive modeling. Subsequent columns will discuss many of the latest statistical learning procedures and identify software for deployment. Illustrations will be drawn from business applications.
Advertisement
One advantage of working with R is the collateral learning derived from participation in the various support lists. The lists certainly help me keep up to date with coding tricks, best practices, bug resolution, new release schedules and the availability of the latest statistical packages from the worldwide community. But occasionally correspondence evolves to important philosophical discussion of an issue pertinent to statisticians and business analytics professionals alike.
Not long ago, a newbie - R list-speak for a community novice - presented his research data and statistical analysis on one of the forums, asking for commentary on the work. Though he didnt realize it at the time, his overture would start a lively discussion that escalated from evaluations of his R coding prowess to the reliability of his approach for testing the findings, and, finally, to talk of whether the chosen model was even appropriate for the data, given the models stringent assumptions. Questions were raised on the relative importance of the models ability to accurately predict the future versus sensibly explain findings by estimating parameters and determining their significance. A community elder framed the evolving discussion as one that positioned traditional statistics, with mathematical models that have inference foundations, against machine learning, which focuses on algorithms and obsesses on predictive accuracy. He cited a 2001 article by UC Berkeley Professor Leo Breiman that provocatively articulated the TS statistics versus ML tension in the field.
Professor Breiman was certainly qualified to offer commentary on the disparate approaches to predictive modeling. He earned doctoral degrees in both physics and mathematics before accepting a teaching position at UCLA. Years later, he left academia for a stint in freelance statistical consulting, which helped formulate his views on algorithmic modeling. Breiman then joined the statistics department at UC Berkeley, where he made seminal contributions to statistics and the nascent field of ML for 25 years before his untimely passing in 2005. BI analysts recognize his noteworthy contributions of classification and regression trees (CART) and random forests to the predictive modeling knowledge base.
Breimans illustrations focus primarily on supervised learning, for which theres a designated dependent variable to predict with one or more explanatory attributes. Supervised learning is further delineated by whether that dependent variable is continuous (regression) or categorical (classification). Breimans major criticism of TS is that the culture that comes from academic statistics is too focused on the mathematical models and assumptions underpinning the data generating process, obsessing unproductively on model parameters and validation measured by goodness of fit significance tests. In his mind, this deductive approach to modeling is too limiting for the growing complexity of problems and volumes of data.
The machine learning paradigm, which has roots outside statistics in the computer and mathematical sciences, is more inductive; it can learn from the data with fewer preconceived notions and assumptions. Whereas precisely specified (but limiting) models with parameters are de rigueur in traditional statistics, flexible black box methods that mask complexity are the norm with ML. The criterion for model validation with ML is predictive accuracy the ability of models to reliably predict or classify new observations.
According to Breiman, the TS approach has led to irrelevant theory that can promote questionable conclusions while deterring statisticians from using more suitable methods. This direction has also prevented statisticians from working on exciting new problems for which traditional models are ill-suited. He notes that imposing a parametric model on data is often too simplistic and constraining, and it may lead to incorrect results. The approach of ML, in contrast, changes the focus from data models to algorithms, from deduction to induction, with predictive accuracy characterized by the strength of explanatory variables as the ultimate goal. Rather than look for simple models that are mathematically tractable, ML embraces the complexity of high dimensionality for analysis. From the ML perspective, its more important to be accurate than simple.
Breiman doesnt cavalierly dismiss traditional statistics for machine learning. Instead, he advocates a pragmatic approach to modeling that uses the best of both TS and ML. His mantra for the practice of analytics that provides guidance for BI:
- Explore the data before starting the modeling process,
- Search for a suitable model, either from TS or ML, that provides a good solution and
Page 1 of 2.






