Free Site RegistrationFree Site Registration

Sign up today and access Information Management on the web!
Your FREE registration entitles you to:

FREE email newsletters

FREE access to all Information Management content

FREE access to web seminars, resource portals, our white paper library and more!

Statistical Learning for BI, Part 1

OpenBI Forum

Information Management Online, November 26, 2008

Steve Miller

This is the first in a series of columns of statistical learning for business intelligence (BI). Column one contrasts what’s come to be called statistical or machine learning (ML) with traditional statistical (TS) methods for predictive modeling. Subsequent columns will discuss many of the latest statistical learning procedures and identify software for deployment. Illustrations will be drawn from business applications.

Advertisement

My fondness for the R statistical computing platform is no secret. R combines two current passions at OpenBI: statistical analysis and open source software. Both the R platform and community appear to be growing exponentially in the academic, research and, more recently, business worlds. Increasingly, the benefits of this growth are manifest to BI.

One advantage of working with R is the collateral learning  derived from participation in the various support lists. The lists certainly help me keep up to date with coding tricks, best practices, bug resolution, new release schedules and the availability of the latest statistical packages from the worldwide community. But occasionally correspondence evolves to important philosophical discussion of an issue pertinent to statisticians and business analytics professionals alike.

Not long ago, a “newbie” - R list-speak for a community novice - presented his research data and statistical analysis on one of the forums, asking for commentary on the work. Though he didn’t realize it at the time, his overture would start a lively discussion that escalated  from evaluations of his R coding prowess to the reliability of his approach for testing the findings, and, finally, to talk of whether the chosen model was even appropriate for the data, given the model’s stringent assumptions.  Questions were raised on the relative importance of the model’s ability to accurately predict the future versus sensibly explain findings by estimating parameters and determining their significance. A community “elder” framed the evolving discussion as one that positioned traditional statistics, with mathematical models that have inference foundations, against machine learning, which focuses on algorithms and obsesses on predictive accuracy. He cited a 2001 article by UC Berkeley Professor Leo Breiman that provocatively articulated the TS statistics versus ML tension in the field.

Professor Breiman was certainly qualified to offer commentary on the disparate approaches to predictive modeling. He earned doctoral degrees in both physics and mathematics before accepting a teaching position at UCLA. Years later, he left academia for a stint in freelance statistical consulting, which helped formulate his views on algorithmic modeling. Breiman then joined the statistics department at UC Berkeley, where he made seminal contributions to statistics and the nascent field of ML for 25 years before his untimely passing in 2005. BI analysts recognize his noteworthy contributions of classification and regression trees (CART) and random forests to the predictive modeling knowledge base.

Breiman’s illustrations focus primarily on supervised learning, for which there’s a designated dependent variable to predict with one or more explanatory attributes. Supervised learning is further delineated by whether that dependent variable is continuous (regression) or categorical (classification). Breiman’s major criticism of TS is that the culture that comes from academic statistics is too focused on the mathematical models and assumptions underpinning the data generating process, obsessing unproductively on model parameters and validation measured by goodness of fit significance tests.  In his mind, this deductive approach to modeling is too limiting for the growing complexity of problems and volumes of data.

The machine learning paradigm, which has roots outside statistics in the computer and mathematical sciences, is more inductive; it can learn from the data with fewer preconceived notions and assumptions. Whereas precisely specified (but limiting) models with parameters are de rigueur in traditional statistics, flexible “black box” methods that mask complexity are the norm with ML. The criterion for model validation with ML is predictive accuracy – the ability of models to reliably predict or classify new observations.

According to Breiman, the TS approach has led to irrelevant theory that can promote questionable conclusions while deterring statisticians from using more suitable methods. This direction has also prevented statisticians from working on exciting new problems for which traditional models are ill-suited. He notes that imposing a parametric model on data is often too simplistic and constraining, and it may lead to incorrect results. The approach of ML, in contrast, changes the focus from data models to algorithms, from deduction to induction, with predictive accuracy characterized by the strength of explanatory variables as the ultimate goal. Rather than look for simple models that are mathematically tractable, ML embraces the complexity of high dimensionality for analysis. From the ML perspective, it’s more important to be accurate than simple.

Breiman doesn’t cavalierly dismiss traditional statistics for machine learning. Instead, he advocates a pragmatic approach to modeling that uses the best of both TS and ML. His mantra for the practice of analytics that provides guidance for BI:

  1. Explore the data before starting the modeling process,
  2. Search for a suitable model, either from TS or ML, that provides a good solution and

    Page 1 of 2.

Advertisement

Advertisement