NOV 26, 2008 4:20am ET

Related Links

Innovative Organizations Likely to have More Pervasive BI and Data Governance
September 2, 2014
Revolutionize Your Business Intelligence with Lean, High-Performance Solutions
August 21, 2014
Should You Always Obey Orders from Your Executives?
August 7, 2014

Web Seminars

Why Data Virtualization Can Save the Data Warehouse
September 17, 2014
Essential Guide to Using Data Virtualization for Big Data Analytics
September 24, 2014

Statistical Learning for BI, Part 1


This is the first in a series of columns of statistical learning for business intelligence (BI). Column one contrasts what’s come to be called statistical or machine learning (ML) with traditional statistical (TS) methods for predictive modeling. Subsequent columns will discuss many of the latest statistical learning procedures and identify software for deployment. Illustrations will be drawn from business applications.

My fondness for the R statistical computing platform is no secret. R combines two current passions at OpenBI: statistical analysis and open source software. Both the R platform and community appear to be growing exponentially in the academic, research and, more recently, business worlds. Increasingly, the benefits of this growth are manifest to BI.

One advantage of working with R is the collateral learning  derived from participation in the various support lists. The lists certainly help me keep up to date with coding tricks, best practices, bug resolution, new release schedules and the availability of the latest statistical packages from the worldwide community. But occasionally correspondence evolves to important philosophical discussion of an issue pertinent to statisticians and business analytics professionals alike.

Not long ago, a “newbie” - R list-speak for a community novice - presented his research data and statistical analysis on one of the forums, asking for commentary on the work. Though he didn’t realize it at the time, his overture would start a lively discussion that escalated  from evaluations of his R coding prowess to the reliability of his approach for testing the findings, and, finally, to talk of whether the chosen model was even appropriate for the data, given the model’s stringent assumptions.  Questions were raised on the relative importance of the model’s ability to accurately predict the future versus sensibly explain findings by estimating parameters and determining their significance. A community “elder” framed the evolving discussion as one that positioned traditional statistics, with mathematical models that have inference foundations, against machine learning, which focuses on algorithms and obsesses on predictive accuracy. He cited a 2001 article by UC Berkeley Professor Leo Breiman that provocatively articulated the TS statistics versus ML tension in the field.

Professor Breiman was certainly qualified to offer commentary on the disparate approaches to predictive modeling. He earned doctoral degrees in both physics and mathematics before accepting a teaching position at UCLA. Years later, he left academia for a stint in freelance statistical consulting, which helped formulate his views on algorithmic modeling. Breiman then joined the statistics department at UC Berkeley, where he made seminal contributions to statistics and the nascent field of ML for 25 years before his untimely passing in 2005. BI analysts recognize his noteworthy contributions of classification and regression trees (CART) and random forests to the predictive modeling knowledge base.

Breiman’s illustrations focus primarily on supervised learning, for which there’s a designated dependent variable to predict with one or more explanatory attributes. Supervised learning is further delineated by whether that dependent variable is continuous (regression) or categorical (classification). Breiman’s major criticism of TS is that the culture that comes from academic statistics is too focused on the mathematical models and assumptions underpinning the data generating process, obsessing unproductively on model parameters and validation measured by goodness of fit significance tests.  In his mind, this deductive approach to modeling is too limiting for the growing complexity of problems and volumes of data.

The machine learning paradigm, which has roots outside statistics in the computer and mathematical sciences, is more inductive; it can learn from the data with fewer preconceived notions and assumptions. Whereas precisely specified (but limiting) models with parameters are de rigueur in traditional statistics, flexible “black box” methods that mask complexity are the norm with ML. The criterion for model validation with ML is predictive accuracy – the ability of models to reliably predict or classify new observations.

According to Breiman, the TS approach has led to irrelevant theory that can promote questionable conclusions while deterring statisticians from using more suitable methods. This direction has also prevented statisticians from working on exciting new problems for which traditional models are ill-suited. He notes that imposing a parametric model on data is often too simplistic and constraining, and it may lead to incorrect results. The approach of ML, in contrast, changes the focus from data models to algorithms, from deduction to induction, with predictive accuracy characterized by the strength of explanatory variables as the ultimate goal. Rather than look for simple models that are mathematically tractable, ML embraces the complexity of high dimensionality for analysis. From the ML perspective, it’s more important to be accurate than simple.

Breiman doesn’t cavalierly dismiss traditional statistics for machine learning. Instead, he advocates a pragmatic approach to modeling that uses the best of both TS and ML. His mantra for the practice of analytics that provides guidance for BI:

  1. Explore the data before starting the modeling process,
  2. Search for a suitable model, either from TS or ML, that provides a good solution and
  3. Make predictive accuracy on test data the determinant of success. Breiman seems the ultimate analytics pragmatist.

Breiman’s argument that traditional statistical approaches are at least somewhat dated is valid.  Reactions to Breiman’s work among statisticians at the time of his paper, however, were muted at best. Most acknowledge that the article presented valid criticisms, even if not all precepts were universally accepted. If nothing else, the paper helped open statistical eyes to the contributions of other disciplines and popularize the term “statistical learning,” a phrasing more palatable to statisticians than ML.

Get access to this article and thousands more...

All Information Management articles are archived after 7 days. REGISTER NOW for unlimited access to all recently archived articles, as well as thousands of searchable stories. Registered Members also gain access to:

  • Full access to including all searchable archived content
  • Exclusive E-Newsletters delivering the latest headlines to your inbox
  • Access to White Papers, Web Seminars, and Blog Discussions
  • Discounts to upcoming conferences & events
  • Uninterrupted access to all sponsored content, and MORE!

Already Registered?

Filed under:


Comments (0)

Be the first to comment on this post using the section below.

Add Your Comments:
You must be registered to post a comment.
Not Registered?
You must be registered to post a comment. Click here to register.
Already registered? Log in here
Please note you must now log in with your email address and password.
Login  |  My Account  |  White Papers  |  Web Seminars  |  Events |  Newsletters |  eBooks
Please note you must now log in with your email address and password.