for Information Management Blogs
APR 20, 2009 3:07am ET

Blogroll

More Statistical Learning

Print
Reprints
Email

The Elements of Statistical Learning : Data Mining, Inference and Prediction. Second Edition, by Trevor Hastie, Robert Tibshirani, and Jerome Friedman is now available. The authors, along with mentor Brad Efron and other faculty/students from the top-ranked Statistics department at Stanford University, continue to progress the discipline of statistical learning – a convergence of statistics with machine learning – at a feverish pace, much to the benefit of business intelligence.

ESL is encyclopedic in its command of the latest methods, especially those surrounding supervised learning, in which there is a known dependent variable to predict, either numeric (e.g. lifetime customer value) or classification (e.g. fraud abuse). Alas, ESL's a book for the mathematically sophisticated, the reading probably a bit heavy for many BI analysts. 

Those without a strong math background but with a solid business understanding of multiple regression might benefit first from Richard Berk's Statistical Learning from a Regression Perspective. Berk, a statistician and social scientist, provides a more gentle, applied foundation for statistical learning that should resonate with a BI audience.

Berk's outlines four separate “stories” for which traditional parametric regression or the more flexible statistical learning techniques are applicable:
  • A Causal Story – in which the analyst is looking to test a theory – to relate the independent X's to the dependent Y in such a way as to conclude the X's cause Y.
  • A Conditional Distribution Story – in which the analyst deploys the traditional linear regression model with assumptions about the behavior of the errors.
  • A Data Summary Story – in which the learning is used to reduce to dimensionality of the problem space.
  • A Forecasting Story – in which the analyst constructs a model to forecast future behavior.
Though I'm not sure Berk would agree, I see the first two stories as more relevant to scientific research and academia, the latter more pertinent for business analytics. I've also found that for summary and forecasting challenges, the newer learning techniques often perform better than traditional regression, especially when the relationship of X's to Y is complex. 

Much to my satisfaction, Berk provides a comprehensive discussion of regression smoothers. Though smoothers look much like linear regression models, they differ in that they adapt to the patterns in the data more readily than traditional regression, where the analyst must specify the functional form of the relationship in advance. As Berk notes: “As long as one is content to merely describe, these methods are consistent with the goals of exploratory data analysis.”

Berk also pays homage to Classification and Regression Trees (CART), a foundational learning method that's served the BI world well for over 15 years. The next generation of CART-like methods builds on the wisdom of crowds to ensemble ever more precise predictions. Bagging deploys bootstrapping methods to resample and average multiple predictions, often with significant forecasting lift. Boosting, by contrast, aggregates a group of weak classifiers – perhaps each little better than random guessing – into a committee with often powerful predictive insights.

SLRP is an excellent text that can serve analysts well as either a prerequisite or co-reading for ESL. Let me suggest a third book to be written for a statistical learning trilogy. This text would take the methods so elegantly formulated in ESL and explained in SLRP to provide comprehensive illustrations with real business and social science data sets. The analysis could be done with R packages hot off SourceForge, written by the method developers themselves. With patterned prediction problems and ample R code, the newest learning techniques would get a major “boost” in the BI world.

Steve Miller's blog can also be found at miller.openbi.com.

Filed under:

Advertisement

Comments (0)

Be the first to comment on this post using the section below.

Add Your Comments:
You must be registered to post a comment.
Not Registered?
You must be registered to post a comment. Click here to register.
Already registered? Log in here
Please note you must now log in with your email address and password.

Blog Archive for Steve Miller

Lean Start-Ups, Planning and Searching
Tableau, Python and R
The Data and Bias of Macroeconomics
No Quick Death for Statistical Practices
Getting Started with Statistical Learning

More from Steve Miller »

Blog Index »

Where do young IT professionals (30 and under) obtain information to aid with daily role responsibilities and career development?

Trade publication websites 14%
Social media 23%
Vendor websites 4%
Vendor/community forums 7%
Newsletters 1%
Trade conferences/meetups 2%
RSS feeds 6%
Web search 44%

 

Twitter
Facebook
LinkedIn
Login  |  My Account  |  White Papers  |  Web Seminars  |  Events |  Newsletters |  eBooks
FOLLOW US
Please note you must now log in with your email address and password.