I guess Im proof positive that an old dog can learn new tricks. I was a bit concerned as I considered registering for the early-October statistical learning seminar in Boston. Though in possession of and familiar with the book The Elements of Statistical Learning by course instructors Trevor Hastie, Robert Tibshirani and colleague Jerome Friedman, it had been some time since Id participated in an intensive, multiday training course. Besides, the trip would be quite expensive, and I wasnt sure how receptive my partners would be to me skipping out for a few days and spending a large chunk of our discretionary funds. I ruminated on whether to commit when I first saw the announcement in midsummer, but the October 6-7 course dates closed the deal. I could spend all day Sunday the 5th gadding about Boston, visiting with friends, and maybe even experiencing live, post-season baseball.
On my early Sunday morning flight from Chicago to Boston, I prepped for Monday by reviewing the Breiman article discussed in my last column, as well as a thoughtful 1997 sister article by Jerome Friedman, entitled Data Mining and Statistics: Whats the Connection? Like Breiman, Friedman envisions an expanding academic analytics world - one that is progressing far beyond the purview of the traditional statistics currently taught in universities. He sees the then emerging field of data mining as a close match to statistics in the types of problems addressed, arguing that both statistics and data mining should make accommodations to more closely collaborate for the common good. Friedman cautions that an isolated discipline of statistics will lose students, researchers and standing to other, evolving information sciences.
Hastie, Tibshirani and Friedman are on the faculty of Stanford University, home of the top-rated statistics department in the U.S. The late Leo Breiman was a professor at UC Berkeley, number two in statistics. Both Stanford and Berkeley have assumed leadership positions in the cross-discipline study of statistics and machine learning - whats come to be known as statistical learning. The field is evolving so quickly that new methods are seemingly sprouting overnight, keeping faculty and generations of graduate students busy developing algorithms, refining optimization techniques and coding software. Indeed, version two of the award-winning Elements has already been released. Stanford and Berkeley statistical learners are gobbled up quickly in the job market as academia competes with consultancies and internet information companies like Yahoo! and Google.
The primary focus of both days of the seminar was supervised learning, in which dependent or outcome measures supervise the learning of independent attributes for use in predicting future instances. If the outcome consists of categories such as fraud/no fraud, disease/disease-free or churn/no churn, the problem is one of classification. If the outcome is an interval variable like income or revenue, on the other hand, the problem is one of regression. Monday was dedicated primarily to tall data - data sets for which the number of cases (N) is substantially larger than the number of attributes or predictors (p). Predictive modeling in business intelligence revolves primarily around supervised learning of tall data.
The training was certainly fast-paced. After introductions and distribution of course notes, the morning session on day one started with a review of least squares regression, along with a discussion of mean square error and the bias-variance tradeoff. The topic of variable selection in linear regression led to the recurring theme of model assessment with the basic testing tools of independent train/tune/test data sets, bootstrapping and cross validation. The instructors are sticklers for rigorous testing, noting on several occasions negative consequences of cavalier modeling without appropriate validation.
The rest of the morning was fast and furious, continuing with a discussion of shrinkage methods that both model and cross-validate to protect from overfitting training data, sacrificing bias for variance. In semi-mathematical terms, the instructors explained the algorithms for ridge regression, the Lasso, principal components regression and partial least squares, demonstrating how the methods behaved in the extreme and compared among each other.
After lunch we returned to least squares with twists of forward stagewise, the Lasso and least angle regression (LAR). The instructors also introduced boosting, where models are comprised of multiple, slow-learning components. Generalized additive models provide a flexible approach for fitting semi-parametric models for tall supervised learning applications. I like GAMs a lot, finding them especially useful when its known that relationships are not strictly linear. Hastie and Tibshirani published a book on GAMs in 1991 called Generalized Additive Models.
Toward the end of the day, attention moved from structured, regression-like models that are used for high-dimensional (i.e., many predictive attributes) problems to support vector machine, tree and ensemble methods. Ensemble methods bring the benefits of crowd wisdom to learning problems. Two of the more prominent crowd classifiers involve bagging and boosting. For bagging (bootstrap aggregating), the same classifier, such as tree, is fit many times to bootstrapped resamples of training data. The results for the individual instances are then averaged to determine regression weights or classification decisions, often with dramatically reduced variance. Boosting fits a sequence of weak learners to reweighted versions of training data, where each new learner in turn focuses on regions missed earlier, ultimately consolidating with a weighted majority vote. Random forest classification and regression, developed by Breiman, is a refined bagging approach that bootstraps both cases and attributes, computing the out-of-bag error rate for observations not included in the bootstrap. Gradient boosting builds additive tree models, inheriting positive tree features while improving on prediction performance. At the end of the day, gradient boosting and random forests fared best in a Consumer Reports-like bakeoff of discussed methods.
All Information Management articles are archived after 7 days. REGISTER NOW for unlimited access to all recently archived articles, as well as thousands of searchable stories. Registered Members also gain access to:
- Full access to information-management.com including all searchable archived content
- Exclusive E-Newsletters delivering the latest headlines to your inbox
- Access to White Papers, Web Seminars, and Blog Discussions
- Discounts to upcoming conferences & events
- Uninterrupted access to all sponsored content, and MORE!