I guess Im proof positive that an old dog can learn new tricks. I was a bit concerned as I considered registering for the early-October statistical learning seminar in Boston. Though in possession of and familiar with the book The Elements of Statistical Learning by course instructors Trevor Hastie, Robert Tibshirani and colleague Jerome Friedman, it had been some time since Id participated in an intensive, multiday training course. Besides, the trip would be quite expensive, and I wasnt sure how receptive my partners would be to me skipping out for a few days and spending a large chunk of our discretionary funds. I ruminated on whether to commit when I first saw the announcement in midsummer, but the October 6-7 course dates closed the deal. I could spend all day Sunday the 5th gadding about Boston, visiting with friends, and maybe even experiencing live, post-season baseball.
On my early Sunday morning flight from Chicago to Boston, I prepped for Monday by reviewing the Breiman article discussed in my last column, as well as a thoughtful 1997 sister article by Jerome Friedman, entitled Data Mining and Statistics: Whats the Connection? Like Breiman, Friedman envisions an expanding academic analytics world - one that is progressing far beyond the purview of the traditional statistics currently taught in universities. He sees the then emerging field of data mining as a close match to statistics in the types of problems addressed, arguing that both statistics and data mining should make accommodations to more closely collaborate for the common good. Friedman cautions that an isolated discipline of statistics will lose students, researchers and standing to other, evolving information sciences.
Hastie, Tibshirani and Friedman are on the faculty of Stanford University, home of the top-rated statistics department in the U.S. The late Leo Breiman was a professor at UC Berkeley, number two in statistics. Both Stanford and Berkeley have assumed leadership positions in the cross-discipline study of statistics and machine learning - whats come to be known as statistical learning. The field is evolving so quickly that new methods are seemingly sprouting overnight, keeping faculty and generations of graduate students busy developing algorithms, refining optimization techniques and coding software. Indeed, version two of the award-winning Elements has already been released. Stanford and Berkeley statistical learners are gobbled up quickly in the job market as academia competes with consultancies and internet information companies like Yahoo! and Google.
The primary focus of both days of the seminar was supervised learning, in which dependent or outcome measures supervise the learning of independent attributes for use in predicting future instances. If the outcome consists of categories such as fraud/no fraud, disease/disease-free or churn/no churn, the problem is one of classification. If the outcome is an interval variable like income or revenue, on the other hand, the problem is one of regression. Monday was dedicated primarily to tall data - data sets for which the number of cases (N) is substantially larger than the number of attributes or predictors (p). Predictive modeling in business intelligence revolves primarily around supervised learning of tall data.
The training was certainly fast-paced. After introductions and distribution of course notes, the morning session on day one started with a review of least squares regression, along with a discussion of mean square error and the bias-variance tradeoff. The topic of variable selection in linear regression led to the recurring theme of model assessment with the basic testing tools of independent train/tune/test data sets, bootstrapping and cross validation. The instructors are sticklers for rigorous testing, noting on several occasions negative consequences of cavalier modeling without appropriate validation.
The rest of the morning was fast and furious, continuing with a discussion of shrinkage methods that both model and cross-validate to protect from overfitting training data, sacrificing bias for variance. In semi-mathematical terms, the instructors explained the algorithms for ridge regression, the Lasso, principal components regression and partial least squares, demonstrating how the methods behaved in the extreme and compared among each other.
After lunch we returned to least squares with twists of forward stagewise, the Lasso and least angle regression (LAR). The instructors also introduced boosting, where models are comprised of multiple, slow-learning components. Generalized additive models provide a flexible approach for fitting semi-parametric models for tall supervised learning applications. I like GAMs a lot, finding them especially useful when its known that relationships are not strictly linear. Hastie and Tibshirani published a book on GAMs in 1991 called Generalized Additive Models.
Toward the end of the day, attention moved from structured, regression-like models that are used for high-dimensional (i.e., many predictive attributes) problems to support vector machine, tree and ensemble methods. Ensemble methods bring the benefits of crowd wisdom to learning problems. Two of the more prominent crowd classifiers involve bagging and boosting. For bagging (bootstrap aggregating), the same classifier, such as tree, is fit many times to bootstrapped resamples of training data. The results for the individual instances are then averaged to determine regression weights or classification decisions, often with dramatically reduced variance. Boosting fits a sequence of weak learners to reweighted versions of training data, where each new learner in turn focuses on regions missed earlier, ultimately consolidating with a weighted majority vote. Random forest classification and regression, developed by Breiman, is a refined bagging approach that bootstraps both cases and attributes, computing the out-of-bag error rate for observations not included in the bootstrap. Gradient boosting builds additive tree models, inheriting positive tree features while improving on prediction performance. At the end of the day, gradient boosting and random forests fared best in a Consumer Reports-like bakeoff of discussed methods.
Day two was devoted to wide data, in which the number of attributes is larger than the number of cases. At first I was wary of spending a whole day on the topic, given there are many more tall BI applications than wide. But a sizeable percentage of participants were interested in wide data, and I was quite intrigued by the genomic examples discussed in class.
Tibshirani set the tone for the day by detailing his reanalysis of cancer findings presented in a prestigious medical journal that found several super genes differentiated in a Cox survival model. The problem was that with 49,000 genes (predictors) measured on 179 patient samples (cases), it was easy to overfit the data and find spurious results. His post-analysis recommendations encourage journal authors to publish both raw data and a script of their work as well as develop measures of fragility. In analyses where the number of predictors is substantially greater than the number of cases, overfitting can occur quickly. Simple methods are generally preferred.
Wide data presents a dilemma for determining which of the many predictors are significant. For example, with 12,625 genes, one would expect 613 to be significant at the .05 level by chance alone if they are independent, which they certainly are not. What is needed is approach for multiple testing, using a family-wise error rate to accommodate the large number of tests. Tibshirani outlined several methods, including Bonferroni and Benjamini-Hochberg.
During the second half of lunch each day, there were informal product demonstrations by commercial data mining companies. The first presentation was on the mature product of the statistical industry leader; the second by a company that specializes in commercializing statistical developments from Stanford and Berkeley faculty. Product one seemed a bit dated, with implementation of only a few algorithms discussed in class. Product two showcased a select subset of key procedures from a nice graphical user interface. Professor Hastie confided at break that the open source R Project for Statistical Computing and commercial MATLAB are lingua franca of Stanford statistical learners. Indeed, both Hastie and Tibshirani have made notable contributions to the R platform, and each of the procedures discussed in class had a corresponding R package available for download. For me, this endorsement makes R an easy (and inexpensive) choice for getting started with statistical learning computation.
As befits top professors from an elite university like Stanford, Hastie and Tibshirani are excellent teachers with both a cutting edge command of material and outstanding class demeanor and communication skills. Overall, I would rate the seminar highly and give it a grade of AB. Several minor tweaks to the curricula would easily bring the grade up to an A or A+. First, the tall data sets used in the examples need to be updated to include one in excess of 100,000 cases with either a business, economics or other social science focus. That data set should have at least one classification and one numeric outcome measure for course illustrations. Second, a comprehensive case study should be added that focuses on an R implementation showing code to implement best practices with a prominent procedure like random forests or gradient boosting. Third, the evaluation of the different methods should be enhanced to include measures of prediction performance on a variety of meaningful data sets, as well as computer resource utilization and performance. The additional metrics could be critical for assessing the feasibility of deploying the resource-consuming ensemble procedures, for example. Finally, it would be nice to have a discussion on how the learning procedures could enhance traditional BI statistical methods - like propensity and panel models, which are used to evaluate business performance. With these easy-to-make changes in place, I might take the course again in a year or two and count on exposure to many new and vanguard methods.