Frank's comprehensive text, Regression Modeling Strategies, With Applications to Linear Models, Logistic Regression, and Survival Analysis, has been a close companion over the last year and a half, helping me update to the very latest statistical developments for predictive analytics. The book is great because it gives equal weight to the problem domain, theory and math, and programming solutions in the S+/R statistical languages. I've also learned quite a bit from reading the excellent statistical and programming teaching documents on Frank's Web site at Vanderbilt University: http://biostat.mc.vanderbilt.edu/FrankHarrell.
And perhaps best of all, Hmisc, Frank's S+/R package of miscellaneous metadata, data manipulation, statistical, predictive modeling and graphical functions, is an absolute godsend of helpful goodies for R programmers. I was recently asked by a data warehousing colleague for an open source, quick-and-dirty, data profiling recommendation. Without hesitation, I offered Hmisc's describe function. The modified boxplot graph included in Hmisc is also a personal favorite, able to consolidate a good deal of exploratory information in a small space.
So when I saw the opportunity to not only participate in the international R user conference (useR!2007), but also to attend a one day modeling course taught by Frank, I immediately made time in my schedule. Even then, a late registrant, I had to connive to secure a spot in the class. Apparently many others felt as I did about the opportunity.
The preconference tutorial on regression modeling strategies didn't disappoint. Starting with a 200 page handout to cover in six hours, the class, comprised of college teachers, researchers, analysts and graduate students, was challenged to keep pace. I'm sure most thought much of the day would be review. We were, however, exposed to a good deal of new material reflecting many of the latest approaches in regression, classification and prediction, covered at a rapid pace.
Frank's soft-spoken demeanor belies his command of the discipline. A biostatistician, he has seen just about everything in medical research, epidemiology and health care evaluation - including a lot he doesn't like. Many current "habits" in the trade are anathema to Frank. Stepwise regression is non grata, as is the pernicious practice of recoding continuous attributes into categories to perform logistic regression, discriminant analysis or data mining classification. Frank's beef is that information is always lost migrating from continuous to category variables. I made the mistake of asking whether it was a lesser evil to use many categories rather than two. Frank's deadpan response: don't do it. He's also intolerant of lazy analysis that depicts strong relationships as linear only, offering from experience that they tend often to involve significant curvature.
I could tell from discussions during breaks that many of us were guilty of at least a few of the statistical sins noted in class. At the same time, we all realized how invaluable a "retreat" with an expert like Frank can be. Perhaps BI analytics practitioners should be required to participate in an annual seminar with an authority to snuff out bad habits before they become too ingrained - and costly for business. In fact, I think most professions could profit from regular "rebalancings" with experts like Frank Harrell.
Statistics and ISU
Iowa State University in Ames, Iowa, might seem an unlikely location for an international conference. Ames is approximately 350 miles from Chicago, about 45 minutes from the Des Moines airport, in the heartland of America. But ISU has been a juggernaut in statistics seemingly forever, starting from its land-grant roots of agricultural research. Snedecor Hall is today the epicenter of a thriving program in theoretical, applied and computational statistics that is very pertinent for business intelligence. Indeed, some of the current ISU research on statistical graphics and visualization will hopefully find a home soon in next iterations of BI tools - and I'd be a more-than-willing beta tester. ISU was ever hospitable as well. Professor Dianne Cook and the ISU staff enabled a quality eleventh-hour conference experience for the 250+ participants from around the world. I even felt a bit of a personal connection from the past as I strolled around campus, recalling three texts from my undergraduate studies many years ago: Calculus by Thomas, Economics by Samuelson, and Statistical Methods by Snedecor and Cochran - the same Snedecor as the statistics building.
Many of the presentations given in the two-day conference were relevant to BI; others not so much. And with multiple concurrent presentations, it was sometimes difficult to maneuver where desired, when desired. I was hopelessly lost in discussions of "mixed models using residual maximum likelihood" and the "analysis of soybean seed transcriptomics data." But I found the sessions on R GUIs I and II and the discussion of Web analytics in R quite informative.
I was in my element at the two presentations on social science and statistics. The theme of the combined session was the use of advanced statistical designs and techniques to help assure the validity of observational study findings when the gold standard of randomization isn't feasible - which is often the case in the business world. The questions posed were quite relevant for BI and have been addressed in other OpenBI Forum columns (http://www.dmreview.com/authors/author_sub.cfm?authorId=1052295). The approach outlined in these presentations involves matching subjects across potential confounding variables so that comparison groups are "equal" on factors other than the treatment - and can thus be compared validly. Each talk addressed the optimization of some aspect of the matching problem. The first speaker presented an approach and software for matched sampling and pair matching. The second discussed problems of matching algorithms in very large samples, emphasizing efficient use of computer code and machine resources.