I finally got to meet Frank Harrell. Over the last few years, Frank's been an outstanding mentor for me, though he doesn't know it and, before mid-August, wouldn't have recognized me from Adam.
Frank's comprehensive text, Regression Modeling Strategies, With Applications to Linear Models, Logistic Regression, and Survival Analysis, has been a close companion over the last year and a half, helping me update to the very latest statistical developments for predictive analytics. The book is great because it gives equal weight to the problem domain, theory and math, and programming solutions in the S+/R statistical languages. I've also learned quite a bit from reading the excellent statistical and programming teaching documents on Frank's Web site at Vanderbilt University: http://biostat.mc.vanderbilt.edu/FrankHarrell.
And perhaps best of all, Hmisc, Frank's S+/R package of miscellaneous metadata, data manipulation, statistical, predictive modeling and graphical functions, is an absolute godsend of helpful goodies for R programmers. I was recently asked by a data warehousing colleague for an open source, quick-and-dirty, data profiling recommendation. Without hesitation, I offered Hmisc's describe function. The modified boxplot graph included in Hmisc is also a personal favorite, able to consolidate a good deal of exploratory information in a small space.
So when I saw the opportunity to not only participate in the international R user conference (useR!2007), but also to attend a one day modeling course taught by Frank, I immediately made time in my schedule. Even then, a late registrant, I had to connive to secure a spot in the class. Apparently many others felt as I did about the opportunity.
The preconference tutorial on regression modeling strategies didn't disappoint. Starting with a 200 page handout to cover in six hours, the class, comprised of college teachers, researchers, analysts and graduate students, was challenged to keep pace. I'm sure most thought much of the day would be review. We were, however, exposed to a good deal of new material reflecting many of the latest approaches in regression, classification and prediction, covered at a rapid pace.
Frank's soft-spoken demeanor belies his command of the discipline. A biostatistician, he has seen just about everything in medical research, epidemiology and health care evaluation - including a lot he doesn't like. Many current "habits" in the trade are anathema to Frank. Stepwise regression is non grata, as is the pernicious practice of recoding continuous attributes into categories to perform logistic regression, discriminant analysis or data mining classification. Frank's beef is that information is always lost migrating from continuous to category variables. I made the mistake of asking whether it was a lesser evil to use many categories rather than two. Frank's deadpan response: don't do it. He's also intolerant of lazy analysis that depicts strong relationships as linear only, offering from experience that they tend often to involve significant curvature.
I could tell from discussions during breaks that many of us were guilty of at least a few of the statistical sins noted in class. At the same time, we all realized how invaluable a "retreat" with an expert like Frank can be. Perhaps BI analytics practitioners should be required to participate in an annual seminar with an authority to snuff out bad habits before they become too ingrained - and costly for business. In fact, I think most professions could profit from regular "rebalancings" with experts like Frank Harrell.
Statistics and ISU
Iowa State University in Ames, Iowa, might seem an unlikely location for an international conference. Ames is approximately 350 miles from Chicago, about 45 minutes from the Des Moines airport, in the heartland of America. But ISU has been a juggernaut in statistics seemingly forever, starting from its land-grant roots of agricultural research. Snedecor Hall is today the epicenter of a thriving program in theoretical, applied and computational statistics that is very pertinent for business intelligence. Indeed, some of the current ISU research on statistical graphics and visualization will hopefully find a home soon in next iterations of BI tools - and I'd be a more-than-willing beta tester. ISU was ever hospitable as well. Professor Dianne Cook and the ISU staff enabled a quality eleventh-hour conference experience for the 250+ participants from around the world. I even felt a bit of a personal connection from the past as I strolled around campus, recalling three texts from my undergraduate studies many years ago: Calculus by Thomas, Economics by Samuelson, and Statistical Methods by Snedecor and Cochran - the same Snedecor as the statistics building.
Many of the presentations given in the two-day conference were relevant to BI; others not so much. And with multiple concurrent presentations, it was sometimes difficult to maneuver where desired, when desired. I was hopelessly lost in discussions of "mixed models using residual maximum likelihood" and the "analysis of soybean seed transcriptomics data." But I found the sessions on R GUIs I and II and the discussion of Web analytics in R quite informative.
I was in my element at the two presentations on social science and statistics. The theme of the combined session was the use of advanced statistical designs and techniques to help assure the validity of observational study findings when the gold standard of randomization isn't feasible - which is often the case in the business world. The questions posed were quite relevant for BI and have been addressed in other OpenBI Forum columns (http://www.dmreview.com/authors/author_sub.cfm?authorId=1052295). The approach outlined in these presentations involves matching subjects across potential confounding variables so that comparison groups are "equal" on factors other than the treatment - and can thus be compared validly. Each talk addressed the optimization of some aspect of the matching problem. The first speaker presented an approach and software for matched sampling and pair matching. The second discussed problems of matching algorithms in very large samples, emphasizing efficient use of computer code and machine resources.
Though I didn't attend the "Teaching with R" sessions, I was able to get my hands on the materials for review. Both undergraduate and master's-level mathematical statistics courses at ISU now use R significantly for instruction, complementing arduous mathematical proofs and derivations with the simulation capabilities of R. Students might investigate the distributions of transformed random variables, for example, by examining samples of size 10,000. Or simulations might be used to examine the robustness of assumptions in statistical inference. R's powerful graphics can be paired with sampling techniques to visualize the distribution of order statistics. And, of course, the bootstrap method of examining the precision of parameter estimates through resampling is now de rigueur in the discipline. I love to play with large sample simulations in R on my notebook, watching practice converge with theory. The ascendance of simulation methods is, in my opinion, a major boon for learning and understanding statistical thinking. What I would have given for such computer support back in the day!
John Chambers, the architect of the original S language predecessor to R, was the keynote speaker at the end of the first conference day. In 1998, Chambers won a prestigious "Software System Award" from the Association for Computing Machinery (ACM) for developing the S system. The ACM observed that Dr. Chambers' work "will forever alter the way people analyze, visualize, and manipulate data." Chambers' talk, "Programming with R," noted two tenets in the development of S: rapid and effective exploration and trustworthy software. He credited the open source movement with improving software quality through its many eyes. Ever the programmer and architect, Chambers then highlighted several important design considerations for S/R, finally producing code at the end. An appreciative audience understood the R community's debt to Chambers' seminal work.
Different perspectives on graphics and visualization in R created excitement the second morning of the conference. The author of R's outstanding lattice graphics gave a programming tips talk, and the developer of a new high-level, easy-to-deploy package called ggplot demoed his wares with a comprehensive statistical example. At the conclusion of these two very useful presentations, the moderator created a stir by opining that such static, programmed graphics, while certainly valuable for analysis, were not in keeping with Chambers' mission of R to "enable effective and rapid exploration of data." He then showed yet a third R visual package, iplots, which is interactive and live. A spirited discussion of almost religious proportions followed on the relative benefits of each. My take? Both interactive and programmed graphics have significant roles in statistics and BI. Keep them coming!
There were many sideline discussions of how to handle large data sets in R. In the current implementation, the size of a data structure is limited by physical memory. Users circumvent this limitation in a number of ways. As might be expected, statisticians use sampling techniques to reduce the amount of data needed for their models. Crafty programmers "chunk" large data into smaller subsets that can be accessed and then discarded. Database-savvy users store their data in relational tables, feeding R as needed. One presentation noted a seamless use of a database to serve data to R for large-scale surveys. I like to use the "pipe" function to filter data with agile languages like Python or Ruby before loading R data frames. A programming competition winner developed a package to handle data larger than memory with binary flat files. And R's commercial cousin S-Plus from Insightful has developed a virtual memory large data capability. I'm hopeful for the future of R in the business world that the large data concern will become a priority with the core R development team in the near future.
My connection to R and the R community is well-chronicled in http://www.dmreview.com/article_sub.cfm?articleId=1065015 and http://www.dmreview.com/article_sub.cfm?articleId=1084643, but I must admit to being continually amazed at the level of contributions of such a high-octane open source community. As a result of its volunteer efforts, the latest in statistical methods from top practitioners are readily and freely available months and even years before the commercial competition. Almost all presenters at useR!2007 demoed new packages they'd developed to showcase their methods and analytics. And the world-wide R community continues to expand dramatically. Little wonder that R is now lingua franca of academic statistical computing. Little wonder as well that useR!2008 (http://www.statistik.uni-dortmund.de/useR-2008/) in Dortmund, Germany, promises to be the largest and best yet.
- Frank E. Harrell, Jr. Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis. Springer Series in Statistics. 2001