I hadn’t corresponded with Frank Harrell in about six months, but had to ping him after his pithy forum response to the article on R in the NY Times. Begrudging the meteoric rise in open source R’s popularity, a VP from proprietary statistical software market leader SAS noted: “I think it addresses a niche market for high-end data analysts that want free, readily available code … We have customers who build engines for aircraft. I am happy they are not using freeware when I get on a jet.” To which Frank deadpanned: “It’s interesting that SAS Institute feels that non-peer-reviewed software with hidden implementations of analytic methods that cannot be reproduced by others should be trusted when building aircraft engines.” Touché.

I’d written about Frank in a previous Information Management article, after meeting him in person and taking his Regression Strategies short course at useR!2007 in Ames, Iowa. Even before that conference, I felt I knew him pretty well. I use his Hmisc and Design R packages all the time, and regularly learn from his informative wiki. And Frank is one of perhaps a dozen or so esteemed R forum participants I religiously follow – regardless of topic. Frank and I are about the same age, but he’s the teacher and I’m the student.

In addition to his status as R elder, Frank has “side” jobs as professor of biostatistics and department chair at Vanderbilt University, having previously served on the faculties of both Duke and Virginia after earning his doctorate at North Carolina. His research revolves on the use of various adaptations of multivariable predictive modeling and attempts to get rigorous biostatistical thinking woven into the fabric of biomedical research. Preventing bad research has also been a common thread to his career. A perusal of Frank’s vitae affirms his status as a leading academic biostatistician.

Frank’s biostatistical wisdom is quite suitable for BI. Evidence-based management (EBM) derives from medicine and an obsession with evaluating the merits of study designs using the evidence hierarchy. He’s a stickler for designs to reduce the bias of analyses so that researchers can confidently conclude not just that A is correlated with B but that A caused B. Frank’s proddings and analytic conservatism push students and researchers to do right statistically. Indeed, I think all modelers in business would be well served with a periodic retreat on predictive modeling strategies with Frank Harrell.

Frank is also a leader in statistical computing. His Hmisc and Design package contributions to the R project provide analysts and statisticians with a wealth of statistical goodies for programming analytics with R. Frank’s roots in computation are very deep, starting with contributions as a young student in the late 1960’s to the then pre-released SAS software platform. His evolution in statistical computing seems a metaphor for the open source movement that’s gaining momentum now.

My correspondence with Frank provided the opportunity to ask him to do an interview for the OpenBI Forum. He graciously accepted, turning around deft responses to my sometimes ponderous questions in very short order. What follows is text for our questions and answer session. I trust that readers will learn as much from Frank’s responses as I did.

 

1. Much of your work focuses on statistical analysis in health care/medical research and the field of epidemiology, which has given us the “evidence hierarchy” of designs for evaluating research. How important is a solid design for “proving” the efficacy of interventions?

It is extremely important, because any non-experimental approach to assessing the efficacy of interventions has to involve getting much more “right” in terms of specifying models. The freedom of not worrying about unmeasured variables in randomized clinical trials can never be forgotten.

2. David Sackett has defined evidence-based medicine (EBM) as "the conscientious, explicit and judicious use of current best evidence in making decisions about the care of individual patients ... integrating individual clinical expertise with the best available clinical evidence from systematic research" How would you characterize the current state of EBM?

Much of my work has been indirectly involved with EBM. The current state of EBM is not something we can take a lot of pride in. First, the number of medical, surgical, herbal, and alternative treatments for which true evidence is even sought is frighteningly low. Second, some of EBM is not itself evidence-based. Much EBM involves the use of crude non-patient-specific data in meta-analysis or it involves unwarranted extrapolations. Some national figures such as estimates of the number of unnecessary deaths in hospitals were obtained by studies that were not designed as well as they should have been. In the future we will continue to see EBM progress, but until incentives and regulations are changed, many therapies will not be adequately studied. Let me also add that in many cases an individual excellently designed database can lead to multivariable analysis that provides better answers than a meta-analysis of 20 studies each contributing only crude marginal summaries.

3. In his well-received book Super Crunchers, Yale economist Ian Ayres notes the predictive superiority of analytics over experts in many disciplines, observing that “Unlike self-involved experts, statistical regressions don't have egos or feelings.” Your experience and thoughts pertaining to experts versus analytics in health care?

This issue has been pretty well settled in the medical decision-making and cognitive psychology literatures, which supports Ayers. Multivariable models can make optimum use of continuous variables and can handle many more variables than can a human.

4. BI can be defined as the use of data, technology, methods, and analytics to measure and improve the performance of business processes. More and more companies are using experimental methods with randomization as a foundation for BI initiatives to both test their strategies and learn from findings. Could you comment on the benefits of randomized testing and other sophisticated designs as aids to learning?

I recently learned of this trend and am gratified. The benefits of well-designed observational and experimental research in business will be the same as in medicine: better reliability and generalizability of results. Without careful design, bias can ruin any analysis.

5. As a statistician planning health care studies, you're very concerned with adequate sample size. In business, predictive modelers often enjoy the luxury of hundreds of thousands or even millions of cases to work from – but often without random assignment to treatment groups. How does sample richness change analyses for business? What cautions would you offer to business modelers who have millions of records to analyze?

The volume of data greatly reduces the problems of non-reproducibility due to over-fitting that we see in other fields. Unfortunately, biases are constant as the sample size increases, so a large database has nothing to do with reducing bias other than providing the sample size that will allow more confounder variables to be adjusted for.

6. Statistician Rudolf Beran has offered a modern definition of statistics as “the study of algorithms for data analysis”. He also notes that the study of statistics has changed over time, with now three different competing interests: 1) the use of probability models (and randomization) to analyze behavior; 2) computationally effective statistical algorithms without concern for a probability model; and 3) data analyses, generally without randomization. To my probably naive thinking, these sound suspiciously like 1) the traditional probability and statistics we learned in school; 2) machine learning for knowledge discovery; and, 3) exploratory data analysis. (Historically, BI focus has centered on 3), but, now, 1) and 2) are much more in play and growing.) Your thoughts?

First of all, Beran has omitted several important areas of statistics, including experimental design, refinement of measurements, detective work regarding bias, and incorporating all sources of variation and uncertainty into an analysis. Aside from that, I think you are right. Probability models can be de-emphasized in many circumstances, although the best analyses are done using full Bayesian models which often involve probability models.

7. The use of statistical learning techniques for data mining has exploded in business. The automation of many of these techniques, however, seems to run counter to the deliberate approaches of traditional statisticians. What would you say to the dutiful marketing predictive modeler who previously used statistical techniques like multiple and logistic regression, but now has turned to random forests and gradient boosting to meet most of his predictive needs? He’s a very conscientious analyst, carefully cross-validating his ample training/tuning/testing data sets and conservatively interpreting his bootstrapped results.

I would differentiate fields to some extent by whether they are trying to develop understanding of phenomena or whether they are trying to predict outcomes. For the latter, the newer methods you mentioned can result in predictive accuracy that equals more time-consuming comprehensive modeling approaches. But the new techniques don't always yield very interpretable models or formal inferences.

8. In a provocative June, 2008 Wired Magazine article “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete,” author Chris Anderson argues that faced with massive data, the “hypothesize, model, test” approach to determining causation in science has become obsolete. The thinking is that with petabytes of information, correlation is enough. Your thoughts?

That is completely bogus in my view. Look at the small yield of massive genomics and proteomics experiments that relied on complex empirical analysis without guiding biological theory. Look even at the small yield of the massive data mining taking place in the national intelligence community (e.g., by massive screening of unselected phone call and e-mail data with dubious support from the Constitution). My bet is that “old style” human intelligence is far from obsolete in national intelligence.

9. You are a hands-on statistician with a focus on statistical computing and graphics. Could you tell us of the evolution of statistical computing platforms that you've used for your work? Have the platforms themselves helped shape your approaches? Could you briefly contrast SAS, S-Plus and R? What are the benefits to open source statistical software?

Statistical computing platforms have had a major impact on my work and my thinking. I started with SAS, being one of the first SAS users outside the core development team in the late 1960s. I developed many SAS procedures using a low-level, tedious to program, language. I fell into the SAS regression syntax trap that makes an analyst likely to assume linearity for continuous variables. I was also putting up with low-resolution non-informative statistical graphics. After a visit to Terry Therneau at Mayo Clinic in 1991 and seeing a demonstration of S, I immediately became an S-Plus user. I developed software to make the estimation of predictor transformations a standard part of regression modeling, through the use of regression splines. I started using Bill Cleveland's philosophy of statistical graphics. When the S-Plus developers made decisions that made the system less reliable and their decisions cost me a couple of hundred hours of reprogramming of working modules, I started paying attention to R. The open source model of R has had tremendous benefits, first among them being the incredible pace of additions and improvements to R by a huge community of statisticians and other quantitative folk.

10. You are one of the esteemed “members” of the R open source community. Your responses and opinions are held in the highest regard in the forums, and your Hmisc and Design packages offer a wealth of statistical and reporting goodies for the community. R was recently the subject of a gratifying article in the NY Times. Can you comment on the impact of R and the open source model on the statistical world over the last five years and going forward?

The impact is so large that one hardly knows where to start. The explosion of new predictive modeling procedures, Bayesian modeling, robust regression, model validation, missing data imputation, and new graphics models has affected every area of quantitative research where the researcher is not wedded to the statistical package they first learned. The use of R in primary statistical and machine learning research has accelerated research greatly. Researchers can prototype new methods, test them quickly (including inside a simulation loop), and then spend more time on the methods that work.

11. Your Regression Modeling Strategies book is must reading for serious predictive modeling practitioners, and your curricula with the same name should be required continuing education. In the course, you site many examples of flawed analyses in the health/medical sciences. You also “scold” for various worst practices like categorizing a continuous variable, presuming linearity, and indiscriminant use of stepwise regression. Do you think the prevalence of bad statistical analysis is growing? If so, is it related to the ease of using (and abusing) predictive modeling software? What can/should be done?

Thanks for the kind words. The prevalence is growing, because of the increase in the number of non-statisticians doing statistical analysis. The biggest enemies of quality analytic practice are confirmation bias (finding analyses that support personal biases or just advance careers) and the fact that few analysts understand the subtlety that what we learn from data is not “real” in the sense that it may easily be noise or may be explained by a bias we don't understand. When an analyst makes a “finding” from extensive data analysis, she is unlikely to remember the adage “take everything you read in the newspaper with a grain of salt.” She doesn't see the parallel between unreliable, selective, or biased reporting and over-analysis or over-interpretation of data. Returning again to national intelligence, we know that torture does not even result in the desired outcome. In data analysis, the data can be tortured until they tell us what we want to hear, but this confession is not what is true. In the end, analysts need to test all of the strategies they use to find out what really works, by stringently validating their discoveries and predictions to see if the discoveries replicate and to see if the predictive discrimination is as strong as indicated in the data mining step. Then the researcher can do the even harder work of determining if the predictions and observed relationships mean anything or are they the result of biases in the data.