© 2019 SourceMedia. All rights reserved.

Books To Love: 'Regression Modeling Strategies, Second Edition'

One of the perks of being an Information Management blogger is that I'm often invited to review a book or attend a seminar/conference gratis, the quid pro quo being that I write an article on my reactions. Ok, truth be told, I'm able to ask for a gratis conference or book in exchange for a blog as a consequence of my IM affiliation.

I was recently the beneficiary of a copy Regression Modeling Strategies Second Edition, a holiday gift from author Frank Harrell, professor and head of Biostatistics at Vanderbilt University. I'd written flatteringly of the first edition of RMS seven years ago and had posted an interview with Frank in the same time period. I'd anticipated V2 since a correspondence with Frank last summer.

RMS 2 was well worth the wait. A heavy, 500+ page book, 2 succeeds in it's mission to provide a state-of-the-art, comprehensive, applied treatment of linear models. For Harrell, “strategies” have to do with the balance between techniques of theoretical regression modeling and approaches to practical problem solving. Among the latter are “methods for relaxing linearity assumptions”, “non-additive modeling approaches”, “methods for imputing missing data”, “methods for handling large numbers of predictors”, “data reduction methods”, “powerful model validation techniques”, and “graphical methods for understanding complex models.” You won't find treatment of this material in most introductory regression methods texts.

RMS is a hands-on book, suitable for those with intermediate backgrounds in statistical methods and the R statistical platform. Though there's not a ton of math in the text, the tone is nonetheless serious. Those looking for quick, simple answers might not be happy. On the other hand, if you're searching for a rigorous treatment of regression in the applied setting, this book might be for you.

RMS is primarily concerned with supervised learning, focusing on both classification and regression problems. It addresses all permutations of linear models, including simple and multiple regression, penalized regression,  longitudinal analysis, binary logistic regression, ordinal logistic regression, and survival models, including proportional hazards and Cox proportional hazards. There are answers here for just about any regression challenge.

A no-nonsense modeling practitioner, Harrell is less credulous than many machine learning advocates, and is particularly adept at applying ML techniques within a statistical framework. RMS's short but cogent chapter on model validation, especially involving the bootstrap (“The bootstrap is is a breakthrough for statistical modeling, and the analyst should use it for many steps of the modeling strategy.'), is outstanding. And its treatment of missing data imputation, starting with the premise that ”Imputation of missing data is better than discarding incomplete observations,” is both theoretically and practically satisfying.

Frank Harrell is an elder in the R statistical community –  Vanderbilt , under his leadership, sponsored the 8th International R User Conference in 2012  – and uses R for all analyses in RMS, sharing a bounty of  functions and modeling code he developed. The R package rms “is a series of over 200 functions for model fitting, testing, estimation, validation, prediction, and typesetting” Learning how to use the rms package is reason enough to purchase RMS 2. All of the ample book code is available on the RMS website.

I'd  suggest a 3-pronged approach to mastering RMS 2's material. First, purchase the text and work your way through the material at a conceptual level. Don't fret all the details. Translate Frank's biostats/epidemiology examples to your business problems. Second, download RMS 2's code/data and get it working in your R environment. Progress to understand the examples in the book, then apply as many of  RMS's techniques to your own regression challenges/data as is practical. Third, take my word that Frank's an outstanding instructor and sign up for his annual five-day course that covers the RMS material. The next class is offered in May at Vanderbilt. Great time to be in Nashville.

I highly recommend having RMS 2 as both an applied teaching text and as a staple reference in a data science library. The combination of rigorous statistical methods implemented with open source software and readily accessible code is indeed compelling. Top notch theory demonstrated with the latest methods, using practical, state-of-the-art, freely-available software from an exponentially-growing, world class statistical ecosystem. What's not to like?

For reprint and licensing requests for this article, click here.