I’m getting ready to start another predictive modeling effort and decided to turn to several trusted stats books for a quick review. Three favorites include Maindonald and Braun’s Data Analysis and Graphics Using R, The Elements of Statistical Learning, by Hastie, Tibshirani and Friedman, and Frank Harrell’s Regression Modeling Strategies. The books provide a nice balance of theory and practice, statistical inference and statistical learning.

I didn’t even get past the Preface to RMS before I started taking notes on important considerations for planning my new prediction studies. Indeed, I found the emphases spot on, even though I’m not certain whether I’ll use the regression models that Frank espouses or the statistical learning models of ESL.

The following are nuggets of wisdom from RMS for planning/executing modeling studies, along with a statistical blogger’s commentary:

  1. The cost of data collection outweighs the cost of data analysis. This means it’s critical to maximize the value of data in hand and to analyze it judiciously. It also underscores the oft-heard warning from Predictive Analytics World that quality data is perhaps the leading critical risk/success factor for predictive analytics projects.
  2. Prudent handling of missing data is critical. Simple deletion of cases for which there are missing attributes can lead to prediction coefficients that are either terribly biased or grossly inefficient. There’re well-developed methodologies and statistical procedures for “imputing” missing values that should be a part of the analyst’s arsenal.
  3. Mean square error, which equals variance + bias, is generally a criterion for evaluating a model. Statisticians often look first for unbiased estimates, but it may be better in many cases to trade off a small amount of bias for reduced variance.
  4. Analysts need to pay special attention to non-linearity and non-additivity in their models. The careless deployment of simple linear models is often a by-product of the regression capabilities of BI tools. A miss-specified model may lead to erroneous predictions and results. Techniques like cubic splines are available for testing and incorporating these complications in standard models.
  5. Graphical methods to support the understanding of complex models are critical. The connection of predictive models to graphics is particularly strong in R. The lattice graphics pioneered by William Cleveland and included in R are central to its productivity and popularity.
  6. Methods for handling large numbers of predictors are central to today’s predictive models. Fortunately, there are answers like data reduction methods (e.g. principal components) from the multivariate statistics world, as well as Least Angle Regression (LARS), the Lasso, Random Forests, and Gradient Boosting from statistical learning.
  7. Overfitting is a common problem. Model validation approaches that include the bootstrap and cross validation are now central to estimating and testing. The stepwise regression procedures I learned in grad school 30 years ago are now non-grata in the prediction world. Fortunately, resampling techniques that are part and parcel of statistical practice have come to the rescue.