I got an email the last week in January from the R help list announcing the release of the newest version of glmnet, a statistical learning algorithm that fits lasso and elastic net regularization paths for squared error, binomial and multinomial models via coordinate descent. Don’t be ashamed if you find that description a bit abstruse: just know you’re not alone! Suffice it to say that glmnet is a state-of-the-art modeling package that handles the prediction of interval and categorical dependent variables efficiently. 

The package’s creator is Trevor Hastie, co-author with Jerome Friedman and Rob Tibshirani of the accompanying arcane-sounding paper: Regularized Paths for Generalized Linear Models via Coordinate Descent, published last summer. Hastie, Friedman and Tibshirani are also eminent professors of Statistics at Stanford University, the top-rated such department in the country. Last Fall, I attended a statistical learning seminar with Hastie and Tibshirani where similar models were presented at a dizzying pace.

So the R user community had just been provided access to a latest learning algorithm hot off the development presses from three world-renowned practitioners – for free. And glmnet is readily accessible from the internet, installing on existing R platforms painlessly. No commercial stats package that I know of – certainly not the market leader – is even close to releasing a competitive offering. I’d say that’s a pretty good deal for stats types like me, and a benefit to working with a fertile, world-wide open source initiative like R.

After installing glmnet on my PC, I tested it against a 1988 Current Population Survey (CPS) data set that consists of 25,631 cases. My objective was to predict the log of weekly wages from experience and education, both measured in years. I first divided the base data set into two subsets, a training set with two thirds of the cases randomly selected, and a test one with the remainder of the records. I then developed two separate models with the training data – one a straight linear model with an interaction term, the other using cubic spline mappings of experience and education. Once model parameters were developed with the training data set, I evaluated and graphed the results using the separate test data.

The plot on the left shows the linear plane generated by glmnet; the one on the right depicts the curvilinear plane from the cubic spline mapping. The linear model seems naïve in contrast to the cubic spline alternative which provides a much closer fit between actual and predicted wages. Indeed, preliminary exploration of the training data set confirmed the curvilinear nature of the relationships between education, experience and wages, with wages actually declining for high- end experience. The linear model incorrectly details uniformly increasing wages across the ranges of both education and experience.

The relationship on the left is thus mis-specified and produces predictions out of synch with actual outcomes. A naive linear specification like this is, unfortunately, more the rule than exception for BI analysts using Excel or other standard BI tools for their models. Prudent analysts will turn to the sophisticated packages of platforms like R for predictions that closely reflect the subtlety of their data.

Steve Miller's blog can also be found at miller.openbi.com.