for Information Management Blogs
APR 13, 2009 7:31am ET

Blogroll

The R Statistical Learning Lasso

Print
Reprints
Email

I got an email the last week in January from the R help list announcing the release of the newest version of glmnet, a statistical learning algorithm that fits lasso and elastic net regularization paths for squared error, binomial and multinomial models via coordinate descent. Don’t be ashamed if you find that description a bit abstruse: just know you’re not alone! Suffice it to say that glmnet is a state-of-the-art modeling package that handles the prediction of interval and categorical dependent variables efficiently. 

The package’s creator is Trevor Hastie, co-author with Jerome Friedman and Rob Tibshirani of the accompanying arcane-sounding paper: Regularized Paths for Generalized Linear Models via Coordinate Descent, published last summer. Hastie, Friedman and Tibshirani are also eminent professors of Statistics at Stanford University, the top-rated such department in the country. Last Fall, I attended a statistical learning seminar with Hastie and Tibshirani where similar models were presented at a dizzying pace.

So the R user community had just been provided access to a latest learning algorithm hot off the development presses from three world-renowned practitioners – for free. And glmnet is readily accessible from the internet, installing on existing R platforms painlessly. No commercial stats package that I know of – certainly not the market leader – is even close to releasing a competitive offering. I’d say that’s a pretty good deal for stats types like me, and a benefit to working with a fertile, world-wide open source initiative like R.

After installing glmnet on my PC, I tested it against a 1988 Current Population Survey (CPS) data set that consists of 25,631 cases. My objective was to predict the log of weekly wages from experience and education, both measured in years. I first divided the base data set into two subsets, a training set with two thirds of the cases randomly selected, and a test one with the remainder of the records. I then developed two separate models with the training data – one a straight linear model with an interaction term, the other using cubic spline mappings of experience and education. Once model parameters were developed with the training data set, I evaluated and graphed the results using the separate test data.

The plot on the left shows the linear plane generated by glmnet; the one on the right depicts the curvilinear plane from the cubic spline mapping. The linear model seems naïve in contrast to the cubic spline alternative which provides a much closer fit between actual and predicted wages. Indeed, preliminary exploration of the training data set confirmed the curvilinear nature of the relationships between education, experience and wages, with wages actually declining for high- end experience. The linear model incorrectly details uniformly increasing wages across the ranges of both education and experience.

The relationship on the left is thus mis-specified and produces predictions out of synch with actual outcomes. A naive linear specification like this is, unfortunately, more the rule than exception for BI analysts using Excel or other standard BI tools for their models. Prudent analysts will turn to the sophisticated packages of platforms like R for predictions that closely reflect the subtlety of their data.

Steve Miller's blog can also be found at miller.openbi.com. 

Filed under:

Advertisement

Comments (0)

Be the first to comment on this post using the section below.

Add Your Comments:
You must be registered to post a comment.
Not Registered?
You must be registered to post a comment. Click here to register.
Already registered? Log in here
Please note you must now log in with your email address and password.

Blog Archive for Steve Miller

Lean Start-Ups, Planning and Searching
Tableau, Python and R
The Data and Bias of Macroeconomics
No Quick Death for Statistical Practices
Getting Started with Statistical Learning

More from Steve Miller »

Blog Index »

Where do young IT professionals (30 and under) obtain information to aid with daily role responsibilities and career development?

Trade publication websites 14%
Social media 23%
Vendor websites 4%
Vendor/community forums 7%
Newsletters 1%
Trade conferences/meetups 2%
RSS feeds 6%
Web search 44%

 

Twitter
Facebook
LinkedIn
Login  |  My Account  |  White Papers  |  Web Seminars  |  Events |  Newsletters |  eBooks
FOLLOW US
Please note you must now log in with your email address and password.