I just completed my Christmas shopping, of which about 75 percent was done online, most of that with Amazon. Books, DVDs and music CDs were staples, but I also bought a camera, a computer and a backpack. And, of course, I always play Santa for myself, this year securing four books for the holidays and the start of 2011.

I get my money's worth from Amazon Prime, generally purchasing a batch of books every other month. Many of the volumes are business and technology related, probably half on BI and analytics with R. Two of the recent R purchases I'd especially recommend as last minute stocking stuffers for geeky analytics friends.

Peter Dalgaard's “Introductory Statistics with R, Second Edition,” delivers on its title, with a good bit more material than the first edition to tempt owners to re-purchase. Though I'm not sure readers without a background in stats and R will find the book an easy read, those with some programming and statistics background should get a lot out of this well-written text.

The emphases of “Introductory Statistics” reflect the author's training in biostatistics and epidemiology. Indeed, Dalgaard notes in the preface that the “book was originally based upon a set of notes for a course in Basic Statistics for Health Researchers.” The R basics covered in “Introductory Statistics” are more than adequate to get the uninitiated going with basic programming, probability distributions, graphics, tables and summary statistics. The chapter on advanced data handling is excellent.

The book's treatment of linear models that includes anova, linear regression, polynomial regression and analysis of covariance, is strong, deftly combining both “theory” and best R programming practices. And the epidemiology-oriented chapters on logistic regression, survival analysis and rates/Poisson regression are simply superb. I recently referenced the chapter on survival analysis and the Cox proportional hazards model to guide work for a customer on product reliability. Not only is “Introductory Statistics” a teaching text, it's an R/stats reference as well.

I didn't purchase “R in a Nutshell, a Desktop Quick Reference,” by Joseph Adler, until it'd been on the market for 6-7 months, my experience with other O'Reilly nutshell books uneven. That was a mistake. “R in a Nutshell” is one of the best R programming books available, not only expertly covering basic language and syntax, but also advanced features such as R packages, objects and object oriented programming.

For OO R, “R in a Nutshell” is the best teaching guide I've seen outside John Chambers' “Software for Data Analysis: Programming with R.” The coverage of relational database integration and advanced data management in R is also excellent. And for those, like me, who've searched a long time for a comprehensive guide to R's powerful Lattice graphics subsystem, you need look no further than “R in a Nutshell.” Finally, the book takes a welcome fork in the predictive modeling road from traditional linear models, focusing in addition on machine learning algorithms like regression trees, random forests, boosted regression, multivariate adaptive regression splines (MARS), neural networks, support vector machines, generalized additive models, k nearest neighbors and association rules. The 100 page appendix reference is worth the price of the book alone.

My personal request to Santa is for book that to the best of my knowledge has not yet been written – but one I'd love to see in 2011. Like many predictive modelers/data miners, I'm never without my trusty copy of the ponderous “Elements of Statistical Learning” by Trevor Hastie, Rob Tibshirani and Jerome Friedman. But while that book is encyclopedic in its discussion of PM methods, its treatment of the material is primarily theoretical and mathematical. What I'd like to see is a parallel book that demonstrates how to apply the ESL techniques using existing R packages, with detailed worked-out examples on meaty data sets.

As illustrations, the book might have chapters on R packages that include functions for generalized additive models, support vector machines, neural networks, random forests, gradient boosting with regression trees and clustering. The author would apply the models to several moderately-sized business data sets (> 50,000 records) available on the book website that have both categorical and interval level variables for regression and classification problems. The chapters would show how to deploy the methods to calibrate, tune and test different models.

Special attention would be paid to the tuning “knobs” available with the different functions, along with cross validation, bootstrapping and other approaches to testing and countervailing “overfit.” If available, techniques for highlighting the importance of variables would be illustrated. The methods would then be compared to traditional regression approaches and to each other for predictive accuracy and computer “cost.” A final chapter might include a series of Consumer Reports-like graphics contrasting the relative strengths and weaknesses of each method for different types of data sets and analytics questions.

I'm excited about the new book. Any volunteers to write it?

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access