I received an email from John Maindonald the other day. A little over a year ago, I wrote a review of an excellent statistical text, Data Analysis and Graphics Using R, John co-authored with John Braun. Part of his message was to inform that the 3rd edition of Data Analysis would be coming out soon. Maindonald is also on the faculty of the Australian National University, co-teaching a course on data mining with Graham Williams. Williams is the developer of Rattle, the R Analytical Tool To Learn Easily, a front end to the significant machine learning/data mining capabilities of R. The second piece of John’s message was a request to update the url to the course for Information Management readers. Done. I would highly recommend Math3346 for those seeking an accessible treatment of applied data mining.

Predictive Analytics World

As I mentioned in last week’s blogs, I was pleasantly surprised by version one of Predictive Analytics World, finding it quite useful on a number of levels. Today, I offer a few final observations on the conference.

I guess I shouldn’t be too surprised that the most oft-cited success (or risk) factors for analytics deployments have to do not with analytics per se, but rather with business sponsorship, business/IT/analytics team alignment, methodology, data quality, communication, incremental wins, and governance. It appears lessons learned for predictive analytics look much like those for broader business intelligence.

On the evening of Wednesday, Feb 18, The Bay Area useR Group (R Programming Language), held its meeting using PAW hotel facilities. 70 people, many of whom were not R users, listened to presentations by commercial R vendor Revolution Computing as well as web titans Facebook and Google. Both Facebook and Google are big advocates of R’s open source analytics and graphical capabilities, employing analysts who learned the package in grad school. R is particularly popular for preliminary, exploratory data analysis (EDA) tasks.

I was a bit surprised by the limited range of analytics techniques demonstrated in the technical sessions I attended. Logistic regression and CART seemed the norm for classification problems, while ordinary least squares and stepwise regression appeared the choice for interval-level prediction. One session presented a hand-rolled ensemble of logistic regressions, demonstrating reduced variance and sharpened predictions – results R users take for granted with Random Forests and Gradient Boosting. Maybe I’m just spoiled by the embarrassment of riches available to predictive modelers in R. There are now scores of the very latest techniques accessible for free.

The Bay area is home to the top two schools of statistics in the U.S., Stanford and Cal Berkeley. It’d been nice to have an academic perspective on the current state of predictive analytics, especially given the rapid developments in both statistics and machine learning. One of the Stanford professors among Trevor Hastie, Rob Tibshirani, or Jerome Friedman, co-authors of the just-released book, The Elements of Statistical Learning, Second Edition, would have been an ideal presenter. Perhaps next year there can be sessions surveying both statistical learning and Bayes modeling.

Looking forward to PAW 2010!