The URL to a pretty good article on R in the New York Times was in my email inbox early in the morning of January 7. I say only pretty good because the author didnt adequately explain the lineage of R to the S language developed by John Chambers et. al. at Bell Labs in the 80s and 90s, and Ive never heard anyone from the R community refer to the platform as a supercharged version of Microsoft Excel. I also received an email from an R support list announcing an updated release of Rattle, the R Analytical Tool to Learn Easily, an open source data mining front end for R that Ive been investigating for the last few weeks.
Data mining or machine learning, like just about everything in R, suffers from an embarrassment of riches. Unlike other open source projects, where the lions share of development work is done by a few key programmers, there are hundreds of participants in R. The core team handles the base platform, but new packages offering diverse capabilities, such as statistical models, database access, graphical user interface front ends, graphics and boosting machine learning algorithms are routinely developed by the R community at large. Indeed, there are over 1,650 such packages available for download today - and the pace of new development is escalating. More often than not, the new procedures are brought to life by the very statisticians (or their students) that developed the methods and algorithms years before their availablity in the commercial, proprietary competition.
With this largesse, however, comes the challenge of tracking whats available and when. You cant tell the players without a scorecard, so the R community developed task views for areas of significant interest, where a single expert tracks, organizes and publishes listings of the latest available packages. There are currently 22 task views pertaining to topics as diverse as graphics, econometrics, machine learning, spatial and computational physics (chemphys).
The machine learning task view can only provide so much help, however, because at last count it cites 42 packages - not including the standard regression procedures. For those new to R or data mining, the Rattle front end to many of the machine learning packages is great place to become productive quickly. Developed with the cross platform GUI toolkit Gtk2 by Graham Williams, a leading health care data miner and adjunct professor at the Australian National University, open source Rattle generates and executes R code through a consistent interface that guides novitiates through the often difficult early going.
Invoked as a function within an R session, the Rattle GUI presents analysts with a set of tabs corresponding to tasks of the data mining life cycle, including: the premining jobs of data (loading), explore and transform; the unsupervised learning techniques cluster and association; supervised learning model; model evaluation; and log, which notes activity. Log is a particular favorite of mine, because it tracks all R code generated and executed in the session. Those familiar with SASs Enterprise Guide, a VB front end to the SAS language and procedures, will appreciate the productivity of such a language generator. Indeed, Ive learned more than a few coding tips over the last few weeks just reviewing the output of log.
Version 2.4 of Rattle, released recently, can access existing R data as well as import from CSV and ARFF file formats and ODBC. It offers a wealth of R algorithms/models, including: unsupervised clustering/association kmeans, hclust and arules; the linear models packages lm and glm; tree algorithms rpart and party; the versatile bagging package randomForests, a neural networks procedure nnet and boosting algorithm ada. In addition to support for Rs standard and lattice graphics, Rattle interfaces with interactive visualization tools GGobi and PlayWith. It also provides a host of performance evaluation graphs/statistics for classification models, including Confusion, Lift, Risk, Sensitivity and ROC. It can also score regression models for both training and test subsets. Finally, Rattle supports PMML for model export to other tools, such as enterprise decision management platform Zementis.
Williams built an accompanying Web page, The Data Mining Desktop Survival Guide, available for purchase as a PDF file, which articulates a modeling framework and serves as a user guide to Rattle, while also providing a wealth of data mining and R goodies. Newbies will find the examples helpful in getting started with both R and Rattle. I found the multiple sections on data to be particularly useful, with clear discussions of R data types and ways to load data using Rattle. The author provides access to several datasets used for subsequent analysis, showcasing nifty R code that can be reused to download data from FTP sites to R. The sections describing visualization - including Graphics in R, Exploring Data and Understanding Data - are especially noteworthy.
My investigation of Rattle and data mining at the Australian National University came full circle when I happened on the Web page for Math 3346, a course cotaught by members of the ANU Data Mining Group, which includes Graham Williams. The 3346 course coordinator was none other than John Maindonald, co-author of the excellent Data Analysis and Graphics Using R book I reviewed in a column a year ago. Those looking to ramp up in data mining will find a wealth of excellent material here that closely aligns with the approach of Rattle. Course lecture presentations on the background, theory and applications of data mining are generally first rate. The assignments covering mining techniques in Rattle and R are also quite helpful. The enterprising student can use these materials to take an excellent self-study course for free.
My overall assessment is that, even as a work in progress, Rattles an important productivity tool for R modelers, especially those just getting started. In tandem with the documentation and examples from the Survivor Guide, Rattle will help users through the often thorny early steps to modeling. And those just learning the R language will find a great source of inspiration in the code generated to the log file.
Analysts at an intermediate or higher level of R and statistical learning expertise will probably choose to pick their spots with Rattle. The tool cannot possibly provide complete access to all the algorithms available to R, nor does it generally handle all parameters of supported models, so analysts will at times have to work outside Rattle. For core models supported by the tool, Ill often use Rattle to generate preliminary code, tweaking what is generated and reusing later. I maintain such scripts of working code for important flavors of machine learning like random forests and gradient boosting, and will generally build new models from these rather than starting from scratch in Rattle. Even with such an incomplete commitment, I find the investment in Rattle well worthwhile and enthusiastically recommend its use to the modeling community.
Incidentally, the open source R community is not the only beneficiary of Rattle and the R machine learning capabilities. Proprietary BI vendor Information Builders now offers a commercial license of Rattle called RStat for its flagship product WebFocus.