The URL to a pretty good article on R in the New York Times was in my email inbox early in the morning of January 7. I say only pretty good because the author didnt adequately explain the lineage of R to the S language developed by John Chambers et. al. at Bell Labs in the 80s and 90s, and Ive never heard anyone from the R community refer to the platform as a supercharged version of Microsoft Excel. I also received an email from an R support list announcing an updated release of Rattle, the R Analytical Tool to Learn Easily, an open source data mining front end for R that Ive been investigating for the last few weeks.
Data mining or machine learning, like just about everything in R, suffers from an embarrassment of riches. Unlike other open source projects, where the lions share of development work is done by a few key programmers, there are hundreds of participants in R. The core team handles the base platform, but new packages offering diverse capabilities, such as statistical models, database access, graphical user interface front ends, graphics and boosting machine learning algorithms are routinely developed by the R community at large. Indeed, there are over 1,650 such packages available for download today - and the pace of new development is escalating. More often than not, the new procedures are brought to life by the very statisticians (or their students) that developed the methods and algorithms years before their availablity in the commercial, proprietary competition.
With this largesse, however, comes the challenge of tracking whats available and when. You cant tell the players without a scorecard, so the R community developed task views for areas of significant interest, where a single expert tracks, organizes and publishes listings of the latest available packages. There are currently 22 task views pertaining to topics as diverse as graphics, econometrics, machine learning, spatial and computational physics (chemphys).
The machine learning task view can only provide so much help, however, because at last count it cites 42 packages - not including the standard regression procedures. For those new to R or data mining, the Rattle front end to many of the machine learning packages is great place to become productive quickly. Developed with the cross platform GUI toolkit Gtk2 by Graham Williams, a leading health care data miner and adjunct professor at the Australian National University, open source Rattle generates and executes R code through a consistent interface that guides novitiates through the often difficult early going.
Invoked as a function within an R session, the Rattle GUI presents analysts with a set of tabs corresponding to tasks of the data mining life cycle, including: the premining jobs of data (loading), explore and transform; the unsupervised learning techniques cluster and association; supervised learning model; model evaluation; and log, which notes activity. Log is a particular favorite of mine, because it tracks all R code generated and executed in the session. Those familiar with SASs Enterprise Guide, a VB front end to the SAS language and procedures, will appreciate the productivity of such a language generator. Indeed, Ive learned more than a few coding tips over the last few weeks just reviewing the output of log.
Version 2.4 of Rattle, released recently, can access existing R data as well as import from CSV and ARFF file formats and ODBC. It offers a wealth of R algorithms/models, including: unsupervised clustering/association kmeans, hclust and arules; the linear models packages lm and glm; tree algorithms rpart and party; the versatile bagging package randomForests, a neural networks procedure nnet and boosting algorithm ada. In addition to support for Rs standard and lattice graphics, Rattle interfaces with interactive visualization tools GGobi and PlayWith. It also provides a host of performance evaluation graphs/statistics for classification models, including Confusion, Lift, Risk, Sensitivity and ROC. It can also score regression models for both training and test subsets. Finally, Rattle supports PMML for model export to other tools, such as enterprise decision management platform Zementis.
Williams built an accompanying Web page, The Data Mining Desktop Survival Guide, available for purchase as a PDF file, which articulates a modeling framework and serves as a user guide to Rattle, while also providing a wealth of data mining and R goodies. Newbies will find the examples helpful in getting started with both R and Rattle. I found the multiple sections on data to be particularly useful, with clear discussions of R data types and ways to load data using Rattle. The author provides access to several datasets used for subsequent analysis, showcasing nifty R code that can be reused to download data from FTP sites to R. The sections describing visualization - including Graphics in R, Exploring Data and Understanding Data - are especially noteworthy.