The URL to a pretty good article on R in the New York Times was in my email inbox early in the morning of January 7. I say only pretty good because the author didnt adequately explain the lineage of R to the S language developed by John Chambers et. al. at Bell Labs in the 80s and 90s, and Ive never heard anyone from the R community refer to the platform as a supercharged version of Microsoft Excel. I also received an email from an R support list announcing an updated release of Rattle, the R Analytical Tool to Learn Easily, an open source data mining front end for R that Ive been investigating for the last few weeks.
Data mining or machine learning, like just about everything in R, suffers from an embarrassment of riches. Unlike other open source projects, where the lions share of development work is done by a few key programmers, there are hundreds of participants in R. The core team handles the base platform, but new packages offering diverse capabilities, such as statistical models, database access, graphical user interface front ends, graphics and boosting machine learning algorithms are routinely developed by the R community at large. Indeed, there are over 1,650 such packages available for download today - and the pace of new development is escalating. More often than not, the new procedures are brought to life by the very statisticians (or their students) that developed the methods and algorithms years before their availablity in the commercial, proprietary competition.
With this largesse, however, comes the challenge of tracking whats available and when. You cant tell the players without a scorecard, so the R community developed task views for areas of significant interest, where a single expert tracks, organizes and publishes listings of the latest available packages. There are currently 22 task views pertaining to topics as diverse as graphics, econometrics, machine learning, spatial and computational physics (chemphys).
The machine learning task view can only provide so much help, however, because at last count it cites 42 packages - not including the standard regression procedures. For those new to R or data mining, the Rattle front end to many of the machine learning packages is great place to become productive quickly. Developed with the cross platform GUI toolkit Gtk2 by Graham Williams, a leading health care data miner and adjunct professor at the Australian National University, open source Rattle generates and executes R code through a consistent interface that guides novitiates through the often difficult early going.
Invoked as a function within an R session, the Rattle GUI presents analysts with a set of tabs corresponding to tasks of the data mining life cycle, including: the premining jobs of data (loading), explore and transform; the unsupervised learning techniques cluster and association; supervised learning model; model evaluation; and log, which notes activity. Log is a particular favorite of mine, because it tracks all R code generated and executed in the session. Those familiar with SASs Enterprise Guide, a VB front end to the SAS language and procedures, will appreciate the productivity of such a language generator. Indeed, Ive learned more than a few coding tips over the last few weeks just reviewing the output of log.
Version 2.4 of Rattle, released recently, can access existing R data as well as import from CSV and ARFF file formats and ODBC. It offers a wealth of R algorithms/models, including: unsupervised clustering/association kmeans, hclust and arules; the linear models packages lm and glm; tree algorithms rpart and party; the versatile bagging package randomForests, a neural networks procedure nnet and boosting algorithm ada. In addition to support for Rs standard and lattice graphics, Rattle interfaces with interactive visualization tools GGobi and PlayWith. It also provides a host of performance evaluation graphs/statistics for classification models, including Confusion, Lift, Risk, Sensitivity and ROC. It can also score regression models for both training and test subsets. Finally, Rattle supports PMML for model export to other tools, such as enterprise decision management platform Zementis.
Williams built an accompanying Web page, The Data Mining Desktop Survival Guide, available for purchase as a PDF file, which articulates a modeling framework and serves as a user guide to Rattle, while also providing a wealth of data mining and R goodies. Newbies will find the examples helpful in getting started with both R and Rattle. I found the multiple sections on data to be particularly useful, with clear discussions of R data types and ways to load data using Rattle. The author provides access to several datasets used for subsequent analysis, showcasing nifty R code that can be reused to download data from FTP sites to R. The sections describing visualization - including Graphics in R, Exploring Data and Understanding Data - are especially noteworthy.
My investigation of Rattle and data mining at the Australian National University came full circle when I happened on the Web page for Math 3346, a course cotaught by members of the ANU Data Mining Group, which includes Graham Williams. The 3346 course coordinator was none other than John Maindonald, co-author of the excellent Data Analysis and Graphics Using R book I reviewed in a column a year ago. Those looking to ramp up in data mining will find a wealth of excellent material here that closely aligns with the approach of Rattle. Course lecture presentations on the background, theory and applications of data mining are generally first rate. The assignments covering mining techniques in Rattle and R are also quite helpful. The enterprising student can use these materials to take an excellent self-study course for free.
All Information Management articles are archived after 7 days. REGISTER NOW for unlimited access to all recently archived articles, as well as thousands of searchable stories. Registered Members also gain access to:
- Full access to information-management.com including all searchable archived content
- Exclusive E-Newsletters delivering the latest headlines to your inbox
- Access to White Papers, Web Seminars, and Blog Discussions
- Discounts to upcoming conferences & events
- Uninterrupted access to all sponsored content, and MORE!