I’m finally back from my travels, which started at the international R user group conference in Nashville. Almost 500 attendees were treated to a delightfully geeky week at useR! 2012, hosted by Frank Harrell and the Biostatistics Department of Vanderbilt University. The weather, facilities and hospitality couldn’t have been better.
The tough part of my two useR! days was making the hard choices of which presentations to attend. Indeed, on my office shelves are books written by at least eight conference participants. The single-threaded keynotes were no-brainers, but I had to make hard choices among the multi-tasked Kaleidoscope and Focus sessions. To make my conference life easier, I settled on a “data science” theme, opting to concentrate on R business applications, visualization, and the “big data” sub-topics of the cloud, Hadoop, database integration and performance/capacity enhancement.
Iowa State Professor and visualization expert Di Cook kicked off Tuesday’s itinerary with her keynote “Every plot must tell a story – even in R”. Cook gently criticized R package developers for not being graphical enough in their documentation. Her take is that R analysts must be attentive to both the visualization demands of exploring raw data as well as the needs of evaluating statistical models. She went on to illustrate good practices using examples drawn from popular R data sets and the ggplot2 package. Location, juxtaposition, color and faceting are major graphical tools at the analyst’s disposal.
David Kahle demoed his ggmap package that links ggplot2 and RgoogleMaps, OpenStreetMap, Stamen Design Maps and CloudMade Maps. The power of ggmap is that its functionality is available simply as an additional geom layer to ggplot2. ggmap also provides several useful utility functions. I’m excited to be using ggmap now.
Iowa State Ph.D. student Yihui Xie was quite the star at useR! 2012, delivering several well-received presentations. His cranvas package of interactive statistical graphics based on Qt appears to hold much promise as a foundation for live visualization of “big” data in R.
When I first saw the topic on the HiveR package, I thought it had to do with integration of R with the Hadoop ecosystem’s Hive. This interesting presentation by Bryan Hanson, though, was on using his package to visualize 2 and 3D hive plots of networks. I can certainly see HiveR as an emerging tool for the data scientist.
Norman Matloff, author of the popular R book, “The Art of R Programming”, presented on parallel programming in R. Matloff’s starting point is the snow package that delivers much of R’s current parallel capabilities. He distinguishes simple challenges – “embarrassingly parallel” – from tough ones, and offers the graphics processor (GPU) as an example of low-hanging, performance-enhancing fruit. His in-development Rth package, though, takes on the tough stuff, attempting software alchemy that replaces initial “linear” problems with alternatives that are embarrassingly parallel. Many of his preliminary findings are encouraging, several producing order of magnitude performance gains with multicore computer hardware.
Though AT&T Research Labs Simon Urbanek’s talk, “Web-based Interactive Graphics and R in the Cloud” was nominally on visualization, it was as much on scalable, distributed, interactive, analytical computing on the Web. Advances in the Rserve and FastRWeb packages as well as recent developments in browser performance using the binary protocol show promise for large data collaboration using R in the cloud. Karim Chine’s introduction of Elastic-R and an accompanying portal to facilitate plug-and-play scientific statistical computing, and Jeroen Ooms’ discussion on “Scalable Embedded Scientific Computing with OpenCPU”, affirm Urbanek’s direction.
For big data enthusiasts, Seonghak Hong’s “RHive in a Data Scientist’s Tool Box” was a great start to using R in the Hadoop ecosystem. Hong sees Hive as the Hadoop “data warehousing” platform, while RHive combines the power of SQL and R against HDFS data. Antonio Piccolboni offered another perspective on big data and R with his “Slicing and dicing big data with RHadoop/rmr”. He views Hadoop as a cloud operating system and map reduce as a low-level, assembler-like language for programming it. His rmr – R map reduce –allows programmers to embed highly-productive R functions in their map reduce algorithms.
Having worked at Oracle many years ago, I was gratified to see the integration of R with mainstream relational/analytical databases Oracle, Netezza and Greenplum. Each professes an answer to the R memory constraint and offers database parallel computation to R programmers. There’s no free lunch however, the performance largesse coming at the cost of commitment to expensive, proprietary software. For at least existing Oracle, Netezza and Greenplum customers, on the other hand, the R integration is certainly worth a look.
As I look through my conference notes and the program agenda, I realize I’ve just touched the surface of the terrific useR! 2012 content. I’ll have more to say about additional topics like reproducibility in future blogs. For now, kudos to Frank Harrell, the Organizing Committee, the Program Committee and Vanderbilt University for an outstanding conference.
On an R high, I’m already looking ahead to next summer, when useR! 2013 will be held July 10-12 at the University of Castilla-La Mancha, Albacete, Spain.