Steve would like to thank his partner Dave Reinke for his important input to this column.
The R Community
When I returned from the annual August family vacation in the Outer Banks, I prepared for a major email backlog that I'd have to sort through. Though dreading the specter of responding to hundreds of messages, I did look forward to logging on to my open source help list account and catching up with the R community. The 660 messages that had accumulated from the various R groups didn't disappoint. I spent four hours over the next two days scouring the messages and enjoyed every minute of my labor, learning a lot about statistics, programming, and the workings of an open source community in the process.
Having been an active participant on the R help lists for several years now, I am able to make a quick first pass through accumulated messages, prioritizing for personal interests, education and amusement. Topics such as graphics, database connectivity, programming puzzles, predictive models, new package availability, python integration and financial engineering are tagged for early review. Of the world-wide R community, there are probably a couple of dozen list authors who are so knowledgeable that I prioritize their messages regardless of content, privileged for the opportunity to learn intimately from respected experts. Then there is the always-entertaining community policing and control, where "newbies" are taught the norms of list participation. Indeed the community publishes a posting guide that prescribes the method for posing questions. Identifying oneself as a newbie generally buys some slack, but there are limits. Be sure to choose the proper list for the question; don't, for example, send R-sig-DB questions to R-help. Do your homework first; it's not good form to ask a trivial question readily answered through online help. Nor is it proper to vaguely frame a question without sample code, data or mention of R version or platform. And woe to the unwitting dupe who disparages a language or package feature through ignorance. I've gotten pretty good at predicting which notes will be torched - just by reading the first sentence of the message.
I respond to list questions as I'm able, though a request just an hour old from Europe may already be answered by the time I craft my note. A relatively simple inquiry might get half a dozen responses within minutes, each unaware of the others. There's a pecking order to both questions and respondents, indicative of the many levels of R and subject expertise. It is testimony to the commitment of the R list community that participation by leading experts is so pervasive. These experts generally pick their spots, offering support as their skills are needed. For the most part, questions are addressed by the least experienced qualified respondents, promoting participation from apprentice as well as seasoned users. A simple question will be answered by someone like me, not a distinguished professor or researcher. That distinguished professor, however, might take on language and statistical esoterica out of the reach of most, or adjudicate a lively discussion on an issue of list disagreement. Once the distinguished opine, the discussion thread generally ends. Occasionally there are illuminating messages from academics on packages they've developed for their latest statistical procedures. Programming puzzle questions sometimes become competitions where, within the domain of correct responses, terseness wins. I'm generally wary of submitting a programming solution, lest someone respond to my note with, "Yes, but simpler is ..." or "Why not use such and such function?" It somewhat reminds me of the old APL mentality - one program, one line of code.
The R Platform
So just what is this phenomenon called R with such a talented and entertaining community? As described on its Web site, R is:
a "language and environment for statistical computing and graphics. It consists of a language plus a run-time environment with graphics, a debugger, access to certain system functions and the ability to run programs stored in script files."
In the statistical world, R is differentiated as an extensible, object-oriented language modeled on Scheme and S from Bell Labs. R is also open source, available under the GNU licensing agreement. As such, source code is readily available and modifiable, and R is free to use, within the boundaries of GNU. R was originally developed by two researchers at the University of Auckland in New Zealand (both of whose first names start with R), and is now maintained by a world-wide group of core developers. There are ports of R for Windows, Linux, UNIX, and Mac OS, with source code available for all, and binaries downloadable for Windows and Mac OS. R's syntax is quite similar to that of S which is commercialized as S-Plus by Insightful Corporation. As the senior product, more has been written to date on S than R, though R specialty books are starting to proliferate. Fortunately, much of the code written for S works with R, despite the different internals of the two languages. Several seminal statistical texts include both R and S code, noting differences as appropriate. Alas, commercial S-Plus, an excellent product, continues to struggle in the marketplace, even as R's popularity soars. I wonder what that suggests about open source versus proprietary software?
Extensive documentation of R's workings is available from the R Web site as either downloadable PDF files or html. An R newsletter offering both practical and arcane insights is published, though is not for the faint of heart. UseR! is a biennial international R user conference, the last two of which have been held in Vienna, Austria. The R community provides numerous help and special interest group mailing lists to support its users. I subscribe to R-help, R-packages, R-sig-DB, R-sig-finance, R-sig-GUI and R-sig-Wiki. New users can enroll through the main R page. A promising R Wiki has been established and provides collaborative, community documentation outside the formal R manuals as well as links about R, tips, reference cards, galleries of graphics, extensive code samples, etc. R-based projects substantial enough to exist independently have been established and are accessible from the R Web site. Bioconductor, Bioinformatics with R, and Rmetrics (a project for financial engineering) are examples. Finally, search consists of a series of engines to help the R community locate pertinent R Web pages and mail archives.
A major benefit of the R platform is the quantity and quality of add-on modules readily accessible to the community. Indeed, a notable strength of R is that of a platform on which domain-specific solutions can be easily developed and shared. These solutions, in turn, promote a viral adoption of the core R technology. Several hundred packages are available on CRAN and other sites for free download to R users. The latest methods developed in university and research worlds are generally first "published" as R packages before being adapted by commercial statistical vendors. Publication of articles introducing new methods in statistical journals is, at times, contingent on the development of R packages showcasing the work.
A basic setup of R will install a default set of packages including base, stats, graphics, methods, data sets and utils. Figure 1 is a sample of other downloadable packages (and modules) from my home environment and reveals just how diverse and valuable community participation can be.
Figure 1: Downloadable Packages and Usages
Note: The main author of Zelig is Gary King, David Florence Professor of Government and Director of the Institute for Quantitative Social Science at Harvard University. In coming months, OpenBI Forum will publish interviews with Professor King discussing his research and the Institute's work as well as their relevance to business intelligence (BI).
R and Business Intelligence
R's meteoric rise has probably moved it to second place in popularity among statistical cognoscenti, though SAS still dominates in the commercial arena. SAS has done a masterful job over the past few years expanding its brand from statistical package to enterprise-scale BI platform and is little challenged by other commercial stats competitors. In fact, SAS probably sees larger BI players such as Business Objects and Cognos, upstreaming from the reporting side to analytics, as its major competition at this point -- and vice-versa.
A comparison of SAS with R would be an article by itself, but two important differences should be noted for BI. The R language is object oriented and extensible by design, so integration of stats, programming and graphics is natural, and packages expanding capabilities are readily built. In contrast, the SAS language, while comprehensive and fully-functional, looks like PL/1 circa 1975. Communication between the data step and procedures is through SAS data sets and macro variables, rather than objects recognized with different behaviors by functions. On the other hand, SAS is almost certainly a better choice for large data. Because R holds its data in main memory it is limited for very large volumes, though a 300M data frame is certainly doable with adequate RAM and should be suitable for most deployments. SAS makes efficient use of memory and disk and performs much better than R for very large data sets, handling volumes that would choke R. If multigigabyte analytics data sets are in the plans (though I'd certainly question such an application), SAS is the choice. S-Plus, R's commercial kin, has implemented support for data sets larger than the size of RAM.
The unique combination of a powerful and extensible language, industry-leading graphics, state-of-the-art statistical procedures, and a vibrant open source community positions R enviably in a BI space that is increasing its analytical sophistication. The successful transition of SAS from geeky stats tool to enterprise BI solution in recent years shows market acceptance of powerful statistical platforms. An unscientific inspection of bookstores at the University of Chicago, Northwestern, Illinois, Wisconsin and UIC revealed R in use for statistics courses at each school, providing grist for the oft-heard observation that R is now the lingua franca of academic statistical computing. Graduates of these programs bring their knowledge of R to the work world, so look for R to grow and be used increasingly in commercial BI, much like SAS has been over the last 25 years. What was once the realm of quant heads is now approaching mainstream. Continued development of connectivity capabilities like those to relational databases, spreadsheet packages, other stats programs, other applications and the Internet will fortify R's collaboration in serious BI environments. Finally, for those wary of open source because of the absence of accountable, single-provider support, there are rumblings of a commercial (though still open source) R supported by a real company.
Even with today's heightened use of predictive analytics in business, it is more than just the advanced statistical models that position R favorably for BI. R's rich language and graphics provide BI benefits on their own, supporting exploratory capabilities lacking in many current tools. R programming and graphics can be used at the outset of performance management (PM) intelligence deployments to quickly determine which metrics and dimensions really matter - i.e., are best indicators of enterprise performance. These early investigations serve the dual functions of identifying key measures for subsequent PM dashboards and strategic reporting while also eliminating the clutter of nonpredictive attributes in the data warehouse. A small investment in a targeted statistical and graphical "study" can easily pay for itself by quickly focusing PM activities on those facts that matter most. OpenBI has developed an exploratory BI offering around the capabilities of R, analyzing business performance metrics by important operating dimensions over time. The point of departure for this work is the data warehouse that sources "fact" and "dimension" data to R. Part of the engagement's deliverable is output in the form of trellis statistical graphs and HTML reports, which are then used to formulate and refine the design of PM dashboards and executive reports for subsequent development. OpenBI plans to actively promote the use of R for early exploration with our customers as a preamble to performance measurement applications going forward.
The BI world is waiting. R we ready?