I participate in the Advanced Business Analytics, Data Mining and Predictive Modeling group on LinkedIn. It's a terrific forum, with always at least half a dozen active discussions among the now more than 26,000 members.
One recent theme, SAS versus R, particularly caught my eye. The discourse is lively, with now over 100 comments representing the different perspectives of the relative merits of both statistical platforms. My connection is that I've worked closely with each at times during my career – 20 years with SAS and now 10 with open source R and its proprietary kin S+.
The initiating question asked whether anyone had ever been asked to justify if/why R is to be preferred to SAS. Over 100 responses later, the focus has taken many twists and turns as users on both sides of the SAS/R divide weigh in. I guess it's my turn now.
A point of discussant departure is the observation that one of the big differences between SAS and R is the size of the data sets each can accommodate. Out of the box, that size is limited by physical memory in R, while SAS, with virtual memory management, theoretically has no limits.
On my 4 GB RAM Wintel pc using the 64 bit build of R, I process a data frame with 11M records and 4 attributes – albeit barely – and adequately run all my favorite predictive models against another data set with over half a million records and 11 variables.
With 64 bit Linux and lots of memory, the size of data that can be accommodated in R is quite large. One forum thread from an R elder notes: "If you are on 64-bit completely and have invested some coins in memory modules, the limit may be really huge." Indeed, a group participant opined: "R doesn't have any real limitations as far as size. I've used a very fast (~16Tb RAM) computer to run simulations on hundreds of billions of observations. I think someone mentioned this above, but R is only limited to the physical ability of your computer ..."
Commercial R vendor Revolution Analytics has taken the R size limitation head on, developing a file-based R dialect that can handle unlimited data set size. A frustration with this approach, though, is that models and statistical procedures must be re-written and hence prioritized for porting to the new framework. Revolution R is also embedded in the Netezza analytic database for parallel processing of large data in situ. Finally, RA's built an interface to Hadoop/Map-Reduce to bring the power of big data processing to R for data scientists.
On ths SAS side, one data miner added that in his experience, while SAS may work well with unlimited size for programming and data preparation, he's generally forced to work with a sample from a 1 GB data set for his association rules analysis.
A recurring skirmish between SAS and R advocates revolves on the benefits and risks of the open source development model. There's little argument that the vast international R community provides access to the latest statistical models and procedures before they're available in proprietary SAS. But SAS proponents counter that R users assume more risk with software quality than do those of SAS. In fact, an oft-quoted comment from a SAS executive on the "benefits" of R goes something like “I think it addresses a niche market for high-end data analysts that want free, readily available code. We have customers who build engines for aircraft. I am happy they are not using freeware when I get on a jet.” My take after 8 years of heavy R usage is that I've never worked with a more stable, bug-free piece of software.
Another thread contrasts the R and SAS languages for both capabilities and ease of learning. A discussant offered that if he had 10M records and one day to "find" something, SAS would be the the more productive tool. I remember thinking the same thing when I first started using S+/R 10 years ago. Now I much prefer R for statistics, visualization and data programming, even as I acknowledge the superiority of SAS database access. When I first used R, I also agreed with the observation of another discussant that "proc summary" is the killer SAS app. At this point, though, I find the R motifs for "by group" summary processing more accommodating. I guess you like what you know.
A related observation compares R unfavorably with SAS as an enterprise platform. Fair enough. But with a solid company, Revolution Analytics, now supporting and enhancing the base, there should be much less angst with commercial R customers going forward.
And though I haven't looked in years, I was never a fan of SAS data warehousing and business intelligence solutions, at the time preferring instead a best-of-breed lineup of Informatica, Business Objects and Hyperion. Now I recommend R as a component of low-cost BI/analytics platforms such as Pentaho and Jaspersoft/Talend.
So how did SAS get to be the big gorilla in the first place? By doing the right thing in the right place at the right time. The SAS of 1980 was well ahead of its competitors, SPSS and BMDP, in data management capabilities and programmability. And the decision in the early 80s to migrate from the IBM mainframe-only platform to a portable one written in C suitable for the soon-to-be-exploding mini/micro/pc market was nothing short of brilliant. SAS saw the opportunity and grabbed it with gusto.
Indeed, thirty years ago, the SAS programming paradigm of data steps, proc steps and a macro glue to tie them together was innovative. In 2011 though, the same paradigm seems old and ugly, especially in contrast to the extensible object-orientation of R. My big complaint about developing with R is that I often program functions that are similar to those already available, discovering the pre-built R answers only after the fact. SAS certainly leads in documentation and training, while the number of excellent R teaching books is skyrocketing.
Another strategy that's served SAS well over the decades, affirmed by several writers, is winning over statistics students while they're in school. For years, SAS "owned" statistics graduate programs in the U.S. and abroad, with SAS the platform for teaching/learning the latest methods. When those students graduated and entered commerce, they naturally sought to work with what they knew – SAS. And once the analysts were in the fold, SAS did a marvelous job cultivating relationships with user groups to secure its position.
Alas, SAS is no longer king of academic statistics. That title now belongs to R. R students are quite enthusiastic about the platform they've trained in. SAS programmers approaching retirement will be replaced by R-trained new grads. And these R afficionados will increasingly demand the platform they know and love when they reach the work world. My bet is that Revolution Analytics will be the benificiary of this trend to R in the commercial world.
The obsession with big data and the emergence of data science as a "competitor" to traditional business intelligence/analytics also favors R. Big data software innovators, such as Google, Amazon, Yahoo, Facebook, Twitter, LinkedIn, et. al., prefer open source components, including R, in the platforms they build. They also increasingly choose Hadoop instead of relational databases to address their big data challenges.
In the short run, with its huge lead, SAS will continue to rule the business predictive analytics landscape. But look for that dominance to erode. One astute discussion observer sees the encroachment of R as similar to that of Linux a dozen years ago: "I see R and other open source software ... as being at the at stage that Linux was 10-12 years ago when Sun Microsystems still had robust business. We know what's happened to Sun and to Linux over those 10-12 years. I'm going to stick my neck out and predict the same fate for many commercial statistics software providers ..." I think he's right.