Excel for Statistical Analysis?
I belong to several LinkedIn BI professional groups, the most notable of which is Advanced Business Analytics, Data Mining and Predictive Modeling, capably run by Vincent Granville. The ABADMPM discussion that I've been following most closely of late is Which data mining/analytic tools have you used in the past 12 months for a real project? And, I must admit, I've been pretty surprised by the direct responses to that question and also by the results of polls of the same inquiry commissioned by AnalyticBridge, The Social Network for Analytics Professionals, and KDNuggets, The Data Mining Community's Top Resource.
When I first looked at one of the polls, I was taken back by the selection of forced-choice categories. First of all, industry heavy IBM SPSS was not identified as a distinct choice, but instead lumped with Weka and Oracle. I would have set up Oracle and SPSS as separate choices and perhaps included Weka with Other Open Source. In addition, I would have added S-Plus and SQL (non-Oracle) as separate categories. What I might have missed, though, turns out to be the most selected choice of all: Excel.
I must admit, I'm not an Excel guru. An open source type, I use Open Office Calc when I “spreadsheet”, but even then barely at an intermediate level. I generally use Calc for planning and other modest computation tasks. Though I've programmed with it, I've never liked Excel's macro language and, truth be told, don't care much for its Visual Basic interface either. Where many use Excel for data manipulation, I currently use Ruby. When in the past I've embedded Excel in programs, it's been to use the graphics with agile languages Perl and Python. But aside from investigating capabilities, I've done very little spreadsheet statistical work, even though there's now quite a bit of stats functionality included. It seems I'm never without access to superior statistical platforms like SAS, S-Plus and R. So why would I condescend to work with a tool that's not designed for my complicated statistical needs? Indeed, I'd say my most common use for spreadsheets is as a universal storage format for data transfer.
Apparently, there are many poll respondents who disagree with my thinking. In one poll, where three data mining choices were allowed, Excel was selected by over 38% of 486 respondents and was indeed the top pick overall, outdistancing even SAS and R. The ABADMPM discussion includes supportive Excel for analytics comments like: “Truly MS Excel is a great tool for analysis.”, “I'm using MS Excel at most as we can do a lot with this by using PIVOT Tables through which we can do both weighted and unweighted analysis.”, “Surely Excel is popular, accessible, and easy. I use it daily.”, and, “First let me defend excel as god's gift to analysts. The flexibility, presentation, scalability and usability (yes all different) are unmatched by any platform.” Yikes! Maybe I've been missing out all these years?
I take some solace, though, from being on the same side of the Excel as Statistical Platform discussion as statistician, author and R community luminary Patrick Burns. Burns' thoughtful but provocative article Spreadsheet Addiction comes with strong admonitions on the proliferation of Excel for statistical analysis.
The author offers two points of departure defending traditional statistical packages against Excel: 1) “The perception of the ease-of-use of spreadsheets is to some extent an illusion. It is dead easy to get an answer from a spreadsheet, however, it is not necessarily easy to get the right answer. Thus the distorted view.” and 2) “The difficulty of using alternatives to spreadsheets is overestimated by many people. Safety features can give the appearance of difficulty when in fact these are an aid.”
Burns cites a litany of computational problems with Excel. As examples of how to get the wrong answer easily, he offers the ambiguity of value and formula, challenging the reader to: “create a column of calls to a random function ("=rand()" in Excel). Now sort those cells.” A second illustration has to do with numeric precision: “If you write a text file (csv or txt) from Excel or Works, then numbers will be written with a limited number of significant digits. Microsoft is aware that there is displeasure about this, but regards it as a "feature".....This loss of precision is especially troublesome since it limits the use of Excel as a staging area for data -- gathering data from one or more sources and writing the data out to be used by other programs ...There is a trick to get full precision, however -- turn the numbers into text. One way of doing this is to use the concatenate function with just one argument.” Finally, Burns notes a common spreadsheet problem he calls Data Extent. “It is extremely common for data to be added to a spreadsheet after it has been created. The augmentation of data can go wrong, rendering a correct spreadsheet incorrect ...When there is separation between functions and data, it is possible to refine the functions so that they work on all data of any size. In spreadsheets, where there is no such separation, it is easy for bugs to creep into the calculation on any use of the spreadsheet. “ Yes, been there; done that.
The author pulls no punches decrying the graphics metaphor of Excel. He starts by taking on the omnipresent Excel 3D pie chart, noting the gratuitous orientation of this and other Excel visuals: “Much, much worse is the three-dimensional effect. In most graphics a three-dimensional effect merely makes the graph harder to interpret.”
But Burns is a statistician and saves his most strident comment for last. “to the extent that Excel's poor statistical functionality has driven people to other programs, it has been a service. A spreadsheet is not a proper place to perform statistical analyses except in the very simplest of cases. “ He cites the scathing presentation: Problems With Using Microsoft Excel for Statistics, by statistician Jonathan Cryer, as evidence. Stats-intensive Excel users should review this document carefully. Cryer's conclusion: “Due to substantial deficiencies, Excel should not be used for statistical analysis. We should discourage students and practitioners from such use. The following pretty much sums it up: Get the Right Tool for the Job! Friends Don’t Let Friends Use Excel for Statistics!” Ouch!
A compromise, noted by some in the ABADMPM discussion, includes using Excel in tandem with one or more statistical packages such as SAS or R. No doubt, Excel is well connected to relational databases and just about any other data source. And many analysts are magicians using Excel to pivot and otherwise “munge” their data into formats friendly for data mining. Certainly Excel's ubiquitous connectivity to just about any other analytics tool minimally pre-ordains it to a top-tier supporting analytics role. So perhaps Excel can function well as a statistical assistant, “serving” the preliminary needs of more comprehensive statistical platforms. One illustration of this approach is the marriage of R and Excel. An unabashed R advocate, I like the freely-available RExcel addin a lot and recommend it enthusiastically to the Excel community. My motto: Render unto Excel the things which are Excel’s, and unto R the things that are R’s!