The results are in for the 2013 KDnuggets survey of “programming/statistics languages being used for analytics / data mining / data science work.” I respond annually and track year-to-year changes. For 2013, the leading vote getters -- R, Python, and SQL – are, along with Tableau for visualization, also my top data science tools. I’d like to think I’m a bellwether, but suspect I’m instead more the dutiful follower.
Much as I’m gratified that R and Python set the analytic pace in the KDnuggets poll, I’m generally skeptical of online surveys for measuring the pulse of a tech population, with self-selected rather than representative response samples the rule. I suspect there’s a fair amount of bias in this year’s votes, the typical KDnuggets engager, like me, pre-disposed towards open source and programming language solutions. Run a similar survey at the SAS Global Forum and you’re likely to see a quite different breakdown. All in all, though, the findings fit my analytics worldview nicely.
One observation buried in the KDnuggets write-up I found particularly intriguing: “We also find a small affinity between R and Python users”. This too affirms my experience of a long involvement with R and a renewed preference for Python. Three years ago, while R was my analytics platform, I wavered between Python and Ruby for data munging/wrangling. I then discovered Numpy, a comprehensive numerical computation library for Python. The most noteworthy benefit of Numpy for me was changing the Python programming paradigm from procedural/scalar to vectorized for data operations. Rather than loops, iterators and list comprehensions, I now could work with vector operators/functions/methods. And classes built on top of Numpy extend its substantial capabilities with structures that provide R dataframe-like functionality. Indeed, much of my Python code now reads like R.
Several Python packages also make it straightforward to interoperate between Python and R. PypeR connects to R within a Python script and allows data to be both pushed and pulled between the two environments. I can also execute R code in Python and thus divide analytics labor. If the computer gods are smiling and you’re able to get the Python package Rpy2 installed, you have even more capabilities. I especially like invoking R functions on Python data.
I still do most of my statistical modeling in R, with its wealth of procedures and outstanding applied guides. There are performance limitations to R, however, that make the execution of certain models on even medium-sized data sets daunting. Python at least partially to the rescue here with its scikit-learn package of statistical learning algorithms. I’m indebted to my colleagues from the Illinois Institute of Technology for their recent encouragement to give the package a second chance.
So far, I like what I see from scikit-learn. Not surprisingly, it’s less mature than R – no model specification language, for example -- but does show promise. And while I haven’t calibrated relative performance, I suspect that scikit-learn outperforms R in many cases, especially for resource-consuming, re-sampling models. I use the Python versions of gradient boosting and random forests on models of a half million cases without hesitation, while I can’t recall running an equivalent R model on N > 200,000 with impunity.
The capability to interoperate in Python and R lets me temper the frustration of learning a new language. I can build a model in scikit-learn and then push a prediction matrix to R for visualizing with lattice. Once I’m comfortable with the equivalent graphics using Python’s matplotlib, I can easily make the switch back.
I suspect the Python-R collaboration will continue to grow, especially as API’s for big data evolve from Java-MapReduce to more productive languages. Those willing to invest the effort to learn the powerful programming, data management, analytics, visualization and interoperability of R/Python will almost certainly be rewarded.