I can’t get enough of Python and R. My last blog of 2013 extolled Python; I wrote a flattering R-Python piece several months ago; and I’ve authored countless articles on R over the years. It’s safe to say Python and R are my favorite programming languages.
And count me all in as a promoter of the Python-R combo for data science. Both open source platforms share a large, enthusiastic and growing community of developers busy advancing functionality. Python’s my choice for more generic, agile development and data wrangling, while R’s the pick for statistical analysis/graphics and machine learning. There’s overlap in the broad area of data analysis, which combines components of wrangling, data management/manipulation, numerical computation, elementary statistical functions and graphics.
Developing thought by some practitioners, though, suggests that Python will soon supplant R and assume the mantel of lingua franca for data science computing. The reasoning is as follows: “While R has traditionally been the programming language of choice for data scientists, it is quickly ceding ground to Python . there are several reasons for the shift, perhaps the biggest one is that Python is general purpose and comparatively easy to learn whereas R remains a somewhat complex programming environment to master .Python still lacks some of R's richness for data analytics, but it is closing the gap fast.”
“Data science is moving out of the realm of the alpha geeks, something that was clearly evident at O'Reilly's Strata conference in New York last month. PhDs used to haunt its halls. Now mortal business analysts and others, tasked by their enterprises to figure out Big Data, made up the majority of attendees. This new, early majority of "data scientists" is far more likely to use Python than R. It's comparatively simple to use, and they've likely been able to use it in another project already. As in other markets, the tool you already know or is easy to learn is far more likely to win than the powerful-but-complex tool you'd really rather avoid if possible.”
Several of these points give me angst. I wasn’t at the most recent Strata in NYC, but did attend last year, and have participated in all the Santa Clara “Stratas.” I agree that the mix of Strata attendees is evolving from younger and geeky to older and “businessy” as might be expected with data science becoming more mainstream. I don’t, however, buy the declaration that the new majority of data scientists are “far more likely to use Python than R.” -- as if the business crowd prefers Python. The overflow gathering of R DS’ers at Michael Bailey’s Strata discussion of forecasting a year ago suggests R’s doing quite well with the DS folks, thank you.
Python may well erode R’s statistical hegemony in time, but it’d be a bad decision to go all-in Python for DS right now. Much better, in my view, would be to exploit the complementary strengths of both platforms. Python’s preferred for general development, munging and wrangling; R’s the choice for its breadth of statistical and machine learning functions. For data analysis, that wide territory between wrangling and statistical analysis, the choice can be either R or the splendid Python library pandas to handle “the nuts and bolts of manipulating, processing cleaning and crunching data.”
Alas, Python for computation suffers from the same memory-limitation constraints as R. Yes, Python’s more efficient, and pandas’ optimizations promote better-than-R data management/manipulation performance, but the memory ceiling’s still there. And Python’s graphical, statistical and machine learning libraries, while progressing at a rapid pace, lag in number and sophistication to those of R.
Perhaps just as important, as Python evolves for numerical computation, data analysis and statistics/machine learning, it’s programming metaphor is starting to look more like R than the Python 2.0 I learned years ago -- in large part because significant new capabilities are being provided outside core Python libraries. As I noted in my last blog, the addition of array-handling capabilities in numpy and pandas has fundamentally changed Python computational programming. Programmers who haven’t kept up will discover a far different language for data science than the one they mastered years ago. And a learning curve that may be steeper than they suspect.
I’m currently in the process of porting several soup-to-nuts R examples I’ve evolved over the years to a comparable 2014 Python environment, When I’m done with my tests in a few weeks, I’ll write an article contrasting my experience using the numpy, scipy, pandas, patsy, statsmodels, sklearn and ggplot libraries with similar package capabilities in R.
A sneak peak of what I’m seeing? The resulting Python code with these libraries looks as much like R as it does core Python. This shouldn’t be surprising: pandas data frames come directly from R; patsy was introduced to mimic the R model specification language; and Python’s ggplot is an early adaptation of the ever-popular ggplot library in R.
For the foreseeable future, Python and R will co-exist as primary languages of data science. Python-averse analysts obsessed with R shouldn’t weep and gnash teeth just yet but should certainly start down a modern Python learning path if they haven’t already. At the same time, Python programmers from a decade ago must learn a very different environment as they adapt to Python for data science. Modern Python for statistical computation looks very much like R, which, while a challenge for some, is likely a benefit for data science.