© 2019 SourceMedia. All rights reserved.

Defining Data Scientists & Their Tools

My thoughts of the day involve reactions to two blog entries. The first is titled, “Data Scientists Must Also Be Research Methodology Scientists." The second is "SAS vs. R (vs. Python) – which tool should I learn?" Here's my take on both.

The first, cited by Alex Liu from Research Methods and Data Science, references a blog posted by Informatics Professor Bill Hersh of the Oregon Health and Science University.

While Hersh is a big proponent of data science in health care, he chides clinical DS for both its bravado and its credulity – in essence scolding DS for not being scientifically rigorous in its research methods. “Whereas the academic and clinical operational types were cautious in their methods and results, the data scientists implied their techniques would revolutionize healthcare and threw around terms like "big data" and "analytics" at every turn.“

Moreover, “while there are many opportunities for using clinical data for research and analytics, we also must remember the limitations of such data....The bottom line is that while data scientists may be able to generate interesting and important results with their methods, they must also understand basic principles of research science, such as inferential statistics, clinical significance, and cause and effect.“

Hersh Makes a Strong Case

While some of this divide can be explained by differences between business and academia, Hersh's points are well taken. And I agree with his position, feeling it critical to prioritize the scientific method first in data science work. I expressed reservations with the “correlation is good enough with big data” methodological mantra espoused by “Big Data: A Revolution That Will Transform How Live, Work and Think,” in a piece posted a year and a half ago, and reiterated that thinking in a series of blogs several weeks back.

My take is that the data science field should aspire to “Think Like a Freak”, driven from the scientific method, which includes hypothesizing, measuring, testing, computing, analyzing and reporting. ”When all is said and done, thinking like a freak means becoming more methodologically sophisticated, enhancing one's ability to generate and test hypotheses, unassailably link cause and effect in the social and scientific worlds, and persuade the public to accept conclusions they might well be inclined to reject. This same methodological rigor bridges the what-how gap of DS and answers the critics who believe data science is little more than the latest moniker for overly paid programmers.”

So my current composition of data science is roughly equal parts business acumen, scientific methodology, data, computation, statistics/machine learning, and story-telling. I also believe that academic science training along with strong computation skills make for a solid entry-level DS background.

Another Blog: SAS vs. R (vs. Python)

From the Data Mining, Statistics, Big Data, and Data Visualization group I came across an on-the-mark blog by Kunal Jain entitled “SAS vs. R (vs. Python) – which tool should I learn?”

There's no shortage of opinion in the blogosphere on the relative merits of SAS, R and Python for data science. Most are passionate for one camp or another. Many, like me, are pro R-Python and con SAS. Actually, having worked with SAS extensively for 20 years, my beef is more with its cost relative to open source R and Python than the software itself. Still, I acknowledge SAS's ubiquity in commercial analytics. And I admit I'm smitten with SAS-clone WPS as a collaborating statistical platform for R.

Over the last five years, I've experienced the generational rift between baby-boom SAS lifers and millennial Python-R aficionados. Given that divide, Jain's assessment is more thoughtful and even-handed than most. After reviewing his slides, though I didn't totally agree with Jain's scoring, I was hard-pressed to peg him biased for any of the language camps.

Keeping Score

As a point of departure, Jain's background assessment of SAS, R and Python squarely hits the mark: SAS is the legacy commercial analytics leader; open source R owns academia and research; while Python, with its wide swath of programming relevance and the addition of the highly-productive numpy, pandas, scipy, statsmodels and scikit libraries, is fast becoming a serious analytics contender.

I especially like the author's three-way tie among the contenders for data handling. In most other reviews, R pales in comparison to both SAS and Python for data programming. R cognoscenti disagree, arguing R's at least the equal of SAS. I buy that view, but might give Python, the more robust language, a slight advantage over both SAS and R.

I agree with R's lead over SAS and Python in graphics, but would argue that Python is R's peer in tool advancements. Both Python and R offer community-developed packages outside the core languages that significantly enhance programmer productivity. And I'd say that, assessing support and community together, Python and R are now the equal of SAS.

My strongest area of disagreement with Jain is with job trends, where he gives SAS a notable advantage over both R and Python. I suspect the Indian market where he works might be different from the US. In the states, my advice for budding data scientists is to focus on R-Python with their many add-on packages rather than SAS as data science languages of choice.

For reprint and licensing requests for this article, click here.