Open Thoughts on Analytics
for Information Management Blogs
JAN 3, 2012 8:54am ET

Blogroll

Data Science Skepticism

Print
Reprints
Email

I don't think you'd get much argument from the data science community that the emerging field involves components of business, technology and statistical science. “Veteran” DS'ers will also note both inquisitive and skeptical dispositions as keys to success in the discipline.

LinkedIn's Monica Rogati observes that data scientists are at the intersection of Columbus and Columbo – “starry eyed explorers and skeptical detectives.” Amazon's John Rauser opines “A healthy dose of skepticism comprises the fourth dimension of the data scientist. If you have a healthy skepticism, you will look as hard for evidence that refutes your thesis as you will for evidence that confirms it.”

In a terrific article “Top Holiday Gifts For Data Scientists,” Cloudera co-founder and chief scientist Jeff Hammerbacher recommends a multitude of books, websites and software tools for the budding data scientist. Among his choices are the texts “Statistics as Principled Argument” and “Bias and Causation,” both of which encourage healthy skepticism in interpreting relationships from observational or non-experimental investigations. The latter details a “taxonomy of bias and its potential sources. It is a must read and constant reference for those designing survey studies and a reminder of cautions for those who must contend with study results and conclusions.”

Why the obsession with bias? Because data scientists generally work with messy observational data from which it can be difficult to prove that factor A caused outcome B. Does a high correlation between A and B indicate that A caused B? Or maybe that both A and B are caused by a confounding factor C? Or perhaps that A and B are spuriously related? In the absence of random sampling or random assignment to experimental groups, these questions can be nearly impossible to answer with certainty – hence the skepticism of good data scientists.

I often put on my cynic's hat when I review the results and interpretations of surveys in BI/analytics. And so it was with the recently published “Data Science Revealed: A Data-Driven Glimpse into the Burgeoning New Field,” a survey of business intelligence and data science professionals conducted by EMC. It's not that I think the DSR study was poorly done; rather, I believe there are significant weaknesses in an online survey methodology that might bias the findings. Are the survey findings valid?

The first question a skeptical data scientist would ask is how representative the DSR data is of the population of BI and DS professionals it purports to describe. I'd love, for example, to know the demographics of the DSR sample. If it includes many more data scientists than business intelligence practitioners even though BI professionals currently dwarf data scientists in the work world, does that introduce bias?

Is the sample of 497 respondents large enough to detect small percentages? Does the fact that respondents choose whether or not to participate in any way bias the results? Might it be the case, for example, that those who consider themselves data scientists were more likely to complete the survey than those who identified as BI professionals? And could it be that the sexier title of data scientist is the now the self-reported professional designation of choice for many BI professionals – regardless of the work they do? The data scientist must ask questions like these.

I agree with DSR's declaration that data science is a young field, much as business intelligence was 20 years ago. About BI, the study notes: “As the field grew rapidly in the 90s, it also coalesced around a smaller number of tools, more consistent expectations for talent, better training, and more rigorous organizational standards. As our data demonstrates, data scientists are currently going through that transition.”

This disparity in maturity levels may explain some of DSR's findings. As an illustration, the observation “that data science professionals were over 2.5 times more likely to have a master’s degree, and over 9 times more likely to have a doctoral degree as business intelligence professionals.“ is probably an artifact of the relative maturity and size of BI in contrast to DS. Think back 20 years when BI was in its infancy. There were then a high percentage of advanced degrees among the small population of BI professionals as well. Recall the seminal work of Bill Inmon, Ralph Kimball and Claudia Imhoff – Ph.D.s all.

I don't buy DSR's assertion that “the data science toolkit is more varied and more technically sophisticated than the BI toolkit. While most BI professionals do their analysis and data processing in Excel, data science professionals are using SQL, advanced statistical packages, and NoSQL databases.” Huh? Excel as the primary BI data processing tool?  SQL for DS but not BI? Not.

And don't tell Tableau founder Pat Hanrahan that while “advanced visualization tools like Tableau are just starting to emerge in the data science world, they are almost unseen in the business intelligence world.” On the contrary, Tableau and kin Spotfire, Omniscope and QlikView are now inundating self-service BI, as Tableau's startup screen greeting “Fast analytics and rapid-fire business intelligence” attests.

That BI is more mature than DS probably suggests that BI professionals are, on the average, older than their DS counterparts, many of whom started their data science careers just out of school. That could explain why “Open Source tools, like the R statistics package, Python, and Perl, are each used by one in five data science professionals, but around one in twenty BI professionals.” R, Python and Perl are languages many DS'ers learned in graduate school and brought with them to the work world. And while I'm a big fan of all three, I find it curious that the Data Management tool section doesn't include ETL stalwarts Informatica, DataStage, and Pentaho PDI. I don't think I'd choose to use Perl for a big data integration initiative in 2012.

While some of the findings of Data Science Revealed contrasting DS and BI give me heartburn, I'm pretty much in agreement with the survey's organizational implications. The admonition that DS professionals must be built rather than bought is spot on. Companies should “find practitioners with the intellectual curiosity and technical depth to solve big data problems, with academic concentrations in the hard sciences, statistics, and mathematics ... Rather than hiring for experience with a certain toolkit, companies should invest in on-the-job training with their chosen set of emerging technologies.” This aligns with OpenBI's strategy of hiring scientifically-inclined graduates for BI consulting, most of who are not CS majors.

As DS matures, look for additional division of labor in the discipline, with sub-specialties evolving in the science of business, big data integration, statistical learning, visualization, user experience, et al. “Once companies have brought in the right talent, they need to create an environment conducive to effective data science. That means building high-performing, cross-functional teams that include a variety of roles, including programmers, statisticians, and graphic designers, and aligning them to directly support interested business decision makers.”

(To read Steve’s prior posts in this series, click here to read part 1, and click here to read part 2.)

Advertisement

Comments (3)
As a former statbrat and BI professional - now data driven marketer, I'll agree with your perspective that technically, there is not much difference between the DS and BI professional except in the use of statistical packages (yes, Excel can do stats but this is a layman's use for that type of work). Where I do think the distinction lies is how certain of the analytic outcome and to what use is it for that is the determining factor. BI more often than not answers the question of business and business process performance in a dashboard. DS takes it a step further to gain insight into more allusive aspects of the business and prediction. It is not the exactness of the answer that DS provides, but a possibility and probability. BI is absolute as it typically crunches the entire data set and tallies up.

In marketing it is A/B testing of email vs. factor factorial. A/B is easier, faster, and points in the right direction. Factor factorial lets me squeeze out the 1/2 percentage improvement on a wider B2C campaign that could translate into significant revenue increases. 9 times out of 10, A/B works just fine and BI wins. It is that perspective that is probably most impactful to the establishment of DS.

Posted by michele g | Tuesday, January 03 2012 at 1:03PM ET
Steve,

I am in agreement with you that being skeptical with a dash of cynicism is healthy. Bias can certainly lead to trying to prove preconceived notions.

I would like to add a different element. I believe data scientists can be leaders, not necessarily like the executive leaders at the top of the organization chart. I refer to the definition where leaders have followers. However, effective leadership requires periods of solitude which I believe can aid in being skrptical.

What does solitude have to do with leadership? Solitude means being alone, and leadership necessitates the presence of others - the people you're leading. When we think about leadership in American history we are likely to think of Washington, at the head of an army, or Lincoln, at the head of a nation, or King, at the head of a movement - people with multitudes behind them, looking to them for direction. And when we think of solitude, we are apt to think of Thoreau, a man alone in the woods, keeping a journal and communing with nature in silence."

Solitude allows one to be alone with your thoughts. Arguably solitude is crucial to carry out the task of leadership and being a data scientist. Everyone needs this to provide one the chance to deeply consider the lasting improvements and skills their organization will need to for sustained organizational performance improvement. These include exploiting the emerging practices of business analytics and deploying and integrating enterprise performance management methodologies. These include strategy maps, scorecards, dashboards, risk management, activity-based costing, predictive analytics, rolling financial forecasts, and many others.

Data scientists need to take time to think and to first frame a problem before they start solving it.

Gary Cokins, SAS

Posted by Gary C | Tuesday, January 03 2012 at 1:36PM ET
Add Your Comments:
You must be registered to post a comment.
Not Registered?
You must be registered to post a comment. Click here to register.
Already registered? Log in here
Please note you must now log in with your email address and password.

Blog Archive for Steve Miller

In With the New RAM
Analytics: The Widening Divide
Omniscope and R
Out With the Old RAM ...
Applications of R in Business

More from Steve Miller »

Blog Index »

Twitter
Facebook
LinkedIn
Login  |  My Account  |  White Papers  |  Web Seminars  |  Events |  Newsletters |  eBooks
FOLLOW US
Please note you must now log in with your email address and password.