I came across an analytics article the other day that made me pause. The author quoted a respected industry executive that "Algorithmic business is here .. Big data is not where the value is ..  Algorithms are where the action lies. Data is dumb. Algorithms define the way the world works."

The author himself then opines that "Big data is looking increasingly like a backend plumbing topic. If data science can't find the signals and real business use, all you're going to have is a big lake of information."

I don't disagree with these observations, but am compelled to add a big “but......”.

The commentary is certainly correct that the infrastructure around big data is becoming a solved problem, and that the effective management of BD will soon be a non-differentiator. At the same time, there are other issues with DS data that continue to be cause for concern.

I gave a talk a while back where I defined data science as the sum of: Business + Data + Designs + Algorithms + Analytics + Communication.  Assume for now that any depiction of DS includes Business and Communication, and that Analytics can be subsumed under Algorithms. Data Science then reduces to Data + Designs + Algorithms.

My experience with the Data of DS is that infrastructure challenges of size and organization are solvable. Much more problematic is data quality. Missing and error-plagued data are often the rule rather than the exception. In fact, I cannot remember the last DS project I was associated with that didn't present serious challenges with ugly, invalid data. The more data noise, the less algorithmic signal.

In his defining article on data science a few years back, Mike Loukides emphasizes the messiness of DS data that must be “conditioned” and “imputed” to comprise data sets that can profit from analytics Without those pre-algorithm cleanups, the analytics would likely find mostly noise.

New York University Data Science professor Neal Walk agrees on the challenges of messy data, noting “I think what distinguishes data science from statistics is a real appreciation for exciting new sources of data and a willingness to deal with the very messy problems of such data. The trick is to not lose all the good done by statisticians (understanding causality and uncertainty) while devising methods to deal with this very messy data.”

My persistent take is that the acquisition, conditioning, and curation of data are single most significant challenges in data science.

In their book, Big Data: A Revolution That Will Transform How We Live, Work, and Think, authors Viktor Mayer-Schonberger & Kenneth Cukier cite three developments that are transforming the data world: 1) the ability to collect and analyze incredibly large data stores – N=all; 2) tolerance for messy data; and 3) transition from the experimental method’s hunger for cause and effect to big data’s tolerance for much less rigorous correlation.

I certainly agree with 1) and 2), but must reject 3) Big Data's acceptance of correlation in lieu of causation. Though causation’s impossible to prove, it’s the data scientist’s responsibility to “build a story around the data” to demonstrate its case. Skepticism – an attitude that assumes many plausible explanations and a relentless methodological attack on proposed theories -- must be a tenet of the DS's approach.

Skepticism in Data Science is enabled by theory-testing designs that systematically eliminate additional plausible explanations. The platinum design is the experiment, in which subjects are assigned randomly to treatment and control groups. In the experimental case, other potentially confounding explanations for outcome differences between groups are statistically minimized.

The data scientist has other options for cases where randomized experiments aren't feasible. Less rigorous but still helpful quasi-experiments can be important skeptic tools. For example, there's much methodological benefit to comparing outcomes of two natural groups over time before and after an intervention. And where pre-intervention designs aren't feasible, methods that match treatment and non-treatment groups on confounding variables for statistical adjustment after data's collected can also be fruitful.

NYU's Walk summarizes: “We have all these great new applications, whether it be using tax returns to measure inequality, looking at how people are moving around, on a daily or permanent basis, or something like NYU’s Center for Urban Science and Progress’ metered city project. Public policy is also taking advantage of new ideas in research design, whether in the use of experiments or quasi-experiments. So many of the courses stress what we can learn from data, but also, how we learn from data.”

The importance of designs for attempting to assess causality is being recognized in data science academia. Both NYU and Berkeley include design courses in the core curricula for their MS in Data Science programs.

My big “but...”?  For this data scientist, obsession with data quality and rigorous causal design is every bit as important as the relentless pursuit of the best algorithm.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access