Data Science – Part 1
He may not have invented “data science” but Berkeley professor and Chief Google Economist Hal Varian certainly gave the discipline a jolt of credibility with his now oft-repeated October, 2008 quote:
"I keep saying the sexy job in the next ten years will be statisticians. People think I’m joking, but who would’ve guessed that computer engineers would’ve been the sexy job of the 1990s? The ability to take data – to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it – that’s going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids. Because now we really do have essentially free and ubiquitous data. So the complimentary scarce factor is the ability to understand that data and extract value from it."
Varian went on to ground data science as a statistical discipline, but was quick to point out there's more to it than that, adding the essential tasks of visualization, communication and utilization: “I think statisticians are part of it, but it’s just a part. You also want to be able to visualize the data, communicate the data, and utilize it effectively. But I do think those skills – of being able to access, understand, and communicate the insights you get from data analysis – are going to be extremely important. Managers need to be able to access and understand the data themselves.”
There's been a lot written recently to frame the boundaries of data science, much of it by statisticians who've migrated to business. Revolution Analytics CTO David Champagne, in a January 2011 article, borrowed from NYU political and data scientist Drew Conway to conceptualize data science as sitting at the confluence of statistical science, “computer hacking” and substantive expertise.
Conway weighs in on the topic with “Data Science in the U.S. Intelligence Community,” in which he argues that “the data science movement is about the people and the tools drive innovation and promote discovery.” He proceeds to articulate the “three primary areas of expertise needed to be a successful data scientist.”
“First, one must have hacking skills …(which) in this context mean proficiency working with large, unstructured chunks of electronic data … Second, one needs a basic understanding of mathematics and statistics … Finally, and perhaps most importantly, a data scientist must have some substantive expertise in the data being analyzed.” Conway's Venn diagram of the intersections of the hacking, statistical and components is illuminating.
The ability to improvise data integration solutions using operating systems, databases and programming languages clearly distinguishes data science from traditional research statistics. According to Michael Elashoff, director of biostatistics at CardioDx, the term "data scientist" is really "more of an acknowledgment that people in this field need multiple types of expertise. It recognizes the fact that looking at data requires more than just analytic skill."
Perhaps my curriculum for an M.S. in Applied Statistics, attentive to the skills of both data conditioning and statistical reasoning I've found pertinent over the years, could be a starting point for budding data scientists.
Statistician and R user group leader Mike Driscoll agrees with Champagne and Conway that statistical and data manipulation foci are central to data science. As a third area, he cites visualization. For Driscoll, it's statistics for studying data, data “munging” = hacking for suffering with data and visualization for storytelling with data.
UCLA statistician, visualization expert and infographics website author Nathan Yau opines that it's the combination of disparate skills that sets the great data scientists apart. “Why is their work always of such high quality? Because they're not just students of computer science, math, statistics, or graphic design. They have a combination of skills that not just makes independent work easier and quicker; it makes collaboration more exciting and opens up possibilities in what can be done.”
Yau cites the pioneering research on computational information design by MIT's Ben Fry as a methodology template for the soup-to-nuts charter of data science. The author's point of departure is a series of well-defined data science disciplines that unfortunately do not communicate.
“In an attempt to gain better understanding of data, fields such as information visualization, data mining and graphic design are employed, each solving an isolated part of the specific problem, but failing in a broader sense: there are too many unsolved problems in the visualization of complex data. As a solution, this dissertation proposes that the individual fields be brought together as part of a singular process titled Computational Information Design.”
Fry conceptualizes CIF as a sequence of computer science tasks of acquiring and parsing data, followed by the math/stat work of filtering and mining. At this point, a storytelling focus kicks in: first the graphics design steps of represent and refine, followed finally by human-computer interaction. It's not a stretch to see Fry's CID as an elaboration of Conway's data science.
So it seems there's at least some consensus on data science as a discipline driven by statistics, with healthy doses of practical computer and data management and a dash of substantive business expertise – all topped off by the storytelling benefits of visualization and human computer interaction. Next week, I'll take a more detailed look at the data science manifesto of Mike Loukides and offer my thoughts on how the “new” field relates to business intelligence.