Data Science – Part 2
(Editor's note: Click on the following link to read the first installment of Steve's blog, "Data Science - Part 1")
I was a bit taken back the first day of O'Reilly's Strata Conference on data science in early February. The crowd was decidedly younger than those I generally encounter at business intelligence conferences. And while the topics of conversation – big data, data integration, statistics and visualization – were similar, the products of focus were very different.
Instead of Oracle and Netezza, data storage buzz was about MapReduce/Hadoop and Cassandra. Instead of Informatica and Kettle, data integration discussion was on languages python and ruby. Instead of OLAP and dashboards, analytics attention was on predictive models and machine learning. And instead of mature project organizations with well-specified roles, it seemed the data science teams were small, with a few jack-of-all-trade individuals handling most of the work.
I wasn't sure at first what to make of my observations. Trained as a statistician, I liked the practical focus of data science on both statistics and data integration. But having spent the last 25 years in decision support and business intelligence, I got the sense that DS was unappreciative of the history of intelligence in business.
It almost seemed that data science suffered from “not invented here” syndrome. So I determined right there and then to further investigate the differences between BI and DS. Fortunately, I was able to get a lot of help from the seminal article, “What is Data Science?,” by prolific O'Reilly Unix author and industry expert Mike Loukides.
From the get-go, Loukides distinguishes DS as “not just an application with data; it's a data product. Data science enables the creation of data products.” He notes Google's PageRank algorithm and Amazon's recommendation engine that exploits the exhaust of searches as examples of data science apps. This is certainly in contrast to BI, whose primary role is support of performance measurement.
A second difference contrasts the “data conditioning” of DS that includes mashups and “munging” manipulations with tools such as perl, python and ruby with the formal extract, transform and load (ETL) of data warehousing and BI. DS also appears to deal with missing and incongruous data more often than BI.
“In data science, what you have is frequently all you're going to get. It's usually impossible to get 'better' data, and you have no alternative but to work with the data at hand.” One mitigating factor: you can often just kill the missing data problem with shear volume in DS. Relatedly, DS appears to lead BI in exploiting “approximate” answers. “Most data analysis is comparative: if you're asking whether sales to Northern Europe are increasing faster than sales to Europe, you aren't concerned about the difference between 5.92 percent annual growth and 5.93 percent.”
Although there are many very large data warehouses in the BI world, data science seems obsessed with handling “big data – when the size of the data itself becomes party of the problem.” For DS, this means the database structures that serve BI don't adequately scale for their problems. “Most of the organizations that have built data platforms have found it necessary to go beyond the relational database model. Traditional relational database systems stop being effective at this scale.”
In response, DS has begun to adopt a new breed of non-relational or NoSQL databases, such as Cassandra and HBase, that are modeled on pioneering big data work by Google, Amazon, Facebook and Yahoo. But Google's divide-and-conquer MapReduce, which can distribute large problems across mammoth computer clusters, seems to be getting the most attention in the DS space. The Hadoop open source implementation of MapReduce, in tandem with Amazon's cloud-centric EC2 “makes it much easier to put Hadoop to work without investing in racks of Linux machines.”
Interestingly, interfaces to these open source products have emerged from many BI vendors. The rise of these technologies has dramatically reduced turnaround times for many types of analysis, making it “possible to pursue intriguing possibilities that you'd otherwise have to drop for lack of time.”
The reports, dashboards and OLAP cubes of BI are often replaced by the experiments/inferential statistics, machine learning and optimizations of DS. For the latter, DS prefers the open source R Project for Statistical Computing, with hundreds of freely-available, add-on packages developed by many of the world's leading practitioners. Proprietary software from tech giants IBM, SAP, Oracle and Microsoft continues to lead the way in the BI world, with open source competitors like Pentaho and Jaspersoft on the rise. On the DS side, commercial companies look to add value to big open source projects.
Both BI and DS show explosive growth in the use of graphics and interactive visualization. One difference might be the DS emphasis on exploration of new relationships in contrast to the BI dashboarding focus of confirmed, existing patterns.
Former Facebook DS innovator Jeff Hammerbacher described a day in the life of a data scientist: “... on any given day, a team member could author a multistage processing pipeline in python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data-intensive product or service in Hadoop, or communicate the results of our analyses to other members of the organization.”
LinkedIn DS head DJ Patil feels hard scientists such as physicists are most likely to meet those versatility challenges. I'd counter that quantitative social scientists should be DS-ready coming out of grad school. BI, on the other hand, is equally well served by students from computer science/information systems, mathematics and business. Regardless, you're unlikely to see Hammerbacher's jack-of-all-trades mentality as much now in a mature BI.
Below is a summarization of what I currently see as some of the differences between DS and BI. I consider the observations preliminary, more gray than black and white. My hope is that in time the “best practices” of BI will weigh positively on DS – and vice-verse. Just as I think BI will benefit from a deeper focus on approximate answers and ubiquitous machine learning, data science should appreciate what's been learned over the years in BI on methodology and governance.
Readers, what do you think? BI or DS? Or both?