Open Thoughts on Analytics
for Information Management Blogs
DEC 13, 2011 9:01am ET

Blogroll

Data Science or BI? – Part 1

Print
Reprints
Email

OpenBI held our all-hands quarterly meeting a few weeks back. Included in the full day agenda were individual presentations on customer projects and up-and-coming technologies of interest to our BI team. For my talk, I chose to illustrate how a data scientist might conduct her work.

The idea was to assemble a data set from information readily accessible on the web that would be suitable for interactive visualization and predictive modeling. My searches took me to website of the Current Population Survey, a joint data initiative between the bureaus of Census and Labor Statistics. There I discovered the annual CPS March Supplement files which contain a treasure of information on demographics, education, occupation and income of the U.S. population at annual points in time. So I decided to put together a soup-to-nuts analytics demo using the CPS data.

Starting with raw files from 2003 through 2010, I wrote a Ruby program to process each and assemble a final comma-delimited data set of over 575,000 individual records with attributes such as age, sex, race, education, occupation and income. I then piped those records into both the R statistical package and the visualization tool Omniscope from Visokio. In my presentation, I displayed relationships between age, sex and education with income using slick Omniscope trellis visuals, and then fit R machine learning models and lattice graphics to show “predicted” income as a function of age, education and health status. Since mine was the last presentation of the day, I'm not sure whether the smiling faces indicated enthusiasm for data science or excitement over the pending happy hour festivities.

After the presentation, one of my partners proposed that we consider using the CPS data as a basis for a formal OpenBI demo that would serve as training for new BI staff. But rather than Ruby to assemble the data, he mused, how about using Pentaho Data Integration (PDI)? Rather than a CSV file as the ultimate data store, why not use VectorWise or LucidDB as an analytical database? Rather than storing the data in a denormalized, flattened format that's needed for visualization and statistics, why not deploy a star schema design instead?

While we're at it, why not create an OLAP cube from the database for drilling and slicing and dicing into average income by age, education, race and health status? And why not use the database and PDI to source Omniscope and R?  Finally, why not deploy Revolution Analytics DeployR to integrate the R predictive models with Pentaho reports? In short, why not use BI foundational technologies to support the data science tasks?

My partner's suggestions made sense and got me thinking about the distinctions between data science and BI. Back in the Spring, I wrote a series of articles on DS for Information Management. One delineation I noted then, attributable to statistician and R user group leader Mike Driscoll, argues that statistical science and data manipulation are central to the conduct of data science. As a third critical emphasis, he cites visualization. For Driscoll, it's statistics for studying data, data “munging” = hacking for suffering with data and visualization for storytelling with data.

Based on Driscoll's definition, I'd say my CPS demo clearly qualifies as DS – I munge the data using Ruby; I storytell with the data using Omniscope; and I study the data using R. And yet my partner sees the same set of tasks through a BI lens: ETL, relational database, OLAP and statistical models. I think both of us are right. The CPS work can simultaneously be seen as both BI and data science.

Since I posted those blogs six months ago, there's been an explosion of new articles purporting to define data science, several of which give me considerable heartburn. Next week I'll give my take on the similarities and differences between BI and data science. Warning: my point of departure is that the two are more similar than they are different – and that each can learn from the other to the ultimate benefit of  “competing on analytics.”

Advertisement

Comments (4)
Steve I totally agree with your point of departure for the future discussion on Data Science and Busines Intelligence. As scientific analytical software and the tools that incorporate them become more prevalent within the enterprise it is going to be very difficult to distiguish between the two ideologies. I believe that the drive over the next year to discover insight into larger and larger data sets is going to lead to a greater requirement for more scientific insight in the business inteligence arena - if that leads to a combination of the two roles I doubt but it will certainly bring them closer together.
Posted by Peter E | Tuesday, December 13 2011 at 12:06PM ET
Steve, Enjoyed your article. It helped me to clarify my definition of a data scientist as one being skilled in statistics as well as data modeling. I think a data scientist understands the data on a level where they can influence the database design so that optimal query can be performed. To tell the story a good data visualization tool should be used (either by the data scientist or the Business Intelligence analyst) and I lean towards Qlikview for visualizations. I also agree that a star schema works best. OLAP cubes make for faster queries and better slicing and dicing. Thanks again for the data scientist clarification and contrast with BI.
Posted by Jeff R | Tuesday, December 13 2011 at 4:07PM ET
Add Your Comments:
You must be registered to post a comment.
Not Registered?
You must be registered to post a comment. Click here to register.
Already registered? Log in here
Please note you must now log in with your email address and password.

Blog Archive for Steve Miller

Lean Start-Ups, Planning and Searching
Tableau, Python and R
The Data and Bias of Macroeconomics
No Quick Death for Statistical Practices
Getting Started with Statistical Learning

More from Steve Miller »

Blog Index »

Where do young IT professionals (30 and under) obtain information to aid with daily role responsibilities and career development?

Trade publication websites 14%
Social media 23%
Vendor websites 4%
Vendor/community forums 7%
Newsletters 1%
Trade conferences/meetups 2%
RSS feeds 6%
Web search 44%

 

Twitter
Facebook
LinkedIn
Login  |  My Account  |  White Papers  |  Web Seminars  |  Events |  Newsletters |  eBooks
FOLLOW US
Please note you must now log in with your email address and password.