Data analytics is a pretty mature discipline; we’ve been using analysis and statistics to make sense of data for a very long time. In recent times, SQL-based analytics—things like calculating sums and averages over different groupings of data—have been foundational for organizational strategy and competitive edge, powering dashboards and forming the basis of KPIs (Key Performance Indicators).
Data science, now, is a newer kid on the block. What’s the difference between data science and the traditional SQL-based analytics? Who needs data science anyway?
Well, you do, probably—that is, if you want to analyze data at scale, if you have unstructured or complex data, if you want to ask open-ended questions of your data, if you want to develop predictive models that produce new insights from your data, or if you don’t know yet what your data has to tell you.
SQL-based analytics are very good so long as your data can be easily represented using relational data models (often implemented as tables or columns that can be linked) and you are interested in asking questions involving statistics and properties of the data. The database goes through every row and picks out the rows that conform to your query and gives those rows back as your answer. That’s extremely useful—if you can represent your data in tables.
But what if you want to work with the full, rich, messy glory of a huge variety of data, not easily represented as a relational model?
That was the challenge for the City of Chicago, which wanted to integrate geospatially tagged 311 reports, 911 calls, public tweets, emergency operations data, video feeds from surveillance cameras, and city bus location data, and to pair analytics on this data with maps to build a visual, intelligent, unified operational view of the city.
You can really only do this effectively with big data and a distributed programming framework like Apache Spark. Spark lets you write programs to flexibly transform your big data into structures compatible with doing SQL analytics—or to perform completely different analytics, like natural language processing, dimensionality reduction, or feature extraction, which are difficult or impossible to do with SQL queries.
SQL-based systems also struggle with scale. A traditional data warehouse is really good at doing analytics on a month’s worth of data. But if you want to mine five years’ worth of data, or if you want to, say, pull specific details out of millions of electrocardiograms so that you can find patients with a specific condition—for that you’ll need big data and distributed computing.
Spark with MLlib will also let you use machine learning for predictive analysis or discovery—something that’s very difficult to do with SQL analytics. When data is constrained by relational models, you can’t easily, as the City of Houston did, build a model that predicts storm surge and hurricane damage and link it to sociodemographic data to discover which populations were suffering more due to flooding during Tropical Storm Alison, and why.
Or link years of public opinion surveys with student performance data and health data and connect the dots to discover how health challenges are impacting educational outcomes. (It turns out asthma is a problem.)
Data science models on top of big data are very fast to develop relative to SQL analytics. In SQL-based systems, it can take months to analyze the data, months to build the analytical models, and then months more to modify them if you need to change them down the road. With Spark, you can be exploring your data and deriving insight in a matter of hours. The result doesn’t make sense? Change the program in minutes rather than months.
All this is not to say you need to throw away all your old analytics models. SQL is not going anywhere soon, and your data solution does not have to be “one size fits all.” You can use big data in combination with incumbent data technologies to form a “data pipeline,” in which data can span a number of platforms and applications.
But the power and flexibility of something like Spark with MLlib, the scale at which you can analyze data, the types of questions you can ask as compared to a traditional BI tool, the difference in the types of deep understanding you can pull out, and the ease and speed with which you can do it—that’s disruptive.
And that’s why data science on top of big data is transforming the nature of analytics.
(About the author: Roy Wilds is chief data scientist at PHEMI Systems, a big data solutions company. Roy has led data science teams for multiple organizations and has advanced knowledge in machine learning theory, Python, R, and SQL, and substantial expertise using Hadoop's distributed technologies.)
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access