You think you know what’s in your data. But there may be a lot more there than you realize.
The combination of big data and modern data science can empower you to ask questions in entirely new ways, and uncover answers locked away in your data to questions you hadn’t thought to ask. But how do you get at those insights? The answer has two parts.
First, you need to move beyond traditional relational or Excel-based analytics and embrace the power of big data systems to wring useful information from hard-to-analyze sources. Traditional analytics are great for gleaning insights from data that can be represented in simple data models. But if you want to work with complex or “messier” data (like text, audio files and social media), or huge data sources (like genomic data, clinical images and decades-long studies), you’re going to want to harness the power of big data systems and distributed computing.
Next, you’ll need to use data science to wrangle your data to get more meaning out of it. When you do, you can start to connect the dots in surprising ways. You can move from descriptive analytics to predictive and prescriptive data models that transform your organization.
We work with many customers who are new to big data. They don’t want to invest the time to become Hadoop experts, or learn all about Impala and Pig. But one thing they completely understand is how much information is locked up in text. Whether it’s transcripts of customer service calls, physician notes, comments on Twitter and Facebook, there are vast quantities of relevant data in documents.
How do you do research on that information? How do you mine it to analyze for similarities? In many cases, you can’t, because you have to open those documents up one at a time. Big data with distributed programming frameworks like Apache Spark can help transform that mass of text (as well as all sorts of other complex data, like sensors and clinical images) into structures compatible with analytics.
Applying Data Science
Once you’ve got that complex data into a format you can use, what then? Continuing the documents example, a lot of our customers want to run that data through natural language processing (NLP) to get at the insights, but it’s not that simple. You’ll need a large helping of data cleansing and wrangling, and a pinch of modeling to make it useful.
Imagine a cloud hosting company. They want to analyze all of their customer service interactions—web, phone, social media—to see if there are specific areas where they keep making mistakes. They can start by running a frequency plot of word counts on anything labeled “complaint,” and lo and behold, they’ll find several spikes. But what are they?
Usually, just the most commonly used terms (in this case, “server,” “host,” etc.), which don’t correlate in any meaningful way with positive or negative customer experiences. So they cleanse those terms. Then they do NLP, further cleansing the data with tokenization, stop-word removal, and lemmatization.
Now, they can attempt to cluster that data in meaningful ways, and get at the real source of repeat problems they were previously overlooking. They can also start using more advanced data science tools like machine learning to start turning those insights into action—for example, predicting the next best action to take in response to certain types of complaints.
Data Science in Action
That’s just one area where the combination of big data and data science yields results. Organizations in practically every industry are doing the same kinds of things. For example:
We have customers analyzing a variety of complex data sources, including vast quantities of gene sequence variation data, to compare medication effects based on genetic profile. They’re ingesting and visualizing complex sets of data for gene mutations, patient demographics, treatments given and patient response. And they can now identify large sets of personalized variables to predict which patients will respond best to a given medication.
The City of Houston is using data science to take models of predicted infrastructure damage from storm surges and hurricanes, and link it to sociodemographic problems that this damage causes. They’re using this information to predict which populations will be most negatively be affected by flooding to guide policy.
In another project, Houston is linking previously disparate data sets—decades worth of public opinion surveys, student performance data, and Texas Medical Center health data. They’re starting to connect the dots between education, health and student performance—for example, identifying asthma as a much bigger problem than they had realized.
The Fire Department of New York (FDNY) is responsible for the safety of more than 330,000 buildings, and inspects around 25,000 each year. They used data science to build an algorithm that gives every single building a risk score to prioritize which ones get inspected first. In the past, the first 25% of inspections they conducted in a given year identified 21% of the severe violations. Using the new data science-driven algorithm, the first 25% of inspections now yield 71% of the city’s severe violations. That predictive capability translates to millions of dollars saved annually, and potentially thousands of lives.
Some customers are aggregating data sets about judges, lawyers, parties and patents to predict the behaviors and outcomes that different legal strategies will produce. One law firm uses this technology to determine in 20 minutes if a case is worth taking, instead of the standard 20 days.
There is huge potential to wring value from data sources you weren’t able to tap before. But to capitalize, you first need the capacity to ingest all that data in a format you can work with, and the data science capability to wrangle it for meaningful results. That often involves some iterative work. And it requires substantive domain knowledge just as much as statistical and computational skills. There are technologies out there that can help enterprises get started with data science without having to learn all the moving parts of the Hadoop ecosystem and without digging themselves into a “DIY” hole.
And, when you can create a richer discovery environment for your data scientists, they’ll be able to identify patterns that no one recognized before, and identify a larger set of indicators you should be tracking. They’ll be able to create predictive and prescriptive models that take into account dozens of indicators, and turn vast amounts of data into proactive action.
(About the author: Roy Wilds is chief data scientist at PHEMI Systems, a big data solutions company. Roy has led data science teams for multiple organizations and has advanced knowledge in machine learning theory, Python, R, and SQL, and substantial expertise using Hadoop's distributed technologies.)