We regret to inform you that we will no longer be publishing Information Management. It has been an honor to provide you with the insights and connections to move your career forward. We wish you continued success on your professional journey and welcome you to explore our other titles at www.arizent.com/brands.

Breaking the data prep barrier

Even as companies increasingly turn to data preparation to feed their analytics tools and create better data-driven intelligence, they are encountering a hard upward limit in terms of how many data sources they can handle.

Useful data comes in all forms and from a wide range of sources. But many of these companies are experiencing a fundamental limitation with their traditional data preparation tools.

This problem is particularly acute for larger, mature organizations that have been accumulating data in separate systems for a number of years. They may have several ERP and CRM systems, or they may have acquired other companies with their own data silos. But regardless of the source of the incompatibilities, their traditional data management tools just can’t handle the scale of data needed to make full use of modern analytics.

Instead, businesses like these need data unification.

Unification versus preparation

Unlike data preparation, which hits a brick wall when combining data from anything more than about a dozen sources, data unification specifically addresses the challenge of bringing together data from numerous and disparate sources. Data unification technologies, like those created at MIT’s Computer Science and Artificial Intelligence Laboratory under the leadership of Turing award honoree Michael Stonebraker, apply human-guided machine learning to unearthing the underlying structure in divergent data. These tools evaluate the metadata, offer suggestions for combining similar fields and query experts for guidance on possible matches to enhance their models. In this way, data unification quickly creates a single view of the relevant data, ready for analysis.

Human-guided machine learning is a key design principle for data unification. As data scientists and data engineers use the technology to integrate data, the system learns from the process to automate more of the data matching and better structure the final result. Unlike other approaches that send users back to square one for every project, data unification can utilize previous results, along with what it learned from generating those results, to provide faster and more accurate outcomes for each undertaking.

In this way, data unification enhances the efforts of self-service data preparation users within these larger enterprises. One large pharmaceutical company, for instance, is using the technique to curate thousands of clinical trial datasets. Data unification is used to get source datasets into the correct format for analysis, while a data prep system is used downstream for individual data “wrangling.”

Data unification is also helping companies address a host of challenges caused by dirty, fractured data. For example, computer solutions and services provider Hewlett Packard Enterprise created a data-driven customer’s journey that goes beyond individual activity to capture activity at the company level. Iana Dankova, Business Analytics Manager at HPE, says the process has “allowed us to get to views and insights we otherwise could never have reached, ultimately improving our win rate.”

Thomson Reuters is also using data unification to deliver better-connected content within a fraction of the time and cost of legacy approaches. Likewise, GE applied machine learning to its procurement data to uncover tens of millions of dollars in savings.

Indeed, data unification efforts are crossing all industries from telecommunications to business consulting, replacing or working in conjunction with traditional ETL and MDM systems. And while it can be a boon for organizations of all sizes, enterprises with significant fracture and noise in their data are discovering that the technology allows them to transform their massive data stores from a liability to a significant decision-making advantage.

For reprint and licensing requests for this article, click here.