Picking an effective approach for unifying data is, by and large, determined by the scale of the challenge. If your problem is unifying three data sources with 10 records each, then it doesn’t matter what type of tool you use.

A whiteboard or paper and pencil is probably the best approach. If your problem is integrating five data sources with 100,000 records each, you can likely use traditional rules-based approaches and known technologies like ETL; although it may well be painful.

If, however, you are like the majority of large enterprises, and you need to combine tens or hundreds of separate data sources with perhaps millions of records each, you’ll be unsuccessful with either approach. However, there are tactics you can use to successfully perform unification at scale.

Data unification, as I’ve described in more length elsewhere, is the process of ingesting, transforming, mapping, deduplicating and exporting data from multiple sources. Two types of products are routinely used to accomplish this task -- Extract, Transform and Load (ETL) tools, and Master Data Management (MDM) tools.

These processes require that a human construct a global schema upfront, discover and convert local schemas into that global schema, write program cleaning and transformation routines and write a collection of rules for matching and merging data. It routinely takes three to six months to do this for each data source.

At General Electric, with 80 procurement systems containing information about its global suppliers, this approach would therefore take 20-40 person years; even by applying human parallelism, this will be a multi-year project and cost millions of dollars. GE is certainly not alone in confronting tasks of this magnitude.

This raises two questions: 1) why does an enterprise have so many data sources?; and 2) why would an enterprise want to unify its data sources?

To answer the second question first, there is a huge upside to GE to perform data unification on their 80 supplier databases. A procurement officer, when purchasing paperclips from Staples, is only able to see the information in her database about her relationship with Staples. When the Staples contract comes up for renewal, a procurement officer would love to know the terms and conditions negotiated with Staples by other business units, so that she can demand “most favored nation” status.

GE estimates that accomplishing this task with all of its vendors would save the company around $1B per year. Needless to say, GE would prefer to be on a single procurement system; but every time the corporation acquires a company, they also acquire its procurement system. Precisely because of the limitations of traditional data integration systems, GE has historically been unable to create a single view of their supplier base. It simply requires too much human work.

Any reasonable shot at solving this problem must be largely automated, with humans reviewing only a small fraction of the unification operations. If GE can automate 95 percent of its unification operations, the human labor requirement is now 3-6 months rather than 20-40 years.

This leads us to the first rule of scaling your data unification problem:

Rule I: Scalable data unification systems must be mostly automated

The next issue for scalable data unification is the possibility of large numbers of data sources. For example, Novartis has about 10,000 bench scientists, each recording data in a personal electronic lab notebook. Novartis would gain substantial productivity advantages from understanding which scientists are producing the same results using different reagents, or different results using the same reagents.

Since each scientist produces their results independently, the number of attributes across the company’s collective 10,000 sources is very large. Any attempt to define a global schema up-front would be hopeless. Even in less extreme cases, the task of up-front schema development is usually a fool’s errand. Enterprises tried constructing up-front enterprise-wide schemas in the 1990’s, and these projects all failed, because they were out-of-date on day one of the project; let alone at project completion.

The only feasible solution is to build a schema “bottom-up” from the local data sources by discovering a global schema from the source attributes. In other words, the global schema is produced “last.”

Therefore, the second rule of scalable data unification is:

Rule II: Scalable data unification systems must be “schema last”

Since this rule must also follow rule I, we also must ensure that the majority of schema building is automated.

One of the most insidious problems in traditional data unification using ETL and/or MDM is the starkly perverse division of labor. All of the work is foisted onto professional computer scientists, except for some collaboration with business experts in understanding business requirements.

The professionals who are responsible for building data structures and pipelines cannot be expected to understand the nuances of the data itself. Consider for a moment two supplier names “Cessna Textron Av” and “Textron Aviation.” A computer scientist has no idea whether they are the same or different suppliers. However, a procurement officer in GE’s aerospace division almost certainly knows. Scalable data unification systems must resolve ambiguous cases and solicit information from domain experts in addition to interfacing with data architects and computer scientists.

This is called a collaborative system, and the third rule of scalable data unification systems is:

Rule III. When domain specific information is required, only collaborative data unification systems will scale

Traditional ETL and MDM systems rely on rules to match, merge and classify records. GE, for example, might instruct that “any transaction with Microsoft is classified as a computer equipment/software purchase”. This one rule might classify a few thousand transactions. To classify all of GE’s 80 million transactions would require thousands of rules; way beyond the number of rules a human can comprehend.

Moreover, I have never seen an implementation with thousands of rules. In short, rule systems don’t scale. In contrast, scalable match, merge and classify can be solved using machine learning. In this case, rules are one useful tool for providing training data if it is not available in some other way. In summary, a rule system is one technique for generating training data which can then be used by a machine learning system to deal with the scale problem.

The final rule for scalable data unification is:

Rule IV. Scalable data unification must rely on machine learning, not rules.

Taken together these four rules -- that scalable data unification systems must be mostly automated, schema-last, collaborative and rely on machine learning -- point us towards a path that can unify large number of data sources and avoid the scalability failures of ETL, MDM and whiteboards. If an enterprise is tasked with unifying tens or hundreds of data sources, it will have to use these four rules to succeed.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access

Michael Stonebraker

Michael Stonebraker

Michael Stonebraker is an adjunct professor of computer science at the Massachusetts Institute of Technology. He is also the chief technology officer at StreamBase.