© 2019 SourceMedia. All rights reserved.

'Making Databases Work' celebrates advances in database systems

(BOOK EXCERPT: The following is Chapter 21 reprinted with permission from Making Databases Work, a new book authored by Tamr chief technology officer and 2014 Turing Award winner Dr. Michael Stonebraker. ($119.95 hardback/$99.95 paperback/$79.96 e-book; Morgan & Claypool).

The book is a collection of 36 stories written by Stonebraker and 38 of his collaborators: 23 world-leading database researchers, 11 world-class systems engineers, and 4 business partners. The book celebrates Stonebraker’s accomplishments that led to his 2014 ACM A.M. Turing Award for fundamental contributions to the concepts and practices underlying modern database systems, and describes, for the broad computing community, the unique nature, significance, and impact of his achievements in advancing modern database systems over more than 40 years. For more information, visit http://www.morganclaypoolpublishers.com/stonebraker/.)

Chapter 21 - Data Unification at Scale: Data Tamer
By Ihab Ilyas

In this chapter, I describe Mike Stonebraker’s latest start-up, Tamr, which he and I co-founded in 2013 with Andy Palmer, George Beskales, Daniel Bruckner and Alexander Pagan. Tamr is the commercial realization of the academic prototype “Data Tamer” [Stonebraker et al. 2013b]. I describe how we started the academic project in 2012, why we did it, and how it evolved into one of the main commercial solution providers in data integration and unification at the time of writing this chapter.

Mike’s unique and bold vision targeted a problem that many academics had considered “solved” and still provides leadership in this area through Tamr.

How I Got Involved

In early 2012, Mike and I, with three graduate students (Mike’s students, Daniel Bruckner and Alex Pagan, and my student, George Beskales), started the data tamer project to tackle the infamous data-integration and unification problems, mainly record deduplication and schema mapping. At the time, I was on leave from the University of Waterloo, leading the data analytics group at the Qatar Computing Research Institute, and collaborating with Mike at MIT on another joint research project.

making databases work.jpg

Encouraged by industry analysts, technology vendors, and the media, “big data” fever was reaching its peak. Enterprises were getting much better at ingesting massive amounts of data, with an urgent need to query and analyze more diverse datasets, and do it faster. However, these heterogeneous datasets were often accumulated in low-value “data lakes” with loads of dirty and disconnected datasets.

Somewhat lost in the fever was the fact that analyzing “bad” or “dirty” data (always a problem) was often worse than not analyzing data at all—a problem now multiplied by the variety of data that enterprises wanted to analyze. Traditional data-integration methods, such as ETL (extract, transform load), were too manual and too slow, requiring lots of domain experts (people who knew the data and could make good integration decisions). As a result, enterprises were spending an estimated 80 percent of their time preparing to analyze data, and only 20 percent actually analyzing it. We really wanted to flip this ratio.

At the time, I was working on multiple data quality problems, including data repair and expressive quality constraints [Beskales et al. 2013, Chu et al. 2013a, Chu et al. 2013b, Dallachiesa et al. 2013]. Mike proposed the two fundamental unsolved data-integration problems: record linkage (which often refers to linking records across multiple sources that refer to the same real-world entity) and schema mapping (mapping columns and attributes of different datasets).

I still remember asking Mike: “Why deduplication and schema mapping?” Mike’s answer: “None of the papers have been applied in practice. . . . We need to build it right.”

Mike wanted to solve a real customer problem: integrating diverse datasets with higher accuracy and in a fraction of the time. As Mike describes in Chapter 7, this was the “Good Idea” that we needed! We were able to obtain and use data from Goby, a consumer web site that aggregated and integrated about 80,000 URLs, collecting information on “things to do” and events.

We later acquired two other real-life “use cases”: for schema integration (from pharmaceutical company Novartis, which shared its data structures with us) and for entity consolidation (from Verisk Health, which was integrating insurance claim data from 30-plus sources).

Data Tamer: The Idea and Prototype

At this point we had validated our good idea, and we were ready to move to Step Two: assembling the team and building the prototype. Mike had one constraint: “Whatever we do, it better scale!” In the next three months, we worked on integrating two solutions: (1) scalable schema mapping, led by Mike, Daniel, and Alex, and (2) record deduplication, led by George and me.

Building the prototype was a lot of fun and we continuously tested against the real datasets. I will briefly describe these two problems and highlight the main challenges we tackled.

Schema Mapping. Different data sources might describe the same entities (e.g., customers, parts, places, studies, transactions, or events) in different ways and using different vocabularies and schemas (a schema of a dataset is basically a formal description of the main attributes and the type of values they can take). For example: While one source might refer to a part of a product as two attributes (Part Description and Part Number), a second source might use the terms Item Descrip and Part #, and a third might use Desc. and PN to describe the same thing.

Establishing a mapping among these attributes is the main activity in schema mapping. In the general case, the problem can be more challenging and often involves different conceptualizations, for example when relationships in one source are represented as entities in another, but we will not go through these here.

Most commercial schema mapping solutions (usually part of an ETL suite) traditionally focused on mapping a small number of these schemas (usually fewer than ten), and on providing users with suggested mappings taking into account similarity among column names and their contents. However, as the big data stack has matured, enterprises can now easily acquire a large number of data sources and have applications that can ingest data sources as they are generated.

A perfect example is clinical studies in the pharmaceutical industry, where tens of thousands of studies/assays are conducted by scientists across the globe, often using different terminologies and a mix of standards and local schemas. Standardizing and cross-mapping collected data is essential to the company's businesses, and is often mandated by laws and regulations. This changed the main assumption of most schema mapping solutions: suggestions curated by users in a primarily manual process.

Our main challenges were: (1) how to provide an automated solution that required reasonable interaction with the user, while being able to map thousands of schemas; and (2) how to design matching algorithms robust enough to accommodate different languages, formats, reference master data, and data units and granularity.


Fname

Lname

Occupation

Institution

Number Students

Michael

Stonebraker

Professor

UC Berkeley

5

Mike

Stonebraker

PI

MIT-CSAIL

4+2 postdocs

M

Stonebreaker

Adjunct

Professor

MIT

16

Mike

Stonebraker

Faculty

Massachusetts Institute of Technology

n/a

Figure 21.2 Many representations for the same Mike!


Record Deduplication. Record linkage, entity resolution, and record deduplication are a few terms that describe the need to unify multiple mentions or database records that describe the same real-world entity. For example, “Michael Stone- braker” information can be represented in different ways. Consider the example in Figure 21.2 (which shows a single schema for simplicity).

It’s easy to see that the four records are about Mike, but they look very different. In fact, except for the typo in Mike’s name in the fourth record, all these values are correct or were correct at some point in time. While it’s easy for humans to judge if such a cluster refers to the same entity, it’s hard for a machine. Therefore, we needed to devise more robust algorithms that could find such matches in the presence of errors, different presentations, and mismatches of granularity and time references.

The problem is an old one. Over the last few decades, the research community came up with many similarity functions, supervised classifiers to distinguish matches from non-matches, and clustering algorithms to collect matching pairs in the same group.

Similar to schema mapping, current algorithms can deal with a few thousands of records (or millions of records but partitioned in disjointed groups of thousands of records!) However, given the massive amount of dirty data collected— and in the presence of the aforementioned schema-mapping problem—we now faced multiple challenges, including:

1. How to scale the quadratic problem (we have to compare every record to all other records, so computational complexity is quadratic in the number of records);

2. How to train and build machine learning classifiers that handle the subtle similarities as in Figure 21.2;

3. How to involve humans and domain experts in providing training data, given that matches are often rare; and

4. How to leverage all domain knowledge and previously developed rules and matchers in one integrated tool.

Mike, Daniel and Alex had started the project focusing on schema mapping, while George and I had focused on the deduplication problem. But it was easy to see how similar and correlated these two problems were. In terms of similarity, both problems are after finding matching pairs (attributes in the case of schema mapping, records in the case of deduplication).

We quickly discovered that most building blocks we created could be reused and leveraged for both problems. In terms of correlation, most record matchers depend on some known schema for the two records they compare (in order to compare apples to apples); however, unifying schemas requires some sort of schema mapping, even if not complete.

For this and many other reasons, Data Tamer was born as our vision for consolidating these activities and devising core matching and clustering building blocks for data unifications that could: (1) be leveraged for different unification activities (to avoid piecemeal solutions); (2) scale to a massive number of sources and data; and (3) have human in the loop as a driver to guide the machine in building classifiers and applying the unification at large scale, in a trusted and explainable way.

Meanwhile, Stan Zdonik (from Brown University) and Mitch Cherniack (from Brandeis University) were simultaneously working with Alex Pagan on expert sourcing : crowdsourcing, but applied inside the enterprise and assuming levels of expertise. The idea was to use a human in the loop to resolve ambiguities when the algorithm’s confidence on a match falls below a threshold. They agreed to apply their model to the Goby data to unify entertainment events for tourists.

Our academic prototype worked better than the Goby handcrafted code and equaled the results from a professional service on Verisk Health data. And it appeared to offer a promising approach to curate and unify the Novartis data (as mentioned in Chapter 7).

The vision, prototype, and results were described in the paper “Data Curation at Scale: The Data Tamer System,” presented at CIDR 2013, the Sixth Biennial Conference on Innovative Data Systems Research in California [Stonebraker 2013].

The Company: Tamr Inc.

Given Mike’s history with system-building and starting companies, it wasn’t hard to see where he was going with Data Tamer. While we were building the prototype, he clearly indicated that the only way to test “this” was to take it to market and to start a VC-backed company to do so. And Mike knew exactly who would run it as CEO: his long-term friend and business partner, Andy Palmer, who has been involved with multiple Stonebraker start-ups (see Chapter 8). Their most recent collaboration at the time was the database engine start-up Vertica (acquired in 2011 by Hewlett- Packard (HP) and now part of Micro Focus).

Tamr was founded in 2013 in Harvard Square in Cambridge, Massachusetts, with Andy and the original Data Tamer research team as co-founders. The year 2013 was also when I finished my leave and went back to the University of Waterloo and George moved to Boston to start as the first full-time software developer to build the commercial Tamr product, with Daniel and Alex leaving grad school to join Tamr as full-time employees.

Over the years, I have been involved in few start-ups. I witnessed all the hard work and the amount of anxiety and stress sometimes associated with raising the seed money. But things were different at Tamr: The credibility of the two veterans, Mike and Andy, played a fundamental role in a fast, solid start, securing strong backing from Google Ventures and New Enterprise Associates (NEA). Hiring a world-class team to build the Tamr product was already under way.

True to Mike’s model described in his chapter on how to build start-ups, our first customer soon followed. The problem Tamr tackled, data unification, was a real pain point for many large organizations, with most IT departments spending months trying to solve it for any given project. However, a fundamental problem with data integration and data quality is the non-trivial effort required to show return on investment in starting these large-scale projects, like Tamr.

With Tamr living much further upstream (close to the silo-ed data sources scattered all over the enterprise), we worked hard to show the real benefit of unifying all the data on an enterprise’s final product or main line of business—unless the final product is the curated data itself, as in the case of one of Tamr’s early adopters, Thomson Reuters, which played a key role in the early stages of Tamr creation.

Thomson Reuters (TR), a company in which curated and high-quality business data is the business, was thus a natural early adopter of Tamr. The first deployment of Tamr software in TR focused on deduplicating records in multiple key datasets that drive multiple businesses. Compared to the customer’s in-house, rule-based record matchers, Tamr’s machine learning-based approach (which judiciously involves TR experts in labeling and verifying results) proved far superior. The quality of results matched those of human curators on a scale that would have taken humans literally years to finish [Collins 2016].

With the success of the first deployment, the first product release was shaping up nicely. Tamr officially launched in May 2014 with around 20 full-time employees (mostly engineers, of course), and a lineup of proofs of concepts for multiple organizations.

As Mike describes in Chapter 8, with TR as the “Lighthouse Customer,” Andy Palmer the “adult supervisor,” and the strong support of Google Ventures and NEA, Steps 3, 4, and 5 of creating Tamr the company were complete.

More enterprises soon realized that they faced the same problem—and business opportunity—with their data as TR. As I write this, Tamr customers include GE, HP, Novartis, Merck, Toyota Motor Europe, Amgen, and Roche. Some customers— including GE, HP, Massachusetts Mutual Insurance, and TR—went on to invest in our company through their venture-capital arms, further validating the significance of our software for many different industries.

In February 2017, the United States Patent and Trademark Office issued Tamr a patent (US9,542,412) [Tamr 2017] covering the principles underlying its enterprise- scale data unification platform. The patent, entitled “Method and System for Large Scale Data Curation,” describes a comprehensive approach for integrating a large number of data sources by normalizing, cleaning, integrating, and deduplicating them using machine learning techniques supplemented by human expertise.

Tamr’s patent describes several features and advantages implemented in the soft- ware, including:

  • The techniques used to obtain training data for the machine learning algorithms.
  • A unified methodology for linking attributes and database records in a holistic fashion.
  • Multiple methods for pruning the large space of candidate matches for scalability and high data volume considerations.
  • Novel ways to generate highly relevant questions for experts across all stages of the data curation lifecycle.

With our technology, our brand-name customers, our management team, our investors, and our culture, we’ve been able to attract top talent from industry and universities. In November 2015, our company was named the #1 small company to work for by The Boston Globe.

Mike’s Influence: Three Lessons Learned.

I learned a lot from Mike over the last five years collaborating with him. Here are three important lessons that I learned, which summarize his impact on me and are indicative of how his influence and leadership have shaped Tamr’s success.

Lesson 1: Solve Real Problems with Systems

A distinctive difference of Tamr (as compared to Mike’s other start-ups) is how old and well-studied the problem was. This is still the biggest lesson I learned from Mike: It doesn’t really matter how much we think the problem is solved, how many papers were published on the subject, or how “old” the subject is, if real-world applications cannot effectively and seamlessly use a system that solves the problem, it is the problem to work on.

In fact, it is Mike’s favorite type of problem. Indeed, we’re proud that, by focusing on the challenge of scale and creating reusable building blocks, we were able to leverage and transfer the collective effort of the research community over the last few decades, for practical adoption by industry— including a large number of mega enterprises.

Lesson 2: Focus, Relentlessly

Mike’s influence on the type of challenges Tamr will solve (and won’t) was strong from Day One. In the early days of Tamr, a typical discussion often went as follows.

Team: Mike, we have this great idea on how to enable Feature X using this clever algorithm Y.”

Mike (often impatiently): “Too complicated... Make it simpler... Great for Version.

10... Can we get back to scale?”

I have often measured our progress in transferring ideas to product by the version number Mike assigns to an idea for implementation! (Lower being better, of course). His impressive skill in judging the practicality and the probability of customer adoption is one of Mike’s strongest skills in guiding the construction of adoptable and truly useful products.

Lesson 3: Don’t Invent Problems. Ever

Mike simply hates inventing problems. If it isn’t somebody’s pain point, it is not important. This can be a controversial premise for many of us, especially in academia. Far too often in academia, the argument is about innovation and solutions to fundamental theoretical challenges that can open the door for new practical problems, and so on.

In identifying problems, my lesson from Mike was not to be convinced one way or another. Instead, simply take an extreme position and make the biggest tangible impact with it. Mike spends a lot of his time listening to customers, industry practitioners, field engineers, and product managers. These are Mike’s sources of challenges, and his little secret is to always look to deliver the biggest bang for the buck. As easy as it sounds, talking to this diverse set of talents, roles, and personalities is an art, requiring a good mix of experience and “soft” skills.

Watching Mike has greatly influenced the way I harvest, judge, and approach research problems, not only at Tamr but also in my research group at Waterloo. These lessons also explain the long list of his own contributions to both academia and industry to deserve computing’s highest honor.

For reprint and licensing requests for this article, click here.