Continue in 2 seconds

Customer Data Integration, Linkage Precision and Match Accuracy

  • David Loshin, Ed Allburn
  • November 01 2004, 1:00am EST

Competitive advantage may be gained in the evaluation and improvment of match accuracy by improving the effectiveness of all BI technologies that rely on the data.

As customer relationship management (CRM), personalization, data mining, one-to-one relationship marketing/database marketing and customer loyalty programs are becoming de rigueur at many large (and some not so large) organizations, billions of dollars are being invested in sophisticated customer data integration technology as a means to total customer data integration (CDI). The underlying technology for CDI evolved out of the data quality tools space, particularly from the concepts of record linkage and matching.

Record Linkage and Match Accuracy

Record matching is a sophisticated process referred to by a variety of different terms such as merge/purge, de-duping, householding, building a 360-degree single customer view, creating a marketing customer information file (MCIF) and others. Regardless of the term used, all perform a similar process of identifying and linking related records by parsing name, address and other text fields into separate components and then using advanced approximate string matching algorithms and sophisticated similarity scoring to compare sets of these components and identify pairs that are similar enough to isolate as referring to the same entity. There has been great success in deploying record linkage for the purpose of customer data integration. However, a key aspect of this process is often glossed over and ignored - the issue of linkage precision and record match accuracy.

Linkage precision guides how well a set of record linkage applications are tuned. Consider this simple mechanism for tuning: match/not-a-match thresholds. As part of the matching process, two records are compared across multiple fields, and the similarity of the two records is evaluated as a function of the application of a set of business rules and corresponding weights associated with each field, resulting in the assignment of a similarity score. If that score is greater than the match threshold, then the pair is deemed a match. If the score is less than the not-a-match threshold, it is reported that the pair does not match. When the score falls between the two thresholds, the pair is shunted to a separate repository for subsequent manual review. Match accuracy is a measure of how well the assorted thresholds, business rules and weights are set to provide the most accurate match.

When match accuracy is high, the results are excellent - better CDI, more aggressive personalization, reduced costs associated with customer interaction, etc. On the other hand, low match accuracy is likely to provide the impression of much poorer customer relationship management, resulting in duplicate mailings, mixed up credit profiles and repeated attempts at direct marketing, among other less heinous crimes. On the other hand, businesses increasingly face major risks when linking records for applications such as health records and financial management, especially in the context of HIPAA privacy requirements, Sarbanes-Oxley compliance, Anti-Kickback Statute and other regulatory constraints. As more businesses and more applications rely upon a single customer view, it becomes increasingly important to ensure that this single view is accurate.

Today's CDI systems have evolved into highly sophisticated applications incorporating leading-edge research and development advances in fields such as information theory, natural language processing, artificial intelligence and others. One major advancement has been the recognition of users' needs to be able to fine-tune the matching and householding behavior to create a single customer view that more directly fits with the business needs. CDI vendors no longer assume that they can dictate to businesses what the "correct" single customer view is. As businesses have become increasingly sophisticated with business intelligence (BI), CRM and one-to-one systems, they have demanded control of their customer definition.

This is typically effected via business rules that control how the single customer view is resolved by the CDI system. In general, a business rule is anything that controls or changes the CDI application's function, such as:

  • Parsing, standardizing and matching program parameters.
  • Control/configuration/job file settings.
  • Lookup table/dictionary entries.
  • Data partitioning logic (such as by geography).
  • Custom programming logic (exit functions, retry logic, etc.).
  • Individually enabled/disabled parsing and matching rules.

In recent years, CDI vendors have started competing on who has the most business rules, basically arguing that more business rules are better. Many vendors now claim to have more than 100,000 business rules, and one vendor at a major industry conference bragged that a large customer added more than 50,000 custom business rules (thus emphasizing how flexible their system was). However, CDI matching and householding accuracy requires precise refinement of all these business rules; otherwise it generates a less-than-effective data warehouse.

Improving Match Accuracy for Competitive Advantage

It is interesting to note that until recently, the notion of applying data quality technology for the purposes of CDI was considered to be leading-edge application of technology. Today, it would be unusual for an organization to not be doing this. Years ago, businesses could gain a major competitive advantage by implementing basic data quality and BI technology. However, today this technology is no longer an optional luxury, but instead is a fundamental requirement just to be on a level playing field.

For example, a company might cancel a promotional campaign because too many consumers such as "Michael Jablowski" did not respond. However, more accurate record matching might reveal that "Mike Jadlowsky" is in fact the same person, and Mike Jadlowsky did respond (or worse, was already a customer - thus indicating wasted marketing).

It is very likely that your own company has already made major investments in record matching, and it is equally likely that all of your major competitors have also made similar investments. However, if this is true, and everyone is doing the same thing, then an enlightened manager should be looking for second-order opportunities for additional competitive advantages. One idea with potential is the evaluation and improvement of match accuracy, which, in turn, will deliver an ongoing competitive advantage by improving the accuracy and effectiveness of all the BI technologies that rely on the data.

The opportunity to gain a fresh competitive advantage in this area is very compelling because although most companies already have similar technology, the odds are that their technology is significantly underperforming. In fact, an individual's chances for winning the Powerball lottery are greater than having one's current matching system's complex business rules fully optimized to deliver maximum match accuracy.

The Complexity of Linkage Precision

More than 100 vendors offer record matching systems, many of which have evolved into highly sophisticated technologies that are often comprised of more than 100,000 "business rules" that control their exact behavior. In theory, information-workers can precisely fine-tune these business rules to improve match accuracy. In reality, very little, if any, significant time is spent attempting to do so. Instead, many people just rely on the business rules and settings right out of the box or some organizational adaptations based on recommendations from minimal vendor consulting due to the overwhelming complexity of attempting to make changes to all these close-knit, highly interdependent rules.

As an example of a way to adjust similarity scoring, most matching systems provide a way to modify the weighting assigned to parsed components during matching, using qualifiers such as disabled, very low, low, medium, high, very high or required. These systems often parse name and address fields into multiple components, as well as other fields often used in matching such as phone, social security number, e-mail, city, state and ZIP code (often 12 to 20 total components). The potential for modifying settings for even a reasonable number of fields yields a staggering number of possible combinations. Therefore, when matching records by scoring 12 fields with seven different weighting settings, there are 712, or nearly 14 billion possible combinations. Bump that up to 20 fields, and you have almost 80 quadrillion combinations! In comparison, the odds of someone winning the Powerball lottery are one in 121 million.

In addition to the issue of complexity, the reality is that project time for fine-tuning is typically scheduled toward the end of the project. Yet when projects run over schedule and over budget, a common target for project elimination is the step for fine-tuning the matching business rules. The ramification of this is that most businesses are using these expensive, sophisticated matching engines with little or no changes to their default settings.

Approaches to Improvement

One of the most insidious aspects of match accuracy is that its responsibility often falls through cracks of the company organizational chart, with responsibility typically defaulting to IT staff to fine-tune the business rules. However, the exact goal of the matching behavior can often be a moving target or equally often a target that has either conflicting definitions or no definition at all. This creates a very strong temptation for the IT staff to simply accept the default business rules with little, if any, attempt to truly refine them.

More accurate matching results are achieved when business users actively collaborate with the IT staff to analyze and refine the business rules. Business users are the ones with the critical information about the intended business use of the data, which then drives the decisions on the matching business rules. For example, a newspaper company may place much higher priority on postal address matching criteria and may not want any records that have different addresses to be matched. A bank, on the other hand, may place a higher priority on the individual, regardless of how many different addresses their records may span (such as home and work addresses). Fraud detection applications may utilize even looser match rules to find all possible relationships between records.

The bottom line is that regardless of what business rules the IT staff defines, those rules will be wrong if they are defined in a vacuum without business user involvement. From that, any initiatives to find and fix matching errors must be driven by the business users (and executive sponsorship helps).

One interesting (one might even say, "procedurally fractal") aspect of linkage precision is that the quality of match accuracy can be analyzed and improved the same way other aspects of information quality are treated. We can take the same steps in evaluating the quality of the application's matching rules as a baseline measure and then identify potential areas for improvement.

One way to start this process is to ask some questions about how your record matching software is used, including:

  • How many business rules are in your current matching process?
  • How is matching accuracy measured within the organization? Is that accuracy estimated by scientific measures or gut instinct?
  • How accurate are the business rules (50%, 90%, 99%, mostly accurate)?
  • Is a formal methodology such as Six Sigma used to measure accuracy?
  • How confident are the business clients in the accuracy metrics and conclusion?
  • Of the total project time and effort, how much time was spent developing or modifying business rules?
  • What is the strategy for fine-tuning business rules for new data?
  • Is new data being regularly used to further refine the business rules?
  • Has consideration been given to making match rules tighter or looser?
  • How much involvement did business clients have in the rule refinement?
  • How difficult was it to get agreement on the matching business rules?
  • Can examples of bad record matching still be found?
  • What is preventing the match accuracy from being improved now?
  • Exactly how many customers do you really have?
  • Can your match accuracy still be improved?

Another key step is to review your current tools and techniques for fine-tuning the matching business rules. Surprisingly, many teams still use the same tools and techniques that were being used more than 20 years ago. For example:

  • A small sample of a few thousand records is selected and used as the basis for developing and evaluating rules and are subjected to the parsing, standardizing and matching steps.
  • IT staff visually review the matching results with limited spot-checks.
  • Business rules are adjusted, the sample data set is reprocessed and the cycle starts again.

Clearly, basing the rules used to aggregate large sets of disparate customer information on a small selected sample may not be the most effective way to develop business rules. To address this issue, automated tools are now being developed that can be used to adjust business rule settings, run the record matching applications and then automatically evaluate the results, providing reports that can be scanned to assess the differences between sets of business rules and corresponding thresholds and similarity scoring.
Finally, it is not uncommon to uncover conflicting requirements during this stage that may warrant creating multiple customer views instead of trying to force a single customer view on the entire organization. Therefore, a key step in match accuracy assessment and improvement is to ensure that business clients closely collaborate with IT staff to clarify matching requirements.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access