What do a consumer catalog in someone else’s name delivered to your door, a spelling error and mathematical set theory have in common? The answer is that all are factors affecting the quality of a customer master data management system. It should go without saying that a critical success factor of any business is to know its customer. The proliferation of enterprise software and the need to link data from different systems has placed a renewed spotlight on MDM solutions and, more specifically, customer MDM solutions.

The best customer MDM systems do not exist in a vacuum. They are continually updated with the latest and greatest data available, whether that be from a customer change request, an internal CRM system or a partner data feed. But in order for this data to be meaningful and accurate, it must be integrated with existing data so as not to create duplicates or apply updates to the wrong record. The challenge with customer MDM is that names are not unique. In addition, persons may change their name and customers may shift addresses. In the absence of a unique identifier (e.g., Social Security number), what mechanisms exist for matching this type of data? How can the savvy IT professional be certain that customer additions are genuine and not just a clever duplicate?

Fortunately, principles from the math and science fields have been applied to data matching. Data matching is one of those concepts the human brain finds astonishingly easy, but is actually quite difficult for a computer program to reproduce. Consider the records in Figure 1:

Are the values for name, address and ZIP Code exactly the same? Definitely not. But could these records be referring to the same customer? Most certainly. This is the essence of “fuzzy matching”: searching for record linkages when two values are not 100 percent identical.

Fuzzy Matching Toolbox

Many database systems and open source APIs offer a range of comparison tools to perform fuzzy matching. These tools compare two data elements and return a value (or score) that indicates the confidence or likelihood the two values are identical. The best fuzzy matching comparison functions account for variances in spelling or form such that minor variances are still rated as a likely match but major variances flag a low match probability.

Let us take a brief look at the major fuzzy matching algorithms:

Soundex / Metaphone — Soundex is a phonetic matching algorithm developed in the early 20th century, used in the United States Census to group together surnames that sound the same in English but are spelled differently. Soundex classifies words by the first letter of the word and assigns a three digit code for the remaining consonants. For instance, “Smith,” “Smithe” and “Smyth” are all represented by the code “S530.” Soundex was a good first step at fuzzy matching of names, but it had its weaknesses. Future variations of phonetic matching, most notably Metaphone, sought to extend the capability of Soundex by accounting for more variations in English spellings, pronunciations and foreign words. Later versions of Metaphone have been used in spell check programs and search engine tools.

Levenshtein — The Levenshtein distance algorithm, named after its founder Vladimir Levenshtein, seeks to provide a mathematical representation of the similarity (or lack thereof) between two string values by computing the minimum number single character changes required to change one value into another. For example, the distance between the words “cat” and “hats” is represented by one character substitution (c for h) and one character addition (s).

Jaccard — Often referred to as the Jaccard similarity coefficient, this mathematical equation computes the similarity of data sets by measuring the size of the intersection divided by the size of the union. A complementary measurement to the Jaccard similarity coefficient is the Jaccard distance, which computes the dissimilarity between two values (by subtracting the similarity coefficient from 1).

Longest Common Sequence — For a string value, a subsequence is a combination of characters (or character placeholders) that appears in left to right order. The characters do not necessarily have to appear in consecutive order. Some subsequences of the word DATA are DAT, ATA, A_T, __T. From these, the longest common subsequence is defined as the maximum length subsequence shared by two string values.

Trigram — The trigram model is another tool for the fuzzy matching of string values. Trigram is based on the n-gram principle (where n = 1,2,. . . .) of dividing string values into three character blocks and comparing the number of shared blocks between the values being compared. Trigram, unlike Soundex or Metaphone functions, is not phonetic and therefore is language agnostic.

Data Matching Process

Armed with the knowledge of these scientific algorithms, how does one go about implementing this into a coherent matching solution? A typical process flow is shown in Figure 2.

The first step is always to profile the data. Profiling will identify particular anomalies present in the data set as well as dictate the amount of data cleansing to be done. Data profiling will also determine which customer attributes will be used in the matching process. This is referred to as the selecting the match string. Typical candidates for a match string in a customer database include:

  • Name (first, middle, last, suffix and/or organization name)
  • Address (address, city, state, postal code)
  • Phone

The next step is to cleanse and standardize the data. Data cleansing is essential to eliminate as much “noise” as possible from the elements being matched. This noise can take the form of extra whitespace (De Bray versus DeBray), punctuation (O Connor versus O’Connor), or common abbreviations (Corp versus Corporation). Functions can be written to remove or correct name strings of these differences, thus increasing the likelihood of an accurate match.
Address information offers a great opportunity for data standardization. Cleansed addresses will provide an organization value in terms of better customer matching and increased operational efficiencies related to accurate customer visits and reduced mailing errors. Several data providers offer these services for relatively modest fees. Due to the possible variances present in address information, standardization is essential whenever location information is included as part of customer matching.

Once the match string has been determined and the data has been cleansed, the process of data matching can begin. Let us assume the match string elements are the following:

  • First name
  • Middle name
  • Last name
  • Suffix
  • Address line 1
  • ZIP Code

In order to begin matching, one must select the matching algorithm(s) to be used. The selection process should begin with a series of structured tests to determine which algorithms work best with the data at hand. The tests should be designed to verify the results of three groups of data samples: obvious matches, obvious non-matches, and data that could go either way. With enough samples and testing, it should become apparent which algorithm gives the best results for identifying positive matches and negative matches as such. Figure 3 offers simple examples of matching algorithm comparisons.
Click here to see a larger image of Figure 3.

In conjunction with selecting the matching algorithm, the data designer must also decide whether to assign weightings to the match string elements. Weightings are percentage values that give more or less emphasis to different elements of the match string; all must add up to 100 percent. Determining the weightings depends on the data. While it is typical to assign a higher weight to name attributes over address details, this may not be the best solution if the customers to be matched are concentrated in a particular geography. Weighting is particularly useful for data elements such as middle name or suffix. These values may not always be present, the accuracy of the match score is definitely improved when they are. Once weightings are set, they can be used as a multiplier of individual confidence scores in order to determine a final score.

The last step in the process is to apply data stewardship procedures to the customer matching process. Since the matching process deals with nuanced data, it makes sense to introduce a human element to serve as a check to the preprogrammed routines. Most of the matching algorithms discussed here output a confidence score, so it is simply a matter of choosing a middle ground of score ranges that should be reviewed by a data steward. For instance, let the system process match scores above 0.90 (high confidence) and below 0.70 (lower confidence); anything in the between should be reviewed to determine whether it is a genuine match or not. In this way, the data steward may gain insights into the data that can later be used to improve the matching process.

This is where the art of data matching comes into play. There are many mathematical tools and levers that one can pull when comparing two pieces of data, but understanding how and when to pull those levers is the art of the procedure. The data designer must have a feel for the data needing to be matched and understand how different weightings can affect the overall outcome. Should more weight be given to a certain element of a match string? Would a different matching algorithm give a better result? All of these are factors that the data designer must consider. In the end, the due diligence of profiling, cleansing, matching and stewardship will pay dividends in the form of a reliable and accurate customer MDM system.