The following column is excerpted from the white paper, "The Perfect Match: 7 Steps to a Match" by Cory Shouse. For a copy of the full paper, please visit www.csiwhq.com/news/whitepaper_requests.asp.

In part 1, we covered the three preparation steps for matching our data to third-party data. In part 2, we cover the steps required to actually perform the match.

Step 4: Consider Latency and Architecture Options

Once the data is ready to be matched, it is important to understand latency and physical architecture, and have a well-defined workflow and a scoring system in place to ensure the master data cleansing process is ongoing.

Real Time versus Batch

Strongly consider your business requirements when assessing the latency requirements for a match. The first item for consideration is real time versus batch. For example, if the requirement is to set a credit limit, a real-time match may be critical to get a true assessment of the total customer spend; however, if the requirement is to generate a deduped mailing list for shipping a new catalog, then a one-time batch match may suffice. The cost for these two options can be substantially different when using a third party's product.

On Site versus Off Site

In addition to real time versus batch, on site versus off site must also be considered. Many vendors offer an updated reference file to keep on site at your location. You are able to use the file without ever having to send them a file or integrate the matching process directly with the vendor's database. However, that on-site file is only as good as the last update received from the vendor. You need to consider just how much decay you can afford to live with (i.e., should you receive monthly, quarterly or yearly updates of the file?).

Manual versus Automated

Another element to consider is whether or not to use a manual or automated match. Hiring temporary staff to "eyeball" records across different systems using some common elements is a cost-effective way of matching records; however, this introduces the human element, and this method may only be as good as the staff performing the match. In cases where you are dealing with millions of records (and new records will be added at a high rate), automation of the process may be the right answer. A hybrid approach can also be a likely scenario. For example, an automated system will have cases in which it will not be able to find a match, requiring a manual review of the records before determining the appropriate path.

Step 5: Score the Match

How do you know if a match exists? The best approach is to sample a set of records and evaluate the results. Start by taking the standardized elements between the two records and compare each one. For example, if we match the address values from our system with those of a vendor, we may apply a grade such as that shown in Figure 1.

Figure 1: Example of a Match Grading


After doing this for each element, we need to score the record as a whole. D&B calls this score a "confidence code." The confidence code is a scale from one to 10 indicating the probability of a match. For example, after performing a match, we may see results as shown in Figure 2.

Figure 2: Example of Matching Results


Figure 2 shows that record 1 is a perfect match while record 5 is an absolute no match. The difficulty comes in determining what to do with the records in between. After performing this same analysis on a large sample, we may come to the conclusion that records with a confidence code greater than or equal to 8 will be flagged as a match, records with a confidence code less than 5 are an absolute no match, and those in between will be flagged for manual review.

Step 6: Define a Workflow

Taking into account all the factors discussed thus far, the pieces must now be put together to create a well-defined and repeatable process for performing the match. Get a good understanding of how quickly your files will decay and your latency for the match when defining your workflow. Figure 3 is one example of a workflow to support a match using an enterprise resource planning (ERP) system, a customer master file and a vendor's external reference file. This workflow takes the following into account:

  1. We will use our own in-house customer master to match against first (using reward number).
  2. We only go to the vendor's file if we can't find a match.
  3. We will use a mix of both automated and manual.
  4. We will use a mix of both on site and off site.
  5. We will use a mix of both real time and batch.

Figure 3: Example of a Matching Workflow

Step 7: Perfect the Match

Perfecting the match takes time and recalibration of the process. Remember playing the game Memory as a kid? My four-year-old daughter loves playing it and enjoys winning by getting the most matched cards. The game itself demonstrates just how difficult matching can be. It is almost assured that a match will not be achieved when the initial two cards are turned over. However, as the players begin to uncover other cards and learn what to look for and where to look, the probability of a match increases until all cards are accounted for with a match. This same concept applies when we begin to match our customer, vendor and product master files across different systems. Continually reevaluate the status of your matching process and the results of your scoring system. Repeat steps 1 through 6 until the findings show that the matching process is working and results are positive.

One hundred percent match success is not guaranteed. There are numerous factors to consider and outline when designing your own process and achieving success in your master data match approach. However, for a match made in heaven, you must properly prepare the match, perform the match and perfect the match.

William wishes to thank Cory Shouse for his contribution to this month's column.


Cory Shouse is a senior architect with Conversion Services International. With more than 10 years of experience in business intelligence, Shouse specializes in helping companies establish, organize and deliver value to the business. He has assisted a number of Fortune 500 companies define quality assurance programs, organizational and staffing plans, change control procedures, and appropriate information and technical architectures. He may be reached at cshouse@csiwhq.com or (469) 939-5385.


Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access