Let's assume you've decided to invest some serious effort in choosing a customer matching system. How do you go about it?

You'll start with technical specifications such as hardware, operating systems and integration methods. You might try to narrow the field by considering just one of the three classes of matching systems described in my previous columns – string-based, pattern-based or reference-based.

While it's generally true that reference-based systems are the most accurate and string-based are the least accurate, this does not mean that one class of product is more appropriate than another because the differences in performance depend on the applications and specific data involved. For example, a power utility's list of current customers is likely to be quite accurate, while a list of inactive catalog buyers will contain many duplicate accounts and outdated addresses. If a file is highly accurate to begin with, moving to a more powerful system may not increase performance enough to justify the higher acquisition and operating costs. Even if you did limit yourself to a single class of systems, there are still significant differences among the products within each group. In short, there is no way to make a truly sound decision without testing each product against your own data. The process has three main steps.

Assemble test data. This is often the most difficult part of the project because the data is not readily available and information technology (IT) resources to assemble it are scarce. Ideally, the test data would include complete files from each system that will eventually provide inputs. This would test the matching system's ability to handle data gathered through different processes and stored in different formats. It would also provide the highest possible number of duplicates to detect. In fact, the test data should really include several sets of input from each system taken at different dates to ensure the data contains old and new versions of customers who have moved, changed their names, opened or closed accounts and gone through other transformations the matching system may be intended to detect.

Alas, comprehensive data is rarely available. Even if it is, the volume is likely to be greater than the matching software vendors are willing to include in a test. Therefore, some form of sampling is usually necessary.

Constructing a sample for a matching test is unusually tricky. The statistician's usual instinct is to take a random or nth sample. However, this is the worst thing to do for matching tests. These methods tend to remove adjacent records, which are the most likely to be duplicates or members of the same household. A better approach is to select all names in limited geographic area. More than one geographic region should be chosen to get a mix of urban and rural areas and include any regional differences. This is particularly important in companies where different areas are served by different operational systems. For these companies, using multiple regions ensures that inputs from all systems are represented.

If the volume remains too high even when the sample is limited to a handful of regions, it may be further reduced by selecting on last name (e.g., all last names beginning with the letters A through F). This will still include most duplicates, although it will likely lose women who have changed their names following marriage or divorce.

It is also worth inserting records known to contain special situations, such as those with tricky parsing problems, name changes, frequent movers, household splits or multiple generations (i.e., Sr., Jr. and III). These can be fictional records to test string- and pattern-based matching, but should be real people when testing reference-based systems. To avoid having such records stand out during processing, they should be physically mixed with the other data in exactly the same format. This may require constructing plausible values for fields that are populated in other records in the same file, such as account IDs or telephone numbers. The number of such fields should be limited because data not used for matching should be removed from the test file to reduce security risks and processing costs. Any individual or household link that comes from a system that would be replaced by the new matching software should also be removed. Such links should not be discarded, however, because they can later be compared with links created by the new systems.

Each record should include a source system indicator and file date because the matching system might need different rules for records from different sources or from the same source at different times. Every record should also be assigned a unique identifier to simplify later analysis of how the matching systems performed.

The final step in test file preparation is creation of record layouts and counts needed to help load the data into the matching system itself. Some users prepare two test files: one for initial system setup and tuning, and the other to generate test results. This is analogous to the standard approach of predictive modelers who build a model on one data sample and then validate it against a separate data set. In both modeling and matching, the purpose is to ensure the system is not generating unrealistic results by tuning itself to anomalies in the test data. Split test files are rarely used, however, because this is generally not an issue for matching systems.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access