The first column in this series described the most basic type of customer matching software, merge/purge systems. These systems parse incoming addresses into elements such as first name, last name, house number, street name, city, state and postal code. They then standardize these elements, correcting for variations such as misspellings, nicknames and alternate place names. Finally, they compare the elements in pairs of records, calculate a similarity score and flag as matches any pair scoring above a user-specified level.

Merge/purge systems are relatively fast, cheap and easy to set up. But applying the same scoring formula to all records inherently fails to take into account significant differences in particular situations. For example, matching an uncommon last name should count for more than matching a common one. The second class of customer matching software is able to take such differences into account.

This second class of customer matching software systems works by looking for patterns in the input records and applying different rules to different patterns. Patterns are applied at two levels: to identify data elements and to determine treatment of record pairs. Pattern-based element identification is particularly good at working with complex name lines, such as "John Smith and Jane Doe," "Jane Doe Smith" and "Mr. and Mrs. John Smith." A simple parsing routine would look at the first and last word on the line and come up with first and last names of "John Doe," "Jane Smith" and either "John Smith" or "Mr. Smith." That is, it would conclude each name is significantly different and miss the presence of two individuals altogether.

A pattern- based parser would recognize common first names, last names, titles and conjunctions, look at the patterns these are forming and apply rules to identify the elements correctly. Such a parser would also adjust for generational indicators such as Jr., Sr. and III; industry terms identifying relationships such as ITF (in trust for); and business aliases such as "John Smith d.b.a. (doing business as) Smith Supplies."

As with most standardization and parsing processes, this approach relies heavily on key-word tables that identify how different words are commonly used in different contexts. The scope and variety of these tables is critical to the accuracy of the parsing process. Most pattern-based systems let users modify these tables to reflect conditions in their particular files, such as specialized industry terms, company-standard abbreviations or local geography. The pattern tables themselves can also be modified to accommodate known input peculiarities, such as a practice of flagging the last name with a special character (Henry @James). In effect, key-word and pattern tables provide the knowledge that a human reviewer would intuitively bring to the task. Because the tables have greater memory capacity and behave consistently regardless of personality or fatigue, they are in some ways superior to human reviewers, particularly on routine processes. (However, where accuracy is critical, most firms still rely on manual review and research to resolve ambiguous cases.)

Pattern-based matching rules rely on elements identified at the parsing step and apply different rules to different element patterns. These patterns may look at the sequence of element types. For example, a pattern that identified a female first name followed by two possible last names (Jane Doe Smith) might trigger a rule to treat the middle name as a potential last name for matching purposes. Or, rules might take into account which elements are present ­ for example, giving higher weight to a matching first name if there is also a matching middle initial.

Different systems take different approaches to rule and pattern definitions. Some are highly structured, offering fixed elements, match types (e.g., perfect, close or none) and outcome classes (e.g., accept, reject or ambiguous). In this case, the user must only determine how to classify each of the large but finite number of possible combinations of element match types. Other systems let users write rules in a scripting language that defines what to look for and how to react; this gives almost total flexibility. Whatever process a vendor applies, nearly all systems provide a default set of patterns and rules to help users get started. Because users can identify exactly which rule was applied to accept or reject a given match, it is relatively easy to modify the default rules by reviewing outcomes and making adjustments over time.

The rules-based approach also lets users apply additional processing only to ambiguous matches ­ thus allowing a more detailed review of the available data when needed, without performing unnecessary processing on simple cases. One application of such processing is to resolve cases of "chaining" ­ where record A matches record B and record B matches record C, but records A and C do not match each other. Users may define rules to determine when to accept such matches and when to reject them. This sort of incremental processing combined with the greater inherent accuracy of pattern- based matching lets pattern-based systems find 90 to 95 percent of possible matches, compared with rates of 50 to 70 percent for merge/purge systems. Of course, your results may vary.

On the other hand, merge/purge systems run faster. Merge/purge systems run multiple millions of records per hour, compared with one million or fewer per hour for pattern-based matching. These figures are crude guidelines, as speed varies greatly for all types of matching software depending on the hardware and algorithms involved.

Pattern- based and merge/purge systems differ in ways other than matching techniques and speed. Because the pattern-based systems were designed primarily to match customer records, they maintain persistent customer identifiers from one update to the next. This is unnecessary in a merge/purge system, which is built largely to remove duplicates from a group of lists that are rented for one-time use. Maintaining a persistent customer ID is relatively straightforward because it largely involves appending the ID to the input records in each matching session and carrying it through to the output. However, it does involve some nuances, such as ensuring that the same ID is applied if a customer vanishes for a few cycles and then reappears or if the customer moves and a record later shows up at the old address. When IDs are applied to households as well as individuals, things get even more complicated. You need rules to handle household mergers such as weddings and household splits such as divorces or children leaving for college. In fact, household definition is often a very contentious part of the database development process because different users have different definitions that make sense for their own purposes. Multiple household definitions, each with its own set of IDs, are quite common in large consumer marketing databases.

The desire to build a permanent customer database also leads pattern-matching vendors to include extensive facilities for data consolidation. These range from simple functions to aggregate values such as purchases recorded in different billing systems to complex rules to select the "best" version of an element such as a Social Security number or primary address. Although this sort of consolidation does not rely directly on pattern-based matching, it may use the system's assessment of the quality of different input records to help determine which record to treat as most reliable.

Pattern-based systems are also much more likely than merge/purge software to provide an application programming interface (API) for real-time processing of individual records. The API is commonly used to integrate the matching process with operational systems, such as order entry or customer service, to quickly identify individuals as existing or new customers. Most operational systems provide their own simple matching routines, but it makes sense to leverage an advanced pattern matching system if the enterprise has already purchased one. This provides results that are both more accurate and more consistent than the operational system would provide by itself and ensures that searches are made against the entire customer universe rather than only the customer records residing in a particular operational silo.

Major vendors of pattern-based matching systems include Trillium Software, a division of Harte-Hanks; Innovative Systems Inc.; Group 1; and Firstlogic. The latter two vendors also sell merge/purge systems; however, their pattern- matching software uses different technology. Vality also sells pattern-based matching software, but relies on users to build their own key-word and pattern tables – a major undertaking that the other vendors avoid by providing users with prebuilt tables and rules. A newcomer to the market is DataMentors, which draws on its founder's experience building pattern-based matching systems at pioneering marketing database vendor OKRA Marketing.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access