Last month's column began a discussion of reference-based systems which continues in this column.

The mix of inputs as well as processing techniques vary from vendor to vendor, so results from the different reference- based systems are not necessarily the same. However, all vendors report significant improvements – often, nearly double the match rates of conventional merge/purge or pattern-based systems. On a reasonably well-maintained file, this might translate into two to eight additional duplicates per one hundred records input. A reference-based system also eliminates the much smaller number of false duplicates that occur when two records are similar enough to match but actually refer to different individuals.

Why do reference-based systems find so many additional duplicates? There is more involved than greater precision in matching. Specifically, the reference tables can include a history of the same individual at different addresses or under different names (e.g., before and after a marriage). These connections, derived from change of address transactions, legal records, financial institutions and similar sources, cannot possibly be made by comparing name and address records directly. While some false connections are inevitable, each vendor has tuned its rules to keep errors at what it considers an acceptable minimum. Users with different preferences cannot change these rules directly, although most vendors allow clients to apply their own splitting or combination rules after the standard processing. This contrasts with merge/purge and pattern-based matching systems which let clients tighten or loosen matching rules to meet their individual purposes. The reference-based matching vendors argue this is unnecessary because their standard processes yield such accurate results. Clients can also propose corrections to the reference tables, although not all clients are willing to share such information and let the vendors decide whether or not to accept a proposed change. When corrections are made, vendors can notify clients by publishing the list of affected IDs. Because the vendors keep track of which IDs have matched to each client's input, they can send each client only the list of relevant IDs.

In addition to providing greater accuracy and operational efficiency, reference-based systems hugely simplify the sharing of data among different companies. The standard ID is the key. When two list owners wish to combine information on common customers, they need only compare their lists of IDs – an easier and more accurate process than conventional matching and one that does not require sharing actual names and addresses. In practice, such comparisons would be done by the reference table vendor rather than the companies themselves because license agreements forbid sharing the standard IDs with outside firms.

Standard IDs provide similar efficiencies for appending data from third-party sources to in-house lists. Again, the third-party data list is coded with the standard IDs, and these are matched against the IDs provided by the list owner. This sort of matching could be conducted on a periodic basis, or list owners could be notified when any interesting data appeared about one of their customers. This opens up some intriguing, if Orwellian, marketing possibilities.

In fact, the privacy implications of reference-based matching have received relatively little public discussion. The vendors argue these systems enhance privacy because they yield more accurate matches and, by linking all related records, make it easier to comply with opt-out requests. However, widespread use of the same reference table also means that any errors in that table will be propagated widely rather than limited to a single company's internal systems. Easier and cheaper cross- company matching also encourages firms to share data more widely, leading to more comprehensive customer profiles that could easily be misused by the inept or abused by the malevolent. Because the reference-based systems are technically designed for matching rather than data sharing, they do not appear to be governed by existing privacy regulations. They are affected indirectly, however, as reduced access to data such as credit records makes the tables themselves potentially less accurate. As such systems are more widely understood, they may eventually be subject to the same rules as other lists for individual disclosure, review and opt-out; but, at least in the U.S., it's hard to imagine any regulations being passed that significantly diminish these systems' effectiveness.

In sum, reference-based matching is often more accurate, more efficient and easier to deploy than merge/purge or pattern-based matching systems. On the other hand, prices are higher than for other technologies, and some enterprises may balk at sending their customer list to an outside vendor. Where circumstances permit, reference-based matching is an option well worth exploring.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access