While fictional, the previous example illustrates a potential risk of relying on inadequate data matching technologies for identification purposes. Though such technologies have improved substantially in recent years, many businesses across all industries still rely on inaccurate matching mechanisms, making their customer data integration (CDI) systems prone to serious error. In the case of healthcare, using the wrong technologies can hinder and delay treatment efforts and the ability to comply with state and federal privacy laws. Inadequate data matching methods also jeopardize companies, governments and organizations that must rapidly determine the identities of individuals. For example, the quality of the technology can mean the difference between preventing a terrorist from boarding a jet airliner and allowing one to pass undetected.
In addition, inaccurate data matching can cost businesses thousands or, in some cases, millions of dollars.
- A financial institution could be fined for dealing with a prohibited foreign company because their matching logic failed to recognize that one of their customers was on a government watch list.
- An insurance company with inaccurate matching systems could fail to identify a supposedly new customer as an existing customer with a delinquent account.
- A retail firm with systems that lack real-time accurate identification could lose a customer by not recognizing them when they place an order over the Web instead of their usual method of ordering by phone.
Therefore, it is essential for companies to take steps to ensure their CDI system leverages the most effective identity matching solution.
Choosing a Matching System
Though accuracy has long been viewed as the cornerstone of any successful CDI installation, deciding which method to use to ensure precise automated data matching can be difficult. In the CDI industry, "matching" refers to the process of determining when two records belong to the same customer.
Because different industries require different degrees of accuracy and because the complexity and diversity of data sources vary by company, businesses need to decide which data matching strategy best suits their needs prior to implementation. Each method has its strengths and weaknesses, so businesses should not be concerned about which is better in an absolute sense, but rather what the practical application will be and their tolerance for errors. Unfortunately, there is no perfect system, so understanding the possible matching errors and their frequency is key in finding the right solution to meet an organization's needs. In CDI lingo, inaccuracies are expressed as " false positives" and "false negatives." False positives occur when the system mistakenly links records that should not be matched (mismatches); false negatives result when the system fails to link two records that should have been matched (missed matches). However, these inaccuracies can vary greatly depending on the type of matching method being used.
Understanding Matching Methods
In today's CDI industry there are essentially two methods available for matching and retrieving data in CDI installations: probabilistic and deterministic.
Deterministic MatchingDeterministic matching systems use a combination of algorithms and business rules to determine when two or more records match (the rule "determines" the result). In a deterministic matching system, for example, one rule might instruct the system to match two records with different names if the Social Security number and address fields coincide. Algorithms catch simple common errors such as typos, phonetic variations and transpositions. The result is an either/or outcome: Either records match the requirements of the business rule or they don't.
Deterministic matching systems have a relatively lower degree of accuracy compared to probability matching. Such systems are best suited for applications where the number of records is relatively small (less than two million), there are few data attributes and there is no great consequence of error. One such application could be mailing list processing. If the system matches a name to an incorrect address, the mailing would be sent to the wrong person, resulting in the sending company wasting postage costs.
Deterministic systems do allow organizations to better leverage their in-house IT staff for system implementation and to develop matching rules. When the number of data attributes and rules required are small, this can make implementation times shorter and less expensive. However, the more attributes involved and the larger the data sets, the more complex the rules-based matching routines become. This means implementation can involve many man hours of development and testing time and longer deployment times than probabilistic systems. Deterministic approaches do not have speed advantages over probabilistic methods, which now have the capability to perform lookups in real time.
In addition, deterministic systems lack scalability. When databases grow beyond a few hundred thousand records companies with deterministic matching systems typically require expensive customization and business rule revision. If an attribute is added to a data set, this doubles the number rules the system requires, which can be very labor-intensive and impact system scalability and performance. Both examples push the maintenance costs and total cost of ownership of deterministic systems far higher than that of a probabilistic matching solution.
Probabilistic MatchingProbabilistic matching uses likelihood ratio theory to assign comparison outcomes to the correct, or more likely decision. This method leverages statistical theory and data analysis and, thus, can establish more accurate links between records with more complex typographical errors and error patterns than deterministic systems.












