Continue in 2 seconds

Fuzzy Data Integration, Part 1

Published
  • August 01 2005, 1:00am EDT

How do you deal with integrating data from heterogeneous sources that do not share common keys?

One of the greatest challenges and most time-consuming efforts in a data warehouse project is integrating data between disparate data sources into a meaningful collection of information for users to make decisions. Significant resources need to be applied to understanding not only the data and structures, but also gathering and analyzing the meta data of source systems to fully appreciate the context of the information. Further efforts may also need to be utilized or developed if the data exists in nonstructured or proprietary formats, exasperating an already very complex process. Once this basic step of understanding and accessing your data sources is somewhat manageable, you need to determine the best approach to integrating this data. Depending on your business requirements and source system data state, integrating data from these systems can be exigent if common data columns exist between the applications. The task becomes even more difficult if common elements do not exist between source systems and new, inexact methods need to be designed and developed in order to meet business needs to integrate this data.

This column is the first part of a two-part series examining a data integration strategy for matching and integrating data between heterogeneous source systems when no common elements exist to facilitate the process. In this installment of the series, we will look at some typical data integration approaches and provide an example case. In the second part of this series, we will look at the logic needed to implement this data matching method and some of the details that need to be considered before attempting this process. We are making the assumption that the data quality and data integration product or method being used by your firm does not have this type of capability.

Data integration into the data warehouse typically occurs through one of several methods. Your data integration process should support these methods whether used individually or in combination to improve data quality in the data warehouse environment. Data validation should be able to be performed on data originating from a particular source system to identify and mark poor quality records. Validation can include use of lookup tables (e.g., ISO code standards) or data type checking (valid numeric, date and null) to qualify source records for inclusion in further processing to the data warehouse. Records that do not qualify are excluded from the primary data integration workflow and rerouted to an exception queue for alternate processing or review. The first data integration process method consists of cross matching records from several source systems by same or similar key fields (e.g., employee ID). Similar key field matching may consist of some type of masking of the comparison fields to eliminate character differences (e.g., SMITH100321  <-> SMITH, JOHN using SMITH*). The second method involves selection of record values based on the most common values occurring across the various source systems (e.g., last name Smith exists in two of the four source systems). Third, new target column values can be selected based on a prioritized hierarchy of the source systems (for example, use call center customer address data over billing unless call center data for customer X does not exist).

Because most firms do not use the same keys across all of their systems, fuzzy matching can be used on several similar columns on each source system looking for closely matching patterns such as similar last names (e.g., Smith versus Smyth) or phone numbers (e.g., 312-555-1212 versus 312-555-1221). Alternately, a pattern matching process can be set up to reconcile the dissimilar column values to integrate the data. In next month's column, we will look at one approach to this logic.  

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access