How do you deal with integrating data from heterogeneous sources that do not share common keys?
This column is the first part of a two-part series examining a data integration strategy for matching and integrating data between heterogeneous source systems when no common elements exist to facilitate the process. In this installment of the series, we will look at some typical data integration approaches and provide an example case. In the second part of this series, we will look at the logic needed to implement this data matching method and some of the details that need to be considered before attempting this process. We are making the assumption that the data quality and data integration product or method being used by your firm does not have this type of capability.
Data integration into the data warehouse typically occurs through one of several methods. Your data integration process should support these methods whether used individually or in combination to improve data quality in the data warehouse environment. Data validation should be able to be performed on data originating from a particular source system to identify and mark poor quality records. Validation can include use of lookup tables (e.g., ISO code standards) or data type checking (valid numeric, date and null) to qualify source records for inclusion in further processing to the data warehouse. Records that do not qualify are excluded from the primary data integration workflow and rerouted to an exception queue for alternate processing or review. The first data integration process method consists of cross matching records from several source systems by same or similar key fields (e.g., employee ID). Similar key field matching may consist of some type of masking of the comparison fields to eliminate character differences (e.g., SMITH100321 <-> SMITH, JOHN using SMITH*). The second method involves selection of record values based on the most common values occurring across the various source systems (e.g., last name Smith exists in two of the four source systems). Third, new target column values can be selected based on a prioritized hierarchy of the source systems (for example, use call center customer address data over billing unless call center data for customer X does not exist).
Because most firms do not use the same keys across all of their systems, fuzzy matching can be used on several similar columns on each source system looking for closely matching patterns such as similar last names (e.g., Smith versus Smyth) or phone numbers (e.g., 312-555-1212 versus 312-555-1221). Alternately, a pattern matching process can be set up to reconcile the dissimilar column values to integrate the data. In next month's column, we will look at one approach to this logic.
Michael Jennings is a recognized industry expert in enterprise information management. He has more than twenty years of information technology experience in the healthcare, telecommunications, insurance, banking, manufacturing and logistics, and human resources industries and has experience within governmental organizations. Jennings has published numerous industry columns/articles for EIM Institute, Information Management and Intelligent Enterprise magazines and others, and has been a judge for numerous awards. He speaks frequently on enterprise information management issues and practices at major industry conferences, and has been an instructor of information technology at the University of Chicago's Graham School. Jennings is a co-author of the book "Universal Meta Data Models" (2004) and a contributing author to the books "Building and Managing the Meta Data Repository" (2000) and "The DAMA Guide to the Data Management Body of Knowledge" (DAMA-DMBOK, 2009).










Be the first to comment on this post using the section below.