How do you deal with integrating data from heterogeneous sources that do not share common keys?
In the previous installment of this series (see my column in the August issue of DM Review), we examined some of the various data integration strategies for matching and integrating data between heterogeneous source systems when no common data elements exist to facilitate the process. In the second part of this series, we will look at the logic needed to implement a fuzzy logic data matching method and some of the details that need to be considered before attempting this process.
During the building of our data warehouse, we are faced with the challenge of trying to integrate heterogeneous client data from different source systems in the enterprise - some of which do not have a common key to match against. Because the source systems cannot be readily matched by common columns, the decision is made to apply fuzzy client name matching as an alternative data integration method. For some of the sources, the client name column is the only available method of integrating the data between sources. The task is made more different due to client names being entered in each system in dissimilar methods (e.g., Smith Bros, SMITH Brothers Inc., The Smith Bros. Inc). The extraction, transformation and load (ETL) logic will accept data from each source system that the business users indicate can be matched by client name.
The logic used:
- Retrieve all client information for the specific source system, which needs to be processed into the data warehouse since the last refresh cycle.
- Process the source client name through the fuzzy name match logic to determine the standardized client name of the specific record.
a) An enterprise word translation table is used by this process to convert source system client name words into an enterprise standard format for comparison purposes. The method to create this standardized enterprise client name is explained later.
b) The data warehouse cross-reference table maintains a cross reference between the various source system key fields and the one unique data warehouse client surrogate ID. The standardized enterprise name for the client is also maintained on this table.
- Determine if any of the records from the source system have a matching enterprise client name in the data warehouse surrogate key cross-reference table. If so, use the corresponding client surrogate ID and insert a record into the data warehouse surrogate key cross-reference table for future processing cycles. Update the warehouse client table with specific data from the source system using the selected update method (e.g., SCD 1, SCD 2).
- If no enterprise client name matches the source record, then insert a new warehouse surrogate key in the cross-reference table with the enterprise standard client name. Insert a new record into the warehouse client table with specific data from the source system.
Fuzzy Logic Name Match
This process uses a pattern matching method to reconcile client names from different source systems into an enterprise standard format so records from these disparate systems and the data warehouse can be compared. This process takes a client name (the legal description) as input and provides a standardized enterprise name as its output. A translate table is created and used by the process to take words found in the client name of sources and convert them into a standard enterprise format. For example, consider the word INTERNATIONAL. This word may arrive from the source in a variety of formats. Data profiling can be used on the different source systems to build the translate table for the fuzzy logic process. A company name such as Smith Bros International or Smith Brothers or Smith Brothers Int. found in three source systems would all be translated into an enterprise standard format of SMITH BROTHERS INTERNATIONAL.The process uses the following logic:
- Get rid of any character not in upper or lowercase (A-Z) and (0-9).
- Tokenize the remaining words using the space character as the separator.
- For each word, determine if a matching word exists on the word translation table. If so, replace the word with the standardized word version. For example, BROS is replaced by BROTHERS. Words such as COMPANY are regarded as noise words and are removed.
- Concatenate all the standardized words into a standard enterprise client name for comparison purposes with the data warehouse surrogate key cross-reference table.
Using this type of fuzzy logic name matching, you can integrate data from disparate source systems into your data warehouse. Data profiling and rigorous testing need to be performed to ensure that the standardized names are unique and that no collisions occur during your ETL process.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access