In the January 2002 column in DMReview.com, we discussed some examples of data integration. Data integration is a term that is frequently thrown around and can mean a number of things. As pragmatic planners, managers or architects, however, we intuitively know that we need to see through such nebulous terms to provide proper resources for our projects – such as time, people, skills, tools, data, etc. Last month we identified four models of data integration:

  • Data sharing or transport – typically characterized as the moving of data from one application to another as in enterprise application integration (EAI).
  • Data compilation – accumulating similar records or rows into a common file or table.
  • Data aggregation – summarizing data as it is brought to a higher level of granularity.
  • Data synthesis – constructing a row or record with columns or fields blended from multiple sources.

This month’s column will briefly describe a framework for data integration; one that will identify the primary processes of each model. This framework, we’ll call it EMIL (for extract, move, integrate, load), will show which processes the models have in common, as well as those that are unique to the model.


Figure 1: The Data Integration Road Map

The EMIL framework shows that each model consists of a four-step process (see Figure 1). All four models begin with the extract step and proceed with a move step. As with other extract, transform, load (ETL) frameworks, the source data is extracted from its source and moved, in most cases, to a staging area for additional processing. Our models also conclude with the load step, which takes the integrated data and loads it into its target database or file. What differs in each model and what makes EMIL different from the ETL framework is the third step of processing – the integration engine.

The Integration Engine

Except for the data transport model, each model implements a different type of integration engine. The integration engine is where the specialized services you require for your application will reside. It is the component that makes each of the data integration models distinct from the others. Each data integration service is described:

Data Compilation

Special services are often needed when you compile a large number of similar records into a single data store – for example, when compiling a large number of customers or prospects into a single file for a mailing from your sales department. Special consideration must be given to deduping and sorting/filtering.

Often large customer files will contain duplicate records of the same customer. This is frequently the result of misspellings in the name and address fields. For example, last week I received four different program guides for the same data management conference each with slightly different spellings for my name or company (the result of such data quality issues can easily result in thousands of dollars of excess printing and postage expenses). Name and address data cleansing software provides services for detecting and eliminating such redundancy (see Data e.Quality: March 2001 issue: A Behind-the-Scenes Perspective on Data Cleansing). Of course, for the brave of heart, you can always attempt to write your own comprehensive service.

Sorting and filtering also require special attention in order to eliminate unwanted records and to sort the file into a sequence that will optimize its subsequent load into the target file.

Data Aggregation

Services to summarize or aggregate records into a single record are common – particularly within a data mart architecture. Sort and ETL tools are beginning to provide such functions within their basic tool sets. Even the newer releases of the DBMSs are beginning to embed tools to provide such capability.

Data Synthesis

Data synthesis is the most complex of the integration models and should never be underestimated in a project plan. Several underlying complexities await the team that takes on this task.

Customer relationship management (CRM) is a good example of data synthesis. By creating customer records in a data warehouse or operational data store that combine data from several enterprise departments, a "panoramic" view of the customer can be created. Creating this integrated customer record requires matching customer records from multiple sources, such as sales (all channels – call center, direct sales, Web, etc.), marketing, service center, billing, order processing, etc. Additional complexity arises when the same field/column is available from multiple sources and a determination of the most accurate must be made. Experienced project managers and architects know that careful analysis is needed to gather and store the business rules to effectively automate this process.

Optional Services

Finally, the data integration engine may implement a number of optional services. In reality – these services will be performed whether you put them into your design or not – so it is best to plan for them and design them properly. In most cases it makes sense to implement such functions within a staging area, prior to the data being loaded. Optional services include:
  • Data transformation – the transforming of data from one value to another to be consistent with a predetermined set of rules. For example, changing all values of the State field to a two-character representation consistent with the values in your predefined state table.
  • Editing – To some degree, critical fields will require some type of sanity check. For example, is the customer’s age field a negative number or is it greater than 120?
  • Error handling – You and your users must decide in advance what to do when an error is detected in the data. In some cases, your rules may allow for the automated correction of the data. So the customer with a negative age will have the age turned positive or set to a default value. In some cases, however, your rules may state that you have encountered an irreconcilable error and the record must be omitted from the data load. In such cases the data can be saved to an error file where users can review the data and decide how it can be resolved. In most cases this means you must create a mechanism for the data to be fixed and loaded into the target database.
  • Data cleansing – Data cleansing often complements the edits being performed in your system. For example, if my edits detect a problem in the age field, I might define a rule that states: change any age found to be negative to a positive number. Or if the age is greater that 120, change it to 100. Data cleansing may also involve the use of a name and address tool where n me and address fields can be changed to comply with Postal Code regulations.
  • Auditing – Auditing facilities must be designed into the process in order to monitor the data to assure that it is all accounted for and either reaches the target database or is captured in a error file to be reviewed by the users.

The next time your boss asks you if you can have a data warehouse or other data integration project completed in the four weeks (or some other ridiculous time frame) – remember the EMIL framework. What services are you willing to omit in order to meet the deadline? Perhaps it is a legitimate "throw-away" system, you might be thinking. If so, you’d still better be able to have some answers for issues and services that the EMIL framework is designed to address.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access