Continue in 2 seconds

What are the risks involved when doing data integration?

By
  • Sid Adelman, Chuck Kelley, Evan Levy
Published
  • August 13 2004, 1:00am EDT

Q:  

What are the risks involved when doing data integration?

A:  

Chuck Kelley's Answer: I think a better question is what are the risks of not doing data integration. For example, can the measures be additive (centimeters vs. yards, dollars vs. euros), keeping large amounts of duplicate data, etc.

Sid Adelman's Answer: First a little on data integration. This is taken from Data Strategy by Sid Adelman, Larissa Moss and Majid Abai that should be published in early 2005.

If you ask 10 knowledgeable people in the IT world their definition of integration, you are likely to get 11 or more answers. Integration means to make entire or complete by bringing together the necessary parts. It means to use the required components to form a whole. The parts, the components we care about in data integration are the data elements that will form this whole image. For example, we will try to get a consistent view of a customer by bringing together data elements as they relate to a customer from multiple sources including what the customer filled out in a loan application, the porn Web site, the transactions this customer performs with our organization (captured as the customer swipes our infinity or loyalty card at the supermarket or the casino slot machine), the demographic data on the customer available from external vendors, data from sales force automation, order management, and the information we get when the customer calls our customer service or asks for information on our Web site. A great deal of this data is redundant and duplicated in multiple of these applications and is often inconsistent. Selected data - you have to choose the right ones - is integrated to form a complete picture of the customer so we can market to him or her in the way they want, to market the right products, and market using the right channels. This whole picture of the customer should minimize annoying them.

A primary reason for data integration is the need for the accuracy and the quality of key data. An example would be the need for consistency and accuracy in patient data. If there was some event such as lab results indicating a life-threatening situation such as the indication of a heart attack, that information consolidated with other diagnostic data should be reflected in the patient's integrated medical record.

Risks

There are a number of risks to integrate data and they should be considered before the integration process begins. For any significant data integration to be successful, senior management must know why they are spending the money and must be committed to a long-run project for the initial implementation and the continued sustenance of the project. If management is only short-term oriented or if they are always focusing on the sexy new project, they lack the ability to support a data integration project. Management must be aware that data integration provides little immediate apparent benefits so this is not appropriate for those immediate-gratification managers.

Risks Associated with Cost and Effort to Integrate

Integrating data is expensive. It will require an infrastructure including hardware and software. The software may already be in place but there is no shortage of vendors who are eager to sell you their integration software. For this integration project you will need dedicated, smart, and skilled internal personnel who know the internal systems, the important internal people, and are respected by those people. You may also want to consider bringing in consultants for the initial stages of the project but don't depend too much on them and do not put them in the role of the SMEs.

Risks Associated with Maintaining Integrated Data

The integrated data must be kept reasonably current. The level of currency should be based on how the data is to be used and the requirements for currency. For example, a requirement for near real-time integrated data means the data must be close to up-to-the minute. Near real-time and especially, real-time data requires a more demanding process for maintaining current data.

Source systems will change and so the integration process must be aware of the data sources and should employ impact analysis procedures to trigger activities that will incorporate the source system changes. Metadata should be used to understand the impact of these source system changes. If ETL software is used, the ETL meta data should know about the changes in the source data.

External Data Risks

You will, most likely be bringing data in from suppliers, distributors and trading partners. If you are integrating customer data, you will be purchasing data from companies that specialize in capturing customer demographic data. Most organizations underestimate the difficulty of incorporating this external data. It is often incomplete, less than clean, difficult to match to your data, out of date and often poorly documented. You will be bringing in the data periodically, perhaps once a month or once a quarter, and often the data supplier will have changed the format, the meaning or some other characteristic of the data and that will cause problems.

Validation and Conversion Risks

Validation has more to do with data quality validation is relevant in choosing what data to use when more than one source of data is available. The dirtier the data, the more difficult it will be to clean and then to integrate. To state the obvious, when more than one source is available, first profile (evaluate) the different sources and then, if all other factors are reasonably similar, select the integration source with the cleanest data.

Clay Rehm's Answer: The biggest risk is that the data does not integrate but it appears that it did, and the people who are using it are using inaccurate data. Having said this, sufficient time must be spent in the analysis and design phase to understand all of the pitfalls. Additionally, the test cases and test plan must be developed at the same time as the design document.

Meta data becomes very important during this time - each and every data element must be thoughtfully defined so there is no mistake or doubt to what it is and what it is not. Usually it is the incorrect or incomplete definition of data elements that lead to incorrect assumptions of how the data is captured and used.

Evan Levy's Answer: There are several areas to consider when integrating multiple data sources:

Source content - Is the data content between the two difference systems compatible? While the element names and values might appear similar on first glance, the actual meaning may be entirely different. (E.g., booked revenue and billed revenue are both references to financial value, but they have very different meanings). It's important that you identify a source system data steward or subject matter expert that can give you accurate details about the data from each source.

Timeliness - Does the availability of the data enable integration? We frequently see this as an obstacle when integrating data from different sources. Just because the data extract has been identified as a "daily extract" doesn't necessarily mean that it can be merged with another daily extract. We recently worked with a Bank where two different systems generated their extract data seven hours apart. The problem occurred when today's ATM transactions weren't available in today's extract. The balance information didn't match the transaction details.

Granularity - Is the data captured or detailed at the same level. While the obstacle in merging data at different levels of granularity is apparent (such as weekly with daily data), there are issues relating to data with similar granularity. We see this often in the retail industry. As with many businesses, they have different types of weeks: a financial week, a calendar week, and an advertising week. While each week contains seven days, they don't contain the same seven days. The advertising week is Thursday - Wednesday, the financial week is Monday - Sunday, and the calendar week is Sunday - Saturday. It's important that you identify the granularity and boundaries of the data.

Domain Representation - Is the value for common data represented in the same manner? I've seen social security number represented with and without dashes and in numeric and character fields. If the data is going to be merged, the values have to match. And don't forget about the integration issues relating to merging character and numeric data from different hardware platforms - such as Sun and an IBM mainframe - characters and numeric values are different between ASCII and EBCDIC.

The key to mitigating integration risk is data profiling and analysis. It's important to analyze the data before development, and it's even more important to continue profiling and analyzing the data after implementation. It's not uncommon for source system changes to change - and for those changes to go unnoticed. Integration isn't a one-time-only activity. It should be managed and monitored as a data maintenance activity.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access