Advancing the Art of Data Integration

  • March 06 2009, 5:54pm EST

Data integration is an essential and fundamental basis for data warehouses, business intelligence applications, master data management applications, data migration projects and scalable data service architectures. The DI software market has been estimated to have reached more than $1 billion in 2008.

So why are so many DI users unhappy, and why is the DI industry in what Gartner calls the “trough of disillusionment” that technologies fall into when they fail to meet expectations?
The first generation of DI solutions were too expensive, too complex for developers to use and maintain, and didn’t facilitate a collaborative development approach. Because they typically used a vocabulary foreign to business operations, they were also too difficult for end users to understand.

To solve these problems and put the DI industry on “the slope of enlightenment” where users actually start gaining real productivity, a new generation of solutions is required. These next-generation DI solutions will need to:

  • Base the fundamental data integration concept on an abstracted environment that is used to develop, maintain and run the application. Everything needs to be based on the usage of common business definitions, so the analyst, steward, developer, user or anyone discussing the application all are using common business definitions. To be effective, this cannot be an add-on reporting facility – it needs to be baked into the infrastructure to break out of the physical handcuffs that have traditionally shackled extract, transform and load.
  • Change the focus by organizing business definitions into subject areas. These definitions can then be used to categorize the definitions for reporting purposes, retaining the cross-subject-area relationships ignored by current approaches. Consider an employee table – some information is related to the individual, some is related to organizational structure and some is related to personnel. Next-generation DI solutions need to be sophisticated enough to maintain the fidelity of the metadata and not resort to lumping it into a single business category simply because it is in a particular physical structure.
  • Assign an attribute called a “metatype” to business definitions. Metatypes define how the business definition is used: ZIP code, money, ID and first name are all examples of metatypes. Understanding how a data item is used is much more useful than knowing how it was stored.

Building Apps Faster

Delivering applications more quickly has been the holy grail of business support organizations back to the early days of code generation. Those of you familiar with parallel programming concepts know that removal of blocking operations is a good starting point to improve overall throughput – and some of these same principles can be applied to application development within the context of data intergation.

Think of the individual roles associated with the typical DI development process. Common blocking operations you may have experienced include still waiting for the target schema to be completed; still identifying the source systems of record to get the data from; haven’t completed the source to target mapping spreadsheet yet; and waiting for test data.

The first advantage that a semantically abstracted environment provides is that analysts and stewards can begin to construct this abstracted environment before sources, targets, or even a project has been created. This abstracted environment is built using either the natural column names in your environment, the logical data names from a logical model, or you can create your own ontology iteratively as you work. The abstracted environment doesn’t need to be exhaustive or complete – work can be done even with only a few of the business definitions identified.
After specifying the business definitions, the analyst can begin constructing rules that define the necessary transformations to deliver the target data values. Since next-generation DI solutions are not tied to a physical environment, it is possible to build and test these rules as they are defined and not wait for the complete set of rules to be developed before we begin. Any rules built are automatically related to the business definition – making it easy to reuse and manage rules that have been created to produce an output datum. Most first-generation DI tools do something similar, but the rules are associated with the cryptic natural column names and not the business definitions or logical names. If you don’t see the difference, consider the case of a large customer with 50 distinct customer files with almost as many different physical names for the same business definition. With a first-generation DI solution, they would need to aggregate the rules across all those systems to determine all the operations that have been done on a particular piece of data as it moves through the system. While this is possible, I know I have more interesting applications to build – so if I pull someone off another application to build this report, there is a significant opportunity cost to the business.

Some first-generation DI vendors have identified this problem and are building components on top of their ETL tools to assist with collecting, organizing and reporting on this aspect of the application for compliance reasons. These tools are not always free, and if the extensions are not infused into the actual data movement components, so there is still a disconnect when it comes to keeping things in sync. This synchronization is historically the kiss of death for after-the-fact metadata solutions.

Working in Parallel

Let’s go back to building an application with a second-generation DI solution. The rules have been built and are associated with the business definition (remember, we don’t have sources or targets yet). The analyst can start building scenarios that test the various edge scenarios for this particular rule. These test cases can be stored and used later in the process by the developer to validate the application and provide a solid regression testing facility.

Now that the business definitions have been created and some of the target-centric rules are completed and tested, the application developer can begin to prototype the logic using the business definitions and rules and test data created by the analyst. If new edge cases are discovered during testing, new rules can be created which are available to the analyst for additional testing. Random data values can also be used to test the application as well.
Recall that all this activity can be in progress even before the source and target data collections are identified. When they are decided upon, the architect uses a wizard to correlate the external natural column names to the business definitions. This activity isolates the team from the complexities of the external environment – we don’t care that the data is from z/OS and formatted as a “big endian, packed decimal 7,2” value or even what that looks like externally (is it 1234567C or 123456789C?). All the development team needs to know is that it is “account_balance, part of the account subject area, and is an instance of money.” Which description do you think will be more useful to the application developer and better understood by the business user?

From Abstraction to Results

If none of the other concepts I’ve discussed to this point have made an impression, consider the difference introduced by constructing this abstraction layer. Given this foundation, the business user, steward, analyst, architect and developer are all now speaking the same language. The game of telephone, where it’s fun to see how a phrase is mutilated as it is passed between participants, only contributes to application development impedance and cannot be tolerated in the business world.

Ideas and software are never perfect, but innovation by approaching a problem from a different perspective is often required to create dramatic improvements in the way we do business. No one has all the answers to the problems created by the complex processes and systems required to deliver data integration applications. But the abstract approach described can solve the most pressing problems for DI today and has the potential to address the impending wave of complications posed by governance, compliance, off-shoring, and service-oriented architecture as they become more widespread.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access