Once development starts, data integration projects are served well by metadata tools. Major ETL vendors like Informatica, IBM WebSphere DataStage and others include metadata solutions that document mappings and transformations, enabling impact analysis in the event of interface changes, database design changes and data quality problems. However, metadata associated with development tools only kicks in when development starts. A significant part of data integration effort happens before the virtual pen meets paper to build ETL maps. Development can only begin after the team defines requirements, designs the database and maps source data elements to their targets.
On a small integration project with just a few interfaces, this isn’t a problem. However, on a more complex project featuring scores of interfaces, it can be a major manual task to ensure the design meets all requirements, that the interfaces as designed will load the database properly, and to correctly identify the redesign needed in response to changes in the interface, data model or requirements.
This article describes a working SQL Server prototype (found here) that shows how a data integration team can build its own custom metadata database based on commonly used data integration design artifacts and enable the impact analysis and change management benefits that metadata provides well before ETL coding starts. Furthermore, metadata analysis can help improve the rigor and quality of requirements, mapping and database design artifacts, preventing defects early and, therefore, helping to improve quality and reduce cost.
The Data Integration Project Pattern
Data integration projects typically implement systems that take in interface data, store it in a database and deliver that data in the form of outbound interfaces (see figure). Interfaces may be in various formats but often can be thought of as files containing fields. Likewise, the database consists of tables and columns. Requirements, in our example, are affirmative statements of business objectives, which subdivide into “data requirements,” the logical data elements needed to fulfill each requirement.
The industry standard interface definition artifact is the “source to target document,” typically a spreadsheet relating (for incoming interfaces) each incoming field with its destination database column and describing any transformations that must occur in the process of delivering the incoming field.
On the typical integration project, activities like making sure different interfaces are mapped consistently, or making sure the design loads all required database columns, or determining the impact of a design change are a matter of manually comparing source-to-target mapping spreadsheets – not a problem on a small project with only a few interfaces.
However, Frederick Brooks’ Mythical Man-Month concept comes into play quickly as project size increases. Brooks’ concept is that project complexity increases exponentially as the size of the team increases. According to Brooks, project complexity is related more to the number of communication paths than the number of people. There are n(n-1)/2 communication paths among n people. So, a four-person project has six communication paths, an eight-person project has 28, and a 25-person project has 300. By this logic, the 25-person project is more than 10 times more complex than the eight-person project.
It is easy to see how this concept operates on a large integration project. Say there are 75 source-to-target mapping documents, each mapping about 100 interface fields, halfway into the design phase. If a major requirements shift results in a change to database design, then the team embarks on a major review and revision of all 75 mapping spreadsheets, spending tense hours in review meetings to ensure the changes are applied consistently and the project remains on schedule.
The Conceptual Integration Metadata Model
It is also easy to see how a database linking the field-to-column mappings, database tables and columns, and data requirements could quickly identify the mappings impacted by the database change and reduce the level of effort in identifying defects in the revised mappings. Fortunately, data integration teams tend to be rich in database skills, so development of a metadata database should be within the practical scope of the team’s abilities.
Again, the goal is to integrate requirements, interfaces and database tables, along with their detailed elements: data requirements, interface fields and database columns, as in the conceptual metadata model in the figure:
The model shows that each requirement implies many data requirements, each table consists of columns and each interface consists of fields.
At a detailed level, the relationships show that:
• Each data requirement may be fulfilled by many database columns,
• Each column can potentially fulfill many data requirements,
• Each column may be the destination for load of many fields (for example, if many interface fields combine in an equation to calculate a single column), and
• Each field may load many separate columns.










Be the first to comment on this post using the section below.