It is widely accepted that the extraction, transformation and load (ETL) process is one of the most complex and costly pieces of data integration during the development of any technology implementation. Today, the data integration marketplace is full of companies whose sole purpose in life is to provide technology to assist with this integration task. But, when people turn to the task of designing and deploying their ETL processes the propensity is to look only at the raw data. There is more to the effort that just the data.
It is generally accepted that there is a natural order for the ETL process flow consisting of several steps. First, the process must undergo a complete review of all information required by the new system. Secondly, a detailed mapping of the data requirements back to the various sources of information must be completed. And, as we pointed out last month these tasks should also include investigation of the unstructured data around the organization, not just the structured data stored in server databases. This is followed by a GAP analysis to identify any required data that is missing or incomplete in the existing source systems. Lastly, the ETL development process or tools need to be selected and deployed. So, what is left to do?
One of the most overlooked obvious answers to this question is, "What to do with the meta data." The whole issue of meta data tends to be more involved than is often considered. Many people shy away from the concept of meta data. Not because it is a difficult concept to understand, but rather, because previous attempts at deploying meta data repositories produced less than acceptable results. Constructing a true meta data repository is more than just building a data dictionary, a concept that has been around since the early days of computers and relational databases. A repository must contain base definitions and the technical meta data such as information about data types, lengths, source mappings, transformation formulae and the like. All of the database vendors have incorporated the ability to add this meta data into the table structures and most data modeling tools provide the database architects and modelers with the ability to define a robust set of technical meta data values about every element in the database when designing, deploying or maintaining the data model.
The technical meta data is seldom populated and rarely complete; yet, it can prove to be very useful for the database developer, the DBA and the IT programmer. But, there is more to meta data than the technical information. Business meta data is the information about the data that the end user would find useful such as English definition, data type, valid values, mandatory or optional indicator and identification of other systems or applications that use an element. The database vendors provide limited, if any, support this for this type of meta data. And, if available, there is usually no obvious method for an end user to retrieve this information. The data modeling tools allow for this information to be stored but, again, there is limited access to the information by the end user. These tools have been developed for technical usage and don't readily lend themselves to usage outside of the IT groups. The modeling tools can generate reports in hard copy, PDF or Web formats, but these are seldom done and even less frequently setup to allow the end user to view the meta data. Over the years we have seen several attempts to deploy user meta data with limited success.
Over time, ETL processing has evolved to require some additional needs that have to be addressed. Today's business needs typically require the use of multiple systems and applications to support all corporate functions. There may be several ERP solutions deployed with one supporting a set of corporate functions or legal entities and another that supports the remainder of the functions for operating entities. Also consider that corporate entities are often created by mergers and acquisitions. This creates a nightmare for unifying support. These ERP systems may need to interoperate with multiple customer service applications, call center tools and customer relationship management (CRM) systems. Additionally, each of these tools will be accessed via reporting tools and supporting analytical (OLAP) applications across the entire enterprise. All of these applications have one thing in common and, in order to maintain all of these functions, one needs to find the common thread. All of these functions use some form of hierarchies to logically organize their data. Hierarchies may include the corporate charts of accounts, product catalogs, legal organization structures and customer groupings. The challenge for the support organizations then becomes how to keep these hierarchies synchronized across organizational and application boundaries. This ensures everyone has the required views of the data while maintaining an environment that provides access to one common set of data across all of the disparate systems and functions.
No one ETL application currently supports the migration of all three of these types of information. The major ETL tools, such as Informatica and Ascential, excel at the migration and transformation of data elements as well as provide some support for the synchronization of the technical meta data. There are several dedicated meta data repositories, such as CA's Advantage Repository, SchemaLogic's SchemaServer or Ascential's MetaStage, each of which are quite capable of ensuring the corporate meta data is communicated among the various applications. Unfortunately, none of these products effectively deals with the problem of maintaining the business meta data which is the key to keeping the hierarchies used by all of the ancillary functions synchronized. When it comes to the management of hierarchies and dimensions there are few products to choose from. Most of the ERP products contain hierarchy management utilities, but these are for use by the ERP application itself. There are few mechanisms for exporting the hierarchies so that they may be used by any other application in your organization's infrastructure. Obviously, the OLAP vendors each provide mechanisms for building and maintaining the hierarchies as the dimensional hierarchy is the logical ordering mechanism for OLAP processing. But even within the same OLAP suite, there is less than an optimal way of maintaining hierarchies across multiple OLAP applications (cubes).
With each application maintaining the hierarchies independently, the various support staffs for the applications need to make changes manually to their respective applications to ensure all data is grouped, reported and analyzed in a consistent manner. And, when you consider the need to build centralized hierarchies for consolidation of information across a multisystem enterprise this is not a trivial task. The need for changes in multiple systems invites errors, and errors in regulatory reporting nowadays results in severe consequences to those responsible for validating and ensuring accuracy of information for regulatory compliance.
The best solution is for IT to seek products in the marketplace that have the ability to provide centralized management of a company's various logical hierarchies from a centralized location. Entities that have multiple systems, products and reporting functions have struggled to get a handle on the problem and to implement a satisfactory solution. Products for this arena have been slow to market and are long overdue. The process of hierarchy management is critical for complex infrastructures and for entities that have rapidly changing dynamic environments. To be effective at centralized hierarchy management, a product should: 1) allow multiple people to perform centralized maintenance activities on different parts of the hierarchies; 2) allow for security and audit logging; 3) provide the ability to capture historical hierarchies for point-in-time reporting; 4) provide mechanisms to push the hierarchies to the operational and analytical systems - the ERP systems, OLAP cubes, data warehouses and custom relational databases; and, just for good measure, 5) provide a level of business meta data that is accessible by the end users in a simple, graphical and useful format.
Today's corporate information architecture is built upon a wide range of diverse and complex databases, purchased applications and custom developed solutions. The data management professional is faced with the extremely difficult job of ensuring all these disparate repositories are up to date and synchronized in terms of the data, structures and meta data. Failure to maintain any one of these components across the enterprise may cause significant difficulties for senior management, both in their ability to drive the corporate vessel and in their ability to achieve compliance with today's corporate reporting laws. IT organizations can significantly enhance their capabilities and fulfill their responsibilities by seeking centralized management and providing management with the level of comfort required to achieve the new standards of corporate compliance.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access