Continue in 2 seconds

The Alphabet Soup of Data Integration

  • December 09 2004, 1:00am EST

Do you feel like you have fallen into a large bowl of alphabet soup when talking about data integration? Are you totally frustrated while trying to keep up with all the acronyms? You are not alone! The world of IT tools has been expanding to meet the ever-increasing needs of the global business environment. When business was simple there was little to worry about other than the operational systems. In the prehistoric world of IT, operational systems required to run the business were typically from the same vendor; and, in many cases, they were modules, such as G/L, payables, HR and sales, which allowed for data to be relatively easily tied together. In the many cases where standard modules were not available, the IT organization developed custom applications with data keys and structures that were based on their purchased applications. The demands for information were easily satisfied by canned reports.

Now, enter the modern world of conglomerates. They have sales tracking modules from one vendor, accounting modules from another and a separate HR system. And, just for good measure, throw in ERP and CRM systems. The new dashboards and scorecards the top executives are asking for require a broad cross-section of data, which is scattered throughout every system in the company. Explore the new world of extraction, transformation and load (ETL), enterprise application integration (EAI) and enterprise information integration (EII) solutions. Over the course of the next view months, this column will concentrate on deciphering the ins and outs of the current integration strategies focusing on theories and functionality rather than on specific products.

This month, the discussion will focus on the world of ETL. This is a term that we all use, even the business users. But, seldom does everyone use the term to mean the same processes, technologies or functions. In the traditional sense, ETL - extraction, transformation and load - is an acronym associated with a variety of processes. The ETL processes and technology arose from the business need to draw information from disparate data systems and either move it to a central repository or to create load files that feed into a reporting system. Figure 1 depicts the traditional ETL deployment as it was at the beginning.

Figure 1: Traditional ETL Process

During the extraction process, one or more files are created containing the extracted data values. The data needed is identified and mapped to existing sources of data, such as from the financials, CRM and ERP systems. The data is then extracted from the sources and placed into flat files or temporary database tables in a staging area. These files and/or tables are then fed into a set of processes that perform the transformation activities, such as data validation, creating aggregated or derived values, and other transformations, such as converting Y to 1 and N to 2, key replacement and address standardization. Load files containing valid information are created during the transformation process. This is true regardless of whether the transformation process uses flat files or temporary database tables. As the transformation process has already performed full validations, the receiving systems perform little, if any, validation and the target database's (databases') native bulk loader(s) is (are) the most often selected method of database insertion. All steps of the transformation and load processes are optimized to run concurrently, unless there are specific referential integrity or order issues. Also, it should be noted that, as a rule, data runs in one direction: sources to transform to targets. Figure 2 depicts a non-traditional or alternative methodology that has evolved over the last few years.

Figure 2: Non-Traditional ETL Process

Non-traditional ETL extends the ETL process by adding in business meta data and reference data management. The reference data includes hierarchy definition, business rule management and data flow control information which are used by the ETL process to optimize all steps of the process. The reference data manager allows for one set of master hierarchies and data mappings to be set up by the organization. These hierarchies and mappings are then validated by all relevant parties and may be pushed out to all data sources and targets. For example, a company may define the G/L and organization hierarchies in a reference data manager and push them back to their Oracle Financials application (source) and G/L OLAP cube (target) to ensure all applications are looking at the same data in the same structures. The hierarchies and mappings may also be fed into the ETL tool which uses them to optimize the data extract from the source and mapping to the target data repositories. In many cases, the transformation server performs the cleansing and transformations just as it does in the traditional ETL process. In other cases, the source and destination systems are fed the details of any required validations and/or transformations and the most appropriate system performs the functions prior to the data insertion. While it is possible that the transformation server could create load files, the more common practice is for the server to use the database server's bulk insertion utility and stream the load data directly to the database.

The decision to select the traditional or non-traditional methodology is a function of the environment and the level of complexity required by the ETL process. The traditional method has vendors that have created very sophisticated user interfaces and scripting languages that provide all the tools necessary to perform the extractions, edits and transformations that are most commonly needed. They work well when there is a 1:1 correspondence between the number of systems and the number of extractions. They begin to suffer performance degradation when extracts have to be performed from disparate systems and require heavy consistency and standardization processing and business rule enforcement. For example, to create a set of load files that contain information from several sources is significantly more involved than creating a load file based on information from only one source database. The information is extracted into temporary structures - files or tables - and the business rules are applied to combine the data into a consolidated output format, which may be comprised of multiple load files. When the time and effort required by the ETL process' business rules becomes overwhelming, alternatives, such as non-traditional ETL methods should be considered.

The non-traditional ETL was developed to reduce this level of complexity; however, it does create other problems. As long as referential integrity issues and complexity of required transformation is minimal from source(s) to target(s), this method will likely move data faster and will reduce the need for number of complex processes that must run to prepare the data for bulk inserts. However, since many of the products that use the non-traditional method do so by generating complex SQL, the performance of this method will degrade in proportion to the level of complexity of the resulting SQL based on the data.

It, therefore, becomes incumbent on the users to fully understand their ETL needs and to select the tools and methods most appropriate for their particular circumstances. While vendors of both methods claim high performance and product superiority, the appropriate toolset is the one that will actually work in a given situation. During the vendor selection process, ask plenty of questions and use proofs of concept from the vendors to identify the best fitting tool(s). It is safe to say that the only absolute when looking at data integration is that using a purchased ETL tool is almost always superior to writing and maintaining one in house. As to which method should be used, the best rule of thumb is to use whichever method is most appropriate for your environment.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access