I am frequently asked by vendors, clients or consulting partners to aid in the selection, testing and adoption of extraction, transformation and load (ETL) tools. I have stopped providing product reviews and related work for vendors (as of 12/31/2001) and am under no pressure to suggest anything other than my own preferences. Our team uses a variety of tools under diverse circumstances. I try very hard to focus on technical and business criteria, not product preferences when asked for my help. I also try to shape customer expectations so that they can rapidly discern the best possible candidates from an over-crowded field. This article outlines categories of ETL tools and criteria for consideration in their selection.

ETL Tool Categories

There are many ways to categorize ETL tools, price, complexity, breadth of product line and so on. It is important to separate these tools based on their primary platform support and methodology. That is because specialized platforms (mainframe, application servers) require specialized products. The breadth of solutions provided by ETL products is a good way to discern basic product categories. These start at the point solutions and grow to enterprise class applications. Point solutions are designated by their ability to support complexity and volume, not by their price. Enterprise class applications are very broad in terms of both features and data volume capacity. The purpose, features and pricing of enterprise class products separate discrete tools from overall solution sets.

First Generation ETL Tools

ETL has grown from a simple data-mart- in-a-box function to a robust data integration and information construction tool. First generation tools were originally focused on direct-to-mart strategies. These attempted to solve problems of data integration for business intelligence (BI) delivery by way of the dedicated data mart. These were often called federated or so-called independent data marts. They were meant to be freestanding data stores built using dimensional models (or no models) with direct (hard-coded) connections to the source applications. The emphasis was on the extraction of data into some (typically custom or proprietary) form of dimensional model. This is due to the comparative ease of navigation provided by dimensional data models (which require a certain amount of selectivity and focus).

Second Generation ETL Tools

The introduction of data warehousing by Bill Inmon and others provided the path to the second generation of ETL tools. These are focused on a broad array of targets including data warehouses, data marts, operational data stores, data mining and exploration warehouses and analytic applications. The flexibility of these second generation tools provided far superior returns to their owners since the rapid proliferation of federated data marts was already proving to be a massive failure. These independent data marts were impossible to conform to one another (or anything else) so that there was never a single version of the truth. The need for a base data warehouse using a more relational model to provide the foundation for all manner of dimensional analysis has since become the accepted method of data integration. Inconsistent, incomparable and incomprehensible results are hardly the stuff of ROI.

ETL Selection Criteria: Primary Processing Platforms

I start with this issue because many IT shops have specific processing platform requirements that must be respected. There are fewer and fewer specialized platforms running proprietary operating systems, but those that remain are important to their owners and users. The mainframe (IBM 390 and others) is the best known and most pervasive of these specialized platforms. Please remember these tools are mentioned for customers who require native mainframe execution. Code generators can provide COBOL and JCL to allow ETL work to be completed entirely on the mainframe or largely on the mainframe with an external load to non-mainframe systems.

The other major segment of the specialized computing market is the iSeries of eServers (formerly known as AS/400, System3x). There are more than 250,000 iSeries customers throughout the world who rely on these application servers for some or all of their data processing requirements. IBM has continually developed this product line to provide massive scalability and a broad array of IBM and partner software applications.

Many iSeries customers have a high reliance on this technology and a vested interest in the addition of data warehousing success without the addition of new platforms and issues. IBM has provided a path to this extended success using partner products specifically designed to run on the iSeries and support IBM DB2 for the iSeries directly.

Key criteria for iSeries native ETL tools include:

  • Native access (R/W) to DB2 iSeries (bit-mapped indexes, etc.)
  • Complexity of transforms supported (relational and dimensional)
  • Ease of user interface – (graphic, no code environment)
  • Distributed architecture (thin client design)
  • Meta data repository (relational)
  • Scalability – utilizes iSeries scalability
  • Training and support services
  • Installed base – scope of usage, complexity, ROI
  • IBM partnership status

The bottom line for mainframe and iSeries customers is that their processing platforms can support data warehousing success if tools specific to that environment are employed. Customers that require complex sourcing from these platforms to more open targets (Wintel, UNIX) may still want to consider these native applications for iSeries and mainframe, since much of their application knowledge and expertise is tied to these platforms and their respective operational systems.

Selection Criteria: Scope of Usage

Point solutions abound in the world of ETL tools. Unfortunately, many of them come with an enterprise-class price tag. What is the distinction between point and enterprise solutions? Generally, the use of an ETL tool across projects and time is what distinguishes enterprise class solutions. Many ETL tools are used for projects and maintained for many years without additional adoption within the enterprise. This is not always due to product limitations. The ETL products may be quite adequate for additional usage, but the tool vendor or system integrator fail to service the customer relationship. Many software vendors are known for their low-hanging-fruit approach to quick sales. They often abandon their best prospects in search of the next quick hit. Don’t hesitate to demand ongoing attention from your vendor of choice. They will come to understand your "lifetime" value. They may actually learn a little about your use of their product and the improvements they should make. Your willingness to share a positive purchase and implementation experience will go a long way towards establishing an open conduit for ongoing communications.

Driving enterprise adoption requires a broader array of features and benefits than point solutions typically provide. There are two basic drivers for enterprise adoption of ETL tools. The first is the use of an independent engine to actually execute the ETL process. Many tools use the underlying database (RDBMS) engine of the target to complete their work. This is fine until you generate significant complexity and volume. The other factor is the variety and complexity of source systems. More complexity requires more robust solutions. This is particularly true of complex sources from nonrelational environments. Mainframe derived flat files, third-party data and other complex sources require much more transformation work to integrate into the data warehouse.

Enterprise utilization also requires features that support shared meta data, process management and control, and managed scalability. Enterprise use implies some level of distribution of sources, ETL processing and targets. It also requires support for the end users of integrated information via business intelligence tools. ETL becomes the clearinghouse for all manner of transactional data and must support some exposure of this process to distributed constituents. Managed scalability must provide for an orderly, cost effective means of increasing data sources, targets, volumes, varieties and frequencies without compelling customers to reengineer the library of ETL jobs they have amassed.

A note about the data warehouse: our approach is based upon Bill Inmon’s groundbreaking work in the area. We support the iterative methodology for the design and construction of the data warehouse. Industry logical data models provide the foundation for the enterprise logical and physical models. The data warehouse is built iteratively by subject area. Each new subject area requires some integration to the existing subject areas, hence the use of relational (ER) modeling techniques. Dimensional models (stars) also have a place in the warehouse. We do not support the use of relational- or dimensional-only modeling techniques as both provide value and can be combined. The data warehouse is populated via time-based additions, or snapshots, that are moment in time updates. Each snapshot is serialized to provide longitudinal integrity. These are our guiding design principles, yours may be different. Our review of ETL tools and the criteria for their selection are the direct result of our approach. If you use a markedly different approach then adjust our criteria accordingly.

Issues for consideration in scope of usage:

  • Code generator for specialized processing platform
    1. Full procedural and job control generation and management
    2. Full utilization of processing environment resources
    3. Distributed design and meta data usage
  • Use of ETL engine or code generator relying on RDBMS
    1. Independent, server-based engine (best)
    2. SQL, VB or other code generator for execution by OS or RDBMS
  • Robust design interface that supports complex transformation design and testing
    1. Ease of use (apparent)
    2. Range of use (when does the GUI stop and the code start)
    3. Effect of design methodology (best)
  • No forced sub-setting of data with repeated execution to cycle thru data
  • Unlimited use of direct to memory, tokenization or hashing options
  • Unlimited portability of design elements (objects)
  • Meta data management and control
    1. Serialization and version control
    2. Exposure via relational models for users and other tools
    3. Some form of damage control to predict the impact of change.
  • Process Management and Control
    1. Check pointing of jobs in design AND execution
    2. Versioning of job modules or components (OO)
    3. Remote monitoring and control of processes
    4. Distributed process mechanics (servers and distance)
  • Managed Scalability
    1. Growth of sources, targets, volumes, and varieties at controlled cost
    2. Reasonable support for platform growth and additions (processors and boxes)
  • Enterprise Feature Sets
    1. Data quality – contents, standards, matching and beyond
    2. Real-time support – messaging, EAI and other transports
    3. MPP support – optional use of massively parallel environments
    4. Advanced source mapping and discovery – ERP, CRM, other

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access