MAR 1, 2005 1:00am ET

Related Links

When Fast is Not Enough
July 18, 2008
TopQuadrant Software Imports Email MetaData into Semantic Applications
March 26, 2008
An Open Challenge to the Open Source Community
November 30, 2007

Web Seminars

Go with the Flow – The Game-Changing Impact of Clickstream Analysis
Available On Demand

Maximizing Your ETL Tool Investment

Print
Reprints
Email

Is your organization struggling to justify an ETL (extract, transform and load) tool purchase? If the tool is an enterprise level solution, the entry price is likely well into the six digits. At this price, the justification process may appear daunting. Because these tools serve a purpose and are critical to serious data warehouse endeavors, the justification is well worth the effort.

Once the decision to proceed with an ETL tool is made and budget dollars released, the difficult part is complete. From this point, the data warehouse team initiates the tool selection process. They proceed to investigate options and leverage past experience to identify a short list of product vendors. Fortunately, in the ETL space, only a handful of vendors are capable of addressing enterprise needs. Once engaged, the vendors work diligently to understand requirements, mold requirements to fit their product and shorten the tool selection time frame.

The "short list" vendors create a small working sample, often referred to as a proof of concept (POC), to demonstrate tool capabilities. Once functional fit is established, the vendor moves aggressively toward closing the sale. Many in this situation understand the dynamics associated with procuring enterprise software. Less understood are the consequences for poor decision making. Was the tool purchased without a complete evaluation? Will this ETL tool implementation be successful?

Why Use ETL Tools?

Since Oracle (substitute any database) exists in house, writing stored procedures in PL/SQL is often the first thought for ETL development. With no additional licensing and existing in-house talent, the decision appears obvious. Less obvious is the lack of integration, lack of usable meta data and additional maintenance cost.

ETL tools serve two very specific purposes. First, they provide a development environment that is easy to manage. The scripting associated with ETL tools is embedded into the graphical interface, which allows for code visualization. The visualization achieved with ETL tools is priceless when attempting to reverse-engineer existing code. It also reduces knowledge transfer to production support.

Second, ETL tools provide increased throughput over scripting and database coding. The increased throughput is achieved by separating data management from data access. Data management is the coordination of data from source to target systems. Data management integrates meta data for the accurate cataloging and control of source code. Data access is the delivery of information from the database to the end user. Data access is most often achieved directly or semantically via SQL queries (relational databases) or proprietary access (cube technology).

Data Access vs. Data Management

What is the value of separating data access from data management? To answer that question, one must first understand the origin of data warehousing. Fifteen years ago, organizations were asking, "Why build a redundant database dedicated to reporting?" The answer is priority and performance. Transaction systems have historically held higher priority than reporting applications. Heavy query access impacts performance of transactional systems to unacceptable levels. In order to maintain high performance, a second repository is necessary.

The reason to separate data management from data access is also priority and performance. To meet current demands, access to analytical data must be near immediate. When users drill, data is immediately returned. Specifically, fast performing ETL does not justify slow running queries. In the practical world, both ETL performance and query performance must be fast.

To achieve adequate performance on the movement and query of data, the enterprise solution must optimize and separate each function. While this separation increases cost, this investment is essential to establishing large-scale solutions.

Successful Implementation

Successful ETL tool implementation is defined by several project team characteristics: 1) speed to market, 2) effective use of functional capabilities, 3) maintenance complexity and 4) financial ROI. Each organization weighs these measures according to internal drivers.

Speed to Market. Speed to market refers to a project team's ability to quickly deliver new applications or enhancement applications. From an ETL perspective, these new projects equate to the addition of new sources and/or new attributes from existing sources. Project teams delivering rapid solutions demonstrate the ability to accurately and precisely scope and plan new projects. Successful ETL tool implementations are directly correlated to the project team's ability to deliver according to schedule.

Use of Functional Capabilities. Functional capabilities are grouped into two categories: a) performance gains and b) processing flexibility. The leading enterprise ETL tools all claim (and most deliver) to increase processing throughput over the majority of custom coded solutions. The majority of custom code resides in database-specific language. Within a specific database, certain processing may perform comparably to a tool (e.g., aggregations), but broadly speaking, most tasks are outperformed through ETL tools. Each of the leading tools offers separate approaches to optimization. The optimization techniques developed in the tool represent many hours of R&D and implementation experience. The tools also provide options for performance tuning that work in concert with database-specific performance gains. Specifically, the parallel reading and writing of data from databases coordinates with the ETL tool processing to execute multiple instances of single code.

Processing flexibility refers to the simplification of coding business logic. Most third-generation coding languages accommodate data warehouse processing (integrity checks, domain validation, data partitioning, process partitioning, etc.). Additionally, many languages support parallel processing. The dilemma is the cost associated with the design/development/testing of custom complex and parallel code. Is the custom code parameterized and flexible to accommodate multiple similar requirements? Successful implementations understand tool capabilities and leverage them effectively.

Maintenance Complexity. Maintenance complexity refers to the ability of existing and new staff to interpret and maintain existing code. Poorly utilized tools contain confusing and often extensive code. Successful projects have efficiently coded applications that are easily maintained.

Case in point: While working on a high-volume data warehouse effort for a national retailer, a rogue developer produced a sophisticated scheduling application to manage the ETL tool processing. The developer likely invested 250 hours of effort into a custom-coded solution (Perl). This particular client lacked significant experience in Perl, and the source code resided in a private source code repository. Without consideration for the architecture, this developer implemented a significant component. The result was a production system that was dependent on a sophisticated scheduling application that was difficult and expensive to maintain.

Filed under:
ETL

Advertisement

Comments (0)

Be the first to comment on this post using the section below.

Add Your Comments:
You must be registered to post a comment.
Not Registered?
You must be registered to post a comment. Click here to register.
Already registered? Log in here
Please note you must now log in with your email address and password.
Twitter
Facebook
LinkedIn
Login  |  My Account  |  White Papers  |  Web Seminars  |  Events |  Newsletters |  eBooks
FOLLOW US
Please note you must now log in with your email address and password.