Your data warehouse is doubling in size every year, and you suspect that you'll be adding a clickstream data source from your corporate Web site before the end of the year. Your marketing group doesn't yet know precisely how they are going to leverage the new clickstream data for personalization and customer intimacy – but they do know that they don't want to be the last ones to figure it out. Too many signs suggest it will be of strategic competitive advantage.

Clickstream Volume
+ Demand for Access
Big Trouble

As data warehouse manager or ETL team leader, you know the information that users learn from their clickstream data will stimulate additional requests for analysis. That means reprocessing and reloading the data many times before thorough data transformation and data hygiene requirements can be fixed in stone. This may not be new; after all, you've seen significant change in your ETL process for several years. However, this is likely to be as challenging a design-as-you-go solution as you've seen because clickstream volumes may match, even exceed, current volumes – and ongoing support of existing ETL is not getting any cheaper. What to do?

The answer: start building your new ETL job streams within a scalable ETL development environment. The initial clickstream ETL job stream will change many times before you're through. There is no better time to start positioning a scalable technology that delivers in the very large data warehousing development environment.

Defining the Environment

A scalable ETL development environment addresses continuous change in your existing ETL infrastructure with programmer productivity benefits that save time and money; at the same time, you will also reduce existing support and maintenance costs. The real clincher, however, is in providing an extremely reliable and high-performance ETL processing environment that can also scale predictably to meet faster response times and higher throughput. There you have it: extreme scalability, massive throughput, robust performance and low cost of ownership. These characterize the scalable ETL development environment.

While there are many ETL tools on the market, few deliver the scalability, throughput, performance and low cost of ownership you will need to support an environment that integrates clickstream data with your existing data management environment. Solutions are available, however, from parallel technology vendors such as Ab Initio and Torrent Systems, Inc. These solutions deliver three key benefits in the scalable ETL development environment for extract, transform and load jobs:

  • Programmer productivity gains
  • Reduced maintenance costs
  • Scalable high performance

Programmer Productivity Gains

Developers always seem to take offense when programmer productivity tools are suggested for ETL. How many times have you heard someone say, "Why do I need a tool? I can write code that performs more quickly and be done with a simple transform I've already built it in my class library – and I can code it in 15 minutes." You rarely see it turn out that way though.

Ask your favorite C/C++ developer if they have a routine to convert packed decimal to text. Nobody would actually code one until it was needed, so you don't often find one in your class library. Ask how long it takes to translate the COBOL copybook into the input interface specification for your transform. Probably an hour – and that is if you understand the copybook format or you've already processed similar feeds.

The point is this: you shouldn't have to burn valuable development resource coding interface specifications! Rather, you should have your ETL development team tackle more elaborate ETL data and application concerns – issues that come up often and need a good strategy – such as synchronizing data between different sources, coming up with a refresh strategy, designing checkpoint and recovery logic or even data quality data audit checkpoints.

Consider some of the tasks that would take a day or more to code and test without a scalable ETL development environment. It is no accident that most of the components to build these steps are available from tools designed for scalable ETL:

  • Formatting packed decimal/binary decimal/EBCDIC data into string text for dimension tables.
  • Sorting and merging data files with logic handling for merge exceptions.
  • Calculating aggregates for a single file or across many sorted and merged files.
  • Selecting certain records from an ETL job stream for data quality or data auditing checkpoints.
  • Extracting a 10 percent random sample from your warehouse for an ad hoc data pull.
  • Removing duplicates from an ETL job stream.

These are the bread and butter of ETL coding that essentially come "free" with your scalable ETL development environment. That is, developers could be using prebuilt and tested components to solve these problems in hours, not days.

Reduced Maintenance Costs

Let's now look at maintenance costs which require considerable resources. A scalable ETL development environment tackles the need for custom-coded C/C++ extraction and transformation logic head-on. It must alleviate the need for a large C/C++ development team – and it does.

Most of the transformation logic can be built defining transformation rules in a simple procedural language instead of C/C++. As a result, programming support costs are much less. You don't have to worry about supporting unreadable code once a developer moves on. In most cases, scalable ETL development environments provide many self-documenting features.

One of the other classic support problems with ETL performance is the number of potential interfaces between sources and targets – databases, tables or flat files. They have a nasty habit of growing exponentially; in fact, there are potentially n x (n-1) possible interfaces between the n sources and targets. If you have just four sources, you are already supporting up to 12 interfaces to those sources. The best way to reduce the maintenance and support costs associated with the large number of interfaces is for the scalable ETL development environment to manage these interfaces using an ETL-specific meta data database and version control.

Predictable Performance in an Unpredictable Environment

The third and final key benefit to scalable ETL development environments derives from the performance options that are embedded within a scalable ETL development tool. It separates the developers from the details of executing their code in a parallel or distributed system.

Let's say that you have an ETL job stream in place wherein you can deliver data to the warehouse. How do you make it four times faster? Better yet, how would you process four times as much data in the same amount of time? The answer to both questions could be to buy more hardware, but let's assume that the next fastest processor is only twice as fast. You're stuck with finding a software solution.

Now compromised, the development group starts planning their code to be not only optimized running on one CPU, but they also plan to make use of multiple CPUs/disks concurrently. This in itself is not too daunting a task for more advanced programmers, and your development team might suggest a multithreaded or multiprocess programming solution in which you partition the data into four pieces on four disks and run four copies of your program on four CPUs at the same time. This is a great solution – until you realize that only 10 percent of your development team can support the code.

What if the problem changed again, and you were asked to increase the number of processing steps in the ETL job stream which slowed down the data stream by half. But it's not good enough: your users are howling for the same run time! You could partition the problem into eight pieces using the same techniques as before. You would have to make sure you added extra file systems to keep each of the eight streams fed with independent I/O, but you did it before so you can do it again – and the server has eight CPUs.

Now your clickstream data is feeding into your existing data stream, and all is well. But volumes are growing and so are user requests. What do you do when you're asked to process twice as much data again but you've run out of CPUs on your server? Do you start building programs that use message passing or remote procedure calls (RPC) to do distributed processing? This would allow you to make use of CPUs and disks on other servers. You could build scripts to move the data around to do the processing on the new server – but now only two to three percent of your development team has the expertise. What happens when they leave?

This unpleasant scenario is occurring more often than not. The key benefit from scalable ETL environments is that developers are largely shielded from the details of coordinating or even thinking about parallel and distributed programming. Using a visual programming environment, a developer can implement ETL job streams with many prebuilt components, without concern for whether a job stream will use 4-way data parallelism or 16-way data parallelism. Unlike components in standard ETL tools, the prebuilt components were designed from the beginning to be run across multiple disks, multiple CPUs or even multiple servers.

Developers are also shielded from the hardware specifics of allocating programs to servers/CPUs or moving data around for processing. The hardware allocation details are managed in a system configuration component that is updated as your system group adds more hardware in the form of new servers (CPUs) or more file systems (disks) – requiring no application development change to make use of the new hardware.

Suddenly the growth path becomes predictable and clear for your ETL environment. As business requirements grow, you can scale the hardware to accommodate the processing requirement without requiring ETL programming/software changes because you've designed your ETL applications to be scalable from the beginning.

Real-World Demands Need Real-Time Data Warehousing

Real-time data warehousing is no longer a buzzword within the scalable ETL development environment. Your marketing group can extract data in near real time, perform analytics and deliver the personalization that differentiates your business. And you can make that happen.

On a final note, industry watcher META Group estimates that within the next year or two, data analysis efforts for e-commerce sites will need to deal with 10 times more data, analyzed 5 to 10 times more quickly than current best practices. That's 50 to 100 times more capacity. Are you ready for it? You can be – really.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access