Faisal Shah would like to thank John Y. Miller for his contribution to this month’s column.

Can you, at this moment in time, imagine managing 500 terabytes of data? Or integrating billions of clicks from your Web site with data from multiple channels and business units every day, maybe more than once a day? It seems like an extreme scenario, but it’s one that industry analysts uniformly predict organizations will be confronting within the next three to four years. The exact figures may vary slightly, but the consensus is solid: enterprises are going to be swamped with data, and they’re going to have to figure out how to manage it or risk being left behind.

A rapid increase in data volume is not the only challenge enterprises will face. End users want more information at more granular levels of detail, and they want flexible, integrated, timely access. The number of users and the number of queries are also growing dramatically. Additionally, organizations are placing more emphasis on high-value, data-intensive applications such as CRM. All of these developments pose problems for enterprise data management. Fortunately, there is an effective answer to these problems – scalable data solutions (and more specifically, scalable ETL environments).

Why You Need a Scalable ETL Environment

I define scalability as the retention or improvement of an application’s performance, availability and maintainability with increasing data volumes. In the case of performance, scalability implies the ability to increase performance practically without limit. The remainder of this column will focus exclusively on performance scalability. Future columns will discuss scalable availability and maintainability.

Scalability as a concept is not new, but achieving scalability is becoming much more challenging because of the dramatic increases in data volume and complexity that enterprises are experiencing on every front. How long will users tolerate growing latency in the reporting we provide them? How long can we keep adding hardware, installing new software, tweaking or building from scratch new applications? What is the e-business channel doing to carefully build and integrate CRM solutions? The fact is, when we look under the covers, yesterday’s scalable solutions aren’t working in our new environment.

The extract, transform and load (ETL) environment poses some especially difficult scalability challenges because it is constantly changing to meet new requirements. Enterprises must tackle the scalability problem in their ETL environments in order to successfully confront increasing refresh frequencies, shrinking batch windows, increasing processing complexity and growing data volumes. Without scalability in the ETL environment, scalability in the hardware or database layer becomes less effective.

In terms of implementing scalable data solutions, enterprises should adopt a "build it once and build it right" attitude. If an ETL environment is designed to be scalable from the start, the organization can avoid headaches later. Let’s consider a situation in which this is not the case, and the ETL environment is architected without consideration for scalability. The first generation of this solution will be fine until data volumes exceed capacity. At that point, the organization will be able to make the fairly easy change to move to a second- generation environment by upgrading hardware and purchasing additional software licenses. Once this solution is no longer sufficient, however, the enterprise will find it more difficult and costly to evolve to a third-generation solution which usually involves custom programming and buying point solutions. Finally, once the third-generation solution has reached its limits, the enterprise will need to rebuild its ETL environment entirely, this time using scalable and parallel technologies. Clearly, enterprises can save time and money by implementing a scalable ETL environment from the very beginning.

Techniques to Achieve Scalability in the ETL Environment

Although a thorough discussion of techniques to ensure performance scalability is beyond the scope of this column, following are some basic considerations for improving the performance of your ETL environment.

Reduce Data Flow

Nothing beats good application design for high performance, and making reductions to the data flow for improved scalability is common sense. However, this technique is often overlooked in ETL applications.

  • Avoid scanning unnecessary data. Are there data sets or partitions that do not need to be scanned? Can I avoid the extra CPU/memory cost of passing unused rows or columns? Could I structure the data in columnar data sets and only read the columns needed?
  • Apply data reduction steps of the ETL job flow as soon as possible. Could I filter data during the extract processing instead of during the transformation processing? Could I aggregate data sooner in the processing stream, reducing the rows sent for subsequent processing?
  • Use tools to facilitate data reduction. Could I use a "change data capture" feature on a source database?

Eliminate I/O

Whether implemented within the ETL application, or more likely by the ETL tool, these techniques avoid I/Os and reduce run times. The first three in- memory techniques are generally implemented within the database and often come " free" with the ETL tool that processes in the database. Pipelining is generally supported through an ETL tool.

  • In-memory caching. Will the I/Os generated from caching a lookup file/table be less than a join? Will the lookup file/table fit in memory – either OS buffers, tool buffers or database buffers?
  • Hash join versus sort-merge join. Will the join result fit in memory?
  • In-memory aggregates. Will the aggregation result fit in memory?
  • Pipelining. Can we pass data from one stage of ETL processing to the next? Do we need to land data to disk for reliability purposes?

Optimize Common ETL Processing

Certain ETL tools provide optimized implementation of common and processing-intensive operations.

  • Change data capture. Will only records that have been changed (inserted or updated) be sufficient for all the ETL logic required before populating the target system?
  • Surrogate key optimization. Will the surrogate key logic supported within the tool allow the warehouse dimensions to change in the manner my users need? Will I have to write my own within the tool?
  • Incremental aggregates. Does the tool support optimizations for only applying deltas/changed data to aggregates?
  • Bulk loading. Does the bulk loading support the unit of work features required for recoverability?

Eliminate Bottlenecks and Balance Resources

Hardware performance tuning is essential and is heavily dependent on the ETL application workload being run through the system or systems. There is no such thing as a perfectly tuned system, but there are several tuning techniques that get the system running close to optimal.

  • Eliminating configuration bottlenecks. Have I made sure that the disk farm (I/Os) is not a bottleneck? Have I made sure that the memory buffer allocation is not a bottleneck? Have I made sure that the network is not a bottleneck? Have I made sure that the CPU or operating system configuration is not a bottleneck? Have I made sure that the ETL tool configuration is not causing a bottleneck?
  • Balancing resource utilization within applications. Have I made sure that the hardware environment has the right mix of CPU/disk/memory/network for the ETL environment? Should I add or subtract hardware to balance the hardware to the application needs?

Provide Rules to Determine Environment Scalability

  • Linear speed-up. Linear speed-up is measured by reducing the run-time of an ETL application by a factor of N with N times more resource.
  • Linear scale-up. Linear scale-up is measured by maintaining a constant run time while increasing data volumes by a factor of N with N times more resource.

Use Data Parallelism

Parallel processing is only slowly being adopted into ETL tools, although many enterprises are already making use of it without the help of tools. Following are some of the considerations for using data parallelism.

  • Application data parallelism. Is my ETL job stream too small to be partitioned to use parallelism? How will I change my application logic such as joins and aggregates to accommodate partitioned data? How many different partitioning schemes will I need? How will I efficiently repartition the data? How will I gather the data back into serial files?
  • Control and hardware parallelism. How will I control and manage jobs running on different servers (a distributed environment)? How will I move data between different servers? How do I recover from failure and provide monitoring and debugging?

Conclusion

Developing a scalable ETL environment is well worth the extra time and effort it requires. Enterprises can’t afford to ignore the massive data growth industry analysts are predicting for the near future, and those who implement parallel and scalable solutions from the start will gain a competitive edge by making the most of customer, partner and market information.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access