Enormous amounts of data, disparate legacy systems and nonstandardized business terminology are just a few of the challenges an organization must overcome when building a data warehouse. In a typical data warehouse implementation, these challenges are addressed during extract, transform and load (ETL) development. ETL development is no small undertaking and is a sizeable portion of a data warehouse project. Companies often budget 40 to 70 percent of project development effort for their ETL tasks. Designing a robust and scalable ETL architecture is a critical factor for ensuring overall project success.
Because ETL components are the core to most, if not all, data warehouse implementations, it is not difficult to find companies whose greatest concern and greatest source of frustration is directly related to these tasks. Managers for data warehouse implementations often ask these questions:
- Will our ETL process complete in the allocated time frame and
meet our service-level requirements with system users?
- Even if our processes can handle current data volumes, will they scale to handle the growing flow of data being generated by our transactional systems and Web sites?
- How can we streamline the ETL processes of our production data warehouse which is failing to handle new performance requirements?
If you have ever asked yourself these questions, you are not alone. This article explores ETL development and the methods companies can utilize to meet the challenges of this critical success factor in data warehouse implementations.
Avoiding Data Overload
One of the first challenges encountered when beginning a data warehouse is selecting the data elements required to support business users' analysis, exploration and reporting processes. Although this sounds straightforward, it is often anything but! The difficulty lies in the fact that
system users may not be prepared to discuss much more than static reporting requirements. Initially, current operational system reports provide a good basis for understanding data requirements. Sifting through reports to discover data elements is easy, but that is only the beginning. In order to understand data requirements, it is necessary to ask questions such as:
Does data exist within the company (or externally) which can enhance current reporting?
- Will incorporating additional data elements empower users to move beyond static reporting into the realm of dynamic analysis and exploration?
Invariably, the answer to these questions is a resounding and emphatic yes. After all, that is one of the key reasons a data warehouse is built in the first place.
Herein lies the first of many dangers which can bring ruin to an ETL architecture data overload.
Instead of selecting only the essential data elements, data architects and business users instinctively begin making plans for an uncertain future and start including data elements for which need or use is not easily discerned. This tendency to inflate data requirements is accompanied by the phrase, "Although we are unsure if we can use this data or if we really need it, the data is available. Just in case, let's store it." Thus begins a potential collision course with ETL disaster. As more and more data elements are included in the requirements, the ETL process is deluged with additional validation and transformation rules which must be applied to each record. This increases development time, process run time, storage requirements and the potential points of failure throughout the entire ETL process. When it comes to evaluating return on investment, consider carefully the short-term and long- term costs of including nonessential data. Keep the following in mind when defining data requirements:
Start small and keep it simple. If there is any doubt about the potential use of available data, then it should probably not be included in the first phase of the project. Approaching ETL development with a reasonable amount of data elements helps speed project implementation and increases the chances of successfully meeting load-window requirements when the system is migrated into production.
- Think flexibility. A great motto for data warehouse design is: "Things change, be prepared for it." Although you may not include data elements with unknown use now, be prepared to incorporate them, if necessary, in the future. As part of flexibility planning, build generic ETL processes, design data models that accommodate change, explore advanced software functionality and be prepared to deal with increased storage requirements.
- The future is just around the corner. If you can postpone dealing with data elements up front, it is likely that advances in data warehousing technology will simplify the process of incorporating new data into the warehouse in the future. Potential advances include ETL tool enhancements, better database systems and hardware performance gains.
Once data elements are chosen, an architecture must be built to support the extraction, transformation and loading of those elements. Partitioning is one of the foremost concepts to consider when building an ETL architecture. It is a very effective way to improve ETL performance. The first, and most common, use of partitioning consists of breaking large data sets into multiple subsets. Splitting data sets every 10,000 rows or applying rules to a data field such as "sales region" or "transaction date" are common partitioning methods. For example, a data set of monthly sales could be split into weekly data sets by partitioning on the "transaction date" field.
The divide-and-conquer style of partitioning can improve ETL performance in several ways. First, instead of using one process to validate and transform a single large data set, multiple processes can be run in parallel to manipulate the smaller partitioned data sets. Theoretically, if the ETL process is performed on hardware that can utilize parallel processing, the validation and transformation of two smaller data sets would take half as much time as one large data set. Second, the smaller data sets can be saved on separate storage devices such as disk drives, thereby increasing the overall performance of input and output (I/O) operations during the ETL process. Finally, partitioning can be used to split the data, supporting the creation of two or more smaller data marts instead of one large data warehouse. For example, if your company engages in international operations, the data could be partitioned by geographic locations such as North America, Europe and Asia, thereby reducing the amount of data which must undergo ETL processing for each system. This approach also insures that business analysts have timely access to data regardless of their geographic region. The resulting data marts could then be used to feed a centralized data warehouse for company-wide analysis and reporting.
Task partitioning is another method for improving ETL performance. Partitioning by task is simply choosing to perform an ETL process at the time and location yielding the best overall performance.
For example, rather than transferring 10 million customer account records from a legacy mainframe to the ETL platform only to filter the data set based on customer status (which leaves 4 million active customers), filter the records on the mainframe before sending them to the ETL platform. Transferring 6 million useless customer records wastes time and network bandwidth. During ETL architecture design, take time to evaluate each platform and its associated tools. Choose the platforms and tools that are appropriate for each individual ETL task. For instance, it may make sense to skip the validation of fields in an ETL tool and harness the power of a database system to perform the validation, or you may decide to build summary tables using a sorting tool instead of building them in a database. Careful exploitation of the strengths and weaknesses of available systems and their tools can have enormous impact on ETL performance.
It's About the Data
The objective of ETL is to manage data. Data is first acquired, then moved from one location to another, validated and transformed, and eventually loaded into its final destination. Missing, invalid and misunderstood data complicate this process of data management. Just when the ETL process appears to be solid and capable of running without errors, surprise! Unexpected data values cause an error and require changes to the warehouse or the ETL process. The following methods can help alleviate this all too common scenario:
Data Profiling.Data profiling is a strategy to help uncover issues the ETL processes must address by analyzing the range of values represented in each data element. Obtain data extracts early in the development process and analyze them thoroughly. Determine which data elements contain null values or have a tendency to contain erroneous values. If possible, use the same source and amount of data to be processed when the warehouse is implemented in production. Profiling only partial data sets increases the possibility that critical data issues may not be discovered.
- Monitor Source Systems. In many organizations, the level of communication between operational system development teams and data warehouse teams is so minimal that the first time data warehouse developers learn about changes to operational systems is when ETL processes fail. Avoid ETL problems caused by changes in source systems by proactively monitoring the development schedules associated with each system. When changes are developed, assess the impact they will have on the data warehouse. If feasible, process data from source system changes generated during operational system testing.
Sometimes it requires more than partitioning, data profiling and processing carefully chosen data elements to attain ETL success. If you need more options to conquer the challenges of developing and maintaining an ETL architecture, consider the following:
Continuous ETL.In certain cases, data volumes are so large that it is not possible to process complete data sets at one time. In those cases, the best choice may be to process data continuously throughout the day as it becomes available. For example, suppose a retailer's daily sales data cannot easily be processed in the hours from store closing to promised system availability as set forth in the service level agreement with business analysts. This scenario is a good candidate for continuous ETL. Collect sales data as it occurs throughout the day, send it to the ETL platform, run it through the validation and transformation processes, and load the data into the warehouse or a holding area so it can be loaded into the warehouse later. Continuous ETL can be accomplished using technologies such as database replication and message queuing.
- Set Processing. Processing data in large sets is often more effective than performing ETL tasks on data records one at a time. Assume that two data elements in a customer data source must be transformed by performing a lookup against one of several reference tables. The reference table accessed during the transformation process is dependent upon other data values in the customer record. Instead of using an ETL tool or database cursors to inspect each row, determine which reference table is correct for the transformation and then perform the lookup to get the transformed value. It may be more efficient to execute transformations using database functionality which joins the customer table to the reference tables in sets. Intermediate results can be stored in a temporary table before inserting the final results into the data warehouse.
The Silver Bullet
Achieving ETL success is not unlike achieving success in other areas of data warehouse development. It takes a willingness to experiment and to approach development in new and sometimes unorthodox ways. Keep an open mind and remember there are no silver bullets in ETL architecture. Methods that yield promising results in some situations may perform poorly in others. The approaches to development and design discussed in this article are good starting points for managing ETL processes and ultimately reaching your goals in this critical area of data warehouse development.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access