ETL: The Best-Kept Secret of Success in Data Warehousing
While watching my kids work on their homework, I often wonder why they wait until late at night to begin tough math. Prime time is usually spent on art, history, social studies, biology and the like. As a know-it-all parent, I then start my midnight preaching of best time-management practices, recommending they start with the most difficult work. That way, I preach, you best utilize your resources (time), learn the most and score the highest marks; you meet the max of your objectives.
Yet I see this same mistake being repeated over and over again by the professionals in our trade data warehousing (or information delivery, if you like). A high cost to deliver solutions is the primary and measurable impact of this mistake.
It all starts with how our projects get sold showing nice, have-it-all CRM screens, impressive slice-and-dice reports, PowerPoint charts demonstrating successfully met business objectives and ROI. Expert talk is presented about the importance of business visions, sound project management and methodologies, system architecture and state-of-the-art database design. For data mining applications, the experts talk about sophisticated techniques, artificial intelligence, neural networks and rocket science.
Who would want or dare to offend executive business-oriented clients with discussion of something so low and ordinary as data integration and ETL (extract, transform and load) processes? Additionally, who would "waste" the most-intelligent consultants on back-office work, a place beyond any executive and business visibility, when it's so much more rewarding to focus on business analyses, project management and database architecture?
Maybe the client executives would be intrigued by ETL if they learned that this is arguably the most critical and most costly part of any information system designed for decision support.
More than 50 percent of all development work spent on any data warehousing project is ETL design and development. In data mining applications, more than 80 percent of all work is data preparation (i.e., ETL). The success of data mining is more dependent on solid data preparation than choosing a best-fitting algorithm. Poorly designed ETL processes are costly to maintain, change and update. In many cases, the ETL process eventually gets stiff and is unable to take another change. ETL arthritis caused by poor design not only slows system performance, but makes it impossible to build further on the system.
In data warehousing, the emphasis has traditionally been given to database design including normalization and denormalization, treatment of dimensions, restatements and aggregations. In the best case, ETL has been seen as little more than meta data-driven generation of stored procedures running off and supporting/feeding the data warehouse database.
In this article, I want to emphasize the role of ETL in a new era of data warehousing projects. Let me do so by reviewing some of the data warehousing paradigms.
Paradigm No. 1: Data warehouse is a term describing a whole (logically) integrated system environment. A data warehouse shouldn't be seen only as the one integrated database feeding some end-product data marts. Rather, it's a logical collection of highly interconnected systems consisting of multiple databases, staging areas, transformation processes and complex transformation engines.
We can no longer guess when an information derivation reaches a final stage. The data process is more complex than a simple stream of legacy to data warehouse to data mart. Feedback loops turn end-product data into inputs of new derivation processes. One data mart is likely to feed other data marts. The client credit risk indicator derived in a risk management system might be further used in a management information system in the form of different aggregations, in a customer relationship management system for derived client treatments or in an enterprise performance management system as one of the inputs for instrument-level profitability calculation. Some of this derived information may make it back to the data warehouse for further use or maybe to the exploratory data mart, where we collect data as input variables for building of new predictive models. All the interconnections, in fact, represent some ETL processes.
Paradigm No. 2: Thou shall have only one extract per every legacy system. Hardly anyone has, but we shall strive toward that goal. As new information-delivery systems are requested and built, overlapping extracts from legacy systems pile upon each other. Overlapping in requirements, but not unified in design, these multiple extracts contribute greatly to a new era of data quality problems.
One of the original drivers of data warehousing was the identified need of centralized data cleansing processes. At that time, we were worried about erroneous data entries about customers who, according to our records, were not born yet or lived at nonexistent addresses.
While partially fixing these easy-to-identify errors, new data quality problems were introduced through inconsistent ETL designs of data warehousing systems. Problems were caused by ambiguous or multiple transformation rules which were poorly documented and implemented several times without any consistency. For example, in retail banking, it is important to identify inactive accounts. An inactive account might be an account with less than $10 that has been dormant for four months or longer. Another definition of an inactive account could be similar: every account with less than $20 that has been dormant longer than 6 months.
Where is the problem? The problem is not in having two different rules, but in having them implemented inconsistently as part of two separately managed extracts/separately managed ETL processes. One extract and transformation might be happening off the legacy environment while the other might be happening off the UNIX or relational database management system (RDBMS) environment hosting the data warehouse. There is no one place of governance, no one system for ETL meta data. One extract per source resolves many such problems. Problems which could, in some cases, cause up to hundreds of thousands of dollars of business loss. This is an ETL design issue.
Paradigm No. 3: ETL is a multilayered process with a system of staging areas presenting an integral part of the logical data warehouse environment. For performance and other reasons, it is practical to execute heavy-duty data transformation outside of the relational database world. Over time, our experience in developing ETL has taught us that some of the transformation process can be reused for multiple final data/information products. You save on development time, maintenance, performance and space if you smartly leverage the reuse potential. The transformation process is multilayered. Not all that is produced must make it immediately to some relational database.
RDBMSs are great for querying, but what if we don't need to query on all the data elements we produce? We keep them in well designed and organized staging areas. Some of those elements might be needed for further transformation processing or serve as input into embedded predictive models (also a special type of transformation). The outputs are what make it through the load into some other system and database.
Smartly designed multilayered ETL can substantially reduce development costs and improve maintainability and information quality, bringing annual savings of hundreds of thousands to millions of dollars.
Paradigm No. 4: Pay attention to ETL meta data. Parameterize as much as you can. Poor ETL design is usually first detected by the next generation of development. Just about anybody can write the first release of an ETL process. The challenge comes with the maintenance and consecutive releases.
The ETL meta data is problematic enough even with solid parameterized design. Just consider the previous example with the inactive accounts. The rule is first introduced in a project scope document. Then it makes it to a mapping document used by the data architects and ETL designers. It is reflected in the database model and respective CASE tool meta data repository, and also in the ETL design and respective ETL tool repository. Because of the well-known problem of disintegrated local meta data repositories, all the different entries are made manually, a process prone to human error. It is unlikely that later interpretation of that rule will look into all these different repositories; later changes won't get recorded in all the places they should.
Poor ETL design, poor ETL programming habits and neglected ETL meta data management often magnify maintainability problems to the point where the entire data warehousing system is pronounced throwaway a costly result of many piled-on mistakes.
Paradigm No. 5: Quality assurance (QA) processes start with the planning of a new development and go all the way to the project end and system production delivery. Data warehousing projects are usually lengthy and costly undertakings. Traditional QA executed on the end of the development life cycle is usually too little, too late. Many problems go back all the way to project scope documentation and interpretation of transformation rules. Additionally, changes in ETL are costly, having multiple impacts on other processes and subsystems and requiring thorough retesting. Remember, ETL requires more than 50 percent of all development work.
Therefore, it pays greatly to engage in QA process early and to use a solid methodology that especially concentrates on the ETL process. Such early, ETL-focused QA might seem costly to many; but, in fact, this is one of the safest assurances to deliver your project on time and within budget.
Data warehousing is now mainstream. Once mainly a source for OLAP-smart MIS applications, today it serves many other business-critical missions from money laundering and fraud detection, CRM, performance and risk management to multiwave campaign management, clickstream analyses and Webhousing. Many businesses have missions and different project sponsors; but they all have one thing in common it is all about processing essentially the same data!
The most advanced organizations recognize the potential of leveraging the overlapping requirements of these systems, many of which are in almost perfect parallel development what costly and unnecessary duplication! These organizations nominate high-powered executives to oversee that these overlaps are detected and projects and developments synchronized. Not only can they achieve tremendous overall development cost reduction and thus improvement in ROI of individual project business cases; but with their focus on intelligent overall ETL design (because that's where the potential really lies), they achieve higher maintainability of their systems and, therefore, also lower costs and increase business value for the future.
Thus, ETL makes it into the forefront of business getting recognized as the most critical, most sophisticated, largest and integral part of modern data warehouse development, with the largest cost-cutting and business-impact potential to leverage synergies among many mission-critical projects.
The next time you ask the big consulting guys selling you a new development who will be doing the ETL and you hear that they happily leave this job to your internal IT department, be on the alert. If you connect your success with anything more than the low-hanging fruit of quick wins and if you expect to be around after the celebration of the first-release system delivery, you now know that your success depends on the attention you pay to the overall integrated ETL design. Don't miss the chance!