Many data warehouse efforts struggle with determining the best methodology and implementation strategy for performing extract, transform and load (ETL) processing for their particular enterprise. Once the business requirements for the warehouse project have been defined, the source systems analyzed and a physical data model has been developed, you are ready to begin designing your ETL processes. A systematic approach needs to be applied for designing your ETL processes. This column will describe the use of a staged approach for designing ETL process flows. Each process is directed toward loading of a particular target table or cluster of tables depending on the business requirements and technical design considerations.
A staged transformation approach is used to optimize data acquisition and transformation from source systems. This method provides increased flexibility and extensibility to ETL developers. The various stages provide a single processing method for initial and successive loads to the data warehouse. The five stages provide an adaptable transformation process for the target table(s) that can adapt easily to changes in the source systems or warehouse model design. The combined use of meta data tags, sometimes referred to as operational meta data, in the warehouse schema design and ETL transformation processes allows for improved capabilities in loading and maintenance designs. Not all stages will be used in every target table case depending on the business requirements and complexity of business rules needed in the transformation.
The first stage, source verification, performs the access and extraction of data from the source system. This stage builds a temporal view of the data at the time of extraction. The source extract built in this stage is included in backups of the entire batch cycle for reload purposes and for audit/reconciliation purposes during testing. If audit files are provided from the source system, these files are compared against the extract files to verify such items as row counts, byte counts, amount totals or hash sums. During this stage, both technical and business meta data can be captured and verified against the meta data repository (if available). Verification of business rules unique to the source system can also be applied during this stage using the repository to identify any exceptions or errors.
The second stage, source alteration, can perform a variety of transformations unique to the source depending on business requirements. These transformation options include integration of data from multiple source systems based on priority ranking or availability, integration of data from secondary sources, splitting of source system files into multiple work files for multiple target table loads (clusters), and application of business logic and conversions unique to the source systems. Meta data tags such as source system(s) identifiers, production key active in source system flags and confidence level indicators can be applied during this transformation stage.
The third stage, common interchange, applies business rules and/or transformation logic that are frequent across multiple target tables. Examples of transformation logic applied during this stage include referential integrity (e.g., population of fact table surrogate keys from dimension tables) and application of enterprise definitions and business rules from the meta data repository (e.g., code values to ISO standard formats).
Target Load Determination
The fourth stage, target load determination, performs final formatting of data to produce load-ready files for the target table, identifies and segregates rows to be inserted versus updated (if applicable), applies remaining technical meta data tagging and processes data into the relational database management system (RDBMS). Loading data into the RDBMS can be considered a separate stage depending on your preferences. Data rows processed from the source system up to this stage for the current batch cycle are compared against records loaded in previous cycles to determine if insertion or update is required. An example of this is a dimension table that uses slowly changing dimension (SCD) type 2 method for update. The load-ready files built in this stage are included in backups of the entire batch cycle for reload purposes and for audit/reconciliation purposes during testing.
Depending on the database management system (DBMS) used for the warehouse and the volume of records being processed, the high-speed parallel load process of the database engine may be used in this stage in place of the standard database load utility. Faster load performance can sometimes be achieved through truncating and reinsertion of all data rows versus updating due to optimizations made available through use of the DBMS's high-speed parallel load utility.
The fifth and final stage, aggregation, uses the load-ready files from the fourth stage to build aggregation tables needed to improve query performance against the warehouse. This stage is typically applied only against fact table target load processes. Care needs to be taken to ensure that aggregated records end up with the correct surrogate keys from the dimension tables for the rollup levels required in reports.
This five-stage approach provides a modular and adaptable means to efficiently load a data warehouse. ETL process designs can easily adapt to changes in source systems or the addition or removal of source systems without affecting the entire ETL workflow. Additionally, changes to the data warehouse schema will cause minimal impact to ETL processes through the insulation provided through the various transformation stages.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access