Requirements have been defined, the data model is complete, source systems have been identified, tool selections have been made, and now the only thing left to do is connect the dots. Sounds easy, right? Designated with creating the extract, transform and load (ETL) architecture to move the source data into the warehouse, you begin drawing boxes and lines depicting the individual ETL processes that must be created in order to reconcile the idiosyncratic source system data into its generic business view of the world. The solution seems obvious: create an ETL process for each source you must introduce.
And they're off! The ETL analysts charge forward, creating the detail design (source-to-target mappings) for each of the processes you identified. However, upon reviewing said designs, you begin to discover that many of them are repetitively performing many of the same actions (and not always consistently). Maybe there's a better way.
ETL architects and data warehouse designers are faced with the task of homogenizing data into a standard and consistent format. Whether by geography, business unit or anything (and everything!) else, data in the source systems feeding the data warehouse is inherently different. This article examines the conformed ETL architecture - an approach that eliminates proliferating, redundant ETL actions by distinguishing between where to apply source-specific and non source-specific business rules during the ETL load processes within the data warehouse.
Defining ETL Actions
Creating a stable, timely, maintainable and extensible framework for converging like, but disparate, data sources into homogenized entities is the goal of any successful ETL architecture. This meshing of source system-specific data into the business view represented in the warehouse requires two distinctly different types of ETL actions:
- Source-Specific Actions: Those actions which apply source-specific business rules in order to transform and decode the data from a format that is recognizable only by the source systems into a standardized format.
- Non Source-Specific Actions: Those actions performed on data, regardless of the source system of origination. Non source-specific actions include the application of business rules, as well as the ETL logic necessary to effect the change in the warehouse (insert/update).
A logical ETL design outlines the necessary actions and mapping of specific elements that must be performed in order to successfully integrate disparate data sources. This detailed abstraction provides insight into the transformations that the source data must undergo at an atomic level, regardless of the ETL tool used for the physical implementation. Accurately categorizing these actions into their appropriate designations (source- or non source-specific) can be a difficult undertaking. Consider the following scenario where the customer's status (either ACTIVE or INACTIVE) and the customer's billing status (either BILLED or UNBILLED) are being determined (see Figure 1).
Figure 1: Source-Specific and Non Source-Specific Business Rules
It would be easy to mistakenly determine that the assignment of the customer's billing status would be an action specific to the source. However, by isolating the individual transformations and business rules to apply to source data, we have better insight into the true scope of the business rule to apply, allowing us to create generic, reusable business rules and definitions.
Traditional Versus Conformed ETL Architecture
Following is a simple example as context for this discussion.
Company ABC has two sales systems (systems X and Y), which contain information surrounding the sale of widgets. Each sale of a widget needs to be captured as part of the data warehouse implementation. The sales are qualified by the date of the sale, the customer purchasing the widget, the type of widget purchased (commercial or residential) and the total dollar amount of the sale (see Figure 2).
Figure 2: Widget Sales Example - Sources and Targets
When faced with the issue of how to create the necessary ETL processes to converge like data sources into standardized entities within the warehouse, there are two choices:
Traditional ETL Architecture:
Create individual ETL processes for each source system (as shown in Figure 3).
Figure 3: Traditional ETL Architecture
A traditional ETL architecture would create one ETL process to perform all of the logic necessary to transform the source data into its target destination. The advantages to this approach are that there are fewer ETL processes to create and maintain. The hidden shortcoming of this approach is that each individual ETL process must redundantly perform many of the same actions (the non source-specific actions). This duplication of efforts results in additional efforts throughout the life cycle of the data warehouse:
- During development, this redundancy results in an increase in the amount and complexity of ETL code to create and test, thereby lengthening the development time line.
- Upon implementation, modifications require more effort as changes potentially must be made in multiple places.
Conformed ETL Architecture:
Create individual ETL processes for each source to apply source-specific business rules necessary to conform the source into a generic format (conformed widget sales) and create reusable, generic processes to execute non source-specific business rules and actions (see Figure 4).
Figure 4: Conformed ETL Architecture
The conformed ETL architecture involves creating an intermediary definition of the data entity (the conformed table). This operational data entity contains the standardized format of the converged source systems, clearly defines the boundary between source-specific and reusable generic actions and rules, and serves as the gateway for moving data into the warehouse. Obviously, the disadvantages of this approach are that there are more objects and processes that must be created and maintained. However, the underlying advantages of this architecture are plentiful:
- Modularization of ETL processes: The creation of smaller, less complex ETL processes makes troubleshooting problems and creating future enhancements easier.
- Reusability of post-conform processes: Often, the "simple" process of enforcing referential integrity (looking up foreign key assignments) and performing inserts and/or updates to the final target destination is "not so simple." The conform approach prevents having to duplicate the efforts and logic necessary to apply non source-specific actions to data. Eliminating redundant processing and ETL logic reduces the chances of error, ultimately improving data quality.
- Extensibility: The reusable aspects of the conformed ETL architecture allow for the rapid acquisition of additional source systems. When new source data is introduced to the warehouse, only the source-specific ETL processes to conform the data must be created.
To Conform or Not to Conform?
While the conform ETL architecture has many benefits, consideration should be made concerning where to apply the approach, as it is not applicable in all scenarios. Analysis should be performed to decide which sources to conform to ensure that the extra development overhead will have quantifiable benefits (increased performance, flexibility in ETL scheduling, modularization of processes, etc.). In making the decision to follow a conformed architecture or not, be sure to keep in mind the future requirements of the warehouse (additional sources of data, etc.), not just the current state.
Redundancy within a data warehouse is a given - where the data is concerned. By properly identifying and categorizing the business rules and logic to be applied to source data, many processing redundancies can be reduced as well. The conformed ETL architecture allows for the creation of a scalable, extensible framework, which helps to ensure that timely information is consistently delivered to the masses.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access