Extraction, transformation and loading constitutes more than 70 percent of the overall execution effort in an enterprise data warehousing project. While typical projects spend considerable time and energy finalizing the overall data warehousing architecture, designing the ETL architecture detail has not received the attention and research it deserves. Several technologies and architectures exist in this area; however these are based on the architect’s experience with prior projects and the approach is purely based on what seems more practical from an architectural viewpoint in a given situation satisfying the triple project management constraints. This article outlines a more objective and analytical approach to the ETL and data flow architecture selection based on a set of variables with an objective to enhance the reliability of the overall data warehousing solution.

Data Flow Architecture


The reliability of the overall data flow is very important for making sure the data warehouse meets the service level agreement it was designed to satisfy. While data validation efforts are required to make sure the data is right and/or within the acceptable quality threshold, it is very important to consider the factors influencing the overall reliability of data flow from source systems all the way up to and including the presentation layer so that appropriate steps can be taken during the ETL detailed design phase. 
Consider the typical (5*5) data warehouse architecture as shown in Figure 1. Data moves from source systems to staging to the operational data store and then to the data warehouse before getting to the data mart and finally the presentation layer. There could be up to five stages of ETL before the data gets to the user. Reliability at each stage is important, since each of these stages is like a link within a big chain. 

Reliability Revisited


The reliability (R) of a system with n components in a series is equal to
R = R1  * R2 * R3* R4* R5*…………… * Rn
where Ri  = reliability of the individual components of the system.
The reliability (R) of a system with n components in parallel is equal to
R = 1/((1/R1 + 1/R2  + 1/R3 + 1/R4 + 1/R5+…………… +1/Rn)/n)

where Ri  = reliability of the individual components of the system.

Applying the above principles to the data flow within a data warehousing framework, there are two basic variables that impact the overall reliability. They are:

n   = number of stages or data flow components from start to end, and

Ri  = reliability within a stage.

Number of Stages

The number of stages is very important since the overall reliability reduces as the number of stages increases. Think of a system with three stages in a series with each stage having a reliability factor of 0.9. The overall reliability of the system equals 0.73 (0.9*0.9*0.9). Add another stage to the system, and the reliability is reduced to 0.65 (=0.9*0.9*0.9*0.9).

Reliability of Individual Components

The following section describes how to compute reliability at each stage of ETL within the data flow. In all of these cases, it is presumed that the programs (new and old) have been well tested and validated with an estimated reliability of 100 percent.

Case 1

Extraction is done from the source system  and stored in the form of flat files on a server/partition/folder; another ETL program uses this flat file to transform and load to tables (refer to Figure 2).

Reliability of this type of ETL stage can be computed as:

Ri = r1 * r2 * r3 where

r1 = Reliability attached to process 1 in Figure 2, which for all practical purposes can be assumed to be the availability of the source system based on the past experience.

r2 = Reliability attached to process 2 in Figure 2, which could be availability of the server/partition/folder storing the flat file.

r3 = Reliability attached to process 3 in Figure 2. This is equal to the availability of the server/partition/folder on which the destination tables are located.

Case 2

One program does ETL (refer to Figure 3).  The reliability of such a stage can be computed as

Ri = r1 * r2  where

r1 = reliability attached to process A in Figure 3, which is equal to the availability of the source system.

r2 = reliability of process B in Figure 3, which can be assumed to be the availability of the destination server for all practical purposes.

A quick comparison of the above two cases reveals that case 1 will always yield lower reliability compared to case 2.

How to Compute: An Example

In Figure 4, an enterprise data warehouse has three stages of ETL: sources to staging, staging to data warehouse, data warehouse to data mart. The presentation layer accesses the data mart for reporting and analytics. There are four source systems in this case. Availability details of the servers are indicated in parentheses. 

Reliability of E1 = 1/((1/0.95) + (1/0.93) + (1/0.96) + (1/0.90)/4) * 0.98 =  0.81

(The data flows from sources to staging are in parallel.)

Reliability of E2 = 0.98 * 0.94 = 0.92

Reliability of E3 = 0.94 * 0.97 = 0.91

Reliability of the overall data flow for the EDW = 0.81 * 0.92 * 0.91 = 0.68

As you can see, the overall reliability of the data flow is less than 70 percent even though the availability of the individual components within the system is more than 90 percent. 

If we have intermediate landing zones (as in Figure 2), the overall reliability will still be less than the above.

Usefulness

This concept can be used to specify the service level the data warehouse can satisfy based on the system availability of the source systems and the infrastructure that make up the data warehousing environment itself.  Where the latency requirement is very stringent, the data warehousing team needs to examine the overall reliability of the system before committing to a specified service level to the business.

On the other hand, if a certain level of service is required by the business, it is recommended that the overall ETL architecture and data flow be looked into for possible improvement in the reliability within the individual components or processes within the data warehousing framework.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access