Continue in 2 seconds

We can have a staging area between extract and transform or we can have staging area between transform and load.

By
  • Larissa Moss, Clay Rehm, Les Barbusinski, Scott Howard
Published
  • March 04 2004, 1:00am EST

Q:

We can have a staging area between extract and transform or we can have staging area between transform and load. I would like to know three to four advantages and disadvantages of each.

A:

Larissa Moss’ Answer: This is not an either/or situation. How you perform ETL staging depends on a number of things.

  1. Your storage availability. If you have ample storage, you may want to "park" the data after each set of programs (extract and transform).
  2. Your architecture. Depending on whether you have a three- tier (persistent staging database, data warehouse, data marts), two-tier (persistent staging or data warehouse, data marts) or one-tier (data marts only) architecture, your ETL job stream would be different for each.
  3. Your data volumes. If your data volumes are small, running most or all ETL processes inline may be possible. But if your data volumes are large, you will probably want to have the flexibility to continue processing a set of data without having completed the previous step on all of your data.
  4. Your data sources. If your source data resides on heterogeneous source files/databases, you may be forced to finish at least some of your extracts before you can proceed with the consolidation, integration, and standardization steps.

Your data quality. If your data is dirty, you may have to read additional source files/databases to perform your edit checks and your data correction process.

Les Barbusinski’s Answer: You need both. The first staging area (i.e., between the extract and transformation processes) provides the following benefits:

  • It minimizes the time ETL processes spend "inside" each source system … thereby minimizing any contention and/or negative performance impact on the source system’s resources.
  • It provides a snapshot of the source system data that was extracted for each ETL "run" (i.e., the staging area from the previous night’s processing can be archived at the beginning of tonight’s ETL run).
  • It provides a means for restarting failed/aborted transformation processes.

The second staging area (i.e., between the transformation and load processes) provides the following benefits:

  • It provides a secure area for storing load files and/or "net change" transactional files (for those tables that are too large for a nightly reload).
  • It provides a snapshot of what was loaded in each ETL "run."
  • It provides a means for restarting failed/aborted load processes.
  • It provides a means for "backing out" bad data from a DW table (e.g., removing the effects of errant double posting or a corrupted ETL run).

Hope this helps.

Clay Rehm’s Answer: My answer would be "it depends." I don’t think there is any right or wrong answer or approach – I think it depends on your organizations business rules and practices that would dictate how data should be staged and how long it should be backed up/archived.

Scott Howard’s Answer: You should consider that a staging area exists between each layer of your ETL architecture. The output of one layer is input to the subsequent layer. For example, the output of transform is obvious the input to load. However, you’ll need to consider the costs of physically instantiating that data as compared to directly piping data from one process to the next if possible. UNIX pipes and POSIX batch pipes come to mind and should be utilized if practical.

Now don’t limit yourself to only the extraction (E), transformation (T) and load (L) layers. I run an ETL workshop that quickly establishes that ETL alone as an architecture is inadequate. You need to further identify unique process layers within the traditional ETL architecture that offer a much more detailed and fine grain mapping of all the talks that need to be performed in sequence to accomplish your ETL task. For example, one of my customers identified source extract, pre-formating for consistency and data harmony, filtering, intelligent merge, delta detection, cleansing, standard transformation, build and finally load as unique service layers. Only after identifying these tasks can you effectively implement consistent and integrated ETL processes regardless of your ETL tool and tool sets. Consider now that you may need to manage seven staging areas if this highly recommended approach is embraced.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access