Q: What are the different data staging techniques?

Tom Haughey's Answer:

Here is a general set of options for different forms of data staging and staging requirements:

Files versus tables. Should the staging data be stored in files or in relational tables? Either one is workable. Where there is a lot of sorting, I use files.

Store and forward vs. direct database insert/update. Should the ETL'd data be directly applied to the data warehouse, or stored and later bulk loaded? This is a question of direct load or store and forward. In many cases, whether you update the data warehouse directly with changed or new data, or stage it into a temporary file/table, is not an integrity issue. I find better performance by storing the data in a staging table/file and then loading it in bulk, rather than directly applying the change to the DW.

Aggregate creation. Should you aggregate in ETL, during load or in the DW? Either one is possible, but my experience is that during ETL or after load into the DW are better choices. I do not like to aggregate during load because of some bad experiences with performance.

Change data capture. There are about six different ways to detect change data. See my April Ask the Experts (ATE) column. You will have to stage some warehouse data to aid you in detecting changed data. This staged DW data is usually a vertical subset of the DW (only the few needed columns).

Key management. How do you maintain natural keys and coordinate and control surrogate keys? Where you use surrogate keys, you will have to stage a mapping table with surrogate key-natural key correspondence. As new facts come in, they are looked up by their natural key and a surrogate key is substituted.

What servers. Does ETL happen on a separate server or on the target server? Usually not on the DW server.

ETL tools versus roll-your-own code. Should you use ETL tools or write custom code? Several ATE columns have addressed this in the past few months. Please refer to them.

Sort utilities versus ETL tools. Where large amounts of sorting are involved, should you use an ETL tool to do this or use a sort utility such as Syncsort? The same is true for aggregation.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access