Can we and should we adopt different data modeling schemas during the extraction, transformation and staging method? What is the consistency that needs to be maintained during this stage?


Scott Howard’s Answer: The schema you choose during the ETL process should be independent of the schema of both the source and target objects. ETL intermediate or staging structures should be ones that best support the task at hand, that is the movement and transformation of data from source to target. I always recommend that you transform the data into a common format for ETL as soon after extract as possible. This can sometimes be done with an extract utility and sometimes not because the extract should move the data from your source to an external file in as nondisruptive way as possible so as not to compromise source system SLAs. The most common file formats are flat file and relational – both are well known and boast many utilities and well-know techniques used for subsequent steps of the ETL process. The more detailed schema model depends on whether you need to perform a one-to-one, one-to-many or many-to-many mapping. You don’t have to worry about the final schema for targets until the final stages of the ETL flow. Most often the final schema is never physically materialized in the ETL flow, or it’s supporting staging structures. I teach a four-day course on this topic; therefore, our editor will not allow me to fully elaborate on all of the issues related to your question.

Joe Oates’ Answer: The ETL process, which normally takes place in a staging area not the data warehouse, does not require data modeling schemas in the sense that a data warehouse or operational system requires data modeling.

The input to the ETL process is a set of legacy system files/tables or their flat-file equivalent. These will have been mapped to the target data warehouse tables and columns. The output of the ETL process is usually as set of flat files formatted exactly like the data warehouse tables. I say usually flat files, because these can be fed into a bulk loader to greatly reduce the time needed to populate the data warehouse tables.

The non-key fields from the legacy systems can generally be moved to the equivalent data warehouse table and column combination with little or no processing required. The legacy system keys must be translated to data warehouse keys, which are generally surrogate keys.

To describe this process, I will assume an ETL tool that uses a database system rather than writing the ETL code in a 3 rd generation language or PERL. There should be a separate cross-reference table for each data warehouse dimension table (customer, product, etc.). These cross- reference tables consist of columns that form a unique identifier for the legacy system file/table. The columns required for this are the legacy system name or mnemonic, the legacy system file/table name or mnemonic and the particular legacy key, which may be a single column or a concatenated key. The cross- reference is completed by having a single column for the data warehouse dimension table surrogate key. The legacy system unique identifier is indexed for fast lookup.

When legacy system data for the data warehouse dimension tables are populated, then take the transactions from the legacy systems and transform the legacy keys in the transaction (customer, product, etc.) to data warehouse keys using the same cross-reference tables.

Clay Rehm’s Answer: I would suggest keeping your data models as consistent as possible. It sounds like you want to have a ODS type model for your ETL process which makes sense in certain cases. You may want to question why would you want to adopt different data modeling schemas and what would the benefit be?

Chuck Kelley’s Answer: During the ETL process and the staging areas, I think whatever best suits the process is the right thing. Mostly, I think it is a relational (normalized) structure. However, if you are processing massive numbers of rows, maybe a flat file system would be more appropriate.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access