Data Wrangling
Kimball Perspectives
Information Management Magazine, January 2008
In the first two parts of this series, we got ready to start the design of an enterprise data warehouse. First we carefully took stock of all the simultaneous requirements and constraints facing the designer. Then we reminded ourselves of the appropriate boundaries between our responsibilities and those of the other business intelligence (BI) stakeholders.
Advertisement
In this third column, we are ready to design the first stage of the data pipeline leading from the operational sources to the final BI user interfaces. I call this stage data wrangling because we must lasso the data and get it under our control. Successful data wrangling includes change data capture, extraction, data staging, archiving and the first step of data warehouse compliance. Lets examine these narrow data wrangling responsibilities.
We should limit the amount of data we process in each data extract to the bare minimum. We try not to download complete copies of source tables, but sometimes we must.
Limiting the data extract to the bare minimum is a fascinating challenge and can be harder than it appears. The first architectural choice is whether to perform change data capture on the production source computer or after extraction to a machine owned by the data warehouse. From the data warehousing (DW) point of view, the more attractive alternative is finding the changes at the production source. For this you need cooperation from the production source database administrators (DBAs), adequate processing resources on the production machine and a very high quality scheme for identifying 100 percent of the changes that have occurred since the previous load.
To design the change data capture system on the production source, you need to have a very candid conversation with the production system DBAs. You need to identify every situation in which a change to source data could happen. These include normal applications posting transactions, special administrative overrides and emergency scenarios, such as restoration of a data set.
One popular way to look for source data changes is to query a Change_Date_Time field in the source table. This is a pretty strong approach if this field is populated by database triggers that are not circumvented by any process. But many production applications prohibit the use of triggers for performance reasons. Also, how does such an approach handle record deletes? If the record simply vanishes, then you wont find it by querying the Change_Date_Time field. But maybe you can collect the deletes in a separate feed.
Another approach is a special production system daemon that captures every input command by reading the production system transaction log or by intercepting message queue traffic. The daemon approach solves the delete problem but is still vulnerable to special administrative override commands performed manually by the DBAs. Some of you may think such overrides are crazy, but I have seen some very well-run shops resort to doing these overrides occasionally because of weird business rules that are simply too complicated to program into the normal transaction processing applications.
If you have figured out an acceptable scheme for isolating all data changes at the source, you still need to ask for one more favor, if you have any political capital left with the source system DBAs. You need to get a Reason Code for all changes to the major dimensional entities, such as Customer or Product. In dimensional modeling parlance, these Reason Codes will tell you whether the change to an individual dimensional attribute should be treated as Type 1, 2 or 3. These distinctions are a big deal. The extract, transform and load (ETL) pipelines required to process these three slowly changing dimension (SCD) choices are completely different. For more information, visit http://www.kimballgroup.com/ and search the article archive for SCD.
If your production system presents too many objections, consider doing change data capture after extraction. Now you must download much larger data sets, perhaps complete dimensions or even complete fact tables. But you are guaranteed to find every change in the source system, as long as you keep a complete prior copy of the source system tables against which to compare.
Page 1 of 2.






