In the first two parts of this series, we got ready to start the design of an enterprise data warehouse. First we carefully took stock of all the simultaneous requirements and constraints facing the designer. Then we reminded ourselves of the appropriate boundaries between our responsibilities and those of the other business intelligence (BI) stakeholders.
In this third column, we are ready to design the first stage of the data pipeline leading from the operational sources to the final BI user interfaces. I call this stage data wrangling because we must lasso the data and get it under our control. Successful data wrangling includes change data capture, extraction, data staging, archiving and the first step of data warehouse compliance. Lets examine these narrow data wrangling responsibilities.
We should limit the amount of data we process in each data extract to the bare minimum. We try not to download complete copies of source tables, but sometimes we must.
Limiting the data extract to the bare minimum is a fascinating challenge and can be harder than it appears. The first architectural choice is whether to perform change data capture on the production source computer or after extraction to a machine owned by the data warehouse. From the data warehousing (DW) point of view, the more attractive alternative is finding the changes at the production source. For this you need cooperation from the production source database administrators (DBAs), adequate processing resources on the production machine and a very high quality scheme for identifying 100 percent of the changes that have occurred since the previous load.
To design the change data capture system on the production source, you need to have a very candid conversation with the production system DBAs. You need to identify every situation in which a change to source data could happen. These include normal applications posting transactions, special administrative overrides and emergency scenarios, such as restoration of a data set.
One popular way to look for source data changes is to query a Change_Date_Time field in the source table. This is a pretty strong approach if this field is populated by database triggers that are not circumvented by any process. But many production applications prohibit the use of triggers for performance reasons. Also, how does such an approach handle record deletes? If the record simply vanishes, then you wont find it by querying the Change_Date_Time field. But maybe you can collect the deletes in a separate feed.
Another approach is a special production system daemon that captures every input command by reading the production system transaction log or by intercepting message queue traffic. The daemon approach solves the delete problem but is still vulnerable to special administrative override commands performed manually by the DBAs. Some of you may think such overrides are crazy, but I have seen some very well-run shops resort to doing these overrides occasionally because of weird business rules that are simply too complicated to program into the normal transaction processing applications.
If you have figured out an acceptable scheme for isolating all data changes at the source, you still need to ask for one more favor, if you have any political capital left with the source system DBAs. You need to get a Reason Code for all changes to the major dimensional entities, such as Customer or Product. In dimensional modeling parlance, these Reason Codes will tell you whether the change to an individual dimensional attribute should be treated as Type 1, 2 or 3. These distinctions are a big deal. The extract, transform and load (ETL) pipelines required to process these three slowly changing dimension (SCD) choices are completely different. For more information, visit http://www.kimballgroup.com/ and search the article archive for SCD.
If your production system presents too many objections, consider doing change data capture after extraction. Now you must download much larger data sets, perhaps complete dimensions or even complete fact tables. But you are guaranteed to find every change in the source system, as long as you keep a complete prior copy of the source system tables against which to compare.
If you download a complete source table today, you can find all the changes by performing a record by record and field by field comparison against a copy of yesterdays source table. You will indeed find every change, including the deletes. But in this case, you are probably missing Reason Codes for dimensional attribute changes. If so, you may need to impose unilaterally a Reason Code policy crafted for each attribute. In other words, if the Package Type of an existing product suddenly is changed, then you could always assume that Manufacturing is correcting a data error, and hence the change is always Type 1.
If the table you are comparing is very large, the brute force approach of comparing each field can take too long. You can often improve this comparison step by a factor of 10 using a special hash code, called a cyclic redundancy checksum (CRC). For a discussion of this advanced technique, see the Kimball Group article archive previously mentioned. Search for CRC.
Finally, even if you are sure you have accounted for 100 percent of the source system changes, you should periodically check the DW totals against totals computed directly on the source. This is like balancing your checkbook when you have to manually investigate a discrepancy between the two data sets.
Extraction, whether it occurs before or after change data capture, is the transfer of data from the source system into the DW/BI environment. Besides actually moving the data, you have two main responsibilities in this step. First, you need to rid yourself of all narrowly proprietary data formats during the transfer itself. Change EBCDIC character formatting to ASCII. Unravel all IBM mainframe data formats (e.g., packed decimals and OCCURS statements) into standard relational database management system table and column formats. I also suggest unraveling XML hierarchical structures at this stage, although perhaps at some point XML structures will be fully supported at a semantic level by relational databases.
Your second responsibility is to direct the flow of incoming data either into simple flat files or relational tables. Both choices are equally valid. You can process flat files very efficiently with sort utilities and sequential processing commands like grep and tr. Of course, you will eventually load everything into relational tables for joining and random access operations.
I recommend immediately staging all data received by the DW/BI system. In other words, save the data you just received in the original target format you have chosen before you do anything else to it. I am very conservative. Staging the extracted data means keeping it forever, either offline or online. Data staging is meant to support all the types of backup.
A special form of archiving serves as an important step when you are forced to deal with compliance-sensitive data: proving that the data you received has not been tampered with. In this case, the data staging is augmented with a strong hash code that you use to show that the data has not changed. You should also write this staged data and hash code to permanent media and store this media with a bonded third party who can verify that the data was sent to them on a certain date.
Now that you have wrangled the data into your DW/BI environment, it is time to tame the beast. In upcoming articles, well work through a number of techniques for making the data support decision-making by end users.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access