Traditionally, the staging area in your warehouse is pretty straightforward. A little bit of trimming and formatting, then it is off to the races on the real part of the load, right? During a recent conversation with a former client, we spoke of some considerations that went into architecting the staging area of their data warehouse project. These considerations were not technically complex, but they added significant value. We spoke of the real-world value provided by the choices we made for their application's staging area. Back before disk was so inexpensive, and before partition swapping and other data definition language commands made it fast and easy to move large amounts of data (provided there is little or no transformation), some of these things were not even considered because of the attendant cost. This column shares some of the features that have more lately added real-world value to staging exercises.
I've read arguments and had conversations with people who find value in a persistent staging area and others who think persistent staging is a waste of disk space and processing power. Those in favor talk about the ease of access to source data for reprocessing and data investigation. Those opposed generally have space concerns or small loading windows to contend with. In the latter case, anything that is not mission-critical gets left out.
I have found great value in a persistent staging area, and I think there are a few minor considerations that can make this choice very useful. First, determine if it makes sense to use a persistent staging area based on the data you have. If your system runs a complete refresh every night, there's probably little reason to continue to store each file in the database. If there is an issue with data that requires a reload, you would most likely need a new file anyway, and the last file would be in the regular staging area. If there was a logic issue that required a reload, the refresh would be accomplished the next time the extract, transform and load process is executed.
Persistent staging areas make sense when the data is transactional and incremental in nature. However, if a goal of your persistent staging area is to facilitate reloading of data from the staging area, then you need to consider the underlying database structure for the tables. If you're using a relational database that offers a partitioning and partition swapping option, it would make sense to partition at the grain of a single load or batch. Partition swapping can be used to quickly move data from the persistent stage area to the traditional staging area - just be careful not to lose the persisted data in the swap!
Many source systems lack the ability to track changes to source data. Using the staging area to identify deltas can be a great way to filter records early in the load and improve processing speeds by not dealing with unnecessary data. In doing this, you can gain control and add system-specific rules for your delta process, rather than find yourself subject to the sometimes generic delta rules applied in the source. A staging area can also act as your enforcer when source systems lack controls to enforce data integrity rules.
One particularly tricky problem solved by using the staging area had to do with a source system that passed future dated transactions. This source system let users enter events that had not yet happened and allowed those events to be changed before they had taken place. The users did not want to report on these events until they had actually happened, so the staging area served as a great gatekeeper.
Another useful aspect of the persistent staging area is the ability to quickly tie the warehouse or data mart back to the source. Often there are questions about how a record looked when it was passed to the warehouse. During loading to the persistent stage area, we can tag each record with a numeric value that can be carried to the target. This value provides an association from the target area back to the source record and makes it very easy to compare data values pre- and post-ETL load.
In addition to making comparisons easier, the numeric value can provide a simple link for use in data patching when a code problem has caused an issue. Sometimes reloading is the best option (which the persistent staging area facilitates with a partition swap); other times a data patch is more appropriate (and the numeric value makes it easy to find the corresponding row).
The business of staging data ETL jobs is typically very straightforward, but architecture and design can add lots of value when given the proper attention. The time spent on design and architecture from a production support standpoint and taking into consideration what happens when things don't run as planned can be important parts of creating an easily maintainable system with high data accuracy.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access