William wishes to thank Mike Cross (email@example.com), data warehouse director at Rent-A-Center, for his contribution to this month's column.
This is the fifth and final installment in a series of articles on data warehousing concepts and best practices. In this column, we will address slowly changing dimensions (SCDs) and an alternative to the standard styles and the use of a semaphore table.
Slowly Changing Dimensions
SCDs are one of the key benefits of doing data warehousing. The ability to selectively cast transactions perfectly into their historical context has provided extraordinary benefits. At Rent-A-Center (RAC), we use them for important columns of the customer, product, contract, store, brand and model tables, among others. Because we need to trace historical information about our dimensions (such as the catalog purchase price for models or customer address at time of purchase), simple replacement of historical values with current values was unacceptable for most columns. However, the same style does not have to apply to each column in each dimension table.
Figure 1: Example of Timestamp Key and a Surrogate ID
Styles that create a new surrogate key when values change would have created operational problems and challenges. This would mean that every time the purchase price of a model changed, that model would get a completely new surrogate key. This works fine for new models, but when we wanted to aggregate inventory data over time to the model level, we would need to resolve the model information back through the historical surrogate keys. For RAC, this defeated the purpose of having a surrogate key. Consider that we also want the same key for International Business Machines; IBM; IBM, Inc.; etc. as well as in situations where company names change in order to trace their evolution.
To resolve these issues, we retain the same surrogate identifier as a substitution for the natural key and treat SCDs like we treat states (see the March 2007 column on states and events). That is, we add a timestamp to the key in addition to the surrogate identifier (as shown in Figure 1). This allows us to maintain the history of an item without changing its surrogate identifier. Thus, a single surrogate identifier will always represent the natural key (though it may or may not be natural), but the dimension table will also contain a complete history of the natural key - the primary benefit of the SCD. Voila, the best of both worlds.
In the example, it is important for RAC to understand what John Doe's address was for any transactions between 2007-01-03 and 2007-02-09. However, as a derived value that we will never cast historically, we would simply update the customer lifetime value column in place for the current (i.e., null end timestamp) record for John Doe.
This type of SCD does not come without a cost, however. Because we do not use the timestamp as part of the foreign key in fact tables and other dimension tables, we must always be cognizant of the time element when joining tables or we will get a one-to-many result or many-to-many result when dimensions reference other dimensions. If you are only interested in the latest natural key, then joining to the dimension where the end date is null will suffice.
For those transaction queries that need customer address at the time of the transaction, we have to take the transaction timestamp and go between start and end timestamps for customer ID 123. For those queries that do not, we simply join the transaction table to the customer table on customer ID 123 where end timestamp in customer table is null. Another option is to include an active flag for the current version when you are mixing all the historical records in a single table. When the state of the dimension is important, then a between operation will be necessary to select the appropriate version of the surrogate key. These are small prices to pay, however, for a surrogate key that is valid over time and still maintains a complete history.
A semaphore table is one of those extract, transform and load (ETL) artifacts that few start out employing, but eventually everyone uses as their warehouse matures and the need for more data marts grows. It is part of operational metadata. A semaphore table is a database table that contains entries that signal the start and end of certain processes, such as the loading of a specific data type, the building of a data mart or the running of quality control scripts. A semaphore table is used to provide handshaking between and within processes without having to have all processes run sequentially. We started our data warehouse ETL processes sequentially. We would process and load raw data, do some end-load processing, build several data marts, run some quality control scripts on the load and on the marts and then wait for the next load. Many of these processes, however, can be run in parallel to more quickly provide data to end users and to more efficiently use systems and batch windows. Separate scheduling of processes also allows for the right-time building of data marts. Some marts may need to be refreshed hourly, while others may only be built weekly. A semaphore table is necessary when one independent process is dependent upon another process or when the same process may step on itself if an iteration begins before the prior one finishes.
The semaphore table should consist of a process ID, an active state flag, a start time and an end time. If you plan on employing incremental builds that are time dependent, data begin and end dates and times are also very useful so that dependent processes know where to begin and end their incremental builds. Within each process, create a new record in the semaphore table once it is determined that the process will alter the data. Update the same record to show completion as the last step in the process. Note that a single ETL job may contain multiple processes. The first step in using a semaphore table is to determine the dependencies between processes. Once this is determined, simple tests at the beginning of each process can determine the necessity of continuing. For example, there is no sense in updating a data mart if none of the source data changed or if the source data ETL process is only partially complete.
By using a simple semaphore table, you can have independently executed dependent ETL processes that intelligently run based on previous runs and the execution of other processes. In addition, you have a great tool to monitor the execution of processes over time.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access