What historical insights does your data warehouse provide to users?
One of the more challenging business requirements a data warehouse project will have to address is user navigation of historical information. In many cases, providing business users access to historical measurement information remapped to a current corporate viewpoint is an acceptable method. In other cases, the company needs to see measurement information from a historical point of view. For some firms, there is a need to provide users the ability to look at information from both perspectives.
The information delivery component of the data warehouse must allow the users to choose either a current (recasting of history) or historical view of dimension table information for all reports and queries. This business requirement has several consequences on the data model design as well as the extract, transform and load (ETL) processes. For example, when a human resources head count report is run for a two-year period, the results vary depending on the chosen historical perspective. If the current view is selected, all head count is shown for employees based on their position in the organizational structure at present (e.g., John Smith in department C3C). If a historical perspective of the same report is requested, head count is broken down over a two-year period denoting employee movement in the organization (e.g., John Smith in department A1A, then B2B, currently in C3C).
This column is the first of a two-part series examining various development techniques that can be used to provide alternate views of historical measurement information in the warehouse. Data modeling and ETL process methods for addressing this business requirement will be explored in part two.
The challenge becomes how to keep track of dimension table columns over the course of time and also provide direct access to the latest version of the data. Initial iterations of these dimensional models often use standard data warehouse modeling techniques, such as slowly changing dimensions (SCD) type 2, in an attempt to address this business requirement.
SCD 2 is used to model a dimension table when changes to relevant columns need to be captured over time. This technique relies on the use of production or natural key(s) in the dimension table not changing. New surrogate keys are assigned to the dimension table when changes to relevant columns are detected during a batch load cycle. Comparisons are made between the previously loaded production key(s) of the dimension table and the new load cycle data. Changes to relevant columns in the new load cycle data for matching production keys are loaded with new surrogate key assignments. If operational meta data columns are incorporated into the design of the table, newly assigned surrogate keys can be identified as current while records previously loaded for a matching production key can be flagged as historical.
This technique solves half of our new business requirement by segregating history in the dimension. Queries run against a dimension table using the SCD 2 method will automatically associate the proper view of history to the corresponding fact table row through use of the surrogate keys on both tables. Straightforward structured query language (SQL) queries would produce fact information grouped, over time, according to the number of iterations the production key of the dimensions had gone through. This technique is not as practical when the business requirement is to ignore the historical changes made to a dimension table and show all facts associated to the most current view of the production key. An operational meta data flag on the table can easily identify the current rows in the dimension and ignore earlier perspectives. However, the fact table rows are still associated to the historical surrogate keys on the dimension. Simply constraining on the most current dimension rows and retrieving fact table information would result in incomplete or, worse, incorrect reporting results. Complex SQL or multipass SQL methods available in some information delivery products can be used to retrieve the current row information from the dimension table while still gathering all fact table rows associated to the dimension's production key. The ability to successfully use SCD 2 to provide current and historical views of information will depend greatly on the information delivery product's capabilities and the use of operational meta data in the data model.
Alternate design approaches that do not rely as much on the capabilities of the information delivery product and provide some added versatility will be examined in the next part of this series.
What historical data method does your firm use, if any? Send me an e-mail with the subject line: "Our Historical Data Method." I will discuss the results of this unofficial survey in a future column.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access