Free Site RegistrationFree Site Registration

Sign up today and access Information Management on the web!
Your FREE registration entitles you to:

FREE email newsletters

FREE access to all Information Management content

FREE access to web seminars, resource portals, our white paper library and more!

Retyping Slowly Changing Dimensions

InfoManagement Direct, May 2000

Todd Schraml

A multidimensional model, or star schema, is the standard technique for delivering data to end users in a data warehouse environment. The design consists of a fact table containing numeric and additive measures as well as foreign keys, linking that fact row to a number of separate dimension tables. The dimension tables contain textual or other descriptive items, usually serving as constraints on queries. Items brought together in a single dimension are meant, as a group, to describe some relevant business category to end users. This grouping of items into business categories naturally contains hierarchies of information. Each row in a dimension table has a one-to-many relationship to the fact table. Relatively, the fact table contains fewer columns than the dimension table but, at the same time, it has many, many more rows than the dimension table. Figure 1 shows an example of a generic star schema.

Advertisement

There are many advantages in using the star schema for data delivery to end users. The design is much simpler than most relational/third normal form designs. Star schemas relate well to what users can conceptualize in terms of spreadsheets and even as regression analysis formulae. Typically, users want the data they access to be in a single universe table containing everything they desire; and the star schema serves as a slightly normalized compromise to accomplishing that wish.

The standard paradigm of maintenance for sets of star schema tables requires that new entries to the fact table be simple inserts of additional transactions at a chosen level of detail (properly called granularity). While changes for the dimension table may be directly applied, those changes must be handled in a particular manner. The compelling reason for this distinction in approach is that changes to the dimension’s data must maintain an unaltered relationship between the fact and any given dimension table row. If this relationship were altered, updates to the fact table would be necessary to correct these relationships. Due to the great number of rows that can exist in the fact table, applying updates to the fact table can be problematic.

In his seminal books on multidimension design, Ralph Kimball proffered the concept of "slowly changing dimensions" to handle change within a dimension table and defined three approaches to implementation:

TYPE 1: Overwrite the old values in the dimension record with the new values, thereby losing the ability to track the old history.
TYPE 2: Create an additional dimension record using a new value of the surrogate key, thereby segmenting history between the old value and the new value.
TYPE 3: Create an "old" field in the dimension record to store the immediate previous attribute value, thereby being able to describe history both forward and backward from the change.

A suggestion for slightly revising the perspective on slowly changing dimensions is offered here with a redefinition of sorts. The intention of this redefinition is to focus on the result rather than the specific implementation. This altered perspective is stated as follows:

TYPE 1: A fact is associated with only the current value of a dimension column.
TYPE 2: A fact is associated with only the original value of a dimension column.
TYPE 3: A fact is associated with both the original value and with the current value of a dimension column.

And this revised view of slowly changing dimensions promotes a fourth domain:

TYPE 4: A fact is associated with the current value and the values that were current at some critical points in time (i.e., end of year, each year) of a dimension column.

Under these altered definitions, how we accomplish the fact-to-dimension associations is no longer the sole focus of the various types. The reason for this altered view is that people are creative. If we leave the type tied to the specific implementation, we then need to expand our type list every time someone uses their creativity. Besides, under the scenario of that ever- expanding type list, who’s going to maintain the official list so we don’t end up with 10 variations that are implemented by different creative individuals around the world, all of them designating their own version "Type 8?"

For example, suppose a dimension table is refreshed every cycle. Our processes may be so defined that we retain the integrity of the sequence number keys (or cheat and use the operational data values themselves as the key, as we are not supposed to be doing). With sequence number integrity retained we basically drop, rebuild and reload our dimension table every cycle. We have not overwritten anything, but from time period to time period our dimension data is indeed changing. Our facts are always associated with the current values of dimension columns. Nominally, this variation is not any one of the original three slowly changing dimension types. The closest association under the original type definitions would be to say, "Conceptually, it falls under the Type 1 umbrella, even though we aren’t actually updating." Under the new definition, it is clearly, totally and simply a Type 1 slowly changing dimension.

More creativity exists. In some implementations, dimension tables have been established with two-part keys. A sequence number identifying an occurrence, and a secondary sequence number identifying a version. The fact table contains both key parts. Joins using both parts provide the original value. Joins using just the first part, and secondarily adding in a current version indicator on the dimension table, provide the original dimension value. In this instance, new values cause the creation of new rows. Under the original type definitions, it means we have a Type 2 change. However, for this example, processing does not allow for a new unintelligent sequence number key to be generated. Processing must determine which things really are new things, and which are new versions of old things, resulting in a rather complicated process which allows for far greater opportunities to make mistakes along the way. This may also require extra lookups, necessitating more computer resources to process a change. Even though user querying must decide on one of two forms for joining to a dimension, a fact can be associated with both a current and with an original value (as shown in Figure 2). Consequently, the example scenario in actuality describes a Type 3 slowly changing dimension in Type 2’s clothing. If we simply say the change is Type 2 under the original definition, the implication is that only the original value is available in fact analysis, which is not true. If we say it is Type 3 under the original definition, it implies that we have distinct original and current columns in our individual dimension table rows, which is also not true.

Type 4, as previously defined, is quite similar in spirit to Type 3. The sole distinction is that Type 3 only allows for a current and original value to be associated with a fact. Type 4 allows for an infinite number of additional time-valued slices of dimension values to be retained. Because of this additional multiplicity, it is certainly worthy of its own recognition. Recently, Kimball has acknowledged that such circumstances can arise and has referred to this as "many alternate realities" with the implementation being a hybrid Type 2/3. Hopefully, the business reasons for doing this will be rare, but if a true reason does exist, it is incumbent on a data warehouse designer to attempt to satisfy that need. Under the original definition of Type 3, Kimball made efforts to explain that indeed such a column addition, to hold both an original and current value for an item, could only be done once. In reality, there is no DBMS restriction from adding in additional columns (unless, of course, you have reached your DBMS’s physical column limit.) Certainly, the concern here is to do such a thing only under very limited and extreme circumstances. You do not want to have unbounded column proliferation across your star schema. For example, suppose we have a star schema containing 25 years of data (yes, some future time where things are really wonderful). For our business customer information, there is a critical need to retain their standard industrial classification (SIC) codes over time. The government reworks SIC codes every 10 years for classifying types of business. We have columns on our customer dimension table containing a 2000 SIC code, a 2010 SIC code and a 2020 SIC code. Maybe we are even gearing up to add in a 2030 SIC code. As long as this column proliferation is approached judiciously, this approach to retaining historical change can be functional.

Remember the new type definitions extrapolated in this discussion are concerned with the functionality, not the detailed implementation, of the change protocol. Let’s look back at our last Type 3 example. If the second part of our two-part dimension key were a date, instead of a version sequence number, it would be possible for our fact to join to a dimension using the first part of the key and applying a range test against the second part of the key to retrieve the dimension’s value at any given point in time (assuming, of course, a DBMS optimizer capable of effectively performing such a process). Likewise, imagine taking a snapshot of one’s dimension at a given point in time – for example, year end every year – and retaining that snapshot as a separate table. Depending on the users’ analysis needs, users could join to a specific year’s version of the dimension table to gather the dimension’s values at any of the times for which the snapshot versions exist. In each of these cases, a fact can be associated with dimension values from multiple points in time. These examples would both demonstrate implementations of a Type 4 slowly changing dimension.

It is a column-by-column decision within a dimension table as to which items are to be handled via a Type 1, 2, 3 or 4 approach. To make this work, the combining of the processing needs has to reach the point where complexity exists in maintaining your dimension tables. A star schema, or multidimensional design, has become an important concept in the data warehousing world. Much of the power of the star schema is in its consistency of form. In a unique way, the star schema is its own normalized design. Likewise, language is important. The words we use do much to color our perceptions. It can be argued that, logically speaking, the new wording offered here does not really change anything. The intention is not to change the basic premises; the intention is to offer a refinement in detail. There is no need to insert an additional semantic layer that designers need to interpret. (A layer requiring designers to identify their implemented procedures as one type of slowly changing dimension or another based on which type can be abstracted to match with what they are doing rather than which type matches what they are doing.) Perhaps this refinement can expand our perspectives on examining and implementing slowly changing dimensions. The level of standardization offered within a star schema still has much that can be written to formalize the design process beyond what exists today. Part of Kimball’s goal in framing slowly changing dimensions surely is to prevent a proliferation of "creativity" while guiding efforts to a best- practice approach. Although his concerns are laudable, implementations show that greater variations have been put into practice and do function for organizations. There is no need to exclude these variations from explicit and legitimate coverage under the umbrella of handling change and maintenance within a star schema via slowly changing dimensions.

Todd Schraml is senior data architect and manager of ETL for Innovative Health Strategies. He can be reached at tschraml@ihsiq.com.

For more information on related topics, visit the following channels: