A source system that keeps its own history presents interesting problems when used to load a type 2 history-tracking dimension table. The problem I am referring to arose in an actual project where we had a source system (SAP) which was keeping its own history of changes to an entity in a comprehensive, audit-trail sort of way. It appeared to have been tracking every change to the entity in question. But, in the data warehouse, for the corresponding dimension we only wanted to track history for a subset of attributes, treating the rest as type 1 attributes. This article focuses on the initial load, not the continuing maintenance, of a dimension from a set of history-tracking source records.
Lets say that, in the source system (an HR application, for instance), we have the records shown in Figure 1.
In this table, EMPKEY is the "person key" or "person ID, " which is a surrogate key in the source system itself. Social Security NUmber, which is often thought of as a key, is a simple attribute. The primary key of the table is EMPKEY + VALID_FROM. A real HR system would obviously have more attributes than this, but, for purposes of this article, all we need are a mix of type 1 and type 2 attributes.
The table tells a tale. Jo, a female, living in Michigan, gets hired on 12/3/1998. On 12/27/1998, the HR staffers discover they entered her SSN incorrectly and change it; the source system automatically creates a new record, with a new VALID_FROM date of 12/28/2008. Seven years go by until, in April of 2005, Jo, whos always felt uncomfortable as a female, goes into the online HR portal and changes her first name to "Joe"; the source system dutifully tracks the change. On August 8, Joe finally gets her operation, and an HR staff member changes her > his sex from "F" to "M," and her > his name to "Joseph." On February 13, Joseph decides he really prefers to go by "Joe" again, until July 5, when he flips his name back to "Joseph" and transfers to the companys Big Apple office. On Christmas Eve, weary of the brusque and chilly streets of Manhattan, Joseph returns to Michigan. Jim or James, the other person in this table, has a simpler history: he changes from "Jim" to "James" on 3/16/2004 and then moves to Indiana on 6/23/2007. Hes apparently let go on 8/31/2007. (In this HR system, if your latest record doesnt have a NULL in "VALID_TO," it means youre terminated as of that date.)
The business users have stated that they dont care about tracking or reporting on historical changes to first names or SSNs, but they do care about sex and state. In data warehousing terms, FIRSTNAME and SSN are type 1 attributes, and SEX and STATE are type 2.
We are doing an initial load of our PERSON dimension. Again, we do not cover, in this article, how to deal with comparisons between a history-tracking, type 2-ish source and an already-populated dimension table.
Ultimately, our PERSON dimension table should look like Figure 2.
DIM_K is just a standard dimension table surrogate key; only those records for the two people in question are displayed. Note that where we have a currently active record with no "natural" or source end date (the employee is still employed, the person is still alive, etc.), we enter a fake "high date" to make querying and joins easier. (This sort of thing will, doubtless, cause a massive "Year 9999 Problem.") Our PERSON dimension table tracks history for those attributes we care about, but washes out the history for the type 1 attributes. To get to this blissful state, we need to sift out the type 1 changes from the type 2 and only make new dimension records where we have a type 2 change.
In what follows, well stick to fairly simple set logic. Well use no cursors or other looping-type constructs. While an ETL tool like Informatica or DataStage would be handy, any of this could be done fairly easily in straight SQL. Well also keep all the steps simple, easy to understand and discrete. Its possible to create enormous heaps of nested SQL to do everything in one statement, but it's best to keep everything understandable. We create and label "tables," but whether you choose to actually create real database tables, or the tables are just record sets inside a flow, the process remains essentially unchanged.
Our first step is to order and number our records. Depending on what system youre using, there will be any number of ways to do this, but start by ordering your source records by EMPKEY, VALID_FROM and add a row numbering RANK column (see Figure 3).
Next, well need the earliest and latest records for each natural key. The earliest ones will form a sort of "base" for future joins; the latest ones carry the values well use for the type 1 attributes. (The business rule here is that the latest records hold the type 1 attributes to use; other rules would demand different coding.)
src_ranked s inner join (select key, min(rank) r from src_ranked) mins
on s.rank = mins.r
src_ranked s inner join (select key, max(rank) r from src_ranked) maxes
on s.rank = maxes.r
The existence of this NK_LATEST table will serve us well later on.
Next, lets make a table of the type 2 columns only. Within these well need to sift out the records for a given natural key where theres no change.
In the table in Figure 6, the columns values that will form the VALID_FROMs in our dimension are highlighted in red, and those that will supply the VALID_TO are blue.
For our next trick, inner join this table to itself, each record to the record one above it, in rank, where the key (natural key) is the same and where theres something different about the type 2 attributes:
type2_cols t2c1 inner join type2_cols t2c2
on t2c1.rank + 1 = t2c2.rank
and t2c1.key = t2c2.key
t2c1.sex <> t2c2.sex or t2c1.state <> t2c2.state
To view Figure 7 see PDF below.
We can see that were very close to what were looking for in the VALID_TO1 and VALID_FROM2 columns. But what we want (leaving out the irrelevant columns) is shown in Figure 8.
To get there, do a quasi-pivot of TYPE2_CHANGES, like this:
select * from
order by rank
Add the handy-dandy, presaved NK_EARLIEST and NK_LATEST records back in, and order by rank:
select * from
select * from type2_c_piv
order by rank, valid_date
Put an ordinal or row number on these rows again:
The odd-numbered records contain the VALID_FROM dates, and the next-higher-numbered row contains the VALID_TO dates. Join em, Dano. The following assumes the availability of a mod (modulus) operator to find the odd-numbered rows; there are certainly other ways to find odd numbers.
type2_c_all_ranked t1 inner join type2_c_all_ranked t2
on t1.new_rank + 1 = t2.new_rank
mod (t1.new_rank,2) = 1
The remainder is trivial. Join to the "latest" records on the natural key to pick up the type 1 dimension, and, while its handy, set the current flag:
Add_T1 ("add type 1")
pd.key as nat_k
,case when pd.valid_from = l.valid_from then Y else N end as current_flag
pre_dim pd inner join nk_latest l
on pd.key = l.key
Then, generate surrogate keys for your target dimension table and load it.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access