A source system that keeps its own history presents interesting problems when used to load a type 2 history-tracking dimension table. The problem I am referring to arose in an actual project where we had a source system (SAP) which was keeping its own history of changes to an entity in a comprehensive, audit-trail sort of way. It appeared to have been tracking every change to the entity in question. But, in the data warehouse, for the corresponding dimension we only wanted to track history for a subset of attributes, treating the rest as type 1 attributes. This article focuses on the initial load, not the continuing maintenance, of a dimension from a set of history-tracking source records.

Let’s say that, in the source system (an HR application, for instance), we have the records shown in Figure 1.

In this table, EMPKEY is the "person key" or "person ID, " which is a surrogate key in the source system itself. Social Security NUmber, which is often thought of as a key, is a simple attribute. The primary key of the table is EMPKEY + VALID_FROM. A real HR system would obviously have more attributes than this, but, for purposes of this article, all we need are a mix of type 1 and type 2 attributes.

The table tells a tale. Jo, a female, living in Michigan, gets hired on 12/3/1998. On 12/27/1998, the HR staffers discover they entered her SSN incorrectly and change it; the source system automatically creates a new record, with a new VALID_FROM date of 12/28/2008. Seven years go by until, in April of 2005, Jo, who’s always felt uncomfortable as a female, goes into the online HR portal and changes her first name to "Joe"; the source system dutifully tracks the change. On August 8, Joe finally gets her operation, and an HR staff member changes her > his sex from "F" to "M," and her > his name to "Joseph." On February 13, Joseph decides he really prefers to go by "Joe" again, until July 5, when he flips his name back to "Joseph" and transfers to the company’s Big Apple office. On Christmas Eve, weary of the brusque and chilly streets of Manhattan, Joseph returns to Michigan. Jim or James, the other person in this table, has a simpler history: he changes from "Jim" to "James" on 3/16/2004 and then moves to Indiana on 6/23/2007. He’s apparently let go on 8/31/2007. (In this HR system, if your latest record doesn’t have a NULL in "VALID_TO," it means you’re terminated as of that date.)

The business users have stated that they don’t care about tracking or reporting on historical changes to first names or SSNs, but they do care about sex and state. In data warehousing terms, FIRSTNAME and SSN are type 1 attributes, and SEX and STATE are type 2.

We are doing an initial load of our PERSON dimension. Again, we do not cover, in this article, how to deal with comparisons between a history-tracking, type 2-ish source and an already-populated dimension table.

Ultimately, our PERSON dimension table should look like Figure 2.

DIM_K is just a standard dimension table surrogate key; only those records for the two people in question are displayed. Note that where we have a currently active record with no "natural" or source end date (the employee is still employed, the person is still alive, etc.), we enter a fake "high date" to make querying and joins easier. (This sort of thing will, doubtless, cause a massive "Year 9999 Problem.") Our PERSON dimension table tracks history for those attributes we care about, but washes out the history for the type 1 attributes. To get to this blissful state, we need to sift out the type 1 changes from the type 2 and only make new dimension records where we have a type 2 change.

In what follows, we’ll stick to fairly simple set logic. We’ll use no cursors or other looping-type constructs. While an ETL tool like Informatica or DataStage would be handy, any of this could be done fairly easily in straight SQL. We’ll also keep all the steps simple, easy to understand and discrete. It’s possible to create enormous heaps of nested SQL to do everything in one statement, but it's best to keep everything understandable. We create and label "tables," but whether you choose to actually create real database tables, or the tables are just record sets inside a flow, the process remains essentially unchanged.

Our first step is to order and number our records. Depending on what system you’re using, there will be any number of ways to do this, but start by ordering your source records by EMPKEY, VALID_FROM and add a row numbering RANK column (see Figure 3).

Next, we’ll need the earliest and latest records for each natural key. The earliest ones will form a sort of "base" for future joins; the latest ones carry the values we’ll use for the type 1 attributes. (The business rule here is that the latest records hold the type 1 attributes to use; other rules would demand different coding.)

NK_EARLIEST

select

s.*

from

src_ranked s inner join (select key, min(rank) r from src_ranked) mins

on s.rank = mins.r

NK_LATEST

select

s.*

from

src_ranked s inner join (select key, max(rank) r from src_ranked) maxes

on s.rank = maxes.r

The existence of this NK_LATEST table will serve us well later on.

Next, let’s make a table of the type 2 columns only. Within these we’ll need to sift out the records for a given natural key where there’s no change.

TYPE2_COLS

select

rank

,key

,sex

,state

,valid_from

,valid_to

from

src_ranked

order by

rank

In the table in Figure 6, the columns values that will form the VALID_FROM’s in our dimension are highlighted in red, and those that will supply the VALID_TO are blue.

For our next trick, inner join this table to itself, each record to the record one above it, in rank, where the key (natural key) is the same and where there’s something different about the type 2 attributes:

TYPE2_CHANGES

select

t2c1.rank rank1

,t2c1.key key1

,t2c1.sex sex1

,t2c1.state state1

,t2c1.valid_from valid_from1

,t2c1.valid_to valid_to1

,t2c2.rank rank2

,t2c2.key key2

,t2c2.sex sex2

,t2c2.state state2

,t2c2.valid_from valid_from2

,t2c2.valid_to valid_to2

from

type2_cols t2c1 inner join type2_cols t2c2

on t2c1.rank + 1 = t2c2.rank

and t2c1.key = t2c2.key

where

t2c1.sex <> t2c2.sex or t2c1.state <> t2c2.state

To view Figure 7 see PDF below.

We can see that we’re very close to what we’re looking for in the VALID_TO1 and VALID_FROM2 columns. But what we want (leaving out the irrelevant columns) is shown in Figure 8.

To get there, do a quasi-pivot of TYPE2_CHANGES, like this:

TYPE2_C_PIV

select * from

(

select

rank1 rank

,key1 key

,sex1 sex

,state1 state

,valid_to1 valid_date

from

type2_changes

union

select

rank2 rank

,key2 key

,sex2 sex

,state2 state

,valid_from2 valid_date

from

type2_changes

) s

order by rank

Add the handy-dandy, presaved NK_EARLIEST and NK_LATEST records back in, and order by rank:

TYPE2_C_ALL

select * from

(

select * from type2_c_piv

union

select

rank

,key

,sex

,state

,valid_from valid_date

from nk_earliest

union

select

rank

,key

,sex

,state

,IsNull(valid_to,’12/31/9999’) valid_date

from nk_latest

)

order by rank, valid_date

Put an ordinal or row number on these rows again:

The odd-numbered records contain the VALID_FROM dates, and the next-higher-numbered row contains the VALID_TO dates. Join ‘em, Dano. The following assumes the availability of a mod (modulus) operator to find the odd-numbered rows; there are certainly other ways to find odd numbers.

PRE_DIM

select

t1.key

,t1.sex

,t1.state

,t1.valid_date ‘valid_from’

,t2.valid_date ‘valid_to’

from

type2_c_all_ranked t1 inner join type2_c_all_ranked t2

on t1.new_rank + 1 = t2.new_rank

where

mod (t1.new_rank,2) = 1

The remainder is trivial. Join to the "latest" records on the natural key to pick up the type 1 dimension, and, while it’s handy, set the current flag:

Add_T1 ("add type 1")

select

pd.key as ‘nat_k’

,l1.first_name

,l1.ssn

,pd.sex

,pd.state

,pd.valid_from

,pd.valid_to

,case when pd.valid_from = l.valid_from then ‘Y’ else ‘N’ end as ‘current_flag’

from

pre_dim pd inner join nk_latest l

on pd.key = l.key

Then, generate surrogate keys for your target dimension table and load it.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access