As applied to data warehousing, a bridge table relates a fact record to one or more dimension records.If a given fact only ever relates to one dimension record, you don’t need a bridge table.Where a fact has a one-to-many relationship with a dimension, or many-to-many, you may choose to handle that with a bridge table.

 

The bridge table approach described in this article identifies and numbers unique sets of dimensions and then reuses those sets, treating the set identifier as itself a sort of dimension key, which is put directly in the fact table.It’s elegant and is particularly applicable to situations where you don’t need a weighting factor but, rather, all the analysis is impact analysis (i.e., the full amount of a metric is counted against all the applicable dimensions). However, the extract, transform and load (ETL) is a bit complex.

 

An alternative approach that is better in most situations is to use a traditional associative table, just like you normally would in a transactional system.Assuming you have a fact table with a primary key “FK” and a dimension table with key “DK,” you simply create records with the pairs of keys as shown in Figure 1.

 

 

If the fact table’s primary key contains multiple columns, put all the columns in there.You can add a “weighting” column to this, as in Figure 1, to avoid double-counting (unless you always want to double-count); when joining the fact through the bridge to the dimension, duplexing your fact records, you generally multiply the metrics by the weighting factor.

 

The admittedly minor issues with the associative table approach are:

 

  1. It requires that the primary key of the fact be put in the associative table.Unless you put a surrogate key on your fact tables, like you do your dimension tables, a fact table primary key may be a large collection of columns, perhaps even the collection of all foreign keys.If your fact table primary key is large, then your associative table will be large.Again, this particular issue goes away with surrogate keys on your fact table.
  2. The associative table will contain at least one record for each fact.Specifically, it will hold a number of records equal to the number of fact records times the average number of dimensions per fact.If you have 10,000 facts each related to both “red” and “orange,” then you’ll have 20,000 associative table entries.If space is a problem, this might also be a problem.The approach described in this article reuses sets of dimensions and will result in much smaller bridge tables (at the cost of more complex ETL).
  3. The traditional associative table does not easily allow you to predefine acceptable combinations of dimensions.In a situation when you want to predefine such combinations, the model described in this article may be better.

Another way to handle a one-to-many relationship between a fact and a dimensional attribute is to make a delimited string of all the applicable dimensional attributes (assuming they are, or can be converted to, strings) in one fact table column.For instance, if one of your dimensional attributes is “color,” then you could make a “Colors” attribute as some longish VARCHAR() data type directly in your fact table, and populate it something like Figure 2:

 

 

Notice the need to delimit the beginning and end of the entire string with whatever delimiter character you like.The queries you write against this sort of structure are easy enough:

 

select *

from fact

where colors like ‘%/seafoam green/%’ or colors like ‘%/red/%’

 

to get the facts of either color, or

 

select *

from fact

where colors = ‘/seafoam green/’

 

to get all those where the only color is “seafoam green.”

 

But easy queries do not always a happy warehouse make.With a concatenated string approach:

  • You can’t bake data quality into the warehouse database itself. The domain of acceptable values (“red,” “seafoam green,” “violet,” etc.) has to be managed inside ETL, perhaps by checking incoming color values against a “Colors” lookup table inside the staging area, but certainly not as a foreign key constraint to a real “Colors” dimension nor as a check constraint on the column (unless you wanted to write a constraint of all the different permutations of the dimension values, which is only practical if you’ve got just a few values).
  • Dealing with situations where your delimiter of choice shows up inside a dimensional value can be a drag (e.g., “sort of red/orange” as an incoming color attribute.)
  • You can’t easily add attributes about the attributes.You can’t, for instance, say whether a color is warm, cool or cold, track color names in different languages or divide them into “standard” and “oddball” bins.
  • You can’t flag that a particular color is the primary color for a fact or assign weighting factors.
  • Compared with a true bridge table approach, it wastes database space.“Seafoam green” in UNICODE repeated a million times will burn a little volt 200MB.
  • How you actually group a metric by a given color, or all the various colors, isn’t obvious.How does one do a count of the number of records by color?Do I count the records with “/red/green/” and “/red/” as two for red and one for green?Or one for /red/green/ and one for /red/?

In short: I wouldn’t do this.I wouldn't make a concatenated string of dimension keys, either, for many of the same reasons.

 

Let's move on to our set-based approach. We have a many-to-many relationship between a fact and a dimension.Our dimension table, for as an example colors, would look something like Figure 3.

 

 

A set-based bridge table would look, then, like Figure 4.We’ll add a “NUMBER_IN_SET” column to track the number of different rows contained in a given set.

 

 

Color set 1 is for those facts which are red and white.Color set 2 is white only, and set 3 is for red, yellow and white.To relate a fact to a color set, you simply put COLOR_SET_K in the fact table as a foreign key to the bridge.The primary key of the bridge table is the foreign key to the dimension table plus the set key.

 

Note: Depending on your modeling tool and/or database management system (DBMS), you may not be able to model the oddball foreign key relationship from the fact to the bridge.Some systems and tools will insist that you make the set key alone (COLOR_SET_K in the above example) the primary key to be able to make a proper foreign key relationship.If so, you just need to work around it.If you’re typically keen on declaring referential integrity (RI) in your database and the nasty thing won’t let you, you can always implementRI with triggers.If you have a religious problem with triggers, you’ll need to trust to your ETL code for data integrity.

 

In our bridge table, we’ll create a unique “group” or “set” key for every distinct combination of colors.You could do this on the fly, only creating those sets which appear in your data, or you could prepopulate your bridge table with all the permutations of your dimensions.The latter approach is nice – all your bridge table sets are there as you load your fact table -but if you’ve got lots of dimensional values (for example, medical diagnoses attached to a “patient visit” fact record), the set of permutations can get out of hand for a given maximum number of dimensions.Plus, it's likely that many of the sets will never be used; it’s uncommon to see “nail through the head” and “strep throat” diagnosed at the same time. On the other hand, if you want to enforce that only certain dimension combinations are acceptable, then prepopulating your bridge table with those sets is a good idea. This article explains how to create bridge entries on the fly for only those combinations actually appearing in the incoming data.

 

Finding Unique Sets

 

Let’s create our bridge table entries.Assume that you are already to the point in your ETL of having identified for any given natural (source system) key the set of dimension records that apply, and you have the surrogate keys to the dimension table in hand.In Figure 5, NK = “natural key” and DK = “dimension key.”“Natural key” can be a source system or natural key, or a surrogate key on your fact table.In either case, it represents the handle to the fact which has a relationship to multiple dimensions.What you have, so far is shown in Figure 5.

 

 

In the formation of bridge records, we want to identify and number the unique groups of dimensions.Reordering Figure 5 by both columns makes it a little easier to see that we actually have only five distinct sets of DK values, as shown in Figure 6.

 

 

NK’s 2 and 98 share the DK setof [83].NK’s 4 and 53 share the set [12, 23].NK’s 17 and 102 use the set [3, 24, 192].NK 11 does not share its set of DK values with any other NK.NK 25 shares the first two of its set (3 and 24) with others (NK’s 17 and 102), but the third in its set is different, which means that 25 has its own unique set [3, 24, 988].

 

The question is how you get from the initial set of pairs of natural keys and dimension keys to the identification of unique sets of dimensions, the creation of bridge keys for those sets and then population of the bridge table.Brute force approaches involving cursors or the declaration of a maximum number of dimensions per natural key and reverse pivot operations are inflexible, crude and slow, and are not considered here.We'll use set logic. The approach accommodates any sort of natural key, does not use “smart” or concatenated keys, is not concerned with the number of dimensions that may ultimately appear for any given natural key, creates new sets as needed on the fly and does not duplicate sets.Again, environments with small numbers of dimensions, or where the number of dimensions attached to a given natural key remains reliably small, may choose to prepopulate the bridge table with all the permutations of the dimension keys or may choose to take the easier traditional associative table approach.

 

Again, we’ve identified the dimension keys for our source records.We’ll name the various work or staging tables we create with little mnemonic acronyms to try to make things easier to follow.

 

Step one is to sort your records by natural key and then dimension key, as shown in Figure 7.

 

 

Now we need to identify the unique sets.How can we tell that the combo of 3, 24 and 192 for NK 17 is “the same thing” as the set attached to NK 102?

 

First, order these rows in ascending order by NK and then DK and add a row numbering or identity column.Depending on your ETL tool and DBMS, you’ll have a number of choices for how to do this.(In Business Objects Data Integrator, for instance, you can use a “Key Generation” transform.In Microsoft SQL Server, you could add an identity column, either to a table variable in memory, a temp table or a “real” working table.)However you like, add a new column, which we'll call “K1,” to uniquely identify each row in order, as shown in Figure 8.

 

 

Next, make a table of the minimum K1 value and the count of records for each NK.Again, depending on your ETL tool (even if your ETL tool is just plain SQL), there will be various ways to do this.In SQL:

 

select

nk

,min(k1) as ‘min_k1’

,count(*) as ‘ct’

from

s2d

group by

nk

 

That will generate the table shown in Figure 9.This table might be a subquery in SQL code, an in-memory table in your ETL tool or a persistent table or file - logically, it doesn’t matter.

 

 

What we’d like is for each set of dimension keys to have similar attributes, so that we can do a SELECT DISTINCT to get the sets, apart from their appearance multiple times attached to various NK values.

 

Let’s join S2D to AG1 and do some math to get a set of attributes for any given set of dimension keys to be the same:

 

select

s2d.NK

,s2d.dk

,s2d.k1

,((s2d.k1 + 1) – ag1.min_k1) as ‘SET_ORD’

,ag1.min_k1

,ag1.ct

from

s2d inner join ag1

               on s2d.nk = ag1.nk

 

You get the following table shown in Figure 10.

 

 

SET_ORD is the ordinal ranking of the dimension value within a given set.It’ll be handy in a second.We can now pull out our unique sets of dimension keys by doing a self-join on the common attributes and making sure things match up (or don’t match up!) in a certain way.

 

select

t1.min_k1

from

S2DAG1 t1 left outer join s2dag1 t2

on t1.ct = t2.ct

and t1.set_ord = t2.set_ord

and t1.dk = t2.dk

and t1.min_k1 < t2.min_k2

group by

t1.min_k1

having

sum(IsNull(t2.dk,1) > 0

 

We want the “set markers” (the MIN_K1 values) for all the sets where when we try to join on a higher numbered set, matching everything up – dimension key, ordinal ranking and number in set - we fails to get a complete match.(If we do get a complete match, then by definition of “complete match,” it’s a duplicate set that we don’t want.)

 

The above query generates the table of unique set markers (USM) shown in Figure 11.

 

 

Joining USM back to S2DAG1 results in a nice table of unique dimension sets:

 

select

usm.min_k1

s2dag1.dk

,s2dag1.set_ord

,s2dag1.ct

from

s2dag1 inner join usm

on s2dag1.min_k1 = usm.min_k1

 

 

We see that we have five sets of dimension keys.These are the unique sets we see coming in, in our source data.

 

New versus Existing Sets

 

Next we need to figure out which of these unique sets we already have in our bridge table and which we need to add.Those we already have we’ll toss for purposes of this article.

 

To keep things simple, let’s say our existing bridge table consists entirely of shown in Figure 13.

 

 

So far, then, we have these sets:

 

  1. [83]
  2. [3, 24, 192]
  3. [3, 24]
  4. [3, 24, 889]

Step one: do a left outer join from the incoming distinct set table, UDS, to the existing, BRIDGE:

 

New Left Joined to Existing (NLJE)

 

select

uds.min_k1

,uds.dk ‘new_dk’

,uds.ct

,IsNull(bridge.bk,-100) ‘bk’

from

uds left outer join bridge

on uds.dk = bridge.dk and uds.ct = bridge.ct

 

You set the value of the bridge’s bridge key (BRIDGE.BK) to some “crazy” value that you know, whatever your processes and standards, won’t ever be a real group key value.You don’t want these to be NULL at this point, but you want some sort of placeholder for “no match.”The results are shown in Figure 14.

 

 

Next, let’s make a little table of the existing group keys, the keys of the incoming unique sets and the count of records by that level of aggregation:

 

Existing Count (EC)

 

select

bk

,min_k1

,count(*) ‘ct’

from

nlje

group by

bk

 

 

By joining the NLJE table back to EC, we can figure out the sets we already have versus the new ones we need to create:

 

Create versus Have (CVSH)

 

select distinct

nlje.min_k1

,nlje.new_dk

,nlje.ct

,ec.bk

from

nlje left outer join ec

on nlje.bk = ec.bk and nlje.ct = ec.ct and nlje.min_k1 = ec.min_k1

 

 

And here we’ve identified our new sets – they’re simply the records with NULL in the BK column.(As a nice side benefit, we’ve also found the existing set keys for our incoming records, which we can use as part of the fact table load.)

 

The final piece of the puzzle is numbering our new bridge records, and then they’re ready to insert into the bridge table.

 

Numbering Our New Sets

 

We pull out the MIN_K1 values from above for the new sets:

 

New sets, MIN_K1 (NSMIN_K1)

 

select distinct

min_k1

from

cvsh

where

bk is null

 

 

We need to number these with new surrogate keys for our bridge table.Again, depending on tools, there are different ways, but the basic idea is to find the maximum existing bridge key value, add 1, and then use that as the seed value to number the new sets.To find the new seed value:

 

select

max(bk) + 1 as ‘next_bk’

from

bridge

 

In our example, the next BK value is 5.Using that as the seed, we add a BK (“bridge key”) column to table NSMIN_K1:

 

 

To create the records to be inserted into the bridge table, just join CVSH to NEW_BK, as follows:

 

insert bridge (dk, bk, ct)

select

cvsh.new_dk dk

,new_bk.bk

,cvsh.ct

from

cvsh inner join new_bk

on cvsh.min_k1 = new_bk.min_k1

where

cvsh.bk is null

 

Now you have your new bridge table entries. As mentioned, you've also identified the bridge set key to use for the fact table load (for the existing sets, the newly inserted sets are identified in the same way).

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access