Q:

There are some "lively" discussions going on regarding the appropriate use of surrogate keys versus natural keys? What is your opinion? When is each of these most appropriate? What are the advantages and disadvantages of each?

A:

Les Barbusinski’s Answer: Surrogate keys (a.k.a. token IDs) are most often used when an entity has no natural key or when the natural key consists of five or more data attributes. The primary advantage of surrogate keys is that they facilitate n-way joins between fact and dimension tables. However, they can be problematic in that they can mask business rules and relationships, and make enforcing UNIQUE constraints difficult. Generally speaking, using a surrogate key is not recommended when a simple, natural key is available for an entity. Doug Hackney’s Answer: If your users are being asked to build ad hoc queries using the product IDs, customer IDs, etc. that they know, then you should stick with natural keys. Otherwise, it is not an issue for users and limited to a sometimes obscure and arcane debate among DBAs and data architects.

Scott Howard’s Answer: Lively discussion may be an understatement. I've seen this debate blossom into a philosophical war. Some claim that surrogate keys must always be used while other claim they never be used. I advocate their cautious use when appropriate.

First let's discuss why one may need them. Deriving a data warehouse occurrence's uniqueness solely relying on the natural keys can often lead to primary keys made up from the composite of every dimensional attribute. This is further exacerbated in the multidimensional data mart in deriving the primary key for a fact table. In its simplest form, a fact table's primary key is the composite of all dimensional keys. A star schema with time, geography, product, customer and business scenario dimensions would lead to a fact tables primary key being a concatenation of five attributes. If you have slow varying dimensions where time must be added to the dimension key, you wind up with a concatenation of 10 attributes. Take this to the physical implementation layer where some RDBMSs have problems joining tables based on concatenated keys, that is the number of concatenated attributes prevent appropriate index use, and you have a performance horror (we've seen this behavior with Oracle). These RDBMSs will perform admirably if you simplify the composite key with the assignment of a surrogate. Another reason for surrogate key assignment is the possibility of key reuse in the operational system.

Surrogate key assignment 101: I generally recommend surrogate keys for all slow varying dimensions. Also consider them when the dimension will be combined with many other dimensions in support of a single star schema – many dimension converging on a single fact table. Don't, however, assign surrogate keys for assignment sake. Surrogate key assignment and then resolution back to the natural key when needed is expensive in both storage costs and ETL processing. In other words, justify all surrogate key assignments by validating they save a problem that I just described from hindering your DW.

Michael Jenning’s Answer: As always the answer is, it depends. Surrogate key use offers several advantages from a dimensional modeling perspective. It provides a layer of insulation in environments where frequent changes to source systems or natural keys can occur. Since the dimensional model uses surrogate keys, only the mapping (natural keys=> surrogate key) is affected not the key structure of your database. A single fixed length surrogate key on a table offers database performance advantages in cases where numerous natural keys are needed to establish uniqueness in a dimension and/or where the natural key has a large or highly variable length (e.g., URs).

Why would you ever not want to use surrogate keys in a dimensional model?

  • Your particular data warehouse environment’s size and/or the number of sources is small and stable.
  • Your window of time for processing updates to your data warehouse is very small or near real time. The mapping of a natural key (s) to the surrogate key in the dimensional model requires additional processing and complexity to the ETL process.
  • Your particular dimensional database is very large and to increase the efficiency of ETL processing and access the RDBMSs database fragmentation method is used. Database fragmentation strategies are most efficient when segmented using natural keys versus surrogate keys.
  • Your particular reporting tool may not provide sufficient navigation of dimensional hierarchies, filtering, aggregation or other capabilities to produce the required output needs of your users when the dimensional model uses surrogate keys.

Chuck Kelley’s Answer: I always (well, 99.999999999999999 percent of the time) use surrogate keys because the allow the optimum flexibility without the loss of anything. I see no disadvantages to using them.

Joe Oates' Answer: I can only give some examples of what made me supporter of surrogate keys.

  1. Another operational system was brought into the warehouse and had duplicate keys.
  2. Joins are much faster on integer keys than on a long, mixed alpha and numeric, concatenated primary key. I was once forced to change from these kind of “natural” keys in order get query performance that the customer would accept. Most of the important queries improved at least an order of magnitude.
  3. New records came along with the same natural key as records that had been logically deleted.

Clay Rehm’s Answer: I prefer surrogate keys. However, whichever route you go, be consistent. The primary benefit of the surrogate key is that if your source system changes, your keys don’t have to change.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access