Free Site Registration

To Surrogate Key or Not

Design Challenge

Information Management Magazine, August 2006

Steve Hoberman

A great way to sharpen our analysis and modeling skills is to continuously address real-world scenarios. A modeling scenario along with suggested solutions appears each month in this Design Challenge column. The scenario is emailed to more than 1,000 designers up to the challenge. Many of the responses, including my own, are consolidated here. If you would like to become a Design Challenger and have the opportunity to submit modeling solutions, please add your email address at www.stevehoberman.com/designchallenge.htm. If you have a challenge you would like our group to tackle, please email me a description of the scenario at me@stevehoberman.com.


The Response

A surrogate key is a unique identifier for a table, a numeric counter introduced early in the physical design stage. No meaning is implied by a surrogate key value. For example, you can't look at a month identifier of 1 and assume that it represents January. A surrogate key allows for more efficient navigation across structures and facilitates integration across applications. However, before telling the manager to always use surrogates, we need to consider the drawbacks: it can take more time and effort to navigate structures as well as add more complexity to the development phase. Therefore, we need guidelines in determining where it makes sense to create a surrogate key.

Advertisement

Create a surrogate key if the entity has a natural key and any of the following conditions hold:

  • There are multiple sources;
  • The natural key is inefficient; or
  • The natural key is recycled.

Surrogate versus Virtual

A natural key is ideally how the business would identify an entity instance, and a surrogate key is a substitute for this natural key. For example, a natural key for an organization could be taxpayer identifier or DUNS (D&B) Number. If no natural key exists, we can create a virtual key. A virtual key is a unique numeric counter used when it is not possible to identify an entity instance. For example, a manufacturing company has the concept of an adjustment. An adjustment is when a pallet is moved between shelves in a warehouse. The business has no way of identifying this kind of transaction, so the virtual key adjustment identifier is created.

Create a Surrogate if Multiple Sources Exist

If the entity you are modeling contains data from multiple sources, there is usually value in creating a surrogate key. For example, assume you learn that both Henry Winkler the student and Hank Winkler the instructor are really the same person. You can add this person as a unique row to the person entity identified by a person identifier surrogate key. This person can then play the two roles of instructor and student.

Create a Surrogate if the Natural Key is Inefficient

If the natural key is inefficient, create a surrogate key. Inefficient means that the data elements which make up the natural key have performance or space issues when joining across tables. This surfaces most frequently when joining a reference to a transaction table, such as a dimension to a fact table. For example, the business might identify a promotion by a promotion code and promotion start date. Although the code and start date have little impact on a promotion table containing only a few thousand rows, there will be an impact on the multimillion row order and credit tables that reference promotion.

Create a Surrogate if the Natural Key Values are Recycled

While working with a manufacturing plant, I learned that one of their key business concepts is identified by a code whose values repeat every 18 months. We decided to use a surrogate key as a buffer layer and create an alternate key on this natural key code plus a date. As an aside, it is a very good practice to always define an alternate key on the natural key when using a surrogate.

Every set of guidelines has exceptions, and for surrogate keys a very real exception is the limitation of application development tools. There is technology in use that requires surrogates while other tools prohibit them.

Steve Hoberman is one of the world's most well-known data modeling gurus. He taught his first data modeling class in 1992 and has educated more than 10,000 people about data modeling and business intelligence techniques since then. Steve is known for his entertaining, interactive teaching and lecture style (watch out for flying candy!), and organizations around the globe have brought Steve in to teach his Data Modeling Master Class, which is recognized as the most comprehensive data modeling course in the industry. Steve is the author of "Data Modeling Made Simple," "Data Modeler’s Workbench" and "Data Modeling for the Business (Technics Publications). He is the founder of the Design Challenges group and inventor of the Data Model Scorecard.

For more information on related topics, visit the following channels:

Advertisement

Advertisement