In Part 14, we introduced the concept of surrogate keys and showed how our example database of clients and policies would look if we used surrogate keys with that database. We also introduced some terminology that will help us examine the arguments for and against their use, terms such as "business key," "natural key" and "mixed key." We then went on to define two sets of semantic constraints that give surrogate keys their meaning. The minimal set consisted of three rules for the Client table and a corresponding three rules for the Policy table. The full set consisted of an additional two rules, both of which applied to both tables.

 

We also designated the approach to surrogate keys that we will use in the remainder of these articles as the "strong approach." The approach we will contrast it to we will designate the "standard approach," for such it is. We begin, then, by describing the standard approach and our reasons for dismissing it.

 

Neither the chart of version patterns nor the chart of this series are included in this article. The reason is that this discussion of surrogate keys is relatively self-contained and makes few references to earlier articles in this series.

 

Surrogate and Other Kinds of Keys: Definitions

  • Primary Key. The column or columns of a table in a relational database that the database management system (DBMS) recognizes as the unique identifier column or columns for that table.
  • Foreign Key. One or more columns of a table in a relational database used by the DBMS to establish a relationship between that table and another (not necessarily distinct) table. At the instance level, foreign keys relate pairs of rows such that each foreign key value is identical to a primary key value in the referenced table.
  • Surrogate Key. A primary key whose values are system-generated, and which therefore carry no business meaning. Surrogate keys are usually single-column keys.
  • Mixed Key. A multi-column primary key, some of whose columns contain system-generated values and other of whose columns contain business-meaningful values.
  • Business Key. A unique identifier for a table, all of whose values carry business meaning.
  • Natural Key. A primary key that is a business key.

Surrogate Keys: The Standard Approach

 

When IT practitioners speak of a database that uses surrogate keys, they are usually referring to a database some of whose tables have single-column surrogate primary keys, but other of whose tables have multi-column primary keys whose values may or may not be system assigned. Thus, they might use a single-column surrogate key for a Purchase Order table but create a mixed primary key for a Purchase Order Line Item table, one consisting of a foreign key to the Purchase Order table together with a line number. For an associative table matching up purchase orders to receipts, they would typically follow the general practice for associative tables and create a primary key by concatenating the primary keys of the Purchase Order and Receipt tables.

 

Let's call this the standard approach to using surrogate keys. What might justify the standard approach as the best way to assign primary keys to the tables in a database? It would have to be something like this. Whether or not surrogate keys are preferable to business keys is something that should be determined for each table. Now consider a database all of whose tables have natural keys. Among those tables, there is a subset whose primary keys do not contain any foreign keys, i.e., which do not use "identifying relationships." Call them "kernel tables." Assign a single-column surrogate key to these kernel tables.

 

That surrogate key will then become part of an identifying relationship in all of the other tables to which each kernel key table is related. There are two types of such other tables: dependent tables and associative tables. Dependent tables have primary keys which contain one surrogate foreign key. Associative tables have primary keys that contain two surrogate foreign keys.

 

Our objection to this standard approach is that it makes the decision to use or not to use surrogate keys at the level of individual tables. The result is a database in which some tables use surrogate keys, other tables use mixed keys and yet other tables may use natural keys.

 

We have seen various rules of thumb offered for deciding whether to use a surrogate, mixed or natural key for a particular table. For example, some database modelers/designers will always use a natural key for reference/code tables. Their argument is that besides the value being referenced, all that such tables contain are things like a short and a long description. Since for most purposes, it's just the code that is wanted in a result set, using that code as the natural key of the reference table means that a join to that table is not required in order to get the code value. ICD-9 codes on a medical claim are a good example. If the ICD-9 code itself is the foreign key, the claim with its list of ICD-9 codes can be materialized without needing to join back to the ICD-9 reference table.

 

We prefer to make the decision to use or not use surrogate keys at the database level. For all the tables in the database, it's all or nothing. Either all of them use surrogate keys, or none of them do. If single-column surrogate keys are being used, then the primary keys of all tables contain exactly one column. This is true of associative and dependent tables, of core tables and code tables. All use surrogate keys.

 

Moreover, on our strong approach to surrogate keys, no surrogate key value will appear in more than one primary key instance anywhere in the database. A primary key is unique, in other words, across all tables in the database, not merely unique within one table.

 

Argument Against the Standard Approach to Surrogate Keys

 

Our argument against the standard approach to using surrogate keys is simplicity itself. The standard approach leaves us with a mixture of primary and foreign keys to manage that share no consistent syntax and across which duplicate values are possible.

 

The argument for using surrogate keys rather than natural keys is simple: natural keys can change, and that change can be expensive. As for change, the values in natural keys describe something about what the row they are part of represents; that follows from the definition of "natural key." Suppose one natural key is a two-byte column whose values represent geographical sales regions and are: NE, NW, SE and SW. Now suppose the VP of Sales, after extensive analysis and reflection, of course, decides to change sales regions as follows:

  • California is too big to be lumped in with the rest of the SW, so it will become its own sales region, with "CA" as its code.
  • The NE and SE regions are also getting too big, but we don't want to carve out one or more states to fix their problem. Instead, we will split them each into two regions, with designations as follows: NE1, NE2, SE1 and SE2.

We invite you to reflect on the cost of such changes. Next, think about the cost of these changes if IT had, instead of using natural keys, used a surrogate key for this Sales Region table.

 

Argument 1 Against Surrogate Keys: Users Don't Understand Them

 

Taken in one sense, this argument means that users don't know what surrogate key values stand for. But of course, that's the whole point. The system generates values for surrogate keys, and those values have no business meaning. No wonder users don't understand them.

 

Taken in another sense, this argument means that users don't understand the need for surrogate keys, that they are uncomfortable when they see surrogate keys in their result sets and so on.

 

The need for surrogate keys can be explained to business users in two steps. First of all, primary keys are needed to uniquely designate each row in each table. Secondly, the business may need to occasionally require IT to changes values and/or data types and lengths of primary keys in order to make the data being used as a unique identifier more accurately describe what it is about. And when those changes are required, the costs of implementing them can be enormous. We saw this above, in the argument against the standard approach to surrogate keys.

 

Argument 2 Against Surrogate Keys: Users Don't Use Them

 

This argument is that, because surrogate keys do not contain business data, users will never select rows from tables by supplying surrogate key values. This is certainly true, but it doesn't support the conclusion that users don't use them. On the contrary, surrogate keys, together with the foreign keys that replicate their values, are used to join rows to other rows.

 

If a user wants information about a particular Policy, for example, we may assume that he knows what the business key of that policy is. His query then begins with a Select operation on the Policy table, with policy-nbr as the target, and a specific policy number as the search argument. Since business keys should be unique, this process should select exactly one row (provided the supplied policy number is valid).

 

Next, logically although not temporally speaking, the query proceeds to the Join operation in which one or more other rows foreign key-related to the original row, are also selected.

 

Finally, again speaking logically and not temporally, the query proceeds to the Project operation in which some columns are designated to be included in the result set, while the other ones drop out.

 

The entire query, at this point, asks a question of the database. It says, "Do there exist one or more instances of data that satisfy this specification?" If so, the resulting instances are returned, each one as a row in a result set.

 

There is generally no good reason to include surrogate keys or their associated foreign keys in the result set. There is generally no good reason for users to ever see a surrogate key value. For even if the user writes his own SQL and uses surrogate keys and foreign keys in his joins, he does not supply a surrogate key value anywhere in the query, and isn't interested in a surrogate key result. Primary and surrogate key join pairs, like two adjacent links in a chain, are purely DBMS machinery for linking data together.

 

Argument 3 Against Surrogate Keys: They Take Up Extra Space

 

This argument points out that if we used a table's business key as its primary key, we wouldn't need an extra surrogate key column. Say that surrogate key is eight bytes long, and it occurs in a million-row Client table. That's eight million unnecessary bytes, approximately 7.5MB extra for the entire table.

 

While direct access storage device (DASD) space is relatively cheap, the argument will continue, response time to a user's query is not. And the eight extra bytes on each row also contributes to poorer response time, because fewer rows will fit into a single I/O block; fewer rows at a time can be transferred between secondary storage and main memory.

 

In response, we need to distinguish between the impact of a surrogate key on the table it is the primary key for and its impact across the entire database. In its native table, the Client table in this case, adding a surrogate key will indeed add extra space to the storage requirements for the table and will also add some amount of extra time when doing sequential retrievals, i.e., retrievals that require moving I/O block after I/O block from DASD to main storage.

 

But let's consider all the tables in the database and nonsequential retrievals. Across all the tables in a database, the use of surrogate keys may well reduce the total requirement for storage space. For consider our eight-byte, single-column surrogate key, and compare it to a natural key of one or more columns, which may well be longer than eight bytes.

 

Next, suppose that the Client table is an important table in the specific sense that many other tables have foreign keys to the Client table, and for some of them those foreign keys are part of their own primary keys, i.e., those foreign keys implement identifying relationships. Finally, suppose that for some of these other tables with identifying relationships to the Client table, there are further tables that have foreign keys to them.

 

Throughout this chain of direct and also indirect foreign keys to the Client table, and depending, of course, on the number of other rows which instantiate these foreign keys, we may have dozens or even hundreds of occurrences of Client table primary key values.

 

Let us assume that, at some point in time, there are 5,000 such Client foreign keys. And let us assume, further, that the business key for the Client table would have been two columns, totaling twelve bytes. The arithmetic here is simple. In the first case, the foreign keys total 60kb. In the latter case, they total 40kb.

 

Now consider a database in which important tables like our Client table contain millions of rows. And consider then the trade-off in space consumed between such a database in which surrogate keys were used for all tables versus a database in which business keys were used for all tables.

 

There is indeed a significant difference in storage requirements between using and not using surrogate keys. Surrogate keys are much more space-efficient.

 

Now let's briefly consider the argument that retrievals will take longer with surrogate keys because fewer rows will fit into an I/O block. Although there are several things to note, we will take the time to mention only one of them. This argument applies only when sequential retrievals are being considered. When a business key value is in hand, and about to be used to select a client, then after a lookup in the unique index defined on business key, only a single I/O block will be retrieved. As for joins, once a primary key value is in hand, and about to be used in a join, then after the lookup in the index, only a single I/O block will be retrieved. Any negative impact on retrieval time caused by using surrogate keys is negligible to the point of being undetectable.

 

Argument 4 Against Surrogate Keys: They Require Extra Joins

 

This argument points out that whenever a surrogate primary key is used, then all relationships to a row with such a key use foreign keys whose values are identical to that row's surrogate key value. So whenever a query has specified a row which has such a foreign key, and whenever that query wants to include all or part of the business key of the related row, a join is required to bring that business key into the result set.

 

This argument is correct. However, it is possible to avoid joins that the use of surrogate keys would otherwise make necessary. The solution is to denormalize and put duplicate columns on the referencing table, columns that duplicate the business key of the row that the surrogate foreign key is pointing to.

 

The Argument For Surrogate Keys

 

One of us (Johnston) will present a more complete case for using surrogate keys on all tables in a database, and will do so at the DAMA 2008 Conference in San Diego in March 2008. But in brief, the reasons we will proceed to discuss versioning in terms of sample tables that use surrogate keys are these:

  1. We want an enterprise solution to versioning, one applicable to any table and any database without modifying the code that implements it. That requires that all versioned tables have the same syntax for their primary keys, i.e., the same columns each with the same data types, lengths and domains.  
  2. We want to avoid the usually very high costs of changing primary keys by adopting keys that there will never be a business reason to change. Such keys are precisely those that carry no business meaning and whose only semantics are to link related rows together.

A Physical Description of Earlier Version Patterns

 

In Part 15, we begin to revisit our earlier version patterns and provide a physical description of them. This includes physical schemas and SQL, and shows how these version patterns work in a real database. For a longer-term look at where this series is going, consult Part 13.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access