Beginning with Version Pattern 6, we will use surrogate version keys in all versioned tables. But note that the phrase surrogate version keys should be read as surrogate keys for versions, and not as keys for surrogate versions. We are recommending the use of surrogate keys for versioned tables. We are not introducing a concept of surrogate versions.
Neither the chart of version patterns nor the chart of this series are included in this article. The reason is that this discussion of surrogate keys is relatively self-contained and makes few references to earlier articles in this series.
We begin by defining our terms.
Surrogate and Other Kinds of Keys: Definitions
Primary Key. The column or columns of a table in a relational database that the database management system (DBMS) recognizes as a unique identifier for that table, and as the column or columns whose values will be duplicated in foreign keys in order to create relationships.
Comment: The second half of the definition, the part following the "and," is necessary. The reason is that a DBMS is able to recognize and accommodate more than one unique identifier for a table. Any column or columns over which a unique index is defined is as much a unique identifier for a table as is the primary key. Once one of those columns or groups of columns is designated the primary key, the DBMS knows that their values will appear in foreign keys whenever a relationship to their rows is created.
Foreign Key. The column or columns of a table in a relational database that the DBMS uses to establish a relationship between that table and another (not necessarily distinct) table. At the instance level, foreign keys relate pairs of rows such that each foreign key value is identical to the primary key value of a row in the referenced table.
Surrogate Key. A primary key whose values are system-generated and are either hidden from business users or exposed with the understanding that no meaning other than their function as unique identifiers should be assumed true of them.
Comment: the second half of the definition, the part following the "and," is necessary. The reason is that system-generated primary key values may be exposed to business users. A typical example is system-generated document numbers for invoices, purchase orders, receipts, etc. The exposure does not mean that those primary keys are no longer surrogate keys. But, if exposed, such keys must be exposed to business users with the clear understanding that their only role is to identify, and that no other meaning can reliably be given to them.
Because this has turned out to be an excellent example with which to illustrate the concepts of meaning and semantics, the next section will be devoted to elaborating on it. This is an aside to our main theme of introducing the use of surrogate keys with versioned tables, but we think it is an important one.
Business Key. A unique identifier for a table, all of whose columns express business meaning.
Natural Key. A primary key which is a business key.
Semantics and Meaning, Illustrated With a Surrogate Key Example
Speaking of surrogate keys, we said above that " if exposed, such keys must be exposed to business users with the clear understanding that their only role is to identify, and that no other meaning can reliably be given to them."
The word "reliably" is essential here, because sometimes additional meaning is imputed to surrogate keys by business users, whether we IT folks like it or not. For example, perhaps many users of Acme's Policy table know that, following the "P" in its values, the three-digit number is a number that was sequentially assigned by the system. Knowing that, these business users may then use that knowledge to conclude, for example, that of two clients, the one with the higher numeric part in the client-nbr was added to the database later than the other one.
In this way, such business users are imputing meaning, over and above the meaning of the key as a unique identifier, to the values that key takes on. The additional meaning, or semantics, is that the surrogate key also chronologically sequences the events in which the clients or policies were added to Acme's tables. Those key values "tell us" the sequence in which rows of that table were created, and it is in their "telling us that" that their imputed semantics consists.
While IT cannot stop users from reading these semantics into these keys, we can and must explicitly state that such imputed semantics are not reliable. We must tell the users that if they continue to do this, they do so at their own risk. Let's see, now, what kind of trouble such users can get themselves into.
Acme's Policy Management System initially did start out by creating primary key values using a sequence number for the last three characters in the key. Although many business users knew this, what they did not know is that policy numbers, starting about a year ago, were being reused. Whenever an Acme policy was deleted, a six-month waiting period was observed (to allow all references to the deleted policy to be "flushed out" of the on-line database), after which the policy number was used as the unique identifier of a different policy. Obviously, the reuse of policy numbers makes them useless for telling which of two policies was created before the other one. This illustrates the risks of relying on undocumented semantics.
As semantically risky as such reuse of identifiers is, most of us who have been around for awhile have seen this kind of thing happen. In our case, Acme apparently thought that it would never have to manage more than a thousand policies. A very modest way to start out, but also a little shortsighted. And this shortsightedness is the cause of this semantically risky reuse of policy numbers. Acme's IT department is desperately trying to put off the evil day when they will have to change from four-byte to five-byte policy numbers.
It doesn't take much reflection to realize that the changeover will be very "evil," i.e., very costly and complicated. Existing policy numbers are everywhere in primary keys, in foreign keys, across several Acme databases, and in the databases of customers and partners. Those numbers are "in flight" on transactions created but yet to be applied to Acme's databases. This can't be cleaned up in one fell swoop, on one night between midnight and 6a.m. while all the data is unavailable for access. It can't be cleaned up, in other words, in one multi-transaction atomic unit of work. While the changeover from a four-byte to a five-byte client number is taking place, semantic disintegrity will be exposed to business users. Joins will return only partial result sets. Code and SQL will have to be rewritten to use "link tables" and, after the changeover is complete, rewritten again to not use those link tables.
It reminds me of what Everett Dirksen, the senior senator from Illinois who almost helped to write the Civil Rights Act of 1964, said about how the government spends money: "A billion here, a billion there, and pretty soon you're talking about real money." Well, when it comes to changing the syntax of the primary keys of critical enterprise tables, it may not be a billion here and a billion there. But it can often be "a million here and a million there."
The Client Version and Policy Version Tables Without Surrogate Keys
Figure 1 shows our Client Version and Policy Version tables, as we have seen them through the first fourteen articles in this series. In these tables, the primary key is a natural key, not a surrogate key, because the same columns that play the role of primary key also play the role of business key.
Client-nbr and Policy-nbr. Although we have identified client-nbr and policy-nbr as surrogate keys, each is really a hybrid. We said that the numeric part was surrogate because, even if it had the semantics of chronological sequence, semantics were an accident of how the numbers were created and could not be relied on. But we tacitly accepted the semantics of the alphabetic portion, in which "C" designated a client key and "P" designated a policy key.
So because one part of those keys has business meaning, let's think of them as natural keys. In other words, let's think of them as not being surrogate keys. (This notion of a primary key column having "parts" brings up the old issue of "intelligent keys. But intelligent keys have no relevance to the issues being discussed here. However, for an extensive discussion of intelligent keys, see Johnston's two articles on "Primary Key Re-Engineering Projects"(Part 1, Part 2).
Client-ver-beg-dt and Policy-ver-beg-dt. These primary key columns have business meaning, because they designate business effective begin dates in both cases. Therefore, [client-nbr + client-ver-beg-dt] and [policy-nbr + policy-ver-beg-dt] are business keys for versions of clients and policies, respectively.
The Client Version and Policy Version Tables With Surrogate Keys
Figure 2 shows our examples of a Client Version and Policy Version table, with surrogates used as primary keys. The first thing to notice is that these tables still contain their original business keys. Business keys are what provide semantic entity integrity. Primary keys provide physical entity integrity. The distinction between semantic and physical entity integrity arises, of course, only when we substitute surrogate for natural keys, and thus separate what the DBMS recognizes as the primary key from what the business recognizes as the combination of data elements that properly play the role of distinguishing each instance of the entity from every other instance.
This is a complex topic, and we will not say anything more about it in these articles. However, at the DAMA 2008 Conference, Johnston's presentation will discuss this issue in greater detail.
Client-nbr, client-ver-beg-dt, policy-nbr and policy-ver-beg-dt have the same semantics they have had all along, except that their role as primary keys has been removed. In the primary key roles, we have provided two new columns for each table, an object identifier ("oid") and a version identifier ("vid").
We have also "stacked" a few more columns, again so the picture of these tables could fit comfortably on a page. So although the picture of each table suggests that there are six columns apiece, in reality the Client Version table has nine columns and the Policy Version table has ten columns. Or, if you prefer, the graphic columns are six and six, respectively, while the database columns are nine and ten, respectively.
The Client Version and Policy Version tables, with surrogate keys, are shown in Figure 2.
For both oids and vids, we have used five-digit numbers. But the important thing is that a) these numbers carry no business meaning whatsoever, and b) they play their roles as unique identifiers. Their semantics as unique identifiers are, at a minimum:
- No two policies may have the same policy-oid.
- All versions of the same policy must have the same policy-oid.
- No two versions of the same policy may have the same policy-vid.
- And the same for clients.
For reasons described in Johnston's "Primary Key Re-Engineering Project" articles, we would prefer additional semantics for oids and vids, namely:
- No two objects, of whatever type (policies and clients, here) may have the same oid.
- No two versions, of the same or of different objects, may have the same vid.
One final point before we conclude Part 14. Surrogate keys are usually single-column primary keys. But our versioned tables are two-column primary keys. It isn't necessary to use two columns an oid and a vid as the primary key of versioned tables. But it is very useful to have a system-generated key for an object, with an additional distinct system-generated key for a version of an object. What isn't necessary is that these two system-generated columns be used as the primary key of a versioned table. Instead, we could have a typical surrogate key, i.e., one consisting of a single column.
Why do we need a system-generated oid and also a system-generated vid? It is not for the sake of semantics. As far as semantics is concerned, we could use a single-column surrogate key for versioned tables, while retaining the client-nbr (or policy-nbr) and associated version begin date. No oid, no vid.
Recall that client-nbr and policy-nbr were our somewhat simplified examples of business keys. So it could just as easily have been that policies had a four-column business key while the client had a three-column business key. Quite possible no two or more of those six columns would have the same data type and length.
Why is that significant? It is significant because, all along, we have had a single, enterprise-wide solution to versioning as our objective. Not one solution for one database and a slightly different solution for another database. While we may be able to retrieve data from versioned tables with SQL alone, it is not possible, as we will see, to update data in versioned tables with SQL alone. For updating versioned tables, therefore, there is a codebase. Our enterprise objective, if we are to take it seriously, requires that one and the same codebase be deployed for all versioning solutions.
If an enterprise versioning solution is used, let us say, in six databases across the enterprise, then there may be six physical copies of that codebase. An enterprise solution does not require federated access to a single physical codebase instance. But it does require that all instances of that codebase be line-for-line, byte-for-byte identical. The only alternative is to develop a custom versioning solution for each database that requires versioned tables. While that may be the most common way of implementing versioning in an enterprise, it is not our objective.
The Versioned Table Surrogate Key Position, Going Forward
The concept of surrogate keys is a contentious one. Publications arguing both for and against their use go back at least twenty years. Arguments based on both theoretical and practical concerns can be made for using surrogate keys and also for not using surrogate keys.
One of us (Johnston) has been an advocate of using surrogate keys for all kinds of tables, in all kinds of databases. This means for dependent and associative tables, not just for the kernel tables which those tables are dependent on, or relate. This means for tables in OLTP and data mart databases, not just in ODS and data warehouse databases. We will refer to this as the strong surrogate key position to distinguish it from positions which recommend surrogate keys for some kinds of databases but not for others, or for some kinds of tables but not for others.
In Part 15, we will provide an outline of the major arguments for and against the strong surrogate key position. This will not be an impartial review of arguments on both sides. It will be a distillation of the arguments Johnston has made in half a dozen articles going back to 2000, and also a preview of the presentation he will be making at the DAMA International Conference next March in San Diego.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access