Primary Key Reengineering Projects
The Problem
Information Management Magazine, February 2000
Many enterprises are undertaking primary key reengineering projects for their core business tables. These tables keep information on an enterprise's suppliers, partners, customers, orders, locations, employees, and so forth. The purpose of these projects is to replace the "intelligent keys" of these tables with "unintelligent" keys.
Advertisement
Referential integrity is usually enforced only within the physical database that contains the core table itself. But the instances in tables in other databases are also foreign keys, and they must be changed, too. Moreover, foreign keys exist not just in these online databases, but also in off-line electronic archives not to mention microfiche and hard copy where, of course, they cannot be changed.
Consider the effort to propagate changes in just one table's primary keys to their foreign keys in that table's own database and to their foreign keys in physically distinct databases. Ideally, a change to a primary key value should be done as one atomic transaction; but this usually isn't possible when large, physically distinct databases are involved. So there is a period of time often weeks and months, if many databases are involved over which the changes take place. During this time, joins will fail for the foreign key instances which haven't yet been changed. Consequently, the databases will report incorrect information when queried.
If we want to avoid reporting incorrect information on important business objects, we must use some kind of cross-reference table until all foreign key instances have been changed to the new values. Cross-reference tables track which foreign keys have been changed to the new values and which ones have not. They are useful only if all queries that contain joins based on the foreign keys are altered to add an intermediate join to the cross-reference table; and this can be a very expensive process. What's more, after all of the foreign key instances have been changed, the cross-reference table must be discarded. Consequently, the queries have to be changed back to make the direct join and not use the cross- reference intermediary. Again and again, with one database after another, this lengthy and expensive process must take place until all the foreign key instances of all the primary keys of that core table have been changed.
Thus, reengineering the primary keys of just one core table, for a large enterprise, can take a year or more and millions of dollars. I'll call the factor that accounts for this tremendous time and cost "the foreign key ripple effect." On the bottom line, it outweighs all other costs of reengineering primary keys by perhaps an order of magnitude.
With this kind of cost for these reengineering efforts, we'd better do it right the first time! However, every primary key reengineering project that I have seen or heard of is doing it the wrong way, by reengineering to unintelligent keys only. These reengineering projects will only solve part of the problem because they are based on an incomplete understanding of the semantic requirements for unique identifiers in databases. If these projects, whose objective is to reengineer from intelligent keys to unintelligent keys, are allowed to continue, they will have to be followed by a second round of equally costly projects. The second time, the objective will be to reengineer from unintelligent keys to what I call "enterprise keys."
The good news is that we can reengineer from intelligent keys directly to enterprise keys. We don't have to first reengineer to unintelligent keys and then reengineer a second time to enterprise keys. In this series, I will show that the problem is indeed as serious as I have said and that it is indeed avoidable. I will also show how to avoid it.
Entity and Object Integrity
Basically, the problem is that both intelligent and unintelligent keys satisfy the entity integrity constraint, but neither satisfy the constraint of "object integrity." Entity integrity is the constraint that the primary key values in a relational table must be unique across all of its rows. Object integrity, as this article will explain, subsumes entity integrity under a stronger requirement. It requires that every primary key instance must be different from the other primary key instances of its own table and also from all primary key instances of an entire set of other tables.
That entire set of tables, across which a primary key value must be unique, forms a "namespace." For our purposes, a namespace is simply a set of tables across which the identifier for every row must be unique. This namespace might be a single physical database; but the larger the namespace, the greater the benefits of object integrity. So an enterprise would do well to include all of its principal databases in its initial namespace. Later, a namespace shared with the enterprise's principal suppliers, partners and customers could be created at little incremental cost.
Today's RDBMSs do not enforce object integrity. Consequently, a database can contain two or more rows (each in a different table) which happen to have the same value for their primary keys. It happens this way. Sometimes, the primary keys of two or more tables in a database have the same "syntax" (i.e., the same data type and length) and also identical or overlapping domains (permissible values). When we have these two conditions, the primary key of a row from any of these tables may have the same value as a primary key from a row of any of the other tables. For example, if a purchase order table and an invoice table both have primary keys consisting of a date plus a sequence number, then every day there will be n pairs of orders and invoices with identical primary keys, where n is the smaller of the number of orders or of the number of invoices produced that day.
Page 1 of 3.






