In part 1 of the series, "Primary Key Reengineering Projects: The Problem," which appeared in DM Review in February 2000, we saw that unintelligent keys are an improvement over intelligent keys. We also saw that there's nothing wrong with using intelligent keys as unique identifiers, as long as they aren't used as primary keys. In fact, each of the core tables of an enterprise needs a unique identifier - one that the user can see, manipulate and specify as a selection criterion in his queries. In this role, these unique identifiers are often called "business keys," and they can be as intelligent as the user wants them to be. The important thing is that if a table has a business key, it can't be the primary key of the table.
But unintelligent keys lack two essential semantic features. They are that each primary key instance: a) must be unique across all the tables in an entire name space and b) must not "type" those instances. These requirements, together with the requirement that the primary key contain no descriptive information, comprise what I call "object integrity," and they are a superset of the requirements that comprise entity integrity. In this article, we'll see what object integrity means, why it is so important and how to satisfy it in today's relational databases.
Why We Need Enterprise Keys as Primary Keys
Suppose our company has just completed two intelligent key to unintelligent key reengineering projects, one for customers and their households and another one for the company's own organizational structures and its employees. There are four unintelligent keys, now:
There are two (highly simplified) data models - one for households and customers (see Figure 1) and the other for the company's organization structure and its employees (see Figure 2).
Figure 1: Households and Customers
Figure 2: Oganization and Employees
In these two models, the original primary keys have been retained but are now used as business keys and not as primary keys. There is a recursive relationship on the company-organization-structure table to support a hierarchical structure of divisions, departments, groups, etc. Employees must directly belong to one and only one organization structure. Customers don't have to belong to households, but a household must have at least one customer in it.
The four unintelligent primary keys of these tables may or may not have the same syntax, i.e., the same data type, length and domain. There are problems with these keys in either case, so let's start by assuming that they do have the same syntax and that a sequence number is used for all of them.
The problem with unintelligent keys like these is that the value for any instance is unique only within its own table. It is not unique across all tables in its database and certainly not unique across all tables in all databases in the enterprise. And, of course, this is what we relational modelers are accustomed to. Indeed, we are so accustomed to it that we can hardly see why it is a problem.
But without object integrity, it is very difficult to create supertypes of two or more tables at a later point in time, because doing so almost always incurs a substantial and often prohibitive foreign key ripple cost.1 Let's see how.
Consider our customer and household data model, and our organization structure and employee data model, neither of which has any supertypes. Now suppose that we begin to look at these models from an enterprise perspective. In doing so, we will quickly realize that households and organizational units are similar and that customers and employees are similar, too.
In integrating these two models into an enterprise model, we will explicitly recognize these similarities by creating two supertypes. One is a "related-organization" supertype, whose two subtypes are household and company-organization-structure. The other is a "person" supertype, whose two subtypes are customer and employee.2 In each case, we will then move attributes and relationships that are common to the subtypes into their supertypes. Only the attributes and relationships that belong to one subtype only remain with that subtype. Any particular customer, for example, will then be characterized by both the attributes found in his row of the customer table and also the attributes "inherited" from his row in the person table. Of course, since RDBMSs do not directly support inheritance, we must implement it with code, i.e., with a join between the supertype and subtype tables.
In Figure 3, we have gone one step further and recognized that every table in our model represents an "object of interest" to us. This object-of-interest table is a supertype to all tables in the combined model.
Figure 3: An Enterprise Perspective on Persons and Organizations
In a relational model, the primary keys of supertype tables are also the primary keys in each subtype table. In addition, these subtype table primary keys are also foreign keys, pointing back to the supertype. The relationship is one to one, required from the subtype table and optional from the supertype table. Because all other tables are either directly or indirectly a subtype of the object-of-interest table, their primary keys are also foreign keys of the object-of-interest table.
Now we are approaching the heart of the matter. We have used sequence numbers to generate primary key values for each of our four original tables. Now, when we try to bring employees and customers together under a person supertype, n of their keys will collide (where n is the number of employees or customers, whichever is less). And the same problem exists for households and organization structures.
But suppose that each of our two original models had included a supertype right from the beginning - person for one model and related-organization for the other. Then there would have been no key collisions in either model. However, even if we had been foresighted enough to have included a supertype in each of our two models, the key collision problem would have arisen when we attempted to bring person and related-organization together under an object-of-interest supertype.
Is this a contrived example which looks bad only because we used sequence numbers as an example? Not at all. Suppose, instead, that the keys were date timestamps. If we did not include supertypes in either model, it would be possible, on a fast machine with a date timestamp precision to the second only, that some pairs of households and company organization structures would get the same key value and similarly for pairs of employees and customers. The key conflicts would be much less frequent in this case, but they could happen and in some environments almost certainly would happen.
An Easy Way Out?
Earlier, we said that there were two options. One was for multiple tables to have keys with the same syntax, i.e., the same data type and length, and overlapping or identical domains. The other was for the tables to have keys with different syntax. If two tables which we are about to put together under a single supertype have primary keys with different syntax, then a key value conflict cannot occur. So does this solve the problem?
It does, but it raises an equally fatal problem. Suppose that the primary key of employee is a two-attribute key consisting of: a) date of hire and b) sequence number. And suppose that the primary key of customer is a date timestamp. What, then, is the syntax of the primary key of the supertype table for these two tables? At best, it can be the syntax for the key of one of the subtype tables. And if it is, then the other table cannot be made a subtype at all, because the primary keys of its rows could not have the same primary keys of any rows in the supertype table and, therefore, could not be foreign keys of the supertype table.
We've already seen how enterprise keys solve these problems. Now let's see how to create enterprise keys and how to enforce the object integrity constraint.
Two Ways to Guarantee Object Integrity in Enterprise Keys
One technique for creating enterprise keys is to always use a "root class" table in every database. The object-of-interest table in Figure 3 is a root class table. Attributes of this table, besides its primary key, are attributes that all its subtypes need such as name, abbreviation, description, creation-date, last update date, and so on.
In a database with a root table, no row can be created in any table until a row is first created in the root table. Since the primary keys of all its subtypes - in our examples, person and its two subtypes and related-organization and its two subtypes - must be foreign keys back to this root class table, all primary keys in all tables are necessarily unique across all those tables. Since they all have the same syntax, they are also type neutral. Type neutrality is important because occasionally a data model will evolve by eliminating a table and replacing it with two or more tables. With type-neutral primary keys, rows of the eliminated table can migrate to any existing or newly created table without altering their primary keys. (Of course, some non-primary key columns would probably have to be dropped and others added; but the cost of doing this is trivial compared to the foreign key ripple cost.)
A second technique is to construct primary key values according to rules defined for a name space. This is done by prefixing one or two bytes which uniquely identify a table with however many bytes are needed to create a value that is unique for each row in that table. Then the complete key will be unique across all tables in that database. To make a key unique across multiple databases, just put one or two more additional bytes as a prefix to uniquely designate the database within the enterprise. Now each key is unique across all tables in the enterprise. Finally, to anticipate making the key unique across all the databases of an entire supply chain consortium of related enterprises, leave one or two additional bytes at the front of the key for this future expansion. Finally, standards such as X.500 formalize this process and make it possible to create keys which are unique within a universal name space.
The first approach lets the DBMS do the work of enforcing uniqueness. The second approach requires code to create the unique primary key values. However, the first approach has a drawback of its own. It won't work across physically separate databases unless a two-phase commit transaction is used each time a key value is created. These transactions are needed to insure that no other database with the authority to create an instance of that type could do so concurrently and, in the process, create a duplicate key value.
With the second approach, all we need is an agreement on a prefix which uniquely identifies each database, and each table within a database, across all the participating databases. Then keys can be created in each database with the assurance that they are unique across all instances of all tables of all participating databases. Moreover, guaranteeing this uniqueness does not require a two-phase commit transaction every time a new primary key instance is created.
My recommendation is to use both approaches. The first approach lets the DBMS do the work of guaranteeing object integrity within a single physical database. The second approach guarantees primary key uniqueness within a multi-database name space, without the need for two-phase commit transactions. Thus, for example, an enterprise might have a dozen databases scattered around the world, each one with the authority to create new customers. With the second approach, they can do so without coordinating with the other 11 databases that might be doing the same thing simultaneously and can do so without the possibility of ever creating duplicate keys.
Objections to Enterprise Keys
I am aware of no objections to enterprise keys on the semantic level, nor can I conceive of any. Object integrity is a stronger constraint than entity integrity and brings clear bottom-line benefits. The objections to the use of enterprise keys, that I am aware of, all concern their additional costs. So let's consider each of those costs. In the majority of cases, as we shall see, the additional cost is a cost of using supertypes and subtypes in a non-SQL3 relational database and is, therefore, not an additional cost of using enterprise keys.
Development Cost. There is an incremental development cost, but it is a cost associated with using supertypes and subtypes. Therefore, it is not a cost of using enterprise keys. It is a cost that applies whether we use intelligent keys, unintelligent keys or enterprise keys.
This cost is the cost of more complex SQL. For updates, we will have to use multi-table update transactions instead of issuing a single SQL statement to update a single table. If customer attributes are contained in an object-of-interest table, a person table and a customer table, for example, then rows in all these three tables, for one customer, must be updated as an atomic unit. Similarly, queries to retrieve customers must join across the three tables instead of specifying just a single table.
This development cost will eventually be eliminated by DBMS vendors. That will be achieved by changing SQL to reference classes and objects instead of tables and rows. The structural changes which will support redirecting the references of SQL statements to objects instead of to rows, i.e., instances of classes instead of instances of tables, will result in an ANSI-SPARC three-tiered structure in which the tiers are as follows:
- A physical layer of normalized or denormalized physical tables along with their indexes;
- A logical layer of normalized relational tables with their firm foundation in mathematics; and
- A presentation and manipulation layer of object classes and their objects.
SQL will evolve to reference these object classes and objects, including invoking class- specific methods. Select, project, join and the set-theoretic operators, which are currently applied to tables and rows, will be replaced by corresponding operators that apply to sets and objects. In other words, SQL will evolve to never reference relational constructs, just as it never references indexes today.3 The physical and logical layers of the architecture will be kept "under the wraps" and never seen by the user. Tables will become as invisible to those who write SQL as indexes are today.
One-Time Conversion Cost. As for one-time conversion costs, there is no incremental cost to converting to enterprise keys. The cost is the cost of replacing primary and foreign key values in all affected data stores. This cost is the same regardless of the kind of keys we choose to migrate to.
In considering costs, it is important to remember that this one cost will far outweigh all the other costs combined. This is because of the high cost of the foreign key ripple effect. We can incur this cost twice, by first reengineering to unintelligent keys and then reengineering a second time to enterprise keys; or we can incur it once, by reengineering directly to enterprise keys.
Performance Cost. As for performance costs, the incremental cost is the cost of accessing multiple tables instead of a single table. If the database is update-intensive, the cost occurs in the update transactions and the code to guarantee their atomicity. If it is retrieval-intensive, the cost occurs in the joins. But again, this cost is a function of using supertypes and subtypes; it is not a function of using enterprise keys. If supertypes and subtypes are used, the cost will be incurred whether intelligent keys, unintelligent keys or enterprise keys are used.
Is the cost worth it? Well, we think nothing of using the next big boost of CPU speed and of main storage to improve the look and feel of our user interfaces. And clearly, object integrity is a far more important benefit than the cosmetics of user interfaces. So no matter what the incremental performance costs of object integrity (and they do not seem to be excessive), I suggest that it is worth that cost to avoid the foreign key ripple effect cost every time we find that we need to add a supertype to our models.
Security Enforcement Cost. Another cost is security enforcement. If data on a customer, for example, is no longer contained in a single table, then DBMS-enforced, table- level or row-level security cannot be used to control access to customer information. Instead, the application must supply the security mechanism to let the user specify security on what are effectively object classes - in this case, customers. Building a new security enforcement mechanism is not an insignificant cost. But it is not a cost of using enterprise keys; it is a cost of the object/relational impedance mismatch and will be eventually solved by the DBMS vendors.
Storage Cost. Another cost is storage space and the consequent additional time that table scans will take against larger tables. The storage space is increased when the original intelligent key is increased because then an additional column is added to the reengineered table - the column for the new primary key.
But once again, this is not a cost of using enterprise keys. It is a cost of keeping an original key as a business key and adding a new primary key, and it applies regardless of the kind of primary key we add. In addition, we should remember that keeping the original business keys and adding enterprise keys (or unintelligent keys) does not necessarily increase storage requirements. If some of the original tables used two or more attributes as their primary keys, then all foreign keys defined on those primary keys are going to be pretty big. A single-attribute enterprise key will probably be shorter than the original multi-attribute key. If it is, then although it will increase storage requirements for its own table, it will decrease the storage required for all foreign-key-related tables. This can easily net out to zero or even to an advantage for enterprise keys.
Assessment. All these costs must, of course, be weighed against the benefits.4 The benefits are all in the additional semantics gained by using enterprise keys. Because enterprise keys are semantically equivalent to OIDs (object identifiers), their benefits are more likely to be fully appreciated by those familiar with object orientation.
For example, the supertyping described in these articles is known as inheritance in object- oriented contexts. This is because the inheritance of attributes, relationships and table-specific procedures (methods) is the principal benefit which object languages and databases provide to developers, and this benefit derives from supertype/subtype hierarchies.
In addition, if only entity integrity is enforced, then heterogeneous collections will be difficult to create, manage and persist. An example of a heterogeneous collection is a collection of salespersons, their top ten customers and a year's worth of purchase orders from each customer together with line item information for those orders. If all identifiers (primary keys) are guaranteed unique across all these different kind of things and all have the same syntax, then only one list is needed and only one piece of code is needed to traverse the list. Otherwise, multiple pieces of code will be needed - one for salespeople, one for customers, one for purchase orders and one for order line items.
Eventually, SQL3- compliant, object/relational DBMSs will provide many of these benefits for us. I have already briefly described how I believe they will do so. When they do, joins will no longer be required in SQL statements that retrieve and update objects whose attributes and relationships are stored in multiple physical tables. Also, DBMS-provided security will change from table-and-row-level security to class-and-object-level security.
To wait or not to wait: that is the question. By waiting, some of the costs of meeting these objections can be shifted from the enterprise to the hardware vendors and the DBMS vendors. But while we are waiting, the pressure to replace intelligent keys that are no longer big enough to handle all values in accordance with the rules for forming them will in many cases be irresistible. And if we do somehow resist them until vendors come to the rescue with a reliable implementation of object integrity, we are likely to incur a significant amount of lost opportunity cost.
Resolving a Dilemma: A Warehouse Model or Data Mart Models?
Unintelligent keys are an improvement over intelligent keys. But their drawback is that they make it very difficult to add supertypes to relational models, because: a) primary key values of previously unrelated tables may collide when we try to bring those tables together under a supertype, or b) they cannot be brought together under a supertype at all because their primary keys do not have the same syntax. The only solution is to change one of the colliding primary keys (in the first case) or change the syntax and value of all of the primary keys (in the second case). But either of these changes is terribly expensive because of the foreign key ripple effect.
Because most relational modelers are not aware of the need for enterprise keys, they continue to reengineer to unintelligent keys, thereby constructing data models and databases that cannot evolve through supertyping. But supertyping is unavoidable in enterprise models because the enterprise perspective sees the basic tables of individual data models from a more generalized perspective. In other words, it sees those tables as subtypes of more general tables.
This has forced a dilemma on data modelers. On the one hand, some enterprises have opted for a top-down approach. They fund very lengthy and expensive enterprise modeling projects. The advantage is that all the supertyping will (it is mistakenly supposed) be identified once and for all. Then, once the primary keys of basic tables in individual databases are changed to create a consistent syntax and unique values across all tables in a supertype hierarchy, it will never have to be done again. Advocates of this perspective argue that an enterprise model cannot evolve from individual data mart or ODS models. It has to be created top down, not bottom up.
The disadvantage of the top-down approach is that in large companies, it is a very lengthy and expensive process. Its value is never fully understood outside the IT department, and often not well understood outside the data administration group. As the company funds the project quarter after quarter, often year after year, and the users of IT services never see any tangible benefit, senior management support for the project wavers. The project becomes the prime IT candidate for cost-cutting measures when they inevitably come.
Another disadvantage is that the top-down approach doesn't work, i.e., it doesn't get all the tables right and it doesn't get all the type hierarchies right. Our understanding of our enterprise's business evolves over time; and as it evolves, the way we represent that business in our models will evolve also. Taking two or three years to build a top-down enterprise model will (we hope!) produce a better model than taking two or three months. By these very lengths of time, it is clear that the business is complex and so, therefore, is the model. Consequently, we can be nearly certain that the model will need to continue to evolve.
The alternative approach is the bottom-up approach. In a data warehousing context, it is the approach that says that ODS models and data mart models can continue to be created, even though there is no enterprise model to guide them. The enterprise model, on this view of things, can be created through an evolutionary process. As each ODS and data mart model is created, it is merged into an evolving enterprise model. Eventually, the enterprise model will be complete in scope, and its further evolution will be simply a matter of making the model increasingly more detailed.
The advantage of this approach is that CIOs aren't criticized (or fired!) for funding an immensely expensive project that has no direct, user-visible benefits. The CIO can tell the CEO that the strategic work of rebuilding the enterprise's data infrastructure is being addressed, but that this work is paying for itself as it goes along, in increasingly better individual data models - models that have a direct impact on what is delivered to the users.
The disadvantage of the bottom-up approach is that logical data models can be reworked time and time again, at little cost. But once they are deployed, i.e., once they are turned into physical data models and those databases are populated, it is very difficult to change them. The reason for this is that the changes required as individual data models are merged nearly always involve supertyping. Supertypes can be added to logical models in a matter of days. But the primary and foreign key changes required to introduce supertypes into physical models and their databases is very expense. Once again, the greatest obstacle to such changes is the foreign key ripple effect.
The top-down vs. bottom-up debate has continued for so long because there are good arguments for and against each approach. But using enterprise keys avoids the disadvantages of each approach and, therefore, ends the argument. This is because enterprise keys make it possible to add supertype structures to any model and any physical database, at any point in time, without incurring the foreign key ripple cost.
Now we can have the best of both worlds. On the one hand, the enterprise perspective provides breadth of understanding of what an enterprise's data really means. Data models for marts and ODSs will always benefit from this knowledge. On the other hand, the mart/ODS perspective provides depth of understanding. Data models for the enterprise are less likely to follow unfortunate flights of fancy when enterprise modeling is grounded in a detailed understanding of specific subject areas being modeled.
The correct balance between top-down and bottom-up modeling, then, is no longer a technical issue. Instead, it becomes a political and cultural issue. In a corporate culture that doesn't like to make a move until every detail is nailed down, an enterprise modeling group within IT can be a highly visible cost item on the corporate budget. In the opposite kind of corporate culture, there will be no enterprise modeling group, but modeling for each project will be done with the enterprise perspective always in mind and will result in incremental improvement to that model.
Enterprise keys make all this possible. Unintelligent keys which are not enterprise keys make model evolution impossible, because adding supertype models built on unintelligent keys will trigger the foreign key ripple effect. In doing so, unintelligent keys insure that the enterprise perspective on data will never makes it from data administration's beautiful logical model into the physical models of the enterprise's working databases.
1 Johnston, Tom. "Primary Key Reengineering Projects: The Problem." DM Review. February 2000. P.60.
2 An increasingly popular alternative to modeling things like customers and employees is to consider each to be a person in a role. But the use of roles in modeling is a topic for another time and does not affect the argument here.
3 I am referring here, of course, to SQL DML (the data manipulation language part of SQL), not SQL DDL (the data definition part).
4 Many of these cost objections to using enterprise keys were clarified in discussions with Tom Cummings, of MCI.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access