The advent of data models has been the impetus for enormous progress in data management. Conceptual advances have provided a framework for elaborating database design, and wonderful tools have enabled data professionals to design databases in practice. Today, data modeling is viewed as a necessary skill in data management, and rightly so. Underlying this, there is the assumption that a data model can capture all the information about the design of a database. This assumption is rarely questioned, but is it true? This is not just a question of whether different data modeling approaches yield different levels of accuracy about how an enterprise sees the information in a particular subject area. Rather, it is about whether any data model can truly specify all the design information for a database. I would submit that there are real limits to what data models can do, and failure to understand these limitations can result in data management problems at a numbers of levels.

Data modeling and data administration in general are usually focused on the "logical" level. In many ways, this is a good thing. We want a logical data model to represent how the business truly views its data. There can be arguments for why we have to denormalize a data model to implement it as a particular database. Typically, such arguments are based on the need for system performance, and we can argue about how valid they are. However, a logical data model is still required to understand how the business sees the data.

Figure 1: Data Model Fragment for a Hospital Admissions Database

Whatever kind of data model we end up with, most data modeling tools have the functionality to instantiate a physical database in any of the common platforms. In this very strict sense, therefore, a data model does contain enough design information to create tables, columns, keys, constraints, etc.

Once a database is physically implemented, data modelers tend to think that their work is over. They may have a continuing responsibility for the structure of the database, but the content of the database is nearly always seen as solely the responsibility of the business users. This divide between the logical and the physical may appear to be quite justified, but what would happen if some of the data values the users populated into the database acted to specify a layer of design? Such an idea seems silly; yet, there is a category of data that exists in every database and acts to specify design elements that organize and constrain the other information in the database. It is called reference data.

Reference data is a class of data that has many synonyms. It is commonly referred to as code tables, lookup tables or domain values. In a data model, these entities look unimpressive. They typically contain two attributes - a "code" that is a primary key and a description. In a physically implemented database, they seem equally simple. There are usually just a few rows in these tables, and these contain data that typically changes infrequently. In fact, it is often referred to as static or slowly changing.

Figure 2: Revised Data Model Fragment for a Hospital Admissions Database

However, reference data has some important properties that are not shared by other categories of data. One of them is that the code values often have definitions. For instance, the code "Gold" in a record in a Customer Credit Type table may be used to identify a customer who can be given 90 days to pay their bill. Even in a Country table, it may be necessary to have definitions to tell us, for instance, whether U.S. includes Puerto Rico, or whether People's Republic of China includes Hong Kong. Only metadata - at least the metadata for entities and attributes in a data model - shares this need for semantics. All other categories of data lack the need to assign meanings to physical data values. The Customer Jane Doe has no semantic meaning, nor does the Product Widget A.

In my book Managing Reference Data in Enterprise Databases, I defined reference data as any kind of data that is used solely to categorize other data found in a database or solely for relating data in a database to information beyond the boundaries of the enterprise. This definition is based on the role that reference data plays at a high level in a database. It is within the general role of categorizing other data that reference data is used to specify design in a physically implemented database.

How do reference data values specify database design? One of the ways is by effectively replacing attributes in entities. Consider the data model fragment in Figure 1. The entity Person records all persons of interest to the hospital, and the entity Hospital Admission tracks a patient's stay in the hospital. Within the Hospital Admission entity, there are three attributes that represent different roles for a person: Patient ID, Primary Care Physician ID and Assigned Nurse ID. A Patient ID and an Assigned Nurse ID must always be specified, while Primary Care Physician ID is optional.

A data modeler might wonder if there could be additional roles for Person in the future. This could lead to a more generalized design as shown in Figure  2, which introduces the reference data entity Person Role. When implemented as a database table, it will contain one record for each of the possible roles that a person can play: patient; primary care physician and assigned nurse. In Figure 1, the attributes Patient ID, Primary Care Physician ID and Assigned Nurse ID were explicitly identified and (hopefully) had complete and accurate definitions. In Figure 2, they have been removed. We can tell that there are person roles from Figure 2, but we can only know what they are by looking at the data values in records in the physically implemented Person Role database table. Thus, design information has been moved from the logical level to reference data values at the physical level.

It is not only attribute metadata that has been removed from Figure 1 when we look at Figure 2. Each person role attribute in the Hospital Admission entity in Figure 1 has a relationship with the Person entity. Each of these relationships has its own optionality and cardinality. In Figure 2, all the relationships are collapsed into the three identifying relationships between Person Role in Hospital Admission and its parent entities. In Figure 1, we can model the facts that a Hospital Admission must specify patient and assigned nurse, but not necessarily primary care physician. This cannot be done in the data model in Figure 2.

One convenience of this design is that it is easier to create another person role, e.g., Assigned Clergyman. In Figure 1, this would involve the addition of a new attribute in the data model and a new column in the physical database. In Figure 2, we simply need to add a new record into the table Person Role. This is only easier in the sense that nobody is forced to go through the steps to capture the relevant metadata and implement the requirement as a design change in the database. In other words, we are bypassing knowledge management for our database design.

It might be argued that the example given is an exception. Yet, reference data is pervasive in implemented databases. Anywhere from 20 percent to 50 percent of the database tables in an operational database will be reference data tables. Until proven otherwise, every record in a reference data table should be assumed to be the equivalent of an attribute that has been moved from the logical to the physical level. This may seem to be an odd assertion. Yet, records in reference data tables have meanings, and this is a clue that they really are the equivalent of attributes.

One of the great problems of reference data management is the failure to assign definitions to reference data values. We nearly always see simply a code and a description. No data modeler would add an attribute to a data model without providing a definition. This is typically what happens to the equivalent of attributes in reference data tables.

The thought that a data model can be used to understand the structure of a database breaks down in other ways when reference data is involved. Consider how a user would extract a list of primary care physicians from a database designed using the model in Figure 1. The simplest SQL that could be written would be:

SELECT Primary_Care_Physician_ID FROM Hospital_Admission

All of this can be understood from the data model. In Figure 2, the simplest SQL would be:

SELECT Person_ID FROM Person_Role_in_Hospital_Admission

WHERE Person_Role_Code = "Primary Care Physician"

Clearly, the user has to know an additional fact in Figure 2 - what the value of Person Role Code is that identifies a primary care physician. Unfortunately, the paradigm of data models and data administrators only concerning themselves with the logical level, leaving the business users to deal with data values is very unhelpful here. The business user is not going to find out the facts about Person Role Code from the data model, but needs this information to understand the database. Data models and databases are full of reference data tables, so this problem is pervasive. It should come as no surprise that enterprises cannot understand their data. They never will simply by aggregating logical data models.

The artificial divide between the logical level of design metadata and the physical world of data values represents a huge problem for data administration because reference data values embody database design. The more reference data tables a data model contains, the more generalized its design, and the more it depends on the physical data to specify its implemented design. The extreme of this can be seen in certain commercial products that may have many reference data tables. These products are "configured" when they are installed for individual clients partly by adding records to and selecting records from reference data tables. The client thinks they are getting a uniform product that is the same as is installed for other clients. In theory, they are - there is just one data model. In reality, the design is being controlled at the physical level, and the installation is unique.

Emerging trends in master data management and enterprise information architecture are beginning to try to bridge the logical/physical divide. We desperately need better tools and techniques to deal with it at the level of database design.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access