The advent of data models has been the impetus for enormous progress in data management. Conceptual advances have provided a framework for elaborating database design, and wonderful tools have enabled data professionals to design databases in practice. Today, data modeling is viewed as a necessary skill in data management, and rightly so. Underlying this, there is the assumption that a data model can capture all the information about the design of a database. This assumption is rarely questioned, but is it true? This is not just a question of whether different data modeling approaches yield different levels of accuracy about how an enterprise sees the information in a particular subject area. Rather, it is about whether any data model can truly specify all the design information for a database. I would submit that there are real limits to what data models can do, and failure to understand these limitations can result in data management problems at a numbers of levels.
Data modeling and data administration in general are usually focused on the "logical" level. In many ways, this is a good thing. We want a logical data model to represent how the business truly views its data. There can be arguments for why we have to denormalize a data model to implement it as a particular database. Typically, such arguments are based on the need for system performance, and we can argue about how valid they are. However, a logical data model is still required to understand how the business sees the data.
Figure 1: Data Model Fragment for a Hospital Admissions Database
Whatever kind of data model we end up with, most data modeling tools have the functionality to instantiate a physical database in any of the common platforms. In this very strict sense, therefore, a data model does contain enough design information to create tables, columns, keys, constraints, etc.
Once a database is physically implemented, data modelers tend to think that their work is over. They may have a continuing responsibility for the structure of the database, but the content of the database is nearly always seen as solely the responsibility of the business users. This divide between the logical and the physical may appear to be quite justified, but what would happen if some of the data values the users populated into the database acted to specify a layer of design? Such an idea seems silly; yet, there is a category of data that exists in every database and acts to specify design elements that organize and constrain the other information in the database. It is called reference data.
Reference data is a class of data that has many synonyms. It is commonly referred to as code tables, lookup tables or domain values. In a data model, these entities look unimpressive. They typically contain two attributes - a "code" that is a primary key and a description. In a physically implemented database, they seem equally simple. There are usually just a few rows in these tables, and these contain data that typically changes infrequently. In fact, it is often referred to as static or slowly changing.
Figure 2: Revised Data Model Fragment for a Hospital Admissions Database
However, reference data has some important properties that are not shared by other categories of data. One of them is that the code values often have definitions. For instance, the code "Gold" in a record in a Customer Credit Type table may be used to identify a customer who can be given 90 days to pay their bill. Even in a Country table, it may be necessary to have definitions to tell us, for instance, whether U.S. includes Puerto Rico, or whether People's Republic of China includes Hong Kong. Only metadata - at least the metadata for entities and attributes in a data model - shares this need for semantics. All other categories of data lack the need to assign meanings to physical data values. The Customer Jane Doe has no semantic meaning, nor does the Product Widget A.
In my book Managing Reference Data in Enterprise Databases, I defined reference data as any kind of data that is used solely to categorize other data found in a database or solely for relating data in a database to information beyond the boundaries of the enterprise. This definition is based on the role that reference data plays at a high level in a database. It is within the general role of categorizing other data that reference data is used to specify design in a physically implemented database.
How do reference data values specify database design? One of the ways is by effectively replacing attributes in entities. Consider the data model fragment in Figure 1. The entity Person records all persons of interest to the hospital, and the entity Hospital Admission tracks a patient's stay in the hospital. Within the Hospital Admission entity, there are three attributes that represent different roles for a person: Patient ID, Primary Care Physician ID and Assigned Nurse ID. A Patient ID and an Assigned Nurse ID must always be specified, while Primary Care Physician ID is optional.
A data modeler might wonder if there could be additional roles for Person in the future. This could lead to a more generalized design as shown in Figure 2, which introduces the reference data entity Person Role. When implemented as a database table, it will contain one record for each of the possible roles that a person can play: patient; primary care physician and assigned nurse. In Figure 1, the attributes Patient ID, Primary Care Physician ID and Assigned Nurse ID were explicitly identified and (hopefully) had complete and accurate definitions. In Figure 2, they have been removed. We can tell that there are person roles from Figure 2, but we can only know what they are by looking at the data values in records in the physically implemented Person Role database table. Thus, design information has been moved from the logical level to reference data values at the physical level.