In the world of data management, reference data remains largely unexplored terrain. Its existence is known to IT professionals, and it is a territory all must pass through on their journey to successful database design. Yet most travelers hurry along to the more interesting places where they can find the data that has greater significance to business users – data on parties, roles and activities that combine to generate the transactions of the enterprise. However, it is worthwhile to survey reference data and push beyond the limited horizons that constrain our understanding to gain a better appreciation of this long-neglected area.
One of the most striking things that appears when reference data is examined is its structural regularity. In most databases, reference data consists of many small tables each containing a single primary key column, usually a code or acronym. (See Figure 1.) There are a few nonkey columns, which nearly always include a long text column that is a name. Countries, currencies, industry classifications and credit ratings are all examples of reference data that fit this pattern. These tables tend to have relatively few rows and change infrequently, at least when compared to the rest of the database. This structural simplicity, low data volume and slow rate of change seem to be the reasons why reference data is generally regarded as unexciting. However, if we look at the behavior and usage of reference data, patterns emerge that suggest a distinct class of data that has unique features and requires special management.
One of the most common functions of reference data is to describe things that do not really interact with the enterprise, but which are needed to put the enterprise's data in a broader context. Customers, products and employees are examples of data that is used to structure the transactions of the enterprise. They are parties to these transactions. Transaction activity of the enterprise is recorded in entities such as orders, shipments and payments. Reference data can be used to relate the information captured in transaction structure and activity to the wider world. Country, state, time zone and currency are examples of things that do not belong to the enterprise and are not parties to its transactions, yet are used as dimensions for reporting and for creating business rules. This kind of reference data is found over and over again in different databases across enterprises. Indeed there are international standards for some of this data, such as the International Organization for Standardization's ISO 3166 for country codes and ISO 4217 for currency codes. The concept of international standards defining values simply does not apply to customer, payment, employee and personnel action, providing strong evidence that reference data really is different than other kinds of data found in databases.
Not all reference data describes things that are truly external to the enterprise. One of the more important classes of reference data is meta data implemented as data. For instance, entity subtypes are usually associated with a category discriminator that has values held in a reference data table. These are often called type codes. Figure 2 shows a design to hold information for persons who may be male or female. Characteristics common to all persons occur in the Person table, while uniquely male characteristics occur in the Male Person table and uniquely female characteristics are found in the Female Person table. More important for our discussion is the Gender table (some would call it "Person Type") that provides the category discriminator and contains two records – one with the code "M" and description "male" and another with the code "F" and description "female." How strange it is to have data values that are known when a database is being designed, months or perhaps years before it is implemented.
Figure 2: Subtyping and Use of a Category Discriminator
Another important use of reference data is to hold classification schemes. These are ways of categorizing information held in a database. Suppose a zoo creates a database holding an inventory of its animal collection. It may wish to classify the animals according to their taxonomic groupings (mammals, reptiles, etc.), how endangered they are, whether trade in them is permitted, whether they are considered dangerous, and so on. There is no end to these classification schemes; and every person in the world could construct their own listings of animals they like a lot, animals they dislike a little and animals they do not like at all. A visitor from another planet, knowing nothing of this, would only be able to see observable facts – that a lion has fur, teeth, whiskers and claws – and would be unaware of the infinite number of classification schemes which people could construct on these facts. This boundlessness gives rise to one of the few areas in which reference data does become volatile. Over time, notions of what is important about data change and so do the reference data tables that hold the corresponding classifications. For instance, during the oil crisis of the 1970s many economic activities were classified according to the amount of energy they consumed. As these problems receded, others emerged; and schemes were introduced to classify economic activities in terms of their harm to the environment. None of this is to say that all classification schemes are personal and transitory. Some are very stable and widely used, such as Standard Industry Classification (SIC) codes. However, the characteristics of classification schemes show how different they are to other types of data found in a database and add to the understanding of reference data as a very diverse grouping.
So far we have been discussing reference data in terms of identifiable things; but there is a type of reference data, called constant values, that exists only as nonprimary key attributes. Constant values are tax rates and ends of financial years which are given to an enterprise, often by some regulatory authority, and whose values the enterprise cannot change. The enterprise typically employs these values in business rules that perform calculations and trigger events. They are prone to change over time, albeit slowly. In a database design, they tend to be housed within reference data tables that describe the authority that controls them. Just as commonly, they are placed in a single table of global "default values" for the entire database. It is not unusual for database designers to forget that these values can change and omit an effective date from the primary keys of the tables that hold them. Thus, only current values can be stored, which causes problems regarding audits and the reconstruction of historical data.
This brief tour of the landscape of reference data shows that although the database constructs that house it may be small and simple, it really is different than the other data found in a database. We can attempt to define it as follows: Reference data is any kind of data that is used solely to categorize other data found in a database or solely for relating data in a database to information beyond the boundaries of the enterprise.
This definition differs from some other common usages of the term reference data. It is sometimes used to describe any data that a system "looks up" in some external database. For example, an order entry system may have no mechanism to record customer data, all of which must be on a remote customer database. However, customers are parties to the transactions of an enterprise and thus bound to it in a way which true reference data is not. Another common usage of the term reference data is to describe any data that is acquired from sources external to the enterprise. For example, data on consumer credit history may be purchased by a company. Yet this is often detailed transaction activity data and, again, quite different from the definition of reference data used here.
Having seen something of the diversity of reference data, it is now time to look at the important characteristics that unify it as a class of data. Chief among these is the scope of its usage. Reference data commonly appears as foreign keys in many database tables. Individual values are also found in many rows of these tables. At a higher level, the same reference database tables are found over and over again in different databases of an enterprise and even across different enterprises. Thus at the levels of architecture, design and implementation, reference data has a wide scope. This is why it is feared by many IT professionals. With such wide scope, there can be zero tolerance for defects. Infor-mation quality must be 100 percent.
Another strange characteristic of reference data is that its values can have meaning. The value of bank account #123245 has little semantic content beyond the attribute definition of bank account number. Yet every entry in a classification scheme must have a definition. If a customer's credit is to be classified as good, bad or indifferent, how will people know how to assign these values? Even beyond classification schemes, meaning of values is important. For instance, does China's ISO country code include Hong Kong? If we have a category discriminator for employee position that distinguishes administrative employee positions from professional employee positions, what do we do with long-term computer consultants or short-term secretarial help? The lesson here is that there is additional meta data that must be attached to individual reference data values, something that is not the same with other data in a database.
An extension of the high degree of meaning embedded in reference data values is the fact that they often turn up as values in business rules, e.g., "if the customer's credit rating is bad, then decline their credit application." This observation can also be made from the perspective of the feared scope problems of reference data. IT professionals recognize that individual reference data values are often typed into program code, since they play a role in the logic. Who knows where to find them or can understand what they are doing there?
Just as there are more varieties of reference data than have been discussed here, there are additional unifying characteristics. These include the fact that reference data comprises nearly all the text that needs to be translated in multilingual applications; the fact that reference data values undergo life cycles where one value is succeeded by another; and the need to map between different schemes that represent the same reference data. There are more, but what is really important to consider is that there is a strong case for managing reference data as a distinct component of an enterprise's information architecture. There are already a few enterprises that are doing this, probably more as a response to data quality problems in reference data than anything else. However, such efforts are likely to be sporadic until there is a better conceptual understanding of reference data and some real vendor support for its management.
This has been a brief tour of reference data. Important places of interest have been visited, but many more need to be explored. The ground covered lies somewhere between meta data and the database components that hold information on transaction structure and activity. Hopefully, the topography is now a little more clearly delineated and what has been revealed is worthy of greater attention.
Malcolm Chisholm, Ph.D. has over 25 years of experience in enterprise information management and data management and has worked in a wide range of sectors. He specializes in setting up and developing enterprise information management units, master data management, and business rules. His experience includes the financial, manufacturing, government, and pharmaceutical industries. He is the author of How to Build a Business Rules Engine and Managing Reference Data in Enterprise Databases and Definition in Information Management. He writes numerous articles and is a frequent presenter on these topics at industry events. Chisholm runs the websites http://www.bizrulesengine.com, http://www.refdataportal.com and http://www.data-definition.com. Chisholm is the winner of the 2011 DAMA International Achievement Award.