The term reference data means different things to different people. The definition used here is: Reference data is any kind of data that is used solely to categorize other data found in a database, or solely for relating data in a database to information beyond the boundaries of the enterprise.
This definition means that reference data tables are what are commonly called "lookup tables" or "domain values." These are typically tables with a key column containing a code and another column containing a description. Sometimes reference data tables can have additional columns, and these may include foreign keys from other reference data tables. Examples of reference data tables include country, currency, customer type and product category. Reference data tables rarely contain more than 200 records and often contain fewer than 20.
Reference data has an importance that approaches meta data in terms of the significance of individual data values. In part, this is because reference data is used to control business rules. If a business rule includes an actual data value, this will nearly always be reference data (e.g., a code). Another reason that reference data is important is that it defines both formal subtypes and hidden subtypes. Formal subtypes use category discriminators to identify the subtype to which records in the supertype belong. The category discriminators' values are housed in reference data tables. However, hidden subtypes are much more frequent in databases. These are subsets of records in a database table that are processed in ways that are different from the other records in a table. Perhaps a unique set of business rules applies to them or the records are reported separately from the other records in the tables. Hidden subtypes are identified by reference data values. Many important tables, such as customer or product, are surrounded by clusters of many reference data tables so that the hidden subtypes in product or customer can be identified and processed.
Reference data, therefore, acts like meta data in certain respects, but it is not treated like meta data. IT professionals seem to view defining a database as an IT task, but populating it with data is the job of the business users. From this viewpoint, reference data should be the responsibility of the users. In reality, however, reference data tables must be populated long before an application goes live - even prior to testing; therefore, this task is often performed by IT staff. Unfortunately, there is sufficient ambiguity and such a desire to differentiate between the roles of IT and the business users that reference data tables are often populated with little thought. The result is that the reference data is rarely "normalized," which is to say that the reference data values are not standardized in terms of domains they are supposed to describe. IT professionals pay great attention to the normalization of database design, but not to normalization of the contents of databases. Reference data tables typically constitute anywhere from 20 to 50 percent of a database in terms of numbers of tables. Therefore, a database whose design is in third normal form can still have up to half of its tables populated in such a chaotic manner that it is difficult to build an application on it or to utilize the data that it contains. If at least part of the effort that is typically spent on normalizing the database design were spent on normalizing the reference data, such situations could be avoided.
There are a number of ways in which reference data can be normalized. If these rules are applied, reference data tables can be populated in a clean and consistent manner that parallels and supports the normalization of the database design. Let us look at these rules one by one.
Definition of All Required Reference Data Tables
It is not uncommon to find database tables such as product or customer that contain code columns which should be foreign keys but for which no parent reference data table exists. The code values, therefore, exist only in the database table where they are found. If a new code must be added, it can only be added as part of a new record. If a list of the existing set of codes is required, it can only be made by extracting the codes from the table where they are found. Because no parent table exists, no descriptions are available for the codes, and no meta data is available for them (e.g., who added or updated the codes and when this was done).
Lack of parent reference data tables also means that a reporting dimension is missing from the database. This may not be a problem for transaction processing, but it is a problem for any reporting - especially if data must be reused in a data mart or warehouse.
Lack of defined reference tables appears to occur mostly in operational systems where programmers add a code column to an existing table for some newly discovered requirement. Adding the corresponding reference table is typically unnecessary for the functionality that the programmers are creating at this point in time. Unless the codes are directly meaningful to users, a major consequence is that the programmers must put the corresponding descriptions in program code. Defining data values in program source code is an unacceptable practice but, unfortunately, one that is all too common.