APR 1, 2004 1:00am ET

Related Links

When Fast is Not Enough
July 18, 2008
TopQuadrant Software Imports Email MetaData into Semantic Applications
March 26, 2008
An Open Challenge to the Open Source Community
November 30, 2007

Web Seminars

How to Run a Successful Bring Your Own Device (BYOD) Scheme
June 18, 2013
IBM MobileFirst Management: Empower Your Mobile Workforce
June 25, 2013
Hybrid Cloud Storage: Getting the Best of Two Worlds
June 26, 2013

Normalizing Reference Data

Print
Reprints
Email

For many years, the principles of normalizing database design have been well understood and are increasingly practiced. Yet many databases are created in third normal form, or even higher normal forms, only to be populated with reference data in a chaotic and nonstandard manner. This can have a negative impact that offsets the main advantage of a normalized database design, which is that the database design only needs to change if there are changes in the subject area that it supports. A strong case can be made that reference data needs to be normalized in its own special way if a database is to provide the best possible return on the investment in it.

The term reference data means different things to different people. The definition used here is: Reference data is any kind of data that is used solely to categorize other data found in a database, or solely for relating data in a database to information beyond the boundaries of the enterprise.

This definition means that reference data tables are what are commonly called "lookup tables" or "domain values." These are typically tables with a key column containing a code and another column containing a description. Sometimes reference data tables can have additional columns, and these may include foreign keys from other reference data tables. Examples of reference data tables include country, currency, customer type and product category. Reference data tables rarely contain more than 200 records and often contain fewer than 20.

Reference data has an importance that approaches meta data in terms of the significance of individual data values. In part, this is because reference data is used to control business rules. If a business rule includes an actual data value, this will nearly always be reference data (e.g., a code). Another reason that reference data is important is that it defines both formal subtypes and hidden subtypes. Formal subtypes use category discriminators to identify the subtype to which records in the supertype belong. The category discriminators' values are housed in reference data tables. However, hidden subtypes are much more frequent in databases. These are subsets of records in a database table that are processed in ways that are different from the other records in a table. Perhaps a unique set of business rules applies to them or the records are reported separately from the other records in the tables. Hidden subtypes are identified by reference data values. Many important tables, such as customer or product, are surrounded by clusters of many reference data tables so that the hidden subtypes in product or customer can be identified and processed.

Reference data, therefore, acts like meta data in certain respects, but it is not treated like meta data. IT professionals seem to view defining a database as an IT task, but populating it with data is the job of the business users. From this viewpoint, reference data should be the responsibility of the users. In reality, however, reference data tables must be populated long before an application goes live - even prior to testing; therefore, this task is often performed by IT staff. Unfortunately, there is sufficient ambiguity and such a desire to differentiate between the roles of IT and the business users that reference data tables are often populated with little thought. The result is that the reference data is rarely "normalized," which is to say that the reference data values are not standardized in terms of domains they are supposed to describe. IT professionals pay great attention to the normalization of database design, but not to normalization of the contents of databases. Reference data tables typically constitute anywhere from 20 to 50 percent of a database in terms of numbers of tables. Therefore, a database whose design is in third normal form can still have up to half of its tables populated in such a chaotic manner that it is difficult to build an application on it or to utilize the data that it contains. If at least part of the effort that is typically spent on normalizing the database design were spent on normalizing the reference data, such situations could be avoided.

There are a number of ways in which reference data can be normalized. If these rules are applied, reference data tables can be populated in a clean and consistent manner that parallels and supports the normalization of the database design. Let us look at these rules one by one.

Definition of All Required Reference Data Tables

It is not uncommon to find database tables such as product or customer that contain code columns which should be foreign keys but for which no parent reference data table exists. The code values, therefore, exist only in the database table where they are found. If a new code must be added, it can only be added as part of a new record. If a list of the existing set of codes is required, it can only be made by extracting the codes from the table where they are found. Because no parent table exists, no descriptions are available for the codes, and no meta data is available for them (e.g., who added or updated the codes and when this was done).

Lack of parent reference data tables also means that a reporting dimension is missing from the database. This may not be a problem for transaction processing, but it is a problem for any reporting - especially if data must be reused in a data mart or warehouse.

Lack of defined reference tables appears to occur mostly in operational systems where programmers add a code column to an existing table for some newly discovered requirement. Adding the corresponding reference table is typically unnecessary for the functionality that the programmers are creating at this point in time. Unless the codes are directly meaningful to users, a major consequence is that the programmers must put the corresponding descriptions in program code. Defining data values in program source code is an unacceptable practice but, unfortunately, one that is all too common.

Filed under:

Advertisement

Comments (0)

Be the first to comment on this post using the section below.

Add Your Comments:
You must be registered to post a comment.
Not Registered?
You must be registered to post a comment. Click here to register.
Already registered? Log in here
Please note you must now log in with your email address and password.

Where do young IT professionals (30 and under) obtain information to aid with daily role responsibilities and career development?

Trade publication websites 14%
Social media 23%
Vendor websites 4%
Vendor/community forums 7%
Newsletters 1%
Trade conferences/meetups 2%
RSS feeds 6%
Web search 44%

 

Twitter
Facebook
LinkedIn
Login  |  My Account  |  White Papers  |  Web Seminars  |  Events |  Newsletters |  eBooks
FOLLOW US
Please note you must now log in with your email address and password.