We are having a little debate here regarding the use of codes along with their related descriptions in our the relational dimensions (not the OLAP dimensions). For example, we have an office dimension which consists of Miami, New York and Chicago. There are some that believe we should also bring in the OLTP codes for those three offices, which are 1, 2, and 3 respectively. They saying it increases performance in their extracts and when joining back to the source table to populate the fact table. We are trying to standardize the approach and process.

In all the documentation and classes we have gone to – they have never used codes. Have you seen them used and if so, was it for performance reasons (performance in building the fact table, performance in the extract)? We realize that sometimes in class they don't teach you "real world" scenarios. Any help would be appreciated. Any links to any articles would also be appreciated.


Les Barbusinski’s Answer: In the real world all kinds of things go into dimension tables. Codes to create uniqueness, "flags" to provide visual cues for power users and developers, "tokenized" numeric IDs to improve the performance of BI queries and/or ETL scripts, etc. For example, the Customer dimension in most data marts is a composite of data gleaned from several source systems. In such cases, I always embed the Customer ID from each of the source systems in the dimension table. Although the BI reports that use the Customer dimension never use these IDs, having them there markedly improves the performance of the matching logic in the ETL scripts that maintain this dimension.

Codes (especially standardized, abbreviated or widely known codes) are often more recognizable on reports than the corresponding descriptions. Also, codes most definitely improve the performance of joins and are more conducive to enforcing uniqueness. It’s all just common sense. Good luck.

Chuck Kelley’s Answer: I use both of them (codes and descriptions/names) in my dimensions. When building the fact, I continue to use both of them for the lookups to the dimension. Why? Because, what if 3 changed to be Phoenix and 4 was made to be Chicago. If you just used the numbers, you would begin associating Phoenix data with the Chicago tag. Since it is always (well at least 99.9999 percent of the time) better to query with names, then the codes won’t matter and you will always get the correct Chicago data if you use both code and descriptions. You will get mixed data if you don’t. While you may not think that codes and descriptions *ever* change, you would be living dangerously not to use both. Remember, your organization may merger with another organization and you will have to somehow merge data. Having both there will make it easier.

Joe Oates’ Answer: I gather from the question that the codes are used as "natural keys" to access the descriptive information. The discussion of whether to use natural keys or surrogate keys in a data warehouse has already been covered in this column. Please refer to http://www.dmreview.com/master.cfm?NavID=55&EdID=4892. Here are two other links on the Web that you should look at: http://www.rkimball.com/html/articles.html, http://www.sapior.com/Surrogate_Keys/surrogate_keys.html.

Clay Rehm’s Answer: I believe in having as many identifying attributes as possible, and ways to tie back to source systems are an added benefit. I would store the OLTP codes in the dimension as non-key attributes.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access