In the past several months, I introduced the concepts of the business dimensional model (BDM), which represents the business view of the data and how users will interact with it. This month, I will begin discussing how to move that business view forward on a project. If you are using a multidimensional database management system (MDBMS), then you already have what you need. You will need to enter your BDM constructs into the MDBMS meta data. However, if you are implementing your dimensional model in a relational database engine, you have more design work to do.
You need to translate the BDM into logical table structures. This is a relatively straightforward process. In its simplest form, each business dimension translates to a single dimension table, and each fact group translates to a fact table. This month we will look at translating the dimensions. These are the basics, so keep in mind that reality is rarely this simple.
The first thing you need to do when you translate the business dimension into a logical table is include a surrogate key. This surrogate key is a unique non-meaningful identifier that is assigned to each row in that dimension. The reasons to use a surrogate key include the following:
You need to isolate your data warehousing environment from fluctuations in your operational systems. Over time, production key values can change within an operational system. Production keys are typically unique at any given point in time. However, if you look across history, they may be reused. Universal product codes (UPCs) are a good example of this. They are unique at any point in time; however, they are reused after several years. Therefore, you do not have uniqueness across time in the data warehouse environment. The UPC should be treated as a descriptive attribute, not a key.
As you are preparing your data, you may identify special cases. You may receive a production key that has no associated reference data. Sometimes, you obtain the production key and several descriptive fields on the incoming data. Perhaps the descriptive data does not map to the reference data that you have previously mapped to that production key. Or, perhaps the production key is missing completely. Some organizations prefer to place data with these error conditions into suspense and not include it in the data mart. Other organizations want all data loaded, regardless of its condition. If you need to load the data, you must include a key from every dimension. This means that you must build an appropriate row in the dimension to represent each of these special conditions. This can be an initial step to increase awareness of data quality issues.
These cases are identified in the extract, transform and load (ETL) process and do not exist in the underlying source systems. In order to include these special cases in your dimension, you need an identifier that is different from any of the production keys. Your first instinct may be to ask the source system team to reserve several key values for your use. My experience has been that it is only a matter of time before those reserved key values are used for some clever purpose for the operational system itself, thus creating a problem because real data begins to flow into the data warehouse environment with a production key you are already using for a different purpose.
The most compelling reason to use surrogate keys is to support changes in the dimension data over time using a technique referred to as a type 2 slowly changing dimension. The essence of this technique is to generate a new version of the dimension row when the descriptive data changes. I encourage you to read Ralph Kimball's work in which he discusses slowly changing dimensions (The Data Warehouse Lifecycle Toolkit or The Data Warehouse Toolkit 2nd Edition).
Use of surrogate keys in the data warehouse environment provides a layer of isolation between you and the operational systems. The operational system team does not impose restrictions on your key structure; likewise, the data warehouse team does not impose artificial restrictions on handling production keys.
Each attribute in your business dimension will translate to one or more logical columns; usually, this is to accommodate variations of a description. For example, a product may be represented with a short and a long product description. The long description provides a complete description without the abbreviations used for the short description. This short description is much more useful for row and column headings in reports. If there is also a need for a product code, that would be included as a separate attribute in the business dimension and then stored as a separate column.
Figures 1 and 2 show an example of this translation for a simple customer dimension. Figure 1 shows the BDM notation. Figure 2 shows the logical table structure that would be a simple translation.
Figure 1: Customer Dimension BDM Notation
Figure 2: Customer Dimension Logical Table
Next month, I will extend this process to the fact groups.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access