What table key method do you use in your dimensional models?
One of the first data modeling decisions a data warehouse data architect and project manager will make is deciding what type of key fields to use in a dimensional model. This decision centers on whether to use the production key fields from the source systems, sometimes referred to as the natural keys, or to incorporate the use of surrogate keys as an alternative. While this decision may at first seem relatively trivial, it has a significant impact on the methods and tools that will be used for maintenance and information delivery with the data warehouse.
This column explores some of the factors a data warehousing project should consider when selecting a key field method for a dimensional data model. Technical and business considerations around ETL (extract, transform and load), databases, reporting and other areas concerning the selected table key field method are examined. It will be up to you to determine which key field method best integrates into your environment.
How do you decide between using a production or surrogate key in a data warehouse dimensional model? Your decision depends on a number of factors. Surrogate key use offers several advantages from a dimensional modeling perspective. It provides a layer of insulation in environments where frequent changes to source system production keys can occur. Because the dimensional model uses surrogate keys, only the mapping (production key(s) --> surrogate key) is affected and not the key structure of your dimension table and alteration to the fact table.
There are, however, a couple of cases where the use of surrogate keys should always be considered. Our first case is when a data warehouse environment has many source systems that are not integrated with each other, creating their own instance of core key items such as a customer, product or location. In these situations, the use of surrogate keys becomes a necessity in order to integrate the data into dimensional tables. Case two is when establishment of uniqueness for a dimension table from a source system(s) requires the use of many fields or very large fields (e.g., URLs). In this infrequent scenario, the sheer number and/or size of fields needed require/s too much maintenance or processing time. In this type of case, a single fixed length surrogate key on a dimension table offers database performance advantages in this type of case. Case three is when a primary source system can reuse production keys, such as in the retail industry with product or stock keeping unit (SKU) number. Figure 1 illustrates this type of scenario where the product number, the source system primary key, does not change due to either a description change or product reclassification.
Figure 1: Reuse of Production Keys
How are surrogate keys established in a dimensional model? This is typically handled through an identifying relationship between a dimension table and a fact table with a one-to-many parent-to-child cardinality. The dimension table uses an integer surrogate key field as its primary key. This surrogate key field becomes a foreign key in the fact table. Using slowly changing dimensions type 2 processing, during any subsequent load cycles to the data warehouse where a field value changes for the production key (product number in Figure 2), a new record is added to the customer dimension table to capture this information (see product key 10004 and 10005 in Figure 2). If no new fact-related measures are available for this product during this load cycle, the change in the product dimension can be skipped and picked up in subsequent load cycles (maintaining 1:M).
You would not want to use surrogate keys in a dimensional model when the data warehouse environment's size and/or the number of sources are small and deemed stable. Examples are companies that are relatively small or have transitioned to integrated enterprise resource systems for financial, human resource, supply chain, budget and customer relationship management. Your window of time for processing updates to your data warehouse is very small or near-real time. The mapping of natural key(s) to the surrogate key in the dimensional model requires additional processing and it adds complexity to the ETL process thus it is considered unnecessary. Your particular dimensional database is very large, and to increase the efficiency of ETL processing, production keys are used. You are using the data fragmentation strategy capabilities of your relational database product to increase overall data deployment and thus access efficiency. Database fragmentation methods allow data to be uniformly spread across multiple disks based on a predefined distribution method (expression, round robin, hash, range) thereby avoiding I/O bottlenecks. Some database products require table fragmentation to be established in relationship to table keys. Database fragmentation strategies are most efficient when segmented through the use of production keys versus surrogate keys. Additionally, it is typically very difficult to change the fragmentation strategy in a database once it has been adopted. Finally, your particular reporting tool may not be able to provide dimensional hierarchy (organizational structures) navigation, filtering, aggregation and other capabilities required for information delivery to your users when the dimensional model uses surrogate keys versus production keys.
Use of this criterion when evaluating your dimensional model table key method can help you avoid many negative consequences and issues.
What surrogate key method does your firm use, if any? Send me an e-mail with "Our Surrogate Key Method" as the subject line, and I will post the results of this unofficial survey in a future column.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access