Sometimes a data warehouse will include two dimension tables which are so large that when they are joined by the business intelligence reporting tool, the join of those tables consumes too many resources and time to return a result set in a satisfactory time duration. These dimensions are too big to join.
When this happens, the most common solution is to prejoin the two large dimension tables into one denormalized dimension table. How will that join happen? The extract, transform and load application will make it happen. The ETL application has a simple job – join two dimension tables that are too big to join, every single day, and do so before data warehouse customers arrive in the morning.

Traditional Changed Data Capture


In a traditional changed data capture an ETL application realizes the complete universe of values which are in effect in the system of record at that moment – set A. The universe of values previously in the data warehouse – set B – is compared to set A. The resulting set of changes are then used to update the data warehouse.
The advantage of this method is completeness. Because it includes the universe of values in set A and set B no values are excluded. The disadvantage is the consumption of resources necessary to create and manage two universes of values from a denormalized dimension which is built from two dimension tables which are too large to join.

Normalized Changed Data Capture


A possible solution to the conundrum caused by a denormalized dimension which is built from two dimension tables that are too large to join is a normalized CDC. The elements of a normalized CDC include the following:

  • Set A – The universe of values in effect in the system of record at that moment.
  • Set B – The universe of values previously in the data warehouse.
  • Dimension 1 – The first large dimension table denormalized into sets A and B.
  • Dimension 2 – The second large dimension table denormalized into sets A and B.

The steps in a normalized CDC of a denormalized dimension table are the following:
Step 1: Realize the universe of values for dimension 1 and dimension 2 from both the system of record (set A) and the data warehouse (set B).

Step 2: Perform CDC processes on dimension 1 and dimension 2, separate from each other.

Step 3: Realize the complete set of insert, update and delete records by joining the normalized insert, update and delete records back to the universe of values from which they came.

Step 3 may seem complex and cumbersome, because it is. However, when the alternative is an ETL process that cannot finish in time to meet its SLA – or cannot finish at all – complex and cumbersome become a bit more tolerable. 
The advantage to this method is in the join functions. The ETL application is able to join only those records that have changed, rather than the universe of values for a denormalized dimension. This renders the resources consumed in the joins of a normalized CDC more meaningful and removes wasted consumption of resources.
Traditional CDC which compares the universe of values is the best practice and should be the first choice for managing a dimension table, regardless of its normalization, in a data warehouse. When, however, the join which achieves a denormalized dimension table becomes so large that traditional CDC cannot complete in the required ETL time window, normalized CDC is a viable alternative. Although more complex than traditional CDC, normalized CDC can reduce the cardinality of data in the Joins necessary to realize the data in the denormalized dimension. Therefore, it may allow the ETL application to complete within the ETL time window without any loss in the completeness of the data in the denormalized dimension table.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access