William wishes to thank Mike Cross (firstname.lastname@example.org), data warehouse director at Rent-A-Center, for his contributions to this month's column.
The one overriding constant about data warehousing is that the data warehouse will change. Data designs that you spend months perfecting will become obsolete overnight, and unforeseen business requirements will require a different view of the data. If you want to be a successful data warehouse architect, you can either become very astute at accommodating change or you can design for the unknown.
One way to architect your data warehouse for the unknown is abstract design. Abstract design allows for and welcomes change without impacting the overall structure or design of your data warehouse. The primary benefit of abstract design is flexibility. This design technique can represent the data more naturally, is easily understood by end users, allows for unforeseen changes, requires less knowledge of the data and data relationships by the end user, and prevents the carrying forward of legacy data elements. Additionally, database indexing techniques can be fully exploited to make querying abstract designs much faster than straight normalized or dimensional designs.
As the name implies, abstract design removes most of the rigidity of traditional data design and replaces it with one or more levels of abstraction. Abstract design is characterized by heavy use of supertypes and subtypes, surrogate keys representing natural keys, very simple elements - such as amount, count and date - and lookup tables that define element types and relationships between element types.
As an example, consider a simplified daily income table for a convenience store. Traditionally, it may look something like Figure 1.
Figure 1: Simplified Daily Income Table
One of the problems with this design is that if additional income categories appear, as they inevitably will, you will need to add them to the table structure. After several iterations of adding, renaming and subtracting, the table transforms itself into a very inefficient structure. Other problems with this design are that developers need to understand the history of all changes to use it effectively, and you will need to include all income columns for every store, even if it does not apply to that store. An abstract design for this same structure would look like Figure 2.
Figure 2: Abstract Design Daily Income Table
This structure, combined with the lookup table, which is shown in Figure 3, allows for new income categories without changing the existing structure and provides the added benefit of defining categories and showing relationships between categories.
Figure 3: Lookup Table
An Example of Abstract Design
Consider the following. You have the new capability to break fuel sales into diesel and unleaded. Traditionally, you would need to add two additional columns to the income table and begin populating these. Developers would need to know when this change was made and query the table accordingly. (Because this change would probably not happen in all stores at the same time, more complicated logic will probably be necessary.) In the abstract design, you would simply add two income types whose parents are fuel sales. If developers were querying the data at the lowest possible level, they would not even have to be cognizant of this change and would automatically receive the lowest level of detail data available, regardless of the implementation schedule.
The use of surrogate keys means you are not tied to current names and allows for the renaming of types without changing the structure, breaking existing queries or needing to notify developers or end users. The use of recursive parent-child relationships allows for the simple aggregation of types without having to know the children or the number of generations below the parent. Again, developers simply need to write code once that anticipates these situations.
The heavy use of supertypes and subtypes aids in abstract data design. By combining multiple related entity types under a single supertype, you can exploit the abstract design to show relationships between items that are not within the same subtype without having to worry about tracking which subtype they belong to or having to perform "tricks" if an entity belongs to multiple subtypes. The fundamental principle that subtypes must be exclusive and exhaustive is great in theory, but in practice becomes very limiting, if not impossible to implement. In addition, there is no implied relationship or hierarchy between subtypes. This is defined elsewhere if necessary.
In an insurance example, there are agents, adjusters, agencies and companies. An abstract design would create a supertype of organization, containing the core information about each organization (such as name, address and tax ID) and the subtypes of agent, adjuster, agency and company, which have specific information pertinent to each subtype. With this design, a person would be defined once as an organization, possibly once as an agent and once as an adjuster; all three would have the same surrogate key. A relationship table could then be created that relates agents to agencies and agencies to companies. However, because there is not an implied relationship or order within the subtypes, an agent could be the child of a company and an agency could be the child of another agency. The only relationship not allowed is a child being its own parent.
Before tackling abstract design, a thorough understanding of the underlying data and its relationship with other data is necessary. Relational and dimensional tables, though inefficient at times, do not require as much understanding of the data. The elements are grouped according to a certain granularity level (i.e., the primary key), and all relationships between the elements below the key are left to the end user to understand and implement accordingly. Abstract design, however, requires the complete understanding of data for its full potential to be realized.
Abstract design is not the answer to all data modeling challenges and should not be taken to an extreme, but it is a very powerful technique that allows the data warehouse to accommodate unforeseen changes without having to be redesigned, because we all know that change happens.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access