Free Site Registration

Illusions, Allusions – Let’s Get Real about Database Design

InfoManagement Direct, October 2002

Bob Lambert

In a recent two-part DM Review article, "The Dangerous Illusion: Denormalization, Performance and Integrity" (June and July 2002) , Fabian Pascal detailed some of the dangers of denormalization, asserting "gains from denormalization can only be had at the expense of [data] integrity," and that "if the integrity cost of denormalization is taken into consideration, it will cancel out performance gains, if any."1

In the September 6, 2002 issue of DM Direct, Steve VanArsdale replied that Pascal is fundamentally mistaken, that the real problem is "slavish acceptance of the [first normal form] dogma" and the solution is a denormalized approach that permits database fields to have more than one value.2

Both of these authors make valuable contributions to thinking on database design. Mr. Pascal advocates normalization and reminds one of the importance of understanding the meanings embedded in our database designs and the difficulties of implementing a normalized design in mainstream DBMS products. Mr. VanArsdale promotes a "multivalued" database architecture, showing the potential of denormalization in reducing storage and increasing responsiveness of some database applications.

Advertisement

However, we in the IT group need to get it done for the business. Our databases must be usable when deployed and resilient to business changes over long term. And, we need to make them work in mainstream DBMS products.

In the real world, database efficiency and long-term usability are conflicting goals that successful database designers balance for the benefit of business people who use the database applications. Let’s review some of the successful designer’s strategies, using as a backdrop the arguments of Mr. Pascal and Mr. VanArsdale.

Strategy 1: Use a Normalized Data Model to Understand the Business

Other things being equal, a normalized data design is most business-usable and most maintainable over time because it: 1) matches the business view of data, 2) is generally optimized, and 3) tends to reduce the amount of code in the application.

In a normalized data model, each table represents a distinct business concept. Foreign key relationships between tables represent real-world relationships between business concepts. In reality, a data model is just a graphical restatement of business rules that may be expressed in plain English.3


Figure 1: Partial Retail Data Model

For instance, the data model in Figure 1 shows the relationships among entities involved in a retail sales transaction. This model could also be expressed in these simple sentences:

  • An inventory item is stocked in stores as a stock item.
  • A stock item can be purchased as a purchased item.
  • A retail transaction may include many purchased items.
  • A store is the location of a retail transaction.
  • A customer purchases items with a retail transaction.

Mr. VanArsdale uses a similar example to assert that "real data is not normalized…People buy more than one item at a time, sometimes with more than one price and quantity." However, in attempting to evaluate his examples, it is difficult to understand their meaning. Figure 2 quotes Mr. VanArsdale’s logical design for a multivalued record:


Figure 2: Multivalue Fact4

Figure 2 is understandable to some readers, but its meaning is unclear to those unfamiliar with multivalued databases and the particular system that Mr. VanArsdale describes.

Here's how a reader might interpret Figure 2: On Tuesday (^DATE) John Doe (^CUST. NO) purchased a WeedChopper Model 501 Chainsaw in Buena Vista, VA (^BRAND, ^MARKET, and ^STYLE) for $250 (^PRICE). He was so happy with his new chainsaw that he went back and bought two more for $300 each, then returned for five more at $250.

Reviewing the example against the Figure 1 data model, it seems that it might have been expressed like this:


Figure 3: Multivalue Fact, Corrected Based on Figure 1

The revised layout in Figure 3 shows a sales transaction involving three different items, each with distinct quantities, prices, brands and styles.

The discipline required in building a normalized data model helps identify business rules not otherwise apparent. Relational data modeling enables database designers to express in plain English business rules embedded in their database design. So expressed, these business rules are unambiguous and easily interpreted by business people. Analysts who skip this kind of data analysis often build systems that don’t work well with "real data" because they are incorrect from a business perspective.

Strategy 2: Understand the Costs of Denormalization

The denormalized table design of Figure 3 would be very efficient for aggregating sales data by date or customer but is inefficient for analyzing sales for specific SKUs. Each transaction record would have to be opened to see whether a given item was involved in the transaction. If the business requires only analysis by date and customer, this multivalued approach is more efficient. However, embedding items in records keyed by date and customer makes analysis of items, brands and styles less efficient.

While a denormalized structure may be more efficient for a particular query, the efficiency gains are always made at the cost of other queries. In fact, a normalized structure is the most efficient across the set of all possible queries. Good database designers understand that when they denormalize for efficiency they run the risk that business conditions will change and users in future will require non-optimized queries.

Another cost of denormalization is that it increases the work the application must do to maintain data. For example, what if, in the multi-valued structure in Figure 3, BRAND2 is misspelled in the item master table. To correct the problem, a program would have to open each transaction record and interrogate the list of items to find and correct occurrences of SKU2. In a normalized structure, all that is required is to correct BRAND2 in the item table.

Strategy 3: Denormalize Carefully

In spite of the costs, denormalization is part of most database design. It makes sense that a database optimized for all possible queries is probably not optimized for the specific queries involved in any particular application. And, as Mr. Pascal argues, a fully normalized design will generally not work well in mainstream DBMS products.

The art of database design is to deeply understand both business requirement and DBMS capabilities, and create a design that is efficient for the application, works well within the constraints of the DBMS and supports business changes.

Fortunately, designers of reporting databases can take advantage of dimensional data design, a well-developed, effective denormalization strategy.

When there are no such clearly defined paths, a useful guideline is E.F. Codd’s Rule of Reconstruction5, described by Dick Root in his recent TDAN article "Denormalization and the Rules of Reconstruction." Simply stated, the rule of reconstruction means that a valid denormalization can be renormalized with a single SQL statement. If this is not the case, then data integrity has been lost in the process of denormalization.

For example, Figure 3 may represent a valid denormalization of our Figure 1 data model, since we could write a query that draws data from the SKU repeating group and puts it into a separate Purchased Item table – if brand and style for a given item are always the same across different sales transactions. (A reasonable assumption if the table is in a reporting database.)

Strategy 4: Know the Data

These days, it is rare for a system to be developed in isolation from other systems. Most of today’s system development – and all data warehousing – involves extracting data from legacy systems, third- party software products and external data sources such as demographics or mail lists.

More often than not, there will be problems with data input from other systems, either due to data quality issues or the nature of the data integration itself. Here are some typical examples:

  • One leading ERP package uses the same table for both Sales Quotes and Sales Orders. Columns that mean one thing for Quotes mean quite something else Orders. One team extracting data from this ERP package continually mixed up, for example, Date Received on the Quote with Date Prepared for the Order. The designer who blindly copies data from input systems can propagate these issues. In this case, the correct solution is to separate the two documents into separate tables, making each column describe either a quote or an order, not both.
  • Marketing databases often store data purchased from several third parties on the same set of customers. These sources usually include overlapping columns with different values. For the same customer, different sources might store different values for the person’s address, credit scores or even name. It is sometimes important to preserve all of the columns from all of the sources and to maintain the information on where the data came from as well as what its value was. This can result in a messy database design, where columns carry both their value and their source.
  • Codes from legacy databases tend to evolve into complex forms, embedding more and more information into a single field. This is perhaps a natural reaction to the slow evolution of the system relative to changes in business, as users shoehorn information into the system that it was not designed to store. For instance, in a legacy system a one- character code might classify customers by "customer category," with values 1 for small business, 2 for mid-size, and 3 for Fortune 5000. Users might add codes 4, 5 and 6 for corresponding values for aerospace customers, then 7 for federal government, and so on. The database designer must know the data well to extract all of the different logical columns and correctly place them in the database.

These examples point to the fundamental problem in Mr. Pascal’s recommendations to practitioners. His analysis is on the money: other things being equal, the best design is a normalized design, and denormalized designs risk loss of business information. Unfortunately, other things aren’t equal on real projects.

Query requirements can be broad enough to warrant normalized design as a strategy but that strategy can rarely be fully pursued. Sometimes, as with multiple sources of similar demographic data, denormalization is built into the data. More often there are serious integrity issues inherited from source systems. Data designers are rarely given the time and resources needed to effect more than a cursory cleanup as data is input. In reporting databases it is common for users to reject significant changes in favor of making the new database as much like the old, familiar one as possible.

Strategy 5: Promote Quality while Getting the Job Done

The real world of system deployment is about getting business results. For an IT professional working on most projects, advocating Mr. Pascal’s principled stand purposely removes oneself from the project. Mr. VanArsdale seems to take the opposite course: just make the query run fast.

The astute database professional understands the value of a normalized data model for clarification of business rules. This professional combines an understanding of the normalized model, a deep knowledge of pre- existing data, and an understanding of how the system will be used to develop the right database design and deliver it on schedule.

Along the way, this person applies the incremental quality improvements that are possible within the project’s scope and resources, and quietly promotes the principles of information quality.

References:

  1. Pascal, Fabian. "The Dangerous Illusion: Denormalization, Performance and Integrity, Part 1." DM Review. June 2002. P. 52 (http://www.dmreview.com/master.cfm? NavID=198&EdID=5251), and "The Dangerous Illusion: Denormalization, Performance and Integrity, Part 2." DM Review. July 2002. P. 16. (http://www.dmreview.com/master.cfm? NavID=198&EdID=5337)
  2. VanArsdale, Steve. "Dangerous Allusions." DM Direct. September 6, 2002. (http://www.dmreview.com/master.cfm? NavID=198&EdID=5736)
  3. Fisher, Richard and Schmidt, Bob. "The Language of Data Modeling." DM Review. July 2002. (http://www.dmreview.com/master.cfm?NavID=198&EdID=653)
  4. This example is Figure 2 of Mr. VanArsdale’s September 6, 2002 article. (http://www.dmreview.com/master.cfm? NavID=198&EdID=5736)
  5. Root, Dick. "Denormalization and the Rules of Reconstruction." The Data Administration Newsletter. October 10, 2000. http://www.tdan.com/i014ht04.htm

Bob Lambert is a consultant with CapTech Ventures specializing in data management and business intelligence strategy, architecture and management services. You can reach him at blambert@captechventures.com or bob@robertlambert.net.

He also blogs on business-aligned IT at http://robertlambert.net.

For more information on related topics, visit the following channels:

Advertisement

Advertisement