"A traditional normalized structure cannot and will not outperform a denormalized star schema from a decision support system (DSS) perspective. These schemas are designed for speed, not your typical record style transaction ... [DSS] requests put way to [sic] much stress on a model that was originally developed to support the handling of single record transactions."
Practitioner with 20 years of experience
"Third normal form seems to be regarded by many as the point where your database will be most efficient ... If your database is overnormalized [sic] you run the risk of excessive table joins. So you denormalize and break theoretical rules for real world performance gains."
"Denormalization can be described as a process for reducing the degree of normalization with the aim of improving query processing performance...we present a practical view of denormalization, and provide fundamental guidelines for denormalization ... denormalization can enhance query performance when it is deployed with a complete understanding of application requirements."
Sanders & Shin, State University NY Buffalo
Those familiar with my work know I have deplored the lack of foundational knowledge and sound practices in the database trade for years. One of the most egregiously abused aspects is normalization, as reflected in the opening quotes. The prevailing argument is as follows:
- The higher the level of normalization, the greater the number of tables in the database;
- The greater the number of tables in the database, the more joins are necessary for data manipulation;
- Joins slow performance;
- Denormalization reduces the number of tables and, therefore, the reliance on joins, which speeds up performance.
The problem is that points 2 and 3 are not necessarily true, in which case point 4 does not hold. What is more, even if the claims were all true, denormalization would still be highly questionable because performance gains, if any, can only be had at the expense of integrity.
The purpose of this article is to explain just why the notion of "denormalization for performance" is a misconception and to expose its costly dangers, of which practitioners are blissfully unaware. If the integrity cost of denormalization is taken into consideration, it will cancel out performance gains, if any.
The Key, the Whole Key and Nothing But the Key
Informally, a fully normalized relational database consists of R-tables. Each represents just one entity type. In such tables, all non-key columns represent attributes of one entity type. The attributes of an entity type are represented by columns, and the entities of that type are represented by rows. Because a table represents only one entity type and the entities of that type are identified by the key, non-key values or sets thereof are dependent on (associated with) key values. Stated another way, non-key columns are "directly about the key, the whole key and nothing but the key."
By definition, key columns do not have duplicate values; therefore, non-key columns dependent on the key won't have duplicate values either. That is why there is no redundancy due to such dependencies in fully normalized tables. Practically, what this means is that in a fully normalized database, it is sufficient for the database designer to declare key constraints to the RDBMS, and the integrity of the database insofar as column dependencies are concerned will be guaranteed.1
Denormalizing a database means, again informally, "bundling" multiple entity types in one table. In a table that represents multiple entity types, some non-key columns will not be directly about the key, the whole key and nothing but the key. The following kinds of dependency are possible in such tables, each of which is associated with a normal form lower than full normalization:
- Dependency on part of a (composite) key; table not in 2NF.
- Indirect dependency on the key; dependency on another non-key column that is dependent on the key; table not in 3NF.
- Intra- key multivalued dependency; table not in 4NF.
- Intra-key join- dependency; table not in 5NF.2
Non-key columns and columns that are part of a (composite) key are not guaranteed to have unique values. If they have duplicates in some rows in the table, other columns in the table dependent on such columns (rather than directly on the key, the whole key and nothing but the key) will also have duplicate values in those same rows, as highlighted in table A in Figure 1, because of the association with those non-key values. This causes redundancy in denormalized tables. For such tables, key constraints are no longer sufficient to guarantee integrity. The DBMS must also control the redundancy, so that when a table such as A is updated, all stored instances of the same data are updated. If this is not done, the database can be corrupted and become inconsistent. This complication is not necessary for fully normalized tables (such as B1 and B2 in Figure 1) because they don't have redundancy due to under- normalization.
Figure 1: Denormalized versus Normalized Tables
Thus, from an integrity perspective, there are two design options:
- Fully normalize the database and maximize the simplicity of integrity enforcement;
- Denormalize the database and complicate enormously (prohibitively, as we shall see shortly) integrity enforcement.
Obviously the first choice is the better option. Why, then, the prevailing insistence on the second choice? The argument for denormalization is, of course, based on performance considerations; but does it make sense?
The Logical-Physical Confusion
Suppose I ask you to retrieve two sets of books from a library: one set of five books and one of ten. If I ask you which set will take you longer to retrieve, would you know? When I ask practitioners this question, many realize that the time required for set retrieval depends not on the number of books per se, but on the characteristics of the library: how the books are physically stored and retrieved. It is entirely possible for 10 books to take less time if, for example, they are all stored together while the five are scattered across the library; or if the 10 are on the first floor and the five on the sixth; or if the 10 are indexed in a catalog, but the five are not. (If any of the books were being used or borrowed by others, retrieval could take a very long time indeed.)
By definition, normalization increases the number of logical tables books in the analogy in the database. If the number of books says nothing about library retrieval performance, how can denormalization decreasing the number of logical tables say anything about database performance? As in the library case, performance is determined entirely at the physical database level (storage and access methods, hardware, physical design, DBMS implementation details, degree of concurrent access, etc.). What, then, does "denormalization for performance" mean? When I ask this question, I generally get answers along the following line: "Well, it's all fine in theory, but we live in the real world. The reality is that I cannot get acceptable performance unless I denormalize my databases."
Let us assume, for the sake of argument, that this is always so (even though it's not). The only logical conclusion possible is not that normalization slows performance and denormalization speeds it up. To the extent that performance with fully normalized databases is slow and it improves with denormalization, this can only be due to the physical implementations of the database and DBMS. Those implementations are simply not good enough for correct logical database design, forcing you to introduce redundancy in the database and, as we shall see, trade integrity for performance.
Okay, you say, but I must work with existing DBMS software anyway so what difference does your argument make? First of all, it is very important to realize that it is not normalization, but physical implementation that determines performance (more on this in Part 2). Secondly, the only reason you may sometimes achieve better performance when you denormalize your database is because you ignore the integrity implications. If and when those are taken into account, you will lose the performance gain, if any, from denormalization and may even end up with a net performance loss relative to the fully normalized alternative.
I will demonstrate this next month in Part 2. Stay tuned.
- 1. There may be redundancies in the database for reasons other than such column dependencies (e.g. derived columns, cross-table duplicates) which must be separately resolved by means other than full normalization (see Chapters 5 and 8 in Practical Issues In Database Management, Addison Wesley, 2000).
- All R-tables are by definition in first normal form (1NF), or normalized they don't have multivalued columns (known in older databases as repeating groups). Normalized databases can be further normalized to the higher normal forms 2NF 5NF. If there are no composite keys in the database, a normalized database is automatically also in 2NF and a database in 3NF is automatically also in 5NF, or fully normalized. In other words, you need to worry about 4NF and 5NF only if composite keys are present (see Chapter 5 in Practical Issues In Database Management, Addison Wesley, 2000).
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access