Part 1 of this article (June 2002 DM Review) described the advantage of a fully normalized database and debunked the fallacy prevailing in the industry that it is the level of full (logical) normalization that determines performance. Part 1 concluded as follows: The only reason you may sometimes achieve better performance when you denormalize your database is because you ignore the integrity implications. If and when those are taken into account, you will lose the performance gain, if any, from denormalization and may even end up with a net performance loss relative to the fully normalized alternative. This article will demonstrate why the notion of "denormalization for performance" is a dangerous illusion.
Consider the assignment of employees to projects and (within-project) to activities in a certain company. The following business rules are in effect:
- An employee can be assigned to any specified project and to all of the specified activities;
- An employee can be assigned to any number of projects and any number of activities;
- Regardless of the project to which an employee is assigned, the activity assignments are the same;
- A given project or activity can have any number of employees assigned to it.
Simply put, these rules mean that this particular company chooses to assign activities independent of assignment to projects. Given these rules, the fully normalized representation consists of tables B1 and B2 in Figure 1. Each table represents one entity type project assignment and activity assignment, respectively. Consequently, all non-key columns in each are directly about the key, the whole key and nothing but the key. Therefore, there is no redundancy that would be due to the violation of this principle.
Let's now assume that this design yields poor performance with your DBMS, so you denormalize and bundle both entity types into table A, which is in 3NF, but not in 4NF (case 3 in Part 1). The highlighted redundancy is due to dependencies of some key columns on other key columns intra-key dependencies.
With the fully normalized tables B1 and B2, you would declare the keys to the DBMS, and you'd be done. With A, however, the redundancy must be controlled by defining an additional integrity constraint to ensure that all instances of the same data are updated. I have yet to encounter one practitioner who can figure out what that constraint should be can you? Because denormalization is quite common, this can only mean that databases are denormalized without controlling redundancy, which leaves them open to corruption. In fact, this is why denormalization sometimes yields better performance.
To determine redundancy control constraints, it is necessary to know and understand relational fundamentals. The reason denormalization is at all possible in relational databases is because, courtesy of the relational model, no information is lost in the process: table A is the R-join of tables B1 and B2, which are R-projections on A. This means that at all times, the following condition holds for a consistent database: A = B1 JOIN B2.
Any violation of this condition at any time can only mean that the database was corrupted and rendered inconsistent.
In order to prevent such a possibility, the DBMS must enforce this equality condition for all updates, which is exactly the integrity constraint that would control the redundancy. Assuming the DBMS speaks an SQL-like language, it would look something like:
However, because you opted for A, the DBMS does not have tables B1 and B2, so how would it enforce this constraint?
Conceptually, the DBMS can use the after- update copy of table A and:
- Take the two R-projections (the two SELECT expressions are R-projections B1 and B2)
- R-join the two R- projections
- Compare the result to same copy of table A.
Note: I say conceptually, because a well-implemented DBMS can optimize joins (e.g., by comparing only the new row being inserted to all existing rows in the table). However, because this can be done in both the fully normalized and denormalized case, it does not affect the argument. In fact, it reinforces it performance should be maximized via optimization, not denormalization.
Figure 1: Less than 4NF (A) and Fully Normalized (B1,B2) Representations
If the two tables table A and the R-join are equal after the update, the database will be consistent and the update is allowed; if they are not, it means corruption and the update is rejected.
Note very carefully, however, that to enforce this integrity constraint, the DBMS must execute, conceptually, the very same join that denormalization was supposed to avoid in the first place! Not only that, but it must also do the projections and the comparison, which would not be necessary with full normalization. Such constraints must be formulated for each and every denormalized table in the database! I leave it to you to judge whether undertaking such burden is better than full normalization.
In fact, even if you decide to add such constraints, current SQL DBMSs won't let you. SQL does not support table comparison the basic relational operation that underlies them. Why? Because the IBM designers of SQL, like most vendors and practitioners today, did not want to bother with "all that theoretical stuff." The result is, of course, not more, but less practical for users.
Incidentally, avoiding redundancy is not the only justification for full normalization. Denormalized databases also suffer from update anomalies: certain table operations that might be required by business reasons cannot be performed. Such designs also bias databases for some applications and against others. Furthermore, bundling entity types makes the database more complex and less obvious to users, which leads to more difficult formulation of queries and increases the risk of correct answers to unintended queries. There is no solution to these problems, except to fully normalize databases (see chapters 5 and 8 in Practical Issues in Database Management).
A major problem with all SQL implementations is that they maintain a one-to-one relationship between logical tables and physical files: for each logical row there is a physical record stored on disk as such.
Consequently, SQL tables have an inherent ordering of rows and columns which, among other things, limits optimization. This means that when full normalization increases the number of logical tables, the number of stored files also increases; and it is that latter number that affects performance, not normalization.
Truly relational databases, on the other hand, consist of sets which don't have such ordering. An implementation truer to the relational concept would provide a more complete separation between the logical and physical levels. In fact, technology that facilitates just such implementations has recently been developed, but whether it will be used for that purpose by the industry is another matter altogether.
In my book, Web site, column, lectures and seminars, I expose many problems in database practice for all of which the underlying cause is poor understanding by practitioners vendors and users of fundamental, sound database principles. As long as users are willing to dismiss theory and denormalize databases to obtain acceptable performance, oblivious to the integrity risks they undertake, there will be no incentive for any vendor to produce a true RDBMS that will perform best with fully normalized databases and, thus, minimize rather than increase the integrity risk.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access