We model a database for the same reasons that Boeing builds an aircraft model to test flight characteristics in a wind tunnel. It's simpler and cheaper to do this than reconstruct the plane until you get it right.
Likewise, a proper data model should be designed to reflect the business components and their possible relationships. You've probably read or heard the debate going on today about whether to model your data warehouse using third normal form or a star-schema model.
Third normal form is a model in which each attribute (column) must be a fact about the primary key--the whole key and nothing but the key. Data is placed in tables where it makes the most sense with no repeating groups, derived data or optional columns. This allows users to ask any question at any time on all data in the enterprise.
A star-schema model is comprised of a fact table and a number of dimension tables. The fact table is a table with a multi-part key. Each element of the key is itself a foreign key to a single dimension table. The remaining fields in the fact table are known as facts; and they are numeric, continuously valued and additive. Facts can be thought of as measurements taken at the intersection of all of the dimensions. Dimension attributes are mostly textual and are almost always the source of constraints and report breaks. This model enhances performance on known questions or queries.
Most database modelers create a logical model in third normal form; but because many database engines are overcome by physical limitations, they must compromise the model.
The four hardest things for a database to do are: 1) join tables; 2) aggregate data; 3) sort data; and 4) scan large volumes of data. In order to get around these system limitations, some vendors will suggest models to avoid joins, use summarized data to avoid aggregation, store data in sorted order and overuse indexes to avoid large scans.
One of the reasons for the debate is because there is a lack of database products available with the power and maturity to utilize a third normal form physical model on data of any size, especially those approaching, and exceeding, a terabyte. In my opinion, NCR's Teradata data warehouse engine is the only one on the market that can do so. Because of such physical limitations, other databases have had to use a star-schema model to enhance performance, but by doing so have given up the ability to perform ad hoc queries and data mining.
Based on more than 15 years of real-world experience, NCR has found that third normal form is the model that should be used for the central data warehouse. This allows users the ability to ask any question at any time on information anywhere in the enterprise. This is the central philosophy of a data warehouse, which also leads to the power of ad hoc queries and data mining. Using such techniques, advanced tools can discover relationships that are not easily detected but do exist naturally in the business environment.
In contrast, a star-schema model enhances performance on known queries because we build our assumptions into the model. While these assumptions may be correct for the first application, they may not be correct for others.
If you want to build a world-class data warehouse, start by building your central data warehouse around detail data using a third normal form model. Then, as query patterns emerge and performance for well-known queries becomes a priority, star-schema data marts can be created by extracting summarized or departmental data from the centralized data warehouse. The user then has access to both the data marts for repetitive queries and the central warehouse for all other queries.
An enterprise data warehouse should be designed so that everyone in the enterprise can, in a timely manner, get the information needed to do their jobs. This is difficult because there are generally two completely different data warehouse environments that exist--repetitive and iterative. A repetitive environment has many of the same functions that are executed repeatedly, while an iterative environment tends to be unpredictable.
Eighty percent of data warehouse queries are repetitive, but 80 percent of the return on investment (ROI) generally comes from the 20 percent of the queries that go against detailed data in an iterative environment. By using a third normal form as the data model for your central data warehouse and star-schema indexing for data marts, you can enhance your company's ability to gain 80 percent of the return on investment and still enhance the performance on 80 percent of your queries. Now, that's a data warehouse that will satisfy everyone in the enterprise.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access