Q: Should the operational data store be designed using 3NF approach or a dimensional approach? What is the best design approach if I am planning for staging -> ODS -> DW -> DM approach?
Chuck Kelley's Answer:
I believe the ODS is a 3NF... It is an operational system. I also think that you have your approach a little backward. I think it is ODS -> staging -> DW -> DM. I also believe source systems feed both ODS and staging. In my opinion - ODS, staging, and DW are 3NF, and DM are based on the tools the user community is using. They would either be star, snowflake, 3NF, multidimensional, etc.
Larissa Moss' Answer:
The "best" design approach is always to use the most appropriate schema for the specific usage of a database and not to follow anyone's religion on how a database "should" be designed. Having said that, let's examine the most common usage of those four databases:
1. Staging area. The purpose for a staging area is to receive data from the source files and prepare the data for loading the target end-user databases (ODS, DW, DM). Therefore, any files or tables in this area are designed to support the ETL process in the most useful and appropriate way. Since many companies have a variety of different types of end-user databases (ODS with DM, or DW with DM, or ODS with DW and DM), they usually design the staging area as close to 3NF as possible for maximum flexibility (i.e., so that they can create load files for any current and future relational or dimensional end-user target database). In addition, they also utilize plenty of flat files, summary tables, and even dimensional schemas to support very specific ETL processes.
2. ODS. The reason why the ODS was invented is to provide ad-hoc querying capabilities as well as canned or parameterized reporting capabilities to business managers and knowledge workers to assist them with their tactical management needs as well as their operational activities. This usually involves the need to be able to access the data in the same way and through the same data relationships as the data would be accessible on the operational systems. Since operational systems are designed using the 3NF approach (for the purpose of minimizing update anomalies), the most appropriate design for ODS is relational - not necessarily 3NF but denormalized to the degree necessary to achieve the desired query and report performance. (Example: If two tables that have a One-to-Many relationship are always accessed together, they would be denormalized [collapsed], which would violate 1NF.) Frequently run reports that contain operational metrics would best be split off into "Oper Marts" (OM), which are designed using a dimensional schema, just like most other DM.
3. DW. Enterprise data warehouses (EDW or DW) mostly exist in two different scenarios: a) in a multi-tier hub-and-spoke architecture, where many additional DM exist and b) on Teradata platforms, where separate DM are discouraged because the desired performance and ease-of-use can be achieved through special Teradata features that mimic the dimensional views of DM. That means that metric-based reporting, which is most appropriately delivered through dimensional schemas is handled in both cases either through additional DM (first scenario) or through special product features (second scenario). That allows the EDW to remain in a relational schema - again, not necessarily 3NF but denormalized to the degree necessary to achieve the desired query and report performance - and allows it to support ad-hoc query needs in addition to canned reporting. The difference between ODS and EDW is that an ODS only contains current data (no history) and is volatile (can be updated), whereas an EDW contains historical data and is non-volatile (read only).
4. DM. DM, specifically dimensional DM, were invented to solve severe performance problems when companies ran trend analysis reports against their relational DW that required multi-table joins and table scans. In addition to the performance issues, end users found it difficult to deal with the multitude of tables and data relationships, when all they needed were a few of them to run their reports. Since most reports are metric-based facts delimited by a handful (or two) of dimensions, most DM are designed using the dimensional approach. However, if a group of end users needs extensive ad-hoc query capabilities against a specific subset of data for a specific purpose that requires the data to be navigable the way it is in the real world (i.e., the way it is related in the operational systems), then the most appropriate design for their DM would be relational - again, not necessarily 3NF but denormalized to the degree necessary to achieve the desired query and report performance. The most important distinction of DM is that they are tailored reporting solutions, which means that each solution must be designed (tailored) for its specific intended usage and access pattern.
Tom Haughey's Answer:
The operational data store (ODS) is a tactical environment which stores detailed, near-real time results of committed transactions for a certain period of time for immediate reporting needs and which can sometimes be updated by users. An ODS is often created for one of three purposes:
- Integrating data from multiple sources (because modifying the source systems would be too costly)
- Tactical reporting
- Providing consolidated update processing
While the ODS can be used to stage data for the DW, that is not its primary purpose. An ODS needs to have an operational purpose, reflecting the three reasons just mentioned.
Figure 1 summarizes the differences across the system types you refer to, except for staging.
Given this, let us discuss how normalization fits in.
First and foremost, the concept of a normalized structure is not well understood in data warehousing. Kimball's original Dimensional Modeling Manifesto is predicated on the completely false assumption that the designer has the choice of using a dimensional design of informational data or a normalized design of operational data. His examples all emphasize this. This assumption is completely wrong. In warehousing the second choice is for a normalized model of informational data. Actually, if you take most of Kimball's examples and provide their normalized equivalent, the primary difference would be that the normalized model would be a snowflake. Another difference in some examples, such as his Orders examples, would be the separation of Order Header and Order Item rather than just the fact table Order Item as in his example. More about that later.
My colleague Joe Oates published an interesting paper in which he described data warehousing normalization. Warehouse normalization is different than operational normalization, he contends. The reality is that there is no such thing as DW normalization, but there is the application of normalization to informational (or DW) data. The rules or normalization are the same. The difference is in the data. It just comes out different than if you normalize informational data.
Margy Ross (and Kimball along with her) are jousting at windmills when they say that the main reason proposed for normalizing DW data is to save space. This is naïve. First, the main reason for normalizing the data in an operational environment is to prevent update anomalies. An update anomaly occurs when you need to update in several redundant places - and don't. Then you have an anomaly. In the data warehouse the main purpose of normalization is to produce the most flexible model possible. The normalized model is better suited to satisfy any ad hoc query. Any time you collapse data, and that is what happens in a star schema, you reduce the flexibility of the data at least to some degree. How much depends on the degree of flattening. Then you have to do other things to circumvent this, such as introduce a greater amount of redundancy to cover the variations you just flattened out. Any time you flatten the data into a structure that violates the functional dependencies of the data, you constrict its flexibility because the functionally dependent model is the most open. It shows all the granular possibilities.
In addition, when one talks about implementing a normalized model, that does not necessarily imply that the implemented model is totally normalized. Most fully normalized logical models get physically denormalized to some degree to accommodate performance and other purposes.
Back to the ODS. The ODS model is often a synthesis of different operational models. Sometimes, but rarely does it look like any one individual operational model. Above we talked about the three main reasons for building an ODS. Notice that some ODSs are updatable. For this reason and for the reason of greater flexibility, the ODS model is better off more normalized. If a lot of reporting is done off the ODS, some aggregate structures could be used as well to supplement the base data. Remember the state of being normalized and the introduction of some denormalization and even some supplementary summary (and denormalized) structures are not incompatible.
An ODS can be used for staging, but that is not one of its justifications. An ODS has to have some operational purpose as described above. Staging data can be kept in files or on relational structures, normalized or not as best suits the needs of pure staging.
The base data in the central data warehouse (that is, the main database in the data warehouse) should also be more normalized. Again, one of the main purposes of the CDW is to answer any query or extract needs, especially ad hoc queries. The normalized model is the best model for providing such flexibility. Again, some denormalization and some stored summarization are normal.
As data gets pushed out to data marts, it can (but doesn't have to) get more dimensional. There are different types of data marts, namely:
- summary marts (where the data is aggregated),
- subset marts (where the data is as detailed as the DW but with either fewer rows or columns), and
- string marts (files extracts that go to data mining).
Summary marts are naturally dimensional, as is all aggregated data.
Finally, remember that the level of normalization required for efficiency is significantly influenced by the platform. If you a running on a robust parallel platform, such as Teradata, it would squander the resources of that system to reduce everything to a purely dimensional structure. On the other hand, if you are running on hand-me-down server and a left-over DBMS license, you may have to do everything you can to optimize the data.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access