Q: Why can't you experts agree on terminology? We may need a new term or could you be more carefully qualifying (Centralized ODS vs Distributed ODS) (DW as 3rd NF? How big a snowflake is that?) (EDW is not 3rd NF but maybe distributed DW could be) (DM centralized or distributed could be 3rd NF or not right?) Folks here are saying DW is DM. I am looking for support that says they are not and could you experts come to some agreements?
Sid Adelman's Answer:
I'm afraid we don't always agree. As a matter of fact, I often disagree with my colleagues on some primary issues and getting us to agree on terminology is just as difficult. That said, it's fair to ask for definitions so we don't have situations where we agree but kill each other because the terminology hasn't been defined and so miscommunication is the result.
Clay Rehm's Answer:
I agree with you - there is no reason why our industry cannot come to agreement on terminology. With this industry still being fairly new compared to other industries, I feel this disagreement will continue.
I would not get hung up on the terminology based on how others describe something - however I would make sure that within your organization the terminology is standard and agreed upon even if it differs with what experts say.
Adrienne Tannenbaum's Answer:
Correct me if I am wrong, group, but the "Ask the Experts" purpose is to get all of these different perspectives. Perhaps what we can do for you is put a brief definition in parentheses next to industry terminology. So here is a question for you - who creates the terminology to begin with?
Tom Haughey's Answer:
First, you have a point. Admittedly. So let's step back and sort through this. I would like to point out that your expression of this question itself exemplifies (and even exacerbates) much of the confusion you are identifying.
Second, the experts have a reasonably common understanding of what the terms mean . Not universal, but fairly common. The disagreement is on which is better. Specifically, there are two things about which there is most disagreement: 1) Which DW topology is the best? Should the DW be centralized or distributed? How centralized and how distributed? 2) What is the best way of delivering a data warehouse. Another way of saying this is it delivered all at once or in increments?
Data Warehousing Terms
The following terms are fairly well defined: ODS. DW, star schema, snowflake schema, EDW, data mart, 3NF. Even the topology terms such as Centralized and Distributed are fairly well defined, though there is greater disparity among them. Again, the issue is not what the terms mean, but which is best. In the next paragraphs, I will briefly define each of these terms and then propose some topology recommendations
An ODS is a tactical environment which stores detailed, near-real time results of committed transactions for a certain period of time for immediate reporting needs and which can sometimes be updated by users. It is probably difficult to have one centralized ODS due to the different processing requirements of the OLTP environment. After all, an ODS is an operational environment. See Ask The Experts column on April 10 and 21, 2006. A DW is a strategic environment and system for gathering informational data from multiple sources, storing it in a central and reusable database, and delivering it as information to business and system users. A data mart, on the other hand, is an environment containing a specialized collection of related data, customized for a specific community of knowledge workers, analysts or planners, to support their reporting and analysis needs. Though the words different that experts use to describe these may differ, there is not much substantial disagreement here.
A dimensional model is a model in which data is classified as a fact or dimension. The dimensional model defines two major patterns. A Star is a fact surrounded by a single circle of dimensions, in which, if any of those dimensions are multi-leveled, it is are flattened out into a single denormalized dimension. Similarly, a snowflake is a single fact, surrounded by a circle of dimensions in which at least one of those dimensions is multi-leveled and the multiple levels are kept separate. The question is which is better. Dimensional modelers strongly propose that the star is the best solution. See the Ask The Experts column April 10, 2006, " How do star schemas compare to snowflake schemas? " for a full description of the trade-offs between stars and snowflakes. Finally, let me make the point that aggregated data is inherently dimensional, meaning it will naturally consist of a fact and its corresponding dimensions. No specific techniques are required to achieve this.
A normalized model is one that honors the functional dependencies. An operational model and an analytical model can both be normalized, but an operational and an analytical model of even the same kind of data within the same business (such as Orders data) will not look alike. The analytical model will more naturally fit the pattern of facts and dimensions.
Two concepts in particular have caused controversy: can the DW (or ODS for that matter) be centralized or must it be distributed; and is the dimensional structure the only viable data design for a DW? An ODS will not be limited to a dimensional structure. It is likely that an ODS will be distributed and quite normalized.
One term that has generated a great deal of controversy about centralization is that of the enterprise data warehouse (EDW). Let us discuss this more fully.
EDW is one among several DW topologies, two others being federated and functional. A federated data warehouse is a data warehouse in which several existing data warehouses or data marts are actively interrelated either by exchanging (or joining) data with one another (peer levels) or by passing data to a higher level data warehouse (superior level). A functional warehouse architecture consists of multiple separate data warehouses each supporting different business functions. Each DW is separate. Essentially, the functional topology consists of large scale independent data marts. See my article on "Data Warehouse Architectures in an Internet Age" published in DM Direct, Feb 13, 2004.
The EDW is a centralized DW which covers a major cross-section of the business or a logical or legal subdivision of the business. Here is where the biggest disagreement exists (and where I suggest the most misinformation exists as well). Indirectly, the term "enterprise" contributes to the problem. There are two gross misconceptions about the EDW: 1) how much is centralized; and 2) how you deliver one. Let us address each point in particular.
Essentially, an EDW is a centralized solution, but not fully centralized. It is centralized at its core but is distributed at its tips. At the core of the EDW is a centralized DB or what I call the central data warehouse (CDW). The CDW will have a main database that is centralized and that will be detailed, atomic, general purpose, flexible and open ended. This model should be quite normalized, though it will contain different levels of data. As the data gets more aggregated within the CDW, it will take a more purely dimensional shape. The delivery end of the EDW, which will be distributed and non-centralized, will include data marts and other BI application that may reside on the same or different platforms. These also tend to have a more pure dimensional structure.
Delivering the EDW
The biggest misconception with this EDW concept is how you go about building one. No authors have more grossly misunderstood (and misrepresented) this concept than Ralph Kimball and Margy Ross. I agree with everything Kimball and Ross have written about the EDW except one thing: that anything they have said is a true characteristic of an EDW. Read what they say of it. They would give you the impression that an EDW is built in advance by a team of omniscient analysts, who are required to understand all requirements in advance, implement them all at once, deliver only with long delivery lead times, and do so on a platform costing a king's ransom. You'd have to have corners on your head (and rocks in it) to do it that way. Here's the real deal.
From a development point of view, an EDW should be planned several years in advance (as should any DW for that matter), but is built in increments. In other words, plan top-down but build bottom-up. These increments should be three month intervals. The first increment from start should be delivered in 6-9 months, to allow for establishment of the infrastructure. Subsequent increments should be quarterly releases. It is useful to have a high level enterprise-wide data model in place for planning, presentation and budgetary purposes. A third of the way through each release a prototype database should be delivered for testing and should be abundantly populated. In essence, this is saying that a DW is build using a RAD (rapid application development) approach, not the typical monolithic waterfall approach.
Look at this matter of distribution another way. It is a complete fallacy that it is easier to take a given set of functionality and split it into pieces and then distribute those pieces. If you do this, it will eventually cost more and take longer to deliver these capabilities. Why? Because now you have to interface these separate pieces. It is proposed that conformed dimensions is the solution to this in a distributed solution. Indeed, the use of conformed dimensions is a solution to this. But in a real sense it is a solution to a self-created problem. If you did not distribute the solution, you would not have to re-assemble the pieces. Conformed dimensions are an important concept to be sure. But an EDW does not need conformed dimensions at its core because there will be only one copy of a given dimension. It will need conformed dimensions at its BI tips. Dimensions must conform when there are multiple copies of the same dimension. Conformed dimensions are required wherever the solution is distributed and the same dimensions are used, such as where there are distributed BI solutions.
The distributed DW is a viable solution. However, a DW does not have to be viewed as the sum of its data marts. There are big hidden costs in this approach. It can be viewed this but there is a better solution. Remember, EDW's exist and plenty of them and plenty of them have huge ROI's.
Next, let us discuss the terms 3NF and dimensional. Many people present the alternatives as relational vs. dimensional. I am completely opposed to this comparison. First, most dimensional models are implemented on relational technology, and second, a relational model can be highly denormalized. This means that the comparison of relational vs. dimensional is irrelevant. Dimensional models contain some level of denormalization. The denormalization follows a pattern, namely, star or snowflake. The real distinction is not relational vs. dimensional but normalized vs. dimensional (which is usually denormalized). In the CDW, the data should contain more normalized analytical data. I say "more normalized" because, as we all know, most databases are implemented with some denormalization. The reason for normalizing the CDW has nothing to do with saving storage, as dimensional modelers suggest. It has to do with the characteristics of the CDW: general purpose, flexible and open-ended. Remember, we are talking about normalizing the warehouse data, not normalizing back to operational data (a confusion Kimball's original Dimensional Modeling Manifesto clearly perpetuates). A model based on functional dependency, which is what all normalized models are, is the most flexible model conceivable with in the parameters of the business. That is why the CDW is normalized, to support the flexibility required to answer all queries and feed all extracts.
However, in terms used in your question, the CDW will be essentially snowflaked. It will contain multiple snowflakes. However, it also contains different levels of data, such as base data and reporting data. The same structure might be fully normalized in base data and denormalized in reporting tables. Take the example of a ragged hierarchy. The ragged hierarchy might be stored in a recursive structure in base data and flattened in reporting tables.
In summary, the issue is not what these terms mean but what is the best solution. It is my view that the centralized DW solution is the most powerful overall solution in most cases. But it must be built sensibly using RAD principles. It is not fully centralized. It will be centralized at its core but distributed at its BI tips. There are other viable topologies, such as a distributed or federated one. While a DW can be viewed as the sum of its data marts, it does not have to be and there are huge hidden costs in this. However, many factors play into this topology decision, some technical, some business, some political. The centralized topology is not ideal for every organization. An organization should choose the topology that best suits its requirements. While different topologies are viable, they each have their advantages and corresponding disadvantages. Consider this: the number one trend today in data warehousing is data mart consolidation. Don't believe it? Go ask Gartner.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access