Continue in 2 seconds

What a Data Warehouse Is Not

  • July 01 2001, 1:00am EDT

From the humble origins of a concept derided by database theorists, data warehousing has come a long way in a short time. Today data warehousing is conventional wisdom. There are books, articles and conferences on data warehouses, companies whose products are dedicated to data warehouses, magazines dedicated to data warehousing, and so forth. This tacit marketplace acceptance of data warehousing has come in a short decade, without the support of the established database theorist community.

In fact, there is so much enthusiasm for data warehousing that now people are calling things that are not, and never will be, data warehouses. It is interesting to review what is considered data warehousing in some circles.

A collection of data randomly thrown together. In the early days of data warehousing, I saw this approach. An oil company had taken all of its transactions and collected them into one large machine. The users of the system referred to the large machine as the world's most expensive plant holder. The data was collected from the transactions the corporation had run and was simply thrown into the machine. There was no integration, cleansing, key reconciliation or time stamping of data. It was just one big garbage dump of transactions. It was no wonder that the end user could get nothing out of the system. In order to make sense of the garbage dump of data, the end user had to integrate and cleanse the data, and this simply was too much to ask. The end user was interested in using the data, not in becoming a data administrator. Furthermore, if one user went to the trouble of cleansing and integrating the data, there was no guarantee that the next user would do the integration and cleansing in the same way.

Why was this garbage considered a data warehouse? It was created like this because a hardware vendor wanted to make a quick sale without having the faintest idea what a data warehouse was. When the whole scheme collapsed, what was blamed? Data warehousing, of course.

A data mart. A data mart is a collection of data specifically designed for the use of a department. The departments that typically have data marts are sales, marketing, accounting and finance. One department looks at data summarized by the week; another looks at data summarized by the month. Some departments look at customers along the lines of geographical regions, others look at commercial and retail customers, and so forth. There are fundamentally different ways of looking at customers, products and all the other aspects of the corporation for each department.

The data mart reflects the peculiar needs of the department. One approach to data warehousing holds that one should build a data mart; then, as the data mart grows, it will turn into a data warehouse. However, there is no more sense to this than planting a tumbleweed, watering it and expecting it to turn into an elm tree. The genetics of the tumbleweed are such that it will never turn into an elm tree, regardless of the amount of water it receives. The same is true of a data warehouse and a data mart. A data warehouse contains granular integrated corporate data. The data structure of a data warehouse is such that it is fundamentally and essentially different from a data mart. A data mart will never grow into a data warehouse.

Who is behind the notion that a data mart can magically turn into a data warehouse? The vendors of data marts who are hungry to cash in on the data warehouse bonanza! When a data mart never turns into a data warehouse, what receives the blame? Naturally, the data warehouse.

A collection of data marts. If a data mart never turns into a data warehouse, maybe a collection of data marts will. When the data mart vendors failed to sell the notion that a data mart was really a data warehouse, the next step was to try to convince the world that a collection of data marts, glued together properly with conforming dimensions, was a data warehouse. This is akin to saying if you could cleverly glue together a tumbleweed, a marigold, a carrot and a rose bush, and if the gluing is done properly, then you will produce an elm tree. This approach had its own unique set of reasons for failure. The combination of different entities is unnatural, and, while entities (such as a tumbleweed and elm) may have similarities, they're only apparent during immature phases.

Who was behind the charade of using multiple data marts to create a data warehouse? Surprise, surprise, it was the data mart vendors. What was blamed when this failed? Data warehousing, of course.

Online databases. Online transaction processing (OLTP) databases have been around a long time. With an online database you get good response time – in the neighborhood of two to three seconds. Everyone likes good response time. In fact, in an OLTP environment, response time is so good and consistent that the computer can be used operationally, and information coming from the computer can be used as an essential component in the day-to-day operations of the company. Clerks interact with computers daily; and if the online system suffers, the business suffers. Banks and airlines, among others, are famous for their online processing.

If online response time is pleasing, then why not make online response time a part of the data warehouse environment? There are some pretty good reasons why online response time is not a part of the data warehouse milieu. Data warehouses contain a mammoth amount of data. Trying to write transactions that efficiently examine millions of rows of data is beyond challenging. For the general case of performance, such transactions cannot be created. A second reason why online processing does not fit with data warehouses is that the workload coming into the data warehouse is decidedly mixed. Unlike the workload that passes through an ATM system where each transaction is small and uniform in size, in a data warehouse workload, one transaction is large and the next transaction is small. The erose nature of the data warehouse workload simply does not lend itself to consistent response times.

However, good response time is very appealing. There is a hybrid form of data warehousing where online response time is the norm. That companion structure is called an operational data store (ODS). The ODS is the place where a subset of data is kept in order to satisfy the response time needs of the organization. The ODS is closely related to the data warehouse, but is not a data warehouse.

Who tries to make high-performance response time a feature of the data warehouse? It's the vendors who are making this claim. After creating a huge expensive infrastructure and discovering that a data warehouse is incompatible with OLTP response time, what will be blamed? The data warehouse, of course.

An exploration warehouse. An exploration warehouse is a facility where exploration and data mining can be performed. As long as a data warehouse is small and not much activity is occurring there, exploration and data mining can be conducted in the data warehouse. However, as a data warehouse grows in size and begins to house a significant amount of processing, the systems administrator can no longer afford to conduct exploration processing and data mining in the data warehouse. Explor-ation processing and data mining entail the creation and execution of very large transactions. While those transactions are running, other users of the system are precluded from conducting any significant amount of processing. Furthermore, when a warehouse is built that is optimal for exploration processing and data mining, the structure of that warehouse is fundamentally different from the structure of the data warehouse. In a mature state, exploration processing and data mining need to be separated from normal data warehouse processing. A mature data warehouse is not the same as an exploration warehouse.

Who says that an exploration warehouse and a data warehouse are the same? The vendors who are hungry to sell hardware are the ones pushing this idea. What takes the blame when this approach fails? Data warehousing, of course.

The Internet. Amazingly some people have claimed that the world's biggest data warehouse is the Internet. There is no doubt that the Internet is immensely useful. However, claiming the Internet is a big data warehouse shows real ignorance about data warehousing. Internet data is not integrated, nor is it high-quality cleansed data. Just try to separate summary data from detailed data, and you will realize that while the Internet contains a wealth of useful information, it does not resemble a data warehouse.

Who says the Internet is a large data warehouse? Not surprisingly, it is vendors closely associated with the Internet. What happens when people use the Internet as if it were a large warehouse? They will be disappointed. Where will the blame be laid? You've got it – on data warehousing.

Why is it that people try to position whatever they have as a data warehouse when it is not a data warehouse? For the most part, they do it for commercial gain. When things don't work out, where are they? Jumping on the next bandwagon trying to fool the same people all over again.

Why is data warehousing so vulnerable to misinterpretation? The answer is that data warehouse is an architecture, not a technology. Because it is an architecture, it is a bit more abstract than what people are accustomed to. Most people are accustomed to very finite, concrete, well-defined things. Unfortunately, a data warehouse does not satisfy any of those criteria.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access