Every now and then, an old idea resurfaces. In this case, the idea is that there is something real known as a virtual data warehouse. Why this idea did not die a natural death long ago is a mystery. With a virtual data warehouse, you don't have to go through all the hard work of integrating data, rummaging around old files and trying to understand what some programmer back in the 1960s had in mind. You don't need to take up space with replicated data. What a lovely idea. You just string some files together and access them as if they were a data warehouse. There is no need for the hard stuff that goes with a real data warehouse. However, there also is an old saying: if it seems too good to be true, it probably is.

In truth, there are many very fundamental reasons why a virtual data warehouse is a myth, not a reality.

If it was only about the mere access of data, there might be a real case for building a virtual data warehouse. If we could wave our hands and magically the data would be integrated, then accessing disparate data sources would be as easy as worrying about the access of data. However, most organizations have never built integrated applications. One of the techniques for coping with complexity is to reduce the size and scope of the problem. That is exactly what applications programmers and designers have done.

The problem is that when the business problem is addressed by multiple development organizations, each developer reduces the size and complexity of his/her problem. The result is that one application has only a faint resemblance or connection to the other applications. There is no integration between applications. Therefore, merely accessing data from multiple sources does not solve the much more profound problem of integrating data before the data is accessed. That is the primary flaw of the notion of a virtual data warehouse.

Let's suppose that I think that access of data is the main issue addressed by data warehousing. I go to application A to find that $5,000 dollars was spent on inventory. With application B, I find that $10,000 was spent on inventory, and application C shows that $15,000 was spent. I add the dollar amounts and determine that $30,000 was spent on inventory. However, what I don't account for is the fact that application A uses U.S. dollars, application B uses Australian dollars and application C uses Canadian dollars. Additionally, application A is for the first quarter of the year, application B is for the whole year and application C is only for the month of December. To blindly add these numbers creates fiction. Before the numbers can be added meaningfully, there needs to be integration, and that is one of the real problems that data warehousing solves. Merely accessing data blindly does nothing about the issue of integration.

Now, suppose that it were possible for a virtual data warehouse engine to address this integration problem. Analyst A goes into the database and converts the files in applications A, B and C. Upon completion of the conversion, the analyst adds the numbers together.

Now, a month later, analyst B converts the same data. The only problem is that analyst B uses a different algorithm for conversion than was used by analyst A. Analyst B adds the numbers.

When analyst A and analyst B compare their numbers, they are fundamentally different even though they appear to be the same. It is not enough to be able to convert the data. It is necessary to convert the data in a way that is consistent with its usage across the entire corporation. Data warehousing solves this messy problem. Virtual data warehousing is incapable of solving this problem.

However, let's suppose that somehow the issues of conversion and consistency of conversion can be solved. What about performance? Analyst A issues a query to look at some data. The query must go to numerous sources of data, pull the data from its old technology, convert the data, consolidate the data and then perform the calculation. Now, analyst B has a query. Analyst B's query must go to numerous sources, pull data from old file formats and technologies, convert the data, merge the data and then satisfy the query. What about analyst C? The process is repeated once again. These queries consume huge amounts of machine resources. Performance suffers and gets worse as the data becomes more popular. That is what you get with a virtual data warehouse.

There is yet another powerful reason why a virtual data warehouse is a fantasy –­ historical data. Applications are notorious for getting rid of historical data as quickly as they can in order to perform efficiently. What does this do for the virtual data warehouse that operates on top of the application data? The answer is that a virtual data warehouse that operates on top of applications is limited to the historical data that resides in the applications, which isn't much.

However, proper data warehouses are natural candidates for holding historical data. Within the boundaries of reality, a data warehouse can hold as much historical data as you want and can afford. Virtual data warehouses can't do that.

Virtual data warehousing is certainly a transparent and superficial answer.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access