Ask any data warehouse developer what media data will reside on, and the automatic answer is high-performance disk storage. Most data warehouse developers have never built a system on anything but high-performance disk storage during their entire career. Indeed many data warehouse developers are not even aware that there are alternatives to high-performance disk storage.
There are many reasons why the volume of data in the warehouse is exploding: data warehouses carry historical data, detailed data, data for which there is no known need and e-commerce data.
The volumes of data found in the data warehouse surpass anything ever seen before.
Surprisingly, the future of data warehousing storage is not high- performance disk storage, despite the strong track record of disk storage for the past 20 years and the protestations of the storage vendor. Instead, high- performance disk storage plays only a secondary role in the future of data warehousing. The real future of data warehousing is in a storage media collectively known as alternative storage.
Alternative storage consists of two forms of storage near-line storage and/or secondary storage. Near-line storage is siloed tape storage where siloed cartridges of tape storage are managed robotically. The technology for siloed tape storage has been around for a long time and is certainly proven and mature technology.
Secondary storage is a form of disk storage where the disk is slower, significantly less expensive and less cached than high-performance storage. Figure 1 illustrates the two basic forms of alternative storage.
Figure1: Technology for the Future of Data Warehousing
There are lots of reasons why alternative storage fits well with the data warehouse environment. Perhaps the most fundamental reason is that data warehouse data is very stable. The nature of data in a warehouse is that the data is put into the warehouse in a time-stamped, snapshot mode. If there is a change in the data that the warehouse needs to be aware of, a new snapshot is made. The old snapshot of data remains undisturbed. Because of this mode of storing data, no updates are made into the data warehouse. Ultimately, style of storage and processing results in very stable data. The stability of the data fits very nicely with the "write once" data found in near-line storage.
One of the other reasons why data warehouse data fits nicely on alternative storage is that the queries that operate on warehouse data need long streams of data, and often times that data is stored sequentially. Unlike a job stream for online processing where there is constant demand for different units of data from different parts of the disk device, in data warehouse processing the processing that occurs is fundamentally different. Both near- line storage and secondary storage fit this model of a job stream very nicely.
Another very important reason for alternative storage is the need to store many, many records in the data warehouse. Because data warehouses store detailed and historical data, they contain far more data than their online, OLTP brethren. The ability to store far more data on near-line and/or secondary storage is a very important reason why high-performance disk storage is not the future of data warehousing.
Not only can much greater volumes of data be stored in alternative storage, but those massive volumes can be stored much less expensively than on high-performance disk storage. How much cheaper? About an order of magnitude less expensive.
One can hear the high-performance disk vendor proclaim, "But hardware is getting cheaper all the time." Indeed, secondary storage and near-line storage are getting cheaper at a faster rate than high-performance storage. The hardware vendors who wish to maintain the status quo have been saying this for as long as there has been a computer industry.
Another powerful reason why high-performance disk storage is not the future of data warehousing is that ironically, and much to the chagrin of the high-performance vendors performance gets better, not worse, when you move your data to near-line storage or secondary storage. Performance gets better because of the phenomenon in data warehousing called "dormant data." Dormant data is data that is seldom or never used. In the early days of data warehousing when the warehouse is new and small, there is little or no dormant data; but as the warehouse matures, the volume of data rises and the patterns of usage of the data stabilize. Soon only a fraction of the data warehouse is being used. At this point, the dormant data is moved to alternative storage. Performance for the remaining, actively used data picks up dramatically. If dormant data is left on high-performance disk storage, the dormant data "gets in the way" of query processing. Data that is needed for the query is hidden by the masses of data not regularly needed. By moving dormant data to alternative storage, performance is greatly enhanced.
The greatest advantage of selecting alternative storage as the basis for the data in the data warehouse environment is that the designer can choose the lowest level of granularity desired for the data warehouse. When high-performance disk storage is used as the only medium on which data is stored, the designer ends up being restricted as to how much detailed data can be placed in the data warehouse. The telecommunications designer must aggregate or summarize detailed call-level detail. The bank designer must add checking and ATM activity into a monthly aggregate record. The retailing executive must summarize POS data to the store level and/or to the daily level. In short, placing the data warehouse on disk storage forces a compromise to occur. But when the bulk of the data in the warehouse is stored on alternative storage, the designer can afford to store data at the lowest level of detail that exists. In doing so, the data warehouse ends up with a great deal more functionality than if the warehouse were stored on high-performance disk storage.
There are some very powerful reasons why the medium of storage for the data warehouse should be alternative storage. Admittedly some of the data warehouse data the actively used component of the warehouse will be stored on high-performance disk storage. But the vast majority of the data stored in the warehouse will reside on slower, less expensive alternative storage.
The notion that data should be stored on different media based on the volume and usage characteristics of the data is not a new idea. Years ago there was the notion of technology called hierarchical storage management (HSM). HSM was the intellectual predecessor of alternative storage. The primary difference between HSM and alternative storage is that alternative storage operates at the row or record level while HSM operates at the table or data set level. Management of storage at the table or data set level is simply unthinkable for the volumes of data and the kind of processing that occurs in the data warehouse.
In order to make the alternative storage architecture perform at the optimal level, two types of software are needed. The first type needed is that of the activity monitor. The activity monitor sits between the data warehouse DBMS server and the users and collects information about the activity that is occurring inside the data warehouse. Once collected, the data warehouse administrator is in a position to know what data is and is not being used in the actively used portion of the warehouse. With that knowledge, the data warehouse administrator is able to precisely determine what data belongs in actively used storage and what data belongs in alternative storage.
The second type of software needed for the data warehouse environment that operates on alternative storage is software that can be called a cross-media storage manager. The job of the cross-media storage manager is to manage the traffic between the actively used storage and alternative storage. The traffic can be managed by actually moving data to and from one component to the other or can be used to satisfy query processing where the data resides in either actively used storage or alternative storage.
Both types of software are needed in order for alternative storage to operate effectively. As a rule, the activity monitor is first used to determine how much data needs to be placed in alternative storage. After the decision is made, the cross-media storage manager and alternative storage are purchased and installed.
The alternative storage solution for data warehousing is a compelling story. For warehouses that will grow to any size at all, alternative storage is not an option it is plainly mandatory. What are the obstacles to the success and adoption of alternative storage? The primary obstacle is a familiar one to those who have been around the information processing community a while. The "we didn't do it that way before" attitude is the primary reason why people do not immediately adopt alternative storage. The very success of the high- performance disk vendors trapped them into thinking that their world would remain static forever.
But the cat is out of the bag and won't be going back again.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access