Very large data-intensive applications such as data warehouses are excessively complex to design and deploy, and almost all architectural decisions have a lasting impact on longevity, performance and return on investment (ROI). In this article, we will examine one such important architectural decision the data storage architecture.
Quantitatively speaking, the largest resource you need to build any large data warehouse is storage. Even though storage is the largest resource, corporations often do not spend enough time and energy on data storage architecture design. Most often, they adopt storage architectures which are often too conservative, too expensive, too simplistic or too risky. The most predominant reasons for these mistakes are the obsolete presumptions and prejudices that are carried forward by IT personnel. Let us examine a few of these prejudices.
Prejudice #1: Storage devices of variable speeds (heterogeneous storage devices) shouldn't be attached to the same server because when a particular I/O operation spans disks with variable speed, performance of this I/O operation is bound to the slowest performing disk. For this reason, databases are often designed to use similar storage devices.
Reality #1: Based on this prejudice, if a data-intensive application was designed to use expensive, high-performance storage devices, you are wasting a lot of money. If you are using cheaper devices, you are most likely sacrificing performance and reliability. The solution to this problem is an optimum data storage architecture which will be explained later in this article.
Prejudice #2: Storage is cheap.
Reality #2: It is true that acquisition costs of some of the storage devices are becoming cheaper. However, high-end storage maintenance and ownership cost still remains very high. If you are planning to house terabytes of data, the storage cost itself could quickly become prohibitively expensive.
Prejudice #3: Corporate standards require standard software and hardware for developing all applications. With any deviation from accepted standard storage solutions, it will cost more to recover data. By using the corporate standard storage solution, more money could be saved by employing fewer personnel to maintain them.
Reality #3: Corporate standards are a good thing. We are not suggesting that you should use different hardware and software for different applications. However, if the technological requirements are significantly different, don't make your decisions based solely upon your corporate standards. Remember, corporate standards are there to serve the corporation, not the other way around.
For example, we were building a large data warehouse for a prestigious and well-run U.S. corporation. Corporate rule stated that all data storage had to be on high-end storage devices with RAID 1 configuration, and all data had to be replicated to a disaster recovery site. Thus, to build a terabyte data warehouse, we had to use 4TB of storage devices. Fortunately, the data warehouse was capturing data from one of its most important and profitable lines of business, and no one complained about ROI of the warehouse. However, the real problem arose when business users wanted to warehouse 19 other lines of business, which were not as profitable.
Typically corporations do one of the two things in this situation: 1) build the warehouse using same high-end storage devices and accept a negative ROI, or 2) abandon warehousing the data altogether. Unfortunately, most often these not-so-profitable lines of business are the ones that desperately need the business intelligence that could be obtained from data warehousing.
Data storage is by no means the only cost of building and maintaining a data warehouse. However, it is a significant one. As data warehouse professionals, if we could minimize this cost, there would be a whole range of new applications that could become economically viable and immensely benefit the enterprise.
Figure 1 illustrates the storage requirements of a typical data warehouse.
|Figure 1: Storage Requirements of a Typical Data Warehouse|
Suppose you are deploying the data warehouse on expensive, high-performance storage devices, with one of the devices at the primary site and the other at a disaster recovery site. The acquisition cost of such a system at the rate of $1,000/GB would be $1 million. If you add other deployment, maintenance and management costs at $500/GB/year, total cost of ownership for five years would be $3.5 million.
In a recessionary environment, how many companies can afford to build such systems? Even if your company can afford such systems, how many projects could justify the ROI of building such systems? Obviously, not many.
The solution we are proposing to this problem hinges on two important principles. Let us examine these principles before we explain how to implement such a solution.
Principle 1. Use heterogeneous storage devices for a single application. This is acceptable, provided a single query doesn't employ storage devices that rotate at different speeds. By carefully segmenting and placing data on appropriate disks, it is possible to eliminate or minimize such queries, thereby reducing the overall storage cost.
Principle 2. Not all data is created equally; therefore, don't treat it equally. As shown in Figure 1, a typical data warehouse contains distinctly different kinds of data. From business and technical perspectives, not all the data is of equal importance and not all data is accessed in a similar fashion and frequency. Some data is not easily reproducible; however, a large portion of data warehouse data is easily reproducible.
Once you agree with these principles, our proposed solution is fairly simple. Categorize data into different "classes of data," categorize your storage devices and store each "class of data" on an appropriate storage device. See Figure 2.
Figure 2: Storage Device/Data MatrixIn a 1TB warehouse, 0.05 percent (5GB) of data typically services 50 percent of the requests. That .05 percent is a candidate for Category A type of storage.
Summary tables/summarized views typically occupy less than 20 percent of the space and service 80 percent of the queries. These types of data objects are typical candidates for Category B type of storage.
The remainder of the elemental data should be stored in Category C storage.
Category D storage is ideal for storing raw input data. Typically a 1TB data warehouse is built with 150-200GB of raw input data. This input data is in ASCII format, which is amenable to very high data compression. To compress and store this raw data, you don't need more than 20- 25GB of optical space. Many large corporations already employ such devices for document retention and archiving. You should use such devices only if you already have such devices in house or if you are building several data warehouses which could justify their acquisition and use.
Let us recalculate our storage cost estimates by conservatively assuming Category C devices are five times cheaper than Category B devices. The cost of the Category B storage would not change.
Acquisition of Category B storage (20% of $1 million) = $200,000
Acquisition of Category C storage (80% of $1 million divided by 5) = $160,000
Total Acquisition Cost = $360,000
Maintenance cost (5 years) = .5 x $360,000 x 5 years = $900,000
Total acquisition and maintenance cost for five years = $1,260,000
Storing data in this configuration substantially reduces the total cost of ownership; at the same time, 90 percent of your queries should perform at par with the expensive storage disk solution.
If you would like to further reduce your cost at the expense of performance, you have a few more options. For example, all historical elemental data could be stored on read-only tablespaces on optical devices. Optical devices are admittedly slow when compared to regular hard disks: however, by strategic allocation of data and other common-sense measures, you could still avoid hitting faster disks and slower (optical) disks in the same query 90 percent of the time. Remember, with the advent of the newer generation of databases such as Oracle 9i and DB2 UDB 7.x, migrating tablespaces across databases is very easy. If you really encounter a performance bottleneck, the problem tablespace could be moved to a higher-performing storage device. This kind of solution is ideal for disaster recovery site databases and projects that need to lower costs to justify the ROI.
The optical disk output rate is still painfully slow.
Complexity in hardware configuration almost always leads to more human intervention.
If you have only one data warehouse or a smaller data warehouse, the configuration suggested may not be for you. However, if you have or are planning to have a multiterabyte data warehouse, this configuration is something that you should definitely consider. Follow these suggestions, and you will have an optimal data warehouse that doesn't break your bank.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access