If you attended any data warehouse conferences this year, you undoubtedly heard data warehouse architects expressing the following concern: I'm building a terabyte-size data warehouse that could easily grow to 20 terabytes in a year, but I don't know how I can afford to store all that data. These data warehouse architects recognize that disks can never be inexpensive enough to satisfy their storage demands. Not surprisingly, many are starting to look at alternative storage for their large data warehouse needs. Recommended by Bill Inmon as mandatory for warehouses that will grow to any size at all, alternative storage particularly near-line storage holds the promise of making growing data warehouses affordable.
Even though the disk drive market keeps turning in lower and lower cost per megabyte numbers, high- performance RAID boxes, the mainstay of typical data warehousing environments, don't necessarily follow the downward pricing trend of the overall disk market. When warehouse requirements reach into tens and hundreds of terabytes or even into petabytes, only the well-funded few can proceed with their plans. With storage management costs at eight times that of initial hardware, the total price tag can easily upset the cost/benefit balance of such a warehouse.
Because database performance is measured in terms of transaction speed, most databases are oriented toward disk-based storage. The prospects of databases providing direct support for near-line storage are not good, but this isn't to say there are no efforts underway to make these databases more warehouse friendly. Oracle8i continues to enhance its table-partitioning feature, especially useful for accumulating historical data. Less advertised but critical for transparent access to near-line storage is Oracle's relaxation of timeout values to tolerate delays in getting data back from tape or optical storage. In future releases, we can expect other such concessions to the importance of data warehousing.
Wild, Wild West
What types of near-line storage solutions are there today for large data warehouses? At one end of the spectrum is FileTek with its own relational-like database that has near-line capabilities built into it. On the other end is UniTree Software with a near-line solution designed to work transparently in most database environments. There are not many players in this emerging field. In fact, at data warehousing conferences this year, only one vendor was showing a solution based around a tape library. Although hierarchical storage management (HSM) solutions would appear to be ideally suited to solving this problem, most are geared toward general-purpose file systems and are, therefore, of limited use in data warehousing. One particular problem is that most HSMs handle migration at a file-atomic level, which is fine for small files, but poorly matched for database tables which can be very large. If a query only needs a handful of rows from a 100GB table, staging in the entire table makes little sense.
Leading Edge or Bleeding Edge
Any early adopter in the near-line storage area will have to carefully balance standards adherence with leading-edge technology, both in hardware and software. Among tape drive offerings, there are the usual DLT, 8mm and half- inch cartridges to choose from, with LTOs soon to be available. However, each has its own unique access characteristic as well as price point. The StorageTek 9840 drive, with its high-speed access, is thought by many to be the leader in near-line storage applications. At close to $20K each, it is geared toward high-end use. However, since the primary objective is cost savings, several cheaper but slower access drives may work just as well if used in parallel. Although DLT is less than stellar from an access perspective, it can work, especially for a pilot project where proof of concept is more important than final benchmark numbers. AIT is tainted with the memory of 8mm helical scan's bad old days. However, with drives priced at $5K each, many AIT- 2 based tape libraries have a compelling price/performance story. In software, since there is not yet anything resembling a standard for this solution, it is more important to find a vendor who is willing to explore the possibilities and understand all the nuances of the warehousing requirements. Also, vendor experience in effectively managing very large quantities of data into hundreds of terabytes will be key since that is where the bulk of the work will be when implementing a data warehouse with near-line storage.
Will Everything Change?
The impact of the near-line storage solution on the warehouse developers will depend largely on the software approach. A special purpose database will require bulk loading of data from production to the warehouse database. It may also require retraining and reprogramming of queries. On the other hand, a solution that can work with existing databases may not require anything new of the developers. This would allow them to concentrate on query development rather than storage issues.
To minimize the inevitable delays in retrieving data from near-line storage, a certain amount of time versus space balancing is required. Cost savings of keeping data on lower cost media are clear. However, if query processing times become intolerably long, the whole exercise is futile. A rule-of- thumb expectation is a five- to sevenfold increase in the query time when comparing near-line to online-based data warehousing.
Near-line access latency consists of three components: load time, access time and transfer time. Since transfer speed is fairly comparable to disk drives, it is in the first two areas where most of the timesaving will be realized. This may take the shape of initiating parallel stages from many drives when a multitable query is detected. This assumes the near-line solution is able to organize the table data on separate tape families to facilitate parallel access. Full table read-ahead for a table scan would help reduce the time-consuming tape start/stop operations, freeing up valuable and scarce drive resources for other staging requests. The ability to peek at the data set expected to be used by a query, such as through a query plan examination, would allow the near-line solution to retrieve all required data from tape simultaneously. If this information is available down to actual data blocks, near-line retrieval can take place with surgical precision.
Disk space management is vitally important to ensuring acceptable query times. Cache starvation can lead to thrashing where table data gets purged before it is actually used by the query. Staging in only the needed rows of data can dramatically reduce the query footprint on the disk cache; however, this granularity needs to be balanced against ideal block sizes for near-line transfer. If a tape requires 20 seconds to load and position for a read, then it makes sense to retrieve more than a few rows of data. If query progress information is available, then data already staged in and scanned by the query engine can be proactively purged to make disk space available for other data.
Playing for Keeps
All components of near- line storage solutions are in a state of flux. Tape storage is about to undergo a leap in density, capacity and access times. SuperDLT's starting capacity of 80GB native is nearly twice that of its predecessor. STK's 9840 will double its transfer speed in the near future, and LTO's road map calls for several generations of capacity and speed doubling. A near-line solution must be equipped to handle the graceful transition of data from one generation of storage media to the next. It must also be flexible enough to work with new types of drives without major driver rewrites.
Keeping Data Safe
Warehouse data, almost by definition, does not change. Except for purging of old records and data cleanup, records do not get updated. This is the main reason near-line storage has a place in the data warehouse. What about backup? Since this is unchanging data, there is no need to include this data in the normal database backup cycle. How-ever, for the same reason, near- line storage must ensure there is adequate protection of this valuable data through multiple tape copies, remotely generated and maintained if possible. Moreover, there must be data verification, such as header index, built into the tape format to guard against incorrect tape positioning which is altogether too frequent in even the most reliable tape drives in the industry.
Proceed Step by Step
Growing data warehouses need an affordable solution, and alternative storage is a good candidate. While the field is still young, there are some vendors proposing solutions which have proven to work. As with previous iterations of your data warehouse implementation, this phase will certainly undergo evolutionary stages as data volumes grow beyond thresholds previously thought cost prohibitive to cross. Start by building a modest five terabyte, for example near-line data warehouse. With the experience gained, move on to larger and more ambitious warehouses. You don't have to spend a fortune doing this. All that is required is some careful planning and evaluation of solutions already available.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access