But Backup Is Still Required!
Getting started with a data warehousing backup strategy means managing how:
- Different parts of the warehouse change at a different rate.
- Backup frequency slides in the direction of the transactional.
- "Read mostly" means "read and write."
Optimize the Backup!Even if the data warehouse is of a heroic size -- that is, in the multiterabyte range -- several tactics can increase the likelihood of handling the backup operation using standard database utilities. The options are straightforward and, in part, depend on the capabilities of the underlying database (such as IBM DB2, Microsoft SQL Server, Oracle10g, Sybase or Teradata) that is in operation. Use these approaches:
Partitioning to increase operational flexibility. Partition the database in such a way that an individual partition can be backed up in a reasonable window of time, even if the entire data structure is too large to backup all at once. For example, 12 partitions of 250GB each could be backed up periodically (daily, weekly or monthly, depending on the requirements). The schedule handles one partition at a time, not all three terabytes at once.
Incremental backup to reduce elapsed time to complete an operation. Use a database whose backup and recovery utility allows for incremental backup and restore. Both Oracle and DB2 provide for incremental backup. Because the backup copy just contains the deltas, the recovery operation requires less time than if the entire data structure has to be restored. However, a complete copy of the data to which the incremental copies can be applied must first be made.
Parallel processing to divide and conquer large data volumes. The operations environment can increase the bandwidth of the backup or restore activity by increasing the number of tape devices used and executing the operation in parallel. For example, if the operation calls for 24 data structures to be backed up and each requires one hour to complete, and 24 tape devices are available, then an exclusive backup window of only one hour is needed. In contrast, if only one tape drive is available, then the entire process will require 24 hours.
In this example, the tradeoff is between investing up front in the additional tape drive infrastructure and the inefficiency of the process on a daily basis, day after day. Such a parallel architecture also applies to massively parallel databases, such as Teradata or the IBM Scalable Parallel framework, where each node can be mapped to a tape drive.
Concurrent backup to extend availability. The major database vendors now allow concurrent backup - the table or table space can be backed up while the data structure is still online and in use. Log changes must also be applied if there are concurrent updates, but that is not the case for an inquiry-only data warehouse. This concurrent backup greatly relieves the problem of having to fit the backup into an exclusive batch window.
Of course, in the case of a data warehouse that is inquiry (i.e., read) only, the problem is much less severe. Once an initial backup is taken, the database does not change. The data management function must still provide for a backup of the additions to the tables, but these are usually much less than a full backup and can be handled by incremental backup of just the deltas (what has changed).
Redundant arrays of independent disks (RAID) to recover from media failure. If the installation is dealing with a data structure that is truly extreme in size -- such as the hundreds of terabytes that reportedly occur at Wal-Mart - then use a storage technology mechanism such as RAID 1 or RAID 5. In case of disk failure, the disk's contents can be reconstructed using redundant storage technology. However, note that this only addresses availability and data loss due to media (i.e., disk drive) failure. Other provisions must be made to address damage or loss of data integrity due to a rogue application or incorrect update. Design and implement a custom application to rebuild the data structure in case of loss of data integrity.
A combination of tactics to satisfy complex, real-world data center scenarios. In the real world, all five of these methods are combined as applicable. Partitioned tables are backed up in rotation, incremental backup is used against concurrent processes, and, for those structures that are too big to fit in the batch window, RAID 1 or RAID 5 is used. It is always necessary to have a backup. In some instances where the table or data structure does not ever change, the installation can save the input file so that the table can be restored in case of damage. Such a restore will be by means of an application rather than a formal database utility backup. However, the function is the same. The input file functions as a backup in this case. The cost and complexity are greater than if a database utility could be used, but still less than going out of business!
- 21 percent of respondents to a 2001 data warehousing questionnaire reported having a data warehouse more than one terabyte in size. In contrast, this data point was up and stood at 43 percent in early 2004, according to the February 2004 Data Warehousing Institute-Forrester Quarterly Technology Survey.
Lou Agosta is an independent industry analyst in data warehousing. A former industry analyst at Giga Information Group, Agosta has published extensively on industry trends in data warehousing, data mining and data quality. He can be reached at LAgosta@acm.org.