Q:

What are the recent trends in data warehouse backups where the size of the warehouse is in terabytes? Are full online backups feasible? Should batch and backups be interdependent?

A:

Les Barbusinski’s Answer: Generally speaking, when a data warehouse is measured in terabytes, full database backups of any kind (i.e., "hot" or "cold") become problematic. The volume just overwhelms everything, and backups just take too much time. Furthermore, backing up to tape always presents the potential of corruption (i.e., the tape becomes unreadable). This can be mitigated if you take daily backups (i.e., if yesterday’s backup tape is no good, you can recover from the prior backup tape and deal with the loss of a day’s worth of data). However, with terabyte implementations backups tend to be less frequent, and the loss of a single backup could spell D-I-S-A-S-T-E-R.

If your RDBMS supports full and incremental backups of individual tables (such as DB2), you can achieve a pretty comfortable level of recoverability by performing a rotating schedule of backups (i.e., performing full backups of a few tables or table partitions each day) and archiving the log files. Even then, it’s best to backup the tables to disk (for speed), then spin off the backup to tape at your leisure.

If, however, your RDBMS doesn’t support individual table backups (a la Oracle or Sybase), the best solution (albeit a very expensive one) is hardware mirroring. This is a solution offered by SAN vendors such as EMC and others that allows you to create a channel-attached mirror of your data warehouse. It can be detached at will (which effectively makes it a disk- based backup), then reattached and resynchronized in as little as two to four hours. Detached, it can even be brought online as a separate database "instance" on a different server. This feature provides you with several benefits:

  • The new "instance" can be used as the target database for your OLAP and reporting engines while your main "instance" is busy performing other functions (such as ETL, data distribution, etc.).
  • The new "instance" can be used to restore an individual table in the main "instance" to the previous day’s state using a simple COPY function.
  • The new "instance" can be used as the source of a full database tape backup while the main "instance" continues to provide regular data warehouse services to the user community.

The disadvantages to this approach are a) its exorbitant expense and b) the fact that the RDBMS has to be offline during the resynchronizing operation.
Scott Howard’s Answer: Many disk systems today have their own way of providing maximum system (database) availability while performing required backup. These systems can "copy" terabytes very quickly, but is done independent of the RDBMS scope of recovery and must be manually reconciled with the RDBMS. New features and techniques are also available for the modern RDBMS backup from the vendors. Things like continuous business volumes and split log mirroring can also help but are expensive and, therefore, only for critical availability needs. They are still worth researching through your RDBMS vendor. If all else fails, or is too expensive, you can employ other techniques such as manual range partitioning, where you manually partition the large fact tables by date or other appropriate ranges then UNION them back together for materialization purposes. The advantage is that you only need to back up the current partition until it is full. For example, say that you need to keep two years’ worth of history in the fact table. Create 24 partitions, or 24 cloned tables, depending on your RDBMS, each one representing one-month's worth of facts. You now only need to periodically back up the current month until time marches on to the subsequent month (then the current month becomes the prior month). Make one final backup of the new prior month and keep it for two years as it will never change again. This technique does require granular backup and recovery schemes, but reduces the backup volumes, in this case by 24X. There are also issues in UNIONing them back together that vary by RDBMSs that must be explored and addressed. You don't want to improve availability by sacrificing DW performance!

Chuck Kelley’s Answer: This depends on your refreshment strategy. If your refreshment strategy has the output into flat files and then bulk load, then the backup strategy will be different than if you are loading the DW directly from the ETL process. If you do bulk loads, then I would do full backups on a quarterly or monthly basis and keep the ETL load files as backup. If you need to recover, go back to the last backup and then bulk load the data. Most DBMSs deal well with this scheme. If you are doing direct loading, then I would do weekly or incremental backups and keep the log (or journal) files available for recover. Be sure to TEST your choice!

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access