What are the recent trends in data warehouse backups where the size of the warehouse is in terabytes? Are full online backups feasible? Should batch and backups be interdependent?
Les Barbusinski’s Answer: Generally speaking, when a data warehouse is measured in terabytes, full database backups of any kind (i.e., "hot" or "cold") become problematic. The volume just overwhelms everything, and backups just take too much time. Furthermore, backing up to tape always presents the potential of corruption (i.e., the tape becomes unreadable). This can be mitigated if you take daily backups (i.e., if yesterday’s backup tape is no good, you can recover from the prior backup tape and deal with the loss of a day’s worth of data). However, with terabyte implementations backups tend to be less frequent, and the loss of a single backup could spell D-I-S-A-S-T-E-R.
If your RDBMS supports full and incremental backups of individual tables (such as DB2), you can achieve a pretty comfortable level of recoverability by performing a rotating schedule of backups (i.e., performing full backups of a few tables or table partitions each day) and archiving the log files. Even then, it’s best to backup the tables to disk (for speed), then spin off the backup to tape at your leisure.
If, however, your RDBMS doesn’t support individual table backups (a la Oracle or Sybase), the best solution (albeit a very expensive one) is hardware mirroring. This is a solution offered by SAN vendors such as EMC and others that allows you to create a channel-attached mirror of your data warehouse. It can be detached at will (which effectively makes it a disk- based backup), then reattached and resynchronized in as little as two to four hours. Detached, it can even be brought online as a separate database "instance" on a different server. This feature provides you with several benefits:
The disadvantages to this approach are a) its exorbitant expense and b) the fact that the RDBMS has to be offline during the resynchronizing operation.
Chuck Kelley’s Answer: This depends on your refreshment strategy. If your refreshment strategy has the output into flat files and then bulk load, then the backup strategy will be different than if you are loading the DW directly from the ETL process. If you do bulk loads, then I would do full backups on a quarterly or monthly basis and keep the ETL load files as backup. If you need to recover, go back to the last backup and then bulk load the data. Most DBMSs deal well with this scheme. If you are doing direct loading, then I would do weekly or incremental backups and keep the log (or journal) files available for recover. Be sure to TEST your choice!
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access