Data warehousing without archiving is like a garage attached to a residence that a family has lived in for several generations. It becomes the container for a lot of “valuable” stuff, but soon there is no room for the family car. If that stuff were just pure junk, the family could simply throw it out. Yet everyone knows that as soon as anything has been thrown out, there will be a request for it so we keep it just in case. The same can be said for data warehouses.

 

Data warehouses often start big and get even bigger. In the October 2007 “Data Warehousing Satisfaction Survey” 78 percent of data warehouses surveyed were 1TB or larger. Data warehouses over 100TB in size are growing at a rate of 35 percent, albeit from a small base.1 Managing and controlling these exploding volumes of data, while managing reduced latency and doing less with more, is a challenge facing business across all industries.

 

Managing Your Data Growth

 

Best practices in data governance indicate that information has a lifecycle. It is born as a customer calls up, orders a product and gets identification. It goes through various transformations as it is related to financial, marketing, demand planning or predictive uses, in order to answer questions crucial to operating and optimizing business decisions. Finally, information has an end game. The customer moves away or the product is discontinued. The data is no longer updated, remains unused by the business and eventually become irrelevant both to the enterprise and the society in which the enterprise does business. As the volumes of data accumulate, the data warehouse becomes “obese.” Meanwhile, the data warehouse become entwined with mission-critical systems, impacting the performance of both transactional and decision support systems. The lifecycle of the data warehouse and the requirement to perform archiving shifts into the foreground.

 

Understanding information lifecycle management (ILM) for data warehousing and how it related to data archiving requires following the information supply chain. Every business exchanges products or services for payments. These basic set of transactions form the life blood of the enterprise. Enterprises engage in data archiving as part of an approach to information lifecycle management, of which data warehousing is an essential part. Archiving is the best way both to improve performance of the data warehouse (or transactional system) and to satisfy the requirements for data retention and security. Industry analysts estimate that up to 85 percent of production data is inactive.2 While this number surely differs from one solution to another, a system that has been in production for several years is likely to contain a significant volume of data that is not used at all or used infrequently. It is just common sense - at least to a database administrator. The more data that needs to be scanned, the longer the response time. The more data to be processed, the deeper the index hierarchy and the longer the response. The longer the response time, the more likely that pressure will build from the business to add processors, database licenses and staff. In short, unused data is costly. It does not just sit around quietly - it consumes significant resources.

 

Building the Business Case

 

The business case for data archiving includes an analysis of return on investment (ROI). If an enterprise has incurred a substantial financial penalty from losing a legal case due to failure to produce electronic documents (e-discovery), then such hard data can form the business case in itself. Such enterprises have learned the value of archiving - unfortunately, in the “college of hard knocks.” However, most firms are more fortunate - or lucky. If you are one of those, then look first for savings in storage technology, administrative costs, backup profile and performance and (deferred) system upgrades. Boeing estimates that it costs the company approximately $67,000 for every document it cannot find.3 At $1 a gigabyte, reducing the amount of disk needed by a terabyte is $1,000 saved. This also translates into less disk to administer, less to backup, including a tighter batch window, and being able to do more with less. As the proportion of data in any given system tends to favor cold data over warm or hot data, then the cost benefit of archiving grow proportionately. Cold data is inactive, unused, not touched by inquiries or update, etc. Since data warehousing storage costs rise disproportionately as the size of the warehouse grows, eventually coming to dominate the effort to administer the system, the data warehouse deserves attention as the target for archiving activity. As data warehouses grow into the multiterabyte range, policy-driven data archiving is an especially effective way of reducing data warehousing obesity.

 

Data becomes information when it represents business relationships. For example, in retail, an order has line items - a lady’s handbag, shoes and a scarf make up an order. In health care, a visit to the doctor by a particular patient on a given day includes multiple services - exam, tests, shots, etc. These enter the insurance company as a related claim, possibly pointing to multiple providers for lab services. These relationships - also called parent-child since there is a hierarchy - should be preserved in the process of moving information through the information supply chain. Data warehouses are required to represent these relationships accurately, whether in a star schema, snowflake or hybrid data model. When the high-level entity, such as an order, is archived, so too should be the corresponding line items. If this does not happen, then data integrity is lost. Such connections form a complete business object and enterprises should look for archiving software that represents and preserves such complex entities in a simple, easy to manage and high performing way.

 

An archive of data warehousing information is significantly different than a backup of the database or particular data structures. The backup of a data warehouse duplicates all the data totally or all the data that was touched since a given point in time (for an incremental backup). In contrast, an archive represents and applies policy-based selection criteria such as the period since the data was last referenced or the class of data (e.g., customer, product, claim, etc.) or combination of criteria to remove the item from the production database and move it to a different media, typically a less expensive one with a less rigorous performance profile. The archive process then leaves a metadata-like trace in the catalog or database index that tells the system where to look for the data in the event that it is requested. The data still exists, but has been archived in such a way that it can be retrieved if it is needed (but retrieved with a lower priority service than if it were online).

 

ILM logically leads to the technology implemented by hierarchical storage management (HSM). HSM distinguishes different classes of storage - online, nearline, offline (or on the shelf) - that provides difference performance and cost profiles. Online is on low latency disk and readily accessible to inquiries without any significant mechanical delay. Nearline storage refers to an optical jukebox, such as a write-once-read-many (WORM) device, that contains dozens of optical disks that can be mounted for read access by means of a robot arm. A similar idea occurs with the use of a robotic tape library, though the performance is less since the tape has to be read sequentially. Finally, offline storage allows for popping (“ejecting”) the WORM disks or tape cartridges out of the robotic silo and putting them literally on the shelf with a bar code in sequential order with the identifier known to the system. Manual intervention is required, typically by a human on roller blades (sorry, not included with the jukebox or tape silo), to read the request off of the system consul, fetch the disk and insert it back into the jukebox or silo.

 

Compliance Puts Many Data Warehouses on the Critical Path

 

Archiving replaces purging data for that data that satisfies the retention policy. You cannot simply throw the data away if the data falls within the statutory retention requirements for such legislation as Sarbanes-Oxley (SOX), SEC-17A or the Health Insurance Portability and Accountability Act (HIPAA). A side benefit is the simple fact that moving closed accounts, inactive customers and unused data to relatively less-accessible storage, possibly with encryption, means there is less data in the line of fire to be stolen or inadvertently lost.

 

Many enterprises argue that the data warehouse is off the critical path and therefore is not subject to regulatory review under Sarbanes-Oxley or related legislations. That is especially true with first generation data warehousing, where the data is used for aggregation and reporting, and arguably is just another way of representing the same data, so that the information is supposedly the same as the transactional system, just more accessible. However, with second generation data warehouse, in which advanced applications in fraud detection or predictive analytics as well as third generation data warehouses in which the loop is closed back to the transactional system and used to optimize operations, the criticality changes. The data warehousing data is no longer thrown away, in any sense of the term, but part of the mission-critical set of solutions. It becomes subject to the same service level agreements, including those for data retention and security, as the transactional system itself and must be covered by information lifecycle policies for standard corporate data. It the data warehouse was used to make business decisions, it may be the target of legal disclosure under e-discovery. Such solutions raise the bar on data warehousing performance. As indicated, archiving is needed to support rigorous data warehousing SLAs. But that is not all.

 

Just when enterprises thought they had emailed archiving handled, new federal legislation raised the bar on the retention and retrieval of all electronic records, so called e-discovery, which, under one interpretation, includes the data used to make business decisions.4 While is it unlikely a federal judge would tell the CIO “bring the data or bring your tooth brush,” because he will be staying overnight, substantial financial penalties will accrue if a company cannot produce the data impacted by civil litigation (leaving aside that it will then be at risk of losing the case on procedural grounds). The key take away is the requirement for archiving technology and solutions. An archiving system should be able to be implemented top down by specifying policies for handling the data according to its lifecycle.

 

In order for accountability to be able to be traced from the executive (decision-making) function to the managerial (administrative) perspective to the implementation in the IT department, it is essential and useful for any archiving technology to support policy-based rules. For example, medical claims should be retained online in a work-in-progress database for three months or until closed, whichever occurs later. Once closed, claims should be migrated to readily accessible decision support database against which inquiries can be executed with a response time of between fifteen seconds for tactical, short queries and fifteen minutes for longer aggregates. After 18 months, the claims should be migrated to nearline storage such as optical disk or WORM media that provides access within twenty minutes, depending on how busy the system is, by means of an asynchronous request and notification. After three years, the claims should be migrated to tape or popped out of the optical device (often called a jukebox because it has a robot arm to swamp platters in and out) and placed on the shelf with a readable bar code known to the system and available to it. Therefore, if the information is needed - whether due to a legal summons, financial audit or any business reason - it can be requested, mounted by an operator and read.

 

In conclusion, data warehousing supports business intelligence solutions and both are an essential part of information lifecycle management (ILM). The benefits of data archiving, which occurs towards the backend of the ILM process, include improved system performance, reduced data warehousing obesity through reduced storage size and costs, and correspondingly fewer servers to administer. The enterprise also reduces its risk of noncompliance by conforming to policies supporting legal e-discovery. The implementation of data archiving for data warehousing - and other enterprise systems - tends to raise the level of an enterprise’s information agility, especially when accomplished in the context of a coherent approach to ILM.

References:

 

  1. Lou Agosta and Kevin Modreski. “The Data Warehousing Satisfaction Survey, Part 3: A Single Fact is Worth a Thousand Opinions.” DM Review Special Report, October 2007.
  2. Noel Yuhanna. “Database Archiving Remains an Important Part of Enterprise DBMS Strategy.” Forrester.com, August 13, 2007.
  3. Byte and Switch. “Archiving: A Plan of Attack.” Byte and Switch Insider, January 2005.
  4. The Committee on the Judiciary House of Representatives. “Federal Rules of Civil Procedure (Rules 16,26,33,34 and 36).” U.S. Government Printing Office, December 31, 2004.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access