At a recent Gartner event, I spoke with an analyst - he shall remain nameless - who feigned surprise that anyone from the data warehousing world would want to talk to him about information life cycle management (ILM): "The content management people have adopted an ILM discipline to control growth, the application archiving business is growing at double digits, and yet every time I speak to someone from data warehousing, they want to brag about how big their warehouse is. You guys seem to be the last holdout of the 'size matters' camp!"

In a way, he was right. While rampant data warehouse growth is increasing both storage and storage management costs and (probably more importantly) reducing the performance of the applications that depend on the warehouse, I hear most warehouse owners talking about moving from five terabytes to 10 terabytes as if conceding to "tera-flation" was their only option.

Although storage costs are decreasing by 15 percent every quarter and hardware performance is doubling every 18 months, we in the data warehousing market are not getting ahead of the growth curve - and this is only taking into account operational growth and the continued demand for business intelligence in organizations. When we look at the potential impact of new regulatory requirements to keep data for longer periods, the challenge becomes much more formidable.

Today's standard "best practices" call for constant budget growth in order to permit upgrades to underlying warehouse infrastructures and, when the problem becomes insurmountable, to enter the expensive world of specialized hardware-based data warehousing environments.

Surely there must be a more manageable and economical way to deal with warehouse explosion?

It turns out that there is another option - if we simply steal some secrets from the content management and application management arena. After all, the data kept for purposes of business intelligence represents only about 20 percent of the data that an organization has to manage. The other 80 percent is either unstructured data (or semi-structured, such as e-mail) or in-production structured data associated with transaction applications. In these areas, ILM has already been embraced as an approach to managing storage costs and enhancing performance for users.

The basic principle of an ILM strategy is to match the availability of data - and therefore the resources it is assigned - to its current value to the organization. High-usage, high-value data is assigned the best "online" infrastructure to ensure high levels of user satisfaction. Data of potential value is kept nearline so that it can be restored online if and when it is required. Finally, data that is no longer an active part of the application, but which will be useful in the future - for compliance purposes, for example - is kept offline in an environment where it can still service users, but where its availability does not negatively impact the performance of the primary application.

Why should this discipline be applied to data warehousing?

Studies have repeatedly indicated that only about 25 percent (or less) of the data in a traditional data warehouse is used regularly, while the other 75 percent is there "just in case." Because the data warehouse as we know it takes a "one-size-fits-all" approach, this just-in-case data is as costly to the organization as the data that is actively used. Keeping it available in the warehouse requires resources and, more importantly, compromises the warehouse's ability to expediently service users of the high-value, high-use data. A big warehouse equals a slow warehouse - which, in turn, equals unhappy users.

Applying an ILM discipline to the data warehouse would involve the following:

  • All the current, actively used data would be kept online in the warehouse with the highest possible level of availability, in order to expedite operational business intelligence.
  • The just-in-case data, or data that has a high value but is associated with specific BI applications that run only occasionally, would be kept nearline with the possibility of being restored to the online warehouse as and when it is required.
  • Data that is past its usefulness to the BI applications, but which has historic or audit-related value, would be kept "offline" in an accessible repository, where it would be available to a new class of users using simple query applications.

The online warehouse would be assigned the most powerful infrastructure, in order to ensure high service levels to regular users. It would be indexed to support the persistent applications and would probably also support dependent data marts for complex analytical applications.
The ideal nearline warehouse would be able to store data very efficiently and, without any additional indexing, allow rapid restoration of data to the online environment, either on demand or according to a business schedule. The restored data would be used as required and then purged, so as not to bloat the online warehouse. Copies of this data would be maintained at all times in the efficient nearline archive.

The offline archive would also offer efficient storage for keeping "audit quality" read-only records, and be accessible to users using standard business intelligence tools and methods.

The advantages of this type of architecture are fourfold:

  1. Users of the primary data in the online warehouse can rely on consistently high business intelligence application performance.
  2. The organization is able to practice better business intelligence because it has an effective way of storing historical and other secondary data, thus making available a broader and deeper view of the business than might currently be possible.
  3. The data warehouse remains a manageable size and user satisfaction is maintained without moving to specialized hardware platforms, thereby containing costs.
  4. The regulatory and compliance burden is effectively handled by a specialized "archive" without burdening the operational warehouse.

Given these advantages, and all the negatives associated with ignoring the solution and acquiescing to unrelenting growth, this approach has real merit. The industry is beginning to recognize this: for example, according to some of their recent announcements in this area, SAP believes that ILM has a place in their future SAP BI strategy.
ILM is the "gospel" of other data-intensive areas. META Group believes that the database archiving market will be worth $2.7 billion in 2007. Perhaps it's time for the data warehousing industry as a whole to acknowledge that bigger is not necessarily better and to take a closer look at this efficient and economical approach to managing warehouse growth. 

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access