Information Life Cycle Management (ILM) for Data Warehousing
InfoManagement Direct, July 2005
Record data warehouse growth continues. Some estimates project in excess of 100 percent per year - and nearly a third of data warehouses in North America exceed a terabyte in size already. DW managers are inevitably investigating information life cycle management (ILM) strategies and online/near-line partitioning to manage growth pains.
While storage costs are a factor, the key driver in the ILM decision is maintaining service levels to warehouse users without immense infrastructure cost overruns to handle the volume. The equation is clear - for any given infrastructure, a larger warehouse means slower query performance. So, while the ILM approach can be adopted as a "storage management" solution by looking to export data to a cheaper storage tier, ILM for performance reasons is primarily a database partitioning issue. DW managers are asking, "What data best serves the company's needs in an active online database and what data can reside in a near-line archive?"
Advertisement
How does adopting an ILM strategy maximize warehouse performance?
Studies and anecdotal evidence indicate that typically about 20 percent or less of the data in a warehouse is used regularly and by a large number of users. The other data is there "just in case" or is historical in nature. By implication, 20 percent of the data is of high present value to the business and 80 percent is of potentially high "latent" value. ILM best practice is that resources be allocated by a formula that matches availability with present business value.
When the "online" warehouse is limited to the 20 percent of data that is actively demonstrating high value, it is smaller and offers better performance for the mass of users and standard reporting and analysis, yielding the best-bang-for-the-buck performance to the business. The 80 percent with latent value can be stored near line where it is still accessible to users as required but is no longer encumbering the queries against the primary use data.
A properly partitioned warehouse - combining active "online" data with a readily accessible near-line archive can result in higher overall performance without the loss of data access when it is needed by the business. In effect, the near-line data is moved from being kept "just in case" to being delivered to users on demand and "just in time."
The ILM for Data Warehousing Rules
In the absence of any formal standards, organizations considering an ILM approach to improving their warehouse performance should take into account a number of key best-practice considerations:
1) Understand the real value of the data to the business.
- What data is actively involved in regular business versus what data is of historical interest or being kept "just in case." Users often specify "just in case" data because they are afraid of losing access, so their understanding of near-line retention will help them facilitate this business decision.
- Since the application will evolve over time, ensure that the near-line solution includes a meta data abstraction layer to enable application change - adding/removing/modifying data elements - without any major management task for the near line data.
2) Standardize and simplify partitioning criteria.
- First consider the time dimension when setting archiving rules. Most businesses find this the easiest to understand and implement since so many business processes are themselves time-based. This strategy alone will likely result in a 50 percent or more reduction in the size of the online portion of the warehouse
- Once you have gained experience with a simple time-based methodology, and then consider other query-log analysis based rules such as "user class" or "not accessed since."
- Another logical growth-management partitioning strategy can be to keep the detail data near-line and the aggregate data online with the ability to "drill to detail."
3) Maintain the primary application architecture.
- The primary architecture should be an archive-and-restore relationship between the online and near-line partition. This maintains the established BI application framework that is already in place and simplifies management. Typically the "restore" process can be automated for standardized tasks or "on demand." The drill-to-detail model described above also maintains the existing application integrity.
- Consider "federated query" across the partitions as a secondary strategy for those more ad hoc requests that do not merit a full and active restore of near-line data to the online environment.
4) Consider an offline archiving tier if you want to keep extended history for applications other than those primarily server by the warehouse.
- If history is to be kept for extended periods and accessed by users independently of the primary warehouse applications (for example audit or other special projects.
- Insure that the archive files contain meta data as well so that the data has context independent of the original application
- Establish that there is an industry standards-based offline access layer to the archive such as ODBC or XML.
Robert Thompson is vice president of Marketing for PARACCEL Inc., an emerging leader in parallel processing business intelligence acceleration solutions. Formerly vice president of Marketing for SAND Technology and CEO of QueryObject Systems, he has spent the last 10 years focused on the problems of business intelligence performance against high-volume data. He can be reached at robert.thompson@paraccel.com.
For more information on related topics, visit the following channels:






