Because of this explosive growth of data, enterprises are facing high primary storage costs and an increasingly paradoxical dilemma: while business, regulatory and compliance requirements demand more complex and increasingly rapid analysis of this growing data hoard, the access problems and costs of storage and retrieval have made it necessary to offload more of the data burden to an archive, particularly in data warehousing environments. Analysis and reporting then become functions of how fast and how accurately archival data can be retrieved and subjected to analysis. Unfortunately, the state of the art in archival storage and retrieval mandates that both accuracy and speed be sacrificed as the use of archival database alternatives grows.
Many organizations are reviewing tiered storage strategies to migrate less frequently accessed data to the lowest-cost storage devices using policy-based automated storage migration, commonly referred to as ILM. Advances in database technologies have allowed for critical database information to remain on fast-access primary storage while less frequently accessed database information is migrated to an archive on near-line storage systems. However, what is often not thought through is the impact of searching these near-line stores when data is required, either in response to an unexpected question or as part of a less frequent but nevertheless critical business cycle. One of the challenges ILM presents is providing convenient access to information in the database archive after the information has been compressed and archived.
IT staffs and BI users are just now beginning to recognize the challenges faced by this ILM database dilemma. The growth of data warehouses has begun to reach an important critical juncture: for many users, multi-terabyte data warehouses are creating a barrier to effective analysis and business intelligence, as throughput issues, data access, and hardware and administrative costs begin to challenge users and their IT managers.
Accessing the Archived Data Warehouse Across Storage Tiers
The data at the heart of these myriad business uses includes not only structured transaction data from ERP and back office systems, but also unstructured data from a host of sources that were largely nonexistent even a decade ago. Email and Internet transaction logs, voicemail databases, contracts, medical records, point-of-sale systems data and other data sources have been added to the ocean of data that companies must now swim through in the course of their day-to-day operations.
These transaction systems leave a data trail that is piling up at an astonishing rate: it's not uncommon for active transaction systems to contain many terabytes of data. The data warehouses that are fed by these voracious transaction systems are becoming larger than anyone had ever imagined.
Deriving Business Value from the Data Archive
Data archives have been a traditional solution for addressing usability and cost, particularly when it comes to off-loading historical or infrequently used data. Indeed, the archive's main contribution has been to improve the usability of the remaining online data. As such, archiving has traditionally been a less-than-perfect solution to the problems of too much data and not enough throughput because most archiving solutions rely on tape-based systems that are both costly and not user-friendly. The result is that while archiving solves the problem of throughput and cost for the on-line portion of the data, it fails to provide a solution for archived data that is cost-effective and supports relatively rapid data access.
Thus, from a business standpoint, archiving is a problematic solution for most users. Archives cannot support timely data analysis, despite the fact that for many business uses - particularly those relating to regulations, compliance and legal action - timeliness is a major criterion for action. The current state of the art in archiving is thus too cumbersome and costly to keep pace with the growth of transaction databases and data warehouses and the analytical needs incumbent upon them. For most companies and most use cases, archiving represents an imperfect solution.
A New Approach to Tiered Data Archiving
Clearly, a new archiving solution is required to enable databases to operate at maximum efficiency. One potential solution provides four key features:
- Data compression,
- Online query access,
- Maintenance exposure, and
- Enterprise scalability.
Use of column-based data compression technology allows for storage of relational data in what is essentially a pre-indexed format, alleviating the requirement for storing or building indexes at restore time. This design significantly reduces the overall storage needed for the database. Column-based storage also significantly improves data compression: being made up of a single data type, each column of data can be compressed much more efficiently than rows of data, which by definition include many different data types. This technology can also further reducing the data footprint by selecting the best optimized compression strategy for each data type.
Column-based storage also allows more rapid processing of archival queries: reporting tools can either directly query the repository using the subset of the ANSI SQL language current supported, or the necessary data can be rapidly restored to an operational data store and queried using the full complement of SQL commands. This accessibility contrasts with the majority of archiving systems that limit access to summary data unless a full database restoration process has been undertaken.









