The end of the one-size-fits-all database is nigh. Todays 30-year-old online transaction processing (OLTP) databases werent designed to handle ad hoc queries against terabytes of data. That means companies today are facing conflicting data storage challenges as their databases are growing rapidly along with their need to conduct complex ad hoc queries on the data. Companies that spend millions of dollars on stopgap measures typically see limited benefits and, ultimately, have to rethink their strategies if they want to compete effectively.
This article presents an overview of the evolution of relational database management systems (RDBMSs) and how the gap between data warehouse requirements and data warehouse database capabilities has come about. I will also explain seven database innovations that will revolutionize data warehousing. New developments such as a column-oriented architecture, aggressive data compression and the ability for RDBMSs to run natively on large grids/clusters of industry-standard servers make todays business queries - regardless of complexity or level of detail - fast, efficient and affordable. A rearchitected database management system (DBMS) featuring these innovations means business data analysis becomes a competitive advantage rather than an obstacle. CIOs will be able to analyze more data faster so their organizations can succeed.
RDBMS Comes of Age
Relational databases emerged in the 1970s as a way to improve business data processing. The idea was that storing data in row-oriented tables that group related information was better than using hierarchical data models, which incorporate parent/child relationships to store data in tree-like structures. Relational databases such as Ingres were much better at allowing users to store and access information involving one-to-many relationships rather than hierarchies, which are limited because every child can only have one parent.
Ingres, Postgres and other early-stage RDBMS technologies shared a common architecture that became the blueprint for most RDBMS products still in use today. These row-oriented architectures are optimized for write-intensive OLTP, which enables simple data storage and retrieval. OLTP makes it easy to find complete records in one place, streamlining data updates.
While these RDBMS platforms made it easier and more efficient to store and retrieve data at the time, they were not designed to handle read-intensive analytical workloads. As demand increased for more complex business data analytics, organizations had to work harder to get the performance they needed from their queries, leading to rising hardware and operating costs.
Today, this means that many organizations are not achieving all of their business goals because their databases are too big to support their increasing need for complex, real-time data analysis. This problem will only worsen with time, as half of all data warehouses will exceed 10 terabytes of data by 2010, according to The Data Warehouse Institute.1 Moreover, query complexity often increases as the volume of data rises.
The reason so many organizations are struggling with this problem is that until recently, there had been little innovation in the 30-year-old DBMSs these companies are running to match the rapid growth in database size. Granted, large RDBMS vendors developed some enhancements to improve overall data warehouse query performance by adding features such as bitmap indexes, online analytical processing (OLAP) cubes, materialized views, index-only tables and join indexes. Theyve also extended their RDBMSs to support shared disk systems (disk clusters), shared-nothing systems (blades) or both as well as SQL or XQuery on either tables or data represented in XML schema.
These improvements were established in large part to ensure that a single line of code would continue to meet all DBMS needs. Vendors created a template for the one-size-fits-all database that they try to force fit into a broad range of markets. They do this because adapting databases to the different business markets would require writing other code lines, which is astronomically expensive, could take years to develop and would take more time and money to train sales staff on how to sell it.
Organizations in data-heavy industries using existing DBMS technologies have been faced with some unpleasant choices. Typical solutions involve hiring additional database administrators (DBAs), creating and maintaining OLAP cubes (which is time-consuming, costly and slows down load performance) or replacing legacy DBMSs with expensive, proprietary data warehouse appliance hardware. Companies spend millions of dollars annually on these fixes, which ultimately prove to be stopgap measures that dont really solve the problem.
Seven Innovations that Make Data Warehousing Work
Relational database technology has evolved dramatically in the past several years in an effort to address these problems.
Data warehouse size and query complexity have grown inexorably, forcing changes in the underlying technology to help stock brokerages, telcos, retailers and other organizations wring the most value from their data. Here are seven of these recent innovations that can make storing, querying and analyzing data faster, more efficient and more affordable. These innovations are the foundation for a new DBMS architecture that solves the one-size-fits-all problem.
- Column store architecture. This architecture organizes data on a disk as columns of values from the same attribute, rather than storing it as rows of tabular records (as is done in most current RDBMSs). This vertical approach means that when a query needs to access only a few columns of a particular table, only those columns will be read from the disk. Row-oriented databases search every row and every column (even columns not specified in the query), which wastes I/O bandwidth. While bitmap indices, data cubes, materialized views and other database enhancements do help some, managing data structures to optimize performance for every query and use in this manner is prohibitively complex and time-consuming. Moreover, these data structures dramatically impose storage space and update performance overhead, while column-oriented architectures do not.
- Aggressive compression. Data compression becomes particularly effective in column-oriented architectures because the data within each column tends to be very similar, making compression easy. In a typical row-oriented database, values within a row of a table are not likely to be very similar, making compression impractical. The increasing gap between processor performance and disk speed in modern hardware means that more cycles are available to encode or decode data as it goes to or from disk. Innovations in compression algorithms further optimize performance by allowing the database to operate directly on the compressed data in memory, rather than always decoding it first.
- Multiple sort orders. When data is compressed, the DBMS has additional storage space available, allowing it to store each table on disk in multiple sort orders. In such a configuration, data is stored as projections - collections of sorted columns (similar to materialized views) - spread across a collection of networked, shared-nothing machines. Multiple projections can contain overlapping subsets of columns with different sort orders to ensure high availability and enhance performance by executing queries against the projection(s) with the most appropriate columns and sort orders.
- Automatic database design. Automating database design helps organizations streamline DBMS deployment and ensure downstream queries are more efficient. New tools use logical schema definitions from database administrators and SQL queries to automatically determine what projections to construct and where to store them to optimize query database performance and high availability. Not only do these tools reduce design costs, but they also provide intelligent, real-time adaptation to design changes based on the addition or removal of database nodes.
- Recovery by query, not logs. The redundancy built into database projections helps companies avoid the expensive, database performance-draining problems of logging and two-phase commit operations. This architecture enables DBMS solutions to query projections not only to handle user requests, but also to rebuild the data in a recently restored projection or node. The DBMS can build the necessary redundancy into the projections it creates so that a number of node failures (predetermined by a DBA) can occur without compromising the system.
- Concurrent data loading and querying. The huge data warehouses that organizations operate today mean that the standard bulk loading done at night extends into daytime when business managers need to query the data, causing the system to slow down. A hybrid architecture that allows querying and loading to occur in parallel across multiple projections means the system constantly refreshes data while still providing fast querying. Write-optimized stores (WOS) that record inserts, updates and deletes continuously move data into associated read-optimized stores (ROS) asynchronously throughout the day. Lightweight transaction management prevents database reads and writes from conflicting so queries can run against data in the ROS, WOS or both.
- Standards-based appliances. As mentioned, proprietary data warehouse hardware is expensive and difficult to tune and scale to individual business needs. DBMS software that runs on blades and servers from a variety of popular brands or even basic Linux boxes can help organizations save millions of dollars in IT costs annually. Additionally, these companies can repurpose existing hardware investments for other critical business operations.
That many companies today compete on the quality of their data is a given. Look no further than the massive call detail records telcos must monitor for pricing or the 10-year tick store records investment firms review to make split-second decisions on watched stocks. How data is stored and queried can have a huge impact on query latency and throughput. Throwing millions of dollars at stopgap measures only delays the inevitable time when existing RDBMS tools will not be able to give business managers access to the information they need. Instead, by adopting new approaches to RDBMS design that allow companies to conduct detailed searches 50 or 100 times faster and at a fraction of the cost of one-size-fits-all databases, companies can be assured that their databases will continue to allow them to focus on the possible business opportunities their data holds rather than the mounting costs of simply holding their data.
- Philip Russom. A Slew of Data Warehouses Enter the Multi-Terabyte Club. TDWI, June 14, 2006.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access