What’s Next for DBMS?
Once upon a time, when client/server ruled, applications all used database management systems in much the same way. Graphical user interface clients typically running on Windows would issue SQL directly against a shared back-end relational database. The emergence of three-tier Web applications in which business logic moved to a middle tier and presentation moved to the browser resulted in only minor changes in the role of RDBMS. A single monolithic RDMBS continued to support multiple clients, which just happened to be Web or application servers rather than Windows client programs.
Over the past 25 years, there have been no successful challenges to the dominance of the RDBMS. The most successful upstarts object-oriented, XML and document-oriented databases account for a minute fraction of the overall database market. Until relatively recently, the RDBMS seemed to have a complete and permanent hold on the data management tier. However, the emergence of new computing models most importantly, the convergence of grid and utility computing models in the guise of cloud computing suggest that the times might be changing for DBMS.
2008: The Year of the Cloud
Arguably the most exciting development in software application architecture in 2008 was the emergence of viable cloud computing environments. The term cloud computing is seriously overhyped, but might be summed up for our purposes as the provision of virtualized application software, platforms or infrastructure across the network, in particular the Internet. The concepts of utility computing, where computer resources are allocated on demand, and grid computing, the ability to use large numbers of homogenous commodity computers to support larger applications or tasks, became a practical reality for the first time in 2008, and became so in environments made available across the Internet e.g., in the cloud.
Amazon arguably pioneered the public cloud with its Amazon Web Services Infrastructure as a Service cloud. AWS users can buy virtual computing resources storage, memory, I/O and CPU by the GB or CPU hour, and can rapidly scale their resource utilization up or down as demand varies.
Google provided an even more abstracted Platform as a Service cloud with its Google App Engine offering. With GAE, users do not even need to explicitly request increases in CPU or memory as their application demand increases. They merely need to deploy their application into GAE, and GAE will manage these resources.
Microsofts Azure cloud by far the most ambitious of the cloud computing architectures - appears to be a PaaS cloud primarily oriented toward .NET applications, but with more extensive application support services.
Most of the cloud architectural innovation is occurring in these public clouds. However, the architectures pioneered in the public clouds are expected to be mirrored within tomorrows enterprise data centers, and many vendors are emerging with private cloud infrastructure products.
RDBMS Meets the Cloud
The traditional relational database can be a poor fit in these cloud environments. The promise of the cloud is that you can increase computing resources by allocating additional virtual machines. Unfortunately, most relational databases cannot smoothly scale up in this manner. Oracles RAC cluster database provides a database scale-out capability, but licensing and administrative overhead currently make it an unattractive fit for zero-maintenance, pay-as-you-go cloud computing. The common MySQL sharding model and the use of distributed object caches such as memcached can be used to establish a large virtual database from many small database servers, but this solution is insufficiently elastic and automatic when demand scales rapidly.
As a result, cloud providers offer simpler, nonrelational databases that can provide elastic on-demand scalability. The major offerings are:
- SimpleDB within Amazon AWS
- BigTable within Google App Engine
- Microsofts Azure Table Services and SQL Data Services
These cloud databases differ in implementation detail, but share common core features:
- A hierarchical structure resembling a B-tree index or hash table. Rapid lookup via a single key value is provided.
- Flexible or un-types attributes. Each row often can contain different columns, and columns may have multiple values or include a more complex embedded structure.
- Automatic geo-redundancy. Elements stored in the database are guaranteed to be replicated across multiple data centers.
- Automatic partitioning across multiple hosts and automatic scale-out as the size of or demand on the data store exceeds the capability of a single host.
However, these databases are missing many of the features weve come to expect in a relational solution:
- Joins and complex queries must be implemented in procedural code.
- Transactions are not supported. You cant create a set of changes that must succeed or fail as a single unit.
- Eventual consistency. When you make a change to a data item, it will be visible eventually, but not immediately, in all locations where a copy of that data is maintained.
These limitations seem severe, and there is a wide range of application types that probably cannot tolerate the relaxation of transactional integrity. But, for Web 2.0 style applications, these limitations are no big deal does anyone care if your latest Facebook update takes a few minutes to be visible to all your friends around the world? Will cloud databases disrupt the RDBMS market? Probably not. But the cloud environments available from Amazon, Google and Microsoft offer significant advantages over traditional hosting or in-house hardware models. Applications that exploit the cloud will find the scalability and zero-maintenance models of these cloud databases appealing. As these cloud applications emerge into the mainstream, the cloud database probably will establish a permanent and successful segment of the database taxonomy.
Todays relational databases are generally built around a one-size-fits-all architecture. The architecture indeed the actual vendor/version of an RDBMS used for data warehousing is generally the same as that used for an online transaction processing application. The RDBMS model has certainly proven to be a flexible one, and its ability to support both transaction processing and decision support operations is one of the reasons the RDBMS has come to dominate the database world.
However, as the size and complexity of data warehousing databases has increased, its been increasingly apparent that databases designed explicitly to support data warehousing could have a significant cost/performance advantage.
In particular, the common RDBMS implementation in which all data for a given row is stored as a block might be optimal for OLTP operations create an entry, query an entry, change an entry but is not optimal for many data warehousing and BI query scenarios. In these scenarios, storing the data for a specific column in the same block is far more effective, since data warehousing queries usually aggregate within columns and across rows.
In 2005, Mike Stonebraker, pioneer of relational databases and one of the creators of Postgress, and his team proposed a new model for column-centric databases called C-Store. In a C-Store database, data is physically oriented around projections, which represent sorted contents of one or more columns. Queries that perform typical data warehousing aggregate queries will typically find the data they need stored in one of these projections.
A further advantage of column orientation is that very high compression rates can be achieved. Compression works best when repeated patterns are present. The repetition within columns is almost far greater than the repetition within rows. For instance, a SURNAME column can be expected to have many repeated entries, whereas there will be little or no repetition between a SURNAME and a DATE_OF_BIRTH column. Column orientation, therefore, allows for very high compression ratios with only modest CPU requirements.
Of course, not everything about C-Store is faster. C-Store is a read-optimized database design, and the architecture exacts a heavy penalty when data is added or modified. In a traditional RDBMS, a new row requires the addition of just a single entry. In C-Store, entries must be added to every relevant projection. Further, since projections are ordered, new rows may require some reorganization. Consequently, the C-Store proposal includes a writable store, a sort of staging area, for inserts and updates. Entries to the writable store are periodically merged in the read-optimized, but, in the meantime, queries read from both stores to get the up-to-date view.
Despite the significant changes to the underlying architecture, C-Store exposes a familiar SQL interface and transactional model to the application, making it immediately compatible with existing business intelligence tools.
The C-Store proposal has been commercialized in the Vertica product, and an open source version called MonetDB is also available. Some elements of column-oriented compression also are becoming available in commercial databases such as Oracle.
Memory Only Databases and H-Store
Theres also been innovation in the world of online transaction processing databases. Moores law, the doubling of memory and CPU capacity roughly every 18 months, has been driving exponential increases in the processing capacity of desktop and middleware computers. Unfortunately, Moores law does not apply to I/O operations. Disk drives have gotten gradually faster, but storage capacity has been growing at a faster rate. As a result, the I/O per GB rate actually has been dropping over the past few decades.
All database systems make heroic attempts to avoid this disk I/O bottleneck by employing increasingly sophisticated caching mechanisms. In some cases, there may be four separate caches within the application server, a result set and block cache within the database, and separate caching within dedicated storage devices. Despite all this, databases remain the slowest component in the modern application stack.
The imminent end of the spinning disk device and its replacement by solid state disk technology will relieve the bottleneck somewhat, but not enough to prevent the database from remaining the ultimate limit on transaction processing.
The disk I/O conundrum has lead to a fairly healthy niche of in-memory databases. In addition, there are calls for a complete rewrite of the disk-based RDBMS in favor of a design that more directly meets OLTP requirements.
Most notably, Mike Stonebraker and colleagues have proposed a complete do-over of the RDBMS, called H-Store. H-Store is an in-memory database in which transactional integrity is guaranteed not by a write-to disk, but by replication to other databases in the cluster. The clustering is achieved by requiring a hierarchical data model. Elements in the hierarchy are distributed across the cluster using a mechanism similar to that of the cloud databases, such as SimpleDB. This allows H-Store to provide scalable shared nothing clustering across large numbers of machines.
The architecture is single threaded, removing much of the complexity that arises in RDBMS when multiple concurrent sessions want to update the same piece of data. To take advantage of multi-core machines, one would run multiple H-Store instances clustered on the one machine.
Most shockingly to some, H-Store rejects SQL as the transactional language, and, instead, will require that transactions be written as stored procedures using an object-oriented syntax similar to Ruby on Rails ActiveRecord.
Of course, main memory is not yet abundant enough to allow the worlds larger databases to be moved to an H-Store model. One would imagine that short-term transactional data would reside in a high performance H-Store and longer-term data for analysis purposes would be shipped to C-Store.
H-Store exists only in prototype form today, but expect to see a commercial database with at least some of these features within the next few years. I also wouldnt be surprised to see cloud databases adopt some of these ideas in the future.
The relational database model has been incredibly successful; theres no real doubt that it will continue to be the dominant model for data storage for the foreseeable future. However, the emergence of cloud databases, column-oriented data warehouses and high performance in-memory OLTP databases suggest it might not be the only model for data storage in the future.