for Information Management Blogs
MAY 28, 2009 3:49am ET

Blogroll

Database Religions Dissolve Into The Big Billowing Virtual Data Cloud

Print
Reprints
Email

Virtualization is a venerable old computing concept that has achieved new life in recent years.

Virtualization brings to life a new world of more flexible service provisioning while cleverly emulating the old world that is being replaced. Virtualization refers to any approach that abstracts the external interface from the internal implementation of some service, functionality, or other resource.

The promise of virtualization is that, no matter how scattered and diverse, all pooled resources behave as if they were a single unified resource, both for usage and administration. In a sense, this is the practical magic that Arthur C. Clarke identified with advanced technology. The external interface may conceal various facts about the implementations of the underlying resources. The virtualized resources may:

  • Run on diverse operating and application platforms;
  • Have been deployed on nodes in diverse locations;
  • Have been aggregated across diverse hosting platforms (or partitioned within a single hosting Platform, either through virtual machine software, separate CPUs, or separate blade servers); and have been provisioned dynamically in response to a client request.

When Noel Yuhanna and I presented on enterprise database virtualization last week at Forrester IT Forum, we took pains to point out that is not a radically new paradigm. In fact, database administrators (DBAs) have been doing virtualization for a long time and not realizing it. We’re all familiar with such database virtualization approaches as policy-based server clustering, massive parallel processing database grids, and enterprise information integration. In these environments, you can identify the virtualization layer as “single system image,” “semantic abstraction,” or some other approach.

What all these approaches share is that they make two or more repositories behave as if they were a single database for unified access, query, reporting, predictive analytics, and other applications. If you wish, I could drill down further into the layers of database virtualization—data virtualization, transaction virtualization, and platform virtualization—but that would be too much for a mere blog post.

One twist that I didn’t have time to explore in depth last week is the notion that the traditional hub-and-spoke enterprise data warehousing (EDW) architecture is itself a form of database virtualization. The hub-and-spoke model transforms analytic data to a common “spoke-side” semantic access model, such as star schema or columnar. As such, this approach abstracts from the data models (usually 3NF relational) implemented at the EDW hub tier, the staging tier (perhaps file-based), and OLTP sources (perhaps hierarchical, XML, or what have you).

When you realize that each data-persistence approach has its optimal deployment sphere, you’re thinking database virtualization. At that point, you start to realize that the various database religions—relational is supreme, columnar is king, and so forth—are not absolute truths. They’re simply sectarian texts in a tradition of longer vintage: the evolution of truly all-encompassing data virtualization clouds.

Yes, I’m using “cloud” in this context because it best describes this new paradigm. Cloud-based virtualization is beginning to seep into analytic infrastructures. To support flexible mixed-workload analytics, the EDW, over the coming five to 10 years, will evolve into a virtualized, cloud-based, and supremely scalable distributed platform.

What are the outlines of this new paradigm? The virtualized EDW will allow data to be transparently persisted in diverse physical and logical formats to an abstract, seamless grid of interconnected memory and disk resources and to be delivered with sub-second delay to consuming applications. EDW application service levels will be ensured through an end-to-end, policy-driven, latency-agile, distributed-caching and dynamic query-optimization memory grid, within an information-as-a-service (IaaS) environment. Analytic applications will migrate to the EDW platform and leverage its full parallel-processing, partitioning, scalability, and optimization functionality. At the same time, DBAs will need to make sure that cloud-based DW offerings meet their organizations’ most stringent security, performance, availability, and other service-level requirements.

I won’t opine here and now on how much enterprise data will be persisted in public clouds vs. private environments that incorporate many of the same platform virtualization technologies. I’ll save that discussion for the upcoming Forrester reports that Noel and I are developing in virtualization of transactional and analytic databases, respectively.

Expect those in Q3 or thereabouts. Thanks everybody who attended our preso last week in Vegas!

Filed under:

Advertisement

Comments (2)
Jim-

Noel and you are on to something significant here.

I really like how you have identifed both database access and database management as key capabilities within database virtualization.

In your presentation at the Forrester IT Forum I noted that a common approach to query was a standard element.

And in your multiple virtualizations stack (storage, servers, databases, information, etc.), I like the database virtualization positioning relative to the information virtualization layer above it.

It seems to me, the information virtualization layer can call up the common query capabilities of the database virtualization layer.

And we all know who is the best of breed supplier of the virtualized, federated query services that support both the database and information virtualization layers..... Composite Software!!!

I look forward to your future research on this topic.

- Bob Eve, EVP Composite Software

Posted by Robert E | Tuesday, June 02 2009 at 2:45PM ET
It's an interesting vision of the future, but distributed query optimization is fantastically complex, requiring, as a minimum, comprehensive and accurate technical meta-data that describes how the data in the different databases are related to one another and a query optimizer that sits above the various repositories and that understands (a) the data distribution and demographics of the data on each platform; (b) the computing power of each available platform and the available bandwidth / latency of the networks connecting them and (c) the existing workloads that each platform is supporting at that point-in-time. And that's assuming that the performance characteristics of the distributed database platforms are similar for a given workload, which they certainly won't be if the cloud includes the variety of databases (relational, c-store, etc.) that you envisage. Plenty of today's RDBMSs do a very average job of cost-based query optimization - requiring hints, etc. - when all they have to contend with is data maintained in a single platform; and plenty of end-user organizations also do a mediocre job of meta-data management.

Plus assuming that data is stored redundantly in these multiple repositories and that we want the cloud to support "operational analytics" in near real-time we now have to figure out not only which of the distributed platforms can serve the request most quickly / efficiently, but whether it has an up-to-date copy of the data requested. And if it doesn't, can we serve the request most quickly / efficiently by queuing the request until the data is available on that platform or by re-directing it to another platform?

So I stand by "complex". And, at least with current distributed database / query federation technology and business processes, I think "fantastic" is also a fair assessment!

Posted by Martin W | Tuesday, June 02 2009 at 5:00PM ET
Add Your Comments:
You must be registered to post a comment.
Not Registered?
You must be registered to post a comment. Click here to register.
Already registered? Log in here
Please note you must now log in with your email address and password.

Blog Archive for James Kobielus

Big Data for the Global Grid
Big Data’s Open Source Momentum
Best Practices from Real-World Experiments
Naïve on Big Data’s Evolution?
Social Media Analytics Revolutionizing Marketing Campaign Management

More from James Kobielus »

Blog Index »

Twitter
Facebook
LinkedIn
Login  |  My Account  |  White Papers  |  Web Seminars  |  Events |  Newsletters |  eBooks
FOLLOW US
Please note you must now log in with your email address and password.