In the 45 years since the beginning of widespread use of information technology, we have seen a swing from a focus on automation to an emphasis on data. Could we now be on the cusp of a new paradigm that will push data management to an even higher level, perhaps a model that goes beyond what we can do with traditional data stores? There is reason to think we are at a point where this is a distinct possibility.

Into the Cloud

The new paradigm has its origins in cloud computing. Like so many developments in IT, cloud means different things to different people. The notion of cloud computing originated with the idea that existing IT infrastructure not working to capacity could be rented out, or even made available for free, for owners unwilling to invest in additional dedicated resources to run applications. Amazon is a good commercial example of this. Academic institutions have pioneered similar innovations in grid computing, which is typically used to meet the needs of research programs by accessing unused capacity across networks. The widely documented SETI@home project is one example in this area, and is not a new model. Those of us old enough to remember the heyday of mainframes can recall time-sharing, whereby spare capacity was rented out to enterprises that did not have enough of their own.


As the cost of hardware declined over a period of years, the idea that commodity servers could be bundled together began to take shape. Among other things, this could provide fault tolerance so that large arrays of cheap devices could be configured in such a way that some would back up others that might fail. Thus, a lot of cheap components could rival much more expensive architectures. Today, this has become a reality, and arrays of cheap servers can provide ultra large scale (ULS) computing environments.

A hardware foundation built of arrays of commodity servers makes cloud computing possible. The ULS architecture provides the means to host many applications, but it has still broader appeal. ULS might mean that we will be able to carry out computing tasks that were impossible for previous generations of hardware and software. One tempting possibility is the ability to process hitherto unimaginably large volumes of data - big data.

Emulating Google

One company that has blazed a trail in this area is Google, whose stated mission is to "organize the world's information and make it universally accessible and useful." Google is already recognized as highly successful and a model many wish to emulate. If Google can monetize the vast sets of data that exist on the Web, why shouldn't other enterprises be able to get value from similar sets of data, wherever they originate?

The hardware deployed for cloud computing is part of the answer to this question. But there is another, perhaps more controversial, component that revolves around how data will be held and processed in the ULS environments. Google put an early stake in the ground with its famous 2006 paper on BigTable. This describes a data storage approach that uses thousands of commodity servers to manage data volumes in the petabytes. It is flexible and capable of high performance, but it is not relational.

This presents a problem of its own, because since the 1970s, the relational model has come to dominate the way we think about databases. Data management teaches that the relational model is the one and only right way to do things. Yet, just as cloud computing echoes mainframe time-sharing, so too does BigTable contain hints of long-forgotten days. Those who remember variable-length ISAM files with multiple record types might be forgiven a wry smile as they contemplate some of the characteristics of BigTable.


BigTable is beginning to get some competition with Apache Hadoop, HBase and the like. The architecture and design of these databases is similar to BigTable, and very unrelational. A big part of the secret sauce in them is what is called MapReduce. This is a way of splitting the processing of a large set of data into smaller pieces that can be handled independently - and in parallel, thus providing high performance.

This approach has its critics. David DeWitt and Michael Stonebraker have called MapReduce a major step backward. These are serious authors who deserve attention, but given that the relational paradigm has ruled unchallenged for decades, any break with it could be expected to generate controversy. Only time will tell in this debate.

Finally, Big Data

Assuming that the hardware and software will work, what will it work on? The answer to that is becoming increasingly obvious. Vast amounts of structured, semistructured and unstructured data are now commonplace. How big is all this? It is difficult to say, but reliable reports have found organizations already processing at the petabyte level. Rates of data production will probably rise in potentially staggering volumes in the years to come.

It is likely that this data will not just come from the Web, but will also be found in servers within enterprises. Some of it will be publicly available, some will be proprietary. It will be quite heterogeneous, and any use of it will present an integration challenge considerably greater than anything we see with enterprise data warehouses.

What will it look like? Personally, I think big data environments will always be focused on taking in existing data rather than producing it. The relational paradigm of tuples of related data will probably not fit because of these heterogeneous origins. Arrays of ontologies, rather than single data models, will describe it. There will be an immense data management problem in understanding the sources that are brought into such environments.

What will it be used for? Probably to search and make associations that will involve the production of enriching metadata and data. Meta-analysis and discovery of individual facts will likely be important, too.

Of course, these are only guesses, but big data will almost certainly be different. We live in exciting times.