Transaction-oriented client/server systems built in recent decades have resulted in mountains of data. Every credit card purchase, even the history of a customer's wanderings from register to register in a department store, are kept in the computer. The data warehousing concept, introduced earlier this decade, views this legacy of transaction records as crude, raw material. To be used as a corporate resource from which important knowledge can be gained or which can support more global business processes, legacy data must first be subjected to cleansing and conversion processes and then loaded into a single, central repository or enterprise warehouse. Sometimes, new business applications or analyses can be run against the enterprise warehouse. Often, warehouse data is selectively moved down to smaller, special-purpose warehouses called data marts. These smaller warehouses keep data in a form accessible to OLAP, data mining and other tools designed to extract particular business value from the data. In many ways, the operation of an enterprise data warehousing system is similar to the operation of an oil refinery or chemical plant. Where big oil is measured in billions of barrels, big data comes in terabytes! Just to help visualize the magnitude of such data, consider the enterprise warehouse of a major credit card company. We know of one building a 12- terabyte warehouse. If each byte were a single character on a page, such data would fill over 300 million pages!
Such big numbers make it easy to imagine the torrents of data streaming from legacy systems as tankers bringing crude oil to a huge refinery. There, massive tanks, pipes and processors reduce the crude oil to various intermediate forms in various stages of refinement for distribution to more specialized processing plants that we might liken to the data marts of our data processing world. Refined petroleum products then find there way to gas stations or manufacturers, just as users of our data marts will apply any of hundreds of OLAP, query tools and other applications to create the information end products of the data refinery. Like the oil refinery, today's data refineries operate around the clock, moving batches of data through the system on a cyclical basis.
Building Data Refineries
While technology for building oil refineries has been evolving for decades, technology for building data refineries is just beginning to emerge. Most data refineries today are built on an ad hoc basis, from the bits and pieces of software infrastructure and applications that happen to be around. They would not be possible at all were it not for two fundamental building blocks provided today by mainstream vendors. First, only high performance parallel processors such as the RS/6000 SP2 (yes, the one that checkmated Kasparov!) or the recently announced Starfire from SUN Microsystems, equipped with volumes of today's cheap memory and disk, can provide the raw power needed to move, store and manipulate terabytes of data. Remember that at disk speeds of megabytes per seconds, it would still take months for a single computer to make a single pass through one terabyte of data. Thus, the only way to get practical results is to divide and conquer, spreading the data across hundreds of computers to reduce elapsed time for data processing to reasonable levels. Unfortunately, orchestrating the operation of hundreds of coupled processors to cooperatively perform data processing tasks can present formidable programming challenges.
The second building block is the parallel RDBMS. The major database vendors, including IBM, NCR, Oracle, Informix and Sybase, all have undertaken the complex programming task of creating fully parallel RDBMS implementations hosted on these super servers. Databases such as IBM's DB2 Parallel Edition or Informix Extended Parallel Server are capable of distributing the workload of big and complex queries across hundreds of disks and CPUs.
Scalable Data Flow
Most of today's data refineries are built with the help of small systems integrators (such as Emergent Corporation and Knightsbridge Solutions) that have concentrated expertise in building data processing systems with parallel computers and databases. Unfortunately, even though parallel computers and databases are available off the shelf, very few applications or utilities exist that can directly exploit them. For example, let's imagine that a company wants to perform a sort-merge kind of process on terabytes of data in a parallel RDBMS. A call to the vendors of the major sort utilities quickly reveals that no truly parallel versions of their sort packages exist. The customer must resort to writing complex scripts or programs that launch standard sort utilities on multiple processors in an attempt to divide and conquer the sorting task. While the hardware and RDBMS vendors have provided us with the platform and storage tanks for our data refineries, it's the huge pipes and processing elements needed in order to more easily build the key elements of data refineries outside the database that are nowhere to be found. A software infrastructure that facilitates orchestration of applications on parallel computers and provides the high-speed parallel data interconnections between processing steps is what's needed.
What are the characteristics of this missing infrastructure? The data processing systems we have been talking about involve big data flows moving to and from parallel databases through applications for cleansing, converting, importing, exporting or otherwise processing the data. In order to be as efficient as the parallel database, data must issue from the RDBMS over many simultaneous parallel connections. In turn, these many separate streams of data must be brought to the inputs of processing programs able to operate on many separate processors. On mainframes and workstations, it is common to store the results of one data processing step in temporary files or tables, only to be read as input to the next data processing step. In a data refinery, where data is big, this procedure can be a major factor in wasting both time and resources. Ideally, parallel streams of data should flow directly from the database, through one process after another using the high-speed data connections of the underlying parallel computer. In the 1970s IBM Fellow Dr. John Paul Morris wrote a book called Flow Based Programming: A New Approach to Application Development (copyright 1994 by Van Nostrand Reinhold). In it, he outlines a method for viewing data processing in terms of data flowing along a flow graph from one processing element to the next. While useful in any computing context, data flow programming is an ideal metaphor for programming the data refinery. Let's examine the need for such an infrastructure more closely.
Parallel UNIX Data Processing: Two Problems
There are two major stumbling blocks facing the applications developer attempting to construct dataflow systems on their parallel servers. Every UNIX programmer faced with writing data processing programs in C or C++ on any kind of computer encounters the first one. The second is present only when the computer is parallel. The first, we call the "data destructuring problem." Data in your RDBMS is represented in neat tables with named fields. SQL queries operate directly on these fields, with lots of error checking to make sure all formats are consistent before a query is run. On the other hand, a C or C++ program reading data in from the RDBMS receives this data basically as a stream of bytes. It becomes the programmer's responsibility to interpret these bytes as the records and fields they once were inside the RDBMS, with no guarantee of accuracy. Furthermore, when the schema of the data in the RDBMS changes, corresponding changes must be rippled through all C or C++ programs that read or write this data. We have seen up to 70 percent of an application's code be dedicated to data interpretation, with only 30 percent actually representing the business logic of the application!
The second stumbling block is "low level parallelism." This means simply that programmers are forced to deal with the underlying parallel computer at a very low level. They need to keep track of which subset of data resides on which processor. They need to manage the process of starting and stopping an application running on many processors simultaneously. This makes their programs much more difficult to write and maintain than they were on non-parallel machines. It also makes them dependent on the particular computer architecture on which the applications were developed. Not good in today's rapidly changing hardware landscape.
Raising the Level
It's clear that we need to raise the level of abstraction in which applications developers operate on several counts. First, we must maintain the record structure of the data as it was inside the RDBMS. Second, we must hide the details of how data gets distributed over the processors and disks of the parallel machine. Finally, we must allow processing to be distributed over the machine at a low level, where the application developer doesn't need to know about it. Furthermore, we must provide APIs that permit ISVs to build parallel data processing components that can easily snap into a user's data flow, again without detailed knowledge of parallelism. Products that provide these abstractions will finally provide the missing building block which together with the scalable computer and RDBMS will allow the modern data refinery to become the dominant paradigm for information architectures in the next century.
Allen Razdow is CEO and president of Torrent Systems in Cambridge, Massachusetts. He manages Torrent Systems corporate direction for the company's flagship product, Orchestrate, which enables data-driven industries to develop high-performance business intelligence applications. He can be reached at firstname.lastname@example.org m.