Free Site Registration

Degrees of Massively Parallel Processing

InfoManagement Direct, February 26, 2009

John O'Brien

The concept of linear growth is obsolete. In the closing decades of the 20th century, we got used to the rapid pace of change, but the shape of that change was still one of incremental growth. Now we’re contending with a breakneck speed of change and exponential growth almost everywhere we look, especially with the information we generate. As documented in “Richard Winter’s Top Ten” report from 2005, the very largest databases in the world are literally dwarfed by today’s databases.

The fact that the entire Library of Congress’s holdings comprised 20 terabytes of data was breathtaking. Today, some telecommunications, energy and financial companies can generate that much data in a month. Even midsized organizations are coping with data sets that will soon outgrow the Library of Congress.

How do we manage this growth? How do we get to the point where we can focus on using our data assets instead of just hoarding them for regulation’s sake? What’s our strategy for assembling the infrastructure for handling these large quantities of data? It’s tempting to just keep adding on to the current infrastructure. Buy more of the same - at a faster pace.

Advertisement

The problem with that approach is that just as a large collection of information needs a specialized infrastructure to house, classify and maintain its accessibility, a very large database has unique requirements. The complexity introduced by the size factor is best addressed by a technology known as massively parallel processing.

Parallel processing is not a new concept - it’s been used to radically expand the computational capacity of super computers. MPP basically applies the same principles to database architecture.

MPP is a class of architectures aimed specifically at addressing the processing requirements of very large databases. MPP architecture has been accepted as the only way to go at the high end of the data warehousing world.  If it’s so well-suited to the very large data warehouses, why hasn’t everyone adopted it? The answer lies in its previous complexity. Engineering an MPP system is difficult and remains the purview of organizations and specialized vendors that have a deep layer of dedicated R&D resources. These specialized vendors are bringing solutions to the market that shield the user from the complexity of implementing their own MPP systems. These solutions take a variety of forms, such as custom-built deployments, software/hardware configurations and all-in-one appliances.

If the amount of data your organization handles has outgrown the capacity of your data management infrastructure, you’re probably evaluating one or more MPP solutions. Which one is the best? Which one solves the problem you are trying to address? Which is least disruptive to your environment?

To answer these questions, it’s helpful to understand the specific challenges on the table and context in which the MPP solutions were designed. The challenges involved in processing large amounts of data are similar to those of any large-scale project. What is the best method for accomplishing a task that is too large for any one person, piece of equipment or facility to handle? The answer is simple - split it up. The hard part is what comes next. How do you break it down, and, more importantly, how do you bring it all back together?

In the specific context of working with very large data, the approach to the first challenge is to do as much of the work in parallel as possible. All MPP architectures share this approach - the difference lies in how successful they are at accomplishing the goal.

The next step is to assemble the results of work done in parallel. Here the various architectures differ even more. This complex orchestration of work has one overarching challenge - to avoid bottlenecks that will bring the whole system to a grinding halt. And if you can’t avoid the bottleneck entirely, ensure that it is wide enough for work to flow through at an efficient pace.

The technical underpinnings of all MPP approaches are a shared-nothing environment. The less that is shared, the more work can be parallelized. Shared-something environments allow for some parallel processing, but not enough to accommodate large data sets. The essential difference between MPP and other classes of architecture is the decision to implement parallelization at every level, hence the moniker “massively parallel.” If an architecture is anything but shared-nothing, there is a built-in limitation to the amount of work it can perform in parallel. In a shared-something environment, processes can be in contention for resources or shared information and spending time in queues when they could be continuing with their primary task. On the hardware side, CPU cycles spent on maintaining state or orchestrating resources could be spent on the actual data processing work.

Three Blueprints for Massively Parallel Processing

There are three major approaches to breaking down a massive unit of work into multiple subunits and processing these in parallel. However, not all MPP architectures are created equal. The earliest MPP architectures parallelized only one or two tasks or applied simple algorithms to sharing and reassembling work.

MPP architectures vary the most in the way they handle three different aspects of parallel processing - optimization, assembly of results and system-wide interactions. How each architecture handles these operations reveals the degree of parallelization it has achieved, as well as its ability to scale and flexibility in handling different types of requests. Because big data presents a very complex challenge, even some advanced MPP architectures take shortcuts or focus on one aspect of query processing to the detriment of others.

In terms of hardware, this translates into how the flow and control among CPU, I/O, disk and interconnectivity are handled. How work is distributed among these components characterizes each of the three models - bounded parallelization, hierarchical MPP and pervasive MPP. Each of these models represents a progression toward parallelizing and balancing as many operations as feasible.

Optimization

Page 1 of 3.

Advertisement

Advertisement