The concept of linear growth is obsolete. In the closing decades of the 20th century, we got used to the rapid pace of change, but the shape of that change was still one of incremental growth. Now were contending with a breakneck speed of change and exponential growth almost everywhere we look, especially with the information we generate. As documented in Richard Winters Top Ten report from 2005, the very largest databases in the world are literally dwarfed by todays databases.
The fact that the entire Library of Congresss holdings comprised 20 terabytes of data was breathtaking. Today, some telecommunications, energy and financial companies can generate that much data in a month. Even midsized organizations are coping with data sets that will soon outgrow the Library of Congress.
How do we manage this growth? How do we get to the point where we can focus on using our data assets instead of just hoarding them for regulations sake? Whats our strategy for assembling the infrastructure for handling these large quantities of data? Its tempting to just keep adding on to the current infrastructure. Buy more of the same - at a faster pace.
The problem with that approach is that just as a large collection of information needs a specialized infrastructure to house, classify and maintain its accessibility, a very large database has unique requirements. The complexity introduced by the size factor is best addressed by a technology known as massively parallel processing.
Parallel processing is not a new concept - its been used to radically expand the computational capacity of super computers. MPP basically applies the same principles to database architecture.
MPP is a class of architectures aimed specifically at addressing the processing requirements of very large databases. MPP architecture has been accepted as the only way to go at the high end of the data warehousing world. If its so well-suited to the very large data warehouses, why hasnt everyone adopted it? The answer lies in its previous complexity. Engineering an MPP system is difficult and remains the purview of organizations and specialized vendors that have a deep layer of dedicated R&D resources. These specialized vendors are bringing solutions to the market that shield the user from the complexity of implementing their own MPP systems. These solutions take a variety of forms, such as custom-built deployments, software/hardware configurations and all-in-one appliances.
If the amount of data your organization handles has outgrown the capacity of your data management infrastructure, youre probably evaluating one or more MPP solutions. Which one is the best? Which one solves the problem you are trying to address? Which is least disruptive to your environment?
To answer these questions, its helpful to understand the specific challenges on the table and context in which the MPP solutions were designed. The challenges involved in processing large amounts of data are similar to those of any large-scale project. What is the best method for accomplishing a task that is too large for any one person, piece of equipment or facility to handle? The answer is simple - split it up. The hard part is what comes next. How do you break it down, and, more importantly, how do you bring it all back together?
In the specific context of working with very large data, the approach to the first challenge is to do as much of the work in parallel as possible. All MPP architectures share this approach - the difference lies in how successful they are at accomplishing the goal.
The next step is to assemble the results of work done in parallel. Here the various architectures differ even more. This complex orchestration of work has one overarching challenge - to avoid bottlenecks that will bring the whole system to a grinding halt. And if you cant avoid the bottleneck entirely, ensure that it is wide enough for work to flow through at an efficient pace.
The technical underpinnings of all MPP approaches are a shared-nothing environment. The less that is shared, the more work can be parallelized. Shared-something environments allow for some parallel processing, but not enough to accommodate large data sets. The essential difference between MPP and other classes of architecture is the decision to implement parallelization at every level, hence the moniker massively parallel. If an architecture is anything but shared-nothing, there is a built-in limitation to the amount of work it can perform in parallel. In a shared-something environment, processes can be in contention for resources or shared information and spending time in queues when they could be continuing with their primary task. On the hardware side, CPU cycles spent on maintaining state or orchestrating resources could be spent on the actual data processing work.
Three Blueprints for Massively Parallel Processing
There are three major approaches to breaking down a massive unit of work into multiple subunits and processing these in parallel. However, not all MPP architectures are created equal. The earliest MPP architectures parallelized only one or two tasks or applied simple algorithms to sharing and reassembling work.
MPP architectures vary the most in the way they handle three different aspects of parallel processing - optimization, assembly of results and system-wide interactions. How each architecture handles these operations reveals the degree of parallelization it has achieved, as well as its ability to scale and flexibility in handling different types of requests. Because big data presents a very complex challenge, even some advanced MPP architectures take shortcuts or focus on one aspect of query processing to the detriment of others.
In terms of hardware, this translates into how the flow and control among CPU, I/O, disk and interconnectivity are handled. How work is distributed among these components characterizes each of the three models - bounded parallelization, hierarchical MPP and pervasive MPP. Each of these models represents a progression toward parallelizing and balancing as many operations as feasible.
Query optimization is part of any database management system, however, all the rules for optimization change when you start dealing with big data. An optimizer built for an online transactional processing system simply doesnt have the functionality or the decision tree to support manipulating big data.
That being said, there are as many strategies for optimization of big data as there are engineers specializing in this area. Optimizing for the sheer size of data, however, is just part of the picture. When designing an optimizer, certain assumptions are made about leveraging various characteristics of the data. Generally, MPP optimizers rely on statistics that can be collected on the data. Typical areas of interest are the datas cardinality, its sparsity or density and table size. It then takes those statistics and uses them to build a better execution plan tailored to the data and the specific type of processing thats being requested.
Optimization strategies are usually closely held intellectual property. Just like many other technology products, you are paying both for the expertise and to avoid replicating the work involved in building the solution. Since you probably wont be allowed to look behind these curtains, the best way to evaluate a solution is to see if it gives you the results you need.
Assembly of Results
After the first challenge of successfully executing an operation against a large data set by breaking it up into parallel processes has been met, there remains the second challenge of assembling the multiple results into one outcome. This assembling process is the bottleneck for any MPP architecture and fundamentally limits the number of requests a system can handle. To increase the amount of work or speed a system can deliver, you can add more hardware, that is, more parallel nodes, but it always comes down to having one point of collection.
Just as there are many optimization strategies, there are different ways to implement the aggregation process. It can be a standalone component on its own machine, it can be embedded within a master node or it can be distributed across all nodes within a system.
Problem solved - or it would be if we were not working with big data in an active environment. Two things can overwhelm any of the assembly approaches: the sheer size of a result set, or the number of users requesting data. Any MPP system that does not make provisions for scaling the assembly or results dynamically has a hard limit to its capacity.
The previous discussion of how the problem of assembling results of parallel operations is addressed touches on another difference among the various MPP architectures - how the coordination of the various hardware components and software processes is handled. The way in which joins are handled is an example of how the interactions within the system can be orchestrated.
Every database architect understands that the physical location of commonly accessed data will have significant impact on performance. Some of that thinking applies to MPP architectures as well. For example, in many cases, the cost of the space wasted by making physical copies of small lookup tables available in all partitions or on all servers is justified by the time saved when accessing those tables. Each node can do its own local join and return results to the designated assembly location.
The reality is that in a big data environment, tables can be several terabytes in size. MPP architectures allow you to stripe large tables to all the nodes in the system - a best practice that takes advantage of the highest degree of parallelization. What happens when you want to perform a join on two very large tables? For example, you have product IDs in two tables, and you want to return where TableA_ID = TableB_ID.
No single node has all the information it needs to process even a portion of the join. Interim results need to be exchanged so the system has to start shipping data around among its components. Now the importance of really good interconnect with fast bandwidth among the nodes or servers becomes clear. Components like Infiniband and 4 GB network cards will buy a certain degree of scalability, however, they only ease, not solve, the bottleneck problem.
In addition to shoring up the network connectivity among components, there are other ways to manage the dataflow within a system. MPP architectures look to be more intelligent about the data that gets shipped around the system, such as developing smarter algorithms or doing smarter, that is, more economical disk scans. The end result is that less data is moved around, fewer CPU cycles are consumed and less time is spent fulfilling the request.
The first approach to addressing big data processing is to add more servers to operations that involve more than one node. Although common, this approach is not an MPP architecture. Architectures that locate the various data processing steps onto distinct hardware components can be referred to as bounded parallelization. Some nodes are dedicated to the multinode, system-wide tasks of assembling results and second-pass processing. The other nodes are used only to filter the data to which they have access.
The fact that there are more nodes capable of performing all tasks means that it can perform true joins between tables across nodes. However, the system still has to ship quite a bit of data around, which will slow overall performance. It is because of this that the approach isnt considered truly MPP in nature.
This approach involves more pieces of hardware that are specialized for the role of either master or slave, which limits the options for distributing work. It also has the added burden of supporting the network communications between components. The more components a system has, the more complex and resource-intensive it is to manage the dataflow within the system. This architecture is scalable to the extent that the interconnects among the components can sustain the amount of data flowing through them.
In contrast to hierarchical MPP, this architecture overcomes the potential bottleneck of a single master node by parallelizing cross-blade and system-wide operations. The slave nodes are still restricted to processing only the data on their assigned nodes. Note that the data storage nodes are external to the processing capacity. In many MPP implementations, these nodes are connected through high-speed, high-capacity bandwidth.
While more scalable than bounded parallelization, hierarchical MPP has limitations on user and workload scalability. A hierarchical MPP architecture has a single master node and any number of slave nodes. In addition to assembling the results passed up from the slave nodes, the master node is also responsible for handling requests, developing query execution plans, collecting statistics and coordinating the parallel work (see Figure 2.) One can see how easily resource contention could occur with all these tasks assigned to one node. Depending on the size of the data, the master node might need the capacity of a super computer. This is not a viable approach when you have a wide range of workloads and hundreds or thousands of concurrent users because the model places limits on working with data that resides on different nodes. For example, performing joins on tables that reside on different nodes requires moving relevant data up to the master node and then performing the join. You would not be able to join two tables whose size exceeds the master nodes capacity. A similar situation arises for operations that require a full table scan. If the table has been striped across nodes because of its size, completing a full scan might overwhelm the master node.
A master node functions as the central controlling agent for the system. The slave nodes combine storage disk space and CPU capacity in a single unit, but the slave nodes are restricted to processing only the data on their drives.
Pervasive MPP provides each node with complete processing and assembly capabilities, making it self-sufficient. This allows each node to serve as a collection point for results of parallel processing so that incoming queries can always be assured of adequate aggregation services. You can think of it as round-robin aggregation. As requests come in, available nodes become the marshalling points.
By adding layers of asynchronous multiprocessing on top of the MPP layer, this architecture supports cross-node joins and working with tables striped across multiple nodes. This level of cross-node support is imperative for accommodating very large tables or running many queries simultaneously.
In this architecture, all nodes are equivalent. Each has the capacity to perform the complete range of cross-node and single-node operations. This approach requires the highest degree of coordination within the system, but it also allows for the most flexibility to handle blends of operations. Note that like the hierarchical MPP, storage is contained within each node; however each node in this architecture has the ability to act as a master node and call upon other nodes to coexecute its query.
The design decisions around optimizing, collecting results and coordinating cross-node operations are not just arcane preferences or disagreements within the theoretical community. They have very real implications for a solutions price point, its administrative complexity, its scalability and the range of workloads for which the solution is best suited.
The optimizers design will work better for certain types of data than others. The same goes for queries and operations. An optimizer by its very definition cannot optimize equally for all situations. Although vendors will justifiably not share their optimizers functional specifications with you, they should be able to give you enough guidance on what kind of queries against which kind of data their solution is designed to master.
Decisions about how to accommodate aggregation will play out in the amount of hardware required, the degree of scalability and the level of concurrency the system can support. How the interactions among components within a system have been resolved will impact network and infrastructure costs and scalability. Depending on how much of the networking is exposed, that is, how much the customer must provide and configure, the administrative overhead will increase.
All MPP approaches improve upon the scalability offered by nonparallel systems, especially in terms of amounts of data they can handle. Where they differ most is in the degree of flexibility they offer in numbers of queries they can process in parallel and the types of queries (cross node or single node) they can process.
Bounded parallel architectures performed well in environments with static data supporting a limited number of users with predictable patterns of activity in terms of frequency and type of query. Hierarchical MPP can stretch further to accommodate evenly paced data growth and several distinct types of queries as long as they can be scheduled. The most scalable form of MPP, the pervasive MPP architecture, has the most flexibility in how it distributes data, processes queries and retrieves results across nodes, making it highly adaptable to environments best described as mixed or ad-hoc.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access