If you have been reading business or technical journals recently, you have probably read about data warehousing or using very large databases for relationship marketing or decision support systems. One of the terms you probably have seen is Massive Parallel Processing (MPP). In this article we will look at this technology to understand how it works and why it is becoming a strategic weapon for many companies with their data warehouse.
MPP machines are designed specifically to handle very large data warehouses or very large databases (VLDBs). We are talking about hundreds of gigabytes or many terabytes of data. Several of the largest retailers, railroads and airlines in the world use MPP technology to store more than several years' worth of detailed sales data. The largest single table on a relational database is on an MPP machine with over 4 BILLION rows! MPP machines use hundreds or thousands of processors instead of one or a few processors to do their work. Each of these many processors has it own set of disk storage units (DSUs) or dedicated storage area to hold the data. In order to access these massive data stores, MPP machines often use shared nothing or shared almost nothing designs. The difference between these would be with true shared nothing design, the processing engines that have their own copy of the operating system, memory and DSUs. They communicate with each other over an I/O bus. They do not share memory or DSUs like SMP (symmetrical multi- processing) machines do. Some of the newest releases of MPP DBMSs have moved to a shared almost nothing architecture, in which clustered SMP nodes are used. A node in the clustered node design looks very much like a standard SMP node. The major difference is that these nodes communicate with each other over an I/O bus and appear as one large logical machine.
A very large table (several hundred million rows) on a mainframe system would be stored on one or a few very large disk packs. On an MPP machine, a very large table would be split into pieces and distributed evenly to all processing units. The ideal solution is to divide and conquer. When you query a mainframe system, you have one query going against a very large set of data. When you query an MPP machine, the machine broadcasts the query over the I/O bus. Each of the processing engines (PE) then queries against their DSUs to find the answer. It is easy to see how an MPP machine with 100 PEs can find an answer against a very large table faster than a single query going against a very large set of data. A good example would be taking five decks of playing cards and passing them out evenly to 100 people. Then ask everyone to find their kings and pass them to you. Now think how long it would take you to shuffle through all the cards to find the kings all by yourself.
MPP machines also have the ability to be fault tolerant. This means that they can continue to operate after a piece of the hardware has failed. This feature is done in conjunction with the DBMS. They can carry duplicate copies of the rows for each table on each of the PEs. For example, row 1 of table 1 would exist on two different PEs. All the PEs, in the shared nothing designs, have their own copy of the operating system and communicate with each other over the I/O bus. If a PE was to stop communicating or fail, the other PEs could detect this and take over for the down PE. In the clustered SMP design, the processes from the failed node would migrate to the surviving nodes. This feature can be one of the most important and powerful when dealing with critical data and systems you cannot do without. I have seen a system with these features lose a DSU (disk drive) and the system restart itself and be back up and ready for work in a couple of minutes. What we are talking about is availability. As these systems become critical to the organization, they must be operational close to 24 hours a day, seven days a week. You cannot have them down because of hardware failures, or to load the data from the mainframe or to reorganize the database.
MPP machines can also be very modular and upgradeable. This means that the system has the scalability to grow to the size you need. This can be of great importance. After you do a successful project with these VLDBs, it is going to grow! It is not unusual to see these machines double in size every two years. It is fairly easy to add additional PEs or nodes to the system. This is dependent on two things: that the DBMS uses a hash algorithm to do its data partitioning and the operating system uses a shared nothing design or the shared almost nothing clustered SMP nodes. I have been through several of these upgrades with MPP machines. The steps that are required are almost too good to be true. You bring down the system, the field engineer attaches the new hardware, and you tell the system to reconfigure itself. When you tell the system to do a reconfigure, it knows this means it needs to spread the data across to the new processors. All the rows are distributed to the new hardware. The system comes up when it is finished and says, "I am ready to go to work." There are a number of things that an MPP system must have. One of the most important is the ability to be "channel connected" to a host computer system. The vast majority of data that gets into a data warehouse comes from a legacy operational system. There are many companies claiming they can build a data warehouse, but the real test is maintaining it on a day-to-day basis. It is not unusual to see the need to delete and then load several million rows to large existing tables each night in the production cycle. A data warehouse will often contain time-stamped data. You will need a way to delete the old data and load the new data very quickly.
Another important item that is required for the data warehouse is a very mature set of database utilities. These utilities must allow for the fast loading and unloading of large amounts of data. An MPP system should not only be able to be channel connected to a host computer system, but connected to a PC-based LAN system as well. These utilities should be able to run as host production jobs or from the LAN. The point is that these utilities or tools will have an immediate impact right on the bottom line by reducing expensive database administrator (DBA) and programmer time.
I talked earlier about the importance of choosing a DBMS that uses a hash algorithm to do its data partitioning. What this means is the placement of the rows in the database is based on a value calculated by the algorithm. This means you NEVER have to reorganize a database that uses this type of data partitioning. A reorganization is required when a DBMS must reorder the rows it carries because it uses sorted indexes or sorted rows. The other major advantage of using a hash value is that the data will be evenly distributed across all the processors if you choose a good unique key. No manual intervention is required to distribute the data across the processors. If the data is skewed correctly, then the queries will run faster. All the processors will have an equal amount of data against which to query. What this all means is a lower maintenance cost for this type of DBMS.
In order for an MPP machine to fully utilize the massive parallel processing concept, it must use a DBMS that was built to perform every database function in parallel. This means you want it to be able to insert, delete, update, select and do all of its operations in parallel. This will be reflected in the performance the system can deliver, in the speed of queries, number of concurrent users and the size of the tables it can handle. If the DBMS is built to be parallel, then it should be easy to double the size of your system when necessary.
The interest in data warehousing and what a MPP system can do with all that data is an exciting and developing field. No other technology has the capability of handling terabytes of data in a cost-effective manner. Look carefully at the following items when you are trying to find a system to do all the things you want to do with that data: scalability, modular upgradeability, connections to legacy systems, mature database tools, cost of maintenance, built to be parallel, fault tolerant, performance and size that can be referenced by other users of the MPP hardware and DBMS you choose.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access