Data management software isn't built for modern processors and can't leverage the technology it needs
Robin Bloor, Chief analyst and president, The Bloor Group, founder, Bloor Research
You always have your ear to the ground, what's got you interested right now?
I've been doing a fair amount of work on parallelism recently, which I think is kind of interesting because the data industry is going to have to deal with a real fundamental problem there.
You mean in terms of processing data?
Yes. It was around the year 2005 when chip manufacturers came to a point where they couldn't really scale up the chips anymore simply by increasing the clock speed. The films of the silicon surface had gotten so thin that pretty much all the voltage was being released as heat, so much so that it wasn't possible to run anything faster than about 4 gigahertz. The effect of that was that the silicon boys had to find another way of increasing the power of chips or they would have lost their obsolescence model, which means their whole industry is ruined.
And then we got multi-core processors, right?
Right, they started to put multiple cores on chips, and even though clock speeds had hit a wall, with miniaturization, they were able to progress for a couple more generations. More cores on chips is great if you've got software that can use that. But right now, there's very little software that can make use of multi-core. All people really need to know is that chips now have multiple cores and that software needs to be written for multiple cores. If you've got software that can only run on a single core on a single server, why would you be buying multi-core chips to run it? That would be as inefficient as adding another server for every application, when you should be trying to use the power that's there.
What about databases?
Pretty much all the large databases were written for parallel operation. Databases can have a lot of concurrent users where the issue is more breadth than speed, and that's one reason they're good candidates for parallelism. That parallelism was aimed at SMP, symmetric multiprocessors, which was often a cluster of boxes. The difference between a cluster and a multi-core CPU is that you have a fiber between the memories of the various machines in the cluster, whereas you don't need the fiber in multi-core. Oracle and DB2 were written for parallel operation in the 1990s and, therefore, they can use the stuff. So this doesn't make a big difference to many people already on SMP, though over time they could move the databases they've got onto multi-core chips. That's a kind of upgrade, but we're not talking about new applications. The opportunity in applications is in ETL, it's in data cleansing and data integration areas where multi-core parallel operation needs to come into play and hasn't yet.
What's a simple way for us to understand or explain the difference between serial and parallel processing?
Most people know how to write a set of instructions, and that's all programming really is. And most people are well aware of how to serially express something like doing the laundry. You gather the wash, separate the whites from the colors, put the whites in, set the machine, add the soap, turn it on. You can write that just like a program. Now try writing that in parallel. Most people wouldn't have a clue where to start. To do the wash in parallel, you'd start out with many washing machines and many dryers, and your parallel operation would be to take an item and stick it in one machine, take another and stick it in another machine. It doesn't make sense all the time, but in computing it makes tons of sense to split the kind of common work that makes everything go a lot faster. But most people simply don't know how to write the instructions.
Isn't virtualization supposed to sort tasks with a sort of core resource that's not specific to a piece of software?
First of all, virtualization didn't need parallelization. The original virtualization by IBM was so long ago that a lot of us were still in short pants. But virtualization of operating systems was done because of the richness of power in a given CPU chip, not multi-core chips. It actually overdelivered for all the time they were doubling the power of chips every 18 months. Windows NT and Linux came along, and people were buying commodity servers, one application per server basically. You don't want all the work of monitoring how resources should be shared between one application and another, so you just put in a new server with every application. But people came to see in data centers that the CPU efficiency for servers was like 6 to 10 percent, and this was on single core. So you can put virtual machines in place and now you've got lots of space without even thinking about multiple cores.
Right, and that's a good thing.
But with every virtual machine, you're putting in a new operating system, and that footprint means a lot of management problems. If you've got four virtual machines sharing a single CPU or a single server, you can have peak loads on all of them that will knock each other out. Also if that server fails, you have four recoveries to make rather than one. So the management for virtual machines is not trivial. In information management, it's mostly for low-hanging fruit like development and testing and some other things, but once you get to about 30 percent of applications, you start to run out of things that don't have too many dependencies to manage.
Doesn't multi-core technology really speed up things like images and video processing?
Yes, and this is like a dam ready to break. Intel is on a campaign to make people multi-core aware and is encouraging vendors to be writing apps that leverage multi-core. If you're using something like Photoshop, the act of making an image a little bit bluer affects everything on the screen and is a huge load on the CPU. Doing that in parallel is an awful lot faster because you have eight cores on a chip and it divides the screen into eight parts and does the job eight times faster. It's really as simple as that, and that's the way to use multi-core now. Where the problem arises is back in the server. To make stuff work in parallel on a server, you actually have to write it in parallel. This really is going to be a problem, and most programmers don't know how to write software in parallel.
Why aren't new data management apps being written to run on multi-core?
That starts to become the point, and there's a second problem because if the industry can't use multi-core, then the chip vendors' business is in trouble. So the chip vendors are of course pushing the advantages of multi-core that give you more than you had before and say this new generation of chips is [a worthy] increase in performance over the last.
Is the problem that application vendors don't see the demand sufficient to justify reengineering?
That's another issue. This would require reengineering and the motivation would be 'Why does it matter?' Well, it matters if you're running queries that used to take three hours to resolve and you can bring that down to three minutes. That's a huge business benefit. And, you'll be able to do that on multi-core chips if you get the parallelism working because you get the kind of multiple that makes it possible. The commodity chip for multi-core right now has eight cores. But there are also esoteric products with 60 or 70 or 100 cores on a chip being used in esoteric areas of applications. They're not computer chips per se, some of them are used in backup systems for switches for things where part of the chip can fail and it doesn't matter. But with that, you can see that 100 cores on a chip is possible, a stunning level of power. But you've got to be able to use it.
Where will we start to see the impact?
You talk to someone with a huge heap of data, maybe everybody who visited Amazon or eBay one day, and the whole Web log for that. It takes an entire day to load it to disk or a database. And they want to trace the path of all these people, and every time you query this huge database it takes two hours to go through multiple terabytes. You know you can't do the analysis this way.
It's different than time sample, but it sounds like a traditional BI problem too.
In BI we've hit this problem many times, but we just don't think of it like that. We talked about data warehouses and then people came up with data marts. The reason you had a data mart was you simply could not do the workload for the whole data warehouse on one machine. So you pulled some of it off and you even had specialized file structures; soon you might have many data marts, some of them are OLAP, some are operational data stores or data mining and so forth. If you'd had a parallel architecture, some of this stuff would get simpler. Maybe we would have to siphon off data some of the time, but in parallel you can squash a lot of that work down.
What happens in the next 18 months?
A number of things are happening already. Most of Google already runs in parallel, so let's not say nothing has been done. Google created a database called MapReduce, which has given rise to a parallel framework called Hadoop for very, very big heaps of data, petabytes. IBM's doing work with that, Pervasive Software has a product called Pervasive DataRush which is a parallel framework for Java, so that's there too. If you go down into the guts of the machine, Intel is doing things at the instruction level, and other things are happening. Everybody is looking at opportunity, but at the moment it's almost a virgin market because there's just so little there.
What about enterprise software?
In the next 18 months, it's probably going to be that the crisis will become more apparent more than anything else. Big data vendors like Vertica, Greenplum, etc. are making use of parallelism, and the streaming data people [Apama, StreamBase, etc.] are also aware of this. At this point in time, you've got various pools of parallel work but you don't have parallel architecture, and I don't think that's going to arrive within 18 months. Also, we're in an eight-core generation of products and I think it won't be until the next standard product generation that this will impact enterprise software much. Right now you can buy 32 processors on a board, but nobody is going to be able to use that unless they have really obvious parallel applications. Intel selling boards with multiple chips will not be the beginning. It will begin with 16 core chips and it will start to be embarrassing if you can't run anything much faster with 16 cores. So I think in about 18 months it starts to emerge as a point of debate. At this point in time, guys like me who try to keep on the edge notice this, but give it 18 months – you'll be publishing articles because people will be talking about it.