Facebook has open sourced Presto, a “Distributed SQL Query Engine for Big Data.” It’s of course nothing new to see the Web heavyweights open source technologies they have developed to meet their own requirements - after all, Hadoop itself was born at Yahoo!, the Storm real-time framework comes to us courtesy of Twitter, and many more including eBay, LinkedIn, and Facebook have all made significant contributions to the big data cause.

However, in this case, the situation is different. The market is already crowded with alternative technologies for SQL-on-Hadoop:

  • Hortonworks runs the Stinger project, aimed at making Hive run “100 times faster.”
  • Cloudera offers Impala, a query engine that exposes HDFS data through a low-latency SQL layer.
  • MapR is leading the charge on the Apache Drill project, a copycat of Google Dremel.
  • Before Pivotal was even founded they had already announced the fusion of their HAWQ SQL engine with Hadoop.

However, most of these developments are fairly recent. If memory serves me right, the first announcement of improving SQL query performance on Hadoop was made by Cloudera at Strata East 2012, just a year ago, when they pre-announced Impala. All the other projects listed above were born, or at least announced publicly, in the first half of this year.
Nevertheless, the need for interactive SQL query performance on big data is not new. Many organizations have been facing the conundrum of taking big data beyond batch processing, and deliver analysis and reporting. And clearly the performance of Hive couldn’t allow this - hence all these competing projects. And in all likelihood, the Facebook engineers, not finding what they needed on the market, simply went off to build it. And, when all these announcements took place, faced the dilemma to carry on, or not. 

Players on the Hadoop market are clearly struggling to differentiate. Beyond a common foundational layer they all support (a layer that was just augmented by YARN), they aim to offer differentiated capabilities for business applications - where the money is. But by all offering the “same differentiation” (interactive SQL), they are losing an opportunity to bring the battle on other grounds. Nobody is going to gain market share because their query engine is 3 percent faster than the other!

Interactive SQL-on-Hadoop is now a “basic” capability of the platform. Choice is good, but complexity and fragmentation are not, and N SQL engines on top of Hadoop is simply (N-1) too many.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access