Continue in 2 seconds

Opinion Real-Time Data or Batch? How About Both?

Published
  • February 25 2016, 6:30am EST

Listen to the Silicon Valley buzz, and what do you hear? Among other things, lots of busy bees making noise about “real-time data.”

Revolutionary! Disruptive! The future! The excited hive tends to overlook one important point: Real-time processing has been around for a long time, only we called it something else — Complex Event Processing (CEP).

Academia and industry have spent a lot of time researching CEP since the 1990s. Real-time’s historic legacy does not detract from its vital importance to the future of computing. Indeed, advances in real-time processing have contributed towards technology movements that are both revolutionary and disruptive.

Regardless of how long it’s been around, real-time is indispensable and beats at the heart of our collecting computing future. Just as real-time, or CEP, has been part of the computing universe for quite a while, so has batch processing. And like real-time, batch has served as a core function of another decades-old technology: data warehouses.

Batch engineering and data warehousing is just as urgent as real-time. These categories of data processing underwent a transformation during the past 15 years, due to the Cambrian explosion of data.

Two insurgent developments, in particular, precipitated this data-processing evolution: the rise of distributed computing, and the demand for open-source. Computing today is exponentially more distributed than it was even five years ago.

In the context of computing, if you make anything you have to architect it to run on many machines. We no longer are talking about five machines, each with a unique hostname, to accomplish tasks. This creates new ways of thinking about the movement of data.

If we think of servers now as herds of cattle, we don’t care if a yearling breaks a leg and gets left behind. We just want to make sure the herd is running in the right direction, and the herd as a whole is healthy. You keep an approximate count of how many cows comprise the herd, and the breeds of cows, and with a single command you should be able to change direction.

This grand blossoming of data — extremely useful, invaluable data — has helped hoist the profile of open source. Small startups can now produce enormous floods of data thanks to the swift and powerful rise of the Internet and mobile. In 2014, for example, 2.8 billion people were online, compared to 35 million in 1995.

Mobile’s ascent is key — it not only increased the overall Internet population, it dramatically boosted the amount of time people spend on the Internet. And its reliance upon real-time is profound. With people constantly on the Internet due to their mobile use, data about their behavior, geo-location and more streams constantly. This has become gold for businesses, which process the data in real-time to communicate with customers.

Startups, which increasingly rely upon data dominion for marketplace triumph, can’t afford the expensive enterprise solutions required to handle and understand all of the data, and opt instead to build their own or use open source alternatives.

Stephen O’Grady’s observation in The Software Paradox, “software, once an enabler rather than a product, is headed back in that direction,” is perfectly apt. Software used to be the lure to persuade people to buy hardware.

Then Windows arrived, and things changed rather dramatically. Software became the star, and licensing it for individual machines became a large expense for companies. But now, with data-related demands for software so immense (and rising every day), and with budgets for so many companies simply not able to purchase software licenses for all of their needs, open-source is gaining muscle.

It’s not about to pin Oracle to the ground any time soon, but it no longer is a harmless waif, a weak, sideshow phenomenon of no concern to software giants.

The competitive edge today is not building your own software, but turning software into a useful service. This is an industry-wide trend — everything is a service. Uber is valued at more than $60 billion not because it has the world’s best software, but because the company used open-source software assembled by talented software engineers to deliver profound and consequential transformations to a legacy business model and has done so quickly.

Today’s computing ecosystems increasingly rely upon, and demand a stronger, marriage between those two categories of data processing that we highlighted to begin this essay — real-time and batch. The way forward in this world of distributed computing and open source does not involve the development of merely parallel but disconnected advances in real-time and batch. Instead, it revolves around ways to make this sometimes shaky marriage as vital as possible.

Today, the best-performing digital properties use data as a strategic asset — Facebook, Amazon, Tesla, etc. To do this, they need to learn from data (batch/historical analysis) and act on it (real-time). At the same time, Internet of Things (IoT) use cases require real-time action based on data: For example, giving a driver a discount as she passes a gas station, or pushing a personalized notification for a discount at Macy’s to a customer’s mobile as he passes by a brick-and-mortar store, based on behavior analysis at macys.com.

Batch/historical data should always be used to inform real-time processing. But which tool, and how?

One approach, trumpeted by Lambda Architecture adherents, leverages two separate tools, one designed for real-time and the other for batch. The offline batch layer performs heavy computations to deliver computed results to the real-time layer. For example, the batch layer is used to build a recommendation model, and the real-time layer turns to that model to make actual recommendations.

Lambda Architecture sounds great. But as Jay Kreps points out, it translates into a demand for a lot of engineering operations, because two data pipelines now must be maintained and kept in sync. Another approach champions a single, unified system, often hooked up to a distributed message queue like Kafka. This is the inspiration behind Apache Spark and Apache Fink.

Much more work needs to be done in this promising area. For example, last December Yahoo Engineering did thorough benchmarks that showed Spark Streaming still has room for improvement.

With both cases, one thing to avoid is raising the power of one partner in the marriage between real-time and batch at the expense of the other. For example, one downside, so far, with the unified mode of computation is that its fluency with streaming is improving, while on the truly scalable batch side it is falling short. Ideally, both streaming and batch will benefit from the same level of engineering excellence.

Real-time or Batch? We know the answer is both.

We don’t yet fully understand how, or at least how best to help this wild ecosystem flourish with the vigor it demands. Both the Lambda architecture and the single-unifying tool approach have pros and cons, and require engineering time and expertise and the willingness to work with moving targets.

(About the author: Kazuki Ohta is chief technology officer at Treasure Data, Inc.)

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access