A couple of months ago, one of my partners sent me a link to an interesting blog post by Cloudera Board Chairman and Chief Strategy Officer, Mike Olson. Olson's article had to do with the emergence of Apache Spark as the preferred analytics development platform for Hadoop.

Olson opines, “What the Hadoop ecosystem needs is a successor system that is more powerful, more flexible and more real-time than MapReduce. While not every current (or maybe even future) application will abandon the MapReduce framework of today, new applications could use such a general-purpose engine to be faster and to do more than is possible with MapReduce. Enter Spark.”

I couldn't agree more. I had discovered Spark at Strata eighteen months ago and was quite impressed. That Spark had a University of California pedigree as part of AMPlab and BDAS, the Berkeley Data Analytics Stack was a good start.

While I understand basic MapReduce concepts, the prospect of combining dreaded Java with extended sequences of map, shuffle and reduce steps to build analytics streams appealed to me about as much as a trip to the dentist. In fact, it kind of reminded me of repressed college exposure to assembler programming back in the day: Nice, now on to PL-1, Fortran or something else reasonable – quickly.

Spark/BDAS in contrast makes top-down sense to me. “Starting from a big data platform built on infrastructure, storage, data processing and applications, the goals of BDAS are:

  1. 1.To combine the now-disparate handling of batch, interactive and streaming data into a single execution engine
  2. 2.To readily accommodate sophisticated machine learning algorithms, and
  3. 3.To be compatible with the existing Hadoop ecosystem.”

Olson adds:

“Spark uses memory wisely so they say it’ll run very quickly, but it also doesn’t force you to put that synchronization step in between every useful operation, so it runs much better. What’s another good thing is it began at Berkeley but now it’s got a huge ecosystem of contributors and developers, like 500 people are writing and committing code into the project. It’s very widely embraced by much of the industry.

“I believe, we believe at Cloudera, that it’s the likeliest successor to MapReduce. That doesn’t say that MapReduce, the original Google concept, ever leaves the platform. That engine will be part of CDH forever, our distro. But we see more new workloads launching on Spark than on MapReduce these days. Easier to program. Addresses a bunch of those latency and other issues.”

If OpenBI's experiences are a guide, the big data market's listening to Olson. Spark/BDAS was again front and center at Strata 2014. The most recent San Francisco Spark Summits in December 2013 and July 2014 attracted 450 and 1,000 developers respectively. Spark is now the single biggest Hadoop project. And there are 2,500 members in the Bay Area Spark Meetup Group. The trajectory of these figures would be the envy of any new software company or open source initiative.

Perhaps the Spark experience of OpenBI data scientist Bryce Lobdell can provide further illumination. Bryce develops big data/analytic apps in all flavors of Hadoop and has this to say about the evolution from MapReduce to Spark …

“Spark is interesting because it creates new possibilities for systems which never existed before. OpenBI's performed some experiments to explore those possibilities, and understand the issues and maturity of the platform. Below are some observations.

Spark implements a generalization of Hadoop’s MapReduce structure for computation, which allows the execution of directed acyclical workflows. Any algorithm designed as successive phases of MapReduce can be trivially redesigned in Spark by using its map() and reduce() methods. map() and reduce() in Spark, though, can be arranged in a variety of ways both more convenient and efficient than MapReduce, and offer further possibilities for optimization.

As an additional improvement over MapReduce, Spark has a data storage abstraction which allows intermediate results to be stored in-memory, spilling to disk as necessary. This is important, because it eliminates the Hadoop need to pick up and put down the data set to (slow) disk for each round of MapReduce. Some algorithms require many phases of MapReduce (i.e. map-reduce-map-reduce-map-reduce…) and quickly become intractable for this reason. Spark however has a short-enough cycle time, thanks to the storage abstraction, to eliminate this issue. Lacking sufficient RAM, it too must write intermediate data sets to disk, and read them again in the next phase. But Spark can benefit from even small increases in available RAM: Expanding available memory will enhance execution speed of a process up to the point where all intermediate data is stored in RAM.

Running a data loader, we scaled a Spark cluster operating on a large data set by a factor of four (i.e., 2 cores, 4 cores, and 8 cores). The compute time fell nearly linearly, a desirable behavior.

Spark has faster task startup and shutdown than Hadoop, which is salutary for online, interactive analysis or real-time queries supporting a web application. Our understanding is that this is accomplished by pre-starting the executors (ie., those revealed by “jps | grep Worker”). The downside is that resource sharing is less elegant than with Hadoop. A running Spark job or application holds its pool of resources for the duration of its existence, and cannot shrink or grow. This limitation may be obviated when Spark is used with the Mesos or YARN cluster managers.

Spark is integrated with two interactive languages: Scala and Python, which are appealing for interactive data exploration and analysis. A common approach in Hadoop is to write a Hive (SQL on Hadoop) query which condenses a big data set to one which can be loaded into an interactive language on a single node. This has the significant disadvantage that the code running in-cluster is limited to Hive/SQL. Spark allows arbitrary Python and Scala code to be written on-the-fly and sent down to the cluster – including disparate Python libraries (though such libraries must be installed on all nodes). This is a big deal for analytics app/dev.

The Python API is especially appealing, because Python is powerful, widely known, and has scores of wonderful libraries written by its open source ecosystem. The Python API, however, is somewhat slower than the Scala analog, supports fewer methods of the Spark API, and cannot be resource-boxed in the manner of Scala.

Spark is not as mature as Hadoop, though we've encountered no problems which could not immediately be redressed. For the analytics world, the most important issues relate to performing joins, sorts, distincts, or other operations which can end up pushing large amounts of data into a single executor. This is analogous to a problem experienced with Hadoop, when too much data is pushed into a single reducer process. The solution is also similar: partition data differently, increase the number of partitions, and tweak a few parameters. The problem is manageable, but requires attention to detail and understanding of the inner workings of Spark.

The learning curve for Spark is is not overly steep, save for the issue of patterns. For many operations, the use of the API is straightforward but the arrangement of data structures passed to Spark methods is not. The selection of appropriate data structures can effect the execution speed of a Spark process by a factor of four or more. Once again, considerable attention-to-detail is needed to reap performance benefits from Spark.

In sum, even though Spark's just progressing to adolescence, we see it as a platform of choice for big data and analytics. And we're excited to help introduce its considerable capabilities to our analytics customers.”

I'd like to thank colleague Bryce Lobdell for his articulate contribution to this article.