I was a bit apprehensive with my decision to attend Tuesday’s Strata tutorial “An Introduction to the Berkeley Data Analytics Stack (BDAS) Featuring Spark, Spark Streaming, and Shark.” I’m more a stats guy than computer scientist and generally shy away from conference presentations by “vendors.”
But I’ve had great experience with U.C. Berkeley open source software over the years, working extensively with Berkeley Unix, Ingres and PostgreSQL. And I’m glad I participated. Berkeley professor and AMP (Algorithms, Machines, People) Lab co-director Ion Stoica and grad students Matei Zaharia, Reynold Xin, Shivaram Venkataraman and Tathagata Das presented a lucid introduction to what might well become the foundation for a next-generation big data platform that extends the current Hadoop ecosystem.
I’m the first to admit I find the Hadoop lexicon – MapReduce, HDFS, HBase, Hive, Pig, NoSQL, et. al – confusing. The BDAS presentation clarified a lot for me. Starting from a big data platform built on infrastructure, storage, data processing and applications, the goals of BDAS are:
- To combine the now-disparate handling of batch, interactive and streaming data into a single execution engine
- To readily accommodate sophisticated machine learning algorithms, and
- To be compatible with the existing Hadoop ecosystem.
And if all of this can be delivered with a half an order or more performance boost, so much the better.
The BDAS design approach includes the aggressive use of memory, enhanced parallelism and the increasingly popular trade-off of a bit of accuracy for a lot of performance. The team cited research that even at big factories like Microsoft and Facebook, 90% of current jobs process 1 TB or less of data, with an average of 15 GB.
The workhorse components of BDAS include cluster computing runtime Spark, cluster resource manager Mesos, and SQL API Shark. There’s also a variant of Spark for the unique demands of streaming data.
What separates Spark immediately from MapReduce is its integration and high-level accessibility. “Spark provides primitives for in-memory cluster computing: your job can load data into memory and query it repeatedly much quicker than with disk-based systems like Hadoop MapReduce. To make programming faster, Spark provides clean, concise APIs in both Scala and Java. You can also use Spark interactively from the Scala console to rapidly query big datasets.”
Though Spark is a new engine, it’s fully compatible with data sources supported by Hadoop, so can run against existing data. Spark’s optimized for the special needs of machine learning and data mining, and “is also the engine behind Shark, a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive.” Finally, Programs written in Spark with functional language Scala (or Python) are much easier to read and maintain than their java MapReduce primogenitors. Sounds too good to be true!
Spark is open source under a BSD license. You can download it and get started locally, or deploy it on a private cluster or Amazon EMR. Training videos and materials from an AMP Camp held at Berkeley last summer are available. There are also Spark/Shark meetup groups.
I liken the difference of what I see between Hadoop of 2013 and BDAS to my education in programming many years ago. My first college exposure was assembler, which I found pretty inaccessible. Once introduced to the high-level languages Fortran and Pl/1, though, I was in business.