Apache Spark has quickly risen to prominence in the big data world. With more than 750 contributors from more than 200 organizations, Spark has become the largest open source community in big data.
A number of large enterprises, like Amazon, Groupon, Baidu, Yahoo and others, have deployed Spark, and it’s been embraced by a number of enterprise vendors, like Cloudera and IBM. And while Spark adoption is growing, plenty of business executives are left wondering, “What does Spark mean for my business?”
Spark is a powerful technology, but in order to see whether Spark is a good fit, you must first know what Spark is and whether you should use it at all.
To understand Spark, it helps to have context for how it came to be. The world of big data arrived in the form of Hadoop, which allows tremendous volumes and varying types of data to be stored and processed at low cost. Hadoop’s scalable architecture has powered tremendous change in business. However, a gap existed between the parallel processing of data in Hadoop and the ability to use analytical insights in business processes. Enter Spark.
Spark is an open-source processing engine built to handle the needs of big data applications. It was designed for speed, ease of use, and most importantly—for analytics. Spark abstracts the nature of big data into a construct called Resilient Distributed Datasets (RDDs). Spark loads the data you’ll process in RDDs, enabling you to then write a program on top of them. Spark then adds lots of other good ideas from other data processing systems such as data frames, which come from R.
The beauty of Spark is that it allows you to process data in a programming environment without having to move the data, and data can be processed using SQL, streaming, machine learning libraries, or graph libraries. The core API is available in R, SQL, Python, Scala, and Java.
Ideal Use Cases for Spark
To begin thinking about how Spark may fit into your business, we suggest first considering some of the use cases outlined below.
The first use case is ETL. If you want to perform ETL faster or need the ability to do ETL based on Python, Java or R, then Spark is a good option. With Spark, you can take in data and perform simple transformations through SQL commands, use any of the supported languages mentioned above to process the data, and then send the data wherever you need it. The same is true if you want to process streaming data. You can have streams of data brought into Spark and then write applications that process those streams of data.
The second use case is data science and analytics. Spark has machine learning and graph analytics libraries that allow you to take data you’ve brought in and use one programming environment to apply those libraries against it. Spark is also excellent if you want to write applications that process big data using more than one technique. Either streaming or batch will allow you to absorb data from different sources and bring it into one programming model to write applications against it using machine learning, graph capabilities, or custom algorithms.
Considerations for Using Spark
There are several ways you can use Spark. In addition to using it directly, you can use Spark through other products that use Spark as an infrastructure, like H2O. In addition, most vendors have integrations with Spark, and consultancies like Think Big are experts at integrating Spark into an existing data infrastructure.
Of course, using Spark requires a number of skill sets. You need to know how to write code in the languages you plan on using, how to use Spark itself, how to use Spark for SQL and streaming, and how to put Spark’s machine learning and graph libraries to work. It’s important to understand that getting a firm grasp on Spark’s capabilities and envisioning how they can be used by your organization takes time and effort—it’s not going to happen overnight.
Another important consideration is how you’ll keep up with the tool’s evolution. Much like Hadoop, there have been huge amounts of attention paid to Spark. In addition to Databricks, the company that is commercializing Spark, a variety of other vendors have embraced the Spark ecosystem. These include Cloudera, MapR, and IBM. As a result, Spark is evolving very quickly and organizations should have a plan for dealing with this rapid evolution.
As often happens when competing vendors commercialize a single open source project, it’s likely we’ll have some of the same conflict and fragmentation that we do in the Hadoop world. This usually occurs when any one vendor moves the project in a direction that’s not immediately embraced by another vendor.
If you’re using Spark that’s supported by a vendor that wants to move slowly, then you’ll have to wait until they decide to move, or switch to the vendor that’s moving fast. This dilemma is just a fact of life that must be dealt with when you use open source software.
The bottom line: Spark brings some exciting capabilities to the world of big data, but as with any new tool, there are startup and maintenance costs. Before you begin investing in the skills and infrastructure to use Spark, be sure you understand the type of work you want to do and how you’ll manage this fast moving project.
(About the author: Ron Bodkin is founder and president of Think Big, A Teradata Company)