Are You Asking All the Wrong Questions About Apache Spark?
Apache Spark has seen explosive growth in recent months. But like the implementation of any new technology, there are always important questions that the IT leader should ask before starting the implementation.
Information Management spoke with Sean Suchter, CEO and co-founder of Pepperdata, about the questions he feels everyone should be asking about Spark, but aren’t. Or, in some cases, CIOs and CDOs are simply asking the wrong ones. Below, Suchter explains what he means by that.
Information Management: What are the “wrong questions” being asked about Spark and why are they the wrong ones?
Sean Suchter: Many of the wrong questions about Spark focus on whether it is a replacement for other technologies or engines, particularly MapReduce and Hadoop. Asking whether or not Spark will replace MapReduce is a meaningful and valid question, but it’s really not the most important one. On the other hand, questions about whether or not Spark will replace Hadoop are pretty meaningless. It’s like asking, will diesel locomotives replace train tracks? It doesn’t really make any sense.
IM: What questions should IT execs and data managers should be asking about Spark?
SS: IT execs and data managers should be thinking about what they can accomplish more effectively with Spark. They should be asking, what can I do more of? What jobs can I now run more effectively? Thinking about what is now possible with Spark that hasn’t been tried before is much more interesting than wondering what it could replace.
Secondly, they should be asking themselves, how do I get as many people as possible in the organization to take advantage of Spark? How do I make it available to them? At this level, they need to be thinking about ensuring that Spark correctly fits into their multitenant clusters from a performance perspective. Spark has a lot of advantages because it uses hardware in new and unique ways, but that means there are opportunities to get performance right and opportunities to get it really wrong.
IM: How do these questions differ from what you think are good or bad concerns about other big data tools?
SS: Spark, for whatever reason, engenders this conversation about replacement that we don’t really see with other big data tools, like BI tools or workload management tools. The concern with every new big data tool should always be about fitting systems into existing infrastructure–performance has to be a top priority.
IM: What are the key considerations to think about when considering a Spark implementation?
SS: The most important consideration that execs and managers need to think about as they consider a Spark implementation is that it doesn’t have to be a huge project. It’s just one more thing that you put in a cluster, and it should be treated as such. You can successfully fit Spark into current systems and it’s sometimes not obvious because of all the talk about standalone Spark or replacing MapReduce, but it doesn’t have to be hard.
There are tools to help with important components like security models and multi-tenancy performance. Start by figuring out the easiest way to implement Spark into an existing cluster and you’re already off to a good start.
IM: What are the least obvious issues that data managers and IT leaders should be aware of?
SS: Spark, as a relatively new engine, is still adding functionality with every version, and it’s adding that functionality really quickly. Data managers and IT leaders should plan to be aggressive with Spark expansion and updates, at least in the next couple years. New versions of other tools like MapReduce, Hive or YARN don’t require the same kind of attention that should be applied to Spark updates right now.
IM: How should a chief data officer or CIO best measure success with a Spark implementation?
SS: A successful Spark implementation first and foremost means you shouldn’t have to compromise. You don’t have to use features that get in your way, and you probably don’t have to give up on things to reap benefits. Success means faster analytics and a noticeable increase in productivity, and you should be able to meet all your data management and security goals. Most importantly, you should gain all of this without having to take something else away.
IM: What message would you most like IT execs and data execs to receive when it comes to achieving best success with Spark?
SS: Don’t overthink it. You can implement Spark and it doesn’t have to be a major upheaval or huge organizational shift–it’s not like going from a SQL database to NoSQL or moving to VMware. If you think you have to change something, think again, because it might not be the case. Spark doesn’t have to mean starting from scratch.
In most implementations, Spark is really just another tenant in a multitenant system. As always, the real challenge with tenants is getting both the security model right and the performance right. Because Spark uses hardware in new ways, it complicates things.