Enterprises are increasingly turning to Hadoop to get business value from their data. Over the years, we’ve seen the use of Hadoop evolve from a simple data repository and into an engine that fuels smart, data-driven decision-making.
While more and more organizations are becoming increasingly sophisticated as their use of Hadoop matures, it’s hard to ignore that implementing and scaling Hadoop is enormously complicated. Unforeseen challenges can quickly arise as Hadoop projects grow. Solving these challenges can help organizations derive even greater value from their data, possibly leading to new markets and customers.
As Hadoop projects scale, here are five signs that you’re going through growing pains and how to take the sting out of them.
You Never Get to Production
Moving from proof of concept (POC) to production is a significant step for big data workloads. What works well for your POC doesn’t always work when you get to production. Scaling Hadoop jobs is fraught with challenges. Sometimes large MapReduce jobs just won’t finish. A job that ran in testing might not run at production scale. The job hits a certain weak point, a funny bone in Hadoop, that wasn’t visible during the POC.
Before you go into production, perform realistic scale and stress testing. Through that testing, you’ll better exercise the scalability and fault-tolerance of your applications, which is critical in the context of big data. More rigorous testing will also help you develop a model for capacity planning. You can apply that model against your projected growth plans to ensure that you’ll stay ahead of the curve.
You Start Missing Deadlines
As the use of the Hadoop cluster grows, the time it takes jobs to run becomes unpredictable, and you start to miss deadlines. This problem develops slowly and is often ignored until it’s too late. After a successful launch, applications typically finish comfortably within their deadlines. Over time, as data volumes grow, and as other applications are deployed, the comfortable margin starts to shrink. At first deadlines are missed sporadically, and then chronically.
Don’t wait for a crisis in order to take action. As comfortable margins start to erode, add capacity or optimize your applications to keep pace. Adjust your capacity-planning model, with particular attention on worst-case performance, so that it matches what you’re seeing with your implementation.
You Start Telling People They Can’t Keep All That Data
Another growing pain is shrinking data retention windows. Initially, you hoped to keep 13 months of data to do year-over-year analysis. However, space constraints force you to cut that number. Shrinking retention windows are the storage equivalent of missed deadlines. The dynamic is also the same: a margin that initially seemed comfortable becomes “just enough” and then “not enough,” forcing you to cut back.
Act early. As margins start to erode, revisit your capacity planning models to see why your predictions didn’t hold and adjust the models — and capacity — to better track what’s happening.
Your Data Scientists Are Starved
An over-utilized Hadoop cluster stifles innovation. Perhaps there’s not enough compute capacity for data scientists to launch large jobs, or maybe your data scientists need to generate intermediate results that are too large to store. Capacity planning routinely omits or underestimates the needs of data scientists. This omission, compounded with inadequate planning for either compute (missed deadlines) or storage (shrinking retention windows), means the needs of data scientists will become marginalized.
Be sure your capacity planning models include the requirements of data scientists, and act early when you see signs that you’re heading for a capacity growing pain.
You Get Sticker Shock
The number one “success disaster” with Infrastructure-as-a-Service (IaaS)-based deployments of Hadoop is out-of-control spending. You suddenly find yourself slapped with a bill that is three times last month’s cost.
Capacity planning is as important for IaaS-based Hadoop implementations as it is for on-premises ones — not for managing capacity, but for managing costs. But good capacity planning is just the start. If you plan on growing an IaaS-based Hadoop implementation to even modest levels of scale, expect to invest heavily in systems to help you track and optimize costs, as Netflix has done.
Hadoop implementation plans typically underestimate the effort required to keep a Hadoop cluster running smoothly and to keep users productive. It’s an understandable miscalculation. If you feel you’re starting to experience these tough growing pains, then take a step back and see how you can ease the transitions.
About the author: Mike Maciag is chief operations officer at Altiscale.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access