Slideshow 7 mistakes that can doom any Hadoop project

Published
  • March 01 2017, 6:05am EST
15 Images Total

Avoiding the most common mistakes with Hadoop

It’s no secret that Hadoop comes with inherent challenges. Business needs, specialized skills, data integration, and budget are just a few things that factor into planning and implementation. With the goal of helping organizations achieve business value from Hadoop, Pentaho, has identified the most common mistakes made by executives and IT teams as they go through the planning and implementation process, and most importantly, how to avoid them.

Mistake #1: Migrate everything before devising a plan…

Let’s say that you’ve determined that your current architecture is not equipped to process big data effectively, management is open to adopting Hadoop, and you’re excited to get started. Don’t just dive in without a plan. Migrating everything without a clear strategy will only create long-term issues resulting in expensive ongoing maintenance.

c

Content Continues Below


…Know the business reason and potential value of a project

With first-time Hadoop implementations, you can expect a lot of error messages and a steep learning curve. Dysfunction, unfortunately, is a natural byproduct of the Hadoop ecosystem...unless you have expert guidance. Successful implementation starts by identifying a business use case, consider every phase of the process, and clearly determining how Hadoop and big data will create value for your business. Taking an end-to-end, holistic view of the data pipeline, prior to implementation, will help promote project success and enhanced IT collaboration with the business.

Mistake #2: Treating a data lake on Hadoop like a regular database…

A major misconception is that you can treat a data lake on Hadoop just like a regular database. While Hadoop is powerful, it’s not structured the same way as an Oracle, HP Vertica, or a Teradata database, for example. Similarly, it was not designed to store anything you’d normally put on Dropbox or Google Drive. A good rule of thumb for this scenario is: if it can fit on your desktop or laptop, it probably doesn’t belong on Hadoop.

…Ensure that you don’t end up with a data swamp

As your organization scales up data onboarding from just a few sources going into Hadoop to hundreds or more, IT time and resources can be monopolized, creating hundreds of hard-coded data movement procedures – and the process is often highly manual and error-prone. Take the proper steps up front, in order to understand how best to use the Hadoop ecosystem to derive business value. Otherwise, you’ll end up with a data lake that’s more of a data swamp. Everything will be there, but you won’t be able to derive any value from it.

Content Continues Below


Mistake #3: Assume the same skillsets for managing a traditional relational database are transferable to Hadoop…

Believing you can do everything with Hadoop the way you do things with relational databases is a common mistake made by business people who are implementing Hadoop for the first time. Like taking the “red pill” in the movie The Matrix, once you enter the new world, you can’t do things the same way.

…New skills and maybe new developers will be needed

Because Hadoop doesn’t function in the same way as a relational database, you cannot expect to simply migrate all your data and manage it in the same way, nor can you expect skillsets to be transferable between the two. Ensure a smooth transition to Hadoop by taking the time to learn how it will best serve your business, and how it may impact your organization. At a minimum, you will likely have to acquire new technology skills or developers and figure out how to effectively integrate Hadoop with existing operational systems and data warehouses.

Mistake #4: ‘I can figure out security later’…

For most enterprises, protecting sensitive data is top-of-mind, especially after recent headlines about high profile data breaches. And if you’re considering using any sort of big data solution in your enterprise, keep in mind that you’ll be processing data that’s sensitive to your business, your customers and your partners. You know security is important in the long run, but is it important to consider it before you deploy? Absolutely!

Content Continues Below


…Do these steps before deploying any project

Address each of the following security solutions before you deploy a big data project:

•Authentication: Control who can access clusters and what they can do with the data

•Authorization: Control what actions users can take once they’re in a cluster

•Audit and tracking: Track and log all actions by each user as a matter of record

•Compliant data protection: Utilize industry standard data encryption methods in compliance with applicable regulations

•Automation: Prepare, blend, report and send alerts based on a variety of data in Hadoop

•Predictive analytics: Integrate predictive analytics for near real-time behavioral analytics

•Best practices: blending data from applications, networks and servers as well as mobile, cloud, and IoT data

Mistake #5. The HiPPO knows best. No strategic inquiry necessary…

HiPPO is an acronym for the "highest paid person's opinion" or the "highest paid person in the office." The idea is that HiPPOs are so self-assured that they tend to dismiss any data or the input of lower-paid employees that disagree with the correctness of their intuitions. Trusting one’s gut rather than data may work occasionally, but Hadoop is complex and requires strategic inquiry to fully understand the nuances of when, where, and why to use it.

…Listen to the ideas of other people involved

The true business value of Hadoop is determined by the nature of your data problem. Once a data problem has been established, the next step is to determine whether or not your current architecture will help you achieve your big data goals. You hired talented people for a reason; listen to them. Once a business need for big data has been established, determine who will benefit from the investment, how it will impact your infrastructure, and how spending will be justified. Also, try to avoid science projects; they tend to become technical exercises with limited business value.

Content Continues Below


Mistake #6: Bridge the skills gap with traditional ETL processes…

The truth is, the skills gap is a major stumbling block for most businesses. New big data technologies are designed to address the skills gap, but they tend to support experienced users rather than elevate the skills of those who need it most. And unfortunately, what works for regular ETL doesn’t translate to a Hadoop ecosystem, and the Hadoop learning curve is very steep. Basically, you have two options: 1) Hire people who’ve had the proper training, or 2) Work with experts to train and guide your staff through implementation.

…Experience and best practices will drive success

Technology only gets you so far. People, experience, and best practices are the most important drivers for project success with Hadoop. Regular ETL processes aren’t generally transferable to the Hadoop ecosystem, so be prepared to either hire talent with the proper training, or work with experts who can train and guide your staff through implementation.

Mistake #7. I can have a small budget and get enterprise-level value…

The low-cost scalability of Hadoop is one reason why organizations decide to use it. But many organizations fail to factor in data replication/compression (storage space), skilled resources, and overall management of big data integration of your existing ecosystem. Remember, Hadoop was built to process a wide variety of enormous data files that continue to grow - quickly. And once data is ingested, it gets replicated! So, it’s absolutely essential to do proper sizing up front. This includes having the skills on hand to leverage SQL and BI against data in Hadoop and to compress data at the most granular levels.

Content Continues Below


…Know how big data will impact your systems

Understand how storage, resources, growth rates, and management of big data will factor in to your existing ecosystem before you implement.

s, an�:�.�