To capitalize on the increasing flows and volumes of data, data-driven organizations have embraced Apache Hadoop as their core Big Data technology. This data represents a tremendous business opportunity, and early adopters of Hadoop are seeing real business value.

However, there are still challenges remaining to achieve value. Hadoop is an ecosystem of rapidly evolving open-source software projects and the growing demand for Hadoop has made it challenging to find and keep the technical and operational experts who know how to implement and run it well.

To avoid getting stuck in a recurring bad dream of clusters, nodes, failed jobs, vanishing Hadoop admins, and more, here are some pieces of advice from grizzled veterans:

Enterprise-Ready Hadoop

Today, organizations manage multi-layered data environments involving a wide array of data sources, data management processes, and data marts and warehouses. It is a challenge for enterprise IT teams to keep up with existing systems, let alone introducing new ones. Introducing Hadoop into the enterprise data architecture without interrupting the flow of established business processes, data access, and user data flows requires some careful thought once Hadoop gets beyond the experimental stage.

Ensure that your platform provider has an architecture that allows for Big Data to be incorporated easily into the current ecosystem, so that companies maintain existing investments in environment, processes and people, and also have a method by which Big Data insights can be spread broadly throughout the organization. 

For many enterprises, a cloud implementation of Hadoop might be the fastest and easiest way to get to enterprise-ready Hadoop. For others, who prefer the direct control of on-premises, have adequate manpower to run it effectively in-house, and have the time for a full onsite deployment schedule, an on-premises provider with a strong professional services arm could be best.

The Right Set of Tools for Your Jobs

Hadoop is a full ecosystem of solutions, not a single thing. Many processing engines run on top of Hadoop to fulfill a variety of different analytical requirements. Iterative analytics for machine learning, for example, are best achieved with fast-turn engines like Spark, while MapReduce, although older, is still the leader for massive batch processing jobs. Hive is a good choice for the middle ground.

To ensure that you have the right array of tools for the job, ensure that your platform provider offers a full complement of solutions and is also adequately battle-testing the offerings, so that they are truly enterprise-ready.

Make Peace with Dirty Data

No one wants to work with dirty data, but do you really need all of your data to be pristine?

As it happens, business intelligence and big data repositories don’t necessarily have the same level of data quality as your operational systems. And that’s ok. If you’re running analytics to find out social media sentiment analysis of your products by geography, it’s fine if some of the data is wrong – in aggregate, the analysis will provide the direction that you need to make marketing or product decisions.

For example, your new barbecue-flavored potato chips are getting raves on the West Coast, but the East Coast thinks they’re far too sweet. It’s time to change your production and distribution plans to capitalize on West Coast tastes and swap out a preferred product to your East Coast distributors before revenues tank.

Analyses run the gamut in terms of data purity and accuracy requirements, and it might take too long and too much money to try to ensure that all data is “clean.” Improving the quality of one more datum has a marginal cost. Because there is no single marginal value to a datum or set of data, the value of data is a function of its use. As such, rather than have data stewards make a decision about data quality when it is loaded into your big data repository, have data consumers make decisions about data quality when they use it.

To tap the full potential of Big Data, enterprises need to understand how to run the Hadoop ecosystem to their advantage. By choosing the enterprise-ready Hadoop, outsourcing operations, getting the right set of tools, and making peace with dirty data, companies can begin to extract business value from their Hadoop experience.

(About the author: Mike Maciag is chief operating officer at Altiscale)

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access