How to Overcome Hadoop Growing Pains
The mountain of data pouring in from websites, smart devices, social networks and other sources keeps growing, but for most companies, it’s still very challenging to store, process and glean insights from that data. Even Hadoop, the open source data platform that has given rise to data lakes’ due to its affordability of storing and processing genuinely big data sets, has its problems.
As Hadoop continues its advance into the enterprise and powers an increasing amount of the storage and processing capabilities of many large businesses, companies must take an even closer look at the strategies that can deliver on the Hadoop promise. While Forrester predicts enterprise adoption of Hadoop is now mandatory, meaningful success with Hadoop is less about which distribution you are using and more about what’s on top of it. High-value, high-performance Hadoop is contingent on the tools that sit on top of the new data stack, and make Hadoop more enterprise-grade.
It’s important to take a holistic look at the Hadoop data stack to show how each level builds on the next in providing the modern data architecture that will increasingly power business applications of the future. A new generation of Hadoop-specific tools are emerging to overcome many of the early obstacles companies have faced in large-scale Hadoop deployments. The barriers to entry will continue to dissolve as pre-packaged, solution sets come to market over the next few years. By understanding and optimizing for each layer, businesses have the opportunity to realize the true power of Hadoop.
Commodity Hardware Is the Foundation of the Hadoop Data Lake
At the base of the Hadoop data stack is the availability and reliability of commodity hardware; modern, scale-up and scale-out hardware that is affordable and effective. A majority of the software industry has created solutions that are built to take advantage of these hardware capabilities, however some vendor options still require expensive hardware specifically customized for a specific use case. Hadoop allows organizations to fully optimize their commodity hardware, and makes possible the agile storing and processing of massive troves of data, enabling enterprises to reduce IT costs while also benefiting from the increased processing power of applications leveraging Hadoop-based data.
HDFS & YARN Power the Enterprise-grade Hadoop Core
On commodity hardware runs the Hadoop Distributed File System (HDFS), a Java-based file system designed to provide highly scalable, reliable data storage across large Hadoop clusters. HDFS is highly fault-tolerant and provides high throughput access to application data, making it ideal for applications that have large data sets. YARN (Yet Another Resource Negotiator) is the operating system of Hadoop 2.0 that allows multiple data processing engines such as interactive SQL, real-time streaming and batch processing to run on data stored in a single platform. YARN also provides a framework that enables applications to run in Hadoop.
Analytic Tooling Makes High-Value, High-Performance Hadoop a Reality
When it comes to industrial-grade, high-performance Hadoop, the analytic tooling layer is vital to the new data stack. This is where Hadoop gets interesting. Many customers make the mistake of sticking within the standard confines and analytic capabilities of their Hadoop distributions. But when you are ready to bring your Hadoop project into production mode, you need more sophisticated and powerful capabilities than the default analytic tools that come with the various Hadoop distributions. Organizations need full-featured analytics capabilities that cover all the basics of any enterprise application:
- Full ANSI SQL support
- ACID compliance
- Hadoop distribution agnostic
- Update capability
- Highest concurrency
- High-performance queries (at least 16x faster)
- Native DBMS security
- Mature and fast optimizer
- Native in-Hadoop, via YARN
- Collaborative architecture
- Open APIs
There are a variety of analytic tools available today that developers can leverage to become true big data rock stars, capable of building applications that glean fresh business insights and power new business models. Best of all, these solutions can integrate with existing hardware investments and leverage familiar tools such as SQL already in widespread use across the enterprise. By using analytic tooling to optimize the Hadoop data stack, developers can supercharge their deployment to provide infinitely scalable, high-performance, enterprise-grade analytic applications. It’s the best way to leverage your big data investment and make high-value, high-performance Hadoop a reality.
Solutions Templates That Offer an Optimized Experience Are Coming to Market
Additionally, customers will soon have a variety of solution templates powered by high productivity development tools like SQL and visual dataflows that provide a clear path for their Hadoop implementation. I’ve been working with enterprise customers across retail, healthcare, financial services and a wide range of other sectors to develop big data analytics blueprints based on real-world use cases. The blueprints connect the dots between specific business challenges and the solution sets that provide optimal performance and scalability in multi-threaded environments. They do so by laying out the analytics path for organizations, including validated approaches to foundational issues, stepwise methodology across analytic workflows, advanced analytics across digital and traditional data, and starting points for deeper analysis and better accuracy.
As the improved solutions come to market, organizations will look to purchase templates and workflows that can be customized for their unique data architecture and business challenges. This will offer an optimized experience that finally allows for the best utilization of Hadoop.
The Hadoop End Game: Analytic Applications
The steady progression of Hadoop and the flood of new Hadoop-specific tools is a clear signal that the tolerance for mediocre performance will soon end. Organizations will seek solutions that do most of the work for them, at the speed of business, so they are not required to build an application or write code themselves. It’s expensive and exhausting to hire developers and data scientists exclusively to reconcile Hadoop with other long-standing enterprise architectures only to produce less than prime results. My hope is that within the next several years, organizations will expect to be able to sign on to an app, point to a data source and get an answer. But to have this kind of capability, organizations must have a robust big data infrastructure that lends them the ability to easily call on the data that will be required to make the apps sing. Fortunately, there’s a race among open source projects and proprietary software to solve this problem. It will be interesting to see who takes the crown.