What the cloud can do for data that Hadoop couldn't
When I was at Yahoo! almost 10 years ago now, I remember the first time I heard the term “Hadoop.” Like everyone else, I didn’t even know how to pronounce the word. Besides the weird name, though, it was obvious that this was something radically different.
Before Hadoop, we had databases that we needed to load with flat files. These databases ran on top of expensive file systems that had proprietary and exotic connecting ”fabric.” With Hadoop, we got a distributed file system build on commodity, off-the-shelf hardware. It was easy to scale by just adding more nodes to the cluster and its “shared nothing” architecture meant true horizontal scalability for compute, memory and disk.
The resulting cost savings associated with this new architecture meant that we no longer had to pick and choose what data was worth storing - we could store everything. More importantly, we could store data “as is” without needing to transform the data into the rigid rows and columns that standard databases require.
Over time, Hadoop’s open source heritage and flexible architecture drove new innovations beyond just its cheaper data storage beginnings. With the addition of a schema catalog (HCatalog) and a SQL query processor (Hive), we now had the ability to directly query the raw data files and skip the database load step altogether.
This “schema on read” architecture meant that we could choose to hydrate data only when we had questions to ask. In my opinion, this was Hadoop’s true innovation and it made the “data lake” an alternative to the age old “schema on write” architectures made popular with SQL databases.
With Hadoop and its distributed file system, schema catalog and query processing engine, we can now build “just in time” information pipelines that are drastically more flexible and easier to manage.
There’s got to be a catch, right? Sort of.
Unfortunately, the best qualities of Hadoop’s architecture, flexibility and extensibility, have a dark side too. While its open source beginnings translated into rapid innovation, the pace of change and the underlying complexity that accompanied that change have made it a bear to manage.
The resources to stand up and manage a Hadoop cluster are hard to find and require a wide range of skill sets. Coupled with scarce human resources to operate the software, the challenge of racking your own hardware adds another layer of complexity and friction.
Enter the Cloud
Luckily, the cloud has largely solved the problem of hardware provisioning and management. But it gets better than that. In parallel, the cloud based distributed file systems like S3 have matured to the point where they are now an alternative to Hadoop’s distributed file system (HDFS).
Moreover, just like the Hadoop ecosystem introduced a schema catalog and SQL engine to provide direct file access, so goes the cloud. For example, AWS integrates a schema catalog with Glue and has introduced SQL engines to query raw files with Spectrum and Athena.
In fact, the cloud is beginning to look a lot like a Hadoop data lake, just cheaper and without the overhead of managing clusters.
Advantages of a Data Lake in the Cloud
While the cloud is adopting many of the best features of Hadoop, there are key advantages to the cloud based data lake.
Unlike Hadoop’s HDFS where storage and compute are tightly coupled, the cloud file systems like S3 separate storage from compute. This means that users can elastically add compute when needed and turn it off when it’s not needed. Besides minimizing costs, the flexibility this architecture affords means that no job is too large or small for your infrastructure.
Unlike Hadoop where you need to manage physical infrastructure and clusters, the cloud based data lake is “serverless”: there are no hosts or clusters to manage and scale. Land your data in the cloud and only pay for what you use - the cloud vendor will scale the infrastructure behind the scenes.
The cloud vendors offer a variety of tools and processing engines for your data that are compatible with their respective distributed file systems like S3. By landing your data there, you can leverage these tools with no or minimal data movement. If you choose a columnar data format like Parquet, you can achieve respectable raw file query performance and avoid loading data into an analytics data store altogether.
Where We Go from Here
At the time of this writing, Amazon has a clear lead in delivering the cloud data lake vision today. I’m sure the other cloud vendors will catch up and provide schema catalogs and file based SQL query engines to make their data lakes a reality.