Tips for creating a successful big data lake
Excerpt from "The Enterprise Big Data Lake," by Alex Gorelik. Published by O'Reilly Media, Inc. Copyright © 2019 Alex Gorelik. All rights reserved. Used with permission.
Creating a successful data lake
So what does it take to have a successful data lake? As with any project, aligning it with the company’s business strategy and having executive sponsorship and broad buy-in are a must. In addition, based on discussions with dozens of companies deploying data lakes with varying levels of success, three key prerequisites can be identified:
• The right platform
• The right data
• The right interfaces
The Right Platform
Big data technologies like Hadoop and cloud solutions like Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform are the most popular platforms for a data lake. These technologies share several important advantages:
These platforms were designed to scale out—in other words, to scale indefinitely without any significant degradation in performance.
We have always had the capacity to store a lot of data on fairly inexpensive storage, like tapes, WORM disks, and hard drives. But not until big data technologies did we have the ability to both store and process huge volumes of data so inexpensively—usually at one-tenth to one-hundredth the cost of a commercial relational database.
These platforms use file systems or object stores that allow them to store all sorts of files: Hadoop HDFS, MapR FS, AWS’s Simple Storage Service (S3), and so on. Unlike a relational database that requires the data structure to be predefined (schema on write), a file system or an object store does not really care what you write.
Of course, to meaningfully process the data you need to know its schema, but that’s only when you use the data. This approach is called schema on read and it’s one of the important advantages of big data platforms, enabling what’s called “frictionless ingestion.” In other words, data can be loaded with absolutely no processing, unlike in a relational database, where data cannot be loaded until it is converted to the schema and format expected by the database.
Because our requirements and the world we live in are in flux, it is critical to make sure that the data we have can be used to help with our future needs. Today, if data is stored in a relational database, it can be accessed only by that relational database.
Hadoop and other big data platforms, on the other hand, are very mod‐ ular. The same file can be used by various processing engines and programs— from Hive queries (Hive provides a SQL interface to Hadoop files) to Pig scripts to Spark and custom MapReduce jobs, all sorts of different tools and systems can access and use the same files. Because big data technology is evolving rapidly, this gives people confidence that any future projects will still be able to access the data in the data lake.
The Right Data
Most data collected by enterprises today is thrown away. Some small percentage is aggregated and kept in a data warehouse for a few years, but most detailed operational data, machine-generated data, and old historical data is either aggregated or thrown away altogether. That makes it difficult to do analytics.
For example, if an analyst recognizes the value of some data that was traditionally thrown away, it may take months or even years to accumulate enough history of that data to do meaningful analytics. The promise of the data lake, therefore, is to be able to store as much data as possible for future use.
So, the data lake is sort of like a piggy bank (Figure 1-4)—you often don’t know what you are saving the data for, but you want it in case you need it one day. Moreover, because you don’t know how you will use the data, it doesn’t make sense to convert or treat it prematurely.
You can think of it like traveling with your piggy bank through different countries, adding money in the currency of the country you happen to be in at the time and keeping the contents in their native currencies until you decide what country you want to spend the money in; you can then convert it all to that currency, instead of needlessly converting your funds (and paying conversion fees) every time you cross a border. To summarize, the goal is to save as much data as possible in its native format.
Figure 1-4. A data lake is like a piggy bank, allowing you to keep the data in its native or raw format
Another challenge with getting the right data is data silos. Different departments might hoard their data, both because it is difficult and expensive to provide and because there is often a political and organizational reluctance to share.
In a typical enterprise, if one group needs data from another group, it has to explain what data it needs and then the group that owns the data has to implement ETL jobs that extract and package the required data. This is expensive, difficult, and time-consuming, so teams may push back on data requests as much as possible and then take as long as they can get away with to provide the data. This extra work is often used as an excuse to not share data.
With a data lake, because the lake consumes raw data through frictionless ingestion (basically, it’s ingested as is without any processing), that challenge (and excuse) goes away. A well-governed data lake is also centralized and offers a transparent process to people throughout the organization about how to obtain data, so ownership becomes much less of a barrier.
The Right Interface
Once we have the right platform and we’ve loaded the data, we get to the more difficult aspects of the data lake, where most companies fail—choosing the right interface. To gain wide adoption and reap the benefits of helping business users make data- driven decisions, the solutions companies provide must be self-service, so their users can find, understand, and use the data without needing help from IT. IT will simply not be able to scale to support such a large user community and such a large variety of data.
There are two aspects to enabling self-service: providing data at the right level of expertise for the users, and ensuring the users are able to find the right data.
Providing data at the right level of expertise
To get broad adoption for the data lake, we want everyone from data scientists to business analysts to use it. However, when considering such divergent audiences with different needs and skill levels, we have to be careful to make the right data available to the right user populations.
For example, analysts often don’t have the skills to use raw data. Raw data usually has too much detail, is too granular, and frequently has too many quality issues to be easily used. For instance, if we collect sales data from different countries that use different applications, that data will come in different formats with different fields (e.g., one country may have sales tax whereas another doesn’t) and different units of measure (e.g., lb versus kg, $ versus €).
In order for the analysts to use this data, it has to be harmonized—put into the same schema with the same field names and units of measure—and frequently also aggregated to daily sales per product or per customer. In other words, analysts want “cooked” prepared meals, not raw data.
Data scientists, on the other hand, are the complete opposite. For them, cooked data often loses the golden nuggets that they are looking for. For example, if they want to see how often two products are bought together, but the only information they can get is daily totals by product, data scientists will be stuck. They are like chefs who need raw ingredients to create their culinary or analytic masterpieces.
We’ll see in this book how to satisfy divergent needs by setting up multiple zones, or areas that contain data that meets particular requirements. For example, the raw or landing zone contains the original data ingested into the lake, whereas the production or gold zone contains high-quality, governed data.
Getting to the data
Most companies that I have spoken with are settling on the “shopping for data” paradigm, where analysts use an Amazon.com-style interface to find, understand, rate, annotate, and consume data. The advantages of this approach are manifold, including:
A familiar interface
Most people are familiar with online shopping and feel comfortable searching with keywords and using facets, ratings, and comments, so they require no or minimal training.
Search engines are optimized for faceted search. Faceted search is very helpful when the number of possible search results is large and the user is trying to zero in on the right result. For example, if you were to search Amazon for toasters (Figure 1-5), facets would list manufacturers, whether the toaster should accept bagels, how many slices it needs to toast, and so forth.
Similarly, when users are searching for the right data sets, facets can help them specify what attributes they would like in the data set, the type and format of the data set, the system that holds it, the size and freshness of the data set, the department that owns it, what entitlements it has, and any number of other useful characteristics.
Ranking and sorting
The ability to present and sort data assets, widely supported by search engines, is important for choosing the right asset based on specific criteria.
As catalogs get smarter, the ability to find data assets using a semantic under‐ standing of what analysts are looking for will become more important. For example, a salesperson looking for customers may really be looking for prospects, while a technical support person looking for customers may really be looking for existing customers.
Figure 1-5. An online shopping interface