Data Lakes v. Data Warehousing In the Digital Transformation Boom

We are living in a digital era, in which the numbers of channels we use to interact and engage are always increasing.

Businesses must transform to keep pace with, when, how, and where the digital customer wants to engage. They need quick insight from data to make rapid strategic and proactive decisions. But where is all of that data being stored?

While in the past companies have traditionally relied on data warehouses for their data repository, data lakes have recently been utilized as a new approach to data storage, with tech-savvy tech companies being early adopters.

In addition, as organizations undergo digital transformation, it’s crucial to understand how and where to aggregate the influx of digital data. Data lakes may well be better suited for this digital explosion, but before making the shift, organizations need to evaluate their options.

So what do a data lake and data warehousing entail, what are the advantages and disadvantages of each, and what drives the data lake approach?

The Ins and Outs of Data Lakes

For many organizations, the question arises around what data should be kept, what discarded and what should be done with it all?

Erring on the side of caution, many businesses would rather hold onto all data until it’s determined how that data can play a role in future business strategy. This is where data lakes come into the picture. To combat data storage issues, data lakes provide a place for data to be stored in its original source, until a purpose is found for it.

The advantages of data lakes are many, which is why they have been on the uptick of adoption and have been a source of conversation in the data community. While data lakes serve as a staging ground for data warehouses, they themselves are massively scalable and provide low-cost storage of data files in any format. This is of particular relevance for the myriad of new digital businesses, which generate new shapes and sizes of data, often hosted in the cloud.

Data scientists have begun using data lakes for discovery and ideation, and some key trends to keep in mind around this storage element include its ability to handle this explosion in new forms of data, the drop in hardware cost as a result, an increase in digital information and a push to analyze more data.

With a data lake, streams of information are flowing in from different repositories but are compiled in one location. With the ability to accept input from various sources, data lakes preserve both the original data fidelity and the lineage of data transformations. As such, data models emerge with usage over time rather than being imposed up front.

However, like with any new trend or data storage solution, there are some noted disadvantages. In data lakes, due to the unstructured environment, companies are struggling with the hardware and software needed to process the data that allow it to work across systems, apps and infrastructures.

Another resource many organizations are short of is the talent to actually mine the data and analyze it to generate valuable business insight. Because of this, many have come to conclude that data lakes are still years away from being a true reality.

But the primary challenge has been identified as basic data integration capabilities between multiple disparate stores of data. While it is certainly worth testing in smaller environments, it may not be possible to get a full 360 view of the real advantages and applications of a data lake with the technologies organizations currently have in place today.

In these smaller environments and with the appropriate access management, governance and data consumer skills, data lakes may provide access to unique data elements. As such, for those thinking of throwing in the towel, it may not be time yet, but rather, a chance to evaluate further and place greater emphasis on research and discovery.

The Ins and Outs of Data Warehouses

As organizations evaluate data storage options, it’s important to also note the data warehouse, which is the current established method of aggregating data from multiple sources for business intelligence and analytics. Data warehouses are used for storing clean data that organizations don’t want to mix with anything else. This is one of the big data processes companies are pursuing in order to pull and work with the most valuable data for business.

Data warehouses provide a means for reporting, duplication/archival, analytics and greater operational responsiveness. Traditionally there has been no standards-based way for moving large amounts of data between systems, resulting in poorly architected solutions that suffer when additional systems come online or additional databases are thrown into the mix. In addition, compared to data lakes, data warehouses are often used for storing more aggregated versions of the same data in the form of structured reports.

With the volume of data today, the concern is that data warehouses are not scalable enough or agile enough to keep up. As enterprises embrace digital at a more rapid pace, data warehouses may not have what it takes to keep up with a company’s growing storage needs and ability to quickly access and analyze the collected data.

Ultimately, where data storage is concerned, whether an organization chooses data lakes or data warehousing is completely dependent on their use of the data, accessibility and organizational structure. Although both options have their place within organizations, the shifting digital landscape is starting to reveal that data lakes are better suited for organizations taking the leap into digital transformation.

(About the author: Sumit Sarkar is chief data evangelist at Progress)

For reprint and licensing requests for this article, click here.