Strong governance programs separate data lakes from swamps

Register now

A recent Forrester report finds between 60 percent – 73 percent of all enterprise data goes unused for analytics. This stat highlights one of the biggest challenges experienced by data scientists and business users hoping to gain insight from their data.

As the volume of data increases, tapping its value and generating accurate reports has become a Herculean effort. Considering the many data initiatives businesses have in place, and the significant investments made, coming up short in data discovery and analytics represents a huge missed opportunity.

Familiar hurdles organizations face when using data for analytics include:

  • Data that can’t be found
  • Data that, once found, makes no sense or isn’t trusted
  • Conflicting definitions of data that make finding the “right” data impossible

For organizations to effectively leverage data to differentiate products and services, improve decision-making and maintain competitive advantage, they need a comprehensive, enterprise-wide data strategy, and one that ensures data becomes a valuable business asset.

Data Lake or Data Dump?

In recent years, data lakes have emerged as a viable solution to store massive amounts of data cost effectively. A data lake is centralized repository that can store an enormous amount of raw data, allowing different users to analyze it and gain actionable insight. However, despite their promise, many lakes are overflowing and organizations are struggling to operationalize this data.

Data lakes have massive scale and tremendous flexibility. They accommodate vast amounts of structured and unstructured data. And, getting data into a lake is simple.

These very attributes, however, contribute to making it easy to lose track of what’s in the lake. In our rush to aggregate data somewhere, our lakes often serve as data junk drawers: a place where we dump our data for the moment with the best intention to put it in its proper context later.

This isn’t surprising. In 2014, Gartner warned that data lakes without the right level of governance would be nothing more than disconnected data pools. A data lake requires a set of processes and policies around how data is collected, defined and secured. Without this kind of framework, it’s impossible to know what data is in the lake, where it came from, who owns it, and its overall value to business users.

Bringing Order to the Chaos

Governance creates transparency across the organization, answering critical questions regarding the data lake, such as:

  • What’s in your data lake – and what should be in your data lake
  • Where your data comes from and where it’s been
  • Who has access to your data
  • Who’s using your data and how

A good data governance framework combined with a data catalog can keep a data lake pristine by cleaning up the disorderly swamp of data. A data catalog offers a single source of intelligence for data experts and other data users who need quick access to their data. Users can tag, document, and annotate data sets in the data catalog, continuously enriching the data and increasing the value of existing data assets while also eliminating data silos.

A data catalog enables users to collaborate to understand the data’s meaning and use, to determine which data is fit for what purpose, and which is unusable, incomplete, or irrelevant. It provides a way for every user to find data, understand what it means, and trust that it’s correct.

Businesses today are either building a brand new lake, or cleaning up an existing data lake. Whether you’ve inherited a swamp, or are just starting out and want to keep your data lake pristine, establishing a set of policy-driven processes can help you avoid these four common data lake problems:

Data without context - A data catalog helps users understand the data they find by providing information about that data, including its origin, format and use as well as its relationship to other data.

Data that can’t be found - A data catalog organizes and structures data to help people find the information they need to solve business problems.

Data that can’t be trusted - A data catalog can help data users find the best data for their purposes, understand the quality of that data, and know whether it’s appropriate to join data from disparate sources.

Data that can’t be shared - A data catalog makes it easier for people to work collaboratively with transparency and trust, enriching data sets and driving value across – and beyond – the enterprise.

Without question, big data is big business. But it’s not about how much data you have in your lake, but rather how your organization uses that data.

To realize the potential of data lakes, businesses must take appropriate steps to ensure these lakes don’t turn into swamps. This requires a governance framework that allows organizations to establish control over the data dumped into their lakes.

Governance empowers data users, helping them to find, understand and trust their data to improve decision-making and drive innovation.

For reprint and licensing requests for this article, click here.