Slideshow Data Lakes Guide: 10 Requirements for Success

Published
  • September 24 2015, 6:30am EDT
15 Images Total

Data Lakes Guide: 10 Requirements for Success

Data Lakes are gaining popularity as a way to store massive amounts of information for big data and analytics applications. But how are data lakes built? Here are 10 requirements for success.Image: iStock

What Is A Data Lake?

Before we explore how data lakes are built, it’s important to understand the purpose of a data lake. Pentaho CTO James Dixon is widely credited with coining the term data lake and describing it in 2010, stating: "If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples."Image: iStock

Content Continues Below


Existing Data Management Challenges

In a white paper, Knowledgent describes four challenges with current enterprise data warehouses. They include:1. Timeliness: Adding new content to a data warehouse can be time-consuming and cumbersome. 2. Flexibility: Users often lack on-demand access to data. Plus, they often can’t use the tools of their choice to analyze the data.3. Quality: If it’s unclear where the data originated and how it has been acted on, users may not trust the data. 4. Findability: In many data warehouses, it can be difficult to find the data you need when you need it.Image: iStock

Solution: Data Lakes

In response to existing data warehouse challenges, data lakes must be “designed to support multiple reporting tools in a self-serve capacity, to allow rapid ingestion of new datasets without extensive modeling,” Knowledgent asserts. Moreover, the company says, data lakes should “support advanced analytics, like machine learning and text analytics, and allow users to cleanse and process the data iteratively and to track lineage of data for compliance. Users should be able to easily search and explore structured, unstructured, internal, and external data from multiple sources in one secure place.”Image: iStock

Requirement 1: Multi-tool Support

Make sure your tool sets support multiple technology stakes that natively support structured, semi-structured and unstructured data types, Knowledgent recommends.Image: iStock

Content Continues Below


Requirement 2: Domain Specification

Data lakes must be designed for the vertical markets they serve. For instance, a data lake customized for biomedical research would be significantly different from one tailored to financial services, Knowledgent notes. Be sure the self-service search capabilities include keyword, faceted and graphical search, the company adds.Image: iStock

Requirement 3: Automated Meta Data Management

“Attributes like data lineage, data quality, and usage history are vital to usability,” Knowledgent states. “Maintaining this meta data requires a highly-automated meta data extraction, capture, and tracking facility. Without a high-degree of automated and mandatory meta data management, a data lake will rapidly become a data swamp.”Image: iStock

Requirement 4: Configurable Ingestion Workflows

Make sure external information can be rapidly added to the data lake. Known as an ingestion workflow mechanism, the data addition process should be easy, secure and trackable, Knowledgent recommends.Image: iStock

Content Continues Below


Requirement 5: Integrate With Existing Environments

Instead of ripping and replacing your enterprise environment, a new data lake needs to work with existing enterprise data management paradigms, tools and methods, Knowledgent recommends.Image: iStock

Requirement 6: Define Your Service

Work with experts to define the catalog of services your data lake will offer. Be sure to consider data onboarding, data cleansing, data transformation, analytic tool libraries, and other requirements for the system, Knowledgent suggests.Image: iStock

Requirement 7: Figure Out Your Architecture

Work with experts to “architect the environment, select components, define engineering processes and design user interfaces,” Knowledgent says.Image: iStock

Content Continues Below


Requirement 8: Develop Your Proof of Concept

At this point, you’re ready to work with experts to launch a proof of concept to demonstrate the data lake’s intended capabilities.Image: iStock

Requirement 9: Design and Roll Out An Operating Model

If working with an expert partner, make sure the operating model meets your company’s processes, organizational structure, rules and governance – including such capabilities as chargeback models, consumption tracking and reporting mechanisms, Knowledgent recommends.Image: iStock

Requirement 10: Build Out the Platform

The stages can include such areas as design, development and integration, testing, data loading, meta data and catalog population, and rollout.Image: iStock

Content Continues Below


Thanks and More

Special thanks to Knowledgent for the data lake overview, from which Information Management built the requirements list in this slide show. For additional Information Management slideshows please visit here.Image: iStock