5 steps to making your data lake an organizational lifestyle
Over the past decade, the world of big data and Hadoop has grown into a $150B market, and organizations are rapidly adopting related technologies, whether on premises or in the cloud, to augment and modernize their existing data architectures.
While the growth rate and size of this market is astonishing in itself, especially considering Apache Hadoop is open source, there is a cloud hanging over the big data lake movement. At the heart of the issue is whether people actually get business value from these projects. According to Gartner, through 2018, 90 percent of deployed data lakes will be rendered useless as they’re overwhelmed with information assets captured for uncertain use.
Today, the data lake has become the operating term for a new data store built on the big data technologies, typically implemented using Hadoop. In this concept, organizations gather all the data they can collect regardless of the structure or format and place it in the data store so that is can be used for analytical purposes.
While the realities around the data lake might be different than its initial intent of establishing that single data source, it remains a crucial element to modernizing the data architecture and delivering critical insights for data driven initiatives. To ensure success, organizations should plan and design a data lake around five key capabilities:
1. Connect the data lake to all other data sources
As we suggested, there is never just the data lake. More often than not, the first data lake is created to augment and live right next to the enterprise data warehouse (EDW). Usually it will contain archived data for analysis from various times or data that did not fit into the original EDW design (e.g., streams of data from smart meters and devices). Therefore, it is important to design an analytics strategy and technology decisions around the notion that any analysis will include data from both EDW and the data lake.
Many technologies for analytics will connect to the EDW or data lake as siloed data sources, resulting in IT developers being called in to help bring data together via a traditional waterfall analytics approach.
2. Democratize user access and analytics at scale
While on the surface this sounds very easy, the challenge is to achieve this without putting undue stress on already scarce IT resources or creating compliance issues. It is frequently mentioned that 80 percent of analytical efforts is spent preparing data. This is certainly true for the data lake as it is essentially a minimally defined data store with just one shape of data.
In addition, the data lake is a technology framework and the associated toolsets are still immature and require deep technical know-how to use. Worse, very few organizations have enough Hadoop programming skills available to support all these initiatives. Investing in ways to empower business users, analysts, and data scientists to easily find, ingest, discover, improve quality, enrich, govern, and collaborate are vital to ensure analytical processes are not hampered by resource scarcities in IT.
3. Data quality matters
Because a data lake can become a repository of all kinds of data from a wide range of sources, organizations often end up with many forms of data standardization issues. According to a recent study by Corinium Digital, nearly 40 percent of organizations feel that they have a data quality problem. For instance, one system could send the value “CA” for the state of California while another could call it “California.” In fact, for many organizations, data quality issues have led to a new term for the data lake; the data swamp.
Users need to quickly resolve these kinds of quality concerns and be able to standardize data rapidly. More importantly, this needs to be done visually and interactively for the most part – by business users and analysts who have been empowered. While Excel is a great tool for visualizing and sharing data, it is not designed for enterprise data preparation activities such data quality processes or enrichment.
4. Build an analytical sandbox capability
Many analytical initiatives require multiple iterations to achieve a meaningful insight. Quite often the need is to be able to curate data from a variety of data sources such as the EDW, the data lake, and third-party data sources. In order to support specific analytical processing or maybe even data science models curated data frequently needs to be moved into a separate sandbox for further analysis.
In line with the drive towards self-service, this is a critical part of self-service data preparation. Users need to be able to discover, profile, prepare, and then provision data to the desired location. Often, this could mean data being curated from various sources on-premises and then the resulting dataset could be provisioned in the cloud on Microsoft Azure or Amazon Web Services (AWS) for the data scientist or analyst to use.
5. Enterprise scale, security, and governance
All too often, big data projects stall in the prototyping phase largely due to a lack of enterprise security and governance. And while big parts of Hadoop can be properly secured, companies need to consider how they will manage security and governance when expanding to self-service.
Also complicating issues is when users are curating and then creating new datasets in their own sandboxes. Emerging regulations like Global Data Protection Regulation (GDPR) demands strict treatment and lineage tracking when dealing with customer data. Once again, personal productivity tools like Excel, do not provide this level of security or governance, leaving organizations exposed.
More than just storage: Consider data lakes as a lifestyle
Organizations need to stretch beyond these five pillars of data lake success and change their traditional thinking and technology approach, because they won’t be of much help or use unless businesses start approaching the data lake as a “lifestyle”. This is because it requires different thinking, different tools, and in cases, different people. In this new lifestyle, self-service and empowerment of the data analysts, data scientists, and power users is a must.
The right tools should help support the on-demand and ad-hoc business data need for analytics, operations, and regulatory requirements. Business analysts need to work intuitively within an intelligent and dynamic application to visually, explore, transform, and publish contextual and clean data for analysis anywhere with clicks, not code.
Teams need to be empowered to work collaboratively and securely, with the complete and trusted governance of a modern platform. If IT can support the scale of data volumes across the full spectrum of enterprise and cloud data sources, business can yield productive and successful outcomes for ad-hoc and repeatable data service needs.