Seeking the Power of Predictability In the Data Lake
"Nobody has a crystal ball, and part of evolving a business plan is to say, ‘I might have said we’re going left, but I see the opportunity and we’re going right.’” – Ryan Kavanaugh
Today, forward-thinking businesses are making incremental strides toward deploying big-data-centric architectures. These complex systems are positioned around the commodity computing power found in Hadoop or Hadoop-like distributed big data storage systems, also knowns as the data lake.
The hope is that analytical experimentation – kind of like dipping an algorithmic toe into the data lake – will function as a prediction engine for the enterprise. And by staring into the data lake crystal ball, a data-centric structure emerges, and business leaders will profit from a clearer vision of the future – one that reveals entire realms of untapped opportunity.
The patterns uncovered by analytics enable data-driven insights that give rise to predictability. By optimizing data lake-driven analytical discoveries, the underlying genetic code of business can be rewritten. For example, predetermined operational and digital processes can be fully integrated and streamlined, eliminating wasted time and misaligned efforts.
Human connections to the business will deepen, moving corporations closer to the customer. Businesses would have the data needed to develop dynamic pricing models and precision-targeted advertising to increase revenue streams and deliver better value to the end customer. And perhaps foremost to the business, key stakeholders and decision makers can act sooner with data-driven intelligence, increasing operational speed and agility.
These are some of the promises of big data - nothing short of a revolution, conceived by the business and attempted through the principles of data science. However, realizing the advantage of big data starts on the shores of the data lake.
“Man marks the earth with ruin, but his control stops with the shore.” – Lord Byron
The most important terms to both the nature and success of a data lake are format and scale and how the enterprise handles the enormity and convoluted composition of big data.
The data lake only comes at a big data scale
Big data is really big. In fact, by 2020, an estimated 1.7 megabytes of data will be created every second for every human on the planet. This presents the near-impossible challenge of connecting source technologies, gathering and managing the multifaceted flow of voluminous data that forms the data lake, but doing so in a highly efficient and time-sensitive manner.
In fashioning the data lake, the enterprise must ensure the web-scale systemic carrying capacity to load enormous amounts of data quickly. Seemingly, at a big data scale, the persistent data-centric challenges of variety, volume, and velocity organizations face don’t go away. If anything, they escalate.
The data lake is a format-agnostic destination – as it should be
The data lake is differentiated from traditional enterprise storage applications like a data warehouse in that the files, messages, events, and raw data are captured and reside in distributed data systems, like Hadoop, while retaining the native, point-of-source information formats. The composition of the data lake includes web data, clickstream data, server logs, social media data, geographic information system (GIS) data including geolocation, GPS streams, massive meteorological data sets, satellite imagery, RFID and machine-generated data, and media such as images, audio, and video, all in their raw states.
On the surface, the coexistence of such differentiated raw data without the structured order of a hierarchal system can be difficult, but the advantage of the data lake is that nothing is lost on ingestion. The data is perfectly preserved in the data lake in its source state, a key advantage.
Labeling the big data jars
The order for traditional process layer actions such as extract, transform, and load (ETL) for integration can go away with a data lake. Instead, structure can be applied to the data through schema-on-read processes, querying the lake for the right data that is then gathered, transformed, and provisioned to the destination application, platform, or enterprise data warehouse for analysis.
However, for this to work, there needs to be comprehensive metadata tagging within the initial ingestion. Metadata tagging is like putting a label on a preserve before it goes to a gigantic larder (with billions of other jars and cans). At any time, the contents can be quickly and accurately identified without cracking the seal and possibly spoiling what’s inside.
“If you are puzzled by what dark energy is, you’re in good company.” – Saul Perlmutter
Comprehensive tagging eliminates dark data from the data lake – information that is unscannable and unqueriable is essentially useless to the business goal of driving organizational predictability.
Synchronous metadata tagging for multilayer enrichment should be performed within the initial ingestion stream. Metadata tagging extends the capabilities of the data lake to include critical aspects for the enterprise by fastening capabilities for data governance, security, and compliance.
An unfixed and indeterminate informational value
To put this in context, operational B2B data (files for instance) is not refined or altered on ingestion to the data lake. Because format requirements do not exist, no data is lost. This is an essential component of the big data value chain and a reason for the data lake in the first place. The business will not know beforehand which information embedded in the files will eventually prove valuable. And as metadata tags enrich the ingestion stream at the outset, the data gains provenance.
The enterprise now has data that can be accurately and compliantly provisioned to a variety of analytics applications downstream, even with algorithmic queries and data selection parameters that have yet to be conceived.
Predictive not predictable
That only 1.5 percent of data is ever aggregated and analyzed does not detract from the fact that the volume and variety of this observable portion is staggering. Much like the composition of the universe, where dark matter and dark energy make up an overwhelming majority of stuff, there is still much wonder to behold in what the data lake allows us to perceive and thereby predict.
Ultimately the predictive organization is naturally proactive. Faster insight leads to clearer foresight. Actionable intelligence provides a clear and decisive competitive edge. And in this light, IT leaders are finding ways to leverage big data and harness analytics investments to turn up the speed and accuracy of forecasting with predictive insight made possible by the data lake.
(About the author: John Thielens is chief technology officer at Cleo)