Data readiness is much more than mere cleanliness

Register now

In baseball, a starting pitcher goes through an entire routine to get ready for the day’s game. Stretching. Light cardio. Pushups. Long toss. Short toss.

A relief pitcher also warms up his arm and body in the bullpen before his turn at the mound. And each always gets in a few pitches before starting a new inning.

But no one keeps tally on the practice pitches, because they don’t mean anything to the team. It’s all to ensure these critical cogs in the lineup are ready for the work ahead, the work that means something.

Your business data must similarly be warmed up – scrubbed, consistent and duplicated. But it’s also up to the enterprise to prepare a mechanism to ingest and operationalize that data for analytics.

Machine learning and artificial intelligence (AI)-based analytics platforms are replacing traditional business intelligence tools to deliver significantly higher-level business insights, and the data collection fueling these platforms has to be so much more accurate and timely. It also must evoke readiness.

But readiness too often gets minimized to data readiness, when it’s so much more than having clean data to analyze for strategic decision-making. Data readiness is about orchestrating at scale the intra- and inter-enterprise data flows, from systems and applications throughout your enterprise and across the world, that create business value.

It’s why big data integration, persona-based data consumption and multi-lateral governance are increasingly valuable to today’s powerful analytics capabilities.

What is Data Readiness?

Today, we are dealing with fundamentally distributed architectures and must move data across multi-enterprise boundaries, geo-distributed, and intra-enterprise “silos.” It’s critical to understand just how such data sets fit into a business’s analytics plan and then dispense the appropriate layers of security, control, governance, and provenance via a scalable data ingestion platform.

But consider all that enterprises must connect and integrate:

  • The geo-dispersed network of applications, file servers, data lakes and clouds.
  • The tiles, messages, events and database rows.
  • The operational data from external partners.
  • The organizational data stuck in siloed systems, mainframes and legacy applications.
  • The pricing, product, location and positioning data from a network of franchise and chains.
  • The data obtained from an acquired entity.

Data readiness, at the base level, comprises the processes and activities related to the organization’s integration and governance of data from such varying sources, annotation of the data and publication and presentation of the data such that its value is maintained over time and the data remains available for reuse and preservation.

When you then consider how they must secure, move and govern all the data above, data readiness becomes a multi-dimensional process encompassing previously neglected integration pillars. The enterprises making truly impactful analytics gains, then, are the ones who realize this as a strategy and capture the right approach.

Static vs. Dynamic Analytics Approaches

In some sense, there are two different approaches to analytics.

In a static approach, analytics is more like data forensics, where you assemble a digital data set and pore over it looking for insights. But the real benefit of analytics stems from another, more dynamic approach, in which organizations assemble a data pipeline that delivers a series of insights in real time as data flows through the network.

Often a dynamic network starts with a static method, where data scientists or domain experts examine a data snapshot to develop meaningful models. But those models are then operationalized into a dynamic data analytics framework. To attain and fully commit to such a framework requires both payload and transport preparations that the enterprise hasn’t had the technology (or the wherewithal) to commit to it.

So, what no one tells you about data readiness is that it’s an amalgamation of a host of business preparedness, including (but not limited to) data set readiness and data pipeline readiness.

Data Set and Data Pipeline Readiness

Data set readiness comprises traditional data preparation ideas: data cleanliness and consistency, de-duplication and the management of unstructured data. (The seemingly simple task of mailing address normalization is a data preparation discipline in its own right.)

In the world of the V’s – variety, volume, velocity, veracity, and even validity and volatility – the biggest challenge here is variety.

Since data sets evolve over time as domain experts look for new insights and correlation with new data sources, some agility in the ability to acquire and integrate new data sets is a part of data set readiness, albeit in the “meta” sort of way where being ready to get more data ready is a prerequisite.

Data pipeline readiness addresses some of the larger big data V’s: volume and velocity. Once you have models to execute, operationalizing them to operate reliably at scale and at business speed brings an entirely new set of challenges. Can your business handle the massive data flows? Can it handle them in an increasingly expeditious way?

This is where some of the infrastructure technology innovation has become a blessing and a curse.

Just a few years ago, Hadoop was the answer, whether in its MapReduce form or its later YARN incarnation. Now, owners of operationalized Hadoop infrastructures exude a kind of buyer’s remorse as storage and analytics tools have moved rapidly to the cloud (primarily Google Cloud, AWS, and Microsoft Azure, in no particular order) and they are scrambling to accommodate inter-product dependencies.

Because if the business-end of the pipeline is analysis, then the data pipeline must terminate in the right place at the right time in the right cloud for the right tool. An owner of an Amazon-backed data lake, for example, who decides to use a Google-based analytical tool, would require a cross-border extension to the pipeline to make it all work.

So as with the static side, there is now a “meta” problem and the ability to read and understand the data. Readiness, in some sense, also includes the readiness to change.

Handle Data with Care

Additionally, with the General Data Protection Regulation (GDPR) mandate taking effect, another operational challenge of gathering information is the audit aspect of data collection.

As the pipeline collects and aggregates data, it must maintain some record of its provenance – knowing where that data came from, how it was collected, etc. – which may then have further compliance and licensing implications. The data pipeline is responsible for aggregating and analyzing not only the data, but also the rights and obligations that come with it.

For those that have ever been through a compliance implementation, the experience highlights one of the biggest complications in the marketplace right now.

Data analytics projects are still relatively new, and there is lots of exploration, experimentation and tinkering going on. This certainly is necessary in the world of data science as discovery is an incremental and experimental process.

But the data, like evidence checked out of a police locker, must be handled with care, even in the experimental phase.

Operationalized data pipelines connecting customers, partners, third-party data providers, etc., will need to be constructed with the same disciplines as other mature software systems, including the security controls that prevent the introduction of unauthorized backdoors and side channels.

So, the biggest concerns right now are not just readiness, but “meta” readiness – the ability to gather the overarching data sets about the data – and of course, provenance and compliance.

Readiness, Set, Go

Traditional tools for moving and integrating data, however, were not built with this readiness in mind, nor to interoperate with big data technologies in a dynamic way. They were not engineered with a rich set of connectors to connect any data sources to any storage frameworks supporting any analytics frameworks, and they rarely support critical monitoring, visibility and governance capabilities.

The need for data readiness, then, traditionally has slowed projects operating at “business speed” to a less rapid pace to properly handle the data, delaying any potential gains from business analytics initiatives.

In keeping with the baseball analogy, this step would be the manager or pitching coach’s manual call to the bullpen, speaking with the liaison there and then relaying to a couple of guys the message that it’s time to start throwing. A more optimized scenario would entail bringing on a relief pitcher without ever having to call down to the bullpen; the optimal asset would be ready to go on demand, as the team needed him.

While it can be frustrating to have to slow down to “IT speed,” such processes are driving migration to agile cloud platforms and adoption of other productized toolsets, away from IT scripting and manual tinkering. With necessary “readiness” controls built into the data pipeline and integration tools, the data explorers are free to work at business speed without putting the business at risk.

For reprint and licensing requests for this article, click here.