Avoiding the perils of 'dirty' data

Register now

Dirty data has turned data professionals into data janitors who waste hours cleaning data instead of analyzing for strategy and business insights. Today’s data scale and velocity are simply beyond human capacity.

It’s clear that dirty data can have a seriously negative impact on business. The numbers speak for themselves.

According to Gartner, Inc., poor data quality is a key cause behind 40 percent of business initiatives that fail to achieve their targets, costing organizations some $14.2 million annually. What’s more, the costs of working with dirty data are estimated to top $600 billion a year for US companies alone, according to BI industry organization TDWI.

Yet beyond the numbers, dirty data translates into lower efficiency in a hyper-competitive global business ecosystem. It undermines confidence in decisions, management, marketing, products and services both internally and externally. It spells missed opportunity and reputational damage. And we’re not the only ones concerned - a recent Forbes/KPMG International survey found that 84 percent of CEOs were concerned about the quality of data used to make strategic decisions.

Velocity – A Bigger Challenge

One of the famous “3V’s” of big data is velocity and the rate at which data pours into organizations today was unthinkable just a few years ago. Hysterical data velocity means that CDOs and their teams are called upon to ensure faster data processing and – critically – faster access to usable business insights based on this data. Indeed, for many organizations, there is now a true business mandate to assess and act on business data risks in real-time.

This is a formidable challenge. By way of example, a large e-commerce company I worked with launched a mobile app that generated massive amounts of event data. Data was collected from hundreds of events – clicks, products added to a cart, products removed from a cart, searches and more. This data was sent to a central data repository and drove data analytics and customer strategy.

A bug in a new version of the app prevented collection of specific event data for certain iOS versions – a fact that got lost among the hundreds of other reported events and went undiscovered for weeks. During this period, the business team noticed a drop in purchases, and increased the marketing budget for the specific products affected - a costly decision based on incorrect data.

This wasn’t an isolated incident. A few months ago a glitch at UK grocer Sainsbury's allowed customers to buy as much ice cream as they wanted for around $3. One customer left the store with 20 tubs! Even the computer giants aren’t immune. Apple recently suffered an iMessage bug that randomly rearranged the order of conversation threads causing confusion for users.

Here’s a hard truth: it’s quite possible that you may not know how dirty your data is. And the faster you find out, the better for your business.

Recent advances in big data machine learning algorithms and AI have enabled the creation of new tools. These autonomous analytics solutions detect outliers in time series data and auto-correlate among related anomalies. This allows them – in near real time – to preemptively identify business issues before they become problems. This enables quick pinpointing – and rectification - of the root cause of a given issue before it can have a serious effect on business.

The time is past where humans and first generation BI dashboards have the capacity to handle the volumes and velocity of data generated and consumed by today’s digital enterprises. Today, machine learning algorithms can be used to analyze and clean data automatically.

We have been talking about the problem of dirty data for a long time; it’s time we also talked about the solutions.

For reprint and licensing requests for this article, click here.