Over $100 billion in fines were paid in US for non-compliance since 2007! More than $2.5 Billion in 2015 alone was a result of incomplete and inaccurate data used for complying with Anti-Money Laundering (AML) regulations. It’s not just financial institutions, but most firms are sitting on a ticking time bomb. How?

Data is becoming untrustworthy if not verified.

A 2013 survey of data management professionals revealed that data quality, accuracy, reconciliation are significant problems in big data projects. With increase in data volume, variety and velocity, Data Quality (DQ) has become more important than ever. As more business processes become automated, data quality becomes the rate limiting factor for overall process quality. Trustworthiness of Big Data, at best, remains questionable and as a result Big Data projects are failing to deliver intended returns across the industry.

Untrustworthy data is very expensive.

Gartner reports that 40% of data initiatives fail due to poor quality of data and affects overall labor productivity by ~20%2. That is a huge loss on which it’s hard to even put a cost figure on! Forbes and PwC report that poor DQ was a critical factor that led to regulatory non-compliance. Poor quality of Big Data is costing companies not only in fines, manual rework to fix errors, inaccurate data for insights, failed initiatives and longer turnaround times, but also in lost opportunity. Operationally most organizations fail to unlock the value of their marketing campaigns due to Data Quality issues.

Why is the current approach to Data Quality (DQ) inadequate?

When data flows at a high volume, in different formats, from multiple sources, validating it is a nightmare. Big-data teams rely on ad hoc methods of the regular-data world, like writing big-data based scripts, to validate the data. They run into three major problems: (1) highly susceptible to human errors and system-change related errors, (2) retrofitting regular Data Validation tools for Big Data needs sampling, so 100% of the data is not checked, and (3) architectural limitations of existing data validation tools make such approaches non-scalable, unsustainable and unsuitable for Big Data.

The above problem is exacerbated when big-data platforms are thrown into the mix. For example, transactions may flow into an operational “No-SQL” database (MongoDB, Datastax, etc.) and then to a Hadoop data storage repository which may even on the Cloud. Interactions with a traditional Data Warehouse are guaranteed along the way as well. In such scenarios, script based solution work do not work very efficiently and does not provide an end to end perspective on Data Quality.

This translates into Big Data projects spending 50-60% time and money to detect and fix quality issues. Despite significant effort and investment in ensuring the quality of Big Data it’s at best questionable.

A Solution

Organizations should only consider Big-Data validation solutions that are equipped to ingest data at high velocity across multiple platforms (regular- and big-data platforms), parse variety of data formats without transformations, and are scalable as the underlying big-data platform. They must be enabled for Cross Platform Data Profiling, Cross Platform Data Quality tests, Cross Platform Reconciliation and Anomaly Detection.  Seamless integration with the existing enterprise infrastructure systems such as scheduling system is needed for operationalizing data quality end to end.

(About the author: Seth Rao is CEO at FirstEigen, a Greater Chicago-based Big Data validation and predictive analytics company)

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access