How often is it that the unsexy, critical piece will fail you, because nobody was paying attention to it? Without the basic blocks and tackles you don’t have a prayer to win. What is so fundamental in the Big-Data world, and yet so overlooked?
Boston Consulting Group recently identified poor quality of Big Data as that “horseshoe nail” that could lose wars. It impacts as much as 25% of the full potential when making decisions in marketing, bad-debt reduction, pricing, etc. Paying attention to that little thing can literally make you millions.
With the increasing use of high speed, large volume, complex data in a variety of formats in supporting cross functional operational processes such as marketing, compliance initiatives, analytics initiatives, customer and product management, Data Quality (DQ) has become more important than ever in the age of Big Data. Having a lot of data in different volumes and formats coming in at high speed is worthless if that data is incorrect.
Cost of incorrect data
Poor quality of Big Data results in compliance failures, manual rework cost to fix errors, inaccurate insights, failed initiatives and lost opportunity. The current focus in most big-data projects is on data ingestion, processing and analysis of large volume of data. Data quality issues start surfacing during the data analysis and operation phase.
Our research estimates that an average of 25-30% of any big-data project is spent on identifying and fixing data quality issues. In extreme scenarios where data quality issues are significant, projects get abandoned. That is very expensive loss of capability!
Fighting todays’ DQ battles with yesterdays’ DQ tools
In the “regular-data” world, small and manageable data-volume and velocity, DQ validation is either automated or manual. But, when data flows at a high volume and high speed, in different formats, from multiple sources and through multiple platforms, validating data using conventional approaches is a nightmare. The conventional data validation tools and approaches are architecturally limited and unable to handle massive scale of Big Data volume and meet processing speed requirements.
Big-Data teams in organizations often rely on a number of these methods to validate the data quality:
- Profiling the source system data prior to the ingestion
- Matching the record count pre and post data ingestion
- Sampling the big-data to detect data quality issues
They hard-code BD-based scripts (e.g. Pig/Spark SQL, etc.) to perform some of these quality checks because of the architectural limitations of the existing tools. Which are executed during the development cycle in an ad-hoc manner. While these methods are somewhat effective in detecting the errors, scripts are often the susceptible to human error or system change related errors.
More importantly, these approaches are not effective during the operational phase. In addition, these approaches are not designed to detect hidden data quality issues such as transaction outliers. A transaction outlier is defined as a transaction that is statistically different from the transaction set but passes all deterministic data quality tests. Such scenarios require advanced statistical logic for identifying the outlier transactions.
Icing on the cake
The problem exacerbates when multiple big-data platforms are involved. For example, transactions from source systems may be dumped to operational “NO-SQL” database and a HDFS based data storage repository for reporting and analytics. In such scenario, script based solution would not work cohesively to provide an end to end view. You are doomed from the beginning!
What you need
In my view, and that of BCG and others, you belong to an exclusive group of wise executives if you realize the importance of Big Data Quality from the very beginning. Current approaches are not scalable, not sustainable, and definitely not suitable for big-data initiatives. In the absence of a scalable, cross platform, comprehensive, and automated solution to detect data quality issues, organization will risk any returns on their big data initiatives.
(About the author: Seth Rao is CEO of FirstEigen, a Chicago-based Big Data validation and predictive analytics company)
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access