The Forrester Muse
for Information Management Blogs
SEP 7, 2012 4:25pm ET

Blogroll

blog

Big Data Quality: Persistence vs. Disposability

Print
Reprints
Email

We last spoke about how to reboot our thinking on master data to provide a more flexible and useful structure when working with big data. In the structured data world, having a model to work from provides comfort.

However, there is an element of comfort and control that has to be given up with big data, and that is our definition and the underlying premise for data quality.

Current thinking: Persistence of cleansed data. For years data quality efforts have focused on finding and correcting bad data. We used the word “cleansing” to represent the removal of what we didn’t want, exterminating it like it was an infestation of bugs or rats. Knowing what your data is, what it should look like, and how to transform it into submission defined the data quality handbook. Whole practices were stood up to track data quality issues, establish workflows and teams to clean the data, and then reports were produced to show what was done. Accomplishment was the progress and maintenance of the number of duplicates, complete records, last update, conformance to standards, etc. Our reports may also be tied to our personal goals. Now comes big data — how do we cleanse and tame that beast?

Reboot: Disposability of data quality transformation. The answer to the above question is, maybe you don’t. The nature of big data doesn’t allow itself to traditional data quality practices. The volume may be too large for processing. The volatility and velocity of data change too frequently to manage. The variety of data, both in scale and visibility, is ambiguous.

Your data quality efforts need to be defined more as profiling and standards versus cleansing. This is better aligned to how big data is managed and processed. While on the surface, big data processing is batch in nature, it would seem obvious to institute data quality rules the way they have always been done. But the answer is to be more service-oriented, invoking data quality rules that provide improved standardization and sourcing during processing versus fundamentally changing the data. In addition, data quality rules are invoked in a customized fashion based on customer service calls from big data processing.

Why this also makes sense is that when you do decide to persist sourced big data into your internal infrastructure, you have pre-aligned the data to existing policies for integration and business rules for improved mapping and cleansing that would need to persist. In essence you treat big data as a reference source, not a primary source. When have you looked to persist your data quality rules on reference data from a third party?

So, think about data quality in the context of supporting preprocessing with Hadoop and MapR through profiling and standards, not cleansing.

This blog originally appeared at Forrester Research.

Advertisement

Comments (3)
Nice article Michele. With the explosion of big data, companies are faced with data challenges in three different areas. First, you know the type of results you want from your data but it's computationally difficult to obtain. Second, you know the questions to ask but struggle with the answers and need to do data mining to help find those answers. And third is in the area of data exploration where you need to reveal the unknowns and look through the data for patterns and hidden relationships. The open source HPCC Systems big data processing platform can help companies with these challenges by deriving insights from massive data sets quick and simple. Designed by data scientists, it is a complete integrated solution from data ingestion and data processing to data delivery. More info at http://hpccsystems.com
Posted by HAANA M | Tuesday, September 11 2012 at 6:16PM ET
Bravo Michelle! One of the best articles I have seen on this topic. Instead of just saying 'Big Data is different' and stopping there, like many of the articles and papers I have read, you actually explain why and offer suggestions on how to adapt governance to it. When we look at the value of an organization's Big Data and how it meshes with existing corporate information, coupled with sophisticated profiling tools, we can make useful decisions on the value and usage of the data, instead of wasting time, energy, and resources on cleaning it. Does anyone go out and clean the leaves off the forest floor? They look just fine there and they serve a useful purpose. At the same time, an artist might walk that same forest and collect only the red leaves for use in a leaf painting. It's all about perception of value and risk. On the other hand, data cleansing/standardization could be key in trend analysis, householding, consumer sentiment, etc, some of the many use cases we've seen with Big Data. Tools must also evolve to meet the scalability requirements that Big Data presents, as well as how to incorporate unstructured data- whether it's data profiling, integration, cleansing, metadata management, building a business glossary, or security. That means as vendors, we need to do more than opportunistically just slap 'Big Data Enabled' in front of our tools offerings. It's time to "Walk the Walk"... or run the risk of being lost in the woods, aimlessly picking up leaves.
Posted by Cindy C | Wednesday, September 12 2012 at 2:33PM ET
Add Your Comments:
You must be registered to post a comment.
Not Registered?
You must be registered to post a comment. Click here to register.
Already registered? Log in here
Please note you must now log in with your email address and password.

Blog Archive for Michele Goetz

Shifting to an MDM Golden Profile
Is Big Data Better Outside of IT?
Without Data Management Standards - Anarchy!
Is Your Big Data Stuck in the Pilot Stage?
Data Management Standards are a Barrier

More from Michele Goetz »

Blog Index »

Where do young IT professionals (30 and under) obtain information to aid with daily role responsibilities and career development?

Trade publication websites 14%
Social media 23%
Vendor websites 4%
Vendor/community forums 7%
Newsletters 1%
Trade conferences/meetups 2%
RSS feeds 6%
Web search 44%

 

Twitter
Facebook
LinkedIn
Login  |  My Account  |  White Papers  |  Web Seminars  |  Events |  Newsletters |  eBooks
FOLLOW US
Please note you must now log in with your email address and password.