There has been a good deal of discussion recently in the information management industry about the dimensions of data quality, what they really mean, and the implications of these definitions in operational settings. However, not much has been discussed about the possibility that we have not identified all the dimensions of data quality, and that new dimensions may emerge over time. I think we are now seeing the recognition of one such new dimension, data credibility, which has incredibly important consequences for social media in particular and big data in general.

What Is Data Credibility?

I first became aware of the concept behind data credibility when reading about Solvency II a couple of years ago. Solvency II is a regulatory standard for insurance companies doing business in Europe, described in a shorthand way as what is needed to get “a pan-European insurance license.” The regulators need reports that prove an insurance company is able to withstand potential losses on the policies it has written. To be credible to the regulators, the reports that are sent to them must be generated using actual data from operational systems. Apparently the regulators had grown distrustful of numbers simply typed into spreadsheets which were then reformatted and sent out as regulatory reports. Thus, insurance companies must not only produce reports using the data they actually use to run their businesses, but they must provide evidence that the reports are really based on this data.

So how can we define data credibility? Here is my attempt:

Data credibility is the extent to which the good faith of a provider of data or source of data can be relied upon to ensure that the data really represents is what the data is supposed to represent, and that there is no intent to misrepresent what the data is supposed to represent.

We can immediately see that, unlike other dimensions of data quality, we are getting into the difficult area of ethics about data. This is probably unusual or unpalatable for data management professionals who already have myriad purely technical problems to wrestle with, but I think it is an area that cannot be ignored.

Intent or Neglect?

It might be argued that credibility issues can arise purely from traditional data quality concerns, and thus never get into the realm of misrepresentation. For instance, an insurance company might have really sloppy data management practices and might have assigned junior resources to develop the spreadsheets it used for regulatory submissions. Knowing these facts, the regulators would be quite right to distrust the information they were getting. Confronted with this, the insurance company might respond that there was no deliberate intent to deceive.

This argument will not wash. In all aspects of data management, and especially in reporting to outside entities (be they regulators, customers, partners or whatever), there is a duty of care. Deliberate negligence cannot be used as a foundation for plausible denial. Data management practices are always proactive and based on management decisions. Omitting checks and balances, not standing up robust processes, under-resourcing and other decisions made in the name of “efficiency” that generate data which actually or potentially harms entities outside the enterprise, is decidedly acting in bad faith, and hence I would argue that the definition of data credibility given above still holds.

Social Media, Big Data and Data Credibility

The insurance business is perhaps not the most sexy of industries, but data credibility is not just a concern for insurance, as the above example might suggest. Social media and big data have data credibility problems on steroids.

How much do you think it costs to get 1 million fake followers on your Twitter account? Apparently the answer is $600. According to a recent USA Today article there is at least one entrepreneur, based in Indonesia, offering this service:

"Ali Hanafiah, 40, offers 1,000 Twitter followers for $10 and 1 million for $600. He owns his own server, and pays $1 per month per Internet Protocol address, which he uses to generate thousands of social media accounts. Those accounts, he said, 'enable us to create many fake followers.' During an interview at a downtown Jakarta cafe, Hanafiah — wearing a Nike cap, blue jeans and a white T-shirt — said large social networks can boost a business' public profile. 'Today, we are living in a tight competition world that is forcing people to compete with many tricks,' he said."

This kind of business is called click farming, and it is not confined to Indonesia. The USA Today article identifies Dhaka, the capital of Bangladesh, as the world capital for the industry. We are not talking about some kind of automated software application to create fake followers; much of the activity appears to be performed by actual human beings.

Nor is this just a problem for Twitter. Facebook, SoundCloud and other sites have similar issues. Business Insider featured an article on Fetopolis CEO Raaj Kapur Brar, who runs a number of niche fashion magazines. He advertized on Facebook, but the results were not as expected:

"Recently, however, Brar has fallen out of love with Facebook. He discovered — as Business Insider reported recently — that his Facebook fanbase was becoming polluted with thousands of fake likes from bogus accounts. He can no longer tell the difference between his real fans and the fake ones. Many appear fake because the users have so few friends, are based in developing countries, or have generic profile pictures. At one point, he had a budget of more than $600,000 for Facebook ad campaigns, he tells us. Now he believes those ads were a waste of time."

As always, there are two sides to this story, and the article presents Facebook's response. It should be remembered that all media offer exposure to their advertisers but cannot guarantee results. Furthermore, Facebook may also be a victim in this because click farms operate by camouflaging themselves with clicks on pages and ads that are not their intended targets. That said, it is also worth noting that, according to Business Insider, Facebook's terms of service prohibit third-party verification of clicks.

A Data Concern?

Data management professionals may think these issues are not so much data issues but core business issues, and may be inclined to dismiss them. However, today we live in the Golden Age of Data. Social media is all about data. Clicks get transformed into revenue directly or indirectly, via means such as marketing advantage. All of this business activity is mediated through data. I would agree that information fraud can include areas such as accounting fraud that have not been traditionally part of the realm of data management. However, social media, which is a huge driver of big data projects really is all about data management. What we must do is recognize the issue of data credibility as yet another dimension of data quality and take steps to address it.