Wouldn’t an issue like data quality raise a similar conflict?
When you get into data quality now you’re looking at different things like how to deal with streaming data that’s flowing in, and that’s a different kind of data quality than most in that field have dealt with. In some examples you are trying to match multiple feeds from different sensors, maybe a temperature sensor and a motion sensor. You might expect the temperature sensor to respond 10 times a second. For some reason you lose three seconds and that’s potentially a data quality issue. In the book I talk about temporal alignment and the rate of arrival. It’s a different implementation of data quality, though things like metadata still apply. If you’re thinking about clickstream analytics, which is big data, how do you define a unique visitor to a website? How do define a session, one that is closed or one that is returned to while open? I found many governance issues in that vein that may not have been considered.
In your book you seem to use dictionaries and metadata as the connecting point of where these things can be aligned. Is that a kind of overlay or abstraction as opposed to an attempt to conform the data?
Yes, exactly, and if you want to align a customer’s Twitter feeds with their master record, you still have to define what a customer is. You think about whether customers are prospects or active clients just like in any other system.
What are some of the unknowns in big data governance companies need to manage before they take their experiments out of quarantine and into production?
First, you are right, I haven’t seen a lot of companies ready to integrate their big data governance policies with the rest. There are just so many things that need to be understood first, which is why governance is there in the first place. If I work in credit, can I use your Twitter account to make a loan decision? If I am in collections, can I use Facebook info under the Fair Debt Collections Act? You definitely have to start writing policies by jurisdiction. The state of Maryland and others now have policies that don’t allow employers to use social data to pre-screen candidates. There are concerns that a lot of social media contains protected information like age, race, gender or sexual orientation. You cannot consult social media and later claim you didn’t discriminate with that knowledge.
It reminds me of some of the unintended consequences marketers have experienced using analytics against customer records that backfired after they dug too deeply into a person’s history.
I think that’s a similar challenge for big data because so many regulations are evolving. There’s also reason to worry about reputational backlash if you cross a line with social information that is also deemed personal. I advise clients to be conscious in both the regulator and reputational areas but remind them big data has many types to take advantage of. That can also be a problem when you start integrating multiple types of data and focus a lot of analytical power that can push the edges of privacy, but again, that’s where governance comes in. It will be interesting to follow how people in privacy and legal departments will have their own take on data governance and risk.