Sunil Soares is a widely known figure and authority in the field of data governance, the author of two books on the topic and the former director of IBM’s Information Governance consulting practice. He’s been a contributor to Information Management and recently spun himself off as founder and managing partner at Information Asset LLC., where he continues to help organizations build data governance practices. His third book, “Big Data Governance, an Emerging Imperative,” is the first comprehensive volume on the new data phenomenon from a governance perspective. He recently spoke about the purpose and timing of this work with Information Management Editorial Director Jim Ericson. 

Information Management: Your new book subtitles big data governance as “emergent.” How emergent is it in reality?

Sunil Soares: Candidly, when I talk to CIO and CTO clients about big data governance, they say, “great, but I am focused on governance of small data.” So it’s early but approaching quickly. When you delve a little bit into what they are doing you find they are all dealing with big data even if they don’t think of it that way. I spoke to an insurance company looking at telematics data where the policyholder agrees to place a sensor on the car that monitors their driving.

I’ve seen the Progressive Insurance ads and their Snapshot device that works like that.

Many insurance companies, not just Progressive, are experimenting with that kind of device now. But my point of this example is, it raises tremendous governance implications. There are privacy questions because now the insurance guy knows where the car has been. There is also a potential issue around data quality. When you research some of these sensors you find they can produce defective or duplicate readings that have to be weeded out. And then there is the information management lifecycle perspective, which is how long do I need to keep the sensor data?

That’s a good example of sensor data governance, what about something more human?

Okay, take the example of a retailer that wants to integrate social media data with master data. I spoke to a retailer in the process of doing that and told him it sounded like a great idea, right? He clued me into some implications though, like the fact that Facebook, for example, has very specific platform policies around what it can and cannot do with data listed right at its site. For example, if I’ve got your name in my master data management system and you decide to ‘like’ my company on Facebook and have a phone number there, I can’t just take your phone number to my master. Facebook’s written policy platform policy says if you decide to “unfriend” my business I have to delete all of your data. If I’m holding a golden record with that information, I am in trouble. So there are very specific big data governance issues, I call them emergent, not far along but not always a very thoughtful process either.

I would expect most of this experimenting is quarantined from things like master records anyway. Are your clients still mostly experimenting?

Yes, that is the case. And as governance goes, it’s really about the data, not the technology of big data. In my book I created a three-axis chart. The x-axis accounts for the big data types like social, machine, sensor, biometric or big transaction. The y-axis is all the different industries and the z-axis really looks at the governance implications. When you put that in front of organizations and say, “Given these big data types, are you doing some or all of them today?” The answer is they probably are. They didn’t necessarily think of that as big data but they suddenly see they need governance.

Your book talks about big data governance as part of a broader data governance plan. You mention issues like politics and stakeholders, but how else is it similar or different?

I think the disciplines of traditional data governance apply to big data. You’ve got to think about data quality, metadata, privacy, managing the information lifecycle and people who are stewards. But I think you differ first in the implementation. If you are thinking about the example I gave you of MDM and social media, you’ve got to ask yourself, “Do I need a customer steward who understands the ins and outs of privacy laws and regulations in social media?” Or, instead, “Do I need a dedicated social media steward who can negotiate with legal and privacy on what we can and cannot do?”

That’s a big leap to make in terms of commitment and maybe funding.

Yes it is. Some clients I talked to said they started out having the customer steward deal with all the social media and they very quickly ran out of steam. They couldn’t do the governance and their day job or even gather all the expertise around regulations. So instead, in that example, they picked the people who were social media stewards.

Wouldn’t an issue like data quality raise a similar conflict?

When you get into data quality now you’re looking at different things like how to deal with streaming data that’s flowing in, and that’s a different kind of data quality than most in that field have dealt with. In some examples you are trying to match multiple feeds from different sensors, maybe a temperature sensor and a motion sensor. You might expect the temperature sensor to respond 10 times a second. For some reason you lose three seconds and that’s potentially a data quality issue. In the book I talk about temporal alignment and the rate of arrival. It’s a different implementation of data quality, though things like metadata still apply. If you’re thinking about clickstream analytics, which is big data, how do you define a unique visitor to a website? How do define a session, one that is closed or one that is returned to while open? I found many governance issues in that vein that may not have been considered.

In your book you seem to use dictionaries and metadata as the connecting point of where these things can be aligned. Is that a kind of overlay or abstraction as opposed to an attempt to conform the data?

Yes, exactly, and if you want to align a customer’s Twitter feeds with their master record, you still have to define what a customer is. You think about whether customers are prospects or active clients just like in any other system.

What are some of the unknowns in big data governance companies need to manage before they take their experiments out of quarantine and into production?

First, you are right, I haven’t seen a lot of companies ready to integrate their big data governance policies with the rest. There are just so many things that need to be understood first, which is why governance is there in the first place. If I work in credit, can I use your Twitter account to make a loan decision? If I am in collections, can I use Facebook info under the Fair Debt Collections Act? You definitely have to start writing policies by jurisdiction. The state of Maryland and others now have policies that don’t allow employers to use social data to pre-screen candidates. There are concerns that a lot of social media contains protected information like age, race, gender or sexual orientation. You cannot consult social media and later claim you didn’t discriminate with that knowledge.

It reminds me of some of the unintended consequences marketers have experienced using analytics against customer records that backfired after they dug too deeply into a person’s history.

I think that’s a similar challenge for big data because so many regulations are evolving. There’s also reason to worry about reputational backlash if you cross a line with social information that is also deemed personal. I advise clients to be conscious in both the regulator and reputational areas but remind them big data has many types to take advantage of. That can also be a problem when you start integrating multiple types of data and focus a lot of analytical power that can push the edges of privacy, but again, that’s where governance comes in. It will be interesting to follow how people in privacy and legal departments will have their own take on data governance and risk.