Data Governance and Big Data
A data governance program is one important pillar in a company’s overall information strategy. A data governance program can govern one or more business and/or IT data-intensive projects.
Using a structured, top-down governance program to guide the data acquisition, integration, processes, policies, standards and operations of these data-intensive initiatives will positively impact the business return of those initiatives. Conversely, if a governance program is not in place, the risk of the overall initiative not achieving its business benefits increases.
Today, projects like master data management, the enterprise data warehouse and business analytics are a few examples of data-intensive initiatives that fit this criteria. Big data projects are also quickly emerging. While much of what we’ve learned for governing structured, in-house data will be applicable for big data environments, big data also causes us to re-think some of our established data governance practices.
What Is Big Data?
Before we can “govern” big data initiatives, we first have to understand more about them. Big data typically refers to the following types of data:
- Machine-generated/sensor data includes call detail records, weblogs, smart meters, manufacturing sensors, equipment logs, GPS signals from cell phones and trading systems data.
- Social data includes customer feedback streams, micro-blogging sites, like Twitter, and social media platforms, like Facebook.
- Traditional enterprise data such as that from a CRM or ERP system, usually transactional sales data or ledger data that can be used in combination with external big data.
Also used to describe big data are four other parameters. The most obvious of these is volume, but that is not the only parameter that distinguishes big data.
- Volume: Machine-generated data is produced in much larger quantities than nontraditional data. As of 2012, about 2.5 exabytes are created each day and that number is doubling every 40 months. For instance, a single jet engine can generate 10TB of data in 30 minutes. Smart meters and heavy industrial equipment, like oil refineries and drilling rigs, generate similar data volumes.
- Velocity: Time-sensitive, streaming data (such as social media data streams), while not as massive as machine-generated data, produces a large influx of opinions and relationships valuable to customer relationship management. Even at 140 characters per tweet, the high velocity (or frequency) of Twitter data ensures large volumes.
- Variety: Traditional data formats tend to be relatively well-described and change slowly. In contrast, nontraditional data formats exhibit a rapid rate of change. As new services are added, new sensors deployed or new marketing campaigns executed, new data types are needed to capture the resultant information.
- Value: The economic value of different data varies significantly. Typically, good information is hidden amongst a larger body of nontraditional data; the challenge for companies is identifying what is valuable and then transforming and extracting that data for analysis.
The McKinsey Global Institute estimates that data volume is growing 40 percent per year and will grow 44 times between 2009 and 2020. Yet most of this data is unmanaged -- one study claims that a significant majority of this data is unmanaged. This poses new challenges for a company’s enterprise data governance program, but clearly it cannot be ignored as the business benefits of using big data can be quite substantial.
Use Cases for Big Data
Because data governance goals should be aligned to business goals, understanding the intended use cases and benefits for the company’s big data initiatives will provide the data governance team with the business insight to establish the appropriate data standards and policies to meet the business goals.
Customer Analysis Use Case: A very popular use case for big data is customer analysis. Customer analysis combines social media data available from blogs, Twitter and other external sources with internal customer and product data to provide insight about what potential and current customers are saying about the company and its products. These insights can then be used to adjust internal business processes or make product changes. For example, feedback about a recently launched product can be used to understand customer reactions or discover possible product defects or features customers don’t like. This information can be used to launch or change a marketing campaign in near real time. Sentiment analysis can also be used in customer support processes to address product defects or errors more quickly. Smart phones and other GPS devices offer advertisers an opportunity to target consumers when they are in close proximity to a store, coffee shop or restaurant. This opens new revenue for service providers and offers many businesses a chance to target new customers.
Sensor Analysis Use Cases: In manufacturing companies, sensors, telemetry and/or barcodes have been implemented in products and/or processes. This form of telemetry identifies usage patterns, failure rates and other opportunities for product improvement that can reduce development time and assembly. Also, barcodes provide information that can track a process or a customer order and can be shared with customers. Health care companies are deploying in-home monitoring to measure vital signs and monitor progress to improve patient health and reduce office visit. In sensor analysis use cases, big data is being generated from devices and processes the company can manage directly versus external social data, which the company cannot control. This provides the governance team more opportunity to govern the processes that create and update big data sources.
Risk Management Use Cases: External data about individuals can be combined with internal fraud detection algorithms to detect a credit risk or fraudulent usage of a customer account. In this scenario, quickly processing the information is a key requirement for governance. Another use case for financial institutions involves bringing real-time external price information about stocks and other financial instruments, in combination with internal buying algorithms, in order to make a purchasing or selling decision.
In general, when big data is distilled and analyzed in combination with traditional enterprise data, companies can develop a more thorough and insightful understanding of their business, which can have a significant impact on the bottom line.
Data Governance Considerations
Managing all this new data certainly will be challenging. It is necessary to decide what big data should be managed and what aspect of that data must be governed in order to achieve the business results in the intended use cases. The data governance team should then drive a collaborative process, with the various IT and business stakeholders, to identify the most critical, strategic, shared big data sources and specific data fields that are the most important to govern. But don’t try to manage it all.
Once the most important data fields and data sources are defined, the data governance program should consider managing the following aspects with policies, standards and accountable parties:
- Selected big data should have clear business owners or data stewards assigned, especially for data that is externally sourced. The business owners should understand and accept their data management responsibilities and be staffed for those roles.
- In many cases, the data will be created outside the company and, therefore, little can be done to control the creation process. Certainly if the creation process can be governed, such as for internal sensor data, you can establish the acceptable data management standards. Once inside the company, how that information is stored, updated and maintained should be governed with policies that prevent misuse and comply with the company’s overall data management policies.
- The data governance program should establish metadata requirements for these new sources. Whether the metadata is combined with other internal metadata or in a special big data repository is a decision to be made jointly with IT. Certainly for the critical big data fields, the data definitions, business owners, time-sensitive dimensions and even lineage should be stored and available for all users. Also, classify the big data as either public, private, confidential or sensitive – with input from the Security officer of the company – to ensure that the proper controls are in place.
- It is critically important to develop a lifecycle management strategy that includes archiving and deletion policies, business rules and IT automation. The company will not be able to store all this incoming data forever.
- Assess the data quality of the big data sources, establish acceptable criteria and monitor adherence. Accept that big data quality will not be the same as internal data.
- The architecture and IT infrastructure for big data will be very different. A knowledgeable IT data architect will need to be assigned to define the new landscape and work with the governance team to implement the governance requirements in areas such as metadata, operational reporting and monitoring, matching and best record.
Big Data and Data Quality
Big data will bring some interesting challenges in data quality and changes to some of the traditional assumptions of governing data quality. First, it is important to go back to the basic definition of data quality as “fit for purpose.” In this case, fit for purpose means evaluating how this data will be used in the intended business use cases and just how “good” the data needs to be in order to provide meaningful value to the business goal.
The assumption that big data has to have the same level of data quality, especially in terms of data accuracy, as the company’s traditional internal data is not valid and is even sometimes unreasonable. In some cases, having more “fuzzy” data about a topic will increase the overall real-world representation of the topic.
Defining the most important dimensions of quality and the acceptable standards of quality falls on the data governance program, in collaboration with the business owners of the big data sources and the users of the big data across the company.
For example, let’s consider the case of using social media data to assess a current marketing campaign. The timeliness dimension will be a very important quality criterion, because the company will want to make changes to its campaign as a result of what it’s seeing on social media. Timely input from various sources will determine trends for taking action. Therefore, the data quality criteria and subsequent metrics should be based on timeliness dimensions, such as:
- How often do we need to get this data into the internal systems (every 15 minutes, hourly, daily), and is that happening?
- How often does it need to change to be useful, and is that happening?
- When is data too “old” to achieve the needed business value? Subsequently, is there an ongoing archiving maintenance program in place to remove the data?
In fraud detection and credit analysis scenarios, certain big data fields warrant having higher data accuracy because those fields would be used to establish credit worthiness or potential fraud in a transaction. In this case, the governance program should select the big data sources that provide some guarantee of data accuracy. At a minimum, the confidence level of that data or data accuracy ranges should be measured and communicated to the business users. Another data quality criterion would be data consistency with internal data, because matching big data with internal data would be necessary to complete the transaction.
In all business scenarios, the governance program would establish criteria for how long this data is stored in the internal systems and the corresponding archiving and deletion rules. Keeping old data around unnecessarily will drive up total cost of ownership and also degrade analytic results.
Big Data Governance Metrics
As management expert Peter Drucker said, “You cannot manage what you cannot measure.” This is certainly applicable with data governance in general and big data metrics specifically. As this is still an emerging area of governance, more will be learned and shared about metrics and monitoring as more companies embark on big data governance initiatives.
As a starting point, data governance metrics should be established to monitor the effectiveness of the big data sources and solutions in meeting the business need for which it was designed. The types of metric categories that can be considered include:
- Data quality metrics, to measure the dimensions that are critical for the business value.
- Metadata metrics, to ensure the accuracy and completeness of the big data metadata.
- Infrastructure performance metrics, such as transfer rates, processing times and query processing. Also, as part of the IT metrics, monitor the archiving schedule and volumes archived, as we have stressed the importance of not letting big data take over your internal storage systems.
- Matching algorithm success rates. The matching of multiple sources of external information will place a high demand on effective matching and integration algorithms, yet these technologies are not perfect.
- Metrics to track the promised business process efficiencies or other business benefits (e.g., cost, customer satisfaction, campaign effectiveness, etc.). As with traditional governance metrics, first establish reliable baseline numbers. If possible, establish these baselines as part of the evaluation of the adequacy of the external data for the company’s big data solutions. During ongoing operations, the metric gathering and calculation process should be done in an automated way, because analyzing large amounts of data manually is not practical or timely. Also consider the way the metrics should be displayed, especially for real-time data analysis. Data visualization software may be required to display these metrics using time series charts, heat maps, dials and dashboards.
Regulatory and Legal Considerations
The need to comply with privacy, security, financial standards and legal requirements does not change with the introduction of big data. In fact, the introduction of new sources of external personal data can increase the potential privacy risk to the company, increase the possibility for a security breach due to malware in these external sources and possibly introduce new e-discovery and evidence obligations.
In addition to privacy risks, the use of information regarding individuals’ health, location and online activity can also raise legal and reputational concerns about profiling and discrimination. Adding to the compliance complexities, big data challenges some of the most fundamental principles of the existing privacy framework, including the definition of “personal identifiable data” and the concept of consent. Privacy laws vary greatly by industry, country and state. Policy makers in the EU and the U.S. are studying this area, and we can expect to see more guidelines and standards in the coming months and years. Therefore, it is extremely important that any attempt by a data governance team to establish compliance governance guidelines and metrics be done with the inclusion of the chief privacy officer, head of security, the legal department, it and maybe even the chief risk officer.
Conversely, there are some big data use cases, specifically those for risk management, which can help companies comply to standards and lower their risk exposure.
The data governance program should facilitate bringing all relevant internal stakeholders into an enterprise-level assessment of whether the business advantages of these big data solutions outweigh the increased regulatory, reputational and legal risks. Also, the governance program should involve the head of security and privacy in the initial assessment of external sources to determine the risk and the laws that may apply to these information sources.
Finally, classify the new data source as you would other data sources. Use the company’s classification system to determine if this data is confidential, non-public, public or sensitive. Guard internal access and controls using existing company classification policies.
In general, big data initiatives offer significant opportunities for companies to develop a more thorough and insightful understanding of their business, which can lead to enhanced productivity, more efficient processes, a stronger competitive position and greater innovation – all of which can have a significant impact on the bottom line. That is why big data projects are growing so quickly.
However, big data initiatives bring some incremental risks and costs. One of the key value added projects a data governance program can lead is to drive an independent, yet collaborative assessment of the business benefits the company can hope to achieve from the big data sources versus the additional costs and risks of these new sources.
The data governance team must also be willing to challenge many of the traditional approaches to governance, making it even more important to understand the business goals and usage of this new information. Traditional governance approaches to data quality criteria, lifecycle management and compliance, to name a few examples, need to be re-evaluated in the context of the business goals and what is “acceptable,” yet living with some risk. Big data will also bring some new roles. Business owners or data stewards should be assigned to critical big data sources. Additionally, data scientists will be new power users that should also be staffed and become new members of the data governance teams.
The value and benefits to governing big data solutions are clear – get started now!