Data Profiling and Data Governance: How Good is This Data?
How good is this data? Have you asked yourself this question? When you are consuming data for analysis, reporting, operations and decision-making in general it’s a natural question. And, at a time when most organizations are striving to treat data as an asset, lack of confidence in the quality of the data will quickly turn your data into a depreciating asset.
There’s no argument that data quality is important. You don’t have to look very far to find stories of data quality gone bad. For instance, a man once received a $44.8 million bill when an invoice number was mistakenly typed into the invoice amount. Uncovering an issue before it shows up on an invoice or report is mission critical to protecting the confidence in the data. And this is where data profiling and data governance enter. Data profiling is data analysis technique and, when overlaid with a data governance structure, creates a powerful data quality process.
Data is often profiled as part of a data warehousing project. When dealing with large volumes of data coming in from a variety of sources, in different formats, and delivered by different methods a data profiling automation solution becomes a necessity to efficiently analyze data. Data profiling functionality can often be found as part of a larger data quality technology suite. Profiling tools can quickly process and analyze large data sets and automatically produce a baseline profile, thus replacing the need to run manual queries. That baseline profile may consists of statistics such as top and bottom values, patterns, uniqueness, completeness, data types, etc. at the data element level which assist in uncovering potential data quality issues, such as:
- Required fields missing data
- Default values used in place of real data
- Data outside an acceptable list of values
- Data not in the expected format
- Duplicate values
- Faulty logic and chronology
Spotting and fixing data issues in the originating source prior to the data reaching the production environment, and ultimately the end users of the data, is crucial to maintaining the trust in the data. It’s no secret that poor data quality can be contributing factor to the failure of a data warehouse initiative.
Why You Need Data Governance
There’s a data issue, now what? Someone needs to answer questions such as:
- Should the data even be in use?
- Is this truly an issue?
- What standards should be applied to the data to determine a “quality” piece of data?
- What’s the impact of this issue to the business?
- Who will remediate the data issue?
Who decides? The business does! It’s important that the business is engaged upfront in making decisions about their data. Traditionally, people contact whomever they think owns the data, which could be someone in the business or in IT. It’s data governance’s mission is to provide the consistent and transparent structure around who is formally responsible and accountable for a domain of data or an application, thereby creating a community of data governance members that carry out the day-to-day data quality work and coordinate data quality projects. And ongoing data quality improvement is dependent upon data governance to find opportunities to improve overall data quality, protecting the status of the data as an asset and sharing recommendations on how to continually move the needle to bring the data closer to the ideal state.
Data governance doesn’t just entail naming the stewards and owners; it also is necessary to engage and build long-term relationships with the data governance community members. When the data profiling exercise is complete, data governance liaisons with the stewards and owners to share the results of the profiling effort. This information about their data is a conversation starter about data quality. Together a determination of the business rules that will overlay their data will be made. And these business rules are vital to answering the question, “How good is this data?” A business rule is used to assess if a piece of data is considered valid or invalid. It’s data governance’s job to expose the information about the quality to the end users of the data. But the answer is not a simple yes or no. Perhaps a piece of data that is 75 percent valid may be good enough for one end user but for another 100 percent is a requirement for their purpose. Ultimately, the end users are empowered to make the decision if the data is fit for their intended use.