What do you include in a data quality issue log?
Whenever I am helping clients implement a data governance framework, a data quality issue resolution process is top of my list of the processes to implement. After all, if you are implementing data governance because you want to improve the quality of your data, it makes sense to have a central process to enable people to flag known issues, and to have a consistent approach for investigating and resolving them.
At the heart of such a process is the log you keep of the issues. The log is what the data governance team will be using while they help investigate and resolve data quality issues, as well as for monitoring and reporting on progress. So, it is no surprise that I am often asked what should be included in this log.
For each client, I design a data quality issue resolution process that is as simple as possible (why create an overly complex process which only adds bureaucracy?) that meets their needs. Then, I create a data quality issue log to support that process. Each log I design is, therefore, unique to that client. That said, there are some column headings that I typically include on all logs.
Let’s have a look at each of these and consider why you might want to include them in your data quality issue log:
Typically, I just use sequential numbers for an identifier (001, 002, 003 etc). This has the advantage of being both simple and giving you an instant answer to how many issues have been identified since we introduced the process (a question that your senior stakeholders will ask you sooner or later).
If you are creating your log on an excel spreadsheet, then it is up to you to decide how you record ID numbers or letters. If, however, you are recording your issues on an existing system (e.g. an operational risk system or helpdesk system), you will need to follow their existing protocols.
Now this is important for tracking how long an issue has been open and monitoring average resolution times. Just one small reminder: be sure to decide on and stick to a standard date format – it doesn’t look good for dates to have inconsistent formats in your data quality issue log!
RAISED BY (NAME AND DEPARTMENT)
This is a good way to start to identify your key data consumers (it is usually the people using the data who notify you when there are issues with it) for each data set. This is something you should also log in your data glossary for future reference (if you have one). More importantly, you need to know who to report progress to and agree on remedial action plans with.
SHORT NAME OF ISSUE
This is not essential and some of my clients prefer not to have it, but I do like to include this one. It makes referring to the data quality issue easy and understandable.
If you are presenting a report to your data governance committee or chasing data owners for a progress update, everyone will know what you mean if you refer to the “duplicate customer issue”. They may not remember what “data quality issue 067” is about, and “system x has an issue whereby duplicate customers are created if a field on a record is changed after the initial creation date of a record” is a bit wordy (this is the detail that can be supplied when it is needed).
As I mentioned above, I don’t want to use the detailed description as the label for an issue, but the detailed description is needed. This is the full detail of the issue as supplied by the person who raised it and drives the investigation and remedial activities.
Again, this is supplied by the person who identified the issue. This field is useful in prioritizing your efforts when investigating and resolving issues. It is unlikely that your team will have unlimited resources and be able to action every single issue as soon as you are aware of it. Therefore, you need a way to prioritize which issues you investigate first. Understanding the impact of an issue means that you focus on resolving those issues that have the biggest impact on your organization.
I like to have defined classifications for this field. Something simple like High, Medium and Low is fine, just make sure that you define what these mean in business terms.
I was once told about a ‘High’ impact issue and spent a fair amount of time on it before I discovered that in fact just a handful records had the wrong geocode. The percentage of incorrect records made it seem more likely that human error was to blame, rather than there being some major systemic issue that needed to be fixed! This small percentage of incorrect codes was indeed causing a problem for the team who reported them. They had to stop time critical month-end processes to fix them, but the impact category they chose had more to do with their level of frustration at the time they reported it than the true impact of the issue.
With all things (not just data), I find that activities don’t tend to happen unless it is very clear who is responsible for doing them. One of the first things I do after being notified of a data quality issue is to find out who the Data Owner for the affected data is and agree with them that they are responsible for investigating and fixing the issue (with support from the data governance team of course).
Status is another good field to use when monitoring and reporting on data quality issues. You may want to consider using more than just the obvious “open” and “closed’ statuses.
From time to time, you will come across issues that you either cannot fix, or that would be too costly to fix. In these situations, a business decision has to be made to accept the situation. You do not want to lose sight of these, but neither do you want to skew your numbers of ‘open’ issues by leaving them open indefinitely. I like to use ‘accepted’ as a status for these and have a regular review to see if solutions are possible at a later date. For example, the replacement of an old system can provide the answer to some outstanding issues.
This is where you keep notes on progress to date and details of the next steps to be taken (and by whom).
TARGET RESOLUTION DATE
Finally, I like to keep a note of when we expect (and/or wish) the issue to be fixed by. This is a useful field for reporting and monitoring purposes. It also means that you don’t waste effort chasing for updates when issues won’t be fixed until a project delivers next year.
I hope this has given you a useful insight on the items you might want to include in your Data Quality Issue Log. You can download a template with these fields for free by clicking here.
Running and managing a data quality log using excel and email is an easy place to start but it can get time consuming once volumes increase – especially when it comes to chasing those responsible! That’s why I was delighted to be involved recently with helping Atticus Associates create their latest product in this space, DQLog.
The Atticus team are launching their beta version in spring this year and they are keen to hear from anyone interested in trying it for their feedback. If you are interested in testing the beta, please email me and I can put you in touch.
(This post originally appeared on Nicola Askham's blog, which can be viewed here).