Data Quality for Operational BI
Operational business intelligence shares many characteristics with traditional BI, but it also differs in many ways, the most dramatic of which is the timeliness of the data acquisition and integration process. Traditional BI can often rely on overnight or intraday batch processing for collecting and processing the data. To meet operational BI needs, the update cycles repeatedly require more frequent processing of the data and do not allow for a batch processing cycle. This has several implications with respect to ensuring data quality, two of which are governance/data stewardship and source data quality.
Governance and Data Stewardship
Best practices for a BI project dictate effective governance structures as well as a robust data stewardship program. While this may the best practice, many companies have BI programs that deliver value yet do not have adequate governance or stewardship. (I don't condone that approach.) To understand how this can happen, we need to examine the impact of governance and stewardship on both the project and the result. The first of these two impacts applies equally to both traditional and operational BI initiatives. The second is more problematic for operational BI projects.
- Absence of effective leadership impacts the project by lengthening the time required to reach an understanding of the data definitions, business rules and quality expectations. While this is painful to the project team, once the agreement is reached, appropriate logic can be developed to correctly bring the data into the data warehouse.
- Even without effective governance and stewardship, once the business rules for migrating data are established, (batch) extract, transform and load policies can be developed to address data quality deficiencies. This is not always the case for operational BI. If the data needs to be loaded on a near real-time basis, error correction logic often cannot be incorporated into the data movement code. There simply isn't time to do the error correction, and often the data required to perform the correction (e.g., reference data) is not available at the same time that the transactions are being processed. To alleviate this problem, the source systems and business processes must be adjusted to prevent the errors from occurring within the data. Changes there are well beyond the scope and authority of the data warehouse team. Strong leadership (i.e., governance and stewardship) is required to determine, implement and enforce whatever changes are needed. Without strong support, the data sources will not be adjusted and the data quality deficiencies will propagate into the operational BI environment.
Source Data Quality
As previously explained, errors in the source data at the source must be addressed for operational BI to succeed. But how do we know the condition of the data?
The condition of the source data is analyzed using data profiling (a.k.a. source data analysis). Data profiling provides a systematic way of examining the source data to identify quality deficiencies, which would either impede the data acquisition and aggregation processes or generate erroneous or misleading BI results. Both strategic and operational BI development methodologies include data profiling. The difference lies in the options that can be pursued.
With traditional BI, errors that are found in the data can be corrected as part of the ETL process. This is possible due to the nature of the ETL jobs (batch) and their frequency (often daily). For any errors detected during data profiling, the project team could opt to correct the data within the ETL process.
With operational BI, there may not be an ETL process. Depending on the desired data latency, data cleansing logic within the data capture and integration is limited. For these applications, at least some of the errors detected during the data profiling need to be addressed within the source system environment, and the source system may need to be enhanced to prevent erroneous data from being stored. This requires the data profiling process to include thorough root-cause analysis.
Success in an operational BI environment requires people to trust the results they receive, and that is only accomplished if the data meets quality expectations and is understood by the business community. Two of the ways the operational BI environment differs from the traditional BI environment are the criticality of effective governance and data stewardship and of the data profiling work.
In my next column, I will describe additional ways in which the data quality requirements of the operational BI environment will dictate when special actions need to be taken.