My online series of articles has been focused on the need for businesses to get serious about their approach to developing enterprise business intelligence (BI) and data warehouse (DW) capability. When pursuing this capability it is important to adopt a holistic view, followed by disciplined investment and execution. To develop the future vision for this capability, one should consider seven inter-related areas:
This is the fourth column exploring the key considerations of data.
In my earlier columns on data, I discussed sourcing, integration, master data management and data security considerations, but there is another important data-related topic to consider - data and information quality management.
From the point of view of a BI application end user, quality means data and information is available that enables a true understanding of what is going on in the business. It also allows effective decisions to be made that positively influence future outcomes.
In order to provide this desired quality to their end users, BI professionals have many methods that should be considered. Techniques that enable quality information include:
- Ensuring master data and dimensions are commonly defined across BI applications and make sense to end users,
- Ensuring metrics are defined properly to provide the information users need to make effective decisions about their business, and
- Ensuring data governance processes are in place to oversee data definitions and enforce quality.
BI development teams use data profiling to discover information about the contents of a source systems database tables. Profiling is typically done in the beginning of a BI development project as part of a requirements and/or detailed design phase.
In order to learn about the data structure, you must analyze the data model. A data model can tell you a lot about the data in a source system, but its similar to looking at a trail map when you go hiking. The trail map is a guide, but it doesnt give you the kind of details about a trail that you see when you actually hike it, such as where a fallen tree is blocking your way.
Profiling is about more than just looking at the data structure of a source system, it is about determining the content that actually exists within the database. For example, profiling can tell you:
- How many rows are in each table,
- The date and time ranges found in all date and time columns,
- The numeric value ranges found in all numeric columns, and how often each occurs,
- The string values found on all character columns, as well as how often each occurs, and
- What column values match column values in other tables in other words, it can point out relationships between data whether the data model intended it or not.
Profiling is an interesting exercise, because it often tells you things the source system experts didnt know or may have forgotten. For example, a system expert may tell you that the data values expected in a code column are X, Y and Z, but upon profiling you find these values plus A, B and C. When you ask the system expert about this they will usually say something like oh, yeah, smile and tell you why those values are there and for what they are used. Usually, the unexpected data values you have discovered were used in the past. Its important to keep in mind that when building an enterprise data warehouse (EDW) and BI applications with historical information you must understand all the historical nuances.
Profiling is an important activity contributing to a quality solution. It allows you to understand with what you are really working and, therefore, gives you the information needed to develop the appropriate business and transformation logic in your extract, transform and load (ETL) routines to correct any quality issues. It can also serve to highlight quality fixes that can be implemented in the source system itself. Profiling activities can be done by executing simple queries against source data, but they can be done much more quickly and more thoroughly using data profiling tools.
The major benefit of profiling on the BI development process itself is to reduce the amount of time and effort required to develop ETL routines while reducing the amount of time and effort spent testing and fixing ETL routines. The benefit to end users is that they will see quicker turnaround for the development of a BI application with fewer data quality issues.
Data cleansing is a technique typically implemented as part of an ETL process. It can refer to simple ETL transformations like mapping values to enterprise standards or filling in values where data is missing, etc., but it most commonly refers to name and address cleansing. This is where information about people and their addresses is run through special purpose transformation software that standardizes addresses (ensures street and city names are spelled correctly, postal codes are correct and complete and state names, country names and postal codes correspond to one another.) Cleansing and subsequent matching processes enable companies to create integrated customer databases, one list of customers, which can be very useful for BI purposes and other purposes such as customer marketing.
Quality monitoring is another technique that can be used to improve overall information quality. Monitoring focuses on running data through a series of quality checks as it moves through an ETL process, recording information about what is found. For example, you may want to keep track of how often and for which records a quality transformation was applied as the data was loaded into the BI/DW databases. Additionally, if some records fail certain checks, you may want to automatically reject some and not load them or tag records as being of suspect quality.
In all of these cases, you can then run reports on the information collected to see what types of quality issues you are experiencing most often and work to proactively devise the appropriate remedies. For example, if you find certain data quality issues occurring frequently that result in poor information being made available in a BI application, it may be time to examine the source systems and the business processes driving the data collection to see if it can be improved.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access