The biggest challenge in writing this column has been deciding what to discuss first. Rather than following a life cycle approach, I plan to focus immediately on preparing for and developing dimensional models. Later, I will discuss other topics that are just as critical for your success including data warehouse architectures, gathering business requirements and achieving enterprise integration.
Before you start, you need to decide what you are going to model. A common approach is to ask the business users to compile a list of data elements that they need. This may be compiled in group brainstorming or joint application development (JAD) sessions, or by having each user submit his/her own list of data elements. The project team diligently compiles this list and presents it back to the business community for approval. Unfortunately, I have found this to be one of the least successful methods to determine what is to be included in your dimensional model.
This user-defined list of elements is based upon what the users really believe they need. Often this includes key performance indicators (KPIs) that are being used to judge their personal or group performance. Indeed, the business community will need to have access to these fields in their data mart. Therefore, why doesn't this work?
The business community has learned what data they can and can't get in a timely manner. Over time, they stop asking for data or reports that they know will be too costly or take too long to help them make a decision. Then, as more time passes, they stop even thinking about reports that they know they cannot get. Their perception of what they want is often limited to what they have been able to get recently.
On the other extreme, many business users take this as the opportunity to spell out the top data elements that they believe are the key to their future success. However, this may not be grounded in the reality of what data is actually captured.
I must find these data elements in order to understand them and their relationships. This is where we start the great data scavenger hunt. Some of the data may be easy to locate, while some may be obscure and take time to ferret out. We may spend many hours searching for some data elements that may not exist anywhere. Unfortunately, we may not come to that conclusion until we have finished an exhaustive (both thorough and tiring) search.
What is a better approach? Start with what you learned gathering business requirements. One output of gathering business requirements is a list of "business data sources" that the users need to support their analysis. This is not a list of file names, but a list of types of data they need, such as customer sales, financial budgets, sales forecasts, product details and customer demographics. This is a small sample; I often identify 30-plus different kinds of data that the business might want.
Using that list, identify candidate source systems where that type of information can be found. Sometimes there is only one source, but there may be multiple sources. If so, you will need to determine the accessibility, accuracy and timeliness of each to determine the most appropriate place to get that data. Work with the business community to prioritize the business data based upon how effectively the data supports their top business problems. Evaluate the technical feasibility as well. The optimal solution is to identify two or three sources that support the business and are highly feasible.
Now you have a reasonable scope of data to analyze. You will need to understand each of these data sources in detail. Understanding a data source typically requires sitting side by side with the data-source expert and walking through each of the tables and columns in the database. This may take several hours or several days depending upon the quality or lack of documentation.
Write down what you learn! It all makes sense now, but will it make sense next Monday? You must use the data source expert's time wisely. I like having hard copies of the data models and a simple list of all the tables. Quickly eliminate tables that are used to drive production applications. You may be surprised at what is lingering in your production database. For tables of interest, print a spreadsheet listing each column by table and use this for taking notes.
Figure 1: Excerpt from Column Spreadsheet for Note Taking
Your charter is not to document all the incoming source systems, but this could indeed be a by-product of your efforts. Consider including a representative from the source team to formally document data element descriptions. Just ensure that this does not sidetrack your project.
I find color-coding data models and table listings to be extremely helpful. Buy yourself a rainbow of highlighters, and use a different color for each major type of data. For a recent insurance application, I color-coded all policy elements pink, claims purple, reinsurance orange and financials green (of course). This may seem silly, but we are trying to understand, retain and use later what we learn in these sessions. This is most useful if you are wading through hundreds of tables.
Taking the time to understand the current data sources is critical to prepare for modeling. Upon completion of the tasks outlined in this column, you will be ready to begin developing your dimensional model.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access