Why Does Data Modeling Take So Long?
At a recent conference, one of the attendees asked me why it takes so long to develop a data model that can be used for designing the data structures for a data warehouse or data mart. To answer this question, we should consider the major steps involved in developing the data model and how long each step should take. For simplicity, we'll use the following steps, recognizing that these are high-level groupings and that the specific steps may vary based on the governing methodology and architecture:
- Requirements and business rules gathering;
- Source analysis;
- Logical model development; and
- Physical model development.
In my experience, the first two steps contribute the most to the time it takes to develop the data model.
Requirements and business rules gathering: Before data modeling work can begin, the data modeler needs to understand requirements from three different aspects. On the surface, the primary driver may appear to be a determination of the data that is needed, but that only scratches the surface. The needed data elements determine the attributes (columns) that must be included in the data model and their placement within a normalized model. They do not, however, contribute to the overall structure.
The dimensional model is structured to respond to questions for various business subjects and provide a means of navigating through the data for the desired level of summary and filtration. A key factor in designing these models is understanding the requirements in terms of the business questions that need to be answered. Those questions can be captured using various techniques (e.g., fact-qualifier matrix) to help determine the number of star schemas that are needed and the questions to be answered by each.
The third requirement that helps the data modeler is the reason the data is needed. In other words, 'Why do these business questions need to be answered?' That requirement provides the modeler with information to anticipate future needs so that a more extensible model can be built.
If a normalized model is built for a data warehouse hub (or if an increment of the enterprise data model is built), the business rules need to be understood.
Source analysis: Another time-consuming model-related activity is analyzing the data sources to understand the data that actually exists and its condition. This step is needed for several reasons. The structure of the existing sources and the supporting documentation provide the data modeler with information about data that is potentially available. This knowledge contributes to requirements gathering as well.
Data profiling, which examines data content, provides information about the true meaning of the data as well as the business rules represented by the data. This information is important because the data structures being developed need to be able to capture the existing sources. If, for example, a supposedly mandatory field is often blank, the model must allow for null values (or the ETL needs to populate fields that are null based on business rules).
Finally, an understanding of the challenges that need to be addressed by the integration and transformation steps also has an impact on the data model.
Logical model development: Once the requirements, business rules and sources are understood, the data modeler can develop the logical model very quickly. Factors that slow down this step typically are ones that clarify ambiguities in requirements, definitions, etc. While this time is attributed to the model development, it is, in fact, an extension of the previous development steps.
Physical model development: The base physical model builds on the logical model and characteristics of the existing physical structures. Again, this is a pretty quick step, and modeling tool capabilities enable automation of some of the tasks. Once the base structure is created, DBAs need to review and adjust for performance and security. In some cases, a physical model may not even need to be created.
Data modeling duration is often perceived to be too long, but that is because people don't separate analysis work (which is needed for other reasons) from actual data modeling. Most of the time spent developing data models is consumed developing or clarifying the requirements and business rules and ensuring that the data structure can be populated by the existing data sources. While time must be spent developing the models, that time is overshadowed by the activities on which this work depends.
If you are asked why it takes so long to develop a data model, respond by saying that the model depends on a foundation of accurate business rules and requirements, and point to some areas (e.g., inconsistencies in specific business terms and definitions) as examples of the factors that contribute to the time. If you have additional ideas or have answered this question in the past, I welcome your input - please send me your thoughts at firstname.lastname@example.org.