This month we’ll discuss one of the most difficult activities in our data warehouse design process source system assessment. Often this activity is called "system of record" or "best source identification." In conjunction with requirements gathering and data quality assessment (the subject of next month’s column), we must identify not what source systems to draw data from, but how this is to be done. Fundamental activities to be completed are:
- Determining timeliness, accuracy and granularity of the source data to be extracted.
- Determining how derived or consolidated data will be created from this base or atomic information (and who is going to verify it’s accuracy).
- Determining where the best possible source for each data element (not subject) is located, as each dimension or fact table can be created by either:
- A one-to-one mapping from source systems to fields in our star schema
- A many-to-one mapping from source systems data elements to star schema data fields
- A many-to-many mapping from source systems
- Determining how the data is to be extracted. A number of different methods can be investigated based on the source system database type, size, frequency of extracts and sensitivity of the source database. Options include:
- The wholesale movement of data from operational DBMS and comparison against the last set of load files.
- The usage of application created delta file of changes.
- The alteration of application code to create a delta file.
- The trapping of changes at the DBMS call level.
- The trapping of changes on a log file (reused from DBMS or separately created).
- Comparison of before-and- after record images from journal files.
- Access to archive files for loading of history into the warehouse and the establishment of the appropriate context (dimension) information.
All but the first item necessitate changes to either the source system application code, supporting DBMS or file structure. If changes are made or required to the code or DBMS, you may end up impacting either the performance or availability of these systems. Therefore, the only practical solution (in most cases) is to access the entire source system’s DBMS through its mirror image copy, during off-line hours, or conduct the necessary extracts for comparison against the last load. We must also consider the capabilities and capacity of the DBMS and its database server in defining our data staging requirements for source system loading. We need to consider:
- How much data the DBMS can realistically handle.
- How is the data stored, compressed, encoded and indexed and how null values are to be managed.
- If locking can be suppressed during the load.
- If the DBMS allows for parallel data loading, and if this load process can be customized by the DBA and tuned for optimum performance.
- How the technical meta data will be loaded into the data warehouse.
The other aspect to consider is data security and who has access to verify what goes into the warehouse. Tips to consider in dealing with data security as part of the source system data loading analysis process include deciding on a strategy for encoding highly summarized global data, reducing data security constraints on load data as levels become more granular and layering security constraints against load data by establishing criteria such as secured to secret to top secret (highly summarized CEO or CFO data).
The major deliverable for this activity is a source system assessment which sets the "scope" for data staging design and implementation determining what’s in and out, timing, resources, methods and tool requirements and, most importantly, budget. Figure 1 illustrates a sample table of contents for a source system assessment report.
The following process can be used to extend your project plan for source system assessment.
1. Develop Data Staging Strategy
1.1. Establish data staging approach (strategy) and document (OLTP or OLAP).
1.2. Develop initial list of source systems for consideration as the system of record for the data warehouse or operational data store.
1.3. Define data currency, context and accuracy thresholds to be maintained.
1.4. Define data audit/verification approach.
1.5. Define data cleansing/transformation guidelines and recommendations.
1.6. Define organizational infrastructure and recommended implementation approach (for the training, development and support of the data staging environment).
1.7. Define data staging technical infrastructure, functional architecture and recommended implementation approach (hardware, software and network).
1.8. Determine resource costs and implementation time frames.
1.9. Develop and approve a data staging mandate or charter based on the information collected.
2. Refine a List of Existing Internal, External and Derived Data Sources
2.1. Identify all existing systems, tables, files, spreadsheets, etc. within scope.
2.2. Determine volatility and data access strategy for each source.
2.3. Identify data repair, consistency, accuracy recommendations and collection/auditing procedures for each source.
2.4. Develop and approve a source system assessment document and set of related procedures.
Resource requirements for this process will vary based on number and type of source applications, but they will primarily revolve around a key set of roles which include:
- Source system project leader.
- Participation from the DW lead data analyst.
- Subject matter expert participation (one per user area or major system).
- Legacy database expert (one per major source system and or type of DBMS).
- Corporate infrastructure services participation for setting access and security thresholds and constraints for the source systems.
- DW team infrastructure support for investigation/determination of a data staging approach based on a selection of methods and tools.
- End user participation for source system selection confirmation and verification of approach for data selection and verification.
This process, which leads to data staging design and implementation, typically consumes the bulk of resources and time frame in the design phase. Plan accordingly and do not be surprised if the subsequent development of your data staging environment consumes upwards of 70 percent or more of your project resources!
Next month we will continue this discussion by reviewing the data quality assessment process, the third of five major activities we need to accomplish as part of data warehouse design. For a more complete description of this process and deliverables, please contact the author at email@example.com.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access