Most organizations, both public and private sector, are faced with managing large quantities of disparate data. Disparate data is characterized by four basic problems.

  • The organization does not have one, complete, integrated inventory of all its data.
  • The true content and meaning of all the data in an organization's data resource is not readily known.
  • The data is highly redundant throughout the organization.
  • The data is highly variable in its format and content. Some data, such as names and dates, often have as many as 10 to 20 different formats.

This disparate data severely impacts both the organization's ability to perform its business activities and the quality of the data resource.
Data disparity is continuing to increase in most organizations. The millennium date problem utilizes resources that could be used to resolve or prevent data disparity. The shortage of skilled personnel causes priorities to be shifted to tasks other than resolving data disparity. The availability of client-friendly products allows people throughout the organization to informally develop undocumented data that adds to data disparity. The proliferation of data warehouses and the increased use of other data megatypes (such as spatial, video, audio, textual and images) result in additional disparate data.

Meta data is following the same pattern. It used to be that there was minimal meta data available. The meta data that does exist is largely vested in people and is seldom formally documented. As the workforce becomes more dynamic and people leave the organization for retirement or better jobs, this meta data is lost forever--a permanent loss of institutional memory.

Many new products capture and maintain their own meta data. This is, however, a good news/bad news situation. This is good news in the sense that meta data is being captured independently by different products, but it is bad news because meta data is being captured and stored independently and can not be readily integrated across the organization. Most organizations are moving from minimal meta data to disparate meta data.

Numerous products are presented as a solution to the disparate data and meta data problems. While most of these products solve some aspects of data disparity, they do not resolve the entire data disparity problem. Data disparity can only be solved by formally transforming disparate data into an integrated data resource.

Common Data Architecture

Transforming disparate data to an integrated data resource requires a formal construct for understanding and resolving data disparity. The common data architecture1 is the common context within which all data is formally managed. It encompasses all manual and automated data that is in, or available to, an organization. The common data architecture consists of four components.

  • Data description is the formal naming and comprehensive definition of data. All data is formally named according to a data naming taxonomy and is comprehensively defined.
  • Data structure is the arrangement and relationships of data in the data resource. It generally consists of a formal logical data model for the business and a formal physical data model for implementation.
  • Data fidelity is the integrity, accuracy and completeness of the data. Data integrity is how well the data is maintained according to formal data integrity rules. Data accuracy is how well the data represents the real world. Data completeness is how well the data resource supports all business activities.
  • Data documentation is the complete, current, readily available documentation about the data resource. It is available to anyone in the organization interested in understanding and using the data resource.

New terms were defined to support the understanding and resolution of disparate data within the common data architecture. A data subject represents an object or event in the real world, such as customer, vehicle, river, transaction, accident or flood. It is the primary component for developing a subject-oriented data resource. A data characteristic represents a feature of an object or event, such as customer name, vehicle identification, river name, transaction amount, accident date or flood duration. A data characteristic variation is any variation in the format or content of a data characteristic, such as a date in MDY, M/D/Y and CYMD format, or customer name that is complete or abbreviated and in normal or inverted sequence.

Formal data names are developed from a formal data naming taxonomy2 that consists of the 12 components shown in Figure 1. Not all components of the taxonomy are used in every data name, but the proper combination of components uniquely names all data in an organization's data resource. A formal data naming vocabulary (an expansion of the traditional class word), contains common words that support the data naming taxonomy.

Data Site:
[Data Occurrence Selection]
Data Subject.
Data Subject Hierarchy#
Data Subject Hierarchy Aggregationˆ
Data Code Set;
Data Characteristic,
Data Characteristic Variation-
(Data Characteristic Substitution)
'Data Value'
Data Rule!
Figure 1: Data Naming Taxonomy Components

Formal data names are fully spelled out, untruncated, unabbreviated names that are readily understood by business clients, such as customer, employee or birth date. Data names can be formally abbreviated only for physical implementation to meet length restrictions. All data in the data resource is also comprehensively defined based on its content and meaning in the real world. A typical comprehensive data definition consists of one or two paragraphs of two to three sentences each.

Data rules3 are a subset of business rules that define the integrity of the data resource. Each data integrity rule is formally named, has a formal notation and an associated explanation. For example, Employee. Name, Value Change! is a data integrity rule defining the procedure for changing an employee's name. The data integrity rule for a domain might be Gender. Code, Domain!={'M' | 'F'}. The data integrity rule for deleting a data occurrence might be Employee. Deletion! with the explanation describing the procedure when an occurrence is deleted.

  
Figure 2: Transition Phases

Data resource transition is the process of developing an integrated data resource from disparate data. It consists of the three data resource transition phases shown in Figure 2. The disparate data is formalized by developing a data inventory and then cross- referencing each item of disparate data to a data characteristic variation in the common data architecture. This non-destructive approach to understanding disparate data requires careful analysis to identify or create the data characteristic variation that exactly represents each item of disparate data. Usually 70 to 80 percent of the cross-referencing is relatively easy; 10 to 15 percent requires some degree of analysis and 5 to 10 percent is quite difficult to identify.

Once the cross-referencing is completed, official data sources are designated based on the most current and accurate data values. An official data source is the location where data is extracted for data sharing and developing the integrated data resource. Official data characteristic variations are then designated based on the desired form for data sharing and developing an integrated data resource. Data translation schemes are developed between official and non-official data characteristic variations, such as feet to meters or a name from normal sequence to inverted sequence, to support the translation of data values.

Formalizing the data resource resolves the four basic problems with disparate data. Inventorying the data raises the awareness of the data that exists. Cross-referencing the data to the common data architecture increases the understanding of that data. Designating official data sources identifies the resolution to data redundancy, and designating official data variations identifies the resolution to data redundancy. An integrated data resource is developed by the permanent transformation of disparate data based on the official data sources and official data variations.

Data Transformation

Data transformation is the process of transforming disparate data to the integrated data resource. It is not a trivial process and requires the following precise steps:

Data identification identifies the data needed at the target location, such as an operational data store or a data warehouse, and the source data needed to produce the target data. The source data is the official data source designated during the formalization of disparate data.

Data extraction takes the desired data from the official data sources and places them in a data depot for data refining. A data depot is a working area or staging area for refining disparate data before it is actually loaded into the target database. Data extraction includes any conversion between database management systems.

Data refining is where the real work of changing the disparate data to integrated data within the common data architecture occurs. It consists of six steps.

  1. Data translation changes the data values to the official data characteristic variations using the data translation schemes developed during the formalization of disparate data, such as changing the formats of dates or sequence of a person's name. It takes the data characteristic variation representing the disparate data and, using the appropriate data translation scheme, creates the official data characteristic variation for the integrated data resource.
  2. Data reconstruction builds complete historical records from archive and audit data. It provides the time-variant data needed in data warehouses by working backward from current data values to create full data records based on archived or audit data.
  3. Data recasting alters the structure of the data for historical continuity. The structure of data often change over long periods of time. These varying data structures need to be adjusted so that they are as consistent as possible for the period of analysis. This is particularly important in data warehouses.
  4. Data restructuring renormalizes the data for data warehouses. Note that this is not data denormalization; it is unnormalizing the operational data for a data warehouse. It is one step in the process to develop a formal logical data model for a data warehouse. Data denormalization occurs when the physical design is developed from the logical design for a specific operating environment.
  5. Data derivation creates new data for processing, such as aggregating operational data, weighting, scaling and stratification for data warehouses. It also includes creating new operational data, such as customer profiles.
  6. Data validation ensures that the data produced by the data refining process meets all the established data integrity rules. Any errors require an adjustment to the data refining process before the data can be loaded into the target environment.

Data loading places the refined data from the data depot into the target environment. It includes any conversion between database management systems.

Data review ensures that the data loaded into the target environment is correct and ready for operational or analytical processing. It is similar to the parallel testing commonly done for enhancements to information systems. Any errors or discrepancies found during the review require an adjustment to the data refining process.

A data resource is the heart of an intelligent, learning, information-driven public or private sector organization. Operational data, historical data, analytical data, predictive data and meta data are all part of that data resource and must be formally managed and integrated within a common data architecture to provide high-quality, meaningful support to the business. Only through formal management of the data resource can an organization ever hope to stop the rampant creation of disparate data, clean up the existing disparate data and develop an integrated data resource that provides high-quality data to its business activities.

References

1 The common data architecture is explained in:

Brackett, Michael H. Data Sharing Using a Common Data Architecture. John Wiley & Sons. 1994.

Brackett, Michael H. The Data Warehouse Challenge: Taming Data Chaos. John Wiley & Sons. 1996.

2 A detailed explanation of the data naming taxonomy can be found in the two books cited above.

3 A detailed explanation of data rules and other new concepts and techniques are contained in a forthcoming book by the author about data quality and meta data.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access