When is a horse a houseboat? Whenever you need it to be, of course - especially when you do not have a field defined for horse in your database, but you need to know about it.

Gender code, for example, identifies the sex of a person (M and F) except when the value is C, which means that the person is a dependent child of an insurance policy holder.

Country code always means a code identifying a specific political state or nation or its territory, except when it identifies an American Indian reservation in one of the U.S. federal agency's databases.

What is illustrated here is that knowledge workers need to know certain facts about things. If a database has not been defined with all knowledge workers' information requirements, and that database is not easily extendable, knowledge workers will often use an existing field for multiple purposes. (Please do not be critical of businesspeople who apply such techniques. They truly need the information to perform their work. If it takes the information systems group too long to modify the database, knowledge workers must create their own workarounds to be able to capture and maintain those facts.)

Several information quality principles are illustrated in this common problem of data overloading or domain chaos. This problem creates chaos in the sense of a confused mass or mixture of different types of fact in one data element.

Problems Caused by Data Overloading

  • Confusion in interpreting data because of multiple facts being represented by one data element.
  • The second and subsequent meanings embedded in the overloaded data element values must be known by knowledge workers in order for them to understand the meaning.
  • Any statistical analysis made on the data will be skewed by the overloaded values.
  • Native queries that may be automatically generated will provide wrong answers.
  • Knowledge workers must remember to filter out all of the overloaded data element values when constructing searches or queries.
  • Combining this data with other data to create derived or calculated data can produce invalid results.
  • If the data element is a source of data that is extracted to be propagated to a downstream database, it can cause the ETL process or the applications using the data from either the source or the downstream database to fail due to an "unexpected value."

Again, the reason for such data overloading is that knowledge workers need to know facts beyond the ability of the database to house them. So what is the real cause?

Causes and Root Causes

  • Cause: Limitation of the database design to house all the facts required by all knowledge workers.
  • Cause: Old, obsolete database technology that is not easily modifiable or extendable.
  • Cause: Lack of awareness by knowledge workers that others use information from this database, which leads workers to conclude that it must be okay to create new codes for data elements. When we explore these anomalies with the source areas, we often hear the response, "Oh, I didn't know anybody else used this database. This is our database that we use for ..."
  • Root Cause: Poor information requirements gathering, data modeling and database design processes that do not capture all information requirements from all stakeholder areas across the enterprise and failure to capture "should know" attributes not required for the immediate application being built.
  • Root Cause: Lack of education of knowledge workers and lack of controlled data definition change process that allows chaotic data definition changes.
  • Ultimate Root Cause: Failure to treat information as a true resource of the enterprise and design singular subject-oriented enterprise-strength databases.
  • Ultimate Root Cause: Failure to treat information as a product of the business processes and failure to treat data definition as information "product specification."

What Improvements are Needed

Process improvements are required to the data development and data definition change processes.

Data Development Process: When developing an information model, knowledge workers representing all the involved business areas must identify their information needs on a subject-by-subject basis, such as customer information or product information.

Today's state-of-the-art information models require entity type-subtype models for specializations, generic entity types for classifications for similar attributes and entity life cycles for clear specification of business rules associated with unique occurrences of important information types (such as a manufactured product from concept to engineered to manufactured to offered to retired, or customer from person to prospect to active customer to former customer, or order from received to approved to picked to filled to shipped to invoiced to paid and completed).

Data Definition Change Process: The data definition change process must provide for impact analysis of requested definition, valid value set and business rule changes to identify who must be involved in approval of changes. All data definition changes must have approval by both business and information resource management stewards who understand the business and technical ramifications of the changes, respectively.

Data overloading is a catch-22 problem. The intention is good, i.e., to capture a fact needed to be known. However, this "solution" results in side effects described previously.

What do you think? Let me know at Larry.English@infoimpact.com.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access