Continue in 2 seconds

The Next Data Management Frontier: Unstructured Data

  • June 22 2006, 1:00am EDT

This month's column is contributed by Patricia Cupoli.

At the April 2006 Wilshire Meta Data Conference/DAMA International Symposium, there were a number of presentations that dealt with metadata, ontologies (organization of knowledge and terms), semantics, controlled vocabularies and taxonomy/classification. You may ask why these topics typically associated with library and information science, document management, content management and knowledge management were presented - they do not seem typical for the data management professional. However, these types of presentations have been showing up more and more in the last several years.

Data management professionals are becoming more and more involved with a data area that is called unstructured data. This term includes objects in both hard and soft media such as emails, all types of text documents, graphic images, videos and Internet Web pages. These items cannot be stored in a database or spreadsheet columns and rows, but can be stored in a relational DBMS BLOB (binary large object) or in XML files. Yet most of the unstructured data has some type of structure (also known as semistructured data) which could provide metadata in adherence to a standard such as the Dublin Core (15 metadata elements in total to include title, author, description, etc.). This metadata could be stored in a relational database even if the object content is not in electronic format.

Why is unstructured data important to a company? It has been estimated that at least 80 percent of a company's data is unstructured and not easily accessible or found. In this age of Sarbanes-Oxley and other regulations, the overwhelming amount of unmanaged, unstructured data could increase a company's exposure. Business users want to browse and search across all types of data for such opportunities as understanding customer issues. Management often does not have the ability to make decisions based on analysis of both structured and unstructured data if unstructured data is not integrated into a data warehouse/business intelligence environment.

This growing area of data needs to be managed as a corporate asset to provide value. It has to be identified, captured, organized, and made accessible and sharable. These management processes should sound familiar to data management. This organization deals with the structured data world through the development/maintenance of data model structures and metadata associated with data models that give meaning and vocabulary, and has best practices of data standards and a governance process with data stewards. One structured data concept (e.g., employee entity) can have many expressions or types (e.g., management or staff, active or retired, etc.) that describe it.

Unstructured data deals with content semantics where one expression (e.g., foot) can have many different concepts associated with it (e.g., unit of measurement, part of a human or animal leg below the ankle joint, or the lower part of anything). A controlled vocabulary organizes content through a selected list of words and phrases used to tag units of information (either automatically or manually) so that they may be more easily retrieved by a search. There is usually a governance structure to keep the various types of controlled vocabularies current. The different types include the following:

  • list of equivalence relationships or synonyms (e.g., cat and feline, baby and infant, student and pupil);
  • taxonomy that shows hierarchical relationships of subject and topic metadata;
  • thesaurus that shows equivalence (synonym list), hierarchical (taxonomy), and associative (related terms) relationships; and
  • ontology that represents a collection of taxonomies and thesauri for knowledge representation.

Where should data management start with unstructured data? Most likely, there are other organizational groups in your company such as content or knowledge management, libraries, records management, or document management that a data management organization could collaborate with to raise awareness of the criticality of managing and integrating unstructured data for accessibility. There can be synergy between data management and these other organizations with regard to values for reference data and data architectures, metadata creation and definition, metadata topics for taxonomies, use of newer technologies that can handle all types of data, and governance (it may be the same subject matter experts) at both the enterprise and project (requirements gathering) levels. It is the integration of structured and unstructured data that is a challenge, especially if the unstructured data is in paper or other media. Eventually, the techniques of structured data management and data integration will converge with the techniques of the unstructured data world to help businesses overcome this challenge.
Patricia Cupoli, CCP, CDMP, CBIP, is the DAMA International ICCP Liaison, the DAMAi Project Manager for the Data Exam Development, ICCP Board President, and a past president of DAMA International, DAMA Chicago, and DAMA Philadelphia / Delaware Valley. She is the recipient of the 2006 DAMA International Professional Award. She may be reached at

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access