Sometimes we are so overwhelmed by the masses of data that must be managed that it is difficult to see the many ways that we can learn from information. However, the ability to link individual objects together into patterns and networks is a valuable enhancement to many knowledge-discovery applications.

The key to this kind of analysis is the ability to look at lots of things, figure out the sets of attributes that distinguish them, and then find the small collections of things that can be linked together based on attribute similarities. The operative word here is similarity. While it is relatively easy to link two records when most of the values are identical, it is much more of a challenge to connect two data instances when their values don't match exactly, but upon inspection, clearly refer to the same thing.

Similarity analysis gets more complex as the amount of information carried within a data attribute increases because similarity scoring is based on the values being compared and the context in which the values exist. For example, comparing the closeness of two integers or two floating point numbers assumed to be drawn from a continuous range is straightforward. Yet if the compared integers are actually code values mapped to more complex data domains, that distance function is useless.

Things get more interesting when we look at character strings, such as person names, business names or addresses. Without a set of qualitative rules, it is tough to automatically determine that "IBM," "Intl Bus Mach" and "International Business Machines" all refer to the same entity. Knowing the rules for establishing closeness is the voodoo that makes similarity scoring more of an art than a technology. In fact, similarity scoring schemes may change from data set to data set, with different emphases depending on the source of the data, its completeness and other meta-attributes. This concept of "linkage" permeates the information world. Here are three examples:

Merge/Purge: Merge/purge is a process of comparing one data set against itself or combining two or more data sets to identify duplicate records in the merged set. Typically, a number of attributes from each set are assumed to contain overlapping information. Naively, the attribute values from each record of the first set are compared against the attribute values from each record of the second set. If the similarity of the compared records is above a certain threshold, it is assumed that the records may be linked and possibly consolidated into a single record.

Householding: Householding originally referred to identifying multiple individuals that live in the same household. Today, householding can be used to describe any consolidation of individual entities based on some determinable criteria. One example is the determination of corporate ownership structure, in which the hierarchy of company/subsidiary relationships and the corresponding owners is explored. Another example is direct sales enhancement through collaborative filtering, in which individuals are encouraged to make transactions that mimic those made by other individuals with similar characteristics.

Social Networks: I recall an early social networks analysis that exposed networks of common interest that emerged from evaluating the "sent-to," "cc" and "from" lists of e-mails and messages posted to Usenet newsgroups. The results showed that the correspondents eventually "self-organized" into much smaller groups that discussed topics of common interest. One critical value of a process like this is the ability to expose not just the existence of a connection between individuals, but more importantly the ability to make "meta data inferences" about those individuals.

What I mean by meta data inference is that aside from classifying the individuals into special interest groups, I can assume other valuable information, such as: that the individuals within the group are likely to know each other, the extent to which specific individuals are connected based on participation and depth of thread topics, and the influence of specific parties.

This list is by no means inclusive, and it only goes to show how much information is embedded within certain kinds of data. The key is to recognize the power of connectivity as a source of knowledge and to understand the role that similarity plays in establishing connectivity.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access