For years, two environments have grown up side by side - the unstructured environment and the structured environment. The unstructured environment is filled with informal systems built on e-mail, spreadsheets, texts and reports. The structured environment is formal and is filled with transactions, databases and operating systems. Important business of the corporation occurs in both places. However, these worlds might as well be as far apart as Peiking, China, and Rio De Janeiro, Brazil.

In years past, the practitioners of the structured world learned about the evils of stovepipe systems. With stovepipe systems, there was no integration of data across the corporation. There was no foundation of reusable data. There was no historical data, to any great extent. In short, stovepipe systems caused more long-term grief to the IT department than the Y2K problem ever did.

Stovepipe systems originated because of the inability of application developers and systems personnel to look beyond their immediate surroundings and see the larger picture. The result was a long-term, architectural nightmare from which some organizations, including the IT departments of the government, are still wondering how to extract themselves.

Does any of this sound familiar? Are the structured people and the unstructured people merely building today's silos of unintegrated information? How often does the user of unstructured systems stop to wonder how e-mail messages will integrate with structured systems? The answer is either almost never or never. When we step back and look at the larger picture, it is clear that we are busy building stovepipe systems once again in the worlds of structured systems and unstructured systems. Didn't we learn anything the first time around?

So you ask: How do I integrate these two very different environments? There is so much that is different about them - is it even possible to achieve integration from one environment to the other? The answer is yes. There are challenges; however, integration between the two worlds is absolutely a possibility.

In order to achieve integration between the two worlds, it is necessary to contemplate "unstructured ETL" (extract, transform and load) processing. ETL processing has been around for a long time. However, in the early renditions of ETL, the transformation was always from legacy structured applications into a decision support system (DSS) data warehouse environment, which is also structured.

In order to integrate the structured environment and the unstructured environment, it is necessary to create a completely different form of ETL - unstructured ETL.

In order to build an unstructured ETL environment, it is necessary to accomplish three tasks:

  • The access and selection of unstructured data,
  • The editing and manipulation of unstructured data, and
  • The integration of unstructured data into the structured environment.

Access of unstructured data: The access of unstructured data is the first challenge. The access of unstructured data means that unstructured data must be accessed in its native format. This means being able to read unstructured files such as e-mail, .txt/.doc/PDF files and many others. However, reading the files is only the first step. The next step is selecting the important text from the unstructured data. One of the features of the unstructured environment is that it contains a lot of blather that is not germane to business. Part of the access process is separating the blather from the business.

Editing/manipulation of data: After the unstructured data has been read, the next step is to edit and manipulate that data. Some forms of simple editing include "stop word" analysis, where common words (such as a, an, the, of, which, when, and that) are removed. After the stop words are removed, the remaining words are edited, reducing them to their stems. In doing so, it can be recognized that the words move, moving, moved and moves are all branches of the same stem. A much more meaningful understanding of words results from working with word stems. Then there are relationships that are created between words, and so forth. In fact, in determining what is to be done by the ETL tool, many forms of editing can occur.

Integration into the Unstructured Environment: After access and editing occurs, the next step is integration into the structured environment. If it is desired to have a firm integration into the structured environment, there must be an assurance that one piece of data is the same from one environment to the next. For example, suppose there is a "Dan Meers" in the structured environment and a "Dan Meers" in the unstructured environment. Certainly the names match, but that does not necessarily mean that both references are to the same person. In order to integrate the information, it is necessary to go to a much deeper level of matching.

Unstructured ETL

These, are just a few thoughts about what is meant by unstructured ETL. Of course these thoughts require an extension from theory into reality; however, given the tremendous push to integrate, it is predictable that unstructured ETL is right around the corner.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access