Continue in 2 seconds

Structured + Unstructured Data = N-Structured Data

  • December 16 2004, 1:00am EST

For decades, vendors and enterprises have treated applications that mine structured data (databases) and unstructured data (text) as completely separate. An analyst drilling into an OLAP cube navigates through numbers, not memos; a person submitting a search query views text results, not report totals. However, smart companies are starting to recognize that this application-centric separation of data types is a time-waster - after all, users just want information, no matter what application it resides in - and these application boundaries are starting to blur.

Users Want to Navigate Across All Data Structures

The result is that BI vendors are starting to investigate search technology; text miners are getting better generating metrics; and forward-thinking vendors are investigating ways to connect structured and unstructured data (often via XML) as a way to make it easier for users to seamlessly move between the two worlds. Put simply, the design problem is being turned on its head. Rather than assuming a data structure and then designing an application so the user can navigate through it, a curious user is assumed and a morphing data structure (n-structure) handles the user's navigation across all types of data.

Options 1 and 2: Structured Informs Unstructured and Vice Versa

For many years, BI vendors have depended on their users' familiarity with their software for navigation: "Sales by region? Oh, that's in report 62." However, as large corporate rollouts increase the BI user population, many of the users arrive untutored and don't use the software enough to become experts. In this case, using search to help users find the appropriate report makes a lot of sense.

Unstructured data technology can help its counterpart; the reverse is true as well. Nowadays, text mining and categorization software can help characterize text, whether it be in memos, e-mails or other forms. For example, text mining, by trawling through e-mails and customer support call logs, can score a customer's propensity to buy or default on his bills. All of a sudden, unstructured text can supply metrics that can be crunched by BI applications.

Option 3: Connecting the Structured and Unstructured Worlds

The third option is admittedly the least mature, but also the one with the greatest promise. This is the ability to identify common entities (e.g., names of customers, products and companies) as a way to tie the structured and unstructured data together. The arrival of XML and its support of semi-structured data is integral to this form of integration.

Today, an analyst at a component manufacturer may drill into an OLAP cube to discover that sales to HP are down, compared to previous quarters. At that point the analyst is at an analytical dead-end - time to pick up the phone, call some fellow worker and ask, "What's up?" If, on the other hand, the analyst were able to drill "sideways" into a repository of memos and e-mails discussing HP, he might be able to quickly understand the history and cause of the sales decline.

The reverse would apply as well - a vice president receiving a memo discussing sales to HP could click on the term "HP" and instantly receive internal corporate metrics about the HP account, including total sales and profitability. Such an auto-hyperlinking capability is not science fiction - Microsoft ships a form of it in its smart tags technology within Office 2003.

The Drivers: Ubiquitous PCs, Inexpensive Storage and the Web Browser

It is worth noting that this level of integration will not go away, but rather intensify, due to ongoing changes in information infrastructure. Years ago, structured and unstructured data were kept separate because they were stored differently - structured data lived in online databases, while unstructured data resided in printed memos. However, three drivers have made the two equally accessible online: ubiquitous PCs, inexpensive storage and the Web browser.

Twenty-five years ago, the PC was just being invented; computer terminals were for data entry, not data browsing, and most of a corporation's data resided in file cabinets. Today, with a full-featured PC costing less than $1,000, virtually every office worker has a terminal on their desk. Furthermore, almost all corporate data is created digitally - via Microsoft Word or Excel or in e-mail packages. Corporate data being online is now the rule, rather than the exception.

Echoing the price drop of PCs, disk drives and memory chips have also become inexpensive. 150+ terabyte data warehouses are no longer uncommon; 80GB disk drives now go for under $100; 1GB secure digital cards are available for PDAs. The upshot of all this inexpensive storage is that all of a corporation's transactions and musings can be accessed at the touch of a button - it is now cheaper to store everything than spend time deciding what to keep and what to archive.

Finally, the Web browser has melded the two data types into a single viewer. In the past, users viewed structured data within a BI application or Microsoft Excel and perused unstructured data with Microsoft Word or Adobe Acrobat. Today, with a little programming wizardry behind the scenes, a Web browser displays both types of data equally effectively; in fact, when browsing a Web page on, it's hard to discern which section of the page is generated from a database and which is free-standing text.

Use the N-Structured Data Viewpoint to Attack Switching Time Loss

Due to this sea change in infrastructure, both vendors and enterprises need to rethink how users discover and access information. Both BI and search vendors have done a wonderful job of speeding up usability and query time within their solutions; productivity within standalone applications is now quite high. The main productivity drag is now "switching" time - that is, the time users spend bopping back and forth between applications searching for answers. Software developers must stop thinking, "We handle only databases" or "We handle only text." Only by thinking in terms of n-structured data will application builders free themselves of past viewpoints and start attacking the next hurdle in user productivity.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access