Every few years, the corporate information factory (CIF) is extended, as architecture and technological advances occur in the industry. The highlights of the additions to the 2004/2005 CIF are the inclusion of:

  • Unstructured data,
  • Unstructured ETL,
  • Unstructured visualization, and the
  • Virtual operational data store.

Unstructured data has been around for a long time. Unstructured data includes e-mail, spreadsheets, text files, Word documents and more. Typically, unstructured data is what you find on the desktop. Interestingly, there is a large world of structured data and a large world of unstructured data, but there is very little intersection between the two. There is a lot of very valuable data in the world of unstructured data, and it is a shame that there has been little intersection between the two environments all these years. Now there is unstructured ETL technology and there is the potential for intersection of the two worlds at last.
One of the most intriguing new possibilities is unstructured visualization. Visualization today is really visualization of numbers and quantities. There are summarizations, drill down, drill across, detailed analysis and KPIs. All of this manipulation and visualization is based on the properties of numerical data. However, the fiber of unstructured data is made up of text, not numbers. Now there is unstructured visualization, based on text, which is the business intelligence of the unstructured world.

Perhaps the most interesting new addition to the CIF is that of the virtual operational data store (VODS). For a long time there has been talk of the virtual data warehouse. There have been many manifestations of the virtual data warehouse, the most prominent of which is the federated data warehouse. However, anything virtual in the world of data warehousing is pie in the sky. Because data warehousing requires a tangible, real foundation if it is going to do what it needs to do, "virtual" and "data warehousing" do not mix at all.

However, operational data stores (ODSs) are fundamentally different from data warehouses. The ODS only reflects information as of a single moment in time. The ODS reflects transitory data, not permanent data. Because of this fundamental difference, the ODS is architecturally different from the data warehouse. Having a virtual ODS is absolutely an acceptable thing to do.

In order to highlight this difference, consider this. You run a query against a data warehouse at 10:32 a.m. and get an answer of $4,981.07. Then you do an identical query against the same data warehouse at 7:18 p.m. What result should you get? It should be $4,981.07 -- not a penny more or less. However, consider a query against an ODS. You do a query at 11:15 a.m. and you get an answer of $5,119.06. The same query at 4:13 p.m. yields an answer of $6,510.74. Is this a problem? Not at all. In the ODS environment, data underlying the query has the potential to change from one instant in time to the next. Therefore, as time changes, the underlying values the query is based on can change as well.

Because of this transitory nature of ODS data, it is possible to have a virtual ODS. In a virtual ODS, the data needed for the query is gathered at the time the query is made. In a standard physical ODS, the data is gathered into the physical structure known as the ODS. In a virtual ODS, there is no physical infrastructure. This means that a virtual ODS is fast to build and is highly flexible.

The primary difference between a physical ODS and a virtual ODS is where resources are spent. In a physical ODS, resources are spent in building an infrastructure. When it comes time to make a query against a physical ODS, the query consumes very little resources. In a virtual ODS, there is no time spent in building an infrastructure, but the query of a virtual ODS consumes many more resources than a query against a physical ODS. In essence, the virtual ODS has to be rebuilt every time a query is made.

There are other important differences between a virtual ODS and a physical ODS as well. The physical ODS is much less versatile than the virtual ODS. However, the virtual ODS is subject to a series of limitations:

  • What if the data underlying the virtual query is not integrated? With a virtual ODS, there is the possibility of getting really strange results.
  • What if the underlying data is being accessed by a process that will not share the resources, such as a reorganization? It is possible that a query can take a very long time.
  • What if a vendor of an underlying resource one day decides to not cooperate with the other technologies? Anyone thinking that Larry Ellison is going to happily coordinate with IBM or Microsoft hasn't been paying a lot of attention.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access