Photographed by Ray Ng
Bill Inmon's Data Warehouse 2.0 tackles industry trends, unstructured data and the data lifecycle
Of all the mainstream business technologies, classic data warehousing might well be considered the least evolved in terms of practice and approach. And while data warehousing continues to spread as a foundational technology, the availability of information and speed of change have led businesses to add operational data strategies to the data warehouse mix - something the textbook writers didn't originally envision. So it might well be left to one of the fathers of the industry to update the definition of the data warehouse. That person is Bill Inmon, president of Inmon Data Systems. His model - called Data Warehouse 2.0 - is delivered with a complete architecture, outright enthusiasm - and a little ambivalence. You'll find the technical details in Inmon's recent articles in DM Review and at his Web site, www.inmoncif.com; more recently, DM Review Editorial Director Jim Ericson spoke with Bill Inmon for a philosophical take on DW 2.0.
DMR: Why do we need Data Warehouse 2.0?
Bill Inmon: There are two reasons for DW 2.0 - the first is for the integrity of the definition because I feel there are too many definitions floating around. The second reason is the need for a vision for the future of data warehousing, which I believe a lot of people in the industry have wrong. It came from confusion and from vendors trying to sell products. There were people building transactional systems they were calling a data warehouse; people building federated versions of a data warehouse; people building data marts that they were calling a data warehouse. Those are just some of the renditions.
DMR: What are the main distinctions between DW 2.0 and DW 1.0?
BI: The first major distinction is that the DW 1.0 never recognized the lifecycle of data within the corporation. DW 1.0 said, "Here's some data." DW 2.0 says, "Here's the data; it has a lifecycle, and each of the different portions of the lifecycle have unique characteristics." The second major difference between DW 1.0 and 2.0 is the recognition that unstructured data and structured data should both contribute to the data warehouse. There is a wealth of information in the world of unstructured technology, but it has to be built properly for the data warehouse.
DMR: We'll get to unstructured data in a moment. First, your DW 2.0 model adds an "interactive" zone to address systems that don't meet the definition of a data warehouse. Is this a concession to the need to leverage operational data?
BI: Well, DW 1.0 was never meant to do transactional processing. Yet certain vendors have something called an "active data warehouse," and they insist on doing transaction processing in the data warehouse. So, if we're going to be doing transaction processing in the data warehouse, let's at least do it with recognition of the architectural principles that are needed for both data warehousing and transaction processing.
DMR: But you're not sold on the idea of collecting near real-time operational information and aligning it with historical information?
BI: This was never the intent of a data warehouse. We've always recognized the need for operational reporting and operational analytical processing; it's just that it was in a different bucket. Some vendors are trying to say there should be one source of data for all reporting. That has never been true. But, if you're going to do transaction processing in the data warehouse, do it in the interactive sector where you can meet the architectural necessities.
DMR: How does this new interactive zone work in the context of the classic "integrated" data warehouse?
BI: It eliminates the confusion of technology. In the interactive sector, there is a whole art to getting good, high-performance processing in a transaction environment. You've got to mold transactions a certain way. You need transaction and data integrity; you need queue management. In the integrated technology, you're able to store a lot more data. You don't have to worry about transaction workload in terms of uniformity; you don't have to worry as much about queue management. By having the sectors separate, you're able to apply different technologies that are optimized on different things. The notion of the system of record in DW 2.0 is of data that is spread over different sectors. Part of the system of record is in one sector, another part is in another sector and so on. It's important, but in the grand scheme of things, it's rather minor; for all practical purposes the integrated sector is the old classical data warehouse.
DMR: Getting to unstructured data, the holy grail of DW 2.0 seems to be the idea of "structuring" text and making it available in the interactive zone of the data warehouse.
BI: That is correct, and we have been working on this for about three years. Right now, technology is divided into camps. In structured technology, you've got products like Business Objects, Oracle and DB2; in the unstructured world you have products like ClearForest, Convera and Documentum. The idea is to bring unstructured data to where you can leverage an analytical technology that's already in place. But, if you just proceed with the idea of bringing unstructured data to the structured world, you're building a data junkyard. You need to integrate textual data before you bring it into the structured environment. So we're definitely not talking about a search engine; we're talking about a textual integration engine.
DMR: Can you give us an example of a "textual integration engine"?
BI: Sure. We have been working with a large health care institution. As we brought information to the structured environment, we kept seeing the term "HA" appear. In medical circles, it turns out that if you are a cardiologist, HA means heart attack. If you're an endocrinologist, HA means hepatitis A. If you're any other kind of doctor, HA means headache. So if you want to have proper meaning in the structured environment, you've got to condition the data before it gets there. You've got to ask the question, "Is the source of this data a cardiologist?" If so, convert it to "heart attack." If the source of this data is an endocrinologist, convert it to "hepatitis A" and so on. By the time the textual data arrives in the structured environment, it no longer is "HA;" it is specifically headache, heart attack or hepatitis A. It gets much more complicated than this, of course. I'm told there are 20 different ways a doctor can describe a broken bone. But the point is, you don't simply take a search engine and a bunch of data and throw it into your data warehouse because that's not meaningful.
DMR: Once you've conditioned and imported the data, does it appear in a table?
BI: Yes, it ends up in a relational format. The first nice thing that happens is that you can now put your unstructured data in DB2, Oracle, Teradata, NT or SQL Server, where standard technology is able to work on the unstructured data you have treated. You can start to use standard business intelligence tools on the unstructured data itself. You've crossed the bridge, and now you're taking advantage of the infrastructure organizations have paid so much money for.
DMR: Most unstructured data strategies we've seen are merged at a higher layer, partly because content management systems are associated with current workflows and form-based approvals, such as check-in and prescriptions. Do those things continue to run separately?
BI: Yes, but we're not just extracting data from form fields. In our health care example, we are taking tens of thousands of patient records at a time, reading and ingesting them whole and coming up with patterns. In one study, doctors knew there was a relationship between different types of cancer, but in those 10,000 records are all kinds of other correlated factors that had gone unnoticed. The first thing that they said was, "Wow, this ability to look at 10,000 patient records all at once and produce meaningful results is one very powerful usage of unstructured technology."
DMR: How do you walk into the CEO's office and make a case for DW 2.0?
BI: I go back to how you would make a case for data warehousing in the first place. The whole subject of return on investment has vexed me. I have seen people take a macro approach. They say, "Okay, the data warehouse was installed, and the stock price of the corporation began to rise." I don't think data warehousing is particularly relevant to a measure like that. I started looking at a micro level and said, "Consider two companies. One company has a data warehouse, one company doesn't. What are the different information capabilities of these two companies?"
The second major case is speed of information. Once the data warehouse is built, the ability to get the information quickly in the hands of the right person in the corporation is greatly facilitated. With DW 2.0, there's also the issue of the data lifecycle, and by recognizing that, you can cut the cost of the data warehouse dramatically.
DMR: Will this lead business executives to believe they'll get an answer to a specific problem in short order?
BI: One of the problems of starting with a business problem is that inevitably, your data warehouse becomes very biased toward the solution being built. The truth is, to build a data warehouse successfully, you need to free it from the boundaries of any one given application.
DMR: Yet many executives and technologists are short-term thinkers; they want to be responsive and flexible.
BI: I work with some bright, long-term thinking people, but for others, vision and architecture are simply not on the agenda and you go through the same thing every two years. That's one of the frustrations of our industry. You need vision because what we're talking about with DW 2.0 is brand new; the technology is just starting to appear and needs to mature. When it does, we'll have a whole new class of systems that will allow you to merge two worlds.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access