Meta data has long been the Wednesday's child of information processing systems. From corporate failures with data dictionary in the 1970s to IBM's legendary failure with repository in the 1980s, meta data has presented the data processing community with a seemingly intractable problem. And as organizations grow large, the problems with meta data multiply. However, the problems associated with meta data present concomitant opportunities. Nowhere are the problems and opportunities with meta data more apparent than in the large-scale enterprise data warehouse environment that is centered around a terabyte-size data warehouse. Why Does Meta Data Have Problems?
Why is it that meta data carries with it such burdensome problems? There are a multitude of reasons for the travails of meta data.
Stretching Meta Data
The first and probably most profound reason why meta data poses such a problem is that each unit of meta data is being stretched in opposing directions by two very strong forces. In any environment where there is a very hard and constant pull, there is inherent instability. In order to understand this dramatic pull of diametrically opposing forces against meta data, consider the diagram shown in Figure 1.
Figure 1 shows that each unit of meta data is attracted to the notion that there should be consistency and uniformity of meta data across the organization. If there is ever to be integration and uniformity of language and meaning of data across the corporation, meta data must be at the core of that effort. One department calls revenue one thing, and another department calls revenue something else. There needs to be some common meeting ground where each department can describe what it means by its own terms.
Not surprisingly, meta data is the logical place for the creation of the basis of shareability and a corporate uniform definition of data. There is then a valid and strong pull against meta data toward a centralized, very disciplined approach.
In years past, data dictionaries and centralized repositories have espoused this viewpoint of shareability of data and meta data. This perspective is undoubtedly valid. Unfortunately it is not a complete perspective, because there is another strong and opposite force operating against meta data at the same time.
Autonomy of End-User Processing
There is the notion that the end user should have autonomy of processing. The end user has powerful tools such as spreadsheets and a wide variety of other tools designed to provide immediate control of his/her analytical processing. When the end user has his/her own personal computer, his/her own software and his/her own data, the end user is free to do analysis unfettered by other concerns and constraints. An environment of creativity and spontaneity is fostered by end-user tools.
In order to understand the autonomy of end-user analysis, consider an analyst sitting in an office building at 10:30 p.m. doing an analysis for an important meeting at 8:00 a.m. the next morning. The analyst is using a spreadsheet. At first the analyst has defined a variable AMT to be equal to WEEKLY_REV - WEEKLY_EXP. But the analyst decides a better way to express the analysis is to create two variables: AMT and AMTDEPT. AMT is now defined to be WEEKLY_REV - (WEEKLY_EXP + WEEKLY_OHVD). AMTDEPT is taken to mean WEEKLY_REV BY DEPARTMENT. The analyst has just created two new variables and destroyed another. Does the analyst need the permission of a central meta data management group to make this change?
What is a spreadsheet (and for that matter, most end-user analysis) but a collection of meta data? As the end user goes about creating analysis, the end user freely manipulates--creating, changing, destroying--meta data in the blink of an eye. An end user is not about to stand for some centralized organization dictating what meta data can be used, created and altered. The end-user environment is one that is wholly self contained, an environment that operates entirely free from centralized control. The spontaneity of this environment is in marked contrast to the discipline of the centralized approach to meta data management.
Heretofore, there have been tools that have addressed the centralized approach to meta data management and tools that have addressed end-user autonomy of meta data processing. But no tools have recognized that BOTH aspects of meta data are valid, important and need to be addressed at the same time with the same tool.
There are other problems with meta data. In years past, meta data has been implemented long after the development has been done. In the 1960s and 1970s, massive applications were developed. In the mid 1980s we found that we needed to integrate these massive legacy applications. Meta data was a natural place to start the integration process. However, several problems appeared:
- The budget for development had run out long ago.
No one was anxious to spend a lot of money on systems that were already aging.
- Populating the meta data dictionaries and repositories was difficult.
The people who knew the old systems were long gone. In some cases, source code did not even exist.
- The benefits of a repository to the business were difficult to articulate and demonstrate.
Management just did not see the benefits.
- Even if the repository could be populated, keeping it up to date was impossible.
Real programmers don't do documentation and certainly do not stoop to do meta data.
What's Different Today?
Given the torturous past of meta data, why is meta data in today's environment even worthy of discussion? In today's world of enterprise decision support, there are good reasons why meta data has surfaced as a very important topic. In order to understand some of these reasons, consider the metaphor: meta data is like a road sign.
How much attention do you pay to road signs as you drive to and from work every day? You have been over the route a hundred times. You don't need to look at the road sign to know that you are on Interstate 25. In short, when you are traversing territory that is very familiar to you, you don't pay much, if any, attention to road signs.
Now suppose one day you are on a motoring vacation going from Chicago to Phoenix. You are near the four corners area of the U.S., and you find yourself in a town called Gallup, New Mexico. You have never been to Gallup before in your life. How much attention do you pay to road signs in Gallup? Plenty. Because if you don't, you may end up in Albuquerque, El Paso or Las Vegas. Anywhere but Phoenix. So when you are traveling on a route with which you are unfamiliar, you pay close attention to road signs.
In years past when we were doing transaction operational systems, operators were executing the same transaction many times a day. In those systems, the operators did not need a road sign (meta data) telling them what to do. But in a world of decision support where the end user is doing analytical processing, many times the end user is performing activities that he/she has never done before. A road sign (meta data) becomes invaluable to the end user in this case.
The first reason then why meta data is so important in the enterprise decision support environment is that the nature of the processing done in the data warehouse environment is fundamentally different from the processing done in the operational environment. Meta data plays a very important role in the ability to do analytical processing effectively.
Historical Data in the Data Warehouse
There is a second reason why meta data is so important in the data warehouse environment. Data warehouses contain a massive amount of historical data. Operational systems never did house very much historical data. It was rare for an operational system to house more than 60 to 90 days of data. But it is normal for a data warehouse to house five to 10 years of data. This historical factor makes a big difference in the importance of meta data.
In order to illustrate that difference, consider a manager that asks for a report from the data warehouse using 1998 data. The report is quickly prepared, and the manager is impressed with the speed of the accessibility of the data. In fact, the manager is so impressed that the manager asks for a similar report for 1993. The data warehouse also contains 1993 data, and the report is prepared and sent to the manager. When the two reports are held side by side, the data warehouse analyst expects more praise but is disappointed to hear the manager complain that data processing people just don't understand business.
The data warehouse analyst asks the manager to explain the problem, and the manager points out that the reports show that 1998 revenue is $5,000,000 and 1993 revenue is $100,000. The manager states that data processing people just can't get business values straight. The manager states that there cannot possibly be that much of an increase in revenue in five years time.
Before the data warehouse analyst retires from the fight and accepts the criticism of the manager, the analyst points out to the manager that:
1. There were different sources of data in 1993 and 1998.
2. There was a difference in the definition of what a product was in 1993 versus what a product is in 1998.
3. In 1998 Canada was considered to be part of North America, and Canadian revenues were added to American revenues. In 1993 Canadian revenues were accounted for separately.
4. The ratio of the dollar to the pound/frank/peso/etc. was very different in 1998 than it was in 1993.
5. Inflation was different in 1998 than it was in 1993.
6. Taxation rates were different.
7. Depreciation--by mandate of the government--was calculated differently in 1998 than it was in 1993.
8. The business climate was different in 1998 than it was in 1993, and so forth.
When the data warehouse analyst points out these differences to the business manager, the business manager now says, "I can easily see how $100,000 grew to $5,000,000 given the other information."
Meta data provides context for understanding data over time. Stated differently, you can have perfectly preserved data content over time; and if you don't have context to explain how data content came to be, then you have nothing. In an operational world where the focus was on very current data, there was not a great need to understand data over time. But in a data warehouse environment where there is an abundance of historical data, contextual information is absolutely essential. And it is meta data that holds the key to the context of data.
So there is another very valid reason why meta data plays a very different role in the world of the enterprise data warehouse.
But there is yet one more obstacle to the satisfactory data warehouse implementation. The origins of meta data are the mainframe environment. The mainframe environment was/is a centralized environment. Not surprisingly, all old meta data products--data dictionaries, repositories, etc.--are centralized since those products sprang up from mainframe technology. The notion with the old mainframe products is that meta data exists to solve the problem of shareability of data in a centralized environment. No attention is paid to the need for end-user autonomy of meta data in the mainframe-oriented repository products of yesterday.
The Enterprise Data Warehouse Environment
The problem is that the enterprise data warehouse environment is distributed. Very distributed. In the enterprise data warehouse environment there are:
- Data marts--many of them,
- Enterprise data warehouses,
- Operational data stores,
- Exploration warehouses,
- Near-line stores of data, and so forth.
Trying to implement a centralized meta data manager in the middle of a distributed, technologically heterogeneous DSS environment is a gross misfit. If ever there was a case of trying to fit a square peg in a round hole, it is this case.
The Distributed Meta Data Approach
A much more natural approach to managing the meta data problem in the distributed DSS environment is to employ a distributed meta data approach. Figure 2 outlines some of the important aspects of a distributed meta data approach for the data warehouse, DSS environment.
Figure 2 shows that one feature of distributed meta data is that it resides at each of the nodes in the distributed DSS environment. There is a separate physical instance of meta data at each data mart, each enterprise data warehouse, at each operational data store and so forth. The implication is that each architectural entity has its own meta data. In doing so, the need for local autonomy of processing is satisfied. Some meta data resides in DB2, some in Teradata, some in Oracle and so on.
By having local control of the meta data, each community of users can do such things as create definitions of data meaningful to the community, decide what meta data will be open to the world and which can be shared locally, etc.
By locally controlling the meta data, the end users can control the access, movement, backup and storage of their own meta data. Furthermore, the end users can fit the meta data infrastructure into their budget so that they are the true owners/stewards of the meta data.
But having local autonomy of meta data does nothing to address the issue of consistency and uniformity of meta data. In order to achieve an enterprise-wide understanding of data, meta objects must be shared. This means that the finance data mart can look at the meta data gathered and managed by the sales data mart. And the sales data mart can look at the meta data found in the ODS. And the ODS operator can look at the meta data found in the enterprise data warehouse. And as data passes from one server to another, it passes from Teradata to Oracle to DB2 and so forth.
But the movement of meta objects across the network of the enterprise data warehouse is only the first step toward uniformity and consistency.
The next step to uniformity is to implement enterprise meta data integrity. With enterprise-wide meta data integrity, a "system of record" for meta data objects is implemented. With a meta data system of record, there is one and only one owner and manager of any given meta object. When meta data becomes part of a node's system of record, the manager of the node has the responsibility of creating, changing and deleting the meta data. In addition, the manager has the right to determine where the meta data can be shared and where it cannot be shared. Meta objects can be shared all around the environment, but careful note is made of the status of their ownership.
If an object is shared, note is taken as to the owner of the meta data. Once shared, an object cannot be altered by the sharing node. If an change needs to be made to the meta data, the change must be made by the owning node.
By implementing a system of record for meta data, integrity of definition, shareability and ownership and stewardship of data can be accomplished across the enterprise. In doing so, there is integrity of meta data across the enterprise.
Interfaces to a Node
Another important aspect of distributed meta data is the ability to easily and automatically capture meta data at any node in the network and to make that meta data easily available to the end user.
Figure 3 shows the interface to the likely sources for meta data and some of the target destinations. This illustration shows that there are a wide variety of sources from which meta data can be gathered. Some of the likely sources are:
- DBMS catalogs,
- tools of automation,
- standard interfaces,
- the CASE/data modeling environment, and
- free form text.
The data needs to be able to be entered into any given node in three manners: direct manual entry, manually directed automated entry and automated entry.
In addition to automatically adding the data to the data warehouse, there is the need to make the data available to the end user. There are several ways the end user can get to the meta data that resides in his/her node:
- accessing the meta data as a relational table,
- direct entry into the tool of access, or
- a customized interface, such as a Web-based interface.
Of course, when data is entered into a node, it becomes a part of the system of record of that node.
Technical Meta Data/Business Meta Data
There are two important facets to distributed meta data in the DSS data warehouse environment. Those facets are the technical face of meta data and the business face of meta data. In the operational environment, meta data served the administrator and the technician. In the case of operational systems, meta data really had only one face--a technical one. But in the world of DSS data warehousing, there is another very important side of meta data--the business side. If meta data is to achieve its true potential in the DSS data warehouse environment, meta data must also serve the business community. Business meta data is similar to, but essentially different from, technical meta data. Business meta data is geared to the day-to-day business of the corporation, not the day-to-day technology of the corporation. In order to be successful, business meta data must be factored into the equation.
The distributed meta data network that has been described here is one that satisfies several important objectives. The end user is able to own and control his/her own meta data; the distributed meta data is able to be distributed in units of meta objects; meta objects are able to be passed from one technology to another with no technological constraints; enterprise integrity is enforced by the establishment of a system of record, and there is a wide variety of interfaces for the capturing and the dissemination of distributed meta data.
Both the autonomy of the end user and the need for the integrity of meta data are satisfied by the distributed meta data architecture. In addition, both business and technical meta data belong to the meta data infrastructure.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access