Continue in 2 seconds

Meta Data Architecture for Data Warehousing

Published
  • April 01 1998, 1:00am EST

Over the years a lot has been written about the importance of meta data for data warehousing. In simple terms, meta data is information about data and is critical for not only the business user but also data warehouse administrators and developers. Without meta data, business users will be like tourists left in a new city without any information about the city, and data warehouse administrators will be like the town administrators who have no idea about the size of the city or how fast it is growing.

Despite its criticality, meta data continues to remain the most neglected part of many data warehousing projects. "We shall worry about it later" is usually the approach.

Data extraction/transformation tools have done the most in terms of capturing part of the meta data. Some of the front-end query tools are also able to read meta data from these tools and the database engines. But there is a lot of critical meta data locked away in CASE tools, legacy data, front-end tools, database engines and data warehouse administration tools waiting to be integrated for seamless access by the users. Most of the meta data in these disparate sources is also out of date and inconsistent.

Lack of an integrated approach to consolidate, manage and keep meta data up to date can result in several risks to the credibility and success of a data warehousing project:

  • Misleading information leading to erosion of credibility of the warehouse. A very subtle example of this is the case of a corporate DSS analyst who was studying the pattern of inventory held by various business units of a manufacturing company. He noticed a sudden, unexplained increase in the inventory held by the heavy equipment unit. He talked to the business unit manager of the heavy equipment unit who, in turn, had to call for explanation from his staff. After several days of research and loss of productivity, they found that during the period in question heavy equipment also included distribution transformers for a brief period of time during reorganization of the company. It caused a lot of heartburn between the corporate DSS analyst, the concerned business unit staff and the data warehousing group. It could have been avoided if there had been contextual meta data about the definition of heavy equipment during the period in question. This brings us to another important aspect of meta data, which should not only be accurate and up to date but also have the right version and be contextual.

    CONTEXT: In the above example, if the query tool had the right definition of the heavy equipment and what it included during the period in question, the unpleasant situation could easily have been avoided. The current state of meta data technology does not provide contextual meta data when a user is looking at the results of a query. Contextual meta data should provide all of the context information including information about the data relationships existing at the time the data was created along with the results of a query. This requires a tight integration between the meta data management tools, meta data repository and the front-end tools.

    VERSIONING: It will be impossible to provide the right contextual data without maintaining past versions of meta data. In the previously mentioned example, having up-to-date and accurate meta data would have been of no use if the appropriate version of meta data for the period in question had not been maintained. As organizations go through different business changes due to mergers, spin-offs, acquisitions and internal reorganizations, the basic definitions of business entities undergo changes. These changes should be meticulously maintained as different versions of the meta data repository.

    META DATA ABOUT DATA QUALITY: Most organizations have to live with the reality of the quality of legacy and external data due to poor validation rules, overuse of fields or simple misuse of certain fields. There are plenty of examples in the industry for each of these cases. It is very important during the early phases or iterations of a data warehousing project that the quality of the data should be analyzed and the results documented as associated meta data.

  • Inability to get the right information. Without the proper meta data, users may not even know certain information already exists in the data mart or a data warehouse. Even though this may sound to be more serious of a problem than the first one discussed, I consider this problem to be less serious in nature. Misleading information is worse than getting no information at all.

  • Productivity loss. Users are not able to get to the right information in the right way quickly resulting in productivity loss. For example, let us say that the VP of marketing for a company wants to know the average profitability from the customers who joined since the last campaign. If there is no meta data which defines the average profitability, how has it been calculated, which campaigns to take into account, he might end up spending a lot of time trying to research the definitions--resulting in productivity loss.

  • Unpredictable performance. Lack of information on query estimates impairs the ability to schedule long-running queries to off-peak hours. This can result in unpredictable performance. To avoid this problem, meta data should contain estimated time and expected number of resultant rows of at least the seemingly complex and long queries. It is understandable that in a DW environment it will be almost impossible to estimate all the ad hoc queries. However, building a simple historical file about various queries can go a long way in alleviating this problem.

  • ROI. Without meta data about usage of a data mart or a data warehouse, it may be difficult to determine ROI and sustain the project.

  • Difficult to upgrade and enhance. Data warehouse developers will not be able to maintain, enhance or upgrade a data mart or a data warehouse without a good knowledge of what is in the current version of the warehouse. This is especially true with current trends toward implementing quick data marts. If done without proper meta data architecture, users will get inconsistent and conflicting information from data marts.

Meta Data Contents

The Entity Relationship Diagram in Figure 1 shows the contents of a meta data repository for a data warehouse. There are three broad categories of meta data:

1. Meta data for the business users. Meta data is an like a complete itinerary from AAA(American Automobile Association) showing where they can find what information, how they can access it, how long will it take to access it and what quality can they expect when they finally get it. In Figure 1, entities marked with a "U" indicate they are of great importance for the business user.

2. Meta data for the data warehouse administrator. The data warehouse administrator (responsible for populating, maintaining and ensuring availability of the data warehouse) can make his tedious tasks simpler through his own special view of meta data which includes profile and growth metrics in addition to other things marked with an "A" in Figure 1.

3. Meta data for the data warehouse developer. Meta data for developers affects their ability to maintain and enhance data marts. Without up-to-date meta data, developers will not be able to maintain and enhance these data marts which can easily grow into conflicting islands of information. Meta data entities marked with a "D" in Figure 1 are of special interest to developers.

FIGURE 1: Entity Relationship Diagram

Architecting for Meta Data

Whether you are building a single data mart or a giant enterprise-wide data warehouse, architecture should be an integral part of the planning and design process. Developing a long-term architecture early in the project helps sets the vision for the future, guiding the data warehousing team through the phases. Meta data architecture should have the following characteristics:

  • Mandatory. Meta data has been very important even for OLTP systems. However, for OLTP systems, most of the meta data was required by the IT community consisting of programmers and analysts. It never acquired the importance it has with the onset of data warehousing where most of the meta data is required by end users. They would be totally lost if they didn't know what was available in the warehouse, what it means and how to access it. Data warehouse users should be provided not just accurate meta data but accurate contextual meta data. As described earlier, without accurate and up-to-date contextual meta data information, they can get misleading and ambiguous information, which can lead to wrong decisions. Thus meta data architecture for a data warehousing project should not be an afterthought, but a mandatory and well planned part of the overall architecture.

  • Open (ability to work and get information from various categories of tools). It is a key requirement for the meta data to be up to date. If the meta data repository is proprietary and does not interchange information with other components of the data warehouse architecture, the meta data update process will be tedious, resulting in out-of-date meta data. As shown in the Figure 2, the meta data tool should be able to work seamlessly with various sources of meta data:

    * CASE tools;

    * Applications for the source data, both internal and external;

    * Database engines, whether relational or multidimensional;

    * Data extraction, transformation and cleansing tools;

    * DW monitors; and

    * Front-end tools.

    FIGURE 2: Tool Relationships


    Two parallel efforts in the industry have led to some development of common meta data standards. The first effort by Meta Data Coalition, which was established in '95, has led to the Meta Data Interchange Specification called MDIS. Several vendors have announced compliance with MDIS 1.0: Viasoft's Rochade repository, The Intellidex Warehouse Control Center, ETI's ETI*EXTRACT Tool Suite. Other vendors who have committed to offering MDIS-compliant products include IBM, Apertus Carleton, Brio, Platinum and SAS.

    The second effort by Microsoft Corp. is based on the Object Management Group's Unified Modeling Language (UML) standard. The Microsoft Repository, called Open Information Model (OIM), provides a common format for tools to share information that describes objects, components and modules across the application development life cycle. Microsoft also announced plans to enlarge OIM to support all forms of data warehousing meta data: extraction and transformation rules, data mapping and data model information. It plans to implement these extensions to OIM in collaboration with its data warehousing partners. Several vendors have announced support for building extensions to OIM. Some of these vendors are: Apertus Carleton Corporation; Business Objects; Cognos Corporation; Evolutionary Technologies International; Informatica Corp; Logic Works, Inc.; Platinum technology, inc.; Powersoft (a subsidiary of Sybase Inc.); Prism Solutions, Inc.; and Sterling Software.

  • Easy to synchronize and update. If the meta data manager requires jumping through a lot of hoops to keep meta data up to date, it is a sure recipe for failure. The architecture should facilitate automatic updates with sources of meta data wherever possible. There should be clear cut standard procedures for updating and synchronizing meta data. There should be three categories of process for synchronizing and keeping meta data up to date:

    INITIAL CAPTURE. This process should be designed for interchange of meta data from various categories of the tools to a logically central meta data repository. Initial capture usually takes place during the development life cycle of the project. This should be used to develop the process required for the ongoing update of the meta data repository.

    ONGOING UPDATES. After the data warehouse is up and running, the automated processes should be developed for the ongoing update of the meta data.

    CONSISTENCY CHECK. This process should be used to make sure that the meta data is consistent and accurate based on date and time of update of various sources of meta data as described earlier.

In cases where one needs to manually update meta data, there should be enough checks and incentives to make sure the process is adhered to. Besides adhering to the standards, there should be a periodic process to check for the consistency and completeness of meta data. Figure 3 shows the sources of meta data that are important during various phases of a data warehousing project.
FIGURE 3: Meta Data Sources

EXTENSIBLE. It is very important that the meta data architecture can be easily extended, especially with the data mart approaches becoming more and more popular with the notion of "Think globally, act locally." As more data marts are added, meta data architecture should be easily extensible whether in a distributed mode or a centralized mode.

The Future

Meta data will acquire much more importance with the marriage of Web technology and data warehousing. This marriage will result in a meta data browser being the single point of access for the business information, whether it is on an intranet, an operational system or a data warehouse. Meta data will become a critical component of any overall architecture. The diagram in Figure 4 illustrates a logical representation of a future meta data architecture.

The logical meta data repository can be centralized or distributed depending on the business needs and organizational requirements. It would most likely be a group of data stores consisting of objects and relational data.

The active meta data manager would be the core component of the meta data architecture and would ideally consist of the following components:

  • Meta Data Capture: Initial capture of meta data from a variety of sources.
  • Meta Data Synchronizer: Processes to keep meta data up to date.
  • Meta Data Search Engine: This will be the front end for the users to search and access meta data.
  • Meta Data Results Manager: This will process the results of the meta data search and allow the user to make an appropriate selection.
  • Meta Data Alerter: As part of a push technology, this will notify the subscribers about any new changes to the meta data contents depending on the user profile.
  • Meta Data Query Trigger: Will trigger an appropriate query tool to get data from a data warehouse or any other source based on the selection made by the user in the meta data results manager.

FIGURE 4: Meta Data Architecture

The architecture will ensure that users know what they are getting when they finally get their hands on the information they desperately need. With the speed at which the data warehousing market is evolving, it will be sooner rather than later that such architecture will become a reality.

Recently, the Meta Data Coalition has built an alpha version of a bridge between the Meta Data Coalition's Meta Data Interchange Specification (MDIS) and Microsoft's OIM-based repository described earlier. The bridge will allow a two-way exchange of information between MDIS compliant files and Microsoft's repository. This is a giant first step toward achieving an ideal meta data architecture based on dynamic updates to meta data.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access