The terms "meta data" and "repository" have been with us for many years now, but there always seem to be questions about their definitions and many divergent opinions as to what they represent. Broadly speaking, meta data is data about data and a repository is something that stores meta data. Many people provide good reasons to object to this rather simple definition of meta data, and there are also many views on what a repository is. Underlying these difficulties is a notion that meta data is a fixed set of knowable facts about the structure, organization and behavior of any set of enterprise data and that if we design a good repository, we will be able to capture and manage the meta data. However, this article will argue that the meta data that can pertain to any database is potentially unknowable and infinite and that attempts to build a single general-purpose repository to house it all are unlikely to be successful.
There is little argument that data can be defined as the stored representation of a fact. Computerized information systems work by converting facts to an encoded form and placing them in a medium from which the facts can later be decoded and communicated to an intelligence capable of understanding them. For instance, an account number is entered into a computerized system by a human being and is converted into binary digits that are then encoded again when they are placed in a magnetic medium on a hard drive as part of an organized collection of similar facts called a database. The means exist to get the account number out of this storage, return it to its original format and present it on screens and reports to other human beings.
If this is what data is, then meta data is not really different it too is the stored representation of facts. However, the facts being represented are about the stored representations of other facts. In other words, meta data is facts about other facts that happen to be stored in databases. Of course, there is an important distinction between data and meta data. As long ago as the 18th century, Dr. Samuel Johnson, who wrote the first dictionary of the English language, observed: "Knowledge is of two kinds. We know a subject ourselves, or we know where we can find information upon it." However, it is a big leap from this distinction to the often unstated but seemingly pervasive idea that there is a finite set of knowable meta data about any database. It is this idea that allows us to think that we can design a general purpose repository to house any meta data or create an XML standard for exchanging all meta data.
More and More Meta Data
There is a strong case to be made that the concept of meta data has been influenced by the fact that data administration is still a young discipline that for much of its lifetime has been dealing with relatively few concepts and a few technological platforms. This is not to underestimate the complexity and difficulty of data administration. The word few is relative to the situation that is rapidly becoming a reality an explosive growth in the complexity of information management throughout our society. This complexity is now becoming more and more apparent; and as it does, the goal of being able to completely define the meta data for a given database is rapidly receding.
Since the early days of entity-relationship modeling, it has been recognized that things have characteristics that describe their structure and the way they behave. In entity-relationship modeling, we describe these things (entities) and their characteristics (attributes). Normally we do this for things that have a physical existence, such as customer and product, but it has always been understood that concepts and ideas can also be represented in this way. The goal of a data modeling project is ultimately to create a database where data about these things is stored. All this is rather abstract and complex, and it is necessary for data modelers to separate the concepts that pertain to their tools and methodologies from the subject matter of any data modeling exercise (i.e., the data being modeled). This has led to certain pieces of information being recognized as meta data. For instance, in a given project, it may be necessary to decide if data types will be assigned to attributes. Data types are considered to be meta data, while attributes are usually referred to as data although anything in a data model is, strictly speaking, meta data! Data modelers also have to manage their work and keep track of what they are doing. For instance, in a given project it may be necessary to maintain a list of entities that have not yet been fully modeled. Again, the fact of whether an entity has been fully modeled or not is considered to be meta data, but the entities themselves are usually spoken of as data instead of the representation of data.
Meta Data for Users Too
Few would doubt the usefulness, or even necessity, of data administrators separating data and meta data in this way, even though the ways in which these terms are used can be fuzzy. Yet there is also a growing relationship between knowledge workers (a more accurate description of users) and meta data. Consider the following hypothetical example:
- An author writes a book on orchids. To the author, the content of the book is the data. However, upon publication some meta data gets assigned, such as the ISBN and the price of the book.
- The publisher now creates a Web page for the book to display information that was meta data for the author, such as ISBN and price, but which for the publisher is data. For the publisher, the URL of this Web page is meta data.
- A search engine finds the publisher's Web page and records the URL in its database together with information on the content of the page. The URL, which was meta data for the publisher, is data for the search engine.
- A journalist working for a magazine for orchid collectors decides to review all orchid books. He uses the search engine to find URLs created by publishers for such books and finds the URL of the publisher in our example. The journalist creates his own Web page with links to Web pages that have more information about orchid books. Again, the publisher's URL is used as data.
- The author of our book finds out what the journalist has done and e-mails him to ask what the click-through rate for the link to the author's book is relative to the click-through rates for other authors. This is meta data from the perspective of the journalist, but is valuable marketing data from the perspective of the author.
This is an illustration of an information chain reaction, and it seems that information chain reactions are becoming more important in our economy. Modern technology, particularly the Internet, is creating an environment where information can be productively used in many ways, some of them quite unexpected. In this environment, what is meta data from one business perspective may be data from another business perspective. While it may be possible to predict some of these perspectives, it is not possible to anticipate all of them, particularly as new enabling technologies and new business models appear.
The lesson for the data administrator is that there are many kinds of meta data that can be very relevant to knowledge workers. Despite this, meta data is commonly thought of as something that is more relevant to IT staff.
Meta Data Without Limit
With information now being treated as a valuable corporate resource, the demands on data administration to manage it are increasing. This is also fueling needs to manage quite disparate kinds of meta data, making it increasingly apparent that there is no predictable limit to the meta data that can be associated with any database.
Here are some examples of the kinds of meta data that are being managed by data administration personnel:
- The data administration unit in an organization may introduce a process for gathering, agreeing upon and publishing entity and attribute definitions to answer questions such as "What is a customer?" This process involves the submission of proposed definitions, flagging and resolving of issues, documentation of agreements and publication of results all of which is meta data.
- Building data marts and warehouses is increasingly popular. It involves moving data from source to target databases. There is a great volume and variety of meta data involved in defining the mapping of the data between source and target. Once the data warehouse is completed, another set of meta data comes into play. That meta data records facts about the movement of the data from the source to the target, including load statistics.
- Usage of data in a database sometimes needs to be recorded. This may not simply be to devise a strategy for enhancing performance, but also to provide feedback to business users so they can assess the importance of different kinds of information to the enterprise. These usage statistics may be quite complex to define and are meta data.
In recent years there has been an increase in the types of meta data with which data administration has to deal. It is very unlikely that we have reached a stage where no new types of meta data will appear.
The definition of meta data as data about data is too limiting for most data administration professionals, yet the fact that meta data can be limitless in scope may seem equally difficult to accept. However, the growth in the quantity and different types of meta data inevitably lead to this conclusion. Meta data is still a useful concept, as it is necessary to distinguish the facts in a database from the facts about these facts. After all, data and meta data will have different structures, behaviors and audiences. These two classes of data will be collected and managed in different ways and have different life cycles.
If this conclusion is accepted, there are several implications. One is that projects to build "the repository" for an enterprise, or even a database, may not meet the expectations they generate. A more realistic approach is to state exactly what kind of meta data will be stored in a given repository and work toward that limited goal. Another implication is that meta data needs to be thought of in similar terms as regular data. Thus, there will be a particular business problem that needs to be addressed by managing information that happens to be meta data, and a database can be built to house this meta data even though it is called a repository.
Some may wish to restrict meta data to the information necessary for the data administration function to manage a corporate data resource. Yet even these boundaries are prone to shift as new technologies emerge, as different methodologies are born, as businesses demand that we manage additional kinds of data and as new, productive uses are found for the data we already have.
It is true that there may be similarities in the meta data associated with different databases, but there are also general patterns for data models in a given industry, such as insurance. There are even patterns for data models across different businesses, such as when we abstract roles of customer, vendor and employee into a party supertype. However, the notion that there can be one repository to store all meta data cannot be supported. Meta data is just too diverse for this approach. It is more correct to visualize different databases containing qualitatively different meta data clustered around a database containing business facts, though in the modern economy there is a strong chance that these business facts describe some other kind of data from which the business makes a living. Is the meta data repository dead? The notion of the repository is going to have to be replaced by the idea of many repositories or databases containing meta data (though not necessarily these terms). This will give enterprises, including their data administration functions, the freedom to see meta data for what it truly is and fully exploit its enormous potential.
Malcolm Chisholm, Ph.D. has over 25 years of experience in enterprise information management and data management and has worked in a wide range of sectors. He specializes in setting up and developing enterprise information management units, master data management, and business rules. His experience includes the financial, manufacturing, government, and pharmaceutical industries. He is the author of How to Build a Business Rules Engine and Managing Reference Data in Enterprise Databases and Definition in Information Management. He writes numerous articles and is a frequent presenter on these topics at industry events. Chisholm runs the websites http://www.bizrulesengine.com, http://www.refdataportal.com and http://www.data-definition.com. Chisholm is the winner of the 2011 DAMA International Achievement Award.