Metadata has always been around. There used to be a theory that organizations needed to devise one and only one way of defining a concept, making it official, and then keeping it in one place for everyone to access and use. That goal was never achieved in most organizations, and thus the information lifecycle continues to evolve. We create reference data, master data, metadata, operational metadata, business metadata and process metadata, and I guess it is all really data. Or is it?
The elusive concept here is the connection between all of these new and often repetitive ways of revisiting our information. Organizations continue to struggle with the need to map, the need to interpret and, ultimately, the need to identify. We’ve reached the world of “Metadata 2015.” Because our data will always exist in more than one place it is separated in some way, yet it is similar or absolutely the same in another. Most of us may think we are seeing things differently when we look at these fragmented pieces whether in the cloud, in stored and downloaded segments, or on our devices. Often this is the case, but sometimes it really is not.
Before bringing in our cultural mindset, let’s look at a brief historical analogy, considering an information perspective. Information is always the result of interpreting data. When cars were first invented and driving became popular, it had several “data” components a driver, a vehicle and a destination each being key data pieces. But in the days of the horse and buggy, the knowledge of the roads was a required set of information, learned by doing, but not stored anywhere per se. Early drivers did not realize that their data was being clarified by “metadata”; they were keeping this organized metadata called a map - in their heads. As time evolved, individuals could not keep up with the different types of vehicles and supporting transportation infrastructures that were devised, the volumes of passengers and needed help with navigating and tracking. Hence the need for devices and more metadata, more data, or both.
Metadata Goes Mainstream
The world of metadata has evolved. Its beginnings were technical, and people in the business world didn’t really use the word or care about it. Eventually we reached today’s state, but that is not to say we’ve all reached the peak of maturity. Organizations are at various levels of metadata maturity, and even today not all organizations have come to agreement as to what metadata really is. (See Figure 1 In Popup Window: Reaching Maturity.)
What is important to realize is that there now is a common thread in all of our information management environments, and this thread is the result of a phased evolution. Let’s look more closely:
During the 1960s and 1970s the idea that data needed to become consistent (to at least some degree) became prevalent in organizations - but this occurred without the foresight to set up strict rules and edit checks. We did not have nearly the amount of information challenges that we have today because most companies had only one mainframe computer, one billing system, one HR system, etc. Most people had accounts at only one bank; there was only one phone company. But even in this simpler environment, problems arose. For example, adding a new office location and associating it with any number of existing employees required a new program to be written. Files were flat, and it was very time-consuming to test the program and, ultimately, to run it. Hierarchical database structures added some flexibility but still had restrictions: Analysis of the data was only permitted based upon the design of the hierarchy. So using our office example, if we wanted to know how many people with a certain title had moved to the new location, another program would have to be written, yet another database was typically created, the same data was used to re-populate it, and the world of data redundancy began. HR had their view of the employees (the “title” view), Operations had their view of the employees (the “office” view), and each business area could change data as it pleased. My address in one database could be 15 Smith Street; my address in the other could be 15 Smith St. but it didn’t really matter. The problem comes to light if I (hypothetically) resigned. What if I was only deleted from one of the databases? This exemplifies Phase 1: Redundant and Disconnected Data. The evolution from this phase began as people recognized the need for a better way of standardizing how and where we stored and related our data.
Phase 2 of the metadata evolution began with the implementation of relational databases and the concept of third normal form (3NF). The introduction of data modeling started a new way of designing databases based upon business needs as opposed to performance needs. Flexibility, analysis and how data would be uniquely identified were all key aspects that determined how the database would be designed. Data modeling was considered an art of sorts, and some people said that modelers would never come up with the same model if they were given the same set of requirements. In reality, requirements had a tendency to not be fully defined, which generally led to assumptions that were modeled, whether directly or indirectly, with or without proper validation from the business. But good modelers knew how to add the flexibility that was necessary. The idea that products could consist of other products originated with the relational data model. The idea that products have lifecycles that begin with a compound which needs to be related to its marketed product began with the relational model. But not everyone understood this concept, and other modeling methodologies evolved. We have process models and object models, and because there is still no standard way of representing our models, we have lots and lots of models. Everything in a model immediately became considered metadata, which brings us to the end of Phase 2: Overlapping, Conflicting, Similar, Complementary Data Models.
Most organizations have graduated beyond Phase 2. In order to move beyond this point, an organization would have experienced the following:
- Redundant data sometimes (more often than not) conflicting
- Varied database designs, often lacking the required flexibility to meet reporting/analysis criteria
- Inaccurate and redundant data models
- Redundant and disconnected metadata
- Minimal data standard enforcement
Where do we put all of this metadata? Is it even worth saving? If so, should we treat it the same way we treat data? Should we model it, relate it and analyze requirements? With the existence of so many nonstandard models, it made sense to come up with a standard way of modeling the models so to speak. In Figure 2 (Popup Window: The Metamodel "Layers"), which is taken from my book, “Metadata Solutions: Using Metamodels, Repositories, XML, and Enterprise Portals to Generate Information on Demand,” the repository starts to organize the metadata.
Now the software is looking for and accessing the data that is being modeled, but it gets to the data via database, table and column names. We see in the “metamodel” the various layers that relate our data, metadata, metamodel, and ultimately meta-metamodel layers. To help conceptualize this, think of one layer as painting the picture of the previous layer by showing how it is viewed by the software that needs to access it. Notice that some pieces of metamodels (like Database Name) are used by many perspectives. The same metadata can appear in multiple metamodels, just like the same data can appear in multiple data models. Metadata was originally stored in a centralized, organized and managed repository. The metadata came from many sources and became organized via a metamodel structure similar to the one that is shown in Figure 2. The management of metadata in one place requires amazing amounts of discipline and automation. Many organizations realized this after the implementation of their centralized metadata repository. For example, database structures were changed but the associated metadata in the repository was not, or data models were updated in modeling tools but not loaded into the repository. Many organizations are still struggling with Phase 3. I call it the “too many repositories” phase.
Too Many Repositories
From the year 2000 onward, many solutions were introduced, each trying to integrate disparate sources of data, metadata, both, or some combination. But each solution had one common characteristic it created another source of data, metadata, both, or some combination. Simple examples include data warehouses, business intelligence tools, data dictionaries and, of course, repositories. Many full BI platforms included all of these as full product sets. In Figure 3 (Popup Window: Lots of Metadata - Lots of Data) we see the result. How do we keep track of all of our data? Each new data store needed its own metadata. (Notice the famous spreadsheet and ETL process.) How do we integrate our repositories? How do we share our metadata? Should we centralize anything? Is it even possible?
Many organizations continue working on Phase 3. The number of conflicting repositories inspires the need to standardize data as a means of survival and metadata now also needs to be standardized. It is not just how we name our data, how we define it, where it comes from, where it is going, and the other basic metadata rules that became data governance. It is also the format, the timing, the reason, and how data is going to be shared. For the most part, “hows” can be standardized (technical innovators tend to dictate the first “how”), but ultimately regulation ensues, and “metadata standards” evolve.
Phase 3 draws attention to the fact that metadata and data have the same symptoms and same problems and, therefore, need to be treated the same way. Metadata standards extended beyond organizational parameters when the semantic Web became the place to publish, creating further challenges.
This highlights the need for:
- Standard Terms
- A standard way of using the Standard Terms
- A standard way of exchanging the Standard Terms
- A standard format for all of the above
- And more
We incorporate the reason for standardization into the fact that internal data is no longer private -- data moves beyond organizational boundaries. Causing further complications, the issue is not only what data is being sent, but how it is being sent. And when we send it, we need to explain exactly what is being sent and make sure it meets the requirements that we are being given, typically by regulatory agencies. Which brings us to Phase 4: Today (or tomorrow if you are not quite there yet).
We now have too much data, too much metadata, and too many containers. The containers are inside, outside and in transit. They are in the “cloud” and perhaps not really “contained”. We have too many standards. Some are of our own creation; some are forced upon us. Some are metadata standards, some are data standards; today, there really is no difference anymore.
When looking for direction and advice for your metadata strategy, follow the leaders. In the world of metadata, quite a few industries survive and thrive based only upon the automated exchange of information. Information may actually be their product/service offering, but regardless, the similarity across successful industries is that the exchange is and has been a necessity. Without timely and accurate sharing of business information across organizations including regulatory organizations - there is no functioning business. Phase 4 is metadata exchange.
Consider the publishing industry: How could we ever find a piece of published material if it was not identified in a uniform way? Automating this metadata and its usage began with the Dublin Core Metadata Initiative, expanded to Microsoft’s MMS for SharePoint, and so on. Media is now rife with standards, but the different standards all come together with standardized tags. This is known as metadata categorization. Figure 4 (Popup Window) shows a simple example from part of the Publishing Requirement for Industry Standard Metadata (PRISM) initiative, a specification of IDEAlliance.
Another leader is the geospatial industry, which deals with lots of ISO standards. Maps, dimensions, land, water, air just think of how hard it would be to locate a sunken ship if we could not communicate the exact location of a sound in a standard way. Why was it so easy to introduce the GPS? Maps and the identification of locations have been standardized for ages, as have geography, space and dimensions.
While publishing and geospatial may be industries leading the way in standardization, some industries are not quite there yet. Perhaps they have been exchanging data but are not necessarily required to distribute it cross-industry or cross-supplier. But health care, to consider one area, is changing. The world of metadata standardization is now illustrating how a person can be a provider (a supplier of health care services), a member (a subscriber to or receiver of health care benefits), a patient (a participant in a clinical trial, e.g.), a study director and so on. A diagnosis can be considered a discovered condition, a professional specialty within a therapeutic area, an actual disease with medical history across many patient populations, or a historically identified pre-existing health problem just to name a few perspectives. Practice specialties are required for licensing and validation of claims submissions. All health care information is intertwined.
In health care, metadata also contains data. Required code lists are considered metadata in the medical world. Figure 5 (Popup Window) shows a sample of this: ICD10 codes are a way of standardizing how we are diagnosed, how our “adverse events” are tracked, which of our claims are paid, how our preexisting conditions are logged, etc. Further, an effort in place, called BRIDG, is a collaborative effort from several health care standards communities that is bringing together health care through an integrated metamodel.
The benefits of Metadata now are clear: We have moved from reactive information management to proactive information management. See Figure 6 (Popup Window: Reactive vs Proactive).
Consider the examples from industry leaders as you move forward with your metadata strategies. Are you standardizing metadata in one sector of your organization, or are you looking across the board? Are there metadata standards you must already follow when your data leaves the door? Or when it enters your door? Perhaps you can take advantage of some synergy.
Author note (1): Although not addressed in this article, minimal data standards refer to naming conventions and required data formats, along with “allowed values”, for example. They were usually enforced by a “data administration” group, but are now enforced through automated data quality/edit checks.
Adrienne Tannenbaum is a recognized specialist in the world of “metadata". She is currently employed by Sanofi and is a member of the Clinical Information Management and Analytics organization.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access