Why is there so much metadata?

Maybe you don’t think there is, but there is. Metadata is the data that defines the structure of data records in files and databases. For some files, such definition data is buried in program code somewhere and only the program knows for sure what the data structure is. That’s a problem if you want to share the data – and that’s why we have databases: so we can.

So let’s ignore the data we don’t care to share and think only of shareable data – which is probably only about 5 percent of data (estimates vary) at the moment. A rough estimate of the amount of data in the world at end of 2014 puts it at about 4 zettabytes, which would suggest that there is about 200 exabytes of shareable data.

Big Data is Shareable Data

Yes, big data is shareable. Common sense tells us so. Nobody is going to collect terabytes or even petabytes of data just for fun. Such data gets collected so it can be reported on or delved into or mined. It will be used at some point by a business intelligence or analytics program that did not create it.

The main sources of this data are:

  • Business data. Think of this as traditional data from business applications that we tend to collect in traditional data warehouses.
  • Log file data. Think of this as operational data (database logs, network logs, OS logs, etc.) that we generated in the past but rarely looked at – except, of course, for web logs that some companies did look at.
  • Mobile data. Think of this as location data. Many companies didn’t collect much of this, but some did. In the future we will collect more of it.
  • Social network data. This data is available, usable and frequently accessed.
  • Public data. There are many sources of publicly available data, such as census data.
  • Commercial databases. There is now a burgeoning market in companies selling data.
  • Streaming data. This is commercial data sold as a continuous stream.
  • IoT data. The Internet of Things – sensor data of various kinds. This is currently relatively small as a source of external data, but is predicted to grow as a major source, and it probably will.

In summary, there are many more potential sources of data than there used to be, and we have reasonably inexpensive computer power to process such data, in the cloud if not on premise. And conveniently, in every case the metadata is available in one way or another. If we have reason to believe there are diamonds in such data, we should probably dig deep and find them.
The New World of Metadata and the Data Reservoir

The old world of data was reasonably simple. You had transactional systems, they fed data into a data warehouse and that data was sprayed out into data marts or even desktop databases and analyzed or reported on. It could get complicated at the metadata level because data sources might not agree completely on the definition of customer, supplier, product and so on.

The new world of metadata is like the old world of metadata except that it has many more data sources – and just like data from packaged software, the business has no control over the data definitions of the new sources. We can think of this as big metadata. If big data means lots more data than we had before, then big metadata is lots more data sources.

The old world was characterized by the term data warehouse; the new world will most likely be characterized by the idea of a data reservoir, also sometimes referred to as a data lake. This is similar to a data warehouse, being a big heap of data, but it will most likely be a big Hadoop heap of data, where the business stores data awaiting its use. The advantage of Hadoop for this, aside from its scalability, is that you do not have to specify the metadata when you capture the data; you only need to define a unique key so you can get to the data when needed. You will have to define the metadata eventually and, conveniently, Hadoop has HCatalog for that very purpose. The Hadoop data reservoir is the source of data for data marts and databases of various kinds, supplementing or superseding the data warehouse of old.

Bring the Metadata Under Control

You cannot share data effectively without managing the metadata. And this is not as simple as it might seem.

Metadata as most of us usually understand it – data names and data type descriptions – is not as meaningful as it needs to be. There is a reason for this. Technically, metadata only needs to provide enough information for a program to use the data. So if the metadata for the Person– Table says: Person-Code, Title, First Name, Last Name, Job-Title, along with some data type information, a program is good with that. A human being will not be, because the metadata does not provide any context. Even if you are told that the data is from a database used by the HR application, you still may not be sure what the data refers to. Probably it refers to a member of staff, but maybe it refers to a contractor. And if you encounter another data record that mentions Person-Code from some other data source, you may not be sure if it refers to the same set of persons.

In essence the main problem with metadata is that it’s not as meaningful as it needs to be for users to know exactly what the data is.

There are basically three strategies that you can adopt with metadata:

  • Ignore it.
  • Try to standardize it across the enterprise.
  • Catalog and enrich it to enable sharing.

The first of these strategies only works well if there’s little need to organize the sharing of data. Small to medium-sized businesses that run their businesses entirely on a suite of software packages may be able to ignore the issue – primarily because their BI applications are organized for them. Once an organization strays outside the borders of its software suite, the metadata starts to grow wild and data sharing gets more complicated.
The second of these strategies is a full-blown long-running project that goes by the name of master data management. The objective is to have a master model of corporate data, which is very well-defined and consistent across all systems, with any variances from the master model being known and documented. If this can be established then the activities of collecting, matching, consolidating, cleansing and distributing data can be carried out with guaranteed consistency of data and the full-blown governance of data becomes possible. There are software tools to assist in such efforts.

The main problem with MDM projects is not the software technology needed to manage the definition and maintenance of master data, but persuading individuals throughout the organizations to agree on common definitions of data entities and data items. Data and “data ownership” can be intensely political. Naturally, the MDM project becomes even more challenging as the number of data sources an organization cares about expands.

The last of these strategies is the pragmatic one. It can be a sensible precursor to embarking on an MDM project, if that’s the direction you wish to head in, but that’s not a necessity. The idea is to create a registry of all corporate data sources (a kind of map) and to allow users to add business information to the metadata that can be gathered automatically from databases and other data sources. Users can focus on the business-critical data elements, identifying their sources and noting interactions between applications that use the data. In effect, those who use the data become responsible for providing useful business definitions of what the data is and does.

Once a comprehensible and searchable registry is established, data can be shared far more effectively. This approach, incidentally, will work effectively for defining new external data sources if you enforce some simple organizational rules, such as data can only be shared if it is defined in the registry. Again, there are vendors with appropriate software tools to help harvest the metadata in this way and allow users to enrich it.

Metadata needs to be organized for data sharing to be productive, for the simple reason that users need to know what the data is. The need for this becomes increasingly pressing as the number of useful and potentially useful data sources increases – and it now increasing quite dramatically.

The chief reason for sharing data within any business is to feed the wide variety of BI and analytics applications that are now becoming indispensable business drivers. As the number of data sources increases, so does the potential of the data to make a difference. But that potential will only be realized if the metadata is well-documented and well-managed.

Top image: Thinkstock