Does Big Data Mean Big Metadata?

Comments (8)

Why is there so much metadata?

Maybe you don’t think there is, but there is. Metadata is the data that defines the structure of data records in files and databases. For some files, such definition data is buried in program code somewhere and only the program knows for sure what the data structure is. That’s a problem if you want to share the data – and that’s why we have databases: so we can.

So let’s ignore the data we don’t care to share and think only of shareable data – which is probably only about 5 percent of data (estimates vary) at the moment. A rough estimate of the amount of data in the world at end of 2014 puts it at about 4 zettabytes, which would suggest that there is about 200 exabytes of shareable data.

Big Data is Shareable Data

Yes, big data is shareable. Common sense tells us so. Nobody is going to collect terabytes or even petabytes of data just for fun. Such data gets collected so it can be reported on or delved into or mined. It will be used at some point by a business intelligence or analytics program that did not create it.

The main sources of this data are:

  • Business data. Think of this as traditional data from business applications that we tend to collect in traditional data warehouses.
  • Log file data. Think of this as operational data (database logs, network logs, OS logs, etc.) that we generated in the past but rarely looked at – except, of course, for web logs that some companies did look at.
  • Mobile data. Think of this as location data. Many companies didn’t collect much of this, but some did. In the future we will collect more of it.
  • Social network data. This data is available, usable and frequently accessed.
  • Public data. There are many sources of publicly available data, such as census data.
  • Commercial databases. There is now a burgeoning market in companies selling data.
  • Streaming data. This is commercial data sold as a continuous stream.
  • IoT data. The Internet of Things – sensor data of various kinds. This is currently relatively small as a source of external data, but is predicted to grow as a major source, and it probably will.

In summary, there are many more potential sources of data than there used to be, and we have reasonably inexpensive computer power to process such data, in the cloud if not on premise. And conveniently, in every case the metadata is available in one way or another. If we have reason to believe there are diamonds in such data, we should probably dig deep and find them.

The New World of Metadata and the Data Reservoir

The old world of data was reasonably simple. You had transactional systems, they fed data into a data warehouse and that data was sprayed out into data marts or even desktop databases and analyzed or reported on. It could get complicated at the metadata level because data sources might not agree completely on the definition of customer, supplier, product and so on.

The new world of metadata is like the old world of metadata except that it has many more data sources – and just like data from packaged software, the business has no control over the data definitions of the new sources. We can think of this as big metadata. If big data means lots more data than we had before, then big metadata is lots more data sources.

The old world was characterized by the term data warehouse; the new world will most likely be characterized by the idea of a data reservoir, also sometimes referred to as a data lake. This is similar to a data warehouse, being a big heap of data, but it will most likely be a big Hadoop heap of data, where the business stores data awaiting its use. The advantage of Hadoop for this, aside from its scalability, is that you do not have to specify the metadata when you capture the data; you only need to define a unique key so you can get to the data when needed. You will have to define the metadata eventually and, conveniently, Hadoop has HCatalog for that very purpose. The Hadoop data reservoir is the source of data for data marts and databases of various kinds, supplementing or superseding the data warehouse of old.

Bring the Metadata Under Control

You cannot share data effectively without managing the metadata. And this is not as simple as it might seem.

Metadata as most of us usually understand it – data names and data type descriptions – is not as meaningful as it needs to be. There is a reason for this. Technically, metadata only needs to provide enough information for a program to use the data. So if the metadata for the Person– Table says: Person-Code, Title, First Name, Last Name, Job-Title, along with some data type information, a program is good with that. A human being will not be, because the metadata does not provide any context. Even if you are told that the data is from a database used by the HR application, you still may not be sure what the data refers to. Probably it refers to a member of staff, but maybe it refers to a contractor. And if you encounter another data record that mentions Person-Code from some other data source, you may not be sure if it refers to the same set of persons.

In essence the main problem with metadata is that it’s not as meaningful as it needs to be for users to know exactly what the data is.

There are basically three strategies that you can adopt with metadata:

  • Ignore it.
  • Try to standardize it across the enterprise.
  • Catalog and enrich it to enable sharing.


(8) Comments



Comments (8)
Great article Robin. Can you recommend any tools that could help organizations collect, standardize and manage their metadata?
Posted by Rustam A | Wednesday, June 18 2014 at 11:39AM ET
There's a White Paper here that has some info on one solution:
Posted by Joy R | Wednesday, June 18 2014 at 5:35PM ET
Check out this product - it's the best I have come across
Posted by Trish W | Thursday, June 19 2014 at 1:31AM ET
Robin, thanks. It's great to see you opening up a topic that isn't getting enough attention yet! As you know this whole issue of metadata management for Big Data is "Big" on my radar just now. You're right that "the new world of metadata is like the old world of metadata..." But I do think those extra data sources are going to drive more use cases. The whole issue reminds me of the early days of data warehouse. I think there are governance bumps in Big Data's future! Here's another reference to a potential solution:
Posted by Ian R | Tuesday, June 24 2014 at 1:20PM ET
Thought provoking article, Very different perspective on big data. Data Reservoir and data Lakes are new terms thrown in market for big data. Irrespective of the fact tools are available for MDM projects, its implementations is very tricky (in fact rare). Banking system is really struggling to get metadata right to enable data governance with the help of MDM or any other tools. mahender.
Posted by Mahender C | Tuesday, July 01 2014 at 2:11AM ET
Add Your Comments:
Not Registered?
You must be registered to post a comment. Click here to register.
Already registered? Log in here
Please note you must now log in with your email address and password.
Please note you must now log in with your email address and password.