Maybe you don’t think there is, but there is. Metadata is the data that defines the structure of data records in files and databases. For some files, such definition data is buried in program code somewhere and only the program knows for sure what the data structure is. That’s a problem if you want to share the data – and that’s why we have databases: so we can.
So let’s ignore the data we don’t care to share and think only of shareable data – which is probably only about 5 percent of data (estimates vary) at the moment. A rough estimate of the amount of data in the world at end of 2014 puts it at about 4 zettabytes, which would suggest that there is about 200 exabytes of shareable data.
Big Data is Shareable Data
Yes, big data is shareable. Common sense tells us so. Nobody is going to collect terabytes or even petabytes of data just for fun. Such data gets collected so it can be reported on or delved into or mined. It will be used at some point by a business intelligence or analytics program that did not create it.
The main sources of this data are:
- Business data. Think of this as traditional data from business applications that we tend to collect in traditional data warehouses.
- Log file data. Think of this as operational data (database logs, network logs, OS logs, etc.) that we generated in the past but rarely looked at – except, of course, for web logs that some companies did look at.
- Mobile data. Think of this as location data. Many companies didn’t collect much of this, but some did. In the future we will collect more of it.
- Social network data. This data is available, usable and frequently accessed.
- Public data. There are many sources of publicly available data, such as census data.
- Commercial databases. There is now a burgeoning market in companies selling data.
- Streaming data. This is commercial data sold as a continuous stream.
- IoT data. The Internet of Things – sensor data of various kinds. This is currently relatively small as a source of external data, but is predicted to grow as a major source, and it probably will.
In summary, there are many more potential sources of data than there used to be, and we have reasonably inexpensive computer power to process such data, in the cloud if not on premise. And conveniently, in every case the metadata is available in one way or another. If we have reason to believe there are diamonds in such data, we should probably dig deep and find them.
The New World of Metadata and the Data Reservoir
The old world of data was reasonably simple. You had transactional systems, they fed data into a data warehouse and that data was sprayed out into data marts or even desktop databases and analyzed or reported on. It could get complicated at the metadata level because data sources might not agree completely on the definition of customer, supplier, product and so on.
The new world of metadata is like the old world of metadata except that it has many more data sources – and just like data from packaged software, the business has no control over the data definitions of the new sources. We can think of this as big metadata. If big data means lots more data than we had before, then big metadata is lots more data sources.
The old world was characterized by the term data warehouse; the new world will most likely be characterized by the idea of a data reservoir, also sometimes referred to as a data lake. This is similar to a data warehouse, being a big heap of data, but it will most likely be a big Hadoop heap of data, where the business stores data awaiting its use. The advantage of Hadoop for this, aside from its scalability, is that you do not have to specify the metadata when you capture the data; you only need to define a unique key so you can get to the data when needed. You will have to define the metadata eventually and, conveniently, Hadoop has HCatalog for that very purpose. The Hadoop data reservoir is the source of data for data marts and databases of various kinds, supplementing or superseding the data warehouse of old.
Bring the Metadata Under Control
You cannot share data effectively without managing the metadata. And this is not as simple as it might seem.
Metadata as most of us usually understand it – data names and data type descriptions – is not as meaningful as it needs to be. There is a reason for this. Technically, metadata only needs to provide enough information for a program to use the data. So if the metadata for the Person– Table says: Person-Code, Title, First Name, Last Name, Job-Title, along with some data type information, a program is good with that. A human being will not be, because the metadata does not provide any context. Even if you are told that the data is from a database used by the HR application, you still may not be sure what the data refers to. Probably it refers to a member of staff, but maybe it refers to a contractor. And if you encounter another data record that mentions Person-Code from some other data source, you may not be sure if it refers to the same set of persons.
In essence the main problem with metadata is that it’s not as meaningful as it needs to be for users to know exactly what the data is.
There are basically three strategies that you can adopt with metadata:
- Ignore it.
- Try to standardize it across the enterprise.
- Catalog and enrich it to enable sharing.