The big data phenomenon has given rise to a lot of discussion. I think that there is general consensus that big data is something more than just ultra-large scale data at the petabyte level. It is somehow different.
But different from what? Well, presumably it is different from what we have been taking for granted as data during the history of data management to date. This legacy data seems to now be termed "structured" data. I suspect the term comes from the relational revolution, which is now so embedded in data management that it is seen as constituting the status quo -- at least before big data came along. Yet there is nothing inherent about structured data that prevents it from being large-scale and nothing inherent in big data that prevents it from being structured. So are we looking at some kind of continuum here or are there real differences? I believe there are real differences and that big data is unique in a way that will inspire important innovations in data management.
The Data of Representation
I work mainly in financial services, so my view is somewhat biased to that sector. In financial services, we deal almost exclusively with things that have no material existence. Examples would be mortgages, insurance policies, stocks, bonds, options, credit ratings, bank accounts, overdrafts and even money. My bank account, for instance, is not a physical thing at all; it is essentially an agreed upon idea between myself, the bank, the legal system and the regulatory authorities. It only exists insofar as it is represented, and it is represented in data. My bank account does not really exist in the data, but it is represented in it. If the disk drive with my bank account data on it were to fail, a backup copy would quickly restore it - or at least that is what I earnestly hope. That is not true of a material object, such as my cat, for which I have no equivalent backup. The idea of the bank account is what really exists, and its representation in data, along with information technology, allows this idea to be managed very efficiently.
A characteristic of my bank account, and similar non-material entities in financial services, is that its representation is exact. The balance in my bank account is not some estimate with a positive and negative tolerance; it is truly exact. The same is true of all the transactions that flow through the financial system every day. They may add up to trillions of dollars, but each transaction is exact.
The non-material entities of the financial sector are orderly human constructs. Because they are orderly, we can more easily manage them in computerized environments. Things would be a lot more difficult if, for instance, every individual bank account had different attributes and governing business rules than every other bank account. But this is not the way we do things. Our creations are like species, where every individual shares the same characteristics and behavior, and this is reflected in the data.
The Data of Observation
Having spent my professional career in data management, I was always puzzled when I read publications by the great minds of quality management, such as Shewhart and Deming. Deming said things like:
"In God we trust; all others bring data."
and, more disturbingly,
"There is no true value of anything. There is instead a figure that is produced by application of a master or ideal method of counting or measurement ..."
These quotes never rang true with my experience of data. To me, they could not connect with financial entities. Apparently, the balance in my bank account, which is exact, is infinitely more accurate than our best estimate of the speed of light, which will forever remain an approximation. I eventually realized that what Shewhart and Deming were talking about were measurements.
A measurement is usually a comparison of a characteristic, using some criteria (usually a known standard), a count of certain instances or the direct comparison of two characteristics. A measurement can generally be quantified, although sometimes it is expressed in a qualitative manner. However, I think that big data goes beyond mere measurement, to observations. Let me paraphrase a quote from the British philosopher R. G. Collingwood, taken from his book “The Idea of Nature”:
"A scientific fact is an event in the world of nature ... An event in the world of nature becomes important for the natural scientist only on condition that it is observed ... the observer must be a trustworthy observer and the conditions must be of such a kind as to permit trustworthy observations to be made. And lastly, but not least, the observer must have recorded his observation ...The scientist who wishes to know that such an event has taken place in the world of nature can know this only by consulting the record left by the observer and interpreting it ... The consultation and interpretation of records is the characteristic of historical work."
I think that Collingwood establishes observations as a particular kind of record, and hence as a particular kind of data. I would propose that the concept of data Shewhart and Deming suggest is a subset of the data of observation. It is also interesting that Collingwood sees scientists as producing observations, but the work that is done on the records thereafter - the data - is not truly a natural science, but some other discipline. Collingwood identifies this other discipline as history, which may seem odd to us, however he had constructed an advanced theory of history, which is quite different to its widely-accepted definition.
Therefore, we can see that there are two datas. One type of data represents non-material entities in vast computerized ecosystems that humans create and manage. The other data consists of observations of events, which may concern material or non-material entities. The data of representation tends to be structured, in the relational sense, but doesn’t need to be, as graph databases show. Likewise, observations tend to be unstructured, but can be structured, as I have personally seen in the petroleum industry.
"Structured" and "unstructured" describe form, not essence, and I suggest that "representation" and "observation" describe the essences of the two datas. I would also submit that both datas need different data management approaches. We have a good idea what these are for the data of representation, but much less so for the data of observation.