But different from what? Well, presumably it is different from what we have been taking for granted as data during the history of data management to date. This legacy data seems to now be termed "structured" data. I suspect the term comes from the relational revolution, which is now so embedded in data management that it is seen as constituting the status quo -- at least before big data came along. Yet there is nothing inherent about structured data that prevents it from being large-scale and nothing inherent in big data that prevents it from being structured. So are we looking at some kind of continuum here or are there real differences? I believe there are real differences and that big data is unique in a way that will inspire important innovations in data management.
The Data of Representation
I work mainly in financial services, so my view is somewhat biased to that sector. In financial services, we deal almost exclusively with things that have no material existence. Examples would be mortgages, insurance policies, stocks, bonds, options, credit ratings, bank accounts, overdrafts and even money. My bank account, for instance, is not a physical thing at all; it is essentially an agreed upon idea between myself, the bank, the legal system and the regulatory authorities. It only exists insofar as it is represented, and it is represented in data. My bank account does not really exist in the data, but it is represented in it. If the disk drive with my bank account data on it were to fail, a backup copy would quickly restore it - or at least that is what I earnestly hope. That is not true of a material object, such as my cat, for which I have no equivalent backup. The idea of the bank account is what really exists, and its representation in data, along with information technology, allows this idea to be managed very efficiently.
A characteristic of my bank account, and similar non-material entities in financial services, is that its representation is exact. The balance in my bank account is not some estimate with a positive and negative tolerance; it is truly exact. The same is true of all the transactions that flow through the financial system every day. They may add up to trillions of dollars, but each transaction is exact.
The non-material entities of the financial sector are orderly human constructs. Because they are orderly, we can more easily manage them in computerized environments. Things would be a lot more difficult if, for instance, every individual bank account had different attributes and governing business rules than every other bank account. But this is not the way we do things. Our creations are like species, where every individual shares the same characteristics and behavior, and this is reflected in the data.
The Data of Observation
Having spent my professional career in data management, I was always puzzled when I read publications by the great minds of quality management, such as Shewhart and Deming. Deming said things like:
"In God we trust; all others bring data."
and, more disturbingly,
"There is no true value of anything. There is instead a figure that is produced by application of a master or ideal method of counting or measurement ..."
These quotes never rang true with my experience of data. To me, they could not connect with financial entities. Apparently, the balance in my bank account, which is exact, is infinitely more accurate than our best estimate of the speed of light, which will forever remain an approximation. I eventually realized that what Shewhart and Deming were talking about were measurements.
A measurement is usually a comparison of a characteristic, using some criteria (usually a known standard), a count of certain instances or the direct comparison of two characteristics. A measurement can generally be quantified, although sometimes it is expressed in a qualitative manner. However, I think that big data goes beyond mere measurement, to observations. Let me paraphrase a quote from the British philosopher R. G. Collingwood, taken from his book “The Idea of Nature”:
"A scientific fact is an event in the world of nature ... An event in the world of nature becomes important for the natural scientist only on condition that it is observed ... the observer must be a trustworthy observer and the conditions must be of such a kind as to permit trustworthy observations to be made. And lastly, but not least, the observer must have recorded his observation ...The scientist who wishes to know that such an event has taken place in the world of nature can know this only by consulting the record left by the observer and interpreting it ... The consultation and interpretation of records is the characteristic of historical work."
I think that Collingwood establishes observations as a particular kind of record, and hence as a particular kind of data. I would propose that the concept of data Shewhart and Deming suggest is a subset of the data of observation. It is also interesting that Collingwood sees scientists as producing observations, but the work that is done on the records thereafter - the data - is not truly a natural science, but some other discipline. Collingwood identifies this other discipline as history, which may seem odd to us, however he had constructed an advanced theory of history, which is quite different to its widely-accepted definition.
Two Datas
Therefore, we can see that there are two datas. One type of data represents non-material entities in vast computerized ecosystems that humans create and manage. The other data consists of observations of events, which may concern material or non-material entities. The data of representation tends to be structured, in the relational sense, but doesn’t need to be, as graph databases show. Likewise, observations tend to be unstructured, but can be structured, as I have personally seen in the petroleum industry.













However, even on the "observational" side of data, there are differences of essence. The notion of scientific observation assumes many things, including that the observer be careful, competent, trustworthy, etc. However, much of what makes up the bulk of Big Data in the corporate marketing space, which is to say social media (c.f. avionics and other sensor and control data) is pure opinion. Or sometimes even less than that: raw emotion. Sometimes it is intentionally defamatory, or even downright libelous. But it is still valuable, and companies and organizations are trying to understand it. Generally each datum is considered untrustworthy, and it is only the statistically significant trends that are considered valuable. Though in reality, this is not that different from many scientific endeavors like drug clinical trials, where it is likewise the trends within many observations that are most important.
so, is there are third category, the data of Opinion, Conjecture and/or Ulterior Motive?
Of course, when it comes to form, social media content, along with anything in textual form, is totally different for database data. For one thing, it is completely free of a formal schema, apart from the "rules" of syntax of the language in which it's written. For example, "The length of the bolt is 2 inches, while it is a half inch in diameter." expresses a couple of concepts that are very precise and free from opinion and bias (though they could still be factually incorrect...) and that are easily structured:
length: 2.0: diameter: 0.5"
However, the structure of the original is embedded in the syntactic and semantic content of the sentence.
Furthermore, the "rules" of language are often broken, which is part of what makes something like a Twitter tweet so much harder to parse and model than a WSJ article.
Anyway, it should be clear from this (to a human reader anyway) that my opinion is that Text Analytics is one of the biggest challenges within the Big Data sphere. But try getting a computer program to come to that conclusion...