Data Management is Based on Philosophy, Not Science

Register now

There's a joke running around on Twitter that the definition of a data scientist is “a data analyst who lives in California.” I'm sure the good natured folks of the Golden State will not object to me bringing this up to make a point. The point is: Thinking purely in terms of marketing, which is a better title -- data scientist or data philosopher?

My instincts tell me there is no contest. The term data scientist conjures up an image of a tense, driven individual, surrounded by complex technology in a laboratory somewhere, wrestling valuable secrets out of the strange substance called data. By contrast, the term data philosopher brings to mind a pipe-smoking elderly gentleman sitting in a winged chair in some dusty recess of academia where he occasionally engages in meaningless word games with like-minded individuals. 

These stereotypes are obviously crude, but they are probably what would come into the minds of most executive managers. Yet how true are they? I submit that there is a strong case that data management is much more like applied philosophy than it is like applied science.

Making Distinctions

I have argued before that a science of data is possible, but by “science” I mean an organized body or knowledge that addresses a particular set of problems using a particular set of methods. Today, science is really a shorthand reference to “natural science” which is the set of sciences that investigate the material world (meaning physics, chemistry, biology, etc.). The success of natural science over the past five centuries has been indisputably revolutionary.

However, various authors, such as F.A. Hayek and R.G. Collingwood, have argued that this has given rise to scientism (this is Hayek's word - Collingwood's term was pseudo-science). Scientism is the misapplication of the language and methods of natural science to departments of human experience which utterly unlike the material world that natural science studies. The reason this is done is to pretend that the proven success of natural science can be transferred to these otherwise difficult areas. According to Hayek, this has had a negative impact in the so-called social sciences as a result of scientistic innovations such as Keynesianism and Socialism. According to Collingwood, uncritical acceptance of natural science will ultimately destroy the foundations of Western civilization and usher in a new era of barbarism.

So what does this have to do with data management? Well, as a trained scientist myself (a biologist), I have often reflected that with respect to what I do in data management, I have learned nothing from science. It is true that the technology that makes data management possible is based on engineering, which is based on science, but data management is not about this technology any more than writing is about paper and ink. There simply seem to be no lessons learned from natural science that can be directly transferred to data management.

The Role of Philosophy

By contrast, there are lessons that are derived from philosophy that can be applied to data management.  Here are a few:

1. The theory and practice of definitions, which are a very old part of logic and are the basis of semantics.

2. The rules of normalization, which are derived from logic.   

3. The differences between generic (supertype-subtype) and partitive (part-whole) conceptual systems types are yet another set of lessons from philosophy.  

4. The principles of logical division and classification, which are used in constructing taxonomies, and go back to Aristotle.

5. The approach of structural decomposition in business analysis, which can be found in Descartes' Method.  

6. The basic vocabulary of data management (e.g., entity, attribute, relationship), which goes back more than two millennia in philosophy.

Philosophy versus Science

Recently, the physicist Stephen Hawking announced that “Philosophy is dead.” He claims that it has not kept up with modern developments in science, particularly physics. By contrast, R.G. Collingwood, writing in the mid-twentieth century proposed that philosophy provides a framework within which all natural science is possible - it is a kind of “master science.” It seems very likely to me that data management is part of this master science, and if this is true it will position data management in philosophy, and not in natural science.

Let us consider definitions as an example. The International Astronomical Union caused outrage a few years ago when it redefined “planet” so as to exclude Pluto. The new IAU definition states that a planet:

is a celestial body that (a) is in orbit around the Sun, (b) has sufficient mass for its self-gravity to overcome rigid body forces so that it assumes a hydrostatic equilibrium (nearly round) shape, and (c) has cleared the neighbourhood around its orbit.

There are some problems here. The first is that the IAU is actually defining “planets and other bodies in our Solar System, except satellites.” It is not strictly defining planet. Does the IAU maintain there are only planets in “our solar system”? What exactly are all the extrasolar planets we have been hearing about for the past few years?  Why is the term “nearly round” used in the definition? A circle can be round, and only exists in two dimensions. Surely “nearly spherical” would have been more accurate.  And what exactly is the “neighourhood” referred to in the definition? This point is very important because it is the only point by which Pluto is held to differ to true “planets.” The IAU saw fit to define “planet” and “dwarf planet.” If they had followed classical logic they would have realized they needed either a superordinate class or a coordinate class, and they would have given us something like “planet” (the genus or supertype) and “dwarf planet” and “un-dwarf planet” (the species or subtypes).

The point here is that the theory and practice of definitions does not originate from natural science, but philosophy, and scientists can get themselves into trouble by having a poor understanding of it. This would seem to justify Collingwood's view more than Hawkings’. 

If we use the term “data scientist,” we imply that there is “data science.” There is a sense in which this can be true, but it is not any sense in which data science is derived from natural sciences such as physics. In fact, all natural sciences rely on data management because all natural sciences are concerned with cataloging observations and recording the results of experiments. Everything goes into data. Data management, in turn, has its pedigree in philosophy. However, I still don't think it is a career boost for anyone to advertize themselves as a data philosopher.

For reprint and licensing requests for this article, click here.