We have heard a lot about data scientists recently. Quite often, the term seems to mean little more than a data analyst or a quantitative analyst (a "quant"). Yet the notion of the data scientist seems to be part of a large paradigm shift.

There is now a growing feeling that data is a fundamental resource for enterprises that can be used to do things that affect the core of what enterprises are about. Long ago, data was seen simply as a byproduct of automation that needed to be organized and administered. Then data was found to be useful for the management of enterprises. Now, data is increasingly regarded as a vital input to the business models of enterprises. We are moving beyond the data-centricity of projects or environments, like data warehouses, to data-centric enterprises.

If enterprises are serious about harnessing the value of data, the way data is handled will have to be revolutionized. Dealing with data requires technology, and IT controls technology. Yet IT is typically very narrow-minded and can only envisage working in projects using the methodology of the systems development life cycle. I submit that neither the project approach nor the SDLC will ever get enterprises to be truly data centric. In fact, the IT’s insistence on working in projects and using the SDLC will prevent enterprises from ever becoming data centric.

It is not possible to go into all the aspects of what is required for a data-centric enterprise, but after studying and thinking about the current situation for some time, I now firmly believe that data-centric enterprises will need to set up data laboratories. Real scientists do real science in laboratories, and I think that real data scientists will do the same in the future – as opposed to the retitled analysts and quants that we have today.

The Purpose of the Data Laboratory

What would the purpose of a data laboratory be? I think that the following ideas are strong candidates:

  • Identify what value can be extracted from existing enterprise data.
  • Identify value that may be present for the enterprise in new data sources that can be acquired externally or internally.
  • Synthesize new "data compounds" through the "chemistry" of data integration and assess them for business value.
  • Distill new information products from the raw data resources.
  • Figure out how to work around existing data deficiencies.

The obvious objection to having a data laboratory to perform these tasks is that these ideas can all be tackled today in the context of individual projects. I contend that this is not so. 
Suppose a marketing executive has an idea for a six-week campaign that requires a new configuration of data as an input. The first problem is getting IT to procure the infrastructure to set up the development, quality assurance and production environments that IT sees as needed for this "project." Doing this repeatedly, every time a new data research idea comes up, is so colossally expensive that it prohibits enterprise data-centricity. Astronomers do not build a new telescope for every research initiative. Molecular biologists do not build a new laboratory every time they sequence the DNA of a particular species. Nor should enterprises set up completely new environments when dealing with new data initiatives. Rather, there should be an existing environment in which data research can be done. It is true that the environment might need enhancing for a particular initiative, but that is no justification for what we see today.

Even if the infrastructure can be procured, another set of problems typically follows. The individuals who have the ideas about the data need to experiment with it: to observe the data, subject it to tests and see how it behaves, assess it against various business criteria, mix it together, synthesize new data structures and so on. Resources, like ETL developers, will be needed to do this, so IT once again enters the picture. And IT always begins with "What are your requirements?" or "Tell me what to do and I will do it." Well, if a real data scientist knew what the "requirements" were to the level needed by traditional IT, all the answers to the research questions would already be known and there would be no need for a data laboratory. Telling IT what to do in excruciating detail is frankly beyond the capacity of a real data scientist.

What a real data scientist requires is the equivalent of technicians, research assistants and other support staff we find in real laboratories. Such staff, as I found when I did scientific research, are invaluable. Traditional IT, by contrast, works in an "order-taker" mode that is far from the true partnership required. The effort the real data scientist has to put into organizing IT is equivalent to doing nearly all the work themselves. This is hardly a scalable model.

The Shape of the Data Laboratory

The notion of projects and the SDLC has everything beginning in development. Yet a data laboratory, by definition, must start with real data – that is, production data. The data laboratory is a place where we search for value in data. That cannot be done with the tiny, made-up data sets that are used for the unit testing required as the first step of testing in the SDLC. In fact, there is there is no concept of development, quality assurance or production in a data laboratory; it is all production data, but it is used in a research mode. 

Therefore, we have to treat the data laboratory as a production environment, in the sense that it contains production data. The appropriate security must be built in and staff who work in the data laboratory must be qualified to work with the data and trained appropriately.

It will also be very important to maintain data isolation. The data laboratory can never be part of the true "production" environments of an enterprise, even though it contains only production data. No data can ever leak back from the data laboratory into any other environment in the enterprise. The temptations for shortcuts and workarounds involving the data laboratory must be completely resisted.

Then we come to how the data laboratory operates, how the real data scientists are organized to carry out their work and how the products of the data laboratory are taken to market. These are very large topics and are beyond the scope of this article. However, I hope I have painted a sketch of why enterprises need data laboratories and what the shape of a data laboratory would look like.