For decades, the purpose of information technology departments has been to acquire or build applications and implement them. Perhaps we do not stop long enough to think about what we are doing, but it would appear that we think of applications as machines. Once the machine has been built, it is turned over to its business owners and is expected to function, although it may require some maintenance. I think this analogy is only partially accurate, because it ignores the role of data. Furthermore, I think such a perspective is fundamentally dangerous.

Viewing an application as a machine makes some sense when the intent is to automate a manual process. For the first few decades of the Information Age, this is what applications did. More recently, there has been a shift in applications from a process-centric to a data-centric orientation. There is widespread recognition that data is a valuable resource and that the value needs to be unlocked from the data. Data integration and BI environments are being built to unlock this value. However, these data-centric applications are not like their process-centric forerunners. In particular, the data, which is the raw material the "machine" processes, needs as much attention as the machine itself. This was never really the case in process-centric applications.

A data-centric application is like an oil refinery. An oil refinery is a complex that turns crude petroleum into a diverse range of valuable end products, such as gasoline, kerosene, and even fertilizer and energy. Analogously, in a data-centric application, raw data is taken as an input and distilled and processed into valuable information products. No company would be allowed to operate an oil refinery without monitoring what is flowing through the plumbing. Fluid level gauges, heat sensors, pressure monitors and so on provide an array of monitoring equipment that usually feeds a central control function. Yet it is very rare to find anything analogous in a data-centric environment. It is as if there is an expectation that, for processes to run correctly, the production data must be correct, too.

What Can Go Wrong?

The need to monitor and meter production data is not really a "nice to have" feature for a modern enterprise. It is essential. Data flows can go wrong in all kinds of ways. Consider the orchestration of data movement. We may have a nightly flow from a table in Transaction Application A to Staging Table B in Data Warehouse C, and a second flow from Staging Table B to Fact Table D in the warehouse. Suppose that the flow from A to B is scheduled to run at 2:15 a.m. every day, and the flow from B to D at 5:30 a.m. every day. Now suppose that the first flow is delayed and does not happen until after the B to D flow has completed. We obviously will have a problem.

A single isolated example like this is easy to comprehend and might not seem to require a monitoring function to detect exceptions. Perhaps it could be managed within the process itself. But when we have hundreds or thousands of instances of data flows per day (many individual flow processes are executed multiple times per day), figuring out everything that could go wrong and specifically coding it into the data movement processes is not scalable. Also, what happens if a data movement process, for whatever reason, simply is not run? It cannot detect its own failure. We are back to the need for independent verification - for monitoring.

What Is Monitoring and Metering for Data?

We undoubtedly still have a lot of theoretical and practical work to do in the realm of data monitoring and metering, but it is possible to see the outlines of what it should consist of.

A monitoring and metering tool should allow us to identify a source and a target. We should then be able to identify the records expected to have been moved in the source and the records expected to have arrived from the source in the target. This could be simple or it could be complex. It can certainly involve identifying subsets of records in the source and the target. If this is the case, we will inevitably need a business rules approach. Logic will be needed to identify the subsets of records in the source and target. Perhaps this will be based on SQL queries. This logic will require metadata, such as a description of what the logic is trying to do, who set it up and how it corresponds to some kind of business reality.

Governance processes will need to be overlaid on all of this. Thus, we can quickly appreciate that a simplistic programming approach will not be sufficient.

The monitoring part of such a tool will have to detect the presence or absence of conditions. This could be as simple as whether the flow has occurred. The metering portion will need to make measurements - gather the metrics. These metrics will have to be compared to known parameters to determine if there is a problem. For instance, if the records in a source data set have the same effective date-time stamps as the last time an ETL job was run, it is likely that we have stale source data.

Another example might be comparing record counts between a source and target, where any difference beyond a small tolerance is unacceptable.

To be fair, this does not really match anything in our oil refinery analogy, which serves to highlight how much the world of data is very often unique, and shows we often cannot fully reuse concepts from other domains of experience. Data is different and comes with its own unique set of problems that must be solved by a unique set of methods. Simply stated, the problems are hard to understand and the methods require innovation.