Not so long ago, data was just data. It was viewed as a homogenous resource within which nothing was differentiated in any way. But very gradually, the data management profession is accepting that there are different kinds of data with different properties, behaviors and management needs.
The most common distinction now drawn is between master data and event data, and I would add reference data to this mix. So how do these three major classes of data differ? Let us briefly survey them.
Reference data is basically code tables. These are the pesky tables that generally have a code column, a description column and a few rows. Examples are transaction status and customer credit status. With reference data, the actual values in the tables are terms used to describe other data in the database. Since they are terms, they require definitions. This is a huge problem because data analysts are trained to think that only entities and attributes have definitions - not data values. However, when we state that "Jane Doe has a customer credit status of bronze," the term "bronze" needs to mean something and have a definition. These definitions are rarely stated anywhere.
Another odd characteristic of reference data is that it transcends time in ways that other classes of data do not. Suppose I define a customer credit status of bronze as the extension of credit up to $1,000 and a grace period for payment of 45 days. First of all, this illustrates that reference data is purely a product of the human mind and does not exist anywhere else. It stays the same until a human mind decides to change it. Maybe after a few years, the credit limit in this example will become $1,200, but nothing about reference data is necessarily dependent on time.
I prefer to call it transaction structure data, but master data is the term widely used. This data is about things - not just any things, but things that are parties to the processes of the enterprise. In other words, these things interact in transactions that the enterprise processes.
Master data is essentially catalogs of these things. The usual examples are product and customer. Master data tables nearly always have a lot of columns, often in the hundreds, and sometimes in the thousands. Managing so many columns is not easy, and there is a propensity for users to overload them or otherwise abuse them.
Because master data is about things, the records in master data tables describe individual instances of the entity. These master data instances always have a greater feel of reality about them than instances of reference data or the instance of event data (see next section). This leads us to the great problem of master data, which is identification. Master data tables are notorious for having multiple and unreliable ways of identifying entity instances. A record in a master data table can easily be fitted with a surrogate key, but that is only really identifying the record. What identifies the entity instance - the thing in the real world - is a whole different matter.
Why should the difficulty of identifying an entity instance arise for this class of data but not other classes? I think the reason is that master data is concerned with singular, individual things, and individual things cannot have a definition. The customer Jane Doe cannot have a definition. She must have an identity to distinguish her from other instances, and she can have a description. A customer credit status of bronze will have a definition, but never a description, because it is universal (something that can be said of many) and not singular.
Master data exists mostly in the present. Most enterprises hope that the things that participate in their transactions do not disappear very quickly afterward. This means that we can usually revisit entity instances of master data and interact with them again.
Event data is produced during an instance of execution of a business process. Generally, the instance of execution is over pretty quickly. When that has happened, the instance can never be observed again, and we are in the true realm of historical facts. We tend to use the term "process" rather loosely. Sometimes we mean the specification or design of the process. This is usually a set of rules. Sometimes we mean the system or mechanism that implements the process, the part that we can see working. However, this is the actual execution of the process from beginning to end - an instance of the execution of the process. In this sense, a process is not complete until it has ceased to exist.
With event data we cannot consult a definition and understand a fact as we can with reference data. Nor can we get in touch with an entity instance in the present as we (hopefully) can with master data. Event data is made of factual relics that no longer exist. It is a residue of the past embedded in the present. If we lose this data, we cannot recreate it, and the history that it represents is beyond our reach forever. There are many implications of this, not least being that our capacity to improve any data quality problems with event data is likely to be very limited.
This brief sketch of a few of the differences between these three classes of data illustrates that there really are important distinctions between them. Hopefully, work on understanding the various kinds of data will continue, and we will gradually get a much more detailed insight into what data really is.