Timeliness is the Most Important Data Quality Dimension
In his book The Most Human Human: What Artificial Intelligence Teaches Us About Being Alive, Brian Christian explained that “the first branch of computer science was what’s come to be known as computability theory, which doesn’t care how long a computation would take, only whether it’s possible or not. Take a millisecond or take a millennium, it’s all the same to computability theory.” However, as Christian also noted, “computer scientists refer to certain problems as intractable meaning the correct answer can be computed, but not quickly enough to be of use.”
Intractability does have some practical applications for business-enabling technology. For example, “computer data encryption hinges on the fact that prime numbers can be multiplied into large composite numbers faster than composite numbers can be factored back into their primes. The two operations are perfectly computable, but the second happens to be exponentially slower making it intractable. This is what makes online security, and online commerce, possible.”
As computer science advanced, Christian continued, complexity theory was developed, which took time constraints into account, recognizing that “even when a problem is decidable and thus computationally solvable in principle, it may not be solvable in practice if the solution requires an inordinate amount of time.”
Correct versus Timely
Data-driven decision making exists at the intersection of data quality and business intelligence, and has to contend with the practical trade-offs between computability theory and complexity theory, which Brian Christian summarized as:
- Computability Theory: “Produce correct answers — quickly if possible.”
- Complexity Theory: “Produce timely answers — correctly if possible.”
Computer advancements have followed the oft-cited Moore’s Law, a trend accurately described by Intel co-founder Gordon Moore in 1965, which states the number of transistors that can be placed inexpensively on an integrated circuit, thereby increasing processing speed and memory capacity, doubles approximately every two years. However, as Christian explained, for a while in the computer industry, “an arms race between hardware and software created the odd situation that computers were getting exponentially faster but not faster at all to use, as software made ever-larger demands on systems resources, at a rate that matched and sometimes outpaced hardware improvements.” This was sometimes called “Andy and Bill’s Law,” referring to Andy Grove of Intel and Bill Gates of Microsoft. “What Andy giveth, Bill taketh away.”
Continued advancements in computational power, along with increased network bandwidth, parallel processing frameworks (e.g., MapReduce), scalable and distributed models (e.g., cloud computing), and other techniques (e.g., in-memory computing) have nowadays, in the era of big data, made real-time data-driven decisions more technologically possible than ever before.
Currency versus Timeliness
In his book The Practitioner’s Guide to Data Quality Improvement, David Loshin explained the important distinction between two time-related data quality dimensions currency and timeliness. “Currency,” Loshin explained, “refers to the degree to which data is current with the world that it models. Currency can measure how up-to-date data is, and whether it is correct despite the possibility of modifications or changes that impact time and date values. Currency rules may be defined to assert limits to the lifetime of a data value, indicating that it needs to be checked and possibly refreshed.”
Meanwhile, on the other hand of data’s ticking clock, Loshin noted, “timeliness refers to the time expectation for the accessibility of data. Timeliness can be measured as the time between when data is expected and when it is readily available for use.”
It’s about Time for Data Quality
Although new prefixes for bytes (giga, tera, peta, exa, zetta, yotta) measure an increase in space, new prefixes for seconds (milli, micro, nano, pico, femto, atto) measure a decrease in time. More space is being created to deliver more data within the same, or smaller, time frames. Space isn’t the final frontier, time is. Due to the increasing demand for real-time data-driven decisions, timeliness is the most important dimension of data quality.
This blog was originally posted at OCDQblog.com. Published with permission.