When it comes to data, why the 'garbage in, garbage out' doctrine is all wrong

Register now

Garbage in. Garbage out.

It’s one of the great truisms of technology. Sure, it’s been eclipsed by “software will eat the world” as something to say in a meeting when you really don’t know what’s going on, but it’s still probably uttered a few thousand times a month to explain the failure of a recent technology initiative.

But, like its sister expressions, like “you can’t get fired for hiring IBM,” we’re learning that the “garbage doctrine” is, well, garbage.

The problem isn’t that corrupted or incorrect data can lead to bad results. It can. Unfortunately, “bad” data isn’t the most frequent problem. The raw data itself is probably valid.

The problem is that there’s way too much of it and it’s not organized in a way that makes it easy to understand. It doesn’t form beautiful crystalline patterns like salt: it’s more like a huge pile of gravel. And the hours required to turn that gravel into a mosaic don’t seem worth it—particularly at the start of a project—so people claim it’s garbage and declare defeat.

It’s an understandable reaction. If you harvest data from the smart meters in the U.S. every 15 minutes, you can get a gross sense of power consumption, but if you harvest it every few minutes or seconds, you can begin to unobtrusively harvest power by delaying refrigerator defrost cycles and dimming lights. Unfortunately, that also means juggling exabytes of memory.

Health data and personalized medicine? The world’s total mass of genomic data is doubling every seven months. By 2025, genomic data will dwarf the size of YouTube.

A Cacophony of Sensors

Worse, the data also often comes in incompatible formats measuring distinctly different trends. Take a simple device like a pump. To conduct predictive maintenance, you might want to track power consumption, water flow, equipment temperature, rotational speed and other phenomena. So that means you’ll be collecting data measured in kilowatt hours, liters, degrees, RPMs and other standards with some data refreshing every 15 minutes and other signals, like vibrations, emitting new information hundreds of thousands of times a second.

McKinsey & Co., for instance, estimates that only 1% of the data from the roughly 30,000 sensors on offshore oil rigs gets used for decision making because of the challenge of getting at it.

To get around the problem, analysts and others suggest that the solution is to collect less data. Unfortunately, the obscure bits often prove to be the solution to the puzzle. In 2015, researchers at Lawrence Livermore National Laboratories were experiencing rapid and unexpected variations in electrical load for Sequoia, one of the world’s most powerful supercomputers. The swings were both large—with power dropping from 9 megawatts to a few hundred kilowatts—and creating substantial management problems for its local utilities.

By cross-checking different data streams, the source of the problem emerged: the droop coincided with scheduled maintenance for the massive chilling plant. LLNL was able to smooth out its power ramp (and help its local utility), but think about it for a moment. The answer was only discovered after some of the leading computing minds in the nation checked on what their co-workers in the facilities department were doing.

Once, I interviewed a company trying to curb energy at office buildings and gyms by analyzing interior traffic and other parameters. What did they discover? CO2 sensors were by far the best occupancy sensors. You could quickly tell how many people were in the room or whether they were exercising or resting. But at the outset, no, CO2 readings weren’t high on the list of leading indicators. In a “save only the important stuff” regime, it would have been chucked.

The Hoarder’s Dilemma

Let’s say you save all of your data. Now your highly paid data scientists are bogged down serving as data janitors, which 76% say is the least attractive part of their day.

Luckily, automation in software development and IT management is coming to the fore. A growing number of startups are focused on automatically generating digital twins and harnessing sensor data streams into screens and consoles that make sense to ordinary humans. The movement toward smart edge architectures—where substantial amounts of data and analytics are conducted locally rather than in the cloud to reduce latency and bandwidth costs—will also help by reducing the time and overhead of juggling massive data sets.

AI will help as well. Video and images until recently were considered “dark” data because they couldn’t be easily searched. Neural networks, however, have turned this around, leading to things like facial recognition and photo search. Before these developments; however, video and images often fell into that category of data that was perennially on the “do we really need to keep all of this?” chopping block.

“Running scared from the big data monster is a cop out,” says Neil Strother, principal research analyst at Navigant Research. “The tools now available for collecting, organizing and analyzing large and growing datasets are here and affordable. I’m not saying this kind of effort is trivial, but it’s not beyond their reach either.”

So quit thinking of data you can’t figure out as garbage. It’s just a gem that hasn’t been polished.

For reprint and licensing requests for this article, click here.