CERN’s Experiment on Data Quality
September 26, 2011 – The search for a particle that may unlock secrets of the origin of the universe was made clearer with data quality and software testing tools used in everyday business.
Every year, the Large Hadron Collider, possibly the world’s largest scientific instrument, churns through more than 15 petabytes of information for physicists and scientists with the European Organization of Nuclear Research (CERN) and other, particle-splitting organizations. That data is compiled from 600 million proton collisions per second, with an immediate goal of proving the existence of the Higgs boson, the most elusive elementary particle in quantum physics that may point the dawn of the universe.
To analyze information in that search, about 10,000 scientists worldwide since 2000 access ROOT, the homemade software for particle analysis that CERN scientist Axel Naumann recently likened to “Excel for big data.” ROOTs framework was not unlike BI data shuffled along in an enterprise to various users, only specific to the functions or registering, compiling and visualizing the Large Hadron Collider's findings Naumann says. But, because much of the particle information compiled was from huge data pools on such finite results, deeper problems did not simply pop up to the every day users or ROOT’s direct team of about six.
“Whatever information we get wrong will be multiplied through the experiments,” Naumann says. “If you find fundamental bugs, then we can get more precise results. We can say, ‘We found this particle,’ or ‘We found this particle doesn’t exist.’”
With that in mind, ROOT's software called for data troubleshooting and additional testing. One of those data quality and integrity solutions, Coverity Static Analysis, is primarily used in commercial applications, and was chosen by ROOT developers after reading about its big data success from Linux and SQL open source projects.
There, buried like the elusive particles they mine data on, Coverity’s tool followed ROOT’s data and spotted 40,000 software flaws in the first week, a discovery which Naumann, with a grim laugh, called “horrible.” Some of those errors came from server downtime, with information deleted during that process, network disruptions that were difficult to replicate, or from data buffer overflows and memory leaks, Naumann says.
After six weeks of solving those source code errors, the CERN software team also put in place daily use of Coverity’s solution for many of its users, including nondevelopers who Naumann said have been able to detect and report minor errors. The oversight of ROOT's data processes with the Coverity tool keeps users from moving forward on results that are mired in errors and runs nightly checks on the entire system.
“We still deal with maintenance issues like anyone else with software, but I would say CERN is more impressive with its results now and more clear with its experiments” from the Coverity application, Naumann says.
While the cumulative data racked up by the Large Hadron Collider is not the most for any client of Coverity, CERN’s work in physics is a unique application requiring particular accuracy, says Jennifer Johnson, global VP of marketing at Coverity. But at its root, the software and data quality application with CERN is not dissimilar from many other business applications for “better governance of its entire software development process,” she says.
For a demo, data sheet and information on CERN’s access license with the tool, click here.
For commentary on how big data at CERN and other organizations is scaling BI and analytics, click here to read Shawn Roger's cover story from the September/October issue of Information Management magazine.