I had an interesting conversation with a young participant at the February, 2011 Strata Conference on “data science” during a break in first day action.
He was tasked by his company with sifting through quite a bit of messy data and making sense of it quickly. When I sympathized with his plight, he responded by saying dirty data wasn't a big problem. His concern was more the tight time frame for getting the work done. Management seemed empathetic to the dirty data, opining that an approximate answer quickly was more important than a belated, precise one.
Besides, he said, there was so much data that he could kill the problem with quantity if not quality. A bit befuddled, I asked myself at the time whether this might be a difference between the thinking of the new “data science” and traditional business intelligence.
I was reminded of this “approximate answer” thinking a few weeks ago when I came across an article in MIT news about the computer work of Carnegie Mellon professor Joseph Bates. Bates has spearheaded research on new chips that return imprecise or fuzzy answers – 100 + 100 might sometimes yield 202, other times 197 – but have the advantages of “arithmetic circuits … that would be much smaller than those in today's computers. They would consume less power and many more of them could fit on a single chip, greatly increasing the number of calculations it could perform at once.”
Could fast, fuzzy answers be useful to BI? In many cases, yes. A simulation on an object-recognition algorithm demonstrated that where the results of calculations were “either raised or lowered by a randomly generated factor between 0 and 1 percent … The difference between low-precision and standard arithmetic was trivial.”
One problem the chip handles well: nearest-neighbor search “in which you have a set of objects that can be described by hundreds or thousands of criteria, and you want to find the one that best matches some sample.” Predictive modelers are well aware of this technique, often finding success with the “k-nearest-neighbor” algorithm for their classification problems.
Several articles in a recent edition of the MIT Sloan Management Review embellish on approximate answers for business. Two Attivio executives argue in “Why Companies Have to Trade ‘Perfect Data’ for ‘Fast Info’”, that the investment in perfect data often comes at the expense of timeliness of data delivery.
“One of the most important questions is whether we should even worry about whether this report is exactly right or not. There's a term called 'eventually consistent' that grew up around a whole fleet of open-source-type for crunching the huge amounts of data generated by website click-throughs … (Amazon) is good at this because they don't worry about everybody. They develop a model where they're eventually going to get a consistent model of the world, but at the moment they need to do it, they don't care that they can't role it out for everyone … The key thing is to do it quickly and to make sure that whatever we conclude, there are many observations for it.” Hmmm, I wonder if my Strata acquaintance works for Amazon?
“Competing on Analytics” author Tom Davenport argues from survey results in his article, “How Fast and Flexible Do You Want Your Information, Really?”: “the real aim should be not faster information but faster decision making.” One of Davenport's areas of survey inquiry involved the tradeoff between accuracy and speed. “Several executives we interviewed said that they were often willing to sacrifice some accuracy or granularity for information that they received on a real-time or daily basis. One prominent former CEO noted that 'Speed is often more important than accuracy.'”
Of course the speed-accuracy tug of war is also driven by the content of the information. Financial reports should be unassailable, while there's a lot of leeway with, say, competitive intelligence. “For information such as cash and cash flow, receivables and payables, and budgets, spending and costs …,” there's little give on accuracy. For information on “customers, suppliers and partners, and competitors – the survey respondents were willing to accept a much lower level of confirmation.”
I'll contrast business intelligence and data science in an upcoming blog. On the issue of precise versus approximate answers, though, data science evangelist Mike Loukides is certainly not “fuzzy”: “Do you really care if you have 1,010 or 1,012 Twitter followers? Precision has an allure, but in most data-driven applications outside of finance, that allure is deceptive. Most data analysis is comparative: if you're asking whether sales to Northern Europe are increasing faster than sales to Southern Europe, you aren't concerned about the difference between 5.92 percent annual growth and 5.93 percent.”
Reactions from readers? Are you seeing more tolerance/push for “approximate BI” in your current roles? If yes, are the pressures of time and the luxuries of big data behind this evolution?