A few years ago, a vendor I'm close to conducted a “research” study on the BI and analytics software market, using a web-based survey they assembled.. A major finding was that their own predictive analytics platform showed as the #2 most popular in the market, outgunning the likes of industry leaders SAS and SPSS.
Having worked in the PM market for many years, I was more than a little suspicious, and set out to get to the bottom of the findings. It didn't take long: The survey driving the compilations was accessible only from the vendor's web site. The results were thus heavily selection-biased, since respondents were much more likely to be users of the vendor's software than they were to be “representative” of the larger analytics population. Put another way, every analytics practitioner did not have a known, non-zero probability of being sampled . Of course, the bias was in the direction that gratified my friend company. Go figure.
The problems with online surveys in business are not unlike those with web-based polling in politics. Even as the percentage of the population accessing the Internet increases. “....big challenges remain. Random sampling is at the heart of scientific polling, and there’s no way to randomly contact people on the Internet in the same way that telephone polls can randomly dial telephone numbers. Internet pollsters obtain their samples through other means, without the theoretical benefits of random sampling. To compensate, they sometimes rely on extensive statistical modeling.”
And the jury's out on the extent to which the newer methodologies can compete with tried-and-true randomization and probability sampling to produce a “representative” sample of the population – or whether the selection is biased to such an extent that threatens the validity of results.
Selection bias isn't the only risk of online polling. An interesting experiment by political scientist Kyle Dropp shows Republican candidate Donald Trump performing better in online polls than he does with traditional phone surveys. 'Dropp thinks “social desirability bias” is at play. In other words, people are afraid to tell another human being that they support the Republican because, even though they like him, they know about his controversial statements and do not want to be judged negatively.'
Alas, online surveys are now the rule in our industry, so the potential biases are a constant risk. Perhaps an even more significant problem than web-based surveys in the BI and analytics world is that research often serves a marketing master, so there's an additional “business desirability” bias toward the surveyor's products. That a company's research efforts “demonstrate” a large market and solid support for its products? Not a surprise.
Despite questions about their validity, web-based marketing surveys are here to stay and will increasingly be used to frame BI and analytics thinking. They'll also continue to drive marketing initiatives. What's a savvy consumer to do?
Her first task is to understand the nature of the business desirability bias with the research. Is the vendor doing the survey to better “understand” the market and the demand for its products? If so, don't be surprised by flattering results.
Or is the survey being conducted by a consulting company that contracts with the vendor to do research on its behalf? If so, how is the researcher compensated – and how might that influence findings? Or maybe the consultancy sells its “objective” findings to interested parties after the fact. How does that impact their methodologies? In the end, understanding how the research is funded – and calibrating accordingly – is sine qua non for research consumers.
That web-based surveys do not revolve on probability sampling is a difficult, though not intractable, problem. Methods are under development that can use statistical techniques like matching and weighting to adjust for non-probability sampling. “One example of this is sample matching, where a non-probability sample is drawn with similar characteristics to a target probability-based sample and the former uses the selection probabilities of the latter to weight the final data.” I suspect that non-probability sampling schemes will soon have the potential of being more than adequate.
Right now, at a minimum, online survey researchers should publish characteristics of their sample and detail how those statistics match up to comparables from the known population. For example, it'd be helpful to cross-classify survey businesses by, say, size, industry, and geography – and then to determine how those crosstabs track with the population figures. The closer the fit, the more evidence that the sample's in sync with the population – and hence that the findings foster confidence.