Six months ago I purchased a subscription to the New York Times Digital, which provides access to the online newspaper and NYTimes.com for my iPad. I like it a lot, experiencing much less paper withdrawal than anticipated and no doubt now spending more time reading articles than before.
A collateral benefit to being part of the “Times ecosystem” is exposure to other solid news sources. One illustration is the new policy and politics website, The Upshot, that was launched several months ago by NYTimes.com. The Upshot's mission is to build on the Times' foundation “by helping readers make connections among different stories and understand how those stories fit together .... [the columns will] feature a rich stream of graphics and interactives, one of The Times’s great strengths .... Data will also be at the heart of what we do ... there is a large audience for [this] kind of plain-spoken, analytical journalism.” I like.
A recent Upshot column that nicely demonstrates that ideal is “How Not to be Misled by the Jobs Report”. The article's point is that the observed “decline” of new jobs in April to 64,000 from the “robust” 220,000 in March must be taken with a grain of statistical salt. The authors correctly note the weeping and gnashing of market teeth that transpired on May 1 may be due to little more than chance variation or sampling noise: “What if the apparent decline in job growth came from the inherent volatility of surveys that rely on samples, like the survey that produces the Labor Department’s monthly employment estimate?”
Now caveats about the sampling variation in aggregate estimates like these are nothing new. Historically, such figures are generally accompanied by standard errors that put, say, a 95 percent confidence interval around the point estimate. For me, this old-fashioned approach is just not compelling. And neither was it for the authors.
Rather, they use statistical simulation to demonstrate the random possibilities. Assuming an actual or “population” value of 150,000 new jobs computed with similar sampling variation to the reported jobs estimate, the authors show how the results could indeed be all over the statistical map with a 23 percent chance of a figure larger than 190,000, causing market euphoria, to a 23 percent likelihood of a number less than 110,000 that would bring angst to investors. The problem is, of course, the high sampling variation surrounding the calculated statistic.
The article then showcases two dynamic visuals: the first illustrating what a year of monthly work-addition samples might look like given the existing sampling variation, with each month's actual figure constant at 150,000; and the second, a similar graphic with the population figure at 150,000 in January, increasing 15,000/month thereafter. In both cases, “noise” surrounding the survey sampling distorts the jobs estimates, potentially leading to off-the-mark interpretations. The graphs well communicate the impact of this randomness.
There are a couple of takeaways for BI-analytics from this article. The first is a reminder that the null hypothesis should always be one of randomness -- no relationship or no change until unequivocally demonstrated otherwise by the data. The second is that re-sampling, simulation methods fueled by computation and showcased by simple visuals, is best for telling the survey sampling variation story in 2014. These methods should be standard components of the analytics worker's tool chest.