As we moved her into the temporary, pre-semester athletic housing, my wife and I had the opportunity to meet some of the other players and their parents. The 14 members of the team are from all over the country – east, mid-west and west. Four are from California.
I’ve always associated volleyball with California. The beach, Karch Kiraly, USC, UCLA, Stanford, Long Beach State …. It seems California’s always at the top of the sport, both men’s and women’s. This year, 12 of the 32 qualifiers for travel team, open division 18-year-old Nationals were from California.
As my wife and I made the twelve hour drive back from Winston-Salem to Chicago, we entertained ourselves by researching volleyball trivia. One question we started to investigate was how many women’s D1 players originate from the state of California. I’d name a school and she’d Google the volleyball team roster, counting the number of Californians. We looked up schools like Penn State, Virginia, Georgetown, Boston College, Notre Dame, North Carolina, Nebraska, Duke, Cal, Stanford , several Ivies, and others. Of the “sampled” schools, we found an average of almost 5 California residents per team, about 1/3 of players. Acknowledging the unscientific sampling, I thought 25% might be a better guess.
A week or so after our trip, we met up with some friends, another Chicago-area volleyball family, and I mentioned my California calculations. They immediately logged on to a volleyball recruiting website, issued a few queries, and countered that the “real” number was less than half my estimate. That seemed awfully low to me, just as my 25% figure was starting to feel high. I had to get to the bottom of the discrepancy. Time for a little data science.
It was easy assembling a compendium of all women’s D1 volleyball schools. After scraping that info into R, I randomly sampled 20% of the list, 66 of the 330 total schools. I then found the selected 2012 volleyball rosters online and tallied the number of California players, updating my data set.
For my sample of 66, the mean number of California players per school is 2.7, though the distribution is quite skewed. The most common frequency is 0, the median 1, and 47 of the 66 schools have 2 or less Californians. On the other end, 10 schools roster 8 or more Golden Staters, one reporting 18! Figure 1’s violin plot details the quirky-shaped frequencies.
To get a sense of the sampling variation surrounding the mean estimate, I ran 100,000 bootstrap iterations of the original 66, computing and storing the means. A plot of the resulting density in Figure 2 shows the variation of this calculation. Between the left vertical gray line at 1.85 and the right at 3.64 lie 95% of the bootstrap sample means. Alas, both my neighbors’ figure and my own 25% number were outside this range.
No doubt both my neighbors and I were off with our calculations. My initial estimate of 25% given 330 schools and roughly 15 players per school yields 1,220 California players, a contrast to the neighbors’ 575. The random sample mean estimate of 2.7 translates to an 890 total, just a bit more than halfway between our extremes. Interestingly, a “wisdom of crowds” average of the two estimates is close to target.
In retrospect, the “sample” my wife and I investigated was biased, pushing my estimate unreasonably high. The “name” east coast schools we tested were more likely to recruit California players than less-known state schools which tend to look locally. For example, in my sample, Princeton and Boston College have 8 and 9 California players respectively, while Southern Illinois and Missouri State have none. Not surprisingly, California schools are heavy native recruiters.
I think there’s a lesson here for BI and data science, which often rely on Web-based surveys to gauge the marketplace. Like our initial choice of schools to investigate, the self-selecting respondents to surveys may lead to biased findings. Indeed, in the absence of a method such as random sampling to assure representation of the target population, analysts must take special precautions to “prove” their data isn’t biased, and thus at risk for invalid interpretation. My sense is that this invalidity is all too common at present in BI.