Thank goodness the NCAA Men’s Basketball Tournament has finally started. I’m not sure I could take much more of the “bracketology” prognostication that’s been consuming sports TV since pairings were announced last Sunday. It’s experts and analytics to the max – Dick Vitale versus Nate Silver.
To followers of college hoops like myself, there’s no shortage of team performance metrics to digest, what with strength of schedule (SOS), ratings percentage index (RPI), basketball power index (BPI), Ken Pomeroy’s college basketball ratings, the Jeff Sagarin ratings – and many more. Analytics gurus run simulations of the tournament based on predictions derived from the ratings to determine their brackets.
Two months ago, I watched a game where one of the announcers opined that “no question the Big 10 is this season’s top conference top to bottom.” His partner whole-heartedly agreed. Me? I wondered whether the power rankings would concur. So I decided to put together a little data set with a snapshot of rpi, bpi and Pomeroy metrics by school and conference to test their “hypothesis”.
It turns out making the data set was a bit more work than I signed up for. The ESPN website offers both rpi and bpi, but in different tables. So I scraped each into a text file, loaded individual R data frames and attempted to join them by school name – only to discover many of the names didn’t match. Michigan State in one, Michigan St. in the other; Southern Methodist in one, SMU in two. And just what is SDSU – San Diego State University, as I initially thought, or South Dakota State University, as intended?
Just two thirds of the 347 school names corresponded. I had to manually connect the disparate one third to create a bridge table linking bpi and rpi. I then scraped the Pomeroy data and similarly bridged them to the bpi-rpi data frame for a consolidated, all-in finale. Somewhat painful.
I sent the data to two colleagues, one of whom isn’t a hoops fan at all. I used R, one used Pentaho and the other deployed Tableau to assess conference performance. Our findings? Each thought the Big 10 and Big East were above the other conferences, but none would anoint the Big 10 tops. Indeed, my colleagues both gave a slight advantage to the Big East. Perhaps the selection committee agreed with us: 8 of the 15 Big East teams were invited as were 7 of 12 from the Big 10. My take? Even as they profess they’re evaluating a conference “top to bottom”, experts are actually over-weighting the highest, say, three quarters of teams in their assessment. Having many really good teams helps more than having a few really bad ones hurts.
For the last eight weeks I’ve looked at even more of the seemingly limitless supply of hoops ratings. My task of ingesting new web sources has been simplified by Microsoft’s nifty Data Explorer Excel add-in. For non-HTML table data, I’m pretty handy with Python. So the Sagarin data’s now in, as is the Nolan power index.
Alas, I think I’ve met my match with what looked at first to be a treasure trove of rankings by Ken Massey. I was feeling pretty good after first securing the 50-some odd metrics with a Python program. When I updated the page and re-ran the program the following week, however, I got no output: the headings I’d coded to select lines of text had been re-ordered, so I had to adjust my script. The following week, another change in page layout sank the program again. It seems I have to make weekly code accommodations to update the data.
These minor frustrations brought to mind the recent Strata talk: “Broad Data: What Happens When the Web of Data Becomes Real,” by James Hendler. Hendler’s obsession is integrating small and broadly-available web data into usable formats. His solution: a little semantics – “meta-data” – for discovery, integration, visualization and extraction. Standards, consistency, organization, and conventions are friends. Amen.