I ran across an interesting column in the WSJ the other day entitled "Teachers Are Put to the Test." The article revolves on developments in the growing movement to evaluate teachers based, in large part, on the results of their students' performance in standardized tests.

“The metric created by the Value-Added Research Center, a non-profit housed at the University of Wisconsin's education department, is a new kind of report card that attempts to gauge how much of students' growth on tests is attributable to the teacher … The teacher's overall effectiveness with every student in the classroom is boiled down to one number to rate them from least effective to most effective.”

Not surprisingly, many teachers are less than enthusiastic about being assessed this way. Critics note that the gold standard evaluation framework, the randomized experiment, is impossible to deploy in value-added research, leaving results potentially badly biased by selection, measurement error and other factors that differ markedly from school to school, class to class. And teachers fret over having their livelihood at the whim of this unreliable measurement.

VARC counters that its director, Rob Meyer, has developed statistical methods to negate the biases, using “methods (that) are closely related to the statistical models traditionally used in formal quasi-experimental evaluation studies.” Meyer’s unique method explicitly  “addresses student level selection bias and provides solutions for eliminating it ... allows for possible decay over time in the effects of interventions ... tracks student achievement across time, (and) explicitly addresses measurement error in student achievement.”

A few months ago, I wrote a blog on a sophisticated methodology developed by several financial statisticians to support performance measurement in the investment science/portfolio management discipline. The investment community defines PM as “The calculation of return, risk and derived statistics stemming from the periodic change in market value of portfolio positions and transactions made into and within a portfolio, for use in the evaluation of historical fund or manager performance.” The holy grail of PM is the statistical separation of beta, which summarizes market forces outside the influence of individual investors, from alpha, a measure of individual manager skill and performance. Investment management companies often compensate their managers based on computed portfolio alphas.

Though portfolio returns are input to sophisticated statistical methods to produce estimates of alpha and beta, there's still controversy with the techniques,  many arguing it's impossible to produce unbiased estimates of manager skill. My own case of outperforming the “market” with meager investments from 2002-2005 can best be explained by the risks I assumed over-allocating to small and value equities rather than specific stock-picking skill. In fact, my good fortune dissipated in 2006, when value fell out of vogue. Nonetheless,  I wish I'd have been a portfolio manager over the earlier period. Had I enjoyed my success with a sizable dollar amount under management, I'd had been well compensated indeed.

Last week I picked up the just-released U.S. News Best Colleges, 2012 Edition. Best Colleges rank- orders different classes of schools, gathering 16 indicators of academic excellence for each college and using a secret formula to produce a composite score. Harvard and Princeton tied for the top of the national universities category in 2012, their scores normalized at 100; all other schools are curved down from there.

Tellingly, this year's top 20 consists of private schools only. As much as U.S. News argues its measurement is outcome based, the  most differentiating factors in its scoring are school academic reputation/prestige and quality of student body as measured by rank in high school graduating class, test scores and admission percentage – more inputs then outputs. And the “outcomes” it does use, alumni giving rate, for example, are either flimsy or, in the case of graduation percentage, undifferentiated at the highest level. Is Yale's 99% freshman retention meaningfully different than the University of Chicago's 98%?

That schools select their students makes comparative evaluation of college performance problematic. Is Harvard a great school because of its brilliant professors, classes and learning environment – or because it's able to attract the most august student body?  Billionaire Harvard dropouts Bill Gates and Mark Zuckerberg  are evidence for the latter.  Call me jaundiced, but after following Best Colleges for 20 years, I'm convinced the two biggest factors in the annual shuffling of rankings are year-to-year changes in scoring procedures and school success gaming the ratings formula.

As major league baseball's playoffs come into focus, and the movie “Moneyball” starring Brad Pitt fills theaters, we are reminded once again of the statistical emphasis on analysis in sports, especially baseball. I've been a big fan of pioneering sabermatrician Bill James forever, and now periodically enjoy reading articles from the baseball quants of Baseball Prospectus.

Among the in-vogue statistics now used in the baseball world are VORP, value over replacement player measured in runs,  and WARP, wins above a replacement player. VORP and WARP measure how much a player contributes in comparison to a fictitious replacement player – the level of performance an average team can expect when trying to replace a player at minimal cost. Irreplaceable superstars like Albert Pujols of the St. Louis Cardinals have high VORP and WARP; itinerant journeymen have low or negative VORPs. I'd love to see the 2011 figures for White Sox free agent disaster Adam Dunn.

Even with all the analytic sophistication of these measurements, I find it hard to believe they can be free of team confoundment. There must be at least some individual performance benefit to playing on a great team that's difficult to isolate statistically.

Consider Yankee center fielder Curtis Granderson, who's enjoying a breakout season in New York. Through 9/24 with five games remaining, Granderson has hit 41 home runs, scored 134 and batted in 119 – each already a good deal larger than his career highs entering 2011. But the American League-best Yankees have another player, Robinson Cano, also enjoying an MVP-like season. And up and down their lineup, players are having productive years. Indeed, three Yankees are among the top six of RBI leaders in the American League. Think being surrounded by high-performing teammates hasn't helped Granderson? I wonder what his statistics would look like if his 2011 team were the woeful Astros.

So what are the defining characteristics of performance measurement 2011? I can count at least five. First is an obsession with outcomes -- test scores, portfolio returns, graduation rates, runs/wins et al.– as a basis for the assessment of performance. Second is measurement at both individual and collective levels to serve both personal and organizational evaluation. Schools aggregate the measurements of teachers and students. Baseball teams are the sum of individual player performance. Hospitals are a roll-up of their providers. Prestigious universities are the sum of outstanding students and faculty. And investment companies aggregate the returns of individual managers.
 
Third is forced ratings for all – grading on a curve – where there's a price to pay for landing in the lower tail. Fourth is the use of advanced analytics and complex statistical models as the special sauce to focus and “purify” measurements against uncontrollable, biasing outside factors that aren't distributed uniformly among measurement units. And finally, there are the inevitable disagreements over how the final measure is calculated, along with the remonstrations of those who feel such measurement is unfair  because in non-randomized, observational analyses, “other things aren't equal” to support valid between-group interpretation.

One thing's certain given the tsunami of evaluation momentum in business, education, health care and government: performance measurement 2011 and its descendents are here to stay.