As noted in my last blog, I was quite taken by Clifford Lyon’s presentation, “Why We Test: Informed Design at CBS Interactive”, at the recent IE Predictive Analytics Summit in Chicago. A collateral benefit of Lyon’s informative discussion of using experiments or A/B testing to improve user website experience was exposure to the research of Microsoft’s Ronny Kohavi.

The point of departure for the accessible paper “Online Experimentation at Microsoft” co-authored by Kohavi was the 2005 statement from a memo by Microsoft’s then CTO, Ray Ozzie that the “web is fundamentally a self-service environment, and it is critical to design websites and product ‘landing pages’ with sophisticated closed-loop measurement and feedback systems.” On the heels of that dictum, Kohavi and colleagues conceptualized the Experimentation Platform (ExP) at Microsoft that would enable product teams to run controlled experiments.

OEM’s most fundamental experiment is the two group A/B test in which 50% of users are assigned randomly to a treatment condition and 50% to a control. The measurement of interest, called a response or dependent variable in the scientific world, is often referred to as a metric, key performance indicator or overall evaluation criteria (OEC) in business. Randomization assures that, within the limits of probability, “the only thing consistently different between the two variants is the change between the Control and Treatment, so any statistically significant differences in the OEC are the result of the specific change, establishing causality.” Experiments can, of course, be more complicated than the simple A/B. Multiple treatments with control along a single axis are referred to as A/B/C/D…, while multivariable tests that involve multiple axes are common as well.

ExP’s mission objectives, solidified in 2006, were both to build a testing platform that’s easy to integrate and to change Microsoft’s web development culture towards more data-driven decisions. Yet, tellingly, it wasn’t until two years later that the adoption of ExP grew significantly. Though technical challenges were many, it was the cultural barrier – “getting groups to see experimentation as part of the development lifecycle” – that was most nettlesome. And perhaps the single biggest early impediment? A history of deferring to the Highest-Paid-Person’s Opinion – HiPPO – rather than resolving debates with data. How relevant the quote from Upton Sinclair: “It is difficult getting a man to understand something when his salary depends upon his not understanding it.”

Though change has been difficult, the platform’s nonetheless been very rewarding to Microsoft. Experimentation is now pervasive with Microsoft’s web properties with many demonstrated successes. Testing on the look and feel of a new widget for the MSN Real Estate site led to a lift in referral revenues of almost 10% from an increase in clickthroughs (CTR).  An experiment on the Microsoft support page contrasting a section that provides answers to the most common technical questions to one that personalizes to the user’s browser and operating system demonstrated a 50% higher CTR with the latter. And an MSN Homepage Header Experiment that compared a magnifying glass with actionable keywords such as Search, Go, and Explore, revealed a statistically-significant 1.23% increases in usage with Search. As one pleased early user enthused “The results of the experiment were in some respects counter intuitive. They completely changed our feature prioritization. It dispelled long held assumptions about video advertising.”

Several important takeaways from the Microsoft experience? 1) First and foremost, “Avoid the temptation to try and build optimal features through extensive planning without early testing of ideas.” Test early and often. 2) Those introducing experimental or “science of business” approaches should be cautioned there’s much more immediate failure than success. Indeed, failures consistently outnumber successes two to one. “A failure of an experiment is not a mistake: learn from it.” 3) Take fliers and test radical and controversial ideas. Amazon’s personalized recommendation based on items in a shopping cart was a wildly successful idea at first vehemently opposed by a senior executive.