I recently received an email from a colleague following the post of my article on business experiments. Tongue firmly in cheek, he exasperatingly asked how long I can continue to milk the use of randomized business experiments in my blog. Enough already, he implored, write on something else! A bit more seriously, he then asked if it were the experimental way or the highway for me.

My first reaction was to jump to defense of randomization and experiments as the proper foundation of the scientific method for business, blah, blah, blah. But as I started to draft my response, I realized there was merit to his ribbing. I've been writing as if randomized experiments were themselves a desired end state when in fact they're just methodological tools to help scientists get to their goal of reproducible discovery.

Researchers often talk of the internal and external validity of their inquiries. Shadish, Cook and Campbell define internal validity as “inferences about whether an observed covariation between A and B reflects a causal relationship from A to B … the researcher must show that A preceded B in time, that A covaries with B and that no other explanations for the relationship are plausible.”

External validity has to do with generalizing from the sample to the population – from the experimental setting to the real world. “External validity concerns inferences about the extent to which a causal relationship holds over variations in persons, settings, treatments, and outcomes.” In other words, external validity addresses the extent to which a sample adequately represents the population.

Random sampling and experimentation are perhaps the sharpest instruments in the research toolchest for minimizing these threats to validity. A properly designed/executed experiment using random assignment to the various conditions can, constrained by the mathematics of probability, assure that treatment/control conditions are allocated to subjects in such a way that differences in outcome measures are not due to extraneous, confounding factors. In other words, if there is a difference in measurement between treatment and control, it's likely due to the treatment itself and not to outside variables. Randomization essentially “equalizes” these potentially contaminating factors between treatment and control.

Similarly, a random sample from a population can assure, limited by the laws of probability, that the selected “represent” the population reasonably – that there's no systematic difference or bias between the sample and the population on unmeasured dimensions. So random sampling and randomized experiments are designs that, used judiciously, can go a long way to assure both the internal and external validity of our intelligence inquiries.

But they aren't the only ways to demonstrate that “other things are equal” for dispelling alternative explanations to BI findings. One method in particular that seems to be gaining in popularity is the use of “matching” to equate treatment and control of non-experimental investigations on important factors outside the intervention.

A pure matching scenario starts with a non-randomized treatment/control “study.” It also identifies important variables beyond the intervention that might be expected to influence the outcome. For each member of the treatment group, an attempt is made to locate a control that has similar scores on the matching variables. The “matched” treatment and controls help to assure that “other things are equal” with the potentially confounding extraneous variables. The same logic holds for selecting a sample from the population. If the important matching variables of the population are known, choose a sample to assure it mimics the population on these attributes.

What happens though, when there are more outside variables of concern then can be accommodated by individually matching treatment to control? One answer is the use of “propensity scores” that summarize a number of potentially confounding variables to a single measure that can be contrasted between treatment and control, sample and population.

A propensity score is typically defined as the probability of receiving a treatment versus a control assignment, given a set of observed baseline characteristics. It can also represent the probability of being included in a sample of the population, given the measured covariates. Once the propensities are calculated, the scores are compared, either treatment to control or sample to population. Between group similarity in the distributions of propensity scores inspires confidence that “other things are equal.” Matching is then done using proximate propensity scores.

There's quite a bit of academic research on the relative effectiveness of matching and propensity techniques as proxies for randomized experiments where the latter aren't feasible, but where attention to design remains critical. The results are mixed but with enough positives to fuel continued investigation.

Expect the use of matching and propensity score “quasi-experimental” techniques to mushroom in business analytics where pure experiments and randomization are not practical. For those interested in learning more about non-experimental designs for BI, I'd recommend the book cited above by Shadish, Cook and Campbell, “Experimental and Quasi-Experimental Designs for Generalized Causal Inference.” For those who'd like to learn about matching and propensity score statistical adjustments in observational studies, I'd suggest the approachable papers of Johns Hopkins statistician Elizabeth Stuart. And for those ready to take the propensity plunge, there are a host of freely-available packages available in R, including matchit and twang.

## Comments