Not everyone shares that enthusiasm, though, least of all threatened public school teacher unions. And debates rage in the educational world on the relative “success” of charter schools in enhancing student performance in comparison to the publics.
Consider the case of a set of students who are admitted to study at a charter school. To assess the effectiveness of that school, the administration would like to compare the performance of these students (the factual), with the performance of the same group of students in a traditional public school environment (the counterfactual). The difference between the first and second measures would be the effect of charter school education for that group of students.
Alas, the counterfactual is unobservable: students cannot simultaneously be in both charter and traditional public schools. “So a central task for all cause-probing research is to create reasonable approximations to this physically impossible counterfactual.” This is where methods and designs derived from the scientific method come in.
The platinum design for educational skeptics is the experiment, in which prospective students are assigned randomly to treatment/control – charter or traditional public schools. And it turns out it’s often natural to allocate prospects in such a way. Since there’s more demand for charter slots than supply, school districts can resort to lotteries to assign students. The lotteries behave much like random assignment, creating groups which, aside from chance, are equal on factors other than the method of schooling. In theory then, differences between charter and traditional public school performance would be “pure,” or “unbiased.” Skeptics would be pleased.
Even without a randomized lottery, though, evaluators can learn from their data. Much insight, for example, can be gleaned from comparing the performance of neighboring charter/traditional schools within homogeneous socioeconomic urban communities over time. Such a pre-post, natural control group, “quasi-experimental” design can help mitigate bias, even if it’s not as powerful as its randomized cousin. And if multiple pretest and posttest scores are available for each school, so much the better for quieting the skeptics.
Skepticism, then, is an obsession with the scientific method and tools to formulate, test and defend hypotheses. It’s not enough to observe a strong relationship between factors of interest; potential alternative or “confounding” explanations for the association should also be dismissed. Scientists must apply rigorous designs such as randomized experiments to rule out potential confounding biases. And when randomization’s impractical, clever practitioners must be inventive with quasi-experimental alternatives.
Data science faces challenges similar to scientists and charter school evaluators in the conduct of its work. Indeed, as I noted in a blog earlier this year: “data scientists generally work with messy observational data from which it can be difficult to prove that factor A caused outcome B. Does a high correlation between A and B indicate that A caused B? Or maybe that both A and B are caused by a confounding factor C? Or perhaps that A and B are spuriously related? In the absence of random sampling or random assignment to experimental groups, these questions can be nearly impossible to answer with certainty – hence the skepticism of good data scientists.”
The methodologies/designs driven by skepticism to eliminate bias are also important for business and data science. It’s well documented how Internet businesses like Amazon, Yahoo, Google and eBay use experiments to learn from interaction with their customers. Randomized marketing campaigns from companies such as Capital One and Harrah’s to optimally allocate offers are pervasive. And customer loyalty programs are generally built on value cutoffs that can be assessed by regression discontinuity analysis.
To assure I consider all design possibilities when conducting DS investigations, I consult my trusty skeptics’ guide: Experimental and Quasi-Experimental Designs for Generalized Causal Inference.
Quasi-Experimental designs expounded in the book I consider when randomized experiments aren’t feasible include variants of:
- Untreated, Non-Randomized Control Group with Pretest and Posttest
- Cohort Control Matching/ Propensity Score Matching
- Interrupted Time Series with Non-Random Control Groups
- Regression Discontinuity
I’d urge data scientists (and BI analysts) to become familiar with these design techniques, especially 3), when randomization’s impractical but skepticism’s pervasive. It’s often quite easy, for example, to turn a simple pretest/posttest on a single treatment group into an interrupted time series with multiple pretest/posttest observations on both treated and non-randomized controls. The latter design is quite a bit more skeptically informative than the former.