The progress of a business enterprise is often nonlinear. You strategize, plan, execute, evaluate and adapt - sometimes frustrated by the absence of consistent movement. Then, when you're least expecting, an opportunity appears from seemingly nowhere, almost by happenstance. OpenBI had such good fortune recently. It wasn't a new customer, nor a vendor partner or new employee. In early August, I serendipitously met a gentleman from the academic world who has helped broaden my base of business intelligence, promoting business intelligence (BI) in a wider context of social science and quantitative/statistical analysis - historically critical incubators of BI methods. Indeed, it is possible to envision a bit of BI's future by looking at what is prominent in quantitative social science today, with an eye toward new analytic tools for the evolving BI portfolio.
Gary King is David Florence Professor of Government and Director of the Institute for Quantitative Social Science (IQSS) at Harvard University. In the course of introductory discussions relating his work to ours, Professor King generously agreed to be interviewed for OpenBI Forum. My October and November columns will be Q&As with him surrounding quantitative social science and BI. In a later column, I'll focus on exciting technology developments for a world-wide launch at IQSS of what will be called the Dataverse Network. I trust you'll find Professor King's insights and the IQSSs analytical leadership as informative and BI-thought-provoking as I do. I hope also that you'll look to start adapting these methods to solve real business problems as opportunities present.
SM: You're a political scientist by training. How did you end up as the director of the Institute for Quantitative Social Science at Harvard?
GK: You may remember "polisci" from qualitative courses you took in college. But a growing fraction of research in political science is heavily quantitative and statistical. Political science is also one of the most interdisciplinary fields in academia, and indeed the training of at least some of my colleagues in the Government department at Harvard span political science, business, sociology, economics, statistics, law, anthropology and other areas. My focus within political science is on empirical methods, a subfield known as "political methodology," and because of the interdisciplinary nature of political science we have the most diverse types of data and methodological problems of any of the methodological fields within the traditional academic disciplines, such as econometrics within economics, psychometrics within psychology, biostatistics and epidemiology within medicine, etc. In any event, I think you can begin to get a sense why a political scientist might lead a interdisciplinary institute devoted to quantitative analysis of social science problems.
SM: Could you tell us a bit about the Institute, its charter and some early accomplishments?
GK: The scientific mission of IQSS is to create, and make widely accessible, statistical and analytical tools for the social sciences and related areas and to use these tools for understanding and solving major problems that affect society and the well-being of human populations. We foster interdisciplinary, larger-scale and highly collaborative projects that cannot readily be accomplished within the traditional setting of individual departments. We are also building a scientific culture where faculty, students and staff work side by side, not only to solve their own disciplinary problems, but also to seek out problems in unrelated or applied areas amenable to the same approach.
The tools that scholars at IQSS have created have been used in many fields of academia and beyond. For example, most states, legislatures, courts and partisans use the methods we developed to evaluate the fairness of legislative redistricting plans and the existence of partisan and racial gerrymandering. The U.S. Supreme Court discussed our proposals favorably in three of its opinions in the recent Texas redistricting case.
We've created methods of survey research, now implemented in surveys in over 80 countries and many fields, that avoids the problem of survey respondents thinking you mean something different than they do, or different respondents interpreting the same question in different ways.
We've developed statistical and other methods for a wide range of other problems too. Many of these are implemented in software we make available.
SM: Are HBS faculty affiliated with the Institute? How much, if any, of a business focus does IQSS have?
GK: HBS's strength historically has been in qualitative research and is only just beginning to develop strengths in quantitative research. Nevertheless, we do have some of their faculty regularly attending our seminars and workshops and hope to be able to help them gear up when interested.
At Harvard, most centers and institutes are located within one of our 12 schools. IQSS has been located within the Faculty of Arts and Sciences which is where the undergraduate college and graduate school reside, but the University recently decided - and will announce this academic year - that IQSS will be transformed into a university-wide Institute, spanning all of Harvard's schools. Of course, this also includes HBS.
SM: The publishing of Freakonomics, with its emphasis on using statistical techniques to address rather mundane economic problems, has started to educate the public on the evolving role of data and quantitative social science and has served to bridge the gap between the approaches of statistics and the softer business intelligence. Could you tell us how the quantitative social science discipline has changed in the last 20 years? Has the domain of social statistics expanded? Has it become more applied - more focused on solving real world problems?
GK: Steve Levitt, author of Freakonomics, was a graduate of a program we ran that eventually became IQSS, so along with many others we're happy to claim him as our product!
Here's one way to think about this. All research and almost all of life includes qualitative features, intuitions and knowledge. When you walk into a room and meet someone for the first time, you instantly decide that the person is not going to kill you and you act appropriately. No one would suggest quantifying that decision, entering the data in a computer, running the best statistical methods on it, interpreting the results and then making a decision.
Some fraction of research is also quantitative, where judgments, decisions, actions and behaviors are quantified, measured and analyzed more or less formally. Over the last 20 years, the fraction of research that is quantitative has grown fast. Research will always include qualitative information, but the best decisions are now being informed more and more by systematic statistical analyses.
SM: Why the changes over time?
GK: The central conclusion of research in hundreds of fields and most of the hundreds of thousands of applications is the same: Whenever a sufficiently important fraction of information can be quantified, statistical analysis beats qualitative human judgment. There is just no contest.
SM: Can you give some examples?
GK: 1. An election is coming up - an election is always coming up! - and you will hear dozens of Sunday morning pundits giving their opinions about which candidate is likely to win and by how much. We know from systematic study that these prognostications are little better than random chance. They're still loads of fun to watch, but their specific numerical predictions about who will win which elections are largely useless. At the same time, using modern statistical methods some a number of us have developed, it is relatively easy to forecast election outcomes within known margins of error - to know who will win, what their vote percentages will be and even what the effects of legislative redistricting plans are likely to be. These statistical forecasts are not perfect either, but they are far better than chance, considerably better than the pundits, and the degree to which they will be wrong is also estimable.
2. Kevin Quinn, a faculty member at IQSS, and some coauthors came up with a statistical method of forecasting Supreme Court decisions. As you might imagine, law professors - who know a great deal more about jurisprudence, judicial precedence, and the personalities and preferences of the justices than any statistical algorithm - predict the outcome of Supreme Court cases all the time. So Quinn set up a contest: his automated prediction method based on a small amount of quantitative information versus a large number of law professors. They all made predictions for a year of Court decisions. The outcome? The statistical algorithm beat the qualitative experts. If better quantitative information were collected, they would presumably do even better.
3. In countries without autopsies and death certificates, researchers and governments still need to estimate how many people die of different causes so they can direct public health dollars and so pharmaceutical companies and others can develop appropriate treatments. The way this information was estimated was by finding a sample of deaths and asking bereaved relatives about the symptoms they remember the deceased had before their demise. A colleague and I developed a statistical method recently that takes these data on symptom and classifies the cause of death. Our statistical approach did considerably better than three physicians operating qualitatively by looking at the list of symptoms.
There are lots of other examples, and there will likely be lots more.
SM: How would you characterize this evolution?
GK: The quantitative social sciences are in the midst of a revolution in understanding the world and solving real problems. A dramatic increase in progress is now achievable because: 1) changes in technology enable us to collect and store unprecedented amounts of far more informative data about human populations and institutions; 2) new policies encourage the collection of data and its provision to researchers, including the computerization and automation of many government services, new data collection requirements, e.g., the No Child Left Behind Act, and the growing movement in science to make data publicly available; and 3) the development of novel methods of data collection and analysis that make it possible for scholars to extract information from new data, such as from the rise of social experiments that enable reliable causal inferences in major issues of public policy, natural language processing that enables scientists to extract information from millions of Web sites, newspapers, emails or other textual sources, and new informatics techniques that provide instant and reliably persistent, access to the world's data.
In the previous half-century, social scientists have learned about human populations primarily from cross sections of individuals such as sample surveys every few years, end-of-period government statistics; and in-depth studies of particular places or people. These data collection mechanisms will obviously continue to be used, but they now require substantial methodological innovation as the scientific basis of sampling is becoming increasingly undermined by massive increases in survey non-response and the rise in cell phone usage. In addition, the proliferation of digital information arising from the pervasive spread of computers will likely make the future information base quite different. If we can tackle the substantial privacy issues, the increasing storage and computing needs, and the markedly new statistical methods required, we have the opportunity to move from occasional cross-sectional studies of a small number of randomly selected individuals to continuous-time information from much larger numbers. Even today, people are tracked in continuous time for non-research purposes to learn about our commercial activity through every credit card transaction, geographic location every time we pass through a toll booth with a Fastlane transponder and every moment we carry a cell phone, health through digital medical records and hospital admittances, and other areas. The challenge before us is to lead the collection and analysis of these new data, and to unlock the secrets they hold.
SM: Can we anticipate increasing relevance of developments in social statistics to business intelligence?
GK: "Social science" is very broad term applying generally to any study with individual people or groups of people as the unit of analysis. Business intelligence seems to fall squarely within this framework. Although all the developments occurring in quantitative social science will not be immediately relevant to BI, there ought to be a lot that can be harvested from quantitative social science methods and techniques, if not applications.
SM: A recent Forbes article detailed a new breed of intervention economists and the work they do determining the efficacy of social programs. Their use of randomized field experiments to evaluate programs is recognized as an optimal learning method for both the social sciences and business. Could you comment? Should businesses use similar approaches, such as randomized marketing campaigns, to evaluate their initiatives and strategies? Could you site other pertinent experimental examples from your experience in the social science and business world? Is social/business experimentation on the rise?
GK: Randomized experiments would certainly make the short list of the most important and powerful methodological ideas in science in the last century. The development you are referring to involves the application of this idea to study human behavior, the success of public policies and business efforts. The idea of randomness is counterintuitive since it discards information but uniquely useful in helping to produce valid causal inferences.
As one example, Roland Fryer, a faculty member at IQSS, has developed a series of experiments he is now running in 80 schools, in poor, inner-city areas. His theory is that if you give children in these neighborhoods a financial incentive to read books and study that they will respond. The payments, on the order of $5 per book, are trivially small compared to most budget lines in school budgets, but large for individual elementary school children in these areas with no other legal way to make money. If the outcome of the experiments work as his pilot projects suggest they will, test scores will increase by a substantial amount, and far more than if we spent the same funds in more traditional ways. This kind of work involves cleverly developed research designs, large data collection projects and sophisticated analyses. The result, however, may well be a massively cost-effective way to improve schools.
As another example, at IQSS, we are leading an evaluation of the Mexican universal health insurance program, which is intended to provide health care and financial health protection to the 50 million Mexicans without present access. This is the largest health reform of any country in the world in the last two decades. The evaluation is also large scale and constitutes one of the largest randomized policy experiments to date and what seems to be the largest randomized health policy experiment ever. We are analyzing the data from the first cohort of the experiment now.
Business and other experiments need not be anywhere near this scale of course. Any time you roll out some change in your business procedures that might affect people, you can learn a great deal more by the proper randomized design coupled with proper statistical analysis of the results.
SM: It seems now that the use of evidence-based methods is de rigueur in the business world as in health care. Because business is the consummate social science, could you articulate a few examples where business can adopt approaches you champion?
GK: This is the key of course. Any time you make a decision that affects a large number of "units," which may include people, groups, clients, sales, items, etc., and there is a relevant and measurable outcome, some statistical procedure can help you learn about how to do it better. You can get a big edge up on the competition by finding areas to quantify that haven't been quantified before, by bringing the best statistical methods to bear on the question and by designing data collection or experiments to make the data most informative.
A fundamental principle is that it is preferable to conquer problems by better data collection than by better statistical methods. In some situations it is possible to design data collection so that some statistical problems do not arise. It is rare, especially when studying human behavior, for all statistical problems to vanish by better experimental design - and indeed the best analyses of the best social experiments tend to require quite sophisticated statistical analyses - but the assumptions necessary to make these work are often far less onerous when the research was designed properly from the outset.
SM: Can you give some examples? Can't we avoid complicated statistics when we control data collection?
GK: Let's take the Mexico evaluation project I mentioned earlier. To simplify, we were able to randomly assign to different areas access to a major health insurance program, as well as money for drugs, health clinics and medical staff. We then measured various outcome variables, such as health status of individuals and whether the money allocated by the government actually made it to the people. So we have treatment group who got access to the program and a control group which didn't. And thus, you might ask, why can't we estimate the effect of the program by merely taking the average health of those in the treated group and subtracting the average health of those in the control group?
The problem is the real world. So, for example, some people we randomly selected decided not to participate. If these people differed from those who participated, and we didn't adjust for the problem, we could serious bias our answers. This is the problem of missing data, and sophisticated techniques developed at IQSS and elsewhere exist for correcting problems such missingness can cause if you merely discard those who don't participate.
And what about those who participated in the baseline survey but did not participate in the follow-up? If they didn't participate because they died or were too ill, and more participated in the treatment group because the better stocked health clinics had longer hours and more staff, the bias could be massive. So we corrected for these problems too, each of which involved sophisticated calculations.
Okay, so suppose we fix those problems. Is that all? Well, what about the rich people in the treated areas who had their own health insurance and weren't interested in a public program, or those who merely chose not to comply with the experimental incentives and so did not affiliate for some other reason? If we ignore this problem, then we could seriously underestimate the effect of the program, since mixed in with those who take our advice and sign up for the insurance program are those who cannot be affected by it because they do not comply with our experiment. So we then need to estimate who are experimental "compliers," and cull out the others, which involves another set of sophisticated procedures.
Okay, so suppose we fix those problems. Is that all? No. One set of outcomes the Mexican government is interested in is what patients in hospitals think about how they were treated. So, you might ask, why can't we merely compare the views of those in the hospital in the treated and control groups? The problem is that the point of the program is to provide access to better health care facilities, and so the types of people who choose to become patients when they are sick differ in the treated and control groups, and we need to control for that. In addition, the only people for whom we can legitimately even ask about the effect of the program are those who would go to the hospital whether they were in our treated or control group. So in this case, we need to estimate the effect only for those who do not change their hospital-going behavior based on the experimental treatment, but how do we know who those people are? The only valid way is proper statistical estimation.
And these are only a few of the many statistical issues. It is true that avoiding statistical problems via better data collection is always better than after-the-fact statistical corrections, but even the best data collection efforts regularly generate problems because of the predictable uncertainties of dealing with human beings.
SM: It sounds hard!
GK: Well, its not rocket science, but then again even rocket science is not rocket science! What is important to remember is that the benefits of getting the analysis right can often make a huge difference to what you learn and, ultimately, to the academic, public policy or commercial bottom line.