In the October OpenBI Forum, we had the good fortune of introducing our DM Review audience to Gary King, David Florence Professor of Government and Director of the Institute for Quantitative Social Science (IQSS) at Harvard University, for the first in a series of three interview columns. Gary's charter was to begin educating the readership in the area of quantitative social science (QSS) methods and how they relate to BI.

This column continues to highlight methods developed in the social sciences which, I believe, are very pertinent to day-to-day BI concerns, with an emphasis on Gary's own research. The final column in this series, to come in a winter month, will focus on exciting technology developments for a worldwide launch at IQSS of what will be called the Dataverse Network.

Enjoy and Happy Thanksgiving!

Steve Miller: Missing data is often a thorny problem in BI. Many times, the "sample size" of business data is quite large, but the amount of missing data for pertinent factors is significant as well. You have developed methods for systematically handling missing data. Was there an "aha" moment that caused you to start obsessing with this problem?

Gary King: I remember vividly the first time I stumbled into the problem of missing data; perhaps some of your readers have had a similar experience. I was taking my first statistics course in graduate school. I had learned some techniques for analyzing data from the textbooks, and the contrived data sets that came with them - for which the only problems you ever saw were the ones the text said would be there. Of course, the real world is not especially textbook-compliant.

I had set out to understand voter decisions in the 1980 presidential election from a public opinion survey. I got out my trusty copy of SPSS (created by political scientists by the way), ran some analyses and the results made no sense. No wonder. When I looked closer, I discovered that the "income" of people who did not wish to reveal their income was recorded as -99999. So I went back to the text and looked for "missing data" and related words in the index. The only thing missing from the book was a discussion of missing data. So I looked at the SPSS manual and it had a procedure called "listwise deletion," which deleted any person's information that had any missing data. But when I tried this, I had no data left at all! So what's an aspiring statistical analyst to do? I changed "no answer" on opinion questions to "neutral," missing data on income to the median value, etc. But what exactly is the justification for a data analyst claiming he knows the answers of the survey respondents better than they did? Later on I and others developed more systematic methods for dealing with missing data.

SM: Why is missing data so important? Is the juice of obsessing on missing data worth the squeeze for business?

GK: Missing data affects almost every real data analysis project, and all the simple fixes you might think to deal with it, many of which I tried originally, can horribly bias your results. Suppose you want to know the average income of your clients and upper income people are less likely to tell you their income. If you don't know this, and don't deal with it properly, you can massively underestimate the distribution of income.

In fact, even if missing values are sprinkled in an unbiased way throughout your data, just deleting those observations is enormously wasteful of expensive information. We calculated that, for scholarly research articles, the common practice of "listwise deletion" is equivalent to discarding about half the available information. In fact, we found that the best analysts, who were most worried about other problems in data analysis, such as the bias due to omitted variables, controlled for more variables and wound out with even less information. The fact is that researchers just did not have proper methods of dealing with missing data.

If this is the same in your business - and the situation is similar in most fields - using better methods will produce as much new information as would doubling your entire data collection budget!

SM: Could you summarize some of those methods for us? Do you feel your findings are appropriate for BI [business intelligence]?

GK: The problem with missing data is that dealing with it properly has in the past involved highly specialized and technical methods that are theoretically appropriate but few could use them and fewer did. So my students and I found a sophisticated method that is easy to use in theory but difficult to use in practice. We then derived a new algorithm that made it easy and fast to use in practice. The core of the idea is to preprocess your data in just the right way, so that after preprocessing you can use whatever statistical method you would have used if you had no missing data. No information is lost, no bias is introduced, and no data are made up. And, with our algorithm, it is easy to do this preprocessing, which meant that many more people could avoid the biases of missing data and can harvest a considerable quantity of information they were previously discarding.

In fact, there's a free, open source software package we developed to implement this preprocessing for missing data. It's called Amelia (named after that famous missing person) and is available at my Web site. If you're interested in this topic, I did an interview about the first article we wrote on the subject that's still available (see Emerging Research Fronts). After the first article, we have worked to develop faster, easier, and more powerful approaches, which now cover a wider range of data types and sizes. We've updated the same software to implement all these new methods too.

SM: Ecological inference has a storied history in the social science. Could you briefly describe ecological inference/correlation and the problems it causes for behavioral research? You won a prestigious award for methods you developed to untangle ecological inference. Could you give us a layman's description of your approach and how it might be used in marketing or other business areas?

GK: Ecological inference is the problem of learning about individual behavior from information about groups. As it turns out, this problem affects research in numerous fields of inquiry and many areas outside of academia. For example, suppose you want to know who bought refrigerators from your company so you can design a marketing campaign. However, the only information you have is how many of your company's refrigerators were sold in each ZIP code +4 area.

So you gather information on average income in these areas and find that in areas with more income, more people bought your refrigerators. It seems obvious then that rich people are doing the buying and you should target high-end magazines and pay for direct mail in expensive catalogs to the largest residences in each area. However, suppose in fact it's the poor people who happen to live in areas with many rich people who buy most of the refrigerators. For example, you could easily imagine that when the economy does better, wealthy people who already have refrigerators buy other things, whereas poor people show up at your store.

So a naive analysis of ZIP code level information suggests you target rich people, whereas the individuals are doing exactly the opposite. This is what is known as the "ecological fallacy," and it is what the methods I developed attempt to get around. Since information is lost in the process of aggregating individuals into groups, collecting individual-level information is still far better, but it is often the case that group-level information, when analyzed with these methods, can extract useful individual-level information.

SM: But how can you learn about individuals from groups? What information is left in the group-level data?

GK: Some of the information is indeed obliterated, but some still exists. For example, if you find a geographic area, which has no poor people, and you have data on refrigerator purchases, then you have exact information on at least this one group of wealthy people. Most areas are mixed of course, but this simple example gives you the sense that some information remains in the aggregate data. So if 90 percent of the people in the area are wealthy, you can't pin down the exact numbers of units purchased by the wealthy, but you can narrow the possibilities. With this information, combined with a more general statistical approach across the different areas, provides more informative hints about individual behavior than either approach would alone.

SM: The business world distinguishes between exploratory and confirmatory analytics. Is that an important distinction for you? Are approaches for either progressing more rapidly?

GK: They go hand and hand in BI and outside the business world. If you dropped the word "business'' from your first sentence it would also hold. We need techniques that help us ask questions and others that help us validate the answers to questions we feel like asking. These exist in numerous areas of statistical research.

SM: Can you give an example of each?

GK: I'll give you an example of each from two related but separate research projects in progress at the Institute for Quantitative Social Science. First exploratory, then confirmatory.

A faculty associate of IQSS, Kevin Quinn, and some colleagues elsewhere, developed a way of taking the texts of congressional speeches and automatically classifying them by subject area. This was exploratory because their method did not specify ahead of time what the subject categories were. Another way of putting it is that the method simultaneously classified speeches into subject matter categories and came up with the categories.

Now the tremendous advantage of this method is that you pop the text in and out comes the answer. And in this case, it was quite interesting. The categories in many cases were those which you might have chosen ahead of time: civil rights, foreign policy, immigration, social security, etc. Of course this advantage is also a disadvantage, since if you had a different categorization scheme that you want to learn about - length of speeches, eloquence, whether the speech is favorable to the president's policies, etc. - you're out of luck.

You could also use exactly the same method for an application that might be closer to home: suppose you had 50,000 comments from customers on your products, say from your Web site. Reading these would not be fun! But if you could distill them down into a reasonable set of categories and then see how the frequency of concerns in each of these categories varied over time, you might learn a lot, in considerably less time.

SM: And how about an example of confirmatory research?

GK: At the other end of the continuum is some ongoing research I'm involved in with IQSS graduate student associate Dan Hopkins, and a team of undergraduates. We are also working on automatically classifying text by computer, but in a confirmatory rather than exploratory way. Instead of congressional speeches, or email comments on your Web site, our application is classifying all English language blogs according to each blogger's opinion about a political figure, commercial product, drug or another topic we might choose. Currently, we classify the opinion expressed in a blog by a blogger into a scale from -2 (strongly negative) to +2 (strongly positive). Our current results thus far indicate that we can estimate the distribution of all blogs in these categories very accurately (to within sampling error).

The great advantage of our method is that the subject of the blog and the categorization scheme we use is entirely up to us. So we don't have to worry about an exploratory method choosing a categorization scheme we don't care about. We ask the questions and our method and software gives the answers to our questions. But this advantage also comes with its own cost: for our method to work, we need to classify some small number of blogs by hand into the given categories. These examples are then used by our statistical method to classify as many others that come along.

If we were willing to give up our choice of question we wanted to ask, the exploratory method would be valuable. And if a user of the exploratory method were willing to hand code their text documents, they could use our confirmatory method to answer their chosen questions. In practice, of course, having both methods to apply to a problem can be more valuable than either separately.

SM: A fascinating area of your research has focused on the statistical handling of rare events. Might rare event handling be pertinent to business in areas such as risk management?

GK: Most decision-makers, like most people, plan first for ordinary times, standard processes, and common patterns. And no wonder. Given how stable most patterns are that we see in the world, we can forecast many phenomena by assuming no change at all. Look out the window right now and you have a reasonably good forecast of tomorrow's weather. Beating the same "no change" forecast for stock prices is not easy either.

Extraordinary events are a different story. Their very rareness means that by definition we have much less information about them and what causes them, how to forecast them and what might affect them. This of course is a mistake, and good management practice is to have regular disaster planning, fault tolerance and fail-safe procedures. Creating these procedures is difficult, but we know that we should prepare and treat rare events differently.

Almost the same logic holds for statistical analyses: Rare events are different and so need to be analyzed differently. This is true whether finding out what causes them, predicting them, or asking what the world would be like if they occurred more or less frequently.

SM: Could you tell us a bit about your methods for analyzing rare events? How much information can they help us uncover?

GK: The benefits of using methods tuned to the problems of rare events can be extraordinary. For example, suppose only about one of every 10,000 of the engine parts you make is faulty, but the faulty one costs you a great deal. So how do we learn what factors about your manufacturing process might reduce the occurrence of these events?

One possibility is to do a classical experiment, where you assign one of two manufacturing processes to each engine part you make. Then you run the experiment for a long time and see which one produces fewer flawed parts. This is a perfectly reasonable procedure of course, but it will take a considerable amount of time (and faulty parts) until you can ascertain which procedure was better. If you make 20,000 parts, you'll likely only see two with a problem, which is far too few to ascertain which manufacturing process is better.

So, as an alternative, you could do what is known as a "case-control study," which involves collecting information on a sample of available faulty parts, and a specially matched sample of nonfaulty parts. Then, if you correct using some statistical procedures designed for this task, you can get answers as accurate, and save as much as 99 percent or more of your data collection costs, as if you had collected all the data. And of course you can also learn the answer in considerably less time and make fewer flawed parts.

And that's a rare events data collection procedure. There are also rare events data analysis procedures that work like regular data analysis procedures, only far better with these types of data. A colleague and I developed several of these procedures that have been widely used in many fields. They enable you to do the same type of analysis as you would have without rare events, only that the estimates can be considerably more accurate - less biased and equivalent to having available much more data.

We have implemented the results from our research articles in a software package available at my Web site: relogit, for rare events logit analysis. Relogit is also available in a general purpose statistics package we also distribute called Zelig.

SM: You mentioned a development in survey research in our previous interview. Businesses use surveys a great deal. Can you tell us more?

GK: Surveys seem easy to do. You write some questions, knock on doors or call people on the phone, ask the questions, write down the answers, and you're done. Right? Not if you want accurate answers. Surveys seem easy, but doing them right, in ways that do not mislead, is very, very difficult.

Think about it this way. How often have you had massive miscommunications with your spouse? Has the frequency of these miscommunications gone away after a year? After 20 years? Do they ever go away? And yet somehow we all live under the illusion that when we go to work and do surveys, we can walk into someone's house we have never met before and in 20 minutes ask a few innocent questions and expect to be understood exactly.

In the project you're referring to, my colleagues and I tackled one aspect of this problem, known as "differential item functioning" or DIF - the problem of different people understanding the same question in different ways.

Suppose I asked you "How healthy are you? Excellent, good, fair or poor." If you're a 40-year-old in excellent health, but I happened to ask you on a day you had a backache and a bad cold, you might say "fair." But if I visited a frail, 90-year-old man by his bedside and asked him the same question, he might say "excellent" (as all his friends are dead). And I didn't just make up this survey question: it is one of the primary ways public health scholars measure the health of populations around the world.

This is the problem of DIF. It's the same problem that occurs when people generally use the response categories in different ways: so when your dour friend says "fair" she may mean the same thing as when your cheery, optimistic friend says "good."

We tackled this problem by describing a hypothetical person to the survey respondents and asking what the respondent thought of them. So if we wanted to measure mobility we would describe (say) Allison: "Allison can walk up a flight of stairs, but when she gets to the top, she is breathless and has to sit down.'' Then we ask respondents "How much trouble do you have getting around your house?'' and also "How much trouble does Allison have getting around her house?'' The advantage of asking questions about hypothetical people is that the only reason for systematic variability in answers to questions about Allison is DIF, since Allison has the identical level of mobility no matter who we ask (we know, because we created Allison!).

Then, we can compare answers of different survey respondents, not to each other, but to the fixed, standard represented by Allison. If someone asks us about our fitness and you say "good" and I say "fair" but I rate myself higher than Allison and you rate yourself lower than Allison, then despite our uncorrected answers I'm probably more fit than you are. This simple idea can fix a lot of the DIF problem.

Writing these questions requires a considerable amount of expertise, but even leaving aside this issue, these "anchoring vignettes'' as they are called, add expense to surveys since, for every self-assessment, we have to add several of these questions about hypothetical individuals. So we developed a sophisticated statistical method that enabled us to make the same comparisons, and extract the same information, even if the vignettes were asked of only a small sub-sample of respondents. (See our anchoring vignettes Web site for more information.)

SM:You've mentioned several open source [OS] software projects you have. The open source movement is gaining increasing prominence in the technology world. What are your thoughts? Is this something new? Is OS important to quantitative analysis?

GK: The open source movement is modeled on how science works. It is perhaps not widely appreciated, but science is not only about being scientific. A few hundred years ago, when scholars lived in monasteries working alone, many fooled themselves for their entire lives into thinking they accomplished great things. Science requires a community of people competing and cooperating to pursue common goals. With this community checking up on each other, dramatic progress is possible. The tremendous advances across the fields of science and social science have come from intense interactions, from checking each other's work, and from individuals building on the work of others.

The open source movement similarly helps foster these communities of individuals competing and yet pursuing common goals. And when the community is sufficiently on the same page, that progress can be dramatic.

For example, statistical researchers have benefited tremendously by the developments around the open source R statistics package. It enables people in a large variety of diverse fields to communicate, compete, and cooperate in ways they were only rarely able to before. It has created a common market of statistical analysis, where communication is much easier, and getting it right happens much faster. I expect this area to experience dramatic growth as it finds its way in new areas of social science and beyond.

SM: The use of stats and analytics is pervasive in sports today. Baseball's Boston Red Sox and Oakland A's have developed reputations for "quants" leadership. Indeed the field of sabermetrics, defined loosely as the systematic analysis of baseball data, has many quantitative social scientists as active leaders. What are your thoughts on this? Since sabermetrics is used as a foundation for evidenced-based decision-making in baseball, are there lessons to be learned for other types of businesses?

GK: This is related to a conversation we had during the last interview. Any time it is possible to quantify the important features of decision problem, a good statistical analyst can typically help you make far better decisions than coaches, pundits, experts, fans or color commentators ever could. Baseball is a great example of this phenomenon because it seems so surprising and so far from statistics. After all, we all watch sporting events, we all have opinions about what goes on, and it often seems completely obvious what the right decision is. But even experts like us can be beat by good statistical algorithms.

So what I'd do if I were you is to think about what decisions you make, or your people make, on a regular basis. I'd identify which of those decision processes are based on available quantitative data or, or for which more data could be useful.

Then, ask how these decisions are being made. In the past, you might have improved the decisions by putting a better person in charge, getting a committee to agree or making the decisions yourself.

These are obviously still good practices, but we know how to do even better today. Just collect the best data, invest some time and effort into getting some good quantitative analysts, and you ought to be able to outperform even the best experts you have. If you're not sure, set up a head-to-head contest with a known outcome, such as forecasting some result, and see for yourself. If you have enough data, and you have quantified enough of the information in the problem, you are likely to do better than your experts did previously.

SM: There seems to be a flurry of hiring of high-powered quantitative social scientists by Google, Yahoo! and Microsoft to help develop new auction methods and approaches for maximizing profits as the companies compete for Web advertising dollars. Is that a good example of companies taking advantage of what IQSS has to offer?

GK: Yes, this is a well-defined example, with a profit potential that has become obvious, and with some remarkable demonstrated advances. New auction procedures developed by economists and others have produced new markets, more efficiency overall and more profits for those involved.

This is one powerful example, but its power does not only come from special features of auctions or advertising. It comes from the fact that the idea had not been implemented successfully or in some cases had not been existed before. So sure, these companies are trying to hire people to improve their auction methods even further, but I bet they are also building their general human capital in expertise in quantitative analysis, so that when the next idea comes along, perhaps entirely unrelated to auctions or advertising, they will be ready.

SM: A good BI analyst brings a variety of skills to the job, including: strong analytical capabilities; business, economic, organizational and behavioral knowledge; facility with data management, statistical analysis and visualization of information; programming, database and technical computer package usage skills; collaborative ability to bridge business and technology; and strong communication skills - verbal, written and presentation. Could you sell us on why we would want to recruit Harvard's IQSS or other quantitative social science graduates to work in business intelligence?

GK: Harvard IQSS students, both undergraduate and graduate, would appear from your criteria to be absolutely ideal candidates for BI analysts. They have strong, broad-based social science backgrounds that are pertinent to business (after all, business as you say is the ultimate social science), have outstanding communication skills, are adept with data and statistical analysis, and present strong skills in canned programs and original software programming.

The data and methodological problems that BI analysts face, so far as I understand them, are highly diverse and would benefit from methodologists who operate in many areas but specialize in particular types of data or data collection processes. For example, there are a large group of biostatisticians who do nothing but develop methods for randomized clinical trials of drugs. There are financial economists who focus entirely on stock returns (and, as they say, "spin the CRISP tapes").

However, quantitative social scientists in general, and especially those in some subfields such as political methodology, tend to have to deal with a much broader range of methodological problems and data types. This diversity in methodological approach would seem ideal for the ever-changing challenges of BI. You really want someone to be ready for the next challenge, to hunt around your business processes, find new forms of quantifiable data, and to systematize and analyze them as fast as possible.

SM: Could we just do more analysis ourselves?

GK: You could, and perhaps you should. It is a strange thing about areas like quantitative analysis that it is possible to do it a little and make some progress on your own, without much expertise. That sometimes makes nonspecialists think they have little need for specialists. In contrast, heart surgeons don't seem to find many people with chest pain unbuttoning their shirts and asking someone to pass a dinner knife. Although a good quantitative analyst ought to be able to explain what he or she is doing in intuitive terms, quantitative analysis is a highly specialized, extremely technical discipline, with many separate but interconnected collections of the best researchers, all advancing methods and producing software that affects us all.

One way to think about it is in terms of productivity. One secretary can be maybe two or three times as productive as another. A statistical programmer can easily be 50 times more productive than another: you would literally prefer to have that one first-rate statistical programmer than 50 average ones. But one highly trained statistical methodologist could plausibly be more productive than an infinite number of nonspecialists, since that person can do things that nonspecialists could never accomplish. Moreover, as well as you may pay new analysts, the cost compared to new data systems would be trivial.