OpenBI Forum Goes to Harvard, Part 2
OpenBI Forum
Information Management Online, November 23, 2006
In the October OpenBI Forum, we had the good fortune of introducing our DM Review audience to Gary King, David Florence Professor of Government and Director of the Institute for Quantitative Social Science (IQSS) at Harvard University, for the first in a series of three interview columns. Gary's charter was to begin educating the readership in the area of quantitative social science (QSS) methods and how they relate to BI.
This column continues to highlight methods developed in the social sciences which, I believe, are very pertinent to day-to-day BI concerns, with an emphasis on Gary's own research. The final column in this series, to come in a winter month, will focus on exciting technology developments for a worldwide launch at IQSS of what will be called the Dataverse Network.
Advertisement
Enjoy and Happy Thanksgiving!
Steve Miller: Missing data is often a thorny problem in BI. Many times, the "sample size" of business data is quite large, but the amount of missing data for pertinent factors is significant as well. You have developed methods for systematically handling missing data. Was there an "aha" moment that caused you to start obsessing with this problem?
Gary King: I remember vividly the first time I stumbled into the problem of missing data; perhaps some of your readers have had a similar experience. I was taking my first statistics course in graduate school. I had learned some techniques for analyzing data from the textbooks, and the contrived data sets that came with them - for which the only problems you ever saw were the ones the text said would be there. Of course, the real world is not especially textbook-compliant.
I had set out to understand voter decisions in the 1980 presidential election from a public opinion survey. I got out my trusty copy of SPSS (created by political scientists by the way), ran some analyses and the results made no sense. No wonder. When I looked closer, I discovered that the "income" of people who did not wish to reveal their income was recorded as -99999. So I went back to the text and looked for "missing data" and related words in the index. The only thing missing from the book was a discussion of missing data. So I looked at the SPSS manual and it had a procedure called "listwise deletion," which deleted any person's information that had any missing data. But when I tried this, I had no data left at all! So what's an aspiring statistical analyst to do? I changed "no answer" on opinion questions to "neutral," missing data on income to the median value, etc. But what exactly is the justification for a data analyst claiming he knows the answers of the survey respondents better than they did? Later on I and others developed more systematic methods for dealing with missing data.
SM: Why is missing data so important? Is the juice of obsessing on missing data worth the squeeze for business?
GK: Missing data affects almost every real data analysis project, and all the simple fixes you might think to deal with it, many of which I tried originally, can horribly bias your results. Suppose you want to know the average income of your clients and upper income people are less likely to tell you their income. If you don't know this, and don't deal with it properly, you can massively underestimate the distribution of income.
In fact, even if missing values are sprinkled in an unbiased way throughout your data, just deleting those observations is enormously wasteful of expensive information. We calculated that, for scholarly research articles, the common practice of "listwise deletion" is equivalent to discarding about half the available information. In fact, we found that the best analysts, who were most worried about other problems in data analysis, such as the bias due to omitted variables, controlled for more variables and wound out with even less information. The fact is that researchers just did not have proper methods of dealing with missing data.
If this is the same in your business - and the situation is similar in most fields - using better methods will produce as much new information as would doubling your entire data collection budget!
SM: Could you summarize some of those methods for us? Do you feel your findings are appropriate for BI [business intelligence]?
GK: The problem with missing data is that dealing with it properly has in the past involved highly specialized and technical methods that are theoretically appropriate but few could use them and fewer did. So my students and I found a sophisticated method that is easy to use in theory but difficult to use in practice. We then derived a new algorithm that made it easy and fast to use in practice. The core of the idea is to preprocess your data in just the right way, so that after preprocessing you can use whatever statistical method you would have used if you had no missing data. No information is lost, no bias is introduced, and no data are made up. And, with our algorithm, it is easy to do this preprocessing, which meant that many more people could avoid the biases of missing data and can harvest a considerable quantity of information they were previously discarding.
In fact, there's a free, open source software package we developed to implement this preprocessing for missing data. It's called Amelia (named after that famous missing person) and is available at my Web site. If you're interested in this topic, I did an interview about the first article we wrote on the subject that's still available (see Emerging Research Fronts). After the first article, we have worked to develop faster, easier, and more powerful approaches, which now cover a wider range of data types and sizes. We've updated the same software to implement all these new methods too.
Page 1 of 5.






