Here are the basics. QDA is the process of coding segments of free-form text with predefined categories. The segments can be single words, phrases, sentences or entire paragraphs. Coded segments can overlap as well. For example, consider a customer comment on a digital camera such as, "To avoid red eye, remove the flash - but this may cause your subject's eyes to dilate." The phrase "to avoid red eye, remove the flash" in this sentence may be coded as "flash problem," while the entire sentence may be coded as "red eye."
Once segments are coded, they can be analyzed in a variety of ways, using clustering, dendrograms, thematic maps and proximity plots. Analysis can include meta data and related structured attributes as well as coded segments. Clustering is an activity that groups similar documents and segments. Dendrograms are tree-like visual displays that depict the relative order in which segments are grouped. Thematic maps are 2-D and 3-D representations of co-occurrence of terms. (Caveat: unstructured texts are represented in multiple dimensions. Projecting that representation into 2-D and 3-D spaces can introduce distortions, just as mapping a three-dimensional globe onto a two-dimensional map distorts some surface features. Due to those distortions, some terms may appear closer to or farther away from others than they actually are.)
Why is this useful in the world of data warehouses and online analytical processing (OLAP) cubes? It is useful for the same reason QDA is useful in social sciences: numbers do not tell the whole story. Customer comments recorded by call center attendants, claims adjuster notes and search logs from customer self-service sites are primarily textual. Analysis of structured attributes can indicate the length of time attendants spend on customer support calls or identify the top ten claim codes. Analysis of structured data, however, cannot describe characteristics of problems that do not have associated codes and application variables to track them. Business intelligence is about analyzing data to answer questions relevant to organizational operations; it's not just about answering questions readily framed around a fixed set of easily encoded attributes.
QDA is one way to approach text mining. One of the earliest steps in text mining is reviewing sample texts to gain some insight into the breadth of topics, variations in terms, levels of grammatical structures, use of domain-specific terminology and abbreviations, and amount of noise, or dirty data, in the text. QDA can make this somewhat ad hoc process more systematic. Text miners can tag content and then methodically review clusters, correlation of terms and variations in spelling, etc.
There are a number of QDA tools available on the market, but their suitability to text mining depends on several factors. First, look for a tool that integrates clustering algorithms and visualization tools, such as thematic maps and dendrograms. QDA tools should also export tagged content to XML so that it can be analyzed by other tools. Because text mining is the ultimate objective here, we want tools that include capabilities for developing and maintaining dictionaries and synonym sets. Clustering tools should work both with tagged content and meta data attributes assigned to text or structured attributes associated with text, such as coded fields in database records.
Some readily available QDA tools are: QDA Miner by Provalis Research, NUD*IST/N6 and NVivo from QSR International, and ATLAS.ti by Thomas Muhr. Provalis Research integrates QDA Miner with its text mining tool, Wordstat, and its statistical analysis package, Simstat, making it especially useful for text mining projects. These are desktop applications, not enterprise scale systems; therefore, costs are reasonable. The U.S. Center for Disease Control offers a free version of AnSWR, another free-form text QDA tool.
Qualitative data analysis is a broad discipline, and we will probably continue to draw on it for practical techniques supporting text mining. For more information on QDA, see William Evans' Content Analysis Resources site (www.car.ua.edu/) and Eugene Horber's Qualitative Data Analysis site (www.unige.ch/ses/sococ/qual/qual.html).
Dan Sullivan is president of the Ballston Group and author of Proven Portals: Best Practices in Enterprise Portals (Addison Wesley, 2003). Sullivan may be reached at firstname.lastname@example.org.