There's a great "Cathy" cartoon in which Cathy's boyfriend Irving examines a list of Web sites she's recently visited. "The next time you log on," he remarks, "you should see an ad for singles weight-loss spas in Italy that allow dogs." Cathy run off in distress as Irving reflects, "Everybody wants to be understood. No one wants to be known."

This cartoon captures a common worry: What will my data and transactions reveal about me to strangers? People are concerned about the privacy of their medical, financial, personal and professional information. They're uneasy about others knowing what books they read and what movies they see, what clubs and political parties they belong to, when they are traveling and where. They fear that dissemination of private data could lead to identity theft, increased telemarketing calls and spam, larger debt (from responding to those personally targeted marketing pitches) and unwanted attention from the government.

We generate an enormous amount of data as a by-product of our everyday transactions (purchasing goods, enrolling for courses, etc.), visits to Web sites and interactions with government (taxes, census, car registration, voter registration, etc.). Not only is the number of records we generate increasing, but the amount of data gathered for each type of record is increasing. Latanya Sweeney, assistant professor of computer science and public policy and director of the laboratory for international data privacy at Carnegie Mellon University, has developed a rough measure of the growth in personal data, which she calls the disk storage per person (DSP). The DSP is simply the amount of hard disk storage sold each year divided by the world population. This number has grown from 20KB of data in 1983 to 28MB in 1996 and then to 472MB in 2000.

As data miners, our tasks are colliding with these concerns. In analytic customer relationship management (CRM), we often analyze customer data with the specific intent of understanding individual behavior and instituting sales campaigns based on this understanding. Researchers in economics, demographics, medicine and social sciences are trying to understand the relationships between behaviors and outcomes.

How can we reconcile the legitimate needs of business and research with the equally legitimate desire of people to maintain their privacy? A total prohibition on collecting or retaining data is not really in anyone's interest.

We could solicit people's cooperation. Every organization gathering data can ask people to sign a form granting permission to use the data (known as opt-in) or acquire their permission implicitly when they do not revoke it (opt-out).

We could also respond with regulations about what data may be collected and how it can be used. In some countries, there are already strict laws that prohibit the use of personal data without the individual's explicit opt-in.

In the U.S., health-related companies and researchers are constrained by a complex 1996 law called HIPAA (Health Insurance Portability and Accountability Act), which provides a national standard for the protection of information relating to an individual's health. HIPAA provides for some limited use of the data collected for marketing purposes. For many purposes, however, the data must be stripped of all fields that would enable an individual to be identified, such as name, address, date of birth and Social Security number.

However, the growth and networking of computerized databases has made it possible to identify the "de-identified" people with surprising accuracy. Thus, your anonymity isn't guaranteed even if a database doesn't contain information that easily identifies you. Sweeney conducted an experiment in which, merely by knowing an individual's postal code and birth date, she could identify an individual's personal information in a supposedly anonymous public database with 69-percent accuracy. Knowing gender raised the accuracy to 87 percent!

To make such re-identification more difficult, field values can be aggregated in a way that preserves many of their analytic properties. For example, a field can be replaced with the average value for five records. Alternatively, a certain amount of random noise can be added to values. While these changes can be applied to numeric data, they don't work with nominal data. An approach for nominal data is to swap certain values among database records in an intelligent fashion. The challenge in all of these methods is to maintain privacy while still allowing for analysis of subgroups.

As data miners, we must be sensitive to these worries when collecting or using data, or else we risk burdensome and counter-productive regulation. Some organizations have created the role of chief privacy officer to oversee the protection and use of data. However, this charter should be extended to explain how the data will be used to the ultimate advantage of the people whose personal information is captured in the database. The solution, I think, is to be more open about why we need to know so much.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access