My last column addressed some of the fallacies about using data mining to find terrorists. This column will look further at certain misconceptions about data analysis and data mining, and how those technologies can be effective tools for investigators.
It was recently reported that a few days after the September 11 attacks, FBI agents visited one of the largest providers of consumer data. They did so to see if the 9/11 terrorists were in the database and quickly found five of them. One of the terrorists had been in the country for less than two years, had 30 credit cards and a quarter million dollars' debt with a payment schedule of $9,800 per month. Mohammed Atta, the ringleader, had also been here less than two years and had 12 addresses under the names Mohammed Atta, Mohammed J. Atta, J. Atta and others. Surely, their report speculated, with patterns like this, we can use the databases we presently have to ferret out terrorists in our midst. Unfortunately, the answer is, "It depends."
There are limitations in using these so-called patterns of the agents' observations. We need to ask, first, how the records were found and, second, if the observed characteristics are indeed repeated patterns or merely isolated instances. Because I am not privy to any knowledge other than what was published in the report, my analysis is based on surmise.
More than likely, the FBI started their search with database queries using the suspected terrorists' names and likely variants. They found the terrorists' records and then noticed the number of credit cards, addresses and the amount of debt. However, they probably would not have known in advance to look for these attributes. Furthermore, the terrorists' records probably didn't show that they had been in the country for only two years; that is knowledge the FBI brought to the search.
We also don't know how easily the observations generalize to other terrorists or how many non-terrorists have these same attributes. Combing the database for people who have a number of credit cards, big debts or multiple addresses would undoubtedly yield both criminals (most of whom aren't terrorists) and perfectly innocent folks.
The large number of addresses for Atta may be an even more difficult screening criterion to use, considering that we don't know the names of unknown terrorists, let alone their aliases. It would be nearly impossible to conduct an aggregation across the hundreds of millions of individuals in this database to calculate the number of addresses, especially because all a terrorist would have to do to defeat such a search is use different aliases.
As I indicated last month, we don't have enough known terrorists or a consistent set of behaviors to use data mining to build predictive models. Thus, it would not be particularly productive to search for a signature.
If we can't inductively find a pattern from the data, perhaps we can just find exceptional behaviors sufficiently far from the norm to be worth investigating, such as 30 credit cards or 12 addresses. This problem (called outlier detection) is easy if you're simply searching for something very different on one dimension. It's much more difficult when you're looking for combinations of attributes whose individual values are typical, but which taken together are unusual. For example, being male or pregnant is not unusual, but pregnant males are rather uncommon! It's even more difficult to find outliers in categorical variables (data that fits in discrete classes) because the way to measure differences is not obvious. For example, what is the measure of the difference between a Ford and a Chevy?
Another trap is that if you look at enough variables, sooner or later you'll find at least one that correlates well with what you are trying to predict. This is called a specification search. When you are searching through large databases with many attributes, it is easy to find such false predictors. The problem of relying on data mining or query software as a primary line of defense is that it produces too many false positives.
What is the best way to use databases, search technology and data mining? First, recognize that "data" is more important than "mining." Resources should be spent working with the existing databases and setting up new ones that allow investigators to easily share information. Second, humans are more important than computers. Once trained investigators have generated lists of suspects, it's time to follow their tracks through the databases to verify information and check whether apparent anomalies are genuinely unusual and suspicious. Third, while the profiling and prediction aspects of data mining will be of limited use, other techniques, such as those used for finding fraud, will also help investigators spread their nets beyond the original suspects. For example, visualizations and algorithms have been used to locate doctors and lawyers who work together to defraud insurance providers. As investigations help uncover behaviors of terrorists that differentiate them from the rest of us, profiles that trigger further investigations will emerge.
Thus, we cannot rely on the magic of data mining to find terrorists or protect us from attack. No shortcuts can substitute for careful investigative work supported by good databases and a management structure that listens to and supports its investigators.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access