Support vector machines (SVMs) have become a hot, but unfortunately confusing, topic. I had an interesting and illuminating conversation recently with Rob Cooley, the U.S. technical director for KXEN (Knowledge Extraction Engines), who helped elucidate SVMs while also explaining what KXEN is doing. At this point, KXEN is the only data mining company I know of to base their product entirely on the work of Vladimir Vapnik, who is best known for developing SVMs. Other companies, including Statistica and SAS, have added SVMs to their collection of algorithms. I want to share with you what I learned about KXEN's implementation.

The goal of SVMs is to address the problem of generalization; that is, how to build a model that is applicable to a wide range of data beyond that which was used to create the model. Typically in data mining, we separate our data into a training data set and a test data set. When we have minimized the error on the test data set, we stop the training in the belief that the model thus built will be most generalizable. If we continue training, the error on the training data set will go down, but the error on the test data set will go up. This is called overfitting. Conversely, if we stop training too early, we fit neither the training nor the test data set particularly well. This is called underfitting. The more the real data differs from the training and test data, the more accuracy will suffer.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access