Support vector machines (SVMs) have become a hot, but unfortunately confusing, topic. I had an interesting and illuminating conversation recently with Rob Cooley, the U.S. technical director for KXEN (Knowledge Extraction Engines), who helped elucidate SVMs while also explaining what KXEN is doing. At this point, KXEN is the only data mining company I know of to base their product entirely on the work of Vladimir Vapnik, who is best known for developing SVMs. Other companies, including Statistica and SAS, have added SVMs to their collection of algorithms. I want to share with you what I learned about KXEN's implementation.
The goal of SVMs is to address the problem of generalization; that is, how to build a model that is applicable to a wide range of data beyond that which was used to create the model. Typically in data mining, we separate our data into a training data set and a test data set. When we have minimized the error on the test data set, we stop the training in the belief that the model thus built will be most generalizable. If we continue training, the error on the training data set will go down, but the error on the test data set will go up. This is called overfitting. Conversely, if we stop training too early, we fit neither the training nor the test data set particularly well. This is called underfitting. The more the real data differs from the training and test data, the more accuracy will suffer.
SVMs are based on a general principle Vapnik developed called structured risk minimization (SRM) to build the most generalizable model. SRM requires minimizing both a number called the VC (Vapnik-Chervonenkis) dimension and the error on the training data set, resulting in a very generalizable model.
Unfortunately, calculating the VC dimension precisely for all situations is difficult. It belongs to a class of problems for which the time it takes any algorithm to solve the problem grows very fast with the complexity. In practice, this means you must approximate the VC dimension. According to Cooley, one of KXEN's competitive advantages is the speed and accuracy of their algorithm for approximating the VC dimension.
SVMs are used primarily in classification problems (although they can be used in regression) and seem to work particularly well when the number of columns greatly exceeds the number of rows. This occurs, for example, in text mining; microarray analysis; and image, speech and handwriting recognition.
The word "machine" in SVM refers to the classifier that is found to best "shatter" or classify the data. SVMs work by taking the data and transforming it into a higher dimensional space (this is similar to adding variables) using something known as a kernel transformation. SVMs then find in this higher dimensional space a hyperplane (the equivalent of a straight line in two-dimensional space) that maximizes its distance from each collection of data points in the same class. SVMs pick the best hyperplane using SRM; in other words, they do so by minimizing both the error and a penalty term for complexity that is based on the VC dimension. Although there are many possible separators, this one will generalize best because it is furthest from each class. New data that might cross over a different separator often will not be able to cross over this hyperplane. The SVM then translates the hyperplane back into the dimensionality of the problem space.
There is a lot of activity going on in SVM research and development to make SVMs as easy to use effectively as more commonly employed data mining algorithms, as well as addressing such questions as scalability and interpretation of results.
While KXEN includes an SVM algorithm, the main thrust of the product is to use SRM to optimize a linear regression that starts with all the variables. The regression coefficients are chosen using the VC dimension in a penalty term for complexity, thus effectively eliminating excess variables. The goal is to get enough variables to avoid underfitting, while avoiding overfitting by not including too many. Thus the regression with the VC term results in a good model that generalizes well.
Like any linear regression, all variables must be numeric, and this is another area in which KXEN has developed interesting technology. Not only do they convert nominal fields (short text data) into numbers, but they encode all the data in a way that simplifies the calculation, resulting in better speed. Experimentation has shown that KXEN's algorithm (unlike some others) gives consistent answers and is not sensitive to different encodings.
The result is that KXEN does a nice job of automated variable selection and model building. Is it as good as a hand-tuned model with appropriately transformed variables? No, and it is not intended to be. KXEN's goal is to allow an unsophisticated user to easily come up with a reasonable answer or to provide the expert data miner with an exploratory tool that can help select variables. The product itself is designed to be embedded within an application such as a task-specific analytic application or a general-purpose analysis tool.
By automating variable selection along with model building, KXEN is making it possible for a broader range of applications to incorporate data mining effectively.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access