next up previous contents index
Next: Experimental results Up: Extensions to the SVM Previous: Multiclass SVMs   Contents   Index

Nonlinear SVMs

Figure 15.6: Projecting data that is not linearly separable into a higher dimensional space can make it linearly separable.

With what we have presented so far, data sets that are linearly separable (perhaps with a few exceptions or some noise) are well-handled. But what are we going to do if the data set just doesn't allow classification by a linear classifier? Let us look at a one-dimensional case. The top data set in Figure 15.6 is straightforwardly classified by a linear classifier but the middle data set is not. We instead need to be able to pick out an interval. One way to solve this problem is to map the data on to a higher dimensional space and then to use a linear classifier in the higher dimensional space. For example, the bottom part of the figure shows that a linear separator can easily classify the data if we use a quadratic function to map the data into two dimensions (a polar coordinates projection would be another possibility). The general idea is to map the original feature space to some higher-dimensional feature space where the training set is separable. Of course, we would want to do so in ways that preserve relevant dimensions of relatedness between data points, so that the resultant classifier should still generalize well.

SVMs, and also a number of other linear classifiers, provide an easy and efficient way of doing this mapping to a higher dimensional space, which is referred to as ``the kernel trick ''. It's not really a trick: it just exploits the math that we have seen. The SVM linear classifier relies on a dot product between data point vectors. Let $K(\vec{x}_i, \vec{x}_j) = \vec{x}_i^{T}\vec{x}_j$. Then the classifier we have seen so far is:

f(\vec{x}) = \mbox{sign}(\sum_i \alpha_i y_i K(\vec{x}_i, \vec{x}) + b)
\end{displaymath} (172)

Now suppose we decide to map every data point into a higher dimensional space via some transformation $\Phi\colon \vec{x} \mapsto
\phi(\vec{x})$. Then the dot product becomes $\phi(\vec{x}_i)^{T}\phi(\vec{x}_j)$. If it turned out that this dot product (which is just a real number) could be computed simply and efficiently in terms of the original data points, then we wouldn't have to actually map from $\vec{x} \mapsto \phi(\vec{x})$. Rather, we could simply compute the quantity $K(\vec{x}_i, \vec{x}_j) =
\phi(\vec{x}_i)^{T}\phi(\vec{x}_j)$, and then use the function's value in Equation 172. A kernel function $K$ is such a function that corresponds to a dot product in some expanded feature space.

Worked example. The quadratic kernel in two dimensions.quad-kernel For 2-dimensional vectors $\vec{u}=(u_1\;\;u_2)$, $\vec{v} = (v_1\;\;v_2)$, consider $K(\vec{u}, \vec{v}) = (1 + \vec{u}^{T}\vec{v})^2$. We wish to show that this is a kernel, i.e., that $K(\vec{u}, \vec{v}) = \phi(\vec{u})^{T}\phi(\vec{v})$ for some $\phi$. Consider $\phi(\vec{u}) = (1\;\; u_1^2\;\; \sqrt{2}u_1u_2\;\; u_2^2\;\; \sqrt{2}
u_1\;\; \sqrt{2}u_2)$. Then:

$\displaystyle K(\vec{u}, \vec{v})$ $\textstyle =$ $\displaystyle (1+\vec{u}^{T}\vec{v})^2$ (173)
  $\textstyle =$ $\displaystyle 1+u_{1}^2v_{1}^2 + 2 u_{1}v_{1}u_{2}v_{2}+ u_{2}^2v_{2}^2 +
2u_{1}v_{1} + 2u_{2}v_{2}$ (174)
  $\textstyle =$ $\displaystyle (1\;\; u_{1}^2\;\; \sqrt2u_{1}u_{2}\;\; u_{2}^2\;\; \sqrt{2}u_{1}...
..._{1}^2\;\; \sqrt{2}v_{1}v_{2}\;\; v_{2}^2\;\; \sqrt{2}v_{1} \;\;
\sqrt{2}v_{2})$ (175)
  $\textstyle =$ $\displaystyle \phi(\vec{u})^{T}\phi(\vec{v})$ (176)

End worked example.

In the language of functional analysis, what kinds of functions are valid kernel functions ? Kernel functions are sometimes more precisely referred to as Mercer kernels , because they must satisfy Mercer's condition: for any $g(\vec{x})$ such that $\int g(\vec{x})^2 d\vec{x}$ is finite, we must have that:

\int K(\vec{x}, \vec{z})g(\vec{x})g(\vec{z})d\vec{x}d\vec{z} \ge 0\thinspace.
\end{displaymath} (177)

A kernel function $K$ must be continuous, symmetric, and have a positive definite gram matrix. Such a $K$ means that there exists a mapping to a reproducing kernel Hilbert space (a Hilbert space is a vector space closed under dot products) such that the dot product there gives the same value as the function $K$. If a kernel does not satisfy Mercer's condition, then the corresponding QP may have no solution. If you would like to better understand these issues, you should consult the books on SVMs mentioned in Section 15.5 . Otherwise, you can content yourself with knowing that 90% of work with kernels uses one of two straightforward families of functions of two vectors, which we define below, and which define valid kernels.

The two commonly used families of kernels are polynomial kernels and radial basis functions. Polynomial kernels are of the form $K(\vec{x},\vec{z}) =
(1+\vec{x}^{T}\vec{z})^d$. The case of $d = 1$ is a linear kernel, which is what we had before the start of this section (the constant 1 just changing the threshold). The case of $d = 2$ gives a quadratic kernel, and is very commonly used. We illustrated the quadratic kernel in quad-kernel.

The most common form of radial basis function is a Gaussian distribution, calculated as:

K(\vec{x},\vec{z}) = e^{-(\vec{x}-\vec{z})^2/(2\sigma^2)}
\end{displaymath} (178)

A radial basis function (rbf) is equivalent to mapping the data into an infinite dimensional Hilbert space, and so we cannot illustrate the radial basis function concretely, as we did a quadratic kernel. Beyond these two families, there has been interesting work developing other kernels, some of which is promising for text applications. In particular, there has been investigation of string kernels (see Section 15.5 ).

The world of SVMs comes with its own language, which is rather different from the language otherwise used in machine learning. The terminology does have deep roots in mathematics, but it's important not to be too awed by that terminology. Really, we are talking about some quite simple things. A polynomial kernel allows us to model feature conjunctions (up to the order of the polynomial). That is, if we want to be able to model occurrences of pairs of words, which give distinctive information about topic classification, not given by the individual words alone, like perhaps operating and system or ethnic and cleansing, then we need to use a quadratic kernel. If occurrences of triples of words give distinctive information, then we need to use a cubic kernel. Simultaneously you also get the powers of the basic features - for most text applications, that probably isn't useful, but just comes along with the math and hopefully doesn't do harm. A radial basis function allows you to have features that pick out circles (hyperspheres) - although the decision boundaries become much more complex as multiple such features interact. A string kernel lets you have features that are character subsequences of terms. All of these are straightforward notions which have also been used in many other places under different names.

next up previous contents index
Next: Experimental results Up: Extensions to the SVM Previous: Multiclass SVMs   Contents   Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.