Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The desire to learn from examples is as old as mankind but has reached a new dimension with the invention of computers. This paper concentrates on learning by support vector machines (SVMs) which meanwhile deliver state-of-the-art performance in many real-world applications. However, it should be mentioned at the beginning that there exist many alternatives to SVMs ranging from classical k-nearest neighbor methods over trees and neural networks to other kernel-based methods. Overviews can be found, e.g., in the books of Mitchell [73], Hastie et al. [47], Duda et al. [34], and Bishop [9].

SVMs are a new-generation learning system based on various components Including:

  • Statistical learning theory,

  • Optimization theory (duality concept),

  • Reproducing kernel Hilbert spaces (RKHSs),

  • Efficient numerical algorithms.

This synthesis and their excellent performance in practice make SVM-like learning attractive for researchers from various fields. A non-exhaustive list of SVM applications includes text categorization (see Joachims [53] and Leopold and Kinderman [64]), handwritten character recognition (see LeCun et al. [62]), texture and image classification (see Chapelle et al. [23]), protein homology detection (see Jaakkola and Haussler [51]), gene expression (see Brown et al. [17]), medical diagnostics (see Strauss et al. [96]), and pedestrian and face detection, (see Osuna et al. [77] and Viola and Jones [110]). There exist various benchmark data sets for testing and comparing new learning algorithms and a good collection of books and tutorials on SVMs as those of Vapnik [105], Burges, [19], Cristianini and Shawe-Taylor [27], Herbrich [48], Schölkopf and Smola [87], and Steinwart and Christmann [93]. The first and latter ones contain a mathematically more rigerous treatment of statistical learning aspects. So-called least squares SVMs are handled in the book of Suykens et al. [98], and SVMs from the approximation theoretic point of view are considered in the book of Cucker and Zhou [29].

Let \(\mathcal{X} \subset \mathbb{R}^{d}\), \(\mathcal{Y}\subset \mathbb{R}^{\tilde{d}}\), where for simplicity only \(\tilde{d} = 1\) is considered, and \(\mathcal{Z}:= \mathcal{X}\times \mathcal{Y}\). The aim of the following sections is to learn a target function \(\mathcal{X} \rightarrow \mathcal{Y}\) from given training samples \(Z:=\{ (x_{1},y_{1}),\ldots,(x_{m},y_{m})\} \subset \mathcal{Z}\). A distinction is made between classification and regression tasks. In classification \(\mathcal{Y}\) is a discrete set, in general as large as the number of classes the samples belong to. Here binary classification with just two labels in \(\mathcal{Y}\) was most extensively studied. An example where binary classification is useful is SPAM detection. Another example in medical diagnostics is given in Fig. 1. Here it should be mentioned that in many practical applications, the original input variables are pre-processed to transform them into a new useful space which is often easier to handle but preserves the necessary discriminatory information. This process is also known as feature extraction .

Fig. 1
figure 1

Examples of physiological (SR) and pathological (VT) electrical heart activity curves measured by an implanted cardioverter-defibrillator. For the classification of such signals, see Strauss and Steidl [95]

In contrast to classification, regression aims at approximating the “whole” real-valued function from some function values, so that \(\mathcal{Y}\) is not countable here. The above examples, as all problems considered in this paper, are from the area of supervised learning. This means that all input vectors come along with their corresponding target function values (labeled data). In contrast, semi-supervised learning makes use of labeled and unlabeled Data, and in unsupervised learning labeled data are not available, so that one can only exploit the input vectors x i . The latter methods can be applied for example to discover groups of similar exemplars within the data (clustering), to determine the distribution of the data within the input space (density estimation), or to perform projections of data from high-dimensional spaces to lower-dimensional spaces. There are also learning models which involve more complex interactions between the learner and the environment. An example is reinforcement learning which is concerned with the problem of finding suitable actions in a given situation in order to maximize the reward. In contrast to supervised learning, reinforcement learning does not start from given optimal (labeled) outputs but must instead find them by a process of trial and error. For reinforcement learning, the reader may consult the book of Sutton and Barto [97].

Learning models can also differ in the way in which the training data are generated and presented to the learner. For example, a distinction can be made between batch learning, where all the data are given at the start of the training phase, and online learning, where the learner receives one example at a time and updates the hypothesis function in response to each new example.

This paper is organized as follows: An overview of the historical background is given in Sect. 2. Section 3 contains an introduction into classical SVM methods. It starts with linear methods for (binary) support vector classification and regression and considers also linear least squares classification/regression. Then the kernel trick is explained and used to transfer the linear models into so-called feature spaces which results in nonlinear learning methods. Some other models related to SVM as well as multi-class classification and multitask learning are addressed at the end of the section. Section 4 provides some mathematical background concerning RKHSs and quadratic optimization. The last subsection sketches very briefly some results in statistical learning theory. Numerical methods to make the classical SVMs efficient in particular for large data sets are presented in Sect. 5. The paper ends with some conclusions in Sect. 6.

2 Historical Background

Modern learning theory has a long and interesting history going back as far as Gauss and Legendre but got its enormous impetus from the advent of computing machines. In the 1930s revolutionary changes took place in understanding the principles of inductive inference from a philosophical perspective, e.g., by Popper, and from the point of view of statistical theory, e.g., by Kolmogorov, Glivenko, and Cantelli, and applied statistics, e.g., by Fisher. A good overview over the leading ideas and developments in this time can be found in the comments and bibliographical remarks of Vapnik’s book, Vapnik [105]. The starting point of statistical learning theory which considers the task of minimizing a risk functional based on empirical data dates back to the 1960s. Support vector machines, including their RKHS interpretation, were only discovered in the 1990s and led to an explosion in applications and theoretical analysis.

Let us start with the problem of linear regression which is much older than linear classification. The method of least squares was first published by Legendre [63]. It was considered as a statistical procedure by Gauss [43], who claimed, to the annoyance of Legendre but in accordance with most historians, to have applied this method since 1795. The original least squares approach finds for given points \(x_{i} \in \mathbb{R}^{d}\) and corresponding \(y_{i} \in \mathbb{R}\), \({i = 1, \ldots, m}\) a hyperplane \(f(x) =\langle w,x\rangle + b\) having minimal least squares distance from the points (x i , y i ):

$$\displaystyle{ \sum _{i=1}^{m}(\langle w,x_{ i}\rangle + b - y_{i})^{2}\; \rightarrow \;\min _{ w,b}. }$$
(1)

This leads to the solution of a linear system of equations which can be ill conditioned or possess several solutions. Therefore, regularized versions were introduced later. The linear least squares approach is optimal in the case of linear targets corrupted by Gaussian noise. Sometimes it is useful to find a linear function which does not minimize the least squares error, but, for example, the 1-error

$$\displaystyle{\sum _{i=1}^{m}\vert \langle w,x_{ i}\rangle + b - y_{i}\vert \;\rightarrow \min _{w,b}}$$

which is more robust against outliers. This model with the constraint that the sum of the errors is equal to zero was already studied by Laplace in 1799; see Laplace [61]. Another popular choice is the -error

$$\displaystyle{\max _{i=1,\ldots,m}\vert \langle w,x_{i}\rangle + b - y_{i}\vert \;\rightarrow \min _{w,b}}$$

which better incorporates outliers. In contrast to the least squares method, the solutions of the 1- and -problems cannot be computed via linear systems of equations but require to solve linear optimization problems. Figure 2 shows a one-dimensional example, where data are approximated by lines with minimal 2-, 1- and error norm, respectively. For more information on (regularized) least squares problems, the reader may consult, e.g., the books of Golub and Van Loan [45] and of Björck [10].

Fig. 2
figure 2

Linear approximation with respect to the 2-, 1- and -norm of the error (left to right). The 1 approximation is more robust against outliers while the -norm takes them better into account

Regularized least squares methods which penalize the quadratic weight \(\|w\|^{2}\) as in section “Linear Least Squares Classification and Regression” were examined under the name ridge regression by Hoerl and Kennard [49]. This method can be considered as a special case of the regularization theory for ill-posed problems developed by Tikhonov and Arsenin [102]. Others than the least squares loss function like the ε-insensitive loss were brought into play by Vapnik [105]. This loss function enforces a sparse representation of the weights in terms of so-called support vectors which are (small) subsets of the training samples {x i : i = 1, , m}.

The simplest form of classification is binary classification, where one has just to separate between two classes. Linear hyperplanes H(w, b) separating points, also called linear discriminants or perceptrons, were already studied by Fisher [41] and became interesting for neural network researchers in the early 1960s. One of the first algorithms that constructs a separating hyperplane for linearly separable points was Rosenblatt’s perceptron; see Rosenblatt [84]. It is an iterative online and mistake-driven algorithm which starts with an initial weight guess w for the hyperplane and adapts the weight at each time a training point is misclassified by the current weight. If the data are linearly separable, the procedure converges, and the number of mistakes (number of updates of w) does not exceed (2Rγ)2, where \(R:=\min _{i=1,\ldots,m}\|x_{i}\|\) and γ is the smallest distance between a training point and the hyperplane. For linearly separable points, there exist various hyperplanes separating them.

An optimal hyperplane for linearly separable points in the sense that the minimal distance of the points from the plane becomes maximal was constructed as so-called generalized portrait algorithm by Vapnik and Lerner [108]. This learning method is also known as linear hard margin support vector classifier . The method was generalized to nonseparable points by Cortes and Vapnik [26] which leads to soft margin classifiers. Finally, the step from linear to nonlinear classifiers via feature maps was taken by Boser et al. [12]. Their idea to combine a linear algorithm with a kernel approach inspired the further examination of specific kernels for applications.

However, the theory of kernels and their applications is older than SVMs. Aronzajn [5], systematically developed the theory of RHKSs in the 1940s though it was discovered that many results were independently obtained by Povzner [83]. The work of Parzen [78] brought the RKHS to the fore in statistical problems; see also Kailath [54]. Kernels in pattern recognition were already applied by Aizerman et al. [1]. Empirical risk minimization over RKHSs was considered by Wahba [112] in connection with splines and by Poggio and Giriosi [81] in relation with neural networks. Schölkopf et al. [89] realized that the kernel trick works not only for SVMs but for many other methods as principal component analysis in unsupervised learning.

The invention of SVMs has led to a gigantic amount of developments in learning theory and practice. The size of this paper would be not enough to list the references on this topic. Beyond various applications, also advanced generalization results, suitable choices of kernels, efficient numerical methods in particular for large data sets, relations to other sparse representation methods, multi-class classification, and multitask learning were addressed. The reader will find some references in the corresponding sections.

3 Mathematical Modeling and Applications

3.1 Linear Learning

This section starts with linear classification and regression which provide the easiest algorithms to understand some of the main building blocks that appear also in the more sophisticated nonlinear support vector machines. Moreover, concerning the classification task, this seems to be the best approach to explain its geometrical background. The simplest function to feed a classifier with or to use as an approximation of some unknown function in regression tasks is a linear (multivariate) function

$$\displaystyle{ f(x) = f_{w}(x):=\langle w,x\rangle,\quad x \in \mathcal{X} \subset \mathbb{R}^{d}. }$$
(2)

Often it is combined with some appropriate real number b, i.e., one considers the linear polynomial \(f(x) + b =\langle w,x\rangle + b\). In the context of learning, w is called weight vector and b offset, intercept or bias.

3.1.1 Linear Support Vector Classification

Let us consider binary classification first and postpone multi-class classification to section “Multi-class Classification and Multitask Learning.” As binary classifier \(F = F_{w,b}: \mathcal{X} \rightarrow \{-1,1\}\), one can use

$$\displaystyle{F(x):= {\mathrm{sgn}}(f_{w}(x) + b) = {\mathrm{sgn}}(\langle w,x\rangle + b)}$$

with the agreement that sgn(0): = 0. The hyperplane

$$\displaystyle{H(w,b):=\{ x:\langle w,x\rangle + b = 0\}}$$

has the normal vector \(w/\|w\|\), and the distance of a point \(\tilde{x} \in \mathbb{R}^{d}\) to the hyperplane is given by

$$\displaystyle{\bigg\vert \bigg\langle \frac{w} {\|w\|},\tilde{x}\bigg\rangle + \frac{b} {\|w\|}\bigg\vert }$$

see Fig. 3 left. In particular, \(\vert b\vert /\|w\|\) is the distance of the hyperplane from the origin.

Fig. 3
figure 3

Left: Hyperplane H with normal \(w/\|w\|\) and distance \(\vert b\vert /\|w\|\) from the origin. The distance of the point \(\tilde{x}\) from the hyperplane is the value \(\lambda\) fulfilling \(\langle w,\tilde{x} -\lambda w/\|w\|\rangle + b = 0\), i.e., \(\lambda = (\langle w,\tilde{x}\rangle + b)/\|w\|\). Right: Linearly separable training set together with a separating hyperplane and the corresponding margin γ

The training set Z consists of two classes labeled by ± 1 with indices \(I_{+}:=\{ i: y_{i} = 1\}\) and \(I_{-}:=\{ i: y_{i} = -1\}\). The training set is said to be separable by the hyperplane H(w, b) if \(\langle w,x_{i}\rangle + b > 0\) for iI + and \(\langle w,x_{i}\rangle + b < 0\) for iI , i.e.,

$$\displaystyle{y_{i}(\langle w,x_{i}\rangle + b) > 0.}$$

The points in Z are called (linearly) separable if there exists a hyperplane separating them. In this case, their distance from a separating hyperplane is given by

$$\displaystyle{y_{i}\left (\bigg\langle \frac{w} {\|w\|},x_{i}\bigg\rangle + \frac{b} {\|w\|}\right ),\quad i = 1,\ldots,m.}$$

The smallest distance of a point from the hyperplane

$$\displaystyle{ \gamma:=\min _{i=1,\ldots,m}y_{i}\left (\bigg\langle \frac{w} {\|w\|},x_{i}\bigg\rangle + \frac{b} {\|w\|}\right ) }$$
(3)

is called margin. Figure 3, right, shows a separating hyperplane of two classes together with its margin. Of course for a separable training set, there may exist various separating hyperplanes. One way to ensure a unique solution is to pose additional requirements on the hyperplane in form of minimizing a cost functional.

Hard Margin Classifier

One obvious way is to choose those separating hyperplane which has the maximal distance from the data, i.e., a maximal margin. The corresponding classifiers are called maximal margin classifiers or hard margin classifiers. The hyperplane and the corresponding half-spaces do not chance if the defining vectors is rescaled to (c w, c b), c > 0. The so-called generalized portrait algorithm of Vapnik and Lerner [108], constructs a hyperplane that maximizes γ under the constraint \(\|w\| = 1\). The same hyperplane can be obtained as follows: by (3), it holds

$$\displaystyle{\gamma \,\|w\| =\min _{i=1,\ldots,m}y_{i}\left (\langle w,x_{i}\rangle + b\right )}$$

so that one can use the scaling

$$\displaystyle{\gamma \,\|w\| = 1\quad \Leftrightarrow \quad \gamma = \frac{1} {\|w\|}.}$$

Now γ becomes maximal if and only if \(\|w\|\) becomes minimal, and the scaling means that \(y_{i}\left (\langle w,x_{i}\rangle + b\right ) \geq 1\) for all \(i = 1,\ldots,m\). Therefore, the hard margin classifier aims to find parameters \(w\) and b solving the following quadratic optimization problem with linear constraints:

$$\displaystyle{\begin{array}{c} \mbox{ Linear SV hard margin classification (Primal problem)} \\ \frac{1} {2}\|w\|^{2}\; \rightarrow \;\min \limits _{ w,b}\quad \mbox{ subject to}\quad y_{i}\left (\langle w,x_{i}\rangle + b\right ) \geq 1,\;i = 1,\ldots,m. \end{array} }$$

If the training data are linearly separable, the problem has a unique solution. A brief introduction into quadratic programming methods is given in section “Quadratic Optimization.” To transform the problem into its dual form, consider the Lagrangian

$$\displaystyle{L(w,b,\alpha ) = \frac{1} {2}\|w\|^{2} +\sum _{ i=1}^{m}\alpha _{ i}\left (1 - y_{i}\left (\langle w,x_{i}\rangle + b\right )\right ),\quad \alpha _{i} \geq 0.}$$

Since

$$\displaystyle\begin{array}{rcl} \frac{\partial L} {\partial w}& =& w -\sum _{i=1}^{m}\alpha _{ i}y_{i}x_{i}\; =\; 0\quad \Leftrightarrow \quad w\; =\;\sum _{ i=1}^{m}\alpha _{ i}y_{i}x_{i},{}\end{array}$$
(4)
$$\displaystyle\begin{array}{rcl} \frac{\partial L} {\partial b} & =& \sum _{i=1}^{m}\alpha _{ i}y_{i}\; =\; 0{}\end{array}$$
(5)

the Lagrangian can be rewritten as

$$\displaystyle{ L(w,b,\alpha ) = -\frac{1} {2}\sum _{i=1}^{m}\sum _{ j=1}^{m}y_{ i}\alpha _{i}y_{j}\alpha _{j}\langle x_{i},x_{j}\rangle +\sum _{ i=1}^{m}\alpha _{ i} }$$
(6)

and the dual problem becomes

$$\displaystyle{\begin{array}{c} \mbox{ Linear SV hard margin classification (Dual problem)} \\ \frac{1} {2}\sum \limits _{i=1}^{m}\sum \limits _{ j=1}^{m}y_{ i}\alpha _{i}y_{j}\alpha _{j}\langle x_{i},x_{j}\rangle -\sum \limits _{i=1}^{m}\alpha _{ i} \rightarrow \min \limits _{\alpha }\quad \mbox{ subject to}\quad \sum _{i=1}^{m}y_{ i}\alpha _{i} = 0, \\ \phantom{\frac{1} {2}\sum \limits _{i=1}^{m}\sum \limits _{ j=1}^{m}y_{ i}\alpha _{i}y_{j}\alpha _{j}\langle x_{i},x_{j}\rangle -\sum \limits _{i=1}^{m}\alpha _{ i} \rightarrow \min \limits _{\alpha }\quad \mbox{ subject to}\quad }\quad \quad \alpha _{i} \geq 0,\;i = 1,\ldots,m. \end{array} }$$

Note that the dual maximization problem has been rewritten into a minimization problem by using that \(\max \phi =\min -\phi\). Let 1 m denote the vector with m coefficients 1, \(\alpha:= (\alpha _{i})_{i=1}^{m}\), \(y:= (y_{i})_{i=1}^{m}\), \(\mathbf{Y}:= {\mathrm{diag}}(y_{i})_{i=1}^{m}\) and

$$\displaystyle{ \mathbf{K}:= \left (\langle x_{i},x_{j}\rangle \right )_{i,j=1}^{m}. }$$
(7)

Note that K is symmetric and positive semi-definite. The dual problem can be rewritten in matrix-vector form as

$$\displaystyle{\begin{array}{c} \mbox{ Linear SV hard margin classification (Dual problem)} \\ \frac{1} {2}\alpha ^{\mbox{ T}}\mathbf{Y}\mathbf{K}\mathbf{Y}\alpha -\langle 1_{ m},\alpha \rangle \;\rightarrow \;\min \limits _{\alpha }\quad \mbox{ subject to}\quad \langle y,\alpha \rangle = 0,\;\alpha \geq 0. \end{array} }$$

Let α be the minimizer of this dual problem. The intercept b does not appear in the dual problem, and one has to determine its optimal value in another way. By the Kuhn-Tucker conditions, the equations

$$\displaystyle{\alpha _{i}^{{\ast}}\left (y_{ i}(\langle w^{{\ast}},x_{ i}\rangle + b^{{\ast}}) - 1\right ) = 0,\quad i = 1,\ldots,m}$$

hold true, so that α i > 0 is only possible for those training data with \(y_{i}(\langle w^{{\ast}},x_{i}\rangle + b^{{\ast}}) = 1\). These are exactly the (few) points having margin distance γ from the hyperplane H(w , b ). Define

$$\displaystyle{ I_{S}:=\{ i:\alpha _{ i}^{{\ast}} > 0\},\quad S:=\{ x_{ i}: i \in I_{S}\}. }$$
(8)

The vectors from S are called support vectors. In general | S | ≪ m and by (4) the optimal weight w and the optimal function \(f_{w^{{\ast}}}\) have a sparse representation in terms of the support vectors

$$\displaystyle{ w^{{\ast}} =\sum _{ i\in I_{S}}\alpha _{i}^{{\ast}}y_{ i}x_{i},\quad f_{w^{{\ast}}}(x) =\sum _{i\in I_{S}}\alpha _{i}^{{\ast}}y_{ i}\langle x_{i},x\rangle. }$$
(9)

Moreover,

$$\displaystyle{ b^{{\ast}} = y_{ i} -\langle w^{{\ast}},x_{ i}\rangle = y_{i} - f_{w^{{\ast}}}(x_{i}),\quad i \in I_{s} }$$
(10)

and hence, using (5),

$$\displaystyle\begin{array}{rcl} \|w^{{\ast}}\|^{2}& =& \sum _{ i\in I_{S}}\alpha _{i}^{{\ast}}y_{ i}\sum _{j\in I_{S}}\alpha _{j}^{{\ast}}y_{ j}\langle x_{i},x_{j}\rangle =\sum _{i\in I_{S}}\alpha _{i}^{{\ast}}y_{ i}f_{w^{{\ast}}}(x_{i}) {}\\ & =& \sum _{i\in I_{S}}\alpha _{i}^{{\ast}}(1 - y_{ i}b^{{\ast}}) =\sum _{ i\in I_{S}}\alpha _{i}^{{\ast}} {}\\ \end{array}$$

so that

$$\displaystyle{\gamma = 1/\|w\| =\big (\sum _{i\in I_{S}}\alpha _{i}^{{\ast}}\big)^{-1/2}.}$$

Soft Margin Classifier

If the training data are not linearly separable which is the case in most applications, the hard margin classifier is not applicable. The extension of hard margin classifiers to the nonseparable case was done by Cortes and Vapnik [26] by bringing additional slack variables and a a parameter C > 0 into the constrained model:

$$\displaystyle{\begin{array}{lll} \mbox{ Linear SV soft margin classification (Primal problem)} \\ \frac{1} {2}\|w\|^{2} + C\sum \limits _{ i=1}^{m}\xi _{ i}\; \rightarrow \;\min \limits _{w,b,\xi }\quad \mbox{ subject to}\quad y_{i}\left (\langle w,x_{i}\rangle + b\right ) \geq 1 -\xi _{i}, \\ \phantom{\frac{1} {2}\|w\|^{2} + C\sum \limits _{ i=1}^{m}\xi _{ i}\; \rightarrow \;\min \limits _{w,b,\xi }\quad \mbox{ subject to}\quad }\xi _{i} \geq 0,\;i = 1,\ldots,m. \end{array} }$$

For C = , this is again the hard margin classifier model. As before, the margin is defined as \(\gamma = 1/\|w^{{\ast}}\|\), where w is the solution of the above problem. If the slack variable fulfills 0 ≤ ξ i < 1, then x i is correctly classified, and in the case \(y_{i}\left (\langle w^{{\ast}},x_{i}\rangle + b^{{\ast}}\right ) = 1 -\xi _{i}^{{\ast}}\) the distance of x i from the hyperplane is \(\gamma -\xi _{i}^{{\ast}}/\|w^{{\ast}}\|\). If 1 < ξ i , then one has a misclassification. By penalizing the sum of the slack variables, one tries to keep them small.

The above constrained minimization model can be rewritten as an unconstrained one by using a margin-based loss function. A function \(L:\{ -1,1\} \times \mathbb{R} \rightarrow [0,\infty )\) is called margin based if there exists a representing function l: → [0, ) such that

$$\displaystyle{L(y,t) = l(yt).}$$

In soft margin classification the appropriate choice of a loss function is the hinge loss function l h determined by

$$\displaystyle{l_{h}(x):=\max \{ 0,1 - x\}.}$$

Then the unconstrained primal problem reads

$$\displaystyle{\begin{array}{c} \mbox{ Linear SV soft margin classification (Primal problem)} \\ \frac{1} {2}\|w\|^{2} + C\sum \limits _{ i=1}^{m}L_{ h}\left (y_{i},\left (\langle w,x_{i}\rangle + b\right )\right )\; \rightarrow \;\min \limits _{w,b}. \end{array} }$$

The Lagrangian of the linear constrained problem has the form

$$\displaystyle{L(w,b,\xi,\alpha ) = \frac{1} {2}\|w\|^{2} + C\sum _{ i=1}^{m}\xi _{ i} +\sum _{ i=1}^{m}\alpha _{ i}\left (1 -\xi _{i} - y_{i}\left (\langle w,x_{i}\rangle + b\right )\right ) -\sum _{i=1}^{m}\beta _{ i}\xi _{i},}$$

where α i , β i ≥ 0. Partial differentiation of the Lagrangian with respect to w and b results in (4), (5) and with respect to ξ in

$$\displaystyle{\frac{\partial L} {\partial \xi } = C\,1_{m} -\alpha -\beta \; =\; 0.}$$

Using these relations, the Lagrangian can be rewritten in the same form as in (6), and the dual problem becomes in matrix-vector form

$$\displaystyle{\begin{array}{c} \mbox{ Linear SV soft margin classification (Dual problem)} \\ \frac{1} {2}\alpha ^{\mbox{ T}}\mathbf{Y}\mathbf{K}\mathbf{Y}\alpha -\langle 1_{ m},\alpha \rangle \quad \mbox{ subject to}\quad \langle y,\alpha \rangle = 0,\;0 \leq \alpha \leq C. \end{array} }$$

Let α be the minimizer of the dual problem. Then the optimal weight w and \(f_{w^{{\ast}}}\) are again given by (9) and depend only on the few support vectors defined by (8). By the Kuhn-Tucker conditions, the equations

$$\displaystyle\begin{array}{rcl} & & \alpha _{i}^{{\ast}}\left (y_{ i}(\langle w^{{\ast}},x_{ i}\rangle + b^{{\ast}}) - 1 +\xi _{ i}^{{\ast}}\right ) = 0\quad {\mathrm{and}} {}\\ & & \quad \beta _{i}^{{\ast}}\xi _{ i}^{{\ast}} = (C -\alpha _{ i}^{{\ast}})\xi _{ i}^{{\ast}} = 0,\quad i = 1,\ldots,m {}\\ \end{array}$$

hold true. For 0 < α i < C, it follows that ξ i = 0 and \(y_{i}(\langle w^{{\ast}},x_{i}\rangle + b^{{\ast}}) = 1\), i.e., the points x i have margin distance \(\gamma = 1/\|w^{{\ast}}\|\) from H(w , b ). Moreover,

$$\displaystyle{ b^{{\ast}} = y_{ i} -\langle w^{{\ast}},x_{ i}\rangle,\quad i \in \tilde{ I}_{S}:=\{ i: 0<\alpha _{i}^{{\ast}}<C\}. }$$
(11)

For α i = C, one concludes that \(y_{i}(\langle w^{{\ast}},x_{i}\rangle + b^{{\ast}}) = 1 -\xi _{i}^{{\ast}}\), i.e., x i has distance \(\gamma -\xi _{i}^{{\ast}}/\|w^{{\ast}}\|\) from the optimal hyperplane.

3.1.2 Linear Support Vector Regression

Of course one can also approximate unknown functions by linear (multivariate) polynomials of the form (2).

Hard Margin Regression

The model for linear hard margin regression is given by

$$\displaystyle{\begin{array}{c} \mbox{ Linear SV hard margin regression (Primal problem)} \\ \frac{1} {2}\|w\|^{2}\; \rightarrow \;\min _{ w,b}\quad \mbox{ subject to}\quad \langle w,x_{i}\rangle + b - y_{i}\quad \leq \quad \epsilon, \\ \phantom{\frac{1} {2}\|w\|^{2}\; \rightarrow \;\min _{ w,b}\quad \mbox{ subject to}\quad }\quad \quad \quad \quad -\langle w,x_{i}\rangle - b + y_{i}\quad \leq \quad \epsilon,\quad i = 1,\ldots,m. \end{array} }$$

The constraints make sure that the test data y i lie within an ε distance from the value f(x i ) + b of the approximating linear polynomial. The Lagrangian reads

$$\displaystyle\begin{array}{ll} L(w,b,\xi ^{\pm },\alpha ^{\pm },\beta ^{\pm })& = \frac{1} {2}\|w\|^{2} +\sum _{ i=1}^{m}\alpha _{ i}^{-}\left (\langle w,x_{ i}\rangle + b - y_{i}-\epsilon \right ) \\ & +\sum _{i=1}^{m}\alpha _{ i}^{+}\left (-\langle w,x_{ i}\rangle - b + y_{i}-\epsilon \right ), \end{array}$$

where α i ± ≥ 0. Setting partial derivatives to zero leads to

$$\displaystyle\begin{array}{rcl} \frac{\partial L} {\partial w}& =& w +\sum _{ i=1}^{m}\left (\alpha _{ i}^{-}-\alpha _{ i}^{+}\right )x_{ i}\; =\; 0\quad \Leftrightarrow \quad w\; =\;\sum _{ i=1}^{m}\left (\alpha _{ i}^{+} -\alpha _{ i}^{-}\right )x_{ i},{}\end{array}$$
(12)
$$\displaystyle\begin{array}{rcl} \frac{\partial L} {\partial b} & =& \sum _{i=1}^{m}\left (\alpha _{ i}^{+} -\alpha _{ i}^{-}\right )\; =\; 0.{}\end{array}$$
(13)

Using these relations and setting

$$\displaystyle{\alpha:=\alpha ^{+} -\alpha ^{-},}$$

the Lagrangian can be written as

$$\displaystyle{L(w,b,\xi ^{\pm },\alpha ^{\pm },\beta ^{\pm }) = -\frac{1} {2}\sum _{i=1}^{m}\sum _{ j=1}^{m}\alpha _{ i}\alpha _{j}\langle x_{i},x_{j}\rangle -\epsilon \sum _{i=1}^{m}\left (\alpha _{ i}^{+} +\alpha _{ i}^{-}\right ) +\sum _{ i=1}^{m}y_{ i}\alpha _{i}}$$

and the dual problem becomes in matrix-vector form

$$\displaystyle{\begin{array}{c} \mbox{ Linear SV hard margin regression (Dual problem)} \\ \frac{1} {2}(\alpha ^{+} -\alpha ^{-})^{\mbox{ T}}\mathbf{K}(\alpha ^{+} -\alpha ^{-}) +\epsilon \langle 1_{ m},\alpha ^{+} +\alpha ^{-}\rangle -\langle y,\alpha ^{+} -\alpha ^{-}\rangle \;\rightarrow \min \limits _{\alpha ^{+ },\alpha ^{-}} \\ \mbox{ subject to}\quad \langle 1_{m},\alpha ^{+} -\alpha ^{-}\rangle = 0,\quad \alpha ^{\pm }\geq 0. \end{array} }$$

This is a quadratic optimization problem with linear constraints. Let \((\alpha ^{+})^{{\ast}},(\alpha ^{-})^{{\ast}}\) be the solution of this problem and \(\alpha ^{{\ast}} = (\alpha ^{+})^{{\ast}}- (\alpha ^{-})^{{\ast}}\). Then, by (12), the optimal weight and the optimal function have in general a sparse representation in terms of the support vectors x i , iI S , namely,

$$\displaystyle{ w^{{\ast}} =\sum _{ i\in I_{S}}\alpha _{i}^{{\ast}}x_{ i},\quad f_{w^{{\ast}}}(x) =\sum _{i\in I_{rS}}\alpha _{i}^{{\ast}}\langle x_{ i},x\rangle,\quad I_{rS}:=\{ i:\alpha _{ i}^{{\ast}}\not =0\}. }$$
(14)

The corresponding Kuhn-Tucker conditions are

$$\displaystyle\begin{array}{rcl} (\alpha _{i}^{-})^{{\ast}}\left (\epsilon -\langle w^{{\ast}},x_{ i}\rangle - b^{{\ast}} + y_{ i}\right )& =& 0,{} \\ (\alpha _{i}^{+})^{{\ast}}\left (\epsilon +\langle w^{{\ast}},x_{ i}\rangle + b^{{\ast}}- y_{ i}\right )& =& 0. \\ \end{array}$$
(15)

If (α i ) > 0 or (α i +) > 0, then

$$\displaystyle{b^{{\ast}} = y_{ i} -\langle w^{{\ast}},x_{ i}\rangle +\epsilon,\quad b^{{\ast}} = y_{ i} -\langle w^{{\ast}},x_{ i}\rangle -\epsilon,}$$

respectively. Since both conditions cannot be fulfilled for the same index, it follows that \((\alpha _{i}^{-})^{{\ast}}(\alpha _{i}^{+})^{{\ast}} = 0\) and consequently, either \(\alpha _{i}^{{\ast}} = (\alpha _{i}^{+})^{{\ast}}\geq 0\) or \(\alpha _{i}^{{\ast}} = -(\alpha _{i}^{-})^{{\ast}}\leq 0\). Thus, one can obtain the intercept by

$$\displaystyle{ b^{{\ast}} = y_{ i} -\langle w^{{\ast}},x_{ i}\rangle -\epsilon,\quad i \in I_{S}. }$$
(16)

Soft Margin Regression Relaxing the constraints in the hard margin model leads to the following linear soft margin regression problem with C > 0:

$$\displaystyle{\begin{array}{c} \mbox{ Linear SV soft margin regression (Primal problem)} \\ \frac{1} {2}\|w\|^{2} + C\sum _{ i=1}^{m}\left (\xi _{ i}^{+} +\xi _{ i}^{-}\right ) \rightarrow \min _{ w,b,\xi _{i}^{\pm }}\quad \mbox{ subject to}\quad \langle w,x_{i}\rangle + b - y_{i}\quad \leq \quad \epsilon +\xi _{i}^{-}, \\ \phantom{\frac{1} {2}\|w\|^{2} + C\sum _{ i=1}^{m}\left (\xi _{ i}^{+} +\xi _{ i}^{-}\right ) \rightarrow \min _{ w,b,\xi _{i}^{\pm }}\quad \mbox{ subject to}} -\langle w,x_{i}\rangle - b + y_{i} \leq \epsilon +\xi _{i}^{+}, \\ \phantom{\frac{1} {2}\|w\|^{2} + C\sum _{ i=1}^{m}\left (\xi _{ i}^{+} +\xi _{ i}^{-}\right ) \rightarrow \min _{ w,b,\xi _{i}^{\pm }}\mbox{ subject to}}\xi _{i}^{+},\xi _{ i}^{-}\phantom{i\rangle - b + y_{ i}} \geq 0. \end{array} }$$

For C = , this recovers the linear hard margin regression problem. The above constrained model can be rewritten as an unconstrained one by using a distance-based loss function. A function \(L: \mathcal{Y}\times \mathbb{R} \rightarrow [0,\infty )\) is called distance based if there exists a representing function \(l: \mathbb{R} \rightarrow [0,\infty )\) such that

$$\displaystyle{L(y,t) = l(y - t).}$$

In soft margin regression, the appropriate choice is Vapnik’s ε-insensitive loss function defined by

$$\displaystyle{l_{\epsilon }(x):=\max \{ 0,\vert x\vert -\epsilon \}.}$$

The function l ε is depicted in Fig. 4, left. Then the unconstrained primal model reads

Fig. 4
figure 4

Vapnik’s ε-insensitive loss function for ε = 2 (left). Example of linear SV soft margin regression (right)

$$\displaystyle{\begin{array}{c} \mbox{ Linear SV soft margin regression (Primal problem)} \\ \frac{1} {2}\|w\|^{2} + C\sum \limits _{ i=1}^{m}L_{\epsilon }(y_{ i},\langle w,x_{i}\rangle + b)\; \rightarrow \;\min \limits _{w,b}. \end{array} }$$

The Lagrangian of the constrained problem is given by

$$\begin{array}{ll} L(w,b,\xi ^{\pm },\alpha ^{\pm },\beta ^{\pm }) = & \frac{1} {2}\|w\|^{2} + C\sum _{ i=1}^{m}(\xi _{ i}^{+} +\xi _{ i}^{-}) \\ & +\sum _{i=1}^{m}\alpha _{ i}^{-}\left (\langle w,x_{ i}\rangle + b - y_{i} -\epsilon -\xi _{i}^{-}\right ) \\ & +\sum _{i=1}^{m}\alpha _{ i}^{+}\left (-\langle w,x_{ i}\rangle - b + y_{i} -\epsilon -\xi _{i}^{+}\right ) \\ & -\sum _{i=1}^{m}\beta _{ i}^{+}\xi _{ i}^{+} -\sum _{ i=1}^{m}\beta _{ i}^{-}\xi _{ i}^{-}, {}\\ \end{array}$$

where α i ± ≥ 0 and β i ± ≥ 0, i = 1, , m. Setting the partial derivatives to zero leads to (12), (13) and

$$\displaystyle{\frac{\partial L} {\partial \xi ^{+}} = C\,1_{m} -\alpha ^{+} -\beta ^{+} = 0,\quad \frac{\partial L} {\partial \xi ^{-}} = C\,1_{m} -\alpha ^{-}-\beta ^{-} = 0.}$$

Using these relations the Lagrangian can be written exactly as in the hard margin problem, and the dual problem becomes in matrix-vector form

$$\displaystyle{\begin{array}{c} \mbox{ Linear SV soft margin regression (Dual problem)} \\ \frac{1} {2}(\alpha ^{+} -\alpha ^{-})^{\mbox{ T}}\mathbf{K}(\alpha ^{+} -\alpha ^{-}) +\epsilon \langle 1_{ m},\alpha ^{+} +\alpha ^{-}\rangle -\langle y,\alpha ^{+} -\alpha ^{-}\rangle \;\rightarrow \min \limits _{\alpha ^{+ },\alpha ^{-}} \\ \phantom{\frac{1} {2}(\alpha ^{+} -\alpha ^{-}}\mbox{ subject to}\quad \langle 1_{ m},\alpha ^{+} -\alpha ^{-}\rangle = 0,\quad 0 \leq \alpha ^{+},\alpha ^{-}\leq C \end{array} }$$

If \((\alpha ^{+})^{{\ast}},(\alpha ^{-})^{{\ast}}\) are the solution of this problem and \(\alpha ^{{\ast}} = (\alpha ^{+})^{{\ast}}- (\alpha ^{-})^{{\ast}}\), then the optimal weight w and the optimal function \(f_{w^{{\ast}}}\) are given by (14). The corresponding Kuhn-Tucker conditions are

$$\displaystyle\begin{array}{rcl} (\alpha _{i}^{-})^{{\ast}}\left (\epsilon +(\xi _{ i}^{-})^{{\ast}}-\langle w^{{\ast}},x_{ i}\rangle - b^{{\ast}} + y_{ i}\right )& =& 0, {}\\ (\alpha _{i}^{+})^{{\ast}}\left (\epsilon +(\xi _{ i}^{+})^{{\ast}} +\langle w^{{\ast}},x_{ i}\rangle + b^{{\ast}}- y_{ i}\right )& =& 0, {}\\ (C - (\alpha _{i}^{+})^{{\ast}})(\xi _{ i}^{+})^{{\ast}}\; =\; 0,\quad (C - (\alpha _{ i}^{-})^{{\ast}})(\xi _{ i}^{-})^{{\ast}}& =& 0,\quad i = 1,\ldots,m. {}\\ \end{array}$$

If 0 < (α i +) or 0 < (α i ), then

$$\displaystyle{b^{{\ast}} = y_{ i} -\langle w^{{\ast}},x_{ i}\rangle +\epsilon +\xi _{i}^{+},\quad b^{{\ast}} = y_{ i} -\langle w^{{\ast}},x_{ i}\rangle -\epsilon -\xi _{i}^{-},}$$

respectively. Both equations cannot be fulfilled at the same time so that one can conclude that either \(\alpha _{i}^{{\ast}} = (\alpha _{i}^{+})^{{\ast}}\geq 0\) or \(\alpha _{i}^{{\ast}} = -(\alpha _{i}^{-})^{{\ast}}\leq 0\). In case \(\alpha _{i}^{{\ast}} = (\alpha _{i}^{+})^{{\ast}} < C\), this results in the intercept

$$\displaystyle{ b^{{\ast}} = y_{ i} -\langle w^{{\ast}},x_{ i}\rangle -\epsilon,\quad i \in \tilde{ I}_{S}. }$$
(17)

3.1.3 Linear Least Squares Classification and Regression

Instead of the hinge loss function for classification and the ε-insensitive loss function for regression, other loss functions can be used. Popular margin-based and distance-based loss functions are the logistic loss for classification and regression

$$\displaystyle{l(yt):=\ln (1 + e^{-yt})\quad {\mathrm{and}}\quad l(y - t):= -\ln \frac{4e^{y-t}} {(1 + e^{y-t})^{2}},}$$

respectively. In contrast to the loss functions in the previous subsections, logistic loss functions are differentiable in t so that often standard methods as gradient descent methods or Newton (like) methods can be applied for computing the minimizers of the corresponding problems. For details, see, e.g., the book of Mitchell [73] or of Hastie et al. [47]. Other loss functions for regression are the pinball loss, the Huber function, and the p-th power absolute distance loss | yt | p, p > 0. For p = 2, the latter is the least squares loss

$$\displaystyle{l_{{\mathrm{lsq}}}(y - t) = (y - t)^{2}.}$$

Since \((y - t)^{2} = (1 - yt)^{2}\) for y ∈ {−1, 1}, the least squares loss is also margin based, and one can handle least squares classification and regression using just the same model with y ∈ {−1, 1} for classification and y for regression. In the unconstrained form, one has to minimize

$$\displaystyle{\begin{array}{c} \mbox{ Linear LS classification/regression (Primal problem)} \\ \frac{1} {2}\|w\|^{2} + \frac{C} {2} \sum \limits _{i=1}^{m}\mathop{\underbrace{ (\langle w,x_{ i}\rangle + b - y_{i})^{2}}}\limits _{ L_{{\mathrm{lsq}}}(y_{i},\langle w,x_{i}\rangle +b)}\; \rightarrow \;\min \limits _{w,b}. \end{array} }$$

This model was published as ridge regression by Hoerl and Kennard [49] and is a regularized version of the Gaussian model (1). Therefore, it can be considered as a special case of regularization theory introduced by Tikhonov and Arsenin [102]. The minimizer can be computed via a linear system of equations. To this end, rewrite the unconstrained problem in matrix-vector form

$$\displaystyle{\frac{1} {2}\|w\|^{2} + \frac{C} {2} \|{\mathbf{x}}^{\mbox{ T} }w + b1_{m} - y\|^{2}\; \rightarrow \;\min \limits _{ w,b},}$$

where

$$\displaystyle{{\mathbf{x}}:= (x_{1}\,\ldots \,x_{m}) = \left (\begin{array}{lll} x_{1,1} & \ldots & x_{m,1}\\ & \vdots & \\ x_{1,d}&\ldots &x_{m,d} \end{array} \right ).}$$

Setting the gradient (with respect to w and b) to zero, one obtains

$$\displaystyle\begin{array}{rcl} 0& =& \frac{1} {C}\,w + \mathbf{XX}^{\mbox{ T} }w + b{\mathbf{x}}1_{m} -{\mathbf{x}}y, \\ 0& =& 1_{m}^{\mbox{ T} }{\mathbf{x}}^{\mbox{ T} }w - 1_{m}^{\mbox{ T} }y + mb\quad \Leftrightarrow \quad b =\bar{ y} -\langle w,\bar{x}\rangle,{}\end{array}$$
(18)

where \(\bar{y}:= \frac{1} {m}\sum _{i=1}^{m}y_{ i}\) and \(\bar{x}:= \frac{1} {m}\sum _{i=1}^{m}x_{ i}\). Hence b and w can be obtained by solving the linear system of equations

$$\displaystyle{ \left (\begin{array}{l|l} 1 &\bar{x}^{\mbox{ T}} \\ \bar{x}& \frac{1} {m}{\mathbf{x}}{\mathbf{x}}^{\mbox{ T}} + \frac{1} {mC}I \end{array} \right )\left (\begin{array}{*{10}c} b\\ w \end{array} \right ) = \left (\begin{array}{*{10}c} \bar{y} \\ \frac{1} {m}{\mathbf{x}}y \end{array} \right ). }$$
(19)

Instead of the above problem, one solves in general the “centered” problem

$$\displaystyle{\begin{array}{c} \mbox{ Linear LS classification/regression ${\it in centered version}$ (Primal problem)} \\ \frac{1} {2}\|\tilde{w}\|^{2} + \frac{C} {2} \sum \limits _{i=1}^{m}(\langle \tilde{w},\tilde{x}_{ i}\rangle +\tilde{ b} - y_{i})^{2}\; \rightarrow \;\min \limits _{\tilde{ w},\tilde{b}}, \end{array} }$$

where \(\tilde{x}_{i}:= x_{i} -\bar{ x}\), i = 1, , m. The advantage is that \(\bar{\tilde{x}} = 0_{m}\), where 0 m is the vector consisting of m zeros. Thus, (19) with \(\tilde{x}_{i}\) instead of x i becomes a separable system, and one obtains immediately that \(\tilde{b}^{{\ast}} =\bar{ y}\) and that \(\tilde{w}^{{\ast}}\) follows by solving the linear system with positive definite, symmetric coefficient matrix

$$\displaystyle{\left (\tilde{{\mathbf{x}}}\tilde{{\mathbf{x}}}^{\mbox{ T} } + \frac{1} {C}I\right )w =\tilde{ {\mathbf{x}}}y.}$$

This means that \(\tilde{w}\) is just the solution of the centered primal problem without intercept. Finally, one can check by the following argument that indeed \(w^{{\ast}} = \tilde{w}^{{\ast}}\):

$$\begin{array}{ll} (\tilde{w}^{{\ast}},\tilde{b}^{{\ast}}): & = \mathop{{\mathrm{argmin}}}_{\tilde{ w},\tilde{b}}\left \{\frac{1} {2}\|\tilde{w}\|^{2} + \frac{C} {2} \sum \limits _{i=1}^{m}(\langle \tilde{w},\tilde{x}_{ i}\rangle +\tilde{ b} - y_{i})^{2}\right \}, \\ \tilde{w}^{{\ast}}& = \mathop{{\mathrm{argmin}}}_{\tilde{ w}}\left \{\frac{1} {2}\|\tilde{w}\|^{2} + \frac{C} {2} \sum \limits _{i=1}^{m}(\langle \tilde{w},\tilde{x}_{ i}\rangle +\bar{ y} - y_{i})^{2}\right \} \\ & = \mathop{{\mathrm{argmin}}}_{\tilde{w}}\bigg\{\frac{1} {2}\|\tilde{w}\|^{2} + \frac{C} {2} \sum \limits _{i=1}^{m}(\langle \tilde{w},x_{ i}\rangle +\bar{ y} -\langle \tilde{ w},\bar{x}\rangle - y_{i})^{2}\bigg\} \\ \end{array}$$

and with (18) on the other hand

$$\displaystyle\begin{array}{rcl} (w^{{\ast}},b^{{\ast}})&:=& \mathop{{\mathrm{argmin}}}_{ w,b}\left \{\frac{1} {2}\|w\|^{2} + \frac{C} {2} \sum \limits _{i=1}^{m}(\langle w,x_{ i}\rangle + b - y_{i})^{2}\right \}, {}\\ w^{{\ast}}& =& \mathop{{\mathrm{argmin}}}_{ w}\left \{\frac{1} {2}\|w\|^{2} + \frac{C} {2} \sum \limits _{i=1}^{m}(\langle w,x_{ i}\rangle +\bar{ y} -\langle w,\bar{x}\rangle - y_{i})^{2}\right \}. {}\\ \end{array}$$

Note that \({\mathbf{x}}^{\mbox{ T}}{\mathbf{x}} = \mathbf{K} \in \mathbb{R}^{m,m},\) but this is not the coefficient matrix in (19). When turning to nonlinear methods in section “Nonlinear Learning,” it will be essential to work with K = X T X instead of XX T. This can be achieved by switching to the dual setting. In the following, this dual approach is shown although it makes often not sense for the linear setting since the size of the matrix K is in general larger than those of \({\mathbf{x}}{\mathbf{x}}^{\mbox{ T}} \in \mathbb{R}^{d,d}\). First, one reformulates the primal problem into a constrained one:

$$\displaystyle{\begin{array}{c} \mbox{ Linear LS classification/regression (Primal problem)} \\ \frac{1} {2}\|w\|^{2} + \frac{C} {2} \sum \limits _{i=1}^{m}\xi _{ i}^{2}\; \rightarrow \;\min \limits _{ w,b,\xi }\quad \mbox{ subject to}\quad \langle w,x_{i}\rangle + b - y_{i} =\xi _{i},\;i = 1,\ldots,m. \end{array} }$$

The Lagrangian reads

$$\displaystyle{L(w,b,\xi,\alpha ) = \frac{1} {2}\|w\|^{2} + \frac{C} {2} \sum _{i=1}^{m}\xi _{ i}^{2} -\sum _{ i=1}^{m}\alpha _{ i}\left (\langle w,x_{i}\rangle + b - y_{i} -\xi _{i}\right )}$$

and

$$\displaystyle\begin{array}{rcl} \frac{\partial L} {\partial w}& =& w -\sum _{i=1}^{m}\alpha _{ i}x_{i}\; =\; 0\quad \Leftrightarrow \quad w\; =\;\sum _{ i=1}^{m}\alpha _{ i}x_{i}, {}\\ \frac{\partial L} {\partial b} & =& \sum _{i=1}^{m}\alpha _{ i}\; =\; 0, {}\\ \frac{\partial L} {\partial \xi } & =& C\,\xi +\alpha \; =\; 0. {}\\ \end{array}$$

The equality constraint in the primal problem together with the above equalities leads to the following linear system of equations to determine the optimal α and b :

$$\displaystyle{ \left (\begin{array}{c|c} 0 & 1_{m}^{\mbox{ T}} \\ 1_{m}&\mathbf{K} + \frac{1} {C}I\\ \end{array} \right )\left (\begin{array}{*{10}c} b\\ \alpha \end{array} \right ) = \left (\begin{array}{*{10}c} 0\\ y \end{array} \right ). }$$
(20)

The optimal weight and the corresponding optimal function read

$$\displaystyle{ w^{{\ast}} =\sum _{ i=1}^{m}\alpha _{ i}^{{\ast}}x_{ i},\quad f_{w^{{\ast}}}(x) =\sum _{ i=1}^{m}\alpha _{ i}^{{\ast}}\langle x_{ i},x\rangle. }$$
(21)

In general there is no sparse representation with respect to the vectors x i ; see also Theorem 8.36 in the book of Steinwart and Christman [93]. Therefore, this method is not called a support vector method in this paper. Finally, note that the centered approach also helps to avoid the intercept in the dual approach. Since this is no longer true when turning to the nonlinear setting, the intercept is kept here.

3.2 Nonlinear Learning

A linear form of a decision or regression function may not be suitable for the task at hand. Figure 5 shows two examples, where a linear classifier is not appropriate.

Fig. 5
figure 5

Two sets, where linear classification is not appropriate

A basic idea to handle such problems was proposed by Boser et al. [12] and consists of the following two steps which will be further explained in the rest of this subsection:

  1. 1.

    Mapping of the input data \(X \subset \mathcal{X}\) into a feature space \(\Phi (\mathcal{X}) \subset \ell_{2}(I)\), where I is a countable (possibly finite) index set, by a nonlinear feature map

    $$\displaystyle{\Phi: \mathcal{X} \rightarrow \ell_{2}(I).}$$
  2. 2.

    Application of the linear classification/regression model to the feature set

    $$\displaystyle{\{(\Phi (x_{1}),y_{1}),\ldots,(\Phi (x_{m}),y_{m})\}.}$$

    This means that instead of a linear function (2), we are searching for a function of the form

    $$\displaystyle{ f(x) = f_{w}(x):=\langle w,\Phi (x)\rangle _{\ell_{2}(I)} }$$
    (22)

    now. This nonlinear function on \(\mathcal{X}\) becomes linear in the feature space \(\Phi (\mathcal{X})\).

Figure 6 shows an example of a feature map. In this example, the set {(x i , y i ): i = 1, , 5} is not linearly separable while \(\{(\Phi (x_{i}),y_{i}): i = 1,\ldots,5\}\) is linearly separable. In practical applications, in contrast to this example, the feature map often maps into a higher dimensional, possibly also infinite dimensional space.

Fig. 6
figure 6

Linearly nonseparable training data in the original space \(\mathcal{X} \subset \mathbb{R}^{2}\) (left) and become separable in the feature space \(\Phi (\mathcal{X})\), where \(\Phi (x_{1},x_{2}):= (x_{1}^{2},x_{2}^{2})\) (right)

Together with the so-called kernel trick to avoid the direct work with the feature map \(\Phi \), this approach results in the successful support vector machine (SVM).

3.2.1 Kernel Trick

In general, one avoids to work directly with the feature map by dealing with the dual problem and applying the so-called kernel trick. For a feature map \(\Phi \), define a kernel \(K: \mathcal{X}\times \mathcal{X} \rightarrow \mathbb{R}\) associated with \(\Phi \) by

$$\displaystyle{ K(x,t):=\langle \Phi (x),\Phi (t)\rangle _{l_{2}(I)}. }$$
(23)

More precisely, in practice one often follows the opposite way, namely, one starts with a suitable kernel which is known to have a representation of the form (23) without knowing \(\Phi \) explicitly.

A frequently applied group of kernels are continuous, symmetric, positive (semi-) definite kernels like the Gaussian

$$\displaystyle{K(x,t) = e^{-\|x-t\|^{2}/c^{2} }.}$$

These kernels, which are also called Mercer kernels , will be considered in detail in section “Reproducing Kernel Hilbert Spaces.” By Mercer’s theorem, it can be shown that a Mercer kernel possesses a representation

$$\displaystyle{K(x,t) =\sum _{j\in I}\sqrt{\lambda _{j}}\psi _{j}(x)\sqrt{\lambda _{j}}\psi _{j}(t),\quad x,t \in \mathcal{X}}$$

with L 2-orthonormal functions ψ j and positive values λ j , where the right-hand side converges uniformly. Note that the existence of such a representation is clear, but in general without knowing the functions ψ j explicitly. The set \(\{\varphi _{j}:= \sqrt{\lambda _{j}}\psi _{j}: j \in I\}\) forms an orthonormal basis of a reproducing kernel Hilbert space (RKHS) \(\mathcal{H}_{K}\). These spaces are considered in more detail in section “Reproducing Kernel Hilbert Spaces.” Then the feature map is defined by

$$\displaystyle{\Phi (x):= \left (\varphi _{j}(x)\right )_{j\in I} = \left (\sqrt{\lambda _{j}}\psi _{j}(x)\right )_{j\in I}.}$$

Using the orthonormal basis, one knows that for any \(f \in \mathcal{H}_{K}\) there exists a unique sequence \(w = {w_{f}}:= {({w_j})_{j\in I}} \in {\ell_{2}}(I)\) such that

$$\displaystyle{ f(x) =\sum _{j\in I}w_{j}\varphi _{j}(x) =\langle w,\Phi (x)\rangle,\quad {\mathrm{and}}\quad w_{j} =\langle f,\varphi _{j}\rangle _{\mathcal{H}_{K}}, }$$
(24)

where the convergence is uniformly. Conversely every sequence w 2(I) defines a function f w lying in \(\mathcal{H}_{K}\) by (24). Moreover, Parseval’s equality says that

$$\displaystyle{ \|f_{w}\|_{\mathcal{H}_{K}}:=\| w\|_{\ell_{2}(I)}. }$$
(25)

For nonlinear classification and regression purposes, one can follow exactly the lines of section “Linear Learning” except that one has to work in \(\Phi (\mathcal{X})\) instead of \(\mathcal{X}\). Using (22) instead of (2) and

$$\displaystyle{ \mathbf{K}:= \left (\langle \Phi (x_{i}),\Phi (x_{j})\rangle _{\ell_{2}(I)}\right )_{i,j=1}^{m} = \left (K(x_{ i},x_{j})\right )_{i,j=1}^{m} }$$
(26)

instead of the kernel matrix \(\mathbf{K}:= \left (\langle x_{i},x_{j}\rangle \right )_{i,j=1}^{m}\) in (7), the linear models from section “Linear Learning” can be rewritten as in the following subsections. Note again that K is positive semi-definite.

3.2.2 Support Vector Classification

In the following, the linear classifiers are generalized to feature spaces.

Hard Margin Classifier The hard margin classification model is

$$\displaystyle{\begin{array}{c} \mbox{ SVM hard margin classification (Primal problem)} \\ \frac{1} {2}\|w\|_{\ell_{2}(I)}^{2}\; \rightarrow \;\min \limits _{ w,b}\quad \mbox{ subject to}\quad y_{i}\left (\langle w,\Phi (x_{i})\rangle _{\ell_{2}(I)} + b\right ) \geq 1,\;i = 1,\ldots,m. \end{array} }$$

Interestingly, if \(\Phi \) is associated with a Mercer kernel K, then \(f(x) =\langle w,\Phi (x)\rangle _{\ell_{2}(I)}\) lies in the RKHS \(\mathcal{H}_{K}\), and the model can be rewritten using (25) from the point of view of RKHS as

$$\displaystyle{\begin{array}{c} \mbox{ SVM hard margin classification ${\it in RKHS}$ (Primal problem)} \\ \frac{1} {2}\|f\|_{\mathcal{H}_{K}}^{2}\; \rightarrow \;\min \limits _{f\in \mathcal{H}_{K}}\quad \mbox{ subject to}\quad y_{i}\left (f(x_{i}) + b\right ) \geq 1,\;i = 1,\ldots,m. \end{array} }$$

The dual problem reads

$$\displaystyle{\begin{array}{c} \mbox{ SVM hard margin classification (Dual problem)} \\ \frac{1} {2}\sum \limits _{i=1}^{m}\sum \limits _{ j=1}^{m}y_{ i}\alpha _{i}y_{j}\alpha _{j}\langle \Phi (x_{i}),\Phi (x_{j})\rangle _{\ell_{2}(I)} -\sum \limits _{i=1}^{m}\alpha _{ i}\; \rightarrow \;\min \limits _{\alpha }\quad \mbox{ subject to}\quad \sum \limits _{i=1}^{m}y_{ i}\alpha _{i} = 0, \\ \phantom{\frac{1} {2}\sum \limits _{i=1}^{m}\sum \limits _{ j=1}^{m}y_{ i}\alpha _{i}y_{j}\alpha _{j}\langle \Phi (x_{i}),\Phi (x_{j})\rangle _{\ell_{2}(I)} -\sum \limits _{i=1}^{m}\alpha _{ i}\; \rightarrow \;\min \limits _{\alpha }\quad \mbox{ subject to}\quad }\alpha _{i} \geq 0,\;i = 1,\ldots,m. \end{array} }$$

and with K defined by (26) the matrix-vector form of the dual problem looks as those for the linear hard margin classifier.

Let α be the minimizer of the dual problem. Then, by (9) together with the feature space modification, the optimal weight and the function \(f_{w^{{\ast}}}\) become

$$\displaystyle{ w^{{\ast}} =\sum _{ i\in I_{S}}\alpha _{i}^{{\ast}}y_{ i}\Phi (x_{i}),\ \ f_{w^{{\ast}}}(x) =\sum _{i\in I_{S}}\alpha _{i}^{{\ast}}y_{ i}\langle \Phi (x_{i}),\Phi (x)\rangle _{\ell_{2}(I)} =\sum _{i\in I_{S}}\alpha _{i}^{{\ast}}y_{ i}K(x_{i},x). }$$
(27)

Thus, one can compute \(f_{w^{{\ast}}}\) knowing only the kernel and not the feature map itself. One property of a Mercer kernel used for learning purposes should be that it can be simply evaluated at points from \(\mathcal{X}\times \mathcal{X}\). This is, for example, the case for the Gaussian. Finally, using (10) in the feature space, the intercept can be computed by

$$\displaystyle{b^{{\ast}} = y_{ i} -\langle w^{{\ast}},\Phi (x_{ i})\rangle _{\ell_{2}(I)} = y_{i} -\sum _{j\in I_{S}}\alpha _{j}^{{\ast}}y_{ j}K(x_{j},x_{i}),\quad i \in I_{S}}$$

and the margin \(\gamma = 1/\|w^{{\ast}}\|_{\ell_{2}(I)}^{2}\) by using \(\|w^{{\ast}}\|_{\ell_{2}(I)}^{2} = (\alpha ^{{\ast}})^{\mbox{ T}}\mathbf{YKY}\alpha ^{{\ast}}\).

Soft Margin Classifier The soft margin classification model in the feature space is

$$\displaystyle{\begin{array}{c} \mbox{ SVM soft margin classification (Primal problem)} \\ \frac{1} {2}\|w\|_{\ell_{2}(I)}^{2} + C\sum \limits _{ i=1}^{m}\xi _{ i} \rightarrow \min \limits _{w,b,\xi }\quad \mbox{ subject to}\quad y_{i}\left (\langle w,\Phi (x_{i})\rangle _{\ell_{2}(I)} + b\right ) \geq 1 -\xi _{i}, \\ \phantom{\frac{1} {2}\|w\|_{\ell_{2}(I)}^{2} + C\sum \limits _{ i=1}^{m}\xi _{ i} \rightarrow } i = 1,\ldots,m \\ \phantom{\frac{1} {2}\|w\|_{\ell_{2}(I)}^{2} + C\sum \limits _{ i=1}^{m}\xi _{ i} \rightarrow \min \limits _{w,b,\xi }}\xi _{i} \geq 0,\;i = 1,\ldots,m. \end{array} }$$

If \(\Phi \) is associated with a Mercer kernel K, the corresponding unconstrained version reads in the RKHS

$$\displaystyle{\begin{array}{c} \mbox{ SVM soft margin classification ${\it in RKHS}$ (Primal problem) } \\ \frac{1} {2}\|f\|_{\mathcal{H}_{K}}^{2} + C\sum \limits _{i=1}^{m}L_{h}\left (y_{i},f(x_{i}) + b\right )+\; \rightarrow \;\min \limits _{f\in \mathcal{H}_{K}}. \end{array} }$$

With K defined by (26), the matrix-vector form of the dual problem looks as those for the linear soft margin classifier. The function \(f_{w^{{\ast}}}\) reads as in (27), and, using (11), the intercept can be computed by

$$\displaystyle{b^{{\ast}} = y_{ i} -\langle w^{{\ast}},\Phi (x_{ i})\rangle _{\ell_{2}(I)} = y_{i} -\sum _{j\in \tilde{I}_{S}}\alpha _{j}^{{\ast}}y_{ j}K(x_{j},x_{i}),\quad i \in \tilde{ I}_{S}.}$$

3.2.3 Support Vector Regression

In the following, the linear regression models are generalized to feature spaces.

Hard Margin Regression One obtains

$$\displaystyle{\begin{array}{c} \mbox{ SVM hard margin regression (Primal problem)} \\ \frac{1} {2}\|w\|_{\ell_{2}(I)}^{2}\; \rightarrow \;\min _{ w,b}\quad \mbox{ subject to}\quad \langle w,\Phi (x_{i})\rangle _{\ell_{2}(I)} + b - y_{i}\quad \leq \quad \epsilon, \\ \phantom{\frac{1} {2}\|w\|_{\ell_{2}(I)}^{2}\; \rightarrow \;\min _{ w,b}\quad \mbox{ subject to}\quad \quad \quad \quad } -\langle w,\Phi (x_{i})\rangle _{\ell_{2}(I)} - b + y_{i} \leq \epsilon,\quad i = 1,\ldots,m. \end{array} }$$

If \(\Phi \) is associated with a Mercer kernel K, then \(f(x) =\langle w,\Phi (x)\rangle _{\ell_{2}(I)}\) lies in the RKHS \(\mathcal{H}_{K}\), and the model can be rewritten using (25) from the point of view of RKHS as

$$\displaystyle{\begin{array}{c} \mbox{ SVM hard margin regression ${\it in RKHS}$ (Primal problem)} \\ \frac{1} {2}\|f\|_{\mathcal{H}_{K}}^{2}\; \rightarrow \;\min _{f\in \mathcal{H}_{K}}\quad \mbox{ subject to}\quad f(x_{i}) + b - y_{i}\quad \leq \epsilon, \\ \phantom{\frac{1} {2}\|f\|_{\mathcal{H}_{K}}^{2}\; \rightarrow \;\min _{f\in \mathcal{H}_{K}}\quad \mbox{ subject to}\quad \quad \quad \quad } - f(x_{i}) - b + y_{i} \leq \epsilon,\quad i = 1,\ldots,m. \end{array} }$$

The dual problem reads in matrix-vector form as the dual problem for the linear SV hard margin regression except that we have to use the kernel matrix \(\mathbf{K}\) defined by (26). Let \((\alpha ^{+})^{{\ast}},(\alpha ^{-})^{{\ast}}\) be the solution of this dual problem and \(\alpha ^{{\ast}} = (\alpha ^{+})^{{\ast}}- (\alpha ^{-})^{{\ast}}\). Then one can compute the optimal function \(f_{w^{{\ast}}}\) using (14) with the corresponding feature space modification as

$$\displaystyle{ f_{w^{{\ast}}}(x) =\sum _{i\in I_{rS}}\alpha _{i}^{{\ast}}K(x_{ i},x). }$$
(28)

One obtains sparse representations in terms of the support vectors. By (16), the intercept can be computed by

$$\displaystyle{b^{{\ast}} = y_{ i} -\sum _{j\in I_{rS}}\alpha _{j}^{{\ast}}K(x_{ j},x_{i})-\epsilon,\quad i \in I_{S}.}$$

Soft Margin Regression In the feature space, the soft margin regression model is

$$\displaystyle{\begin{array}{c} \mbox{ SVM soft margin regression (Primal problem)} \\ \frac{1} {2}\|w\|_{\ell_{2}(I)}^{2} + C\sum _{ i=1}^{m}(\xi _{ i}^{+} +\xi _{ i}^{-})\; \rightarrow \;\min _{ w,b,\xi _{i}^{\pm }}\mbox{ s.t.}\quad \langle w,\Phi (x_{i})\rangle _{\ell_{2}(I)} + b - y_{i} \leq \epsilon +\xi _{i}^{-}, \\ \phantom{\frac{1} {2}\|w\|_{\ell_{2}(I)}^{2} + C\sum _{ i=1}^{m}(\xi _{ i}^{+} +\xi _{ i}^{-})\; \rightarrow \;\min _{ w,b,\xi _{i}^{\pm }}\mbox{ s.t.}\quad } -\langle w,\Phi (x_{i})\rangle _{\ell_{2}(I)} - b + y_{i} \leq \epsilon +\xi _{i}^{+}, \\ \phantom{\frac{1} {2}\|w\|_{\ell_{2}(I)}^{2} + C\sum _{ i=1}^{m}(\xi _{ i}^{+} +\xi _{ i}^{-})\; \rightarrow \;\min _{ w,b,\xi _{i}^{\pm }}\mbox{ s.t.}\quad -\langle w,\Phi (x_{i})\rangle _{\ell_{2}(I)} -}\xi _{i}^{+},\xi _{ i}^{-}\geq 0. \end{array} }$$

Having a feature map associated with a Mercer kernel K, the corresponding unconstrained problem can be written as the following minimization problem in the RKHS \(\mathcal{H}_{K}\):

$$\displaystyle{\begin{array}{c} \mbox{ SVM soft margin regression ${\it in RKHS}$ (Primal problem)} \\ \frac{1} {2}\|f\|_{\mathcal{H}_{K}} + C\sum \limits _{i=1}^{m}L_{\epsilon }(y_{i},f(x_{i}) + b)\; \rightarrow \;\min \limits _{f\in \mathcal{H}_{K}}. \end{array} }$$

The dual problem looks as the dual problem for linear SV soft margin regression but with kernel (26). From the solution of the dual problem \((\alpha ^{+})^{{\ast}},(\alpha ^{-})^{{\ast}}\), one can compute the optimal function \(f_{w^{{\ast}}}\) as in (28) and the optimal intercept using (17) as

$$\displaystyle{b^{{\ast}} = y_{ i} -\sum _{j\in I_{rS}}\alpha _{j}^{{\ast}}K(x_{ j},x_{i})-\epsilon,\quad i \in \tilde{ I}_{S}.}$$

Figure 7, left, shows an SVM soft margin regression function for the data in Fig. 2.

Fig. 7
figure 7

SVM regression using the Gaussian with c = 8 for the data in Fig. 2. SVM soft margin regression curve with C = 0. 5 and ε = 0. 2 (left). Least squares SVM regression curve with C = 40

3.2.4 Relations to Sparse Approximation in RKHSs, Interpolation by Radial Basis Functions, and Kriging

SVM regression is related to various tasks in approximation theory. Some of them will be sketched in the following.

Relation to Sparse Approximation in RKHSs Let \(\mathcal{H}_{K}\) be a RKHS with kernel K. Consider the problem of finding for an unknown function \(\tilde{f} \in \mathcal{H}_{K}\) with given \(\tilde{f}(x_{i}) = y_{i}\), i = 1, , m an approximating function of the form

$$\displaystyle{ f(x):=\sum _{ i=1}^{m}\alpha _{ i}K(x,x_{i}) \in \mathcal{H}_{K} }$$
(29)

with only few summands. A starting point would be to minimize the \(\mathcal{H}_{K}\)-norm of the error and to penalize the so-called 0-norm of α given by \(\|\alpha \|_{0}:= \vert \{i:\alpha _{i}\not =0\}\vert \) to enforce sparsity. Unfortunately, the complexity when solving such 0-penalized problems increases exponentially with m. One remedy is to replace the 0-norm by the 1-norm, i.e., to deal with

$$\displaystyle{ \frac{1} {2}\|\tilde{f}(x) -\sum _{i=1}^{m}\alpha _{ i}K(x,x_{i})\|_{\mathcal{H}_{K}}^{2} +\epsilon \|\alpha \| _{ 1}\; \rightarrow \;\min _{\alpha }, }$$
(30)

where ε > 0. This problem and its relation to support vector regression were considered by Girosi [44] and Evgeniou et al. [37]. Using the relations in RKHS from section “Reproducing Kernel Hilbert Spaces,” in particular the reproducing property (H2), this problem becomes

$$\displaystyle{ \frac{1} {2}\alpha ^{\mbox{ T} }\mathbf{K}\alpha -\sum _{i=1}^{m}\alpha _{ i}y_{i} + \frac{1} {2}\|\tilde{f}\|_{\mathcal{H}_{K}}^{2} +\epsilon \|\alpha \| _{ 1}\; \rightarrow \;\min _{\alpha }, }$$
(31)

where \(\mathbf{K}\) is defined by (26). With the splitting

$$\displaystyle{\alpha _{i} =\alpha _{ i}^{+} -\alpha _{ i}^{-},\;\alpha _{ i}^{\pm }\geq 0,\;\alpha _{ i}^{+}\alpha _{ i}^{-} = 0,\;i = 1,\ldots,m}$$

and consequently \(\vert \alpha _{i}\vert =\alpha _{ i}^{+} +\alpha _{ i}^{-}\), the sparse approximation model (30) has finally the form of the dual problem of the SVM hard margin regression without intercept:

$$\displaystyle{\begin{array}{c} \mbox{ SVM hard margin regression ${\it without intercept}$ (Dual problem)} \\ \frac{1} {2}(\alpha ^{+} -\alpha ^{-})^{\mbox{ T}}\mathbf{K}(\alpha ^{+} -\alpha ^{-}) +\epsilon \langle 1_{ m},\alpha ^{+} +\alpha ^{-}\rangle -\langle y,\alpha ^{+} -\alpha ^{-}\rangle \;\rightarrow \min \limits _{\alpha ^{+ },\alpha ^{-}} \\ \mbox{ subject to}\quad \alpha ^{\pm }\geq 0. \end{array} }$$

Note that for ε > 0 the additional constraints \(\alpha _{i}^{+}\alpha _{i}^{-} = 0\), i = 1, , m are automatically fulfilled by the minimizer since otherwise, the Kuhn-Tucker conditions (15) without intercept would imply the contradiction \(f(x_{i}) = y_{i}+\epsilon = y_{i}-\epsilon\).

Relation to the Interpolation by Radial Basis Functions For ε = 0, problem (30), resp. (31) becomes

$$\displaystyle{F(\alpha ):= \frac{1} {2}\alpha ^{\mbox{ T} }\mathbf{K}\alpha -\alpha ^{\mbox{ T} }y\; \rightarrow \;\min _{\alpha }.}$$

If K is positive definite, the solution of this problem is given by the solution of

$$\displaystyle{\nabla F(\alpha ) = \mathbf{K}\alpha - y = 0,\quad \Leftrightarrow \quad \mathbf{K}\alpha = y}$$

and the approximating function f reads

$$\displaystyle{ f(x) =\langle \mathbf{K}^{-1}y,\left (K(x,x_{ i})\right )_{i=1}^{m}\rangle. }$$
(32)

This is just the solution of the interpolation problem to find f of the form (29) such that f(x i ) = y i for all i = 1, , m. If the kernel K of the positive definite matrix arises from a radial basis function \(\kappa (x) = k(\|x - t\|^{2})\), i.e., \(K(x,t) =\kappa (x - t)\) as, e.g., from a Gaussian or an inverse multiquadric described in section “Reproducing Kernel Hilbert Spaces,” this interpolation problem is called interpolation by radial basis function.

If the kernel K arises from a conditionally positive definite radial function κ of order ν, e.g., from a thin plate spline, the matrix K is in general not positive semi-definite. In this case, it is useful to replace the function f in (29) by

$$\displaystyle{f(x):=\sum _{ i=1}^{m}\alpha _{ i}K(x,x_{i}) +\sum _{ k=1}^{n}\beta _{ k}p_{k}(x),}$$

where n is the dimension of the polynomial space \(\Pi _{\nu -1}(\mathbb{R}^{d})\) and {p k : k = 1, , n} is a basis of \(\Pi _{\nu -1}(\mathbb{R}^{d})\). The additional degrees of freedom in the interpolation problem f(x i ) = y i , i = 1, , m are compensated by adding the new conditions

$$\displaystyle{\sum _{i=1}^{m}\alpha _{ i}p_{k}(x_{i}) = 0,\quad k = 1,\ldots,n.}$$

This leads to the final problem of finding \(\alpha:= (\alpha _{i})_{i=1}^{m}\) and \(\beta:= (\beta _{k})_{k=1}^{n}\) such that

$$\displaystyle{ \left (\begin{array}{*{10}c} \mathbf{K} &\mathbf{P}\\ \mathbf{P} ^{\mbox{ T} } & \mathbf{0} \end{array} \right )\left (\begin{array}{*{10}c} \alpha \\ \beta \end{array} \right ) = \left (\begin{array}{*{10}c} y\\ 0 \end{array} \right ),\qquad \mathbf{P}:= \left (p_{k}(x_{i})\right )_{i,k=1}^{m,n}. }$$
(33)

If the points {x i : i = 1, , m}, \(m \geq {\mathrm{dim}}\left (\Pi _{\nu -1}(\mathbb{R}^{d})\right )\) are \(\Pi _{\nu -1}(\mathbb{R}^{d})\)-unisolvent, i.e., the zero polynomial is the only polynomial from \(\Pi _{\nu -1}(\mathbb{R}^{d})\) that vanishes on all of them, then the linear system of equations (33) has a unique solution. To verify that the coefficient matrix in (33) is indeed invertible, consider the corresponding homogeneous system. Then the second system of equations \(\mathbf{P}^{\mbox{ T}}\alpha = 0\) means that α satisfies (39). Multiplying the first system of equations by α T gives \(0 =\alpha ^{\mbox{ T}}\mathbf{K}\alpha + (\mathbf{P}^{\mbox{ T}}\alpha )^{\mbox{ T}}\beta =\alpha ^{\mbox{ T}}\mathbf{K}\alpha\). By the definition of conditionally positive definite functions of order ν, this is only possible if α = 0. But then P β = 0. Since the points {x i : i = 1, , m} are \(\Pi _{\nu -1}(\mathbb{R}^{d})\)-unisolvent, this implies that \(\beta = 0\).

The interpolation by radial basis functions having (conditionally) positive definite kernels was examined, including fast evaluation techniques for the interpolating function f, by many authors; for an overview see, e.g., the books of Buhmann 2003 [18], Wendland [114], and Fasshauer 2007 [39].

Relation to Kriging The interpolation results can be derived in another way by so-called kriging. Kriging is a group of geostatistical techniques to interpolate the unknown value of a random field from observations of its value at nearby locations. Based on the pioneering work of Krige [59] on the plotter of the distance-weighted average gold grades at the Witwatersrand reef in South Africa, the French mathematician Matheron [70] developed its theoretical foundations. Let S(x) denote a random field such that the expectation value fulfills E(S(X)) = 0 which is the setting in the so-called simple kriging. Let K(x i , x j ): = Cov(S(x i ), S(x j )) and \(\mathbf{K}:= \left (K(x_{i},x_{j})\right )_{i,j=1}^{m}\). The aim is to approximate the value S(x 0) from observations S(x i ) = y i , i = 1, , m by the kriging estimator

$$\displaystyle{\hat{S}(x_{0}) =\sum _{ i=1}^{m}\omega _{ i}(x_{0})S(x_{i})}$$

in such a way that the variance of the error is minimal, i.e.,

$$\displaystyle\begin{array}{ll} {\mathrm{Var}}(\hat{S} - S) =& {\mathrm{Var}}(\hat{S}) + {\mathrm{Var}}(S) - 2{\mathrm{Cov}}(S,\hat{S}) \\ =& \sum _{i=1}^{m}\sum _{ j=1}^{m}\omega _{ i}(x_{0})\omega _{j}(\hat{x})K(x_{i},x_{j}) \\ & -2\sum _{i=1}^{m}\omega _{ i}(x_{0})K(x_{0},x_{i}) + {\mathrm{Var}}(S)\; \rightarrow \;\min _{\omega (x_{0})}. \end{array}$$

Setting the gradient to zero, the minimizer \(\omega ^{{\ast}} = \left (\omega _{1}^{{\ast}}(x_{0}),\ldots,\omega _{m}^{{\ast}}(x_{0})\right )^{\mbox{ T}}\) is given by the solution of the following linear system of equations:

$$\displaystyle{\mathbf{K}\omega = \left (K(x_{0},x_{i})\right )_{i=1}^{m}.}$$

In case K is invertible, we get

$$\displaystyle{S(x_{0}) =\langle y,\mathbf{K}^{-1}\left (K(x_{ 0},x_{i})\right )_{i=1}^{m}\rangle =\langle \mathbf{K}^{-1}y,\left (K(x_{ 0},x_{i})\right )_{i=1}^{m}\rangle.}$$

Supposing the same values K(x i , x j ) as in the interpolation task, this is exactly the same value as f(x 0) from the radial basis interpolation problem (32).

3.2.5 Least Squares Classification and Regression

Also in the feature space, least squares classification and regression can be treated by the same model. So-called Least Squares Support Vector Classifiers were introduced by Suykens and Vandevalle [99] while least squares regression was also considered within regularization network approaches, e.g., by Evgeniou et al. [37] and Wahba [112]. The least squares model in the feature space is

$$\displaystyle{\begin{array}{c} \mbox{ LS classification/regression in feature space (Primal problem)} \\ \frac{1} {2}\|w\|_{\ell_{2}(I)}^{2} + \frac{C} {2} \sum \limits _{i=1}^{m}(\langle w,\Phi (x_{ i})\rangle _{\ell_{2}(I)} + b - y_{i})^{2}\; \rightarrow \;\min \limits _{ w,b}. \end{array} }$$

and becomes in the case that the feature map is related with a Mercer kernel K a problem in a RKHS \(\mathcal{H}_{K}\):

$$\displaystyle{\begin{array}{c} \mbox{ LS classification/regression ${\it in RKHS}$ (Primal problem)} \\ \frac{1} {2}\|f\|_{\mathcal{H}_{K}} + \frac{C} {2} \sum \limits _{i=1}^{m}(f(x_{ i}) + b - y_{i})^{2}\; \rightarrow \;\min \limits _{ w,b}. \end{array} }$$

Setting gradients to zero, one can try to solve this primal problem via a linear system of equations (19) with \({\mathbf{x}}:= \left (\Phi (x_{1})\ldots \Phi (x_{m})\right )\). However, one has to compute with \(\mathbf{XX}^{\mbox{ T}}\) here which is only possible if the feature space is finite dimensional. In contrast, the dual approach leads to the linear system of equations (20) which involves only the kernel matrix \(\mathbf{K} = {\mathbf{x}}^{\mbox{ T}}{\mathbf{x}}\) from (26). Knowing the dual variable α , the optimal function \(f_{w^{{\ast}}}\) can be computed using (21) with the feature space modification as

$$\displaystyle{f_{w^{{\ast}}}(x) =\sum _{ i=1}^{m}\alpha _{ i}^{{\ast}}\langle \Phi (x_{ i}),\Phi (x)\rangle \sum _{i=1}^{m}\alpha _{ i}^{{\ast}}K(x_{ i},x).}$$

In general there is no sparse representation with respect to the vectors x i . For more information on least squares kernel methods, the reader may consult the book of Suykens et al. [98]. Figure 7, right, shows a least squares SVM function for the data in Fig. 2.

3.2.6 Other Models

There are numerous models related to the classification and regression models of the previous subsections. A simple classification model which uses only the hinge loss function without penalizing the weight was proposed by Bennet and Mangasarian [7]:

$$\displaystyle{{\sum \limits _{i=1}^{m}}{L_{h}}\left ({y_{i}},\langle w,{x_{i}}\rangle + b\right )\; \rightarrow \;\min \limits _{w,b}.}$$

This approach is called robust linear programming (RLP) and requires only linear programming methods. Note that the authors weighted the training errors by 1∕n ±, where \(n_{\pm }:= \vert \{i;y_{i} = \pm 1\}\vert \). The linear SV soft margin classifier adds just the penalizer \(\frac{\lambda }{2}\|w\|_{2}^{2}\) with \(\lambda = 1/C\), 0 < C < to the RLP term which leads to quadratic programming. Alternatively, one can add instead the 2-norm the 1-norm of the weight as it was done by Bradley and Mangasarian [16]:

$$\displaystyle{\sum \limits _{i=1}^{m}L_{ h}\left (y_{i},\langle w,x_{i}\rangle + b\right ) +\lambda \| w\|_{1}\; \rightarrow \;\min \limits _{w,b}.}$$

As in (30), the 1 penalizer enforces the sparsity of the solution vector w . Note that the sparsity of w itself and not a sparse representation of w as linear combination of the support vectors x i is announced here. The 1-penalizer was introduced in the statistical context of linear regression in conjunction with the least squares loss function by Tibshirani [101] and is called “LASSO” (Least Absolute Shrinkage and Selection Operator):

$$\displaystyle{\sum \limits _{i=1}^{m}L_{\mathrm{ lsq}}\left (y_{i},\langle w,x_{i}\rangle + b\right ) +\lambda \| w\|_{1}\; \rightarrow \;\min \limits _{w,b}}$$

As emphasized in section “Support Vector Regression,” the 1-norm is more or less a replacement for the 0-norm to make problems computable. Other substitutes of the 0-norm are possible, e.g., \(\|w\|_{\ell_{0}} \approx \sum _{j=1}^{d}(1 - e^{\nu \vert w_{j}\vert })\), ν > 0 which gives

$$\displaystyle{\sum \limits _{i=1}^{m}L_{ h}\left (y_{i},\langle w,x_{i}\rangle + b\right ) +\lambda \sum _{ j=1}^{d}(1 - e^{\nu \vert w_{j}\vert })\; \rightarrow \;\min \limits _{ w,b}.}$$

This is a nonconvex model and was proposed by Bradley and Mangasarian [16] and by Weston et al. [115] as FSV (Features Selection concaV e). Numerical solution methods via successive linearization algorithms and difference of convex functions algorithms by Tao and An [100] were applied.

Further, one can couple several penalizers and generalize the models to the feature space to obtain nonlinear classifiers as it was done, e.g., by Neumann et al. [75]. Figure 8 shows an example for the binary classification of specific organs in CT scans with a so-called “ 2- 0-SV” machine taken from the above paper, where more information can be found. In particular, a χ 2 kernel was used here.

Fig. 8
figure 8

From left to right: (i) Sample CT slice from a three-dimensional scan of the data set with contours of bladder and prostate. (ii) A zoom within the region of interest shows that the organs are very difficult to distinguish visually. (iii) Manual classification by an expert as “accepted truth.” (iv) A classification result: The images are filtered by a three-dimensional steerable pyramid filter bank with 16 angular orientations and four decomposition levels. Then local histograms are built for the filter responses with ten bins per channel. Including the original grey values, this results in 650 features per image voxel which are used for classification by the “ 2- 0-SV” machine

3.2.7 Multi-class Classification and Multitask Learning

So far only binary classification was considered. Assume now that one wants to learn K > 2 classes. Figure 9 shows a typical example of 12 classes for the classification of Mammals; see Amit et al. [2].

Fig. 9
figure 9

Classification of mammal images: 12 from 72 classes of animals which were used for classification in Amit et al. [2]. Typically these classes share common characteristics as in the different rows above (deers, canines, felines, rodents), e.g., the texture or shape

Some attempts to extend the binary case to multi-classes were achieved by adding constraints for every class; see Weston and Watkins [116] and Vapnik [105]. In case of many classes, this approach often results in quadratic problems which are hard to solve and difficult to store. In the following, two general approaches to handle multiple classes are presented, namely, with:

  1. (i)

    Vector-valued binary class labeling,

  2. (ii)

    K class labeling.

The dominating approach for solving multi-class problems using SVMs is based on reducing a single multi-class problem to multiple binary problems. For instance, a common method is to build a set of binary classifiers, where each classifier distinguishes between one of the labels and the rest. This so-called one-versus-all classifier cannot capture correlations between different classes since it breaks the multi-class problem into independent binary problems. More general, one can assign to each class a vector-valued binary class label \((y^{(1)},\ldots,y^{(\kappa )})^{\mbox{ T}} \in \{-1,1\}^{\kappa }\) and use a classifier based on

$$\displaystyle{F(x):= \left ({\mathrm{sgn}}(\langle w^{(k)},x\rangle + b^{(k)})\right )_{ k=1}^{\kappa }.}$$

For example, in the one-versus-all method, the classes can be labeled by \(\{(-1 + 2\delta _{r,k})_{k=1}^{K}: r = 1,\ldots,K\}\), i.e., κ = K, and the assignment of x to a class can be made according to the shortest Hamming distance of F(x) from these class labels. In the one-versus-all example, there was κ = K. More sophisticated methods use values κ > K and error-correcting output codes as Dietterich and Bakiri [32]. Note that 2κ different labels are in general possible with binary vectors of length κ which is an upper bound for the number of classes that could be learned. In the learning process, one can obtain \(w^{(k)} \in \mathbb{R}^{m}\) and \(b^{(k)} \in \mathbb{R}\) by solving, e.g.,

$$\displaystyle{ \frac{1} {2}\sum _{k=1}^{\kappa }\|w^{(k)}\|^{2} +\sum _{ k=1}^{\kappa }C_{ k}\sum _{i=1}^{m}L\left (y_{ i},\langle w^{(k)},x_{ i}\rangle + b^{(k)}\right )\; \rightarrow \;\min _{ w^{(k)},b^{(k)}}, }$$
(34)

where L is some loss function. Note that this problem can be decoupled with respect to k. Let \(W:= (w^{(1)}\ldots w^{(\kappa )}) \in \mathbb{R}^{d,\kappa }\) be the weight matrix. Then the first sum in (34) coincides with the squared Frobenius norm of W defined by

$$\displaystyle{\|W\|_{F}^{2}:=\sum _{ k=1}^{\kappa }\sum _{ i=1}^{d}(w_{ i}^{(k)})^{2}.}$$

Let us consider the second labeling approach. Here one assumes that each class label is an integer from \(\mathcal{Y}:=\{ 1,\ldots,K\}\). As before, one aims to learn weight vectors w (k), k = 1, , K (the intercept is neglected for simplicity here). The classifier is given by

$$\displaystyle{F_{W}(x):=\mathop{ {\mathrm{argmax}}}_{k=1,\ldots,K}\,\langle w^{(k)},x\rangle.}$$

A training sample (x i , y i ) is correctly classified by this classifier if

$$\displaystyle{\langle w^{(y_{i})},x_{ i}\rangle \geq \langle w^{(k)},x_{ i}\rangle + 1,\quad \forall k = 1,\ldots,K,\;k\not =y_{i}.}$$

Without adding 1 at the left-hand side of the inequality, correct classification is still attained if there is strong inequality for ky i . This motivates to learn the weight vectors by solving the minimization problem

$$\displaystyle{\begin{array}{lll} \frac{1} {2}\|W\|_{F}^{2} \rightarrow \;\min \limits _{ W}&\mathrm{subject\ to}&\langle w^{(y_{i})},x_{ i}\rangle +\delta _{y_{i},k} -\langle w^{(k)},x_{ i}\rangle \geq 1, \\ & &\forall k = 1,\ldots,K\mbox{ and }i = 1,\ldots,m.\end{array} }$$

After introducing slack variables to relax the constraints, one gets

$${\begin{array}{c} \frac{1} {2}\|W\|_{F}^{2} + C\sum _{ i=1}^{m}\xi _{ i} \rightarrow \;\min \limits _{W,\xi _{i}}\mathrm{subject\ to}\langle w^{(y_{i})},x_{ i}\rangle +\delta _{y_{i},k} -\langle w^{(k)},x_{ i}\rangle \geq 1 -\xi _{i}, \\ \forall k = 1,\ldots,K\mbox{ and }i = 1,\ldots,m.\end{array} }$$

This can be rewritten as the following unconstrained problem:

$$\displaystyle{\frac{1} {2}\|W\|_{F}^{2} + C\,\sum _{ i=1}^{m}l_{ h}\left (\langle w^{(y_{i})},x_{ i}\rangle +\max _{k}(\langle w^{(k)},x_{ i}\rangle -\delta _{y_{i},k})\right )\; \rightarrow \;\min _{W}.}$$

In this functional the learning tasks are coupled in the loss function.

In general the aim of multitask learning is to learn data that are common across multiple related supervised learning tasks, i.e., to facilitate “cooperative” learning. Recently, multitask learning has received attention in various applications; see the paper of Caruana [21]. Learning of vector-valued functions in RKHSs was considered, e.g., by Micchelli and Pontil [72]. Inspired by the “sparseness” models in section “Other Models” which focus on the sparsity of the weight vector w, one can ask for similar approaches for a weight matrix W. As a counterpart of the 0-norm of a weight vector, there can serve a low rank of the weight matrix. But as in 0-penalized minimization problems, such problems are computationally not manageable. A remedy is to replace the low-rank condition by demanding a small trace norm or nuclear norm of W defined by

$$\displaystyle{\|W\|_{{\ast}}:=\sum _{j}\sigma _{j},}$$

where σ j are the singular values of W. Then a minimization problem to learn the weight matrix reads, for example, as

$$\displaystyle{\frac{1} {2}\|W\|_{{\ast}} + C\,\sum _{k=1}^{K}\sum _{ i=1}^{m}L\left (y_{ i},\left \langle w^{(k)},x_{ i}\right \rangle \right ),\rightarrow \;\min \limits _{W}}$$

where L is mainly the least squares loss function. Such models were considered by Amit et al. [2], Obozinski et al. [76], and Pong et al. [82]. Other approaches use the norm

$$\displaystyle{\|W\|_{2,1}:=\sum _{ j=1}^{d}\left \|\left (w_{ j}^{(k)}\right )_{ k=1}^{K}\right \|}$$

which favors a small number of nonzero rows in W instead of the trace norm; see Argyriou et al. [4] and Obozinski et al. [76]. Another interesting model was proposed by Argyriou et al. [4] and learns in addition to a weight matrix an orthogonal matrix \(U \in \mathcal{O}\) by minimizing

$$\displaystyle{\|W\|_{2,1} + C\,\sum _{k=1}^{K}\sum _{ i=1}^{m}L\left (y_{ i},\left \langle w^{(k)},Ux_{ i}\right \rangle \right ). \rightarrow \;\min \limits _{W,U\in \mathcal{O}}.}$$

The numerical solution of multitask problems which are convex but non-smooth requires sophisticated techniques. The trace norm minimization problem can be, for example, reformulated as a semi-definite program (SDP), and then existing SDP solvers can be used as long as the size of the problem is moderate; see the papers of Fazel et al. [40] and Srebro et al. [91]. A smooth, but nonconvex reformulation of the problem and a subsequent solution by a conjugate gradient or alternating minimization method was proposed, e.g., by Weimer et al. [113]. Accelerated proximal gradient methods (multistep methods) and Bregman iterative methods were applied in the papers of Lu et al. [66], Ma et al. [67], Cai et al. [20], and Toh and Yun [103]. A new primal-dual reformulation of the problem in conjunction with a gradient projection method to solve the reduced dual problem was given by Pong et al. [82].

3.2.8 Applications of SVMs

SVMs have been applied to many real-world problems. Some applications were already sketched in the previous subsections. Very often SVMs are used in connection with other techniques, in particular feature extraction/selection methods to specify the input domain.

A non-exhaustive list of SVM applications includes text categorization (see Joachims [53] and Leopold and Kinderman [64]), hand-written character recognition (see LeCun et al. [62]), texture and image classification (see Chapelle et al. [23]), protein homology detection (see Jaakkola and Haussler [51]), gene expression (see Brown et al. [17]), medical diagnostics (see Strauss et al. [96]), and pedestrian and face detection (see Osuna et al. [77], Viola and Jones [110]).

This subsection describes only two applications of SVM classification and shows how the necessary design choices can be made. In particular, one has to choose an appropriate SVM kernel for the given application. Default options are Gaussians or polynomial kernels, and the corresponding SVMs often already outperform other classification methods. Even for such parameterized families of kernels, one has to specify the parameters like the standard deviation of the Gaussian or the degree of the polynomial. In the Gaussian case, a good choice of the standard deviation in the classification problem is the distance between closest points within different classes. In the absence of reliable criteria, one could use a validation set or cross-validation to determine useful parameters. Various applications require more elaborate kernels which implicitly describe the feature space.

Handwritten Digit Recognition The problem of handwritten digit recognition was the first real-world task on which SVMs were successfully tested. The results are reported in detail in the book of Vapnik [105]. This SVM application was so interesting because other algorithms incorporating prior knowledge on the USPS database have been designed. The fact that SVMs perform better than these specific systems without using prior detailed information is remarkable; see [62].

Different SVM models have been tested on two databases:

  • United States Postal Service (USPS): 7,291 training and 2,007 test patterns of the numbers 0, , 9, represented by 16 × 16 gray level matrices; see Fig. 10.

    Fig. 10
    figure 10

    Examples of patterns from the USPS database; see [105]

  • National Institute of Standard and Technology (NIST): 60,000 training and 10,000 test patterns, represented by 20 × 20 gray-level matrices.

In the following, the results for the USPS database are considered.

For constructing the decision rules, SVMs with polynomial and Gaussian kernels were used:

$$\displaystyle{K(x,t):= \left (\langle x,t\rangle /256\right )^{n},\quad K(x,t):= e^{-\|x-t\|^{2}/(256\sigma ^{2}) }.}$$

The overall machine consists of 10 classifiers, each one separating one class from the rest (one-versus-all classifier). Then the 10-class classification was done by choosing the class with the largest output number.

All types of SVMs demonstrated approximately the same performance shown in the following tables, cf. [105]. The tables contain the parameters for the hard margin machines, the corresponding performance, and the average (over one classifier) number of support vectors. Moreover, it was observed that the different types of SVMs use approximately the same set of support vectors.

Degree n

1

2

3

4

5

6

Error

8.9

4.7

4.0

4.2

4.5

4.5

Number of SV

282

237

274

321

374

422

Results for SVM classification with polynomial kernels

σ

4.0

1.5

0.3

0.25

0.2

0.1

Error

5.3

4.9

4.2

4.3

4.5

4.6

Number of SV

266

237

274

321

374

422

Results for SVM classification with Gaussian kernels

Finally, it is worth to mention that the training data are not linearly separable; ≈ 5 % of these data were misclassified by a linear learning machine. For a degree 2 polynomial kernel, only the 4 examples in Fig. 11 were misclassified. For polynomials of degree 3, the training data are separable. The number of support vectors increases only slowly with the degree of the polynomial.

Fig. 11
figure 11

Labeled USPS examples of training errors for the SVM with second degree polynomial kernel; see [105]

Color Image Recognition Image recognition is another area where SVMs were successfully applied. Chapelle et al. [23] have reported their SVM classification results for color image recognition. The database was a subset (Corel14) of the Corel Stock Photo Collection consisting of 1,400 photos associated with 14 categories. Each category was split into 2/3 for training and 1/3 for testing. Again the one-versus-all classifier was applied.

The images were not used themselves as inputs, but each image was associated to its color histogram. Since each color is a point in a three-dimensional vector space and the number of bins per color was fixed at 16, the dimension of such a histogram (and thus of the feature space) is d = 163. Note that low-level features like histograms have the advantage that they are invariant with respect to many operations and allow the comparison of images of different sizes. Of course, local high-level image features like edges are not captured by low-level features. Chapelle and co-workers [23] have used both the RGB (red, green, blue) and the HSV/HSB (hue, saturation, value/hue, saturation, brightness) histogram representation. Note that HSV arranges the geometry of RGB in an attempt to be more perceptually relevant. As kernels they have used

$$\displaystyle{K(x,t):= e^{-{\mathrm{dist}}(x,t)/\sigma ^{2} },}$$

where dist denotes a measure of similarity in the feature space which has to be determined. For histograms, the χ 2 function

$$\displaystyle{{\mathrm{dist}}(x,t):=\sum _{ i=1}^{d}\frac{(x_{i} - t_{i})^{2}} {x_{i} + t_{i}} }$$

is accepted as an appropriate distance measure. It is not clear if the corresponding kernel is a Mercer kernel. For the distances \({\mathrm{dist}}_{p}(x,t):=\| x - t\|_{p}^{p}\), p = 1, 2 this is the case.

As can be seen in the following table, the SVM with the χ 2 and the 1 distance perform similarly and significantly better than the SVM with the squared 2 distance. Therefore, the Gaussian kernel is not the best choice here. RGB- and HSV-based methods perform similarly.

 

Linear

Degree 2 poly

χ 2

1

Gaussian

RGB

42.1

33.6

28.8

14.7

14.7

HSV

36.3

35.3

30.5

14.5

14.7

Error rates (percent) using different SVM kernels

For comparision, Chapelle and co-workers conducted some experiments of color image histogram (HSV-based) classifications with the K-nearest neighbor algorithm with χ 2 and 2. Here K = 1 gives the best result presented in the following table:

χ 2

2

26.5

47.7

Error rates (percent) with k-nearest neighbor algorithm

The χ 2-based SVM is roughly twice as good as the χ 2-based k-nearest neighbor technique.

4 Survey of Mathematical Analysis of Methods

4.1 Reproducing Kernel Hilbert Spaces

General theory For simplicity, let \(\mathcal{X} \subset \mathbb{R}^{d}\) be a compact set throughout this subsection. Moreover, only spaces of real-valued functions are considered. Let \(C(\mathcal{X})\) denote the set of continuous functions on \(\mathcal{X}\). Together with the norm

$$\displaystyle{\|f\|_{C(\mathcal{X})} = {\mathrm{sup}}\{\vert f(x)\vert: x \in \mathcal{X}\}}$$

this becomes a Banach space. Further, let \(L_{2}(\mathcal{X})\) be the Hilbert space of (equivalence classes) quadratic integrable, real-valued functions on \(\mathcal{X}\) with inner product and norm given by

$$\displaystyle{\langle f,g\rangle _{L_{2}}:=\int _{\mathcal{X}}f(x)g(x)\,dx,\quad \|f\|_{L_{2}} = \left (\int _{\mathcal{X}}f(x)^{2}\,dx\right )^{1/2}.}$$

Since \(\mathcal{X}\) is compact, the space \(C(\mathcal{X})\) is continuously embedded into \(L_{2}(\mathcal{X})\) which means that \(\|f\|_{L_{2}(\mathcal{X})} \leq C\|f\|_{C(\mathcal{X})}\) for all \(f \in C(\mathcal{X})\). A function \(K: \mathcal{X}\times \mathcal{X} \rightarrow \mathbb{R}\) is symmetric, if K(x, t) = K(t, x) for all \(x,t \in \mathcal{X}\). With a symmetric function \(K \in L_{2}(\mathcal{X}\times \mathcal{X})\), one can associate an integral operator \(T_{K}: L_{2}(\mathcal{X}) \rightarrow L_{2}(\mathcal{X})\) by

$$\displaystyle{T_{K}f(t):=\int _{\mathcal{X}}K(x,t)f(x)\,dx.}$$

This operator is a compact and self-adjoint operator, and K is called its kernel. The following spectral theorem holds true for compact, self-adjoint operators, i.e., in particular for T K .

Theorem 1 (Spectral theorem for compact, self-adjoint operators).

Let T be a compact, self-adjoint operator on the Hilbert space \(\mathcal{H}\). Then there exists a countable (possibly finite) orthonormal system {ψ i : i ∈ I} and a zero sequence (λ i ) i∈I, \(\lambda _{i} \in \mathbb{R}\setminus \{0\}\) such that

$$\displaystyle{\mathcal{H} = {\mathrm{ker}}T \oplus \overline{{\mathrm{span}}\{\psi _{i}: i \in I\}}}$$

and

$$\displaystyle{ Tf =\sum _{j\in I}\lambda _{j}\langle f,\psi _{j}\rangle _{\mathcal{H}}\,\psi _{j}\quad \forall f \in \mathcal{H}. }$$
(35)

The numbers λ j are the nonzero eigenvalues of T and ψ j are the corresponding eigenfunctions. If T is a positive operator, i.e.,

$$\displaystyle{\langle Tf,f\rangle _{\mathcal{H}} =\int _{\mathcal{X}}\int _{\mathcal{X}}K(x,t)f(x)f(t)\,dxdt \geq 0\quad \forall f \in \mathcal{H},}$$

then the values λ j are positive.

Consider the special operator T K for a symmetric kernel \(K \in L_{2}(\mathcal{X}\times \mathcal{X})\). Using the L 2-orthonormal eigenfunctions {ψ i : iI} of T K , one can also expand the kernel itself as

$$\displaystyle{K(x,t) =\sum _{j\in I}\lambda _{j}\psi _{j}(x)\psi _{j}(t),}$$

where the sum converges as those in (35) in general only in \(L_{2}(\mathcal{X}\times \mathcal{X})\). One can tighten the statement if K is continuous and symmetric. Then \(T_{K}: C(\mathcal{X}) \rightarrow C(\mathcal{X})\) is a compact operator on the Pre-Hilbert spaces \(C(\mathcal{X})\) equipped with the L 2-norm into itself, and the functions ψ j are continuous. If \(f \in C(\mathcal{X})\), then the right-hand side in (35) converges absolutely and uniformly. To prove such a convergence result also for the kernel expansion, one needs moreover that the operator T K is positive. Unfortunately, it is not true that a positive kernel K implies a positive operator T K . There is another criterion which will be introduced in the following. A matrix \(\mathbf{K} \in \mathbb{R}^{m,m}\) is called positive semi-definite if

$$\displaystyle{\alpha ^{\mbox{ T} }\mathbf{K}\,\alpha \geq 0\quad \forall \alpha \in \mathbb{R}^{m}}$$

and positive definite if strong inequality holds true for all α ≠ 0. A symmetric kernel \(K: \mathcal{X}\times \mathcal{X} \rightarrow \mathbb{R}\) is positive (semi)-definite if the matrix \(\mathbf{K}:= \left (K(x_{i},x_{j})\right )_{i,j=1}^{m}\) is positive (semi)-definite for all finite sets \(\{x_{1},\ldots,x_{m}\} \subset \mathcal{X}\). Now a symmetric kernel \(K \in C(\mathcal{X}\times \mathcal{X})\) is positive semi-definite if and only if the corresponding integral operator \(T_{K}\) is positive.

Theorem 2 (Mercer’s theorem).

Let \(K \in C(\mathcal{X}\times \mathcal{X})\) be a continuous, symmetric, and positive semi-definite function with corresponding integral operator T K. Then K can be expanded into an absolutely and uniformly convergent series in terms of T K ’s orthonormal eigenfunctions ψ j and the associated eigenvalues λ j > 0 as follows:

$$\displaystyle{ K(x,t) =\sum _{j\in I}\lambda _{j}\psi _{j}(x)\psi _{j}(t). }$$
(36)

Moreover, if K is positive definite, then {ψ i : i ∈ I} form an orthonormal basis of \(L_{2}(\mathcal{X})\).

A continuous, symmetric, positive semi-definite kernel is called a Mercer kernel. Mercer kernels are closely related to so-called reproducing kernel Hilbert spaces.

Let \(\mathcal{H}\) be a real Hilbert space of functions \(f: \mathcal{X} \rightarrow \mathbb{R}\). A function \(K: \mathcal{X}\times \mathcal{X} \rightarrow \mathbb{R}\) is called a reproducing kernel of \(\mathcal{H}\) if

  1. (H1)

    \(K_{t}:= K(\cdot,t) \in \mathcal{H}\quad \forall t \in \mathcal{X}\),

  2. (H2)

    \(f(t) =\langle f,K_{t}\rangle _{\mathcal{H}}\quad \forall f \in \mathcal{H}\;{\mathrm{and}}\;\forall t \in \mathcal{X}\quad \mbox{ $(\mathit{ReproducingProperty})$}.\)

In particular, property (H2) implies for \(f:=\sum _{ i=1}^{m}\alpha _{i}K_{x_{i}}\) and \(g:=\sum _{ j=1}^{n}\beta _{j}K_{x_{j}}\) that

$$\displaystyle{ \langle f,g\rangle _{\mathcal{H}}:=\sum _{ i=1}^{m}\sum _{ j=1}^{n}\alpha _{ i}\beta _{j}K(x_{i},t_{j}),\quad \|f\|_{\mathcal{H}}^{2} =\sum _{ i=1}^{m}\sum _{ i,j=1}^{m}\alpha _{ i}\alpha _{j}K(x_{i},t_{j}) =\alpha ^{\mbox{ T} }\mathbf{K}\,\alpha, }$$
(37)

where \(\alpha = (\alpha _{1},\ldots,\alpha _{m})^{\mbox{ T}}\) and \(\mathbf{K}:= \left (K(x_{i},x_{j})\right )_{i,j=1}^{m}\). If such a kernel exists for \(\mathcal{H}\), then it is uniquely determined. A Hilbert space which exhibits a reproducing kernel is called reproducing kernel Hilbert space (RKHS); write \(\mathcal{H} = \mathcal{H}_{K}\) to emphasize the relation with the kernel. In \(\mathcal{H}_{K}\), the set of all finite linear combinations of K t , \(t \in \mathcal{X}\) is dense, i.e.,

$$\displaystyle{ \mathcal{H}_{K} = \overline{{\mathrm{span}}\,\{K_{t}: t \in \mathcal{X}\}}. }$$
(38)

Moreover, the kernel K of a RKHS must be a symmetric, positive semi-definite function; see Wendland [114]. Finally, based on the Riesz representation theorem, another characterization of RKHSs can be given. It can be shown that a Hilbert space \(\mathcal{H}\) is a RKHS if and only if the point evaluation functionals \(\delta _{x}: \mathcal{H}\rightarrow \mathbb{R}\) determined by δ x f: = f(x) are continuous on \(\mathcal{H}\), i.e.,

$$\displaystyle{\vert f(x)\vert \leq C\|f\|_{\mathcal{H}}\quad \forall f \in \mathcal{H}.}$$

Conversely, by the following theorem, any Mercer kernel gives rise to a RKHS.

Theorem 3.

Let \(K \in C(\mathcal{X}\times \mathcal{X})\) be a continuous, symmetric, and positive semi-definite function. Then there exists a unique Hilbert space \(\mathcal{H}_{K}\) of functions on \(\mathcal{X}\) which is a RKHS with kernel K. The space \(\mathcal{H}_{K}\) consists of continuous functions on \(\mathcal{X}\), and the embedding operator \(\iota _{K}: \mathcal{H}_{K}(\mathcal{X}) \rightarrow C(\mathcal{X})\big(\rightarrow L_{2}(\mathcal{X})\big)\) is continuous.

Proof.

  1. 1.

    First, one constructs a Hilbert space which fulfills (H1) and (H2). By (H1), the space \(\mathcal{H}_{K}\) has to contains all functions K t , \(t \in \mathcal{X}\) and since the space is linear also their finite linear combinations. Therefore, define

    $$\displaystyle{\mathcal{H}_{0}:= {\mathrm{span}}\{K_{t}: t \in \mathcal{X}\}.}$$

    According to (37), this space can be equipped with the inner product and corresponding norm

    $$\displaystyle{\langle f,g\rangle _{\mathcal{H}_{0}}:=\sum _{ i=1}^{m}\sum _{ j=1}^{n}\alpha _{ i}\beta _{j}K(x_{i},t_{j}),\quad \|f\|_{\mathcal{H}_{0}}^{2} =\alpha ^{\mbox{ T} }\mathbf{K}\,\alpha.}$$

    It can easily be checked that this is indeed an inner product. In particular \(\|f\|_{\mathcal{H}_{0}} = 0\) for some \(f =\sum _{ i=1}^{m}\alpha _{i}K_{x_{i}}\) implies that \(f(t) =\sum _{ i=1}^{m}\alpha _{i}K(t,x_{i}) = 0\) for all \(t \in \mathcal{X}\) by the following argument: set \(x_{m+1}:= t\). By the positive semi-definiteness of K is follows for any \(\epsilon \in \mathbb{R}\) that

    $$\displaystyle{(\alpha ^{\mbox{ T} },\epsilon )\left (K(x_{i},x_{j})\right )_{i,j=1}^{m+1}\left (\begin{array}{*{10}c} \alpha \\ \epsilon \end{array} \right ) =\alpha ^{\mbox{ T} }\mathbf{K}\,\alpha +2\epsilon \sum _{i=1}^{m}\alpha _{ i}K(x_{i},t)+\epsilon ^{2}K(t,t) \geq 0.}$$

    With \(\alpha ^{\mbox{ T}}\mathbf{K}\,\alpha =\| f\|_{ \mathcal{H}_{0}} = 0\), this can be rewritten as

    $$\displaystyle{\epsilon \left (2f(t) +\epsilon K(t,t)\right ) \geq 0.}$$

    Since K is positive semi-definite, it holds K(t, t) ≥ 0. Assume that f(t) < 0. Then choosing \(0 <\epsilon < -2f(t)/K(t,t)\) if K(t, t) > 0 and 0 < ε if K(t, t) = 0 leads to a contradiction. Similarly, assuming that f(t) > 0 and choosing \(-2f(t)/K(t,t) <\epsilon < 0\) if K(t, t) > 0 and ε < 0 if K(t, t) = 0 gives a contradiction. Thus f(t) = 0.

    Now one defines \({\mathcal{H}_{K}}\) to be the completion of \({\mathcal{H}_{0}}\) with the associated norm. This space has the reproducing property (H2) and is therefore a RKHS with kernel K.

  2. 2.

    To prove that \(\mathcal{H}_{K}\) is unique, assume that there exists another Hilbert space \(\mathcal{H}\) of functions on \(\mathcal{X}\) with kernel K. By (H1) and (38), it is clear that \(\mathcal{H}_{0}\) is a dense subset of \(\mathcal{H}\). By (H2) it follows that \(\langle f,g\rangle _{\mathcal{H}} =\langle f,g\rangle _{\mathcal{H}_{K}}\) for all f, gH 0. Since both \(\mathcal{H}\) and \(\mathcal{H}_{K}\) are completions of \(\mathcal{H}_{0}\), the uniqueness follows.

  3. 3.

    Finally, one concludes by the Schwarz inequality that

    $$\displaystyle{\vert f(t)\vert = \vert \langle f,K_{t}\rangle _{\mathcal{H}_{K}}\vert \leq \| f\|_{\mathcal{H}_{K}}\|K_{t}\|_{\mathcal{H}_{K}} =\| f\|_{\mathcal{H}_{K}}\sqrt{K(t, t)}}$$

    so that f is continuous since K is continuous. Moreover, \(\|f\|_{C(\mathcal{X})} \leq C\|f\|_{\mathcal{H}_{K}}\) with \(C:=\max _{t\in \mathcal{X}}\sqrt{K(t, t)}\) which means that the embedding ι K is continuous. □

Since the completion of \(\mathcal{H}_{0}\) is rather abstract, another characterization of \(\mathcal{H}_{K}\) based on Mercer’s theorem is useful. Let {ψ i : iI} be the L 2-orthonormal eigenfunctions of T K with corresponding eigenvalues λ j > 0 from the Mercer theorem. Then it follows by Schwarz’s inequality and Mercer’s theorem for w: = (w i ) iI 2(I) that

$$\displaystyle{\sum _{i\in I}\vert w_{i}\sqrt{\lambda _{i}}\psi _{i}(x)\vert \leq \| w\|_{\ell_{2}}\left (\sum _{i\in I}\lambda _{i}\psi _{i}^{2}(x)\right )^{1/2} =\| w\|_{\ell_{ 2}}\sqrt{K(x, x)}}$$

so that the series \(\sum _{i\in I}w_{i}\sqrt{\lambda _{i}}\psi _{i}(x)\) converges absolutely and uniformly for all (w i ) iI 2(I). Now another characterization of \(\mathcal{H}_{K}\) can be given.

Corollary 1.

Let \(K \in C(\mathcal{X}\times \mathcal{X})\) be a continuous, symmetric, and positive semi-definite kernel with expansion (36). Then the Hilbert space

$$\displaystyle{\mathcal{H}:= \left \{\sum _{i\in I}w_{i}\sqrt{\lambda _{i}}\psi _{i}: (w_{i})_{i\in I} \in \ell_{2}(I)\right \}}$$

with inner product

$$\displaystyle{\langle f,g\rangle _{\mathcal{H}}:=\sum _{i\in I}w_{i}\omega _{i} =\sum _{ j=1}^{\infty }\frac{1} {\lambda _{j}}\,\langle f,\psi _{j}\rangle _{L_{2}}\langle g,\psi _{j}\rangle _{L_{2}}}$$

for \(f:=\sum _{i\in I}w_{i}\sqrt{\lambda _{i}}\psi _{i}\) and \(g:=\sum _{j\in I}\omega _{j}\sqrt{\lambda _{j}}\psi _{j}\) is the RKHS with kernel K, i.e., \(\mathcal{H} = \mathcal{H}_{K}\). The system \(\{\varphi _{i}:= \sqrt{\lambda _{i}}\psi _{i}: i \in I\}\) is an orthonormal basis of \(\mathcal{H}\).

If K is positive definite, then \(\mathcal{H}\) can be also characterized by

$$\displaystyle{\mathcal{H} =\big\{ f \in L_{2}(\mathcal{X}):\sum _{ j=1}^{\infty }\frac{1} {\lambda _{j}}\,\vert \langle f,\psi _{j}\rangle _{L_{2}}\vert ^{2} < \infty.\big\}}$$

Proof.

By the above definition of the inner product, the set \(\{\sqrt{\lambda _{i}}\psi _{i}: i \in I\}\) is an orthonormal basis of \(\mathcal{H}\). The second equality in the definition of the inner product follows by the orthonormality of the ψ i in L 2.

It remains to show that K fulfills (H1) and (H2). Concerning (H1), it holds \(K_{t} =\sum _{i\in I}\sqrt{\lambda _{i}}\psi _{i}(t)\sqrt{\lambda _{i}}\psi _{i}\), and since

$$\displaystyle{\sum _{i\in I}(\sqrt{\lambda _{i}}\psi _{i}(t))^{2} = K(t,t) < \infty,}$$

it follows that \(K_{t} \in \mathcal{H}\). Using the orthonormal basis property, one can conclude with respect to (H2) that

$$\displaystyle{\langle f,K_{t}\rangle _{\mathcal{H}} =\langle \sum _{j\in I}w_{j}\sqrt{\lambda _{j}}\psi _{j},\sum _{i\in I}\sqrt{\lambda _{i}}\psi _{i}(t)\sqrt{\lambda _{i}}\psi _{i}\rangle _{\mathcal{H}} =\sum _{i\in I}w_{i}\sqrt{\lambda _{i}}\psi _{i}(t) = f(t).}$$

For more information on RKHS, see the book of Berlinet and Thomas-Agnan [8].

Kernels The choice of appropriate kernels for SVMs depend on the application. Default options are Gaussians or polynomial kernels which are described together with some more examples of Mercer kernels below:

  1. 1.

    Let \(\mathcal{X}:=\{ x \in \mathbb{R}^{d}:\| x\| \leq R\}\) with radius R > 0. Then the dot product kernels

    $$\displaystyle{K(x,t):= {\sum _{ j=0}^{\infty }} {a_ j}{(x \cdot t)^{j}},\quad {a_j} \geq 0,\; {\sum _{j=1}^{\infty}} {a_{ j}}{R^{2j}} < \infty }$$

    are Mercer kernels on \(\mathcal{X}\). A proof that these kernels are indeed positive semi-definite is given in the book of Cucker and Zhou [29]. A special case appears if \(\mathcal{X}\) contains the coordinate vectors e j , j = 1, , d and the kernel is \(K(x,t) = 1 + x \cdot t\). Note that even in one dimension d = 1, this kernel is not positive definite. Here the corresponding RKHS \(\mathcal{H}_{K}\) is the space of linear functions and {1, x 1, , x d } forms an orthonormal basis of \(\mathcal{H}_{K}\).

    The special dot product \(K(x,t):= (c + x \cdot t)^{n}\), c ≥ 0, \(n \in \mathbb{N}\), also known as polynomial kernel was introduced in statistical learning theory by Vapnik 1998 [105]. More general dot products were described, e.g., by Smola, Schöllkopf and Müller [89]. See also all-subset kernels and ANOVA kernels in [88].

  2. 2.

    Next, consider translation invariant kernels

    $$\displaystyle{K(x,t):=\kappa (x - t),}$$

    where \(\kappa: \mathbb{R}^{d} \rightarrow \mathbb{R}\) is a continuous function which has to be even, that is, \(\kappa (-x) =\kappa (x)\) for all \(x \in \mathbb{R}^{d}\) to ensure that K is symmetric. Let us see if K is a Mercer kernel on \(\mathbb{R}^{d}\) and hence on any subset \(\mathcal{X}\) of \(\mathbb{R}^{d}\). First, one knows from Bochner’s theorem that K is positive semi-definite if and only if it is the Fourier transform of a finite nonnegative Borel measure on \(\mathbb{R}^{d}\). Let \(\kappa \in L_{1}(\mathbb{R}^{d})\). Then, K is positive definite if and only if κ is bounded and its Fourier transform is nonnegative and non-vanishing.

    A special example on \({\mathbb{R}}\) (d = 1) is the spline kernel K generated by the “hat function” \(\kappa (x):=\max \{ 0,1 -\vert x\vert /2\}\). Its Fourier transform is \(\hat{\kappa }(\omega ) = 2{{(\sin \omega /\omega )}^{2}} \geq 0\). Multivariate examples of this form can be constructed by using, e.g., box splines. Spline kernels and corresponding RKHSs were discussed, e.g., by Wahba [112].

  3. 3.

    A widely used class of translation invariant kernels are kernels associated with radial functions. A function \(\kappa: \mathbb{R}^{d} \rightarrow \mathbb{R}\) is said to be radial if there exists a function \(k: [0,\infty ) \rightarrow \mathbb{R}\) such that \(\kappa (x) = k(\|x\|^{2})\) for all \(x \in \mathbb{R}^{d}\). For radial kernels define

    $$\displaystyle{K(x,t):= k(\|x - t\|^{2}).}$$

    A result of Schoenberg [85] says that K is positive semi-definite on \(\mathbb{R}^{d}\) if and only if the function k is completely monotone on [0, ). Recall that k is completely monotone on (0, ) if kC (0, ) and

    $$\displaystyle{(-1)^{l}k^{(l)}(r) \geq 0\quad \forall l \in \mathbb{N}_{ 0}\;{\mathrm{and}}\;\forall r > 0.}$$

    The function k is called completely monotone on [0, ) if it is in addition in C[0, ).

It holds that K is positive definite if and only if one of the following conditions is fulfilled

  1. (i)

    \(k(\sqrt{\cdot })\) is completely monotone on [0, ) and not constant,

  2. (ii)

    There exists a finite nonnegative Borel measure ν on [0, ) that is not concentrated at zero such that

    $$\displaystyle{k(r) =\int _{ 0}^{\infty }e^{-r^{2}t }\,d\nu (t).}$$

The proofs of these results on radial kernels are contained, e.g., in the book of Wendland [114].

For c > 0, the kernels K arising from the following radial functions κ are positive definite:

$$\displaystyle\begin{array}{rcl} \kappa (x)&:=& e^{-\|x\|^{2}/c^{2} }\quad {\mathrm{Gaussian}}, {}\\ \kappa (x)&:=& (c^{2} +\| x\|^{2})^{-s},\;\;s > 0\quad \text{inverse multiquadric}, {}\\ \end{array}$$

where the positive definiteness of the Gaussian follows from (i) and those of the inverse multiquadric from (ii) with

$$\displaystyle{k(r) = \frac{1} {\Gamma (s)}\int _{0}^{\infty }t^{s-1}e^{-c^{2}t }e^{-r^{2}t }\,dt.}$$

Positive definite kernels arising from Wendland’s radial basis functions with compact support (see Wendland [114]) were applied in SVM classification by Strauss and Steidl [95].

Finally, the following techniques for creating Mercer kernels are remarkable.

Theorem 4.

Let \(K_{j} \in C(\mathcal{X}\times \mathcal{X})\), j = 1,2 be Mercer kernels and p a polynomial with positive coefficients. Then the following functions are also Mercer kernels:

  1. (i)

    \(K(x,t):= K_{1}(x,t) + K_{2}(x,t)\),

  2. (ii)

    K(x,t):= K 1 (x,t)K 2 (x,t),

  3. (iii)

    K(x,t):= p(K 1 (x,t)),

  4. (iv)

    \(K(x,t):= e^{K_{1}(x,t)}\).

Beyond the above Mercer kernels other kernels like kernels for text and structured data (strings, trees), diffusion kernels on graphs or kernel incorporating generative information were used in practice; see the book of Shawe-Taylor and Cristianini [88].

Conditionally positive semi-definite radial functions. In connection with radial basis functions, so-called conditionally positive semi-definite functions \(\kappa (x):= k(\|x\|^{2})\) were applied for approximation tasks. Let \(\Pi _{\nu -1}(\mathbb{R}^{d})\) denote the space of polynomials on \(\mathbb{R}^{d}\) of degree < ν. This space has dimension \({\mathrm{dim}}(\Pi _{\nu -1}(\mathbb{R}^{d})) ={ d +\nu -1\choose \nu }\). A continuous radial function \(\kappa: \mathbb{R}^{d} \rightarrow \mathbb{R}\) is conditionally positive semi-definite of order ν if for all \(m \in \mathbb{N}\), all pairwise distinct points \(x_{1},\ldots,x_{m} \in \mathbb{R}^{d}\), and all \(\alpha \in \mathbb{R}^{m}\setminus \{0\}\) satisfying

$$\displaystyle{ \sum _{i=1}^{m}\alpha _{ i}p(x_{i}) = 0\quad \forall p \in \Pi _{\nu -1}(\mathbb{R}^{d}) }$$
(39)

the relation

$$\displaystyle{\alpha ^{\mbox{ T} }\mathbf{K}\alpha \geq 0,\qquad \mathbf{K}:= \left (\kappa (x_{i} - x_{j})\right )_{i,j=1}^{m}}$$

holds true for all \(\alpha \in \mathbb{R}^{m}\). If equality is attained only for α = 0, the function κ is said to be conditionally positive definite of order ν.

The following result is due to Micchelli [71]: For kC[0, ) ∩ C (0, ), the function \(\kappa (x):= k(\|x\|^{2})\) is conditionally positive semi-definite of order ν if and only if (−1)ν k (ν) is completely monotone on (0, ). If k is not a polynomial of degree at most ν, then κ is conditionally positive definite of order ν.

Using this result, one can show that the following functions are conditionally positive definite of order ν:

$$\displaystyle\begin{array}{rcl} \kappa (x)&:=& (-1)^{\lceil s\rceil }(c^{2} +\| x\|^{2})^{s},\;\;s > 0,s\not\in \mathbb{N},\;\nu = \lceil s\rceil \quad \text{multiquadric}, {}\\ \kappa (x)&:=& (-1)^{\lceil s/2\rceil }\|x\|^{s},\;\;s > 0,s\not\in 2\mathbb{N},\;\nu = \lceil s/2\rceil, {}\\ \kappa (x)&:=& (-1)^{k+1}\|x\|^{2k}\log \|x\|,\;\;k \in \mathbb{N},\;\nu = k + 1\quad \text{thin plate spline}. {}\\ \end{array}$$

A relation of a combination of thin plate splines and polynomials to the reproducing kernels of certain RKHSs can be found in Wahba [112].

4.2 Quadratic Optimization

This subsection collects the basic material from optimization theory to understand the related parts in the previous Sect. 3, in particular the relation between primal and dual problems in quadratic programming. More on this topic can be found in any book on optimization theory, e.g., in the books of Mangasarian [68] or Spellucci [90].

A (nonlinear) optimization problem in \(\mathbb{R}^{d}\) has the general form

$$\displaystyle{\begin{array}{c} \mbox{ Primal problem (P)}\\ \theta (x)\; \rightarrow \;\min \limits _{ x}\quad \mbox{ subject to}\quad g(x) \leq 0,\;h(x) = 0 \end{array} }$$

where \(\theta: \mathbb{R}^{d} \rightarrow \mathbb{R}\) is a real-valued function and \(g: \mathbb{R}^{d} \rightarrow \mathbb{R}^{m}\), \(h: \mathbb{R}^{d} \rightarrow \mathbb{R}^{p}\) are vector-valued functions. In general, only the case p < d is of interest since otherwise one is confronted with the solution of a (nonlinear) system of equations. The region

$$\displaystyle{\mathcal{G}:=\{ x \in \mathbb{R}^{d}: g(x) \leq 0,h(x) = 0\},}$$

where the objective function θ is defined and where all constraints are satisfied, is called feasible region. There are classes of problems (P) which are well examined as convex optimization problems and in particular special classes of convex problems, namely, linear and quadratic problems. Problem (P) is called convex, if θ is a convex function and \(\mathcal{G}\) is a convex region. Recall, that \(x^{{\ast}}\in \mathcal{G}\) is a local minimizer of θ in \(\mathcal{G}\) if there exists a neighborhood \(\mathcal{U}(x^{{\ast}})\) of x such that θ(x ) ≤ θ(x) for all \(x \in \mathcal{U}(x^{{\ast}}) \cap \mathcal{G}\). For convex problems, any local minimizer x of θ in \(\mathcal{G}\) is also a global minimizer of θ in \(\mathcal{G}\) and therefore a solution of the minimization problem. This subsection deals mainly with the following setting which gives rise to a convex optimization problem:

  1. (C1)

    θ convex and differentiable,

  2. (C2)

    g i , i = 1, , m convex and differentiable,

  3. (C3)

    h j , j = 1, , p affine linear.

Important classes of problems fulfilling (C1)–(C3) are quadratic programs, where the objective function is quadratic and the constraints are (affine) linear and linear programs, where the objective function is also linear. The constrained optimization problems considered in Sect. 3 are of this kind.

The function \(L: {\mathbb{R}^{d}} \times {\mathbb{R}_{+}^{m}} \times {\mathbb{R}^{p}} \rightarrow {\mathbb{R}}\) defined by

$$\displaystyle{L(x,\alpha,\beta ):=\theta (x) +\sum _{ i=1}^{m}\alpha _{ i}g_{i}(x) +\sum _{ j=1}^{p}\beta _{ j}h_{j}(x)}$$

is called the Lagrangian associated with (P), and the coefficients α i and β j are called Lagrange multipliers. Recall that \((x^{{\ast}},\lambda ^{{\ast}}) \in \Omega \times \Xi \), \(\Omega \subset \mathbb{R}^{d}\), \(\Xi \subset \mathbb{R}^{n}\) is called a saddle point of a function \(\Phi: \Omega \times \Xi \rightarrow \mathbb{R}\) if

$$\displaystyle{\Phi (x^{{\ast}},\lambda ) \leq \Phi (x^{{\ast}},\lambda ^{{\ast}}) \leq \Phi (x,\lambda ^{{\ast}})\quad \forall (x,\lambda ) \in \Omega \times \Xi.}$$

There is the following close relation between saddle point problems and min-max problems:

Lemma 1.

Let \(\Phi: \Omega \times \Xi \rightarrow \mathbb{R}\). Then the inequality

$$\displaystyle{\max _{\lambda \in \Xi }\min _{x\in \Omega }\Phi (x,\lambda ) \leq \min _{x\in \Omega }\max _{\lambda \in \Xi }\Phi (x,\lambda )}$$

holds true supposed that all extreme points exist. Moreover, in this case, the equality

$$\displaystyle{\max _{\lambda \in \Xi }\min _{x\in \Omega }\Phi (x,\lambda ) = \Phi (x^{{\ast}},\lambda ^{{\ast}}) =\min _{ x\in \Omega }\max _{\lambda \in \Xi }\Phi (x,\lambda )}$$

is fulfilled if and only if (x , λ ) is a saddle point of \(\Phi \).

The solution of (P) is related to the saddle points of its associated Lagrangian as detailed in the following theorem.

Theorem 5.

If \((x^{{\ast}},(\alpha ^{{\ast}},\beta ^{{\ast}})) \in \mathbb{R}^{d} \times (\mathbb{R}_{+}^{m} \times \mathbb{R}^{p})\) is a saddle point of the Lagrangian associated with the minimization problem (P), i.e.,

$$\displaystyle{L(x^{{\ast}},\alpha,\beta ) \leq L(x^{{\ast}},\alpha ^{{\ast}},\beta ^{{\ast}}) \leq L(x,\alpha ^{{\ast}},\beta ^{{\ast}})\quad \forall x \in \mathbb{R}^{d},\forall (\alpha,\beta ) \in \mathbb{R}_{ +}^{m} \times \mathbb{R}^{p},}$$

then x is a solution of (P) . Assume that the functions θ,g,h satisfy the conditions (C1)–(C3) and that g fulfills in addition the following Slater condition :

$$\displaystyle{\text{there}\ \text{exists}\ x_{0} \in \Omega \ \text{such}\ \text{that}\ g(x_{0}) > 0\ \text{and}\ h(x_{0}) = 0.}$$

Then, if x is a solution of (P), there exist \(\alpha ^{{\ast}} \in \mathbb{R}_{+}^{m}\) and \(\beta ^{{\ast}} \in \mathbb{R}^{p}\) such that (x , , β )) is a saddle point of the associated Lagrangian.

By the next theorem, the minimizers of (P) can be also described via the following conditions on the Lagrangian: there exist \(x^{{\ast}}\in \mathcal{G}\), \(\alpha ^{{\ast}} \in \mathbb{R}_{+}^{m}\) and \(\beta ^{{\ast}} \in \mathbb{R}^{p}\) such that

$$\displaystyle\begin{array}{rcl} \mathrm{(KKT1)}& & \nabla _{x}L(x^{{\ast}},\alpha ^{{\ast}},\beta ^{{\ast}}) = 0, {}\\ \mathrm{(KKT2)}& & (\alpha ^{{\ast}})^{\mbox{ T} }g(x^{{\ast}}) = 0,\quad \alpha ^{{\ast}}\geq 0. {}\\ \end{array}$$

These conditions were independently established by Karush and Kuhn and Tucker [60] and are mainly called Kuhn-Tucker conditions.

Theorem 6.

Let θ, g and h fulfill (C1)–(C3) . If x satisfies (KKT1)–(KKT2), then x is a solution of (P) . Assume that g fulfills in addition the Slater condition. Then, if x is a solution of (P), it also fulfills (KKT1)–(KKT2) .

If there are only equality constraints in (P), then a solution is determined by

$$\displaystyle{\nabla _{x}L(x^{{\ast}},\beta ^{{\ast}}) = 0,\quad h(x^{{\ast}}) = 0.}$$

For the rest of this subsection, assume that (C1)–(C3) and the Slater condition hold true. Let a solution x of (P) exist. Then, by Lemma 1 and Theorem 5, there exist α and β such that

$$\displaystyle{L(x^{{\ast}},\alpha ^{{\ast}},\beta ^{{\ast}}) =\max _{\alpha \in \mathbb{R}_{+}^{m},\beta }\min _{x}L(x,\alpha,\beta ).}$$

Therefore, one can try to find x as follows: for any fixed \((\alpha,\beta ) \in \mathbb{R}_{+}^{m} \times \mathbb{R}^{p}\), compute

$$\displaystyle{ \hat{x}(\alpha,\beta ):= \mathop{{\mathrm{argmin}}}_{x}L(x,\alpha,\beta ). }$$
(40)

If θ is uniformly convex, i.e., there exists γ > 0 such that

$$\displaystyle{\mu \theta (x) + (1-\mu )\theta (y) \geq \theta (\mu x + (1-\mu )y) +\mu (1-\mu )\gamma \|x - y\|^{2}\ \forall x,y \in \mathbb{R}^{d},\mu \in \left [0,1\right ],}$$

then \(\hat{x}(\alpha,\beta )\) can be obtained as the unique solution of

$$\displaystyle{\nabla _{x}L(x,\alpha,\beta ) = 0.}$$

This can be substituted into L which results in \(\psi (\alpha,\beta ):= L(\hat{x}(\alpha,\beta ),\alpha,\beta )\) and α and β are the solution of

$$\displaystyle{\psi (\alpha,\beta )\; \rightarrow \;\max _{\alpha,\beta }\quad \mbox{ subject to}\quad \alpha \geq 0.}$$

This problem, which is called the dual problem of (P) can often be more easily solved than the original problem since one has only simple inequality constraints. However, this approach is only possible if (40) can easily be solved. Then, finally \(x^{{\ast}} =\hat{ x}(\alpha ^{{\ast}},\beta ^{{\ast}})\).

The objective functions in the primal problems in Sect. 3 are not strictly convex (and consequently also not uniformly convex) since there does not appear the intercept b in these functions. So let us formulate the dual problem with ψ(x, α, β): = L(x, α, β) as follows:

$$\displaystyle{\begin{array}{c} \mbox{ Dual problem (D)}\\ \psi (x,\alpha,\beta )\; \rightarrow \;\max \limits _{ x,\alpha,\beta }\quad \mbox{ subject to}\quad \nabla _{x}L(x,\alpha,\beta ) = 0,\;\alpha \geq 0. \end{array} }$$

The solutions of the primal and dual problem, i.e., their minimum and maximum, respectively, coincide according to the following theorem of Wolfe [117].

Theorem 7.

Let θ, g and h fulfill (C1)–(C3) and the Slater condition. Let x be a minimizer of (P) . Then there exist α , β such that x , α , β solves the dual problem and

$$\displaystyle{\theta (x^{{\ast}}) =\psi (x^{{\ast}},\alpha ^{{\ast}},\beta ^{{\ast}}).}$$

Duality theory can be handled in a more sophisticated way using tools from Perturbation Theory in Convex Analysis; see, e.g., the book of Bonnans and Shapiro [11]. Let us briefly mention the general idea. Let \(v: \mathbb{R}^{m} \rightarrow (-\infty,\infty ]\) be an extended function, where only extended functions \(\not\equiv \infty \) are considered in the following. The Fenchel conjugate of v is defined by

$$\displaystyle{v^{{\ast}}(\alpha ):=\sup _{ p\in \mathbb{R}^{m}}\{\langle \alpha,x\rangle - v(x)\}}$$

and the biconjugate of v by v ∗∗: = (v ). In general, the inequality v ∗∗(x) ≤ v(x) holds true and becomes an equality if and only if v is convex and lower semicontinuous. (Later the inequality is indicated by the fact that one minimizes the primal and maximizes the dual problem.) For convex, lower semicontinuous functions \(\theta: \mathbb{R}^{d} \rightarrow (-\infty,\infty ]\), \(\gamma: \mathbb{R}^{m} \rightarrow (-\infty,\infty ]\) and \(g: \mathbb{R}^{d} \rightarrow \mathbb{R}^{m}\) one considers the primal problems

$$\displaystyle\begin{array}{rcl} \mathrm{(P_{\mathit{u}})}\quad v(u)& =& \inf _{x\in \mathbb{R}^{d}}\{\theta (x) +\gamma (g(x) + u)\}, {}\\ \mathrm{(P)}\quad \;v(0)& =& \inf _{x\in \mathbb{R}^{d}}\{\theta (x) +\gamma (g(x))\} {}\\ \end{array}$$

where \(u \in \mathbb{R}^{m}\) is the “perturbation.” With \(L(x,\alpha ):=\theta (x) +\langle g(x),\alpha \rangle\), the dual problem reads

$$\displaystyle\begin{array}{rcl} \mathrm{(D_{\mathit{u}})}\quad v^{{\ast}{\ast}}(u)& =& \sup _{\alpha \in \mathbb{R}^{m}}\{\langle \alpha,u\rangle -\gamma ^{{\ast}}(\alpha ) +\inf _{ x\in \mathbb{R}^{d}}L(x,\alpha )\}, {}\\ \mathrm{(D)}\quad v^{{\ast}{\ast}}(0)& =& \sup _{\alpha \in \mathbb{R}^{m}}\{ -\gamma ^{{\ast}}(\alpha ) +\inf _{ x\in \mathbb{R}^{d}}L(x,\alpha )\}. {}\\ \end{array}$$

For the special setting with the indicator function

$$\displaystyle{\gamma (y) =\iota _{\mathbb{R}_{-}^{m}}(y):= \left \{\begin{array}{ll} 0 &{\mathrm{if}}\;y \leq 0,\\ \infty &{\mathrm{otherwise}} \end{array} \right.}$$

the primal problem (P) is equivalent to

$$\displaystyle{\theta (x)\; \rightarrow \;\min \limits _{x}\quad \mbox{ subject to}\quad g(x) \leq 0}$$

and since \(\gamma ^{{\ast}} =\iota _{\mathbb{R}_{+}^{m}}\) the dual problem (D) becomes

$$\displaystyle{\sup _{\alpha \in \mathbb{R}^{m}}\inf _{x\in \mathbb{R}^{d}}L(x,\alpha )\quad \mbox{ subject to}\quad \alpha \geq 0.}$$

Again, if θ and g are convex and differentiable and θ is uniformly convex, then the unique solution \(\hat{x}(\alpha )\) of ∇ x L(x, α) = 0 is the solution of the infimum problem, and the dual problem becomes \(\sup _{\alpha \in \mathbb{R}^{m}}L(\hat{x}(\alpha ),\alpha )\) subject to α ≥ 0.

4.3 Results from Generalization Theory

There exists a huge amount of results on the generalization abilities of statistical learning methods and in particular of support vector machines. The following subsection can only give a rough impression on the general tasks considered in this field from a simplified mathematical point of view that ignores technicalities, e.g., the definition of the correct measure and function spaces and what measurable in the related context means. Most of the material is borrowed from the book of Steinwart and Christmann [93], where the reader can find a sound mathematical treatment of the topic.

To start with, remember that the aim in Sect. 3 was to find a function \(f: \mathcal{X} \rightarrow \mathbb{R}\) from samples \(Z:=\{ (x_{i},y_{i}): i = 1,\ldots,m\}\) such that f(x) is a good prediction of y at x for \((x,y) \in \mathcal{X}\times \mathcal{Y}\). Let P denote an unknown probability distribution on \(\mathcal{X}\times \mathcal{Y}\). Then a general assumption is that the data used in training and testing are identically independent distributed (iid) according to P. The loss function or cost function \(L: \mathcal{X}\times \mathcal{Y}\times \mathbb{R} \rightarrow [0,\infty )\) describes the cost of the discrepancy between the prediction f(x) and the observation y at x. The choice of the loss function depends on the specific learning goal. In the models of this paper, the loss functions depend on x only via f(x) such that one can simply write L(y, f(x)). In Sect. 3, the hinge loss function and the least squares loss function were used for classification tasks. Originally, one was interested in the 0∕1 classification loss \(L_{0/1}: \mathcal{Y}\times \mathbb{R} \rightarrow \{ 0,1\}\) defined by

$$\displaystyle{L_{0/1}(y,t):= \left \{\begin{array}{ll} 0&{\mathrm{if}}\;y = {\mathrm{sgn}}(t),\\ 1 &\mathrm{otherwise }. \end{array} \right.}$$

To the loss function, there is associated a risk which is the expected loss of f:

$$\displaystyle{R_{L,P}(f):=\int _{\mathcal{X}\times \mathcal{Y}}L(x,y,f(x))dP(x,y) =\int _{\mathcal{X}}\int _{\mathcal{Y}}L(x,y,f(x))dP(y\vert x)dP_{\mathcal{X}}.}$$

For example, the 0∕1 loss function has the risk

$$\displaystyle{R_{L_{0/1},P}(f) = P\left ((x,y) \in \mathcal{X}\times \mathcal{Y}: {\mathrm{sgn}}f(x)\not =y\right ).}$$

A function f is considered to be “better” the smaller the risk is. Therefore, one is interested in the minimal risk or Bayes risk defined by

$$\displaystyle{ R_{L,P}^{{\ast}}:=\inf _{ f:\mathcal{X}\rightarrow \mathbb{R}}R_{L,P}(f), }$$
(41)

where the infimum is taken over all possible (measurable) functions. However, since the distribution P is unknown, it is impossible to find a minimizer of R L, P . In learning tasks one can exploit finite training sets Z of iid data. A learning method on \(\mathcal{X}\times \mathcal{Y}\) maps every data set \(Z \in (\mathcal{X}\times \mathcal{Y})^{m}\) to a function \(f_{Z}: \mathcal{X} \rightarrow \mathbb{R}\). A learning methods should produce for sufficiently large training sets Z nearly optimal decision functions f Z with high probability. A measurable learning method is called L-risk consistent for P if

$$\displaystyle{\lim _{m\rightarrow \infty }P^{m}\left (Z \in (\mathcal{X}\times \mathcal{Y})^{m}: R_{ L,P}(f_{Z}) \leq R_{L,P}^{{\ast}}+\varepsilon \right ) = 1\quad \forall \varepsilon > 0}$$

and universally L-risk consistent, if it is L-risk consistent for all distributions P on \(\mathcal{X}\times \mathcal{Y}\). The first learning method that was proved to be universally consistent was the so-called nearest neighbor method; see Stone [94]. Many uniformly consistent classification and regression methods are presented in the books of Devroye et al. [30] and Györfi [46]. Consistency does not address the speed of convergence, i.e., convergence rates. Unfortunately, the no-free-lunch theorem of Devroye [31], says that for every learning method there exists a distribution P for which the learning methods cannot produce a “good” decision function in the above sense with an a priori fixed speed of convergence. To obtain uniform convergence rates, one has to pose additional requirements on P.

Instead of the risk one can deal with the empirical risk defined by

$$\displaystyle{R_{L,Z}(f):= \frac{1} {m}\sum _{i=1}^{m}L\left (x_{ i},y_{i},f(x_{i})\right ).}$$

Then the law of large numbers shows that R L, Z (f) becomes a “good” approximation of R L, P (f) for a fixed f if m is “large enough.” However finding the minimizer of

$$\displaystyle{ \inf _{f:\mathcal{X}\rightarrow \mathbb{R}}R_{L,Z}(f) }$$
(42)

does in general not lead to a good approximation of R L, P . For example, the function which classifies all x i X correctly and is zero elsewhere is a minimizer of the above functional (42) but gives in general a poor approximation of the optimal decision function according to (41). This is an example of so-called overfitting, where the learning method approximates the training data too closely and has poor generalization/prediction properties. One common way to cope with this phenomenon is to choose a smaller set \(\mathcal{F}\) of functions, e.g., subsets of continuous functions, which should have good approximation properties. In the SVMs treated in Sect. 3, this set \(\mathcal{F}\) was a RKHS \(\mathcal{H}_{K}\). Then one considers the empirical risk minimization (ERM)

$$\displaystyle{ \inf _{f\in \mathcal{F}}R_{L,Z}(f). }$$
(43)

Let a minimizer f Z of (43) be somehow “computed.” (In this subsection, the question of the existence and uniqueness of a minimizer of the various functionals is not addressed.) Then one is of course interested in the error R L, P (f Z ) − R L, P . Using the infinite-sample counterpart of the ERM

$$\displaystyle{R_{L,P,\mathcal{F}}^{{\ast}}:=\inf _{ f\in \mathcal{F}}R_{L,P}(f)}$$

this error can be splitted as

$$\displaystyle{R_{L,P}(f_{Z})-R_{L,P}^{{\ast}} =\mathop{\underbrace{ R_{ L,P}(f_{Z}) - R_{L,P,\mathcal{F}}^{{\ast}}}}\limits _{ \mbox{ sample error}}+\mathop{\underbrace{R_{L,P,\mathcal{F}}^{{\ast}}- R_{ L,P}^{{\ast}}}}\limits _{ \mbox{ approximation error}}.}$$

The first error, called sample error, is a probabilistic one since it depends on random samples, while the second error, called approximation error, is a deterministic one. Finding a good balance between both errors is sometimes called bias-variance problem, where the bias is related to the approximation error and the variance to the sampling error.

Concerning the approximation error, it turns out that for RKHS \(\mathcal{F} = \mathcal{H}_{K}\) on compact metric spaces \(\mathcal{X}\) which are dense in \(C(\mathcal{X})\) and continuous, P-integrable, so-called Nemitski losses, this error becomes zero; see Corollary 5.29 in Steinwart and Christmann [93]. In particular, this is true for RKHS with the Gaussian kernel and the loss functions considered in Sect. 3. For relations between the approximation error, interpolation spaces, and K-functionals, see the book of Cucker and Zhou [29] and the references therein.

Concerning the sample error, there is a huge amount of results, and this paper can only cover some basic directions. For a survey on recent developments in the statistical analysis of classification methods, see Boucheron et al. [14]. Based on Hoeffding’s inequality, the first of such relations goes back to Vapnik and Chervonenkis [107]. See also the books of Tsypkin [104], Vapnik [105], Anthony and Bartlett [3], and Vidyasagar [109]. To get an impression how such estimates look like, two of them from Proposition 6.18 and 6.22 of the book of Steinwart and Christmann [93] are presented in the following. If \(\mathcal{F}\) is finite and L(x, y, f(x)) ≤ B, then it holds for all m ≥ 1 that

$$\displaystyle{P^{m}\left (Z \in (\mathcal{X}\times \mathcal{Y})^{m}: R_{ L,P}(f_{Z}) - R_{L,P,\mathcal{F}}^{{\ast}}\geq B\sqrt{\frac{2\tau + 2\ln (2\vert \mathcal{F}} {m}} \right ) \leq e^{-\tau }\forall \tau > 0.}$$

If the function class \(\mathcal{F}\) is infinite, in particular not countable, one needs some bounds on the “complexity” of \(\mathcal{F}\). The most classical of such a “complexity” measure is the VC dimension (see Vapnik and Chervonenkis [107]) applied in connection with the 0∕1 loss function. Another possibility is the use of covering numbers or its counterpart entropy numbers going back to Kolmogorov and Tikhomirov [57]. The ɛ-covering number of a metric set T with metric d is the size of the smallest ɛ-net of T, i.e.,

$$\displaystyle{N(T,d,\varepsilon ):=\inf \left \{n \geq 1: \exists s_{1},\ldots,s_{n} \in T\;\mbox{ such that}\;T \subset \bigcup \limits _{i=1}^{n}B_{ d}(s_{i},\varepsilon )\right \},}$$

where B d (s, ɛ) is the closed ball with center s and radius ɛ. For the estimation of covering numbers see Edmunds and Triebel [35] and Pinkus [79]. Then, for compact \(\mathcal{F}\subset L_{\infty }(X)\), one has basically to replace \(\vert \mathcal{F}\vert \) in the above relation by its covering number:

$$\displaystyle\begin{array}{rcl} & & P^{m}\Bigg(Z \in (\mathcal{X}\times \mathcal{Y})^{m}: R_{ L,P}(f_{Z}) - R_{L,P,\mathcal{F}}^{{\ast}} {}\\ & &\qquad \qquad \quad \geq \left.B\sqrt{\frac{2\tau + 2\ln (2N(\mathcal{F},\| \cdot \|_{\infty },\varepsilon ))} {m}} + 4\varepsilon \vert L\vert _{M,1}\right ) \leq e^{-\tau } {}\\ \end{array}$$

for all τ > 0 and for all ɛ > 0, where ones assumes in addition that \(\|f\|_{\infty }\leq M\), \(f \in \mathcal{F}\) and that L is locally Lipschitz continuous with constant | L | M, 1 here.

Next let us turn to the SVM setting, where an additional term comes along with the loss function, namely, one is interested in minimizers of

$$\displaystyle{\inf _{f\in \mathcal{H}_{K}}\left \{R_{L,Z}(f) +\lambda \| f\|_{\mathcal{H}_{K}}^{2}\right \},\quad \lambda > 0}$$

with a regularization term \(\lambda \|f\|_{\mathcal{H}_{K}}^{2}\) that penalizes functions f with large RKHS norms. The techniques developed for ERM analysis can be extended to the SVM setting.

First let us mention that under some assumptions on the loss function, which are fulfilled for the setting in Sect. 3, a unique minimizer f Z, λ exists and has the form

$$\displaystyle{f_{Z,\lambda } =\sum _{ i=1}^{m}\alpha _{ i}K(x_{i},\cdot ).}$$

This was established in the representer theorem by Kimeldorf and Wahba [56] for special continuous loss functions and generalized, e.g., in Schölkopf et al. [86]. There also exist a representer-like theorems for the minimizer f P, λ of the infinite-sample setting

$$\displaystyle{\inf _{f\in \mathcal{H}_{K}}\left \{R_{L,P}(f) +\lambda \| f\|_{\mathcal{H}_{K}}^{2}\right \}}$$

(see Steinwart [92], de Vito [111], and Dinuzzo et al. [33]). One can show for the infinite-sample setting that the error

$$\displaystyle{A(\lambda ):=\inf _{f\in \mathcal{H}_{K}}\left \{R_{L,P}(f) +\lambda \| f\|_{\mathcal{H}_{K}}^{2}\right \} - R_{ L,P,\mathcal{H}_{K}}^{{\ast}}}$$

tends to zero as λ goes to zero and that \(\lim _{\lambda \rightarrow 0}R_{L,P}(f_{P,\lambda }) = R_{L,P,\mathcal{H}_{K}}^{{\ast}}\). Let us come to the essential question how close R P, λ (f Z, λ ) is to R L, P . Recall that \(R_{L,P}^{{\ast}} = R_{L,P,\mathcal{H}_{K}}^{{\ast}}\) for the above mentioned RKHS. An ERM analysis like estimation has, for example, the form

$$\displaystyle\begin{array}{rcl} & & P^{m}\Bigg(Z \in (\mathcal{X}\times \mathcal{Y})^{m}: R_{ L,P}(f_{Z,\lambda }) +\lambda \| f_{Z,\lambda }\|_{\mathcal{H}_{K}}^{2} - R_{ L,P,\mathcal{H}_{K}}^{{\ast}} {}\\ & &\qquad \geq A(\lambda ) + (\lambda ^{-1/2}\vert L\vert _{\lambda ^{ -1/2},1} + 1) {}\\ & & \qquad \sqrt{\frac{2\tau + 2\ln (2N(B_{\mathcal{H}_{K } },\| \cdot \|_{\infty },\lambda ^{1/2 } \varepsilon ))} {m}} + 4\varepsilon \vert L\vert _{\lambda ^{-1/2},1}\Bigg) \leq e^{-\tau }, {}\\ \end{array}$$

for τ > 0, where one assumes that the continuous kernel fulfills \(\|K\|_{\infty }\leq 1\), L(x, y, 0) ≤ 1 and \(B_{\mathcal{H}}\) is the closed unit ball in \(\mathcal{H}\) (see Theorem 6.25 in Steinwart and Christmann [93] and also Cucker and Smale [28] and Bousquet and Elisseeff [15]). For a certain decay of the covering number \(\ln (2N(B_{\mathcal{H}_{K}},\|\cdot \|_{\infty },\varepsilon ))\) in ɛ and a RKHS for which the approximation error becomes zero, one can then conclude that for zero sequences (λ m ) m ≥ 1 with an additional suitable decay property related to the decay of the covering number, the relation \(R_{L,P}(f_{Z,\lambda _{m}}) \rightarrow R_{L,P}^{{\ast}}\) holds true in probability.

The above relations can be further specified for classification and regression tasks with special loss functions. With respect to classification, one can find, for example, upper bounds for the risk in terms of the margin or the number of support vectors. For the 0∕1 loss function, the reader may consult, e.g., the book of Cristianini and Shawe-Taylor [27]. For the hinge loss function and the soft margin SVM with \(C = 1/(2\lambda m)\), it holds, for example, that

$$\displaystyle{\frac{\vert I_{S}\vert } {m} \geq 2\lambda \|f_{Z,\lambda }\|_{\mathcal{H}_{K}}^{2} + R_{ L,Z}(f_{Z,\lambda })}$$

(see Proposition 8.27 in Steinwart and Christmann [93]). For a suitable zero sequence (λ m ) m ≥ 1 and a RKHS with zero approximation error, the following relation is satisfied:

$$\displaystyle{\lim _{m\rightarrow \infty }P^{m}\left (Z \in (\mathcal{X}\times \mathcal{Y})^{m}: \frac{\vert \{i:\alpha _{ i}^{{\ast}}(Z) > 0\}\vert } {m} \geq R_{L,P}^{{\ast}}-\varepsilon \right ) = 1,\quad \varepsilon > 0.}$$

Finally, let us address the setting, where the risk function defining the learning task is hard to handle numerically. One example is the risk function associated with the 0∕1 loss function. This function is neither continuous nor convex. One remedy is to replace such unpleasant loss functions L by a convex surrogate L sur where one has to ensure that the minimizer f Z in (43) for the surrogate loss fulfills R L, P (f Z ) ≈ R L, P . For the hinge function as surrogate of the 0∕1 loss function, Zhang [119], has proved that

$$\displaystyle{R_{L_{0/1},P}(f) - R_{L_{0/1},P}^{{\ast}}\leq R_{ L_{h},P}(f) - R_{L_{h},P}^{{\ast}}}$$

for all measurable functions f. Thus, if \(R_{L_{h},P}(f_{Z}) - R_{L_{h},P}^{{\ast}}\) is small, this follows for the original risk function, too. For a systematical treatment of surrogate loss functions, the reader may consult Chapter 3 in the book of Steinwart and Christmann [93] which was partially inspired by the work of Barlett et al. [6].

5 Numerical Methods

This section concentrates on the support vector machines in Sect. 3. Numerical methods for the other models were always sketched when they were introduced. Support vector machines require finally the minimization of a quadratic functional subject to linear constraints (QP). These minimization problems involve a symmetric, fully populated kernel matrix having the size m of the training set. Hence, this matrix has in general \(m(m + 1)/2\) distinct nonzero coefficients one has to work with. Therefore, one has to distinguish between small- to moderate-sized problems, where such a matrix can be stored into the RAM of the computer, and large-sized problems, say, with more than a million training data.

For quadratic programming with small to moderate data sizes, there exist various meanwhile standard algorithms. They are implemented in commercial software packages like CPLEX or MOSEK (see also the MATLAB optimization toolbox or in freeware packages like MINOS and LOQO). Among them, the primal-dual interior point algorithms belong to the most reliable and accurate techniques. The main idea of interior point methods is to solve the primal and dual problems simultaneously by enforcing the Kuhn-Tucker conditions to iteratively find a feasible solution. The duality gap, i.e., the difference between the minimum of the primal problem and the maximum of the dual problem, is used to determine the quality of the current set of variables and to check whether the stopping criteria are fulfilled. For QP algorithms including recent algorithms for solving large QPs, the reader may consult the new book of Zdenek [118].

The problem of learning large data sets was mainly addressed based on so-called “working set” methods. The idea is the following: if one knew in advance which constraints were active, it would be possible to cancel all of the inactive constraints which simplifies the problem.

The simplest method in this direction is known as chunking. It starts with an arbitrary subset (“chunk” = working set) of the data and trains the SVM using an optimizer on this subset. The algorithm then keeps the support vectors and deletes the others. Next, the M points (M algorithm parameter) from the remaining part of the data, where the “current SVM” makes the largest errors, are added to these support vectors to form a new chunk. This procedure is iterated. In general, the working set grows until in the last iteration the machine is trained on the set of support vectors representing the active constraints. Chunking techniques in SVMs were already used by Vapnik [106] and were improved and generalized in various papers.

Currently, more advanced “working set” methods, namely, decomposition algorithms, are one of the major tools to train SVMs. These methods select in each iteration a small fixed size subset of variables as working set, and a QP problem is solved with respect to this set (see, e.g., Osuna et al. [77]). A special type of decomposition methods is the sequential minimal optimization (SMO) which uses only working sets of two variables. This method was introduced by Platt [80] for classification; see Flake and Lawrence [42] for regression. The main advantage of these extreme small working sets is that the partial QP problems can be solved analytically. For the soft margin SVM in the dual form from Sect. 3 (with a variable exchange αY α)

$$\displaystyle{\frac{1} {2}\,\alpha ^{\mbox{ T} }\mathbf{K}\alpha -\langle y,\alpha \rangle \quad \mbox{ subject to}\quad \langle 1_{m},\alpha \rangle = 0,\;0 \leq y\alpha \leq C.}$$

the SMO algorithm looks as follows:

SMO-type decomposition methods

1.

Fix α (1) as initial feasible solution and set k: = 1.

2.

If α (k) solves the dual problem up to a desired precision, stop.

 

Otherwise, select a working set B: = { i, j} ⊂ { 1, , m}. Define N: = { 1, , m}∖B

 

and α B (k) and α N (k) as sub-vectors of α (k) corresponding to B and N, resp.

3.

Solve the following subproblem with fixed α N (k) for α B :

 

\(\frac{1} {2}\left (\alpha _{B}^{\mbox{ T}}\,(\alpha _{ N}^{(k)})^{\mbox{ T}}\right )\left (\begin{array}{*{10}c} K_{BB}&K_{BN} \\ K_{NB}&K_{NN}\end{array} \right )\left (\begin{array}{*{10}c} \alpha _{B} \\ \alpha _{N}^{(k)}\end{array} \right )-\left (y_{B}^{\mbox{ T}}\,y_{ N}^{\mbox{ T}}\right )\left (\begin{array}{*{10}c} \alpha _{B} \\ \alpha _{N}^{(k)}\end{array} \right )\)

 

\(= \frac{1} {2}\left (\alpha _{i}\,\alpha _{j}\right )\left (\begin{array}{*{10}c} K_{ii}&K_{ij} \\ K_{ij}&K_{jj}\end{array} \right )\left (\begin{array}{*{10}c} \alpha _{i}\\ \alpha _{ j}\end{array} \right )-\left (\alpha _{i}\,\alpha _{j}\right )\left (K_{BN}\alpha _{N}^{(k)} - y_{ B}\right )+{\mathrm{constant}}\; \rightarrow \;\min _{\alpha _{B}}\)

 

subject to \(\alpha _{i} +\alpha _{j} = -1_{m-2}^{\mbox{ T}}\alpha _{ N}^{(k)},\quad 0 \leq y_{ i}\alpha _{i},y_{j}\alpha _{j} \leq C\).

 

Set α B (k+1) to be the minimizer.

4.

Set \(\alpha _{N}^{(k+1)}:=\alpha _{ N}^{(k)}\), kk + 1 and goto Step 2.

The analytical solution in Step 3 is given as follows: for simplicity, set \(\beta:= -1_{m-2}^{\mbox{ T}}\alpha _{ N}^{(k)}\) and \((c_{i}\,c_{j})^{\mbox{ T}}:= K_{ BN}\alpha _{N}^{(k)} - y_{ B}\). Substituting \(\alpha _{j} =\beta -\alpha _{i}\) from the first constraint into the objective function, one gets

$$\displaystyle{\frac{1} {2}\,\alpha _{i}^{2}(K_{ ii} - 2K_{ij} + K_{jj}) +\alpha _{i}(\beta K_{ij} -\beta K_{jj} - c_{i} + c_{j}) + {\mathrm{constant}}\; \rightarrow \;\min _{\alpha _{i}}.}$$

If K is positive definite, it holds that \(K_{ii} - 2K_{ij} + K_{jj} > 0\), and the above function has a unique finite global minimizer α i, g . One has to take care about the second constraint. This constraint requires that \(\alpha _{i} \in \left [L,U\right ]\), where L and U are defined by

$$\displaystyle{(L,U):= \left \{\begin{array}{lll} (\max (0,\beta -C),\min (C,\beta )) &{\mathrm{if}}&y_{i} = 1,y_{j} = 1, \\ (\max (0,\beta ),\min (C,\beta +C)) &{\mathrm{if}}&y_{i} = 1,y_{j} = -1, \\ (\max (-C,\beta -C),\min (0,\beta ))&{\mathrm{if}}&y_{i} = -1,y_{j} = 1, \\ (\max (-C,\beta ),\min (0,\beta +C))&{\mathrm{if}}&y_{i} = -1,y_{j} = -1. \end{array} \right.}$$

Hence the minimizer in Step 3 is given by (α i , βα i ), where

$$\displaystyle{\alpha _{i}:= \left \{\begin{array}{lll} \alpha _{ig} &{\mathrm{if}}&\alpha _{ig} \in \left [L,U\right ], \\ L &{\mathrm{if}}&\alpha _{ig} < L, \\ U &{\mathrm{if}}&\alpha _{ig} > U. \end{array} \right.}$$

It remains to determine the selection of the working set. (The determination of the stopping criteria is beyond the scope of this paper.) Indeed, current decomposition methods vary mainly according to different working set selections. The SVMlight algorithm of Joachims [52], was originally based on a rule for the selection the working set of Zoutendijk [120]. Moreover, this algorithm uses a shrinking technique to speed up the computational time. Shrinking is based on the idea that if a variable α i (k) remains equal to zero or C for many iteration steps, then it will probably not change anymore. The variable can be removed from the optimization problem such that a more efficient overall optimization is obtained. Another shrinking implementation is used in the software package LIBSVM of Chang and Lin [22]. A modification of Joachims’ algorithm for regression called “SVMTorch” was given by Collobert and Bengio [25]. An often addressed working set selection due to Keerthi et al. [55] is the so-called maximal violating pair strategy. A more general way of choosing the two-element working set, namely, by choosing a constant factor violating pair, was given, including a convergence proof, by Chen et al. [24]. For convergence results, see also the paper of Lin [65]. The maximal violating pair strategy relies on first-order (i.e., gradient) information of the objective function. Now for QP, second-order information directly relates to the decrease of the objective function. The paper of Fan et al. [38] proposes a promising working set selection based on second-order information.

For an overview of SVM solvers for large data sets, the reader may also consult the books of Huang et al. [50] and Bottou et al. [13] and the paper of Mangasagian and Musicant [69] with the references therein. An extensive list of SVM software including logistic loss functions and least squares loss functions can be found on the webpages www.kernel-machines.org and www.support-vector-machines.org.

6 Conclusion

The invention of SVMs in the 1990s led to an explosion of applications and theoretical results. This paper can only give a very basic introduction into the meanwhile classical techniques in this field. It is restricted to supervised learning although SVMs have also a large impact on semi- and unsupervised learning.

Some new developments are sketched as multitask learning where, in contrast to single-task learning, only limited work was involved until now and novel techniques taken from convex analysis come into the play.

An issue that is not addressed in this paper is the robustness of SVMs. There is some ongoing research on connections between stability, learning, and prediction of ERM methods (see, e.g., the papers of Elisseeff et al. [36] and Mukherjee et al. [74]).

Another field that has recently attained attention is the use of kernels as diffusion kernels on graphs (see Kondor and J. Lafferty [58] and also the book of Shawe-Taylor and Cristianini [88]).

7 Cross-References