Keywords

1 Introduction

Support vector machine (SVM) is a kind of pervasive machine learning model which is put forward by Vapnik et al., based on the statistical learning theory [1]. It uses the kernel function mapping nonlinear, thereby enhancing the learning machine performance [2, 3]. At present, the international research gets more and more attention [4, 5]. Machine learning is the basic problem in the field of artificial intelligence. Many learning system get few samples at the early stage of learning. As time goes by, samples accumulate ceaselessly [6, 7]. But the efficiency of the system is reduced. Then we need the new samples get into training to improve the system’s performance [8]. If we process the new samples together with the old ones, the difficulty of learning will increase, and will consume too much time and storage space because sample set is too large. An effective method is to add, modify or the delete the part of original learning results which is only relevant to the new samples. Other irrelevant parts are untouched. This is the incremental learning. However, the learning results of support vector machine are support vector set. It is usually a small part of the whole sample sets, but can represent the feature of the entire sample set. This suggests that SVM incremental learning method based on support vector machine is feasible. By anglicizing support vector, this paper proposed incremental learning algorithm of support vector based on the minimum classification error criterion [9].

2 Problem Description

2.1 The Function of Support Vector Machine and Support Vector

Support vector machine is originated from classification problem under the linear case. The goal of using a support vector machine classification is to find the optimal hyperplane. Optimal hyperplane is the maximum-margin hyperplane, i.e., two kinds of samples are correctly classified, and at the same time, makes their classification interval biggest, as shown in Fig. 83.1 [10]:

Fig. 83.1
figure 1

Optimal separating hyperplane diagram

As to the training sample set \( (y_{1} ,x_{1} ),\ominus ,(y_{l} ,x_{l} ), \) \( x \in R^{n} , \) \( y \in \{ - 1,1\} \), classification hyper plane’s two optimization problem is

$$ \begin{gathered} \mathop {\hbox{min} }\limits_{\psi ,b} \frac{1}{2}\left\| \psi \right\|^{2} + C\sum\limits_{i = 1}^{l} {\xi_{i} } \hfill \\ s.t. \, y_{i} ((\varphi (x_{i} ) * \psi ) + b) \ge 1 - \xi_{i} , \hfill \\ \, \xi_{i} \ge 0,\quad { }i = 1,\ominus ,l \hfill \\ \end{gathered} $$
(83.1)

Its Lagrange function is

$$ \begin{gathered} L(\psi ,b,\alpha ,\beta ) = \frac{1}{2}(\psi * \psi ) + C\sum\limits_{i = 1}^{l} {\xi_{i} } \hfill \\ - \sum\limits_{i = 1}^{l} {\alpha_{i} \left( {y_{i} \left[ {\left( {\varphi (x_{i} ) * \psi } \right) + b} \right] - 1 + \xi_{i} } \right)} - \sum\limits_{i = 1}^{l} {\beta_{i} \xi_{i} } \hfill \\ \end{gathered} $$
(83.2)

Its dual problem is

$$ \begin{gathered} \hbox{max}\, W(\alpha ) = \sum\limits_{i = 1}^{l} {\alpha_{i} } - \frac{1}{2}\sum\limits_{i,j = 1}^{l} {y_{i} y_{j} \alpha_{i} \alpha_{j} \left( {\varphi (x_{i} ) * \varphi (x_{j} )} \right)} \hfill \\ = \sum\limits_{i = 1}^{l} {\alpha_{i} } - \frac{1}{2}\sum\limits_{i,j = 1}^{l} {y_{i} y_{j} \alpha_{i} \alpha_{j} K(x_{i} ,x_{j} )} \hfill \\ s.t.{\text{ C}} \ge \alpha_{i} \ge 0 \hfill \\ \sum\limits_{i = 1}^{l} {y_{i} \alpha_{i} } = 0 \hfill \\ \end{gathered} $$
(83.3)

Here, \( K(x_{i} ,x_{j} ) = \varphi (x_{i} ) * \varphi (x_{j} ) \) is a kernel function.

Finally, we can get the decision function.

$$ f(x,\alpha ) = sign\left( {\sum\limits_{support \, vector} {y_{i} \alpha_{i}^{0} K(x_{i} ,x_{j} )} + b} \right) $$
(83.4)

The learning machine which constructs the decision function is called support vector machines (SVM). In the classification learning, only the support vector samples make contributions to optimal hyperplane and decision function. In general, \( X_{i} \), which corresponds to the Lagrange multiplier \( \alpha_{i} \)’s value \( 0 < \alpha_{i} < C \), is called general support vector. But \( X_{i} \), which corresponds to \( \alpha_{i} = C \), is called boundary support vector. The former one represents all the samples which cannot be correctly classified; while the latter one represents the classification features of most samples. They together determine the form of the final classifier. That is to say, the support vector set can fully describe the features of training sample set. Its division is equivalent to the division of the entire sample set. In most cases, support vector set accounted for only a small part of training samples, therefore we can use support vector set to replace the training sample set for learning, if it does not affect the classification accuracy and reduce training time and memory space.

2.2 The Criterion of Minimum Classification Error

The simplest way of classification error measure is Bayes Discriminant which can be used to classify two kinds of problems. Its classification error measure can be defined as:

$$ d(x){ = }P(C_{2} |x) - P(C_{1} |x) $$
(83.5)

Here, we assume that \( P(C_{2} |x)(i{ = }1,2) \) is a known posteriori possibility. This formula gives the possibility that observational samples of class 1 would be mistakenly classified into class 2. Its optimal decision boundary is the solution of \( d(x)\,{ = }0 \). For many such cases (\( N{ > }2 \)) of unknown distribution, they are different from Bayes Discriminant which classifies two kinds of problems and defines a classification error measure. Amari defines another classification error measure:

$$ d_{k} (x) = \sum\limits_{{x \in S_{k} }} {\frac{1}{{m_{k} }}} \left[g_{i} (x;\Uplambda ) - g_{k} (x;\Uplambda )\right] $$
(83.6)

Here, \( S_{k} = \{ i|g_{i} (x;\Uplambda ) > g_{k} (x;\Uplambda )\} \) is a confusion class set. \( m_{k} \) Is confusion number of \( S_{k}\cdot{S_{k}} \) Is uncertain and it changes with the parameter set \( \Uplambda \) and sample \( x \). Therefore, this formula is discontinuous to … It can’t work out any derivative, so it’s not suitable for gradient operation. There are many methods to define the \( \Uplambda \) continuous classification error measure. One of them is:

$$ d_{k} (x) = - g_{k} (x;\Uplambda ) + \left[ {\frac{1}{N - 1}\sum\limits_{j,\eta \ne k} {g_{i} } (x;\Uplambda )^{\eta } } \right]^{{\frac{1}{\eta }}} $$
(83.7)

In this formula, the second term on the right is the geometric average of all the other competing kinds. Parameter \( \eta \) can be regarded as a coefficient which adjusts the other competing kinds’ contributions to the whole discriminant function. In the process of searching for the classifier parameter, many potential classifications can be found through the change of \( \eta \). An extreme situation is that when \( \eta \to \infty \), the discriminant function of the biggest competing kind in the second term plays a leading role. That is:

When \( \eta \to \infty \), then

$$ \left[ {\frac{1}{N - 1}\sum\limits_{j,\eta \ne k} {g_{j} } (x;\Uplambda )^{\eta } } \right]^{{\frac{1}{\eta }}} = \mathop {\hbox{max} }\limits_{j,\eta \ne k} g_{j} (x;\Uplambda ) $$
(83.8)

Classification error measure changes into:

$$ d_{k} (x) = - g_{k} (x;\Uplambda ) + g_{j} (x;\Uplambda ) $$
(83.9)

Here, \( C_{i} \), except \( C_{K} \), is the kind which has the biggest discriminant value among all other kinds. This is because \( (N - 1)^{1/\infty } \cong 1 \). Obviously, under this circumstance, \( d_{k} (x) > 0 \) is a classification error. \( d_{k} (x) \le 0 \) is the right classification. In this way, decision rule is changed into a problem of judging the scalar value.

3 Algorithm Design and Experiment Analysis

3.1 Algorithm Design

Assuming that the initial training sample set is \( A = \left\{ {x_{1} ,x_{2} ,\ldots,x_{M} } \right\} \), every \( x_{m} (m = 1,2,\ldots,M) \) is a \( K \) vector, and belongs to a certain category in \( C_{i} (i = 1,2,\ldots,N) \). As to a classifier which usually contains a set of parameters and a decision rule, the task of minimum classification error classifier design is: based on the given initial training sample set \( A \) find out the classifier parameters set and the associated decision rules, in order to get the minimum probability of any classification error sample \( x_{m} (m = 1,2,\ldots,M) \). Generally, the probability of classification error is approximated by error rate. If the cost of the related classification error exists, the goal of this kind of classifier is changed into: find out the suitable classifier parameter set \( \Uplambda \) and the related rules to achieve the minimum expected cost.

The algorithm is as follows:

  1. 1.

    Initialize parameters \( \Uplambda \), set \( t = 1 \), select historical sample set \( A \) and new sample set \( I_{t} \). Calculate the respective class center and sample forgetting factor \( d(x_{k} ) \) of the two kinds of sample in \( A \). Determine the minimum classification error criterion collection \( B \). Training \( B \) received initial support vector set \( SV^{0} \) and decision function \( f^{0} \).

  2. 2.

    If \( I_{t} \ne \varnothing \), the algorithm will terminate. Otherwise, turn to 3.

  3. 3.

    Find out the samples \( I_{t} \) that violate the \( KKT \) conditions of decision function \( f^{t - 1} \), and determine set \( I^{v}_{t} \).

  4. 4.

    If \( I_{t}^{v} = \varnothing \), set \( SV^{t} = SV^{t - 1} \), \( f^{t} = f^{t - 1} \), \( t = t + 1 \), then turn to 2; or else turn to 5.

  5. 5.

    Through the incremental learning of \( T = SV^{t - 1} \cup I^{v}_{t} \), get a new decision function \( f^{t} \) support vector set \( SV^{t} \), set \( t = t + 1 \), and turn to 2.

3.2 Experiment Analysis

This paper contains 3678 calibration samples. In order to verify the effectiveness and the feasibility of this algorithm, we apply incremental learning algorithm in the experiments of automatic classification of these samples based on the above research. In the test, we selected 643 samples as test sample set and 781 samples as the initial training set (H 1). The rest of the samples were randomly divided into five groups (H 2, H 3, H 4, H 5, H 6), which make up five incremental training sample set. We have conducted comparative experiments between the new SVM Incremental Learning Algorithm and the Traditional SVM Learning Method. The experimental results are shown in Table 83.1 (SVM training nuclear function is using Gauss function. Software environment is MATLAB7.0. Hardware environment is Pentium 4/256 MB memory) [4].

Table 83.1 The comparison between results of the new SVM incremental learning algorithm and the traditional SVM learning method

From the results, compared with the Traditional Learning Algorithm, the classification accuracy of the SVM Incremental Learning Algorithm is improved, and its training speed is obviously improved.

4 Conclusion

Based on the characteristics of vector machine, we work out the SVM Incremental Learning Method. Furthermore, we put forward Incremental Learning Algorithm which is based on the minimum classification error criterion. This algorithm, in each round of the incremental training of training sample set, effectively keeps the former training samples which contain important information. It also simplifies the current incremental training sample set, and keeps those that are most likely to become the support vectors of training sample set for SVM incremental training. The experimental results show that this learning method is superior to the traditional incremental learning algorithm in both accuracy and time consumption.