1 Introduction

Support vector machine (SVM) proposed byVapnik [1, 2] is a general learning method based on statistical learning theory (SLT). Support vector machines show many good properties on two classification problems [3]. While, in the real word, there are a lot of multi-class classification problems, such as text recognition, image classification, speech recognition, face recognition and so on. Therefore, more and more researchers devote to the study of multi-class classification problems. There are a lot of classification algorithms in data mining, such as neural network, k-nearest neighbor, decision tree, support vector machine, etc. Although Neural network has better classification accuracy, it has some shortcomings. Neural network needs a large number of parameters, such as the initial value of network topology, weight and threshold; it can not observe the learning process between them, and the output results are difficult to explain, which will affect the credibility and acceptability of the results [16,17,18]. K-nearest neighbor is the easiest method, it is suitable for the data with large sample size, while it is easy to produce misclassifications for those data with small sample size. Decision tree is easy to understand and the output is more comprehensive, while over-fitting is easy to occur in decision tree and the decision tree ignores the correlation between attributes. Support vector machine (SVM) is more classical and representative. SVM can improve generalization performance and solve high dimensional, nonlinear problems, it can also avoid the problems of neural network structure selection and local minima.

Most of the currently existed multi classification methods based on support vector machine adopt the decomposition-reconstruction strategy. These kinds of methods transfer the K-class problem to a series of binary-classification problems and the discrimination functions can be obtained by solving a series of binary classification problems [4,5,6,7,8, 12]. There are totally three “decomposition reconstruction” strategies. The first one is “1-versus-rest” strategy [9] which constructs K classifiers. For the k-th problem, the examples belong to the k-th class are viewed as positive examples, the examples in the other classes are viewed as negative examples, then binary SVM is used to build the classifier. A new example is assigned a label according to the winner-takes-all scheme. Although “1-versus-rest” strategy uses the information of all examples in the training set, almost all the binary problems are unbalanced because the number of negative examples is much larger than the number of positive examples. The second one is “1-versus-1” strategy [10] which constructs binary classifiers and each classifier involves only two class examples with abandoning the other K-2 classes. For a new example, its label is decided by the voting scheme. In “1-versus-1” strategy, the information of the remaining examples is omitted when constructing each binary classifier. The third one is “1-versus-1-verus-rest” which is the improvement of “1-versus-1” strategy, called K-SVCR [11]. KSVCR constructs K(K−1)/2 sub-classifiers. In the process of constructing the sub-classifiers, the original training set is divided into three parts: positive examples (the examples belonging to i-th class in original training set), negative examples (the examples belonging to j-th class in original training set) and rest examples (all the other examples belonging to the original training set except the i-th class and the j-th class). KSVCR aims at finding the sub-classifier that can classify the positive and negative examples correctly and the rest examples are restricted within an ε-insensitive band. For a new example, its label is also decided by the voting scheme. Compared with “1-versus-1” support vector machine, KSVCR makes full use of the information of training samples. Although KSVCR does not loss any information of the examples, the scale of its optimization problems increase largely and thus it’s difficult to be solved.

In order to overcome the shortcomings of “1-versus-1” SVM and KSVCR, a new optimal method which incorporates the idea of semi-supervised learning is proposed in this paper. Our method is based on 1-versus-1-versus rest framework, i.e. the K-class problem is transferred to K(K−1)/2 binary classification problems. When solving the k-th binary problem, the training set is divided into three parts: positive examples (the examples in the i-th class), negative examples (the examples in the j-th class) and rest examples (all examples except the positive and negative examples). For each sub-classifier, the classification hyperplane is required to separate the positive and negative examples correctly and the projections of rest examples on the normal vector of the classification hyperplane is focused around zero. So, a new SVM with a mixed regularization term is proposed. The new regularization term considers both the margin between positive and negative examples and the distribution of the rest examples. Compared with currently existed methods, the new method uses all the information of examples without abandon any examples.

This paper is organized as follows. In Sect. 2, we give a brief introduction of SVM, minimization class variance SVM (MCVSVM) and K-SVCR. Our new method MCVMSVM is proposed in Sect. 3. The numerical experiments on several real datasets are conducted in Sect. 4. We summarize this paper in Sect. 5.

2 Related Works

Some works closely related to this paper, such as SVM, MCVSVM and K-SVCR are introduced in this section.

2.1 Support Vector Machine

Consider the binary classification problem with the training set

$$ T{{ = \{(x}}_{1} ,y_{1} ),\ldots,(x_{l} ,y_{l} )\} $$
(1)

where \({\text{x}}_{i} \in R^{n}\) is an example,\({\text{y}}_{i} \in {{\{ }} + 1, - 1\}\) is the corresponding label.

SVM constructs a hyperplane \(w \cdot x + b = 0\) by maximizing the margin between two support hyperplanes (\(w \cdot x + b = - 1\) and \(w \cdot x + b = 1\)). Furthermore, the positive examples should be located above the positive hyperplane \(w \cdot x + b = 1\) and the negative examples should be located under the negative hyperplane \(w \cdot x + b = - 1\). SVM gets the classification hyperplane by solving the following quadratic program:

$$ \begin{aligned} &\mathop {\min }\limits_{w,b,\xi } \frac{1}{2}||w||^{2} + C\sum\limits_{i = 1}^{l} {\xi_{i} } \hfill \\ & y_{i} (w \cdot x_{i} + b) \ge 1 - \xi_{i} ,i = 1,\ldots,l, \hfill \\ &\xi_{i} \ge 0,i = 1,\ldots,l, \hfill \\ \end{aligned} $$
(2)

where C > 0 is the parameter that can balance the margin and classification error, \(\xi_{i} \ge 0\) is slack variable measuring the classification loss of examples. More details of SVM can be found in the literature [13].

2.2 Minimization Class Variance Support Vector Machine

For the training set (1), the hyperplane generated by the traditional support vector machine is determined by a small number of support vectors, and it ignores the structural characteristics of the training examples. Therefore, the SVM can not play a good classification effect for some data, as shown in Fig. 1. Although the SVM can find a hyperplane to separates the positive and negative training examples correctly, the hyperplane derived by SVM does not represent the distribution trend of the data well. Therefore, new point “*” may be misclassified by SVM. So, minimization class variance support vector machine (MCVSVM) is proposed by the literature [14]. MCVSVM adopts an intra class divergence matrix to keep the distribution between data. The optimization problem is as follow:

Fig. 1
figure 1

The classification effect of SVM

$$ \begin{aligned} &\mathop {\min }\limits_{w,b} \frac{1}{2}w^{T} S_{w} w + C\sum\limits_{i = 1}^{l} {\xi_{i} } \hfill \\ &y_{i} (x_{i}^{T} w + b) \ge 1 - \xi_{i} , \hfill \\& \xi_{i} \ge 0, \, \hfill \\ \end{aligned} $$
(3)

where matrix \(S_{w}\) is the sum of the divergence matrix of the positive examples and negative examples. \(C\) > 0 is the penalty parameter, \(\xi_{i}\) is the slack variable measuring the classification loss of examples.

Now, let’s give the geometric interpretation of the optimization problem (3). Minimizing the first term in the objective function means the intra-class examples’ projections on the normal vector of the hyperplane should be as close as possible; minimizing the second term is to minimize the loss of misclassification; the constraints means the examples belonging to positive and negative class should be located on both sides of hyperplanes \((w \cdot x + b) = 1\) and \((w \cdot x + b) = - 1\). For a new example, it’s label is decided by the following discriminant function:

$$ f(x) = {\text{sgn}} (S_{w}^{ - 1} \sum\limits_{i = 1}^{l} {\alpha_{i} y_{i} \left\langle {x_{i} \cdot x} \right\rangle } + b). $$
(4)

Since MCVSVM considers the distribution of examples, the hyperplane obtained by MCVSVM is better which is shown in Fig. 2. There is a serious shortcoming in MCVMSVM, when the dimension of examples is much larger than the number of examples, the inverse of matrix Sw may not exist.

Fig. 2
figure 2

The classification effect of MCVSVM

2.3 Support Vector Classification-Regression Machine for K-Class (K-SVCR)

Given the training set

$$ T{{ = \{ (x}}_{1} ,y_{1} ),\ldots,(x_{l} ,y_{l} )\} $$
(5)

where \({\text{x}}_{i} \in R^{n}\) is an example,\(y_{i} \in \left\{ {1,2, \ldots ,K} \right\}\) is the corresponding label of \(x_{i}\).

K-SVCR constructs K(K − 1)/2 sub-classifiers for the K-class classification problem. When constructing the sub-classifiers, the training set is divided into three parts: positive class, negative class and rest class. A mixed classification and regression machine is formulated. Firstly, the classification hyperplane \(w \cdot x + b = 0\) should separates the training examples belonging to positive and negative classes as correct as possible. Secondly, the examples of the rest class should be in the ε-insensitive band (\(0 < {\upvarepsilon } < 1\)) of the hyperplane \(w \cdot x + b = 0\). According to the idea mentioned above, the optimization problem of K-SVCR is formulated as follows:

$$ \begin{aligned} &\mathop {\min }\limits_{{w,b,\xi ,\varphi_{i} ,\varphi_{i}^{*} }} \frac{1}{2}||w||^{2} + C\sum\limits_{i = 1}^{{l_{1} + l_{2} }} {\xi_{i} } + D\sum\limits_{{i = l_{1} + l_{2} + 1}}^{l} {(\varphi_{i} + \varphi_{i}^{*} )} \hfill \\ &s.t.y_{i} (w \cdot x_{i} + b) \ge 1 - \xi_{i} ,i = 1,...,l_{1} + l_{2} , \hfill \\ &- \varepsilon - \varphi_{i}^{*} \le w \cdot x_{i} + b \le \varepsilon + \varphi_{i} ,i = l_{1} + l_{2} + 1,...,l, \hfill \\ & \xi_{i} \ge 0,i = 1,...,l_{1} + l_{2} , \hfill \\ \end{aligned} $$
(6)

where \(\xi_{i} ,\varphi_{i} ,\varphi_{i}^{*}\) are slack variables measuring the loss of misclassification, C > 0 and D > 0 are parameters that can balance the margin, classification error and regression error. \(l_{1}\) represents the number of examples in positive class. \(l_{2}\) represent the number of examples in negative class.

Now we explain the optimization problem in detail. Minimizing the first term in (6) is to maximize the margin between two support hyperplanes \(w \cdot x + b = - 1\) and \(w \cdot x + b = 1\). The second term is to minimize the sum of classification errors of the examples belonging to the positive and negative class, the third term is to minimizes the sum of regression errors of the examples belonging to the rest class. The first constraints mean the examples belonging to the positive class should be above the positive support hyperplane \(w \cdot x + b = 1\) and the examples belonging to the negative class be under the negative support hyperplane \(w \cdot x + b = - 1\).The second constraints mean the examples belonging to the rest class should be in the ε-insensitive band of \(w \cdot x + b = 0\). It is obviously that there are a lot of constraints in the optimization problem of (6). So, it takes a lot of time to be solved.

3 Minimization class variance SVM for multi classification SVM (MCVMSVM)

Now, we are in the position to put out our MCVMSVM including the linear MCVMSVM and nonlinear MCVMSVM.

3.1 The linear MCVMSVM

Our new method takes “1-versus-1” strategy and totally K(K − 1)/2 classifiers are constructed, and the final decision is made by the voting scheme. When constructing the classifiers, the training set (5) is firstly divided into three parts: the examples belonging to the i-th class are viewed as positive class with label 1, the examples belonging to the j-th class are viewed as negative class with label -1 and the examples belonging to the rest class except the i-th and j-th class are viewed as zero class with label 0. For the re-divided training set, MCVMSVM wishes to find a hyperplane so that the positive and negative examples are located on both sides of \(w \cdot x + b = 1\) and \(w \cdot x + b = - 1\), and the projections of zero-class examples on the normal vector w are concentrated as close as possible to zero. The geometric explanation can be seen clearly in Fig. 3. The negative examples locate under the plane "\(w \cdot x + b = - 1\)″; the positive examples locate above the plane \(w \cdot x + b = 1\); the projections of zero-class examples on w are closed to zero. The hyperplane obtained by MCVMSVM should be the middle solid line H1 but not H0 which is obtained by 1-versus-1 SVM.

Fig. 3
figure 3

An example learned by the linear MCVMSVM

According to the training set (5), the matrix composed of class \(i\) is constructed as the positive sample matrix \(A_{i}\):

$$ A_{i} = [x_{i1} ,x_{i2} ,\ldots,x_{{il_{i} }} ],i = 1,\ldots,K,A_{i} \in R^{{l_{i} \times n}} , $$
(7)

the matrix composed of class \(j\) is constructed as the positive sample matrix \(B_{j}\):

$$ B_{j} = [x_{j1} ,x_{j2} ,\ldots,x_{{jl_{j} }} ],j = 1,\ldots,K,B_{j} \in R^{{l_{j} \times n}} $$
(8)

The matrix composed of zerois denoted as \(C_{i,j}\) Define the set \(C_{i,j}^{k}\) which is composed of the examples belonging to the k-th classas follows:

$$ C_{i,j}^{k} = {{\{ }}x_{k1} ,x_{k2} ,\ldots,x_{kl_{k}} ,k = 1,\ldots,K,k \ne i,k \ne j\} $$
(9)

Let \({m}_{i,j}^{k}\) denote the center point of \(C_{i,j}^{k}\).

The intra class scatter matrix of the zero class is \(S_{w}\):

$$ S_{w} = \sum\limits_{k = 1}^{K - 2} {\sum\limits_{{x \in C_{i,j}^{k} }} {(x - m_{i,j}^{k} )(x - m_{i,j}^{k} )^{T} } } , $$
(10)

Suppose the \(i\)-th hyperplane is \((w_{i,j} \cdot x) + b_{i,j} = 0\), then the \(i\)-th problem is constructed as follows:

$$ \begin{aligned} &\mathop {\min }\limits_{{w_{i,j} ,b_{i,j} }} \frac{1}{2}||w_{i,j} ||^{2} + De_{i,j}^{T} \xi_{i,j} + \frac{\lambda }{2}w_{i,j}^{T} S_{w} w_{i,j} \hfill \\ &(A_{i} w_{i,j} + e_{i} b_{i,j} ) + \xi_{i,j} \ge e_{i} , \hfill \\ &- (B_{j} w_{i,j} + e_{j} b_{i,j} ) + \xi_{i,j} \ge e_{j} , \hfill \\ &\xi_{i,j} \ge 0, \, \hfill \\ \end{aligned} $$
(11)

where, \(D > 0,\,\lambda > 0\) are the parameters, \(e_{i,j} = [e_{i} ,e_{j} ]\) and \(e_{i} ,e_{j}\) are the vector of ones with appropriate dimensions, \(\xi_{i,j}\) are slack variables measuring the misclassification。

Minimizing the first term in the objective function of problem (11) is to maximizes the interval between the support hyperplanes. Minimizing the second term is to minimize the sum of classification errors. Minimizing the last term is to minimize the projection of examples in the zero class on the normal vector direction of hyperplane. The constraints guarantee the positive should be located above \(w \cdot x + b = 1\) and the negative examples should be located under \(w \cdot x + b = - 1\). Next, we construct the dual problem of (11). The Lagrange function of problem (11) is formulated firstly,

$$ \begin{gathered} L(w_{i,j} ,b_{i,j} ,\xi_{i,j} ,,\alpha_{i,j} ,\mu_{i,j} ,\beta_{i,j} ,\eta_{i,j} ) \hfill \\ = De_{i,j}^{T} \xi_{i,j} + \frac{1}{2}||w_{i,j} ||^{2} + \frac{\lambda }{2}w_{i,j}^{T} S_{w} w_{i,j} \hfill \\ - \alpha_{i,j}^{T} [(A_{i} w_{i,j} + e_{i} b_{i,j} ) + \xi_{i,j} - e_{i} ] - \mu_{i,j}^{T} \xi_{i,j} \hfill \\ + \beta_{i,j}^{T} [(B_{j} w_{i,j} + e_{j} b_{i,j} ) - \xi_{i,j} + e_{j} ] - \eta_{i,j}^{T} \xi_{i,j} \hfill \\ \end{gathered} $$
(12)

Then, by the Karuch-Kuhn-Tucker (KKT) conditions, we get

$$ \frac{\partial L}{{\partial w_{i,j} }} = w_{i,j} { + }S_{w} w_{i,j} - A_{i}^{T} \alpha_{i,j} + B_{j}^{T} \beta_{i,j} = 0 \Rightarrow w_{i,j} = (I + \lambda S_{w} )^{ - 1} (A_{i}^{T} \alpha_{i,j} - B_{j}^{T} \beta_{i,j} ), $$
(13)
$$ \frac{\partial L}{{\partial b_{i,j} }} = - e_{i}^{T} \alpha_{i,j} + e_{j}^{T} \beta_{i,j} = 0, $$
(14)
$$ \frac{\partial L}{{\partial \xi_{i,j} }} = \left[ {\begin{array}{*{20}c} {De{}_{i}} \\ {De_{j} } \\ \end{array} } \right] - \left[ {\begin{array}{*{20}c} {\alpha_{i,j}^{T} e{}_{i}} \\ {\beta_{i,j}^{T} e{}_{j}} \\ \end{array} } \right] + \left[ {\begin{array}{*{20}c} {\mu_{i,j}^{T} e{}_{i}} \\ {\eta_{i,j}^{T} e{}_{j}} \\ \end{array} } \right] = 0 \Rightarrow 0 \le \alpha_{i,j} ,\beta_{i,j} \le D, $$
(15)

Substituting the Eq. (13)–(15)s into (12), the dual problem of problem (11) can be abbreviated as:

$$ \begin{aligned} & \mathop {\max }\limits_{{\alpha_{i,j} ,\beta_{i,j} }} [e_{i}^{T} ,e_{j}^{T} ][\alpha_{i,j}^{T} ,\beta_{i,j}^{T} ]^{T} - \frac{1}{2}[\alpha_{i,j}^{T} ,\beta_{i,j}^{T} ]^{T} [A_{i}^{T} , - B_{j}^{T} ]^{T} (I{ + }\lambda S_{w} )^{ - 1} [A_{i}^{T} , - B_{j}^{T} ]^{T} [\alpha_{i,j}^{T} ,\beta_{i,j}^{T} ]^{T} \hfill \\ &0 \le \alpha_{i,j} ,\beta_{i,j} \le D, \hfill \\ \end{aligned} $$
(16)

where \(\alpha_{i.j} ,\beta_{i,j}\) are slack variables. Solving the dual problem (16), we get the optimal solution \([\alpha_{k}^{*} ,\beta_{k}^{*} ]\), and randomly select a support vector label \(y_{k}\), then the optimal solution of the primal problem (11) is

$$ \begin{gathered} w_{i,j}^{ * } = (I + \lambda S_{w} )^{ - 1} [A_{i}^{T} \alpha_{i,j}^{ * } - B_{j}^{T} \beta_{i,j}^{ * } ], \hfill \\ b_{i,j}^{ * } = y_{k} - A_{i}^{k} (I + \lambda S_{w} )^{ - 1} [A_{i}^{T} \alpha_{i,j}^{ * } - B_{j}^{T} \beta_{i,j}^{ * } ], \hfill \\ \end{gathered} $$
(17)

So, the function of the \(k\)-th hyperplane can be expressed as:

$$ f_{ij} (x) = [x,1][w_{i,j}^{*T} ,b_{i,j}^{*T} ]^{T} = 0 .$$
(18)

Totally \(K(K - 1)/2\) hyperplanes like (18) are constructed in our MCVMSVM and thus the \(K(K - 1)/2\) decision functions are defined as follows:

$$ {\text{sgn}} (f_{i,j} (x)) = \left\{ \begin{gathered} + 1,[x,1][w_{i,j}^{*T} ,b_{i,j}^{*T} ]^{T} > 0 \, i,j = 1, \ldots ,K;i < j, \hfill \\ - 1,[x,1][w_{i,j}^{*T} ,b_{i,j}^{*T} ]^{T} < 0 \, i,j = 1, \ldots ,K;i < j, \hfill \\ \end{gathered} \right. $$
(19)

For a new example x, a vote is given to the class (i) or class (j) based on which condition is satisfied. Finally, the example x is assigned to the class label that gets the most votes.

3.2 The nonlinear classifier

In this section, the linear MCVMSVM is extended to the nonlinear case using kernel trick. The training set (5) is mapped into the high-dimensional feature space by introducing a nonlinear mapping \(\phi\):

$$ x \to \phi (x) $$
(20)

Suppose that the normal vector of the hyperplane in the feature space can be expressed as:

$$ w = \mathop \sum \limits_{i = 1}^{{l_{1} + l_{12} }} \mu_{i} \phi \left( {x_{i} } \right) $$
(21)

Then we define the intra-class divergence as:

$$ \begin{gathered} w^{T} S_{w} w = w^{T} \sum\limits_{k = 1}^{K - 2} {\sum\limits_{{x \in C_{i,j} }} {(\phi (x) - m_{i,j}^{k} )(\phi (x) - m_{i,j}^{k} )^{T} } } w \hfill \\ = \sum\limits_{k = 1}^{K - 2} {[w^{T} \sum\limits_{{x \in C_{i,j}^{k} }} {(\phi (x) - m_{i,j}^{k} )(\phi (x) - m_{i,j}^{k} )^{T} } } w]. \hfill \\ \end{gathered} $$
(22)

We rewrite the term in (22) as follows:

$$ \begin{aligned} & w^{T} \sum\limits_{{x \in C_{i,j}^{k} }} {(\varphi (x) - m_{i,j}^{k} )(\varphi (x) - m_{i,j}^{k} )^{T} w} \hfill \\ &= w^{T} \sum\limits_{{x \in C_{i,j}^{k} }} {\left[ {(\varphi (x)\varphi (x)^{T} - m_{i,j}^{k} \varphi (x)^{T} - \varphi (x)m_{i,j}^{kT} + m_{i,j}^{k} m_{i,j}^{kT} )^{T} } \right]} w \hfill \\ &= w^{T} \left[ {\sum\limits_{{x \in C_{i,j}^{k} }} {\varphi (x)\varphi (x)^{T} - m_{i,j}^{k} \sum\limits_{{x \in C_{i,j}^{k} }} {\varphi (x)^{T} } - \sum\limits_{{x \in C_{i,j}^{k} }} {(\varphi (x)m_{i,j}^{kT} } + l_{r}^{k} m_{i,j}^{k} m_{i,j}^{kT} )^{T} } } \right]w \hfill \\ &= w^{T} \left[ {\sum\limits_{{x \in C_{i,j}^{k} }} {\varphi (x)\varphi (x)^{T} } - l_{r}^{k} m_{i,j}^{k} m_{i,j}^{kT} } \right]w \hfill \\ &= \sum\limits_{{x \in C_{i,j}^{k} }} {w^{T} \varphi (x)\varphi (x)^{T} w} - w^{T} l_{r}^{k} m_{i,j}^{k} m_{i,j}^{kT} w \hfill \\ &= \mu^{T} K(C_{i,j}^{k} ,S)K(C_{i,j}^{k} ,S)^{T} \mu - \mu^{T} K(C_{i,j}^{k} ,S)L^{k}_{l_{r}} K(C_{i,j}^{k} ,S)^{T} \mu \hfill \\ &= \mu^{T} K(C_{i,j}^{k} ,S)(I_{{l_{r} }}^{k} - L_{{l_{r} }}^{k} )K(C_{i,j}^{k} ,S)^{T} \mu \hfill \\ \end{aligned} $$
(23)

where \(\mu \, = \left[ {\mu_{1} , \ldots \mu_{{{\text{l}}1 + {\text{l}}2}} } \right]^{{\text{T}}}\), \(l_{r}^{k}\) is the number of \(C_{i,j}^{k}\), \(I_{{l_{r} }}^{k}\) is \(l_{r}^{k}\)-order identity matrix, \(L_{{l_{r} }}^{k}\) is \(l_{r}^{k}\)-order matrix of all elements being equal to the square of \(\frac{1}{{l_{r}^{k} }}\),\(S = [A_{i}^{T} ,B_{j}^{T} ]\). The \(k\)-th optimization problem nonlinear MCVMSVM is:

$$ \begin{aligned} &\mathop {\min }\limits_{{w,b,\xi_{i,j} }} \frac{1}{2}\mu^{T} K(S,S)\mu + De_{i,j}^{T} \xi_{i,j} + \frac{\lambda }{2}\sum\limits_{k = 1}^{K - 2} {\mu^{T} K(C_{i,j}^{k} ,S)(I_{i,j}^{k} - L_{i,j}^{k} )K(C_{i,j}^{k} ,S)^{T} \mu } \hfill \\ &(K(A_{i}^{T} ,S^{T} )\mu + e_{i} b_{i,j} ) + \xi_{i,j} \ge e_{i} , \hfill \\ & - (K(B_{j}^{T} ,S^{T} )\mu + e_{j} b_{i,j} ) + \xi_{i,j} \ge e_{j} , \hfill \\ &\xi_{i,j} \ge 0 \hfill \\ \end{aligned} $$
(24)

where \(D > 0,\lambda > 0\) are the parameters, \(e_{i,j} = [e_{i} ,e_{j} ]\) and \(e_{i} ,e_{j}\) are the vector of ones with appropriate dimensions, \(\xi_{i,j}\) are slack variables measuring the misclassification. Similar to thelinear case, we get the optimal solution of (24) by solving the following dual problem:

$$ \begin{aligned} & \mathop {\max }\limits_{{\alpha_{i,j} ,\beta_{i,j} }} [e_{i}^{T} ,e_{j}^{T} ][\alpha_{i,j}^{T} ,\beta_{i,j}^{T} ]^{T} \hfill \\ &- \frac{1}{2}[\alpha_{i,j}^{T} ,\beta_{i,j}^{T} ]^{T} [K_{i}^{T} , - K_{j}^{T} ](G + \lambda \sum\limits_{k = 1}^{K - 2} {K_{i,j}^{k} L_{i,j}^{k} K_{i,j}^{kT} } )^{ - 1} [K_{i}^{T} , - K_{j}^{T} ]^{T} [\alpha_{i,j}^{T} ,\beta_{i,j}^{T} ] \hfill \\ & 0 \le \alpha_{i,j} ,\beta_{i,j} \le D, \hfill \\ \end{aligned} $$
(25)

where \(G = K(S^{{\text{T}}} ,S^{{\text{T}}} )\), \(K_{i} = K(A_{i}^{T} ,S^{T} ),\) \(K_{j} = K(B_{j}^{T} ,S^{T} )\), \(K_{ij}^{k} = K(C_{ij}^{kT} ,S^{T} )\), \(L_{ij}^{k} = I_{{l_{r} }}^{k} - L_{{l_{r} }}^{k}\). Solving the dual problem (25), we get the optimal solution \([\alpha_{k}^{*} ,\beta_{k}^{*} ]\), and \(y_{k}\) is the label of a support vector\(x_{k}\), then the optimal solution of the primal problem (24) is

$$ \mu^{*} { = }(G + \lambda \sum\limits_{k = 1}^{K - 2} {K_{i,j}^{k} L_{i,j}^{k} K_{i,j}^{kT} } )^{ - 1} (\alpha_{i,j}^{*T} K_{i} - \beta_{i,j}^{*T} K_{j} ) $$
(26)
$$ b^{ * } = y_{k}^{*} - \mu^{*} K(x_{k} ,S) $$
(27)

Then the function of the \(i\)-th hyperplane can be expressed as:

$$ f_{i,j} (x) = \mu^{ * } K(x^{T} ,S^{T} ){ + }b^{ * } = 0 $$
(28)

Just as the linear case, the decision function that classifies class (\(i\)) and class (\(j\)) is designed as

$$ {\text{sgn}} (f_{i,j} (x))\,{ = }\,\left\{ \begin{gathered} { + }1,\mu^{ * } K(x^{T} ,S^{T} ){ + }b^{ * } > 0 \, i,j = 1,\ldots,K;i < j, \hfill \\ - 1,\mu^{ * } K(x^{T} ,S^{T} ){ + }b^{ * } < 0 \, i,j = 1,\ldots,K;i < j, \hfill \\ \end{gathered} \right. $$
(29)

For a new example \({\text{x}}\), a vote is given to the class (\(i\)) or class (\(j\)) based on which condition is satisfied. Finally, the example \({\text{x}}\) is assigned to the class label that gets the most votes.

3.3 Comparing with other methods


Comparing with “1-versus-1″ SVM. Our method needs to construct \({\text{K}}\left( {{\text{K}}\, - \,{1}} \right)/{2}\) hyperplanes, this is just the same as “1-versus-1″ SVM. And the computational complexity of our method is almost the same as “1-versus-1” SVM. The information of all examples is used in our method. While, only the positive and negative examples are used in “1-versus-1” SVM. It is clearly that our method makes full use of the information of rest class compared with “1-versus-1” SVM.

Comparing with “1-versus-rest” SVM. “1-versus-rest” SVM and our MCVMSVM all make full use of the information of all examples. MCVMSVM makes more detailed use of information because it divides all examples into three parts (positive, negative and rest), while “1-versus-rest” SVM makes rough use of information because it divides all examples into two parts (positive and negative).

Comparing with K-SVCR. The “1-versus-1-versus-rest” strategy is used in both our MCVMSVM and KSVCR. But the processing method is quite different. A square term with intra class divergence of rest class is introduced in MCVMSVM, while a lot of constraints containing the rest examples are added in KSVCR. So, the computational complexity of MCVMSVM is much less than the computational complexity of KSVCR, and the training time of MCVMSVM is shorter than that of KSVCR. The gap is more obvious when the number of categories is large which can be seen clearly in Fig. 4.

Fig. 4
figure 4

The comparison of the time consuming of MCVMSVM and KSVCR on the nine experimental data sets

4 Numerical experiments

To evaluate the performance of the new algorithm, several UCI [14] datasets and large scale NDC datasets are used in numerical experiments (Table  3). All experiments have been implemented in Matlab 2015b on a PC with system configuration Intel core 7CPU at 3.5 GHz with 4 GB of RAM, and Windows 10 operating system. For the nonlinear case, RBF kernel is used. The parameters D are selected from the set \([2^{ - 8} ,2^{ - 7} , \ldots ,2^{7} ,2^{8} ]\) and γ is selected from \(\gamma \in [0.1,0.2, \ldots ,2]\). Our MCVMSVM is compared the popular multiclass classification methods: 1-v-1 SVM,1-r SVM and KSVCR.

4.1 Benchmark datasets

In this section, we compare our MCVMSVM and other three multiclass classification methods on UCI data sets. The five-fold cross validations are conducted on 9 UCI datasets, i.e. the data set is randomly split int five parts, four of them is the training set that helps us to learn the optimal hyperplane, one of them is the validation data set that test the hyperplane got by the training set. The five-fold validation results are shown in Table 1. We can see that our MCVMSVM performs better than the other four methods on five data sets: ‘wine’, ‘tea’, ‘seeds’, ‘glass’, ‘hayes’. On these five data sets, the accuracy on the data sets‘tea’ and ‘hayes’ are improved by up to 3.3% and 6.8% using MCVMSVM. On the datasets‘iris’, ‘balance’ and ‘ecoli’, the accuracy of RBSVM is similar to that of the other method. Only on the data sets‘der’, the accuracy of MCVMSVM is 0.8% lower than other methods. On the whole, Table 1 shows that MCVMSVM has the best classification results on the seven datasets. In addition, the last two rows in the table are the average classification accuracy and mean standard deviation of the three classification models on the nine sets of datasets. Obviously, MCVMSVM has the highest accuracy and the lowest standard deviation, which indicates that MCVMSVM can not only improve the classification accuracy, but also have a more stable classification effect. This is because MCVMSVM not only considers the samples of positive and negative class, but also considers the structure information of the rest samples, it makes full use of the information provided by the dataset. So, MCVMSVM has a better classification accuracy.

Table 1 Five-fold cross validation comparison on UCI data sets

The classification effect is also compared on the independent test set. 30% of the datasets are randomly selected as the test set, the remaining 70% are used as the trains set. We firstly select the optimal parameters and train the classifiers using the training set by five-fold validation. Then we test the classifiers on the test set. Table 2 shows the comparison results. It is clearly that MCVMSVM performs better than all the other methods. Figure 5 compares the accuracies between five-fold cross validation and independent test. It is clearly the accuracies drop more than 50% in other three methods. While the accuracy got by MCVMSVM almost is unchanged. So, the generalization of MCVMSVM is better.

Table 2 The comparison on the independent test set
Fig. 5
figure 5

The comparison of generality on UCI data sets. "Blue column" represents the accuracies of five-fold validation, "red column" represents the accuracies of independent test

We also compare the time consuming by KSVCR and MCVMSVM on 9 UCI data sets, the results are shown in Fig. 4. It is clearly that MCVMSVM is much faster than KSVCR. Because the constraint condition of the MCVMSVM model is much less than that of KSVCR, the computational complexity of MCVMSVM is much less than that of KSVCR.

4.2 Parameter Analysis

This section will analyze the influence of model parameters on classification accuracy through numerical experiment. MCVMSVM has three parameters: D is penalty parameter, γ is kernel parameter, λ is a parameter which balance the divergence and classifier complexity in the class. Considering the influence of one parameter on accuracy, the other two parameters remain unchanged. The influence of parameters D, λ, γ on the classification accuracy are given in Figs. 6, 7 and 8.

As shown in Fig. 6, when the parameter D is small, the precision of the data sets is very low, and then the classification accuracy increases significantly with the increase of the classification accuracy, and the final classification precision tends to be slow. From the point of view of the change of precision, the accuracy increases significantly at the beginning of the value 2−5 and the value becomes stable after 25. Therefore, the selection range of parameter D can be reduced between 2−3 and 25.

Fig. 6
figure 6

The influence of parameter D on classification accurac

Fig. 7
figure 7

The influence of parameter γ on classification accuracy

Fig. 8
figure 8

The influence of parameter λ on classification accuracy

As shown in Fig. 7, for different data sets, the variation of γ vary greatly when MCVMSVM achieves the best accuracy. From the point of view of the change of precision, the accuracy increases significantly at the beginning of the value 0.3 and the value becomes stable after 1. Therefore, the selection range of parameter γ can be reduced between 0.3 and 1.

In Fig. 8, we found that except for the "iris" and "tea" datasets with the best parameters of λ = 1 and λ = 4, the other seven sets of data sets had the best accuracy at λ = 0.05. Because λ is a parameter that balances the degree of intra class divergence and the complexity of classifier, λ = 0.05 indicates that the intra class structure of the remaining class points has an effect on the classification model, so it is proved that the method we proposed is meaningful.

4.3 NDC datasets

In order to verify the classification of MCVMSVM in large-scale data sets, NDC dataset is selected as the training samples and the data amount increases from 500 to 50,000. The parameters are fixed as D = 28, P = 1, λ = 0.0039. We compare the five-fold cross validation accuracies. For the five-fold cross validation, the data set is randomly split into five parts, four of them is the training set that is used to learn the hyperplane, one of them is the validation set to test the hyperplane learned by the training set. Table 3 shows the five-fold validation results. With the increase of the size of the dataset, MCVMSVM has obvious advantages in classification accuracy. The average accuracy of MCVMSVM is 5% higher than other methods. Therefore, the proposed algorithm is applicable to the classification of large-scale data sets.

Table 3 Five fold cross validation comparison on NDC dataset

5 Conclusion

In this paper, the multi-class support vector machine based on the minimization of class variance (MCVMSVM) is proposed by introducing the class structure information into “1-versus-1” SVM. Comparing with “1-versus-1” SVM, we make full use of the information of rest class; Comparing with KSVCR, the computational complexity of MCVMSVM is much less than that of KSVCR. Numerical experiments demonstrate the effectiveness of our method.