Multi-class support vector machine based on the minimization of class variance

Zhang, Zhiqiang; Xu, Zeqian; Tan, Junyan; Zou, Hui

doi:10.1007/s11063-020-10393-7

Multi-class support vector machine based on the minimization of class variance

Published: 03 January 2021

Volume 53, pages 517–533, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Neural Processing Letters Aims and scope Submit manuscript

Multi-class support vector machine based on the minimization of class variance

Download PDF

Zhiqiang Zhang¹,
Zeqian Xu²,
Junyan Tan² &
…
Hui Zou²

313 Accesses
3 Citations
Explore all metrics

Abstract

Since the existing methods can not balance the sufficient use of information and the scale of the optimization problem, a new method for multi class classification problem is proposed, which is called multi-class support vector machine based on the minimization of class variance (MCVMSVM for short). MCVMSVM adopts the idea of semi-supervised learning and transfers the K-class problem to K(K − 1)/2 binary classification problems. For each binary classification problem, a new SVM with a mixed regularization term which considers the margin and the distribution of examples is proposed. MCVMSVM can utilize the information of all examples without increasing the scale of the optimization problem. The performance of MCVMSVM on UCI and NDC datasets is the best compared with other methods, that means MCVMSVM is more effective.

A Multi-class Support Vector Machine Based on Geometric Margin Maximization

Multi-Class Support Vector Machine

Efficient Algorithms for Feature Selection in Multi-class Support Vector Machine

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Support vector machine (SVM) proposed byVapnik [1, 2] is a general learning method based on statistical learning theory (SLT). Support vector machines show many good properties on two classification problems [3]. While, in the real word, there are a lot of multi-class classification problems, such as text recognition, image classification, speech recognition, face recognition and so on. Therefore, more and more researchers devote to the study of multi-class classification problems. There are a lot of classification algorithms in data mining, such as neural network, k-nearest neighbor, decision tree, support vector machine, etc. Although Neural network has better classification accuracy, it has some shortcomings. Neural network needs a large number of parameters, such as the initial value of network topology, weight and threshold; it can not observe the learning process between them, and the output results are difficult to explain, which will affect the credibility and acceptability of the results [16,17,18]. K-nearest neighbor is the easiest method, it is suitable for the data with large sample size, while it is easy to produce misclassifications for those data with small sample size. Decision tree is easy to understand and the output is more comprehensive, while over-fitting is easy to occur in decision tree and the decision tree ignores the correlation between attributes. Support vector machine (SVM) is more classical and representative. SVM can improve generalization performance and solve high dimensional, nonlinear problems, it can also avoid the problems of neural network structure selection and local minima.

Most of the currently existed multi classification methods based on support vector machine adopt the decomposition-reconstruction strategy. These kinds of methods transfer the K-class problem to a series of binary-classification problems and the discrimination functions can be obtained by solving a series of binary classification problems [4,5,6,7,8, 12]. There are totally three “decomposition reconstruction” strategies. The first one is “1-versus-rest” strategy [9] which constructs K classifiers. For the k-th problem, the examples belong to the k-th class are viewed as positive examples, the examples in the other classes are viewed as negative examples, then binary SVM is used to build the classifier. A new example is assigned a label according to the winner-takes-all scheme. Although “1-versus-rest” strategy uses the information of all examples in the training set, almost all the binary problems are unbalanced because the number of negative examples is much larger than the number of positive examples. The second one is “1-versus-1” strategy [10] which constructs binary classifiers and each classifier involves only two class examples with abandoning the other K-2 classes. For a new example, its label is decided by the voting scheme. In “1-versus-1” strategy, the information of the remaining examples is omitted when constructing each binary classifier. The third one is “1-versus-1-verus-rest” which is the improvement of “1-versus-1” strategy, called K-SVCR [11]. KSVCR constructs K(K−1)/2 sub-classifiers. In the process of constructing the sub-classifiers, the original training set is divided into three parts: positive examples (the examples belonging to i-th class in original training set), negative examples (the examples belonging to j-th class in original training set) and rest examples (all the other examples belonging to the original training set except the i-th class and the j-th class). KSVCR aims at finding the sub-classifier that can classify the positive and negative examples correctly and the rest examples are restricted within an ε-insensitive band. For a new example, its label is also decided by the voting scheme. Compared with “1-versus-1” support vector machine, KSVCR makes full use of the information of training samples. Although KSVCR does not loss any information of the examples, the scale of its optimization problems increase largely and thus it’s difficult to be solved.

In order to overcome the shortcomings of “1-versus-1” SVM and KSVCR, a new optimal method which incorporates the idea of semi-supervised learning is proposed in this paper. Our method is based on 1-versus-1-versus rest framework, i.e. the K-class problem is transferred to K(K−1)/2 binary classification problems. When solving the k-th binary problem, the training set is divided into three parts: positive examples (the examples in the i-th class), negative examples (the examples in the j-th class) and rest examples (all examples except the positive and negative examples). For each sub-classifier, the classification hyperplane is required to separate the positive and negative examples correctly and the projections of rest examples on the normal vector of the classification hyperplane is focused around zero. So, a new SVM with a mixed regularization term is proposed. The new regularization term considers both the margin between positive and negative examples and the distribution of the rest examples. Compared with currently existed methods, the new method uses all the information of examples without abandon any examples.

This paper is organized as follows. In Sect. 2, we give a brief introduction of SVM, minimization class variance SVM (MCVSVM) and K-SVCR. Our new method MCVMSVM is proposed in Sect. 3. The numerical experiments on several real datasets are conducted in Sect. 4. We summarize this paper in Sect. 5.

2 Related Works

Some works closely related to this paper, such as SVM, MCVSVM and K-SVCR are introduced in this section.

2.1 Support Vector Machine

Consider the binary classification problem with the training set

$$ T{{ = \{(x}}_{1} ,y_{1} ),\ldots,(x_{l} ,y_{l} )\} $$

(1)

where ${\text{x}}_{i} \in R^{n}$ is an example,${\text{y}}_{i} \in {{\{ }} + 1, - 1\}$ is the corresponding label.

SVM constructs a hyperplane $w \cdot x + b = 0$ by maximizing the margin between two support hyperplanes ($w \cdot x + b = - 1$ and $w \cdot x + b = 1$). Furthermore, the positive examples should be located above the positive hyperplane $w \cdot x + b = 1$ and the negative examples should be located under the negative hyperplane $w \cdot x + b = - 1$. SVM gets the classification hyperplane by solving the following quadratic program:

$$ \begin{aligned} &\mathop {\min }\limits_{w,b,\xi } \frac{1}{2}||w||^{2} + C\sum\limits_{i = 1}^{l} {\xi_{i} } \hfill \\ & y_{i} (w \cdot x_{i} + b) \ge 1 - \xi_{i} ,i = 1,\ldots,l, \hfill \\ &\xi_{i} \ge 0,i = 1,\ldots,l, \hfill \\ \end{aligned} $$

(2)

where C > 0 is the parameter that can balance the margin and classification error, $\xi_{i} \ge 0$ is slack variable measuring the classification loss of examples. More details of SVM can be found in the literature [13].

2.2 Minimization Class Variance Support Vector Machine

For the training set (1), the hyperplane generated by the traditional support vector machine is determined by a small number of support vectors, and it ignores the structural characteristics of the training examples. Therefore, the SVM can not play a good classification effect for some data, as shown in Fig. 1. Although the SVM can find a hyperplane to separates the positive and negative training examples correctly, the hyperplane derived by SVM does not represent the distribution trend of the data well. Therefore, new point “*” may be misclassified by SVM. So, minimization class variance support vector machine (MCVSVM) is proposed by the literature [14]. MCVSVM adopts an intra class divergence matrix to keep the distribution between data. The optimization problem is as follow:

$$ \begin{aligned} &\mathop {\min }\limits_{w,b} \frac{1}{2}w^{T} S_{w} w + C\sum\limits_{i = 1}^{l} {\xi_{i} } \hfill \\ &y_{i} (x_{i}^{T} w + b) \ge 1 - \xi_{i} , \hfill \\& \xi_{i} \ge 0, \, \hfill \\ \end{aligned} $$

(3)

where matrix $S_{w}$ is the sum of the divergence matrix of the positive examples and negative examples. $C$ > 0 is the penalty parameter, $\xi_{i}$ is the slack variable measuring the classification loss of examples.

Now, let’s give the geometric interpretation of the optimization problem (3). Minimizing the first term in the objective function means the intra-class examples’ projections on the normal vector of the hyperplane should be as close as possible; minimizing the second term is to minimize the loss of misclassification; the constraints means the examples belonging to positive and negative class should be located on both sides of hyperplanes $(w \cdot x + b) = 1$ and $(w \cdot x + b) = - 1$. For a new example, it’s label is decided by the following discriminant function:

$$ f(x) = {\text{sgn}} (S_{w}^{ - 1} \sum\limits_{i = 1}^{l} {\alpha_{i} y_{i} \left\langle {x_{i} \cdot x} \right\rangle } + b). $$

(4)

Since MCVSVM considers the distribution of examples, the hyperplane obtained by MCVSVM is better which is shown in Fig. 2. There is a serious shortcoming in MCVMSVM, when the dimension of examples is much larger than the number of examples, the inverse of matrix S_w may not exist.

2.3 Support Vector Classification-Regression Machine for K-Class (K-SVCR)

Given the training set

$$ T{{ = \{ (x}}_{1} ,y_{1} ),\ldots,(x_{l} ,y_{l} )\} $$

(5)

where ${\text{x}}_{i} \in R^{n}$ is an example,$y_{i} \in \left\{ {1,2, \ldots ,K} \right\}$ is the corresponding label of $x_{i}$.

K-SVCR constructs K(K − 1)/2 sub-classifiers for the K-class classification problem. When constructing the sub-classifiers, the training set is divided into three parts: positive class, negative class and rest class. A mixed classification and regression machine is formulated. Firstly, the classification hyperplane $w \cdot x + b = 0$ should separates the training examples belonging to positive and negative classes as correct as possible. Secondly, the examples of the rest class should be in the ε-insensitive band ($0 < {\upvarepsilon } < 1$) of the hyperplane $w \cdot x + b = 0$. According to the idea mentioned above, the optimization problem of K-SVCR is formulated as follows:

$$ \begin{aligned} &\mathop {\min }\limits_{{w,b,\xi ,\varphi_{i} ,\varphi_{i}^{*} }} \frac{1}{2}||w||^{2} + C\sum\limits_{i = 1}^{{l_{1} + l_{2} }} {\xi_{i} } + D\sum\limits_{{i = l_{1} + l_{2} + 1}}^{l} {(\varphi_{i} + \varphi_{i}^{*} )} \hfill \\ &s.t.y_{i} (w \cdot x_{i} + b) \ge 1 - \xi_{i} ,i = 1,...,l_{1} + l_{2} , \hfill \\ &- \varepsilon - \varphi_{i}^{*} \le w \cdot x_{i} + b \le \varepsilon + \varphi_{i} ,i = l_{1} + l_{2} + 1,...,l, \hfill \\ & \xi_{i} \ge 0,i = 1,...,l_{1} + l_{2} , \hfill \\ \end{aligned} $$

(6)

where $\xi_{i} ,\varphi_{i} ,\varphi_{i}^{*}$ are slack variables measuring the loss of misclassification, C > 0 and D > 0 are parameters that can balance the margin, classification error and regression error. $l_{1}$ represents the number of examples in positive class. $l_{2}$ represent the number of examples in negative class.

Now we explain the optimization problem in detail. Minimizing the first term in (6) is to maximize the margin between two support hyperplanes $w \cdot x + b = - 1$ and $w \cdot x + b = 1$. The second term is to minimize the sum of classification errors of the examples belonging to the positive and negative class, the third term is to minimizes the sum of regression errors of the examples belonging to the rest class. The first constraints mean the examples belonging to the positive class should be above the positive support hyperplane $w \cdot x + b = 1$ and the examples belonging to the negative class be under the negative support hyperplane $w \cdot x + b = - 1$.The second constraints mean the examples belonging to the rest class should be in the ε-insensitive band of $w \cdot x + b = 0$. It is obviously that there are a lot of constraints in the optimization problem of (6). So, it takes a lot of time to be solved.

3 Minimization class variance SVM for multi classification SVM (MCVMSVM)

Now, we are in the position to put out our MCVMSVM including the linear MCVMSVM and nonlinear MCVMSVM.

3.1 The linear MCVMSVM

Our new method takes “1-versus-1” strategy and totally K(K − 1)/2 classifiers are constructed, and the final decision is made by the voting scheme. When constructing the classifiers, the training set (5) is firstly divided into three parts: the examples belonging to the i-th class are viewed as positive class with label 1, the examples belonging to the j-th class are viewed as negative class with label -1 and the examples belonging to the rest class except the i-th and j-th class are viewed as zero class with label 0. For the re-divided training set, MCVMSVM wishes to find a hyperplane so that the positive and negative examples are located on both sides of $w \cdot x + b = 1$ and $w \cdot x + b = - 1$, and the projections of zero-class examples on the normal vector w are concentrated as close as possible to zero. The geometric explanation can be seen clearly in Fig. 3. The negative examples locate under the plane "$w \cdot x + b = - 1$″; the positive examples locate above the plane $w \cdot x + b = 1$; the projections of zero-class examples on w are closed to zero. The hyperplane obtained by MCVMSVM should be the middle solid line H₁ but not H₀ which is obtained by 1-versus-1 SVM.

According to the training set (5), the matrix composed of class $i$ is constructed as the positive sample matrix $A_{i}$:

$$ A_{i} = [x_{i1} ,x_{i2} ,\ldots,x_{{il_{i} }} ],i = 1,\ldots,K,A_{i} \in R^{{l_{i} \times n}} , $$

(7)

the matrix composed of class $j$ is constructed as the positive sample matrix $B_{j}$:

$$ B_{j} = [x_{j1} ,x_{j2} ,\ldots,x_{{jl_{j} }} ],j = 1,\ldots,K,B_{j} \in R^{{l_{j} \times n}} $$

(8)

The matrix composed of zerois denoted as $C_{i,j}$ Define the set $C_{i,j}^{k}$ which is composed of the examples belonging to the k-th classas follows:

$$ C_{i,j}^{k} = {{\{ }}x_{k1} ,x_{k2} ,\ldots,x_{kl_{k}} ,k = 1,\ldots,K,k \ne i,k \ne j\} $$

(9)

Let ${m}_{i,j}^{k}$ denote the center point of $C_{i,j}^{k}$.

The intra class scatter matrix of the zero class is $S_{w}$:

$$ S_{w} = \sum\limits_{k = 1}^{K - 2} {\sum\limits_{{x \in C_{i,j}^{k} }} {(x - m_{i,j}^{k} )(x - m_{i,j}^{k} )^{T} } } , $$

(10)

Suppose the $i$-th hyperplane is $(w_{i,j} \cdot x) + b_{i,j} = 0$, then the $i$-th problem is constructed as follows:

$$ \begin{aligned} &\mathop {\min }\limits_{{w_{i,j} ,b_{i,j} }} \frac{1}{2}||w_{i,j} ||^{2} + De_{i,j}^{T} \xi_{i,j} + \frac{\lambda }{2}w_{i,j}^{T} S_{w} w_{i,j} \hfill \\ &(A_{i} w_{i,j} + e_{i} b_{i,j} ) + \xi_{i,j} \ge e_{i} , \hfill \\ &- (B_{j} w_{i,j} + e_{j} b_{i,j} ) + \xi_{i,j} \ge e_{j} , \hfill \\ &\xi_{i,j} \ge 0, \, \hfill \\ \end{aligned} $$

(11)

where, $D > 0,\,\lambda > 0$ are the parameters, $e_{i,j} = [e_{i} ,e_{j} ]$ and $e_{i} ,e_{j}$ are the vector of ones with appropriate dimensions, $\xi_{i,j}$ are slack variables measuring the misclassification。

Minimizing the first term in the objective function of problem (11) is to maximizes the interval between the support hyperplanes. Minimizing the second term is to minimize the sum of classification errors. Minimizing the last term is to minimize the projection of examples in the zero class on the normal vector direction of hyperplane. The constraints guarantee the positive should be located above $w \cdot x + b = 1$ and the negative examples should be located under $w \cdot x + b = - 1$. Next, we construct the dual problem of (11). The Lagrange function of problem (11) is formulated firstly,

$$ \begin{gathered} L(w_{i,j} ,b_{i,j} ,\xi_{i,j} ,,\alpha_{i,j} ,\mu_{i,j} ,\beta_{i,j} ,\eta_{i,j} ) \hfill \\ = De_{i,j}^{T} \xi_{i,j} + \frac{1}{2}||w_{i,j} ||^{2} + \frac{\lambda }{2}w_{i,j}^{T} S_{w} w_{i,j} \hfill \\ - \alpha_{i,j}^{T} [(A_{i} w_{i,j} + e_{i} b_{i,j} ) + \xi_{i,j} - e_{i} ] - \mu_{i,j}^{T} \xi_{i,j} \hfill \\ + \beta_{i,j}^{T} [(B_{j} w_{i,j} + e_{j} b_{i,j} ) - \xi_{i,j} + e_{j} ] - \eta_{i,j}^{T} \xi_{i,j} \hfill \\ \end{gathered} $$

(12)

Then, by the Karuch-Kuhn-Tucker (KKT) conditions, we get

$$ \frac{\partial L}{{\partial w_{i,j} }} = w_{i,j} { + }S_{w} w_{i,j} - A_{i}^{T} \alpha_{i,j} + B_{j}^{T} \beta_{i,j} = 0 \Rightarrow w_{i,j} = (I + \lambda S_{w} )^{ - 1} (A_{i}^{T} \alpha_{i,j} - B_{j}^{T} \beta_{i,j} ), $$

(13)

$$ \frac{\partial L}{{\partial b_{i,j} }} = - e_{i}^{T} \alpha_{i,j} + e_{j}^{T} \beta_{i,j} = 0, $$

(14)

$$ \frac{\partial L}{{\partial \xi_{i,j} }} = \left[ {\begin{array}{*{20}c} {De{}_{i}} \\ {De_{j} } \\ \end{array} } \right] - \left[ {\begin{array}{*{20}c} {\alpha_{i,j}^{T} e{}_{i}} \\ {\beta_{i,j}^{T} e{}_{j}} \\ \end{array} } \right] + \left[ {\begin{array}{*{20}c} {\mu_{i,j}^{T} e{}_{i}} \\ {\eta_{i,j}^{T} e{}_{j}} \\ \end{array} } \right] = 0 \Rightarrow 0 \le \alpha_{i,j} ,\beta_{i,j} \le D, $$

(15)

Substituting the Eq. (13)–(15)s into (12), the dual problem of problem (11) can be abbreviated as:

$$ \begin{aligned} & \mathop {\max }\limits_{{\alpha_{i,j} ,\beta_{i,j} }} [e_{i}^{T} ,e_{j}^{T} ][\alpha_{i,j}^{T} ,\beta_{i,j}^{T} ]^{T} - \frac{1}{2}[\alpha_{i,j}^{T} ,\beta_{i,j}^{T} ]^{T} [A_{i}^{T} , - B_{j}^{T} ]^{T} (I{ + }\lambda S_{w} )^{ - 1} [A_{i}^{T} , - B_{j}^{T} ]^{T} [\alpha_{i,j}^{T} ,\beta_{i,j}^{T} ]^{T} \hfill \\ &0 \le \alpha_{i,j} ,\beta_{i,j} \le D, \hfill \\ \end{aligned} $$

(16)

where $\alpha_{i.j} ,\beta_{i,j}$ are slack variables. Solving the dual problem (16), we get the optimal solution $[\alpha_{k}^{*} ,\beta_{k}^{*} ]$, and randomly select a support vector label $y_{k}$, then the optimal solution of the primal problem (11) is

$$ \begin{gathered} w_{i,j}^{ * } = (I + \lambda S_{w} )^{ - 1} [A_{i}^{T} \alpha_{i,j}^{ * } - B_{j}^{T} \beta_{i,j}^{ * } ], \hfill \\ b_{i,j}^{ * } = y_{k} - A_{i}^{k} (I + \lambda S_{w} )^{ - 1} [A_{i}^{T} \alpha_{i,j}^{ * } - B_{j}^{T} \beta_{i,j}^{ * } ], \hfill \\ \end{gathered} $$

(17)

So, the function of the $k$-th hyperplane can be expressed as:

$$ f_{ij} (x) = [x,1][w_{i,j}^{*T} ,b_{i,j}^{*T} ]^{T} = 0 .$$

(18)

Totally $K(K - 1)/2$ hyperplanes like (18) are constructed in our MCVMSVM and thus the $K(K - 1)/2$ decision functions are defined as follows:

$$ {\text{sgn}} (f_{i,j} (x)) = \left\{ \begin{gathered} + 1,[x,1][w_{i,j}^{*T} ,b_{i,j}^{*T} ]^{T} > 0 \, i,j = 1, \ldots ,K;i < j, \hfill \\ - 1,[x,1][w_{i,j}^{*T} ,b_{i,j}^{*T} ]^{T} < 0 \, i,j = 1, \ldots ,K;i < j, \hfill \\ \end{gathered} \right. $$

(19)

For a new example x, a vote is given to the class (i) or class (j) based on which condition is satisfied. Finally, the example x is assigned to the class label that gets the most votes.

3.2 The nonlinear classifier

In this section, the linear MCVMSVM is extended to the nonlinear case using kernel trick. The training set (5) is mapped into the high-dimensional feature space by introducing a nonlinear mapping $\phi$:

$$ x \to \phi (x) $$

(20)

Suppose that the normal vector of the hyperplane in the feature space can be expressed as:

$$ w = \mathop \sum \limits_{i = 1}^{{l_{1} + l_{12} }} \mu_{i} \phi \left( {x_{i} } \right) $$

(21)

Then we define the intra-class divergence as:

$$ \begin{gathered} w^{T} S_{w} w = w^{T} \sum\limits_{k = 1}^{K - 2} {\sum\limits_{{x \in C_{i,j} }} {(\phi (x) - m_{i,j}^{k} )(\phi (x) - m_{i,j}^{k} )^{T} } } w \hfill \\ = \sum\limits_{k = 1}^{K - 2} {[w^{T} \sum\limits_{{x \in C_{i,j}^{k} }} {(\phi (x) - m_{i,j}^{k} )(\phi (x) - m_{i,j}^{k} )^{T} } } w]. \hfill \\ \end{gathered} $$

(22)

We rewrite the term in (22) as follows:

$$ \begin{aligned} & w^{T} \sum\limits_{{x \in C_{i,j}^{k} }} {(\varphi (x) - m_{i,j}^{k} )(\varphi (x) - m_{i,j}^{k} )^{T} w} \hfill \\ &= w^{T} \sum\limits_{{x \in C_{i,j}^{k} }} {\left[ {(\varphi (x)\varphi (x)^{T} - m_{i,j}^{k} \varphi (x)^{T} - \varphi (x)m_{i,j}^{kT} + m_{i,j}^{k} m_{i,j}^{kT} )^{T} } \right]} w \hfill \\ &= w^{T} \left[ {\sum\limits_{{x \in C_{i,j}^{k} }} {\varphi (x)\varphi (x)^{T} - m_{i,j}^{k} \sum\limits_{{x \in C_{i,j}^{k} }} {\varphi (x)^{T} } - \sum\limits_{{x \in C_{i,j}^{k} }} {(\varphi (x)m_{i,j}^{kT} } + l_{r}^{k} m_{i,j}^{k} m_{i,j}^{kT} )^{T} } } \right]w \hfill \\ &= w^{T} \left[ {\sum\limits_{{x \in C_{i,j}^{k} }} {\varphi (x)\varphi (x)^{T} } - l_{r}^{k} m_{i,j}^{k} m_{i,j}^{kT} } \right]w \hfill \\ &= \sum\limits_{{x \in C_{i,j}^{k} }} {w^{T} \varphi (x)\varphi (x)^{T} w} - w^{T} l_{r}^{k} m_{i,j}^{k} m_{i,j}^{kT} w \hfill \\ &= \mu^{T} K(C_{i,j}^{k} ,S)K(C_{i,j}^{k} ,S)^{T} \mu - \mu^{T} K(C_{i,j}^{k} ,S)L^{k}_{l_{r}} K(C_{i,j}^{k} ,S)^{T} \mu \hfill \\ &= \mu^{T} K(C_{i,j}^{k} ,S)(I_{{l_{r} }}^{k} - L_{{l_{r} }}^{k} )K(C_{i,j}^{k} ,S)^{T} \mu \hfill \\ \end{aligned} $$

(23)

where $\mu \, = \left[ {\mu_{1} , \ldots \mu_{{{\text{l}}1 + {\text{l}}2}} } \right]^{{\text{T}}}$, $l_{r}^{k}$ is the number of $C_{i,j}^{k}$, $I_{{l_{r} }}^{k}$ is $l_{r}^{k}$-order identity matrix, $L_{{l_{r} }}^{k}$ is $l_{r}^{k}$-order matrix of all elements being equal to the square of $\frac{1}{{l_{r}^{k} }}$,$S = [A_{i}^{T} ,B_{j}^{T} ]$. The $k$-th optimization problem nonlinear MCVMSVM is:

$$ \begin{aligned} &\mathop {\min }\limits_{{w,b,\xi_{i,j} }} \frac{1}{2}\mu^{T} K(S,S)\mu + De_{i,j}^{T} \xi_{i,j} + \frac{\lambda }{2}\sum\limits_{k = 1}^{K - 2} {\mu^{T} K(C_{i,j}^{k} ,S)(I_{i,j}^{k} - L_{i,j}^{k} )K(C_{i,j}^{k} ,S)^{T} \mu } \hfill \\ &(K(A_{i}^{T} ,S^{T} )\mu + e_{i} b_{i,j} ) + \xi_{i,j} \ge e_{i} , \hfill \\ & - (K(B_{j}^{T} ,S^{T} )\mu + e_{j} b_{i,j} ) + \xi_{i,j} \ge e_{j} , \hfill \\ &\xi_{i,j} \ge 0 \hfill \\ \end{aligned} $$

(24)

where $D > 0,\lambda > 0$ are the parameters, $e_{i,j} = [e_{i} ,e_{j} ]$ and $e_{i} ,e_{j}$ are the vector of ones with appropriate dimensions, $\xi_{i,j}$ are slack variables measuring the misclassification. Similar to thelinear case, we get the optimal solution of (24) by solving the following dual problem:

$$ \begin{aligned} & \mathop {\max }\limits_{{\alpha_{i,j} ,\beta_{i,j} }} [e_{i}^{T} ,e_{j}^{T} ][\alpha_{i,j}^{T} ,\beta_{i,j}^{T} ]^{T} \hfill \\ &- \frac{1}{2}[\alpha_{i,j}^{T} ,\beta_{i,j}^{T} ]^{T} [K_{i}^{T} , - K_{j}^{T} ](G + \lambda \sum\limits_{k = 1}^{K - 2} {K_{i,j}^{k} L_{i,j}^{k} K_{i,j}^{kT} } )^{ - 1} [K_{i}^{T} , - K_{j}^{T} ]^{T} [\alpha_{i,j}^{T} ,\beta_{i,j}^{T} ] \hfill \\ & 0 \le \alpha_{i,j} ,\beta_{i,j} \le D, \hfill \\ \end{aligned} $$

(25)

where $G = K(S^{{\text{T}}} ,S^{{\text{T}}} )$, $K_{i} = K(A_{i}^{T} ,S^{T} ),$ $K_{j} = K(B_{j}^{T} ,S^{T} )$, $K_{ij}^{k} = K(C_{ij}^{kT} ,S^{T} )$, $L_{ij}^{k} = I_{{l_{r} }}^{k} - L_{{l_{r} }}^{k}$. Solving the dual problem (25), we get the optimal solution $[\alpha_{k}^{*} ,\beta_{k}^{*} ]$, and $y_{k}$ is the label of a support vector$x_{k}$, then the optimal solution of the primal problem (24) is

$$ \mu^{*} { = }(G + \lambda \sum\limits_{k = 1}^{K - 2} {K_{i,j}^{k} L_{i,j}^{k} K_{i,j}^{kT} } )^{ - 1} (\alpha_{i,j}^{*T} K_{i} - \beta_{i,j}^{*T} K_{j} ) $$

(26)

$$ b^{ * } = y_{k}^{*} - \mu^{*} K(x_{k} ,S) $$

(27)

Then the function of the $i$-th hyperplane can be expressed as:

$$ f_{i,j} (x) = \mu^{ * } K(x^{T} ,S^{T} ){ + }b^{ * } = 0 $$

(28)

Just as the linear case, the decision function that classifies class ($i$) and class ($j$) is designed as

$$ {\text{sgn}} (f_{i,j} (x))\,{ = }\,\left\{ \begin{gathered} { + }1,\mu^{ * } K(x^{T} ,S^{T} ){ + }b^{ * } > 0 \, i,j = 1,\ldots,K;i < j, \hfill \\ - 1,\mu^{ * } K(x^{T} ,S^{T} ){ + }b^{ * } < 0 \, i,j = 1,\ldots,K;i < j, \hfill \\ \end{gathered} \right. $$

(29)

For a new example ${\text{x}}$, a vote is given to the class ($i$) or class ($j$) based on which condition is satisfied. Finally, the example ${\text{x}}$ is assigned to the class label that gets the most votes.

3.3 Comparing with other methods

Comparing with “1-versus-1″ SVM. Our method needs to construct ${\text{K}}\left( {{\text{K}}\, - \,{1}} \right)/{2}$ hyperplanes, this is just the same as “1-versus-1″ SVM. And the computational complexity of our method is almost the same as “1-versus-1” SVM. The information of all examples is used in our method. While, only the positive and negative examples are used in “1-versus-1” SVM. It is clearly that our method makes full use of the information of rest class compared with “1-versus-1” SVM.

Comparing with “1-versus-rest” SVM. “1-versus-rest” SVM and our MCVMSVM all make full use of the information of all examples. MCVMSVM makes more detailed use of information because it divides all examples into three parts (positive, negative and rest), while “1-versus-rest” SVM makes rough use of information because it divides all examples into two parts (positive and negative).

Comparing with K-SVCR. The “1-versus-1-versus-rest” strategy is used in both our MCVMSVM and KSVCR. But the processing method is quite different. A square term with intra class divergence of rest class is introduced in MCVMSVM, while a lot of constraints containing the rest examples are added in KSVCR. So, the computational complexity of MCVMSVM is much less than the computational complexity of KSVCR, and the training time of MCVMSVM is shorter than that of KSVCR. The gap is more obvious when the number of categories is large which can be seen clearly in Fig. 4.

4 Numerical experiments

To evaluate the performance of the new algorithm, several UCI [14] datasets and large scale NDC datasets are used in numerical experiments (Table 3). All experiments have been implemented in Matlab 2015b on a PC with system configuration Intel core 7CPU at 3.5 GHz with 4 GB of RAM, and Windows 10 operating system. For the nonlinear case, RBF kernel is used. The parameters D are selected from the set $[2^{ - 8} ,2^{ - 7} , \ldots ,2^{7} ,2^{8} ]$ and γ is selected from $\gamma \in [0.1,0.2, \ldots ,2]$. Our MCVMSVM is compared the popular multiclass classification methods: 1-v-1 SVM,1-r SVM and KSVCR.

4.1 Benchmark datasets

In this section, we compare our MCVMSVM and other three multiclass classification methods on UCI data sets. The five-fold cross validations are conducted on 9 UCI datasets, i.e. the data set is randomly split int five parts, four of them is the training set that helps us to learn the optimal hyperplane, one of them is the validation data set that test the hyperplane got by the training set. The five-fold validation results are shown in Table 1. We can see that our MCVMSVM performs better than the other four methods on five data sets: ‘wine’, ‘tea’, ‘seeds’, ‘glass’, ‘hayes’. On these five data sets, the accuracy on the data sets‘tea’ and ‘hayes’ are improved by up to 3.3% and 6.8% using MCVMSVM. On the datasets‘iris’, ‘balance’ and ‘ecoli’, the accuracy of RBSVM is similar to that of the other method. Only on the data sets‘der’, the accuracy of MCVMSVM is 0.8% lower than other methods. On the whole, Table 1 shows that MCVMSVM has the best classification results on the seven datasets. In addition, the last two rows in the table are the average classification accuracy and mean standard deviation of the three classification models on the nine sets of datasets. Obviously, MCVMSVM has the highest accuracy and the lowest standard deviation, which indicates that MCVMSVM can not only improve the classification accuracy, but also have a more stable classification effect. This is because MCVMSVM not only considers the samples of positive and negative class, but also considers the structure information of the rest samples, it makes full use of the information provided by the dataset. So, MCVMSVM has a better classification accuracy.

Table 1 Five-fold cross validation comparison on UCI data sets

Full size table

The classification effect is also compared on the independent test set. 30% of the datasets are randomly selected as the test set, the remaining 70% are used as the trains set. We firstly select the optimal parameters and train the classifiers using the training set by five-fold validation. Then we test the classifiers on the test set. Table 2 shows the comparison results. It is clearly that MCVMSVM performs better than all the other methods. Figure 5 compares the accuracies between five-fold cross validation and independent test. It is clearly the accuracies drop more than 50% in other three methods. While the accuracy got by MCVMSVM almost is unchanged. So, the generalization of MCVMSVM is better.

Table 2 The comparison on the independent test set

Full size table

We also compare the time consuming by KSVCR and MCVMSVM on 9 UCI data sets, the results are shown in Fig. 4. It is clearly that MCVMSVM is much faster than KSVCR. Because the constraint condition of the MCVMSVM model is much less than that of KSVCR, the computational complexity of MCVMSVM is much less than that of KSVCR.

4.2 Parameter Analysis

This section will analyze the influence of model parameters on classification accuracy through numerical experiment. MCVMSVM has three parameters: D is penalty parameter, γ is kernel parameter, λ is a parameter which balance the divergence and classifier complexity in the class. Considering the influence of one parameter on accuracy, the other two parameters remain unchanged. The influence of parameters D, λ, γ on the classification accuracy are given in Figs. 6, 7 and 8.

As shown in Fig. 6, when the parameter D is small, the precision of the data sets is very low, and then the classification accuracy increases significantly with the increase of the classification accuracy, and the final classification precision tends to be slow. From the point of view of the change of precision, the accuracy increases significantly at the beginning of the value 2⁻⁵ and the value becomes stable after 2⁵. Therefore, the selection range of parameter D can be reduced between 2⁻³ and 2⁵.

As shown in Fig. 7, for different data sets, the variation of γ vary greatly when MCVMSVM achieves the best accuracy. From the point of view of the change of precision, the accuracy increases significantly at the beginning of the value 0.3 and the value becomes stable after 1. Therefore, the selection range of parameter γ can be reduced between 0.3 and 1.

In Fig. 8, we found that except for the "iris" and "tea" datasets with the best parameters of λ = 1 and λ = 4, the other seven sets of data sets had the best accuracy at λ = 0.05. Because λ is a parameter that balances the degree of intra class divergence and the complexity of classifier, λ = 0.05 indicates that the intra class structure of the remaining class points has an effect on the classification model, so it is proved that the method we proposed is meaningful.

4.3 NDC datasets

In order to verify the classification of MCVMSVM in large-scale data sets, NDC dataset is selected as the training samples and the data amount increases from 500 to 50,000. The parameters are fixed as D = 2⁸, P = 1, λ = 0.0039. We compare the five-fold cross validation accuracies. For the five-fold cross validation, the data set is randomly split into five parts, four of them is the training set that is used to learn the hyperplane, one of them is the validation set to test the hyperplane learned by the training set. Table 3 shows the five-fold validation results. With the increase of the size of the dataset, MCVMSVM has obvious advantages in classification accuracy. The average accuracy of MCVMSVM is 5% higher than other methods. Therefore, the proposed algorithm is applicable to the classification of large-scale data sets.

Table 3 Five fold cross validation comparison on NDC dataset

Full size table

5 Conclusion

In this paper, the multi-class support vector machine based on the minimization of class variance (MCVMSVM) is proposed by introducing the class structure information into “1-versus-1” SVM. Comparing with “1-versus-1” SVM, we make full use of the information of rest class; Comparing with KSVCR, the computational complexity of MCVMSVM is much less than that of KSVCR. Numerical experiments demonstrate the effectiveness of our method.

References

Cortes C, Vapnik V (1995) Support-vector network. Mach Learn 20:273–297
MATH Google Scholar
Vapnik V (1998) The nature of statistical learning, 2nd edn. Springer, New York
MATH Google Scholar
Bennett KP (1999) Combining support vector and mathematical programming methods for classification. In: Schälkopf B, Burges CJC, Smola AJ (eds) Advances in Kernel methods: support vector learning. MIT Press, Cambridge, MA, pp 307–326
Google Scholar
Lee Y, Lin Y, Wahba G (2001) Multicategory support vector machines. Comput Sci Stat 33:498–512
Google Scholar
Weston J, Watkins C (1998) Multi-class support vector machines, CSD-TR-98-04 Royal Holloway. University of London, Egham, UK
Google Scholar
Graepel T, et al (1999) Classification on proximity data with LP-machines, In: Ninth International Conference on Artifical Neural Networks IEEE, London: Conference Publications, pp. 304–309.
Jayadeva R, Khemchandani R, Chandra S (2007) Twin support vector machine for pattern classification. IEEE Trans Pattern Anal Mach Intell 29(5):905–910
Article Google Scholar
Crammer K, Singer Y (2002) On the algorithmic implementation of multiclass kernel-based vector machines. J Mach Learn Res 2:265–292
MATH Google Scholar
Bottou L, Cortes C, Denker JS, Drucher H, Guyon I, Jackel LD, LeCun Y, Muller UA, Sackinger E, Simard P, Vapnik V (1994) Comparison of classifier methods: a case study in handwriting digit recognition. In: International Conference on Pattern Recognition, Vol 2-conference B: Computer Vision & Image Processing
Kreb U (1999) Pairwise classification and support vector machines. In: Scholkopf B, Burges CJC, Smola AJ (eds) Advances in Kernel methods: support vector learning. MIT Press, Cambridge, pp 255–268
Google Scholar
Angulo C, Parra X, Catala A (2003) K-SVCR. A support vector machine for multi-class classification. Nurocomputing 55(1):57–77
Article Google Scholar
Yang X, Shao YH (2013) Multiple birth support vector machine for multi-class classification. Neural Comput Appl 22(Suppl 1):S153–S216
Article MathSciNet Google Scholar
Deng NY, Tian YJ (2012) Support vector machines: optimization based theory, algorithms, and extensions. CRC Press, Boca Raton
Book Google Scholar
Kotsia I, Zafeiriou S, Pitas I (2009) Novel multiclass classifiers based on the minimization of the within class variance. IEEE Trans Neural Netw 20(1):14–34
Article Google Scholar
Blake C, Merz C (1998) UCI repository for machine learning databases. http://www.ics.uci.edu/mlearn/MLRepository.html
Lu J, Xuan JY, Zhang GP (2018) Structural property-aware multilayer network embedding for latent factor analysis. Pattern Recognit 76:228–241
Article Google Scholar
Sun B, Wen SP, Wang SB (2019) Quantized synchronization of memristor-based neural networks via super twisting algorithm. Neurocomputing 380:133–140
Article Google Scholar
Sun B, Cao YT, Guo ZY (2020) Synchronization of discrete-time recurrent neural networks with time-varying delays via quantized sliding mode control. Appl Math Comput 375:125093
MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

School of Mechanical Engineering, Beijing Institute of Technology, Beijing, 100081, China
Zhiqiang Zhang
College of Science, China Agricultural University, Beijing, 100083, China
Zeqian Xu, Junyan Tan & Hui Zou

Authors

Zhiqiang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zeqian Xu
View author publications
You can also search for this author in PubMed Google Scholar
Junyan Tan
View author publications
You can also search for this author in PubMed Google Scholar
Hui Zou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Junyan Tan or Hui Zou.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, Z., Xu, Z., Tan, J. et al. Multi-class support vector machine based on the minimization of class variance. Neural Process Lett 53, 517–533 (2021). https://doi.org/10.1007/s11063-020-10393-7

Download citation

Accepted: 15 November 2020
Published: 03 January 2021
Issue Date: February 2021
DOI: https://doi.org/10.1007/s11063-020-10393-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Multi-class support vector machine based on the minimization of class variance

Abstract

Similar content being viewed by others