Keywords

1 Introduction

The binary support vector machine (SVM) [12] is one of popular machine learning methods, which finds a classifier with a high classification ability by maximizing the geometric margin between data and a separating hyperplane. In addition, various extended methods of the binary SVM have been investigated for multiclass classification. In this paper, we focus on all-together (AT) method among the extended ones, especially, the multiobjective multiclass SVM (MMSVM) [7]. It was reported that comparing with the simplest AT method maximizing functional margins, the MMSVM can obtain a classifier with a higher classification rate by maximizing exactly the geometric margin between each class pair. However, it requires a larger amount of computational resources than the simplest AT and other methods [7,8,9].

Therefore, in this paper, we propose a method of reducing the computational complexity without decreasing its generalization ability widely. First, we derive an approximate convex multi-objective optimization problem (MOP) by linearizing some constraints of the original non-convex MOP which is used to find a classifier in MMSVM. Secondly, we restrict the normal vectors of classifier candidates to a space spanned by centroids obtained from the preliminary k-means clustering. Thirdly, we solve the derived MOP by applying the reference point method. Through numerical experiments for benchmark problems, we evaluate performance of classifiers obtained by the proposed method and its computational complexity.

2 Multiclass Classification

The multiclass classification means discriminating data into more than two classes. We assume that a dataset \((x^i ,y_i ),\ i=1,\ldots ,l\) are generated by the same distribution P(xy), where \(x^i \in \mathcal{R}^n\) denotes an n-dimensional input, and \(y_i \in M := \{1, \ldots , m\}\) denotes a label which the corresponding \(x^i\) should be classified into. The aim is finding a classifier f(x) which satisfies \( y_i =f(x^i ),\ i=1,\ldots ,l\) and which can correctly classify a new unknown input x from the same distribution. In this paper, we assume that there exists an appropriate feature space F and a corresponding function \(\phi \) \(: \mathcal{R}^n \rightarrow F\). Thus, we mainly discuss a linear classification on F which uses the kernel method.

In the representative SVMs for multiclass classification such as one-against-all (OAA) [2] and all-together (AT) methods [11, 12], the following discriminant function is often used:

$$ f(x) = \mathop {\mathrm {argmax}}\limits _{p \in M} \; {w^p}^{\top }\phi (x) + b^p $$

where \(w^p,\ b^p,\ p \in M\) denote a weight vector and a bias value, respectively. Thus, the aim is finding appropriate \((w^p, b^p)\), \(p \in M\).

2.1 SVM Maximizing Functional Margins

As the simplest AT method, the SVM maximizing the sum of functional margins was proposed in [11, 12], which can be straightforwardly derived from the binary SVM.

$$\begin{aligned} \begin{array}{rl} \text{(AT) }~~~~ \displaystyle \min &{} \sum _{p \in M} \sum _{q \in M}|| w^p - w^q||^2 \\ \mathrm{s.t.} &{} (w^p - w^q)^\top \phi (x^i) + (b^p - b^q) \ge 1, \ {i \in I_p, q> p, \;\; p,q \in M,} \\ &{} (w^q - w^p)^\top \phi (x^i) + (b^q - b^p) \ge 1, \ {i \in I_q, q > p, \;\; p,q \in M,} \end{array} \end{aligned}$$

where \(I_p := \{ i \in \{1, \ldots , l\} |\ y_i = p \}, p \in M\). Note that maximizing the functional margin in binary SVM can guarantee exact maximization of the distance between data and a separating hyperplane, called geometric margin, which can contribute its high generalization ability. On the other hand, in the problem (AT) for the multiclass classification, maximizing functional margins \(1/|| w^p - w^q ||\) does not necessarily guarantees the maximization of the geometric margins. Namely, the functional margin for a class pair pq does not necessarily represent the distance between the corresponding separating hyperplane:

$$ (w^{p} - w^{q})^{\top } \phi (x) + (b^{p} - b^{q}) = 0 $$

and the closest data in classes \(\{ p, q \}\), as pointed out in [7], which is represented by

$$ d_{pq}(w,b) = \min _{i \in I_p \cup I_q} \frac{|(w^p - w^q)^\top \phi (x)^i + (b^p - b^q)|}{\Vert w^p - w^q\Vert }, \ q > p,\;\; p,q \in M. $$

Thus, it might be difficult to expect the generalization ability similar to the binary SVM. The method of maximizing exactly geometric margins was already proposed in [7]. We introduce it in the next section.

2.2 SVM Maximizing Geometric Margins

In order to maximize exactly the geometric margins, an AT method called MMSVM was already proposed, which was formulated as the following multiobjective optimization problem (MOP) [7]:

$$\begin{aligned} \begin{array}{rl} \text{(M) } ~~~~ \displaystyle \max _{w,b,\sigma } &{} {\theta }_{12}(w,\sigma ),\ldots ,{\theta }_{m-1,m}(w,\sigma ), \\ \mathrm{s.t.} &{} (w^p - w^q)^\top \phi (x^i) + (b^p - b^q) \ge \sigma _{pq}, \ {i \in I_p, q> p, \;\; p,q \in M,} \\ &{} (w^q - w^p)^\top \phi (x^i) + (b^q - b^p) \ge \sigma _{pq}, \ {i \in I_q, q> p, \;\; p,q \in M,} \\ &{} \sigma _{pq} \ge 1, \;\; q > p, \;\; p,q \in M, \end{array} \end{aligned}$$

where we define \( {\theta }_{pq}(w,\sigma ) = {\sigma _{pq}} / {\Vert w^p - w^q\Vert },\;\;q > p,\;\;p,q \in M. \) Note that (M) has more than two objective functions, and the number of them is that of all combinations of class pairs. In [7], it was shown that at any Pareto optimal solution \((w^*, b^*, \sigma ^*)\) of (M), each of objective functional values \({\theta }_{pq}(w^*,\sigma ^*)\) is equal to the geometric margin \( d_{pq}(w^*, b^*)\) of the corresponding class pair [7].

Since in general the optimal solutions of the MOP are often given as a set called Pareto optimal solutions, and, in addition, (M) is not convex. The problem is more difficult to solve than the single-objective optimization problem (SOP). However, a method of finding a Pareto optimal solution by solving a convex SOP was introduced, and the kernel method can be easily applied to (M).

Now, let’s consider the kernel method for (M). The weight vector \(w^p\) of the separating hyperplane is represented as a weighted sum of \(\phi (x^i)\) by introducing new decision variables \(\alpha ^{p}_{i} \in R,\;\;i = 1,\ldots ,l,\;\;p \in M\):

$$\begin{aligned} w^{p} = \sum ^l_{i=1} \alpha ^{p}_i \phi (x^i),\;\;p \in M. \end{aligned}$$
(1)

Then, by defining \(K := (\phi (x^1),\ldots ,\phi (x^l))^\top (\phi (x^1),\ldots ,\phi (x^l))\), \(\alpha ^{p} := (\alpha ^{p}_{1},\, \ldots ,\, \alpha ^{p}_{l})^\top \), \(p \in M\), \( \bar{\theta }_{pq}(\alpha ,\sigma ) := {\sigma _{pq}}/{ \sqrt{(\alpha ^p - \alpha ^q)^{\top } K (\alpha ^p - \alpha ^q)}}\), \(q > p\), \(p,q \in M\), \(\kappa (x^i) := (k(x^1,x^i),\ldots ,k(x^l,x^i))^\top \), \(i = 1,\ldots ,l\), (M) can be rewritten as

$$\begin{aligned} \begin{array}{rl} \text{(M2) } ~~~~ \displaystyle \max _{\alpha ,b,\sigma } &{} \bar{\theta }_{12}(\alpha ,\sigma ),\ldots ,\bar{\theta }_{(m-1)m}(\alpha ,\sigma ) \\ \mathrm{s.t.}&{} (\alpha ^p - \alpha ^q)^\top \kappa (x^i) + (b^p - b^q) \ge \sigma _{pq}, \ {i \in I_p,\; q> p,\; p,q \in M,} \\ &{}(\alpha ^q - \alpha ^p)^\top \kappa (x^i) + (b^q - b^p) \ge \sigma _{pq},\ {i \in I_q,\;q> p,\; p,q \in M,} \\ &{} \sigma _{pq} \ge 1, \;\; q > p, \;\; p,q \in M, \nonumber \end{array} \end{aligned}$$

In addition, the discriminant function can be represented by \( f(x) = \mathop {\mathrm {argmax}}\limits _{p \in M} \{ {{\alpha ^p}^*}^\top \kappa (x) + {b^p}^* \}\), where \((\alpha ^*,b^*,\sigma ^*)\) is the Pareto optimal solution of (M2).

As a method of solving (M2), the \(\varepsilon \)-constraint method is used, which is one of popular scalarization methods for the MOP [4]. In the method, the SOP is derived instead of (M2), in which one of objective functions of (M2) is used as the objective function of the new SOP, and other objective functions are changed into its constraints by using an appropriate constant vector \(\varepsilon \). In addition, the following transformation was used to solve the SOP [7, 8].

Now, we focus on all positive eigenvalues values \({\lambda _1}, \ldots , \lambda _{\tau }\) of K, and the corresponding eigenvectors \( t^1, \ldots , t^{\tau }\), where \(\tau >0\) denotes the number of the positive eigenvalues. Then, we have that

$$\begin{aligned} K = [ t^1, \ldots , t^{\tau }] \text{ diag }\{ {\lambda _1}, \ldots , \lambda _{\tau }\} [ t^1, \ldots , t^{\tau }]^\top =: T \varLambda T^{\top } \end{aligned}$$
(2)

Then, new decision variables \(z^p\) are defined as \( z^{p} := {\varLambda }^{\frac{1}{2}} {T}^\top \alpha ^p\), \(p \in M\), and the following convex SOP is obtained:

$$\begin{aligned} \begin{array}{rl} \mathrm{(\varepsilon M2)} ~~~~ \displaystyle \max _{z,b,\sigma } &{} \displaystyle \frac{c_{rs}}{ \Vert z^r - z^s \Vert } \\ \mathrm{s.t.} &{} \displaystyle \frac{\sigma _{pq}}{ \Vert z^p - z^q \Vert } \ge \varepsilon _{pq}, \ {q> p,\;\;(p,q) \ne (r,s),\;\;p,q \in M,}\\ &{} (z^p - z^q)^\top {\varLambda }^{\frac{1}{2}} \bar{t}^i + (b^p - b^q) \ge \sigma _{pq}, \ {i \in I_p, \; q> p, \; p,q \in M,} \\ &{} (z^q - z^p)^\top {\varLambda }^{\frac{1}{2}} \bar{t}^i + (b^q - b^p) \ge \sigma _{pq}, \ {i \in I_q, \; q> p, \; p,q \in M,} \\ &{} \sigma _{pq} \ge 1, \;\; q > p,\;\;(p,q) \ne (r,s), \;\; p,q \in M, \\ &{} \sigma _{rs} = c_{rs}, \end{array} \end{aligned}$$

where \(\bar{t}^i{}^\top \) denotes the i-th row vector of T, the constant \(\varepsilon _{pq}\), \(p, q \in M\) is appropriately selected for the feasibility of (\(\varepsilon \)M2), and class pair rs is appropriately selected. Note that the constraint \( \sigma _{rs} = c_{rs}\) with a sufficiently large constant \(c_{rs}\) is added so that (\(\varepsilon \)M2) is convex, and a large \(c_{rs}\) guarantees that the optimal solution of (\(\varepsilon \)M2) is Pareto optimal [7]. Moreover, (\(\varepsilon \)M2) is a second-order cone programming problem (SOCP), which is a convex problem having the second-order and linear constraints, and which can be effectively solved by using some primal-dual interior method [1]. In addition, numerical experiments showed that the geometric margins of separating hyperplanes constructed by the optimal solution of (\(\varepsilon \)M2) are larger than those obtained by the functional margin method (AT), and that classifiers of (M2) obtained by (\(\varepsilon \)M2) have better classification ability than (AT) [7,8,9].

Next, let us evaluate the computational resources required to solve (\(\varepsilon \)M2) and (AT). Since in (\(\varepsilon \)M2), a constant vector \(\varepsilon \) is determined by the optimal solution of (AT), (\(\varepsilon \)M2) requires solving (AT). In addition, CPU time of solving an SOCP (\(\varepsilon \)M2) is considerably larger than that of (AT) because of many decision variables. Moreover, if the element number l of all datasets is large, the diagonalization of \(l \times l\) matrix K also requires a large amount of computational resources. Therefore, in [5], an approximation method for (M) was proposed, in which a single-objective SOCP is derived by introducing the sum of objective functions of (M) and linearizing the right-hand side of the first and second constraints of (M). The SOCP is easily solved due to its convexity, and an feasible solution of (M) can be easily obtained from the optimal solution of the SOCP. The numerical experiments showed that the generalization ability of classifiers obtained by approximation method is better than that of (AT).

In this paper, we derive an approximate MOP by using the same approximation technique, and, furthermore, we restrict the normal vectors of classifier candidates to a space spanned by centroids obtained from a preliminary clustering in order to reduce its computational complexity.

3 Approximate MMSVM

In this section, we introduce the following problem by defining \( \delta _{pq} := \sigma _{pq}^2\) and putting a constant upper limit \(\rho \ge 1\) on \( \delta _{pq}\), \(q>p,\ p,q\in M\).

Here, \(\eta _{pq}\) is defined by \( \eta _{pq}(w, \delta ) = {\Vert w^p - w^q \Vert ^2}/{ 2 \delta _{pq}}\), \(q>p,\ p,q\in M\). If \(\rho \) is sufficiently large, (S1) can be considered to be equivalent to (M). Then, in order to approximate (S1), we replace the right-hand sides of the first and second constraint inequalities with by using a constant \(\rho \) in the same way to [5]. Then, we obtain

Figure 1 shows the relation of and , which shows that for any \(\delta \in [1, \rho ]\). By making use of the property, for any Pareto optimal solution \((w, b, \delta )\) of (S2), we can obtain a feasible solution \((w, b, \delta ')\) of (S1) as follows:

Fig. 1.
figure 1

Approximate affine function

Since (S2) can be regarded as convex, it is easier to solve the problem than (M2). In addition, we can show the following properties between solutions of (S1) and (S2): The relation of objective function values at \((w, b, \delta )\) and \((w, b, \delta ')\) is given by

Thus, for the Pareto solution or the feasible solution \((\bar{w}, \bar{b}, \bar{\delta })\) of (S1) which dominate \((w, b, \delta ')\) such that \(\eta _{pq}(\bar{w}, \bar{b}, \bar{\delta }) \le \eta _{pq}(w, b, \delta ')\), \(q>p,\;p,q\in M,\) we have that

The approximate method for MMSVM proposed in this section is called AMMSVM.

4 AMMSVM Based on K-Means Clustering

Next, we introduce a dimension reduction which restricts the weights of (1) of separating hyperplanes to AMMSVM. This method is based on the assumption that the representation of appropriate weights of the separating hyperplanes does not need all datasets such as (1), and thus, the weights can be represented by the weighted sum of a smaller number of data. Namely, instead of using all \(\phi (x^i), i = 1, \ldots , l\), the method selects representative points in F for weights \(w^p\) of each \(p \in M\). Since by using this restriction, the feasible region of the proposed problem is smaller than the original (M2), solving time and used memories can be expected to be widely reduced. At the same time, the proposed method keeps all the constraints of (M2) which guarantee that all training data is correctly classified. Thus, the method is quite different from a reduction method of deleting training data by some preliminary technique [3]. The proposed method uses the k-means clustering [6] with an appropriate number of clusters to each class data \( \phi (x^{i}),\ i \in I_{p}\), and centroids of obtained clusters are used as representative points for the class.

4.1 k-Means Clustering

In the proposed method, the k-means clustering is applied to each dataset \(\phi (x^{i}),\ i \in I_{p}\) for class p in order to obtain clusters \( \{ \phi (x^l) \}_{l \in I^k_p}\), \(k=1,\ldots , c_p\), such that \(I_p = \cup _{k=1}^{c_p} I^k_p\), individually, which means minimizing the following function:

$$ E^p=\sum _{k=1}^{c_p}\displaystyle \sum _{i \in I^{k}_p}\Vert \phi (x^{i})- \psi ^{p,k}\Vert ^{2}, $$

where \(c_p\) denotes the number of clusters which is appropriately selected for class p. In the numerical experiments at Sect. 5, we set \(c_p= \lfloor r |I_p| \rfloor \), and r is a small constant. Centroids of each cluster k are given by \(\psi ^{p,k} = \sum _{i \in I^{k}_p} \phi (x^{i})/{|I_p^k|} \). Here, note that the kernel method can be easily applied to the clustering method.

It is well-known that the k-means does not necessarily find the global minimum of \(E^p\). Thus, we executed 20 times k-means clustering and select the centroids of the clustering in which the least \(E^p\) was obtained in the numerical experiments.

4.2 Dimension Reduction Based on k-Means Clustering

The centroids obtained by the k-means clustering for the dataset in class p are represented by \(\psi ^{p,k}\), \(k = 1,\ldots , c_p\). By introducing new decision variables \(\beta ^{p}_{q,k}\), \(k=1, \ldots , n_p\), \(p, q\in M\), the weight \(w_{p}\) of AMMSVM for class \(p \in M\) are given by

$$\begin{aligned} w^p = \sum _{q \in M} \sum ^{c_p}_{h=1} \beta ^p_{q,h} \psi ^{q,h} = \varPsi \beta ^p \end{aligned}$$
(3)

where matrix \(\varPsi \) and a decision vector \(\beta ^p\) is defined as

$$\begin{aligned} \varPsi&:= (\psi ^{1,1},\psi ^{1,2},\dots ,\psi ^{1,c_1},\psi ^{2,1},\dots ,\psi ^{m,c_m}) \in \mathcal{R}^{c_{all} \times c_{all}},\\ \beta ^p&:= (\beta ^p_{1,1},\beta ^p_{1,2},\dots , \beta ^p_{1,c_1},\beta ^p_{2,1},\dots ,\beta ^p_{m,c_m})^\top \in \mathcal{R}^{c_{all}}, \end{aligned}$$

and \(c_{all}\) is defined as \(\sum _{p\in M}c_p\). Then, the discriminant function is represented by

$$ f(x) = \mathop {\mathrm {argmax}}\limits _p \left\{ (\varPsi \beta ^p)^\top \phi (x) + b^p \right\} , $$

and the decision variables \((\beta ^p,b^p)\), \(q>p,\;p,q\in M\) are determined by solving the following MOP:

The geometric margins between all class pairs are maximized by solving (KMS) under the restriction (3). Moreover, since a centroid of each cluster is represented by the weighted sum of \(\phi (x^i)\), the kernel method can be applied to (KMS).

Now, similarly to (2), we have that \( \varPsi ^\top \varPsi = \hat{T} \hat{\varLambda } \hat{T}^\top \), where \(\hat{\varLambda } \in \mathcal{R}^{\tau _c \times \tau _c}\) is a diagonal matrix whose diagonal components are all positive eigenvalues of \(\varPsi ^\top \varPsi \), \(T \in \mathcal{R}^{c_{all} \times \tau _c}\) consists of the corresponding eigenvectors, and \(\tau _c\) denotes the number of the positive eigenvalues. Then, by introducing decision variables:

$$\begin{aligned} z^p = \hat{\varLambda }^{\frac{1}{2}} \hat{T}^\top \beta ^p, \end{aligned}$$
(4)

and defining as \( \bar{k}^p(x) := \left( \sum _{j\in I^1_p}k(x^j,x) /{|I^1_p|} \right. ,\, \dots ,\,\left. \sum _{j\in I^{c_p}_p}k(x^j,x) / {|I^{c_p}_p|} \right) ^{\top }\) and \(\bar{\kappa }(x) := \left( \bar{k}^1(x),\dots , \bar{k}^m(x) \right) ^\top \), we can transform (KMS) to the following MOP:

We can easily show that any Pareto optimal solution of (KMS) is obtained by solving (KMS2) as follows:

Theorem 1

For a Pareto optimal solution \((z^*,b^*,\delta ^*)\) of (KMS2), a solution \((\beta ^*,b^*,\delta ^*)\) of which \(\beta ^*\) is defined as \( \beta ^{p*} = \hat{T} \hat{\varLambda }^{-\frac{1}{2}}z^{p*}, \;\;p\in M\) is Pareto optimal for (KMS). Conversely, for a Pareto optimal solution \((\beta ^*,b^*,\delta ^*)\) of (KMS), \((z^*,b^*,\delta ^*)\) in which \(z^*\) is defined by (4) is Pareto optimal for (KMS2).

4.3 Solving Based on Reference Point Method

In this subsection, we apply the reference point method to solve (KMS2), which finds a Pareto optimal solution by minimizing the distance between a given reference point and Pareto optimal solutions in the objective space. The following SOP can be derived:

Here, \(r^*_{pq}\) is a given reference point which is used as the criterion of minimizing. In (KMS3), the distance between the reference point and Pareto optimal solutions is measured by the augmented Tchebyshev function, in which \(\omega \) is a weight vector which determines a balance between objective functions, and \(\mu \) shows the rate of the second term to the first one. The weight vector \(\omega \) and reference point \(r^*\) are selected as the following three kinds of pairs:

$$\begin{aligned} \begin{array}{lll} \mathrm{(R0)}~&{}\omega _{pq} = 1, &{}r^*_{pq} = 0, \\ \mathrm{(R1)}~&{}\omega _{pq} = \Vert g^p - g^q \Vert ^2,~ &{}r^*_{pq} = 0,\\ \mathrm{(R2)}~&{}\omega _{pq} = \Vert g^p - g^q \Vert ^2,~ &{}r^*_{pq} = {1}/{\Vert g^p - g^q \Vert ^2}, \end{array} \end{aligned}$$

where \(g^p\) denotes the center of gravity of the data set of class p, namely, \(g^p = (1/|I_p|)\sum _{i\in I_p} \phi (x^i)\). The selection is based on the idea that the appropriate balance of margins is roughly estimated at the balance of the distances between the centers of gravity. Then, the discriminant function is represented by

$$\begin{aligned} f(x) = \mathop {\mathrm {argmax}}\limits _p \left\{ z^{*p\top } \hat{\varLambda }^{-\frac{1}{2}} \hat{T}^\top \left( k(x^1,x^i),\dots ,k(x^l,x^i)\right) ^\top + b^{*p} \right\} . \end{aligned}$$
(5)

where \((z^*,b^*,\delta ^*)\) denotes the optimal solution of (KMS3). Although (KMS3) is not smooth, an equivalent smooth SOCP can be easily derived, which means that a Pareto optimal solution of (KMS3) can be easily obtained by solving the SOCP. Moreover, in the numerical experiments, we solve the dual problem of the smooth SOCP problem because its computational time can be expected to be less than that of the original problem.

4.4 Comparison of Computational Complexities

Now, let us compare computational complexities of (\(\varepsilon \)M2) and (KMS3). The constant vector \(\varepsilon \) used in (\(\varepsilon \)M2), are determined by solutions of (AT), which is a large-scale quadratic optimization problem using all class datasets at once. On the other hand, the calculation of centroids for (KMS3) requires a considerably small amount of computational resources even if it is executed 20 times because the k-means clustering is individually applied to each class dataset.

Next, let us compare the diagonalization of kernel matrices used in (\(\varepsilon \)M2) and the dual problem of (KMS3), and the numbers of the decision variables and constraints of them. The size of matrices diagonalized for (\(\varepsilon \)M2) and (KMS3) are \(l \times l\) and \( c_{all} \times c_{all}\), respectively. The numbers of decision variables of (\(\varepsilon \)M2) and the dual problem of (KMS3) are \( m(m+2\tau +1)/2 -1\) and \( m(m + \tau _c)\), respectively, and the sizes of constraints of them are \( (m-1)(l + m/2) -1 \) and \((m-1)(l +3m/2))\), respectively. Since in general, we have that \( m \le \tau _c \le c_{all} \ll \tau \le l\), more reduction can be expected if \(c_{all}\) is small.

Table 1. Mean correct classification rate of four methods for benchmark problems

5 Numerical Experiments

We applied the existing methods, AT and MMSVM and the variations of the proposed methods to seven benchmark problems [10], and compared the mean correct classification rate and mean CPU time by using the 10-fold cross-validation, in which hyperparameters were appropriately selected. To solve optimization problems, we used software package MOSEK. As variations of the proposed methods, we used the AMMSVM which does not use the dimension reduction based on the k-means clustering and which is solved by the reference point method, and KMSs in which r was varied in \( \{ 0.05,\, 0.1, 0.15, 0.2 \}\), which are represented by AMM, KMS5, KMS10, KMS15 and KMS20, respectively, and three kinds of reference point and weights, R0, R1 and R2, were used for AMMSVM and each KMS. We used the RBF kernel, namely, \(k(x,y) = \exp \left( - \gamma \Vert x - y \Vert ^2\right) \).

The results are shown in Tables 1 and 2. In Table 1, the numbers in parentheses denote the best hyperparameters of each method: \((\gamma )\) in AT and MMSVM, and \((\gamma , \rho , \mu )\) in AMMSVM and KMSs. The italic and bold number denote the first and second best classification rate for each problem.

Table 1 shows that MMSVM obtained a high classification rate for many problems, while the classification rates of AMMSVM and KMSs with a large r are equal or slightly smaller than those of MMSVM. The classification rates of KMSs mostly increase as r increases. In particular, although the rates are considerably low if r is small for Ecoli, the highest rate was obtained by the KMS with \(r=0.15\). In addition, AMMSVM and KMSs achieved a higher rate than MMSVM for some problems. The superiority of AMMSVM and KMSs is considered to be caused by the diversity of obtained solutions: The reference point method used in AMMSVM and KMSs finds a solution under various kinds of balances of objective functions, while in MMSVM, a single margin is maximized by the \(\varepsilon \)-constraint method using the optimal solution of (AT). From Table 2, we can see that AMMSVM reduced CPU time than AT or MMSVM without a dimension reduction, and KMSs did more greatly for large-scale problems even if \(r = 0.15\) or 0.20. Comparing performance of KMSs using three kinds of reference point and weights, namely, KMSs with R0, R1 and R2, each method obtained a high rate for different problems, though no significant difference was observed.

Table 2. Mean CPU time (sec) of four methods for benchmark problems

6 Conclusion

In this paper, we have proposed an approximate method of MMSVM which approximates its non-convex MOP by linearizing the constraints, and a reduction method which restricts the normal vectors of separating hyperplanes by using centroids from the k-means clustering. Through numerical experiments, we have observed that the proposed methods are effective to reduce the computational resources without decreasing the classification ability widely.