1 Introduction

In the support vector machine (SVM) [1, 2], training data are mapped into the high dimensional feature space, and in that space, the separating hyperplane is determined so that the nearest training data of both classes are maximally separated. Here, the distance between a data sample and the separating hyperplane is called margin.

Motivated by the success of SVMs in real world applications, many SVM-like classifiers have been developed to improve the generalization ability. The ideas of extensions lie in incorporating the data distribution (or margin distribution) to the classifiers.

To cope with this, one approach proposes kernels based on the Mahalanobis distance [3, 4]. Another approach reformulates the SVM so that the margin is measured by the Mahalanobis distance [5, 6].

Yet another approach controls the overall margins instead of the minimum margin. In [7], a large margin distribution machine (LDM) is proposed, in which the average margin is maximized and the margin variance is minimized. Although the generalization ability is better than that of the SVM, the number of hyperparameters is larger than that of the SVM. To cope with this problem, in [8], the unconstrained LDM (ULDM) is proposed, which has the equal number of hyperparameters and which has the generalization ability comparable to that of the LDM and the SVM.

The generalization ability of the SVM can be analyzed by the VC (Vapnik-Chervonenkis) dimension [1] and the maximum generalization ability is achieved by minimizing the radius-margin ratio, where the radius is the minimum radius of the hypersphere that encloses all the training data in the feature space.

If the center of the hypersphere is assumed to be at the origin, the radius of the hypersphere can be minimized for a given feature space as discussed in [9]. The minimal complexity machine (MCM) is derived based on this assumption. In the MCM, the VC dimension is minimized by minimizing the upper bound of the soft-margin constraints for the decision function. Because the regularization term is not included, the MCM is trained by linear programming. The generalization performance of the MCM is shown to be superior to that of the SVM, but according to our analysis [10], the solution is non-unique and the generalization ability is not better than that of the SVM. The problem of non-uniqueness is shown to be solved by adding the regularization term in the objective function of the MCM, which is a fusion of the MCM and the linear programming SVM (LP SVM) called MLP SVM.

In this paper we propose fusing the MCM with the standard SVM, i.e., L1 SVM, to improve the generalization ability of the L1 SVM. We call the fused architecture minimal complexity L1 SVM (ML1 SVM). The ML1 SVM is obtained by adding the upper bound on the decision function and the upper bound minimization term in the objective function of the L1 SVM. We derive the dual form of the ML1 SVM with one set of variables associated with the soft-margin constraints and the other set, upper-bound constraints. We then decompose the dual ML1 SVM into two subproblems: one for the soft-margin constraints, which is similar to the dual L1 SVM, and the other for the upper-bound constraints. These subproblems neither include the bias term nor the upper bound. Thus, for a convergence check, we derive the exact KKT (Karush-Kuhn-Tucker) conditions that do not include the bias term and the upper bound. The second subproblem is different from the first subproblem in that it includes the inequality constraint on the sum of dual variables. To remove this, we change the inequality constraint into two equality constraints and called this architecture ML1\(_\mathrm{v}\) SVM.

In Sect. 2, we summarize the architectures of L1 SVM and the MCM. In Sect. 3, we discuss the architectures of the ML1 SVM and ML1\(_\mathrm{v}\) SVM. In Sect. 4, we compare the generalization ability of the proposed classifiers with other SVM-like classifiers using two-class and multiclass problems.

2 L1 Support Vector Machines and Minimal Complexity Machines

 In this section, we briefly explain the architectures of the L1 SVM and the MCM [9]. Then we discuss the problem of non-unique solutions of the MCM and one approach to solving the problem [10].

2.1 L1 Support Vector Machines

Let the M training data and their labels be \(\{\mathbf{x}_i, y_i\} (i=1,\ldots ,M)\), where \(\mathbf{x}_i\) is an n-dimensional input vector and \(y_i=1\) for Class 1 and \(-1\) for Class 2. The input space is mapped into the l-dimensional feature space by the mapping function \({\varvec{\phi }}(\mathbf{x})\) and in the feature space the following separating hyperplane is constructed:

$$\begin{aligned} \mathbf{w}^\top \varvec{\phi }(\mathbf{x})+b = 0, \end{aligned}$$
(1)

where \(\mathbf{w}\) is the l-dimensional constant vector and b is the bias term.

The primal form of the L1 SVM is given by

$$\begin{aligned}&\text{ min }&Q(\mathbf{w}, b, {\varvec{\xi }}) = \frac{1}{2}\,\Vert \mathbf{w}\Vert ^2 +C \sum _{i \, = \, 1}^M \xi _i \end{aligned}$$
(2)
$$\begin{aligned}&\text{ s.t. }&y_i\, (\mathbf{w}^{\top }\,{\varvec{\phi }}(\mathbf{x}_i)+b)+ \xi _i\ge 1, \,\, \xi _i \ge 0, \, i=1,\ldots ,M, \end{aligned}$$
(3)

where \({\varvec{\xi }} =(\xi _1,\ldots ,\xi _M)^{\top }\), \(\xi _i\) is the slack variable for \(\mathbf{x}_i\), and \(C \,(> 0)\) is the margin parameter that determines the trade-off between the maximization of the margin and minimization of the classification error. Inequalities (3) are called soft-margin constraints.

2.2 Minimal Complexity Machines

The VC dimension is a measure for estimating the generalization ability of a classifier and lowering the VC dimension leads to realizing a higher generalization ability. For an SVM-like classifier with the minimum margin \(\delta _{\min }\), the VC dimension D is bounded by [1]

$$\begin{aligned} D \le 1+\min \left( {R^2}/{\delta _{\min }^2}, l \right) , \end{aligned}$$
(4)

where R is the radius of the smallest hypersphere that encloses all the training data.

In training the L1 SVM, both R and l are not changed. In the LS SVM, where \(\xi _i\) are replaced with \(\xi _i^2\) in (2) and the inequality constraints, with equality constraints in (3), although both R and l are not changed by training, the second term in the objective function works to minimize the square sums of \(y_i\,f(\mathbf{x}_i)-1\). Therefore, like the LDM and ULDM, this term works to condense the margin distribution in the direction orthogonal to the separating hyperplane.

The MCM that minimizes the VC-dimension, i.e., \({R}/\delta _{\min }\) in (4) is

$$\begin{aligned}&\text{ min }&Q(\varvec{\alpha },h,\varvec{\xi },b)=h+C\sum _{i=1}^M \xi _i \end{aligned}$$
(5)
$$\begin{aligned}&{\text{ s.t. }}&h \ge y_i \left( \sum _{j=1}^M \, \alpha _j\,K_{ij}+b \right) +\xi _i\ge 1,\,i=1,\ldots M, \end{aligned}$$
(6)

where h is the upper bound of the soft-margin constraints and \(K_{ij}=K(\mathbf{x}_i,\mathbf{x}_j)={\varvec{\phi }}^\top \,(\mathbf{x}_i) {\varvec{\phi }} (\mathbf{x}_j)\). Here, the mapping function \({\varvec{\phi }}(\mathbf{x})\) in (1) is [11]

$$\begin{aligned} {\varvec{\phi }}(\mathbf{x}) =(K_{11},\ldots ,K_{1M})^\top , \end{aligned}$$
(7)

and \(\mathbf{w}= {\varvec{\alpha }}\). The MCM can be solved by linear programming.

Because the upper bound h in (6) is minimized in (5), the separating hyperplane is determined so that the maximum distance between the training data and the separating hyperplane is minimized.

The MCM, however, does not explicitly include the term related to the margin maximization. This makes the solution non-unique and unbounded.

To make the solution unique, in [10] the MCM and the LP SVM are fused and the resulting classifier is called minimal complexity LP SVM (MLP SVM):

$$\begin{aligned}&\text{ min }&Q(\varvec{\alpha },h,\varvec{\xi },b)=C_h \, h +\sum _{i=1}^M (C_{\alpha } \,|\alpha _i| +C\, \xi _i) \end{aligned}$$
(8)
$$\begin{aligned}&{\text{ s.t. }}&h \ge y_i \left( \sum _{j=1}^M \alpha _j\,K_{ij}+b\right) +\xi _i\ge 1,\,i=1,\ldots M, \end{aligned}$$
(9)

where \(C_h\) is the positive parameter and \(C_{\alpha }=1\). Deleting \(C_h \, h\) in (8) and upper bound h in (9), we obtain the LP SVM. Setting \(C_h=1\) and \(C_{\alpha }=0\) in (8), we obtain the MCM.

3 Minimal Complexity L1 Support Vector Machines

In this section, we discuss the architecture and optimality conditions of the proposed classifiers.

3.1 Architecture

Similar to the MLP SVM, here we propose fusing the MCM given by (5) and (6) and the L1 SVM given by (2) and (3):

$$\begin{aligned}&\text{ min }&Q(\mathbf{w}, b, h,{\varvec{\xi }}) = {C_h}\,h+\frac{1}{2}\,\Vert \mathbf{w}\Vert ^2 +C \sum _{i \, = \, 1}^M \xi _i \end{aligned}$$
(10)
$$\begin{aligned}&\text{ s.t. }&y_i\, (\mathbf{w}^{\top }\,{\varvec{\phi }}(\mathbf{x}_i)+b)+\xi _i\ge 1, \,\, \xi _i \ge 0,\end{aligned}$$
(11)
$$\begin{aligned}&h \ge y_i\, (\mathbf{w}^{\top }\,{\varvec{\phi }}(\mathbf{x}_i)+b), \, i=1,\ldots ,M,\end{aligned}$$
(12)
$$\begin{aligned}&h \ge 1, \end{aligned}$$
(13)

where \({\varvec{\xi }} =(\xi _1,\ldots ,\xi _M)^{\top }\), \(C_h\) is the positive parameter to control the volume that the training data occupy, and h is the upper bound of the constraints. The upper bound defined by (6) is redefined by (12) and (13), which exclude \(\xi _i\). This makes the KKT conditions for the upper bound simpler. We call (12) the upper bound constraints and the above classifier minimum complexity L1 SVM (ML1 SVM).

In the following, we derive the dual problem of the ML1 SVM. Introducing the nonnegative Lagrange multipliers \(\alpha _i\), \(\beta _i\), and \(\eta \), we obtain

$$\begin{aligned}&Q(\mathbf{w},b,h,{\varvec{\xi }}, \varvec{\alpha }, \varvec{\beta },\eta )={C_h}\,h+\frac{1}{2}\,\Vert \mathbf{w}\Vert ^2 - \sum _{i \, = \, 1}^M \alpha _i \, \left( y_i\, (\mathbf{w}^{\top } \,{\varvec{\phi }}(\mathbf{x}_i)+b)-1+\xi _i \right) \nonumber \\&+\,C \sum _{i \, = \, 1}^M \xi _ i -\sum _{i \, = \, 1}^M \beta _i \,\xi _i - \sum _{i \, = \, 1}^{M} \alpha _{M+i} \, \left( h-y_i\, (\mathbf{w}^{\top } \,{\varvec{\phi }}(\mathbf{x}_i)+b) \right) - (h-1)\, \eta , \end{aligned}$$
(14)

where \({\varvec{\alpha }} = (\alpha _1,\ldots ,\alpha _{2M})^\top \), \({\varvec{\beta }} = (\beta _1,\ldots ,\beta _M)^{\top }\).

For the optimal solution, the following KKT conditions are satisfied:

$$\begin{aligned}&\frac{\partial Q(\mathbf{w},b,h,{\varvec{\xi }}, \varvec{\alpha }, \varvec{\beta },\eta )}{\partial \mathbf{w}}=\mathbf{0}, \quad \frac{\partial Q(\mathbf{w},b,h,{\varvec{\xi }}, \varvec{\alpha }, \varvec{\beta },\eta )}{\partial {h}}={ 0},\end{aligned}$$
(15)
$$\begin{aligned}&\frac{\partial Q(\mathbf{w},b,h,{\varvec{\xi }}, \varvec{\alpha }, \varvec{\beta },\eta )}{\partial b}=0, \quad \frac{\partial Q(\mathbf{w},b,h,{\varvec{\xi }}, \varvec{\alpha }, \varvec{\beta },\eta )}{\partial {\varvec{\xi }}}=\mathbf{0},\end{aligned}$$
(16)
$$\begin{aligned}&\alpha _i \, (y_i \, (\mathbf{w}^{\top }\, {\varvec{\phi }}(\mathbf{x}_i) + b)-1+\xi _i) =0, \,\, \alpha _i\ge 0, \end{aligned}$$
(17)
$$\begin{aligned}&\alpha _{M+i} \, \left( h-y_i\, (\mathbf{w}^{\top } \,{\varvec{\phi }}(\mathbf{x}_i)+b) \right) =0,\,\, \alpha _{M+i}\ge 0,\end{aligned}$$
(18)
$$\begin{aligned}&\beta _i \, \xi _i =0, \quad \beta _i\ge 0, \quad \xi _i \ge 0, \, i=1,\ldots ,M, \end{aligned}$$
(19)
$$\begin{aligned}&(h-1)\,\eta = 0, \,\, h\ge 1,\,\, \eta \ge 0, \end{aligned}$$
(20)

where \(\mathbf{0}\) is the zero vector whose elements are zero. Equations (17) to (20) are called KKT complementarity conditions.

Using (14), we reduce (15) and (16), respectively, to

$$\begin{aligned}&\mathbf{w} = \sum _{i \, = \, 1}^{M}(\alpha _i -\alpha _{M+i})\,y_i\,{\varvec{\phi }}(\mathbf{x}_i), \quad \sum _{i \, = \, 1}^{M}\alpha _{M+i} =C_h-\eta ,\end{aligned}$$
(21)
$$\begin{aligned}&\sum _{i \, = \, 1}^{M}(\alpha _i-\alpha _{M+i}) \,y_i =0, \quad \alpha _i +\beta _i=C,\quad \ i=1,\ldots ,M. \end{aligned}$$
(22)

Substituting (21) and (22) into (14), we obtain the following dual problem:

$$\begin{aligned}&\text{ max }\,\, Q({\varvec{\alpha }} )= \sum _{i \, = \, 1}^{M} (\alpha _i -\alpha _{M+i})-\frac{1}{2}\sum _{i,j=1}^{M}(\alpha _i -\alpha _{M+i})\nonumber \\&\quad \times (\alpha _j -\alpha _{M+j})\,y_i \,y_j \,K(\mathbf{x}_i, \mathbf{x}_j) \end{aligned}$$
(23)
$$\begin{aligned}&\text{ s.t. }\,\, \sum _{i \, = \, 1}^{M}y_i\,(\alpha _i-\alpha _{M+i}) =0, \end{aligned}$$
(24)
$$\begin{aligned}&\quad C_h \ge \sum _{i \, = \, 1}^{M}\alpha _{M+i},\end{aligned}$$
(25)
$$\begin{aligned}&\quad C \, \ge \alpha _i \ge 0, \,\, C_h \ge \alpha _{M+i} \ge 0, \, i = 1,\ldots ,M. \end{aligned}$$
(26)

If we delete variables \(\alpha _{M+i}\) and \(C_h\) from the above optimization problem, we obtain the dual problem of the original L1 SVM.

For the solution of (23) to (26), positive \(\alpha _i\) and \(\alpha _{M+j}\) are support vectors.

We consider decomposing the above problem into two subproblems: 1) optimizing \(\alpha _i \, (i=1,\ldots ,M)\) and 2) optimizing \(\alpha _{M+i} \, (i=1,\ldots ,M)\). To make this possible, we eliminate the interference between \(\alpha _i\) and \(\alpha _{M+i}\) in (24) by

$$\begin{aligned} \sum _{i \, = \, 1}^{M}y_i\,\alpha _i =0,\quad \sum _{i \, = \, 1}^{M}y_i\,\alpha _{M+i} =0. \end{aligned}$$
(27)

Then the optimization problem given by (23) to (26) is decomposed into the following two subproblems:

Subproblem 1: Optimization of \(\alpha _i\)

$$\begin{aligned}&\text{ max } \,\,Q(\varvec{\alpha }^0 )= \sum _{i \, = \, 1}^{M} (\alpha _i -\alpha _{M+i}) -\frac{1}{2}\sum _{i,j=1}^{M}(\alpha _i -\alpha _{M+i})\nonumber \\&\quad \times (\alpha _j -\alpha _{M+j})\,y_i \,y_j \,K(\mathbf{x}_i, \mathbf{x}_j) \end{aligned}$$
(28)
$$\begin{aligned}&\text{ s.t. } \sum _{i \, = \, 1}^{M}y_i\,\alpha _i =0, \end{aligned}$$
(29)
$$\begin{aligned}&\quad C \, \ge \alpha _i \ge 0 \quad \text{ for } \quad i = 1,\ldots ,M, \end{aligned}$$
(30)

where \(\varvec{\alpha }^0=(\alpha _1,\ldots ,\alpha _M)^\top \).

Subproblem 2: Optimization of \(\alpha _{M+i}\)

$$\begin{aligned}&\text{ max }\,\, Q({\varvec{\alpha }}^M )= \sum _{i \, = \, 1}^{M} (\alpha _i -\alpha _{M+i})-\frac{1}{2}\sum _{i,j=1}^{M}(\alpha _i -\alpha _{M+i})\nonumber \\&\quad \times (\alpha _j -\alpha _{M+j})\,y_i \,y_j \,K(\mathbf{x}_i, \mathbf{x}_j) \end{aligned}$$
(31)
$$\begin{aligned}&\text{ s.t. } \sum _{i \, = \, 1}^{M}y_i\,\alpha _{M+i}=0, \end{aligned}$$
(32)
$$\begin{aligned}&\quad C_h \ge \sum _{i \, = \, 1}^{M}\alpha _{M+i},\end{aligned}$$
(33)
$$\begin{aligned}&\quad C_h > \alpha _{M+i} \ge 0, \, i = 1,\ldots ,M, \end{aligned}$$
(34)

where \(\varvec{\alpha }^M=(\alpha _{M+1},\ldots ,\alpha _{2M})^\top .\) Here we must notice that \(\alpha _{M+i} \ne C_h\). If \(\alpha _{M+i} = C_h\), from (32), at least

$$\begin{aligned} \sum _{j \, = \, 1,\ldots ,M,\, y_j \ne y_i}\alpha _{M+j} = C_h \end{aligned}$$
(35)

is satisfied. This contradicts (33).

We solve Subproblems 1 and 2 alternatingly until the solution converges. Subproblem 1 is very similar to the L1 SVM and can be solved by the SMO (Sequential minimal optimization) combined with Newton’s method [12]. Subproblem 2, which includes the constraint (33) can also be solved by a slight modification of the SMO combined with Newton’s method.

3.2 KKT Conditions

To check the convergence of Subproblems 1 and 2, we use the KKT complementarity conditions (17) to (20). However, variables h and b, which are included in the KKT conditions, are excluded from the dual problem. Therefore, as with the L1 SVM [13], to make an accurate convergence test, the exact KKT conditions that do not include h and b need to be derived.

We rewrite (17) as follows:

$$\begin{aligned} \alpha _i \, (y_i \, b -y_i \, F_i +\xi _i)=0, \, i = 1,\ldots , M, \end{aligned}$$
(36)

where

$$\begin{aligned} F_i=y_i-\sum _{j=1}^My_j\, (\alpha _j-\alpha _{M+j})\, K(\mathbf{x}_{i}, \mathbf{x}_j). \end{aligned}$$
(37)

We can classify the conditions of (36) into the following three cases:

  1. 1.

    \( \alpha _i =0\). Because \(y_i \, b -y_i \, F_i +\xi _i \ge 0\) and \(\xi _i = 0\), \(y_i \, b \ge y_i \, F_i\), namely, \(b \ge F_i\) if \(y_i=1\); \(b \le F_i\) if \(y_i=-1\).

  2. 2.

    \(C>\alpha _{i}>0.\) Because \(\beta _i >0\), \(\xi _i=0\) is satisfied. Therefore, \(b = F_i.\)

  3. 3.

    \( \alpha _i=C\). Because \(\beta _i =0\), \(\xi _i \ge 0\) is satisfied. Therefore, \(y_i \, b \le y_i \, F_i\), namely, \(b \le F_i\) if \(y_i=1\); \(b \ge F_i\) if \(y_i=-1.\)

Then the KKT conditions for (36) are simplified as follows:

$$\begin{aligned} \bar{F_i} \ge b\ge \tilde{F_i}, \, i = 1,\ldots , M, \end{aligned}$$
(38)

where

$$\begin{aligned} \tilde{F_i}= & {} F_i \quad \text{ if }\quad (y_i =1, \alpha _i =0), \quad C> \alpha _i >0 \quad \text{ or }\quad (y_i =-1, \alpha _i=C),\end{aligned}$$
(39)
$$\begin{aligned} \bar{F_i}= & {} F_i \quad \text{ if }\quad (y_i =-1, \alpha _i =0), \quad C> \alpha _i>0 \quad \text{ or }\quad (y_i =1, \alpha _i =C). \end{aligned}$$
(40)

To detect the violating variables, we define \(b_\mathrm{low}\) and \(b_\mathrm{up}\) as follows:

$$\begin{aligned} \displaystyle b_\mathrm{low}=\max _{i} {\tilde{F_i} },\,\, \displaystyle b_\mathrm{up}~=\min _{i} {\bar{F_i} }. \end{aligned}$$
(41)

If the KKT conditions are satisfied,

$$\begin{aligned} b_\mathrm{up} \ge b_\mathrm{low}. \end{aligned}$$
(42)

The bias term is estimated to be

$$\begin{aligned} b_\mathrm {e}=\frac{1}{2}(b_\mathrm{up} + b_\mathrm{low}), \end{aligned}$$
(43)

where \(b_\mathrm {e}\) is the estimate of the bias term using (17).

Likewise, using (37), (18) becomes

$$\begin{aligned} \alpha _{M+i} \left( h +y_i\,F_i-y_i\,b-1\right) =0, \, i = 1,\ldots , M. \end{aligned}$$
(44)

Then the conditions for (18) are rewritten as follows:

  1. 1.

    \(\alpha _{M+i}=0\). From \(h +y_i\,F_i-y_i\,b-1\ge 0,\) we have \(y_i \, b-h \le y_i \, F_i -1\), namely, \(b- h \le \, F_i -1\) if \(y_i=1\); \(b+h \ge \, F_i +1\) if \(y_i=-1.\)

  2. 2.

    \(C_h> \alpha _{M+i}>0\). \(y_i \, b-h =y_i \, F_i -1\), namely, \(b-h =F_i -1\) if \(y_i=1\); \(b+h = \, F_i +1\) if \(y_i=-1.\)

The KKT conditions for (18) are simplified as follows:

$$\begin{aligned}&\text{ if }\,\, y_i =-1, \,\, \bar{F_i}^{-} +1 \ge b^-\ge \tilde{F_i}^{-}+1, \nonumber \\&\text{ if }\,\, y_i =1, \,\, \bar{F_i}^{+}-1 \ge b^+\ge \tilde{F_i}^{+}-1, \, i = 1,\ldots , M, \end{aligned}$$
(45)

where \(b^-=b+h\), \(b^+=b-h\), and

$$\begin{aligned} \tilde{F_i}^{-}= & {} F_i +1 \quad \text{ if }\quad y_i =-1,\end{aligned}$$
(46)
$$\begin{aligned} \bar{F_i}^{-}= & {} F_i +1 \quad \text{ if }\quad y_i =-1,C_h> \alpha _{M+i}>0,\end{aligned}$$
(47)
$$\begin{aligned} \tilde{F_i}^{+}= & {} F_i -1 \quad \text{ if }\quad y_i =1,C_h> \alpha _{M+i}>0,\end{aligned}$$
(48)
$$\begin{aligned} \bar{F_i}^{+}= & {} F_i -1 \quad \text{ if }\quad y_i =1. \end{aligned}$$
(49)

To detect the violating variables, we define \(b^{-}_\mathrm{low}\), \(b^{+}_\mathrm{low}\), \(b^{-}_\mathrm{up}\), and \(b^{+}_\mathrm{up}\) as follows:

$$\begin{aligned} \begin{array}{l} \displaystyle b^{-}_\mathrm{low}=\max _{i} \tilde{F_i}^{-}, \quad b^{+}_\mathrm{low}=\max _{i} \tilde{F_i}^{+},\\ \displaystyle b^{-}_\mathrm{up}~=\min _{i} \bar{F_i}^{-},\quad b^{+}_\mathrm{up}~=\min _{i} \bar{F_i}^{+}. \end{array} \end{aligned}$$
(50)

In general, the distributions of Classes 1 and 2 data are different. Therefore, the upper bounds of h for Classes 1 and 2 are different. This means that either of \(b^{-}_\mathrm{up}\) (\(\bar{F_i}^{-}\)) and \(b^{+}_\mathrm{low}\) \((\tilde{F_i}^{+})\) may not exist. But because of (32), both classes have at least one positive \(\alpha _{M+i}\) each, and because of (44), the values of h for both classes can be different. This happens because we separate (24) into two equations as in (27). Then, if the KKT conditions are satisfied, both of the following inequalities hold

$$\begin{aligned} b^{-}_\mathrm{up} \ge b^{-}_\mathrm{low},\quad b^{+}_\mathrm{up} \ge b^{+}_\mathrm{low}. \end{aligned}$$
(51)

From the first inequality, the estimate of h, \(h_\mathrm{e}^-\) for Class 2, is given by

$$\begin{aligned} h_\mathrm{e}^-=-b_\mathrm{e}+\frac{1}{2}(b^{-}_\mathrm{up} + b^{-}_\mathrm{low}). \end{aligned}$$
(52)

From the second inequality, the estimate of h, \(h_\mathrm{e}^+\) for Class 1, is given by

$$\begin{aligned} h^+_\mathrm{e}=b_\mathrm{e}-\frac{1}{2}(b^{+}_\mathrm{up} + b^{+}_\mathrm{low}). \end{aligned}$$
(53)

3.3 Variant of Minimal Complexity Support Vector Machines

Subproblem 2 of the ML1 SVM is different from Subproblem 1 in that the former includes the inequality constraint given by (33). This makes the solution process makes more complicated. In this section, we consider making the solution process similar to that of Subproblem 1.

Solving Subproblem 2 results in obtaining \(h^+_\mathrm{e}\) and \(h^-_\mathrm{e}\). We consider assigning separate variables \(h^+\) and \(h^-\) for Classes 1 and 2 instead of a single variable h. Then the complementarity conditions for \(h^+\) and \(h^-\) are

$$\begin{aligned}&(h^+ -1)\,\eta ^+ = 0, \,\, h^+\ge 1,\,\, \eta ^+\ge 0, \,\, (h^- -1)\,\eta ^- = 0, \nonumber \\&h^-\ge 1,\,\, \eta ^-\ge 0, \end{aligned}$$
(54)

where \(\eta ^+\) and \(\eta ^-\) are the Lagrange multipliers associated with \(h^+\) and \(h^-\), respectively. To simplify Subproblem 2, we assume that \(\eta ^+ = \eta ^- = 0\). This makes the equations corresponding to (33) equality constraints. Then the optimization problem given by (31) to (34) becomes

$$\begin{aligned}&\text{ max }\,\, Q({\varvec{\alpha }}^M )= \sum _{i \, = \, 1}^{M} \alpha _i -\frac{1}{2}\sum _{i,j=1}^{M}(\alpha _i -\alpha _{M+i})\,(\alpha _j -\alpha _{M+j})\,y_i \,y_j \,K(\mathbf{x}_i, \mathbf{x}_j)\,\end{aligned}$$
(55)
$$\begin{aligned}&\text{ s.t. } \sum _{y_i = 1, i \, = \, 1}^{M}\alpha _{M+i} = C_h,\,\,\sum _{y_i = -1, i \, = \, 1}^{M}\alpha _{M+i} = C_h,\end{aligned}$$
(56)
$$\begin{aligned}&\qquad C \, \ge \alpha _i \ge 0, \,\, C_h \ge \alpha _{M+i} \ge 0,\, i = 1,\ldots ,M. \end{aligned}$$
(57)

Here, (32) is not necessary because of (56). We call the above architecture ML1\(_\mathrm{v}\) SVM.

For the solution of the ML1 SVM, the same solution is obtained by the ML1\(_\mathrm{v}\) SVM with the \(C_h\) value given by

$$\begin{aligned} C_h = \sum _{i=1,\ldots ,M, y_i =1}\alpha _{M+i}= \sum _{i=1,\ldots ,M, y_i =-1}\alpha _{M+i}. \end{aligned}$$
(58)

However, the reverse is not true, namely, the solution of the ML1\(_\mathrm{v}\) SVM may not be obtained by the ML1 SVM. As the \(C_h\) value becomes large, the value of \(\eta \) becomes positive for the ML1 SVM, but for the ML1\(_\mathrm{v}\) SVM, the values of \(\alpha _{M+i}\) are forced to become larger. But as the following computer experiments show, the performance difference is small.

4 Computer Experiments

In this section, we compare the generalization performance of the ML1\(_\mathrm{v}\) SVM and ML1 SVM with the L1 SVM, MLP SVM [10], LS SVM, and ULDM [8] using two-class and multiclass problems.

4.1 Comparison Conditions

We determined the hyperparameter values using the training data by fivefold cross-validation, trained the classifier with the determined hyperparameter values, and evaluate the accuracy for the test data.

We trained the ML1\(_\mathrm{v}\) SVM, ML1 SVM, and L1 SVM by SMO combined with Newton’s method [12]. We trained the MLP SVM by the simplex method and the LS SVM and ULDM by matrix inversion.

Because RBF kernels perform well for most pattern classification applications, we used RBF kernels: \(K(\mathbf{x}, \mathbf{x}') = \exp (-\gamma \Vert \mathbf{x}-\mathbf{x}'\Vert ^2/m),\) where \(\gamma \) is the parameter to control the spread of the radius, and m is the number of inputs.

In cross-validation, we selected the \(\gamma \) values from \(\big \{0.01, 0.1, 0.5, 1, 5, 10, 15,\)\( 20, 50, 100, 200\big \}\) and the C and \(C_h\) values from \(\big \{0.1, 1, 10, 50, 100, 500, 1000,\)\( 2000\big \}\). For the ULDM, C value was selected from \(\big \{10^{-12}, 10^{-10},\) \( 10^{-8}, 10^{-6},\)\(10^{-4},10^{-3},10^{-2},0.1\big \}\).

For the L1 SVM, LS SVM, and ULDM, we determined the \(\gamma \) and C values by grid search. For the ML1\(_\mathrm{v}\) SVM, ML1 SVM, and MLP SVM, to shorten computation time, first we determined the \(\gamma \) and C values with \(C_h=1\) \((C_h =0.1\) for the MLP SVM) by grid search and then we determined the \(C_h\) value by line search fixing the \(\gamma \) and C values with the determined values.

After model selection, we trained the classifier with the determined hyperparameter values and calculated the accuracy for the test data. For two-class problems we calculated the average accuracies and their standard deviations, and performed Welch’s t test with the confidence level of 5%.

4.2 Two-Class Problems

Table 1 lists accuracies for the two-class problems. In the first column, I/Tr/Te denotes the numbers of input variables, training data, and test data. Except for the image and splice problems, each problem has 100 training and test data pairs. For the image and splice problems, 20 pairs.

In the table, for each classifier and each classification problem, the average accuracy and the standard deviation are shown. For each problem the best average accuracy is shown in bold and the worst, underlined. The “+” and “−” symbols at the accuracy show that the ML1\(_\mathrm{v}\) SVM is statistically better and worse than the classifier associated with the attached symbol, respectively. The “Average” row shows the average accuracy of the 13 problems for each classifier and “B/S/W” denotes the number of times that the associated classifier showed the best, the second best, and the worst accuracies. The “W/T/L” row denotes the number of times that the ML1\(_\mathrm{v}\) SVM is statistically better than, comparable to, and worse than the associated classifier.

According to the “W/T/L” row, the ML1\(_\mathrm{v}\) SVM is statistically better than the MLP SVM but is comparable to other classifiers. From the “Average” measure, the ULDM is the best and the ML1\(_\mathrm{v}\) SVM, the second. However, the differences of the measures among the ML1\(_\mathrm{v}\) SVM, ML1 SVM, and L1 SVM are small. From the “B/S/W” measure, the ULDM is the best and the LS SVM is the second best.

Table 1. Accuracies of the test data for the two-class problems
Table 2. Accuracies of the test data for the multiclass problems

4.3 Multiclass Problems

Table 2 shows the accuracies for the ten multiclass problems. The symbol “C” in the first column denotes the number of classes. Unlike the two-class problems, each multiclass problem has only one training and test data pair.

We used fuzzy pairwise (one-vs-one) classification for multiclass problems [2]. In the table, for each problem, the best accuracy is shown in bold, and the worst, underlined. For the MLP SVM, the accuracies for the thyroid, MNIST, and letter problems were not available.

Among the ten problems, the accuracies of the ML1\(_\mathrm{v}\) SVM and ML1 SVM were better than or equal to those of the L1 SVM for nine and seven problems, respectively. In addition, the best average accuracy was obtained for the ML1 SVM and the second best, the ML1\(_\mathrm{v}\) SVM. This is very different from the two-class problems where the difference was very small.

5 Conclusions

In this paper, to solve the problem of the non-unique solution of the MCM, and to improve the generalization ability of the L1 SVM, we fused the MCM and the L1 SVM. We derived two dual subproblems: the first subproblem corresponds to the L1 SVM and the second subproblem corresponds to minimizing the upper bound. We further modified the second subproblem by converting the inequality constraint into two equality constraints. We call this architecture ML1\(_\mathrm{v}\) SVM and the original architecture, ML1 SVM.

According to computer experiments for two-class problems, the average accuracy of the ML1\(_\mathrm{v}\) SVM is statistically comparable to that of the ML1 SVM and L1 SVM. For multiclass problems, the ML1\(_\mathrm{v}\) SVM and ML1 SVM generalized better than the L1 SVM.