1 Introduction

A classifier is designed to achieve high generalization ability for unknown data by maximizing class separability. The support vector machine (SVM) [1, 2] realizes this by maximizing the minimum margin, where a margin of a data sample is defined as its distance from the separating hyperplane. Although the SVM works relatively well for a wide range of applications, there is still a room for improvement. Therefore, in addition to maximizing the minimum margin, controlling the margin distribution is considered. One approach controls the low order statistics [3,4,5,6,7,8]. In [5], a large margin distribution machine (LDM) was proposed, in which the average margin is maximized and the margin variance is minimized. Because the LDM includes an additional hyperparameter compared to the SVM, in [6, 7], the unconstrained LDM (ULDM) was proposed, which has the same number of hyperparameters as the SVM. The least squares SVM (LS SVM) [3, 4] is consider to be based on low order statistics because it minimizes the sum of squared margin errors.

Another approach [9,10,11,12,13,14,15] minimizes the VC (Vapnik-Chervonenkis) dimension [1]. In [9], the minimal complexity machine (MCM) that minimizes the VC dimension was proposed, which is reduced to minimizing the sum of margin errors and minimizing the maximum margin. According to the analysis in [10], however, the solution of the MCM was shown to be non-unique and unbounded. These disadvantages can be solved by introducing the regularization term into the MCM, which is a fusion of the LP (Linear Programming) SVM and the MCM called MLP SVM. The soft upper-bound minimal complexity LP SVM (SLP SVM) [14] is a soft upper-bound version of the MLP SVM. The ML1 SVM [11, 12] is the fusion of the MCM and the standard SVM (L1 SVM) and the SL1 SVM [15] is a soft upper-bound version of the ML1 SVM. According to the computer experiments, in general, the fusions, i.e., minimization of the maximum margin in the SVMs, improved the generalization ability of the base classifiers, and the ML1\(_\textrm{v}\) SVM, which is a variant of the ML1 SVM performed best.

In this paper, we discuss whether the idea of minimizing the VC dimension, i.e., minimizing the maximum margin, also works for the LS SVM, which controls the margin distribution by the second order statistics. We formulate the minimal complexity LS SVM (MLS SVM) by minimizing the maximum margin as well as maximizing the minimum margin in the LS SVM framework. We derive the dual form of the MLS SVM and the training method that trains the MLS SVM alternatingly by matrix inversion and by the SMO (Sequential Minimal Optimization) based Newton’s method [16]. By computer experiments, we show whether the MLS SVM performs better than the LS SVM.

In Sect. 2, we discuss the architecture of the MLS SVM and derive its dual problem. Then we discuss the training method of the MLS SVM. And in Sect. 3, we compare generalization performance of the MLS SVM with the LS SVM and other SVM-based classifiers using two-class and multiclass problems.

2 Minimal Complexity Least Squares Support Vector Machines

In this section we discuss the architecture of the MLS SVM, the KKT conditions, and a training method.

2.1 Architecture

For a two-class problem, we consider the following decision function:

$$\begin{aligned} D(\textbf{x}) = \textbf{w}^{\top } {\boldsymbol{\phi }}(\textbf{x}) + b, \end{aligned}$$
(1)

where \(\textbf{w}\) is the l-dimensional vector, b is the bias term, and \({\boldsymbol{\phi }}(\textbf{x})\) is the l-dimensional vector that maps m-dimensional vector \(\textbf{x}\) into the feature space. If \(D(\textbf{x}) > 0\), \(\textbf{x}\) is classified into Class 1 and if \(D(\textbf{x}) < 0\), Class 2.

We introduce the idea of minimizing the VC dimension into the LS SVM: we minimize the maximum margin as well as maximizing the minimum margin.

The minimal complexity LS SVM (MLS SVM) is formulated as follows:

$$\begin{aligned}&\text{ min }&\quad \frac{1}{2} \, \textbf{w}^{\top } \textbf{w} + \frac{C}{2}\sum _{i=1}^M\xi ^2_i+{C_h}\, (h^++h^-) \end{aligned}$$
(2)
$$\begin{aligned}&\text{ s.t. }&\quad y_i \, (\textbf{w}^{\top } {\boldsymbol{\phi }}(\textbf{x}_i) + b) = 1-\xi _i \quad \text{ for }\quad i = 1,\ldots ,M,\end{aligned}$$
(3)
$$\begin{aligned}{} & {} \quad h_i \ge y_i \, (\textbf{w}^{\top } {\boldsymbol{\phi }}(\textbf{x}_i) + b) \quad \text{ for }\quad i = 1,\ldots ,M,\end{aligned}$$
(4)
$$\begin{aligned}{} & {} \quad h^+ \ge 1,\quad h^- \ge 1, \end{aligned}$$
(5)

where \((\textbf{x}_i, y_i)\) \((i=1,\ldots , M)\) are M training input-output pairs, \(y_i = 1 \) if \(\textbf{x}_i\) belong to Class 1, and \(y_i=-1\) if Class 2, \(\xi _i\) are the slack variables for \(\textbf{x}_i\), C is the margin parameter, \(h^+\) and \(h^-\) are the upper bounds for the Classes 1 and 2, respectively, and \(h_i=h^+\) for \(y_i=1\) and \(h_i = h^-\) for \(y_i=-1\). Here, if \(\xi _i \ge 1\), \(\textbf{x}_i\) is misclassified and otherwise, \(\textbf{x}_i\) is correctly classified. Unlike L1 or L2 SVMs, \(\xi _i\) can be negative. The first term in the objective function is the reciprocal of the squared margin divided by 2, the second term is to control the number of misclassifications, and C controls the tradeoff between the first and second terms. The third term works to minimize the maximum margin. Parameter \(C_h\) controls the upper bounds \(h^+\) and \(h^-\).

If we delete (4), (5), and the third term in (2), we obtain the LS SVM. And if in the LS SVM we replace the equality constraints in (3) into the inequality constraints \((\ge )\) and the square sum of slack variables in (2) into the linear sum multiplied by 2, we obtain the L1 SVM, which is a standard SVM.

In the following, we derive the dual problem of the above optimization problem.

Introducing the Lagrange multipliers \({\alpha _i}\), \(\alpha _{M+i} \, (\ge 0)\), \(\eta ^+ \, (\ge 0)\), and \(\eta ^- \, (\ge 0)\) into (2) to (5), we obtain the unconstrained objective function:

$$\begin{aligned} {} & {} Q(\textbf{w},b,{\boldsymbol{\alpha }},{\boldsymbol{\xi }},h^+,h^-,\eta ^+,\eta ^-) \nonumber \\ {} & {} = \frac{1}{2} \,\textbf{w}^{\top } \textbf{w} + \frac{C}{2}\sum _{i=1}^M\xi ^2_i+{C_h}\,(h^++h^-) -\sum _{i=1}^M \alpha _i \,(y_i(\textbf{w}^{\top } {\boldsymbol{\phi }}(\textbf{x}_i) + b) - 1 +\xi _i),\nonumber \\ {} & {} -\sum _{i=1}^M\alpha _{M+i} \,(h_i-y_i\, (\textbf{w}^{\top } {\boldsymbol{\phi }}(\textbf{x}_i) + b))-\eta ^+ \, (h^+-1)-\eta ^- \, (h^- -1) \end{aligned}$$
(6)

where \({\boldsymbol{\alpha }} = (\alpha _1,\ldots ,\alpha _M, \alpha _{M+1},\ldots ,\alpha _{2M})^{\top }\), and \({\boldsymbol{ \xi }} = (\xi _1,\ldots ,\xi _M)^{\top }\).

Taking the partial derivatives of (6) with respect to \(\textbf{w}\), b, \({\boldsymbol{\xi }}\), \(h^+\), and \(h^-\) and equating them to zero, together with the equality constraints (3), we obtain the optimal conditions as follows:

$$\begin{aligned}{} & {} \textbf{w} = \sum _{i=1}^M y_i \, (\alpha _i -\alpha _{M+i})\,{\boldsymbol{\phi }}(\textbf{x}_i),\end{aligned}$$
(7)
$$\begin{aligned}{} & {} \sum _{i=1}^M y_i \, (\alpha _i-\alpha _{M+i}) =0,\end{aligned}$$
(8)
$$\begin{aligned}{} & {} \alpha _i = C\,\xi _i \quad \text{ for }\quad i = 1,\ldots ,M, \end{aligned}$$
(9)
$$\begin{aligned}{} & {} C_h =\sum _{i=1,y_i=1}^M \alpha _{M+i} +\eta ^+, \quad C_h =\sum _{i=1,y_i=-1}^M \alpha _{M+i} +\eta ^-,\end{aligned}$$
(10)
$$\begin{aligned}{} & {} y_i \,( \textbf{w}^{\top } \,{\boldsymbol{\phi }}(\textbf{x}_i) +b )-1+\xi _i = 0 \quad \text{ for }\quad i = 1,\ldots ,M,\end{aligned}$$
(11)
$$\begin{aligned}{} & {} \alpha _{M+i} \,(h_i-y_i\, (\textbf{w}^{\top } {\boldsymbol{\phi }}(\textbf{x}_i) + b))=0,\quad \alpha _{M+i} \ge 0 \quad \text{ for }\quad i = 1,\ldots ,M,\end{aligned}$$
(12)
$$\begin{aligned}{} & {} \eta ^+ \, (h^+-1) = 0,\quad \eta ^+ \ge 0, \quad \eta ^- \, (h^- -1) = 0,\quad \eta ^- \ge 0. \end{aligned}$$
(13)

From (9), unlike L1 or L2 SVMs, \(\alpha _i\) can be negative.

Now, we derive the dual problem. Substituting (7) and (8) into (6), we obtain the objective function with respect to \(\boldsymbol{\alpha }\), \(\eta ^+\), and \(\eta ^-\). Thus, we obtain the following dual problem:

$$\begin{aligned} \text{ max }&\quad Q(\boldsymbol{\alpha },\eta ^+,\eta ^-) = \sum _{i=1}^M \alpha _i -\frac{1}{2} \, \sum _{i, j=1}^M y_i \,(\alpha _i- \alpha _{M+i})\nonumber \\&\times y_j \, (\alpha _j-\alpha _{M+j}) K(\textbf{x}_i, \textbf{x}_j) -\frac{1}{2C}\sum _i^M \alpha _i^2 +\eta ^++\eta ^-, \end{aligned}$$
(14)
$$\begin{aligned} \text{ s.t. }&\quad \sum _{i=1}^M y_i \,(\alpha _i -\alpha _{M+i})=0,\end{aligned}$$
(15)
$$\begin{aligned}&\quad C_h \ge \sum _{i=1,y_i=1}^M \alpha _{M+i}, \quad C_h \ge \sum _{i=1,y_i=-1}^M \alpha _{M+i}, \end{aligned}$$
(16)
$$\begin{aligned}&\quad \alpha _{M+i} \ge 0 \quad \text{ for }\quad i = 1,\ldots ,M, \end{aligned}$$
(17)

where \(K(\textbf{x}, \textbf{x}')\) is the kernel and \(K(\textbf{x}, \textbf{x}') = {\boldsymbol{\phi }}^{\top } (\textbf{x}) \,{\boldsymbol{\phi }}(\textbf{x}')\). Similar to the SVM, defining \(K(\textbf{x}, \textbf{x}')\), we can avoid the explicit treatment of variables in the feature space.

In the above optimization problem, if we delete \((\alpha _{M+1},\ldots ,\alpha _{2M})\), \(\eta ^+\), \(\eta ^-\), and their related terms, we obtain the LS SVM.

Similar to the ML1\(_\textrm{v}\) SVM [12], we assume that \(\eta ^+ = \eta ^- =0\). This means that \(h^+ \ge 1\) and \(h^- \ge 1\). Then the optimization problem reduces to

$$\begin{aligned} \text{ max }&\quad Q(\boldsymbol{\alpha }) = \sum _{i=1}^M \alpha _i -\frac{1}{2} \, \sum _{i, j=1}^M (\alpha _i- \alpha _{M+i})\, (\alpha _j- \alpha _{M+j}) \, y_i \, y_j \,K(\textbf{x}_i, \textbf{x}_j) \nonumber \\&-\frac{1}{2C}\sum _{i=1}^M \alpha _i^2, \end{aligned}$$
(18)
$$\begin{aligned} \text{ s.t. }&\quad \sum _{i=1}^M y_i \, \alpha _i =0,\end{aligned}$$
(19)
$$\begin{aligned}&\quad C_h = \sum _{i=1,y_i=1}^M \alpha _{M+i} = \sum _{i=1,y_i=-1}^M \alpha _{M+i}, \end{aligned}$$
(20)
$$\begin{aligned}&\quad \alpha _{M+i} \ge 0 \quad \text{ for }\quad i = 1,\ldots ,M. \end{aligned}$$
(21)

Notice that because of (20), (15) reduces to (19).

We decompose the above optimization problem into two subprograms:

  1. 1.

    Subproblem 1   Solving the problem for \({\alpha _1,\dots ,\alpha _M}\) and b fixing \(\alpha _{M+1},\ldots ,\alpha _{2M}\):

    $$\begin{aligned} \text{ max }&\quad Q(\boldsymbol{\alpha }^0) = \sum _{i=1}^M \alpha _i -\frac{1}{2} \, \sum _{i, j=1}^M (\alpha _i- \alpha _{M+i})\, (\alpha _j-\alpha _{M+j}) \, y_i \, y_j \, K(\textbf{x}_i, \textbf{x}_j) \nonumber \\&-\frac{1}{2C}\sum _{i=1}^M \alpha _i^2, \end{aligned}$$
    (22)
    $$\begin{aligned} \text{ s.t. }&\quad \sum _{i=1}^M y_i \,\alpha _i =0, \end{aligned}$$
    (23)

    where \(\boldsymbol{\alpha }^0=(\alpha _1,\ldots ,\alpha _M)^\top .\)

  2. 2.

    Subproblem 2   Solving the problem for \(\alpha _{M+1},\ldots ,\alpha _{2M}\) fixing \(\boldsymbol{\alpha }^0\) and b:

    $$\begin{aligned} &\text{ max }&\quad Q(\boldsymbol{\alpha }^M) = -\frac{1}{2} \, \sum _{i, j=1}^M (\alpha _i- \alpha _{M+i})\, (\alpha _j- \alpha _{M+j}) \, y_i \, y_j \,K(\textbf{x}_i, \textbf{x}_j) \end{aligned}$$
    (24)
    $$\begin{aligned} &\text{ s.t. }&\quad C_h = \sum _{i=1,y_i=1}^M \alpha _{M+i} = \sum _{i=1,y_i=-1}^M \alpha _{M+i}, \end{aligned}$$
    (25)
    $$\begin{aligned} {} & {} \quad \alpha _{M+i} \ge 0 \quad \text{ for }\quad i = 1,\ldots ,M, \end{aligned}$$
    (26)

    where \(\boldsymbol{\alpha }^M=(\alpha _{M+1},\ldots ,\alpha _{2M})^\top .\)

We must notice that as the value of \(C_h\) approaches zero, the MLS SVM reduces to the LS SVM. Therefore, for a sufficiently small value of \(C_h\), the MLS SVM and LS SVM behave similarly.

We consider solving the above subproblems alternatingly.

Here, because of (25), if we modify \(\alpha _{M+i}\), another \(\alpha _{M+j}\) belonging to the same class must be modified. Therefore, \(\boldsymbol{\alpha }^M\) can be updated per class.

2.2 Solving Subproblem 1

Variables \((\alpha _1,\ldots ,\alpha _M)\) and b can be solved for using (7), (9), (11), and (23) by matrix inversion. Substituting (7) and (9) into (11) and expressing it and (23) in matrix form, we obtain

$$\begin{aligned} \left( \begin{array}{cc} \varOmega &{} \textbf{1}\\ \textbf{1}^{\top } &{} 0 \end{array}\right) \left( \begin{array}{c} {\boldsymbol{\alpha }}'\\ b \end{array} \right) = \left( \begin{array}{c} \textbf{d}_1\\ 0 \end{array} \right) , \end{aligned}$$
(27)

or

$$\begin{aligned} \varOmega {\boldsymbol{\alpha }}' + \textbf{1} b= & {} \textbf{d}_1,\end{aligned}$$
(28)
$$\begin{aligned} \textbf{1}^{\top } {\boldsymbol{\alpha }}'= & {} 0, \end{aligned}$$
(29)

where \(\textbf{1}\) is the M-dimensional vector and

$$\begin{aligned} \boldsymbol{\alpha }'= & {} (y_1 \, \alpha _1,\ldots ,y_M \, \alpha _M)^\top \end{aligned}$$
(30)
$$\begin{aligned} \varOmega _{ij}= & {} K(\textbf{x}_i, \textbf{x}_j)+\frac{\delta _{ij}}{C},\end{aligned}$$
(31)
$$\begin{aligned} \textbf{d}_1= & {} (d_{11},\ldots ,d_{1M})^{\top },\end{aligned}$$
(32)
$$\begin{aligned} d_{1i}= & {} y_i + \sum _{j=1}^M y_j \, \alpha _{M+j} \, K(\textbf{x}_i,\textbf{x}_j),\end{aligned}$$
(33)
$$\begin{aligned} \textbf{1}= & {} (1,\ldots ,1)^{\top }, \end{aligned}$$
(34)

where \(\delta _{ij} =1\) for \(i=j\), and \(\delta _{ij} =0\) for \(i\ne j\).

If \(\boldsymbol{\alpha }^M=\textbf{0}\), (27) reduces to solving the LS SVM.

Subproblem 1 is solved by solving (27) for \({\boldsymbol{ \alpha }}^0\) and b as follows. Because of \(1/C \, (>0)\) in the diagonal elements of \(\varOmega \), \(\varOmega \) is positive definite. Therefore,

$$\begin{aligned} {\boldsymbol{\alpha }}'= \varOmega ^{-1}(\textbf{d}_1-\textbf{1}\, b). \end{aligned}$$
(35)

Substituting (35) into (29), we obtain

$$\begin{aligned} b = (\textbf{1}^{\top }\varOmega ^{-1}{} \textbf{1})^{-1}{} \textbf{1}^{\top }\varOmega ^{-1}{} \textbf{d}_1. \end{aligned}$$
(36)

Thus, substituting (36) into (35), we obtain \({\boldsymbol{\alpha }}'\).

2.3 Solving Subproblem 2

Subproblem 2 needs to be solved iteratively. We derive the KKT (Karush-Kuhn-Tucker) conditions for Subproblem 2 for the convergence check. Because of the space limitation, we skip the detailed training method based on the SMO (Sequential Minimal Optimization) combined with Newton’s method [16].

For Subprogram 2, training is converged if the KKT optimality condition (12) is satisfied. Substituting (7) and (9) into (12), we obtain the following KKT conditions:

$$\begin{aligned} \alpha _{M+i} \, \left( h_i+y_i \,F_i -y_i \, b\right) =0 \quad \text{ for }\quad i=1,\ldots ,M, \end{aligned}$$
(37)

where

$$\begin{aligned} F_i=-\sum _{j=1}^M y_j (\alpha _j-\alpha _{M+j})K(\textbf{x}_i,\textbf{x}_j). \end{aligned}$$
(38)

Here the value of b is determined in Subprogram 1.

2.3.1 KKT Conditions.

We can classify the conditions of (37) into the following two cases:

  1. 1.

    \( \alpha _{M+i} =0\). From \(h_i \ge -y_i \, F_i +y_i \, b\),

    $$\begin{aligned} F_i \ge b- h^+\, \text{ for } \,y_i=1, \,\,\,b+h^- \ge F_i \, \text{ for } \, y_i=-1. \end{aligned}$$
    (39)
  2. 2.

    \(C_h \ge \alpha _{M+i}>0.\) From \(h_i = -y_i \, F_i +y_i \, b\),

    $$\begin{aligned} b-h^+ = F_i \, \text{ for } \,y_i=1, \,\,\,b+h^- = F_i \, \text{ for } \, y_i=-1. \end{aligned}$$
    (40)

Then the KKT conditions for (37) are simplified as follows:

$$\begin{aligned}{} & {} \bar{F_i}^+ \ge b-h^+ \ge \tilde{F_i}^+ \, \text{ for } \,y_i=1, \nonumber \\{} & {} \bar{F_i}^- \ge b+h^- \ge \tilde{F_i}^- \, \text{ for } \,y_i=-1, \quad \text{ for }\quad i=1,\ldots , M, \end{aligned}$$
(41)

where

$$\begin{aligned}{} & {} \bar{F_i}^+ =F_i \quad \text{ if } \quad \alpha _{M+i}\ge 0,\quad \tilde{F_i}^+ =F_i \quad \text{ if } \quad \alpha _{M+i} > 0,\end{aligned}$$
(42)
$$\begin{aligned}{} & {} \bar{F_i}^- =F_i \quad \text{ if } \quad \alpha _{M+i}> 0,\quad \tilde{F_i}^- =F_i \quad \text{ if } \quad \alpha _{M+i} \ge 0. \end{aligned}$$
(43)

To detect the violating variables, we define \(b_\textrm{up}^s\) and \(b_\textrm{low}^s\) as follows:

$$\begin{aligned} \begin{array}{l} \displaystyle b_\textrm{up}^s=\min _{i} {\bar{F_i} }^s,\quad \displaystyle b_\textrm{low}^s=\max _{i} {\tilde{F_i}^s}, \end{array} \end{aligned}$$
(44)

where \(s=+,-\), \(b^+=b-h^+\), and \(b^-=b+h^-\).

If the KKT conditions are satisfied,

$$\begin{aligned} b_\textrm{up}^s \ge b_\textrm{low}^s. \end{aligned}$$
(45)

To speed up training we consider that training is converged if

$$\begin{aligned} \max _{s=+,-} b^s_\textrm{low}-b^s_\textrm{up} \le \tau , \end{aligned}$$
(46)

where \(\tau \) is a small positive parameter.

And the upper bounds are estimated to be

$$\begin{aligned} h^+_\textrm{e}=b-\frac{1}{2}(b_\textrm{up}^+ + b_\textrm{low}^+),\,\,\,h^-_\textrm{e}=-b+\frac{1}{2}(b_\textrm{up}^- + b_\textrm{low}^-). \end{aligned}$$
(47)

2.4 Training Procedure

In the following we show the training procedure of the MLS SVM.

  1. 1.

    (Solving Subprogram 1) Solve (27) for \(\boldsymbol{\alpha }^0\) and b fixing \(\boldsymbol{\alpha }^M\) with the solution obtained in Step 2. Initially we set \(\boldsymbol{\alpha }^M=\textbf{0}\).

  2. 2.

    (Solving Subprogram 2) Solve (24)–(26) for \(\boldsymbol{\alpha }^M\) fixing \(\boldsymbol{\alpha }^0\) and b with the solution obtained in Step 1. Initially, we set one \(\alpha _{M+i}\) in each class to \(C_h\).

  3. 3.

    (Convergence check) If (46) is satisfied, finish training. Otherwise go to Step 1.

The objective function \(Q(\boldsymbol{\alpha })\) is monotonic during training: In Step 1, the objective function is maximized with the fixed \(\boldsymbol{\alpha }^M\). Therefore the objective function is non-decreasing after \(\boldsymbol{\alpha }^0\) and b are corrected. In Step 2, the objective function is maximized with the fixed \(\boldsymbol{\alpha }^0\) and b. Therefore, the objective function is also non-decreasing after \(\boldsymbol{\alpha }^M\) is corrected. In Step 2, so long as (45) is not satisfied, the objective function is increased by correcting \(\boldsymbol{\alpha }^M\). Therefore, the training stops within finite steps.

The hyperparameter values of \(\gamma \), C, and \(C_h\) are determined by cross-validation. To make the accuracy improvement over the LS SVM clear, in the following performance evaluation, we determined the values of \(\gamma \) and C, with \(C_h=0\), i.e., using the LS SVM. After they were determined, we determined the \(C_h\) value of MLS SVM. By this method, we can make the accuracy of the MLS SVM at least by cross-validation not lower than that of the LS SVM, if the smallest value of \(C_h\) in cross-validation is sufficiently small.

3 Performance Evaluation

We evaluated whether the idea of minimizing the VC-dimension, i.e., minimizing the maximum margin works to improve the generalization ability of the LS SVM. As classifiers we used the MLS SVM, LS SVM, ML1\(_\textrm{v}\) SVM, which is a variant of ML1 SVM, L1 SVM, and ULDM. As a variant of the MLS SVM, we used an early stopping MLS SVM, MLS\(_\textrm{e}\) SVM, which terminates training when the Subprogram 2 converges after matrix inversion for Subprogram 1 is carried out. This was to check whether early stopping improves the generalization ability when overfitting occurs.

To make comparison fair we determined the values of the hyperparameters by fivefold cross-validation of the training data, trained the classifiers with the selected hyperparameter values, and tested the accuracies for the test data. (Because of the computational cost we did not use nested (double) cross-validation.) We used two-class and multiclass problems used in [15]. The two-class problems have 100 or 20 pairs of training and test data sets and the multiclass problems, one, each. In cross-validation, the candidate values for \(\gamma \) and C were the same as those discussed in [15]. Those for \(C_h\) in the MLS SVM and MLS\(_\textrm{e}\) SVM were \(\{0.001, 0.01, 0.1, 1,10,50, 100,500\}\) instead of \(\{0.1, 1,10,50, 100,500,1000,2000\}\) in the ML1\(_\textrm{v}\) SVM. This was to avoid deteriorating the cross-validation accuracy in determining the \(C_h\) value. In addition, a tie was broken by selecting a smallest value except for MLS\(_\textrm{e}\) SVM; for the MLS\(_\textrm{e}\) SVM, a largest value was selected so that minimizing the maximum margin worked.

Table 1 shows the average accuracies for the 13 two-class problems. In the first column, in I/Tr/Te, I shows the number of inputs, Tr, the number of training data, and Te, the number of test data. For each problem, the best average accuracy is in bold and the worst, underlined. The average accuracy is accompanied by the standard deviation. The plus sign attached to the accuracy shows that the MLS SVM is statistically better than the associated classifier. Likewise, the minus sign, worse than the associated classifier. The “Average” row shows the average accuracy of the associated classifier for the 13 problems and B/S/W denotes that the associated classifier shows the best accuracy B times, the second best, S times, and the worst accuracy, W times. The “W/T/L” denotes that the MLS SVM is statistically better than the associated classifier W times, comparable to, T times, and worse than, L times.

From the Average measure, the ULDM performed best and the MLS\(_\textrm{e}\) SVM, the worst. And both the MLS SVM and MLS\(_\textrm{e}\) SVM were inferior to the LS SVM. From the B/S/W measure, also the ULDM was the best and the LS SVM was the second best. From the W/T/L measure, the MLS SVM was better than the MLS\(_\textrm{e}\) SVM and comparable or almost comparable to the LS SVM, ML1\(_\textrm{v}\) SVM, and ULDM. Although the MLS SVM was statistically comparable to or better than other classifiers, from the Average measure, it was inferior to the LS SVM. To investigate, why this happened, we compared the average accuracy obtained by cross-validation, which is shown in Table 2. From the table, the MLS SVM showed the best average accuracies for all the problems. This shows that the idea of minimizing the maximum margin worked for the MLS SVM at least for the cross-validation accuracies. But from Table 1, the MLS SVM was better than or equal to the LS SVM for only three problems: the diabetes, flare-solar, and splice problems. This shows that in most cases overfitting occurred for the MLS SVM. On the other hand, the MLS\(_\textrm{e}\) SVM was inferior to the LS SVM except for the test data accuracy of the titanic problem. Thus, in most cases, inferior performance was caused by underfitting.

Table 1. Average accuracies of the test data for the two-class problems
Table 2. Average accuracies by cross-validation for the two-class problems
Table 3. Average accuracies by cross-validation for the multiclass problems

Table 4 shows the accuracies of the test data for the multiclass problems. The original MNIST data set has 6000 data points per class and it is difficult to train the low order statistic-based classifiers by matrix inversion. Therefore, to reduce the cross-validation time, we switched the roles of training and test data sets for the MNIST problem and denote it as MNIST (r). From the Average measure, the ML1\(_\textrm{v}\) SVM performed best, the ULDM the second best, and the MLS SVM and MLS\(_\textrm{e}\) SVM, worst. From the B/S/W measure, the MLS\(_\textrm{e}\) SVM was the best and the MLS SVM the worst. For the MLS\(_\textrm{e}\) SVM, the accuracy for the thyroid problem was the worst. Comparing the MLS\(_\textrm{e}\) SVM and LS SVM, the MLS\(_\textrm{e}\) SVM performed better than the LS SVM six times, but the MLS SVM, only once. Therefore, the MLS\(_\textrm{e}\) SVM performed better than the LS SVM but MLS SVM did not.

Now examine the result from the cross-validation accuracies shown in Table 3. The accuracies of the MLS SVM were the same as those of the LS SVM except for the USPS problem. Therefore, from Table 4, the idea of minimizing the maximum margin did not contribute in improving the accuracies of the test data except for the blood cell problem. For the MLS\(_\textrm{e}\) SVM, the adverse effect of early stopping occurred for the thyroid problem: the worst accuracy of the test data in Table 4 was caused by underfitting as seen from Table 3. For the remaining problems the adverse effect was small or none.

Table 4. Accuracies of the test data for the multiclass problems

For the experiment of the multiclass problems, we compared the accuracies of the classifiers because we had only one training data set and one test data set. It was possible to generate multiple training and test data sets from the original data. However, we avoided this because of long cross-validation time. To strengthen performance comparison, in the future, we would like to compare classifiers statistically using multiple training and test data sets.

4 Conclusions

In this paper we proposed the minimal complexity least squares support vector machine (MLS SVM), which is a fusion of the LS SVM and the minimal complexity machine (MCM). Unlike the ML1\(_\textrm{v}\) SVM, which is a fusion of the L1 SVM and the MCM, the MLS SVM did not show an improvement in the accuracy for the test data over the LS SVM although the MLS SVM showed an improvement for the cross-validation accuracy. However, early stopping of the MLS SVM training sometimes showed improvement over the LS SVM.

In the future, we would like to compare performance of classifiers statistically using multiple training and test data sets.