1 Introduction

In a classification problem, margins between data and the separating hyperplane play an important role. Here, margin is defined as the distance between a data point and the separating hyperplane and it is nonnegative when correctly classified, and negative, when misclassified. In the support vector machine (SVM) [1, 2], the minimum margin is maximized.

Because the SVM does not assume a specific data distribution, the obtained separating hyperplane is optimal under the assumption that the data obey an unknown but fixed distribution. Therefore, if prior knowledge is available, it can improve the generalization ability.

The central idea of SVMs, maximizing the minimum margin, has been applied to improving generalization performance of other classifiers. However, for AdaBoost, instead of the minimum margin, directly controlling the margin distribution has been known to improve the generalization ability [3, 4].

Among several classifiers to control the margin distribution [5,6,7,8,9,10,11,12], in [6], the margin mean for the training data is maximized without constraints. This approach is extended in [11]: the bias and slope of the separating hyperplanes are optimized and then equality constraints are introduced. This introduction results in the least squares SVM. According to the computer experiments, without equality constraints, the generalization ability is inferior to that of the SVM. In [9, 10], in addition to maximizing the margin mean, the margin variance is minimized and the classifier is called large margin distribution machine (LDM). The advantage of the LDM is that the generalization ability is better than or comparable to that of the SVM, but one of the disadvantages is that two hyperparameters are added to the SVM. This will lengthen model selection. To solve this problem, in [12], an unconstrained LDM (ULDM) is developed, where the number of hyperparameters is the same as that of the SVM.

In this paper, we reformulate the ULDM as a special case of the least squares LDM (LS LDM). As in [12], we formulate the LS LDM as maximizing the margin mean and minimizing the margin variance, in addition to minimizing the square norm of the coefficient vector of the hyperplane and the square sum of slack variables. As in the LS SVM, we impose the equality constraints for training data. Because the hyperparameters are necessary for the square sum of slack variables and the margin variance, one hyperparameter is added to the LS SVM. Eliminating the square sum of slack variables in the objective function and the equality constraints, we obtain the ULDM.

By computer experiments we perform model selection of the LS LDM including the parameter value of zero for the slack variables, which results in the ULDM. Checking the number that the parameter value of zero is taken, we judge whether the equality constraints are necessary for improving the generalization ability.

In Sect. 2, we summarize the LS SVM. And in Sect. 3, we explain the LDM and then discuss its variants: the LS LDM and ULDM. In Sect. 4, we evaluate the effect of equality constraints to the ULDM using two-class problems.

2 Least Squares Support Vector Machines

Let the decision function in the feature space be

$$\begin{aligned} f(\mathbf{x})= \mathbf{w}^{\top } {{\varvec{\phi }}}(\mathbf{x}) + b, \end{aligned}$$
(1)

where \({{\varvec{\phi }}}(\mathbf{x})\) maps the m-dimensional input vector \(\mathbf{x}\) into the l-dimensional feature space, \(\mathbf{w}\) is the l-dimensional coefficient vector, \({\top }\) denotes the transpose of a vector, and b is the bias term.

Let the M training input-output pairs be \(\{\mathbf{x}_i, y_i\}\) (\(i = 1, \ldots , M\)), where \(\mathbf{x}_i\) are training inputs and \(y_i\) are the associated labels and \(y_i = 1\) for Class 1 and \(-1\) for Class 2.

The margin of \(\mathbf{x}_i\), \(\delta _i\), is defined as the distance from the separating hyperplane \( f(\mathbf{x})=0\), and is given by

$$\begin{aligned} \delta _i = y_i \, f(\mathbf{x}_i)/\Vert \mathbf{w}\Vert . \end{aligned}$$
(2)

If \(\delta \, \Vert \mathbf{w}\Vert =1\), where \(\delta \) is the minimum margin among \(\delta _i\) \((i=1,\ldots ,M)\), maximizing \(\delta \) is equivalent to minimizing \(\Vert \mathbf{w}\Vert \). To make \({\delta }_i\) larger than or equal to 1, \(\mathbf{x}_i\) need to satisfy \(y_i \, f(\mathbf{x}_i) \ge 1\). Then allowing misclassification, the LS SVM is formulated in the primal form as follows:

$$\begin{aligned}&\text{ minimize }\quad Q(\mathbf{w}, b, {{\varvec{\xi }}}) = \frac{1}{2}\mathbf{w}^{\top } \mathbf{w} + \frac{C}{2}\sum _{i=1}^M\xi _i^2 \end{aligned}$$
(3)
$$\begin{aligned}&\text{ subject } \text{ to }\quad y_i \,f(\mathbf{x}_i) = 1-\xi _i \quad \text{ for }\quad i = 1,\ldots ,M, \end{aligned}$$
(4)

where \(Q(\mathbf{w}, b, {{\varvec{\xi }}})\) is the objective function, C is the margin parameter that controls the trade-off between the training error and the generalization ability, \(\xi _i \) are the slack variables for \(\mathbf{x}_i\), and \({{\varvec{\xi }}} = (\xi _1,\ldots ,\xi _M)^{\top }\). If we change \(\xi _i^2\) to \(\xi _i\), and C / 2 to C in (3), and the equality constraints in (4) to inequality constraints, we obtain the L1 SVM.

Solving the equation in (4) for \(\xi _i\) and substituting it to the objective function in (3), we obtain the unconstrained optimization problem:

$$\begin{aligned} \mathrm{minimize}\quad Q(\mathbf{w},b)= & {} \frac{1}{2}{} \mathbf{w}^{\top }{} \mathbf{w}+ \frac{C}{2}\sum _{i=1}^M (1-y_i \,f(\mathbf{x}_i))^2. \end{aligned}$$
(5)

The solution of the LS SVM can be obtained by solving a set of linear equations and generalization performance is known to be comparable to the L1 SVM [2], but unlike the L1 SVM, the solution is not sparse.

In the following we use the LS SVM to derive an LS LDM, which is a variant of the LDM, and also use to compare performance of the ULDM.

3 Large Margin Distribution Machines and Their Variants

In this section, first we briefly summarize the LDM. Then, we define the LS LDM and ULDM in a way slightly different from [12].

3.1 Large Margin Distribution Machines

The LDM [9] maximizes the margin mean and minimizes the margin variance.

The margin mean \(\bar{\delta }\) and margin variance \(\hat{\delta }\) are given, respectively, by

$$\begin{aligned} \bar{\delta }= & {} \frac{1}{M}\sum _{i=1}^M \delta _i, \end{aligned}$$
(6)
$$\begin{aligned} \hat{\delta }= & {} \frac{1}{M}\sum _{i=1}^M \left( \delta _i - \bar{\delta } \right) ^2 = \frac{1}{M}\sum _{i=1}^M \delta _i^2 - \bar{\delta }^2. \end{aligned}$$
(7)

Here, instead of (2), we consider the margin as

$$\begin{aligned} \delta _i = y_i \, f(\mathbf{x}_i). \end{aligned}$$
(8)

Similar to the L1 SVM, the LDM is formulated as follows:

$$\begin{aligned}&\mathrm{minimize}\quad Q(\mathbf{w}, b, {{\varvec{\xi }}})=\frac{1}{2}{} \mathbf{w}^{\top }{} \mathbf{w}- \lambda _1 \, \bar{\delta }+\frac{1}{2} \, \lambda _2 \, \hat{\delta }+{C}\sum _{i=1}^M\xi _i \end{aligned}$$
(9)
$$\begin{aligned}&\mathrm{subject \, to}\quad y_i \, f(\mathbf{x}_i)\ge 1-\xi _i \qquad \mathrm{for}~~i=1,\ldots ,M, \end{aligned}$$
(10)

where \(\lambda _1\) and \(\lambda _2\) are parameters to control maximization of the margin mean and minimization of the margin variance, respectively. In the objective function, the second and the third terms are added to the L1 SVM.

Because the LDM uses all the training data to calculate the margin mean and the margin variance, the solution is dense. Furthermore, because four parameter values (including one kernel parameter value), instead of two, need to be determined by model selection, model selection requires more time than the L1 SVM does.

3.2 Least Squares Large Margin Distribution Machines

The LS LDM that maximizes the margin mean and minimizes the margin variance is given by replacing the slack sum in (9) with the square sum and the inequality constraints in (10) with the equality constraints as follows:

$$\begin{aligned}&\mathrm{minimize}\quad Q(\mathbf{w}, b, {{\varvec{\xi }}})=\frac{1}{2}\mathbf{w}^{\top }{} \mathbf{w}- \lambda _1 \, \bar{\delta }+\frac{1}{2} \, \lambda _2 \, \hat{\delta }+\frac{C}{2}\sum _{i=1}^M\xi _i^2 \end{aligned}$$
(11)
$$\begin{aligned}&\mathrm{subject \, to}\quad y_i \, f(\mathbf{x}_i)= 1-\xi _i \qquad \mathrm{for}~~i=1,\ldots ,M. \end{aligned}$$
(12)

Solving the equation in (12) for \(\xi _i\) and substituting it to the objective function in (11) yield

$$\begin{aligned} \mathrm{minimize}\quad Q(\mathbf{w},b)= & {} \frac{1}{2}{} \mathbf{w}^{\top }{} \mathbf{w}- \lambda _1 \, \bar{\delta }+\frac{1}{2} \, \lambda _2 \, \hat{\delta }+ \frac{C}{2}\sum _{i=1}^M (1-y_i \,f(\mathbf{x}_i))^2\nonumber \\= & {} \frac{1}{2}{} \mathbf{w}^{\top }{} \mathbf{w}- \lambda _1 \, \bar{\delta }+\frac{1}{2} \, \lambda _2 \, \hat{\delta }+ \frac{C}{2}\sum _{i=1}^M (\delta _i -1)^2. \end{aligned}$$
(13)

In the above objective function, the last term, which is the variance of margins around the minimum margin works similarly to the third term, which is the variance of margin around the margin mean, \(\hat{\delta }\).

Now substituting (6), (7), and (8) into the objective function of (13) and deleting the constant term, we obtain

$$\begin{aligned} Q(\mathbf{w},b)= & {} \frac{1}{2}{} \mathbf{w}^{\top }{} \mathbf{w} +\frac{\lambda _2}{2\, M}\left( 1+\frac{M \,C}{\lambda _2} \right) \sum _{i=1}^M f^2(\mathbf{x}_i)- \frac{\lambda _2}{2}\left( \frac{1}{M}\sum _{i=1}^M y_i \,f(\mathbf{x}_i)\right) ^2\nonumber \\&- \left( \frac{\lambda _1}{M}+C \right) \sum _{i=1}^M y_i \,f(\mathbf{x}_i). \end{aligned}$$
(14)

The first three terms in the above objective function are quadratic and the last term is linear with respect to \(\mathbf{w}\) and b. Therefore, the coefficient of the linear term is a scaling factor of the decision function obtained by minimizing (14) with respect to \(\mathbf{w}\) and b. Dividing (14) by \(\lambda _2\) and eliminating the coefficient of the last term, we obtain

$$\begin{aligned} Q(\mathbf{w},b)= & {} \frac{1}{2\,C_\mathrm{m}}{} \mathbf{w}^{\top }{} \mathbf{w} +\frac{1+C_\mathrm{e}}{2\,M} \,\sum _{i=1}^M f^2(\mathbf{x}_i)\nonumber \\&- \frac{1}{2}\left( \frac{1}{M}\sum _{i=1}^M y_i \,f(\mathbf{x}_i)\right) ^2 - \sum _{i=1}^M y_i \,f(\mathbf{x}_i). \end{aligned}$$
(15)

Here, \(C_\mathrm{m} = \lambda _2\) and \(C_\mathrm{e} = M\,C/\lambda _2\).

According to the above formulation of the LS LDM, the parameter \(\lambda _1\) in (13) does not work for controlling the margin mean. Therefore, the three hyperparameters in (11) and (12) are reduced to two.

3.3 Unconstrained Large Margin Distribution Machines

Deleting the square sum of the slack variables in (11) and equality constraints in (12), we consider the unconstrained LDM (ULDM) as follows:

$$\begin{aligned} \mathrm{minimize}\quad Q(\mathbf{w}, b)= & {} \frac{1}{2\, C_\mathrm{m}}\mathbf{w}^{\top }{} \mathbf{w}- M\, \bar{\delta }+\frac{1}{2} \hat{\delta }\nonumber \\= & {} \frac{1}{2\, C_\mathrm{m}}{} \mathbf{w}^{\top }{} \mathbf{w} +\frac{1}{2\,M} \,\sum _{i=1}^M f^2(\mathbf{x}_i)- \frac{1}{2}\left( \frac{1}{M}\sum _{i=1}^M y_i \,f(\mathbf{x}_i)\right) ^2\nonumber \\&- \sum _{i=1}^M y_i \,f(\mathbf{x}_i). \end{aligned}$$
(16)

Here, we multiply \(\bar{\delta }\) with M so that the coefficient of the linear term is 1.

Comparing (15) and (16), the ULDM is obtained by setting \(C_\mathrm{e} =0 \, (C=0)\).

Because the LS LDM includes the ULDM, we derive the optimality conditions for (15) in the empirical feature space [2]. Let \(\{\mathbf{z}_1,\ldots , \mathbf{z}_N\}\) be a subset of \(\{\mathbf{x}_1,\ldots , \mathbf{x}_M\}\), where \(N\le M\) and let \(\{{\varvec{\phi }}(\mathbf{z}_1),\ldots , {\varvec{\phi }}(\mathbf{z}_N)\}\) span the empirical feature space. Then the mapping function that maps the input space into the empirical feature space is expressed by

$$\begin{aligned} \mathbf{h}(\mathbf{x}) = (K(\mathbf{x},\mathbf{z}_1),\ldots ,K(\mathbf{x},\mathbf{z}_N))^{\top }, \end{aligned}$$
(17)

where \(K(\mathbf{x},\mathbf{z}_j)={{\varvec{\phi }}}^{\top }(\mathbf{x}) \, {\varvec{\phi }}(\mathbf{z}_j)\). Then the decision function (1) is expressed by

$$\begin{aligned} f(\mathbf{x})= \mathbf{w}^{\top } \mathbf{h}(\mathbf{x}) + b. \end{aligned}$$
(18)

For a linear kernel with \(m <N\), to improve sparsity, we use the Euclidean coordinates: \(\mathbf{z}_1 = \{1,0,\ldots ,0\}, \, \cdots , \mathbf{z}_m = \{0,\cdots ,0,1\}\), and use the identity mapping: \(\mathbf{h}(\mathbf{x}) = \mathbf{x}\).

We derive the optimality condition of the LS LDM given by (15), using (18):

$$\begin{aligned} \frac{\partial Q(\mathbf{w}, b)}{\partial \mathbf{w}}= & {} \left( \frac{1}{C_\mathrm{m}}I_N+(1+C_\mathrm{e})\overline{K^2}-\overline{K^y}^{\top } \overline{K^y}\right) \, \mathbf{w}\nonumber \\&+\left( (1+C_\mathrm{e})\bar{K}^{\top } -\bar{y} \, \overline{K^y}^{\top }\right) \,b -\overline{K^y}^{\top } =0, \end{aligned}$$
(19)
$$\begin{aligned} \frac{\partial Q(\mathbf{w}, b)}{\partial b}= & {} \left( (1+C_\mathrm{e})\,\bar{K} -\bar{y} \,\overline{K^y} \right) \,\mathbf{w}+\left( 1+C_\mathrm{e}-\bar{y}^2 \right) b - \bar{y}=0, \end{aligned}$$
(20)

where \(I_N\) is the \(N \times N\) unit matrix,

$$\begin{aligned}&\overline{K^2}=\frac{1}{M}\sum _{i=1}^M K_i^{\top } K_i, \qquad \bar{K}=\frac{1}{M}\sum _{i=1}^M K_i,\quad \overline{K^y}=\frac{1}{M}\sum _{i=1}^My_i K_i, \quad \bar{y}=\frac{1}{M}\sum _{i=1}^My_i,\nonumber \\&K_{i}=(K_{i1},\ldots ,K_{iN})=\mathbf{h}^{\top }(\mathbf{x}_i),\nonumber \\&K_{ij}=K(\mathbf{x}_i, \mathbf{z}_j) \quad \text{ for } \,\,i=1,\ldots ,M, \,\, j=1,\ldots ,N . \end{aligned}$$
(21)

In a matrix form, (19) and (20) are given by

$$\begin{aligned}&\left( \begin{array}{c@{}c} \frac{1}{C_\mathrm{m}}I_N+(1+C_\mathrm{e})\,\overline{K^2}-\overline{K^y}^{\top } \overline{K^y}&{}(1+C_\mathrm{e})\, \bar{K}^{\top } -\bar{y} \, \overline{K^y}^{\top } \\ (1+C_\mathrm{e})\, \bar{K} -\bar{y} \,\overline{K^y} &{}1+C_\mathrm{e}-\bar{y}^2 \end{array} \right) \left( \mathbf{w} \atop b\right) \nonumber \\&=\left( \overline{K^y}^{\top } \atop \bar{y} \right) \end{aligned}$$
(22)

If \(C=0\), (22) reduces to the ULDM. The difference between (22) with \(C=0\) and the ULDM in [12] is that \(1/C_\mathrm{m}\) is used in (22) instead of \(C_\mathrm{m}\).

Because the coefficient matrix of (22) is positive definite, we can solve (22) for \(\mathbf{w}\) and b by the coordinate descent method [13] as well as by matrix inversion.

In model selection, we need to determine the values of \(C_\mathrm{m}\), C in \(C_\mathrm{e}\), and \(\gamma \) in the kernel. To speed up model selection, as well as grid search of three values, we consider line search: after determining the values of \(C_\mathrm{m}\) and \(\gamma \) with \(C=0\) by grid search, we determine the C value fixing the values of \(C_\mathrm{m}\) and \(\gamma \) with the determined values.

4 Performance Evaluation

We compare performance of the ULDM with that of the LS LDM to clarify whether the equality constraints in the LS LDM are necessary. We also compare the ULDM with the LS SVM and the L1 SVM. Because of the space limitation, we only use two-class problems.

4.1 Conditions for Experiment

Because the coefficient matrix of (22) is positive definite, (22) can be solved by the coordinate descent method [9]. But to avoid the imprecise accuracy caused by the improper convergence, we train the ULDM and LS LDM by matrix inversion. We also train the LS SVM given by (3) and (4) by matrix inversion. For the L1 SVM, we use SMO-NM [14], which fuses SMO (Sequential minimal optimization) and NM (Newton’s method).

We use the radial basis function (RBF) kernels: \(K(\mathbf{x},\mathbf{x}') = \exp (-\gamma ||\mathbf{x}-\mathbf{x}'||^{2}/m)\), where m is the number of inputs for normalization and \(\gamma \) is used to control a spread of a radius. We carry out model selection by fivefold cross-validation. To speed up cross-validation for the LS LDM, which has three hyperparameters including \(\gamma \) for the RBF kernel, we use line search in addition to grid search of the optimal values of C, \(C_\mathrm{m}\) and \(\gamma \). In line search, after determining the values of \(C_\mathrm{m}\) and \(\gamma \) by grid search, we determine the optimal value of C by cross-validation. Therefore \(C_\mathrm{m}\) and \(\gamma \) for the ULDM give the same values for the LS LDM by line search.

We select the \(\gamma \) value from {0.01, 0.1, 0.5, 1, 5, 10, 15, 20, 50, 100, 200}, for the C value from {0.1, 1, 10, 50, 100, 500, 1000, 2000}, and for the \(C_\mathrm{m}\) value from \(\{0.1, 1, 10, 100, 1000, 10^4,10^6,10^8\}\). In the LS LDM, we also include 0 as a candidate of the C value. Then if 0 is selected, the LS LDM reduces to the ULDM.

We measure the average CPU time per data set including model selection by fivefold cross-validation, training a classifier, and classifying the test data by the trained classifier. We used a personal computer with 3.4 GHz CPU and 16 GB memory.

4.2 Results for Two-Class Problems

Table 1 lists the numbers of inputs, training data, test data, and data set pairs of two-class problems [15]. Each data set pair consists of the training data set and the test data set. Using the training data set, we determine parameter values by cross-validation, train classifiers with the determined parameter values and evaluate the performance using the test data set. Then we calculate the average accuracy and the standard deviation for all the test data sets.

Table 1. Benchmark data for two-class problems

Table 2 lists the parameter values determined by cross-validation. In the first row, (l) and (g) show that the three hyperparameters of the LS LDM are determined by linear search and grid search, respectively. Because each classification problem consists of 100 or 20 training and test data pairs, we show the most frequently selected parameter values. For the LS LDM, most selected value for C is 0. Thus, in the table, we show the number that \(C\ne 0\) is selected in the parentheses.

As we discussed before, the \(C_\mathrm{m}\) and \(\gamma \) values for the ULDM and the LS LDM (l) are the same. Therefore, if the number that \(C\ne 0\) is selected is 0, the LS LDM (l) reduces to ULDM for all the training data sets. This happens for seven problems. Except for the german problem, the C value of zero is selected frequently. For the LS LDM (g) also, the C value of zero is frequently selected. Therefore, LS LDM (g) reduces to ULDM frequently. These results indicate that the equality constraints are not important in the LS LDM.

The \(\gamma \) values for the three classifiers are very similar and so are the C values for the LS and L1 SVMs.

Table 2. Most-frequently-selected parameter values for the two-class problems. The numeral in the parentheses shows the number that \(C \ne 0\) is selected.

In the following we show the distributions of C, \(C_\mathrm{e}\), and \(\gamma \) values for the german data, in which \(C=0\) is least frequently selected for the LS LDM.

Table 3 shows the C value distributions for the german data. The distributions for the LS LDM by line search and by grid search are very similar. The values of C smaller than or equal to 1 are selected 93 times and 90 times for the LS LDM (l) and LS LDM (g), respectively. Therefore, C does not affect much to the generalization ability. The distributions for the LS SVM and L1 SVM are similar and although the value of 1 is frequently selected, the larger values are also selected. This means that the value of C affect directly on the generalization ability.

Table 3. Distribution of C values for the german data

Table 4 shows the distributions of \(C_\mathrm{m}\) values for the ULDM and LS LDM (g). The both distributions are similar. The distribution for the LS LDM (l) is the same as that for the ULDM.

Table 4. Distribution of \(C_\mathrm{m}\) values for the german data

Table 5 lists the \(\gamma \) value distributions for the german data. The \(\gamma \) values larger than 20 are not selected for the four classifiers. The distributions of the ULDM (LS LDM (l)) and LS LDM (g) are similar although smaller values are selected for the ULDM (LS LDM (l)). The distributions of the LS SVM and L1 SVM are similar and tend to gather towards smaller values than those of the ULDM (LS LDM (l)) and LS LDM (g).

Table 5. Distribution of \(\gamma \) values for the german data

Table 6 shows the average accuracies and their standard deviations of the five classifiers with RBF kernels. Among the five classifiers the best average accuracy is shown in bold and the worst average accuracy is underlined. The “Average” row shows the average accuracy of the 13 average accuracies and the two numerals in the parentheses show the numbers of the best and worst accuracies in the order. We performed Welch’s t test with the confidence intervals of 95%. The “W/T/L” row shows the results; W, T, and L denote the numbers that the ULDM shows statistically better than, the same as, and worse than the LS LDM (l), LS LDM (g), LS SVM, and L1 SVM, respectively. Symbols “\(+\)” and “−” in the L1 SVM column show that the ULDM is statistically better and worse than the L1 SVM, respectively.

Ignore the difference of 0.01 for the average accuracies and the standard deviations. Then the results of the ULDM and those of the LS LDM (l) are different only for the german problem. Whereas for the ULDM and LS LDM (g), only the ringnorm problem gives the same results.

From the table, from the standpoint of the average accuracy, the ULDM and LS LDM (l) performed best and the LS SVM, the worst. But from the standpoint of statistical analysis the ULDM is statistically comparable with the remaining four classifiers.

Therefore, because the LS LDM frequently reduces to the ULDM and the ULDM is comparable with the LS LDM, the LS LDM can be replaced with the ULDM.

Table 6. Accuracy comparison of the two-class problems for RBF kernels
Table 7. Execution time comparison of the two-class problems (in seconds)

Table 7 shows the average CPU time per data set for calculating the accuracies. The last row shows the numbers that each classifier shows best/worst execution time. In average, the LS SVM is the fastest and the LS LDM (g) the slowest because of the slow model selection by grid search of three hyperparameters. Because the ULDM and LS SVM are trained by solving the sets of linear equations with the equal number of variables, slower training by the ULDM is due to more complex calculation in setting the coefficients of the linear equations. Because the matrix size is the number of training data plus one and because the numbers of training data are smaller than 1000 except for the image and splice data sets, the execution time is relatively short.

The L1 SVM is trained by iterative method. Therefore the training speed depends on the parameter values and for the titanic data, training of the L1 SVM is the slowest. For the ULDM, LS LDM, and LS SVM, the execution time depends on the number of training data not on the parameter values.

5 Conclusions

The unconstrained large margin distribution machine (ULDM) maximizes the margin mean and minimizes the margin variance without constraints.

In this paper, we investigated the effect of the constraints to the ULDM. To do this, we derived the ULDM as a special case of the least squares (LS) LDM, which is the least squares version of the LDM. If the hyperparameter associated with the constraints is set to be zero, the LS LDM reduces to the ULDM. In computer experiments, we carried out model selection of the LS LDM including the zero value of the hyperparameter as a candidate value. For the two-class problems with 100 or 20 data set pairs, in most cases, the LS LDM reduced to the ULDM and if not, there was no statistical difference of generalization abilities. According to the results, the effect of the equality constraints to the generalization ability of the LS LDM is considered to be small and the ULDM can be used instead of the LS LDM.