1 Introduction

Support vector machines (SVMs) are very excellent algorithms in classification and regression problems which have shown state-of-the-art performance in a large number of applications [4, 6, 15, 16]. SVM is first proposed by Vapnik et al. [3, 19] twenty years ago and has always been a hot spot till today since its good generalization ability. Based on margin maximal principle, SVM aims to find two support hyperplanes, the distance between which can be expanded as large as possible. Later, least squares SVM (LSSVM) [14] was presented to convert the inequality constraints into equalities, leading to solving a system of linear equations. LSSVM seeks for two proximal hyperplanes for each class to make the distances between points and their corresponding proximal hyperplanes as small as possible. Meanwhile, LSSVM wishes to maximize the distance of the two hyperplanes. The solving speed of LSSVM is much faster than SVM because of avoiding handling quadratic programming problem, but its classification performance is slight worse than SVM [12, 13, 18].

As another popular research area, metric learning has attracted significant attention recently [1, 20, 24]. The aim of metric learning is to learn a data-dependent metric matrix M to redefine the distance between two points \(x_1\), \(x_2\) as

$$ \sqrt{(x_1-x_2)^{\rm T} M (x_1-x_2)}$$

In Euclidean distance, M is an identity matrix. The goal of the learned matrix is to expand inter-class distance and shrink intra-class distance. Classification can be easier implemented after metric learning, and the performance can be improved naturally [2, 5, 9, 10]. Much work related with metric learning has been done to improve the performance of SVMs [25, 27] and k-NN [8, 23, 26] algorithms. SVM can be improved by learning new RBF kernel [25] since RBF kernel function in nonlinear SVM is a function of distance. Large margin nearest neighbor (LMNN) classification [17, 21, 22] is viewed as a metric learning-based counterpart to SVM. It replaces linear classification in SVM with k-NN classification. MLSVM[11] constructs a new metric learning problem with local neighborhood constraints based on the formulation of SVM. PCML and NCML[28] are formulated as a kernel classification problem which can be solved by training SVMs iteratively.

In fact, SVM has close relationship with metric learning. In the view of metric learning, the margin maximal principle in SVM is equivalent to maximizing inter-class distance. But SVM can not minimize intra-class distance which had been promoted in [7]. \(\varepsilon \)-SVM has been introduced to improve SVM based on metric learning. It not only penalizes the points violating the margin, but also punishes the points far from the support hyperplanes. In this paper, we argue that LSSVM has closer relation with metric learning than traditional SVM. LSSVM makes the two proximal hyperplanes be proximal to their corresponding classes, which can be seen as minimizing within-class distance in metric learning perspective. Pulling the two proximal hyperplanes as far as possible is equivalent to maximizing between-class distance. The difference is that metric learning aims at learning a matrix M (full or sparse), while LSSVM only learns a vector w (or diagonal matrix W with diag(W) = w). So LSSVM can be regarded as a slack version of metric learning algorithm. LSSVM can be improved by consolidating its formulation in measuring between-class distance. The \(1/\Vert w\Vert ^2\) term can not represent inter-class distance well since the two center hyperplanes have invaded the interior of each class. And it is just a regularized term in metric learning view. In traditional SVM, the principal of maximal margin is implemented, which seeks for two boundary hyperplanes, each one corresponds to a class, and maximizes the distance between the two hyperplanes. It can be regarded as expanding inter-class distance as much as possible in metric learning view. However, SVM does not make errors in shrinking intra-class distance, resulting in the claim that LSSVM is closer to metric learning than SVM.

In this paper, we will clearly analyze the relation between LSSVM and metric learning. Metric learning problem can be transformed as LSSVM by relaxing the pairwise constraints. In light of the advantages of metric learning in measuring between-class distance, we add constraints into the primary problems of LSSVM to control inter-class distance. ADMM algorithm is used to solve the new proposed approach, ML-LSSVM. ADMM can solve convex programming problem by breaking it into many smaller pieces and obtain the solution of each subproblem in much less time. Discussion on the relation between ML-LSSVM and LMNN is presented to show that the formulation of ML-LSSVM is similar with local LMNN.

We will organize the rest of the paper as follows. Background on metric learning and SVM is introduced in Sect. 2. The relation between LSSVM and metric learning, ML-LSSVM and LMNN will be discussed in Sect. 3, and Solving ML-LSSVM by ADMM will also be performed. In Sect. 4, we will make plenty of numerical experiments to compare our new method with several algorithms and verify the advantages of ML-LSSVM. Conclusions are made in Sect. 5.

2 Background

For a training set with c classes

$$ T = \{(x_1, y_1), \ldots ,(x_m, y_m )\},$$
(2.1)

where \((x_i,y_i) \in R^n \times \{ 1, 2, \ldots , c\}, i=1, \ldots , m\) and m is the total number of samples, n is the number of features. Define a index set \(I=\{1,\ldots ,m\}\) and the following pairwise data sets

$$S= \{(x_i,x_j) | y_i=y_j \} $$
(2.2)
$$ D= \{(x_i,x_j) | y_i \neq\,y_j \}$$
(2.3)

where S contains data pairs with the same label and D contains data pairs with different labels.

2.1 Metric learning (ML)

The purpose of metric learning is to find a proper distance metric to measure the distance of all the data pairs instead of Euclidean metric. For a new learned distance metric \(M \in R^{n \times n}\), the distance between two data points \(z_1\) and \( z_2 \) in terms of M is represented by

$$ d_M(z_1,z_2) = \sqrt{(z_1 -z_2)^{\rm T} M (z_1 -z_2)}$$
(2.4)

where M is a positive semi-definite matrix and satisfies the following properties

$$d_M(z_1,z_2) \ge 0 $$
(2.5)
$$d_M(z_1,z_2)=d_M(z_2,z_1) $$
(2.6)
$$d_M(z_1,z_2)+d_M(z_1,z_3 ) \ge d_M(z_1,z_2)$$
(2.7)
$$d_M(z_1,z_2)=0 \Leftrightarrow z_1=z_2$$
(2.8)

In supervised metric learning, one of the most representative work is implemented by Xing with the paper [23], in which the method of pairwise constraints is proposed. The algorithm, formulated as a convex programming problem, aims to find a global data-dependent distance metric which minimizes the sum of distances between data pairs in (2.2) constrained by that data pairs in (2.3) are well separated. The following convex optimization problems are constructed

$$ \min \limits _{M}\quad\sum \limits _{(x_i,x_j) \in S} d_M^2(x_i,x_j) $$
(2.9)
$$\hbox{s.t.}\quad\sum \limits _{(x_i,x_j) \in D} d_M^2(x_i,x_j) \ge 1,$$
(2.10)
$$M \succeq 0$$
(2.11)

The above problem can been classified into two cases, the case of diagonal matrix M and full matrix M, which have been solved by Newton–Rapson method and gradient ascent with iterative projection, respectively.

2.2 Large margin nearest neighboor (LMNN)

LMNN learns a Mahanalobis distance metric to improve the performance of k-NN classification. Based upon the local view, LMNN defines a neighborhood for every point and aims to make the metric meet the goal that pull the samples in the neighborhood with the same label nearer and push the samples with different labels away by a large margin. The primal semi-definite programming problem is stated as follows

$$\min \limits _{M,\xi }\quad\sum \limits _{il} \eta _{il} d_M^2(x_i,x_l) + C \sum \limits _{ijl} \eta _{il} (1-y_{jl})\xi _{ijl},$$
(2.12)
$$\hbox{s.t.}\quad\,d_M^2(x_j,x_l)-d_M^2(x_i,x_l) \ge 1-\xi _{ijl},$$
(2.13)
$$\xi _{ijl} \ge 0,{\quad}M \succeq 0$$
(2.14)

where

$$y_{jl}=\left\{ \begin{array}{ll} 1, &\quad y_j=y_l,\\ 0, &\quad y_i\neq\,y_l\\ \end{array} \right. , \quad\eta _{il}=\left\{ \begin{array}{ll} 1, &\quad y_{il}=1, x_i \in t(x_l )\\ 0, &\quad y_{il}=0\\ \end{array} \right.$$

and \(t(x_l)\) denotes the neighborhood of \(x_l\). LMNN can make significant improvements on k-NN classification with Euclidean distance, verifying that the algorithm can obtain a better distance metric to measure the similarity of any two points.

2.3 Support vector machine (SVM)

SVM intends to seek the best hyperplane to classify the training set (2.1) by maximizing the margin between different classes. The primal problem of standard SVM

$$\min \limits _{{w},{b},{\xi }}\quad\frac{1}{2} \Vert w\Vert ^2 + C \sum \limits _{i=1}^{m} \xi _i,$$
(2.15)
$$\hbox{s.t.}\quad\,y_i (w^{\rm T} x_i+b) \ge 1-\xi _i, i \in I $$
(2.16)
$$\xi _i \ge 0,{\quad} i \in I $$
(2.17)

where \(\xi =(\xi _1,\xi _2, \ldots ,\xi _l)^{\rm T}\) and \(C>0\) is a penalty parameter. In the view of metric learning, SVM only makes efforts in separating different classes but ignores gathering the points in the same class. It is considered that SVM can be boosted by adding constraints of minimizing the within-class distance. So \(\varepsilon \)-SVM [7] is constructed as follows

$$\min \limits _{{w},{b},{\xi }}\quad\frac{1}{2} \Vert w\Vert ^2 + C \sum \limits _{i=1}^{m} \xi _i + \lambda \sum \limits _{i=1}^{m} \varepsilon _i,$$
(2.18)
$$\hbox{s.t.}\quad1+\varepsilon _i \ge y_i (w^{\rm T} x_i+b) \ge 1-\xi _i,$$
(2.19)
$$\xi _i, \varepsilon _i \ge 0,{\quad} i \in I $$
(2.20)

which has double number of constraints than SVM, resulting in even lower speed than SVM.

2.4 Least square support vector machine (LSSVM)

In light of the idea of least squares, equality constraints are considered in SVM approach which lead to the presentation of LSSVM. The method can avoid solving quadratic programming by solving a set of linear equations. LSSVM searches for two parallel hyperplanes \(w^{\rm T} x+b=1, w^{\rm T} x+b =-1 \) to minimize the distance between the data points and the corresponding hyperplane and maximize the margin of the two hyperplanes. The formulation of LSSVM algorithm is

$$ \min \limits _{{w},{b},{\xi }}\quad\frac{1}{2} \Vert w\Vert ^2 + C \sum \limits _{i=1}^{m} e_i^2 $$
(2.21)
$$ \hbox{s.t.}\quad\,y_i (w^{\rm T} x_i+b) = 1+ e_i,{\quad} i \in I $$
(2.22)

and the final decision hyperplane is \(w^{\rm T} x+b=0\). In fact, LSSVM makes two sides endeavors in adjusting the between-class and within-class distance. But its formation of measuring within-class and between-class distance are different from metric learning which will be promoted in later section.

The four methods have different forms in presenting within-class distance and between-class distance [14]. We list the formulas in Table 1. For within-class distance, we argue that

$$ \sum \limits _{y_i=y_j} d_M(x_i,x_j) $$

is more strict than

$$ \sum \limits _{k=\pm }\sum \limits _{y_i=k} d(x_i,H_k) $$

in minimizing within-class distance. In a fixed space, minimizing \(\sum \nolimits _{y_i=k} d(x_i,H_k)\) can only make the distance of two points in the direction of perpendicular to \(H_k\) as short as possible without restricting the distance in the direction of parallel to \(H_k\) (Fig. 1). For between-class distance, \( \min \nolimits _{y_i \neq\,y_l} d(x_i,x_l) \) is considered to be more rigid than \(\sum \nolimits _{y_i \neq\,y_l} d_M(x_i,x_l) \). Maximizing the former can separate two different classes distinctly by a large margin, contributing to better classification performance, but the latter can not ensure it.

Table 1 Representations of within-class distance and between-class distance

3 Metric learning-based-LSSVM

3.1 The relation between metric learning and LSSVM

In this subsection, the claim that LSSVM has a strong relation with metric learning will be proved as follows. For the metric learning problem (2.9)–(2.11), let \(M=W^{\rm T} W \), where W is a diagonal matrix and diag(W) = w, and define a linear transformation \(R^n \rightarrow H: \hat{x}=Wx\). Given two fixed hyperplanes \(H_+: 1^{\rm T} \hat{x}+b=\sqrt{n}/2, H_-: 1^{\rm T} \hat{x}+b=-\sqrt{n}/2\)

$$\begin{aligned}\sum \limits _{(x_i,x_j) \in S} d_M^2(x_i,x_j) &= \sum \limits _{(x_i,x_j) \in S} (x_i-x_j)^{\rm T} w w^{\rm T} (x_i-x_j) \\ &= \sum \limits _{(x_i,x_j) \in S} (w^{\rm T} x_i-w^{\rm T} x_i)^2 \\ &= \sum \limits _{y_i=y_j=1} (1^{\rm T} Wx_i+b-\sqrt{n}/2 \\&\quad-\,(1^{\rm T} Wx_j+b-\sqrt{n}/2))^2 \\& \quad+\, \sum \limits _{y_i=y_j=-1} (1^{\rm T} Wx_i+b+\sqrt{n}/2 \\&\quad-\,(1^{\rm T} Wx_j+b+\sqrt{n}/2))^2 \\ & \le n\sum \limits _{y_i=y_j=1} (d(\hat{x_i}, H_+)+d(\hat{x_j}, H_+))^2 \\ & \quad+\, n\sum \limits _{y_i=y_j=-1} (d(\hat{x_i}, H_-)+d(\hat{x_j}, H_-))^2 \\ & \le 2n(p-1)\sum \limits _{y_i=1} d^2(\hat{x_i},H_+) \\ &\quad + 2n(q-1)\sum \limits _{y_j=-1} d^2(\hat{x_j},H_-) \\ & \le t(\sum \limits _{y_i=1} d^2(\hat{x_i},H_+) + \sum \limits _{y_j=-1} d^2(\hat{x_j},H_-)) \end{aligned}$$
(3.1)

where \(t=2n \cdot \max(p-1,q-1)\). In addition, we hope that the projected points of each class lie on two sides of corresponding hyperplane symmetrically. Then, we have

$$n\sum \limits _{y_i=1} d^2(\hat{x_i},H_+) + n\sum \limits _{y_j=-1} d^2(\hat{x_j},H_-) \le \sum \limits _{(x_i,x_j) \in S} d_M^2(x_i,x_j) $$
(3.2)
$$d^2(H_+,H_-) \le \sum \limits _{(x_i,x_j) \in D} d_M^2(x_i,x_j) $$
(3.3)

The Eqs. (3.2) and (3.3) are trivial results and we can explain them in Fig. 1.

From the Eqs. (3.1) and (3.2), it is notable that minimizing \(\sum \nolimits _{y_i=1} d^2(\hat{x_i},H_+) + \sum \nolimits _{y_j=-1} d^2(\hat{x_j},H_-)\) in the H space is a slack way to minimize \(\sum \nolimits _{y_i=y_j} d_M^2(x_i,x_j)\) in \(R^n\) space. So the following optimization problems in H space are considered

$$ \min \limits _W\quad\sum \limits _{y_i=1} d^2(\hat{x_i},H_+) + \sum \limits _{y_j=-1} d^2(\hat{x_j},H_-), $$
(3.4)
$$ \hbox{s.t.}\quad\,d^2(H_+,H_-) = 1 $$
(3.5)

which can be rewritten as

$$ \min\quad \sum \limits _{i=1}^m e_i^2, $$
(3.6)
$$ \hbox{s.t.}\quad y_i(w^{\rm T} x_i +b)=\sqrt{n}/2+e_i,{\quad} i \in I $$
(3.7)

But the problem is scaling with \(\Vert w\Vert \), we add a regularized term into the object function and standardize the constraints by dividing \( \sqrt{n}/2\) on both sides. Then, LSSVM method is obtained

$$ \min \quad\Vert w\Vert _2^2+ C\sum \limits _{i=1}^l e_i^2, $$
(3.8)
$$ \hbox{s.t.}\quad y_i(w^{\rm T} x_i +b)= 1 +e_i,{\quad} i \in I $$
(3.9)
Fig. 1
figure 1

The relation between LSSVM and metric learning. The blue circles belong to positive class, and the red squares belong to negative class. The points are mapped from primal space by the mapping: \(\hat{x}=Wx\). The black line segments represent within-class distance and the green ones represent between-class distance. In these distances, solid line segments correspond to metric learning and dotted line segments corresponds to LSSVM. a LSSVM (color figure online)

It is considered that the \(\Vert w\Vert \) term in the objective function of LSSVM is only a regularization term in metric learning view which can not maximize between-class margin well. In Fig. 2, \(H_+, H_-\) are the two hyperplanes that LSSVM seeking for and \(H_0\) is the final decision hyperplane. There are two drawbacks in LSSVM: (1) The decision hyperplane is sensitive to outliers. In Fig. 2, there is an outlier in the top right corner which pull \( H_+ \) further from \( H_- \), and the final decision hyperplane is not proper; (2) The between-class distance is measured by the distance between \(H_+\) and \(H_- \), \( 1/\Vert w\Vert \). But in metric learning view, between-class distance measured by \(\min \nolimits _{y_i \neq\,y_l}\, d_M^2(x_i,x_l)\) is more proper. So we can improve LSSVM by maximizing

$$ \min \limits _{y_i \neq\,y_l}\quad d_M^2(x_i,x_l) $$

which can be converted into maximizing the distance between two boundary lines (the two solid lines in Fig. 2).

Fig. 2
figure 2

Explanation for LSSVM and its drawbacks

Then, we can construct the following problems, termed as ML-LSSVM:

$$ \min \quad\frac{1}{2}\Vert w\Vert ^2+ C\sum \limits _{i=1}^m e_i^2 - \lambda t, $$
(3.10)
$$ \hbox{s.t.}\quad y_i(w^{\rm T} x_i +b- y_i )= e_i, i \in I $$
(3.11)
$$ y_i(w^{\rm T} x_i +b ) \ge t,{\quad} i \in I $$
(3.12)

In the above problem, maximizing t means that the points in two classes lie in two sides of \( H_0 \) and be far away from \( H_0 \) as much as possible. So the goal of maximizing between-class distance can be obtained.

3.2 Solving ML-LSSVM via ADMM

In this subsection, we try to solve the primary problem of ML-LSSVM in an effective way. The algorithm of ADMM (alternating direction method of multipliers) solves convex optimization problem by splitting them into many smaller scale optimization problems, each of which can be easier handled. It has been used in many applications recently. We will first transform our method into the standard formulation of ADMM and then introduce the solving process.

First, the slack variable \( \eta =\{\eta _1, \ldots , \eta _m \} \le 0 \) is introduced to convert the inequations into equations, and then, we have

$$ \min \quad\frac{1}{2}\Vert w\Vert ^2+ C\sum \limits _{i=1}^m e_i^2 - \lambda t, $$
(3.13)
$$ \hbox{s.t.}\quad y_i(w^{\rm T} x_i +b- y_i )= e_i, i \in I $$
(3.14)
$$ y_i(w^{\rm T} x_i +b ) = t+ \eta _i,\quad\, i \in I $$
(3.15)
$$ \eta \ge 0 $$
(3.16)

Define an indicative function

$$ h(\eta )=\left\{ \begin{array}{ll} +\infty , &\quad \eta < 0,\\ 0, &\quad\eta \ge 0 \end{array} \right. $$
(3.17)

and let \( \pi =(w^{\rm T} , b, e^{\rm T} , t)^{\rm T} \), where \( e=(e_1,\ldots ,e_m)^{\rm T} \), we have the following optimization problem in matrix form

$$ \min \quad\frac{1}{2} \pi ^{\rm T} Q \pi + q^{\rm T} \pi + h(\eta ), $$
(3.18)
$$ \hbox{s.t.}\quad A_1 \pi =1_m, $$
(3.19)
$$ A_2 \pi - \eta = 0_m $$
(3.20)

where

$$ Q= \left( \begin{array}{llll} E & &&\\ & 0 && \\ && 2E & \\ && & 0 \end{array} \right) , $$
(3.21)
$$ A_1= \left( \begin{array}{llll} \hbox{diag}(y)X&\ y&\ -E&\ \varvec{0}_m \end{array} \right) , $$
(3.22)
$$ A_2= \left( \begin{array}{llll} \hbox{diag}(y)X&\ y&\ O_{m\times m}&\ -\varvec{1}_m \end{array} \right) , $$
(3.23)

\( q^{\rm T} =(\varvec{0}^{\rm T} _{m+n+1} -\lambda )\) and E is m × m identity matrix, 1, 0 are vectors of ones and zeros, respectively. For \( X \in R^{m \times n} \), each row is a training instance. Then, we can get the following standard optimization problem for

$$ \min \quad\frac{1}{2} \pi ^{\rm T} Q \pi + q^{\rm T} \pi + h(\eta ), $$
(3.24)
$$ \hbox{s.t.}\quad A \pi +B \eta = c, $$
(3.25)

where \( A=(A_1^{\rm T} \ A_2^{\rm T} )^{\rm T} ,{\quad} B=(O_{m \times m}^{\rm T} \ -E)^{\rm T} ,{\quad} c=(\varvec{1}^{\rm T} \ \varvec{0}^{\rm T} )^{\rm T} .\) The problem (3.24)–(3.25) can be solved by Algorithm 1.

To extend our method into nonlinear case, the following kernel-based surface is considered

$$ K(x, X)u+b=0 $$
(3.26)

where \(K(x, X)=(\Phi (x) \cdot \Phi (X))\). We can introduce (3.26) into the primal problem (3.10)–(3.12) directly and achieve the standard formulation for ADMM easily.

figure a

3.3 The relation on LMNN and ML-LSSVM

We will explore the relation between LMNN and ML-LSSVM in this subsection. The problems (2.12)–(2.14) can be generalized as

$$ \min \quad\sum \limits _{il} \eta _{il} d_M^2(x_i,x_l) - \lambda \sum \limits _{l} \gamma _{l}, $$
(3.27)
$$ \hbox{s.t.}\quad d_M^2(x_j,x_l)-d_M^2(x_i,x_l) \ge \gamma _{l},$$
(3.28)
$$ M \succeq 0 $$
(3.29)

which can be rewritten as

$$ \min \quad\sum \limits _{il} d_M^2(x_i,x_l) - \lambda \sum \limits _{l} \gamma _{l},$$
(3.30)
$$ \hbox{s.t.}\quad d_M^2(x_i,x_l)=e_i^2,\quad \forall l, x_i \in t(x_l) $$
(3.31)
$$ d_M^2(x_j,x_l) \ge \gamma _{l} + \max \limits _{i} d_M^2(x_i,x_l), $$
(3.32)
$$ \forall x_l, x_i \in t(x_l),\quad\,y_j \neq\,y_l,\quad\, M \succeq 0 $$
(3.33)

where \(\lambda >0\) is an adaptive parameter.

For every \(x_l\), the above problem can be broken up into

$$ \min\quad \sum \limits _{i} d_M^2(x_i,x_l) - \lambda \gamma _{l}, $$
(3.34)
$$ \hbox{s.t.}\quad d_M^2(x_i,x_l)=e_i^2,\quad\,x_i \in t(x_l) $$
(3.35)
$$ d_M^2(x_j,x_l)\ge \gamma _{l} + \max \limits _{i} d_M^2(x_i,x_l), $$
(3.36)
$$ y_j \neq\,y_l, M \succeq 0 $$
(3.37)

Similar as [7], we introduce a nonlinear transformation

$$\tilde{x}=\Phi (x) = (x_1^2, \ldots , x_n^2,x_1 x_2,x_1x_3,\ldots ,x_{n-1}x_n, x_1,x_2, \ldots ,x_n) $$
(3.38)

then \( d_M^2 (x,x_l)= w_l^{\rm T} \Phi (x)+b_l \). The problem (3.34)–(3.37) can be transformed as

$$ \min\quad\sum \limits _{i} (e_i^2 -\gamma _l/2)- \lambda \gamma _{l}/2, $$
(3.39)
$$\hbox{s.t.}\quad\,w_l^{\rm T} \Phi (x_i)+b'_l+1=1-\max \limits _{i} d_M^2(x_i,x_l) + e_i^2- \gamma _{l}/2,$$
(3.40)
$$ w_l^{\rm T} \Phi (x_j)+b'_l \ge \gamma _{l}/2, $$
(3.41)
$$ y_j \neq\,y_l, x_i \in t(x_l) $$
(3.42)

where \(b'_l=b_l- \max \limits _{i} d_M^2(x_i,x_l) - \gamma _l/2\).

Fig. 3
figure 3

The relation between local LMNN (a) and ML-LSSVM (b). a The points are in the original space. For every \( x_l \), the inner circle denotes the extent of its class and the radius is \(e_l=\max \limits _{i} d_M^2(x_i,x_l)\). The margin \(m_l= \sqrt{\gamma _{l} + \max \limits _{i} d_M^2(x_i,x_l)}- e_l\). Local LMNN aims to minimize \( e_l \) and maximize \( \gamma _l \). b The points are in the mapped space: \( \tilde{x} = \Phi (x) \). L\(_1: w_l^{\rm T} \Phi (x)+b_l=0\); L\(_2: w_l^{\rm T} \Phi (x)+b'_l=-1\); L\(_3: w_l^{\rm T} \Phi (x)+b'_l=0\). Minimizing \( \sum \nolimits _i e_i^2 \) in the Eq. (3.39) is equivalent to making the distances between blue points and L\(_2\) as small as possible. Maximizing \( \gamma _l\) is to extend the distance between the two boundary lines as much as possible (\(\gamma _l/2\) is equals to t in Eq. (3.12)) (color figure online)

The formulation of problem (3.39)–(3.41) is equivalent to ML-LSSVM, except their different training sets. Given any \(x_l\), the training set contains two classes, one class consists of the points located in the neighborhood of \(x_l\) that own the same label with \( x_l \) and the other class consists of the points with different labels from \(x_l\). We denote the two classes by the sets \(TS_s\) and \( TS_d \). The problem minimizes the within-class distance by forcing the data in \(TS_s\) to be as near as possible from \(x_l\) and maximize between-class distance by making the margin as large as possible. It can be seen that LMNN contains m local ML-LSSVM with constraining that every \(w_l\) is dependent on the corresponding \( x_l \). So local information is embedded in LMNN, but ML-LSSVM utilizes global information to obtain the best hyperplane. The relation between LMNN and ML-LSSVM can be explained in Fig. 3.

4 Numerical experiments

In this section, numerical experiments in different aspects will be made to evaluate the ability of ML-LSSVM. The experimental design, including the selected datasets and compared algorithms, the parameters to be optimized, will be clearly specified. We will introduce a toy example first to show that our method can reduce the negative impact of outliers and scale within-class and between-class distance properly. Then, classifications on binary class and multi-class are implemented and the CPU time is compared.

4.1 A toy example for ML-LSSVM

Fig. 4
figure 4

A toy example for ML-LSSVM

Given an artificial training set: negative class

$$ N= \{(0.5,0.5), (0.7,0.3), (0.3,0.7), (1,1), (1.4, 0.6), (0.6,1.4)\} $$

and positive class

$$P= \{(2,2), (2.8,1.2), (1.2,2.8), (2.5,2.5), (3.5,1.5), (1.5,3.5)\} $$

The former three points in N are in the line x1 + x2 = 1 and the last three in x1 + x2 = 2. The former three points in P are in the line x1 + x2 = 4 and the last three in x1 + x2 = 5. So the best separating hyperplane should be x1 + x2 = 3 and the two corresponding center lines are x1 + x2 = 1.5, x1 + x2 = 4.5.

If adding an outlier (5, 5) into P, LSSVM will get two center lines x1 + x2 = 0.23, x1 + x2 = 6.38 (the blue and red dotted lines in Fig. 4) and the decision line x1 + x2 = 3.3 (the black dotted line in Fig. 4), but ML-LSSVM can get x1 + x2 = 1.6, x1 + x2 = 4.4 (the blue and red solid lines in Fig. 4) and the decision line x1 + x2 = 3 (the green solid line in Fig. 4). In fact, the two lines x1 + x2 = 0.23, x1 + x2 = 6.38 have not minimized the within-class distance owing to the outlier. Also \(1/\Vert w\Vert ^2\) loses its meaning in representating the between-class distance. But ML-LSSVM can eliminate the effect of the outlier by adjusting the parameter \(\lambda \). Besides, the two center lines of ML-LSSVM play the role of minimizing within-class distance well.

4.2 Datasets information and experimental setups

Benchmark datasets were selected to evaluate the classification performance. We selected 20 binary class datasets and 9 multi-class datasets from UCI Machine Learning Repository and the LIBSVM DATASETS. The characteristics of the 29 datasets, including the number of instances, features and classes, are displayed in Table 2. All the datasets are scaled into the interval [0, 1]. The instance numbers of the datasets are from 62 to 1000 and the feature numbers are from 4 to 7129.

Table 2 Characteristics of selected datasets

We compared our method with three methods, \(\varepsilon\)-SVM, LSSVM, LMNN, using classification error to evaluate performance. ADMM algorithm was also implemented in the former two methods. All the experiments were made on MATLAB 2015a(Inter Core i5, 4G RAM).

In binary classification, to test the performance of ML-LSSVM comprehensively, the four algorithms except LMNN were implemented in four kernels: linear, polynomial, RBF and sigmoid. All the best parameters were selected with fivefold cross-validation. For all the kernels, C is searched from the set \(\{10^{-3},\ldots ,10^3\}\). In polynomial kernel \(K(u,v)=(u'v+1)^d \), the degree d is selected from \( \{2, 3, 4\} \). In RBF kernel \( K(u,v)=exp(-\gamma \Vert u-v\Vert ^2) \) and Sigmoid kernel \(K(u,v)=tanh(ku'v+1)\), the parameters \( \gamma \) and k are both chosen from \(\{10^{-2},\ldots , 10^{2} \} \). For LMNN, only experiments in linear case were made since the method can not be extended into nonlinear case directly. For \(\varepsilon -SVM\), \(\lambda \) is set to be C/3 and in ML-LSSVM, \( \lambda =2C \), both settings mean to penalize the points violate inter-class distance constraints more than intra-class distance constraints.

In multi-class classification, we compared the four methods in three cases: linear kernel, polynomial kernels with d = 2 and d = 3. Similarly, LMNN is only performed in linear kernel. The settings of \( C, \lambda \) are the same as that in binary classification. One versus One algorithm was used in \( \varepsilon\)-SVM, LSSVM, ML-LSSVM.

4.3 Experimental results and analysis

The average error rates in binary classification are displayed in Table 3 and the best results are in bold-face. In 20 comparisons of each kernel, ML-LSSVM got 13, 12, 14,12 times best performance in linear, polynomial, RBF, sigmoid kernel, respectively, which demonstrate that ML-LSSVM can handle binary classes task effectively and the improvements on LSSVM are advantageous. LMNN performed worst in linear case. In Fig. 5, The CPU time comparisons were processed on \( \varepsilon\)-SVM, LSSVM, ML-LSSVM. The datasets are ranked in ascend order of the number of instances. For each subfigure in Fig. 5, the horizontal axis is from 1 to 20, denotes the rank of each dataset. It is notable that LSSVM is the fastest method, and ML-LSSVM is slower than LSSVM slightly. For the running speed in linear, RBF and sigmoid kernel, LSSVM is three times faster than ML-LSSVM. In polynomial kernel, the speed of ML-LSSVM is as high as LSSVM. But ML-LSSVM performed much better than LSSVM. \(\varepsilon\)-SVM is much slower than LSSVM and ML-LSSVM since its constraints in the primary problems are four times as much as LSSVM.

Figure 6 shows the average error rates in multi-class classification. In linear case, ML-LSSVM performed best in 4 of 9 and LMNN took the first place in three times. And in polynomial kernel with two degrees, ML-LSSVM got 15 best results of 18. \(\varepsilon\)-SVM performed second after ML-LSSVM.

The performance of ML-LSSVM verify that within-class and between-class distance are very important in classification tasks and the way in measuring the two types of distance can affect prediction results markedly. The relation on metric learning and LSSVM is helpful in making improvements on LSSVM.

Table 3 Error rate of binary classification in linear and nonlinear kernel
Fig. 5
figure 5

Comparisons of CPU time. a Linear, b Poly, c Rbf, d Sigmoid

Fig. 6
figure 6

Error rate of Multi-class classification. a Circle, b iris, c wine, d seeds, e thyroid, f libras, g gem, h vehicle, i vowel

5 Conclusions

In this paper, we explore the relation between metric learning and LSSVM. LSSVM can be regarded as a slack version of the method with pairwise constraints, which is one of the earliest work in metric learning. Within-class distance and between-class distance are defined by the sum of pairwise distances with respect to a new learned Mahalanobis matrix. But LSSVM seeks for two parallel hyperplanes, treated as a center marks for each class, respectively. Within-class distance is measured by the sum of distances between points and corresponding hyperplane. The distance between the two center marks is the between-class distance. In fact, LSSVM implements the idea of metric learning essentially. Though, LSSVM can be improved by revising between-class distance. A novel method, called ML-LSSVM, is presented, which add constraints of inter-class into the primary LSSVM. ML-LSSVM can be solved effectively by ADMM algorithm, breaking large convex problems into smaller ones. Further, LMNN has an inner relation with ML-LSSVM and its local version is equivalent to ML-LSSVM, just different in training sets. Numerical experiments shown that the extra constraints in ML-LSSVM are advantageous in improving classification performance. In the future, we will investigate the relation of metric learning with more variants of SVM for improved performance.