1 Introduction

Metric learning, one of the core problems in pattern recognition, aims to measure the similarity between data when the classification algorithms such as k-nearest neighbors (kNN) could not find its meaningful results. Therefore, many metric learning methods have been proposed, for example the Mahalanobis metric learning, whose goal is to narrow the distance between the samples with the same labels, and to enlarge the distance between the samples with different labels. In fact, for some traditional classification methods, such as Support Vector Machine (SVM) [1], and k-nearest neighbors (kNN) [2], they also rely on a better metric to measure the dissimilarities between data, and their results will be greatly improved by combining metric learning. On this basis, more and more improved metric learning based models and algorithms have been proposed in recent years. Generally speaking, they can be divided into three categories by training samples unsupervised metric learning [3,4,5,6,7,8,9,10], supervised metric learning [11,12,13,14,15,16,17,18], and semi-supervised metric learning [19,20,21,22].

Unsupervised metric learning methods mainly deal with the data without label information, which just aim at finding a latent data manifold embedded in the higher dimensional space, such that local or global structure between data could be preserved. For example, Principal Component Analysis (PCA) [3], as a well-known classical method, is usually regarded as a dimensional reduction method, but in essence, it can be viewed as an unsupervised metric learning method. Similarly, Multidimensional scaling (MDS) [4] and Nonnegative Matrix Factorization (NMF) [5] can both be regarded as the unsupervised metric learning methods. However, these above methods do not work when dealing with nonlinear data. Therefore, the amount of nonlinear methods is proposed. For example, Laplacian Eigenamp (LE) [7] is a nonlinear method by the eigenfunctions of the graphic LaplaceCBeltrami operator and the Hessian operator, respectively to preserve the local neighbor of every single data. Subsequently, on this basis, Locality Preserving Projections (LPP) [8], a linear approximation version of LE was proposed. In contrast, the required nonlinear map in LE is replaced by a linear map, which simplifies the model but can also have a good performance. Another classical nonlinear method is locally linear embedding (LLE) [9], which preserves the local order relation of data in both the embedding space and the intrinsic space. Moreover, LLE also has its linear approximation version called neighborhood preserving embedding (NPE) [10]. Similarly, linear projection obtained by NPE replaces the desired map in LLE, whose performance is still comparable to that of the former LLE. However, these unsupervised methods do not fully take into account the label information of the data that have a very positive impact in classification. Therefore, label-information-based metric learning method has been proposed in recent years.

The main idea of supervised metric learning is to use the label information to shorten the distance between similar samples while enlarge the distance between samples in different classes. Xing et al. [11] first proposed the Mahalanobis metric learning model by introducing the "similar pairs" and "dissimilar pairs" constraints, which can separate similar and dissimilar data clearly. Based on it, the amount of Mahalanobis metric learning methods was investigated successively. For example, Weinberger et al. [13] presented the large margin nearest neighbors (LMNN), in which a hinge loss function is constructed with the triplet constraints, such that a large margin of distance can be held between the inter-class data points. From the information theory, Davis et al. [14] developed information-theoretic metric learning (ITML), involving a natural entropy-based objective function under the pairwise distance constraints, which can be solved as a low-rank kernel learning problem. In addition, Zuo et al. [18] developed Positive-semidefinite constrained metric learning (PCML) and Nonnegative-coefficient constrained metric learning (NCML), which are both formulated as the kernel classification problem with the positive semi-definite constraint, such that they can be solved by iterated training of support vector machines (SVMs). Based on the Riemannian geometry framework, an early approach put forward geometric mean metric learning (GMML) [17], which efficiently solves the metric learning problem through the Riemannian geometry of positive definite matrices. And Li et al. introduced a Kullback–Leibler Divergence (KLD) based metric learning model called Kullback–Leibler Divergence Metric Learning [23]. The model is based on the KLD which is extended by the introduction of a linear mapping and can well express the data distribution similarity. Then, an intrinsic steepest descent method is introduced to solve such optimization problem. Another Riemannian based method called graph embedding multi-kernel metric learning (GEMKML) algorithm was proposed [24], in which the Grassmannian manifold-valued feature representations are produced as the feature in a metric learning process. Furthermore, the Grassmannian conjugate gradient method is used during the optimization process. However, the above-mentioned supervised metric learning method is too idealistic in the label information of data, since in practical applications, it may happen that only partial information is available. Therefore, semi-supervised metric learning methods only using partial labeled information have been developed subsequently.

Fig. 1
figure 1

Illustration of semi-supervised metric learning

Semi-supervised learning integrates the advantages of supervised and unsupervised learning, aiming to overcome the limitations inherent in both approaches. On the one hand, for unsupervised learning, it lacks real sample information to support the clustering results, thereby compromising its accuracy and stability. On the other hand, supervised learning requires a large number of training samples, while labeling samples requires a significant amount of manpower and time resources, which often leads to the lack of training samples in practical applications. Consequently, semi-supervised learning has attracted increasing attention in recent years and demonstrated its remarkable application value and potential in various fields such as medical/healthcare [25], transportation [26], manufacturing [27], etc. The goal of semi-supervised metric learning is to utilize unlabeled data with the help of labeled data to make supervised metric learning to learn an appropriate metric such that it agrees with pairwise or triplet constraint. The main concept of semi-supervised metric learning is shown in Fig. 1. For example, Hoi et al. [19] first introduced the semi-supervised learning into metric learning and put forward the Laplacian Regularized Metric Learning (LRML), which combines both labeled and unlabeled data information through an effective graph regularization framework. Another semi-supervised metric learning was developed called Regularized semi-supervised metric learning (RSSML) [20], in which the local topology and triplet constraints are considered in the model combining with the regularization by the unlabeled samples, which satisfies three basic assumptions for semi-supervised learning namely, smoothness, cluster and manifold assumptions. Based on these three assumptions, semi-supervised metric learning for stratified spaces (S2MLS2) [28] was developed, in which unsuitable local constraints is eliminated for adapting to the multi-manifolds smoothness assumption, and some non-local constraints is introduced to detect the shared structures at different positions for lack of supervised information. Moreover, Li et al. [29] put forward a semi-supervised metric learning method called regularized large margin distance metric learning (RLMM), which considers the triplet constraint and pairwise constraint at the same time in the metric learning model by using two hinge loss function. Wang et al. [30] proposed a coefficient-based semi-supervised metric learning model, in which a linear combination of a set of base vectors is learned as a new metric instead of the traditional metric matrix, combining with the pairwise constraint and sparse regularization. In addition, Ying et al. [22] proposed a metric learning algorithm based on Riemannian manifold optimization [31], in which a semi-supervised metric learning model is specially formulated by considering the metric information of inner classes and interclasses, and an adaptive parameter is designed into the triplet constraint to balance the inner metrics and intermetrics by using data point-to-point structure. Furthermore, Semi-supervised Subspace Metric Learning [32] is presented, in which a low-dimensional subspace is learned by an inverse problem on Grassmannian manifold, then some local positive definite metrics are learned on this subspace.

Different from above works, we will design a semi-supervised metric learning method based on the point-to-class structure of the labeled data. Intuitively, each point should be closer to its own class center than to any other class under a better metric. By this assumption, the point-to-class structure is formulated into a new triplet constraint, which aims to narrow the distance of inner-class data and meanwhile to enlarge the distance of inter-class data. In fact, in traditional point-to-point triplet constraint, the number of the triples between all the labeled samples is quite large. Usually, an attempt is to use the distances from one point to its neighbors rather than to all other points [22], however, which will cause the information to be lost in practice. Therefore, it is highly desired to explore the effective point-to-class based metric learning method in many fields such as computer vision tasks, which will reduce computational cost accordingly. Meanwhile, different weights are introduced into the triplet constraint to measure the interclasses dissimilarity. In fact, its smaller value corresponds to the larger distance of data in different classes, while its larger value indicates that the points in different classes are very close, which will be prone to be misclassified without metric learning generally. This work aims to enlarge the distances between the points and their different classes, especially for the points corresponding to the large weight values. Note that, the point-to-class distance is defined as the distance between point to the class-center. Moreover, two kinds of regularization terms are incorporated respectively in this work to mitigate the overfitting phenomenon, which occasionally occurs in the LMNN due to its absence of regularization, especially in high dimension. One is about the spatial regularizer [33] for hyperspectral image data in order to capture its spatial information. For other data sets, we use a more common regularizer which aims to preserve the topological structure of the data [20]. In addition, we adopted Riemannian gradient descent algorithm to solve the proposed model, which makes full use of the geometric structure of Riemannian manifolds and can be regarded as a generalization of the unconstrained optimization problem in Euclidean space on Riemannian manifold. In order to verify its precision, we test its classification accuracy on various data sets, including four UCI data sets, USPS handwriting data set, and two face data sets yaleB and ORL. In addition, hyperspectral data classification problems is attracting wide public attention in recent years [34,35,36,37]. Therefore, two hyperspectral image data sets(Indian pines and KSC) are also used to test our method.

The rest of this work is organized as follows. Section 2 briefly reviews some related works, including metric learning and Riemannian manifold optimization. In Sect. 3, the improved metric learning with weighted triplet constraint is given and its corresponding Riemannian manifold algorithm is demonstrated. All results of the numerical experiments will be shown to prove the precision in Sect. 4. Section 5 is the conclusion of this work.

2 Related works

2.1 Metric learning

For some classification or clustering problems, the traditional Euclidean distance metric sometimes can not well capture the distance or similarity between two samples. The concept of metric learning was first introduced by Xing et al. [11] in  2002, aiming to accurately describe the relationship between training sample points based on distance in classification problems. By employing metric learning, more suitable metrics can be discovered to better capture the intrinsic characteristics of data distribution and consequently enhance the accuracy of classification tasks.

The goal of metric learning is to learn an effective distance metric, such that samples from the same class are brought closer together while samples from different classes are maximally separated under the new metric. Let \(A\in {\mathbb {R}}^{n\times n}\) be a positive semidefinite matrix, and we define A as our distance metric. So, under the metric A, for any two points x and y, the distance between them can be expressed by

$$\begin{aligned} d_A^2(x,y)=\Vert x-y\Vert _A^2=(x-y)^TA(x-y) \end{aligned}$$
(1)

Given data set \(X=\{x_1,x_2,\ldots ,x_n\}\), with \(x_i \in {\mathbb {R}}^n\). Let S and D represent the sets of similar and dissimilar pairs, respectively, which are used to characterize the relationships of similarity or dissimilarity among sample points, and can be defined by

$$\begin{aligned}{} & {} S=\{(x_i,x_j)|\ x_i~and~x_j~is~similar\} \end{aligned}$$
(2)
$$\begin{aligned}{} & {} D=\{(x_i,x_j)|\ x_i~and~x_j~is~dissimilar\} \end{aligned}$$
(3)

The pairwise constraints in metric learning models are commonly denoted as S and D. Therefore, based on the aforementioned pairwise constraints, our metric learning model can be formulated as follows

$$\begin{aligned} \begin{aligned} \min _{A}&\ g(A)=\sum _{(x_i,x_j)\in S} \Vert x_i-x_j\Vert _A^2 \\ s.t.&f(A)=\sum _{(x_i,x_j)\in D} \Vert x_i-x_j\Vert _A^2\geqslant 1 \\&A\succeq 0. \end{aligned} \end{aligned}$$
(4)

In this metric learning model, g(A) describes the distance between similar samples under pairwise constraints, while f(A) describes the distance between dissimilar samples. The purpose of this model is to ensure that the distance between similar samples is minimized, while ensuring that the distance between dissimilar samples falls within a specific range. According to [11], the optimization problem 4 can be solved by a projected gradient method.

2.2 Riemannian manifold optimization

Riemannian manifold optimization is a special kind of optimization problem, which has a broad application in machine learning [38,39,40,41].

Refer to [42], the Riemannian optimization problem can be written as follows

$$\begin{aligned} \underset{x\in {\mathcal {M}}}{\min }f(x) \end{aligned}$$
(5)

where \({\mathcal {M}}\) is a Riemannian manifold, and \(f: {\mathcal {M}} \rightarrow {\mathbb {R}}\) is a smooth cost function or objective function. When \({\mathcal {M}}={\mathbb {R}}^n\) for some n, the optimization problem reduces to an unconstrained optimization problem in Euclidean space. The standard gradient descent algorithm in Euclidean space \({\mathbb {R}}^n\) iterates

$$\begin{aligned} x_{k+1}=x_k-\alpha _k \nabla f(x_k) \end{aligned}$$
(6)

where \(\alpha _k>0\) is called the step-sizes or learning rate, \(\nabla f: {\mathbb {R}}^n \rightarrow {\mathbb {R}}^n\) is the gradient of f, and can be given by

$$\begin{aligned} \nabla f=\bigg (\frac{\partial f}{\partial x_1},\frac{\partial f}{\partial x_2},\cdots ,\frac{\partial f}{\partial x_n}\bigg )^{T} \end{aligned}$$
(7)

However, when the \({\mathcal {M}}\) is not a linear space but a general Riemannian manifold, the optimization problem may become more difficult to solve by using the gradient descent algorithm. Therefore, we need to define the Riemannian gradient and the way of moving on a Riemannian manifold.

For a Riemannian manifold \({\mathcal {M}}\), the inner product \(<\cdot ,\cdot >_x\) is equipped in the tangent space denoted by \(T_x{\mathcal {M}}\) at each point \(x\in {\mathcal {M}}\). Let f be a smooth function of the Riemannian manifold \({\mathcal {M}}\), the differential of a smooth function on a Riemannian manifold can be defined by

$$\begin{aligned} Df(x)[v]=\frac{d}{dt} f(\gamma (t))\bigg |_{t=0} \end{aligned}$$
(8)

where \(\gamma (t)\) is a smooth curve on \({\mathcal {M}}\) passing through the point x, and satisfies \(\gamma '(0)=v, v\in T_x{\mathcal {M}} \). The Riemannian gradient \(\textrm{grad}f(x)\) is the tangent vector in \(T_x{\mathcal {M}}\) satisfies

$$\begin{aligned} Df(x)[v]=<v,\textrm{grad}f(x)>_x,\forall v\in T_x{\mathcal {M}}. \end{aligned}$$
(9)

According to [42], when \({\mathcal {M}}\) is a Riemannian submanifold of Euclidean space \({\mathbb {R}}^n\) equipped with the metric \(<\cdot ,\cdot>\), and \(f:{\mathcal {M}}\rightarrow {\mathbb {R}}\) is a smooth function on \({\mathcal {M}}\), there must be \(\bar{f}(x):{\mathbb {R}}^n\rightarrow {\mathbb {R}}\), satisfing \(\bar{f}(x)=f(x), x\in {\mathcal {M}}\). Then the Riemannian gradient can be calculate by

$$\begin{aligned} \textrm{grad}f(x)=Proj_x(\nabla \bar{f}(x)) \end{aligned}$$
(10)

where \(Proj_x:{\mathbb {R}}^n\rightarrow T_x{\mathcal {M}}\) the projector from \({\mathbb {R}}^n\) to \(T_x{\mathcal {M}}\), orthogonal with respect to \(<\cdot ,\cdot>\).

In addition, to update the sequence like (6) on Riemannian manifold, a retraction plays an inportant role in the iteration process. A retraction R at \( x\in {\mathcal {M}}\) denoted by \(R_x\) is a smooth map from \(T_x{\mathcal {M}}\) to \({\mathcal {M}}\),which satisfies

  1. 1.

    \(R_x(0_x)=x\) where \(0_x\) is the zero element in \(T_x{\mathcal {M}}\)

  2. 2.

    \(DR_x(0_x)=id_{T_x{\mathcal {M}}}\) where \(DR_x(0_x)\) is the differential map of \(R_x\) at \(0_x\), \(id_{T_x{\mathcal {M}}}\) is the identity map in \(T_x{\mathcal {M}}\).

By introducing the retraction, we can restrict the variables to the same Riemannian manifold during each iteration step. Therefore, the Riemannian gradient descent algorithm can be written as follows.

Given \(x_0 \in {\mathcal {M}}\) and retraction R,\(x_k\) can be iterated through

$$\begin{aligned} x_{k+1}=R_{x_k}(-\alpha _k\textrm{grad}f(x_k)) \end{aligned}$$
(11)

where \(\alpha _k\) is the step-size in kth iteration.

3 The proposed method

3.1 Metric learning model

Given data set \(X=\{x_1,x_2,\ldots ,x_n\}\), with \(x_i \in {\mathbb {R}}^n\). For semi-supervised learning, the data set is divided into two parts, \(X_L\) and \(X_U\). The samples in \(X_L=\{x_1, x_2, \ldots , x_l\}\) are all labeled, with the label set \(Y=\{y_1, y_2, \ldots , y_l\}\). The other part \(X_U\) contains the rest of the samples without labels. Our goal is to find a best metric mentioned in (1) that makes samples can be classified accurately.

Generally speaking, the goal of metric learning is to narrow the distance of inner-class data and meanwhile to enlarge the distance of inter-class data. To better illustrate our model, we rewrite the notation of the samples. Suppose that the samples in \(X_L\) can be divided into c subsets with the same labels denoted by \(X_1, X_2, \ldots , X_c\), where \(X_i=\{x_1^{(i)}, x_2^{(i)}, \ldots , x_{n_i}^{(i)} \}\), \(n_i\) is the sample numbers of \(X_i\). For each class \(X_i\), we denote \(\mu _i=\sum _{k=1}^{n_i} \frac{1}{n_i} x_k^{(i)} \) as the center of all samples in \(X_i\). Like (1), we define the distance between a single samples x and the class set \(X_i\) by

$$\begin{aligned} d_A^2\left( x,X_i\right) =\left( x-\mu _i \right) ^TA\left( x-\mu _i \right) \end{aligned}$$
(12)

Inspired by the triplet constraint in [22], we define the point-to-class triplet constraint denoted by

$$\begin{aligned} {\mathcal {T}}=\left\{ \left( x_i^{(k)}, X_k, X_l \right) :d_A \left( x_i^{(k)}, X_k \right) <d_A \left( x_i^{(k)}, X_l \right) \right\} \end{aligned}$$
(13)

Then our metric learning can be modeled as follows

$$\begin{aligned} \begin{aligned}&\min _{A} \sum _{{\mathcal {T}}} d_A\left( x_i^{(k)}, X_k\right) -d_A\left( x_i^{(k)}, X_l\right) \\&s.t. A\succ 0 \end{aligned} \end{aligned}$$
(14)

Then for each labeled sample, by the definition in (12) we rewrite the objective function as follows

$$\begin{aligned} \sum _{l \ne k}\left( d_A \left( x_i^{(k)}, \mu _k \right) -d_A\left( x_i^{(k)}, \mu _l\right) \right) \end{aligned}$$
(15)

In (15), since the distance between the same class does not need to be calculated, therefore, some terms are deleted, and (15) turns into

$$\begin{aligned} d_A\left( x_i^{(k)}, \mu _k\right) -\sum _{l \ne k} d_A\left( x_i^{(k)}, \mu _l\right) \end{aligned}$$
(16)

In order to balance the terms that have been deleted, and to make our model more accurately to capture those distance that attempts to become larger during metric learning, weight is added into the second terms, like many local-preserved methods [16],

$$\begin{aligned} d_A\left( x_i^{(k)}, \mu _k\right) -\sum _{l \ne k}w_{i,k,l}d_A\left( x_i^{(k)}, \mu _l\right) \end{aligned}$$
(17)

where the specific weights are assigned as follows

$$\begin{aligned} w_{i,j,k}= {\left\{ \begin{array}{ll} \exp \left( \frac{-\Vert x_i^{(j)}-\mu _{k}\Vert ^2}{\sigma _{i,j}}\right) \quad &{}j \ne k \\ 0 \quad &{}j=k \end{array}\right. } \end{aligned}$$
(18)

In order to merge the two terms in Eq. (17), we redefine the weights as follows

$$\begin{aligned} W_{i,j,k}= {\left\{ \begin{array}{ll} -\exp \left( \frac{-\Vert x_i^{(j)}-\mu _{k}\Vert ^2}{\sigma _{i,j}}\right) \quad &{}j \ne k \\ 1 \quad &{}j=k \end{array}\right. } \end{aligned}$$
(19)

Then

$$\begin{aligned} \sum _{l=1}^{c}W_{i, k, l}d_A(x_i^{(k)}, \mu _l) \end{aligned}$$
(20)

Summing over each \(x_i^{(k)}\), we have

$$\begin{aligned} \sum _{k=1}^{c}\sum _{i=1}^{n_k}\sum _{l=1}^{c}W_{i,k,l}d_A\left( x_i^{(k)}, \mu _l\right) \end{aligned}$$
(21)

To sum up, the proposed Riemannian-based graph-regularized metric learning (RGML) with weighted triplet constraint model finally can be expressed as the following

$$\begin{aligned} \begin{aligned}&\min _A \sum _{k=1}^{c}\sum _{i=1}^{n_k}\sum _{l=1}^{c}W_{i,k,l}d_A\left( x_i^{(k)},\mu _l\right) \\&s.t. A\succ 0 \end{aligned} \end{aligned}$$
(22)

Similar with [22], in order to prevent overfitting, a graph regularizer is added to our model, which could preserve three basic assumptions of semi-supervised learning and take full advantage of the topology of the data. The semi-supervised metric learning model can be written as follows

$$\begin{aligned} \begin{aligned}&\min _A \sum _{k=1}^{c}\sum _{i=1}^{n_k}\sum _{l=1}^{c}W_{i,k,l}d_A\left( x_i^{(k)},\mu _l\right) +\lambda Reg(A) \\&s.t. A\succ 0 \end{aligned} \end{aligned}$$
(23)

where the Reg(A) is the graph regularizer, and \(\lambda \) is a parameter which controls the balance between the two terms.

Since the regularization term uses the unlabeled data, we do not consider the distance between points and classes in this term. Note that for different types of data, we will choose different graph regularization terms according to the characteristics of different types of data. The first regularization term is

$$\begin{aligned} \sum _{i=1}^{n}\beta _i\sum _{j \in {\mathcal {N}}(i)} S_{ij}D_{ij}^2 \end{aligned}$$
(24)

where \(\beta _i=f(p(x_i)) \in {\mathbb {R}}^+\),and \(p(x_i)\) is the density of \(x_i\), \({\mathcal {N}}(i)\) is the nearest neighbor of \(x_i\), \(S_{ij}\) is the similarity between \(x_i\) and \(x_j\), which we want to preserve in the new metric. \(D_{ij}\) is the distance between \(x_i\) and \(x_j\) under the metric A formulated by (39). For the sake of calculation, we define \(\eta _{i,j}\) as follows

$$\begin{aligned} \eta _{i,j}= {\left\{ \begin{array}{ll} 1 \quad &{}j\in {\mathcal {N}}(i) \\ 0 \quad &{}others \end{array}\right. } \end{aligned}$$
(25)

Then, (24) turns into

$$\begin{aligned} \sum _{i=1}^{n}\beta _i\sum _{j=1}^{n}\eta _{ij}S_{ij}D_{ij}^2=\sum _{i,j=1}^{n} W_{i,j}^{(s)}D_{ij}^2 \end{aligned}$$
(26)

where \(W_{i,j}^{(s)}=\beta _i\eta _{ij}S_{ij}\).

In addition, for the hyperspectral image, spatial regularizer [33] might be a better choice. To make it suit our metric learning model, the spatial regularizer can be reformed as

$$\begin{aligned} \sum _{i,j=1}^{n} W_{i,j}^{(sp)}D_{ij}^2 \end{aligned}$$
(27)

where \(W_{i,j}^{(sp)}\) is the spatial weight, which reflects the spatial similarity, and can be expressed in the following

$$\begin{aligned} W_{i,j}^{(sp)}= {\left\{ \begin{array}{ll} \exp \left( \frac{-\Vert x_i-x_j\Vert ^2}{\sigma }\right) \quad &{}x_i\;and\;x_j\;are\;spatial\;neighbors \\ 0 \quad &{}others \end{array}\right. }\nonumber \\ \end{aligned}$$
(28)

Finally, two types of graph regularizer could be reduced to

$$\begin{aligned} \sum _{i,j=1}^{n} W_{i,j}D_{ij}^2 \end{aligned}$$
(29)

Note that different \(W_{i,j}\) determine which regularizer is used in our method.

Therefore, the distance metric model eventually becomes

$$\begin{aligned} \begin{aligned}&\min _A \sum _{k=1}^{c}\sum _{i=1}^{n_k}\sum _{l=1}^{c}W_{i,k,l}d_A\left( x_i^{(k)},\mu _l\right) +\lambda \sum _{i,j=1}^{n} W_{i,j}D_{ij}^2 \\&s.t. A\succ 0 \end{aligned} \end{aligned}$$
(30)

To solve the it, we first simplify the loss term in (30) by

$$\begin{aligned} \begin{aligned}&\sum _{k=1}^{c}\sum _{i=1}^{n_k}\sum _{l=1}^{c}W_{i,k,l}\left( x_i^{(k)}-\mu _l\right) ^{T}A\left( x_i^{(k)}-\mu _l\right) \\&\quad =Tr(\sum _{k=1}^{c}\sum _{i=1}^{n_k}\sum _{l=1}^{c}W_{i,k,l}\left( x_i^{(k)}-\mu _l\right) \left( x_i^{(k)}-\mu _l\right) ^{T}A) \\ \end{aligned} \end{aligned}$$
(31)

Let

$$\begin{aligned} M=\sum _{k=1}^{c}\sum _{i=1}^{n_k}\sum _{l=1}^{c}W_{i,k,l}\left( x_i^{(k)}-\mu _l\right) \left( x_i^{(k)}-\mu _l\right) ^{T} \end{aligned}$$
(32)

Then (31) turns into

$$\begin{aligned} Tr(MA) \end{aligned}$$
(33)

For regularizer in (30), we have

$$\begin{aligned} \begin{aligned} \sum _{i,j=1}^{n} W_{i,j}D_{ij}^2&=\sum _{i,j=1}^{n} W_{i,j}\left( x_i-x_j\right) ^TA\left( x_i-x_j\right) \\&=Tr(\sum _{i,j=1}^{n} W_{i,j}\left( x_i-x_j\right) \left( x_i-x_j\right) ^TA)\\&=Tr\left( XLX^TA\right) \end{aligned} \end{aligned}$$
(34)

where \(L=D-W\) is the Laplace matrix, \(W=(W_{i,j})_{n\times n}\) is the weight matrix and D is a diagonal matrix defined by \(D_{i,i}=\sum _{j=1}^{n} W_{i,j}\). Let

$$\begin{aligned} R=XLX^T \end{aligned}$$
(35)

Then the regularizer turns into

$$\begin{aligned} Tr(RA) \end{aligned}$$
(36)

Therefore, (30) can be rewritten as

$$\begin{aligned} \begin{aligned}&\min _A f(A)=Tr\left( MA\right) +\lambda Tr\left( RA\right) \\&s.t. A\succ 0 \end{aligned} \end{aligned}$$
(37)

3.2 Riemannian gradient descent algorithm

In our semi-metric learning model, the objective function is a smooth function defined on the Symmetric Positive Definite (SPD) manifold. According to the theory of Riemannian manifold optimization, similar to many methods based on Riemannian optimization [22, 43,44,45,46,47], the problem can be solved using the Riemannian Gradient Descent (RGD) method [42]. The solution strategy of Riemannian manifold optimization is introduced for optimization problems with manifold constraints. In each iteration step, the variables are constrained to the same Riemannian manifold to achieve efficient and accurate model solutions.

Recall the following optimization problem

$$\begin{aligned} \underset{x\in {\mathcal {M}}}{\min }f(x) \end{aligned}$$
(38)

where \({\mathcal {M}}\) is a Riemannian manifold, and f is a smooth function. Given \(x_0 \in {\mathcal {M}}\), \(x_k\) can be iterated by RGD through

$$\begin{aligned} x_{k+1}=R_{x_k}(-\alpha _k\textrm{grad}f(x_k)) \end{aligned}$$
(39)

For the proposed model, the SPD manifold is the set of all n-order symmetric positive definite matrices denoted by \(S_n^{++}\), which can be defined as follows

$$\begin{aligned} S_n^{++}=\left\{ S\in {\mathbb {R}}^{n\times n} \mid S^T=S, S\succ 0 \right\} . \end{aligned}$$
(40)

Equipped with the affine-invariance metric mentioned in [48], the \(S_n^{++}\) becomes an \(n(n+1)/2\) dimension Riemannian manifold. Refer to [22], the Riemannian gradient of the smooth function f(A) in (37) can be calculate by

$$\begin{aligned} \textrm{grad}f(A)=sym\left( \nabla f(A)\right) =sym\left( M^T+\lambda R^T\right) \end{aligned}$$
(41)

where \(sym(X)=\frac{X^T+X}{2}\) is the symmetrization operator, and \(\nabla f \) is the Euclidean gradient of f, since f can also be regarded as a smooth function in Euclidean space \({\mathbb {R}}^{n\times n}\).

Moreover, the exponential map \(\exp _{S}(L)\) is defined by

$$\begin{aligned} \exp _{S}(L)=S^{1/2}\exp {\left( S^{-1/2}LS^{-1/2}\right) }S^{1/2} \end{aligned}$$
(42)

where \(S\in S_n^{++}\), \(L\in T_SS_n^{++}\) is a tangent vector at S, and \(\exp (X)=\sum _{n=0}^{\infty }\frac{X^n}{n!}\) is the matrix exponential function. Obviously, \(\exp _{S}(L)\) satisfies the condition in 2.2, which can be used as a retraction in our algorithm.

Therefore, the iterative scheme for Riemannian gradient descent is given by

$$\begin{aligned} A_{k+1}=A_k^{1/2}\exp {\left( -\alpha A_k^{-1/2} \textrm{grad}f(A_k) A_k^{-1/2}\right) }A_k^{1/2} \end{aligned}$$
(43)

To sum up, the Riemannian gradient descent algorithm for the RGML model has been shown in Algorithm 1.

Algorithm 1
figure c

Riemannian Gradient Descent Algorithm for Metric Learning

4 Experimental results

In order to verify the performance of the proposed metric learning method, we will conduct some classification experiments on several data sets, including four UCI data sets (wine, iris, dermatology and balance), USPS digit image data set, two face data sets (YaleB and ORL) and two hyperspectral image (HSI) data sets (Indian Pines and KSC). Then, we will compare our method with other algorithms, for example, large margin nearest neighbors (LMNN) [13], information-theoretic metric learning(ITML) [14], Laplacian Regularized Metric Learning (LRML) [19], geometric mean metric learning (GMML) [17], positive-semidefinite constrained metric learning [18], and kNN without any metric learning method namely under the Euclidean metric.

4.1 UCI data sets

Table 1 Details for the UCI datasets

The UCI machine learning repository [49] is a collection of databases, used by the machine learning community for the empirical analysis of machine learning algorithms. In the following experiments, four data sets are selected and their details are shown in Table 1.

Fig. 2
figure 2

Test error rates for UCI data sets obtained by different methods

Note that Each dataset is split into two parts, labeled data set \(X_L\) and unlabeled data set \(X_L\). \(\mid X_L\mid \) and \(\mid X_U\mid \) are the number of the labeled samples and unlabeled samples. Then \(\frac{\mid X_L\mid }{\mid X_U\mid }\) is the labeled sample ratio for the semi-supervised metric learning. Due to the small amount of data, a portion of the unlabeled data would be also used for testing. All the data used in the experiment are randomly selected. Firstly, we select part of the data as labeled data, and then take the rest of the data as unlabeled data, and use part of the unlabeled data for testing. For each UCI dataset, the test is repeated 30 times. The parameters of all algorithms are selected carefully to make sure they’re on their best behavior, and all parameters in our method are as follows. \(\sigma _{i,j}\in W_{i,j,k}\) in (18) can be calculated by \(\sigma _{i,j}=\Vert x_i^{(j)}-x\Vert \), where x is the kth nearest point in \(X_L\). The value of \(\lambda \) will be varied according to the data set, and we will determine its value experimentally. The regularizer used for UCI data is the similarity regularizer in (24). \(\beta _i\) can be calculated by \(\beta _i=f(p(x_i))\), where f a linear map, and \(p(x_i)\) can be given by \(p(x_i)=\frac{1}{\mid {\mathcal {N}}(i)h^n\mid }\sum _{j\in {\mathcal {N}}(i)}K_h\left( \frac{x_i-x_j}{h}\right) \), where \({\mathcal {N}}(i)\) is the set of all neighbors of \(x_i\), \(K_h\) is the a Gaussian kernel, h is the bandwidth. Then we normalize the \(p(x_i)\) by \(\frac{p(x_i)}{\max {\{p(x)\}}}\). \(S_{ij}\) can be obtained by a Gaussian kernel \(S_{ij}=\exp {\left( -d_{ij}^2/2\sigma \right) }\), where \(d_{ij}^2=\Vert x_i-x_j\Vert ^2\) is the Euclidean distance, and \(\sigma =\min D+(1/v)(\max D-\min D)\). \(\min D\) and \(\max D\) are the minimum and maximum values of the distance between samples.

Fig. 3
figure 3

Visualization results for UCI data sets ("wine", "iris", "dermatology", "balance" from top to bottom) by different methods

Figure 2 shows all classification results for four UCI data sets by 1NN classification method under the metric A obtained by different methods. From this figure, we can see that except for the individual results (LRML algorithm on "dermatology" data set), all metric learning algorithms improve the classification results of the Euclidean metric. What’s more, the proposed method (RGML) has the best classification results on each UCI data set.

Fig. 4
figure 4

The influence of \(\lambda \) on recognition error rates under different intervals for UCI data sets by RGML

To visualize the effect of the proposed metric learning, we project all results to 2 dimensional space by t-SNE [50]. Visualization results for four UCI data sets (from top to bottom is "wine", "iris", "dermatology", "balance") under different metrics by different methods are shown in Fig. 3. From left to right, they are Euclidean metric without any metric learning, GMML, PCML and RGML. Different colors represent the samples from different classes, and we can clearly see that different classes of points are clearly distinguished by our algorithm, which is superior to other methods.

How to choose the trade-off parameter \(\lambda \) is also a key point for improving our method’s performance. Figure 4 shows its influence on recognition error rate under different intervals for four different UCI data sets. For "wine", "iris" and "dermatology", the range of \(\lambda \) in the experiment is \(0.1\sim 2\times 10^{-3}\), and for "balance", the range is \(0.1\sim 2\times 10^{-4}\). For "wine", "iris", "dermatology" and "balance", the best \(\lambda \) value is \(1.5\times 10^{-3}\), \(0.1\times 10^{-3}\), \(0.7\times 10^{-3}\), and \(1.1\times 10^{-4}\), respectively.

Fig. 5
figure 5

Variation of the objective function of RGML along with the iterations for UCI data sets

Furthermore, Fig. 5 shows the variation trend of objective function along with the iteration by Riemannian gradient descent (RGD) algorithm for four UCI data sets. It can be seen that with the number of iterations increases, the values of the objective function keep decreasing and gradually converge. especially for "wine" data set, its convergence speed is the fastest among the four UCI data sets.

4.2 USPS data set

The USPS data set contains 20,000 samples, and all samples in USPS data set are \(16\times 16\) gray-level images of 10 types of handwritten digits. In our experiment, 30% of the data are randomly selected as the labeled data. The regularizer used for USPS data set is the same as the UCI data sets, and the parameters are also same.

Firstly, for the parameter \(\lambda \), we also conduct the experiment for its selection. Figure 6 shows the recognition error rate under different \(\lambda \), from which we can see that the range of the \(\lambda \) is \(0.5\sim 10\times 10^{-6}\) and the best \(\lambda \) for USPS data set is \(1.5\times 10^{-6}\), which will be used in the following comparison experiments.

Figure 7 shows the recognition error rates for USPS dataset by 1NN classification method under the metric A obtained by different methods. We can see that LRML performs badly in this data set, and the reason why may be that it only considers the pairwise constraints for labeled data and its regularization term only considers the relations among their neighbors rather than among their local topology. And the performance of the proposed RGML method is comparable to other methods such as ITML, GMML, PCML and LMNN, particularly superior to LRML. Note that LMNN has the lowest error rate, since it enables a large margin between different class points while for RGML there is no such clear margin, which may be helpful in the classification task.

Fig. 6
figure 6

The influence of \(\lambda \) on recognition error rates under different intervals for USPS data set by RGML

Fig. 7
figure 7

Recognition error rates by different methods for USPS data set

Furthermore, Fig. 8 gives all recognition results of different methods. Note that the first row shows the test samples and the rest rows display their nearest neighbours obtained by different metric methods. From this figure, we can find that the proposed method owns a very high classification accuracy compared with other methods, for example, LMNN, ITML, LRML and PCML, which all misclassify some handwritting digits such as 3, 5 or 8.

4.3 Facial data sets

YaleB and ORL are two classic facial data sets, where YaleB data set contains 2414 frontal-face images over 38 subjects and about 64 images per subject, which are captured under different lighting conditions and various facial expressions, and ORL database contains 400 images from 40 distinct subjects, in which for some subjects, the images are taken at different times, varying the lighting, facial expressions and facial details. All images in both data sets are cropped and resized to the size of \(32\times 32\). Similarly, 30% of the data are randomly selected as the labeled data, and we also use the same regularizer as the UCI data sets.

Initially, we conduct experiments to determine the optimal selection of the parameter \(\lambda \). Figure 9 shows all recognition error rates under different \(\lambda \) for two facial data sets. The range of \(\lambda \) for "yaleB" data set is \(0.5\sim 10\times 10^{-7}\) and the best \(\lambda \) value is \(1\times 10^{-7}\). For "ORL", the range of \(\lambda \) is \(0.5\sim 10\times 10^{-6}\) and the best one is \(5\times 10^{-7}\). Similarly, we choose the optimal \(\lambda \) for our method RGML in the following experiments to compare with other methods.

Figure 10 shows recognition error rates using kNN classification algorithm by different metric learning methods on two data sets. We find that for two face data sets our method have best performance among all the metric learning method for every k. Moreover, the classification effect of the proposed RGML algorithm is almost unaffected by the value of k, which shows its robustness for different k.

4.4 Hyperspectral image datasets

In this section, we will perform classification experiments on two hyperspectral data such as Indian pines data set and Kennedy Space Center (KSC) data set. Note that Indian pines data set consists of \(145\times 145\) pixels and 224 spectral reflectance bands in the wavelength range \(0.4\sim 2.5\times 10^{(-6)}\) meters. Here the number of bands are reduced to 200 by removing bands covering the region of water absorption: [104–108], [150–163], 220. The KSC data, acquired from an altitude of approximately 20 km, has a spatial resolution of 18 m. After removing water absorption and low SNR bands, 176 bands are used for the analysis. For classification, 13 classes representing various land cover types that occur in this environment are defined for the site. The detailed information of these two data sets is shown in Tables 2 and 3.

During these experiments about HSI data sets, 30% of the data are randomly selected from each class as the labeled data. The regularizer used here is the spatial regularizer in (27), in which the parameter \(\sigma \) will have a great impact on the classification accuracy. The following Fig. 11 will demonstrate how to choose \(\sigma \). Note that all experiments are repeated 10 times and we use three classification measurements, i.e., average classification accuracy of all the classes (AA), overall classification accuracy (OA) and kappa coefficient (KC).

Figure 11 shows their OAs of two data sets under the different parameters in the regularizer. Window size represents the range to determine spatial neighbors of each point. In our experiment, we set three different window sizes by \("3\times 3"\), \("5\times 5"\) and \("7\times 7"\). For each window size, we conduct classification experiments under \(\sigma \) ranging in 0.1, 0.25, 0.5, 0.75, 0.9, 1. The results show that for Indian Pines data set when "window size" \(=7\times 7\) and \(\sigma =0.25\), we can get the best result. For KSC data set, the best parameters are "window size"\(=5\times 5\) and \(\sigma =0.1\).

Similarly, as for the trade-off parameter \(\lambda \), we also carry out the experiments on it. Figure 12 shows the OAs of two data sets under different \(\lambda \). For Indian pines data set, the range of \(\lambda \) is 0.9 1.1, we find out that the best \(\lambda \) is 1.03. However, for KSC data set, the value of \(\lambda \) ranges from 0.08 to 0.1, and the experiment shows that the best \(\lambda \) is 0.081.

Fig. 8
figure 8

Recognition results for the test samples under different metrics for USPS data set

Fig. 9
figure 9

The influence of \(\lambda \) on recognition error rates under different intervals for yaleB and ORL data sets by RGML

Fig. 10
figure 10

Recognition error rates for yaleB and ORL by different methods for k from 1 to 8

Based on these optimal parameters, we will compare the proposed method with other algorithms in Tables 4 and 5.

From Table 4 and 5, we can see all classification results for Indian Pines and KSC data sets. In these two tables, we list all classification accuracy indexes such as AA, OA and KC, for all algorithms in each data class. Notably, the highest value for each index is highlighted in bold. And the proposed RGML method shows its superiority in most cases. Specifically, for Indian Pines data set, OA and KC of ours are the highest among all algorithms involved in comparisons. For KSC data set, all of three classification indexes are the highest.

Furthermore, Figs. 13 and 14 show the classification maps for two HSI data sets. The Ground Truth was shown at the top left corner, where the pixels in different colors represent different features. The predicted results of different algorithms for each pixel are presented in the classification maps by different colors. By comparing the classification maps of all algorithms with the Ground Truth, we can see that our algorithm performs best among all methods involved in comparisons.

5 Conclusion

In this work, point-to-class distance is used to investigate the semi-supervise metric learning problem, which can better describe the differences between samples than the point-to-point structure to some extent. The goal of metric learning is to narrow the distance of inner-class data and meanwhile to enlarge the distance of inter-class data, which can be achieved by the triplet constraint. For traditional point-to-point triplet constraint, if we want to consider the triplet constraint between all the labeled samples, the number of triples in the model is going to be huge. Usually, an attempt is only to use the distances from one point to its neighbors rather than to all other points, however, which will cause the information to be lost in practice. One alternative triplet constraint formulated by the point-to-class structure is raised in this work, in which we consider the distance from one point to the center of each class. In addition, to make our model more accurately to capture those distance that attempts to become larger during metric learning, we put different weights on the point-to-class triplet constraint to measure the interclasses dissimilarity. For each point, the class-centers that correspond to the larger weights will be preferentially considered for enlarging the distances between it and its different classes. By adding weights, we can automatically figure out which class centers are close enough to be prior considered. Then, regularization term is also incorporated into the proposed model in order to mitigate the overfitting phenomenon and preserve the topological structure of the data. For different data types, two kinds of regularizer are utilized in this work, for example, in term of hyperspectral images data, since their spatial neighbors are usually in the same class, the added spatial regularizer could better preserve their such structure during metric learning. Moreover, since the objective function in the proposed model is smooth on an SPD manifold, Riemannian gradient descent algorithm is given, which owns higher computational efficiency and is one appropriate choice for the solution of the model. Finally, we use the learned metrics to classify different data sets and compare the classification performance to state-of-the-art methods. The experimental results show that the proposed method has a better performance in classification of various kind of data such as hyperspectral image data.

Table 2 Number of samples of each class for the Indian Pines data set
Table 3 Number of samples of each class for the KSC data set
Fig. 11
figure 11

Influence of different parameters in the regularizer for two HSI data sets

Fig. 12
figure 12

Influence of different parameter \(\lambda \) for two HSI data sets

Table 4 Classification results for Indian Pines data set by 1NN under different metrics learned by different methods
Table 5 Classification results for KSC data set by 1NN under different metrics learned by different methods
Fig. 13
figure 13

Classification maps for Indian Pines data set by different methods

Fig. 14
figure 14

Classification maps for KSC data set by different methods

The future work will primarily focus on reducing computational complexity of metric learning to reduce the time cost of the model when handling large datasets. Moreover, for the Riemannian manifold optimization method in this work, we plan to introduce other Riemannian metrics, such as the log-Euclidean metric, in conjunction with other Riemannian manifold optimization methods like Riemannian conjugate gradient method for further research.