Keywords

1 Introduction

High-dimensional data is inconvenient to process, so dimensionality reduction is aimed for a suitable lower-dimensional space where computation should be reduced while retaining sample information. The motivation for metric-learning is to find a suitable distance function, which has been applied extensively in fields as text classification and image-recognition. Traditionally, the Euclidean distance is used, however, Euclidean distance is evenly distributed where features’ proportion and relationship are not considered, which affects the ability of characterizing sample structures.

Xing first proposed to use Mahalanobis distance [1] and solved it with a simple gradient algorithm. Schultz and Joachims proposed a model for determining the parameters of the metric matrix and the learning weight diagonal matrix [2]. Later, Weinberger and Saul designed the LMNN model, combining the idea of neighbor prediction [3]. The principal component analysis (PCA) is typical in dimensionality reduction algorithms [4], which can be regarded as a special metric-learning algorithm. Another classic dimension reduction method is the multi-dimensional dimensioning algorithm MDS [5], which converts a distance into the form of the inner product of the bit matrix. Roweis and Sau proposed local linear embedding by finding low-dimensional manifolds that maintain high-dimensional spatial neighbor structures [6]; Tenenbaum proposed ISOMAP algorithm by replacing Euclidean distances with local geodesic distances. [7]; Belkin and Niyogi proposed the Laplacian characteristic mapping [8]; Donoho and Grimes proposed the Hesse feature mapping algorithm [9].

In this paper, a new triplet constraint is designed, considering the previous research of the semi-supervised metric-learning model. The new triplet constraint combines the advantages of two types of previous triplet constraints, so that the learned matrix can measure the distance more effectively. In order to make the model meet the three premise assumptions of semi-supervised learning, a regular term is designed in this paper using unlabeled samples.

2 Semi-supervised Metric-Learning

2.1 Mahalanobis Distance

Euclidean metric is defined as

$$d\left( {x_{i} ,x_{j} } \right) = \sqrt {\left( {x_{i} - x_{j} } \right)^{T} \left( {x_{i} - x_{j} } \right)}$$
(1)

In Euclidean metric, the proportion of different features of the sample is evenly distributed, so the coupling relationship between different features is not considered. Mahalanobis distance is used to improve this limitation and defined as follow.

$$d\left( {x_{i} ,x_{j} } \right) = \sqrt {\left( {x_{i} - x_{j} } \right)^{T} A\left( {x_{i} - x_{j} } \right)}$$
(2)

where \(A \in R^{n \times n}\) is the metric matrix. As a semi-symmetric positive definite matrix, \(A\) can be decomposed into \(A = Q^{T} Q\), and Eq. (2) is rewritten as follow.

$$d_{A} \left( {x_{i} ,x_{j} } \right) = \sqrt {\left( {x_{i} - x_{j} } \right)^{T} A\left( {x_{i} - x_{j} } \right)} = \sqrt {\left( {x_{i} - x_{j} } \right)^{T} Q^{T} Q\left( {x_{i} - x_{j} } \right)}$$
(3)

It is naturally required that the distance between similar samples is as small as possible, and the distance between heterogeneous samples is as large as possible.

If \(S = \left\{ {\left( {x_{i} ,x_{j} } \right){:}\;x_{i} ,x_{j} \in R^{n} ,{\text{and of the same class}}} \right\}\) and \(D = \left\{ {\left( {x_{i} ,x_{j} } \right){:}\;x_{i} ,x_{j} \in R^{n} ,{\text{and of different class}}} \right\}\), then a simple metric-learning model is defined

$$\mathop {{\min} }\limits_{A} \mathop \sum \limits_{{\left( {x_{i} ,x_{j} } \right) \in S}} d^{2} \left( {x_{i} ,x_{j} } \right), \quad {\text{s.t.}} \mathop \sum \limits_{{\left( {x_{i} ,x_{j} } \right) \in D}} d^{2} \left( {x_{i} ,x_{j} } \right) \ge C$$
(4)

where \(d\left( {x_{i} ,x_{j} } \right) = \sqrt {\left( {x_{i} - x_{j} } \right)^{T} A\left( {x_{i} - x_{j} } \right)}\), \(A\) is a semi-positive symmetric matrix and \(C\) is constant. Many of the later models were derived from this basic model.

2.2 Improved Triplet Constraint

In this paper, note that \(L = \left\{ {x_{1} , \ldots ,x_{l} } \right\}\) is the labled sample, \(U = \left\{ {x_{l + 1} , \ldots ,x_{n} } \right\}\) is the unlabeled sample, \(D = L \cup U = \left\{ {x_{1} , \ldots ,x_{n} } \right\}\) for all sample sets, where \(x_{i} \in R^{m}\). \(X = \left[ {x_{1} ,x_{2} , \ldots ,x_{n} } \right]\) is the data matrix containing the whole input samples.

Since it is desirable to minimize the distance between similar samples while maximizing it between heterogeneous ones, under this assumption, the triplet constraints can be defined as

$${\mathcal{T}} = \left\{ {\left( {x_{i} ,x_{j} ,x_{k} } \right){:}\;D_{ij}^{2} < D_{ik}^{2} } \right\}$$
(5)

where \(D_{ij}^{2} = \left( {x_{i} - x_{j} } \right)^{T} A\left( {x_{i} - x_{j} } \right)\), \(x_{i}\) and \(x_{j}\) are in the same group and \(x_{i}\) and \(x_{k}\) do not belong to the same group. According to the definition of triplet constraints, if given the samples \(x_{i}\) , \(x_{j}\), and \(x_{k}\), the purpose is to find a semi-positive symmetric matrix \(A\), and then use the matrix A to calculate the distance between these samples, maximizing \(D_{ik}^{2} - D_{ij}^{2}\) is the largest, that is equivalent to minimizing \(D_{ij}^{2} - D_{ik}^{2}\). Based on this requirement, an initial model can be derived:

$$\mathop {{\min} }\limits_{A} \sum\limits_{T} {\left( {D_{ij}^{2} - D_{ik}^{2} } \right),\quad {\text{s.t.}}\;A \ge 0}$$
(6)

In the regularized semi-supervised metric-learning (RSSML) [10], in order to introduce the sample distribution boundary, the RSSML model scales the triplet constraints and constructs the following form:

$${\mathcal{T}}_{2} = \left\{ {\left( {x_{i} ,x_{j} ,x_{k} } \right){:}\;D_{ij}^{2} + 1 < D_{ik}^{2} } \right\}$$
(7)

Then, the model can be converted to

$$\mathop {{\min} }\limits_{A} \mathop \sum \limits_{{\mathcal{T}}} \left[ {1 + D_{ij}^{2} - D_{ik}^{2} } \right]_{ + } , \quad {\text{s.t.}}\; A \ge 0$$
(8)

where \(\left[ z \right]_{ + }: = {\max} \left\{ {z,0} \right\}\).If \(D_{ij}^{2}\) plus a unit distance is still smaller than \(D_{ik}^{2}\), a hinge loss function [11] is triggered.

The ability of this triplet constraint is influenced by the distribution of the sample points. When the distance between the same sample and the distance of the heterogeneous sample is very close, a judgment error may occur [12, 13].

Therefore, the parameter γ is introduced to avoid such problems, and thus the triplet constraint is rewritten as follow

$${\mathcal{T}}_{3} = \left\{ {\left( {x_{i} ,x_{j} ,x_{k} } \right){:}\;D_{ij}^{2} < \gamma D_{ik}^{2} } \right\}$$
(9)

where \(\gamma (0 < \gamma \le 1)\) is used to balance the impact of similar data and different types of data. \(\gamma = 1/\left( {1 + \bar{D}^{ - 1} } \right)\), where \(\bar{D}\) is the average distance of the data set \(D\). When the difference between the homogeneous sample and the heterogeneous sample is very large, \(D\) is also very large, in which case γ tends to 1 and \({\mathcal{T}}_{3}\) is equivalent to \({\mathcal{T}}\). The revised model is:

$$\mathop {{\min} }\limits_{A} \mathop \sum \limits_{{{\mathcal{T}}_{3} }} \left( {D_{ij}^{2} - \gamma D_{ik}^{2} } \right),\quad {\text{s.t.}}\; A \ge 0$$
(10)

However, when the distance between the similar sample and the heterogeneous sample is very small, the model imposes too strict restrictions on \(D_{ij}^{2}\), which causes some similar samples to be excluded from the model. In this paper, another parameter is added to the model. \(\omega = 1 - \gamma\), the triplet constraint is as follow

$${\mathcal{T}}_{4} = \left\{ {\left( {x_{i} ,x_{j} ,x_{k} } \right){:}\;D_{ij}^{2} < \gamma D_{ik}^{2} + \omega } \right\}$$
(11)

Obviously, when \(\gamma\) is 1, \({\mathcal{T}}_{4}\) is equivalent to \({\mathcal{T}}\). When \({\mathcal{T}}\) is very small due to the small \(\bar{D}\), \(\omega\) acts to relax the constraint, but overall, \({\mathcal{T}}_{4}\) is still more strict than \({\mathcal{T}}_{2}\).

The model is revised to:

$$\mathop {{\min} }\limits_{A} \mathop \sum \limits_{{{\mathcal{T}}_{4} }} \left( {D_{ij}^{2} - \gamma D_{ik}^{2} - \omega } \right),\quad {\text{s.t.}}\;A \ge 0$$
(12)

In the LMNN algorithm [6], the distances are calculated in pairs to minimize because KNN classifier does not need all samples of one kind clustered into one family. Only the k-nearest neighbor of each sample is needed.

Define two indicator functions here:

$$\eta_{ij} = \left\{ {\begin{array}{*{20}l} {1,} & {x_{i} \;{\text{and}}\;x_{j} \;{\text{are}}\;{\text{neighbours}}} \\ {0,} & {\text{other}} \\ \end{array} } \right.$$
(13)
$$y_{ij} = \left\{ {\begin{array}{*{20}c} {1,} & {x_{i} \;{\text{and}}\;x_{j} \;{\text{belong to the same class}}} \\ {0,} & {\text{other}} \\ \end{array} } \right.$$
(14)

The model can be rewritten as:

$$\mathop {{\min} }\limits_{A} \mathop \sum \limits_{i,j,k}^{l} \eta_{ij} \left( {1 - y_{ij} } \right)\left( {D_{ij}^{2} - \gamma D_{ik}^{2} - \omega } \right),\quad {\text{s}} . {\text{t}} .\;A \ge 0$$
(15)

And \(l\) is the number of labeled samples.

2.3 Regularization Term

Based on neighbor relationships, the following regular term is used in LRML algorithm to combine unlabeled data.

$$rg\left( A \right) = \frac{1}{2}\mathop \sum \limits_{i,j = 1}^{n} S_{ij} D_{ij}^{2}$$
(16)

where

$$S_{ij} = \left\{ {\begin{array}{*{20}c} {1,} & {f i \in N\left( j \right)\;{\text{or}}\;j \in N\left( i \right) i \ne j} \\ {0,} & {\text{other}} \\ \end{array} } \right.$$

\(N\left( i \right) = \left\{ {x_{j} |x_{j} \;{\text{and}}\;x_{i} \;{\text{are neighbour}}} \right\}\), that is, \(N\left( i \right)\) represents the neighbor of \(x_{i}\) measured by the Euclidean distance.

According to the smoothing hypothesis, if the samples are neighbors, their distance in the new space is minimized: \(\mathop \sum \limits_{i = 1}^{n} \mathop \sum \nolimits_{j \in N\left( i \right)} D_{ij}^{2}\). Introduce the similarity \(\left[ {S_{ij} } \right],\) and according to the manifold hypothesis: \(\mathop \sum \nolimits_{i = 1}^{n} \mathop \sum \nolimits_{j \in N\left( i \right)} S_{ij} D_{ij}^{2}\).

According to the clustering hypothesis, for the samples in the high-density region, their distance is minimized by the parameter \(\beta_{i} = f\left( {p\left( {x_{i} } \right)} \right) \in R^{ + }\), where \(p\left( {x_{i} } \right)\) is \(x_{i}\)’s density and \(f{:}\;R \to R\) is a non-negative monotone increasing function, which gives:\(\mathop \sum \nolimits_{i = 1}^{n} \beta_{i} \mathop \sum \nolimits_{j \in N\left( i \right)} S_{ij} D_{ij}^{2}\).

Finally, rewrite the regular term as:

$$rg\left( A \right) = \frac{1}{4}\mathop \sum \limits_{i = 1}^{n} \beta_{i} \mathop \sum \limits_{j \in N\left( i \right)} S_{ij} D_{ij}^{2}$$
(17)

Combined with the previous model, a semi-supervised metric-learning model was obtained:

$$\mathop {{\min} }\limits_{A} \frac{1}{4}c\mathop \sum \limits_{i = 1}^{n} \beta_{i} \mathop \sum \limits_{j \in N\left( i \right)} S_{ij} D_{ij}^{2} + \left( {1 - c} \right)\mathop \sum \limits_{i,j,k}^{l} \eta_{ij} \left( {1 - y_{ij} } \right)\left( {D_{ij}^{2} - \gamma D_{ik}^{2} - \omega } \right)$$
(18)
$${\text{s}} . {\text{t}} .\;A \ge 0$$

And \(0 \le c \le 1\) is a balance parameter used to control the weight between regular term and loss term.

3 Model Simplification and Algorithm

Consider the convenient for calculation, simplify the model as follow

$${\min} ctr\left( {{\text{XLX}}^{\text{T}} A} \right) + \left( {1 - c} \right)\left( {tr\left( {\text{MA}} \right) - \mathop \sum \limits_{i,j}^{l} C_{ij}^{1} \omega } \right),\quad {\text{s.t}}.\;A \ge 0$$
(19)

Where, \(M = X_{l} \left( {D_{c} - C} \right)X_{L}^{T}\),\(C = C^{1} - C^{0} C^{1} = {\text{diag}}\left( {\left( {ee^{T} - Y} \right)e} \right)H\),, \(C^{0} = \gamma {\text{diag}}\left( {He} \right)\left( {ee^{T} - Y} \right)\), \(D_{c} = {\text{diag}}\left( {Ce} \right) L = \mathop \sum \nolimits_{i = 1}^{n} L^{\left( i \right)}\), \(L^{\left( i \right)} = D_{w}^{\left( i \right)} - W^{\left( i \right)}\) is a Laplacian matrix,\(D_{w}^{\left( i \right)}\) is a diagonal matrix which consists of \(D_{w}^{\left( i \right)} \left( {k,k} \right) = \mathop \sum \nolimits_{j} W_{kj}^{\left( i \right)}\).

The optimal solution of the model can be obtained by using the steepest descent algorithm for the positive definite symmetric matrix (Table 1).

Table 1 Algorithm steps

4 Numerical Experiment

The experiment selected four different data sets (iris, wine, balance, breast cancer) in the UCI database for numerical experiments. Table 2 gives a brief description of the four types of data. \(\left| L \right|\) represents the number of labeled samples, \(\left| U \right|\) represents the total number of samples, and \(\left| L \right|/\left| U \right|\) represents the labeling rate of the labeled samples.

Table 2 Descriptive statistics

During the experiment, the whole data is randomly divided into two parts: a labeled set \(L\) and an unlabeled set \(U\). Consider each of its 4 neighbors for each sample and select 10 tag samples for each class. Randomly select 10 samples of each class as training samples and randomly select 5 samples of each class as a test set. Experiment was carried out 20 times for each data set and takes the average of 20 experiments as final.

As shown in Fig. 1, according to the average results of 20 experiments, the classification of these models is much better than the European distance classification. The results presented in this paper are better than those tested on each data set. Second, the average error rate of the proposed method on each data set is the smallest.

Fig. 1
figure 1

Clustering results

When testing the model sensitivity of two parameters, \(c\) and \(v\), the values of other parameters are fixed, and the change step of \(c\) is 0.01. For the dermatology and balance data sets, the range of \(c\) is selected as [0, 0.2]; for the wine data set, the range of \(c\) is set to be [0.8, 1]; for the iris data set, the range of c is set to be [0.7, 0.9]. Take the change step of \(v\) is 1, the range of change is [1, 15]. Figures 2 and 3 show the sensitivity for the two parameters, and the error rate of the experiment fluctuated slightly with the change of \(c\) and \(v\).

Fig. 2
figure 2

Sensitivity for c

Fig. 3
figure 3

Sensitivity for v

5 Conclusions

This paper proposes a new semi-supervised metric-learning model, which uses topology information to construct regular terms to satisfy the three important assumptions of semi-supervised learning. For solving the model, a gradient descent method is used to get the metric matrix \(A\). Then, we apply \(A\) to the KNN classifier and replace the commonly used Euclidean distance by the Mahalanobis distance calculated by the matrix to classify some data in the UCI database as a test set. Finally, compare the results of our model with several existing semi-supervised metric-learning methods. The numerical results illustrate that the classification effect of our model is better and meets the original design goals.