Improved Metric-Learning in Artificial Intelligence

Gao, Fan

doi:10.1007/978-981-15-5959-4_55

Fan Gao³⁷

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 675))

79 Accesses

Abstract

As artificial intelligence develops, it has become more and more concerned to measure the distance and similarity between different samples, especially in classification. It is known that AI meets the difficulty to properly describe the distance between different samples. Metric-learning is one field of machine learning performed on training samples in order to better describe the distribution of training samples, thereby improving the performance of classification. This paper proposes a new triplet constraint and uses the topology information combined with the unlabeled samples to design the regularizer of the model to ensure that the model meets the three basic assumptions of semi-supervised learning: smoothness, low density and manifold. This paper provides a new semi-supervised metric-learning model by using the new triplet constraint and regular terms. Then, using a gradient descent algorithm to solve the model, the optimal Mahalanobis matrix is obtained, which could ensure the Mahalanobis matrix is positive definite and symmetry during the iterative process. The Mahalanobis matrix solved by the model can be applied in clustering, classification and many other artificial intelligence fields. In this paper, the model is used for classification problem as an application. The experimental results are consistent with the idea of the model, reflecting the advantages of this model compared to other models, providing a new method to measure the sample distance in artificial intelligence and machine learning.

Access provided by Autonomous University of Puebla. Download conference paper PDF

UNIT: A unified metric learning framework based on maximum entropy regularization

Article 26 July 2023

Semi-supervised metric learning incorporating weighted triplet constraint and Riemannian manifold optimization for classification

Article 26 July 2024

Semi-supervised Coefficient-Based Distance Metric Learning

Keywords

1 Introduction

High-dimensional data is inconvenient to process, so dimensionality reduction is aimed for a suitable lower-dimensional space where computation should be reduced while retaining sample information. The motivation for metric-learning is to find a suitable distance function, which has been applied extensively in fields as text classification and image-recognition. Traditionally, the Euclidean distance is used, however, Euclidean distance is evenly distributed where features’ proportion and relationship are not considered, which affects the ability of characterizing sample structures.

Xing first proposed to use Mahalanobis distance [1] and solved it with a simple gradient algorithm. Schultz and Joachims proposed a model for determining the parameters of the metric matrix and the learning weight diagonal matrix [2]. Later, Weinberger and Saul designed the LMNN model, combining the idea of neighbor prediction [3]. The principal component analysis (PCA) is typical in dimensionality reduction algorithms [4], which can be regarded as a special metric-learning algorithm. Another classic dimension reduction method is the multi-dimensional dimensioning algorithm MDS [5], which converts a distance into the form of the inner product of the bit matrix. Roweis and Sau proposed local linear embedding by finding low-dimensional manifolds that maintain high-dimensional spatial neighbor structures [6]; Tenenbaum proposed ISOMAP algorithm by replacing Euclidean distances with local geodesic distances. [7]; Belkin and Niyogi proposed the Laplacian characteristic mapping [8]; Donoho and Grimes proposed the Hesse feature mapping algorithm [9].

In this paper, a new triplet constraint is designed, considering the previous research of the semi-supervised metric-learning model. The new triplet constraint combines the advantages of two types of previous triplet constraints, so that the learned matrix can measure the distance more effectively. In order to make the model meet the three premise assumptions of semi-supervised learning, a regular term is designed in this paper using unlabeled samples.

2 Semi-supervised Metric-Learning

2.1 Mahalanobis Distance

Euclidean metric is defined as

$$d\left( {x_{i} ,x_{j} } \right) = \sqrt {\left( {x_{i} - x_{j} } \right)^{T} \left( {x_{i} - x_{j} } \right)}$$

(1)

In Euclidean metric, the proportion of different features of the sample is evenly distributed, so the coupling relationship between different features is not considered. Mahalanobis distance is used to improve this limitation and defined as follow.

$$d\left( {x_{i} ,x_{j} } \right) = \sqrt {\left( {x_{i} - x_{j} } \right)^{T} A\left( {x_{i} - x_{j} } \right)}$$

(2)

where $A \in R^{n \times n}$ is the metric matrix. As a semi-symmetric positive definite matrix, $A$ can be decomposed into $A = Q^{T} Q$, and Eq. (2) is rewritten as follow.

$$d_{A} \left( {x_{i} ,x_{j} } \right) = \sqrt {\left( {x_{i} - x_{j} } \right)^{T} A\left( {x_{i} - x_{j} } \right)} = \sqrt {\left( {x_{i} - x_{j} } \right)^{T} Q^{T} Q\left( {x_{i} - x_{j} } \right)}$$

(3)

It is naturally required that the distance between similar samples is as small as possible, and the distance between heterogeneous samples is as large as possible.

If $S = \left\{ {\left( {x_{i} ,x_{j} } \right){:}\;x_{i} ,x_{j} \in R^{n} ,{\text{and of the same class}}} \right\}$ and $D = \left\{ {\left( {x_{i} ,x_{j} } \right){:}\;x_{i} ,x_{j} \in R^{n} ,{\text{and of different class}}} \right\}$, then a simple metric-learning model is defined

$$\mathop {{\min} }\limits_{A} \mathop \sum \limits_{{\left( {x_{i} ,x_{j} } \right) \in S}} d^{2} \left( {x_{i} ,x_{j} } \right), \quad {\text{s.t.}} \mathop \sum \limits_{{\left( {x_{i} ,x_{j} } \right) \in D}} d^{2} \left( {x_{i} ,x_{j} } \right) \ge C$$

(4)

where $d\left( {x_{i} ,x_{j} } \right) = \sqrt {\left( {x_{i} - x_{j} } \right)^{T} A\left( {x_{i} - x_{j} } \right)}$, $A$ is a semi-positive symmetric matrix and $C$ is constant. Many of the later models were derived from this basic model.

2.2 Improved Triplet Constraint

In this paper, note that $L = \left\{ {x_{1} , \ldots ,x_{l} } \right\}$ is the labled sample, $U = \left\{ {x_{l + 1} , \ldots ,x_{n} } \right\}$ is the unlabeled sample, $D = L \cup U = \left\{ {x_{1} , \ldots ,x_{n} } \right\}$ for all sample sets, where $x_{i} \in R^{m}$. $X = \left[ {x_{1} ,x_{2} , \ldots ,x_{n} } \right]$ is the data matrix containing the whole input samples.

Since it is desirable to minimize the distance between similar samples while maximizing it between heterogeneous ones, under this assumption, the triplet constraints can be defined as

$${\mathcal{T}} = \left\{ {\left( {x_{i} ,x_{j} ,x_{k} } \right){:}\;D_{ij}^{2} < D_{ik}^{2} } \right\}$$

(5)

where $D_{ij}^{2} = \left( {x_{i} - x_{j} } \right)^{T} A\left( {x_{i} - x_{j} } \right)$, $x_{i}$ and $x_{j}$ are in the same group and $x_{i}$ and $x_{k}$ do not belong to the same group. According to the definition of triplet constraints, if given the samples $x_{i}$ , $x_{j}$, and $x_{k}$, the purpose is to find a semi-positive symmetric matrix $A$, and then use the matrix A to calculate the distance between these samples, maximizing $D_{ik}^{2} - D_{ij}^{2}$ is the largest, that is equivalent to minimizing $D_{ij}^{2} - D_{ik}^{2}$. Based on this requirement, an initial model can be derived:

$$\mathop {{\min} }\limits_{A} \sum\limits_{T} {\left( {D_{ij}^{2} - D_{ik}^{2} } \right),\quad {\text{s.t.}}\;A \ge 0}$$

(6)

In the regularized semi-supervised metric-learning (RSSML) [10], in order to introduce the sample distribution boundary, the RSSML model scales the triplet constraints and constructs the following form:

$${\mathcal{T}}_{2} = \left\{ {\left( {x_{i} ,x_{j} ,x_{k} } \right){:}\;D_{ij}^{2} + 1 < D_{ik}^{2} } \right\}$$

(7)

Then, the model can be converted to

$$\mathop {{\min} }\limits_{A} \mathop \sum \limits_{{\mathcal{T}}} \left[ {1 + D_{ij}^{2} - D_{ik}^{2} } \right]_{ + } , \quad {\text{s.t.}}\; A \ge 0$$

(8)

where $\left[ z \right]_{ + }: = {\max} \left\{ {z,0} \right\}$.If $D_{ij}^{2}$ plus a unit distance is still smaller than $D_{ik}^{2}$, a hinge loss function [11] is triggered.

The ability of this triplet constraint is influenced by the distribution of the sample points. When the distance between the same sample and the distance of the heterogeneous sample is very close, a judgment error may occur [12, 13].

Therefore, the parameter γ is introduced to avoid such problems, and thus the triplet constraint is rewritten as follow

$${\mathcal{T}}_{3} = \left\{ {\left( {x_{i} ,x_{j} ,x_{k} } \right){:}\;D_{ij}^{2} < \gamma D_{ik}^{2} } \right\}$$

(9)

where $\gamma (0 < \gamma \le 1)$ is used to balance the impact of similar data and different types of data. $\gamma = 1/\left( {1 + \bar{D}^{ - 1} } \right)$, where $\bar{D}$ is the average distance of the data set $D$. When the difference between the homogeneous sample and the heterogeneous sample is very large, $D$ is also very large, in which case γ tends to 1 and ${\mathcal{T}}_{3}$ is equivalent to ${\mathcal{T}}$. The revised model is:

$$\mathop {{\min} }\limits_{A} \mathop \sum \limits_{{{\mathcal{T}}_{3} }} \left( {D_{ij}^{2} - \gamma D_{ik}^{2} } \right),\quad {\text{s.t.}}\; A \ge 0$$

(10)

However, when the distance between the similar sample and the heterogeneous sample is very small, the model imposes too strict restrictions on $D_{ij}^{2}$, which causes some similar samples to be excluded from the model. In this paper, another parameter is added to the model. $\omega = 1 - \gamma$, the triplet constraint is as follow

$${\mathcal{T}}_{4} = \left\{ {\left( {x_{i} ,x_{j} ,x_{k} } \right){:}\;D_{ij}^{2} < \gamma D_{ik}^{2} + \omega } \right\}$$

(11)

Obviously, when $\gamma$ is 1, ${\mathcal{T}}_{4}$ is equivalent to ${\mathcal{T}}$. When ${\mathcal{T}}$ is very small due to the small $\bar{D}$, $\omega$ acts to relax the constraint, but overall, ${\mathcal{T}}_{4}$ is still more strict than ${\mathcal{T}}_{2}$.

The model is revised to:

$$\mathop {{\min} }\limits_{A} \mathop \sum \limits_{{{\mathcal{T}}_{4} }} \left( {D_{ij}^{2} - \gamma D_{ik}^{2} - \omega } \right),\quad {\text{s.t.}}\;A \ge 0$$

(12)

In the LMNN algorithm [6], the distances are calculated in pairs to minimize because KNN classifier does not need all samples of one kind clustered into one family. Only the k-nearest neighbor of each sample is needed.

Define two indicator functions here:

$$\eta_{ij} = \left\{ {\begin{array}{*{20}l} {1,} & {x_{i} \;{\text{and}}\;x_{j} \;{\text{are}}\;{\text{neighbours}}} \\ {0,} & {\text{other}} \\ \end{array} } \right.$$

(13)

$$y_{ij} = \left\{ {\begin{array}{*{20}c} {1,} & {x_{i} \;{\text{and}}\;x_{j} \;{\text{belong to the same class}}} \\ {0,} & {\text{other}} \\ \end{array} } \right.$$

(14)

The model can be rewritten as:

$$\mathop {{\min} }\limits_{A} \mathop \sum \limits_{i,j,k}^{l} \eta_{ij} \left( {1 - y_{ij} } \right)\left( {D_{ij}^{2} - \gamma D_{ik}^{2} - \omega } \right),\quad {\text{s}} . {\text{t}} .\;A \ge 0$$

(15)

And $l$ is the number of labeled samples.

2.3 Regularization Term

Based on neighbor relationships, the following regular term is used in LRML algorithm to combine unlabeled data.

$$rg\left( A \right) = \frac{1}{2}\mathop \sum \limits_{i,j = 1}^{n} S_{ij} D_{ij}^{2}$$

(16)

where

$$S_{ij} = \left\{ {\begin{array}{*{20}c} {1,} & {f i \in N\left( j \right)\;{\text{or}}\;j \in N\left( i \right) i \ne j} \\ {0,} & {\text{other}} \\ \end{array} } \right.$$

$N\left( i \right) = \left\{ {x_{j} |x_{j} \;{\text{and}}\;x_{i} \;{\text{are neighbour}}} \right\}$, that is, $N\left( i \right)$ represents the neighbor of $x_{i}$ measured by the Euclidean distance.

According to the smoothing hypothesis, if the samples are neighbors, their distance in the new space is minimized: $\mathop \sum \limits_{i = 1}^{n} \mathop \sum \nolimits_{j \in N\left( i \right)} D_{ij}^{2}$. Introduce the similarity $\left[ {S_{ij} } \right],$ and according to the manifold hypothesis: $\mathop \sum \nolimits_{i = 1}^{n} \mathop \sum \nolimits_{j \in N\left( i \right)} S_{ij} D_{ij}^{2}$.

According to the clustering hypothesis, for the samples in the high-density region, their distance is minimized by the parameter $\beta_{i} = f\left( {p\left( {x_{i} } \right)} \right) \in R^{ + }$, where $p\left( {x_{i} } \right)$ is $x_{i}$’s density and $f{:}\;R \to R$ is a non-negative monotone increasing function, which gives:$\mathop \sum \nolimits_{i = 1}^{n} \beta_{i} \mathop \sum \nolimits_{j \in N\left( i \right)} S_{ij} D_{ij}^{2}$.

Finally, rewrite the regular term as:

$$rg\left( A \right) = \frac{1}{4}\mathop \sum \limits_{i = 1}^{n} \beta_{i} \mathop \sum \limits_{j \in N\left( i \right)} S_{ij} D_{ij}^{2}$$

(17)

Combined with the previous model, a semi-supervised metric-learning model was obtained:

$$\mathop {{\min} }\limits_{A} \frac{1}{4}c\mathop \sum \limits_{i = 1}^{n} \beta_{i} \mathop \sum \limits_{j \in N\left( i \right)} S_{ij} D_{ij}^{2} + \left( {1 - c} \right)\mathop \sum \limits_{i,j,k}^{l} \eta_{ij} \left( {1 - y_{ij} } \right)\left( {D_{ij}^{2} - \gamma D_{ik}^{2} - \omega } \right)$$

(18)

$${\text{s}} . {\text{t}} .\;A \ge 0$$

And $0 \le c \le 1$ is a balance parameter used to control the weight between regular term and loss term.

3 Model Simplification and Algorithm

Consider the convenient for calculation, simplify the model as follow

$${\min} ctr\left( {{\text{XLX}}^{\text{T}} A} \right) + \left( {1 - c} \right)\left( {tr\left( {\text{MA}} \right) - \mathop \sum \limits_{i,j}^{l} C_{ij}^{1} \omega } \right),\quad {\text{s.t}}.\;A \ge 0$$

(19)

Where, $M = X_{l} \left( {D_{c} - C} \right)X_{L}^{T}$,$C = C^{1} - C^{0} C^{1} = {\text{diag}}\left( {\left( {ee^{T} - Y} \right)e} \right)H$,, $C^{0} = \gamma {\text{diag}}\left( {He} \right)\left( {ee^{T} - Y} \right)$, $D_{c} = {\text{diag}}\left( {Ce} \right) L = \mathop \sum \nolimits_{i = 1}^{n} L^{\left( i \right)}$, $L^{\left( i \right)} = D_{w}^{\left( i \right)} - W^{\left( i \right)}$ is a Laplacian matrix,$D_{w}^{\left( i \right)}$ is a diagonal matrix which consists of $D_{w}^{\left( i \right)} \left( {k,k} \right) = \mathop \sum \nolimits_{j} W_{kj}^{\left( i \right)}$.

The optimal solution of the model can be obtained by using the steepest descent algorithm for the positive definite symmetric matrix (Table 1).

Table 1 Algorithm steps

Full size table

4 Numerical Experiment

The experiment selected four different data sets (iris, wine, balance, breast cancer) in the UCI database for numerical experiments. Table 2 gives a brief description of the four types of data. $\left| L \right|$ represents the number of labeled samples, $\left| U \right|$ represents the total number of samples, and $\left| L \right|/\left| U \right|$ represents the labeling rate of the labeled samples.

Table 2 Descriptive statistics

Full size table

During the experiment, the whole data is randomly divided into two parts: a labeled set $L$ and an unlabeled set $U$. Consider each of its 4 neighbors for each sample and select 10 tag samples for each class. Randomly select 10 samples of each class as training samples and randomly select 5 samples of each class as a test set. Experiment was carried out 20 times for each data set and takes the average of 20 experiments as final.

As shown in Fig. 1, according to the average results of 20 experiments, the classification of these models is much better than the European distance classification. The results presented in this paper are better than those tested on each data set. Second, the average error rate of the proposed method on each data set is the smallest.

When testing the model sensitivity of two parameters, $c$ and $v$, the values of other parameters are fixed, and the change step of $c$ is 0.01. For the dermatology and balance data sets, the range of $c$ is selected as [0, 0.2]; for the wine data set, the range of $c$ is set to be [0.8, 1]; for the iris data set, the range of c is set to be [0.7, 0.9]. Take the change step of $v$ is 1, the range of change is [1, 15]. Figures 2 and 3 show the sensitivity for the two parameters, and the error rate of the experiment fluctuated slightly with the change of $c$ and $v$.

5 Conclusions

This paper proposes a new semi-supervised metric-learning model, which uses topology information to construct regular terms to satisfy the three important assumptions of semi-supervised learning. For solving the model, a gradient descent method is used to get the metric matrix $A$. Then, we apply $A$ to the KNN classifier and replace the commonly used Euclidean distance by the Mahalanobis distance calculated by the matrix to classify some data in the UCI database as a test set. Finally, compare the results of our model with several existing semi-supervised metric-learning methods. The numerical results illustrate that the classification effect of our model is better and meets the original design goals.

References

Xing, E.P., A.Y. Ng, M.I. Jordan, et al. 2002. Distance metric learning, with application to clustering with side-information. In International Conference on Neural Information Processing Systems, 521–528. Cambridge: MIT Press.
Google Scholar
Schultz, M, and T. Joachims. 2003. Learning a distance metric from relative comparisons. In International Conference on Neural Information Processing Systems, 41–48. Cambridge: MIT Press.
Google Scholar
Weinberger, K.Q., and L.K. Saul. 2009. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research 10 (1): 207–244.
MATH Google Scholar
Yaniv, Z. 2006. Principle component analysis.
Google Scholar
Shang, Y, and W. Ruml. 2004. Improved MDS-based localization. In Joint Conference of the IEEE Computer and Communications Societies, vol. 4, 2640–2651. IEEE.
Google Scholar
Lan, Y.D., H. Deng, and T. Chen. 2012. Dimensionality reduction based on neighborhood preserving and marginal discriminant embedding. Procedia Engineering 29 (4): 494–498.
Article Google Scholar
Silva, V.D., and J.B. Tenenbaum. 2003. Unsupervised learning of curved manifolds. In Nonlinear estimation and classification, 453–465. New York: Springer.
Google Scholar
Belkin, M., and P. Niyogi. 2003. Laplacian Eigenmaps for dimensionality reduction and data representation. Cambridgr: MIT Press.
Google Scholar
Donoho, D.L., and C. Grimes. 2003. Hessian eigenmaps: locally linear embedding techniques for high-dimensional data. Proceedings of the National Academy of Sciences of the United States of America 100 (10): 5591.
Google Scholar
Wang, Q., P.C. Yuen, and G. Feng. 2013. Semi-supervised metric learning via topology preserving multiple semi-supervised assumptions. Amsterdam: Elsevier Science Inc.
Google Scholar
Wu, Y., and Y. Liu. 2007. Robust truncated hinge loss support vector machines. Publications of the American Statistical Association 102 (479): 974–983.
Article MathSciNet Google Scholar
Hoi, S.C.H., W. Liu, and S.F. Chang. 2010. Semi-supervised distance metric learning for collaborative image retrieval and clustering. ACM Transactions on Multimedia Computing Communications & Applications 6 (3): 1–26.
Article Google Scholar
Der, M., and L.K. Saul. 2012. Latent coincidence analysis: a hidden variable model for distance metric learning. In International Conference on Neural Information Processing Systems, pp. 3230–3238. Curran Associates Inc. 2012.
Google Scholar

Download references

Author information

Authors and Affiliations

Shanghai University, Jiading, China
Fan Gao

Authors

Fan Gao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fan Gao .

Editor information

Editors and Affiliations

Department of Computer Science, Tunghai University, Taichung, Taiwan
Chao-Tung Yang
Department of Computer Science and Engineering, University of Aizu, Fukushima, Japan
Yan Pei
Department of Computer Science and Information Engineering, National Taichung University of Science and Technology, Taichung, Taiwan
Jia-Wei Chang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gao, F. (2020). Improved Metric-Learning in Artificial Intelligence. In: Yang, CT., Pei, Y., Chang, JW. (eds) Innovative Computing. Lecture Notes in Electrical Engineering, vol 675. Springer, Singapore. https://doi.org/10.1007/978-981-15-5959-4_55

Download citation

DOI: https://doi.org/10.1007/978-981-15-5959-4_55
Published: 26 September 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-5958-7
Online ISBN: 978-981-15-5959-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics