Deep Metric Learning: Loss Functions Comparison

Vasilev, R. L.; D’yakonov, A. G.

doi:10.1134/S1064562423701053

Deep Metric Learning: Loss Functions Comparison

Published: 09 February 2024

Volume 108, pages S215–S225, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Doklady Mathematics Aims and scope Submit manuscript

Deep Metric Learning: Loss Functions Comparison

Download PDF

R. L. Vasilev¹ &
A. G. D’yakonov¹

79 Accesses
Explore all metrics

Abstract

An overview of deep metric learning methods is presented. Although they have appeared in recent years, these methods were compared only with their predecessors, with neural networks of outdated architectures used for representation learning (representations on which the metric is calculated). The described methods were compared on different datasets from several domains, using pre-trained neural networks comparable in performance to SotA (state of the art): ConvNeXt for images and DistilBERT for texts. Labeled datasets were used, divided into two parts (train and test) so that the classes did not overlap (i.e., for each class its objects are fully in train or fully in test). Such a large-scale honest comparison was made for the first time and led to unexpected conclusions, viz. some “old” methods, for example, Tuplet Margin Loss, are superior in performance to their modern modifications and methods proposed in very recent works.

A Metric Learning Reality Check

Deep Metric Learning Using Triplet Network

Deep Metric Learning with Hierarchical Triplet Loss

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 INTRODUCTION

Many data analysis and machine learning problems require estimating the proximity of objects such as for the search over images [1] or text documents [2], face recognition [3], and identification [4]. Modern methods for solving such problems use deep metric learning. Metric learning is assumed to mean obtaining an automatic way to calculate distances (for example, in the form of the output of a neural network), in which the distances between objects that should be similar (for example, they belong to the same class) are small, and the distances between objects that must differ are great.

We describe this idea more formally. Consider a set of objects X and a set of labels Y. Objects $x \in X$ can be of any kind (images, texts, audio, tabular data, etc.), and each object can have its corresponding label $y \in Y$. If objects are indexed, we assume that the object ${{x}_{t}}$ is matched to the label ${{y}_{t}}$. The general metric learning problem is to build a metric D_θ(·, ·) : $X \times X \to \mathbb{R}$ out of the parametric family $\{ {{D}_{\theta }}\} $ so that it satisfies the property

$${{D}_{\theta }}\left( {{{x}_{1}},{{x}_{2}}} \right) \ll {{D}_{\theta }}\left( {{{x}_{1}},{{x}_{3}}} \right)$$

for ${{y}_{1}} = {{y}_{2}} \ne {{y}_{3}}$. Not all problems have object labels; however, there is often a way to construct a pair $({{x}_{1}},{{x}_{2}})$ of objects that can be logically considered similar such as an object and its augmentation:

• a signal and its noisy copy;

• an image and its rotated copy;

• a text and the same text with some words replaced by synonyms.

Such pairs are called positive while the others are called negative, for example a random pair of sample objects. Formally, for the set of positive pairs P and negative pairs $N$, we can write a system of inequalities

$${{D}_{\theta }}\left( {{{x}_{i}},{{x}_{j}}} \right) < {{D}_{\theta }}\left( {{{x}_{t}},{{x}_{s}}} \right),\quad (i,j) \in P,\quad (t,s) \in N,$$

but there are a large number of inequalities in such a system, in addition, the sign $ \ll $ in the property we described means the desire for these distances to be significantly different, so one uses the approaches that we describe below.

The deep metric learning problem is to train a neural network, i.e., the mapping ${{f}_{\theta }}( \cdot ):X \to {{\mathbb{R}}^{n}}$, wherein

$${{D}_{\theta }}\left( {{{x}_{1}},{{x}_{2}}} \right) = D\left( {{{f}_{\theta }}({{x}_{1}}),{{f}_{\theta }}({{x}_{2}})} \right),$$

where D is the fixed simple metric in the finite-dimensional space ${{\mathbb{R}}^{n}}$ of vector representations (embeddings) of objects, the Euclidean distance being used more often. It is also common to limit the parameterization by the unit sphere $\left\| {{{f}_{\theta }}(x)} \right\| = 1$ and introduce the scalar product

$${{S}_{\theta }}({{x}_{1}},{{x}_{2}}) = \langle {{f}_{\theta }}({{x}_{1}}),{{f}_{\theta }}({{x}_{2}})\rangle ,$$

rather than ${{D}_{\theta }}({{x}_{1}},{{x}_{2}})$; here, one strives to make

$${{S}_{\theta }}\left( {{{x}_{1}},{{x}_{2}}} \right) \gg {{S}_{\theta }}\left( {{{x}_{1}},{{x}_{3}}} \right)$$

hold for ${{y}_{1}} = {{y}_{2}} \ne {{y}_{3}}$.

To find the optimal parameters $\theta $, the loss function L is given and optimized on the training set. Further, we consider particular loss functions.

Metric learning is useful in supervised learning problems where there are quite a lot of classes and they are poorly represented. Also, it is when not all classes are in the training set and the algorithm must detect the appearance of objects from unknown classes (open-world classification).

In this work, we try to systematize neural network approaches to metric learning and perform a great number of experiments to compare different loss functions in different domains, viz. images and texts. This is the first time such a large-scale comparison has been made. Section 2 discusses the principal evaluation metrics, allowing us to compare methods with each other. Section 3 provides an overview of various methods such as the ones based on comparing representations or on reducing it to the classification problem. Section 4 describes experiments with these methods on various datasets, neural network architectures and modalities.

2 EVALUATION METRICS

2.1 Search Related

An object, for which the nearest neighbors are searched for, is called a query, and the objects, over which the search is performed, are called documents; the terminology comes from information retrieval. Upon the query, the information retrieval algorithm yields an ordered list of documents; this is when one can use a trained metric and arrange documents in the order of their increasing distance from the query. Moreover, among all documents there is a subset of relevant documents, viz. those that we would like to retreive, and they should be at the top of our list. The following evaluation metrics assess how well this works.

Recall@K is one of the most commonly used evaluation metrics in metric learning and is defined as the proportion of queries for which a relevant document was found among $K$ nearest neighbors. Note that this definition of Recall@K is used specifically in metric learning problems; in information retrieval, Recall@K is usually understood as the proportion of the found objects among the relevant ones.

MAP (mean average precision) is another frequently used evaluation metric. We define Precision@K as the proportion of the relevant objects among $K$ found ones. Let the output contain indexes of the relevant objects ${{K}_{1}}, \ldots ,{{K}_{R}}$, then we can calculate

$${\text{AP}} = \frac{1}{R}\sum\limits_{i = 1}^R {\text{Precision@}}{{K}_{i}}$$

and MAP is AP averaged over all queries.

R-Precision is also frequently calculated; it is equal to Precision@R, where R is the total number of relevant documents. In [5], a functional is proposed that combines the ideas of R-Precision and MAP. For a separate query with R relevant documents, it equals

$${\text{AP@R}} = \frac{1}{R}\sum\limits_{i = 1}^R {\text{Precision@i}},$$

the overall MAP@R is obtained for the entire dataset by averaging over all queries.

Also, one measures the mean reciprical rank (MRR)

$$MRR = \frac{1}{m}\sum\limits_{i = 1}^n \frac{1}{{{\text{ran}}{{{\text{k}}}_{i}}}},$$

where rank_i is the position number of the first relevant document, and m is total number of queries.

2.2 Cluster Related

In addition to search quality criteria, metric learning often evaluates the quality of clustering in the space ${{\mathbb{R}}^{n}}$ of representations $\{ {{f}_{\theta }}(x){\kern 1pt} |{\kern 1pt} x \in X\} $ as representations of similar objects are expected to form clusters. The disadvantage of “cluster” evaluation metrics based on this idea is that the functionals depend on the selected clustering algorithm.

Let $U = \{ {{U}_{1}}, \ldots ,{{U}_{n}}\} $ be the true partition X into groups of similar objects, and $V = \{ {{V}_{1}}, \ldots ,{{V}_{n}}\} $ be the one obtained by the clustering algorithm (in this work, the experiments use k-means [6]) on ${{f}_{\theta }}(X)$. Then mutual information is

$$\begin{gathered} MI(U,V) = \sum\limits_{i = 1}^{|U|} \sum\limits_{j = 1}^{|V|} {{P}_{{UV}}}(i,j)\log \frac{{{{P}_{{UV}}}(i,j)}}{{{{P}_{U}}(i){{P}_{V}}(j)}} \\ \, = KL({{P}_{{UV}}}\;{\text{||}}\;{{P}_{U}}{{P}_{V}}), \\ \end{gathered} $$

where ${{P}_{U}}(i) = \frac{{{\text{|}}{{U}_{i}}{\text{|}}}}{{{\text{|}}U{\text{|}}}}$ is the estimate of the probability of a random object to fall into the ith partition cluster U; similarly, ${{P}_{V}}(i) = \frac{{{\text{|}}{{V}_{i}}{\text{|}}}}{{{\text{|}}V{\text{|}}}}$ and ${{P}_{{UV}}}(i,j) = \frac{{{\text{|}}{{U}_{i}} \cap {{V}_{j}}{\text{|}}}}{{{\text{|}}V{\text{|}}}}$ is the estimate of the probability to fall into ${{U}_{i}}$ and ${{V}_{j}}$ simultaneously.

The entropy of partition S is defined as follows

$$H(S) = - \sum\limits_{k = 1}^{|S|} {{P}_{S}}(k)\log {{P}_{S}}(k).$$

There exist several normalizations: NMI (normalized mutual information) and AMI (adjusted mutual information). NMI is calculated as follows

$$NMI(U,V) = \frac{{MI(U,V)}}{{\frac{1}{2}(H(U) + H(V))}},$$

however, AMI

$$AMI(U,V) = \frac{{MI(U,V) - \mathbb{E}{\kern 1pt} MI(U,V)}}{{\frac{1}{2}(H(U) + H(V)) - \mathbb{E}{\kern 1pt} MI(U,V)}},$$

is used more often, where $\mathbb{E}{\kern 1pt} MI(U,V)$ is the mathematical expectation of $MI(U,V)$ over all possible partitions U and V.

3 METHODS OF DEEP METRIC LEARNING

3.1 Methods Based on Representation Comparison

We describe the methods that are used to train the metric and are based on comparing object representations in the value space $\{ {{f}_{\theta }}(x){\kern 1pt} |{\kern 1pt} x \in X\} $. Each method is defined by a loss function used to optimize the neural network that defines the mapping ${{f}_{\theta }}(x)$.

3.1.1. Contrastive Loss. One of the first loss functions proposed for metric learning was Contrastive Loss [7], which is defined as follows

$$\begin{gathered} L = \mathbb{I}\left\{ {{{y}_{i}} = {{y}_{j}}} \right\}\left[ {{{D}_{\theta }}\left( {{{x}_{i}},{{x}_{j}}} \right) - {{m}_{p}}} \right]_{ + }^{2} \\ \, + \mathbb{I}\left\{ {{{y}_{i}} \ne {{y}_{j}}} \right\}\left[ {{{m}_{n}} - {{D}_{\theta }}\left( {{{x}_{i}},{{x}_{j}}} \right)} \right]_{ + }^{2}, \\ \end{gathered} $$

(1)

here and in what follows $\mathbb{I}\{ A\} = 1$ if and only if the expression A is true, ${{\left[ z \right]}_{ + }} = \max \left\{ {z,0} \right\}$ is the cut-off function. Functional (1) penalizes positive pairs if the distance between them is greater than a certain threshold m_p (most often assumed to equal zero) while it penalizes the negative pairs until the distance exceeds the threshold m_n.

3.1.2. Triplet Loss. In [8], it is proposed to compare triplets of objects $({{x}_{a}},{{x}_{p}},{{x}_{n}})$, where ${{y}_{a}} = {{y}_{p}}$ yet ${{y}_{a}} \ne {{y}_{n}}$, rather than their pairs

$$L = {{[{{D}_{\theta }}{{({{x}_{a}},{{x}_{p}})}^{2}} - {{D}_{\theta }}{{({{x}_{a}},{{x}_{n}})}^{2}} + m]}_{ + }}.$$

(2)

The main difference between (1) and (2) is that Contrastive Loss penalizes the absolute distance within a single pair of objects while Triplet Loss limits the difference between positive and negative pairs in triplets. Triplet Loss is often used in the face recognition problem [8], with its modifications used in other domains, for example, for training text representations [2].

3.1.3. Fast AP. As we note in Section 2, MAP is one of the principal evaluation metrics in metric learning problems. In [9], it is proposed to optimize the AP approximation. The principal issue for the functional from an optimization point of view is the sorting operation, for which gradient methods cannot be applied. The authors’ idea is to interpret AP as the area under the PR curve and consider Precision and Recall as parametric functions of the distance between the queries and objects. The authors propose the following approximation

$${\text{FastAP}} = \frac{1}{{N_{q}^{ + }}}\sum\limits_{j = 1}^L \frac{{H_{j}^{ + }h_{j}^{ + }}}{{{{H}_{j}}}}.$$

(3)

• Representations are assumed to be ${{l}_{2}}$-normalized, with $\left\| {{{f}_{\theta }}(x)} \right\| = 1$—in this case, the possible distances between the query and the retrieval candidates belong to the segment [0, 2].

• The segment [0, 2] is divided into bins $\{ {{z}_{1}}, \ldots ,{{z}_{L}}\} $ (number of bins L is a hyperparameter), and ${{h}_{j}}$ is the number of objects included in the jth bin. H_j = $\sum\nolimits_{k = 1}^j {{h}_{k}}$ is the cumulative sum.

• $h_{j}^{ + }$ and $H_{j}^{ + }$ are the same counters, but only for objects relevant to the query; $N_{q}^{ + }$ is the total number of relevant objects.

• In fact, in (3) histogram binning is used, which in this case approximates the true distribution (viz., the distribution density and function) of distances by piecewise constant functions

$$h(z) = \sum\limits_{j = 1}^L {{h}_{j}} \cdot \mathbb{I}\left\{ {z \in {{z}_{j}}} \right\},\quad H(z) = \sum\limits_{j = 1}^L {{H}_{j}} \cdot \mathbb{I}\left\{ {z \in {{z}_{j}}} \right\}.$$

• Since the piecewise constant functions are non-differentiable, we replace $h(z)$ and $H(z)$ by linear interpolation during optimization, which leads to continuous piecewise linear functions, with their gradients defined and constant within individual bins. Such relaxation was proposed in [10].

3.1.4. Centroid Triplet Loss. In [11], the authors propose to modify (2) by replacing positive and negative objects $({{x}_{p}},{{x}_{n}})$ with respect to ${{x}_{a}}$ by the centers of their classes $({{c}_{a}},{{c}_{p}})$

$$L = {{[{{D}_{\theta }}{{({{x}_{a}},{{c}_{p}})}^{2}} - {{D}_{\theta }}{{({{x}_{a}},{{c}_{n}})}^{2}} + m]}_{ + }}.$$

During training, class centers are counted by batch (and the batches themselves should be made large), and ${{x}_{a}}$ is excluded. The principal motivation of this method is to simplify the application stage, viz. the search can be carried out only for the centers pre-calculated in advance (this time for the entire sample) rather than for all objects.

3.1.5. Margin Loss. Authors of [12] suggest using

$$L = {{\left[ {\alpha + \left( {2 \cdot \mathbb{I}\{ {{y}_{i}} = {{y}_{j}}\} - 1} \right) \cdot \left( {{{D}_{\theta }}({{x}_{i}},{{x}_{j}}) - \beta } \right)} \right]}_{ + }}.$$

(4)

In fact, (4) differs from (1) by the square of the Euclidean metric replaced by the metric itself $(l_{2}^{2}\, \mapsto \,{{l}_{2}})$ and by the modified parameterization ${{m}_{p}} = \beta - \alpha ,$ ${{m}_{n}} = \beta + \alpha $. Thus, $\beta $ corresponds to the boundary between positive and negative pairs, and $\alpha $ corresponds to the required distance from this boundary.

3.1.6. Multi Similarity Loss. In [13], it is proposed to use more information about the distances between objects in the batch in the loss function

$$\begin{gathered} L = \sum\limits_{i = 1}^m \left\{ {\frac{1}{\alpha }\log \left[ {1 + \sum\limits_{k \in {{P}_{i}}} {{e}^{{ - \alpha \left( {{{S}_{{ik}}} - \lambda } \right)}}}} \right]} \right. \\ \, + \left. {\frac{1}{\beta }\log \left[ {1 + \sum\limits_{k \in {{N}_{i}}} {{e}^{{\beta \left( {{{S}_{{ik}}} - \lambda } \right)}}}} \right]} \right\}. \\ \end{gathered} $$

(5)

Here, ${{S}_{\theta }}({{x}_{i}},{{x}_{j}}) = \langle {{f}_{\theta }}({{x}_{i}}),{{f}_{\theta }}({{x}_{j}})\rangle $ is parameterized. Note that expression (5) can be considered as a smooth approximation of (1) (with additional coefficients) that includes all pairwise distances within the batch.

3.1.7. SNN Loss. The loss function based on distances within a batch can also be defined as follows [14]

$$L = - \frac{1}{m}\sum\limits_{i = 1}^m \log \frac{{\sum\limits_{j = 1}^m {\mathbb{I}\{ {{y}_{i}} = {{y}_{j}}\} \exp \{ {{S}_{\theta }}({{x}_{i}},{{x}_{j}}){\text{/}}\tau \} } }}{{\sum\limits_{j = 1}^m {\exp \{ {{S}_{\theta }}({{x}_{i}},{{x}_{j}}){\text{/}}\tau \} } }},$$

(6)

τ is the temperature (it can be either a customized parameter or a hyperparameter).

3.1.8. SupCon Loss. In [15], the authors noticed that there are different ways to aggregate positive examples in (6) and proposed the following alternative

$$L = - \frac{1}{m}\sum\limits_{i = 1}^m \sum\limits_{j = 1}^m \mathbb{I}\{ {{y}_{i}} = {{y}_{j}}\} \log \frac{{\exp \{ {{S}_{\theta }}({{x}_{i}},{{x}_{j}}){\text{/}}\tau \} }}{{\sum\limits_{j = 1}^m {\exp \{ {{S}_{\theta }}({{x}_{i}},{{x}_{j}}){\text{/}}\tau \} } }}.$$

(7)

Option (7) is usually better optimized than (6).

3.1.9. SNR Loss. In [16], (1) is used, but instead of the Euclidean metric the authors optimize SNR (signal-to-noise ratio)

$${{D}_{\theta }}({{x}_{i}},{{x}_{j}}) = \frac{1}{{{\text{SNR}}}} = \frac{{{\text{Var}}\left[ {{{f}_{\theta }}({{x}_{i}}) - {{f}_{\theta }}({{x}_{j}})} \right]}}{{{\text{Var}}\left[ {{{f}_{\theta }}({{x}_{i}})} \right]}},$$

here Var is the sample variance. Unlike the Euclidean metric, the SNR distance is not symmetric, so the order of the elements in the pair matters, similar to how Triplet Loss (2) is affected by the replacement ${{x}_{a}} \leftrightarrow {{x}_{p}}$.

3.1.10. Tuplet Margin Loss. Authors of [17] made several significant changes to (2)

$$L = \log \left( {1 + \sum\limits_{i = 1}^{k - 1} {{e}^{{s({\text{cos}}{{\theta }_{{a,{{n}_{i}}}}} - \,{\text{cos}}({{\theta }_{{a,p}}} - \beta ))}}}} \right).$$

(8)

$ \bullet $ In (8), the number of negative objects is not limited by the triplet, viz. for each positive pair $({{x}_{a}},{{x}_{p}})$ from the remaining classes, one negative example is sampled, forming the set

$$({{x}_{a}},{{x}_{p}},{{x}_{{{{n}_{1}}}}}, \ldots ,{{x}_{{{{n}_{{k - 1}}}}}}).$$

$ \bullet $ Representations on the sphere are used $(\left\| {{{f}_{\theta }}({{x}_{i}})} \right\|$ = 1), therefore the cosine of the angle between the vectors is calculated as a scalar product.

$ \bullet $ The coefficient $\beta \; \geqslant \;0$ is used to combat overtraining on hard triplets (triplets in which ${{\theta }_{{a,{{n}_{i}}}}} \ll {{\theta }_{{a,p}}}$), which, with $\beta = 0$, make a much larger contribution to the loss function as compared to other elements of the set.

$ \bullet $ Instead of the cut-off function ${{\left[ z \right]}_{ + }} = \max \left\{ {z,0} \right\}$, its differentiable approximation Softplus(z) = log(1 + ${{e}^{{sz}}})$ is used, where s is the scale factor (responsible for the radius of the hypersphere and affecting the speed of convergence).

3.1.11. Circle Loss. In [18], it is proposed to optimize the modified relaxation (1)

$$L = \log \left[ {1 + \sum\limits_{j = 1}^L {{e}^{{\gamma {{{[s_{n}^{j} - {{m}_{n}}]}}_{ + }}s_{n}^{j}}}}\sum\limits_{i = 1}^K {{e}^{{ - \gamma {{{[{{m}_{p}} - s_{p}^{i}]}}_{ + }}s_{p}^{i}}}}} \right],$$

(9)

where $s_{p}^{j}$ and $s_{n}^{i}$ are similarity functionals between positive and negative pairs.

3.2 Classification Based Methods

This group of methods differs by the fact that a sample labeled into a fixed number of classes $c$ is used for training. In fact, the classification problem is solved, but classification methods do not ensure obtaining the space of representations with objects of each class being well grouped.

3.2.1. ArcFace. In ArcFace [19], a loss function is proposed, by optimizing which the authors managed to obtain high-quality vector representations in the face recognition problem. The idea is to consider Softmax-loss on the unit sphere and use the geodesic distance.

We introduce an auxiliary weight matrix $W \in {{\mathbb{R}}^{{n \times c}}}$. When solving a classification problem, one could optimize Softmax-loss (cross-entropy)

$$ - \log \frac{{{{e}^{{W_{{{{y}_{i}}}}^{T}{{f}_{\theta }}({{x}_{i}})}}}}}{{\sum\limits_{j = 1}^c {{{e}^{{W_{j}^{T}{{f}_{\theta }}({{x}_{i}})}}}} }}.$$

(10)

We consider the jth logit (10): $W_{j}^{T}{{f}_{\theta }}({{x}_{i}})$ = $\left\| {{{W}_{j}}} \right\|\;\left\| {{{f}_{\theta }}({{x}_{i}})} \right\|{{\cos }_{{{{\theta }_{j}}}}}$, where ${{\theta }_{j}}$ is the angle between ${{W}_{j}}$ and ${{f}_{\theta }}({{x}_{i}})$. We fix the norm of all weights $\left\| {{{W}_{j}}} \right\| = 1$ and normalize the space of representations $\left\| {{{f}_{\theta }}({{x}_{i}})} \right\| = s$; then, (10) takes the form (11)

$$L = - \log \frac{{{{e}^{{s(\cos ({{\theta }_{{{{y}_{i}}}}}))}}}}}{{{{e}^{{s(\cos ({{\theta }_{{{{y}_{i}}}}}))}}} + \sum\limits_{j \ne {{y}_{i}}}^{} {{{e}^{{s\cos ({{\theta }_{j}})}}}} }},$$

(11)

if we add the margin to the angle in (11), we get ArcFace Loss [19]

$$L = - \log \frac{{{{e}^{{s(\cos ({{\theta }_{{{{y}_{i}}}}} + m))}}}}}{{{{e}^{{s\left( {\cos ({{\theta }_{{{{y}_{i}}}}} + m)} \right)}}} + \sum\limits_{j \ne {{y}_{i}}}^{} {{{e}^{{s\cos ({{\theta }_{j}})}}}} }}.$$

(12)

The parameter m in this case is responsible for both intraclass compactness and interclass separability.

3.2.2. CosFace. Almost the same idea was proposed in [20], only with the margin used for cosines rather than for angles

$$L = - \log \frac{{{{e}^{{s\left( {\cos ({{\theta }_{{{{y}_{i}}}}}) + m} \right)}}}}}{{{{e}^{{s\left( {\cos ({{\theta }_{{{{y}_{i}}}}}) + m} \right)}}} + \sum\limits_{j \ne {{y}_{i}}}^{} {{{e}^{{s\cos ({{\theta }_{j}})}}}} }}.$$

(13)

Although methods (12) and (13) often have almost the same quality [19], ArcFace often performs slightly better in the face recognition problem.

3.2.3. SubCenter ArcFace. In [21], it is proposed to strengthen the robustness of functional (12) to noise. To do this, it is proposed to use K subcenters instead of one for each class. The form of loss function (12) is the same, but K weight matrices ${{W}^{1}}, \ldots ,{{W}^{K}} \in {{\mathbb{R}}^{{n \times c}}}$are introduced

$${{\theta }_{j}} = \arccos \left( {\mathop {\max }\limits_{k \in 1, \ldots ,K} \{ ({{W}^{k}})_{j}^{T}{{f}_{\theta }}({{x}_{i}})\} } \right).$$

(14)

Thus, the closest one is selected among K subcenters in (14). With this approach, as shown in [21], most of the objects of a particular class will be accumulated in the neighbourhood of one subcenter, while noise objects will be assigned with other non-dominant (containing a small number of representatives) subcenters that can be discarded at the inference stage.

3.2.4. SoftTriple Loss. In [22], the idea of several centers per class is also being considered.

$$L = - \log \frac{{{{e}^{{s\left( {G(i,{{y}_{i}}) + m} \right)}}}}}{{{{e}^{{s\left( {G(i,{{y}_{i}}) + m} \right)}}} + \sum\limits_{j \ne {{y}_{i}}} {{e}^{{sG(i,j)}}}}},$$

where $G$ is the function of proximity between the $i$th object and the class c, and $w_{c}^{k}$ is the $k$th center of the class c

$$G(i,c) = \sum\limits_k \frac{{\exp \left\{ {\frac{1}{\gamma }\langle {{f}_{\theta }}({{x}_{i}}),w_{c}^{k}\rangle } \right\}}}{{\sum\limits_{k{\kern 1pt} '}^{} {\exp \left\{ {\frac{1}{\gamma }\langle {{f}_{\theta }}({{x}_{i}}),w_{c}^{{k{\kern 1pt} '}}\rangle } \right\}} }}\langle {{f}_{\theta }}({{x}_{i}}),w_{c}^{k}\rangle .$$

3.2.5. Proxy-Anchor Loss. The authors of [23] propose adding special proxy objects and optimize the distance between them and objects of the same class. Some of the methods above are like this—for example, in (12) the class centers ${{\theta }_{j}}$ can be considered as proxy objects. In [23], the following loss function is used

$$\begin{gathered} L = \frac{1}{{{\text{|}}{{P}^{ + }}{\text{|}}}}\sum\limits_{p \in {{P}^{ + }}} \log \left( {1 + \sum\limits_{x \in X_{p}^{ + }} {{e}^{{ - \alpha (S(x,p) - m)}}}} \right) \\ \, + \frac{1}{{{\text{|}}P{\text{|}}}}\sum\limits_{p \in P} \log \left( {1 + \sum\limits_{x \in X_{p}^{ - }} {{e}^{{ - \alpha (S(x,p) - m)}}}} \right), \\ \end{gathered} $$

where $\alpha $ is the scale hyperparameter, m is the shift hyperparameter, P is the set of all proxy objects, ${{P}^{ + }}$ is the set of positive ones (in the batch), $X_{p}^{ - }$, $X_{p}^{ + }$ and $X_{p}^{ - }$ are the partition of the batch with respect to proxy objects.

4 EXPERIMENTS

As we saw in the previous section, there are quite a lot of loss functions for solving the metric learning problem. Often, when choosing a specific method, one checks whether a ready-made implementation is available or how popular the method is. In this work, we decided to implement and test all the described methods, ensuring fair assessment and comparison of their quality.

4.1 Data

One of the most important features of metric learning is that metric properties (similar objects are close, objects of different classes are not) are preserved on new data. To fairly measure quality in Supervised data, training and testing are performed on non-overlapping classes.

This work uses six datasets, viz. four with images and two text ones, the objects in them are divided into training and test sets according to Table 1.

Table 1. Partition into training and test sets, classes between partitions do not intersect

Full size table

4.1.1. Cars-196. The initial Cars-196 dataset [24] contains 16 185 photos divided into 196 classes. Each photo shows a car, and the class is characterized by its brand, year and model (color, angle, background, etc. may vary). Examples of images are shown in Fig. 1. This is one of the most frequently used datasets in measuring the quality of Metric Learning methods.

4.1.2. CUB-200. The Caltech-UCSD Birds 200 dataset [25] contains 11 788 photos of birds, a total of 200 categories, examples are given in Fig. 2.

4.1.3. SOP. The Stanford Online Products dataset [26] contains 120 053 images of various products, the total number of categories is 22 634, examples are given in Fig. 3.

4.1.4. Dogs-130. The Tsinghua Dogs dataset [27] contains a total of 70 432 different photos of dogs, a total of 130 breeds collected (Fig. 4).

4.1.5. News-20. The 20 Newsgroups dataset [28] is usually used to compare classification methods, but in this work we use it to build textual representations. In total, the dataset contains 18 846 texts and 20 classes.

4.1.6. WOS-134. The Web Of Science dataset [29] contains a total of 46 985 text documents. The number of categories is quite big, viz. 134, so Deep Metric Learning approaches may be especially relevant there.

4.2 Neural Network Architectures

4.2.1. ConvNeXt. Most of the previously discussed methods in the original articles were compared on architectures from 2014–2015, viz. Resnet-50 [30] and GoogLeNet [31]. Since then, many other models have emerged for which metric learning has not been explored. In this work, ConvNeXt is used as the principal neural network for images [32], which has a convolutional architecture yet it is comparable in quality to modern transformer models. We use pretrained ConvNeXt-T, which has 28M trainable parameters. On top of the neural network, we add a linear layer that maps the hidden state into the space of the required dimension.

4.2.2. DistilBERT. To work with text data, we use the DistilBERT transformer model [33], which is the distilled version of BERT [34]. To obtain vector representations of the required dimension, we also add a linear layer on top of the model, specifically the CLS token.

4.2.3. Training details. PyTorch and [35] were the principal frameworks used for writing training and measurement scripts. Adam [36] was used as an optimizer for all models, the learning rate varied from 10^–5 to 10^–3, the batch size varied from 32 to 128 (the best quality for each model is given below). The hyperparameters of the functions themselves are fixed in accordance with the authors’ results. Optimization used early stopping if the quality did not improve within 10 epochs. In the case of DistilBERT, warmup was additionally used for the first 100 iterations. For images during training, standard augmentation methods are used: random horizontal reflection, clipping and scaling of a random segment of the image. When measuring quality, test images are scaled to 256, and then the central 224 × 224 fragment is cropped.

4.3 Results

In this work, all loss functions described in Section 3 are compared on the problems discussed above, viz. those based on image recognition (Cars-196, CUB-200, SOP, Dogs-130) and natural language (News-20, WOS-134). Note that many methods, individually or in small sets, were tested on the Cars-196, CUB-200, and SOP datasets in the works where the methods were first proposed, with the author’s algorithms often achieving the highest quality. Further, we describe the results of comparing Deep Metric Learning approaches under completely equal conditions.

On Cars-196 (Table 2) with ConvNeXT, Tuplet Margin Loss worked best, but the highest values of MAP functionals were achieved when using Circle Loss (9). Indeed, from the point of different quality criteria, different methods may be better or worse (often only Recall@K is given in articles). Note that Tuplet Loss also worked better for MRR, NMI, and AMI on Cars-196.

Table 2. Evaluation metrics for Cars-196 test

Full size table

We consider News-20 (20 Newsgroups), which is a text dataset with 20 classes, and a transformer model—Table 3 shows that Tuplet Margin Loss and Triplet Loss optimize better on Recall while MAP functionals are better when we use Multi Similarity Loss, and cluster functionals are better when we use Circle Loss.

Table 3. Evaluation metrics for 20 Newsgroups test

Full size table

On CUB-200 (Table 4), the results largely repeat Cars-196, viz. Tuplet Loss and Circle Loss are of the highest quality.

Table 4. Evaluation metrics for CUB-200 test

Full size table

Dataset Dogs-130 (Table 5) is quite simple to optimize as its classes contain a particularly large number of examples. Nevertheless, it was clearly better to use Tuplet Loss on it, both from the point of Recall and AMI / NMI. MAP turned out to be the best for SNN Loss.

Table 5. Evaluation metrics for Dogs-130 test

Full size table

On SOP (Table 6: the largest number of classes considered in this dataset), Circle Loss is the most optimal one in terms of Recall and MAP functionals, although Multi Similarity Loss performed better on AMI/NMI.

Table 6. Evaluation metrics for SOP test

Full size table

On the final WOS-134 text dataset, SNR Loss has the highest MAP (Table 7) while cluster metrics are better for Centroid Triplet Loss. At the same time, Tuplet Margin Loss again turned out to be the best on Recall@1.

Table 7. Evaluation metrics for WOS-134 test

Full size table

The results obtained for image datasets can be considered as new benchmarks for Deep Metric Learning that are based on ConvNeXT rather than on Resnet or GoogleNet. Based on the results of experiments, the methods often have a higher quality than in the original articles, for example, on the SOP dataset with ConvNeXT, the obtained Recal@K are higher than in [9, 12, 13, 18, 22, 23]. For the text datasets used here, such statement of the problem was previously considered in [37], but the approach using a transformer model and loss functions from computer vision is probably being tested for the first time.

5 CONCLUSIONS

We give an overview of loss functions that were previously proposed for optimizing neural networks in deep metric learning problems. Experiments were carried out “under equal conditions” with the described loss functions for different domains, viz. images and texts, as well as with modern neural network architectures. As expected, the choice of a loss function depends on the problem, model and evaluation metrics. Nevertheless, in most of the experiments conducted, Tuplet Margin Loss and Circle Loss were more likely to achieve the best results than other methods, which is a little surprising considering that the first one was proposed in 2019 and there are many “more recent” methods.

We also note that we sometimes managed to achieve better quality in our experiments than in the works in which the loss functions involved were proposed. Although it is possible to perform a sufficient number of experiments with different partitions of datasets into training and test with the construction of confidence intervals for the results obtained, we did not do it due to the complexity of such a series of experiments. Among the interesting directions in which the work can be developed, we should mention the case when the labels take real values. Basically, loss functions are focused on categorical label values, so the analytical record contains objects with equal and unequal labels. Intuitively, in the case of real labels, the closer the labels, the closer the representations should be, but various formalizations of this idea have not yet been fully explored.

REFERENCES

W. Chen, Y. Liu, W. Wang, E. M. Bakker, T. K. Georgiou, P. Fieguth, L. Liu, and M. S. K. Lew, “Deep image retrieval: A survey” (2021). https://arxiv.org/abs/2101.11282
N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using Siamese bert-networks” (2019). arXiv:1908.10084
I. Masi, Y. Wu, T. Hassner, and P. Natarajan, “Deep face recognition: A survey,” in 2018 31st SIBGRAPI Conference on Graphics, Patterns and Images (IEEE, 2018), pp. 471–478.
M. Je, J. Shen, G. Lin, T. Xiang, L. Shao, and C. H. S. Hoi, “Deep learning for person re-identification: A survey and outlook,” IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
K. Musgrave, S. Belongie, and S.-N. Lim, “A metric learning reality check,” in European Conference on Computer Vision (Springer, Berlin, 2020), pp. 681–699.
J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search with GPUs,” IEEE Trans. Big Data 7 (3), 535–547 (2019).
Article Google Scholar
S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verification,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2005), Vol. 1, pp. 539–546.
F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 815–823.
F. Cakir, K. He, X. Xia, B. Kulis, and S. Sclaroff, “Deep metric learning to rank,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019), pp. 1861–1870.
E. Ustinova and V. Lempitsky, “Learning deep embeddings with histogram loss,” Advances in Neural Information Processing Systems (2016), Vol. 29.
V. Wieczorek, B. Rychalska, and J. Dąbrowski, “On the unreasonable effectiveness of centroids in image retrieval,” in International Conference on Neural Information Processing (Springer, Berlin, 2021), pp. 212–223.
W. Chao-Yuan, R. Manmatha, A. J. Smola, and P. Krahenbuhl, “Sampling matters in deep embedding learning,” in Proceedings of the IEEE International Conference on Computer Vision (2017), pp. 2840–2848.
W. Xun, H. Xintong, H. Weilin, D. Dengke, and M. R. Scott, “Multisimilarity loss with general pair weighting for deep metric learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019), pp. 5022–5030.
N. Frosst, N. Papernot, and G. Hinton, “Analyzing and improving representations with the soft nearest neighbor loss,” Proc. Mach. Learn. Res. 97, 2012–2020 (2019).
Google Scholar
P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learning,” Adv. Neural Inf. Process. Syst. 33, 18661–18673 (2020).
Google Scholar
T. Yuan, W. Deng, J. Tang, Y. Tang, and B. Chen, “Signal-to-noise ratio: A robust distance metric for deep metric learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019), pp. 4815–4824.
B. Yu and D. Tao, “Deep metric learning with tuplet margin loss,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (2019), pp. 6490–6499.
Y. Sun, C. Cheng, Y. Zhang, C. Zhang, L. Zheng, Z. Wang, and Y. Wei, “Circle loss: A unified perspective of pair similarity optimization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020), pp. 6398–6407.
J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019), pp. 4690–4699.
H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu, “Cosface: Large margin cosine loss for deep face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 5265–5274.
J. Deng, J. Guo, T. Liu, M. Gong, and S. Zafeiriou, “Subcenter arcface: Boosting face recognition by large-scale noisy web faces,” in European Conference on Computer Vision (Springer, Berlin, 2020), pp. 741–757.
Q. Qian, L. Shang, B. Sun, J. Hu, H. Li, and R. Jin, “Softtriple loss: Deep metric learning without triplet sampling,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (2019), pp. 6450–6458.
S. Kim, D. Kim, M. Cho, and S. Kwak, “Proxy anchor loss for deep metric learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020), pp. 3238–3247.
J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3D object representations for fine-grained categorization,” in The 4th International IEEE Workshop on 3D Representation and Recognition, Sydney, Australia (2013).
C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, Caltech-UCSD Birds-200-2011 Dataset (2011).
H. O. Song, Y. Xiang, S. Jegelka, and S. Savarese, “Deep metric learning via lifted structured feature embedding,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016).
D.-N. Zou, S.-H. Zhang, T.-J. Mu, and M. Zhang, “A new dataset of dog breed images and a benchmark for fine-grained classification,” Comput. Visual Media 6 (4), 477–487 (2020).
Article Google Scholar
K. Lang, “Newsweeder: Learning to filter netnews,” in Machine Learning Proceedings (Elsevier, 1995), pp. 331–339.
Google Scholar
K. Kowsari, D. E. Brown, M. Heidarysafa, K. J. Meimandi, M. S. Gerber, and L. E. Barnes, “Hdltex: Hierarchical deep learning for text classification,” in 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA) (IEEE, 2017), pp. 364–371.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778.
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015). pp. 1–9.
Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xiell, “A ConvNet for the 2020s,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022).
V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter” (2019). https://arxiv.org/abs/1910.01108
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding” (2018). https://arxiv.org/abs/1810.04805
K. Musgrave, S. Belongie, and S.-N. Lim, “PyTorch metric learning” (2020). https://arxiv.org/abs/2008.09164
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization” (2014). https://arxiv.org/abs/1412.6980
J. Wohlwend, E. R. Elenberg, S. Altschul, S. Henry, and T. Lei, “Metric learning for dynamic text classification” (2019). https://arxiv.org/abs/1911.01026

Download references

Funding

This work was supported by the ongoing institutional funding. No additional grants to carry out or direct this particular research were obtained.

Author information

Authors and Affiliations

Faculty of Computational Mathematics and Cybernetics, Lomonosov Moscow State University, Moscow, Russia
R. L. Vasilev & A. G. D’yakonov

Authors

R. L. Vasilev
View author publications
You can also search for this author in PubMed Google Scholar
A. G. D’yakonov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to R. L. Vasilev or A. G. D’yakonov.

Ethics declarations

The authors of this work declare that they have no conflicts of interest.

Additional information

Publisher’s Note.

Pleiades Publishing remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vasilev, R.L., D’yakonov, A.G. Deep Metric Learning: Loss Functions Comparison. Dokl. Math. 108 (Suppl 2), S215–S225 (2023). https://doi.org/10.1134/S1064562423701053

Download citation

Received: 30 June 2023
Revised: 19 September 2023
Accepted: 15 October 2023
Published: 09 February 2024
Issue Date: December 2023
DOI: https://doi.org/10.1134/S1064562423701053

Keywords:

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Deep Metric Learning: Loss Functions Comparison

Abstract

Similar content being viewed by others

A Metric Learning Reality Check

Deep Metric Learning Using Triplet Network

Deep Metric Learning with Hierarchical Triplet Loss

1 INTRODUCTION

2 EVALUATION METRICS

2.1 Search Related

2.2 Cluster Related

3 METHODS OF DEEP METRIC LEARNING

3.1 Methods Based on Representation Comparison

3.2 Classification Based Methods

4 EXPERIMENTS

4.1 Data

4.2 Neural Network Architectures

4.3 Results

5 CONCLUSIONS

REFERENCES

Funding

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Additional information

Publisher’s Note.

Rights and permissions

About this article

Cite this article

Keywords:

Navigation

Deep Metric Learning: Loss Functions Comparison

Abstract

Similar content being viewed by others

A Metric Learning Reality Check

Deep Metric Learning Using Triplet Network

Deep Metric Learning with Hierarchical Triplet Loss

Explore related subjects

1 INTRODUCTION

2 EVALUATION METRICS

2.1 Search Related

2.2 Cluster Related

3 METHODS OF DEEP METRIC LEARNING

3.1 Methods Based on Representation Comparison

3.2 Classification Based Methods

4 EXPERIMENTS

4.1 Data

4.2 Neural Network Architectures

4.3 Results

5 CONCLUSIONS

REFERENCES

Funding

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Additional information

Publisher’s Note.

Rights and permissions

About this article

Cite this article

Share this article

Keywords:

Search

Navigation