1 INTRODUCTION

Many data analysis and machine learning problems require estimating the proximity of objects such as for the search over images [1] or text documents [2], face recognition [3], and identification [4]. Modern methods for solving such problems use deep metric learning. Metric learning is assumed to mean obtaining an automatic way to calculate distances (for example, in the form of the output of a neural network), in which the distances between objects that should be similar (for example, they belong to the same class) are small, and the distances between objects that must differ are great.

We describe this idea more formally. Consider a set of objects X and a set of labels Y. Objects \(x \in X\) can be of any kind (images, texts, audio, tabular data, etc.), and each object can have its corresponding label \(y \in Y\). If objects are indexed, we assume that the object \({{x}_{t}}\) is matched to the label \({{y}_{t}}\). The general metric learning problem is to build a metric Dθ(·, ·) : \(X \times X \to \mathbb{R}\) out of the parametric family \(\{ {{D}_{\theta }}\} \) so that it satisfies the property

$${{D}_{\theta }}\left( {{{x}_{1}},{{x}_{2}}} \right) \ll {{D}_{\theta }}\left( {{{x}_{1}},{{x}_{3}}} \right)$$

for \({{y}_{1}} = {{y}_{2}} \ne {{y}_{3}}\). Not all problems have object labels; however, there is often a way to construct a pair \(({{x}_{1}},{{x}_{2}})\) of objects that can be logically considered similar such as an object and its augmentation:

• a signal and its noisy copy;

• an image and its rotated copy;

• a text and the same text with some words replaced by synonyms.

Such pairs are called positive while the others are called negative, for example a random pair of sample objects. Formally, for the set of positive pairs P and negative pairs \(N\), we can write a system of inequalities

$${{D}_{\theta }}\left( {{{x}_{i}},{{x}_{j}}} \right) < {{D}_{\theta }}\left( {{{x}_{t}},{{x}_{s}}} \right),\quad (i,j) \in P,\quad (t,s) \in N,$$

but there are a large number of inequalities in such a system, in addition, the sign \( \ll \) in the property we described means the desire for these distances to be significantly different, so one uses the approaches that we describe below.

The deep metric learning problem is to train a neural network, i.e., the mapping \({{f}_{\theta }}( \cdot ):X \to {{\mathbb{R}}^{n}}\), wherein

$${{D}_{\theta }}\left( {{{x}_{1}},{{x}_{2}}} \right) = D\left( {{{f}_{\theta }}({{x}_{1}}),{{f}_{\theta }}({{x}_{2}})} \right),$$

where D is the fixed simple metric in the finite-dimensional space \({{\mathbb{R}}^{n}}\) of vector representations (embeddings) of objects, the Euclidean distance being used more often. It is also common to limit the parameterization by the unit sphere \(\left\| {{{f}_{\theta }}(x)} \right\| = 1\) and introduce the scalar product

$${{S}_{\theta }}({{x}_{1}},{{x}_{2}}) = \langle {{f}_{\theta }}({{x}_{1}}),{{f}_{\theta }}({{x}_{2}})\rangle ,$$

rather than \({{D}_{\theta }}({{x}_{1}},{{x}_{2}})\); here, one strives to make

$${{S}_{\theta }}\left( {{{x}_{1}},{{x}_{2}}} \right) \gg {{S}_{\theta }}\left( {{{x}_{1}},{{x}_{3}}} \right)$$

hold for \({{y}_{1}} = {{y}_{2}} \ne {{y}_{3}}\).

To find the optimal parameters \(\theta \), the loss function L is given and optimized on the training set. Further, we consider particular loss functions.

Metric learning is useful in supervised learning problems where there are quite a lot of classes and they are poorly represented. Also, it is when not all classes are in the training set and the algorithm must detect the appearance of objects from unknown classes (open-world classification).

In this work, we try to systematize neural network approaches to metric learning and perform a great number of experiments to compare different loss functions in different domains, viz. images and texts. This is the first time such a large-scale comparison has been made. Section 2 discusses the principal evaluation metrics, allowing us to compare methods with each other. Section 3 provides an overview of various methods such as the ones based on comparing representations or on reducing it to the classification problem. Section 4 describes experiments with these methods on various datasets, neural network architectures and modalities.

2 EVALUATION METRICS

2.1 Search Related

An object, for which the nearest neighbors are searched for, is called a query, and the objects, over which the search is performed, are called documents; the terminology comes from information retrieval. Upon the query, the information retrieval algorithm yields an ordered list of documents; this is when one can use a trained metric and arrange documents in the order of their increasing distance from the query. Moreover, among all documents there is a subset of relevant documents, viz. those that we would like to retreive, and they should be at the top of our list. The following evaluation metrics assess how well this works.

Recall@K is one of the most commonly used evaluation metrics in metric learning and is defined as the proportion of queries for which a relevant document was found among \(K\) nearest neighbors. Note that this definition of Recall@K is used specifically in metric learning problems; in information retrieval, Recall@K is usually understood as the proportion of the found objects among the relevant ones.

MAP (mean average precision) is another frequently used evaluation metric. We define Precision@K as the proportion of the relevant objects among \(K\) found ones. Let the output contain indexes of the relevant objects \({{K}_{1}}, \ldots ,{{K}_{R}}\), then we can calculate

$${\text{AP}} = \frac{1}{R}\sum\limits_{i = 1}^R {\text{Precision@}}{{K}_{i}}$$

and MAP is AP averaged over all queries.

R-Precision is also frequently calculated; it is equal to Precision@R, where R is the total number of relevant documents. In [5], a functional is proposed that combines the ideas of R-Precision and MAP. For a separate query with R relevant documents, it equals

$${\text{AP@R}} = \frac{1}{R}\sum\limits_{i = 1}^R {\text{Precision@i}},$$

the overall MAP@R is obtained for the entire dataset by averaging over all queries.

Also, one measures the mean reciprical rank (MRR)

$$MRR = \frac{1}{m}\sum\limits_{i = 1}^n \frac{1}{{{\text{ran}}{{{\text{k}}}_{i}}}},$$

where ranki is the position number of the first relevant document, and m is total number of queries.

2.2 Cluster Related

In addition to search quality criteria, metric learning often evaluates the quality of clustering in the space \({{\mathbb{R}}^{n}}\) of representations \(\{ {{f}_{\theta }}(x){\kern 1pt} |{\kern 1pt} x \in X\} \) as representations of similar objects are expected to form clusters. The disadvantage of “cluster” evaluation metrics based on this idea is that the functionals depend on the selected clustering algorithm.

Let \(U = \{ {{U}_{1}}, \ldots ,{{U}_{n}}\} \) be the true partition X into groups of similar objects, and \(V = \{ {{V}_{1}}, \ldots ,{{V}_{n}}\} \) be the one obtained by the clustering algorithm (in this work, the experiments use k-means [6]) on \({{f}_{\theta }}(X)\). Then mutual information is

$$\begin{gathered} MI(U,V) = \sum\limits_{i = 1}^{|U|} \sum\limits_{j = 1}^{|V|} {{P}_{{UV}}}(i,j)\log \frac{{{{P}_{{UV}}}(i,j)}}{{{{P}_{U}}(i){{P}_{V}}(j)}} \\ \, = KL({{P}_{{UV}}}\;{\text{||}}\;{{P}_{U}}{{P}_{V}}), \\ \end{gathered} $$

where \({{P}_{U}}(i) = \frac{{{\text{|}}{{U}_{i}}{\text{|}}}}{{{\text{|}}U{\text{|}}}}\) is the estimate of the probability of a random object to fall into the ith partition cluster U; similarly, \({{P}_{V}}(i) = \frac{{{\text{|}}{{V}_{i}}{\text{|}}}}{{{\text{|}}V{\text{|}}}}\) and \({{P}_{{UV}}}(i,j) = \frac{{{\text{|}}{{U}_{i}} \cap {{V}_{j}}{\text{|}}}}{{{\text{|}}V{\text{|}}}}\) is the estimate of the probability to fall into \({{U}_{i}}\) and \({{V}_{j}}\) simultaneously.

The entropy of partition S is defined as follows

$$H(S) = - \sum\limits_{k = 1}^{|S|} {{P}_{S}}(k)\log {{P}_{S}}(k).$$

There exist several normalizations: NMI (normalized mutual information) and AMI (adjusted mutual information). NMI is calculated as follows

$$NMI(U,V) = \frac{{MI(U,V)}}{{\frac{1}{2}(H(U) + H(V))}},$$

however, AMI

$$AMI(U,V) = \frac{{MI(U,V) - \mathbb{E}{\kern 1pt} MI(U,V)}}{{\frac{1}{2}(H(U) + H(V)) - \mathbb{E}{\kern 1pt} MI(U,V)}},$$

is used more often, where \(\mathbb{E}{\kern 1pt} MI(U,V)\) is the mathematical expectation of \(MI(U,V)\) over all possible partitions U and V.

3 METHODS OF DEEP METRIC LEARNING

3.1 Methods Based on Representation Comparison

We describe the methods that are used to train the metric and are based on comparing object representations in the value space \(\{ {{f}_{\theta }}(x){\kern 1pt} |{\kern 1pt} x \in X\} \). Each method is defined by a loss function used to optimize the neural network that defines the mapping \({{f}_{\theta }}(x)\).

3.1.1. Contrastive Loss. One of the first loss functions proposed for metric learning was Contrastive Loss [7], which is defined as follows

$$\begin{gathered} L = \mathbb{I}\left\{ {{{y}_{i}} = {{y}_{j}}} \right\}\left[ {{{D}_{\theta }}\left( {{{x}_{i}},{{x}_{j}}} \right) - {{m}_{p}}} \right]_{ + }^{2} \\ \, + \mathbb{I}\left\{ {{{y}_{i}} \ne {{y}_{j}}} \right\}\left[ {{{m}_{n}} - {{D}_{\theta }}\left( {{{x}_{i}},{{x}_{j}}} \right)} \right]_{ + }^{2}, \\ \end{gathered} $$
(1)

here and in what follows \(\mathbb{I}\{ A\} = 1\) if and only if the expression A is true, \({{\left[ z \right]}_{ + }} = \max \left\{ {z,0} \right\}\) is the cut-off function. Functional (1) penalizes positive pairs if the distance between them is greater than a certain threshold mp (most often assumed to equal zero) while it penalizes the negative pairs until the distance exceeds the threshold mn.

3.1.2. Triplet Loss. In [8], it is proposed to compare triplets of objects \(({{x}_{a}},{{x}_{p}},{{x}_{n}})\), where \({{y}_{a}} = {{y}_{p}}\) yet \({{y}_{a}} \ne {{y}_{n}}\), rather than their pairs

$$L = {{[{{D}_{\theta }}{{({{x}_{a}},{{x}_{p}})}^{2}} - {{D}_{\theta }}{{({{x}_{a}},{{x}_{n}})}^{2}} + m]}_{ + }}.$$
(2)

The main difference between (1) and (2) is that Contrastive Loss penalizes the absolute distance within a single pair of objects while Triplet Loss limits the difference between positive and negative pairs in triplets. Triplet Loss is often used in the face recognition problem [8], with its modifications used in other domains, for example, for training text representations [2].

3.1.3. Fast AP. As we note in Section 2, MAP is one of the principal evaluation metrics in metric learning problems. In [9], it is proposed to optimize the AP approximation. The principal issue for the functional from an optimization point of view is the sorting operation, for which gradient methods cannot be applied. The authors’ idea is to interpret AP as the area under the PR curve and consider Precision and Recall as parametric functions of the distance between the queries and objects. The authors propose the following approximation

$${\text{FastAP}} = \frac{1}{{N_{q}^{ + }}}\sum\limits_{j = 1}^L \frac{{H_{j}^{ + }h_{j}^{ + }}}{{{{H}_{j}}}}.$$
(3)

• Representations are assumed to be \({{l}_{2}}\)-normalized, with \(\left\| {{{f}_{\theta }}(x)} \right\| = 1\)—in this case, the possible distances between the query and the retrieval candidates belong to the segment [0, 2].

• The segment [0, 2] is divided into bins \(\{ {{z}_{1}}, \ldots ,{{z}_{L}}\} \) (number of bins L is a hyperparameter), and \({{h}_{j}}\) is the number of objects included in the  jth bin. Hj = \(\sum\nolimits_{k = 1}^j {{h}_{k}}\) is the cumulative sum.

\(h_{j}^{ + }\) and \(H_{j}^{ + }\) are the same counters, but only for objects relevant to the query; \(N_{q}^{ + }\) is the total number of relevant objects.

• In fact, in (3) histogram binning is used, which in this case approximates the true distribution (viz., the distribution density and function) of distances by piecewise constant functions

$$h(z) = \sum\limits_{j = 1}^L {{h}_{j}} \cdot \mathbb{I}\left\{ {z \in {{z}_{j}}} \right\},\quad H(z) = \sum\limits_{j = 1}^L {{H}_{j}} \cdot \mathbb{I}\left\{ {z \in {{z}_{j}}} \right\}.$$

• Since the piecewise constant functions are non-differentiable, we replace \(h(z)\) and \(H(z)\) by linear interpolation during optimization, which leads to continuous piecewise linear functions, with their gradients defined and constant within individual bins. Such relaxation was proposed in [10].

3.1.4. Centroid Triplet Loss. In [11], the authors propose to modify (2) by replacing positive and negative objects \(({{x}_{p}},{{x}_{n}})\) with respect to \({{x}_{a}}\) by the centers of their classes \(({{c}_{a}},{{c}_{p}})\)

$$L = {{[{{D}_{\theta }}{{({{x}_{a}},{{c}_{p}})}^{2}} - {{D}_{\theta }}{{({{x}_{a}},{{c}_{n}})}^{2}} + m]}_{ + }}.$$

During training, class centers are counted by batch (and the batches themselves should be made large), and \({{x}_{a}}\) is excluded. The principal motivation of this method is to simplify the application stage, viz. the search can be carried out only for the centers pre-calculated in advance (this time for the entire sample) rather than for all objects.

3.1.5. Margin Loss. Authors of [12] suggest using

$$L = {{\left[ {\alpha + \left( {2 \cdot \mathbb{I}\{ {{y}_{i}} = {{y}_{j}}\} - 1} \right) \cdot \left( {{{D}_{\theta }}({{x}_{i}},{{x}_{j}}) - \beta } \right)} \right]}_{ + }}.$$
(4)

In fact, (4) differs from (1) by the square of the Euclidean metric replaced by the metric itself \((l_{2}^{2}\, \mapsto \,{{l}_{2}})\) and by the modified parameterization \({{m}_{p}} = \beta - \alpha ,\) \({{m}_{n}} = \beta + \alpha \). Thus, \(\beta \) corresponds to the boundary between positive and negative pairs, and \(\alpha \) corresponds to the required distance from this boundary.

3.1.6. Multi Similarity Loss. In [13], it is proposed to use more information about the distances between objects in the batch in the loss function

$$\begin{gathered} L = \sum\limits_{i = 1}^m \left\{ {\frac{1}{\alpha }\log \left[ {1 + \sum\limits_{k \in {{P}_{i}}} {{e}^{{ - \alpha \left( {{{S}_{{ik}}} - \lambda } \right)}}}} \right]} \right. \\ \, + \left. {\frac{1}{\beta }\log \left[ {1 + \sum\limits_{k \in {{N}_{i}}} {{e}^{{\beta \left( {{{S}_{{ik}}} - \lambda } \right)}}}} \right]} \right\}. \\ \end{gathered} $$
(5)

Here, \({{S}_{\theta }}({{x}_{i}},{{x}_{j}}) = \langle {{f}_{\theta }}({{x}_{i}}),{{f}_{\theta }}({{x}_{j}})\rangle \) is parameterized. Note that expression (5) can be considered as a smooth approximation of (1) (with additional coefficients) that includes all pairwise distances within the batch.

3.1.7. SNN Loss. The loss function based on distances within a batch can also be defined as follows [14]

$$L = - \frac{1}{m}\sum\limits_{i = 1}^m \log \frac{{\sum\limits_{j = 1}^m {\mathbb{I}\{ {{y}_{i}} = {{y}_{j}}\} \exp \{ {{S}_{\theta }}({{x}_{i}},{{x}_{j}}){\text{/}}\tau \} } }}{{\sum\limits_{j = 1}^m {\exp \{ {{S}_{\theta }}({{x}_{i}},{{x}_{j}}){\text{/}}\tau \} } }},$$
(6)

τ is the temperature (it can be either a customized parameter or a hyperparameter).

3.1.8. SupCon Loss. In [15], the authors noticed that there are different ways to aggregate positive examples in (6) and proposed the following alternative

$$L = - \frac{1}{m}\sum\limits_{i = 1}^m \sum\limits_{j = 1}^m \mathbb{I}\{ {{y}_{i}} = {{y}_{j}}\} \log \frac{{\exp \{ {{S}_{\theta }}({{x}_{i}},{{x}_{j}}){\text{/}}\tau \} }}{{\sum\limits_{j = 1}^m {\exp \{ {{S}_{\theta }}({{x}_{i}},{{x}_{j}}){\text{/}}\tau \} } }}.$$
(7)

Option (7) is usually better optimized than (6).

3.1.9. SNR Loss. In [16], (1) is used, but instead of the Euclidean metric the authors optimize SNR (signal-to-noise ratio)

$${{D}_{\theta }}({{x}_{i}},{{x}_{j}}) = \frac{1}{{{\text{SNR}}}} = \frac{{{\text{Var}}\left[ {{{f}_{\theta }}({{x}_{i}}) - {{f}_{\theta }}({{x}_{j}})} \right]}}{{{\text{Var}}\left[ {{{f}_{\theta }}({{x}_{i}})} \right]}},$$

here Var is the sample variance. Unlike the Euclidean metric, the SNR distance is not symmetric, so the order of the elements in the pair matters, similar to how Triplet Loss (2) is affected by the replacement \({{x}_{a}} \leftrightarrow {{x}_{p}}\).

3.1.10. Tuplet Margin Loss. Authors of [17] made several significant changes to (2)

$$L = \log \left( {1 + \sum\limits_{i = 1}^{k - 1} {{e}^{{s({\text{cos}}{{\theta }_{{a,{{n}_{i}}}}} - \,{\text{cos}}({{\theta }_{{a,p}}} - \beta ))}}}} \right).$$
(8)

\( \bullet \) In (8), the number of negative objects is not limited by the triplet, viz. for each positive pair \(({{x}_{a}},{{x}_{p}})\) from the remaining classes, one negative example is sampled, forming the set

$$({{x}_{a}},{{x}_{p}},{{x}_{{{{n}_{1}}}}}, \ldots ,{{x}_{{{{n}_{{k - 1}}}}}}).$$

\( \bullet \) Representations on the sphere are used \((\left\| {{{f}_{\theta }}({{x}_{i}})} \right\|\) = 1), therefore the cosine of the angle between the vectors is calculated as a scalar product.

\( \bullet \) The coefficient \(\beta \; \geqslant \;0\) is used to combat overtraining on hard triplets (triplets in which \({{\theta }_{{a,{{n}_{i}}}}} \ll {{\theta }_{{a,p}}}\)), which, with \(\beta = 0\), make a much larger contribution to the loss function as compared to other elements of the set.

\( \bullet \) Instead of the cut-off function \({{\left[ z \right]}_{ + }} = \max \left\{ {z,0} \right\}\), its differentiable approximation Softplus(z) = log(1 + \({{e}^{{sz}}})\) is used, where s is the scale factor (responsible for the radius of the hypersphere and affecting the speed of convergence).

3.1.11. Circle Loss. In [18], it is proposed to optimize the modified relaxation (1)

$$L = \log \left[ {1 + \sum\limits_{j = 1}^L {{e}^{{\gamma {{{[s_{n}^{j} - {{m}_{n}}]}}_{ + }}s_{n}^{j}}}}\sum\limits_{i = 1}^K {{e}^{{ - \gamma {{{[{{m}_{p}} - s_{p}^{i}]}}_{ + }}s_{p}^{i}}}}} \right],$$
(9)

where \(s_{p}^{j}\) and \(s_{n}^{i}\) are similarity functionals between positive and negative pairs.

3.2 Classification Based Methods

This group of methods differs by the fact that a sample labeled into a fixed number of classes \(c\) is used for training. In fact, the classification problem is solved, but classification methods do not ensure obtaining the space of representations with objects of each class being well grouped.

3.2.1. ArcFace. In ArcFace [19], a loss function is proposed, by optimizing which the authors managed to obtain high-quality vector representations in the face recognition problem. The idea is to consider Softmax-loss on the unit sphere and use the geodesic distance.

We introduce an auxiliary weight matrix \(W \in {{\mathbb{R}}^{{n \times c}}}\). When solving a classification problem, one could optimize Softmax-loss (cross-entropy)

$$ - \log \frac{{{{e}^{{W_{{{{y}_{i}}}}^{T}{{f}_{\theta }}({{x}_{i}})}}}}}{{\sum\limits_{j = 1}^c {{{e}^{{W_{j}^{T}{{f}_{\theta }}({{x}_{i}})}}}} }}.$$
(10)

We consider the jth logit (10): \(W_{j}^{T}{{f}_{\theta }}({{x}_{i}})\) = \(\left\| {{{W}_{j}}} \right\|\;\left\| {{{f}_{\theta }}({{x}_{i}})} \right\|{{\cos }_{{{{\theta }_{j}}}}}\), where \({{\theta }_{j}}\) is the angle between \({{W}_{j}}\) and \({{f}_{\theta }}({{x}_{i}})\). We fix the norm of all weights \(\left\| {{{W}_{j}}} \right\| = 1\) and normalize the space of representations \(\left\| {{{f}_{\theta }}({{x}_{i}})} \right\| = s\); then, (10) takes the form (11)

$$L = - \log \frac{{{{e}^{{s(\cos ({{\theta }_{{{{y}_{i}}}}}))}}}}}{{{{e}^{{s(\cos ({{\theta }_{{{{y}_{i}}}}}))}}} + \sum\limits_{j \ne {{y}_{i}}}^{} {{{e}^{{s\cos ({{\theta }_{j}})}}}} }},$$
(11)

if  we  add  the  margin  to  the  angle  in  (11),  we  get ArcFace Loss [19]

$$L = - \log \frac{{{{e}^{{s(\cos ({{\theta }_{{{{y}_{i}}}}} + m))}}}}}{{{{e}^{{s\left( {\cos ({{\theta }_{{{{y}_{i}}}}} + m)} \right)}}} + \sum\limits_{j \ne {{y}_{i}}}^{} {{{e}^{{s\cos ({{\theta }_{j}})}}}} }}.$$
(12)

The parameter m in this case is responsible for both intraclass compactness and interclass separability.

3.2.2. CosFace. Almost the same idea was proposed in [20], only with the margin used for cosines rather than for angles

$$L = - \log \frac{{{{e}^{{s\left( {\cos ({{\theta }_{{{{y}_{i}}}}}) + m} \right)}}}}}{{{{e}^{{s\left( {\cos ({{\theta }_{{{{y}_{i}}}}}) + m} \right)}}} + \sum\limits_{j \ne {{y}_{i}}}^{} {{{e}^{{s\cos ({{\theta }_{j}})}}}} }}.$$
(13)

Although methods (12) and (13) often have almost the same quality [19], ArcFace often performs slightly better in the face recognition problem.

3.2.3. SubCenter ArcFace. In [21], it is proposed to strengthen the robustness of functional (12) to noise. To do this, it is proposed to use K subcenters instead of one for each class. The form of loss function (12) is the same, but K weight matrices \({{W}^{1}}, \ldots ,{{W}^{K}} \in {{\mathbb{R}}^{{n \times c}}}\)are introduced

$${{\theta }_{j}} = \arccos \left( {\mathop {\max }\limits_{k \in 1, \ldots ,K} \{ ({{W}^{k}})_{j}^{T}{{f}_{\theta }}({{x}_{i}})\} } \right).$$
(14)

Thus, the closest one is selected among K subcenters in (14). With this approach, as shown in [21], most of the objects of a particular class will be accumulated in the neighbourhood of one subcenter, while noise objects will be assigned with other non-dominant (containing a small number of representatives) subcenters that can be discarded at the inference stage.

3.2.4. SoftTriple Loss. In [22], the idea of several centers per class is also being considered.

$$L = - \log \frac{{{{e}^{{s\left( {G(i,{{y}_{i}}) + m} \right)}}}}}{{{{e}^{{s\left( {G(i,{{y}_{i}}) + m} \right)}}} + \sum\limits_{j \ne {{y}_{i}}} {{e}^{{sG(i,j)}}}}},$$

where \(G\) is the function of proximity between the \(i\)th object and the class c, and \(w_{c}^{k}\) is the \(k\)th center of the class c

$$G(i,c) = \sum\limits_k \frac{{\exp \left\{ {\frac{1}{\gamma }\langle {{f}_{\theta }}({{x}_{i}}),w_{c}^{k}\rangle } \right\}}}{{\sum\limits_{k{\kern 1pt} '}^{} {\exp \left\{ {\frac{1}{\gamma }\langle {{f}_{\theta }}({{x}_{i}}),w_{c}^{{k{\kern 1pt} '}}\rangle } \right\}} }}\langle {{f}_{\theta }}({{x}_{i}}),w_{c}^{k}\rangle .$$

3.2.5. Proxy-Anchor Loss. The authors of [23] propose adding special proxy objects and optimize the distance between them and objects of the same class. Some of the methods above are like this—for example, in (12) the class centers \({{\theta }_{j}}\) can be considered as proxy objects. In [23], the following loss function is used

$$\begin{gathered} L = \frac{1}{{{\text{|}}{{P}^{ + }}{\text{|}}}}\sum\limits_{p \in {{P}^{ + }}} \log \left( {1 + \sum\limits_{x \in X_{p}^{ + }} {{e}^{{ - \alpha (S(x,p) - m)}}}} \right) \\ \, + \frac{1}{{{\text{|}}P{\text{|}}}}\sum\limits_{p \in P} \log \left( {1 + \sum\limits_{x \in X_{p}^{ - }} {{e}^{{ - \alpha (S(x,p) - m)}}}} \right), \\ \end{gathered} $$

where \(\alpha \) is the scale hyperparameter, m is the shift hyperparameter, P is the set of all proxy objects, \({{P}^{ + }}\) is the set of positive ones (in the batch), \(X_{p}^{ - }\), \(X_{p}^{ + }\) and \(X_{p}^{ - }\) are the partition of the batch with respect to proxy objects.

4 EXPERIMENTS

As we saw in the previous section, there are quite a lot of loss functions for solving the metric learning problem. Often, when choosing a specific method, one checks whether a ready-made implementation is available or how popular the method is. In this work, we decided to implement and test all the described methods, ensuring fair assessment and comparison of their quality.

4.1 Data

One of the most important features of metric learning is that metric properties (similar objects are close, objects of different classes are not) are preserved on new data. To fairly measure quality in Supervised data, training and testing are performed on non-overlapping classes.

This work uses six datasets, viz. four with images and two text ones, the objects in them are divided into training and test sets according to Table 1.

Table 1. Partition into training and test sets, classes between partitions do not intersect

4.1.1. Cars-196. The initial Cars-196 dataset [24] contains 16 185 photos divided into 196 classes. Each photo shows a car, and the class is characterized by its brand, year and model (color, angle, background, etc. may vary). Examples of images are shown in Fig. 1. This is one of the most frequently used datasets in measuring the quality of Metric Learning methods.

Fig. 1.
figure 1

Cars-196.

4.1.2. CUB-200. The Caltech-UCSD Birds 200 dataset [25] contains 11 788 photos of birds, a total of 200 categories, examples are given in Fig. 2.

Fig. 2.
figure 2

CUB-200: random test subset after preprocessing.

4.1.3. SOP. The Stanford Online Products dataset [26] contains 120 053 images of various products, the total number of categories is 22 634, examples are given in Fig. 3.

Fig. 3.
figure 3

SOP: random test subset after preprocessing.

4.1.4. Dogs-130. The Tsinghua Dogs dataset [27] contains a total of 70 432 different photos of dogs, a total of 130 breeds collected (Fig. 4).

Fig. 4.
figure 4

Dogs-130: random test subset after preprocessing.

4.1.5. News-20. The 20 Newsgroups dataset [28] is usually used to compare classification methods, but in this work we use it to build textual representations. In total, the dataset contains 18 846 texts and 20 classes.

4.1.6. WOS-134. The Web Of Science dataset [29] contains a total of 46 985 text documents. The number of categories is quite big, viz. 134, so Deep Metric Learning approaches may be especially relevant there.

4.2 Neural Network Architectures

4.2.1. ConvNeXt. Most of the previously discussed methods in the original articles were compared on architectures from 2014–2015, viz. Resnet-50 [30] and GoogLeNet [31]. Since then, many other models have emerged for which metric learning has not been explored. In this work, ConvNeXt is used as the principal neural network for images [32], which has a convolutional architecture yet it is comparable in quality to modern transformer models. We use pretrained ConvNeXt-T, which has 28M trainable parameters. On top of the neural network, we add a linear layer that maps the hidden state into the space of the required dimension.

4.2.2. DistilBERT. To work with text data, we use the DistilBERT transformer model [33], which is the distilled version of BERT [34]. To obtain vector representations of the required dimension, we also add a linear layer on top of the model, specifically the CLS token.

4.2.3. Training details. PyTorch and [35] were the principal frameworks used for writing training and measurement scripts. Adam [36] was used as an optimizer for all models, the learning rate varied from 10–5 to 10–3, the batch size varied from 32 to 128 (the best quality for each model is given below). The hyperparameters of the functions themselves are fixed in accordance with the authors’ results. Optimization used early stopping if the quality did not improve within 10 epochs. In the case of DistilBERT, warmup was additionally used for the first 100 iterations. For images during training, standard augmentation methods are used: random horizontal reflection, clipping and scaling of a random segment of the image. When measuring quality, test images are scaled to 256, and then the central 224 × 224 fragment is cropped.

4.3 Results

In this work, all loss functions described in Section 3 are compared on the problems discussed above, viz.  those based on image recognition (Cars-196, CUB-200, SOP, Dogs-130) and natural language (News-20, WOS-134). Note that many methods, individually or in small sets, were tested on the Cars-196, CUB-200, and SOP datasets in the works where the methods were first proposed, with the author’s algorithms often achieving the highest quality. Further, we describe the results of comparing Deep Metric Learning approaches under completely equal conditions.

On Cars-196 (Table 2) with ConvNeXT, Tuplet Margin Loss worked best, but the highest values of MAP functionals were achieved when using Circle Loss (9). Indeed, from the point of different quality criteria, different methods may be better or worse (often only Recall@K is given in articles). Note that Tuplet Loss also worked better for MRR, NMI, and AMI on Cars-196.

Table 2. Evaluation metrics for Cars-196 test

We consider News-20 (20 Newsgroups), which is a text dataset with 20 classes, and a transformer model—Table 3 shows that Tuplet Margin Loss and Triplet Loss optimize better on Recall while MAP functionals are better when we use Multi Similarity Loss, and cluster functionals are better when we use Circle Loss.

Table 3. Evaluation metrics for 20 Newsgroups test

On CUB-200 (Table 4), the results largely repeat Cars-196, viz. Tuplet Loss and Circle Loss are of the highest quality.

Table 4. Evaluation metrics for CUB-200 test

Dataset Dogs-130 (Table 5) is quite simple to optimize as its classes contain a particularly large number of examples. Nevertheless, it was clearly better to use Tuplet Loss on it, both from the point of Recall and AMI / NMI. MAP turned out to be the best for SNN Loss.

Table 5. Evaluation metrics for Dogs-130 test

On SOP (Table 6: the largest number of classes considered in this dataset), Circle Loss is the most optimal one in terms of Recall and MAP functionals, although Multi Similarity Loss performed better on AMI/NMI.

Table 6. Evaluation metrics for SOP test

On the final WOS-134 text dataset, SNR Loss has the highest MAP (Table 7) while cluster metrics are better for Centroid Triplet Loss. At the same time, Tuplet Margin Loss again turned out to be the best on Recall@1.

Table 7. Evaluation metrics for WOS-134 test

The results obtained for image datasets can be considered as new benchmarks for Deep Metric Learning that are based on ConvNeXT rather than on Resnet or GoogleNet. Based on the results of experiments, the methods often have a higher quality than in the original articles, for example, on the SOP dataset with ConvNeXT, the obtained Recal@K are higher than in [9, 12, 13, 18, 22, 23]. For the text datasets used here, such statement of the problem was previously considered in [37], but the approach using a transformer model and loss functions from computer vision is probably being tested for the first time.

5 CONCLUSIONS

We give an overview of loss functions that were previously proposed for optimizing neural networks in deep metric learning problems. Experiments were carried out “under equal conditions” with the described loss functions for different domains, viz. images and texts, as well as with modern neural network architectures. As expected, the choice of a loss function depends on the problem, model and evaluation metrics. Nevertheless, in most of the experiments conducted, Tuplet Margin Loss and Circle Loss were more likely to achieve the best results than other methods, which is a little surprising considering that the first one was proposed in 2019 and there are many “more recent” methods.

We also note that we sometimes managed to achieve better quality in our experiments than in the works in which the loss functions involved were proposed. Although it is possible to perform a sufficient number of experiments with different partitions of datasets into training and test with the construction of confidence intervals for the results obtained, we did not do it due to the complexity of such a series of experiments. Among the interesting directions in which the work can be developed, we should mention the case when the labels take real values. Basically, loss functions are focused on categorical label values, so the analytical record contains objects with equal and unequal labels. Intuitively, in the case of real labels, the closer the labels, the closer the representations should be, but various formalizations of this idea have not yet been fully explored.