Keywords

1 Introduction

The common approach to contrastive learning is to maximize the agreement between individual views of the data [28, 33]. The views are ordered in pairs, such that they either positive, encoding different views of the same object, or negative, corresponding to views of different objects. The pairs are compared against one another by a contrastive loss-objective [12, 39, 44]. Contrastive learning was successfully applied among others in metric learning [23], self-supervised classification [12, 21], and pre-training for dense prediction tasks [45]. However, when two views of an object are drastically different, the view of object A will resemble the same view of object B much more than it resembles the other view of object A. By comparing individual pairs, the common patterns in the differences between two views will not be exploited. In this paper, we propose to go beyond contrasting individual pairs of objects and focus on contrasting sets of objects.

In contrastive learning, the best alignment of objects follows from maximizing the total similarity over positive pairs, while the negative pairs are needed to encourage diversity. The diversity in representations is there to avoid collapse [42]. We note that one number, the total similarity from contrasting individual pairs, cannot both account for the inter-set similarities of the objects from the same view and the intra-set similarities with objects from another view. Therefore, considering the total similarity over pairs essentially limits the information content of the supervisory signal available to the model. We aim to exploit the information from contrasting sets of objects rather than contrasting objects pairwise only, further illustrated in Fig. 1a. From the perspective of set assignment theory [8], two sets may expose the same linear pairwise alignment costs, yet have different internal structures and hence have different set alignment costs, see Fig. 1b. Therefore, contrastive learning from sets provides a richer supervisory signal.

Fig. 1.
figure 1

(a) Contrasting views by similarity over pairs (pairwise linear alignment) versus similarity over sets with set-alignment (set-wise quadratic alignment). (b) Comparing total pairwise similarities versus set similarities for different configurations of representation graphs. In both configurations, the total similarity over pairs remains the same, being unable to discriminate between the internal structures of different views as quadratic alignment can.

To contrast objects as sets, we turn to combinatorial quadratic assignment theory [8] designed to evaluate set and graph similarities. Quadratic assignment has advanced from linear assignment [9], which relies on pairwise similarities between objects. We pose contrastive learning by structural risk minimization [43] in assignment problems. In this, pairwise contrastive learning methods emerge from the linear assignment case. We aim to extend the contrastive objective to the set level by generalizing the underlying assignment problem from linear to quadratic, as illustrated in Fig. 1a. Since directly computing set similarities from quadratic alignment is computationally expensive, we derive an efficient approximation. It can be implemented as a regularizer for existing contrastive learning methods. We provide a theory for the proposed method from the perspective of assignment problems and dependence maximization. And, we experimentally demonstrate the advantages of contrasting objects by quadratic alignment.

We make the following contributions:

  • Using combinatorial assignment theory, we propose to learn representations by contrasting objects as sets rather than as pairs of individual instances.

  • We propose a computationally efficient implementation, where the set contrastive part is implemented as a regularizer for existing contrastive learning methods.

  • We demonstrate that set-contrastive regularization improves recent contrastive learning methods for the tasks of metric learning and self-supervised classification.

As a byproduct of viewing representation learning through the lens of combinatorial assignment theory, we additionally propose SparseCLR contrastive loss, a modification of the popular InfoNCE [39] objective. Different from InfoNCE, SparseCLR enjoys sparse support over negative pairs, which permits better use of hard negative examples. Our experiments demonstrate that such an approach improves the quality of learned representations for self-supervised classification.

2 Related Work

2.1 Contrastive Learning

In contrastive learning, a model is trained to align representations of the data from different views (modalities), which dates back to the work of Becker et al. [4]. Contrastive learning is made up of a joint embedding architecture like Siamese networks [7] and a contrastive objective function. The network maps different views of the data into the embedding space, where the alignment between embeddings is measured by contrasting positive and negative pairs. The pairs either from a complete set of observations [12, 13, 40, 41] or from partially-complete views [26, 32, 46, 47]. And as a contrastive objective, contrastive loss functions are used [12, 21, 39, 44]. Van den Oord et al. [39] derived the InfoNCE contrastive loss as a mutual information lower bound between different views of the signal. In computer vision, Chen et al. apply the InfoNCE to contrast the images with their distorted version. In PIRL [37], the authors propose to maintain a memory bank of image representations to improve the generalization of the model by sampling negative samples better. Along the same lines, MoCO [14, 21] proposes a running queue of negative samples and a momentum-encoder to increase the contrasting effect of negative pairs. All of these contrastive methods are based on measuring the alignment between individual pairs of objects. This approach does not account for patterns in the views of objects beyond contrasting them pairwise. We aim to extend contrastive learning to include set similarities.

2.2 Information Maximization Methods

In contrastive learning information maximization methods aim to improve contrastive representations by maximizing the information content of the embeddings [2]. In W-MSE [16], batch representations are passed through a whitening, Karhunen-Loève transformation before computing the contrastive loss. Such transformations help to avoid collapsing representations when the contrastive objective can be minimized without learning the discriminative representations. In [48], the authors follow the redundancy-reduction principle to design a loss function to bring the cross-correlation matrix of representations from different views close to identity. In the recent work of Zbontar et al., the authors extend the above formulation of Barlow Twins with an explicit variance term, which helps to stabilize the training. In this paper, we also seek to maximize the information content of the embeddings but we aim to do so by computing the rich representation of set similarities between data views. In contrast to existing methods, our approach does not require any additional transformations like whitening and can be easily incorporated into other contrastive learning methods.

2.3 Distillation Methods

Another branch of self-supervised methods is not based on contrastive learning but rather based on knowledge distillation [24]. Instead of contrasting positive and negative pairs of samples, these methods try to predict one positive view from another. We discuss them here for completeness. BYOL [19] uses a student network that learns to predict the output of a teacher network. SimSiam [15] simplifies the BYOL by demonstrating that the stop-gradient operation suffices to learn a generalized representation from distillation. In SWaV [11], the authors combine online clustering and distillation by predicting swapped cluster assignments. These self-supervised methods have demonstrated promising results in the task of learning representations. We prefer to follow a different path of contrastive learning, where we do not exclude the possibility that set-based methods may also be applicable to distillation.

2.4 Assignment Theory

Generally, optimal assignment problems can be categorized into linear and higher-order assignments. Linear assignment problems can be viewed as a transportation problem over a bipartite graph, where the transportation distances are defined on individual pairs of nodes. Linear assignment problems have many real-world applications such as scheduling or vehicle routing, we refer to [9]. Higher-order or in particular Quadratic assignment problems [8, 17] extend the domain of the transportation distances from individual pairs to sets of objects. Where the linear assignment problem seeks to optimally match individual instances, its quadratic counterpart matches the inter-connected graphs of instances, bringing additional structure to the task. In this work, we aim to exploit linear and quadratic assignment theory to rethink contrastive learning.

2.5 Structured Predictions

In a structured prediction problem, the goal is to learn a structured object such as a sequence, a tree, or a permutation matrix [1]. Training a structure is non-trivial as one has to compute the loss on the manifold defined by the structured output space. In the seminal work of Tsochantaridis et al. [43], the authors derive a structured loss for support vector machines and apply it to a sequence labeling task. Later, structured prediction was applied to learn parameters of constraint optimization problems [6, 20]. In this work, we utilize structured prediction principles to implement set similarities into contrastive losses.

3 Background

We start by formally introducing contrastive learning and linear assignment problems. Then, we argue about the connection between these two problems and demonstrate how one leads to another. This connection is essential for the derivation of our contrastive method.

3.1 Contrastive Representation Learning

Consider a dataset \(\mathcal {D}=\{ d_i \}_{i=1}^{N}\) and an encoder function \(f_{\theta }: \mathcal {D} \longrightarrow \mathbb {R}^{N \times E}\), which outputs an embedding vector of dimension E for each of the objects in \(\mathcal {D}\). The task is to learn a such \(f_{\theta }\) that embeddings of objects belonging to the same category are pulled together, while the embeddings of objects from different categories are pushed apart.

As category labels are not available during training, the common strategy is to maximize the agreement between different views of the instances, i.e. to maximize the mutual information between different data modalities. The mutual information between two latent variables XY, can be expressed as:

$$\begin{aligned} MI(X,Y) = H(X) - H(X | Y) \end{aligned}$$
(1)

where X and Y correspond to representations of two different views of a dataset. Minimizing the conditional entropy H(X|Y) aims at reducing uncertainty in the representations from one view given the representations of another, while H(X) enforces the diversity in representations and prevents trivial solutions. In practice, sample-based estimators of the mutual information such as InfoNCE [39] or NT-Xent [12] are used to maximize the objective in (1).

3.2 Linear Assignment Problem

Given two input sets, \(\mathcal {A} = \{ a_{i} \}_{i=1}^{N}\) and \(\mathcal {B} = \{ b_{j} \}_{j=1}^{N}\), we define the inter-set similarity matrix \({\textbf {S}} \in \mathbb {R}^{N \times N}\) between each element in set \(\mathcal {A}\) and each element in \(\mathcal {B}\). This similarity matrix encodes pairwise distances between the elements of the sets, i.e. \([{\textbf {S}}]_{i,j} = \phi (a_{i}, b_{j})\), where \(\phi \) is a distance metric. The goal of the linear assignment problem is to find a one-to-one assignment \(\hat{y}({\textbf {S}})\), such that the sum of distances between assigned elements is optimal. Formally:

$$\begin{aligned} \hat{y}({\textbf {S}}) = \underset{{\textbf {Y}} \in \varPi }{\text {argmin}}~\textit{tr}({\textbf {S}}{} {\textbf {Y}}^T) \end{aligned}$$
(2)

where \(\varPi \) corresponds to a set of all \(N \times N\) permutation matrices.

The linear assignment problem in (2) is also known as the bipartite graph matching problem. It can be efficiently solved by linear programming algorithms [5].

3.3 Learning to Produce Correct Assignments

Consider two sets, \(\mathcal {D}_{\mathcal {Z}_1}\) and \(\mathcal {D}_{\mathcal {Z}_2}\), which encode two different views of the dataset \(\mathcal {D}\) under two different views \(\mathcal {Z}_1\) and \(\mathcal {Z}_2\). By design, the two sets consist of the same instances, but the modalities and the order of the objects may differ. With the encoder, which maximizes the mutual information, objects in both sets can be uniquely associated with one another by comparing their representations as the uncertainty term in (1) should be minimized. In other words, \(\hat{y}({\textbf {S}}) = {\textbf {Y}}_{gt}\), where \({\textbf {Y}}_{gt} \in \varPi \) is a ground truth assignment between elements of \(\mathcal {D}_{\mathcal {Z}_1}\) and \(\mathcal {D}_{\mathcal {Z}_2}\).

Thus, a natural objective to supervise an encoder function is to train it to produce the correct assignment between the different views of the data. As the assignment emerges as a result of the optimization problem, we employ a structured prediction framework [43] to define a structural loss from the linear assignment problem as follows:

$$\begin{aligned} L({\textbf {S}}, {\textbf {Y}}_{gt}) = \textit{tr}({\textbf {S}} {\textbf {Y}}_{gt}^T) - \underset{{\textbf {Y}} \in \varPi }{\min } \textit{tr}({\textbf {S}}{} {\textbf {Y}}^T) \end{aligned}$$
(3)

where . Note that \(L \ge 0\) and \(L = 0\) only when the similarities produced by an encoder lead to the correct optimal assignment. Intuitively, the structured linearFootnote 1 assignment loss in (3) encodes the discrepancy between the true cost of the assignment and the cost induced by an encoder.

By minimizing the objective in (3), we train the encoder \(f_{\theta }\) to correctly assign objects from one view to another. In practice, it is desirable that such assignment is robust under small perturbations in the input data. The straightforward way to do so is to enforce a separation margin m in the input similarities as \({\textbf {S}}_m = {\textbf {S}} + m{\textbf {Y}}_{gt}\). Then, the structured linear assignment loss with separation margin reduces to a known margin Triplet loss [23].

Proposition 1

The structured linear assignment loss \(L({\textbf {S}}_m, {\textbf {Y}}_{gt})\) with separation margin \(m \ge 0\) is equivalent to the margin triplet loss.

Mining Strategies. By default, the loss in (3) enforces a one-to-one negative pair mining strategy due to the structural domain constraint \({\textbf {Y}} \in \varPi \). By relaxing this domain constraint to row-stochastic binary matrices, we arrive at the known batch-hard mining [23]. This is essential to have a computationally tractable implementation of structured assignment losses.

Smoothness. An immediate issue when directly optimizing the structured linear assignment loss is the non-smoothness of the minimum function in (3). It is known that optimizing smoothed functions can be more efficient than directly optimizing their non-smooth counterparts [3]. The common way to smooth a minimum is by log-sum-exp approximation. Thus, we can obtain a smoothed version of structured linear assignment loss:

$$\begin{aligned} L_{\tau }({\textbf {S}}, {\textbf {Y}}_{gt}) = \textit{tr}({\textbf {S}} {\textbf {Y}}_{gt}^T) + \tau \log \sum _{{\textbf {Y}} \in \varPi } \exp (-\frac{1}{\tau } \textit{tr}({\textbf {S}}{} {\textbf {Y}}^T) ) \end{aligned}$$
(4)

where \(\tau \) is a temperature parameter controlling the degree of smoothness. Practically, the formulation in (4) requires summing N! terms, which makes it computationally intractable under the default structural constraints \({\textbf {Y}} \in \varPi \). Fortunately, similar to the non-smooth case, we can utilize batch-hard mining, which leads to \(O(N^2)\) computational complexity. The smoothed structured linear assignment loss with batch-hard mining reduces to known normalized-temperature cross entropy [12] also known as the InfoNCE [39] loss.

Proposition 2

The smoothed structured linear assignment loss \(L_{\tau }({{\textbf {S}}}, {\textbf {Y}}_{gt})\) with batch-hard mining is equivalent to the normalized-temperature cross entropy loss.

Connection to Mutual Information. It is known that the InfoNCE objective is a lower bound on the mutual information between the representations of data modalities [39]. Thus, Propositions 1 and 2 reveal a connection between mutual information maximization and minimization of structured losses for assignmnet problems. In fact, the assignment cost \(\textit{tr}({\textbf {S}} {\textbf {Y}}_{gt}^T)\) in (2) is related to the conditional entropy H(X|Y), while \(\min _{{\textbf {Y}} \in \varPi } \textit{tr}({\textbf {S}}{} {\textbf {Y}}^T)\) aims to maximize the diversity in representations. This connection allows for consideration of contrastive representation learning methods based on InfoNCE as a special case the structured linear assignment loss.

4 Extending Contrastive Losses

We next demonstrate how to exploit the connection between contrastive learning and assignment problems to extend contrastive losses the set level. As a byproduct of this connection, we also derive the SparseCLR contrastive objective.

4.1 Contrastive Learning with Quadratic Assignments on Sets

Quadratic Assignment Problem. As in the linear case, we are given two input sets \(\mathcal {A}, \mathcal {B}\) and the inter-set similarity matrix \({\textbf {S}}\). For the Quadratic Assignment Problem (QAP), we define intra-set similarity matrices \({\textbf {S}}_{\mathcal {A}}\) and \({\textbf {S}}_{\mathcal {B}}\) measuring similarities within the sets \(\mathcal {A}\) and \(\mathcal {B}\) respectively, i.e. \([{\textbf {S}}_{\mathcal {A}}]_{ij} = \phi (a_i, a_j)\). A goal of the quadratic assignment problem is to find a one-to-one assignment \(\hat{y}_{\mathcal {Q}}({\textbf {S}}, {\textbf {S}}_{\mathcal {A}}, {\textbf {S}}_{\mathcal {B}}) \in \varPi \) that maximizes the set similarity between \(\mathcal {A}, \mathcal {B}\), where the set similarity is defined as follows:

(5)

Compared to the linear assignment problem, the quadratic term \(\mathcal {Q}({\textbf {S}}, {\textbf {S}}_{\mathcal {A}}, {\textbf {S}}_{\mathcal {B}})\) in (5) additionally measures the discrepancy in internal structures between sets \({\textbf {S}}_{\mathcal {A}}\) and \({\textbf {S}}_{\mathcal {B}}\).

Learning with Quadratic Assignments. Following the similar steps as for the linear assignment problem in Sect. 3.3, we next define the structured quadratic assignment loss by extending the linear assignment problem in (3) with the quadratic term:

$$\begin{aligned} L_{QAP} = \textit{tr}({\textbf {S}}{} {\textbf {Y}}_{gt}) - \mathcal {Q}({\textbf {S}}, {\textbf {S}}_{\mathcal {A}}, {\textbf {S}}_{\mathcal {B}}) \end{aligned}$$
(6)

Minimization of the structured quadratic assignment loss in (6) encourages the encoder to learn representations resulting in the same solutions of the linear and quadratic assignment problems, which is only possible when the inter-set and intra-set similarities are sufficiently close [8]. Note that we do not use a ground truth quadratic assignment in \(L_{QAP}\), but a ground truth linear assignment objective is used for the supervision. This is due to a quadratic nature of \(\mathcal {Q}\), where minimizing the discrepancy between ground truth and predicted assignments is a subtle optimization objective, e.g. for an equidistant set of points the costs of all quadratic assignments are identical.

To compute \(L_{QAP}\), we first need to evaluate the quadratic term \(\mathcal {Q}({\textbf {S}}, {\textbf {S}}_{\mathcal {A}}, {\textbf {S}}_{\mathcal {B}})\), which requires solving the quadratic assignment. This problem is known to be notoriously hard to solve exactly even when an input dimensionality is moderate [8]. To alleviate this, we use a computationally tractable upper-bound:

$$\begin{aligned} \begin{aligned} L_{QAP}&\le \textit{tr}({\textbf {S}}{} {\textbf {Y}}_{gt}) - \underset{{\textbf {Y}} \in \varPi }{\min } \textit{tr}({\textbf {S}} {\textbf {Y}}^T) - \underset{{\textbf {Y}} \in \varPi }{\min } \textit{tr}({\textbf {S}}_{\mathcal {A}} {\textbf {Y}} {\textbf {S}}_{\mathcal {B}}^T {\textbf {Y}}^T) \\&\le L({\textbf {S}}, {\textbf {Y}}_{gt}) - \langle \lambda _{\mathcal {A}}, \lambda _{\mathcal {B}} \rangle _{-} \end{aligned} \end{aligned}$$
(7)

where \(\langle \lambda _{\mathcal {A}}, \lambda _{\mathcal {B}} \rangle _{-}\) corresponds to a minimum dot product between eigenvalues of matrices \(S_{A}\) and \(S_{B}\).

The first inequality is derived from Jensen’s inequality for (6) and the second inequality holds due to the fact that \(\langle \lambda _{F}, \lambda _{D} \rangle _{-} \le \textit{tr}({\textbf {F}} {\textbf {X}} {\textbf {D}}^T {\textbf {X}}^T) \le \langle \lambda _{F}, \lambda _{D} \rangle _{+}\) for symmetric matrices \({\textbf {F}}, {\textbf {D}}\) and \({\textbf {X}} \in \varPi \) as shown in Theorem 5 by Burkard [8]. The above derivations are for the case when the similarity metric is a valid distance function, i.e. minimizing a distances leads to maximizing a similarity. The derivation for the reverse case can be done analogously by replacing min with max in the optimal assignment problem formulation (we provide more details in supplementary material).

Optimizing the upper-bound in (7) is computationally tractable compared to optimizing the exact version of \(L_{QAP}\). Another advantage is that the upper-bound in (7) breaks down towards minimizing the sum of the structured linear assignment loss \(L({\textbf {S}}, {\textbf {Y}}_{gt})\) and the regularizing term, which accounts for the set similarity. This allows to easily modify existing contrastive learning approaches like those in [12, 23, 36, 39] that are based on pairwise similarities and thus stem from the linear assignment problem by simply plugging in the regularizing term. We provide a simple pseudocode example demonstrating InfoNCE with the quadratic assignment regularization (Algorithm 1).

figure b

Computational Complexity. Computing the upper-bound in (7) has a computational complexity of \(O(N^3)\) as one needs to compute eigenvalues of the intra-set similarity matrices. This is opposed to (6) that requires directly computing quadratic assignments, for which it is known there exist no polynomial-time algorithms [8]. The computational complexity can be pushed further down to \(O(k^2N)\) by evaluating the only top-k eigenvalues instead of computing all eigenvalues. In supplementary materials, we also provide an empirical analysis of how the proposed approach influences a training time of a baseline contrastive method.

4.2 SparseCLR

In Sect. 3 we noted that the smoothness of a loss function is a desirable property for optimization and that the log-sum-exp smoothing of the structured linear assignment loss yields the normalized temperature cross-entropy objective as a special case. Such an approach, however, is known to have limitations. Specifically, the log-sum-exp smoothing yields dense support over samples, being unable to completely rule out irrelevant examples [38]. That can be statistically and computationally overwhelming and can distract the model from fully utilizing information from hard negative examples.

We propose to use sparsemax instead of log-sum-exp approximation to alleviate the non-smoothness of the structured linear assignment objective, yet keep the sparse support of the loss function. Here sparsemax is defined as the minimum distance Euclidean projection to a k-dimensional simplex as in [35, 38].

Let \(\tilde{x}: \in \mathbb {R}^{N!}\) be a vector that consists of the realizations of all possible assignment costs for the similarity matrix S. With this we can define SparseCLR as follows:

(8)

where \(\varOmega (X) = \{ j \in X : \textit{ sparsemax}(X)_{j} > 0\}\) is the support of sparsemax function, and \(\mathcal {T}\) denotes the thresholding operator [35]. In practice, to avoid summing over a factorial number of terms, as well as for other methods, we use batch-hard mining strategy resulting in \(O(N^2)\) computational complexity for SparseCLR.

5 Experiments

In this section, we evaluate the quality of representations trained with and without our Quadratic Assignment Regularization (QARe). We also evaluate the performance of SparseCLR and compare it with other contrastive losses. As we consider the representation learning from the perspective of the assignment theory, we firstly conduct the instance matching experiment, where the goal is to learn to predict the correct assignment between different views of the dataset. Then, we test the proposed method on the task of self-supervised classification and compare it with other contrastive learning approaches. Next, we present ablation studies to visualize the role of the weighting term, when combining QARe with the baseline contrastive learning method.

5.1 Matching Instances from Different Views

In this experiment, the goal is to train representations of objects from different views, such that the learned representations provide a correct matching between identities in the dataset. This problem is closely related to a metric learning and can be solved by contrasting views of the data [23].

Data. We adopt the CUHK-03 dataset [31] designed for the ReID [23] task. CUHK-03 consist of 1467 different identities recorded each from front and back views. To train and test the models, we randomly divide the dataset into 70/15/15 train/test/validation splits.

Evaluation. We evaluate the quality of representations learned with contrastive losses by computing the matching accuracy between front and back views. In practice, we first obtain the embeddings of the views of the instances from the encoder and then compute their inter-view similarity matrix using the Euclidean distance. Given this, the matching accuracy is defined as the mean Hamming distance between the optimal assignment from the inter-view similarity matrix and the ground truth assignment. We report the average matching accuracy and the standard deviation over 3 runs with the same fixed random seeds. As a baseline, we chose Triplet with batch-hard mining [23], InfoNCE [39] and NTLogistic [36] contrastive losses, which we extend with the proposed QARe.

Implementation Details. As the encoder, we use ResNet-18 [22] with the top classification layer replaced by the linear projection head that outputs feature vectors of dimension 64. To obtain representations, we normalize the feature vectors to be L2 unit-norm.

We train the models for 50 epochs using Adam optimizer [29] with the cosine annealing without restarts [34]. Initial learning rate is 0.01 and a batch size is set to 128. During the training, we apply random color jittering and horizontal flip augmentations. To compute the test matching accuracy, we select the best model over the epochs based on the validation split. We provide detailed hyperparameters of the losses, QARe weighting, and augmentations in the supplementary material.

Results. The results are reported in Table 1. As can be seen, adding QARe regularization to a baseline contrastive loss leads to a better matching accuracy, which indicates an improved quality of representations. Notably, QARe delivers 3.6% improvement when combined with Triplet loss and 4.1% increase in accuracy with the InfoNCE objective. This demonstrates that the quadratic assignment regularization on sets helps the model to learn generalized representation space.

Table 1. Matching accuracy for instances from different views of CUHK-03 dataset. The pairwise corresponds to the contrastive losses acting on the level of pairs. +QARe denotes methods with quadratic assignment regularization. Best results are in bold.

5.2 Self-supervised Classification

Next, we evaluate the quadratic assignments on sets in the task of self-supervised classification. The goal in this experiment is to learn a representation space from an unlabeled data that provides a linear separation between classes in a dataset. This is a common problem to test contrastive learning methods.

Evaluation. We follow the standard linear probing protocol [30]. The evaluation is performed by retrieving embeddings from a contrastively trained backbone and fitting a linear classifier on top of the embeddings. Since linear classifiers have a low discriminative power on their own, they heavily rely on input representations. Thus, a higher classification accuracy indicates a better quality of learned representations. As the baseline for pairwise contrastive learning methods, we select popular SimCLR and proposed SparseCLR. We extend the methods to the set level by adding the quadratic assignment regularization. We compare the set-level contrastive methods with other self-supervised [10, 18] and contrastive approaches [11, 12].

Table 2. Self-supervised training and linear probing for a self-supervised classification. Average classification accuracy (percentage) and standard deviation over 3 runs with common fixed random seeds are reported. \(^*\) we report the results from our reimplementation of SimCLR.

Implementation Details. We train shallow convolutional (Conv-4) and deep (ResNet-32) backbone encoders to obtain representations. We follow the standard approach and attach an MLP projection head that outputs feature vectors of dimension 64 to compute a contrastive loss. We use the cosine distance to get similarities between embeddings. We provide more details on the encoder architectures in the supplementary material.

We train the models for 200 epochs using Adam optimizer with a learning rate of 0.001 and a batch size is set to 128. For contrastive learning methods, we adopt a set of augmentations from [12] to form different views of the data. For linear probing, we freeze the encoder weights and train a logistic regression on top of the learned representation for 100 epochs with Adam optimizer, using a learning rate of 0.001 and a batch size of 128. We provide detailed hyperparameters of the contrastive losses and augmentations in the supplementary material.

We perform training and testing on standard CIFAR-10, CIFAR-100, and tiny-ImageNet datasets. For each dataset, we follow the same training procedure and fine-tune the optimal weighting of the quadratic assignment regularizer.

Fig. 2.
figure 2

Influence of the weighting of the QARe term in the self-supervised classification for ResNet-32 trained with SimCLR. The accuracy is recorded over 3 runs with common fixed random seeds.

Table 3. Self-supervised training on unlabeled data and linear evaluation on labeled data. Comparing SimCLR [12] with the proposed SparseCLR and SparseCLR with QARe. Average accuracy and standard deviation over three runs over 3 runs with common fixed random seeds.

QARe Results. The results for the quadratic assignment regularization are reported in Table 2 for SimCLR and in Table 3 for proposed SparseCLR. For SimCLR, adding QARe delivers 1.9% accuracy improvement on CIFAR-10 for the shallow encoder network and up to 1.6% improvement for ResNet-32 on tiny-ImageNet. We observed a consistent improvement in other dataset-architecture setups, except tiny-ImageNet with shallow Conv-4 encoder, where the performance gain from adding QARe is modest. For this case, we investigated the training behavior of the models and observed that both SimCLR and SimCLR+QARe for Conv-4 architecture very quickly saturate to a saddle point, where the quality of representations stops improving. Since this does not happen with ResNet-32 architecture, we attribute this phenomenon to the limited discriminative power of the shallow Conv-4 backbone, which can not be extended by regularizing the loss.

For SparseCLR, we observed the same overall pattern. As can be seen from Table 3, extending SparseCLR to set level with QARe delivers 1.6% accuracy improvement for ResNet-32 on tiny-ImageNet and also steadily improve the performance on other datasets. This indicates that QARe helps to provide a richer supervisory signal for the model to train representations.

SparseCLR Results. Next, we compare SimCLR against the proposed SparseCLR. As can be seen in Table 3, SparseCLR consistently improves over the baseline on CIFAR-100 and tiny-ImageNet datasets, where it delivers 2.4% improvement. Surprisingly, we noticed a significant drop in performance for ResNet-32 on CIFAR-10. For this case, we investigated the training behavior of the model and observed that in the case of CIFAR-10, the batch often includes many false negative examples, which results in uninformative negative pairs. Since SparseCLR assigns the probability mass only to a few hardest negative pairs, a lot of false-negative examples in the batch impede obtaining a clear supervisory signal. We assume the problem can be alleviated by using false-negative cancellation techniques [27].

5.3 Ablation Study

Here we illustrate how the weighting of the QARe term influences the quality of representations learned with SimCLR under a linear probing evaluation. We search for the optimal weighting in the range from 0 to 1.875 with a step of 0.125. The results are depicted in Fig. 2. In practice, we observed that QARe is not too sensitive to a weighting and the values in the range 0.75–1.25 provide consistent improvement.

6 Discussion

In this work, we present set contrastive learning method based on quadratic assignments. Different from other contrastive learning approaches, our method works on the level of set similarities as opposed to only pairwise similarities, which allows improving the information content of the supervisory signal. For derivation, we view contrastive learning through the lens of combinatorial assignment theory. We show how pairwise contrastive methods emerge from learning to produce correct optimal assignments and then extend them to a set level by generalizing an underlying assignment problem, implemented as a regularization for existing methods. As a byproduct of viewing representation learning through the lens of assignment theory, we additionally propose SparceCLR contrastive loss.

Our experiments in instance matching and self-supervised classification suggest that adding quadratic assignment regularization improves the quality of representations learned by backbone methods. We suppose, that our approach would be most useful in the problems where the joint analysis of objects and their groups appears naturally and labeling is not readily available.