Keywords

1 Introduction

Self-Supervised Learning (SSL) has emerged as an effective strategy for unsupervised learning of image representations, eliminating the need to manually annotate vast quantities of data. By training large models on unlabeled data, SSL aims to learn representations that can be effectively applied to a downstream prediction task with few labels [15].

One of the core ideas of SSL is to remove a portion of the input and learn to predict the removed content [43]. Auto-regressive models and denoising auto-encoders instantiate this principle in vision by predicting the missing parts at the pixel or token level [3, 5, 12, 27, 50]. Masked auto-encoders in particular, which learn representations by reconstructing randomly masked patches from an input, have been successfully applied in vision [5, 27, 52, 55]. However, optimizing a reconstruction loss requires modelling low-level image details that are not necessary for classification tasks involving semantic abstraction. Thus, the resulting representations often need to be fine-tuned for semantic recognition tasks which can lead to overfitting in low-shot settings. Nevertheless, masked auto-encoders have enabled the training of large-scale models and demonstrated state-of-the-art performance when fine-tuning on large labeled datasets, with millions of labels [3, 5, 27, 55].

Fig. 1.
figure 1

Masked Siamese Networks. First use random data augmentations to generate two views of an image, referred to as the anchor view and the target view. Subsequently, a random mask is applied to the anchor view, while the target view is left unchanged. The objective is then to assign the representation of the masked anchor view to the same clusters as the representation of the unmasked target view. A standard cross-entropy loss is used as the criterion to optimize.

Joint-embedding architectures, on the other hand, avoid reconstruction. Approaches such as Siamese Networks [6, 10, 11, 15, 25, 28, 57] learn a representation by training an encoder network to produce similar embeddings for two different views of the same image [9, 22]. Here the views are typically constructed by applying different image transforms—such as random scaling, cropping, and color jitter—to the input [41, 53]. The inductive bias introduced by this invariance-based pre-training typically produces strong off-the-shelf representations of a high semantic level [11] but often disregards rich local structure that can be helpful to model.

In this work, we propose Masked Siamese Networks (MSNs), a self-supervised learning framework that leverages the idea of mask-denoising while avoiding pixel and token-level reconstruction. Given two views of an image, MSN randomly masks patches from one view while leaving the other view unchanged. The objective is to train a neural network encoder, parametrized with a vision transformer (ViT) [21], to output similar embeddings for the two views. In this procedure, MSN does not predict the masked patches at the input level, but rather performs the denoising step implicitly at the representation level by ensuring that the representation of the masked input matches the representation of the unmasked one. Figure 1 shows a schematic of the method.

Empirically, we demonstrate that MSNs learn strong off-the-shelf representations that excel at low-shot prediction (cf. Fig. 2). In particular, MSN achieves good classification performance using \(100\times \) fewer labels than current mask-based auto-encoders [27, 54]. In the standard 1% ImageNet low-shot classification task, an MSN-trained ViT-B/4 (using a patch size of 4x4 pixels) achieves 75.7% top-1 accuracy, outperforming the previous 800M parameter state-of-the-art convolutional network [14] while using nearly \(10\times \) fewer parameters (cf. Fig. 2a).

Since a good representation should not need many examples to learn about a concept [24], we also consider a more challenging evaluation benchmark for label-efficient low-shot classification [39, 45], using from 1 labeled image per class up to 5 images per class (cf. Table 2). MSN also achieves state-of-the-art in that regime; e.g., with only 5 labeled images per class, we can pre-train a ViT-B with MSN on ImageNet-1K to achieve over 72% top-1 accuracy, surpassing the previous state-of-the-art method, DINO [11], by 8% top-1.

Similar to masked auto-encoders, MSNs also exhibit good computational scaling since only the unmasked patches are processed by the ViT encoder. For example, by randomly masking 70% of the patches, MSN uses half the computation and memory compared to an unmasked joint-embedding baseline. In practice, we pre-train a ViT-L/7 on as few as 18 AWS p4d-24xlarge machines. Without masking, the same job requires over 42 machines.

Finally, we also show that MSNs are competitive with prior works on other self-supervised benchmarks that use many labels for evaluation (e.g., fine-tuning, linear-evaluation, transfer learning).

2 Prerequisites

Problem Formulation. Consider a large collection of unlabeled images, \(\mathcal {D}=(\textbf{x}_i)_{i=1}^U\), and a small dataset of annotated images, \(\mathcal {S}=({\textbf{x}_{s}}_i, y_i)_{i=1}^L\), with \(L \ll U\). Here, the images in \(\mathcal {S}\) may overlap with the images in the dataset \(\mathcal {D}\). Our goal is to learn image representations by first pre-training on \(\mathcal {D}\) and then adapting the representation to the supervised task using \(\mathcal {S}\).

Siamese Networks. The goal of siamese networks [7, 9], as they are used in self-supervised learning, is to learn an encoder that produces similar image embeddings for two views of an image. Specifically, given an encoder \(f_\theta (\cdot )\) and two views \(\mathbf {x_i}\) and \(\mathbf {x^+_i}\) of an image, the encoder independently processes each view and outputs representations \(z_i\) and \(z^+_i\) respectively, referred to as the anchor representation and the target representation. The objective of siamese networks is to learn an encoder that is not sensitive to differences between views, so the representations \(z_i\) and \(z^+_i\) should match. In practice, the encoder \(f_\theta (\cdot )\) is usually parameterized as a deep neural network with learnable parameters \(\theta \).

The main challenge with siamese architectures is to prevent representation collapse in which the encoder produces a constant image embedding regardless of the input. Several approaches have been investigated in the literature. Contrastive losses explicitly push away embeddings of different images [9, 15, 28]. Information maximization approaches try to maximize the entropy of the average prediction [1, 11] or spread out the embeddings uniformly on the surface of a sphere [10]. Asymmetric approaches rely on an asymmetric architectural choice such as stop-gradient operations and a momentum encoder [15, 25] to prevent collapse. Other approaches try to decorrelate the vector components of the embeddings to minimize redundancy across samples [6, 57].

Fig. 2.
figure 2

Low-shot Evaluation of self-supervised models, pre-trained on ImageNet-1K. (Left) MSN matches the previous 800M parameter state-of-the-art, while using a model that is \(10\times \) smaller, and no fine-tuning. (Right) MSN achieves good classification performance using less labels than current mask-based auto-encoders.

Vision Transformer. We use a standard Vision Transformer (ViT) architecture [21] as the encoder. Vision Transformers first extract a sequence of non-overlapping patches of resolution \(N \times N\) from an image. Next, they apply a linear layer to extract patch tokens, and subsequently add learnable positional embeddings to them. An extra learnable [CLS] token is added to the sequence. This token aims to aggregate information from the full sequence of patches [11, 21]. The sequence of tokens is then fed to a stack of Transformer layers [49]. A Transformer layer is composed of a self-attention [49] and a fully-connected layer with skip connections [29]. Self-attention uses an attention mechanism [4] applied to the entire sequence of elements to update the representation. The output representation associated to the [CLS] token is used as the output of the encoder.

Fig. 3.
figure 3

Masking strategies. When applying a Random Mask, we randomly drop patches across a global view of the image. When applying a Focal Mask, we randomly select a local continuous block of an image, and mask everything around it. We typically leverage both Random and Focal Masking strategies when pre-training with MSNs.

3 Masked Siamese Networks

We now describe the proposed Masked Siamese Network (MSN) training procedure, which combines invariance-based pre-training with mask denoising; see Fig. 1 for a schematic. MSNs first use random data augmentations to generate two views of an image, referred to as the anchor view and the target view. Subsequently, a random mask is applied to the anchor view, while the target view is left unchanged. Similar to clustering-based SSL approaches [1, 10, 11], learning occurs by computing a soft-distribution over a set of prototypes for both the anchor and target views. The objective is then to assign the representation of the masked anchor view to the same prototypes as the representation of the unmasked target view. We use a standard cross-entropy loss to optimize this criterion.

In contrast to previous work on masked image modelling, the mask-denoising process in MSN is discriminative, rather than generative [5, 27, 52, 55, 61]. MSN architectures do not directly predict pixel values (or tokens) for the masked patches. Instead, the loss is applied directly to the output corresponding to the [CLS] token of the encoder.

Input Views. In each iteration of pre-training, we sample a mini-batch of \(B \ge 1\) images. For an index \(i \in [B]\), let \(\textbf{x}_i\) denote the \(i^{\text {th}}\) image in the mini-batch. For each image \(\textbf{x}_i\), we first apply a random set of data augmentations to generate a target view, denoted \(\textbf{x}^+_i\), and \(M \ge 1\) anchor views, denoted \(\textbf{x}_{i,1}, \textbf{x}_{i,2}, \ldots , \textbf{x}_{i,M}\).

Patchify and Mask. Next, we “patchify” each view by converting it into a sequence of non-overlapping \(N \times N\) patches. After patchifying the anchor view \(\textbf{x}_{i,m}\), we also apply the additional step of masking by randomly dropping some of the patches. We denote by \(\mathbf{\hat{x}}_{i,m}\) the sequence of masked anchor patches, and by \(\mathbf{\hat{x}}^+_i\) the sequence of unmasked target patches. Because of masking, the anchor sequence \(\mathbf{\hat{x}}_{i,m}\) can have a different length than the patchified target sequence \(\mathbf{\hat{x}}^+_i\), even if both image views originally have the same resolution.

We investigate two strategies for masking the anchor views, Random Masking and Focal Masking, which are depicted in Fig. 3. When applying Random Masking, we randomly drop potentially non-contiguous patches across the sequence. Conversely, when applying Focal Masking, we randomly select a local continuous block of the anchor view and drop all the patches around it.

Encoder. Given a parameterized anchor encoder, denoted \(f_\theta (\cdot )\), let \(z_{i,m} \in \mathbb {R}^{d}\) denote the representation computed from the patchified (and masked) anchor view \(\mathbf{\hat{x}}_{i,m}\). Similarly, given a parameterized target encoder \(f_{\bar{\theta }}(\cdot )\), with a potentially different set of parameters \(\bar{\theta }\), let \(z^+_i \in \mathbb {R}^{d}\) denote the representation computed from the patchified target view \(\mathbf{\hat{x}}^+_i\). In MSNs, the parameters \(\bar{\theta }\) of the target encoder are updated via an exponential moving average of the anchor encoder parameters [25]. Both encoders correspond to the trunk of a ViT [21]. We take the output of the network to be the representation corresponding to the [CLS] token.

Similarity Metric and Predictions. Let \(\textbf{q} \in \mathbb {R}^{K \times d}\) denote \(K > 1\) learnable prototypes, each of dimension d. To train the encoder, we compute a distribution based on the similarity between these prototypes and each anchor and target view pair, and we penalize the encoder for differences between these distributions. More precisely, for an anchor representation \(z_{i,m}\), we compute a “prediction” \(p_{i,m} \in \varDelta _K\) in the K-dimensional simplex by measuring the cosine similarity to the prototypes matrix \(\textbf{q}\). For L\(_2\)-normalized representations and prototypes, the predictions \(p_{i,m}\) can be concisely written as

$$ p_{i,m} :=\text {softmax}\left( \frac{z_{i,m} \cdot \textbf{q}}{\tau } \right) , $$

where \(\tau \in (0, 1)\) is a temperature. Similarly, for each target representation \(z^+_i\), we generate a prediction \(p^+_i \in \varDelta _K\) by measuring the cosine similarity to the same prototypes matrix \(\textbf{q}\). When computing the target predictions, we also use a temperature parameter \(\tau ^+ \in (0, 1)\). Note, we always choose \(\tau ^+ < \tau \) to encourage sharper target predictions, which implicitly guides the model to produce confident low entropy anchor predictions. As we show in Appendix D, target sharpening coupled with mean-entropy maximization is provably sufficient to eliminate collapsing solutions in the MSN framework.

Training Objective. As previously mentioned, to train the encoder, we penalize when the anchor prediction \(p_{i,m}\) is different from the target prediction \(p_i^+\). We enforce this criterion using a standard cross-entropy loss \(H(p_{i,m}, p^+_i)\).

We also incorporate the mean entropy maximization (me-max) regularizer, also used in [1, 33], to encourage the model to utilize the full set of prototypes. Denote the average prediction across all the anchor views by

$$ \overline{p} :=\frac{1}{MB}\sum ^B_{i=1}\sum ^M_{m=1} p_{i,m}. $$

The me-max regularizer simply seeks to maximize the entropy of \(\overline{p}\), denoted \(H(\overline{p})\), or equivalently, minimize the negative entropy of \(\overline{p}\). Thus, the overall objective to be minimized when training the encoder parameters \(\theta \) and prototypes \(\textbf{q}\) is

$$\begin{aligned} \frac{1}{MB} \sum ^B_{i=1}\sum ^M_{m=1} H(p_{i,m}, p^+_i) - \lambda H(\overline{p}), \end{aligned}$$
(1)

where \(\lambda > 0\) controls the weight of the me-max regularization. Note that when training, we only compute gradients with respect to the anchor predictions \(p_{i,m}\), not the target predictions \(p^+_i\).

4 Related Work

Unsupervised pre-training for vision has seen rapid progress with the development of view-invariant representation learning and joint embedding architectures [6, 11, 15, 25, 28, 53]. Most similar to our approach is DINO [11] which leverages a Siamese Network with a cross-entropy loss and a momentum encoder. DINO also uses multi-crop training, which is a form of focal masking, but it requires an unmasked anchor view during training. MSN can be seen as a generalization of DINO, leveraging both random and focal masking without requiring any unmasked anchor views. Since the cross-entropy loss in Eq. (1) is only differentiated with respect to the anchor predictions, not the target, MSN only backpropagates through the anchor network and only needs to store the activation associated with the masked view. MSN therefore reduces the computational and memory requirements. MSN also differs from DINO in its mechanism for preventing representation collapse (entropy maximization as opposed to centering and sharpening). Our empirical results show that MSN compares favourably to DINO across various degrees of supervision for the downstream task.

A prominent line of work in SSL is to remove a portion of the input and learn to reconstruct the removed content [18]. For example, in the field of image recognition, some works have proposed to predict augmented image channels [60], which can be regarded as a form of image colorization [34, 35, 59]. Other approaches propose to remove and learn to regress entire image regions: the seminal Context Encoders of Pathak et al. [43] train a network to generate missing image patches based on their surroundings. Recent works revisit this idea and investigate the pre-training of ViTs with masked auto-encoders [5, 12, 27, 52, 55]. These approaches corrupt images with mask-noise and predict missing input values at the pixel level [21, 27, 54] or using a tokenizer [5, 52]. Our approach does not predict the missing value at the input level, but instead performs the denoising step implicitly by ensuring that the global representation of the noisy input matches that of the uncorrupted input.

Some recent approaches have started to explore the combination of joint-embedding architectures and denoising pre-training tasks [3, 23, 61]. Those approaches mask an image by replacing the masked patches with a learnable mask token, and output a single vector for each masked patch. The objective is then to directly match each computed patch vector to the equivalent patch token extracted from a target encoder. Different from these approaches, we only match the view representations globally and do not consider a patch level loss. Consequently, we can completely ignore the masked patches, significantly reducing the computational and memory requirements. For example, when training our largest model, a ViT-L/7, we mask over 70% of the input patches, and reduce memory and computational overhead by half.

Table 1. Extreme low-shot. We evaluate the label-efficiency of self-supervised models pretrained on the ImageNet-1K dataset. For evaluation, we use an extremely small number of the ImageNet-1K labels and report the mean top-1 accuracy and standard deviation across 3 random splits of the data.

5 Results

We evaluate MSN representations learned on the ImageNet-1K dataset [44]. We first consider low-shot evaluation on ImageNet-1K using as few as 1–5 images per class. We also compare with the state-of-the-art in settings where more supervision is available and investigate transfer-learning performance. Finally, we conduct ablation experiments with MSN. By default, we pre-train with a batch-size of 1024 images, generating several anchor views from each image: 1 view with a random mask, and 10 views with focal masks. We find that the optimal masking ratio is model-dependent, with larger models benefiting from more aggressive patch dropping. We describe MSN implementation details in Appendix C.

5.1 Label-Efficient Learning

The premise of SSL is to learn representations on unlabeled data that can be effectively applied to prediction tasks with few labels [14]. In this section we explore the performance of self-supervised approaches when very few labeled examples are available.

Table 2. Low-shot evaluation on ImageNet-1K using 1% of the labels (approximately 13 images per class). \(^\dagger \)Indicates evaluations we computed using publicly available models.

Extreme Low-Shot. We first evaluate the classification performance of unsupervised models that have been pre-trained on ImageNet-1K, by using 1, 2, and 5 labeled images per class for supervised evaluation. We compare MSN to the joint-embedding approach, DINO [14], the auto-encoding approach, MAE [27], and the hybrid approach, iBOT [61], which combines a joint-embedding architecture with a token-based patch-level loss. We download the official released models of each related approach for evaluation.

To adapt the joint-embeddings models to the supervised task, we freeze the weights of the pre-trained model and train a linear classifier on top using 1, 2 or 5 labeled samples (see Appendix C). For MAE, we rely on partial fine-tuning [27], except for the 1 image per class setting, and all results with the ViT-H/14 architecture, which use a linear classifier. Partial fine-tuning corresponds to fine-tuning the last block of the pre-trained model along with a linear head. MAE benefits from partial fine-tuning, but for sufficiently large models, such as the ViT-H/14, this leads to significant overfitting in the low-shot regime. We compare both protocols in more detail in Appendix E.

Table 1 reports the extreme low-shot evaluation results. MSN outperforms the other representation learning approaches across all levels of supervision. Moreover, the improvement offered by MSN increases as the amount of available labeled data is decreased. The performance of MSN also benefits from increased model size—settings with less labeled data appear to benefit more from increased model depth and smaller patch sizes.

Table 3. Linear evaluation on ImageNet-1K using 100% of the labels.

We also observe that joint-embedding approaches appear to be more robust to the limited availability of downstream supervision than reconstruction-based auto-encoding approaches. To explain this observation, we refer to the Masked Auto-Encoders paper [27] which conjectures that using a pixel reconstruction loss results in encoder representations of a lower semantic level than other methods. Conversely, the inductive bias introduced by invariance-based pre-training appears to be helpful in the low-shot regime.

1% ImageNet-1K. Table 2 reports a comparison on the 1% ImageNet-1K task, which is a standard benchmark for low-shot evaluation of self-supervised models [13]. For reference, the best reported result in the literature on 1% labeled data is 76.6%, achieved with a multi-stage semi-supervised pipeline, i.e., self-distilling from a fine-tuned ResNet-152 with 3\(\times \) wider channels and selective kernels [14]. Here we focus on comparing to other models trained in a self-supervised setting. Our best MSN model using a ViT-L/7 achieves 75.1% top 1 accuracy, surpassing the previous 800M parameter state-of-the-art convolutional network [14] while using significantly fewer parameters and no fine-tuning. When focusing the comparison on similar architectures (models with similar FLOP counts), MSN also consistently improves upon previous approaches.

5.2 Linear Evaluation and Fine-Tuning

In this section we compare with the state-of-the-art on standard evaluation benchmarks where more supervised samples are available to adapt the representation. We use the full ImageNet-1K training images with 1.28M labels.

Table 4. End-to-end fine-tuning of a ViT-B/16 encoder on ImageNet-1K using 100% of the labels. MSN obtains competitive performance with both joint-embedding approaches and auto-encoding approaches.

Linear Evaluation. We evaluate self-supervised pretrained models by freezing their weights and training a linear classifier. Table 3 reports the linear evaluation results on ImageNet-1K. We observe that MSN performs competitively with the state-of-the-art. The best MSN model achieves 80.7% top-1 accuracy.

Fine-Tuning. In this evaluation setting, we finetune all the weights of the self-supervised model using all the labels from the ImageNet-1K training set. We focus on the ViT-B/16 architecture. We adopt the same fine-tuning protocol as [5], and provide the details in Appendix C. Table 4 reports the comparison with fine-tuning evaluation using 100% labels on ImageNet-1K. MSN is competitive with joint-embedding approaches, such as DINO, and generative auto-encoding approaches, such as MAE.

5.3 Transfer Learning

We also report transfer learning experiments on the CIFAR10, CIFAR100 and iNaturalist datasets in Table 5 when using a self-supervised ViT-B/16 pre-trained on ImageNet-1K. Across all tasks, various levels of supervision, and evaluation methods, MSN either outperforms or achieves similar results to DINO pre-training. Recall that MSN pre-training is also less computationally expensive than DINO pre-training due to the anchor masking.

Table 5. Transfer Learning with a ViT-Base/16 pre-trained on ImageNet-1K. Across all tasks, various levels of supervision, and evaluation methods, MSN either outperforms or achieves similar results to DINO pre-training. The MSN model is trained with a masking ratio of 0.3; i.e., dropping 30% of patches, and thus reduces the computational cost of pre-training relative to DINO.

5.4 Ablations

We now conduct a series of experiments to gain insights into the important design decisions used in MSN such as the masking strategy and the data augmentation strategy. We measure the accuracy of the models by training a logistic regression classifier on the frozen trunk using 1% of ImageNet-1K labels (\(\sim \)13 imgs/class).

Combining Random and Focal Masking. In MSN we apply both random and focal masking to the anchor views. Focal masking corresponds to selecting a small crop from the anchor view. Random masking corresponds to randomly dropping potentially non-contiguous patches from the anchor view.

Table 6. Masking strategy. Impact of masking strategy on low-shot accuracy (1% of ImageNet-1K labels) of a ViT-B/16. We only generate one anchor view of each image, except in the last row, where we generate two views, one with a Random Mask and one with a Focal Mask. A random masking ratio of 0.5 is used. Applying a random mask to the anchor view is better than applying no mask. By combining both random and focal masking strategies, we obtain the strongest performance.

Table 6 reports the effect on low-shot evaluation when using a) No Masking, b) Focal Masking, c) Random Masking, or d) Random and Focal Masking. Applying a random mask to the anchor view is always better than applying no mask. By contrast, applying only a focal mask degrades the performance, which highlights the importance of maintaining a global view during pre-training. By combining both random and focal masking strategies, we obtain the strongest performance.

Random Masking Ratio. Here we explore the relationship between the optimal masking ratio and the model size. Table 7 reports the low-shot learning performance for various random masking ratios as we increase the model size.Footnote 1

Table 7. Masking ratio. Impact of pre-training random masking ratio (fraction of randomly dropped patches in each random mask) on ImageNet 1% accuracy. Accuracy of larger models improves when leveraging aggressive masking during pre-training.

When increasing the model size, we find that increasing the masking ratio (dropping more patches) is helpful for improving low-shot performance. We also find that the ViT-L/16 runs with weak masking are unstable, while the runs with more aggressive masking are quite stable. However, we do not have sufficient evidence to claim that increasing the masking ratio always improves the stability of large ViT pre-training.

Augmentation Invariance and Low-Shot Learning. We explore the importance of data-augmentation invariance for low-shot learning. We pretrain a ViT-B/16 with MSN, where the teacher and anchor networks either share the input image view or use different input views; in both cases, the anchor view is always masked. The views are constructed by applying random ColorJitter, Crop, Horizontal Flips, and GaussianBlur to the input image.

Table 8 reports top-1 accuracy when evaluating with 1% of ImageNet-1K labels. Sharing the view leads to a top-1 accuracy of \(7\%\); MSN finds a shortcut solution relying on color statistics. Using different colors in the input views resolves this pathological behaviour and achieves a top-1 of \(48.3\%\). Further applying the geometric data-augmentations independently to the two views (as opposed to sharing views) further improves the performance to \(52.3\%\), showing the importance of learning view-invariant representations in the low-shot setting.

Random Masking Compute and Memory. We look at the effect of the random masking ratio, i.e., the fraction of dropped patches from the global anchor view, on the computational requirements of large model pre-training. In each iteration we also generate 10 focal views (small crops) of each input image; the random masking ratio has no impact on these views.

Table 8. Impact of view-sharing during pre-training when evaluating on ImageNet 1%. The target view is constructed by applying random ColorJitter, Crop, Horizontal Flips, and GaussianBlur to the input image. When using the same image view, MSN finds a shortcut solution. Using color jitter prevents this pathological behaviour. Randomly applying additional geometric data transformations to the anchor further improves performance, demonstrating the importance of view invariance in the low-shot setting.
Table 9. Impact of random masking ratio on GPU memory usage and runtime when pre-training a ViT-L/7. Measurements are conducted on a single AWS p4d-24xlarge machine, containing 8 A100 GPUs, using a batch-size of 2 images per GPU. In each iteration we also generate 10 focal views (small crops) of each input image; the random masking ratio has no impact on these views. Using more aggressive masking of the global view progressively reduces device memory utilization and speeds up training.

Table 9 reports the memory consumption and throughput (imgs/s) of a ViT-L/7 model on a single AWS p4d-24xlarge machine using a batch-size of 2 images per GPU. As expected, using more aggressive masking of the global view progressively reduces device memory utilization and speeds up training. For example, by randomly masking 70% of the patches, we can use MSN to pre-train a full-precision ViT-Large with a patch-size of \(7\times 7\) on as few as 18 AWS p4d-24xlarge machines. Without masking, the same job requires over 42 machines when using the default batch-size of 1024 images.

6 Conclusion

We propose Masked Siamese Networks (MSNs), a self-supervised learning framework that leverages the idea of mask-denoising while avoiding pixel and token-level reconstruction. We demonstrate empirically that MSNs learn strong off-the-shelf representations that excel at label-efficient learning, while simultaneously improving the scalability of joint-embedding architectures. By relying on view-invariant representation learning, MSN does require the specification of data transformations, and it may be that the optimal transformations and invariances are dataset and task dependant. In future work, we plan to explore more flexible mechanisms to learn those transformations and also explore the use of equivariant representations.