Keywords

1 Introduction

Self-supervised learning proposes to leverage large amounts of unlabeled data to solve complex visual tasks. Early attempts hand-designed pretext tasks, which required some semantic understanding of images and their layout to solve [21, 60, 64, 89]. Contrastive learning departed from this tradition by radically simplifying the self-supervised protocol, in that the pretext task is specified by the data itself: representations must learn to distinguish a given example from the others in the dataset [25, 36, 62, 81]. Modern instances of the contrastive framework have proven to be very powerful, leading to strong performance on a variety of downstream tasks [13, 39, 43]. More recent self-supervised methods have simplified the framework further, removing the need for negative samples [35], bespoke architectural components [15], and learning dynamics [87], suggesting that increasingly domain-agnostic and data-driven methods might enable learning from ever-larger and more general sources of data.

However, a parallel line of work has asked whether the current self-supervised paradigm—which maximizes the similarity of the same data-point under different views—is too simple. By treating data-points as monolithic instances, these methods overlook the complexity of real-world data: natural scenes are composed of many objects, natural speech of multiple speakers, and natural videos of many scenes. Ignoring this variability and encouraging models to represent different parts of an image in a similar manner risks dampening their selectivity for objects, their relationships, and layouts in real-world scenes. Indeed, several works have demonstrated the benefits of properly handling this variability when learning task-relevant representations [42, 70, 76, 77, 79]. While such object- and segmentation-aware approaches have yielded impressive empirical gains, they have relied on more domain-specific prior knowledge to expose the structure of the data—for example by using hand-crafted segmentation algorithms [42], or salience estimation networks trained on human annotations [77]—bounding how much they can learn from the data and what data they can be used on.

In this work we ask whether this knowledge can instead be derived from the data itself. To do so, we propose to couple two learning processes: object discovery and object representation. We use object discovery to uncover the structure of individual data-points, allowing the self-supervised task to focus on learning invariant representations of object-level instances. In turn, we use the resulting object representations as features for unsupervised object discovery, which feeds back into the representation learning process. These object discovery and representation networks thus engage in a virtuous cycle of representation and segmentation quality: better representations lead to better segmentations, and vice versa. Crucially, we derive the unsupervised segmentations with no prior knowledge about image structure or content, using a simple k-means clustering of local features to partition each image. We thus open the possibility of applying the algorithm to different domains, modalities, and their combination.

We make the following contributions: 1) Our object discovery networks uncover, in an entirely self-supervised manner and without any prior knowledge of image structure or segmentation, meaningful decompositions of real-world scenes. 2) Our object representation networks lead to state-of-the-art results in transfer learning to object detection and instance segmentation on COCO, and semantic segmentation on PASCAL and Cityscapes, surpassing prior works which exploit segmentation and saliency information, without requiring this prior knowledge. 3) Our object representation networks seamlessly generalize to video understanding, surpassing supervised pre-training for video object segmentation on DAVIS. Finally, we test the resilience of our method by varying its essential components, and find it to be very robust, supporting further computational benefits. Together these results suggest that knowledge of scene structure, and the benefits it confers in representing objects, can—with the right learning paradigm—be extracted from the data itself.

2 Related Work

Pre-Contrastive Self-supervised Learning: Hand-Designed Tasks. Early self-supervised approaches focused on injecting human expertise and intuition into the design of proxy tasks for pretraining. For example, to stimulate the network to learn object parts, [21] designed the task of predicting the spatial arrangement between local image patches. A rich collection of such intuitions and objectives was further developed, ranging from pixel-wise reconstruction-based approaches, such as denoising [78], inpainting [64], colorization [49, 89], and more [23, 90], to higher-level pretext tasks, such as predicting spatial layouts [21, 59, 60], orientation [30], egomotion [1], and temporal ordering [58].

Contrastive Learning and its Variants. Instance discrimination [25] has proven to be a very powerful pretext task which, we argue, owes its superior performance to being minimally hand-designed and maximally data-driven. By minimizing a contrastive loss [36, 62], the similarity of a representation across different ‘views’ of the same image is maximized, while minimizing their similarity with distracting negative samples. Multiple views of a single data-point can naturally be extracted from multimodal or multisensory data [2, 48, 56, 63, 68, 72] while for a single image-only modality they are typically constructed via local and global cropping [5, 43, 44, 62] or data-augmentation [13, 22, 25, 39, 81]. Positive pairs then correspond to views of the same data point, while negatives are sampled views of different data-points (typically from the same mini-batch), although the need for negative samples has recently been questioned [15, 35, 87].

Baking Prior Knowledge Back into Self-supervised Learning. A growing body of research has brought hand-designed supervisory signals back into the self-supervised paradigm. For example, [42, 77, 79, 82, 88] decompose input images into their constituent objects and regions of interest using supervised segmentation algorithms, or hand-crafted heuristics. Object-level features are then computed for each region, and optimized using a contrastive objective. Other approaches use object-agnostic learning objectives, but integrate knowledge from segmentation heuristics or models in their augmentation strategies [57, 76, 92].

This trend is reflected in the broader research in self-supervised learning for other modalities. For example, [50] uses domain-specific knowledge to improve the masking strategies of BERT and other masked-language models. [37, 85] leverage motion and flow information to improve learning from video. And similar to previously described works in vision, [61] uses a segmentation step prior to applying SSL on point clouds. In all cases, we aim to remove the dependency on such prior knowledge while retaining its benefits for representation learning.

Clustering and Representation Learning. In parallel to the advent of contrastive methods, clustering-based representation learning methods have seen similar success, particularly in harnessing large amounts of uncurated images for transfer learning [4, 9,10,11, 33, 45]. Although they differ in their formulation of the self-supervised objective, these works also treat entire images as monolithic entities.

In contrast, IIC [45] performs within-image clustering using similarity losses counterbalanced by information maximization, obtaining compelling results in unsupervised segmentation. PiCIE [17] improves on this approach by imposing carefully-chosen, modality-specific geometric data augmentations and corresponding invariance and equivariance constraints. Neither of these works explicitly leverage their unsupervised segmentations for transfer learning across datasets and tasks however, which we investigate here.

Object Discovery. Recent years have seen a growing interest in developing generative models that perform object discovery. By introducing different inductive biases such as mixture-model likelihoods [8, 34], attention [54, 93] and specific forms of factorization [46, 47], such models are able to discover objects and their interactions [20, 32, 71]. While much progress has been made, models from this family have yet to be demonstrated to work on natural images [34] and their application has been limited to synthetic data and simple highly structured environments typically used in robotics. Here we investigate object discovery on natural images in the wild, leveraging contrastive representation learning to enable this with simple k-means clustering.

3 Method

3.1 Self-supervised Learning with Odin

Our method learns two sets of networks which work in collaboration. The object discovery network produces feature maps from high-resolution images. These feature maps are then spatially clustered to produce a segmentation of the image. The object representation networks learns better features via a contrastive loss which uses the masks proposed by the object discovery network. The resulting improved features are then used by the object discovery network to create better segmentations, and this process is continuously repeated. Figure 1 illustrates the full method, which we detail below.

Fig. 1.
figure 1

Object discovery and representation networks. The object discovery network takes as input a cropped but otherwise un-augmented view of the image, and parses it using k-means clustering on its representation of it. The resulting segmentation is mapped into two augmented views of the same image, such that the masks are aligned across views and with the underlying image. The object representation networks take as input the augmented views of the image, and are trained using a self-supervised objective based on features pooled within each mask. The object discovery network is regularly updated with the parameters of the object representation network.

Object Discovery Network: From Representations to Segmentations. Given an image \({\boldsymbol{x}}\), we compute a spanning view \({\boldsymbol{v}}^0\) which encompasses most of the area of the image (Fig. 1, spanning view, defined below) and which is simply cropped and resized. We use a feature extractor \(f_\tau \) to encode this view into a spatial map of hidden vectors \({\boldsymbol{h}}^0 = f_\tau ({\boldsymbol{v}}^0)\) and projections \({\boldsymbol{z}}^0 = g_\tau ({\boldsymbol{h}}^0)\), where \(g_\tau \) is a two-layer MLP which is applied to each vector independently, and \(\tau \) are the parameters of the object discovery network. We apply K-means clustering to the spatial map of features \({\boldsymbol{h}}^0\) or \({\boldsymbol{z}}^0\), segmenting it (independently across images) into K non-overlapping binary masks \({\boldsymbol{m}}^{k,0}\) (Fig. 1, top row).

Object Representation Networks: From Segmentations to Representations. We produce two views \({\boldsymbol{v}}^1\) and \({\boldsymbol{v}}^2\) of the image by augmenting \({\boldsymbol{x}}\) twice, using the random preprocessing pipeline of BYOL [35], which includes random cropping, flipping, blurring, and point-wise color transformations (Fig. 1, augmented views and appendix).

The spanning view \({\boldsymbol{v}}^0\) is chosen as the smallest crop which spans the spatial extent of the augmented views \({\boldsymbol{v}}^1\) and \({\boldsymbol{v}}^2\). We can therefore obtain two sets of masks \({\boldsymbol{m}}^{k,1}, {\boldsymbol{m}}^{k,2}\) which are consistent with each other and aligned with the underlying image content, by simply cropping, flipping, and resizing each mask \({\boldsymbol{m}}^{k,0}\) as necessary (Fig. 1, right). Despite the significant differences in appearance across views, these masks contain the same underlying image content (up to differences in cropping), which we leverage in our objective.

Each augmented view \({\boldsymbol{v}}^l \in \{{\boldsymbol{v}}^1, {\boldsymbol{v}}^2\}\) is encoded with a feature extractor \(f_\theta \) into a spatial map of hidden vectors: \({\boldsymbol{h}}^l_\theta = f_\theta ({\boldsymbol{v}}^l)\) where \(\theta \) are the parameters of the object representation network being optimized. For every mask \({\boldsymbol{m}}^{k,l}\) in the image, we compute a mask-pooled hidden vector

$$\begin{aligned} {\boldsymbol{h}}^{k,l}_\theta = \frac{1}{\sum _{i,j} {\boldsymbol{m}}^{k,l}[i,j] } \sum _{i,j} {\boldsymbol{m}}^{k,l}[i,j] \ {\boldsymbol{h}}_\theta ^l[i,j], \end{aligned}$$
(1)

discarding masks that are empty (due to cropping). Our goal is to ensure that these object-level features are roughly invariant across views. Specifically, we wish for an object-level feature in one view to be predictive of the same image content in the other view. To that end we transform the object-level hidden vectors \({\boldsymbol{h}}^{k,l}_\theta \) with two-layer MLPs \(g_\theta \) and \(q_\theta \), yielding non-linear projections \({\boldsymbol{z}}_\theta ^{k,l} = g_\theta ( {\boldsymbol{h}}_\theta ^{k,l} )\) and predictions \(q_\theta ( {\boldsymbol{z}}_\theta ^{k,l} )\). In theory, we could regress the prediction \(q_\theta ( {\boldsymbol{z}}_\theta ^{k,1} )\) directly onto its target \({\boldsymbol{z}}_\theta ^{k,2}\), however it is helpful to stabilize these targets by encoding them instead with specific target networks \(g_\xi \) and \(f_\xi \), where the parameters \(\xi \) vary more slowly [16, 35, 42, 73]. We therefore instead use the projections \({\boldsymbol{z}}_\xi ^{k,l} = g_\xi ( {\boldsymbol{h}}_\xi ^{k,l} )\) as targets for the online prediction networks.

Jointly Learning to Discover and Represent Objects. Given a set of masks which approximately segment an image into objects, we wish to learn representations which distinguish these objects, while being invariant to identity-preserving transformations. Contrastive learning provides a straightforward objective for achieving this. Specifically, contrastive detection [42] trains a network to recognize object-level features across views, in the presence of many distracting “negative” features from other objects. The resulting objective maximizes the similarity between different views of the same object, while minimizing the similarity between different objects. We define the similarity between object-level features across views as

$$\begin{aligned} s_k^{1 \rightarrow 2} = \frac{1}{\alpha } \frac{\langle q_\theta ( {\boldsymbol{z}}_\theta ^{k,1}), {\boldsymbol{z}}_\xi ^{k,2} \rangle }{ \Vert q_\theta ( {\boldsymbol{z}}_\theta ^{k,1}) \Vert \Vert {\boldsymbol{z}}_\xi ^{k,2} \Vert } \end{aligned}$$
(2)

where \(\alpha \) is temperature hyper-parameter. We define the similarity between an object-level feature and a distracting negative sample \(s_k^{1 \rightarrow n}\) analogously, by replacing the paired feature \({\boldsymbol{z}}_\xi ^{k,2}\) with one from a different mask in the same image, or a different image altogether. The contrastive loss function for an individual feature is then

$$\begin{aligned} \ell _k^{1 \rightarrow 2}(\theta ; \xi , \tau ) = - \log \frac{\exp ( s_k^{1 \rightarrow 2} ) }{\exp ( s_k^{1 \rightarrow 2} ) + \sum _n \exp ( s_k^{1 \rightarrow n} )}, \end{aligned}$$
(3)

which we sum across objects, views, and images in the mini-batch (summation across images not shown for clarity)

$$\begin{aligned} \mathcal {L}(\theta ; \xi , \tau ) = \frac{1}{K} \sum _{k=1}^K \ell _k^{1 \rightarrow 2}(\theta ; \xi , \tau ) + \ell _k^{2 \rightarrow 1}(\theta ; \xi , \tau ). \end{aligned}$$
(4)

We optimize the object discovery and representation networks using a strategy inspired by BYOL [35]. One object representation network (the online network with parameters \(\theta \)) is updated with gradients from the contrastive loss. The second object representation network (the target network with parameters \(\xi \)) and the object discovery network (with parameters \(\tau \)) are updated using an exponential moving average of the online network:

$$\begin{aligned} \theta&\leftarrow \text {optimizer}(\theta , \nabla _\theta \mathcal {L}(\theta ; \xi , \tau ), \lambda _\theta ) \end{aligned}$$
(5)
$$\begin{aligned} \xi&\leftarrow (1 - \lambda _\xi ) \xi + \lambda _\xi \theta \end{aligned}$$
(6)
$$\begin{aligned} \tau&\leftarrow (1 - \lambda _\tau ) \tau + \lambda _\tau \theta , \end{aligned}$$
(7)

where the optimizer is LARS [86], and \(\lambda _\theta \), \(\lambda _\xi \), \(\lambda _\tau \) are learning rates for the online, target, and discovery networks respectively. We adopt the learning rates for online and targets networks from BYOL without modification. For the object discovery network, we consider two schedules: a constant learning rate which continuously updates the object discovery network with the online one (e.g. \(\lambda _\tau = 10^{-3}\)), and a small number of discrete updates which copy the online representation network into the object discovery network (e.g. \(\lambda _\tau = 1\) every 100 epochs and \(\lambda _\tau = 0\) otherwise). The advantage of the second scheme is computational: if the object discovery network does not change between updates, the segments for training the object representation networks can be cached, removing the need to evaluate the object discovery network at every iteration.

Pretraining Details. We train object discovery and representation networks (Odin) on ImageNet [69] for 1000 epochs, using a ResNet-50 [41] or Swin Transformer [53] backbone equipped with Feature Pyramid Networks (FPN; [51]) as the feature extractor f. The FPN takes as input the hierarchy of latent vectors output by the ResNet or Swin backbone, and progressively upsamples them while adding information from intermediate feature arrays, yielding high-level and high-resolution representations of the input image. We use the highest-resolution output of the FPN (subsampling the image by a factor of 4) as the array of hidden vectors \({\boldsymbol{h}}\).

After pretraining, we discard the target and object discovery networks, and use only the online object representation network for evaluation, facilitating the comparison to other methods which have their own means of learning the model.

3.2 Evaluating Object Discovery and Representation

Having trained object representation networks using the Odin framework, we evaluate the quality of their representation by fine-tuning them for object detection and instance segmentation on COCO, and segmentatic segmentation on PASCAL and Cityscapes. For consistency with prior work we retain only the pretrained backbone (ResNet or Swin transformer) for transfer learning, discarding feature pyramid networks and projection heads.

Object Detection and Instance Segmentation. For instance segmentation we use Mask-RCNN [40] while for object detection we report results for Mask-RCNN and FCOS\(^\star \). Both methods are equipped with feature pyramid networks [51] and cross-replica batch-norm [65]. For Mask-RCNN we adopt the Cloud TPU implementation [31] and use it without modification. FCOS\(^\star \) is our implementation of a single-stage detector based on FCOS [75], and improved with IoU prediction [80], ATSS [91] and T-Head [29]; full details are available in the appendix. We follow the common transfer setup and evaluate on COCO [52] – the pretrained network is used to initialize the backbone of a Mask-RCNN or FCOS\(^\star \) model, which is then fine-tuned on the train2017 set, and report bounding-box AP (AP\(^\text {bb}\)) and mask AP (AP\(^\text {mk}\)) on the val2017 set. We use two standard training schedules: 12 epochs and 24 epochs [39].

Semantic Segmentation with FCN. Following [39] we initialize the backbone of a fully-convolutional network (FCN, [55]) with our model. For PASCAL [27], we fine-tune on the train_aug2012 set for 45 epochs and report the mean intersection over union (mIoU) on the val2012 set. For Cityscapes [19], we fine-tune on the train_fine set for 160 epochs and evaluate on the val_fine set.

Object Discovery on COCO. We wish to assess whether our representations uncover the structure of real-world scenes during self-supervised pretraining. Simply visualizing saliency maps induced by the model only weakly tests this ability however [26], hence we use the COCO dataset comprised of complex scenes and human-annotated object segments. Specifically, we evaluate models on COCO images, cluster their features, and measure the overlap between the resulting segments and human-annotated ones. Given the diversity of object scales in COCO, we run multiple K-means segmentations (for K in [1, 2, \(\dots \), 128]) on the same set of latents, resulting in 255 object proposals which we resize to the input image resolution.

For each ground-truth segment \({\boldsymbol{g}}_t\) we compute the overlap with all proposals \({\boldsymbol{m}}_k\) using their intersection-over-union (IoU), and record the “best overlap” by taking the maximum across proposals. Averaging this metric across ground-truth segments, we obtain the “average best overlap” (ABO) metric [3], and computing the fraction of “best overlaps” greater than 50% yields the “object recovery” metric [18]. We then average each of these metrics across images.

Video Object Segmentation on DAVIS. As a further test of scene understanding, we assess whether learned representations can continue to recognize parts of an object as they evolve over time. Video object segmentation, specifically in its semi-supervised setting, captures this ability, which we evaluate on the DAVIS’17 benchmark [66]. Having evaluated a learned representation on a video independently across frames, we segment these features with nearest neighbor matching from frame to frame, given a segmentation of the first frame. In this way, the segmentation is propagated according to the similarity of the representation across space and time.

4 Experiments

4.1 Transfer Learning

Our first goal is to assess whether strong transfer learning performance can be obtained without resorting to prior knowledge of scene segmentations. To that end we train a ResNet-50 on ImageNet for 1000 epochs using the proposed Odin framework, and transfer it to object detection and instance segmentation on COCO, and semantic segmentation on PASCAL and Cityscapes.

Table 1. Transfer to COCO object detection and instance segmentation with Mask-RCNN: all methods pretrain a ResNet-50 on ImageNet before fine-tuning on COCO with Mask-RCNN for 12 epochs (1\(\times \) schedule) or 24 epochs (2\(\times \) schedule). We report average precision on object detection (AP\(^\text {bb}\)) and instance segmentation (AP\(^\text {mk}\))

Object Detection and Instance Segmentation on COCO. Self-supervised learning has made steady gains on transfer learning from ImageNet to COCO, with a majority of methods surpassing supervised pretraining. The top-performing methods are ReLIC v2 and DetCon\(_B\) which make heavy use of saliency or segmentation information in their learning paradigm. DetCon uses the same learning objective as Odin, but relies on a hand-crafted image segmentation algorithm [28] applied to the pixel lattice rather than a learned object discovery network. ReLIC v2 does not use segmentation information explicitly in its objective, but uses a hand-crafted saliency network to separate objects from their background in the data-augmentation pipeline. Both represent a step-change in performance relative to previous methods. Odin, which instead derives segmentations from its own learned representations, surpasses both of these methods (Table 1).

A recent self-supervised method, DINO [12], reports high-quality unsupervised segmentations, however it appears to do so at the cost of object representation. We fine-tune the publicly available ResNet checkpoint in our framework, and find it underperforms relative to simple methods such as BYOL. Other SSL methods such as SwAV and DeepCluster-v2 [11] which cluster representations across images rather than within also underperform in this setting.

Table 2. Transfer to PASCAL and Cityscapes semantic segmentation with fully convolutional networks: all methods pretrain a ResNet-50 on ImageNet before fine-tuning for semantic segmentation on PASCAL or Cityscapes, and report the mean intersection-over-union

Semantic Segmentation on PASCAL and Cityscapes. We assess the generality of these results by transferring them to two separate datasets and tasks, semantic segmetation on PASCAL and Cityscapes. Similarly to when transferring to COCO, DetCon and ReLIC v2 substantially outperform supervised and BYOL pretraining, confirming the utility of prior knowledge about segmentation and saliency. In this case as well, Odin successfully recovers this knowledge and surpasses both methods in a fully learned manner (Table 2).

In this setting DINO performs better, surpassing BYOL, possibly because semantic segmentation on PASCAL, which contains only 20 classes compared with 80 in COCO, weights object discovery more than object representation—isolating objects from the background rather than distinguishing object classes from each other. Nevertheless, Odin surpasses it as well, indicating that it achieves a better trade-off between object representation and discovery.

Transfer Learning with High-Performance Architectures. While Mask-RCNN has become a standard method for evaluating the quality of object-level representations, we asked whether the performance gains afforded by the Odin framework persisted with more sophisticated models. For this we turned to our FCOS\(^\star \) implementation, whose supervised baseline surpasses Mask-RCNN by 4.6% AP\(^\text {bb}\). In this setting as well, Odin surpasses the supervised baseline and DINO (+1.3% AP\(^\text {bb}\), Table 3, 1st column).

Table 3. Transfer to COCO object detection with FCOS\(^\star \): all methods pretrain on ImageNet before fine-tuning on COCO with FCOS\(^\star \) for 30 epochs, and report average precision on object detection (AP\(^\text {bb}\)).

Swin transformers appear as a compelling candidate for general-purpose vision architectures, surpassing ResNet’s in a variety of tasks [53]. Despite the almost universal success of self-supervised pretraining in improving the transfer learning performance of ResNet architectures, similar results have yet to become widespread for Swin transformers.

We therefore pretrain Swin-T and Swin-S transformers on ImageNet using Odin, and transfer them to COCO object detection using FCOS\(^\star \). We also evaluate a pre-trained Moby checkpoint in the same setting. Moby pretraining marginally improves the performance of a Swin-T, whereas Odin furthers these gains (+1.8% AP\(^\text {bb}\), Table 3, 2nd column). The benefits of Odin pretraining are emphasized when pretraining and transferring the larger Swin-S backbone (+2.1% AP\(^\text {bb}\), Table 3, 3rd column).

We return to our original question of whether Odin has successfully recovered knowledge of scene structure by pretraining ResNet-50, Swin-T, and Swin-S with DetCon (which uses a hand-crafted segmentation algorithm instead of the object discovery network [28, 42]), and transferring them with FCOS\(^\star \). We find Odin to match or slightly surpass their performance, confirming our previous results.

4.2 Object Discovery in COCO

We have found thus far that Odin surpasses the transfer learning performance of state-of-the-art self-supervised methods which rely on prior knowledge of scene segmentations, suggesting it has derived this knowledge from the data itself. In this section, we directly evaluate the extent to which Odin has discovered objects in real-world scenes. We extract Odin features from COCO images, cluster them, and visualize the resulting segments (Fig. 2). Comparing unsupervised object proposals (last column) to human-annotated segments (2nd column) we see that Odin recovers a reasonable decomposition of real-world scenes: figures are separated from their background, small objects are isolated, and even different instances of the same class (such as cars, last row) are roughly separated from one-another. Failure modes, such as grouping multiple fruit—or two shirts—together, reflect the ambiguity of unsupervised object discovery.

Comparing these proposals to those obtained from a randomly-initialized network (3rd column) and an ImageNet supervised one (4th column), we appreciate the benefits of learning with the Odin framework. Both of these networks make erroneous proposals, failing to delineate object boundaries, or lacking the coherence and locality of real-world objects. We quantify this difference by evaluating the average best overlap (ABO) and fraction of recovered objects (OR) of the segments derived from each network. Consistently with the qualitative results, Odin strongly surpasses both baselines in all metrics (Table 4, left).

We also evaluate the accuracy of a recently-proposed self-supervised method, DINO, which specializes in object discovery. In this challenging task of discovering multiple objects in an unsupervised setting, we find that it underperforms relative to Odin. We test Odin in two regimes, one using the ResNet and FPN used for pretraining, the other with the ResNet only. Although its performance degrades slightly with the lower-resolution ResNet, it continues to outperform all other methods in all metrics. In particular, Odin surpasses DetCon by a large margin (+7% ABO, +16% OR), indicating that it has discovered more relevant image structure than the hand-crafted segmentations used in DetCon.

Fig. 2.
figure 2

Object discovery with Odin. 1st column: original image, 2nd: human-annotated COCO segmentations, 3rd, 4th, 5th: segmentations obtained from k-means clustering on randomly intialized, ImageNet-supervised, and Odin-trained features, respectively.

Finally, we note that the DINO method was primarily designed for use with vision transformers [24]. We therefore train a ViT-B/8 (as in DINO) on ImageNet for 100 epochs using the Odin framework (all other parameters unchanged). In this setting we find Odin to achieve compelling results, surpassing the high-resolution ResNet-FPN, and a supervised and DINO-pretrained vision-transformer (Table 4, right). Figure A.1. illustrates that Odin seems particularly effective at discovering small objects and differentiating instances of the same class. In sum, Odin provides a powerful means of discovering objects in real-world scenes irrespective of the architecture used.

Table 4. Object discovery on COCO: all methods pretrain on ImageNet before evaluating object discovery on COCO in an unsupervised manner, reporting average best overlap of instance masks (ABO\(^i\)) and categorical masks (ABO\(^c\)), and average object recovery (OR). By default we retain only the pretrained ResNet-50 from Odin’s feature extractor, such that all methods are matched in their architecture. Odin\(^\dagger \) denotes the model equipped with the FPN used during training. ResNet’s use 1024 \(\times \) 1024 images for evaluation with a stride of 32, yielding a 32 \(\times \) 32 feature grid. ViT’s use 448 \(\times \) 448 resolution with a patch size of 8, yielding 56 \(\times \) 56 feature grids.
Table 5. Video Object Segmentation on DAVIS’17: we evaluate representation quality for video segmentation by nearest neighbor inference. We report the standard region \(\mathcal {J}\) and contour \(\mathcal {F}\) metrics and their mean. All representations are trained on ImageNet then evaluated without fine-tuning on the 30 validation videos of DAVIS’17. \(\text {Odin}^{\dagger }\) includes a feature pyramid network to reduce the output stride from \(32\times \) to \(8\times \).

4.3 Video Object Segmentation

We evaluate on the DAVIS’17 benchmark [66] following the experimental setup and nonparametric inference method of DINO [12]. Given a video and a segmentation of its first frame, we propagate the segmentation between consecutive frames by nearest neighbor matching of the extracted representation. In Table 5 we evaluate random, supervised, and our self-supervised representations with the ResNet-50 architecture. This evaluation does not fine-tune or train on the DAVIS benchmark, and so accuracy is a measure of object representation, as the fixed representation must support segmentation of the novel objects in these held-out videos. Consistently with our previous results, Odin strongly surpasses supervised pre-training in all metrics.

4.4 Ablations and Analysis

What components are necessary for driving Odin’s ability to represent and discover objects? We systematically vary the two hyper-parameters governing the behavior of the object discovery network: the number of segments used for learning, and the schedule used for updating the network (Table 6). Starting with the number of segments K we find object discovery degrades substantially when using too coarse segmentations (e.g. \(K = 8\)). However, given a fine enough segmentations (K greater than 16) its performance is stable.

Regarding the rate at which the object discovery network is updated, we find both schemes to be viable: continuously updating the network leads to slightly better representations, whereas discrete updates lead to slightly better object discovery. The advantage of the later scheme is that the computational cost of the object discovery network becomes negligible, as it only needs to be evaluated every 100 epochs and resulting segmentations cached in-between.

Table 6. Ablating the components of Odin: We use the variant of Odin equipped with FPN for object discovery. Transfer learning is performed with the ResNet backbone only. K denotes the number of segments obtained through K-means during pretraining

5 Conclusions

We have presented Odin, a new approach to self-supervised training which couples object discovery and representation learning. The resulting framework benefits from the same representation quality as methods which utilize explicit priors about image segmentation [42, 76], while deriving this knowledge from the data itself. The result is a simpler and more generally-applicable learning paradigm, and leads to state-of-the-art performance on a range of transfer learning tasks.

In this work, we have shown the utility of coupling representation learning and object discovery, for transfer learning and unsupervised scene understanding. Nevertheless, we have presented a single instance of this coupling, and there remain several open questions around how best to tie them together. This may require greater integration of the learning procedure and architecture—for example our self-supervised algorithm learns mask-pooled features which are different from those used in downstream tasks. The learning dynamics of Odin also warrant further investigation, as well as the objective used for representation learning. Recent work has revived interest in masked-autoencoding [7, 24, 38] and masked-distillation [6] as viable alternatives to contrastive learning. Odin, by proposing to leverage learned representations in the design of iteratively refined self-supervised tasks, is well positioned to benefit them as well.