Keywords

1 Introduction

Supervised deep learning has enabled great progress and achieved impressive results across a wide number of visual tasks, but it requires large annotated datasets for effective training. Designing such fully-annotated datasets involves a significant effort in terms of data cleansing and manual labeling. It is especially true for fine-grained annotations such as pixel-level annotations needed for segmentation tasks, where the annotation cost per image is considerably high [5, 17]. This hurdle can be overcome with unsupervised learning, where unknown but useful patterns can be extracted from the easily accessible unlabeled data. Recent advances in unsupervised learning [7, 22, 27, 36], that closed the performance gap with its supervised counterparts, make it a strong possible alternative.

Fig. 1.
figure 1

Overview. Given an encoder-decoder type network and two valid orderings \((o_1, o_2)\) as illustrated in (c). The goal is to maximize the Mutual Information (MI) between the two outputs over the different views, i.e. different orderings. (a) For Autoregressive Clusterings (AC), we output the cluster assignments in the form of a probability distribution over pixels, and the goal is to have similar assignments regardless of the applied ordering. (b) For Autoregressive Representation Learning (ARL), the objective is to have similar representations at each corresponding spatial location and its neighbors over a window of small displacements \(\Omega \).

Recent works are mainly interested in two objectives, unsupervised representation learning and clustering. Representation learning aims to learn semantic features that are useful for down-stream tasks, be it classification, regression or visualization. In clustering, the unlabeled data points are directly grouped into semantic classes. In both cases, recent works showed the effectiveness of maximizing Mutual Information (MI) between different views of the inputs to learn useful and transferable features [13, 22, 36, 41] or discover clusters that accurately match semantic classes [21, 27].

Another line of study in unsupervised learning is generative modeling. In particular, for image modeling, generative autoregressive models [9, 34, 35, 40], such as PixelCNN, are powerful generative models with tractable likelihood computation. In this case, the high-dimensional data, e.g., an image , is factorized as a product of conditionals over its pixels. The generative model is then trained to predict the current pixel \(x_i\) based on the past values \(x_{\le i-1}\) in a raster scan fashion using masked convolutions [34] (Fig. 3(a)).

In this work, instead of using a single left to right, top to bottom ordering, we propose to use several orderings obtained with different forms of masked convolutions and attention mechanism. The various orderings over the input pixels, or the intermediate representations, are then considered as different views of the input imageFootnote 1, and the model is then trained to maximize the MI between the outputs over these different views.

Our approach is generic, and can be applied for both clustering and representation learning (see Fig. 1). For a clustering task (Fig. 1(a)), we apply a pair of distinct orderings over a given input image, producing two pixel-level predictions in the form of probability distribution over the semantic classes. We then maximize the MI between the two outputs at each corresponding spatial location and its intermediate neighbors. Maximizing the MI helps avoiding degeneracy (e.g., uniform output distributions) and trivial solutions (e.g., assigning all of the pixels to the same cluster). For representation learning (Fig. 1(b)), we maximize a lower bound of MI between the two output feature maps over the different views.

We evaluate the proposed method using standard image segmentation datasets: Potsdam [14] and COCO-stuff [5], and show competitive results. We present an extensive ablation study to highlight the contribution of each component within the proposed framework, and emphasizing the flexibility of the method.

To summarize, we propose following contributions: (i) a novel unsupervised method for image segmentation based on autoregressive models and MI maximization; (ii) various forms of masked convolutions to generate different orderings; (iii) an attention augmented version of masked convolutions for a larger receptive field, and a larger set of possible orderings; (iv) an improved performance above previous state-of-the-art on unsupervised image segmentation.

2 Related Works

Autoregressive Models. Many autoregressive models [9, 10, 15, 31, 34, 37, 40] for natural image modeling have been proposed. They model the joint probability distribution of high-dimensional images as a product of conditionals over the pixels. PixelCNN [34, 35] specifies the conditional distribution of a sub-pixel (i.e., a color channel of a pixel) as a full 256-way softmax, while PixelCNN++ [40] uses a mixture of logistics. In both cases, masked convolutions are used to process the initial image in an autoregressive manner. In Image [37] and Sparse [10] transformers, self-attention [43] is used over the input pixels, while PixelSNAIL [9] combines both attention and masked convolutions.

Clustering and Unsupervised Representation Learning. Recent works in clustering aim at combining traditional clustering algorithms [19] with deep learning, such as using K-means style objectives when training deep nets training [6, 12, 18]. However, such objective can lead to trivial and degenerate solutions [6]. IIC [27] proposed to use a MI based objective which is intrinsically more robust to such trivial solutions. Unsupervised learning of representations [1, 16, 22, 36] rather aims to train a model, mapping the unlabeled inputs into some lower-dimensional space, while preserving semantic information and discarding instance-specific details. The pre-trained model can then be fine-tuned on a down-stream task with fewer labels.

Unsupervised Learning and MI Maximization. Maximizing MI for unsupervised learning is not a new idea [2, 19], and recent works demonstrated its effectiveness for unsupervised learning. For representation learning, the training objective is to maximize a lower bound of MI over continuous random variables between distinct views of the inputs. These views can be the input image and its representation [23], the global and local features [22], the features at different scales [1], a sequence of extracted patches from an image in some fixed order [36] or different modalities of the image [41]. For a clustering objective, with discrete random variables as outputs, the exact MI can be maximized over the different views, e.g., IIC [27] maximizes the MI between the image and its augmented version.

Unsupervised Image Segmentation. Methods that learn the segmentation masks entirely from data with no supervision can be categorized as follows: (1) GAN based methods [4, 8] that extract and redraw the main object in the image for object segmentation. Such methods are limited to only instances with two classes, a foreground and a background. The proposed method is more generalizable and is independent of the number of ground-truth classes; (2) Iterative methods [24] consisting of a two-step process. The features produced by a CNN are first grouped into clusters using spherical K-means. The CNN is then trained for better feature extraction to discriminate between the clusters. We propose an end-to-end method simplifying both training and inference; (3) MI maximization based methods [27] where the MI between two views of the same instance at the corresponding spatial locations is maximized. We propose an efficient and effective way to create different views of the input using masked convolutions. Another line of work consists of leveraging the learned representations of a deep network for unsupervised segmentation, e.g., CRFs [29] and deep priors [29].

3 Method

Our goal is to learn a representation that maximizes the MI, denoted as I, between different views of the input. These views are generated using various orderings, capturing different aspects of the inputs. Formally, let be an unlabeled data point, and be a deep representation to be learned as a mapping between the inputs and the outputs. For clustering, is the set of possible clusters corresponding to semantic classes, and for representation learning, corresponds to a lower-dimensional space of the output features. Let be two orderings \(o_i\) and \(o_j\) obtained from the set of possible and valid orderings (Fig. 2). For two outputs and , the objective is to maximize the predictability of from and vice-versa, where corresponds to applying the learning function with a given ordering \(o_i\) to process the image . This objective is equivalent to maximizing the MI between the two encoded variables:

(1)

We start by presenting different forms of masked convolutions to generate various raster-scan orderings, and propose an attention augmented variant (Sect. 3.1). We then formulate the training objective for maximizing Eq. (1) (Sect. 3.2). We finally conclude with a flexible design architecture for the function (Sect. 3.3).

Fig. 2.
figure 2

Raster-scan type orderings.

3.1 Orderings

Masked Convolutions. In neural autoregressive modeling [9, 34, 40], for an input image with 3 color channels, a raster-scan ordering is first imposed on the image (see Fig. 2, ordering \(o_1\)). Such an ordering, where the pixel \(x_i\) only depends on the pixels that come before it, is maintained using masked convolutions

Our proposition is to use all 8 possible raster-scan type orderings as the set of valid orderings as illustrated in Fig. 2. A simple way to obtain them is to use a single ordering \(o_1\) with the standard masked convolution (Fig. reffig3 (a)), along with geometric transformations g (i.e., image rotations by multiples of 90 degrees and horizontal flips), resulting in 8 versions of the input image. We can then maximize the MI between the two outputs, i.e., with . In this case, since the masked weights are never trained, we cannot fall-back to the normal convolution where the function has access to the full input during inference, greatly limiting the performance of such approach.

Fig. 3.
figure 3

Masked Convolutions. (a) Standard masked convolution used in autoregressive generative modeling, yielding an ordering \(o_1\). (b) A relaxed version of standard masked convolution where we have access to the current pixel at each step. (c) A simplified version of masked convolution with a reduced number of masked weights. (d) The 8 versions of the standard masked convolution to construct all of the possible raster-scan type orderings. (e) The proposed types of masked convolutions with the corresponding shifts to obtain all of the 8 desired raster-scan types orderings. \(F = 3\) in this case.

This point motivates our approach. Our objective is to learn all the weights of the masked convolution during training, and use an unmasked version during inference. This can be achieved by using a normal convolution, and for a given ordering \(o_i\), we mask the corresponding weights during the forward pass to construct the desired view of the inputs. Then in the backward pass, we only update the unmasked weights and the masked weights remain unchanged. In this case, all of the weights will be learned and we will converge to a normal convolution given enough training iterations. During inference, no masking is applied, giving the function full access to the inputs.

A straight forward way to implement this is to use 8 versions of the standard masked convolution to create the set (Fig. 3(d)). However, for each forward pass, the majority of the weights are masked, resulting in a reduced receptive field and a fewer number of weights will be learned at each iteration, leading to some disparity between them.

Given that we are interested in a discriminative task, rather than generative image modeling where the access to the current pixel is not allowed. We start by relaxing the conditional dependency, and allow the model to have access to the current pixel, reducing the number of masked locations by one (Fig. 3(b)). To further reduce the number of masked weights, for an \(F \times F\) convolution, instead of masking the lower rows, we can simply shift the input by the same amount and only mask the weights of the last row. We thus reduce the number of masked weight from \(\lfloor {F^2/2}\rfloor \) (Fig. 3(b)) to \(\lfloor {F/2}\rfloor \) (Fig. 3(c)). With four possible masked convolutions: \(\{\mathrm {Conv}_{\mathrm {A}}, \mathrm {Conv}_{\mathrm {B}}, \mathrm {Conv}_{\mathrm {C}}, \mathrm {Conv}_{\mathrm {D}}\}\) and four possible shifts:Footnote 2 \(\{\mathrm {Shift}_{\mathrm {1}}, \mathrm {Shift}_{\mathrm {3}}, \mathrm {Shift}_{\mathrm {2}}, \mathrm {Shift}_{\mathrm {4}}\}\), we can create all of 8 raster-scan orderings as illustrated in Fig. 3(e). The proposed masked convolutions do not introduce any additional computational overhead, neither in training, nor inference, making them easy to implement and integrate into existing architectures with minor changes.

Attention Augmented Masked Convolutions. As pointed out by [34], the proposed masked convolutions are limited in terms of expressiveness since they create blind spots in the receptive field (Fig. 6). In our case, by applying different orderings, we will have access to all of the input over the course of training, and this bug can be seen as a feature where the blind spots can be considered as an additional restriction. This restricted receptive filed, however, can be overcome using the self-attention mechanism [43]. Similar to previous works [3, 44, 45], we propose to add attention blocks to model long range dependencies that are hard to access through standalone convolutions. Given an input tensor of shape \((H, W, C_{in})\), after reshaping it into a matrix \(X \in \mathbb {R}^{HW \times C_{in}}\), we can apply a masked version of attention [43] in a straight forward manner. The output of the attention operation is:

(2)

with \(Q = XW_q\), \(K = XW_k\) and \(V = XW_v\), where \(W_q, W_k \in \mathbb {R}^{C_{in} \times d}\) and \(W_{v} \in \mathbb {R}^{C_{in} \times d}\) are learned linear transformations that map the input X to queries Q, keys K and values V, and corresponds to a masking operation to maintain the correct ordering \(o_i\).

Fig. 4.
figure 4

Zigzag type orderings.

The output is then projected into the output space using a learned linear transformation \(W^O \in \mathbb {R}^{d \times C_{in}}\) obtaining \(X_{\text {att}} = A W^{O}\). The output of the attention operation \(X_{\text {att}}\) is concatenated channel wise with the input X, and then merged using a \(1 \times 1\) convolution resulting in the output of the attention block.

Zigzag Orderings. Using attention gives us another benefit, we can extend the set of possible orderings to include zigzag type orderings introduced in [9] (Fig. 4). With zigzag orderings, the outputs at each spatial location will be mostly influenced by the values of the corresponding neighboring input pixels, which can give rise to more semantically meaningful representations compared to that of raster-scan orderings. This is done by simply using a mask corresponding to the desired zigzag ordering \(o_i\). Resulting in a set of 16 possible and valid orderings \(o_i\) with \(i \in \{1, \ldots , 16\}\) in total. See Fig. 5 for an example.

Fig. 5.
figure 5

Attention Masks. Examples of the different attention masks of shape \(HW \times HW\) applied for a given ordering \(o_i\). With \(HW = 9\).

Fig. 6.
figure 6

Blind Spots. Blind spots in the receptive field of pixel as a result of using a masked convolution for a given ordering \(o_i\).

3.2 Training Objective

In information theory, the MI I(XY) between two random variables X and Y measures the amount of information learned from the knowledge of Y about X and vice-versa. The MI can be expressed as the difference of two entropy terms:

$$\begin{aligned} I(X;Y)=H(X)-H(X|Y)=H(Y)-H(Y|X) \end{aligned}$$
(3)

Intuitively, I(XY) can be seen as the reduction of uncertainty in one of the variables, when the other one is observed. If X and Y are independent, knowing one variable exposes nothing about the other, in this case, \(I(X; Y) = 0\). Inversely, if the state of one variable is deterministic when the state of the other is revealed, the MI is maximized. Such an interpretation explains the goal behind maximizing Eq. (1). The neural network must be able to preserve information and extract semantically similar representations regardless of the applied ordering \(o_i\), and learn representations that encode the underlying shared information between the different views. The objective can also be interpreted as having a regularization effect, forcing the function to focus on the different views and subparts of the input to produce similar outputs, reducing the reliance on specific objects or parts of the image.

Let be the joint distribution produced by sampling examples and then sampling two outputs and with two possible orderings \(o_i\) and \(o_j\). In this case, the MI in Eq. (1) can be defined as the Kullback–Leibler (KL) divergence between the joint and the product of the marginals:

(4)

To maximize Eq. (4), we can either maximize the exact MI for a clustering task over discrete predictions, or a lower bound for an unsupervised learning of representations over the continuous outputs. We will now formulate the loss functions and of both objectives for a segmentation task.

Autoregressive Clustering (AC). In a clustering task, the goal is to train a neural network to predict a cluster assignment corresponding to a given semantic class \(k \in \{1, \ldots , K\}\) with K possible clusters at each spatial location. In this case, the encoder-decoder type network is terminated with K-way softmax, outputting of the same spatial dimensions as the input. Concretely, for a given input image and two valid orderings , we forward pass the input through the network producing two output probability distributions and over the K clusters and at each spatial location. After reshaping the outputs into two matrices of shape \(HW \times K\), with each element corresponding to the probability of assigning pixel \(x_l\) with \(l \in \{1, \ldots , HW\}\) to cluster k, we can compute the joint distribution of shape \(K \times K\) as follows:

(5)

The marginals and can then be obtained by summing over the rows and columns of . Similar to IIC [27], we symmetrize using to maximize the MI in both directions. The clustering loss in this case can be written as follows:

(6)

In practice, instead of only maximizing the MI between two corresponding spatial locations, we maximize it between each spatial location and its intermediate neighbors over small displacements (see Fig. 1). This can be efficiently implemented using a convolution operation as demonstrated in [27].

Autoregressive Representation Learning (ARL). Although the clustering objective in Eq. (6) can also be used as a pre-training objective for , Tschannen et al. [42] recently showed that maximizing the MI does not often results in transferable and semantically meaningful features, especially when the down-stream task is a priori unknown. To this end, we follow recent representation learning works based on MI maximization [1, 22, 36, 41], where a lower bound estimate of MI (e.g., InfoNCE [36], NWJ [33]) is maximized between different views of the inputs. These estimates are based on the simple intuitive idea, that if a critic f is able to differentiate between samples drawn from the joint distribution and samples drawn from the marginals , then the true MI is maximized. We refer the reader to [42] for a detailed discussion.

In our case, with image segmentation as the target down-stream task, we maximize the InfoNCE estimator [36] over the continuous outputs. Specifically, with two outputs as C-dimensional feature maps. The training objective is to maximize the infoNCE based loss :

(7)

For an input image and two outputs and . Let and correspond to C-dimensional feature vectors at spatial positions l and m in the first and second outputs respectively. We start by creating N pairs of feature vectors , with one positive pair drawn from the joint distribution and \(N-1\) negative pairs drawn from the marginals. A positive pair is a pair of feature vectors corresponding to the same spatial locations in the two outputs, i.e., a pair with \(m=l\). The negatives are pairs corresponding to two distinct spatial positions \(m\ne l\). In practice, we also consider small displacements \(\Omega \) (Fig. 1) when constructing positives. Additionally, the negatives are generated from two distinct images, since two feature vectors might share similar characteristics even with different spatial positions. By maximizing Eq. (7), we push the model to produce similar representations for the same spatial location regardless of the applied ordering, so that the critic function f is able to give high matching scores to the positive pairs and low matching to the negatives. We follow [22] and use separable critics , where the functions \(\phi _1/\phi _2\) non-linearly transform the outputs to a higher vector space, and produces a scalar corresponding to a matching score between the two representations at two spatial positions l and m of the two outputs.

Note that both losses and can be applied interchangeably for both objectives, a case we investigate in our experiments (Sect. 4.1). For , we can consider the clustering objective as an intermediate task for learning useful representations. For , during inference, K-means [28] algorithm can be applied over the outputs to obtain the cluster assignments.

3.3 Model

The representation can be implemented in a general manner using three sub-parts, i.e., , with a feature extractor h, an autoregressive encoder \(g_{ar}\) and a decoder d. With such a formulation, the function is flexible and can take different forms. With h as an identity mapping, becomes a fully autoregressive network, where we apply different orderings directly over the inputs. Inversely, if \(g_{ar}\) is an identity mapping, becomes a generic encoder-decoder network, where h plays the role of an encoder. Additionally, h can be a simple convolutional stem that plays an important role in learning local features such as edges, or even multiple residual blocks [20] to extract higher representations. In this case, the orderings are applied over the hidden features using \(g_{ar}\). \(g_{ar}\) is similar to h, containing a series of residual blocks, with two main differences, the proposed masked convolutions are used, and the batch normalization [25] layers are omitted to maintain the autoregressive dependency, with an optional attention block. The decoder d can be a simple \(\text {conv}1\times 1\) to adapt the channels to the number of cluster K, followed by bilinear upsampling and a softmax operation for a clustering objective. For representation learning, d consists of two separable critics \(\phi _1/\phi _2\), which are implemented as a series of \(\text {conv}3\times 3 - \text {BN} - \text {ReLU}\) and \(\text {conv}1\times 1\) for projecting to a higher dimensional space. See sup. mat. for the architectural details.

4 Experiments

Datasets. The experiments are conducted on the newly established and challenging baselines by [27]. Potsdam [14] with 8550 RGBIR satellite images of size \(200 \times 200\), of which 3150 are unlabeled. We experiment on both the 6-labels variant (roads and cars, vegetation and trees, buildings and clutter) and Potsdam-3, a 3-label variant formed by merging each of the pairs. We also use COCO-Stuff [5], a dataset containing stuff classes. Similarly, we use a reduced version of COCO-Stuff with 164k images and 15 coarse labels, reduced to 52k by taking only images with at least 75% stuff pixel. In addition to COCO-Stuff-3 with only 3 labels, sky, ground and plants.

Table 1. AC Ablations. Ablations studies conducted on Potsdam (POS) and Potsdam-3 (POS3) for Autoregressive Clusterings. We show the pixel classification accuracy (%).

Evaluation Metrics. We report the pixel classification Accuracy (Acc). For a clustering task, with a mismatch between the learned and ground truth clusters. We follow the standard procedure and find the best one-to-one permutation to match the output clusters to ground truth classes using the Hungarian algorithm [30]. The Acc is then computed over the labeled examples.

Implementation Details. The different variations of are trained using ADAM with a learning rate of \(10^{-5}\) to optimize both objectives in Eqs. (6) and (7). The training is conducted on NVidia V100 GPUs, and implemented using the PyTorch framework [38]. For more experimental details, see sup. mat.

4.1 Ablation Studies

We start by performing comprehensive ablation studies on the different components and variations of the proposed method. Table 1 and Fig. 7 show the ablation results for AC, and Table 2 shows a comparison between AC and ARL, analyzed as follows:

Fig. 7.
figure 7

Overclustering. The Acc obtained when using a number of output clusters greater than the number of ground truth classes \(K>K_{gt}\). With variable number of images used to find the best many-to-one matching between the outputs and targets.

Variations of . Table 2a compares different variations of the network . With a fixed decoder d (i.e., a \(1\times 1 \mathrm {Conv}\) followed by bilinear upsampling and softmax function), we adjust h and \(g_{ar}\) going from a fully autoregressive model () to a normal decoder-encoder network ( and ). When using masked versions, we see an improvement over the normal case, with up to 8 points for Potsdam, and to a lesser extent for Potsdam-3 where the task is relatively easier with only three ground truth classes. When using a fully autoregressive model (), and applying the orderings directly over the inputs, maximizing the MI becomes much harder, and the model fails to learn meaningful representations. Inversely, when no masking is applied ( and ), the task becomes comparatively simpler, and we see a drop in performance. The best results are obtained when applying the orderings over low-level features ( and ). Interestingly, the unmasked versions yield results better than random, and perform competitively with 3 output classes for Potsdam-3, validating the effectiveness of maximizing the MI over small displacements . For the rest of the experiments we use as our model.

Attention and Different Orderings. Table 2c shows the effectiveness of attention. With a single attention block added at a shallow level, we observe an improvement over the baseline, for both raster-scan and zigzag orderings, and their combination, with up to 4 points for Potsdam. In this case, given the quadratic complexity of attention, we used an output stride of 4.

Data Augmentations. For a given training iteration, we pass the same image two times through the network, applying two different orderings at each forward pass. We can, however, pass a transformed version of the image as the second input. We investigate using photometric (i.e., color jittering) and geometric (i.e., rotations and H-flips) transformations. For geometric transformations, we bring the outputs back to the input coordinate space before computing the loss. Results are shown in Table 2e. As expected, we obtain relative improvements with data augmentations, highlighting the flexibility of the approach.

Dropout. To add some degree of stochasticity to the network, and as an additional regularization, we apply dropout to the intermediate activations within residual blocks. Table 2f shows a small increase in Acc for Potsdam.

Orderings. Until now, at each forward pass, we sample a pair of possible orderings with replacement from the set . With such a sampling procedure, we might end-up with the same pair of orderings for a given training iteration. As an alternative, we investigate two other sampling procedures. First, with no repetition (No Rep.), where we choose two distinct orderings for each training iteration. Second, using hard sampling, choosing two orderings with opposite receptive fields (e.g., \(o_1\) and \(o_6\)). Table 2d shows the obtained results. We see 2 points improvement when using hard sampling for Potsdam. For simplicity, we use random sampling for the rest of the experiments. Additionally, to investigate the effect of the number of orderings (i.e., the cardinality of ), we compute the Acc over different choices and sizes of . Table 2b shows best results are obtained when using all 8 raster-scan orderings. Interestingly, for some choices, we observe better results, which may be due to selecting orderings that do not share any receptive fields, as the ones used in hard sampling.

Table 2. Comparing ARL and AC. We compare ARL and AC on a clustering task (left). And investigate the quality of the learned representations by freezing the trained model, and reporting the test Acc obtained when training a linear (center) and non-linear (right) functions trained on the labeled training examples.

Overclustering. To compute the Acc for a clustering task using linear assignment, the output clusters are chosen to match the ground truth classes \(K = K_{gt}\). Nonetheless, we can choose a higher number of clusters \(K > K_{gt}\), and then find the best many-to-one matching between the output clusters and ground truths based a given number of labeled examples. In this case, however, we are not in a fully unsupervised case, given that we extract some information, although limited, from the labels. Figure 7 shows that, even with a very limited number of labeled examples used for mapping, we can obtain better results than the fully unsupervised case.

AC and ARL. To compare AC and ARL, we apply them interchangeably on both clustering and representation learning objectives. In clustering, for ARL, after PCA Whitening, we apply K-means over the output features to get the cluster assignments. In representation learning, we evaluate the quality of the learned representations using both linear and non-linear separability as a proxy for disentanglement, and as a measure of MI between representations and class labels. Table 2 shows the obtained results.

Table 3. Unsupervised image segmentation. Comparison of AC with state-of-the-art methods on unsupervised segmentation.

Clustering. As expected, AC outperforms ARL on a clustering task, given that the clusters are directly optimized by computing the exact MI during training.

Quality of the Learned Representations. Surprisingly, AC outperforms ARL on both linear and non-linear classifications. We hypothesize that unsupervised representation learning objectives that work well on image classification, fail in image segmentation due to the dense nature of the task. The model in this case needs to output distinct representations over pixels, rather than the whole image, which is a harder task to optimize. This might also be due to using only a small number of features (i.e., N pairs) for each training iteration.

4.2 Comparison with the State-of-the-Art

Table 3 shows the results of the comparison. AC outperforms previous work, and by a good margin for harder segmentation tasks with a large number of output classes (i.e., Potsdam and COCO-Stuff), highlighting the effectiveness of maximizing the MI between the different orderings as a training objective. We note that no regularization or data augmentation were used, and we expect that better results can be obtained by combining AC with other procedures as demonstrated in the ablation studies.

5 Conclusion

We presented a novel method to create different views of the inputs using different orderings, and showed the effectiveness of maximizing the MI over these views for unsupervised image segmentation. We showed that for image segmentation, optimizing over the discrete outputs MI works better for both clustering and unsupervised representation learning, due to the dense nature of the task. Given the simplicity and ease of adoption of the method, we hope that the proposed approach can be adapted for other visual tasks and used in future works.