Keywords

1 Introduction

Semantic scene segmentation, i.e., assigning a class label to every pixel in an input image, has received growing attention in the computer vision community, with accuracy greatly increasing over the years [1,2,3,4,5,6]. In particular, fully-supervised approaches based on Convolutional Neural Networks (CNNs) have recently achieved impressive results [1,2,3,4, 7]. Unfortunately, these methods require large amounts of training images with pixel-level annotations, which are expensive and time-consuming to obtain. Weakly-supervised techniques have therefore emerged as a solution to address this limitation [8,9,10,11,12,13,14,15]. These techniques rely on a weaker form of training annotations, such as, from weaker to stronger levels of supervision, image tags [12, 14, 16, 17], information about object sizes [17], labeled points or squiggles [12] and labeled bounding boxes [13, 18]. In the current Deep Learning era, existing weakly-supervised methods typically start from a network pre-trained on an object recognition dataset (e.g., ImageNet [19]) and fine-tune it using segmentation losses defined according to the weak annotations at hand [12,13,14, 16, 17].

In this paper, we are particularly interested in exploiting one of the weakest levels of supervision, i.e., image tags, which is a rather inexpensive attribute to annotate and thus more common in practice (e.g., Flickr [20]). Image tags simply determine which classes are present in the image without specifying any other information, such as the location of the objects. In this extreme setting, a naive weakly-supervised segmentation algorithm will typically yield poor localization accuracy. Therefore, recent works [12, 16, 21] have proposed to make use of objectness priors [22,23,24,25], which provide each pixel with a probability of being an object. In particular, these methods have exploited existing objectness algorithms, such as [22,23,24], with the drawback of introducing external sources of potential error. Furthermore, [22] typically only yields a rough foreground/background estimate, and [23, 24] rely on additional training data with pixel-level annotations.

Here, by contrast, we introduce a Deep Learning approach to weakly-supervised semantic segmentation where the localization information is directly extracted from the network itself. Our approach relies on the following intuition: One can expect that a network trained for the task of object recognition extracts features that focus on the objects themselves, and thus has hidden layers with units firing up on foreground objects, but not on background regions. A similar intuition was also recently explored for other tasks, such as object localization [26] and detection [27]. Starting from a fully-convolutional network pre-trained on ImageNet, we therefore propose to extract a foreground/background mask by directly exploiting the unit activations of some of the hidden layers in the network.

In particular, we focus on the fourth and fifth convolution layers of the VGG-16 pre-trained network [28], which provide higher-level information than the first three layers, such as highlighting complete objects or object parts. We then make use of a fully-connected Conditional Random Field (CRF) to smooth out this information and generate a foreground/background mask. We finally incorporate the resulting masks in our network via a weakly-supervised loss. The resulting masks can also be thought of as a form of objectness measure. While several CNN-based approaches have proposed to learn objectness, or saliency measures from annotations [29,30,31], to the best of our knowledge, our approach is the first extract this information directly from the hidden layer activations of a segmentation network, and employ the resulting masks as localization cues for weakly-supervised semantic segmentation. Ultimately, our model, illustrated by Fig. 1, can therefore be thought of as a weakly-supervised segmentation network with built-in foreground/background prior.

We demonstrate the benefits of our approach on two datasets (Pascal VOC 2012 [32] and a subset of Flickr (MIRFLICKR-1M) [20]). Our experiments show that our approach outperforms the state-of-the-art methods that use image tags only, and even some methods that leverage additional supervision, such as object size information [17] and point supervision [12]. Furthermore, we extend our framework to incorporate some additional, yet cheap, supervision, taking the form of asking the user to select the best foreground/background mask among several automatically generated candidates. Our experiments reveal that this additional supervision only costs the user roughly 2–3 seconds per image and yields another significant accuracy boost over our tags-only results.

Fig. 1.
figure 1

Our weakly-supervised network with built-in foreground/background prior.

2 Related Work

Weakly-supervised semantic segmentation has attracted a lot of attention, because it alleviates the painstaking process of manually generating pixel-level training annotations. Over the years, great progress has been made [9,10,11,12,13,14, 16,17,18, 33]. In particular, recently, Convolutional Neural Networks have been applied to the task of weakly-supervised segmentation with great success. In this section, we discuss these CNN-based approaches, which are the ones most related to our work.

The work of [14] constitutes the first method to consider fine-tuning a pre-trained CNN using image-level tags only within a weakly-supervised segmentation context. This approach relies on a simple Multiple Instance Learning (MIL) loss to account for image tags during training. While this loss improves segmentation accuracy over a naive baseline, this accuracy remains relatively low, due to the fact that no other prior than image tags is employed. By contrast, [13] incorporates an additional prior in the MIL framework in the form of an adaptive foreground/background bias. This bias significantly increases accuracy, which [13] shows can be further improved by introducing stronger supervision, such as labeled bounding boxes. Importantly, however, this bias is data-dependent and not trivial to re-compute for a new dataset. Furthermore, the results remain inaccurate in terms of object localization. In [17], weakly-supervised segmentation is formulated as a constrained optimization problem, and an additional prior modeling the size of objects is introduced. This prior relies on thresholds determining the percentage of the image area that certain classes of objects can occupy, which again is problem-dependent. More importantly, and as in [13], the resulting method does not exploit any information about the location of objects, and thus yields poor localization accuracy.

To overcome this weakness, some approaches [12, 16, 21] have proposed to exploit the notion of objectness. In particular, [16] makes use of a post-processing step that smoothes their initial segmentation results using the object proposals obtained by BING [23] or MCG [24]. While it improves localization, being a post-processing step, this procedure is unable to recover from some mistakes made by the initial segmentation. By contrast, [12, 21] directly incorporate an objectness score [22, 24] in their loss function. While accounting for objectness when training the network indeed improves segmentation accuracy, the whole framework depends on the success of the external objectness module, which, in practice, only produces a coarse heat map and does not accurately determine the location and shape of the objects (as evidenced by our results in supplementary materials).

Note that BING and MCG have been trained from PASCAL train images with full pixel-level annotations or bounding boxes, and thus [16, 21] inherently makes use of stronger supervision than our approach. Here, instead of relying on an external objectness method, we leverage the intuition that, within its hidden layers, a network pre-trained for object recognition should already have learned to focus on the object themselves. This lets us develop a foreground/background mask directly from the information built into the network, which we empirically show provides a more accurate object localization prior. A relevant idea is also presented in an arxiv paper [34] which is further evidencing the popularity and importance of this research trend.

3 Our Approach

In this section, we introduce our approach to weakly-supervised semantic segmentation. After briefly discussing the CNN architecture that we use, we present our approach to extracting a foreground/background mask directly from the network itself. We then introduce our weakly-supervised learning algorithm that leverages this foreground/background information, and finally discuss our novel way to introduce additional weak supervision in the process.

3.1 Network Architecture

As most recent weakly-supervised semantic segmentation algorithms [12,13,14, 16, 17], and as shown in Fig. 2, our architecture is based on the VGG-16-layer network [28], whose weights were trained on ImageNet for the task of object recognition. Following the fully-convolutional approach [1], all fully-connected layers are converted to convolutional layers, and the final classifier replaced with a \(1\,\times \,1\) convolution layer with N channels, where N represents the number of classes of the problem. As a modification to this fully convolutional network which has a stride of 32, inspired from [3], we use a stride of 8 and also a smaller receptive field (128 pixels), which has proven to be effective in practice in weakly-supervised semantic segmentation [13]. At the end of the network, we add a deconvolution layer to up-sample the output of the network to the size of the input image. In short, the network takes an image of size \(W\,\times \,H\) as input and generates an \(N\,\times \,W\,\times \,H\) output encoding a score for each pixel and for each class.

Fig. 2.
figure 2

Network architecture: fully convolutional neural network, derived from the VGG-16 network. We employ a receptive field of 128 pixels and a stride of 8.

3.2 Built-in Foreground/Background Model

We now introduce our approach to extracting a foreground/background mask directly from our network. In Sect. 3.3, we show how this mask can be employed for weakly-supervised semantic segmentation.

Intuitively, we expect that a network trained for an object recognition task has learned to focus on the objects themselves, and their parts, rather than on background regions. In other words, it should produce high activation values on objects and their parts. To evaluate this, we studied the activation of the different hidden layers of our initial network pre-trained on ImageNet. To this end, we forward each image through the network and visualize each activation by computing the mean over the channels after resizing the activation map to the input image size. Perhaps unsurprisingly, this lead to the following observations, illustrated in Fig. 3. The first two convolutional layers of the VGG network extract image edges. As we move deeper in the network, the convolutional layers extract higher-level features. In particular, the third convolutional layer fires up on prototypical object shapes. The fourth layer indicates the location of complete objects, and the fifth one fires up on the most discriminative object parts [35].

Based on these observations, we propose to make use of the fourth and fifth layers to produce an initial foreground/background mask estimate. To this end, we first convert these two layers from 3D tensors (\(512\,\times \,W\,\times \,H\)) to 2D matrices (\(W\,\times \,H\)) via an average pooling operation over the 512 channels. We then fuse the two resulting matrices by simple elementwise sum, and scale the resulting values between 0 and 1. The resulting \(W\,\times \,H\) map can be thought of as a pixelwise foreground probability. Figure 3 illustrates the results of this method on a few images from PASCAL VOC 2012. While the resulting scores indeed accurately indicate the location of the foreground objects, this initial mask remains noisy.

To overcome this, we therefore propose to exploit these foreground probabilities as unary potentials in a fully-connected CRF. Let \(\mathbf{x} = \{x_i\}_{i=1}^{W\cdot H}\) be the set of random variables, where \(x_i\) encodes the label of pixel i, i.e., either foreground or background. We encode the joint distribution over all pixels with a Gibbs energy of the form

$$\begin{aligned} E(\mathbf{x} = \mathbf{X}) = -\sum _i \log P_f(x_i = X_i) + \sum _{i}\sum _{j>i}\theta _{ij}(x_i=X_i, x_j=X_j), \end{aligned}$$
(1)

where \(P_f(x_i=X_i)\) is the probability of pixel i taking label assignment \(X_i\), obtained directly from the foreground probability of our initial fusion strategy. Following [36], we define the pairwise term \(\theta _{ij}\) as a contrast-sensitive Potts model using two Gaussian kernels encoding color similarity and spatial smoothness. This form lets us make use of the filtering-based mean-field strategy of [36] to perform inference efficiently. Some resulting masks are shown in the last column of Fig. 3.

Note that our foreground/background masks can be thought of as a form of objectness measure. While objectness has been used previously for weakly-supervised semantic segmentation (MCG and BING in [16], and the generic objectness [22] in [12]), the benefits of our approach are twofold. First, we extract this information directly from the same network that will be used for semantic segmentation, which prevents us from having to rely on an external method. Second, as opposed to BING and MCG, we require neither object bounding boxes, nor object segments to train our method. While [22] predicts objectness after training on a set of images, as shown in our experiments in the supplementary materials, our method yields much more accurate object localization than this technique. To further evidence the benefits of our approach, in supplementary material, we evaluate the masks obtained using the probabilities of [22, 24] as unary potentials in the same dense CRF.

Fig. 3.
figure 3

Built-in foreground/background mask. From left to right, there is the image, the activations of \(1^{st}\), \(2^{nd}\), \(3^{rd}\) ,\(4^{th}\), and \(5^{th}\) Conv. layers, the results of our fusion strategy, and the final mask after CRF smoothing followed by G.T. Note that “Fusion” constitutes the unary potential of the dense CRF used to obtain “Our mask”.

3.3 Weakly-Supervised Learning

We now introduce our learning algorithm for weakly-supervised semantic segmentation. We first introduce a simple loss based on image tags only, and then show how we can incorporate our foreground/background masks in our framework.

Intuitively, given image tags, one would like to encourage the image pixels to be labeled as one of the classes that are observed in the image, while preventing them to be assigned to unobserved classes. Note that this assumes that the tags cover all the classes depicted in the image. This assumption, however, is commonly employed in weakly-supervised semantic segmentation [12, 14, 16]. Formally, given an input image I, let \(\mathcal {L}\) be the set of classes that are present in the image (including background) and \(\bar{\mathcal {L}}\) the set of classes that are absent. Furthermore, let us denote by \(s_{i,j}^k(\theta )\) the score produced by our network with parameters \(\theta \) for the pixel at location (ij) and for class k, \(0\le k < N\). Note that, in general, we will omit the explicit dependency of the variables on the network parameters. Finally, let \(S_{i,j}^k\) be the probability of class k obtained after a softmax layer, i.e.,

$$\begin{aligned} S_{i,j}^k = \frac{\exp (s_{i,j}^k)}{\sum _{c=1}^N\exp (s_{i,j}^c)}. \end{aligned}$$
(2)

Encoding the above-mentioned intuition can then simply be achieved by designing a loss of the form

$$\begin{aligned} L_{weak} = -\frac{1}{|\mathcal {L}|}\sum _{k\in \mathcal {L}} \log {S^k} - \frac{1}{|\bar{\mathcal {L}}|}\sum _{k\in \bar{\mathcal {L}}}\log (1-S^k), \end{aligned}$$
(3)

where \(S^k\) represents a candidate score for each class in the image. In short, the first term in Eq. 3 expresses the fact that the present classes should be in the image, while the second term penalizes the pixels that have high probabilities for the absent classes. In practice, instead of computing \(S^k\) as the maximum probability (as previously used in [12, 14]) for class k over all pixels in the image, we make use of the convex Log-Sum-Exp (LSE) approximation of the maximum (as previously used in [16]), which can be written as

$$\begin{aligned} \tilde{S}^k = \frac{1}{r}\log \left[ \frac{1}{|I|}\sum _{i,j \in I}\exp (rS_{i,j}^k)\right] , \end{aligned}$$
(4)

where |I| denotes the total number of pixels in the image and r is a parameter allowing this function to behave in a range between the maximum and the average. In practice, following [16], we set r to 5.

The loss in Eq. 3 does not rely on any notion of foreground and background. As a consequence, minimizing it will typically yield poor object localization accuracy. To overcome this issue, we propose to make use of our built-in foreground/background mask introduced in Sect. 3.2. Let \(M_{i,j}\) denote the mask value at pixel (ij), i.e., \(M_{i,j} = 1\) if pixel (ij) belongs to the foreground and 0 otherwise. We can then re-write our loss as

$$\begin{aligned} L_{mask} = -\frac{1}{|\mathcal {L}|-1}\sum _{{k\in \mathcal {L},}{k \ne 0}}\log (S_f^k) -\log (S^0) - \frac{1}{|\bar{\mathcal {L}}|.|I|}\sum _{i,j\in I,\;k\in \bar{\mathcal {L}}}\log (1-S_{i,j}^k), \end{aligned}$$
(5)

where

$$\begin{aligned} S_f^k = \frac{1}{r}\log \left[ \frac{1}{|M|}\sum _{i,j | M_{i,j}=1}\exp (rS_{i,j}^k)\right] , \end{aligned}$$
(6)

and

$$\begin{aligned} S^0 = \frac{1}{r}\log \left[ \frac{1}{|\bar{M}|}\sum _{i,j | M_{i,j}=0}\exp (rS_{i,j}^0)\right] . \end{aligned}$$
(7)

where |M| and \(|\bar{M}|\) denote the number of foreground and background pixels, respectively, and \(S_f^k\) computes an approximate maximum probability for the present class k over all pixels in the foreground mask. Similarly, \(S^0\) denotes an approximate maximum probability for the background class over all pixels outside the foreground mask. In short, the loss of Eq. 5 favors present classes to appear in the foreground mask, while pixels predicted as background should be assigned to the background class and no pixels should take on an absent label.

To learn the parameters of our network, we follow a standard back-propagation strategy to search for the parameters \(\theta \) that minimize the loss in Eq. 10. In particular, the network is fine-tuned using stochastic gradient descent (SGD) with momentum \(\mu \) to update the weights by a linear combination of the negative gradient and the previous weight update. At inference time, given the test image, the network performs a dense prediction. We optionally apply a fully connected CRF to smooth the segmentation using the default parameters of [3].

Remark

Although our loss function performs well, an alternative to this loss function can be expressed as

$$\begin{aligned} L_{weak} = -\frac{1}{|I|}\sum _{i,j \in I}\log (S_{i,j}) - \frac{1}{|I|}\sum _{i,j\in I,\; k\in \bar{\mathcal {L}}}\log (1-S_{i,j}^k), \end{aligned}$$
(8)

where \(S_{i,j}\) represents the approximation of the maximum by using the LSE over the observed classes for each pixel as

$$\begin{aligned} S_{i,j} = \frac{1}{r}\log \left[ \frac{1}{|\mathcal {L}|}\sum _{k \in \mathcal {L}}\exp (rS_{i,j}^k)\right] . \end{aligned}$$
(9)

Such a formulation can also be extended to incorporate our mask, which yields

$$\begin{aligned} L_{mask} = -\frac{1}{|M|}\sum _{i,j | M_{i,j}=1}\log (S^f_{i,j}) -\frac{1}{|\bar{M}|}\sum _{i,j | M_{i,j}=0}\log (S_{i,j}^0) - \frac{1}{|I|}\sum _{i,j\in I,\; k\in \bar{\mathcal {L}}}\log (1-S_{i,j}^k), \end{aligned}$$
(10)

where \(S_{i,j}^0\) denotes the probability for the background class and

$$\begin{aligned} S^f_{i,j} = \frac{1}{r}\log \left[ \frac{1}{|\mathcal {L}|-1}\sum _{k \in \mathcal {L}, \; k \ne 0}\exp (rS_{i,j}^k)\right] \end{aligned}$$
(11)

computes an approximate maximum probability over the observed foreground classes.

We found that, while the two approaches starting from the losses in Eqs. 3 and 8 differ from each other, incorporating our masks in both of them using Eqs. 5 and 10 improves the segmentation quality considerably. Empirically, however, we found that this second formulation was slightly less effective than the one in Eq. 5. This will be further discussed in the experiments.

3.4 A Novel Weak Supervision: The CheckMask Procedure

The masks obtained with the approach introduced in Sect. 3.2 are not always perfect. This is due to the fact that the information obtained by fusing the activations of the fourth and fifth layers is noisy, and thus the solution found by inference in the CRF is not always the desired one. As a matter of fact, many other solutions also have a low energy (Eq. 1). Rather than relying on a single mask prediction, we propose to generate multiple such predictions, and provide them to a user who decides which one is the best one.

The problem of generating several predictions in a given CRF is known as the M-best problem. Here, in particular, we are interested in generating solutions that all have low energy, but are diverse, and thus follow the approach of [37]. In essence, this approach iteratively generates solutions, and, at each iteration, modifies the energy of Eq. 1 to encourage the next solution to be different from the ones generated previously. In practice, we make use of the Hamming distance as a diversity measure. This diversity measure can be encoded as an additional unary potential in Eq. 1, and thus comes at virtually no additional cost in the inference procedure. For more details about the diverse M-best strategy, we refer the reader to [37].

Ultimately, we generate several masks with this procedure, and ask the user to click on the one that best matches the input image. Such a selection can be achieved very quickly. In practice, we found that a user takes roughly 2–3 seconds per image to select the best mask. As a consequence, this new source of weak supervision remains very cheap, while, as evidenced by our experiments, allows us to achieve a significant improvement over our tags-only formulation (Fig. 4).

Fig. 4.
figure 4

(a) Mask candidates generated with our approach. From left to right, we show the input image, the \(1^{st}\), \(5^{th}\), \(10^{th}\), \(15^{th}\), \(20^{th}\), \(25^{th}\) and \(30^{th}\) solutions. (b) Our new level of supervision: The annotator selects a mask which he/she thinks contains all foreground object(s) and the minimum amount of background.

4 Experiments

In this section, we first describe the datasets used for our experiments, and give some details about our learning and inference procedures. We then compare our approach to the state-of-the-art methods that use the same level of supervision as us. We provide an evaluation of our foreground/background masks in supplementary material.

4.1 Datasets

In our experiments, we first made use of the standard Pascal VOC 2012 dataset [32], which serves as a benchmark in most weakly-supervised semantic segmentation papers [12,13,14, 16, 17]. Similar to the dataset used in [12, 13, 17], this dataset contains \(N=21\) classes, and 10,582 training images (the VOC 2012 training set and the additional data annotated by [38]), 1,449 validation images and 1,456 test images. The image tags were obtained from the pixel-level annotations by simply listing the classes observed in each image. As in [12, 13, 16, 17], we report results on both the validation and the test set.

To further demonstrate the generality of our approach, we applied our method to a dataset that truly contains only image tags. To this end, we created a new training dataset from a subset of the MIRFLICKR-1M dataset [20]. In order to facilitate comparison, this subset was built using images containing the same classes as Pascal VOC 2012. In total it contains 7238 images, which were used for training purposes only. This new Flickr-based dataset does not provide any ground-truth pixel level annotations and, hence, the Pascal VOC validation set was used as test data. This training data will be made publicly available upon acceptance of the paper.

For both datasets, we report the mean intersection over union (mIOU), averaged over the 21 classes.

4.2 Implementation Details

Our network architecture is detailed in Sect. 3.1. The parameters of this network were found by using stochastic gradient descent with a fixed learning rate of \(10^{-4}\) for the first 40k iterations, \(10^{-5}\) for the next 20k iterations, a momentum of 0.9, a weight decay of 0.0005, and mini-batches of size 1. Similar to recent weakly-supervised segmentation methods [12,13,14, 16, 17], the network weights were initialized with those of a network pre-trained for a 1000-way classification task on the ILSVRC 2012 dataset [19]. Hence, for the last convolutional layer, we used the weights corresponding to the 20 classes shared by Pascal VOC and ILSVRC. For the background class, we initialized the weights with zero-mean Gaussian noise with a standard deviation of 0.1. At inference time, given only the test image, the network generates a dense prediction as a complete semantic segmentation map. We used C++ and Python (Caffe framework [39]) for our implementation. As other methods [13, 17], we further optionally apply a dense CRF to refine this initial segmentation. To this end, we used the same CRF parameter values as these other approaches, i.e., the same as in [3].

4.3 Semantic Segmentation Results

We now compare our approach with state-of-the-art baselines. We first present the results obtained with image tags only, and then those with additional weak supervision. For the sake of completeness, in addition to the state-of-the-art baselines, we also report the results of our approach without using our foreground/background masks, i.e., by using Eq. 3 as training loss. We also provide the segmentation results achieved by training a model using the losses introduced in Eqs. 8 and 10. In the following, we will refer to our baseline as Ours (baseline), to our approach with tags only as Ours (tags) and to our approach with additional weak supervision as Ours (CheckMask). We indicate the additional use of a dense CRF to further refine our results with +CRF after the method’s name.

Pascal VOC with Image Tags. In Table 1, we compare our approach with our mask-free baseline and state-of-the-art methods on the task of semantic segmentation given only image tags during training. Note that our approach outperforms all the baselines by a large margin, whether we use CRF smoothing or not. Importantly, we outperform the methods based on an objectness prior [12, 16], which clearly shows the benefits of using our built-in foreground/background masks instead of external objectness algorithms. The importance of our mask is further evidenced by the fact that we outperform our mask-free baseline by 13.8 mIOU points. Note that the best-performing baseline (MIL w/ILP) [16] uses a large amount of additional images (roughly 700K) from the ILSVRC2013 dataset to boost the accuracy of the basic MIL method. Note that we still outperform this baseline, even without using any such additional data.

Table 1. Per class IOU on the PASCAL VOC 2012 validation set for methods trained using image tags.

Pascal VOC with Additional Weak Supervision. We then evaluate our approach on Pascal VOC with our additional CheckMask weak supervision procedure. While no other approaches have used this same kind of weak supervision, we report the results of methods that have used additional weak supervision of a similar cost to compute. In particular, these includes the point supervision of [12], the random crops of [13], the size information of [17] and the MCG segments of [16, 21]. The results of this comparison are provided in Table 2. Note that our CheckMask procedure yields an improvement of 4.2 mIOU point (and 4.9 mIOU point when a CRF is applied) over our tag-only approach. More importantly, our approach outperforms the baselines by a large margin. Note that other approaches have proposed to rely on labeled bounding boxes, which require a user to provide a bounding box for each individual foreground object in an image and to associate a label to each such bounding box. While this procedure is clearly more costly than ours, we achieve similar accuracy to these baselines (52.5 % for [13] when using labeled bounding boxes and 54.1 % for [13] when using labeled bounding boxes in an EM process vs. 51.49 % for our approach). We believe that this further evidences the benefits of our approach. We also report the results on the test set of Pascal VOC 2012 and compare our method with other baselines (see Table 3).

Table 2. Per class IOU on the PASCAL VOC 2012 validation set using additional supervision during training.
Table 3. Per class IOU on the PASCAL VOC 2012 test set.

Flickr (MIRFLICKR-1M) with Image Tags and Additional Weak Supervision. We now evaluate our method by training it using our new dataset containing a subset of the MIRFLICKR-1M images [20]. Since no other results have been reported on this dataset, we also computed the results of CCNN [17] whose code is publicly available, and which has shown to yield good accuracy in the previous experiments. In Table 4, we compare the results of our approach with this baseline when trained using tags only, and, as mentioned before, tested on the Pascal VOC 2012 validation dataset, since no ground-truth pixel level annotations are available in Flickr. Note that our approach significantly outperforms both our mask-free baseline and the CCNN by a large margin. It is worth mentioning that this dataset contains three rare classes, Chair, Dining Table, and the Sofa which have 1.1 %, 0.5 %, and 1.3 % of the whole dataset respectively. Although these classes have a negligible contribution in constructing this dataset, our approach performs well in comparison to CCNN in segmenting these classes (17.0 % vs. 10.7 %, 31.2 % vs. 0 %, and 16.8 % vs. 0 % in these classes respectively).

We then further used our CheckMask procedure to evaluate how much can be gained by some cheap additional weak supervision. Note that, here, we were unable to report the result of the CCNN with additional supervision, since, in practice, we did not have access to per-image object size information. Our results in Table 4 evidence the benefits of our CheckMask procedure over our tag-only approach (See also Fig. 5 for qualitative results). Note that selecting the best mask for all 7238 training images took roughly 5 h, which corresponds to 2.5 sec per image. This shows that our additional level of weak supervision remains very cheap to compute.

Table 4. Per class IOU on the PASCAL VOC 2012 validation set for models trained with a subset of the MIRFLICKR-1M dataset.
Fig. 5.
figure 5

Qualitative results: from left to right, there is image, the results of the model trained on Pascal VOC (column 2, 3, and 4), the results of the model trained on Flickr (column 5 and 6), and the groundtruth. The last two row shows the failure cases.

5 Conclusion

We have introduced a Deep Learning approach to weakly-supervised semantic segmentation that leverages foreground/background masks directly extracted from our network pre-trained for the task of object recognition. Our experiments have shown that our approach outperforms the state-of-the-art methods when trained on image tags only. Furthermore, we have introduced a new level of weak supervision, consisting of selecting one mask among a set of candidates. This procedure can be achieve very easily, taking only roughly 2–3 seconds per image, and yields a further significant boost in accuracy. In the future, we intend to study if jointly training the foreground/background mask extraction procedure and the weakly-supervised segmentation network can further improve our results.