1 Introduction

Fully supervised approaches have demonstrated excellent performance by training convolution neural network (CNN) with human annotations, e.g., bounding box for object localization, pixel-wise class labels for semantic segmentation [1,2,3,4,5]. However, they cost huge human labor to obtain accurate annotations. Therefore, weakly supervised approaches that use only image-level supervision have received significant attention over the various computer vision tasks [6,7,8,9,10,11,12]. Especially, weakly supervised object localization (WSOL) is a challenging task that pursues both classification and the localization of the target object where the training datasets provide only the class labels.

For example, Zhou et al. [6] generate class activation maps (CAM) using the classification model with a global average pooling (GAP). CAM highlights the class-specific discriminative regions in a given image [7,8,9, 13]. The crucial pitfall of the activation maps is that it focuses on discriminative parts (e.g., the head of a bird) rather than including the full extent of the object. To mitigate this limitation, recent methods [9, 14, 15] propose to erase the most discriminative parts by thresholding to spread out the activations to less discriminative regions. However, they are likely to induce excessive extension to the backgrounds which over-estimates bounding boxes (Fig. 1).

Fig. 1.
figure 1

Comparison of methods for generating activation maps on the CUB [16] dataset. We display the final results obtained by ADL [9] (first row), SPG [8] (second row), and our method (last row). The red boxes are the ground-truth and the green boxes are the predicted ones. Activation maps are illustrated in heatmap color scale. ADL tries to activate more on less discriminative parts but ends in excessive extension to background. SPG tries to suppress background but still over-estimates the object regions. In contrast, our method covers the whole object delicately without extending to background. (Color figure online)

In this paper, we propose four ingredients for more accurate attention over the entire object: contrastive attention loss, foreground consistency loss, non-local attention block and dropped foreground mask. The contrastive attention loss draws the foreground feature and its erased version close together, and pushes the erased foreground feature away from the background feature (Sect. 3.2). It helps the learned representation to reflect only the object region rather than the backgrounds which are usually helpful for classification but harmful to localization. The foreground consistency loss penalizes disagreement of attentions between layers to provide early layers with a sense of backgroundness (Sect. 3.3). While usual low-level features are activated on locally distinctive regions (e.g., edges) regardless of the presence of the objects, adding foreground consistency loss boosts the activations on the object regions while suppressing the activations on the background regions. Furthermore, we apply the non-local attention blocks to produce enhanced attention maps considering the similarity between locations in a feature map (Sect. 3.4). It allows boosting weights on the regions having similar features with the most discriminative parts to pursue correct activation. Last but not least, we propose a dropped foreground mask which drops the background region as well as the most discriminative region. It prohibits the model from excessively spreading attention to backgrounds.

Our method achieves state-of-the-art performance in terms of the conventional top-1 localization accuracy and the MaxBoxAccV2 [17].

In summary, our main contributions are:

  • We propose a contrastive attention loss that favors similarity between foreground feature and its dropped version and dissimilarity between the dropped foreground feature and background feature.

  • We propose a foreground consistency loss that provides a sense of localization to earlier layers by guiding their features to be consistent with a high-level layer.

  • We propose a dropped foreground mask which drops the background region and the most discriminative region.

  • Our method achieves state-of-the-art performance on CUB-200-2011 and ImageNet benchmark datasets in terms of top-1 localization accuracy and MaxBoxAccV2.

2 Related Work

Weakly Supervised Object Localization (WSOL). Given only the class labels with the images, most of the WSOL methods train a classifier and extract CAM [6]. CAM indicates the strength of activation in every location in the feature map to stimulate the corresponding class [7,8,9, 13]. Recent methods [6,7,8,9,10, 13] propose erasing the most discriminative region of the feature map to spread out the activations to the regions which are less discriminative but still in the object. Hide-and-Seek (HaS) [13] divides a training image into a grid of evenly-divided patches and selects a random patch to be hidden. Adversarial complementary learning (ACoL) [7] and attention-based dropout layer (ADL) [9] partially drop the most discriminative region by thresholding on the feature map. MEIL [18] runs two branches, one with erasing and one without erasing, and impose both branch with classification task. These approaches guide the models to discover previously neglected object regions. Our method steps further to consider background as a region to drop so that the model does not spread the activation excessively to the background.

Several methods have been proposed to suppress the background and localize the whole object. Zhang et al. [8] present self-produced guidance (SPG) that generates three pixel-wise masks (foreground, background, and undefined areas). Each mask is used as auxiliary supervision. However, it requires to find the optimal six hyperparameters for producing the three masks. We also focus on the background but introduce simpler and more effective way.

Fig. 2.
figure 2

Overview of the proposed method. The non-local attention block generates the enhanced attention map reflecting the similarity between locations. We create a dropped foreground mask and an importance map using thresholding and sigmoid activation, respectively. The selected map is multiplied with the input feature to feed the next layer. Foreground consistency loss encourages the consistency between the early and last layer. We calculate the contrastive attention loss at each convolution layer where our non-local attention block is inserted.

Yang et al. [10] use a non-local block following every convolution-pooling block. While their non-local blocks are inserted within the main stem of the network, our non-local attention blocks are branch from the main stem and produce attention maps to be multiplied to the main convolutional features at chosen layers.

Contrastive Visual Representation Learning. Contrastive learning [19] tries to distinguish similar and dissimilar pairs of samples by embedding the samples as feature representations. Recent self-supervised learning methods [20, 21] learn representations by maximizing agreement between differently augmented views of the same image. They also consider the different images to minimize the agreement for negative pairs.

Inspired by [20, 21], we define a contrastive prediction task for WSOL. Instead of building similar and dissimilar pairs of image samples, we regard the foreground region except for the most discriminative part (i.e., the dropped foreground) as an anchor, and build the positive pair with the original foreground and the negative pair with the background. Our contrastive objective does not require a large batch size or large queue because it finds the pairs within an image. Separating the foreground representation and the background representation is suitable for WSOL task.

3 Proposed Method

This section describes elements of the proposed method and how we employ them on the networks.

3.1 Network Overview

As shown in Fig. 2, we augment classification network with the non-local attention blocks (Sect. 3.4) and train it with the contrastive attention loss (Sec. 3.2) and the foreground consistency loss (Sect. 3.3). The non-local attention block receives a feature map \(\mathbf {F}\) and provides an enhanced attention map \(\mathbf {A}\) which becomes an importance map \(\tilde{\mathbf {A}}\) through sigmoid activation and a dropped foreground mask \(\mathbf {M}_\text {dfg}\) by thresholding (Eq. 1). The dropped foreground mask or the importance map is randomly chosen based on a drop_rate and the chosen one is applied to the input feature by pixel-wise multiplication (element-wise multiplication with broadcasting over the channel dimension). The importance map is not to be dropped but to be applied to the feature map. The dropped foreground mask encourages activation of the input feature on less discriminative parts except background to maximize classification accuracy without losing localization accuracy, while the importance map rewards higher activation on the most discriminative part.

In the attention branch, the enhanced attention map and the dropped foreground mask from a non-local attention block are used to compute the contrastive attention loss. In addition, the enhanced attention maps from multiple non-local attention blocks are used to compute the foreground consistency loss.

The differences with ADL [9] in the forward process are that we use the dropped foreground mask instead of drop mask and the attention map is produced by our non-local attention block instead of vanilla convolutional feature. Figure 3 illustrates the importance map, our dropped foreground mask, and the drop mask in [9]. Our dropped foreground mask \(\mathbf {M}_\text {dfg}\) is defined by:

$$\begin{aligned} \mathbf {M}_\text {dfg}= \mathbbm {1}[\mathbf {A}< \theta _\text {fg}] \wedge \mathbbm {1}[\mathbf {A}> \theta _\text {bg}], \end{aligned}$$
(1)

where \(\mathbbm {1}\) denotes a matrix with the same shape with the input having ones according to the logical operation, \(\wedge \) denotes logical and operation, and \(\theta \)’s are the pre-defined thresholds. Unlike the drop masks from ACoL [7] and ADL [9], our dropped foreground mask remedies excessive expansion of activation on the backgrounds by further erasing background regions in the mask.

The contrastive attention loss and the foreground consistency loss are computed wherever the attention maps are extracted.

Fig. 3.
figure 3

Examples of the importance map \(\tilde{\mathbf {A}}\), the drop mask from [9], and our dropped foreground mask \(\mathbf {M}_\text {dfg}\).

3.2 Contrastive Attention Loss

Contrastive loss [20] is a function whose value is low when a query is similar to its equivalent instance and dissimilar to its different instances. Likewise, we design a contrastive attention loss whose value is low when a dropped foreground feature \(\mathbf {z}_\text {dfg}\) is similar to a foreground feature \(\mathbf {z}_\text {fg}\) and dissimilar to a background feature \(\mathbf {z}_\text {bg}\) (Fig. 4). The features \(\mathbf {z}_\alpha \) are obtained by masked global average pooling of \(\mathbf {F}\odot \mathbf {M}_\alpha \) where

$$\begin{aligned} \begin{aligned} \mathbf {M}_\text {fg}= \mathbbm {1}[\mathbf {A}> \theta _\text {bg}], \\ \mathbf {M}_\text {bg}= \mathbbm {1}[\mathbf {A}< \theta _\text {bg}], \end{aligned} \end{aligned}$$
(2)

and the masked global average pooling is spatial average pooling of the pixels whose value on the mask is 1. Then, the contrastive attention loss is given by

(3)

where \([\cdot ]_+=\text {max}(\cdot , 0)\) and \(d(\cdot , \cdot )\) denotes \(L_2\) distance in auxiliary 128-dimensional embedding by \(1\times 1\) convolution. m denotes the margin.

Our contrastive attention loss guides the attention map to spread until it reaches boundary because including backgrounds in the attention map is penalized by the dissimilarity term. In addition, the similarity term favors homogenous features between the most discriminative part and less discriminative parts in the foreground region. Our contrastive attention loss does not require mining positive and negative samples as in triplet loss [22] nor managing large negative samples [20, 21]. Since we regard the masked features \(\mathbf {z}_{\text {dfg}}, \mathbf {z}_{\text {fg}}\) and \(\mathbf {z}_{\text {bg}}\) from one image as an anchor, a positive sample and a negative sample, respectively.

Fig. 4.
figure 4

The details of the contrastive attention loss, where A denotes an enhanced attention map from a non-local attention block. We generate three maps and features to compare the similarity in embedding space. \(\odot \) denotes pixel-wise multiplication. The contrastive attention loss is computed on an embedding space.

3.3 Foreground Consistency Loss

Attention maps roughly are the magnitude of activation on every location. Convolutions in early layers activate more on locally distinctive regions such as edges and corners [23], without inspecting the entire extent of objects due to their limited receptive field. To relieve this problem, we propose a foreground consistency loss that encourages attention maps from early layers to resemble later layers (Fig. 2).

Let \({A}_{i}\) and \({A}_{j}\) are the attention maps from early and later layers, respectively. Then we define the foreground consistency loss as:

$$\begin{aligned} \mathcal {L} _{fc} = {\begin{Vmatrix} {A}_{i} - {A}_{j} \end{Vmatrix}}^2_2, \end{aligned}$$
(4)

where \(\Vert \cdot \Vert _2\) denotes \(L_2\) norm of a matrix.

Gradients from the foreground consistency loss only run through the early layer to achieve the abovementioned goal. It reduces the noisy activations outside the object and boosts activations in the object.

3.4 Non-local Attention Block

In order to provide additional capacity for the network to produce a correct attention map, we employ non-local block [24] instead of average channel pooling of the convolutional features [9, 25]. Given a feature map, our non-local attention block embeds it into three different embeddings and outputs spatial summation of the third one weighted by similarity between the first two embeddings. Then the enhanced attention map is defined by its channel-pooled result.

Specifically, the block receives a feature map \(x\in {R}^{C\times H\times W}\) from a convolution layer. For simplicity, we omit the mini-batch dimension. We define \(f(x), g(x)\in {R}^{\tilde{C}\times H\times W}, z(x)\in {R}^{C\times H\times W}\) that use \(1\times 1\) convolution layer for embedding. Then, f(x), g(x) and z(x) are reshaped to \(f(x), g(x)\in {R}^{\tilde{C}\times HW}, z(x)\in {R}^{C\times HW}\), respectively.

The enhanced attention map A is given by:

$$\begin{aligned} {A} = \mathbb {E}_C[\text {Softmax}({f(x)}^{T}g(x)) \odot z(x)], \end{aligned}$$
(5)

where \(\mathbb {E}_C\) denotes average pooling over the channel dimension.

The non-local attention block produces the enhanced attention map regarding similarities between locations. It unleashes the receptive field of the layer and provides an additional clue for deciding where to attend. Our non-local attention block is different from [10] in that we organize it only when generating several enhanced attention maps. Yang et al. [10] apply the non-local module to all layers in the main branch with residual connection.

3.5 Training and Inference

We train the base network and non-local attention block with the full objective:

$$\begin{aligned} {\mathcal {L}}_{total} = {\mathcal {L}}_{cls} + {\mathcal {L}}_{ca} + {\mathcal {L}}_{fc} \end{aligned}$$
(6)

We employ a GAP layer at the end of the network to produce softmax output \(\hat{y}\) and compute classification loss given the one-hot ground truth label y:

$$\begin{aligned} {\mathcal {L}}_{cls} = \text {CrossEntropy}(\hat{y}, y) \end{aligned}$$
(7)

All network weights are updated wherever all losses send their gradients towards the input, while the foreground consistency loss does not convey gradients to its reference layer.

Our non-local attention block is applied only during training and deactivated in the testing phase. The input image goes through only the vanilla model to produce the class assignment. Then we follow [17] to extract the heatmap which leads to the bounding boxes by thresholding and its connected multi-contour.

4 Experiments

4.1 Experimental Setup

Datasets. We evaluate the proposed method on two benchmark datasets: CUB-200-2011 [16] and ILSVRC [26] (ImageNet) for WSOL task, from which only the image-level labels are used in training. Many weak-supervision methods have used full supervision to some extent, directly or indirectly for hyperparameter tuning. Since the amount of full supervision used for hyperparameter tuning is not consistent, it has been ambiguous using the previous evaluation metric for a fair comparison. We follow the recent evaluation metric [17] which fixes the amount of full supervision only for hyperparameter search. Each dataset is divided into three subsets: train-weaksup, train-fullsup and test. The train-weaksup includes images only with the class labels for training. The train-fullsup contains images with full supervision, which has bounding boxes as well. It is left free for the users to use the train-fullsup for hyperparameter search. They collected five images per class (total 1,000 images) from Flickr for CUB experiments, and ten images per class (total 10,000 images) from ImageNetV2 [27] for ImageNet experiments, respectively. The test split for the final number is the same as the standard WSOL settings on CUB and ImageNet experiments [7,8,9, 17, 28]. In the CUB dataset, there are 5,994 images for training and 5,794 for testing from 200 bird species. ImageNet consists of 1.2M training images 10K test images for 1,000 classes. All experimental analyses of the proposed method are conducted on the test split of the two abovementioned datasets.

Evaluation Metrics. We use top-1 classification accuracy and top-1 localization accuracy, and MaxBoxAccV2 [17].

Top-1 classification accuracy is the ratio of correct classification. The conventional top-1 localization accuracy measures ratio of the samples with the right class and the bounding box of IoU greater than 0.5.

MaxBoxAcc measures ratio of the samples with the correct box, while the correctness is defined by an IoU criterion \(\delta \) at the optimal activation threshold. MaxBoxAccV2 averages MaxBoxAcc at three IoU criterions \(\delta \in \{30, 50, 70\}\) to address diverse demands for localization fineness. It is similar to the common GT-known metric but differs in that it evaluates on three IoUs by extracting the bounding box with the optimal score map threshold. We use % symbol as a percent point for mentioning differences on comparisons.

Implementation Details. We build the proposed method upon three CNN backbones: VGG16 [29], InceptionV3 [30], and Resnet50 [31]. We need three hyperparameters: drop_rate for randomly choosing the importance map or the dropped foreground mask, \(\theta _\text {fg}\) and \(\theta _\text {bg}\) for thresholding. The threshold \(\theta _\text {fg}\) is set to the maximum intensity of \(\mathbf {A}\) times pre-defined ratio \(\gamma _\text {fg}\). The \(\theta _\text {bg}\) is set to average intensity of \(\mathbf {A}\) times pre-defined ratio \(\gamma _\text {bg}\). The specific values of the hyperparameters for each backbone are shown in Table 1.

The layers, from which the enhanced attention maps are extracted, are chosen to be the same with the baseline method [9]. We also calculate our contrastive attention loss and foreground consistency loss for all layers where attention maps are produced. We set the batch size to 32, weight decay to 0.0001, margin m to 1. The initial learning rate and the momentum of the SGD optimizer are set to 0.001 and 0.9, respectively. We start from loading weights from the model pre-trained on the ImageNet classification [26] and then fine-tuned the network. Our model is implemented using PyTorch and trained using two NVIDIA GeForce RTX 2080 Ti GPUs for approximately three hours. The input images are randomly cropped to 224 \(\times \) 224 pixels after being resized to 256 \(\times \) 256 pixels. During testing phase, we directly resize the input images to 224 \(\times \) 224.

Table 1. Hyperparameters (drop_rate, \(\gamma _\text {fg}\), \(\gamma _\text {bg}\)) for each backbone.

4.2 Ablation Study

We first show detailed experiments to validate effectiveness of each component. We fix the ResNet50 [31] as a backbone and add or remove each component. The experiments are performed on the CUB test split. The difference in performance in % represents percent points.

Ablation of the Proposed Losses. Table 2 shows that both the contrastive attention loss and the foreground consistency loss are the crucial element for the improved performance. Ours without the contrastive attention loss achieves 2% lower performance than the full setting. The loss has positive effect on all three IoU thresholds. The foreground consistency loss also plays an important role of improvement by 0.79%. It especially boosts the accuracy at IoU 0.7. We suggest that the loss helps precisely estimating the location of the object in the early layer by providing the hints from the later layer. Using the both losses leads to balanced improvements over all IoU thresholds. In addition, contrastive attention loss with the normalized temperature-scaled cross-entropy loss (NT-Xent) [20, 21] also shows improvements to some extent. Its result can be found in the supplementary material.

Table 2. The ablation study for each element of our method on Resnet50 [31] backbone in terms of MaxBoxAccV2. Contrastive: contrastive attention loss. \( \mathcal {L} _\text {fc}\): foreground consistency loss. Non-local: non-local attention block. \(\mathbf {M}_\text {dfg}\): dropped foreground mask. All elements contribute to the performance improvement.

Effectiveness of the Non-local Attention Block. If we use the vanilla attention map which is the channel-pooled result of the convolutional feature, the performance drops by 0.62% (the fourth row in Table 2). It shows that considering the relationship between pixels in feature map helps localizing where to attend.

Effectiveness of the Dropped Foreground Mask. Here we validate the effectiveness of replacing the drop mask [9] with the dropped foreground mask \(\mathbf {M}_\text {dfg}\). Without the replacement, the model achieves 1.12% lower performance than the full setting (the fifth row in Table 2). Also, only replacing the drop mask with the dropped foreground mask improves the performance of the baseline [9] by 0.87%. We suppose that the dropped foreground mask improves ours more than the baseline because the additional two losses and the non-local attention block provide an extra guide for better importance map.

Location of Our Attention Block. We investigate the influence of where to insert our non-local attention block on VGG16 [29], and report the results in Table 3. The conv_5_3 layer is fixed as the reference layer for the foreground consistency and its preceding layers are added one by one cumulatively. The setting with top three layers, which is the same as the baseline [9], achieves the best performance in terms of MaxBoxAccV2. Adding the attention blocks on pool_1 and pool_2 layers decreases the performance. We suppose that the reason is their small receptive field which leads to noisy activations on extremely locally salient regions. Hence, we do not use the attention mechanism on the two earliest layers.

Table 3. Performance comparison regarding at which layer to insert our attention block. The contrastive attention loss and the foreground consistency loss are in use for all cases. The conv_5_3 layer is fixed as the reference layer and its performance is left empty because the foreground consistency loss requires at least two layers. We add the layers from later to earlier and report their performance in a cumulative setting.
Table 4. MaxBoxAccV2 [17] comparison with the state-of-the-art methods. The results for each backbone represent the average of the three IoU thresholds 0.3, 0.5, and 0.7. VGG: VGG16 [29]. Inc: InceptionV3 [30]. Res: ResNet50 [31]. The best and the second best entries in a column are marked in boldface and italic, respectively.

4.3 Comparison with State-of-the-art Methods

We compare our method with the state-of-the-art WSOL methods in terms of the MaxBoxAccV2 [17], top-1 localization and top-1 classification accuracy.

MaxBoxAccV2 [17]. Table 4 shows comparison of MaxBoxAccV2 across all competitors on ImageNet and CUB. Our method outperforms all existing methods in terms of MaxBoxAccV2 (Mean) and most of backbone choices. Table 5 shows detailed comparison with the runner-up methods of each dataset. Our method boosts performance especially when IoU criterions are 0.5 and 0.7 except when Inception network is the backbone. Our method exhibits the largest improvement when employed on ResNet backbone.

Table 5. Detailed MaxBoxAccV2 [17] comparison with the runner-up methods on each dataset. We compare ours and the second best methods on each dataset and each backbone in terms of MaxBoxAccV2 including individual measures on the three IoU criterions. Mean indicates that the average value of the three IoU thresholds. VGG: VGG16 [29]. Inc: InceptionV3 [30]. Res: ResNet50 [31]. Bold texts denote the best performance in each column.

Top-1 Localization Accuracy. Top-1 localization accuracy on the ImageNet and CUB datasets is shown in Table 6. Our model outperforms the state-of-the-art methods on most settings. Note that we do not perform hyperparameter tuning using the train-fullsup split following the competitors for a fair comparison.

Table 6. Conventional Top-1 localization accuracy comparison with the state-of-the-art methods. The values are taken from their respective papers. Bold texts denote the best performance in each backbone network.

Top-1 Classification Accuracy. Table 7 compares our method with the state-of-the-art methods in terms of top-1 classification accuracy. While some other methods compromise classification accuracy for improving localization, our method achieves the best MaxBoxAccV2 and localization accuracy without damaging the classification accuracy.

Table 7. Top-1 classification performance of the state-of-the-art methods. Hyperparameters for each method are optimally selected for the localization performances on train-fullsup split. Bold texts denote the best performance. MEIL does not provide code for reproduction and its values are taken from the paper. Other values are reproduction from [17].

4.4 Qualitative Results

Figure 1 compares activation maps and estimated bounding boxes from ADL [9], SPG [8] and ours. ADL excessively covers backgrounds because it simply encourages the model to use less discriminative parts, and SPG still over-estimates the bounding boxes although it tries to suppress background. In contrast, our method focuses on the entire object more accurately and estimates tighter bounding boxes. Figure 5 illustrates more examples from our model. Our method not only spreads out of the most discriminative parts, but also restrains the activations in the object regions. Note that the water and mirrored image of the pelican does not earn large activation even though they are helpful cue for classification (the second row of the second column).

Fig. 5.
figure 5

Qualitative examples of activation map and localization produced by our model on the ImageNet and CUB test split. The red boxes are the ground-truth and the green boxes are the predicted ones. These maps output with colors ranging from red (higher importance) to blue (lower importance like a background). (Color figure online)

5 Conclusion

In this paper, we consider the background as an important clue for localizing the entire object without excessive coverage and present two novel objective functions. The crucial weakness of the previous methods is that they focus on discriminative parts rather than localizing the whole object, or extend too much on the background. The proposed contrastive attention loss guides the model to spread the attention map within the objects. The foreground consistency loss decreases the activation to backgrounds in the early layers. The generated attention map not only better localizes the target object but also suppresses the background concurrently. In addition, our non-local attention block enhances the attention map with a larger capacity to better optimize the proposed losses. We achieve state-of-the-art performance on ImageNet and CUB-200-2011 datasets and provide detailed analysis on the effects of our individual components.