Background Activation Suppression for Weakly Supervised Object Localization and Semantic Segmentation

Zhai, Wei; Wu, Pingyu; Zhu, Kai; Cao, Yang; Wu, Feng; Zha, Zheng-Jun

doi:10.1007/s11263-023-01919-2

Background Activation Suppression for Weakly Supervised Object Localization and Semantic Segmentation

Published: 17 October 2023

Volume 132, pages 750–775, (2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal of Computer Vision Aims and scope Submit manuscript

Background Activation Suppression for Weakly Supervised Object Localization and Semantic Segmentation

Download PDF

Wei Zhai¹^na1,
Pingyu Wu¹^na1,
Kai Zhu¹,
Yang Cao ORCID: orcid.org/0000-0002-2891-4379^1,2,
Feng Wu^1,2 &
…
Zheng-Jun Zha¹

817 Accesses
4 Citations
Explore all metrics

Abstract

Weakly supervised object localization and semantic segmentation aim to localize objects using only image-level labels. Recently, a new paradigm has emerged by generating a foreground prediction map (FPM) to achieve pixel-level localization. While existing FPM-based methods use cross-entropy to evaluate the foreground prediction map and to guide the learning of the generator, this paper presents two astonishing experimental observations on the object localization learning process: For a trained network, as the foreground mask expands, (1) the cross-entropy converges to zero when the foreground mask covers only part of the object region. (2) The activation value continuously increases until the foreground mask expands to the object boundary. Therefore, to achieve a more effective localization performance, we argue for the usage of activation value to learn more object regions. In this paper, we propose a background activation suppression (BAS) method. Specifically, an activation map constraint module is designed to facilitate the learning of generator by suppressing the background activation value. Meanwhile, by using foreground region guidance and area constraint, BAS can learn the whole region of the object. In the inference phase, we consider the prediction maps of different categories together to obtain the final localization results. Extensive experiments show that BAS achieves significant and consistent improvement over the baseline methods on the CUB-200-2011 and ILSVRC datasets. In addition, our method also achieves state-of-the-art weakly supervised semantic segmentation performance on the PASCAL VOC 2012 and MS COCO 2014 datasets. Code and models are available at https://github.com/wpy1999/BAS-Extension.

Cross-supervision-based equilibrated fusion mechanism of local and global attention for semantic segmentation

Article 14 September 2022

Built-in Foreground/Background Prior for Weakly-Supervised Semantic Segmentation

Adaptive Spatial-BCE Loss for Weakly Supervised Semantic Segmentation

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Weakly supervised object localization (WSOL) aims to identify the object’s localization in a scene, where only image-level labels instead of bounding box annotations are available during training. Due to the reduction in the cost of manual labeling, and the potential to use the vast weakly-annotated images on many public datasets and the Web, WSOL is gaining more and more attention in the research community (Selvaraju et al., 2020; Zhai et al., 2022; Luo et al., 2022; Zhang et al., 2021b). Moreover, it can serve various downstream tasks, such as weakly supervised object detection (WSOD) (Song et al., 2021; Zhang et al., 2020b, 2019) and weakly supervised semantic segmentation (WSSS) (Ru et al., 2022a; Chan et al., 2021; Pan et al., 2022).

This paper aims to propose an effective approach for WSOL and its downstream task WSSS, since WSOL and WSSS tasks have similarities in that they both use image-level labels as supervision and need to obtain a high-quality pixel-level localization map from the classification network. Actually, WSSS task can be implemented by directly training a fully supervised semantic segmentation with the localization maps generated from WSOL as pseudo labels. Due to these reasons, they face a similar challenge of establishing supervision between image-level labels and pixel-level localization maps in an effective way.

Previously, most WSOL and WSSS methods utilize class activation map (CAM) (Zhou et al., 2016) to extract localization map from classifier. While CAM can localize approximate object regions, it always prefers to capture the most discriminative regions rather than the overall area of the object, resulting in limited localization performance. Therefore, numerous CAM-based approaches have been proposed to alleviate this problem. Adversarial erasing methods (Singh & Lee, 2017; Zhang et al., 2018a; Choe & Shim, 2019; Mai et al., 2020; Yun et al., 2019) erase the most discriminative regions during the training, forcing the network to learn more object features that facilitate complete localization. Some methods (Zhang et al., 2020d; Pan et al., 2021; Lee et al., 2022a) improve the localization performance of CAM by establishing pixel-level spatial and semantic correlation. Additionally, some other methods (Zhang et al., 2018b; Wei et al., 2021; Kolesnikov & Lampert, 2016) suggest using the thought of region growing to spread confidence regions and mine relevant features.

Although CAM-based method can conveniently extract the localization map from the classifier, this approach will lead to limitations and conflicts in optimization since the classifier needs to implement both localization and classification tasks. Very recently, a CAM-independent paradigm (Meng et al., 2021; Xie et al., 2021) is devised for WSOL to achieve localization with a foreground prediction map (FPM) obtained directly through a generator, which allows the two tasks to be accomplished separately in a unified model. Typically, ORNet (Xie et al., 2021) is a two-stage approach, which first trains a classification network as an evaluator, and then utilizes CE loss to guide the learning of generator by masking the original image with a foreground prediction map. Orthogonally, the foreground prediction map in the FAM (Meng et al., 2021) is split into several parts and separately masks high-level feature maps to achieve learning of different regions through CE loss. Despite FPM-based methods achieving promising performance, they still suffer from incomplete object localization.

To better understand FPM-based methods, we focus on exploring the entropy value of CE loss (entropy) with respect to (w.r.t) foreground mask. As shown in Fig. 1A, by changing the area of the foreground mask and masking the feature map, the relationship between the entropy and foreground mask area is plotted in Fig. 1B. An important phenomenon can be observed that there is a “mismatch” between entropy and ground-truth mask, i.e., entropy is already close to zero when foreground mask retains only part of the object region, which indicates that entropy cannot force the foreground map to learn the complete object area. The reason is that the exponential form of softmax amplifies the discrepancy in activation values and drives premature convergence of entropy. To find a better factor to facilitate localization learning, we further explore the activation value (before softmax calculation) w.r.t foreground mask. As shown in Fig. 1B, there is a higher “correlation” between activation value and foreground mask, i.e., activation value tends to saturate when the mask expands to the object boundary. This suggests that better localization ability can be learned by optimizing activation value. Figure 1C also confirms the generality of these phenomena in a statistical sense.

Based on the inspiration of the above exploratory analysis, a straightforward manner to obtain a complete foreground prediction map is to maximize the activation value. However, considering that the minimization optimization problem is more conducive to the stability of training and loss convergence than the maximization optimization problem, this paper proposes a novel way to learn a background prediction map by minimizing background activation value, and further obtain the accurate foreground prediction map by inversion. Actually, the statistics on background activation values in Fig. 1C show “symmetry” with the statistics on activation values, both converging at the ground-truth mask area, which further supports the feasibility of background activation value suppression.

In this paper, we propose a simple but effective Background Activation Suppression (BAS) method. As shown in Fig. 2, our method includes three modules: an extractor, a generator, and an Activation Map Constraint (AMC) module. First, an extractor is used to extract the image features for subsequent localization and classification. The generator aims to generate a class-specific foreground prediction map for localization. Then the coupled background prediction map is obtained by inverting the foreground prediction map and fed into AMC together for localization training. The AMC is supervised by four kinds of losses, which are background activation suppression loss, area constraint loss, foreground region guidance loss, and classification loss. The most important one is background activation suppression loss, which is devised to promote the learning of generator by minimizing the ratio of background activation value and overall activation value (the activation value generated by the entire image). In the inference phase, the Top-k prediction maps are selected based on the predicted category probabilities and their average prediction map is adopted as the final localization result. The main contributions of this paper can be summarized as follows:

(1):: This paper identifies that the essential reason why minimizing CE loss facilitates the generation of foreground map is that it indirectly increases the foreground activation value, and accordingly proposes to promote the generation of foreground prediction map by suppressing background activation value.
(2):: This paper proposes a simple but effective Background Activation Suppression (BAS) approach to facilitate the generation of foreground map by an Activation Map Constraint (AMC) in a weakly supervised manner, which is composed of four losses including background activation suppression loss and together contribute to the generation of the foreground prediction map for localization.
(3):: Extensive experiments on both CUB-200-2011 (Wah et al., 2011) and ILSVRC (Russakovsky et al., 2015) benchmarks demonstrate that our method achieves consistent and significant improvement in terms of GT-known/Top-1/Top-5 Loc. In addition, the proposed BAS approach can be extended to Weakly Supervised Semantic Segmentation (WSSS) task, which also achieves new state-of-the-art results on PASCAL VOC 2012 (Everingham et al., 2010) and MS COCO 2014 (Lin et al., 2014) datasets.

This paper builds upon our conference version (Wu et al., 2021), which has been extended in four distinct aspects. (1) We explain the advantages of Background Activation Suppression and its generalizability (on more complex datasets) in more detail and comprehensively (in a statistical sense), see Fig. 6 and Sect. 3.5. (2) To alleviate the problem of inadequate convergence of BAS loss (Fig. 12), we focus on the location of the ReLU function, which is closely related to the activation value, and further improve the previous BAS after exploration, see Fig. 4 and Sect. 3.2. (3) To verify the extensibility of the BAS approach, we develop a Weakly Supervised Semantic Segmentation (WSSS) framework with proposed BAS in Sect. 5. The framework aims to enhance the quality of the seed generation process in the popular WSSS framework through BAS, resulting in better performance on WSSS task, as shown in Tables 9, 11 and 12. (4) To exploit the advantages of BAS on WSSS in obtaining localization maps through a generator, we propose to produce a class-agnostic foreground map using BAS and further combine it with the class-specific maps to improve the quality of the initial seed, see Fig. 20 and Table 13. (5) To further improve the segmentation quality, we propose to apply the losses of BAS as evaluation scores in the inference phase to assess each threshold and find the image-specific threshold on WSSS, see Fig. 22 and Table 15. (6) We have made a lot of efforts to improve the presentations (e.g., motivation, related illustrative diagrams, formulation, experimental analysis, key results), and organizations of our paper. Besides, several sections have been refined to improve the readability and provide more detailed explanations about the motivation, quantitative/qualitative comparisons, and discussions.

The rest of this paper is organized as follows. Section 2 describes existing works related to WSOL and WSSS. The detailed method is described in Sect. 3. Sections 4 and 5 present the experimental results of WSOL and WSSS, respectively. Limitation and future work are discussed in Sect. 6. Finally, we conclude our work in Sect. 7.

2 Related Work

2.1 Weakly Supervised Object Localization

Weakly supervised object localization (WSOL) is a challenging task that requires localizing objects using only image-level labels. To obtain localization results from the classification network, CAM (Zhou et al., 2016) proposes to replace top layers with a global average pooling, and multiply the fully connected weights on depth feature maps to generate class activation map (CAM) as the localization map. Unfortunately, CAM usually focuses on the most discriminative regions. To alleviate this problem, a series of methods propose to use erasing strategies. HaS (Singh & Lee, 2017) splits the original image into different patches and randomly masks part of them, forcing the classification network to learn more features of objects. ACoL (Zhang et al., 2018a) and EIL (Mai et al., 2020) erase areas with high response in the feature map and use two parallel branches for adversarial erasing. Differently, ADL (Choe & Shim, 2019) erases the most significant regions of each layer during forward propagation, to achieve a balance between classification and localization. CutMix (Yun et al., 2019) adopts a data enhancement strategy that mixes two different images to force network to learn relevant regions of different objects.

In addition, another class of approaches adopt the thought of spreading confidence regions to mine relevant features. SPG (Zhang et al., 2018b) uses thresholds to filter foreground and background regions with high confidence from CAM to guide shallow network learning. Further, SPOL (Wei et al., 2021) generates more reliable confidence regions by multiplicative feature fusion strategy and trains a full segmentation network with confidence regions as pseudo labels. I2C (Zhang et al., 2020d) proposes to increase the robustness and reliability of localization by considering the correlation of different pictures from the same class. Besides, SPA (Pan et al., 2021) uses a post-processing approach to extract feature maps with structure-preserving. SLT (Guo et al., 2021) considers several similar classes as one class when generating classification loss and localization maps, which alleviates the problem of focusing on the most discriminative regions by strengthening learning tolerance. DA-WSOL (Zhu et al., 2022) aligns the feature distributions between the image and pixel domains with the thought of domain adaptation.

Most recently, two Foreground-Prediction-Map-based works (Xie et al., 2021; Meng et al., 2021), both achieve the localization task by generating a foreground prediction map. ORNet (Xie et al., 2021) uses a two-stage approach, where an encode-decode layer is inserted in the shallow layer of the network as a generator and trained by the classification task in the first stage. In the second stage, the parameters of the classification network are fixed as an evaluator, and the foreground prediction map output by the generator is used to mask the image. Then the masked image is fed into the evaluator for classification training, so that the foreground prediction map can learn the object region. FAM (Meng et al., 2021) utilizes a Foreground Memory Mechanism structure to store different foreground classifiers and generate a class-agnostic foreground prediction map. The foreground prediction map is split into several specific parts which are used to mask the feature map to obtain different part-aware feature maps. After classification training with the corresponding foreground classifiers, the class-agnostic foreground map is forced to learn different object regions. It can be noticed that both ORNet (Xie et al., 2021) and FAM (Meng et al., 2021) only consider foreground regions and use cross-entropy to facilitate the learning of generator. Different from these methods, this paper proposes a background activation suppression strategy to learn foreground prediction maps through a simple but effective approach.

2.2 Weakly Supervised Semantic Segmentation

Weakly supervised semantic segmentation (WSSS) purposes to alleviate the reliance on pixel-level ground-truth labels by using weak labels instead. Existing WSSS methods usually include the following three stages: (1) Obtaining a high-quality initial seed. (2) Seed refinement and generating pseudo labels. (3) Training a full segmentation network with pseudo labels. It can be seen that generating a high-quality pixel-level localization map is also crucial for WSSS, similar to WSOL.

Seed Generation. Extraction of CAM is arguably the most common and convenient approach to generate the initial seed, despite the problem that only the discriminative regions can be highlighted. To alleviate this issue, some methods propose to improve the quality of CAM by iterative manipulation. AE-PSL (Wei et al., 2017) performs iterative training steps to mine more object-related regions with adversarial erasure. RIB (Lee et al., 2021a) applies a post-processing method to fine-tune the classification model and obtain CAMs by iteration. AdvCAM (Lee et al., 2022a) proposes an anti-adversarial approach to continuously identify more object areas. Besides, a category of methods try to improve the classification learning process. CONTA (Zhang et al., 2020c) aims to avoid contextual confusion by proposing a structural causal model to analyze the causalities among images, contexts, and class labels. SEAM (Wang et al., 2020b) applies consistency regularization on CAMs through various sized images to mitigate the supervision gap issue. ReCAM (Chen et al., 2022b) proposes to use softmax cross-entropy loss to suppress the response of different categories to the same receptive field. CLIMS (Xie et al., 2022a) utilizes the CLIP (Radford et al., 2021) model to assist the network in activating more complete object regions. GAIN (Li et al., 2018) uses Grad-CAM to obtain localization maps and improve them by exploiting the prediction scores of the network as supervision. In contrast, BAS is based on the FPM-based paradigm and proposes a more essential and effective background activation suppression loss compared to the cross-entropy used in the FPM-based methods from the experimental observations.

Mask Generation. The initial seed is usually coarse and needs to be refined. Some researchers adopt the thought of region growing to spread the initial seed. SEC (Kolesnikov & Lampert, 2016) proposes three principles: seed, expand and constrain. The initial seed is expanded during the training of segmentation and constrained to the object boundaries. PSA (Ahn & Kwak, 2018) trains a deep network to predict semantic affinity between a pair of adjacent image coordinates and propagate the semantics by random walk (Lovász, 1993). IRN (Ahn et al., 2019) predicts a transition probability matrix from the boundary activation map and generates pseudo masks in a similar way to PSA.

3 Methodology

In this section, we first introduce the main architecture of the network and the definition of the symbols in Sect. 3.1. Then we describe the structure of the AMC module, including the form of the four loss functions, and the improvement of BAS compared to the previous conference version in Sect. 3.2. The total loss functions for WSOL and WSSS are listed in Sects. 3.3 and 3.4, respectively. Finally, we provide specific details of the exploratory experiments and statistical results on three different datasets in Sect. 3.5.

3.1 Overview

Based on the experimental observation, we enhance the completeness of the localization map for WSOL by proposing a background activation suppression (BAS) approach. As shown in Fig. 2, BAS consists of three modules: an extractor, a generator, and an activation map constraint (AMC) module. The extractor is used to extract features related to classification and localization. The generator is to produce the foreground prediction maps. The AMC module is to promote the learning of extractor and generator through four kinds of losses.

Specifically, we divide the original backbone network into two sub-networks ${\mathcal {F}}_1$ and ${\mathcal {F}}_2$ according to the location of the generator, and denote the network parameter by $\Theta $. The sub-network ${\mathcal {F}}_1$ before the generator is used as a feature extractor. Given an image ${\textbf{I}}$, the feature maps ${\textbf{F}} \in {\mathbb {R}}^{H \times W \times N}$ are generated by extractor ${\mathcal {F}}_1({\textbf{I}}, \Theta _1)$ in the forward propagation, where H, W, and N denote the height, width, and number of channels of the feature maps, respectively. Afterward, the feature maps ${\textbf{F}}$ are fed into the generator, which consists of a $3\times 3$ convolution layer and a Sigmoid activation function for generating a set of foreground prediction maps ${\textbf{M}} \in {\mathbb {R}}^{H \times W \times C}$ with 0–1 distribution, where C is the number of categories. We choose the class-specific foreground prediction map ${\textbf{M}}_{f} \in {\mathbb {R}}^{H \times W \times 1}$ corresponding to the ground-truth class and invert it to obtain the coupled background prediction map ${\textbf{M}}_{b} \in {\mathbb {R}}^{H \times W \times 1}$, where ${\textbf{M}}_{b}$ = $1 - {\textbf{M}}_{f}$. Finally, ${\textbf{M}}_{f}$, ${\textbf{M}}_{b}$, and ${\textbf{F}}$ are fed together into AMC module for prediction map learning. We will detail describe the AMC structure and loss functions in Sect. 3.2.

In the inference phase, as illustrated in Fig. 3, the feature maps ${\textbf{F}}$ obtained by the extractor are input into the generator and sub-network ${\mathcal {F}}_2({\textbf{F}},\Theta _2)$ to generate the foreground prediction maps set ${\textbf{M}}$ and the classification prediction logits ${\tilde{\textbf{y}}}$, respectively. We select the prediction maps corresponding to the Top-k predicted categories including the ground-truth class, and take their average values as the final localization result. Notably, the Top-k strategy is only used in WSOL and not in WSSS.

3.2 Activation Map Constraint

The proposed AMC module utilizes foreground map, background map, and feature maps as input to jointly promote the learning of extractor and generator, which is consisted of four different kinds of losses, including ${\mathcal {L}}_{bas}$, ${\mathcal {L}}_{ac}$, ${\mathcal {L}}_{frg}$, and ${\mathcal {L}}_{cls}$.

Background Activation Suppression ($\varvec{\mathcal {L}}_{bas}$). For the input background prediction map ${\textbf{M}}_{b}$, we multiply it by the feature maps ${\textbf{F}}$ to obtain the background feature maps (${\textbf{F}} \cdot {\textbf{M}}_{b}$), denoted as ${\textbf{F}}^{b}\in {\mathbb {R}}^{H \times W \times N}$. Subsequently, the feature maps ${\textbf{F}}$ and ${\textbf{F}}^{b}$ are fed to two sub-networks ${\mathcal {F}}_2({\textbf{F}},\Theta _2)$ and ${\mathcal {F}}_2({\textbf{F}}^{b},\Theta _2)$ with shared weights, respectively. For the sub-network with ${\textbf{F}}^{b}$ as input, the goal is to generate the background activation value by the same function, and the parameters of this sub-network are frozen in the back propagation. Following the sub-network ${\mathcal {F}}_2({\textbf{F}},\Theta _2)$ and the global average pooling (GAP) (Zhou et al., 2016), ${\textbf{F}}$ and ${\textbf{F}}^{b}$ produce the prediction logits ${\tilde{\textbf{y}}}\in {\mathbb {R}}^{C}$ and ${\tilde{\textbf{y}}}^{b}\in {\mathbb {R}}^{C}$, respectively, which can be expressed as follows:

$$\begin{aligned}&{\tilde{\textbf{y}}}=\text {GAP}\left( {\mathcal {F}}_2\left( {\textbf{F}},\Theta _2 \right) \right) , \end{aligned}$$

(1)

$$\begin{aligned}&{\tilde{\textbf{y}}}^{b}=\text {GAP}\left( {\mathcal {F}}_2\left( {\textbf{F}}^{b},\Theta _2 \right) \right) . \end{aligned}$$

(2)

We select the values in the ${\tilde{\textbf{y}}}$ and ${\tilde{\textbf{y}}}^{b}$ according to the ground-truth class. After applying a ReLU activation function, these values are represented as the activation value ${\textbf{S}}\in {\mathbb {R}}^{1}$ and the background activation value ${\textbf{S}}^{b}\in {\mathbb {R}}^{1}$, respectively. ${\textbf{S}}$ represents the activation value generated by the unmasked feature map, containing both foreground and background information, and ${\textbf{S}}^{b}$ is the activation value generated by the background feature map, retaining only the background information. Here, we measure the difference between background activation value and activation value in a ratio form as a way to achieve background activation value suppression, and ${\mathcal {L}}_{bas}$ is defined as follows:

$$\begin{aligned} {\mathcal {L}}_{bas}= \frac{ {\textbf{S}}^{b} }{{\textbf{S}}+ \varepsilon }, \end{aligned}$$

(3)

where $\varepsilon $ is a very small value ($e^{-8}$), to ensure that the equation is meaningful. This ratio form not only avoids the addition of more hyperparameters, but also acts as a normalization, so that the range of loss value is maintained under an order of magnitude.

Generating a non-negative ${\textbf{S}}$ and ${\textbf{S}}^{b}$ is necessary for ${\mathcal {L}}_{bas}$. In the previous conference version, we use a ReLU as the activation function at the end of the network to ensure the non-negativity of the outputs, as shown in Fig. 4. This approach causes pixels with negative values are marked as 0 after ReLU and their gradients will not take part in the back propagation. While pixels with negative values are usually associated with background areas, which are also important for the learning of classification and prediction maps. As shown in Fig. 12, the neglect of negative activation values in the classification loss indirectly causes the BAS loss to become inadequate (the loss value becomes larger instead) later in the training process. To solve this problem, we remove this ReLU layer to make negative pixels also participate in the gradient back propagation. To ensure the non-negativity of ${\textbf{S}}$ and ${\textbf{S}}^{b}$, we use the ReLU activation function separately before generating them.

Area Constraint (${\mathcal {L}}_{ac}$). The background prediction map can be guided by ${\mathcal {L}}_{bas}$ in a suppressed way, and a smaller ${\mathcal {L}}_{bas}$ means that the region covered by the background prediction map is less discriminative. When the background prediction map can cover the background region well, the ${\mathcal {L}}_{bas}$ it produced has to be minimal while the background area should be as large as possible, accordingly, the foreground area should be as small as possible. So we use the foreground prediction map area as constraints:

$$\begin{aligned} {\mathcal {L}}_{ac}= \frac{1}{H \times W}\sum _{h=1}^H \sum _{w=1}^W {\textbf{M}}_{f}\left( h,w \right) . \end{aligned}$$

(4)

Foreground Region Guidance (${\mathcal {L}}_{frg}$). Meanwhile, we maintain the FPM’s approach of employing the classification task to drive the learning of foreground prediction maps, which uses high-level semantic information to guide the foreground prediction map to the approximate correct region of the object. Consequently, a foreground region guidance loss based on cross-entropy is utilized. After ${\textbf{F}}$ is fed into ${\mathcal {F}}_2({\textbf{F}},{\Theta }_2)$, it is dotted with ${\textbf{M}}_{f}$ to produce ${\mathcal {L}}_{frg}$:

$$\begin{aligned}&{\tilde{\textbf{y}}}^{f}=\text {GAP}\left( {\textbf{M}}_{f} \cdot {\mathcal {F}}_2\left( {\textbf{F}},\Theta _2 \right) \right) , \end{aligned}$$

(5)

$$\begin{aligned}&{\mathcal {L}}_{frg}=-\sum _{i=1}^C {\textbf{y}}_{i} \log _{}{\frac{e^{{\tilde{\textbf{y}}}_{i}^{f}}}{\sum _{j}^{C} e^{{\tilde{\textbf{y}}}_{j}^{f}}}}, \end{aligned}$$

(6)

where ${\textbf{y}}$ denotes the image-level one-hot encoding label.

Classification (${\mathcal {L}}_{cls}$). Besides, we obtain the classification loss ${\mathcal {L}}_{cls}$ by applying cross-entropy to ${\tilde{\textbf{y}}}$, which is used for classification learning of the entire image:

$$\begin{aligned} {\mathcal {L}}_{cls}=-\sum _{i=1}^C {\textbf{y}}_{i} \log _{}{\frac{e^{{\tilde{\textbf{y}}}_{i}}}{\sum _{j}^{C} e^{{\tilde{\textbf{y}}}_{j}}}}. \end{aligned}$$

(7)

3.3 Weakly Supervised Object Localization

By jointly optimizing background activation suppression loss, area constraint loss, foreground region guidance loss, and classification loss in the AMC module, the foreground prediction map can be guided to the overall area of the object. The total loss of the BAS training process is defined in the following form:

$$\begin{aligned} {\mathcal {L}}={\mathcal {L}}_{cls} + \alpha {\mathcal {L}}_{frg} + \beta {\mathcal {L}}_{ac} + \lambda {\mathcal {L}}_{bas}, \end{aligned}$$

(8)

where $\alpha $, $\beta $, and $\lambda $ are hyperparameters, ${\mathcal {L}}_{cls}$ and ${\mathcal {L}}_{frg}$ are both cross-entropy losses. For all backbones and datasets, we set $\lambda =1$. The ablation experiments of the hyperparameters $\alpha $, $\beta $, and $\lambda $ on WSOL are described in Sect. 4.3.

3.4 Weakly Supervised Semantic Segmentation

BAS can also be applied to weakly supervised semantic segmentation to verify the generality of our method. Different from weakly supervised object localization, weakly supervised semantic segmentation no longer assumes that there is only one ground-truth class in an image, which is more challenging. In addition, it is more direct to reflect the segmentation quality of the prediction map by comparing with the weakly supervised semantic segmentation SOTA methods.

Based on the network structure in Fig. 2, we apply BAS to weakly supervised semantic segmentation with minor changes. As shown in Fig. 5, we maintain the learning process for a single prediction map in the AMC module by randomly selecting a foreground category in the image and denoting its corresponding prediction map as ${\textbf{M}}_{f}$. In addition, to make the network achieve multi-label classification, we adopt softmax cross-entropy loss and simply modify the form of it instead of using Sigmoid-based loss (binary cross-entropy loss). It mainly due to the activation value ${\textbf{S}}^{b}$ obtained from the background localization map has to be less than 0 to ensure that the probability generated by $1/({1+e^{-{\textbf{S}}^{b}}})$ is close to 0, which conflicts with the non-negativity of ${\textbf{S}}^{b}$.

Multi-Label-Classification (${\mathcal {L}}_{mcls}$). For weakly supervised semantic segmentation task, we adopt the multi-label classification loss ${\mathcal {L}}_{mcls}$ instead of ${\mathcal {L}}_{cls}$ to deal with the multi-label case. To avoid the problems of class imbalance and training instability when there are multi-label in the softmax formulation, we only consider the differentiation between foreground and background classes and ignore the interrelationship among foreground categories. It can be expressed as follows:

$$\begin{aligned} {\mathcal {L}}_{mcls}=-\sum _{i=1}^{L} {\textbf{y}}_{i} \log _{}{\left( \frac{e^{{\tilde{\textbf{y}}}_{i}}}{{\sum _{j}^{K} e^{{\tilde{\textbf{y}}}_{j}}} + e^{\tilde{\textbf{y}}_{i}}} \right) }, \end{aligned}$$

(9)

where L is the set of ground-truth classes in the image, and the remaining set of categories is denoted as K. The total loss function in weakly supervised semantic segmentation is of the following form:

$$\begin{aligned} {\mathcal {L}}={\mathcal {L}}_{mcls} + \alpha {\mathcal {L}}_{frg} + \beta {\mathcal {L}}_{ac} + \lambda {\mathcal {L}}_{bas}. \end{aligned}$$

(10)

The $\lambda $ is set to 1 for all datasets. For PASCAL VOC 2012, we set $\alpha =0.2$ and $\beta =1.2$. For MS COCO 2014, we adopt $\alpha =0.5$ and $\beta =1.5$. The ablation experiments of the hyperparameters $\alpha $, $\beta $, and $\lambda $ on WSSS, and the results of different combinations of hyperparameters on five datasets are presented in Sect. 5.2.

3.5 Empirical Justification

In this part, we empirically justify the advantage of introducing background activation suppression and its generalizability.

The purpose of the exploratory experiment is to investigate the relationship between activation value (Activation), cross-entropy (Entropy) and background activation value (Background Activation) with the mask area. Specifically, we first train a VGG16 classification network on CUB-200-2011 using ${\mathcal {L}}_{cls}$ (Eq. 7) as supervision. Then, for a given pixel-level mask, the activation and entropy corresponding to this mask are generated by masking the feature map. We erode and dilate the ground-truth mask with a convolution of kernel size $5n \times 5n$, obtain masks with different areas by changing the value of n, and plot the activation versus entropy with the mask area as the horizontal axis. As shown in Fig. 1A, we display the curve for a single image through the above process.

Due to each image having a different activation value distribution and a different ground-truth mask area, we normalize the activation curve for each image by dividing the activation value generated by the entire image to obtain a more statistically significant result, the same as in Eq. 3. In addition, the area representing the horizontal axis is also normalized based on the ground-truth mask area, which is marked by a red line. As shown in Fig. 6, we present the curves of foreground activation value, cross-entropy, and background activation value with respect to the mask area, which are counted on the CUB-200-2011 test set. It can be noted that the samples on the whole present the following phenomena: When the mask expands near the ground-truth mask, the activation value starts to saturate and the corresponding background activation value tends to converge, while cross-entropy converges to zero early or even diverges with the expansion of the mask. This suggests that the object region learned by activation values is larger and closer to the real object region than that learned by cross-entropy. We further explore why the cross-entropy occasionally diverges and visualize some results as shown in Fig. 7. It can be noted that when the network classifies objects incorrectly, such as identifying cows as horses, the calculated cross-entropy maintains a high value as the mask area increases. In this case, adopting cross-entropy values to supervise the localization map is less feasible and appropriate than using activation values which are not influenced by other categories. Besides, to verify the generality of this observation, we perform the same experiments on the more complex OpenImages and PASCAL VOC 2012 datasets. For PASCAL VOC 2012, we select one ground-truth category and its corresponding mask at a time, convert the multi-label into single-label, and then plot the curve in the same way. As shown in Fig. 6, the statistical analysis demonstrates similar phenomena, therefore, we believe it is general that better localization ability can be learned through activation values compared to cross-entropy.

Table 1 Comparison with state-of-the-art methods

Background Activation Suppression for Weakly Supervised Object Localization and Semantic Segmentation

Abstract

Similar content being viewed by others

Cross-supervision-based equilibrated fusion mechanism of local and global attention for semantic segmentation

Built-in Foreground/Background Prior for Weakly-Supervised Semantic Segmentation

Adaptive Spatial-BCE Loss for Weakly Supervised Semantic Segmentation

Explore related subjects

1 Introduction

2 Related Work

2.1 Weakly Supervised Object Localization

2.2 Weakly Supervised Semantic Segmentation

3 Methodology

3.1 Overview

3.2 Activation Map Constraint

3.3 Weakly Supervised Object Localization

3.4 Weakly Supervised Semantic Segmentation

3.5 Empirical Justification

4 Experiments on Weakly Supervised Object Localization

4.1 Experimental Setup

4.2 Comparison with State-of-the-Arts

4.3 Ablation Study

4.4 Performance Analysis

5 Experiments on Weakly Supervised Semantic Segmentation

5.1 Experimental Setup

5.2 Ablation Study

5.3 Results on PASCAL VOC 2012 Dataset

5.4 Results on MS COCO 2014 Dataset

5.5 Analysis

6 Discussion

7 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation