1 Introduction

Weakly supervised object localization (WSOL) aims to identify the object’s localization in a scene, where only image-level labels instead of bounding box annotations are available during training. Due to the reduction in the cost of manual labeling, and the potential to use the vast weakly-annotated images on many public datasets and the Web, WSOL is gaining more and more attention in the research community (Selvaraju et al., 2020; Zhai et al., 2022; Luo et al., 2022; Zhang et al., 2021b). Moreover, it can serve various downstream tasks, such as weakly supervised object detection (WSOD) (Song et al., 2021; Zhang et al., 2020b, 2019) and weakly supervised semantic segmentation (WSSS) (Ru et al., 2022a; Chan et al., 2021; Pan et al., 2022).

This paper aims to propose an effective approach for WSOL and its downstream task WSSS, since WSOL and WSSS tasks have similarities in that they both use image-level labels as supervision and need to obtain a high-quality pixel-level localization map from the classification network. Actually, WSSS task can be implemented by directly training a fully supervised semantic segmentation with the localization maps generated from WSOL as pseudo labels. Due to these reasons, they face a similar challenge of establishing supervision between image-level labels and pixel-level localization maps in an effective way.

Previously, most WSOL and WSSS methods utilize class activation map (CAM) (Zhou et al., 2016) to extract localization map from classifier. While CAM can localize approximate object regions, it always prefers to capture the most discriminative regions rather than the overall area of the object, resulting in limited localization performance. Therefore, numerous CAM-based approaches have been proposed to alleviate this problem. Adversarial erasing methods (Singh & Lee, 2017; Zhang et al., 2018a; Choe & Shim, 2019; Mai et al., 2020; Yun et al., 2019) erase the most discriminative regions during the training, forcing the network to learn more object features that facilitate complete localization. Some methods (Zhang et al., 2020d; Pan et al., 2021; Lee et al., 2022a) improve the localization performance of CAM by establishing pixel-level spatial and semantic correlation. Additionally, some other methods (Zhang et al., 2018b; Wei et al., 2021; Kolesnikov & Lampert, 2016) suggest using the thought of region growing to spread confidence regions and mine relevant features.

Although CAM-based method can conveniently extract the localization map from the classifier, this approach will lead to limitations and conflicts in optimization since the classifier needs to implement both localization and classification tasks. Very recently, a CAM-independent paradigm (Meng et al., 2021; Xie et al., 2021) is devised for WSOL to achieve localization with a foreground prediction map (FPM) obtained directly through a generator, which allows the two tasks to be accomplished separately in a unified model. Typically, ORNet (Xie et al., 2021) is a two-stage approach, which first trains a classification network as an evaluator, and then utilizes CE loss to guide the learning of generator by masking the original image with a foreground prediction map. Orthogonally, the foreground prediction map in the FAM (Meng et al., 2021) is split into several parts and separately masks high-level feature maps to achieve learning of different regions through CE loss. Despite FPM-based methods achieving promising performance, they still suffer from incomplete object localization.

Fig. 1
figure 1

A Experimental procedure and related definitions. B The entropy value of CE loss w.r.t foreground mask and foreground activation value w.r.t foreground mask. C The results with statistical significance. Implementation details of the experiment and further results are available in Sect. 3.5

To better understand FPM-based methods, we focus on exploring the entropy value of CE loss (entropy) with respect to (w.r.t) foreground mask. As shown in Fig. 1A, by changing the area of the foreground mask and masking the feature map, the relationship between the entropy and foreground mask area is plotted in Fig. 1B. An important phenomenon can be observed that there is a “mismatch” between entropy and ground-truth mask, i.e., entropy is already close to zero when foreground mask retains only part of the object region, which indicates that entropy cannot force the foreground map to learn the complete object area. The reason is that the exponential form of softmax amplifies the discrepancy in activation values and drives premature convergence of entropy. To find a better factor to facilitate localization learning, we further explore the activation value (before softmax calculation) w.r.t foreground mask. As shown in Fig. 1B, there is a higher “correlation” between activation value and foreground mask, i.e., activation value tends to saturate when the mask expands to the object boundary. This suggests that better localization ability can be learned by optimizing activation value. Figure 1C also confirms the generality of these phenomena in a statistical sense.

Based on the inspiration of the above exploratory analysis, a straightforward manner to obtain a complete foreground prediction map is to maximize the activation value. However, considering that the minimization optimization problem is more conducive to the stability of training and loss convergence than the maximization optimization problem, this paper proposes a novel way to learn a background prediction map by minimizing background activation value, and further obtain the accurate foreground prediction map by inversion. Actually, the statistics on background activation values in Fig. 1C show “symmetry” with the statistics on activation values, both converging at the ground-truth mask area, which further supports the feasibility of background activation value suppression.

Fig. 2
figure 2

The architecture of the proposed background activation suppression (BAS) in the training phase. The class-specific foreground prediction map \({\textbf{M}}_{f}\) and the coupled background prediction map \({\textbf{M}}_{b}\) are obtained by the generator according to the ground-truth (\(\textbf{GT}\)) class, and then fed into the Activation Map Constraint module together with the feature maps \({\textbf{F}}\)

In this paper, we propose a simple but effective Background Activation Suppression (BAS) method. As shown in Fig. 2, our method includes three modules: an extractor, a generator, and an Activation Map Constraint (AMC) module. First, an extractor is used to extract the image features for subsequent localization and classification. The generator aims to generate a class-specific foreground prediction map for localization. Then the coupled background prediction map is obtained by inverting the foreground prediction map and fed into AMC together for localization training. The AMC is supervised by four kinds of losses, which are background activation suppression loss, area constraint loss, foreground region guidance loss, and classification loss. The most important one is background activation suppression loss, which is devised to promote the learning of generator by minimizing the ratio of background activation value and overall activation value (the activation value generated by the entire image). In the inference phase, the Top-k prediction maps are selected based on the predicted category probabilities and their average prediction map is adopted as the final localization result. The main contributions of this paper can be summarized as follows:

(1):

This paper identifies that the essential reason why minimizing CE loss facilitates the generation of foreground map is that it indirectly increases the foreground activation value, and accordingly proposes to promote the generation of foreground prediction map by suppressing background activation value.

(2):

This paper proposes a simple but effective Background Activation Suppression (BAS) approach to facilitate the generation of foreground map by an Activation Map Constraint (AMC) in a weakly supervised manner, which is composed of four losses including background activation suppression loss and together contribute to the generation of the foreground prediction map for localization.

(3):

Extensive experiments on both CUB-200-2011 (Wah et al., 2011) and ILSVRC (Russakovsky et al., 2015) benchmarks demonstrate that our method achieves consistent and significant improvement in terms of GT-known/Top-1/Top-5 Loc. In addition, the proposed BAS approach can be extended to Weakly Supervised Semantic Segmentation (WSSS) task, which also achieves new state-of-the-art results on PASCAL VOC 2012 (Everingham et al., 2010) and MS COCO 2014 (Lin et al., 2014) datasets.

This paper builds upon our conference version (Wu et al., 2021), which has been extended in four distinct aspects. (1) We explain the advantages of Background Activation Suppression and its generalizability (on more complex datasets) in more detail and comprehensively (in a statistical sense), see Fig. 6 and Sect. 3.5. (2) To alleviate the problem of inadequate convergence of BAS loss (Fig. 12), we focus on the location of the ReLU function, which is closely related to the activation value, and further improve the previous BAS after exploration, see Fig. 4 and Sect. 3.2. (3) To verify the extensibility of the BAS approach, we develop a Weakly Supervised Semantic Segmentation (WSSS) framework with proposed BAS in Sect. 5. The framework aims to enhance the quality of the seed generation process in the popular WSSS framework through BAS, resulting in better performance on WSSS task, as shown in Tables 9, 11 and 12. (4) To exploit the advantages of BAS on WSSS in obtaining localization maps through a generator, we propose to produce a class-agnostic foreground map using BAS and further combine it with the class-specific maps to improve the quality of the initial seed, see Fig. 20 and Table 13. (5) To further improve the segmentation quality, we propose to apply the losses of BAS as evaluation scores in the inference phase to assess each threshold and find the image-specific threshold on WSSS, see Fig. 22 and Table 15. (6) We have made a lot of efforts to improve the presentations (e.g., motivation, related illustrative diagrams, formulation, experimental analysis, key results), and organizations of our paper. Besides, several sections have been refined to improve the readability and provide more detailed explanations about the motivation, quantitative/qualitative comparisons, and discussions.

The rest of this paper is organized as follows. Section 2 describes existing works related to WSOL and WSSS. The detailed method is described in Sect. 3. Sections 4 and 5 present the experimental results of WSOL and WSSS, respectively. Limitation and future work are discussed in Sect. 6. Finally, we conclude our work in Sect. 7.

2 Related Work

2.1 Weakly Supervised Object Localization

Weakly supervised object localization (WSOL) is a challenging task that requires localizing objects using only image-level labels. To obtain localization results from the classification network, CAM (Zhou et al., 2016) proposes to replace top layers with a global average pooling, and multiply the fully connected weights on depth feature maps to generate class activation map (CAM) as the localization map. Unfortunately, CAM usually focuses on the most discriminative regions. To alleviate this problem, a series of methods propose to use erasing strategies. HaS (Singh & Lee, 2017) splits the original image into different patches and randomly masks part of them, forcing the classification network to learn more features of objects. ACoL (Zhang et al., 2018a) and EIL (Mai et al., 2020) erase areas with high response in the feature map and use two parallel branches for adversarial erasing. Differently, ADL (Choe & Shim, 2019) erases the most significant regions of each layer during forward propagation, to achieve a balance between classification and localization. CutMix (Yun et al., 2019) adopts a data enhancement strategy that mixes two different images to force network to learn relevant regions of different objects.

In addition, another class of approaches adopt the thought of spreading confidence regions to mine relevant features. SPG (Zhang et al., 2018b) uses thresholds to filter foreground and background regions with high confidence from CAM to guide shallow network learning. Further, SPOL (Wei et al., 2021) generates more reliable confidence regions by multiplicative feature fusion strategy and trains a full segmentation network with confidence regions as pseudo labels. I2C (Zhang et al., 2020d) proposes to increase the robustness and reliability of localization by considering the correlation of different pictures from the same class. Besides, SPA (Pan et al., 2021) uses a post-processing approach to extract feature maps with structure-preserving. SLT (Guo et al., 2021) considers several similar classes as one class when generating classification loss and localization maps, which alleviates the problem of focusing on the most discriminative regions by strengthening learning tolerance. DA-WSOL (Zhu et al., 2022) aligns the feature distributions between the image and pixel domains with the thought of domain adaptation.

Most recently, two Foreground-Prediction-Map-based works (Xie et al., 2021; Meng et al., 2021), both achieve the localization task by generating a foreground prediction map. ORNet (Xie et al., 2021) uses a two-stage approach, where an encode-decode layer is inserted in the shallow layer of the network as a generator and trained by the classification task in the first stage. In the second stage, the parameters of the classification network are fixed as an evaluator, and the foreground prediction map output by the generator is used to mask the image. Then the masked image is fed into the evaluator for classification training, so that the foreground prediction map can learn the object region. FAM (Meng et al., 2021) utilizes a Foreground Memory Mechanism structure to store different foreground classifiers and generate a class-agnostic foreground prediction map. The foreground prediction map is split into several specific parts which are used to mask the feature map to obtain different part-aware feature maps. After classification training with the corresponding foreground classifiers, the class-agnostic foreground map is forced to learn different object regions. It can be noticed that both ORNet (Xie et al., 2021) and FAM (Meng et al., 2021) only consider foreground regions and use cross-entropy to facilitate the learning of generator. Different from these methods, this paper proposes a background activation suppression strategy to learn foreground prediction maps through a simple but effective approach.

2.2 Weakly Supervised Semantic Segmentation

Weakly supervised semantic segmentation (WSSS) purposes to alleviate the reliance on pixel-level ground-truth labels by using weak labels instead. Existing WSSS methods usually include the following three stages: (1) Obtaining a high-quality initial seed. (2) Seed refinement and generating pseudo labels. (3) Training a full segmentation network with pseudo labels. It can be seen that generating a high-quality pixel-level localization map is also crucial for WSSS, similar to WSOL.

Seed Generation. Extraction of CAM is arguably the most common and convenient approach to generate the initial seed, despite the problem that only the discriminative regions can be highlighted. To alleviate this issue, some methods propose to improve the quality of CAM by iterative manipulation. AE-PSL (Wei et al., 2017) performs iterative training steps to mine more object-related regions with adversarial erasure. RIB (Lee et al., 2021a) applies a post-processing method to fine-tune the classification model and obtain CAMs by iteration. AdvCAM (Lee et al., 2022a) proposes an anti-adversarial approach to continuously identify more object areas. Besides, a category of methods try to improve the classification learning process. CONTA (Zhang et al., 2020c) aims to avoid contextual confusion by proposing a structural causal model to analyze the causalities among images, contexts, and class labels. SEAM (Wang et al., 2020b) applies consistency regularization on CAMs through various sized images to mitigate the supervision gap issue. ReCAM (Chen et al., 2022b) proposes to use softmax cross-entropy loss to suppress the response of different categories to the same receptive field. CLIMS (Xie et al., 2022a) utilizes the CLIP (Radford et al., 2021) model to assist the network in activating more complete object regions. GAIN (Li et al., 2018) uses Grad-CAM to obtain localization maps and improve them by exploiting the prediction scores of the network as supervision. In contrast, BAS is based on the FPM-based paradigm and proposes a more essential and effective background activation suppression loss compared to the cross-entropy used in the FPM-based methods from the experimental observations.

Mask Generation. The initial seed is usually coarse and needs to be refined. Some researchers adopt the thought of region growing to spread the initial seed. SEC (Kolesnikov & Lampert, 2016) proposes three principles: seed, expand and constrain. The initial seed is expanded during the training of segmentation and constrained to the object boundaries. PSA (Ahn & Kwak, 2018) trains a deep network to predict semantic affinity between a pair of adjacent image coordinates and propagate the semantics by random walk (Lovász, 1993). IRN (Ahn et al., 2019) predicts a transition probability matrix from the boundary activation map and generates pseudo masks in a similar way to PSA.

3 Methodology

In this section, we first introduce the main architecture of the network and the definition of the symbols in Sect. 3.1. Then we describe the structure of the AMC module, including the form of the four loss functions, and the improvement of BAS compared to the previous conference version in Sect. 3.2. The total loss functions for WSOL and WSSS are listed in Sects. 3.3 and 3.4, respectively. Finally, we provide specific details of the exploratory experiments and statistical results on three different datasets in Sect. 3.5.

3.1 Overview

Based on the experimental observation, we enhance the completeness of the localization map for WSOL by proposing a background activation suppression (BAS) approach. As shown in Fig. 2, BAS consists of three modules: an extractor, a generator, and an activation map constraint (AMC) module. The extractor is used to extract features related to classification and localization. The generator is to produce the foreground prediction maps. The AMC module is to promote the learning of extractor and generator through four kinds of losses.

Specifically, we divide the original backbone network into two sub-networks \({\mathcal {F}}_1\) and \({\mathcal {F}}_2\) according to the location of the generator, and denote the network parameter by \(\Theta \). The sub-network \({\mathcal {F}}_1\) before the generator is used as a feature extractor. Given an image \({\textbf{I}}\), the feature maps \({\textbf{F}} \in {\mathbb {R}}^{H \times W \times N}\) are generated by extractor \({\mathcal {F}}_1({\textbf{I}}, \Theta _1)\) in the forward propagation, where H, W, and N denote the height, width, and number of channels of the feature maps, respectively. Afterward, the feature maps \({\textbf{F}}\) are fed into the generator, which consists of a \(3\times 3\) convolution layer and a Sigmoid activation function for generating a set of foreground prediction maps \({\textbf{M}} \in {\mathbb {R}}^{H \times W \times C}\) with 0–1 distribution, where C is the number of categories. We choose the class-specific foreground prediction map \({\textbf{M}}_{f} \in {\mathbb {R}}^{H \times W \times 1}\) corresponding to the ground-truth class and invert it to obtain the coupled background prediction map \({\textbf{M}}_{b} \in {\mathbb {R}}^{H \times W \times 1}\), where \({\textbf{M}}_{b}\) = \(1 - {\textbf{M}}_{f}\). Finally, \({\textbf{M}}_{f}\), \({\textbf{M}}_{b}\), and \({\textbf{F}}\) are fed together into AMC module for prediction map learning. We will detail describe the AMC structure and loss functions in Sect. 3.2.

In the inference phase, as illustrated in Fig. 3, the feature maps \({\textbf{F}}\) obtained by the extractor are input into the generator and sub-network \({\mathcal {F}}_2({\textbf{F}},\Theta _2)\) to generate the foreground prediction maps set \({\textbf{M}}\) and the classification prediction logits \({\tilde{\textbf{y}}}\), respectively. We select the prediction maps corresponding to the Top-k predicted categories including the ground-truth class, and take their average values as the final localization result. Notably, the Top-k strategy is only used in WSOL and not in WSSS.

Fig. 3
figure 3

The architecture of the proposed BAS in inference phase. We utilize Top-k to generate final localization map

3.2 Activation Map Constraint

The proposed AMC module utilizes foreground map, background map, and feature maps as input to jointly promote the learning of extractor and generator, which is consisted of four different kinds of losses, including \({\mathcal {L}}_{bas}\), \({\mathcal {L}}_{ac}\), \({\mathcal {L}}_{frg}\), and \({\mathcal {L}}_{cls}\).

Background Activation Suppression (\(\varvec{\mathcal {L}}_{bas}\)). For the input background prediction map \({\textbf{M}}_{b}\), we multiply it by the feature maps \({\textbf{F}}\) to obtain the background feature maps (\({\textbf{F}} \cdot {\textbf{M}}_{b}\)), denoted as \({\textbf{F}}^{b}\in {\mathbb {R}}^{H \times W \times N}\). Subsequently, the feature maps \({\textbf{F}}\) and \({\textbf{F}}^{b}\) are fed to two sub-networks \({\mathcal {F}}_2({\textbf{F}},\Theta _2)\) and \({\mathcal {F}}_2({\textbf{F}}^{b},\Theta _2)\) with shared weights, respectively. For the sub-network with \({\textbf{F}}^{b}\) as input, the goal is to generate the background activation value by the same function, and the parameters of this sub-network are frozen in the back propagation. Following the sub-network \({\mathcal {F}}_2({\textbf{F}},\Theta _2)\) and the global average pooling (GAP) (Zhou et al., 2016), \({\textbf{F}}\) and \({\textbf{F}}^{b}\) produce the prediction logits \({\tilde{\textbf{y}}}\in {\mathbb {R}}^{C}\) and \({\tilde{\textbf{y}}}^{b}\in {\mathbb {R}}^{C}\), respectively, which can be expressed as follows:

$$\begin{aligned}&{\tilde{\textbf{y}}}=\text {GAP}\left( {\mathcal {F}}_2\left( {\textbf{F}},\Theta _2 \right) \right) , \end{aligned}$$
(1)
$$\begin{aligned}&{\tilde{\textbf{y}}}^{b}=\text {GAP}\left( {\mathcal {F}}_2\left( {\textbf{F}}^{b},\Theta _2 \right) \right) . \end{aligned}$$
(2)

We select the values in the \({\tilde{\textbf{y}}}\) and \({\tilde{\textbf{y}}}^{b}\) according to the ground-truth class. After applying a ReLU activation function, these values are represented as the activation value \({\textbf{S}}\in {\mathbb {R}}^{1}\) and the background activation value \({\textbf{S}}^{b}\in {\mathbb {R}}^{1}\), respectively. \({\textbf{S}}\) represents the activation value generated by the unmasked feature map, containing both foreground and background information, and \({\textbf{S}}^{b}\) is the activation value generated by the background feature map, retaining only the background information. Here, we measure the difference between background activation value and activation value in a ratio form as a way to achieve background activation value suppression, and \({\mathcal {L}}_{bas}\) is defined as follows:

$$\begin{aligned} {\mathcal {L}}_{bas}= \frac{ {\textbf{S}}^{b} }{{\textbf{S}}+ \varepsilon }, \end{aligned}$$
(3)

where \(\varepsilon \) is a very small value (\(e^{-8}\)), to ensure that the equation is meaningful. This ratio form not only avoids the addition of more hyperparameters, but also acts as a normalization, so that the range of loss value is maintained under an order of magnitude.

Fig. 4
figure 4

The improvement of BAS. Partial structure of a the previous conference version and b this work. The green pixels in the localization map indicate positive values and the purple ones indicate negative values (Color figure online)

Generating a non-negative \({\textbf{S}}\) and \({\textbf{S}}^{b}\) is necessary for \({\mathcal {L}}_{bas}\). In the previous conference version, we use a ReLU as the activation function at the end of the network to ensure the non-negativity of the outputs, as shown in Fig. 4. This approach causes pixels with negative values are marked as 0 after ReLU and their gradients will not take part in the back propagation. While pixels with negative values are usually associated with background areas, which are also important for the learning of classification and prediction maps. As shown in Fig. 12, the neglect of negative activation values in the classification loss indirectly causes the BAS loss to become inadequate (the loss value becomes larger instead) later in the training process. To solve this problem, we remove this ReLU layer to make negative pixels also participate in the gradient back propagation. To ensure the non-negativity of \({\textbf{S}}\) and \({\textbf{S}}^{b}\), we use the ReLU activation function separately before generating them.

Area Constraint (\({\mathcal {L}}_{ac}\)). The background prediction map can be guided by \({\mathcal {L}}_{bas}\) in a suppressed way, and a smaller \({\mathcal {L}}_{bas}\) means that the region covered by the background prediction map is less discriminative. When the background prediction map can cover the background region well, the \({\mathcal {L}}_{bas}\) it produced has to be minimal while the background area should be as large as possible, accordingly, the foreground area should be as small as possible. So we use the foreground prediction map area as constraints:

$$\begin{aligned} {\mathcal {L}}_{ac}= \frac{1}{H \times W}\sum _{h=1}^H \sum _{w=1}^W {\textbf{M}}_{f}\left( h,w \right) . \end{aligned}$$
(4)

Foreground Region Guidance (\({\mathcal {L}}_{frg}\)). Meanwhile, we maintain the FPM’s approach of employing the classification task to drive the learning of foreground prediction maps, which uses high-level semantic information to guide the foreground prediction map to the approximate correct region of the object. Consequently, a foreground region guidance loss based on cross-entropy is utilized. After \({\textbf{F}}\) is fed into \({\mathcal {F}}_2({\textbf{F}},{\Theta }_2)\), it is dotted with \({\textbf{M}}_{f}\) to produce \({\mathcal {L}}_{frg}\):

$$\begin{aligned}&{\tilde{\textbf{y}}}^{f}=\text {GAP}\left( {\textbf{M}}_{f} \cdot {\mathcal {F}}_2\left( {\textbf{F}},\Theta _2 \right) \right) , \end{aligned}$$
(5)
$$\begin{aligned}&{\mathcal {L}}_{frg}=-\sum _{i=1}^C {\textbf{y}}_{i} \log _{}{\frac{e^{{\tilde{\textbf{y}}}_{i}^{f}}}{\sum _{j}^{C} e^{{\tilde{\textbf{y}}}_{j}^{f}}}}, \end{aligned}$$
(6)

where \({\textbf{y}}\) denotes the image-level one-hot encoding label.

Classification (\({\mathcal {L}}_{cls}\)). Besides, we obtain the classification loss \({\mathcal {L}}_{cls}\) by applying cross-entropy to \({\tilde{\textbf{y}}}\), which is used for classification learning of the entire image:

$$\begin{aligned} {\mathcal {L}}_{cls}=-\sum _{i=1}^C {\textbf{y}}_{i} \log _{}{\frac{e^{{\tilde{\textbf{y}}}_{i}}}{\sum _{j}^{C} e^{{\tilde{\textbf{y}}}_{j}}}}. \end{aligned}$$
(7)

3.3 Weakly Supervised Object Localization

By jointly optimizing background activation suppression loss, area constraint loss, foreground region guidance loss, and classification loss in the AMC module, the foreground prediction map can be guided to the overall area of the object. The total loss of the BAS training process is defined in the following form:

$$\begin{aligned} {\mathcal {L}}={\mathcal {L}}_{cls} + \alpha {\mathcal {L}}_{frg} + \beta {\mathcal {L}}_{ac} + \lambda {\mathcal {L}}_{bas}, \end{aligned}$$
(8)

where \(\alpha \), \(\beta \), and \(\lambda \) are hyperparameters, \({\mathcal {L}}_{cls}\) and \({\mathcal {L}}_{frg}\) are both cross-entropy losses. For all backbones and datasets, we set \(\lambda =1\). The ablation experiments of the hyperparameters \(\alpha \), \(\beta \), and \(\lambda \) on WSOL are described in Sect. 4.3.

3.4 Weakly Supervised Semantic Segmentation

BAS can also be applied to weakly supervised semantic segmentation to verify the generality of our method. Different from weakly supervised object localization, weakly supervised semantic segmentation no longer assumes that there is only one ground-truth class in an image, which is more challenging. In addition, it is more direct to reflect the segmentation quality of the prediction map by comparing with the weakly supervised semantic segmentation SOTA methods.

Based on the network structure in Fig. 2, we apply BAS to weakly supervised semantic segmentation with minor changes. As shown in Fig. 5, we maintain the learning process for a single prediction map in the AMC module by randomly selecting a foreground category in the image and denoting its corresponding prediction map as \({\textbf{M}}_{f}\). In addition, to make the network achieve multi-label classification, we adopt softmax cross-entropy loss and simply modify the form of it instead of using Sigmoid-based loss (binary cross-entropy loss). It mainly due to the activation value \({\textbf{S}}^{b}\) obtained from the background localization map has to be less than 0 to ensure that the probability generated by \(1/({1+e^{-{\textbf{S}}^{b}}})\) is close to 0, which conflicts with the non-negativity of \({\textbf{S}}^{b}\).

Multi-Label-Classification (\({\mathcal {L}}_{mcls}\)). For weakly supervised semantic segmentation task, we adopt the multi-label classification loss \({\mathcal {L}}_{mcls}\) instead of \({\mathcal {L}}_{cls}\) to deal with the multi-label case. To avoid the problems of class imbalance and training instability when there are multi-label in the softmax formulation, we only consider the differentiation between foreground and background classes and ignore the interrelationship among foreground categories. It can be expressed as follows:

$$\begin{aligned} {\mathcal {L}}_{mcls}=-\sum _{i=1}^{L} {\textbf{y}}_{i} \log _{}{\left( \frac{e^{{\tilde{\textbf{y}}}_{i}}}{{\sum _{j}^{K} e^{{\tilde{\textbf{y}}}_{j}}} + e^{\tilde{\textbf{y}}_{i}}} \right) }, \end{aligned}$$
(9)
Fig. 5
figure 5

Applying BAS to weakly supervised semantic segmentation task

where L is the set of ground-truth classes in the image, and the remaining set of categories is denoted as K. The total loss function in weakly supervised semantic segmentation is of the following form:

$$\begin{aligned} {\mathcal {L}}={\mathcal {L}}_{mcls} + \alpha {\mathcal {L}}_{frg} + \beta {\mathcal {L}}_{ac} + \lambda {\mathcal {L}}_{bas}. \end{aligned}$$
(10)

The \(\lambda \) is set to 1 for all datasets. For PASCAL VOC 2012, we set \(\alpha =0.2\) and \(\beta =1.2\). For MS COCO 2014, we adopt \(\alpha =0.5\) and \(\beta =1.5\). The ablation experiments of the hyperparameters \(\alpha \), \(\beta \), and \(\lambda \) on WSSS, and the results of different combinations of hyperparameters on five datasets are presented in Sect. 5.2.

3.5 Empirical Justification

In this part, we empirically justify the advantage of introducing background activation suppression and its generalizability.

The purpose of the exploratory experiment is to investigate the relationship between activation value (Activation), cross-entropy (Entropy) and background activation value (Background Activation) with the mask area. Specifically, we first train a VGG16 classification network on CUB-200-2011 using \({\mathcal {L}}_{cls}\) (Eq. 7) as supervision. Then, for a given pixel-level mask, the activation and entropy corresponding to this mask are generated by masking the feature map. We erode and dilate the ground-truth mask with a convolution of kernel size \(5n \times 5n\), obtain masks with different areas by changing the value of n, and plot the activation versus entropy with the mask area as the horizontal axis. As shown in Fig. 1A, we display the curve for a single image through the above process.

Fig. 6
figure 6

Motivation. Statistical analysis about exploratory experiments on different datasets

Due to each image having a different activation value distribution and a different ground-truth mask area, we normalize the activation curve for each image by dividing the activation value generated by the entire image to obtain a more statistically significant result, the same as in Eq. 3. In addition, the area representing the horizontal axis is also normalized based on the ground-truth mask area, which is marked by a red line. As shown in Fig. 6, we present the curves of foreground activation value, cross-entropy, and background activation value with respect to the mask area, which are counted on the CUB-200-2011 test set. It can be noted that the samples on the whole present the following phenomena: When the mask expands near the ground-truth mask, the activation value starts to saturate and the corresponding background activation value tends to converge, while cross-entropy converges to zero early or even diverges with the expansion of the mask. This suggests that the object region learned by activation values is larger and closer to the real object region than that learned by cross-entropy. We further explore why the cross-entropy occasionally diverges and visualize some results as shown in Fig. 7. It can be noted that when the network classifies objects incorrectly, such as identifying cows as horses, the calculated cross-entropy maintains a high value as the mask area increases. In this case, adopting cross-entropy values to supervise the localization map is less feasible and appropriate than using activation values which are not influenced by other categories. Besides, to verify the generality of this observation, we perform the same experiments on the more complex OpenImages and PASCAL VOC 2012 datasets. For PASCAL VOC 2012, we select one ground-truth category and its corresponding mask at a time, convert the multi-label into single-label, and then plot the curve in the same way. As shown in Fig. 6, the statistical analysis demonstrates similar phenomena, therefore, we believe it is general that better localization ability can be learned through activation values compared to cross-entropy.

Fig. 7
figure 7

Cross-entropy presents a divergence trend as the area of the mask increases when the model classifies the object incorrectly. The dashed line represents the position of ground-truth mask. Entropy: cross-entropy. GT: Ground-Truth

Table 1 Comparison with state-of-the-art methods
Fig. 8
figure 8

Visualization comparison with the baseline CAM (Zhou et al., 2016) method on CUB-200-2011 (Wah et al., 2011) and ILSVRC (Russakovsky et al., 2015). The ground-truth bounding boxes are in Red, and the predictions are in Green (Color figure online)

4 Experiments on Weakly Supervised Object Localization

4.1 Experimental Setup

Datasets. We evaluate the proposed method on the popular benchmarks including CUB-200-2011 (Wah et al., 2011), ILSVRC (Russakovsky et al., 2015), and OpenImages (Choe et al., 2020b). CUB-200-2011 contains 200 fine-grain classes of birds with 5994 training images and 5794 testing images. ILSVRC contains about 1.2 million training images and 50,000 validation images, which are divided into 1000 categories. OpenImages consists of 29,819, 2500 and 5000 samples from 100 classes for training, validation and test, respectively. Except for class labels, CUB-200-2011 and OpenImages also provide pixel-level mask annotations for the evaluation of the prediction mask.

Metrics. Following DA-WSOL (Zhu et al., 2022), we apply both bounding box and mask metrics to evaluate the performance of our BAS. For bounding box, following Xu et al. (2022); Zhu et al. (2022); Lee et al. (2022a), four metrics are used for evaluation, including GT-known localization accuracy (GT-known Loc), Top-1 localization accuracy (Top-1 Loc), Top-5 localization accuracy (Top-5 Loc), and maximal box accuracy (MaxBoxAccV2). Specifically, GT-known Loc is correct when the intersection over union (IoU) between the ground-truth bounding box and the predicted bounding box is greater than a fixed IoU threshold (\(\delta \) = 0.5). Top-1/Top-5 Loc is correct when the Top-1/Top-5 predicted categories contain the ground-truth class and the GT-known Loc is correct. MaxBoxAccV2 compared to GT-known (\(\delta \) = 0.5) considers multiple IoU thresholds (\(\delta \) \(\in \) {0.3, 0.5, 0.7}) and takes the average localization performance as the result. For mask, we adopt both the peak intersection over union (PIoU) (Zhang et al., 2020a) and the pixel average precision (PxAP) (Choe et al., 2020b) as metrics when the pixel-level ground-truth label is available.

Implementation Details. We evaluate the proposed method on the most popular backbones, including VGG16 (Simonyan & Zisserman, 2014), InceptionV3 (Szegedy et al., 2016), ResNet50 (He et al., 2016), and MobileNetV1 (Howard et al., 2017). All networks are fine-tuned on the pre-trained weights of ILSVRC (Russakovsky et al., 2015). We train 120 epochs on the CUB-200-2011 (Wah et al., 2011) and 9 epochs on ILSVRC (Russakovsky et al., 2015). In the training phase, the input images are resized to 256\(\times \)256 and then randomly cropped to 224\(\times \)224. When \({\mathcal {L}}_{bas}\) is larger than 1, we mark it as 1, to ensure the stability of the initial training. In the inference phase, we use ten crop augmentation to get the final classification results following the settings in Pan et al. (2021), Guo et al. (2021), Zhang et al. (2018b). For localization, we replace the random crop with the center crop, as in previous works (Wei et al., 2021; Zhang et al., 2020a; Yun et al., 2019; Choe & Shim, 2019).

Table 2 Ablation study

4.2 Comparison with State-of-the-Arts

We compare the proposed BAS with state-of-the-art methods on CUB-200-2011 (Wah et al., 2011) and ILSVRC (Russakovsky et al., 2015) datasets. As shown in Table 1, BAS achieves stable and excellent performance on various backbones. On CUB-200-2011 (Wah et al., 2011), BAS surpasses all existing methods by a large margin in terms of GT-known/Top-1/Top-5 Loc when the backbones are MobileNetV1, ResNet50 and InceptionV3. Compared with the current Foreground-Prediction-Map-based method FAM (Meng et al., 2021), BAS achieves 1.78%, 7.33%, 9.68% and 7.38% improvement on VGG16, MobileNetV1, ResNet50 and InceptionV3 in terms of GT-known Loc, respectively. On ResNet50, BAS achieves 95.41% GT-known Loc, which is a significant increase of 3.81% compared to the best performing counterpart Kim et al. (Kim et al., 2022). In addition, our method improves 5.53% and 4.20% GT-known Loc compared to the latest multi-stage model CREAM (Xu et al., 2022) on ResNet50 and InceptionV3, respectively.

Fig. 9
figure 9

Hyperparameters. a \(\varvec{\alpha }\) for foreground region guidance loss \({\mathcal {L}}_{frg}\). b \(\varvec{\beta }\) for area constraint loss \({\mathcal {L}}_{ac}\). c \(\varvec{\lambda }\) for background activation suppression loss \({\mathcal {L}}_{bas}\)

On ILSVRC (Russakovsky et al., 2015), BAS overall exceeds all baseline methods in terms of GT-known/Top-1/Top-5 Loc on all backbones. When MobileNetV1 is used as the backbone, our BAS achieves 72.03% GT-known Loc, surpassing FAM (Meng et al., 2021) by a large margin with a 9.98% improvement. Moreover, InceptionV3-BAS and ResNet50-BAS obtain 72.07% and 72.00% GT-known Loc, respectively, establishing a novel state-of-the-art. It shows that BAS performs well on both fine-grained dataset and large universal dataset. Furthermore, we visualize the localization maps of the proposed BAS and CAM (Zhou et al., 2016) on CUB-200-2011 and ILSVRC in Fig. 8. Compared to CAM, BAS can robustly cover the entire area of the object even in noisy environments and is sharper and more compact at the edges of the object.

4.3 Ablation Study

In this section, we perform a series of ablation experiments using ResNet50 (Simonyan & Zisserman, 2014) as the backbone. Above all, we conduct ablation experiments on various components of BAS on CUB-200-2011 (Wah et al., 2011). We take \({\mathcal {L}}_{cls}\), \({\mathcal {L}}_{frg}\) and \({\mathcal {L}}_{ac}\) together as the baseline method for the Foreground-Prediction-Map-based architecture. As shown in Table 2, the addition of \({\mathcal {L}}_{bas}\) to the baseline can enable the localization map to cover the object region more completely, thus significantly increasing the localization accuracy, with 21.01% and 16.26% improvement in terms of GT-known Loc and Top-1 Loc, respectively. Moreover, using Top-k strategy to integrate the final localization result, though making the localization result not as sharp as before, it can further improve the GT-known Loc (from 92.15% to 95.41%) by alleviating the problem of the classification network focusing on the distinguish parts.

Table 3 Localization accuracy and visualization results about inserting the generator after different layers on ResNet50

Hyperparameter \({\alpha }\), \({\beta }\), and \({\lambda }\) in total loss There are three hyperparameters in Eq. 8. Their effectiveness and sensitivity analyses for localization quality are performed on CUB-200-2011 and ILSVRC in Fig. 9. The \(\alpha \) denotes the factor of \({\mathcal {L}}_{frg}\), and it can be noticed from Fig. 9a that the presence of foreground region guidance loss (\(\alpha \) \(\ge \) 0.2) can significantly improve the localization accuracy by ensuring stable learning of foreground activation maps on both datasets. The \(\beta \) reflects the degree of constraint between foreground area and background suppression. When \(\beta \) is small, more areas in the foreground activation map are activated, while when \(\beta \) is too large, it will suppress the learning of the activation map. As shown in Fig. 9b, our method performs stably with high accuracy when \(\beta \) varies from 1.2 to 1.7 on CUB-200-2011 and from 1.6 to 2.4 on ILSVRC. The \(\lambda \) denotes the factor of \({\mathcal {L}}_{bas}\). A larger \(\lambda \) indicates that more regions in the prediction map are activated by background activation suppression. As shown in Fig. 9c, the localization accuracy continues to grow on CUB-200-2011 when \(\lambda \) increases from 0.3 to 0.6 and remains stable from 0.6 to 1.3 with less than 1% change in GT-known Loc, which shows that the proposed BAS approach can significantly improve the localization accuracy. In summary, although we have three hyperparameters in the loss function, it is easy to choose suitable values for the hyperparameters \(\alpha \), \(\beta \), and \(\lambda \). In addition, we also provide the results of different combinations of hyperparameters on CUB-200-2011 and ILSVRC in Sect. 5.2.

Fig. 10
figure 10

GT-known Loc (\(\%\)) \(\varvec{w.r.t}\) \(\varvec{k}\). Evaluation results of combining the Top-k prediction maps when the backbone is VGG16 and ResNet50 respectively

Hyperparameter k in Top-k strategy. We evaluate the effect of the hyperparameter k in our BAS. As shown in Fig. 10, the accuracy of GT-known Loc is improved on CUB-200-2011 when \(k>1\), comparing \(k=1\). For VGG16 and ResNet50, the highest localization accuracy is achieved at k of 80 and 200, respectively. It suggests that the Top-k strategy can be used to obtain more complete localization results and further improve the localization performance by integrating the localization maps of similar categories on CUB-200-2011. In contrast, for both VGG16 and ResNet50, the best localization results are obtained for \(k=1\) on ILSVRC dataset, which shows a high variability of classes and few localization features of similarity between categories on ILSVRC.

Fig. 11
figure 11

Comparison of background prediction maps learned from a original image or b feature maps

Table 4 Evaluation results in terms of MaxBoxAccV2 on the CUB-200-2011 and ILSVRC datasets using various backbones

Generator after different layers. We report the results of inserting the generator after different layers of ResNet50. As shown in Table 3 (left table), quantitative experiment indicates that inserting the generator after layer 3 achieves the best results and is significantly better than other positions. The prediction maps learned from different layers are visualized in Table 3 (right figure). When the generator learns localization information from shallow feature maps (layer 1 and layer 2), the prediction map performs better at the edges of objects, but it is insufficient to resist background distractions and has poor semantic learning ability. In addition, the generator learns localization information from the high-level feature (layer 4) resulting in imprecise localization due to the limitation of resolution.

Original image vs feature maps. We fix the generator after layer 3 and conduct experiments on the masking position (original image vs feature maps) of the background prediction map. As illustrated in Fig. 11, it can be noted that the masking feature maps approach achieves higher accuracy and better coverage of the localization results on the object, while the results generated by the approach of masking the original image focus more on the edge or texture of the object and have less ability to locate smooth regions. It may be because the learning process in shallow layers usually focuses on common basic features (e.g., edges, textures) and ignores high-level semantic features (Table 4).

Fig. 12
figure 12

Experimental comparison between the previous conference version and this work. a The \({\mathcal {L}}_{bas}\) training loss curves. b Visualization of the localization results

Comparison with previous conference version. We compare the improvement over the conference version in both quantitative and qualitative aspects. Benefiting from the adjustment to the position of the ReLU activation function, BAS can learn the feature map more adequately and efficiently. As shown in Fig. 12a, we display the curve of \({\mathcal {L}}_{bas}\) (Eq. 3) training loss with the training iterations for both previous conference version and this work. It can be observed that the loss curve (this work) converges to a lower point and shows a more stable convergence trend, while the loss curve in the previous conference version even presents an increasing trend during the iterations. It indicates that the ReLU in the last layer (Fig. 4) makes the classification network learn the background region insufficiently, hence resulting in the inadequate convergence of BAS loss. Figure 12b illustrates some localization maps to support this analysis. Compared with the previous conference version, BAS (this work) demonstrates more robustness in the learning of the background region and consequently improves the localization accuracy in Table 5. We achieve an average of 0.83% and 0.11% GT-known Loc gains on the four backbone networks on CUB-200-2011 and ILSVRC, respectively, without additional parameters and computations.

4.4 Performance Analysis

In this section, we evaluate and analyze in detail the localization quality and segmentation quality of BAS.

Table 5 Improvement in GT-known Loc compared to the previous conference version
Fig. 13
figure 13

Statistical analysis of correct bounding boxes, based on ResNet50 (CAM (Zhou et al., 2016), ADL (Choe et al., 2020a), and DA-WSOL (Zhu et al., 2022))

Fig. 14
figure 14

Visualization of the initial seed generated by CAM and the proposed BAS on the PASCAL VOC 2012 dataset

Fig. 15
figure 15

Segmentation Quality. IoU-Threshold curves for different baseline methods and evaluation results of PIoU, PxAP on CUB-200-2011 (Wah et al., 2011) and OpenImages (Choe et al., 2020b) datasets, based on ResNet50 (CAM (Zhou et al., 2016), ADL (Choe et al., 2020a), and DA-WSOL (Zhu et al., 2022))

Fig. 16
figure 16

Examples of semantic segmentation results on PASCAL VOC 2012 for IRN and BAS (with IRN)

Fig. 17
figure 17

Visualization of the initial seed generated by CAM and the proposed BAS on the MC COCO 2012 dataset

Localization Quality. Tabel 4 shows the MaxBoxAccv2 scores compared with other methods on CUB-200-2011(Wah et al., 2011) and ILSVRC (Russakovsky et al., 2015). Quantitative experiments indicate that our method achieves the best results for different backbone networks and datasets under the MaxBoxAccv2 criterion, which proves the high quality of the bounding box generated by BAS and verifies the effectiveness and generalizability of the proposed method. In particular, on CUB-200-2011, we exceed the previous best methods by 2.1% and 5.6% when the backbone networks are VGG16 and ResNet50, respectively. Besides, in Fig. 13, we demonstrate the statistical analysis of IoU based on ResNet50, which plots the IoU distribution curves between the bounding boxes and the ground-truth boxes when localized correctly, following DANet (Xue et al., 2019). On CUB-200-2011, we achieve 78.7% IoU median, exceeding the latest state-of-the-art method DA-WSOL (Zhu et al., 2022) by 12.6%, and correspondingly by 3.0% on ILSVRC. From the median IoU and the IoU distribution, it can be seen that the proposed BAS significantly improves the localization quality on both CUB-200-2011 and ILSVRC datasets (Fig. 14).

Segmentation Quality. We compare the localization map with the ground-truth mask label using two metrics, PIoU and PxAP, following DA-WSOL (Zhu et al., 2022). As shown in Fig. 15 (left table), we evaluate the performance of the proposed BAS with CAM (Zhou et al., 2016), ADL (Choe et al., 2020a) and DA-WSOL (Zhu et al., 2022) on ResNet50. Compared to DA-WSOL, BAS achieves significant and consistent improvement, with a 15.06% increase in PIoU and 15.24% in PxAP on CUB-200-2011. The proposed method also surpasses all methods on OpenImages, although OpenImages is a more challenging dataset due to a large number of small objects and complex backgrounds. In addition, we present the IoU-Threshold curves in the right graph of Fig. 15, which represent the IoU values at varying thresholds within the range of [0, 255]. As observed from the IoU-Threshold curves on both datasets, our method demonstrates a lower sensitivity to the thresholds and achieves better results at arbitrary threshold compared to other methods, which indicates that the localization map produced by BAS has fewer low confidence regions and is closer to the ground-truth object region.

5 Experiments on Weakly Supervised Semantic Segmentation

5.1 Experimental Setup

Datasets and Evaluation Metric. To evaluate the performance of BAS on weakly supervised semantic segmentation task, we conduct experiments on the commonly used PASCAL VOC 2012 (Everingham et al., 2010) and MS COCO 2014 (Lin et al., 2014) datasets. PASCAL VOC 2012 contains 21 categories (including one background class). It has 1464, 1449, and 1456 samples in training, val, and test sets, respectively. Following the common experimental protocol (Chen et al., 2014), the training set is augmented with 10,582 weakly annotated images provided by SBD dataset (Hariharan et al., 2011). MS COCO 2014 dataset has 81 semantic classes (including one background class). Following Lee et al. (2022a), Jiang et al. (2022), images without the target categories are moved off the dataset, remaining 82,081 training images and 40,137 validation images. We use the mean Intersection-over-Union (mIoU) as the evaluation metric for all experiments (Figs. 16, 17).

Implementation Details. For seed generation, the input image is resized to 512\(\times \)512, then augmented by horizontal flipping and random cropping to 448\(\times \)448. We train the network for 10 epochs. Batch size is set to 16 and 64 on PASCAL VOC 2012 and MS COCO 2014 respectively. To optimize the network, SGD optimizer is adopted with momentum mechanism and the momentum coefficient is set to 0.9. The initial learning rate is set as 0.005 and decayed following the poly policy \(lr_{\text {init}}\) = \(lr_{\text {init}}(1-itr/max\_itr)^\rho \) with \(\rho \) = 0.9. Following Lee et al. (2022a), Xie et al. (2022a), we use ResNet50 as the backbone network to generate the initial seed for both PASCAL VOC 2012 and MS COCO 2014 datasets.

Seed Refinement and Segmentation. For seed refinement, to make a fair comparison, we follow Lee et al. (2022a), Lee et al. (2021a), Chen et al. (2022b) using IRN (Ahn et al., 2019) to improve the quality of the initial seed. After generating pseudo masks, we select DeepLabV2 (Chen et al., 2017) with ResNet-101 (He et al., 2016) as the segmentation network, following Xie et al. (2022a), Jo and Yu (2021). We adopt the default setting to train DeepLabV2 as in Lee et al. (2022a) with weights pretrained on MS COCO 2014.

Table 6 Ablation study for the components of BAS on PASCAL VOC 2012 and MS COCO 2014

5.2 Ablation Study

In this section, we perform a series of ablation experiments with ResNet50 as the backbone on PASCAL VOC 2012 and MS COCO 2014. We first execute an ablation study regarding the loss composition of the BAS, and as in Sect. 4.3, we take \({\mathcal {L}}_{cls}\), \({\mathcal {L}}_{frg}\), and \({\mathcal {L}}_{ac}\) together as the baseline for the Foreground-Prediction-Map-based architecture. It can be seen from Table 6 that the addition of \({\mathcal {L}}_{bas}\) can significantly improve the segmentation quality of baseline with 7.6% and 4.4% mIoU gains on PASCAL VOC 2012 and MS COCO 2014, respectively, which verifies the effectiveness of the proposed \({\mathcal {L}}_{bas}\) in capturing object regions relevant to classification.

Fig. 18
figure 18

Hyperparameters. a \(\varvec{\alpha }\) for foreground region guidance loss \({\mathcal {L}}_{frg}\). b \(\varvec{\beta }\) for area constraint loss \({\mathcal {L}}_{ac}\). c \(\varvec{\lambda }\) for background activation suppression loss \({\mathcal {L}}_{bas}\)

Table 7 Effect of different combinations of hyperparameters on WSOL and WSSS with ResNet50 backbone

Hyperparameter \({\alpha }\), \({\beta }\), and \({\lambda }\) in total loss. Figure 18 illustrates the sensitivity of the segmentation quality to the hyperparameters \(\alpha \), \(\beta \), \(\lambda \) on PASCAL VOC 2012 and MS COCO 2014. Among them, \(\alpha \) is the coefficient of \({\mathcal {L}}_{frg}\) and a small \(\alpha \) can enable \({\mathcal {L}}_{frg}\) to work well. As shown in Fig. 18a, the mIoU result is significantly improved on PASCAL VOC 2012 when the \(\alpha \) is greater than 0.1 and varies very little in the interval 0.15 to 0.5 with less than 0.3% mIoU change. \({\mathcal {L}}_{ac}\) aims to constrain the foreground area to avoid unlimited expansion of the foreground area. Therefore, if the coefficient \(\beta \) of \({\mathcal {L}}_{ac}\) is too small, it will lead to too many regions to be activated hence drastically reducing the segmentation performance as shown in Fig. 18b. The purpose of \({\mathcal {L}}_{bas}\) is to allow the localization map to learn regions contributing to the classification in a background activation suppression manner. As shown in Fig. 18c, the mIoU result remains stable on both datasets when the factor \(\lambda \) of \({\mathcal {L}}_{bas}\) is in the range of 0.8 to 1.2.

Although three hyperparameters are included in the total loss, in practice, we simply follow a principle of \(\beta = \alpha +\lambda \), so that \({\mathcal {L}}_{ac}\) is balanced with the losses \({\mathcal {L}}_{frg}\) and \({\mathcal {L}}_{bas}\). Meanwhile, in finding the most suitable ratio between \(\alpha \) and \(\lambda \), for simplicity, \(\lambda \) is fixed at 1 on both WSOL and WSSS. Therefore, when \(\alpha \) is determined, \(\beta \) and \(\lambda \) are also determined. In Table 7, we provide the results of different combinations of hyperparameters on five datasets. When the \(\alpha \) changes from 0.2 to 1.5, the effect on the results is limited with less than 0.6% change. In fact, following the settings of \(\alpha =0.5,\beta =1.5,\lambda =1.0\) is feasible for all datasets, with very little change compared to the results reported in the paper. The above experiments illustrate that it is easy to find a suitable set of hyperparameters on different datasets.

Table 8 The mIoU results of inserting the generator after different layers with ResNet50 backbone

Generator after different layers. In Table 8, we report the mIoU results of inserting the generator after different layers of ResNet50. Since the generator contains only one convolution layer, the semantic representation of the generated localization map depends mainly on the reused backbone part. Therefore, inserting the generator after layer 1 or layer 2 will result in insufficient semantic representation and poor segmentation performance, as presented in Table 8. In addition, inserting the generator after 4 does not perform better than layer 3, reducing 2.4% and 2.1% mIoU on PASCAL VOC 2012 and MS COCO 2014, respectively. It is mainly because the feature maps of layer 4 are usually coarser than the feature maps of layer 3, hindering the acquisition of fine segmentation results.

Table 9 Effects of applying BAS on different baseline methods, including mIoU of the initial seed (Seed) and the pseudo ground-truth mask (Mask) on the PASCAL VOC 2012 training set
Table 10 Semantic segmentation performance gains for per-class on PASCAL VOC 2012

5.3 Results on PASCAL VOC 2012 Dataset

Quality of Initial Seed and Pseudo Labels. Table 9 compares the quality of the initial seed and the pseudo ground-truth masks on the PASCAL VOC 2012 training set. For the initial seed, we achieve a mIoU of 57.7%, exceeding the previous method by a large margin. Compared with the state-of-the-art method CLIMS (Xie et al., 2022a), which uses both ResNet50 and CLIP (Radford et al., 2021) networks in the seed generation phase, while BAS uses only ResNet50 network and achieves a gain of 1.1%. Further, after normalizing the seeds generated by our method and by other methods and adding them together, BAS can combine with various baseline methods and significantly improve their segmentation quality by providing high quality foreground prediction maps. As shown in Table 9, the proposed BAS improves the IRN (Ahn et al., 2019) by 9.4% mIoU, which is a remarkable boost. In addition, we achieve the best results with 59.8% mIoU when applying BAS to AdvCAM (Lee et al., 2022a). We also add the initial seeds of the different methods for a fair comparison in Table 9. It is obvious that combining with BAS brings more remarkable improvement than combining with other methods. This is because BAS can produce high and balanced responses on the object, which benefits other methods significantly. We report the per-class mean IoU in Table 10. Although our method achieves consistent improvement on the above baseline methods, it does not perform well in some categories. This is because the classification network has difficulty distinguishing between objects and class-related contexts, especially in some categories, e.g., boats and water, TV and programs on TV, which in turn limits the localization ability of BAS. Figure 14 shows the visual comparison of the initial seed generated by BAS and IRN. It can be clearly noticed that our method has better performance in capturing the whole object area with a high confidence score. For the pseudo ground-truth mask, after refinement by IRN (Ahn et al., 2019), we achieve 4.8%, 3.3%, and 1.6% gains when BAS is deployed on IRN, CDA, and AdvCAM, respectively, which illustrates the effectiveness of the proposed method. BAS allows to obtain a better foreground-background segmentation and thus provides a strong support for the seed generation stage of the WSSS task.

Table 11 Performance comparison of WSSS methods in terms of mIoU (%) on the PASCAL VOC 2012 val and test sets

Quality of Segmentation. To further validate the effectiveness of our method, we employ the pseudo segmentation labels to directly train a semantic segmentation network. Table 11 presents the segmentation results of the proposed BAS (with IRN) and other methods on the PASCAL VOC 2012 dataset. It is observed that our BAS exceeds previous methods under the same level of supervision, with 69.6% and 69.9% mIoU on the val and test sets. Compared to the latest method ReCAM (Chen et al., 2022b), with the same backbone network, we achieve a 1.1% mIoU improvement on val set and 1.5% on test set. We also show some qualitative segmentation results in Fig. 16. Compared with IRN, BAS demonstrates more robustness to various challenging scenarios, such as various sized objects, complex environments, and multi-instance situations.

Fig. 19
figure 19

Examples of semantic segmentation results on MS COCO 2014 for IRN and BAS (with IRN)

Table 12 Evaluation results on MS COCO 2014 validation set

5.4 Results on MS COCO 2014 Dataset

The accuracy of the proposed method and other state-of-the-art approaches on the MS COCO 2014 validation set is compared in Table 12. Our BAS based on IRN achieves a mIoU value of 45.1%, exceeding all previous methods. Compared to the previous best model AdvCAM (ResNet101 is adopted as the backbone network), we use a smaller ResNet50 as the backbone, but achieve better results. In particular, we surpass our baseline method IRN (Ahn et al., 2019) by 3.7% mIoU. Figure 17 demonstrates the visual comparison of the initial seed obtained by CAM (Zhou et al., 2016) and our method. Qualitative experiments show that the proposed BAS can capture more object areas compared to CAM, especially for large objects and multiple instances. In addition, BAS can achieve balanced and comprehensive responses on the target regions across various categories. Figure 19 shows some examples of semantic segmentation masks on MS COCO 2014 produced by IRN and by BAS (with IRN). It is observed that our method employed on the IRN can achieve more accurate segmentation and show a better demarcation between different objects, because the proposed BAS can provide a more complete and accurate seed region compared to IRN.

5.5 Analysis

In this section, we will explore how to fully leverage BAS, especially focusing on its crucial background activation suppression loss. Furthermore, we aim to enhance the segmentation capability of BAS by integrating it with other methods.

Class-agnostic foreground map. Different from the CAM-based approaches to extract class activation maps from the classifier, the proposed BAS obtains localization maps through an extra generator. In addition to generating class-specific localization maps, BAS can also produce a class-agnostic foreground map by providing suitable objective functions. To this end, we consider all the classes existing in the image as a foreground class and sum the \({\mathcal {L}}_{bas}\) of existing classes to supervise the foreground map. In this way, the foreground map can be fully trained from the entire dataset. As shown in Fig. 20, the class-agnostic foreground map localizes objects more completely and robustly than the class-specific localization map, and generates less noise. However, the foreground map is unable to distinguish objects of different categories and often identifies objects that are not in the target classes as shown in Fig. 20e. To utilize the foreground map to improve the performance of class-specific localization maps, we follow an intuitive idea that the foreground map usually covers all class-specific localization maps and has higher segmentation quality. If the class-specific localization map has a higher response in some regions than the foreground map, it may be caused by noise or confusing background, as shown in Fig. 20c, d. Therefore, we should weaken the response in these regions by directly replacing them with the response in the foreground map or averaging the response of both maps. The experimental results in Table 13 show that both strategies can improve the quality of the initial seed and hence increase the accuracy of the pseudo ground-truth mask. The best results are achieved by the average approach which not only reduces the response of the uncertain region but also combines the prediction probabilities of both class-agnostic and class-specific maps. It improves the initial seed and pseudo ground-truth mask by 0.6% and 0.5% mIoU results.

Fig. 20
figure 20

Class-agnostic foreground map \(\varvec{vs}\) class-specific localization map in the following five aspects: a Completeness. b Connectivity. c Less noise. d Identifying class-related background. e Class-aware

Improve the quality of BAS. As analyzed in Sect. 5.3, It can be noted that BAS does not perform well in some categories, which is usually due to the co-occurring context providing support to the classification discrimination, causing the localization map to learn the context. To alleviate this problem, we apply the proposed BAS to the W-OoD (Lee et al., 2022b) method, which uses additional out-of-distribution data to address the spurious relevance of the background, such as boat-water and aeroplane-sky/runway. As presented in Table 14, benefiting from the strong discriminative ability of the classification network in W-OoD method, BAS can achieve better performance, with a 1.8% mIoU improvement on the initial seed, including 16.0% and 7.1% mIoU gains on the boat and aeroplane categories, respectively. After applying IRN and DeepLabV2, BAS w/ W-OoD obtains 71.3% and 71.1% mIoU on PASCAL VOC 2012 val and test sets. In addition, we apply CLIMS (Xie et al., 2022a) to BAS to suppress the co-occurring background by using natural language supervision in CLIP (Radford et al., 2021), which also significantly improves the quality of the initial seed and brings a 8.9% boost in the boat category. Consequently, BAS w/ CLIMS obtains 70.6% and 70.9% mIoU on the val set and test set, substantially improving the segmentation ability of BAS.

Finding image-specific threshold by BAS. Unlike CAM, the proposed BAS designs a set of loss functions to evaluate the quality of the localization map and uses them for training, similarly, they are also suitable for the testing phase. As shown in Fig. 22a, it can be noted that the unbalanced response of CAM causes the segmentation performance heavily dependent on the threshold, while the optimal threshold value even varies significantly across images. It is obviously not appropriate to use a global threshold for the whole dataset. Therefore, we propose to find the image-specific threshold by employing background activation suppression loss \({\mathcal {L}}_{bas}\) and area constraint loss \({\mathcal {L}}_{ac}\) as the evaluation of the threshold values. As illustrated in Fig. 22b, we obtain a series of binary masks by changing the threshold values and input them into the AMC module to generate \({\mathcal {L}}_{bas}\) and \({\mathcal {L}}_{ac}\), the same process as in Fig. 2. Then, we simply add \({\mathcal {L}}_{bas}\) and \({\mathcal {L}}_{ac}\) together as the evaluation score and select the binary mask with the smallest evaluation score as the final result. Table 15 compares the effect of this image-specific threshold post-processing with global threshold on different methods on the PASCAL VOC 2012 training set. Experimental results show that the proposed post-processing approach is helpful to improve the segmentation quality by providing feedback on different thresholds to select the best threshold value specific to the image, especially for CAM (Zhou et al., 2016) and CDA (Su et al., 2021), bringing 0.7% and 0.5% mIoU improvement, respectively. However, the enhancement is limited when applying it to the proposed BAS, mainly because BAS produces few uncertainty regions and is not very sensitive to the threshold.

Table 13 Applying the class-agnostic foreground map to the class-specific localization maps with different strategies on the PASCAL VOC 2012
Table 14 Applying BAS to CLISM and W-OoD on the PASCAL VOC 2012
Fig. 21
figure 21

Limitation. The density distribution map about IoU and object size. For WSOL, the experiment is implemented on ILSVRC and bounding boxes are used to measure IoU and object size. For WSSS, experimental results are calculated by pixel-level masks on the PASCAL VOC 2012 training set at the seed phase

6 Discussion

Limitation. In this section, we discuss the localization ability of BAS for different size objects. We first visualize the density distribution of the IoU about BAS and CAM (Zhou et al., 2016) in Fig. 21. It can be noted that BAS performs better on medium and large objects, but not enough on small objects. We believe the main reason is the following two aspects: the localization of small objects is an inherent problem of computer vision, on the other hand, the area constraint loss penalizes different size objects unequally and will penalize small objects less, which causes BAS cannot balance both large and small objects with only a single hyperparameter to adjust the area constraint loss (Fig. 22).

Future Works. In the future, there are two main aspects of work, (1) improving the performance of the localization capability at different object sizes and (2) further extending the application of BAS.

To solve the issue of inconsistent localization ability for different size objects, we would like to explore the following promising researches: (1) The area constraint loss can be improved to allow different tolerance for objects of various sizes. (2) Based on the fact that WSOL works better for localizing large objects, we can determine the approximate region of the objects in the first stage, and then crop and resize the corresponding region to convert the original small object into a larger one, thereby performing localization in the second stage.

Table 15 Effect of applying image-specific threshold on different methods compared to global threshold on the PASCAL VOC 2012 training set
Fig. 22
figure 22

Finding image-specific threshold by BAS. a IoU-threshold curve and Loss-threshold curve. Loss indicates the summation of \({\mathcal {L}}_{bas}\) and \({\mathcal {L}}_{ac}\). b Process of finding the image-specific threshold by using \({\mathcal {L}}_{bas}\) and \({\mathcal {L}}_{ac}\) as evaluation

Apart from the above possible improvements, BAS can also be extended to weakly supervised instance segmentation (WSIS), since obtaining a high-quality locality map is also essential for WSIS.

7 Conclusion

In this paper, we find previous FPM-based work using cross-entropy to facilitate the learning of foreground prediction maps, essentially by changing the activation value, and the activation value shows a higher correlation with the foreground mask. Thus, we propose a background activation suppression (BAS) approach to promote the generation of foreground maps by an activation map constraint (AMC) module, which facilitates the learning of foreground prediction maps mainly through the suppression of background activation. Extensive experiments on CUB-200-2011 and ILSVRC verify the effectiveness of the proposed BAS, which surpasses previous methods by a large margin. In addition, BAS can also be extended on WSSS to enhance the seed quality of other methods by providing high quality foreground maps, and achieves the state-of-the-art performance on PASCAL VOC 2012 and MS COCO 2014.