Keywords

1 Introduction

Weakly supervised learning, using coarse annotations as supervision during model learning, has attracted extensive attention in recent years, especially for localization relevant vision tasks, such as image segmentation [4, 9, 13] and object detection [10, 27]. Typically, weakly supervised object localization (WSOL) releases the requirements of bounding boxes or even the densely-annotated pixel-level localization masks by only learning the localization model with image-level annotations, i.e., the class of images, which effectively saves human resources for the annotation process. The majority of WSOL methods adopt the mechanism of classification activation map (CAM) [38], utilizing the global average pooling (GAP) to spatially average the pixel-level features into image-level to learn an image classifier with the image-level supervision. Except for generating the classification results, this image classifier also serves as an object localizer that acts on pixel-level features to produce the localization map in the test process.

Fig. 1.
figure 1

Comparison between our BagCAMs and CAM. Our BagCAMs (upper part) derives regional localizers from the classifier with the RLG strategy, while CAM (bottom part) only copies the globally-learned classifier to locate objects.

Though CAM provides an efficient tool for learning a localization model with weak supervision, it directly adopts the classifier as the localizer without considering the difference between them. In detail, the classifier is only learned based on the image-level features, which are spatially aggregated and contain sufficient object features to be discerned. Catching some discriminative factors is enough for the classifier to discern the class of objects. However, the object localizer focuses on discerning the class of all regional positions based on the pixel-level features, where discriminative factors may not be well-aggregated, i.e., insufficient to activate the globally-learned classifier. Thus, the classifier of CAM will only catch the most discriminative parts rather than the whole object locations when directly adopting it to locate objects for pixel-level features.

To solve this issue, a series of methods have been proposed to force the classifier discerning object features more comprehensively, for example, developing augmentation strategies to enrich the global features [17, 25, 32], aligning the feature distribution between image-level and pixel-level [35, 39], adopting multi-classifier to synergistically localize the object [16, 30, 31, 34], or refining the classifier to catch class-agnostic object features [11, 37]. Though these strategies show some effect, adopting them requires re-training or revising the baseline structure, enhancing the complexity of the training process. Moreover, they still follow CAM to directly adopt the globally-learned classifier as the localizer, indicating that the gap between classifier and localizer remains unresolved.

Unlike the above methods, our work proposes a plug-and-play approach called BagCAMs, which can better project an image-level trained classifier to comply with the requirement of localization tasks. It can easily replace the classifier projection of CAM and be engaged into existing WSOL methods without re-training the network structure. As visualized in Fig. 1, instead of directly adopting the globally-learned classifier, our method focuses on deriving a set of regional localizers from this well-trained classifier. Those regional localizers can discern object-related factors with respect to each spatial position, acting as the base learners of ensemble learning. With those regional localizers, the final localization results can be obtained by integrating their effect. Experiments show that the proposed BagCAMs significantly improves the performance of the baseline methods and achieves state-of-the-art performance on three WSOL benchmarks.

2 Related Work

Existing WSOLs can be categorized into multi-stage methods [6, 11, 18,19,20, 33] and one-stage methods [25, 31, 35,36,37, 39]. The former requires training additional structures upon the classification structure to generate class-agnostic localization results. Our method belongs to the latter, which produces the localization score by projecting the image-classifier back to the pixel-level feature based on CAM, so we just review representative one-stage methods.

To force the classifier to discern some indistinguishable features of objects, Singh et al. [17] proposed hide-and-seek (HAS) augmentation that randomly hides the patches of images in the training process. However, hiding patches also causes information loss. Yun et al. [32] elaborated a CutMix strategy to solve this issue, which replaces the hidden regions with a patch of another image. Babar [1] adopts the siamese neural network to align location maps of two images that contain complementary patches of the input. Instead of developing augmentation strategies, some one-stage methods also focus on fusing the localization maps of multiple classifiers to comprehensively catch object parts. Typically, Zhang et al. [34] suggested learning two classifiers to discern features of objects in a complementary way. Kou et al. [16] added an additional classifier to adaptively produce the auxiliary pixel-level mask, which is then utilized by a metric learning loss for supervision. To consider hierarchical cues, Xue et al. [30] elaborated the DANet by learning multiple classifiers based on hierarchical features, and Tan et al. [25] proposed a pixel-level class selection (PCS) strategy to generalize CAM for hierarchical features. Seunghun et al. [31] fused localization maps of different classes with non-local block [29, 40] to help catch locations that correlated to multiple classes. Compared with them, our BagCAMs generates multiple localizers for each spatial position by degrading a well-trained classifier with efficient post-processing like CAM, rather than re-training the extractor or additional classifiers, increasing the complexity of the training process.

Beyond the community of WSOL, some methods also improved CAM for the visual explanation of convolutional neural networks, i.e., explaining why CNN makes specific decisions. To engage CAM into CNN without the GAP operator, Selvaraju et al. [23] proposed the GradCAM that summarizes the gradient as the importance of neurons to aggregate feature maps. Aditya et al. [5] further improved the GradCAM by elaborating a spatial weighing strategy when summarizing the gradient. Recently, Wang et al. [28] and Desai [22] explored obtaining neuron importance through forward passing to avoid the gradient calculation. Unlike these methods that aim to better activate the discriminative locations, our method focuses on complying CAM mechanism with the purpose of WSOL, activating object locations as many as possible.

3 Methodology

This section first formally overviews our proposed method that localizes objects with a series of regional localizers. Then, the regional localizer generation (RLG) strategy is illustrated, helping generate these regional localizers for the localization task. Finally, the BagCAMs is proposed to derive these localizers from a well-trained image classifier and produce the final localization map.

3.1 Problem Definition

Given an input image represented by , WSOL aims to approximate the localization map by a localization model learned only with the image-level classification mask , where K and \(N^{I}\) are the numbers of classes of interest and pixels, respectively. To learn the localization model with \(\boldsymbol{y}\), a backbone network, i.e., ResNet [12] or InceptionV3 [24], is firstly adopted as the feature extractor \(e(\cdot )\) to extract pixel-level features , where C is the channels of the features with the spatial resolution N. These pixel-level features are fed into the GAP layer to generate the image-level feature . Finally, the classifier \(c(\cdot )\) implemented as the fully-connected layer with weight is acted on the image-level feature to generate the classification result \(\boldsymbol{s}\):

$$\begin{aligned} \boldsymbol{s}_{k} = c(\boldsymbol{z})_{k} = (\textbf{W}\boldsymbol{z})_{k} = \sum _{c} \textbf{W}_{k, c} \boldsymbol{z}_{c}, \end{aligned}$$
(1)

where k and c are the index of class and channel, respectively. This classification score \(\boldsymbol{s}\) is supervised by the cross-entropy \(\mathcal {L}_{ce}(\boldsymbol{y}, \boldsymbol{s})\) to learn the extractor \(e(\cdot )\) and the classifier \(c(\cdot )\) in the training process.

In the test process, except for generating the classification score \(\boldsymbol{s}\), CAM-based methods also utilize the classifier \(c(\cdot )\) as a localizer \(f(\cdot )\) that acts onto the pixel-level features \(\boldsymbol{Z}\) to obtain the localization maps :

$$\begin{aligned} \boldsymbol{P}_{k, i} = f(\boldsymbol{Z})_{k, i} = c(\boldsymbol{Z}_{:, i})_{k} = \sum _{c} \textbf{W}_{k, c} \boldsymbol{Z}_{c, i}. \end{aligned}$$
(2)

As discussed in Sect. 1, the classifier \(c(\cdot )\) is only learned based on the image-level feature \(\boldsymbol{z}\), which aggregates the object features on all the positions of \(\boldsymbol{Z}\). This makes the classifier \(c(\cdot )\) only discern the most discriminative feature rather than all features that are correlated to the objects. When directly projecting the classifier \(c(\cdot )\) as the localizer \(f(\cdot )\) that acts on the pixel-level features, some indistinguishable parts, i.e., the body of animals, will not be activated on the output localization maps \(\boldsymbol{P}\). Thus, our method adopts the proposed RLG strategy to generate a base localizer set \(\mathcal {F} = \{ f_{1}, f_{2}, ..., f_{n} \}\) to comprehensively discern the feature of objects. Then, the proposed BagCAMs can implement the base localizer set \(\mathcal {F}\) based on the image-classifier \(c(\cdot )\) and generate a series of localization maps \(\mathcal {P} = \{ \boldsymbol{P}_{1}, \boldsymbol{P}_{2}, ..., \boldsymbol{P}_{n} \}\). Finally, these maps are integrated with co-efficient \(\{ \lambda _{1}, \lambda _{2}, ..., \lambda _{n} \}\) to form the final localization map \(\boldsymbol{P}^{*}\) that determines \(\boldsymbol{Y}\):

$$\begin{aligned} {\boldsymbol{P}}^{*} = \sum _{n} \lambda _{n} f_{n}({\boldsymbol{Z}}). \end{aligned}$$
(3)

3.2 Regional Localizers Generation Strategy

The proposed RLG strategy utilizes localization scores and pixel-level feature maps to generate a set of regional localizers, which focuses more on the regional features rather than only discerning the global features as the classifier of the classification task. To better illustrate the proposed RLG strategy, we firstly design the regional localizer inspired by the property of an image classifier. In detail, by differentiating from Eq. 1, the weight \(\textbf{W}\) of the global classifier \(c(\cdot )\) can be reformulated [25]:

$$\begin{aligned} \textbf{W} = \frac{\partial c(\boldsymbol{z})}{\partial \boldsymbol{z}} = (\frac{\partial \boldsymbol{s}}{\partial \boldsymbol{z}})^{\top }. \end{aligned}$$
(4)

Taking it into the Eq. 1, a equivalency of the classifier \(c(\cdot )\) can be obtained [25]:

$$\begin{aligned} c(\boldsymbol{z}) = \textbf{W} \boldsymbol{z} = (\frac{\partial \boldsymbol{s}}{\partial \boldsymbol{z}})^{\top } \boldsymbol{z}. \end{aligned}$$
(5)

Equation 5 indicates that an image classifier \(c(\cdot )\) can be represented by the transposition of the partial derivative between the image classification score \(\boldsymbol{s}\) and the image feature \(\boldsymbol{z}\) [25]. Analogizing this property to the localization task, the regional localizer can be simulated with the following definition.

Definition 1

Assuming \(f(\cdot )\) is a localizer that generates the classification score \(\boldsymbol{p}\) on a specific spatial location based on the pixel-level features , i.e., \(\boldsymbol{p} = f(\boldsymbol{Z})\), the localizer \(f(\cdot )\) can be simulated by a function set \(\mathcal {F}\) that contains the partial derivative between this regional classification score \(\boldsymbol{p}\) and each regional position of pixel-level features \(\boldsymbol{Z}\):

$$\begin{aligned} \mathcal {F} = \{ f_{1}, ..., f_{n}, ..., f_{N} \} = \{ (\frac{\partial \boldsymbol{p}}{\partial \boldsymbol{Z}_{:, 1}})^{\top }, ..., (\frac{\partial \boldsymbol{p}}{\partial \boldsymbol{Z}_{:, n}})^{\top }, ..., (\frac{\partial \boldsymbol{p}}{\partial \boldsymbol{Z}_{:, N}})^{\top } \}, \end{aligned}$$
(6)

where \(f_{n}(\cdot ) = (\frac{\partial \boldsymbol{p}}{\partial \boldsymbol{Z}_{:, i}})^{\top } (\cdot )\) is the regional localizer that catches the relation between regional score \(\boldsymbol{p}\) and the pixel-level feature of a specific regional position \(\boldsymbol{Z}_{:, i}\).

Fig. 2.
figure 2

Workflow of our method, where the RLG strategy (orange) generates a set of classifiers and the BagCAMs (green) weights their effect to produce localization maps. (Color figure online)

Based on Definition 1, each row vector \(\boldsymbol{P}_{:, i}\) of a given localization map can be viewed as a regional classification score \(\boldsymbol{p}\) that defines N regional localizers based on the pixel-level feature \(\boldsymbol{Z}\). Thus, as indicated in Fig. 2, our RLG strategy (noted by orange) can simulate \(N*N\) regional localizers based on the correlation between each vector pair of \(\boldsymbol{P}\) and \(\boldsymbol{Z}\):

$$\begin{aligned} f^{m}_{n}(\boldsymbol{x}) = (\frac{\partial \boldsymbol{P}_{:, m}}{\partial \boldsymbol{Z}_{:, n}})^{\top } (\boldsymbol{x}) ~\longrightarrow ~ f^{m}_{n}(\boldsymbol{x})_{k} = \sum _{c} \frac{\partial \boldsymbol{P}_{k, m}}{\partial \boldsymbol{Z}_{c, n}} \boldsymbol{x}_{c}, \end{aligned}$$
(7)

where \(f^{m}_{n}(\cdot )_{k}\) represents the regional localizer of class k and is a variable that represents a feature vector. With this extension, a localizer set \(\mathcal {F}^{*}\) that contains \(N*N\) regional localizer can be defined based on \(\boldsymbol{P}\) and \(\boldsymbol{Z}\), i.e., \(\mathcal {F}^{*} = \{ f^{1}_{1}, ..., f^{m}_{n}, ..., f^{N}_{N} \}\). Compared with the global classifier \((\frac{\partial \boldsymbol{s}}{\partial \boldsymbol{z}})^{\top }\) used by CAM, our regional localizer set \(\mathcal {F}^{*}\) contains sufficient localizers that catch the regional correlation between scores and features on each position, which helps comprehensively discern features of the objects.

3.3 Bagging Regional Classification Activation Maps

The proposed RLG strategy provides an efficient mechanism to generate a localizer set \(\mathcal {F}^{*}\) based on the localization map \(\boldsymbol{P}\). When implementing \(\boldsymbol{P}\) as a coarse localization map , those regional localizers \(f^{m}_{n}\) can be viewed as the base learners that can be integrated as a strong learner to locate objects. For this purpose, our BagCAMs is proposed as shown in Fig. 2 (noted by green), which generates the base localizers based on a coarse localization map \(\hat{\boldsymbol{P}}\) and then weights their localization results as the final localization score:

$$\begin{aligned} \boldsymbol{P}^{*}_{k, i} = \sum _{m} \sum _{n} \mathbf {\Lambda }^{i}_{m,n} f^{m}_{n}(\boldsymbol{Z}_{:, i})_{k} = \sum _{m} \sum _{n} \mathbf {\Lambda }^{i}_{m,n} \sum _{c} \frac{\partial \hat{\boldsymbol{P}}_{k, m}}{\partial \boldsymbol{Z}_{c, n}} \boldsymbol{Z}_{c, i}, \end{aligned}$$
(8)

where \(\boldsymbol{P}^{*}\) is the localization map of our proposed BagCAMs whose element \(\boldsymbol{P}^{*}_{k, i}\) represents the score on class k at position i. \(\mathbf {\Lambda }^{i}\) is a matrix, and its element \(\mathbf {\Lambda }^{i}_{m,n}\) means the co-efficient of regional localizer \(f^{m}_{n}\) at position i. In detail, PCS strategy [25] is adopted to initialize the coarse localization map \(\boldsymbol{\hat{P}}_{k, m}\) to pursue the convenience of calculation and performance on intermediate feature maps:

$$\begin{aligned} \boldsymbol{\hat{P}}_{k, m} = \sum _{c} \frac{\partial \boldsymbol{s}_{k}}{\partial \boldsymbol{Z}_{c, m}}{\boldsymbol{Z}_{c, m}}. \end{aligned}$$
(9)
Table 1. Summary of degrading the proposed BagCAMs into other methods

With this initialization coarse localization map \(\boldsymbol{\hat{P}}_{k, m}\) and defining \(\bar{\boldsymbol{s}}= log(\boldsymbol{s})\), the formulation of our base localizer generated by our RLG derivatives into the following, whose proof are given in Appendix B:

$$\begin{aligned} \begin{aligned} f^{m}_{n}(\boldsymbol{x})_{k} = \sum _{c_{1}} \boldsymbol{s}_{k}(1 + \frac{\partial \bar{\boldsymbol{s}}_{k}}{\partial \boldsymbol{Z}_{c_{1}, m}}\boldsymbol{Z}_{c_{1}, m}) \sum _{c_2} (\frac{\partial \bar{\boldsymbol{s}}_{k}}{\partial \boldsymbol{Z}_{c_{2}, n}} \boldsymbol{x}_{c_{2}}). \end{aligned} \end{aligned}$$
(10)

As for the weight matrix \(\mathbf {\Lambda }^{i}\), the grouping strategy of PCS [25] is also adopted for consistency, assuming \((\frac{\partial \boldsymbol{\boldsymbol{p}}}{\partial \boldsymbol{Z}_{:, i}})^{\top }\) is the localizer specifically for the position i:

$$\begin{aligned} \mathbf {\Lambda }^{i}_{m, n}=\left\{ \begin{array}{rcl} 1, &{} &{} {i = n}\\ 0, &{} &{} {i \ne n} \end{array} \right. . \end{aligned}$$
(11)

This setting assigns the \(N*N\) regional localizers into N groups, each applied specifically to position i. Note that \(\mathbf {\Lambda }^{i}\) can also be implemented with other mechanisms, for example, spatial average [23] or spatial attention [5], but we find the grouping strategy performs the best due to lesser noise. Finally, taking Eq. 10 and Eq. 11 into Eq. 8, an executable formulation of BagCAMs is obtained:

$$\begin{aligned} \boldsymbol{P}^{*}_{k, i} = \sum _{m} \sum _{c_{1}} \boldsymbol{s}_{k}(1+\frac{\partial \bar{\boldsymbol{s}}_{k}}{ \partial \boldsymbol{Z}_{c_{1}, m}}\boldsymbol{Z}_{c_{1}, m}) (\sum _{c_{2}} \frac{\partial \bar{\boldsymbol{s}}_{k}}{ \partial \boldsymbol{Z}_{c_{2}, i}} \boldsymbol{Z}_{c_{2}, i}). \end{aligned}$$
(12)

As indicated in Eq. 12, the computation of our BagCAMs only relies on the gradients \(\frac{\partial \bar{\boldsymbol{s}}}{\partial \boldsymbol{Z}}\), which can be calculated by backward propagating gradients on the logarithm of the classification score \(\boldsymbol{s}\). Thus, our BagCAMs can be projected onto the intermediate layer of CNN and retain similar computation complexity as gradient-based CAM mechanisms [5, 23, 25]. Moreover, Table 1 also shows PCS [25] and other CAM mechanisms [5, 23, 38] can also be generalized by our BagCAMs with the assumption that the initial localization result of each position i are all equal to \(\boldsymbol{s}_{k}\), i.e., \(\forall \boldsymbol{\hat{P}}_{k, m}=\boldsymbol{s}_{k}\). However, this assumption is obviously invalid for the localization task because the background locations of the image should not have the same score as the object locations. Compared with them, our BagCAMs generates a specific initial score \(\boldsymbol{\hat{P}}_{k, m} \in \mathbb {R}^{K \times N}\) for each position to obtain more valid base localizers to generate high-quality localization maps, rather than defining the localizer only based on the global score \(\boldsymbol{s} \in \mathbb {R}^{K \times 1}\). This makes our BagCAMs perform much better than these mechanisms when engaged into WSOL.

The proposed BagCAMs can easily replace CAM step of WSOL methods to generate the localization maps. Algorithm 1 and Fig. 2 show the workflow of localizing objects for an input image \(\boldsymbol{X}\) based on a trained WSOL model that contains a feature extractor \(e(\cdot )\) and a classifier \(c(\cdot )\). Specifically, the input image \(\boldsymbol{X}\) is firstly fed into the feature extractor \(e(\cdot )\) to generate the pixel-level feature \(\boldsymbol{Z} = e(\boldsymbol{X})\). Then, \(\boldsymbol{Z}\) is aggregated into image-level feature \(\boldsymbol{z}\), which is fed into the classifier to produce the classification score \(\boldsymbol{s}\) determining the object class \(k=\arg \max (\boldsymbol{s})\). Next, backward propagation is adopted for \(\bar{\boldsymbol{s}}_{k}\) to calculate \(\frac{\partial \bar{\boldsymbol{s}}_{k}}{\partial \boldsymbol{Z}}\) that is crucial for defining the base localizer. Finally, the localization map \(\boldsymbol{Y}\) is obtained by weighing the localization scores of base localizers as in Eq. 12.

figure l

4 Experiments

This section first introduces the setting of experiments. Then, results of our BagCAMs are shown to compare with SOTA methods on three datasets. Finally, we investigate different settings of our BagCAMs to further reflect its validity.

4.1 Settings

The proposed BagCAMs can be engaged into a well-trained WSOL model by simply replacing CAM in the test process. Thus, we reproduced five WSOL methods as the baseline methods to train them with their optimal settings, including CAM [38], HAS [17], CutMix [32], ADL [6], and DAOL [39]. In detail, the ResNet-50, removing the down-sample layer of \(Res_4\), was used as the feature extractor. When using InceptionV3 as the extractor, we follow existing works [20, 25, 34, 35] that add two additional layers at the end of the original structure. The classifier is implemented as a fully-connected layer, whose outputs are supervised by the cross entropy based on the image-level annotation in the training process. Except for the method-specific strategy [6, 17, 32], the random resize with size \(256 \times 256\) and random horizontal flip crop with size \(224 \times 224\) were adopted as the augmentation. SGD with weight decay \(10^{-4}\) and momentum 0.9 was set as the optimizer. Note that the learning rate and the method-specific hyper-parameters for all datasets were adopted as the released optimal settings [7, 39]. In the test process, our BagCAMs replaced the CAM step of these methods to project the learned classifier as the localizer based on features outputted by \(Res_3\) of the ResNet (\(Mix_{6e}\) for the InceptionV3). All experiments were implemented with Pytorch toolbox [21] on an Intel Core i9 CPU and an NVIDIA RTX 3090 GPU.

Three standard benchmarks were utilized to evaluate our methods:

  • CUB-200 dataset [26] contains 11,788 images that are fine-grained annotated for 200 classes of birds. We follow the official training/test split to use 5,944 images as the training set that only utilizes image-level annotation to supervise WSOL methods. Other 5,794 images, given additional bounding boxes and pixel-level masks, serve as the test set to evaluate the performance.

  • ILSVRC dataset [8] contains 1.3 million images that include 1000 classes of objects. Among them, 50,000 images, whose bounding boxes annotation is provided, are adopted as the test set to report the localization performance.

  • OpenImages dataset [3, 7] contains 37,319 images of 100 classes, where 29,819 images serve as the training set. Following the split released by Junsuk [7], the rest 7,500 images, annotated by pixel-level localization mask, are divided into the validation set (2,500 images) and test set (5,000 images).

Note that our BagCAMs does not contain any hyper-parameters, thus only the test images of these dataset are utilized for comparison. The Top-1 localization accuracy (T-Loc) [17], ground-truth known localization accuracy (G-Loc) [17], and the recently proposed MaxBoxAccV2 [7] (B-Loc) were adopted to evaluate the performance based on bounding box annotations. As for pixel-level localization masks, the peak intersection over union (pIOU) [37] and the pixel average precision (PxAP) [7] were calculated as the metrics.

Table 2. Comparison with SOTA methods with ResNet50 (border means the best).
Table 3. Comparison with SOTA methods with InceptionV3 (border means the best).
Fig. 3.
figure 3

Visualization on replacing CAM into BagCAMs for different WSOL methods.

4.2 Comparison with State-of-the-Arts

Table 2 illustrates the results of SOTA methods and our BagCAMs on the three standard WSOL benchmarks. It shows that adopting our proposed BagCAMs improves the performance of baseline methods to a great extent, especially on the CUB-200 dataset. This is because the CUB-200 dataset is a fine-graining dataset that only contains birds, making the classifier more likely to catch discriminative parts rather than the common parts of birds. As discussed in Sect. 3, this situation basically causes unsatisfactory performance when directly using the classifier to localize objects as CAM. By adopting our BagCAMs to project the classifier into a set of regional localizers, the regional factors of the class of bird can be better concerned, improving nearly 21.38% in G-Loc than the baseline method. Additionally, when more finely evaluated by the pixel-level mask, the improvements of our method are still remarkable, achieving 64.40% pIoU and 84.38% PxAP, which are 17.70% and 18.44% higher than CAM, respectively. As for the larger scale dataset ILSVRC, directly replacing CAM into our BagCAMs in the test process also achieves 3.48% higher performance in G-Loc metric, i.e., correcting the localization of nearly 1,740 images without any fine-tuning process or structure modification. In addition, even using the most recently proposed DAOL [39] that achieves the SOTA performance on the OpenImages dataset, adopting our method can still obviously improve its performance about \(2.49\%\) and \(2.26\%\), respectively in the pIOU and PxAP.

Except for the five reproduced methods, the other nine SOTA methods were also used for comparison in Table 2, whose scores are cited from corresponding papers. Our BagCAMs outperforms SOTA methods for nearly all metrics on all three datasets even though engaged in the vanilla WSOL structure, i.e., “CAM + Ours”. Only the T-Loc metric of our BagCAMs is lower than the methods that generate class-agnostic localization results and adopt addition stages for classification (noted by underline style) [19, 33, 37]. This is because our BagCAMs is only adopted in the test process to enhance the localization results, and our classification accuracy is directly determined by the baseline WSOL methods. Moreover, Table 3 also shows the comparison of using InceptionV3 as the feature extractor for WSOL methods to indicate our generalization for the backbone other than ResNet. The results are in accordance with utilizing ResNet, improving the performance on all baseline methods, for example, 11.31% and 9.38% G-Loc improvement for the vanilla structure (CAM) and DAOL on the CUB-200 dataset, respectively. Moreover, our BagCAMs still outperforms other SOTA methods with InceptionV3 on nearly all metric for these three datasets.

Localization maps generated by WSOL methods with our BagCAMs are also visualized in Fig. 3. For localization maps of the vanilla structure, only the most discriminative locations are activated, e.g., the pedestal of the toy, both ends of the pillar, the shade of the lamp, and the head of the bird. Though existing WSOL methods catch more positions of objects, they only enlarge or refine the activation of regions that near the discriminative parts rather than catching more parts of the object. This also visually verifies that the mechanism of CAM limits the performance of these WSOL methods, making the localizer only concern global cues. Profited by the utilization of our base localizer set, more object parts are effectively activated when adopting our BagCAMs to replace CAM for those methods, for example, the head of the toy, the pedestal of the lamp, and the body of the pillar/bird. Moreover, our BagCAMs can generate the localization map on intermediate layers that contains more fining cues such as pixels near the edge of objects, which also contributes to our high performance (Table 6).

Table 4. The best scores of different CAMs on layers of ResNet for CUB-200 dataset
Table 5. PxAP on layers of ResNet
Table 6. PxAP on layers of Inception
Table 7. Efficiency (fps) of CAMs
Table 8. Different weight strategy
Fig. 4.
figure 4

Localization map generated by the features of different ResNet layer by CAMs.

4.3 Discussion

To deeply investigate the effectiveness of our BagCAMs, we also conducted experiments to compare it with methods that generalize or enhance CAM on GAP-free structures or intermediate layers of CNN, e.g., GradCAM (Grad) [23], GradCAM++ (Grad++) [5], PCS [25]. We adopted the same trained checkpoint for these methods and utilized them to project the classifier in the test process. Except for the original CAM, other methods can be added onto the intermediate layers of the feature extractor. Thus, we generated localization maps on each layer and chose the best performance to report.

Corresponding results are illustrated in Table 4 where the baseline methods, i.e., CAM, HAS, ADL, CutMix, and DAOL, represent directly adopting the classifier for localization as CAM. It shows that for all baseline WSOL methods, our BagCAMs achieves the highest improvement compared with other CAM mechanisms. This is because other CAMs methods all initialize \(\hat{\boldsymbol{P}}\) with the global classification results \({\textbf {s}}\) for all positions as discussed in Sect. 3.3, resulting in their lower improvement. Unlike them, our BagCAMs adopts \(\boldsymbol{\hat{P}}_{:, m}\) to distribute a specific initial localization score for each position m, which helps generate valid localizers and contributes to our outstanding improvement, e.g., 15.68% higher PxAP than using the original CAM for the DAOL.

In addition, our BagCAMs can also achieve satisfactory performance when localizing objects based on features of intermediate layers, which may inspire generating localization maps with higher resolution to consider more details. Table 5 illustrates the PxAP metric for generating localization maps based on the feature of \(Res_1\) (\(256\times 56\times 56\)), \(Res_2\) (\(512\times 28\times 28\)), \(Res_3\) (\(1024\times 28\times 28\)), and \(Res_4\) (\(2048\times 28\times 28\)). Note that the original CAM can only adopt to the last layer before GAP due to the difference between the number of channels in \(\textbf{W}\) and the intermediate features, thus we did not include the original CAM in this Table 5. It can be seen that GradCAM and GradCAM++ have great performance drops when projected to the prior intermediate layers, i.e., \(Res_1\) and \(Res_2\). Though the PCS, proposed for generating localization results on intermediate layers, slightly decelerates this decline, its PxAP of \(Res_1\) is still \(30.97\%\) lower than \(Res_4\). Compared with them, our BagCAMs generates the localization map by bagging the performance of \(N \times N\) base localizers, where N is the spatial resolution of the feature map. Thus, for the previous layers with higher resolution, more basic localizers can be defined for bagging, i.e., 3, 136 for \(Res_{1}\). This makes our BagCAMs achieve 29.34% higher PxAP compared with the best of others, when projected on the feature of \(Res_{1}\).

Figure 4 also qualitatively visualizes the localization maps generated on the intermediate features. It can be seen that localization maps of GradCAM and GradCAM++ contain more noise on \(Res_1\) and \(Res_2\), and the PCS only activates a few discriminative locations. Compared with them, though our BagCAMs suffers from the grid effect caused by the down-sampling, our localization map can cover more object parts even for \(Res_1\). Finally, the efficiencies of different CAMs are also shown in Table 7, where their mean frame per second (fps) for inferring CUB-200 test are reported. It can be seen that, though considering multiple regional localizers rather than only the global one, the complexity of our BagCAMs is only a bit higher than other methods. This indicates that our method can balance the localization performance and efficiency well.

Except for comparing with other CAM mechanisms, the choice of different weighting strategies, i.e., various settings of the weight matrix \(\mathbf {\Lambda }\), were also explored on the CUB-200 dataset. Specifically, we designed three types of BagCAMs: (1) Ours\(_1\) that only averages the scores generated by localizers \(f^{m}_{n}\), i.e., \(\mathbf {\Lambda }^{i}=\frac{1}{N}\textbf{I}\). (2) Ours\(_2\) that aggregates the scores with the spatially weighting mechanism of GradCAM++ [5], i.e., \(\mathbf {\Lambda }^{i}=diag({\alpha })\). (3) Ours\(_3\), the mechanism we used in our paper as defined in Eq. 9, which only selects specific localizers for each position like PCS [25]. Corresponding results are shown in Table 8. It can be seen that using these three weighting mechanisms can all enhance the performance of the baseline methods, profited by adopting the regional localizer set rather than the globally defined classifier. Specifically, when simply averaging the localization scores of the regional localizers (Ours\(_1\)), the performance improves about \(11.35\%\) on the PxAP metric. Adopting the spatial weighting strategy to consider the effect based on each spatial position will bring an additional \(4.5\%\) improvement. When grouping the \(N*N\) localizers into N clusters that are specifically used for each spatial position to reduce noise as PCS [25], the performance hits the highest, i.e., about 84.38% PxAP. Thus, we suggest adopting this grouping strategy to weight the effect of the regional localizers.

4.4 Conclusion

This paper proposes a novel mechanism called BagCAMs for WSOL to replace CAM [38] when projecting an image-level trained classifier as the localizer to locate objects. Our BagCAMs can be engaged in existing WSOL methods to improve their performance without re-training the baseline structure. Experiments show that our method achieves SOTA performance on three WSOL benchmarks.