Keywords

1 Introduction

Thanks to the breakthrough of deep learning, recent years have witnessed great progress in object detection [7]. However, training a high-performance object detector requires massive bounding box annotations. These annotations are expensive and sometimes unavailable. To alleviate the model’s thirst for annotations, weakly-supervised object localization (WSOL) has gained lots of attentions [1,2,3,4, 9, 12, 15, 16, 24, 27, 28, 31, 32, 34, 35] as it aims at predicting objects’ bounding boxes through cheap image-level annotations. Therefore, it largely reduces the annotation cost and is of great practical significance. Previous WSOL methods mainly rely on the class activation maps (CAMs) [35]. However, CAMs tend to cover only small discriminative regions of an object, causing incomplete predictions. Hence, lots of approaches have been proposed to improve CAMs, such as erasing based methods [3, 14, 19, 26, 32], feature refining based methods [4, 16, 24, 27, 29] and regression based methods [6, 12, 31]. All these methods have achieved remarkable localization performance. However, these methods mostly are based on classification models, whose goals are inconsistent with object localization due to the following two defects.

Fig. 1.
figure 1

(a) Visualization of discriminative (eye, beak and foot) and similar (body and feather) features between two categories. (b) The comparison between the CAM-based method and non-negative matrix factorization (NMF)-based one. NMF utilizes features of multiple images of the same category to assist the object mask prediction, while CAM performs on the single image features and highly relies on the classification layer.

First, the similarity between classes has been ignored. Previous classification models [8, 18, 20] usually adopt cross entropy loss for model training, thus finding the discriminative features of each category. However, for WSOL, focusing too much on differences between images will lead to incomplete predictions. Because images from different categories might share highly similar features. Forcing the model to distinguish between them causes the model to focus only on the most discriminative object regions. To better elaborate this statement, we present an illustrative example in Fig. 1(a). Two images come from different categories and each of them could be regarded as a bag of features, such as beak, eye, body, feather and foot. Among these two images, beak, eye and foot show different colors or shapes, as shown in blue and red areas. These features are usually regarded as the discriminative features and will be extracted to assist the classification model in making decisions. However, there also exist some similar features in the two images, namely body and feather, as shown in the green area. Overemphasizing the differences between images results in these similar features being ignored, which leads to incomplete predictions for WSOL. Besides, due to the classification supervision, even though there are no discriminative features in the image, the model still is forced to learn the differences between images, thus overfitting to the background noise.

Fig. 2.
figure 2

Visualization of feature maps and predictions. (a) shows some feature maps before the classification layer, where red rectangle in the first row shows the unused features by CAM. (b) compares the predictions between CAM and our model. (Color figure online)

Second, as shown in Fig. 1(b), the generation of CAMs is highly dependent on the final classifier (i.e., the last fully connected layer), suffering from false positives (i.e., noise) and false negatives (i.e., content missing) in the final predictions. Because CAM-based methods [32, 35] generally adopt parameters of the classification layer as the coefficients to combine feature maps for final prediction. However, these parameters are optimized for classification, where only the most discriminative feature maps are selected out for combination and the rest maps are ignored. But we argue that those overlooked feature maps actually contain helpful information for WSOL. To explain it, we visualize some feature maps extracted from the CAM-based model, as shown in Fig. 2. The body region (red rectangle) of the bird has been activated in feature maps. But in the final prediction, the body region is ignored and only the head region stands out. Namely, CAM-based methods could not make full use of the extracted feature maps. Besides, CAMs are generated based on a single image, which is not robust to background noise. As shown in the blue rectangle in Fig. 2, many background areas are also activated, which interferes with the model’s predictions.

To address the above concerns, we propose the Inter-class feature similarity and Intra-class appearance Consistency (named ISIC) model, which improves WSOL from two aspects: supervision and object mask generation. For supervision, we introduce the inter-class feature similarity (ICFS) loss to supplement the widely used cross entropy (CE) loss. In practice, CE loss focuses on the discriminative features of each category (i.e., red and blue areas in Fig. 1(a)), and ICFS loss focuses on the similarities between different categories (green area in Fig. 1(a)). These two losses work against each other and eventually reach an equilibrium. Therefore, ISIC can better balance the localization task and the classification task and is not easy to overfit to the background noise, resulting in more complete predictions.

For object mask generation, instead of relying on the classification layer, we apply the non-negative matrix factorization module (NMFM) to obtain the object mask. NMFM is based on features of multiple images from the same category, which achieves the object mask by extracting the commonness of these images, as shown in Fig. 1(b). Compared with previous methods, NMFM does not rely on the high-level classification layer, so it will not ignore the body region in Fig. 2 and fully exploit all the feature maps. Besides, NMFM is based on multiple images, which is more robust to background noise than that based on a single image. After getting the predicted mask, we follow [24] to train a class-agnostic segmentation model to get the final mask and apply a bounding box extractor to obtain the final object localization. In summary, our contributions fall into three parts:

  • Opposite to classification, we propose the ICFS loss to constrain and maintain the similarity between classes. Such ICFS loss can largely reduce the model risk of over-optimizing the discriminative features, thus more complete regions of the object can be activated.

  • We propose to replace the original CAMs with non-negative matrix factorization for object mask generation, which avoids the over-discriminative effect of the classification layer and the background noise.

  • With negligible computational cost overheads, our proposed methods achieve consistent and substantial gains, i.e., state-of-the-art on both CUB-200–2011 and ImageNet-1K benchmarks for WSOL.

2 Related Works

2.1 Class Activation Maps (CAMs) Based WSOL

Weakly supervised object localization (WSOL) is a challenging task, aiming to localize objects with inexpensive image-level annotations. Zhou et al. [35] firstly propose the class activation maps (CAMs) to extract the object location. But restricted by the classification mechanism, CAMs only cover the discriminative object parts. To make CAMs complete, HaS [19] proposes the random erasure of image patches to force the model to mine more object regions. ACoL [32], ADL [3], EIL [14] and AE [26] follow the erasure paradigm and drop the most discriminative features to reduce the model’s dependency on them. CutMix [30] assembles patches from different images to guide model to learn more object parts. These methods greatly improve the quality of CAMs, but have the risk of spreading to the background regions when discriminative features are insufficient.

2.2 Pseudo Label Based WSOL

[6, 12, 31] take object localization as a regression task. Specifically, GCNet [12] utilizes a detector to regress the object bounding box, and produces the object mask by a generator to maximize the score of the classifier. But the indirect supervision brings unstable predictions. Inspired by GCNet, SLTNet [6] supervises the regressor to learn through the pseudo bounding box generated by a newly designed locator. On the contrary, PSOL [31] divides WSOL into two separate tasks, classification and localization. It applies DDT [25] to produce pseudo bounding boxes from the pre-trained model, which are exploited afterward to train a detector. However, these pseudo labels come from the pre-trained model, which are inexact and lower the upper limit of the detector. Different from pseudo bounding box label, SPOL [24] proposes to generate the pseudo mask to train a lass-agnostic segmentation model and achieves higher performance.

2.3 Attention Based WSOL

SPG [33] adopts a stage-wise manner to refine object mask, which regards high confident object regions as the foreground seeds and uses the self-produced guidance maps to progressively expand these seeds. SPOL [24] focuses on shallow features and proposes a multiplication feature fusion to combine the complementary features of different layers. To capture the long-range feature dependency, TS-CAM [4] proposes to generate the token semantic coupled attention map by visual transformer [22], which extracts both semantics and positioning information. Similarly, SPA [16] proposes the self-correlation to capture long-range structural information of objects. All these methods have achieved great progress in WSOL. However, the similarity between categories has been ignored. In this paper, we explicitly use inter-class similarity to boost WSOL performance.

3 Methodology

3.1 Pipeline

Figure 3 depicts the pipeline of our proposed ISIC model, which consists of two stages (i.e., object mask generation and class-agnostic segmentation). During training, both stages are involved, where the object mask generation stage is used to generate pseudo masks for the input images, and the class-agnostic segmentation stage adopts these pseudo masks as labels to train a binary (i.e., object or no-object) segmentation model. But during inference, only the class-agnostic segmentation stage is involved, where we directly derive the segmentation mask for each input image and extract the object bounding box from the mask. This decoupled design brings three benefits. First, the complex design (i.e., ICFS and NMFM) in object mask generation will not be brought into the inference phase. Hence, the time complexity of the model depends entirely on the segmentation network and is not affected by ICFS or NMFM. Second, unlike the CAM-based methods, which deal with the classification task and the location task at once, our class-agnostic segmentation model focuses only on localization and is not disturbed by the classification task, thus it can derive more complete object regions. Third, the bounding box extraction from a segmentation mask is much easier and less sensitive to the threshold selection than from a class activation map, because values in the segmentation mask are more consistent (tending to 0 or 1), compared with the class activation map. After getting the bounding box, we follow SPOL [24] to use a separate classification network(SPOL adopts the EfficientNet-B7 [21]) to predict the category of the input image. Combining the bounding box and the category, we derive the final results. In fact, this step of obtaining the object category can be omitted, if we focus only on object localization without category information.

Fig. 3.
figure 3

Pipeline for the proposed ISIC model. In the object mask generation stage, to improve the similarity between different categories, inter-class feature similarity loss \(\mathcal {L}_{\text {ICFS}}\) is applied. Besides, based on non-negative matrix factorization, we design NMFM module to generate the object masks instead of CAM, which flows into the subsequent segmentation stage as the pseudo labels. After training, a class-agnostic segmentation model is achieved, which is adopted as the final model to predict the object bounding boxes during inference.

3.2 Baseline

As shown in the left part of Fig. 3, our proposed methods (i.e., ICFS and NMFM) are concentrated in the object mask generation stage, which aims to improve the accuracy of the pseudo masks. Before introducing the specific methods, let’s first introduce the baseline model that we used. Our model is based SPOL [24], which combines the complementarity of deep and shallow features and designs the multiplication fusion strategy to improve the completeness of the object regions. Specifically, SPOL adopts the ResNet50 [8] as the backbone network. For each input image with the size \(H \times W\), SPOL extracts its features at five scales (denoted as \(\{f_i | i=1,...,5 \}\)) with the resolutions \([\frac{H}{2^i}, \frac{W}{2^i}]\). Considering the calculation cost, SPOL only uses the last three scale features (i.e., \(f_3, f_4\) and \(f_5\)). These features are firstly upsampled to the same scale \([\frac{H}{8}, \frac{W}{8}]\) and then aggregated by element-wise multiplication. In this way, the details of the shallow features and the semantics of the deep features are combined, both of which are helpful for WSOL. We call these aggregated features as the multi-scale fusion features, as shown in Fig. 3. More than that, SPOL also introduces the Gaussian prior pseudo label, self-distillation and auxiliary loss to further enhance the WSOL model. Readers can refer to the specific paper [24] for more details. But to keep the model simple, only the most effective multiplication strategy is involved in our baseline model and the other parts are directly ignored.

Fig. 4.
figure 4

(a) visualizes incomplete predictions of CAM-based models. (b) shows images from three categories. Obviously, the appearance similarity between image 1 and 2 is larger than that between image 1 and 3. (c) shows the inter-class similarity matrix, where the horizontal and vertical axes both represent the category index. The bright areas (i.e., blue circle) and dark areas (i.e., green circle) indicate the high similarity and low similarity between categories, respectively. (d) shows the loss curve of cross-entropy (CE), where ICFS loss is not adopted as the supervision. (e) shows that ICFS loss is adopted as the supervision.

3.3 Inter-class and Intra-class Features Analysis

For WSOL, most of previous methods [1, 24, 32, 35] rely on classification models to predict the object masks and then obtain the bounding boxes. Unfortunately, limited by classification models, these masks only cover the most discriminative object regions while other less discriminative ones are ignored. As shown in Fig. 4(a), only the head regions of the birds are highlighted but the body parts are ignored. Because classification models focus only on the differences (i.e., head parts) between classes. To maximize the classification accuracy, features that have the similar appearance (i.e., body parts) will be discarded. But for WSOL, classification accuracy is not the only goal. Overemphasis on the inter-class differences leads to incomplete object masks. Thus, we argue that WSOL models should also consider the inter-class similarity.

Figure 4(b) shows three images of three categories. From appearance, the similarity between image 1 and 2 is larger than that between image 1 and 3. To quantify the similarities between different categories, we use the pretrained ResNet50 to extract a 128 dimensional vector for each image in both CUB-200 [23] and ImageNet-1k [17], then average the vectors of i-th category as its class representation \(c_i\). For any two representations \(c_i, c_j\), we calculate their cosine similarity \(s_{ij}=\frac{c_i c_j}{||c_i||_2 ||c_j||_2}\). Bringing all \(s_{ij}\) together, we get the similarity matrix S. As shown in Fig. 4(c), S is not evenly distributed. The highlighted areas (e.g., blue circle) show the high similarity between categories and the dark areas (e.g., green circle) show the low similarity between categories. However, previous methods ignore the inter-class similarity, thus leading to incomplete predictions.

3.4 Inter-class Similarity Feature Loss

To address the above concerns, we propose the inter-class feature similarity (ICFS) loss, which aims at reducing the feature distance between similar categories. Specifically, we first derive the representation \(c_i\) for i-th category according to Sect. 3.3 and then find the similar categories of \(c_i\) by \(M_i = \{j\ |\ S_{ij}>\gamma \}\), where i and j are the category indexes, \(\gamma \) is a threshold, \(M_i\) is the index collection of categories that similar to \(c_i\) and S is the inter-class similarity matrix, as shown in Fig. 4(c). Finally, we could define the distance \(\mathcal {D}_i\) between \(c_i\) and \(c_j (j\in M_i)\) and derive ICFS loss \(\mathcal {L}_{\text {ICFS}}\) by Eq. 1

$$\begin{aligned} \mathcal {D}_i = \frac{1}{N_i}\sum _{j\in M_i} ||c_i-c_j||_2^2, \quad \mathcal {L}_{\text {ICFS}} = \frac{1}{N_k}\sum _{i=1}^{N_k} D_i \end{aligned}$$
(1)

where \(N_i\) is element number of \(M_i\) and \(N_k\) is the total number of categories. The challenge is how to get the representation c for each category during training. The naive way is feeding the entire training set into the model at each iteration and calculate the class representation for each category, which is totally unacceptable due to the high cost of computation and storage. Alternatively, we regard \(c_i, c_j\) as the expectations of the image vectors of i-th and j-th categories, respectively.

$$\begin{aligned} c_i=E[X_i], \quad c_j=E[X_j] \end{aligned}$$
(2)

where \(E[\cdot ]\) is the expectation and \(X_i,X_j\) are the image vectors corresponding to i-th and j-th categories, respectively. Hence, we derive the upper bound of \(\mathcal {D}_i\).

$$\begin{aligned} \begin{aligned} \mathcal {D}_i = \frac{1}{N_i}\sum _{j\in M_i} ||E[X_i]-E[X_j]||_2^2 \le \frac{1}{N_i}\sum _{j\in M_i} E||X_i-X_j||_2^2 \end{aligned} \end{aligned}$$
(3)

In Eq. 4, we use Monte Carlo sampling to approximate the upper bound, where i and j are the category indexes. p and q are the sample indexes. \(x_i^p\) and \(x_j^q\) are specific vectors. \(N_{ip}\) and \(N_{iq}\) are the numbers of \(x_i^p\) and \(x_j^q\), respectively.

$$\begin{aligned} \mathcal {D}_i \le \mathcal {U}_i = \frac{1}{N_iN_{ip}N_{iq}}\sum _{j\in M_i}\sum _{p=1}^{N_{ip}}\sum _{q=1}^{N_{iq}} ||x_i^p-x_j^q||_2^2 \end{aligned}$$
(4)

Finally, we replace \(\mathcal {D}_i\) with its upper bound \(\mathcal {U}_i\) in \(\mathcal {L}_{\text {ICFS}}\) and get Eq. 5.

$$\begin{aligned} \mathcal {L}_{\text {ICFS}} = \frac{1}{N_k}\sum _{i=1}^{N_k}\frac{1}{N_iN_{ip}N_{iq}}\sum _{j\in M_i}\sum _{p=1}^{N_{ip}}\sum _{q=1}^{N_{iq}} ||x_i^p-x_j^q||_2^2 \end{aligned}$$
(5)

The total training loss consists of cross-entropy loss (i.e., \(\mathcal {L}_{\text {CE}}\)) and \(\mathcal {L}_{\text {ICFS}}\), as shown in Eq. 6, where \(\lambda \) is a hyper-parameter. \(\mathcal {L}_{\text {CE}}\) supervises the model to learn the discriminative features between categories. In contrast, \(\mathcal {L}_{\text {ICFS}}\) forces the model to learn the similarities between categories. These two losses work against each other so that the model will not go to extremes and eventually reach an equilibrium. Figure 4(d) shows the loss curves for \(\mathcal {L}_{\text {CE}}\) and \(\mathcal {L}_{\text {ICFS}}\) when \(\lambda =0\) in CUB-200. The model minimizes the \(\mathcal {L}_{\text {CE}}\) as much as possible, and the inter-class difference gradually becomes large. However, when \(\lambda =1\), as shown in Fig. 4(e), inter-class difference is constrained and the model does not go to extremes for classification, thus could get more complete predictions. Note that, ICFS loss aims at improving the integrity of pseudo masks and does not care about the classification performance. Following SPOL [24], a separate classification model is adopted to predict the object category.

$$\begin{aligned} \mathcal {L}_{\text {total}} = \mathcal {L}_{\text {CE}} + \lambda \mathcal {L}_{\text {ICFS}} \end{aligned}$$
(6)

3.5 Intra-class Appearance Consistency

Most of the previous WSOL methods obtain the object mask based on class activation maps, where the parameters of the classifier play an important role. Specifically, given a group of feature maps \(\{F_1, F_2, ..., F_N\}\) (extracted before the classifier) with spatial size \(W \times H\) and the parameters L of the final classifier with shape \(N \times C\), where N and C is the number of maps and categories, respectively. Then the class activation map \(M_c\) for the c-th class is derived as Eq. 7. With a threshold, \(M_c\) can be binarized to extract the object bounding box.

$$\begin{aligned} M_c = \sum _{i=1}^NL_{i,c}F_i \end{aligned}$$
(7)

However, CAM-based methods are flawed in two ways. First, the goals of classification and localization are inconsistent. Directly using the parameters of the final classifier to generate the class activation maps is harmful. As shown in Fig. 2, although the bird’s body have been included in the feature maps, the final prediction suffers from the under-utilization of feature maps and get incomplete predictions. Second, for CAM-based methods, each image is processed separately, which is exposed to the risk of accidental noise. Namely, some cluttered background may lead to the prediction failure. In contrast, predictions based on multiple images (Fig. 1(b)) are statistically more robust to noise. By extracting the commonality of multiple images of the same category, accidental risk is reduced and the complementarity between images is fully explored.

Given the above concerns, we propose the non-negative matrix factorization mask (NMFM) module to generate object masks. Different from CAMs [35], NMFM does not rely on the final classifier. Instead, it achieves the object mask based on the appearance consistency of multiple images from the same category. Specifically, NMFM utilizes the non-negative matrix factorization (NMF) to extract the commonalities between images. NMF was first proposed in [10] and has been widely used in face recognition [5], recommender system [13] and data compression [11]. Given a non-negative matrix \(V \in R^{m \times n}\), NMF finds two non-negative matrices \(P\in R^{m \times c}\) and \(Q\in R^{c \times n}\), so that \(V \approx PQ\). The specific optimization function is shown in the Eq. 8.

$$\begin{aligned} \min _{P, Q} \quad f(P, Q) = \frac{1}{2} \sum _{i=1}^{n} \sum _{j=1}^{m}(V_{ij}-(PQ)_{ij})^2 \end{aligned}$$
(8)
$$ \text{ subject } \text{ to } \quad P_{ia} \ge 0, Q_{bj} \ge 0, \quad \forall i, a, b, j $$

Instead of relying on the classifier, we apply NMF to compress the feature maps \(F\in R^{W \times H \times N}\) into the object mask \(M\in R^{W \times H}\). Namely, we find a project direction vector \(S\in R^{N\times 1}\) so that \(M = F \cdot S\) (dot production), where S is derived from the statistics of multiple F of the same category rather than the parameters of the classifier. Specifically, we split F into \(W\times H\) vectors, each of which has N dimensions. Supposing there are T images in each category, then we could get \(T\times W\times H\) vectors. Lining up these vectors together, we get a big matrix \(\Theta \in R^{TWH\times N}\). To find the optimal projection direction S, we use NMF to decompose \(\Theta \) into two small matrices \(\theta _1 \in R^{TWH\times 1}\) and \(\theta _2 \in R^{1\times N}\) so that \(\Theta \approx \theta _1 \cdot \theta _2\)(dot production), where \(\theta _1\) represents the set of vectors reduced in dimension, which is discarded. \(\theta _2\) is what we need, which represents the projection direction and combines the commonality of multiple images. Namely, \(S=\theta _2^T\). According to \(M=F \cdot S\), the object mask could be derived by \(M=F \cdot \theta _2^T\).

Compared with CAMs, NMFM does not rely on the classifier, hence making better use of feature maps, as shown in Fig. 2. Besides, NMFM extracts the commonality of a category of images, which is more robust to background noise. Note that, NMFM is not involved in the training or inference phase. It is just applied to generate the pseudo masks after the classification model has been trained. Thus, it is called only once and will not bring any time complexity for the training or inference phase. With these pseudo masks, we train a class-agnostic segmentation model. The final object bounding boxes are extracted from the predictions of the class-agnostic segmentation model rather than the pseudo masks generated by NMFM.

3.6 Class-Agnostic Segmentation Stage

Although NMFM generates accurate object masks, too many modules are involved in the object mask generation stage, which brings a lot of computation and complexity. To make the inference faster and easier, we use the object masks (generated by NMFM) as the pseudo labels to train a separate class-agnostic segmentation model for prediction. Specifically, we use ResNet50 as the backbone network to extract the features of five scales (denoted as \(\{f_i | i=1,...,5 \}\)) for each image. Similar to the baseline model in Sect. 3.2, only features of the last three scales are utilized, namely \(f_3, f_4, f_5\). We upsample these features to the same scale and aggregate them by element-wise multiplication. Finally, we send the aggregated features to a 1\(\,\times \,\)1 convolutional layer to generate the binary object mask, which is supervised by the pseudo labels derived from NMFM. During inference, for each image, we use the segmentation model to get the object mask and the complex object mask generation stage is discarded. Hence, the whole inference process is simple and quick. Besides, compared with the class activation maps, the predictions of the segmentation are already binary. So precise threshold adjustment for bounding box extraction is no longer required.

4 Experiments

4.1 Experimental Setup

Datasets. CUB-200 [23] and ImageNet-1K [17] are adopted for model evaluation, where CUB-200 consists of 200 categories, with 5,994 training images and 5,794 testing images. ImageNet-1K consists of 1000 categories, with 1,281,197 training images and 50,000 testing images. All the training images have only image-level labels, but the testing images have bounding box annotations.

Metrics. Following [3, 24, 35], three metrics are adopted to quantify the model performance. 1) Top-1 localization (Top-1 Loc): top-1 prediction is exactly the right image class and the IoU (Intersection over Union) between the predicted bounding box and the ground truth one is larger than 0.5. 2) Top-5 localization (Top-5 Loc): top-5 predictions contain the right image class and the IoU between the predicted bounding box and the ground truth one is larger than 0.5. 3) GT-known localization (GT-known Loc): the IoU between the predicted bounding box and the ground truth one is larger than 0.5.

Data Augmentation and Training Settings. During training, we follow previous methods [3, 24, 31, 32, 35] to first resize each input image to \(256\times 256\) then randomly crop it to \(224\times 224\). Also, a random flip is adopted to increase the diversity of input images. During inference, the random cropping is replaced by the center cropping and the random flip is removed [3, 31]. We use the SGD optimizer to train our model, where the learning rates for both CUB-200 [23] and ImageNet-1K [17] are 0.02 and remain constant throughout the training process. Besides, due to the difference in dataset size, the training epochs for CUB-200 and ImageNet-1K are set to 32 and 5, respectively.

Table 1. Performance comparison between the state-of-the-art methods. ‘–’ means no given. The highest scores are highlighted in bold.

4.2 Comparison with State-of-the-Arts

Quantitative Comparison. To evaluate the performance of the proposed ISIC, we train it both on CUB-200 [23] and ImageNet-1k [17], as shown in Table 1. Many state-of-the-art methods [1, 3, 4, 12, 15, 16, 24, 27, 31,32,33,34,35] are also included in Table 1 for comparison. The highest scores are highlighted in bold. Among all these methods, ISIC achieves the highest accuracy on both CUB-200 and ImageNet-1K in terms of Top-1 Loc, Top-5 Loc and GT-Known Loc metrics. Especially for GT-Known Loc metric, ISIC achieves a pronounced performance boost, demonstrating its superiority in object localization.

Visual Comparison. Figure 5 shows some localization maps for CUB-200 and ImageNet-1k, where the bottom row and the middle row visualize the predictions derived from CAM [35] and our proposed ISIC, respectively. Obviously, compared with CAM, ISIC could cover more complete object regions rather than only focus on the most discriminative ones. Besides, ISIC predictions preserve sharper object boundaries and more detailed shapes.

4.3 Ablation Studies

Ablation Study for Each Component. We use CUB-200 to evaluate each component of the proposed ISIC. As shown in Table 2, ICFS loss largely improve the localization accuracy of the baseline model by 4.9% in the GT-Known Loc metric, surpassing a lot of SOTA methods, which proves the significance of inter-class similarity for WSOL. Compared with the excessive pursuit of inter-class difference in classification models, ICFS loss guides the model to a better balance between inter-class similarity and inter-class difference, thus achieving more complete predictions. Besides, NMFM also boosts the model performance by suppressing noise and improving feature utilization. With all components, the object localization capability of ISIC is largely enhanced.

Fig. 5.
figure 5

Visualization of the localization maps with CAM [35] (bottom row) and the proposed ISIC (middle row). Ground truth bounding boxes and the predicted bounding boxes are shown in red and green color, respectively. (Color figure online)

Ablation Study for \(\lambda \). In Eq. 6, \(\lambda \) is set to balance \(\mathcal {L}_{\text {CE}}\) and \(\mathcal {L}_{\text {ICFS}}\). To study its disturbance with the performance, different values are chosen for CUB-200, as shown in Table 3. \(\lambda =0\) means no ICFS supervision. When \(\lambda =1.0\), the model reaches an equilibrium between inter-class similarity and inter-class difference, achieving the best performance. However, when \(\lambda \) keeps increasing, the balance is broken and the model degrades.

Ablation Study for \(\gamma \). In Sect. 3.4, we set a threshold \(\gamma \) to find the similarity categories. Table 4 shows its effect at different values. When \(\gamma =0.3\), our model achieves the best performance.

Visualization of the similar categories. Figure 6 shows some images of the similar categories (Sect. 3.4). As shown, category similarity is widespread both in the fine-grained dataset (CUB-200) and the general dataset (ImageNet-1k).

Fig. 6.
figure 6

Images from the similar categories. One row represents a group of categories.

Table 2. Ablation studies for each component of ISIC. BASE is the baseline model. ICFS and NMFM are the proposed components of ISIC. SEG means the class-agnostic segmentation model. CUB-200 is adopted for evaluation.
Table 3. Ablation studies for \(\lambda \).
Table 4. Ablation studies for \(\gamma \).

5 Conclusion

In this paper, we investigate the effect of inter-class similarity on WSOL and propose the ICFS loss against the widely used cross entropy loss. Besides, considering predictions from the classifier are biased to classification task, we propose to abandon CAMs and apply the non-negative matrix factorization to generate object masks. All the proposed modules greatly improve the WSOL performance.