Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration

Bai, Haotian; Zhang, Ruimao; Wang, Jiong; Wan, Xiang

doi:10.1007/978-3-031-20077-9_36

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13669))

Included in the following conference series:

European Conference on Computer Vision

3726 Accesses
13 Citations
2 Altmetric

Abstract

Weakly Supervised Object Localization (WSOL), which aims to localize objects by only using image-level labels, has attracted much attention because of its low annotation cost in real applications. Recent studies leverage the advantage of self-attention in visual Transformer for long-range dependency to re-active semantic regions, aiming to avoid partial activation in traditional class activation mapping (CAM). However, the long-range modeling in Transformer neglects the inherent spatial coherence of the object, and it usually diffuses the semantic-aware regions far from the object boundary, making localization results significantly larger or far smaller. To address such an issue, we introduce a simple yet effective Spatial Calibration Module (SCM) for accurate WSOL, incorporating semantic similarities of patch tokens and their spatial relationships into a unified diffusion model. Specifically, we introduce a learnable parameter to dynamically adjust the semantic correlations and spatial context intensities for effective information propagation. In practice, SCM is designed as an external module of Transformer, and can be removed during inference to reduce the computation cost. The object-sensitive localization ability is implicitly embedded into the Transformer encoder through optimization in the training phase. It enables the generated attention maps to capture the sharper object boundaries and filter the object-irrelevant background area. Extensive experimental results demonstrate the effectiveness of the proposed method, which significantly outperforms its counterpart TS-CAM on both CUB-200 and ImageNet-1K benchmarks. The code is available at .

H. Bai—Research done when Haotian Bai was a Research Assistant at Shenzhen Research Institute of Big Data, The Chinese Univeristy of Hong Kong (Shenzhen).

Access provided by Autonomous University of Puebla. Download conference paper PDF

In-sample Contrastive Learning and Consistent Attention for Weakly Supervised Object Localization

Self-produced Guidance for Weakly-Supervised Object Localization

Inter-Image Communication for Weakly Supervised Localization

Keywords

1 Introduction

Weakly supervised object localization (WSOL), which learns to localize objects by only using image-level labels, has attracted much attention recently for its low annotation cost. The representative study of WSOL, Class Activation Map (CAM) [36] generates localization results using features from the last convolutional layer. However, the model trained for classification usually focuses on the discriminative regions, resulting insufficient activation for object localization. To solve such an issue, there are many CNN-based methods have been proposed in the literature, including regularization [18, 28, 30, 33], adversarial training [5, 18, 33], and divergent activation [25, 30, 31], but the CNN’s inherent limitation of local activation dampens their performance. Although discriminative activation is optimal for minimizing image classification loss, it suffers from the inability to capture object boundaries precisely.

Recently, visual Transformer has succeeded in computer vision due to its superior ability to capture long-range feature dependency. Vision Transformer [24] splits an input image into patches with the positional embedding, then constructs a sequence of tokens as its visual representation. The self-attention mechanism enables Transformer to learn long-range semantic correlations, which is pivotal for object localization. A representative study is Token Semantic Coupled Attention Map (TS-CAM) [7] which replaces traditional CNN with Transformer and takes full advantage of long-range dependencies to solve the partial activation problem. It localizes objects by semantic-awarded attention maps from patch tokens. However, we argue that only using a Transformer is not an optimal choice in practice. Firstly, Transformer attends to long-range global dependency while inevitably it cannot capture local structure well, which is critical in describing the boundaries of objects. In addition, Transformer splits images into discrete patches. Thus it may not attend to the inherent spatial coherence of objects, which makes it unable to predict the complete activation. As shown in Fig. 1(d), the activation map obtained from TS-CAM captures the global structure. Still, it concentrates in a small semantic-rich region like the bird’s upper body, failing to solve partial activation completely. Furthermore, we observe that the fur has no abrupt change in neighboring space, and its semantic context may favor propagating the activated regions to provide a more accurate result covering the whole body.

Inspired by this potential continuity, we propose a novel external module named Spatial Calibration Module (SCM), tailored for Transformers to produce activation maps with sharper boundaries. As shown in Fig. 1(a)–(b), instead of directly applying Global Average Pooling (GAP) on semantic maps to calculate loss as TS-CAM [7], we insert an external SCM to refine both semantic and attention maps and then use the calibrated features to calculate the semantic loss. Precisely, it implicitly calibrates attention representation of Transformer and produces more meaningful activation maps to cover functional areas based on spatial and contextual coherence. Our core design, a unified diffusion model, is introduced to incorporate semantic similarities of patch tokens and their local spatial relations during training. While in the inference phase, SCM can be dropped out to maintain the model’s simplicity, as shown in Fig. 1(c). Then, we use the calibrated Transformer backbone to predict the localization results by coupling SM and AM. The main contributions of this paper are as follows:

1.
We propose a novel spatial calibration module (SCM) as an external Transformer module to solve the partial activation problem in WSOL by leveraging the spatial correlation. Specifically, SCM is designed to optimize Transformers implicitly and will be dropped out during inference.
2.
We propose a novel information propagation methodology that provides a flexible way to integrate spatial and semantic relationships to enlarge the semantic-rich regions and cover objects completely. In practice, we introduce learnable parameters to adjust the diffusion range and filter the noise dynamically for flexible control and better adaptability.
3.
Extensive experiments demonstrate that the proposed framework outperforms its counterparts in the two challenging WSOL benchmarks.

2 Related Work

2.1 Weakly Supervised Object Localization

The weakly supervised object localization aims to localize objects by solely image-level labels. The seminar work CAM [36] demonstrates the effectiveness of localizing objects using feature maps from CNNs trained initially for classification. Despite its simplicity, CAM-based methods suffer from limited discriminative regions, which cannot cover objects completely. The field has focused on how to expand the activation with various attempts. Firstly, the dropout strategy is proposed to guide the model to attend to more significant regions. For instance, HaS [25] hides patches in training images randomly to force the network to seek other relevant parts; CutMix [31] adopts the same way to drop out patches but further augment the area of the patches with ground-truth labels to reduce information loss. Similarly, ADL [5] adopts an importance map to maintain the informative regions’ classification power. Instead of dropping out patches, people leverage the pixels correlations to fulfill objects as they often share similar patterns. SPG [34] learns to sense more areas with similar distribution and expand the attention scope. I$^2$C [35] exploits inter-and-cross images’ pixel-level consistency to improve the quality of localization maps. Furthermore, the predicted masks can be enhanced to become complete. GC-Net [16] highlights tight geometric shapes to fit the masks. SPOL [27] fuses shallow features and deep features from CNN that filter the background noise and generates sharp boundaries.

Instead of applying only CNN as the backbone for WSOL, Transformer can be another candidate to alleviate the problem of partial activation as it captures long-range feature dependency. A recent study TS-CAM [7] utilizes attention maps from patches coupled with reallocated semantics to predict localization maps, surpassing most of its CNN counterparts in WSOL. Recent work LCTR [2] adopted a similar framework with Transformer while inserting their tailored module in each Transformer block to strengthen the global features. However, we observe that using Transformer alone cannot solve the partial activation completely as it fails to capture the local structure and ignores spatial coherence. What is more, it is cumbersome to insert a module for each Transformer block like LCTR [2]. To address the issue, we propose a simple external module termed spatial calibration module (SCM) that calibrates Transformer by incorporating spatial and semantic relations to provide more complete feature maps and erase background noise.

2.2 Graph Diffusion

Pixels in natural images generally exhibit strong correlation, and constructing graph structure to capture such relationships has attracted much attention. In semantic segmentation, studies like [13, 14] build graphs on images to obtain contextual information and long-term dependencies to model label distribution jointly. In image preprossessing, Gene et.al [3] analyses graphs constructed from 2D images in spectral and succeeds in many traditional processing areas, including image compression, restoration filtering, and segmentation. The graph structure enables many classic graph algorithms and leads to new insights and understanding of image properties.

Similarly, in WSOL, the limited activation regions share semantic coherence with neighboring locations, making it possible to expand the area by information flow to cover objects precisely. In our study, we revise the classic Graph Diffusion Kernel (GDK) algorithm [11] to infer complete pseudo masks based on partial activation results. GDK is initially adopted in graph analysis like social networks [1], search engines [17], and biology [22] to inference pathway membership in genetic interaction networks. GDK’s strategy to explore graphs via random walk inspires us to modify it to incorporate information from the image context, enabling dynamical adjustment by semantic similarity.

3 Methodology

This section describes the Spatial Calibration Module (SCM), which is built by stacking multiple activation diffusion blocks (ADB). ADB consists of several submodules, including semantic similarity estimation, activation diffusion, diffuse matrix approximation, and dynamic filtering. At the end of the section, we show how to predict the final localization results by using the proposed framework during the inference.

3.1 Overall Architecture

In WSOL, the attention maps from models trained on image-level labels mainly concentrate on discriminative parts, which fail to cover the whole objects. Our proposed SCM aims to diffuse activation at small areas outwards to alleviate the partial activation problem in WSOL. In a broad view, the whole framework is supervised by image-level labels during training. As shown in Fig. 1(b), Transformer learns to calibrate both attention maps and semantic maps through the semantic loss from SCM implicitly. To infer the prediction, as described in Fig. 1(c), we drop SCM and use the element-wise product of revised maps to localize objects.

As shown in Fig. 2, an input image is split into $N=H\times W$ patches with each represented as a token, where (H, W) is the patch resolution. After grouping these patch tokens and CLS token into a sequence, we send it into I cascaded Transformer blocks for further representation learning. Similar as TS-CAM [7], to build the initial attention map $\boldsymbol{F}^{0} \in \mathbb {R}^{H \times W}$, the self-attention matrix $\boldsymbol{W}_i \in \mathbb {R}^{(N+1)\times (N+1)}$ at $i^{th}$ layer is averaged over the multiple self-attention heads. Denote $\boldsymbol{M}_i \in \mathbb {R}^{H\times W}$ as attention weights that corresponds to the class token in $\boldsymbol{W}_i$, we average $\{\boldsymbol{M}_i\}_{i=1}^I$ across all intermediate layers to get the attention map $\boldsymbol{F}^{0}$ of Transformer.

$$\begin{aligned} {\boldsymbol{F}^0 = \frac{1}{I} \sum _{i=1}^I {\boldsymbol{M}_i}} \end{aligned}$$

(1)

To obtain the semantic map $\boldsymbol{S}^{0} \in \mathbb {R}^{H \times W \times C}$, where C denotes the number of categories, we extract all spatial tokens $ \{ \boldsymbol{t}_{n} \}_{n=1}^N$ from the last Transformer layer and then encode them by a convolution head,

$$\begin{aligned} \boldsymbol{S}^0 = \text {reshape} (\boldsymbol{t}_{1}... \boldsymbol{t}_{N}) * \boldsymbol{k} \end{aligned}$$

(2)

where $*$ is the convolution operation, $\boldsymbol{k}$ is a $3\times 3$ convolution kernel, and $\text {reshape}(\cdot )$ is an operation that converts a sequence of tokens into 2D feature maps. Then we send both $\boldsymbol{F}^{0}$ and $\boldsymbol{S}^{0}$ into SCM to refine them.

As illustrated in Fig. 2, for the $l^{th}$ ADB, denote ${\boldsymbol{S}}^{l}$ and $\boldsymbol{F}^{l}$ as the inputs, and $\boldsymbol{S}^{l+1}$ and $\boldsymbol{F}^{l+1}$ as the outputs. Firstly, to guide the propagation, we estimate embedding similarity $\boldsymbol{E}$ between pairs of patches in ${\boldsymbol{S}}^{l}$. To enlarge activation $\boldsymbol{F}^{l}$, we apply $\boldsymbol{E}$ to diffuse $\boldsymbol{F}^{l}$ towards the equilibrium status indicated by the inverse of Laplacian matrix $\boldsymbol{L}^{l}$. In practice, we re-activate $\boldsymbol{F}^{l}$ by approximating $(\boldsymbol{L}^{l})^{-1}$ with Newton Shulz Iteration. Afterward, a dynamic filtering module is applied to remove over-diffused parts. Finally, the refined $\boldsymbol{F}^{l}$ updates $\boldsymbol{S}^{l}$ via an element-wise multiplication.

In general, by stacking multiple ADBs, the intensity of both maps is dynamically adjusted to balance semantic and spatial features. In the training phase, we apply GAP to $\boldsymbol{S}^{L}$ to get classification logits and calculate semantic loss with the ground truth. During inference, SCM will be dropped out, and the element-wise product of newly extracted ${\boldsymbol{F}}^{0}$ and ${\boldsymbol{S}}^{0}$ is used to obtain the localization result.

3.2 Activation Diffusion Block

In this subsection, we dive into Activation Diffusion Block (ADB). Under the assumption of continuity of visual content, we calculate the semantic and spatial relationships of patches in $\boldsymbol{S}^{L}$, then diffuse it outwards dynamically to alleviate the partial activation problem in WSOL.

3.2.1 Semantic Similarity Estimation.

Within the $l^{th}$ activation diffusion block, $l \in \{1, 2, ..., L\}$, we need semantic and spatial relationships between any pair of patches for propagation. To achieve it, we construct an undirected graph with each $\boldsymbol{v}_i^l$ connected with its first-order neighbors. Please refer to Fig. 5 at the Appendix for details. Given token representation of $S^l$, we build an N-node graph $G^l$. Denote the $i^{th}$ node as $\boldsymbol{v}_i^l\in \mathbb {R}^{Q}$. Then, we can infer the semantic similarity $\boldsymbol{E}^l$, where the specific element $\boldsymbol{E}^l_{i, j}$ is defined as the cosine distance between $\boldsymbol{v}_i^l$ and $\boldsymbol{v}_j^l$:

$$\begin{aligned} \boldsymbol{E}^l_{i, j} = \frac{{\boldsymbol{v}}_i^l({\boldsymbol{v}}_j^l)^{\intercal }}{|| {\boldsymbol{v}_i}^l || ||{\boldsymbol{v}_j}^l||} \end{aligned}$$

(3)

where $\boldsymbol{v}_i^l$ and $\boldsymbol{v}_j^l$ are flattened vectors, and the larger value $\boldsymbol{E}^l_{i, j}$ denotes the higher similarity shared by $\boldsymbol{v}_i^l$ and $\boldsymbol{v}_j^l$.

3.2.2 Activation Diffusion.

To present spatial relationship, we define a binary adjacency matrix $\boldsymbol{A}^l \in \mathbb {R}^{N \times N}$, whose element $\boldsymbol{A}^l_{i, j}$ indicates whether $\boldsymbol{v}_i^l$ and $\boldsymbol{v}_j^l$ are connected. We further introduce a diagonal degree matrix $\boldsymbol{D}^l \in \mathbb {R}^{N \times N}$, where $D^l_{i, i}$ corresponds to the summation of all the degrees related to $\boldsymbol{v}_i^l$. Then, we obtain Laplacian matrix $\hat{\boldsymbol{L}^l} = \boldsymbol{D}^l - \boldsymbol{A}^l$, with each element $(\boldsymbol{L}^l)^{-1}_{i, j}$ describes the correlation of $\boldsymbol{v}_i^l$ and $\boldsymbol{v}_j^l$ at the equilibrium status.

Recent studies [6, 13, 14] on graph representation inspire us that the inverse of the Laplacian matrix leads to the global diffusion, which allows each unit to communicate with the rest. To enhance the diffusion with semantic relationships, we incorporate $\hat{\boldsymbol{L}^l}$ with node contextual information $\boldsymbol{E}^l$. Intuitively, we take advantage of the spatial connectivity and semantic coherence to split the tokens into the semantic-awarded foreground objects and the background environment. In practice, we use a learnable parameter $\lambda $ to dynamically adjust the semantic intensity, which makes the diffusion process more flexible and easier to fit various situations. The Laplacian matrix $\boldsymbol{L}^l$ with semantics is defined as,

$$\begin{aligned} {\boldsymbol{L}^l} = (\boldsymbol{D}^l - \boldsymbol{A}^l) \odot (\lambda \boldsymbol{E}^l-\boldsymbol{1}) \end{aligned}$$

(4)

where $\odot $ represents element-wise multiplication, and $\boldsymbol{1}$ denotes the information flow exchange with neighboring vertexes. $(\boldsymbol{D}^l - \boldsymbol{A}^l)$ denotes the spatial connectivity, ($\lambda \boldsymbol{E}^l-\boldsymbol{1}$) represents the semantic coherence, and $\odot $ incorporates them for diffusion. Please refer to Appendix for full details of Eq. (4). After the global propagation, the reallocated activation score map can be calculated as follows,

$$\begin{aligned} \boldsymbol{F}^{l+1} = ({\boldsymbol{L}^l})^{-1} \Gamma (\boldsymbol{F}^{l}) \end{aligned}$$

(5)

where $\boldsymbol{F}^{l+1}$ is the output re-allocated attention map and $\Gamma $ is a flattening operation that reshapes $\boldsymbol{F}^{l}$ into a patch sequence.

3.2.3 Diffuse Matrix Approximation.

In practice, directly using $({\boldsymbol{L}^l})^{-1}$ may be impractical since ${\boldsymbol{L}^l}$ is not guaranteed to be positive-definite and its inverse may not exist. Meanwhile, as observed in our initial experiments, directly applying the inverse produced unwanted artifacts. To deal with the problems, we exploit Newton Schulz Iteration [20, 21] to solve $({\boldsymbol{L}^l})^{-1}$ to approximate the global diffusion result,

$$\begin{aligned} \begin{aligned} X_0&= \alpha (\boldsymbol{L}^l)^{\intercal }\\ X_{p+1}&= X_{p}(2\boldsymbol{I}-\boldsymbol{L}^lX_p), \end{aligned} \end{aligned}$$

(6)

where $X_0$ is initialized as $(\boldsymbol{L}^l)^{\intercal }$ multiplied by a small constant value $\alpha $. The subscript p denotes the number of iterations, and $\boldsymbol{I}$ is the identity matrix. As discussed above, we only need $({\boldsymbol{L}^l})^{-1}$ to thrust propagation instead of obtaining the equilibrium result, so we just iterate the Eq. (6) for p times then take the approximated $({\boldsymbol{L}^l})^{-1}$ back to Eq. (5). Then we obtain the diffused activation of $\boldsymbol{F}^{l}$, which is visualized in Fig. 3(c). We can see that diffusion has redistributed the averaged attention map with more boundary details, such as the ear and the mouth, which are beneficial for final object localization.

3.2.4 Dynamic Filtering.

As depicted in Fig. 3(c), we found that the reallocated score map $\boldsymbol{F}^{l+1}$ provides a sharper boundary, but there is a side-effect that it diffuses the activation out of object boundaries, which may make the unnecessary background context back into $\boldsymbol{S}^{l+1}$ or result in over-estimation of bounding box. Therefore, we propose a soft-threshold filter, depicted as Eq. (7), to increase density contrast between the objects and the surrounding background to depress the outside noise.

$$\begin{aligned} \mathcal {T}(\boldsymbol{F}^{l},\beta ) = \beta \cdot \text {tanhShrink}(\frac{\boldsymbol{F}^{l}}{\beta }) \end{aligned}$$

(7)

where $\beta \in (0, 1)$ is a threshold parameter for more flexible control. $\mathcal {T}$ denotes a soft-threshold function, and $\text {tanhShrink}(x) = x - \text {tanh}(x)$ is used to depress activation under $\beta $. Then $\boldsymbol{S}^{l+1}=\boldsymbol{S}^{l}\odot \mathcal {T}(\boldsymbol{F}^{l},\beta )$. As shown in Fig. 3(d), the filter operation removes noise and provides sharper contrast.

3.3 Prediction

After optimizing the model through backpropagation, the calibrated Transformer can generate the object-boundary-aware activation maps. Thus, we drop SCM during inference to obtain the final bounding box. Specifically, the bounding box prediction is generated by coupling $\boldsymbol{S}^0$ and $\boldsymbol{F}^0$ as depicted in Fig. 2. As $\boldsymbol{S}^0\in \mathbb {R}^{H \times W \times C}$ is a C-channel 2D semantic map, each channel represents an activation map for a specific class c. To obtain the prediction from score maps, we carry out the following procedures: (1) Pass $\boldsymbol{S}^0$ through a GAP to calculate classification scores. (2) Select $i^{th}$ map $\boldsymbol{S}_i^0\in \mathbb {R}^{H \times W}$ corresponding to the highest classification score from $\boldsymbol{S}^0$. (3) Calculate the element-wise product $\boldsymbol{F}^0 \odot \boldsymbol{S}_i^0$. The coupled result is then up-sampled to the same size as the input for bounding box prediction.

4 Experiments

4.1 Experiment Settings

4.1.1 Datasets.

We evaluate SCM on two commonly used benchmarks, CUB-200-2011 [29] and ILSVRC2012 [23]. CUB-200-2011 is an image dataset with photos of 200 bird species, containing a training set of 5,994 images and a test set of 5,794 images. ILSVRC contains about 1.2 million images with 1,000 categories for training and 50,000 images for validation. Our SCM is trained on the training set and evaluated on the validation set from which we only use the bounding box annotations for evaluation.

4.1.2 Evaluation Metrics.

We evaluate the performance by the commonly used metric GT-Known and save models with the best performance. For GT-Known, a bounding box prediction is positive if its Intersection-over-Union (IoU) $\delta $ with at least one of the ground truth boxes is over 50% . Furthermore, for a fair comparison with previous works, we apply the commonly reported Top1/5 Localization Accuracy(Loc Acc) and Classification Accuracy(Cls Acc). Compared with GT-Known, Loc Acc requires the correct classification result besides the condition of GT-Known. Please refer to the appendix for more strict measures like MaxboxAccV1 and MaxboxAccV2 as recommended by [4] to evaluate localization performance only.

4.1.3 Implementation Details.

The Transformer module is built upon the Deit [26] pretrained on ILSVRC. In detail, we initialize $\lambda $, $\beta $ in ABDs to constant values (1 and 0.5 respectively), and choose $p=4$ and $\alpha =0.002$ in Eq. (6). For input images, each sample is re-scaled to a size of 256$\times $256, then randomly cropped to 224$\times $224. The MLP head in the pretrained Transformer is replaced by a 2D convolution head with kernel size of 3, stride of 1, and padding of 1 to encode feature maps into semantic maps $\boldsymbol{S}^0$ (200 output units for CUB-200-2011, and 1000 for ILSVRC). The new head is initialized with He’s approach [9]. During training, we use AdamW [15] with $\epsilon =1e^{-8}$, $\beta _{1}=0.9$, $\beta _{2}=0.99$ and weight decay of 5e-4. On CUB-200-2011, the training lasts 30 epochs with an initial learning rate of 5e-5 and batch size of 256. On ILSVRC, the training procedure carries out 20 epochs with a learning rate of 1e-6 and batch size of 512. We measure model performance on the validation set after every epoch. At last, we save the parameters with the best GT-Known performance on the validation set.

4.2 Performance

To demonstrate the effectiveness of the proposed SCM, we compare it against previous methods on CUB-200-2011 and ILSVRC2012 in Table 1. From GT-Known in CUB, SCM outperforms baseline method TS-CAM [7] with a large margin, yielding GT-known 96.6$\%$ with a performance gain of 8.9$\%$. Compared with other CNN counterparts, SCM is competitive and outperforms the state-of-the-art SPOL [27] using only about 24$\%$ parameters. As for ILSVRC, SCM surpasses TS-CAM by 1.2$\%$ on GT-Known and 5.1$\%$ on Top-1 Loc Acc and is competitive against SPOL built on the multi-stage CNN models. Compared with SPOL, SCM has the following advantages, (1) Simple: SPOL produces semantic maps and attention maps on two different modules separately, while SCM is only finetuned on a single backbone. (2) Light-weighted: SPOL is built on a multi-stage model with huge parameters, while SCM is built on a small Transformer with only about 24$\%$ parameters of the former. (3) Convenient: SPOL has to infer the prediction with the complex network design, but SCM is dropped out during the inference stage. Furthermore, compared with the recent Transformer-based works like LCTR [2], with the same backbone Deit-S, we surpass it by a large margin $4.2\%$ in terms of GT-Known in CUB and obtain comparable performance on Loc Acc for both CUB and ISVRC. We achieve this without additional parameters during inference, while other recent proposed methods add carefully designed modules or processes to improve the performance. The models are saved with the best GT-Known performance and achieve satisfactory Loc Acc and Cls Acc. Please refer to Sect. 4.3 for more details.

The visual comparison of SCM and TS-CAM is shown in Fig. 4. We observe that TS-CAM preserves the global structure but still suffers from the partial activation problem that degrades its localization ability. Specifically, it cannot predict a complete component from the activation map. We notice that minor and sporadic artifacts appear on the binary threshold maps, and most of them include half parts of the objects. After adding SCM as a simple external adaptor, the masks become integral and accurate, so we believe that SCM is necessary for Transformers to find their niche in WSOL.

Table 1. Comparison of SCM with state-of-the-art methods in both classification and localization on CUB [29] and ILSVRC [23] test set. The column Params indicates the number of parameters in backbone on which models are built. show improvement of our method compared with TS-CAM [7]. GT-K. stands for ground truth known.

4.3 Ablation Study

In this section, we first illustrate the trade-off between localization and classification given the pre-determined backbone. Then we explore why SCM can reallocate and enlarge activation from two perspectives. Specifically, we show the visual results of both semantic maps $\boldsymbol{S}^l$ and attention maps $\boldsymbol{F}^l$ across all layers, and analyze them with the learnable parameters’ trend during training. Next, we illustrate the influence of module scale by stacking a different number of ADBs. At last, we apply SCM to other Transformers like ViT [24], and Conformer [8] to prove SCM’s adaptability. If not mentioned specifically, We carry out all the experiments on Deit-small with SCM consisting of four ADBs and all the experiments share the same implementation discussed above.

Trade-off Between Classification and Localization. SCM is an external module and will be dropped out during inference, adding no additional computational burden. Thus there is a trade-off between performance of localization and classification when the backbone is pre-determined. As shown in Fig. 5(a), SCM aims to calibrate the raw attention to localize the bird. Specifically, Transformer trained with SCM localizes objects well while suffers from sub-optimal CLS Acc in Fig. 5(b). In contrast, as training process continues, it classifies objects better but only focuses on the discriminant part of the whole object, resulting in worse localization result in Fig. 5(c). To clearly show the advantage of SCM for localization, we saved the model with the highest GT-Known as depicted in Fig. 5(b).

Visualization Result of $\boldsymbol{S}^l$ and $\boldsymbol{F}^l$. Implicit attention of models trained on image-level labels is blessed with remarkable localization ability as shown in CAM [36]. However, due to the effect of label-wise semantic loss, the models would finally be driven to gather around semantic-rich regions, causing the problem of partial activation. TS-CAM [7] suffers from a similar issue despite improving the localization performance by Transformer’s long-range feature dependency. In Fig. 6, we display both $\boldsymbol{S}^l$ and $\boldsymbol{F}^l$ at each layer of SCM. We observe that $\boldsymbol{F}^0$ and $\boldsymbol{S}^0$ have already covered the object completely, demonstrating that SCM can calibrate Transformer to cover objects. As the layer gets deeper, $\boldsymbol{S}^l$ and $\boldsymbol{F}^l$ concentrate more on semantic-rich regions, and $\boldsymbol{S}^L$ at the last layer is further used to calculate the loss. It explains why we drop out SCM instead of appending it to Transformer, as sharper boundaries are provided at $\boldsymbol{S}^0$ and $\boldsymbol{F}^0$.

Propagating and Filtering. To understand the effect of propagating and filtering, we analyze parameters $\lambda $ and $\beta $ in each layer of SCM. As shown in Fig. 7, the training record tells that $\lambda $ in deeper layers increases, while $\lambda $ in shallow layers is reduced. It indicates that SCM learns to diffuse activation at front layers while concentrating it in latter layers, verifying that SCM can enlarge partially activated regions with label-wise supervision. On the other hand, $\beta $ at all layers drops at the beginning, possibly because the activation provided by Transformer is sparse. It takes time for the model to shift its focus from classification to localization, as Transformer is pretrained for classification. Then it starts climbing and goes down again, indicating that attention becomes more concentrated at beginning and then turns sparse to fit the demand across layers. For instance, the front layer prefers a higher filtering threshold to reduce noise, while other layers prefer a smaller threshold to get more semantic context.

Stacking ADBs. We further investigate the effect of module scale by stacking different numbers of ADBs. As shown in Fig. 7(c), we find out that the trend of GT-known and the optimal threshold almost fits the bell curve. It indicates that setting the suitable scale for SCM is essential, as when SCM becomes too deep, it fails to classify and localize objects precisely. On the other hand, the classification accuracy drops as the number of ADBs increases, while the localization performance increases first and drops later. It tells us that classification and localization are two different tasks, and we cannot obtain the optimal for both.

Adapting SCM to More Situations. To evaluate SCM’s performance with other Transformers, we select ViT [24], Conformer [8] to testify SCM. Next, we compare SCM on various model scales on Deit. As shown in Fig. 5(d), we record the localization performance with the optimal epoch at which the best model is saved. It turns out that SCM is successfully adapted to ViT and Conformer, which achieves satisfactory performance 91.8$\%$ and 96.1$\%$ on CUB-200-2011 respectively. On the other hand, we test SCM on Deit with different scales. Surprisingly the larger models don’t perform as well as Deit-small. It turns out that increasing model parameter size may not be optimal for SCM to obtain better performance, and the dropped optimal epoch number indicates that it may need a lower learning rate in training for better result.

Discussions. Our study presents a novel way to calibrate the Transformer for WSOL. Although we prove its adaptability to ViT [24], Conformer [8], we cannot calibrate Transformers without CLS token such as Swin [12], since CLS token is required to obtain $\boldsymbol{F^0}$. Furthermore, it’s heuristic to choose the number of iterations used in Eq. (6), and we simplify it as a constant number. Future research may explore methods such as Deep Reinforcement Learning to search the parameter space for the optimal diffusion policy. Furthermore, the equilibrium status Eq. (4) is a patch-wise correlation like the self-attention matrix. It may indicate a new way to find the regions of interest by diffusion.

5 Conclusions

We proposed a simple external spatial calibration module (SCM) to refine attention and semantic representations of Vision Transformer for weakly supervised object localization (WSOL). SCM exploits the spatial and semantic coherence in images and calibrates Transformers to address the issue of partial activation. To dynamically incorporate semantic similarities and local spatial relationships of patch tokens, we propose a unified diffusion model to capture sharper object boundaries and inhibit irrelevant background activation. SCM is designed to be removed during the inference phase, and we use Transformers’ calibrated attention and semantic representations to predict localization results. Experiments on CUB-200-2011 and ILSVRC2012 datasets prove that SCM effectively covers the full objects and significantly outperforms its counterpart TS-CAM. As the first Transformer external calibration module on WSOL, we hope SCM could shed light on refining Transformers for the more challenging WSOL scenarios.

References

Bourigault, S., Lagnier, C., Lamprier, S., Denoyer, L., Gallinari, P.: Learning social network embeddings for predicting information diffusion. In: Proceedings of the 7th ACM International Conference on Web Search and Data Mining, pp. 393–402 (2014)
Google Scholar
Chen, Z., et al.: On awakening the local continuity of transformer for weakly supervised object localization. In: Proceedings of the AAAI Conference on Artificial Intelligence (2022)
Google Scholar
Cheung, G., Magli, E., Tanaka, Y., Ng, M.K.: Graph spectral image processing. Proc. IEEE 106(5), 907–930 (2018)
Article Google Scholar
Choe, J., Oh, S.J., Lee, S., Chun, S., Akata, Z., Shim, H.: Evaluating weakly supervised object localization methods right. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3133–3142 (2020)
Google Scholar
Choe, J., Shim, H.: Attention-based dropout layer for weakly supervised object localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2219–2228 (2019)
Google Scholar
Gao, S., Tsang, I.W.H., Chia, L.T.: Laplacian sparse coding, hypergraph laplacian sparse coding, and applications. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 92–104 (2013)
Article Google Scholar
Gao, W., et al.: Ts-cam: token semantic coupled attention map for weakly supervised object localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2886–2895 (2021)
Google Scholar
Gulati, A., et al.: Conformer: convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100 (2020)
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)
Google Scholar
Kim, E., Kim, S., Lee, J., Kim, H., Yoon, S.: Bridging the gap between classification and localization for weakly supervised object localization. arXiv preprint arXiv:2204.00220 (2022)
Kondor, R.I., Lafferty, J.: Diffusion kernels on graphs and other discrete structures. In: Proceedings of the 19th International Conference on Machine Learning, vol. 2002, pp. 315–322 (2002)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9992–10002 (2021)
Google Scholar
Liu, Z., Li, X., Luo, P., Loy, C.C., Tang, X.: Semantic image segmentation via deep parsing network. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1377–1385 (2015)
Google Scholar
Liu, Z., Li, X., Luo, P., Loy, C.C., Tang, X.: Deep learning Markov random field for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 40(8), 1814–1828 (2017)
Article Google Scholar
Loshchilov, I., Hutter, F.: Fixing weight decay regularization in adam (2018)
Google Scholar
Lu, W., Jia, X., Xie, W., Shen, L., Zhou, Y., Duan, J.: Geometry constrained weakly supervised object localization. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12371, pp. 481–496. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58574-7_29
Chapter Google Scholar
Ma, H., King, I., Lyu, M.R.: Mining web graphs for recommendations. IEEE Trans. Knowl. Data Eng. 24(6), 1051–1064 (2011)
Article Google Scholar
Mai, J., Yang, M., Luo, W.: Erasing integrated learning: a simple yet effective approach for weakly supervised object localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8766–8775 (2020)
Google Scholar
Meng, M., Zhang, T., Yang, W., Zhao, J., Zhang, Y., Wu, F.: Diverse complementary part mining for weakly supervised object localization. IEEE Trans. Image Process. 31, 1774–1788 (2022)
Article Google Scholar
Pan, V.: Fast and efficient parallel algorithms for the exact inversion of integer matrices. In: Maheshwari, S.N. (ed.) FSTTCS 1985. LNCS, vol. 206, pp. 504–521. Springer, Heidelberg (1985). https://doi.org/10.1007/3-540-16042-6_29
Chapter Google Scholar
Pan, V., Reif, J.: Efficient parallel solution of linear systems. In: Proceedings of the Seventeenth Annual ACM Symposium on Theory of Computing, pp. 143–152 (1985)
Google Scholar
Qi, Y., Suhail, Y., Lin, Y.Y., Boeke, J.D., Bader, J.S.: Finding friends and enemies in an enemies-only network: a graph diffusion kernel for predicting novel genetic interactions and co-complex membership from yeast genetic interactions. Genome Res. 18(12), 1991–2004 (2008)
Article Google Scholar
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Sharir, G., Noy, A., Zelnik-Manor, L.: An image is worth 16$\times $16 words, what is a video worth? arXiv preprint arXiv:2103.13915 (2021)
Singh, K.K., Lee, Y.J.: Hide-and-seek: forcing a network to be meticulous for weakly-supervised object and action localization (2017)
Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
Google Scholar
Wei, J., Wang, Q., Li, Z., Wang, S., Zhou, S.K., Cui, S.: Shallow feature matters for weakly supervised object localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5993–6001 (2021)
Google Scholar
Wei, J., Wang, S., Zhou, S.K., Cui, S., Li, Z.: Weakly supervised object localization through inter-class feature similarity and intra-class appearance consistency. In: European Conference on Computer Vision. Springer, Heidelberg (2022)
Google Scholar
Welinder, P., et al.: Caltech-ucsd birds 200. Technical report (2010)
Google Scholar
Xue, H., Liu, C., Wan, F., Jiao, J., Ji, X., Ye, Q.: Danet: divergent activation for weakly supervised object localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6589–6598 (2019)
Google Scholar
Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: Cutmix: regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6023–6032 (2019)
Google Scholar
Zhang, C.L., Cao, Y.H., Wu, J.: Rethinking the route towards weakly supervised object localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13460–13469 (2020)
Google Scholar
Zhang, X., Wei, Y., Feng, J., Yang, Y., Huang, T.S.: Adversarial complementary learning for weakly supervised object localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1325–1334 (2018)
Google Scholar
Zhang, X., Wei, Y., Kang, G., Yang, Y., Huang, T.: Self-produced guidance for weakly-supervised object localization. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 597–613 (2018)
Google Scholar
Zhang, X., Wei, Y., Yang, Y.: Inter-image communication for weakly supervised localization. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 271–287. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7_17
Chapter Google Scholar
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929 (2016)
Google Scholar

Download references

Acknowledgement

The work is supported in part by the Young Scientists Fund of the National Natural Science Foundation of China under grant No. 62106154, by Natural Science Foundation of Guangdong Province, China (General Program) under grant No. 2022A1515011524, by Shenzhen Science and Technology Program ZDSYS20211021111415025, and by the Guangdong Provincial Key Laboratory of Big Data Computing, The Chinese Univeristy of Hong Kong (Shenzhen).

Author information

Authors and Affiliations

Shenzhen Research Institute of Big Data, The Chinese Univeristy of Hong Kong (Shenzhen), Shenzhen, China
Haotian Bai, Ruimao Zhang, Jiong Wang & Xiang Wan

Authors

Haotian Bai
View author publications
You can also search for this author in PubMed Google Scholar
Ruimao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jiong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiang Wan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ruimao Zhang .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 911 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bai, H., Zhang, R., Wang, J., Wan, X. (2022). Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13669. Springer, Cham. https://doi.org/10.1007/978-3-031-20077-9_36

Download citation

DOI: https://doi.org/10.1007/978-3-031-20077-9_36
Published: 06 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20076-2
Online ISBN: 978-3-031-20077-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration