1 Introduction

Deep learning has been a powerful tool for a series of pattern recognition applications including classification, detection, segmentation and control problems. Due to its data-driven nature and availability of large scale parallel computing, deep neural networks achieve state-of-the-art results in most areas. Researchers have done many efforts to boost the performance in various ways such as designing optimizers (Zeiler 2012; Kingma and Ba 2014), proposing adversarial training scheme (Goodfellow et al. 2014), or task-specific meta architecture such as 2-stage architectures (Ren et al. 2015) for object detection.

Fig. 1
figure 1

BAM integrated with a general CNN architecture. As illustrated, BAM is placed at every bottleneck of the network. Interestingly, we observe sequential BAMs construct hierarchical attention maps which is similar to the human perception procedure. BAM denoises low-level features such as background texture features at the early stage. BAM then gradually focuses on the exact target which is a high-level semantic. More visualizations and analysis are included in Figs. 5 and 6

However, fundamental approach to boost performance is to design a good backbone architecture. Since the very first large-scale deep neural network AlexNet (Krizhevsky et al. 2012), various backbone architectures such as VGGNet (Simonyan and Zisserman 2015), GoogLeNet (Szegedy et al. 2015), ResNet (He et al. 2016b), DenseNet (Huang et al. 2017), have been proposed. All those have their own design choices, and shown significant performance boosts over the precedent architectures.

The most intuitive way to boost the network performance is to stack more layers. Deep neural networks then are able to approximate high-dimensional function using their deep layers. The philosophy of VGGNet (Simonyan and Zisserman 2015) and ResNet (He et al. 2016a) precisely follows this. Compared to AlexNet, VGGNet has twice more layers. Furthermore, ResNet has 22x more layers than VGGNet with improved gradient flow by adopting residual connections. GoogLeNet (Szegedy et al. 2015), which is also very deep, uses concatenation of features with various filter sizes at each convolutional block. The use of diverse features at the same layer shows increased performance, resulting in powerful representation. DenseNet (Huang et al. 2017) also uses the concatenation of diverse feature maps, but the features are from different layers. In other words, outputs of convolutional layers are iteratively concatenated upon the input feature maps. WideResNet (Zagoruyko and Komodakis 2016) shows that using more channels, wider convolutions, can achieve higher performance than naively deepening the networks. Similarly, PyramidNet (Han et al. 2017) shows that increasing channels in deeper layers can effectively boost the performance. Recent approaches with grouped convolutions, such as ResNeXt (Xie et al. 2017) or Xception (Chollet 2017), show state-of-the-art performances as backbone architectures. The success of ResNeXt and Xception comes from the convolutions with higher cardinality which can achieve high performance effectively. Besides, a practical line of research is to find mobile-oriented, computationally effective architectures. MobileNet (Howard et al. 2017), sharing a similar philosophy with ResNeXt and Xception, use depthwise convolutions with high cardinalities.

Apart from the previous approaches, we investigate the effect of attention in DNNs, and propose a simple, light-weight module for general DNNs. That is, the proposed module is designed for easy integration with existing CNN architectures. Attention mechanism in deep neural networks has been investigated in many previous works (Mnih et al. 2014; Ba et al. 2015; Bahdanau et al. 2014; Xu et al. 2015; Gregor et al. 2015; Jaderberg et al. 2015a). While most of the previous works use attention with task-specific purposes, we explicitly investigate the use of attention as a way to improve network’s representational power in an extremely efficient way. As a result, we propose “Bottleneck Attention Module” (BAM), a simple and efficient attention module that can be used in any CNNs. Given a 3D feature map, BAM produces a 3D attention map to emphasize important elements. In BAM, we decompose the process of inferring a 3D attention map in two streams (Fig. 2), so that the computational and parametric overhead are significantly reduced. As the channels of feature maps can be regarded as feature detectors, the two branches (spatial and channel) explicitly learn ‘what’ and ‘where’ to focus on.

We test the efficacy of BAM with various baseline architectures on various tasks. On the CIFAR-100 and ImageNet classification tasks, we observe performance improvements over baseline networks by placing BAM. Interestingly, we have observed that multiple BAMs located at different bottlenecks build a hierarchical attention as shown in Fig. 1. We validate the performance improvement of object detection on the VOC 2007 and MS COCO datasets. We further apply bam to the pixel-level prediction tasks; super resolution and scene parsing and show consistent performance improvement over the baselines, demonstrating a wide applicability of BAM. Since we have carefully designed our module to be light-weight, parameter and computational overheads are negligible.

In short, we investigate the effect of attention with the proposed module BAM. BAM is a simple self-contained module to be inserted at any feed-forward convolutional neural networks without bells and whistles. We extensively validate several design choices via ablation studies, and demonstrate the effectiveness of BAM in various vision tasks including classification, detection, segmentation, and super resolution. Moreover, we analyze and explain the difference between the baseline and the BAM-integrated network in terms of class-selectivity index (Morcos et al. 2018). Finally, we analyze the effect of attention with visualizations.

2 Related Works

A number of studies (Itti et al. 1998; Rensink 2000; Corbetta and Shulman 2002) have shown that attention plays an important role in human perception. For example, the resolution at the foveal center of human eyes is higher than surrounding areas (Hirsch and Curcio 1989). In order to efficiently and adaptively process visual information, human visual systems iteratively process spatial glimpses and focus on salient areas (Larochelle and Hinton 2010).

2.1 Cross-modal attention Attention mechanism is a widely-used technique in multi-modal settings, especially when certain modalities should be conditioned on the other modalities. Visual question answering (VQA) task is a well-known example for such tasks. Given an image and natural language question, the task is to predict an answer such as counting the number, inferring the position or the attributes of the targets. VQA task can be seen as a set of dynamically changing tasks where the provided image should be processed according to the given question. Attention mechanism softly chooses the task(question)-relevant aspects in the image features. As suggested in Yang et al. (2016), attention maps for the image features are produced from the given question, and it act as queries to retrieve question-relevant features. The final answer is classified with the stacked images features. Another way of doing this is to use bi-directional inferring, producing attention maps for both text and images, as suggested in Nam et al. (2017). In such literatures, attention maps are used as an effective way to solve tasks in a conditional fashion, but they are trained in separate stages for task-specific purposes.

2.2 Self-attention There have been various approaches to integrate attention in DNNs, jointly training the feature extraction and attention generation in an end-to-end manner. A few attempts (Wang et al. 2017; Hu et al. 2018a, b) have been made to consider attention as an effective solution for general classification task. Wang et al. have proposed Residual Attention Networks which use a hour-glass module to generate 3D attention maps for intermediate features. Even the architecture is resistant to noisy labels due to generated attention maps, the computational/parameter overhead is large because of the heavy 3D map generation process. Hu et al. have proposed a compact ‘Squeeze-and-Excitation’ module to exploit the inter-channel relationships. Although it is not explicitly stated in the paper, it can be regarded as an attention mechanism applied upon channel axis. Recently, the Gather-Excite framework by Hu et al. (2018a) further improved this approach by replacing the global average pooling with depth-wise convolution, enhancing a gathering operation in the attention module. However, the method still misses the spatial axis, which is also an important factor in inferring accurate attention map.

SCA-CNN (Chen et al. 2017b) and HANet (Li et al. 2018) have shown that using both the spatial and the channel attention is effective for image captioning and person re-identification tasks respectively. Here, we carefully design a module that outputs both the spatial and the channel attention maps for image classification tasks. Our method greatly reduces the heavy computation of 3D attention map inference (Wang et al. 2017) and improves the baseline significantly. We also investigate the effective point to place the module that is before the pooling occurs (see Fig. 1). Recently proposed CBAM method (Woo et al. 2018b) is an extended version of BAM. It improves on BAM with their module design and placement (i.e., the modules are placed at every convolution block). However, it introduces much more parameter overhead than BAM.

2.3 Adaptive modules Several previous works use adaptive modules that dynamically changes their output according to their inputs. Dynamic Filter Network (Jia et al. 2016) proposes to generate convolutional features based on the input features for flexibility. Spatial Transformer Network (Jaderberg et al. 2015b) adaptively generates hyper-parameters of affine transformations using input feature so that target area feature maps are well aligned finally. This can be seen as a hard attention upon the feature maps. Deformable Convolutional Network (Dai et al. 2017) uses deformable convolution where pooling offsets are dynamically generated from input features, so that only the relevant features are pooled for convolutions. Similar to the above approaches, BAM is also a self-contained adaptive module that dynamically suppress or emphasize feature maps through attention mechanism.

In this work, we exploit both channel and spatial axes of attention with a simple and light-weight design. Furthermore, we find an efficient location to put our module—bottleneck of the network.

3 Bottleneck Attention Module

We design a module that learns spatial (where) and channel-wise (what) attention separately. The intuition behind the factorization is that those two attentions have different properties. Thus, separation can make them focus on their own objectives more clearly.

It is well known that each channel of the feature maps corresponds to a certain visual pattern (Simon and Rodner 2015; Zhang et al. 2016). Therefore, estimating and applying the channel-wise attention can be viewed as a process of picking up the necessary semantic attributes for the target task. The spatial attention, on the other hand, attempts to select the important spatial locations rather than considering each image region equally. Thus, it can be seen as a clutter removal that is quite different from the channel-attention. Therefore, it is obvious that using these two complementary attentions in combination is crucial for many classification tasks, and we empirically confirm that it provides the best result in Table 1b.

We implement the attention map generation of each branch to be highly efficient. For the channel attention, we squeeze the spatial axis using global average pooling. We then regress the channel attention using two fully connected layers. For the spatial attention, we gradually reduce the channel dimension to be 1 at the final. Here, we adopt the atrous convolution to enlarge the receptive field and effectively decide spatially important part.

The overall structure of BAM is illustrated in Fig. 2. For the given input feature map \(\mathbf {F}\in \mathbb {R}^{C\times H\times W}\), BAM infers a 3D attention map \(\mathbf {M}(\mathbf {F})\in \mathbb {R}^{C\times H\times W}\). The refined feature map \(\mathbf {F}'\) is computed as:

$$\begin{aligned} \mathbf {F}'=\mathbf {F}+ \mathbf {F} \otimes \mathbf {M}(\mathbf {F}), \end{aligned}$$
(1)

where \(\otimes \) denotes element-wise multiplication. We adopt a residual learning scheme along with the attention mechanism to facilitate the gradient flow. To design an efficient yet powerful module, we first compute the channel attention \(\mathbf {M_c}(\mathbf {F})\in \mathbb {R}^{C}\) and the spatial attention \(\mathbf {M_s}(\mathbf {F})\in \mathbb {R}^{H\times W}\) at two separate branches, then compute the attention map \(\mathbf {M}(\mathbf {F})\) as:

$$\begin{aligned} \mathbf {M}(\mathbf {F})=\sigma (\mathbf {M_c}(\mathbf {F})+\mathbf {M_s}(\mathbf {F})), \end{aligned}$$
(2)

where \(\sigma \) is a sigmoid function. Both branch outputs are resized to \(\mathbb {R}^{C\times H\times W}\) before addition.

Fig. 2
figure 2

Detailed module architecture. Given the intermediate feature map F, the module computes corresponding attention map M(F) through the two separate attention branches—channel \(\mathbf {M}_\mathbf {c}\) and spatial \(\mathbf {M}_\mathbf {s}\). Two intermediate tensors from channel and spatial branches are properly broadcasted to match the final tensor shape. We have two hyper-parameters for the module: dilation value (d) and reduction ratio (r). The dilation value determines the size of receptive fields which is helpful for the contextual information aggregation at the spatial branch. The reduction ratio controls the capacity and overhead in both attention branches. Through the experimental validation (see Sect. 5.1), we set {d = 4, r = 16}

Channel attention branch As each channel contains a specific feature response, we exploit the inter-channel relationship in the channel branch. To aggregate the feature map in each channel, we take global average pooling on the feature map \(\mathbf {F}\) and produce a channel vector \(\mathbf {F_c}\in \mathbb {R}^{C}\). This vector softly encodes global information in each channel. To estimate attention across channels from the channel vector \(\mathbf {F_c}\), we use a multi-layer perceptron (MLP) with one hidden layer. To save a parameter overhead, the hidden activation size is set to \(\mathbb {R}^{C/r}\), where \(r\) is the reduction ratio. After the MLP, we add a batch normalization (BN) layer (Ioffe and Szegedy 2015) to adjust the scale with the spatial branch output. In short, the channel attention is computed as:

$$\begin{aligned} \mathbf {M_c}(\mathbf {F})= & {} BN(MLP(AvgPool(\mathbf {F}))) \nonumber \\= & {} BN(\mathbf {W_1}(\mathbf {W_0}AvgPool(\mathbf {F})+\mathbf {b_0})+\mathbf {b_1}), \end{aligned}$$
(3)

where \(\mathbf {W_0}\in \mathbb {R}^{C/r\times C}\), \(\mathbf {b_0}\in \mathbb {R}^{C/r}\), \(\mathbf {W_1}\in \mathbb {R}^{C\times C/r}\), \(\mathbf {b_1}\in \mathbb {R}^{C}\).

Spatial attention branch The spatial branch produces a spatial attention map \(\mathbf {M_s}(\mathbf {F})\in \mathbf {R}^{H\times W}\) to emphasize or suppress features in different spatial locations. It is widely known that (Yu and Koltun 2016; Long et al. 2015; Bell et al. 2016; Hariharan et al. 2015) utilizing contextual information is crucial to know which spatial locations should be focused on. It is important to have a large receptive field to effectively leverage contextual information. We employ the dilated convolution (Yu and Koltun 2016) to enlarge the receptive fields with high efficiency. We observe that the dilated convolution facilitates constructing a more effective spatial map than the standard convolution (see Sect. 5.1). The “bottleneck structure” suggested by ResNet (He et al. 2016a) is adopted in our spatial branch, which saves both the number of parameters and computational overhead. Specifically, the feature \(\mathbf {F}\in \mathbb {R}^{C\times H\times W}\) is projected into a reduced dimension \(\mathbb {R}^{C/r\times H\times W}\) using 1 \(\times \) 1 convolution to integrate and compress the feature map across the channel dimension. We use the same reduction ratio \(r\) with the channel branch for simplicity. After the reduction, two 3 \(\times \) 3 dilated convolutions are applied to utilize contextual information effectively. Finally, the features are again reduced to \(\mathbb {R}^{1\times H\times W}\) spatial attention map using 1 \(\times \) 1 convolution. For a scale adjustment, a batch normalization layer is applied at the end of the spatial branch. In short, the spatial attention is computed as:

$$\begin{aligned} \mathbf {M_s}(\mathbf {F})&=BN(f_{3}^{1\times 1}(f_{2}^{3\times 3}(f_{1}^{3\times 3}(f_{0}^{1\times 1}(\mathbf {F}))))), \end{aligned}$$
(4)

where \(f\) denotes a convolution operation, \(BN\) denotes a batch normalization operation, and the superscripts denote the convolutional filter sizes. There are two 1 \(\times \) 1 convolutions for channel reduction The intermediate 3 \(\times \) 3 dilated convolutions are applied to aggregate contextual information with a larger receptive field.

Combine two attention branches After acquiring the channel attention \(\mathbf {M_c}(\mathbf {F})\) and the spatial attention \(\mathbf {M_s}(\mathbf {F})\) from two attention branches, we combine them to produce our final 3D attention map \(\mathbf {M}(\mathbf {F})\). Since the two attention maps have different shapes, we expand the attention maps to \(\mathbb {R}^{C\times H\times W}\) before combining them. Among various combining methods, such as element-wise summation, multiplication, or max operation, we choose element-wise summation for efficient gradient flow (He et al. 2016a). We empirically verify that element-wise summation results in the best performance among three options (see Sect. 5). After the summation, we take a sigmoid function to obtain the final 3D attention map \(\mathbf {M}(\mathbf {F})\) in the range from 0 to 1. This 3D attention map is element-wisely multiplied with the input feature map \(\mathbf {F}\) then is added upon the original input feature map to acquire the refined feature map \(\mathbf {F'}\) as Eq. 1.

Module placement As BAM is a self-contained module, it can be placed at any point of the network. Through ablation experiments in Table 2, we empirically found that the best location for BAM is the bottlenecks (i.e. right before spatial pooling).

4 Benefits of Using Self-attention

The two main advantages of using self-attention mechanism in the CNN are: (1) efficient global context modeling, and (2) effective back-propagation (i.e., model training).

The global context allows the model to better recognize patterns that would be locally ambiguous and to attend on important parts. Therefore, capturing and utilizing the global context is crucial for various vision tasks. In this respect, CNN models typically stack many convolution layers or use pooling operations to ensure the features to have a large receptive field. Although doing so provides the model to equip with the global view at the end, there are several drawbacks. First, naively stacking the convolution layers significantly increases the space (i.e., parameters) and time (i.e., computational overheads) complexities. Second, the features at lower layers still have limited receptive fields. On the other hand, our proposed method BAM alleviates the above issues nicely. Specifically, a small meta-network (or module) is designed to refine the input feature map based on its global feature statistics. The module is placed at the bottlenecks of the model, making lower layer features to benefit from the contextual information. The overall procedure operates in a highly efficient manner thanks to the light-weight module design. We empirically verify that using BAM is more effective than simply deepening the models (i.e., using more convolutions) as shown in Table 1c.

Moreover, our method eases model optimization. In particular, the predicted attention map modulates the training signal (i.e., gradients) to focus on more important regions (Wang et al. 2017). We formulate the attentioning process as follows:

$$\begin{aligned} \mathbf {F'}=(1+\mathbf {M}(\mathbf {F}))\mathbf {F}(x,\phi ), \end{aligned}$$
(5)

where \(\phi \) is the parameters of the feature extractor. Then, the gradient can be computed as:

$$\begin{aligned} \frac{\partial \mathbf {M}(F)\mathbf {F}(x,\phi )))}{\partial \phi }&=\mathbf {M}(\mathbf {F})\frac{\partial \mathbf {F}(x,\phi ))}{\partial \phi } \end{aligned}$$
(6)

The equation indicates that the higher the attention value (important regions), the greater the gradient value flows in there.

5 Experiments

In this section, we empirically verify the design choices of BAM, and show the efficacy of BAM across architectures and tasks. We conduct extensive experiments on the standard benchmarks: CIFAR-100 (Sects. 5.15.2), ImageNet-1K (Sects. 5.35.4) for image classification; VOC 2007 (Sect. 5.6), MS COCO (Sect. 5.5) for object detection; Set5 and Set14 (Sect. 5.7) for super resolution; ADE20K (Sect. 5.8) for scene parsing.

In order to perform better apple-to-apple comparisons, we first reproduce all the reported performance of networks in the PyTorch frameworkFootnote 1 and set as our baselines (He et al. 2016a; Zagoruyko and Komodakis 2016; Xie et al. 2017; Huang et al. 2017). When training the baseline models (or BAM-integrated models), we follow their training schemes (i.e., hyper-parameter settings), if not otherwise specified. Throughout all experiments, we verify that BAM outperforms all the baselines without bells and whistles, demonstrating the general applicability of BAM across different architectures as well as different tasks.

5.1 Ablation Studies on CIFAR-100

The CIFAR-100 dataset (Krizhevsky and Hinton 2009) consists of 60,000 32 \(\times \) 32 color images drawn from 100 classes. The training and test sets contain 50,000 and 10,000 images respectively. We adopt a standard data augmentation method of random cropping with 4-pixel padding and horizontal flipping for this dataset. For pre-processing, we normalize the data using RGB mean values and standard deviations.

Table 1 Ablation studies on the structure and hyper parameters of BAM in CIFAR100 benchmark. (a) Includes experiments for the optimal value for the two hyper parameters; (b) includes experiments to verify the effective of the spatial and channel branches; (c) includes experiments to compare the effectiveness of BAM over the original conv blocks

Dilation value and Reduction ratio In Table 1a, we perform an experiment to determine two major hyper-parameters in our module, which are dilation value and reduction ratio, based on the ResNet50 architecture. The dilation value determines the sizes of receptive fields in the spatial attention branch. Table 1a shows the comparison result of four different dilation values. We can clearly see the performance improvement with larger dilation values, though it is saturated at the dilation value of 4. This phenomenon can be interpreted in terms of contextual reasoning, which is widely exploited in dense prediction tasks (Yu and Koltun 2016; Long et al. 2015; Bell et al. 2016; Chen et al. 2016; Zhu et al. 2017). Since the sequence of dilated convolutions allows an exponential expansion of the receptive field, it enables our module to seamlessly aggregate contextual information. Note that the standard convolution (i.e. dilation value of 1) produces the lowest accuracy, demonstrating the efficacy of a context-prior for inferring the spatial attention map. The reduction ratio is directly related to the number of channels in both attention branches, which enable us to control the capacity and overhead of our module. In Table 1a, we compare performance with four different reduction ratios. Interestingly, the reduction ratio of 16 achieves the best accuracy, even though the reduction ratios of 4 and 8 have higher capacity. We conjecture this result as over-fitting since the training losses converged in both cases. Based on the result in Table 1a, we set the dilation value as 4 and the reduction ratio as 16 in the following experiments.

Separate or Combined branches In Table 1b, we conduct an ablation study to validate our design choice in the module. We first remove each branch to verify the effectiveness of utilizing both channel and spatial attention branches. As shown in Table 1b, although each attention branch is effective to improve performance over the baseline, we observe significant performance boosting when we use both branches jointly. This shows that combining the channel and spatial branches together play a critical role in inferring the final attention map. In fact, this design follows the similar aspect of a human visual system, which has ‘what’ (channel) and ‘where’ (spatial) pathways and both pathways contribute to process visual information (Larochelle and Hinton 2010; Chen et al. 2017a).

Combining methods We also explore four different combining strategies: maximum-and-sigmoid, product-and-sigmoid, sum-and-sigmoid, and sigmoid-and-sum. Table 1b summarizes the result of them. We empirically confirm that sum-and-sigmoid achieves the best performance. In terms of the information flow, the sum-and-sigmoid is an effective way to integrate and secure the information from the previous layers. In the forward phase, it enables the network to use the information from two complementary branches, channel and spatial, without losing any of information. In the backward phase, the gradient is distributed equally to all of the inputs, leading to efficient training. Product-and-sigmoid, which can assign a large gradient to the small input, makes the network hard to converge, yielding the inferior performance. Maximum-and-sigmoid, which routes the gradient only to the higher input, provides a regularization effect to some extent, leading to unstable training since our module has few parameters. Sigmoid-and-sum still improves over the baseline, but is worse than other combining options. The main difference with the sum-and-sigmoid lies on where we place the sigmoid operation. Applying the sigmoid to each branch before element-wise summation may affect the original feature representation (i.e., restricting the feature value range between 0 and 1) and may affect the gradient updates. Though, note that all of four different implementations outperform the baselines. This implies that utilizing both stream is important while the best-combining strategy further boosts the final performance.

Identity connection In the early stage of the training, the BAM might produce inaccurate attention map which may negatively affect both the forward and backward (i.e., back propagation) information flow. Therefore, by introducing the residual connection, we are able to alleviate the possibly detrimental initial behavior of the model, easing the overall model training. Note that the attention value now ranges from 1 to 2 instead of 0 to 1. However, the relative importance is still maintained. We empirically verify that residual connection indeed is effective in Table 1b.

Comparison with placing original convblocks It is widely know that larger networks with more parameters have better performances. Although BAM introduces negligible overheads, it does bring some extra layers to the networks. In this experiment, we empirically verify that the significant improvement does not come from the increased depth by naively adding the extra layers to the bottlenecks. We add auxiliary convolution blocks which have the same topology with their baseline convolution blocks, then compare it with BAM in Table 1c. we can obviously notice that plugging BAM not only produces superior performance but also puts less overhead than naively placing the extra layers. It implies that the improvement of BAM is not merely due to the increased depth but because of the effective feature refinement.

Bottleneck: The efficient point to place BAM We empirically verify that the bottlenecks of networks are the effective points to place our module BAM. Bottleneck is where the feature downsampling occurs. For example, pooling operations or convolutions with stride larger than 1. Specifically, we place BAM right before the downsampling. Recent studies on attention mechanisms (Hu et al. 2018b; Wang et al. 2017) mainly focus on modifications within the ‘convolution blocks’ rather than the ‘bottlenecks’. We compare those two different locations by using various models on CIFAR-100. In the BAM-C (‘convolution blocks’) case, we place BAM in every convolutional block, so there is much more overhead. In Table 2, we can clearly observe that placing the module at the bottleneck is effective in terms of overhead/accuracy trade-offs. It puts much less overheads with better accuracy in most cases except PreResNet 110 (He et al. 2016b).

Table 2 Bottleneck versus inside each convolution block
Table 3 Experiments on image classification tasks: CIFAR-100 classification
Table 4 Experiments on image classification tasks: ImageNet 1K classification

5.2 Classification Results on CIFAR-100

In Table 3, we compare the performance on CIFAR-100 after placing BAM at the bottlenecks of state-of-the-art models including  (He et al. 2016a, b; Zagoruyko and Komodakis 2016; Xie et al. 2017; Huang et al. 2017). Note that, while ResNet101 and ResNeXt29 16 \(\times \) 64d networks achieve 20.00% and 17.25% error respectively, ResNet50 with BAM and ResNeXt29 8 \(\times \) 64d with BAM achieve 20.00% and 16.71% error respectively using only half of the parameters. It suggests that our module BAM can efficiently raise the capacity of networks with a fewer number of network parameters. Thanks to our light-weight design, the overall parameter and computational overheads are trivial.

5.3 Classification Results on ImageNet-1K

The ILSVRC 2012 classification dataset (Deng et al. 2009) consists of 1.2 million images for training and 50,000 for validation with 1000 object classes. We adopt the same data augmentation scheme with (He et al. 2016a, b) for training and apply a single-crop evaluation with the size of 224 \(\times \) 224 at test time. Following (He et al. 2016a, b; Huang et al. 2016), we report classification errors on the validation set. ImageNet classification benchmark is one of the largest and most complex image classification benchmark, and we show the effectiveness of BAM in such a general and complex task. We use the baseline networks of ResNet (He et al. 2016a), WideResNet (Zagoruyko and Komodakis 2016), and ResNeXt (Xie et al. 2017) which are used for ImageNet classification task. More details are included in the supplementary material.

As shown in Table 4, the networks with BAM outperform all the baselines once again, demonstrating that BAM can generalize well on various models in the large-scale dataset. Note that the overhead of parameters and computation is negligible, which suggests that the proposed module BAM can significantly enhance the network capacity efficiently. Another notable thing is that the improved performance comes from placing only three modules overall the network.

Table 5 MS COCO detection results
Table 6 Detailed MS COCO detection results

5.4 Effectiveness of BAM with Compact Networks

The main advantage of our module is that it significantly improves performance while putting trivial overheads on the model/computational complexities. To demonstrate the advantage in more practical settings, we incorporate our module with compact networks (Howard et al. 2017; Iandola et al. 2016), which have tight resource constraints. Compact networks are designed for mobile and embedded systems, so the design options have computational and parametric limitations.

As shown in Table 4, BAM boosts the accuracy of all the models with little overheads. Since we do not adopt any squeezing operation (Howard et al. 2017; Iandola et al. 2016) on our module, we believe there is more room to be improved in terms of efficiency.

5.5 MS COCO Object Detection

We conduct object detection on the Microsoft COCO dataset (Lin et al. 2014). According to Bell et al. (2016) and Liu et al. (2016), we trained our model using all the training images as well as a subset of validation images, holding out 5000 examples for validation. We adopt Faster-RCNN (Ren et al. 2015) as our detection method and ImageNet pre-trained ResNet101 (He et al. 2016a) as a baseline network. Here we are interested in improving performance by plugging BAM to the baseline. Because we use the same detection method of both models, the gains can only be attributed to our module BAM. As shown in the Table 5, we observe significant improvements from the baseline, demonstrating generalization performance of BAM on other recognition tasks.

In Table 6, we compute mAP over different IoU thresholds and coco object size criteria (Lin et al. 2014). We confirm that the performance enhancement is not at a certain threshold but in overall. Note that the relative improvement, which we define as accuracy improvement over the baseline performance, are amplified at higher IoU thresholds, demonstrating that attention module is effective for accurate bounding box prediction. The BAM also improves the baseline model over all the different object sizes rather than improving it at a specific object size.

Table 7 VOC2007 detection test set results

5.6 VOC 2007 Object Detection

We further experiment BAM on the PASCAL VOC 2007 detection task. In this experiment, we apply BAM to the detectors. We adopt the StairNet (Woo et al. 2018a) framework, which is one of the strongest multi-scale method based on the SSD (Liu et al. 2016). We place BAM right before every classifier, refining the final features before the prediction, enforcing model to adaptively select only the meaningful features. The experimental results are summarized in Table 7. We can clearly see that BAM improves the accuracy of all strong baselines with two backbone networks. Note that accuracy improvement of BAM comes with a negligible parameter overhead, indicating that enhancement is not due to a naive capacity-increment but because of our effective feature refinement. In addition, the result using the light-weight backbone network (Howard et al. 2017) again shows that BAM can be an interesting method to low-end devices.

Table 8 Super Resolution experiments

5.7 Super Resolution

For the classification and detection tasks, CNNs are used to recognize single or multiple target in the given image respectively. We further explore the applicability of BAM in more challenging pixel-level prediction tasks. We first apply BAM in super resolution task. We set SRResNet (Ledig et al. 2017) as our baseline model and place one BAM module at every 4th ResBlock to construct SRResNet + BAM model. We perform experiments on two widely used benchmark datasets: Set5 and Set14. All experiments are performed with a scale factor of 4\(\times \) between low- and high-resolution images. This corresponds to a total 16\(\times \) reduction in image pixels. For fair comparison, all reported PSNR [dB] and SSIM scores were calculated on the y-channel of center-cropped images, removing a 4-pixel wide strip from each border. We use 2017 COCO train dataset (118k images) for training both baseline and BAM-integrated model. We follow the training details as given in the original paper (Ledig et al. 2017). We confirm that PSNR and SSIM scores of reproduced baseline match closely to the reported values.

As we can see in Table 8, BAM improves over the baseline performance in the super resolution task. Please note that BAM is proposed and ablated on semantic tasks, and is not optimized for pixel-level prediction task, but it still shows improvement over the baseline. We believe moderate changes to the design of attention module can further improve the performance. Here, we focus on showing the attention process can be an effective solution for the pixel-level inference task.

Table 9 ADE20K scene parsing experiments
Fig. 3
figure 3

Qualitative evaluation on ADE20K validation set. Several validation examples are shown above. Baseline is ResNet50(encoder) + UperNet, and ours is ResNet50 & BAM + UperNet. We can see that BAM induces the network to capture a finer object extent

5.8 ADE20K Scene Parsing

We now investigate the effectiveness of BAM in ADE20K scene parsing task (Zhou et al. 2019). We adopt a recent state-of-the-art architecture UperNet (Xiao et al. 2018) and place BAM to the encoder part. We use the official PyTorch code provided by the authors (Zhou et al. 2018). We use the encoder architecture of ResNet50, and the decoder architecture of UperNet. Following the default hyper-parameters (segmentation downsampling 4, padding 32).

The experiment results are summarized in Table 9. The results again shows that attention process is effective for pixel-level inference task. We also provide qualitative results in Fig. 3. We can see that BAM helps the model to capture a finer object extent such as boundary shape, edges, and small targets. We see that attention process enables contextual reasoning and provides strong global cue to resolve local ambiguities.

5.9 Comparison with Squeeze-and-Excitation

We conduct additional experiments to compare our method with SE in CIFAR-100 classification task. Table 10 summarizes all the results showing that BAM outperforms SE in most cases with fewer parameters. Our module requires slightly more GFLOPS but has much less parameters than SE, as we place our module only at the bottlenecks not every conv blocks.

Table 10 BAM versus SE (Hu et al. 2018b)

6 Analysis on the Effect of BAM

We have shown that BAM can improve the performance of a deep network for various vision tasks. Now, we provide in-depth analysis of how a BAM-integrated model (i.e., ResNet50 + BAM) may differ from a vanilla baseline model (i.e., ResNet50) in several aspects. We first explore the features of these models using a class selectivity index proposed by Morcos et al. (2018). Next, we provide visualization results of the attention process with regard to the case of when the BAM-integrated model succeeds in classification but the baseline fails. Finally, we investigate the channel attentions and the spatial attentions of the BAM-integrated model.

6.1 Class-Selectivity Index

Class-selectivity is a neuro-science inspired metric proposed by Morcos et al. (2018). For each feature map, the metric computes the normalized difference between the highest class-conditional mean activity and the mean of all other classes over a given data distribution. The resulting value varies between zero and one, where zero indicates that the filter produced same value for every class (i.e., feature re-use) and one indicates that a filter only activates for a single class. We compute the class-selectivity index for the features generated from two models (i.e., ResNet50 with and without BAM). The distribution of class-selectivity is illustrated in Fig. 4.

We observe a common underlying trend in both models: the class-selectivity increases gradually as the stages progress. It is well known that the filters of deep networks tend to extract class-agnostic features at the early stage (i.e., low-level features) while class-specific features are extracted at the last stage. In contrast to the baseline model, at the stage 2 and 3, the distributions of class-selectivity for the BAM-integrated model appears to be separated. We conjecture that the attention module helps feature re-use within the network and prevents allocating highly specialized units. As a result, the sub-features of intermediate stages from the BAM-integrated model shows less class selectivity than the ResNet50 baseline (see Fig. 4).

Fig. 4
figure 4

Class-selectivity index plot of ResNet50 and ResNet50+BAM in ImageNet

Fig. 5
figure 5

Visualizing the attention process of BAM. In order to provide an intuitive understanding of BAM’s role, we visualize image classification process using the images that baseline (ResNet50) fails to classify correctly while the model with BAM succeeds. Using the models trained on ImageNet-1K, we gather all the 3D attention maps from each bottleneck and examine their distribution spatially and channel-wise. We can clearly observe that the module BAM successfully drives the network to focus on the target while the baseline model fails

6.2 Qualitative Results

In Fig. 5, we visualize our attention maps and compare with the baseline feature maps for thorough analysis of accuracy improvement. We compare two models trained on ImageNet-1K: ResNet50 and ResNet50 + BAM. We select three examples that the baseline model fails to correctly classify while the model with BAM succeeds. We gather all the 3D attention maps at the bottlenecks and examine their distributions with respect to the channel and spatial axes respectively. For visualizing the 2D spatial attention maps, we averaged attention maps over the channel axis and resized them. All the 2D maps are normalized according to the global statistics at each stage computed from the whole ImageNet-1K training set. For visualizing the channel attention profiles, we averaged our attention map over the spatial axis and uniformly sampled 200 channels similar to Hu et al. (2018b).

Fig. 6
figure 6

Successful cases with BAM. The shown examples are the intermediate activations and BAM attention maps when the baseline \(+\) BAM succeeds and the baseline fails. Figure best viewed in color (Color figure online)

As shown in Fig. 5, we can observe that the module BAM drives the network to focus on the target gradually while the baseline model shows more scattered feature activations. Note that accurate targeting is important for the fine-grained classification, as the incorrect answers of the baseline are reasonable errors. At the first stage, we observe high variance along the channel axis and enhanced 2D feature maps after BAM. Since the theoretical receptive field size at the first bottleneck is 35, compared to the input image size of 224, the features contain only local information of the input. Therefore, the filters of attention map at this stage act as a local feature denoiser. We can infer that both channel and spatial attention contributes together to selectively refine local features, learning what (‘channel’) and where (‘spatial’) to focus or suppress. The second stage shows an intermediate characteristic of the first and final stages. At the final stage, the module generates binary-like 2D attention maps focusing on the target object. In terms of channel, the attention profile shows few spikes with low variance. We conjecture that this is because there is enough information about ‘what’ to focus at this stage. Even it is noisy, note that the features before applying the module show high activations around the target, indicating that the network already has a strong clue in what to focus on. By comparing the features of the baseline and before/after BAM, we verify that BAM accurately focuses on the target object while the baseline features are still scattered. The visualization of the overall attention process demonstrates the efficacy of BAM, which refines the features using two complementary attentions jointly to focus on more meaningful information. Moreover, the stage-by-stage gradual focusing resembles a hierarchical human perception process (Hubel and Wiesel 1959; Riesenhuber and Poggio 1999; Marr and Vision 1982), suggesting that BAM drives the network to mimic the human visual system effectively.

6.3 Visualization Results

We show more visualization results of the attention process in Fig. 6 from ImageNet validation set. The listed samples are correctly classified by the BAM-integrated model ResNet50 + BAM, but incorrectly classified by the baseline model of ResNet50. The examples are listed with intermediate features and attention maps (averaged over channel axis for visualization). Starting from the early stage 1, we can clearly observe that the attention module acts as a feature denoiser, successfully suppressing much of the noise and highlighting on visually meaningful contents. Figures are best viewed in color.

7 Conclusion

In this work, we propose a simple and light-weight attention module, named Bottleneck Attention Module, to improve the performance of CNNs. BAM is a self-contained module composed of off-the-shelf CNN layers, so it can be easily implemented and added upon any CNN architectures. Our module learns what and where to focus or suppress efficiently through two separate pathways and refines intermediate features effectively. Inspired by a human visual system, we suggest placing an attention module at the bottleneck of a network which is the most critical points of information flow, and empirically verified it. To show its efficacy, we conducted extensive experiments with various state-of-the-art models and confirmed that BAM outperforms all the baselines on four different types of vision tasks: classification, detection, super-resolution, and scene parsing. Moreover, we analyze and visualize how the module acts on the intermediate feature maps to get a clearer understanding. We believe our findings of adaptive feature refinement at the bottleneck is helpful to the other vision tasks as well.