Keywords

1 Introduction

The attention mechanisms [1] can well improve models’ accuracy by capturing key information of pictures, e.g., find ‘where’ and ‘what’ to focus on. As an effective component of neural networks [2], attention modules have shown good performances in various visual tasks, including image classification [3], object detection [4], semantic segmentation [5] and object tracking [6].

Existing studies introduce two kinds of fundamental attention modules widely used in computer vision: channel and spatial attention modules [7]. These two modules strengthen the representations by combining the feature maps from all the positions with different strategies. There have been many useful implemental architectures for these years. For channel attention modules, Jie Hu et al. [8] automatically recalibrate channel-wise feature reflections by explicitly modeling interdependencies between channels. Xiang Li et al. [9] employ a dynamic selection mechanism that enables every cell to automatically adjust its receptive field size on the basis of multiple scales of input representations. For spatial attention modules, Jun Fu et al. [10] encode a wider range of contextual information into local features, which improves their representative capability. Moreover, researchers try to aggregate both of the attention mechanisms, Sanghyun et al. [7] sequentially infer attention maps along the channel and spatial dimensions, then the attention maps are multiplied to the input feature maps for adaptive feature refinement. All these methods introduce attention modules for the neural networks to learn feature representations of images.

However, the above attention modules are designed mainly for normal networks. When adapted to lightweight models [11], they usually have various kinds of problems. First, the neural networks with single spatial or channel attention modules, like SENet [8], may ignore the other dimensions’ information. They don’t make full use of other dimensions’ representations of images, and the lightweight models with rare parameters can’t absorb the information well, which leads to poor performances. Second, the complex mixed architecture like CBAM [7] violates the principle of lightweight models, which will result in poor efficiency. Specifically, it concats mean and max spatial feature maps. After convolution operations with a large kernel, the concated feature maps will return to the original size, and multiply with the initial feature map. The complex concat and convolution functions respond to the vanishing gradient problem. Therefore, it will be beneficial to visual tasks by incorporating the information of the two dimensions in a simple and effective architecture.

In view of this, we propose a novel attention module called Lightweight Attention Module (LAM). For the spatial part, we use element-wise addition to process the average and max pooled feature maps, and use a smaller convolutional kernel to extract features. For the channel part, we also add the max-pooling and average-pooling feature maps first, then use the squeeze-and-excitation layers [8] to extract features. At last, we add the two output feature maps in a parallel arrangement. Overall, our model simplifies extensive convolution operations, which may cause vanishing gradient problems in previous modules. Meanwhile, we use the parallel instead of the traditional sequential arrangement [7]. As a result, our model efficiently helps the information flow into the next layer within the lightweight neural networks by learning which points to emphasize.

The key contributions can be summarized as follows:

  • A novel lightweight attention module called LAM is proposed, which is capable of capturing information by incorporating the features of the channel and spatial dimensions with a parallel arrangement.

  • The superiority of the LAM is demonstrated compared with the previous methods using image classification datasets, and in-depth analysis gives the rationality and robustness of the proposed method.

2 Related Work

In this section, we introduce the related works in the area of lightweight neural networks and attention mechanisms separately.

2.1 Lightweight Model

Since AlexNet [12] had excellent performances on the ImageNet competition in 2012 [13], deep neural networks started to explode researchers’ interest again. The 2014 ImageNet champion GoogleNet [14] got 74.8 top-1 accuracy. Afterwards, In The 2017 ImageNet match, SENet [8] won the game with 82.7 top-1 accuracy. However, these models are too big to be applied in our real life. These models have reached the hardware limitations. So experts started to reduce the size of model by gaining efficiency in place of accuracy. Since smart phones get popular, there comes various efficient lightweight models like ShuffleNet [15, 16], MobileNet [17,18,19], and EfficientNet [20]. Later, neural architecture search (NAS) [21] cut a striking figure in designing lightweight models. They perform better than the hand-crafted neural networks by adapting the models’ width, channels, kernels and sizes. While most of the neural network designing ways focus on the aspects of depth, width and cardinality, we care about the other influence factor, ‘attention’, which draws lessons from human visual system.

2.2 Attention Module

Attention is one of the most important concepts in the deep learning field [22, 23], inspired by human visual system that cannot manipulate all the information of the same image immediately [24]. As a replacement, people use a series of partial scans and conditionally pay attention to the obvious part for more information.

Recently, there have been many experts trying to combine the channel and spatial attention modules with models for real-world tasks. RAN (Residual Attention Network) [25] makes use of an encoder and decoder to make up attention module. Through purifying the feature maps, the model gets high accuracy even faced up with noisy datasets. Instead of processing the whole 3D attention feature maps, we resolve the procedure that comprehends channel and spatial representations respectively. The single attention-generating part for 3D feature maps has fewer parameters, and the end-to-end design enables it to be a plug-and-play module, which is very suitable for existing lightweight deep neural networks.

Close to our work, CBAM illustrates a channel and spatial mixed module to find the inner relationships of various feature maps. In CBAM’s channel part, it uses MLP layers to get global average features for channel-wise attention. But we find that the linear layers for inferring attention maps may affect the feature extraction process in the lightweight models, so we replace them and use squeeze-and-excitation module, which has better performances both in speed and feature capturing ability. Similar to the channel part, we also delete the 7\(\,\times \,\)7 convolutional kernel, which multiplies with the concat map, and that may cause vanishing gradient problems. Instead, we use a smaller kernel to multiply with the overlying map. In our LAM, we employ both channel and spatial attention in a simplified way intended for lightweight networks. The experiments verify that LAM not only improves the accuracy but also considers the handiness.

3 Methodology

In this section, we propose an attention module intended for lightweight neural networks called LAM. To understand the module, we first introduce the overall framework of the algorithm, then the channel attention module, the spatial attention module and the arrangement of attention modules part, respectively.

Fig. 1.
figure 1

Overview of LAM. The module has two parallel parts: channel and spatial. The feature map is refined by our module at every convolutional step of neural networks.

3.1 Attention Module

LAM is built on a transformation, which separately uses a one dimension channel attention map \(\mathbf {M}_{c} \in \mathbf {R}^{C \times 1 \times 1}\) and a two dimension spatial attention map \(\mathbf {M}_{s} \in \mathbf {R}^{1 \times H \times W}\) to map an input \(\mathbf {X} \in \mathbf {R}^{C \times H \times W}\) to feature maps \(\mathbf {U} \in \mathbf {R}^{C \times H \times W}\) as shown in Fig. 1. The computational procedure can be written as:

$$\begin{aligned} \mathbf {X}_{1}=\mathbf {M}_{c}(\mathbf {X})\mathbf \otimes \mathbf {X} \end{aligned}$$
(1)
$$\begin{aligned} \mathbf {X}_{2}=\mathbf {M}_{s}(\mathbf {X})\mathbf \otimes \mathbf {X} \end{aligned}$$
(2)
$$\begin{aligned} \mathbf {U}=\mathbf {X}_{1} \oplus \mathbf {X}_{2} \end{aligned}$$
(3)

where \(\otimes \) means element-wise multiplication, \(\oplus \) means element-wise addition. In the computational process, the attention maps are dispersed to all the dimensions: spatial attention maps are broadcasted along the channel dimension, and vice versa. \(\mathbf {U}\) is the final output result. The following part will detailedly talk about the core of each attention module and computation process.

3.2 Channel Attention Module

In the LAM module, we use a channel attention computational unit to dig the inner channel information of feature representations. Because every channel is seen as a detector, channel attention pays attention to what is the important part of the images. To get the channel attention maps efficiently, we use the squeeze-and-excitation part to process the input feature representations. In the squeeze-and-excitation block, they use global squeeze information in a channel descriptor [26] to solve the problem of channel dependencies. And the squeeze part uses average-pooling in their module to get spatial statistics. To obtain the feature representations in the squeeze part, they use a second operation for fully capturing channel-wise dependencies. They employ a simple gating mechanism [27], which consists of a dimensionality reduction and increasing layer returning to the channel dimension of the transformation, with sigmoid activation.

In CBAM, they propose that max-pooling can help collect another key clue about obvious objects to obtain better channel-wise attention. So they use both average and max pooling to process feature maps simultaneously, which highly improves the effectiveness of models.

Different from these works, we argue that although the two pooling ways do improve the ability to capture key features, the linear layers in MLP deeply affect the simple architecture of lightweight neural networks and bring extra computation. So we remove the linear and activation layer and use the convolutional layers to multiply with feature maps. We describe the detailed operation below (Fig. 2).

Fig. 2.
figure 2

Diagram of each attention sub-module. The figure shows that the channel sub-module emphasizes both max-pooling outputs and average-pooling outputs with a squeeze-and-excitation architecture. The spatial sub-module emphasizes the channel and spatial output features that are pooled along the channel axis and forward them to a convolutional layer.

We first aggregate spatial features by using both max and average pooling to generate two kinds of spatial descriptors: \(\mathbf {X}^c_{avg}\) and \(\mathbf {X}^c_{max}\), which represent max and average features along the spatial dimensions. These two descriptors are then forwarded by a squeeze-and-excitation module to generate the final attention map \(\mathbf {M}_{c} \in \mathbf {R}^{C \times 1 \times 1}\). The squeeze-and-excitation module consist of a dimensionality reduction and increasing convolutional layer to decrease parameters overhead. After the squeeze-and-excitation layer is applied to both descriptors, we add the output vectors with element-wise addition. The channel attention module can be summarized as follows:

$$\begin{aligned} \begin{aligned} \mathbf {M}_{c}(\mathbf {X})&=\sigma (SE(AvgPool(\mathbf {X})) + SE(MaxPool(\mathbf {X}))) \\&=\sigma (\mathbf {W}_{1}(\mathbf {W}_{0}(\mathbf {X}^c_{avg})) + \mathbf {W}_{1}(\mathbf {W}_{0}(\mathbf {X}^c_{max}))) \end{aligned} \end{aligned}$$
(4)

where \(\sigma \) means the sigmoid function, \(\mathbf {W}_{0} \in \mathbf {R}^{{C}/{r}\times {C}}\), \(\mathbf {W}_{1} \in \mathbf {R}^{{C}\times {C}/{r}}\). And the squeeze-and-excitation weight \(\mathbf {W}_{1}\) and \(\mathbf {W}_{1}\) are shared for both input feature maps. Pay attention that the ReLU activation function is followed by \(\mathbf {W}_{0}\).

3.3 Spatial Attention Module

In the LAM module, we get a spatial attention map by exploiting the inner spatial relationships of feature vectors. Unlike the channel attention module, the spatial attention module pays more attention to the location of images, which is seen as an important part complementary to the channel attention module. We also use average-pooling and max-pooling computational operations across the channel axis. And then sum them to get an efficient feature descriptor. On the summed feature descriptor, we feed it into a convolutional layer to get a spatial attention map \(\mathbf {M}_{s}(\mathbf {F}) \in \mathbf {R}^{H\times W}\), which contains the information where to emphasize. We describe the detailed operation below.

We first aggregate channel features by using both max-pooling and average-pooling to generate two kinds of 2D feature maps: \(\mathbf {M}^s_{avg}(\mathbf {X}) \in \mathbf {R}^{1 \times H\times W}\) and \(\mathbf {M}^s_{max}(\mathbf {X}) \in \mathbf {R}^{1 \times H\times W}\), which represent average and max features across the channel dimensions respectively. The vectors are then added with element-wise addition and convolved by a standard convolutional layer to get the final 2D feature maps. The spatial attention module can be summarized as follows:

$$\begin{aligned} \begin{aligned} \mathbf {M}_{c}(\mathbf {X})&=\sigma (f^{3 \times 3}(AvgPool(\mathbf {X})) + MaxPool(\mathbf {X})) \\&=\sigma (f^{3 \times 3}(\mathbf {X}^c_{avg})) + (\mathbf {X}^c_{max}))) \end{aligned} \end{aligned}$$
(5)

where \(\sigma \) means the sigmoid function. \(f^{3 \times 3}\) means a convolutional layer with the kernel size of \(3 \times 3\).

3.4 Arrangement of Attention Modules

The two attention modules pay attention to the channel and spatial dimension separately with complementary computing attention. Unlike CBAM, we adopt a parallel arrangement. The sequential arrangement will affect the lightweight in a bad way, whether it is channel-first order or spatial first-order. We will discuss experimental results in the next section (Fig. 3).

Fig. 3.
figure 3

LAM is integrated with a Basicblock in lightweight models. This figure shows the exact position of our module when integrated within a depthwise separable block [17]. We apply LAM to the convolution outputs in each block.

4 Experiment

In this section, we first introduce the experiment settings, including datasets, baselines and evaluation methods. Then we evaluate the effectiveness of the LAM in image classification tasks and ablation studies. In general, we seek to answer the following questions:

  • Q1: Does the LAM have better performances than other attention modules when adapted to lightweight neural networks?

  • Q2: Is the current arrangement most suitable for the LAM?

  • Q3: Is the proposed LAM model sensitive to the main parameters, e.g., the learning ratio and kernel size?

4.1 Datasets and Experimental Setting

We conduct experiments on two standard image classification datasets [28]: Cifar10 and Cifar100 [29]. Both of the two standard datasets comprise a collection of 50k training and 10k test 32\(\,\times \,\)32 pixel RGB real-world images, labelled with 10 and 100 classes respectively.

Besides, experiments use the same data augmentations [30, 31] and parameter settings. The input images are randomly horizontally flipped and zero-padded on each side with 4 pixels before taking a random 32\(\,\times \,\)32 cropping operation. We also adopt mean and standard deviation normalization. We use the Top-1 accuracy to compare our model with other baselines for image classification tasks. We use the authors’ released code for baseline models.

4.1.1 Baselines and Metrics

Here we present some existing lightweight neural networks as baselines, which proves that the LAM fits into lightweight models well.

  • MobileNetV1 [17]: This model uses depthwise separable convolutions to decrease computational complexity.

  • MobileNetV2 [18]: It builds inverted residuals and linear bottlenecks to filter features as a source of non-linearity.

  • MobileNetV3 [19]: It uses a combination of complementary search techniques as well as a novel architecture ang.

  • ShuffleNetV1 [15]: It utilizes pointwise group convolution and channel shuffle operations to greatly reduce computation cost.

  • ShuffleNetV2 [16]: It uses an additional convolutional layer right before global averaged pooling to mix up features.

  • EfficientNet [20]: It uses neural architecture search to design a new baseline network and scale it up.

Here we present some attention modules as baselines, which can prove that the LAM has better performances than other attention modules when applied to the lightweight models.

  • Squeeze-and-excitation module [8]: It adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels.

  • Convolutional block attention module [7]: It emphasizes meaningful features along those two principal dimensions: channel and spatial axes.

Table 1. Accuracy of image classification in lightweight neural networks

4.2 Results and Analysis (Q1)

To address the first question (Q1), we conduct experiments to measure the LAM’s quality and compare it with other baseline methods. We train the models on the training dataset and test them on the validation dataset. PyTorch and Adam optimizer are used in our model (Learning Rate = 0.05, Weight Decay = 0.0001, Batch Size = 64). We report the average results of Top-1 accuracy by running the model 100 epochs [32]. The learning rate drops at the \({40}_{th}\), \({60}_{th}\) and \({80}_{th}\) epoch by a factor of 10.

Table 1 summarizes the experimental results. The networks with LAM outperform all the baselines significantly, showing that the LAM can generalize well on lightweight models in the image classification datasets. Moreover, the models with LAM improve the accuracy compared to other attention modules. Firstly, although the CBAM improves the accuracy when applied in the MobileNetV2, it results in vanishing gradient problems in other models due to the complicated convolutional layers. So we use both channel and spatial attention modules but simplify convolutional parts. Secondly, the MobileNetV3 and EfficientNet use the squeeze-and-excitation layer to extract features, which provides a significant improvement. But it’s not robust to other models, specifically, improvements on the MobileNetV1, deteriorations on the MobileNetV2, ShuffleNetV1 and ShuffleNetV2. Although it may help models find channel features, it pays too much attention to the channel dimension and ignores spatial information, thus being sensitive when faced with different tasks.

Finally, our LAM absorbs the strengths of these two modules to pay attention to both channel and spatial dimensions. In addition, we use a parallel arrangement to process the output feature maps, which operation gives them equal weights. The experiment results imply that our proposed module is powerful, showing the efficacy of a new method that generates a richer descriptor that complements the two attention effectively. The LAM also obeys the rule of lightweight, which means small amounts of parameters and fast forward speed.

4.3 Ablation Study (Q2)

To answer the second question (Q2), we will verify whether the current arrangement are the most beneficial to the effectiveness of the model. In the experiment, we design three variants of the proposed model:

  • LAM(I): using the LAM before depthwise convolution;

  • LAM(II): using the LAM after pointwise convolution;

  • LAM(III): using sequential arrangement.

We measure the accuracy of these three variants. As shown in Table 2, our model performs the best when all latent variables are introduced. The performances of the LAM(I) and LAM(II) are worse than the current module. This shows that the current arrangement, which places the module after the depthwise filters, helps the attention to be applied to the largest representations. The results of LAM(III) show that vanishing gradient problems happen in MobileNetV1, and the sequential arrangement has more parameters and lower speed than the parallel arrangement.

Table 2. Ablation study of the LAM
Fig. 4.
figure 4

Parameter sensitivity analysis on Cifar100 dataset.

4.4 Sensitivity Analysis (Q3)

In this subsection, we test the robustness of the model and verify whether the settings of super parameters have an impact on the model. We conduct two groups of experiments, i.e., the learning rates (0.001, 0.005, 0.01, 0.05, 0.1) and the kernel sizes of convolutional layers in the spatial attention part (1, 3, 5, 7). As shown in Fig. 4, our model still keeps a high accuracy between a small range under the change of learning rates and kernel sizes. It implies that the proposed LAM model is not sensitive to these main parameters, and thus has good robustness.

5 Conclusion

In this paper, we propose a novel attention module called LAM for lightweight neural networks, which uses two attention mechanisms but simplifies the components effectively. Specifically, in the spatial attention module, we use element-wise addition and smaller convolutional kernels to avoid the previous vanishing gradient problem. In the channel module, we use the squeeze-and-excitation layers in place of the MLP layers. At last, we take a parallel architecture to integrate the two parts efficiently. The experimental results on the two image classification datasets verify the effectiveness of the proposed attention module for lightweight models. The LAM is ready to be applied to other tasks related to lightweight neural networks, e.g., object tracking in the field of computer vision.