Bidirectional Non-local Networks for Object Detection

Vo, Xuan-Thuy; Wen, Lihua; Tran, Tien-Dat; Jo, Kang-Hyun

doi:10.1007/978-3-030-63007-2_38

Xuan-Thuy Vo¹⁴,
Lihua Wen¹⁴,
Tien-Dat Tran¹⁴ &
…
Kang-Hyun Jo¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12496))

Included in the following conference series:

International Conference on Computational Collective Intelligence

1431 Accesses
7 Citations

Abstract

The convolutional neural networks have reached great achievements in solving the challenging problems of computer vision tasks, such as image classification, object detection, semantic segmentation. The core element of CNNs is the convolution operation, which gathers important features by constructing pixel relationships in a local region. Even though CNNs are universally exploited in visual feature understanding, they still have drawbacks due to that the receptive field is restrained inside local neighborhoods by the physical construction of the convolution layer. The Non-local Network introduces a novel method for modeling long-range dependencies to remedy the local neighborhood problem, which computing the correlations between the query position and all positions to capture global context features and then performing a weighted sum of the features at all positions. As a complementary part of the Non-Local Network, the proposed method called Bidirectional Non-local operation designs the bidirectional relationship, which the informative feature at a specific-query position is gathered and distributed to all positions. Notably, this work relaxes the Bidirectional Non-local complexity by simplifying the network based on the same attention maps for different query positions. To evaluate the effectiveness of the proposed method, the Bidirectional Non-local block is embedded into the backbone network of the detector Mask R-CNN. Without bells and whistles, the integrated network achieves 0.9 points higher Average Precision than Global Context Network on the major baseline.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Gated Bi-directional CNN for Object Detection

Local keypoint-based Faster R-CNN

Article 28 April 2020

Disentangled Non-local Neural Networks

Keywords

1 Introduction

Object detection is one of the most vital and challenging problems in understanding the visual world. Object detection consists of two tasks. The first task is to classify what objects inside the given images. The second task is to determine where objects locate. Object detection has been widely utilized in many applications, such as autonomous cars, robot vision, intelligent surveillance systems.

Currently, the rapid development of deep learning techniques, remarkably Convolutional Neural Networks (CNNs), has brought a bright future in the computer vision task (e.g., image classification, object detection, instance segmentation), as an efficient approach for automatically extracting feature representations from the visual world (e.g., images, videos).

Due to the physical design of the convolution operation, the receptive field is constrained to the local regions. To remedy this problem, capturing long-range dependency is helpful for extracting the global contextual feature of visual data. In CNNs, long-range dependencies are modeled by deeply stacking many convolutional layers to enlarge the receptive field. Nonetheless, repeating many convolutional layers is not an effective way, increasing the computational cost. Furthermore, this strategy leads to the difficult optimization that is time-consuming to converge to global points.

The non-local network NLNet [16] introduces the non-local block inspired by non-local operation [1], modeling the long-range dependency. A non-local operation calculates the correlation between a query position and all positions and then gathers the feature of all positions by weighted average (Fig. 1(a)). Therefore, the relationship between each pairwise is not bidirectional. As a complementary of the non-local block, the proposed method, named BNL, presents the bidirectional information paths (Fig. 1(c)) for understanding complex visual world, not only aggregates the feature of all query position to model global context but also distributes the important information at each position to key positions globally.

The global context network GCNet [2] investigates the simplified version of the non-local block based on the query-independent attention map for all positions. This simplified network dramatically reduces the number of parameters when compared with the non-local network but still maintaining accuracy. Based on the study of the global context network, this paper also relaxes the computation cost of the non-local block but surpasses the efficiency of the global context network and non-local network with only a slight increase of the computational cost.

The Bidirectional Non-local (BNL) block applies to any existing architecture of the backbone networks. To perform the improvement of the proposed method, this block is inserted into the residual block of the ResNet [5]. This work conducts the experiment on the MS-COCO dataset [10] for the object detection task. As mentioned, the proposed method achieves significant improvement, outperforming the Mask R-CNN [4] + GC block, Mask R-CNN + NL block by 0.9% in Average Precision (AP).

2 Related Work

CNNs have been one of the most popular research in the computer vision field since acceptably designing networks guarantee a significant improvement in image classification [5, 6, 14], object detection [4, 7, 9,10,11,12,13], segmentation [4, 17].

With the accelerated development of deep learning, object detection has improved both accuracy and speed. The goal of object detection is to classify what objects inside the given images and localize where objects on. Based on the number of networks, object detection task consists of two types, the two-stage method, and the one-stage method. The two-stage method [4, 12, 13] first creates a set of proposals by region proposal networks (i.e., uses the anchor generator on each center of sliding window) and assigns each ground-truth to each proposal identifying negative proposals or positive proposal. Then the second network classifies each anchor by classification networks and refines coordinates of the proposals by learning offset. Whereas the two-stage method, one-stage method instead of region proposal network creating anchors, they densely place anchors with different size and aspect ratio on each position. Then the classification network and localization network form the final detection with a specific class and bounding boxes. Faster R-CNN [13] is the two-stage method, one of the most popular architectures in the computer vision task related to detection. Inspired by the Faster R-CNN method, many architectures such as Mask R-CNN [4], Libra R-CNN [12], TridentNet [7] have introduced. Mask R-CNN adds one branch into Faster R-CNN to predict the mask for the segmentation task. In this paper, the BNL inserts into the backbone ResNet [5] for object detection based on the detector Mask R-CNN method.

Owning to the restricted receptive field of convolution operation, Non-local network [16] proposes a new method capturing long-range dependencies based on the attention mechanism [15] and the non-local filter [1] to extract the global understanding of the visual world. The relationship between key-query position and query position is not bidirectional, gathering the information at each query position from all key-query position. Although this operation is effective, but still a high computational cost. Global context [2] studies the query-independent attention map for all positions. They only use one query position to form one attention map (i.e., corresponding to one output channel of the kernel) that represents all attention maps for other positions. From this characteristic and inspired by SENet [6], GCNet drastically reduces the number of parameters of the non-local block but still maintains the accuracy of non-local networks. Different from the non-local block and global context block, the proposed method introduces a bidirectional non-local (BNL) block that aggregates the feature at each position from all positions vice-versus distributes the information path at each query position to all positions (Fig. 1). Moreover, this work, inheriting the observation of GCNet, relaxes the high computational cost but surpasses the accuracy of the non-local network and global context network with a slight increase of the computational cost.

3 The Proposed Method

3.1 Non-local Network

To design the BNL, this section visits the non-local block [16]. As mentioned in Sects. 1 and 2, the non-local block gathers the feature information at each query position from all key positions. Equation 1 expresses this relationship as

$$\begin{aligned} \mathbf {z_{i}} = \mathbf {x_{i}} + \mathbf {W_{z}}\sum _{j=1}^{H*W}\mathbf {\omega _{i, j}}\; (\mathbf {W_{v}}\,\mathbf {x_{j}}) \end{aligned}$$

(1)

where $\mathbf {x_{i}}$ denotes the query position, $\mathbf {x_{j}}$ is the key query positions. H*W is the number of key positions in the input feature map. $\mathbf {W_{z}}$ and $\mathbf {W_{v}}$ are a $1\times 1$ convolution operation. $\mathbf {z_{i}}$ presents the output of this block. $\mathbf {\omega _{i, j}}$ is the correlation between the query position $\mathbf {x_{i}}$ and $\mathbf {x_{j}}$, presents four types, namely Gaussian function, Embedded Gaussian, Dot product, and Concatenation. In this paper, BNL inherits the advantage of Embedded Gaussian that forms the attention map (i.e., highlights the important regions and suppresses the unnecessary parts). The Embedded Gaussian is calculated as Eq. 2.

$$\begin{aligned} \mathbf {\omega _{i,j}} = \frac{exp(\langle {\mathbf {W_{q}}\mathbf {x_{i}},\, {\mathbf {W_{k}}\mathbf {x_{j}}}} \rangle )}{\sum _{m}exp(\langle {\mathbf {W_{q}}\mathbf {x_{i}},\, {\mathbf {W_{k}}\mathbf {x_{m}}}} \rangle )} \end{aligned}$$

(2)

where $\mathbf {W_{q}}$, $\mathbf {W_{k}}$ is $1\times 1$ convolution operation. The overall computation of non-local block, as shown in Fig. 2(a). The non-local block models the global contextual feature, which performs a weighted sum from all key positions based on query attention maps to each query position. Hence, the relationship at pairwise positions is not bidirectional. From this observation, the proposed network constructs the bidirectional relationship between query positions and key positions.

3.2 Bidirectional Non-local Network

In the Eq. 2, there are many key query positions and query positions in the feature map. It leads to a large number of the computation between $\mathbf {x_{i}}$ and $\mathbf {x_{j}}$. Hence, the non-local block is simplified by approximation strategy. The first, the function $\mathbf {\omega _{i, j}}$ is converted as

$$\begin{aligned} \mathbf {\omega _{i,j}} \approx \mathbf {\omega _{i}} = \frac{exp({\mathbf {W_{q}}\mathbf {x_{i}}})}{\sum _{m}exp(\langle {\mathbf {W_{q}}\mathbf {x_{i}},\, {\mathbf {W_{k}}\mathbf {x_{m}}}} \rangle )} \end{aligned}$$

(3)

In the Eq. 3, the information path from key query position j to query position i is only related to attention map at the query location i and the correlation of i and j. To model the attention map, the function $\mathbf {\omega _{i}}$ is transformed to softmax function by ignoring $\mathbf {W_{q}x_{i}}$ in the denominator.

$$\begin{aligned} \mathbf {\omega _{i,j}} \approx \mathbf {\omega _{i}} = \frac{exp({\mathbf {W_{q}}\mathbf {x_{i}}})}{\sum _{m}exp( {{\mathbf {W_{k}}\mathbf {x_{m}}}})} \end{aligned}$$

(4)

Correspondingly, the function $\mathbf {\omega _{i, j}}$ is simplified as

$$\begin{aligned} \mathbf {\omega _{i,j}} \approx \mathbf {\omega _{j}} = \frac{exp({\mathbf {W_{k}}\mathbf {x_{j}}})}{\sum _{m}exp({\mathbf {W_{k}}\mathbf {x_{m}}})} \end{aligned}$$

(5)

In the Eq. 5, the information path from key query position j to query position i is only related to attention map at the key query location j and the correlation of i and j. Especially, the function $\mathbf {\omega _{j}}$ is the same formula of the global context block [2] (Fig. 3(a)). It means that the global context block is a special case of the proposed method named bidirectional non-local network (BNL).

Finally, the bidirectional information path is formed as

$$\begin{aligned} \mathbf {z_{i}} = \mathbf {x_{i}} + \mathbf {W_{z}}\sum _{j=1}^{H*W}\mathbf {\omega _{i}}\; (\mathbf {W_{v}}\,\mathbf {x_{j}}) + \mathbf {W_{z}}\sum _{j=1}^{H*W}\mathbf {\omega _{j}}\; (\mathbf {W_{v}}\,\mathbf {x_{j}}) \end{aligned}$$

(6)

The second term shows that each query position gathers the feature from other positions. The third term presents that each query position distributes the feature to other positions. Figure 3(b) shows the bidirectional non-local block.

Notably, the global context network [2] proposed the query-independent attention map for all positions. The proposed method relaxes the computational cost by inheriting this observation of GCNet. Furthermore, the $\mathbf {W_{z}}$ is removed, and the $\mathbf {W_{v}}$ is moved out, as showed in Fig. 4.

4 Experiment Setup

The proposed method conducts the experiments on challenging MS-COCO 2017 [10] for the object detection task. This dataset includes 115k images (80k images of training set + 35k images of validation subset) for training, 5k validation images for selecting the best hyper-parameters, and 20k images for testing. Because the ground-truth annotation of the test set did not publish, the result is submitted to the protocol system. The metrics are used through standard Average Precision (AP) and Average Recall (AR).

All experiments are implemented with the Pytorch framework. The BNL block is applied to stage c3, c4, c5 of the backbone network ResNet-50 [5]. The object detector is Mask R-CNN [4] with the neck FPN [8].

The Mask R-CNN is configured by following the standard setting of the mmdetection [3] with 12 epochs. The integrated model is trained with a batch size of 8 on one NVIDIA Titan GPU, CUDA 10.2, and CuDNN 7.6.5. The initial learning rate is 0.01 from $1^{st}$ epochs to $8^{th}$ epochs. It will decay by a factor of 10 at $9^{th}$ epochs and $10^{th}$ epochs. The input image is resized to $1333\times 800$.

5 Results

Comparison with State-of-the-Art. The BNL block is inserted into the architecture Mask R-CNN with the backbone ResNet-50 and the neck FPN. The integrated model with the stronger backbone ResNet-50 + BNL evaluates on MS-COCO test-dev set and compares the experimental results with the state-of-the-art object detectors in Table 1. The learning schedule is 1$\times $ for training the model with 12 epochs and 2$\times $ for training the model with 24 epochs. The ResNet-50, ResNet-50(a) denotes pytorch-style and caffe-style backbone, respectively.

Table 1. Results on test-dev set 2017.

Full size table

The proposed method, BNL block embedded into the backbone network ResNet-50 (i.e., ResNet-50+BNL) of the object detectors Mask R-CNN achieves 39.3 AP, which increases 0.9% higher AP than GCNet with the backbone ResNet-50+GC achieves 38.4 AP without bells and whistles. Especially, the integrated method outperforms the baseline Mask R-CNN with an improvement rate of 2%. Furthermore, the proposed method has surpassed most object detectors with the same backbone, neck FPN, and learning schedule, e.g., AP of Faster R-CNN [13] with ResNet-50 is 36.4, AP of RetinaNet [9] is 35.6, AP of [4] is 37.3. The performance on test-dev set pointed out that the strong baselines are boosted by a large margin when applying the GNL block to stage c3, c4, c5 of the backbone ResNet-50. These results prove the efficiency of the proposed method.

Figure 5 visualizes the qualitative results of the proposed method on the MS-COCO validation set with three levels of the dataset.

Ablation Study. This work studies the importance of each component in the BNL block. The proposed method consists of gathered information block $\mathbf {\omega _{i}}$ and distributed information block $\mathbf {\omega _{j}}$. The first, the BNL module investigates the efficiency of the gathered information block by removing the distributed information block and otherwise.

Table 2 analyzes the impacts of each component in the BNL module. This experiment gradually inserts the gathered block and distributed block on the ResNet-50 Mask R-CNN baseline. The gathered component inheriting the advantage of the global context block improves 1.1% higher AP than the ResNet-50 Mask R-CNN baseline. The distributed component increases 0.5% from 37.2 AP to 37.7 AP. This block is lightweight due to that this component only used two $1\times 1$ convs with a little of parameter. When combining gathered block and distributed components into the BNL block, the accuracy of the integrated model is gained by a large margin of 1.9% over the baseline.

Table 2. The impacts of each component in the BNL block. The result reports on the validation set.

Full size table

6 Conclusion

In this paper, the proposed Bidirectional Non-local (BNL) block studies the effectiveness of the gathered information block and the distributed information block. The gathered information block gathers the feature of all query position to capture long-range dependencies. The distributed block distributes important information at each position to key positions globally. By fusing two information propagation flows, the proposed methods not only encodes long-range dependencies by computing the correlation between each pair of query position but also considers the relative location of it. The experimental results demonstrate the significant improvement of the BNL block when applying to the backbone ResNet-50 of detectors Mask R-CNN baseline. Without bells and whistles, the integrated model brings 0.9 points higher AP than the GCNet and 2.0 points higher AP than the Mask R-CNN baseline.

References

Buades, A., Coll, B., Morel, J.M.: A non-local algorithm for image denoising. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 2, pp. 60–65. IEEE (2005)
Google Scholar
Cao, Y., Xu, J., Lin, S., Wei, F., Hu, H.: GCNet: non-local networks meet squeeze-excitation networks and beyond. In: Proceedings of the IEEE International Conference on Computer Vision Workshops (2019)
Google Scholar
Chen, K., et al.: MMDetection: Open MMLab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Google Scholar
Li, Y., Chen, Y., Wang, N., Zhang, Z.: Scale-aware trident networks for object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6054–6063 (2019)
Google Scholar
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)
Google Scholar
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, W., et al.: SSD: single shot MultiBox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Chapter Google Scholar
Pang, J., Chen, K., Shi, J., Feng, H., Ouyang, W., Lin, D.: Libra R-CNN: towards balanced learning for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 821–830 (2019)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Google Scholar
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Article MathSciNet Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)
Google Scholar
Zhao, H., et al.: PSANet: point-wise spatial attention network for scene parsing. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 270–286. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_17
Chapter Google Scholar

Download references

Acknowledgement

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government. (MSIT) (2020R1A2C2008972).

Author information

Authors and Affiliations

School of Electrical Engineering, University of Ulsan, Ulsan, Korea
Xuan-Thuy Vo, Lihua Wen, Tien-Dat Tran & Kang-Hyun Jo

Authors

Xuan-Thuy Vo
View author publications
You can also search for this author in PubMed Google Scholar
Lihua Wen
View author publications
You can also search for this author in PubMed Google Scholar
Tien-Dat Tran
View author publications
You can also search for this author in PubMed Google Scholar
Kang-Hyun Jo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kang-Hyun Jo .

Editor information

Editors and Affiliations

Department of Applied Informatics, Wrocław University of Science and Technology, Wroclaw, Poland
Ngoc Thanh Nguyen
Thua Thien Hue Center of Information Technology, Hue, Vietnam
Bao Hung Hoang
Vietnam - Korea University of Information and Communication Technology, University of Da Nang, Da Nang, Vietnam
Cong Phap Huynh
Department of Computer Engineering, Yeungnam University, Gyeungsan, Korea (Republic of)
Dosam Hwang
Department of Applied Informatics, Wrocław University of Science and Technology, Wroclaw, Poland
Bogdan Trawiński
Department of Information Systems, University of Münster, Münster, Germany
Gottfried Vossen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vo, XT., Wen, L., Tran, TD., Jo, KH. (2020). Bidirectional Non-local Networks for Object Detection. In: Nguyen, N.T., Hoang, B.H., Huynh, C.P., Hwang, D., Trawiński, B., Vossen, G. (eds) Computational Collective Intelligence. ICCCI 2020. Lecture Notes in Computer Science(), vol 12496. Springer, Cham. https://doi.org/10.1007/978-3-030-63007-2_38

Download citation

DOI: https://doi.org/10.1007/978-3-030-63007-2_38
Published: 23 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63006-5
Online ISBN: 978-3-030-63007-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Bidirectional Non-local Networks for Object Detection

Abstract

Similar content being viewed by others