Keywords

1 Introduction

With the development of earth observation technology and the improvement in the performance of computation processing of large-scale data, object detection in remote sensing images has attracted increasing attention [2, 7, 18, 19, 21]. However, for remote sensing images with complex backgrounds, small size and in the presence of occlusion, it is still a significant challenge. Deep learning algorithms have demonstrated good performance in object detection in recent years, including region based methods and single shot methods. e.g., faster region-based CNN (faster-RCNN), spatial pyramid pooling networks (SPP-Net) [5], you only look once (YOLO) [15] and the single-shot multibox detector (SSD) [10]. However, these innovations usually fail to detect very small objects, as small object features are always lost in the downsampling process of deep CNNs (DCNNs). The above DCNN methods cannot effectively extract the features of small objects. especially from remote sensing images in which objects usually small and blurry, which has created considerable challenges.

To increase the accuracy of the networks in detecting small objects in remote sensing images, one method is to employ features of the middle layers of the CNN and then exploit multiscale features with multilevel information. Ding et al. [2] directly concatenated multiscale features of a CNN to obtain fine-grained details for the detection of small objects. Lin et al. adopted pixelwise summation to incorporate the score maps generated by multilevel contextual features of different residual blocks for the segmentation of small objects such as a vehicle. Mou et al. [13] proposed a top-down pathway and lateral connection to build a feature pyramid network with strong semantic feature maps at all scales. The method assigned the feature maps of different layers to be responsible for objects at different scales. Yang et al. [21] introduced the dense feature pyramid network (DFPN) for automatic ship detection; each feature map was densely connected and merged by concatenation. Furthermore, Li et al. [7] proposed a hierarchical selective filtering layer that mapped features of multiple scales to the same scale space for ship detection at various scales. Gao et al. [4] designed a tailored pooling pyramid module (TPPM) to take advantage of the contextual information of different subregions at different scales. Qiu et al. proposed A2RMNet [14] to improve the problem of wrong positioning caused by excessive aspect ratio difference of objects in remote sensing images. Xie et al. proposed a fully connect network model NEOONet [20], which solved the problem of category imbalance in remote sensing images.

Although the above methods have achieved promising detection results by aggregating multiscale features, there are still some problems. When the objects are in the complex scene or partly exposed the previous methods can not extract discriminative features for detection.

To address these issues, First, we proposed a multiscale attention module to guide the network to learn more useful information in low-level feature maps with long-range dependencies, which is important for the detection of small objects. Then we presented a contextual attention module to extract interdependencies of foreground and background in feature maps, So the module can learn more of the correlated background and foreground information with long-range dependencies which can efficiently detect objects in complex scenes and in the presence of occlusion. In addition, we designed a channel attention module to model the importance of each feature map, which suppresses the background or unrelated category features with long-range dependencies, and improves the accuracy of the model.

2 Proposed Work

Previous methods cannot effectively extract the features of small objects because of the contradiction between resolution and high-level semantic information; i.e., low-level feature maps have high resolution but little semantic information, and high-level feature maps have low resolution but rich semantic information, which are all useful for accurately detecting objects. In addition, when an object is in a complex scene and in the presence of occlusion, the accuracy of the previous algorithms will decrease. Therefore, motivated by DANet [3], the proposed method uses a parallel self-attention module to enhance the accuracy of small object detection in remote sensing images.

As illustrated in Fig. 1, three types of attention modules are designed, and they contain a series of matrix operations to synthetically consider multiscale features, local contextual features and global features in the residual network after every residual block (resblock). In addition, in order to extract the feature of direction sensitive objects we use Deformable convolution and upsampling bottleneck operations to add deep feature maps to the shallow feature maps and we use four-scale feature maps to predict which are advantageous for detecting small objects. Before prediction, 3 3 convolutional layers are used to prevent aliasing effects.

Fig. 1.
figure 1

The architecture of the IPAN. The proposed method use three parallel modules to obtain rich multiscale, contextual features and global features in different channels. Next, an element-wise summation operation is performed on the feature maps of the above three modules to obtain the final representations reflecting the long-range contexts. Finally, four scales with a multiscale feature fusion operation are used for prediction.

2.1 Multiscale Attention Module

Feature maps of different scales have different semantic and spatial information; in high-level feature maps, semantic information are rich, and in the low-level feature maps, spatial information are rich, however, both types of information are useful for detecting small objects in remote sensing images. To strengthen the small object feature representation, a multiscale attention module is proposed to combine deep and shallow features. The structure of the multiscale attention module is illustrated in Fig. 2.

The feature maps of A and B are used obtained by resblock1 and resblock2 to calculate the attention map. H, W, and C represent the height, width and the number of channels of the feature maps respectively. \(\mathbf {B_1} \in \mathbb {R}^{C \times H\times W}\), \(\mathbf {B_2} \in \mathbb {R}^{2C \times H/2\times W/2}\), \(1\times 1\) convolution is used to set \(B_2\) to \(\mathbf {B} \in \mathbb {R}^{C \times H/2\times W/2}\). We reshape A to \(\mathbb {R}^{C \times N}\) and B to \(\mathbb {R}^{C \times N/4}\), and N = \(H \times W\) is the number of pixels. Then, a matrix multiplication of the transpose of A and B is performed and a softmax layer is applied to normalize the weights and calculate the multiscale attention map \(\mathbf {M} \in \mathbb {R}^{N \times H/4}\):

$$\begin{aligned} M_{j i}=\frac{\exp \left( A_{i} \cdot B_{j}\right) }{\sum _{i=1}^{N} \exp \left( A_{i} \cdot B_{j}\right) } \end{aligned}$$
(1)

where \(M_{ji}\) measures the i th position’s impact on the j th position at different scales. Then, we perform a matrix multiplication of B and the transpose of M and reshape the result to \({R}^{C \times H\times W}\). Finally, the result is multiplied by a scale parameter \(\alpha \) and perform an elementwise summation is performed with A to obtain the final output \(\mathbf {E} \in \mathbb {R}^{C \times H\times W}\) as follows:

$$\begin{aligned} E_{j}=\alpha \sum _{i=1}^{N}\left( M_{j i} B_{i}\right) +B_{1j} \end{aligned}$$
(2)

The feature map B is deeper than A in resnet, which has more semantic information. Therefore, the attention map can guide B to learn more low-level information. As a result, E combines the low-level position information and high-level semantic information, which is effective for small object detection. From the heat map in E, we can see that more small object features are activated.

Fig. 2.
figure 2

The architecture of multiscale attention module. “\(\times \)” represents matrix multiplication.

2.2 Contextual Attention Module

A discrimination of foreground and background features is essential for object detection and can be achieved by capturing contextual information. Some studies [6, 17] have used contextual information to improve the detection result. To model rich contextual relationships over local features, we introduce a contextual attention module. The contextual attention module encodes a wide range of contextual information into the local features, thus enhancing the representation capability. Next, we elaborate the process to adaptively aggregate spatial contexts.

As illustrated in Fig. 3, given a local feature \(\mathbf {B_1} \in \mathbb {R}^{C \times H\times W}\), the local feature is first fed into convolution layers to generate two new feature maps C and D the size of the filter is \(7\times 7\) (for example, \(B_1\)). With the deepening of the network, the filter sizes are \(7\times 7\), \(5\times 5\), and \(3\times 3\) at different scales, and large-scale filters can extract more contextual features, where \(\mathbf {C,D} \in \mathbb {R}^{C \times H \times W}\). Then, reshape them \(\mathbb {R}^{C \times N}\), where N = \(H \times W\) is the number of pixels. Then, a matrix multiplication of the transpose of C and B is performed, and a softmax layer is applied to calculate the contextual attention map \(\mathbf {P} \in \mathbb {R}^{N \times N}\):

$$\begin{aligned} P_{j i}=\frac{\exp \left( C_{i} \cdot D_{j}\right) }{\sum _{i=1}^{N} \exp \left( C_{i} \cdot D_{j}\right) } \end{aligned}$$
(3)
$$\begin{aligned} F_{j}=\beta \sum _{i=1}^{N}\left( P_{j i} B_{1i}\right) +B_{1j} \end{aligned}$$
(4)

where \(M_{ji}\) measures the i th position’s impact on the jth position with contextual information. The more similar feature representations of the two positions contribute to a greater correlation with contextual information between them. Then, we perform a matrix multiplication of \(B_1\) and the transpose of M and reshape the result to \(\mathbb {R}^{C \times H \times W}\). Finally, we multiply the result by a scale parameter \(\beta \) and perform an elementwise summation operation with the features \(B_1\) to obtain the final output \(\mathbf {F} \in \mathbb {R}^{C \times H\times W}\).

The attention map from B and C contains contextual information with long-range dependencies. From the heat map shown in F, it can be observed that there are many features around the objects, which contributes to the detection of the small objects is activated.

Fig. 3.
figure 3

The architecture of contextual attention module. “\(\times \)” represents matrix multiplication.

2.3 Channel Attention Module

Each channel map has a different class and spatial position response, some contribute to object detection but others do not. To strengthen the positive response, weaken the negative response and exploit the interdependencies between channel maps, it can be emphasized that interdependent feature maps can improve the feature representation of the class and specific position. Therefore, we build a channel attention module to explicitly model the interdependencies between the channels and learn the long-range dependencies in the feature maps.

The structure of the channel attention module is illustrated in Fig. 4. Different from the contextual attention module, we directly calculate the channel attention map \(\mathbf {M} \in \mathbb {R}^{C \times C}\) from the original features \(\mathbf {B_1} \in \mathbb {R}^{C \times H\times W}\). Specifically, \(B_1\) is reshaped to \(\mathbb {R}^{C \times N}\), and then a matrix multiplication of original \(B_1\) and the transpose of \(B_1\) is performed. Finally, a softmax layer is applied to obtain the channel attention map \(\mathbf {Q} \in \mathbb {R}^{C \times C}\):

$$\begin{aligned} Q_{j i}=\frac{\exp \left( B_{1i} \cdot B_{1j}\right) }{\sum _{i=1}^{C} \exp \left( B_{1i} \cdot B_{1j}\right) } \end{aligned}$$
(5)
$$\begin{aligned} G_{j}=\gamma \sum _{i=1}^{N}\left( Q_{j i} B_{1i}\right) +B_{1j} \end{aligned}$$
(6)

where \(M_{ji}\) measures the i th channel’s impact on the j th channel. In addition, we perform a matrix multiplication of the transpose of M and \(B_1\) and reshape the result to \(\mathbb {R}^{C \times H\times W}\). Then, we multiply the result by a scale parameter \(\gamma \) and perform an elementwise summation operation with \(B_1\) to obtain the final output \(\mathbf {G} \in \mathbb {R}^{C \times H\times W}\). From the heat map shown in G from Fig. 4, we can see that more features strongly associated with the objects are activated.

Fig. 4.
figure 4

The architecture of channel attention module. “\(\times \)” represents matrix multiplication.

2.4 Embedding with Networks

As shown in Fig. 1, we set our parallel attention module after every resblock. After every IPA module, the feature maps have multiscale contextual and global information with long-range dependencies. Positive feature response achieves mutual gains, thus increasing differences in the objects and backgrounds. Then, every scale feature map is fed to RPN [16] for a further prediction. In addition, the model can exploit spatial information at all corresponding positions to model the channel correlations. Notably, the proposed attention modules are simple and can be directly inserted into the existing object detection pipeline. The modules do not require too many parameters yet effectively strengthen the feature representations.

3 Experiment

We performed a series of experiments with RSOD [11] including objects that are small in a complex scene and in the presence of occlusion. The proposed method achieved state-of-the-art performance: 95.9% for aircrafts and 95.5% for cars. We first conducted evaluations of the remote sensing object detection dataset to compare the performance to that of other state-of-the-art object detection methods, and evaluated the robustness of the proposed method using the COCO [9] dataset.

3.1 Comparison with State-of-the-Arts Methods in Remote Sensing Object Detection Dataset

We chose the ResNet-101 model as the backbone network. The model was initialized by the ImageNet classification model and fine-tuned with the RSOD [11]. We randomly split the samples into 80% for training and 20% for testing. In all experiments, we trained and tested the proposed method based on the TensorFlow deep learning framework. First, we resized the images to 800 \(\times \) 800 pixels and applied stochastic gradient descent for 50k iterations to train our model. The learning rate was 0.001 and it was decreased to 0.0001 after 30 k iterations. We adopted only one scale anchor for one scale to predict with three ratios, 1:1, 1:2 and 2:1, with areas of 32 \(\times \) 32 pixels, 64 \(\times \) 64 pixels, 128 \(\times \) 128 pixels, and 256 \(\times \) 256 pixels, which performed well with the dataset.

Figure 5 shows the detection results of comparing the proposed algorithm with popular deep learning methods. The red boxes in the first three rows are the results of aircraft (small object) detection in complex scene, in the presence of occlusion respectively. The last row shows the results for the cars, the blue box is the car in the complex scene and the red box is the car in the presence of occlusion.

Fig. 5.
figure 5

Visualization of the small object detection results for a complex scene in the presence of occlusion for the proposed algorithm with popular deep learning methods on the aircraft and car datasets: (a) is the results of FPN [8], (b) is the results of Modified faster-RCNN [17], (c) is the results of R-FCN [1] and (d) the proposed method. (Color figure online)

Firstly, we compared the proposed approach with seven common state-of-the-art methods for object detection and five state-of-the-art methods for remote sensing object detection. As shown in Table 1, the deep learning method is obviously better in terms of accuracy than traditional methods such as the HOG [19] and SIFT [18]. Because of the superiority of the proposed method in small object detection and the strong ability to handle complex scene with occlusion, the proposed method has higher accuracy and robustness when detecting aircrafts and cars. From the AP and recall (R) shown in Table 1, we can observe that in terms of AP over all the aircraft and the car categories, the proposed approach improve about 7% on average to other deep learning methods, and outperforms the previously most accurate USB-BBR [11] method and AARMN [14] method by 1% and 1%, respectively. The proposed method can obtain the highest AP on the premise of relatively high recall.

Table 1. Compares the proposed method to other methods in the remote sensing object detection datasets. The symbol “–” indicates that the method does not provide relevant results.

3.2 Ablation Study

An ablation study is performed to validate the contribution of each module of the proposed method. In each experiment, one part of our method is omitted, and the remaining parts are retained. The APs are listed in Table 2. It can be seen that the AP increased by almost 1% by including the contextual attention module and the channel attention module, because the contextual attention module contains contextual information about the object. Second, we used the proposed channel attention module and the multiscale attention module after the resblock, and the result improved by 2.1%. Which indicates that global information and multiscale information are important for detection. Finally, we interposed the contextual attention module and the multiscale attention module into the network, and the AP improved by 2.3%. Thus, the contextual information and multiscale information are effective in helping the network detect small objects. It is obvious that when all modules are applied, the proposed method can improve the AP by 4.7%.

Figure 6 shows the visualized results of the ablation study for situations 1–5 in Table 3, corresponding to (b)–(f). It can be easily seen that due to the use of the multiscale attention module, the proposed method obtains more spatial structural information and semantic information about small objects, so it can enhance the feature representation of small objects. The contextual attention module and the channel attention module enable the network to learn the correspondence between the background and foreground and the global interdependence in different channels and further strengthen the feature representation, so the network can efficiently detect the objects in a complex scenes and in the presence of occlusion.

Table 2. The result of the ablation study for the proposed framework in the aircraft detection task. N: No, Y: Yes.
Fig. 6.
figure 6

The visualized results of the ablation study. (a) is the original image and (b)–(f) are situations from “1” to “5” in Table 2.

3.3 Robustness Experiments

In order to verify the robustness of the proposed method, we evaluated the performance with the COCO [9] dataset. As shown in Table 3, the dataset has more classes and considerably smaller objects so the proposed algorithm has better performance than other methods.

Table 3. Compares the proposed method in the COCO dataset to the state-of-the-art methods.

4 Conclusion

In this paper, an IPAN is presented for small object detection in remote sensing images, which enhances the feature representation and improves the detection accuracy for small objects, especially in complex scenes and in the presence of occlusion. Specifically, we introduced three parallel modules: a multiscale attention module guiding the model extract more information of small objects, a contextual attention module to capture contextual correlation, and a channel attention module to learn global interdependencies in different channels, In addition, the IPAN can capture long-range dependencies which helps to detect objects. The experiments and ablation study show that the network is effective and achieve precise detection results. The proposed IPAN consistently achieves outstanding performance with remote sensing object detection datasets. The future work will focus on further enhancing the robustness of our model.