Keywords

1 Introduction

With the advance of deep learning, object detection can be divided into two groups: two-stage detectors and one-stage detectors. Two-stage detectors such as [1, 2] first generate some RoIs in the first stage and make an object classification and RoI-wise bounding box regression next. One-stage detectors, e.g., YOLO [3] and SSD [4], do not generate the RoIs and directly detect objects. Owing to extreme imbalance of foreground–background class, the performance of two-stage detectors is usually better than one-stage detectors. Anchor-free detectors are used to address this problem, such as [2, 5, 6]. It alternatively transforms object detection into a points detection problem to avoid complex computations of anchors and run faster.

To recognize and locate objects in remote sensing images more effectively, the research of remote sensors detection is urgent. In recent years, the research on object detection is mostly based on Convolution Neural Network (CNN). For example, Region-based Convolutional Neural Networks [7] (R-CNN), known as a pioneering method, first generated region proposals using selective search and then refined them by extracting regional features from a convolution network. A region proposal network and an end-to-end trainable detector have been proposed to improve performance, which is named Faster R-CNN [8]. The Feature Pyramid Networks [9] (FPN) constructed a feature pyramid and predicted different objects at different pyramid feature maps by the scales of the region proposal. RetinaNet [10] chose a feature pyramid network likely FPN as its backbone and introduced a new focal loss to alleviate the imbalance between easy and hard examples. In aerial images, however, since the objects are mostly very small, these methods do not have good results in detecting them. This presents us with great challenges.

In recent years, many methods based on feature pyramid have been proposed. This is because FPN can combine low-level high-resolution information with higher-level strong semantic information, and simultaneously predict at different levels using lower-level features and higher-level features. As a result, targets in remote sensing images are not too small to be ignored by the detectors. Mou et al. [11] proposed a method to establish a feature pyramid network at all scales with strong semantic feature maps, which use a top-down pathway and horizontal connection. The feature map of different layers was responsible for detecting objects of different sizes. A dense feature pyramid network (DFPN) has been proposed by Yang et al. [12] to achieve automatic detection of ships: each feature map was closely linked and combined by concatenation.

With the improvement of the above methods, the ability of FPN network to recognize small objects has been improved, but some problems still exist. FPN proposes different features at each layer of the image pyramids, and then makes corresponding predictions. The shallow networks in the feature pyramid are more concerned with details and location information, while the upper layers focus more on semantics, which helps locate objects. First, feature maps of higher levels contributed to enhance the semantic information of lower levels. Second, the topmost convolution layer losses some information due to a few feature channels and is not compatible with other feature levels since it only has single-scale context information. So, the feature map on the top layer is very important to detect. To improve this shortcoming, we propose a method to enrich the top-level feature information. We use a five-layer feature pyramid network (\({C}_{1}{-}{C}_{5}\)), and our method uses residual branch to get a new convolution layer \({C}_{6}\). Residual branch is used to indoctrinate the original branches with different spatial background information. Generation of a new convolution layer \({C}_{6}\) is used to alleviate the loss of information due to reduced channel convergence.

In addition to the above method, we also introduce super-resolution (SR) technology to enrich some detailed information of feature maps. Image super-resolution refers to make recovery in images or image sequences from low-resolution (LR) to high-resolution (HR). In general, the higher the resolution of an image the more detail and information it contains. However, the resolution is not the same as the pixel size. For example, an image that is multiplied by five by an interpolation does not tell you how much detail it contains. Image super-resolution is concerned with recovering the missing details in the image, that is high-frequency information. Figure 8.1 shows an example of SR technology, where a is the clear image, b is an image that needs to be restored to high resolution, and c is the result of the restoration. As you can see from the image, the restored image with SR contains more details and information. We use sub-pixel convolution to enrich the detail in the case of high-level details so that \({C}_{5}\) has more information. We hope this method can reduce the information loss and improve the performance of generated feature pyramids.

Fig. 8.1
figure 1

The figure is an example of the SR technique, a is the ground truth, b is the low-resolution image, and c is the recovered high-resolution image

In order to realize the above method, we first improve the network structure of the traditional feature pyramid and propose a module to add a convolution layer before multi-scale feature fusion. The module also recalculates the fusion weight to fuse the extracted multi-scale feature layers more effectively. Finally, we introduce sub-pixel convolution to improve the semantic richness of the feature map to reduce the loss of detail.

2 Methods

Previous methods cannot solve the problem of incompatibility between high-level feature map and other level feature map. We propose a new RBFF network consisting of residual branches and sub-pixel convolution which is to detect small objects in aerial images. Figure 8.2 shows the framework of our method. The module we designed performs several operations on the tensor in order to fuse feature maps more efficiently. In addition, we use the sub-pixel convolution to enrich the high-frequency information of the feature map. Our method is described in detail below. Our method adds a residual branch to generate a new feature map \({C}_{6}\) and recalculate weights. These features are then fused with recalculated weights. The ACAR module consists of the anchor classification branch and the anchor regression branch. Then we sent the anchor box and input feature maps into the deformable convolution [6] to extract aligned features. Finally, the active rotating filter [13] (ARF) is used to extract invariant directional features and produce the final detection results.

Fig. 8.2
figure 2

The figure shows the RBFF network architecture

2.1 Sub-pixel Convolution

Most remote sensing images are very large. For example, the size of images in the DOTA dataset is about \(4000\times 4000\), and small objects like vehicles have very little information in the image. In addition, when the image is extracted by the feature pyramid network, there is less detail left, making it impossible to fully identify small objects in the image. The appearance of image super-resolution technology solves this problem.

In general, both \({I}^{LR}\) and \({I}^{HR}\) can have C color channels, thus they are represented as real-valued tensors of size \(r\mathrm{H}\times \mathrm{rW}\times \mathrm{ C}\) and \(r\mathrm{H }\times \mathrm{rW}\times \mathrm{ C}\), respectively. There is a way to realize image super resolution is convolution that uses fractional stride of \(\frac{1}{r}\) in the LR space. But this way will increase the computational cost because that process happens in the HR space. So, we use a convolution with stride of \(\frac{1}{r}\) in LR space filters \({W}_{a}\) of size \({k}_{a}\) with weight spacing \(\frac{1}{r}\), which do not active all \({W}_{a}\) convolution. And we do not need to activate weights and do not need to calculate the weights which are between pixels. The activated pattern has activated at most \({\lceil\frac{{k}_{a}}{r}\rceil}^{2}\) weights. These patterns are activated periodically throughout the convolution, relying on the different sub-pixel positions: mod (a, r), mod (b, r) where a, b is the coordinates of output pixel in HR space. In this paper, we use a more effective way called sub-pixel convolution to achieve the above process when mod (\({k}_{a},r\)) = 0:

$$U^{SR} = t^K \left( {U^{LR} } \right) = VB(S_K \times t^{ K- 1} \left( {U^{LR} } \right) + c_K )$$
(8.1)

where VB is a periodic shuffle operator that ranges the elements of the \(H\times W\times C\cdot {r}^{2}\) tensor again into a tensor of the size \(rH\times rW\times C\). This operation can mathematically be described as follows:

$$PS (T) x, y, c = T_{\left\lfloor {x/r} \right\rfloor,\left\lfloor {y/r} \right\rfloor}, c\cdot r \cdot mod\left(y,r \right)+c \cdot mod(x,r)$$
(8.2)

2.2 Residual Branches

In the feature pyramid network, the top-down feature fusion process in the pyramids loses information at the top level due to fewer channels. To this end, we use a ratio-invariant adaptive pooling on the topmost layer of the feature pyramid to produce feature pyramid with different scales \(({a}_{1}\times S,{a}_{2}\times S,.,{a}_{n}\times S)\) of multiple contextual features. To avoid the aliasing effects caused by interpolation, we have set three different scales to fit these contextual functions rather than simply summarizing them. Next sub-pixel convolution is used to scale up to the scale of S for subsequent fusion. Each context feature then independently passes through a \(1\times 1\) convolution layer, to reduce the channel dimension to 256 of the feature maps. Finally, in order to construct a feature pyramid, we use a \(3\times 3\) convolution layer at each feature map, as shown in Fig. 8.3.

Fig. 8.3
figure 3

The diagram shows the detailed structure of the residual branch that we propose. First of all, the topmost feature map has to go through three scales of adaptive pooling. Then the feature is amplified by sub-pixel convolution and then horizontally concatenated

3 Methods

3.1 Data Set

Our experiments were running primarily on the DOTA [14] dataset, which contains 2,806 aerial images of approximately \(4000\times 4000\) in size and 188,282 instances. And the dataset has 15 categories: plane (PL), ship (SH), storage tank (ST), baseball diamond (BD), tennis court (TC), basketball court (BC), ground track field (GTF), harbor (HA), bridge (BR), large vehicle (LV), small vehicle (SV), helicopter (HC), roundabout (RA), soccer ball field (SBF), and swimming pool (SP). It is marked as a quadrilateral with an arbitrary shape and orientation determined by four points rather than a traditional horizontal box. Specifically, first mark an initial point (\({x}_{1},{y}_{1}\)) and then mark 2, 3, and 4 in clockwise order. The initial point is usually selected at the head of the object. If it is an object such as a port with no obvious visual shape, choose the upper-left corner as the first point, as shown in Fig. 8.4.

Fig. 8.4
figure 4

The figure shows how the dataset labels are defined

Function of Loss. The loss function of our method consists of two parts. The loss function is defined as follows:

$$\begin{aligned}L & =\frac{1}{{N}_{R}}\left(\sum_{i}{L}_{c}\left({c}_{i}^{R},{l}_{i}^{*}\right) + \sum_{i}{1}_{{l}_{i}^{*}\ge 1}{L}_{r}\left({x}_{i}^{R},{g}_{i}^{*}\right)\right) \\ & \quad + \frac{\lambda }{{N}_{M}}\left(\sum_{i}{L}_{c}\left({c}_{i}^{F},{l}_{i}^{*}\right) + \sum_{i}{1}_{{l}_{i}^{*}\ge 1}{l}_{r}\left({x}_{i}^{F},{g}_{i}^{*}\right)\right), \end{aligned}$$
(8.3)

where λ is a loss balance parameter, 1 is an indicator function, NR and NM are the numbers of positive samples in the ACAR and ARF, respectively, i is the index of a sample in a minibatch. \({c}_{i}^{R}\) and \({x}_{i}^{R}\) are the predicted category and refined locations of the anchor i in ACAR. \({c}_{i}^{F}\) and \({x}_{i}^{F}\) are the predicted object category and locations of the bounding box in ARF. \({l}_{i}^{*}\) and \({g}_{i}^{*}\) are the ground-truth category and locations of the anchor i. The Focal loss [10] and smooth L1 loss are adopted as the classification loss LC and the regression loss LR, respectively. The hyperparameters of Focal loss Lc are set to α = 0.25 and γ = 2.0. We use the same training procedure as in Detectron [15].

3.2 Ablation Study

Residual Branches. In our approach, the network is enhanced by changing its structure and adding a new branch. To compare with another method, we use ResNet-50 as the backbone of the two methods. S2A-Net [16] was chosen for comparison with our method. The result of using and not using residual branch are shown in Table 8.1. We use S2A-Net to represent the S2A-Net method and RBFF to show our method. Our method provides better detection results for small objects on the DOTA validation dataset.

Table 8.1 Experimental results with different networks

Sub-pixel convolution. To test the impact of adding sub-pixel convolution on improving the accuracy of small target detection, we work on two tests with our network, one using sub-pixel convolution and the other not. Here we use sub-pixel to denote the network using sub-pixel convolution and S2A-Net to denote the method we did not use. The result of adding sub-pixel convolution or not is shown in Table 8.2. The table shows that the use of sub-pixel convolution has a positive impact on the detection of small objects in general.

Table 8.2 Comparison of the results of the experiment

3.3 Comparison of Experimental Results

The RBFF method was compared with other popular methods in the DOTA dataset. The results of the experiment are shown in Table 8.3. In contrast to many previous works [13, 17] was designed to detect large scale targets, our experimental results presented in the table show detection results for nine types of objects which is aimed at evaluating the small objects. The mAP in the last row of the table is also the average of the detection of these 9 types of objects. From the result, it is clear that our method outperforms some previous detection methods. With the default input size, e.g., \(1024\times 1024\), RBFF can run at 399 ms per image on the RTX2080. A single-scale test can run at 66 ms per image. Finally, some visualization of detection results can be seen in Figs. 8.5 and 8.6.

Table 8.3 Comparison with other methods on DOTA dataset. FFA-3(M) implies the use of the multi-stage detector of FFA-3 for experiments
Fig. 8.5
figure 5

The figure shows visualization results of our method. In the figure, the four pictures on the left are detection results of the S2A-Net, and the four pictures on the right are the detection results of our method. Significantly more objects are identified in the red boxes in the four pictures on the right than on the left

Fig. 8.6
figure 6

This figure shows part of detection results obtained by our method

4 Conclusion

In this paper, a novel method for remote sensing detection has been proposed based on the feature pyramid network. Our method uses the residual branch to improve the network structure and reduce the feature loss that occurs during feature fusion. The features are then scaled by sub-pixel convolution. Our method uses the focal loss to better rebalance the variant scales of the bounding box. Multi-scale testing can significantly improve detection performance. Our RBFF was trained using ResNet-50-FPN and ResNet-101-FPN, both achieved good performance on DOTA dataset. I hope that our approach will be useful in the field of remote sensing object detection or data statistics.