Small Object Detection of Remote Sensing Images Based on Residual Branch of Feature Fusion

Feng, Xiaoling

doi:10.1007/978-981-16-8558-3_8

Xiaoling Feng⁶

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 270))

348 Accesses

Abstract

In recent years, the detection of remote sensing images has been developed widely, and small objects have been paid more and more attention. The existing small object detection methods fuse the multi-scale features of different layers directly when using the feature pyramid network. However, due to the decrease of channels in feature fusion, the top-level feature of pyramid will lose information of the object, which is disadvantageous to detect small object ion. In order to fuse multi-scale features more effectively, we propose an object detection method based on the residual branch of feature fusion (RBFF), which is specially used to detect small objects. Our approach improves the network structure of the feature pyramid. We also recalculated the weights to reduce the semantic gap in feature fusion. In addition, we also introduce sub-pixel convolution to reconstruct the low-frequency information of the feature map accurately, to obtain the feature map with more information. The experimental results show that our method has a good effect.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Feature Fusion Detection Network for Multi-scale Object Detection

A two-way dense feature pyramid networks for object detection of remote sensing images

Article 23 June 2023

Multi-scale Object Detection in Optical Remote Sensing Images Using Atrous Feature Pyramid Network

Keywords

1 Introduction

With the advance of deep learning, object detection can be divided into two groups: two-stage detectors and one-stage detectors. Two-stage detectors such as [1, 2] first generate some RoIs in the first stage and make an object classification and RoI-wise bounding box regression next. One-stage detectors, e.g., YOLO [3] and SSD [4], do not generate the RoIs and directly detect objects. Owing to extreme imbalance of foreground–background class, the performance of two-stage detectors is usually better than one-stage detectors. Anchor-free detectors are used to address this problem, such as [2, 5, 6]. It alternatively transforms object detection into a points detection problem to avoid complex computations of anchors and run faster.

To recognize and locate objects in remote sensing images more effectively, the research of remote sensors detection is urgent. In recent years, the research on object detection is mostly based on Convolution Neural Network (CNN). For example, Region-based Convolutional Neural Networks [7] (R-CNN), known as a pioneering method, first generated region proposals using selective search and then refined them by extracting regional features from a convolution network. A region proposal network and an end-to-end trainable detector have been proposed to improve performance, which is named Faster R-CNN [8]. The Feature Pyramid Networks [9] (FPN) constructed a feature pyramid and predicted different objects at different pyramid feature maps by the scales of the region proposal. RetinaNet [10] chose a feature pyramid network likely FPN as its backbone and introduced a new focal loss to alleviate the imbalance between easy and hard examples. In aerial images, however, since the objects are mostly very small, these methods do not have good results in detecting them. This presents us with great challenges.

In recent years, many methods based on feature pyramid have been proposed. This is because FPN can combine low-level high-resolution information with higher-level strong semantic information, and simultaneously predict at different levels using lower-level features and higher-level features. As a result, targets in remote sensing images are not too small to be ignored by the detectors. Mou et al. [11] proposed a method to establish a feature pyramid network at all scales with strong semantic feature maps, which use a top-down pathway and horizontal connection. The feature map of different layers was responsible for detecting objects of different sizes. A dense feature pyramid network (DFPN) has been proposed by Yang et al. [12] to achieve automatic detection of ships: each feature map was closely linked and combined by concatenation.

With the improvement of the above methods, the ability of FPN network to recognize small objects has been improved, but some problems still exist. FPN proposes different features at each layer of the image pyramids, and then makes corresponding predictions. The shallow networks in the feature pyramid are more concerned with details and location information, while the upper layers focus more on semantics, which helps locate objects. First, feature maps of higher levels contributed to enhance the semantic information of lower levels. Second, the topmost convolution layer losses some information due to a few feature channels and is not compatible with other feature levels since it only has single-scale context information. So, the feature map on the top layer is very important to detect. To improve this shortcoming, we propose a method to enrich the top-level feature information. We use a five-layer feature pyramid network (${C}_{1}{-}{C}_{5}$), and our method uses residual branch to get a new convolution layer ${C}_{6}$. Residual branch is used to indoctrinate the original branches with different spatial background information. Generation of a new convolution layer ${C}_{6}$ is used to alleviate the loss of information due to reduced channel convergence.

In addition to the above method, we also introduce super-resolution (SR) technology to enrich some detailed information of feature maps. Image super-resolution refers to make recovery in images or image sequences from low-resolution (LR) to high-resolution (HR). In general, the higher the resolution of an image the more detail and information it contains. However, the resolution is not the same as the pixel size. For example, an image that is multiplied by five by an interpolation does not tell you how much detail it contains. Image super-resolution is concerned with recovering the missing details in the image, that is high-frequency information. Figure 8.1 shows an example of SR technology, where a is the clear image, b is an image that needs to be restored to high resolution, and c is the result of the restoration. As you can see from the image, the restored image with SR contains more details and information. We use sub-pixel convolution to enrich the detail in the case of high-level details so that ${C}_{5}$ has more information. We hope this method can reduce the information loss and improve the performance of generated feature pyramids.

In order to realize the above method, we first improve the network structure of the traditional feature pyramid and propose a module to add a convolution layer before multi-scale feature fusion. The module also recalculates the fusion weight to fuse the extracted multi-scale feature layers more effectively. Finally, we introduce sub-pixel convolution to improve the semantic richness of the feature map to reduce the loss of detail.

2 Methods

Previous methods cannot solve the problem of incompatibility between high-level feature map and other level feature map. We propose a new RBFF network consisting of residual branches and sub-pixel convolution which is to detect small objects in aerial images. Figure 8.2 shows the framework of our method. The module we designed performs several operations on the tensor in order to fuse feature maps more efficiently. In addition, we use the sub-pixel convolution to enrich the high-frequency information of the feature map. Our method is described in detail below. Our method adds a residual branch to generate a new feature map ${C}_{6}$ and recalculate weights. These features are then fused with recalculated weights. The ACAR module consists of the anchor classification branch and the anchor regression branch. Then we sent the anchor box and input feature maps into the deformable convolution [6] to extract aligned features. Finally, the active rotating filter [13] (ARF) is used to extract invariant directional features and produce the final detection results.

2.1 Sub-pixel Convolution

Most remote sensing images are very large. For example, the size of images in the DOTA dataset is about $4000\times 4000$, and small objects like vehicles have very little information in the image. In addition, when the image is extracted by the feature pyramid network, there is less detail left, making it impossible to fully identify small objects in the image. The appearance of image super-resolution technology solves this problem.

In general, both ${I}^{LR}$ and ${I}^{HR}$ can have C color channels, thus they are represented as real-valued tensors of size $r\mathrm{H}\times \mathrm{rW}\times \mathrm{ C}$ and $r\mathrm{H }\times \mathrm{rW}\times \mathrm{ C}$, respectively. There is a way to realize image super resolution is convolution that uses fractional stride of $\frac{1}{r}$ in the LR space. But this way will increase the computational cost because that process happens in the HR space. So, we use a convolution with stride of $\frac{1}{r}$ in LR space filters ${W}_{a}$ of size ${k}_{a}$ with weight spacing $\frac{1}{r}$, which do not active all ${W}_{a}$ convolution. And we do not need to activate weights and do not need to calculate the weights which are between pixels. The activated pattern has activated at most ${\lceil\frac{{k}_{a}}{r}\rceil}^{2}$ weights. These patterns are activated periodically throughout the convolution, relying on the different sub-pixel positions: mod (a, r), mod (b, r) where a, b is the coordinates of output pixel in HR space. In this paper, we use a more effective way called sub-pixel convolution to achieve the above process when mod (${k}_{a},r$) = 0:

$$U^{SR} = t^K \left( {U^{LR} } \right) = VB(S_K \times t^{ K- 1} \left( {U^{LR} } \right) + c_K )$$

(8.1)

where VB is a periodic shuffle operator that ranges the elements of the $H\times W\times C\cdot {r}^{2}$ tensor again into a tensor of the size $rH\times rW\times C$. This operation can mathematically be described as follows:

$$PS (T) x, y, c = T_{\left\lfloor {x/r} \right\rfloor,\left\lfloor {y/r} \right\rfloor}, c\cdot r \cdot mod\left(y,r \right)+c \cdot mod(x,r)$$

(8.2)

2.2 Residual Branches

In the feature pyramid network, the top-down feature fusion process in the pyramids loses information at the top level due to fewer channels. To this end, we use a ratio-invariant adaptive pooling on the topmost layer of the feature pyramid to produce feature pyramid with different scales $({a}_{1}\times S,{a}_{2}\times S,.,{a}_{n}\times S)$ of multiple contextual features. To avoid the aliasing effects caused by interpolation, we have set three different scales to fit these contextual functions rather than simply summarizing them. Next sub-pixel convolution is used to scale up to the scale of S for subsequent fusion. Each context feature then independently passes through a $1\times 1$ convolution layer, to reduce the channel dimension to 256 of the feature maps. Finally, in order to construct a feature pyramid, we use a $3\times 3$ convolution layer at each feature map, as shown in Fig. 8.3.

3 Methods

3.1 Data Set

Our experiments were running primarily on the DOTA [14] dataset, which contains 2,806 aerial images of approximately $4000\times 4000$ in size and 188,282 instances. And the dataset has 15 categories: plane (PL), ship (SH), storage tank (ST), baseball diamond (BD), tennis court (TC), basketball court (BC), ground track field (GTF), harbor (HA), bridge (BR), large vehicle (LV), small vehicle (SV), helicopter (HC), roundabout (RA), soccer ball field (SBF), and swimming pool (SP). It is marked as a quadrilateral with an arbitrary shape and orientation determined by four points rather than a traditional horizontal box. Specifically, first mark an initial point (${x}_{1},{y}_{1}$) and then mark 2, 3, and 4 in clockwise order. The initial point is usually selected at the head of the object. If it is an object such as a port with no obvious visual shape, choose the upper-left corner as the first point, as shown in Fig. 8.4.

Function of Loss. The loss function of our method consists of two parts. The loss function is defined as follows:

$$\begin{aligned}L & =\frac{1}{{N}_{R}}\left(\sum_{i}{L}_{c}\left({c}_{i}^{R},{l}_{i}^{*}\right) + \sum_{i}{1}_{{l}_{i}^{*}\ge 1}{L}_{r}\left({x}_{i}^{R},{g}_{i}^{*}\right)\right) \\ & \quad + \frac{\lambda }{{N}_{M}}\left(\sum_{i}{L}_{c}\left({c}_{i}^{F},{l}_{i}^{*}\right) + \sum_{i}{1}_{{l}_{i}^{*}\ge 1}{l}_{r}\left({x}_{i}^{F},{g}_{i}^{*}\right)\right), \end{aligned}$$

(8.3)

where λ is a loss balance parameter, 1 is an indicator function, N_R and N_M are the numbers of positive samples in the ACAR and ARF, respectively, i is the index of a sample in a minibatch. ${c}_{i}^{R}$ and ${x}_{i}^{R}$ are the predicted category and refined locations of the anchor i in ACAR. ${c}_{i}^{F}$ and ${x}_{i}^{F}$ are the predicted object category and locations of the bounding box in ARF. ${l}_{i}^{*}$ and ${g}_{i}^{*}$ are the ground-truth category and locations of the anchor i. The Focal loss [10] and smooth L1 loss are adopted as the classification loss L_C and the regression loss L_R, respectively. The hyperparameters of Focal loss Lc are set to α = 0.25 and γ = 2.0. We use the same training procedure as in Detectron [15].

3.2 Ablation Study

Residual Branches. In our approach, the network is enhanced by changing its structure and adding a new branch. To compare with another method, we use ResNet-50 as the backbone of the two methods. S²A-Net [16] was chosen for comparison with our method. The result of using and not using residual branch are shown in Table 8.1. We use S²A-Net to represent the S²A-Net method and RBFF to show our method. Our method provides better detection results for small objects on the DOTA validation dataset.

Table 8.1 Experimental results with different networks

Full size table

Sub-pixel convolution. To test the impact of adding sub-pixel convolution on improving the accuracy of small target detection, we work on two tests with our network, one using sub-pixel convolution and the other not. Here we use sub-pixel to denote the network using sub-pixel convolution and S²A-Net to denote the method we did not use. The result of adding sub-pixel convolution or not is shown in Table 8.2. The table shows that the use of sub-pixel convolution has a positive impact on the detection of small objects in general.

Table 8.2 Comparison of the results of the experiment

Full size table

3.3 Comparison of Experimental Results

The RBFF method was compared with other popular methods in the DOTA dataset. The results of the experiment are shown in Table 8.3. In contrast to many previous works [13, 17] was designed to detect large scale targets, our experimental results presented in the table show detection results for nine types of objects which is aimed at evaluating the small objects. The mAP in the last row of the table is also the average of the detection of these 9 types of objects. From the result, it is clear that our method outperforms some previous detection methods. With the default input size, e.g., $1024\times 1024$, RBFF can run at 399 ms per image on the RTX2080. A single-scale test can run at 66 ms per image. Finally, some visualization of detection results can be seen in Figs. 8.5 and 8.6.

Table 8.3 Comparison with other methods on DOTA dataset. FFA-3(M) implies the use of the multi-stage detector of FFA-3 for experiments

Full size table

4 Conclusion

In this paper, a novel method for remote sensing detection has been proposed based on the feature pyramid network. Our method uses the residual branch to improve the network structure and reduce the feature loss that occurs during feature fusion. The features are then scaled by sub-pixel convolution. Our method uses the focal loss to better rebalance the variant scales of the bounding box. Multi-scale testing can significantly improve detection performance. Our RBFF was trained using ResNet-50-FPN and ResNet-101-FPN, both achieved good performance on DOTA dataset. I hope that our approach will be useful in the field of remote sensing object detection or data statistics.

References

Girshick, R.: Fast R-CNN. In: ICCV, pp. 1440–1448 (2015)
Google Scholar
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points (2019). arXiv:1904.07850
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: CVPR, pp. 779–788 (2016)
Google Scholar
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.: SSD: single shot multibox detector. In: ECCV, pp. 21–37 (2016)
Google Scholar
Yang, Z., Liu, S., Hu, H., Wang, L., Lin, S.: Reppoints: point set representation for object detection. In: ICCV, pp. 9656–9665 (2019)
Google Scholar
Law, H., Deng, J.: Cornernet: detecting objects as paired keypoints. In: ECCV (2018)
Google Scholar
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, Columbus, OH, United states, pp. 580–587 (2014)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)
Article Google Scholar
Lin, T.-Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature Pyramid Networks for Object Detection, vol. 2017-January, Honolulu, HI, United States, pp. 936–944 (2017)
Google Scholar
Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 42(2), 318–327 (2020)
Article Google Scholar
Mou, L., Zhu, X.X.: Vehicle instance segmentation from aerial image and video using a multitask learning residual fully convolutional network. IEEE Trans. Geosci. Remote Sens. 56(11), 6699–6711 (2018)
Article Google Scholar
Yang, X., Sun, H., Fu, K., Yang, J., Sun, X., Yan, M., Guo, Z.: Automatic ship detection in remote sensing images from google earth of complex scenes based on multiscale rotation dense feature pyramid networks. Remote Sens. 10(1) (2018)
Google Scholar
Zhou, Y., Ye, Q., Qiu, Q., Jiao, J.: Oriented response networks. In: CVPR, pp. 4961–4970 (2017)
Google Scholar
Xia, G.-S., Bai, X., Ding, J., Zhu, Z., Belongie, S., Luo, J., Datcu, M., Pelillo, M., Zhang, L.: DOTA: a large-scale dataset for object detection in aerial images. In: CVPR, pp. 3974–3983 (2018)
Google Scholar
Girshick, R., Radosavovic, I., Gkioxari, G., Dollár, P., He, K.: Detectron (2018). https://github.com/facebookresearch/detectron
Han, J., Ding, J., Li, J., Xia, G.-S.: Align deep features for oriented object detection. IEEE Trans. Geosci. Remote Sens. (2021)
Google Scholar
Chen, K., Ouyang, W., Loy, C.C., Lin, D., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Shi, J.: Hybrid Task Cascade for Instance Segmentation, pp. 4969–4978 (2019)
Google Scholar
Fu, K., Chang, Z., Zhang, Y., Xu, G., Zhang, K., Sun, X.: Rotation-aware and multi-scale convolutional neural network for object detection in remote sensing images. ISPRS J. Photogramm. Remote. Sens. 161, 294–308 (2020)
Article Google Scholar
Yang, X., Liu, Q., Yan, J., Li, A.: R3det: refined single-stage detector with feature refinement for rotating object. CoRR, vol. abs/1908.05612 (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

Tiangong University, Tianjin, 300387, China
Xiaoling Feng

Authors

Xiaoling Feng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaoling Feng .

Editor information

Editors and Affiliations

Technical University of Sofia, Sofia, Bulgaria
Roumen Kountchev
Technical University of Sofia, Sofia, Bulgaria
Rumen Mironov
University of Hyogo, Kobe, Japan
Kazumi Nakamatsu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Feng, X. (2022). Small Object Detection of Remote Sensing Images Based on Residual Branch of Feature Fusion. In: Kountchev, R., Mironov, R., Nakamatsu, K. (eds) New Approaches for Multidimensional Signal Processing. Smart Innovation, Systems and Technologies, vol 270. Springer, Singapore. https://doi.org/10.1007/978-981-16-8558-3_8

Download citation

DOI: https://doi.org/10.1007/978-981-16-8558-3_8
Published: 22 March 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-8557-6
Online ISBN: 978-981-16-8558-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Small Object Detection of Remote Sensing Images Based on Residual Branch of Feature Fusion

Abstract

Similar content being viewed by others

Feature Fusion Detection Network for Multi-scale Object Detection

A two-way dense feature pyramid networks for object detection of remote sensing images

Multi-scale Object Detection in Optical Remote Sensing Images Using Atrous Feature Pyramid Network

Keywords

1 Introduction