1 Introduction

With the development of remote sensing technologies, remote sensing image analysis is becoming more and more important. It can facilitate applications such as disaster control, environmental studies [6] and traffic planning [27]. As one fundamental task in computer vision, object detection is the basis of remote sensing image analysis. However, remote sensing images have vast backgrounds with many cluttered areas [26] and different size objects, which declining the performance of object detection on remote sensing images.

Object detection includes three main tasks: feature extraction, proposals classification and bounding box regression [1, 17]. Traditional feature extraction usually uses the hand-crafted features, such as Scale Invariant Feature Transform (SIFT) [20], Histogram of Oriented Gradients (HOG) [5] and texture features [11]. However, with the broad application of deep convolutional neural network (DCNN) [13] in image feature extraction, hand-crafted features are gradually replaced by automatically learned feature representation. The classification task is used to judge the category of objects. There are two main kinds of object detection models. One divides the object detection process into two steps. The first step is to generate the region proposals by doing a binary classification (object or background) on the extracted feature maps [8], the second step is to judge the category of each region proposal. The other just uses one step, doing a multi classification directly (including object categories and background) on the extracted feature maps. The regression task is used to revise the bounding box position and output the coordinate offset.

In early studies on remote sensing image object detection, Cheng et al. [2] applied multi-scale HOG features to build a discriminatively trained mixture method for object detection to detect different size objects in remote sensing images. To effectively identify the objects in remote sensing images, Senaras et al. [25] analyzed various object features (e.g., color, texture and shape), and applied different base-layer classifiers in the fuzzy stacked generalization architecture for detecting buildings. Han et al. [10] used a deep Boltzmann machine to find the spatial and structural information of features encoded in low-level and middle-level. Despite the great success of methods above, they are all based on hand-crafted features, which are time-consuming and require the domain expertise.

With the development of deep learning, DCNN has enjoyed a massive success in computer vision. As an essential task in computer vision, the object detection based on deep learning, such as R-CNN [9], Faster R-CNN [23] and YOLO [22], has a significant improvement on the detection performance compared with traditional object detection methods. Long et al. [19] proposed an unsupervised score-based bounding box regression for the accurate object localization in remote sensing images. Dai et al. [4] proposed the position-sensitive score maps to get accurate and fast object detection. However, those methods using the single-scale feature cannot adapt to the cluttered background and multi-scale objects in remote sensing images. Liu et al. [18] proposed a single shot multibox detector, using multi-scale feature maps to detect various size objects to improve the detection speed. However, the weak semantic of high-resolution feature maps limited the detection accuracy. Li et al. [14] used a coarse-to-fine merged manner to get discriminative candidate regions, nevertheless, the simple and single merged manner limited the feature representation due to the difference between the high-resolution features and low-resolution features after undergoing several convolution layers.

In recent years, some detection models take advantage of the pyramid structure of backbone networks, using nearest neighbor upsampling and element-wise sum to fuse different resolution feature maps to obtain strong feature representation, improving the performance on generic object detection, such as FPN [15]. However, there are a lot of complex background (e.g., cities, forests and grasslands), noise and dense tiny objects in remote sensing images. Simply using nearest neighbor upsampling and element-wise sum to fuse the high-resolution feature maps and low-resolution feature maps lacks enough feature information, which is not suitable for remote sensing object detection. Due to the large difference between high-resolution feature map and low-resolution feature map, the fused feature maps can not achieve a good balance between details and semantics.

In this paper, we propose a novel remote sensing image object detection method with the fusion based feature reinforcement component (FB-FRC). Firstly, we apply the feature reinforcement component (FRC) to filter out some redundant details and strengthen the semantics of high-resolution feature maps. FRC can generate a new feature layer and provide more feature information for fusion, making up the high-resolution feature maps lack of semantics and low-resolution feature maps lack of details. Then, two feature fusion strategies (hard fusion strategy and soft fusion strategy) are designed to get the strong feature representation. Finally, experiments carried out on four remote sensing images datasets (NWPU VHR-10 [3], VisDrone2018 [30], DOTA [28] and RSOD [29]) verify the effectiveness of the proposed method.

In summary, the main contributions of the proposed method are listed in the following.

1) The FRC is applied to filter out some redundant details and strengthen the semantics of high-resolution feature maps, providing more feature information for fusion.

2) The hard fusion and soft fusion strategies are proposed to fuse the feature maps of different scales to get strong feature representation.

The rest of this paper is structured as follows. The details of proposed method are described in Section 2. Section 3 presents experimental results. Finally, Section 4 lists the conclusions of this paper.

2 Proposed method

In this section, we introduce the FB-FRC for remote sensing image object detection in detail. The framework of proposed method is illustrated in Fig. 1. This is a two-step detection method. In step one, the FRC and fusion strategies are used to generate the pyramid feature maps with high object discrimination for obtaining the region proposals. In step two, the region proposals are fed into the classifier and regressor to get final detection results. The details are described in the following.

Fig. 1
figure 1

The framework of the proposed method

2.1 Enhancing feature extraction

The dense objects and cluttered background in remote sensing images are prone to reduce the performance of object detection. Therefore, we add the FRC and apply two fusion strategies to enhance the object feature representation.

The Fig. 2 shows the structure of FRC. The high-resolution feature map is downsampled by a residual block firstly, then undergoing a deconvolution (kernel size = 2, stride = 2) and 1×1 convolution to get the reinforced feature map, where a residual block is used to downsample the high-resolution feature map and further extract the semantic features. In the subsequent deconvolution operation, we use the small size kernel for upsampling. Different from ordinary images, the remote sensing images include the tiny and dense objects (e.g., the crowded pedestrians and congested vehicles), therefore, the two pixels next to each other in a feature map can represent two different objects. If the kernel size is too big, the corresponding size of receptive field would be large, which containing many pixels with different objects. After undergoing deconvolution, one pixel in the feature map contains the mixed feature from multiple objects, which disturb or even loss the tiny object feature. Therefore, we use a small kernel size in deconvolution. In FRC, the 1×1 convolution following deconvolution is applied to unify the dimensions for feature fusion. In this paper, the output of FRC is unified to 256 dimensions.

Fig. 2
figure 2

The structure of FRC

Referring to the description of ResNet in [12], we defined the residual block in FRC as:

$$ y=F(x,{W})+x $$
(1)

where x and y are the input and output of FRC respectively. The function F represents a residual mapping. The operation F + x is performed by a shortcut connection and element-wise addition. The structure of the residual block is shown in Fig. 3, where F as a residual mapping consist of 3 convolution layers and 2 ReLU functions, the output of residual block is activated by a ReLU function. In residual block, the 1×1 convolution are responsible for reducing and then increasing (restoring) dimensions, leaving the 3×3 convolution with smaller input/output dimensions [12], which decreases the computation and training time compared with using two 3×3 convolutions.

Fig. 3
figure 3

The structure of residual block

Different from blurring operations, FRC is applied to enhance the semantic while reduce the redundant details of the high-resolution feature map, and parameters in FRC are updated constantly during the training. In many cases, blurring operations are mainly used to filter image noise, where the parameters in the blurring operations are generally fixed or manually adjusted, such as mean blur and gaussian blur.

For the deeper networks, there is a degradation problem during training: with the depth increasing, accuracy gets saturated and then degrades rapidly. To solve this problem, the ResNet [12] was proposed, using residual learning to catch the subtle changes of networks to make the network training more effective. In this paper, the proposed method uses ResNet101 as the backbone network. As shown in Fig. 1, according to the times of downsampling, the backbone architecture can be divided into five stages, denoted as {C1,C2,C3,C4,C5}, where the feature map resolution decreases continuously from C1 to C5, and the feature maps in one stage have the same resolution. The outputs of the last residual block in {C2,C3,C4,C5} are denoted as {C2l,C3l,C4l,C5l}, in which {C2l, C3l, C4l} undergoes a FRC to generate the reinforced feature maps as {C2d, C3d, C4d}, and FRC shares the first residual block with {C3, C4, C5}. Compared with {C2l, C3l, C4l}, {C2d, C3d, C4d} has stronger object semantic features and weaker background details. C5l is the last feature output of C5 and has very strong semantic, we just append a 3×3 convolution (stride= 1) on C5l to generate P5 and apply another 3×3 convolution (stride= 2) downsampling C5l to generate P6.

Figure 4 shows the examples of feature maps after undergoing FRC. The original image contains six oil tanks. Figure 4a and c show the feature maps of C2l and C3l. Figure 4b and d are C2d and C3d undergoing FRC. As shown in Fig. 4b and d, the background features in C2d and C3d are less cluttered than before, while the features of six oil tanks are more semantic, compared with the features in C2l and C3l.

Fig. 4
figure 4

Visualized feature map from a remote sensing image containing six oil tanks. aC2l feature map. b C2d feature map. c C3l feature map. d C3d feature map

As shown in Fig. 4, FRC can filter out some redundant details and strengthen the semantic of the feature maps. This makes the feature maps appear coarse-grained. Instead of simply using pooling and nearest neighbor upsampling, both residual block and deconvolution are applied in FRC, which is a lightweight component that can perform parameter learning during downsampling and upsampling. Therefore, the feature maps after FRC also retains some main details while enhancing the semantics. As shown in 1, we fuse the high-resolution feature map, the feature map undergoing FRC and the low-resolution feature map to get better object feature representation. As a new added layer, the feature map after undergoing FRC can make up for the high-resolution feature map’s lack of semantics and low-resolution feature map’s lack of details. Due to more feature information being considered, the fused feature map has better balance between details and semantics, making the remotely sensed objects easier to identify.

2.2 Feature fusion strategies

As shown in Fig. 5a and b, two strategies are used to fuse the different feature maps, respectively. One is called hard fusion strategy by the element-wise sum of feature maps. The other is called soft fusion strategy by learning the fusion parameters.

Fig. 5
figure 5

The illustration of the two fusion strategies. a The hard fusion strategy. b The soft fusion strategy

For hard fusion strategy, the feature maps need to be unified to the same size and channel dimensions before fusion. Therefore, as shown in Figs. 1 and 5a, the feature map in (i + 1) level appends a nearest neighbor upsampling and a 1×1 convolutional layer to generate F\(^{i}_{1}\), where nearest neighbor upsampling can be defined as:

$$ f(a+u,b+v)=\left\{\begin{array}{ll} f(a,b) & u<=0.5 and v<0.5 \\ f(a,b+1) & u<=0.5 and v>0.5 \\ f(a+1,b) & u>0.5 and v<=0.5 \\ f(a+1,b+1) & u>0.5 and v>=0.5 \end{array}\right. $$
(2)

where (a,b) is pixel coordinates in the feature map before upsampling. f(a,b) represents the value of pixel (a,b). (a + u,b + v) is the mapped coordinates from the upsampled feature map into the original feature map, which u ∈ (0,1) and v ∈ (0,1).

The reinforced feature Cid is as F\(^{i}_{2}\). Cil undergoes a 1×1 convolutional layer to output F\(^{i}_{3}\). After that, {F\(^{i}_{1}\), F\(^{i}_{2}\), F\(^{i}_{3}\)} is unified to 256 channels. Then, element-wise sum is applied to merge F\(^{i}_{1}\), F\(^{i}_{2}\) and F\(^{i}_{3}\). Finally, We use a 3×3 convolution to learn feature representation from the merged feature map and unify the feature dimensions to generate the final fusion feature map \(P_{hard}^{i}\). This can be expressed as (3).

$$ P_{hard}^{i}=f_{sum}({F_{1}^{i}},{F_{2}^{i}},{F_{3}^{i}})\otimes conv_{3\times3} $$
(3)

Where ⊗ represents a convolution operation. To start the iteration, we attach a nearest neighbor upsampling and a 1×1 convolutional layer on C5l to produce F\(^{4}_{1}\) for fusion.

For soft fusion strategy, as shown in Figs. 1 and 5b, we make a nearest neighbor upsampling on the previous final fusion feature map P(i+ 1) to generate F\(^{i^{\prime }}_{1}\). And the reinforced feature Cid is as F\(^{i}_{2}\). Then, Cil as F\(^{i^{\prime }}_{3}\) concat with F\(^{i^{\prime }}_{1}\) and F\(^{i^{\prime }}_{2}\). After the concat operation, the first 3×3 convolution is used to fuse three feature maps, the second 3×3 convolution extract feature representation from the fused feature map and unify feature dimensions to generate the final fusion feature map \(P_{soft}^{i}\). The process which can be defined as (4).

$$ P_{soft}^{i}=f_{concat}({F_{1}^{i}},{F_{2}^{i}},{F_{3}^{i}})\otimes conv_{3\times3}\otimes conv_{3\times3} $$
(4)

Similar to hard fusion strategy, we attach a nearest neighbor upsampling on C5l to start the iteration.

The final fusion feature maps are called as {P2, P3, P4}. {P2, P3, P4, P5, P6}, as a feature pyramid, shares a 3×3 convolutional layer and two 1×1 convolutional layers to generate region proposals, where nonmaximum suppression (NMS) [21] is used to filter out the similar region proposals. After NMS, the region proposals are unified to the same dimensions by ROI pooling [23], then undergoing the two fully connected layers to produce the final predicted results.

2.3 Loss function of the proposed method

The classifier and regressor are shared between fusion feature map of each level to generate the region proposals. The classifier is used to predict the class probability (object or background) of each anchor in the fusion feature maps. The regressor estimates the coordinate offset of object bounding boxes, corresponding to the anchors’ position. We define the anchor areas {5122, 2562, 1282, 642, 322} on {P2, P3, P4, P5, P6} respectively, and set the aspect ratios of anchors {0.5, 0.75, 1, 1.5, 2} on each level to fit different object shapes.

During training stage, we set the anchors whose values of Intersection-over-Union (IoU) [23] with any ground-truth is greater than 0.7 as positive labels and set the anchors’ IoU with all ground-truth lower than 0.3 as negative labels, where IoU is used to measure the percentage of intersection between two bounding boxes, defined as:

$$ \text{IoU}=\frac{area(r_{i})\cap area(g_{i})}{area(r_{i})\cup area(g_{i})} $$
(5)

where ri is a detection bounding box, and gj represents a ground-truth. The area(ri) is the area enclosed by detection bounding box ri.

The unmarked anchors are dropped out during training. The loss function for the region proposals is defined as (6).

$$ \begin{array}{@{}rcl@{}} Loss &=&\sum\limits_{i\in Levels}(\frac{1}{N_{cls}^{i}}\sum\limits_{k\in A_{i}}L_{cls}(p_{k},p_{k}^{*})\\ &&+\lambda\frac{1}{N_{reg}^{i}}\sum\limits_{k\in A_{i}}p_{k}^{*}L_{reg}(c_{k},c_{k}^{*})) \end{array} $$
(6)

Where i is the index of level which the fusion feature maps belong to and Ai is the anchors set defined in the i-th level. pk represents the probability that anchor i contains an object, and p\(_{k}^{*}\) is the label of ground-truth (1 for the positive labels, 0 for the negative labels). ck represents the coordinates offsets of the predicted bounding box and c\(_{k}^{*}\) denotes the true coordinate offsets to ground-truth. Losscls is softmax classfication loss and Lossreg is soomth L1 loss which is used to learn four coordinate transformation [23] of the predicted bounding box and minimize the error between the predicted coordinates and ground truth coordinates. N\(_{cls}^{i}\) is the number of anchors in classification. Similar N\(_{reg}^{i}\) is the number of anchors in position regression. The λ is used to balance Losscls and Lossreg.

During training, the parameters is updated in a batch training, where the parameters is the weights of each layer in the network, used for network mapping. Taking a batch training as an example, the training process can be divided into two part. In the first part, a batch of images with the corresponding labels are fed into the proposed model, as shown in Fig. 1, after undergoing the step one and step two, the network outputs the predicted results, then the classification loss and regression loss can be calculated according to the predict results and labels. In the second part, based on the loss function, the gradients of each parameter in network are calculated by chain rule. Then the parameters in each network layer are updated according to the gradients and learning rate. This batch training process will be iterated during the training stage until the network converged.

3 Experimental study

To evaluate the effectiveness of the proposed method, experiments are conducted on four widely used remote sensing datasets, i.e. NWPU VHR-10 [3], VisDrone2018 [30], DOTA [28] and RSOD [29]. We compare the proposed method with several state-of-the-art methods, i.e. Faster R-CNN [23], R-FCN [4] and FPN [15]. The mean average precision (mAP) [16] and visualized results are adopted to evaluate the performance of these methods.

3.1 Datasets and evaluation metric

Table 1 shows four commonly used remote sensing data sets, all of which are made by well-known scientific research teams in recent years. The main contents of these datasets are as follows.

Table 1 Details of four remote sensing datasets

In NWPU VHR-10, there are 650 positive label remote sensing images. We choose 500 images for training and the remaining 150 for testing. DOTA is a large-scale dataset including 2806 aerial images with very high-resolution. We cut the images larger than 3000 pixels to 1280×1280 pixels and get 8813 images for training and 2993 for testing. In VisDrone2018 and RSOD, we use the default training set and testing set.

We use VOC2007 11 point metric [7] to evaluate the proposed method performance, where mAP@[0.5:0.95] is the mean of mAPs which IoU thresholds from 0.5 to 0.95, step 0.05, and mAP@0.5 and mAP@0.75 are for the detail evaluation. The instructions of symbol in evaluation metric is shown in Table 2. The evaluation process is as follows:

$$ Precision=\frac{TP}{TP+FP} $$
(7)
$$ Recall=\frac{TP}{TP+FN} $$
(8)

The average precision (AP) of each category can be demonstrated according to precision and recall, which can be defined as:

$$ AP =\frac{1}{11}\sum\limits_{x \in MP}(x) $$
(9)

mAP is the mean of all categories’ AP , as shown in (10).

$$ mAP =\frac{\sum\limits_{c \in C}AP_{c}}{|C|} $$
(10)
Table 2 Instructions of symbol in the evaluation metric

The proposed method is implemented on MXNet and trained on a graphics workstation (2 CPUs E5-2609v4@1.70GHz, 32-GB memory and 2 NVIDIA GTX 1080TI GPUs). During training, we fit a remote sensing image each batch and train ten epochs until convergence, where an epoch completed an iteration of all training images. The learning rate is set as 0.005 in the first 2/3 epochs and 0.0005 in the last 1/3. For enough training, we augment datasets by flipping each image. The backbone networks of all methods are pre-trained on the ImageNet dataset [24] before training.

3.2 Experimental results and analysis

We compare the performance of proposed method with three state-of-the-art methods: Faster R-CNN, R-FCN and FPN. Faster R-CNN applies a region proposal network to obtain the region proposals, then these region proposals are unified to the same size by pooling operation for the later classification and regression. R-FCN uses the position-sensitive score maps to get better results of classification and position regression. FPN builds the feature pyramid of images to detect the different size objects. For a fair comparison, these state-of-the-art methods also use ResNet101 as the backbone, and the hyperparameters with the highest performance in the original papers are used during training. All the mAPs shown in the tables are converted to percentage (%).

Tables 345 and 6 show the performance of proposed method compared with other state-of-the-art methods on four widely used remote sensing images datasets, respectively, where the best results are shown in bold. On the whole, the proposed method performs better than other three methods at mAP@[0.5:0.95], mAP@0.5 and mAP@0.75. Compared with detecting on a single-scale feature map (Faster R-CNN, R-FCN), object detection on multi-scale feature maps (FPN and the proposed method) can assign detection tasks in detail. For the large-scale (high resolution) feature map, due to containing rich detail features, it is convenient to detect small objects. The small-scale (low resolution) feature map has strong semantics, benefiting the large objects detecting. Using the multi-scale feature maps for object detection can obtain more accurate results than the methods detecting on a single scale feature map. The results show that the methods using multi-scale feature are more than 10% higher on mAP than those using single-scale feature.

Table 3 The result of NWPU VHR-10 dataset with the same input size (800×800 pixels)
Table 4 The result of VisDrone 2018 dataset with the same input size (1280×800 pixels)
Table 5 The result of DOTA dataset with the same input size (1280×1280 pixels)
Table 6 These methods trained on ROSD dataset with the same input size (800×800 pixels)

Compared with FPN, the proposed method applies the FRC to provide more feature formation for the fusion, reinforcing the feature representation of objects. To prove the effectiveness of FRC, we compared the proposed method with FPN under the hard fusion strategy and soft fusion strategy, respectively. As shown in the results, the performance of proposed method in two fusion strategies has the higher AP across most categories and gets better mAP@[0.5:0.95], mAP@0.5 and mAP@0.75 than FPN. This indicates that adding the FRC step in the fusion process can effectively improve the accuracy of object detection.

For the different datasets, the performance of two fusion strategies is similar at NWPU VHR-10, DOTA and ROSD datasets. The difference of detection results on mAP@[0.5:0.95] between two fusion strategies is less than 1%. While dealing with VisDrone2018, the hard fusion strategy performs better than the soft fusion strategy on mAP and most categories’ AP. And we found some potential reasons from the visible results.

Figures 678, and 9 exhibit visible results of the proposed method from four remote sensing images datasets respectively, where the green bounding boxes represent the ground-truth, red bounding boxes denote the detection results of hard fusion strategy, and the detection results of soft fusion strategy are drawn with blue lines. Figure 7 shows the detection results on VisDrone2018 dataset, it contains many occluded objects. In Fig. 7, the hard fusion strategy has better detection effect than the soft fusion strategy for some objects which are difficult to identify, such as the occluded cars and pedestrians in Fig. 7a, c and d. But this also makes the hard fusion strategy easy to produce some wrong results, such as the false recognition of some pedestrians in Fig. 7a. And in Figs. 6 and 8, the soft fusion strategy has a better detection effect on some small and crowded objects without occlusion compared with the hard fusion strategy, such as tennis courts in Fig. 6c and d, vehicles and ships in Fig. 8a and c. Unfortunately, some of these objects are not labeled, therefore, this may not make the mAP increase. The main reason for the results may be that the hard fusion strategy merges feature maps by element-wise sum directly, which preserves the features of occluded objects in feature maps. However, the soft fusion strategy uses the concat operation to merge feature maps, this may filter out some occluded object features after undergoing two 3×3 convolutions and enhance the object features without occlusion.

Fig. 6
figure 6

Visualized results of the NWPU VHR-10 detection results. Green rectangles: ground-truth. Red rectangles: detection results of hard fusion. Blue rectangles: detection results of soft fusion

Fig. 7
figure 7

Visualized results of the VisDrone2018 detection results. Green rectangles: ground-truth. Red rectangles: detection results of hard fusion. Blue rectangles: detection results of soft fusion

Fig. 8
figure 8

Visualized results of the DOTA detection results. Green rectangles: ground-truth. Red rectangles: detection results of hard fusion. Blue rectangles: detection results of soft fusion

Fig. 9
figure 9

Visualized results of the RSOD detection results. Green rectangles: ground-truth. Red rectangles: detection results of hard fusion. Blue rectangles: detection results of soft fusion

4 Conclusion

In this paper, the FB-FRC is proposed for remote sensing image object detection. We use the FRC to strengthen the semantics and filter out the redundant details of the high-resolution feature maps, providing more feature information for the fusion. Then two fusion strategies are designed to enhance the feature representation, further improving the detection performance. The experiments on four datasets show that the proposed method has better performance than the three state-of-the-art methods after adding the FRC. For two fusion strategies, from the experimental results, the hard fusion strategy has better performance in detecting occluded objects, while the soft fusion has better detection effect for some small and crowded objects without occlusion. In practical application, according to the situation, we can train and test two fusion strategies firstly and select a better fusion strategy.

In the future, we will focus on increasing the accuracy of position regression by adding some direction parameters and try to use the generative adversarial networks to reconstruct the occluded objects to improve the detection accuracy further.