1 Introduction

With the development of computer vision, object detection for remote sensing images has become a hot topic [14, 16, 29, 36]. Aircraft detection in remote sensing images becomes increasingly important in both military and civilian fields, which is an active research field in remote sensing image detection in recent years. How to categorize objects and locate small-scale aircraft is one of the most important techniques during object detection. The intention of categorization is to separate objects from the background, while the aim of object localization is to locate objects by drawing a bounding box around the object.

To detect objects in the remote sensing images, researchers leverage many approaches such as machine-learning-based [17] and deep-learning-based [6, 20] detection techniques. The process of the machine-learning-based approach is described as follows. Firstly, it extracts various feature information of images, such as texture, shape, and gradient. Secondly, the feature information is fed into a Bayesian network [31] or a support vector machine (SVM) [27] to learn the features. Finally, the trained classifier is used to detect new objects. Although the machine-learning-based approach can perform the detection effectively, it suffers from complicated feature extraction. Furthermore, the machine-learning-based approach has the difficult to extract those shallow features with little feature information. As a result, the ability of the machine learning algorithm [34] is weakened. When the background of the remote sensing image is complex, it is easy to have a low accuracy of recognition. Therefore, with the development of computer processing capabilities, remote sensing image detection based on deep learning becomes the mainstream. It owns the ability of learning the multi-level features of the image from a large amount of data and the ability of classifying the objects automatically. Deep learning algorithm has the strong generalization ability and the high robustness. However, deep-learning-based detecting for aircraft in remote sensing images still needs improvement, especially for small-scale aircraft. The main reason is that the scale of the aircraft in the image is different and the small-scale aircraft are hard to identify.

To improve the accuracy of identifying small-scale aircraft, we propose a novel approach called MFRC based on the K-means algorithm [8] and Faster-RCNN [13]. Firstly, the K-means algorithm is used to cluster the bounding box of the aircraft to improve the anchor in Faster-RCNN. The anchor tends to be the ground truth of the aircraft. Secondly, the structure of the VGG16 [18] feature extraction network is improved, where the number of pooling layers is reduced from four to two. Finally, Soft-Non Maximum Suppression (Soft-NMS) [1] is leveraged to optimize the bounding box of the aircraft. To evaluate the effectiveness of our approach, we conduct the experimentation of MFRC and compare it with other methods. The results show that MFRC achieves higher mAP than existing approaches, which demonstrates that our approach can improve the accuracy of aircraft detection.

The main contribution can be summarized as follows.

  • A novel approach based on K-means algorithm and Faster-RCNN is proposed to detect small-scale aircraft. The feature extration network and the bounding box are optimized.

  • We compare MFRC with existing approaches, demonstrating the effectiveness of our approach.

The rest of this paper is organized as follows. Section 2 introduces related work. Section 3 presents the detection approach for the small-scale aircraft based on Faster-RCNN. Section 4 illustrates the experimental results and Section 5 summarizes this paper.

2 Related works

This section examines the related works based on machine learning and deep learning approaches.

2.1 Machine-learning-based approach

Many works have been done on machine learning approach. Zhu et al. [38] employ an optimized invariant to train SVM to identify aircraft. They describe a single feature fully and effectively. Li et al [12] propose an aircraft detection algorithm based on visual symmetry and saliency detection. Their approach can accurately and quickly identify aircraft objects in remote sensing images. However, due to the lack of processing capabilities for complex environment, their approach does not obtain good accuracy. Liu et al. [15] proposed a two-stage matching strategy. The candidate object is obtained from the edge information of a image in the coarse-grained matching stage. The shape feature is used to match the candidate object in the fine-grained matching stage. This method can effectively identify the object aircraft in the remote sensing image. However, the recognition accuracy of the algorithm is determined by the shape segmentation and data of the prior object.

2.2 Deep-learning-based approach

Deep-learning-based approach can automatically learn and extract features from images. Many researchers detect objects in an image by convolutional neural network (CNN) because of its generalization and robustness. Yang et al. [33] propose a method called MFCN to detect aircraft by combining fully convolutional network (FCN) [19] with markov random field (MRF). They use FCN to generate a object-sensitive map to exclude a large number of objectless areas. The map is fed into the multi-MRF algorithm to improve the object shape. Wan et al. [30] propose a shadow processing algorithm with threshold random sampling. Cao et al. [2] proposed a detection algorithm based on YOLO-v3 in remote sensing images. Their method effectively improved the accuracy and became practical. Chen et al. [3] proposed an aircraft detection approach for high-resolution remote sensing images. Their method detected an object by CNN. A semantic segmentation CNN is leveraged to implement the airplane recognition. Experimental results on a collected airplane dataset demonstrate the effectiveness of their method. Li et al. [11] propose a detection method for aircraft by using CNN by simulating image analysis. Their method can accurately recognize aircraft objects in remote sensing images.

The most representative two-stage object detector is the RCNN series [5], including Fast RCNN [4], Faster-RCNN [24]. Specifically, the Faster-RCNN is a two-stage classification. The first stage is to generate a candidate area while the second stage is to adjust and classify the position of the candidate area. This method has high accuracy of recognition. However, it is difficult to achieve real-time detection results. As the depth increases, the accuracy of Faster-RCNN [32] is not good at some point. Faster-RCNN-res101 [7] uses the residual learning method. This method connects the output of the former layer to the input of the next two layers, which can eliminate the degradation as the depth increases. For a one-stage object detection, one of the most representative models is YOLO [21,22,23] due to its fast detection speed and high recognition rate of small objects. Some works [9, 25, 35] preprocess the aircraft image and use YOLO or Faster-RCNN to detect the aircraft. The results show that their methods are suitable for detecting shadow aircraft. Li et al. [10] propose a detection method based on the Faster-RCNN-mobile network to identify the types and locations of surface defects. Zhou et al. [37] proposed the multi-scale detection network to detect small-scale aircrafts. They added a smaller detection scale to the backbone of Darknet-53 to detect the aircrafts in small size by employing convolutional neural network.

3 Small-scale aircraft detection

This section first presents the detection framework and then depicts each component in detail.

3.1 Detection framework

The framework of MFRC for the small-scale aircraft is presented in Fig. 1. Firstly, an improved VGG16 network is used to extract the features of the aircraft image. Secondly, the K-means algorithm is used to cluster the aircraft remote sensing images. Consequently, the anchor in the region proposal network (RPN) is improved. RPN can generate feature maps with regional proposals while Soft-NMS is used to optimize the aircraft bounding boxes. Finally, the RoI pooling normalized image is fed into the Fast RCNN to obtain the frame and the category of these aircraft.

Fig. 1
figure 1

The framework of MFRC

3.2 Clustering bounding box for aircraft

In the remote sensing image, the orientation and the distance of an aircraft often lead to a different scale. To obtain the scale and distribution, the K-means algorithm is used to cluster the aircraft bounding box and to set k clusters based on a given data set of n categories. Each cluster has at least one data object. Considering that those data objects that belong to the same cluster may have high similarity and those belong to different clusters may have low similarity, we improve the RPN based on the clustering results.

The K-means algorithm divides the data set C into k clusters C1,C2,...,Ck. The loss function is computed by (1).

$$ E={\sum}_{i=1}^{k}{\sum}_{x\epsilon C_{i}}^{}||x-u_{i}||^{2} $$
(1)

Where x represents an element of point set in the cluster Ci(1 ≤ ik). ui represents the centroid of cluster Ci and is computed by (2).

$$ u_{i}=\frac{1}{|C_{i}|}{\sum}_{x\epsilon C_{i}}^{}x $$
(2)

The anchor in the RPN network is improved to obtain the scales and distribution of the aircraft data set by the following steps.

(1):

We randomly select k sample points as the centroid of each cluster in an aircraft image.

(2):

The distance between the sampling point and the centroid of each cluster is calculated. The samples are sorted into the nearest cluster.

(3):

The centroid of each cluster is recalculated based on (2).

(4):

We set a threshold according to the existing experience value and calculate the distance between the new centroid and the original centroid. Steps (2) and (3) will repeat until the distance is greater than the threshold.

(5):

The algorithm terminates if the distance is less than the threshold.

3.3 Improving VGG16 network

In Faster-RCNN, the original scale of an aircraft is around 20×20. When the image is reduced by 16 times, the feature scale will be 1×1 or 2×2. Such a small scale can not fully describe the feature of an aircraft. As a result, it will reduce the model’s ability of recognizing the aircraft. To solve this problem, we optimize the number of pooling layers in VGG16 and reduce it from 4 to 2. So the scale reduces by 4 times and the feature of small-scale aircraft can be obtained.

3.4 Optimizing bounding box

Non-Maximum Suppression(NMS), as an important part of the object detection algorithm, uses complete suppression to remove duplicate aircraft bounding boxes in Faster-RCNN. When using NMS, the aircraft objects will be missed in the high overlapping environment. Compared to NMS, Soft-NMS does not remove the bounding box but reduces the confidence of the bounding box. The confidence is traversed when an overlapping area is larger than a threshold. It compares the reduced bounding box with the other bounding boxes until the confidence is lower than the threshold. Therefore, Soft-NMS is applied to the Faster-RCNN algorithm to improve the accuracy of aircraft detection.

4 Evaluation

This section firstly introduces the experimental setup and dataset. Secondly, it evaluates the K-means algorithm on the bounding boxes of the remote sensing image. Finally, it presents the results and comparison with existing methods.

4.1 Environmental setup

All measurements are conducted on a 2.21 GHz Intel Xeon processor running a 64-bit Linux (2.6.38 kernel) with 8 GB RAM. NVIDIA GeForce GTX is 1060. The TensorFlow version is 1.12.0. The CUDA version is 9.0 and the cuDNN version is 7.6.5.

In our model, we set the batch size to 256 and the learning rate to 1e-5 for model training.

4.2 Dataset

The dataset is selected from NWPU Dataset [10] which owns lots of remote sensing images of aircraft with different sizes, shapes, and orientations. Figure 2 presents some images in this dataset. To enlarge the dataset, we enlarge the dataset to enhance the model’s generalization and robustness. Considering the orientation of the aircraft may be different in an image, we rotate each aircraft image with 45, 90, 135, 180, 225, 270, and 315 degrees. Consequently, we obtain more than 56,000 data items. We take 60 percent of the dataset as the training set, 30 percent of the dataset as the validation set, and 10 percent of the data set as the test set. Besides, the LabelImg tool [26] is used to label these aircraft images.

Fig. 2
figure 2

Aircraft remote sensing images selected from NWPU Dataset [10]

4.3 Evaluation criteria

The Precision-Recall curve is a graph with Precision values on the y-axis and Recall values on the x-axis. AP can be computed by obtaining the area under the Precision-Recall curve. Generally, the better the classifier is, the higher the AP is. A sample curve can be seen in Fig. 4.

Mean average precision (mAP) is one of the most important criteria in the object detection algorithm. The mAP is computed by the average of APs. Its value will be between the interval [0,1]. The model is considered to be better if mAP is close to 1.

4.4 Results of K-means cluster analysis

The clustering results are presented in Fig. 3. The shape of each cluster is close to the square. The distribution of most bounding boxes is below 300×300, and the smallest scale is around 20×20. We improved the anchor (scale is (128, 256, 512) and the ratio is (0.5, 1, 2)) in the RPN network based on the clustering results. The new anchor (scale is (16, 32, 64, 128, 256, 512), the ratio is (0.7, 1, 1.5)) are used as the bounding box corresponding to the element of the feature map. This method will make the detection result more consistent with the ground truth of the aircraft bounding box in the dataset.

Fig. 3
figure 3

Results of the bounding box using K-means

4.5 Results of the improved Faster-RCNN

We conduct experiments under different conditions: (1) 4 or 2 pooling layers; (2) with or without the K-means algorithm; (3) with or without Soft-NMS.

The experimental results are presented in Table 1. When the number of pooling is 4 in VGG16 network without using the K-means algorithm and Soft-NMS, the mAP is 87.39%. When the number of pooling layers is reduced to 2 under the same configuration, the mAP increases by 0.32%, which shows that reducing the number of pooling can extract more features of small-scale aircraft. We should note that the increase is not very significant because reducing the number of pooling layers will weaken the generalization ability of the model to normal-scale aircraft. If the Soft-NMS is used, there is a 0.5% improvement than the model without the Soft-NMS. The improved method is helpful to optimize overlapping or to improve the accurate of identifying aircraft bounding boxes. Furthermore, if the K-means algorithm is used, the mAP will increase to 89.64% which is 1.93% higher than the model with Soft-NMS. Our method makes the bounding box close to the ground truth of the aircraft in the datasets. Therefore, it improves the detection accuracy of the aircraft. We notice that using the K-means algorithm to improve the anchor has a higher accuracy than the strategy of optimizing NMS. When both the K-means algorithm and Soft-NMS are used, the mAP reaches 90.39%, which shows that MFRC has higher accuracy than other methods for small-scale aircraft detection.

Table 1 Experimental results

4.6 Comparing with other models

We compare MFRC with those models based on Faster-RCNN, Faster-RCNN-res101 [28], YOLO v4, and Faster-RCNN-mobile [10] respectively.

The experimental results of mAP are presented in Table 2. The mAP of Faster-RCNN-mobile has a minimum of 81.23% among all models. The possible reason for the lowest mAP is that Faster-RCNN-mobile uses a separable convolution layer to build a lightweight deep neural network. It focuses on compressing the model scale and reduces the number of parameters without the concern of accuracy. For Faster-RCNN and Faster-RCNN-res101, the mAP is about 6% higher than that of Faster-RCNN-mobile. However, it is about 3% lower than that of MFRC. The main reason is that Faster-RCNN-res101 does not consider the scale, proportion, and overlap of the objects in an image. When comparing Faster-RCNN with Faster-RCNN-res101, Faster-RCNN-res101 is implemented by a feed-forward neural network with skip connections to extract the deep features of the image. Therefore, the accuracy of the Faster-RCNN-res101 is increased by 0.48% compared to the Faster-RCNN. However, both Faster-RCNN and FasterRCNN-res101 do not make a careful consideration of the data. We can see that the mAP of YOLO v4 is 88.29%, which is 0.9% higher than Faster-RCNN. For aircraft detection, Faster-RCNN first generates candidate regions, and then adjusts to classify the position of candidate regions. The disadvantage of Faster-RCNN is that it is not sensitive to small-scale objects. By contrast, YOLO v4 directly generates the category probability and position coordinate of the object. The feature layer of YOLO v4 uses a combination of feature pyramid and down-sampling, so it has a good effect on small target detection. However, the mAP of YOLO V4 is 2.1% lower than that of MFRC. The possible reason for high mAP is that MFRC makes a careful consideration for the characteristics of the data and improves structures or networks. Therefore, MFRC has the highest mAP in all models, which shows that it is effective to detect aircraft in a remote sensing image.

Table 2 Experimental results of different models

The advantage of a model can be intuitively illustrated by the Precision-Recall curve in Fig. 4. The precision always remains at 1.0 when the recall increases, and then reduces until the recall approaches 1.0. The mAP can be obtained by computing the area under this Precision-Recall curve. The Precision-Recall curve confirms the effectiveness of MFRC.

Fig. 4
figure 4

Precision-Recall curve that MFRC Method detects aircraft remote sensing image

We also compare MFRC with the original one (without improvement) by letting them detect the same image. The experimental results are shown in Fig. 5. The results of the improved method are presented in Fig. 5(b1), (b2), and (b3) respectively, while the result of the original one is shown in Fig. 5(a1), (a2), and (a3). We can see from Fig. 5(a1) that a small number of small-scale aircraft is detected. By contrast, more small-scale aircraft are detected in Fig. 5(b1). The boundary box of aircraft appears superposition and nonstandard in Fig. 5(a2). Our method optimizes the boundary box to make the result more accurate in Fig. 5(b2). Two aircraft are detected in Fig. 5(a3). More aircraft are detected in Fig. 5(b3). MFRC detects more small-scale aircraft and optimizes the redundant bounding boxes than the original one. The experimental results show the effectiveness of MFRC.

Fig. 5
figure 5

Detection effect diagram of MFRC Method and original method

5 Conclusion

This paper proposes an improved detection approach for small-scale aircraft in remote sensing images based on Faster-RCNN. Three major improvements are conducted. Firstly, K-means is employed to cluster the aircraft bounding boxes and improving the anchor in RPN. Secondly, to extract the feature of the small-scale aircraft, the number of pooling layers of the VGG16 network is reduced from four to two. Finally, Soft-NMS is used to optimize the bounding box of the aircraft. The experimental results show that the detection method for small-scale aircraft based on Faster-RCNN has higher accuracy than the existing methods. It is effective in detecting small-scale aircraft. The future works include that we will continue to improve the accuracy and that we will investigate whether or not our approach works well in a complex environment, such as the low resolution or under cloudy environment.