1 Introduction

Equipped with cameras and embedded systems, Unmanned Aerial Vehicles (UAV) are endowed with computer vision ability and widely used for traffic monitoring, pedestrian tracking, and infrastructure inspection. With the rapid development of deep neural networks, the object detection framework based on deep neural networks has gradually become the mainstream technology of object detection. Although correlation detectors (such as R-CNN family [9,10,11, 29], YOLO family [1, 26,27,28], SSD family [7, 19], etc.) have achieved good performance in natural images, they cannot achieve satisfactory results in aerial images.

Compared to natural images such as COCO [18], ImageNet [4] and Pascal VOC [6], aerial images have the following features:

(1) Small objects with the non-uniform distribution. Generally, a small object refer to object with the area of less than \(32\times 32\) in an image. The main problems of small objects are low resolution and small amount of information, which lead to weak feature expression. The traditional method of processing small objects is to enlarge the image, which will increase the processing time and the memory needed to store large feature maps. Another common method is uniform cropping an image into several regions [8, 20] and then detect in each region, which solves the problem of storing a large feature map. However, the uniform cropping ignores the sparsity of objects, and some regions may have few or no objects, which will waste a lot of computing resources. As can be seen from Fig. 1, the object distribution in the aerial image is uneven and the object is highly clustered in a certain region. Therefore, one method to improve detection efficiency is to focus the detector on these regions with a large number of objects.

Fig. 1.
figure 1

Visualization of uniform cropping vs cluster cropping. The first row is an example of uniform cropping. The second row is an example of cluster cropping. Compared with uniform cropping, cluster cropping has the following advantages: (1) concentrate on computing resources in cluster regions with a large number of objects, (2) have no background interference.

(2) Diversity of object size. When collecting UAV images, the shooting height varies from tens of meters to hundreds of meters, which leads to huge difference in the size of the same category of objects. It is a big problem for the anchor-based detector to set the size of the anchor. For anchor-free detector, it is difficult to directly regress the width and height of the object. Therefore, it is necessary to reduce the difference in object size among images as much as possible.

To solve the above problems, this paper proposes a coarse-to-fine detection framework CRENet. As shown in Fig. 2, CRENet is composed of three parts: a coarse detection network (CNet), a cropping module, and a fine detection network (FNet). The aerial image is first sent to the coarse detection network CNet to get the initial detection results, which will get a rough distribution of the object. Then the initial detection results are sent to the crop module. The first we mentioned, the cluster region is obtained through a clustering algorithm. The second step is to calculate the difficult score of each cluster region, it is believed that the cluster region with higher difficult score can bring greater accurate gain to the detector, and the cluster region with a small score can be deleted to improve the detection efficiency of the model. In the third step, we plug the remaining difficult cluster region into a Gaussian scaling function(GSF), which calculates the scaling factor for each of the difficult cluster regions. In particular, we refer to the difficult cluster region after scaling as ROIFootnote 1 (region of interest). Finally, ROI is sent to the fine detection network to obtain fine detection results, and the fine detection results are fused with the coarse detection results.

Fig. 2.
figure 2

Overview of CRENet framework. Firstly, CRENet sends the aerial image to the coarse detector CNet to get the initial prediction. Then a clustering algorithm is used to generate cluster regions for the initial prediction. And we mine difficult cluster regions, and then use the Gaussian scaling function(GSF) to scale difficult cluster regions. The difficult cluster regions are sent to the fine detector. Finally, the detection result of the global image is fused with the detection result of ROI to generate the final detection result. See Sect. 3 for more details.

Compared with previous detectors, the proposed CRENet has the following advantages: (1) The computational resources are concentrated on a dense region with a large number of objects, which reduces the computational cost and improves the detection efficiency; (2) Because the cluster region has different size, clustering algorithms are directly used to replace the network to predict the cluster region, which avoids the problem of anchor setting and the cluster region overlapping processing; (3) Calculating the difficult score of each cluster region and eliminating the cluster region that can hardly bring the accuracy gain can speed up the calculation; (4) Using Gaussian scaling function(GSF) can reduce the difference of object size among different images.

To sum up, the contributions of this paper are as follows:

  1. 1)

    A new CRENet detector is proposed that it can adaptively search and sacle regions with dense object for fine detection.

  2. 2)

    A Gaussian scaling function(GSF) is proposed to solve the problem of the large size difference of objects in aerial images and improve the detection accuracy.

  3. 3)

    We achieve more advanced performance on representative aerial image dataset VisDrone [45] with fewer images.

The rest of this paper is organized as follows. Section 2 briefly reviews relevant work. In Sect. 3, the proposed approach is described in detail. Section 4 for experimental results and Sect. 5 for the conclusion.

2 Related Work

In this section, we will review the benchmark of anchor-based detectors and anchor-free detectors for natural images and some recent efforts in aerial images. Finally, we focus on searching the region of interest for fine detection.

2.1 Generic Object Detection

At present, the mainstream object detection algorithms are mainly based on deep convolutional neural network, which can be divided into two types: anchor-based detectors and anchor-free detectors. The anchor-based detectors can be further divided into two categories: two-stage detector and single-stage detector. The two-stage detector consists of two steps: proposal region extraction and region classification. The first stage produces proposal regions, containing approximately location information of the object. In the second stage, the proposal regions are classified, and the positions are adjusted. Representatives of two-stage detectors include R-CNN family [9, 10, 29] and Mask RCNN [11]. The single-stage detector does not need the stage of producing proposal region , but directly generates the classification confidence and position of objects in only one stage. Representatives of single-stage detectors include SSD family [7, 19], YOLO family [1, 26,27,28], RetinaNet [17], etc. In general, the two-stage detector has higher accuracy, and the single-stage detector has higher detection speed. However, the anchor-based detector depends on the good prior anchor, and it is difficult to estimate a suitable prior anchor for the large variation in object size. In addition, in order to improve the recall rate, dense anchors are set, and most anchors are negative. This leads to an imbalance between negative anchors and positive anchors, which seriously affects the final detection performance.

The anchor-free detector is a method of object detection based on point estimate. CornerNet [14] uses a convolutional neural network to predict the upper left and lower right corner of an object and predicts embedding Vector of each diagonal corner to determine whether the upper left or lower right corner belong to the same object. CenterNet [45] directly predicts the center of the object and regresses its length and width. ExtremeNet [44] detects vertex, left point, bottom point, rightmost point, and the center point of the object.

It is difficult to find the suitable anchor size for all objects because of the large difference of object size caused by the change of UAV flying height. Therefore, in this paper, the anchor-free detector CenterNet [45] is used as the detection framework to solve this problem. Experiments also show that the anchor-free detector has better performance on the aerial image datasets.

2.2 Aerial Image Detection

Compared with natural image object detection, aerial image object detection faces more challenges: small objects, objects with uneven distribution, objects with various perspectives and objects vary in size. According to the characteristics of aerial images, people have proposed various solutions. Because the focus of this works is deep learning, this paper only reviews the related work of aerial image detection using deep neural networks. In [24], the tile method was used in the training stage and testing stage to improve the detection ability of small objects. In [35], the free metadata recorded by drones are used to learn Nuisance Disentangled Feature Transform (NDFT) to eliminate the interference of the detector caused by flying altitude change, adverse weather conditions, and other nuisances. Objects in the aerial image can be in any direction and any position. [5, 21, 38] uses rotate anchors to detect objects in any direction. In [16], the shape mask is allowed to flexibly detect objects in any direction without any pre-defined rotate anchors. In [34], the researcher studied the scale variation of aerial image object detection and proposed a Receptive Field Expansion Block (RFEB) to increase the receptive field size for high-level semantic features and a Spatial-Refinement Module (SRM) to repair the spatial details. In [25], a multi-task object detection and segmentation model is proposed. The segmentation map is used as the weight of self-attention mechanism to weight the feature map of object detection, which reduces the signal of non-correlated regions.

2.3 Region Search in Detection

In object detection, searching the region of interest for fine detection is usually used to detect small objects. The work of [20] proposes an adaptive detection strategy, which can continuously subdivide the regions that may contain small objects and spend computing resources in the regions that contain sparse small objects. The method in [31, 37], the clustering algorithm was used to get ROI’s ground truth on the original datasets, then a special CNN was used to predict ROI, and finally, ROI was sent to the fine detector. [42], using sliding window method on the feature map, then calculating the difficulty score for each window, and send the difficulty region to the fine detector. [8, 32, 33] solved the problem of small object detection in large images, and used a reinforcement learning method to find ROI for fine detection. [15] proposed an aerial image object detection network based on a density map. According to the density map, we can get a rough distribution of objects to search for the ROI.

Among the methods reviewed above, some use the network to predict the ROI, some use fixed windows to slide on the feature map to search the ROI, and some directly uniform crop the original image to get the ROI. Due to the different shapes and sizes of ROI, it is difficult to set the size of an anchor or regress the width and height of ROI by the network. The size of ROI obtained by using the fixed window sliding method and the uniform crop of the original image is fixed, which is difficult to adapt to the real ROI. Therefore, this paper sends the images to the coarse detection network to get the approximate distribution of objects, and then uses the clustering algorithm to adaptively get the ROI. Through the clustering algorithm, the ROI of various sizes can be obtained, which is more in line with the actual situation.

3 Methodology

As shown in Fig. 2, detection of an aerial image can be divided into three stages: the difficult cluster region extraction, fine detection of the ROI, and fusion of the detection results. In the first stage, aerial images are first sent into CNet to obtain an initial prediction. Then the cluster region is obtained by mean shift [39] for initial detection. Besides, the difficulty score of each cluster region is calculated, and the region with a higher score is regarded as a difficult cluster region. In the second stage, firstly, we use the Gaussian scaling function(GSF) to scale the difficult cluster region, so as to reduce the scale difference of objects. The ROI, scaled difficult cluster region, is then finely detected using the FNet. Finally, the third stage fuses the detection results of each ROI and global image with soft-NMS [2].

3.1 Difficult Cluster Region Extraction

Difficult cluster region extraction consists of three steps: Firstly, aerial images are fed into the trained CNet to obtain coarse detection results of objects. Then the results are used to obtain a cluster region. Finally, the difficulty score for each cluster region is calculated, and the non-difficult region is removed to speed up the detection.

In previous work, [37] proposed to use the clustering algorithm to generate the ground truth of the cluster region for each image and then trained a detector to predict the cluster region. However, there is a large overlap between the cluster regions predicted by the network. It is necessary to use the Iterative Cluster Merging (ICM) for the cluster region, and the number of the cluster region obtained is fixed. Especially, in aerial images, due to different camera angles, shooting time, and other reasons, the number of cluster regions may be different. Therefore, a fixed number of cluster regions is not suitable for all cases. Another problem is that the cluster regions vary in shape. It is difficult to manually set the anchor size in Faster-RCNN [29], and it is also difficult to regress anchor. In this paper, We use the clustering algorithm instead of the network to search the clustering region and avoid the problem of anchor setting. Specifically, aerial images are sent into the CNet, so as to obtain the initial prediction results of the object. The cluster region is obtained by using mean shift [39] from the initial prediction. Because an object can only belong to one region, the overlap between cluster regions is very small. Unlike [37], our algorithm is an unsupervised algorithm, whereas [37] is a supervised algorithm.

The aerial image is acquired at high altitude, so the background is complex and the objects is small. As can be seen from Fig. 1, the objects are usually gathered together. We can use a clustering algorithm to get the cluster region, and then crop and enlarge it for fine detection, which can not only solve the problem of small object detection but also reduce the interference of background. The mean shift [39] is a dense-based clustering algorithm, which assumes that the data of different clusters belong to different probability density distributions. By inputting the initial detection into the mean shift [39] algorithm, the cluster region of the image can be obtained adaptively.

It is worth noting that not every cluster region can get accurate gain. In order to improve the detection efficiency of the detector, it is necessary to eliminate the regions which cannot bring accurate gain or small accurate gain. This paper assumes that the denser the objects are in the cluster region, the lower the average confidence score is. The cluster region with denser objects or low average confidence score can obtain greater the accuracy gain from the fine detection in this region. According to this assumption, similar to [42], the initial prediction results of aerial images are used to calculate a score for each cluster region, and the regions whose score is greater than the difficulty threshold are retained.

$$\begin{aligned} M&= \frac{\sum _{i=1}^{N}score_{i}}{N}\; \end{aligned}$$
(1)
$$\begin{aligned} S&= \frac{N^{2}}{A \times M} \; \end{aligned}$$
(2)

Using Eqs. (1) and (2) to calculate the difficulty score for region p, where N is the number of the predicted boxes in p, M is the average of the confidence scores of all the prediction boxes by the coarse detector for region p. It is believed that the smaller the value of M is, the greater the accuracy gain will be. Therefore, we place it in the denominator. Where A is the area of p, S is the final score of p. \( \frac{N^{2}}{A} \) represents the density of region p. It is believed that the denser the region is, the more accuracy gain it can bring. Because in places where objects are dense, they are usually accompanied by occlusion between objects. When the occlusion is serious, the detector will miss detection. Therefore, for dense regions, enlarging it can effectively reduce the missed detection.

3.2 Fine Detection on Region of Interest

After obtaining the difficult cluster region, a special detector FNet is utilized to perform fine detection on these regions. But the difficult cluster region has different shapes and different sizes of objects, it will bring a problem that it is difficult to regress the width and height of objects. Different from the existing approaches, [8, 13, 20] that directly send these regions to fine detection, inspired by [41], this paper proposes a Gaussian scaling function(GSF) to reduce the size difference of objects. [41] uses the transformation function Scale Match to scale the extra dataset, so that there is little difference in object size between the targeted dataset and the extra dataset. In [41], MS COCO [18] is used as an extra dataset to pre-train the detector and improve its performance on the targeted dataset. Unlike [41], we do not scale the extra dataset, but the targeted dataset. We use Gaussian scale functions(GSF) to scale difficult cluster regions. First, select a mean value that is suitable for the receptive field of the backbone. Then, we select the standard deviation based on the three Sigma rule of thumb. The Gaussian scale functions(GSF) is made up of the mean value and standard deviation. The Gaussian scale function (GSF) can be used to shrink large objects and enlarge small objects. The implementation process is shown in Algorithm 1.

figure a
$$\begin{aligned} AS\left( G_{ij}\right)&= \sqrt{w_{ij}\times h_{ij}}\; \end{aligned}$$
(3)

In Algorithm 1, Eq. (3) is used to calculate the absolute size of each object for each difficult cluster region. The ratio of the value randomly sampled from a Gaussian function to the average absolute size of all objects in the region is used as the scale factor for scaling the difficult cluster region. If the size of the difficult cluster region after scaling is less than a certain range, the padding function is used to pad the region proportionally. Otherwise, the uniform crop function is used to divide it into two equal regions.

After scaling the difficult cluster region, we get the ROI. Then, the detection network (FNet) performs fine object detection. The architecture of the FNet can be any state-of-the-art detectors. The backbone of the detector can be any standard backbone networks, e.g., VGG [30], ResNet [12], Hourglass-104 [23].

3.3 Final Detection with Local-Global Fusion

NMS [22] is a post-processing step commonly used in object detection, which is used to remove the duplicate detection box to reduce false detection. When there are multiple prediction boxes on the same object, NMS will eliminate the remaining prediction boxes whose IOU is greater than the threshold value with the prediction box with the maximum confidence score. It can be seen that NMS is too strict and soft-NMS [2] replaces the original score with a slightly lower score instead of zero. The final detection of an aerial image is obtained by fusing the detection results of ROI and global detection results of the whole image with the soft-NMS [2] post-processing.

4 Experiments

4.1 Datasets and Evaluation Metrics

Datasets. The VisDrone [45] dataset was collected by using drones from 14 different cities in China under different weather and different lighting. The objects in this dataset are mostly small, and the objects are often clustered together. It contains a total of 10,209 images, including 6,471 training images, 548 validation images, 1,610 test-dev images, and 1,580 test-challenge images. Except for the test challenges, all other annotations are publicly available. The dataset contains a total of 10 categories, and its image resolution is about 2000*1500 pixels. In order to make a fair comparison with existing works [15, 37], we evaluate the detection performance in the validation set.

Evaluation Metric. We use the same evaluation protocol as proposed in MS COCO [18] to evaluate our method. Six evaluation metrics of AP, \(AP_{50}\), \(AP_{75}\), \(AP_{small}\), \(AP_{medium}\), and \(AP_{large}\) are reported. The AP is the average precision of all categories on 10 IOU thresholds, ranging from 0.50 to 0.95 with a step size of 0.05. \(AP_{50}\) is the average precision of ten categories when the IOU threshold is set to 0.5, and the IOU threshold of \(AP_{75}\) is set to 0.75. The \(AP_{small}\) means that the AP for objects with area less than \(32\times 32\). The \(AP_{medium}\) means that the AP for objects with area less than \(96\times 96\). The \(AP_{large}\) means that the AP for objects with area greater than \(96\times 96\). The number of ROI will affect inference time. In the following experiments, we use \( \#img \) to record the total number of images, send to the detector, including both original images and cropped ROI.

4.2 Implementation Details

We implemented the proposed CRENet on pytorch 1.4.0. Using an RTX 2080Ti GPU to train and test the model. In common with training many deep CNNs, we use data augmnetation. Specifically we use horizontal flipping. In this article, CNet and FNet use the same detector, which is CenterNet [43] with the backbone network Hourglass-104 [23], and different detectors may be selected. After obtaining the detection result of the aerial image through CNet, use mean shift [39] to get the preliminary cluster region. The region with an area of fewer than 10000 pixels, the aspect ratio of more than 4 or less than 0.25, and the number of objects less than 3 are excluded. The cluster regions with difficulty scores less than 0.01 will be eliminated. The image input resolution for CNet and FNet both are 1024*1024. We train the baseline detector for 140 epoch with Adam, and the initial learning rate was \( 2.5\times 10^{-4} \).

Table 1. The quantitative results on the validation set of VisDrone. \(\#img\) is the number of images that send to the detector. UC refers to the uniform cropping of the aerial image into four parts, while RC refers to the random cropping of four 1024*1024 regions from the image each time.

4.3 Quantitative Result

CenterNet [43] with Hourglass-104 [23] is chosen as the baseline model. Table 1 shows that our approach with baselines, UC, RC, ClusDet [37], and DMNet [15]. We achieve the best performance using fewer images than other methods, and even the small backbone network DLA-34 [40] with deformable convolution layers [3], modified by CenterNet [43], achieves better performance than ClusDet [37] and DMNet [15] both use Faster R-CNN [29] with ResNeXt [36]. We find that the AP value of RC was lower than the baseline, possibly because RC truncates the object when cropping the image. Experiments show that \(AP_{small}\) and \(AP_{medium}\) have more improvements, indicating that the method, the clustering algorithm adaptive cropping regions we proposed, is of great help to the detection of small and medium objects.

4.4 Ablation Study

In this experiment, we show how the three components of the framework, clustering algorithm, difficulty threshold, and Gaussian scaling function(GSF), affect the final performance. We consider five cases: (a) Baseline: we use CenterNet [43] with hourglass-104 [23] as the baseline model; (b) CRENet w/o difficult threshold and GSF: a clustering algorithm is added to the baseline model to search cluster regions, but the difficulty threshold and the Gaussian scaling function(GSF) are not used to process regions; (c) CRENet w/o difficult threshold: clustering algorithm produces regions that are not filtered using difficulty threshold; (d) CRENet w/o GSF: difficult regions are directly sent to the fine detector without Gaussian scaling function(GSF); (e) CRENet: The complete implementation of our method. As can be seen from Table 2, the performance improvement of searching regions using only the clustering algorithm is limited. Therefore, it is necessary to use a Gaussian scaling function(GSF) to scale regions. Using the difficulty threshold to filter the cluster regions, 767 regions were eliminated without lowering the AP. It shows that the difficulty threshold can effectively eliminate the region which can hardly bring the precision gain, thus speeding up the detection speed. The above experiments show that the two components of our proposed, clustering algorithm, and Gaussian scaling function(GSF), are very important for the full improvement of detection performance. And the component of difficulty threshold is crucial to achieve a high inference speed.

Table 2. Ablation study of detection result on validation set of VisDrone.

5 Conclusions

In this paper, we propose a new method CRENet for object detection in aerial images. CRENet using the clustering algorithm can adaptively obtain cluster regions. Then, the difficulty threshold can be used to eliminate the cluster region that can not bring precision gain, and speed up detection. We also propose that the Gaussian scaling function(GSF) can scale the difficult cluster region to reduce the scale difference between objects. Experiments show that CRENet performs well for small and medium objects in dense scenarios. A large number of experiments have demonstrated that CRENet achieves better performance over the VisDrone [45] dataset.