Keywords

1 Introduction

Remote sensing technology is widely used in traffic monitoring, military reconnaissance and other fields, and remote sensing image object detection technology has gradually become a research hotspot in computer vision [1,2,3]. The basic task of object detection is to determine the class of each object and provide their boundaries. Due to the particularity of the location of the remote sensing observation platform, the remote sensing images often contain many complex backgrounds, and the wide variety of objects, different scales, and unstable shapes, which make the general object detectors unsatisfactory in remote sensing images. Furthermore, the scale problem results in poor representation of the features of objects in remote images by DNNs (Deep Neural Networks).

Facing the challenges of remote sensing imagery mentioned above, we propose DA-YOLO based on YOLOv4, which is used for multi-category, multi-scale, and multi-pose small-scale object detection in remote sensing images. The contributions of this paper mainly include the following four aspects:

  1. (1)

    The data augmentation method of “quadruple cropping” is adopted. It is guaranteed that the small target information will not be lost when resizing the original image, and the number of instances is enlarged.

  2. (2)

    DSC module is introduced. DSC module expands the receptive field and improves the feature extraction capability of small objects.

  3. (3)

    CBAM [4] is introduced. CBAM aggregates global and local features and establishes long-term dependencies between channel attention, improving the representation of small objects.

  4. (4)

    Experiments show that DA-YOLO outperforms the DOTA dataset [5], increasing mAP by 1.36% without a significant drop in speed.

2 Related Work

Affected by the acquisition method, the size of remote sensing image is far larger than the image size of general object detection dataset. The factors like density, scale and scene complexity should be considered in the impact results, causing the more difficulty for detection in remote sensing images. Especially for the detection of small-scale objects in high-resolution images, the accuracy is challenged. Relevant works with DNN have been devoted to the application of remote sensing images detection. Fan et al. [6] propose ClusDet that produces object cluster regions and estimates object scales for these regions, which greatly reduces the number of chips for final object detection. Yang et al. [7] adopt an objected feature fusion strategy which fully considers feature fusion, anchor sampling, and receptive field. Zhang et al. [8] introduces a feature enhancement method which learns global and local contexts together.

Input images are enormous while objects have less pixels. Resizing to the input size directly is not an optimal option. It causes an information loss if objects have only a few pixels. Mate Kisantal [9] used multiple copy and paste of small objects to enhance the image to reduce the loss of object information, but it still has limitations for dense objects. The algorithm in [10] proposes a scale adaptive proposal network to improve the precision of multi-object detection, but the detection efficiency of small objects is generally not high.

The receptive fields in deep learning refers to the area size of the pixels mapped on the original input image on each output characteristic map of CNN. In order to reduce the loss of effective object information and improve the detection accuracy of small objects, many scholars have done corresponding research. Gan et al. [11] applied FPN network to remote sensing image object detection, which improved the accuracy of the network for small-scale objects, but only detected specific objects. Qu et al. [12] used dilated convolution to enhance the receptive field of the third-level features in the network and enrich the detailed information of the object. Dilated convolution using sparse kernels is a better choice for alternating the convolutional and can flexibly aggregate context information while maintaining the same resolution [13]. Therefore, we use dilated convolution to maintain the size of feature map could improve the feature extraction ability of network to get more object information.

In addition, small-scale objects are more dependent on shallow level features. Fu et al. [14] proposed a feature fusion architecture to generate multi-scale feature levels and combine features of different levels to form a powerful representation of object features. On the basis of YOLOv4 three-scale detection, we add a detection branch, and convey positioning and semantic information through the path aggregation network, so as to obtain richer texture and contour information, and improve the detection effect of small-scale objects.

3 Our Network

In this paper, we propose a fast and accurate small-scale object detection method for remote sensing images. As shown in Fig. 1, given an image as input, we first resize the image by “quadruple cropping”. The cut images are then input into an improved backbone network based on CSPDarknet53 [15] to extract deep features. We add four DSC modules, which not only increase the size of the receptive field, but also preserve the resolution of the image without losing information. Then, we introduce the CBAM for feature enhancement. In addition, we also adopt the last four stages of feature maps instead of three stages to obtain more contour details of small-scale objects. Finally, the predicted bounding boxes are aggregated and redundant detections are removed by Non-Maximum Suppression (NMS) in the final detection results.

Fig. 1.
figure 1

Network architecture.

3.1 Quadruple Cropping

Sending the large-scale images directly for training to excessive compression and could not make the network training converge. Due to the objects in remote sensing images loss mostly information after over-scaling from the image preprocessing. Given this situation, this paper proposed a novel data cropping method named "quadruple cropping" by referring to References [5, 16]. The crop image is shown in Fig. 2.

Fig. 2.
figure 2

Quadruple cropping.

Figure 2(a) is the original image. As shown in Fig. 2(b), the original image is cropped in four directions from ① to ④ with an overlap rate of 50%, and the image is cropped to a size of 800 × 800. If the width or the height of the original images is less than 800, there will not be cropping for the horizontal or vertical direction. Besides, if the remaining size is less than 800 after cropping, this part of the image will be abandoned.

In addition, for the object on the clipping boundary, the object will be damaged when clipping (Fig. 2(c)), if the object is abandoned, lots of object information will be lost, if all the object information is marked, it will cause missed detection and false detection due to insufficient object information. In this case, this paper uses the method of calculating the object incompleteness to determine whether to retain the object label, and the value calculation is shown in formula (1):

$$ P = \frac{A_a }{{A_b }} $$
(1)

where \(A_a\), \(A_b\) respectively represent the area of the labeled frame in the original image and the cropped image. If \(P \ge 0.7\), it means that the sub-image contains more incomplete object information, so the coordinate will be completely retained. If \(0.3 \le P \le 0.7\), the coordinate will be still retained and set ‘difficult’ in the label file to 1. If \(P \le 0.3\), we remove the annotation of the object. “Quadruple cropping” can ensure the retention of the object information in the greatest extent and increase the sample diversity after cropping, which eliminates the increase of detection error caused by less object and more background information.

3.2 Network Architecture

As one of the most advanced algorithms, YOLOv4 excellent for speed and precision. Its backbone network CSPDarknet53 adds CSPNet (cross-stage partial network) to each large residual block of Darknet53, and fuses it into the feature map through gradient changes. In the neck, PANet [17] (path aggregation network) with a more flexible ROI pooling is applied to shorten the path from up to bottom fusion. Inspired by this latest research algorithm, we employ an improved CSPDarket53 as the backbone, which has a good performance in extracting small-scale object features, especially suitable for object detection in remote sensing images.

DSC Module.

Dilated convolution was first applied in the field of image segmentation, which can increase the perception range of the feature map without increasing the amount of additional calculation. As shown in Fig. 3(a) is the receptive field of the standard convolution kernel, which is only 3 × 3; (b) is the receptive field when the expansion rate is 2, and the mapping range increases as the convolution kernel is zero-filled. The receptive field has changed from 3 × 3 to 7 × 7, and each convolution output contains a larger range of feature information. Inspired by the Inception network structure [18], we propose a dilated separable convolution module based on dilated convolution. As shown in Fig. 4, the DSC module first separates the channels into two groups, and combines dilated convolutions with different size dilated ratios (\(D_1\), \(D_2\)) in each branch to effectively obtain multi-scale information. Finally, perform the \(Add\) operation on the original input and output to obtain the final output.

Fig. 3.
figure 3

Receptive field diagram.

Fig. 4.
figure 4

DSC module.

Yolov4 integrates different scale features to detect objects. So, each layer of feature map will contain low-level features of the shallow layer and deep-seated high-level features, and the precision of prediction will be improved. But when the deep CNN extracts feature of different scales in remote sensing images, the pooling layer will reduce the resolution of the feature map, making it difficult for the receptive field that is too small to detect object information that only occupies a few dozen pixels. DSC module can expand the receiving field of the feature map while maintaining the resolution, so that each convolution output contains a csope range of feature information. Therefore, this paper applies the DSC module to the backbone network to expand the effective receptive field of the feature map while preserving the image resolution.

CBAM.

The attention mechanism mainly emphasizes the importance of different characteristics by assigning weights to features. It mainly imitates the law of human brain observation activities, and assigns more important attention to the target points on the image that need to be paid attention to through weight assignment. [19], which highlights the features of the image target and suppresses the feature information of other objects to achieve feature information enhancement. In order to better aggregate the local features and global features of the image, so that the extracted feature information can better characterize the location and category of the image target, CBAM [4] is introduced to enhance the feature maps of different scales extracted by the backbone network respectively. CBAM [4] is shown in Fig. 5. CBAM [4] is mainly includes channel attention module (CAM) and spatial attention module (SAM).

Fig. 5.
figure 5

The structure of CBAM.

CAM.

After the image is subjected to the convolution operation, each channel of the feature map often expresses different features, and the features of each channel are different. However, each channel of the feature map maintains the same weight, and the importance relationship between each channel is not considered, which is not conducive to enhancing the feature information of the target. SAM can get the importance of different channels of the feature map, and assign corresponding weights to each channel, which can make better use of features with high weights and suppress features with low weights, and enhance the expression between features. The CAM structure is shown in Fig. 6.

Fig. 6.
figure 6

The structure of CAM.

SAM.

Different from CAM, SAM focuses more on spatial location information. SAM can obtain the importance of the spatial location information of the feature map and highlight the location of key features, thereby enhancing the representation capability of the feature map. The SAM structure is shown in Fig. 7.

Fig. 7.
figure 7

The structure of SAM.

Multi-scale Prediction.

For the detection of small-scale objects, it is often more dependent on shallow features. The three-scale in YOLOv4 is not enough to fully obtain the subtle feature information of the object. Therefore, the original three-scale is expanded to four-scale, a more accurate anchor frame is assigned to the object on the larger feature map, and the feature maps from different information streams are effectively combined to gradually transfer the information in the low-level feature map to at the high level, the feature information of the object is continuously enriched and improved, the semantic information is more complete, and richer texture and contour information can be get, which effectively improves the detection effect of small-scale objects. The improved network structure is shown in Fig. 8.

Fig. 8.
figure 8

This article network architecture.

4 Experiment and Analysis

Our detector is trained on 1 Titan 2080Ti GPU, optimized by Adam with the momentum of 0.0005 and weight decay of 0.9. The batch-size is 8 due to limitation of GPU memory. The learning rate is 0.001 initially. This article is trained and tested on the DOTA [5] dataset. DOTA [5] includes 2806 remote sensing images, with 1/6 is validation, and 1/3 is test. The image size is in the range of 800 × 800 to 4000 × 4000, of which 15 categories total 188282 instances. This paper uses the quadruple cropping method to crop 1411 training pictures into 30749 sub-images for network training.

4.1 Ablation Study

In this section, a series of ablation experiments were conducted on the validation set of DOTA [5] to evaluate the effect of each improvement. We apply the CSPDarknet53 as our baseline and modify components gradually to find the final appropriate settings. Table 1 shows the results of ablation studies.

Table 1. Comparison of ablation experiment results.

In Table 1, while keeping other settings unchanged, only the data cropping method is introduced, which can effectively reduce the problem of information loss caused by excessive image compression, and mAP is increased by 0.38%. Based on data cropping, by adding the DSC module, the network can learn high-level semantic information more effectively and improve the recognition ability of small-scale objects. By introducing CBAM, the ability to recognize small targets is further improved. Through the feature maps of four scales, the texture and contour information of the objects are enriched, so that the mAP reaches 74.01%, 75.12% and 75.50%, which are increased by 0.49%, 1.12% and 0.37% respectively.

4.2 Comparison with Other Methods

We conduct a comparison experiment with SOTA object detectors in DOTA. Including two-stage algorithms FR-H [20], ICN [1], SCRDet [3], FADet [21] and single-stage algorithms SSD [22], EFR [23], etc. The proposed method achieves superior performances and outperforms other detectors. The effect comparison is shown in Table 2.

Table 2. Comparison of test results on the DOTA-v1.0 dataset.

As shown in Table 2, DA-YOLO shows competitive performance at 75.50%. For small objects in the image, such as small vehicle, storage tank, our method has a clear improvement in detection accuracy. The excellent performance for detecting small objects is attributed to the enlarged receptive field, the introduction of attention mechanism and the enrichment of texture and contour information. The table also shows that the two-stage detector still achieves superior performances in DOTA [5] research. However, they all use complex model structures to improve accuracy, which extremely slows down detection speed. The proposed single-stage detector achieves comparable performance with other two-stage detectors while keeping a fast detection speed.

4.3 Speed Experiment

In order to test the real-time performance of the DA-YOLO, we used the DOTA [5] dataset to compare with other algorithms in terms of inference speed. The comparison results are shown in Table 3.

Table 3. Comparison of the speed of different methods on the DOTA dataset.

As shown in the Table 3. Because multi-scale prediction will make the network more complicated, and the image cropping network will take more inference time, the detection speed is slightly lower than that of the one-stage detection algorithms such as SSD [22], but the fps has also reached 17.2. Compared with the two-stage detection method, our DA-YOLO still has great competitiveness in detection speed. In summary, the one-stage detection algorithm proposed in this paper has good detection performance while maintaining rapidity.

4.4 Detection Result

Figure 9 Respectively show the detection visualization results of DA-YOLO on the DOTA dataset [5].

Fig. 9.
figure 9

This article network architecture.

As shown in Fig. 9, No matter on the DOTA dataset [5], DA-YOLO can give better detection results and show better generalization ability. Especially for small-scale targets such as airplanes, oil tanks, and small cars, due to the full use of the shallow features of the network, the detection results can show better results.

5 Conclusion

Aiming at the problem of small-scale object detection in high-resolution remote sensing images, a novel and robust first-level target detector DA-YOLO is proposed. By cutting the aerial image, the image can retain most of the information of small objects; by introducing the DSC module, the receiving range of the feature map is expanded; by introducing the CBAM [4] for feature enhancement, the expression of small objects is further improved. In addition, the last four scales of the feature map are used to obtain more texture and contour information of small objects. Experiments show that DA-YOLO has better detection performance, and can more accurately detect objects in aerial images for object detection without significantly reducing the detection speed.