1 Introduction

Object detection has broad development prospects and huge commercial value and has become a research hotspot for relevant practitioners at home and abroad. It has been widely used in intelligent security, autonomous driving and other fields. In 2005 Dalal et al. [5] proposed the classic object detection method for human detection based on histograms of oriented gradients (HOG). In 2008 Felzenswalb et al. [7] proposed the Deformable Part Model (DPM) detection method, this method first uses the gradient operator to calculate the HOG feature of the object, the sliding window and support vector machine (SVM) method is used for classification and performs well in object detection problems.

Compared with classic methods, deep neural networks (DNN) have strong feature extraction capabilities and high accuracy and have achieved remarkable results in the fields of computer vision and image processing. Benefited from the rapid development of deep learning, based on (CNN) object detection models are emerging in an endless stream, and the detection effect is constantly improving. In 2014 Girshick R et al. [11] introduced the R-CNN network. The selective search method was used instead of the traditional sliding window to achieve average and accurate object detection on the VOC2012 dataset. Rate for mean Average Precision (mAP) increased by 30%; Girshick et al. have proposed Fast R-CNN [9] and Faster R-CNN [28] networks. Faster R-CNN adopts regional recommendation network Region Proposal Network (RPN) generates candidate frames, and then classifies and coordinates regression of these candidate frames, the detection accuracy is greatly improved, and the detection speed is about 5 fps. These methods are called as the candidate frame generation, and prediction are divided into two steps (Two-Stage method). The network that performs these two operations at the same time is called (One-Stage method), and the representatives are YOLO [27], SSD [21], etc. In 2016, Redmon J et al. [27] Proposed the YOLO network, which is characterized by combining the candidate box generation and classification regression into one step. The feature map is divided into S × S (S is a constant, 7 in YOLO-V1) cells for prediction, and each cell makes predictions and greatly reduces the computational complexity and accelerates the speed of object detection. The frame rate can reach up to 45 fps. After that, Redmon J once again proposed YOLO-V2 [26]. The mAP on the VOC2007 dataset is increased from 67.4% to 76.8%, however, since a cell is only responsible for predicting an object, the recognition performance of the blocked object is not good enough. In April 2018 [25], YOLO released the third version (YOLO-V3) [25]. The mAP-50 on the COCO dataset that was produced by YOLO-V2 44.0% increased to 57.9%. Compared with RetinaNet mAP 61.1% [20], RetinaNet has a detection speed of about 98 ms/frame when the input size is 500 × 500, while YOLO-V3 has a detection speed of up to 29 ms/frame when the input size is 416 × 416, and a high accuracy rate is achieved on the premise of ensuring the speed.

YOLO method has better generalization than RCNN methods when used in different fields. Therefore, to achieve an effective algorithm for drone’s detection in real-time, we propose to build an efficient real-time algorithm for drone’s detection based on the YOLO-V3. Yet, the final algorithm has been optimized towards building a CNN for effective drone object detection.

The main contributions of this paper are; 1) designing and building a CNN to solve the problem of the too large number parameters of YOLO-V3, 2) using densely connected modules to enhance the interlayer connection of CNNs and further strengthen the connection between dense neural network blocks, 3) improving the YOLO-V3 multi-scale detection by expanding the three-scale detection to four-scale detection to increase the accuracy of detecting small objects like drones.

The paper is arranged as follows. Section 2 presents related work. Section 3 presents brief introduction of YOLO-V3. The proposed algorithm is described in Section 4. Section 5 presents the experimental results, datasets and analyses, and Section 6 provides the conclusion and future work.

2 Related work

One of the most essential tasks in computer vision is the object detection that determines the presence or absence of definite features in an image [6, 16]. When features are detected, this means that an object has been detected and that object can be classified as belonging to one of the pre-defined groups of categories, then a bounding box is expected around the central point of that object or around the object itself. In general, there are approximately three kinds of object detectors, namely:

  • Classic Detectors: Classic detectors operate on the principles of the sliding window, in which every time it applies a classifier over a pre-defined image grid. The most famous of them are the face detector (Viola and Jones) [33] used haar features and proposed an important task of the object detection framework. They accomplished object detection using cascade classifiers and Adaboost training with a technique called the sliding window. CNN for digits recognition (LeCun et al.) [34] and Krizhevsky et al. [32] illustrated that the remarkable success of the CNN as an ideal extractor of features for image recognition and classification has a considerable effect on object detection. Deep convolutional networks have a long history in computer vision, with early examples showing successful results on using supervised back-propagation networks to perform digit recognition (LeCun et al., 1989). More recently, these networks, in particular the convolutional network proposed by Krizhevsky et al. (2012), have achieved competition-winning numbers on large benchmark datasets consisting of more than one million images, such as ImageNet (Berg et al., 2012). DeCAF A Deep Convolutional Activation Feature; and the method HOG for pedestrian detection [5]. The Dalal-Triggs detector, which won the 2006 PASCAL object detection challenge, used a single filter on histogram of oriented gradients (HOG) features to represent an object category. This detector uses a sliding window approach, where a filter is applied at all positions and scales of an image. We can think of the detector as a classifier which takes as input an image, a position within that image, and a scale. The classifier determines whether or not there is an instance of the target category at the given position and scale. Since the model is a simple filter, we can compute a score as β.Ф(x), where β is the filter, x is an image with a specified position and scale, and Ф(x), is a feature vector. A major innovation of the Dalal-Triggs detector was the construction of particularly effective features. Dalal and Triggs: Object appearance in images can be characterized by histograms of orientations of local edge gradients binned over a dense image grid (HOG) [9]. The Dalal and Triggs (D&T) method trains an object-level HOG template for detection using bounding box annotations. Since the features are binned, the detector is robust only to image deformation within the bins.

  • Two-stage Detectors: After the deep learning development, the two-stage detectors have appeared and outperformed the first kind of detectors. These modern methods use the region proposal techniques for the purpose of creating a separate set of candidate proposals that must include all objects in the image first whereas separate out most negative sites, then a classifier is run on these proposals to break them into background/foreground classes. Such kind of object detection is the controlling model today. The region proposal-based CNN has grown and developed with many refinements. R-CNN [10] uses a selective search method [32] to locate RoIs in the input images and uses a DCN-based regionwise classifier to classify the RoIs independently. SPPNet [12] and Fast-RCNN [9] improve R-CNN by extracting the RoIs from the feature maps. Faster-RCNN [28] is allowed to be trained end to end by introducing RPN (region proposal network). RPN can generate RoIs by regressing the anchor boxes. The fastest framework based on region proposal is the Faster R-CNN [28], it works on the entire image and is faster than its antecedents, the Fast R-CNN [9], and the R-CNN [10]. The region proposals and classification are the two stages in the pipeline. Here, the whole image is passed through the CNN to generate a feature map. Then another CNN called RPN is used to produce object proposals and objectness scores from that feature map. The scores and proposals are created by using resizable anchor boxes around each pixel of the feature map. Then, Region of Interest (RoI) pooling is used to fetch the proposals of objects to the same size in the same region. At last, for the purpose of classifying objects and creating bounding boxes, those object proposals are threaded during the fully connected layer, with Softmax and Linear Regression.

  • Later, the anchor boxes are widely used in the object detection task. Mask-RCNN [14] adds a mask prediction branch on the Faster-RCNN, which can detect objects and predict their masks at the same time. R-FCN [4] replaces the fully connected layers with the position-sensitive score maps for better detecting objects. Cascade R-CNN [2] addresses the problem of overfitting at training and quality mismatch at inference by training a sequence of detectors with increasing IoU thresholds. The keypoint-based object detection approaches [22, 31] are proposed to avoid the disadvantages of using anchor boxes and bounding boxes regression. Other meaningful works are proposed for different problems in object detection, e.g., [17, 36] focus on the architecture design, [1, 8, 29, 35] focus on the contextual relationship, [3, 18] focus on the multi-scale unification.

  • One-stage Detectors: The third approach is the most recent to object detection. It assumes a single detection phase or single-shot detection; it is close to human nature in object detection. The three dominant methods are SSD [21], YOLO [25,26,27], and RetinaNet [20]. SSD [21] places anchor boxes densely over an input image and use features from different convolutional layers to regress and classify the anchor boxes. YOLO-V1 uses fewer anchor boxes (divide the input image into an S × S grid) to do regression and classification. YOLO-V2 [26] improves the performance by using more anchor boxes and a new bounding box regression method. Roughly in the same order, their effectiveness grows, SSD has less average precision (about 10–20%). RetinaNet until now is the most effective model for detecting objects, and release 3 of YOLO has roughly the same precision as the second kind detectors. Nevertheless, the still has acceptable precision and fastest among them is YOLO-V3. The performance of the basic YOLO model is more enhanced in YOLO-V3. For feature extraction, three various scales are employed. The fundamental architecture is Darknet-53. The multi-layer scaling permits predictions at 3 various scales. The final layer expects the object’s class and object’s bounding box with the objectness score. For objectness score it uses logistic regression. Instead of one softmax, classifiers with independent logistic are used for each class. Overall YOLO-V3 performs similar to Residual neural network (ResNet-152) [13] and better than SSD.

3 Brief introduction of yolo-V3

The network of YOLO-V3 [25] is evolved from the networks of YOLO-V1 [27] and YOLO-V2 [26]. The YOLO network turns an object detection problem into a regression problem. Unlike faster R-CNN, YOLO does not necessitate a region of the proposal, instead it explicitly generates the coordinates of bounding box and probabilities of every class by regression directly. In comparison to faster R-CNN this significantly improves detection speeds.

Figure 1 shows the YOLO model. Each image is subdivided thru YOLO network into S × S grids in the training set (for YOLO-V1, S = 7). Each grid is responsible for the detection of the object when the target ground truth center of that object is in this grid. Also, each grid expects C class conditional probabilities, B bounding boxes and confidence scores of that bounding boxes. Confidence shall be defined in the following way:

$$ Confidence={P}_r(object)\times {IoU}_{pred}^{truth},{P}_r(object)\in \left\{0,1\right\} $$
(1)
Fig. 1
figure 1

YOLO models detection

If a target is belonging to a grid, Pr(Object) = 1 and 0 otherwise. The intersection over union \( {IoU}_{pred}^{truth} \) represents the correspondence between the predicted bounding box and the ground truth box. If the grid contains objects, then the confidence shows whether that grid has objects and the expected bounding box accuracy. The non-maximum suppression (NMS) approach is used in YOLO for choose the best bounding box when many bounding boxes detect the same object.

Although the first release (YOLO-V1) is much faster if we compare it to Faster RCNN, it has a significant detection error. For the purpose of overcoming this problem, release two (YOLO-V2) introduced the anchor box idea and uses the clustering k-means approach for creating appropriate prior bounding boxes. Consequently, the anchor boxes number needed to obtain the similar results of IoU decreases. The network structure of YOLO-V2 is improved and a convolution layer replaces the FC layer in the YOLO-V1 output layer. Compared to YOLO-V1 a batch normalization, dimension clusters, a classifier with high-resolution, features fine-grained, training multi-scale, direct location prediction, and other techniques that highly improves the precision of detection are introduced to YOLO-V2.

The third release YOLO-V3 is YOLO-V2 enhanced version. To detect the final target, the prediction technique used in YOLO-V3 is a multi-scale, and the structure of its network is more complicated than YOLO-V2. YOLO-V3 allows for a prediction over bounding boxes at different scales, and multiple scale predictions make YOLO-V3 more successful than YOLO-V2 in detecting small targets.

4 The proposed method

The proposed algorithm focuses on improving YOLO-V3 in detecting drones. In this section, three phases of improvements are proposed to YOLO-V3 framework to make it more suitable for drone detection. The first phase presents an improvement to the structure of YOLO-V3 network. The second phase is used to enhance the interlayer connection of CNNs and further strengthen the connection between dense neural network blocks by using densely connected modules. The third phase presents an improvement to the YOLO-V3 multi-scale detection by expanding the three-scale detection to four-scale detection to increase the accuracy of detecting small objects like drones.

In the first phase of improvements a designing and building a CNN to resolve the problem of large number of YOLO-V3 parameters by improving network structure is proposed. In YOLO-V3, the authors introduced the Darknet-53 network based on ResNet. Although Darknet53 through the use of residual structure reduces the network training difficulty, reduces the number of parameters by using a large number of (1 × 1) and (3 × 3) convolution kernels with step size of two instead of maximum pooling. The detection of single-class object by the Darknet53 network is somewhat too complex and excessive. Furthermore, many parameters will lead to decelerate the detection speed, increase the data volume demands, and the complexity of training.

In order to realize a real-time detection of the drones, this paper depends on Darknet53 to maintain accuracy and reduce the number of parameters as a starting point and proposes a CNN with a small number of parameters and relatively low computational complexity as a feature extraction network.

The CNN proposed in this paper is called Darknet49. In this neural network, a 1 × 1 convolution kernel is used in the transition module to further reduce the dimensionality. Since the use of a non-linear activation function for a low-dimensional convolutional layer will destroy image information to a certain extent, this paper uses a linear activation function in the first convolutional layer, as shown in Table 1.

Table 1 The Darknet49 network structure

The second phase is using densely connected convolutional networks (DenseNet) between layers. Due to convolution and down sampling the feature maps are reduced while training the neural network, and the feature information is lost through transmission. To make use of feature information more effective a (DenseNet) was proposed [15, 30]. The DenseNet links each layer in feedforward manner to other layers; therefore, the input of the l layer is the concatenation of the outputs of the previous layers x0, x1, ⋯⋯, xl − 1.

$$ {\mathrm{x}}_l={\mathrm{H}}_l\ \left({\mathrm{x}}_0,{\mathrm{x}}_1,\dots, {\mathrm{x}}_{l-1}\right), $$
(2)

Where x0 to xl − 1 represents the splice of the x0 to xl − 1 layers feature maps. While Hl represents the composite function of activation function, convolutional layer and batch normalization. This paper uses the dense connection to enhance the proposed Darknet49 network. The proposed Darknet49 network consists of 5 densely connected modules and 4 transition modules. There is a transition module between each densely connected module to reduce the feature map size. The output of the prior dense module, in the transition module, is separately pooled with a maximum step size of 2 and the output using a 2-step size convolution kernel is concatenated in series and used as the input of the next dense module as shown in Fig. 2, in the transition module. In this way, the connection between modules of the dense neural network is enhanced, reducing the loss of feature transfer between modules, and enhancing feature reuse.

Fig. 2
figure 2

An illustration of transition module

The third phase presents an improvement to the multi-scale detection. In the standard YOLO-V3, the Feature Pyramid Networks (FPN) network [19] is introduced, which simultaneously uses the low-level features (high resolution) and high-level features (high semantic information) and fuse the different layers features through upsampling to detect objects on three different scale feature layers. Because most of the drones are small objects, this article improves the scale detection module in YOLO-V3, expands the original three scale detections to four scale detections, and assigns more accurate anchor boxes to small targets in larger feature maps. Using the intersection ratio of rectangular boxes (IOU, expressed by RIOU) as the similarity, all targets of the remote sensing aircraft target training set are labeled using K-means clustering to obtain the size of the anchor box, and the distance function of K-means clustering is:

$$ \mathrm{d}\ \left(\mathrm{B},\mathrm{C}\right)=1-{\mathrm{R}}_{\mathrm{IOU}}\ \left(\mathrm{B},\mathrm{C}\right), $$
(3)

Where B represents the size of the rectangular frame, while C represents the center of the rectangular frame, RIOU (B, C) represents the overlap ratio of the two rectangular frames. This article depends on the concept of DenseNet and upsamples the feature layers of the four scales to detect the corresponding multiples and then uses dense connection. The further task of the dense connection of various scale detection layers also enhancing semantic information for every scale feature layer, merging the features of various layers, and improving the precision of a small object coordinate regression to a certain degree. Figure 3 shows the multi-scale detection module proposed in this work where 2 × indicates upsampling with a step size of 2, 4 × indicates upsampling with a step size of 4, and 8 × indicates upsampling with a step size of 8.

Fig. 3
figure 3

Dense connection for multi-scale detection

5 Datasets and experimental results

The main objective of this research is to modify the operation mechanism of YOLO-V3 to enhance its detection efficiency. As a case study, the modified YOLO-V3 algorithm is tested on drones. This section introduces the specifics of the generated dataset with specifications as well as the results generated, besides this section explains the performance comparison of both models (YOLO-V3 and the proposed). The algorithm in this paper is implemented on the computer configured with Intel(R)-Core-(TM) i7-6700HQ CPU @ 2.60 GHz CPU and NVIDIA GeForce GTX 965 M GPU, with 16 GB memory and CUDA10.1 cuDNN9.1. The operating system of the computer is a 64-bit Windows 10. We adopted a MathWorks support example (Object Detection Using YOLO-V3 Deep Learning) [24] to train our deep learning models. This example which had been pre trained on VOC 2007 + 2012 was selected to be the backbone of our CNN network. The results of this study confirm the correctness and effectiveness of the proposed technique in both accuracy and computational cost point of views.

5.1 Specification of our dataset

We have created our dataset thru extracting 5000 images of drones from a number of videos presented in global dataset recorded with the DJI Mavic Pro in various environments with and without people present [23]. This dataset is labeled using the Image Labeler application on MATLAB which is currently the most widely adopted labeler tool as shown in Fig. 4. The Image Labeler application converts the labeled coordinates of the images directly to a mat file format contains the image’s number, class(es) with boundary boxes. The labeled image is used as an input to train the proposed YOLO-V3 based architecture. For the object detection mission, the label and boundary for each object’s ground truth must be manually defined. In the PASCAL VOC dataset, standardized ground truth marking methods are given. This technique is also used for generating the bounding boxes and labels in the drone dataset.

Fig. 4
figure 4

Sample labeling of drones using the Image Labeler application

As shown in Table 2, the dataset is separated into training dataset and test dataset by means of the desired ratio, such as 6:4 or 7:3. The 7:3 ratio is used in this paper.

Table 2 Drones Dataset distribution

5.2 Recall, precision, and F1 score

The performance of the model is evaluated using recall rate, precision, F1-score and overlap ratio IOU. Precision is defined as the ratio of TP to the total positive predictions, expressed as:

$$ \mathrm{P}=\frac{\mathrm{TP}}{\mathrm{FP}+\mathrm{TP}} $$

The recall rate is:

$$ \mathrm{R}=\frac{\mathrm{TP}}{\mathrm{FN}+\mathrm{TP}} $$

The precision-recall (P-R, for short) curve, we can get it by representing the ratio of precision as a vertical axis and the ratio of recall as a horizontal axis. Another evaluation standard also used to test the model performance called F1-score, its definition is shown as:

$$ \mathrm{F}1=\frac{2\mathrm{P}\times \mathrm{R}}{\mathrm{P}+\mathrm{R}}. $$

TP is the true positive example, FP represents the false positive example, and FN represents the false negative example.

In this paper, two indicators that evaluate the processing speed of the algorithm are used, one is the number of images processed per second by the algorithm, its unit is frame per second (f/s), and the other is the algorithm required time to process each image t = 1/N, the unit is ms.

5.3 The experiential results

The dataset is trained for YOLO-V3 algorithm and the proposed algorithm separately. For both algorithms training continued up until the wanted average loss is reached and then all weights are tested after training. For drone detection, the weight that gives higher mAP is chosen. Figure 5 shows the training loss curves for the two algorithms.

Fig. 5
figure 5

Training loss curves for YOLO-V3 algorithm and the proposed algorithm

Figure 6 presents some results illustrates that the system successfully detects drones in different scenes and under different backgrounds. The YOLO-V3 detection accuracy is 92.35% while the accuracy of the proposed based on YOLO-V3 95.60% which is obviously much improved, also YOLO-V3 took a much longer time for training than the training time for the proposed model due to the fact that darknet-53 is a 53 convolutional layers deep network compared with darknet-49 of the proposed algorithm. The mAP is also calculated for both algorithms by testing every weight. The weight that has the highest value of mAP rate is chosen for the test. As shown in Table 3, for YOLO-V3 the highest value of mAP that reached is 0.33 after training on darknet-53 while 0.36 mAP is reached after training on the proposed Darknet-49 for YOLO-V3.

Fig. 6
figure 6

Selected examples of drone detection results on the drone dataset using the proposed algorithm. Each output box is associated with the score, and the detected frame rate is displayed in the upper-left corner

Table 3 Comparison of performance for both models

The Precision-Recall curve is a popular visualization for object detectors. As shown in Fig. 7, the curve is close to a 90-degree angle for our proposed algorithm because the detector finds accurate bounding boxes for the drone class (an Average Precision (AP) value of 96% on the testing set). The area-under-the-curve is smaller for YOLO-V3 (92%) because the detector had less precise bounding boxes.

Fig. 7
figure 7

Precision-recall curve performance of Proposed detection and YOLO-V3 Detection

Figure 8 shows the rates of TP, FP, and FN of the two algorithms. The average of the results of the two algorithms is shown in a pie graph. As shown in the ratios on the pie graph, the proposed algorithm learns the class probabilities of regions in the image where no drones are present during training, indicating high TP and low FP and FN rates.

Fig. 8
figure 8

TP, FP, and FN rates of YOLOv2 and the proposed OYOLOv2_FTD algorithm during detection

6 Conclusion

A real-time detection algorithm for drone detection is proposed based on modification of YOLO-V3. The modified version includes improvements to the network structure and multi-scale detection of YOLO-V3. Moreover, designs and trains dataset of drone detection in images is implemented. This newly designed algorithm is consisting of three phases, in the first phase a convolutional neural network (CNN) that consists of (49) convolutional layers is proposed instead of (53) convolutional layers in YOLO-V3. In the second phase a dense connection is used on the proposed CNN, and maxpool layer is used to improve the feature transmit between dense blocks. Finally, in the third phase we increase the scale detection from three in YOLO-V3 to four in our proposed algorithm via employing dense connection to merge feature map among different scales in order to deals with the fact that drones are small objects. The newly designed algorithm is trained and evaluated on the designed drone dataset. The experimental results illustrate that our proposed algorithm has good robustness to drone detection. The accuracy and average precision have reached 95.60 and 96%, respectively. This work has verified the feasibility and efficiency of YOLO-V3 for the detection of drone objects in the images.