1 Introduction

The International Civil Aviation Organization (ICAO) denotes by “drone” any unmanned aerial vehicle (UAV). Furthermore, the Air Force Special Operations Command (AFSOC) gave additional three names for a drone: a flying robotic system, an unmanned aircraft system (UAS), and a micro air vehicle (MAV) [1, 2]. So, to simplify, an UAV is an aircraft either controlled by a pilot via RF remote control or autonomously following a mission planner through a flight controller. In the same context, the NATO (North Atlantic Treaty Organization) classification [3] and Lykou et al. [4] mention that UAVs weighting between 2 and 25 kg are called mini-UAVs. So, a mini-UAV can carry an operating payload up to 15 kg, e.g., the DJI MATRICE 600 which weighs 10kg is capable of carrying a 6-kg payload for 16 min [5]. Over the last few years, the manufacturing technology of mini-unmanned aerial vehicles (mini-UAVs), also known as mini-drones, has been experiencing a significant evolution. There are multiple usages for mini-UAVs, including precision agriculture for spraying operation, professional aerial photography, and industrial applications [6]. Figure 1 illustrates two types of mini-UAVs that transported higher payloads. Figure 1(a) illustrates DJI Agars T16 mini-UAV equipped with a spray tank that can carry up to 16 L [7]. And, Fig. 1(b) shows DJI MATRICE 600 trying to carry and release a payload. However, the polyvalence of this type of flying gadgets made it accessible to everyone, particularly terrorist groups. Therefore, we can conclude that the detection of mini-UAVs before serious attacks is of the utmost interest.

Fig. 1
figure 1

Two examples of mini-UAVs carrying payloads: (a) DJI Agars T16; (b) DJI MATRICE 600

Consequently, in this work, we will treat the issue of detecting and localizing mini-UAV in diverse environments as a problem of small object detection air image. To set the record straight, an air image or ground-to-aerial perspective image is mostly a picture of a flying object that must include sky background part, taken by a ground-based imaging system, typically used to monitor no-fly zones or restricted areas.

The real-time object detection applied to UAV monitoring is really crucial. Nevertheless, these applications need early detection of objects so that they can be used later as inputs for other reactions. Due to early detection, the appearance of the objects is generally small. In general, the aim of small object detection is to detect objects that belong to the image and are small in size, which implies that the objects of interest either have a large physical appearance but occupy only a small area in an image, or have a really small appearance [32, 34, 35, 37]. Improvements in object detection algorithms allow faster and more accurate results.

The most recent methods using deep convolutional neural networks (deep CNN) usually involve several steps. First, specify the objects of interest in the image, then go through the deep CNN for feature extraction and afterwards classify them using supervised classification techniques. Finally, mix the results between the objects to properly mark the bounding box. In deep CNN models, there are mainly two types of state-of-art object detectors. The first type is the two-stage detectors, such as faster R-CNN (region-based convolutional neural networks) [8] that uses a region proposal network to generate regions of interests in the first stage, and mask R-CNN [9] that sends the region proposals down the pipeline for object classification and bunding box regression. Such models perform well in terms of accuracy, in particular the faster R-CNN with an accuracy of 73% mAP, but due to their very complex pipeline, they perform poorly in terms of speed with 7 frames per second (FPS), which restricts its application for real-time object detection. The second type of detectors is the single-stage detectors such as SSD (single-shot detector) [10] that runs a convolutional network on input image only once and calculates a feature map, and YOLO (you only look once) [11] that treats object detection as a simple regression problem by tacking an input image and learning the class probabilities and bounding box coordinates. Such models (SSD and YOLO) are proposed by considering both accuracy and processing time.

Especially YOLO performs well compared to previous region-based algorithms in terms of speed with 45 FPS while maintaining a good detection accuracy more than 63% mAP. Although the speed and accuracy were good, YOLOv1 (YOLO first version) [11] made some remarkable localization errors. In other words, the bounding boxes predicted by YOLOv1 are not accurate. So, to overcome the deficiencies of YOLOv1, the creators of YOLO launched YOLOv2 (YOLO second version) [12] where the two limitations of; (i) similarity of the predicted bounding box to the ground truth and (ii) the percentage of total relevant objects correctly classified were resolved without impairing the accuracy of the classification. Moreover, YOLOv2, called also YOLO9000 [12], gained a speed of 59 FPS and mAP of 77.8% in experiments on the PASCAL VOC 2007 dataset [13, 14]. Furthermore, in YOLOv3 (YOLO third version) [15, 16], the main improvement is the addition of multi-scale prediction. In addition, YOLOv3 brought further enhancements in terms of speed and accuracy. In experimenting with the MS COCO [17, 18] dataset, it obtained 33% AP score and achieved a real-time speed of approximately 75 FPS on Tesla V100. In February 2020, Joseph Redmon, the creator of YOLO, stopped researching in the field of computer vision research. However, YOLOv4 (YOLO fourth version) was released on 23 April 2020 and YOLOv5 on 10 June 2020 by other researchers. While YOLOv4 [19, 20] was released in the Darknet framework, YOLOv5 [20,21,22,23,24,25] has been released in the Ultralytics PyTorch framework. Despite the fact that YOLOv4 can reach 43% AP on MS COCO [26] and 65 FPS speed, the developers of YOLOv5 claim that in a YOLOv5 Colab notebook, running a Tesla P100, they found inference times of up to 0.007 s per image, meaning 140 frames per second (FPS) [24]. In contrast, YOLOv4 achieved 50 FPS after having been converted to the same Ultralytics PyTorch library [21]. Not only that, they also mentioned that YOLOv5 is smaller. Specifically, the YOLOv5 file weights 27 megabytes. However, it weights for YOLOv4 (with Darknet architecture) 244 megabytes. So, YOLOv5 is about 88% smaller than YOLOv4 [52].

The development of new versions of YOLO has not finished. On Oct 28, 2021, Yuxin et al. [53] have launched the YOLOS (you only look at one sequence). It is a series of object detection models based on the vanilla Vision Transformer with the fewest possible modifications and region priors, as well as inductive biases of the target task. In addition, Sahin and Ozer [55] recently released an improved YOLO framework, called YOLODrone, for detecting objects in aerial images taken by drone. To summarize, YOLOv5 claims to be fast, and has a very light model size compared to the contemporary series of YOLO [54].

This paper focuses on detecting mini-drones based on ground to aerial perspective images, more precisely on the AI techniques used for early detection and localization. The goal is to obtain a real-time and accurate deep-CNN object detector which will be able to correctly detect and locate mini-drones that may probably carry a payload, in order to start a neutralization system. The main contributions of this work can be summarized as follows:

  1. 1.

    We collect air images of flying mini-UAVs in a real environment using our Dahua multi-sensior camera [30], the majority of which contain flying mini-UAVs in poor visibility conditions. Subsequently, we build a custom dataset of air images of different types of flying mini-UAVs, called “Mini-UAVs air image dataset,” which provides a benchmark to evaluate the performance of the proposed detection model. The own custom air image dataset will be made public for future research.

  2. 2.

    We come up with a strategy of instance or object augmentation, in which we add tiny/small objects to the samples of our air image dataset. So, in our work, we denote by tiny/small objects, the mini-UAVs that occupy a small portion of the field of view in real world environment.

  3. 3.

    We develop a mini-UAV detection model by redesigning the YOLOv5 object detector [21, 23] from scratch. So, the redesigned model which is named “Upgraded-YOLO” (i.e., assuming that we will upgrade or mutate the model by modifying its internal structure and narrowing its depth in order to adapt it to the detection of tiny/small objects) aggregates more shallow feature information to focus on the detection of small flying objects in air images.

Experimental results show that our proposed Upgraded-YOLO outperforms the general YOLOv5 and other contemporary object detectors in detecting flying mini-UAVs in air images.

The remainder of this paper is organized as follows. We present an instance augmentation strategy in Sect. 3.1.3. A mini-UAV real-time detection algorithm is presented in Sect. 3.2, and the results are discussed in Sect. 4. Finally, Sect. 5 concludes the study.

2 Related Work

2.1 Overview of Low-Visibility Conditions in Aerial Perspective

The relatively high latitude of Tunisia and its geographical stretch from north to south give it contradiction climatic zones: sub-humid in the extreme north, and desert with dusty environment in the south. So, the Tunisian climate is characterized by a number of meteorological phenomena or meteors (see Fig. 2) which will consequently disturb the visibility and therefore have an impact on video surveillance and the detection of flying drones from the ground.

Fig. 2
figure 2

Examples of phenomena observed in the Tunisian atmosphere. Most of them exhibit significant visibility degradation: (a) fog, (b) mist, (c) haze, (d) duststorm

Under such conditions, many computer vision and image processing algorithms suffer from the visibility degradation, since most of them treat clear scenes under good weather. Therefore, in our work, we took these phenomena into consideration when constructing our training air image dataset (that contain flying mini-UAVs). Among these phenomena, the following are particularly noteworthy: (a) The fog is produced by the suspension in the atmosphere of very small water droplets or ice particles. This phenomenon reduces horizontal visibility at the earth’s surface to less than 1km, and the humidity is close to 100% (see Fig. 2a). (b) The mist is produced by microscopic water droplets suspended in the atmosphere. The mist appears grayish and the visibility is between 1 and 5 km (see Fig. 2b). (c) The haze is the suspension in the atmosphere of dry, extremely small, and invisible particles which is the result of fumes or airborne dust, sand, or even sea salt particles. It appears bluish on a dark background and yellowish on a light background. Visibility in the haze is between 1 and 5 km, and relative humidity can reach 60% (see Fig. 2c). (d) The duststorm is caused by dust or sand particles being powerfully lifted from the ground by a strong, turbulent wind, usually reducing visibility to less than 1km (see Fig. 2d).

2.2 Issues in Object Detection

The deep detectors generally consist of two parts: one is a skeleton pre-trained on ImageNet, and the other (called the head) is the main part used to predict the category and bounding box of the object. In addition, object detectors developed in recent years usually have some layers inserted between the skeleton and the head, and usually used to collect feature maps at different stages. We can call it the neck of the object detector  [21, 24]. So, the detector needs to meet the following conditions:

  1. 1.

    Higher input network scale (resolution) used to detect multiple small objects;

  2. 2.

    Higher layers: higher receptive fields to cover the expanding scale of the input network;

  3. 3.

    More parameters improve the model’s ability to detect multiple objects of different sizes in a single image.

In summary, the general object detector consists of the parts presented by Fig. 3.

Fig. 3
figure 3

Concept of architectural object detection for aerial perspective image

Despite of these works, research in this area is far from complete and many difficulties remain. An interesting summary of some of the challenges is presented in the review by Agarwal et al. [28].

  • Scale variance: the variation in size of the objects to be detected represents an important lock, especially when the gap is large. The image pyramid approach is one of the oldest and most effective methods for detecting objects at different scales. The disadvantage of this operation is the high computational cost of repeating the convolutions after each scaling step. The data augmentation method could be used as an alternative to enrich the training examples by applying transformations on the original images. However, the most used approach by state-of-the-art detection models (e.g., yolov5) is called anchors. Anchors are pre-calculated bounding boxes at different scales and aspect ratios, which are provided as a reference during training to ensure detection at different scales.

  • Rotation variance: often solved by data augmentation by applying rotations to the training set images, thus showing the network rotated examples. This solution is limited to the rotation resolution of rigid objects (e.g., a drone) and does not apply to deformable objects (e.g., a cat).

  • Domain adaptation: Most of known detection networks are pre-trained on huge datasets like ImageNet or COCO [17]. The models thus have a very high generalization capacity except in very specific use cases where the content is very different. So, domain adaptation application is therefore necessary. The most widely adopted solution is transfer learning, which re-train the network by changing its top layers with a small set of data corresponding to the task in question.

  • Occlusions: This problem, which exists in most applications, is an issue since some of the information is hidden. Thus, providing examples containing occlusions in the training dataset may partially solve the problem but will not represent all forms of occlusion.

  • Small objects: detecting small objects is more difficult than detecting medium or large ones. This is due to many factors such as lack of associated information, inaccurate localization, and confusion of objects with the background image. So, to overcome this problem, solutions vary in terms of complexity from simple scaling to the use of surface networks, coarse and fine networks to a super-resolution method that could be implemented with a GAN learning to represent small objects with higher resolutions. In addition, low image resolution could cause the same problems and thus require a super-resolution method.

2.3 Visualization of YOLOv5 Network Structure

YOLO is a technique based on regression. Instead of selecting the relevant part of an image, it predicts classes and bounding boxes for the entire image in a single run of the algorithm. So, the idea of YOLO is originated from the extension of the basic CNN idea for classification and detection tasks. The YOLO series (from YOLOv0 to YOLOv7) is a regression method based on deep learning. So, the latest release of YOLOv5 that we have modified in this work [20,21,22,23,24] is basically modified on the structure of YOLOv3 [15].

As shown in Fig. 4, the YOLO series architecture is divided into three functionally different parts, called backbone network, neck network, and head or detect network [19, 24]. This is a division found in the architecture of many recent image detection models [29].

Fig. 4
figure 4

Basic architecture of the YOLO series network presented as backbone, neck, and detect (head)

The backbone part is a convolutional neural network which extracts feature information from the input image by multiple convolution and pooling. So, it aggregates different fine-grained images and forms image features.

The backbone is the body of the network, which will enable all the decisions made by the network. In simple terms, it can be seen as a converter that converts the input image, a data format difficult to process by AI (artificial intelligence), into a set of information characterized by some features (such as the presence of shapes, colors, textures, ...) from which it is easy to recognize objects. It is thus composed of a series of successive layers.

As shown in Fig. 5, there are four different layers: focus structure, CBNS (convolution, batch normalization, and SiLU activation function), C3B (Bottleneck with 3 CBNS), and SPP (spatial pyramid pooling).

The backbone is usually trained separately on image classification competitions such as the ImageNet challenge [56], which include hundreds of thousands of images with a wide range of content such as animals, vehicles, and plants. This diversity of content forces the backbone to learn a wide variety of features in terms of size, color, and shape of the elements it observes and thus be more robust and able to extract useful features regardless of the image presented to the backbone.

Fig. 5
figure 5

YOLOv5 structure diagram

The second part of Yolov5 is the neck network, a series of feature aggregation layers of mixed and combined image features. As shown in Fig. 5, there are four different layers in this part: CBNS, C3B, concatenation, and upsampling. The neck has the role of extracting the relevant features from all the layers of the backbone, and combining them into useful features for our detection task. Indeed, not all the layers included in the backbone learn the same information: the first set of layers, generally of higher spatial resolution, will detect features that are often simpler (the presence of lines, colors) and smaller. The last set are the lower resolution layers that tend to provide more complex features (e.g., the combination of specific shapes and colors such as a metal circle with a hole for a car rim) and large objects. The neck makes it possible to integrate and combine features of different resolutions and complexities, to allow detection of small and large, simple and complex features.

Finally, the head is responsible for the final decision of the network. Based on the information provided by the neck, it will detect the elements of interest by drawing bounding boxes around them and it will, furthermore, give the nature of every object present in each bounding box.

In terms of general architecture, Yolov5 is similar to its predecessors Yolo and other models in the literature. It is therefore time to see the real reason for the difference in performance. The bag of freebies is a set of enhancements with no impact on the architecture of a network, which can be used free of charge, with no cost of modification on an existing network. It gathers all the improvements that can be applied during the network learning such as the loss function, data augmentation, and cross mini-batch normalization. The bag of specials is, on the contrary, a bag containing improvements that require specific modifications to the architecture of a network. It contains recent advances in the scientific literature that improve the performance of the network without decreasing its speed [23, 27, 29].

3 Method and Dataset

3.1 Custom Air Image Dataset Construction

To construct our own air image dataset, called “Mini-UAVs air image dataset,” we proceeded through the following steps: data collection, data augmentation, object augmentation, and data annotation.

3.1.1 Methodology of Collection

The dataset is collected using internet videos and our Dahua multisensor camera [30], mainly including rotor mini-UAV, like four-rotor UAV (i.e., DJI-Phantom4, DJI-Marvic), and six-rotor UAV (i.e., DJI-Matrice 600, DJI Agars T16). Some samples are shown in Fig. 6. A total of 4560 sample images are used in this experiment which are divided, randomly, into 3100 images for training and 1460 images for testing purposes.

Fig. 6
figure 6

Sample images of own custom “Mini-UAVs air image dataset”

3.1.2 Data Augmentation

Data augmentation is a technique that allows researchers to greatly increase the variety of data that is available for model training, without the need to gather new data. Thus, the purpose of data augmentation in the training dataset is to create diversity and overcome overfitting by artificially increasing the training samples [28]. In our work, traditional data augmentation methods such as adding noise, cropping, flipping, rotation, brightness, and contrast are used. Moreover, another technique is proposed, in which a number of meteorological phenomena such as dust, mist, and fog (see Subsect. 2.1) are added to images of our dataset (see Fig. 7).

Fig. 7
figure 7

Sample images of the Mini-UAVs air image dataset after data augmentation with meteorological phenomena usually observed in Tunisian climate. (a) Dust. (b) Mist. (c) Fog

3.1.3 Object Augmentation Strategy

The detection of tiny/small objects in an image or video is a research topic in computer vision that tries to develop technologies and techniques in order to detect lowercase instances [31]. The possibility of appearance of tiny/small objects is much more numerous than those of other objects due to their limited size, which creates confusion in the detector to locate these objects among several others located in the vicinity or having the same size (or appearance). Thus, it is difficult to distinguish tiny/small objects from the background. In general, a tiny/small object has two definitions [32,33,34,35,36]. The first is related to object dimension in the real world while the other is related to a threshold on the surface occupied by the object in the image [32,33,34, 37]. Although all modern detection models are effective for medium and large objects, they are not very efficient in detecting small objects. For example, it is really hard for a model to see a micro/mini-drone flying from 2000m away. This is because there are some obstacles in tiny/small object detection. First, lowcase objects need appearance information required to distinguish them from background or similar classes. Second, the places of low case objects have a lot of possibilities. That is to say, the required precision for accurate localization is higher [35]. So, in this paper, we are going to demonstrate how we can improve the performance of a detector to detect and classify small flying objects. In our contribution, based on the following references [32,33,34,35,36,37,38,39], we have grouped the flying objects according to the following distribution shown in Table 1.

Table 1 Size distribution of the mini-UAVs in air images by subheadings

The size of objects is measured as the number of pixels in each of the bounding boxes that describe the spatial location of a mini-UAV. To push our model to focus more on tiny/small objects, we perform an augmentation based on the copy-and-paste strategy that increases the number of tiny objects in each image of our dataset. As mentioned in Fig. 8, our augmentation strategy consists of two phases. The first phase consists in finding different appearances of tiny/small objects (micro/mini-drones in flight with low appearance) and generating their masks. The second phase consists in searching, in our dataset, the images that contain objects of bounding box (BB) size \(<32^2\).

Fig. 8
figure 8

Workflow of adding tiny/small instances; BB: bounding box contains an object

Thus, the tiny/small objects that are prepared in the first phase will be pasted in random places of these images, as presented in Fig. 9, generating the labels of these new tiny/small objects in an automatic way.

Fig. 9
figure 9

Sample images of our dataset before (and c) and after (and d) tiny/small object augmentation

Figure 10 shows the percentages (portions) of mini-UAVs (flying objects) of different sizes, for our own custom dataset: before and after object augmentation. So, this figure shows our contribution related to the increase of tiny/small objects in the air images of the “mini-UAVs dataset” (train and test sub-datasets).

As shown in Fig. 10, in the train and test datasets, there are more tiny/small objects than medium and large objects. Approximately more than 50% of objects are tiny (\(area<16^2\)), more than 16% are small (\(16^2<area<32^2\)), more than 4% are medium (\(32^2< area < 96^2\), and less than 3% are large (\(area > 96^2\)).

Fig. 10
figure 10

Size distribution of tiny/small objects in air images before (a) and after (b) tiny/small object augmentation

3.2 Upgraded-YOLO for Mini-UAV Object Detection in Air Images

3.2.1 Model Architecture

In order to implement an optical early warning detection system, a flying target (i.e., unauthorized mini-UAV), which necessarily has a small or even tiny appearance, must be detected. The size of distant mini-UAVs in the sky background is very small; and the receptive field size of YOLOv5 is not enough to detect these tiny flying objects. This is the reason of improving the architecture of YOLOv5. As shown in Fig. 11, we did two improvements to the original YOLOv5 architecture:

  1. i.

    A fourth scale (marked with dashed yellow rectangle in Fig. 11) is added to the three scales of YOLOv5 feature maps to capture more texture and contour information of tiny/small objects as mini-UAVs.

  2. ii.

    Feature maps from the backbone network are brought into the added fourth scale (represented by the red line) to reduce feature information loss of mini-UAVs.

Fig. 11
figure 11

Anatomy of the Upgraded-YOLO for mini-UAV detection in air images

The YOLOv5 final part consists of three detection tensors. So, YOLOv5 applies 8, 16, and 32 downsampling of the initial image to detect objects at different resolutions. For example, given an image of resolution 416×416 as input to YOLOv5, features of input are extracted by the backbone network of YOLOv5. To precisely detect different sizes of the targets, 3 different scales of boxes are predicted, which are expressed as T1, T2, and T3 in the followed framework. In our mini-UAVs, air image dataset experiments, we predicted 3 boxes at each scale, which means the tensor is N×N×[3*(1+1+4)] for 1 class probability (confidence score), 1 class (mini-UAV) predictions, and 4 surrounding box position coordinates. Here, N is feature maps size of T1, T2, and T3, which are 13, 26, and 52, respectively. So, the problem of lacking appearance information is related to different image resolutions. For example, if the image resolution is low, it may prevent the detector from detecting very small objects. In these cases, the information needed to detect very small objects will be very limited. Indeed, in YOLOv5, if the object of interest occupies a size of 8*8 pixels on an image with a resolution of 416*416, then it will be represented by only one pixel in the final feature maps. Therefore, any object smaller than 8*8 will be disappeared. Subsequently, this architecture of YOLOv5 is insufficient for the detection of tiny objects. Therefore, the main idea of our proposal is to add a detection level (scale 4 in Fig. 11) with a high resolution that is able to extract more features for tiny objects. For this purpose, we added a level that reduces the resolution only four times (i.e., the input image was downsampled with a stride of size 4). In fact, our proposed architecture aims in detecting tiny objects, that is why we have added a higher resolution detection level that generated a tensor T4 of size 104*104*18. The addition of the later consists of adding seven layers as indicated in Fig. 11 by a yellow box, of which the upsample layer increases the resolution and then the output of this layer will be concatenated with the output of layer three of the backbone part. In addition, the connection represented by the red line is added to bring the feature information from the backbone network into the added fourth scale of the neck network. Based on the idea of residual networks, this connection can improve gradient back propagation, to prevent the gradient from being erased, and reduce the loss of the feature information of very small flying objects.

3.2.2 The Loss Function

The loss of YOLOv5 [22] is a multi-task loss that contains three terms: the first for localization loss or bounding box regression loss (denoted as \(Loss^{box}\)), the second for classification loss (denoted as \(Loss^{class}\)), and the third for object loss or confidence loss (denoted as \(Loss^{obj}\)) [29]. The total loss (Loss Total) can be written as:

$$\begin{aligned} Loss^{Total}=Loss^{box}+Loss^{obj}+Loss^{class} \end{aligned}$$
(1)

Next, we will go into the details of the three losses we used in the proposed Upgraded-YOLO network [40].

$$\begin{aligned} Loss^{box}=\lambda _{box}*\sum \limits _{i=0}^{S^2}\sum \limits _{j=0}^{B}L_{i,j}^{obj}*(1-CIoU_{pred_{box}}^{truth_{box}}) \end{aligned}$$
(2)
$$\begin{aligned} Loss^{obj}=&-\lambda _{obj}*\sum \limits _{i=0}^{S^2}\sum \limits _{j=0}^{B}L_{i,j}^{obj}*[\alpha *P_{truth}^{obj}*(1-P_{pred}^{obj})^{\gamma }*log(P_{pred}^{obj})\\&+ (1-\alpha )*(1-P_{truth}^{obj})*(P_{pred}^{obj})^{\gamma }*log(1-P_{pred}^{obj}) ] \end{aligned}$$
(3)
$$\begin{aligned} Loss^{class}=&-\lambda _{class}*\sum \limits _{i=0}^{S^2}\sum \limits _{c\in K}L_{i}^{obj}*[\alpha *P_{truth}^{c}*(1-P_{pred}^{c})^{\gamma }*log(P_{pred}^{c})\\&+ (1-\alpha )*(1-P_{truth}^{c})*(P_{pred}^{c})^{\gamma }*log(1-P_{pred}^{c}) ] \end{aligned}$$
(4)

where \(\lambda _{box}\), \(\lambda _{obj}\), \(\lambda _{class}\) are hyperparameters or scalars to weight each loss function,

  • B is the number of bounding boxes predicted for each tile,

  • \(S^2\) is the number of cells (grids) that input images are divided into,

  • K denotes the number of classes,

  • \(P_{truth}^{c}\) equals 1 if the ground-truth belongs to the i-th class and 0 otherwise (binary indicator),

  • \(P_{pred}^{c}\)is the predicted probability for the i-th class,

  • \(P_{truth}^{obj}\) equals 1 if the ground-truth bounding box belongs an object (drone) and 0 otherwise,

  • \(P_{pred}^{obj}\) is the probability the predicted bounding box contains an object inside,

  • \(\gamma \in [0,+\infty ]\) is a focusing parameter or a modulating factor,

  • \(\alpha \in [0,1]\) is a balancing parameter, is also useful for addressing class imbalance,

  • The loss is similar to categorical cross entropy, and they would be equivalent if \(\gamma =0\) and \(\alpha _i=1\),

  • Here \(L_{i,j}^{obj}\) and \(L_{i}^{obj}\) are indicator functions such that: \(L_{i,j}^{obj} =1\) if box j and cell i are matched together, 0 otherwise (1 if object appears in cell i and j-th box detects it, 0 otherwise), \(L_{i}^{obj} =1\) if cell i has an object present, 0 otherwise.

  • The \(CIoU_{pred_{box}}^{truth_{box}}\) is called the complete-IoU between the predicted box and the ground-truth box [41].

  • \((1-CIoU_{pred_{box}}^{truth_{box}})\) is the complete IoU loss which ensures three geometric measures, i.e., overlap area, central point distance, and aspect ratio [41].

4 Experimental Results and Evaluation

4.1 Experimental Setting

Experiments in this paper have been performed using the machine learning framework PyTorch 1.9. At the beginning of our work, training trials were performed, with 100 epochs, on the Kaggle platform with a GPU NVIDIA TESLA P100, 16 GB of memory, Driver version: 450.119.04, and CUDA version: 11.0. The neural network training and testing were performed on a workstation equipped with an AMD Ryzen 9 5900X 12-Core Processor 3.70 GHz, NVIDIA Geforce RTX 3070 GPU, and NVIDIA TESLA T4 GPU. The model building, training, and result testing are all completed under the V1.9 of PyTorch framework, using the V11.1.0 of CUDA parallel computing architecture and at the same time integrating the v8.2.2 of cuDNN acceleration library into the PyTorch framework to accelerate computer computing capabilities. The ADAM optimizer with a learning rate of 0.001 was used for the training optimization. The training was performed with an input image size of 416*416, a batch size on a GPU of 8 frames, and a number 8 of dataloader’s workers.

4.2 Evaluation Metrics for Mini-UAV Detection

The following standard criteria are exploited to quantitatively evaluate and compare the detection accuracy: Intersection over union (IoU) is one of the most used tools in machine learning to measure the accuracy of an object detection model. Moreover, this criterion compares the detected predicted region with the ground truth region in a way that is proportional to the size of the object being searched. Also, the regions being compared can be bounding boxes that locate drones. Therefore, the overlapping ratio between the detected box (\(B_D\)) and the ground truth box (\(B_{GT}\)) is the measure of IoU [42,43,44].

$$\begin{aligned} IoU = \frac{|B_D \bigcap B_{GT}|}{|B_D \bigcup B_{GT} |} \end{aligned}$$
(5)

For object detection, IoU is used to determine how many objects were detected correctly and how many false positives were generated. Generally, a \(0.5*IoU\) ratio for each prediction at the training stage is targeted. This means that if the network predicts an object with a detected box that overlaps with the ground truth box by at least \(50\%\), it is considered as a true prediction. By defining the true positive (TP) as the number of correct detections with \(IoU > 0.5\), the false positive (FP) as the number of false detections (like a bird that was detected as a drone) or detected more than once, and the false negative (FN) as the number of drones that are not detected or detected with \(IoU \le 0.5\), the precision and recall scores [45], which are used to measure the performance of a detection model, are calculated as:

$$\begin{aligned} P = \frac{TP}{TP + FP} \end{aligned}$$
(6)
$$\begin{aligned} R = \frac{TP}{TP +FN} \end{aligned}$$
(7)

where precision shows how accurately the model has detected the drones. Recall is described as the number of truly detected drones over the sum of truly detected drones and undetected drones in the image. In order to properly evaluate the performances of our object detector, average precision (AP) has been used, with precision (P) and recall (R). From all these indicators, it is now possible to draw the precision curve as a function of the recall. It will allow, thanks to the computing of the area under this curve, to define the average precision (AP) of the proposed model.

$$\begin{aligned} AP = \int \limits _{0}^{1}{P(R)dR} \end{aligned}$$
(8)

Therefore, the mean average precision (mAP) is defined as the mean of AP across all categories (M):

$$\begin{aligned} mAP = \sum \limits _{i=1}^{M}{\frac{AP_i}{M}} \end{aligned}$$
(9)

If the IoU threshold has been set to 0.5 or 50%, the mAP is called \(mAP\_0.5\) or mAP@50. \(mAP\_0.5:0.95\) means mAP with \(0.5< IoU <0.95\).

4.3 Results

4.3.1 Parameters and Hyper-Parameters Optimization

Our proposed model has two different types of parameters:

Model parameters or network parameters: these are related to the size and topology of neural networks which have an influence on model performance. Therefore, for our model, they are initialized randomly (from scratch) to avoid symmetry, which could potentially affect the training process, and learned during model training.

Hyper-parameters are parameters that have an influence on the speed and quality of the learning process such as learning rate and decay. The determination of these hyper-parameters can be done manually by trying all possible values. But this is very time consuming as the number of possible combinations is very high. Therefore, in our work, we have applied the hyperparameter evolution that uses a genetic algorithm (GA) to automatically find the optimal hyper-parameter [46, 47]. Thus, crossover and mutation are the main genetic operators in the GA. So, the mutation is used, with a probability of 90% and a variance of 0.04, to create new offspring based on a combination of the best parents from all previous generations [48].

To evolve the hyper-parameters of our model (Fig. 12), we trained it, for 50 epochs, 300 times (300 generations) by maximizing the fitness score, which is defined as follows:

$$\begin{aligned} q_f=0.1mAP\_0.5+0.9mAP\_0.5:0.95 \end{aligned}$$
(10)
Fig. 12
figure 12

Fitness (y axis) vs hyper-parameter values (x axis) of the base scenario trained on “mini-UAV dataset.” Yellow indicates higher concentrations. Vertical distributions indicate that a parameter has been disabled and does not mutate

4.3.2 Experimental Analysis and Discussion

Figure 13 shows performance metrics and loss functions change curves, for the training validation. These curves correspond to the baseline (YOLOv5 small) and the Upgraded-YOLO, both trained on the Mini-UAVs air image dataset. The loss function indicates the performance of a given predictor in detecting the input data points in a dataset. The smaller the loss, the better the detector is at modeling the relationship between the input data and the output targets.

Fig. 13
figure 13

Comparison of common evaluation indicators and loss curves between baseline (YOLOV5), given in blue curve, and Upgraded-YOLO, given in red curve. The first two plots are (a) bounding box loss (measured by CIoU) and (b) confidence loss in the validation dataset. The remaining four curves represent performance metrics of object detection task, and they are (c) mAP_0.5:0.9, (d) mAP_0.5, (e) precision, and (f) recall

Line plots created in Fig. 13 show two different types of loss. Those represented in Fig. 13(a), (b) are associated to both of the losses related to the given cell containing an object during the training: the confidence loss or objectness loss (\(obj\_ loss\) ) and the predicted bounding box loss (\(box\_ loss\)). In other words, the \(box\_ loss\) represents how well the model can locate the center of an object and how well the predicted bounding box covers an object, while the objectness loss determines whether there are objects in the predicted bounding box. Based on Fig. 13(a), we can conclude that as the number of iterations gradually increases, baseline and Upgraded-YOLO algorithm curves gradually converge, and the loss values become smaller and smaller. When the two models are iterated 250 times, the loss values are basically stable and the network basically converges. On one hand, Fig. 13(a) shows that the baseline (blue curve) had the smallest objectness loss. Thus, for 300 training epochs, the loss of confidence in the baseline is 0.0007, whereas for the Upgraded-YOLO (red curve) it reached 0.0014. So, when we added a fourth level to our proposed model, we actually increased the number of detected object parts. Consequently, this leads to an increase in the number of bounding boxes predicted by each cell. Moreover, as mentioned by the authors in reference [15]: “If a bounding box prior is not assigned to a ground truth object it incurs no loss for coordinate or class predictions, only objectness”; it means that all the box predictions contribute to the objectness loss which is an accumulation of losses related to each given cell containing an object during the training, whereas the Upgraded-YOLO had a slightly superior loss.

On the other hand, Fig. 13(b) shows that the box loss value of our Upgraded-YOLO was much lower than that of the baseline model because the mutation (the changes in the anatomy of YOLO to detect tiny/small objects) of our proposed model has brought an improvement in terms of prediction of the bounding box that covered the target object. So, the mutation of YOLO, by injecting a neural structure capable of detecting the most part of an object, reduces errors between the ground truth and the predicted bounding box. Hence, we have pushed the box loss to better teach the network to predict a better CIoU. Since only the best-fitting boxes in each spatial cell contribute to the box loss, we get better results with our model. Also, it can be seen that the number of true positive (TP) or the number of detections with \(IoU > 0.5\) is augmented. Hence, the better results were obtained by our model in Fig. 13(c)–(f). So, the general trend of performance metrics (\(mAP\_0.5:0.9\), \(mAP\_0.5\), precision and recall) was roughly the same for the two models (Upgraded-YOLO and the baseline). But, the curves of the baseline have a jitter range almost during the whole training process. However, the timing of jitter for the Upgraded-YOLO during the early stage was shorter than that of the baseline, with a smaller jitter amplitude. Generally, the observation of the jitter phenomenon during the training of a detection model is due to the added noise, and it is due to local minima. So, every time the optimizer converges towards the local minimum, the performance metrics increases. But with good learning rate, the model learns to jump from these points and the optimizer will converge toward the global minimum which is the solution. Hence, our model is better for the convergence towards the global solution. In terms of speed, the Upgraded-YOLO is faster than the baseline; e.g., for mAP_0.5, our model reached \(91\%\) after 11 epochs; however, the baseline did not exceed 5%. In terms of accuracy, the curves of the evaluation metrics show that our model is more accurate. Furthermore, the mAP is used to measure the quality of the detection model. Thus, the higher the value is, the higher the average detection accuracy and the better the performance will be. Moreover, the high mAP also denotes a great performance of the training models. So, Fig. 13(c) shows that from the start, the \(mAP\_0.5:0.9\) of our model is higher than that of the baseline. Moreover, the \(mAP\_0.5:0.9\) of our model reaches about 86% after 300 epochs, while the baseline reached \(47 \%\) after 150 epochs and started overfitting until epoch 300, with a \(mAP\_0.5:0.9\) of \(45\%\). As the graphs in Fig. 13(d) show, the mAP_0.5 of our model reached 92% whereas the baseline reached 88%. As shown in Fig. 13(e), the precision or the exactness of the Upgraded-YOLO reached 0.98 after 300 epochs; however, for the baseline it did not exceed 0.91. The recall tells us what proportion of objects was predicted to be a mini-UAV. The curves of Fig. 13(f) show that the completeness of our model is better than the baseline. The improvement in precision and recall rate of our model compared to the baseline can be attributed to the increase in the true positive (the number of correctly classified objects). Therefore, an improvement in prediction accuracy can be seen.

Furthermore, Table 2 shows mAP, precision, and recall of four models: YOLOv3-tiny, YOLOv3, YOLOv5, and ours after a training on our air image dataset. It can be seen that, after 300 epochs, our method has better performance.

Compared with the results of the baseline (YOLOv5 model), the precision of the Upgraded-YOLO model is increased by 6.84% and the recall rate is increased by 9.04%. Moreover, the mAP_0.5:0.95 is increased by 40.57% and the mAP_0.5 has improved by 9.9%. These results confirm what was mentioned at the beginning of this interpretation, that the performance of our model is higher than that of the baseline.

Table 2 Model performance evaluation with the custom air image dataset

Moreover, Table 3 shows that after 300 epochs of training, our model has the lowest total loss value, which makes it more accurate and perform better than the three contemporary object detectors: YOLOv3-tiny, YOLOv3, YOLOv5.

Table 3 Comparison of loss functions between our model and other contemporary detectors

To highlight the performance of our detector, we compare it to the baseline. The results of the test are based on 400 frames from YouTube video sequences captured in an outdoor environment with different drone models, and from visible video clips shot with our Dahua multi-sensor camera. An illustration of the detected results related to both of the baseline model and the Upgraded-YOLO for some samples in air images (i.e., ground to aerial perspective images) is shown in Fig. 14 where the red and green bounding boxes correspond to detections by the Upgraded-YOLO detector and the baseline detector, respectively. For instance, in Fig. 14(a), (b), we used two frames of size 1920*1080 taken by our Dahua camera, which contain very far mini-drones. Indeed, the Upgraded-YOLO has detected the far mini-drones with a confidence score higher than 0.76, which is superior than that of the baseline (i.e., between 0.64 and 0.72). Accordingly, Fig. 14(a), (b) show that our model was efficient and outperformed the baseline (original YOLOv5) in the detection of mini-UAVs of tiny and small appearance.

Fig. 14
figure 14

Comparison of the detection results in air images at diverse distances and with different visibility conditions: (a) and (b) mini-UAVs in very small appearance; (c) and (d) mini-UAVs with medium appearance; (e) and (f) mini-UAVs flying in low visibility conditions

Furthermore, the results in Fig. 14(c), (d) show that the bounding boxes of our model (red bounding boxes) are more adjusted with the detected mini-UAVs than the original YOLOv5. This was consistent with the previous evaluation, and this shows that our method has the lowest box loss. Finally, the last figures (lack of lighting for Fig. 14(e) and fog phenomena for Fig. 14(f)) show that our model performs well even under low-visibility conditions.

5 Conclusion

In this research, deep learning technology was applied to tiny/small flying object detection in air image. And based on the YOLOv5 object detector [21], a high-precision mini-UAV detection model was proposed. So, we firstly collected images of mini-UAVs in a real environment, using our Dahua Thermal Network PTZ Camera. Most of them consist of mini-UAVs flying in poor visibility conditions. Then, we constructed an own custom dataset designed by “Mini-UAVs air image dataset,” which provides a benchmark to evaluate the performance of the proposed detection model, especially under low-visibility condition. In addition, we proposed a new strategy of instance augmentation in order to increase the accuracy of our model. This strategy consists of adding tiny/small objects (mini-UAVs) to the images of our own custom dataset. As a result, in order to reduce the total loss, we implemented a mini-UAV detection model based on the state-of-the-art object detection method of YOLOv5, which has recently appeared. In the proposed detector, called Upgraded-YOLO, a new feature fusion layer was added to capture more feature information about tiny and small flying objects, detected in air images. This paper mainly researches and develops drone-related threats under the requirement of a real-time flying object detector. However, fast detection still needs specific hardware configuration. In the future, we will continue to optimize Upgraded-YOLO especially with a huge multi-class dataset of drones. At the same time, we will try to deploy and integrate our model with a flying object tracker such as DeepSORT in order to set up an anti-UAV system [49,50,51].