Keywords

1 Introduction

Search-and-Rescue (SAR) is the search for people in danger or imminent danger to rescue them. Rescue operations must be performed quickly, as any delay can potentially cause injury or even human loss. Furthermore, the environments in which they are performed are often hostile, such as in the case of post-disaster scenes, low light situations, inaccessible areas, etc. [6].

In this context, unmanned aerial vehicles (UAVs), commonly known as drones, are increasingly used as technological support tools [10, 23]. In fact, once equipped with high-resolution cameras, drones can provide cost-effective help to emergency rescue operations for several reasons. Swarms of aerial vehicles can quickly spread across a disaster area by providing mobile ad-hoc networks [1]. They can quickly fly over and traverse hard-to-reach regions, such as mountains, islands, deserts, etc., covering large areas with sparse human distribution. They can deliver rescue equipment, such as drugs, much faster than rescue teams. Furthermore, compared to classic helicopters used for these purposes, drones can fly below the normal altitude of air traffic, have lower costs and faster responses, and can get closer to the area of interest.

The literature is already populated with many use cases where drones have been successfully used in humanitarian settings, e.g. [7, 19]. However, detecting people in online SAR images during inspection flights is still not a trivial task for human operators. First, it requires a long concentration to perform the flight operation and the search task at the same time. Secondly, operators may work in poor conditions, mainly due to the small size of the monitor they are equipped with, as well as the brightness of the screen monitored by the operator outdoors. Therefore, it would be helpful if the visual inspection process were somehow automated using visual patterns that suggest or detect potential humans in the image. Such a system can prove extremely beneficial for SAR operations, mainly because, as previously mentioned, locating victims, who may be unconscious or injured, as quickly as possible, is critical to improving their chances of survival.

This goal motivated research efforts to develop intelligent real-time decision support tools to be mounted directly on board drones, leveraging the integrated yet powerful GPUs available for UAVs. Unfortunately, there is currently still a small body of knowledge on applying pattern recognition and computer vision strategies to this type of problem, e.g. [15, 21]. The large-scale variation and dense distribution of objects characteristic of UAV images pose challenges to human detection due to different heights, perspectives, poses of the human body, presence of many objects in the scene or very small objects to be detected, etc. Furthermore, most of the state-of-the-art computer vision algorithms for people detection in images and videos (e.g., [13, 25]), which leverage convolutional neural network (CNN) models, while effective, are usually very expensive from a computational point of view, making their use on flying drones impractical. In fact, despite being equipped with powerful GPUs, a drone still needs fast solutions, given the limited battery it is equipped with as well as the limited bandwidth to communicate with the operator or the ground station [5].

In light of this, in this paper we investigate how recently proposed lighter versions of the popular YOLO detection algorithm [20], fine-tuned on new datasets specifically designed to help the community focused on this task, can provide an effective and at the same time efficient solution to aid SAR operations with drones. This experimental effort promises to better guide further research towards an acceptable trade-off between accuracy and speed of detection, which is crucial and still missing in this particular domain.

The rest of this paper is organized as follows. Section 2 reviews related work. Section 3 describes the datasets used and the proposed method. Section 4 presents the experimental results obtained. Section 5 concludes the paper and depicts future developments of our research.

2 Related Work

The use of drones to support SAR operations is becoming increasingly popular. However, there are currently few noteworthy works in the literature that explicitly make use of computer vision techniques to provide a drone with automatic image processing capabilities.

In an early work [16], Martins et al. used traditional Histogram of Oriented Gradient features and a Support Vector Machine classifier to classify frames as containing or not containing people and distinguish between safe and dangerous landing sites. While this approach provides promising prediction results, one of its main drawbacks is detection time.

Thanks to the success of deep learning in many areas, including classification and detection tasks in aerial images [4], research in this direction has rapidly moved towards an end-to-end pipeline. Kashihara et al. [11] proposed using a fast object detection model like YOLO to detect people in drone videos. Interestingly, to improve performance time, they did not feed the model with the overall video stream, but proposed to extract snapshots at periodic time intervals. Unfortunately the video frames they collected are very few and this penalizes the generalizability of the results obtained. Mishra et al. [18] have recently proposed a large dataset for human action recognition from drones, whose main goal is to provide support for SAR. The proposed dataset is characterized by a rich variety in terms of colors, heights, poses and background. However, the dataset was collected in and out of a campus, making the depicted scenes less realistic for a typical SAR scenario, which includes remote and hostile environments.

Similar to us, Lygouras et al. [15] used a lightweight version of YOLO, TinyYOLOv3 in their case, to aid SAR operations with drones. In particular, they proposed a system to detect swimmers in danger. The novelty of the proposed method is the combination of computer vision with a global satellite navigation system both for the precise detection of people and for the release of the rescue apparatus. The sea background provides a completely different background from other SAR scenarios, which we are interested in, with its specificities.

Two new datasets for SAR with drones have recently been released that offer the community new benchmarks to advance research in this field. HERIDAL focuses on Mediterranean and sub-Mediterranean landscapes [3]. The authors have proposed a method that works quite well on these data, which begins by reducing the search space through a visual attention algorithm that detects the salient or most prominent segments in the image; then, the regions most likely to contain a person are selected using a pre-trained and fine-tuned CNN for detection. Similarly, the SARD database has been created to detect victims and people in SAR scenarios in drone images and videos [22]. Actors were involved, who were asked to simulate tired and injured people and classic types of movement of people in nature, such as running and walking. Several state-of-the-art detectors have been used, including YOLOv4, which obtained very promising results. In this paper, we look at how newer algorithms for detecting people in images and videos, especially the lightweight versions of YOLOv5, can further improve detection performance.

Many other works, such as [17] and [21], have used thermal imaging to develop real-time people detection systems. However, it should be noted that in certain circumstances they have not proved to be a good solution, for example when the scanned areas reach very high temperatures, as the temperature of the environment can emit thermal radiation higher than the heat emanating from a human body. These situations limit the use of thermal imaging cameras during the day; however, fusion methods, based on the combination of thermal and optical imaging, appear to be a candidate solution.

3 Materials and Methods

The goal of our research is to provide a model for people detection that could potentially work on a drone, therefore with limited hardware resources. For this reason, we focused on lightweight versions of YOLO, that was chosen for its known speed and accuracy. YOLO (“You Look Only Once”) is an open source model, initially introduced in 2016 by Joseph Redmon et al. [20], which is capable of doing object detection at very high speed, thus making it suitable as a real-time system. In our study we leverage YOLOv5, which is the latest version of YOLO, and also the lightest. Two new datasets available in the literature have been considered, namely the HERIDAL dataset and the SARD dataset. In the following we provide details on the datasets and the YOLO model adopted.

3.1 HERIDAL Dataset

The HERIDAL dataset contains approximately 1700 images of wildlife in various locations captured from an aerial perspective with drones equipped with a high-definition camera. In particular, images were captured by various UAVs from custom solutions to popular solutions such as DJI Phantom 3 or Mavic Pro, at altitudes from 30 m to 60 m. HERIDAL contains \(4000 \times 3000\) full-size, labeled real images, split into 1583 training and 101 testing images. The images consist of realistic scenes of mountains, wilderness or remote places in non-urban areas. Since most research is conducted in remote locations outside of urban areas, the emphasis is on land and natural environments. In particular, the images were collected in various locations in Croatia and Bosnia-Herzegovina during mountain hikes, nature trips or during mountain rescue exercises. Most of them have more than one person, on average 3.38 people. In order to compile a dataset that would be a realistic representation of a real SAR operation, the authors used statistical data and specialist knowledge on SAR operations. In fact, there are many variations of the positions (standing, lying, squatting, etc.) in which a lost person can be found. Nevertheless, this specific information is not labeled, so the training and testing was done for human detection only.

3.2 SARD Dataset

To develop the SARD dataset, the authors involved actors, who simulated exhausted and injured people and classic types of movement. The images were sampled at a rate of 50 frames per second from a video recorded with an FHD resolution of \(1920 \times 1080\) pixels with a high-performance camera on the DJI Phantom 4A drone. All the videos were shot in the Moslavacka Gora area, Croatia, outside the urban area. The positions of the people in the images vary from the standard ones (standing, sitting, lying, walking, running) to typical positions of tired or injured people reconstructed by the actors at their discretion. The actors were nine people of different ages and genders, ages 7 to 55, to include differences in movement and posture associated with age and body constitutions. As different terrains and backgrounds determine possible events and scenarios in the captured images and videos, the actors are located in various places, from those clearly visible (to the naked eye) to those in the woods, tall grass, shade and the like, which further complicates the detection. From the total length of approximately 35 min recordings, 1981 individual frames with people on them were identified. In the selected images, people have been manually labeled with the bounding boxes typically used to annotate objects. The training set contains 1189 images, in which 3921 people are marked, while the test set contains 792 images, in which 2611 people are marked. Furthermore, each person was labeled as belonging to one of the 6 classes corresponding to the different human movements, although some of them are under-represented.

3.3 YOLOv5

YOLO is a regression-based method and is in fact much faster than region proposal-based methods (such as R-CNN [8]), although it is not as accurate. The idea behind YOLO is to realize object detection by considering it as a regression and classification problem: the first is used to find the bounding box coordinates for the objects in the image, while the second is used to classify the objects found in an object class. This is done in a single step by first dividing the input image into a grid of cells, and then calculating, for each cell, the bounding box and the relative confidence score for the object in that cell.

Although the latest stable version of YOLO is YOLOv4 [2], we used YOLOv5 [9], which is still in development. The latest version of YOLO was chosen because several empirical results showed how it can work accurately, compared to YOLOv4, but with an extremely smaller model sizeFootnote 1. Many controversies about YOLOv5 have been raised by the communityFootnote 2. These controversies are mainly caused by the fact that YOLOv5 does not even (yet) have a published paper; nevertheless, we have preferred to use the latter version to obtain experimental results that could be a valid reference for future work, as other projects, such as [24], have done. In some experiments on the well-known COCO dataset [14], YOLOv5s showed much lower training time and model storage size than the custom YOLOv4 model. Also, YOLOv5 takes less inference time, making it faster than YOLOv4.

The original YOLO architecture features 24 convolutional layers followed by 2 fully connected layers. More generally, a YOLO network is made up of three main parts: a CNN that aggregates image characteristics at different granularities; a series of layers that combine the extracted features; an output head whose goal is to regress bounding box coordinates and classify objects. YOLOv5 is different from all other previous versions; in particular, the major improvements introduced include mosaic data augmentation and the ability to autonomously learn bounding box anchors, i.e. the set of predefined bounding boxes of a certain height and width used to detect objects.

In order to evaluate the best human detection model in SAR operations using the integrated resources available on drones, as mentioned above we have tested different architectures based on different versions of YOLOv5. In particular, we considered the less expensive models YOLOv5s and YOLOv5m, respectively acronym for “small” and “medium” size models. In fact, YOLOv5s is the smallest model available among those provided by the Ultralytics repository [9]; YOLOv5m is the medium model available among those provided. The two models differ in size, speed and mAP (mean average precision) achieved. The size of YOLOv5 is only 14 MB compared to YOLOv5m, for which it is 41 MB; moreover, the former is faster in detection but tends to exhibit a lower mAP.

4 Experiment

4.1 Setting

As a programming environment, we used Google Colaboratory (Colab). Colab provides a code environment like Jupyter Notebook and it is free to use a Graphic Processing Unit (GPU) or a Tensor Processing Unit (TPU). Colab has pre-installed popular libraries in deep learning research such as PyTorch, TensorFlow, Keras, and OpenCV. Since machine learning/deep learning algorithms require a system to have high speed and processing power (usually GPU-based), normal computers are not equipped with high performance GPUs. Therefore, Colab supplies GPUs such as the Tesla K80 manufactured by NVIDIA. To optimize the use of the datasets, we used the increasingly popular Roboflow platformFootnote 3. It is very useful for computer vision tasks, as it provides tools for label annotation, dataset organization, image preprocessing, image augmentation, and training. For example, we used Roboflow to convert the originally provided Pascal VOC annotations to the supported YOLOv5 PyTorch format. Additionally, storing the dataset on Roboflow offers practical benefits for training models, as it avoids some additional steps usually required to load large datasets to Colab.

In our experiments we conducted transfer learning with weights pre-trained on the COCO dataset, running the small size model for 200 epochs and the medium size model for 100 epochs, when a plateau was reached on a held-out validation set. As an optimization algorithm, we experimented with the traditional stochastic gradient descent, using a learning rate of 0.01. To make training feasible, the high resolution input images were resized to \(800 \times 800\) and the mini-batch size was set to 32 for the small and 16 for the medium size model.

4.2 Metrics

To properly evaluate the detection model on both datasets, standard object detection metrics, specifically precision and recall, were calculated. Precision represents the percentage of people correctly detected out of the total number of proposed objects classified as people. Recall measures the percentage of missing persons who have been correctly detected among all those labeled in the dataset. We also used average precision (AP) to evaluate the differences between the trained models, which calculates the average precision value for recall values from 0 to 1.

Finally, it is worth noting that a true positive detection is provided when the intersection over union (IoU) between the ground truth and the predicted bounding box is greater than or equal to a fixed threshold; as a common choice, we used a threshold of 50%.

4.3 Results

Table 1 reports the results obtained with the models experimented on the testing sets of the HERIDAL and SARD dataset. The results of the state-of-the-art on these data are also reported, for a comparison with the current literature. On HERIDAL, YOLOv5s achieved 0.753 precision, 0.694 recall and an AP of 0.731. Detecting an image took an extremely short time of \(\sim \)0.015 s. Slightly better results are obtained with YOLOv5m, which achieved a precision of 0.797, a recall of 0.812 and an AP of 0.810. This gain in accuracy has doubled the detection time, although it is still an extremely fast time of 0.030 s per image. Currently, the state-of-the-art results on this dataset have been obtained by the multi-modal region-proposal CNN proposed in [12], in which a precision of 0.689 and a recall of 0.946 are reported. Our results are much better in terms of precision, at the expense of less recall. However, it should be noted that the state-of-the-art model takes around 15 s to process every single image, even on the powerful NVIDIA GeForce GTX 1080Ti Turbo. This implies that our proposed models are not only much faster, but can also mitigate the lower recall, as, given the very high frame rate, a missed detection can be recovered in a next frame in a very short time.

Regarding the SARD dataset, we observed an opposite behavior where, with the same ensemble of hyperparameters of the previous experiment, the medium size model performs worst than the small one. In fact, YOLOv5s achieved very good precision, recall and AP of 0.940, 0.917 and 0.933, respectively. In contrast, YOLOv5m achieved a performance of around 0.77 for all three metrics. This may be explained by the fact that the medium size model is an over-parameterized model compared to the small one, so it may have been overfitted this different dataset. On the same computational platform, and with the same input image resolution, detection time remains unchanged. The previous YOLOv4 model, experimented in [22], maintains the state-of-the-art on the SARD dataset. However, the single AP reported is only 0.03 better than our best model, and the detection time, while pretty short, is still an order of magnitude higher. Again, the state-of-the-art model was not tested on one of the embedded GPUs typically mounted on drones, but on a laptop with a GeForce GTX 1660Ti.

Table 1. Detection performance and comparison with the state-of-the-art (“-” means “not provided”).
Fig. 1.
figure 1

Examples of human detection on the HERIDAL dataset.

Examples of human detection on both datasets are shown in Fig. 1 and 2. It can be seen that in the case of the HERIDAL dataset, we can only detect objects as containing humans. In the case of SARD, we can also classify the type of pose/movement exhibited by the detected person. It is worth noting that, especially in the HERIDAL images, we have high-altitude scenes where it is sometimes difficult to detect people even with the naked eye. Instead, the computer vision model proved very effective in this challenging task.

Fig. 2.
figure 2

Examples of human detection on the SARD dataset.

Since SARD is also provided with human pose labels, in Fig. 3 we report the confusion matrix for the task of classifying people’s posture (i.e., “running”, “walking”, etc.), obtained on this dataset. Due to the data imbalance, our model performed worse on the under-represented target class “running”: in fact, it can be seen that the latter is often (\(\sim \)33%) confused with the “walking” class, which is represented in the dataset substantially more, while only \(\sim \)56% of the times is correctly predicted. As for the other classes, we can see how the model can achieve really good results, with an accuracy, on average, of \(\sim \)84%.

Fig. 3.
figure 3

Normalized confusion matrix obtained from the classification of people’s posture on the SARD dataset (darker means better).

5 Conclusion

In this paper, we have demonstrated promising human detection performance using several approaches based on the latest YOLOv5 detection algorithm. Experimental results conducted on two new benchmark datasets showed that such a computer vision approach can provide an effective support tool to aid Search-and-Rescue operations with drones. Furthermore, the detection speed achieved allows people to be detected in a very short period of time, thus ensuring a rapid organization of the rescue.

As a future work, we want to combine optical and thermal imaging technology to further improve detection accuracy. Furthermore, we want to improve the results obtained on the multi-class classification based on human pose by properly augmenting the under-represented classes.