Keywords

1.1 Introduction

Developing countries are facing an increase in the number of vehicles in regional capitals. There is a high density of vehicle movement from region to region. In Senegal, road traffic accidents account for 63.17% of all accidents, which means that [1] of all accidents, which means that the majority of accidents are related to road insecurity. These traffic accidents occur outside built-up areas, i.e. between regions of the same country or between different countries. These accidents in nonurban areas are most often caused by wildlife species. The authors in [2] conducted research to assess the impact of wildlife-vehicle collisions along the Dakar-Bamako corridor on animal populations in the Niokolo Koba National Park.

With all these problems related to roads, video surveillance is a necessary means of ensuring road safety. Video surveillance, commonly known as video protection, is made up of cameras and everything useful to record and exploit the images in order to detect abnormal events [3]. The main objective of processing a digital image is to extract information and improve its visual quality in order to make it more interpretable by a human analyst or an autonomous machine perception.

The use of video surveillance allows us to use computer vision which includes many object detection models. In our previous work [4], we proposed a lateral road obstacle detection model based on machine learning to contribute to road safety in areas outside built-up areas. Today, with the use of deep neural networks in computer vision, deep learning is taking over machine learning in terms of video surveillance. In this paper, we have chosen YOLOV4 for the detection of three types of animals in the wildlife which are cows, donkeys and goats. The rest of the paper will be structured as follows: in Sect. 1.2, we present related work on obstacle detection systems in the field of road video surveillance. In Sect. 1.3, we propose an approach based on the YOLOv4 detection model. In Sect. 1.4, we present the training and testing results of the model as well as the performance metrics. In Sect. 1.5, we conclude with a conclusion and perspectives.

1.2 Related works

In this section, we will look at work on animal detection and monitoring using deep learning, specifically yolo (You Only Look Once).

1.2.1 Object Detection Models Based on Deep Learning

The past decade, object detection models based on deep learning have gained great importance in the research field. In [5,6,7], there is a good overview of the state of the art of object detection models based on deep learning. For example, in [5], the author shows us that with the advancement of artificial intelligence, neural networks such as convolutional neural networks (CNNs) were often used in image processing. Later, CNN models face many problems in execution, performance, deployment, etc. In [8], another deep learning network, namely, Faster Regional Convolution Neural Network (Faster R-CNN) for object detection and tracking, is discussed. In the literature, we note other types of algorithms such as SSD [9] and F-CNN [10]. In this paper, we choose the deep learning detection algorithm Yolov4 which is much faster and more efficient in terms of detection [11]. These algorithms have often been used in video surveillance for object detection.

1.2.2 Roadside Video Surveillance of Wild Animals

In recent years, video surveillance has been the subject of much research using deep learning. Deep learning also gives rise to detection techniques such as YOLO. A presentation of the state of the art is available in [5, 9, 12].

For example, in [9], Haomin and He proposed a study on YOLO object detection algorithm for road scenes based on computer vision. They made a study on Yolo detection algorithms at the road level based on computer vision. The authors in [12] made an in-depth study on the progress of road object detection optimisation, which is an important part of detection and also the evaluation of detection models.

As pointed out by [13], the Yolo detection models withstand conditions such as night, rain and snow to provide fast and reliable detection. In [14], the authors presented a publication on the detection of wild animals in the forest and their use to monitor their movement. In the survey of research on detection models in road safety, we did not find any work on the Yolov4 detection model to warn the wildlife crossing roads. Thus, our paper is based on the Yolov4 approach to perform automatic wildlife detection applied on roads for accident prevention. In the following, we will present the detection approach based on YOLOV4.

1.3 Detection Approach Based on Yolov4

1.3.1 Architecture of Yolov4

In [15], the Yolov4 architecture is made up of different parts. The input comes first, and this is essentially what we have as our set of training images that will be passed to the network – they are processed in batches in parallel by the GPU. Then comes the backbone and the neck which does the feature extraction and aggregation. The sensing neck and sensing head can be referred to as an object detector assembly (Fig. 1.1).

Fig. 1.1
An architecture diagram depicts YOLO dense prediction using two-stage and one-stage detectors. It comprises a set of pooling layers. The steps include input, backbone, neck, dense prediction, and sparse prediction. Data is provided at the bottom.

Object detector

YOLOv4 explores different backbones and data augmentation methods:

  • Backbone network

  • Neck

  • PANet (Path Aggregation Network)

  • Head

The head is the main function; it is to locate the selection frames and perform the classification.

The coordinates of the selection frame (x, y, height and width) and the scores are detected. Here, the x and y coordinates are the centre of the b-box expressed relative to the grid cell boundary. The width and height are predicted relative to the whole image.

$$ {b}_x=\sigma \left({t}_x\right)+{C}_x $$
(1.1)
$$ {b}_y=\sigma \left({t}_y\right)+{C}_y $$
(1.2)
$$ {b}_w={p}_w{e}^{tw} $$
(1.3)
$$ {b}_h={p}_h{e}^{th} $$
(1.4)

1.3.2 Construction of Our Dataset

Our data represents a collection of images of three types of wild animal species: cows, goats and donkeys. These data were acquired through Google search sites, on a farm in Senegal, specifically in Niague, which raises cows. After the collection, we renamed the images using python code to make the renaming faster. Before renaming, we did a very essential step which is to remove the irrelevant images. After that, a problem arises, because the images acquired through the websites and the images taken through a camera on a farm were not the same size, so we have to do a resizing so that the size of all the images conforms to 671 × 480. Finally, we labelled the images. We used labelImg which is an open-source image annotation tool. We have 1000 images for each type of animal, making 3000 images in total (Fig. 1.2).

Fig. 1.2
Three photographs. The photo on the left has a donkey on a grassland. The photo in the center has a goat on a mountain. The photo on the right has a cow on a grassland.

Donkey, goat and cow

1.4 Experimentation and Validation

In this section, we will show the details of training our model to detect three (03) classes (donkey, goat and cow). Then, we present the performance measures of our model.

1.4.1 Setting Up the Experiments

Implementation Details

For the training of the YOLO model, we based ourselves on the Darknet framework which contains all the necessary files [16]. The training phase of YOLO requires a lot of time, which is why we use the transfer learning method. This method consists of dividing the training phase between deep artificial neural networks, which results in savings in machine resources and computing time. We need to use a pre-trained model of YOLO to do the transfer learning. Before starting the training, we need to make some settings to adapt it to our model. These modifications concern the number of classes, the number of iterations and the number of filters to be used at the layer level of the convolutional neural networks.

Training Environment

The training phase of a YOLO model is rather heavy, and if you have a lot of images, you will need to have a machine with very powerful resources (GPUs, RAM) for the model to learn in a suitable time frame. This is why we use Google Colab Pro to train our data [17].

Splitting the Dataset (Training/Test)

We will just split our dataset (1000 images per class) to have a training dataset (80%) and a test dataset (20%). So we will have the following:

  • A training data set (80%)

  • A test data set (20%)

1.4.2 Performance Measures of Our Model

Several indicators can be used to measure the performance of an object detection model. Each one has its own specificities, and it is often necessary to use several of them to have a complete view of the performance of a model. Most of these indicators depend on the parameters true positive (TP), false positive (FP), false negative (FN) and true negative (TN) [18].

  • TP: These are the correctly predicted positive values, which mean that the actual class value is yes and the predicted class value is also yes.

  • TN: These are the correctly predicted negative values, which means that the actual class value is no and the predicted class value is also no.

  • FP: When the actual class is no and the predicted class is yes.

  • FN: When the actual class is yes but the predicted class is no.

Now we will define the performance measurement indicators for the case of a YOLO model [18].

Accuracy (P)

The accuracy is the number of objects correctly assigned to class i relative to the total number of objects predicted to belong to class i.

$$ P=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}} $$
(1.5)

Recall (R)

Recall is the number of objects correctly assigned to class i out of the total number of objects belonging to class i.

$$ R=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} $$
(1.6)

F1-Score (F1)

Although useful, neither precision nor recall can fully evaluate a model. The F1-Score provides a good assessment of the performance of our model. The F1-Score subtly combines precision and recall to make a good assessment of a model’s performance.

$$ {F}_1=2\ast \frac{P\ast R}{P+R} $$
(1.7)

Intersection on Union (IoU)

It indicates the overlap of the coordinates of the predicted bounding box with the ground truth box. A higher IoU indicates that the coordinates of the predicted bounding box closely resemble the coordinates of the ground truth box.

$$ \mathrm{IoU}=\frac{\mathrm{Area}\ \mathrm{of}\ \mathrm{overlap}}{\mathrm{Area}\ \mathrm{of}\ \mathrm{union}} $$
(1.8)

Mean Average Precision

The mAP is calculated by finding the average precision (AP) for each class, then averaging over the total number of classes. Interestingly, the average precision (AP) is not the average of the precision (P). The term AP has evolved over time. To simplify, it can be said to be the area under the precision-recall curve. The mAP incorporates the trade-off between precision and recall and considers both false positives (FP) and false negatives (FN). This property makes mAP a suitable metric for most detection applications [19].

$$ \mathrm{mAP}=\frac{1}{n}\sum \limits_{i=0}^n\mathrm{A}{\mathrm{P}}_i $$
(1.9)

Loss Function

This is the sum of the errors made for each example in training sets. The main objective of a learning model is to minimise the value of the loss function with respect to the model parameters by modifying the values of the weight vector using different optimisation methods, such as back-propagation in neural networks. APi

1.4.3 Results and Analysis

After training, a graph is generated. The graph shows us the evolution of the average accuracy (mAP) of the model and the loss function as a function of the iterations (Fig. 1.3).

Fig. 1.3
A graph of the evolution of the m A P versus the loss function as a function of iterations. The current average loss is 0.4660. Iteration = 6000. The approximate time left is 0.12 hours. A message reads, press s to save chart dot p n g.

Curve representing the evolution of the mAP and the loss function as a function of iterations

This graph shows that after 1000 iterations mAP = 72% then at 1200 iterations mAP = 98% then at 2500 iterations mAP = 99%, and in all remaining iterations, mAP is equal to about 98%. We also see that the loss function keeps decreasing until the end of the training to reach 0.466. This graph shows us the results in a global way, while we have three (3) classes. The following figure will give us in detail the results obtained (Fig. 1.4).

Fig. 1.4
A screenshot of a few lines of code for object detection. It calculates m A P and detection counts at 6000 iterations. The total detection time is 13 seconds.

Detailed results of the training

For donkeys, we have an accuracy of 99.96% with the number of true positives (TP) =156 and the number of false positives (FP) =30.

For the cows, we have an accuracy of 94.32% with TP =308 and FP =24.

For goats, we have an accuracy of 98.85% with TP =274 and FP =18.

Averaging the accuracies for our three classes, we have 97.71%. This shows that the detection model is acceptable.

1.4.4 Test

Testing on Images

To do the detection on an image, we use a python script that takes images as input and makes a prediction with our model (Fig. 1.5).

Fig. 1.5
Three photographs of detection. The photo on the left has a donkey, a child, and a small goat. The photo in the center has a goat. The photo on the right has a cow. All photos are captured on the road.

Detection from images of the three categories of animals: donkey, cow and goat

Testing on a Video

To do the detection on a video, we use a python script that takes a video as input and makes a prediction with our model. This video was taken in real time on the Niague road located in Keur Massar, Senegal, as the cows are returning to the Niague farm after a day’s walk (Fig. 1.6).

Fig. 1.6
A photograph. It detects 5 cows on the road with an accuracy of 0.99. The value is simplified to 2 decimals.

Detection of cow on a video obtained on the road of “Niague” Senegal

1.5 Conclusion and Outlook

Wild animals are increasingly unpredictable obstacles. Related work related to road obstacle detection has been presented to propose a Yolov4-based approach to detect wild animals such as cows, goats and donkeys crossing roads especially in nonurban areas. With this approach, a performance study of the model is done to validate our work. In the perspective of the work, we propose an integration of several types of animals and also an evaluation of the distance of obstacles. We also plan to integrate IOT devices for the deployment of our model in a vehicle.