Keywords

1 Introduction

In the field of computer vision, there are several applications in object detection, one of these applications is pedestrian detection, used on surveillance [1, 2], robotics [3, 4, 5, 6], navigation [7, 8, 9], driver assistance systems particularly in PPSs(pedestrian protection systems) [10, 11], and others. In the state of art multiple feature extraction algorithms working with machine learning and datasets have been created to deal with this problem.

Developments in computer vision have been introduced for UAVs [12, 13, 14]. Pedestrian detection can be used with UAVs taking into consideration that they have a complex dynamic and altitude variation that adding extra challenges to the detection [15, 16]. Conventional classifiers fail when altitude increases generating more false positives.

Our proposal for pedestrian detection in UAVs considers the altitude and introduces the CICTE-PeopleDetection dataset with images captured from surveillance cameras. We use two trained algorithms: The first one based on a combination of the feature extraction methods HAAR-LBP, and the second one based on HOG. Both algorithms use cascade classifiers with Adaboost training. In addition we propose an algorithm that merges Saliency Maps algorithm presented on [17] with cascade classifier to provide detection robustness. Our proposal is evaluated in images captured from UAVs in different scenarios.

This paper is organized as follows: Sect. 2 describes the related work on pedestrian detection. Next, our proposal for pedestrian detection, the creation of dataset and the algorithm are described in Sect. 3. In Sect. 4 we present the experimental results, followed by the summary. Finally conclusions and future works are presented in Sect. 5.

2 Related Works

In the literature, several research groups have created different datasets and methods for pedestrian detection. INRIA was introduced on [18], with training based on Histograms of Oriented Gradients (HOG). Widely used datasets are Caltech Pedestrian Dataset [19] and KITTI [20], due to they are comparatively large and challenging. According to [21, 22] there are two types of datasets: photo datasets and video datasets. Photo datasets like MIT [23], CVC [11], NICTA [24] aboard the classification problem: train binary classification algorithms. Video datasets as ETH [25], TUD-Brussels [26] or Dalmier (DB) [27] are focused on detection problem: design and test full image detection systems and human locomotion modeling.

Two important algorithms have been developed for pedestrian detection and object detection in general: Haar-like features [28] by Viola and Jones, and Dalal and Triggs algorithm called HOG [18]. Both algorithms have generated over 40 new approaches [21]. Several methods for pedestrian detection includes feature extraction algorithms: HAAR [28], HOG [18, 29], HOG-HAAR [30] and HOG-LBP [31]; working with machine learning approaches based on SVMs [18, 32] or Adaboost [11, 27].

The applications of pedestrian detection in UAVs are manifold: Human safety [33], rescue and monitoring missions [34, 35], track people systems [32, 36], and others. One of the challenges of pedestrian detection in UAVs is the camera perspective variations that deform the images. In [37, 38], they use thermal imagery combined with cascade classifiers to perform the detection. Few papers like [35] works on altitudes around five meters. In this paper, authors propose post-disaster victims detection with cascade classifier methods. In UAVs, the use of saliency maps is widely used to object and motion detection in aerial images [35, 39]. Works like [34] use saliency maps to detect people reducing the search space, choosing randomly bounding boxes to detect people inside saliency region and treating separately all detection windows; they fuse the results using mean-shift procedure applied in flights from 10 to 40 m of altitude.

3 Our Approach

3.1 Dataset Creation

One of the reasons for introducing our dataset is the requirement to detect people from UAV cameras. The main problem in pedestrian detection is the high altitude, where people images have deformation of their characteristics. The main difference of CICTE-PeopleDetection with previous photo datasets is the location and perspective of the cameras that emulate the onboard camera perspective of the UAV. We use surveillance cameras for photo dataset creation due to UAVs video captures are stable and comparable with fixed cameras. There are approximately 100 cameras (we can not specify the exactly number of cameras for security reasons) with D1 resolution located in the University between 2.3 m and 5 m of height looking down as shown in Fig. 1.

Fig. 1.
figure 1

Location of the cameras in the campus (a) 5 m height. (b) 2.3 m

For training we need positive and negative images. Positive images are the images that contain the object to be detected, in our case pedestrians. Negative images are frames without pedestrians. Our dataset has 3900 positive images and 1212 negative images. The positives images were captured in the Universidad de las Fuerzas Armadas ESPE during the day and the night in different scenarios, and contain entire and partial occluded people samples.

3.2 Training Process

Our approach consists in the combination of two algorithms for extraction of the feature set: Local Binary Patterns (LBP) and Haar-like features. We use Adaptive Boosting (AdaBoost) as training algorithm and a combination of Haar-LBP features due to them are algorithms of low computation time. To create our Haar-LBP algorithm we divided the all images in 70% for training and the other 30% for testing, after that we use the algorithm with a UAV images in different scenarios. Additionally, we train a HOG cascade classifier and compare it with Opencv HOG to validate our Dataset. The training processes are shown Fig. 2.

Fig. 2.
figure 2

Pedestrian detection training. (a) HOG features with Adaboost. (b) Haar-LBP features with Adaboost

The methods used for training the cascade classifiers are described as follows:

Local Binary Patterns (LBP)

This feature extractor was presented in [40] as a texture descriptor for object detection, and compares a central pixel with the neighbours. The window to be examined is separated into cells of 16 × 16 pixels. 8 neighbours are considered for each pixel inside the cell, the central pixel value is the threshold. A value of 1 is assigned if the neighbour is greater or equal to the central pixel, otherwise the value is 0.

Haar-like Features

Viola and Jones uses a statistical approach for the tracking and detection problem, describing the ratio between light and dark areas within a defined kernel. This algorithm is robust regarding to noise and lighting changes. The method uses simple feature sets similar to Haar basis functions [28, 41].

Histogram of Oriented Gradients (HOG)

This algorithm is a feature descriptor for object detection focused on pedestrian detection and introduced in [18]. The image window is separated into smaller parts called cells. For each cell, we accumulate a local 1-D histogram of gradient orientations of the pixels in the cell. Each cell is discretized into angular bins according to the gradient orientation and each pixel of the cell contributes with a gradient weight to its corresponding angular bin. The adjacent cells are grouped in special regions called blocks and the normalized group of histograms represents the block histogram.

Adaboost

Adaboost is a machine learning algorithm [42] that initially keeps uniform distribution of weights in each training sample. In the first iteration the algorithm trains a weak classifier using a feature extraction methods or mix of them achieving a higher recognition performance for the training samples. In the second iteration, the training samples, misclassified by the first weak classifier, receive higher weights. The new selected feature extraction methods should be focused in these misclassified samples.

3.3 People Detection Algorithm

In order to get a better performance of the classifier we implement a combination of cascade classifier with saliency maps, an algorithm presented in [17]. The purpose of saliency maps is to locate prominent areas at every location in the visual field. The areas with high saliency correspond to objects or places they are most likely to be found, and the areas with lower saliency are associated to background [43]. The saliency maps algorithms are deduced by convolving the function \( f \) by an isotropic bi-dimensional Gaussian function [44]:

$$ S(X) = f(X)G_{\sigma } (X) $$
(1)

where σ is the standard deviation of the Gaussian function. The standard deviation depends on the experimental setup (size of the screen and viewing distance). To eliminate the false positives in the image we obtain the salient region; we consider a threshold from the salient map and we create a mask where values greater than threshold will belong to salient map. Additionally, this region was dilated to give it robustness. This algorithm is shown in Fig. 3.

Fig. 3.
figure 3

Saliency maps algorithm. (a) Saliency map (b) Saliency region.

Once it has been obtained the salient region, our algorithm proposes take as true positives only the cascade classifier detections inside this region. For this reason we take the salient region as Region of Interest (ROI). To determinate if a detection bounding box is inside the salient region, we compute the center point of the bounding box with the formulas:

$$ x_{m} = x + \left( {\frac{w}{2}} \right); \,\,\,\,\, y_{m} = y + \left( {\frac{h}{2}} \right) $$
(2)

where \( x \) and \( y \) are the horizontal and vertical coordinates of the top left of the bounding box, \( x_{m} \) and \( y_{m} \) are the coordinates of the central point and \( w, h \) are the width and height. We take the center point as reference to avoid false positives that could have small parts of their bounding box in salient regions. Unlike other methods presented in the literature [34], we use our own algorithm for combination of cascade classifier with saliency maps. Our proposal is presented graphically in Fig. 4.

Fig. 4.
figure 4

Algorithm for people detection using HOG cascade classifier and saliency Maps

The results of the application of this algorithm are presented in the Sect. 4.

4 Results and Discussion

4.1 Dataset and Training Evaluation

The metric of evaluation for our approach is based on the sensitivity (true positive rate-TPR) and the miss rate (False negative rate-FNR). Defined as follows:

$$ TPR = \frac{TP}{TP + FN} *100{\% } ;\quad FNR = \frac{FN}{TP + FN} *100{\% } $$
(3)

For the dataset evaluation we have trained the cascade classifier based on HOG features and compared this classifier with the OpenCV HOG cascade classifier. We tested the cascade classifier with videos captured from UAVs. Experimental results are presented in Table 1.

Table 1. Dataset training performance

In this table, two cascade classifiers are compared: HOG-CICTE PeopleDetection and a HOG cascade classifier with the Adaboost training from the OpenCV library. Result shows our approach has better performance, the miss rate of our proposal is 20% lower than the conventional classifier miss rate, and the sensitivity is higher. ROC curves for comparing both algorithms are presented in Fig. 5.

Fig. 5.
figure 5

Comparison of ROC curves for HOG from OpenCV cascade classifier and HOG-CICTE cascade classifier trained for pedestrian detection in UAVs.

In Fig. 5, HOG-CICTE classifier has a better performance that HOG from OpenCV cascade classifier in videos captured from UAVs.

4.2 Algorithm Evaluation

For the algorithm evaluation we are using 3 scenarios with 3 different altitudes. We compare HAAR-LBP features and HOG features (trained with CICTE-PeopleDetection) respect to other cascade classifiers. Results are presented in Table 2.

Table 2. Cascade classifiers performance

In Table 2, the combination of HAAR-LBP features has low sensitivity compared with the other methods; however the proposal is higher that HAAR features. With altitude increasing, sensitivity decrease in all cascade classifiers. Performance curves are presented in the Fig. 6.

Fig. 6.
figure 6

Comparison of ROC curves for different approaches.

In the Fig. 6, the performance of the HAAR-LBP features algorithm is better than HAAR individually applied. HAAR-LBP features generate a lower rate of false positives. True positive rate of HAAR-LBP features is higher than HAAR features but lower than LBP. Nevertheless, the HOG-CICTE cascade classifier still has the best performance due to its higher true positives rate and lower false positives rate.

4.3 Cascade Classifier-Saliency Maps Combination

Based on the results of performance we choose HOG CICTE cascade classifier to implement our algorithm. Graphical results are shown in the Fig. 7 and video results are provided by: https://www.youtube.com/watch?v=KN_hVgp1_t4

Fig. 7.
figure 7

Combination of Saliency Maps and Cascade classifier. (a) Cascade classifier result (b) saliency regions (c) final Result

As we can see in the Fig. 7, the use of saliency region helps to reject the false positives in the images. For the evaluation we take an additional metric of evaluation that is precision or positive predictive value (PPV), given by:

$$ PPV = \frac{TP}{TP + FP} *100{\% } $$

where TP are the true positive values and FP are the false positives. The results of precision of the detector with the application of the saliency region algorithm (SR) are shown in the Table 3.

Table 3. Comparison of precision between algorithms

In Table 3, the application of saliency region algorithm improves the precision of detection in 20% approximately, this denote an improvement in the performance too. The performance curves of two algorithms are shown in the Fig. 8.

Fig. 8.
figure 8

ROC curves for HOG and HOG-SR

The Fig. 8 shows that the use of saliency region algorithm improves the detection performance eliminating false positives.

5 Conclusions and Future Work

Our proposal for pedestrian detection based on HOG features has higher performance that OpenCV HOG respect to sensitivity and miss rate (with an improvement of 20%), as shown in the Table 1 and Fig. 5, because the images used for training emulate UAVs perspective.

In order to improve the HAAR algorithm performance we combine two algorithms (HAAR and LBP). The sensitivity increased and the miss rate decreased as shows Table 2 and Fig. 6; however the performance is lower in comparison with HOG-CICTE and LBP algorithm. When the altitude increased from 2 to 4 meters, the sensitivity decreased in the four algorithms. Comparing HAAR-LBP and HAAR, HAAR-LBP has a better performance even in the altitude of 4 m.

The use of saliency maps improves the performance detectors, saliency map helps to eliminate background regions even in mobile cameras like UAVs, and these regions may contain objects that confuse the classifier that is important to decrease the number of false positives.

In the future is necessary to improve the detection. We will train new classifiers with images captured from UAVs, taking into consideration other human body parts like face, head, shoulders, etc. In addition, a robust of detection could be used for many applications like people tracking or people avoidance systems.