Keywords

1 Introduction

Fast pedestrian detection on aerial images has been a challenge due to dynamic nature of the images and hardware constraints. Integral Image for fast feature calculation [2] and Histogram of Oriented Gradients (HOG) [3], classification by Support Vector Machine (SVM) based tree-type neural network [4] are some initial work. Some improved methods for feature extraction are Integral channel feature [5], Boosted HOG [6], Channel Feature Extrapolation [7] and Search Region Proposal based on Saliency Map [8]. Pedestrian detection in infrared images has been shown by Zhang et al. [9]. Some techniques exploiting input image properties are Image Orientation Adjustment by Xu et al. [10] and Locally constraint linear coding based detection by Yang et al. [11]. Some techniques trying to enhance detection speed by breaking the task into multiple stages include local binary pattern with HOG-SVM classifier [12], simplified HOG [13], Center Symmetric - Local Binary Pattern (XCS-LBP) [14], Bin-Interleaved HOG [15] and two-stage linear with non-linear SVM [16] but they need improvement for real-time application. Some hardware solutions for fast HOG-SVM based classification have been presented in [17, 18] that proposed hardware design suitable for HOG-SVM classification. This work proposes a Two-stage (Sparse-Dense) sliding window technique for pedestrian detection task and is an improvement over standard single stage sliding window techniques used with HOG+SVM [3, 5] based classifiers. Re-searchers have proposed using features other than HOG too for better detection but the proposed work shows how detection process can be modified to speed it up using existing classification method HOG-SVM and achieve real-time or near real-time performance. Section 2 discusses about standard sliding window based pedestrian detection. Section 3 presents the proposed two-stage (sparse-dense sliding) window based detection technique. Section 4 discusses about experimental setup, results and analysis. Section 5 concludes the work presented.

2 HOG-SVM Classification Based Pedestrian Detection

Histogram of Oriented Gradients

Histogram of Oriented Gradients (HOG) was given by Dalal et al. [3] to extract visual information from an image patch using pixel gradients. The technique has been used widely for classification/detection [5,6,7,8,9]. One can refer to [3] for HOG feature vector calculation for an image. Parameters to calculate HOG feature descriptor for an image patch of dimension \(\left[ {M,N} \right]\) has been shown in Table 1 and its length \({\text{F}}_{{\text{l}}}\) [3] using (1) is 3780.

Table 1. HOG feature descriptor parameters
$$ F_{l} = \left( {\frac{M}{{C_{h} }} - 1} \right){*}\left( {\frac{N}{{C_{v} }} - 1} \right)\;{*}\;b\;{*}\;B_{h} \;{*}\;B_{v} $$
(1)

HOG feature plots with corresponding RGB images has been shown in Fig. 1.

SVM Classification

Support Vector Machines (SVM) [19, 20] is a supervised learning based classification algorithm that creates an N-dimensional hyper-plane that divides \(m\) number of classes. Input feature vector (\(p\) dimensional) and output label for a sample \(i\) has been denoted by \(h_{i} \in R^{p}\) and \(y_{i} \in \left\{ { - 1,1} \right\}\), where \(i = 1,2,3,...,n\), Person and Background classes have been represented by label \(1\) and \(- 1\) respectively.

$$ f\left( {h_{i} } \right) = w^{T} \;{*}\;\phi \left( {h_{i} } \right) + b $$
(2)

Here, \(f\left( {h_{i} } \right)\) is distance of a sample from decision boundary and its sign indicate predicted class, \(w\) is weight vector, \(\phi \left( {h_{i} } \right)\) is a function (kernel) of \(h_{i}\) and \(b\) is the offset from decision boundary.

Fig. 1.
figure 1

Example images from UAV123 [1] dataset with their HOG feature

Detection in an Image

While HOG-SVM classifier classify an image patch of dimension \(\left[ {W_{h} ,W_{v} } \right]\), pedestrian detection on full image is done by extracting image patches of dimension \(\left[ {W_{h} ,W_{v} } \right]\) in sliding window manner from multi-scale image pyramid.. Five level image pyramid formed by original and downscaled versions of an image has been shown in Fig. 2.

An important factor that determines the speed of detection is frequency of classification step for an image that has been denoted by \(HOG_{count}^{I}\) and can be obtained using (3), where \(W_{h}\) and \(W_{v}\) are dimensions of HOG window, \(DS_{h}\) and \(DS_{v}\) are dense sliding strides for HOG window \(\left[ {W_{h} ,W_{v} } \right]\) in horizontal and vertical direction respectively. \(\left[ {M,N} \right]\) is input image width and height in pixels respectively.

$$ HOG_{count}^{I} = \frac{{M - W_{h} }}{{DS_{h} }}*\frac{{N - W_{v} }}{{DS_{v} }} $$
(3)
Fig. 2.
figure 2

Multi-resolution image pyramid formed from downscaled versions of original image

Down-scaled image shape can be given by \(shape_{dn} = \left[ {\frac{M}{{\alpha^{l} }},\frac{N}{{\alpha^{l} }}} \right]\) where \(\alpha = 1.5\) and \(l \in \left\{ {0,1,...,\left( {L - 1} \right)} \right\}\) are downscaling factor and image pyramid level respectively. \(L = 5\) is the number of levels in image pyramid. Classification step frequency for an image pyramid (\(HOG_{count}^{P}\)) can be obtained using (4).

$$ HOG_{count}^{P} = \sum\nolimits_{l = 0}^{L - 1} {\left( {\frac{{M - W_{h} }}{{\alpha^{l} *DS_{h} }}*\frac{{N - W_{v} }}{{\alpha^{l} *DS_{v} }}} \right)} $$
(4)

where, \(HOG_{count}^{P}\) denoted number of HOG-SVM classification step for an image pyramid with L levels. Value of \(HOG_{count}^{P}\) for the parameters given in Table 2 using (4) is 3692. This work focuses on reducing the required number of classification steps for an image pyramid by introducing a two-stage sliding (sparse-dense sampling) window technique.

Table 2. Input image shape, HOG window shape, sliding stride for standard detection technique [3] and image pyramid parameters

3 A Two-Stage Sliding Window

Conventional classifier based detection techniques follow a dense sampling approach to classify patches from image pyramid into number of classes. [3, 5, 9]. Dense sampling window stride has been denoted by \(\left[ {DS_{h} ,DS_{v} } \right]\) and is \(\left[ {8,8} \right]\). Here, \(DS_{h}\) and \(DS_{v}\) are strides in horizontal and vertical direction respectively. The proposed two-stage sliding window technique divides the detection task into sparse and dense sampling stages. Block diagram has been shown in Fig. 3.

Fig. 3.
figure 3

Block diagram of Two-Stage sliding window pedestrian detection

Stage 1: Sparse Sampling

In first stage, image patches of size \(\left[ {W_{h} ,W_{v} } \right]\) are extracted from all the levels in the image pyramid at larger window sliding strides denoted by \(\left[ {SS_{h} ,SS_{v} } \right]\) as compared to \(\left[ {DS_{h} ,DS_{v} } \right]\) taken by most of the previous techniques [3, 5, 6, 10, 11]. Here, \(\left[ {SS_{h} ,SS_{v} } \right] \in \left\{ {\left[ {32,64\left] , \right[43,90\left] , \right[51,102} \right]} \right\}\). HOG-SVM classifier output represents distance from SVM decision boundary and has been taken to determine confidence of classification. Distance threshold for this stage has been denoted by \(Th_{sparse}\) and image patches exceeding \(Th_{sparse}\) are recorded as regions for proposal to stage 2. The regions proposed in the stage are of dimension \(\left[ {W_{h} ,W_{v} } \right]\) in their corresponding downscaled image from the pyramid. The shape of proposed regions has been transformed back to level 0 (to represent same region in original image) denoted by \(PRS_{0}\) and can be obtained using (5) where \(l \in \left\{ {0,1,...,\left( {L - 1} \right)} \right\}\).

$$ PRS_{0} = \left[ {W_{h} ,W_{v} } \right]\;{*}\;\alpha^{l} $$
(5)
Fig. 4.
figure 4

Sparse sampling for region proposal and dense sampling on proposed region

Window sliding Stride fir sparse sampling stage has been calculated by taking percentage overlap between consecutive sampling windows. Overlap percentage of \(50\%\), \(30\%\) and \(20\%\) have been taken for experimentation. Window sliding Stride values \(\left[ {SS_{h} ,SS_{v} } \right]\) in pixels can be obtained using (6) and are \(\left[ {32,64} \right]\), \(\left[ {43,90} \right]\) and \(\left[ {51,102} \right]\) respectively. HOG-SVM classification step frequency in sparse sampling stage has been demoted by \(HOG_{count}^{S}\) and can be obtained using (7).

$$ \left[ {SS_{h} ,SS_{v} \left] { = overlap\% \;{*}\;} \right[W_{h} ,W_{v} } \right] $$
(6)
$$ HOG_{count}^{S} = \sum\nolimits_{l = 0}^{L - 1} {\left( {\frac{{M - W_{h} }}{{\alpha^{l} {*}SS_{h} }}*\frac{{N - W_{v} }}{{\alpha^{l} {*}SS_{v} }}} \right)} $$
(7)

Stage 2: Dense Sampling

In this stage, image regions proposed from first stage are searched for objects by HOG-SVM classifier with dense sampling window strides \(\left[ {DS_{h} ,DS_{v} } \right]\). Image patches crossing a threshold \(Th_{dense}\) are final detections. An example has been shown in Fig. 4. Only the proposed regions from stage 1 are processed in this stage and not the whole image pyramid. This saves significant processing time. Flow-chart for the two-stage process has been shown in Fig. 5.

Fig. 5.
figure 5

Two-stage sliding window pedestrian detector process flow

Classification step frequency for dense sampling stage has been denoted by \(HOG_{count}^{I}\) (3) and depends upon region proposal. Here, \(I\) in \(HOG_{count}^{I}\) represent an image region proposed by sparse sampling stage. \(\left( {HOG_{count}^{S} + HOG_{count}^{I} } \right)\) is combined classification step frequency and has been determined by average value for 1000 pedestrian images from UAV123 [1] dataset.

Suppressing Duplicate Detections

Dense sampling stage yields multiple detections around the object as classifier output crosses \(Th_{dense}\). These duplicate detections have been suppressed by computing Intersection-over-Union (IoU) among the detection boxes using (8) where \(R_{{B_{1} }}\) and \(R_{{B_{2} }}\) denote area of box \(B_{1}\) and \(B_{2}\) respectively. If \(IoU\left( {B_{1} ,B_{2} } \right)\) crosses a threshold \(IoU_{th}\), then the box with lower confidence (\(f\left( {h_{i} } \right)\)(2)) is discarded. An example can be seen in Fig. 6.

$$ IoU\left( {B_{1} ,B_{2} } \right) = \frac{{R_{{B_{1} }} \cap R_{{B_{2} }} }}{{R_{{B_{1} }} \cup R_{{B_{2} }} }} $$
(8)
Fig. 6.
figure 6

(a) Sparse sampling stage output, (b) Dense sampling stage output, (c) \(IoU\) thresholded output

4 Experimental Results and Analysis

Dataset

Dataset used for training and testing of HOG-SVM classifier is UAV123 [1]. It has 11575 images in its person class, out of which 2620 images with pedestrians are randomly selected and a window of \(\left[ {64,128} \right]\) size has been cropped to form person and 2450 windows cropped for background class. Positive and negative class formation for training has been shown in Fig. 7.

Fig. 7.
figure 7

Person and background dataset creation form UAV123 [1] for training and testing of SVM classifier

SVM Training

Dataset size is 5070 images (2620 person class and 2450 background class) with train-test split ratio 80:20. 5-fold cross validation scheme has been adopted to split dataset into 5 mutually exclusive parts. Training has been done on 4 parts combined and testing on the remaining part. 5 trials of training/testing has been done. Scikit-learn [21] python library has been used to train Support Vector Classifier (SVC) for binary classification of Person and Background classes. Classifier parameters have been shown in Table 3.

Table 3. SVM parameters

Classification Performance

Precision, Recall, F1-score and Accuracy have been taken as classification performance metrics [22]. Metrics for 5 trials for test data with mean values has been presented in Table 4.

Table 4. Classification performance (5-fold cross validation, SD - Standard Deviation)

Detection Performance

UAV123 [1] provides ground-truth bounding boxes for evaluation of detection result. The prediction bounding box dimension of HOG-SVM classifier is fixed to \(\left[ {64,128} \right]\) as the classifier is not designed for bounding box regression task. Centre Prediction Error (CPE) [23] has been taken as performance metric that measures difference between predicted object centre and ground-truth box centre. \(CE_{avg}^{{S_{i} }}\) denote average CPE for an image sequence \(S_{i}\) and can be obtained using (9) where \(N_{{S_{i} }}\) denote number of frames in a sequence \(S_{i}\), \(\left( {x^{G} ,y^{G} } \right)\) and \(\left( {x^{P} ,y^{P} } \right)\) represent ground-truth box centre and predicted box centre respectively. Chosen dataset has 23 person class image sequences named from person1 to person23 captured from an UAV platform. Average of \(CPE_{avg}^{{S_{i} }}\) for 23 sequences and shown in (10) where \(i = \left\{ {1,2,3,...,K} \right\}\) and \(K = 23\).

$$ CE_{avg}^{{S_{i} }} = \frac{1}{{N_{{S_{i} }} }}\sum\nolimits_{t = 1}^{{N_{{S_{i} }} }} {\sqrt {\left( {x_{t}^{G} - x_{t}^{P} } \right)^{2} + \left( {y_{t}^{G} - y_{t}^{P} } \right)} } $$
(9)
$$ CE_{avg} = \frac{1}{K}\sum\nolimits_{i = 1}^{K} {CE_{avg}^{{N_{{S_{i} }} }} } $$
(10)
$$ CE_{avg} \left( \% \right) = \frac{{CE_{avg} \times 100}}{{\sqrt {M^{2} + N^{2} } }} $$
(11)

\(CE_{avg}\) as percentage of input image diagonal length can be given by (11) to present a metric invariant to input image dimension. \(CE_{avg} \left( \% \right)\) for person sequences in UAV123 [1] has been found out by running detector for all the 23 sequences and comes out to be 1.47%. Centre coordinate plot for X and Y direction for person1 has been shown in Fig. 8. It can clearly be seen from X and Y coordinate detection graphs that detector is following the ground-truth coordinates almost all the time. An example image with ground-truth box, detection box and a line joining their centers has been shown in Fig. 9.

Fig. 8.
figure 8

x and y coordinate detection vs ground-truth

Fig. 9.
figure 9

Ground-truth box (pink), detection (white) and line between centers (green)

Proposed Technique Evaluation

In the proposed technique, \(Th_{sparse} = 0.1\) has been taken for sparse sampling, \(Th_{dense} = 1.0\) for dense sampling and \(IoU_{th} = 0.5\) for IoU thresholding step. Improvement in processing time has been shown in Table 5 for different percentage of overlap between consecutive sampling windows in sparse sampling stage. Testing has been done on a 3.2 GHz CPU machine. Significant reduction in processing time can be seen as originally HOG-SVM classification step frequency has decreased from 3692 to 469, 344 and 294 (Table 5).

Table 5. Percentage overlap vs processing time: 2 stage detector (input size: \(\left[ {640,360} \right]\))

As shown in Fig. 10, the technique gives different region proposals for different sliding strides and detections are concentrated around object after dense sampling stage. It can be observed that maximum detections are there in single stage dense sampling (Fig. 10-b) method but most of them are redundant and will be removed after IoU thresholding. Sparse sampling with \(\left[ {32,64} \right]\) (Fig. 10-c)and \(\left[ {43,90} \right]\) (Fig. 10-f) strides provide significant number of region proposals around object and multiple detections after dense sampling stage (Fig. 10-d and 10-g). IoU thresholding has been applied to remove duplicate detections.

Comparison with single stage dense sampling technique has been shown in Table 5. The improvement in speed for input image of size \(\left[ {640,360} \right]\) with \(\left[ {SS_{h} ,SS_{v} } \right] = \left[ {32,64} \right]\) is from 1.95 to 15.36 fps (improved by a factor of 7.88) and 1.95 to 24.48 fps for \(\left[ {SS_{h} ,SS_{v} } \right] = \left[ {51,102} \right]\) (improved by a factor of 10.50).

Fig. 10.
figure 10

(a) Original image (b) Standard dense sampling detection output on image pyramid, (c) Sparse stage output - [32, 64], (d) Sparse sampling output - [43, 90], (e) Sparse sampling output - [52, 102], (f) Dense sampling output for c, (g) Dense sampling output for d, (h) Dense sampling output for e

Detection speed on full images has been compared with existing techniques and presented in Tables 6, 7, 8, 9 and 10 for different input image dimension.

Table 6. Comparison with existing work (input image size: \(\left[ {320,240} \right]\))
Table 7. Comparison with existing work (input image size: \(\left[ {640,480} \right]\))
Table 8. Comparison with existing work (input image size: \(\left[ {720,400} \right]\))
Table 9. Comparison with existing work (input image size: \(\left[ {1200,400} \right]\))
Table 10. Comparison with existing work (input image size: \(\left[ {640,320} \right]\))

It is evident from the comparison tables that the proposed technique performs better than similar techniques. Moreover, the detection quality of the technique has been quantified in terms of average center prediction error (CPE). It is a standard metric used to judge the distance of predicted bounding box to that of ground-truth. Average CPE \(CE_{avg} \left( \% \right)\) for UAV123 dataset (person class) is only \(1.47\%\) of image diagonal length. Also, it should be noted that using classification based techniques clubbed with window sliding always introduces a quantization error in detection (evident in Fig. 9) which is equal to \(0.5\) times HOG window sliding stride \(\left[ {DS_{h} ,DS_{v} } \right]\). So, a trade-off has to be maintained between detection speed and quality to choose a particular sliding stride. This work has used \(\left[ {8,8} \right]\) as sliding stride in dense sampling stage and is a standard used by other researchers. The presented average CPE value includes this quantization error inherently along with the actual detection error.

5 Conclusion

The work introduced a Two-stage (Sparse-Dense) sliding window sampling technique for fast pedestrian detector. The first stage was sparse sampling stage to extract relevant regions. In the second stage, the proposed regions were taken to run classifier with smaller strides and larger classification threshold. Thus, a modified version of simple HOG-SVM detector has been presented. Visual information in image was exploited for region proposal and most of the time was spent in processing proposed regions. The proposed technique can be utilized to run real-time detection on low-cost processor on UAV platform and thus eliminate dependency on external system. This eventually opens up scope of more applications using UAV systems.