1 Introduction

Pedestrians are omnipresent but vulnerable participants in urban traffic environments. Therefore, it is of importance for vehicles to locate them early from on-board video data and to prevent collisions. From the perspective of active safety, moving pedestrians are more endangered than those who just stand along the street; in addition, walking pedestrians are more challenging to detect, since movements cause significant variations in appearance. An early attempt concerned with this topic was done by Curio et al. [7], which was restricted to detect pedestrians walking across the street. In this paper, we consider all moving pedestrians, including those who walk across and also along the street. General pedestrian detection techniques could be readily employed for moving pedestrian detection, but specialized solutions as presented here enhance effectiveness and efficiency at the same time.

The approach we demonstrate utilizes motion information from an optical flow estimation to improve results in two different ways: one is to select ROIs from a motion field segmentation to avoid exhaustive search over the whole image; the second are novel motion based features, motivated by the characteristic motion patterns exhibited by moving pedestrians against the background, as well as other moving objects, e. g. vehicles, bicycles and so on.

Our previous findings about moving pedestrian detection was published in [36], while in this article we provide several new enhancements and deeper insights into this challenge. Most notably are the following four contributions:

Moving object localization from graph-based motion field segmentation

in video frames, motion is a very important cue to distinguish moving objects from background, as moving objects usually generate different flow fields than a static background and static objects, in terms of both magnitude and orientation. This assumption still holds in a moving camera situation as from a driving vehicle. We observed that the motion field responds more sensitively already to small scale (∼50 pixels in image height) and low contrast pedestrians than single image cues (e. g. gradients, colors). Such demanding scenes which show only low response on pedestrians for still image appearance cues, nevertheless often exhibit distinguishable optical flow vectors. An example is given in Fig. 1, where we obtained a clearly separated blob from motion segmentation around the small scale and low contrast pedestrians. We implemented an efficient graph-based segmentation algorithm on motion fields as a pre-processing to localize moving objects, e. g. walking pedestrians and driving vehicles.

Fig. 1
figure 1

An example of using motion segmentation for ROIs selection when the pedestrian is of small scale and low contrast. (a) The red bounding box indicates the ground truth annotation. (c) The yellow bounding box denotes the final detection results, and the number above indicates the detection score output from our classification method

Height-prior detection window generation from blob hypothesis

an obvious disadvantage for sliding window detectors is that a huge amount of detection windows have to be examined to detect people on various scales in a whole image. We successfully cope with this challenge by applying our efficient algorithm for detection window proposals. These examined windows are placed along the width direction of each segmented blob considering the blob height and fixed aspect ratio as given knowledge to determine appropriate sizes. The motivation is that from our observation, a flow field of pedestrians always shows obvious boundaries along the upper and lower side of each person, while it sometimes presents weak or even no boundaries on the left and right side of human bodies. One explanation might be that pedestrians in images usually appear side-by-side instead of on-top of each other. See Fig. 4 for an example.

Motion self difference features

we designed new motion based features for moving pedestrians. As exemplary shown in Fig. 5, one can notice that moving pedestrians exhibit rather characteristic motion patterns versus static backgrounds and non-articulated moving objects, e. g. driving cars, in terms of flow magnitudes. More specific, such objects exhibit a flow magnitude map reflecting their fixed outer shapes, while moving pedestrians show non-uniform structures due to inter-body motion. We make use of this property of walking pedestrians to design motion self difference features, which compute the difference between each neighboring pair of rectangular regions all over the human body w.r.t. optical flow vectors.

Moving pedestrian annotations

we added “moving” annotations to the Daimler mono pedestrian detection benchmark. For each original pedestrian annotation, we differentiate between moving or static by examination of several video frames before and after the current frame. This was done manually by a single, expressly instructed person to avoid any bias and inhomogeneity. This was extensive, but valuable work for further research in this area. The ground truth data with moving/static annotations can be downloaded from http://www.iai.uni-bonn.de/∼zhangs.

2 Related work

In recent years, the topic of pedestrian detection has attracted intensive attention in the computer vision community [11, 13], but the currently best performing approaches are still far from being reliable and real-time capable.

The arguably most prominent pedestrian detector is the HOG+SVM approach proposed by Dalal et al. [8]. From then on, HOG was considered to be an efficient feature for people detection, as it accurately describes the human-specific, steady head-shoulder and mutable lower body appearance in a rich gradient set. Most later on developed detectors employ it at least as one basic feature, sometimes complemented with others [10, 31, 32, 35]. Apart from that, shape features are also frequently considered a useful cue: Gavrila et al. [18] introduced the Hausdorff distance transform and a template hierarchy to match image edges to a set of shape templates in a fast way.

Compared to appearance based features, motion features for dynamic scenes have not yet been intensively investigated in state-of-the-art detectors, due to some irregular influence from camera motion, which is difficult to remove. The Histogram of Optical Flow (HOF) features [9] were proposed as HOG-like motion features, which were successful on a per-window basis, but achieved only minimal benefit for full image detection in a recent evaluation [11]. Afterwards, Walk et al. [31] proposed a number of modifications to HOF features, which improve the performance modestly. More recently, Park et al. [28] computed temporal differences as features, following a weak stabilization based on coarse optical flow estimations through multiple frames. Besides, Bouchrika et al. [4] also tried to analyze person walking patterns for improving the detection accuracy.

In terms of classifiers, SVM has been very frequently used in different approaches for people detection. A linear kernel SVM [8] is computationally much faster than a latent SVM [16], while the latter can satisfy requirements for part-based models. In order to improve efficiency, Walk et al. [31] developed MPL-Boost, which is adaptive to learn multiple pose classifiers in parallel.

Techniques for ROIs extraction are of remarkable importance, although they are not widely discussed in previous work [19]. After determination of ROIs, unlikely regions such as the uniform ground plane and sky [24] can be avoided to be scanned for people, resulting in a significant reduction of the number of candidate detection windows for a more costly classifier to examine. Gualdi et al. [20] proposed a statistical based ROIs search approach which used Monte Carlo sampling for estimating the likelihood density function with Gaussian kernels; Kamijo et al. [22] applied a spatio-temporal Markov Random Field (MRF) model [23] for foreground objects extraction from background scenes; Some researchers used symmetry and pedestrian size constraints to select ROIs [2, 3, 5]; Itti et al. [21] employed a biologically inspired attentional algorithm that selects candidate regions according to a saliency map built on color, intensity, and gradient orientation of pixels; Elzein et al. [12] and Lim et al. [25] selected moving object regions based on temporal differences computed between successive frames; Enzweiler et al. [14] used Bayes’ rule to estimate the posterior for the presence of a pedestrian in a certain image region based on motion parallax features. In some cases, when stereo images are provided, more information can be exploited. For example, Franke et al. [17] proposed to merge stereo processing and motion analysis to detect moving objects; Benenson et al. [1] estimated the stixel world to restrict search regions.

3 Overview on our approach

A flow chart of our approach is shown in Fig. 2. The whole procedure can be divided into four main parts:

Fig. 2
figure 2

The flow chart of our approach. See more details in Section 3

Motion segmentation

Graph-based motion segmentation is applied on the optical flow field, which is computed from two down-sampled, consecutive input frames.

ROIs selection and detection window generation

Interesting segments are considered blobs of pedestrians. They are selected according to permissible sizes and positions from prior knowledge. For each one, a line of detection windows is defined using a height-prior principle based on the blob’s height progression.

Feature computation

HOG features and MSD features are extracted from each detection window.

Classification

We propose a two-layer classification scheme, adaptive to two different kinds of features. First, we train a SVM with Radial Basis Function kernel (RBF-SVM) for each kind of feature separately. Therefore, we get two primary scores for every detection window from two individually trained classifiers. Next, a linear SVM trained on those two primary scores decides about the merged confidence for each detection window, and the final detection results are obtained by a non-maximum suppression (NMS) approach [16] to suppress less confident windows nearby.

4 Graph-based motion segmentation for detection window generation

We apply motion segmentation as a pre-processing step to select ROIs, where moving objects may appear. Since the magnitudes of optical flow vectors are related to the distance between camera and subjects, also static persons are distinguishable from the background when the camera itself moves, especially if the following distance ratio R z is high:

$$ R_{z} = \frac{\text{distance}({bg},{cam})}{\text{distance}({person},{cam})}. $$
(1)

Therefore, although we focus on moving pedestrian detection in this paper, we state that our approach for ROIs selection can be extended for static pedestrian detection given an accurate optical flow estimation. The more accurate optical flow is, the weaker such constraint on R z .

4.1 Graph-based motion segmentation

We use a robust optical flow estimation algorithm introduced by Liu [26], which produces a layered flow field from two consecutive frames I t and I t+1. Because we consider optical flow computation to be too time consuming for the full resolution frame, we resize the input frames to 1/16 at first, preserving the original aspect ratio. Fortunately, we find the image down-sampling causes little performance loss, while the overall consumed time is significantly reduced.

We treat the optical flow field as a two-channel image, which is called flow image I f (t) in the following procedure, and its two channels f x and f y denote flow vector elements in direction of the corresponding axis. Before segmentation, we first employ gentle Gaussian smoothing (σ=0.8) on the flow image in order to reduce noise while only slightly influencing the region edges. After that, segmentation is processed following an efficient graph based algorithm proposed by Felzenszwalb [15], which we ported to work on flow images.

First, we build a 4D feature space by mapping each pixel p i to a vector

$$ \vec{\nu}(i) = (\,x(i), y(i), f_{x}(i), f_{y}(i)\,){^{\mathsf{T}}} , $$
(2)

where x(i) and y(i) are the coordinates of p i in I f (t), and f x (i) respectively f y (i) are its flow elements.

Second, we construct the graph G=(V,E) by connecting each vertex (feature vector) to its 8 nearest neighbors within this 4D-space. The weight w(v i ,v j ) of an edge, formed by vertices v i and v j , is the L 2 (Euclidean) distance between the two corresponding points in feature space.

Finally, we iterate on the set of edges to combine components according to weight w(v i ,v j ) until a segmentation S which satisfies

$$ \text{w}(v_{i},v_{j}) > \text {MInt}(C_{i},C_{j}),\quad v_{i},v_{j}\in V,\quad i\neq j $$
(3)

has been found. Here, C i and C j represent the components which v i and v j belong to, and MInt is the minimum internal difference, defined as:

$$ \begin{array}{lllll} \text {MInt}(C_{i},C_{j}) = \min(\max\limits_{p,q\in C_{i}} {\text w(p,q)}+\tau(C_{i}),\\ \max\limits_{p,q\in C_{j}} {\text w(p,q)}+\tau(C_{j})). \end{array} $$
(4)

The threshold function τ is used to control the degree of segmentation. The boundary between two components is evidenced when the difference between them is greater than their internal difference. τ controls the gap between intra- and inter-component differences. In our method, we define the function τ as follows:

$$ \tau(C) =\frac{k}{|C|}, $$
(5)

where, k is a constant number, and |C| is the number of pixels inside component C. This means, for small components we require stronger evidence for boundary. Intuitively, a bigger number of k produces larger segmented components.

4.2 Analysis of interesting blobs

Assuming we obtain n segments from the previous step, only a subset is selected as ROIs. Interesting blobs are those probably containing moving pedestrians, chosen according to the following weak constraints of the blobs’ sizes and positions: from our observation, those blobs belonging to the background usually have a large extent and huge span from left to right, so one can reject them easily; those blobs smaller than the lower limit of pedestrian size we want to detect(cf. Section 7.1) are considered as noises and should be discarded; assuming pedestrians are standing on the ground, the lower boundary of each interesting blob can not go above the vanishing point.

The i-th segment comprises a set of s(i) unique pixels

$$ \mathrm{S}_{i} =\big\{(x_{1},y_{1}),(x_{2},y_{2}),...,(x_{s(i)},y_{s(i)})\big\}_{i}, $$
(6)

that is inspected to fulfill the following conditions:

$$ w_{\text{min}}\leq \max\limits_{a,b \in [1,s(i)]} \{|x_{a} - x_{b}|\}_{i} \leq w_{\text{max}}, $$
(7)
$$ h_{\text{min}}\leq \max\limits_{a,b \in [1,s(i)]} \{|y_{a} - y_{b}|\}_{i} \leq h_{\text{max}}, $$
(8)
$$ \max\limits_{k \in [1,s(i)]} \{y_{k}\}_{i} \geq \xi y_{v}, $$
(9)

where y v indicates the vertical coordinate of the vanishing point, which can be computed given camera parameters; ξ(ξ<1) is the tolerance parameter.

Some number m of interesting blobs remain which satisfy the above requirements and undergo further examinations for inside pedestrian detection.

4.3 Detection window generation

In this subsection, we describe how detection windows are created from interesting blobs as the input of our classifier. We generate detection window coordinates from blobs in I f (t), but crop them from the original frame I t , thus a coordinate transformation is necessary to reverse down-sampling (cf. Section 4.1). In order to crop reasonable detection windows, a novel height-prior algorithm as illustrated in Algorithm 1 is proposed to obtain window coordinates from segmented blobs in motion frames. For each blob, we employ a sliding window strategy to generate detection windows from left to right. At each x-coordinate, the height of each detection window is adjusted to the upper and lower boundary of the blob at current coordinate. In order to include some context information for feature computation, we use a ratio of m b to grow the height value. After fixing the height of the detection window, the width is calculated using a constant aspect ratio r w h to this height. This aspect ratio is a fixed parameter predefined by the training examples when learning the classification model. Figure 3 shows two examples of detection windows at different positions.

Fig. 3
figure 3

Generation of detection windows from a blob: two examples are drawn at different horizontal positions

figure e

The reason for this height-prior rule is that from our observation, the optical flow around human bodies usually exhibits more apparent edges in the head-to-sky and feet-to-ground areas; in contrast, the flow field can easily form a horizontally connected region when a group of people walking shoulder by shoulder as illustrated in Fig. 4.

Fig. 4
figure 4

An example of blobs each containing multiple people walking side by side. The green bounding boxes in (a) indicate ground truth annotations

We emphasize two details turned out to be important to obtain good performance in this procedure: one is to keep the aspect ratio r wh of each detection window fixed to the same as the classification model (64/128); the other is to add surrounding context with a fixed proportion of m b applied to height of the detection window to include gradient information between a person and background during training and classification. In our approach, we choose m b to be 0.2.

5 Motion self difference features

Typical urban traffic environments consist of various objects, including buildings, vehicles, pedestrians and so on. In this paper, we categorize all the above objects into: static backgrounds, including all the static objects, e. g. buildings and static pedestrians etc.; and moving objects, e. g. walking pedestrians, driving cars, etc.. After the motion segmentation procedure, blobs probably containing moving objects are selected, but inevitably some static scene parts may also be included due to high R z (cf. (1)), errors from the optical flow estimation, or segmentation algorithm. Our task in the classification procedure is to distinguish moving pedestrians from all other objects, thus we observe the optical flow magnitude maps of various objects to seek representative features for moving pedestrians.

We observed that moving pedestrians exhibit rather different flow patterns from static backgrounds including static pedestrians and other moving objects, e. g. moving cars. The major reason is that moving pedestrians generate non-consistent motions from different body parts. See Fig. 5 for examples. More specific, from the flow magnitude maps of static backgrounds and rigid moving objects, one can clearly see the constant silhouette of the objects; in contrast, from the flow magnitude maps of moving pedestrians, only the silhouette of the upper bodies stays constant, but there is a significant difference between two legs, since only one leg at a time moves. We call this inter-body motion, caused by the non-rigidity of human body. Based on the above observation, we propose MSD features to describe the inter-body relative motion for moving pedestrians. The feature extraction procedure is listed in Algorithm 2.

Fig. 5
figure 5

Several examples of flow magnitudes for various objects in a typical urban traffic environment. Each blue bounding box contains a pair of the original image and its corresponding flow magnitude map

First, we resize each detection window to the size of our pedestrian model: 64×128 pixels, and divide the resized window into s×s pixel square-sized regions, which we call cells in the following. We denote the number of cells along the horizontal and vertical directions as n c h and n c v , respectively.

Second, we compute two histograms for each cell using trilinear interpolation [8] in terms of f x and f y respectively. For the whole detection window, this yields two histogram sets, denoted

$$ H_{x} = \{h_{x}(i,j)|1\leq i\leq nc_{h}, 1\leq j\leq nc_{v}\}, $$
(10)

and

$$ H_{y} = \{h_{y}(i,j)|1\leq i\leq nc_{h}, 1\leq j\leq nc_{v}\}. $$
(11)

Third, we compute the difference between histograms in H x and H y respectively. For each cell, we consider all of its eight nearest neighboring cells and compute the difference between the centering cell and each neighboring cell. The difference measure we utilize is histogram intersection (HI). Given two histograms h p and h q each containing n bins, histogram intersection is defined as

$$ \text{HI}(h_{p},h_{q}) = \frac{\sum\limits_{k=1}^{n}\min(h_{p}(k), h_{q}(k))}{\sum\limits_{k=1}^{n}h_{p}(k)}. $$
(12)

However, if we iterate over all cells, redundancy will emerge since some cell pairs are considered for twice. Therefore, we employ a tabu list to solve this problem. Those cell pairs which have been considered once will be appended to the tabu list and a second visit is thus prohibited.

Choosing the cell size of s to be 8 pixels for a 64 × 128 pixels detection window, we obtain a 1072 dimensional feature vector. Notably, those cells along detection window borders do not have eight neighboring cells.

figure f

6 Two-layer classification

In this paper, we use two different kinds of features: HOG, which is appearance based; MSD, which is motion based. Generally, there are two classification strategies to cope with multiple features. One is to combine features, i. e. to concatenate all the features into a long feature vector and train one classifier as a whole; another is to combine classifiers, i. e. to train one classifier for each set of features individually, and then combine the outputs of these classifiers to make a joint decision.

The first method is easy to implement, but it might suffer from the curse of dimensionality; the second method consists of a two-layer classification procedure, more adaptable to high variation among different kinds of features, which may generate quite different boundaries during training. Therefore, we apply the two-layer classification scheme in this article (cf. the illustration in Fig. 2).

In the first layer, same as in [8], we choose support vector machines as classifiers for both HOG and MSD features but in two separate spaces. We use a RBF kernel instead of a linear kernel, which has been widely used in similar applications. Apparently, a linear kernel is most often chosen due to its efficiency but not quality of results. Since in our approach we examine much less detection windows than previous methods did, the time consumed by a SVM based on RBF kernel is still tolerable.

In the second classification layer, there are multiple possible ways to combine the decision scores from the first layer [6]. In this case, a linear kernel SVM suffices, since there is an only two dimensional feature space consisting of the two decision scores from the first layer classifiers. The trained linear kernel SVM outputs the final decision score for the whole classification procedure.

Similar to [8], we also use the INRIA person dataset as training dataset for HOG features, while we use the TUD-Brussel dataset [34] for MSD features since this dataset provides motion pair annotations. A multi-round training strategy is used for both features, i. e. the final model is produced by retraining using an augmented set (initial training data + hard negative samples).

After receiving a final score from our classification procedure for each detection window, the final detections in the current frame are determined by a pairwise non-max suppression method [16], which discards the lesser-confident of every pair of detection windows that overlap sufficiently according to (13).

7 Experiments

To our knowledge, none of the current public pedestrian datasets provide “moving” annotations for pedestrians. Therefore, we have to choose one as our basis dataset and then manually annotate moving pedestrians. In order to choose an appropriate dataset for experiments, we have the following criteria: first, it should contain consecutive video frames for optical flow computation; second, the size of the dataset should be big, with a large number of annotated pedestrians; third, pedestrians of bigger size are preferable, because it makes adding moving annotations easier.

Based on the above criteria, we took the Daimler mono pedestrian detection benchmark [13], which was captured by a monochrome camera installed on a vehicle driving through urban environment, as our basis dataset. This dataset consists of 21,790 consecutive frames (640×480), as well as 56,492 pedestrian annotations. The on-board camera setting and large amount of pedestrian annotations meet our purpose of an extensive test for intelligent vehicle applications. Next, we manually determined each pedestrian annotation to be moving or static, through observing multiple consecutive video frames before and after the current time point. Then we added an additional label of “moving” to the ground truth data, so as to record the movement status for each pedestrian annotation.

In this section, we explain our evaluation protocol, examine performance of our detector under different configurations, and evaluate the performance of our approach in comparison to several other methods using the same protocol. In addition, we discuss the runtimes of all the detectors considered in this article in Section 7.4.

7.1 Evaluation protocol

In the following, we explain details of our evaluation protocol in four aspects, which are consistent to the conventions in this field.

Ground truth regulation

Footnote 1 (1) Ignored data selection: In our experiments, three categories of pedestrians are not considered: first, static pedestrians; second, pedestrians of less than 50 pixels in image height, which corresponds to a real-world height of 1 meter at a distance of 25 meters in this dataset; third, pedestrians that are heavily occluded (36-100 % occlusion). All the above data is marked with an ignore label, which means they need not be matched, however, matches are not considered as mistakes either. (2) Aspect ratio standardization: because most of the detectors use windows with aspect ratio of 0.5, it is important that the ground truth is annotated in the same way to obtain meaningful performances [11]. We standardize all ground truth bounding boxes by keeping the original height and center while adjusting the width. From our observation, performances of different detectors stay stable for various choices of r wh.

Detection results filtering

Detection results are filtered out using an expanded filtering method [11], so that detection results far outside the evaluation scale range should not be considered. When evaluating a scale range of [S 1,S 2], only detections in [S 1/ξ,S 2 ξ] are considered for evaluation. In our evaluation, we set ξ=1.25.

Matching rules

Filtered ground truth bounding boxes and detection results bounding boxes are annotated by B gt, and B dt respectively. A detected bounding box and a ground truth bounding box match if and only if the ratio of overlap to the union of their areas exceeds a given threshold [13]:

$$ \text{match}(B_{\text{dt}},B_{\text{gt}}) = \frac{\text{area}(B_{\text{dt}})\cap\ \text{area}(B_{\text{gt}})}{\text{area}(B_{\text{dt}})\cup\ \text{area}(B_{\text{gt}})}\stackrel{!}{>} 0.25 \text{ .} $$
(13)

Performance measurements

We perform full image evaluation instead of per-window evaluation as the former one provides a natural measure of error of an overall detection system. In order to compare different detectors, we plot miss rate against false positives per image (FPPI) curves in logarithmic scales by varying the threshold on the detection confidence of the classifiers. We only plot the curves in FPPI range between \((-\infty , 10^{0}]\) as more than 100 FPPI is anyway unacceptable for intelligent vehicle applications. In addition to this miss rate vs. FPPI curves, we calculate a single, numerical measurement to summarize detector performance. We use the log-average miss rate [11], which is computed by averaging the miss rates at nine FPPI rates evenly sampled in log-space in the range of [10−2,100]. This log-average miss rate generally gives a more stable and informative assessment of the overall performance for different detectors than the single miss rate at 10−1 FPPI, according to [11].

7.2 Comparisons of our detectors under different configurations

To examine the improvements brought by motion segmentation and motion self difference features, we define two detectors with different configurations: MoSeg+HOG, using motion segmentation for ROIs selection and HOG features for classification; MoSeg+HOG+MSD, using motion segmentation for ROIs selection and, both HOG and MSD features for classification. We compare the above two detectors with a solely HOG feature based detector to observe the performance gains from both modifications.

From Fig. 6, we can see that supplementing motion segmentation for ROIs selection to original HOG detector results in a significant decrease of miss rate by 12 % on average; moreover, the integration of our novel, motion based MSD features yields an additional performance improvement of 7 % w.r.t. the average miss rate, compared to the above detector MoSeg+HOG. Therefore, our optimal detector is MoSeg+HOG+MSD, which results in an overall improvement of 19 % w.r.t. the average miss rate of the original HOG detector.

Fig. 6
figure 6

Performance comparison of our detectors under different configurations on Daimler Monocular Pedestrian Dataset

7.3 Comparisons against state-of-the-art detectors

We also compare our proposed MoSeg+HOG+MSD detector against several state-of-the-art detectors from a recent survey [11] in this field: original HOG [8] as the most similar approach; VJ [30], Shapelet [29] and MultiFtr [33] as comparison for different features; HikSvm [27] and LatSvm [16] as comparison for different SVM kernels. Here, we do not consider those detectors who use color information, as the dataset only contains gray scale images. Moreover, we do not discuss those detectors whose runtimes are on minute-level per frame (640×480 pixels) reported by [11], as they are far from real-time requirements for intelligent vehicle applications. All the original detection resultsFootnote 2 of the 6 selected detectors plus our own detection results are evaluated using the same evaluation protocol explained earlier in Section 7.1.

The miss rate vs. FPPI curves of above detectors are shown in Fig. 7, where the log-average miss rate values are indicated in the left bottom rectangle within this figure. Generally, our approach obtains a log-average miss rate of 36 %, lower than other state-of-the-art detectors, and our miss rate is consistently lower in the FPPI range of [10−3,10−1], which is an useful range for intelligent vehicle applications (Fig. 8).

Fig. 7
figure 7

Performance comparison of state-of-the-art detectors on Daimler Monocular Pedestrian Dataset

Fig. 8
figure 8

Some exemplary results from our approach in different scenarios. In each row, the original frame, optical flow field, motion segmentation blobs and final detection results are shown from left to right

7.4 Runtime analysis

We perform our experiments on a modern computer with Intel Core-i5 CPU (2.4GHz), and our source code is written in Matlab. Our runtime for one single frame is 1.93 seconds on average obtained by evaluating the whole dataset of 21,790 frames.

As it would be infeasible work to re-implement every other detector and rerun them on our computer, we instead normalize all the runtimes using the data in [11] to the rate of a single machine. Here, we explain the normalization procedure in more detail. First, we found there is one detector’s code publicly available, among the 15 detectors selected in [11]. We downloaded it and ran it on our computer. After obtaining its runtime on our computer, we compare it with its runtime reported by [11], thus we get a relative speed ratio between our computer and the computer used in [11]. Then we use this ratio to normalize all the runtimes of detectors to the rate of our computer. A similar method was used in [11]. From Table 1, we can find, among the other six state-of-the-art detectors, LatSvm [16] is the fastest one with an average runtime of 10.86 seconds on a 640×480 frame, but our approach is around 10 times faster. The total runtime of our detector is 1.93 seconds, consisting of 0.63 seconds for ROIs selection and 1.60 seconds for classification. Among all the detectors listed in Table 1, our detector is the fastest and the only one who applies ROIs selection, which demonstrates that ROIs selection is very important for enhancing efficiency.

Table 1 Runtimes (in seconds) of different detectors by normalizing to the rate of a single machine

7.5 Discussion

Based on the experimental results reported in Section 7.3, we want to highlight two aspects of our comparisons in more detail: first, our modifications to the HOG approach result in a significant decrease of miss rate by 19 % on average, which confirms our improvements are successful; moreover, we outperform MultiFtr, which combines various kinds of appearance features, such as Haar-like features, shapelets etc., thus an integration of temporal information and spatial information seems to be more informative than spatial information alone, and we managed to exploit it.

Another important aspect of evaluating a detector is the speed. The runtime for each detector reported in Section 7.4 are evaluated for one single frame. For our detector, the computational costs come from two parts: ROI selection and detection algorithm. For other detectors, the costs only come from detection algorithm. Therefore, there is a trade-off between additional ROI selection costs and reduced detection costs. From Table 1, we can see that our detector is significantly faster than other detectors considered for comparison. This improvement indicates that our ROI selection method is an effective way to reduce the runtime of the whole detector. More importantly, our detector ensures reasonable performance while reducing the runtime.

8 Conclusion and future work

In this paper, we proposed a new approach utilizing motion information for moving pedestrian detection. A main contribution is to utilize motion information from optical flow estimation for ROIs selection. This is accomplished by motion segmentation and blob selection. Furthermore, the motion field is used for novel motion features extraction through computing self differences between two neighboring regions’ histograms of flow vectors. These novel features have been shown to produce distinctive patterns for moving pedestrians versus static backgrounds, static pedestrians, and other moving objects, thus increased classification performance.

Our approach significantly reduced the number of examined detection windows by explicit generation of suitable pedestrian hypotheses inside motion field blobs (ROIs), resulting in less false positives as well as lower runtime. Furthermore, a two-layer classification scheme enhanced the detection confidence by combination of a HOG+SVM with a MSD+SVM classifier using a linear SVM.

Experimental results on Daimler mono pedestrian detection benchmark showed that our approach obtained a lower log-average miss rate than several state-of-the-art detectors. At the same time, we improved the efficiency by an order of magnitude.

In the future, we will focus on motion pattern analysis for other moving objects e. g. moving cars, so as to extend our approach to recognize various moving objects, enabling a better understanding of the urban traffic environment for collision avoidance and other tasks.