Keywords

1 Introduction

1.1 Pedestrian Detection

With the rise of autonomous driving systems, visual perception algorithms are facing a new dilemma, that is the ability to detect and recognize objects in highly varying scenes. Among typical objects present in traffic scenes, pedestrians are particularly challenging for identification because they assume different poses, have high variability of appearance and can be easily confused with other objects with similar properties [25].

In the past decades numerous pedestrian detection algorithms [1, 7, 8, 25, 31] have been proposed, the majority of which have been tested on the publicly available datasets such as Caltech [5] and KITTI [11]. Although these datasets contain an adequately large amount of data for evaluating the performance of pedestrian detection algorithms, they lack sufficient variability in scene properties such as different lighting conditions and pedestrians’ appearance corresponding to different weather conditions.

Fig. 1.
figure 1

Different sources of detection errors due to the variability in the appearance of the pedestrians and scenes: (a) shows localization errors due to the presence of bags, backpack and umbrellas which are commonly associated with pedestrians observed in the scenes; (b) false positives caused by various environmental factors such as reflections on wet surfaces, over-exposure as well as the presence of objects resembling pedestrians; and (c) false negatives due to variation in shape, e.g. children who have different aspect ratio compared to adults, and appearance, e.g. pedestrians wearing hooded jackets, holding umbrellas or carrying bulky backpacks.

Given the dynamic nature of driving and the fact that autonomous vehicles should be able to handle a wide range of conditions robustly (see examples in Fig. 1), there is a need to examine the performance of pedestrian detection algorithms and measure their limitations under various visual conditions.

A number of past studies have investigated the role of data properties, such as deformation and occlusion [17], ground truth annotation [23, 30], and scale [18] in pedestrian detection algorithms. What is missing, however, is determining the effects of visual appearances due to pedestrian attributes and environmental conditions.

The newly proposed detection datasets collected under various conditions, such as CityPersons [32] and JAAD [20], provide the opportunity to further investigate the role of data properties in the performance of pedestrian detection algorithms. To this end, we analyze the performance of state-of-the-art pedestrian detection algorithms using the publicly available JAAD dataset for which we annotated all pedestrian samples with information regarding their appearance, such as clothing, accessories, objects being carried and pose.

Using the newly annotated dataset together with available properties of JAAD, we show performance variation in detection algorithms as a result of the changes in train/test data. In particular, we investigate the influence of weather conditions under which the data is collected and attributes that impact appearance and visibility of pedestrians. We also examine the effect of data diversity on generalizability by cross-evaluating the state-of-the-art pedestrian detection algorithms on the JAAD and Caltech datasets. As part of our contribution, we release an evaluation framework for training and testing pedestrian detection algorithms using common benchmarks and evaluation metrics.

2 Related Works

Pedestrian detection is a well-studied field. Over the years, a wide range of algorithms have been developed, ranging from models based on hand-crafted features [7, 14, 31] to modern convolutional neural networks [1, 8, 29], and hybrid algorithms benefitting from a combination of both of these techniques [15, 28].

The modern pedestrian detection algorithms use various techniques to overcome the challenges of identifying pedestrians in the wild. For example, Tian et al. [24] propose a part-based detection algorithm to deal with occlusion. The model consists of a number of part detectors, combinations of which determine the existence of a pedestrian in a given location. In [25], the authors use semantic information of the scene in the form of pedestrian attributes, e.g. carrying a backpack, and scene attributes such as trees or vehicles to distinguish the pedestrians from the background.

In [29] the authors use bootstrapping techniques to mine hard negative samples to minimize confusions caused by background while detecting pedestrians. The proposed algorithm uses features learned by a region proposal network (RPN) to train a cascaded boosted forest for the final hard negative mining and classification. In a more recent approach, Brazil et al. [1] show that jointly training a Faster R-CNN network and semantic segmentation network on pedestrian bounding boxes can improve the overall detection results.

As the performance of state-of-the-art pedestrian detection algorithms on benchmark datasets began to saturate (e.g. 7–9% miss rate reported on Caltech [5]), attention has shifted towards the effects of data properties on detection performance. A recent study on generic object recognition tasks shows that order of magnitude increase in the size of training samples can enhance performance even in the presence of up to 20% error in ground truth annotation [22].

As for pedestrian detection algorithms, the effect of occlusion and sample size [17], the balance between negative and positive samples [12], and the cleanness of ground truth annotations [23] have been investigated. Zhang et al. [30], for example, demonstrate that the percentage of miss-classification and localization error varies significantly depending on the algorithm. Through experimental evaluations, the authors show that simply by improving the quality of ground truth annotations, localization errors can be significantly reduced resulting in the overall performance boost of more than 7% miss rate in state-of-the-art pedestrian detection algorithms.

2.1 Datasets

There are a number of publicly available pedestrian detection datasets among which some, namely the Caltech [5] and KITTI [11] datasets, are widely used for evaluating the performance of pedestrian detection algorithms. These datasets, although large in scale, lack the diversity in data properties such as weather conditions, geographical locations, pedestrian attributes, etc. For example, Caltech contains 10 h of driving footage collected under sunny and clear weather conditions in streets of Los Angles. Likewise, KITTI is collected under similar weather conditions in streets of Karlsruhe in Germany.

Recently, we have witnessed the emergence of more diverse pedestrian detection datasets. For instance, CityPersons [32] is a pedestrian detection dataset comprised of data collected in various cities across Germany, in different seasons and under different weather conditions. Another pedestrian detection dataset, JAAD [20] is a set of high resolution image sequences collected in different countries and contains video footage recorded under clear and extreme weather conditions such as heavy rain.

The recently proposed datasets provide a variety of scenery and pedestrian samples suitable for studying the limitations of pedestrian detection algorithms under different conditions. Examples of errors caused by the changes data properties are illustrated in Fig. 1.

Despite the introduction of diverse pedestrian detection datasets, there are very few attempts on quantifying the effect of data properties on pedestrian detection algorithms. To this end, in this paper we analyze the effect of data properties in two ways: their impact on the performance of state-of-the-art, and generalizability of the algorithms across different datasets. More specifically, the contributions of this paper are as follows:

  1. 1.

    We introduce a large dataset of pedestrian attributes by annotating the pedestrian samples from the JAAD dataset [20] to study the effect of pedestrian appearance changes on detection algorithms.

  2. 2.

    We examine the performance of state-of-the-art pedestrian detection algorithms with respect to dataset properties and highlight changes in their behavior with respect to different training and testing samples.

  3. 3.

    We perform a cross-evaluation of the state-of-the-art algorithms on the JAAD and Caltech datasets to measure the generalizability of algorithms and datasets based on different properties of the data.

  4. 4.

    We propose a software framework for experimentation and benchmarking classical and state-of-the-art pedestrian detection algorithms using publicly available pedestrian datasets.

3 The Attribute Dataset

There are a number of existing pedestrian attribute datasets that provide fine-grained attributes (e.g. RAP [13], PETA [4]). These datasets primarily cater to applications such as surveillance and identification tasks, and, as a result, often contain indoor scenes or are recorded using on-site security cameras. Such characteristics make these datasets unsuitable for analyzing pedestrian detection algorithms for applications such as autonomous driving.

Fig. 2.
figure 2

(a) Types and frequency of new attribute labels in the JAAD dataset color-coded based on the attribute type (e.g. pose, clothing color, accessories); (b) Samples of pedestrians with select attribute labels shown. (Color figure online)

Tian et al. [25] introduced pedestrian attribute information for the Caltech dataset. The authors augmented the dataset with 9 attributes on 2.7K pedestrian samples. As was mentioned earlier, the Caltech dataset has insufficient variability of weather and scenery properties, hence the attributes lack diversity as well.

To investigate the effect of pedestrian attributes and data properties on detection algorithms, we utilized the publicly available JAAD dataset. The JAAD dataset is a naturalistic driving dataset which comprises videos gathered under different weather and road conditions and contains annotations for video properties, as well as some characteristics of pedestrians (e.g. their age and gender).

We further extended these annotations by adding 16 attributes for each of the 392K pedestrian samples, a total of 900K new attribute labels, summarized in Fig. 2(a)Footnote 1. There are attributes for coarse pose (left, right, back, front), clothing color (upper_dark and lower_dark) and length (below_knee for long coats and skirts).

There are also several attributes for the presence and location of bags and their type: whether they are worn on the left_side or right_side relative to pedestrian’s body and carried on the shoulder (bag_shoulder), elbow (bag_elbow), back (backpack) or held in the hand (bag_hand). In addition, we add labels for hooded clothing (hood) and caps (cap), accessories (e.g. phone, sunglasses) and various objects that pedestrians can hold in their hands (e.g. object, baby).

The attributes were selected based on their appropriateness for the driving tasks. For instance, pose of the pedestrian and color of their clothing affect visibility; long clothing obscures the shape and movement of the human body; caps, hoods, and sunglasses occlude pedestrian’s face and may limit their view of the traffic scene as well; carrying large bags, backpacks or other objects may not only change appearance and shape of the pedestrian but limit their mobility; holding a phone does not change the pose significantly, but can be used to determine pedestrian’s distraction [19], etc.

Clothing color and pose are the only attributes provided for all bounding boxes in the JAAD dataset and form the minimum attribute set. As can be seen from the bar plot in Fig. 2(a), most pedestrians in the dataset are wearing dark clothes, for instance, nearly 70% of pedestrians have both upper_dark and lower_dark attributes present.

Pose attributes, left, right, back, and front, are nearly equally distributed. Aside from clothing color and pose, the bags category is the most represented. In fact, nearly 50% of all pedestrians carry a bag or a backpack. In the following sections, we will consider the effect of the diversity and uneven distribution of attributes in the training data on detection.

4 Experimental Setup

4.1 Evaluation Framework

Our framework provides a unified APIFootnote 2 for experimentation with 10 classical and state-of-the-art pedestrian detection algorithms including SPP+ [16], ACF+ [7], Faster-RCNN [21], CCF [28], Checkerboards [31], DeepPed [26], RPN+BF [29], LDCF+ [14], MS-CNN [2], and SDS-RCNN [1]. All algorithms in the API have training and testing code except SPP+ and DeepPed which only have test code as no official training code has been released by the authors.

The proposed framework is compatible with major publicly available pedestrian detection datasets including INRIA [3], ETH [10], TUD-Brussels [27], Daimler [9], Caltech [5], KITTI [11], CityPersons [32], and JAAD [20]. It allows the manipulation of these datasets in terms of scale, balancing training and testing samples, selection of ground truth, etc. The results can be evaluated using common metrics for pedestrian detection.

The software is implemented in Matlab and is based on the code published by the authors of the corresponding algorithms. The training and testing functions are modified for the ease of use from a single API. Our framework uses the original and modified versions of the Piotr toolbox [6] and follows the Caltech benchmark standards [5].

4.2 Algorithms

For the experimental evaluations in this paper we chose three classical algorithms as baselines including ACF+ and its variation LDCF+ [7], and LDCF++ [14], and four state-of-the-art algorithms including RPN+BF [29], MS-CNN [2], SDS-RPN and SDS-RCNN [1] (the top performing algorithm as of ICCV 2017). From RPN+BF algorithm, we only report the results of its RPN component to highlight how the weak segmentation approach proposed in SDS-RPN would behave under different conditions.

The algorithms were trained on the subsets of the JAAD dataset using the default parameters proposed by the authors for the Caltech dataset. The only exception is that we modified the width of training and testing images to maintain the aspect ratio of the images in JAAD. For cross-evaluation with the Caltech dataset, we used the pre-trained models published by the authors of corresponding algorithms.

4.3 Data

The JAAD dataset contains HD quality images with dimensions of 1080 \(\times \) 1920 pixels. To maximize the performance of the detection algorithms using default parameters tuned on Caltech, we resized all images to half-scale of 540 \(\times \) 960. For evaluation and training, we selected samples with reasonable scale (bounding box height of 50 pixels or more) with partial occlusion (visibility of 75% or more).

For experimental evaluations, we divided JAAD into four different train/test subsets according to the property of the data in terms of weather conditions including clear, cloudy, cloudy+clear (c+c) and mix. As the names imply, clear and cloudy subsets only include training images collected under clear and cloudy skies with no rain/snow, and mix contains all weather conditions including clear and cloudy, and more extreme weather conditions such as rain/snow. It should be noted that we excluded the videos from the JAAD dataset that were collected under very poor visibility conditions such as nighttime and heavy rain.

The training images for each subset are generated by uniformly sampling 50% of the videos that are recorded under the given condition. Each training subset contains approximately 6.5K pedestrian samples. The remainder of the videos (which may include all weather conditions) are also uniformly sampled and divided into validation and test set.

4.4 Metrics

To report the performance of the algorithms, we use log-average miss rate over the precision range of \([10^{-2},10^0]\) (\(MR_2\)) and \([10^{-4},10^0]\) (\(MR_4\)) false positives per image (FPPI) as in [29, 30]. We also follow [30] and apply two oracle test cases to measure the contributions of background and localization errors. The localization oracle excludes all false positives that overlap with ground truth from evaluation thus reflecting the contribution of background error. The background oracle does not count false positives that do not overlap with ground truth hence showing the amount of localization error. All of our results are presented using the matching criterion of intersection over union (IoU) \(\ge \) 0.5, unless otherwise stated.

5 Data Properties and Detection Accuracy

5.1 Weather

Weather conditions have multiple effects on the visibility of the pedestrians (e.g. due to rain) and their appearance (e.g. presence of sunglasses or umbrellas). In addition, the appearance of the scene itself may be altered by different lighting conditions, precipitation, reflections, sharp shadows, etc., leading to detection errors as illustrated in Fig. 1. In order to quantify these effects, we trained and tested all pedestrian detection algorithms on different subsets of JAAD dataset split by weather conditions as explained in Sect. 4.

Fig. 3.
figure 3

ROC curves for all algorithms trained and tested on mix, clear, cloudy and c+c (clear and cloudy) datasets with detection threshold set to 0.5 IoU. Legends for each plot show the names of algorithms together with \(MR_{2} (MR_{4})\) measures. In each plot legend the algorithms are sorted by \(MR_{2}\) in a descending order.

Fig. 4.
figure 4

The relative contribution of background and localization errors to the performance of the pedestrian detection algorithms. The errors are calculated as changes in (a) \(MR_2\) and (b) \(MR_4\) measures for algorithms trained and tested on different subsets of JAAD.

We begin by reporting the ROC curves along with \(MR_{2}\) and \(MR_{4}\) metrics. As can be seen in Fig. 3, despite the changes in the overall performance of the algorithms, the rankings are the same across different subsets. The only exception is in the clear case where SDS-RPN outperforms RPN.

The main difference between SDS-RPN and regular RPN is that the former adds a weak segmentation component utilizing binary masks from bounding box annotations. It is apparent that using this technique is only effective under clear weather conditions which correspond to the properties of the Caltech dataset that this algorithm was originally tested on (see Table 2). Under different weather conditions, however, the weak segmentation results in a poorer performance compared to the regular RPN.

Another observation is that the MS-CNN algorithm (which according to [1] is not among top five performing algorithms on Caltech) achieves the best performance by a large margin (up to 2% on mix, clear and c+c subsets and more than 5% on cloudy) compared to state-of-the-art SDS-RCNN.

To further understand the underlying factors impacting the performance of each algorithm, we report background and localization errors under different weather conditions. As depicted in Fig. 4, testing and training on the subsets of JAAD with different properties reveal inconsistencies in the performance of each detection algorithm as well as their relative performance compared to other algorithms. For example, in the case of c+c, MS-CNN reaches its highest background error while at the same time it achieves the lowest localization error compared to others.

For RPN-based models the same trend does not hold as they all perform poorly in terms of localization error, when trained and tested on c+c. Comparatively, MS-CNN has the lowest background error on the mix, clear and cloudy subsets and the second worst on c+c.

Likewise, on average, RPN performs best in terms of localization error, however, it is the worst in terms of background error. One interesting observation is the added benefit of the weak segmentation component to RPN (in SDS-RPN) which helps improve the background error but at the price of reducing its localization accuracy.

5.2 Pedestrian Attributes

In this section, we evaluate the contribution of select attributes (shown in Table 1) on the performance of detection algorithms trained and tested on the mix dataset.

Table 1. The performance of pedestrian detection algorithms in the presence of individual attributes. The results are reported as \(MR_4\) metric. The top performing algorithms for each attribute are highlighted in bold.

Due to the fact that many attributes often appear together in various combinations, it is very hard to disentangle the effect of the individual attributes on the overall detection accuracy of each algorithm. However, major differences can be observed in the relative performances of the algorithms in the presence of certain attributes in the scene.

As one would expect, the performance of classical models is inferior compared to CNN-based algorithms, particularly with respect to some of the rarely occurring attributes such as child and umbrella. The performance of the state-of-the-art also varies on different attributes. For example, MS-CNN, which shows the highest results on the mix subset of JAAD, underperforms compared to SDS-RCNN on select attributes such as umbrella, backpack, child, pose-back.

Fig. 5.
figure 5

Error analysis for MS-CNN and SDS-RCNN trained and tested on the mix reasonable subset of JAAD. Plot (a) shows the relative percentages of false positives (FP) and false negatives (FN) for each algorithm at 0.1 FPPI. FP is further split into localization and background errors depending on whether the detected bounding box overlaps with the ground truth or not. Plot (b) shows a detailed breakdown of false positive and false negative errors grouped by the corresponding attributes.

To investigate the common causes of error for MS-CNN and SDS-RCNN we group false positive (FP) and false negative (FN) detections at 0.1 FPPI by the object present in the bounding box as shown in Fig. 5.

With respect to FP, SDS-RCNN and MS-CNN differ greatly not only in the relative contributions of background and localization errors but also in terms of the objects they commonly confuse with pedestrians. Aside from annotation errors, MS-CNN is much more distracted by elongated objects often found in the street scenes, such as tree trunks, hydrants and parking meters.

Many of the localization errors for both MS-CNN and SDS-RCNN are caused by not being able to distinguish pedestrians in groups of 2 or more, particularly when children are also present (attribute group child in Fig. 5b). SDS-RCNN also has a higher tendency to place bounding boxes on body parts of the pedestrians or objects they carry (e.g. bags) than MS-CNN. Finally, for both MS-CNN and SDS-RCNN, partially occluded pedestrians, groups of pedestrians and children stand out as main sources of false negative detections.

Note that despite individual sensitivities to certain attributes, both MS-CNN and SDS-RCNN have trouble detecting children and pedestrians with infrequently occurring attributes such as backpacks, umbrellas, hooded clothing, etc. There is also evidence that algorithms may learn the appearance of common attributes such as bags instead of the pedestrian itself leading to poor localization.

The former issue may be addressed by increasing the variability of the training data either by explicitly ensuring the presence of certain hard attributes or implicitly, by gathering data under different weather conditions, which in turn affect the appearance of the pedestrians. On the other hand, explicitly learning the attributes may also help, as demonstrated by [25].

5.3 Generalizability Across Different Datasets

Here, our goal is to identify the link between the generalizability of the dataset and its properties, i.e. we want to measure whether using training data from a diverse dataset can improve the performance of detection algorithms on other datasets with more uniform properties.

For this purpose, we employed the widely used Caltech dataset [5] and JAAD. We evaluated the algorithms trained on Caltech using the test data from the mix subset of JAAD, and also the models trained on different subsets of JAAD using Caltech test set. All the tests are done on a reasonable set of pedestrians with the height of 50 pixels and above. The minimum allowable visibility is set to 75% on the Caltech test set to match the partial occlusion of the JAAD dataset.

Given that a large portion of the original bounding box annotations in the Caltech dataset are poorly localized, following the advice of [30], we report the results on both the original and newly clean Caltech test set. We denote the miss rate results as \(MR^O\) and \(MR^N\) for old and new annotations respectively. All detections are calculated on \(IoU\ge 0.5\). The results of the evaluations of the algorithms trained and tested on the same dataset are summarized in Table 2 and the results of cross-evaluation between algorithms trained and tested on Caltech and subsets of JAAD are shown in Table 3.

Table 2. The performance of state-of-the-art pedestrian detection algorithms on the Caltech and JAAD mix datasets. The table shows the results for algorithms trained and tested on the same dataset. The performances on the Caltech test set are reported on both old (\(MR^O\)) and new (\(MR^N\)) annotations. The best results are highlighted with blue color.
Table 3. The performance of state-of-the-art pedestrian detection algorithms on the Caltech and different subsets of the JAAD dataset. The results show the performance of the algorithms trained on Caltech and tested on JAAD (\(C\rightarrow mix\)) and trained on different subsets of JAAD and tested on Caltech (\(J\rightarrow C\)). The performances on the Caltech test set are reported on both old (\(MR^O\)) and new (\(MR^N\)) annotations. The best and second best results are highlighted with blue and green color respectively.

The first observation is that the performance of algorithms on a uniform dataset compared to a diverse one varies significantly. SDS-RCNN algorithm that achieves state-of-the-art performance on Caltech is the second best in JAAD and its counterpart, SDS-RPN, which has the second-best performance on Caltech, performs worse compared to the regular RPN algorithm. MS-CNN, on the other hand, performs best on the mix subset, even though on Caltech it is the third best in our evaluation and not even in top five in the latest benchmarks [1].

As was mentioned earlier, the Caltech dataset contains images collected during daylight under the clear sky. Surprisingly, we observe that the clear subset of JAAD that has similar properties does not generalize best to Caltech. Besides having the second-best performance on SDS-RCNN models, it ranks third in other cases. In fact, we can see that diversifying the data by training on c+c and further adding extreme weather conditions such as rainy and snowy samples achieves the best results on the Caltech dataset.

Partly, such performance improvement is owing to better localization. For instance, MS-CNN and SDS-RCNN on average have IoUs of 0.73 and 0.75 respectively when trained on JAAD clear and 0.74 and 0.76 when trained on JAAD mix. The same models trained on Caltech, however, have an average IoU of 0.73.

Fig. 6.
figure 6

Examples of the performance of state-of-the-art pedestrian detection algorithms on samples with different weather conditions and pedestrian attributes. From left to right, the ground truth (GT) and the results of algorithms trained on different subsets of the JAAD dataset are shown. Colors green, red and blue correspond to the , and respectively. The results show that the behaviors of both detection algorithms are affected based on the changes in the training data, but in different and somewhat unpredictable ways. For instance, in the example in the second row, SDS-RCNN performs better when trained on the mix subset whereas MS-CNN does so when trained on the clear subset. (Color figure online)

It should be noted that the CNN-based models in the table are trained on Caltech10x [31] which contains over 45K images with more than 16K training samples. The diverse mix dataset contains less than 7K samples, yet generalizes better on Caltech than vice versa.

6 Discussion

In this paper, we conducted a series of experiments to investigate the effect of dataset diversity on the performance of pedestrian detection algorithms (see some qualitative examples in Fig. 6). Using the newly proposed JAAD dataset, we showed that the performance measures reported on the classical benchmark datasets, such as Caltech, do not necessarily reflect the true potential of detection algorithms in dealing with a wider range of environmental conditions. For instance, MS-CNN which does not even rank top five in the recent state-of-the-art benchmarks, outperforms the current top ranking algorithm, SDS-RCNN, by a significant margin on all subsets of the JAAD dataset.

We showed that the changes in relative performance can be attributed to different properties of the datasets, e.g. depending on what types of weather conditions are represented in the training data. For example, SDS-RPN outperforms the classical RPN on the Caltech dataset owing to the use of a weak segmentation technique, however, it shows inferior results on the JAAD dataset under all weather conditions except clear (which is the most similar to Caltech).

Similar fluctuations in the performance of detection algorithms can be seen with respect to pedestrian attributes. Particularly, rarely occurring attributes such as child, backpack and umbrella are associated with the highest miss rate for all algorithms. On the other hand, some of the most frequently occurring attributes such as hand bags are shown to be frequently localized instead of the pedestrians.

The diversity of training data also leads to the better generalization of pedestrian detection algorithms across different datasets. Our empirical results suggest that mixing samples with different properties can improve the performance of algorithms even on a more uniform dataset. For example, the MS-CNN algorithm trained on the mix subset of JAAD had 7% and 3% lower miss rates on Caltech compared to the models trained on the clear and c+c subsets respectively.

A carefully selected dataset can also reduce the need for a large volume of training data. For example, the models trained on the mix subset of JAAD using only 7K training samples performed better on the Caltech dataset compared to models that were trained on more than 16K training samples from Caltech and tested on the JAAD mix.

In conclusion, our study shows that the selection of benchmark datasets for the evaluation of pedestrian detection algorithms for practical applications such as autonomous driving should be revisited to properly assess their performance and limitations under different conditions, and to better reflect the nature of generalizability that is desired.

Using larger datasets certainly benefits the training of the algorithms as does balancing the data with respect to underrepresented weather conditions and pedestrian categories. On the other hand, overrepresented attributes in the data can cause detection errors which should be taken into account when designing pedestrian detection algorithms.