Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The Aggregate Channel Features (ACF) object detector [1], from Piotr’s Computer Vision Matlab Toolbox (PMT) [2], has been used for detecting a wide range of objects. Originally it was introduced as as a detector for pedestrians in [1], but have since been applied in several other areas related to driver assistant systems (DAS). The applied areas are not only limited to looking-out of the vehicle [3], where other vehicles [4], signs [5], and traffic lights (TLs) [6] have been popular, but also looking-in areas, such as hands detection [7] has seen use of the ACF detector. General for all areas is that the ACF object detector has been adjusted heuristically in a practical manner. Fine-tuning towards the optimal parameters are a common problem amongst researchers as it can be difficult without any prior experience of applying the given detector or without any prior knowledge of the test data. All of the above DAS areas where ACF has been applied are great challenges and remains important cases as people unfortunately keeps getting injured in the traffic. In 2012, 683 people died and 133,000 people were injured in crashes related to red light running in the USA [8]. Traffic light detection is thus an obvious part of DAS system in the transition towards fully autonomous cars.

A large issue in research is that evaluations are done on small and private datasets that are captured by the authors themselves. For better and easier comparison in DAS related areas, benchmarks such as the VIVA-challenge [9] and KITTI Vision Benchmark Suite [10] can highly beneficial for determine the prone future research directions.

In this paper, we will do a comprehensive analysis of three central parameters for the ACF object detector, applied on the night data from freely available LISA Traffic Light Dataset used in the VIVA-challenge [11]. The contributions of this paper are thus threefold:

  1. 1.

    Exhaustive parameter sweep of ACF.

  2. 2.

    Analysis of correlations between detector parameters.

  3. 3.

    Optimized TL detection results on the night data from the LISA Traffic Light Dataset.

The paper is organized as follows: Relevant previous work is summarized in Sect. 2. In Sect. 3 we present the detector and the three parameters that are investigated. The extensive evaluation of the parameter sweep is presented in Sect. 4. Finally, in Sect. 5 we give our concluding remarks.

2 Related Work

The related work can be split into two parts: model-based and learning-based approaches. For a more comprehensive overview of the related work, we refer to [11].

2.1 Model-Based

Model-based object detection is a very popular approach for detecting TLs. Most model-based detectors are defined by some heuristic parameters, in most cases relying on color or shape information for detecting TL candidates. The color information is used by heuristically defining thresholds for the color of interest in a given color space [12, 13]. The shape information is usually found by applying circular Hough transform on an edge map [14], or finding circles by applying radial symmetry [15, 16]. In [17, 18] shape information is fused with structural information and additionally color information in [19, 20]. The output of using above approaches are usually a binary image with TL candidates. BLOB analysis is introduced to reduce the number of TL candidates by doing connected component analysis and examining each BLOB by it’s size, ratio, circular shape, and so on [21].

2.2 Learning-Based

One of the first learning-based detectors is introduced in [22, 23] where a cascading classifier is tested using Haar-like features, but was unable to perform better than their Gaussian color classifier. The popular combination of Histogram of Oriented Gradients (HoG) features and SVM classifier were introduced in [24], but additionally also relying on prior maps with very precise knowledge of the TL locations. The learning-based ACF detector has previously been used for TLs, where features are extracted as summed blocks of pixels in 10 different channels created from the original input RGB frame. In [25] and [6] the extracted features are classified using depth-2 and depth-4 decision trees, respectively. In [6] the octave parameter, which define the number of octaves to compute above the original scale, is changed from 0 to 1.

3 Method

The method section is two-fold, firstly the learning-based ACF detector is presented. Secondly, the method for conducting the comprehensive parameters optimization for the TL detector is presented.

3.1 Learning-Based Detector

The features for the ACF object detector are extracted from 10 feature channels: 1 normalized gradient magnitude channel, 6 histogram of oriented gradients channels, and 3 channels constituting the LUV color channels. The features are hence created by single pixel lookups in the feature maps. The channels sub-sampled corresponds to a halving of the dimensions [4].

The training is done using 3,728 positives TL samples (Fig. 1) with a resized resolution of \(25\times 25\), and 5,772 frames without any TLs and hard negatives generated from 1 execution of bootstrapping on the 5 night training clips from the LISA TL dataset [11]. Examples of these hard negatives are seen in Fig. 2. The number of extracted negative samples varies depending on the configuration, but is limited to maximum of 175,000 samples.

AdaBoost is used to train 3 stages of soft cascades, the three stages consists of 10, 100, and 4000 weakleaners. However, the comprehensive parameters optimization showed that it often converges earlier. The generated AdaBoost classifier is using decision trees as weak learners.

For detecting TLs at greater distances, the intervals of scales can be adjusted by the octave up parameter, e.g. changing it from 0 to 1 will define the number of octaves to compute above the original scale. The number of extracted samples from the training will highly depend on the model size, tree-depth, and octave up parameters.

Finally the detection is done by using a sliding window which is moved across each of the 10 aggregated feature channels created from the test frame.

Fig. 1.
figure 1

Positives samples cropped from training data.

Fig. 2.
figure 2

Hard negatives generated from bootstrapping.

3.2 Parameter Optimization

In this paper, a comprehensive parameter optimization is made by adjusting the dimensions of the sliding window, hereafter defined as mDs, the decision tree’s depth, hereafter defined as treeDepth, and the number of octaves to compute above the original scale, hereafter defined as nOctUp. To speed up the parameters optimization, a MATLAB script is developed which uses a FTP connection to communicate with a master web host, such n-computers can work on the parameter optimization simultaneously.

The parameter optimization is done by adjusting one parameter at a time, e.g. creating a TL detector with a nOctUp = 0 and treeDepth = 2, and then vary the mDs size from [12, 12] to [25, 25]. A total of \(14^2 = 196\) detectors are made with above nOctUp and treeDepth settings. By adjusting the nOctUp and treeDepth and redoing the sliding window variation, a very comprehensive overview of what the optimal mDs size is, and how the performance correlate with the nOctUp and treeDepth.

4 Evaluation

In this paper the parameters optimization will be done according to the parameter variations seen in Table 1. The parameters optimization will be performed on nighttime sequence 1 from the LISA TL dataset which are collected in an urban environment in San Diego, USA and contain 4,993 frames and 18,984 annotations. The data is generated from a 5 min and 12 s long video sequence containing 25 physical TLs split between 5 different types: go, go left, warning, stop, and stop left [11].

The mDs are decreased in the last two iteration in Table 1 as the training time increases significantly when the nOctUp and treeDepth are increased. As the training have been done on multiple different computers, the average training time, defined in Table 1, is calculated from calculated the average training time from the computer being involved in all 6 iterations for the most comparable results. The most involved computer is a Lenovo Thinkpad T550 with an Intel i7-5600U CPU @ 2.6 GHz, 8 GB of memory, and a SSD page file. The parameter sweep was done using MATLAB R2015b on Windows 7 Enterprise, both 64-bit.

Table 1. ACF detector parameter variation
Fig. 3.
figure 3

PR-curves of best ACF detector from each heatmap.

Each detections will be quantified in accordance to the VIVA-challenge [9], where the Area-Under-Curve (AUC) of a Precision-Recall curve (PR-curve) generated from the ACF results is used as the final evaluation metric [11]. Furthermore, the true positive criteria in the VIVA-challenge defines a detection as one that is overlapping with an annotation with more than 50%, as defined in Eq. (1).

$$\begin{aligned} a_0 = \frac{\text {area}(B_d \cap B_{gt})}{\text {area}(B_d \cup B_{gt})} \end{aligned}$$
(1)

where \(a_0\) denotes the overlap ratio between the detected bounding box \(B_d\) and the ground truth bounding box \(B_{gt}\). \(a_0\) must be equal or greater that 0.5 to meet true positive criteria [26].

Fig. 4.
figure 4

Heatmaps of ACF detector with varying octaves and tree-depths. (Color figure online)

In Fig. 4, the 6 different parameter variation sweeps, defined in Table 1, are seen. All of the heatmaps are plotted with the same color range, spanning from dark blue to dark red indicating a detection rate of 0% and 100%, respectively. For each heatmap plot in Fig. 4, the model dimension with the highest detection rate is marked with bold. By examining the figures in pairs, e.g. 4a+4b and 4a+4c, one can determine the effect of changing tree-depth or octave, respectively. Increasing only the octave from 0 to 1 increases the best performance from 33.42% to 49.29%. Furthermore, the average AUC of the entire heatmap is also increased significantly as a result of the octave increment, which is best illustrated by the increase of more bright green areas in Fig. 4c compared to Fig. 4a. Increasing the tree-depth from 2 to 4 increases the best performing mDs with 6.79%, and the overall average AUC is also increased by comparing the color schemes of Fig. 4a and b. In Fig. 4d both the octave and tree depth is increased to respectively 1 and 4, resulting in an AUC of 56.85% with a mDs of [18, 16]. There are no clear tendency of a groupings of mDs where the detection rate is good in Fig. 4a. In Fig. 4a–d, a grouping with a lower detection rate is present in the upper right corner and the lower left corner, which suggests that the optimal mDs is found between a size of 15 and 22. Finally, the octave increased in Fig. 4e and f, where only the detection with mDs from 15 to 22 have been executed due to time restrictions and the previously mentioned low detection rate grouping analysis. Increasing the octave to 2 increases the AUC to 61.28 with a tree-depth 2, and finally 66.63%, which is the highest achieved AUC in this parameter sweep.

In Fig. 3, the Precision-Recall curves of the best performing mDs from each heatmap are seen. The precision is decent when the recall is under 0.35 for all of the detections, meaning that we have high confidence in our detections until this point. The detections with octave 0 detects less than 60% of the true positives, by increasing the octaves the recall, and number of true positives detections, are greatly improved reaching over 90% with octave 2 and tree-depth 4. By increasing the octave all detections reaches a recall above 79% resulting in a higher AUC.

5 Conclusion

Increasing only the octave provides us with better capabilities of detect a larger size range of TLs, resulting in the most significant AUC increments. The increments of the tree-depth improves the results when keeping the octave unchanged, however, the AUC increase is not as high as increasing the octave while keeping the tree-depth the same. The AUC is nearly doubled by increasing both of tree-depth and octave in Fig. 4a and d, leading to conclusion that these parameters are correlated, as the color scheme strongly show the overall AUC increase. Finally, the AUC is improved by increasing octave and tree-depth additionally, as seen in Fig. 4e and f, respectively. As in the first 4 iteration heatmaps, the best performing AUC is increased when increasing both octave and tree-depth simultaneously, which supports the conclusion that the parameters are highly correlated. By examining Fig. 4f it is clear that the best performing AUC is increased additionally and found at a mDs of [20, 20] with 2 octaves and a tree-depth of 4.

Further experiments includes finding the convergence points by keep increasing the parameters. Additionally, a similar parameter sweep on the daytime data from the LISA TL dataset would be interesting.