Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Traffic Sign Recognition (TSR) is an active topic in the intelligence traffic systems. It is important for automatic driverless vehicle and driver assistant systems [1]. For example, driver assistance systems could warn drivers to take strategies ahead of time to avoid accidents [2]. The task of traffic sign recognition usually contains two main stages: traffic sign detection (TSD) and traffic sign classification (TSC). Traffic sign detection aims at locating the position of the traffic signs accurately in an image or each frame of video. Traffic sign classification focuses on labeling a traffic sign. Although the two stages may overlap such as feature representation of traffic sign, they are usually studied independently. In this paper, we focus on the second task which is usually named traffic sign recognition.

Traffic sign recognition is challenging due to the complicated dynamic nature scene, which faces four difficulties: (1) Appearances of traffic signs change with variations of viewpoint and illumination, weather condition like rain or fog, motion-blur during driving, occlusions, physical damage, colors fading, graffiti, stickers and so on. (2) Traffic sign recognition should be of real time speed and high recognition accuracy for the purpose of the practical application. (3) The training data are unbalanced. The frequencies of occurrences of traffic signs are different greatly. For example, the speed limit signs appear more frequently than the no-entry signs in the German Traffic Sign Recognition Benchmark (GTSRB) [3] and the Swedish Traffic Signs Dataset (STSD) [4]. (4) The number of traffic sign classes is large. For example, there are 112 important warning sign templates in Chinese traffic signs.

There are many literatures to deal with the first two difficulties. Some popular machine learning methods are implemented on traffic sign recognition, such as Bayesian classifiers [5], boosting [6], support vector machines (SVM) [7], and random forest classifier [8]. These methods used hand-crafted features such as Histogram of Oriented Gradient (HOG) [9] and Scale-invariant feature transform (SIFT) [68]. In [10], Zaklouta used a tree classifier of K depth combined with HOG as well as the distance transforms. Maldonado [11] designed a recognition system based on SVMs, whose results showed high recognition accuracy and a very low false positive rate. Convolutional Neural Network (CNN) is used for traffic sign recognition [3, 12, 13] and achieved high recognition accuracy. Recently, CNN is hot in the field of computer vision. It has achieved several state-of-the-art performances in ILSVRC2012 [1416] and the 2011 International Joint Conference on Neural Networks (IJCNN) competition [3, 4, 17].

However, few work discussed how to deal with the imbalance of training data and how to improve the efficiency of multi-class prediction for traffic sign recognition. Most of the current multi-class prediction schemes are flat, that is, a one-vs.-all or a one-vs.-one classification scheme is used for label prediction. The flat classification scheme is time-consuming. Moreover, the imbalance of training data has negative influence on the classification performance. As we know, the traffic signs are man-made signs of special shapes, which can be divided three shape classes: circle, triangle, and square. We construct a tree structure of two layers for traffic sign. The first layer contains the coarse shape classes and the second layer contains the fine classes which is the traffic sign identification. Thus, we propose a hierarchical traffic sign recognition method. The advantage of our method is to improve the efficiency of traffic sign recognition.

The paper is organized as follows. Section 2 introduces the hierarchical recognition method for traffic sign recognition. Section 3 introduces the experimental results. Conclusion are given in Sect. 4.

2 Hierarchical Class Prediction Algorithm

In this section, we detail the implementation of the hierarchical traffic recognition. Figure 1 shows the framework of our method. In the training stage, a classification tree \( G = (V,E) \) is constructed which has two layers. In the first layer, the traffic signs are divided into three groups based on the Adaboost classifier combined with Aggregate Channel Features (ACF) [18]. The non-leaf nodes are the shape nodes in the first layer, and in the second layer, each node contains traffic sign identification. A leaf node is identified by a random forest classifier which is learned on the data of classes contained in its parent node. In the testing stage, a query traffic sign image will traverse the classification tree. In each layer, the query image is given to the node with the maximum confidence value. Finally, the leaf node label with the maximum confidence value is regarded as the label of traffic sign.

Fig. 1.
figure 1

The framework of the hierarchical recognition for traffic signs

2.1 Building the Non-leaf Node for Shape Classification

In this subsection, we introduce the construction of the non-leaf nodes based the shape classification. Aggregated channel features (ACF) [18] are proved to be useful for pedestrian detection with high speed and detection accuracy [19]. Motivated by the success in pedestrian detection, we use ACF for feature representation of traffic signs. The basic structure of the aggregated channel features is channel. We use ten channels: three color channels of the image with RGB color space, the gradient magnitudes, the six oriented gradient maps. Figure 2 shows the ACF used in our method. In our implementation, six oriented gradient filters are used: horizontal, vertical, 30°, 60°, 120°, and 150°. A traffic sign image is firstly normalized into a 10 × 10 image. And then its ACF are computed. We use all the obtained map for training an Adaboost classifier.

Fig. 2.
figure 2

Aggregated channel features for shape classification

For the shape classification in the first layer, we adopt the Adaboost framework of Viola and Jones (VJ) framework [20]. As we know, the Adaboost classifier contains many weak classifiers called weak learners which can be combined into a strong classifier. It has been proven to converge to the optimal solution with a sufficient number of weak classifiers. AdaBoost assigns weights to weak classifiers based on their quality, and the resulting strong classifier is a linear combination of weak classifiers with the appropriate weights.

We use depth-2 decision trees for boosting [21], where each node is a simple decision stump, defined by rectangular region, a channel, and a threshold [23]. We carry out the VJ framework and the final classifier is a weighted linear combination of boosted depth-2 decision trees of weak classifier. Because each weak classifier is a depth-2 decision tree, it implements only two comparing operations to apply a weak classifier, so the shape classification is quite fast.

2.2 Building the Leaf Node for Traffic Sign Identification

In this subsection, we detail how to build the leaf node based on random forest classifier. Because each shape node contains several traffic sign classes, we build a random forest classifier for the traffic sign classes contained in a shape node. In order to train a random forest classifier, each training sample is normalized to a 40 × 40 image. If a shape node contains N classes of traffic signs, the samples from the N classes of traffic signs are used to train the random forest classifier, and each leaf node contains a traffic sign class. In order to train a random forest classifier, we use multiple features which include the following features: Histogram of Oriented Gradients (HOG), Local Binary Pattern (LBP), and HSV.

HOG: An image is converted from the RGB color space to gray scale space. And then it is divided into 7 × 7 blocks and each block contains 4 cells. In each cell, a gradient oriented histogram with 9 bins is computed. Thus, HOG is 1764-dimensional.

LBP: Just like HOG, an image firstly is transformed to gray scale space. LBP has low computing complexity with rotation invariant and gray scale invariance performances. In this paper, a 256-dimensional LBP descriptor is used.

HSV: Because RGB color space is very sensitive to illumination, HSV color space is used in this paper. An image is firstly converted to HSV color space. For each pixel, values of hue and saturation are scaled to the range [0, 255]. For the H channel and S channel, the two components of two similar colors are numerically much closer, thus, HSV is less sensitive to illumination. A histogram is computed for each channel, and the two histograms are concatenated into a vector of 512 dimensions, which is treated as the color feature.

The tree types of features are concatenated and form a 2532-dimensional vector. In our experiments, 500 trees are used to form a random forest classifier. The prediction label is predicted by the ensemble learning of all the trees.

2.3 Class Prediction Scheme

If a query image is input, it traverses the classification tree. In the first layer, it can be scored by the classifiers in shape nodes and the shape node is retained whose score is the maximum among the three shape nodes. And then, the query image is scored by the random forest classifier in the retained shape node. The label of traffic sign whose score is the maximum is given to the query image.

3 Experimental Results

We estimate the proposed hierarchical recognition method on three traffic sign databases: GTSRB, STSD, and the 2015 Traffic Sign Recognition Competition Dataset (Mutil-72TSD).

GTSRB: This database is famous, because it is used in the 2011 IJCNN competition of traffic sign recognition [4]. It contains 43 classes. There are 39209 training images in the training set and the testing set contains 12630 testing images. The sizes of traffic sign images vary from 15 × 15 to 250 × 250 pixels. They have reliable ground-truth data due to semi-automatic annotation. GTSRB has two test sets: final_test and online_test. We will give the result on both datasets.

STSD: It was built in 2011 by Department of Electronic Engineering in Linkoping University. It is mainly used for traffic sign detection. Some scene images contain one or many traffic signs. In order to test our method, we crop the traffic signs in the scene image. In order to test further the algorithm performance, we create a sub-dataset of STSD: Swedish30. And training set contains all of the samples with four statuses. The first 18 classes are those which occurred most frequently in STSD, other 12 are those appearing STSD at least 5 times. Swedish30 includes 3129 traffic signs.

Mutil-72TSD: It is a multi-class traffic sign dataset used in the 2015 China Traffic Sign Recognition Competition. Figure 3 shows some examples in Mutil-72TSD. In the training set, there are 66 video sequences containing 72 traffic classes. They are split into 7 main categories: (1) warning signs, (2) prohibitory or restrictive signs, (3) mandatory signs, (4) tourism districts signs, (5) road construction safety sign (6) direction, position, or indication signs and (7) assist sign. According to the image quality, they are divided into visible, blurred, occluded, shaded and sloping. The training dataset contains 10611 training images and test dataset contains 8520 test images.

Fig. 3.
figure 3

Some examples in Mutil-72TSD.

Additionally, in Mutil-72TSD, the number of the traffic signs with low occurrence frequency is very few, which result in the imbalance of training data. Thus, we augment the training data for robust learning to potential deformations in the test set. We build a synthesizing dataset by adding 5 transformed versions of the original training set; enhance the number of training samples. Samples are randomly perturbed in position ([−0.2, 0.2] pixels), in scale ([0.1, 5] ratio) and rotation ([−10, +10] degrees).

3.1 The Imbalance of Traffic Sign Classes

We analyze the distribution of training data for the three traffic sign datasets. The histograms of the class frequencies are given in Fig. 4. It demonstrates that the imbalance of traffic sign classes exist in all the three dataset. The plots of histogram of the class frequencies are of long tails. In GTSRB, the biggest class set contains more than 2000 samples while the smallest class set contains only dozens of samples. In Swedish30, the biggest set contains about 600 samples, while the smallest set contains only several samples. In Mutil-72TSD, the biggest set contains about 1000 samples while the smallest set contains only dozens of samples. The imbalance of training data has negative influence on the classification performance. It implies that our method is required for multi-class classification.

Fig. 4.
figure 4

The histogram of the class frequencies in traffic sign dataset. The x-axis is the label of each class; the y-axis is the number of samples in train dataset. (a) the histogram for GTRSB. (b) the histogram for Swedish30. (c) The histogram for Mutil-72TSD.

3.2 The Analysis of Computational Complexity

The main computational complexity of our method comes from the complexity of building the decision tree. Building the random forest model is an ensemble method, so it’s going to be close to the sum of the complexities of building the individual decision trees in the model. If each model has the same complexity, then it would be the complexity of the individual model times the number of models you build. If having n instances and m attributes, the computational cost of building a tree is O(mn log n). If growing M trees, then the complexity is O(M(mn log n). This is not an exact complexity, because the trees in the model are grown using a subset of the features, and additional time may be added in to handle the randomization processes. However, this would get close to the complexity. The parameters here are n, m, and M - the number of instances in the training data, the number of attributes, and the number of trees you build. The number of trees is a parameter you set yourself when you run the model [24].

We compare the flat classification scheme with the proposed hierarchical scheme in terms of the computational complexities. We give the number of classes contained in shape nodes in Table 1. Take the GTSRB for example. This database contains 43 classes, in which 26 classes are circle, 16 classes are triangle and 1 class is rectangle. For the flat classification scheme, 43 classifiers are used and the label with the largest score is given to the query image. Instead, for our method, 3 shape classifiers and 16 classifiers identifying traffic signs are used, thus, the total number of classifier is 19, which is more time-saving. Moreover, in the training stage, all data are loaded for training, while our method does not load all data for all classifiers, that is, the classifiers in the leaf nodes do not load all data. We also show the distribution of training data of Mutil-72TSD in the non-leaf nodes in Fig. 5. The horizontal axis denotes the shape nodes, and the vertical axis denotes the number of samples in each shape node. It demonstrates that our method can overcome the imbalance of classes.

Table 1. The class number in the classification tree for the three datasets
Fig. 5.
figure 5

The distribution of classes in the shape nodes for Mutil-72.

3.3 The Performance of the Proposed Method

In this subsection, we estimate the performance of the proposed method in term of the recognition accuracy. We first implement our method on GTSRB, which is famous because of the 2011 IJCNN traffic sign competition. In Table 2, we compare the methods of traffic sign recognition in terms of recognition accuracy. For the purpose of fair comparison, we only compare the method based on tree classifiers and SVM classifiers with HOG. The result demonstrates that our method is superior to other tree based methods. We also compare our method with the CNN [22] method trained on GTSRB. The result proves that our method performs CNN method with the same amount of train samples.

Table 2. Comparison of traffic sign recognition in GTSRB

In Swedish30, we compare our method with the random forest classifier used in [25] which discusses the performances of HOG and LBP in different color channels. Table 3 shows the comparison results in which our result is the average of five testing results. HH means three HOG is computed in the three channels of HSV color space and the feature vector is the concatenation of the three channels of HOG. HL means that LBP is computed in the three channels of HSV color space, and H+L means that the histogram of color in HSV color space and LBP histogram are concatenated to form a feature vector. Table 3 demonstrates that our method can achieve the comparable results while the dimension of features used in our method is lower than the comparison method [25].

Table 3. Comparison of traffic sign recognition in Swedish30.

We also implement the proposed method on Multi-72STD. The results are shown in Table 4. Furthermore, we compare the flat classification scheme with the hierarchical classification method. In Table 4, we give the comparison results on the three datasets. We also compare the flat classification scheme with the hierarchical classification scheme.

Table 4. Comparison between the flat classification and the hierarchical classification

4 Conclusion

In this paper, we focus on traffic sign recognition in a hierarchical classification scheme. A classification tree is firstly constructed, in which the non-leaf nodes are constructed based on shape classification and the leaf nodes are constructed based on traffic sign identification. For the shape classification, aggregated channel features are used for feature representation of a traffic sign image and the Adaboost classifier based on weak decision tree are used for shape classification. In each shape node, a random forest classifier is trained based on HOG. The proposed method can overcome the inefficiency of flat classification scheme and the imbalance of the training data. The proposed method is implemented on three famous traffic sign recognition datasets and the experimental results demonstrate the efficiency and effectiveness of our method.