1 Introduction

Traffic sign recognition is an important sub-task in the advanced driver assistant systems and autonomous driving systems. In general, there are two stages in a traffic sign recognition system: finding the locations of the traffic signs in real traffic scenes (traffic sign detection) and classifying the detected traffic signs into their specific sub-classes (traffic sign classification). Traffic sign detection faces many difficulties in real traffic scenes due to illumination changes, partial occlusion, cluttered background, and small size, as shown in Fig. 1. And in the second stage, there are also many problems, such as missing samples, unbalanced sample categories, the rare sample numbers etc. TT100k [37] is a Chinese traffic sign dataset and contains over 150 categories, which is the most widely covered traffic sign dataset in the current. However, many rather rare traffic signs are still not included. And not only that, the existing categories in the dataset are extremely imbalanced, let alone the majorities are less than 100 which is obviously not enough for training the detection model, as shown in Fig. 2.

Fig. 1
figure 1

The difficulties in traffic sign detection. In real traffic scenes, traffic sign detection faces many difficulties including small size, viewpoint, occlusion, illumination and so on

Fig. 2
figure 2

The number of the traffic signs in TT100k. The number of the existing samples is extremely unbalanced, the number of categories starting with w30 is less than 100. the number of categories starting with w26 is less than 10

On the one hand, current object detection methods [12, 21, 25,26,27,28] are either not robust to small-sized traffic signs or difficult to meet the real-time performance. On the other hand, limited by traffic sign data sets, most approaches can only classify several super-classes such as just 3 classes of Mandatory, Danger and Prohibitory [34] or a limited number of sub-classes such as 45 classes [20, 22, 37]. In this paper, oriented to the real traffic scene, we proposed a novel two-level detection architecture to address the aforementioned challenges. The contributions of this paper are summarized as follows:

1) We propose a two-level detection architecture to deal with the problem of missing samples and imbalanced samples. We present a revised YOLOv3 network for traffic sign detection and improve the performance of small objects detection.

2) We present an effective data augmentation method based on traffic sign logo to generate enough training data and achieve unlimited traffic sign recognition. The reasonable experiment is designed to prove the effectiveness of the method.

3) The approach achieves good trade-offs in terms of completeness, real-time and accuracy, it can be applied widely to many fields such as advanced driver assistant systems and autonomous driving systems.

2 Related work

2.1 Traffic sign detection

Traditional traffic sign detection approaches contain a wide variety of algorithms and thoughts [23, 24]. Escalera et al. [19] use color and shape features to detect road traffic signs, while Garcia-Garrido et al. [36] employ the Hough transform to get the information from the edges in the image. To improve the detection speed, Bahlmann et al. [1] detect traffic signs by using a set of Haar wavelet features obtained from AdaBoost training. Salti et al. [29] use HOG features and SVM classifier to detect traffic signs. Recently, Berkaya et al. [3] extended this approach by combining features including HOG, local binary patterns (LBP) and Gabor features within a SVM classification framework.

In addition to the traditional methods, CNN-based methods have developed rapidly in recent years. Jin et al. [18] use a hinge loss stochastic gradient descent (HLSGD) method to train a detection network. Zhu et al. [36] employ fully convolution network (FCN) to guide traffic sign proposals and deep convolutional neural network (CNN) for object classification. Meng et al. [22] detect traffic signs based on SSD by using image pyramid. Li et al. [20] improve the performance of small object detection by using Generative Adversarial Networks(GAN). All these approaches are able to detect some traffic signs. However, limited by data sets and the small size of the traffic signs, these methods are difficult to make good trade-offs in terms of completeness, real-time and accuracy. To achieve unlimited traffic sign detection, some researchers put their attention to data augmentation due to the specific templates of the traffic signs. In reference [4], the authors present a pipeline-based approach to image augmentation including z-stack augmentation, randomized elastic distortions etc. And these image augmentation methods are open to the public. Zhun et al. [35] propose Random Erasing methods to randomly select a rectangle region in an image and erase its pixels with random values, which can improve the robustness of the model against occlusion and prevent the occurrence of over-fitting to some extent. Similar to [35], in [9], the authors show an approach of randomly masking out square regions of input images and prove the effectiveness of the method by sufficient experiments.

2.2 Object detection

As early as 2001, Viola-Jones et al. [33] used the strategy of sliding window and multi-scale haar feature to realize real-time face detection in the first time. In 2005, Hog feature [8] was proposed to detect pedestrians and achieved robust results. In 2008, the DPM method [10] was proposed and achieved the best results at the time. Before the deep learning, the traditional object detection methods are roughly divided into three parts: regional selection (sliding window, ROI, etc.), feature extraction (SIFT, HOG, etc.) and classifier (SVM, Adaboost, etc.). However, the traditional object detection algorithms have many shortcomings. On the one hand, the sliding window selection strategy is time-consuming and redundant, on the other hand, the hand-designed features are less robust.

Recently, the CNN-based approaches have achieved great success in many fields [5, 6]. Generally speaking, the CNN-based methods in the field of object detection can be divided into two categories: Region-Proposed methods and End-to-End methods. Among Region-Proposed methods, the Overfeat [30] is an early work using CNNs to do object detection . The main idea is to use multi-scale sliding windows for classification, positioning and detection. Unlike the Overfeat, which uses sliding windows for region-propose, R-CNN [13] uses selective search[1] to propose ROI region and makes final predictions using an SVM. Due to different size of input images, SPP-Net [15] introduces a Spatial Pyramid Pooling Layer (SPP) to reduce the adverse effects of deformation and cutting. Fast R-CNN [12] enables the network to be trained end-to-end. Consider the huge time cost of selective search in Fast R-CNN, Faster R-CNN [28] proposes a region proposal CNN and integrates it with Fast-RCNN by sharing convolutional layers, which further improves the object detection performance in terms of detection speed and accuracy. In R-FCN [7], the final fully connection layer is replaced by a location-sensitive convolutional network, which greatly improves the detection rate while maintaining high positioning accuracy.

Among End-to-End methods, YOLO [27] can be viewed as an originator which uses a fully connection layer to directly produce the class of object as well as the bounding box. Subsequently, the single shot multi-box detector(SSD) [21] introduces the default boxes inspired by [28] and multi-scale feature mapping layers to raise the precision of the bounding box. DSOD [31] designs an efficient framework and a set of principles to learn object detectors from scratch, following the network structure of SSD. Based on [27], YOLO9000 [25] and YOLOv3 [26] further improve the performance and speed by a large margin by merging many tricks including multi-scale training, anchor boxes, new classification network design and so on.

3 Base network

Our method is based on the YOLOv3 structure. Compared with YOLO and YOLO9000, YOLOv3 uses a new feature extracting network with some shortcut connections: darknet53, and does predictions across 3 scales. Instead of using softmax loss, the authors use logistic loss to deal with more complex domains like the Open Images Dataset.

3.1 Network structure

The YOLOv3 network is organized as follows: the classifier network darknet53 with many Residual blocks achieves similar performance to ResNet-152 but 2 × speed. The darknet53 executes down-sampling operation using a convolution layer with stride 2 instead of using pooling layer. After each down-sampling operation, the Residual block is employed for in that scale. The darknet53 contains 5 down-sampling operations and results in 32-fold scale-zooming. Based on the darknet53, YOLOv3 adds 3 prediction blocks from different scales respectively. Specifically, for each prediction block, the up-sampling operation and router operation are employed to get higher resolution feature maps. The YOLOv3 structure is illustrated in the Fig. 3.

Fig. 3
figure 3

The architecture of our method. The RPM consists of the feature extraction network and the location prediction network. The CM accepts the proposals from the RPM for further classification. In this work, we select darknet19 as our classifier network

3.2 Loss function

The YOLOv3 predicts 3 boxes at every scale and outputs the tensors with the N × N × [3 × (4 + 1 + C)] dimensions for the 4 bounding boxes offsets, 1 objectness prediction and C classes predictions. The authors use k-means clustering to acquire 9 preset bounding boxes and every scale contains 3 preset bounding boxes. The loss function is as follows:

$$ \begin{array}{@{}rcl@{}} loss &=& \lambda_{coord}[\sum\limits_{i=1}^{s^{2}}\sum\limits_{j=1}^{k} 1_{ij}^{obj}[(x_{i}-\hat{x}_{i})^{2} +(y_{i}-\hat{y}_{i})^{2}] \\ &&+\sum\limits_{i=1}^{s^{2}}\sum\limits_{j=1}^{k} 1_{ij}^{obj}[(\sqrt{w}_{i}-\sqrt{\hat{x}}_{i})^{2} +(\hat{h}_{i}-\hat{\hat{y}}_{i})^{2}]]\\ &&+\lambda_{obj}\sum\limits_{i=1}^{s^{2}}\sum\limits_{j=1}^{k} 1_{ij}^{obj}(C_{i}-\hat{C}_{i})^{2}\\ &&+\lambda_{obj}\sum\limits_{i=1}^{s^{2}}1_{i}^{obj}\sum\limits_{c}^{} (p_{i}(c)-\hat{p}_{i}(c))^{2}\\ &&+\lambda_{noobj}\sum\limits_{i=1}^{s^{2}}\sum\limits_{j=1}^{k} 1_{ij}^{noobj}(C_{i}-\hat{C}_{i})^{2} \end{array} $$
(1)

where \( 1_{i}^{obj} \in \{0,1\}\) denotes that the grid cell i which is responsible for predicting the object. \( 1_{ij}^{obj} \in \{0,1\}\) represents the j th default bounding box of the grid cell i. \( 1_{ij}^{noobj} \in \{0,1\}\) denotes the j th default bounding box of the grid cell which is not responsible for any objects. pi(c) is a 1-D vector indicating the classes of the object. s2 denotes the area of the feature maps. k denotes the number of preset bounding boxes, which is equal to 3 in YOLOv3.

4 Our methodology

Our goal is to design an utra-efficient traffic-sign object detection network towards real scenes, therefore in this work we mainly focus on the small object detection and unlimited traffic sign classification while ensuring the real-time performance. However, the general detection framework does not work since many classes are not included in the dataset. To solve this problem we proposed a novel two-level network architecture, which is formed by two modules, i.e., the region proposal module(RPM) and the classifier module(CM). The RPM aims to regress the locations of the object whereas the CM aims to predict multi-class labels based on the locations, as shown in Fig. 3.

4.1 Two-level detection architecture

For the unlimited traffic sign detection, we design a two-level detection architecture, as shown in Fig. 3. In the first stage, we focus on regressing the locations of the traffic sign by using the RPM. And in the second stage, we target at classifying the specific categories of the traffic signs by the CM. We view all traffic signs as one class and use the RPM to propose the location of the traffic signs, then we add an extra classifier module to acquire the labels of the objects.

4.2 Improved YOLOv3 for the RPM

The RPM is based on the YOLOv3. Although YOLOv3 achieves better performance in small object detection by doing predictions across scales, there still are much room for improvement. For traffic signs in TT100k, the sizes of near 90% boxes are less than 80 pixels while the size of the whole image is 2048 pixels, as shown in Fig. 4.

Fig. 4
figure 4

The Cumulative Distribution Function(CDF) of the size of the traffic signs in TT100k. The sizes of near 90% boxes are less than 80 pixels, which is rather than small compared with the original 2048 pixels

Our solutions are inspired by the facts: the larger feature maps tend to encode more location information, which is proven crucial [2, 11] for small object detection. In our network, features are extracted from the tail of each stage in the encoder part. To make use of the low-level features, one straight-forward approach is to add more low-level layers to the decoder part. As shown in Table 1, YOLOv3 [26] adds the layers of {36,67,83} to the decoder network. In this work, we reassign the layers of {11,36,67,83} to the decoder network to acquire 4 scales for region proposals and adjust the number of channels to ensure the same overall computational complexity. The experiment shows that this simple skill can make the performance of detecting small and medium objects increases dramatically while keeping the same detection speed, shown in Table 4, which implies the low-level and large feature maps can provide more location information for the small and medium objects.

Table 1 Our revised network based on YOLOv3, the input resolution is set to 512*512

In addition, the RPM just aims to acquire the proposals of the object without need for the specific categories of the object, therefore we modify the loss function by getting rid of the section of classifying loss. We rewrite the loss function as follows:

$$ \begin{array}{@{}rcl@{}} loss &=& \lambda_{coord}[\sum\limits_{i=1}^{s^{2}}\sum\limits_{j=1}^{k} 1_{ij}^{obj}[(x_{i}-\hat{x}_{i})^{2} +(y_{i}-\hat{y}_{i})^{2}] \\ &&+\sum\limits_{i=1}^{s^{2}}\sum\limits_{j=1}^{k} 1_{ij}^{obj}[(\sqrt{w}_{i}-\sqrt{\hat{x}}_{i})^{2} +(\hat{h}_{i}-\hat{\hat{y}}_{i})^{2}]]\\ &&+\lambda_{obj}\sum\limits_{i=1}^{s^{2}}\sum\limits_{j=1}^{k} 1_{ij}^{obj}(C_{i}-\hat{C}_{i})^{2}\\ &&+\lambda_{noobj}\sum\limits_{i=1}^{s^{2}}\sum\limits_{j=1}^{k} 1_{ij}^{noobj}(C_{i}-\hat{C}_{i})^{2} \end{array} $$
(2)

4.3 Data augmentation based on logo for the RM

There are many datasets for traffic sign detection and classification, such as TT100k [37], GTSDB [17], GTSRB [32] and so on. However, the traffic signs distributes seriously unevenly no matter which country it is. For example, in TT100k, No Parking, No Entry and Speed Limit signs are very common with the numbers of more than 1000, while the signs such as Mountain danger, Falling rocks are very rare. The current situation of non-uniform distribution of traffic signs makes the recognition of traffic signs a difficult problem.

Through the careful observation of the existing data sets, we find that the differences in the same traffic signs are mainly reflected in the following aspects: viewpoint, background, size, color, illumination, contrast, occlusion, and pollution. To get enough and evenly distributed images and simulate as many situations as possible in the real world, a specific data augmentation technology based on traffic sign logo is proposed. The pipeline of the data augmentation based on logo is shown in Fig. 5.

Fig. 5
figure 5

The pipeline of the data augmentation based on logo

Firstly, the logo of traffic sign is acquired mutually and the mask of the sign is formed automatically by Canny operator. To simulate the change of viewpoint in the real scene, the perspective transformation is applied to the traffic sign logo. A 2-d image is transformed into another plane through perspective transformation. This process can be expressed as:

$$ \left[\begin{array}{l} a \\ b\\ c \end{array}\right] = \left[\begin{array}{lll} M_{11} & M_{12} & M_{13} \\ M_{21} & M_{22} & M_{23} \\ M_{31} & M_{32} & M_{33} \end{array}\right] \left[\begin{array}{l} x \\ y \\ 1 \end{array}\right] .x^{\prime}=\frac{a}{c},y^{\prime}=\frac{b}{c} $$

Mij is the transformation matrix including 9 parameters,(x,y) is the coordinates before the transformation, \((x^{\prime },y^{\prime })\) is the coordinates after transformation.

In real traffic scenarios, the shapes of traffic signs vary in thickness due to the influence of illumination, contamination, manufacturing technology and so on. The erosion and dilation are the primary morphological operation of the image. These two operations have completely opposite effects, which can further increase the variations of the samples.

The next step is the fusion of the sign and the background. We select randomly an image from the real traffic scenes as the background, and crop out a sub-image with the same size of the sign. Then the fusion operation of the sign and the sub-image can described as follows:

$$ I_{o} = mask\odot I_{s} + (1-mask)\odot I_{b} $$
(3)

where Is denotes the traffic sign logo, mask denotes the mask of the sign and Ib denotes the image cropped out from the real traffic scenes.

Finally, color jitter is employed to the fused image. In addition, we add Gaussian noise with different kernel size including {1 × 1,3 × 3,5 × 5,7 × 7} to the image for the blurring and the cut-out [9, 35] is applied to the synthetic image which enables the network to have more generalization performance. The reasonable experiments are conducted to valid the effectiveness of our data augmentation method, shown in Table 2. Some synthetic images are shown in Fig. 6.

Table 2 The performance of our classifiers trained by our generated data
Fig. 6
figure 6

Some synthetic images

5 Experiment

In this section, we compare the performance in terms of accuracy and speed to other approaches in traffic sign detection in TT100k [37] and GTSDB [17]. TT100k is a Chinese traffic sign data set, which is composed of 9176 images containing 143 classes of traffic signs. The resolution of the image is 2048 × 2048 pixels. The images are collected under real world conditions with large illumination variations and weather differences. A typical traffic sign is about 80 × 80 pixels in a 2048 × 2048 pixels image, or just 0.2% area of the image. GTSDB is a German traffic sign data set, which contains 3 categories namely mandatory, prohibitory and danger. GTSDB data set is split into a training set of 600 images and a test set of 300 images and covers natural traffic scenes of various roads (road, rural, urban) recorded during the day and dusk. Due to GTSDB data set just contains 3 classes, we only evaluate the performance of our RPM on this data set.

We design our experiments in 3 parts. In the first part, we conduct the experiment to verify the effectiveness of our data augmentation based on logo. In the second section, the performance of our revised network is compared to the base network YOLOv3. In the final part, we compare our approach to other state-of-the-art methods in terms of speed, accuracy and the number of recognizable traffic signs. All experiments are conducted in a same workstation with Intel Core i7-6700 3.4GHz CPU and a single GeForce GTX1080 Ti Graphics. And the operating system is ubuntu 16.04 LTS. We use the Darknet neural network framework for training and testing.

5.1 Implementation details

For our Region Proposal Module (RPM), k-means cluster method is firstly employed to generate the default anchor boxes. And we use the generated anchor boxes to replace the original ones in cfg file. We set the parameters of learning rate, max batches, momentum and decay to 0.0005, 100000, 0.9, 0.001 respectively. And then we start the training process and keep an eye on training losses.

For our Classification Module (CM), first, our data augmentation method is used to generate the training data. And then the parameters of batch, subdivisions, height, width in the cfg file are set to 1280, 2, 72, 72 respectively, other configures remain unchanged. Because the training of classifiers in the original Darknet framework turns on random clipping by default, and random clipping will seriously affect the appearance of traffic signs, so we modify the code to close it.

After getting the detector model and the classifier model, we combine these two models to make predictions in a pipeline, one for predicting the location of the object, and another for classifying the object. For the detection model, we set a relatively low threshold of 0.2 to ensure high recall rate. The threshold of nms is set to 0.35. And for the classifier model, we set a relatively high threshold of 0.75 to ensure high accuracy.

5.2 The experiments of the CM

5.2.1 The effectiveness analysis of the data augmentation based on logo

In this section, we will conduct the experiment to evaluate the performance of our data augmentation method based on traffic sign logo. The signs in TT100k are cropped to form our testing data. The formed testing data includes 143 categories and 19546 images and the io, wo, po are deprecated. The training data are generated by our data augmentation method. Specifically, we manually produce 143 standard logos, and then generate 5,000 synthetic images for each logo through our data augmentation method. Eventually, the training data contains 143 categories and 715000 images. Some synthetic images are shown in Fig. 6. We train our 3 different classifiers by feeding our training data and test in our testing data. The test results are shown in Table 2. For darknet19, ResNet50 and VGG-16, they achieve 91.3%, 91.2% and 86.4% average precision separately in our testing data set. The model trained by the data generated by our data augmentation method still has good robustness and generalization ability for traffic signs in real scenes and it can be used as a baseline for future comparison in data augmentation of the traffic signs. Overall, our studies thus offer a new strategy to treat the unbalanced samples problem in traffic sign classification.

5.2.2 Ablation study of the data augmentation based on logo

To verify the effectiveness of each component of our data augmentation method for traffic signs, we conduct several experiments by removing the specific components of the method. Specifically, several additional experiments are designed, in which perspective transformation, erosion and dilation and cut-out were removed respectively from the method. For a fair comparison, we use the exact same network structure, training images and super-parameters (such as epochs, learning rate, etc.). Specifically, The darknet19 is used as our training network with the learning rate of 0.1 and the 300 epochs. We select the final weights as our testing model. The results in Table 3 suggest that these components are more or less helpful to the model. Although some important data augmentation methods are revealed by the paper, whether there are other means of data augmentation is an open question.

Table 3 The ablation study of the data augmentation method for traffic signs

5.3 The experiments of the RPM

In this section, we conduct the experiments to compare our revised network with the base network YOLOv3 in TT100k data set and GTSDB data set.

5.3.1 The results in TT100k

Firstly, we use k-means cluster method to generate the default anchor boxes. In YOLOv3, the anchor boxes are set to {4,4, 6,7, 6,12, 9,9, 11,22, 13,13, 18,19, 25,28, 42,42} and in our revised network, the anchor boxes are {3,4, 5,6, 6,11, 7,7, 8,9, 11,12, 11,22, 14,14, 17,18, 22,24, 30,32, 46,45}. The input resolution are 512 × 512. Most of our training strategies follow YOLOv3, including multi-scale training, data augmentation, aspect ratios, learning rate and so on. As shown in Table 4, the average detection time of YOLOv3 is 80 ms and it achieves 71.24% mAP, while our network achieves 79.38% mAP with 86 ms time-spending. For a further comparison, following [37], the AUC curves are drawn in 3 different sizes, that are (0,32], (32,96] and (96,400], as shown in Figs. 7 and 8. Compared with the base network, our revised network mainly improves the performance in the size of (0,32] and (32,96].

Table 4 The comparison on the TT100K data set
Fig. 7
figure 7

Some detection results when the input resolution is 512*512. The label appears as an image in the bottom right corner of the box

Fig. 8
figure 8

The detection results of zhu.al [37], YOLOv3(512) [26] and ours(512,1024)

5.3.2 The results in GTSDB

Similar to the above steps, the k-means cluster method is used to produce the default anchor boxes. In YOLOv3, the generated anchors are {7,12, 8,14, 10,17, 12,20, 15,24, 17,30, 22,37, 28,47, 40,67}. And in our revised network, they are {6,8, 7,9, 8,11, 9,12, 10,13, 11,14, 12,16, 15,19, 17,24, 22,29, 29,37, 40,52}. As shown in Table 5, YOLOv3 achieves the mAP value with 0.89 and speed with 78 ms, However, our revised network obtains better result with 0.93 mAP value and 83 ms speed. Except for that, from the IoU point of view, compared with YOLOv3, our network achieves more accurate positioning with IoU value of 0.845. This mainly benefits from our larger feature maps and more anchor boxes.

Table 5 The comparison on the GTSDB data set, aIoU denotes the average of IoU values of true positive bounding boxes

5.4 The experiments of the whole framework

5.4.1 Overall performance comparison to other methods

In this section, we compare our whole algorithm consisting of RPM and CM to other state-of-the-art methods. Figure 8 provides the comparison of our approach with other state-of-the-art methods in terms of the recall and accuracy in TT100k. It can be observed that our proposed approach obtains comparable results to the previous state-of-the-art methods [26, 37]. Specifically, on the one hand, compared to other methods of limited classes detection, our proposed method achieves the detection of all traffic sign classes due to our two-level detection architecture and the data augmentation based on logo, as shown in Tables 6 and 7. On the other hand, as shown in Table 4, when we set the input sizes of the network to 512 × 512, our method can also achieve the comparable results with the mAP of 79.38% and the speed of 41.67 FPS, which make a great trade-off between the accuracy and the speed. When the input sizes of the network are set to 1024 × 1024, we achieve better results with the mAP of 82.6% and 11.6 FPS, which outperforms other state-of-the-arts in terms of the detection accuracy and speed. In [37], the resolutions of the input images are 2048 × 2048, which far exceeds our input sizes. Some detection results are shown in the Fig. 7.

Table 6 The analysis of the effectiveness of the RPM and CM. All the input images are resized to the same size of 512 × 512 in all variants
Table 7 The detection results of per class in TT100k, ’-’ means the detection results don’t contain this category

5.4.2 Ablation studies

To better verify the effectiveness of the RPM and CM in our method, we construct four variants and evaluate them on TT100k, shown in Table 6. For a pair comparison, we resize the input images to the same size of 512 × 512 in all variants. The CM is trained to classify 45 traffic signs using the generated data. The methods without CM imply just original data is used to train the model end by end. We compare the results by mAP value and speed. As shown in Table 6, YOLOv3 with CM achieves great improvement in terms of mAP value of 0.85 with a slightly slower speed. And RPM with CM further improves the performances with a mAP value of 79.38 and the speed of 24 ms. This can be explained by the facts that RPM can improve the detection accuracy of small objects and CM can achieve more accurate classification of rare categories.

6 Conclusion

In this paper, we present a novel two-level detection architecture, which is composed of the location modules(RPM) and the classification module(CM). The RPM aims to locate the objects and then the CM targets to get the specific labels of the objects. We revised the YOLOv3 network to locate the small size object more precisely. To solve the problem of missing categories, a data augmentation method based on logo is present and a reasonable experiment is designed to prove the effectiveness of the method. In the future, we plan to employ the two-level detection architecture to other specific kinds of objects e.g. traffic light etc. In addition, in the detection task of traffic signs, a wrong detection result or classification result may suddenly appear. However, we cannot effectively explain this phenomenon. Therefore, in the future we will explore the upcoming field of explainable AI, e.g. why there is an error classification of a traffic sign. For those truly interested in this field, more details can be found in [14, 16].