1 Introduction

In the past decade, the rise of deep learning has made human life more and more intelligent. One of the applications of deep learning is the detection and identification of traffic signs in autonomous driving environment perception. The level of environmental perception is a key reason to ensure that the vehicle can safely reach its destination. Traffic sign detection is crucial to the application of self-driving vehicles on the ground. Although many well-designed traffic signs are installed around the roads, thousands of people are killed or seriously injured in traffic accidents in the country every year. Therefore, traffic sign detection is a challenging practical problem.

The identification accuracy and speed of traffic signs are some of the important factors affecting the safety of autonomous driving, which has been researched by many scholars. Traditional traffic sign detection mainly extracts a large number of candidate areas through simple and effective features such as symbols, texts, colors, and shapes. Luo et al. [1] proposed an approach to recognize both symbol-based traffic signs and text-based traffic signs. Yang et al. [2] defined a detection module of traffic sign extraction and classification based on the color probability model and color features, which can obtain classification results from the convolutional neural network with high speed. Gudigar et al. [3] utilized multiple thresholding to extract region of interest (RoI) to detect traffic signs. Xu et al. [4] made use of the shape symmetry hypothesis algorithm to extract the RoI of the traffic sign. Among the approaches mentioned above, the color feature is one of the most conspicuous features of traffic signs and is used in many kinds of literature. However, these kinds of ways are weak in adaptability under complex environmental conditions.

In recent years, object detection algorithms on the basis of deep learning have been broadly utilized. Single shot detector (SSD) [5], faster region-based convolutional neural network (faster R-CNN) [6], you only look once (YOLO) series [7, 8], RetinaNet [9], cascade R-CNN [10], fully convolutional one-stage object detector (FCOS) [11], and CornerNet-Squeeze [12] are the current mainstream framework. For example, Yao et al. [13] focused on using YOLOv3 combined with DenseNet to detect foreign bodies, and He et al. [14] utilized popular object detection networks (SSD, faster R-CNN) together with different feature extractors (visual geometry group network (VGG), ResNet) to build foreign body detection models. Analogously, deep learning algorithms have been applied in industrial quality inspection [15], obstacle detection [16], remote sensing image detection [17], unmanned patrol inspection [18], and other aspects [19].

The research of deep learning is popular in the transportation field [20,21,22,23,24,25,26], especially in traffic sign detection. Traffic sign detection has been continuously improved in detection accuracy and speed and has made satisfactory progress. Lee et al. [27] utilized CNN to detect traffic signs by predicting the exact boundary and location of traffic signs simultaneously, which was more effective than other boundary estimation methods. Yang et al. [28] put an attention network (AN) in faster R-CNN to find the RoI and adopted fine RPN to generate the terminal region proposal. Arcos-García et al. [29] analyzed the detection performance of several frameworks (faster R-CNN, R-FCN, SSD, and YOLOv2) combined with different backbone networks (ResNet50, Inception V2, Darknet19, and so on), discussed the characteristics of these models in the field of traffic sign detection through the method of transfer learning, and summarized the most appropriate model for different demands.

Song et al. [30] defined an efficient CNN with few parameters and fast network operation speed. Under the consistent model size, the detection accuracy was better than faster R-CNN. Li et al. [31] designed and implemented a traffic sign detector modeled on the framework of faster R-CNN and MobileNet. The performance of the detector was optimized by the location refinement approach according to color and shape information. A traffic sign detection method based on repeated attention was proposed by Tian et al. [32]. Their method effectively improved the use of context in images.

Zhang et al. [33] put forward two new lightweight networks, teacher network and student network, which obtained higher traffic sign detection accuracy while ensuring fewer trainable parameters. Wan et al. [34] presented a novel small traffic sign detection approach based on YOLOv3 architecture. Tabernik et al. [35] employed an improved mask R-CNN to realize the detection and recognition of traffic signs through end-to-end automatic learning. Gamez Serna et al. [36] designed a mask R-CNN method to test and refine the German Traffic Sign Detection Benchmark (GTSDB) and adopted the CNN architecture to achieve classification.

In 2021, scholars continued to study traffic sign detection to get more satisfactory results. The thin and deep convolutional neural network structure proposed by Haque et al. [37] detected traffic signs rapidly and accurately. In [38], a cascade saccade network structure with class hierarchy was proposed for traffic sign detection. Their method has good performance and can meet the real-time requirements of autonomous driving. Tang et al. [39] presented a neck network IFA-FPN to solve the problem of difficult detection caused by the imbalance of the size and category distribution of traffic signs. Experimental results performed on three mainstream datasets showed the superiority of the algorithm. Shen et al. [40] proposed a new group multi-scale attention pyramid network to detect small traffic signs, which achieved high accuracy. Liu et al. [41] presented a new network TSingNet, which used AbFPN to learn scale-aware features and ARFF blocks to derive more context information. Many experimental results showed that TSingNet was superior to the state-of-the-art approaches.

The traffic sign detection based on deep learning has good performance, but there are still some problems that need further improvement. Traffic signs are small and multi-scale, which increases the difficulty of feature extraction. During the training process, the network undergoes multiple convolutions and sums. After the pooling operation, many feature map details will be lost, and the pixel-level deviation of the feature map will often cause errors in detection, greatly reducing the accuracy of network detection, so it is more difficult to identify small-size traffic signs. In the real driving environment, traffic signs are very easy to be confused with billboards or other similar signs. For convolutional neural networks, illumination changes, partial occlusion, complex background, etc., also make traffic sign detection more challenging. Therefore, in terms of detection accuracy and speed, the general object detection approaches cannot better meet the requirements.

Aiming at the above traffic sign detection problems, we choose the faster R-CNN model, because it is more sensitive to small target traffic signs and is an end-to-end detection network. In addition, it shares parameters in the training process, which reduces the training time and improves model quality. Faster R-CNN is used as the baseline plan and improved on this basis to achieve better model performance.

The main contributions of this study are as follows:

  1. 1.

    An improved faster R-CNN model is proposed, which can effectively improve the accuracy of traffic sign detection. Experiments on the CCTSDB dataset show that the proposed method is competitive for traffic sign detection, and the mAP is up to 99.5%.

  2. 2.

    ResNet50-D with fewer parameters is adopted as the backbone network to realize feature extraction. Compared with ResNet50, the accuracy is improved by 0.5%.

  3. 3.

    ACFPN is introduced into the traffic sign detection, which focuses on context information and only slightly increases the model complexity and calculation, but the detection accuracy is further improved by 0.7%.

  4. 4.

    The data augmentation methods such as random flip, resize, and AutoAugment are adopted to enhance the generalization ability of the model. Transfer learning can significantly reduce training time and speed up network convergence.

  5. 5.

    Our method can be extended to detect other traffic sign datasets. The adaptability experiment on the TT100K dataset further proves its usability.

The rest of the paper is arranged as follows: Sect. 2 presents the proposed method. In Sect. 3, the experimental process is described, containing the training ways, data processing methods, transfer learning approach, and so on. Section 4 gives the experimental results and the discussion. Conclusions and future work are presented in Sect. 5.

2 Proposed method

2.1 Improved faster R-CNN

In this study, the traffic sign detection system uses the original faster R-CNN detector, which is a general method developed for the detection and recognition of general objects. To adapt it to the specific field of traffic sign detection, some improvements are proposed. Firstly, the traffic sign images are processed by using the data augmentation approaches, then the characteristics are obtained by inputting ResNet50-D, and the multi-level characteristics are output. Secondly, the multi-level features are input into the ACFPN for feature fusion to generate a multi-scale characteristic pyramid with rich context information. Thirdly, the feature pyramid is input to the RPN to produce region proposals. Fourthly, the feature pyramid and region proposals are transported to the RoI align layer to produce proposal feature maps. Finally, the proposal feature maps are sent to the Bbox head to achieve object classification and obtain the precise position of the object bounding box. The overall structure is shown in Fig. 1.

Fig. 1
figure 1

The overall structure

Compared with faster R-CNN, the proposed method has the following advantages: (1) A series of data augmentation methods, such as AutoAugment, random flip, and resize, are adopted to enhance the generalization capacity and detection performance; (2) ResNet50-D is conducted as the backbone network for characteristic extraction, which improves the capability of characteristic extraction; (3) the introduction of ACFPN can reduce the loss of context information, so ACFPN is used to perform feature fusion on the multi-level feature output of ResNet50-D, and a feature map with rich semantic and location information is obtained.

2.2 Faster R-CNN

Region-based networks mostly contain R-CNN, fast R-CNN, and faster R-CNN. Both R-CNN and fast R-CNN adopt selective search (SS) to extract 2000 candidate area frames, which spend too much time and are difficult to meet the real-time requirements. Therefore, based on R-CNN and fast R-CNN, the typical two-stage model faster R-CNN came into being. Its core idea is to use RPN (region proposal network) to replace SS to extract candidate regions.

Here, we briefly describe faster R-CNN and see [6] for a more explicit illustration. The faster R-CNN uses anchor to replace SS to extract candidate areas. The anchor containing the target object is screened and employed in RoI pooling to extract the feature map of fixed size. Then, the candidate regions are classified using proposal feature maps and the target object position is predicted using bounding box regression. The faster R-CNN architecture is shown in Fig. 2.

Fig. 2
figure 2

Architecture of faster R-CNN

In faster R-CNN, RPN is a fully connected network for the generation of candidate regions.

RPN loss is defined as follows:

$$L_{rpn} = L\left( {\left\{ {p_{i} } \right\},\left\{ {t_{i} } \right\}} \right) = \frac{1}{{N_{cls} }}\sum\limits_{i} {L_{cls} } \left( {p_{i} ,p_{i}^{*} } \right) + \lambda \frac{1}{{N_{reg} }}\sum\limits_{i} {p_{i}^{*} } L_{reg} \left( {t_{i} ,t_{i}^{*} } \right)$$
(1)

where the classification loss function is

$$L_{cls} \left( {p_{i} ,p_{i}^{*} } \right) = - \log \left[ {p_{i} p_{i}^{*} + \left( {1 - p_{i}^{*} } \right)\left( {1 - p_{i} } \right)} \right]$$
(2)

The regression loss function is

$$L_{reg} \left( {t_{i} ,t_{i}^{*} } \right) = R\left( {t_{i} - t_{i}^{*} } \right)$$
(3)

R is the robustness loss function:

$$smooth_{{L_{1} }} \left( x \right) = \left\{ {\begin{array}{*{20}l} {0.5x^{2} ,} \hfill & {if\left| x \right| < 1} \hfill \\ {\left| x \right| - 0.5,} \hfill & {otherwise} \hfill \\ \end{array} } \right.$$
(4)
$$t_{i} = \left\{ {t_{x} ,t_{y} ,t_{w} ,t_{h} } \right\}$$
(5)

where \(i\) is the exponent of the anchor in a minimum sample; \({P}_{i}\) is the chance of the anchor to predict the object; \({P}_{i}^{*}\) represents classified supervision information, and its value is 0 (anchor is a negative sample) or 1 (anchor is a positive sample); \(\lambda\) is a constant for balancing the classification loss and regression loss; \({t}_{i}\) expresses the four parameterized coordinates of the predicted candidate box; \({t}_{i}^{*}\) represents the coordinate vector of the corresponding candidate box when the anchor generation box is positive; \({N}_{cls}\) and \({N}_{reg}\) represent the classification loss and regression loss after normalization treatment, respectively.

2.3 ResNet50-D

In general, the more the network layers are, the richer features are extracted and the more semantic information is obtained. But simply increasing the depth of the network will lead to gradient explosion. ResNet network using residual connection can be a good way to raise neural network depth when the gradient disappears. ResNet according to the number of different network layers mainly has ResNet18, ResNet34, ResNet50, ResNet101, ResNet152, and so on. ResNet50 is the most commonly used. However, in the process of ResNet50 downsampling, the convolution of 1 × 1 and the stride of 2 will cause that three-quarters of the input feature map will not be utilized. Therefore, the backbone network selected for our method is ResNet50-D. As shown in Fig. 3, ResNet50-D moves the downsampling of path A to the following 3 × 3 convolution for execution, so all information is used. In the downsampling of path B, 2 × 2 average pooling layer with a stride of 2 is added before the convolution, so the loss of a large number of feature information is avoided. Compared with ResNet50, ResNet50-D uses an average pooling method to reduce the dimension during the skip connection. The advantages of average pooling are that it can retain image background information well, extract features better, reduce the number of parameters, and save hardware resources.

Fig. 3
figure 3

The architecture of ResNet50-D

2.4 RoI align

The core of RoI pooling is that candidate boxes share features of the feature map and keep the output size consistent. Nevertheless, when RoI pooling is approximated by two rounds, it causes a mismatch between the detection information and the extracted features. To enhance the identification correctness of traffic signs, RoI pooling is converted into RoI align in faster R-CNN.

RoI align can eliminate errors generated in RoI pooling. Firstly, RoI align takes N points uniformly in the region and finds the four points closest to each point in the feature map. Then, the export value of the points is obtained by bilinear interpolation, as shown in Fig. 4. Finally, the output value of the region is obtained by averaging the N points. Consequently, the whole characteristic aggregation procedure is turned into a consecutive manipulation to acquire better performance and retain the features of the original region as much as possible.

Fig. 4
figure 4

Bilinear interpolation diagram

The results from the RPN are carried to the RoI align layer along with the feature pyramid, and the final result is obtained after the two fully connected layers. In practical terms, the chance of the traffic sign in the bounding box can be obtained by the classification layer, and the position of the traffic sign can be obtained through the regression layer.

2.5 Attention-guided context feature pyramid network (ACFPN)

To take advantage of the intrinsic multi-scale characteristic representation of deep convolutional networks, feature pyramid network (FPN) is proposed. Unfortunately, there are two problems in FPN: (1) insufficient communication between multi-scale receptive fields; (2) the contradiction between the resolution of the feature map and the receptive field for high-resolution input pictures. To address the defects mentioned above, a new ACFPN is proposed. The structure is shown in Fig. 5.

Fig. 5
figure 5

The structure of ACFPN network

Compared with the original FPN, ACFPN uses dilated convolution on F5 and subsample P5 and maintains the same step size as FPN through max pooling. ACFPN [42] is composed of two modules, namely context extraction module (CEM) and attention-guided module (AM), which solve the contradiction between feature map resolution and receptive field on high-resolution input. CEM searches extensive contextual information from multiple receiving domains. The AM is composed of two sub-modules: contextual attention module (CxAM) and content attention module (CnAM), which are dedicated to seizing and distinguishing semantics and accurate positioning.

3 Experiment

3.1 Training

In this study, the benchmark datasets are the CCTSDB traffic sign dataset released by Changsha University of Science and Technology [43] and the TT100K dataset [44]. The CCTSDB dataset contains three categories, namely warning, prohibitory, and mandatory. There are 11,062 pictures for training and 2767 pictures for testing, both of which contain all the categories of traffic signs without intersections. The TT100K dataset contains 232 categories, 6105 pictures for training and 3071 pictures for testing.

All methods are trained and tested using the PaddlePaddle framework, the operating environment is Python, and the cloud hardware is configured with a 4-core CPU, 32 GB memory, and a NVIDIA Tesla V100, 16G video memory.

Stochastic gradient descent (SGD) is employed to train the whole network, and the initial learning rate is 0.01. When iterating 80,000 and 100,000, the learning rate is divided by 10, respectively. The weight decay is set to 0.0001. The momentum is set to 0.9. The Batch_size is set to 2. The training network is initialized by using the pre-trained model of ImageNet classification of object detection, and the initial parameters and weight of the traffic sign detection training model are obtained. The iterative training is 120,000 times. During the training, the model is evaluated every 10,000 times, and the evaluation results are saved until the end of the training.

3.2 Data processing

The CCTSDB dataset has different sizes, such as 1280*720, 1024*768, and 1000*350. The image resolution of the TT100K dataset is 2048*2048. The target size of the faster R-CNN network is modified to 800. All models adopt data augmentation methods such as random flip, normalization, and resize, mainly to increase the number of pictures in the training set and diversify them to enhance the generalization ability of the model. Moreover, the model trained by data augmentation has better detection performance.

In addition to the data augmentation mentioned above, the proposed method also uses AutoAugment technology. Here, we briefly describe AutoAugment strategies and refer to [45] for a more detailed description. The idea is to establish a search space for data augmentation strategies. A strategy in the search space includes many sub-strategies. Each sub-strategy consists of two image-processing functions and function parameters (probability and amplitude). The sub-strategy will be randomly assigned to each picture in each mini-batch. The search algorithm is adopted to search for the best strategy to ensure higher accuracy.

3.3 Increase the number of anchor frames

To adapt to traffic sign detection and detect small targets more successfully, besides the default anchor size, 32*32 and 64*64 parameters are also added, so that each anchor point corresponds to 15 candidate windows, namely anchor_sizes {32*32, 64*64, 128*128, 256*256, 512*512} and aspect_ratios {1:2, 1:1, 2:1}.

3.4 Transfer learning

At present, object detection algorithms based on big datasets and deep CNN models show prominent capability. However, in practical application, when the amount of training data is small, the model has fallen into overfitting. Therefore, transfer learning can solve this problem well. In this study, the ImageNet classification pre-training model is employed to initialize the detection model. We use our dataset in transfer learning and convert it into scripts for COCO format. In transfer learning, the variables of the pre-training model are different from those of the actual model, so there is a mismatch problem. Therefore, it is necessary to selectively load the pre-training model and directly load the pre-training weight, so that the parameters with different shapes of corresponding parameters in the model and the pre-training model will be automatically ignored. Using transfer learning can significantly reduce training time.

3.5 Performance evaluation index

To compare the performance of the method presented in this study with other methods, the following evaluation indexes are utilized.

  1. (1)

    mAP

    AP indicates the performance of each class in the detection model, and mAP is the average of AP values under all categories. mAP with intersection over union (IoU) = 0.5 and area = all is used as the evaluation index in the validation set, and COCO metrics are adopted. The mAP calculates the mean value of AP of each category when IoU_threshold = 0.5. The mAP represents the detection performance of the detection model.

  2. (2)

    FPS

    FPS expresses the number of frames to fill the image per second. By listing the FPS of different methods, the real-time capability of different methods is revealed.

  3. (3)

    Precision, Recall_rate, Missing_rate and F1-measure

Precision represents the proportion of accurate prediction samples in all prediction samples. Recall_rate shows the probability that the predicted accurate sample accounts for all positive samples. Missing_rate is the opposite of Recall_rate. F1-measure is the index for comprehensive evaluation of precision and Recall_rate. True positive (TP), true negative (TN), false positive (FP), and false negative (FN) denote four cases of detection results, respectively. These indicators are defined as follows:

$$Precision = \frac{TP}{{TP + FP}}$$
(6)
$$Recall_{ - } rate = \frac{TP}{{TP + FN}}$$
(7)
$$Missing_{ - } rate = 1 - \frac{TP}{{TP + FN}}$$
(8)
$$F1 - measure = 2 \times \frac{{Precision \times Recall_{ - } rate}}{{Precision + Recall_{ - } rate}}$$
(9)

4 Results and discussion

4.1 Experimental results and discussion

Figure 6 shows some detection results on the CCTSDB dataset. These traffic signs are photographed on different roads. The shape, color, and size of traffic signs are also different.

Fig. 6
figure 6

Part of the detection results on the CCTSDB dataset

Although there are many proposed methods in the field of traffic sign detection, it is difficult to make a reliable comparison with these methods because of different datasets and inconsistent hyperparameters. Therefore, to testify the capability of the proposed method in traffic sign detection, we compare it with some object detection approaches, namely SSD [5], YOLOv3 [8], RetinaNet [9], cascade R-CNN [10], FCOS [11], and CornerNet-Squeeze [12]. These models use the same CCTSDB dataset, set some of the same hyperparameters (momentum, number of training iterations, and so on), measure the performance on the same hardware with a NVIDIA Tesla V100X GPU and a 4-core CPU, 32 GB memory, and compare them on both the mAP and FPS metrics. The experimental results are shown in Table 1. In addition, Fig. 7 shows the more intuitive results.

Table 1 Experimental results of different object detection methods on the CCTSDB dataset
Fig. 7
figure 7

Performance comparison of each model on the CCTSDB dataset

Obviously, the detection accuracy and speed of the original faster R-CNN model are not good. When ResNet50-D, a better backbone network, is applied to faster R-CNN and ACFPN module and AutoAugment technology, the best detection effect is achieved. In this study, the mAP of the improved faster R-CNN is 99.5%, which is the highest among the eight detection models. FPS is 29.8, although lower than the one-stage detection models SSD and YOLOv3, but close to 30, which can achieve real-time detection of traffic signs. In contrast, the mAP is significantly higher than them. In the case of a significant increase in mAP, the sacrifice of time is worth it, because more attention should be paid to accuracy while meeting the real-time requirements. Compared with the original faster R-CNN model, the mAP and FPS of the improved faster R-CNN model are increased by 1.8% and 24.5%, respectively. And the FPS is significantly improved by at least five times. The improvement in the method in this study is better reflected in the FPS indicator. This indicates that the improved faster R-CNN model has better comprehensive performance than other models.

In the one-stage model, compared with SSD, YOLOv3, and RetinaNet, the improved faster R-CNN enhances 13.7%, 4.9%, and 1.1% in the mAP index, respectively, which is significantly better than the SSD and YOLOv3 models, because the faster R-CNN is a two-stage model consisting of regional recommendation and goal classification. First, RoI align is performed for each proposal extracted. In the second stage, there are two full connection layers to classify the candidate regions and predict the position of the target object, so higher detection accuracy is obtained. However, its inference speed is significantly lower than that of SSD and YOLOv3 models, mainly because faster R-CNN uses RPN to acquire a group of candidate regions and then predicts it. Consequently, the detection speed of faster R-CNN is impeded, while SSD and YOLOv3 are fast, because they are one-stage detection models with fewer layers than other models, whereas this does not impact the overall detection behavior of the proposed method.

In the two-stage model, compared with faster R-CNN and cascade R-CNN, the improved faster R-CNN not only increases 1.6% and 1.8% in the mAP index but also improves 6.8 and 24.5 in the FPS index, respectively. The improvement is noticeable.

In the anchor-free model, compared with FCOS and CornerNet-Squeeze, the improved faster R-CNN not only raises 0.6% and 8.6% in the mAP index but also increases 12.4 and 1.3 in the FPS index, respectively.

An illustration of precision–recall curves for these methods is provided in Fig. 8, which further demonstrates the validity of the proposed model. From Fig. 8, among three types of traffic signs including warning, prohibitory, and mandatory, it can be observed that the best way is faster R-CNN + ResNet50-D + ACFPN + AutoAugment which achieved the highest average precision.

Fig. 8
figure 8

Precision–recall curves of traffic sign detection performance on the CCTSDB dataset for warning, prohibitory, and mandatory

Additionally, the proposed model is compared with the state-of-the-art approaches on the CCTSDB dataset. It can be easily seen from Table 2 that our method is superior to Shen et al. [40] on the Recall_rate, Missing_rate, and F1-measure indicators. Compared with Liu et al. [38], the proposed model acquires over 9.0% improvement on mAP. Therefore, these results further confirm the availability of the proposed method.

Table 2 Comparison of our method with the state-of-the-art methods on the CCTSDB dataset

To further prove the effectiveness of our method, the faster R-CNN + ResNet50 model is used as the baseline model for a comparative experiment. Table 3 displays the average recall (AR) rate on the CCTSDB dataset. Compared with the benchmark model, the proposed method increases 6.8% in ARS, and the effect is significantly improved, which verifies that the proposed method significantly enhances the test capacity of little targets. The performance of the proposed method on ARL is poor, because ACFPN will pay more attention to small targets after fusing more information, which improves the detection performance of small targets and reduces the attention to large targets.

Table 3 Average recall rate performance comparison on the CCTSDB dataset

In summary, the improved faster R-CNN model has higher accuracy compared with the other models and achieves a satisfactory balance between detection accuracy and speed. It shows that the improved faster R-CNN model has superior performance and is effective in the field of traffic sign detection.

4.2 Ablation study

To discuss the influence of ResNet50-D, ACFPN, and AutoAugment technology on the detection capacity of faster R-CNN, we conduct an ablation study on the CCTSDB dataset. First, the backbone network is changed to ResNet50-D to study its impact on the faster R-CNN. Next, we demonstrate the role of the ACFPN in the faster R-CNN detection model. Finally, the AutoAugment technique is added to further evaluate the performance. Table 4 reveals the experimental results of the ablation study.

Table 4 Experimental results of ablation study on the CCTSDB dataset

It can be seen from Table 4 that, by changing the backbone network ResNet50-D, the mAP of faster R-CNN is increased by 0.5%. Faster R-CNN with FPN is superior to the faster R-CNN without FPN, with a difference of 0.6%. The effectiveness of ACFPN is also proved in our proposed model. After replacing FPN with ACFPN, the detection capability is enhanced, and the network effect is further improved. The mAP is improved by 0.1%. Finally, after adding AutoAugment technology, the mAP value increased by 0.6%. These improvements are mainly because of the capability of the ResNet50-D network to retain image background information and better extract features. By fusing the features of different receptive fields, ACFPN not only increases the receptive fields of objects, but also gives better classification results by using the context information of objects. AutoAugment technology improves accuracy by automatically searching data augmentation strategies suitable for specific datasets.

4.3 Adaptability experiment

To prove the adaptability of the proposed method, it is applied to the TT100K dataset. The experimental results are shown in Table 5. The improved faster R-CNN acquires a competitive result with 95.8% in mAP, which surpasses some mainstream approaches, such as SSD, YOLOv3, and faster R-CNN. Simultaneously, the improved faster R-CNN is compared with the state-of-the-art approaches, and the results are shown in Table 6. The improved faster R-CNN obtains 2.2%, 0.38%, and 1.08% improvement on mAP, precision, and Recall_rate, respectively. Table 6 further illustrates the superiority of the proposed method. This shows that our method has strong adaptability. This means that when the proposed method is applied to a new dataset, it can work well only by fine-tuning.

Table 5 Experimental results of different object detection methods on the TT100K dataset
Table 6 Comparison of our method with the state-of-the-art methods on the TT100K dataset

To sum up, ResNet50-D, ACFPN, and AutoAugment technology can availably ameliorate the traffic sign detection performance of the faster R-CNN model.

5 Conclusions

Traffic sign detection is one of the important premises for the realization of autonomous driving environment perception. In the real driving road scene, the target size of traffic signs is small, and the environment changes in real time, which increases the difficulty of traffic sign detection. This paper proposes an improved faster R-CNN traffic sign detection model, which is more suitable for traffic sign detection by selecting a better backbone network (ResNet50-D), RPN optimization (anchor size), hyperparameter, ACFPN, and AutoAugment technology. Compared with the state-of-the-art approaches on the CCTSDB dataset, our method obtains the mAP of 99.5% and the FPS of 29.8. In addition, the adaptability experiment on the TT100K dataset further proves its superiority. Therefore, the proposed method has more advantages and stronger generalization ability, which can provide the idea for the environment perception of autonomous driving and can be extended to the area of small object detection like traffic signs.

The model results show that the ideal performance is still not achieved. This is mainly because there is an apparent disparity between the proposed approach and the one-stage object detection method in real-time detection speed. Future work will focus on this aspect and optimize the detection model to further improve the detection speed. In addition, we also want to test the performance of the proposed traffic sign model in more traffic scenarios.