Abstract
Detecting small, multi-scale, and easily obscured traffic signs in real-world scenarios presents a persistent challenge. This paper proposes an approach that utilizes a multi-scale feature pyramid module to capture hierarchical features, facilitating robust detection of traffic signs across varying viewing angles and scales. To aggregate features at different scales and eliminate background interference, we employ a superposition of null convolution kernels with varying dilation rates, expanding the perceptual field from small to large. This effectively covers the object distribution across multiple scales while enhancing the resolution of the final output feature map for improved small target localization. Our method has demonstrated its effectiveness and superiority over several state-of-the-art approaches through extensive experiments conducted on two public traffic sign detection datasets.
Y. Ke and W. Mo—Contribute equally to this work.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
- Traffic sign detection
- Multi-scale feature pyramid
- Hierarchical features
- Perceptual field expansion
- Robust detection
1 Introduction
The rapid advancement of autonomous driving and intelligent driver assistance systems has spurred extensive research on traffic sign detection [4]. However, this task is beset by various challenges, including interference from external factors such as illumination, occlusion, weather conditions, and shooting angles. Furthermore, traffic sign targets are typically small and exhibit variations in scale, exacerbating the difficulty of detection.
To overcome these challenges, researchers have proposed innovative approaches. Wang et al. [17] have replaced the original feature pyramid network in YOLOv5, resulting in improved real-time detection performance. Similarly, Yao et al. [19] have enhanced the feature fusion method of YOLOv4-Tiny through the introduction of an AFPN (Adaptive Feature Pyramid Network). While single-stage networks are commonly employed in traffic sign detection studies due to computational limitations in real-world applications, this approach is not without shortcomings. Environmental changes and occlusions often impact the visibility of traffic signs, leading to a decline in detection performance.
In recent times, researchers have made significant advancements in improving the performance of ATDR (Automatic Traffic Sign Detection and Recognition) in real-world scenarios. One such approach involves the utilization of multiscale pre-trained networks, which have shown promising results. In this context, a novel traffic sign detection network called TSingNet has been introduced. TSingNet leverages scale-aware and context-rich features to effectively detect and identify small or obscured traffic signs [11]. Furthermore, Shen et al. [16] propose a population multiscale attention pyramid network that facilitates optimal feature fusion patterns and the construction of information-rich feature pyramids to detect traffic signs of various sizes. Although these approaches have demonstrated strong performance, they primarily incorporate high-level semantic information in the earlier layers. Consequently, they face the challenge of foreground semantics, particularly regarding small traffic signs, which are prone to vanishing at higher levels of the FPN (Feature Pyramid Network).
We introduce MDCN (Multi-Scale Dilated Convolutional Enhanced Residual Network), a novel traffic sign detection network that leverages scale awareness and context-rich feature representation to detect multi-scale and small-object traffic signs efficiently. Our paper makes the following key contributions:
-
We employ several novel data augmentation methods to increase the diversity and difficulty of the data, thereby enhancing the generalization ability of the model and effectively addressing the detection of various scales.
-
We propose MDRNet(Multi-scale Deep Residual network), a new backbone architecture designed to learn scale-aware and context-rich features for traffic sign detection in outdoor environments. MDRNet aims to narrow the semantic gap between multiple scales, leading to improved detection performance.
-
To overcome hardware limitations and accommodate high-resolution images, we introduce GN to remove the batch size limitation. Additionally, we utilize WS to further normalize the data from a weight perspective, accelerating model convergence and improving accuracy.
2 Methodology
We developed MDCN based on the Faster R-CNN framework [15]. The architecture of our model is illustrated in Fig. 1, which comprises three primary components: feature extraction, feature fusion, and detection.
2.1 Multiple Data Augmentation Fusion
In our experiments, we utilized the CTSD and GTSDB datasets, which exhibit an imbalanced distribution of target scales and contain small target samples. To address this issue and enhance the model’s attention towards small targets during training, we expanded and augmented the datasets with additional data specifically focused on small targets.
To achieve this, we employed various data augmentation techniques, including Mixup [22], Mosaic [1], and Random affine [14]. These methods allowed us to synthesize samples and expand the dataset effectively. Fig. 2 provides a visualization of the data augmentation techniques employed.
Mixup involves overlaying two images, resulting in improved image detection accuracy without incurring significant computational overhead. Mosaic, on the other hand, stitches together four images by randomly cropping, scaling, and aligning them. This approach not only enhances the richness of the image background but also increases the diversity of target scales within a single image. These augmentation methods significantly enrich the detection dataset and contribute to the network’s robustness.
2.2 MDRNet
To address the challenges posed by small-scale traffic sign image targets, multiple scale levels, and potential occlusion, we propose MDRNet. Our approach involves replacing all 3 \(\times \) 3 regular convolutions in the conv4-conv5 layers of ResNet50 with dilated convolutions [20]. This modification expands the model’s perceptual field without increasing computational complexity or compromising resolution. Additionally, the different perceptual fields obtained from various convolutional layers facilitate the extraction of multi-scale contextual information, thereby enabling the effective detection of traffic sign targets at different scales. Please refer to Fig. 3 for a visual representation of the details.
The regular convolution operation can be represented by Eq. 1, while a dilated convolution is defined as Eq. 2.
Let \(\textbf{y}(m, n)\) denote the result of dilated convolution between an input signal \(\textbf{I}(m, n)\) and a filter \(\textbf{F}(i, j)\), where \(\textbf{I}\) has a length and width of M and N, respectively. The formulation of the regular convolution operation is given by Eq. 1:
In contrast, the dilated convolution introduces a hyperparameter known as the dilation rate, denoted by r. This parameter defines the spacing between values as the convolution kernel processes the data. The formulation of the dilated convolution can be expressed as Eq. 2:
When the dilation rate r is set to 1, the dilated convolution reduces to a regular convolution.
2.3 Normalization Methods
Smaller batch sizes have been shown to be more robust to variations between training and test sets [8]. For pixel-level image generation tasks like object detection and image segmentation, memory consumption limits the batch size to be small [6]. Moreover, the large image size of traffic signs and the constraints of general hardware resources further support the use of smaller batch sizes.
However, the effectiveness of BN (Batch Normalization) decreases significantly with small batches, limiting its applicability in micro-batch training. To address this limitation, we employ GN (Group Normalization), which divides channels into groups and calculates the mean and variance within each group for normalization. GN demonstrates stable accuracy across a wide range of batch sizes compared to BN. GN differs from BN in terms of the statistical range over which the mean and variance are calculated. We also introduce WS (Weight Standardization) [13] to further enhance the model’s generalization ability and network performance. WS provides regularization without compromising information exchange, thereby improving model generalization. The joint application of GN and WS is illustrated in Fig. 4. The fusion of GN and WS is expressed as follows:
where \(\boldsymbol{x}\) is the input tensor, \(\mathbf {\Gamma }\) is the convolution kernel, and \(\textbf{y}\) is the output tensor. i and j denote the indexes of the output tensor, and k and l denote the indexes of the convolution kernel, respectively. In WS, the statistical domain of the mean and standard deviation of the weight parameters is each channel. The \(\boldsymbol{\mu }_{i,\cdot }\) and \(\boldsymbol{\sigma }_{i,\cdot }\) in Eq. 4 are the mean and standard deviation, calculated as \(\boldsymbol{\sigma }_i\) and \(\boldsymbol{\mu }_i\).
\(\mathcal {S}_i\) is the set of pixels for which the mean and variance are calculated, and \(\mathcal {S}_i\) of GN is defined as
where G is the number of groups (default value is 32) and C/G is the number of channels per group. \(\lfloor -\rfloor \) represents the floor operation. GN computes \(\boldsymbol{\mu }\) and \(\boldsymbol{\sigma }\) along the (H, W) axes and along a group of C/G channels.
3 Experiment
3.1 Implementation Details
Datasets GTSDB [7]: The traffic signs are classified into three categories: triangular warning signs (yellow or red), circular prohibitory signs (red or black), and mandatory signs (blue circles or squares). The GTSDB dataset consists of 900 images with a size of 1360 \(\times \) 800, including 43 types of traffic signs. Similarly, the CTSD dataset contains 1100 images captured on highways, urban, and rural roads, with sizes of 1024 \(\times \) 768 and 1280 \(\times \) 720. It also classifies traffic signs in China into the same three categories as the GTSDB dataset.
Both the CTSD and GTSDB datasets were selected for their inclusion of small targets and multiple scales in complex environments. Figure 5 shows some typical examples from these datasets, where small targets and multiple scales of traffic signs are often present in challenging conditions such as blurred, poorly lit, damaged, and obscured signs. The datasets were divided into training and test sets in a 7:1 ratio.
Evaluation Metrics. In this paper, the evaluation metrics used are \(AP_{50}\), \(AP_{75}\), \(AP_S\), \(AP_M\), \(AP_L\), and AR, which are calculated following the methods described in COCO [10].
Setting. All experiments were conducted on Ubuntu 20.04 using PyTorch 1.9.1. The training process utilized an NVIDIA GeForce GTX 3090 GPU with 24 GB of memory. The SGD optimizer was employed with an initial learning rate of 0.00125, momentum of 0.9, and weight decay of 0.0001.
3.2 Results and Analysis
Comparison with State-of-the-Art Methods on CTSD. We compared our MDCN method with several state-of-the-art approaches on the CTSD dataset. The comparison results are presented in Table 1 and Table 2.
Our proposed MDCN method achieves an impressive 83.0 AP and 87.6 AR, surpassing all other methods in the table. It demonstrates a strong balance between minimizing false detections and reducing missed detections. In terms of detection metrics, MDCN not only exhibits significant improvements over the original Faster R-CNN, but also outperforms both two-stage and single-stage target detection algorithms. It outperforms the second-best DH Faster R-CNN by 1.7 AP and surpasses the lowest AP YOLOF by a remarkable 14.9 AP. MDCN outperforms other methods in all metrics, except for the APL metric. This demonstrates the superiority of our method in multi-scale detection.
Furthermore, our proposed method achieves outstanding results for all three major categories of traffic signs, particularly for the mandatory category, where it outperforms all other detection methods by at least 3 points. This is due to the variable shapes and scales of directional traffic signs, with their rectangular aspect ratios often being more extreme compared to other traffic signs. Consequently, the detection of multi-scale signs in this category poses greater challenges.
Overall, the results validate the effectiveness of our MDCN method in achieving superior performance in multi-scale traffic sign detection.
Comparison with State-of-the-Art Methods on GTSDB. We conducted a comparison between MDCN and several popular detection algorithms on the GTSDB dataset, which is widely recognized as a representative dataset in the field of traffic sign object detection. Table 1 presents the results of this comparison.
Among the two-stage algorithms, DH Faster R-CNN and Cascade R-CNN achieved commendable detection performance with 77.9 AP and 76.6 AP, respectively. However, MDCN surpassed them by achieving an impressive 78.7 AP. Furthermore, MDCN outperformed the baseline by 5.5 AP and surpassed YOLOF, which had the lowest AP, by a significant factor of 1.48. Overall, MDCN exhibited superior detection accuracy compared to other methods, particularly for the warning class of traffic signs that typically occupy a smaller proportion of the image.
Additionally, MDCN demonstrated higher AP scores in the \(AP_S\) and \(AP_L\) metrics compared to all other methods. Although its \(AP_M\) was slightly lower than Dynamic R-CNN, Cascade R-CNN, and DH Faster R-CNN by 0.6 \(AP_M\), 1.2 \(AP_M\), and 2.0 \(AP_M\), respectively, MDCN still outperformed them by significant margins in the \(AP_S\) metric (10.1 \(AP_S\), 13.1 \(AP_S\), and 6.8 \(AP_S\), respectively). Moreover, MDCN’s \(AP_L\) was higher than that of these methods. These results indicate that our method exhibits superior and more stable performance in multi-scale detection.
In conclusion, MDCN demonstrated excellent detection performance on the GTSDB dataset, affirming the effectiveness and generalization capabilities of our proposed model.
3.3 Ablation Studys
We performed a series of ablation experiments to demonstrate the effectiveness of MDCN for the detection of small traffic sign targets at multiple scales in real-world complex environments.
Component Ablation Studies of MDCN. The effectiveness of various optimization components in improving the performance of the baseline model was evaluated through experiments on the CTSD and GTSDB datasets, and the results are summarized in Table 3 and Table 4.
The addition of GN led to improvements of 1.4 and 3.4 in AP on the CTSD and GTSDB datasets, respectively, highlighting its positive impact on detecting small targets. Combining GN with WS further enhanced the detection performance, resulting in improvements of 1.5 AP and 1.2 AP, as well as 8.9 and 15.4 in APS, respectively, compared to GN alone. This demonstrates that the combination of GN and WS contributes to improved detection of small targets.
The integration of MDRNet significantly improved the AP values by 3.7 AP and 5.3 AP on both datasets compared to the baseline. Moreover, it led to notable improvements in APS, APM, and APL, with gains of 8.9, 2.6, and 4.1, respectively, on both datasets. The largest improvement was observed in APS, indicating the effectiveness of MDRNet in detecting small targets. This improvement can be attributed to the ability of MDRNet to capture more contextual information, which helps reduce the rates of false detections and missed detections for small targets.
Furthermore, the inclusion of the MDEF method resulted in additional improvements of 4.0 AP and 5.5 AP on the two datasets compared to the baseline. MDEF effectively expanded the number of small targets in the dataset and disrupted the regular positioning of traffic signs on the road, thereby preventing overfitting to specific road environments.
Visualizations in Fig. 6 further demonstrate the superior performance of MDCN compared to the baseline. MDCN exhibits greater sensitivity to traffic sign objects with uneven positional distribution and a wide range of scales. Additionally, it demonstrates better detection performance for small objects, as evident from the second row of the visualization.
Ablation Experiments of Dilated Convolution Embedding Positions. In our ablation experiments, we investigated the impact of dilated convolution on the task of traffic sign detection. Dilated convolution is known for its ability to expand the receptive field while preserving resolution. However, it is crucial to carefully select the positions where regular convolutions are replaced with dilated convolutions to achieve optimal results.
We examined the effect of incorporating dilated convolutions from the bottom-up, starting from the conv2 to conv5 layers in the backbone network. The results, as shown in Table 5 and Table 6, revealed that the addition of dilated convolutions in the last two layers of the convolutional hierarchy yielded the most significant improvements. Specifically, on the CTSD dataset, the APs increased to 81.7 and 81.6, while on the other dataset, they improved to 76.4 and 77.6, respectively.
This observation can be attributed to the fact that deeper convolution layers tend to lose internal data structures, resulting in the loss of some crucial information related to small objects. By replacing the deeper convolutions with dilated convolutions, we were able to improve the localization accuracy of small targets.
Building upon these findings, we gradually added dilated convolution modules in pairs until all 3\(\,\times \,\)3 convolutions between c4 and c5 were replaced. This configuration resulted in the optimal embedding position for MDRNet, achieving the best performance of 83.0 AP and 78.7 AP on both datasets.
These ablation experiments highlight the importance of selecting appropriate positions for dilated convolutions, and they provide valuable insights into the optimal design choices for our MDRNet architecture.
4 Conclusion
The primary objective of this paper is to enhance the detection performance of small multi-scale traffic signs in complex real-world environments. To achieve this goal, we propose an MDCN traffic sign detection method based on Faster R-CNN. Our approach incorporates MDRNet as the backbone network, which effectively increases the perceptual field and sampling rate. This improvement enhances the feature representation capability for targets of different scales and those that may be partially obscured. Considering the characteristics of the traffic sign dataset, hardware resource limitations, and the dataset’s high resolution, we introduce a normalized combination method at the backbone and pooling layers. This technique accelerates the convergence speed and improves the generalization ability of the model. Furthermore, we employ a multi-data augmentation fusion strategy to expand small targets and enhance the network’s robustness. The effectiveness and generalization of our proposed MDCN method are validated through comprehensive evaluations on the CTSD and GTSDB datasets. The results demonstrate its superiority in detecting small traffic signs and establish its competitiveness compared to other state-of-the-art methods.
For future research, we intend to explore traffic sign recognition techniques tailored specifically for mobile terminals. By adapting the methodology to mobile devices, we aim to further expand the applicability and practicality of traffic sign detection in real-world scenarios.
References
Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020)
Cai, Z., Vasconcelos, N.: Cascade r-CNN: high quality object detection and instance segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 43(5), 1483–1498 (2019)
Chen, Q., Wang, Y., Yang, T., Zhang, X., Cheng, J., Sun, J.: You only look one-level feature. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13039–13048 (2021)
Elsagheer Mohamed, S.A., AlShalfan, K.A.: Intelligent traffic management system based on the internet of vehicles (IoV). J. Adv. Transp. 2021, 1–23 (2021)
Feng, C., Zhong, Y., Gao, Y., Scott, M.R., Huang, W.: TOOD: task-aligned one-stage object detection. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3490–3499. IEEE Computer Society (2021)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Houben, S., Stallkamp, J., Salmen, J., Schlipsing, M., Igel, C.: Detection of traffic signs in real-world images: the German traffic sign detection benchmark. In: The 2013 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2013)
Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On large-batch training for deep learning: generalization gap and sharp minima. arXiv preprint arXiv:1609.04836 (2016)
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of The IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, Y., Peng, J., Xue, J.H., Chen, Y., Fu, Z.H.: Tsingnet: scale-aware and context-rich feature learning for traffic sign detection and recognition in the wild. Neurocomputing 447, 10–22 (2021)
Pang, J., Chen, K., Shi, J., Feng, H., Ouyang, W., Lin, D.: Libra R-CNN: towards balanced learning for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 821–830 (2019)
Qiao, S., Wang, H., Liu, C., Shen, W., Yuille, A.: Micro-batch training with batch-channel normalization and weight standardization. arXiv preprint arXiv:1903.10520 (2019)
Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems 28 (2015)
Shen, L., You, L., Peng, B., Zhang, C.: Group multi-scale attention pyramid network for traffic sign detection. Neurocomputing 452, 1–14 (2021)
Wang, J., Chen, Y., Dong, Z., Gao, M.: Improved yolov5 network for real-time multi-scale traffic sign detection. Neural Comput. Appl. 35(10), 7853–7865 (2022)
Wu, Y., et al.: Rethinking classification and localization for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10186–10195 (2020)
Yao, Y., Han, L., Du, C., Xu, X., Jiang, X.: Traffic sign detection algorithm based on improved yolov4-tiny. Signal Process.: Image Commun. 107, 116783 (2022)
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)
Zhang, H., Wang, Y., Dayoub, F., Sunderhauf, N.: Varifocalnet: an iou-aware dense object detector. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8514–8523 (2021)
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)
Zhang, S., Chi, C., Yao, Y., Lei, Z., Li, S.Z.: Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9759–9768 (2020)
Acknowledgment
This work is supported by the Natural Science Foundation of Xinjiang Uygur Autonomous Region (2020D01C33).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ke, Y., Mo, W., Li, Z., Cao, R., Zhang, W. (2023). MDCN: Multi-scale Dilated Convolutional Enhanced Residual Network for Traffic Sign Detection. In: Yang, X., et al. Advanced Data Mining and Applications. ADMA 2023. Lecture Notes in Computer Science(), vol 14176. Springer, Cham. https://doi.org/10.1007/978-3-031-46661-8_39
Download citation
DOI: https://doi.org/10.1007/978-3-031-46661-8_39
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-46660-1
Online ISBN: 978-3-031-46661-8
eBook Packages: Computer ScienceComputer Science (R0)