Keywords

1 Introduction

The rapid advancement of autonomous driving and intelligent driver assistance systems has spurred extensive research on traffic sign detection [4]. However, this task is beset by various challenges, including interference from external factors such as illumination, occlusion, weather conditions, and shooting angles. Furthermore, traffic sign targets are typically small and exhibit variations in scale, exacerbating the difficulty of detection.

To overcome these challenges, researchers have proposed innovative approaches. Wang et al. [17] have replaced the original feature pyramid network in YOLOv5, resulting in improved real-time detection performance. Similarly, Yao et al. [19] have enhanced the feature fusion method of YOLOv4-Tiny through the introduction of an AFPN (Adaptive Feature Pyramid Network). While single-stage networks are commonly employed in traffic sign detection studies due to computational limitations in real-world applications, this approach is not without shortcomings. Environmental changes and occlusions often impact the visibility of traffic signs, leading to a decline in detection performance.

In recent times, researchers have made significant advancements in improving the performance of ATDR (Automatic Traffic Sign Detection and Recognition) in real-world scenarios. One such approach involves the utilization of multiscale pre-trained networks, which have shown promising results. In this context, a novel traffic sign detection network called TSingNet has been introduced. TSingNet leverages scale-aware and context-rich features to effectively detect and identify small or obscured traffic signs [11]. Furthermore, Shen et al. [16] propose a population multiscale attention pyramid network that facilitates optimal feature fusion patterns and the construction of information-rich feature pyramids to detect traffic signs of various sizes. Although these approaches have demonstrated strong performance, they primarily incorporate high-level semantic information in the earlier layers. Consequently, they face the challenge of foreground semantics, particularly regarding small traffic signs, which are prone to vanishing at higher levels of the FPN (Feature Pyramid Network).

We introduce MDCN (Multi-Scale Dilated Convolutional Enhanced Residual Network), a novel traffic sign detection network that leverages scale awareness and context-rich feature representation to detect multi-scale and small-object traffic signs efficiently. Our paper makes the following key contributions:

  • We employ several novel data augmentation methods to increase the diversity and difficulty of the data, thereby enhancing the generalization ability of the model and effectively addressing the detection of various scales.

  • We propose MDRNet(Multi-scale Deep Residual network), a new backbone architecture designed to learn scale-aware and context-rich features for traffic sign detection in outdoor environments. MDRNet aims to narrow the semantic gap between multiple scales, leading to improved detection performance.

  • To overcome hardware limitations and accommodate high-resolution images, we introduce GN to remove the batch size limitation. Additionally, we utilize WS to further normalize the data from a weight perspective, accelerating model convergence and improving accuracy.

2 Methodology

We developed MDCN based on the Faster R-CNN framework [15]. The architecture of our model is illustrated in Fig. 1, which comprises three primary components: feature extraction, feature fusion, and detection.

Fig. 1.
figure 1

The Overall architecture of MDCN. The feature maps C1 to C5, obtained from the backbone network, serve as the input for feature extraction. The feature maps P2 to P6 are then generated to extract features at different scales. The RPN (Region Proposal Network) is responsible for extracting regions of interest.

2.1 Multiple Data Augmentation Fusion

In our experiments, we utilized the CTSD and GTSDB datasets, which exhibit an imbalanced distribution of target scales and contain small target samples. To address this issue and enhance the model’s attention towards small targets during training, we expanded and augmented the datasets with additional data specifically focused on small targets.

To achieve this, we employed various data augmentation techniques, including Mixup [22], Mosaic [1], and Random affine [14]. These methods allowed us to synthesize samples and expand the dataset effectively. Fig. 2 provides a visualization of the data augmentation techniques employed.

Mixup involves overlaying two images, resulting in improved image detection accuracy without incurring significant computational overhead. Mosaic, on the other hand, stitches together four images by randomly cropping, scaling, and aligning them. This approach not only enhances the richness of the image background but also increases the diversity of target scales within a single image. These augmentation methods significantly enrich the detection dataset and contribute to the network’s robustness.

Fig. 2.
figure 2

Visualization of multiple data augmentation fusion. The ground truth is represented by the green box. (a) demonstrates the application of the Mosaic method, where four images are randomly stitched together. This technique enhances the richness of the image background and increases the diversity of target scales within a single image. (b) depicts the use of Mixup, which involves scaling two sample-label data pairs together to generate a new sample with an adjusted label count. This method effectively improves the accuracy of image detection. (Color figure online)

2.2 MDRNet

To address the challenges posed by small-scale traffic sign image targets, multiple scale levels, and potential occlusion, we propose MDRNet. Our approach involves replacing all 3 \(\times \) 3 regular convolutions in the conv4-conv5 layers of ResNet50 with dilated convolutions [20]. This modification expands the model’s perceptual field without increasing computational complexity or compromising resolution. Additionally, the different perceptual fields obtained from various convolutional layers facilitate the extraction of multi-scale contextual information, thereby enabling the effective detection of traffic sign targets at different scales. Please refer to Fig. 3 for a visual representation of the details.

The regular convolution operation can be represented by Eq. 1, while a dilated convolution is defined as Eq. 2.

Let \(\textbf{y}(m, n)\) denote the result of dilated convolution between an input signal \(\textbf{I}(m, n)\) and a filter \(\textbf{F}(i, j)\), where \(\textbf{I}\) has a length and width of M and N, respectively. The formulation of the regular convolution operation is given by Eq. 1:

$$\begin{aligned} \textbf{y}(m, n) = \sum _{i=1}^M \sum _{j=1}^N \textbf{I}(m+i, n+j) * \textbf{F}(i, j) \end{aligned}$$
(1)

In contrast, the dilated convolution introduces a hyperparameter known as the dilation rate, denoted by r. This parameter defines the spacing between values as the convolution kernel processes the data. The formulation of the dilated convolution can be expressed as Eq. 2:

$$\begin{aligned} \textbf{y}(m, n) = \sum _{i=1}^M \sum _{j=1}^N \textbf{I}(m+r \times i, n+r \times j) * \textbf{F}(i, j) \end{aligned}$$
(2)

When the dilation rate r is set to 1, the dilated convolution reduces to a regular convolution.

Fig. 3.
figure 3

A detailed design of MDRNet. (a)The network architecture of MDRNet. (b) A residual module is employed in MDRNet to enhance feature extraction and localization. This module consists of two 1 \(\times \) 1 regular convolutional modules and a 3 \(\times \) 3 dilated convolutional module. (c) The dilated convolutional module showcases the spacing between dots.

2.3 Normalization Methods

Smaller batch sizes have been shown to be more robust to variations between training and test sets [8]. For pixel-level image generation tasks like object detection and image segmentation, memory consumption limits the batch size to be small [6]. Moreover, the large image size of traffic signs and the constraints of general hardware resources further support the use of smaller batch sizes.

However, the effectiveness of BN (Batch Normalization) decreases significantly with small batches, limiting its applicability in micro-batch training. To address this limitation, we employ GN (Group Normalization), which divides channels into groups and calculates the mean and variance within each group for normalization. GN demonstrates stable accuracy across a wide range of batch sizes compared to BN. GN differs from BN in terms of the statistical range over which the mean and variance are calculated. We also introduce WS (Weight Standardization) [13] to further enhance the model’s generalization ability and network performance. WS provides regularization without compromising information exchange, thereby improving model generalization. The joint application of GN and WS is illustrated in Fig. 4. The fusion of GN and WS is expressed as follows:

$$\begin{aligned} \hat{\boldsymbol{x}}=\left[ \boldsymbol{\hat{x}}_{i, j} \mid \boldsymbol{\hat{x}}_{i, j} = \frac{1}{\boldsymbol{\sigma }_{i,\cdot }}\left[ \sum _{k, l} \boldsymbol{x}_{i-k, j-l} \cdot \left[ \frac{1}{\boldsymbol{\sigma }_w}\left( \mathbf {\Gamma }_{k, l}-\boldsymbol{\mu }_w\right) \right] -\boldsymbol{\mu }_{i}\right] \right] \end{aligned}$$
(3)

where \(\boldsymbol{x}\) is the input tensor, \(\mathbf {\Gamma }\) is the convolution kernel, and \(\textbf{y}\) is the output tensor. i and j denote the indexes of the output tensor, and k and l denote the indexes of the convolution kernel, respectively. In WS, the statistical domain of the mean and standard deviation of the weight parameters is each channel. The \(\boldsymbol{\mu }_{i,\cdot }\) and \(\boldsymbol{\sigma }_{i,\cdot }\) in Eq. 4 are the mean and standard deviation, calculated as \(\boldsymbol{\sigma }_i\) and \(\boldsymbol{\mu }_i\).

$$\begin{aligned} \boldsymbol{\mu }_{i,\cdot }=\frac{1}{m} \sum _{k \in \mathcal {S}_i} \boldsymbol{x}_k, \quad \boldsymbol{\sigma }_{i}=\sqrt{\frac{1}{m} \sum _{k \in \mathcal {S}_i}\left( \boldsymbol{x}_k-\mu _{i}\right) ^2+\epsilon } \end{aligned}$$
(4)

\(\mathcal {S}_i\) is the set of pixels for which the mean and variance are calculated, and \(\mathcal {S}_i\) of GN is defined as

$$\begin{aligned} \mathcal {S}_i=\{k \mid k_N=i_N,\left\lfloor \frac{k_c}{C / G}\right\rfloor =\left\lfloor \frac{i_c}{C / G}\right\rfloor \} \end{aligned}$$
(5)

where G is the number of groups (default value is 32) and C/G is the number of channels per group. \(\lfloor -\rfloor \) represents the floor operation. GN computes \(\boldsymbol{\mu }\) and \(\boldsymbol{\sigma }\) along the (HW) axes and along a group of C/G channels.

Fig. 4.
figure 4

The joint application of GN and WS involves utilizing \(\boldsymbol{\mu }_i\), \(\boldsymbol{\sigma }_i\), \(\boldsymbol{\mu }_w\), and \(\boldsymbol{\sigma }_w\), which represent the mean and variance of the respective statistical intervals. The normalized convolution kernel of Filter 1 is denoted as Filter \(1^{'}\), while the normalized feature layer of Feature 1 is denoted as Feature \(1^{'}\). It is important to note that the same operation performed on Filter 1 is also repeated for Filter 2 to 5, ensuring consistency across all filters.

3 Experiment

3.1 Implementation Details

Datasets GTSDB [7]: The traffic signs are classified into three categories: triangular warning signs (yellow or red), circular prohibitory signs (red or black), and mandatory signs (blue circles or squares). The GTSDB dataset consists of 900 images with a size of 1360 \(\times \) 800, including 43 types of traffic signs. Similarly, the CTSD dataset contains 1100 images captured on highways, urban, and rural roads, with sizes of 1024 \(\times \) 768 and 1280 \(\times \) 720. It also classifies traffic signs in China into the same three categories as the GTSDB dataset.

Both the CTSD and GTSDB datasets were selected for their inclusion of small targets and multiple scales in complex environments. Figure 5 shows some typical examples from these datasets, where small targets and multiple scales of traffic signs are often present in challenging conditions such as blurred, poorly lit, damaged, and obscured signs. The datasets were divided into training and test sets in a 7:1 ratio.

Evaluation Metrics. In this paper, the evaluation metrics used are \(AP_{50}\), \(AP_{75}\), \(AP_S\), \(AP_M\), \(AP_L\), and AR, which are calculated following the methods described in COCO [10].

Setting. All experiments were conducted on Ubuntu 20.04 using PyTorch 1.9.1. The training process utilized an NVIDIA GeForce GTX 3090 GPU with 24 GB of memory. The SGD optimizer was employed with an initial learning rate of 0.00125, momentum of 0.9, and weight decay of 0.0001.

Fig. 5.
figure 5

Some difficult examples for traffic sign detection.

3.2 Results and Analysis

Comparison with State-of-the-Art Methods on CTSD. We compared our MDCN method with several state-of-the-art approaches on the CTSD dataset. The comparison results are presented in Table 1 and Table 2.

Our proposed MDCN method achieves an impressive 83.0 AP and 87.6 AR, surpassing all other methods in the table. It demonstrates a strong balance between minimizing false detections and reducing missed detections. In terms of detection metrics, MDCN not only exhibits significant improvements over the original Faster R-CNN, but also outperforms both two-stage and single-stage target detection algorithms. It outperforms the second-best DH Faster R-CNN by 1.7 AP and surpasses the lowest AP YOLOF by a remarkable 14.9 AP. MDCN outperforms other methods in all metrics, except for the APL metric. This demonstrates the superiority of our method in multi-scale detection.

Furthermore, our proposed method achieves outstanding results for all three major categories of traffic signs, particularly for the mandatory category, where it outperforms all other detection methods by at least 3 points. This is due to the variable shapes and scales of directional traffic signs, with their rectangular aspect ratios often being more extreme compared to other traffic signs. Consequently, the detection of multi-scale signs in this category poses greater challenges.

Overall, the results validate the effectiveness of our MDCN method in achieving superior performance in multi-scale traffic sign detection.

Table 1. Comparison with the state-of-the-art methods on the CTSD dataset. Set Wa is the abbreviation of Warning, Pr is the abbreviation of Prohibitory, and Ma is the abbreviation of Mandatory.

Comparison with State-of-the-Art Methods on GTSDB. We conducted a comparison between MDCN and several popular detection algorithms on the GTSDB dataset, which is widely recognized as a representative dataset in the field of traffic sign object detection. Table 1 presents the results of this comparison.

Among the two-stage algorithms, DH Faster R-CNN and Cascade R-CNN achieved commendable detection performance with 77.9 AP and 76.6 AP, respectively. However, MDCN surpassed them by achieving an impressive 78.7 AP. Furthermore, MDCN outperformed the baseline by 5.5 AP and surpassed YOLOF, which had the lowest AP, by a significant factor of 1.48. Overall, MDCN exhibited superior detection accuracy compared to other methods, particularly for the warning class of traffic signs that typically occupy a smaller proportion of the image.

Additionally, MDCN demonstrated higher AP scores in the \(AP_S\) and \(AP_L\) metrics compared to all other methods. Although its \(AP_M\) was slightly lower than Dynamic R-CNN, Cascade R-CNN, and DH Faster R-CNN by 0.6 \(AP_M\), 1.2 \(AP_M\), and 2.0 \(AP_M\), respectively, MDCN still outperformed them by significant margins in the \(AP_S\) metric (10.1 \(AP_S\), 13.1 \(AP_S\), and 6.8 \(AP_S\), respectively). Moreover, MDCN’s \(AP_L\) was higher than that of these methods. These results indicate that our method exhibits superior and more stable performance in multi-scale detection.

In conclusion, MDCN demonstrated excellent detection performance on the GTSDB dataset, affirming the effectiveness and generalization capabilities of our proposed model.

Table 2. Comparison with the state-of-the-art methods on the GTSDB dataset. Set Wa is the abbreviation of Warning, Pr is the abbreviation of Prohibitory, and Ma is the abbreviation of Mandatory.

3.3 Ablation Studys

We performed a series of ablation experiments to demonstrate the effectiveness of MDCN for the detection of small traffic sign targets at multiple scales in real-world complex environments.

Component Ablation Studies of MDCN. The effectiveness of various optimization components in improving the performance of the baseline model was evaluated through experiments on the CTSD and GTSDB datasets, and the results are summarized in Table 3 and Table 4.

The addition of GN led to improvements of 1.4 and 3.4 in AP on the CTSD and GTSDB datasets, respectively, highlighting its positive impact on detecting small targets. Combining GN with WS further enhanced the detection performance, resulting in improvements of 1.5 AP and 1.2 AP, as well as 8.9 and 15.4 in APS, respectively, compared to GN alone. This demonstrates that the combination of GN and WS contributes to improved detection of small targets.

The integration of MDRNet significantly improved the AP values by 3.7 AP and 5.3 AP on both datasets compared to the baseline. Moreover, it led to notable improvements in APS, APM, and APL, with gains of 8.9, 2.6, and 4.1, respectively, on both datasets. The largest improvement was observed in APS, indicating the effectiveness of MDRNet in detecting small targets. This improvement can be attributed to the ability of MDRNet to capture more contextual information, which helps reduce the rates of false detections and missed detections for small targets.

Furthermore, the inclusion of the MDEF method resulted in additional improvements of 4.0 AP and 5.5 AP on the two datasets compared to the baseline. MDEF effectively expanded the number of small targets in the dataset and disrupted the regular positioning of traffic signs on the road, thereby preventing overfitting to specific road environments.

Visualizations in Fig. 6 further demonstrate the superior performance of MDCN compared to the baseline. MDCN exhibits greater sensitivity to traffic sign objects with uneven positional distribution and a wide range of scales. Additionally, it demonstrates better detection performance for small objects, as evident from the second row of the visualization.

Table 3. Ablation study on the effectiveness of the various MDCN component modules on CTSD dataset. MDEF is short for Multiple Data augmentation Fusion.
Table 4. Ablation study on the effectiveness of the various MDCN component modules on GTSDB dataset. MDEF is short for Multiple Data augmentation Fusion.
Fig. 6.
figure 6

Comparison of detection results between baseline and MDCN. (a) is the original image. (b) is the detection result using baseline. (c) is the detection result using MDCN. The images in the last two columns are taken from the green box in the first column. (Color figure online)

Ablation Experiments of Dilated Convolution Embedding Positions. In our ablation experiments, we investigated the impact of dilated convolution on the task of traffic sign detection. Dilated convolution is known for its ability to expand the receptive field while preserving resolution. However, it is crucial to carefully select the positions where regular convolutions are replaced with dilated convolutions to achieve optimal results.

We examined the effect of incorporating dilated convolutions from the bottom-up, starting from the conv2 to conv5 layers in the backbone network. The results, as shown in Table 5 and Table 6, revealed that the addition of dilated convolutions in the last two layers of the convolutional hierarchy yielded the most significant improvements. Specifically, on the CTSD dataset, the APs increased to 81.7 and 81.6, while on the other dataset, they improved to 76.4 and 77.6, respectively.

This observation can be attributed to the fact that deeper convolution layers tend to lose internal data structures, resulting in the loss of some crucial information related to small objects. By replacing the deeper convolutions with dilated convolutions, we were able to improve the localization accuracy of small targets.

Building upon these findings, we gradually added dilated convolution modules in pairs until all 3\(\,\times \,\)3 convolutions between c4 and c5 were replaced. This configuration resulted in the optimal embedding position for MDRNet, achieving the best performance of 83.0 AP and 78.7 AP on both datasets.

These ablation experiments highlight the importance of selecting appropriate positions for dilated convolutions, and they provide valuable insights into the optimal design choices for our MDRNet architecture.

Table 5. Ablation study with gradually dilated modules on CTSD dataset. In the setting column, c2, c3, c4 and c5 stand for applying dilated convolution at c2, c3, c4 and c5 layers.
Table 6. Ablation study with gradually dilated modules on GTSDB dataset. In the setting column, c2, c3, c4 and c5 stand for applying dilated convolution at c2, c3, c4 and c5 layers.

4 Conclusion

The primary objective of this paper is to enhance the detection performance of small multi-scale traffic signs in complex real-world environments. To achieve this goal, we propose an MDCN traffic sign detection method based on Faster R-CNN. Our approach incorporates MDRNet as the backbone network, which effectively increases the perceptual field and sampling rate. This improvement enhances the feature representation capability for targets of different scales and those that may be partially obscured. Considering the characteristics of the traffic sign dataset, hardware resource limitations, and the dataset’s high resolution, we introduce a normalized combination method at the backbone and pooling layers. This technique accelerates the convergence speed and improves the generalization ability of the model. Furthermore, we employ a multi-data augmentation fusion strategy to expand small targets and enhance the network’s robustness. The effectiveness and generalization of our proposed MDCN method are validated through comprehensive evaluations on the CTSD and GTSDB datasets. The results demonstrate its superiority in detecting small traffic signs and establish its competitiveness compared to other state-of-the-art methods.

For future research, we intend to explore traffic sign recognition techniques tailored specifically for mobile terminals. By adapting the methodology to mobile devices, we aim to further expand the applicability and practicality of traffic sign detection in real-world scenarios.