1 Introduction

As the foundation of the national economy, the road transportation system is developing rapidly. Meanwhile, traffic problems have become increasingly prominent, such as urban traffic jams, frequent traffic accidents, and increased air pollution. Furthermore, Traffic accidents endanger personal safety and social security. Analysis of the frequent causes of accidents mainly includes fatigue driving, illegal driving, bad weather, etc. Among them, driver’s subjective driving behavior such as driving in violation of traffic signs is one of the main causes of traffic accidents. Therefore, it is necessary to develop an Intelligent Transport System (ITS) to assist the driver [51]. In addition, with the global spread of the new crown epidemic in 2020, unmanned vehicles are used in many hospitals to distribute emergency supplies, making unmanned vehicles have once again entered the public’s attention. To sum up, the Traffic Sign Detection (TSD) system is an important sub-module of ITS [48], and its detection accuracy is an important prerequisite for ITS to effectively assist the driver and the unmanned system to drive safely.

TSD can be regarded as a sub-task in the field of object detection, where the goal is to detect traffic signs and their boundaries. Traffic signs are designed in a specific pattern, differentiated from their surroundings mainly by color, shape, and what they signify. Therefore, early traffic sign recognition algorithms were mainly aimed at the localization and classification of target regions. Traditional methods [55] used color thresholding and shape analysis to segment traffic signs from images. With the development of computer vision technology, deep learning has demonstrated the powerful ability to learn feature representations from raw data, which has received great attention in pattern recognition and computer vision research. It has been widely used in object detection and recognition. For example, Convolutional Neural Networks (CNNs) have shown their powerful feature extraction ability [6]. Many CNN-based methods have achieved fruitful results in object detection tasks.

However, traffic sign recognition still faces the following challenges:

  1. 1.

    Under different viewing angles and viewing distances, the traffic sign images may appear distortions of shape and color.

  2. 2.

    The complicated road environment can lead to complex background of the traffic signs.

  3. 3.

    Traffic signs have characteristics that most object detection objects do not have, so it is hard to obtain satisfactory performance by simply applying conventional object detection methods.

In order to address the above challenges, scholars have done extensive studies. However, these original methods are difficult to be widely used in practical detection scenarios. On the one hand, designing these feature extraction methods for specific traffic sign categories requires a lot of work and consumes manpower and material resources. On the other hand, simple feature extraction methods are not powerful enough to deal with the complex and changing traffic environment. In addition, in the real traffic images which can be captured by in-vehicle equipment, traffic signs often occupy only a small part of the area as shown in Fig. 1. Conventional object detection classifiers used a series of down-sampling operations to obtain high-level feature maps. It will lead to the loss of small targets, which is unfavorable for the TSD task dominated by small targets. So it is difficult to obtain satisfactory performance simply by using traditional object detection methods. For this reason, a series of excellent basic networks such as VGGNet [39], ResNet [10] and DenseNet [14] have been proposed. The typical models also include Fast R-CNN [7], YOLO [32], SSD [23], and RetinaNet [20].

Fig. 1
figure 1

Image sample in TT100K. a Original images in TT100K dataset and the green rectangle regions contain traffic signs; b Image patches that are cropped from (a) according to the green rectangle

Although increasing the complexity of the detection classifier can improve the detection effect, the complexity of model heavily increases the number of parameters and computation. In actual scenarios, the TSD system should be deployed on the premise of onboard embedded devices to effectively identify traffic signs. Too large models are difficult to meet the real-time performance required by industrial applications, so quantitative networks came into being. The SqueezeNet [15] network uses common compression techniques to compress the model and then expand. On the basis of similar performance to AlexNet [18], the parameter model was only 1/50 of AlexNet. However, the network still adopted the standard convolution calculation method. MobileNet [12] employed a more effective depthwise separable convolution, which improved the network speed and further promotes the application of the convolution network on the mobile terminal. Furthermore, higher precision is obtained with less computation. Theoretically, the amount of computation can still be reduced. ShuffleNet [52] used group convolution and channel shuffling to effectively reduce the amount of computation for point convolution, which achieved better performance. With the advancement of mobile devices and the diversified development of application scenarios, lightweight networks show higher engineering value. Therefore, how simultaneously ensuring the accuracy and speed of TSD is still a difficult problem.

Inspired by the above methods, the purpose of this study is to develop a lightweight method for TSD which can strike a balance between accuracy and efficiency and can solve the problem of small target loss. In this paper, a new traffic sign recognition method called Ghost-YOLO was proposed. The main contributions can be summarized as follows:

  1. 1.

    The detection and recognition of traffic signs in the actual environment is one of the technical bottlenecks of ITS. Through experimental research and test, this paper provided a scientific means and framework for accurate recognition of traffic signs.

  2. 2.

    The Ghost-YOLO model was proposed. Based on Ghost Conv, a more lightweight C3Ghost structure is proposed to replace the backbone network of the YOLOv5 target detection model. After realizing model compression and speeding up inference, this paper obtains an optimized neural network model, which greatly reduced the dependence on the hardware environment.

  3. 3.

    Aiming at the fact that there are many small objects in the actual scene of TSD, the multi-scale feature fusion detection head is used to detect large, medium and small scales, which improves the detection performance of small objects.

  4. 4.

    Experimental results on TT100k dataset show that compared with several current advanced object detection methods, this method can obtain a more lightweight model scale on the basis of maintaining competitive performance, which enhances the practicability of the model

The rest of this paper is organized as follows. Section 2 introduces the related work of TSD in recent years. Section 3 introduces the C3Ghost module design method, multi-scale feature fusion scheme, and overall model structure. Section 4 describes the dataset, experimental setup and experimental results. Finally, Section 5 provides a summary and outlook of this paper.

2 Related work

2.1 Traditional TSD

Traffic signs are usually designed in a specific pattern, mainly distinguished by color, shape, and the content they mark, which are often different from the surroundings. Color and shape are the basic attributes of traffic signs and their information started to be used for identification in early research. The core of the algorithm for detecting based on color is to select the color space of the image, and the images collected by the in-vehicle equipment are generally RGB images. Benallal et al. found that the RGB components are significantly different under different lighting conditions. Segmenting the RGB images collected by the camera can reduce the amount of calculation [3], thereby greatly improving the speed and meeting the real-time requirements of the algorithm. But when detecting in a complex environment, interference such as background noise will be mixed with traffic signs. Hence, algorithms that only consider the color space cannot achieve good detection results. There are also many solutions. Zhou et al. used color threshold and shape analysis to segment traffic signs from images. Complementary data obtained from different sensors was utilized to fuse the prior location, color, laser reflectivity, and lidar data of traffic signs. The above operations improved the robustness of the algorithm [55]. Zhu et al. converted the image from the RGB model to the HSI model, and the red color was extracted from the H channel value. Then, the template LOG was used to extract the edge. Finally, the BP network was adopted to process the image [38]. However, converting RGB to HSI color space requires a certain amount of computation, which requires hardware processing to improve real-time performance.

2.2 Deep learning-based TSD

With the wide application of deep learning technology in various fields, it has demonstrated the powerful ability to learn feature representations from raw data. The representative network CNN is one of the most widely used network models for deep learning in computer vision [22]. Therefore, many CNN-based methods have been transformed to address the tas of TSD. Various object detection models were improved in [1] and applied to TSD. Sermanet et al. used a multi-scale CNN network for TSD and obtained an accuracy of 99.17% [36]. Belghaouti et al. proposed an automatic road sign recognition system based on the LeNet model, which achieved 99% accuracy in the German traffic dataset [2]. Song et al. advanced an efficient convolutional neural network (CNN) that can significantly reduce redundancy parameters, and increase the speed of the network [40]. Wang et al. proposed a new space-cover convolutional neural network (SC-CNN) for technological conundrum [46]. Zhou et al. proposed the Ice Environment Traffic Sign Recognition Benchmark (ITSRB) and Detection Benchmark (ITSDB) annotated in the COCO2017 dataset format. They put forward an attention network-based approach for high-resolution traffic sign classification (PFANet) and performed ablation experiments on the designed parallel fused attention module [56]. Zhu et al. modified the OverFeat framework and proposed a single network that simultaneously detects and classifies landmarks [57]. Li developed a novel perceptual generative adversarial network to improve detection performance by generating super-resolution images of small traffic signs [19]. MR-CNN [25] adopts a multi-scale deconvolution structure that combines the features from deep and shallow layers. The fused feature maps reduce the number of region proposals to a certain extent and improve the efficiency of TSD. In [30], a feature aggregation structure is proposed to aggregate regional features of different scales, which improves the performance of small traffic signs. Zhang et al. proposed a cascaded R-CNN network for detecting small traffic sign instances and designed a data augmentation method to increase the number of difficult negative samples [53]. SADANet [26] combines a domain-adaptive network and a multi-scale prediction network to address the scale variation problem. The TSD method based on deep learning learns the features in a large amount of data, which has more advantages than the traditional method using artificially designed features. It is also not easily affected by external factors such as illumination and occlusion. Compared with the traditional detection method, the generalization ability is strong and accurate.

2.3 Multi-scale feature fusion

In the target detection task, the most important problem is how to extract target features more accurately [9]. Current neural networks like depth convolution structure used for feature extraction. With the deepening of network layer, the network reception field increases gradually, and the semantic expression ability also increases. But it also reduces the resolution of the images, and many details characteristics after multi layer network of convolution blur, such as smaller traffic signs. Shallow neural networks have smaller receptive fields and richer details, but weaker semantic information is extracted. In order to obtain accurate semantic information, traditional target detection models usually only use the feature graph output from the last layer of feature extraction network to classify and locate objects. However, the graph of the last feature corresponds to a large down-sampling rate, resulting in less effective information and reduced detection ability of small targets. Multi-scale feature fusion solves this problem well. FPN (Feature Pyramid Network) used the RPN to extract candidate regions on the feature pyramid [33]. By fusing deep features with shallow features, predictions were made at multiple scales of the feature pyramid. Thus, the semantics of shallow feature maps were enhanced, and the detection accuracy of small targets was improved. The study [44] proposed an efficient and accurate arbitrary-shaped text detector, termed Pixel Aggregation Network (PAN), which is equipped with a low computational-cost segmentation head and a learnable post-processing.

2.4 Research on model lightweight

Deep neural networks (DNNs) have recently achieved great success in many visual recognition tasks. However, existing neural networks require a lot of memory space and computational cost, making them difficult to deploy in devices with low memory resources. Solving these problems requires joint solutions from many disciplines, including but not limited to machine learning, system structure optimization and hardware design. Reference [16] proposed to use different tensor decomposition schemes, which only lost 1% of the accuracy and achieved a 4.5x speedup. Since the translation-invariant property is ensured when exploring the features of the input image, the parameters of the CNN network are efficient, which is the key to successfully training the deep neural network and avoiding overfitting. Using a compact convolution kernel to replace a convolution kernel with a large number of parameters can directly reduce the amount of computation. SqueezeNet [47] adopts a 1 × 1 convolutional layer instead of a 3 × 3 convolutional layer, which reduces the number of parameters. The same approach is also adopted for MobileNets [13]. The work in [37] introduces a more advanced successor of the CNNS called 3-D CNNS, and the computing time (0.19 seconds per frame) of the proposed work shows that the proposal may be used in real-time applications. The work in [35] exploits the advantages of deep neural networks to solve the network compression problem. It proposes FitNets to train deep yet lightweight networks to compress large deep neural networks.

3 Proposed method

3.1 Overall architecture

The YOLOv5 [27, 42] network is the latest model in the YOLO series. The network model has high detection accuracy, fast inference speed, and the fastest detection speed can reach 140 frames per second. The weight file of the YOLOv5 network model is nearly 90% smaller than that of YOLOv4, which indicates that the YOLOv5 model is suitable for deployment on embedded devices for real-time target detection. However, YOLOv5 still has defects in the problem of small target detection, and cannot accurately identify smaller targets. To solve this problem, a multi-scale detection layer is further added in the latest YOLOv5 series named YOLOv5-P6 [42]. But what followed is a substantial increase in the amount of parameters and FLOPS. The YOLOv5 model has four architectures, named YOLOv5-s [42], YOLOv5-m [42], YOLOv5-l [42] and YOLOv5-x [42]. The main difference between them is the depth and number of feature extraction modules and convolution kernels at specific positions, and the size and model parameters of the four structural models are accordingly increased. This paper needs to identify many small targets and the intelligent driving system has high requirements on the real-time and lightweight performance of the recognition model. Hence, the accuracy, efficiency and scale of the recognition model are comprehensively considered in this paper, and the improved design is carried out based on the YOLOv5s architecture.

The overall framework is shown in Fig. 2, including the backbone and head. The head is composed of neck and detector. Features need to be extracted from the detected image and use the backbone for localization and classification so as to detect the location and class of landmarks. The backbone network consists of Focus, Ghost Convolution (GhostConv), GhostBottleneck with three convolutional layers (C3Ghost) and Spatial Pyramid Pooling (SPP). The first layer of the backbone network is the focus module, as shown in Fig. 3. This module performs a slicing operation on the 640 × 640 × 3 input image and divides it into 4 parts. The 4 parts complement each other and expand from 3 channels of the input image to 12 channels. Finally, convolution is performed on the generated new image. The Focus module reduces the cost of convolution. The method of reshaping tensor is used to downsampling and increase channels which reduces FLOPs and increases speed. The channel expansion algorithm adopted by this module is as follows:

Fig. 2
figure 2

The architecture of Ghost-YOLO

Fig. 3
figure 3

Focus module

Algorithm 1:
figure e

The process of Channel Expansion.

C3Ghost module refers to the structure of CSPNet [45] and combines GhostConv to perform convolution operation on images. The feature map of the base layer in one stage is divided into two parts, which realizes feature extraction by cheap means. At the same time, the probability of repetition is reduced in the process of information integration. Section 3.2 details the specific structure.

The last layer of the backbone network is the SPP module. After the combined three multi-scale max-pooling layers are used, the receptive field can be greatly improved with almost no speed loss while extracting features. It also effectively reduces the possible loss of image information when the images are directly stretched which ensures detection accuracy.

In the head part, the high-level feature information and the bottom-level feature information are transferred and fused by upsampling to realize a top-down information transfer structure. The Concat operation is performed on the bottom-level features and the high-level features so that the features with high resolution of the bottom-level can be easily transferred to the high-level. Thereby realizing the PANet [24] structure. Effectively utilize the complementary advantages of multi-scale features, and improve the accuracy of target recognition.

Based on YOLOv5s, the proposed model in this paper firstly replaces the convolutional layer and Bottleneck with the GhostConv module and the C3Ghost module to extract the features. Then add a detection layer of small target scale in the detector, which can more effectively identify small targets. Finally, the feature maps of each scale are input into the Detect module. In general, this method can validly enhance the recognition effect of small targets at high resolution. Moreover, it has less parameter quantity and calculation quantity, which achieves model compression under the premise of ensuring accurate detection. It is beneficial to deal with complex TSD scenes.

3.2 C3Ghost

CNNs have shown excellent performance in various computer vision tasks. The traditional CNNs usually require a large number of parameters and FLOPS to achieve satisfactory accuracy. Considering the extensive redundancy in the intermediate feature maps computed by mainstream CNNs, GhostNet [8] proposed an innovative convolution module, named Ghost module. It generates more feature maps to obtain the same effect as the original convolution through cheap linear operations. This new basic unit successfully achieves more feature maps with fewer parameters and computations, as shown in Fig. 4(a). Given input data X ∈ Rc × h × wh and w are the height and width of the input data, while c is the number of channels of the input data. Any convolution operation used to generate n feature maps can be expressed as

$$ Y=X\cdotp \omega +b $$
(1)

where \( Y\in {R}^{n\times {h}^{\prime}\times {w}^{\prime }} \)represents that the output is n feature map with height h and width w, while ω ∈ Rc × k × k × n represents that the convolution operation is performed by c × n convolution kernels of size k × k, and b is the bias term. It is not difficult to find that the FLOPS required in this convolution can be calculated as n ∙ h ∙ w ∙ c ∙ k ∙ k This value usually reaches hundreds of thousands, because the number of convolution kernels n and the number of channelscare usually very large.

Fig. 4
figure 4

Standard convolution and Ghost module

According to Formula 1, the dimensions of the input and output maps explicitly determine the number of parameters to be optimized (in w and b). Ghost module states that the feature maps generated by mainstream CNN operations contain a lot of redundancy, some of which are similar to each other. These redundant feature maps can be individually generated by using cheaper operations. As shown in Fig. 4(b), the feature extraction process of the Ghost module to generate m feature maps \( {Y}^{\prime}\in {R}^{n\times {h}^{\prime}\times {w}^{\prime }} \)can be expressed as:

$$ {Y}^{\prime }=X\cdotp {\omega}^{\prime } $$
(2)

where ω ∈ Rc × k × k × m represent the filters, m ≤ n and no bias term is required. Other hyperparameters such as convolution kernel size, stride and padding are consistent with ordinary convolution (Formula 1) to ensure the same size as the output feature map. A series of linear operations are adopted to generate repeating features according to the following formula:

$$ {Y}_{ij}={\varPhi}_{i,j}\left({Y}_i^{\prime}\right),\forall i=1,\dots m,j=1,\dots s $$
(3)

where \( {Y}_i^{\prime } \) represents the i-th feature map in Y, Φi, j is the j-th linear operation for each \( {Y}_i^{\prime } \) to generate the j-th Ghost feature map Yij. The desired feature map can be obtained by directly splicing the generated feature map with the feature map generated by the original convolution. After using Ghost module, set the linear convolution kernel size to d × d, comparing the calculation amount of Ghost module and standard convolution can get the theoretical improvement degree.

$$ {C}_s=\frac{c\cdotp k\cdotp k\cdotp n\cdotp {h}^{\prime}\cdotp {w}^{\prime }}{\frac{n}{s}\cdotp {h}^{\prime}\cdotp {w}^{\prime}\cdotp c\cdotp k\cdotp k+\left(s-1\right)\cdotp {h}^{\prime}\cdotp {w}^{\prime}\cdotp d\cdotp d}=\frac{c\cdotp k\cdotp k}{c\cdotp k\cdotp k\cdotp \frac{1}{s}+d\cdotp d\cdotp \frac{s-1}{s}}\approx \frac{s\cdotp c}{s+c-1}\approx s $$
(4)

where d × d has a similar magnitude as that of k × k and s ≪ c, Thence, it can be quantitatively calculated that the calculation amount of the Ghost module is 1/s of standard convolution. The calculation of the parameters is similar, and it can also be simplified to s in the end. Theoretically, the superiority of the Ghost module can be quantitatively proved. So based on the Ghost module, the GhostBottleneck is designed. The specific structure is shown in Fig. 5(b).

Fig. 5
figure 5

The structure of Ghost module

Algorithm 2:
figure f

Feature Extraction Based on C3Ghost.

Taking the advantages of Ghost module and GhostBottleneck, we introduced a lightweight feature extraction structure named C3Ghost. As shown in Fig. 5(c), it consists of three 1 × 1 convolution layers and n linearly stacked GhostBottleneck. c1 and c2 in Fig. 6 refer to the number of input and output feature map channels respectively, h and w have the same meaning as before. The first 1 × 1normal convolution is used to reduce the number of channels to 1/2 the number of output channels. Then features are extracted by linear stacked Ghostbottleneck and residual branches respectively. In this way, extracts the depth semantic information of the input image through two branches and the two sets of features are concatenated by contact module. Contact is a feature fusion operation, which splices two or more feature maps based on the number of channels and better utilizes the semantic information of feature maps of different scales to achieve better performance by increasing channels. Finally, the pieced signature information will pass through the BatchNorm module and use LeakyRelu as the activation function. The feature extraction Algorithm of C3Ghost is shown in Algorithm 2. In this process, the feature information of the original image is effectively preserved and the loss of features in the deep network is avoided. The C3Ghost module is applied to replace all BottleneckCSP modules in YOLOv5 to reduce the amount of computation and compress the model size. In theory, this method is completely feasible.

Fig. 6
figure 6

The structure of multi-scale feature fusion module

Algorithm 3:
figure g

Training of Ghost-YOLO.

3.3 Improvement of fusion feature layer

The fusion of features of different scales is a significant way to improve the recognition performance of the target detection network [21]. The purpose of feature fusion is to combine the features extracted from images into a feature with more discriminate ability. The current detection and segmentation networks mainly use convolutional networks to extract target features layer by layer. The low-level feature map has a higher resolution and contains more location and detail information. However, due to fewer convolutional layers, the lower-level feature map has less semantic information and contains more noise. High-level feature map has stronger semantic information, but due to the increased receptive field, the resolution of feature map is lower and the representation ability of geometric information is weakened. How to efficiently integrate the two is the key to improving performance.

Considering that there are many small targets in TSD, this study adds a multi-scale feature fusion detection module [29] based on the structure based on the model design in Section 3.3. As shown in Fig. 7, it consists of a top-down structure and a bottom-up structure. Firstly, feature extraction is performed on the input image, we can get [C1, C2, C3, C4, C5] five groups of feature maps with different sizes. Through the up-sampling operation, the network obtains four groups of feature maps [P5, P4, P3, P2] from bottom-up paths, and obtains four groups of feature maps [N2, N3, N4, N5] from top-down paths. Unit addition is adopted in the fusion process, as shown in the dotted box. In addition, two shortcuts spanning multiple layers are included in the module to reduce information loss across layers. Finally, we can get feature maps of four scales.

Fig. 7
figure 7

Typical traffic sign categories in TT100K

Combined with the multi-scale feature fusion module to improve the network structure, the fusion of layers 4 and 15, 6 and 11, 10 and 21 in the original YOLOv5s architecture are changed to the fusion of layers 4 and 22, 6 and 18, 8 and 14, 16 and 28 in the network architecture designed in this paper. In order to improve the accuracy and make up for the loss of information caused by the low resolution of high-level features, the output features of the 20th and 25th layers of the improved network structure are fused.

3.4 Training

Algorithm 3 describes the construction of the dataset and the complete training process. The design of hyperparameters will be given in Section 4.

4 Experiment

4.1 Data description

In this paper, we choose the TT100k [58] dataset as the experimental object. The TT100k dataset contains 9170 images, of which 6105 images are used as the training set and 4071 images are used as the test set. The size of the picture is 2048 × 2048 and includes the situation in different light and weather conditions. The size of the traffic sign is between 8 × 8 and 400 × 400, which is about 0.001%–4% of the whole picture. We ignore classes with less than 100 tt100k instances to ensure there is enough data for each type of traffic sign, leaving 45 classes for detection. Through the analysis of the data set, the visualization results are obtained, as shown in Fig. 8. The data set format is PASCAL VOC format, but YOLOv5 needs a txt tag file in YOLO format, and Ghost-YOLO also inherits this. The YOLO format is specifically (class_id, x, y, w, h), and they are all normalized results, so the original data set needs to be converted accordingly. The format conversion operation rules are as follows:

$$ x={x}_{center}/ width $$
(5)
$$ x={\mathrm{y}}_{center}/ height $$
(6)
$$ w=\left({\mathrm{x}}_{\mathrm{max}}-{x}_{\mathrm{min}}\right)/ width $$
(7)
$$ h=\left({\mathrm{y}}_{\mathrm{max}}-{y}_{\mathrm{min}}\right)/ height $$
(8)
Fig. 8
figure 8

Dataset analysis

Among them, xmax ymax xmin and ymin respectively refer to the coordinates of the upper left corner and the lower right corner of the position of the marked object relative to the upper left corner of the picture in the VOC marking format. These coordinates are given in the dataset.

4.2 Metrics and experiment setup

In the object detection task, total samples can be divided into three types. TP (true positive) denotes the targets which are correctly detected, FN (false negative) indicates the targets that have not been detected and FP (false positive) is used to denote the incorrect detections of targets. Three criteria are used to evaluate the performance in this study, including precision, recall and mAPs. Precision (P) [50] was used to evaluate the percentages of correct predictions in the results. Recall (R) [50] was used to evaluate how many positive samples were correctly detected. These two criteria are defined as follows:

$$ precision=\frac{TP}{TP+ FP} $$
(9)
$$ recall=\frac{TP}{TP+ FN} $$
(10)

Mean average precision (mAP) [50] is a commonly used metric to evaluate object detectors, as shown in formula 11. In both metrics, to be considered as a true positive, the intersection-over-union (IoU) overlap between the detection and the ground truth needs to exceed the defined minimal value. IoU was used to represent the overlap rate of predicted and real borders, which is the ratio of their intersection to union. We used two types of mAPs here. mAP_0.5 refers to the average AP of the classes when the IoU is set to 0.5, and mAP_0.5:0.95 refers to the average mAP at different IoU thresholds. The IoU value ranges from 0.5 to 0.95 with a step size of 0.05.

$$ mAP=\frac{1}{N}\sum \limits_{i=1}^N{AP}_i $$
(11)

The experiment was run on a GPU server. Table 1 shows the detail of the experiment environment. We mainly use Python 3.8, Pytorch, OpenCV and other required libraries to implement our model.

Table 1 Experimental environment

The proposed Ghost-Yolo was using a backpropagation learning algorithm with CIoU (Complete-IoU) and BCE (Binary Cross Entropy) as the loss function and the stochastic gradient descent (SGD) algorithm as the optimizer. The model has about 30 hyperparameters for training settings, including training parameters and image processing parameters. The initialization values of key hyperparameters are given in Table 2. The training parameters include various coefficients and momentum and the image processing parameters include the coefficient of data enhancement. Learning rate is an important hyperparameter that cannot be ignored in model training, setting a proper learning rate can help model training. The learning rate of all experiments in this paper uses Warmup [11] to avoid model oscillation caused by a high initial learning rate during model training, cosine warmup is used to update the learning rate. The learning rate of the bias layer is decreased from 0.1 to the preset learning rate of 0.01, and the learning rate of other parameters is increased from 0 to 0.01, then attenuated according to the cos function value. All experiments are trained for 700 epochs.

Table 2 Settings of parameters

4.3 Result analysis

To demonstrate the advantages of the proposed method in the task of TSD, we compare the proposed method with RentinaNet [20], Faster R-CNN [34], R-FCN [5], SSD [23], YOLOv3 [31], MSA_YOLOv3 [54], YOLOv4 [4], the original YOLOv5-s and our Ghost-YOLO. Among these models, Faster R-CNN represents a two-stage detector, while YOLOv3 is a representative one-stage detector. As can be seen from Table 3, RentinaNet achieves poor results in the two-stage detector, with an accuracy rate of only 69.83%. YOLOv3 is an efficient one-stage detector with results comparable to the faster Faster R-CNN. These demonstrate that simply applying a generic object detector to TSD does not achieve significant results. Ghost-YOLO proposed in this paper achieves competitive performance on the TT100k dataset. Using the Ghost-YOLO model, the accuracy rate is 93.48%, the recall rate is 89.65%, the mAP_0.5 is 92.71%, and the mAP_0.5:0.95 is 73.31%. Compared with the YOLOv5-s model, the accuracy rate increased by 5.2%, the recall rate increased by 4.13%, mAP_0.5 increased by 3.26%, and mAP_0.5:0.95 increased by 3.04%. For the two-stage detector, mAP_0.5 is 6.8% higher than Faster R-CNN and 5.52% higher than R-FCN. Compared with other legacy models, the Ghost-YOLO model is competitive in all detection metrics.

Table 3 The recognition results of different methods on TT100K dataset

Moreover, some state-of-the-art methods were compared. As shown in Table 4, our method achieves similar or even better results than these latest methods. Finally, we show the detection results (mAP_0.5) of different algorithms in each category. We can see that methods such as Faster-RCNN perform better for large objects but less well for small objects. Our method achieves better performance in most categories, especially in small flag categories, such as ‘wo’, ‘io’, etc., Ghost-YOLO has a significant improvement, which also shows the improvement of multi-scale feature fusion for small target detection (Table 5).

Table 4 Comparison with the state-of-the-art method on the TT100K dataset
Table 5 Comparison of mAP_0.5 for each class in TT100K dataset

In order to further confirm the efficiency of the modules and networks proposed in this article, as shown in Table 6, we present the mAP(0.5), speed and some other evaluation indicators of the improvement. We think that inference speeds faster than 30FPS can be considered real-time detection. It can be seen that the YOLO series has obvious advantages in speed, and the inference speed of our model can reach FPS 56. Yolov5-s can reach FPS 48.3, slightly lower than our model. Yolov3-tiny can reach FPS 60.4, which is the fastest. Ghost-YOLO also has advantages in network size, the calculation amount and parameter amount are compressed to 50.29% and 91.4% of the original, and the derivation process is improved by 6.25%. YOLOv3-tiny is a representative lightweight model, compared with it, the amount of computation and parameters are compressed to 65.64% and 75.8%. Although the inference speed is slightly slower, however, our model maintains high accuracy while performing real-time detection, which proves that our method achieves a balance between accuracy and speed. Finally, we compare the improved method with the training process of YOLOv5-s and YOLOv3 which are representative models of the YOLO series and the comparable lightweight model YOLOv3-tiny. Figure 9 sketches these curves. Due to the learning of the model on the dataset, these values all increase rapidly. The changes stabilized at 100 epochs but were still gradually increasing, and all fluctuations were within acceptable limits. When the training ends at 700 epochs, the metrics of each model reach the maximum value. Ghost-YOLO also achieved the best performance.

Table 6 Performance comparison of each model
Fig. 9
figure 9

Performance comparisons of Ghost-YOLO, YOLOv5, YOLOv3 and YOLOv3-tiny

In addition to the intuitive loss performance, this paper also draws the classification confusion matrix, as shown in Fig. 10. The confusion matrix is used to summarize the classification results and represent the accuracy evaluation matrix, the darker the color, the higher the recognition rate of the targets.

Fig. 10
figure 10

Confusion matrix of Ghost-YOLO on test set

4.4 Ablation studies

In this section, we verify the impact of each component in Ghost-YOLO on the final performance, we conduct an ablation study on the TT100K dataset. The baseline is the original YOLOv5-s. As shown in Table 7, we compare the results by mAP(0.5) and FPS. Compared to the baseline, the model using only the C3Ghost module has a significant improvement in speed and a small loss of accuracy but is within acceptable limits. This is because Ghost Conv uses linear operation to replace the complex convolution operation, and the feature extraction effect has a certain fluctuation. Networks using the improved feature fusion module show significant improvements in mAP, but are slightly slower. This can be explained as follows: multi-scale feature fusion significantly improved the accuracy of small target recognition, but more complex feature fusion also reduced the inference speed.

Table 7 Ablation studies of the proposed Ghost-YOLO, FF stands for the improved feature fusion structure

4.5 Visualization result

To directly verify the detection capability of the model, we visualize the results. Figure 11(a) gives the detection result of the YOLOv5-s and Ghost-YOLO. The original image is in the first column, the second and third columns respectively represent the visual detection results of the above two models. We can zoo min on the picture to see the more detailed detection results. It can be seen that both YOLOv5 and Ghost-YOLOv5 have good recall since they both detected the target. Ghost-YOLOv5 has a certain degree of accuracy improvement in various detection targets and is more accurate than YOLOv5 in recognizing farther and smaller traffic signs, which is useful for TSD tasks and driverless safety. Moreover, Fig. 11 also shows some detection results for traffic signs in complex environments such as occlusion and shadow and the results of small targets. It shows that our model excellently detects and recognizes traffic signs.

Fig. 11
figure 11

Visualization results on the TT100K dataset: (a) detection results for YOLOv5-s and Ghost-YOLO; (b) detection results of the small target; (c) detection results in complex environments. Shadows, occlusions, cloudy, etc.

5 Conclusions

In this paper, based on the framework of YOLOv5, aiming at the difficulty of the traffic sign detection and the shortcomings of original YOLOv5, the lightweight network Ghost-YOLO is proposed. We design a new feature extraction module to reduce the account of redundant parameters and computation and speed up inference. At the same time, the multi-scale feature fusion structure is used to combine the high-level semantic information in the deep feature map with the shallow feature map to improve the feature representation of small targets and improve the accuracy of TSD. Experimental results on the TT100K dataset showed that the method achieves the balance of accuracy and lightness and has better robustness. In future work, we plan to explore a more lightweight model and further address the efficiency of the model on the mobile terminal. At the same time, considering that traffic signs are usually stored in image format, image classification based on massive data has become one of the important topics [17], and the work in this paper can also be used as one of the future work ideas for image classification.