1 Introduction

Traffic signs use characters, symbols, and colors to convey mandatory, prohibitory, and danger information. The recognition of traffic sign is an indispensable component of autonomous vehicles and advanced driver assistance systems (ADAS). Traffic sign recognition has high requirements as concern real-time processing, accuracy, and robustness; moreover, energy-efficient processing is also important in a mobile computing environment [1]. Traffic sign images or videos in natural scenes are collected by the camera installed on the vehicle, then inputted into the vehicle computer. The semantics of signs can be understood through types of processing including detection, location, tracking, classification, and so on. However, no real-time, accurate, and adaptable system has been introduced so far due to the existence of several challenges, such as the complex and diverse backgrounds in real scenes, different national traffic sign standards, illumination variations, diversified shooting angles, and real-time requirements.

Deep learning [2], a hot topic in machine learning methods of late, has been successfully applied to tasks including handwritten numeral recognition [3], classification [4, 5], detection [6], tracking [7,8,9], natural language processing [10], and intelligent question and answer systems [11] and has achieved accomplishments that go beyond the reach of traditional methods. The success of deep learning benefits not only from larger and deeper models with more parameters but also from large-scale annotated or unlabeled data provided by academia and industry. More specifically, the large model structure enhances the nonlinearity of deep learning, while the huge amount of training data enhances its generalizability.

Traditional traffic sign recognition methods typically use extreme learning machine (ELM) [12] or support vector machine (SVM) [13, 14] methods for feature classification, and use handcrafted features which may result in the loss of significant information. The recognition rate of traditional methods will decline when traffic signs are occluded or shaded. Recently, convolutional neural networks have been used for traffic sign recognition. MSCNN [15] extracts the features from different convolutional layers for traffic sign classification and achieves a recognition rate superior to that of traditional methods. MCDNN [16] proposes a multi-column network to classify traffic signs; the recognition ability of this network was improved through the application of expert voting. Moreover, after extracting features from CNNs, CNN-ELM [17] uses ELM as a classifier, combining the advantages of deep learning and traditional machine learning. CNN-HLSGD [18] trains a convolutional neural network with hinge loss, achieving a recognition rate on the GTSRB dataset better than that of most methods. Furthermore, in [19], the authors propose a novel approach called DP-KELM, which classifies the deep perceptual features using a kernel-based extreme learning machine (KELM) in the perceptual LAB color space, a method that can reduce the model’s computational cost and yield an improved recognition rate.

Although deep neural networks perform well in traffic sign recognition experiments, they are still restricted by time and space in practical applications. As larger and deeper networks require more resources, graphics processing units (GPUs) [20] are commonly used to help speed up computation. The strong representation ability of convolutional neural networks arises as a result of their millions of trainable parameters; for example, AlexNet [21] has 60 M parameters, while VGG16 [22] has 138 M parameters. However, the reduced computing power and storage space of on-board equipment or wearable devices cannot support the operational resources required by these complex networks. As reported in [23], the parameters in CNNs exhibit a great deal of redundancy. As a result, many methods have been proposed to compress large CNNs. Deep network compression primarily encompasses five kind of methods: low-rank, pruning, quantization, knowledge distillation, and compact network design. Low-rank decomposition methods [24] such as singular value decomposition (SVD) or tensor train decomposition use a low-rank matrix to approximate a weight matrix in CNNs. Channel pruning methods refer to pruning those unimportant channels [25], making the selection of the correct pruning criterion a highly essential aspect of these methods. Quantization reduces model size by means of low-bit representation; for example, by quantizing weights into ternary values [26]. It is significant to use deep neural networks to identify traffic signs in real time with limited equipment resources. Therefore, in the present paper, we design two slim networks with fewer trainable parameters that do not require special software or hardware accelerators.

The main contributions of this paper can be summarized as follows:

  1. 1.

    To alleviate the abovementioned problems, a new training strategy and two lightweight convolutional neural networks are proposed. These two networks work as a teacher model and a student model respectively. We design a new module in our teacher network that combines two streams of feature channels with dense connectivity to make the network deeper. Our student network is a simple, end-to-end architecture with five convolutional layers and a fully connected layer.

  2. 2.

    The teacher model can assist the training of the student model by means of knowledge distillation, while the student model can obtain a better traffic sign recognition rate than the teacher model. Finally, according to the values of BN scaling factors towards zero to identify insignificant channels, our student model is pruned to reduce the number of parameters and the computational costs.

  3. 3.

    Compared with some existing traffic sign classification algorithms, the proposed lightweight network has fewer parameters while still obtaining the same high recognition rate; moreover, the input data does not require extra preprocessing operations, thus enabling the implementation of a simple and efficient end-to-end network.

Our proposed lightweight networks can achieve an accuracy loss of between 0.33 and 0.63% while parameters can be reduced to one tenth or even 1% of those employed by the compared algorithms. The rest of this paper is organized as follows: In Sect. 2, our proposed method is described in detail. The performance on two traffic sign datasets is presented in Sect. 3. Finally, the conclusion is provided in Sect. 4.

2 Proposed methodology

In this paper, our work is divided into three steps. Firstly, we design a teacher network and a student network. After the teacher model converges on the traffic sign classification training set, knowledge distillation is utilized in order to improve the precision of the student model. Finally, the student model’s channels are pruned, reducing the overall computational cost.

2.1 Knowledge distillation

Knowledge distillation [27] can help to train the shallower student network through the softened output of the teacher network on the target datasets. The training sets are as follows: D = {X = {x1, x2..., xN}, Y = {y1, y2..., yN}}, where x and y represent an input and a target output respectively. The output of the teacher model is t = teacher(x); likewise, the output of the student model is s = student(x). We train the student model to minimize the following loss function:

$$ {L}_{KD}=\left(1-\alpha \right){L}_{CE}\left(y,\sigma (s)\right)+2{T}^2\alpha {L}_{CE}\left(\sigma \left(\frac{t}{T}\right),\sigma \left(\frac{s}{T}\right)\right) $$
(1)

where T and α are hyperparameters, α controls the ratio of the two terms, T is a temperature parameter, and σ() is the softmax function. LCE denotes a standard cross-entropy loss that penalizes the student network when it classifies the target incorrectly. If there is only a tiny difference between the outputs of the teacher model and the student model, the second term is minimized.

A student network trained on softened outputs is significantly better than one that learns directly from the original training data. This is because what the model learns directly from the traffic sign training set is one-hot class label. For example, we train a network to classify a specific object as either a car, dog, or cat, [0.05, 0.9, 0.6]; this is a softened output, which is the most likely prediction of an image. Since dogs are more similar to cats than cars, the difference between the second and the third probabilities is smaller than the difference between the first and the second. In cases where the initial prediction information in [0.05, 0.9, 0.6] makes only a minimal contribution to the weights update, increasing the temperature parameter T can help transfer the knowledge to the student model.

The knowledge distillation procedure is illustrated in Fig. 1. The teacher network and the student network use the same training set. The teacher network is first trained to converge on the traffic sign training set, the parameters of which will not be updated during the training of the student network, as they only play the role of guiding the updating of the student network parameters. We thus need to find the appropriate hyperparameters T and α. When the accuracy of the teacher network on the target dataset is low, it is hard to guide the student model to update parameters; accordingly, it is easy for the student network to fall into the local optimum value. Aimed at addressing this problem, our proposed teacher network achieves an accuracy of 99.23% and 98.89% on the GTSRB and BTSC datasets respectively, meaning that it can be an effective means of helping to improve the traffic sign recognition rate of the student model.

Fig. 1
figure 1

The knowledge distillation procedure

2.2 Teacher network

Our proposed teacher network is based on an important observation result: namely, that low-level convolutional features can better represent the target’s texture information, while high-level convolutional features contain more semantic information. We connect each layer by means of dense connectivity [28], while the feature maps are input to all subsequent layers. A high utilization rate of feature maps is achieved; moreover, the low-level and high-level feature maps are combined effectively, enabling improved learning of the features (Fig. 2).

Fig. 2
figure 2

The architecture of our teacher network

We propose a novel module, as shown in Fig. 3. Our teacher network is constructed as follows: (a) two 1 × 1 convolutional filters are used to reduce the number of channels of the input feature maps. Compared with the 3 × 3 kernels, the parameters are reduced one ninth when 1 × 1 kernels are used. The nonlinearity of the network can be greatly increased while the size of the feature maps remains unchanged, which makes the network deeper. (b) The 1 × 1 kernels and the 3 × 3 kernels execute convolution operations in parallel and splice all output results, as shown in Fig. 4. Different convolution operations can obtain different information being obtained about the input image, while the feature maps following parallel operation exhibit a stronger feature representation ability. (c) Six cells are used to establish the direct connection between different layers, making full use of the feature maps of each layer, as well as integrating the characteristics of each channel in order to alleviate the gradient disappearance problem. At the same time, the batch normalization [29] layer and ReLU [30] function are used in each layer to further prevent gradient disappearance and gradient explosion, increasing the degree of network nonlinearity.

Fig. 3
figure 3

One Stage module

Fig. 4
figure 4

One Cell block

The teacher network proposed in this paper can integrate the features between different layers, obtain more detail information (such as the texture features and the edge features), and increase the network’s ability to recognize traffic signs. By using our novel module, we can stack multiple convolutional layers in order to obtain a deeper network structure. As the number of network layers increases, the representation ability of the extracted features grows stronger; finally, only a fully connected layer is used to transform the feature vector into a probability vector for traffic sign classification.

2.3 Student network

CNNs can automatically extract features in an unsupervised manner. As shown in Table 1, the student network is an end-to-end structure consisting of five convolutional layers and a fully connected layer. The input layer loads the input data, and the input images do not require the use of data augmentation. RGB three-color channels are used to retain the original information of traffic signs. Convolutional layer is used for learning features, and each type of convolutional filter corresponds to the extraction of a specific feature in the image. We add a BN layer and ReLU layer following each convolutional layer in order to increase the network nonlinearity. The pooling layer can prevent overfitting and reduce the dimensionality of feature maps, while average-pooling can better preserve the information of the input feature maps, such as the background information. Moreover, max-pooling can obtain textural features, inhibiting the attenuation of the reverse gradient. The fully connected layer can also work as a classifier to transfer the feature vectors into target class probability.

Table 1 Description of our student network

The Adam [31] algorithm is used to update the weights and bias during the training of the student network. Let mt be the exponential moving averages of the gradient, which estimate the first moment of the gradient. vt denotes the squared gradient, which estimates the second raw moment of the gradient. The exponential decay rates of these moving averages are controlled by the hyperparameters β1, β2 ∈ [0, 1). Finally, gt indicates the gradient at timestep t:

$$ {m}_t={\beta}_1{m}_{t-1}+\left(1-{\beta}_1\right){g}_t $$
(2)
$$ {v}_t={\beta}_2{v}_{t-1}+\left(1-{\beta}_2\right){g}_t^2 $$
(3)

The bias-corrected first moment estimate and the bias-corrected second raw moment estimate are computed as follows:

$$ {\hat{m}}_t=\frac{m_t}{1-{\beta}_1^t} $$
(4)
$$ {\hat{v}}_t=\frac{v_t}{1-{\beta}_2^t} $$
(5)

Finally, the updated formula of the parameter is as outlined below; here, η is the learning rate, while θ0 denotes the initial parameter vector. Moreover, ϵ is set to 10−8 to avoid division by zero.

$$ {\theta}_{t+1}={\theta}_t-\eta \frac{{\hat{m}}_t}{\sqrt{{\hat{v}}_t}+\epsilon } $$
(6)

2.4 Network pruning

There are some common methods used to convert networks into more compact networks with fewer trainable parameters. Weight quantization [26] uses low-bit weights and activations for model compression; however, these methods require a suitable data structure to be utilized in order to store quantized parameters. Low-rank [32] can decompose a convolutional layer into several efficient ones, although it is not efficient for the current network with a 1 × 1 convolutional layer. Network pruning is still a hot topic in the context of small accuracy drops and efficient structured pruning.

Network pruning mainly reduces the computation required by the model by removing redundant parameters in the network. The most time-consuming aspect of a convolutional neural network is the convolutional layer; moreover, the fully connected layer contains the network’s main parameters. Minimizing the difference in accuracy between the full and pruned models depends on the criterion used to identify the “least important” parameters. Reasonable criteria of this kind include minimum weight, activation value, and mutual information. The method outlined in [33] prunes redundant connections by learning only the important connections. In [34], moreover, the model is pruned on the fully connected layer.

Network slimming [35] employs γ parameters in batch normalization layers as the scaling factors for channel pruning. The smaller the value, the lower the importance of the channels and the easier they are to prune. B = {x1, x2..., xm} denotes the current mini-batch:

$$ {\hat{x}}_t=\frac{x_t-{\mu}_B}{\sqrt{\sigma_B^2+\epsilon }} $$
(7)

Here, μB and σB are the mean and standard deviation values, Zout is the output of a batch normalization layer, and γ and β are trainable affine transformation parameters, such that

$$ {Z}_{out}=\gamma {\hat{x}}_t+\beta $$
(8)

This loss function imposes the L1-norm on the scaling factors of each channel in order to make the values of unimportant channels smaller. Finally, those channels with small scaling factors are pruned, after which the pruned network is retrained to resume the recognition rate accuracy on a target dataset. Minimizing the train objective loss:

$$ L=\sum \limits_{\left(x,y\right)}l\left(f\left(x,W\right),y\right)+\lambda \sum \limits_{y\in \varGamma }g\left(\gamma \right) $$
(9)

Here, (x, y) represents the input and target output respectively. Let W denote the trainable parameters, while g() depicts an L1 regularization on channel scaling factors; the two terms are balanced by λ.

3 Experimental results

3.1 Dataset and hyperparameters

3.1.1 Dataset

The German Traffic Sign Recognition Benchmark (GTSRB) [36] dataset is a multi-class, single-image dataset that poses a challenge in traffic sign classification tasks. The dataset consists of 51,839 samples, ranging in size from 15 × 15 to 250 × 250, and not all of them are square. This dataset has 43 categories, each of which comprises 100~1000 images including prohibitory signs, danger signs, and mandatory signs. The training set contains 39,209 images; the remaining 12,630 images are selected as the testing set. Due to perspective change, shade, color degradation, lighting conditions, and so on, it can be difficult even for humans to recognize many of these signs (Fig. 5).

Fig. 5
figure 5

Sample images from the GTSRB dataset

Moreover, there are 4533 training images and 2562 testing images in the Belgian Traffic Sign Classification dataset (BTSC) [37], which is divided into 62 traffic sign types. The images in the BTSC dataset are often distorted due to weather change, occlusions, etc. Compared with the GTSRB dataset, the BTSC dataset contains a larger number of different types but less training samples of traffic signs, which increases the difficulty associated with correct classification (Fig. 6).

Fig. 6
figure 6

Sample images from the BTSC dataset

3.1.2 Training teacher network

Our experiments are performed with PyTorch on a Linux PC with an Intel® Xeon(R), CPU E5-2670 v3 @ 2.30 GHz×24 and an NVIDIA TITAN X, 12 GB RAM. We first train our teacher network on a general dataset. The CIFAR-10 dataset [38] contains ten classes, each of which contains 6000 RGB images. The ten classes are as follows: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. The training set contains 50,000 images, while the testing set contains 10,000 images. We train using a batch size of 128 for 300 epochs. The learning rate is initially set to 0.001, and individual adaptive learning rates are computed using the Adam method. We use a weight decay of 10−5. The converged model can achieve comparable performance with fewer trainable parameters. The results in Table 2 demonstrate the generality of our teacher network.

Table 2 Performance comparison of our teacher model and the general model on the CIFAR-10 dataset

3.1.3 Hyperparameters for knowledge distillation

To find the appropriate hyperparameters T and α, we set out 16 kinds of hyperparameter configurations, each of which is trained for 500 epochs on the GTSRB dataset. The results are listed in Table 3. To achieve knowledge distillation, we set α to 0.9 and use a temperature of 20, which can facilitate better experimental results.

Table 3 Different hyperparameters for knowledge distillation

3.2 Performance on GTSRB dataset

The best recognition rate achieved by our teacher model is 99.23%, while the average recognition rate of our student model increased to 99.61%. The confusion matrix (CM) is one of the most widely used evaluation metrics. We construct our own confusion matrix in order to further analyze the effect of the proposed lightweight network, as shown in Fig. 7. CM is an evaluation criterion that can measure an algorithm’s performance in a visual way. When using a CM, each row and column represent the actual categories and the predicted value respectively. Accordingly, the question of whether these multiple categories are confused or not can be intuitively seen in Fig. 7.

Fig. 7
figure 7

CM of student network on GTSRB dataset

To demonstrate the performances of our teacher network and student network, we first evaluate the classification task on the GTSRB testing images. Table 4 presents the recognition rate and trainable parameters of the currently typical CNN model on the GTSRB dataset. Our teacher network achieves a competitive result with significantly improved computational efficiency. The batch size is 128. The learning rate is initially set to 0.001, while epochs are set to 300. As CNN-HLSGD [18] needs to train 20 same networks, the number of parameters it uses can be as large as 2.3 × 107, which is nearly 31 times the number used by our student network. Compared with DP-KELM [19], moreover, the student network uses half as many parameters; our student network is also an end-to-end system without augmentation.

Table 4 Performance comparison on GTSRB Dataset

Table 5 shows the results of pruning the student network following fine-tuning on the GTSRB dataset. We use the values of scaling factors, which approximate to zero, to identify insignificant channels, then prune those channels via thresholding; finally, we retrain the pruned network until convergence on the target task is achieved. As shown in Table 6, we prune the filters of each convolutional layer in order to reduce the number of redundant parameters. For example, as 32(− 32) in Conv1 means 64 filters in the first convolutional layer, the number of remaining filters following channel pruning is 32.

Table 5 Pruning 50% of channels for the student network
Table 6 Pruning 70% of channels for the student network

Table 7 presents the recognition rate of the pruned network and the full student network. When the model is pruned of 50% of its channels, the recognition rate of the student network falls to 99.38%, but the number of parameters is reduced by 70%. Moreover, when compared with DP-KELM [19], our pruned student network utilizes 16% of the parameters used by this method, while losing only a small amount of accuracy. When 70% channels are pruned, moreover, the number of parameters used by the student network is only 85,593; it can thus be deployed on mobile devices with limited power budgets.

Table 7 Performance comparison of the original and pruned student models on the GTSRB dataset

3.3 Performance on BTSC dataset

As shown in Fig. 8, the best recognition rate achieved by our teacher network is 98.89%, while the average training loss value is 0.628 over 30 epochs. One cross-entropy loss value is calculated at each epoch, after which the mean of the loss values across all epochs is defined as the average loss value. We use a standard cross-entropy loss to optimize the traffic sign classification task; here, the batch size is 128 and the initial learning rate is 0.001. Subsequently, the average recognition rate of our student network increases to 99.13%. Compared with the teacher network, the student network achieves better progress using knowledge distillation.

Fig. 8
figure 8

Comparison of the accuracy and loss of the teacher network (two pictures on the left) and the student network (two pictures on the right)

As shown in Table 8, the best recognition rate achieved by our teacher network is 98.89%, while that of our student network is 99.13%. The number of trainable parameters of a single CNN with 3 STNs [41] is 14,629,801; by contrast, the number used by our student network is only 809,982. After filters are pruned by half, the number of parameters used by our student network is 266,782; under these conditions, the obtained recognition rate is 98.89% on the BTSC dataset.

Table 8 Performance comparison on BTSC dataset

4 Conclusion

In this paper, we propose two lightweight networks for traffic sign classification. We implement a new module in our first model, referred to as the teacher network, which uses 1 × 1 convolutional layers and dense connectivity to learn features through parallel channels. Due to the large size of the neural networks involved, many models are difficult to deploy on mobile devices (which have limited power budgets) in traffic sign recognition systems. The second model, referred to as the student network, is a simple end-to-end architecture comprising only six layers. The performance of our method illustrates that our lightweight network is able to reduce the number of redundant parameters while retaining comparable accuracy. Moreover, we also prune channels for the student network, which yields a compact model. In conclusion, our lightweight network can provide an effective solution to deploying CNN for traffic sign classification in a resource-limited setting. In our future work, we aim to find a novel pruning criterion that can prune channels while producing a lower accuracy loss. We also plan to accelerate both the inference time and training procedure by implementing a compact model.