1 Introduction

Traffic signs use images, characters, and shapes with meaningful colors to indicate set, prohibitive and dangerous. While driving a motor vehicle, it is typical for most drivers to misinterpreted and misread traffic signs. In terms of speed, the intelligent vehicle also misinterpreted traffic signs which cause a major accident [19]. In terms of the survey, more than 1.24 million deaths and critical injuries from vehicle accidents are interpreted every year in different age groups and vehicles due to the reflection of light, lack of attention, unknown problems, and a lot of further reasons for terrible accidents. Cameras detect most objects in automated driving, including traffic signals, routes, individual vehicles, and traffic police gestures. Temporarily, radar is inherently incapable of identifying signs such as speed limits and stop signs. Many autonomous vehicles and driver assistance systems have cameras mounted on their dashboards. Cameras can collect real-time traffic sign images or videos integrated into the vehicle system for machine learning models [39]. The model’s deep learning method must be reliable for recording traffic signs in different angles and positions. Even so, the vehicle’s speeds and geographic position are constantly changing as it passes through various backgrounds and lights [11]. The camera mounted on the car captures images of traffic signals in their natural environment. While deep neural networks perform well in traffic signal detection experiments, real applications still suffer time and space constraints [50]. In an image, its distance from the camera defines the image size. The identification of traffic signs is most important for the responses and safety of intelligent vehicles. The limited scale of the signs, on the other hand, significant difficulty in traffic signs recognition. In comparison, identifying traffic signs is more important for accuracy than the label of traffic signs [7, 21]. Convolutional Neural Networks (CNN) [22] is a form of deep neural network that can learn more discriminative features and is close to the visual processing of human vision [1, 52]. Compared to the current best-performing approaches, CNN performs better in traffic sign identification algorithms [31]. For traffic sign classification, traditional research methods like simple machine learning models and support vector machines [27, 49, 50]. However, if traffic signs are blurred, the identification rate of traditional methods can decrease (Figs. 1 and 2).

Fig. 1
figure 1

Traffic Sign categories [34]

Fig. 2
figure 2

Design of our CNN Model

The German traffic sign database, which consists of the German Traffic Sign Recognition Benchmark (GTSRB) and the German Traffic Sign Detection Benchmark (GTSDB) [40], is frequently used to research traffic sign detection and classification. GTSRB and GTSDB, conversely, are not required to represent real-world driving circumstances, as GTSRB sign images consist of a considerable part of the image. Compared to GTSDB signs, real-world traffic sign images frequently have a smaller image area [39].

This paper suggests a multi-neural network strategy for identifying traffic signs using deep learning-our proposed methodology is a newly constructed neural network architecture for successful traffic sign recognition. CNN architecture has been designed to detect traffic signs. Here different convolutional layers, pooling layers, max pool, and dropout layers are defined that show the novelty of our work. Furthermore, the proposed model is also applied in the different standard datasets such as GTSRB, GTSDB [40], BTSC [51], and TSRD datasets. Finally, analyzes the result in terms of precision, recall, and f1 score representing the best accuracy under the validation of datasets. Finally, our experimental evaluation demonstrates that the dataset’s traffic sign identification has significantly greater accuracy efficiency. Here, Adam optimizer is applied [5] to optimize the overall training objective and set lr = 0.001 (Table 1).

Table 1 Classes of the traffic signs

2 Related works

S. He et al. [11] proposed CapsNet for the detection of different traffic signs. Here study addresses the misunderstanding issues caused by CNN’s use in traffic sign identification and the loss of maximum pooling. Visual characteristics, position information such as the location and shape of the input image, and the spatial relationship are all kept. The network has the advantage of being robust and generalizable. This study uses CapsNets in traffic sign identification, emerging in deep learning for analyzing traffic scenes. In this article, CapsNet solves traffic sign recognition by modifying the parameters and weights. In 2021, Ghosh R. [9] presents a method for detecting and monitoring on-road vehicles in various weather circumstances by utilizes many Faster R-CNN region proposal networks (RPNs). The application of several RPNs in Faster RCNN relatively unexplored area of research in this field. Since the typical Faster R-CNN produces regions of interest (ROIs) with a single fixed RPN, it could not distinguish vehicles of different sizes. In comparison, the existing study provides an end-to-end method for detecting on-road cars that generates ROIs using numerous varying-sized RPNs and thereby detects various vehicle sizes. The suggested approach is novel in which it includes many RPNs with varying sizes into a standard Faster R-CNN. Three distinct public datasets to evaluate the proposed system’s performance: DAWN, CDNet 2014, and LISA. The DAWN, CDNet 2014, and LISA datasets attained an average precision of 89.48%, 91.20%, and 95.16%. A study conducted by Gupta H. et al. [10] developed an algorithm that automatically differentiates various kinds of vehicles in aerial footage using deep learning techniques. UAV dataset has unbalanced classes. This work added 500 images to the cleaned and preprocessed dataset using web mining. Faster R-CNN, SSD, YOLOv3, and YOLOv4 were the four most unique object identification algorithms with enhanced generalization capabilities and flexibility in real-world scenarios. The traffic recognition model constructed with YOLOv4 does better than by at least 88%, 13%, and 25%, respectively. Dewi C. et al. [6] proposed Spatial Pyramid Pooling (SPP) that used the Yolo V3 model. SPP considered input images of various dimensions during training. The experimental results indicated that SPP improves recognition of Taiwan’s prohibitory sign. All YOLO V3 models evaluated using mAP.

Y. Wu et al. [46] provided two-level detection architecture location modules (RPM) and the classification module (CM) that seeks to categorize objects in this article. Additionally, the model also demonstrated that the data augmentation technique is based on the logo addresses missing categories efficiently. Fu J. et al. [7] suggested an object detection approach based on detecting a specific vehicle and its wheels. Due to the standard SSD’s incompetence in differentiating small objects, completing vehicle and wheel detection tasks offers problems. To address this issue, they develop a new benchmark dataset with five separate categories and propose a novel SSD-based approach that uses many multi-concatenation modules and SEBlocks to enhance recognition accuracy for small objects. The proposed technique illustrates through trials on the Pascal VOC2007 dataset, the KITTI dataset, and the benchmark dataset. Using visualization tests on the Pascal VOC2007 exam and wheel detection, they demonstrate that MSSD extensively boosts small object recognition. Furthermore, it is extensible to different models with a large number of prediction layers. According to Said Y. et al. [30], identify a country’s flags in the absence of a localization requirement. The suggested local context network design created area suggestions by vertically expanding tested anchors to encompass the flag under challenging conditions such as distortion and occlusion. The projected collection will include 20,000 images of actual flags from 200 countries-the proposed methods mean average precision of 89.5%.

The classification of traffic signs [43] examine for decades, and the German Traffic Sign Recognition (GTSRB) [28, 35, 45] advocated that an analogous study as well. In this article, they proposed the DC-GCNN model. The performance of DCGCNN is compared to the standard deep structure on ten separate datasets in the first experiment set. On the same convolutional network, DC-GCNN improves DCNN using conventional fully connected techniques. On application datasets, DC-GCNN increases performance by 44.45%, recall by 39.69%, and F1-score by 42.57%. The performance of DC-GCNN compared to that of other CNN-based classifiers in the second experiment set. On the CIFAR-10 and MNIST datasets, the proposed structure outperformed competing techniques with classification accuracies of 89.12% and 99.28%, respectively [36]. The CNN-ELM model implemented integrating the excellent ability of CNN feature learning with ELM efficiency. CNN-ELM also uses ELM as a classifier after extracting features from CNNs, combining the advantages of deep learning with conventional machine learning [46]. Then ELM is used as a classifier to perform a fast and exceptional classification. CNN-HLSGD trains a hinge loss convolution neural network recognition rate was higher than most GTSRB dataset approaches [15]. The DP-KELM algorithm introduced different techniques that classify deep perceptual features through kernel-based extreme learning machines for different traffic sign images (KELM in the deep perceptual Lab color space instead of base RGB color space). The approach uses a basic architecture that lowers computing costs and increases comparative recognition [49]. Graphics processing units (GPUs) optimize time complexity computation for the significant and deeper network [27]. In this paper, Sun C. et al. [37] suggested a technique for recognizing minor traffic signs (Dense-RefineDet) based on RefineDet. They presented a novel anchor-design method for placing small traffic signs at feature-map cell corners or next. A Dense-TCB communicates semantic data from all higher-level levels to the target lower-level layer, resulting in rich contextual data for small-sized traffic signs. Dense-RefineDet was also faster than previous deep-learning-based algorithms due to its single-stage layout [11].

Publicly available standard traffic sign image datasets containing a variety of traffic-sign categories:

  • The German Traffic-Sign Recognition Benchmark (GTSRB) Dataset [2, 4, 35, 37, 51]: There are 43 different classes or categories, The training set contains 39,209 images, with the remaining 12,630 images for testing purposes.

  • The German Traffic-Sign Detection Benchmark (GTSDB) Dataset [28, 35, 37, 45]: There are 43 different classes or groups, with 900 images. The training set contains 600 images, whereas the testing set contains 300 images.

  • The Belgium Traffic Signs (BTS) dataset [22]: There are 62 classes or categories for detection and recognition, 4533 training, and 2562 testing images.

  • TSRD dataset [17]: The TSRD contains 6164 traffic sign images divided into 58 sign categories. The images separate into two sub-databases: training and testing. The training database has 4170 images, and the testing database contains 1994 images.

In 2020, Liu Z. et al. [21] introduced improving the small size of traffic sign images for identification. DR-CNN recommended leveraging the deconvolution to the deep layer and the low layer. Huang W et al. [12] proposed the MS COCO technique, which evaluated two-phase adaptive loss function classification to separate complex negative samples from clear positive total loss samples. They looked at ways to speed up the distribution of high-resolution traffic images to achieve on-time productivity for automobile applications in this research, includes driver assistance, near-road mapping, repair stock signs, automatic driving, and automated traffic. A hinge loss stochastic gradient descent (HLSGD) technique for training convolutionary neural networks is proposed (CNNs). Here, with 11,62,284 trainable parameters, CNN consists of three phases (70–110-180). They suggested an HLSGD strategy to teach CNNs on the German Benchmark for traffic sign identification and contrasted our findings [15]. They introduced DP-KELM, a kernel-based revolutionary learning approach. They used perceptual color space and demonstrated that it was more beneficial for identifying CNN-based traffic signs [49]. Two back propagation neural network (BPNN) models constructed in this article for early yield prediction, namely for immature, small, green fruitlet, and mature red fruits, by combining fruit attributes with (four) tree canopy variables. The four canopy variables chosen are suitable for early yield prediction and provide an elegant method for forecasting fruit, such as apples and maybe other fruit crops, using backpropagation neural network prediction [5]. In this article, they suggested a novel approach to traffic sign identification and recognition. The ResNet-50 is the backbone in the detection phase to construct pyramidal feature networks to improve the semantic representation of small objects and track focal failure [20, 23]. The technique proposed by Xing J. et al. [48], Faster R-CNN for traffic sign detection and identification, and comparison of object detection and recognition results using different networks such as VGGNet, GoogleNet, and ResNet used in this study, achieving 92.6% recall and 93.4% precision. The Adam stands for Adaptive Moment Prediction, which, with hyper-parameters, predicts adaptive learning rates. The Adam optimization algorithm is simple to use, needs little memory, adequate for computer-efficient and sparse gradient issues [5].

3 Proposed methods

The traffic sign dataset uses preprocess traffic sign images and trains a deep CNN traffic sign identification system [38]. It recognizes and detections various signs and symbols [4, 8, 14, 52]. Initially, images transform into an acceptable shape from which feature extraction becomes more accessible. Following that, image preprocessing methods extract features from different layers of CNN and finally classify traffic signs using these features [18].

3.1 Input layer: Input layer takes images as input

3.1.1 CNN

CNN is a neural network of several layers composed of several similar building blocks [1, 52]. Primarily, the machine read images that consist of three simple colors, known as RGB colors, such as Red, Green, and Blue [5, 24, 28]. Each color has its pixel values corresponding to it. The binary image has only two black and white colors.

Here consider the different architecture of CNN or adjust different CNN layers, pooling layer, max pool with dropouts at various levels. Our results do better than with adam optimizer, previous studies on the same datasets in the GTSRB, GTSDB, BTSC, and TSRD datasets.

In convolution layers, a CNN combines numerous kernels for converting both the whole object and the best feature maps, resulting in many characteristics [32]. In the majority of earlier work, fully connected networks classify images. Thus, suppose firstly take input images with 32x32x3 and 200x200x3 pixels. In the first hidden layer, the total number of weights required will be 3072 and 1,20,000, respectively. As a result of dealing with such a large number of parameters, processing a whole complicated image collection provides a more significant number of neurons that could result in overfitting or is impossible. As a result, rather than fully connected networks, convolutional neural networks are used for image classification. It is composed of neurons with learnable weights and biases. It is similar to a feed-forward neural network. These neurons learn and transform inputs such as images to the corresponding output label during training. CNN, A neuron in a layer is connected to a small number of neurons (such as a local region), needing fewer neurons and processing less weight than a fully connected network. A single neuron connects to all neurons. Each neuron takes in a large number of inputs one at a time and performs a weighted sum on them. When it exceeds it using an activation function, it responds with an output. A CNN, in general, is made from four layers that help extract information from images: convolution, ReLU, pooling, and fully connected. CNN takes inputs as an image of a traffic sign, constructs various damaged images, and classifies them. CNN classifies images by converting them into a matrix of numbers, one for each pixel [3].

The convolution layer consists of many feature filters performing the convolution function [26]. These features evaluate two small bits of larger images and evaluate if they match. This layer involves four steps: first, align the image feature filter and multiply each image pixel by matching the feature pixel. The feature filter crosses the image, discontinues each position on the image, and repeats the previous phases. For each feature filter, this technique performs to obtain convolution output. The other is the rectified linear unit (ReLU layer), which activates a node only if the inputs are more significant than a specific value. Output is 0, as seen in Fig. 6. When the input value passes a specific threshold, there is a linear relationship between it and the dependent variable. Removes all negative values from the equation and converts them to zero. It results by applying it to all the feature images. An additional activation function is sigmoid, seen as S-shape in Fig. 7. The result seems predictable as a probability, suggesting the output is between 0 and 1.

3.1.2 Pooling layer

A pooling layer is preferable to minimize the complexity of feature maps and network parameters following a convolutional approach. Pooling layers, like convolutional layers, interpretable invariantly because their computations take into consideration surrounding pixels. The two most common methods are average and maximum pooling.

3.1.3 Nonlinear layer

The CNN adjusts the input non-linearly in classifying the features contained inside each hidden layer. In the CNN framework, ReLU is used. ReLUs is a frequently used nonlinear transform. This layer performs a fundamental action with a threshold, setting any input value less than zero.

3.1.4 Fully connected layer

After many rounds of the preceding layers, the data reaches the final layer of the CNN, which is the fully connected node. The neurons in the two adjacent layers communicate directly with the neurons in the fully connected network. Our suggested explanation makes advantage of a batch normalization layer. Any channel can be normalized in a mini-batch using the batch normalization layer.

3.1.5 Softmax

To interpret the network’s performance is a challenge. In classification [19] problems, using a softmax function to finish the CNN is typical. Each phase will select a traffic sign class based on the features extracted from the previous layers through which the images of traffic signs pass. This layer makes use of the softmax to identify the correct sign class.

3.1.6 Training

Network training by identifying kernels and weights in convolutional layers and fully connected layers minimizes the inconsistency between output predictions and provides ground truth labels on a training dataset [41]. Our model uses the cross-entropy loss as the objective function and used 70% of the data in our study for training.

3.1.7 Testing

The testing dataset evaluates the final design based on the dataset.

3.2 Image preprocessing and data augmentation

To train our CNN, GTSRB [35, 51], GTSDB [51], TSRD [17], and BTSC [17] datasets. Despite this, datasets contain classes such as 43, 43, 62, and 58. The GTSRB database is also another name for the Traffic Sign Recognition Benchmark test images dataset. To train and test the traffic sign recognition and classification model. A large number of convolutional neural network (CNN) approaches were used [29, 44]. Figure 3 also shows images of gray road signs from the GTSRB [14] dataset.

Fig. 3
figure 3

Samples of different categories in GTSRB image dataset [50]

3.2.1 Image preprocessing

The primary objective of image preprocessing in the traffic sign recognition and detection method [13, 35, 53] is to reduce low-frequency background noise, normalize particle image amplitude, suppress reflections, and mask image parts [14, 18]. The input image is classified separately into channels R, G, B [3].

3.2.2 Data augmentation

It essential to do some primary data augmentation to prevent overfitting and boost generalization [3]. For this reason, it is using measurements of degrees across its geometric center for angles in the range [−20o, 20o] rotated. Here are some data augmentation parameters as shown in Table 2, after applying data augmentation parameters on images and shows Fig. 4 after augmented images.

Table 2 Data augmentation parameters
Fig. 4
figure 4

After Data augmentation

3.3 Datasets

The GTSRB dataset has several categories, image data collection that presents difficulties in classifying traffic sign tasks. There are 51,839 samples in the dataset, varying in scale from 15 × 15 to 250 × 250 and noncircular. In this dataset, there are 43 classifications, each containing 1000 ~ 1000 images. The GTSRB dataset includes 39,209 training and 12,630 test images. According to view adjustments, shade, color loss, lighting conditions, it can be complicated even for human eyes to recognize all of these signs [2]. In addition, in the Belgian Traffic Sign Classification dataset (BTSC), 4637 training images and 2534 test images are separated into 62 classes. The BTSC dataset includes more distinct forms but fewer training examples of traffic signs than the GTSRB dataset, increasing the complexity of proper classification as shown in Table 3. The GTSDB dataset includes 900 high-Resolution natural condition images of traffic signals. GTSDB dataset makes use of 600 training and 300 test images. Finally, combine GTSRB and TSRD datasets to test our research performance on 101 classes of the dataset, as shown in Table 4.

Table 3 Comparison between different publicly available standard datasets
Table 4 Comparison between different publicly available standard datasets in terms of performance

3.4 Experimental results

The proposed automatic traffic sign identification system has been tested in this work on the GTSRB, GTSDB, BTSC, and TSRD datasets, which contained traffic sign images from 43, 43, 62, and 58 classes, respectively. Cross-validation is utilized in datasets where 70% of the dataset from each class is used as the training set and 30% as the validation set. The training dataset is used to train the model, and it improves the model in learning about traffic sign images. A validation dataset is used to offer an unbiased evaluation of a model that has been fitted to training data.

Operating system (OS): Windows 10; Development platform: Python3.8 + OpenCV+Jupyter notebook; CPU: Intel(R) Core (TM) i7, Memory: 8 GB; Disk: 1 TB.

The image data collection for traffic signs in Belgium has two comprehensive datasets [2, 35]. The comparatively well-established publicly available standard traffic sign identification dataset consists of mainly the GTSDB dataset [51], the GTSRB dataset [2]. Here different numerical values like precision, recall, f1-score of the GTSDB dataset shown in Table 5, and Fig. 5 shows its confusion matrix.

Table 5 Report of the confusion matrix
Fig. 5
figure 5

Confusion matrix

3.4.1 Simultaneous classification and detection

The test has four classifications: True Positive (TP), False Positive (FP), False Negative (FN), and True Negative (TN). Precision (P) is the probability of adequately predicting a positive sample. The recall (R) value for positive core samples represents the number of positive samples predicted effectively [42]. As a result of this computation, the following accuracy and recall values determine:

$$ \boldsymbol{P}=\frac{\boldsymbol{TP}}{\left(\boldsymbol{TP}+\boldsymbol{FP}\right)} $$
$$ \boldsymbol{R}=\frac{\boldsymbol{TP}}{\left(\boldsymbol{TP}+\boldsymbol{FN}\right)} $$
$$ \boldsymbol{F}\mathbf{1}=\mathbf{2}\ast \frac{\boldsymbol{P}\ast \boldsymbol{R}}{\left(\boldsymbol{P}+\boldsymbol{R}\right)} $$

3.4.2 The architecture of the model

Conv1 → Relu →Conv2 → Relu → Pool1 → dropout → Conv3 → Relu → Conv4 → Relu → Pool4 → dropout → fully_connected → output_layer → Result.

Considering traffic signs in different classes and train networks using our architecture. As a result, our network has attained 99.81% accuracy on the GTSDB dataset. Due to its linear non-saturated form, the ReLU has been very common in recent years. It has dramatically accelerated the convergence of stochastic gradient descent compared to sigmoid/tanh functions. However, the ReLU excludes all unfavorable details and is unsuitable for all databases and architectures (Figs. 6 and 7).

Fig. 6
figure 6

A Relu activation function

Fig. 7
figure 7

Sigmoid function

3.4.3 Softmax

The softmax activation feature is a feature in the output layer for classification purposes [36, 47]. This activation function is superior to others in classification because it constrains the outputs of each segment to values between 0 and 1. A softmax activation portion can be denoted mathematically by an equation.

$$ \boldsymbol{\upsigma} {\left(\boldsymbol{x}\right)}_{\boldsymbol{i}}=\frac{{\boldsymbol{e}}^{{\boldsymbol{x}}_{\boldsymbol{i}}}}{\sum_{\boldsymbol{j}=\mathbf{1}}^{\boldsymbol{j}}\left({\boldsymbol{e}}^{{\boldsymbol{x}}_{\boldsymbol{j}}}\right)} $$

3.4.4 Evaluations on GTSRB, GTSDB, BTSC, and TSRD

Here our work tested on GTSRB, GTSDB, BTSC, and TSRD. Table 6 shows the comparison metrics for related work and the proposed method.

Table 6 Comparison metrics for related work and the proposed method

Our CNN architecture achieved accuracy on GTSRB, GTSDB, BTSC, and TSRD + GSRB datasets is 99.76%, 99.81, 99.79, and 98.37%. The training and testing of value accuracy and value loss graphs below in Figs. 8, 9, 10, 11, 12, 13, 14, and 15.

Fig. 8
figure 8

Training accuracy result of GTSRB

Fig. 9
figure 9

Training loss result of GTSRB

Fig. 10
figure 10

Training accuracy result of GTSDB

Fig. 11
figure 11

Training loss result of GTSDB

Fig. 12
figure 12

Training accuracy result of BTSC

Fig. 13
figure 13

Training loss result of BTSC

Fig. 14
figure 14

Training accuracy result of TSRD+GTSRB

Fig. 15
figure 15

Training loss result of TSRD+GTSRB

4 Conclusion

In today’s motor vehicles, advanced driving assistance systems are helpful to assist. One of the essential functions of an ADAS is to assist autonomous vehicles and drivers in traffic signs. This research proposed a CNN model for traffic TSRD based on the deep learning technique. Traffic sign detection is a difficult task. Above all, our model could consistently learn the data, and its accuracy would increase as it trained with data. Our deep CNN network is better CNNs for reliability and accuracy, even in blurred, rotated, and distorted images, by correctly performing image classification and recognition tasks. Our future work will consist of two components. One is to collect traffic signs as tests under difficult weather situations to create our dataset. The second one is Generative Adversarial Network (GANs) to optimize the approach further to create a traffic sign classification and detection system.