Keywords

1 Introduction

In recent years, self-driving car have become a hot topic that receive lots of attention from both academic and industry. One of the key components in a self-driving car is the computer vision module used for obtaining various type of traffic data from environment. Traffic sign is among crucial data for self-driving car to operate properly. For example, based on the instruction on traffic signs, the vehicle know if it can turn left or right, or if it must reduce the speed. Therefore, a traffic sign detection system is a must for any self-driving car system [3].

Detecting a single traffic sign is not a difficult problem as most traffic signs has simple patterns with features easy to extract. Detecting and differentiating many traffic signs [4], however, is a challenging problem as many traffic signs have similar patterns. Beside accuracy, processing time is another factor to concern. For such an application like self-driving car, any mistake or delay in detecting and classifying a traffic sign might lead to serious consequences. The problem is exacerbated in developing countries with modest traffic infrastructure where traffic signs are usually blocked by many obstacles.

Thanks to the development of advance object detection algorithms, traffic sign detection has become a much approachable compare to it was just less than 10 years ago. Among possible approaches for traffic sign detection, deep learning based algorithms are likely to have the best performance in terms of accuracy and processing time. A tremendous number of experiments has shown that deep learning based techniques like You Only Look Once (YOLO) [5], Single Shot Detection (SSD) [6] perform very well in manky object detection tasks. However, compared to normal object detection tasks, traffic sign detection [7] is different in that the number of object class, which is the number of types of traffic signs, is much larger. The larger number of classes, the higher possibility of misclassifying the detected object [8].

This paper focuses on building a traffic sign detection application to detect popular traffic signs in Vietnam. This application receives a traffic video as input. It then locates the regions of the traffic signs in the videos and recognizes these signs. To train the traffic sign detection model, a large dataset consisting of 16770 images of 54 types of traffic signs has been built. The performance of the proposed application has been tested and evaluated in various metrics. Based on the experiment results, an analysis of detection errors in the application has also been provided.

Fig. 1.
figure 1

System design.

2 Proposed System Architecture

System Design: The design of the proposed system is described in Fig. 1. The input of the system are traffic video frames. A transfer learning model based on YOLOv4 is used for detecting the traffic signs in each video frames to obtain the labels of these signs. Then, the contents of these labels is shown to users through the web based interface of the system.

Transfer Learning Model Based on Yolov4: YOLOv4 [9] has many special enhancements that increase the accuracy and speed of its brother YOLOv3 [10] on the same COCO dataset and on the V100 GPU. The structure of v4 is divided into four parts: Backbone, Neck, Dense prediction, Sparse Prediction.

The backbone network for object recognition is usually pre-trained through the ImageNet classification problem. Pre-train means that the weights of the network have been adjusted to identify relevant features in an image, although they will be fine-tuned in the new task of object detection. The author considers using the backbone: CSPResNext50, CSPDarknet53, EfficientNet-B3.

Neck is responsible for mixing and matching feature maps learned through feature extraction (backbone) and identification process (YOLOv4 called Dense prediction).

YOLOv4 allows customization of Neck structures such as: FPN, PAN, NAS-FPN, BiFPN, ASFF, SFAM, SSP.

Fig. 2.
figure 2

Experimental procedure.

3 Experiment

The procedure of the experiment in this paper is described in Fig. 2. The experiment includes four steps: data preparation, data labeling, model training, and performance analysis.

Table 1. Correlation table between class_id and label.

3.1 Datasets

In this paper, the dataset of traffic signs was collected in two ways: image collection from Google search page and video recording. Most of the data is collected by video recording because of its closeness to reality, the variety of contexts as well as the noise that the images available on Google rarely bring. The video recording is divided into two directions, once is the actual battle (out to the street to shoot traffic signs), the other is based on the image projected from the satellite on Google Maps and then back to the screen. For the first direction is collected images will more than for the second direction. However, The second direction is used to supplement data for signs that are difficult to encounter in real life because it is not possible to correctly locate the remaining signs. If this second direction still does not meet the quantity, the sample signs will be stitched into the actual context to create a realistic image and ensure the quantity for the signs.

The collected signs are common signs that can be encountered in life with the label names based on the traffic manual, a total of about 54 labels of which 53 are single signs, one category contains images deemed complex or absent from the selected number of signs as shown in Table 1.

This label was added for the later developed problem. After labeling the images and videos, there are 16770 images in total, of which 13439 are for the training set and 3331 for the test set. Figure 3 illustrates the statistical chart of the number of each assigned label.

Fig. 3.
figure 3

Number of photos per set.

3.2 Data Preprocessing

Each image has many different features. Therefore, to be used in the model, the image data has to go through several preprocessing steps. Below are the preliminary preprocessing steps on the image dataset:

  • Read the image, then convert the color channels of all images to RGB format to create consistency in the number of color channels for all images to match the model input.

  • Resize the photo to the appropriate size - height: 416 pixels and width: 416 pixels. So all images have been converted to size 416 * 416 * 3.

After preprocessing, we use yolov4 to train labeled images from the dataset with the following parameters: Yolov4 using the model yolov4 Pre-trained. The parameters used are: batch = 64, subdivisions = 16, max_batches = 108000, steps = 86400,97200, filters = 177, classes = 54, width = 416, height = 416.

3.3 Evaluation Methods

Performance metrics of object detection problem include:

  • IoU (Intersection over union) is the ratio between measuring the degree of intersection between two contours (usually the predicted contour and the actual contour) to determine if two frames are overlapping. This ratio is calculated based on the area of intersection of 2 contours with the total area of intersection and non-intersection between them.

  • Precision measure how accurate is the model’s prediction i.e. percentage of model’s prediction is correct.

  • Recall measure how well the model finds all positive patterns.

Fig. 4.
figure 4

mAP.

From the precision and recall defined above, we can also evaluate the model based on changing a threshold and observing the values of Precision and Recall. The concept of Area Under the Curve (AUC) is similarly defined. With Precision-Recall Curve, the AUC has another name, Average precision (AP). Suppose there are N thresholds for precision and recall, with each threshold for a pair of precision values, recall is \(R_n, n=1,2,\ldots ,N\). Precision-Recall curve is drawn by drawing each point with coordinates (\(P_n\)) on the coordinate axis and connecting them together. AP is defined by:

$$\begin{aligned} AP = \sum \limits _{n = 0}^N {[{{\mathop {\mathrm {R}}\nolimits } _n} - {{\mathop {\mathrm {R}}\nolimits } _{n - 1}}]} * {{\mathop {\mathrm {P}}\nolimits } _n} \end{aligned}$$
(1)

In multiple-classes object detection, mAP is the average of AP calculated for all classes.

Fig. 5.
figure 5

SIGN detection demo.

3.4 Results

During the long training period (specifically, training around 4000 rounds/day, with approximately 27 days for training to complete), there were many models saved at rounds 10000, 20000, ... 10000; along with the models saved from the calculation of mAP at each small round. And we compared the obtained models. In the end, the best model is the one with mAP@0.5 = 94.81% and mAP@0.75 = 68.53%.

Derived from Figs. 4a, 4b and Table 1, it can be seen that the overall model evaluation results for the dataset are very good with mAP@0.5 up to 94.81% and mAP@0.75 = 68.53%, only a few cases are not high, like class_id = 7 is a sign that prohibits motorcycles and tricycles with very low accuracy AP = 22.85% at rating mAP@0.5 and AP = 0 at rating mAP@0.75.

3.5 SIGN Detection Application

To build this application, we use python language with the main library Flask, that capable of creating an interface that can be accessed by the website. After detecting the signs and determining their classes, the parts containing the signs in the video frames will be shown in the right panels to show the signs and their contents as shown in Fig. 5. Next, transmit to the web the cropped image with the category of the image after the model predicts it. To transmit information on the device interface so that the driver can observe. This application builds for users a list of 20 consecutive signs that help drivers have more information about signs and the next section.

Fig. 6.
figure 6

Typical error of detecting and labeling.

3.6 Error Analysis

Some pairs of figures have a lot of detail look similar to each other, which is explained in Fig. 6a. This issue easily causes confusion for detection model. Frames with a large number of objects (signs) or objects with overlap as shown in Fig. 6b also make the model difficult to detect and recognize. Another issue is about detecting small objects in large scenes. In figures c, because the object is so small in proportion to the frame, it is mistaken as sign #21 instead of sign #25 (Figs. 6c). There are a few road signs that are not ordinarily utilized, so collecting data will be a troublesome issue for us as class imbalance also affects the predictive performance of the model. Some of the reasons for the difference in accuracy of identifying signs is due to the unevenness in the complexity of them, or some signs have a highly correlated appearance with others, or the rate involved in data set construction.

4 Concluding Remarks

In the future, we plan to improve and expand the dataset by recording more videos of different routes to make it closer to real life. We also develop other methods to improve the quality of identifying QR codes by combining more frames. In this article, we used the method Yolov4. The method gave a high result of 94.18% for mAP @0.5, but it gave a quite low result of 68.53% for mAP@75 due to some cases of unable to identify such as prohibit motorcycle signs or prohibit three-wheeled vehicle sign. The accuracy of identifying those mentioned signs were extremely low: AP = 22.85% for mAP@0.5, and AP = 0 for mAP@0.75. This happened because the amounts of different signs in the dataset are quite uneven. In the future, we plan to improve and expand the dataset for those types of signs that have a small number of images. In addition to identifying the signs, we would develop the problem to also provide instructions or warnings based on the collected images from the dash camera. We hope to contribute this dataset to the community in order to motivate the research of identifying traffic signs, improve the efficiency of identifying with better methods.