1 Introduction

Development of autonomous vehicles is one of the most prevalent research hotspots in the recent decade (Wang et al. 2014; Chen et al. 2019; Wang et al. 2020a). Traffic lights recognition is a critical technology in autonomous vehicles, which is able to obtain information on the status, color, and number of signal lights, and the lanes controlled by each light (Fairfield and Urmson 2011; Possatti et al. 2019; De Charette and Nashashibi 2009b). Although traffic lights are designed with various techniques, nowadays, there are still some challenges in identifying traffic lights. These challenges include: (1) in a complex and changing traffic environment, the recognition requires superior robustness (Chiang et al. 2011; Wang et al. 2020b; 2) to ensure the safety of the vehicle during the driving, the operation of the algorithm must be real-time (Greenhalgh and Mirmehdi 2012).

In the initial phase of developing traffic light recognition system, the feature-based methods (Saini et al. 2017; Lee et al. 2018; Cai et al. 2012; Diaz-Cabrera et al. 2015; Hosseinyalamdary and Yilmaz 2017; Wang and Xiong 2016) are widely adopted. For example, an ellipsoid geometry threshold model in HSV color space is built to extract interesting color regions. Besides, a kernel function is proposed to combine two heterogeneous features which are used to describe the candidate regions of traffic light. (Liu et al. 2016) But it can not perform well when it occurs to the diverse weather with various brightness (Wang et al. 2021). Furthermore, the adaptive background suppression filter is implemented to predict the location for traffic lights (Shi et al. 2015). This method highlights the traffic light candidate regions while suppressing the undesired backgrounds. Besides, several features such as the aspect ratio, area, location, and context of traffic lights are tried (Li et al. 2017; Kim et al. 2013; De Charette and Nashashibi 2009a). The contribution of related references is to design and use various new features on one or more conditions to improve the accuracy of traffic lights detection. But their common challenge is that this feature design based on the researcher’s prior knowledge cannot cope with complex and diverse realistic scenarios.

In recent years, deep learning methods which can provide models imitating neural decision-making are applied to deal with classification and object detection (Jensen et al. 2016). For example, in some works, various deep neural network algorithms are trained as efficient classifiers based on the cumulative training data. The contribution of designing and using neural networks is to greatly improve the accuracy of traffic light recognition in dynamic scenes. Because neural networks establish an implicit function to describe various characteristics of traffic lights.(Lee and Kim 2019; Bach et al. 2018; John et al. 2015; Chen and Huang 2016) Recent studies reveal that a combination of the image information and deep learning is a promising way to promote the performance of recognition (Wang and Zhou 2018; Wang et al. 2019; Hirabayashi et al. 2019), wherein prior feature is exploited to generate region of interest (ROI), and neural network is utilized to determine the state or color of traffic lights. John et al. (2014) used image processing techniques to extract the texture, color, and shape features of the candidate area hereafter the identification of the traffic light state is made by an artificial neural network using Multilayer-Perceptron (MLP). In these works, preprocessing slightly reduces the amount of calculation and saves processing time. Still, the common problem of deep learning methods is the excessive calculation that slows down the speed of processing and instability in video detection.

In order to reduce computational redundancy, achieve the requirement of real-time for autonomous vehicles, some researchers are exploring the pattern of informing drivers the position, status, and remaining time of traffic lights through Vehicle-to-roadside-Infrastructure (V2I) or GPS in the last several years. For example, Hirabayashi et al. (2019) uses current location and finds traffic lights on the road. Ci et al. (2019) studies the effect of V2I on traffic flow at signalized intersections. But the large-scale introduction of V2I requires a large investment in infrastructure, which will not be possible in the short term. Therefore, it is still meaningful to study onboard traffic light recognition algorithms.

This paper, proposes a novel traffic lights recognition strategy. First, the multi-thread program is built, wherein the video reading, CNN model, ICFT is settled. Then, the respective module of detection and tracking are cooperated to search the traffic light targets and determine the light color. Finally, the performance of the presented traffic lights recognition method is validated in experiments. The results indicate that the presented method is a promising choice for traffic lights recognition.

Three original innovations and contributions are underlined in this article: (1) a composite mechanism of traffic light recognition is constructed to jointly utilize both detection and tracking information. To the best of our knowledge, this is a novel attempt to combine deep learning and object tracking methods in traffic lights recognition of autonomous vehicles. (2) The architecture of CNN and the features in ICFT are well-designed and suitable for traffic light recognition. (3) Compared with traditional image processing methods or a single deep learning algorithm, the proposed strategy is of better recognition accuracy and speed for the traffic light.

The following content is arranged in this layout: the part of the method detail is explained in Sect. 2. Section 3 describes the results and analysis of various experiments performed on the dataset. Finally, the conclusion and future work are summarized in Sect. 4.

2 Methods

In this section, the details of the method are given that including the mechanism, CNN model, and IFCT. The constructions and mathematical formulations of these three parts are expounded carefully.

2.1 Mechanism of simultaneous detection and tracking

In this paper, an innovative mechanism of simultaneous detection and tracking is created. There are three threads in the mechanism: Reading, Detection and Tracking.

Figure 1 demonstrates the main process of the mechanism. The output of the detection thread is utilized as auxiliary information to update and correct the initial information for the tracking thread.

Fig. 1
figure 1

The mechanism of traffic light recognition algorithm. The reading thread reads every frame from the input video with a speed of 100 frames per second. The detection thread produces the updated coordinate information of targets which is needed by tracking thread.The tracking also runs on the newest image captured by the reading thread

As for the description in time scale, which is shown in Fig. 2. The recognition process can be recognized as a cycle without a fixed period. Once the tracking module starts, the frames of the image would be quickly proceed based on the initial information, and outputs are saved. After each detection, Inter-frame Buffer discriminates the newly appearing or disappearing target, filtering out the influence of the mutation caused by false detection, and completing the update of the target position and quantity. Besides, the final candidate area given by Inter-frame Buffer is corrected using MSER (Maximally Stable Extreme Regions) to acquire a more accurate initial information for the tracking thread.

Fig. 2
figure 2

The process in time scale. Each picture is read from the input video, and we take the sequence of pictures that we keep reading as the timeline. When the first frame of the video is read, the detection thread starts running immediately. The tracking thread won’t start until the first target is found

2.2 Deep learning based detection

A deep learning based method is implemented in the detection part.The CNN model is founded which consists of two main parts: backbone and backend. Table 1 shows the brief structure of the backbone network with some main layers. The backbone is composed of five residual network blocks. Different from the sequential networks such as GoogleNet and VGG19, residual networks can better solve the overfitting problem of deep neural networks. In terms of the number of network layers, in order to increase the calculation speed, the network depth is strictly limited. Compared with Faster R-CNN (152 layers), the proposed network model has only 58 layers. Before the data enter each block, the feature map is processed by a convolutional layer with a stride of 2, the size is reduced to a quarter, and the number of filters is doubled. The softmax is used as the activation function.

Table 1 The Backbone of the Network

In the backend network, some networks only use single-scale feature. Many network models employ feature maps of different sizes to detect targets, such as SSD. However, SSD does not reuse low-level high-resolution feature maps, that is, does not make full use of the spatial information in the low-level feature maps, which is very important for the detection of small objects. Therefore, we add the feature maps obtained by the last residual networks to the previous feature map. Through such a connection, the feature maps used in each layer of prediction are fused with different resolutions and different strength of semantic features.

In the way of connection, the Add function is adopted rather than the Concatenate layer as usual. At the same time, since this method only adds additional cross-layer connections to the original network, practically no additional time and calculations are required. The calculation amount of the Concatenate layer is twice that of the Add layer.

According to the explanation above the entire CNN is illustrated in Fig. 3.

Fig. 3
figure 3

The structure of the detection network. The feature map is extracted in the backbone while detection is finished in the backend of the network. The number of network layers and the number of filters in the backbone are set properly to improve the computational efficiency without excessively reducing the recognition accuracy. Subsequently, the connection of multi-scale feature maps is created in the backend network to enhance the performance of small object detection without substantially increasing the calculation amount of the original model

2.3 Integrated channel feature tracking

In the proposed method, an integrated channel feature is used as an object model to compute the weight. The steps of the individual tracking algorithm will be specifically described below:

2.3.1 Target model description

This method uses the integrated channel feature function as the description of the target model. The integrated feature includes HSV and LBP, of which the calculation method is introduced later.

2.3.2 Particle sample set and particle initialization

The position and size of the traffic light target in the video is represented by a rectangular box, so the state space \(s_{t}^{\left( n \right) }\) of the particle sample of the traffic light at t time is constructed by four parameters of the rectangle:

$$\begin{aligned} s_{t}^{\left( n \right) }=\left[ x_{t}^{\left( n \right) },y_{t}^{\left( n \right) },h_{t}^{\left( n \right) },w_{t}^{\left( n \right) },a_{t}^{\left( n \right) } \right] \end{aligned}$$
(1)

Where \(n\in \left\{ 1,2,\ldots ,N \right\}\) and N is the number of random particles, \(x_{t}^{\left( n \right) }\) and \(y_{t}^{\left( n \right) }\) denote the center coordinate of the rectangular box, \(h_{t}^{\left( n \right) }\) and \(~w_{t}^{\left( n \right) }\) determine the height and width of the rectangular box, \(a_{t}^{\left( n \right) }\) is the corresponding scale factor. In particle initialization, a random particle set with N particles of which each state vector obeying a Gaussian distribution is generated.

The range of traffic lights in the image area can be estimated and restricted by the transition model. A second-order auto-regressive dynamics model is adopted. The particle sample set is propagated through the system state transition equation to obtain a new particle sample set:

$$\begin{aligned} \begin{aligned} s_{t}^{\left( n \right) }&={{A}_{1}}s_{t-1}^{\left( n \right) }+{{A}_{2}}s_{t-2}^{\left( n \right) }+BW \end{aligned} \end{aligned}$$
(2)
$$\begin{aligned} \begin{aligned}&W\tilde{\ }N\left( 0,\text { }\!\!\varLambda \!\!\text { } \right) \end{aligned} \end{aligned}$$
(3)

Where \({{A}_{1}}\), \({{A}_{2}}\), \(\text {m}\) is the Auto-regressive coefficients, and taking \({{A}_{1}}=2.0,{{A}_{2}}=-1.0,B=1.0\). \(N\left( 0,\text { }\!\!\varLambda \!\!\text { } \right)\) denotes the Gaussian distribution with zero mean and covariance \(\text { }\!\!\varLambda \!\!\text { }=\text {diag}\left( \sigma _{x}^{2},\sigma _{y}^{2},\sigma _{w}^{2},\sigma _{h}^{2},\sigma _{a}^{2} \right)\). Here, \(\sigma _{x}^{2}=2.0,\sigma _{y}^{2}=2.0,\sigma _{w}^{2}=1.0,\sigma _{h}^{2}=1.0,\sigma _{a}^{2}=0.01\).

2.3.3 Weight calculation

First, the histograms of the Hue and Saturation channels of the target image and particle samples are computed separately. Then, Bhattacharyya coefficient is used to calculate the likelihood between two histograms:

$$\begin{aligned} \beta \left( {{p}_{s_{t}^{\left( n \right) }}},{{q}_{0}} \right) =\frac{1}{2}\underset{H,~S}{\mathop \sum }\,\underset{u=1}{\overset{m}{\mathop \sum }}\,\sqrt{p_{s_{t}^{\left( n \right) }}^{\left( u \right) }q_{0}^{\left( u \right) }} \end{aligned}$$
(4)

Where \(p_{s_{t}^{\left( n \right) }}^{\left( u \right) }\) denotes each histogram bin of one particle sample, \({{q}_{0}}\) is each histogram bin of target, and m is the number of histogram bins. Each histogram value \(c_{t}^{\left( n \right) }\) for the particle sample set \(s_{t}^{\left( n \right) }\) is calculated by Bhattacharyya coefficient:

$$\begin{aligned} \begin{aligned} c_{t}^{\left( n \right) }&={{f}_{c}}\frac{1}{\sqrt{2\pi \sigma }}{{e}^{-\frac{\left[ 1-\beta \left( {{p}_{s_{t}^{\left( n \right) }}},{{q}_{0}} \right) \right] }{2{{\sigma }^{2}}}}} \end{aligned} \end{aligned}$$
(5)
$$\begin{aligned} \begin{aligned} {{f}_{c}}&=\frac{1}{\mathop {\sum }_{n=1}^{N}\frac{1}{\sqrt{2\pi \sigma }}{{e}^{-\frac{\left[ 1-\beta \left( {{p}_{s_{t}^{\left( n \right) }}},{{q}_{0}} \right) \right] }{2{{\sigma }^{2}}}}}} \end{aligned} \end{aligned}$$
(6)

Where \({{f}_{c}}\) is the normalization coefficient as well as the following \({{f}_{h}}\).

Second, the LBP histogram is calculated. Then calculate the histogram of each cell, that is, the frequency of each number (assuming the decimal number LBP value). Similarly, the \(h_{t}^{\left( n \right) }\) is calculated:

$$\begin{aligned} h_{t}^{\left( n \right) }={{f}_{h}}\frac{1}{\sqrt{2\pi \sigma }}{{e}^{-\frac{\left[ 1-\beta \left( {{j}_{s_{t}^{\left( n \right) }}},{{k}_{0}} \right) \right] }{2{{\sigma }^{2}}}}} \end{aligned}$$
(7)

Where \({{j}_{s_{t}^{\left( n \right) }}}\) is the LBP histogram of each particle samples, \({{k}_{0}}\) is the LBP histogram of the target.

Also, the distance weight is calculated:

$$\begin{aligned} \begin{aligned} r_{t}^{\left( n \right) }&={{(x_{t}^{\left( n \right) }-{{x}_{0}})}^{2}}+{{\left( y_{t}^{\left( n \right) }-{{y}_{0}} \right) }^{2}} \end{aligned} \end{aligned}$$
(8)
$$\begin{aligned} \begin{aligned} R_{t}^{\left( n \right) }&={{e}^{-\frac{r{{_{t}^{\left( n \right) }}^{2}}/2{{\sigma }^{2}}}{\sigma \sqrt{2\pi }}}} \end{aligned} \end{aligned}$$
(9)
$$\begin{aligned} \begin{aligned} \text { }\!\!\sigma \!\!\text { }&=\frac{1}{3}w{{_{t}^{\left( n \right) }}^{2}} \end{aligned} \end{aligned}$$
(10)

Where \({{x}_{0}}, {{y}_{0}}\) is the coordination of the target center in the image, \(r_{t}^{\left( n \right) }\) is the distance between each particle and target center, and \(R_{t}^{\left( n \right) }\) indicates the distance weight.

Then the distance weight is assigned to every feature weight:

$$\begin{aligned} \begin{aligned} C_{t}^{\left( n \right) }&=c_{t}^{\left( n \right) }R_{t}^{\left( n \right) } \end{aligned} \end{aligned}$$
(11)
$$\begin{aligned} \begin{aligned} H_{t}^{\left( n \right) }&=h_{t}^{\left( n \right) }R_{t}^{\left( n \right) } \end{aligned} \end{aligned}$$
(12)

Therefore, the Integrated Channel Feature Based Weight can be obtained:

$$\begin{aligned} I_{t}^{\left( n \right) }=\sqrt{C_{t}^{\left( n \right) }H_{t}^{\left( n \right) }} \end{aligned}$$
(13)

The average of the particle sample set based on the weight is estimated as the output of the object tracking:

$$\begin{aligned} E\left( s_{t}^{\left( n \right) } \right) =\underset{n=1}{\overset{N}{\mathop \sum }}\,I_{t}^{\left( n \right) }s_{t}^{\left( n \right) } \end{aligned}$$
(14)

2.3.4 Re-sampling

First, the particles are sorted according to the weight size, and then a new set of particles is re-sampled according to the discrete probability distribution rules obtained after sorting. The newly generated particles are given equal initialization weights. To maintain particle diversity, Gaussian noise is added to the re-sampling process.

2.4 Inter-frame buffer

Fig. 4
figure 4

The process of inter-frame buffer

The inter-frame buffer is demonstrated in Fig. 4. It is assumed that after the kth detection, a certain target is found, and the distance of which between all the targets tracked in the previous frame is compared with the threshold for determining whether it is a new target. If it is a new target, the new target will not be tracked in this cycle right away.

Then in the \((k+1)th\) detection, if the target still appears, the target will be tracked, that is, the new target enters. But if the target does not appear for the \((k+1)th\) test, it will not enter the tracking.

After the new target obtained, MSER is carried out. MSER performs a binarization process on an image that has been processed into a gray-scale image. The coordination of darker traffic light cases set in other backgrounds can be corrected by MSER as shown in Fig. 5.

Fig. 5
figure 5

MSER correction

3 Results and discussion

First, the detection network is trained and four derived networks are compared afterward. Meanwhile, the performance comparison between the common single-channel feature and the proposed integrated channel feature is also implemented. Finally, the overall algorithm is tested.

3.1 Datasets

3.1.1 Berkeley deep drive 100K

44,932 images that traffic lights appear in diverse transportation and weather conditions are obtained from the 80 thousand annotation files in the BDD100K. The resolution of training images is 1280\(\times\)720 pixels and the frame rate of which is 30FPS.

3.1.2 Local urban dataset

A local urban dataset is established to evaluate the algorithm we developed. The data is captured at Jiangbei District, Chongqing, China. The video acquisition device is the Logitech C922 HD Camera. The camera is fixed at the top of the front windshield of an electric vehicle. All the videos are 720p with a frame rate of 30FPS. After editing and filtering, 22 videos are finally reserved. Meanwhile, these sequences of videos are split into 1770 images.

3.2 Evaluations of detection network

The computer runs in the whole experiment is equipped with Nvidia Titan XP with 12 GB memory and the resolution of training input images is 521\(\times\)288. LUD is uesd in the test of network models trained before.

The details of detection results compared with YOLOv3 are listed in Table 2. Our network model and YOLOv3 have little difference in the number of TP and FN, so the recall rate is similar. But the number of FP is reduced by 15%. This means that our network has stronger anti-interference ability. One step closer, the accuracy of our network has increased by 2.3 percentage points, and the F1 value is also better. At the same time, the results indicate that our network has a faster calculation speed, and its operating speed has increased by 23.8%. Through comparison and analysis, our network is greatly increasing the calculation speed, at the same time, it still maintains the recognition performance level of the existing excellent network models, and has stronger stability.

Table 2 Details of the detection results

To further prove the optimality of our network, four self-derived networks (MF, BR, MU, MFBR) are introduced to be compared with our network in the experiment. These networks are of some differences in their structure and they are given in Table 3. The presence or absence of the first ResNet determines the initial size of the resolution of the three feature maps. The feature map sizes of BR and MFBR are 4 times larger than those of other models. The larger the size of the feature map, the more conducive to the recognition of small-sized targets, but the fewer global features obtained by the receptive field. Setting more filters in the network can get more features. For example, the number of filters for MF and MFBR is twice that of other models, and the number of features they obtain is also twice that of other models. With more features, the more accurate the model’s description of the target, but obviously the amount of calculation is also greater. MU increases the multiple of up-sampling, which increases the area of the feature map (the largest one) used to identify the smallest target, which is more beneficial to the recognition of small targets. There are two ways to join the two feature maps in the connection layer: Add is more efficient and Concat retains more information.

Table 3 Parameters of Models

Each network introduced above is trained on the identical device and settings. The details of detection results are shown in Table 4 and Fig. 6. About the precision rate, all the derived networks are higher than YOLOv3. Especially, MFBR is 6.4 percentage points higher than YOLOv3. What is opposite, the recall rate of most models is the lower or equal compared to YOLOv3. Only our network is a little bit higher than YOLOv3. Compared with the F1 score which can measure the accuracy of the two-classification model, MFBR and our network are better. Referred to the item of FPS, MU and our network are much faster than YOLOv3 and increase by 31.2% and 23.8% respectively. Although the precision rate and F1 score of MFBR are both best, the speed is too slow and cannot reach real-time detection. In Fig. 7, there are a couple of detection results of our network in the test.

Table 4 Details of detection results via self-derived networks
Fig. 6
figure 6

Comparison of the detection performance. In the daytime, the conditions of captured images are favorable and the similar objects of a traffic light are few, so the model with a better recall rate is preferred. Furthermore, the detection speed is always the indicator that we care about. According to the comprehensive comparison, our network is the optimal network that is suitable for the stage of detection for the overall algorithm

Fig. 7
figure 7

Samples of traffic light detection result. The detected traffic lights are marked by red rectangles

3.3 Feature channel tracking comparison

The Intersection over Union (IoU) value between the tracking result and the ground truth after tracking a certain number of times is applied to characterize the accuracy. The Success Rate Map and Accuracy Map of different tracking features are displayed in Fig. 8. The detailed data is shown in Table 5 and the advantages are bolded.

Fig. 8
figure 8

Performance of tracking

Table 5 Details of tracking test results

According to the experimental results, among these groups of channel features, the Average Error, Average IOU, and AUC of the single-channel are not as excellent as integrated channel features. Furthermore, in our algorithm, HSV+LBP reaches 52.5 FPS which is much quicker than other integrated groups. Overall, the performance of the integrated channel feature is satisfactory, and the accuracy and stability are better than the single-channel feature.

3.4 Entire algorithm performance evaluation

The five test videos in the evaluation are shown in Fig. 10. Table 6 reveals the result of this test. In the experiment, the algorithm processes a total of 14103 frames of the image during the experiment, which takes 660.2 seconds, and the actual average running frame rate is 21.4 FPS. Comparing with the performance of the YOLOv3 in Table 2 (14.47FPS), the complete algorithm can process 47.9% more images in the same amount of time.

Referring to the precision rate and recall rate, the precision rate ranges from 0.937 to 0.973 with an average of 0.962; the recall rate ranges from 0.834 to 0.953 with an average of 0.909. By comparison, both performances are more superior to the result of previous evaluation on YOLOv3—precision rate increased by 15.9%, and the recall rate increased by 8.5%. As revealed in the experimental results, there is a significant improvement in traffic light recognition supported by the proposed algorithm.

In the entire algorithm experiment, the algorithm we proposed still has certain limitations. In Fig. 9, there are two typical defects in the experiment: (a) The rightmost traffic light in the bottom row is missed; (b) Although the traffic light is found, its box position is interfered with by the countdown indicator next to the light. Higher Precision rate means accurate recognition and few false detections, but the recall rate of the proposed method is relatively low, that is, there is a case of missed detection. In addition, when the target pixel area is very small, the image composed of half of the black countdown indicator and a red number is similar to the traffic light, resulting in inaccurate positioning of the traffic light (Fig. 10).

Fig. 9
figure 9

Mistakes in detection result

For the five video test results in Table 6, we conducted a statistical significance test to determine if the average performance data of the proposed method in the experiment is significantly improved compared to YOLOv3 (results in Table 2). The test process of FPS is shown below. The significance level \(\alpha\) is set to 0.05. The hypotheses for the significance test are as follows:

\({ H_{0} }\): The FPS of the proposed method is not higher than that of YOLOV3.

\({ H_{1} }\): The FPS of the proposed method is higher than that of YOLOV3.

For the test of a single normal population mean, when the standard deviation is unknown, the T-test is used:

$$S = \sqrt {\frac{{\text{1}}}{{n - {\text{1}}}}\sum\limits_{{i = {\text{1}}}}^{n} {(X_{i} - \bar{X})^{2} } } {\text{ = 1}}{\text{.975}}$$
(15)
$$T = \frac{{\bar{X} - \mu _{{\text{0}}} }}{S}\sqrt n = 7.846$$
(16)
$${\text{Rejection interval}}:\{ t > t_{{1 - \alpha }} (n - 1) = 2.132\}$$
(17)

Where S is the sample standard deviation, T is the test statistic. Since T is in the rejection interval, \(H_{0}\) is rejected, that is, \(H_{1}\) is accepted. The significance test results of Precision and Recall are also the same. Therefore, it can be considered that the improvement of the proposed method compared to YOLOv3 is not accidental.

Fig. 10
figure 10

Proportions of light’s color in videos. The test video contains 10712 targets, of which red and green lights are 7132 and 3589 respectively

Table 6 Details of evaluation results

The color of traffic lights is told by the hue feature. The corresponding confusion matrix is shown in Fig. 11. From the result, the recognition accuracy of red lights is higher than that of green lights because the difference between red and background color is more significant than green especially referring to blue sky and green trees.

Fig. 11
figure 11

Confusion matrix of color classification

Figure 12 reveals the time consummation of the detection thread and the tracking thread and total process. The average detecting time on a single frame of the detection thread is not much different from the time consumption in the previous experiment, which is basically above and below 0.056s. However, the average tracking time on a single frame of the tracking thread is much shorter, which takes about 0.019s. The shorter the time required to process the task, the lower the computational complexity of the algorithm. In Fig. 11, both Track Thread and Detection Thread can achieve traffic light targets However, the average time for Detection Thread to process each frame of pictures is 2.6 times that of Track Thread. That is to say, the computational complexity of the detection network model is 2.6 times that of the tracking model. When compared with other deep learning methods, the FPS of YOLOv3 is 14.47, the FPS of the proposed method is 21.4. Therefore, the complexity of our method is 47.9% lower than that of YOLOv3 and the proposed algorithm can meet the requirement of the real-time application.

Fig. 12
figure 12

Processing time of every thread

4 Conclusion

To enhance the usability of the traffic light recognition system in autonomous vehicles, this article employs CNN and ICFT to determine the coordinates and color for traffic lights. This paper improves the recognition accuracy and processing speed by combining detection and tracking. Experiment results first estimate the optimality of the presented CNN models and ICFT, which indicates that the Recall (0.842) and FPS (0.853) of the modified model are close to those of YOLOv3 (0.838 and 0.830) but FPS (17.92) is higher than 14.47. Additionally, IFCT is proved to achieve better performance of 4.393 Average Error, 0.567 Average IOU, and 0.344 AUC than single-channel feature tracking. The overall test further demonstrates the superiority of the proposed method, which means the proposed traffic lights recognition method could be adaptive to autonomous vehicles and achieve better performance.

Future work focuses on three perspectives: (1) Apply the related traffic light recognition system of this article into the system-on-chip and deploy on a real vehicle; (2) Communicate the traffic light information via 5G. By doing this, the efficiency and safety of autonomous vehicles in the network can be promoted by sharing the information; (3) Employ more advanced algorithms to improve the adaptability of CNN in different places. Reinforcement learning (RL) is a promising method to train the network in the way of unsupervised learning.