Simultaneous detection and tracking using deep learning and integrated channel feature for ambint traffic light recognition

Wang, Ke; Tang, Xinwei; Zhao, Shulian; Zhou, Yuchen

doi:10.1007/s12652-021-02900-y

Simultaneous detection and tracking using deep learning and integrated channel feature for ambint traffic light recognition

Original Research
Published: 30 January 2021

Volume 13, pages 271–281, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Simultaneous detection and tracking using deep learning and integrated channel feature for ambint traffic light recognition

Download PDF

Ke Wang ORCID: orcid.org/0000-0002-0797-4655¹,
Xinwei Tang¹,
Shulian Zhao² &
…
Yuchen Zhou³

1077 Accesses
13 Citations
1 Altmetric
Explore all metrics

Abstract

Perceiving the information about ambient traffic lights is an inevitable task for autonomous vehicles. To deal with the issue, this work develops an accurate and fast traffic light recognition strategy for autonomous vehicles by an onboard camera. In this paper, deep learning based detection and object tracking is synthesized to determine the position and color of traffic lights. First, the mechanism of simultaneous detection and tracking is founded, wherein the video reading module, convolutional neural network (CNN) module, integrated channel feature tracking (ICFT) module are run simultaneously. Then, the respective modules of detection and tracking are introduced. CNN model is designed and trained to obtain the position of traffic lights utilized as initial information for tracking. ICFT is applied to continually track the traffic light targets and determine the light color. Finally, the effectiveness of the presented method is validated via comparing with the state of art. Experiments results indicate that the proposed technique can improve the accuracy and speed of recognition. Our contributions are: (1) Establish a mechanism for simultaneous detection and tracking of traffic lights; (2) Carefully design the CNN architecture and ICFT features; (3)The precision and recall rates on traffic lights recognition reached 0.962 and 0.909, respectively, and the recognition speed reached 21.4FPS (GPU: Nvidia Titan Xp).

Traffic Light and Vehicle Signal Recognition with High Dynamic Range Imaging and Deep Learning

DeLTR: A Deep Learning Based Approach to Traffic Light Recognition

Traffic Lights Detection Based on Deep Learning Feature

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Development of autonomous vehicles is one of the most prevalent research hotspots in the recent decade (Wang et al. 2014; Chen et al. 2019; Wang et al. 2020a). Traffic lights recognition is a critical technology in autonomous vehicles, which is able to obtain information on the status, color, and number of signal lights, and the lanes controlled by each light (Fairfield and Urmson 2011; Possatti et al. 2019; De Charette and Nashashibi 2009b). Although traffic lights are designed with various techniques, nowadays, there are still some challenges in identifying traffic lights. These challenges include: (1) in a complex and changing traffic environment, the recognition requires superior robustness (Chiang et al. 2011; Wang et al. 2020b; 2) to ensure the safety of the vehicle during the driving, the operation of the algorithm must be real-time (Greenhalgh and Mirmehdi 2012).

In the initial phase of developing traffic light recognition system, the feature-based methods (Saini et al. 2017; Lee et al. 2018; Cai et al. 2012; Diaz-Cabrera et al. 2015; Hosseinyalamdary and Yilmaz 2017; Wang and Xiong 2016) are widely adopted. For example, an ellipsoid geometry threshold model in HSV color space is built to extract interesting color regions. Besides, a kernel function is proposed to combine two heterogeneous features which are used to describe the candidate regions of traffic light. (Liu et al. 2016) But it can not perform well when it occurs to the diverse weather with various brightness (Wang et al. 2021). Furthermore, the adaptive background suppression filter is implemented to predict the location for traffic lights (Shi et al. 2015). This method highlights the traffic light candidate regions while suppressing the undesired backgrounds. Besides, several features such as the aspect ratio, area, location, and context of traffic lights are tried (Li et al. 2017; Kim et al. 2013; De Charette and Nashashibi 2009a). The contribution of related references is to design and use various new features on one or more conditions to improve the accuracy of traffic lights detection. But their common challenge is that this feature design based on the researcher’s prior knowledge cannot cope with complex and diverse realistic scenarios.

In recent years, deep learning methods which can provide models imitating neural decision-making are applied to deal with classification and object detection (Jensen et al. 2016). For example, in some works, various deep neural network algorithms are trained as efficient classifiers based on the cumulative training data. The contribution of designing and using neural networks is to greatly improve the accuracy of traffic light recognition in dynamic scenes. Because neural networks establish an implicit function to describe various characteristics of traffic lights.(Lee and Kim 2019; Bach et al. 2018; John et al. 2015; Chen and Huang 2016) Recent studies reveal that a combination of the image information and deep learning is a promising way to promote the performance of recognition (Wang and Zhou 2018; Wang et al. 2019; Hirabayashi et al. 2019), wherein prior feature is exploited to generate region of interest (ROI), and neural network is utilized to determine the state or color of traffic lights. John et al. (2014) used image processing techniques to extract the texture, color, and shape features of the candidate area hereafter the identification of the traffic light state is made by an artificial neural network using Multilayer-Perceptron (MLP). In these works, preprocessing slightly reduces the amount of calculation and saves processing time. Still, the common problem of deep learning methods is the excessive calculation that slows down the speed of processing and instability in video detection.

In order to reduce computational redundancy, achieve the requirement of real-time for autonomous vehicles, some researchers are exploring the pattern of informing drivers the position, status, and remaining time of traffic lights through Vehicle-to-roadside-Infrastructure (V2I) or GPS in the last several years. For example, Hirabayashi et al. (2019) uses current location and finds traffic lights on the road. Ci et al. (2019) studies the effect of V2I on traffic flow at signalized intersections. But the large-scale introduction of V2I requires a large investment in infrastructure, which will not be possible in the short term. Therefore, it is still meaningful to study onboard traffic light recognition algorithms.

This paper, proposes a novel traffic lights recognition strategy. First, the multi-thread program is built, wherein the video reading, CNN model, ICFT is settled. Then, the respective module of detection and tracking are cooperated to search the traffic light targets and determine the light color. Finally, the performance of the presented traffic lights recognition method is validated in experiments. The results indicate that the presented method is a promising choice for traffic lights recognition.

Three original innovations and contributions are underlined in this article: (1) a composite mechanism of traffic light recognition is constructed to jointly utilize both detection and tracking information. To the best of our knowledge, this is a novel attempt to combine deep learning and object tracking methods in traffic lights recognition of autonomous vehicles. (2) The architecture of CNN and the features in ICFT are well-designed and suitable for traffic light recognition. (3) Compared with traditional image processing methods or a single deep learning algorithm, the proposed strategy is of better recognition accuracy and speed for the traffic light.

The following content is arranged in this layout: the part of the method detail is explained in Sect. 2. Section 3 describes the results and analysis of various experiments performed on the dataset. Finally, the conclusion and future work are summarized in Sect. 4.

2 Methods

In this section, the details of the method are given that including the mechanism, CNN model, and IFCT. The constructions and mathematical formulations of these three parts are expounded carefully.

2.1 Mechanism of simultaneous detection and tracking

In this paper, an innovative mechanism of simultaneous detection and tracking is created. There are three threads in the mechanism: Reading, Detection and Tracking.

Figure 1 demonstrates the main process of the mechanism. The output of the detection thread is utilized as auxiliary information to update and correct the initial information for the tracking thread.

As for the description in time scale, which is shown in Fig. 2. The recognition process can be recognized as a cycle without a fixed period. Once the tracking module starts, the frames of the image would be quickly proceed based on the initial information, and outputs are saved. After each detection, Inter-frame Buffer discriminates the newly appearing or disappearing target, filtering out the influence of the mutation caused by false detection, and completing the update of the target position and quantity. Besides, the final candidate area given by Inter-frame Buffer is corrected using MSER (Maximally Stable Extreme Regions) to acquire a more accurate initial information for the tracking thread.

2.2 Deep learning based detection

A deep learning based method is implemented in the detection part.The CNN model is founded which consists of two main parts: backbone and backend. Table 1 shows the brief structure of the backbone network with some main layers. The backbone is composed of five residual network blocks. Different from the sequential networks such as GoogleNet and VGG19, residual networks can better solve the overfitting problem of deep neural networks. In terms of the number of network layers, in order to increase the calculation speed, the network depth is strictly limited. Compared with Faster R-CNN (152 layers), the proposed network model has only 58 layers. Before the data enter each block, the feature map is processed by a convolutional layer with a stride of 2, the size is reduced to a quarter, and the number of filters is doubled. The softmax is used as the activation function.

Table 1 The Backbone of the Network

Full size table

In the backend network, some networks only use single-scale feature. Many network models employ feature maps of different sizes to detect targets, such as SSD. However, SSD does not reuse low-level high-resolution feature maps, that is, does not make full use of the spatial information in the low-level feature maps, which is very important for the detection of small objects. Therefore, we add the feature maps obtained by the last residual networks to the previous feature map. Through such a connection, the feature maps used in each layer of prediction are fused with different resolutions and different strength of semantic features.

In the way of connection, the Add function is adopted rather than the Concatenate layer as usual. At the same time, since this method only adds additional cross-layer connections to the original network, practically no additional time and calculations are required. The calculation amount of the Concatenate layer is twice that of the Add layer.

According to the explanation above the entire CNN is illustrated in Fig. 3.

2.3 Integrated channel feature tracking

In the proposed method, an integrated channel feature is used as an object model to compute the weight. The steps of the individual tracking algorithm will be specifically described below:

2.3.1 Target model description

This method uses the integrated channel feature function as the description of the target model. The integrated feature includes HSV and LBP, of which the calculation method is introduced later.

2.3.2 Particle sample set and particle initialization

The position and size of the traffic light target in the video is represented by a rectangular box, so the state space $s_{t}^{\left( n \right) }$ of the particle sample of the traffic light at t time is constructed by four parameters of the rectangle:

$$\begin{aligned} s_{t}^{\left( n \right) }=\left[ x_{t}^{\left( n \right) },y_{t}^{\left( n \right) },h_{t}^{\left( n \right) },w_{t}^{\left( n \right) },a_{t}^{\left( n \right) } \right] \end{aligned}$$

(1)

Where $n\in \left\{ 1,2,\ldots ,N \right\}$ and N is the number of random particles, $x_{t}^{\left( n \right) }$ and $y_{t}^{\left( n \right) }$ denote the center coordinate of the rectangular box, $h_{t}^{\left( n \right) }$ and $~w_{t}^{\left( n \right) }$ determine the height and width of the rectangular box, $a_{t}^{\left( n \right) }$ is the corresponding scale factor. In particle initialization, a random particle set with N particles of which each state vector obeying a Gaussian distribution is generated.

The range of traffic lights in the image area can be estimated and restricted by the transition model. A second-order auto-regressive dynamics model is adopted. The particle sample set is propagated through the system state transition equation to obtain a new particle sample set:

$$\begin{aligned} \begin{aligned} s_{t}^{\left( n \right) }&={{A}_{1}}s_{t-1}^{\left( n \right) }+{{A}_{2}}s_{t-2}^{\left( n \right) }+BW \end{aligned} \end{aligned}$$

(2)

$$\begin{aligned} \begin{aligned}&W\tilde{\ }N\left( 0,\text { }\!\!\varLambda \!\!\text { } \right) \end{aligned} \end{aligned}$$

(3)

Where ${{A}_{1}}$, ${{A}_{2}}$, $\text {m}$ is the Auto-regressive coefficients, and taking ${{A}_{1}}=2.0,{{A}_{2}}=-1.0,B=1.0$. $N\left( 0,\text { }\!\!\varLambda \!\!\text { } \right)$ denotes the Gaussian distribution with zero mean and covariance $\text { }\!\!\varLambda \!\!\text { }=\text {diag}\left( \sigma _{x}^{2},\sigma _{y}^{2},\sigma _{w}^{2},\sigma _{h}^{2},\sigma _{a}^{2} \right)$. Here, $\sigma _{x}^{2}=2.0,\sigma _{y}^{2}=2.0,\sigma _{w}^{2}=1.0,\sigma _{h}^{2}=1.0,\sigma _{a}^{2}=0.01$.

2.3.3 Weight calculation

First, the histograms of the Hue and Saturation channels of the target image and particle samples are computed separately. Then, Bhattacharyya coefficient is used to calculate the likelihood between two histograms:

$$\begin{aligned} \beta \left( {{p}_{s_{t}^{\left( n \right) }}},{{q}_{0}} \right) =\frac{1}{2}\underset{H,~S}{\mathop \sum }\,\underset{u=1}{\overset{m}{\mathop \sum }}\,\sqrt{p_{s_{t}^{\left( n \right) }}^{\left( u \right) }q_{0}^{\left( u \right) }} \end{aligned}$$

(4)

Where $p_{s_{t}^{\left( n \right) }}^{\left( u \right) }$ denotes each histogram bin of one particle sample, ${{q}_{0}}$ is each histogram bin of target, and m is the number of histogram bins. Each histogram value $c_{t}^{\left( n \right) }$ for the particle sample set $s_{t}^{\left( n \right) }$ is calculated by Bhattacharyya coefficient:

$$\begin{aligned} \begin{aligned} c_{t}^{\left( n \right) }&={{f}_{c}}\frac{1}{\sqrt{2\pi \sigma }}{{e}^{-\frac{\left[ 1-\beta \left( {{p}_{s_{t}^{\left( n \right) }}},{{q}_{0}} \right) \right] }{2{{\sigma }^{2}}}}} \end{aligned} \end{aligned}$$

(5)

$$\begin{aligned} \begin{aligned} {{f}_{c}}&=\frac{1}{\mathop {\sum }_{n=1}^{N}\frac{1}{\sqrt{2\pi \sigma }}{{e}^{-\frac{\left[ 1-\beta \left( {{p}_{s_{t}^{\left( n \right) }}},{{q}_{0}} \right) \right] }{2{{\sigma }^{2}}}}}} \end{aligned} \end{aligned}$$

(6)

Where ${{f}_{c}}$ is the normalization coefficient as well as the following ${{f}_{h}}$.

Second, the LBP histogram is calculated. Then calculate the histogram of each cell, that is, the frequency of each number (assuming the decimal number LBP value). Similarly, the $h_{t}^{\left( n \right) }$ is calculated:

$$\begin{aligned} h_{t}^{\left( n \right) }={{f}_{h}}\frac{1}{\sqrt{2\pi \sigma }}{{e}^{-\frac{\left[ 1-\beta \left( {{j}_{s_{t}^{\left( n \right) }}},{{k}_{0}} \right) \right] }{2{{\sigma }^{2}}}}} \end{aligned}$$

(7)

Where ${{j}_{s_{t}^{\left( n \right) }}}$ is the LBP histogram of each particle samples, ${{k}_{0}}$ is the LBP histogram of the target.

Also, the distance weight is calculated:

$$\begin{aligned} \begin{aligned} r_{t}^{\left( n \right) }&={{(x_{t}^{\left( n \right) }-{{x}_{0}})}^{2}}+{{\left( y_{t}^{\left( n \right) }-{{y}_{0}} \right) }^{2}} \end{aligned} \end{aligned}$$

(8)

$$\begin{aligned} \begin{aligned} R_{t}^{\left( n \right) }&={{e}^{-\frac{r{{_{t}^{\left( n \right) }}^{2}}/2{{\sigma }^{2}}}{\sigma \sqrt{2\pi }}}} \end{aligned} \end{aligned}$$

(9)

$$\begin{aligned} \begin{aligned} \text { }\!\!\sigma \!\!\text { }&=\frac{1}{3}w{{_{t}^{\left( n \right) }}^{2}} \end{aligned} \end{aligned}$$

(10)

Where ${{x}_{0}}, {{y}_{0}}$ is the coordination of the target center in the image, $r_{t}^{\left( n \right) }$ is the distance between each particle and target center, and $R_{t}^{\left( n \right) }$ indicates the distance weight.

Then the distance weight is assigned to every feature weight:

$$\begin{aligned} \begin{aligned} C_{t}^{\left( n \right) }&=c_{t}^{\left( n \right) }R_{t}^{\left( n \right) } \end{aligned} \end{aligned}$$

(11)

$$\begin{aligned} \begin{aligned} H_{t}^{\left( n \right) }&=h_{t}^{\left( n \right) }R_{t}^{\left( n \right) } \end{aligned} \end{aligned}$$

(12)

Therefore, the Integrated Channel Feature Based Weight can be obtained:

$$\begin{aligned} I_{t}^{\left( n \right) }=\sqrt{C_{t}^{\left( n \right) }H_{t}^{\left( n \right) }} \end{aligned}$$

(13)

The average of the particle sample set based on the weight is estimated as the output of the object tracking:

$$\begin{aligned} E\left( s_{t}^{\left( n \right) } \right) =\underset{n=1}{\overset{N}{\mathop \sum }}\,I_{t}^{\left( n \right) }s_{t}^{\left( n \right) } \end{aligned}$$

(14)

2.3.4 Re-sampling

First, the particles are sorted according to the weight size, and then a new set of particles is re-sampled according to the discrete probability distribution rules obtained after sorting. The newly generated particles are given equal initialization weights. To maintain particle diversity, Gaussian noise is added to the re-sampling process.

2.4 Inter-frame buffer

The inter-frame buffer is demonstrated in Fig. 4. It is assumed that after the kth detection, a certain target is found, and the distance of which between all the targets tracked in the previous frame is compared with the threshold for determining whether it is a new target. If it is a new target, the new target will not be tracked in this cycle right away.

Then in the $(k+1)th$ detection, if the target still appears, the target will be tracked, that is, the new target enters. But if the target does not appear for the $(k+1)th$ test, it will not enter the tracking.

After the new target obtained, MSER is carried out. MSER performs a binarization process on an image that has been processed into a gray-scale image. The coordination of darker traffic light cases set in other backgrounds can be corrected by MSER as shown in Fig. 5.

3 Results and discussion

First, the detection network is trained and four derived networks are compared afterward. Meanwhile, the performance comparison between the common single-channel feature and the proposed integrated channel feature is also implemented. Finally, the overall algorithm is tested.

3.1 Datasets

3.1.1 Berkeley deep drive 100K

44,932 images that traffic lights appear in diverse transportation and weather conditions are obtained from the 80 thousand annotation files in the BDD100K. The resolution of training images is 1280$\times$720 pixels and the frame rate of which is 30FPS.

3.1.2 Local urban dataset

A local urban dataset is established to evaluate the algorithm we developed. The data is captured at Jiangbei District, Chongqing, China. The video acquisition device is the Logitech C922 HD Camera. The camera is fixed at the top of the front windshield of an electric vehicle. All the videos are 720p with a frame rate of 30FPS. After editing and filtering, 22 videos are finally reserved. Meanwhile, these sequences of videos are split into 1770 images.

3.2 Evaluations of detection network

The computer runs in the whole experiment is equipped with Nvidia Titan XP with 12 GB memory and the resolution of training input images is 521$\times$288. LUD is uesd in the test of network models trained before.

The details of detection results compared with YOLOv3 are listed in Table 2. Our network model and YOLOv3 have little difference in the number of TP and FN, so the recall rate is similar. But the number of FP is reduced by 15%. This means that our network has stronger anti-interference ability. One step closer, the accuracy of our network has increased by 2.3 percentage points, and the F1 value is also better. At the same time, the results indicate that our network has a faster calculation speed, and its operating speed has increased by 23.8%. Through comparison and analysis, our network is greatly increasing the calculation speed, at the same time, it still maintains the recognition performance level of the existing excellent network models, and has stronger stability.

Table 2 Details of the detection results

Full size table

To further prove the optimality of our network, four self-derived networks (MF, BR, MU, MFBR) are introduced to be compared with our network in the experiment. These networks are of some differences in their structure and they are given in Table 3. The presence or absence of the first ResNet determines the initial size of the resolution of the three feature maps. The feature map sizes of BR and MFBR are 4 times larger than those of other models. The larger the size of the feature map, the more conducive to the recognition of small-sized targets, but the fewer global features obtained by the receptive field. Setting more filters in the network can get more features. For example, the number of filters for MF and MFBR is twice that of other models, and the number of features they obtain is also twice that of other models. With more features, the more accurate the model’s description of the target, but obviously the amount of calculation is also greater. MU increases the multiple of up-sampling, which increases the area of the feature map (the largest one) used to identify the smallest target, which is more beneficial to the recognition of small targets. There are two ways to join the two feature maps in the connection layer: Add is more efficient and Concat retains more information.

Table 3 Parameters of Models

Full size table

Each network introduced above is trained on the identical device and settings. The details of detection results are shown in Table 4 and Fig. 6. About the precision rate, all the derived networks are higher than YOLOv3. Especially, MFBR is 6.4 percentage points higher than YOLOv3. What is opposite, the recall rate of most models is the lower or equal compared to YOLOv3. Only our network is a little bit higher than YOLOv3. Compared with the F1 score which can measure the accuracy of the two-classification model, MFBR and our network are better. Referred to the item of FPS, MU and our network are much faster than YOLOv3 and increase by 31.2% and 23.8% respectively. Although the precision rate and F1 score of MFBR are both best, the speed is too slow and cannot reach real-time detection. In Fig. 7, there are a couple of detection results of our network in the test.

Table 4 Details of detection results via self-derived networks

Full size table

3.3 Feature channel tracking comparison

The Intersection over Union (IoU) value between the tracking result and the ground truth after tracking a certain number of times is applied to characterize the accuracy. The Success Rate Map and Accuracy Map of different tracking features are displayed in Fig. 8. The detailed data is shown in Table 5 and the advantages are bolded.

Table 5 Details of tracking test results

Full size table

According to the experimental results, among these groups of channel features, the Average Error, Average IOU, and AUC of the single-channel are not as excellent as integrated channel features. Furthermore, in our algorithm, HSV+LBP reaches 52.5 FPS which is much quicker than other integrated groups. Overall, the performance of the integrated channel feature is satisfactory, and the accuracy and stability are better than the single-channel feature.

3.4 Entire algorithm performance evaluation

The five test videos in the evaluation are shown in Fig. 10. Table 6 reveals the result of this test. In the experiment, the algorithm processes a total of 14103 frames of the image during the experiment, which takes 660.2 seconds, and the actual average running frame rate is 21.4 FPS. Comparing with the performance of the YOLOv3 in Table 2 (14.47FPS), the complete algorithm can process 47.9% more images in the same amount of time.

Referring to the precision rate and recall rate, the precision rate ranges from 0.937 to 0.973 with an average of 0.962; the recall rate ranges from 0.834 to 0.953 with an average of 0.909. By comparison, both performances are more superior to the result of previous evaluation on YOLOv3—precision rate increased by 15.9%, and the recall rate increased by 8.5%. As revealed in the experimental results, there is a significant improvement in traffic light recognition supported by the proposed algorithm.

In the entire algorithm experiment, the algorithm we proposed still has certain limitations. In Fig. 9, there are two typical defects in the experiment: (a) The rightmost traffic light in the bottom row is missed; (b) Although the traffic light is found, its box position is interfered with by the countdown indicator next to the light. Higher Precision rate means accurate recognition and few false detections, but the recall rate of the proposed method is relatively low, that is, there is a case of missed detection. In addition, when the target pixel area is very small, the image composed of half of the black countdown indicator and a red number is similar to the traffic light, resulting in inaccurate positioning of the traffic light (Fig. 10).

For the five video test results in Table 6, we conducted a statistical significance test to determine if the average performance data of the proposed method in the experiment is significantly improved compared to YOLOv3 (results in Table 2). The test process of FPS is shown below. The significance level $\alpha$ is set to 0.05. The hypotheses for the significance test are as follows:

${ H_{0} }$: The FPS of the proposed method is not higher than that of YOLOV3.

${ H_{1} }$: The FPS of the proposed method is higher than that of YOLOV3.

For the test of a single normal population mean, when the standard deviation is unknown, the T-test is used:

$$S = \sqrt {\frac{{\text{1}}}{{n - {\text{1}}}}\sum\limits_{{i = {\text{1}}}}^{n} {(X_{i} - \bar{X})^{2} } } {\text{ = 1}}{\text{.975}}$$

(15)

$$T = \frac{{\bar{X} - \mu _{{\text{0}}} }}{S}\sqrt n = 7.846$$

(16)

$${\text{Rejection interval}}:\{ t > t_{{1 - \alpha }} (n - 1) = 2.132\}$$

(17)

Where S is the sample standard deviation, T is the test statistic. Since T is in the rejection interval, $H_{0}$ is rejected, that is, $H_{1}$ is accepted. The significance test results of Precision and Recall are also the same. Therefore, it can be considered that the improvement of the proposed method compared to YOLOv3 is not accidental.

Table 6 Details of evaluation results

Full size table

The color of traffic lights is told by the hue feature. The corresponding confusion matrix is shown in Fig. 11. From the result, the recognition accuracy of red lights is higher than that of green lights because the difference between red and background color is more significant than green especially referring to blue sky and green trees.

Figure 12 reveals the time consummation of the detection thread and the tracking thread and total process. The average detecting time on a single frame of the detection thread is not much different from the time consumption in the previous experiment, which is basically above and below 0.056s. However, the average tracking time on a single frame of the tracking thread is much shorter, which takes about 0.019s. The shorter the time required to process the task, the lower the computational complexity of the algorithm. In Fig. 11, both Track Thread and Detection Thread can achieve traffic light targets However, the average time for Detection Thread to process each frame of pictures is 2.6 times that of Track Thread. That is to say, the computational complexity of the detection network model is 2.6 times that of the tracking model. When compared with other deep learning methods, the FPS of YOLOv3 is 14.47, the FPS of the proposed method is 21.4. Therefore, the complexity of our method is 47.9% lower than that of YOLOv3 and the proposed algorithm can meet the requirement of the real-time application.

4 Conclusion

To enhance the usability of the traffic light recognition system in autonomous vehicles, this article employs CNN and ICFT to determine the coordinates and color for traffic lights. This paper improves the recognition accuracy and processing speed by combining detection and tracking. Experiment results first estimate the optimality of the presented CNN models and ICFT, which indicates that the Recall (0.842) and FPS (0.853) of the modified model are close to those of YOLOv3 (0.838 and 0.830) but FPS (17.92) is higher than 14.47. Additionally, IFCT is proved to achieve better performance of 4.393 Average Error, 0.567 Average IOU, and 0.344 AUC than single-channel feature tracking. The overall test further demonstrates the superiority of the proposed method, which means the proposed traffic lights recognition method could be adaptive to autonomous vehicles and achieve better performance.

Future work focuses on three perspectives: (1) Apply the related traffic light recognition system of this article into the system-on-chip and deploy on a real vehicle; (2) Communicate the traffic light information via 5G. By doing this, the efficiency and safety of autonomous vehicles in the network can be promoted by sharing the information; (3) Employ more advanced algorithms to improve the adaptability of CNN in different places. Reinforcement learning (RL) is a promising method to train the network in the way of unsupervised learning.

References

Bach M, Stumper D, Dietmayer K (2018) Deep convolutional traffic light recognition for automated driving. In: 2018 21st international conference on intelligent transportation systems (ITSC), IEEE, pp 851–858
Cai Z, Li Y, Gu M (2012) Real-time recognition system of traffic light in urban environment. In: 2012 IEEE symposium on computational intelligence for security and defence applications, IEEE, pp 1–6
Chen Z, Huang X (2016) Accurate and reliable detection of traffic lights using multiclass learning and multiobject tracking. IEEE Intell Transp Syst Mag 8(4):28–42
Article Google Scholar
Chen J, Wang K, Bao H, Chen T (2019) A design of cooperative overtaking based on complex lane detection and collision risk estimation. IEEE Access 7:87951–87959
Article Google Scholar
Chiang CC, Ho MC, Liao HS, Pratama A, Syu WC (2011) Detecting and recognizing traffic lights by genetic approximate ellipse detection and spatial texture layouts. Int J Innov Comput Inf Control 7(12):6919–6934
Google Scholar
Ci Y, Wu L, Zhao J, Sun Y, Zhang G (2019) V2i-based car-following modeling and simulation of signalized intersection. Phys A Stat Mech Appl 525:672–679
Article Google Scholar
De Charette R, Nashashibi F (2009a) Real time visual traffic lights recognition based on spot light detection and adaptive traffic lights templates. In: 2009 IEEE intelligent vehicles symposium, IEEE, pp 358–363
De Charette R, Nashashibi F (2009b) Traffic light recognition using image processing compared to learning processes. In: 2009 IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 333–338
Diaz-Cabrera M, Cerri P, Medici P (2015) Robust real-time traffic light detection and distance estimation using a single camera. Exp Syst Appl 42(8):3911–3923
Article Google Scholar
Fairfield N, Urmson C (2011) Traffic light mapping and detection. In: 2011 IEEE international conference on robotics and automation, IEEE, pp 5421–5426
Greenhalgh J, Mirmehdi M (2012) Real-time detection and recognition of road traffic signs. IEEE Trans Intell Transp Syst 13(4):1498–1506
Article Google Scholar
Hirabayashi M, Sujiwo A, Monrroy A, Kato S, Edahiro M (2019) Traffic light recognition using high-definition map features. Robot Auton Syst 111:62–72
Article Google Scholar
Hosseinyalamdary S, Yilmaz A (2017) A Bayesian approach to traffic light detection and mapping. ISPRS J Photogr Remote Sensing 125:184–192
Article Google Scholar
Jensen MB, Philipsen MP, Møgelmose A, Moeslund TB, Trivedi MM (2016) Vision for looking at traffic lights: Issues, survey, and perspectives. IEEE Trans Intell Transp Syst 17(7):1800–1815
Article Google Scholar
John V, Yoneda K, Qi B, Liu Z, Mita S (2014) Traffic light recognition in varying illumination using deep learning and saliency map. In: 17th international IEEE conference on intelligent transportation systems (ITSC), IEEE, pp 2286–2291
John V, Yoneda K, Liu Z, Mita S (2015) Saliency map generation by the convolutional neural network for real-time traffic light detection using template matching. IEEE Trans Comput Imaging 1(3):159–173
Article MathSciNet Google Scholar
Kim HK, Shin YN, Sg Kuk, Park JH, Jung HY (2013) Night-time traffic light detection based on svm with geometric moment features. Int J Comput Inf Eng 7(4):472–475
Google Scholar
Lee E, Kim D (2019) Accurate traffic light detection using deep neural network with focal regression loss. Image Vis Comput 87:24–36
Article Google Scholar
Lee SH, Kim JH, Lim YJ, Lim J (2018) Traffic light detection and recognition based on haar-like features. In: 2018 international conference on electronics, information, and communication (ICEIC), IEEE, pp 1–4
Li X, Ma H, Wang X, Zhang X (2017) Traffic light recognition for complex scene with fusion detections. IEEE Trans Intell Transp Syst 19(1):199–208
Article Google Scholar
Liu W, Li S, Lv J, Yu B, Zhou T, Yuan H, Zhao H (2016) Real-time traffic light recognition based on smartphone platforms. IEEE Trans Circuits Syst Video Technol 27(5):1118–1131
Article Google Scholar
Possatti LC, Guidolini R, Cardoso VB, Berriel RF, Paixão TM, Badue C, De Souza AF, Oliveira-Santos T (2019) Traffic light recognition using deep learning and prior maps for autonomous cars. In: 2019 international joint conference on neural networks (IJCNN), IEEE, pp 1–8
Saini S, Nikhil S, Konda KR, Bharadwaj HS, Ganeshan N (2017) An efficient vision-based traffic light detection and state recognition for autonomous vehicles. In: 2017 IEEE intelligent vehicles symposium (IV), IEEE, pp 606–611
Shi Z, Zou Z, Zhang C (2015) Real-time traffic light detection with adaptive background suppression filter. IEEE Trans Intell Transp Syst 17(3):690–700
Article Google Scholar
Wang K, Xiong Z (2016) Visual enhancement method for intelligent vehicle’s safety based on brightness guide filtering algorithm thinking of the high tribological and attenuation effects. J Balk Tribol Assoc 22(2A):2021–2031
Google Scholar
Wang JG, Zhou LB (2018) Traffic light recognition with high dynamic range imaging and deep learning. IEEE Trans Intell Transp Syst 20(4):1341–1352
Article Google Scholar
Wang K, Huang Z, Zhong Z (2014) Simultaneous multi-vehicle detection and tracking framework with pavement constraints based on machine learning and particle filter algorithm. Chin J Mech Eng 27(6):1169–1177
Article Google Scholar
Wang K, Huang X, Chen J, Cao C, Xiong Z, Chen L (2019) Forward and backward visual fusion approach to motion estimation with high robustness and low cost. Remote Sensing 11(18):2139
Article Google Scholar
Wang K, Li G, Chen J, Long Y, Chen T, Chen L, Xia Q (2020a) The adaptability and challenges of autonomous vehicles to pedestrians in urban China. Accid Anal Prev 145:105692. https://doi.org/10.1016/j.aap.2020.105692
Article Google Scholar
Wang K, Zhang S, Chen J, Ren F, Xiao L (2020b) A feature-supervised generative adversarial network for environmental monitoring during hazy days. Sci Total Environ 748:141445. https://doi.org/10.1016/j.scitotenv.2020.141445
Article Google Scholar
Wang k, Ma S, Chen J, Lu J (2021) Approaches challenges and applications for deep visual odometry toward to complicated and emerging areas. IEEE Trans Cogn Dev Syst. https://doi.org/10.1109/TCDS.2020.3038898
Article Google Scholar

Download references

Funding

This research was funded by National Natural Science Foundation of China (Grant No. 51605054), National Key Research and Development Program of China (SQ2020YFF0410766), Natural Science Foundation of Chongqing (cstc2020jcyj-msxmX0575), Chongqing Technology Innovation and application development project (cstc2020jscx-msxmX0109 and cstc2019jscx-fxydX0063), Fundamental Research Funds for the Central Universities (2020CDJ-LHZZ-042).

Author information

Authors and Affiliations

School of Automobile Engineering, State Key Lab of Mechanical Transmission, Chongqing University, Chongqing, 400044, China
Ke Wang & Xinwei Tang
School of Vehicle and Mobility, Tsinghua University, Beijing, 401122, China
Shulian Zhao
School of Automobile Engineering, Chongqing University, Chongqing, 400044, China
Yuchen Zhou

Authors

Ke Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xinwei Tang
View author publications
You can also search for this author in PubMed Google Scholar
Shulian Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Yuchen Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ke Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, K., Tang, X., Zhao, S. et al. Simultaneous detection and tracking using deep learning and integrated channel feature for ambint traffic light recognition. J Ambient Intell Human Comput 13, 271–281 (2022). https://doi.org/10.1007/s12652-021-02900-y

Download citation

Received: 05 August 2020
Accepted: 09 January 2021
Published: 30 January 2021
Issue Date: January 2022
DOI: https://doi.org/10.1007/s12652-021-02900-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Simultaneous detection and tracking using deep learning and integrated channel feature for ambint traffic light recognition

Abstract

Similar content being viewed by others

Traffic Light and Vehicle Signal Recognition with High Dynamic Range Imaging and Deep Learning

DeLTR: A Deep Learning Based Approach to Traffic Light Recognition

Traffic Lights Detection Based on Deep Learning Feature

1 Introduction

2 Methods