Keywords

1 Introduction

With the high-speed development of society and the increasing number of cars, traffic congestion has increasingly become an important factor limiting urban development. And an excellent signal control system is of great significance for improving traffic efficiency [1]. A key point of traffic signal optimization is traffic status sensing, and in the past, intersections were used to obtain real-time traffic status through induction coils and cameras, but these devices have the disadvantages of high installation and maintenance costs and short service life [2, 3]. Thanks to the development of vehicle networking and Vehicle-to-Everything (V2X) communication, a Connected Vehicle (CV) driving on the road can act as a powerful motion detector to provide high-quality and inexpensive traffic status input to signal intersections [4,5,6]. With the further development of communication and artificial intelligence technologies, Intelligent Transportation System (ITS), which combines the concepts of vehicle networking and intelligent control, is gradually becoming the future direction of transportation systems, making it possible to improve traffic efficiency at urban intersections on a large scale and in all directions [7, 8].

Combined with a data-driven control approach, the efficiency of traffic junction passage is effectively improved. Reinforcement learning (RL) is an unsupervised machine learning algorithm. Unlike supervised learning approaches that require pre-labeling, reinforcement learning models learn entirely through interaction with the environment until the best policy is learned to guide the action, which is a very advanced adaptive control method [9]. Data-driven reinforcement learning control enables intersection signal control to be fully integrated with the real-time characteristics of traffic flow, which is important for improving traffic flow efficiency [10]. Traditional reinforcement learning is suitable for simple models such as segmented constant tables and linear regression, leading to limited scalability or optimality in practice. Deep Neural Network (DNN), which has the ability to learn complex tasks [11, 12], combined with reinforcement learning solves this problem and brings a new direction to intersection adaptive control. To improve the stability of the algorithm, empirical replay and target network mechanisms are used. Simulation results show that the algorithm reduces vehicle delays by 47% and 86%, respectively, compared to two other popular traffic signal control algorithms, the longest queue priority algorithm and the fixed time control algorithm [12]. Liang et al. use a rasterized state input representation to quantify the traffic flow information by dividing the intersection into occupancy raster maps. Multiple states such as vehicle position and speed are fused and different green light times are set as action spaces. A convolutional network is used to approximate the Q-network and mechanisms such as empirical replay and target networks are employed to stabilize the Q-network [13].

This paper focuses on designing a reinforcement learning based control model for intersection signal controller suffering from information attack. There are two main reasons, firstly to evaluate how much improvement the reinforcement learning model can bring to the efficiency of the intersection communication. Secondly due to the model-free nature of reinforcement learning, the evaluation of the impact of being attacked is more convincing, this is because the attacker cannot set a specific targeting strategy based on the flaws of the model.

The main contributions of this paper are summarized as follows:

  1. 1.

    A reinforcement learning-based signal control model is designed that uses a deep neural network to approximate the maximum cumulative return and an empirical replay mechanism to stabilize the network.

  2. 2.

    Performing a signaling attack on the reinforcement learning control model. The interference of the information attacks is removed by using the model-free property of reinforcement learning.

  3. 3.

    The state of the controller is tampered with to explore the best attack strategy.

2 Problem Description

2.1 Application Scenarios

To solve traffic control problems, intelligent transportation systems (ITS) have been created by integrating advanced technologies such as edge, cloud computing, V2X, and IoT into traffic monitoring and management [14]. V2X technology connects vehicle and road infrastructure, making adaptive real-time signal control at intersections possible.

On Board Unit (OBU) is the main application of vehicle side, each CV has an OBU, which mainly realizes the collection of vehicle status information. Road Side Unit (RSU) is the main application product on the road side, deployed at each intersection, mainly detecting the status information of itself or the surrounding environment, such as ETC and other non-contact sensing devices, and receiving and feeding back communication data through microwave. The controller mainly implements specifically adapted traffic control algorithms. For example, at a queuing lane, the OBU transmits CV position, speed, heading and acceleration information to the RSU in the form of Basic Safety Messages (BSMs), and consecutive BSMs represent the vehicle trajectory. At the same time, the RSU and the controller broadcast the signal timing plan to the traffic monitoring center for area-wide coordination [15].

In the ITS environment, the crossed vehicle states can be obtained in real time by the track information transmitted by the networked vehicles or by the cameras. The state volume should try to use real-time information from all vehicles as much as possible, and in general, using more information to describe the state can help the network to determine the value of the state more precisely. One type of state description is raster occupancy [16, 17]. The working principle is shown in Fig. 1:

Fig. 1.
figure 1

Environment perception of raster occupancy, for each lane approaching an intersection in the diagram, divide it into multiple cells. If a cell is occupied by one car, that cell will return 1, indicating that there is a car in that cell, otherwise it will return 0.

This state description method provides a large amount of information to the controller, almost similar to image data input. Therefore, this method is expected to be perfectly integrated with the image recognition vehicle detector.

In an intelligent transportation system, it is assumed that an attacker is able to tamper with the status information obtained in the RSU. There are two main ways to do this attack, firstly, the attacker is able to access the internal network of V2X and the attacker tampers with the state input in the RSU. Secondly, the status can also falsify the CV trajectory at the end of the queue to create the illusion that a vehicle is parked at the end of the queue.

A state table attack is defined as a direct modification of the state table so that all lanes of an intersection are filled with networked cars, where the state of all lanes is tampered with, and the map of the occupied grid will change from 0 to 1. The state table attack is a direct attack that must invade the controller's interior, and this attack is more difficult to implement. There are few quantitative evaluations of information attacks, and the specific impact such an attack would have on the traffic control system is something that needs to be evaluated, such as how the attacker would attack to maximize overall traffic congestion.

3 RL Signal Controller for Intersection Suffering from Information Attack

3.1 Reinforcement Learning Model Controller

Intersection Scene Description.

In this paper, a typical urban road intersection controlled by traffic signals is considered.

Fig. 2.
figure 2

Cross-port phase setting. There are four traffic signal phases available, and after executing a certain phase, the corresponding priority time interval is assigned to a set of non-conflicting vehicle movements.

As shown in Fig. 2, there are four traffic signal phases available, and after executing a certain phase, the corresponding priority time interval is assigned to a set of non-conflicting vehicle movements. The lanes available for each of these phases are marked in green. This control process can be formulated as a reinforcement learning (RL) problem. The environment is a Markov process and reinforcement learning model learn by interacting with the environment. In this case, our environment is an isolated signal intersection.

Single Intersection Reinforcement Learning Model.

In RL, there are three important elements \(\left( {S , A , R} \right)\) in the training process, where is the state space, \((s \in S)\), \(A\) is the action space, \((a \in A)\), \(R\) is the reward function. In the case of traffic signal control, we define these elements as follows:

State: The input data is a description of the current state. In general, using more information to describe the state can help the network to determine the value of the state more precisely. One of the state descriptions uses raster occupancy, which is described in Sect. 2.2.

For the traditional grid division, it is usually divided by equal distance, but the intersection is usually several hundred meters, and the equal distance division will lead to too large feature state set, which affects the calculation. This subsection makes some improvements to this state input, taking into account the arrival distribution of vehicles in the traffic flow, and divides the grid by distance. Specifically, the grid is divided in front of the frequently formed queue according to the length of the vehicle, and the length of the grid increases as the distance increases. The intervals are [0, 7], [7, 14, 21], [21, 28], [28, 40], [40, 60], [60, 100], [100, 160], [160, 400], [400, 750], which ensure that the focus is on the vehicle information in the queue, but also to ensure that we do not lose the information of the vehicles that are moving at high speed.

Similarly, for lanes that are in the same phase, we can combine them into the same state table since their traffic distribution is similar and they have the same travel route.

The detailed working principle is shown in Fig. 3. That is, whenever there is a vehicle in one lane, this grid is marked as 1, unless there is no vehicle in the straight lane of the same grid, then it is marked as 0.

Fig. 3.
figure 3

Intersection state awareness. That is, whenever there is a vehicle in one lane, this grid is marked as 1, unless there is no vehicle in the straight lane of the same grid, then it is marked as 0.

For each entrance lane, the state can be defined as \(s_{i}\), \(s_{j}\) is an array of 10 elements recording whether a vehicle occupies or not, and \(i\) represents its lane number. Then the set of states of each lane constitutes the whole set of states \(S\).

$$ \begin{array}{*{20}c} {S = \left\{ {s_{1} ,\cdot\cdot\cdot,s_{i} ,\cdot\cdot\cdot,s_{8} } \right\}} & {\left( {i = 1,2,\cdot\cdot\cdot,8} \right)} \\ \end{array} $$
(1)

Action: The action here means selecting a signal phase. The possible signal phase configuration settings are shown in Fig. 3. The smart body picks up one of these four phases every second. The set of actions is as follows:

$$ A = \{ 1,2,3,4\} $$
(2)

Because the green light possesses a time interval limit, if the intelligence changes the current phase. Conversely, if the intelligent body keeps the current phase, it can choose the same signal phase action.

Reward: the reward is a guide that allows an intelligent body to perform an action in the right direction. In this case, we choose the difference between the cumulative vehicle wait times of two adjacent actions as the reward. The vehicle waiting time is defined as the time spent waiting in the vehicles in the import lane since the vehicles emerged from the environment.

$$ W_{t} = \sum\nolimits_{i = 1}^{16} {w_{t}^{i} } $$
(3)

\(W_{t}\) indicates the delay time for vehicles in all lanes, and \(w_{t}^{i}\) denotes the cumulative waiting time for all vehicles in the lane \(i\) at the time \(t\).

Since the intelligence is engaged in a continuous decision process, for each state action, an immediate reward is generated, called \(R_{t}\). Here we use the difference between the cumulative vehicle wait time \( W_{t - 1} \) for the previous action \(a_{t}\). And the cumulative vehicle wait time \( W_{t}\) for the current action \(a_{t}\) as \(R_{t}\). The following is shown:

$$ R_{t} = W_{t - 1} - W_{t} $$
(4)

It means that \(R_{t}\) will encourage the intelligence to make the current action towards a smaller cumulative vehicle wait time than past actions, so as to ensure a larger reward and thus reduce the overall cumulative vehicle wait time.

Q-Network Design.

Unlike model-based control methods that require full information about the environment, RL intelligences can learn optimal strategies for traffic signal control by interacting with the traffic environment. We use Q-networks in RL approach. It’s a well-designed neural network is used as a Q-function approximator.

The goal of the Q-network is to train traffic signals to adapt to the phase and phase duration of intersections based on their real-time traffic patterns. This can be achieved by selecting an action in each training step \((a_{t} \in A)\) that maximizes the expectation of future cumulative rewards.

$$ Q^{*} \left( {s_{t} ,a_{t} } \right) = \max E[R_{t} + \gamma R_{t + 1} + \gamma^{2} R_{t + 2} + \ldots |\pi ] $$
(5)

where \({\upgamma } \in\) (0,1) is a discount factor that represents the trade-off between future and immediate payoffs. The current policy \(\pi\) is defined as the probability of taking action \(a_{t}\) in state \(s_{t}\). According to dynamic programming theory, the optimal Q-function in Eq. (5) can be rewritten in the form of Bellman's equation, as shown in the following equation:

$$ Q^{*} \left( {s_{t} ,a_{t} } \right) = E[R_{t} + \gamma \mathop {\max }\limits_{{a^{\prime}}} Q^{*} \left( {s_{t + 1} ,a^{\prime}} \right)] $$
(6)

The traditional Q-learning algorithm solves the Bellman equation in an iterative manner, which relies on the concept of Q-table, which is a discrete table of Q-values. However, in our traffic signal control formulation, traditional Q-learning suffers from the curse of dimensionality because the state-behavior space becomes huge due to the existence of the definition of the state space[18].

Therefore, in this paper, we adopt the recent prominent idea of Q-network, which is to use neural network to approximate the Q-function:

$$ Q (s , a:\theta ) \approx Q^{*} (s , a) $$
(7)

where \( \theta \) denotes the parameter of the neural network.

Our Q-network for traffic signal control is shown in Fig. 4. At the beginning of each training step, state information is collected as the input layer of the neural network, and then feedforward is performed to estimate the Q-value at the output layer of the neural network.

Fig. 4.
figure 4

Q network structure diagram. It contains state inputs, six layers of neurons and action outputs.

To fit the function \( Q^{*} (s , a)\), the deep neural network shown in Fig. 4 is used, and the network structure consists of six layers of neurons, with the first layer receiving the state \(S\) of the intersection, containing 80 neurons, and the middle four layers with 400 neurons. The last layer is the output layer and possesses four neurons that fit the action value function for each of the four phase actions. The middle neuron uses relu as the activation function, which has a good performance in fitting the nonlinear function. A linear function is used as the activation function for the final output. The maximum cumulative return is approximated effectively.

Algorithm Process.

In the reinforcement learning model designed in this paper, first the model observes the state of the environment, and then the Q-network is used to fit the action value function and output the decided action, the environment generates a new state at the same time, and so on and so forth. Until the best strategy is learned, which eventually makes the vehicle traffic flow at the traffic junction more efficient and achieves the purpose of reducing the delay.

In signal control at intersections, since the traffic flow state and phase settings are directly related, then there is a strong correlation between the samples \(\left( {s , a , r , s^{\prime}} \right)\) generated by the reinforcement learning model, which must be disrupted and redistributed in order to prevent overfitting of the deep neural network. To improve the stability of Q-network training, the mechanism of experience replay is used in our reinforcement learning. Specifically, experience replay is the storage of historical samples \( (s_{t} , a_{t } , r_{t } , s_{t + 1} )\) in a memory \(m \) that records historical samples. When the Q-network is trained, a certain number of historical samples are randomly selected as training data in each training process.

The exact process is shown in Algorithm 1:

figure a

3.2 Signaling Attack Model Based on State Table

Assume that an attacker can directly manipulate the controller's state table input \( S = \{ s_{1} , \cdots s_{i} , \cdots s_{8} \}\). By directly modifying the state vector \(S \), make the controller incorrectly believe that there are cars in the free lane.

Modify a phase lane state to \(s_{a}\), so that each grid of this lane becomes 1:

$$ s_{a} = \{ 1,1,1,1,1,1,1,1,1,1\} $$
(8)

The complete state is generally collected by multiple RSUs, and we assume that the attacker has limited ability to attack and can only interfere with the state perception of certain lanes. To explore the impact of state attacks, a state table attack is designed to attack different lanes.

Attacking a single lane of traffic:

$$ s_{i} = s_{a} $$
(9)

Attacking two lanes of traffic:

$$ \left\{ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {s_{i} = s_{a} } \\ {s_{j} = s_{a} } \\ \end{array} } & {(\left\{ {i,j} \right\} \in \left\{ {1,2,3,4,5,6,7,8} \right\}\& i \ne j)} \\ \end{array} } \right. $$
(10)

When \(i\) and \(j\) are in different relative positions, there are at least three different scenarios, i.e., (1) lane \(i\) and lane \(j\) are lanes in different phases in the same entrance direction, (2) lane \(i\) and lane \(j\) are lanes in different entrance directions and in different phases, and (3) lane \(i\) and lane \(j\) are lanes in the same phase.

The lane state is modified to measure the extent to which the controller is affected by the signal attack and to confirm whether this causes other congestion. Revealing the degree of impact of a information attacks can be an indication of the necessity of defense against potential information attacks.

4 Simulation Result

4.1 Simulation Settings

The simulation environment uses SUMO, a microscopic traffic simulation platform, to model intersections, and TraCI, a secondary development interface, using SUMO software and python language for secondary development, and TensorFlow framework for modeling and training of Q-networks. A joint SUMO-TensorFlow simulation platform was built to test and evaluate the signal control effects of the reinforcement learning model.

The intersection road network is built using SUMO's own road network editor, netedit, to create a two-way with 8-lane intersection, where the leftmost lane is a left-turn lane. The middle two lanes are straight lanes, the rightmost lane is a straight and right-turn lane, and the road length is set to 750 m.

Traffic Distribution: For each simulation instance, the initial state is a traffic network with no vehicles, and then vehicles are randomly inserted at the destination and the corresponding path. Each simulation lasts for 1.5 h (5400 s). As we seek to propose a method that can be generalized to any case, the traffic distribution is Poisson to simulate the general case. That is, the traffic flow is 4500 veh/h. The left-turn and right-turn lanes are allocated 25% of the traffic flow.

Signal Timing Scheme: Since this scheme is a timing scheme without fixed phase sequence, the phase settings are the same as in Sect. 3.2, which is worth noting. After performing the control of the reinforcement learning model, the adjustment of the phase is completely determined by the reward, and it is up to the intelligence to decide the action to be taken for the next phase.

4.2 Reinforcement Learning Q-Network Training

First, we examine the simulation data, and the results show that our algorithm does learn good action strategies that are effective in reducing vehicle dwell time, thereby reducing vehicle delays and traffic congestion, and that our algorithm is stable in terms of control decisions. The controller does not oscillate between good and bad action strategies, or even favor bad action strategies.

Since we set the reward as the difference of vehicle waiting time between two adjacent actions, it is impossible to guarantee that the next action will always have less cumulative vehicle waiting time than the previous action since the traffic flow arrives randomly. In order to better evaluate the training process, only rewards with negative reward values are collected and summed in this paper, and the change of cumulative negative rewards reflects the whole process of training. As shown in Fig. 5a):

Fig. 5.
figure 5

a): Cumulative negative reward. b): Average number of parked vehicles.

As the number of times the model is trained increases, the cumulative negative reward value of the vehicles rapidly decreases the reward reaches convergence at around 370 turns. This indicates that the model does learn good action strategies from training. The same trend can be seen in Fig. 5b), where the average number of parked vehicles also remains stable at smaller values after 370 rounds, with a significant decrease in the number of vehicles in the queue. This indicates that the algorithm in this paper converges on a good action strategy and algorithmic stabilization mechanism, and that the empirical replay mechanism used by the model is effective.

4.3 Reinforcement Learning Control Model Results Analysis

The RL model does not need to follow a specific phase cycle, but runs acyclically. Each action is chosen to maximize the reward. In contrast to fixed-time control, which does not exhibit this goal-directed behavior and instead performs a relatively poor primitive logic. To evaluate the performance of our proposed reinforcement learning model, a fixed-time control strategy is used for comparison. The fixed-time control involves pre-calculating the signal phase duration based on the proportion of vehicle arrivals and fixing it for the duration of the run.

Fig. 6.
figure 6

Comparison of the average number of stops of the two controllers.

The difference in the number of vehicles in queue at each moment between the RL adaptive control strategy and the fixed-time control strategy can be seen in Fig. 6. The reinforcement learning control model significantly outperforms the fixed-time control strategy. Specifically, due to the effect of the adaptive control strategy of RL. The naked eye can intuitively see that most of the instantaneous queue lengths of the RL control model are lower than those of the fixed-time control queue lengths, with the maximum number of vehicles in the queue remaining stable at about 40 vehicles and the minimum at about 5 vehicles when in the RL control strategy. In comparison, the maximum queue of vehicles in the fixed-time control reaches 60, and the minimum is about 10 vehicles.

In order to better evaluate the overall control performance, the average queue length and total parking waiting time are used as evaluation metrics, as shown in Figs. 7a) and Figs. 7b), the number of queuing vehicles and total parking time are reduced by 23% and 22%, respectively, which proves the effectiveness of our reinforcement learning control model.

Fig. 7.
figure 7

Comparison of the control effect of the two controllers.

4.4 Analysis of Signal Attack Results Based on State Table

On the basis of the established reinforcement learning model, the state is modified with corresponding attacks according to different scenarios. Similar to the previous section, we also used the average number of vehicles in queue and the total parking time as evaluation metrics. The specific results are shown in Table 1.

From the data in Table 1, it can be seen that, unlike what is imagined, the total average queue length improves by 7% if only one lane is attacked for state awareness, indicating that the attack (Type1) does not have a significant impact on the controller. However, when attacking the perception of two lanes and both lanes are in the same phase (Type2), the average queue length and total stopping time are significantly increased by 253%, which shows the significant interference of signal attack on the intersection controller. The results of attacking two lanes in different phases in the same entrance direction (Type3) and two lanes in different phases in different entrance directions (Type4) are improved by 60% and 19.6%, respectively. Both scenarios are in different phases, but there is also a big difference, which is because the same direction contains left-turn lanes, and the RL model assigns fewer actions to the left-turn lanes in order to release more vehicles to pass, when the left-turn lane falsifies the attack, making a large difference in the results.

Table 1. The degree of influence of the controller by the signal attack.

The above results show that the RL model is robust and resistant to interference, and can still make suboptimal decisions even when a portion of state perception in the same phase is compromised by an attack. However, when the attacker attacks the overall perception of this phase, the impact on the system is particularly large, which confirms the great impact of network attacks on the signal control system. Fortunately, such direct attacks are easily identified by cyber security systems, but strengthening signal control system cyber security is also an urgent issue.

5 Conclusion

V2X, 5G and other communication technologies that connect vehicles, roads and infrastructure have undoubtedly brought new solutions to traffic signal control [19]. But at the same time, this extensive connectivity makes traffic signal systems very vulnerable to information attacks [20]. The first step in how to protect against such attacks is to analyze how and to what extent such attacks affect signal controllers. In this paper, we evaluate the impact of the attack more objectively by building a model-free controller with reinforcement learning. We also argue that reinforcement learning not only has excellent control performance, but also has some robustness, which coincides with the literature that mentions reinforcement learning as the next generation of signal controllers.

For intersection signal controllers, the impact is huge if they are attacked. Conversely, if only some of the lane states in the same phase are attacked, the impact is limited and within an acceptable range, and a smart attacker will surely take advantage of this flaw [21, 22].

In the future it will be necessary to develop a series of defensive measures that will not only work on the level of network security to prevent malicious attackers from penetrating the internal network. It is also important to focus on the way attackers use personal vehicles to send fake tracks and establish reasonable screening methods. Finally, there is also a need for some degree of verification using data from other sensors, and a certain amount of signal redundancy is necessary.