Study of Reinforcement Learning Based Dynamic Traffic Control Mechanism

Zhang, Zheng; Baek, Seung Jun; Lee, Duck Jin; Chong, Kil To

doi:10.1007/978-94-007-6738-6_129

Zheng Zhang⁵,
Seung Jun Baek⁶,
Duck Jin Lee⁷ &
…
Kil To Chong^6,8

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 240))

1279 Accesses
1 Citations

Abstract

A traffic signal control mechanism is proposed to improve the dynamic response performance of a traffic flow control system in an urban area. The necessary sensor networks are installed in the roads and on the roadside upon which reinforcement learning is adopted as the core algorithm for this mechanism. A traffic policy can be planned online according to the updated situations on the roads based on all the information from the vehicles and the roads. The optimum intersection signals can be learned automatically online. An intersection control system is studied as an example of the mechanism using Q-learning based algorithm and simulation results showed that the proposed mechanism can improve traffic efficiently more than a traditional signaling system.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Traffic Signal Switching Strategy Based on Reinforcement Learning Algorithm

Intelligent Traffic Signal Control Based on Reinforcement Learning

A Reinforcement Learning-Based Controller Designed for Intersection Signal Suffering from Information Attack

Keywords

Introduction

Intelligent Transportation Systems (ITS) utilizes synergistic technologies and systems engineering concepts to develop and improve transportation systems of all kinds [1]. Machine intelligence on the road has been a popular research area with the advent of modern technologies especially artificial intelligence, wireless communication and advanced novel sensors.

Current traffic signal control system design is based on historic traffic flow data which cannot adapt itself to the rapidly varying situations at a crossroad. In some extreme situations, there are no vehicles during a green light and lots of vehicles waiting at a red one.

Many researchers have proposed schemes to solve the afore-mentioned problems like Choy et al. [2] who introduced hybrid agent architecture for real-time signal control. He suggested in his paper a dynamic database for storing all recommendations of the controller agents for each evaluation period. Liu et al. [3] proposed a calculating method of intersection delay under signal control while Bao et al. [4] studied an adaptive traffic signal timing scheme for an isolated intersection. However all these papers solve the problem according to the history flow data but not the current information [5, 6].

This paper makes the following contributions in particular:

(a)
A novel traffic flow control mechanism is proposed based on the cooperation of the vehicle, road and traffic management systems. A roadside wireless communication network supports a dynamic traffic flow control method.
(b)
Reinforcement learning is introduced as the core algorithm to dynamically plan traffic flow in order to improve efficiency. A Q-learning based intersection traffic signal control system is studied as an example of the proposed mechanism.

Study of Intersection Signal Control

In this section, a Q learning algorithm will be used to create a real time cooperation policy for an isolated intersection control under the proposed Traffic Control Mechanism. The algorithm and the simulation are both described in detail. The result shows the advantage of the proposed method.

Q-Learning Algorithm

Q learning, a type of reinforcement learning, can develop optimal control strategies from delayed rewards, even when an agent has no prior knowledge of the effects of its actions on the environment [7].

The agent’s learning task can be described as follows. We require that the agent learn a policy $ \pi $ that maximizes $ V^{\pi } (s) $ for all states s. We will call such a policy an optimal policy and denote it by $ \pi^{*} $

$$ \pi^{*} \equiv \mathop {\arg \hbox{max} }\limits_{\pi } V^{\pi } (s),(\forall s) $$

(1)

To simplify notation, we will refer to the value function $ V^{{\pi^{*} }} (s) $ of such an optimal policy as $ V^{*} (s) $. $ V^{*} (s) $ gives the maximum discounted cumulative reward that the agent can obtain starting from state s; that is, the discounted cumulative reward obtained by following the optimal policy beginning at state s.

However, it is difficult to learn the function $ \pi^{*} :S \to A $ directly, because the available training data does not provide training examples of the form $ < s,a > $. Instead, the only training information available to the learner is the sequence of immediate rewards $ r(s_{i} ,a_{i} ) $for $ i \, = \, 0, \, 1,2, \ldots $. As we shall see, given this kind of training information it is easier to learn a numerical evaluation function defined over states and actions, then implement the optimal policy in terms of this evaluation function.

What evaluation function should the agent attempt to learn? One obvious choice is $ V^{*} $. The agent should prefer state s ₁ over state s ₂ whenever $ V^{*} (s_{1} ) > V^{*} (s_{2} ) $, because the cumulative future reward will be greater from s ₁. The agent’s policy must choose among actions, not among states. However, it can use $ V^{*} $ in certain settings to choose among actions as well. The optimal action in state s is the action a that maximizes the sum of the immediate reward $ r(s,a) $ plus the value $ V^{*} $ of the immediate successor state, discounted by $ \gamma $.

$$ \pi^{*} (s) = \mathop {\arg \hbox{max} }\limits_{a} [r(s,a) + \gamma V^{*} (\delta (s,a))] $$

(2)

where $ \delta (s,a) $ denotes the state resulting from applying action a to state s.

Thus, the agent can acquire the optimal policy by learning $ V^{*} $, provided it has perfect knowledge of the immediate reward function r and the state transition function $ \delta $. When the agent knows the functions r and $ \delta $ used by the environment to respond to its actions, it can then use Eq. (2) to calculate the optimal action for any state s.

Unfortunately, learning $ V^{*} $ is a useful way to learn the optimal policy only when the agent has perfect knowledge of $ \delta $ and r.

Let us define the evaluation function Q(s, a) so that its value is the maximum discounted cumulative reward that can be achieved starting from state s and applying action a as the first action. In other words, the value of Q is the reward received immediately upon executing action a from state s, plus the value (discounted by $ \gamma $) of following the optimal policy thereafter.

$$ Q(s,a) \equiv r(s,a) + \gamma V^{*} (\delta (s,a)) $$

(3)

Note that Q(s, a) is exactly the quantity that is maximized in Eq. (3) in order to choose the optimal action a in state s. Therefore, we can rewrite Eq. (3) in terms of Q(s, a) as

$$ \pi^{*} (s) = \mathop {\arg \hbox{max} }\limits_{a} Q(s,a) $$

(4)

Why is this rewrite important? Because it shows that if the agent learns the Q function instead of the $ V^{*} $ function, it will be able to select optimal actions even when it has no knowledge of the functions r and $ \delta $. As Eq. (4) makes clear, it need only consider each available action a in its current state s and choose the action that maximizes Q(s, a). This is exactly the most important advantages of Q learning, and also is the reason why we choose Q learning in this paper.

How should the Q learning algorithm be implemented? The key problem is finding a reliable way to estimate training values for Q, given only a sequence of immediate rewards r spread out over time. This can be accomplished through iterative approximation. To see how, notice the close relationship between Q and $ V^{*} $, $ V^{*} (s) = \mathop {\hbox{max} }\limits_{{a^{\prime}}} Q(s,a^{\prime} ) $, which allows rewriting Eq. (3) as follows:

$$ Q(s,a) = r(s,a) + \gamma \mathop {\hbox{max} }\limits_{{a^{\prime}}} Q(\delta (s,a),a^{\prime}) $$

(5)

Equation (5) provides the basis for algorithms that iteratively approximate Q. In the algorithm, $ \overline{Q} $ will be the learner’s estimate, or hypothesis of the actual Q function. $ \overline{Q} $ will be represented by a large table with a separate entry for each state-action pair. The table can be initially filled with random values (though it is easier to understand the algorithm if one assumes initial values of zero). The agent repeatedly observes its current state s, choose some action a, executes this action, then observes the resulting reward $ r = r(s,a) $ and the new state $ s^{\prime} = \delta (s,a) $. It then updates the table entry for $ \overline{Q} (s,a) $ following each such transition, according to the rule:

$$ \overline{Q} (s,a) \leftarrow r(s,a) + \gamma \mathop {\hbox{max} }\limits_{{a^{\prime}}} \overline{Q} (s^{\prime},a^{\prime}) $$

(6)

Note that the above training rule uses the agent’s current $ \overline{Q} $ values for the new state $ s^{\prime} $ to refine its estimate of $ \overline{Q} (s,a) $ for the previous state s.

The iterative training rule (6) will be replaced by

$$ \overline{Q} (s,a) \leftarrow g(s,a) + \gamma \mathop {\hbox{min} }\limits_{{a^{\prime}}} \overline{Q} (s^{\prime},a^{\prime}). $$

(7)

It means that the learning target is to minimize the Q function by minimizing the total cost when acting based on the optimum action sequences. This is exactly the algorithm used in this paper.

Model of the Intersection Signal System

A traffic system consists of various components, among which the traffic intersection is one of the most important [8]. Our method is applied to a traffic intersection that consists of two intersecting roads, each with several lanes and a set of synchronized traffic lights that manage the flow of vehicles, as shown in Fig. 1.

In this intersection, the rule of traffic management is right-hand based, which is used in China and South Korea. The vehicles in lanes ①, ③, ⑤ and ⑦, are approaching the intersection. Vehicles in ②, ④, ⑥ and ⑧, are leaving the intersection. For each of the approaching lanes, there are three directions for vehicles to choose: turn left, turn right and go straight, as shown in Fig. 1.

We will not consider the turn right direction because it does not impact other directions. In order to make this problem easy to model, we will not consider the pedestrian crossing the road. It will be very easy to add an additional rule for a pedestrian under our proposed mechanism.

Therefore, this problem can be modeled as 8 queues for different paths, as shown in Table 1.

Table 1 Basic action definition of different queues

Full size table

We assume that there are a random number of vehicles spreading on different queues at the beginning of a signal period. This is the initial state of the environment. The final state must be that all the vehicles in the initial state have crossed the intersection. The intersection signal control system is modeled as a leader agent to manage the actions of all vehicle agents around the intersection. Since the action libraries of vehicle agents include actions from A1 to A8, the leader agent can choose any one action or their reasonable combination to reach the final state.

If two of the actions from A1 to A8 are nonintervention, they are possible action combinations. We call these different combinations a signal phase. All possible combinations are shown in Table 2.

Table 2 Action combination symbol

Full size table

Therefore, the problem can be described as how to find the optimum sequence of action combinations to reach the final state. This is the main function of the intersection signal control agent.

For each of the discrete states from the initial state to the final state, the optimum policy will be independent of the previous state. The successor state will be deterministic after one action combination is done. Therefore, this problem can be modeled as a deterministic Markov decision process.

Parameters of Learning Process

(1)
Cost function

We suppose that the vehicle number is n at state s. After the selected action a completed, the current vehicle number will be n ₁. The cost of this action depends on the waiting time t, and the remainder of vehicles n ₁.

$$ g(s,a) = n_{1} \times (t + t_{transition} ). $$

(8)

where t _transition equals one of the three numbers {0, 1.5, 3} shown in Table 3. The average time for each vehicle passing the crossroad is supposed to be 3 s.

Table 3 t_transition of different phase transition

Full size table

(2)
Discount factor

In the simulation we set the discount factor, $ \gamma = 0.8 $.

Simulation and Results

We wrote some MATLAB code to complete the simulation with the following configuration.

CPU: Intel Pentium 4 Processor 2.40 GHz,
Memory: 1047792 KB,
Operation System: Microsoft Windows XP Professional (SP3).

In order to show the advantage of our proposed mechanism, the traditional signal mechanism was introduced to create a comparative study. In the traditional mechanism, the signal phase transition is in a fixed sequence as shown by Ph1, Ph2, Ph3, Ph4, Ph5 and Ph6. However, our proposed method can determine the optimum phase sequence automatically based on the updated situation.

In the following, we will show the comparative result for three different periods T and different phase time interval t_phase.

In the above-mentioned tables, Ps is the simulation period series, NIV is the total number of vehicles at the initial state, Random Queues the number of vehicle queues that are randomly created, TIQ is the time interval from the initial state to the final state for a Q learning method, TWQ is the total waiting time for the Q learning method, TIT is the time interval from the initial state to the final state for the traditional method, $ T_{IT} = 6 \times t_{phase} $,

TWT is the total waiting time for the traditional method,

$$ P_{EI} = \frac{{T_{IT} - T_{IQ} }}{{T_{IT} }} \times 100\;\% $$

(9)

Equation (9) determines the percent improvement in the traffic efficiency,

$$ P_{WD} = \frac{{T_{WT} - T_{WQ} }}{{T_{WT} }} \times 100\;\% $$

(10)

Equation (10) shows the percent decrease in total waiting time.

OA is the optimum phase sequence from Q learning, TL is the running time of the Q learning program on the above mentioned computer.

Analysis of the Results

From Table 4, we find that all the running times of the Q learning program TL in every period are less than one second. This is short enough for the application of the intersection signal control system.

Table 4 Simulation result when t _phase = 60 s

Full size table

At the same time, the percent traffic efficiency improvement PEI, is located in [4.17 % 47.5 %], the percent total waiting time decrease PWD is located in [1.07 % 56.95 %]. The average percents of PEI are 32.2 % and the average percents of PWD are 37.5 %.

Conclusion

A new traffic control based mechanism based on a combination of machine learning and multiagent modeling methods is proposed for future intelligent transportation systems. The control systems, the vehicles, and some necessary roadside sensors are all modeled as intelligent agents in the proposed systems, therefore the ITS system will be a multiagent system. It is possible to improve the traffic control efficiency by some artificial intelligence algorithm.

The control method for an isolated intersection was studied specifically. The intersection signal was first modeled according to the proposed mechanism then a new algorithm based on reinforcement learning, especially Q-learning, was proposed and studied in detail. A simulation for such an intersection system was finally carried out and a comparative study with the traditional intersectional signal method was done.

Simulation results showed that the proposed intersection control mechanism can improve traffic efficiency by more than 30 % over the traditional method and simultaneously bring the drivers some benefit by decreasing the waiting time by more than 30 %. This proves that the proposed traffic control mechanism is applicable in the near future.

References

http://www.ewh.ieee.org/tc/its/
Choy MC, Srinivasan D, Cheu RL (2003) Cooperative, hybrid agent architecture for real-time traffic signal control. IEEE Trans Syst Man Cybern Part A Syst Hum 33(5):597–607
Google Scholar
Liu G, Zhai R, Pei Y (2007) A calculating method of intersection delay under signal control. In: Proceedings of the 2007 IEEE intelligent transportation systems conference, Seattle, pp 1114–1119
Google Scholar
Bao W, Chen Q, Xu X (2006) An adaptive traffic signal timing scheme for bus priority at isolated intersection. In: Proceedings of the 6th world congress on intelligent control and automation, Dalian, pp 8712–8716
Google Scholar
Srinivasan D, Choy MC (2006) Cooperative multi-agent system for coordinated traffic signal control. IEE Proc Intell Transp Syst 153(1):41–50
Google Scholar
Lee JH, Lee-Kwang H (1999) Distributed and cooperative fuzzy controllers for traffic intersections group. IEEE Trans Syst Man Cybern C Appl Rev 29:263–271
Google Scholar
Mitchell TM (1997) Machine learning. McGraw-Hill, New York. ISBN: 0070428077
Google Scholar
D’Ambrogio A et al (2008) Simulation model building of traffic intersections. Simul Model Pract Theory
Google Scholar

Download references

Acknowledgments

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MEST) (No. 2012-038978) and (No. 2012-0002434).

Author information

Authors and Affiliations

Department of Mechanical Engineering, Xian Jiaotong University, Xian, Peoples Republic of China
Zheng Zhang
Department of Electronics Engineering, Jeonbuk National University, Jeonju, Republic of Korea
Seung Jun Baek & Kil To Chong
Department of Mechanical Engineering, Jeonbuk National University, Jeonju, Republic of Korea
Duck Jin Lee
Advanced Research Center for Electronics and Information, Jeonbuk National University, Jeonju, Republic of Korea
Kil To Chong

Authors

Zheng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Seung Jun Baek
View author publications
You can also search for this author in PubMed Google Scholar
Duck Jin Lee
View author publications
You can also search for this author in PubMed Google Scholar
Kil To Chong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kil To Chong .

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Seoul University of Science & and Technology (SeoulTech), Seoul, Korea, Republic of (South Korea)
James J. (Jong Hyuk) Park
Dept of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong SAR
Joseph Kee-Yin Ng
Humanitas College, Kyung Hee University, Seoul, Korea, Republic of (South Korea)
Hwa-Young Jeong
School of Computer Science and Software Engineering, Monash University, Clayton, Victoria, Australia
Borgy Waluyo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Z., Baek, S.J., Lee, D.J., Chong, K.T. (2013). Study of Reinforcement Learning Based Dynamic Traffic Control Mechanism. In: Park, J., Ng, JY., Jeong, HY., Waluyo, B. (eds) Multimedia and Ubiquitous Engineering. Lecture Notes in Electrical Engineering, vol 240. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-6738-6_129

Download citation

DOI: https://doi.org/10.1007/978-94-007-6738-6_129
Published: 03 May 2013
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-007-6737-9
Online ISBN: 978-94-007-6738-6
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Study of Reinforcement Learning Based Dynamic Traffic Control Mechanism

Abstract

Similar content being viewed by others

Traffic Signal Switching Strategy Based on Reinforcement Learning Algorithm

Intelligent Traffic Signal Control Based on Reinforcement Learning

A Reinforcement Learning-Based Controller Designed for Intersection Signal Suffering from Information Attack

Keywords

Introduction

Study of Intersection Signal Control

Q-Learning Algorithm

Model of the Intersection Signal System

Parameters of Learning Process

Simulation and Results

Analysis of the Results

Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Study of Reinforcement Learning Based Dynamic Traffic Control Mechanism

Abstract

Similar content being viewed by others

Traffic Signal Switching Strategy Based on Reinforcement Learning Algorithm

Intelligent Traffic Signal Control Based on Reinforcement Learning

A Reinforcement Learning-Based Controller Designed for Intersection Signal Suffering from Information Attack

Keywords

Introduction

Study of Intersection Signal Control

Q-Learning Algorithm

Model of the Intersection Signal System

Parameters of Learning Process

Simulation and Results

Analysis of the Results

Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation