Keywords

Introduction

Intelligent Transportation Systems (ITS) utilizes synergistic technologies and systems engineering concepts to develop and improve transportation systems of all kinds [1]. Machine intelligence on the road has been a popular research area with the advent of modern technologies especially artificial intelligence, wireless communication and advanced novel sensors.

Current traffic signal control system design is based on historic traffic flow data which cannot adapt itself to the rapidly varying situations at a crossroad. In some extreme situations, there are no vehicles during a green light and lots of vehicles waiting at a red one.

Many researchers have proposed schemes to solve the afore-mentioned problems like Choy et al. [2] who introduced hybrid agent architecture for real-time signal control. He suggested in his paper a dynamic database for storing all recommendations of the controller agents for each evaluation period. Liu et al. [3] proposed a calculating method of intersection delay under signal control while Bao et al. [4] studied an adaptive traffic signal timing scheme for an isolated intersection. However all these papers solve the problem according to the history flow data but not the current information [5, 6].

This paper makes the following contributions in particular:

  1. (a)

    A novel traffic flow control mechanism is proposed based on the cooperation of the vehicle, road and traffic management systems. A roadside wireless communication network supports a dynamic traffic flow control method.

  2. (b)

    Reinforcement learning is introduced as the core algorithm to dynamically plan traffic flow in order to improve efficiency. A Q-learning based intersection traffic signal control system is studied as an example of the proposed mechanism.

Study of Intersection Signal Control

In this section, a Q learning algorithm will be used to create a real time cooperation policy for an isolated intersection control under the proposed Traffic Control Mechanism. The algorithm and the simulation are both described in detail. The result shows the advantage of the proposed method.

Q-Learning Algorithm

Q learning, a type of reinforcement learning, can develop optimal control strategies from delayed rewards, even when an agent has no prior knowledge of the effects of its actions on the environment [7].

The agent’s learning task can be described as follows. We require that the agent learn a policy \( \pi \) that maximizes \( V^{\pi } (s) \) for all states s. We will call such a policy an optimal policy and denote it by \( \pi^{*} \)

$$ \pi^{*} \equiv \mathop {\arg \hbox{max} }\limits_{\pi } V^{\pi } (s),(\forall s) $$
(1)

To simplify notation, we will refer to the value function \( V^{{\pi^{*} }} (s) \) of such an optimal policy as \( V^{*} (s) \). \( V^{*} (s) \) gives the maximum discounted cumulative reward that the agent can obtain starting from state s; that is, the discounted cumulative reward obtained by following the optimal policy beginning at state s.

However, it is difficult to learn the function \( \pi^{*} :S \to A \) directly, because the available training data does not provide training examples of the form \( < s,a > \). Instead, the only training information available to the learner is the sequence of immediate rewards \( r(s_{i} ,a_{i} ) \)for \( i \, = \, 0, \, 1,2, \ldots \). As we shall see, given this kind of training information it is easier to learn a numerical evaluation function defined over states and actions, then implement the optimal policy in terms of this evaluation function.

What evaluation function should the agent attempt to learn? One obvious choice is \( V^{*} \). The agent should prefer state s 1 over state s 2 whenever \( V^{*} (s_{1} ) > V^{*} (s_{2} ) \), because the cumulative future reward will be greater from s 1. The agent’s policy must choose among actions, not among states. However, it can use \( V^{*} \) in certain settings to choose among actions as well. The optimal action in state s is the action a that maximizes the sum of the immediate reward \( r(s,a) \) plus the value \( V^{*} \) of the immediate successor state, discounted by \( \gamma \).

$$ \pi^{*} (s) = \mathop {\arg \hbox{max} }\limits_{a} [r(s,a) + \gamma V^{*} (\delta (s,a))] $$
(2)

where \( \delta (s,a) \) denotes the state resulting from applying action a to state s.

Thus, the agent can acquire the optimal policy by learning \( V^{*} \), provided it has perfect knowledge of the immediate reward function r and the state transition function \( \delta \). When the agent knows the functions r and \( \delta \) used by the environment to respond to its actions, it can then use Eq. (2) to calculate the optimal action for any state s.

Unfortunately, learning \( V^{*} \) is a useful way to learn the optimal policy only when the agent has perfect knowledge of \( \delta \) and r.

Let us define the evaluation function Q(s, a) so that its value is the maximum discounted cumulative reward that can be achieved starting from state s and applying action a as the first action. In other words, the value of Q is the reward received immediately upon executing action a from state s, plus the value (discounted by \( \gamma \)) of following the optimal policy thereafter.

$$ Q(s,a) \equiv r(s,a) + \gamma V^{*} (\delta (s,a)) $$
(3)

Note that Q(s, a) is exactly the quantity that is maximized in Eq. (3) in order to choose the optimal action a in state s. Therefore, we can rewrite Eq. (3) in terms of Q(s, a) as

$$ \pi^{*} (s) = \mathop {\arg \hbox{max} }\limits_{a} Q(s,a) $$
(4)

Why is this rewrite important? Because it shows that if the agent learns the Q function instead of the \( V^{*} \) function, it will be able to select optimal actions even when it has no knowledge of the functions r and \( \delta \). As Eq. (4) makes clear, it need only consider each available action a in its current state s and choose the action that maximizes Q(s, a). This is exactly the most important advantages of Q learning, and also is the reason why we choose Q learning in this paper.

How should the Q learning algorithm be implemented? The key problem is finding a reliable way to estimate training values for Q, given only a sequence of immediate rewards r spread out over time. This can be accomplished through iterative approximation. To see how, notice the close relationship between Q and \( V^{*} \), \( V^{*} (s) = \mathop {\hbox{max} }\limits_{{a^{\prime}}} Q(s,a^{\prime} ) \), which allows rewriting Eq. (3) as follows:

$$ Q(s,a) = r(s,a) + \gamma \mathop {\hbox{max} }\limits_{{a^{\prime}}} Q(\delta (s,a),a^{\prime}) $$
(5)

Equation (5) provides the basis for algorithms that iteratively approximate Q. In the algorithm, \( \overline{Q} \) will be the learner’s estimate, or hypothesis of the actual Q function. \( \overline{Q} \) will be represented by a large table with a separate entry for each state-action pair. The table can be initially filled with random values (though it is easier to understand the algorithm if one assumes initial values of zero). The agent repeatedly observes its current state s, choose some action a, executes this action, then observes the resulting reward \( r = r(s,a) \) and the new state \( s^{\prime} = \delta (s,a) \). It then updates the table entry for \( \overline{Q} (s,a) \) following each such transition, according to the rule:

$$ \overline{Q} (s,a) \leftarrow r(s,a) + \gamma \mathop {\hbox{max} }\limits_{{a^{\prime}}} \overline{Q} (s^{\prime},a^{\prime}) $$
(6)

Note that the above training rule uses the agent’s current \( \overline{Q} \) values for the new state \( s^{\prime} \) to refine its estimate of \( \overline{Q} (s,a) \) for the previous state s.

The iterative training rule (6) will be replaced by

$$ \overline{Q} (s,a) \leftarrow g(s,a) + \gamma \mathop {\hbox{min} }\limits_{{a^{\prime}}} \overline{Q} (s^{\prime},a^{\prime}). $$
(7)

It means that the learning target is to minimize the Q function by minimizing the total cost when acting based on the optimum action sequences. This is exactly the algorithm used in this paper.

Model of the Intersection Signal System

A traffic system consists of various components, among which the traffic intersection is one of the most important [8]. Our method is applied to a traffic intersection that consists of two intersecting roads, each with several lanes and a set of synchronized traffic lights that manage the flow of vehicles, as shown in Fig. 1.

Fig. 1
figure 1

Isolated intersection

In this intersection, the rule of traffic management is right-hand based, which is used in China and South Korea. The vehicles in lanes ①, ③, ⑤ and ⑦, are approaching the intersection. Vehicles in ②, ④, ⑥ and ⑧, are leaving the intersection. For each of the approaching lanes, there are three directions for vehicles to choose: turn left, turn right and go straight, as shown in Fig. 1.

We will not consider the turn right direction because it does not impact other directions. In order to make this problem easy to model, we will not consider the pedestrian crossing the road. It will be very easy to add an additional rule for a pedestrian under our proposed mechanism.

Therefore, this problem can be modeled as 8 queues for different paths, as shown in Table 1.

Table 1 Basic action definition of different queues

We assume that there are a random number of vehicles spreading on different queues at the beginning of a signal period. This is the initial state of the environment. The final state must be that all the vehicles in the initial state have crossed the intersection. The intersection signal control system is modeled as a leader agent to manage the actions of all vehicle agents around the intersection. Since the action libraries of vehicle agents include actions from A1 to A8, the leader agent can choose any one action or their reasonable combination to reach the final state.

If two of the actions from A1 to A8 are nonintervention, they are possible action combinations. We call these different combinations a signal phase. All possible combinations are shown in Table 2.

Table 2 Action combination symbol

Therefore, the problem can be described as how to find the optimum sequence of action combinations to reach the final state. This is the main function of the intersection signal control agent.

For each of the discrete states from the initial state to the final state, the optimum policy will be independent of the previous state. The successor state will be deterministic after one action combination is done. Therefore, this problem can be modeled as a deterministic Markov decision process.

Parameters of Learning Process

  1. (1)

    Cost function

We suppose that the vehicle number is n at state s. After the selected action a completed, the current vehicle number will be n 1. The cost of this action depends on the waiting time t, and the remainder of vehicles n 1.

$$ g(s,a) = n_{1} \times (t + t_{transition} ). $$
(8)

where t transition equals one of the three numbers {0, 1.5, 3} shown in Table 3. The average time for each vehicle passing the crossroad is supposed to be 3 s.

Table 3 ttransition of different phase transition
  1. (2)

    Discount factor

In the simulation we set the discount factor, \( \gamma = 0.8 \).

Simulation and Results

We wrote some MATLAB code to complete the simulation with the following configuration.

  • CPU: Intel Pentium 4 Processor 2.40 GHz,

  • Memory: 1047792 KB,

  • Operation System: Microsoft Windows XP Professional (SP3).

In order to show the advantage of our proposed mechanism, the traditional signal mechanism was introduced to create a comparative study. In the traditional mechanism, the signal phase transition is in a fixed sequence as shown by Ph1, Ph2, Ph3, Ph4, Ph5 and Ph6. However, our proposed method can determine the optimum phase sequence automatically based on the updated situation.

In the following, we will show the comparative result for three different periods T and different phase time interval tphase.

In the above-mentioned tables, Ps is the simulation period series, NIV is the total number of vehicles at the initial state, Random Queues the number of vehicle queues that are randomly created, TIQ is the time interval from the initial state to the final state for a Q learning method, TWQ is the total waiting time for the Q learning method, TIT is the time interval from the initial state to the final state for the traditional method, \( T_{IT} = 6 \times t_{phase} \),

TWT is the total waiting time for the traditional method,

$$ P_{EI} = \frac{{T_{IT} - T_{IQ} }}{{T_{IT} }} \times 100\;\% $$
(9)

Equation (9) determines the percent improvement in the traffic efficiency,

$$ P_{WD} = \frac{{T_{WT} - T_{WQ} }}{{T_{WT} }} \times 100\;\% $$
(10)

Equation (10) shows the percent decrease in total waiting time.

OA is the optimum phase sequence from Q learning, TL is the running time of the Q learning program on the above mentioned computer.

Analysis of the Results

From Table 4, we find that all the running times of the Q learning program TL in every period are less than one second. This is short enough for the application of the intersection signal control system.

Table 4 Simulation result when t phase  = 60 s

At the same time, the percent traffic efficiency improvement PEI, is located in [4.17 % 47.5 %], the percent total waiting time decrease PWD is located in [1.07 % 56.95 %]. The average percents of PEI are 32.2 % and the average percents of PWD are 37.5 %.

Conclusion

A new traffic control based mechanism based on a combination of machine learning and multiagent modeling methods is proposed for future intelligent transportation systems. The control systems, the vehicles, and some necessary roadside sensors are all modeled as intelligent agents in the proposed systems, therefore the ITS system will be a multiagent system. It is possible to improve the traffic control efficiency by some artificial intelligence algorithm.

The control method for an isolated intersection was studied specifically. The intersection signal was first modeled according to the proposed mechanism then a new algorithm based on reinforcement learning, especially Q-learning, was proposed and studied in detail. A simulation for such an intersection system was finally carried out and a comparative study with the traditional intersectional signal method was done.

Simulation results showed that the proposed intersection control mechanism can improve traffic efficiency by more than 30 % over the traditional method and simultaneously bring the drivers some benefit by decreasing the waiting time by more than 30 %. This proves that the proposed traffic control mechanism is applicable in the near future.