Keywords

1 Introduction

Thousands of vehicles distribute in a large and board urban area. It is a difficult and complicated work to effectively take care of such a large scale, dynamic, and distributed system with a high degree of uncertainty [1]. Though the number of vehicles is getting more and more in major cities, most of the current traffic control methods have not taken benefit of an intelligent control of traffic light [2]. It is observed that sensible traffic control and enhancing the deployment effectiveness of roads is an efficient and cost-effective technique to resolve the urban traffic crisis in majority urban areas [3]. Major vital part of the intelligent transportation system is traffic signal lights control strategy becomes necessary [4]. There are so various parameters that have an effect on the traffic lights control. The static control method is not feasible for rapid and irregular traffic flow. The paper suggests a dynamic traffic control framework which is based on reinforcement learning [5]. The reinforcement learning can present a very crucial move to resolve the above cited problems. It is effectively deployed in resolving various problems [6]. The framework defines different traffic signal control types as action selections; the number of vehicles arriving and density of vehicle at a junction is observed as the context of environment and common signal control indicators, including delay time, the number of stopped vehicles and the total vehicle density are described as received rewards. The paper is divided as Sect. 2 gives the insights about the related work done in the area of traffic signal control. Section 3 describes Multi-agent Cooperative Q learning algorithm (MCQL). Section 4 explains about the system model. Experimental results are given in Sect. 5. The conclusion is presented in the Sect. 6.

2 Related Work

The traffic control systems can be categorized into offline traffic control systems and online traffic control systems. Offline methods make use of theoretical move toward optimizing the controls. Online methods regulate traffic regulator period dynamically as per instantaneous traffic conditions. Many achievements in collaborative traffic flow guidance and control strategy have been made. The approach of [7] the transportation industry. By means of F-B method approach, the traffic jam difficulty was partially resolved [7]. After that, several enhanced methods based on the F-B approach had developed [8]. Driving reimbursement coefficient and delay time was used to estimate the effectiveness of time distribution system given in [9]. The approach reduces the delay of waiting time, making the method appear to be sharp and sensible. There is a need to discover a new proper technique as this method could hardly solve the heavy traffic problem. Traffic congestion situation has been addressed using intelligent traffic control in [10] but congestion problems among neighboring junctions required better technique. The local synchronization demonstrated a fine result to this problem discussed in [11]. Because of complication and unpredictability, it is of limited opportunity to construct a precise mathematical model for traffic system in advance [12]. It has turned out to be a style to resolve traffic problems by taking benefit of computing expertise and machine learning [13]. Among many machine intelligence methods, reinforcement learning is feasible for the finest control of the transport system [14]. The study using the learning algorithm [15] achieved online traffic control. The approach was able to choose the optimal coordination model under different traffic conditions. Some applications [16] that utilize learning algorithm have received much significant effect. A paper implemented an online traffic control through learning algorithm, yielding good effort in the normal state of traffic congestion [17].

2.1 Traffic Estimation Parameters

Signal lights control has a very crucial responsibility in traffic management. Normally applied traffic estimation parameters [18] comprises of delay time, the number of automobiles stopped at a signal, and a number of newly arriving cars.

Delay Time. The delay between the real time and theoretically calculated time for a vehicle to leave a signal is defined as the delay time. In practice, we can get total delay time during a certain period of time and the average delay time of a cross to evaluate the time difference. The more delay time indicates the slower average speed of a vehicle to leave a signal.

Number of Vehicles Stopped. How many vehicles are waiting behind stop line to leave the road signal gives the number of vehicles stopped. The indicator [18] is used to measure the smooth degree of the road as well as the road traffic flow. It is defined as

$$ \text{stop} = \, \text{stopG} + \, \text{stopR,} $$
(1)

where stopG is the number of automobiles stopped before the green light and stopR is the number of vehicles stopped at the red light.

Number of Vehicles Newly Arrived. The ratio of the actual traffic flow to the maximum available traffic flow gives the signal saturation. Newly arrived vehicle is calculated as

$$ S = \frac{traffic \,flow}{{\left( dr \times sf \right)}}, $$
(2)

where dr is the ratio of red light duration to green light duration and sf is traffic flow of the signal.

Traffic Flow Capacity. The highest number of vehicles crossing through the signal is shown by traffic flow capacity. The result of the signal control strategy is given by the indicator. Traffic signal duration is associated with traffic flow capacity.

2.2 Reinforcement Learning

Reinforcement learning describes about maximize the numerical reward and mapping the state into actions through different way [18, 19]. Signal agents identify situation and responses from traffic scenarios, learn information depend on learning algorithms. Then it makes action choice with respect to its own accumulated information. Increase the traffic flow and decrease the average delay time is the purpose of traffic light control system. In this traffic arrangement, signals at one intersection coordinate with the signals at other intersection for better transport flow. Throughout the process, signals at each intersection develop a strategy of cooperation to maximize their individual benefit. Cooperation between agents is accomplished by distributing partial information of the states with the adjacent agents. In view of changing scenarios of the real traffic situation, multi-agent cooperative Q learning algorithm (MCQL) is developed for intelligent traffic control approach [19,20,21]. The Q update equation is given as:

$$ \text{Q}\left( {\text{s, a}} \right) \leftarrow \text{Q}\left( {\text{s}, \, \text{a}} \right) \, + {\varvec{\upalpha}}(\text{r} \, + {\varvec{\upgamma}}\text{Q}\left( {\text{s'}, \, \text{a'}} \right) \, {-} \, \text{Q}\left( {\text{s}, \, \text{a}} \right)), $$
(3)

where the state, action, immediate reward, and cumulative reward at time t correspondingly stands for st, at, rt, and Qt. Qt(s, a) is called policy function. \( {\varvec{\upalpha}} \in \) [0, 1] refers to the learning rate and \( {\varvec{\upgamma}} \in \) [0, 1] indicates the discount rate.

3 Multi-agent Cooperative Q Learning (MCQL)

Synchronization in multi-agent reinforcement generates a complex set of presentations achieved from the different agents’ actions. A portion of good performing agent group (i.e., a general form) is shared among the different agents via a specific form(Qi) [22]. Such specific forms embrace the limited details about the environment. Such strategies are incorporated to improve the sum of the partial rewards received using satisfactory cooperation prototype. The action plans or forms are created by the way of multi-agent Q learning algorithm by constructing the agents to travel for the most excellent form Q* and accumulating the rewards. When forms Q1, …, Qx are incorporated, it is possible to construct new forms that is General Form (GF = {GF 1 , …, GF x }), in which GF i denotes the outstanding reinforcement received by agent i all through the knowledge mode [5]. Algorithm 1 expresses get_form algorithm that splits the agents’ knowledge. The forms are designed by the Q learning used for all prototypes. Outstanding reinforcements are liable for GF which compiles all outstanding rewards. It will be shared by the way of the added agents [21, 22]. Transforming incomplete rewards as GF is considered for outstanding reinforcements to achieve the cooperation between the agents. A status utility gives the outstanding form among the opening states and closing state for a known form which approximates GF with the outstanding reinforcements. The status utility is calculated by summation of steps the agent needed to get to the destination at the closing state and the sum of the received status in the forms among each opening and the closing state [22].

Algorithm 1

Cooperative Multi-agent Q Learning Algorithm

Algorithm get_form (I, technique)

  1. 1.

    Initialization Q i (s, a) and GFi(s, a)

  2. 2.

    Coordination of the agents i є I;

  3. 3.

    Agents collaborate till the target state is found; episode \( \leftarrow \) episode +1

  4. 4.

    Renewal rule which estimates the reward value;

    $$ \text{Q}\left( {\text{s, a}} \right) \leftarrow \text{Q}\left( {\text{s}, \, \text{a}} \right) \, + {\varvec{\upalpha}}(\text{r} \, + {\varvec{\upgamma}}\text{Q}\left( {{s^{\prime}}, \, {a^{\prime}}} \right) \, {-} \, \text{Q}\left( {\text{s}, \, \text{a}} \right)) $$
  5. 5.

    Fcooperate (episode, tech, s, a, i);

  6. 6.

    Qi \( \leftarrow \) GF that is Q i of agent i є I is updated by means of GFi.

Cooperation Models

Various collaboration methods for cooperative reinforcement learning are proposed here:

  1. (i)

    Group model—reinforcements are distributed in a sequence of steps.

  2. (ii)

    Dynamic model—reinforcements are distributed in each action.

  3. (iii)

    Goal-oriented model—distributing the sum of reinforcements when the agent reaches the goal-state (S goal ).

Algorithm 2

Cooperation model

Fcooperate (episode, tech,s,a,i)/*cooperation between agents as four cases*/

q: count of sequence

  1. 1.

    Switch between cases

  2. 2.

    In case of Group method

    • if episode mod q = 0 then

    • get_Policy(Qi, Q*,GFi);

  3. 3.

    In case of Dynamic method

    • \( \text{r} \leftarrow \sum\nolimits_{j = 1}^{x} {Qj\left( {s,a} \right)} \);

    • Qi(s, a) \( \leftarrow \) r;

    • get_Policy(Qi, Q*, GFi);

  4. 4.

    In case of Goal-oriented method

    • if S = Sgoal then

    • \( \text{r} \leftarrow \sum\nolimits_{j = 1}^{x} {Qj\left( {s,a} \right)} \);

    • Qi(s,a) \( \leftarrow \) r;

    • get_Policy(Qi,Q*,GFi);

Algorithm 3

get_Policy

Function get_Policy(Qi, Q*, GFi) /*find out universal agent policy */

  1. 1.

    for loop for each agent i є I

  2. 2.

    for loop for each state s є S

  3. 3.

    if value(Qi, s) \( \le \) value(Q*, s) then

    • GFi(s,a) \( \leftarrow \) Qi(s,a);

  4. 4.

    end for loop

Group Model: During the learning process each agents receives reinforcements for their actions. At the last part of the series (step q), each agent gives out the cost of Qj to GF. If reward value is suitable, that is it improves the usefulness of another agents for given state the agents will afterward donate to these rewards [21,22,23].

4 Model Design

In a practical environment, traffic flows of four signals with eight flow directions are considered for the development. As shown in Fig. 1, there are altogether four junctions at each signal agent, i.e., agent 1, agent 2, agent 3, and agent 4 for Ja, Jb, Jc, and Jd, respectively.

Fig. 1
figure 1

Traffic flow and control of four intersections with eight flow directions

The control coordination between the intersections can be viewed as a Markov process, denoted by ⟨S, R⟩, where represents the state of the intersection, stands for the action for traffic control, and indicates the return attained by the control agent.

Definition of State: Agent receives instantaneous traffic state and then returns traffic control decision by the present state of the road. Essential data such as a number of vehicles newly arriving and number of vehicles currently stopped at signal are used to reflect the state of road traffic.

Number of vehicles newly arriving = Xmax = x1, x2, x3, x4 = 10

Number of vehicles currently stopped at junction J = Imax = i1, i2, i3, i4 = 20.

State for agent 1 become (x1, i1), e.g., (5,0) that means 5 new vehicles are arriving to agent 1 with 0 vehicles are stopped at junction 1. State for agent 2 become (x2, i2), State for agent 3 become (x3, i3) and State for agent 4 become (x4, i4). State of the system become Input as (xi, ii). Here, it can get together 200 possible states by combining maximum 10 arriving vehicle and maximum 20 vehicles stopped at signal (10 ∗ 20 = 200).

Definition of Action: In reinforcement learning framework, policy denotes the learning agent activities at a given time. Traffic lights control actions can be categorized to three types: no change in signal duration, increasing signal duration, reducing signal duration.

Value

Action

1

No change in signal duration

2

Increase in signal duration

3

Reduce the signal duration

Action set for signal agent 1 is A1 = {1, 2, 3}, action set for signal agent 2 is A2 = {1, 2, 3} and action set for signal agent 3 is A3 = {1, 2, 3}.

Each of them is for one of the following actual traffic scenarios. The strategy of no change in signal duration is used in the case of the normal traffic flow when the lights control rules do not change. The strategy increasing the signal duration is mostly used in the case that in one route traffic flow is stopped and the other route is regular. Increasing the signal duration extends the traffic flow while signal lights are still timing. The strategy decreasing signal duration is mostly used in the case that in one route of traffic flow is little while that of the other route is big. Decreasing signal light duration reduces the waiting time of the other route and lets vehicles of that route pass the junction faster, while signal lights keep timing [23, 24].

Definitions of Reward and Return: Agent makes signal control decisions under diverse traffic circumstances and returns an action sequence. We use traffic value display to estimate the traffic flows as [26].

Reward is calculated in the system as given below:

  • Assume current state i = (xi, ii) and next state j = (xj, ij).

  • current state i \( \to \) next state j

  • Case 1: [xi, ii]\( \to \)[xi, ii−1]

  • $$ \left[ {\text{X}_{{\text{max}}} = 10,\text{I}_{{\text{max}}} = { 2}0} \right] \to \left[ {\text{X}_{{\text{max}}} = 10,\text{I}_{{\text{max}}} = 19 \, } \right] $$

That means: one vehicle from currently stopped vehicle is passing the junction

  • Case 2: [xi, ii] \( \to \) [xi+1, ii−1]

  • $$ \left[ {\text{X}_{{\text{max}}} = 9,\text{I}_{{\text{max}}} = 20} \right] \to \left[ {\text{X}_{{\text{max}}} = 10,\text{I}_{{\text{max}}} = 19 \, } \right] $$

That means: one newly arrived vehicle at junction and one vehicle is passing

  • Case 3: [xi, ii] \( \to \) [xi, ii−3]

  • $$ \left[ {\text{X}_{{\text{max}}} = 10,\text{I}_{{\text{max}}} = 20} \right] \to \left[ {\text{X}_{{\text{max}}} = 10,\text{I}_{{\text{max}}} = 17 \, } \right] $$

That means: More than one stopped vehicles are passing the junction

  • Case 4: [xi, 0] \( \to \) [xi+1, 0]

  • $$ \left[ {\text{X}_{{\text{max}}} = 2,\text{I}_{{\text{max}}} = \, 0} \right] \to \left[ {\text{X}_{{\text{max}}} = \, 3,\text{I}_{{\text{max}}} = \, 0 \, } \right] $$

That means: new one new vehicle is arriving and no stopped vehicle at the junction. Depending on above state transitions from current state to next state, reward is calculated as [24]

$$ \begin{aligned} \text{Reward is r}_{\text{p}} \left( {\text{i, p, j}} \right) & = \, \text{1}\quad \text{if} \, \text{x}_{1}^{{\prime }} = \, \text{x}_{\text{1}} + \, {1 \ldots \ldots \ldots }.\text{Case}\,\text{4} \\ & = \text{2}\quad \text{if} \, \text{i}_{1}^{{\prime }} = \, \text{i}_{1} {-} \, {1 \ldots \ldots \ldots }..\text{Case}\,\text{1} \\ & = \text{3}\quad \text{if} \, \text{i}_{1}^{{\prime }} = \, \text{i}_{1} {-} \, {3 \ldots \ldots \ldots \ldots Case}\,\text{2} \, \& \, \text{3} \\ & = \text{0}\quad \text{otherwise}. \\ \end{aligned} $$

5 Experimental Results

The study learns a controller with learning rate = 0.5, discount rate = 0.9, and λ = 0.6. During the learning process, the cost was updated 1000 with 6000 episodes.

Figure 2 shows that delay time versus a number of state given by simple Q learning (without cooperation) and group method (with cooperation). Delay time obtained by cooperative methods, i.e., group method is much less than that of without cooperation method, i.e., simple Q learning for agent 1 in the multi-agent scenario.

Fig. 2
figure 2

States versus Delay time for Agent 1 by Q learning and Group Method

Figure 3 shows that delay time versus a number of state given by simple Q learning (without cooperation) and group method (with cooperation). Delay time obtained by cooperative methods, i.e., group method is much less than that of without cooperation method, i.e., simple Q learning for agent 2 in the multi-agent scenario.

Fig. 3
figure 3

States versus Delay Time for Agent 2 by Q learning and Group Method

Figure 4 shows that delay time vs number of state given by simple Q learning (without cooperation) and group method (with cooperation). Delay time duration obtained by cooperative methods, i.e., group method is much less than that of without cooperation method, i.e., simple Q learning for agent 3 in multi-agent scenario.

Fig. 4
figure 4

States versus Delay Time for Agent 3 by Q learning and Group Method

6 Conclusion

Because traffic control system is so complicated and variable that a Q learning model (without cooperation) with defined strategy can rarely manage with the traffic jam and sudden traffic accidents which actually may occur at any time, the demand for combining timely and intelligent traffic control policy with real-time road traffic is getting more and more urgent. Reinforcement learning gathers tests and information by keeping communication with the situation. Although it usually needs a long duration to complete learning, it has good learning ability to a complex system, enabling it to handle unknown complex states well. The application of reinforcement learning in traffic management area is gradually receiving more and more concerns. The paper proposed a cooperative multi-agent reinforcement learning-based models (CMRLM) for traffic control optimization. The actual continuous traffic states are discretized for the purpose of simplification. We design actions for traffic control and define reward and return by mean of traffic cost which combines with multiple traffic capacity indicators.