Keywords

1 Introduction

Large number of vehicles dispersed in a large and board urban area. This makes a difficult and complicated work to successfully take care of such a large scale, dynamic, and distributed system with a high degree of uncertainty [1]. Though the number of vehicles are getting more and more in major cities, most of the current traffic control methods have not taken benefit of a intelligent control of traffic light [2]. It is observed that sensible traffic control and enhancing the deployment effectiveness of roads is an efficient and cost effective technique to resolve the urban traffic crisis in majority urban areas [3]. Major vital part of intelligent transportation system is traffic signal lights control strategy becomes necessary [4]. There are so various parameters that have an effect on the traffic lights control. Static control method is not feasible for rapid and irregular traffic flow. The paper suggests a dynamic traffic control framework which is based on reinforcement learning [5]. The reinforcement learning can present a very crucial move to resolve the above cited problems. It is effectively deployed in resolving various problems [6]. The framework defines different traffic signal control types as action selections; the number of vehicles arriving and density of vehicle at a junction are observed as environment condition. Signal management parameters, like delay time, the number of stopped vehicles, and the total vehicle density are described as received rewards.

The article is described in four parts. Section 2 describes about the traffic estimation parameters. Cooperative multi-agent reinforcement learning algorithm (CMRLA) is proposed in Sect. 3. Section 4 discuss about the system model, including definitions pertaining the state, action, and reward function. Section 5 discuss about experiment and analysis of the results followed by concluding remark.

2 Traffic Estimation Parameters

In traffic management a very crucial responsibility is handled by signal lights control. A practical time allotment method ensures that in usual conditions the traffic moves seamlessly. Normally applied traffic estimation parameters [7] comprises of delay time, the number of automobiles stopped at intersection, and number of newly arriving automobiles.

2.1 Delay Time

The delay between the real time and theoretically calculated time for a vehicle to leave a signal is defined as delay time. In practice, we can get total delay time during a certain period of time and average delay time of a cross to evaluate the time difference. The more delay time indicates the slower average speed of a vehicle to leave a signal.

2.2 Number of Vehicles Stopped

How many vehicles are waiting behind stop line to leave the road signal gives the number of vehicles stopped. The indicator [8] is used to measure the smooth degree of road as well as the road traffic flow. It is defined as

$$ {\text{stop}} = {\text{ stopG}} + {\text{stopR}} $$
(1)

where stopR is the number of automobiles stopped before the red light and stopG is the number of automobiles stopped before the green light.

2.3 Number of Vehicles Newly Arrived

The ratio of the actual traffic flow to the maximum available traffic flow gives the signal saturation. Newly arrived vehicle is calculated as

$$ S = \frac{traffic\;flow}{{\left( {dr*sf} \right)}} $$
(2)

where sf is traffic flow of the signal and dr is the ratio of red light duration to green light duration.

2.4 Traffic Flow Capacity

Highest number of vehicles crossing through the signal is shown by traffic flow capacity. The result of signal control strategy is given by the indicator. Traffic signal duration and traffic flow capacity are associated with each other. Generally more signal crossing capability is a result of more crossing period.

3 Cooperative Multi-agent Reinforcement Learning Algorithm (CMRMA)

Synchronization in multi-agent reinforcement generates a complex set of presentations achieved from the different agents’ actions. Portion of good performing agent group (i.e. an general form) is shared amongst the different agents via a specific form(Qi) [9]. Such specific forms embrace the limited details about the environment. Such strategies are incorporated to improve the sum of the partial rewards received using satisfactory cooperation prototype. The action plans or forms are created by the way of multi-agent Q-learning algorithm by constructing the agents to travel for the most excellent form Q* and accumulating the rewards. When forms Q1, …, Qx are incorporated, it is possible to construct new forms that is General Form (GF = {GF1, …, GFx}), in which GFi denotes the outstanding reinforcement received by agent i all through the knowledge mode [10]. Algorithm 1 expresses get_form algorithm that splits the agents’ knowledge. The forms are designed by the Q-learning used for all prototypes. Outstanding reinforcements are liable for GF which compiles all outstanding rewards. It will be shared by the way of the added agents [11, 12]. Transforming incomplete rewards as GF is considered for outstanding reinforcements to achieve the cooperation between the agents. A status utility gives the outstanding form amongst the opening states and closing state for a known form which approximates GF with the outstanding reinforcements. The status utility is calculated by summation of steps the agent needed to get to destination at the closing state and the sum of the received status in the forms amongst each opening and the closing state [13].

figure a

The Fcooperate utility selects a coordination method. period, tech, s, a, I are the factors, in which period is current iteration, cooperation tech is {grp, dyna, gol}, s and a is state and action chosen likewise;

3.1 Cooperation Models

Various cooperation methods for cooperative reinforcement learning are proposed:

  1. (i)

    Grp model – reinforcements are disseminated in a series of periods.

  2. (ii)

    Dyna model – reinforcements are distributed in each action.

    figure b
    figure c

Grp Model: During the learning period each agents collect expertise depend rewards received from their actions. At the end of the period (step q), every agent gives cost of Qj to GF. The usefulness of another agents for given state is enhanced when reward value is appropriate. And these expertise base reinforcements will afterward supplied to the agents. Agent will carry on to make use of its rewards with the objective is for congregating latest values [11,12,13].

Dyna Model: The coordination in the dyna method is gained as: each act perceived by agent produces a reinforcement value (+ or −), that is summation of all together expertise depends rewards to all agents to action a achieved in state s. Each agent collaborate to achieve more the rewards sum fulfill its own policy [14].

4 Model Design

In practical environment, traffic flows of four signals with eight flow directions are considered for the development. The control coordination between the intersections can be viewed as a Markov process, denoted by ⟨S, R⟩ where S represents the state of the intersection, A stands for the action for traffic control and R indicates the return attained by the control agent [15].

4.1 States of System

Instantaneous traffic states are received by each agent. To present state of the road, it returns traffic control decision. Essential data such as number of vehicles newly arriving and number of vehicles currently stopped at signal are used to reflect the state of road traffic [14, 15].

  • Number of vehicles newly arriving = Xmax = x1, x2, x3, x4 = 10

  • Number of vehicles currently stopped at junction J = Imax = i1, i2, i3, i4 = 20

  • State of the system become Input as (xi, ii).

Here, it can get together 200 possible states by combining maximum 10 arriving vehicle and maximum 20 vehicles stopped at signal (10 * 20 = 200).

4.2 Actions of System

Each policy denotes the learning agent activities at a given time in case of reinforcement learning framework. Rewards are obtained by mapping the scene to the action in reinforcement learning. It affects not only to the next scene but also to direct rewards due to which all successive rewards will be affected [15, 16]. In the study, traffic lights control actions can be categorized to 3 types: no change in signal duration, increasing signal duration, reducing signal duration.

Value

Action

1

No change in signal duration

2

Increase in signal duration

3

Reduce the signal duration

Action set for signal agent 1 is A1 = {1, 2, 3}, action set for signal agent 2 is A2 = {1, 2, 3} and action set for signal agent 3 is A3 = {1, 2, 3}.

Each of them is for one of the following actual traffic scenarios.

The strategy of no change in signal duration is used in the case of the normal traffic flow when the lights control rules do not change [16,17,18]. The strategy increasing the signal duration is mostly used in the case that in one route is regular and the other route traffic flow is stopped. Two cases are possible i.e. to increase the signal duration to extend the traffic flow and to decrease signal duration when traffic flow on one route is less as compared to other route. Waiting time of other route is reduced as decreased in signal light so that vehicles pass the junction faster.

4.3 Definitions of Reward and Return

Reward function in reinforcement learning describes the target of the problem. The apparent state of the environment is mapped to a value, reinforcement, defining internal needs of the state [18].

In the work, agent makes signal control decisions under diverse traffic circumstances and returns an action sequence, so that by the actions the road traffic jamming display is the least amount. To be additional, the model provides a best traffic synchronization mode in a particular traffic state. Here, we use traffic value display to estimate the traffic flows as

Reward is calculated in the system as given below:

  • Assume current state i = (xi, ii) and next state j = (xj, ij). i.e. current state i → next state j

  • Case 1: [xi, ii] → [xi, ii−1] i.e. [Xmax = 10, Imax = 20] → [Xmax = 10, Imax = 19]

  • That means: one vehicle from currently stopped vehicle is passing the junction

  • Case 2: [xi, ii] → [xi+1, ii−1] i.e. [Xmax = 9, Imax = 20] → [Xmax = 10, Imax = 19]

  • That means: one newly arrived vehicle at junction & one vehicle is passing junction

  • Case 3: [xi, ii] → [xi, ii−3] i.e. [Xmax = 10, Imax = 20] → [Xmax = 10, Imax = 17]

  • That means: More than one stopped vehicles are passing the junction

  • Case 4: [xi, 0] → [xi+1, 0] i.e. [Xmax = 2, Imax = 0] → [Xmax = 3, Imax = 0]

  • That means: new one new vehicle is arriving and no stopped vehicle at the junction. Depending on above state transitions from current state to next state, reward is calculated as

$$ \begin{aligned} {\text{Reward is r}}_{\text{p}} \left( {\text{i, p, j}} \right) & = 1\quad \quad \quad {\text{if x}}_{1}^{{\prime }} {\text{ = x}}_{ 1} { + 1} \ldots \ldots \ldots {\text{Case 4}} \\ & = 2\quad \quad \quad {\text{if i}}_{1}^{{\prime }} = {\text{ i}}_{1} {-} \, 1 \ldots \ldots \ldots {\text{Case }}1 \\ & = 3\quad \quad \quad {\text{if i}}_{1}^{{\prime }} = {\text{i}}_{1} {-} \, 3 \ldots \ldots \ldots {\text{Case 2 }}\& \, 3 \\ & = 0\quad \quad \quad {\text{otherwise}} \\ \end{aligned} $$

5 Experimental Results

The study learn a controller with learning rate = 0.5, discount rate = 0.9, and λ = 0.6. During learning process, cost was updated 1000 with 6000 episodes.

The grp method appears to be extremely strong converging very fast to an optimal action form Q*. Rewards obtained by the agents are produced in series of pre identified stages. They gather reasonable reward values that cause a good convergence. In the grp method the global policy converges to a best action strategy as there is an intermission of series necessary to gather good reinforcements. The general form of the dyna method is capable to assemble good reward values in small knowledge series. It is observed that after some series, the performance of global strategy reduces. This takes place since the states neighboring to the final state begin in the direction of more superior reward values giving to a restricted maximum. It will no more stay at the other states so it punishes the agent. In the dyna method as the reinforcement learning algorithm renews learning values, actions with higher gathered reinforcements are chosen through top likelihood than acts with small gathered reinforcements.

Figures 1 and 2 respectively shows that delay time vs number of state given by simple Q learning (without cooperation) and grp and dyna methods (with cooperation). Delay time obtained by cooperative methods i.e. grp and dyn methods is much less than that of without cooperation method i.e. simple Q learning for agent 1 in multi-agent scenario.

Fig. 1.
figure 1

States vs delay time for agent 1 by Q-learning & grp method

Fig. 2.
figure 2

States vs delay time for agent 1 by Q-learning & dyna method

Figures 3 and 4 respectively shows that delay time vs number of state given by simple Q learning (without cooperation) and grp and dyna methods (with cooperation). Delay time obtained by cooperative methods i.e. grp and dyna methods is much less than that of without cooperation method i.e. simple Q learning for agent 2 in multi-agent scenario.

Fig. 3.
figure 3

States vs delay time for agent 2 by Q-learning & grp method

Fig. 4.
figure 4

States vs delay time for agent 2 by Q-learning & dyna method

Figures 5 and 6 respectively shows that delay time vs number of state given by simple Q learning (without cooperation) and grp and dyna methods (with cooperation). Delay time duration obtained by cooperative methods i.e. grp and dyna methods is much less than that of without cooperation method i.e. simple Q learning for agent 2 in multi-agent scenario.

Fig. 5.
figure 5

States vs delay time for agent 3 by Q-learning & grp method

Fig. 6.
figure 6

States vs profit for agent 3 by Q-learning & dyna method

6 Conclusion

Traffic control system is so complicated and dynamic in nature. It is impossible to manage traffic jam and sudden traffic accidents for Q learning model without cooperation with predefined strategy. The demand is getting more and more urgent for combining timely and intelligent traffic control policy with real-time road traffic. Reinforcement learning collects information by keeping communication with situation. Although it usually needs a long duration to complete learning, it has good learning ability to complex system, enabling it to handle unknown complex states well. The application of reinforcement learning in traffic management area is gradually receiving more and more concerns. The paper proposed a cooperative multi-agent reinforcement learning algorithm (CMRLA) for traffic control optimization. The actual continuous traffic states are discretized for the purpose of simplification. Actions for traffic control are designed and rewards are defined to return by mean of traffic cost which combines with multiple traffic capacity indicators.