Keywords

1 Introduction

In this work, we test solutions to decrease travel times based on multiagent reinforcement learning, modeling the problem as a multiagent Markov Decision Process (MDP). A collection of agents learns to minimize vehicle queuing delays and queue lengths at all junctions. Coordination between agents is done in two different ways. In the first approach (Q-VE) agents are modeled as vertices in a coordination graph and the joint action is found with the variable elimination algorithm. The second method (Q-BR) computes the action for an agent as the best response of a two player game with each member of its neighborhood.

2 Main Purpose

Multiagent RL for traffic light control allows to split the global function Q into a linear combination for each agent. However, decisions made at the individual level must be optimal for the group. Hence, the problem of coordination is to find at each step the joint action:

$$\begin{aligned} \mathbf {a}^{*}=\mathop {{{\,\mathrm{argmax}}}}\limits _{\mathbf {a}^{\prime } \in \mathcal {A}}Q(\mathbf {s}^{k},\mathbf {a}^{\prime }) \end{aligned}$$
(1)

2.1 Coordination Graphs - (Q-VE)

In a coordination graph, \(G=(\mathcal {V},\mathcal {E})\) agent \(i \in \mathcal {V}\) needs to coordinate its actions with its neighbors\(\varGamma (i)=\{j\,:\, (i,j)\in \mathcal {E}\}\). Given the action \(\mathbf {a}^{*}\) and joint state \(s_{ij}^{k}\), we update the factors \(Q_{ij}\) for each edge \((i,j)\in \mathcal {E}\) with:

To find the optimal joint action \(\mathbf {a}^{*}\) in (1), we use the variable elimination algorithm (VE) proposed by Gaustrin et al. [3].

2.2 Best Response from Game Theory - (Q-BR)

We follow the work done by El-Tantawy et al. in [2], in which each agent participates in a two player game with its neighborhood \(\varGamma (i)\). Agent i creates and updates a model \(\theta \) that estimates the likelihood of action selection for each neighbor \(j \in \varGamma (i)\).

To find the best joint action, \(\mathbf {a}^{*}\), each agents computes its best response, which is the action that maximizes the Q factor at their neighborhood level, regardless of the policies of other members:

(2)

2.3 Learning Parameters

The state vector for each agent has the hour to include temporal dynamic; the maximum queue length (in vehicles) in all edges, and the queuing delay (in minutes) of stopped vehicles in every edge. Regarding the actions, all agents have two phases. Finally, the reward function encourages short queue lengths and waiting time experienced by the vehicles throughout the road.

$$\begin{aligned} r_i=-\sum _{k=1}^{edges}\beta _q(q_k)^{\theta _q} + \beta _w(w_k)^{\theta _w} \quad \forall i \in \mathcal {N} \end{aligned}$$
(3)

Where, edges is the number of approaches of agent i. \(q_k\) and \(w_k\) are the maximum queue length and queuing delay in edge k. \(\beta _q\) and \(\beta _w\) are coefficients to set priority. \(\theta _q\) and \(\theta _w\) balance queue lengths and waiting times across approaches.

3 Demonstration

Both methods were simulated in a network of Bogotá, as shown in Fig. 1 through the SUMO simulator [4] and the TraCI environment. To compare Q-VE and Q-BR methods, we implement independent Q-learning as proposed by Camponogara et al. in [1] and, the coordination method proposed by Xu et al. in [5].

Fig. 1.
figure 1

Test framework for multiagent traffic control

Q-VE and Q-BR achieve reductions of at least \(14\%\) and at most \(48\%\) with respect to Fixed Time. The largest reductions are obtained with Q-BR. We note that Q-BR policy generates green waves along arterials, as show in Fig. 2.

Fig. 2.
figure 2

Space-time diagram for route 1, Ak7 north to south, with Q-BR policy. At some intervals, agents 4, 5 and 6 coordinate their actions to generate a platoon.

In Fig. 3 we found that the reward evolution with independent learning is very similar to the one obtained by the coordinated method Q-BR. Nonetheless, the policy learned by Q-BR positively influence other variables that are not included in the reward function.

Fig. 3.
figure 3

Agents learning curves. With Q-VE and Q-BR it was achieved a better reward in comparison with FT control.

4 Conclusions

Distributing the reward function into contribution per agent simplifies the problem, since the Q factors can be splitted into dependencies between agents. This is represented by the coordination graphs, which are favorable for the application of the VE algorithm. This method allows an exact solution to the joint action selection problem. However, as the algorithm eliminates agents, neighborhoods change and may include ones that are not adjacent, thus, may not have direct communication. The method would require an estimation of the Q factors for nonadjacent agents.

On the other hand, the coordination strategy based on BR presents good scalability, due to communication between agents is known a priori. However, policies in the neighborhood are not shared knowledge, so a greater transmission of information is required to estimate and model the behavior of neighbors.