Keywords

1 Introduction

As vehicles grow increasingly widespread, one of the most basic issues is understanding how to navigate a fleet of vehicles to perform a specified job. Also, the huge population densities in our cities today put all existing infrastructure, especially urban transportation networks, under strain. With the progression of services like e-commerce and vehicle sharing, these congested cities’ transportation demands have also gotten more complicated. So, it is very important to route vehicles in way so as to reduce overall cost, time, and congestion. Different methods [1] have been proposed to route vehicles. One of the classic methods in which a single agent is entrusted with determining the shortest path between a set of sites and destinations is known as Travelling Salesperson Problem. The multi-agent approach to this problem is called the Vehicle Routing Problem (VRP) [2]. In VRP, multiple agents try to find an optimal route by visiting a set of locations exactly once. Even after having a huge number of solvers, they are primarily built to perform planning offline and cannot modify solutions when used online. They are, however, often evaluated on simple planar network benchmarks with limited exploration in multi-agent environments. Furthermore, none of these solutions were created for dynamic contexts where online communication may be quite advantageous.

The Value Iteration Networks [3] have excellent planning capability and can generalize better in a diverse set of tasks. Its purpose is to discover a policy that optimizes expected returns. The value function peak at the goal, so the high-value function mean the destination. In the attention mechanism, only a subset of the input characteristics (value function) is meaningful for a specific label prediction (action). It is also well known that attention improves learning performance by lowering the effective number of network parameters used during learning. We have given a fleet of cars in a multi-agent environment. We have to determine the minimum total cost for mapping a given graph under traffic conditions, such that all routes are traversed not less than a defined number, and this number is not known prior. The Vehicle Routing Problem value iteration network is a distributed neural network designed for managing multiple vehicles intended to complete a specific task. Each agent has a value iteration module to carry out its own planning with the help of communication between agents via an attention mechanism. The dense adjacency matrix [4] is used for encoding paired edge information to accelerate information sharing and allow for more complex encoding since our focus is on sparse road graphs. Using actual traffic flow simulation, we illustrate the usefulness of VRP-VIN on actual road maps derived from eighteen different cities around the world. A random sub-graph of those cities was used to produce training and evaluation examples comprising real-world mapping difficulties, and then a random number is selected, which determines how many times each node in each graph is covered [5, 6]. The fleet will be unaware of this knowledge until they reach this number. We use the total time taken for traversal as our major evaluation criterion, demonstrating that this technique outperforms both conventional VRP solvers and recently suggested deep learning models. Moreover, VRP-VIN adapted effectively to the graph size and agent count.

The paper is organized as follows: Sect. 2 gives us detailed literature on value iteration networks. The proposed Model is presented in Sect. 3. An evaluation of the proposed model is available in Sect. 4. Finally, Sect. 5 represents the conclusion and future work of the proposed work.

2 Related Work

In [8], Tamar et al. proposed a neural network incorporated with a planning module. They can learn to plan and can anticipate planning-related outcomes, such as reinforcement learning. They are based on a differential estimation of the value-iteration algorithm using CNN. Value iteration is a technique based on the Markov decision process. The MDP’s purpose is to discover a policy that in turn optimizes our expected return. \(V_n\) (state value function at iteration n) converges to V* (ideal state value function) using the value iteration technique as n approaches infinity [6]. The VI module in the VIN takes advantage of the fact that each iteration of VI can be visualized as previous \(V_n\) and the reward function passes through a convolutional and max pooling layer. The Q function for every channel in the convolution layer refers to a specific action. As a result, K iterations of VI are equivalent to K times of applying a convolutional layer.

In [9], Lu et al. suggested a distributed cooperative routing method (DCR) based on evolutionary game theory to coordinate vehicles. This solution combines edge computing and intelligence to run on roadside units. Nash equilibrium is achieved under DCR. No vehicle can find a path more suitable than the one currently under Nash equilibrium [7].

In [10], Tang et al. proposed a reinforcement learning model with multi-agents for a centralized vehicle routing in order to improve the spatial-temporal coverage. Two reinforcement learning: proximal policy optimization and deep q-learning have been used to create routing policies. A centralized routing method is proposed for vehicular mobile crowd-sensing systems (VMCS) to expand their range of sensing based on MARL. The author initiates by customizing an environment for reinforcement learning in order to get the maximum feasible spatial-temporal coverage based on user preferences for various regions. They designed two MARL algorithms based on the Deep Q network [14] and Proximal Policy Optimization (PPO) [15]. Then, they do comparisons and sensitivity analyses to figure out how well the two methods work for VMCS problems.

In [11], Niu et al. Proposed a Multi-Agent Graph-attention Communication (MAGIC). It is a novel multi-agent reinforcement learning algorithm with a graph-attention communication protocol having a Scheduler to aid the challenge of when and to whom messages should be sent, and a Message Processor employs Graph Attention Networks (GATs) [12] comprising dynamic graphs for handling communication signals. A combination of a graph attention encoder and a differentiable attention mechanism [16] is used to develop the scheduler that provides dynamic, differentiable graphs to the Message Processor, allowing the Scheduler and Message Processor to be trained at the same time.

3 System Model

In this section, the solution to the Vehicle Routing Problem with Value Iteration Network has been discussed in detail. Vehicle Routing Problem with Value Iteration Network (VRP-VIN) has two main components:

  • Asynchronous communication module [17] saves messages sent by agents in a temporary unit and retrieves information via agent-level attention method. This information is received by the value iteration network for path planning in the future.

  • Value iteration network operates locally on each node repeatedly to calculate the value of traveling to each node for its next route. After that, LSTM [20] planning unit with attention mechanism repeatedly refines the node features and produces a value function associated with each node. The next destination will be selected on the basis of the value function, so the node with the highest value function will become the next destination.

Figure 1 represents the flowchart of the proposed model VRP-VIN The proposed VRP-VIN model is dynamic in nature. In order to do this, VRP-VIN includes a communication module based on the attention mechanism, in which attention is now focused on the agents as opposed to the street segment earlier in the VI module. Whenever an agent acts, it outputs some communication vector: \(y^{(i)}\). It is subsequently transmitted to each agent using \(Z^{(K)}\), the value iteration module’s final encoding. The communication vector [18] is expressed as a set of node attributes in order to obtain the topology of the street graph. At the receiver end, each sender’s current communication vector is stored temporarily. Every agent has an attention layer that compiles data from the receiver’s inbox, whenever an agent wants to take a new action.

Fig. 1.
figure 1

Flowchart of proposed VRP-VIN Model

The Routing Path [19] can be represented as a strongly connected graph (G) with edges (E) and vertices (V), where we want to generate a routing path for D agents \({\{R^{(i)}\}}^D_{i=1}\), and each vertex in the graph is traversed \(D_v\) time across all agents. Until a specific number is reached, \(D_v\) is unknown to the agents, and local traffic information [13] is the only thing that can be observed. Each agent collects surrounding environment observation and information gathered from other agents and then outputs the next step’s route.

Symbol

Representation

t

Current timestamp

G

Graph of road map

L

No of Agents

n

No of graph nodes

f

Policy of Routing and Communication

\(\pi \)

Routing Policy

F

Time Cost given a route R

o

Agent i’s observation at time t

s

Agent i state at time t

a

Agent i action at time t

m

Message vector sent by agent ‘i’ at time t

\(D_v\)

No of times node v needs to be visited

\(Z_i\)

At the kth value iteration, Agent i’s node feature

\(Y_i\)

Agent i’s input communication feature

A route is defined as a sequence of action \(R(i) = [s_0^{i},\ldots ..,s^N]\), where \(s_0^{i}\) represents routing steps taken by agent ’i’ in time t, indicating the next node to traverse, and each step represents an intermediary destination [20]. The strategy of a single agent can be described as a function of the graph of the road network; surrounding environment observation \(b_t^{i}\); the communication messages sent by other agents \(y_t^{j}\); and, the current status of an agent \(j_t^{i}\). The mathematical Eq. 1 are as follows:

$$\begin{aligned} {s_t^i, y_t^{j}} = f(G, b_t^i,\{y_{t-1}^j\}_{j=1}^M; j_t^i), \end{aligned}$$
(1)

Consider a traffic model D determines how long it takes to travel a route, we want our system to accomplish the following goal:

$$\begin{aligned} min_R(i) \sum _{X=1...L} D(R^{i}), \end{aligned}$$
(2)

subject to

$$\begin{aligned} {\sum _i T(R^i,v)}\ge T_v, {v} \end{aligned}$$
(3)

where T(Rv) tells how many times a node v should be visited in path R.

It is worth noting that the model is resilient and failure-proof as the model runs locally on all the agents, which allows it to scale better with the number of agents.

The road network [21] is represented by a tightly linked graph G(V, E). Each graph node represents a street segment, and each agent’s goal is to choose a node to be its next destination. Initial node features are refined by passing them through a graph neural network [22] for specific iterations. Then, these features are turned into a value function, and the next destination is the node with the highest value. Let \(Z = (z_1, z_2, ..., z_n)\) represents a vector with initial node features, where n denotes number of nodes, and \(Y = {y_1, y_2, ..., y_n}\) denote the node features of the input communication. A linear layer encodes node input features to produce an initial feature for the value iteration module [2]:

$$\begin{aligned} Z^{0} = (Z || Y)W_{enc} + b_{enc} \end{aligned}$$
(4)

We conduct the following iterative update across neighboring nodes at each planning iteration ‘z’ using an attention LSTM:

$$\begin{aligned} Z^{(k+1)} = Z^{(k)} + LSTM(Att(Z^{(k)}, A); H^{(k)}) \end{aligned}$$
(5)

where K denotes the number of value iteration steps, hidden states \(H^{(t)}\) in LSTM, and adjacency matrix A. Floyd Warshall method [23] is employed to compute dense distance matrix, which is then used as an input to this model, rather than using the adjacency matrix as an input to the network. This ensures that our model uses more useful information. The Floyd-Warshall algorithm generates a matrix, \(D_{i,j} = d(v_i, v_j)\), which represents the shortest path between any two nodes in terms of pairwise distance. This matrix is then normalized to create a dense adjacency matrix. \(A = (D -\mu )/\sigma \) where \(\mu \) is the mean of the elements of D and \(\sigma \) is the standard deviation of the elements.

The graph attention layer(GAL) [24] is responsible for the exchange of information within a graph. The attention module used in VRP-VIN is a transformer layer that receives adjacency matrix and node features, then outputs modified features. First, the values of the key, query, and value function for each node are calculated.

$$\begin{aligned} Q^{(k)} = Z^{(k)}W_q + b_q,\end{aligned}$$
(6)
$$\begin{aligned} K^{(k)} = Z^{(k)}W_k + b_k,\end{aligned}$$
(7)
$$\begin{aligned} V^{(k)} = Z^{(k)}W_v + b_v \end{aligned}$$
(8)

The node feature vector is multiplied by the weight vector to calculate the key, query, and value. Then we form an attention matrix \(A_{att}\) by computing attention between the node and each other nodes.

$$\begin{aligned} A_{att} = Q^{(k)}.K^{(k)T} \end{aligned}$$
(9)

To express edge features, we mix adjacency matrix A and attention matrix \(A_{att}\) in a multi-layer neural network g as shown below.

$$\begin{aligned} A(k) = softmax(g(A_{att}^{(k)}, A)) \end{aligned}$$
(10)

The values of new nodes are calculated by merging the values generated by the other nodes in the merged attention matrix according to the attention. The output of GALs is sent into the LSTM module.

$$\begin{aligned} Z^{(k+1)} = Z^{(k)} + LSTM(A^{(k)}V^{(k)}; H^{(k)}) \end{aligned}$$
(11)

Before decoding, the entire procedure is performed for a fixed no. of iterations, k = 1\(\ldots ..\)K.

Each node feature is translated into a scalar value function on the graph after iterating the attention LSTM module for K iterations. Then SoftMax function is applied across the rest of the nodes to derive action probabilities, masking off the value of any node that is not required to be visited anymore because they’re fully traversed.

$$\begin{aligned} \pi (s_i ; j_i) = softmax(Z^{K}W_{dec} + b_{dec}) \end{aligned}$$
(12)

Now, the node with the highest probability value is chosen as the next destination. With the help of the shortest path algorithm, a full path is constructed by linking the latest node with the node chosen as the next destination. The graph weight is calculated by dividing the length of a road segment by the average speed of the car driving it. It shows how long it is expected to take to drive from one road segment to the next.

There are 22,814 directed road graphs in the collection, which were collected from 18 cities on six continents. For testing, we pick a different location, and for validation, we utilize 10% of the training set. We also include actual traffic situations and mapping issues in this benchmark. Extra problems fall into three groups: random revisits, realistic traffic, and asynchronous execution (Figs.  23 and 4).

4 Evaluation

A 3-layer MLP with 16 dimensions each, with ReLU activation is utilized to combine the dot-product attention with the distance matrix. The encoding vectors have a 16-dimensional size. Adam optimizer is used to set the model’s learning rate during training to be 1e\({-}\)3. The decay rate is set as 0.1 per 2000 epochs. The model is trained for 5000 epochs. Each of the 50 graphs in our batch size has a maximum of 25 nodes. We just use two agents to train our network and one to nine agents to analyze it [25]. LKH3 is the best-performing iterative solver with available data. First, the solver chooses the best route for precisely covering all nodes exactly once. After that, the solver determines a new optimal route across each of the nodes that still need to be traversed. Until every node has been completely mapped, this is repeated. Basically, the solver does VRP traversals until the required number of nodes has been visited. If an agent had been given global knowledge of every hidden state, this is the best possible performance that could have been obtained. By giving the LKH3 solver information about every hidden variable and doing an optimal plan search, the solution is discovered. By duplicating the nodes and raising the node’s edge weights affected by traffic congestion, we alter the adjacency matrix. GATs [12] exchange complex information between the nodes based on the attention mechanism. Normally, GAT architectures, on the other hand, presume that all edges have equal weight rather than encoding the information about the distance matrix, which restricts their potential. Although GATs aren’t always made to address TSP or VRP issues, they are still among the most cutting-edge options for graph and network encoding (Tables 1 and 2).

Table 1. Total cost (runtime in hrs)
Table 2. Average traversal cost on real graphs; Time cost (hrs); Runtime (ms)
Fig. 2.
figure 2

Total cost graph

Fig. 3.
figure 3

Cost per agent (runtime in hrs)

Fig. 4.
figure 4

Oracle vs VRP-VIN (runtime in hrs)

The performance of VRP-VIN is the best over a range of agent counts and graph sizes. Notably, this technique with 25 nodes and two agents using reinforcement learning achieves a total cost of around 3% when compared with oracle. We discovered that the model that included imitation and reinforcement learning outperformed all rival models. The model’s overall generalization to multiple agents and larger graph sizes is outstanding. Each traversal’s cost is distributed across the agents in a fairly equal manner. The method performs significantly better than the current state-of-the-art, LKH solver. The overall cost increases marginally when the number of agents is increased, demonstrating high scalability. When dealing with more agents, the models trained with reinforcement learning have the excellent generalizing capability. Increasing the number of value iterations further increases the performance.

5 Conclusion

The proposed VRP-VIN model can easily route multiple vehicles online in a real-world environment with dynamic obstacles. This model beats all current approaches on real road graphs by leveraging the learned value iteration transitions and a communication protocol based on an attention mechanism. Also, it can be scaled up or down to different numbers of agents and nodes without requiring retraining. Communication is a key component in multi-agent systems learning coordinated behavior. So, Future studies will involve a more in-depth examination of the information stored in the communication and its semantic value. There will also be further research into approaches that will allow this system to operate on huge graphs.