1 Introduction

Reinforcement Learning (RL) is often considered to be a general formalization of decision-making tasks and a subfield of machine learning. In RL, agents learn not from sample data, as in supervised and unsupervised learning, but from experiences that interact with the environment. With the success of deep neural networks (DNN), reinforcement learning algorithms combine with it and form deep reinforcement learning (DRL) methods to solve complex problems in the real world. The pioneering model is Deep Q-Network, which was able to play Atari console games without adjusting network architecture or hyperparameters. Deep reinforcement learning methods have been extensively researched and significantly improved since then.

Most successful DRL methods have been in the single-agent domains so far, and extending DRL to multi-agent settings is indispensable. However, deep reinforcement learning for multi-agent settings is fundamentally more difficult than the single-agent scenario due to the presence of multi-agent pathologies such as the curse of dimensionality and multi-agent credit assignment. Despite this complexity, there has been a lot of work in the fields of general control, robot system (Gu et al. 2017; Kurek and Jakowski 2016), man–machine game (Fu et al. 2019; Lanctot et al. 2017; Leibo et al. 2017), autonomous driving (Shalev-Shwartz et al. 2016), Internet advertising (Jin et al. 2018), and resource utilization (Xi et al. 2018; Perolat et al. 2017).

This paper systematically summarizes several research directions in the field of multi-agent deep reinforcement learning (MDRL), including scalability, non-stationarity, partial observability, communication learning, coordinated exploration, agent modeling. The rest of this paper is structured as follows: firstly, the basic theory of multi-agent reinforcement learning is reviewed in Sect. 2. The latest work of deep reinforcement learning for single-agent and multi-agent settings are summarized in Sect. 3. The applications and prospect of multi-agent deep reinforcement learning are discussed in Sect. 4. The conclusion and directions are given in Sect. 5.

2 Background

2.1 Single-agent reinforcement learning

In single-agent reinforcement learning, the agent learns through interaction with the dynamic environment by a trial and error procedure as shown in Fig. 1. The goal of the agent is to learn an optimal policy by maximizing the expected value of the cumulative sum of rewards.

Fig. 1
figure 1

A single agent interacting with its environment

The basic framework of reinforcement learning is the Markov decision process, which is a random process represented by the tuple \(\left( {S,A,R,P} \right)\).

  1. (1)

    \(S\) is the state space, \(s_{t} \in S\) represents the state that the agent is in at the timestep t

  2. (2)

    \(A\) is the action space, \(a_{t} \in A\) represents the action taken by the agent at the timestep t

  3. (3)

    \(R\) is the reward,\({ }r_{t} \sim \rho \left( {s_{t} ,a_{t} } \right)\) represents immediate reward value received by the agent in state \(s_{t}\) performing action \(a_{t}\)

  4. (4)

    \(P\) is the state transition probability, \(p[s_{t + 1} |s_{t} ,a_{t} ]\) represents the probability that the agent takes action \(a_{t}\) and transfer from state \(s_{t}\) to next state \(s_{t + 1} .\)

In RL, the agent in the state \(s_{t}\) selects action \(a_{t}\) depending on policy \(\pi\) and takes the action and transfers to the next state \(s_{t + 1}\) with the probability \(p[s_{t + 1} |s_{t} ,a_{t} ]{ }\), meanwhile receives the reward \(r_{t}\) from the environment. Assuming that the reward for each timestep t must be multiplied by a discount factor \(\gamma\) to determine how much importance is to be given to the immediate reward and future rewards, the cumulative rewards from the timestep t to the end of the timestep T is represented as \(R_{t} = \sum\nolimits_{t^{\prime} = t}^{T} \gamma^{{t^{\prime} - t}} r_{t}\). The value function \(Q_{\pi } \left( {{\text{s}},{\text{a}}} \right)\) refers to taking action \(a\) in the current state \({ }s\) and taking the policy \(\pi\) until the episode ends. In this process, the cumulative rewards obtained by agents can be expressed as follows:

$$Q_{\pi } \left( {s,a} \right) = E_{\pi } \left[ {R_{t} |s_{t} = s,a_{t} = a} \right]$$
(1)

If the expected reward of policy \(\pi^{*}\) is greater than or equal to the expected reward of all the other policy for all states, then the policy \(\pi^{*}\) can be called an optimal policy. There may be more than one optimal policy, but they share a common function:

$$Q_{*} \left( {s,a} \right) = \max_{\pi } E_{\pi } \left[ {R_{t} |S_{t} = s,A_{t} = a} \right]$$
(2)

The function is called the optimal action-value function, and it follows the Bellman optimality equation:

$$Q_{*} \left( {s,a} \right) = E_{s^{\prime}\sim S} \left[ {\left. {r + \gamma \max_{a^{\prime}} Q\left( {s^{\prime},a^{\prime}} \right)} \right|s,a} \right]$$
(3)

Linear function approximators are used to approximate the action-value function usually, i.e. \(Q(s,a\left|\theta )\approx {Q}_{*}(s,a)\right.\). In addition, deep neural network and other nonlinear function approximators can be used to approximate action-value functions or policy.

2.2 Multi-agent reinforcement learning

The framework of multi-agent reinforcement learning is a stochastic game based on the Markov decision process represented by the tuple \(S,A_{1} \ldots A_{n} ,R_{1} \ldots R_{n} ,P\). Where n refers to the number of agents, \(A = A_{1} \times \ldots \times A_{n}\) is the joint action space of all agents, \(R_{n} :S \times A \times S \to R\) is the reward function of each agent, \(P:S \times A \times S \to \left[ {0,1} \right]\) is the state transition function, where it assumes that the reward function is bounded (Littman 1994). (Fig. 2).

Fig. 2
figure 2

Multi agents interacting with the environment

In the case of multi-agent settings, state transitions are the result of all agents acting together, so the rewards of agents depend on the joint policy, which is represented as \(H:S \times A \to \left[ {0,1} \right]\), and the corresponding reward for each agent is

$$R_{i}^{H} = E[R_{t + 1} |S_{t} = s,A_{t,i} = a,H]$$
(4)

The bellman equation is

$$v_{i}^{H} \left( s \right) = E_{i}^{H} \left[ {R_{t + 1} + \gamma V_{i}^{H} \left( {S_{t + 1} } \right)|S_{t} = s} \right]$$
(5)
$$Q_{i}^{H} \left( {s,a} \right) = E_{i}^{H} \left[ {R_{t + 1} + \gamma Q_{i}^{H} \left( {S_{t + 1} ,A_{t + 1} } \right)|S_{t} = s,A_{t} = a} \right]$$
(6)

Stochastic games can be divided into fully cooperation games, fully competitive games, and mixed games. In a fully cooperative stochastic game, the reward function is the same for all agents, \({R}_{1}\text{=}{R}_{2}\text{=...=}{R}_{n}\) so the rewards are the same, and the goal of agents is to maximize the common rewards. In a fully competitive stochastic game, for example, if \(n=2,{R}_{1}=-{R}_{2}\), then the two agents have opposite goals. In mixed games, agents' rewards are usually different and correlated.

It is a challenge to specify good general goals for agents. Reviewing the previous literature on the definition of learning goals, it mainly can be summarized as two aspects, stability and adaptability. Stability means that agents can converge to a stable policy. Adaptability ensures that the performance of agents does not decrease as other agents change their policy (Buşoniu et al. 2010).

2.2.1 Fully cooperation game

In a fully cooperative stochastic game, agents have the same reward function, and the learning goal is to maximize the common discounted reward. If a centralized structure is available, the goal can be expressed by learning the optimal joint-action values. Q-learning is a common method to learn it:

$$\begin{array}{*{20}c} {Q_{t + 1} \left( {s_{t} ,a_{t} } \right) = Q_{t} \left( {s_{t} ,a_{t} } \right) } \\ + {\alpha \left[ {r_{t + 1} + \gamma \mathop {\max }\limits_{{a^{\prime}}} Q_{t + 1} \left( {s_{t + 1} ,a^{\prime}} \right) - Q_{t} \left( {s_{t} ,a_{t} } \right)} \right]} \\ \end{array}$$
(7)

Each agent uses a greedy policy to maximize common rewards:

$$h_{i} \left( x \right) = \mathop {\arg \max }\limits_{{a_{i} }} \mathop {\max }\limits_{{a_{i} ...a_{n} }} Q^{*} \left( {s,a} \right)$$
(8)

It is necessary to consider the cooperation between agents. Team-Q algorithm (Littman 2001) solves cooperation problems by assuming that optimal joint action is unique. Distributed-Q algorithm (Lauer and Riedmiller 2000) solves problems with a limited computational complexity which is similar to that of Q learning. However, this algorithm is only applicable to deterministic problems with non-negative reward functions.

2.2.2 Fully competitive game

In a fully cooperative stochastic game, agents have the same reward function, and the learning goal is to maximize the common discounted reward. If a centralized structure is available, the goal can be expressed by learning the optimal joint-action values. Q-learning is a common method to learn it:

$$h_{1,t} \left( {s_{t} , \cdot } \right) = \arg m_{1} \left( {Q_{t} ,s_{t} } \right)$$
(9)

Each agent uses a greedy policy to maximize common rewards:

$$\begin{array}{*{20}c} {Q_{t + 1} \left( {s_{t} ,a_{1,t} ,a_{2,t} } \right) = Q_{t} \left( {s_{t} ,a_{1,t} ,a_{2,t} } \right) + } \\ {\alpha \left[ {r_{k + 1} + \gamma m_{1} \left( {Q_{t} ,a_{t + 1} } \right) - Q_{t} \left( {s_{t} ,a_{1,t} ,a_{2,t} } \right)} \right]} \\ \end{array}$$
(10)

It is necessary to consider the cooperation between agents. Team-Q algorithm (Littman 2001) solves cooperation problems by assuming that optimal joint action is unique. Distributed-Q algorithm (Lauer and Riedmiller 2000) solves problems with a limited computational complexity which is similar to that of Q learning. However, this algorithm is only applicable to deterministic problems with non-negative reward functions.

where \(m_{1}\) is the minimax reward of agent 1

$$m_{1} \left( {Q,s} \right) = \mathop {\max }\limits_{{h_{{1\left( {s, \cdot } \right)}} }} \mathop {\min }\limits_{{a_{2} }} \mathop \sum \limits_{{a_{1} }} h_{1} (s,a_{1} )Q\left( {s,a_{1} ,a_{2} } \right)$$
(11)

where the policy of agent 1 in the state s is represented by \(h_{{1\left( {s, \cdot } \right)}}\), and the point represents the parameter of the action.

2.2.3 Mixed game

In mixed dynamic stochastic games, the methods are mainly divided into agent-independent methods and agent-aware methods. Agent-independent methods generally adopt a common structure based on Q learning, where policy and state values are calculated using the game theory solver in SG. One representative approach of agent-independent methods is Nash Q-learning (Hu and Wellman 2003), and there are also Correlated Q-learning (CE-Q) (Greenwald et al. 2003) or Asymmetric Q-learning (Kononen 2004) to solve equilibrium problems by using correlation or Stackelberg (leader–follower) equilibrium respectively. Agent-aware methods typically do consider convergence, in which the representative algorithm is the Win-or-Learn-Fast Policy Hill-Climbing (WoLF-PHC) (Xi et al. 2015). Many methods for mixed SG suffer from scalability challenges and are sensitive to partial observability, the latter one holds for agent-independent algorithms especially.

3 Multi-agent deep reinforcement learning

3.1 Deep reinforcement learning

Deep reinforcement learning algorithms can be roughly divided into two categories: value-based,policy-based. Value-based methods construct optimal policy by gaining an approximation of optimal function \({Q}_{*}\left({\text{s}},{\text{a}}\right)\) using dynamic programming. In DRL, Q-function is represented with deep neural network. Policy-based algorithms directly optimize policy \({\pi }^{*}\mathrm{without}\) additional information about MDP, using gradient approximate estimations relative to policy parameters.

3.1.1 Deep Q-network

Deep Q-Network (DQN) proposed by Mnih et al. (2013) is a representative method of Value-based methods. Concisely, the DQN structure leverages the deep neural network to directly extract a representation of input state from the environment. The output of DQN produces Q-values of all possible actions. Therefore, DQN can be considered as a value network parameterized by \(\beta\), which is trained continually to the approximate optimal policy. Mathematically, DQN uses the Bellman equation mentioned before to minimize the loss function \({\mathcal{L}}\left( \beta \right)\)

$${\mathcal{L}}\left( \beta \right) = E\left[ {\left( {r + \gamma \max Q\left( {s^{\prime},a^{\prime}|\beta } \right) - Q\left( {s,a|\beta } \right)} \right)^{2} } \right]$$
(12)

The demerit of using the neural network to approximate value function is unstable and may cause divergence due to the bias of correlative samples. Mnih et al. presented a target network parameterized by \(\beta^{\prime}{ }\)to make the samples uncorrelated. The target network is updated in every N step from the estimation network. Besides, generated samples are stored in an experience replay memory and are retrieved randomly from it. Therefore, Eq. (13) can be rewritten as:

$${\mathcal{L}}\left( \beta \right) = E\left[ {\left( {r + \gamma \max Q\left( {s^{\prime},a^{\prime}|\beta^{\prime}} \right) - Q\left( {s,a|\beta } \right)} \right)^{2} } \right]$$
(13)
$$\beta^{\prime} \leftarrow \beta \; for\; every\; N\; steps$$
(14)

On the basis of DQN, a variety of deep reinforcement learning algorithms was proposed. Double Deep Q-network (DDQN) was proposed by Hasselt et al. By applying Double Q-learning to DQN, it can separate action selection and policy evaluation, reducing the risk of overestimating Q value [26 ~ 27]. Dueling Deep Q-network was put forward by Wang et al., the model divides the abstract features extracted by CNN into two branches, one of which represents the state value function and the other represents the advantage function. Through this dueling network structure, agents can identify the correct behaviors faster in the process of policy evaluation and network architecture can be better integrated (Wang et al. 2015). Schaul et al. proposed a double deep Q network with proportional prioritization based on DDQN. This method replaces uniform sampling with priority-based sampling, improves the sampling probability of some valuable samples, and thus speeds up the learning of optimal policy (Schaul et al. 2015). Lakshminarayanan et al. proposed Dynamic Frame Skip Deep Q-Network (DFDQN), which uses dynamic frameskip to replace action repeated k times per moment in DQN. Experiments show that DFDQN achieves better performance in some Atari 2600 games (Lakshminarayanan et al. 2016). Vincent et al. use the adaptive discount factor and learning rate in DQN, the convergence speed of the deep network is accelerated (Francois-Lavet et al. 2015). Tom Schaul et al. developed a framework for prioritizing experience and used it in DQN to replay important transitions more frequently and learn more effectively (Schaul et al. 2015). Fortunato et al. (2017) proposed adding noise to the parameters instead of \(\epsilon\)-greedy to increase the exploration ability of the model. The success of DQN encouraged full-scale research of value-based methods by studying various demerits of DQN and developing auxiliary extensions. Hessel et al. proposed Rainbow DQN (Hessel et al. 2017), which uniting seven Q-learning-based ideas in one procedure to verify whether these merged extensions are essentially necessary for the RL algorithms.

3.1.2 Deep deterministic policy gradient

The actor-critic algorithm is a widely used policy-based method. The structure consists of two networks, a policy network called the actor network and a value network called the critic network. Actor network input state and output action while critic network input state and action and output Q value (Zhao et al. 2019, 2018; Ding et al. 2019). Lillicrap et al. proposed the Deep Deterministic Policy Gradient (DDPG) method based on the actor-critic structure, which can be used to tackle the problem of continuous action space in DRL. Experiments show that DDPG is not only stable in a series of continuous action space tasks, but also requires far fewer time steps than DQN to obtain the optimal policy. Compared with the DRL method based on value function, the deep deterministic policy gradient method based on the AC framework has higher optimization efficiency and faster speed (Lillicrap et al. 2016; Silver et al. 2014). Besides, various algorithms are derived based on the AC model, such as asynchronous advantage action critic (A3C) algorithm (Mnih et al. 2016) and distributed proximal policy optimization (DPPO) algorithm (Schulman et al. 2017; Heess et al. 2017). A3C algorithm was proposed by Mnih in 2016. In A3C, multi-thread is used to collect data in parallel, and each thread is an independent agent searching for an independent environment. At the same time, each agent can use a different exploration policy to sample in parallel, so that the samples obtained by each thread are naturally unrelated and the sampling speed is faster. Fujimoto et al. (2018) proposed Twin Delayed DDPG (TD3) algorithms to tackle the overestimation problems in actor-critic structure by taking the minimum value between a pair of critics. Tuomas et al. proposed a soft actor-critic method based on the maximum entropy RL framework. The actor network maximizes expected rewards while also maximizing entropy (Haarnoja et al. 2018).

3.2 MDRL research progress: challenges and structures

Many MDRL methods suffer from scalability issues and are sensitive to partial observability (Partial observability means that the agents do not know complete information of states pertaining to the environment when they interact with the environment). There are two mainstream structures used in multi-agent settings. The first structure is decentralizing training and decentralizing execution structure (DTDE), each agent is trained independently of the other agents. This structure can handle the scalability problems caused by the growth of the number of agents, but other problems are emerging, including environment non-stationary, reward distribution, and independent agents is sensitive to partial observability. The paradigm of centralized training and centralized execution (CTCE) can tackle these problems, in which agents are modeled together to learn a joint policy. The disadvantage of this centralized structure is the huge input and output space dimension. With the increase in the number of agents, the space dimension of the output joint policy renders an exponential increase. Another main structure is centralized training and decentralized execution (CTDE), which can solve partial observability problems meanwhile avoid huge input and output space dimensions caused by centralized execution.

3.2.1 Scalability and DTDE

Scalability is one of the core issues of MDRL domains. Scalability mainly refers to the extension from single-agent environment to multi-agent environment, including the expansion of state dimension and action dimension and the number of agents.

Tampuu et al. (2017) first proposed playing the Atari Pong game with two independent DQN agents. The result indicates that DQNs can be extended for the decentralized learning of multi-agent systems. It is the earliest work to adopt the DTDE framework to solve the scalability problem. Leibo et al. (2017) applied independent DQN to Sequential Social Dilemmas issues and they analyzed the dynamics of multiple independent agent learning policy, each using its deep Q network. Foerster et al. (2017) proposed two methods to improve the experience replay method to make multi-agents more stable and compatible. On the one hand, it adopted importance sampling to naturally attenuate outdated data. On the other hand, each agent can infer the action of other agents by observing the policy of other agents.

Song et al. (2018) proposed a new multi-agent policy gradient algorithm, which solved the high variance gradient estimation problem and effectively optimized multi-agent cooperative tasks in a highly complex particle environment. Wai et al. (2018) proposed a decentralized local exchange scheme to make the method more extensible and robust. Each agent communicates only with its neighbors through the network and iterates spatially and temporally to combine adjacent gradient information and local reward information respectively. Abouheaf and Gueaieb (2017) proposed an online adaptive reinforcement learning method for multi-agent settings based on graph interaction. In this method, the bellman equation of multi-agent setting is solved by the reduced value function, reducing the computational complexity and solving the large-scale optimization problem. Palmer et al. (2018) proposed LDQN algorithm, which introduced the lenient policy into the deep Q network and adopted the leniency treatment method for the update of negative policies to improve the convergence and stability.

The weakness of independent agents is that by treating other agents as part of the environment, it ignores the fact that their policies change over time. While independent agent avoids the scalability caused by centralized learning, it brings a new problem that the environment becomes non-stationary from the point of view of each agent. In this situation, the optimal policy of an agent can be affected by the learning among agents. Meanwhile, the convergence theory of Q-learning applied in single-agent reinforcement is not suitable for most multi-agent settings as the Markov property is no longer valid in the non-stationary environment.

3.2.2 Partial observability and CTDE

In real-world tasks, there are many situations where the environment is partially observable. In other words, when agents interact with the environment, they do not know all the information about the state of the environment. This type of problem is typically modeled using a partially observable Markov decision process (POMDP).

In single-agent deep reinforcement learning, there are many models and algorithms dealing with POMDP, among which deep recurrent Q-network (DRQN) is the most representative. Hausknecht and Stone (2015) modified DQN combined a Long short-term Memory with DQN to tackle the noisy observations problems of POMDP. Although the DRQN sees only one frame at a time, it can still combine information across frames to detect relevant information, such as the speed of objects on the screen. Foerster et al. (2016) proposed Reinforced Inter-Agent Learning (RIAL) which is also based on DRQN to address the partial observability problems and they proposed two variations on it. In one variant, agents learn policy using their own network parameters as independent Q-learning, treating the others as part of the environment. In another variant, agents use the same network in which parameters are shared among all agents.

Gupta et al. (2017) introduced parameter sharing (PS) method to improve learning in homogeneous partially observable multi-agent environments where agents have the same action space. The idea is that a globally shared policy network can still perform differently through different inputs (individual agent observation). They tested three different methods through parameter sharing: PS-DQN, PS-DDPG and PS-TRPO, and the results showed that PS-TRPO was superior to the other two methods.

In fact, in many multi-agent settings, partial observability needs the learning of decentralized policy, because on the one hand it only necessitates local action observation history of each agent. On the other hand, decentralized policies naturally address the problem that joint action spaces grow significantly with the number of agents. Fortunately, decentralized policies are often learned in a centralized way. Partial observability problem can be tackled by centralized training and decentralized execution structure. Lowe et al. (2017) adopt the CTDE paradigm, allowing the policies to use others information to improve training. They proposed an extension of actor-critic setting where the critics added extra information about other agents' policies, while the actor only has local information. After training is completed, only the local actors are used at the execution phase, acting in a decentralized manner. Concretely, consider a game having N agents whose policies are \(\mu =\) and policies parameters are \(\theta { = }\left. {\left\{ {\theta _{1} \ldots,} \right.\theta _{N} } \right\}\).Then the gradient of the expected reward for agent i with policy \({ }\mu_{i}\), \(J\left( {\theta_{i} } \right) = E\left[ {R_{i} } \right]\) is present as:

$$\nabla _{{\theta _{i} }} J\left( {\theta _{i} } \right) = E_{{x,a \sim D}} \left[ {\mu _{i} \left( {o_{i} } \right)\nabla a_{i} Q_{i}^{\mu } \left( {x,a_{1} , \ldots ,a_{N} } \right)\left| {a_{i} = \mu _{i} \left( {o_{i} } \right)} \right.} \right]$$
(15)

where \(Q_{i}^{\mu } \left( {x,a_{1} , \ldots ,a_{N} } \right)\) is a centralized action-value function that takes all agents’ action \(a_{1} , \ldots ,a_{N}\) and some state information x (i.e., x = \({ }\left( {o_{1} ,...o_{N} } \right)\)) as input, and outputs the Q-value for agent i. The experience replay buffer D contains the tuples \(\left( {x,x^{\prime},a_{1} ,...,a_{N} ,r_{1} ...r_{N} } \right)\) recording experiences of all agents, where x′ represents the next state after taking actions \(a_{1} ,...,a_{N}\) The centralized action-value function \(Q_{i}^{\mu }\) is updated as:

$$L\left( {\theta _{i} } \right) = E_{{x,a,r,x'}} \left[ {Q_{i}^{\mu } \left( {x,a_{1} , \ldots ,a_{N} } \right) - y} \right]^{2}$$
(16)
$$y = r_{i} + \gamma Q_{i}^{{\mu '}} \left( {x',a'_{1} , \ldots ,a'_{N} } \right)\left| {a'_{j} } \right. = \mu '_{j} \left( {o_{j} } \right)$$
(17)

where \(\mu ' = \left\{ {\mu _{{\theta _{1}^{'} }} , \ldots ,\mu _{{\theta _{N}^{'} }} } \right\}\) is the set of target policies with delayed parameters \(\theta_{{\text{i}}}^{^{\prime}}\). It is worth noting that the centralized Q function is only used during training, while each policy \(\mu_{{\theta_{i } }}\) only takes local information \({o}_{i}\) to produce an action during decentralized execution. (Fig. 3).

Fig3
figure 3

Architecture of Multi-Agent Deep Deterministic Policy Gradient

Many works have emerged based on this framework. Li et al. (2019) studied the problem of training robust DRL agents with continuous action in multi-agent learning settings, so that the trained agents still have the generalization ability when the opponent's policy changes. They proposed the MiniMax Multi-agent Deep Deterministic Policy Gradient (M3DDPG) method with two main contributions: on the one hand, they introduced a minimax extension of the popular MADDPG algorithm for robust policy learning. On the other hand, because the continuous action space leads to the computational difficulty of minimax learning objectives, they proposed Multi-Agent Adversarial Learning (MAAL) to efficiently tackle that problem.

Foerster et al. (2018) proposed a novel multi-agent actor-critic method like MADDPG called counterfactual multi-agent policy gradients (COMA) to learn decentralized policies for cooperative agents. In addition, COMA addresses credit assignment problems in multi-agent setting by using a counterfactual baseline. That baseline keeps the other agents’ actions fixed and marginalizes out the action of a single agent. Then, by comparing the current Q value with the baseline, an advantage function can be calculated. This counterfactual was inspired by difference rewards (Tumer and Agogino 2007), which is a way to capture individual contributions from agents in a coordinated multi-agent setting.

Coordinated MDRL application problems such as coordinating self-driving vehicles often could be treated using a centralized approach. However, Sunehag et al. (2017) found that the centralized approach consistently fails on these simple cooperative MDRL tasks in practice. Specifically, the centralized approach learned inefficient policies with only one agent behaving actively and the other being “lazy”. It occurs when an agent learns an effective policy, then the other agents are discouraged from learning. The authors proposed a new value decomposition network (VDN) architecture to solve these problems and trains individual agents to decompose team value functions into agent-wise value functions. They proved that value-decomposition has a much better performance than centralization or fully independent learners.

Rashid et al. (2018) proposed a novel value-based method called QMIX. They demonstrated that the full decomposition of VDN is unnecessary to extract decentralized policies. QMIX is an extension of VDN, which gets the joint action-value function by summarizing local action-value functions of each agent. QMIX adopts a hybrid network to merge local value functions of single- agent and adds global state information in the training and learning process to improve algorithm performance(Fig. 4).

Fig.4
figure 4

Architecture of QMIX

However, VDN and QMIX achieve the value decomposition heuristically without valid theoretical groundings. Yang et al. (2020) theoretically derived a linear decomposing formation from \({Q}_{tot}\) to each \({Q}_{i}\). Based on this theoretical discovery, they introduced multi-head attention mechanisms to approximate the decomposition of each term in the formula and provided theoretical explanations. Son et al. (2019) proposed a new factorization method named QTRAN, which is not subject to structural constraints in the factorization process of VDN and QMIX. Yang et al. (2020) proposed the Q-value Path Decomposition (QPD) method to decompose global Q-values of the system into Q-values of individual agents leveraging the integrated gradient attribution technique (Table 1).

Table 1 Multi-agent deep reinforcement learning main Challenges and Structure

In addition to the above classification, the next sections describe two current mainstream branches, communication learning and agent modeling, in which most of the methods are also used to solve scalability, partial observability, non-stationary problem in MDRL. Actually, some of them can also be categorized into the table above.

3.3 Communication learning

Communication between agents is one of the hotspots in the MDRL field in recent years. Communication learning setting usually considers a set of cooperated agents in the partially observable environment where agents exchange information through communication to promote the cooperation between agents.

Foerster et al. (2016) first proposed the mechanism of updating the communication model through the backpropagation method, namely Differentiable Inter-Agent Learning (DIAL). DIAL takes information from other agents as input, providing a reference for agent decision-making. Gradient flow will update the communication generator layer at the same time. Foerster et al. proposed the Deep Distributed Recurrent Q-network (DDRQN) based on DRQN, which enables the agent teams to learn to solve communication problems and establish coordination tasks. The agents do not receive any pre-designed communication protocols in these tasks so that they must automate the development first and agree on their communication protocols. This is the first success of deep reinforcement learning in learning communication protocols.

Sukhbaatar et al. (2016) proposed the CommNet method in a similar way. A single agent extracts information from other agents through broadcasting. The CommNet model is showed in Fig. 5. The left one is a view of the model for single agent that the parameters are shared across all agents. The middle one is a single communication step where each agent modules propagate their internal state, as well as broadcasting a communication vector on a common channel (shown in yellow). The Right one is the full model, showing input states for each agent, two communication steps and the output actions for each agent.

Fig. 5
figure 5

Architecture of CommNet

Peng et al. (2017) came up with a new communication network, Multi-agent Bidirectional-Coordinated Nets (BicNet). BicNet uses bi-directional RNN as a communication channel and stores local state memory. The BicNet is different from DIAL, CommNet in that BicNet can adjust the order of agents joining communication networks and support continuous action space. However, BicNet requires each agent to obtain a global state.

Mao et al. (2017) proposed an Actor-Coordinator-Critic Net (ACCNet) framework for solving learning-to-communicate problems in multi-agent setting. The ACCNet completes the output of the integrated state through communication channel, and then uses integrated state to complete the agent-independent actor-critic learning process. On the other hand, ACCNet carries out communication in the critic part to obtain better estimated Q-values. Jiang et al. (2018) proposed a novel method named ATOC to use attention to complete communication. ATOC uses attention unit to build a communication group (select agents for information sharing) and bi-directional LSTM unit as a communication channel. In addition, ATOC supports an environment of a large number of agents.

3.4 Agent modeling

An important ability of agents is to reason about the other agents’ behaviors. By constructing model agents can make predictions about the modeled agents’ properties of interest, including actions, goals, beliefs. Typically, a model is a function that takes a portion of the observed interaction history as input and outputs predictions about some of the related properties of the modeled agent. The interaction history contains information such as past actions taken by the modeled agent in various situations. Various modeling approaches now exist in multi-agent environments, with vastly different approaches and underlying assumptions.

He et al. (2016) proposed Deep Reinforcement Opponent Network (DRON) algorithm, which is an early work using deep neural networks for agents modeling. The network architectures have two networks, one evaluates Q values and the other one learns a representation of the opponent policy. In addition, they proposed to have several expert networks. Each expert network captures one type of opponent policy to combine their predictions in order to get the estimated Q value. While DRON defines the opponent network using hand-crafted features, Deep Policy Inference Q-Network (DPIQN) (Hong et al. 2018) learn “policy features” directly from original observations of the other agents. DPIQN consists of three main parts: Q value learning module, feature extraction module, and auxiliary policy feature learning module. The first two modules are responsible for learning the Q value, while the last module is mainly concerned with learning hidden representations from other agent policies.

A variety of opponent modeling approaches are learned from observations. Raileanu et al. (2018) proposed a different approach Self Other Modeling (SOM). SOM uses agents’ policies to predict the actions of opponents. The authors present a new method for inferring hidden states from the behavior of other agents and using these estimates to select actions. SOM uses two networks, one to calculate the agent's own policy and the other to infer the opponent's target. This approach does not require any additional parameters to model other agents. These networks have the same input parameters but the different values of the agent or the opponent. Compared with previous methods, the focus of SOM is not to learn the opponent's policy, but to estimate the opponent's target. (Fig. 6)

Fig.6
figure 6

general agent model

Rabinowitz et al. (2018) came up with a new neural network, Theory of Mind Network (ToMnet), that is capable of understanding the mental states of itself and the agents around it. Theory of mind refers broadly to the ability of humans to represent the mental states of others, such as desires, beliefs, and intentions. Theory of mind is part of recursive reasoning approaches (Gmytrasiewicz and Doshi 2005; Gmytrasiewicz and Durfee 2000; Camerer et al. 2004; Carmel and Markovitch 1996). In these approach agents have explicit beliefs about the other agents’ mental states. ToMnet is composed of three networks, the first one is a character network that learns from past information, the second one is a mental state network that takes the character output and the most recent trace as its input; the third one is the prediction network, its input are the current state and the outputs of the other two networks. The output is to predict the opponent's next action in general. ToMnet can learn a generic model of agents in a training distribution and build agent-specific models whilst observing the actions of new agents.

Yang et al. (2019) proposed Deep Bayesian Theory of Mind Policy (Bayes-ToMoP), which takes inspiration from theory of mind. Their setup assumes that the opponent has a different set of policies to act on, and changes over time. Earlier work such as BPR + (Hernandez-Leal et al. 2016) extends the Bayesian policy reuse framework (Rosman et al. 2016) to multi-agent domains to deal with this setup. Deep Bayes-ToMoP provide a higher-level reasoning policy than BRP + by using theory of mind. In addition, Deep BPR + Zheng et al. (2018) method is also inspired by BPR + . It uses not only environmental rewards but also the idea of online learning opponent model (Hernandez-Leal and Kaisers 2017) to construct a rectified belief on the opponent policy. In addition, it extracts ideas from policy distillation (Rusu et al. 2015; Hinton et al. 2015) and extends them to multi-agent settings to create a policy network of distillation.

Lanctot et al. (2017) quantified a serious problem with independent reinforcement agents, joint policy Correlation (JPC), which limits the versatility of these methods. They presented a generalized multi-agent reinforcement learning algorithm that includes several previous algorithms. They demonstrated that PSRO/DCH produces a general policy of significantly reducing JPC in partially observable coordinated games in their experiment. (Table 2).

Table 2 Multi-agent deep reinforcement learning main approachs

4 Applications and prospect of MDRL

In recent years, MDRL methods have been applied in various fields to solve complex real-world tasks. This section outlines the application of these methods in different domains (Mao et al. 2019; Wang et al. 2019; Tan 1993; Duan et al. 2016).

MDRL is used in various domains such as autonomous driving, Internet marketing, resource management, and traffic control. Shalev-shwartz et al. (2016) improved and optimized the safety and environmental unpredictability of autonomous driving. They demonstrated policy gradient iterations can be used without Markovian assumptions. In addition, they decomposed the problem into components of Policy for Desires and trajectory planning with hard constraints, which enabled the comfort of driving and the safety of driving respectively. Jin et al. (2018) proposed the Distributed Coordinated Multi-Agent Bidding (DCMAB) algorithm which combined the idea of clustering with the MDRL method to optimize the performance of real-time online bidding in the face of a large number of advertisers. In order to balance the competition and cooperation between advertisers, a practical distributed coordinated multi-agent bidding algorithm was proposed and implemented. As shown in Fig. 7, the model use structure of separate actor and Q network for each agent. \({a}_{i}\) is calculated through \({u}_{i}\) using g and x as input. Where g represents general information and x represents the clustering feature. In addition to states and actions, consumer distribution d is collected as the input of all agents’ Q function.

Fig.7
figure 7

Structure of DCMAB algorithm

In the field of resource management, Xi et al. (2018) proposed a novel MDRL algorithm named PDWoLF-PHC to solve the stochastic disturbance problem of the power grid system caused by the integration of the distributed energy and new energy. It can realize the application of stochastic games in non-Markovian environments effectively. That model has a faster convergence speed and stronger robustness, so that the power grid system can improve the utilization rate of new energy under more complex conditions. Perolat et al. (2017) studied the emergent behavior of groups of independent-agents in partially observable Markov games. It modeled the occupant of common-pool resources, revealed the relationship between exclusivity, sustainability and inequality, and proposed solutions to improve resource management ability. Kofinas et al. (2018) proposed the fuzzy Q learning method to effectively improve the energy management ability of decentralized micro grid. Noureddine et al. (2017) proposed a DRL cooperative task allocation method, which enables multiple agents to interact and effectively allocate resources and tasks. In a loosely coupled distributed multi-agent setting, agents can benefit from collaborative neighbors.

In the traffic control domains, Chen et al. (2016) proposed a cooperative multi-agent reinforcement learning framework to alleviate bus congestion on bus lanes in real time. They adopted coordination graphs to automatically selecting the coordinated holding actions when multiple buses are stationed at the stops. In addition, for the particularity of the sparsely structured graphs, they developed sparse collaborative Q learning algorithms for coordinated holding actions. Simulations experiments proved the method could be applied in an advanced public transportation system to improve the performance of bus operation. Vidhate et al. (2017) proposed a traffic control model based on cooperative multi-agent reinforcement learning for the control and optimization of the traffic systems, which can deal with unknown complex states. The model extends the traffic value of the vehicle, including delay time, the newly arriving vehicles and the number of vehicles parked at a signal to learn and set the optimal actions. The model makes a great improvement to traffic control, which proves that it can realize real-time dynamic traffic control. As shown in Fig. 8, in a real-world environment, the flow of four signals with eight flows is considered. Calvo et al. (2018) proposed a novel IDQN algorithm to solve the heterogeneity problem of urban traffic signal control in multi-agent environment. Each agent learns through dueling double deep Q-network (DDDQN), which integrates dueling networks, DDQN and priority experience replay. As shown in Fig. 9, agents learn from local experience and the exchange of information between them. Agents interact in two ways: one is an intra-level way that agents interact with each other at the same level and another is an inter-level way where agents interact with each other at different hierarchy levels. Traffic agents use inter-level ways to communicate with the monitor-agent. Sharing information between traffic agents is intra-level.

Fig. 8
figure 8

Traffic flow and control of four intersections with eight flow directions

Fig. 9
figure 9

Architecture of traffic signal control

The agents interact in two manners: (1) intra-level or horizontal interaction; where the agents interact with each other at the same level, and (2) inter-level; where the agents interact with each other in different levels of hierarchy. Traffic agents use intra-level interaction by sharing information between them, and the communication between the monitor-agent and the traffic agents is inter-level. (Table 3).

Table 3 Typical MDRL applications in different fields

5 Conclusion and research directions

Deep reinforcement learning has shown success in many single-agent fields, and the next step is to focus on multi-agent scenarios. However, deep reinforcement learning for multi-agent settings is fundamentally more difficult due to non-stationarity, the increase of dimensionality, the partial observability and the credit assignment problem, among other factors (Stone and Veloso 2000; Panait and Luke 2005; Hernandez-Leal et al. 2017, 2018; Albrecht and Stone 2018; Silva and Costa 2019; Palmer et al. 2019; Wei et al. 2018; Nguyen et al. 2018; Xu et al. 2016).

This paper makes a systematic review of multi-agent deep reinforcement learning, including the background, classical algorithm, research progress and practical application. In many practical problems and fields, multi-agent deep reinforcement learning has shown great potential, with endless research results and various algorithms emerging. In this paper, the theoretical background and classical algorithms of multi-agent reinforcement learning are introduced firstly, including the Markov framework and stochastic game model. Then the recent innovations and improvements of multi-agent deep reinforcement learning algorithms from different perspectives are reviewed in detail. Finally, the practical application and future prospect of multi-agent deep reinforcement learning are discussed. There are some research directions that are not detailed in this paper but are still promising, including learning from demonstration, model-free deep RL and transfer learning.

Learning from demonstration which consists of imitation learning and inverse RL has made significant progress in single-agent deep RL (Piot et al. 2016). Imitation learning attempts to map states to actions in a supervised way. It extends the expert policy directly to the unvisited state, making it closer to the multi-class classification problem in the case of finite action sets. Inverse RL agents try to infer a reward function based on expert demonstrations (Hadfield-Menell et al. 2016, 2017). However, these methods have not been fully studied in multi-agent settings. In MAS, these applications create a very direct challenge that requires multiple experts who can demonstrate tasks collaboratively. Moreover, the communication and reasoning abilities of experts are difficult to be described and modeled by autonomous agents in MAS. These raise important questions for the extensions of imitation learning and inverse RL to MDRL (Christiano et al. 2017; Nguyen et al. 2018, 2018). Zhang et al. combined human prior suboptimal knowledge with RL to introduce a knowledge guided policy network (Zhang et al. 2020). In addition, model-free deep RL has been applied to solve many complex problems in the field of single-agent and multi-agent domains. However, such methods require a large number of samples and a long learning time to achieve good performance. Model-based deep learning extensions have made great progress in single-agent field, such as (Finn and Levine 2017; Gu et al. 2016; Levine et al. 2016), but these extensions have not been widely studied in multi-agent setting.

Many studies have promoted transfer learning to improve the performance of MDRL models during training and reduce computational cost. Yin et al. (2017) introduced policy distillation framework to apply knowledge transfer to DRL. This method reduces training time and has better performance than DQN, but its exploration policy is not effective enough. Egorov et al. (2016) reconstructed the multi-agent environment as an image-like representation and used CNNs to estimate the Q value of each agent. When the transfer learning method can be used to speed up the training process, it can solve the scalability problem in MAS. Parisotto et al. (2015) proposed a role simulation method for multi-task transfer learning to improve the learning speed of deep policy networks. The network is not very complex, but it can achieve expert performance on multiple games at the same time.

Although MDRL is a recent area, there are still many open-source benchmarks that can be used for different characteristics, such as Starcraft Multi-agent Challenge, Hanabi and so on. Starcraft Multi-agent Challenge (Samvelyan et al. 2019) is based on StarCraft II which is a real-time strategy game. It focuses on the challenges of micromanagement, which means the fine-grained control of individual units. Each unit is controlled by an independent agent that acts according to local observations. Hanabi is a cooperative multiplayer card game that includes two to five players.

The game is designed that players can't see their own cards, but other players can reveal their information. This presents an interesting challenge to learning algorithms, especially in the case of self-play learning and ad-hoc teams (Bard et al. 2020; Hessel et al. 2018; Bowling and McCracken 2005). Pommerman (Resnick et al. 2018) is another multi-agent benchmark that supports partial observability and communication learning among agents. It can be used to test cooperative, competitive and mixed tasks. Pommerman is a very challenging field because the rewards are very sparse and delayed (Gao et al. 2019). The Apprentice Firemen Game (Palmer et al. 2019) and Fully Cooperative Multiagent Object Transporation Problems (CMTOPs) (Palmer et al. 2018) are both two-agent pixel-based environment. In addition to the above, there are many benchmarks such as MuJoCo Multi-agent Soccer (Liu et al. 2019), Neural MMO (Suarez et al. 1903), Arena (Song et al. 2019), MARLO competition (Johnson et al. 2016), MAgent (Zheng et al. 2017).