Keywords

1 Introduction

Basic reinforcement learning is powerful on reinforcing one agent to behave outstandingly based on the rules and rewards reflected by the environment. However, for various applications in artificial intelligence, when the environment is in large-scale and the tasks are complicated, not only do we expect one single agent to make smart actions, but also we hope there exist a group of agents who can communicate and make decisions with each other. Accordingly, we need to propose and apply learning strategies for each agent. Considering the interactions among multiple agents, multi-agent reinforcement learning is put forwards to achieve the expectation.

For clear analysis and better understanding, we set the basic component of multi-agent learning as agent, policy, and utility, which are further elaborated as follows.

  • Agent: We define the agent as the autonomous individual, which can independently interact with the environment and take its own strategy based on the observation of others’ behaviors, aiming to achieve maximum revenue or minimum loss for itself. In the considered scenarios, there exists multiple agents. When the number of agent equals to one. The multi-agent reinforcement learning is identical to regular reinforcement learning scenarios.

  • Policy: Each agent follows its own policy in the multi-agent reinforcement learning. The policy is normally designed to maximize the revenue and minimize the cost of the agent while it is affected by the environment and the policies of other agents.

  • Utility: Each agent has unique utility, considering its requirements and dependencies with the environment and other agents. The utility is defined as the revenue minus the cost of the agent based on different objectives. In multi-agent scenarios, each agent aims to maximize its own utility through learning from the environment and other agents.

Accordingly, in the multi-agent reinforcement learning, agents are assigned with their own utility functions. Based on the observation and experience through interactions, each agent performs policy learning autonomously, aiming to optimize its own utility value, without considering the utilities of other agents’. Therefore, there may exist competitions or co-operations through the interactions with all other agents. Considering different kinds of interactions among multiple agents, game theoretical analysis is commonly applied as a powerful tool for decision making (Fudenberg and Tirole 1991). Based on different scenarios, the games can be fitted into different categories, listed as follows.

  • Static Game: The static game is the simplest form to model the interactions of agents. In the static game, single decision is required by each agent. As each agent only acts once, unexpected cheating and betraying in the static game can be profitable. Thus, each agent is required to carefully predict the strategies of the other agents so as to act smartly to gain high utility.

  • Repeated Game: The repeated game refers to the situation where all agents can take actions repeatedly based on the same state for multiple iterations. The overall utility of each agent is the summation of the discounted utility for each iteration of the game. Due to the repeated actions of all agent, the cheating and betraying during the interactions can cause penalty or revenge from other agents in future iterations. Thus, the repeated game avoids the malicious behaviors of the agents and generally improve the total utilities for all agents.

  • Stochastic Game: The stochastic game (or Markov game) can be regarded as the MDP with multiple agents or the repeated game with multiple states. The game models the iterated interactions of multiple agents in general scenarios, where for each iteration, each agent is at different states and tries to gain high utility based on the observation and predication of other agents.

In the chapter, based on the fundamental reinforcement learning for single agent, we focus more on the relations between agents, seeking equilibrium scenarios where each agent is able to achieve high and stable utilities.

2 Optimization and Equilibrium

As each agent aims to maximize its own utility, multi-agent reinforcement learning can be considered as solving optimization problem for each agent. Suppose there are m agents, \(\mathcal {X}=\mathcal {X}_1\times \mathcal {X}_2\times , \dotsc , \times \mathcal {X}_m\) refers to the policy space of all agents and u = (u 1(x), …, u m(x)) represents the lists of utility profile of all agents with policy profile x, where \(\mathbf {x} \in \mathcal {X}\). Accordingly, each agent i, ∀i ∈{1, 2, …, m}, requires to maximize its own utility considering on the behaviors of others. For multi-agent reinforcement learning, the task is generally solving multiple optimization problems simultaneously or sequentially so as to make sure each agent is able to get high utility.

As utility of each agent can be affected by the policies of all other agents, it is required to seek a stable policy profile for all agents so that no agents are willing to deviate from the current policy for higher utilities. Therefore, the equilibrium concept in multi-agent reinforcement learning is put forward. For better understanding and analysis, without loss of generality, we introduce different equilibrium concepts based on an intuitive chicken dare game. The chicken dare game is a static game scenario related with the interactions between two agents. Both agents are able to choose “chicken” (short as “C”) or “dare” (short as “D”) as its action independently with each other. Based on different actions of both agents, the utilities are shown in Fig. 11.1. When both agents choose “D,” both dare and receive lowest utility 0. When one agent chooses “D” and the other chooses “C,” the one who dares receives largest utility 7 while the one who chickens still gets relatively low utility 3. When both agents choose “C,” both agents chicken and can get relatively high rewards 5.

Fig. 11.1
figure 1

Chicken dare game

2.1 Nash Equilibrium

Following the chicken dare game (Rapoport and Chammah 1966) in Fig. 11.1, we set the rule that both agents are required to make actions simultaneously. When both agents choose “C,” each of them would like to switch to “D” to gain higher utility based on the assumption that its opponent will not change its action. When both agents switch their actions to “D,” both of them will receive 0 utility and would definitely switch the actions back to “C” for higher utility. Nevertheless, when one agent chooses the action “C” while the other “D,” assuming the component would not switch its action, each agent cannot switch actions anymore to gain higher utility. Therefore, we call the scenarios when one agent chooses “C” and the other agent chooses “D” as Nash equilibrium (Nash et al. 1950), which are formally defined as follows.

Definition 11.1

Let \(\left ( \mathcal {X}, \mathbf {u} \right )\) denotes the static scenario with m agents. \(\mathcal {X}=\mathcal {X}_1\times \mathcal {X}_2\times , \dotsc , \times \mathcal {X}_m\) refers to the policy space of all agents and u = (u 1(x), …, u m(x)) is the utility profile of all agents with policy profile x, where \(\mathbf {x} \in \mathcal {X}\). Let x i be a policy of agent i, x i be the policies of other agents except for agent i. The policy \({x}^* \in \mathcal {X}\) is able to achieve the Nash equilibrium if ∀i, \({x}_i\in \mathcal {X}_i\),

$$\displaystyle \begin{aligned} \begin{array}{l} u_i({x}^*_{i}, {\mathbf{x}}^*_{-i}) \geq u_i({x}_{i},{\mathbf{x}}^*_{-i}). \end{array} \end{aligned} $$
(11.1)

2.1.1 Pure Strategy Nash Equilibrium

As shown in the definition, in the static scenario of multi-agent reinforcement learning, each agent is required to determine one action at each timestamp. When the actions of the other agents are fixed and each agent cannot deviate its current action for higher utility, all agents achieve the pure strategy Nash equilibrium. In the chicken dare game, there exist two pure strategy Nash equilibrium solutions, where one agent behaves chicken and the other behaves dare. The pure strategy Nash equilibrium may not always exists, as the pure action of each agent may let the other agents deviate from its current behavior.

2.1.2 Mixed Strategy Nash Equilibrium

Moreover, each agent can set up a policy, where each action is chosen with probabilities at each timestamp. The policies of the agents bring in randomness and uncertainty for the interactions. Thus, the agents can adjust their policies considering its effect for other agents and the mixed strategy Nash equilibrium always exists. Take the chicken dare game as an example, we suppose the probability for agent 1 to behave chicken is p, then the probability to behave dare is 1 − p. Therefore, in order to guarantee the policy of agent 1 will not cause bias on the decision making for agent 2. The following equation exists:

$$\displaystyle \begin{aligned} 5p + 3(1-p) = 6p + 0(1-p). \end{aligned} $$
(11.2)

Thus, we get p = 0.75, and the policy for both agents are the same, namely to choose action “C” with probability 0.75, and to choose action “D” with probability 0.25. With the policy, both agents achieve the Nash Equilibrium too, and the expected utility for each agent is 4.5.

Based on the above, we further show the results in Fig. 11.2, where the X axis is the utility of agent 1, and the Y  axis is the utility of agent 2. Based on the relations of both agents’ utilities in Fig. 11.1, the point A refers to the result when both agents act as “C.” The point B denotes the result when agent 1 acts as “C” and agent 2 acts as “D.” The point C represents the result when agent 1 acts as “D” and agent 2 acts as “C.” The point D indicates the result when both agents act as “D.” Therefore, whatever the policies of each agent, the result falls in the region ABDC. And point B and point C are the pure strategy Nash Equilibrium with determined actions. The middle point E of the line BC is the mixed strategy Nash Equilibrium. The total utilities for both agents equals 9 for all Nash equilibrium solutions.

Fig. 11.2
figure 2

Nash equilibrium in chicken dare game

2.2 Correlated Equilibrium

In Nash equilibrium solution, the total utility of both agents is 9, which is less than the maximum value 10. However, both agents are required to choose “C” at the same time to achieve the maximum value 10, which is unstable in distributed fashion. Therefore, correlation concept is put forwards among agents to further improve total utility and guarantee the stability of the solution at the same time.

In the chicken dare game, we set the probability distribution for both agents to choose “CC” (the first action is for agent 1 and the second action is for agent 2), “CD,” “DC,” and “DD” as v. When both agents correlate with each other and set \(\mathbf {v} = \left [1/3, 1/3, 1/3, 0 \right ]\), the total utility for both agents are 9.3333, which is larger than the Nash equilibrium solution. Moreover, when one agent choose “C,” as the agent knows its component will follow the probability distribution in correlation, its component will takes mixed strategy and choose “C” with probability 0.5 and “D” with probability 0.5. Thus, if the agent continues to choose “C,” it can receive the utility of 0.5 ∗ 5 + 0.5 ∗ 3 = 4. If the agent switches its action and assumes its component fixes its current policy, it can receive the utility of 0.5 ∗ 6 + 0.5 ∗ 0 = 3, which is less than 4. Similarly when the agent chooses “D,” its component will follow the correlation and choose action “C” with probability 1. Therefore, the agent cannot switch its action to “C” to gain higher utility. Accordingly, the probability distribution v lets both agents achieve correlated equilibrium, with the definition as follows.

Definition 11.2

Correlated equilibrium (Aumann 1987) can be achieved by any probability distribution v satisfying,

$$\displaystyle \begin{aligned} \begin{array}{l} \sum_{{{\mathbf{x}}_{ - i}} \in {\mathcal{X} _{ - i}}} {v({x}_i^*,{{\mathbf{x}}_{ - i}})[} {u_i}({x}_i^*,{{\mathbf{x}}_{ - i}}) - {u_i}({{x}_i},{{\mathbf{x}}_{ - i}})] \geqslant 0, \forall {x}_i \in \mathcal{X}_i, \end{array} \end{aligned} $$
(11.3)

where \({\mathcal {X} _{i}}\) is the policy space of the agent i and \({\mathcal {X} _{ - i}}\) denotes the policy space of all the agents except agent i.

Therefore, as long as both agents follow the correlated probability distribution, each agent cannot deviate with current policy for higher utility. We further depict correlated equilibrium as point F in Fig. 11.3. Moreover, for all points in the region ABC, as long as it satisfies relations in (11.3), it can be called correlated equilibrium solutions.

Fig. 11.3
figure 3

Correlated equilibrium in chicken dare game

2.3 Stackelberg Equilibrium

Apart from the simultaneous scenarios, both agents may also take actions sequentially. In sequential scenarios, the agents are divided into leaders and followers, where the leaders act first and the followers act correspondingly (Bjorn and Vuong 1985). Accordingly, the first-mover advantage exists, where the leaders are able to predict the corresponding reactions of followers and take actions for high utilities. In the chicken dare game, if we assume the agent 1 as leader and agent 2 as follower, the agent 1 can choose action “D,” since when agent 1 choose “D,” the agent 2 will definitely choose “C” for higher utility. Thus, the utility for the agent 1 can achieve the maximum value 6 for itself and both agents achieve the stackelberg equilibrium (Zhang et al. 2018), which can be defined as follows.

Definition 11.3

Let \(\left ( {(\mathcal {X},\boldsymbol {\Pi }),(g,f)} \right )\) be the general sequential scenario with m leaders and n followers. \(\mathcal {X}=\mathcal {X}_1\times \mathcal {X}_2\times , \dotsc , \times \mathcal {X}_m\) and Π = Π 1 × Π 2×, …, × Π n are the policy space of all leaders and all followers, respectively. g = (g 1(x), …, g m(x)) is the utility function of leaders for \(\mathbf {x}\in \mathcal {X}\), and f = (f 1(π), …, f n(π)) is the utility function of followers for π ∈ Π. Let x i be the policy of leader i, x i be policies of all leaders except for leader i, π j be the policy of follower j, and π j be policies of all other followers except for leader j. The policies of \({x}^* \in \mathcal {X}\) and π Π can achieve the stackelberg equilibrium of the multi-leader multi-follower scenario if ∀i, ∀j \({x}_i\in \mathcal {X}_i ,\pi _j\in \boldsymbol {\Pi }_j\),

$$\displaystyle \begin{aligned} \begin{array}{l} g_i\big({x}^*_{i}, {\mathbf{x}}^*_{-i}, \boldsymbol\pi^*\big) \geq g_i\big({x}_{i},{\mathbf{x}}^*_{-i}, \boldsymbol\pi^*\big) \geq g_i\big({x}_{i},{\mathbf{x}}_{-i}, \boldsymbol\pi^*\big), \end{array} \end{aligned} $$
(11.4)
$$\displaystyle \begin{aligned} \begin{array}{l} f_j\big(\mathbf{x}, \boldsymbol\pi^*_{j}, \boldsymbol\pi^*_{-j}\big) \geq f_j\big(\mathbf{x}, \boldsymbol\pi_{j}, \boldsymbol\pi^*_{-j}\big). \end{array} \end{aligned} $$
(11.5)

3 Competition and Cooperation

In the last section, we take an example of chicken dare static game to introduce the optimization and equilibrium concepts. Moreover, the relation among multiple agents varies for different applications. In this section, we will further analyze the competitive and cooperative relations among multiple agents in distributed fashion. Without specific explanation, we consider the scenario where there are m agents, \(\mathcal {X}=\mathcal {X}_1\times \mathcal {X}_2\times , \dotsc , \times \mathcal {X}_m\) refers to the policy space of all agents and u = (u 1(x), …, u m(x)) is the utility profile of all agents with policy profile x, where \(\mathbf {x} \in \mathcal {X}\).

3.1 Cooperation

When multiple agents cooperate with each other, in most times, the total utilities will be higher than the utilities of all agents without cooperation. Moreover, in distributed network, each agent only considers its own utility. Accordingly, in order to include the agent in the cooperated coalition, the agent is required to receive higher utility compared with its non-cooperated behaviors. The optimization problem for agent i, ∀i ∈{1, 2, …, m} can be formulated as following:

$$\displaystyle \begin{aligned} \begin{array}{l} {} \max_{{x}_i} \;{\kern 1pt} \, {\sum_{k =1}^{k=m} u_{k}({x}_k | {\mathbf{x}}_{-k})}, \hfill \\ s.t. ~~ {\begin{gathered} u_i({x}^*_i | {\mathbf{x}}^*_{-i}) \geq u_i({x}_i|{\mathbf{x}}^*_{-i}). \hfill \\ \end{gathered}} \hfill \end{array} \end{aligned} $$
(11.6)

3.2 Zero-Sum Game

Zero-sum game (VINCENT 1974) is frequently adopted in multiple applications. For simplicity, we suppose there are two agents, and each agent can choose to take action “A” or “B.” The utility function is shown in Fig. 11.4, where we can observe the total utility for each situation equal to zero. Accordingly, for general zero-sum problem, each agent is required to maximize its own utility based on the prediction that its utility is minimized by its opponents at the same time. The optimization problem for agent i, ∀i ∈{1, 2, …, m} can be summarized as follows:

$$\displaystyle \begin{aligned} \begin{array}{l} \max_{{x}_i}{ \;{\kern 1pt} \, \min_{{\mathbf{x}}_{-i}} {\;{\kern 1pt} \, {u_{i}} }}. \end{array} \end{aligned} $$
(11.7)
Fig. 11.4
figure 4

Zero-sum game

In Littman (1994), the authors analyze a simplified football competition and model it as a zero-sum game. Generally, in the game, there are two agents, each agent tries to maximize its own utility while minimize the utility of its component. Thus, for the agent i, its optimization problem can be represented as

$$\displaystyle \begin{aligned} \begin{array}{l} \max_{\pi_i}{ \;{\kern 1pt} \, \min_{{\mathbf{a}}_{-i}} {\;{\kern 1pt} \, {\sum_{a_i} Q(s,a_i, {\mathbf{a}}_{-i}) \pi_i} }}, \end{array} \end{aligned} $$
(11.8)

where π i is the strategy for the agent i and a i is the actual action of the agent i based on the strategy π i. In the game, the agent i tries to maximize its value function, while its component tries to minimize the value function by taking the action a i.

3.3 Simultaneous Competition

Apart from the zero-sum game, there are many applications requiring general simultaneous competition for multiple agents. In simultaneous competition, all agents are required to take actions at the same time. The optimization problem for agent i, ∀i ∈{1, 2, …, m} can be summarized as follows:

$$\displaystyle \begin{aligned} \begin{array}{l} {} \max_{{x}_i} \;{\kern 1pt} \, {u_{i}(\mathbf{x_i} | \mathbf{x_{-i}})}. \end{array} \end{aligned} $$
(11.9)

In Hu and Wellman (1998), the general Q-learning is put forward to solve the competitions among multiple agents. The algorithm is illustrated in 1. Based on the experience of interactions, each agent i maintains a Q table to instruct its policies π i and update with the following function:

$$\displaystyle \begin{aligned} \begin{array}{l} {} Q_i(s,a_i,{\mathbf{a}}_{-i}) = (1 - \alpha_i) Q_i(s,a_i,{\mathbf{a}}_{-i}) + \alpha_i \big[ r_i + \gamma \pi_i(s^\prime)Q_i\big(s^\prime,a^\prime_i,{\mathbf{a}}^\prime_{-i}\big) \boldsymbol{\pi}_{-i}(s^\prime) \big]. \end{array} \end{aligned} $$
(11.10)

In the multi-agent scenarios, as the update of Q table requires the policies of the other agents π i, the agent i is also required to maintain an estimated Q table for all other agents. According to the prediction of other agents’ policies π i, the agent i aims to set up the policy π i so that (π i, π i) achieves the mixed strategy Nash equilibrium.

Algorithm 1 Multi-agent general Q learning

Except for the basic Q learning, other deep reinforcement learning approaches can also be explored considering the interactions among agents. The multi-agent deep deterministic policy gradient (MADDPG) (Lowe et al. 2017), developed from the single-agent deep deterministic policy gradient (DDPG) algorithm, provides strategies for each agent in the simultaneous competition scenario. In MADDPG, as shown in Algorithm 2, each agent is allocated with a den-centralized actor, which suggests the agent to take actions. On the other side, the critic is centralized and maintains the Q value related with action profile of all agents.

Algorithm 2 Multi-agent deep deterministic policy gradient (MADDPG)

Specifically, the gradient of the expected return for each actor i can be denoted as

(11.11)

where o 1, …, o m is the observations of m agents, respectively. Parameterized by \(\theta _i^{\pi }\), π i is the deterministic policy for agent i satisfying a i = π i(o i).

Correspondingly, the loss function of the critic for the agent i is the TD-error of the Q value, such as

$$\displaystyle \begin{aligned} \mathcal{L}_i &= \mathbb{E} \bigg[\bigg(Q^{\boldsymbol{\pi}}_i \big(o_1, \ldots, o_m, a_1, \ldots, a_m|\theta^Q_i\big) \\&\quad - r_i - \gamma Q^{\boldsymbol{\pi^\prime}}_i \big(o^\prime_1, \ldots, o^\prime_m, a^\prime_1, \ldots, a^\prime_m|\theta_i^{Q^\prime}\big) \bigg)^2\bigg], \end{aligned} $$
(11.12)

where \(\theta _i^{Q^\prime }\) is the delayed parameter for Q prediction. π refers to the target policies with delayed parameters \(\theta _i^{\pi ^\prime }\).

3.4 Sequential Competition

For some applications, different types of agents may have different priorities when take actions. Thus, the agents in competition take actions sequentially and the agents act first will have first-mover advantage. Generally we suppose \(\left ( {(\mathcal {X},\boldsymbol {\Pi }),(g,f)} \right )\) as the general sequential scenario with m leaders and n followers. \(\mathcal {X}=\mathcal {X}_1\times \mathcal {X}_2\times , \dotsc , \times \mathcal {X}_m\) and Π = Π 1 × Π 2×, …, × Π n are the policy space of all leaders and all followers, respectively. g = (g 1(x), …, g m(x)) is the utility function of leaders for \(\mathbf {x}\in \mathcal {X}\), and f = (f 1(π), …, f n(π)) is the utility function of followers for π ∈ Π. Accordingly, the optimization problem for follower j, ∀j ∈{1, 2, …, n} is

$$\displaystyle \begin{aligned} \begin{array}{l} \max \;{\kern 1pt} \, f_{j}({\pi}_j | \boldsymbol{\pi}_{-j}, \boldsymbol{x}). \end{array} \end{aligned} $$
(11.13)

The optimization problem for leader i, ∀i ∈{1, 2, …, m} can be depicted as

$$\displaystyle \begin{aligned} \begin{array}{l} \begin{gathered} \max \;{\kern 1pt} \, {g_{i}} ({x}_i | {\mathbf{x}}_{-i}, \boldsymbol{\pi}),\hfill \\ s.t. ~~ {\begin{gathered} {\pi}_j = \arg \max \;{\kern 1pt} \, f_{j}({\pi}_j | \boldsymbol{\pi}_{-j}, \boldsymbol{x}) , ~~~~ \forall j \in \{1,2, \ldots, n \}. \hfill \\ \end{gathered}}\hfill \end{gathered} \end{array} \end{aligned} $$
(11.14)

4 Game Theoretical Framework

Based on the analysis on the relationships of multiple agents, we summarize a general game theoretical framework in Fig. 11.5. In the framework, we suppose it is an iterative scenario where all the agents are able to take actions during each time interval. Within the same time interval, we further classify the agents into multiple levels. The agents at top level act first. Based on the observation of the actions in top levels, the agents in lower levels behave correspondingly. Moreover, within each level, there are multiple agents take actions simultaneously. Accordingly, the stackelberg equilibrium is expected between each two levels and the Nash equilibrium or correlated equilibrium is expected for all agents within one level.

Fig. 11.5
figure 5

Game theoretical framework

The game theoretical framework can be regarded as a general structure to deal with all kinds of multi-agent reinforcement learning problems. For further tests and evaluations, various multi-agent platforms have been put forward. For example, AlphaStar is the platform for simulating the behaviors of multiple agents in StarCraft video game. Multi-agent connected autonomous driving (MACAD) platform (Palanisamy 2019) is provided for learning and adapting the driving environment. Google research football (Kurach et al. 2019) is the platform to simulate football games for multiple autonomous agents. Based on the multi-agent platforms in different kinds of scenarios, detailed game theoretical framework for multi-agent reinforcement learning can be proposed and analyzed for optimal strategies.