Definition of the Subject

Stochastic games, first introduced by Shapley [60], model dynamic interactions in which theenvironment changes in response to the behavior of the players. Formally, a stochastic game is a tuple \( { G = \langle N, S, (\mathcal{A}_i, A_i, u_i)_{i \in N}, q) } \) where

  • N is a set of players .

  • S is a state space. If S is uncountable, it is supplemented with a σ‑algebra of measurable sets.

  • For every player \( { i \in N } \), \( { \mathcal{A}_i } \) is a set of actions for that player, and \( { A_i \colon S \to \mathcal{A}_i } \) is a set‐valued (measurable) function that assigns to each state \( { s \in S } \) the set of actions \( { A_i(s) } \) that are available to player i in state s. If \( { \mathcal{A}_i } \) is uncountable, it is supplemented with a σ‑algebra of measurable sets. Denote \( SA = \{(s,a) \colon s \in S, a=(a_i)_{i \in N}, a_i\in A_i(s) \enskip \forall i \in N\} \). This is the set of all possible action profiles.

  • For every player \( { i \in N } \), \( { u_i \colon SA \to \mathbf{R} } \) is a (measurable) stage payoff function for player i.

  • \( q \colon SA \to \Delta(S) \) is a (measurable) transition function, where \( { \Delta(S) } \) is the space of probability distributions over S.

The game starts at an initial state s 1, and is played as follows. At each stage \( { t \in \mathbf{N} } \), each player \( { i \in N } \) chooses an action \( { a_i^t \in A_i(s^t) } \), receives the stage payoff \( { u_i(s^t,a^t) } \), where \( a^t = (a_i^t)_{i \in N} \), and the game moves to a new state \( { s^{t+1} } \) that is chosen according to the probability distribution \( { q(\cdot\mid s^t,a^t) } \).

A few comments are in order.

  1. 1.

    A stochastic game lasts infinitely many stages. However, the model also captures finite interactions (of length t), by assuming the play moves, at stage t, to an absorbing state with payoff 0 to all players.

  2. 2.

    In particular, by setting \( { t=1 } \), we see that stochastic games are a generalization of matrix games (games in normal form games in normal form), which are played only once.

  3. 3.

    Stochastic games are also a generalization of repeated games , in which the players play the same matrix game over and over again. Indeed, a repeated game is equivalent to a stochastic game with a single state.

  4. 4.

    Stopping games are also a special case of stochastic games. In these games every player has two actions in all states, continue and quit. as long as all players choose continue the stage payoff is 0; once at least one player chooses quit the game moves to an absorbing state.

  5. 5.

    Markov decision problems (see, e. g., [49]) are stochastic games with a single player.

  6. 6.

    The transition function q governs the evolution of the game. It depends on the actions of all players and on the current state, so that all the players influence the evolution of the game.

  7. 7.

    The payoff function u i of player i depends on the current state as well as on the actions chosen by all players. Thus, a player's payoff depends not only on that player's choice, but also on the behavior of the other players.

  8. 8.

    Though we refer to the functions \( { (u_i)_{i \in N} } \) as “stage payoffs”, with the implicit assumption that each player tries to maximize his payoff, in some applications these functions describe a stage cost, and then the implicit assumption is that each player tries to minimize his cost.

  9. 9.

    The action of a player at a given stage affects both his stage payoff and the evolution of the state variable, thereby affecting his future payoffs. These two, sometimes contradicting effects make the optimization problem of the players quite intricate, and the analysis of the game challenging.

  10. 10.

    The players receive a stage payoff at each stage. So far we did not mention how the players evaluate the infinite stream of stage payoffs that they receive, nor did we say what is their information at each stage: Do they observe the current state? Do they observe the actions of the other players? These issues will be discussed later.

  11. 11.

    The actions that are available to the players at each stage, the payoff functions, and the transition function, all depend on the current state, and not on past play (that is, past states that the game visited, and past actions that the players chose). This assumption is without loss of generality. Indeed, suppose that the actions available to the players at each stage, the payoff functions, and the transition function, all depend on past play, as well as on the current state. For every \( t\in \mathbf{N} \) let H t be the set of all possible histories of length t, that is, all sequences of the form \( (s^1, a^1, s^2, a^2, \ldots, s^t) \), where \( s^k \in S \) for every \( k=2,3,\ldots,t \), \( a^k = (a^k_i)_{i \in N} \) and \( a^k_i \) is an available action to player i at stage k, for every \( { k=1,2,\ldots,t-1 } \). Then the game is equivalent to a game with state space \( { H \mathrel{\mathop:}= \bigcup_{t \in \mathbf{N}} H_t } \), in which the state variable captures past play, and the state at stage t lies in H t . In the new game, the sets of available actions, the payoff function, and the transition function, depend on the current state rather than on all past play.

The interested reader is referred to [20,42,72] for further reading on stochastic games. We now provide a few applications.

Example 1 (Capital Accumulation ([7,18,19,34,45]))

Two (or more) agents jointly own a natural resource or a productive asset; at every period they have to decide the amount of the resource to consume. The amount that is not consumed grows by a known (or an unknown) fraction. Such a situation occurs, e. g., in fishery: Fishermen from various countries fish in the same area, and each country sets a quota for its fishermen. Here the state variable is the current amount of resource, the action set is the amount of resource to be exploited in the current period, and the transition is influenced by the decisions of all the players, as well as possibly by the random growth of the resource.

Example 2 (Taxation ([14,48]))

A government sets a tax rate at every period. Each citizen decides at every period how much to work, and, from the total amount of money he or she has, how much to consume; the rest is saved for the next period, and grows by a known interest rate. Here the state is the amount of savings each citizen has, the stage payoff of a citizen depends on the amount of money that he consumed, on the amount of free time he has, and on the total amount of tax that the government collected. The stage payoff of the government may be the average stage payoff of the citizens, the amount of tax collected, or a mixture of the two.

Example 3 (Communication Network [58])

A single‐cell system with one receiver and multiple uplink transmitters share a single, slotted, synchronous classical collision channel. Assume that all transmitted packets have the same length, and require one time unit, which is equal to one time slot, for transmission. Whenever a collision occurs, the users attempt to retransmit their packets in subsequent slots to resolve collision for reliable communication.

Here a state lists all relevant data for a given stage: e. g., the number of packets waiting at each transmitter, or the length of time each has been waiting to be transmitted. The players are the transmitters, and the action of each transmitter is which packet to transmit, if any. The stage cost may depend on the number of time slots that the transmitted packet waited, on the number of packets that have not been transmitted at that period, and possibly on additional variables. The transition depends on the actions chosen by the players, but it has a stochastic component, which captures the number of new packets that arrive at the various transmitters during every time slot.

Example 4 (Queues [1])

Individuals that require service have to choose whether to be served by a private slow service provider, or by a powerful public service provider. This situation arises, e. g., when jobs can be executed on either a slow personal computer or a fast mainframe. Here a state lists the current load of the public and private service providers, and the cost is the time to be served.

The importance of stochastic games stems from the wide range of applications they encompass. Many repeated interactions can be recast as stochasticgames; the wide range of theoretical results that have been obtained provide insights that can help in analyzing specific situations and suggesting properbehavior to the participants. In certain classes of games algorithms that have been developed may be used to calculate such behavior.

Strategies, Evaluations and Equilibria

So far we have not described the information that the players have at each stage. In most of the chapter we assume that the players have completeinformation of past play; that is, at each stage t, they know the sequence \( { s^1,a^1,s^2,a^2,\ldots,s^{t} } \) of states that werevisited in the past (including the current state) and the actions that were chosen by all players. This assumption is too strong for most applications,and in the sequel we will mention the consequences of its relaxation.

Since the players observe past play, a  pure strategy for player i isa (measurable) function σ i that assigns to every finite history \( { (s^1,a^1,s^2,a^2,\ldots,s^{t}) } \) an action\( { \sigma_i(s^1,a^1,s^2,a^2,\ldots,s^{t}) \in A_i(s^t) }\), with the interpretation that, at stage t, if the finite history\( { (s^1,a^1,s^2,a^2,\ldots,s^{t}) } \)occurred, player i plays the action \( {\sigma_i(s^1,a^1,s^2,a^2,\ldots,s^{t}) } \). If the player does not know the complete history, then a strategyfor player i is a function that assigns to every possible information set, an action that is available to theplayer when the player has this information. A  mixed strategy for player iis a probability distribution over the set of his pure strategies. The space of mixed strategies of player iis denoted by σ i .

A simple class of strategies is the class of stationary strategies ; a strategyσ i for player i is stationary if \( { \sigma_i(s^1,a^1,s^2,a^2,\ldots,s^{t}) }\) depends only on the current state s t, and not on past play \( {s^1,a^1,s^2,a^2,\ldots,a^{t-1} } \). A stationary strategy of player ican be identified with an element \( { x = (x_s)_{s \in S} \in \times_{s \in S}\Delta(A_i(s)) } \), with the interpretation that player i plays the mixedaction x s whenever the current state is s. Denote by \( { X_i =\times_{s \in S}\Delta(A_i(s)) } \) the space of stationary strategies of player i.

There are three common ways to evaluate the infinite stream of payoffs that the players receive in a stochastic game: The finite‐horizon evaluation , in which a player considers the average payoff during the first T stages, the discounted evaluation , in which a player considers the discounted sum ofhis stage payoffs, and the limsup evaluation , in which a player considers the limsup of his long-run averagepayoffs. We now formally define these evaluations.

Every profile \( { \sigma = (\sigma_i)_{i \in N} }\) of mixed strategies, together with the initial state, induces a probability distribution \( { \mathbf{P}_{s_1,\sigma} } \) over the space of infiniteplays \( { H_\infty \mathrel{\mathop:}= SA^{\mathbf{N}} }\). We denote the corresponding expectation operator by \( { \mathbf{E}_{s_1,\sigma} } \).

Definition 5

Let σ be a profile of mixed strategies. For every finite horizon \( { T \in \mathbf{N} } \), the Tstage payoff under σ for player i is

$$ \gamma_i^T(s_1,\sigma) \mathrel{\mathop:}= \mathbf{E}_{s_1,\sigma}\left[ \frac{1}{T} \sum_{t=1}^T u_i(s^t,a^t)\right]\:. $$

For every discount factor \( { \lambda \in (0,1] } \), the λ-discounted payoff under σ for player i is

$$ \gamma_i^\lambda(s_1,\sigma) \mathrel{\mathop:}= \mathbf{E}_{s_1,\sigma}\left[ \lambda \sum_{t=1}^\infty (1-\lambda)^{t-1}u_i(s^t,a^t)\right]\:. $$

The limsup payoff under σ for player i is

$$ \gamma_i^\infty(s_1,\sigma) \mathrel{\mathop:}= \mathbf{E}_{s_1,\sigma}\left[ \limsup_{T \to \infty} \frac{1}{T} \sum_{t=1}^T u_i(s^t,a^t)\right]\:. $$

The T‑stage payoff captures the situation in which the interaction lasts exactly T stages. The λ‑discounted evaluation captures the situation in which the game lasts “many” stages,and the player discounts stage payoffs – it is better to receive $1 today than tomorrow. The limsup payoff also captures the situation inwhich the game lasts “many” stages, but here the player does not discount his payoffs, and the payoff at each given stage is insignificant ascompared to the payoff in all other stages. Equivalently, one could consider the liminf payoff in which the player considers the liminf of the long-runaverage payoffs.

As usual, an equilibrium is a vector of strategies such that no player can profit by a unilateral deviation. For everyplayer i and every strategy profile \( {\sigma = (\sigma_i)_{i \in N} } \) we denote the strategy profile of all other players, except player i, by \( { \sigma_{-i} = (\sigma_j)_{j \neq i} } \).

Definition 6

Let \( { \varepsilon \geq 0 } \). A profile of strategies σ is a  T-stage ε-equilibrium if

$$ \gamma_i^T(s_1,\sigma) \geq \gamma_i^T(s_1,\sigma^{\prime}_i,\sigma_{-i})-\varepsilon\:, \\ \forall s_1 \in S, \forall i \in N, \forall \sigma^{\prime}_i \in \Sigma_i \:. $$

It is a  λ‑discounted ε-equilibrium if

$$ \gamma_i^\lambda(s_1,\sigma) \geq \gamma_i^\lambda(s_1,\sigma^{\prime}_i,\sigma_{-i})-\varepsilon\:, \\ \forall s_1 \in S, \forall i \in N, \forall \sigma^{\prime}_i \in \Sigma_i \:. $$

It is a  limsup ε-equilibrium if

$$ \gamma_i^\infty(s_1,\sigma) \geq \gamma_i^\infty(s_1,\sigma^{\prime}_i,\sigma_{-i})-\varepsilon\:, \\ \forall s_1 \in S, \forall i \in N, \forall \sigma^{\prime}_i \in \Sigma_i \:. $$

The payoff that corresponds to an ε-equilibrium, that is, either one of the quantities \( { \gamma^T(s_1,\sigma) } \), \( { \gamma^\lambda(s_1,\sigma) } \) and \( { \gamma^\infty(s_1,\sigma) } \), is called an ε-equilibrium payoff at the initial state s 1.

As we will see below, when both state and action spaces are finite, a T‑stage anda λ‑discounted 0‑equilibrium exist. However, when the state or action spaces are infinite such a 0‑equilibrium may failto exist, yet ε-equilibria may exist for every \( { \varepsilon > 0 } \).

As the length of the game T varies, or as the discount factor λ varies, the equilibrium strategyprofile varies as well. A strategy profile that is an ε-equilibrium for every T sufficiently largeand for every λ sufficiently small is called a  uniform ε-equilibrium .

Definition 7

Let \( { \varepsilon > 0 } \). A strategy profile σ is a uniform ε-equilibrium if there are \( { T_0 \in \mathbf{N} } \) and \( { \lambda_0 \in (0,1) } \) such that for every \( { T \geq T_0 } \) the strategy profile σ is a T‑stage ε-equilibrium, and for every \( { \lambda \in (0,\lambda_0) } \) it is a λ‑discounted ε-equilibrium.

If for every \( { \varepsilon > 0 } \) thegame has a (T‑stage, λ‑discounted, limsup or uniform) ε-equilibrium with corresponding payoff\( { g_\varepsilon } \), then any accumulationpoint of \( { (g_\varepsilon )_{\varepsilon > 0} }\) as ε goes to 0 is a (T‑stage, λ‑discounted, limsup oruniform) equilibrium payoff.

Zero-Sum Games

A two‐player stochastic game is zero-sum if \( u_1(s,a) + u_2(s,a) = 0 \) for every \( { (s,a) \in SA } \). As in matrix games, every two‐player zero-sum stochastic gameadmits at most one equilibrium payoff at every initial state s 1, which is termed the value of the game at s 1. Each player's strategy which is part ofan ε-equilibrium is termed ε-optimal. The definition of ε-equilibrium implies thatan ε-optimal strategy guarantees the value up to ε; for example, in the T‑stage evaluation, ifσ1 is an ε-optimal strategy of player 1, then for every strategy of player 2 we have

$$ \gamma_1^T(s_1,\sigma_1,\sigma_2) \geq v^T(s_1) - \varepsilon \:, $$

where \( { v^T(s_1) } \) is the T‑stage value at s 1.

In his seminal work, Shapley [60] presented the model of two‐player zero-sum stochasticgames with finite state and actions spaces, and proved the following.

Theorem 8 [60]

For every two‐player zero-sum stochastic game, the λ‑discounted value at every initial state exists. Moreover, both players have λ‑discounted 0‑optimal stationary strategies.

Proof

Let \( { \mathcal{V} } \) be the space of all functions \( { v \colon S \to \mathbf{R} } \). For every \( { v \in \mathcal{V} } \) define a zero-sum matrix game \( { G_s^\lambda(v) } \) as follows:

  • The action spaces of the two players are \( { A_1(s) } \) and \( { A_2(s) } \) respectively.

  • The payoff function (that player 2 pays player 1) is

    $$ \lambda u_1(s,a) + (1-\lambda) \sum_{s^{\prime} \in S} q(s^{\prime} \mid s,a) v(s^{\prime})\:. $$

The game \( { G_s^\lambda(v) } \) captures the situation in which, after the first stage, the game terminates with a terminal payoff \( { v(s^{\prime}) } \), where \( { s^{\prime} } \) is the state reached after stage 1. Define an operator \( { \varphi \colon \mathcal{V} \to \mathcal{V} } \) as follows:

$$ \varphi_s(v) = \operatorname{val}(G_s^\lambda(v))\:, $$

where \( { \operatorname{val}(G_s^\lambda(v)) } \) is the value of the matrix game \( { G^\lambda_s(v) } \). Since the value operator is non‐expansive, it follows that the operator φ is contracting: \( \Vert\varphi(v) - \varphi(w)\Vert_\infty \leq (1-\lambda)\Vert v - w\Vert_\infty \), so that this operator has a unique fixed point \( { \widehat v^\lambda } \). One can show that the fixed point is the value of the stochastic game, and every strategy σ i of player i in which he plays, after each finite history \( { (s^1,a^1,s^2,a^2,\ldots,s^{t}) } \), an optimal mixed action in the matrix game \( { G_{s^t}^\lambda(\widehat v^\lambda) } \), is a λ‑discounted 0‑optimal strategy in the stochastic game.□

Example 9

Consider the following two‐player zero-sum game with three states, s 0, s 1 and s 2; each entry of the matrix indicates the payoff that player 2 (the column player) pays player 1 (the row player, the payoff is in the middle), and the transitions (which are deterministic, and are denoted at the top-right corner).

$$ \begin{aligned} \begin{array}{l} {\begin{array}{c} \\ T\\ B\\ \\ \end{array} \begin{array}{cc} L & R \\ \framebox{$0\enskip s^2$\hphantom{00}}& \framebox{$1 \enskip s^1$\hphantom{00}}\\ \framebox{$1\enskip s^1$\hphantom{00}}& \framebox{$0 \enskip s^0$\hphantom{00}}\\ \quad \text{State}& s_2 \end{array}} \end{array} \begin{array}{c} \\ T\\ \\ \end{array} \begin{array}{c} L\\ \framebox{$1 \enskip s^1$\hphantom{00}}\\ \text{State}\enskip s_1 \end{array} \begin{array}{c} \\ T\\ \\ \end{array} \begin{array}{c} L\\ \framebox{$0 \enskip s^0$\hphantom{00}}\\ \text{State} \enskip s_0\\ \end{array}\:. \end{aligned} $$

The states s 0 and s 1 are absorbing: Once the play reaches one of these states it never leaves it. State s 2 is non‐absorbing. Stochastic games with a single non‐absorbing state are called absorbing games . For every \( v = (v_0,v_1,v_2) \in \mathcal{V} = \mathbf{R}^3 \) the game \( { G_{s_2}^\lambda(v) } \) is the following matrix game:

$$ \begin{aligned} &\begin{array}{c} \\ T\\ B\\ \\ \end{array} \begin{array}{cc} L & R \\ \framebox{$(1-\lambda)v_2$\hphantom{0000}} & \framebox{$\lambda+(1-\lambda)v_1$\hspace*{0.65mm}}\\ \framebox{$\lambda+(1-\lambda) v_1$} & \framebox{$(1-\lambda)v_0$\hphantom{0000}\hspace*{0.5mm}}\\[1.5mm] \text{The game}\enskip G^\lambda_{s_2} \end{array}\\ &\begin{array}{c} \\ T\\ \\ \end{array} \begin{array}{c} L \\ \framebox{$\lambda+(1-\lambda)v_1$}\\[1.5mm] \text{The game}\enskip G^\lambda_{s_1}\\ \end{array} \begin{array}{c} \\ T\\ \\ \end{array} \begin{array}{c} L \\ \framebox{$(1-\lambda)v_0$}\\[1.5mm] \text{The game}\enskip G^\lambda_{s_0}\\ \end{array}\:. \end{aligned} $$

The unique fixed point of the operator \( { \operatorname{val}(G^\lambda) } \) must satisfy

  • \( { \widehat v_0 = \operatorname{val}(G_{s_0}^\lambda(\widehat v)) } \), so that \( { \widehat v_{s_0}^\lambda = \widehat v_0=0 } \);

  • \( { \widehat v_1 = \operatorname{val}(G_{s_1}^\lambda(\widehat v)) } \), so that \( { \widehat v_{s_1}^\lambda = \widehat v_1=1 } \);

  • \( { \widehat v_2 = \operatorname{val}(G_{s_1}^\lambda(\widehat v)) } \). By Theorem 8 both players have a stationary λ‑discounted 0‑optimal strategy. Denote by x (resp. y) a mixed action for player 1 (resp. player 2) that is part of a λ‑discounted 0‑optimal strategy at the state s 2. Since we know that in the fixed point \( { \widehat v_0=0 } \) and \( { \widehat v_1=1 } \), \( { \widehat v_2 } \) must be the unique solution of

    $$ v_2 = y(1-\lambda)v_2 + (1-y) = y\:, $$

    so that \( { \widehat v_{s_2}^\lambda = \widehat v_2 = (1-\sqrt{\lambda})/(1-\lambda) } \). The 0‑optimal strategy of player 2 at state s 2 is \( y=\widehat v_2= (1-\sqrt{\lambda})/(1-\lambda) \), and the 0‑optimal strategy of player 1, \( x = \widehat v_2 = (1-\sqrt{\lambda})/(1-\lambda) \), can be found by finding his 0‑optimal strategy in \( G^\lambda_{s_2}(\widehat v) \).

Bewley and Kohlberg [11] proved that when the state and action spaces are finite, the function\( { \lambda \mapsto v^\lambda_s } \), thatassigns to every state s and every discount factor λ the λ‑discounted value at theinitial state s, is a Puiseux function, that is, it hasa representation \( { v^\lambda_s = \sum_{k=K}^\infty a_k \lambda^{k/M} }\) that is valid for every \( { \lambda \in(0,\lambda_0) } \) for some \( { \lambda_0 > 0 }\), where M is a natural number, K isa non‐negative integer, and \( { (a_k)_{k=K}^\infty }\) are real numbers. In particular, the function \( { \lambda \mapsto v^\lambda_s } \) is monotone in a neighborhood of 0, and its limitas λ goes to 0 exists. This result turned out to be crucial in subsequent study on games with finitely many states and actions.

Shapley's work has been extended to general state and action spaces; for a recent survey see [46]. The tools developed in [46], together with a dynamicprogramming argument, prove that under proper conditions on the payoff function and on the transitions the two‐player zero-sum stochastic game hasa T‑stage value.

Maitra and Sudderth [35] proved that the limsup value exists in a very generalsetup. Their proof follows closely that of Martin [36] for the determinacy of Blackwellgames.

The study of the uniform value emanated from an example, called the “Big Match”, due to Gillette [28], that was solved by Blackwell and Ferguson [13].

Example 10

Consider the following stochastic game with two absorbing states and one non‐absorbing state.

$$ \begin{array}{c} \\ T\\ B\\ \\ \end{array} \begin{array}{cc} L & R \\ \framebox{$0 \enskip s^2$\hphantom{00}} & \framebox{$1 \enskip s^2$\hphantom{00}}\\ \framebox{$1 \enskip s^1$\hphantom{00}} & \framebox{$0 \enskip s^0$\hphantom{00}}\\ \text{State}\enskip s_2 \end{array} \begin{array}{c} \\ T\\ \\ \end{array} \begin{array}{c} L\\ \framebox{$1 \enskip s^1$\hphantom{00}}\\ \text{State}\enskip s_1 \end{array} \begin{array}{c} \\ T\\ \\ \end{array} \begin{array}{c} L\\ \framebox{$0 \enskip s^0$\hphantom{00}}\\ \text{State}\enskip s_0 \end{array} $$

Suppose the initial state is s 2. As long as player 1 plays T the play remains at s 2; once he plays B the play moves to either s 0 or s 1, and is effectively terminated. By finding the fixed point of the operator φ one can show that the discounted value at the initial state s 2 is \( { \frac{1}{2} } \), and a λ‑discounted stationary 0‑optimal strategy for player 2 isFootnote 1 \( [\frac{1}{2}(L),\frac{1}{2}(R)] \). Indeed, if player 1 plays T then the expected stage payoff is \( { \frac{1}{2} } \) and play remains at s 2, while if player 1 plays B then the game moves to an absorbing state, and the expected stage payoff from that stage onwards is \( { \frac{1}{2} } \). In particular, this strategy guarantees \( { \frac{1}{2} } \) for player 2 both in the limsup evaluation and uniformly. A λ‑discounted 0‑optimal strategy for player 1 is \( [\frac{1}{1+\lambda}(T), \frac{\lambda}{1+\lambda}(B)] \).

What can player 1 guarantee in the limsup evaluation and uniformly? If player 1 plays the stationary strategy \( { [x(T),(1-x)(B)] } \) that plays at each stage the action T with probability x and the action B with probability \( { 1-x } \), then player 2 has a reply that ensures that the limsup payoff is 0: If \( { x=1 } \) and player 2 always plays L, the payoff is 0 at each stage; if \( { x < 1 } \) and player 2 always plays R, the payoff is 1 until the play moves to s 0, and then it is 0 forever. Since player 1 plays the action B with probability \( { 1-x > 0 } \) at each stage, the distribution of the stage in which play moves to s 0 is geometric. Therefore, the limsup payoff is 0, and if λ is sufficiently small, the discounted payoff is close to 0.

One can verify that if player 1 uses a bounded‐recall strategy, that is, a strategy that uses only the last k actions that were played, player 2 has a reply that guarantees that the limsup payoff is 0, and the discounted payoff is close to 0, provided λ is close to 0. Thus, in the limsup payoff and uniformly finite memory cannot guarantee more than 0 in this game (see also [27]).

Intuitively, player 1 would like to condition the probability of playing T on the past behavior of player 2: If in the past player 2 played the action L more often than the action R, he would have liked to play T with higher probability; if in the past player 2 played the action R more often than the action L, he would have liked to play B with higher probability. Blackwell and Ferguson [13] constructed a family of good strategies \( { \{\sigma_1^M, M \in \mathbf{N}\} } \) for player 1. The parameter M determines the amount that the strategy guarantees: The strategy \( { \sigma_1^M } \) guarantees a limsup payoff and a discounted payoff of \( { \frac{M}{2M+1} } \), provided the discount factor is sufficiently low. In other words, player 1 cannot guarantee \( { \frac{1}{2} } \), but he may guarantee an amount as close to \( { \frac{1}{2} } \) as he wishes by choosing M to be sufficiently large. The strategy \( { \sigma_1^M } \) is defined as follows: At stage t, play B with probability \( { \frac{1}{(M + l_t - r_t)^2} } \), where l t is the number of stages up to stage t in which player 2 played L, and r t is the number of stages up to stage t in which player 2 played R.

Since \( { r_t+l_t=t-1 } \) one has \( { r_t-l_t=2r_t -(t-1) } \). The quantity r t is the total payoff that player 1 received in the first \( { t-1 } \) stages if player 1 played T in those stages (and the game was not absorbed). Thus, this total payoff is a linear function of the difference \( { r_t-l_t } \). When presented this way, the strategy \( { \sigma_1^M } \) depends on that total payoff. Observe that as r t increases, \( { r_t-l_t } \) increases as well, and the probability to play B decreases.

Mertens and Neyman [38] generalized the idea presented at the end of Example 10 to stochasticgames with finite state and action spaces.Footnote 2

Theorem 11

If the state and action spaces of a two‐player zero-sum stochastic game are finite, the game has a uniform value \( { v^0_s } \) at every initial state \( { s \in S } \). Moreover, \( { v^0_s = \lim_{\lambda \to 0} v^\lambda_s = \lim_{T \to \infty} v^T_s } \).

In their proof, Mertens and Neyman describe a uniform ε-optimal strategy. In this strategy the player keeps a parameter,λ t , which is a fictitious discount factor to use at stage t. This parameter changes at each stage as a function of the stage payoff; if the stage payoff at stage t is high then \( { \lambda_{t+1} < \lambda_t }\), whereas if the stage payoff at stage t is low then \( { \lambda_{t+1} > \lambda_t } \). The intuition is asfollows. As mentioned before, in stochastic games there are two forces that influence the player's behavior: He tries to get high stage payoffs, whilekeeping future prospects high (by playing in such a way that the next stage that is reached is favorable). When consideringthe λ‑discounted payoff there is a clear comparison between the importance of the two forces: The weight of the stage payoffis λ and the weight of future prospects is \( { 1-\lambda }\); the lower the discount factor, the more weight is given to the future. When considering the uniform value (or theuniform equilibrium) the weight of the stage payoff is 0. However, if the player never attempts to receive a high stage payoff, the overall payoff inthe game will not be high. Therefore, the player has a fictitious discount factor; if past payoffs are low and they do not meet the expectation,player 1 increases the weight of the stage payoff by increasing the fictitious discount factor; if past payoffs are high, player 1 increases theweight of the future by lowering this fictitious discount factor.

Multi‐Player Games

Takahashi  [75] and Fink [21] extendedShapley's [60] result to discounted equilibria in non-zero-sum games.

Theorem 12

Every stochastic game with finite state and action spaces has a λ‑discounted equilibrium in stationary strategies.

Proof

The proof utilizes Kakutani's fixed point theorem [31]. Let \( M = \max_{i,s,a} \vert u_i(s,a)\vert \) be a bound on the absolute values of the payoffs. Set \( X = \times_{i \in N,s \in S} \left(\Delta(A_i(s)) \times [-M,M]\right) \). A point \( x=(x^A_{i,s}, x^V_{i,s})_{i \in N, s \in S} \in X \) is a collection of one mixed action and one payoff to each player at every state. For every \( v = (v_i)_{i \in N}\in [-M,M]^{N \times S} \) and every \( s \in S \) define a matrix game \( { G_s^\lambda(v) } \) as follows:

  • The action spaces of each player i is \( { A_i(s) } \);

  • The payoff to player i is

    $$ \lambda u_i(s,a) + (1-\lambda) \sum_{s^{\prime} \in S} q(s^{\prime} \mid s,a) v_i(s^{\prime})\:. $$

We define a set‐valued function \( { \varphi \colon X \to X } \) as follows.

  • For every \( { i \in N } \) and every \( { s \in S } \), \( { \varphi^A_{i,s} } \) is the set of all best responses of player i to the strategy vector \( { x_{-i,s} \mathrel{\mathop:}= (x_{j,s})_{j \neq i} } \) in the game \( { G_s^\lambda(v) } \). That is,

    $$ \begin{aligned} \varphi^A_{i,s}(x,v) &\mathrel{\mathop:}= \big\{\operatorname{argmax}_{y_{i,s} \in \Delta(A_i(s))} \lambda r_i(s,y_{i,s},x_{-i,s})\\ &\quad \: + (1-\lambda)\sum_{s^{\prime} \in S} q(s^{\prime} \mid s,y_{i,s},x_{-i,s}) v_{i,s^{\prime}}\big\}\:. \end{aligned} $$
  • For every \( { i \in N } \) and every \( { s \in S } \), \( { \varphi^V_{i,s}(x,v) } \) is the maximal payoff for player i in the game \( { G_s^\lambda(v) } \), when the other players play \( { x_{-i} } \):

    $$ \varphi^V_{s}(x,v) \mathrel{\mathop:}= \max_{y_{i,s} \in \Delta(A_i(s))} \Bigg(\lambda r(s,y_{i,s},x_{-i,s}) + (1-\lambda)\\ \left.\times\sum_{s^{\prime} \in S} q(s^{\prime} \mid s,y_{i,s},x_{-i,s}) v_{i,s^{\prime}}\right)\:. $$

The set‐valued function φ has convex and non-empty values and its graph is closed, so that by Kakutani's fixed point theorem it has a fixed point. It turns out that every fixed point of φ defines a λ‑discounted equilibrium in stationary strategies.□

This result has been extended to general state and action spaces by various authors. These results assume a strong continuity on either thepayoff function or the transition function, see, e. g., [39,44,62].

As in the case of zero-sum games, a dynamic programming argument shows that under a strong continuity assumption on the payoff function oron the transitions a T‑stage equilibrium exists.

Regarding the existence of the limsup equilibrium and uniform equilibrium little is known. The most significant result in this direction isVieille [77,78], who proved that everytwo‐player stochastic game with finite state and action spaces has a uniform ε-equilibrium for every \( { \varepsilon > 0 } \). This result has been proved forother classes of stochastic games, see, e. g., [6,24,25,61,63,65,76]. Several influential works in this area are [22,32,71,79]. Most of the papers mentioned above rely on the vanishingdiscount factor approach, which constructs a uniform ε-equilibrium by studying a sequence of λ‑discounted equilibriaas the discount factor goes to 0.

For games with general state and action spaces, a limsup equilibrium exists under an ergodicity assumption on the transitions, seee. g. Nowak [44], Remark 4 and Jaskiewicz and Nowak [30].

A game has perfect information if there are no simultaneous moves, and both players observe past play. Existence of equilibrium in this casewas proven by Mertens [37] in a very general setup.

Correlated Equilibrium

The notion of correlated equilibrium was introduced by Aumann [8,9], see also Forges Correlated Equilibria and Communication in Games. A correlatedequilibrium is an equilibrium of an extended game, in which each player receives at the outset of the game a private signal such that thevector of signals is chosen according to a known joint probability distribution. In repeated interactions, such as in stochastic games, there are twonatural notions of correlated equilibria: (a) each player receives one signal at the outset of the game ( normal‐formcorrelated equilibrium ); (b) each player receives a signal at each stage ( extensive‐form correlatedequilibrium ). It follows from Forges [26] that when the state and action sets are finite,the set of all correlated T‑stage equilibrium payoffs (either normal‐form or extensive‐form) isa polytope.

Nowak and Raghavan [47] proved the existence of an extensive‐form correlated discounted equilibrium under weak conditions on thestate and action spaces. In their construction, the strategies of the players are stationary, and so is the distribution according to which the signalsare chosen after each history; both depend only on the current state, rather than on the whole past play. Roughly, their approach is to apply Kakutani'sfixed point theorem to the set‐valued function that assigns to each game \( {G^\lambda_s(v) } \) the set of all correlated equilibrium payoffs in this game, which is convex and compact.

Solan and Vielle [66] proved the existence of an extensive‐form correlated uniformequilibrium payoff when the state and action spaces are finite. Their approach is to let each player play his uniform optimal strategy in a zero-sumgame in which all other players try to minimize his payoff. Existence of a normal‐form correlated equilibrium was proved for the class ofabsorbing games [68].

Solan [64] characterized the set of extensive‐form correlated equilibrium payoffs forgeneral state and action spaces and a general evaluation on the stage payoffs, and provided a sufficient condition that ensures that the set ofnormal‐form correlated equilibrium payoffs coincides with the set of extensive‐form correlated equilibrium payoffs.

Imperfect Monitoring

So far it has been assumed that at each stage the players know the past play. There are cases in which this assumption is too strong; in some casesplayers do not know the complete description of the current state (Examples 3 and 4), and in others players do not fully observe the actions of all otherplayers (Examples 2, 3 and 4). For a most general description of stochastic games see Mertens, see Chapter IV in Sorin andZamir [40] and Coulomb [17].

In the study of the discounted equilibrium, the T‑stage equilibrium or the limsup equilibrium, one mayconsider the game as a one-shot game: The players simultaneously choose strategies, and the payoff is either the discounted payoff,the T‑stage payoff or the limsup payoff. If the strategy spaces of the players are compact (e. g., if thestate and action spaces are finite), and if the payoff is upper-semi‐continuous in each player's strategy, keeping the strategies of the otherplayers fixed, then an equilibrium exists. This approach can be used successfully for the discounted equilibrium or the T‑stage equilibrium under weak conditions (see, e. g., [4]), and maybe used for the limsup equilibrium under a proper ergodicity condition.

Whenever there exists an equilibrium in stationary strategies (e. g., a discounted equilibrium in games with finitely many states andactions) the only information that players need in order to follow the equilibrium strategies is the current state. In particular, they need not observepast actions of the other players. As we now show, in the “Big Match” (Example 10) the limsup value and the uniform value may fail to existwhen each player does not observe the past actions of the other player.

Example 13 (Example 10: Continued.)

Assume that no player observes the actions of the other player, and assume that the initial state is s 2. Player 2 can still guarantee \( { \frac{1}{2} } \) in the limsup evaluation by playing the stationary strategy \( { [\frac{1}{2}(L),\frac{1}{2}(R)] } \). One can show that for every strategy of player 2, player 1 has a reply such that the limsup payoff is at least \( { \frac{1}{2} } \). In other words, \( { \inf_{\sigma_2}\sup_{\sigma_1} \gamma^\infty(s_2,\sigma_1,\sigma_2) =\frac{1}{2} } \). We now argue that \( { \sup_{\sigma_1}\inf_{\sigma_2} \gamma^\infty(s_2,\sigma_1,\sigma_2)= 0 } \). Indeed, fix a strategy σ1 for player 1, and \( { \varepsilon > 0 } \). Let θ be sufficiently large such that the probability that under σ1 player 1 plays B for the first time after stage t is at most ε. Observe that as t increases, the probability that player 1 plays B for the first time after stage t decreases to 0, so that such a θ exists. Consider the following strategy σ2 of player 2: Play R up to stage θ, and play L from stage \( { t+1 } \) and on. By the definition of θ, either player 1 plays B before or at stage θ, and then the game moves to s 0, and the payoff is 0 at each stage thereafter, or player 1 plays T at each stage, and then the stage payoff after stage θ is 0, or, with probability less than ε, player 1 plays B for the first time after stage θ, the play moves to s 1, and the payoff is 1 thereafter. Thus, the limsup payoff is at most ε. A similar analysis shows that

$$ \begin{aligned} \inf_{\sigma_2}\sup_{\sigma_1} \gamma^\lambda(s_2,\sigma_1,\sigma_2) &=\tfrac{1}{2}\:, \\ \sup_{\sigma_1}\inf_{\sigma_2} \lim_{\lambda\to 0}\gamma^\lambda(s_2,\sigma_1,\sigma_2)&= 0\:, \end{aligned} $$

so that the uniform value does not exist as well.

This example shows that in general the limsup value and the uniform value need not exist when the players do not observe past play. Though ingeneral the value (and therefore also an equilibrium) need not exist, in many classes of stochastic games the value and an equilibrium do exist, even inthe presence of imperfect monitoring.

Rosenberg et al. [55] and Renault [54]showed that the uniform value exists in the one player setup (Markov Decision Problem), in which the player receives partial information regarding thecurrent state. Thus, a single decision maker who faces a dynamic situation and does not fully observe the state of the environment can play insuch a way that guarantees high payoff, provided the interaction is sufficiently long or the discount factor is sufficiently low.

Altman et al. [5,6] and Flesch etal. [24] studied stochastic games in which each player has a “private” state, whichonly he can observe, and the state of the world is composed of the vector of private states. Altman et al. [5,6] studied the situation in which players do not observe the actionsof the other players, and Flesch et al. [24] studied the situation in which players do observe eachothers payoffs. Such games arise naturally in wireless communication (see [5]); take for exampleseveral mobiles who periodically send information to a base station. The private state of a mobile may depend, e. g., on its exactphysical environment, and it determines the power attenuation between the mobile and the base station. The throughput (the amount of bits per second) thata mobile can send to the base station depends on the power attenuations of all the mobiles. Finally, the stage payoff is the stage powerconsumption.

Rosenberg et al. [57] studied the extreme case of two player zero-sum games in which theplayers observe neither the current state nor the action of the other player, and proved that the uniform value does exist in two classes of games, whichcapture the behavior of certain communication protocols. Classes of games in which the actions are observed but the state is not observed were studied,e. g., by Sorin [69,70], Sorin andZamir [74], Krausz and Rieder [33], Flesch etal. [23], Rosenberg et al. [56],Renault [52,53]. For additional results,see [16,73].

Algorithms

There are two kinds of algorithms: Those that terminate in a finite number of steps, and those that iterate and approximate solutions. Bothkinds of algorithms were devised to calculate the value and optimal strategies (or equilibria) in stochastic games.

It is well known that the value of a two‐player zero-sum matrix game and optimal strategies for the two players can be calculatedefficiently using a linear program. Equilibria in two‐player non-zero-sum games can be calculated by the Lemke–Howson algorithm, which isusually efficient, however, its worst running time is exponential in the number of pure strategies of the players [59]. Unfortunately, to date there are no efficient algorithms to calculate either the value inzero-sum stochastic games, or equilibria in non-zero-sum games. Moreover, in Example 9 the discounted value may be irrational for rational discountfactors, even though the data of the game (payoffs and transitions) are rational, so it is not clear whether linear programming methods can be used tocalculate the value of a stochastic game. Nevertheless, linear programming methods were used to calculate the discounted and uniform value of severalclasses of stochastic games, see [20,50]. Othermethods that were used to calculate the value or equilibria in discounted stochastic games include fictitious play [80], value iterates, policy improvement, and general methods to find the maximum of a function(see [20,51]), a homotopymethod [29], and algorithms to solve sentences in formal logic [15,67].

Additional and Future Directions

The research on stochastic games extends to additional directions than those mentioned in earlier sections. We mention a few here.Approximation of games with infinite state and action spaces by finite games was discussed by Whitt [81], and further developed by Nowak [43]. Stochastic games incontinuous time have also been studied, as well as hybrid models that include both discrete and continuous aspects, see, e. g., [2,10].

Among the many directions of future research in this area, we will mention here but a few. One challenging question is the existence ofa uniform equilibrium and a limsup equilibrium in multi‐player stochastic games with finite state and action spaces. Another is thedevelopment of efficient algorithms that calculate the value of two‐player zero-sum games. A third direction concerns the identification ofapplications that can be recast in the framework of stochastic games, and that can be successfully analyzed using the theoretical tools that theliterature developed. Another problem that is of interest is the characterization of approachable and excludable sets in stochastic games with vectorpayoffs (see [12] for the presentation of matrix games with vector payoffs, and [41] for partial results regarding this problem).