Keywords

1 Introduction

Stochastic games extend the model of strategic form games to situations in which the environment changes in time in response to the players’ actions. They also extend the Markov decision model to competitive situations with more than one decision maker. The choices made by the players have two effects. First, together with the current state, the players’ actions determine the immediate payoff that each player receives. Second, the current state and the players’ actions have an influence on the chance of moving to a new state, where future payoffs will be received. Therefore, each player has to observe the current payoffs and take into account possible evolution of the state. This issue is also present in one-player sequential decision problems, but the presence of additional players who have their own goals adds complexity to the analysis of the situation. Stochastic games were introduced in a seminal paper of Shapley (1953). He considered zero-sum dynamic games with finite state and action spaces and a positive probability of termination. His model is often considered as a stochastic game with discounted payoffs. Gillette (1957) studied a similar model but with zero stop probability. These two papers inspired an enormous stream of research in dynamic game theory and Markov decision processes. There is a large variety of mathematical tools used in studying stochastic games. For example, the asymptotic theory of stochastic games is based on some algebraic methods such as semi-algebraic functions. On the other hand, the theory of stochastic games with general state spaces has a direct connection to the descriptive set theory. Furthermore, the algorithmic aspects of stochastic games yield interesting combinatorial problems. The other basic mathematical tools make use of martingale limit theory. There is also a known link between nonzero-sum stochastic games and the theory of fixed points in infinite-dimensional spaces. The principal goal of this chapter is to provide a comprehensive overview of the aforementioned aspects of zero-sum stochastic games.

To begin a literature review, let us mention that a basic and clear introduction to dynamic games is given in Başar and Olsder (1995) and Haurie et al. (2012). Mathematical programming problems occurring in algorithms for stochastic games with finite state and action spaces are broadly discussed in Filar and Vrieze (1997). Some studies of stochastic games by the methods developed in gambling theory with many informative examples are described in Maitra and Sudderth (1996). An advanced material on repeated and stochastic games is presented in Sorin (2002) and Mertens et al. (2015). The two edited volumes by Raghavan et al. (1991) and Neyman and Sorin (2003) contain a survey of a large part of the area of stochastic games developed for almost fifty years since Shapley’s seminal work. This chapter and the chapter of Jaśkiewicz and Nowak (2018) include a very broad overview of state-of-the-art results on stochastic games. Moreover, the surveys given by Mertens (2002), Vieille (2002), Solan (2009), Krishnamurthy and Parthasarathy (2011), Solan and Vieille (2015), and Laraki and Sorin (2015) constitute relevant complementary material.

There is a great deal of applications of stochastic games in science and engineering. Here, we only mention the ones concerning zero-sum games. For instance, Altman and Hordijk (1995) applied stochastic games to queueing models. On the other hand, wireless communication networks were examined in terms of stochastic games by Altman et al. (2005). For use of stochastic games in models that arise in operations research, the reader is referred to Charnes and Schroeder (1967), Winston (1978), Filar (1985), or Patek and Bertsekas (1999). There is also a growing literature on applications of zero-sum stochastic games in theoretical computer science (see, for instance, Condon (1992), de Alfaro et al. (2007) and Kehagias et al. (2013) and references cited therein). Applications of zero-sum stochastic games to economic growth models and robust Markov decision processes are described in Sect. 3, which is mainly based on the paper of Jaśkiewicz and Nowak (2011). The class of possible applications of nonzero-sum stochastic games is larger than in the zero-sum case. They are discussed in our second survey in this handbook.

The chapter is organized as follows: In Sect. 2 we describe some basic material needed for a study of stochastic games with general state spaces. It incorporates auxiliary results on set-valued mappings (correspondences), their measurable selections, and the measurability of the value of a parameterized zero-sum game. This part naturally is redundant in a study of stochastic games with discrete state and action spaces. Sect. 3 is devoted to a general maxmin decision problem in discrete-time and Borel state space. The main motivation is to show its applications to stochastic economic growth models and some robust decision problems in macroeconomics. Therefore, the utility (payoff) function in illustrative examples is unbounded and the transition probability function is weakly continuous. In Sect. 4 we consider standard discounted and positive Markov games with Borel state spaces and simultaneous moves of the players. Sect. 5 is devoted to semi-Markov games with Borel state space and weakly continuous transition probabilities satisfying some stochastic stability assumptions. In the limit-average payoff case, two criteria are compared, the time average and ratio average payoff criterion, and a question of path optimality is discussed. Furthermore, stochastic games with a general Borel payoff function on the spaces of infinite plays are examined in Sect. 6. This part includes results on games with limsup payoffs and limit-average payoffs as special cases. In Sect. 7 we present some basic results from the asymptotic theory of stochastic games, mainly with finite state space, the notion of uniform value. This part of the theory exhibits nontrivial algebraic aspects. Some algorithms for solving zero-sum stochastic games of different types are described in Sect. 8. In Sect. 9 we provide an overview of zero-sum stochastic games with incomplete information and imperfect monitoring. This is a vast subarea of stochastic games, and therefore, we deal only with selected cases of recent contributions. Stochastic games with vector payoffs and Blackwell’s approachability concept, on the other hand, are discussed briefly in Sect. 10. Finally, Sect.11 gives a short overview of stochastic Markov games in continuous time. We mainly focus on Markov games with short-stage duration. This theory is based on an asymptotic analysis of discrete-time games when the stage duration tends to zero.

2 Preliminaries

Let \(\mathbb {R}\) be the set of all real numbers, \( \underline {\mathbb {R}} = \mathbb {R}\cup \{-\infty \}\) and \(\mathbb {N}=\{1,2,\ldots \}\). By a Borel space X we mean a nonempty Borel subset of a complete separable metric space endowed with the relative topology and the Borel σ-algebra \({ \mathcal {B}}(X)\). We denote by \(\Pr (X)\) the set of all Borel probability measures on X. Let \(\mathcal {B}_\mu (X)\) be the completion of \({\mathcal {B}} (X)\) with respect to some \(\mu \in \Pr (X)\). Then \(\mathcal {U}(X)= \cap _{\mu \in \Pr (X)} \mathcal {B}_\mu (X)\) is the σ-algebra of all universally measurable subsets of X. There are a couple of ways to define analytic sets in X (see Chap. 12 in Aliprantis and Border 2006 or Chap. 7 in Bertsekas and Shreve 1996). One can say that C ⊂ X is an analytic set if and only if there is a Borel set D ⊂ X × X whose projection on X is C. If X is uncountable, then there exist analytic sets in X which are not Borel (see Example 12.33 in Aliprantis and Border 2006). Every analytic set C ⊂ X belongs to \(\mathcal {U}(X)\). A function \(\psi :X\to\underline { \mathbb {R}}\) is called upper semianalytic (lower semianalytic) if for any \(c\in \mathbb {R}\) the set {x ∈ X : ψ(x) ≥ c} ({x ∈ X : ψ(x) ≤ c}) is analytic. It is known that ψ is both upper and lower semianalytic if and only if ψ is Borel measurable. Let Y  be also a Borel space. A mapping ϕ : X → Y  is universally measurable if \(\phi ^{-1}(C) \in \mathcal {U}(X)\) for each \(C\in \mathcal {B}(Y)\).

A set-valued mapping x → Φ(x) ⊂ Y  (also called a correspondence from X to Y ) is upper semicontinuous (lower semicontinuous) if the set Φ−1(C) := {x ∈ X : Φ(x) ∩ C ≠ ∅} is closed (open) for each closed (open) set C ⊂ Y . Φ is continuous if it is both lower and upper semicontinuous. Φ is weakly or lower measurable if \(\varPhi ^{-1}(C) \in \mathcal {B}(X)\) for each open set C ⊂ Y . Assume that Φ(x) ≠ ∅ for every x ∈ X. If Φ is compact valued and upper semicontinuous, then by Theorem 1 in Brown and Purves (1973), Φ admits a measurable selector, that is, there exists a Borel measurable mapping g : X → Y  such that g(x) ∈ Φ(x) for each x ∈ X. Moreover, the same holds if Φ is weakly measurable and has complete values Φ(x) for all x ∈ X (see Kuratowski and Ryll-Nardzewski 1965). Assume that D ⊂ X  × Y  is a Borel set such that D(x) := {y ∈ Y : (x, y) ∈ D} is nonempty and compact for each x ∈ X. If C is an open set in Y , then D−1(C) := {x ∈ X : D(x) ∩ C ≠ ∅} is the projection on X of the Borel set D0 = (X × C) ∩ D and D0(x) = {y ∈ Y : (x, y) ∈ D0} is σ -compact for any x ∈ X. By Theorem 1 in Brown and Purves (1973), \(D^{-1}(C)\in \mathcal {B}(X)\). For a broad discussion of semicontinuous or measurable correspondences, the reader is referred to Himmelberg (1975), Klein and Thompson (1984) or Aliprantis and Border (2006). For any Borel space Y , let C(Y ) be the space of all bounded continuous real-valued functions on Y . Assume that \(\Pr (Y)\) is endowed with the weak topology and the Borel σ-algebra \(\mathcal {B} (\Pr (Y))\) (see Bertsekas and Shreve 1996; Billingsley 1968 or Parthasarathy 1967). The σ -algebra \(\mathcal {B}(\Pr (Y))\) of all Borel subsets of \(\Pr (Y)\) coincides with the smallest σ-algebra on \(\Pr (Y)\) for which all the mappings p → p(D) from \(\Pr (Y)\) to [0, 1] are measurable for each \(D\in \mathcal {B} (Y)\) (see Proposition 7.25 in Bertsekas and Shreve 1996). Recall that a sequence \( (p_n)_{n\in \mathbb {N}}\)converges weakly to some \(p\in \Pr (Y)\) if and only if for any ϕ ∈ C(Y ),

$$\displaystyle \begin{aligned} \int_Y \phi(y)p_n(dy)\to \int_Y \phi(y)p(dy)\quad\mbox{as}\quad n\to\infty.\end{aligned} $$

If Y  is a Borel space, then \(\Pr (Y)\) is a Borel space too, and if Y  is compact, so is \(\Pr (Y)\) (see Corollary 7.25.1 and Proposition 7.22 in Bertsekas and Shreve 1996).

Consider the correspondence \(x\to \varPsi (x):= \Pr (\varPhi (x))\subset \Pr (Y)\). The following result from Himmelberg and Van Vleck (1975) is useful in studying stochastic games.

Proposition 1

If Φ is upper (lower) semicontinuous and compact valued, then so is Ψ.

A transition probability or a stochastic kernel from X to Y  is a function \(\varphi :{\mathcal {B} }(Y)\times X\to [0,1]\) such that φ(D|⋅) is a Borel measurable function on X for every \(D\in {\mathcal {B}} (Y)\) and \(\varphi (\cdot |x)\in \Pr (Y)\) for each x ∈ X. It is well known that every Borel measurable mapping \(f: X\to \Pr (Y)\) may be regarded as a transition probability φ from X to Y . Namely, φ(D|x) = f(x)(D), \(D\in \mathcal {B} (Y)\), x ∈ X (see Proposition 7.26 in Bertsekas and Shreve 1996). We shall write f(dy|x) instead of f(x)(dy). Clearly, any Borel measurable mapping f : X → Y  is a special transition probability φ from X to Y  such that for each x ∈ X, φ(⋅|x) is the Dirac measure concentrated at the point f(x). Similarly, universally measurable transition probabilities are defined, when \(\mathcal {B} (X)\) is replaced by \(\mathcal {U} (X)\).

In studying zero-sum stochastic games with Borel state spaces, we must use in the proofs some results on minmax measurable selections in parameterized games. Let X, A, and B be Borel spaces. Assume that \( K_{A}\in \mathcal {B}(X\times A)\) and \(K_{B}\in \mathcal {B}(X\times B)\) and suppose that the sets A(x) := {a ∈ A : (x, a) ∈ A} and B(x) := {b ∈ B : (x, b) ∈ B} are nonempty for all x ∈ X. Let K := {(x, a, b) : x ∈ X, a ∈ A(x), b ∈ B(x)}. Then K is a Borel subset of X × A × B. Let \(r:K\to \mathbb {R}\) be a Borel measurable payoff function in a zero-sum game parameterized by x ∈ X. If players 1 and 2 choose mixed strategies \( \mu \in \Pr (A(x))\) and \(\nu \in \Pr (B(x))\), respectively, then the expected payoff to player 1 (cost to player 2) depends on x ∈ X and is of the form

$$\displaystyle \begin{aligned} R(x,\mu,\nu):= \int_{A(x)}\int_{B(x)}r(x,a,b)\nu(db)\mu(da) \end{aligned}$$

provided that the double integral is well defined. Assuming this and that B(x) is compact for each x ∈ X and r(x, a, ⋅) is lower semicontinuous on B(x) for each (x, a) ∈ KA, we conclude from the minmax theorem of Fan (1953) that the game has a value, that is, the following equality holds

$$\displaystyle \begin{aligned} v^*(x):= \min_{\nu\in \Pr(B(x))}\sup_{\mu\in \Pr(A(x))} R(x,\mu,\nu)= \sup_{\mu\in \Pr(A(x))}\min_{\nu\in Pr(B(x))} R(x,\mu,\nu), \quad x\in X. \end{aligned}$$

A universally (Borel) measurable strategy for player 1 is a universally (Borel) measurable transition probability f from X to A such that f(A(x)|x) = 1 for all x ∈ X. By the Jankov-von Neumann theorem (see Theorem 18.22 in Aliprantis and Border 2006), there exists a universally measurable function φ : X → A such that φ(x) ∈ A(x) for all x ∈ X. Thus, the set of universally measurable strategies for player 1 is nonempty. Universally (Borel) measurable strategies for player 2 are defined similarly. A strategy g is optimal for player 2 if

$$\displaystyle \begin{aligned} v^*(x) = \sup_{\mu\in \Pr(A(x))}\int_{A(x)}\int_{B(x)}r(x,a,b)g^*(db|x)\mu(da)\quad\mbox{for all} \quad x\in X. \end{aligned}$$

Let ε ≥ 0. A strategy f is ε-optimal for player 1 if

$$\displaystyle \begin{aligned} v^*(x) \le \inf_{\nu\in \Pr(B(x))}\int_{A(x)}\int_{B(x)}r(x,a,b)\nu(db)f^*(da|x)+\varepsilon \quad \mbox{for all} \quad x\in X. \end{aligned}$$

A 0-optimal strategy is called optimal.

The following result follows from Nowak (1985b). For a much simpler proof, see Nowak (2010).

Proposition 2

Under the above assumptions the value function v is upper semianalytic. Player 2 has a universally measurable optimal strategy and, for any ε > 0, player 1 has a universally measurable ε-optimal strategy. If, in addition, we assume that A(x) is compact for each x  X and r(x, ⋅, b) is upper semicontinuous for each (x, b) ∈ KB, then v is Borel measurable and both players have Borel measurable optimal strategies.

As a corollary to Theorem 5.1 in Nowak (1986), we can state the following result.

Proposition 3

Assume that x  A(x) is lower semicontinuous and has complete values in A and x  B(x) is upper semicontinuous and compact valued. If \(r:K\to \mathbb {R}\) is lower semicontinuous on K, then v is lower semicontinuous, player 2 has a Borel measurable optimal strategy, and for any ε > 0, player 1 has a Borel measurable ε-optimal strategy.

The lower semicontinuity of v in Proposition 3 is a corollary to the maximum theorem of Berge (1963). In some games or minmax control models, one can consider the minmax value

$$\displaystyle \begin{aligned} \underline{v}^*(x):= \inf_{\nu\in \Pr(B(x))}\sup_{\mu\in \Pr(A(x))} R(x,\mu,\nu), \quad x\in X, \end{aligned}$$

if the mixed strategies are used, or

$$\displaystyle \begin{aligned} \underline{w}^*(x):= \inf_{b\in B(x)}\sup_{a\in A(x)} r(x,a,b), \quad x\in X, \end{aligned}$$

if the attention is restricted to pure strategies. If the assumption on semicontinuity of the function r is dropped, then the measurability of \(\underline {v}^*\) or \( \underline {w}^*\) is connected with the measurability of projections of coanalytic sets. This issue leads to some considerations in the classical descriptive set theory. A comprehensive study of the measurability of upper or lower value of a game with Borel payoff function r is given in Prikry and Sudderth (2016).

3 Robust Markov Decision Processes

A discounted maxmin Markov decision process is defined by the objects X, A, B, KA, K, u, q, and β, where:

  • X is a Borel state space;

  • A is the action space of the controller (player 1) and B is the action space of the opponent (player 2). It is assumed that A and B are Borel spaces;

  • \(K_{A} \in \mathcal {B}(X\times A)\) is the constraint set for the controller. It is assumed that

    $$\displaystyle \begin{aligned} A(x):= \{a\in A: (x,a)\in A\}\neq \emptyset \end{aligned}$$

    for each x ∈ X. This is the set of admissible actions of the controller in the state x ∈ X;

  • \(K \in \mathcal {B}(X\times A\times B)\) is the constraint set for the opponent. It is assumed that

    $$\displaystyle \begin{aligned} B(x,a):= \{b\in B: (x,a,b)\in B\}\neq \emptyset \end{aligned}$$

    for each (x, a) ∈ KA. This is the set of admissible actions of the opponent for (x, a) ∈ KA;

  • \(u:K \to\underline {\mathbb {R}}\) is a Borel measurable stage payoff function;

  • q is a transition probability from K to X, called the law of motion among states. If xn is a state at the beginning of period n of the process and actions an ∈ A(xn) and bn ∈ B(xn, an) are selected by the players, then q(⋅|xn, an, bn) is the probability distribution of the next state xn+1;

  • β ∈ (0, 1) is the discount factor.

We make the following assumptions on the admissible action sets.

  1. (C1)

    For any x ∈ X, A(x) is compact and the set-valued mapping x → A(x) is upper semicontinuous.

  2. (C2)

    The set-valued mapping (x, a) → B(x, a) is lower semicontinuous.

  3. (C3)

    There exists a Borel measurable mapping g : KA → B such that g(x, a) ∈ B(x, a) for all (x, a) ∈ KA.

Remark 1

From Sect. 2, it follows that condition (C3) holds if B(x, a) is σ-compact for each (x, a) ∈ KA (see Brown and Purves 1973) or if B is a complete separable metric space and each set B(x, a) is closed (see Kuratowski and Ryll-Nardzewski 1965).

Let H1 := X, Hn := Kn × X for n ≥ 2. Put \(H_1^*:=K_{A}\) and \( H_n^*:=K^n\times K_{A}\) if n  ≥ 2. Generic elements of Hn and \(H_n^*\) are histories of the process, and they are of the form h1 = x1, \( h_1^*=(x_1,a_1)\) and for each n ≥ 2, hn = (x1, a1, b1, ….xn−1, an−1, bn−1, xn), \(h_n^*=(h_n,a_n). \)

A strategy for the controller is a sequence \(\pi =(\pi _n)_{n\in \mathbb {N}}\) of stochastic kernels πn from Hn to A such that πn(A(xn)|hn) = 1 for each hn ∈ Hn. The class of all strategies for the controller will be denoted by Π. A strategy for the opponent is a sequence \(\gamma =(\gamma _n)_{n\in \mathbb {N}}\) of stochastic kernels γn from \(H^*_n\) to B such that \( \gamma _n(B(x_n,a_n)|h^*_n)=1\) for all \(h^*_n\in H^*_n\). The class of all strategies for the opponent will be denoted by Γ. Let F be the set of Borel measurable mappings f from X to A such that f(x) ∈ A(x) for each x ∈ X. A deterministic stationary strategy for the controller is a sequence \(\pi = (f_n)_{n\in \mathbb {N}}\) where fn = f for all \(n\in \mathbb {N}\) and some f ∈ F. Such a strategy can obviously be identified with the mapping f ∈ F. Let

$$\displaystyle \begin{aligned} u^+(x,a,b)&:=\max\{u(x,a,b),0\}\quad\mbox{ and}\\ u^-(x,a,b)&:=\min\{u(x,a,b),0\},\quad (x,a,b)\in K. \end{aligned} $$

For each initial state x1 = x and any strategies π ∈ Π and γ ∈ Γ, define

$$\displaystyle \begin{aligned} J_{\beta}^+(x,\pi,\gamma) = E^{\pi\gamma}_{x}\left(\sum^{\infty}_{n=1} \beta^{n-1}u^+(x_{n},a_{n},b_n)\right), \end{aligned} $$
(5.1)
$$\displaystyle \begin{aligned} J_{\beta}^-(x,\pi,\gamma) = E^{\pi\gamma}_{x}\left(\sum^{\infty}_{n=1} \beta^{n-1}u^-(x_{n},a_{n},b_n)\right). \end{aligned} $$
(5.2)

Here, \(E^{\pi \gamma }_{x}\) denotes the expectation operator corresponding to the unique conditional probability measure \(P^{\pi \gamma }_{x}\) defined on the space of histories, starting at state x, and endowed with the product σ-algebra, which is induced by strategies π, γ and the transition probability q according to the Ionescu-Tulcea Theorem (see Proposition 7.45 in Bertsekas and Shreve 1996 or Proposition V.1.1 in Neveu 1965). In the sequel, we give conditions under which \(J_{\beta }^+(x,\pi ,\gamma )< \infty \) for any x ∈ X, π ∈ Π, γ ∈ Γ. They enable us to define the expected discounted payoff over an infinite time horizon as follows:

$$\displaystyle \begin{aligned} J_{\beta}(x,\pi,\gamma) = E^{\pi\gamma}_{x}\left(\sum^{\infty}_{n=1} \beta^{n-1}u(x_{n},a_{n},b_n)\right). \end{aligned} $$
(5.3)

Then, for every x ∈ X, π ∈ Π, γ ∈ Γ we have that \( J_{\beta }(x,\pi ,\gamma )\in\underline {\mathbb {R}}\) and

$$\displaystyle \begin{aligned} J_{\beta} (x,\pi,\gamma)= J_{\beta}^+(x,\pi,\gamma)+J_{\beta}^-(x,\pi,\gamma) =\sum^{\infty}_{n=1}\beta^{n-1}E^{\pi\gamma}_{x} u(x_{n},a_{n},b_n). \end{aligned}$$

Let

$$\displaystyle \begin{aligned} v_{\beta}(x) := \sup_{\pi\in\varPi}\inf_{\gamma\in\varGamma^*}J_{\beta}(x,\pi,\gamma), \quad x\in X. \end{aligned}$$

This is the maxmin or lower value of the game starting at the state x ∈ X. A strategy π∈ Π is called optimal for the controller if \(\inf _{\gamma \in \varGamma ^*}J_{\beta }(x,\pi ^*,\gamma ) =v_{\beta }(x)\) for every x ∈ X.

It is worth mentioning that if u is unbounded, then an optimal strategy π need not exist even if 0 ≤ vβ(x) <  for every x ∈ X and the available action sets A(x) and B(x) are finite (see Example 1 in Jaśkiewicz and Nowak 2011).

The maxmin control problems with Borel state spaces have been already considered by González-Trejo et al. (2003), Hansen and Sargent (2008), Iyengar (2005), and Küenle (1986) and are referred to as games against nature or robust dynamic programming (Markov decision) models. The idea of using maxmin decision rules was introduced in statistics (see Blackwell and Girshick 1954). It is also used in economics (see, e.g., the variational preferences in Maccheroni et al. 2006).

3.1 One-Sided Weighted Norm Approach

We now describe our regularity assumptions imposed on the payoff and transition probability functions.

  1. (W1)

    The payoff function \(u:K\to\underline {\mathbb {R}}\) is upper semicontinuous.

  2. (W2)

    For any ϕ ∈ C(X) the function

    $$\displaystyle \begin{aligned} (x,a,b) \to \int_X \phi (y)q(dy|x,a,b) \end{aligned}$$

    is continuous.

  3. (M1)

    There exist a continuous function ω : X → [1, ) and a constant α > 0 such that

    $$\displaystyle \begin{aligned} \sup_{(x,a,b)\in K}\frac{\int_X \omega(y)q(dy|x,a,b)}{\omega(x)} \le\alpha\quad\mbox{and}\quad \beta\alpha<1.\end{aligned} $$
    (5.4)

    Moreover, the function (x, a, b) →∫Xω(y)q(dy|x, a, b) is continuous.

  4. (M2)

    There exists a constant d > 0 such that

    $$\displaystyle \begin{aligned} \sup_{a\in A(x)}\sup_{b\in B(x,a)}u^+(x,a,b)\le d\omega(x)\end{aligned} $$

    for all x ∈ X.

Note that under conditions (M1) and (M2), the discounted payoff function is well defined, since

$$\displaystyle \begin{aligned} 0\le E_x^{\pi\gamma} \left(\sum_{n=1}^\infty \beta^{n-1} u^+(x_n,a_n,b_n)\right)\le d\sum_{n=1}^\infty\beta^{n-1}\alpha^{n-1}\omega(x)<\infty. \end{aligned}$$

Remark 2

Assumption (W2) states that transition probabilities are weakly continuous. It is worth emphasizing that this property, in contrast to the setwise continuous transitions, is satisfied in a number of models arising in operations research, economics, etc. Indeed, Feinberg and Lewis (2005) studied the typical inventory model:

$$\displaystyle \begin{aligned} x_{n+1}=x_n+a_n-\xi_{n+1},\quad n\in\mathbb{N}, \end{aligned}$$

where xn is the inventory at the end of period n, an is the decision on how much should be ordered, and ξn is the demand during period n and each ξn has the same distribution as the random variable ξ. Assume that \(X=\mathbb {R}\), \(A=\mathbb {R}_+\). Let q(⋅|x, a) be the transition law for this problem. In view of Lebesgue’s dominated convergence theorem, it is clear that q is weakly continuous. On the other hand, recall that the setwise continuity means that q(D|x, ak) → q(D|x, a0) as ak → a0 for any \(D\in \mathcal {B}(X)\). Suppose that the demand is deterministic d = 1, ak = a + 1∕k and D = (−, x + a − 1]. Then, q(D|x, a) = 1, but q(D|x, ak) = 0.

For any function \(\phi :X\to \mathbb {R}\), define the ω-norm as follows:

$$\displaystyle \begin{aligned} \|\phi\|{}_\omega=\sup_{x\in X} \frac{|\phi(x)|}{\omega(x)}, \end{aligned} $$
(5.5)

provided that it is finite. Let Uω(X) be the space of all upper semicontinuous functions endowed with the metric induced by the ω -norm. By \( \underline {U}_\omega (X)\) we denote the set of all upper semicontinuous functions \(\phi :X\to\underline {\mathbb {R}}\) such that ϕ+ ∈ Uω(X).

Define \(u_k:= \max \{u, -k\}\), \(k\in \mathbb {N}\). For any \(\phi \in\underline { U}_\omega (X)\), (x, a, b) ∈ K, and \(k\in \mathbb {N}\), let

$$\displaystyle \begin{aligned} L_{\beta,k}\phi(x,a,b)= u_k(x,a,b)+\beta\int_X \phi(y)q(dy|x,a,b) \end{aligned}$$

and

$$\displaystyle \begin{aligned} L_\beta \phi(x,a,b)= u(x,a,b)+\beta\int_X \phi(y)q(dy|x,a,b). \end{aligned}$$

The maximum theorem of Berge (1963) (see also Proposition 10.2 in Schäl 1975) implies the following auxiliary result.

Lemma 1

Assume (C1)–(C3), (W1)–(W2) , and (M1)–(M2) . Then for any \(\phi \in\underline {U}_\omega (X)\), the functions

$$\displaystyle \begin{aligned} \inf_{b\in B(x,a)}L_{\beta,k}\phi(x,a,b) \quad and\quad \max_{a\in A(x)}\inf_{b\in B(x,a)}L_{\beta,k}\phi(x,a,b)\end{aligned} $$

are upper semicontinuous on KA and X, respectively. Similar properties hold if Lβ,kϕ(x, a, b) is replaced by Lβϕ(x, a, b).

For any x ∈ X, define

$$\displaystyle \begin{aligned} T_{\beta,k}\phi(x)&= \max_{a\in A(x)}\inf_{b\in B(x,a)}L_{\beta,k}\phi(x,a,b) \quad\mbox{and}\\ T_\beta \phi(x)&= \max_{a\in A(x)}\inf_{b\in B(x,a)}L_\beta \phi(x,a,b).\end{aligned} $$
(5.6)

By Lemma 1, the operators Tβ,k and Tβ are well defined. Additionally, note that

$$\displaystyle \begin{aligned} T_\beta \phi(x)= \max_{a\in A(x)}\inf_{\rho \in \Pr(B(x,a))}\int_{B(x,a)}L_\beta \phi(x,a,b)\rho (db).\end{aligned} $$

We can now state the main result in Jaśkiewicz and Nowak (2011).

Theorem 1

Assume (C1)–(C3) , (W1)–(W2) , and (M1)–(M2) . Then \(v_{\beta }\in\underline {U}_\omega (X)\), Tβvβ = vβ and there exists a stationary strategy f F such that

$$\displaystyle \begin{aligned} v_{\beta}(x)=\inf_{b\in B(x,a)}L_\beta v_{\beta}(x,f^*(x),b)\end{aligned} $$

for x  X. Moreover,

$$\displaystyle \begin{aligned} v_{\beta}(x)=\inf_{\gamma\in\varGamma^*} J_{\beta} (x,f^*,\gamma)=\sup_{\pi\in\varPi} \inf_{\gamma\in\varGamma^*} J_{\beta}(x,\pi,\gamma)\end{aligned} $$

for all x  X, so f is an optimal stationary strategy for the controller.

The proof of Theorem 1 consists of two steps. First, we deal with truncated models, in which the payoff function u is replaced by uk. Then, making use of the fixed point argument, we obtain an upper semicontinuous solution to the Bellman equation, say vβ,k. Next, we observe that the sequence \((v_{\beta ,k})_{k\in \mathbb {N}}\) is nonincreasing. Letting k → and making use of Lemma 1, we arrive at the conclusion.

Remark 3

The weighted supremum norm approach in Markov decision processes was proposed by Wessels (1977) and further developed, e.g., by Hernández-Lerma and Lasserre (1999). This method has been also adopted to zero-sum stochastic games (see Couwenbergh 1980; González-Trejo et al. 2003; Jaśkiewicz 2009, 2010; Jaśkiewicz and Nowak 2006, 2011; Küenle 2007 and references cited therein). The common feature of the aforementioned works is the fact that the authors use the weighted norm condition instead of assumption (M2). More precisely, in our notation it means that the following holds

$$\displaystyle \begin{aligned} \sup_{a\in A(x)}\sup_{b\in B(x,a)} |u(x,a,b)|\le d\omega(x),\quad x\in X\end{aligned} $$
(5.7)

for some constant d > 0. This assumption, however, excludes many examples studied in economics where the utility function u equals − in some states. Moreover, inequality in (M1) and (5.7) often enforces additional constraints on the discount coefficient β in comparison with (M1) and (M2) (see Example 6 in Jaśkiewicz and Nowak 2011).

Observe that if the payoff function u accepts only negative values, then assumption (M2) is redundant. Thus, the problem comes down to the negative programming, which was solved by Strauch (1966) in the case of one-player game (Markov decision process).

3.1.1 Models with Unknown Disturbance Distributions

Consider the control system in which

$$\displaystyle \begin{aligned} x_{n+1} = \varPsi(x_n,a_n,\xi_n), \quad n\in \mathbb{N}.\end{aligned} $$

It is assumed that \((\xi _n)_{n\in \mathbb {N}}\) is a sequence of independent random variables with values in a Borel space S having unknown probability distributions that can change from period to period. The set B of all possible distributions is assumed to be a nonempty Borel subset of the space \(\Pr (S)\) endowed with the weak topology. The mapping Ψ : KA × S → X is assumed to be continuous. Let u0 be an upper semicontinuous utility function defined on KA × S such that \( u_0^+(x,a,s)\le d\omega (x)\) for some constant d > 0 and all (x, a) ∈ KA, s ∈ S.

We can formulate a maxmin control model in the following way:

  1. (a)

    \(B(x,a)=B\subset \Pr (S)\) for each (x, a) ∈ KA, K = KA × B;

  2. (b)

    u(x, a, b) =∫Su0(x, a, s)b(ds), (x, a, b) ∈ K;

  3. (c)

    for any Borel set DX, q(D|x, a, b)=∫X1D(Ψ(x, a, s))b(ds), (x, a, b) ∈ K.

Then for any bounded continuous function \(\phi :X\to \mathbb {R}\), we have that

$$\displaystyle \begin{aligned} \int_X \phi(y)q(dy|x,a,b)= \int_X\phi(\varPsi(x,a,s))b(ds).\end{aligned} $$
(5.8)

From Proposition 7.30 in Bertsekas and Shreve (1996) or Lemma 5.3 in Nowak (1986) and (5.8), it follows that q is weakly continuous. Moreover, by virtue of Proposition 7.31 in Bertsekas and Shreve (1996), it is easily seen that u is upper semicontinuous on K.

The following result can be viewed as a corollary to Theorem 1.

Proposition 4

Let Ψ and u0 satisfy the above assumptions. If (M1)holds, then the controller has an optimal strategy.

Proposition 4 is a counterpart of the results obtained in Sect. 6 of González-Trejo et al. (2003) for discounted models (see Propositions 6.1, 6.2, 6.3 and their consequences in González-Trejo et al. (2003)). However, our assumptions imposed on the primitive data are weaker than the ones used by González-Trejo et al. (2003). They are satisfied for a pretty large number of systems, in which the disturbances comprise “random noises” that are difficult to observe and often caused by external factors influencing the dynamics. Below we give certain examples which stem from economic growth theory and related topics. Mainly, they are inspired by models studied in Stokey et al. (1989), Bhattacharya and Majumdar (2007), and Hansen and Sargent (2008).

Example 1 (A growth model with multiplicative shocks)

Let X = [0, ) be the set of all possible capital stocks. If xn is a capital stock at the beginning of period n, then the level of satisfaction of consumption of an ∈ A(xn) = [0, xn] in this period is \(a_n^\sigma \). Here σ ∈ (0, 1] is a fixed parameter. The evolution of the state process is described by the following equation:

$$\displaystyle \begin{aligned} x_{n+1}=(x_{n}-a_{n})^\theta\xi_{n}, \quad n\in \mathbb{N},\end{aligned} $$

where θ ∈ (0, 1) is some constant and ξn is a random shock in period n. Assume that each ξn follows a probability distribution b ∈ B for some Borel set \(B\subset \Pr ([0,\infty ))\). We assume that b is unknown.

Consider the maxmin control model, where X = [0, ), A(x) = [0, x], B(x, a) = B, and u(x, a, b) = aσ for (x, a, b) ∈ K. Then, the transition probability q is of the form

$$\displaystyle \begin{aligned} q(D|x,a,b)=\int_{0}^{\infty}1_{D}((x-a)^\theta s)b(ds), \end{aligned}$$

where \(D\in \mathcal {B}( X)\). If ϕ ∈ C(X), then the integral

$$\displaystyle \begin{aligned} \int_{X}\phi(y)q(dy|x,a,b)=\int_{0}^{\infty}\phi((x-a)^\theta s)b(ds) \end{aligned}$$

is continuous at (x, a, b) ∈ K. We further assume that

$$\displaystyle \begin{aligned} \bar{s} = \sup_{b\in B}\int_0^{\infty}s b(ds) <\infty.\end{aligned} $$

Define now

$$\displaystyle \begin{aligned} \omega (x)= (r+x)^\sigma, \quad x\in X,\end{aligned} $$
(5.9)

where r ≥ 1 is a constant. Clearly, u+(x, a, b) = aσ ≤ ω(x) for any (x, a, b) ∈ K. Hence, condition (M2) is satisfied. Moreover, by Jensen’s inequality we obtain

$$\displaystyle \begin{aligned} \int_X\omega(y)q(dy|x,a,b) =\int_0^{\infty}(r+(x-a)^\theta s)^\sigma b(ds) \le (r+x^\theta \bar{s})^\sigma.\end{aligned} $$

Thus,

$$\displaystyle \begin{aligned} \frac{ \int_X\omega(y)q(dy|x,a,b)}{\omega(x)}\le \eta^\sigma(x), \quad\mbox{ where }\quad \eta(x):= \frac{r+\bar{s}x^\theta}{r+x},\quad x\in X.\end{aligned} $$
(5.10)

If \(x\ge \bar {x}:= \bar {s}^{1/(1-\theta )}\), then η(x) ≤ 1, and consequently, ησ(x) ≤ 1. If \(x < \bar {x}\), then

$$\displaystyle \begin{aligned} \eta(x) <\frac{r+ \bar{s}x^{\theta}}{r+x}\le \frac{r+ \bar{s}\bar{x}^{\theta}}{r}= 1+ \frac{\bar{x}}{r},\end{aligned} $$

and

$$\displaystyle \begin{aligned} \eta^\sigma(x) \le \alpha:= \left(1+ \frac{\bar{x}}{r}\right)^\sigma.\end{aligned} $$
(5.11)

Let β ∈ (0, 1) be any discount factor. Then, there exists r ≥ 1 such that αβ < 1, and from (5.10) and (5.11) it follows that assumption (M1) is satisfied.

Example 2

Let us consider again the model from Example 1 but with \(u(x,a,b)=\ln a\), a ∈ A(x) = [0, x]. This utility function has a number of applications in economics (see Stokey et al. 1989). Nonetheless, the two-sided weighted norm approach cannot be employed, because \(\ln (0)= -\infty \). Assume now that the state evolution equation is of the form

$$\displaystyle \begin{aligned} x_{n+1}= (1+\rho_0)(x_n-a_n)\xi_n, \quad n\in \mathbb{N},\end{aligned} $$

where ρ0 > 0 is a constant rate of growth and ξn is an additional random income (shock) received in period n. Let \(\omega (x)= r\,{+}\,\ln (1\,{+}\,x)\) for all x ∈ X and some r ≥ 1. Clearly, \(u^+(x,a,b)=\max \{0,\ln a\} \le \max \{0,\ln x\} \le \omega (x)\) for all (x, a, b) ∈ K. By Jensen’s inequality it follows that

$$\displaystyle \begin{aligned} \int_X\omega(y)q(dy|x,a,b) =\int_0^\infty \omega((x-a)(1+\rho_0)+s)b(ds) \le r+ \ln (1+x(1+\rho_0)\bar{s}) \end{aligned}$$

for all (x, a, b) ∈ K. Thus

$$\displaystyle \begin{aligned} \frac{\int_X\omega(y)q(dy|x,a)}{\omega(x)} \le \psi(x):= \frac{ r+\ln (1+x(1+\rho_0)\bar{s})}{r+\ln(1+x)}. \end{aligned} $$
(5.12)

If we assume that \(\bar {s}(1+\rho _0)>1\), then

$$\displaystyle \begin{aligned} \psi(x)-1 = \frac{\ln\left(\frac{1+(1+\rho_0)\bar{s}x}{1+x}\right)}{r+\ln(1+x)} \le \frac{1}{r}\ln \left(\frac{1+(1+\rho_0)\bar{s}x}{1+x}\right)\le \frac{1}{r}\ln (\bar{s}(1+\rho_0)).\end{aligned} $$

Hence

$$\displaystyle \begin{aligned} \psi(x)\le \alpha:= 1 + \frac{1}{r}\ln (\bar{s}(1+\rho_0)).\end{aligned} $$

Choose now any β ∈ (0, 1). If r is sufficiently large, then αβ < 1 and by (5.12) condition (M1) holds.

Example 3 (A growth model with additive shocks)

Consider the model from Example 1 with the following state evolution equation:

$$\displaystyle \begin{aligned} x_{n+1}=(1+\rho_0 )(x_{n}-a_{n})+\xi_{n}, \quad n\in \mathbb{N},\end{aligned} $$

where ρ0 is constant introduced in Example 2. The transition probability q is now of the form

$$\displaystyle \begin{aligned} q(D|x,a,b)=\int_{0}^{\infty}1_{D}((1+\rho_0 )(x-a)+s)b(ds),\end{aligned} $$

where \(D\in \mathcal {B}( X)\). If ϕ ∈ C(X), then the integral

$$\displaystyle \begin{aligned} \int_{X}\phi(y)q(dy|x,a)=\int_{0}^{\infty}\phi((1+\rho_0 )(x-a)+s )b(ds)\end{aligned} $$

is continuous in (x, a, b) ∈ K. Let the function ω be as in (5.9). Applying Jensen’s inequality we obtain

$$\displaystyle \begin{aligned} \int_X\omega(y)q(dy|x,a,b) &=\int_0^\infty\omega((x-a)(1+\rho_0)+s) b(ds)\\ &\le \omega(x(1+\rho_0)+\bar{s}) = (r+x(1+\rho_0)+\bar{s})^\sigma.\end{aligned} $$

Thus,

Take \(r>\bar {s}/\rho _0\) and note that

$$\displaystyle \begin{aligned} \lim_{x\to 0+}\eta_0(x)= 1+\frac{\bar{s}}{r} < \lim_{x\to \infty} \eta_0(x) = 1+\rho_0. \end{aligned}$$

Hence,

$$\displaystyle \begin{aligned} \sup_{(x,a,b)\in K} \frac{\int_X\omega(y)q(dy|x,a,b)}{\omega(x)} \le \sup_{x\in X}\eta_0^\sigma(x)= (1+\rho_0)^\sigma. \end{aligned}$$

Therefore, condition (M1) holds for all β ∈ (0, 1) such that β(1 + ρ0)σ < 1.

For other examples involving quadratic cost/payoff functions and linear evolution of the system, the reader is referred to Jaśkiewicz and Nowak (2011).

3.1.2 An Application to the Hansen-Sargent Model in Macroeconomics

In this subsection, we study maxmin control model, in which minimizing player (nature) helps the controller to design a decision rule that is robust to misspecification of a dynamic approximating model linking controls today to state variables tomorrow. The constraint on nature is represented by a cost based on a reference transition probability q. Nature can deviate away from q, but the larger the deviation, the higher the cost. In particular, this cost is proportional to the relative entropy \(I(\hat {q}||q)\) between the chosen probability \(\hat {q}\) and the reference probability q, i.e., the cost equals to \(\theta _0 I(\hat {q}||q)\), where θ0 > 0. Such preferences in macroeconomics are called multiplier preferences (see Hansen and Sargent 2008).

Let us consider the following scalar system:

$$\displaystyle \begin{aligned} x_{n+1}=x_n+a_n+\varepsilon_n+b_{n},\quad n\in\mathbb{N}, \end{aligned} $$
(5.13)

where \(x_n\in X=\mathbb {R}\), \(a_n\in A(x_n)\equiv A=[0,\hat {a}]\) is an action selected by the controller and bn ∈ B(xn, an) ≡ B = (−, 0] is a parameter chosen by the malevolent nature. The sequence of random variables \((\varepsilon _n)_{n\in \mathbb {N}}\) is i.i.d., where εn follows the standard Gaussian distribution with the density denoted by ϕ. At each period the controller selects a control a ∈ A, which incurs the payoff u0(x, a). It is assumed that the function u0 is upper semicontinuous on X × A. The controller has a unique explicitly specified approximating model (when bn ≡ 0 for all n) but concedes that data might actually be generated by a number of set of models that surround the approximating model.

Let \(n\in \mathbb {N}\) be fixed. By p we denote the conditional density of variable Y = xn+1 implied by equation (5.13). Setting a = an, x = xn, and bn = b we obtain that

$$\displaystyle \begin{aligned} p(y|x,a,b)=\frac 1{\sqrt{2\pi}}e^{-\frac{(y-x-a-b)^2}2} \quad\mbox{for}\quad y\in \mathbb{R}. \end{aligned}$$

Clearly, p(⋅|x, a, b) defines the probability measure q, where

$$\displaystyle \begin{aligned} q(D|x,a,b)=\int_D p(y|x,a,b)dy\quad \mbox{for }D\subset \mathcal{B}(\mathbb{R }). \end{aligned}$$

If b = 0, then we deal with the baseline model. Hence, the relative entropy

$$\displaystyle \begin{aligned} I(q(\cdot|x,a,b)||q(\cdot|x,a,0))=\frac 12 b^2, \end{aligned}$$

and consequently, the payoff function in the model is

$$\displaystyle \begin{aligned} u(x,a,b)=u_0(x,a)+\frac{1}{2}\theta_0 b^2. \end{aligned}$$

The term \(\frac {1}{2}\theta _0 b^2\) is a penalized cost paid by nature. The parameter θ0 can be viewed as the degree of robustness. For example, if θ0 is large, then the penalization becomes so great that only the nominal model remains and the strategy is less robust. Conversely, the lower values of θ0 allow to design a strategy which is appropriate for a wider set of model misspecifications. Therefore, a lower θ0 is equivalent to a higher degree of robustness.

Within such a framework, we shall consider pure strategies for nature. A strategy \(\gamma =(\gamma _n)_{n\in \mathbb {N}}\) is an admissible strategy to nature, if \(\gamma _n: H_n^*\to B\) is a Borel measurable function, i.e., \(b_n=\gamma _n(h_n^*)\), \(n\in \mathbb {N}\), and for every x ∈ X and π ∈ Π

$$\displaystyle \begin{aligned} E_x^{\pi\gamma}\left( \sum_{n=1}^{\infty}\beta^{n-1} b_n^2\right)<\infty. \end{aligned}$$

The set of all admissible strategies to nature is denoted by \(\varGamma ^*_0\).

The objective of the controller is to find a policy π∈ Π such that

$$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle { \inf_{\gamma\in\varGamma^*_0}E_x^{\pi^*\gamma}\left( \sum_{n=1}^{\infty}\beta^{n-1}\left\{u_0(x_n,a_n)+\frac{1}{2}\theta_0 b_n^2\right\} \right)=} \\ &\displaystyle &\displaystyle \max_{\pi\in\varPi}\inf_{\gamma\in\varGamma^*_0}E_x^{\pi\gamma}\left( \sum_{n=1}^{\infty}\beta^{n-1}\left\{u_0(x_n,a_n)+\frac{1}{2}\theta_0 b_n^2\right\} \right). \end{array} \end{aligned} $$

We solve the problem by proving that there exists a solution to the optimality equation. First, we note that assumption (M1) is satisfied for \( \omega (x)=\max \{x,0\}\,{+}\,r\), where r ≥ 1 is some constant. Indeed, on page 268 in Jaśkiewicz and Nowak (2011), it is shown that for every discount factor β ∈ (0, 1), we may choose sufficiently large r ≥ 1 such that αβ < 1, where \(\alpha = 1+(\hat {a}+1)/r\). Further, we shall assume that \(\sup _{a\in A}u_0^+(x,a)\le d\omega (x)\) for all x ∈ X.

For any function \(\phi \in\underline {U}_\omega (X)\), we define the operator \( \mathcal {T}_{\beta }\) as follows:

$$\displaystyle \begin{aligned} \mathcal{T}_{\beta} \phi(x)=\max_{a\in A}\inf_{b\in B}\left[u_0(x,a)+\frac{1}{2} \theta_0b^2 +\beta\int_X \phi(y)q(dy|x,a,b)\right] \end{aligned}$$

for all x ∈ X. Clearly, \(\mathcal {T}_{\beta }\) maps the space \( \underline {U} _\omega (X)\) into itself. Indeed, we have

$$\displaystyle \begin{aligned} \mathcal{T}_{\beta} \phi(x)\le\max_{a\in A}\left[u_0(x,a)+\beta\int_X \phi(y)q(dy|x,a,b)\right]\\ \le d\omega(x)+ \beta\alpha \|\phi^+\|{}_\omega\omega(x) \end{aligned} $$

for all x ∈ X. Hence, \((\mathcal {T}_{\beta } \phi )^+\in U_\omega (X)\) and by Lemma 1, \(\mathcal {T}_{\beta } \phi \) is upper semicontinuous. Proceeding analogously as in the proof of Theorem 1, we infer that \(v_{\beta } \in\underline {U}_\omega (X)\), where \(v_{\beta }=\mathcal {T} _\beta v_{\beta }\) and there exits f∈ F such that

$$\displaystyle \begin{aligned} v_{\beta}(x)=\mathcal{T}_{\beta} v_{\beta}(x)&=\max_{a\in A}\inf_{b\in B}\left[ u_0(x,a)+\frac{1}{2} \theta_0b^2 +\beta\int_X v_{\beta} (y)q(dy|x,a,b)\right] \notag \\ &=\inf_{b\in B}\left[u_0(x,f^*(x))+\frac{1}{2} \theta_0b^2 +\beta\int_X v_{\beta}(y)q(dy|x,f^*(x),b)\right] \end{aligned} $$
(5.14)

for x ∈ X. Finally, we may formulate the following result.

Proposition 5

Consider the system given in (5.13). Then, \(v_{\beta }\in\underline {U}_\omega (X)\) and there exists a stationary strategy f such that (5.14) is satisfied for all x  X. The strategy f is optimal for the controller.

3.2 Average Reward Robust Markov Decision Process

In this subsection, we assume that u takes values in \(\mathbb {R}\) rather than in \( \underline {\mathbb {R}}\). Moreover, the action set of nature is independent of (x, a) ∈ KA, i.e., B(x, a) ≡ B, where B is a compact metric space. Obviously, (C3) is then immediately satisfied. Since we consider the average payoff in the maxmin control problem, we impose a bit stronger assumptions than in the previous subsection. Below are their counterparts.

  1. (C~1)

    For any x ∈ X, A(x) is compact and the set-valued mapping x → A(x) is continuous.

  2. (W~1)

    The payoff function u is continuous on K.

A strategy for the opponent is a sequence \(\gamma =(\gamma _n)_{n\in \mathbb {N }}\) of Borel measurable mappings \(\gamma _n: H_n^* \to B\) rather than a sequence of stochastic kernels. The set of all strategies for the opponent is denoted by \(\varGamma _0^*\).

For any initial state x ∈ X and strategies π ∈ Π, \(\gamma \in \varGamma ^*_0\), we set \(u_n^-(x,\pi ,\gamma )= E^{\pi \gamma }_{x}[u^-(x_n,a_n,b_n)]\) , \(u_n^+(x,\pi ,\gamma )= E^{\pi \gamma }_{x}[u^+(x_n,a_n,b_n)]\), and \( u_n(x,\pi ,\gamma )= E^{\pi \gamma }_{x}[u(x_n,a_n,b_n)]\), provided that the integral is well defined, i.e., either \(u_n^+(x,\pi ,\gamma )<+\infty \) or \( u_n^-(x,\pi ,\gamma ) >-\infty \). Note that un(x, π, γ) is the n -stage expected payoff. For x ∈ X, strategies π ∈ Π, \(\gamma \in \varGamma _0^*\), and β ∈ (0, 1), we define \(J^-_{\beta }(x,\pi ,\gamma )\) and \( J^+_{\beta }(x,\pi ,\gamma )\) as in (5.1) and in (5.2). Assuming that these expressions are finite, we define the expected discounted payoff to the controller as in (5.3). Clearly, the maxmin value vβ is defined as in the previous subsection, i.e.,

$$\displaystyle \begin{aligned} v_{\beta}(x) =\sup_{\pi\in\varPi} \inf_{\gamma\in\varGamma_0^*}J_{\beta}(x,\pi,\gamma). \end{aligned}$$

For any initial state x ∈ X, strategies π ∈ Π, \(\gamma \in \varGamma _0^*\), and \(n\in \mathbb {N}\), we let

$$\displaystyle \begin{aligned} J^-_{n}(x,\pi,\gamma) &:= E_{x}^{\pi\gamma}\left[ \sum_{m=1}^{n} u^-(x_{m} ,a_{m} ,b_{m})\right] \quad\mbox{and}\\ J^+_{n}(x,\pi,\gamma) &:= E_{x}^{\pi\gamma}\left[\sum_{m=1}^{n} u^+(x_{m} ,a_{m} ,b_{m})\right]. \end{aligned} $$

If these expressions are finite, we can define the total expected n-stage payoff to the controller as follows:

$$\displaystyle \begin{aligned} J_{n}(x,\pi,\gamma) :=J^+_{n}(x,\pi,\gamma) +J^-_{n}(x,\pi,\gamma).\end{aligned} $$

Clearly, we have that

$$\displaystyle \begin{aligned} J_{n}(x,\pi,\gamma) =\sum_{m=1}^{n}u_m(x,\pi,\gamma).\end{aligned} $$

Furthermore, we set

$$\displaystyle \begin{aligned} \overline{J}_n^- (x,\pi,\gamma)= \frac{J_{n}^- (x,\pi,\gamma)}{n}, \quad\quad \overline{J}_n^+ (x,\pi,\gamma)= \frac{J_{n}^+ (x,\pi,\gamma)}{n }, \end{aligned}$$

and

$$\displaystyle \begin{aligned} \overline{J}_n (x,\pi,\gamma)= \frac{J_{n} (x,\pi,\gamma)}{n}. \end{aligned}$$

The robust expected average payoff per unit time (average payoff, for short) is defined as follows:

$$\displaystyle \begin{aligned}\hat{R}(x,\pi) =\liminf_{n\to\infty}\inf_{\gamma\in\varGamma_0^*}\overline{J}_n (x,\pi,\gamma). \end{aligned} $$
(5.15)

A strategy \(\bar {\pi }\in \varPi \) is called an optimal robust strategy for the controller in the average payoff case, if \(\sup _{\pi \in \varPi }\hat {R} (x,\pi )=\hat {R}(x,\bar {\pi })\) for each x ∈ X.

We can now formulate our assumption.

  1. (D)

    There exist functions D+ : X → [1, ) and D : X → [1, ) such that

    $$\displaystyle \begin{aligned} \overline{J}_n^+(x,\pi,\gamma) \le D^+(x)\quad\mbox{and}\quad |\overline{J} _n^-(x,\pi,\gamma)| \le D^-(x) \end{aligned}$$

    for every x ∈ X, π ∈ Π, \(\gamma \in \varGamma _0^*\) and \(n \in \mathbb { N}\). Moreover, D+ is continuous and the function (x, a, b) →∫XD+(y)q(dy|x, a, b) is continuous on K.

Condition (D) trivially holds if the payoff function u is bounded. The models with unbounded payoffs satisfying (D) are given in Jaśkiewicz and Nowak (2014) (see Examples 1 and 2). Our aim is to consider the robust expected average payoff per unit time. The analysis is based upon studying the so-called optimality inequality, which is obtained via vanishing discount factor approach. However, we note that we cannot use the results from previous subsection, since in our approach we must take a sequence of discount factors converging to one. Theorem 1 was obtained under assumption (M1). Unfortunately, in our case this assumption is useless. Clearly, if α > 1, as it happens in Examples 1, 2, and 3, the requirement αβ < 1 is a limitation and makes impossible to define a desirable sequence \((\beta _n)_{n\in \mathbb {N}}\) converging to one. Therefore, we first reconsider the robust discounted payoff model under different assumption.

Put w(x) = D+(x)∕(1 − β), x ∈ X. Let \(\tilde {U}_w(X)\) be the space of all real-valued upper semicontinuous functions \(v: X\to \mathbb {R}\) such that v(x) ≤ w(x) for all x ∈ X. Assume now that \(\phi \in \tilde {U}_w(X)\) and f ∈ F. For every x ∈ X we set (recall (5.6))

$$\displaystyle \begin{aligned} T_\beta \phi(x)= \sup_{a\in A(x)}\inf_{b\in B}\left[u(x,a,b)+\beta\int_X \phi(y)q(dy|x,a,b) \right]. \end{aligned} $$
(5.16)

The following result is Theorem 1 in Jaśkiewicz and Nowak (2014).

Theorem 2

Assume (C~1),(W~1), (W2) , and (D) . Then, for each β ∈ (0, 1), \(v_{\beta }\in \tilde {U}_w(X)\), vβ = Tβvβ, and there exists f F such that

$$\displaystyle \begin{aligned} v_{\beta}(x)= \inf_{b\in B}\left[u(x,f^*(x),b)+\beta\int_Xv_\beta(y)q(dy|x,f^*(x),b) \right],\quad x\in X. \end{aligned}$$

Moreover, \(v_{\beta }(x)=\sup _{\pi \in \varPi }\inf _{\gamma \in \varGamma ^*_0}J_{\beta }(x,\pi ,\gamma )=\inf _{\gamma \in \varGamma ^*_0}J_{\beta }(x,f^*,\gamma )\) for each x  X, i.e., f is optimal.

Remark 4

The proof of Theorem 2 is to some extent standard, but as mentioned we cannot apply the Banach contraction principle (see for instance Blackwell 1965 or Bertsekas and Shreve 1996). The majority of papers that deal with maximization of the expected discounted payoff assume that the one-stage payoff function is bounded from above (see Hernández-Lerma and Lasserre 1996; Schäl 1975) or it satisfies inequality (5.7). Neither requirement is met in this framework. Therefore, we have to consider truncated models and finite horizon maxmin problems.

In order to establish the optimality inequality, we shall need a generalized Tauberian relation, which plays a crucial role in proving Theorem 3 stated below.

For any sequence \((u_k)_{k\in \mathbb {N}}\) of real numbers, let \(\overline {u} _{n} := \frac {1}{n}\sum _{k=1}^n u_k\) for any \(n\in \mathbb {N}\). Fix a constant D ≥ 1 and consider the set SD of all sequences \((u_k)_{k\in \mathbb {N}}\) such that \(|\overline {u}_{n}| \le D\) for each \(n\in \mathbb {N}\). Assume now that the elements of the sequence \((u_k(\xi ))_{k\in \mathbb {N}} \in S_D\) may depend on ξ belonging to some set Ξ. Define

$$\displaystyle \begin{aligned} \overline{u}_{n}(\xi)= \frac{1}{n}\sum_{k=1}^n u_k(\xi)\end{aligned} $$

and

$$\displaystyle \begin{aligned} v_{\beta} = \inf_{\xi \in \varXi}(1-\beta)\sum_{k=1}^\infty \beta^{k-1} u_k(\xi)\quad\mbox{for }\quad \beta\in(0,1), \quad v_{n} := \inf_{\xi \in \varXi}\overline{u}_{n}(\xi).\end{aligned} $$

Proposition 6

Assume that \((u_n(\xi ))_{n\in \mathbb {N}}\in S_D\) for each ξ  Ξ. Then, we have the following

$$\displaystyle \begin{aligned} \liminf_{\beta \to 1^-}v_{\beta} \ge \liminf_{n \to \infty}v_{n}. \end{aligned}$$

Proposition 6 extends Proposition 4 and Corollary 5 in Lehrer and Sorin (1992) that are established under the assumption that 0 ≤ un(ξ) ≤ 1 for every \(n\in \mathbb {N}\) and ξ ∈ Ξ. This result is related to the so-called Tauberian relations. Recent advances on this issue can be found in Renault (2014) (see also the discussion in Sect. 7). It is worth mentioning that Proposition 6 is also useful in the study of risk-sensitive control models (see Jaśkiewicz 2007 or Appendix in Jaśkiewicz and Nowak 2014).

Let us fix a state z ∈ X and define

$$\displaystyle \begin{aligned} h_\beta(x):= V_\beta(x)- V_\beta(z), \quad\mbox{for } x\in X \mbox{ and } \beta\in (0,1).\end{aligned} $$

Furthermore, we make the following assumptions.

  1. (B1)

    There exists a function M : X → (−, 0] such that infβ ∈ (0,1)hβ(x) ≥ M(x), and there exists a continuous function Q : X → [0, +) such that supβ ∈ (0,1)hβ(x) ≤ Q(x) for every x ∈ X. Moreover, the function (x, a, b) →∫XQ(y)q(dy|x, a, b) is continuous on K.

  2. (B2)

    For any x ∈ X, π ∈ Π, and \(\gamma \in \varGamma ^*_0\), it holds that

    $$\displaystyle \begin{aligned} \lim_{n\to\infty}\frac{E_x^{\pi\gamma}[Q(x_n)]}n=0. \end{aligned}$$

The main result in Jaśkiewicz and Nowak (2014) is as follows.

Theorem 3

Assume (C~1), (W~1), (W2), (D) , and (B1)–(B2) . Then, there exist a constant g, a real-valued upper semicontinuous function h, and a stationary strategy \(\bar {f}\in F\) such that

$$\displaystyle \begin{aligned} \begin{array}{rcl} h(x) + g &\displaystyle \le&\displaystyle \sup_{a\in A(x)}\inf_{b\in B} \left[u(x,a,b) + \int_X h(y)q(dy|x,a,b)\right]\\ &\displaystyle =&\displaystyle \inf_{b\in B} \left[u(x,\bar{f}(x),b)+ \int_X h(y)q(dy|x,\bar{f}(x),b)\right] \end{array} \end{aligned} $$

for x  X. Moreover, \(g=\sup _{\pi \in \varPi }\hat {R}(x,\pi )=\hat {R}(x, \bar {f})\) for all x  X, i.e., \(\bar {f}\) is the optimal robust strategy.

4 Discounted and Positive Stochastic Markov Games with Simultaneous Moves

From now on we assume that B(x, a) = B(x) is independent of a ∈ A(x) for each x ∈ X. Therefore, we now have \(K_{A}\in \mathcal {B}(X\times A)\),

$$\displaystyle \begin{aligned} K_{B}\in \mathcal{B}(X\times B),\quad\mbox{and}\quad K:= \{(x,a,b): x\in X, a\in A(x),b\in B(x)\}.\end{aligned} $$
(5.17)

Thus, at every stage \(n\in \mathbb {N}\), player 2 does not observe player 1’s action an ∈ A(xn) in state xn ∈ X. One can say that the players act simultaneously and play the standard discounted stochastic game as in the seminal work of Shapley (1953). It is assumed that both players know at every stage \(n\in \mathbb {N}\) the entire history of the game up to state xn ∈ X. Now a strategy for player 2 is a sequence \( \gamma =(\gamma _n)_{n\in \mathbb {N}}\) of Borel (or universally measurable) transition probabilities γn from Hn to B such that γn(B(xn)|hn) = 1 for each hn ∈ Hn. The set of all Borel (universally) measurable strategies for player 2 is denoted by Γ (Γu). Let G (Gu) be the set of all Borel (universally) measurable mappings \(g: X \to \Pr (B)\) such that \(g(x)\in \Pr (B(x))\) for all x ∈ X. Every g ∈ Gu induces a transition probability g(db|x) from X to B and is recognized as a randomized stationary strategy for player 2. A semistationary strategy for player 2 is determined by a Borel or universally measurable function \(g:X\times X \to \Pr (B)\) such that \( g(x,x^{\prime })\in \Pr (B(x^{\prime }))\) for all (x, x) ∈ X × X. Using a semistationary strategy, player 2 chooses an action bn ∈ B(xn) on any stage n ≥ 2 according to the probability measure g(x1, xn) depending on xn and the initial state x1. Let F (Fu) be the set of all Borel (universally) measurable mappings \(f: X \to \Pr (A)\) such that \( f(x)\in \Pr (A(x))\) for all x ∈ X. Then, F (Fu) can be considered as the set of all randomized stationary strategies for player 1. The set of all Borel (universally) measurable strategies for player 1 is denoted by Π (Πu). For any initial state x ∈ X, π ∈ Πu, γ ∈ Γu, the expected discounted payoff function Jβ(x, π, γ) is well defined under conditions (M1) and (M2). Since Π ⊂ Πu and Γ ⊂ Γu, Jβ(x, π, γ) is well defined for all π ∈ Π, γ ∈ Γ. If we restrict attention to Borel measurable strategies, then the lower value of the game is

$$\displaystyle \begin{aligned} \underline{v}_\beta(x)=\sup_{\pi\in\varPi}\inf_{\gamma\in\varGamma}J_{\beta}(x,\pi, \gamma) \end{aligned}$$

and the upper value of the game is

$$\displaystyle \begin{aligned} \overline{v}_\beta(x)=\inf_{\gamma\in\varGamma}\sup_{\pi\in\varPi}J_{\beta}(x,\pi, \gamma),\ x\in X. \end{aligned}$$

Suppose that the stochastic game has a value, i.e., \(v_{\beta }(x):=\underline {v}_\beta (x)=\overline {v}_\beta (x)\), for each x ∈ X. Then, under our assumptions (M1) and (M2), \(v_{\beta }(x)\in\underline {\mathbb {R}}\). Let \(\underline {X}:=\{x\in X: v_{\beta }(x)=-\infty \}\). A strategy π∈ Π is optimal for player 1 if

$$\displaystyle \begin{aligned} \inf_{\gamma\in\varGamma}J_{\beta}(x,\pi^*,\gamma)=v_{\beta}(x)\ \ \mbox{for all} \ x\in X. \end{aligned}$$

Let ε > 0 be fixed. A strategy γ∈ Γ is ε-optimal for player 2 if

$$\displaystyle \begin{aligned} \sup_{\pi\in\varPi}J_{\beta}(x,\pi,\gamma^*)&=v_{\beta}(x) \ \ \mbox{for all}\ \ x\in X\setminus \underline{X} \quad \mbox{and} \\ \sup_{\pi\in\varPi}J_{\beta}(x,\pi,\gamma^*) &< -\frac{1}{\varepsilon} \ \ \mbox{for all}\ x\in \underline{X}. \end{aligned} $$

Similarly, the value vβ and ε-optimal or optimal strategies can be defined in the class of universally measurable strategies. Let

$$\displaystyle \begin{aligned} \bar{K}_A: = \{(x,\nu): x\in X, \nu\in \Pr(A(x))\}, \quad \bar{K}_B: = \{(x,\rho): x\in X, \rho\in \Pr(B(x))\},\end{aligned} $$

and

$$\displaystyle \begin{aligned} \bar{K}: = \{(x,\nu,\rho): x\in X, \nu\in \Pr(A(x)), \rho \in \Pr(B(x))\}.\end{aligned} $$

For any \((x,\nu ,\rho )\in \bar {K}\) and \(D\in \mathcal {B}(X)\), define

$$\displaystyle \begin{aligned} u(x,\nu,\rho):= \int_{A(x)}\int_{B(x)}u(x,a,b)\rho(db)\nu(da)\end{aligned} $$

and

$$\displaystyle \begin{aligned} q(D|x,\nu,\rho):= \int_{A(x)}\int_{B(x)}q(D|x,a,b)\rho(db)\nu(da).\end{aligned} $$

If f ∈ Fu and g ∈ Gu, then

$$\displaystyle \begin{aligned} u(x,f,g):= u(x,f(x),g(x)) \quad \mbox{and}\quad q(D|x,f,g):= q(D|x,f(x),g(x)). \end{aligned} $$
(5.18)

For any \((x,\nu ,\rho )\in \bar {K}\) and \(\phi \in\underline {U}_\omega (X)\), define

$$\displaystyle \begin{aligned} L_\beta \phi(x,\nu,\rho)= u(x,\nu,\rho)+\beta\int_X \phi(y)q(dy|x,\nu,\rho) \end{aligned} $$
(5.19)

and

$$\displaystyle \begin{aligned} T_\beta \phi(x)= \max_{\nu\in \Pr(A(x))}\inf_{\rho \in \Pr(B(x))}L_\beta \phi(x,\nu,\rho). \end{aligned} $$
(5.20)

By Lemma 7 in Jaśkiewicz and Nowak (2011), the operator Tβ is well defined, and using the maximum theorem of Berge (1963), it can be proved that \(T_\beta \phi \in\underline {U}_\omega (X)\) for any \(\phi \in\underline {U}_\omega (X)\).

Theorem 4

Assume (C1), (W1)–(W2) , and (M1)–(M2) . In addition, let the correspondence x  B(x) be lower semicontinuous and let every set B(x) be a complete subset of B. Then, the game has a value \(v_{\beta } \in\underline {U}_\omega (X)\), player 1 has an optimal stationary strategy f F and

$$\displaystyle \begin{aligned} T_\beta v_{\beta}(x){=}v_{\beta}(x){=} \max_{\nu\in \Pr(A(x))}\inf_{\rho\in \Pr(B(x))}L_\beta v_{\beta} (x,\nu,\rho){=} \inf_{\rho\in \Pr(B(x))}L_\beta v_{\beta}(x,f^*(x),\rho) \end{aligned}$$

for each x  X. Moreover, for any ε > 0, player 2 has an ε-optimal Borel measurable semistationary strategy.

The assumption that every B(x) is complete in B is made to assure that G ≠ ∅ (see Kuratowski and Ryll-Nardzewski 1965). The construction of an ε-optimal semistationary strategy for player 2 is based on using “truncated games” \(\mathcal {G}_k\) with the payoff functions \(u_k:= \max \{u, -k\}\), \(k\in \mathbb {N}\). In every game \(\mathcal {G}_k\) player 2 has an \(\frac {\varepsilon }{2}\)-optimal stationary strategy, say \(g^*_k\in G\). If vβ,k is the value function of the game \(\mathcal {G}_k\), then it is shown that \(v_{\beta }(x)= \inf _{k\in \mathbb {N}}v_{\beta ,k}(x)\) for all x ∈ X. This fact can be easily used to construct a measurable partition {Xn}nZ of the state space (\(Z\subset \mathbb {N}\)) such that \( v_{\beta }(x) >v_{\beta ,k}(x)-\frac {\varepsilon }{2}\) for all x ∈ Xk, k ∈ Z. If \(g^*(x,x^{\prime }):=g^{*}_n(x^{\prime })\) for every x ∈ Xn, n ∈ Z and for each x∈ X, then g is an ε-optimal semistationary strategy for player 2. The above definition is valid, if vβ(x) > − for all x ∈ X. If vβ(x) = − for some state x ∈ X, then the reader is referred to the proof of Theorem 2 in Jaśkiewicz and Nowak (2011), where a modified construction of the ε-optimal semistationary strategy is provided.

Remark 5

Zero-sum discounted stochastic games with a compact metric state space and weakly continuous transitions were first studied by Maitra and Parthasarathy (1970). Kumar and Shiau (1981) extended their result to Borel state space games with bounded continuous payoff functions and weakly continuous transitions. Couwenbergh (1980) studied continuous games with unbounded payoffs and a metric state space using the weighted supremum norm approach introduced by Wessels (1977). He proved that both players possess optimal stationary strategies. In order to obtain such a result, additional conditions should be imposed. Namely, the function u is continuous and such that |u(x, a, b)|≤ (x) for some constant d > 0 and all (x, a, b) ∈ K. Moreover, the mappings x → A(x) and x → B(x) are compact valued and continuous. It should be noted that our condition (M2) allows for much larger class of models and is less restrictive for discount factors compared with the weighted supremum norm approach. We also point out that a class of zero-sum lower semicontinuous stochastic games with weakly continuous transition probabilities and bounded from below nonadditive payoff functions was studied by Nowak (1986).

A similar result can also be proved under the following conditions:

  1. (C4)

    A(x) is compact for each x ∈ X.

  2. (C5)

    The payoff function u is Borel measurable and u(x, ⋅, b) is upper semicontinuous and q(D|x, ⋅, b) is continuous on A(x) for any \(D\in \mathcal {B}(X)\), x ∈ X, b ∈ B(x).

A simple modification of the proof of Theorem 2 in Jaśkiewicz and Nowak (2011) using appropriately adapted theorems on measurable minmax selections proved in Nowak (1985b) yields the following result:

Theorem 5

Assume (C4)–(C5)and (M1)–(M2) . Then, the game has a value vβ, which is a lower semianalytic function on X. Player 1 has an optimal stationary strategy f Fu and

for each x  X. Moreover, for any ε > 0, player 2 has an ε-optimal universally measurable semistationary strategy.

Maitra and Parthasarathy (1971) first studied positive stochastic games, where the stage payoff function u ≥ 0 and β = 1. The extended payoff in a positive stochastic game is

$$\displaystyle \begin{aligned} J_p(x,\pi,\gamma):= E^{\pi\gamma}_x\left(\sum_{n=1}^\infty u(x_n,a_n,b_n)\right), \quad x=x_1\in X,\ \pi\in\varPi,\ \gamma\in\varGamma.\end{aligned} $$

Using standard iteration arguments as in Strauch (1966) or Bertsekas and Shreve (1996), one can show that Jp(x, π, γ) <  if and only if there exists a nonnegative universally measurable function w on X such that the following condition holds:

$$\displaystyle \begin{aligned} u(x,a,b)+\int_Xw(y)q(dy|x,a,b) \le w(x)\quad\mbox{for all}\quad (x,a,b)\in K. \end{aligned} $$
(5.21)

Value functions and ε-optimal strategies are defined in positive stochastic games in an obvious manner. Studying positive stochastic games, it is convenient to use approximation of Jp(x, π, γ) from below by Jβ(x, π, γ) as β goes to 1. To make this method effective we must change our assumptions on the primitives in the way described below.

  1. (C6)

    B(x) is compact for each x ∈ X.

  2. (C7)

    The payoff function u is Borel measurable and u(x, a, ⋅) is lower semicontinuous and q(D|x, a, ⋅) is continuous on B(x) for any \(D\in \mathcal {B}(X)\), x ∈ X, a ∈ A(x).

As noted in the preliminaries, assumption (C6) implies that ∅ ≠ G ⊂ Gu and Fu ≠ ∅. Let L1 and T1 be the operators defined as in (5.19) and (5.20), respectively, but with β = 1.

Theorem 6

Assume that (5.21) and (C6)–(C7)hold. Then the positive stochastic game has a value function vp which is upper semianalytic and vp(x) =supβ ∈ (0.1)vβ(x) for all x  X. Moreover, vp is the smallest nonnegative upper semianalytic solution to the equation

$$\displaystyle \begin{aligned} T_1v(x)=v(x), \quad x\in X. \end{aligned}$$

Player 2 has an optimal stationary strategy g Gu such that

$$\displaystyle \begin{aligned} T_1v_p(x)= \sup_{\nu\in \Pr(A(x))}\min_{\rho\in \Pr(B(x))}L_1v_p(x,\nu,\rho)= \sup_{\nu\in \Pr(A(x))}L_1v_p(x,\nu,g^*(x)), \quad x\in X \end{aligned}$$

and for any ε > 0, player 1 has an ε-optimal universally measurable semistationary strategy.

Theorem 6 is a version of Theorem 5.4 in Nowak (1985a). Some special cases under much stronger continuity assumptions were considered by Maitra and Parthasarathy (1971) for games with compact state spaces and by Kumar and Shiau (1981) for games with a Borel state space and finite action sets in each state. An essential part of the proof of Theorem 6 is Proposition 2.

A similar result holds for positive semicontinuous games satisfying the following conditions:

  1. (C8)

    For any x ∈ X, A(x) is a complete set in A and the correspondence x → A(x) is lower semicontinuous.

  2. (C9)

    For any x ∈ X, B(x) is compact and the correspondence x → B(x) is upper semicontinuous.

  3. (W3)

    u ≥ 0 and u is lower semicontinuous on K.

Theorem 7

Assume (5.21), (C8)–(C9) , and (W3) . Then, the positive stochastic game has a value function vp which is lower semicontinuous and vp(x) =supβ ∈ (0,1)vβ(x) for all x  X. Moreover, vp is the smallest nonnegative lower semicontinuous solution to the equation

$$\displaystyle \begin{aligned} T_1v(x)=v(x), \quad x\in X. \end{aligned} $$
(5.22)

Player 2 has an optimal stationary strategy g G such that

and for any ε > 0, player 1 has an ε-optimal Borel measurable semistationary strategy.

The proof of Theorem 7 is similar to that of Theorem 6 and makes use of Proposition 3.

Player 1 need not have an optimal strategy even if X is finite. This is shown in Kumar and Shiau (1981) in Example 1 (see also pages 192–193 in Maitra and Sudderth 1996), which was inspired by Everett (1957). We present this example below.

Example 4

Let X = {−1, 0, 1}, A = {0, 1}, B = {0, 1}. States x = −1 and x = 1 are absorbing with zero payoffs. If x = 0 and both players choose the same actions (a = 1 = b or a = 0 = b), then u(x, a, b) = 1 and q(−1|0, a, b) = 1. Moreover, q(0|0, 0, 1) = q(1|0, 1, 0) = 1 and u(0, 0, 1) = u(0, 1, 0) = 0. It is obvious that vp(−1) = 0 = vp(1). In state x = 0 we obtain the equation vp(0) = 1∕(2 − vp(0)), which yields the solution vp(0) = 1. In this game player 1 has no optimal strategy.

If player 2 is dummy, i.e., every set B(x) is a singleton, X is a countable set and vp is bounded on X, then by Ornstein (1969) player 1 has a stationary ε-optimal strategy. A counterpart of this result does not hold for positive stochastic games.

Example 5

Let \(X=\mathbb {N}\cup \{0\}\), A = {1, 2}, B = {1, 2}. State x = 0 is absorbing with zero payoffs. Let x ≥ 2 and a = 1. Then u(x, 1, b) = 0 for b ∈ B and q(x − 1|x, 1, 1) = q(x + 1|x, 1, 2) = 1. If x ≥ 2 and a = 2, then u(x, 2, 1) = 0 and u(x, 2, 2) = 1. In both cases (b = 1 or b = 2) the game moves to the absorbing state x = 0 with probability one. If x = 1, then u(1, a, b) = 1 and q(0|1, a, b) = 1 for all a ∈ A and b ∈ B. It is obvious that vp(0) = 0 and vp(1) = 1. It is shown that vp(x) = (x + 1)∕2x for x ≥ 2 and player 1 has no stationary ε-optimal strategy. It is easy to check that the function vp given here is a solution to equation (5.22). It may be interesting to note that also v(0) = 0, v(x) = 1 for x ≥ 1 is also a solution to equation (5.22) and v(x) > vp(x) for x > 1. For details see counterexample in Nowak and Raghavan (1991), whose interesting modification called the “Big Match on the integers” was studied by Fristedt et al. (1995).

The assumption that q(D|x, a, ⋅) is continuous on B(x) for each (x, a) ∈ KA and \(D\in \mathcal {B} (X)\) is weaker than the norm continuity of q(⋅|x, a, b) in b ∈ B(x). However, from the point of view of applications, e.g., in dynamic economic models or engineering problems, the weak continuity assumption of q(⋅|x, a, b) in (x, a, b) ∈ K is more useful (see Remark 2).

We close this section with a remark on the weighted evaluation proposed for Markov decision models in Krass et al. (1992) and for zero-sum stochastic games in Filar and Vrieze (1992). The criterion is either a convex combination of discounted evaluation and an average evaluation or a convex combination of two discounted evaluations. In the first case, it is proved that the value of the game exists and that both players have 𝜖-optimal strategies. In the second case, it is shown that the value is the unique solution of some system of functional equations and that both players have optimal Markov policies. The idea of using the weighted evaluations was applied to the study of nonzero-sum stochastic games (with finite state and action sets) by Flesch et al. (1999). Zero-sum perfect information games under the weighted discounted payoff criterion were studied by Altman et al. (2000). We would like to point out that discounted utility (payoff) functions belong to the class of “recursive utilities” extensively examined in economics (see Miao 2014). It seems, however, that the weighted discounted utilities are not in this class.

5 Zero-Sum Semi-Markov Games

In this section, we study zero-sum semi-Markov games on a general state space with possibly unbounded payoffs. Different limit-average expected payoff criteria can be used for such games, but under some conditions they turn out to be equivalent. Such games are characterized by the fact that the time between jumps is a random variable with distribution dependent on the state and actions chosen by the players. Most primitive data for a game model considered here are as in Sect. 4. More precisely, let \( K_{A}\in \mathcal {B} (X\times A)\) and \(K_{B}\in \mathcal {B}( X\times B)\). Then, the set K in (5.17) is Borel. As in Sect. 4 we assume that A(x) and B(x) are the admissible action sets for the player 1 and 2, respectively, in state x ∈ X. Let Q be a transition probability from K to [0, ) × X. Hence, if a ∈ A(x) and b ∈ B(x) are actions chosen by the players in state x, then for \(D\in \mathcal {B}(X)\) and t ≥ 0, Q([0, t] × D|x, a, b) is the probability that the sojourn time of the process in x will be smaller than t, and the next state x will be in D. Let k = (x, a, b) ∈ K. Clearly, q(D|k) = Q([0, ] × D|k) is the transition law of the next state. The mean holding time given k is defined as

$$\displaystyle \begin{aligned} \tau(k)=\int_0^{+\infty} tH(dt|k), \end{aligned}$$

where H(t|k) = Q([0, t] × X|k) is a distribution function of the sojourn time. The payoff function to player 1 is a Borel measurable function \(u: K\to \mathbb {R}\) and is usually of the form

$$\displaystyle \begin{aligned} u(x,a,b)=u^1(x,a,b)+u^2(x,a,b)\tau(x,a,b),\quad (x,a,b)\in K,\end{aligned} $$
(5.23)

where u1(x, a, b) is an immediate reward obtained at the transition time and u2(x, a, b) is the reward rate in the time interval between successive transitions.

The game starts at T1 := 0 and is played as follows. If the initial state is x1 ∈ X and the actions (a1, b1) ∈ A(x1) × B(x1) are selected by the players, then the immediate payoff u1(x1, a1, b1) is incurred for player 1 and the game remains in state x1 for a random time T2 that enjoys the probability distribution H(⋅|x1, a1, b1). The payoff u2(x1, a1, b1) to player 1 is incurred until the next transition occurs. Afterwards the system jumps to the state x2 according to the transition law q(⋅|x1, a1, b1). The situation repeats itself yielding a trajectory (x1, a1, b1, t2, x2, a2, b2, t3, …) of some stochastic process, where xn, an, bn and tn+1 describe the state, the actions chosen by the players, and the decision epoch, respectively, on the nth stage of the game. Clearly, tn+1 is a realization of the random variable Tn+1, and H(⋅|xn, an, bn) is a distribution function of the random variable Tn+1 − Tn for any \(n\in \mathbb {N}\).

Strategies and their sets for both players are defined in a similar way as in Sect. 4. The only difference now is that the history of the process also includes the jump epochs, i.e., hn = (x1, a1, b1, t2, …, xn) is the history of the process up to the nth state.

Let N(t) be the number of jumps that have occurred prior to time t, i.e., \(N(t) = \max \{n \in \mathbb {N}:\ T_n \le t\}\). Under our assumptions for each initial state x ∈ X and any strategies (π, γ) ∈ Π × Γ, we have \(P^{\pi \gamma }_x (N(t) < 1) = 1\) for any t ≥ 0.

For any pair of strategies (π, γ) ∈ Π × Γ and an initial state x ∈ X, we define

  • the ratio average payoff

    $$\displaystyle \begin{aligned} \hat{J}(x,\pi,\gamma) = \liminf_{n \to\infty} \frac {E_x^{\pi\gamma} (\sum_{k=1}^{n} u(x_k,a_{k},b_k))} {E_x^{\pi\gamma}(\sum_{k=1}^{n} \tau(x_k,a_{k},b_k))};\end{aligned} $$
    (5.24)
  • the time average payoff

    $$\displaystyle \begin{aligned} \hat{j}(x,\pi,\gamma) = \liminf_{t \to\infty} \frac {E_x^{\pi\gamma}( \sum_{n=1}^{N(t)} u(x_n,a_{n},b_n))}t,\end{aligned} $$
    (5.25)

where \(E_x^{\pi \gamma }\) is the expectation operator corresponding to the unique measure \(P_x^{\pi \gamma }\) defined on the space of all histories of the process starting at x and induced by q, H, and strategies π ∈ Π and γ ∈ Γ.

Remark 6

(a) The definition of average reward in (5.25) is more natural for semi-Markov games, since it takes into account continuous nature of such processes. Formally, the time average payoff should be defined as follows:

$$\displaystyle \begin{aligned} \hat{j}(x,\pi,\gamma) {=} \liminf_{t \to\infty} \frac {E_x^{\pi\gamma}(\sum_{n=1}^{N(t)} u(x_n,a_{n},b_n)\,{+}\,(T_{N(t)+1}\,{-}\,t)u_2(x_{N(t)},a_{N(t)},b_{N(t)})}t.\end{aligned} $$

However, from Remark 3.1 in Jaśkiewicz (2009), it follows that the assumptions imposed on the game model with the time average payoff imply that

$$\displaystyle \begin{aligned} \lim_{t \to\infty} \frac {E_x^{\pi\gamma}(T_{N(t)+1}-t)u_2(x_{N(t)},a_{N(t)},b_{N(t)})}t=0.\end{aligned} $$

Finally, it is worth emphasizing that the payoff defined in (5.25) requires additional tools and methods for the study (such as renewal theory, martingale theory, and analysis of the underlying process to the so-called small set) than the model with average payoff (5.24).

(b) It is worth mentioning that payoff criteria (5.24) and (5.25) need not coincide even for stationary policies and may lead to different optimal policies. Such situations happen if the Markov chain induced by stationary strategies is not ergodic (see Feinberg 1994).

We shall need the following continuity-compactness, ergodicity, and regularity assumptions.

  1. (C10)

    The set-valued mappings x → A(x) and x → B(x) are continuous; moreover, A(x) and B(x) are compact for each x ∈ X.

  2. (C11)

    The functions u and τ are continuous on K, and there exist a positive constant d and continuous function ω : X → [1, ) such that

    $$\displaystyle \begin{aligned} \tau(x,a,b)\le d\omega(x),\quad\quad |u(x,a,b)| \le d\omega(x),\end{aligned} $$

    for all (x, a, b) ∈ K.

  3. (C12)

    The function (x, a, b) →∫Xω(y)q(dy|x, a, b) is continuous.

  4. (GE1)

    There exists a Borel set C ⊂ X such that for some \( \hat {\lambda } \in (0,1)\) and η > 0, we have

    for each (x, a, b) ∈ K, with ω introduced in (C11).

  5. (GE2)

    The function ω is bounded on C, that is,

    $$\displaystyle \begin{aligned} \omega_{C} := \sup_{x \in C} \omega(x) < \infty.\end{aligned} $$
  6. (GE3)

    There exist some δ ∈ (0, 1) and a probability measure on C with the property that

    $$\displaystyle \begin{aligned} q(D|x,a,b) \ge \delta\mu (D),\end{aligned} $$

    for each Borel set D ⊂ C, x ∈ C, a ∈ A(x), and b ∈ B(x).

  7. (R1)

    There exist κ > 0 and ξ < 1 such that

    $$\displaystyle \begin{aligned} H(\kappa|x,a,b)\le \xi,\end{aligned} $$

    for all x ∈ C, a ∈ A(x) and b ∈ B(x). Moreover, τ(x, a, b) ≤ d for all (x, a, b) ∈ K.

  8. (R2)

    There exists a decreasing function α such that α(0) ≤ d, α() = 0 and

    $$\displaystyle \begin{aligned} \int_t^\infty sH(ds|x,a,b)\le \alpha(t) \end{aligned}$$

    for all (x, a, b) ∈ K. Moreover, limtsupxCsupaA(x),bB(x)[1 − H (t|x, a, b)] = 0.

  9. (C13)

    There exists an open set \(\widetilde {C}\subset C\) such that \( \mu (\widetilde {C})>0\).

For any Borel function \(v: X\to \mathbb {\ R}\), we define the ω-norm as in (5.5). By \(\mathcal {B}_\omega (X)\) we denote the set of all Borel measurable functions with finite ω-norm.

Remark 7

  1. (a)

    Assumption (GE3) in the theory of Markov chains implies that the process generated by the stationary strategies of the players and the transition law q is φ-irreducible and aperiodic. The irreducible measure can be defined as follows:

    $$\displaystyle \begin{aligned} \varphi(D):=\delta\mu(D\cap C)\quad\mbox{for}\quad D\in \mathcal{B}(X). \end{aligned}$$

In other words, if φ(D) > 0, then the probability of reaching the set D is positive, independent of the initial state. The set C is called “small set.”

The function ω in (GE1, GE2) up to the multiplicative constant is a bound for the average time of first entry of the process to the set C (Theorem 14.2.2 in Meyn and Tweedie 2009).

Assumptions (GE) imply that the underlying Markov chain \((x_n)_{n\in \mathbb {N}}\) induced by a pair of stationary strategies (f, g) ∈ F × G of the players possesses a unique invariant probability measure πfg. Moreover, \((x_n)_{n\in \mathbb {N}}\) is ω-uniformly ergodic (see Meyn and Tweedie 1994), i.e., there exist constants θ > 0 and \(\hat {\alpha }<1\) such that

$$\displaystyle \begin{aligned} \left|\int_X \phi(y)q(dy|x,f,g)-\int_X \phi(y)\pi_{fg}(dy) \right| \le \|\phi\|{}_\omega\theta \omega(x)\hat{\alpha}^n \end{aligned} $$
(5.26)

for every \(\phi \in \mathcal {B}_\omega (X)\) and x ∈ X, n ≥ 1. Here q(n)(⋅|x, f, g) denotes the n-step transition probability induced by q, f ∈ F, and g ∈ G. Clearly, for integers n ≥ 2 and \(D \in \mathcal {B}(X)\), we have

$$\displaystyle \begin{aligned} q^{(n)}(D|x,f,g):= \int_Xq^{(n-1)}(D|y,f,g) q(dy|x,f,g) \end{aligned}$$

and q(1)(D|x, f, g) := q(D|x, f, g). From (5.26) we conclude that

$$\displaystyle \begin{aligned} \hat{J}(f,g) := \hat{J}(x,f,g) = \frac{\int_{X} u(y,f,g)\pi_{fg}(dy)}{\int_{X} \tau(y,f,g)\pi_{fg}(dy)}, \quad x\in X, \end{aligned} $$
(5.27)

for every f ∈ F and g ∈ G, that is, the average payoff is independent of the initial state. Obviously, τ(x, f, g) = τ(x, f(x), g(x)) (see (5.18)). Consult also Proposition 10.2.5 in Hernández-Lerma and Lasserre (1999) and Theorem 3.6 in Kartashov (1996) for similar type of assumptions that lead to ω-ergodicity of the underlying Markov chains induced by stationary strategies of the players. The reader is also referred to Arapostathis et al. (1993) for an overview of ergodicity assumptions.

  1. (b)

    Condition (R1) ensures that infinite number of transitions does not occur in a finite time interval when the process is in the set C. Indeed, when the process is outside the set C, then assumption (GE) implies that the process governed by any strategies of the players returns to the set C within a finite number of transitions with probability one. Then, (R1) prevents the process in the set C from the explosion. As an immediate consequence of (R1), we get that τ(x, a, b) > κ(1 − ξ) for all x ∈ C and (x, a, b) ∈ K. Assumption (R2) is a technical assumption used in the proof of the equivalence of the aforementioned two average payoff criteria.

In order to formulate the first result, we replace the function ω by a new one \(W(x):=\omega (x)+\frac \eta \delta \) that satisfies the following inequality:

$$\displaystyle \begin{aligned} \int_X W(y)q(dy|x,a,b)\le\lambda^* W(x)+\delta1_C(x)\int_CW(y)\mu(dy), \end{aligned}$$

for (x, a, b) ∈ K and a suitably chosen λ∈ (0, 1) (see Lemma 3.2 in Jaśkiewicz 2009). Observe that if we define the subprobability measure p(⋅|x, a, b) := q(⋅|x, a, b) − δ1C(x)μ(⋅), then

$$\displaystyle \begin{aligned} \int_X W(y)p(dy|x,a,b)\le\lambda^*W(x). \end{aligned}$$

The above inequality plays a crucial role in the application of the fixed point argument in the proof of Theorem 8 given below.

Similarly as in (5.5) we define ∥⋅∥W and the set \(\mathcal {B }_W(X)\). For each average payoff, we define the lower value, upper value, and the value of the game in an obvious way.

The first result summarizes Theorem 4.1 in Jaśkiewicz (2009) and Theorem 1 in Jaśkiewicz (2010).

Theorem 8

Assume (C10)–(C13), (GE1)–(GE3), and (W2) . Then, the following hold:

  1. (a)

    There exist a constant v and \(h^*\in \mathcal {B}_{W}(X), \) which is continuous and such that

    $$\displaystyle \begin{aligned} \begin{array}{rcl} {} h^*(x)&\displaystyle =&\displaystyle \mathrm{val}\left[u(x,\cdot,\cdot)-v\tau(x,\cdot,\cdot)+\int_X h^*(y)q(dy|x,\cdot,\cdot)\right]\\&\displaystyle =&\displaystyle \sup_{\nu\in \Pr(A(x))}\inf_{\rho\in \Pr(B(x))} \left[u(x,\nu,\rho)-v\tau(x,\nu,\rho)+\int_Xh^*(y)q(dy|x,\nu,\rho)\right]\\&\displaystyle =&\displaystyle \inf_{\rho\in \Pr(B(x))}\sup_{\nu\in \Pr(A(x))} \left[u(x,\nu,\rho)-v\tau(x,\nu,\rho)+\int_X h^*(y)q(dy|x,\nu,\rho)\right]\vspace{-3pt} \end{array} \end{aligned} $$
    (5.28)

    for all x  X.

  2. (b)

    The constant v is the value of the game with the average payoff defined in (5.24).

  3. (c)

    There exists a pair \((\hat {f},\hat {g})\in F\times G\) such that

$$\displaystyle \begin{aligned} \begin{array}{rcl} h^*(x)&\displaystyle =&\displaystyle \inf_{\rho\in \Pr(B(x))} \left[u(x,\hat{f}(x),\rho)-v\tau(x,\hat{f}(x),\rho)+\int_Xh^*(y)q(dy|x,\hat{f}(x),\rho)\right]\\ &\displaystyle =&\displaystyle \sup_{\nu\in \Pr(A(x))} \left[u(x,\nu,\hat{g}(x))-v\tau(x,\nu,\hat{g}(x))+\int_X h^*(y)q(dy|x,\nu,\hat{g}(x))\right]\vspace{-3pt} \end{array} \end{aligned} $$

for all x  X. The stationary strategy \(\hat {f}\in F\) (\(\hat {g}\in G\)) is optimal for player 1 (player 2).

The proof of Theorem 8 owes much to the approach introduced by Vega-Amaya (2003), who used a fixed point argument in the game model with setwise continuous transition probabilities. However, we cannot directly apply a fixed point argument. First, we have to regularize (to smooth in some sense) certain functions. Using this smoothing method, we are able to apply the Banach fixed point theorem in the space of lower semicontinuous functions that are bounded in the W-norm. It is worth mentioning that the contraction operator for any lower semicontinuous function \(h:X\to \mathbb {R}\) is of the form

$$\displaystyle \begin{aligned} (\hat{T}h)(x):=\inf_{\rho\in \Pr(B(x))}\sup_{\nu\in \Pr(A(x))} \varPhi^h(x,\nu,\rho),\end{aligned} $$

where

$$\displaystyle \begin{aligned} \varPhi^h(\bar{k}):=\liminf_{d(k^{\prime },\bar{k})\to 0}\left(u(k^{\prime })- \mathcal{V}\tau(k^{\prime })+\int_X h(y)p(dy|k^{\prime })\right),\end{aligned} $$

d is a metric on \(X\times \Pr (A)\times \Pr (B)\), and

$$\displaystyle \begin{aligned} \mathcal{V}:=\sup_{f\in F}\inf_{g\in G} \hat{J}(f,g) \end{aligned}$$

is the lower value (in the class of stationary strategies) of the game with the payoff function defined in (5.24). Next, it is proved that k → Φh(k) is indeed lower semicontinuous. The definition of the operator \( \hat {T}\) is more involved when compared to the one studied by Vega-Amaya (2003) , who assumed that the transition law is setwise continuous in actions, i.e., for which the function (x, a, b) → q(D|x, a, b) is continuous in (a, b) for every set \(D\in \mathcal {B}(X)\). Within such a framework he obtained a solution to the optimality equation \(h^*\in \mathcal {B}_W(X)\). The operator \( \hat {T}\), on the other hand, enables us to get a lower semicontinuous solution to the optimality equation. In order to obtain a continuous solution, we have to repeat this procedure for a game with the payoff − u. Then, it is sufficient to show that the obtained lower semicontinuous solution for the game with the payoff − u coincides with the solution to the optimality equation obtained for the original game. Hence, it must be continuous. The optimal strategies and the conclusion that \(\mathcal {V}=v\) are deduced immediately from the optimality equation.

The problem of finding optimal strategies for the players in ergodic zero-sum Markov games on a general state space was considered by, among others, Ghosh and Bagchi (1998), who assumed that the transition law q has a majorant, i.e., there exists a probability measure \(\hat {\nu }\) such that \( q(\cdot |x,a,b)\ge \hat {\nu }(\cdot )\) for all (x, a, b) ∈ K. Then, the solution to the optimality equation is obtained via the Banach fixed point theorem, since due to the aforementioned assumption, one can introduce a contractive operator in the so-called span semi-norm: ∥hsp :=supxXh(x) −infxXh(x), where \(h:X\to \mathbb {R}\) is a bounded Borel function. Nowak (1994) studied Markov games with state-independent transitions and obtained some optimality inequalities using standard vanishing discount factor approach. Finally, the results of Meyn and Tweedie (1994, 2009) and Kartashov (1996) allowed to study other classes of stochastic (Markov or semi-Markov) games satisfying general ergodicity conditions. These assumptions were used to prove the existence of the game value with the average payoff criteria and the existence of optimal strategies for the players in games with unbounded payoff functions (see Jaśkiewicz 2002; Vega-Amaya 2003 or Jaśkiewicz and Nowak 2006; Küenle 2007, and references cited therein). For instance, the first two papers mentioned above deal with semi-Markov zero-sum games with setwise continuous transition probabilities. The payoffs and transitions in Jaśkiewicz (2002) and Vega-Amaya (2003) need not be continuous with respect to the state variable. Within such a framework, the authors proved that the optimality equation has a solution, there exists a value of the game, and both players possess optimal stationary strategies. However, the proofs in these papers are based on different methods. For instance, Jaśkiewicz (2002) analyzes auxiliary perturbed models, whereas Vega-Amaya (2003) makes use of a fixed point theorem, which directly leads to a solution of the optimality equation. Moreover, neither of these works deals with the time average payoff criterion.

Jaśkiewicz and Nowak (2006) and Küenle (2007), on the other hand, examine Markov games with weakly continuous transition probabilities. Jaśkiewicz and Nowak (2006) proved that such a Markov game has a value and both players have optimal stationary strategies. Their approach relies on applying Fatou’s lemma for weakly convergent measures, which in turn leads to the optimality inequalities instead of the optimality equation. Moreover, the proof employs Michael’s theorem on a continuous selection. A completely different approach was presented by Küenle (2007). Under slightly weaker assumptions, he introduced certain contraction operators that lead to a parameterized family of functional equations. Making use of some continuity and monotonicity properties of the solutions to these equations (with respect to the parameter), he obtained a lower semicontinuous solution to the optimality equation.

Remark 8

Jaśkiewicz (2009) and Küenle (2007) imposed a weaker version of basic assumption (C10). In particular, they assumed that the payoff function u is lower semicontinuous, A(x) is a complete metric space, and the mapping x → A(x) is lower semicontinuous, while the correspondence x → B(x) is upper semicontinuous and B(x) is a compact metric space. Then, it was shown that the game has a value and the second player has an optimal stationary strategy, whereas the first player has an 𝜖-optimal stationary strategy for any 𝜖 > 0.

The next result is concerned with the second payoff criterion.

Theorem 9

Assume (C10)–(C13), (GE1)–(GE3), (W2), and (R1)–(R2) . Then, v is the value of the game and the pair of stationary strategies \((\hat {f},\hat {g})\) is also optimal for the players in the game with the time average payoff defined in (5.25).

The proof of Theorem 9 requires different methods than the proof of Theorem 8 and was formulated as Theorem 5.1 in Jaśkiewicz (2009) . The point of departure of its proof is the optimality equation (5.28). It allows to define a certain martingale or a super- (sub-) martingale, to which the optional sampling theorem is applied. Use of this result requires an analysis of returns of the process to the small set C and certain consequences of ω-geometric ergodicity as well as some facts from the renewal theory. Theorem 5.1 in Jaśkiewicz (2009) refers to the result in Jaśkiewicz (2004) on the equivalence of the expected time and ratio average payoff criteria for semi-Markov control processes with setwise continuous transition probabilities. Some adaptation to the weakly continuous transition probability case is needed. Moreover, the conclusion of Lemma 7 in Jaśkiewicz (2004) that is also used in the proof of Theorem 9 requires an additional assumption as (R2) given above.

The third result deals with the sample path optimality. For any pair of strategies (π, γ) ∈ Π × Γ and an initial state x ∈ X, we define three payoffs:

  • the sample path ratio average payoff (I)

    $$\displaystyle \begin{aligned} \hat{J}^1(x,\pi,\gamma) = \liminf_{n \to\infty} \frac {\sum_{k=1}^{n} u(x_k,a_{k},b_k)} {T_{n} }; \end{aligned} $$
    (5.29)
  • the sample path ratio average payoff (II)

    $$\displaystyle \begin{aligned} \hat{J}^2(x,\pi,\gamma) = \liminf_{n \to\infty} \frac {\sum_{k=1}^{n} u(x_k,a_{k},b_k)} {\sum_{k=1}^{n} \tau(x_k,a_{k},b_k)};\end{aligned} $$
    (5.30)
  • the sample path time average payoff

    $$\displaystyle \begin{aligned} \hat{j}(x,\pi,\gamma) = \liminf_{t \to\infty} \frac {\sum_{n=1}^{N(t)} u(x_n,a_{n},b_n)}t. \end{aligned} $$
    (5.31)

A pair of strategies (π, γ) ∈ Π × Γ is said to be sample path optimal with respect to (5.29), if there exists a function \(v_1\in \mathcal {B}_\omega (X)\) such that for all x ∈ X it holds

$$\displaystyle \begin{aligned} \hat{J}^1(x,\pi^*,\gamma^*)=v_1(x)\quad P^{\pi^*\gamma^*}_x\ a.s. \end{aligned}$$
$$\displaystyle \begin{aligned} \mbox{ for every} \quad \gamma\in\varGamma \quad \hat{J}^1(x,\pi^*,\gamma)\ge v_1(x)\quad P^{\pi^*\gamma}_x\ a.s. \end{aligned}$$
$$\displaystyle \begin{aligned} \mbox{for every}\quad \pi\in\varPi\quad \hat{J}^1(x,\pi,\gamma^*)\le v_1(x)\quad P^{\pi\gamma^*}_x\ a.s.\end{aligned} $$

Analogously, we define sample path optimality with respect to (5.30) and (5.31). In order to prove sample path optimality, we need additional assumptions.

  1. (C14)

    There exist positive constants d1, d2, and p ∈ [1, 2) such that

    $$\displaystyle \begin{aligned} d_2\le\tau(x,a,b)^p\le d_1\omega(x),\quad\mbox{and}\quad |u(x,a,b)|{}^p \le d_1\omega(x),\end{aligned} $$

    for all (x, a, b) ∈ K.

  2. (C15)

    If we introduce

    $$\displaystyle \begin{aligned} \hat{ \eta}(x,a,b)=\int_0^\infty t^pH(dt|x,a,b),\end{aligned} $$

    where the constant p is introduced in (C14) and (x, a, b) ∈ K, then there exists a constant d3 > 0 such that

    $$\displaystyle \begin{aligned} \hat{\eta}(x,a,b)\le d_3\omega(x),\quad (x,a,b)\in K. \end{aligned}$$

The following result states that the sample path average payoff criteria coincide. The result was proved by Vega-Amaya and Luque-Vásquez (2000) (see Theorems 3.7 and 3.8). for semi-Markov control processes (one-player games).

Theorem 10

Assume (C10)–(C15), (W2), and (GE1)–(GE2) . Then, the pair of optimal strategies \((\bar {f},\bar {g})\in F\times G\) from Theorem 8is sample path optimal with respect to each of the payoffs in (5.29), (5.30), and (5.31). Moreover, \( \hat {J}^1(x,\bar {f},\bar {g})= \hat {J}^2(x,\bar {f},\bar {g})=\hat {j}(x,\bar {f},\bar {g})=v\).

The point of departure in the proof of Theorem 10 is the optimality equation from Theorem 8. Namely, from (5.28) we get two inequalities. The first one is obtained with the optimal stationary strategy \(\bar {f}\) for player 1, whereas the second one is connected with the optimal stationary strategy \(\bar {g}\) for player 2. Then, the proofs proceed as in Vega-Amaya and Luque-Vásquez (2000) and make use of strong law of large numbers for Markov chains and for martingales (see Hall and Heyde 1980).

6 Stochastic Games with Borel Payoffs

Consider a game \(\mathcal {G}\) with countable state space X, finite action spaces, and the transition law q. Let \(r:H_\infty \to \mathbb {R}\) be a bounded Borel measurable payoff function defined on the set H of all plays \((x_t,a_t,b_t)_{t\in \mathbb {N}}\) endowed with the product topology and the Borel σ-algebra. (X, A, and B are given the discrete topology.) For any initial state x = x1 and each pair of strategies (π, γ), the expected payoff is

$$\displaystyle \begin{aligned} R(x,\pi,\gamma):= E^{\pi\gamma}_xr(x_1,a_1,b_1,x_2,a_2,b_2,\ldots). \end{aligned}$$

If X is a singleton, then \(\mathcal {G}\) is called the Blackwell game (see Martin 1998). Blackwell (1969, 1989) proved the following result:

Theorem 11

The game \(\cal G\) has a value if r = 1Z is the indicator function of a Gδ-set Z  H.

Martin (1998) proved the following remarkable result:

Theorem 12

The Blackwell game \(\cal G\) has a value for any bounded Borel measurable payoff function \(r:H_\infty \to \mathbb {R}\).

Maitra and Sudderth (2003b) noted that Theorem 12 can be extended easily to stochastic games with countable set of states X. It is interesting that the proof of the above result is in some part based on the theorem of Martin (1975, 1985) on the determinacy of infinite Borel games with perfect information extending the classical work of Gale and Steward (1953) on clopen games. A further discussion of games with perfect information can be found in Mycielski (1992). An extension to games with delayed information was studied by Shmaya (2011). Theorem 12 was extended by Maitra and Sudderth (1998) in a finitely additive measure setting to a pretty large class of stochastic games with arbitrary state and action spaces endowed with the discrete topology and the history space H equipped with the product topology. The payoff function r in their approach is Borel measurable. Since Fubini’s theorem is not true for finite additive measures, the integration order is fixed in the model. The proof of Maitra and Sudderth (1998) is based on some considerations described in Maitra and Sudderth (1993b) and basic ideas of Martin (1998).

As shown in Maitra and Sudderth (1992), Blackwell Gδ-games (as in Theorem 11) belong to a class of games where the payoff function r =limsupnrn and rn depends on finite histories of play. Clearly, the limsup payoffs include the discounted ones. A “partial history trick” on page 181 in Maitra and Sudderth (1996) or page 358 in Maitra and Sudderth (2003a) can be used to show that the limsup payoffs also generalize the usual limiting average ones. Using the operator approach of Blackwell (1989) and some ideas from gambling theory developed in Dubins and Savage (2014) and Dubins et al. (1989), Maitra and Sudderth (1992) showed that every stochastic game with the limsup payoff, countable state, and action spaces has a value. The approach is algorithmic in some sense and was extended to a Borel space framework by Maitra and Sudderth (1993a), where some measurability issues were resolved by using the minmax measurable selection theorem from Nowak (1985a) and some methods from the theory of inductive definability. The authors first studied “leavable games,” where player 1 can use a stop rule. Then, they considered approximation of a non-leavable game by leavable ones. The limsup payoffs are Borel measurable, but the methods used in Martin (1998) and Maitra and Sudderth (1998) are not suitable for the countably additive games considered in Maitra and Sudderth (1993a). On the other hand, the proof given in Maitra and Sudderth (1998) has no algorithmic aspect compared with Maitra and Sudderth (1993a). As mentioned above the class of games with the limsup payoffs includes the games with the average payoffs defined as follows: Let X, A, and B be Borel spaces and let \(u:X\times A\times B\to \mathbb {R}\) be a bounded Borel measurable stage payoff function defined on the Borel set K. Assume that the players are allowed to use universally measurable strategies. For any initial state x = x1 and each strategy pair (π, γ), the expected limsup payoff is

$$\displaystyle \begin{aligned} R(x,\pi,\gamma):= E^{\pi\gamma}_x\left(\limsup_{n\to\infty} \frac{1}{n}\sum_{k=1}^n u(x_k,a_k,b_k) \right). \end{aligned} $$
(5.32)

By a minor modification of the proof of Theorem 1.1 in Maitra and Sudderth (1993a) together with the “partial history trick” mentioned above, one can conclude the following result:

Theorem 13

Assume that X, A, and B are Borel spaces, \(K_{A}\in \mathcal {B}(X\times A)\), \(K_{B}\in \mathcal {B}(X\times B)\), and the set B(x) is compact for each x  X. If \(u:K\to \mathbb {R}\) is bounded Borel measurable, u(x, a, ⋅) is lower semicontinuous and q(D|x, a, ⋅) is continuous on B(x) for all (x, a) ∈ KA and \(D\in \mathcal {B}(X)\), then the game with the expected limiting average payoff defined in (5.32) has a value and for any ε > 0 both players have ε-optimal universally measurable strategies.

The methods of gambling theory were also used to study “games of survival” of Milnor and Shapley (1957) (see Theorem 16.4 in Maitra and Sudderth 1996). As defined by Everett (1957) a recursive game is a stochastic game, where the payoff is zero in every state from which the game can move after some choice of actions to a different state. Secchi (1997, 1998) gave conditions for recursive games with countably many states and finite action sets under which the value exists and the players have stationary ε-optimal strategies. He used techniques from gambling theory.

The lower semicontinuous payoffs \(r:H_\infty \to \mathbb {R}\) used in Nowak (1986) are of the limsup type. However, Theorem 4.2 on the existence of value in a semicontinuous game established in Nowak (1986) is not a special case of the aforementioned works of Maitra and Sudderth. The reason is that the transition law in Nowak (1986) is weakly continuous. If r is bounded and continuous and the action correspondences are compact valued and continuous, then Theorem 4.2 in Nowak (1986) implies that both players have “persistently optimal strategies.” This notion comes from gambling theory (see Kertz and Nachman 1979). A pair of persistently optimal strategies forms a sub-game perfect equilibrium in the sense of Selten (1975).

We close this section with a famous example of Gillette (1957) called the Big Match.

Example 6

Let X = {0, 1, 2}, A(x) = A = {0, 1}, and B(x) = B = {0, 1}. The state x = 0 is absorbing with zero payoffs and x = 2 is absorbing with payoffs 1. The game starts in state x = 1. As long as player 1 picks 0, she gets one unit on each stage that player 2 picks 0 and gets nothing on stages when player 2 chooses 1. If player 1 plays 0 forever, then she gets

$$\displaystyle \begin{aligned} \limsup_{n\to\infty}\frac{r_1+\cdots+r_n}{n}, \end{aligned}$$

where rk is the number of units obtained on stage \(k\in \mathbb {N}\). However, if player 1 picks 1 on some stage (goes to “Big Match”) and the choice of player 2 is also 1, then the game moves to the absorbing state 2 and she will get 1 from this stage on. If player 1 picks 1 on some stage and the choice of player 2 is 0, then the game moves to the absorbing state 0 and all future payoffs will be zero. The definition of the transition probability is obvious. Blackwell and Ferguson (1968) proved the following: The Big Match has no value in the class of stationary strategies. However, if the players know the entire history at every stage of the game, then the game has a value in general classes of strategies. Player 2 has a stationary optimal strategy (toss a coin in state x = 1), and for any ε > 0 player 1 has an ε-optimal strategy. The value of the game in state 1 is 1∕2. An important feature of this example (that belongs to the class of games studied by Maitra and Sudderth 1992) is that player 1 must remember the entire history of the game at every moment of play. Blackwell and Ferguson (1968) gave two different constructions of an ε-optimal strategy for player 1. One of them relies on using a sequence of optimal stationary strategies in the discounted games with the discount factor tending to one. The idea was to switch from one discounted optimal strategy to another on the basis of some statistics defined on the past plays. This concept was used by Mertens and Neyman (1981) in their fundamental work on stochastic games with average payoffs. The Big Match was generalized by Kohlberg (1974), who considered finite state and finite action games in which all states but one are absorbing. Useful comments on the Big Match can be found in Mertens (2002) or Solan (2009).

7 Asymptotic Analysis and the Uniform Value

In this section, we briefly review some results found in the literature in terms of “normalized discounted payoffs.” Let x = x1 ∈ X, π ∈ Π, and γ ∈ Γ. The normalized discounted payoff is of the form

$$\displaystyle \begin{aligned} J_{\lambda}(x,\pi,\gamma):= E^{\pi\gamma}_x\left(\lambda\sum_{n=1}^\infty(1-\lambda)^{n-1}u(x_n,a_n,b_n) \right). \end{aligned}$$

The discount factor is β = 1 − λ where λ ∈ (0, 1). Clearly Jλ(x, π, γ) = (1 − β) Jβ(x, π, γ). If the value wλ(x) exists for the normalized game for an initial state x ∈ X, then wλ(x) = (1 − β)vβ(x). By vn(x) we denote the value function of the n-stage game with the payoff function:

$$\displaystyle \begin{aligned} \overline{J}_n(x,\pi,\gamma):= E^{\pi\gamma}_x\left(\frac{\sum_{k=1}^n u(x_k,a_k,b_k)}{n}\right). \end{aligned}$$

A function \(v_\infty :X\to \mathbb {R}\) is called a uniform value for the stochastic game if for any 𝜖 > 0, there exist a pair of strategies (π𝜖, γ𝜖) ∈ Π × Γ, some \(n_0\in \mathbb {N}\) and λ0 ∈ (0, 1) such that for all n ≥ n0 and x ∈ X,

$$\displaystyle \begin{aligned} \sup_{\pi\in\varPi} \overline{J}_n(x,\pi,\gamma^\epsilon)-\epsilon \le v_\infty(x)\le \inf_{\gamma\in\varGamma} \overline{J}_n(x,\pi^\epsilon,\gamma) +\epsilon \end{aligned} $$
(5.33)

and for all λ ∈ (0, λ0) and x ∈ X,

$$\displaystyle \begin{aligned} \sup_{\pi\in\varPi}J_{\lambda}(x,\pi,\gamma^\epsilon)-\epsilon\le v_\infty(x)\le \inf_{\gamma\in\varGamma} J_{\lambda}(x,\pi^\epsilon,\gamma) +\epsilon. \end{aligned} $$
(5.34)

If v exists, then from (5.33) and (5.34), it follows that \(v_\infty (x)=\lim _{n\to \infty }v_n(x)=\lim _{\lambda \to 0^+} w_\lambda (x)\). Moreover, (π𝜖, γ𝜖) is a pair of nearly optimal strategies in all sufficiently long finite games as well as in all discounted games with the discount factor β (or λ) sufficiently close to one (zero).

Mertens and Neyman (1981) gave sufficient conditions for the existence of v for arbitrary state space games. For a proof of the following result, see Mertens and Neyman (1981) or Chap. VII in Mertens et al. (2015).

Theorem 14

Assume that

  • the payoff function u is bounded,

  • for any λ ∈ (0, 1), wλ exists, and both players have ε-optimal stationary strategies,

  • for any α < 1, there exists a sequence \((\lambda _i)_{i\in \mathbb {N}}\) such that 0 < λi < 1, λi+1 ≥ αλi for all \(i\in \mathbb {N}\), limiλi = 0 and

$$\displaystyle \begin{aligned} \sum_{i=1}^\infty \sup_{x\in X} |w_{\lambda_i}(x)-w_{\lambda_{i+1}}(x)|<\infty. \end{aligned}$$

Then, the uniform value v exists. Moreover, if x = x1 is an initial state and

$$\displaystyle \begin{aligned} \overline{U}_n(h_n,a_n,b_n)=\frac{u(x_1,a_1,b_1)+\cdots+ u(x_n,a_n,b_n)}{n}, \end{aligned}$$

then we have

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \sup_{\pi\in\varPi}E^{\pi\gamma^\epsilon}_x\left(\limsup_{n\to\infty}\overline{U}_n(h_n,a_n,b_n)\right)-\epsilon &\displaystyle \le &\displaystyle v_\infty(x)\\&\displaystyle \le&\displaystyle \inf_{\gamma\in\varGamma}E^{\pi^\epsilon\gamma}_x\left(\liminf_{n\to\infty}\overline{U}_n(h_n,a_n,b_n\right)+\epsilon. \end{array} \end{aligned} $$
(5.35)

Mertens and Neyman (1981) proved additionally that wλ and vn converge to v uniformly on X. It is worth emphasizing that their 𝜖 -optimal strategy has a simple intuition behind it. Namely, at every step, the strategy updates a fictitious discount factor and plays an optimal strategy for that fictitious parameter. This parameter summarizes past play and its updating is based on payoffs received in the previous steps. If payoffs received so far are high, the player places higher weight on the future and increases his patience by letting the fictitious discount factor get closer to one. If, on the other hand, payoffs received so far are low, he focuses more about short-term payoffs and therefore decreases this fictitious discount factor. The construction idea of such a strategy lies in the fine-tuning and hinges on algebraic properties of the value of the discounted game as a function of the discount factor (see Bewley and Kohlberg 1976a). For a detailed discussion of the assumptions made in Theorem 14, consult Mertens (2002) and Mertens et al. (2015). It should be noted that neither the existence of uniform value nor (5.35) follows from the general minmax theorems of Maitra and Sudderth (1992, 1993a).

Assume that X, A, and B are finite. Bewley and Kohlberg (1976a,b) proved that the limits \(\lim _{\lambda \to 0^+}w_\lambda (x)\) and limnvn(x) exist and have a common value v(x), called the asymptotic value. Using their results, Mertens and Neyman (1982) proved that v(x) is actually the uniform value v(x). Independent of this result, it is possible to show using Bewley and Kohlberg (1976a) that the assumptions of Theorem 14 hold for games with a finite state space and finite action sets (see Remark VII.3.2 in Mertens et al. 2015). Bewley and Kohlberg (1976a) actually proved more, i.e., wλ(x) has in the neighborhood of zero the Puiseux series expansion. More precisely, there exist λ∈ (0, 1), \(M\in \mathbb {N}\), and numbers ai(x) (i = 0, 1, …) (depending on x ∈ X) such that for all λ ∈ (0, λ), we have

$$\displaystyle \begin{aligned} w_\lambda(x) =\sum_{i=0}^\infty a_i(x) \lambda^{i/M}. \end{aligned} $$
(5.36)

Recently, Oliu-Barton (2014) gave a direct proof of the existence of \( \lim _{\lambda \to 0^+} w_\lambda \). His proof does not utilize the Tarski-Seidenberg elimination from real algebraic geometry as in Bewley and Kohlberg (1976a). (An excellent introduction to semi-algebraic functions and their usage in finite state and action stochastic games can be found in Neyman 2003a.) Moreover, based upon the explicit description of asymptotically optimal strategies, Oliu-Barton (2014) showed that his approach can also be used to obtain the uniform value as in Mertens and Neyman (1981). Further generalization of the abovementioned results to other stochastic games was provided by Ziliotto (2016).

A similar Puiseux expansion can be obtained for stationary optimal strategies in discounted games. Mertens (1982, 2002) showed how to get (5.36) for normalized discounted payoffs in finite nonzero-sum games. Different proofs of (5.36) are given in Milman (2002), Szczechla et al. (1997), and Neyman (2003a). It is also worth mentioning that the values vn of finite stage games can be approximated by also some series of expansions. Bewley and Kohlberg (1976b) proved that there exist \(M\in \mathbb {N}\) and real numbers bi(x) (i = 0, 1, 2…) such that for n sufficiently large we have

$$\displaystyle \begin{aligned} \left|v_n(x)- \sum_{i=0}^\infty b_i(x)n^{-i/M}\right|=O(\ln n/n) \end{aligned} $$
(5.37)

and the bound in (5.37) is tight. A result on a uniform polynomial convergence rate of the values vn to v is given in Milman (2002) . The results on the values wλ described above generalize the paper of Blackwell (1962) on dynamic programming (one-person games), where it was shown that the normalized value is a bounded and rational function of the discount factor.

The Puiseux series expansions can also be used to characterize average payoff games, in which the players have optimal stationary strategies (see Bewley and Kohlberg 1978, Chap. 8 in Vrieze 1987 or Filar and Vrieze 1997). For example, one can prove that the average payoff game has a constant value v0 and both players have optimal stationary strategies if and only if a0(x) = v0 and a1(x) = ⋯ = aM−1(x) = 0 in (5.36) for all x ∈ X (see, e.g.,Theorem 5.3.3 in Filar and Vrieze 1997).

We recall that a stochastic game is absorbing if all states but one are absorbing. A recursive or an absorbing game is called continuous if the action sets are compact metric, the state space is countable, and the payoffs and transition probabilities depend continuously on actions. Mertens and Neyman (1981) gave sufficient conditions for \(\lim _{\lambda \to 0^+}w_\lambda = \lim _{n\to \infty }v_n\) to hold that include the finite case as well as a more general situation, e.g., when the function λ → wλ is of bounded variation or satisfies some integrability condition (see also Remark 2 in Mertens 2002 and Laraki and Sorin 2015). However, their conditions are not known to hold in continuous absorbing or recursive games. Rosenberg and Sorin (2001) studied the asymptotic properties of wλ and vn using some non-expansive operators called Shapley operators, naturally connected with stochastic games (see also Kohlberg 1974; Neyman 2003b; Sorin 2004). They obtained results implying that equality \(\lim _{\lambda \to 0^+}w_\lambda = \lim _{n\to \infty }v_n\) holds for continuous absorbing games with finite state spaces. Their result was used by Mertens et al. (2009) to show that every game in this class has a uniform value (consult also Sect. 3 in Ziliotto 2016).

Recursive games were introduced by Everett (1957), who proved the existence of value and of stationary ε-optimal strategies, when the state space and action sets are finite. Recently, Li and Venel (2016) proved that recursive games on a countable state space with finite action spaces have the uniform value, if the family {vn} is totally bounded. Their proofs follow the same idea as in Solan and Vieille (2002). Moreover, the result in Li and Venel (2016) together with the ones in Rosenberg and Vieille (2000) provides the uniform Tauberian theorem for recursive games: (vn) converges uniformly if and only if (vλ) converges uniformly and both limits are the same. For finite state continuous recursive games, the existence of \( \lim _{\lambda \to 0^+}w_\lambda \) was recently proved by Sorin and Vigeral (2015a).

We also mention one more class of stochastic games, the so-called definable games, studied by Bolte et al. (2015). Such games involve a finite number of states, and it is additionally assumed that all their data (action sets, payoffs, and transition probabilities) are definable in an o-minimal structure. Bolte et al. (2015) proved that these games have the uniform value. The reason for that lies in the fact that definability allows to avoid highly oscillatory phenomena in various settings (partial differential equations, control theory, continuous optimization) (see Bolte et al. 2015 and the references cited therein).

Generally, the asymptotic value \(\lim _{\lambda \to 0^{\,+}}w_\lambda \) or limnvn may not exist for stochastic games with finitely many states. An example with four states (two of them being absorbing) and compact action sets was recently given by Vigeral (2013). Moreover, there are problems with asymptotic theory in stochastic games with finite state space and countable action sets (see Ziliotto 2016). In particular, the example given in Ziliotto (2016) contradicts the famous hypothesis formulated by Mertens (1987) on the existence of asymptotic value. A generalization of examples due to Vigeral (2013) and Ziliotto (2016) is presented in Sorin and Vigeral (2015b).

A new approach to the asymptotic value in games with finite state and action sets was recently given by Oliu-Barton (2014). His proof when compared to Bewley and Kohlberg (1976a) is direct, relatively short, and more elementary. It is based on the theory of finite-dimensional systems and the theory of finite Markov chains. The existence of uniform value is obtained without using algebraic tools. A simpler proof for the existence of the asymptotic value limλ→0wλ of finite λ-discounted absorbing games was provided by Laraki (2010), who obtained explicit formulas for this value. According to the author’s comments, certain extensions to absorbing games with finite state and compact action spaces are also possible, but under some continuity assumptions on the payoff function. The convergence of the values of n-stage games (as n →) and the existence of the uniform value in stochastic games with a general state space and finite action spaces were studied by Venel (2015) who assumed that the transition law is in certain sense commutative with respect to the actions played at two consecutive periods. Absorbing games can be reformulated as commutative stochastic games.

8 Algorithms for Zero-Sum Stochastic Games

Let P = [pij] be a payoff matrix in a zero-sum game where 1 ≤ i ≤ m1, 1 ≤ j ≤ m2. By valP we denote the value for this game in mixed strategies. We assume in this section that X, A, and B are finite sets. For any function \(\phi :X\to \mathbb {R}\), we can consider the zero-sum game Γϕ(x) where the payoff matrix is

$$\displaystyle \begin{aligned} P_\phi(x):= \left[ \lambda u(x,i,j)+(1-\lambda)\sum_{y\in X} \phi(y)q(y|x,i,j)\right],\quad x\in X.\end{aligned} $$

Recall that β = 1 − λ. Similar to (5.20) we define Tλϕ(x) as the value of the game Γϕ(x), i.e., Tλϕ(x) = valPϕ(x). If ϕ(x) = ϕ0(x) = 0 for all x ∈ X, then \(T_\lambda ^n\phi _0(x)\) is the value of the n-stage discounted stochastic game starting at the state x ∈ X. As we know from Shapley (1953), the value function wλ of the normalized discounted game is a unique solution to the equation wλ(x) = Tλwλ(x), x ∈ X. Moreover, wλ(x) =limnTλnϕ0(x). The procedure of computing \(T_\lambda ^n\phi _0(x)\) is known as the value iteration and can be used as an algorithm to approximate the value function wλ. However, this algorithm is rather slow. If f(x) (g(x)) is an optimal mixed strategy for player 1 (player 2) in game \(\varGamma _{w_\lambda }(x)\), then the functions f and g are stationary optimal strategies for the players in the infinite horizon discounted game.

Example 7

Let X = {1, 2}, A(x) = B(x) = {1, 2} for x ∈ X. Assume that state x = 2 is absorbing with zero payoffs. In state x = 1, we have u(1, 1, 1) = 2, u(1, 2, 2) = 6, and u(1, i, j) = 0 for i ≠ j. Further, we have q(1|1, 1, 1) = q(1|1, 2, 2) = 1 and q(2|1, i, j) = 1 for i ≠ j. If λ = 1∕2, then the Shapley equation is for x = 1 of the form

$$\displaystyle \begin{aligned} w_\lambda (1)= val \left[ \begin{array}{rr} 1+\frac{1}{2}w_\lambda(1) \ \ &0+\frac{1}{2}w_\lambda(2)\\ 0+\frac{1}{2}w_\lambda(2) \ \ &3+\frac{1}{2}w_\lambda(1) \end{array}\right].\end{aligned} $$

Clearly, wλ(2) = 0 and wλ(1) ≥ 0. Hence, the above matrix game has no pure saddle point and it is easy to calculate that \(w_\lambda (1)= (-4+2\sqrt {13})/3\). This example is taken from Parthasarathy and Raghavan (1981) and shows that in general there is no finite step algorithm for solving zero-sum discounted stochastic games.

The value iteration algorithm of Shapley does not utilize any information on optimal strategies in the n-stage games. Hoffman and Karp (1966) proposed a new algorithm involving both payoffs and strategies. Let g1(x) be an optimal strategy for player 2 in the matrix game \(P_{\phi _0}(x)\), x ∈ X. Define w1(x) =supπΠJλ(x, π, g1). Then, choose an optimal strategy g2(x) for player 2 in the matrix game \(P_{w^{1}}(x)\). Define w2(x) =supπΠJλ(x, π, g2) and continue the procedure. It is shown that limnwn(x) = wλ(x).

Let X = {1, …, k}. Any function \(w:X\to \mathbb {R}\) can be viewed as a vector \( \bar {w}=(w(1),\ldots ,w(k))\in \mathbb {R}^k\). The fact that wλ is a unique solution to the Shapley equation is equivalent to saying that the unconstrained optimization problem

$$\displaystyle \begin{aligned} \min_{\bar{w}\in \mathbb{R}^k}\sum_{x\in X}(T_\lambda w(x)-w(x))^2\end{aligned} $$

has a unique global minimum. Pollatschek and Avi-Itzhak (1969) proposed a successive iterations algorithm, which corresponds to the “policy iteration” in dynamic programming. The proposed algorithm is connected with a Newton-Raphson type procedure associated with the global minimum problem mentioned above. Van der Wal (1978) showed that their algorithm does not converge in general. Filar and Tolwinski (1991) presented an improved version of the Pollatschek and Avi-Itzhak algorithm for solving discounted zero-sum stochastic games based on a “modified Newton’s method.” They demonstrated that it always converges to the value of the stochastic game and solved the example of Van der Wal (1978). For further comments on the abovementioned iterative algorithms, the reader is referred to Vrieze (1987), Breton (1991), Raghavan and Filar (1991), Filar and Vrieze (1997), and Raghavan (2003).

Observe now that every f ∈ F (also g ∈ G) can be viewed as a vector in Euclidean space. If f ∈ F, then

Similarly u(x, a, g) and q(y|x, a, g) are defined for any g ∈ G.

In the remaining part of this section we assume that u ≥ 0. This condition is made only for simplicity of presentation. A zero-sum discounted stochastic game can also be solved by a constrained nonlinear programming technique studied by Filar et al. (1991) (see also Chap. 3 in Filar and Vrieze 1997). Consider the problem (NP1) defined as follows:

$$\displaystyle \begin{aligned} \min \ \sum_{x\in X}(w_1(x)+w_2(x)) \end{aligned}$$

subject to (f, g) ∈ F × G, w1 ≥ 0, w2 ≤ 0 and

$$\displaystyle \begin{aligned} \lambda u(x,a,g)+(1-\lambda)\sum_{y\in X}w_1(y)q(y|x,a,g)\le w_1(x), \ \mbox{for all} \ x\in X,\ a\in A(x), \end{aligned}$$
$$\displaystyle \begin{aligned} -\lambda u(x,f,b)+(1-\lambda)\sum_{y\in X}w_2(y)q(y|x,f,b)\le w_2(x), \ \mbox{for all} \ x\in X,\ b\in B(x). \end{aligned}$$

Note that the objective function is linear, but the constraint set is not convex. It is shown (see Chap. 3 in Filar and Vrieze 1997) that every local minimum of (NP1) is a global minimum. Hence, we have the following result.

Theorem 15

Let \((w_1^*,w_2^*,f^*,g^*)\) be a global minimum of (NP1). Then, \(\sum _{x\in X}(w_1^*(x)+w_2^*(x))=0\) and \(w_1^*(x)=w_\lambda (x)\) for all x  X. Moreover, (f, g) is a pair of stationary optimal strategies for the players in the discounted stochastic game.

In the case of single-controller stochastic game, in which q(y|x, a, b) is independent of a ∈ A(x) for each x ∈ X and denoted by q(y|x, b), the problem of finding optimal strategies for the players is much simpler. We now present a result of Parthasarathy and Raghavan (1981). Consider the following linear programming problem (LP1):

$$\displaystyle \begin{aligned} \max \ \sum_{x\in X}w(x) \end{aligned}$$

subject to f ∈ F, w ≥ 0 and

$$\displaystyle \begin{aligned} \lambda u(x,f,b)+(1-\lambda)\sum_{y\in X}w(y)q(y|x,b)\ge w(x),\ \mbox{for all}\ x\in X,\ b\in B(x). \end{aligned}$$

Note that the constraint set in (LP1) is convex.

Theorem 16

The problem (LP1) has an optimal solution (w, f). Moreover, w(x) = wλ(x) for all x  X, and f is an optimal stationary strategy for player 1 in the single-controller discounted stochastic game.

Remark 9

Knowing wλ one can find an optimal stationary strategy g for player 2 using the Shapley equation wλ = Tλwλ, i.e., g(x) can be any optimal strategy in the matrix game with the payoff function:

$$\displaystyle \begin{aligned} \lambda u(x,a,b)+(1-\lambda)\sum_{y\in X}w_\lambda(y)q(y|x,b), \quad a\in A(x),\ b\in B(x). \end{aligned}$$

Let X = X1 ∪ X2 and X1 ∩ X2 = ∅. Assume that q(y|x, a, b) = q1(y|x, a) for x ∈ X1 and q(y|x, a, b) = q2(y|x, b) for x ∈ X2, a ∈ A(x), b ∈ B(x), y ∈ X. Then the game is called a switching control stochastic game (SCSG for short). Filar (1981) studied this class of games with discounting and showed the order field property saying that a solution to the game can be found in the same algebraic field as the data of the game. Other classes of stochastic games having the order field property are described in Raghavan (2003). It is interesting that the value function wλ for the SCSG can be represented in a neighborhood of zero by the power series of λ (see Theorem 6.3.5 in Filar and Vrieze 1997) . It should be mentioned that every discounted SCSG can be solved by a finite sequence of linear programming problems (see Algorithm 3.2.1 in Filar and Vrieze 1997). This was first shown by Vrieze (1987).

We can now turn to the limiting average payoff stochastic games. We know from the Big Match example of Blackwell and Ferguson (1968) that ε-optimal stationary strategies may not exist. A characterization of limiting average payoff games, where the players have stationary optimal strategies, was given by Vrieze (1987) (see also Theorem 5.3.5 in Filar and Vrieze 1997). Below we state this result. For any function \( \phi :X\to \mathbb {R}\) we consider the zero-sum game \(\varGamma _\phi ^0(x)\) with the payoff matrix

$$\displaystyle \begin{aligned} P_\phi^0(x):= \left[ \sum_{y\in X} \phi(y)q(y|x,i,j)\right],\quad x\in X \end{aligned}$$

and the zero-sum game \(\varGamma _\phi ^1(x)\) with the payoff matrix

$$\displaystyle \begin{aligned} \tilde{P}_\phi(x):= \left[ u(x,i,j)+\sum_{y\in X} \phi(y)q(y|x,i,j)\right] ,\quad x\in X. \end{aligned}$$

Theorem 17

Consider a function \(v^*:X\to \mathbb {R}\) and f F, g G. Then, v is the value of the limiting average payoff stochastic game and f, g are stationary optimal strategies for players 1 and 2, respectively, if and only if for each x  X

$$\displaystyle \begin{aligned} v^*(x)= \mathrm{val} P_{v^*}^0(x), \end{aligned} $$
(5.38)

(f(x), g(x)) is a pair of optimal mixed strategies in the zero-sum game with the payoff matrix \(P_{v^*}^0(x)\), and there exist functions \(\phi _i:X\to \mathbb {R}\) (i = 1, 2) such that for every x  X, we have

$$\displaystyle \begin{aligned} v^*(x)+\phi_1(x)=\mathrm{val} \tilde{P}_{\phi_1}(x)= \min_{b\in B(x)}\left[u(x,f^*,b)+\sum_{y\in X}\phi_1(y)q(y|x,f^*,b)\right], \end{aligned} $$
(5.39)

and

$$\displaystyle \begin{aligned} v^*(x)+\phi_2(x)=val \tilde{P}_{\phi_2}(x)= \max_{a\in A(x)}\left[u(x,a,g^*)+\sum_{y\in X}\phi_2(y)q(y|x,a,g^*)\right]. \end{aligned} $$
(5.40)

Remark 10

If the Markov chain induced by any stationary strategy pair is irreducible, then v is a constant. Then, (5.38) holds trivially and ϕ1(x), ϕ2(x) satisfying (5.39) and (5.40) are such that ϕ1(x) − ϕ2(x) is independent of x ∈ X. In such a case we may take ϕ1 = ϕ2. However, in other cases (without irreducibility) ϕ1(x) − ϕ2(x) may depend on x ∈ X. For details the reader is referred to Chap. 8 in Vrieze (1987).

A counterpart to the optimization problem (NP1) with non-convex constraints can also be formulated for the limiting average payoff case. Consider the problem (NP2):

$$\displaystyle \begin{aligned} \min\ \sum_{x\in X}(v_1(x)+v_2(x)) \end{aligned}$$

subject to (f, g) ∈ F × G, v1 ≥ 0, v2 ≤ 0, ϕ1 ≥ 0, ϕ2 ≥ 0 and

for all x ∈ X, a ∈ A(x) and

for all x ∈ X, b ∈ B(x).

Theorem 18

If \((\phi _1^*,\phi _2^*,v_1^*,v_2^*,f^*,g^*)\) is a feasible solution of (NP2) with the property thatxX(v1(x) + v2(x)) = 0, then it is a global minimum and (f, g) is a pair of optimal stationary strategies. Moreover, \(v_1^*(x)=R(x,f^*,g^*)\) (see (5.32)) for all x  X.

For a proof consult Filar et al. (1991) or pages 127–129 in Filar and Vrieze (1997). Single-controller average payoff stochastic games can also be solved by linear programming. The formulation is more involved than in the discounted case and generalizes the approach known in the theory of Markov decision processes. Two independent studies on this topic are given in Hordijk and Kallenberg (1981) and Vrieze (1981). Similarly as in the discounted case, the SCSG with the average payoff criterion can be solved by a finite sequence of nested linear programs (see Vrieze et al. 1983).

If X = X1 ∪ X2, X1 ∩ X2 = ∅, and A(x) (B(x)) is a singleton for each x ∈ X1 (x ∈ X2), then the stochastic game is of perfect information. Raghavan and Syed (2003) gave a policy-improvement type algorithm to find optimal pure stationary strategies for the players in discounted stochastic games of perfect information. Avrachenkov et al. (2012) proposed two algorithms to find the uniformly optimal strategies in discounted games. Such strategies are also optimal in the limiting average payoff stochastic game. Fresh ideas for constructing optimal stationary strategies in zero-sum limiting average payoff games can be found in Boros et al. (2013). In particular, Boros et al. (2013) introduced a potential transformation of the original game to an equivalent canonical form and applied this method to games with additive transitions (AT games) as well as to stochastic games played on a directed graph. The existence of a canonical form was also provided for stochastic games with perfect information, switching control games, or ARAT (additive reward-additive transition) games. Such a potential transformation has an impact on solving some classes of games in sub-exponential time. Additional results can be found in Boros et al. (2016). It is worth to note that a finite step algorithm of Cottle-Dantzig’s type was recently applied for solving discounted zero-sum semi-Markov ARAT games by Mondal et al. (2016).

Computation of the uniform value is a difficult task. Chatterjee et al. (2008) provided a finite algorithm for finding the approximation of the uniform value. As mentioned in the previous section, Bewley and Kohlberg (1976a) showed that the function λ → wλ is semi-algebraic. It can be function of λ. It can be expressed as a Taylor series in fractional powers of λ (called Puiseux series) in the neighborhood of zero. By Mertens and Neyman (1981), the uniform value \(v(x)=\lim _{\lambda \to 0^{\,+}}w_\lambda (x)\). Chatterjee et al. (2008) noted that, for a given α > 0, determining whether v > α is equivalent to finding the truth value of a sentence in the theory of real-closed fields. A generalization of the quantifier elimination algorithm of Tarski (1951) due to Basu (1999) (see also Basu et al. 2003) can be used to compute this truth value. The uniform value v is bounded by the maximum payoffs of the game; it is therefore sufficient to repeat this algorithm for finitely many different values of α to get a good approximation of v. An ε-approximation of v(x) at a given state x can be computed in time bounded by an exponential in a polynomial of the size of the game times a polynomial function of \(\log (1/\varepsilon ). \) This means that the approximating uniform value v(x) lies in the computational complexity class EXPTIME (see Papadimitriou 1994). Solan and Vieille (2010) applied the methods of Chatterjee et al. (2008) to calculate the uniform ε-optimal strategies described by Mertens and Neyman (1981). These strategies are good for all sufficiently long finite horizon games as well as for all (normalized) discounted games with λ sufficiently small. Moreover, they use unbounded memory. As shown by Bewley and Kohlberg (1976a), any pair of stationary optimal strategies in discounted games (which are obviously functions of λ) can also be represented by a Taylor series of fractional powers of λ for λ ∈ (0, λ0) with λ0 sufficiently small. This result, the theory of real-closed fields, and the methods of formal logic developed in Basu (1999) are basic ideas for Solan and Vieille (2010). A complexity bound on the algorithm of Solan and Vieille (2010) is not determined yet.

9 Zero-Sum Stochastic Games with Incomplete Information or Imperfect Monitoring

The following model of a general two-player zero-sum stochastic game, say \( \mathcal {G}\), is described in Sorin (2003a).

  • X is a finite state space.

  • A and B are finite admissible action sets for players 1 and 2, respectively.

  • Ω is a finite state of signals.

  • r : X × A × B → [0, 1] is a payoff function to player 1.

  • q is a transition probability mapping from X × A × B to \(\Pr (X\times \varOmega )\).

Let p be an initial probability distribution on X × Ω. The game evolves as follows. At stage one nature chooses (x1, ω1) according to p and the players learn ω1. Then, simultaneously player 1 selects a1 ∈ A and player 2 selects b1 ∈ B. The stage payoff r(x1, a1, b1) is paid by player 2 to player 1 and a pair (x2, ω2) is drawn according to q(⋅|x1, a1, b1). The game proceeds to stage two and the situation is repeated. The standard stochastic game with incomplete information is obtained, when Ω = A × B. Such a game with finite horizon of play was studied by Krausz and Rieder (1997), who showed the existence of the game value and presented an algorithm to compute optimal strategies for the players via linear programming. Their model assumes incomplete information on one side, i.e., player 2 is never informed about the state of the underlying Markov chain in contrast to player 1. In addition, both players have perfect recall. Renault (2006) studied a similar model. Namely, he assumed that the sequence of states follows a Markov chain, i.e., q is independent of the actions of the players. At the beginning of each stage, only player 1 is informed of the current state, the actions are selected simultaneously, and they are observed by both players. The play proceeds to the next stage. Renault (2006) showed that such a game has a uniform value and the second player has an optimal strategy.

Clearly, if Ω is a singleton, the game is a standard stochastic game. For general stochastic games with incomplete information, little is known, but some classes were studied in the literature. For the Big Match game, Sorin (1984, 1985) and Sorin and Zamir (1991) proved the existence of the maxmin value and the minmax value. These values may be different. Moreover, they showed that the values of the n-stage games (λ-discounted games with normalized payoffs) converge as n → (as λ → 0+) to the maxmin value.

Another model was considered by Rosenberg et al. (2004). Namely, at the beginning of the game a signal ω is chosen according to \(p\in \Pr (\varOmega )\). Only player 1 is informed of ω. At stage \(n\in \mathbb {N}\) players simultaneously choose actions an ∈ A and bn ∈ B. The stage payoff rω(xn, an, bn) is incurred and the next state xn+1 is drawn according to q(⋅|xn, an, bn). Both players are informed of (an, bn, xn+1). Note that in this setting rω(xn, an, bn) is told to player 1, but not to player 2. Rosenberg et al. (2004) proved the following result

Theorem 19

If player 1 controls the transition probability, the game value exists. If player 2 controls the transition probability, both the minmax value and the maxmin value exist.

Recursive games with incomplete information on one side were studied by Rosenberg and Vieille (2000), who proved that the maxmin value exists and is equal to the limit of the values of n-stage games (λ-discounted games) as n → (as λ → 0+). Rosenberg (2000), on the other hand, considered absorbing games. She proved the existence of the limit of the values of finitely repeated absorbing games (discounted absorbing games) with incomplete information on one side as the number of repetitions goes to infinity (λ → 0+). Additional discussion on stochastic games with incomplete information on one side can be found in Sorin (2003b) and Laraki and Sorin (2015).

Coulomb (1992, 1999, 2001) was the first who studied stochastic games with imperfect monitoring. These games are played as follows. At every stage, the game is in one of finitely many states. Each player chooses an action, independently of his opponent. The current state, together with the pair of actions, determines a daily payoff, a probability distribution according to which a new state is chosen, and a probability distribution over pairs of signals, one for each player. Each player is then informed of his private signal and of the new state. However, no player is informed of his opponent’s signal and of the daily payoff (see also the detailed model in Coulomb 2003a). Coulomb (1992, 1999, 2001) studied the class of absorbing games and proved that the uniform maxmin and minmax values exist. In addition, he provided a formula for both values. One of his main findings is that the maxmin value does not depend on the signaling structure of player 2. Similarly, the minmax value does not depend on the signaling structure of player 1. In general, the maxmin and minmax values do not coincide, hence stochastic games with imperfect monitoring need not have a uniform value. Based on these ideas, Coulomb (2003c) and Rosenberg et al. (2003) independently proved that the uniform maxmin value always exists in a stochastic game, in which each player observes the state and his/her own action. Moreover, the uniform maxmin value is independent of the information structure of player 2. Symmetric results hold for the uniform minmax value.

We now consider the general model of zero-sum dynamic game presented in Mertens et al. (2015) and Coulomb (2003b). These games are known as games of incomplete information on both sides.

  • X, A, and B are as above.

  • S and T are finite signal spaces for players 1 and 2, respectively.

  • The payoff function is defined as above, and the transition probability function is \(q:X\times A\times B \to \Pr (X\times S\times T)\).

The evolution of the game is as follows. At stage one nature chooses (x1, s1, t1) according to a given distribution \(p\in \Pr (X\times S\times T)\). Player 1 learns s1 and player 2 is informed of t1. Then, simultaneously player 1 selects a1 ∈ A and player 2 selects b1 ∈ B. The stage payoff r(x1, a1, b1) is incurred and a new triple (x2, s2, t2) is drawn according to q(⋅|x1, a1, b1). The game proceeds to stage two and the process repeats. Let us denote this game by \(\mathcal {G}_0\). Renault (2012) proved that such a game has a value under an additional condition.

Theorem 20

Assume that player 1 can always deduce the state and player 2’s signal from his own signal. Then, the game \(\mathcal {G}_0\) has a uniform value.

Further examples of games for which Theorem 20 holds were recently provided by Gensbittel et al. (2014). In particular, they showed that if player 1 is more informed than player 2 and controls the evolution of information on the state, then the uniform value exists. This result, from one side, extends results on Markov decision processes with partial observation given by Rosenberg et al. (2002), and, on the other hand, it extends a result on repeated games with an informed controller studied by Renault (2012).

An extension of the repeated game in Renault (2006) to a game with incomplete information on both sides was examined by Gensbittel and Renault (2015). The model is described by two finite action sets A and B and two finite sets of states S and T. The payoff function is r : S × T × A × B → [−1, 1]. There are given two initial probabilities \(p_1\in \Pr (S)\) and \( p_2\in \Pr (T)\) and two transition probability functions \(q_1: S\to \Pr (S)\) and \(q_2: T\to \Pr (T)\). The Markov chains \((s_n)_{n\in \mathbb {N}}\), \((t_n)_{n\in \mathbb {N}}\) are independent. At the beginning of stage \(n\in \mathbb {N}\), player 1 observes sn and player 2 observes tn. Then, both players simultaneously select actions an ∈ A and bn ∈ B. Player 1’s payoff in stage n is r(sn, tn, an, bn). Then, (an, bn) is publicly announced and the play goes to stage n + 1. Notice that the payoff r(sn, tn, an, bn) is not directly known and cannot be deduced. The main theorem states that limnvn exists and is a unique continuous solution to the so-called Mertens-Zamir system of equations (see Mertens et al. 2015). Recently, Sorin and Vigeral (2015a) showed in a simpler model (repeated game model, where s1 and t1 are chosen once and they are kept throughout the play) that vλ converges uniformly as λ → 0.

In this section, we should also mention the Mertens conjecture (see Mertens 1987) and its solution. His hypothesis is twofold: the first statement says that in any general model of zero-sum repeated game, the asymptotic value exists, and the second one says that if player 1 is always more informed than player 2 (in the sense that player 2’s signal can be deduced from player 1’s private signal), then in the long run player 1 is able to guarantee the asymptotic value. Ziliotto (2016) showed that in general the Mertens hypothesis is false. Namely, he constructed an example of a seven-state symmetric information game, in which each player has two action sets. The set of signals is public. The game is played as the game \(\mathcal { G}\) described above. More details can be found in Solan and Ziliotto (2016) where related issues are also discussed.

Although the Mertens conjecture does not generally hold, there are some classes of games for which it is true. The interested reader is referred to Sorin (1984, 1985), Rosenberg et al. (2004), Renault (2012), Gensbittel et al. (2014), Rosenberg and Vieille (2000), and Li and Venel (2016) . For instance, Li and Venel (2016) dealt with a stochastic game \(\mathcal {G}_0\) with incomplete information on both sides and proved the following (see Theorem 5.8 in Li and Venel 2016).

Theorem 21

Let \(\mathcal {G}_0\) be a recursive game such that player 1 is more informed than player 2. Then, for every initial distribution \(p\in \Pr (X\times S\times T)\) , both the asymptotic value and the uniform maxmin exist and are equal, i.e.,

$$\displaystyle \begin{aligned} \underline{v}_\infty=\lim_{n\to\infty} v_n=\lim_{\lambda\to 0} v_\lambda. \end{aligned}$$

Different notions of value in two-person zero-sum repeated games were recently examined by Gimbert et al. (2016). Assuming that the state evolves and players receive signals, they showed that the uniform value (limsup value) may not exist. However, the value exists if the payoff function is Borel measurable and the players observe a public signal including the actions played. The existence of the uniform value was proved for recursive games with nonnegative payoffs without any special assumptions on signals.

Stochastic games with partial observations, in which one player observes the sequence of states, while the other player observes the sequence of state-dependent signals, are examined in Basu and Stettner (2015) and its references. A class of dynamic games in which a player is informed of his opponent’s actions and states after some time delay were studied by Dubins (1957), Scarf and Shapley (1957), and Levy (2012). For obvious reasons, this survey does not cover all models and cases of games with incomplete information. Further references and applications can be found in Laraki and Sorin (2015), Neyman and Sorin (2003), or Solan and Ziliotto (2016).

10 Approachability in Stochastic Games with Vector Payoffs

In this section, we consider games with payoffs in Euclidean space \(\mathbb {R }^k\), where the inner product is denoted by 〈⋅, ⋅〉 and the norm of any \( \bar {c}\in \mathbb {R}^k\) is \(\|\bar {c}\|=\sqrt {\langle \bar {c},\bar {c}\rangle } \). Let A and B be finite sets of pure strategies for players 1 and 2, respectively. Let \(u^0:A\times B\to \mathbb {R}^k\) be a vector payoff function. For any mixed strategies \(s_1\in \Pr (A)\) and \(s_2\in \Pr (B)\), \( \bar {u}^0(s_1,s_2)\) stands for the expected vector payoff. Consider a two -person infinitely repeated game G defined as follows. At each stage \(t\in \mathbb {N}\), players 1 and 2 choose simultaneously at ∈ A and bt ∈ B. Behavioral strategies \(\hat \pi \) and \(\hat \gamma \) for the players are defined in the usual way. The corresponding vector outcome is \( g_t =u^0(a_t,b_t) \in \mathbb {R}^k\). The couple of actions (at, bt) is announced to both players. The average vector outcome up to stage n is \( \bar {g}_n=(g_1+\cdots +g_n)/n\). The aim of player 1 is to make \(\bar {g}_n\) approach a target set \(C\subset \mathbb {R}^k\). If k = 1, then we usually have in mind C = [v0, ) where v0 is the value of the game in mixed strategies. If \(C\subset \mathbb {R}^k\) and \(y\in \mathbb {R}^k\), then the distance from y to the set C is d(y, C) =infzCy − z∥.

A nonempty closed set \(C\subset \mathbb {R}^k\) is approachable by player 1 in G if for every 𝜖 > 0 there exists a strategy \(\hat \pi \) of player 1 and \(n_\epsilon \in \mathbb {N}\) such that for any strategy \( \hat \gamma \) of player 2 and any n ≥ n𝜖, we have

$$\displaystyle \begin{aligned} E^{ \hat{\pi}\hat{\gamma}}d(\bar{g}_n,C)\le\epsilon. \end{aligned}$$

The dual concept is excludability.

Let PC(y) denote the set of closest points to y in C. A closed set \( C\subset \mathbb {R}^k\) satisfies the Blackwell condition for player 1, if for any yC, there exist z ∈ PC(y) and a mixed action (depending on y) \(s_1=s_1(y) \in \Pr (A)\) such that the hyperplane through z orthogonal to the line segment [yz] separates y from the set \(\{ \bar {u}^0(s_1,s_2):s_2\in \Pr (B)\}\), i.e.,

$$\displaystyle \begin{aligned} \langle\bar{u}^0(s_1,s_2)-z,y-z\rangle \le 0\quad\mbox{for all}\quad s_2\in \Pr(B). \end{aligned}$$

The following two results are due to Blackwell (1956).

Theorem 22

If \(C\subset \mathbb {R}^k\) is a nonempty closed set satisfying the Blackwell condition, then C is approachable in game G . An approachability strategy is \(\hat {\pi }(h_n)=s_1(\bar {g}_n)\) , where h n is the history of a play at stage n.

Theorem 23

A closed and convex set \(C\subset \mathbb {R}^k\) is either approachable or excludable.

The next result was proved by Spinat (2002).

Theorem 24

A closed set \(C\subset \mathbb {R}^k\) is approachable if and only if C contains a subset having the Blackwell property.

Related results with applications to repeated games can be found in Sorin (2002) and Mertens et al. (2015). Applications to optimization models, learning, and games with partial monitoring can be found in Cesa-Bianchi and Lugosi (2006), Cesa-Bianchi et al. (2006), Perchet (2011a,b), and Lehrer and Solan (2016). A theorem on approachability for stochastic games with vector payoffs was proved by Shimkin and Shwartz (1993). They imposed certain ergodicity conditions on the transition probability and showed the applications of these results to queueing models. A more general theorem on approachability for vector payoff stochastic games was proved by Milman (2006). Below we briefly describe his result.

Consider a stochastic game with finite state space X and action spaces A(x) ⊂ A and B(x) ⊂ B, where A and B are finite sets. The stage payoff function is \(u:X\times A\times B\to \mathbb {R}^k\). For any strategies π ∈ Π and γ ∈ Γ and an initial state x = x1, there exists a unique probability measure \(P^{\pi \gamma }_x\) on the space of all plays (the Ionescu-Tulcea theorem) generated by these strategies and the transition probability q. By \(PD_x^{\pi \gamma }\) we denote the probability distribution on the stream of vector payoffs \(\overline {g}=(g_1,g_2,\ldots )\). Clearly, \(PD_x^{\pi \gamma }\) is uniquely induced by \(P^{\pi \gamma }_x\).

A closed set \(C\subset \mathbb {R}^k\) is approachable in probability from all initial states x ∈ X, if there exists a strategy π0 ∈ Π such that for any x ∈ X and 𝜖 > 0 we have

$$\displaystyle \begin{aligned} \lim_{n\to\infty}\sup_{\gamma\in\varGamma} PD^{\pi_0\gamma}_x(\{\overline{g}: d( \overline{g}_k,C)>\epsilon\})=0. \end{aligned}$$

Assume that yC and z ∈ PC(y). Let σ(z, y) := (z − y)∕∥z − y∥. Consider the stochastic game with scalarized payoffs uσ(x, a, b) := 〈u(x, a, b), σ(z, y)〉. By Mertens and Neyman (1981) this game has a uniform value, denoted here by vσ(x), x ∈ X. An analogue to the theorem of Blackwell (1956), due to Milman (2006), sounds as follows.

Theorem 25

A closed set \(C\subset \mathbb {R}^k\) is approachable in probability from all initial states x  X if, for each yC, there exists z  PC(y) such that vσ(x) ≥〈z, σ(z, y)〉 for all x  X.

We close this section by mentioning a recent paper by Kalathil et al. (2016) devoted to the approachability problem in Stackelberg stochastic games with vector costs. They constructed a simple and computationally tractable strategy for approachability for this class of games and gave a reinforcement learning algorithm for learning the approachable strategy when the transition kernel is unknown.

11 Stochastic Games with Short-Stage Duration and Related Models

Studying continuous-time Markov games entails some conceptual and mathematical difficulties. One of the main issues concerns randomization in continuous time. Zachrisson (1964) first considered zero-sum Markov games of a finite and commonly known duration. His method of evaluating the stream of payoffs in continuous time was simply to integrate over time. In his approach, the players use Markov strategies, i.e., they choose their actions as a function of time and the current state only. Stochastic games on Markov jump processes were studied by many authors (see, e.g., Guo and Hernández-Lerma 2003, 2005). The payoff functions and transition rates are time independent, and it is assumed that using randomized Markov strategies, the players determine an infinitesimal operator of the stochastic process, whose trajectories determine the stream of payoffs. The assumptions made on the primitives imply that the players have optimal stationary strategies in the zero-sum case (stationary equilibria in the nonzero-sum case), i.e., strategies that are independent of time, but depend on the state that changes at random time epochs. Altman and Gaitsgory (1995) studied zero-sum “hybrid games,” where the state evolves according to a linear continuous-time dynamics. The parameters of the state evolution equation may change at discrete times according to a countable state Markov chain that is directly controlled by both players. Each player has a finite action space. The authors proposed a procedure (similar in form to the well-known maximum principle) that determines a pair of stationary strategies for the players, which is asymptotically a saddle point, as the number of transitions during the finite time horizon grows to infinity. Levy (2013) studied some connections of continuous-time (finite state and action spaces) n-person Markov games with differential games and the theory of differential inclusions. He also gave some results on correlated equilibria with public randomization in an approximating game. He considered Markov strategies only. We mention his paper here because no section on continuous-time games is included in our chapter on nonzero-sum stochastic games. Cardaliaguet et al. (2012) considered the asymptotic value of two-person zero-sum repeated games with incomplete information games, splitting games, and absorbing games. They used a technique relying on embedding the discrete repeated game into a continuous-time game and using the viscosity solution methods. Other approaches to continuous-time Markov games including discretization of time are briefly described in Laraki and Sorin (2015). The class of games discussed here is important for many applications, e.g., in studying queueing models involving birth and death processes and more general ones (see Altman et al. 1997).

Recently, Neyman (2013) presented a framework for fairly general strategies using an asymptotic analysis of stochastic games with stage duration converging to zero. He established some new results, especially on the uniform value and approximate equilibria. There has been very little development in this direction. In order to describe briefly certain ideas from Neyman (2013), we must introduce some notation. We assume that the state space X and the action sets A and B are finite. Let δ > 0 and Γδ be a zero-sum stochastic game played in stages , \(t\in \mathbb {N}\). Strategies for the players are defined in the usual way, but we should note that the players act in time epochs δ, 2δ, and so on. Following Neyman (2013), we say that δ is the stage duration. The stage payoff function \(u_\delta :X\times A\times B\to \mathbb {R}\) is assumed to depend on δ. The evaluation of streams of payoffs in a multistage game is not specified at this moment. The transition probability qδ also depends on δ and is defined using so-called transition rate function \(q^0_\delta :X\times X\times A\times B\to \mathbb {R}\) satisfying standard assumptions

The transition probability is

$$\displaystyle \begin{aligned} q_\delta(y|x,a,b)= q^0_\delta(y,x,a,b)\mbox{ if } y\neq x\quad\mbox{and} \quad q_\delta(x|x,a,b)= q^0_\delta(x,x,a,b) +1 \end{aligned}$$

for all x ∈ X, a ∈ A and b ∈ B. The transition rate \( q^0_\delta (y,x,a,b)\) represents the difference between the probability that the next state will be y and the probability (0 or 1) that the current state is y when the current state is x and the players’ actions are a and b, respectively.

Following Neyman (2013), we say that the family of games (Γδ)δ>0 is converging if there exist functions \( \mu :X\times X\times A\times B\to \mathbb {R}\) and \(u:X\times A\times B\to \mathbb {R}\) such that for all x, y ∈ X, a ∈ A, and b ∈ B, we have

$$\displaystyle \begin{aligned} \lim_{\delta\to0^+} \frac{q^0_\delta(y,x,a,b)}{\delta} =\mu(y,x,a,b)\quad\mbox{and}\quad \lim_{\delta\to0^+}\frac{u_\delta(x,a,b)}{\delta} =u(x,a,b), \end{aligned}$$

and the family of games (Γδ)δ>0 is exact if there exist functions \(\mu :X\times X\times A\times B\to \mathbb {R}\) and \( u:X\times A\times B\to \mathbb {R}\) such that for all x, y ∈ X, a ∈ A, and b ∈ B, we have \(q^0_\delta (y,x,a,b)/\delta =\mu (y,x,a,b)\) and uδ(x, a, b)∕δ = u(x, a, b).

Assume that (x1, a1, b1, …) is a play in the game with stage duration δ. According to Neyman (2013), the unnormalized payoff in the ρ -discounted game, denoted by Γδ,ρ, is

$$\displaystyle \begin{aligned} \sum_{t=1}^\infty(1-\rho\delta)^{t-1}u_\delta(x_t,a_t,b_t).\end{aligned} $$

The discount factor β in the sense of Sect. 3 is 1 − δρ. It is called admissible, if \(\lim _{\delta \to 0^+} (1-\beta (\delta ))/\delta \) exists. This limit is known as an asymptotic discount rate. In the case of β(δ) = 1 − ρδ, ρ > 0 is the asymptotic discount rate. Other example of an admissible δ-dependent discount factor is eρδ. Assuming that the family of games (Γδ)δ>0 is converging, it is proved that the value of Γδ,ρ, denoted by vδ,ρ(x), converges to some vρ(x) (called the asymptotic ρ-discounted value) for any initial state x ∈ X as δ → 0+ and the players have stationary optimal strategies πρ and γρ that are independent of δ. Optimality of πρ means that πρ is 𝜖(δ)-optimal in the game Γδ,ρ, where 𝜖(δ) → 0 as δ → 0+. Similarly, we define the optimality for γρ. For details the reader is referred to Theorem 1 in Neyman (2013).

For any play (x1, a1, b1, …) and s > 0, define the average per unit time payoff gδ(s) as

$$\displaystyle \begin{aligned} g_\delta(s):=\frac{1}{s} \sum_{1\le t<s/\delta} u_\delta(x_t,a_t,b_t).\end{aligned} $$

A family (Γδ)δ>0 of two-person zero-sum stochastic games has an asymptotic uniform value v(x) (x ∈ X) if for every 𝜖 > 0 there are strategies πδ of player 1 and γδ of player 2, a duration δ0 > 0 and a time s0 > 0 such that for every δ ∈ (0, δ0) and s > s0, strategy π of player 1, and strategy γ of player 2, we have

$$\displaystyle \begin{aligned} \epsilon + E_x^{\pi_\delta \gamma}g_\delta(s)\ge v(x)\ge E_x^{\pi\gamma_\delta }g_\delta(s)-\epsilon.\end{aligned} $$

Theorem 6 in Neyman (2013) states that any exact family of zero-sum games (Γδ)δ>0 has an asymptotic uniform value.

The paper by Neyman (2013) contains also some results on the limit-average games and n-person games with short-stage duration. His asymptotic analysis is partly based on the theory of Bewley and Kohlberg (1976a) and Mertens and Neyman (1981). His work inspired other researchers. For instance, Cardaliaguet et al. (2016) studied the asymptotics of a class of two-person zero-sum stochastic game with incomplete information on one side. Furthermore, Gensbittel (2016) considered a zero-sum dynamic game with incomplete information, in which one player is more informed. He analyzed the limit value and gave its characterization through an auxiliary optimization problem and as the unique viscosity solution of a Hamilton-Jacobi equation. Sorin and Vigeral (2016), on the other hand, examined stochastic games with varying duration using iterations of non-expansive Shapley operators that were successfully used in the theory of discrete-time repeated and stochastic games.