1 Introduction

The paper deals with a class of discrete-time zero-sum discounted Markov games with nonconstant discount factors of the form

$$\begin{aligned} \tilde{\alpha }(x_{n},a_{n},b_{n},\xi _{n+1}), \end{aligned}$$
(1)

where \(x_{n}\) is the state of the game, \(a_{n}\) and \(b_{n}\) represent the actions of players 1 and 2, respectively, at time n,  and \(\left\{ \xi _{n}\right\} \) is a sequence of independent and identically distributed random variables with common distribution \(\theta \) representing a random disturbance at each time. The one-stage payoff (or utility) \( r(x_{n},a_{n},b_{n})\) accumulates in an infinite horizon by means of the functional

$$\begin{aligned} E\left[ \sum _{n=0}^{\infty }\prod \limits _{k=0}^{n-1}\tilde{\alpha } (x_{k},a_{k},b_{k},\xi _{k+1})r(x_{n},a_{n},b_{n})\right] , \end{aligned}$$
(2)

which defines the total expected discounted payoff with random discount factors depending on the state and the actions. Thus, in this scenario, our main objective is to prove the existence of a value of the game and a pair of optimal strategies.

Among the optimality criteria to study zero-sum and nonzero-sum Markov games, the discounted payoff with a constant discount factor is the best understood. It has been analyzed under several approaches, for instance, dynamic programming via the Shapley’s equation, linear programming, estimation, and control procedures (see, e.g., [9, 18,19,20, 24,25,26,27,28, 36, 37] ), and Nash equilibrium [5, 13, 31]. Moreover, its main applications are in economic and financial models where the discount factor is a function of the interest rate. Hence, considering a constant discount factor could be restrictive in problems where such an interest rate is random. It is in these situations where the need arises to consider a function as (1) representing the discount factor.

Even though the usual applications of the discounted criterion is in economic and financial models, there are other problems where a discount factor as (1) appears naturally. For instance, consider a game that is played as follows. At stage n, when the game is in state \( x_{n}\) and once the players choose the actions \((a_{n},b_{n}),\) player 2 pays \(r(x_{n},a_{n},b_{n})\) to player 1. Then, there is a positive probability the game stops which is influenced by \((x_{n},a_{n},b_{n}),\) otherwise the game moves to a new state \(x_{n+1}\) according to a transition law, and the process is repeated. Under these circumstances, the performance of the zero-sum game is measured by the total expected payoff criterion with a random horizon \(\tau \) of the form

$$\begin{aligned} E\left[ \sum _{n=0}^{\tau }r(x_{n},a_{n},b_{n})\right] . \end{aligned}$$
(3)

However, we prove that (3) can be written as (2) with

$$\begin{aligned} \tilde{\alpha }(x_{n},a_{n},b_{n},\xi _{n+1}):=1-\gamma (x_{n},a_{n},b_{n}), \end{aligned}$$

where \(\gamma (x_{n},a_{n},b_{n})\) is the probability the game stops at stage n.

Another example with random state-actions-dependent discount factors is the following zero-sum semi-Markov game. Let \(\left\{ \xi _{n}\right\} \) be a sequence of independent and identically distributed random variables with exponential distribution representing the sojourn (or holding) times. In addition, let \(\gamma (x_{n},a_{n},b_{n})\) be the discount factor imposed at stage n. Then, by defining \(\tilde{\alpha }(x_{n},a_{n},b_{n},\xi _{n+1}):= \exp (-\gamma (x_{n},a_{n},b_{n})\xi _{n+1})\), the total expected discounted payoff takes the form (2).

Typically, the existence of optimal strategies in zero-sum Markov games is studied via Shapley’s equation. Such an approach has the advantage that it allows to apply the nice contractive properties of the minimax (maximin) operator. In our case, since the discount factor is a nonconstant function \( \tilde{\alpha }\) which explicitly depends on the random disturbance process \( \left\{ \xi _{n}\right\} \), it is not possible to obtain, at least directly, a Shapley-like equation. To obtain the advantages that such an equation entails, we first need to establish a representation of the performance index related to (2) in terms of the common distribution \(\theta \) of the random variables \(\left\{ \xi _{n}\right\} \). However, due to measurability issues, such a representation is only possible when the players are restricted to use Markov strategies, not arbitrary strategies (see Proposition 2). Taking into account this fact, we then prove the sufficiency of Markov (stationary) strategies in the sense that if a pair of stationary strategies is optimal with respect to the Markov strategies, so is with respect to all strategies (see Proposition 1). It is worth remarking that this fact is a well-known result in zero-sum games under the standard discounted criterion (see, e.g., [21, 30]). In our game model, from the fact that the discount factor is a function of the game process \(\{(x_{n},a_{n},b_{n},\xi _{n+1})\},\) such a result is not a direct consequence from [21], and therefore, some modifications should be made. Hence, for completeness of our work, we have included its proof. Once ensured the sufficiency of Markov strategies, we establish, in Theorem 1, the existence of a value of the game and an optimal pair of stationary strategies.

Problems with nonconstant discount factors have been extensively studied for Markov decision processes from several point of views (see, e.g., [3, 8, 10,11,12, 23, 34, 38]). In particular, control processes with random state-action-dependent discount factors are analyzed in [23], which is close to our context. Nonetheless, in addition to the usual difficulties of dealing with stochastic games, proving Proposition 1 requires different arguments from those followed in the single-controller case.

The paper is organized as follows. In Sect. 2 we present the game model we deal with. Next, in Sect. 3 we introduce the optimality criterion and the main properties related with the sufficiency of Markov strategies. The existence of the value of the game and the pair of optimal strategies is established in Sect. 4, whereas the proofs are remitted to Sect. 6. In order to illustrate our results, in Sect. 5 we present some examples with nonconstant discount factors. The first one is a financial model where the discount factor is function of a random interest rate. Next we present an example of a game with random horizon and nondiscounted payoff criterion which is equivalent to a game where the discount factor is a state-actions-dependent function representing the probability of continuing the game. The third example is a semi-Markov game where the payoffs are exponentially discounted according to a random state-actions-dependent discount factor. We also present some insights into the fulfillment of our hypotheses.

Notation

As usual, \(\mathbb {N}\) (respectively \(\mathbb {N}_{0}\)) denotes the set of positive (resp. nonnegative) integers. On the other hand, given a Borel space X (that is, a Borel subset of a complete and separable metric space) its Borel sigma-algebra is denoted by \(\mathcal {B}(X)\), and “measurable”, for either sets or functions, means “Borel measurable”. Let X and Y be Borel spaces. Then a stochastic kernel \(\gamma (\mathrm{d}x\mid y)\) on X given Y is a function such that \(\gamma (\cdot \mid y)\) is a probability measure on X for each fixed \(y\in Y,\) and \(\gamma (B\mid \cdot ) \) is a measurable function on Y for each fixed \(B\in \mathcal {B}(X)\) . The space of probability measures on X is denoted by \(\mathbb {P}(X),\) which is endowed with the weak topology. In addition, we denote by \(\mathbb {P}(X\mid Y)\) the family of stochastic kernels on X given Y.

2 The Game Model

A zero-sum Markov game model with random state-actions-dependent discount factors is defined by the collection

$$\begin{aligned} \mathcal {GM}:=(\mathbf {X},\mathbf {A},\mathbf {B},{\mathbb {K}}_{\mathbf {A}},{ \mathbb {K}}_{\mathbf {B}},\mathbf {S},Q,\tilde{\alpha },r), \end{aligned}$$
(4)

satisfying the following conditions. The state space \(\mathbf {X}\) and the action sets \(\mathbf {A}\) and \(\mathbf {B}\) for players 1 and 2, respectively, as well as the discount factors disturbance space \(\mathbf {S}\), are assumed to be Borel spaces. The constraint sets \(\mathbb {K}_{\mathbf {A}}\) and \( \mathbb {K}_{\mathbf {B}}\) are Borel subsets of \(\mathbf {X}\times \mathbf {A}\) and \(\mathbf {X}\times \mathbf {B},\) respectively. For each \(x\in \mathbf {X},\) the x-sections

$$\begin{aligned} A(x):=\{a\in \mathbf {A}:(x,a)\in \mathbb {K}_{\mathbf {A}}\} \end{aligned}$$

and

$$\begin{aligned} B(x):=\{b\in \mathbf {B}:(x,a)\in \mathbb {K}_{\mathbf {B}}\} \end{aligned}$$

represent the admissible actions or controls sets for players 1 and 2, respectively, and the set

$$\begin{aligned} \mathbb {K}=\{(x,a,b):x\in \mathbf {X},\ a\in A(x),\ b\in B(x)\} \end{aligned}$$

of admissible state-actions triplets is a Borel subset of \(\mathbf {X}\times \mathbf {A}\times \mathbf {B}\). The transition law \(Q(\cdot |x,a,b)\) is a stochastic kernel on \(\mathbf {X}\) given \(\mathbb {K},\) and \(\tilde{\alpha }: \mathbb {K}\times \mathbf {S}\rightarrow (0,1)\) is a measurable function which gives the discount factor \(\tilde{\alpha }(x_{n},a_{n},b_{n},\xi _{n+1})\) at stage \(n\in \mathbb {N}\), where \(\left\{ \xi _{n}\right\} \) is a sequence of independent and identically distributed (i.i.d.) random variables defined on the probability space \((\varOmega ,\mathcal {F},P)\) taking values in \(\mathbf {S}\) with common distribution \(\theta \in \mathbb {P}(S)\). That is

$$\begin{aligned} \theta (S)=P(\xi _{n}\in S),\ \ S\in \mathcal {B}(\mathbf {S}),n\in \mathbb {N}. \end{aligned}$$

Finally, \(r(\cdots )\) is a real-valued measurable function on \( \mathbb {K}\) that represents the one-stage payoff function.

The game is played as follows. At the initial state \(x_{0}\in \mathbf {X},\) the players independently choose actions \(a_{0}\in A(x_{0})\) and \(b_{0}\in B(x_{0}).\) Then player 1 receives a payoff \(r(x_{0},a_{0},b_{0})\) from player 2, and the game jumps to a new state \(x_{1}\) according to the transition law \(Q(\cdot |x_{0},a_{0},b_{0})\), and the random disturbance \( \xi _{1}\) comes in. Once the system is in state \(x_{1}\), the players select actions \(a_{1}\in A(x_{1})\) and \(b_{1}\in B(x_{1})\) and player 1 receives a discounted payoff \(\tilde{\alpha }(x_{0},a_{0},b_{0},\xi _{1})r(x_{1},a_{1},b_{1})\) from player 2. Next the system moves to a state \(x_{2}\), and the process is repeated over and over again. In general, at stage \(n\in \mathbb {N},\) player 1 receives from player 2 a discounted payoff of the form

$$\begin{aligned} \tilde{\varGamma }_{n}r(x_{n},a_{n},b_{n}) \end{aligned}$$
(5)

where

$$\begin{aligned} \tilde{\varGamma }_{n}:=\prod \limits _{k=0}^{n-1}\tilde{\alpha } (x_{k},a_{k},b_{k},\xi _{k+1})\text { if }n\in \mathbb {N}\text {, and }\tilde{ \varGamma }_{0}=1. \end{aligned}$$
(6)

Thus, the goal of player 1 (player 2, resp.) is to maximize (minimize, resp.) the total expected discounted payoff defined by the accumulation of the one-stage payoffs (5) over an infinite horizon.

The actions chosen by players at each stage are selected by rules known as strategies which are defined as follows.

Let \(\mathbb {H}_{0}:=\mathbf {X}\) and \(\mathbb {H}_{n}:=\mathbb {K\times } \mathbf {S}\times \mathbb {H}_{n-1}\) for \(n\in \mathbb {N}.\) For each \(n\in \mathbb {N}_{0},\) an element \(h_{n}\in \mathbb {H}_{n}\) takes the form

$$\begin{aligned} h_{n}:=(x_{0},a_{0},b_{0},s_{1},\ldots ,x_{n-1},a_{n-1},b_{n-1},s_{n},x_{n}), \end{aligned}$$

which represents the history of the game up to time n. A strategy for player 1 is a sequence \(\pi ^{1}=\{\pi _{n}^{1}\}\) of stochastic kernels \(\pi _{n}^{1}\in \mathbb {P}(\mathbf {A}|\mathbb {H}_{t})\) such that \(\pi _{n}^{1}(A(x_{n})|h_{n})=1\) for every \(h_{n}\in \mathbb {H}_{n},n\in \mathbb {N }_{0}.\) We denote by \(\varPi ^{1}\) the family of all strategies for player 1.

For each \(x\in \mathbf {X},\) let \(\mathbb {A}(x):=\mathbb {P}(A(x))\) and \( \mathbb {B}(x):=\mathbb {P}(B(x)).\) We denote by \(\varPhi ^{1}\) the class of all stochastic kernels \(\varphi ^{1}\in \mathbb {P}(\mathbf {A}|\mathbf {X})\) such that \(\varphi ^{1}(\cdot |x)\in \mathbb {A}(x)\), \(x\in \mathbf {X},\) and by \( \varPhi ^{2}\) the class of all stochastic kernels \(\varphi ^{2}\in \mathbb {P}( \mathbf {B}|\mathbf {X})\) such that \(\varphi ^{2}(\cdot |x)\in \mathbb {B}(x)\), \(x\in \mathbf {X}.\) Hence, a strategy \(\pi ^{1}=\{\pi _{n}^{1}\}\in \varPi ^{1}\) is called a Markov strategy if there exists \(\varphi _{n}^{1}\in \varPhi ^{1}\ \)such that \(\pi _{n}^{1}(\cdot |h_{n})=\varphi _{n}^{1}(\cdot |x_{n})\) for every \(h_{n},n\in \mathbb {N}_{0}.\) The class of all Markov strategies for player 1 is denoted by \(\varPi _{M}^{1}\). Now, a Markov strategy is called stationary if \(\varphi _{n}^{1}=\varphi ^{1}\) for every \(n\in \mathbb {N}_{0}\) and some stochastic kernel \(\varphi ^{1} \) in \(\varPhi ^{1}\). The set of stationary strategies for player 1 is denoted by \(\varPi _{S}^{1}\). The sets \(\varPi ^{2},\)\(\varPi _{M}^{2},\) and \(\varPi _{S}^{2}\) corresponding to player 2 are defined similarly.

According to the previous definitions, and by using a standard convention, a Markov strategy \(\varphi ^{i}\in \varPi _{M}^{i}\) takes the form \( \varphi ^{i}=\left\{ \varphi _{0}^{i},\varphi _{1}^{i},\ldots \right\} =:\left\{ \varphi _{n}^{i}\right\} \), for \(i=1,2.\) In particular, for stationary strategies, we have \(\varphi ^{i}=\left\{ \varphi ^{i},\varphi ^{i},\ldots \right\} =\left\{ \varphi ^{i}\right\} \).

The game process. Let \((\varOmega ^{\prime },\mathcal {F} ^{\prime })\) be the measurable space consisting of the sample space \( \varOmega ^{\prime }=(\mathbb {K}\times \mathbf {S})^{\infty }\) and its product \(\sigma \)-algebra \(\mathcal {F}^{\prime }.\) Following standard arguments (see, e.g., [6]), we have that for each pair of strategies \((\pi ^{1},\pi ^{2})\in \varPi ^{1}\times \varPi ^{2}\) and initial state \(x_{0}=x\in \mathbf {X},\) there exists a unique probability measure \(P_{x}^{\pi ^{1},\pi ^{2}}\) and a stochastic process \(\left\{ (x_{n},a_{n},b_{n},\xi _{n+1})\right\} \), where \(x_{n},\)\(a_{n},\)\(b_{n},\) and \(\xi _{n+1}\) represent the state, the actions of players, and the random disturbance in the discount factor, respectively, at stage \(n\in \mathbb {N}_{0},\) satisfying

$$\begin{aligned} P_{x}^{\pi ^{1},\pi ^{2}}\left[ x_{0}\in X\right]= & {} \delta _{x}(X),\quad X\in \mathcal {B}(\mathbf {X}); \end{aligned}$$
(7)
$$\begin{aligned} P_{x}^{\pi ^{1},\pi ^{2}}\left[ a_{n}\in A,b_{n}\in B|h_{n}\right]= & {} \pi _{n}^{1}\left( A|h_{n}\right) \pi _{n}^{2}\left( B|h_{n}\right) , \ A\in \mathcal {B}(\mathbf {A}),B\in \mathcal {B}(\mathbf {B}); \end{aligned}$$
(8)
$$\begin{aligned} P_{x}^{\pi ^{1},\pi ^{2}}\left[ x_{n+1}\in X|h_{n},a_{n},b_{n},\xi _{n+1} \right]= & {} Q\left( X|x_{n},a_{n},b_{n}\right) ,\ \ X\in \mathcal {B}(\mathbf {X }); \end{aligned}$$
(9)
$$\begin{aligned} P_{x}^{\pi ^{1},\pi ^{2}}\left[ \xi _{n+1}\in S|h_{n},a_{n},b_{n}\right]= & {} \theta (S),\ \ S\in \mathcal {B}(\mathbf {S}), \end{aligned}$$
(10)

where \(\delta _{x}(\cdot )\) is the Dirac measure concentrated at x. We denote by \(E_{x}^{\pi ^{1},\pi ^{2}}\) the expectation operator with respect to \(P_{x}^{\pi ^{1},\pi ^{2}}.\) The stochastic process \(\left\{ x_{n}\right\} \) defined on \((\varOmega ^{\prime },\mathcal {F}^{\prime },P_{x}^{\pi ^{1},\pi ^{2}})\) is called game process.

3 The Optimality Criterion

According to (5) and (6), given the initial state \( x_{0}=x\in \mathbf {X}\) and a pair of strategies \((\pi ^{1},\pi ^{2})\in \varPi ^{1}\times \varPi ^{2}\), the total expected discounted payoff—with random state-actions-dependent discount factors—is defined as

$$\begin{aligned} \tilde{V}(x,\pi ^{1},\pi ^{2}):=E_{x}^{\pi ^{1},\pi ^{2}}\left[ \sum _{n=0}^{\infty }\tilde{\varGamma }_{n}r(x_{n},a_{n},b_{n})\right] . \end{aligned}$$
(11)

The lower and the upper value of the game are

$$\begin{aligned} L(x):=\sup _{\pi ^{1}\in \varPi ^{1}}\inf _{\pi ^{2}\in \varPi ^{2}}\tilde{V} (x,\pi ^{1},\pi ^{2})\, \, \text {and} \, \, U(x):=\inf _{\pi ^{2}\in \varPi ^{2}}\sup _{\pi ^{1}\in \varPi ^{1}}\tilde{V}(x,\pi ^{1},\pi ^{2}), \end{aligned}$$

respectively, for each initial state\(\ x\in \mathbf {X}\). Of course, \(U(\cdot )\ge L(\cdot )\); however, if \(U(\cdot )=L(\cdot )\) holds, then the common function is called the value of the game and is denoted by \(V^{*}(\cdot ).\)

Suppose the game has a value \(V^{*}\). A strategy \(\pi _{*}^{1}\in \varPi ^{1}\) is said to be optimal for player 1 if

$$\begin{aligned} V^{*}(x)=\inf _{\pi ^{2}\in \varPi ^{2}}\tilde{V}(x,\pi _{*}^{1},\pi ^{2}),\ \ x\in \mathbf {X}. \end{aligned}$$

Similarly, a strategy \(\pi _{*}^{2}\in \varPi ^{2}\) is said to be optimal for the player 2 if

$$\begin{aligned} V^{*}(x)=\sup _{\pi ^{1}\in \varPi ^{1}}\tilde{V}(x,\pi ^{1},\pi _{*}^{2}),\ \ x\in \mathbf {X.} \end{aligned}$$

Hence, the pair \((\pi _{*}^{1},\pi _{*}^{2})\) is called an optimal pair of strategies. Observe that \((\pi _{*}^{1},\pi _{*}^{2}) \in \varPi ^{1}\times \varPi ^{2}\) is an optimal pair if and only if

$$\begin{aligned} \tilde{V}(x,\pi ^{1},\pi _{*}^{2})\le \tilde{V}(x,\pi _{*}^{1},\pi _{*}^{2})\le \tilde{V}(x,\pi _{*}^{1},\pi ^{2}),\ \ \forall (\pi ^{1},\pi ^{2})\in \varPi ^{1}\times \varPi ^{2},\ x\in \mathbf {X}. \end{aligned}$$
(12)

An important fact in our analysis on the existence of a value of the game is the sufficiency of Markov strategies in the following sense.

Proposition 1

Let \((\varphi _{*}^{1},\varphi _{*}^{2})\in \varPi _{S}^{1}\times \varPi _{S}^{2}\) be an optimal pair with respect to the Markov strategies, i.e.,

$$\begin{aligned} \tilde{V}(x,\varphi ^{1},\varphi _{*}^{2})\le \tilde{V}(x,\varphi _{*}^{1},\varphi _{*}^{2})\le \tilde{V}(x,\varphi _{*}^{1},\varphi ^{2}),\ \ \forall (\varphi ^{1},\varphi ^{2})\in \varPi _{M}^{1}\times \varPi _{M}^{2},\ x\in \mathbf {X}. \end{aligned}$$
(13)

Then \((\varphi _{*}^{1},\varphi _{*}^{2})\) is an optimal pair with respect to all strategies, i.e., (12) holds.

By virtue of Proposition 1 we can restrict our study to the set of Markov strategies. Furthermore, over the Markov strategies, we can express the performance index (11) in terms of the distribution of the discount factor random disturbance \(\theta \). We proceed to establish this fact in a precise way.

We define the mean discount factor function \(\alpha _\theta :\mathbb {K} \rightarrow (0,1)\) as

$$\begin{aligned} \alpha _{\theta }(x,a,b):=\int _{S}\tilde{\alpha }(x,a,b,s)\theta (\mathrm{d}s),\ \ (x,a,b)\in \mathbb {K}, \end{aligned}$$
(14)

and denote

$$\begin{aligned} \varGamma _{n}=\prod \limits _{k=0}^{n-1}\alpha _{\theta }(x_{k},a_{k},b_{k}) \,\text { if } \, n\in \mathbb {N},\text { and }\varGamma _{0}=1. \end{aligned}$$
(15)

For each pair of strategies \((\pi ^{1},\pi ^{2})\in \varPi ^{1}\times \varPi ^{2}\) and initial state \(x\in \mathbf {X}\), we define

$$\begin{aligned} V(x,\pi ^{1},\pi ^{2}):=E_{x}^{\pi ^{1},\pi ^{2}}\left[ \sum _{n=0}^{\infty } \varGamma _{n}r(x_{n},a_{n},b_{n})\right] . \end{aligned}$$
(16)

Proposition 2

For each initial state \(x\in \mathbf {X}\) and pair of strategies \((\varphi ^{1},\varphi ^{2})\in \varPi _{M}^{1}\times \varPi _{M}^{2} \),

$$\begin{aligned} V(x,\varphi ^{1},\varphi ^{2})=\tilde{V}(x,\varphi ^{1},\varphi ^{2}). \end{aligned}$$
(17)

4 Existence of Optimal Strategies

To ease notation, the probability measures \(\varphi ^{1}(\cdot |x)\in \mathbb { A}(x)\) and \(\varphi ^{2}(\cdot |x)\in \mathbb {B}(x)\), \(x\in \mathbf {X},\) are written \(\varphi ^{i}(x)=\varphi ^{i}(\cdot |x),\)\(i=1,2.\) In addition, for a measurable function \(u:\mathbb {K}\rightarrow \mathbb {R} \),

$$\begin{aligned} u(x,\varphi ^{1},\varphi ^{2})=u(x,\varphi ^{1}(x),\varphi ^{2}(x)):=\int _{B(x)}\int _{A(x)}u(x,a,b)\varphi ^{1}(da|x)\varphi ^{2}(db|x). \end{aligned}$$
(18)

For instance, for \(x\in \mathbf {X}\), we have

$$\begin{aligned} r(x,\varphi ^{1},\varphi ^{2}):=\int _{B(x)}\int _{A(x)}r(x,a,b)\varphi ^{1}(da|x)\varphi ^{2}(db|x), \end{aligned}$$

and

$$\begin{aligned} Q(X|x,\varphi ^{1},\varphi ^{2}):=\int _{B(x)}\int _{A(x)}Q(X|x,a,b)\varphi ^{1}(da|x)\varphi ^{2}(db|x),\ \ X\in \mathcal {B}(\mathbf {X}). \end{aligned}$$

The existence of a value of the game as well as a pair of optimal strategies is analyzed under the following conditions.

Assumption 1

The game model (4) satisfies the following:

(a) For each \(x\in \mathbf {X}\), the sets A(x) and B(x) are compact.

(b) For each \((x,a,b)\in \mathbb {K},\)\(r(x,\cdot ,b)\) is upper semicontinuous (usc) on A(x),  and \(r(x,a,\cdot )\) is lower semicontinuous (lsc) on B(x). Moreover, there exists a constant \(r_{0}>0\) and a function \( W:\mathbf {X}\rightarrow [1,\infty )\) such that

$$\begin{aligned} |r(x,a,b)|\le r_{0}W(x), \end{aligned}$$
(19)

and the functions

$$\begin{aligned} \int _{\mathbf {X}}W(y)Q(\mathrm{d}y|x,\cdot ,b)\ \ \text { and }\ \ \int _{\mathbf {X} }W(y)Q(\mathrm{d}y|x,a,\cdot ) \end{aligned}$$
(20)

are continuous on A(x) and B(x),  respectively.

(c) For each \((x,a,b)\in \mathbb {K}\) and each bounded measurable function u on \(\mathbf {X},\) the functions

$$\begin{aligned} \int _{\mathbf {X}}u(y)Q(\mathrm{d}y|x,\cdot ,b) \ \ \text { and } \ \ \int _{\mathbf {X} }u(y)Q(\mathrm{d}y|x,a,\cdot ) \end{aligned}$$

are continuous on A(x) and B(x),  respectively.

(d) The function \(\tilde{\alpha }(x,a,b,s)\) is continuous on \( \mathbb {K}\times \mathbf {S},\) and

$$\begin{aligned} \alpha ^{*}:=\sup _{(x,a,b)\in \mathbb {K}}\alpha _{\theta }(x,a,b)<1. \end{aligned}$$
(21)

(e) There exists a positive constant \(\beta \) such that \( 1\le \beta <(\alpha ^{*})^{-1},\) and for every \((x,a,b)\in \mathbb {K}\)

$$\begin{aligned} \ \int \limits _{\mathbf {X}}W(y)Q(\mathrm{d}y\mid x,a,b)\le \beta W(x). \end{aligned}$$
(22)

For each measurable function \(u:\mathbf {X}\rightarrow {\mathbb {R}}\), we define the W-norm as

$$\begin{aligned} ||u||_{W}:=\sup _{x\in X}\frac{|u(x)|}{W(x)}, \end{aligned}$$

and let \({\mathbb {B}}_{W}\) be Banach space of all real-valued measurable functions defined on \(\mathbf {X}\) with finite W-norm. It is easy to prove that under Assumption 1, the Shapley operator

$$\begin{aligned} Tu(x):=\inf _{\varphi ^{2}\in \mathbb {B}(x)}\sup _{\varphi ^{1}\in \mathbb {A} (x)}\hat{T}(u,x,\varphi ^{1},\varphi ^{2}),\ \ x\in \mathbf {X}, \end{aligned}$$
(23)

maps \({\mathbb {B}}_{W}\) into itself, where

$$\begin{aligned} \hat{T}(u,x,a,b):=r(x,a,b)+\alpha _{\theta }(x,a,b)\int _{\mathbf {X} }u(y)Q(\mathrm{d}y|x,a,b),\ \ (x,a,b)\in \mathbb {K}. \end{aligned}$$
(24)

Moreover, as will be established later, the interchange of inf and sup in (23) holds.

We now state our main results as follows.

Theorem 1

Suppose that Assumption 1 holds. Then

  1. (a)

    the game \(\mathcal {GM}\) (4) has a value \( V^{*}\in \mathbb {B}_{W}\),

  2. (b)

    the value \(V^{*}\) is the unique function in \(\mathbb {B} _{W} \) such that \(TV^{*}=V^{*}\), and

  3. (c)

    there exist \(\varphi _{*}^{1}(x)\in \mathbb {A}(x)\) and \( \varphi _{*}^{2}\in \mathbb {B}(x)\) such that

    $$\begin{aligned} V^{*}(x)= & {} \hat{T}(V^{*},x,\varphi _{*}^{1},\varphi _{*}^{2}) \end{aligned}$$
    (25)
    $$\begin{aligned}= & {} \max _{\varphi ^{1}\in \mathbb {A}(x)}\hat{T}(V^{*},x,\varphi ^{1},\varphi _{*}^{2}) \end{aligned}$$
    (26)
    $$\begin{aligned}= & {} \min _{\varphi ^{2}\in \mathbb {B}(x)}\hat{T}(V^{*},x,\varphi _{*}^{1},\varphi ^{2}),\;\;\;\forall x\in \mathbf {X}. \end{aligned}$$
    (27)

In addition, the stationary strategies \(\varphi _{*}^{1}=\left\{ \varphi _{*}^{1}\right\} \in \varPi _{S}^{1}\) and \(\varphi _{*}^{2}=\left\{ \varphi _{*}^{2}\right\} \in \varPi _{S}^{2}\) form an optimal pair of strategies respect to the Markov strategies. Hence, from Proposition 1, \(\left( \varphi _{*}^{1},\varphi _{*}^{2}\right) \) is an optimal pair of strategies for the game \(\mathcal {GM}\).

5 Examples

In order to illustrate the theory developed above, we present two classes of examples. In the first one, Examples 13, we describe the potential applications of indices with nonconstant discount factors. Specifically, in Example 1 we present an application of this kind of optimality criteria in games involving monetary units in which the discount factor is a function of a random interest rate and/or inflation rate, and in Examples 2 and 3, the state-actions-dependent discount factors appear in a natural manner. Finally, Examples 4 and 5, which constitute the second class, are devoted to illustrate the assumptions imposed on the game model.

Example 1

(Monetary payoffs) Consider the game model (4). In general, r is a utility function which represents the preferences over the outcomes (xab) in \(\mathbb {K}\), and so money is not necessarily involved ([22, p. 9] or [32, p. 13]). In this example, we assume that r(xab) is indeed measured in monetary units. Let

$$\begin{aligned} \tilde{\alpha }(x,a,b,\xi )=\frac{1}{1+\rho -\xi }, \end{aligned}$$

where \(\rho >0\) is the (constant) nominal interest rate and \(\xi \) represents the inflation rate between two consecutive periods; thus \(\rho -\xi \) is the real interest rate which is random. Assume that \(\xi \) takes values in \(S=[\underline{s},\overline{s}]\), with \(0<\underline{s }<\overline{s}<\rho \). Hence, Assumption 1 (d) trivially follows since

$$\begin{aligned} \alpha ^{*}=\frac{1}{1+\rho -\overline{s}}<1. \end{aligned}$$

Example 2

(A nondiscounted payoff game with random horizon) In Sect. 2 we described how the discounted game with infinite horizon is played. Let us consider the game model

$$\begin{aligned} (\mathbf {X},\mathbf {A},\mathbf {B},{\mathbb {K}}_{\mathbf {A}},{\mathbb {K}}_{ \mathbf {B}},Q,\alpha ,r) \end{aligned}$$
(28)

with the following alternative playing where the horizon is random. For simplicity, we are not considering the disturbance space \(\mathbf {S}\). At state \(x_{n}\), players 1 and 2 choose actions \((a_{n},b_{n})\) and respectively receive \(r(x_{n},a_{n},b_{n})\) and \(-r(x_{n},a_{n},b_{n})\), then with probability \(1-\alpha (x_{n},a_{n},b_{n})\) the game stops; otherwise, the system moves to another state \(x_{n+1}\) according to \(Q(\cdot \mid x_{n},a_{n},b_{n})\), where the nondiscounted payoff \( r(x_{n+1},a_{n+1},b_{n+1})\) is determined by the actions \((a_{n+1},b_{n+1})\). We assume that there is \(\gamma \in (0,1)\) such that \(1-\alpha (x_{n},a_{n},b_{n})\ge \gamma \). Thus

$$\begin{aligned} \alpha ^{*}:=\sup _{(x,a,b)\in \mathbb {K}}\alpha (x,a,b)\le 1-\gamma <1. \end{aligned}$$

We will show that the total expected payoff in this game takes the form (16). For this purpose, let \(x^{*}\) and \((a^{*},b^{*})\) be artificial state and actions. We define the game model

$$\begin{aligned} \mathcal {GM}^{*}=(X^{*},A^{*},B^{*},\mathbb {K}_{A^{*}}, \mathbb {K}_{B^{*}},Q^{*},r^{*}) \end{aligned}$$

where \(X^{*}=X\cup \{x^{*}\},\)\(A^{*}=A\cup \{a^{*}\},\)\( B^{*}=B\cup \{b^{*}\},\) and the corresponding x-sections are the sets

$$\begin{aligned} A^{*}(x):= & {} \left\{ \begin{array}{ccc} \left\{ a^{*}\right\} &{} if &{} x=x^{*}, \\ A(x) &{} if &{} x\in X; \end{array} \right. \\ B^{*}(x):= & {} \left\{ \begin{array}{ccc} \left\{ b^{*}\right\} &{} if &{} x=x^{*}, \\ B(x) &{} if &{} x\in X; \end{array} \right. \end{aligned}$$

The transition law \(Q^{*}\) among the states in \(X^{*}\) is a stochastic kernel on \(X^{*}\) given the set

$$\begin{aligned} \mathbb {K}^{*}:=\left\{ (x,a,b):x\in X^{*},a\in A^{*}(x),b\in B^{*}(x)\right\} \end{aligned}$$

defined as follows: For \((x,a,b)\in \mathbb {K},\)

$$\begin{aligned} Q^{*}(D\mid x,a,b):= & {} \alpha (x,a,b)Q(D\mid x,a,b),\ \ D\in \mathcal {B} (X), \\ Q^{*}(\{x^{*}\}\mid x,a,b):= & {} 1-\alpha (x,a,b), \\ Q^{*}(\{x^{*}\}\mid x^{*},a^{*},b^{*}):= & {} 1. \end{aligned}$$

Finally, the payoff function \(r^{*}:\mathbb {K}^{*}\rightarrow \mathbb {R} \) is given by

$$\begin{aligned} r^{*}(x,a,b):=\left\{ \begin{array}{lll} r(x,a,b) &{} \text{ if } &{} (x,a,b)\in \mathbb {K}, \\ 0 &{} \text{ if } &{} (x,a,b)=(x^{*},a^{*},b^{*}). \end{array} \right. \end{aligned}$$

On the other hand, let \((\varOmega ^{\prime },\mathcal {F}^{\prime })\) be the measurable space associated with the game model \(\mathcal {GM}^{*}\) (see Sect. 2) and define the first passage time \(\mathrm {\tau }\)\(:\varOmega ^{\prime }\rightarrow \mathbb {N}_{0}\cup \{+\infty \}\) as

$$\begin{aligned} \mathrm {\tau }(x_{0},a_{0},b_{0},\ldots ):=\inf \{n\in \mathbb {N} _{0}:x_{n}=x^{*}\}, \end{aligned}$$

where, as usual, \(\inf \emptyset =+\infty \). For each pair of strategies \( (\varphi ^{1},\varphi ^{2})\in \varPi _{M}^{1}\times \varPi _{M}^{2}\) and initial state \(x\in X,\) the total expected payoff with random horizon \(\tau \) takes the form

$$\begin{aligned} V_{\mathrm {\tau }}(x,\varphi ^{1},\varphi ^{2}):=E_{x}^{\varphi ^{1},\varphi ^{2}}\sum _{n=0}^{\mathrm {\tau }}r(x_{n},a_{n},b_{n}). \end{aligned}$$
(29)

Then a straightforward calculation shows that the performance index (29) can be written as a performance index with state-actions-dependent discount factors. Specifically, by following similar arguments as the proof of Proposition 2, it is possible to prove the equality

$$\begin{aligned} V_{\mathrm {\tau }}(x,\varphi ^{1},\varphi ^{2})=V(x,\varphi ^{1},\varphi ^{2})=E_{x}^{\varphi ^{1},\varphi ^{2}}\sum _{n=0}^{\infty }\prod \limits _{k=0}^{n-1}\alpha (x_{k},a_{k},b_{k})r^{*}(x_{n},a_{n},b_{n}). \end{aligned}$$

Hence, provided that Assumption 1 holds, the game (28) with random horizon has a value and there exists a pair of optimal stationary strategies due to Theorem 1.

This game model with random horizon is in the spirit of Shapley’s [35] seminal paper where finite stochastic games were introduced. Similar games but considering continuous and bounded payoff functions in the performance index (16) and countable state space were studied by Rieder [33]. On the other hand, Markov decision models with random horizon and Borel spaces have also been studied under several settings (see, for instance, [2, 4]); however, such control processes are assumed to be stopped with constant probability. Therefore, our example generalizes many results in the existing literature.

Example 3

(A semi-Markov game) Consider a zero-sum semi-Markov game (see, e.g., [17, 20, 24]) where the sojourn (or holding) times \(\xi _{1},\xi _{2},\ldots \) are i.i.d. random variables with common exponential distribution with parameter \(\lambda >0.\) Suppose that the discount factor is a continuous function \(\gamma :\mathbb {K\rightarrow }(d,\infty )\) where \( d>0.\) Then the expected discounted payoff is

$$\begin{aligned} \bar{V}(x,\pi ^{1},\pi ^{2}):=E_{x}^{\pi ^{1},\pi ^{2}}\left[ r(x_{0},a_{0},b_{0})+\sum \limits _{n=1}^{\infty }\prod \limits _{k=0}^{n-1}\mathrm{e}^{-\gamma (x_{k},a_{k},b_{k})\xi _{k+1}}r(x_{n},a_{n},b_{n})\right] . \end{aligned}$$

If we define the function \(\tilde{\alpha }:\mathbb {K}\times \mathbf {S} \rightarrow (0,1)\) as

$$\begin{aligned} \tilde{\alpha }(x,a,b,\xi )=\mathrm{e}^{-\gamma (x,a,b)\xi }, \end{aligned}$$

where \(\mathbf {S}=(0,\infty ),\) then the performance index \(\bar{V}\) takes the form (16). In addition, observe that \(\tilde{\alpha }\) is continuous on \(\mathbb {K}\times \mathbf {S}\), and for all (xab), 

$$\begin{aligned} \alpha _{\theta }(x,a,b)=\lambda \int _{0}^{\infty }\mathrm{e}^{-\gamma (x,a,b)s}\mathrm{e}^{-\lambda s}\mathrm{d}s=\frac{\lambda }{\lambda +\gamma (x,a,b)}< \frac{ \lambda }{\lambda +d}. \end{aligned}$$

Thus

$$\begin{aligned} \alpha ^{*}< \frac{\lambda }{\lambda +d}<1. \end{aligned}$$
(30)

To the best of our knowledge, semi-Markov models with state-action-dependent discount factors have been considered only for decision processes in [15, 16].

We conclude by presenting some insights into the fulfillment of continuity and W-growth conditions imposed in Assumption 1. Such conditions are standard in the literature (see, e.g., [14, 18, 20, 24,25,26]) and satisfied by several zero-sum game models.

As stated in [14, Appendix C], Assumption 1(c) holds if the transition kernel Q on \(\mathbf {X}\) given \(\mathbb {K}\) has a continuous density q(y|xab) in \((x,a,b)\in \mathbb {K}\) with respect to a \(\sigma \)-finite measure m on X,  that is

$$\begin{aligned} Q\left( X|x,a,b\right) =\int _{X}q(y|x,a,b)m(\mathrm{d}y),\ \ X\in \mathcal {B}(\mathbf { X}),\ (x,a,b)\in \mathbb {K}. \end{aligned}$$

Furthermore, Assumption 1(c) also holds for games that evolve on \( \mathbf {X}=\mathbb {R}\) according to noise-additive difference equations of the form

$$\begin{aligned} x_{n+1}=G(x_{n},a_{n},b_{n})+w_{n},\ \ n\in \mathbb {N}_{0}, \end{aligned}$$

with \(\mathbf {A}=\mathbf {B}=\mathbb {R}\), where G is a continuous function and \(\left\{ w_{n}\right\} \) is a sequence of i.i.d. random variables with continuous density g on \(\mathbb {R}\). In this case the kernel Q takes the form

$$\begin{aligned} Q\left( X|x,a,b\right) =\int _\mathbb {R}I_{X}\left[ G(x,a,b)+w\right] g(w)dw, \end{aligned}$$

where \(I_{X}(\cdot )\) stands for the indicator function of the set \(X\in \mathcal {B}(\mathbf {X}).\)

In general, the conditions related to the weighted function W are easier to illustrate in difference-equation game models as we show in the following examples.

Example 4

(A linear quadratic game (see [7])) Consider a game whose dynamics is defined by the linear equation

$$\begin{aligned} x_{n+1}=x_{n}+a_{n}+b_{n}+w_{n},\ \ n\in \mathbb {N}_{0,} \end{aligned}$$

where \(\mathbf {X}=\mathbf {A}=\mathbf {B}=\mathbb {R}\) and \(\left\{ w_{n}\right\} \) is a sequence of i.i.d. random variables with standard normal distribution

$$\begin{aligned} g(w):=\frac{1}{\sqrt{2\pi }}\exp \left( -\frac{w^{2}}{2}\right) ,\ \ \ w\in \mathbb {R}. \end{aligned}$$

We assume that the admissible action sets are \(A(x)=B(x)=[ - \vert x \vert /2,\vert x \vert / 2].\) In addition, the one-stage payoff r is a quadratic function such that

$$\begin{aligned} \left| r(x,a,b)\right| \le r_{0}(x^{2}+1), \end{aligned}$$

for some positive constant \(r_{0}.\)

Since the dynamics is defined by a continuous noise-additive function and g is a continuous density, from [14, Appendix C], Assumption 1(c) holds. Moreover, if we take \(W(x):=x^{2}+1,\) by applying the same arguments, the continuity of the functions defined in (20) follows.

On the other hand, for all \((x,a,b)\in \mathbb {K},\)

$$\begin{aligned} \int _{\mathbb {R}}W(y)Q(\mathrm{d}y|x,a,b)= & {} \int _{\mathbb {R}}[(x+a+b+w)^{2}+1]\frac{1 }{\sqrt{2\pi }}\exp \left( -\frac{w^{2}}{2}\right) dw \\= & {} \int _{\mathbb {R}}(x+a+b+w)^{2}\frac{1}{\sqrt{2\pi }}\exp \left( -\frac{ w^{2}}{2}\right) dw+1 \\= & {} \int _{\mathbb {R}}y^{2}\frac{1}{\sqrt{2\pi }}\exp \left( -\frac{ (y-(x+a+b))^{2}}{2}\right) \mathrm{d}y+1 \\= & {} (x+a+b)^{2}+2\le 4x^{2}+2\le 4W(x). \end{aligned}$$

Hence, Assumption 1 (d) and (e) are satisfied with \(\beta =4\) and any continuous function such that \(\tilde{\alpha }(x,a,b,s)<\frac{1}{4}\).

Example 5

(A semi-Markov storage system) We consider a storage system with controlled input/output, whose evolution is as follows. At time \(T_{n}\) when an amount of certain product \(M>0\) accumulates for admission in the system, player 1 selects an action \(a\in [a_{*},1]=:\mathbf {A},\)\(a_{*}\in (0,1),\) representing the portion of M to be admitted. In addition, there is a continuous consumption of the admitted product which is controlled by the player 2 by selecting \(b\in [b_{*},b^{*}]=:\mathbf {B}\)\((0<b_{*}<b^{*})\) which represents the consumption rate per unit time. Thus, if \( x_{n}\in \mathbf {X}:=[0,\infty )\) is the stock level, \(a_{n}\) and \(b_{n}\) are the decisions of players 1 and 2, respectively, at the time of the nth decision epoch \(T_{n},\) the process \(\left\{ x_{n}\right\} \) can be modeled as a semi-Markov game evolving according to the equation

$$\begin{aligned} x_{n+1}=(x_{n}+a_{n}M-b_{n}\xi _{n+1})^{+} \end{aligned}$$

with holding times \(\xi _{n+1}:=T_{n+1}-T_{n}.\) In the context of Example 3, we suppose that \(\left\{ \xi _{n}\right\} \) is a sequence of i.i.d. random variables, exponentially distributed with parameter \(\lambda >0.\) Moreover, the discount factor is a continuous function \(\gamma :\mathbb {K}\rightarrow (d,\infty )\) where \(d>0\mathrm {.}\) It is reasonable to assume that

$$\begin{aligned} b_{*}E(\xi )<b^{*}E(\xi )=\frac{b^{*}}{\lambda }<M. \end{aligned}$$
(31)

Let \(\varPsi \) be the moment generating function of the random variable \( M-b_{*}\xi \), that is:

$$\begin{aligned} \varPsi (t)=E[\exp (t(M-b_{*}\xi ))]=\frac{\lambda \exp (Mt)}{b_{*}t+\lambda }. \end{aligned}$$

Then, computing the derivative \(\varPsi ^{\prime }\) and using the fact that \( b_{*}<M\lambda \) (see (31)), it is easy to prove that \(\varPsi ^{\prime }(t)>0, t>0\). Moreover, taking the constant \(d>\lambda \), from the continuity of \(\varPsi \) and because \(\varPsi (0)=1\), there exists \(\lambda ^{*}>0\) such that

$$\begin{aligned} \beta _{0}:=\varPsi (\lambda ^{*})=\frac{d}{\lambda }. \end{aligned}$$
(32)

Now we assume that the one-stage payoff r is an arbitrary function satisfying Assumption 1(b) such that

$$\begin{aligned} \left| r(x,a,b)\right| \le r_{0}\mathrm{e}^{\lambda ^{*}x}, \end{aligned}$$

for some constant \(r_{0}>0.\) Hence, defining \(W(x):=\mathrm{e}^{\lambda ^{*}x},\) relation (19) is satisfied. Moreover, for \((x,a,b)\in \mathbb {K},\)

$$\begin{aligned} \int _{\mathbf {X}}\mathrm{e}^{\lambda ^{*}y}Q(\mathrm{d}y|x,a,b)= & {} \int _{0}^{\infty }\mathrm{e}^{\lambda ^{*}(x+aM-bs)^{+}}\lambda \mathrm{e}^{-\lambda s}\mathrm{d}s \\= & {} P[x+aM-b\xi \le 0]+\mathrm{e}^{\lambda ^{*}x}\int _{0}^{\infty }\mathrm{e}^{\lambda ^{*}(M-bs)}\lambda \mathrm{e}^{-\lambda s}\mathrm{d}s \\\le & {} 1+W(x)E[\mathrm{e}^{\lambda ^{*}(M-b_{*}\xi )}] \\\le & {} (\beta _{0}+1)W(x). \end{aligned}$$

Hence, combining (30) and (32), we obtain

$$\begin{aligned} 1<\beta _{0}+1=\frac{d}{\lambda }+1=\frac{\lambda +d}{\lambda }<(\alpha ^{*})^{-1}, \end{aligned}$$

and defining \(\beta :=\beta _{0}+\bar{r},\) Assumptions 1(d), (e) are satisfied.

Finally, to verify Assumption 1(c), let u be a bounded measurable function on \(\mathbf {X}\) and \(\rho _{(a,b)}\) be the density of the random variable \(aM-b\delta \), for every fixed \(a\in \mathbf {A}\) and \( b\in \mathbf {B}\). Observe that

$$\begin{aligned} \rho _{(a,b)}(y)=\frac{1}{b}\lambda \mathrm{e}^{-\lambda \left( \frac{aM-y}{b}\right) },\;\;-\infty <y\le aM, \end{aligned}$$

and therefore, for each \(y\in \mathbf {R},\)\((a,b)\longmapsto \rho _{(a,b)}(y) \) is continuous function on \(\mathbf {A}\times \mathbf {B}\). Hence,

$$\begin{aligned} \int _{X}u(y)Q(\mathrm{d}y \mid x,a,b)&=\int _{0}^{\infty }u[(x+y)^{+}]\rho _{(a,b)}(y)\mathrm{d}y \\&=u(0)\int _{-\infty }^{-x}\rho _{(a,b)}(y)\mathrm{d}y+\int _{-x}^{\infty }u(x+y)\rho _{(a,b)}(y)\mathrm{d}y \\&=u(0)\int _{-\infty }^{-x}\rho _{(a,b)}(y)\mathrm{d}y+\int _{0}^{\infty }u(y)\rho _{(a,b)}(y-x)\mathrm{d}y. \end{aligned}$$

Thus by Scheffé’s Theorem,

$$\begin{aligned} (a,b)\longmapsto \int _{X}u(y)Q(\mathrm{d}y\mid x,a,b) \end{aligned}$$

defines a continuous function on \(\mathbf {A}\times \mathbf {B},\) which proves that Assumption 1(c) holds. Similarly is shown the continuity of the functions in (20).

6 Proofs

6.1 Proof of Proposition 1

The proof is a consequence of the following facts.

Let us fix \(\varphi ^{2}\in \varPi _{S}^{2}.\) Define the stochastic kernel \( Q_{\varphi ^{2}}\) on \(\mathbf {X}\) given \(\mathbb {K}_{\mathbf {A}}\) as

$$\begin{aligned} Q_{\varphi ^{2}}(X|x,a):=\int _{\mathbf {B}}Q(X|x,a,b)\varphi ^{2}(db|x),\ \ X\in \mathcal {B}(\mathbf {X}), \end{aligned}$$
(33)

\(r_{\varphi ^{2}}:\mathbb {K}_{\mathbf {A}}\longmapsto \mathbb {R}\) and \(\tilde{ \alpha }_{\varphi ^{2}}:\mathbb {K}_{\mathbf {A}}\times \mathbf {S}\rightarrow (0,1)\) are the measurable functions defined as

$$\begin{aligned} r_{\varphi ^{2}}(x,a):= & {} \int _{\mathbf {B}}r(x,a,b)\varphi ^{2}(db|x), \end{aligned}$$
(34)
$$\begin{aligned} \tilde{\alpha }_{\varphi ^{2}}(x,a,s):= & {} \int _{\mathbf {B}}\tilde{\alpha } (x,a,b,s)\varphi ^{2}(db|x). \end{aligned}$$
(35)

In addition, let \(\pi ^{1}\in \varPi ^{1}\) be an arbitrary strategy, and for \( x\in \mathbf {X},\) we define the performance index

$$\begin{aligned} \tilde{V}_{\varphi ^{2}}(x,\pi ^{1}):=E_{x}^{\pi ^{1}}\left[ \sum _{n=0}^{\infty }\tilde{\varGamma }_{n}^{\varphi ^{2}}r_{\varphi ^{2}}(x_{n},a_{n})\right] , \end{aligned}$$
(36)

where

$$\begin{aligned} \tilde{\varGamma }_{n}^{\varphi ^{2}}=\prod \limits _{k=0}^{n-1}\tilde{\alpha } _{\varphi ^{2}}(x_{k},a_{k},\xi _{k+1}),\ \ \tilde{\varGamma } _{0}^{\varphi ^{2}}=1, \end{aligned}$$

and \(E_{x}^{\pi ^{1}}\) is the expectation operator with respect to the probability measure \(P_{x}^{\pi ^{1}}\equiv P_{x}^{\pi ^{1},\varphi ^{2}}\) induced by \((\pi ^{1},\varphi ^{2})\in \varPi ^{1}\times \varPi _{S}^{2}\) and \( x_{0}=x.\) Then, from (7)–(10), \(P_{x}^{\pi ^{1}}\) satisfies the following properties:

$$\begin{aligned} P_{x}^{\pi ^{1}}\left[ x_{0}\in X\right]= & {} \delta _{x}(X),\ \ X\in \mathcal {B} (\mathbf {X}); \end{aligned}$$
(37)
$$\begin{aligned} P_{x}^{\pi ^{1}}\left[ a_{n}\in A|h_{n}\right]= & {} P_{x}^{\pi ^{1}}\left[ a_{n}\in A,b_{n}\in \mathbf {B}|h_{n}\right] \nonumber \\= & {} \pi _{n}^{1}\left( A|h_{n}\right) \varphi _{n}^{2}\left( \mathbf {B} |x_{n}\right) \nonumber \\= & {} \pi _{n}^{1}\left( A|h_{n}\right) ,\ \ A\in \mathcal {B}(\mathbf {A}); \end{aligned}$$
(38)
$$\begin{aligned} P_{x}^{\pi ^{1}}\left[ x_{n+1}\in X|h_{n},a_{n},b_{n},\xi _{n+1}\right]= & {} Q_{\varphi ^{2}}\left( X|x_{n},a_{n}\right) ,\ \ X\in \mathcal {B}(\mathbf {X} ); \end{aligned}$$
(39)
$$\begin{aligned} P_{x}^{\pi ^{1}}\left[ \xi _{n+1}\in S|h_{n},a_{n},b_{n}\right]= & {} \theta (S),\ \ S\in \mathcal {B}(\mathbf {S}). \end{aligned}$$
(40)

Similarly, for a fixed \(\varphi ^{1}\in \varPi _{S}^{1},\) define \( Q_{\varphi ^{1}},\)\(r_{\varphi ^{1}},\)\(\tilde{\alpha }_{\varphi ^{1}}\) and the performance index

$$\begin{aligned} \tilde{V}_{\varphi ^{1}}(x,\pi ^{2}):=E_{x}^{\pi ^{2}}\left[ \sum _{n=0}^{\infty }\tilde{\varGamma }_{n}^{\varphi ^{1}}r_{\varphi ^{1}}(x_{n},b_{n})\right] ,\ \ \pi ^{2}\in \varPi ^{2},\ x\in \mathbf {X}, \end{aligned}$$
(41)

where

$$\begin{aligned} \tilde{\varGamma }_{n}^{\varphi ^{1}}=\prod \limits _{k=0}^{n-1}\tilde{\alpha } _{\varphi ^{1}}(x_{k},b_{k},\xi _{k+1}),\ \ \tilde{\varGamma } _{0}^{\varphi ^{1}}=1. \end{aligned}$$

The next result is an adaptation of [23, Lemma 15] to our context. The proof follows by applying similar arguments and making the appropriate changes.

Lemma 1

For each \(x\in \mathbf {X}\), \(\varphi ^{2}\in \varPi _{S}^{2}\), and \(\pi ^{1}\in \varPi ^{1}\) there exists \(\varphi ^{1}\in \varPi _{M}^{1}\) such that

$$\begin{aligned} \tilde{V}_{\varphi ^{2}}(x,\pi ^{1})=\tilde{V}_{\varphi ^{2}}(x,\varphi ^{1}). \end{aligned}$$
(42)

Remark 1

Let us fix \(\varphi ^{1}\in \varPi _{S}^{1}.\) Then we can also prove that for each \(\pi ^{2}\in \varPi ^{2}\) there exists \(\varphi ^{2}\in \varPi _{M}^{2}\) such that

$$\begin{aligned} \tilde{V}_{\varphi ^{1}}(x,\pi ^{2})=\tilde{V}_{\varphi ^{1}}(x,\varphi ^{2}),\ \ x\in \mathbf {X}, \end{aligned}$$
(43)

where \(\tilde{V}_{\varphi ^{1}}\) is the performance index defined in (41).

Lemma 2

(a) For each \(\pi ^{1}\in \varPi ^{1}\) and \( \varphi ^{2}\in \varPi _{S}^{2}\), there exists \(\varphi ^{1}\in \varPi _{M}^{1}\) such that

$$\begin{aligned} \tilde{V}(x,\pi ^{1},\varphi ^{2})=\tilde{V}(x,\varphi ^{1},\varphi ^{2}),\ \ x\in \mathbf {X}. \end{aligned}$$
(44)

(b) For each \(\pi ^{2}\in \varPi ^{2}\) and \(\varphi ^{1}\in \varPi _{S}^{1}\), there exists \(\varphi ^{2}\in \varPi _{M}^{2}\) such that

$$\begin{aligned} \tilde{V}(x,\varphi ^{1},\pi ^{2})=\tilde{V}(x,\varphi ^{1},\varphi ^{2}),\ \ x\in \mathbf {X}. \end{aligned}$$
(45)

Proof

Let \(\pi ^{1}\in \varPi ^{1}\) and \(\varphi ^{2}\in \varPi _{S}^{2}\) be arbitrary strategies and consider the corresponding performance index \( \tilde{V}_{\varphi ^{2}}(x,\pi ^{1}),\)\(x\in \mathbf {X}\). From Lemma 1, there exists \(\varphi ^{1}\in \varPi _{M}^{1}\) such that \( \tilde{V}_{\varphi ^{2}}(x,\pi ^{1})=\tilde{V}_{\varphi ^{2}}(x,\varphi ^{1}), \)\(x\in \mathbf {X}.\) Hence, to obtain (44), it is enough to prove

$$\begin{aligned} \tilde{V}_{\varphi ^{2}}(x,\pi ^{1})=\tilde{V}(x,\pi ^{1},\varphi ^{2}),\ \ x\in \mathbf {X}, \end{aligned}$$
(46)

which is obtained by comparing the corresponding terms in the sums (36) and (11).

Indeed, for the first term, from (34)

$$\begin{aligned} E_{x}^{\pi ^{1}}r_{\varphi ^{2}}(x_{0},a_{0})= & {} \int _{\mathbf {A} }r_{\varphi ^{2}}(x,a_{0})\pi _{0}^{1}(da_{0}|x) \\= & {} \int _{\mathbf {A}}\int _{\mathbf {B}}r(x,a_{0},b_{0}) \varphi _{0}^{2}(db_{0}|x)\pi _{0}^{1}(da_{0}|x) \\= & {} E_{x}^{\pi ^{1},\varphi ^{2}}r(x_{0},a_{0},b_{0}). \end{aligned}$$

Furthermore, from (35)

$$\begin{aligned} E_{x}^{\pi ^{1}}\tilde{\varGamma }_{1}^{\varphi ^{2}}r_{\varphi ^{2}}(x_{1},a_{1})= & {} E_{x}^{\pi ^{1}}\tilde{\alpha } _{\varphi ^{2}}(x_{0},a_{0},\xi _{1})r_{\varphi ^{2}}(x_{1},a_{1}) \\= & {} \int \limits _{\mathbf {A}\times \mathbf {S}\times \mathbf {X}\times \mathbf {A} }\tilde{\alpha }_{\varphi ^{2}}(x,a_{0},\xi _{1})r_{\varphi ^{2}}(x_{1},a_{1})\pi _{1}^{1}(da_{1}|h_{1}) \\&\quad Q_{\varphi ^{2}}(\mathrm{d}x_{1}|x_{0},a_{0})\theta (\mathrm{d}\xi _{1})\pi _{0}^{1}(da_{0}|x) \\= & {} \int \limits _{\mathbf {A}\times \mathbf {B}\times \mathbf {S}\times \mathbf {X} \times \mathbf {A\times B}}\tilde{\alpha }(x,a_{0},b_{0}, \xi _{1})r(x_{1},a_{1},b_{1}) \\&\quad \varphi _{1}^{2}(db_{1}|x)\pi _{1}^{1}(da_{1}|h_{1})Q(\mathrm{d}x_{1}|x_{0},a_{0},b_{0})\theta (\mathrm{d}\xi _{1})\varphi _{0}^{2}(db_{0}|x)\pi _{0}^{1}(da_{0}|x) \\= & {} E_{x}^{\pi ^{1},\varphi ^{2}}\tilde{\alpha }(x_{0},a_{0},b_{0}, \xi _{1})r(x_{1},a_{1},b_{1}) \\= & {} E_{x}^{\pi ^{1},\varphi ^{2}}\tilde{\varGamma }_{1}r(x_{1},a_{1},b_{1}). \end{aligned}$$

An induction argument shows that

$$\begin{aligned} E_{x}^{\pi ^{1}}\tilde{\varGamma }_{n}^{\varphi ^{2}}r_{\varphi ^{2}}(x_{n},a_{n})=E_{x}^{\pi ^{1},\varphi ^{2}}\tilde{\varGamma } _{n}r(x_{n},a_{n},b_{n}),\ \ \ \forall n\in \mathbb {N}_{0}. \end{aligned}$$

Hence, from (36) and (11), we obtain (46).

Part (b) is proved similarly. \(\square \)

Proof of Proposition 1

Let \((\varphi _{*}^{1},\varphi _{*}^{2})\in \varPi _S^1\times \varPi _S^2\) be a pair that satisfies (13). From Lemma 2, we have, for each \(\varphi ^{2}\in \varPi _{S}^{2}, \)

$$\begin{aligned} \max _{\pi ^{1}\in \varPi ^{1}}\tilde{V}(x,\pi ^{1},\varphi ^{2})=\max _{\varphi ^{1}\in \varPi _{M}^{1}}\tilde{V}(x,\varphi ^{1},\varphi ^{2}),\ \ x\in \mathbf {X}, \end{aligned}$$
(47)

and for each \(\varphi ^{1}\in \varPi _{S}^{1}\)

$$\begin{aligned} \min _{\pi ^{2}\in \varPi ^{2}}\tilde{V}(x,\varphi ^{1},\pi ^{2})=\min _{\varphi ^{2}\in \varPi _{M}^{2}}\tilde{V}(x,\varphi ^{1},\varphi ^{2}),\ \ x\in \mathbf {X}. \end{aligned}$$
(48)

Now, from (13) and (47)

$$\begin{aligned} \tilde{V}(x,\varphi _{*}^{1},\varphi _{*}^{2})\ge & {} \max _{\varphi ^{1}\in \varPi _{M}^{1}}\tilde{V}(x,\varphi ^{1},\varphi _{*}^{2}) \nonumber \\= & {} \max _{\pi ^{1}\in \varPi ^{1}}\tilde{V}(x,\pi ^{1},\varphi _{*}^{2}) \nonumber \\\ge & {} \tilde{V}(x,\pi ^{1},\varphi _{*}^{2}),\ \ \ \forall \pi ^{1}\in \varPi ^{1},\ x\in \mathbf {X}. \end{aligned}$$
(49)

Similarly, from (13) and (48)

$$\begin{aligned} \tilde{V}(x,\varphi _{*}^{1},\varphi _{*}^{2})\le & {} \min _{\varphi ^{2}\in \varPi _{M}^{2}}\tilde{V}(x,\varphi _{*}^{1},\varphi ^{2}) \nonumber \\= & {} \min _{\pi ^{2}\in \varPi ^{2}}\tilde{V}(x,\varphi _{*}^{1},\pi ^{2}) \nonumber \\\le & {} \tilde{V}(x,\varphi _{*}^{1},\pi ^{2}),\ \ \ \forall \pi ^{2}\in \varPi ^{2},\ x\in \mathbf {X}. \end{aligned}$$
(50)

Therefore, (49) and (50) yield the desired inequality (12). This completes the proof of the proposition. \(\square \)

6.2 Proof of Proposition 2

Proof

The proof follows by applying similar arguments as those in the proof of Lemma 2. For instance, observe that for each \(x\in \mathbf {X}\) and \((\varphi ^{1},\varphi ^{2})\in \varPi _{M}^{1}\times \varPi _{M}^{2}\), from (14)

$$\begin{aligned} E_{x}^{\varphi ^{1},\varphi ^{2}}\tilde{\varGamma } _{1}r(x_{1},a_{1},b_{1})= & {} E_{x}^{\varphi ^{1},\varphi ^{2}}\tilde{\alpha } (x_{0},a_{0},b_{0},\xi _{1})r(x_{1},a_{1},b_{1})\\= & {} \int \limits _{\mathbf {A}\times \mathbf {B}\times \mathbf {S}\times \mathbf {X} \times \mathbf {A\times B}}\tilde{\alpha }(x_{0},a_{0},b_{0}, \xi _{1})r(x_{1},a_{1},b_{1}) \\&\left. \quad \varphi _{1}^{2}(db_{1}|x_{1}) \varphi _{1}^{1}(da_{1}|x_{1})Q(\mathrm{d}x_{1}|x_{0},a_{0},b_{0})\theta (\mathrm{d}\xi _{1})\varphi _{0}^{2}(db_{0}|x)\varphi _{0}^{1}(da_{0}|x)\right. \\= & {} \int \limits _{\mathbf {A}\times \mathbf {B}}\int \limits _{\mathbf {S}}\tilde{ \alpha }(x_{0},a_{0},b_{0},\xi _{1})\theta (\mathrm{d}\xi _{1})\int \limits _{\mathbf {X} }\int \limits _{\mathbf {A}\times \mathbf {B}}r(x_{1},a_{1},b_{1}) \\&\left. \quad \varphi _{1}^{2}(db_{1}|x_{1}) \varphi _{1}^{1}(da_{1}|x_{1})Q(\mathrm{d}x_{1}|x_{0},a_{0},b_{0})\varphi _{0}^{2}(db_{0}|x)\varphi _{0}^{1}(da_{0}|x)\right. \\= & {} E_{x}^{\varphi ^{1},\varphi ^{2}}\alpha _{\theta }(x_{0},a_{0},b_{0})r(x_{1},a_{1},b_{1}) \\= & {} E_{x}^{\varphi ^{1},\varphi ^{2}}\varGamma _{1}r(x_{1},a_{1},b_{1}). \end{aligned}$$

It is shown, by induction, that

$$\begin{aligned} E_{x}^{\varphi ^{1},\varphi ^{2}}\tilde{\varGamma } _{n}r(x_{n},a_{n},b_{n})=E_{x}^{\varphi ^{1},\varphi ^{2}}\varGamma _{n}r(x_{n},a_{n},b_{n}),\ \ \ \forall n\in \mathbb {N}_{0}. \end{aligned}$$

Therefore, from (11) and (16), we get (17). \(\square \)

6.3 Proof of Theorem 1

Before presenting the proof, we establish some important facts on minimax theorems and the W-norm, as well as the Shapley operator (23). All these facts are summarized in the following remark.

Remark 2

(a) Provided that Assumption 1 holds, for \(u\in {\mathbb {B}}_{W}\) and \((x,a,b)\in \mathbb {K},\)\(\hat{T}(u,x,\cdot ,b)\) is usc on A(x) and \(\hat{T}(u,x,a,\cdot )\) is lsc on B(x). Hence, by applying well-known properties of weak convergence of measures on the sets \(\mathbb {A}(x)\) and \(\mathbb {B}(x)\) (see, e.g., Theorem 2.8.1 in [1]), we can prove that the function \(\hat{T}(u,x,\cdot ,\varphi ^{2})\) is usc on \(\mathbb {A}(x)\) while \(\hat{T}(u,x,\varphi ^{1},\cdot )\) is lsc on \(\mathbb {B}(x)\). In addition, since \(\hat{T}(u,x,\varphi ^{1},\varphi ^{2})\) is concave in \(\varphi ^{1}\) and convex in \(\varphi ^{2},\) the well-known Fan’s Minimax Theorem implies that we can interchange \(\inf \) and \(\sup \) in (23), i.e.,

$$\begin{aligned} Tu(x)=\sup _{\varphi ^{1}\in \mathbb {A}(x)}\inf _{\varphi ^{2}\in \mathbb {B} (x)}\hat{T}(u,x,\varphi ^{1}(x),\varphi ^{2}(x)),\ \ x\in X. \end{aligned}$$
(51)

(b) Moreover, suitable measurable selection theorems yield the existence of \(\varphi _{*}^{1}\in \mathbb {A}(x)\) and \(\varphi _{*}^{2}\in \mathbb {B}(x)\) such that (see, e.g., Lemma 4.3 in [29])

$$\begin{aligned} Tu(x)= & {} \hat{T}(u,x,\varphi _{*}^{1}(x),\varphi _{*}^{2}(x)) \\= & {} \max _{\varphi ^{1}\in \mathbb {A}(x)}\hat{T}(u,x,\varphi ^{1},\varphi _{*}^{2})=\min _{\varphi ^{2}\in \mathbb {B}(x)}\hat{T} (u,x,\varphi _{*}^{1},\varphi ^{2}),\ \ x\in \mathbf {X}. \end{aligned}$$

(c) For \(u,v\in {\mathbb {B}}_{W}\), (21)–(23) and properties of the W-norm imply

$$\begin{aligned} \left| Tu(x)-Tv(x)\right|\le & {} \sup _{a\in A(x)}\sup _{b\in B(x)}\alpha _{\theta }(x,a,b)\int \limits _{\mathbf {X}}\left| u\left( y\right) -v\left( y\right) \right| Q(\mathrm{d}y\mid x,a,b) \\\le & {} \alpha ^{*}\left\| u-v\right\| _{W}\sup _{a\in A(x)}\sup _{b\in B(x)}\int \limits _{\mathbb {X}}W(y)Q(\mathrm{d}y\mid x,a,b) \\\le & {} \alpha ^{*}\beta \left\| u-v\right\| _{W}W(x), \end{aligned}$$

which in turn yields

$$\begin{aligned} \left\| Tu-Tv\right\| _{W}\le \alpha ^{*}\beta \left\| u-v\right\| _{W}. \end{aligned}$$

Hence, T is a contraction operator on \(\mathbb {B}_{W}\) with modulus \( \alpha ^{*}\beta <1.\) Similarly, the operator

$$\begin{aligned} T_{\varphi ^{1}\varphi ^{2}}u(x):=\hat{T}(u,x,\varphi ^{1}(x),\varphi ^{2}(x)),\ \ x\in \mathbf {X}, \end{aligned}$$
(52)

defined for a pair of stationary strategies \((\varphi ^{1},\varphi ^{2})\in \varPi _{S}^{1}\times \varPi _{S}^{2}\), is a contraction operator on \( \mathbb {B}_{W}\) with modulus \(\alpha ^{*}\beta \).

(d) Thus, there exist unique fixed points v and \(v_{\varphi ^{1}\varphi ^{2}}\) in \(\mathbb {B}_{W}\) of operators T and \(T_{\varphi ^{1}\varphi ^{2}} \), respectively, that is

$$\begin{aligned} Tv(x)=v(x)\ \ \text { and } \ \ T_{\varphi ^{1}\varphi ^{2}}v_{\varphi ^{1}\varphi ^{2}}(x)=v_{\varphi ^{1}\varphi ^{2}}(x),\ \ \ x\in \mathbf {X}. \end{aligned}$$
(53)

(e) Finally, we also apply the following properties of the weighted function W. From (22), for each \(\ x\in \mathbf {X}\),\(\ \left( \pi ^{1},\pi ^{2}\right) \in \varPi ^{1}\times \varPi ^{2},\) and \(n\in \mathbb {N}_{0}\),

$$\begin{aligned} E_{x}^{\pi ^{1},\pi ^{2}}\left[ W\left( x_{n+1}\right) \right] \le \beta E_{x}^{\pi ^{1},\pi ^{2}}\left[ W\left( x_{n}\right) \right] . \end{aligned}$$

Iteration of this inequality yields

$$\begin{aligned} E_{x}^{\pi ^{1},\pi ^{2}}\left[ W\left( x_{n+1}\right) \right] \le \beta ^{n+1}W\left( x\right) ,\ \ \ x\in \mathbf {X},\ n\in \mathbb {N}_{0}. \end{aligned}$$
(54)

Furthermore, from (54) and (15), for each \(u\in \mathbb {B} _{W},\)\(x\in \mathbf {X}\),\(\ \left( \pi ^{1},\pi ^{2}\right) \in \varPi ^{1}\times \varPi ^{2},\) and \(n\in \mathbb {N}_{0},\)

$$\begin{aligned} \left| E_{x}^{\pi ^{1},\pi ^{2}}\varGamma _{n}u(x_{n})\right|\le & {} \left( \alpha ^{*}\right) ^{n}\left\| u\right\| _{W}E_{x}^{\pi ^{1},\pi ^{2}}\left[ W\left( x_{n}\right) \right] \\\le & {} \left( \beta \alpha ^{*}\right) ^{n}\left\| u\right\| _{W}W\left( x\right) . \end{aligned}$$

Therefore,

$$\begin{aligned} \lim _{n\rightarrow \infty }E_{x}^{\pi ^{1},\pi ^{2}}\varGamma _{n}u(x_{n})=0,\ \ x\in \mathbf {X},\ \left( \pi ^{1},\pi ^{2}\right) \in \varPi ^{1}\times \varPi ^{2}. \end{aligned}$$
(55)

Proof of Theorem 1

From (23) and (51)

$$\begin{aligned} v(x)= & {} Tv(x)=\sup _{\varphi ^{1}\in \mathbb {A}(x)}\inf _{\varphi ^{2}\in \mathbb {B}(x)}\hat{T}(v,x,\varphi ^{1}(x),\varphi ^{2}(x)) \\= & {} \inf _{\varphi ^{2}\in \mathbb {B}(x)}\sup _{\varphi ^{1}\in \mathbb {A}(x)} \hat{T}(v,x,\varphi ^{1}(x),\varphi ^{2}(x)),\ \ x\in \mathbf {X}, \end{aligned}$$

where v is the fixed point of T (see Remark 2 (d)). In addition, from Remark 2 (b), there exists a pair of stationary strategies \((\varphi _{*}^{1},\varphi _{*}^{2})\in \varPi _{S}^{1}\times \varPi _{S}^{2}\) such that

$$\begin{aligned} v(x)= & {} \hat{T}(v,x,\varphi _{*}^{1}(x),\varphi _{*}^{2}(x))=T_{\varphi _{*}^{1}\varphi _{*}^{2}}v(x) \end{aligned}$$
(56)
$$\begin{aligned}= & {} \max _{\varphi ^{1}\in \mathbb {A}(x)}\hat{T}(v,x,\varphi ^{1}(x),\varphi _{*}^{2}(x)) \end{aligned}$$
(57)
$$\begin{aligned}= & {} \min _{\varphi ^{2}\in \mathbb {B}(x)}\hat{T}(u,x,\varphi _{*}^{1}(x),\varphi ^{2}(x)),\ \ x\in \mathbf {X}. \end{aligned}$$
(58)

On the other hand, \(V(\cdot ,\varphi _{*}^{1},\varphi _{*}^{2})\) is the unique fixed point of \(T_{\varphi _{*}^{1}\varphi _{*}^{2}}\) belonging to \(\mathbb {B}_{W},\) i.e.,

$$\begin{aligned} v_{\varphi _{*}^{1}\varphi _{*}^{2}}(\cdot )=V(\cdot ,\varphi _{*}^{1},\varphi _{*}^{2}). \end{aligned}$$
(59)

Indeed, from (53), (24), and (52)

$$\begin{aligned} v_{\varphi _{*}^{1}\varphi _{*}^{2}}(x)=\int _{B}\int _{A}\left[ r(x,a,b)+\alpha _{\theta }(x,a,b)\int _{\mathbf {X}}v_{\varphi _{*}^{1}\varphi _{*}^{2}}(y)Q(\mathrm{d}y|x,a,b)\right] \varphi _{*}^{1}(da | x)\varphi _*^{2}(db | x), \end{aligned}$$

for every x in \(\mathbf {X}\). Iterating this equation, we obtain

$$\begin{aligned} v_{\varphi _{*}^{1}\varphi _{*}^{2}}(x)=E_{x}^{\varphi _{*}^{1},\varphi _{*}^{2}}\sum \limits _{n=0}^{m-1}\varGamma _{n}r(x_{n},a_{n},b_{n})+E_{x}^{\varphi _{*}^{1},\varphi _{*}^{2}}\varGamma _{m}v_{\varphi _{*}^{1}\varphi _{*}^{2}}(x_{m}). \end{aligned}$$

Now, letting \(m\rightarrow \infty \), from (55) and (16 ), we obtain (59).

Since \(V(\cdot ,\varphi _*^{1},\varphi _{*}^{2})\) is the unique fixed point of \(T_{\varphi _*^1\varphi _*^2},\) (56) implies that \( v(x)=V(x,\varphi _*^1,\varphi _*^2)\), \(x\in \mathbf {X}.\) Therefore, considering (57) and (58), Theorem 1 will be proved if we show that

$$\begin{aligned} V(x,\varphi ^{1},\varphi _{*}^{2})\le V(x,\varphi _{*}^{1},\varphi _{*}^{2})\le V(x,\varphi _{*}^{1},\varphi ^{2}),\ \ \forall (\varphi ^{1},\varphi ^{2})\in \varPi _{M}^{1}\times \varPi _{M}^{2},\ x\in \mathbf {X}. \end{aligned}$$
(60)

To prove the first inequality in (60), let \(\varphi ^{1}\in \varPi _{M}^{1}\) be an arbitrary Markov strategy for player 1. Then, for all \( n\in \mathbb {N},\)

$$\begin{aligned}&E_{x}^{\varphi ^{1},\varphi _{*}^{2}}\left[ \varGamma _{n+1}V(x_{n+1}, \varphi _{*}^{1},\varphi _{*}^{2})|h_{n},a_{n},b_{n}\right] =\varGamma _{n+1}E_{x}^{\varphi ^{1},\varphi _{*}^{2}}\left[ V(x_{n+1},\varphi _{*}^{1},\varphi _{*}^{2})|h_{n},a_{n},b_{n}\right] \nonumber \\&\quad =\varGamma _{n+1}\int \limits _{\mathbf {X}}V(y,\varphi _{*}^{1},\varphi _{*}^{2})Q(\mathrm{d}y|x_{n},\varphi _{n}^{1},\varphi _{*}^{2})\nonumber \\&\quad =\varGamma _{n}\left\{ \alpha _{\theta }(x_{n},\varphi _{n}^{1},\varphi _{*}^{2})\int \limits _{\mathbf {X}}V(y,\varphi _{*}^{1},\varphi _{*}^{2})Q(\mathrm{d}y|x_{n},\varphi _{n}^{1},\varphi _{*}^{2})\right. \nonumber \\&\qquad \left. +\,r(x_{n},\varphi _{n}^{1},\varphi _{*}^{2})-r(x_{n},\varphi _{n}^{1},\varphi _{*}^{2})\phantom {\int \limits _{\mathbf {X}}V(y,\varphi _{*}^{1},\varphi _{*}^{2})}\right\} \nonumber \\&\quad \le \varGamma _{n}\left\{ \sup _{\varphi ^{1}\in \mathbb {B}(x)}\hat{T} (v,x_{n},\varphi ^{1}(x_{n}),\varphi _{*}^{2}(x_{n}))-r(x_{n},\varphi _{n}^{1},\varphi _{*}^{2})\right\} \nonumber \\&\quad =\varGamma _{n}\left\{ v(x_{n})-r(x_{n},\varphi _{n}^{1},\varphi _{*}^{2})\right\} \nonumber \\&\quad = \varGamma _{n}\left\{ V(x_{n},\varphi _{*}^{1},\varphi _{*}^{2})-r(x_{n},\varphi _{n}^{1},\varphi _{*}^{2})\right\} , \end{aligned}$$
(61)

where the last two equalities come from (56) and (57). Now, from (61), for all \(n\in \mathbb {N},\)

$$\begin{aligned} \varGamma _{n}V(x_{n},\varphi _{*}^{1},\varphi _{*}^{2})-E_{x}^{\varphi ^{1},\varphi _{*}^{2}}\left[ \varGamma _{n+1}V(x_{n+1},\varphi _{*}^{1},\varphi _{*}^{2})|h_{n},a_{n},b_{n} \right] \ge \varGamma _{n}r(x_{n},\varphi _{n}^{1},\varphi _{*}^{2}), \end{aligned}$$

which, by taking expectation \(E_{x}^{\varphi ^{1},\varphi _{*}^{2}}\) and adding over \(\ n=0,1,\ldots ,m-1,\)\(m>0,\) implies

$$\begin{aligned} V(x,\varphi _{*}^{1},\varphi _{*}^{2})-E_{x}^{\varphi ^{1},\varphi _{*}^{2}}\left[ \varGamma _{m+1}V(x_{m+1},\varphi _{*}^{1},\varphi _{*}^{2})\right] \ge E_{x}^{\varphi ^{1},\varphi _{*}^{2}}\sum \limits _{n=0}^{m-1}\varGamma _{n}r(x_{n},a_{n},b_{n}). \end{aligned}$$

Letting \(m\rightarrow \infty \), from (16) and (55), we get

$$\begin{aligned} V(x,\varphi _{*}^{1},\varphi _{*}^{2})\ge V(x,\varphi ^{1},\varphi _{*}^{2}),\ \ \ x\in \mathbf {X}, \end{aligned}$$

that is, the first inequality in (60) holds. The second inequality is proved similarly. Hence, the proof of Theorem 1 is completed. \(\square \)