Abstract
This work deals with a class of discrete-time zero-sum Markov games under a discounted optimality criterion with random state-actions-dependent discount factors of the form \(\tilde{\alpha }(x_{n},a_{n},b_{n},\xi _{n+1})\), where \(x_{n}, a_{n}, b_{n}\), and \(\xi _{n+1}\) are the state, the actions of players, and a random disturbance at time n, respectively, taking values in Borel spaces. Assuming possibly unbounded payoff, we prove the existence of a value of the game as well as a stationary pair of optimal strategies.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
The paper deals with a class of discrete-time zero-sum discounted Markov games with nonconstant discount factors of the form
where \(x_{n}\) is the state of the game, \(a_{n}\) and \(b_{n}\) represent the actions of players 1 and 2, respectively, at time n, and \(\left\{ \xi _{n}\right\} \) is a sequence of independent and identically distributed random variables with common distribution \(\theta \) representing a random disturbance at each time. The one-stage payoff (or utility) \( r(x_{n},a_{n},b_{n})\) accumulates in an infinite horizon by means of the functional
which defines the total expected discounted payoff with random discount factors depending on the state and the actions. Thus, in this scenario, our main objective is to prove the existence of a value of the game and a pair of optimal strategies.
Among the optimality criteria to study zero-sum and nonzero-sum Markov games, the discounted payoff with a constant discount factor is the best understood. It has been analyzed under several approaches, for instance, dynamic programming via the Shapley’s equation, linear programming, estimation, and control procedures (see, e.g., [9, 18,19,20, 24,25,26,27,28, 36, 37] ), and Nash equilibrium [5, 13, 31]. Moreover, its main applications are in economic and financial models where the discount factor is a function of the interest rate. Hence, considering a constant discount factor could be restrictive in problems where such an interest rate is random. It is in these situations where the need arises to consider a function as (1) representing the discount factor.
Even though the usual applications of the discounted criterion is in economic and financial models, there are other problems where a discount factor as (1) appears naturally. For instance, consider a game that is played as follows. At stage n, when the game is in state \( x_{n}\) and once the players choose the actions \((a_{n},b_{n}),\) player 2 pays \(r(x_{n},a_{n},b_{n})\) to player 1. Then, there is a positive probability the game stops which is influenced by \((x_{n},a_{n},b_{n}),\) otherwise the game moves to a new state \(x_{n+1}\) according to a transition law, and the process is repeated. Under these circumstances, the performance of the zero-sum game is measured by the total expected payoff criterion with a random horizon \(\tau \) of the form
However, we prove that (3) can be written as (2) with
where \(\gamma (x_{n},a_{n},b_{n})\) is the probability the game stops at stage n.
Another example with random state-actions-dependent discount factors is the following zero-sum semi-Markov game. Let \(\left\{ \xi _{n}\right\} \) be a sequence of independent and identically distributed random variables with exponential distribution representing the sojourn (or holding) times. In addition, let \(\gamma (x_{n},a_{n},b_{n})\) be the discount factor imposed at stage n. Then, by defining \(\tilde{\alpha }(x_{n},a_{n},b_{n},\xi _{n+1}):= \exp (-\gamma (x_{n},a_{n},b_{n})\xi _{n+1})\), the total expected discounted payoff takes the form (2).
Typically, the existence of optimal strategies in zero-sum Markov games is studied via Shapley’s equation. Such an approach has the advantage that it allows to apply the nice contractive properties of the minimax (maximin) operator. In our case, since the discount factor is a nonconstant function \( \tilde{\alpha }\) which explicitly depends on the random disturbance process \( \left\{ \xi _{n}\right\} \), it is not possible to obtain, at least directly, a Shapley-like equation. To obtain the advantages that such an equation entails, we first need to establish a representation of the performance index related to (2) in terms of the common distribution \(\theta \) of the random variables \(\left\{ \xi _{n}\right\} \). However, due to measurability issues, such a representation is only possible when the players are restricted to use Markov strategies, not arbitrary strategies (see Proposition 2). Taking into account this fact, we then prove the sufficiency of Markov (stationary) strategies in the sense that if a pair of stationary strategies is optimal with respect to the Markov strategies, so is with respect to all strategies (see Proposition 1). It is worth remarking that this fact is a well-known result in zero-sum games under the standard discounted criterion (see, e.g., [21, 30]). In our game model, from the fact that the discount factor is a function of the game process \(\{(x_{n},a_{n},b_{n},\xi _{n+1})\},\) such a result is not a direct consequence from [21], and therefore, some modifications should be made. Hence, for completeness of our work, we have included its proof. Once ensured the sufficiency of Markov strategies, we establish, in Theorem 1, the existence of a value of the game and an optimal pair of stationary strategies.
Problems with nonconstant discount factors have been extensively studied for Markov decision processes from several point of views (see, e.g., [3, 8, 10,11,12, 23, 34, 38]). In particular, control processes with random state-action-dependent discount factors are analyzed in [23], which is close to our context. Nonetheless, in addition to the usual difficulties of dealing with stochastic games, proving Proposition 1 requires different arguments from those followed in the single-controller case.
The paper is organized as follows. In Sect. 2 we present the game model we deal with. Next, in Sect. 3 we introduce the optimality criterion and the main properties related with the sufficiency of Markov strategies. The existence of the value of the game and the pair of optimal strategies is established in Sect. 4, whereas the proofs are remitted to Sect. 6. In order to illustrate our results, in Sect. 5 we present some examples with nonconstant discount factors. The first one is a financial model where the discount factor is function of a random interest rate. Next we present an example of a game with random horizon and nondiscounted payoff criterion which is equivalent to a game where the discount factor is a state-actions-dependent function representing the probability of continuing the game. The third example is a semi-Markov game where the payoffs are exponentially discounted according to a random state-actions-dependent discount factor. We also present some insights into the fulfillment of our hypotheses.
Notation
As usual, \(\mathbb {N}\) (respectively \(\mathbb {N}_{0}\)) denotes the set of positive (resp. nonnegative) integers. On the other hand, given a Borel space X (that is, a Borel subset of a complete and separable metric space) its Borel sigma-algebra is denoted by \(\mathcal {B}(X)\), and “measurable”, for either sets or functions, means “Borel measurable”. Let X and Y be Borel spaces. Then a stochastic kernel \(\gamma (\mathrm{d}x\mid y)\) on X given Y is a function such that \(\gamma (\cdot \mid y)\) is a probability measure on X for each fixed \(y\in Y,\) and \(\gamma (B\mid \cdot ) \) is a measurable function on Y for each fixed \(B\in \mathcal {B}(X)\) . The space of probability measures on X is denoted by \(\mathbb {P}(X),\) which is endowed with the weak topology. In addition, we denote by \(\mathbb {P}(X\mid Y)\) the family of stochastic kernels on X given Y.
2 The Game Model
A zero-sum Markov game model with random state-actions-dependent discount factors is defined by the collection
satisfying the following conditions. The state space \(\mathbf {X}\) and the action sets \(\mathbf {A}\) and \(\mathbf {B}\) for players 1 and 2, respectively, as well as the discount factors disturbance space \(\mathbf {S}\), are assumed to be Borel spaces. The constraint sets \(\mathbb {K}_{\mathbf {A}}\) and \( \mathbb {K}_{\mathbf {B}}\) are Borel subsets of \(\mathbf {X}\times \mathbf {A}\) and \(\mathbf {X}\times \mathbf {B},\) respectively. For each \(x\in \mathbf {X},\) the x-sections
and
represent the admissible actions or controls sets for players 1 and 2, respectively, and the set
of admissible state-actions triplets is a Borel subset of \(\mathbf {X}\times \mathbf {A}\times \mathbf {B}\). The transition law \(Q(\cdot |x,a,b)\) is a stochastic kernel on \(\mathbf {X}\) given \(\mathbb {K},\) and \(\tilde{\alpha }: \mathbb {K}\times \mathbf {S}\rightarrow (0,1)\) is a measurable function which gives the discount factor \(\tilde{\alpha }(x_{n},a_{n},b_{n},\xi _{n+1})\) at stage \(n\in \mathbb {N}\), where \(\left\{ \xi _{n}\right\} \) is a sequence of independent and identically distributed (i.i.d.) random variables defined on the probability space \((\varOmega ,\mathcal {F},P)\) taking values in \(\mathbf {S}\) with common distribution \(\theta \in \mathbb {P}(S)\). That is
Finally, \(r(\cdots )\) is a real-valued measurable function on \( \mathbb {K}\) that represents the one-stage payoff function.
The game is played as follows. At the initial state \(x_{0}\in \mathbf {X},\) the players independently choose actions \(a_{0}\in A(x_{0})\) and \(b_{0}\in B(x_{0}).\) Then player 1 receives a payoff \(r(x_{0},a_{0},b_{0})\) from player 2, and the game jumps to a new state \(x_{1}\) according to the transition law \(Q(\cdot |x_{0},a_{0},b_{0})\), and the random disturbance \( \xi _{1}\) comes in. Once the system is in state \(x_{1}\), the players select actions \(a_{1}\in A(x_{1})\) and \(b_{1}\in B(x_{1})\) and player 1 receives a discounted payoff \(\tilde{\alpha }(x_{0},a_{0},b_{0},\xi _{1})r(x_{1},a_{1},b_{1})\) from player 2. Next the system moves to a state \(x_{2}\), and the process is repeated over and over again. In general, at stage \(n\in \mathbb {N},\) player 1 receives from player 2 a discounted payoff of the form
where
Thus, the goal of player 1 (player 2, resp.) is to maximize (minimize, resp.) the total expected discounted payoff defined by the accumulation of the one-stage payoffs (5) over an infinite horizon.
The actions chosen by players at each stage are selected by rules known as strategies which are defined as follows.
Let \(\mathbb {H}_{0}:=\mathbf {X}\) and \(\mathbb {H}_{n}:=\mathbb {K\times } \mathbf {S}\times \mathbb {H}_{n-1}\) for \(n\in \mathbb {N}.\) For each \(n\in \mathbb {N}_{0},\) an element \(h_{n}\in \mathbb {H}_{n}\) takes the form
which represents the history of the game up to time n. A strategy for player 1 is a sequence \(\pi ^{1}=\{\pi _{n}^{1}\}\) of stochastic kernels \(\pi _{n}^{1}\in \mathbb {P}(\mathbf {A}|\mathbb {H}_{t})\) such that \(\pi _{n}^{1}(A(x_{n})|h_{n})=1\) for every \(h_{n}\in \mathbb {H}_{n},n\in \mathbb {N }_{0}.\) We denote by \(\varPi ^{1}\) the family of all strategies for player 1.
For each \(x\in \mathbf {X},\) let \(\mathbb {A}(x):=\mathbb {P}(A(x))\) and \( \mathbb {B}(x):=\mathbb {P}(B(x)).\) We denote by \(\varPhi ^{1}\) the class of all stochastic kernels \(\varphi ^{1}\in \mathbb {P}(\mathbf {A}|\mathbf {X})\) such that \(\varphi ^{1}(\cdot |x)\in \mathbb {A}(x)\), \(x\in \mathbf {X},\) and by \( \varPhi ^{2}\) the class of all stochastic kernels \(\varphi ^{2}\in \mathbb {P}( \mathbf {B}|\mathbf {X})\) such that \(\varphi ^{2}(\cdot |x)\in \mathbb {B}(x)\), \(x\in \mathbf {X}.\) Hence, a strategy \(\pi ^{1}=\{\pi _{n}^{1}\}\in \varPi ^{1}\) is called a Markov strategy if there exists \(\varphi _{n}^{1}\in \varPhi ^{1}\ \)such that \(\pi _{n}^{1}(\cdot |h_{n})=\varphi _{n}^{1}(\cdot |x_{n})\) for every \(h_{n},n\in \mathbb {N}_{0}.\) The class of all Markov strategies for player 1 is denoted by \(\varPi _{M}^{1}\). Now, a Markov strategy is called stationary if \(\varphi _{n}^{1}=\varphi ^{1}\) for every \(n\in \mathbb {N}_{0}\) and some stochastic kernel \(\varphi ^{1} \) in \(\varPhi ^{1}\). The set of stationary strategies for player 1 is denoted by \(\varPi _{S}^{1}\). The sets \(\varPi ^{2},\)\(\varPi _{M}^{2},\) and \(\varPi _{S}^{2}\) corresponding to player 2 are defined similarly.
According to the previous definitions, and by using a standard convention, a Markov strategy \(\varphi ^{i}\in \varPi _{M}^{i}\) takes the form \( \varphi ^{i}=\left\{ \varphi _{0}^{i},\varphi _{1}^{i},\ldots \right\} =:\left\{ \varphi _{n}^{i}\right\} \), for \(i=1,2.\) In particular, for stationary strategies, we have \(\varphi ^{i}=\left\{ \varphi ^{i},\varphi ^{i},\ldots \right\} =\left\{ \varphi ^{i}\right\} \).
The game process. Let \((\varOmega ^{\prime },\mathcal {F} ^{\prime })\) be the measurable space consisting of the sample space \( \varOmega ^{\prime }=(\mathbb {K}\times \mathbf {S})^{\infty }\) and its product \(\sigma \)-algebra \(\mathcal {F}^{\prime }.\) Following standard arguments (see, e.g., [6]), we have that for each pair of strategies \((\pi ^{1},\pi ^{2})\in \varPi ^{1}\times \varPi ^{2}\) and initial state \(x_{0}=x\in \mathbf {X},\) there exists a unique probability measure \(P_{x}^{\pi ^{1},\pi ^{2}}\) and a stochastic process \(\left\{ (x_{n},a_{n},b_{n},\xi _{n+1})\right\} \), where \(x_{n},\)\(a_{n},\)\(b_{n},\) and \(\xi _{n+1}\) represent the state, the actions of players, and the random disturbance in the discount factor, respectively, at stage \(n\in \mathbb {N}_{0},\) satisfying
where \(\delta _{x}(\cdot )\) is the Dirac measure concentrated at x. We denote by \(E_{x}^{\pi ^{1},\pi ^{2}}\) the expectation operator with respect to \(P_{x}^{\pi ^{1},\pi ^{2}}.\) The stochastic process \(\left\{ x_{n}\right\} \) defined on \((\varOmega ^{\prime },\mathcal {F}^{\prime },P_{x}^{\pi ^{1},\pi ^{2}})\) is called game process.
3 The Optimality Criterion
According to (5) and (6), given the initial state \( x_{0}=x\in \mathbf {X}\) and a pair of strategies \((\pi ^{1},\pi ^{2})\in \varPi ^{1}\times \varPi ^{2}\), the total expected discounted payoff—with random state-actions-dependent discount factors—is defined as
The lower and the upper value of the game are
respectively, for each initial state\(\ x\in \mathbf {X}\). Of course, \(U(\cdot )\ge L(\cdot )\); however, if \(U(\cdot )=L(\cdot )\) holds, then the common function is called the value of the game and is denoted by \(V^{*}(\cdot ).\)
Suppose the game has a value \(V^{*}\). A strategy \(\pi _{*}^{1}\in \varPi ^{1}\) is said to be optimal for player 1 if
Similarly, a strategy \(\pi _{*}^{2}\in \varPi ^{2}\) is said to be optimal for the player 2 if
Hence, the pair \((\pi _{*}^{1},\pi _{*}^{2})\) is called an optimal pair of strategies. Observe that \((\pi _{*}^{1},\pi _{*}^{2}) \in \varPi ^{1}\times \varPi ^{2}\) is an optimal pair if and only if
An important fact in our analysis on the existence of a value of the game is the sufficiency of Markov strategies in the following sense.
Proposition 1
Let \((\varphi _{*}^{1},\varphi _{*}^{2})\in \varPi _{S}^{1}\times \varPi _{S}^{2}\) be an optimal pair with respect to the Markov strategies, i.e.,
Then \((\varphi _{*}^{1},\varphi _{*}^{2})\) is an optimal pair with respect to all strategies, i.e., (12) holds.
By virtue of Proposition 1 we can restrict our study to the set of Markov strategies. Furthermore, over the Markov strategies, we can express the performance index (11) in terms of the distribution of the discount factor random disturbance \(\theta \). We proceed to establish this fact in a precise way.
We define the mean discount factor function \(\alpha _\theta :\mathbb {K} \rightarrow (0,1)\) as
and denote
For each pair of strategies \((\pi ^{1},\pi ^{2})\in \varPi ^{1}\times \varPi ^{2}\) and initial state \(x\in \mathbf {X}\), we define
Proposition 2
For each initial state \(x\in \mathbf {X}\) and pair of strategies \((\varphi ^{1},\varphi ^{2})\in \varPi _{M}^{1}\times \varPi _{M}^{2} \),
4 Existence of Optimal Strategies
To ease notation, the probability measures \(\varphi ^{1}(\cdot |x)\in \mathbb { A}(x)\) and \(\varphi ^{2}(\cdot |x)\in \mathbb {B}(x)\), \(x\in \mathbf {X},\) are written \(\varphi ^{i}(x)=\varphi ^{i}(\cdot |x),\)\(i=1,2.\) In addition, for a measurable function \(u:\mathbb {K}\rightarrow \mathbb {R} \),
For instance, for \(x\in \mathbf {X}\), we have
and
The existence of a value of the game as well as a pair of optimal strategies is analyzed under the following conditions.
Assumption 1
The game model (4) satisfies the following:
(a) For each \(x\in \mathbf {X}\), the sets A(x) and B(x) are compact.
(b) For each \((x,a,b)\in \mathbb {K},\)\(r(x,\cdot ,b)\) is upper semicontinuous (usc) on A(x), and \(r(x,a,\cdot )\) is lower semicontinuous (lsc) on B(x). Moreover, there exists a constant \(r_{0}>0\) and a function \( W:\mathbf {X}\rightarrow [1,\infty )\) such that
and the functions
are continuous on A(x) and B(x), respectively.
(c) For each \((x,a,b)\in \mathbb {K}\) and each bounded measurable function u on \(\mathbf {X},\) the functions
are continuous on A(x) and B(x), respectively.
(d) The function \(\tilde{\alpha }(x,a,b,s)\) is continuous on \( \mathbb {K}\times \mathbf {S},\) and
(e) There exists a positive constant \(\beta \) such that \( 1\le \beta <(\alpha ^{*})^{-1},\) and for every \((x,a,b)\in \mathbb {K}\)
For each measurable function \(u:\mathbf {X}\rightarrow {\mathbb {R}}\), we define the W-norm as
and let \({\mathbb {B}}_{W}\) be Banach space of all real-valued measurable functions defined on \(\mathbf {X}\) with finite W-norm. It is easy to prove that under Assumption 1, the Shapley operator
maps \({\mathbb {B}}_{W}\) into itself, where
Moreover, as will be established later, the interchange of inf and sup in (23) holds.
We now state our main results as follows.
Theorem 1
Suppose that Assumption 1 holds. Then
-
(a)
the game \(\mathcal {GM}\) (4) has a value \( V^{*}\in \mathbb {B}_{W}\),
-
(b)
the value \(V^{*}\) is the unique function in \(\mathbb {B} _{W} \) such that \(TV^{*}=V^{*}\), and
-
(c)
there exist \(\varphi _{*}^{1}(x)\in \mathbb {A}(x)\) and \( \varphi _{*}^{2}\in \mathbb {B}(x)\) such that
$$\begin{aligned} V^{*}(x)= & {} \hat{T}(V^{*},x,\varphi _{*}^{1},\varphi _{*}^{2}) \end{aligned}$$(25)$$\begin{aligned}= & {} \max _{\varphi ^{1}\in \mathbb {A}(x)}\hat{T}(V^{*},x,\varphi ^{1},\varphi _{*}^{2}) \end{aligned}$$(26)$$\begin{aligned}= & {} \min _{\varphi ^{2}\in \mathbb {B}(x)}\hat{T}(V^{*},x,\varphi _{*}^{1},\varphi ^{2}),\;\;\;\forall x\in \mathbf {X}. \end{aligned}$$(27)
In addition, the stationary strategies \(\varphi _{*}^{1}=\left\{ \varphi _{*}^{1}\right\} \in \varPi _{S}^{1}\) and \(\varphi _{*}^{2}=\left\{ \varphi _{*}^{2}\right\} \in \varPi _{S}^{2}\) form an optimal pair of strategies respect to the Markov strategies. Hence, from Proposition 1, \(\left( \varphi _{*}^{1},\varphi _{*}^{2}\right) \) is an optimal pair of strategies for the game \(\mathcal {GM}\).
5 Examples
In order to illustrate the theory developed above, we present two classes of examples. In the first one, Examples 1–3, we describe the potential applications of indices with nonconstant discount factors. Specifically, in Example 1 we present an application of this kind of optimality criteria in games involving monetary units in which the discount factor is a function of a random interest rate and/or inflation rate, and in Examples 2 and 3, the state-actions-dependent discount factors appear in a natural manner. Finally, Examples 4 and 5, which constitute the second class, are devoted to illustrate the assumptions imposed on the game model.
Example 1
(Monetary payoffs) Consider the game model (4). In general, r is a utility function which represents the preferences over the outcomes (x, a, b) in \(\mathbb {K}\), and so money is not necessarily involved ([22, p. 9] or [32, p. 13]). In this example, we assume that r(x, a, b) is indeed measured in monetary units. Let
where \(\rho >0\) is the (constant) nominal interest rate and \(\xi \) represents the inflation rate between two consecutive periods; thus \(\rho -\xi \) is the real interest rate which is random. Assume that \(\xi \) takes values in \(S=[\underline{s},\overline{s}]\), with \(0<\underline{s }<\overline{s}<\rho \). Hence, Assumption 1 (d) trivially follows since
Example 2
(A nondiscounted payoff game with random horizon) In Sect. 2 we described how the discounted game with infinite horizon is played. Let us consider the game model
with the following alternative playing where the horizon is random. For simplicity, we are not considering the disturbance space \(\mathbf {S}\). At state \(x_{n}\), players 1 and 2 choose actions \((a_{n},b_{n})\) and respectively receive \(r(x_{n},a_{n},b_{n})\) and \(-r(x_{n},a_{n},b_{n})\), then with probability \(1-\alpha (x_{n},a_{n},b_{n})\) the game stops; otherwise, the system moves to another state \(x_{n+1}\) according to \(Q(\cdot \mid x_{n},a_{n},b_{n})\), where the nondiscounted payoff \( r(x_{n+1},a_{n+1},b_{n+1})\) is determined by the actions \((a_{n+1},b_{n+1})\). We assume that there is \(\gamma \in (0,1)\) such that \(1-\alpha (x_{n},a_{n},b_{n})\ge \gamma \). Thus
We will show that the total expected payoff in this game takes the form (16). For this purpose, let \(x^{*}\) and \((a^{*},b^{*})\) be artificial state and actions. We define the game model
where \(X^{*}=X\cup \{x^{*}\},\)\(A^{*}=A\cup \{a^{*}\},\)\( B^{*}=B\cup \{b^{*}\},\) and the corresponding x-sections are the sets
The transition law \(Q^{*}\) among the states in \(X^{*}\) is a stochastic kernel on \(X^{*}\) given the set
defined as follows: For \((x,a,b)\in \mathbb {K},\)
Finally, the payoff function \(r^{*}:\mathbb {K}^{*}\rightarrow \mathbb {R} \) is given by
On the other hand, let \((\varOmega ^{\prime },\mathcal {F}^{\prime })\) be the measurable space associated with the game model \(\mathcal {GM}^{*}\) (see Sect. 2) and define the first passage time \(\mathrm {\tau }\)\(:\varOmega ^{\prime }\rightarrow \mathbb {N}_{0}\cup \{+\infty \}\) as
where, as usual, \(\inf \emptyset =+\infty \). For each pair of strategies \( (\varphi ^{1},\varphi ^{2})\in \varPi _{M}^{1}\times \varPi _{M}^{2}\) and initial state \(x\in X,\) the total expected payoff with random horizon \(\tau \) takes the form
Then a straightforward calculation shows that the performance index (29) can be written as a performance index with state-actions-dependent discount factors. Specifically, by following similar arguments as the proof of Proposition 2, it is possible to prove the equality
Hence, provided that Assumption 1 holds, the game (28) with random horizon has a value and there exists a pair of optimal stationary strategies due to Theorem 1.
This game model with random horizon is in the spirit of Shapley’s [35] seminal paper where finite stochastic games were introduced. Similar games but considering continuous and bounded payoff functions in the performance index (16) and countable state space were studied by Rieder [33]. On the other hand, Markov decision models with random horizon and Borel spaces have also been studied under several settings (see, for instance, [2, 4]); however, such control processes are assumed to be stopped with constant probability. Therefore, our example generalizes many results in the existing literature.
Example 3
(A semi-Markov game) Consider a zero-sum semi-Markov game (see, e.g., [17, 20, 24]) where the sojourn (or holding) times \(\xi _{1},\xi _{2},\ldots \) are i.i.d. random variables with common exponential distribution with parameter \(\lambda >0.\) Suppose that the discount factor is a continuous function \(\gamma :\mathbb {K\rightarrow }(d,\infty )\) where \( d>0.\) Then the expected discounted payoff is
If we define the function \(\tilde{\alpha }:\mathbb {K}\times \mathbf {S} \rightarrow (0,1)\) as
where \(\mathbf {S}=(0,\infty ),\) then the performance index \(\bar{V}\) takes the form (16). In addition, observe that \(\tilde{\alpha }\) is continuous on \(\mathbb {K}\times \mathbf {S}\), and for all (x, a, b),
Thus
To the best of our knowledge, semi-Markov models with state-action-dependent discount factors have been considered only for decision processes in [15, 16].
We conclude by presenting some insights into the fulfillment of continuity and W-growth conditions imposed in Assumption 1. Such conditions are standard in the literature (see, e.g., [14, 18, 20, 24,25,26]) and satisfied by several zero-sum game models.
As stated in [14, Appendix C], Assumption 1(c) holds if the transition kernel Q on \(\mathbf {X}\) given \(\mathbb {K}\) has a continuous density q(y|x, a, b) in \((x,a,b)\in \mathbb {K}\) with respect to a \(\sigma \)-finite measure m on X, that is
Furthermore, Assumption 1(c) also holds for games that evolve on \( \mathbf {X}=\mathbb {R}\) according to noise-additive difference equations of the form
with \(\mathbf {A}=\mathbf {B}=\mathbb {R}\), where G is a continuous function and \(\left\{ w_{n}\right\} \) is a sequence of i.i.d. random variables with continuous density g on \(\mathbb {R}\). In this case the kernel Q takes the form
where \(I_{X}(\cdot )\) stands for the indicator function of the set \(X\in \mathcal {B}(\mathbf {X}).\)
In general, the conditions related to the weighted function W are easier to illustrate in difference-equation game models as we show in the following examples.
Example 4
(A linear quadratic game (see [7])) Consider a game whose dynamics is defined by the linear equation
where \(\mathbf {X}=\mathbf {A}=\mathbf {B}=\mathbb {R}\) and \(\left\{ w_{n}\right\} \) is a sequence of i.i.d. random variables with standard normal distribution
We assume that the admissible action sets are \(A(x)=B(x)=[ - \vert x \vert /2,\vert x \vert / 2].\) In addition, the one-stage payoff r is a quadratic function such that
for some positive constant \(r_{0}.\)
Since the dynamics is defined by a continuous noise-additive function and g is a continuous density, from [14, Appendix C], Assumption 1(c) holds. Moreover, if we take \(W(x):=x^{2}+1,\) by applying the same arguments, the continuity of the functions defined in (20) follows.
On the other hand, for all \((x,a,b)\in \mathbb {K},\)
Hence, Assumption 1 (d) and (e) are satisfied with \(\beta =4\) and any continuous function such that \(\tilde{\alpha }(x,a,b,s)<\frac{1}{4}\).
Example 5
(A semi-Markov storage system) We consider a storage system with controlled input/output, whose evolution is as follows. At time \(T_{n}\) when an amount of certain product \(M>0\) accumulates for admission in the system, player 1 selects an action \(a\in [a_{*},1]=:\mathbf {A},\)\(a_{*}\in (0,1),\) representing the portion of M to be admitted. In addition, there is a continuous consumption of the admitted product which is controlled by the player 2 by selecting \(b\in [b_{*},b^{*}]=:\mathbf {B}\)\((0<b_{*}<b^{*})\) which represents the consumption rate per unit time. Thus, if \( x_{n}\in \mathbf {X}:=[0,\infty )\) is the stock level, \(a_{n}\) and \(b_{n}\) are the decisions of players 1 and 2, respectively, at the time of the nth decision epoch \(T_{n},\) the process \(\left\{ x_{n}\right\} \) can be modeled as a semi-Markov game evolving according to the equation
with holding times \(\xi _{n+1}:=T_{n+1}-T_{n}.\) In the context of Example 3, we suppose that \(\left\{ \xi _{n}\right\} \) is a sequence of i.i.d. random variables, exponentially distributed with parameter \(\lambda >0.\) Moreover, the discount factor is a continuous function \(\gamma :\mathbb {K}\rightarrow (d,\infty )\) where \(d>0\mathrm {.}\) It is reasonable to assume that
Let \(\varPsi \) be the moment generating function of the random variable \( M-b_{*}\xi \), that is:
Then, computing the derivative \(\varPsi ^{\prime }\) and using the fact that \( b_{*}<M\lambda \) (see (31)), it is easy to prove that \(\varPsi ^{\prime }(t)>0, t>0\). Moreover, taking the constant \(d>\lambda \), from the continuity of \(\varPsi \) and because \(\varPsi (0)=1\), there exists \(\lambda ^{*}>0\) such that
Now we assume that the one-stage payoff r is an arbitrary function satisfying Assumption 1(b) such that
for some constant \(r_{0}>0.\) Hence, defining \(W(x):=\mathrm{e}^{\lambda ^{*}x},\) relation (19) is satisfied. Moreover, for \((x,a,b)\in \mathbb {K},\)
Hence, combining (30) and (32), we obtain
and defining \(\beta :=\beta _{0}+\bar{r},\) Assumptions 1(d), (e) are satisfied.
Finally, to verify Assumption 1(c), let u be a bounded measurable function on \(\mathbf {X}\) and \(\rho _{(a,b)}\) be the density of the random variable \(aM-b\delta \), for every fixed \(a\in \mathbf {A}\) and \( b\in \mathbf {B}\). Observe that
and therefore, for each \(y\in \mathbf {R},\)\((a,b)\longmapsto \rho _{(a,b)}(y) \) is continuous function on \(\mathbf {A}\times \mathbf {B}\). Hence,
Thus by Scheffé’s Theorem,
defines a continuous function on \(\mathbf {A}\times \mathbf {B},\) which proves that Assumption 1(c) holds. Similarly is shown the continuity of the functions in (20).
6 Proofs
6.1 Proof of Proposition 1
The proof is a consequence of the following facts.
Let us fix \(\varphi ^{2}\in \varPi _{S}^{2}.\) Define the stochastic kernel \( Q_{\varphi ^{2}}\) on \(\mathbf {X}\) given \(\mathbb {K}_{\mathbf {A}}\) as
\(r_{\varphi ^{2}}:\mathbb {K}_{\mathbf {A}}\longmapsto \mathbb {R}\) and \(\tilde{ \alpha }_{\varphi ^{2}}:\mathbb {K}_{\mathbf {A}}\times \mathbf {S}\rightarrow (0,1)\) are the measurable functions defined as
In addition, let \(\pi ^{1}\in \varPi ^{1}\) be an arbitrary strategy, and for \( x\in \mathbf {X},\) we define the performance index
where
and \(E_{x}^{\pi ^{1}}\) is the expectation operator with respect to the probability measure \(P_{x}^{\pi ^{1}}\equiv P_{x}^{\pi ^{1},\varphi ^{2}}\) induced by \((\pi ^{1},\varphi ^{2})\in \varPi ^{1}\times \varPi _{S}^{2}\) and \( x_{0}=x.\) Then, from (7)–(10), \(P_{x}^{\pi ^{1}}\) satisfies the following properties:
Similarly, for a fixed \(\varphi ^{1}\in \varPi _{S}^{1},\) define \( Q_{\varphi ^{1}},\)\(r_{\varphi ^{1}},\)\(\tilde{\alpha }_{\varphi ^{1}}\) and the performance index
where
The next result is an adaptation of [23, Lemma 15] to our context. The proof follows by applying similar arguments and making the appropriate changes.
Lemma 1
For each \(x\in \mathbf {X}\), \(\varphi ^{2}\in \varPi _{S}^{2}\), and \(\pi ^{1}\in \varPi ^{1}\) there exists \(\varphi ^{1}\in \varPi _{M}^{1}\) such that
Remark 1
Let us fix \(\varphi ^{1}\in \varPi _{S}^{1}.\) Then we can also prove that for each \(\pi ^{2}\in \varPi ^{2}\) there exists \(\varphi ^{2}\in \varPi _{M}^{2}\) such that
where \(\tilde{V}_{\varphi ^{1}}\) is the performance index defined in (41).
Lemma 2
(a) For each \(\pi ^{1}\in \varPi ^{1}\) and \( \varphi ^{2}\in \varPi _{S}^{2}\), there exists \(\varphi ^{1}\in \varPi _{M}^{1}\) such that
(b) For each \(\pi ^{2}\in \varPi ^{2}\) and \(\varphi ^{1}\in \varPi _{S}^{1}\), there exists \(\varphi ^{2}\in \varPi _{M}^{2}\) such that
Proof
Let \(\pi ^{1}\in \varPi ^{1}\) and \(\varphi ^{2}\in \varPi _{S}^{2}\) be arbitrary strategies and consider the corresponding performance index \( \tilde{V}_{\varphi ^{2}}(x,\pi ^{1}),\)\(x\in \mathbf {X}\). From Lemma 1, there exists \(\varphi ^{1}\in \varPi _{M}^{1}\) such that \( \tilde{V}_{\varphi ^{2}}(x,\pi ^{1})=\tilde{V}_{\varphi ^{2}}(x,\varphi ^{1}), \)\(x\in \mathbf {X}.\) Hence, to obtain (44), it is enough to prove
which is obtained by comparing the corresponding terms in the sums (36) and (11).
Indeed, for the first term, from (34)
Furthermore, from (35)
An induction argument shows that
Hence, from (36) and (11), we obtain (46).
Part (b) is proved similarly. \(\square \)
Proof of Proposition 1
Let \((\varphi _{*}^{1},\varphi _{*}^{2})\in \varPi _S^1\times \varPi _S^2\) be a pair that satisfies (13). From Lemma 2, we have, for each \(\varphi ^{2}\in \varPi _{S}^{2}, \)
and for each \(\varphi ^{1}\in \varPi _{S}^{1}\)
Therefore, (49) and (50) yield the desired inequality (12). This completes the proof of the proposition. \(\square \)
6.2 Proof of Proposition 2
Proof
The proof follows by applying similar arguments as those in the proof of Lemma 2. For instance, observe that for each \(x\in \mathbf {X}\) and \((\varphi ^{1},\varphi ^{2})\in \varPi _{M}^{1}\times \varPi _{M}^{2}\), from (14)
It is shown, by induction, that
Therefore, from (11) and (16), we get (17). \(\square \)
6.3 Proof of Theorem 1
Before presenting the proof, we establish some important facts on minimax theorems and the W-norm, as well as the Shapley operator (23). All these facts are summarized in the following remark.
Remark 2
(a) Provided that Assumption 1 holds, for \(u\in {\mathbb {B}}_{W}\) and \((x,a,b)\in \mathbb {K},\)\(\hat{T}(u,x,\cdot ,b)\) is usc on A(x) and \(\hat{T}(u,x,a,\cdot )\) is lsc on B(x). Hence, by applying well-known properties of weak convergence of measures on the sets \(\mathbb {A}(x)\) and \(\mathbb {B}(x)\) (see, e.g., Theorem 2.8.1 in [1]), we can prove that the function \(\hat{T}(u,x,\cdot ,\varphi ^{2})\) is usc on \(\mathbb {A}(x)\) while \(\hat{T}(u,x,\varphi ^{1},\cdot )\) is lsc on \(\mathbb {B}(x)\). In addition, since \(\hat{T}(u,x,\varphi ^{1},\varphi ^{2})\) is concave in \(\varphi ^{1}\) and convex in \(\varphi ^{2},\) the well-known Fan’s Minimax Theorem implies that we can interchange \(\inf \) and \(\sup \) in (23), i.e.,
(b) Moreover, suitable measurable selection theorems yield the existence of \(\varphi _{*}^{1}\in \mathbb {A}(x)\) and \(\varphi _{*}^{2}\in \mathbb {B}(x)\) such that (see, e.g., Lemma 4.3 in [29])
(c) For \(u,v\in {\mathbb {B}}_{W}\), (21)–(23) and properties of the W-norm imply
which in turn yields
Hence, T is a contraction operator on \(\mathbb {B}_{W}\) with modulus \( \alpha ^{*}\beta <1.\) Similarly, the operator
defined for a pair of stationary strategies \((\varphi ^{1},\varphi ^{2})\in \varPi _{S}^{1}\times \varPi _{S}^{2}\), is a contraction operator on \( \mathbb {B}_{W}\) with modulus \(\alpha ^{*}\beta \).
(d) Thus, there exist unique fixed points v and \(v_{\varphi ^{1}\varphi ^{2}}\) in \(\mathbb {B}_{W}\) of operators T and \(T_{\varphi ^{1}\varphi ^{2}} \), respectively, that is
(e) Finally, we also apply the following properties of the weighted function W. From (22), for each \(\ x\in \mathbf {X}\),\(\ \left( \pi ^{1},\pi ^{2}\right) \in \varPi ^{1}\times \varPi ^{2},\) and \(n\in \mathbb {N}_{0}\),
Iteration of this inequality yields
Furthermore, from (54) and (15), for each \(u\in \mathbb {B} _{W},\)\(x\in \mathbf {X}\),\(\ \left( \pi ^{1},\pi ^{2}\right) \in \varPi ^{1}\times \varPi ^{2},\) and \(n\in \mathbb {N}_{0},\)
Therefore,
Proof of Theorem 1
where v is the fixed point of T (see Remark 2 (d)). In addition, from Remark 2 (b), there exists a pair of stationary strategies \((\varphi _{*}^{1},\varphi _{*}^{2})\in \varPi _{S}^{1}\times \varPi _{S}^{2}\) such that
On the other hand, \(V(\cdot ,\varphi _{*}^{1},\varphi _{*}^{2})\) is the unique fixed point of \(T_{\varphi _{*}^{1}\varphi _{*}^{2}}\) belonging to \(\mathbb {B}_{W},\) i.e.,
Indeed, from (53), (24), and (52)
for every x in \(\mathbf {X}\). Iterating this equation, we obtain
Now, letting \(m\rightarrow \infty \), from (55) and (16 ), we obtain (59).
Since \(V(\cdot ,\varphi _*^{1},\varphi _{*}^{2})\) is the unique fixed point of \(T_{\varphi _*^1\varphi _*^2},\) (56) implies that \( v(x)=V(x,\varphi _*^1,\varphi _*^2)\), \(x\in \mathbf {X}.\) Therefore, considering (57) and (58), Theorem 1 will be proved if we show that
To prove the first inequality in (60), let \(\varphi ^{1}\in \varPi _{M}^{1}\) be an arbitrary Markov strategy for player 1. Then, for all \( n\in \mathbb {N},\)
where the last two equalities come from (56) and (57). Now, from (61), for all \(n\in \mathbb {N},\)
which, by taking expectation \(E_{x}^{\varphi ^{1},\varphi _{*}^{2}}\) and adding over \(\ n=0,1,\ldots ,m-1,\)\(m>0,\) implies
Letting \(m\rightarrow \infty \), from (16) and (55), we get
that is, the first inequality in (60) holds. The second inequality is proved similarly. Hence, the proof of Theorem 1 is completed. \(\square \)
References
Ash RB, Doléans-Dade C (2000) Probability and measure theory, 2nd edn. Academic Press, London
Bäuerle N, Rieder U (2011) Markov decision processes with applications to finance. Springer, Berlin
Carmon Y, Shwartz A (2009) Markov decision processes with exponentially representable discounting. Oper Res Lett 37(1):51–55
Cruz-Suárez H, Ilhuicatzi-Roldán R, Montes-de Oca R (2014) Markov decision processes on Borel spaces with total cost and random horizon. J Optim Theory Appl 162(1):329–346
Dutta PK, Sundaram R (1992) Markovian equilibrium in a class of stochastic games: existence theorems for discounted and undiscounted models. Econ Theory 2(2):197–214
Dynkin EB, Yushkevich AA (1979) Controlled Markov processes. Springer, Berlin
Engwerda J (2005) LQ dynamic optimization and differential games. Wiley, New York
Feinberg EA, Shwartz A (1999) Constrained dynamic programming with two discount factors: applications and an algorithm. IEEE Trans Autom Control 44(3):628–631
Filar J, Vrieze K (2012) Competitive Markov decision processes. Springer, Berlin
González-Hernández J, López-Martínez RR, Minjárez-Sosa JA (2008) Adaptive policies for stochastic systems under a randomized discounted cost criterion. Bol Soc Mat Mexicana 3(14):149–163
González-Hernández J, López-Martínez RR, Minjárez-Sosa JA (2009) Approximation, estimation and control of stochastic systems under a randomized discounted cost criterion. Kybernetika 45(5):737–754
González-Hernández J, López-Martínez RR, Minjárez-Sosa JA, Gabriel-Arguelles JR (2013) Constrained Markov control processes with randomized discounted cost criteria: occupation measures and extremal points. Risk Decis Anal 4(3):163–176
He W, Sun Y (2017) Stationary Markov perfect equilibria in discounted stochastic games. J Econ Theory 169:35–61
Hernández-Lerma O, Lasserre JB (1996) Discrete-time Markov control processes: basic optimality criteria, vol 30. Springer, Berlin
Huang Y, Guo X (2012) Constrained optimality for first passage criteria in semi-Markov decision processes. In: Hernández-Hernández D, Minjárez-Sosa JA (eds) Optimization, control, and applications of stochastic systems, systems & control: foundations & applications. Birkhauser, Boston, pp 181–202 chap. 11
Huang Y, Wei Q, Guo X (2013) Constrained Markov decision processes with first passage criteria. Ann Oper Res 206(1):197–219
Jaśkiewicz A, Nowak AS (2006) Approximation of noncooperative semi-Markov games. J Optim Theory Appl 131(1):115–134
Jaśkiewicz A, Nowak AS (2006) Zero-sum ergodic stochastic games with Feller transition probabilities. SIAM J Control Optim 45(3):773–789
Krausz A, Rieder U (1997) Markov games with incomplete information. Math Methods Oper Res 46(2):263–279
Luque-Vásquez F (2002) Zero-sum semi-Markov games in Borel spaces: discounted and average payoff. Bol Soc Mat Mexicana 8:227–241
Maitra A, Parthasarathy T (1970) On stochastic games. J Optim Theory Appl 5(4):289–300
Maschler M, Solan E, Zamir S (2013) Game theory. Cambridge University Press, Cambridge
Minjárez-Sosa JA (2015) Markov control models with unknown random state-action-dependent discount factors. Top 23(3):743–772
Minjárez-Sosa JA, Luque-Vásquez F (2008) Two person zero-sum semi-Markov games with unknown holding times distribution on one side: a discounted payoff criterion. Appl Math Optim 57(3):289–305
Minjárez-Sosa JA, Vega-Amaya O (2009) Asymptotically optimal strategies for adaptive zero-sum discounted Markov games. SIAM J Control Optim 48(3):1405–1421
Minjárez-Sosa JA, Vega-Amaya O (2013) Optimal strategies for adaptive zero-sum average Markov games. J Math Anal Appl 402(1):44–56
Neyman A, Sorin S (2003) Stochastic games and applications, vol 570. Kluwer, Dordrecht
Nowak AS (1984) On zero-sum stochastic games with general state space. I. Prob Math Stat 4(1):13–32
Nowak AS (1985) Measurable selection theorems for minimax stochastic optimization problems. SIAM J Control Optim 23(3):466–476
Nowak AS (1987) Nonrandomized strategy equilibria in noncooperative stochastic games with additive transition and reward structure. J Optim Theory Appl 52(3):429–441
Nowak AS, Szajowski K (1999) Nonzero-sum stochastic games. In: Stochastic and differential games. Annals of the international society of dynamic games, vol 4, chap 7. Springer, Berlin, pp 297–342
Osborne MJ, Rubinstein A (1994) A course in game theory. MIT Press, Cambridge
Rieder U (1991) Non-cooperative dynamic games with general utility functions. In: Raghavan TES, Ferguson TS, Parthasarathy T, Vrieze OJ (eds) Stochastic games and related topics, theory and decision library, vol 7. Springer, Berlin, pp 161–174
Schäl M (1975) Conditions for optimality in dynamic programming and for the limit of n-stage optimal policies to be optimal. Probab Theory Rel Fields 32(3):179–196
Shapley LS (1953) Stochastic games. Proc Natl Acad Sci USA 39(10):1095–1100
Shimkin N, Shwartz A (1995) Asymptotically efficient adaptive strategies in repeated games. Part I: certainty equivalence strategies. Math Oper Res 20(3):743–767
Shimkin N, Shwartz A (1996) Asymptotically efficient adaptive strategies in repeated games. Part II: asymptotic optimality. Math Oper Res 21(2):487–512
Wei Q, Guo X (2011) Markov decision processes with state-dependent discount factors and unbounded rewards/costs. Oper Res Lett 39(5):369–374
Author information
Authors and Affiliations
Corresponding author
Additional information
Work supported by Consejo Nacional de Ciencia y Tecnología (CONACYT) under Grant CB2015/254306.
Rights and permissions
About this article
Cite this article
González-Sánchez, D., Luque-Vásquez, F. & Minjárez-Sosa, J.A. Zero-Sum Markov Games with Random State-Actions-Dependent Discount Factors: Existence of Optimal Strategies. Dyn Games Appl 9, 103–121 (2019). https://doi.org/10.1007/s13235-018-0248-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13235-018-0248-8