Abstract
Semi-Markov model is one of the most general models for stochastic dynamic systems. This paper deals with a two-person zero-sum game for semi-Markov processes. We focus on the expected discounted criterion with state-action-dependent discount factors. The state and action spaces are both Polish spaces, and the reward rate function is ω-bounded. We first construct a fairly general model of semi-Markov games under a given semi-Markov kernel and a pair of strategies. Next, based on the standard regularity condition and the continuity-compactness condition for semi-Markov games, we derive a “drift condition” on the semi-Markov kernel and suppose that the discount factors have a positive lower bound, under which the existence of the value function and an optimal pair of stationary strategies of our semi-Markov game are proved by using a general Shapley equation. Moreover, in the scenario of finite state and action spaces, a value iteration-type algorithm for approximating the value function and an optimal pair of stationary strategies is developed. The convergence and the error bound of the algorithm are also proved. Finally, we conduct numerical examples to demonstrate the main results.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
This paper deals with two-person zero-sum semi-Markov games (SMGs) with expected discounted criterion, which is a generalization of discrete-time Markov games (DTMGs) (Shapley 1953), since the sojourn time between two consecutive decision epochs follows any distribution rather than a constant. Such games have already been studied in the literature (Chen et al. 2021; Lal and Sinha 1992; Luque-Vásquez 2002a, 2002b; Minjárez-Sosa and Luque-Vásquez 2008; Mondal et al. 2016, 2020). However, the existing references are all restricted to the case where the discount factor is a constant, which may not always hold in practice.
In this paper, we study a more general case of SMGs with varying discount factors to meet the often demanding nature of real-world problems. For example, considering the application in economics, the discount factor (interest rate) may depend both on economy environments and decision-makers’ actions. That is, the interest rate usually varies in different financial markets and monetary policies, where financial markets can be considered as states and monetary policies are actions taken by the government. Problems with non-constant discount factors have been studied for Markov decision processes (MDPs) (Feinberg and Shwartz 1994; González-Hernández et al. 2009; Schäl 1975; Wei and Guo 2011; Ye and Guo 2012; Minjárez-Sosa 2015; Wu and Zhang 2016) and two-person zero-sum DTMGs (González-Sánchez et al. 2019). Furthermore, in light of the application of SMGs to the dynamic oligopoly model (Kirman and Sobel 1974) and the dynamic overlapping generations model (Raut 1990), it is required and reasonable to investigate SMGs with state-action-dependent discount factors to compensate for the inadequacies of theory and application.
On the other hand, most of the literature work on SMGs focuses on the existence of optimal strategies. However, how to efficiently solve a stochastic dynamic game and compute an optimal pair of stationary strategies are especially important for practical implementation of game theory. The classic algorithmic study on game theory focuses on matrix games, which can be solved by linear programming (LP) (Barron 2013). This LP technique further combines the policy iteration in MDPs to solve discounted two-person zero-sum DTMGs (Hoffman and Karp 1966; Pollatschek and Avi-Itzhak 1969), where LP is used to improve strategy at every iteration. Recently, there are emerging investigations that aim to study the efficient computation for stochastic dynamic games using approximation or learning algorithms. Littman (1994) proposes a minimax-Q algorithm to solve discounted two-person zero-sum DTMGs, which is essentially motivated by the standard Q-learning algorithm with a minimax operator in Markov games replacing the max operator in reinforcement learning. In addition, finite algorithms for some interesting special classes of stochastic games are also widely studied, such as LP to solve DTMGs with single-controller, switching controller and separable reward and state independent transition (Filar and Vrieze 2012). Recently, the same results are extended to SMGs. Mondal et al. (2016) study the discounted two-person zero-sum SMGs with AR-AT-AITT (Additive Reward and Additive Transition and Action Independent Transition Times) structure. They prove that such game can be formulated as a vertical linear complementarity problem (VLCP), which can be solved by the Cottle-Dantzig’s algorithm. They further propose a policy improvement algorithm for solving a mixture class of perfect information and AR-AT SMG (Mondal et al. 2020). Notice that the above algorithms for solving SMGs are only suitable for some special structures, which fail to effectively solve general discounted two-person zero-sum SMGs.
In this paper, we aim at studying the two-person zero-sum SMGs with expected discounted criterion in which the discount factors are state-action-dependent. The objective is to find an optimal pair of stationary strategies to maximize the reward of player 1 (P1) and minimize the payoff of player 2 (P2). More precisely, we deal with the SMGs specified by five primitive data: the state space X; the action spaces A,B for P1 and P2, respectively; the semi-Markov kernel Q(t,y|x,a,b); the discount factor α(x,a,b); and the reward rate function r(x,a,b). The state space X and action spaces A,B are all Polish spaces, and the reward rate function r(x,a,b) is ω-bounded. The semi-Markov kernel Q(t,y|x,a,b) is a joint distribution of sojourn time and state for any given (x,a,b) ∈ X × A × B, which is more general in comparison to the literature (Luque-Vásquez 2002a, 2002b; Minjárez-Sosa and Luque-Vásquez 2008; Mondal et al. 2016, 2020). To the best of our knowledge, the framework with the common state-time distribution was just used in previous works dealing with semi-Markov decision processes (for non-negative reward rate function (Huang and Guo 2011, 2010); for average criterion with unbounded reward rate function (Ross 1970)) and SMGs (for bounded reward rate function (Chen et al. 2021; Lal and Sinha 1992; Tanaka and Wakuta 1976); for average criterion with unbounded reward rate function (Vega-Amaya et al. 2022)), which has not been studied in discounted SMGs with unbounded reward rate function. This extension necessitates the addition of a new “drift condition”, which leads to a general Shapley equation and a more complicated proof of optimality. With these new features, we construct an SMG model with a fairly general problem setting. Then we impose suitable conditions on the model parameters shown in Assumptions 1-4, under which we establish the Shapley equation and prove the existence of the value function and an optimal pair of stationary strategies of the game. Our proof is quite different from González-Sánchez et al. (2019) since we directly search for optimal strategies with respect to history-dependent strategies instead of restricted to Markov strategies. Furthermore, in the scenario of finite state and action spaces, we derive a value iteration-type algorithm to approach to the value function and an optimal pair of stationary strategies of the game based on the Shapley equation. The convergence and the error bound analysis of the algorithm are also derived. Finally, we conduct numerical examples on an investment problem to demonstrate the main results of our paper.
The contributions of this paper can be summarized as follows. (1) This paper studies a fairly general model of SMGs with state-action-dependent discount factors and joint probability transition functions. We derive a “drift condition” (see Assumption 3) on the generic semi-Markov kernel, which is more general than the counterparts in the literature work (Luque-Vásquez 2002a, 2002b; González-Sánchez et al. 2019), as stated in Remark 3. (2) In order to find an ε-optimal pair of stationary strategies and an approximate value function, a value iteration-type algorithm is proposed, which can be viewed as a combination of the value iteration of MDPs and the LP of matrix games. Moreover, the convergence and the error bound of the algorithm are also analyzed.
The rest of this paper is organized as follows. In Section 2, we introduce the model of SMG as well as the optimality criterion. Our main optimality results are stated in Section 3 and studied with the proof in Section 4. A value iteration-type algorithm for approximating an optimal pair of stationary strategies is developed in Section 5, and some numerical examples are conducted to demonstrate our main results in Section 6. Finally, we conclude the paper and discuss some future research topics in Section 7.
2 Two-person zero-sum semi-Markov game model
Notation: If E is a Polish space (that is, a complete and separable metric space), its Borel σ-algebra is denoted by \({\mathscr{B}}(E)\), and \(\mathbb {P}(E)\) denotes the family of probability measures on \({\mathscr{B}}(E)\), endowed with the topology of weak convergence.
In this section, we introduce a two-person zero-sum SMG model with expected discounted criterion and state-action-dependent discount factors, which is denoted by the collection
where the symbols are explained as follows.
-
X is the state space which is a Polish space, and A, B are action spaces for P1 and P2, respectively, which are also supposed to be Polish spaces.
-
A(x) and B(x) are Borel subsets of A and B, which represent the sets of the admissible actions for P1 and P2 at state x ∈ X, respectively. Let
$$ K := \{(x,a,b)|x\in X, a\in A(x), b\in B(x)\} $$be a measurable Borel subset of X × A × B.
-
Q(t,y|x,a,b) is a semi-Markov kernel which satisfies the following properties.
-
(a)
For each fixed (x,a,b) ∈ K, Q(⋅,⋅|x,a,b) is a probability measure on \([0,\infty ) \times X\), whereas for each fixed \(t \in [0,\infty ), D\in {\mathscr{B}}(X)\), Q(t,D|⋅,⋅,⋅) is a real-valued Borel function on K.
-
(b)
For each fixed (x,a,b) ∈ K and \(D\in {\mathscr{B}}(X)\), Q(⋅,D|x,a,b) is a non-decreasing right-continuous real-valued Borel function on \([0,\infty )\) such that Q(0,D|x,a,b) = 0.
-
(c)
For each fixed (x,a,b) ∈ K, we denote by
$$ H(\cdot|x,a,b) := Q(\cdot,X|x,a,b) $$the distribution function of the sojourn time at state x ∈ X when the actions a ∈ A(x),b ∈ B(x) are chosen. For each x ∈ X and \(D\in {\mathscr{B}}(X)\), when P1 and P2 select actions a ∈ A(x) and b ∈ B(x), respectively, Q(t,D|x,a,b) denotes the joint probability that the sojourn time in state x is not greater than \(t \in \mathbb {R}_{+}\) and the next state belongs to D.
-
α(x,a,b) is a measurable function from K to \((0,\infty )\), which denotes the state-action-dependent discount factor.
-
r(x,a,b) is a real-valued function on K, which represents the reward/payoff rate function for P1/P2.
Remark 1
In our SMG model, the semi-Markov kernel Q(⋅,⋅|x,a,b) is a joint probability distribution with respect to sojourn time and state for given (x,a,b) ∈ K. Thus, our model is more general than the counterpart in the literature (Lal and Sinha 1992; Luque-Vásquez 2002a, 2002b; González-Sánchez et al. 2019).
The evolution of SMGs with the expected discounted criterion carries on as follows.
Assume that the game starts at the initial state x0 ∈ X at the initial decision epoch t0 := 0. The two players choose simultaneously actions a0 ∈ A(x0),b0 ∈ B(x0) according to the variables t0 and x0, then P1 and P2 receive immediate reward r(x0,a0,b0) and immediate payoff r(x0,a0,b0), respectively. Consequently, after staying at state x0 up to time t1 > t0, the system moves to a new state x1 ∈ D according to the transition law Q(t1 − t0,D|x0,a0,b0). Once the state transition to x1 occurs at the 1st decision epoch t1, the entire process repeats again and the game evolves in this way.
Thus, we obtain an admissible history at the n-th decision epoch
When the game goes to infinity, we obtain the history
where tn ≤ tn+ 1, (xn,an,bn) ∈ K for all n ≥ 0. Moreover, let \({\mathscr{H}}_{n}\) be the class of all admissible histories hn of the system up to the n-th decision epoch, endowed with a Borel σ-algebra.
Let \(({\Omega }, \mathcal {F})\) be the canonical measurable space consisting of the sample space \({\Omega }=(\mathbb {R}_{+}\times X\times A\times B)^{\infty }\) and the corresponding product σ-algebra \(\mathcal {F}\). For each \(\omega =(t_{0},x_{0},a_{0},b_{0},t_{1},x_{1},\dots )\in {\Omega }\), we define a stochastic process \(\left \{T_{n}, X_{n}, A_{n}, B_{n}, t\ge 0\right \}\) on \(({\Omega }, \mathcal {F})\) by
Moreover, we denote Hn the processes of decision time, state and action until the n-th decision epoch by
To introduce our expected discounted criterion discussed in this paper, we give the definitions of strategies as follows.
Definition 1
A randomized history-dependent strategy for P1 is a sequence of stochastic kernels \(\pi ^{1}:=({\pi _{n}^{1}},n\geq 0)\) that satisfies the following conditions:
-
(i)
for each \(D\in {\mathscr{B}}(X)\), \({\pi _{n}^{1}}(D|\cdot )\) is a Borel function on \({\mathscr{H}}_{n}\), and for each \(h_{n}\in {\mathscr{H}}_{n}\), \({\pi _{n}^{1}}(\cdot |h_{n})\) is a probability measure on A;
-
(ii)
\({\pi _{n}^{1}}(\cdot |h_{n})\) is concentrated on A(xn), that is
$$ {\pi_{n}^{1}}(A(x_{n})|h_{n})=1, ~~~ \forall h_{n}\in \mathcal{H}_{n} ~ \text{and}~ n\geq0. $$
We denote by π1 the set of all the randomized history-dependent strategies for P1 for simplicity.
Definition 2
-
(1)
A strategy \(\pi ^{1}=({\pi _{n}^{1}},n\geq 0) \in {\Pi }_{1}\) is called a randomized Markov strategy if there exists a sequence of stochastic kernels ϕ1 = (φn,n ≥ 0) such that
$$ \pi_{n}^{1}(\cdot|h_{n})=\varphi_{n}(\cdot|x_{n}),~~~ \forall h_{n}\in \mathcal{H}_{n} ~ \text{and}~ n\geq0. $$ -
(2)
A randomized Markov strategy ϕ1 = (φn,n ≥ 0) is called stationary if φn is independent of n; that is, if there exists a stochastic kernel φ on A given x such that
$$ \varphi_{n}(\cdot|x)\equiv \varphi (\cdot|x),~~~ \forall x\in X~ \text{and}~ n\geq0. $$That is, \(\phi _{1}=(\varphi ,\varphi ,\ldots )=:\varphi ^{\infty }\). For convenience, we still use φ to denote the randomized stationary strategy without special declaration.
-
(3)
Moreover, if φ(⋅|x) is a Dirac measure for all x ∈ X, then the stationary strategy φ is called a pure strategy.
We denote by \({{\Pi }_{1}^{M}}\), Φ1 and \({\Pi }_{1}^{MD}\) the sets of all the randomized Markov strategies, randomized stationary strategies and pure strategies for P1, respectively.
The sets of all randomized history-dependent strategies π2, randomized Markov strategies \({{\Pi }_{2}^{M}}\), randomized stationary strategies Φ2, pure strategies \({\Pi }_{2}^{MD}\) for P2 are defined similarly, with B(x) in lieu of A(x). Clearly, \({\Pi }_{1}^{MD}\subset {\Phi }_{1} \subset {{\Pi }_{1}^{M}} \subset {\Pi }_{1} \) and \({\Pi }_{2}^{MD}\subset {\Phi }_{2} \subset {{\Pi }_{2}^{M}} \subset {\Pi }_{2} \).
For each \(x\in X,\pi ^{1}\in {\Pi }_{1},\pi ^{2}\in {\Pi }_{2}\), by Theorem of C. Ionescu-Tulcea (Hernández-Lerma and Lasserre 2012a, P.178), there exist a unique probability space \(({\Omega }, \mathcal {F}, \mathbb {P}_{x}^{\pi ^{1}, \pi ^{2}})\) and a stochastic process {Tn,Xn,An,Bn, n ≥ 0} such that for each \(D\in {\mathscr{B}}(X),D_{1}\in {\mathscr{B}}(A),D_{2}\in {\mathscr{B}}(B)\) and n ≥ 0, we have
Here and in what follows, we denote by \(\mathbb {E}_{x}^{\pi ^{1}, \pi ^{2}}\) the expectation operator with respect to \(\mathbb {P}_{x}^{\pi ^{1}, \pi ^{2}}\).
To avoid the possibility of infinitely numerous decision epochs during the finite time interval, we take an assumption on the semi-Markov kernel, which is also used in Lal and Sinha (1992), Luque-Vásquez (2002a, 2002b), and the references therein.
Assumption 1
There exist constants 𝜃 > 0 and δ > 0 such that
Proposition 1
If Assumption 1 holds, then for each fixed x ∈ X and \(\pi ^{1}\in {\Pi }_{1},\pi ^{2}\in {\Pi }_{2}\), we have
The proof of Proposition 1 is provided in Appendix B.
Since \(T_{n}\overset {P}{\longrightarrow }\infty \), it is not required to consider the processes for \(t {>} T_{\infty }=\lim \limits _{n\rightarrow \infty } T_{n}\). Now, we establish an underlying continuous-time state-action process {X(t),A(t),B(t),t ≥ 0}, which corresponds to the stochastic process {Tn,Xn,An,Bn,n ≥ 0} with probability space \(({\Omega }, \mathcal {F}, \mathbb {P}_{x}^{\pi ^{1}, \pi ^{2}})\),
where \(\mathbbm {1}_{E}\) is an indicator function on any set E.
Definition 3
The stochastic process {X(t),A(t),B(t),t ≥ 0} is called a semi-Markov game.
Next, we will show the definition of the expected discounted criterion in this paper.
Definition 4
For each \((\pi ^{1}, \pi ^{2})\in {\Pi }_{1} \times {\Pi }_{2}\), the initial state x ∈ X and discount factor α(⋅) > 0, the expected discounted reward/payoff for P1/P2 is defined as follows:
P1 aims to maximize the reward while P2 aims to minimize the payoff. Both players aim to find an optimal strategy.
Definition 5
The upper value and lower value of the expected discounted SMG are defined as
respectively. Obviously, U(x) ≥ L(x) for all x ∈ X. Moreover, if it holds that L(x) = U(x) for all x ∈ X, then the common function is called the value function of the game and denoted by V∗.
Definition 6
Assume that the game has a value function V∗(⋅). Then a strategy \(\pi ^{1}_{*}\in {\Pi }_{1}\) is said to be optimal for P1 if
Similarly, \(\pi ^{2}_{*}\in {\Pi }_{2}\) is said to be optimal for P2 if
If \(\pi ^{i}_{*}\) is optimal for player i (i = 1,2), then we call \((\pi ^{1}_{*}, \pi ^{2}_{*})\) an optimal pair of strategies.
Remark 2
\((\pi ^{1}_{*}, \pi ^{2}_{*})\) is an optimal pair of strategies if and only if
Remark 2 is an effective method to verify whether a pair of strategies (π1,π2) is an optimal pair of strategies, which is widely used in the literature of two-person zero-sum stochastic games; see, for instance, González-Sánchez et al. (2019), Luque-Vásquez (2002a, 2002b), and the references therein.
3 Main results
This section focuses on the existence of the value function and an optimal pair of stationary strategies, which requires imposing suitable assumptions on the model parameters. We first give some notations for convenience.
Given a measurable function \(\omega : X\rightarrow [1, \infty )\), a function u on X is said to be ω-bounded if it has finite ω-norm which is defined as
such a function ω can be referred to as a weight function. For convenience, we denote by Bω(X) the Banach space of all ω-bounded measurable functions on X.
For each given function u ∈ Bω(X) and (x,a,b) ∈ K, we write
For each fixed x ∈ X and probability measures \(\mu \in \mathbb {A}(x) := \mathbb {P}(A(x)) \text { and } \lambda \in \mathbb {B}(x) := \mathbb {P}(B(x))\), we denote
whenever the integral is well defined.
We define an operator T on Bω(X) by
which is called the Shapley operator. A function v ∈ Bω(X) is said to be a solution of the Shapley equation if
In order to explore the existence of an optimal pair of stationary strategies, we also need to define a stationary-strategy-dependent operator T(φ1,φ2) on Bω(X) by
where (φ1,φ2) ∈Φ1 ×Φ2 is a pair of stationary strategies.
Next, we give some hypotheses to guarantee the existence of an optimal pair of stationary strategies. The framework settled by these hypotheses has became quite standard for the study of semi-Markov models (Chen et al. 2021; Huang and Guo 2011, 2010; Lal and Sinha, 1992; Luque-Vásquez, 2002a, 2002b; Minjárez-Sosa and Luque-Vásquez 2008) and varying discount factors (Feinberg and Shwartz 1994; González-Hernández et al. 2009; González-Sánchez et al. 2019; Minjárez-Sosa 2015; Wei and Guo 2011; Ye and Guo 2012; Wu and Zhang 2016).
Assumption 2
(a) There exists a constant α0 > 0 such that α(x,a,b) ≥ α0 for all (x,a,b) ∈ K. (b) There exists a measurable function \(\omega : X\rightarrow [1, \infty )\) and a non-negative constant M such that for all (x,a,b) ∈ K,
The key point is that Assumption 2 entails a finiteness property of expected discounted reward. Below we give an important consequence of Assumption 1 and Assumption 2(a).
Lemma 1
If Assumptions 1&2(a) hold, then there exists a constant 0 < γ < 1 such that for each (x,a,b) ∈ K,
The proof of Lemma 1 is provided in Appendix A.
Assumption 3
There exists a constant η with 0 < ηγ < 1 such that for each fixed t ≥ 0 and (x,a,b) ∈ K,
where ω(⋅) is the function mentioned in Assumption 2.
Remark 3
-
(1)
We call Assumption 3 the “drift condition”, which is needed to ensure that the Shapley operator (defined in (3)) is a contraction operator with respect to a weighted norm as well as our main results. Obviously, Assumption 3 naturally holds when r is bounded by taking ω(⋅) ≡ 1 and η = 1.
-
(2)
Particularly, if we set Q(t,y|x,a,b) = H(t|x,a,b)P(y|x,a,b), where P(y|x,a,b) denotes the state transition probability, then (5) degenerates into \({\int \limits }_{X} \omega (y) P(dy|x,a,b) \leq \eta \omega (x)\), which is the same as the Assumption 3(b) of Luque-Vásquez (2002a) and Assumption 1(e) of González-Sánchez et al. (2019). Thus, our Assumption 3 is more general than the counterpart in the literature Luque-Vásquez (2002a) and the Example 3 of González-Sánchez et al. (2019) about SMG is a special case of ours.
-
(3)
Combining Lemma 1 with Assumption 3, it is easy to derive
$$ {\int}_{0}^{\infty} e^{-\alpha(x,a,b) t}{\int}_{X} u(y) Q(d t, d y | x, a, b)\leq \eta\gamma\|u\|_{\omega}\omega(x),\quad\forall u\in B_{\omega}(X),(x,a,b)\in K. $$(6)
Moreover, we impose the following continuity-compactness conditions to ensure the existence of an optimal pair of stationary strategies of our SMG model.
Assumption 4
-
(a)
For each fixed x ∈ X, both A(x) and B(x) are compact sets.
-
(b)
For each fixed (x,a,b) ∈ K, r(x,⋅,b) is upper semi-continuous on A(x) and r(x,a,⋅) is lower semi-continuous on B(x).
-
(c)
For each fixed (x,a,b) ∈ K, t ≥ 0 and v ∈ Bω(X), the functions
$$ a \longmapsto \int v(y) Q(t,d y | x, a, b) \quad \text { and } \quad b \longmapsto \int v(y) Q(t,d y | x, a, b) $$are continuous on A(x) and B(x), respectively.
-
(d)
For each fixed t ≥ 0, H(t|⋅,⋅,⋅) is continuous on K.
-
(e)
The function α(x,a,b) is continuous on K.
Remark 4
-
(1)
Assumption 4 is similar to the standard continuity-compactness hypotheses for Markov control processes; see, for instance, Hernández-Lerma and Lasserre (2012b), and the references therein. It is commonly used for the existence of minimax points of games.
-
(2)
By Lemma 1.11 in Nowak (1984), if Assumption 4(a) holds, then the probability spaces \(\mathbb {A}(x) \text { and } \mathbb {B}(x)\) are also compact for each x ∈ X.
-
(3)
These continuity-compactness conditions are specifically applied to infinite state and action spaces, which obviously hold when S and A,B are finite.
Now, we present our main results, Theorem 1 below, which extends to SMGs with common state-time distribution and state-action dependent discount factors the analysis given in Luque-Vásquez (2002a) for SMGs with constant discount factor.
Theorem 1
Suppose that Assumptions 1-4 hold, then
-
(a)
The SMG has a value function V∗(⋅), which is the unique function in Bω(X) that satisfies the Shapley equation, i.e.,
$$ V^{*}(x)=TV^{*}(x),\quad\forall x\in X, $$and furthermore, there exists an optimal pair of stationary strategies.
-
(b)
A pair of stationary strategies \((\varphi _{1}^{*}, \varphi _{2}^{*}) \in {\Phi }_{1} \times {\Phi }_{2}\) is optimal if and only if its expected discounted reward satisfies the Shapley equation, i.e., \(TV(x,\varphi _{1}^{*},\varphi _{2}^{*})=V(x,\varphi _{1}^{*},\varphi _{2}^{*})\) for all x ∈ X.
4 Preliminaries and proofs
In this section, we present some results needed to prove Theorem 1. Some of these results are already known in the literature (Luque-Vásquez 2002a, 2002b), but we state them here for completeness and ease of reference.
Lemma 2
Suppose that Assumptions 1-4 hold, then for each given function u ∈ Bω(X), the function Tu is in Bω(X) and
Moreover, there exists a pair of stationary strategies (φ1,φ2) ∈Φ1 ×Φ2 such that
Lemma 3
Both T and T(φ1,φ2) are contraction operators with modulus less than 1.
Since T and T(φ1,φ2) are both contraction operators, there exist unique functions u∗∈ Bω(X) and \(u_{\varphi _{1},\varphi _{2}}^{*}\in B_{\omega }(X)\) such that Tu∗(⋅) = u∗(⋅) and \(T(\varphi _{1},\varphi _{2})u_{\varphi _{1},\varphi _{2}}^{*}(\cdot )=u_{\varphi _{1},\varphi _{2}}^{*}(\cdot )\) by the Banach’s fixed point theorem.
Before stating the next important result, we give the definition of the m-shift strategy (Hernández-Lerma and Lasserre 2012b, P.96).
Definition 7
Given a strategy \(\pi ^{i}=\left \{{\pi _{n}^{i}},n=0,1,\ldots \right \}\in {\Pi }_{i}, i=1,2,\) and an integer m ≥ 0, the corresponding m-shift strategy \(^{(m)}\!\pi ^{i}=\left \{{~}^{(m)}\!{\pi ^{i}_{n}}, n=0,1,\ldots \right \}\) is given by
and for n = 1,2,…,
where hm := (t0,x0,a0,b0,…,tm− 1,xm− 1,am− 1,bm− 1,tm,xm) denotes the admissible history at the m-th decision epoch.
Lemma 4
For each \((\pi ^{1}, \pi ^{2})\in {\Pi }_{1} \times {\Pi }_{2}\) and x ∈ X,
where \(^{(1)}\!\pi ^{i}:=({\pi _{n}^{i}},n\geq 1)\) is the 1-shift strategy of \(\pi ^{i}:=({\pi _{n}^{i}},n\geq 0)\), \(\pi _{0}^{i\infty }=({\pi _{0}^{i}},{\pi _{0}^{i}},\cdots )\), i = 1,2, and \((\pi _{0}^{1\infty }, \pi _{0}^{2\infty })\) is a pair of stationary strategies.
Now, if we set π1 = φ1 and π2 = φ2 specially, which are both stationary strategies, from Lemma 4, we have
which implies that the function V (x,φ1,φ2) is the unique fixed point of the contraction operator T(φ1,φ2).
Lemma 5
Suppose that Assumptions 1-3 hold, let \((\pi ^{1}, \pi ^{2})\in {\Pi }_{1} \times {\Pi }_{2}\), then for each x ∈ X, u ∈ Bω(X), we have
Proofs of Lemmas 2-5 are provided in Appendix A. Next, a complete proof of Theorem 1 is given by applying Lemmas 2-5.
Proof of Theorem 1 (a) Let u∗ be the unique fixed point of T in Bω(X), that is
By Lemma 2, there exists a pair of stationary strategies (φ1,φ2) ∈Φ1 ×Φ2 such that for each x ∈ X,
which implies that
Moreover, by Lemma 4,
from which we can derive
Next, we prove that u∗ is the value function of the game and (φ1,φ2) is an optimal pair of stationary strategies, that is
We just prove the first inequality in (10). Then a similar proof can follow for the second inequality. By (9), we have
Particularly, let λ be an indicator function such that λ(db) = 1. Then for each b ∈ B(x), we have
For each given \(h_{n}=(t_{0},x_{0},a_{0},b_{0},\ldots ,t_{n},x_{n})\in {\mathscr{H}}_{n}\) and bn ∈ B(xn), we have
For \(\forall \pi ^{2}\in {\Pi }_{2}\), integrating bn on both sides in the above inequality, we have
Notice that the above inequality holds for all \(h_{n}\in {\mathscr{H}}_{n}\), taking xn as a random variable Xn and we have
Multiplying \(e^{-{\int \limits }_{0}^{T_{n}}\alpha (X(s), A(s), B(s))ds}\) on both sides in the above inequality and using the properties of the conditional expectation, we have
Then, taking the expectation \(\mathbb {E}_{x}^{\varphi _{1}, \pi ^{2}}\), we have
Now, summing over \(n=0,1,2,\dots ,N\), we obtain
Letting \(N\rightarrow \infty \), according to Lemma 5, we derive
which means that the first inequality in (10) holds.
(b) (⇒)
Suppose that \((\varphi _{1}^{*}, \varphi _{2}^{*}) \in {\Phi }_{1} \times {\Phi }_{2}\) is an optimal pair of stationary strategies, then for each \(x\in X,(\pi ^{1},\pi ^{2}) \in {\Pi }_{1}\times {\Pi }_{2}\), we have
For each fixed \(\lambda \in \mathbb {B}(x)\), let \(\pi ^{2}=\{{\pi _{n}^{2}},n\geq 0\}\) with \({\pi _{0}^{2}}=\lambda \) and \({\pi _{n}^{2}}=\varphi _{2}^{*}, n\geq 1\), then by Lemma 4, for each x ∈ X, we have
which yields
Similarly, we can prove
Combining the last two inequalities, we obtain the desired result.
(⇐)
This part holds, which has been proved in part (a).
5 Algorithm
In this section, for the feasible implementation of algorithms, the state and action spaces are supposed to be finite. To avoid excessive symbols, we still use ω-norm by taking ω(⋅) ≡ 1, which represents the infinity norm, i.e., \(\|u\|_{\omega }=\sup \limits _{x\in X}|u(x)|=\|u\|_{\infty },~\forall u\in B_{\omega }(X)\). We develop an iterative algorithm to approach to the value function and an optimal pair of stationary strategies of our two-person zero-sum SMG, where numerically solving matrix games is iteratively utilized at every state in a form of value iteration.
Without loss of generality, we assume that \(A(x):=\{a_{1},a_{2},\dots ,a_{m_{1}}\}\) and \(B(x):=\{b_{1},b_{2},\dots ,b_{m_{2}}\}\), for any \(x\in X :=\{x_{0},x_{1},\dots ,x_{n-1}\}\). Under Assumptions 1-4 presented in Section 3, we rewrite the Shapley equation in a matrix form as follows
where \(\boldsymbol {\mathcal G}(u,x)\) denotes an m1 × m2-dimensional matrix for each fixed x ∈ X and given function u ∈ Bω(X), with elements defined as
and \(\boldsymbol \varphi _{k}(x):=(\varphi _{k}(1|x),\dots ,\varphi _{k}(m_{k}|x))^{T}\) is an mk-dimensional column vector for k = 1,2. It is obviously that the last equation of (11) can be viewed as a matrix game with reward matrix \(\boldsymbol {\mathcal G}(V^{*},x)\) at each state x ∈ X.
However, we cannot directly solve (11) since the value function V∗ is unknown. Below, we develop Algorithm 1 to iteratively compute a series of matrix games whose values can asymptotically approach to V∗(x) at each state x. From the lines 11-12 of Algorithm 1, we can see that at the n-th iteration, Vn(x) and \(({\boldsymbol \varphi _{1}^{n}}(x),{\boldsymbol \varphi _{2}^{n}}(x))\) are obtained by using LP to solve the game with reward matrix \(\boldsymbol {\mathcal G}(V_{n-1},x)\). This iterative procedure of computing a series of Vn is similar to the classic value iteration algorithm in the MDP theory. Considering that iterative algorithms usually converge to approximate solutions in finite steps, we give the definition of ε-optimal.
Definition 8
Assume that the game has a value function V∗. Then a pair of strategies \((\pi ^{1}_{\varepsilon },\pi ^{2}_{\varepsilon })\in {\Pi }_{1}\times {\Pi }_{2}\) is said to be ε-optimal of the game if
Moreover, \(V_{\varepsilon }(\cdot ):=V(\cdot , \pi ^{1}_{\varepsilon }, \pi ^{2}_{\varepsilon })\) is called the ε-value function of the game.
Furthermore, we derive Theorem 2 to guarantee the convergence of Algorithm 1.
Theorem 2
Under Algorithm 1, for any given 𝜖 > 0 and initial value \(V_{0}\in \mathbb {R}\), there exists a non-negative integer \(N_{\epsilon }=\big (1+\lfloor log_{\eta \gamma } (\frac {\epsilon }{\|TV_{0}-V_{0}\|_{\omega }}) \rfloor \big )\mathbbm {1}_{TV_{0}\neq V_{0}}\) such that \(\|V_{N_{\epsilon }+1}-V_{N_{\epsilon }}\|_{\omega }<\epsilon \), which implies that Algorithm 1 can converge within N𝜖 iterations. Moreover, the strategy pair \((\varphi _{1}^{\varepsilon },\varphi _{2}^{\varepsilon })\) output by Algorithm 1 is ε-optimal, where \(\varepsilon =\frac {\epsilon }{1-\eta \gamma }\).
Proof
According to the iterative formula of Algorithm 1, we have
which by iteration yields
For each given 𝜖 > 0 and initial value \(V_{0}\in \mathbb {R}\), if TV0 = V0, choose N𝜖 = 0, and we have
otherwise, if TV0≠V0, choose \(N_{\epsilon }=1+\lfloor log_{\eta \gamma } (\frac {\epsilon }{\|TV_{0}-V_{0}\|_{\omega }}) \rfloor \), and we have
Combining the two cases above, choose \(N_{\epsilon }=\big (1+\lfloor log_{\eta \gamma } (\frac {\epsilon }{\|TV_{0}-V_{0}\|_{\omega }}) \rfloor \big )\mathbbm {1}_{TV_{0}\neq V_{0}}\) and we have \(\|V_{N_{\epsilon }+1}-V_{N_{\epsilon }}\|_{\omega }<\epsilon \), which implies that Algorithm 1 can converge within N𝜖 iterations.
Moreover, since V∗ is the unique solution of the Shapley equation, we have
thus,
taking n = N𝜖, and we have
which implies that \((\varphi _{1}^{\varepsilon },\varphi _{2}^{\varepsilon })\) is ε-optimal by Definition 8. □
Therefore, with Algorithm 1, we can iteratively approach to the value function and an optimal pair of stationary strategies of our SMG problem through recursively solving a matrix game at each state x. Theorem 2 guarantees the convergence of Algorithm 1. We can implement Algorithm 1 to solve practical problems, as illustrated in the next section.
6 Numerical experiment
In this section, we conduct numerical examples to illustrate our main results derived in Sections 3-5. First, we give an example to demonstrate that Assumptions 1-4 ensuring the existence of the value function and an optimal pair of stationary strategies of SMGs are easy to verify in practice.
Example 1
Consider a system with a model of SMG which is defined as follows:
The state space \(X:=(-\infty ,\infty )\) and the action spaces \(A:=[\underline {a},\bar {a}],B:=[\underline {b},\bar {b}]\) with admissible action sets A(x) := A,B(x) := B for each x ∈ X. The semi-Markov kernel is given by:
where Φ(⋅) and F(⋅) denote the cumulative distribution functions of a normal distribution with mean μ(x,a,b) and variance σ2(x,a,b) and an exponential distribution with parameter β(x,a,b), respectively. The reward rate function is denoted by r(x,a,b) := x2 + a + b. Moreover, the discount factor is defined as α(x,a,b) := e|x|+a+b.
Now, we verify that the conditions on the existence of an optimal pair of stationary strategies described in Assumptions 1-4 are satisfied in this example. To this end, we need the following hypothesis:
Assumption 5
(a) The function β(x,a,b) is continuous on K and has both a positive lower bound \(\underline {\beta }\) and a positive upper bound \(\bar {\beta }\); (b) \(\mu ^{2}(x,a,b)+\sigma ^{2}(x,a,b)\leq \frac {1}{4}x^{2}+\frac {e^{\underline {a}+\underline {b}-1}}{8(\bar {\beta }+e^{\underline {a}+\underline {b}})},~\forall (x,a,b)\in K\).
With this hypothesis, we directly have the following result.
Proposition 2
Suppose that Assumption 5 holds, then Example 1 satisfies Assumptions 1-4, which means that the SMG has an optimal pair of stationary strategies.
The proof of Proposition 2 is provided in Appendix B.
Next, we give another example about investment problem to demonstrate the numerical computation of Algorithm 1 to solve the value function and an optimal pair of stationary strategies of the game.
Example 2
Consider an investment problem with three states {1,2,3}, which denotes the benefit, medium and loss economy environments, respectively. At each state, the investor will buy some assets while the market-maker will sell. The interest rate depends on the economy environments as well as the number of assets that investor buys and market-maker sells. In state i ∈{1,2}, the investor buys a certain amount of assets from {ai1,ai2} and the market-maker sells from {bi1,bi2}, which leads to a reward rate r(i,a,b) to the investor and − r(i,a,b) to the market-maker, where a ∈{ai1,ai2},b ∈{bi1,bi2}. Then the system moves to a new state j with probability p(j|i,a,b) after staying at state i for a random time which follows exponential-distribution with parameter β(i,a,b). In state 3, the investor buys a certain amount of assets from {a31,a32} and the market-maker sells from {b31,b32}, which leads to a reward rate r(3,a,b) to the investor and − r(3,a,b) to the market-maker, where a ∈{a31,a32},b ∈{b31,b32}. Then the system moves to a new state j with probability p(j|3,a,b) after staying at state 3 for a random time uniformly distributed in [0,β(3,a,b)] with parameter β(3,a,b) > 0. For this system, the decision makers aim to find an optimal pair of stationary strategies.
We establish an SMG model to solve this investment problem. The state space is X = {1,2,3}, action spaces are A(i) = {ai1,ai2}, B(i) = {bi1,bi2} for each i ∈ X and the semi-Markov kernel Q is given by:
from which we can obtain
and
Then by (2), we have
To take numerical calculation for this example, we assume that the values of model parameters are shown in Table 1.
With this data setting, by taking α0 = 0.25,δ = 0.1,𝜃 = 0.023, Assumptions 1-4 obviously hold since both the state and action spaces are finite. Thus, the existence of the value function and an optimal pair of stationary strategies of the SMG are ensured by Theorem 1. Next, we use Algorithm 1 to find the approximate value function and an ε-optimal pair of stationary strategies of the game. The detailed steps are listed as follows.
Step 1: Initialization.
Let n = 0, and V0(⋅) = 1; set a small threshold 𝜖 := 10− 4, and we have \(\varepsilon =\frac {\epsilon }{1-\eta \gamma }=0.33\).
Step 2: Iteration.
For n ≥ 0, (a,b) ∈ A(i) × B(i), we have
Then, for each state i ∈{1,2,3}, we solve the linear program
with the solution denoted by \({\pi _{n}^{1}}(\cdot | i)\) where \({\pi _{n}^{1}}(a_{i1}| i)=f(i,a_{i1})\), \({\pi _{n}^{1}}(a_{i2}| i)=f(i,a_{i2})\).
Also we solve the dual program of (12)
with the solution denoted by \({\pi _{n}^{2}}(\cdot | i)\) where \({\pi _{n}^{2}}(b_{i1}| i)=g(i,b_{i1})\), \({\pi _{n}^{2}}(b_{i2}| i)=g(i,b_{i2})\). We set
Step 3: Termination judgement.
If \(\underset {i=1,2,3}{\max \limits }|V_{n+1}(i)-V_{n}(i)|<\epsilon \), then the iteration stops, Vn is the ε-value function and \(({\pi _{n}^{1}},{\pi _{n}^{2}})\) is an ε-optimal pair of stationary strategies of the SMG. Otherwise, set n = n + 1 and go to Step 2.
We use Matlab to implement the iteration algorithm for this example. It takes about 10 seconds to stop at the 93rd iteration. The curves of the error of two successive iterations, the value function, and the optimal pair of stationary strategies of players with respect to the iteration times are illustrated by Figs. 1, 2 and 3.
Based on the experimental results, we have the following observations:
-
1.
When the state is benefit, the investor should take action a11 with probability 0.60217 and a12 with probability 0.39783, while the market-maker should take action b11 with probability 0.55737 and b12 with probability 0.44263;
-
2.
When the state is medium, the investor should take action a21 with probability 0.87111 and a22 with probability 0.12889, while the market-maker should take action b21 with probability 0.77887 and b22 with probability 0.22113;
-
3.
When the state is loss, the investor should always take action a31 while the market-maker should always take action b31;
-
4.
If both investor and market-maker use the optimal strategies, the investor will obtain a profit 12.6054 at benefit state, 12.1271 at medium state and 11.1653 at loss state, while the market-maker will lose the same amount, respectively.
Remark 5
In this example, we choose a uniformly distributed sojourn time at state 3 to show that arbitrary distributions are permitted for the sojourn time of semi-Markov processes. Other distributions can also be chosen for the sojourn time according to practical situations. Moreover, if all the sojourn times are exponentially distributed, the semi-Markov games degenerate into discrete-time Markov games.
7 Conclusion
In this paper, we concentrate on the two-person zero-sum SMG with expected discounted criterion in which the discount factors are state-action-dependent. We first construct the SMG model with a fairly general definition setting. Then we impose suitable conditions on the model parameters, under which we establish the Shapley equation whose unique solution is the value function and prove the existence of an optimal pair of stationary strategies of the game. While the state and action spaces are finite, a value iteration-type algorithm for approaching to the value function and an optimal pair of stationary strategies is developed. Finally, we apply our results to an investment problem, which demonstrates that our algorithm performs well.
One of the future research topics is to deal with the nonzero-sum case of this game model. We wish to find sufficient conditions under which we use the similar arguments to establish the Shapley equation and prove the existence of Nash equilibrium for such game. In practice, many decision problems are continuous in states or actions. It is of interest to further study the discretization technology of continuous SMGs in order to apply our value iteration algorithm. Moreover, considering the limitations of computing resources, the dynamic programming algorithm is difficult to implement in reality when the scale of the game becomes huge. Another future research topic is to develop data-driven learning algorithms to approximately solve the game problems, such as the combination with multi-agent reinforcement learning approaches.
References
Ash R B, Robert B, Doleans-Dade CA, Catherine A (2000) Probability and measure theory. Academic Press
Barron EN (2013) Game Theory: An Introduction, 2nd Edn. Wiley, New York
Chen F, Guo X, Liao ZW (2021) Discounted semi-Markov games with incomplete information on one side. arXiv:210707499
Fan K (1953) Minimax theorems. Proc Natl Acad Sci U S A 39 (1):42–47
Feinberg E A, Shwartz A (1994) Markov decision models with weighted discounted criteria. Math Oper Res 19(1):152–168
Filar J, Vrieze K (2012) Competitive Markov Decision Processes. Springer Science & Business Media, Berlin
González-Hernández J, López-martínez RR, Minjárez-Sosa JA (2009) Approximation, estimation and control of stochastic systems under a randomized discounted cost criterion. Kybernetika 45(5):737–754
González-Sánchez D, Luque-Vásquez F, Minjárez-Sosa J A (2019) Zero-Sum Markov games with random State-Actions-Dependent discount factors: Existence of optimal strategies. Dyn Games Appl 9(1):103–121
Hernández-Lerma O, Lasserre JB (2012a) Discrete-time markov control processes: basic optimality criteria, vol 30. Springer Science & Business Media, Berlin
Hernández-Lerma O, Lasserre JB (2012b) Further topics on discrete-time markov control processes, vol 42. Springer Science & Business Media
Hoffman A J, Karp R M (1966) On nonterminating stochastic games. Manag Sci 12(5):359–370
Huang Y, Guo X (2011) Finite horizon semi-Markov decision processes with application to maintenance systems. Eur J Oper Res 212(1):131–140
Huang Y H, Guo X P (2010) Discounted semi-Markov decision processes with nonnegative costs. Acta Math Sinica (Chinese Series) 53:503–514
Kirman A P, Sobel M J (1974) Dynamic oligopoly with inventories. Econometrica: Journal of the Econometric Society, 279–287
Lal A K, Sinha S (1992) Zero-sum two-person semi-Markov games. J Appl Probab 29(1):56–72
Littman ML (1994) Markov games as a framework for multi-agent reinforcement learning
Luque-Vásquez F (2002a) Zero-sum semi-Markov game in Borel spaces with discounted payoff. Morfismos 6(1):15–29
Luque-Vásquez F (2002b) Zero-sum semi-Markov games in Borel spaces: discounted and average payoff. Bol Soc Mat Mexicana 8:227–241
Minjárez-Sosa J A (2015) Markov control models with unknown random state–action-dependent discount factors. Top 23(3):743–772
Minjárez-Sosa J A, Luque-Vásquez F (2008) Two person zero-sum semi-Markov games with unknown holding times distribution on one side: a discounted payoff criterion. Appl Math Optim 57(3):289–305
Mondal P, Sinha S, Neogy S K, Das A K (2016) On discounted AR-AT semi-Markov games and its complementarity formulations. Int J Game Theory 45(3):567–583
Mondal P, Neogy S, Gupta A, Ghorui D (2020) A policy improvement algorithm for solving a mixture class of perfect information and AR-AT semi-Markov games. Int Game Theory Rev 22(02):2040008
Nowak A S (1984) On zero-sum stochastic games with general state space I. Probab Math Stat 4(1):13–32
Pollatschek M, Avi-Itzhak B (1969) Algorithms for stochastic games with geometrical interpretation. Manag Sci 15(7):399–415
Raut L K (1990) Two-sided altruism lindahl equilibrium and pareto efficiency in overlapping generations models. Department of Economics, University of California
Ross SM (1970) Average cost semi-Markov decision processes. J Appl Probability 7(3):649–656
Schäl M (1975) Conditions for optimality in dynamic programming and for the limit of n-stage optimal policies to be optimal. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 32(3):179–196
Shapley L S (1953) Stochastic games. Proc Natl Acad Sci 39 (10):1095–1100
Tanaka K, Wakuta K (1976) On semi-Markov games. Science Reports of Niigata University Series A, Mathematics 13:55–64
Vega-Amaya Ó, Luque-Vásquez F, Castro-Enríquez M (2022) Zero-sum average cost semi-markov games with weakly continuous transition probabilities and a minimax semi-markov inventory problem. Acta Appl Math 177(1):1–27
Wei Q, Guo X (2011) Markov decision processes with state-dependent discount factors and unbounded rewards/costs. Oper Res Lett 39(5):369–374
Wu X, Zhang J (2016) Finite approximation of the first passage models for discrete-time Markov decision processes with varying discount factors. Discrete Event Dyn Syst 26(4):669–683
Ye L, Guo X (2012) Continuous-time Markov decision processes with state-dependent discount factors. Acta Applicandae Math 121(1):5–27
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China (62073346, 11931018, U1811462), the Guangdong Basic and Applied Basic Research Foundation (2021A1515011984), and the Guangdong Province Key Laboratory of Computational Science at the Sun Yat-sen University (2020B1212060032).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix : A: Proofs of Lemmas 1-5
Proof
(Lemma 1) For each fixed (x,a,b) ∈ K, integrating by parts and we have
Let \(\gamma =1-\delta +\delta e^{-\alpha _{0}\theta }\), which yields (4). □
Proof
(Lemma 2)
By Assumption 2 and formulation (6), for each given function u ∈ Bω(X) and (x,a,b) ∈ K, we can easily get
The above inequality yields \(\|G(u, \cdot , a, b)\|_{\omega }\leq \frac {M}{\alpha _{0}}+\eta \gamma \|u\|_{\omega }\), which implies G(u,x,a,b) is in Bω(X), and so Tu ∈ Bω(X).
On the one hand, by Assumption 4, it follows that G(u,x,⋅,b) is upper semi-continuous on A(x), then for each fixed \(\lambda \in \mathbb {B}(x)\), by the Fatou’s theorem, the function
is also upper semi-continuous on A(x). Moreover, since the probability measures on \({\mathscr{B}}(X)\) endowed with the topology of weak convergence, by Theorem 2.8.1 in Ash et al. (2000), the function G(u,x,⋅,λ) is upper semi-continuous on \(\mathbb {A}(x)\). Similarly, G(u,x,μ,⋅) is lower semi-continuous on \(\mathbb {B}(x)\). Thus, by Theorem A.2.3 in Ash et al. (2000), the supremum and the infimum are indeed attained in (3), which means
Then, by the Fan’s minimax Theorem (Fan 1953), we obtain (7).
On the other hand, since \(\mathbb {A}(x)\) and \(\mathbb {B}(x)\) are compact, by the well-known measurable selection theorem for minimax problems (Nowak 1984), there exists a pair of stationary strategies (φ1,φ2) ∈Φ1 ×Φ2 that satisfies (8). □
Proof
(Lemma 3)
First, it is easy to verify that the operator T(φ1,φ2) is monotonically increasing. Let u,v ∈ Bω(X), by the definition of ω-norm, u(⋅) ≤ v(⋅) + ∥u − v∥ωω(⋅), it follows that for each fixed x ∈ X, we have
where the last inequality is followed by formulation (6). Furthermore, taking maximum of φ1 ∈Φ1 and minimum of φ2 ∈Φ2 on both sides of the inequality (13), we have
i.e.,
Similarly, interchanging u and v, we obtain
Combining the two inequalities above, we have
i.e.,
which implies T is a contraction operator with modulus ηγ < 1. Using the same arguments, we can prove that T(φ1,φ2) is also a contraction operator with modulus ηγ < 1. □
Proof
(Lemma 4)
where the third and fourth equalities are ensured by the property of conditional expectation. The fifth equality follows from the strong Markovian property. Hence,
which is required. □
Proof
(Lemma 5)
For ∀n ≥ 1 and x ∈ X, we have
where the first and second equalities are ensured by the property of conditional expectation. The last inequality follows from formulation (6). Through iteration we have
which yields Lemma 5. □
Appendix : B: Proofs of Propositions 1-2
Proof
(Proposition 1)
By the property of conditional expectation and Lemma 1, we have
where the last inequality follows directly from the proof of Lemma 1 by taking α0 := 1. Through iteration we have,
For any given t > 0, by the Chebychev inequality, we obtain
notice that 1 − δ + δe−𝜃 < 1, we have
since t is arbitrary, Proposition 1 holds. □
Proof
(Proposition 2)
Using the marginal distribution formula, the corresponding distribution of sojourn time is given as follows,
Easy to show that Assumption 2 holds by choosing \(\alpha _{0}=e^{\underline {a}+\underline {b}}\), ω(x) = x2 + 1 and \(M=3\max \limits \{|\bar {a}|,|\underline {a}|,|\bar {b}|,|\underline {b}|,1\}\). Now, let \(\theta =\frac {1}{\alpha _{0}}\ln \left (1+\frac {\alpha _{0}}{\bar {\beta }}\right )\) and \(\delta =\left (1+\frac {\alpha _{0}}{\bar {\beta }}\right )^{-\bar {\beta }/\alpha _{0}}\), then for each (x,a,b) ∈ K,
thus, Assumption 1 holds.
Next we verify Assumption 3. According to Lemma 1 and its proof, we can choose \(\gamma =1-\delta +\delta e^{-\alpha _{0} \theta }\) and further take \(\eta =\frac {2}{1+\gamma }\), then
which implies that Assumption 3 holds. Finally, since the reward rate r(x,a,b), discount factor α(x,a,b) and semi-Markov kernel Q(t,y|,x,a,b) are continuous on K, Assumption 4 holds. Hence, the SMG of Example 1 has an optimal pair of stationary strategies. □
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yu, Z., Guo, X. & Xia, L. Zero-sum semi-Markov games with state-action-dependent discount factors. Discrete Event Dyn Syst 32, 545–571 (2022). https://doi.org/10.1007/s10626-022-00366-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10626-022-00366-4