Zero-sum semi-Markov games with state-action-dependent discount factors

Yu, Zhihui; Guo, Xianping; Xia, Li

doi:10.1007/s10626-022-00366-4

Zero-sum semi-Markov games with state-action-dependent discount factors

Published: 05 November 2022

Volume 32, pages 545–571, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Discrete Event Dynamic Systems Aims and scope Submit manuscript

Zero-sum semi-Markov games with state-action-dependent discount factors

Download PDF

280 Accesses
2 Citations
Explore all metrics

Abstract

Semi-Markov model is one of the most general models for stochastic dynamic systems. This paper deals with a two-person zero-sum game for semi-Markov processes. We focus on the expected discounted criterion with state-action-dependent discount factors. The state and action spaces are both Polish spaces, and the reward rate function is ω-bounded. We first construct a fairly general model of semi-Markov games under a given semi-Markov kernel and a pair of strategies. Next, based on the standard regularity condition and the continuity-compactness condition for semi-Markov games, we derive a “drift condition” on the semi-Markov kernel and suppose that the discount factors have a positive lower bound, under which the existence of the value function and an optimal pair of stationary strategies of our semi-Markov game are proved by using a general Shapley equation. Moreover, in the scenario of finite state and action spaces, a value iteration-type algorithm for approximating the value function and an optimal pair of stationary strategies is developed. The convergence and the error bound of the algorithm are also proved. Finally, we conduct numerical examples to demonstrate the main results.

Zero-Sum Markov Games with Random State-Actions-Dependent Discount Factors: Existence of Optimal Strategies

Article 03 March 2018

Continuous-Time Zero-Sum Games for Markov Decision Processes with Discounted Risk-Sensitive Cost Criterion

Article 28 May 2021

Constrained Discounted Stochastic Games

Article 13 April 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

This paper deals with two-person zero-sum semi-Markov games (SMGs) with expected discounted criterion, which is a generalization of discrete-time Markov games (DTMGs) (Shapley 1953), since the sojourn time between two consecutive decision epochs follows any distribution rather than a constant. Such games have already been studied in the literature (Chen et al. 2021; Lal and Sinha 1992; Luque-Vásquez 2002a, 2002b; Minjárez-Sosa and Luque-Vásquez 2008; Mondal et al. 2016, 2020). However, the existing references are all restricted to the case where the discount factor is a constant, which may not always hold in practice.

In this paper, we study a more general case of SMGs with varying discount factors to meet the often demanding nature of real-world problems. For example, considering the application in economics, the discount factor (interest rate) may depend both on economy environments and decision-makers’ actions. That is, the interest rate usually varies in different financial markets and monetary policies, where financial markets can be considered as states and monetary policies are actions taken by the government. Problems with non-constant discount factors have been studied for Markov decision processes (MDPs) (Feinberg and Shwartz 1994; González-Hernández et al. 2009; Schäl 1975; Wei and Guo 2011; Ye and Guo 2012; Minjárez-Sosa 2015; Wu and Zhang 2016) and two-person zero-sum DTMGs (González-Sánchez et al. 2019). Furthermore, in light of the application of SMGs to the dynamic oligopoly model (Kirman and Sobel 1974) and the dynamic overlapping generations model (Raut 1990), it is required and reasonable to investigate SMGs with state-action-dependent discount factors to compensate for the inadequacies of theory and application.

On the other hand, most of the literature work on SMGs focuses on the existence of optimal strategies. However, how to efficiently solve a stochastic dynamic game and compute an optimal pair of stationary strategies are especially important for practical implementation of game theory. The classic algorithmic study on game theory focuses on matrix games, which can be solved by linear programming (LP) (Barron 2013). This LP technique further combines the policy iteration in MDPs to solve discounted two-person zero-sum DTMGs (Hoffman and Karp 1966; Pollatschek and Avi-Itzhak 1969), where LP is used to improve strategy at every iteration. Recently, there are emerging investigations that aim to study the efficient computation for stochastic dynamic games using approximation or learning algorithms. Littman (1994) proposes a minimax-Q algorithm to solve discounted two-person zero-sum DTMGs, which is essentially motivated by the standard Q-learning algorithm with a minimax operator in Markov games replacing the max operator in reinforcement learning. In addition, finite algorithms for some interesting special classes of stochastic games are also widely studied, such as LP to solve DTMGs with single-controller, switching controller and separable reward and state independent transition (Filar and Vrieze 2012). Recently, the same results are extended to SMGs. Mondal et al. (2016) study the discounted two-person zero-sum SMGs with AR-AT-AITT (Additive Reward and Additive Transition and Action Independent Transition Times) structure. They prove that such game can be formulated as a vertical linear complementarity problem (VLCP), which can be solved by the Cottle-Dantzig’s algorithm. They further propose a policy improvement algorithm for solving a mixture class of perfect information and AR-AT SMG (Mondal et al. 2020). Notice that the above algorithms for solving SMGs are only suitable for some special structures, which fail to effectively solve general discounted two-person zero-sum SMGs.

In this paper, we aim at studying the two-person zero-sum SMGs with expected discounted criterion in which the discount factors are state-action-dependent. The objective is to find an optimal pair of stationary strategies to maximize the reward of player 1 (P1) and minimize the payoff of player 2 (P2). More precisely, we deal with the SMGs specified by five primitive data: the state space X; the action spaces A,B for P1 and P2, respectively; the semi-Markov kernel Q(t,y|x,a,b); the discount factor α(x,a,b); and the reward rate function r(x,a,b). The state space X and action spaces A,B are all Polish spaces, and the reward rate function r(x,a,b) is ω-bounded. The semi-Markov kernel Q(t,y|x,a,b) is a joint distribution of sojourn time and state for any given (x,a,b) ∈ X × A × B, which is more general in comparison to the literature (Luque-Vásquez 2002a, 2002b; Minjárez-Sosa and Luque-Vásquez 2008; Mondal et al. 2016, 2020). To the best of our knowledge, the framework with the common state-time distribution was just used in previous works dealing with semi-Markov decision processes (for non-negative reward rate function (Huang and Guo 2011, 2010); for average criterion with unbounded reward rate function (Ross 1970)) and SMGs (for bounded reward rate function (Chen et al. 2021; Lal and Sinha 1992; Tanaka and Wakuta 1976); for average criterion with unbounded reward rate function (Vega-Amaya et al. 2022)), which has not been studied in discounted SMGs with unbounded reward rate function. This extension necessitates the addition of a new “drift condition”, which leads to a general Shapley equation and a more complicated proof of optimality. With these new features, we construct an SMG model with a fairly general problem setting. Then we impose suitable conditions on the model parameters shown in Assumptions 1-4, under which we establish the Shapley equation and prove the existence of the value function and an optimal pair of stationary strategies of the game. Our proof is quite different from González-Sánchez et al. (2019) since we directly search for optimal strategies with respect to history-dependent strategies instead of restricted to Markov strategies. Furthermore, in the scenario of finite state and action spaces, we derive a value iteration-type algorithm to approach to the value function and an optimal pair of stationary strategies of the game based on the Shapley equation. The convergence and the error bound analysis of the algorithm are also derived. Finally, we conduct numerical examples on an investment problem to demonstrate the main results of our paper.

The contributions of this paper can be summarized as follows. (1) This paper studies a fairly general model of SMGs with state-action-dependent discount factors and joint probability transition functions. We derive a “drift condition” (see Assumption 3) on the generic semi-Markov kernel, which is more general than the counterparts in the literature work (Luque-Vásquez 2002a, 2002b; González-Sánchez et al. 2019), as stated in Remark 3. (2) In order to find an ε-optimal pair of stationary strategies and an approximate value function, a value iteration-type algorithm is proposed, which can be viewed as a combination of the value iteration of MDPs and the LP of matrix games. Moreover, the convergence and the error bound of the algorithm are also analyzed.

The rest of this paper is organized as follows. In Section 2, we introduce the model of SMG as well as the optimality criterion. Our main optimality results are stated in Section 3 and studied with the proof in Section 4. A value iteration-type algorithm for approximating an optimal pair of stationary strategies is developed in Section 5, and some numerical examples are conducted to demonstrate our main results in Section 6. Finally, we conclude the paper and discuss some future research topics in Section 7.

2 Two-person zero-sum semi-Markov game model

Notation: If E is a Polish space (that is, a complete and separable metric space), its Borel σ-algebra is denoted by ${\mathscr{B}}(E)$, and $\mathbb {P}(E)$ denotes the family of probability measures on ${\mathscr{B}}(E)$, endowed with the topology of weak convergence.

In this section, we introduce a two-person zero-sum SMG model with expected discounted criterion and state-action-dependent discount factors, which is denoted by the collection

$$ \{X, A, B, (A(x),B(x), x\in X), Q(t,y|x,a,b), \alpha(x,a,b), r(x,a,b)\}, $$

where the symbols are explained as follows.

X is the state space which is a Polish space, and A, B are action spaces for P1 and P2, respectively, which are also supposed to be Polish spaces.
A(x) and B(x) are Borel subsets of A and B, which represent the sets of the admissible actions for P1 and P2 at state x ∈ X, respectively. Let
$$ K := \{(x,a,b)|x\in X, a\in A(x), b\in B(x)\} $$
be a measurable Borel subset of X × A × B.
Q(t,y|x,a,b) is a semi-Markov kernel which satisfies the following properties.

(a)
For each fixed (x,a,b) ∈ K, Q(⋅,⋅|x,a,b) is a probability measure on $[0,\infty ) \times X$, whereas for each fixed $t \in [0,\infty ), D\in {\mathscr{B}}(X)$, Q(t,D|⋅,⋅,⋅) is a real-valued Borel function on K.
(b)
For each fixed (x,a,b) ∈ K and $D\in {\mathscr{B}}(X)$, Q(⋅,D|x,a,b) is a non-decreasing right-continuous real-valued Borel function on $[0,\infty )$ such that Q(0,D|x,a,b) = 0.
(c)
For each fixed (x,a,b) ∈ K, we denote by
$$ H(\cdot|x,a,b) := Q(\cdot,X|x,a,b) $$
the distribution function of the sojourn time at state x ∈ X when the actions a ∈ A(x),b ∈ B(x) are chosen. For each x ∈ X and $D\in {\mathscr{B}}(X)$, when P1 and P2 select actions a ∈ A(x) and b ∈ B(x), respectively, Q(t,D|x,a,b) denotes the joint probability that the sojourn time in state x is not greater than $t \in \mathbb {R}_{+}$ and the next state belongs to D.

α(x,a,b) is a measurable function from K to $(0,\infty )$, which denotes the state-action-dependent discount factor.
r(x,a,b) is a real-valued function on K, which represents the reward/payoff rate function for P1/P2.

Remark 1

In our SMG model, the semi-Markov kernel Q(⋅,⋅|x,a,b) is a joint probability distribution with respect to sojourn time and state for given (x,a,b) ∈ K. Thus, our model is more general than the counterpart in the literature (Lal and Sinha 1992; Luque-Vásquez 2002a, 2002b; González-Sánchez et al. 2019).

The evolution of SMGs with the expected discounted criterion carries on as follows.

Assume that the game starts at the initial state x₀ ∈ X at the initial decision epoch t₀ := 0. The two players choose simultaneously actions a₀ ∈ A(x₀),b₀ ∈ B(x₀) according to the variables t₀ and x₀, then P1 and P2 receive immediate reward r(x₀,a₀,b₀) and immediate payoff r(x₀,a₀,b₀), respectively. Consequently, after staying at state x₀ up to time t₁ > t₀, the system moves to a new state x₁ ∈ D according to the transition law Q(t₁ − t₀,D|x₀,a₀,b₀). Once the state transition to x₁ occurs at the 1st decision epoch t₁, the entire process repeats again and the game evolves in this way.

Thus, we obtain an admissible history at the n-th decision epoch

$$ h_{n}:=(t_{0},x_{0},a_{0},b_{0},t_{1},x_{1},a_{1},b_{1},\dots, t_{n},x_{n}). $$

When the game goes to infinity, we obtain the history

$$ h:=(t_{0},x_{0},a_{0},b_{0},t_{1},x_{1},a_{1},b_{1},\dots), $$

where t_n ≤ t_n+ 1, (x_n,a_n,b_n) ∈ K for all n ≥ 0. Moreover, let ${\mathscr{H}}_{n}$ be the class of all admissible histories h_n of the system up to the n-th decision epoch, endowed with a Borel σ-algebra.

Let $({\Omega }, \mathcal {F})$ be the canonical measurable space consisting of the sample space ${\Omega }=(\mathbb {R}_{+}\times X\times A\times B)^{\infty }$ and the corresponding product σ-algebra $\mathcal {F}$. For each $\omega =(t_{0},x_{0},a_{0},b_{0},t_{1},x_{1},\dots )\in {\Omega }$, we define a stochastic process $\left \{T_{n}, X_{n}, A_{n}, B_{n}, t\ge 0\right \}$ on $({\Omega }, \mathcal {F})$ by

$$ T_{n}(\omega)=t_{n}, \ X_{n}(\omega)=x_{n}, \ A_{n}(\omega)=a_{n}, \ B_{n}(\omega)=b_{n}. $$

Moreover, we denote H_n the processes of decision time, state and action until the n-th decision epoch by

$$ H_{n}(\omega)=(T_{0},X_{0},A_{0},B_{0},\ldots,T_{n},X_{n})(\omega)=h_{n}. $$

To introduce our expected discounted criterion discussed in this paper, we give the definitions of strategies as follows.

Definition 1

A randomized history-dependent strategy for P1 is a sequence of stochastic kernels $\pi ^{1}:=({\pi _{n}^{1}},n\geq 0)$ that satisfies the following conditions:

(i)
for each $D\in {\mathscr{B}}(X)$, ${\pi _{n}^{1}}(D|\cdot )$ is a Borel function on ${\mathscr{H}}_{n}$, and for each $h_{n}\in {\mathscr{H}}_{n}$, ${\pi _{n}^{1}}(\cdot |h_{n})$ is a probability measure on A;
(ii)
${\pi _{n}^{1}}(\cdot |h_{n})$ is concentrated on A(x_n), that is
$$ {\pi_{n}^{1}}(A(x_{n})|h_{n})=1, ~~~ \forall h_{n}\in \mathcal{H}_{n} ~ \text{and}~ n\geq0. $$

We denote by π₁ the set of all the randomized history-dependent strategies for P1 for simplicity.

Definition 2

(1)
A strategy $\pi ^{1}=({\pi _{n}^{1}},n\geq 0) \in {\Pi }_{1}$ is called a randomized Markov strategy if there exists a sequence of stochastic kernels ϕ₁ = (φ_n,n ≥ 0) such that
$$ \pi_{n}^{1}(\cdot|h_{n})=\varphi_{n}(\cdot|x_{n}),~~~ \forall h_{n}\in \mathcal{H}_{n} ~ \text{and}~ n\geq0. $$
(2)
A randomized Markov strategy ϕ₁ = (φ_n,n ≥ 0) is called stationary if φ_n is independent of n; that is, if there exists a stochastic kernel φ on A given x such that
$$ \varphi_{n}(\cdot|x)\equiv \varphi (\cdot|x),~~~ \forall x\in X~ \text{and}~ n\geq0. $$
That is, $\phi _{1}=(\varphi ,\varphi ,\ldots )=:\varphi ^{\infty }$. For convenience, we still use φ to denote the randomized stationary strategy without special declaration.
(3)
Moreover, if φ(⋅|x) is a Dirac measure for all x ∈ X, then the stationary strategy φ is called a pure strategy.

We denote by ${{\Pi }_{1}^{M}}$, Φ₁ and ${\Pi }_{1}^{MD}$ the sets of all the randomized Markov strategies, randomized stationary strategies and pure strategies for P1, respectively.

The sets of all randomized history-dependent strategies π₂, randomized Markov strategies ${{\Pi }_{2}^{M}}$, randomized stationary strategies Φ₂, pure strategies ${\Pi }_{2}^{MD}$ for P2 are defined similarly, with B(x) in lieu of A(x). Clearly, ${\Pi }_{1}^{MD}\subset {\Phi }_{1} \subset {{\Pi }_{1}^{M}} \subset {\Pi }_{1} $ and ${\Pi }_{2}^{MD}\subset {\Phi }_{2} \subset {{\Pi }_{2}^{M}} \subset {\Pi }_{2} $.

For each $x\in X,\pi ^{1}\in {\Pi }_{1},\pi ^{2}\in {\Pi }_{2}$, by Theorem of C. Ionescu-Tulcea (Hernández-Lerma and Lasserre 2012a, P.178), there exist a unique probability space $({\Omega }, \mathcal {F}, \mathbb {P}_{x}^{\pi ^{1}, \pi ^{2}})$ and a stochastic process {T_n,X_n,A_n,B_n, n ≥ 0} such that for each $D\in {\mathscr{B}}(X),D_{1}\in {\mathscr{B}}(A),D_{2}\in {\mathscr{B}}(B)$ and n ≥ 0, we have

$$ \mathbb{P}_{x}^{\pi^{1}, \pi^{2}}(X_{0}=x)=1, $$

$$ \mathbb{P}_{x}^{\pi^{1}, \pi^{2}}(A_{n}\in D_{1},B_{n}\in D_{2}|h_{n})={\pi_{n}^{1}}(D_{1}|h_{n}){\pi_{n}^{2}}(D_{2}|h_{n}), $$

$$ \mathbb{P}_{x}^{\pi^{1}, \pi^{2}}(T_{n+1}-T_{n}\leq t ,X_{n+1}\in {D} | h_{n}, a_{n}, b_{n})=Q(t,D|x_{n},a_{n},b_{n}). $$

Here and in what follows, we denote by $\mathbb {E}_{x}^{\pi ^{1}, \pi ^{2}}$ the expectation operator with respect to $\mathbb {P}_{x}^{\pi ^{1}, \pi ^{2}}$.

To avoid the possibility of infinitely numerous decision epochs during the finite time interval, we take an assumption on the semi-Markov kernel, which is also used in Lal and Sinha (1992), Luque-Vásquez (2002a, 2002b), and the references therein.

Assumption 1

There exist constants 𝜃 > 0 and δ > 0 such that

$$H(\theta| x, a, b) \leq 1-\delta, \quad \forall(x, a, b) \in K.$$

Proposition 1

If Assumption 1 holds, then for each fixed x ∈ X and $\pi ^{1}\in {\Pi }_{1},\pi ^{2}\in {\Pi }_{2}$, we have

$$ \mathbb{P}_{x}^{\pi^{1}, \pi^{2}}(\lim\limits_{n\to\infty}T_{n}=\infty )=1. $$

The proof of Proposition 1 is provided in Appendix B.

Since $T_{n}\overset {P}{\longrightarrow }\infty $, it is not required to consider the processes for $t {>} T_{\infty }=\lim \limits _{n\rightarrow \infty } T_{n}$. Now, we establish an underlying continuous-time state-action process {X(t),A(t),B(t),t ≥ 0}, which corresponds to the stochastic process {T_n,X_n,A_n,B_n,n ≥ 0} with probability space $({\Omega }, \mathcal {F}, \mathbb {P}_{x}^{\pi ^{1}, \pi ^{2}})$,

$$ X(t)=\sum\limits_{n=0}^{\infty}\mathbbm{1}_{\{T_{n}\leq t<T_{n+1}\}}X_{n}, $$

$$ A(t)=\sum\limits_{n=0}^{\infty}\mathbbm{1}_{\{T_{n}\leq t<T_{n+1}\}}A_{n}, $$

$$ B(t)=\sum\limits_{n=0}^{\infty}\mathbbm{1}_{\{T_{n}\leq t<T_{n+1}\}}B_{n}, $$

where $\mathbbm {1}_{E}$ is an indicator function on any set E.

Definition 3

The stochastic process {X(t),A(t),B(t),t ≥ 0} is called a semi-Markov game.

Next, we will show the definition of the expected discounted criterion in this paper.

Definition 4

For each $(\pi ^{1}, \pi ^{2})\in {\Pi }_{1} \times {\Pi }_{2}$, the initial state x ∈ X and discount factor α(⋅) > 0, the expected discounted reward/payoff for P1/P2 is defined as follows:

$$ V(x, \pi^{1}, \pi^{2}):= \mathbb{E}_{x}^{\pi^{1}, \pi^{2}}\Big[{\int}_{0}^{\infty} e^{-{{\int}_{0}^{t}}\alpha(X(s),A(s),B(s))ds} r(X(t), A(t), B(t))dt\Big]. $$

(1)

P1 aims to maximize the reward while P2 aims to minimize the payoff. Both players aim to find an optimal strategy.

Definition 5

The upper value and lower value of the expected discounted SMG are defined as

$$ U(x):= \underset{\pi^{2}\in {\Pi}_{2}}{\inf}\underset{\pi^{1}\in {\Pi}_{1}}{\sup}V(x, \pi^{1}, \pi^{2})~\text{and}~L(x):= \underset{\pi^{1}\in {\Pi}_{1}}{\sup} \underset{\pi^{2}\in {\Pi}_{2}}{\inf}V(x, \pi^{1}, \pi^{2}), $$

respectively. Obviously, U(x) ≥ L(x) for all x ∈ X. Moreover, if it holds that L(x) = U(x) for all x ∈ X, then the common function is called the value function of the game and denoted by V^∗.

Definition 6

Assume that the game has a value function V^∗(⋅). Then a strategy $\pi ^{1}_{*}\in {\Pi }_{1}$ is said to be optimal for P1 if

$$ \underset{\pi^{2}\in {\Pi}_{2}}{\inf} V(x, \pi^{1}_{*}, \pi^{2})=V^{*}(x),~~ \forall x\in X. $$

Similarly, $\pi ^{2}_{*}\in {\Pi }_{2}$ is said to be optimal for P2 if

$$ \underset{\pi^{1}\in {\Pi}_{1}}{\sup} V(x, \pi^{1}, \pi^{2}_{*})=V^{*}(x),~~ \forall x\in X. $$

If $\pi ^{i}_{*}$ is optimal for player i (i = 1,2), then we call $(\pi ^{1}_{*}, \pi ^{2}_{*})$ an optimal pair of strategies.

Remark 2

$(\pi ^{1}_{*}, \pi ^{2}_{*})$ is an optimal pair of strategies if and only if

$$ V(x, \pi^{1}, \pi^{2}_{*})\leq V(x, \pi^{1}_{*}, \pi^{2}_{*})\leq V(x, \pi^{1}_{*}, \pi^{2}),~~~ \forall x\in X, \pi^{1}\in {\Pi}_{1}, \pi^{2}\in {\Pi}_{2}. $$

Remark 2 is an effective method to verify whether a pair of strategies (π¹,π²) is an optimal pair of strategies, which is widely used in the literature of two-person zero-sum stochastic games; see, for instance, González-Sánchez et al. (2019), Luque-Vásquez (2002a, 2002b), and the references therein.

3 Main results

This section focuses on the existence of the value function and an optimal pair of stationary strategies, which requires imposing suitable assumptions on the model parameters. We first give some notations for convenience.

Given a measurable function $\omega : X\rightarrow [1, \infty )$, a function u on X is said to be ω-bounded if it has finite ω-norm which is defined as

$$ \|u\|_{\omega}:=\underset{x\in X}{\sup}\frac{|u(x)|}{\omega(x)}, $$

such a function ω can be referred to as a weight function. For convenience, we denote by B_ω(X) the Banach space of all ω-bounded measurable functions on X.

For each given function u ∈ B_ω(X) and (x,a,b) ∈ K, we write

$$ G(u, x, a, b):=r(x, a, b){\int}_{0}^{\infty} e^{-\alpha(x,a,b) t}(1-H(t| x, a, b))dt+{\int}_{0}^{\infty} e^{-\alpha(x,a,b) t}{\int}_{X} u(y) Q(d t, d y | x, a, b). $$

(2)

For each fixed x ∈ X and probability measures $\mu \in \mathbb {A}(x) := \mathbb {P}(A(x)) \text { and } \lambda \in \mathbb {B}(x) := \mathbb {P}(B(x))$, we denote

$$ G(u, x, \mu, \lambda):={\int}_{A(x)} {\int}_{B(x)} G(u, x, a, b) \mu(d a) \lambda(d b), $$

whenever the integral is well defined.

We define an operator T on B_ω(X) by

$$ Tu(x):=\underset{\mu \in \mathbb{A}(x)}{\sup} \underset{\lambda \in \mathbb{B}(x)}{\inf} G(u, x, \mu, \lambda),\quad \forall x \in X, $$

(3)

which is called the Shapley operator. A function v ∈ B_ω(X) is said to be a solution of the Shapley equation if

$$ Tv(x)=v(x), \quad \forall x \in X. $$

In order to explore the existence of an optimal pair of stationary strategies, we also need to define a stationary-strategy-dependent operator T(φ₁,φ₂) on B_ω(X) by

$$ T(\varphi_{1},\varphi_{2})u(x):=G(u, x, \varphi_{1}(\cdot|x), \varphi_{2}(\cdot|x)),\quad \forall x \in X, $$

where (φ₁,φ₂) ∈Φ₁ ×Φ₂ is a pair of stationary strategies.

Next, we give some hypotheses to guarantee the existence of an optimal pair of stationary strategies. The framework settled by these hypotheses has became quite standard for the study of semi-Markov models (Chen et al. 2021; Huang and Guo 2011, 2010; Lal and Sinha, 1992; Luque-Vásquez, 2002a, 2002b; Minjárez-Sosa and Luque-Vásquez 2008) and varying discount factors (Feinberg and Shwartz 1994; González-Hernández et al. 2009; González-Sánchez et al. 2019; Minjárez-Sosa 2015; Wei and Guo 2011; Ye and Guo 2012; Wu and Zhang 2016).

Assumption 2

(a) There exists a constant α₀ > 0 such that α(x,a,b) ≥ α₀ for all (x,a,b) ∈ K. (b) There exists a measurable function $\omega : X\rightarrow [1, \infty )$ and a non-negative constant M such that for all (x,a,b) ∈ K,

$$ |r(x,a,b)|\leq M\omega(x). $$

The key point is that Assumption 2 entails a finiteness property of expected discounted reward. Below we give an important consequence of Assumption 1 and Assumption 2(a).

Lemma 1

If Assumptions 1&2(a) hold, then there exists a constant 0 < γ < 1 such that for each (x,a,b) ∈ K,

$$ {\int}_{0}^{\infty} e^{-\alpha(x,a,b) t} H(d t|x, a, b) \leqslant \gamma $$

(4)

The proof of Lemma 1 is provided in Appendix A.

Assumption 3

There exists a constant η with 0 < ηγ < 1 such that for each fixed t ≥ 0 and (x,a,b) ∈ K,

$$ {\int}_{X} \omega(y) Q(t,dy|x,a,b) \leq \eta\omega(x)H(t|x,a,b), $$

(5)

where ω(⋅) is the function mentioned in Assumption 2.

Remark 3

(1)
We call Assumption 3 the “drift condition”, which is needed to ensure that the Shapley operator (defined in (3)) is a contraction operator with respect to a weighted norm as well as our main results. Obviously, Assumption 3 naturally holds when r is bounded by taking ω(⋅) ≡ 1 and η = 1.
(2)
Particularly, if we set Q(t,y|x,a,b) = H(t|x,a,b)P(y|x,a,b), where P(y|x,a,b) denotes the state transition probability, then (5) degenerates into ${\int \limits }_{X} \omega (y) P(dy|x,a,b) \leq \eta \omega (x)$, which is the same as the Assumption 3(b) of Luque-Vásquez (2002a) and Assumption 1(e) of González-Sánchez et al. (2019). Thus, our Assumption 3 is more general than the counterpart in the literature Luque-Vásquez (2002a) and the Example 3 of González-Sánchez et al. (2019) about SMG is a special case of ours.
(3)
Combining Lemma 1 with Assumption 3, it is easy to derive

$$ {\int}_{0}^{\infty} e^{-\alpha(x,a,b) t}{\int}_{X} u(y) Q(d t, d y | x, a, b)\leq \eta\gamma\|u\|_{\omega}\omega(x),\quad\forall u\in B_{\omega}(X),(x,a,b)\in K. $$
(6)

Moreover, we impose the following continuity-compactness conditions to ensure the existence of an optimal pair of stationary strategies of our SMG model.

Assumption 4

(a)
For each fixed x ∈ X, both A(x) and B(x) are compact sets.
(b)
For each fixed (x,a,b) ∈ K, r(x,⋅,b) is upper semi-continuous on A(x) and r(x,a,⋅) is lower semi-continuous on B(x).
(c)
For each fixed (x,a,b) ∈ K, t ≥ 0 and v ∈ B_ω(X), the functions
$$ a \longmapsto \int v(y) Q(t,d y | x, a, b) \quad \text { and } \quad b \longmapsto \int v(y) Q(t,d y | x, a, b) $$
are continuous on A(x) and B(x), respectively.
(d)
For each fixed t ≥ 0, H(t|⋅,⋅,⋅) is continuous on K.
(e)
The function α(x,a,b) is continuous on K.

Remark 4

(1)
Assumption 4 is similar to the standard continuity-compactness hypotheses for Markov control processes; see, for instance, Hernández-Lerma and Lasserre (2012b), and the references therein. It is commonly used for the existence of minimax points of games.
(2)
By Lemma 1.11 in Nowak (1984), if Assumption 4(a) holds, then the probability spaces $\mathbb {A}(x) \text { and } \mathbb {B}(x)$ are also compact for each x ∈ X.
(3)
These continuity-compactness conditions are specifically applied to infinite state and action spaces, which obviously hold when S and A,B are finite.

Now, we present our main results, Theorem 1 below, which extends to SMGs with common state-time distribution and state-action dependent discount factors the analysis given in Luque-Vásquez (2002a) for SMGs with constant discount factor.

Theorem 1

Suppose that Assumptions 1-4 hold, then

(a)
The SMG has a value function V^∗(⋅), which is the unique function in B_ω(X) that satisfies the Shapley equation, i.e.,
$$ V^{*}(x)=TV^{*}(x),\quad\forall x\in X, $$
and furthermore, there exists an optimal pair of stationary strategies.
(b)
A pair of stationary strategies $(\varphi _{1}^{*}, \varphi _{2}^{*}) \in {\Phi }_{1} \times {\Phi }_{2}$ is optimal if and only if its expected discounted reward satisfies the Shapley equation, i.e., $TV(x,\varphi _{1}^{*},\varphi _{2}^{*})=V(x,\varphi _{1}^{*},\varphi _{2}^{*})$ for all x ∈ X.

4 Preliminaries and proofs

In this section, we present some results needed to prove Theorem 1. Some of these results are already known in the literature (Luque-Vásquez 2002a, 2002b), but we state them here for completeness and ease of reference.

Lemma 2

Suppose that Assumptions 1-4 hold, then for each given function u ∈ B_ω(X), the function Tu is in B_ω(X) and

$$ Tu(x):=\underset{\lambda \in \mathbb{B}(x)}{\min} \underset{\mu \in \mathbb{A}(x)}{\max} G(u,x, \mu, \lambda). $$

(7)

Moreover, there exists a pair of stationary strategies (φ₁,φ₂) ∈Φ₁ ×Φ₂ such that

$$ \begin{array}{@{}rcl@{}} T u(x) &=&G(u, x, \varphi_{1}(\cdot|x), \varphi_{2}(\cdot|x)) \\ &=&\underset{\mu \in \mathbb{A}(x)}{\max} G(u, x, \mu, \varphi_{2}(\cdot|x)) \\ &=&\underset{\lambda \in \mathbb{B}(x)}{\min} G(u, x, \varphi_{1}(\cdot|x), \lambda). \end{array} $$

(8)

Lemma 3

Both T and T(φ₁,φ₂) are contraction operators with modulus less than 1.

Since T and T(φ₁,φ₂) are both contraction operators, there exist unique functions u^∗∈ B_ω(X) and $u_{\varphi _{1},\varphi _{2}}^{*}\in B_{\omega }(X)$ such that Tu^∗(⋅) = u^∗(⋅) and $T(\varphi _{1},\varphi _{2})u_{\varphi _{1},\varphi _{2}}^{*}(\cdot )=u_{\varphi _{1},\varphi _{2}}^{*}(\cdot )$ by the Banach’s fixed point theorem.

Before stating the next important result, we give the definition of the m-shift strategy (Hernández-Lerma and Lasserre 2012b, P.96).

Definition 7

Given a strategy $\pi ^{i}=\left \{{\pi _{n}^{i}},n=0,1,\ldots \right \}\in {\Pi }_{i}, i=1,2,$ and an integer m ≥ 0, the corresponding m-shift strategy $^{(m)}\!\pi ^{i}=\left \{{~}^{(m)}\!{\pi ^{i}_{n}}, n=0,1,\ldots \right \}$ is given by

$$ {~}^{(m)}\!{\pi^{i}_{0}}(\cdot|x_{m}):={\pi_{m}^{i}}(\cdot|h_{m}), $$

and for n = 1,2,…,

$$ {~}^{(m)}\!{\pi^{i}_{n}}(\cdot|x_{m},a_{m},b_{m},\ldots,x_{m+n}):=\pi_{m+n}^{i}(\cdot|h_{m},a_{m},b_{m},\ldots,x_{m+n}), $$

where h_m := (t₀,x₀,a₀,b₀,…,t_m− 1,x_m− 1,a_m− 1,b_m− 1,t_m,x_m) denotes the admissible history at the m-th decision epoch.

Lemma 4

For each $(\pi ^{1}, \pi ^{2})\in {\Pi }_{1} \times {\Pi }_{2}$ and x ∈ X,

$$ V(x,\pi^{1}, \pi^{2})=T(\pi_{0}^{1\infty}, \pi_{0}^{2\infty})V(x,^{(1)}\!\pi^{1},^{(1)}\!\pi^{2}) $$

where $^{(1)}\!\pi ^{i}:=({\pi _{n}^{i}},n\geq 1)$ is the 1-shift strategy of $\pi ^{i}:=({\pi _{n}^{i}},n\geq 0)$, $\pi _{0}^{i\infty }=({\pi _{0}^{i}},{\pi _{0}^{i}},\cdots )$, i = 1,2, and $(\pi _{0}^{1\infty }, \pi _{0}^{2\infty })$ is a pair of stationary strategies.

Now, if we set π¹ = φ₁ and π² = φ₂ specially, which are both stationary strategies, from Lemma 4, we have

$$ V(x,\varphi_{1},\varphi_{2})=T(\varphi_{1},\varphi_{2})V(x,\varphi_{1},\varphi_{2}),~~~~\forall x\in X, $$

which implies that the function V (x,φ₁,φ₂) is the unique fixed point of the contraction operator T(φ₁,φ₂).

Lemma 5

Suppose that Assumptions 1-3 hold, let $(\pi ^{1}, \pi ^{2})\in {\Pi }_{1} \times {\Pi }_{2}$, then for each x ∈ X, u ∈ B_ω(X), we have

$$ \underset{n\to\infty}{\lim}\mathbb{E}_{x}^{\pi^{1}, \pi^{2}}\Big[e^{-{\int}_{0}^{T_{n}}\alpha(X(s),A(s),B(s))ds}u(X_{n})\Big]=0. $$

Proofs of Lemmas 2-5 are provided in Appendix A. Next, a complete proof of Theorem 1 is given by applying Lemmas 2-5.

Proof of Theorem 1 (a) Let u^∗ be the unique fixed point of T in B_ω(X), that is

$$ u^{*}(x)=Tu^{*}(x),~~~~\forall x\in X. $$

By Lemma 2, there exists a pair of stationary strategies (φ₁,φ₂) ∈Φ₁ ×Φ₂ such that for each x ∈ X,

$$ \begin{array}{@{}rcl@{}} Tu^{*}(x) &=&G(u^{*}, x, \varphi_{1}(\cdot|x), \varphi_{2}(\cdot|x))\\ &=&\underset{\mu \in \mathbb{A}(x)}{\max} G(u^{*}, x, \mu, \varphi_{2}(\cdot|x))\\ &=&\underset{\lambda \in \mathbb{B}(x)}{\min} G(u^{*}, x, \varphi_{1}(\cdot|x), \lambda) , \end{array} $$

(9)

which implies that

$$ u^{*}(x)=G(u^{*}, x, \varphi_{1}(\cdot|x), \varphi_{2}(\cdot|x))=T(\varphi_{1},\varphi_{2})u^{*}(x),\quad \forall x\in X. $$

Moreover, by Lemma 4,

$$ V(x,\varphi_{1},\varphi_{2})=T(\varphi_{1},\varphi_{2})V(x,\varphi_{1},\varphi_{2}),\quad \forall x\in X, $$

from which we can derive

$$ u^{*}(x)=V(x,\varphi_{1},\varphi_{2}),\quad \forall x\in X. $$

Next, we prove that u^∗ is the value function of the game and (φ₁,φ₂) is an optimal pair of stationary strategies, that is

$$ V(x,\varphi_{1},\pi^{2})\geq V(x,\varphi_{1},\varphi_{2})\geq V(x,\pi^{1},\varphi_{2})~,~~~\forall (\pi^{1},\pi^{2})\in {\Pi}_{1} \times {\Pi}_{2}. $$

(10)

We just prove the first inequality in (10). Then a similar proof can follow for the second inequality. By (9), we have

$$ u^{*}(x)\leq G(u^{*},x,\varphi_{1}(\cdot|x),\lambda)~,~~~\forall \lambda \in \mathbb{B}(x). $$

Particularly, let λ be an indicator function such that λ(db) = 1. Then for each b ∈ B(x), we have

$$ \begin{array}{@{}rcl@{}} u^{*}(x)&\leq& {\int}_{A(x)}\left\{r(x,a,b){\int}_{0}^{\infty}e^{-\alpha(x,a,b)t}\Big[1-H(t| x, a, b)\Big]dt\right.\\ && +\left.{\int}_{0}^{\infty}e^{-\alpha(x,a,b)t}\Big [{\int}_{X}u^{*}(y)Q(d t,dy| x, a, b)\Big] \right\}\varphi_{1}(da|x). \end{array} $$

For each given $h_{n}=(t_{0},x_{0},a_{0},b_{0},\ldots ,t_{n},x_{n})\in {\mathscr{H}}_{n}$ and b_n ∈ B(x_n), we have

$$ \begin{array}{@{}rcl@{}} u^{*}(x_{n})&\leq& {\int}_{A(x_{n})}\left\{r(x_{n},a_{n},b_{n}){\int}_{0}^{\infty}e^{-\alpha (x_{n},a_{n},b_{n})t}\Big[1-H(t| x_{n},a_{n},b_{n})\Big]dt\right.\\ &&+\left.{\int}_{0}^{\infty}e^{-\alpha(x_{n},a_{n},b_{n})t}\Big [{\int}_{X}u^{*}(y)Q(d t,dy|x_{n},a_{n},b_{n})\Big] \right\}\varphi_{1}(da_{n}|X_{n}=x_{n}). \end{array} $$

For $\forall \pi ^{2}\in {\Pi }_{2}$, integrating b_n on both sides in the above inequality, we have

$$ \begin{array}{@{}rcl@{}} u^{*}(x_{n})&\leq& {\int}_{B(x_{n})}{\int}_{A(x_{n})}\left\{{\int}_{0}^{\infty}e^{-\alpha(x_{n},a_{n},b_{n})t}\Big [{\int}_{X}u^{*}(y)Q(dt,dy|x_{n},a_{n},b_{n})\Big]+\right.\\ &&\left.r(x_{n},a_{n},b_{n}){\int}_{0}^{\infty}e^{-\alpha(x_{n},a_{n},b_{n})t}\Big[1-H(t| x_{n},a_{n},b_{n})\Big]dt \right\}\varphi_{1}(da_{n}|X_{n}=x_{n}){\pi_{n}^{2}}(db_{n}|H_{n}=h_{n})\\ &=&\mathbb{E}_{x}^{\varphi_{1}, \pi^{2}}\Big[e^{-{\int}_{T_{n}}^{T_{n+1}}\alpha(X(s),A(s),B(s))ds}u^{*}(X_{n+1})|H_{n}=h_{n}\Big]\\ &&+ \mathbb{E}_{x}^{\varphi_{1}, \pi^{2}}\Big[{\int}_{T_{n}}^{T_{n+1}} e^{-{\int}_{T_{n}}^{t}\alpha(X(s),A(s),B(s))ds} r(X(t), A(t), B(t))dt|H_{n}=h_{n}\Big]. \end{array} $$

Notice that the above inequality holds for all $h_{n}\in {\mathscr{H}}_{n}$, taking x_n as a random variable X_n and we have

$$ \begin{array}{@{}rcl@{}} u^{*}(X_{n}) &\leq&\mathbb{E}_{x}^{\varphi_{1},\pi^{2}}\Big[e^{-{\int}_{T_{n}}^{T_{n+1}}\alpha(X(s),A(s),B(s))ds}u^{*}(X_{n+1})|H_{n}\Big]\\ &&+\mathbb{E}_{x}^{\varphi_{1}, \pi^{2}}\Big[{\int}_{T_{n}}^{T_{n+1}} e^{-{\int}_{T_{n}}^{t}\alpha(X(s),A(s),B(s))ds} r(X(t), A(t), B(t))dt|H_{n}\Big]. \end{array} $$

Multiplying $e^{-{\int \limits }_{0}^{T_{n}}\alpha (X(s), A(s), B(s))ds}$ on both sides in the above inequality and using the properties of the conditional expectation, we have

$$ \begin{array}{@{}rcl@{}} e^{-{\int}_{0}^{T_{n}}\alpha(X(s),A(s),B(s))ds}u^{*}(X_{n})&\leq& \mathbb{E}_{x}^{\varphi_{1}, \pi^{2}}\Big[e^{-{\int}_{0}^{T_{n+1}}\alpha(X(s),A(s),B(s))ds}u^{*}(X_{n+1})|H_{n}\Big]\\ &&+\mathbb{E}_{x}^{\varphi_{1}, \pi^{2}}\Big[{\int}_{T_{n}}^{T_{n+1}} e^{-{{\int}_{0}^{t}}\alpha(X(s),A(s),B(s))ds} r(X(t), A(t), B(t))dt|H_{n}\Big]. \end{array} $$

Then, taking the expectation $\mathbb {E}_{x}^{\varphi _{1}, \pi ^{2}}$, we have

$$ \begin{array}{@{}rcl@{}} \mathbb{E}_{x}^{\varphi_{1}, \pi^{2}}\Big[e^{-{\int}_{0}^{T_{n}}\alpha(X(s),A(s),B(s))ds}u^{*}(X_{n})\Big]&\leq& \mathbb{E}_{x}^{\varphi_{1}, \pi^{2}}\Big[e^{-{\int}_{0}^{T_{n+1}}\alpha(X(s),A(s),B(s))ds}u^{*}(X_{n+1})\Big] \\ &&+\mathbb{E}_{x}^{\varphi_{1}, \pi^{2}}\Big[{\int}_{T_{n}}^{T_{n+1}} e^{-{{\int}_{0}^{t}}\alpha(X(s),A(s),B(s))ds} r(X(t), A(t), B(t))dt\Big]. \end{array} $$

Now, summing over $n=0,1,2,\dots ,N$, we obtain

$$ \begin{array}{@{}rcl@{}} u^{*}(x)&\leq& \mathbb{E}_{x}^{\varphi_{1}, \pi^{2}}\Big[{\int}_{0}^{T_{N+1}} e^{-{{\int}_{0}^{t}}\alpha(X(s),A(s),B(s))ds}r(X(t), A(t), B(t))dt\Big]\\ &&+\mathbb{E}_{x}^{\varphi_{1}, \pi^{2}}\Big[e^{-{\int}_{0}^{T_{N+1}}\alpha(X(s),A(s),B(s))ds}u^{*}(X_{N+1})\Big]. \end{array} $$

Letting $N\rightarrow \infty $, according to Lemma 5, we derive

$$ u^{*}(x)\leq \mathbb{E}_{x}^{\varphi_{1}, \pi^{2}}\Big[{\int}_{0}^{\infty} e^{-{{\int}_{0}^{t}}\alpha(X(s),A(s),B(s))ds}r(X(t), A(t), B(t))dt\Big], $$

which means that the first inequality in (10) holds.

(b) (⇒)

Suppose that $(\varphi _{1}^{*}, \varphi _{2}^{*}) \in {\Phi }_{1} \times {\Phi }_{2}$ is an optimal pair of stationary strategies, then for each $x\in X,(\pi ^{1},\pi ^{2}) \in {\Pi }_{1}\times {\Pi }_{2}$, we have

$$ V(x,\varphi_{1}^{*},\pi^{2})\geq V(x,\varphi_{1}^{*},\varphi_{2}^{*})\geq V(x,\pi^{1},\varphi_{2}^{*}). $$

For each fixed $\lambda \in \mathbb {B}(x)$, let $\pi ^{2}=\{{\pi _{n}^{2}},n\geq 0\}$ with ${\pi _{0}^{2}}=\lambda $ and ${\pi _{n}^{2}}=\varphi _{2}^{*}, n\geq 1$, then by Lemma 4, for each x ∈ X, we have

$$ V(x,\varphi_{1}^{*},\varphi_{2}^{*}) \leq V(x,\varphi_{1}^{*},\pi^{2})=T(\varphi_{1}^{*},\lambda)V(x,\varphi_{1}^{*},\varphi_{2}^{*}), $$

which yields

$$ V(x,\varphi_{1}^{*},\varphi_{2}^{*})\leq \underset{\lambda \in \mathbb{B}(x)}{\min} T(\varphi_{1}^{*},\lambda)V(x,\varphi_{1}^{*},\varphi_{2}^{*})\leq TV(x,\varphi_{1}^{*},\varphi_{2}^{*}). $$

Similarly, we can prove

$$V(x,\varphi_{1}^{*},\varphi_{2}^{*})\geq TV(x,\varphi_{1}^{*},\varphi_{2}^{*}). $$

Combining the last two inequalities, we obtain the desired result.

(⇐)

This part holds, which has been proved in part (a).

5 Algorithm

In this section, for the feasible implementation of algorithms, the state and action spaces are supposed to be finite. To avoid excessive symbols, we still use ω-norm by taking ω(⋅) ≡ 1, which represents the infinity norm, i.e., $\|u\|_{\omega }=\sup \limits _{x\in X}|u(x)|=\|u\|_{\infty },~\forall u\in B_{\omega }(X)$. We develop an iterative algorithm to approach to the value function and an optimal pair of stationary strategies of our two-person zero-sum SMG, where numerically solving matrix games is iteratively utilized at every state in a form of value iteration.

Without loss of generality, we assume that $A(x):=\{a_{1},a_{2},\dots ,a_{m_{1}}\}$ and $B(x):=\{b_{1},b_{2},\dots ,b_{m_{2}}\}$, for any $x\in X :=\{x_{0},x_{1},\dots ,x_{n-1}\}$. Under Assumptions 1-4 presented in Section 3, we rewrite the Shapley equation in a matrix form as follows

$$ \begin{array}{@{}rcl@{}} V^{*}(x) &=&\underset{\varphi_{2}(\cdot|x) \in \mathbb{B}(x)}{\min} \underset{\varphi_{1}(\cdot|x) \in \mathbb{A}(x)}{\max}G(V^{*},x, \varphi_{1}(\cdot|x), \varphi_{2}(\cdot|x))\\ &=& \underset{\boldsymbol\varphi_{2}(x) \in \mathbb{B}(x)}{\min} \underset{\boldsymbol\varphi_{1}(x) \in \mathbb{A}(x)}{\max}{\boldsymbol\varphi_{1}^{T}}(x)\boldsymbol{\mathcal G}(V^{*},x)\boldsymbol\varphi_{2}(x)\\ &=& \underset{\boldsymbol\varphi_{1}(x) \in \mathbb{A}(x)}{\max} \underset{\boldsymbol\varphi_{2}(x) \in \mathbb{B}(x)}{\min}{\boldsymbol\varphi_{1}^{T}}(x)\boldsymbol{\mathcal G}(V^{*},x)\boldsymbol\varphi_{2}(x),\quad\forall x\in X, \end{array} $$

(11)

where $\boldsymbol {\mathcal G}(u,x)$ denotes an m₁ × m₂-dimensional matrix for each fixed x ∈ X and given function u ∈ B_ω(X), with elements defined as

$$ \boldsymbol{\mathcal G}(u,x)_{ij}:=G(u,x, a_{i}, b_{j}), \quad i=1,2,\dots,m_{1}; \ j=1,2,\dots,m_{2}, $$

and $\boldsymbol \varphi _{k}(x):=(\varphi _{k}(1|x),\dots ,\varphi _{k}(m_{k}|x))^{T}$ is an m_k-dimensional column vector for k = 1,2. It is obviously that the last equation of (11) can be viewed as a matrix game with reward matrix $\boldsymbol {\mathcal G}(V^{*},x)$ at each state x ∈ X.

However, we cannot directly solve (11) since the value function V^∗ is unknown. Below, we develop Algorithm 1 to iteratively compute a series of matrix games whose values can asymptotically approach to V^∗(x) at each state x. From the lines 11-12 of Algorithm 1, we can see that at the n-th iteration, V_n(x) and $({\boldsymbol \varphi _{1}^{n}}(x),{\boldsymbol \varphi _{2}^{n}}(x))$ are obtained by using LP to solve the game with reward matrix $\boldsymbol {\mathcal G}(V_{n-1},x)$. This iterative procedure of computing a series of V_n is similar to the classic value iteration algorithm in the MDP theory. Considering that iterative algorithms usually converge to approximate solutions in finite steps, we give the definition of ε-optimal.

Definition 8

Assume that the game has a value function V^∗. Then a pair of strategies $(\pi ^{1}_{\varepsilon },\pi ^{2}_{\varepsilon })\in {\Pi }_{1}\times {\Pi }_{2}$ is said to be ε-optimal of the game if

$$ \| V(\cdot, \pi^{1}_{\varepsilon}, \pi^{2}_{\varepsilon})-V^{*}(\cdot)\|_{\omega}<\varepsilon. $$

Moreover, $V_{\varepsilon }(\cdot ):=V(\cdot , \pi ^{1}_{\varepsilon }, \pi ^{2}_{\varepsilon })$ is called the ε-value function of the game.

Furthermore, we derive Theorem 2 to guarantee the convergence of Algorithm 1.

Theorem 2

Under Algorithm 1, for any given 𝜖 > 0 and initial value $V_{0}\in \mathbb {R}$, there exists a non-negative integer $N_{\epsilon }=\big (1+\lfloor log_{\eta \gamma } (\frac {\epsilon }{\|TV_{0}-V_{0}\|_{\omega }}) \rfloor \big )\mathbbm {1}_{TV_{0}\neq V_{0}}$ such that $\|V_{N_{\epsilon }+1}-V_{N_{\epsilon }}\|_{\omega }<\epsilon $, which implies that Algorithm 1 can converge within N_𝜖 iterations. Moreover, the strategy pair $(\varphi _{1}^{\varepsilon },\varphi _{2}^{\varepsilon })$ output by Algorithm 1 is ε-optimal, where $\varepsilon =\frac {\epsilon }{1-\eta \gamma }$.

Proof

According to the iterative formula of Algorithm 1, we have

$$ \|V_{n+1}-V_{n}\|_{\omega}=\|TV_{n}-TV_{n-1}\|_{\omega}\leq \eta\gamma \|V_{n}-V_{n-1}\|_{\omega},\quad\forall n\ge 1, $$

which by iteration yields

$$ \|V_{n+1}-V_{n}\|_{\omega} \leq (\eta\gamma)^{n}\|TV_{0}-V_{0}\|_{\omega},\quad\forall n\ge 0. $$

For each given 𝜖 > 0 and initial value $V_{0}\in \mathbb {R}$, if TV₀ = V₀, choose N_𝜖 = 0, and we have

$$ \|V_{N_{\epsilon}+1}-V_{N_{\epsilon}}\|_{\omega}=0<\epsilon, $$

otherwise, if TV₀≠V₀, choose $N_{\epsilon }=1+\lfloor log_{\eta \gamma } (\frac {\epsilon }{\|TV_{0}-V_{0}\|_{\omega }}) \rfloor $, and we have

$$ \|V_{N_{\epsilon}+1}-V_{N_{\epsilon}}\|_{\omega}\leq (\eta\gamma)^{N_{\epsilon}}\|TV_{0}-V_{0}\|_{\omega}<\epsilon. $$

Combining the two cases above, choose $N_{\epsilon }=\big (1+\lfloor log_{\eta \gamma } (\frac {\epsilon }{\|TV_{0}-V_{0}\|_{\omega }}) \rfloor \big )\mathbbm {1}_{TV_{0}\neq V_{0}}$ and we have $\|V_{N_{\epsilon }+1}-V_{N_{\epsilon }}\|_{\omega }<\epsilon $, which implies that Algorithm 1 can converge within N_𝜖 iterations.

Moreover, since V^∗ is the unique solution of the Shapley equation, we have

$$ \|V_{n}-V^{*}\|_{\omega} \leq \|V_{n+1}-V^{*}\|_{\omega}+\|V_{n}-V_{n+1}\|_{\omega}\leq \eta\gamma\|V_{n}-V^{*}\|_{\omega}+\|V_{n}-V_{n+1}\|_{\omega} , $$

thus,

$$ \|V_{n}-V^{*}\|_{\omega} \leq \frac{\|V_{n}-V_{n+1}\|_{\omega}}{1-\eta\gamma}, $$

taking n = N_𝜖, and we have

$$ \|V_{N_{\epsilon}}-V^{*}\|_{\omega} < \frac{\epsilon}{1-\eta\gamma}=\varepsilon, $$

which implies that $(\varphi _{1}^{\varepsilon },\varphi _{2}^{\varepsilon })$ is ε-optimal by Definition 8. □

Therefore, with Algorithm 1, we can iteratively approach to the value function and an optimal pair of stationary strategies of our SMG problem through recursively solving a matrix game at each state x. Theorem 2 guarantees the convergence of Algorithm 1. We can implement Algorithm 1 to solve practical problems, as illustrated in the next section.

6 Numerical experiment

In this section, we conduct numerical examples to illustrate our main results derived in Sections 3-5. First, we give an example to demonstrate that Assumptions 1-4 ensuring the existence of the value function and an optimal pair of stationary strategies of SMGs are easy to verify in practice.

Example 1

Consider a system with a model of SMG which is defined as follows:

The state space $X:=(-\infty ,\infty )$ and the action spaces $A:=[\underline {a},\bar {a}],B:=[\underline {b},\bar {b}]$ with admissible action sets A(x) := A,B(x) := B for each x ∈ X. The semi-Markov kernel is given by:

$$ Q(t,y|x, a, b)={\Phi}\bigg(\frac{1+t}{2+t}y\bigg)F(t),\quad\forall (t,y,x,a,b)\in [0,+\infty)\times X\times K, $$

where Φ(⋅) and F(⋅) denote the cumulative distribution functions of a normal distribution with mean μ(x,a,b) and variance σ²(x,a,b) and an exponential distribution with parameter β(x,a,b), respectively. The reward rate function is denoted by r(x,a,b) := x² + a + b. Moreover, the discount factor is defined as α(x,a,b) := e^|x|+a+b.

Now, we verify that the conditions on the existence of an optimal pair of stationary strategies described in Assumptions 1-4 are satisfied in this example. To this end, we need the following hypothesis:

Assumption 5

(a) The function β(x,a,b) is continuous on K and has both a positive lower bound $\underline {\beta }$ and a positive upper bound $\bar {\beta }$; (b) $\mu ^{2}(x,a,b)+\sigma ^{2}(x,a,b)\leq \frac {1}{4}x^{2}+\frac {e^{\underline {a}+\underline {b}-1}}{8(\bar {\beta }+e^{\underline {a}+\underline {b}})},~\forall (x,a,b)\in K$.

With this hypothesis, we directly have the following result.

Proposition 2

Suppose that Assumption 5 holds, then Example 1 satisfies Assumptions 1-4, which means that the SMG has an optimal pair of stationary strategies.

The proof of Proposition 2 is provided in Appendix B.

Next, we give another example about investment problem to demonstrate the numerical computation of Algorithm 1 to solve the value function and an optimal pair of stationary strategies of the game.

Example 2

Consider an investment problem with three states {1,2,3}, which denotes the benefit, medium and loss economy environments, respectively. At each state, the investor will buy some assets while the market-maker will sell. The interest rate depends on the economy environments as well as the number of assets that investor buys and market-maker sells. In state i ∈{1,2}, the investor buys a certain amount of assets from {a_i1,a_i2} and the market-maker sells from {b_i1,b_i2}, which leads to a reward rate r(i,a,b) to the investor and − r(i,a,b) to the market-maker, where a ∈{a_i1,a_i2},b ∈{b_i1,b_i2}. Then the system moves to a new state j with probability p(j|i,a,b) after staying at state i for a random time which follows exponential-distribution with parameter β(i,a,b). In state 3, the investor buys a certain amount of assets from {a₃₁,a₃₂} and the market-maker sells from {b₃₁,b₃₂}, which leads to a reward rate r(3,a,b) to the investor and − r(3,a,b) to the market-maker, where a ∈{a₃₁,a₃₂},b ∈{b₃₁,b₃₂}. Then the system moves to a new state j with probability p(j|3,a,b) after staying at state 3 for a random time uniformly distributed in [0,β(3,a,b)] with parameter β(3,a,b) > 0. For this system, the decision makers aim to find an optimal pair of stationary strategies.

We establish an SMG model to solve this investment problem. The state space is X = {1,2,3}, action spaces are A(i) = {a_i1,a_i2}, B(i) = {b_i1,b_i2} for each i ∈ X and the semi-Markov kernel Q is given by:

$$Q(t,j|i, a, b)=\left\{ \begin{array}{ll} (1-e^{-\beta (i,a,b)t})p(j|i,a,b) & {\text{if}\quad i\in \{1,2\}},\\ {\frac{t}{\beta(i, a, b)} p(j | i, a, b)} & {\text{if}\quad i=3, \ 0 \leq t \leq \beta(i, a, b)}, \\ {p(j | i, a, b)} & {\text {otherwise}}, \end{array} \right. $$

from which we can obtain

$$ \begin{array}{@{}rcl@{}} Q(dt,j|i,a,b)=\left\{ \begin{array}{ll} p(j|i,a,b)\beta (i,a,b)e^{-\beta(i, a, b)t}dt & {\text{if}\quad i\in \{1,2\}},\\ {\frac{dt}{\beta(i, a, b)} p(j | i, a, b)} & {\text{if}\quad i=3, \ 0 \leq t \leq \beta(i, a, b)}, \\ 0 & {\text{otherwise}}, \end{array} \right. \end{array} $$

and

$$H(t|i,a,b)=\left\{ \begin{array}{ll} 1-e^{-\beta (i,a,b)t} & {\text{if}\quad i\in \{1,2\}},\\ {\frac{t}{\beta(i, a, b)} } & {\text{if}\quad i=3, \ 0 \leq t \leq \beta(i, a, b)}, \\ 1 & {\text {otherwise}}. \end{array} \right. $$

Then by (2), we have

$$ G(u,i, a, b)=\left\{ \begin{array}{ll} \frac {r(i,a,b)}{\alpha (i,a,b)+\beta (i,a,b)}+\frac {\beta (i,a,b)}{\alpha (i,a,b)+\beta (i,a,b)}\sum\limits_{j=1}^{3}p(j|i,a,b)u(j) & {\text{if}\quad i\in \{1,2\}},\\ \frac {r(3,a,b)}{(\alpha (3,a,b))^{2}\beta (3,a,b)}\Big[\alpha (3,a,b)\beta (3,a,b)-1+e^{-\alpha (3,a,b)\beta (3,a,b)}\Big]\\ +\frac {1-e^{-\alpha (3,a,b)\beta (3,a,b)}}{\alpha (3,a,b)\beta (3,a,b)}\sum\limits_{j=1}^{3}p(j|3,a,b)u(j) & {\text{if}\quad i=3}. \end{array} \right. $$

To take numerical calculation for this example, we assume that the values of model parameters are shown in Table 1.

Table 1 The values of model parameters

Full size table

With this data setting, by taking α₀ = 0.25,δ = 0.1,𝜃 = 0.023, Assumptions 1-4 obviously hold since both the state and action spaces are finite. Thus, the existence of the value function and an optimal pair of stationary strategies of the SMG are ensured by Theorem 1. Next, we use Algorithm 1 to find the approximate value function and an ε-optimal pair of stationary strategies of the game. The detailed steps are listed as follows.

Step 1: Initialization.

Let n = 0, and V₀(⋅) = 1; set a small threshold 𝜖 := 10^− 4, and we have $\varepsilon =\frac {\epsilon }{1-\eta \gamma }=0.33$.

Step 2: Iteration.

For n ≥ 0, (a,b) ∈ A(i) × B(i), we have

$$ u_{n}(i, a, b)=\frac {r(i,a,b)}{\alpha (i,a,b)+\beta (i,a,b)}+\frac{\beta (i,a,b)}{\alpha (i,a,b)+\beta (i,a,b)}\sum\limits_{j=1}^{3}p(j|i,a,b)V_{n}(j),\quad i=1,2, $$

$$ \begin{array}{@{}rcl@{}} u_{n}(3, a, b)&=&\frac {r(3,a,b)}{(\alpha (3,a,b))^{2}\beta (3,a,b)}\Big[\alpha (3,a,b)\beta (3,a,b)-1+e^{-\alpha (3,a,b)\beta (3,a,b)}\Big]\\ &&+\frac {1-e^{-\alpha (3,a,b)\beta (3,a,b)}}{\alpha (3,a,b)\beta (3,a,b)}\sum \limits_{j=1}^{3}p(j|3,a,b)V_{n}(j). \end{array} $$

Then, for each state i ∈{1,2,3}, we solve the linear program

$$ \begin{array}{@{}rcl@{}} &&\max\limits_{f\left( i,a_{i1}\right), f\left( i,a_{i2}\right),v}\quad v \\ &&\text{s.t.}\quad \left\{ \begin{array}{lr} v \leq u_{n}\left( i, a_{i1}, b_{ij}\right) f\left( i,a_{i1}\right)+u_{n}\left( i, a_{i2}, b_{ij}\right) f\left( i,a_{i2}\right),\quad j=1,2 \\ f\left( i,a_{i1}\right)+f\left( i,a_{i2}\right)=1 \\ f\left( i,a_{i1}\right) \geq 0, f\left( i,a_{i2}\right) \geq 0 , \end{array}\right. \end{array} $$

(12)

with the solution denoted by ${\pi _{n}^{1}}(\cdot | i)$ where ${\pi _{n}^{1}}(a_{i1}| i)=f(i,a_{i1})$, ${\pi _{n}^{1}}(a_{i2}| i)=f(i,a_{i2})$.

Also we solve the dual program of (12)

$$ \begin{array}{@{}rcl@{}} &&\min\limits_{g\left( i,b_{i1}\right), g\left( i,b_{i2}\right),z}\quad z \\ &&\text{s.t.}\quad \left\{ \begin{array}{lr} z \geq u_{n}\left( i, a_{ij}, b_{i1}\right) g\left( i,b_{i1}\right)+u_{n}\left( i, a_{ij}, b_{i2}\right) g\left( i,b_{i2}\right),\quad j=1,2\\ g\left( i,b_{i1}\right)+g\left( i,b_{i2}\right)=1 \\ g\left( i,b_{i1}\right) \geq 0, g\left( i,b_{i2}\right) \geq 0 , \end{array} \right. \end{array} $$

with the solution denoted by ${\pi _{n}^{2}}(\cdot | i)$ where ${\pi _{n}^{2}}(b_{i1}| i)=g(i,b_{i1})$, ${\pi _{n}^{2}}(b_{i2}| i)=g(i,b_{i2})$. We set

$$ V_{n+1}(i)=\sum\limits_{a\in A(i),b\in B(i)}u_{n}(i, a, b){\pi_{n}^{1}}(a| i){\pi_{n}^{2}}(b| i). $$

Step 3: Termination judgement.

If $\underset {i=1,2,3}{\max \limits }|V_{n+1}(i)-V_{n}(i)|<\epsilon $, then the iteration stops, V_n is the ε-value function and $({\pi _{n}^{1}},{\pi _{n}^{2}})$ is an ε-optimal pair of stationary strategies of the SMG. Otherwise, set n = n + 1 and go to Step 2.

We use Matlab to implement the iteration algorithm for this example. It takes about 10 seconds to stop at the 93rd iteration. The curves of the error of two successive iterations, the value function, and the optimal pair of stationary strategies of players with respect to the iteration times are illustrated by Figs. 1, 2 and 3.

Based on the experimental results, we have the following observations:

1.
When the state is benefit, the investor should take action a₁₁ with probability 0.60217 and a₁₂ with probability 0.39783, while the market-maker should take action b₁₁ with probability 0.55737 and b₁₂ with probability 0.44263;
2.
When the state is medium, the investor should take action a₂₁ with probability 0.87111 and a₂₂ with probability 0.12889, while the market-maker should take action b₂₁ with probability 0.77887 and b₂₂ with probability 0.22113;
3.
When the state is loss, the investor should always take action a₃₁ while the market-maker should always take action b₃₁;
4.
If both investor and market-maker use the optimal strategies, the investor will obtain a profit 12.6054 at benefit state, 12.1271 at medium state and 11.1653 at loss state, while the market-maker will lose the same amount, respectively.

Remark 5

In this example, we choose a uniformly distributed sojourn time at state 3 to show that arbitrary distributions are permitted for the sojourn time of semi-Markov processes. Other distributions can also be chosen for the sojourn time according to practical situations. Moreover, if all the sojourn times are exponentially distributed, the semi-Markov games degenerate into discrete-time Markov games.

7 Conclusion

In this paper, we concentrate on the two-person zero-sum SMG with expected discounted criterion in which the discount factors are state-action-dependent. We first construct the SMG model with a fairly general definition setting. Then we impose suitable conditions on the model parameters, under which we establish the Shapley equation whose unique solution is the value function and prove the existence of an optimal pair of stationary strategies of the game. While the state and action spaces are finite, a value iteration-type algorithm for approaching to the value function and an optimal pair of stationary strategies is developed. Finally, we apply our results to an investment problem, which demonstrates that our algorithm performs well.

One of the future research topics is to deal with the nonzero-sum case of this game model. We wish to find sufficient conditions under which we use the similar arguments to establish the Shapley equation and prove the existence of Nash equilibrium for such game. In practice, many decision problems are continuous in states or actions. It is of interest to further study the discretization technology of continuous SMGs in order to apply our value iteration algorithm. Moreover, considering the limitations of computing resources, the dynamic programming algorithm is difficult to implement in reality when the scale of the game becomes huge. Another future research topic is to develop data-driven learning algorithms to approximately solve the game problems, such as the combination with multi-agent reinforcement learning approaches.

References

Ash R B, Robert B, Doleans-Dade CA, Catherine A (2000) Probability and measure theory. Academic Press
Barron EN (2013) Game Theory: An Introduction, 2nd Edn. Wiley, New York
Book MATH Google Scholar
Chen F, Guo X, Liao ZW (2021) Discounted semi-Markov games with incomplete information on one side. arXiv:210707499
Fan K (1953) Minimax theorems. Proc Natl Acad Sci U S A 39 (1):42–47
Article MathSciNet MATH Google Scholar
Feinberg E A, Shwartz A (1994) Markov decision models with weighted discounted criteria. Math Oper Res 19(1):152–168
Article MathSciNet MATH Google Scholar
Filar J, Vrieze K (2012) Competitive Markov Decision Processes. Springer Science & Business Media, Berlin
MATH Google Scholar
González-Hernández J, López-martínez RR, Minjárez-Sosa JA (2009) Approximation, estimation and control of stochastic systems under a randomized discounted cost criterion. Kybernetika 45(5):737–754
MathSciNet MATH Google Scholar
González-Sánchez D, Luque-Vásquez F, Minjárez-Sosa J A (2019) Zero-Sum Markov games with random State-Actions-Dependent discount factors: Existence of optimal strategies. Dyn Games Appl 9(1):103–121
Article MathSciNet MATH Google Scholar
Hernández-Lerma O, Lasserre JB (2012a) Discrete-time markov control processes: basic optimality criteria, vol 30. Springer Science & Business Media, Berlin
MATH Google Scholar
Hernández-Lerma O, Lasserre JB (2012b) Further topics on discrete-time markov control processes, vol 42. Springer Science & Business Media
Hoffman A J, Karp R M (1966) On nonterminating stochastic games. Manag Sci 12(5):359–370
Article MathSciNet MATH Google Scholar
Huang Y, Guo X (2011) Finite horizon semi-Markov decision processes with application to maintenance systems. Eur J Oper Res 212(1):131–140
Article MathSciNet MATH Google Scholar
Huang Y H, Guo X P (2010) Discounted semi-Markov decision processes with nonnegative costs. Acta Math Sinica (Chinese Series) 53:503–514
MathSciNet MATH Google Scholar
Kirman A P, Sobel M J (1974) Dynamic oligopoly with inventories. Econometrica: Journal of the Econometric Society, 279–287
Lal A K, Sinha S (1992) Zero-sum two-person semi-Markov games. J Appl Probab 29(1):56–72
Article MathSciNet MATH Google Scholar
Littman ML (1994) Markov games as a framework for multi-agent reinforcement learning
Luque-Vásquez F (2002a) Zero-sum semi-Markov game in Borel spaces with discounted payoff. Morfismos 6(1):15–29
MATH Google Scholar
Luque-Vásquez F (2002b) Zero-sum semi-Markov games in Borel spaces: discounted and average payoff. Bol Soc Mat Mexicana 8:227–241
MathSciNet MATH Google Scholar
Minjárez-Sosa J A (2015) Markov control models with unknown random state–action-dependent discount factors. Top 23(3):743–772
Article MathSciNet MATH Google Scholar
Minjárez-Sosa J A, Luque-Vásquez F (2008) Two person zero-sum semi-Markov games with unknown holding times distribution on one side: a discounted payoff criterion. Appl Math Optim 57(3):289–305
Article MathSciNet MATH Google Scholar
Mondal P, Sinha S, Neogy S K, Das A K (2016) On discounted AR-AT semi-Markov games and its complementarity formulations. Int J Game Theory 45(3):567–583
Article MathSciNet MATH Google Scholar
Mondal P, Neogy S, Gupta A, Ghorui D (2020) A policy improvement algorithm for solving a mixture class of perfect information and AR-AT semi-Markov games. Int Game Theory Rev 22(02):2040008
Article MathSciNet MATH Google Scholar
Nowak A S (1984) On zero-sum stochastic games with general state space I. Probab Math Stat 4(1):13–32
MathSciNet MATH Google Scholar
Pollatschek M, Avi-Itzhak B (1969) Algorithms for stochastic games with geometrical interpretation. Manag Sci 15(7):399–415
Article MathSciNet MATH Google Scholar
Raut L K (1990) Two-sided altruism lindahl equilibrium and pareto efficiency in overlapping generations models. Department of Economics, University of California
Ross SM (1970) Average cost semi-Markov decision processes. J Appl Probability 7(3):649–656
Article MathSciNet MATH Google Scholar
Schäl M (1975) Conditions for optimality in dynamic programming and for the limit of n-stage optimal policies to be optimal. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 32(3):179–196
Article MathSciNet MATH Google Scholar
Shapley L S (1953) Stochastic games. Proc Natl Acad Sci 39 (10):1095–1100
Article MathSciNet MATH Google Scholar
Tanaka K, Wakuta K (1976) On semi-Markov games. Science Reports of Niigata University Series A, Mathematics 13:55–64
MathSciNet MATH Google Scholar
Vega-Amaya Ó, Luque-Vásquez F, Castro-Enríquez M (2022) Zero-sum average cost semi-markov games with weakly continuous transition probabilities and a minimax semi-markov inventory problem. Acta Appl Math 177(1):1–27
Article MathSciNet MATH Google Scholar
Wei Q, Guo X (2011) Markov decision processes with state-dependent discount factors and unbounded rewards/costs. Oper Res Lett 39(5):369–374
MathSciNet MATH Google Scholar
Wu X, Zhang J (2016) Finite approximation of the first passage models for discrete-time Markov decision processes with varying discount factors. Discrete Event Dyn Syst 26(4):669–683
Article MathSciNet MATH Google Scholar
Ye L, Guo X (2012) Continuous-time Markov decision processes with state-dependent discount factors. Acta Applicandae Math 121(1):5–27
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (62073346, 11931018, U1811462), the Guangdong Basic and Applied Basic Research Foundation (2021A1515011984), and the Guangdong Province Key Laboratory of Computational Science at the Sun Yat-sen University (2020B1212060032).

Author information

Authors and Affiliations

School of Business, Sun Yat-sen University, Guangzhou, China
Zhihui Yu & Li Xia
School of Mathematics, Sun Yat-sen University, Guangzhou, China
Xianping Guo
Guangdong Province Key Laboratory of Computational Science, Guangzhou, China
Xianping Guo & Li Xia

Authors

Zhihui Yu
View author publications
You can also search for this author in PubMed Google Scholar
Xianping Guo
View author publications
You can also search for this author in PubMed Google Scholar
Li Xia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Li Xia.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix : A: Proofs of Lemmas 1-5

Proof

(Lemma 1) For each fixed (x,a,b) ∈ K, integrating by parts and we have

$$ \begin{array}{@{}rcl@{}} {\int}_{0}^{\infty} e^{-\alpha(x,a,b) t} H(d t | x, a, b) &=&\alpha(x,a,b) {\int}_{0}^{\infty} e^{-\alpha(x,a,b) t} H(t | x, a, b) d t \\ &=&\alpha(x,a,b)\Big[{\int}_{0}^{\theta} e^{-\alpha(x,a,b) t} H(t | x, a, b) d t+{\int}_{\theta}^{\infty} e^{-\alpha(x,a,b) t} H(t |x, a, b) d t\Big] \\ &\leq& \alpha(x,a,b)\Big[(1-\delta){\int}_{0}^{\theta} e^{-\alpha(x,a,b) t} d t+{\int}_{\theta}^{\infty} e^{-\alpha(x,a,b) t} dt\Big] \\ &=&1-\delta \left( 1-e^{-\alpha(x,a,b) \theta}\right)\\ &\leqslant& 1-\delta+\delta e^{-\alpha_{0}\theta}<1. \end{array} $$

Let $\gamma =1-\delta +\delta e^{-\alpha _{0}\theta }$, which yields (4). □

Proof

(Lemma 2)

By Assumption 2 and formulation (6), for each given function u ∈ B_ω(X) and (x,a,b) ∈ K, we can easily get

$$ |G(u, x, a, b)|\leq\frac{M}{\alpha_{0}}\cdot\omega(x)+\eta\gamma\|u\|_{\omega} \cdot \omega(x). $$

The above inequality yields $\|G(u, \cdot , a, b)\|_{\omega }\leq \frac {M}{\alpha _{0}}+\eta \gamma \|u\|_{\omega }$, which implies G(u,x,a,b) is in B_ω(X), and so Tu ∈ B_ω(X).

On the one hand, by Assumption 4, it follows that G(u,x,⋅,b) is upper semi-continuous on A(x), then for each fixed $\lambda \in \mathbb {B}(x)$, by the Fatou’s theorem, the function

$$ a \longmapsto {\int}_{B(x)} G(u, x, a, b) \lambda(d b) $$

is also upper semi-continuous on A(x). Moreover, since the probability measures on ${\mathscr{B}}(X)$ endowed with the topology of weak convergence, by Theorem 2.8.1 in Ash et al. (2000), the function G(u,x,⋅,λ) is upper semi-continuous on $\mathbb {A}(x)$. Similarly, G(u,x,μ,⋅) is lower semi-continuous on $\mathbb {B}(x)$. Thus, by Theorem A.2.3 in Ash et al. (2000), the supremum and the infimum are indeed attained in (3), which means

$$ Tu(x)=\underset{\mu \in \mathbb{A}(x) }{\max}~ \underset{\lambda \in \mathbb{B}(x) }{\min} G(u, x, \mu, \lambda). $$

Then, by the Fan’s minimax Theorem (Fan 1953), we obtain (7).

On the other hand, since $\mathbb {A}(x)$ and $\mathbb {B}(x)$ are compact, by the well-known measurable selection theorem for minimax problems (Nowak 1984), there exists a pair of stationary strategies (φ₁,φ₂) ∈Φ₁ ×Φ₂ that satisfies (8). □

Proof

(Lemma 3)

First, it is easy to verify that the operator T(φ₁,φ₂) is monotonically increasing. Let u,v ∈ B_ω(X), by the definition of ω-norm, u(⋅) ≤ v(⋅) + ∥u − v∥_ωω(⋅), it follows that for each fixed x ∈ X, we have

$$ \begin{array}{@{}rcl@{}} T(\varphi_{1},\varphi_{2}) u(x) & \leqslant& T(\varphi_{1},\varphi_{2})(v+\omega \|u-v\|_{\omega})(x) \\ &=&T(\varphi_{1},\varphi_{2})v(x)\\ &&+ \|u-v\|_{\omega} {\int}_{A(x)} {\int}_{B(x)}\Big[{\int}_{0}^{\infty} e^{-\alpha(x,a,b) t} {\int}_{X}\omega (y)Q(d t, d y| x, a, b)\Big] \varphi_{1}(da|x) \varphi_{2}(db|x)\\ &\leqslant& T(\varphi_{1},\varphi_{2})v(x)+\eta\gamma \|u-v\|_{\omega}\omega (x), \end{array} $$

(13)

where the last inequality is followed by formulation (6). Furthermore, taking maximum of φ₁ ∈Φ₁ and minimum of φ₂ ∈Φ₂ on both sides of the inequality (13), we have

$$ \underset{\varphi_{1} \in {\Phi}_{1}}{\max} \underset{\varphi_{2}\in {\Phi}_{2}}{\min}T(\varphi_{1},\varphi_{2})u(x)\leq \underset{\varphi_{1} \in {\Phi}_{1}}{\max} \underset{\varphi_{2}\in {\Phi}_{2}}{\min}T(\varphi_{1},\varphi_{2})v(x)+\eta\gamma \|u-v\|_{\omega}\omega (x), $$

i.e.,

$$ Tu(x)\leq Tv(x)+\eta\gamma \|u-v\|_{\omega}\omega (x). $$

Similarly, interchanging u and v, we obtain

$$ Tv(x)\leq Tu(x)+\eta\gamma \|v-u\|_{\omega}\omega (x). $$

Combining the two inequalities above, we have

$$ |Tu(x)-Tv(x)|\leq \eta\gamma \|u-v\|_{\omega}\omega (x) ,\quad \forall x\in X, $$

i.e.,

$$ \|Tu-Tv\|_{\omega} \leq \eta\gamma \|u-v\|_{\omega}, $$

which implies T is a contraction operator with modulus ηγ < 1. Using the same arguments, we can prove that T(φ₁,φ₂) is also a contraction operator with modulus ηγ < 1. □

Proof

(Lemma 4)

$$ \begin{array}{@{}rcl@{}} {V(x, \pi^{1}, \pi^{2})}&=& \mathbb{E}_{x}^{\pi^{1}, \pi^{2}}\Big[{\int}_{0}^{\infty}e^{-{{\int}_{0}^{t}}\alpha(X(s),A(s),B(s))ds}r(X(t), A(t), B(t))dt\Big]\\ &=& \mathbb{E}_{x}^{\pi^{1},\pi^{2}}\Big[{\int}_{0}^{T_{1}} e^{-{{\int}_{0}^{t}}\alpha(X(s),A(s),B(s))ds} r(X(t), A(t), B(t))dt\Big]\\ &&+ \mathbb{E}_{x}^{\pi^{1},\pi^{2}}\Big[{\int}_{T_{1}}^{\infty} e^{-{{\int}_{0}^{t}}\alpha(X(s),A(s),B(s))ds} r(X(t), A(t), B(t))dt\Big]\\ &=&\mathbb{E}_{x}^{\pi^{1},\pi^{2}}\Big[{\int}_{0}^{\infty}\mathbbm{1}_{\{T_{1}>t\}} e^{-\alpha(x,A_{0},B_{0})t} r(x, A_{0},B_{0})dt\Big]\\ &&+ \mathbb{E}_{x}^{\pi^{1},\pi^{2}}\bigg[ \mathbb{E}_{x}^{\pi^{1},\pi^{2}}\Big[{\int}_{T_{1}}^{\infty}e^{-\alpha(x,A_{0},B_{0})T_{1}} e^{-{\int}_{T_{1}}^{t}\alpha(X(s),A(s),B(s))ds} r(X(t), A(t), B(t))dt|H_{1}\Big]\bigg]\\ &=& {\int}_{A(x)}{\int}_{B(x)}\Big[{\int}_{0}^{\infty}e^{-\alpha(x,a,b)t}\big[1-H(t| x, a, b)\big]r(x,a,b)dt\Big] {\pi_{0}^{1}}(da|x) {\pi_{0}^{2}}(db|x)\\ &&+\mathbb{E}_{x}^{\pi^{1},\pi^{2}}\bigg[e^{-\alpha(x,A_{0},B_{0})T_{1}} \mathbb{E}_{x}^{\pi^{1},\pi^{2}}\Big[{\int}_{T_{1}}^{\infty} e^{-{\int}_{T_{1}}^{t}\alpha(X(s),A(s),B(s))ds} r(X(t), A(t), B(t))dt|H_{1}\Big]\bigg]\\ &=& {\int}_{A(x)}{\int}_{B(x)}\Big[{\int}_{0}^{\infty}e^{-\alpha(x,a,b)t}\big[1-H(t| x, a, b)\big]r(x,a,b)dt\Big] {\pi_{0}^{1}}(da|x) {\pi_{0}^{2}}(db|x)\\ &&+\mathbb{E}_{x}^{\pi^{1},\pi^{2}}\Big[e^{-\alpha(x,A_{0},B_{0})T_{1}}V(X_{1},^{(1)}\!\pi^{1},^{(1)}\!\pi^{2})\Big]\\ &=& {\int}_{A(x)}{\int}_{B(x)}r(x,a,b)\Big[{\int}_{0}^{\infty}e^{-\alpha(x,a,b)t}\big[1-H(t|x, a, b)\big]dt\Big] {\pi_{0}^{1}}(da|x) {\pi_{0}^{2}}(db|x)\\ &&+{\int}_{A(x)}{\int}_{B(x)}\Big [{\int}_{0}^{\infty}e^{-\alpha(x,a,b)t}{\int}_{X} V(y,^{(1)}\!\pi^{1},^{(1)}\!\pi^{2})Q(d t,dy| x, a, b){\Big]\pi_{0}^{1}}(da|x) {\pi_{0}^{2}}(db|x)\\ &=& {\int}_{A(x)}{\int}_{B(x)}\left\{r(x,a,b)\Big[{\int}_{0}^{\infty}e^{-\alpha(x,a,b)t}\big[1-H(t|x, a, b)\big]dt\Big]+\right.\\ &&\left. {\int}_{0}^{\infty}e^{-\alpha(x,a,b)t}\Big [{\int}_{X} V(y,^{(1)}\!\pi^{1},^{(1)}\!\pi^{2})Q(d t,dy| x, a, b)\Big] \right\}{\pi_{0}^{1}}(da|x) {\pi_{0}^{2}}(db|x), \end{array} $$

where the third and fourth equalities are ensured by the property of conditional expectation. The fifth equality follows from the strong Markovian property. Hence,

$$ V(x,\pi^{1}, \pi^{2})=T(\pi_{0}^{1\infty}, \pi_{0}^{2\infty})V(x,^{(1)}\!\pi^{1},^{(1)}\!\pi^{2}), $$

which is required. □

Proof

(Lemma 5)

For ∀n ≥ 1 and x ∈ X, we have

$$ \begin{array}{@{}rcl@{}} &&\bigg|\mathbb{E}_{x}^{\pi^{1}, \pi^{2}}\Big[e^{-{\int}_{0}^{T_{n}}\alpha(X(s),A(s),B(s))ds}\omega(X_{n})\Big]\bigg|\\ &=& \bigg|\mathbb{E}_{x}^{\pi^{1},\pi^{2}}\Big[\mathbb{E}_{x}^{\pi^{1}, \pi^{2}}\big[e^{-{\int}_{0}^{T_{n}}\alpha(X(s),A(s),B(s))ds}\omega(X_{n})|H_{n-1},A_{n-1},B_{n-1}\big]\Big]\bigg|\\ &= &\bigg|\mathbb{E}_{x}^{\pi^{1},\pi^{2}}\Big[e^{-{\int}_{0}^{T_{n-1}}\alpha(X(s),A(s),B(s))ds}\mathbb{E}_{x}^{\pi^{1}, \pi^{2}}\big[e^{-{\int}_{T_{n-1}}^{T_{n}}\alpha(X(s),A(s),B(s))ds}\omega(X_{n})|H_{n-1},A_{n-1},B_{n-1}\big]\Big]\bigg|\\ &=&\bigg|\mathbb{E}_{x}^{\pi^{1},\pi^{2}}\Big[e^{-{\int}_{0}^{T_{n-1}}\alpha(X(s),A(s),B(s))ds}\big[{\int}_{0}^{\infty}e^{-\alpha(X_{n-1},A_{n-1},B_{n-1})t}{\int}_{X}\omega(y)Q(d t,dy| X_{n-1}, A_{n-1},B_{n-1})\big]\Big]\bigg|\\ &\leq& \eta\gamma\bigg|\mathbb{E}_{x}^{\pi^{1},\pi^{2}}\Big[e^{-{\int}_{0}^{T_{n-1}}\alpha(X(s),A(s),B(s))ds}\omega(X_{n-1})\Big]\bigg|, \end{array} $$

where the first and second equalities are ensured by the property of conditional expectation. The last inequality follows from formulation (6). Through iteration we have

$$ \begin{array}{@{}rcl@{}} \bigg|\mathbb{E}_{x}^{\pi^{1}, \pi^{2}}\Big[e^{-{\int}_{0}^{T_{n}}\alpha(X(s),A(s),B(s))ds}u(X_{n})\Big]\bigg|&\leq& \|u\|_{\omega}\bigg|\mathbb{E}_{x}^{\pi^{1}, \pi^{2}}\Big[e^{-{\int}_{0}^{T_{n}}\alpha(X(s),A(s),B(s))ds}\omega(X_{n})\Big]\bigg|\\ &\leq& (\eta\gamma)^{n}\|u\|_{\omega}\omega(x), \end{array} $$

which yields Lemma 5. □

Appendix : B: Proofs of Propositions 1-2

Proof

(Proposition 1)

By the property of conditional expectation and Lemma 1, we have

$$ \begin{array}{@{}rcl@{}} \mathbb{E}_{x}^{\pi^{1}, \pi^{2}}\Big[e^{-T_{n}}\Big] &=& \mathbb{E}_{x}^{\pi^{1},\pi^{2}}\Big[\mathbb{E}_{x}^{\pi^{1}, \pi^{2}}\big[e^{-T_{n}}|H_{n-1},A_{n-1},B_{n-1}\big]\Big]\\ &=& \mathbb{E}_{x}^{\pi^{1},\pi^{2}}\Big[e^{-T_{n-1}}\mathbb{E}_{x}^{\pi^{1}, \pi^{2}}\big[e^{-(T_{n}-T_{n-1})}|H_{n-1},A_{n-1},B_{n-1}\big]\Big]\\ &=& \mathbb{E}_{x}^{\pi^{1},\pi^{2}}\Big[e^{-T_{n-1}}\big[{\int}_{0}^{\infty}e^{-t}H(d t| X_{n-1}, A_{n-1},B_{n-1})\big]\Big]\\ &\leq& (1-\delta+\delta e^{-\theta})\mathbb{E}_{x}^{\pi^{1},\pi^{2}}\Big[e^{-T_{n-1}}\Big], \end{array} $$

where the last inequality follows directly from the proof of Lemma 1 by taking α₀ := 1. Through iteration we have,

$$ \mathbb{E}_{x}^{\pi^{1}, \pi^{2}}\Big[e^{-T_{n}}\Big]\leq (1-\delta+\delta e^{-\theta})^{n}. $$

For any given t > 0, by the Chebychev inequality, we obtain

$$ \mathbb{P}_{x}^{\pi^{1}, \pi^{2}}(T_{n}\leq t)=\mathbb{P}_{x}^{\pi^{1}, \pi^{2}}(e^{-T_{n}}\ge e^{-t}) \leq e^{t}\mathbb{E}_{x}^{\pi^{1}, \pi^{2}}\Big[e^{-T_{n}}\Big]\leq e^{t}(1-\delta+\delta e^{-\theta})^{n}, $$

notice that 1 − δ + δe^−𝜃 < 1, we have

$$ \mathbb{P}_{x}^{\pi^{1}, \pi^{2}}(\lim\limits_{n\rightarrow\infty}T_{n}\leq t)=\lim\limits_{n\rightarrow\infty}\mathbb{P}_{x}^{\pi^{1}, \pi^{2}}(T_{n}\leq t)=0, $$

since t is arbitrary, Proposition 1 holds. □

Proof

(Proposition 2)

Using the marginal distribution formula, the corresponding distribution of sojourn time is given as follows,

$$ H(t|x,a,b)=Q(t,+\infty|x,a,b)=1-e^{-\beta (x,a,b)t}. $$

Easy to show that Assumption 2 holds by choosing $\alpha _{0}=e^{\underline {a}+\underline {b}}$, ω(x) = x² + 1 and $M=3\max \limits \{|\bar {a}|,|\underline {a}|,|\bar {b}|,|\underline {b}|,1\}$. Now, let $\theta =\frac {1}{\alpha _{0}}\ln \left (1+\frac {\alpha _{0}}{\bar {\beta }}\right )$ and $\delta =\left (1+\frac {\alpha _{0}}{\bar {\beta }}\right )^{-\bar {\beta }/\alpha _{0}}$, then for each (x,a,b) ∈ K,

$$ \begin{array}{@{}rcl@{}} H(\theta|x,a,b)=1-e^{-\beta(x,a,b)\theta}\leq 1-e^{-\bar{\beta}\theta}=1-\delta, \end{array} $$

thus, Assumption 1 holds.

Next we verify Assumption 3. According to Lemma 1 and its proof, we can choose $\gamma =1-\delta +\delta e^{-\alpha _{0} \theta }$ and further take $\eta =\frac {2}{1+\gamma }$, then

$$ \begin{array}{@{}rcl@{}} {\int}_{X} \omega(y)Q(t,dy|x,a,b)&=&{\int}_{X} (y^{2}+1){\Phi}\bigg(\frac{1+t}{2+t}y\bigg)F(t)d\bigg(\frac{1+t}{2+t}y\bigg)\\ &=&\Big[\left( \frac{2+t}{1+t}\right)^{2} (\mu^{2}(x,a,b)+\sigma^{2}(x,a,b))+1\Big]F(t)\\ &\leq& \bigg[4\left( \frac{1}{4}x^{2}+\frac{e^{\underline{a}+\underline{b}-1}}{8(\bar{\beta}+e^{\underline{a}+\underline{b}})}\right)+1\bigg] F(t)\\ & \leq& \Big[x^{2}+\frac{\delta\alpha_{0}}{2(\bar{\beta}+\alpha_{0})}+1\Big] F(t)\\ & <& \Big[x^{2}+\frac{1-\gamma}{1+\gamma}+1\Big] F(t)\\ & \leq &\eta \omega(x) H(t|x,a,b), \end{array} $$

which implies that Assumption 3 holds. Finally, since the reward rate r(x,a,b), discount factor α(x,a,b) and semi-Markov kernel Q(t,y|,x,a,b) are continuous on K, Assumption 4 holds. Hence, the SMG of Example 1 has an optimal pair of stationary strategies. □

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yu, Z., Guo, X. & Xia, L. Zero-sum semi-Markov games with state-action-dependent discount factors. Discrete Event Dyn Syst 32, 545–571 (2022). https://doi.org/10.1007/s10626-022-00366-4

Download citation

Received: 28 April 2021
Accepted: 04 August 2022
Published: 05 November 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s10626-022-00366-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Zero-sum semi-Markov games with state-action-dependent discount factors

Abstract

Similar content being viewed by others

Zero-Sum Markov Games with Random State-Actions-Dependent Discount Factors: Existence of Optimal Strategies

Continuous-Time Zero-Sum Games for Markov Decision Processes with Discounted Risk-Sensitive Cost Criterion

Constrained Discounted Stochastic Games

1 Introduction

2 Two-person zero-sum semi-Markov game model

Remark 1

Definition 1

Definition 2

Assumption 1

Proposition 1

Definition 3

Definition 4

Definition 5

Definition 6

Remark 2

3 Main results

Assumption 2

Lemma 1

Assumption 3

Remark 3

Assumption 4

Remark 4

Theorem 1

4 Preliminaries and proofs

Lemma 2

Lemma 3

Definition 7

Lemma 4

Lemma 5

5 Algorithm

Definition 8

Theorem 2

Proof

6 Numerical experiment

Example 1

Assumption 5

Proposition 2

Example 2

Remark 5

7 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Appendices

Appendix : A: Proofs of Lemmas 1-5

Proof

Proof

Proof

Proof

Proof

Appendix : B: Proofs of Propositions 1-2

Proof

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation