1 Introduction

This paper deals with two-person zero-sum semi-Markov games (SMGs) with expected discounted criterion, which is a generalization of discrete-time Markov games (DTMGs) (Shapley 1953), since the sojourn time between two consecutive decision epochs follows any distribution rather than a constant. Such games have already been studied in the literature (Chen et al. 2021; Lal and Sinha 1992; Luque-Vásquez 2002a, 2002b; Minjárez-Sosa and Luque-Vásquez 2008; Mondal et al. 2016, 2020). However, the existing references are all restricted to the case where the discount factor is a constant, which may not always hold in practice.

In this paper, we study a more general case of SMGs with varying discount factors to meet the often demanding nature of real-world problems. For example, considering the application in economics, the discount factor (interest rate) may depend both on economy environments and decision-makers’ actions. That is, the interest rate usually varies in different financial markets and monetary policies, where financial markets can be considered as states and monetary policies are actions taken by the government. Problems with non-constant discount factors have been studied for Markov decision processes (MDPs) (Feinberg and Shwartz 1994; González-Hernández et al. 2009; Schäl 1975; Wei and Guo 2011; Ye and Guo 2012; Minjárez-Sosa 2015; Wu and Zhang 2016) and two-person zero-sum DTMGs (González-Sánchez et al. 2019). Furthermore, in light of the application of SMGs to the dynamic oligopoly model (Kirman and Sobel 1974) and the dynamic overlapping generations model (Raut 1990), it is required and reasonable to investigate SMGs with state-action-dependent discount factors to compensate for the inadequacies of theory and application.

On the other hand, most of the literature work on SMGs focuses on the existence of optimal strategies. However, how to efficiently solve a stochastic dynamic game and compute an optimal pair of stationary strategies are especially important for practical implementation of game theory. The classic algorithmic study on game theory focuses on matrix games, which can be solved by linear programming (LP) (Barron 2013). This LP technique further combines the policy iteration in MDPs to solve discounted two-person zero-sum DTMGs (Hoffman and Karp 1966; Pollatschek and Avi-Itzhak 1969), where LP is used to improve strategy at every iteration. Recently, there are emerging investigations that aim to study the efficient computation for stochastic dynamic games using approximation or learning algorithms. Littman (1994) proposes a minimax-Q algorithm to solve discounted two-person zero-sum DTMGs, which is essentially motivated by the standard Q-learning algorithm with a minimax operator in Markov games replacing the max operator in reinforcement learning. In addition, finite algorithms for some interesting special classes of stochastic games are also widely studied, such as LP to solve DTMGs with single-controller, switching controller and separable reward and state independent transition (Filar and Vrieze 2012). Recently, the same results are extended to SMGs. Mondal et al. (2016) study the discounted two-person zero-sum SMGs with AR-AT-AITT (Additive Reward and Additive Transition and Action Independent Transition Times) structure. They prove that such game can be formulated as a vertical linear complementarity problem (VLCP), which can be solved by the Cottle-Dantzig’s algorithm. They further propose a policy improvement algorithm for solving a mixture class of perfect information and AR-AT SMG (Mondal et al. 2020). Notice that the above algorithms for solving SMGs are only suitable for some special structures, which fail to effectively solve general discounted two-person zero-sum SMGs.

In this paper, we aim at studying the two-person zero-sum SMGs with expected discounted criterion in which the discount factors are state-action-dependent. The objective is to find an optimal pair of stationary strategies to maximize the reward of player 1 (P1) and minimize the payoff of player 2 (P2). More precisely, we deal with the SMGs specified by five primitive data: the state space X; the action spaces A,B for P1 and P2, respectively; the semi-Markov kernel Q(t,y|x,a,b); the discount factor α(x,a,b); and the reward rate function r(x,a,b). The state space X and action spaces A,B are all Polish spaces, and the reward rate function r(x,a,b) is ω-bounded. The semi-Markov kernel Q(t,y|x,a,b) is a joint distribution of sojourn time and state for any given (x,a,b) ∈ X × A × B, which is more general in comparison to the literature (Luque-Vásquez 2002a, 2002b; Minjárez-Sosa and Luque-Vásquez 2008; Mondal et al. 2016, 2020). To the best of our knowledge, the framework with the common state-time distribution was just used in previous works dealing with semi-Markov decision processes (for non-negative reward rate function (Huang and Guo 2011, 2010); for average criterion with unbounded reward rate function (Ross 1970)) and SMGs (for bounded reward rate function (Chen et al. 2021; Lal and Sinha 1992; Tanaka and Wakuta 1976); for average criterion with unbounded reward rate function (Vega-Amaya et al. 2022)), which has not been studied in discounted SMGs with unbounded reward rate function. This extension necessitates the addition of a new “drift condition”, which leads to a general Shapley equation and a more complicated proof of optimality. With these new features, we construct an SMG model with a fairly general problem setting. Then we impose suitable conditions on the model parameters shown in Assumptions 1-4, under which we establish the Shapley equation and prove the existence of the value function and an optimal pair of stationary strategies of the game. Our proof is quite different from González-Sánchez et al. (2019) since we directly search for optimal strategies with respect to history-dependent strategies instead of restricted to Markov strategies. Furthermore, in the scenario of finite state and action spaces, we derive a value iteration-type algorithm to approach to the value function and an optimal pair of stationary strategies of the game based on the Shapley equation. The convergence and the error bound analysis of the algorithm are also derived. Finally, we conduct numerical examples on an investment problem to demonstrate the main results of our paper.

The contributions of this paper can be summarized as follows. (1) This paper studies a fairly general model of SMGs with state-action-dependent discount factors and joint probability transition functions. We derive a “drift condition” (see Assumption 3) on the generic semi-Markov kernel, which is more general than the counterparts in the literature work (Luque-Vásquez 2002a, 2002b; González-Sánchez et al. 2019), as stated in Remark 3. (2) In order to find an ε-optimal pair of stationary strategies and an approximate value function, a value iteration-type algorithm is proposed, which can be viewed as a combination of the value iteration of MDPs and the LP of matrix games. Moreover, the convergence and the error bound of the algorithm are also analyzed.

The rest of this paper is organized as follows. In Section 2, we introduce the model of SMG as well as the optimality criterion. Our main optimality results are stated in Section 3 and studied with the proof in Section 4. A value iteration-type algorithm for approximating an optimal pair of stationary strategies is developed in Section 5, and some numerical examples are conducted to demonstrate our main results in Section 6. Finally, we conclude the paper and discuss some future research topics in Section 7.

2 Two-person zero-sum semi-Markov game model

Notation: If E is a Polish space (that is, a complete and separable metric space), its Borel σ-algebra is denoted by \({\mathscr{B}}(E)\), and \(\mathbb {P}(E)\) denotes the family of probability measures on \({\mathscr{B}}(E)\), endowed with the topology of weak convergence.

In this section, we introduce a two-person zero-sum SMG model with expected discounted criterion and state-action-dependent discount factors, which is denoted by the collection

$$ \{X, A, B, (A(x),B(x), x\in X), Q(t,y|x,a,b), \alpha(x,a,b), r(x,a,b)\}, $$

where the symbols are explained as follows.

  • X is the state space which is a Polish space, and A, B are action spaces for P1 and P2, respectively, which are also supposed to be Polish spaces.

  • A(x) and B(x) are Borel subsets of A and B, which represent the sets of the admissible actions for P1 and P2 at state xX, respectively. Let

    $$ K := \{(x,a,b)|x\in X, a\in A(x), b\in B(x)\} $$

    be a measurable Borel subset of X × A × B.

  • Q(t,y|x,a,b) is a semi-Markov kernel which satisfies the following properties.

  1. (a)

    For each fixed (x,a,b) ∈ K, Q(⋅,⋅|x,a,b) is a probability measure on \([0,\infty ) \times X\), whereas for each fixed \(t \in [0,\infty ), D\in {\mathscr{B}}(X)\), Q(t,D|⋅,⋅,⋅) is a real-valued Borel function on K.

  2. (b)

    For each fixed (x,a,b) ∈ K and \(D\in {\mathscr{B}}(X)\), Q(⋅,D|x,a,b) is a non-decreasing right-continuous real-valued Borel function on \([0,\infty )\) such that Q(0,D|x,a,b) = 0.

  3. (c)

    For each fixed (x,a,b) ∈ K, we denote by

    $$ H(\cdot|x,a,b) := Q(\cdot,X|x,a,b) $$

    the distribution function of the sojourn time at state xX when the actions aA(x),bB(x) are chosen. For each xX and \(D\in {\mathscr{B}}(X)\), when P1 and P2 select actions aA(x) and bB(x), respectively, Q(t,D|x,a,b) denotes the joint probability that the sojourn time in state x is not greater than \(t \in \mathbb {R}_{+}\) and the next state belongs to D.

  • α(x,a,b) is a measurable function from K to \((0,\infty )\), which denotes the state-action-dependent discount factor.

  • r(x,a,b) is a real-valued function on K, which represents the reward/payoff rate function for P1/P2.

Remark 1

In our SMG model, the semi-Markov kernel Q(⋅,⋅|x,a,b) is a joint probability distribution with respect to sojourn time and state for given (x,a,b) ∈ K. Thus, our model is more general than the counterpart in the literature (Lal and Sinha 1992; Luque-Vásquez 2002a, 2002b; González-Sánchez et al. 2019).

The evolution of SMGs with the expected discounted criterion carries on as follows.

Assume that the game starts at the initial state x0X at the initial decision epoch t0 := 0. The two players choose simultaneously actions a0A(x0),b0B(x0) according to the variables t0 and x0, then P1 and P2 receive immediate reward r(x0,a0,b0) and immediate payoff r(x0,a0,b0), respectively. Consequently, after staying at state x0 up to time t1 > t0, the system moves to a new state x1D according to the transition law Q(t1t0,D|x0,a0,b0). Once the state transition to x1 occurs at the 1st decision epoch t1, the entire process repeats again and the game evolves in this way.

Thus, we obtain an admissible history at the n-th decision epoch

$$ h_{n}:=(t_{0},x_{0},a_{0},b_{0},t_{1},x_{1},a_{1},b_{1},\dots, t_{n},x_{n}). $$

When the game goes to infinity, we obtain the history

$$ h:=(t_{0},x_{0},a_{0},b_{0},t_{1},x_{1},a_{1},b_{1},\dots), $$

where tntn+ 1, (xn,an,bn) ∈ K for all n ≥ 0. Moreover, let \({\mathscr{H}}_{n}\) be the class of all admissible histories hn of the system up to the n-th decision epoch, endowed with a Borel σ-algebra.

Let \(({\Omega }, \mathcal {F})\) be the canonical measurable space consisting of the sample space \({\Omega }=(\mathbb {R}_{+}\times X\times A\times B)^{\infty }\) and the corresponding product σ-algebra \(\mathcal {F}\). For each \(\omega =(t_{0},x_{0},a_{0},b_{0},t_{1},x_{1},\dots )\in {\Omega }\), we define a stochastic process \(\left \{T_{n}, X_{n}, A_{n}, B_{n}, t\ge 0\right \}\) on \(({\Omega }, \mathcal {F})\) by

$$ T_{n}(\omega)=t_{n}, \ X_{n}(\omega)=x_{n}, \ A_{n}(\omega)=a_{n}, \ B_{n}(\omega)=b_{n}. $$

Moreover, we denote Hn the processes of decision time, state and action until the n-th decision epoch by

$$ H_{n}(\omega)=(T_{0},X_{0},A_{0},B_{0},\ldots,T_{n},X_{n})(\omega)=h_{n}. $$

To introduce our expected discounted criterion discussed in this paper, we give the definitions of strategies as follows.

Definition 1

A randomized history-dependent strategy for P1 is a sequence of stochastic kernels \(\pi ^{1}:=({\pi _{n}^{1}},n\geq 0)\) that satisfies the following conditions:

  1. (i)

    for each \(D\in {\mathscr{B}}(X)\), \({\pi _{n}^{1}}(D|\cdot )\) is a Borel function on \({\mathscr{H}}_{n}\), and for each \(h_{n}\in {\mathscr{H}}_{n}\), \({\pi _{n}^{1}}(\cdot |h_{n})\) is a probability measure on A;

  2. (ii)

    \({\pi _{n}^{1}}(\cdot |h_{n})\) is concentrated on A(xn), that is

    $$ {\pi_{n}^{1}}(A(x_{n})|h_{n})=1, ~~~ \forall h_{n}\in \mathcal{H}_{n} ~ \text{and}~ n\geq0. $$

We denote by π1 the set of all the randomized history-dependent strategies for P1 for simplicity.

Definition 2

  1. (1)

    A strategy \(\pi ^{1}=({\pi _{n}^{1}},n\geq 0) \in {\Pi }_{1}\) is called a randomized Markov strategy if there exists a sequence of stochastic kernels ϕ1 = (φn,n ≥ 0) such that

    $$ \pi_{n}^{1}(\cdot|h_{n})=\varphi_{n}(\cdot|x_{n}),~~~ \forall h_{n}\in \mathcal{H}_{n} ~ \text{and}~ n\geq0. $$
  2. (2)

    A randomized Markov strategy ϕ1 = (φn,n ≥ 0) is called stationary if φn is independent of n; that is, if there exists a stochastic kernel φ on A given x such that

    $$ \varphi_{n}(\cdot|x)\equiv \varphi (\cdot|x),~~~ \forall x\in X~ \text{and}~ n\geq0. $$

    That is, \(\phi _{1}=(\varphi ,\varphi ,\ldots )=:\varphi ^{\infty }\). For convenience, we still use φ to denote the randomized stationary strategy without special declaration.

  3. (3)

    Moreover, if φ(⋅|x) is a Dirac measure for all xX, then the stationary strategy φ is called a pure strategy.

We denote by \({{\Pi }_{1}^{M}}\), Φ1 and \({\Pi }_{1}^{MD}\) the sets of all the randomized Markov strategies, randomized stationary strategies and pure strategies for P1, respectively.

The sets of all randomized history-dependent strategies π2, randomized Markov strategies \({{\Pi }_{2}^{M}}\), randomized stationary strategies Φ2, pure strategies \({\Pi }_{2}^{MD}\) for P2 are defined similarly, with B(x) in lieu of A(x). Clearly, \({\Pi }_{1}^{MD}\subset {\Phi }_{1} \subset {{\Pi }_{1}^{M}} \subset {\Pi }_{1} \) and \({\Pi }_{2}^{MD}\subset {\Phi }_{2} \subset {{\Pi }_{2}^{M}} \subset {\Pi }_{2} \).

For each \(x\in X,\pi ^{1}\in {\Pi }_{1},\pi ^{2}\in {\Pi }_{2}\), by Theorem of C. Ionescu-Tulcea (Hernández-Lerma and Lasserre 2012a, P.178), there exist a unique probability space \(({\Omega }, \mathcal {F}, \mathbb {P}_{x}^{\pi ^{1}, \pi ^{2}})\) and a stochastic process {Tn,Xn,An,Bn, n ≥ 0} such that for each \(D\in {\mathscr{B}}(X),D_{1}\in {\mathscr{B}}(A),D_{2}\in {\mathscr{B}}(B)\) and n ≥ 0, we have

$$ \mathbb{P}_{x}^{\pi^{1}, \pi^{2}}(X_{0}=x)=1, $$
$$ \mathbb{P}_{x}^{\pi^{1}, \pi^{2}}(A_{n}\in D_{1},B_{n}\in D_{2}|h_{n})={\pi_{n}^{1}}(D_{1}|h_{n}){\pi_{n}^{2}}(D_{2}|h_{n}), $$
$$ \mathbb{P}_{x}^{\pi^{1}, \pi^{2}}(T_{n+1}-T_{n}\leq t ,X_{n+1}\in {D} | h_{n}, a_{n}, b_{n})=Q(t,D|x_{n},a_{n},b_{n}). $$

Here and in what follows, we denote by \(\mathbb {E}_{x}^{\pi ^{1}, \pi ^{2}}\) the expectation operator with respect to \(\mathbb {P}_{x}^{\pi ^{1}, \pi ^{2}}\).

To avoid the possibility of infinitely numerous decision epochs during the finite time interval, we take an assumption on the semi-Markov kernel, which is also used in Lal and Sinha (1992), Luque-Vásquez (2002a, 2002b), and the references therein.

Assumption 1

There exist constants 𝜃 > 0 and δ > 0 such that

$$H(\theta| x, a, b) \leq 1-\delta, \quad \forall(x, a, b) \in K.$$

Proposition 1

If Assumption 1 holds, then for each fixed xX and \(\pi ^{1}\in {\Pi }_{1},\pi ^{2}\in {\Pi }_{2}\), we have

$$ \mathbb{P}_{x}^{\pi^{1}, \pi^{2}}(\lim\limits_{n\to\infty}T_{n}=\infty )=1. $$

The proof of Proposition 1 is provided in Appendix B.

Since \(T_{n}\overset {P}{\longrightarrow }\infty \), it is not required to consider the processes for \(t {>} T_{\infty }=\lim \limits _{n\rightarrow \infty } T_{n}\). Now, we establish an underlying continuous-time state-action process {X(t),A(t),B(t),t ≥ 0}, which corresponds to the stochastic process {Tn,Xn,An,Bn,n ≥ 0} with probability space \(({\Omega }, \mathcal {F}, \mathbb {P}_{x}^{\pi ^{1}, \pi ^{2}})\),

$$ X(t)=\sum\limits_{n=0}^{\infty}\mathbbm{1}_{\{T_{n}\leq t<T_{n+1}\}}X_{n}, $$
$$ A(t)=\sum\limits_{n=0}^{\infty}\mathbbm{1}_{\{T_{n}\leq t<T_{n+1}\}}A_{n}, $$
$$ B(t)=\sum\limits_{n=0}^{\infty}\mathbbm{1}_{\{T_{n}\leq t<T_{n+1}\}}B_{n}, $$

where \(\mathbbm {1}_{E}\) is an indicator function on any set E.

Definition 3

The stochastic process {X(t),A(t),B(t),t ≥ 0} is called a semi-Markov game.

Next, we will show the definition of the expected discounted criterion in this paper.

Definition 4

For each \((\pi ^{1}, \pi ^{2})\in {\Pi }_{1} \times {\Pi }_{2}\), the initial state xX and discount factor α(⋅) > 0, the expected discounted reward/payoff for P1/P2 is defined as follows:

$$ V(x, \pi^{1}, \pi^{2}):= \mathbb{E}_{x}^{\pi^{1}, \pi^{2}}\Big[{\int}_{0}^{\infty} e^{-{{\int}_{0}^{t}}\alpha(X(s),A(s),B(s))ds} r(X(t), A(t), B(t))dt\Big]. $$
(1)

P1 aims to maximize the reward while P2 aims to minimize the payoff. Both players aim to find an optimal strategy.

Definition 5

The upper value and lower value of the expected discounted SMG are defined as

$$ U(x):= \underset{\pi^{2}\in {\Pi}_{2}}{\inf}\underset{\pi^{1}\in {\Pi}_{1}}{\sup}V(x, \pi^{1}, \pi^{2})~\text{and}~L(x):= \underset{\pi^{1}\in {\Pi}_{1}}{\sup} \underset{\pi^{2}\in {\Pi}_{2}}{\inf}V(x, \pi^{1}, \pi^{2}), $$

respectively. Obviously, U(x) ≥ L(x) for all xX. Moreover, if it holds that L(x) = U(x) for all xX, then the common function is called the value function of the game and denoted by V.

Definition 6

Assume that the game has a value function V(⋅). Then a strategy \(\pi ^{1}_{*}\in {\Pi }_{1}\) is said to be optimal for P1 if

$$ \underset{\pi^{2}\in {\Pi}_{2}}{\inf} V(x, \pi^{1}_{*}, \pi^{2})=V^{*}(x),~~ \forall x\in X. $$

Similarly, \(\pi ^{2}_{*}\in {\Pi }_{2}\) is said to be optimal for P2 if

$$ \underset{\pi^{1}\in {\Pi}_{1}}{\sup} V(x, \pi^{1}, \pi^{2}_{*})=V^{*}(x),~~ \forall x\in X. $$

If \(\pi ^{i}_{*}\) is optimal for player i (i = 1,2), then we call \((\pi ^{1}_{*}, \pi ^{2}_{*})\) an optimal pair of strategies.

Remark 2

\((\pi ^{1}_{*}, \pi ^{2}_{*})\) is an optimal pair of strategies if and only if

$$ V(x, \pi^{1}, \pi^{2}_{*})\leq V(x, \pi^{1}_{*}, \pi^{2}_{*})\leq V(x, \pi^{1}_{*}, \pi^{2}),~~~ \forall x\in X, \pi^{1}\in {\Pi}_{1}, \pi^{2}\in {\Pi}_{2}. $$

Remark 2 is an effective method to verify whether a pair of strategies (π1,π2) is an optimal pair of strategies, which is widely used in the literature of two-person zero-sum stochastic games; see, for instance, González-Sánchez et al. (2019), Luque-Vásquez (2002a, 2002b), and the references therein.

3 Main results

This section focuses on the existence of the value function and an optimal pair of stationary strategies, which requires imposing suitable assumptions on the model parameters. We first give some notations for convenience.

Given a measurable function \(\omega : X\rightarrow [1, \infty )\), a function u on X is said to be ω-bounded if it has finite ω-norm which is defined as

$$ \|u\|_{\omega}:=\underset{x\in X}{\sup}\frac{|u(x)|}{\omega(x)}, $$

such a function ω can be referred to as a weight function. For convenience, we denote by Bω(X) the Banach space of all ω-bounded measurable functions on X.

For each given function uBω(X) and (x,a,b) ∈ K, we write

$$ G(u, x, a, b):=r(x, a, b){\int}_{0}^{\infty} e^{-\alpha(x,a,b) t}(1-H(t| x, a, b))dt+{\int}_{0}^{\infty} e^{-\alpha(x,a,b) t}{\int}_{X} u(y) Q(d t, d y | x, a, b). $$
(2)

For each fixed xX and probability measures \(\mu \in \mathbb {A}(x) := \mathbb {P}(A(x)) \text { and } \lambda \in \mathbb {B}(x) := \mathbb {P}(B(x))\), we denote

$$ G(u, x, \mu, \lambda):={\int}_{A(x)} {\int}_{B(x)} G(u, x, a, b) \mu(d a) \lambda(d b), $$

whenever the integral is well defined.

We define an operator T on Bω(X) by

$$ Tu(x):=\underset{\mu \in \mathbb{A}(x)}{\sup} \underset{\lambda \in \mathbb{B}(x)}{\inf} G(u, x, \mu, \lambda),\quad \forall x \in X, $$
(3)

which is called the Shapley operator. A function vBω(X) is said to be a solution of the Shapley equation if

$$ Tv(x)=v(x), \quad \forall x \in X. $$

In order to explore the existence of an optimal pair of stationary strategies, we also need to define a stationary-strategy-dependent operator T(φ1,φ2) on Bω(X) by

$$ T(\varphi_{1},\varphi_{2})u(x):=G(u, x, \varphi_{1}(\cdot|x), \varphi_{2}(\cdot|x)),\quad \forall x \in X, $$

where (φ1,φ2) ∈Φ1 ×Φ2 is a pair of stationary strategies.

Next, we give some hypotheses to guarantee the existence of an optimal pair of stationary strategies. The framework settled by these hypotheses has became quite standard for the study of semi-Markov models (Chen et al. 2021; Huang and Guo 2011, 2010; Lal and Sinha, 1992; Luque-Vásquez, 2002a, 2002b; Minjárez-Sosa and Luque-Vásquez 2008) and varying discount factors (Feinberg and Shwartz 1994; González-Hernández et al. 2009; González-Sánchez et al. 2019; Minjárez-Sosa 2015; Wei and Guo 2011; Ye and Guo 2012; Wu and Zhang 2016).

Assumption 2

(a) There exists a constant α0 > 0 such that α(x,a,b) ≥ α0 for all (x,a,b) ∈ K. (b) There exists a measurable function \(\omega : X\rightarrow [1, \infty )\) and a non-negative constant M such that for all (x,a,b) ∈ K,

$$ |r(x,a,b)|\leq M\omega(x). $$

The key point is that Assumption 2 entails a finiteness property of expected discounted reward. Below we give an important consequence of Assumption 1 and Assumption 2(a).

Lemma 1

If Assumptions 1&2(a) hold, then there exists a constant 0 < γ < 1 such that for each (x,a,b) ∈ K,

$$ {\int}_{0}^{\infty} e^{-\alpha(x,a,b) t} H(d t|x, a, b) \leqslant \gamma $$
(4)

The proof of Lemma 1 is provided in Appendix A.

Assumption 3

There exists a constant η with 0 < ηγ < 1 such that for each fixed t ≥ 0 and (x,a,b) ∈ K,

$$ {\int}_{X} \omega(y) Q(t,dy|x,a,b) \leq \eta\omega(x)H(t|x,a,b), $$
(5)

where ω(⋅) is the function mentioned in Assumption 2.

Remark 3

  1. (1)

    We call Assumption 3 the “drift condition”, which is needed to ensure that the Shapley operator (defined in (3)) is a contraction operator with respect to a weighted norm as well as our main results. Obviously, Assumption 3 naturally holds when r is bounded by taking ω(⋅) ≡ 1 and η = 1.

  2. (2)

    Particularly, if we set Q(t,y|x,a,b) = H(t|x,a,b)P(y|x,a,b), where P(y|x,a,b) denotes the state transition probability, then (5) degenerates into \({\int \limits }_{X} \omega (y) P(dy|x,a,b) \leq \eta \omega (x)\), which is the same as the Assumption 3(b) of Luque-Vásquez (2002a) and Assumption 1(e) of González-Sánchez et al. (2019). Thus, our Assumption 3 is more general than the counterpart in the literature Luque-Vásquez (2002a) and the Example 3 of González-Sánchez et al. (2019) about SMG is a special case of ours.

  3. (3)

    Combining Lemma 1 with Assumption 3, it is easy to derive

    $$ {\int}_{0}^{\infty} e^{-\alpha(x,a,b) t}{\int}_{X} u(y) Q(d t, d y | x, a, b)\leq \eta\gamma\|u\|_{\omega}\omega(x),\quad\forall u\in B_{\omega}(X),(x,a,b)\in K. $$
    (6)

Moreover, we impose the following continuity-compactness conditions to ensure the existence of an optimal pair of stationary strategies of our SMG model.

Assumption 4

  1. (a)

    For each fixed xX, both A(x) and B(x) are compact sets.

  2. (b)

    For each fixed (x,a,b) ∈ K, r(x,⋅,b) is upper semi-continuous on A(x) and r(x,a,⋅) is lower semi-continuous on B(x).

  3. (c)

    For each fixed (x,a,b) ∈ K, t ≥ 0 and vBω(X), the functions

    $$ a \longmapsto \int v(y) Q(t,d y | x, a, b) \quad \text { and } \quad b \longmapsto \int v(y) Q(t,d y | x, a, b) $$

    are continuous on A(x) and B(x), respectively.

  4. (d)

    For each fixed t ≥ 0, H(t|⋅,⋅,⋅) is continuous on K.

  5. (e)

    The function α(x,a,b) is continuous on K.

Remark 4

  1. (1)

    Assumption 4 is similar to the standard continuity-compactness hypotheses for Markov control processes; see, for instance, Hernández-Lerma and Lasserre (2012b), and the references therein. It is commonly used for the existence of minimax points of games.

  2. (2)

    By Lemma 1.11 in Nowak (1984), if Assumption 4(a) holds, then the probability spaces \(\mathbb {A}(x) \text { and } \mathbb {B}(x)\) are also compact for each xX.

  3. (3)

    These continuity-compactness conditions are specifically applied to infinite state and action spaces, which obviously hold when S and A,B are finite.

Now, we present our main results, Theorem 1 below, which extends to SMGs with common state-time distribution and state-action dependent discount factors the analysis given in Luque-Vásquez (2002a) for SMGs with constant discount factor.

Theorem 1

Suppose that Assumptions 1-4 hold, then

  1. (a)

    The SMG has a value function V(⋅), which is the unique function in Bω(X) that satisfies the Shapley equation, i.e.,

    $$ V^{*}(x)=TV^{*}(x),\quad\forall x\in X, $$

    and furthermore, there exists an optimal pair of stationary strategies.

  2. (b)

    A pair of stationary strategies \((\varphi _{1}^{*}, \varphi _{2}^{*}) \in {\Phi }_{1} \times {\Phi }_{2}\) is optimal if and only if its expected discounted reward satisfies the Shapley equation, i.e., \(TV(x,\varphi _{1}^{*},\varphi _{2}^{*})=V(x,\varphi _{1}^{*},\varphi _{2}^{*})\) for all xX.

4 Preliminaries and proofs

In this section, we present some results needed to prove Theorem 1. Some of these results are already known in the literature (Luque-Vásquez 2002a, 2002b), but we state them here for completeness and ease of reference.

Lemma 2

Suppose that Assumptions 1-4 hold, then for each given function uBω(X), the function Tu is in Bω(X) and

$$ Tu(x):=\underset{\lambda \in \mathbb{B}(x)}{\min} \underset{\mu \in \mathbb{A}(x)}{\max} G(u,x, \mu, \lambda). $$
(7)

Moreover, there exists a pair of stationary strategies (φ1,φ2) ∈Φ1 ×Φ2 such that

$$ \begin{array}{@{}rcl@{}} T u(x) &=&G(u, x, \varphi_{1}(\cdot|x), \varphi_{2}(\cdot|x)) \\ &=&\underset{\mu \in \mathbb{A}(x)}{\max} G(u, x, \mu, \varphi_{2}(\cdot|x)) \\ &=&\underset{\lambda \in \mathbb{B}(x)}{\min} G(u, x, \varphi_{1}(\cdot|x), \lambda). \end{array} $$
(8)

Lemma 3

Both T and T(φ1,φ2) are contraction operators with modulus less than 1.

Since T and T(φ1,φ2) are both contraction operators, there exist unique functions uBω(X) and \(u_{\varphi _{1},\varphi _{2}}^{*}\in B_{\omega }(X)\) such that Tu(⋅) = u(⋅) and \(T(\varphi _{1},\varphi _{2})u_{\varphi _{1},\varphi _{2}}^{*}(\cdot )=u_{\varphi _{1},\varphi _{2}}^{*}(\cdot )\) by the Banach’s fixed point theorem.

Before stating the next important result, we give the definition of the m-shift strategy (Hernández-Lerma and Lasserre 2012b, P.96).

Definition 7

Given a strategy \(\pi ^{i}=\left \{{\pi _{n}^{i}},n=0,1,\ldots \right \}\in {\Pi }_{i}, i=1,2,\) and an integer m ≥ 0, the corresponding m-shift strategy \(^{(m)}\!\pi ^{i}=\left \{{~}^{(m)}\!{\pi ^{i}_{n}}, n=0,1,\ldots \right \}\) is given by

$$ {~}^{(m)}\!{\pi^{i}_{0}}(\cdot|x_{m}):={\pi_{m}^{i}}(\cdot|h_{m}), $$

and for n = 1,2,…,

$$ {~}^{(m)}\!{\pi^{i}_{n}}(\cdot|x_{m},a_{m},b_{m},\ldots,x_{m+n}):=\pi_{m+n}^{i}(\cdot|h_{m},a_{m},b_{m},\ldots,x_{m+n}), $$

where hm := (t0,x0,a0,b0,…,tm− 1,xm− 1,am− 1,bm− 1,tm,xm) denotes the admissible history at the m-th decision epoch.

Lemma 4

For each \((\pi ^{1}, \pi ^{2})\in {\Pi }_{1} \times {\Pi }_{2}\) and xX,

$$ V(x,\pi^{1}, \pi^{2})=T(\pi_{0}^{1\infty}, \pi_{0}^{2\infty})V(x,^{(1)}\!\pi^{1},^{(1)}\!\pi^{2}) $$

where \(^{(1)}\!\pi ^{i}:=({\pi _{n}^{i}},n\geq 1)\) is the 1-shift strategy of \(\pi ^{i}:=({\pi _{n}^{i}},n\geq 0)\), \(\pi _{0}^{i\infty }=({\pi _{0}^{i}},{\pi _{0}^{i}},\cdots )\), i = 1,2, and \((\pi _{0}^{1\infty }, \pi _{0}^{2\infty })\) is a pair of stationary strategies.

Now, if we set π1 = φ1 and π2 = φ2 specially, which are both stationary strategies, from Lemma 4, we have

$$ V(x,\varphi_{1},\varphi_{2})=T(\varphi_{1},\varphi_{2})V(x,\varphi_{1},\varphi_{2}),~~~~\forall x\in X, $$

which implies that the function V (x,φ1,φ2) is the unique fixed point of the contraction operator T(φ1,φ2).

Lemma 5

Suppose that Assumptions 1-3 hold, let \((\pi ^{1}, \pi ^{2})\in {\Pi }_{1} \times {\Pi }_{2}\), then for each xX, uBω(X), we have

$$ \underset{n\to\infty}{\lim}\mathbb{E}_{x}^{\pi^{1}, \pi^{2}}\Big[e^{-{\int}_{0}^{T_{n}}\alpha(X(s),A(s),B(s))ds}u(X_{n})\Big]=0. $$

Proofs of Lemmas 2-5 are provided in Appendix A. Next, a complete proof of Theorem 1 is given by applying Lemmas 2-5.

Proof of Theorem 1 (a) Let u be the unique fixed point of T in Bω(X), that is

$$ u^{*}(x)=Tu^{*}(x),~~~~\forall x\in X. $$

By Lemma 2, there exists a pair of stationary strategies (φ1,φ2) ∈Φ1 ×Φ2 such that for each xX,

$$ \begin{array}{@{}rcl@{}} Tu^{*}(x) &=&G(u^{*}, x, \varphi_{1}(\cdot|x), \varphi_{2}(\cdot|x))\\ &=&\underset{\mu \in \mathbb{A}(x)}{\max} G(u^{*}, x, \mu, \varphi_{2}(\cdot|x))\\ &=&\underset{\lambda \in \mathbb{B}(x)}{\min} G(u^{*}, x, \varphi_{1}(\cdot|x), \lambda) , \end{array} $$
(9)

which implies that

$$ u^{*}(x)=G(u^{*}, x, \varphi_{1}(\cdot|x), \varphi_{2}(\cdot|x))=T(\varphi_{1},\varphi_{2})u^{*}(x),\quad \forall x\in X. $$

Moreover, by Lemma 4,

$$ V(x,\varphi_{1},\varphi_{2})=T(\varphi_{1},\varphi_{2})V(x,\varphi_{1},\varphi_{2}),\quad \forall x\in X, $$

from which we can derive

$$ u^{*}(x)=V(x,\varphi_{1},\varphi_{2}),\quad \forall x\in X. $$

Next, we prove that u is the value function of the game and (φ1,φ2) is an optimal pair of stationary strategies, that is

$$ V(x,\varphi_{1},\pi^{2})\geq V(x,\varphi_{1},\varphi_{2})\geq V(x,\pi^{1},\varphi_{2})~,~~~\forall (\pi^{1},\pi^{2})\in {\Pi}_{1} \times {\Pi}_{2}. $$
(10)

We just prove the first inequality in (10). Then a similar proof can follow for the second inequality. By (9), we have

$$ u^{*}(x)\leq G(u^{*},x,\varphi_{1}(\cdot|x),\lambda)~,~~~\forall \lambda \in \mathbb{B}(x). $$

Particularly, let λ be an indicator function such that λ(db) = 1. Then for each bB(x), we have

$$ \begin{array}{@{}rcl@{}} u^{*}(x)&\leq& {\int}_{A(x)}\left\{r(x,a,b){\int}_{0}^{\infty}e^{-\alpha(x,a,b)t}\Big[1-H(t| x, a, b)\Big]dt\right.\\ && +\left.{\int}_{0}^{\infty}e^{-\alpha(x,a,b)t}\Big [{\int}_{X}u^{*}(y)Q(d t,dy| x, a, b)\Big] \right\}\varphi_{1}(da|x). \end{array} $$

For each given \(h_{n}=(t_{0},x_{0},a_{0},b_{0},\ldots ,t_{n},x_{n})\in {\mathscr{H}}_{n}\) and bnB(xn), we have

$$ \begin{array}{@{}rcl@{}} u^{*}(x_{n})&\leq& {\int}_{A(x_{n})}\left\{r(x_{n},a_{n},b_{n}){\int}_{0}^{\infty}e^{-\alpha (x_{n},a_{n},b_{n})t}\Big[1-H(t| x_{n},a_{n},b_{n})\Big]dt\right.\\ &&+\left.{\int}_{0}^{\infty}e^{-\alpha(x_{n},a_{n},b_{n})t}\Big [{\int}_{X}u^{*}(y)Q(d t,dy|x_{n},a_{n},b_{n})\Big] \right\}\varphi_{1}(da_{n}|X_{n}=x_{n}). \end{array} $$

For \(\forall \pi ^{2}\in {\Pi }_{2}\), integrating bn on both sides in the above inequality, we have

$$ \begin{array}{@{}rcl@{}} u^{*}(x_{n})&\leq& {\int}_{B(x_{n})}{\int}_{A(x_{n})}\left\{{\int}_{0}^{\infty}e^{-\alpha(x_{n},a_{n},b_{n})t}\Big [{\int}_{X}u^{*}(y)Q(dt,dy|x_{n},a_{n},b_{n})\Big]+\right.\\ &&\left.r(x_{n},a_{n},b_{n}){\int}_{0}^{\infty}e^{-\alpha(x_{n},a_{n},b_{n})t}\Big[1-H(t| x_{n},a_{n},b_{n})\Big]dt \right\}\varphi_{1}(da_{n}|X_{n}=x_{n}){\pi_{n}^{2}}(db_{n}|H_{n}=h_{n})\\ &=&\mathbb{E}_{x}^{\varphi_{1}, \pi^{2}}\Big[e^{-{\int}_{T_{n}}^{T_{n+1}}\alpha(X(s),A(s),B(s))ds}u^{*}(X_{n+1})|H_{n}=h_{n}\Big]\\ &&+ \mathbb{E}_{x}^{\varphi_{1}, \pi^{2}}\Big[{\int}_{T_{n}}^{T_{n+1}} e^{-{\int}_{T_{n}}^{t}\alpha(X(s),A(s),B(s))ds} r(X(t), A(t), B(t))dt|H_{n}=h_{n}\Big]. \end{array} $$

Notice that the above inequality holds for all \(h_{n}\in {\mathscr{H}}_{n}\), taking xn as a random variable Xn and we have

$$ \begin{array}{@{}rcl@{}} u^{*}(X_{n}) &\leq&\mathbb{E}_{x}^{\varphi_{1},\pi^{2}}\Big[e^{-{\int}_{T_{n}}^{T_{n+1}}\alpha(X(s),A(s),B(s))ds}u^{*}(X_{n+1})|H_{n}\Big]\\ &&+\mathbb{E}_{x}^{\varphi_{1}, \pi^{2}}\Big[{\int}_{T_{n}}^{T_{n+1}} e^{-{\int}_{T_{n}}^{t}\alpha(X(s),A(s),B(s))ds} r(X(t), A(t), B(t))dt|H_{n}\Big]. \end{array} $$

Multiplying \(e^{-{\int \limits }_{0}^{T_{n}}\alpha (X(s), A(s), B(s))ds}\) on both sides in the above inequality and using the properties of the conditional expectation, we have

$$ \begin{array}{@{}rcl@{}} e^{-{\int}_{0}^{T_{n}}\alpha(X(s),A(s),B(s))ds}u^{*}(X_{n})&\leq& \mathbb{E}_{x}^{\varphi_{1}, \pi^{2}}\Big[e^{-{\int}_{0}^{T_{n+1}}\alpha(X(s),A(s),B(s))ds}u^{*}(X_{n+1})|H_{n}\Big]\\ &&+\mathbb{E}_{x}^{\varphi_{1}, \pi^{2}}\Big[{\int}_{T_{n}}^{T_{n+1}} e^{-{{\int}_{0}^{t}}\alpha(X(s),A(s),B(s))ds} r(X(t), A(t), B(t))dt|H_{n}\Big]. \end{array} $$

Then, taking the expectation \(\mathbb {E}_{x}^{\varphi _{1}, \pi ^{2}}\), we have

$$ \begin{array}{@{}rcl@{}} \mathbb{E}_{x}^{\varphi_{1}, \pi^{2}}\Big[e^{-{\int}_{0}^{T_{n}}\alpha(X(s),A(s),B(s))ds}u^{*}(X_{n})\Big]&\leq& \mathbb{E}_{x}^{\varphi_{1}, \pi^{2}}\Big[e^{-{\int}_{0}^{T_{n+1}}\alpha(X(s),A(s),B(s))ds}u^{*}(X_{n+1})\Big] \\ &&+\mathbb{E}_{x}^{\varphi_{1}, \pi^{2}}\Big[{\int}_{T_{n}}^{T_{n+1}} e^{-{{\int}_{0}^{t}}\alpha(X(s),A(s),B(s))ds} r(X(t), A(t), B(t))dt\Big]. \end{array} $$

Now, summing over \(n=0,1,2,\dots ,N\), we obtain

$$ \begin{array}{@{}rcl@{}} u^{*}(x)&\leq& \mathbb{E}_{x}^{\varphi_{1}, \pi^{2}}\Big[{\int}_{0}^{T_{N+1}} e^{-{{\int}_{0}^{t}}\alpha(X(s),A(s),B(s))ds}r(X(t), A(t), B(t))dt\Big]\\ &&+\mathbb{E}_{x}^{\varphi_{1}, \pi^{2}}\Big[e^{-{\int}_{0}^{T_{N+1}}\alpha(X(s),A(s),B(s))ds}u^{*}(X_{N+1})\Big]. \end{array} $$

Letting \(N\rightarrow \infty \), according to Lemma 5, we derive

$$ u^{*}(x)\leq \mathbb{E}_{x}^{\varphi_{1}, \pi^{2}}\Big[{\int}_{0}^{\infty} e^{-{{\int}_{0}^{t}}\alpha(X(s),A(s),B(s))ds}r(X(t), A(t), B(t))dt\Big], $$

which means that the first inequality in (10) holds.

(b) (⇒)

Suppose that \((\varphi _{1}^{*}, \varphi _{2}^{*}) \in {\Phi }_{1} \times {\Phi }_{2}\) is an optimal pair of stationary strategies, then for each \(x\in X,(\pi ^{1},\pi ^{2}) \in {\Pi }_{1}\times {\Pi }_{2}\), we have

$$ V(x,\varphi_{1}^{*},\pi^{2})\geq V(x,\varphi_{1}^{*},\varphi_{2}^{*})\geq V(x,\pi^{1},\varphi_{2}^{*}). $$

For each fixed \(\lambda \in \mathbb {B}(x)\), let \(\pi ^{2}=\{{\pi _{n}^{2}},n\geq 0\}\) with \({\pi _{0}^{2}}=\lambda \) and \({\pi _{n}^{2}}=\varphi _{2}^{*}, n\geq 1\), then by Lemma 4, for each xX, we have

$$ V(x,\varphi_{1}^{*},\varphi_{2}^{*}) \leq V(x,\varphi_{1}^{*},\pi^{2})=T(\varphi_{1}^{*},\lambda)V(x,\varphi_{1}^{*},\varphi_{2}^{*}), $$

which yields

$$ V(x,\varphi_{1}^{*},\varphi_{2}^{*})\leq \underset{\lambda \in \mathbb{B}(x)}{\min} T(\varphi_{1}^{*},\lambda)V(x,\varphi_{1}^{*},\varphi_{2}^{*})\leq TV(x,\varphi_{1}^{*},\varphi_{2}^{*}). $$

Similarly, we can prove

$$V(x,\varphi_{1}^{*},\varphi_{2}^{*})\geq TV(x,\varphi_{1}^{*},\varphi_{2}^{*}). $$

Combining the last two inequalities, we obtain the desired result.

(⇐)

This part holds, which has been proved in part (a).

5 Algorithm

In this section, for the feasible implementation of algorithms, the state and action spaces are supposed to be finite. To avoid excessive symbols, we still use ω-norm by taking ω(⋅) ≡ 1, which represents the infinity norm, i.e., \(\|u\|_{\omega }=\sup \limits _{x\in X}|u(x)|=\|u\|_{\infty },~\forall u\in B_{\omega }(X)\). We develop an iterative algorithm to approach to the value function and an optimal pair of stationary strategies of our two-person zero-sum SMG, where numerically solving matrix games is iteratively utilized at every state in a form of value iteration.

Without loss of generality, we assume that \(A(x):=\{a_{1},a_{2},\dots ,a_{m_{1}}\}\) and \(B(x):=\{b_{1},b_{2},\dots ,b_{m_{2}}\}\), for any \(x\in X :=\{x_{0},x_{1},\dots ,x_{n-1}\}\). Under Assumptions 1-4 presented in Section 3, we rewrite the Shapley equation in a matrix form as follows

$$ \begin{array}{@{}rcl@{}} V^{*}(x) &=&\underset{\varphi_{2}(\cdot|x) \in \mathbb{B}(x)}{\min} \underset{\varphi_{1}(\cdot|x) \in \mathbb{A}(x)}{\max}G(V^{*},x, \varphi_{1}(\cdot|x), \varphi_{2}(\cdot|x))\\ &=& \underset{\boldsymbol\varphi_{2}(x) \in \mathbb{B}(x)}{\min} \underset{\boldsymbol\varphi_{1}(x) \in \mathbb{A}(x)}{\max}{\boldsymbol\varphi_{1}^{T}}(x)\boldsymbol{\mathcal G}(V^{*},x)\boldsymbol\varphi_{2}(x)\\ &=& \underset{\boldsymbol\varphi_{1}(x) \in \mathbb{A}(x)}{\max} \underset{\boldsymbol\varphi_{2}(x) \in \mathbb{B}(x)}{\min}{\boldsymbol\varphi_{1}^{T}}(x)\boldsymbol{\mathcal G}(V^{*},x)\boldsymbol\varphi_{2}(x),\quad\forall x\in X, \end{array} $$
(11)

where \(\boldsymbol {\mathcal G}(u,x)\) denotes an m1 × m2-dimensional matrix for each fixed xX and given function uBω(X), with elements defined as

$$ \boldsymbol{\mathcal G}(u,x)_{ij}:=G(u,x, a_{i}, b_{j}), \quad i=1,2,\dots,m_{1}; \ j=1,2,\dots,m_{2}, $$

and \(\boldsymbol \varphi _{k}(x):=(\varphi _{k}(1|x),\dots ,\varphi _{k}(m_{k}|x))^{T}\) is an mk-dimensional column vector for k = 1,2. It is obviously that the last equation of (11) can be viewed as a matrix game with reward matrix \(\boldsymbol {\mathcal G}(V^{*},x)\) at each state xX.

However, we cannot directly solve (11) since the value function V is unknown. Below, we develop Algorithm 1 to iteratively compute a series of matrix games whose values can asymptotically approach to V(x) at each state x. From the lines 11-12 of Algorithm 1, we can see that at the n-th iteration, Vn(x) and \(({\boldsymbol \varphi _{1}^{n}}(x),{\boldsymbol \varphi _{2}^{n}}(x))\) are obtained by using LP to solve the game with reward matrix \(\boldsymbol {\mathcal G}(V_{n-1},x)\). This iterative procedure of computing a series of Vn is similar to the classic value iteration algorithm in the MDP theory. Considering that iterative algorithms usually converge to approximate solutions in finite steps, we give the definition of ε-optimal.

Definition 8

Assume that the game has a value function V. Then a pair of strategies \((\pi ^{1}_{\varepsilon },\pi ^{2}_{\varepsilon })\in {\Pi }_{1}\times {\Pi }_{2}\) is said to be ε-optimal of the game if

$$ \| V(\cdot, \pi^{1}_{\varepsilon}, \pi^{2}_{\varepsilon})-V^{*}(\cdot)\|_{\omega}<\varepsilon. $$

Moreover, \(V_{\varepsilon }(\cdot ):=V(\cdot , \pi ^{1}_{\varepsilon }, \pi ^{2}_{\varepsilon })\) is called the ε-value function of the game.

Algorithm 1
figure d

Value iteration-type algorithm to solve the two-person zero-sum SMG.

Furthermore, we derive Theorem 2 to guarantee the convergence of Algorithm 1.

Theorem 2

Under Algorithm 1, for any given 𝜖 > 0 and initial value \(V_{0}\in \mathbb {R}\), there exists a non-negative integer \(N_{\epsilon }=\big (1+\lfloor log_{\eta \gamma } (\frac {\epsilon }{\|TV_{0}-V_{0}\|_{\omega }}) \rfloor \big )\mathbbm {1}_{TV_{0}\neq V_{0}}\) such that \(\|V_{N_{\epsilon }+1}-V_{N_{\epsilon }}\|_{\omega }<\epsilon \), which implies that Algorithm 1 can converge within N𝜖 iterations. Moreover, the strategy pair \((\varphi _{1}^{\varepsilon },\varphi _{2}^{\varepsilon })\) output by Algorithm 1 is ε-optimal, where \(\varepsilon =\frac {\epsilon }{1-\eta \gamma }\).

Proof

According to the iterative formula of Algorithm 1, we have

$$ \|V_{n+1}-V_{n}\|_{\omega}=\|TV_{n}-TV_{n-1}\|_{\omega}\leq \eta\gamma \|V_{n}-V_{n-1}\|_{\omega},\quad\forall n\ge 1, $$

which by iteration yields

$$ \|V_{n+1}-V_{n}\|_{\omega} \leq (\eta\gamma)^{n}\|TV_{0}-V_{0}\|_{\omega},\quad\forall n\ge 0. $$

For each given 𝜖 > 0 and initial value \(V_{0}\in \mathbb {R}\), if TV0 = V0, choose N𝜖 = 0, and we have

$$ \|V_{N_{\epsilon}+1}-V_{N_{\epsilon}}\|_{\omega}=0<\epsilon, $$

otherwise, if TV0V0, choose \(N_{\epsilon }=1+\lfloor log_{\eta \gamma } (\frac {\epsilon }{\|TV_{0}-V_{0}\|_{\omega }}) \rfloor \), and we have

$$ \|V_{N_{\epsilon}+1}-V_{N_{\epsilon}}\|_{\omega}\leq (\eta\gamma)^{N_{\epsilon}}\|TV_{0}-V_{0}\|_{\omega}<\epsilon. $$

Combining the two cases above, choose \(N_{\epsilon }=\big (1+\lfloor log_{\eta \gamma } (\frac {\epsilon }{\|TV_{0}-V_{0}\|_{\omega }}) \rfloor \big )\mathbbm {1}_{TV_{0}\neq V_{0}}\) and we have \(\|V_{N_{\epsilon }+1}-V_{N_{\epsilon }}\|_{\omega }<\epsilon \), which implies that Algorithm 1 can converge within N𝜖 iterations.

Moreover, since V is the unique solution of the Shapley equation, we have

$$ \|V_{n}-V^{*}\|_{\omega} \leq \|V_{n+1}-V^{*}\|_{\omega}+\|V_{n}-V_{n+1}\|_{\omega}\leq \eta\gamma\|V_{n}-V^{*}\|_{\omega}+\|V_{n}-V_{n+1}\|_{\omega} , $$

thus,

$$ \|V_{n}-V^{*}\|_{\omega} \leq \frac{\|V_{n}-V_{n+1}\|_{\omega}}{1-\eta\gamma}, $$

taking n = N𝜖, and we have

$$ \|V_{N_{\epsilon}}-V^{*}\|_{\omega} < \frac{\epsilon}{1-\eta\gamma}=\varepsilon, $$

which implies that \((\varphi _{1}^{\varepsilon },\varphi _{2}^{\varepsilon })\) is ε-optimal by Definition 8. □

Therefore, with Algorithm 1, we can iteratively approach to the value function and an optimal pair of stationary strategies of our SMG problem through recursively solving a matrix game at each state x. Theorem 2 guarantees the convergence of Algorithm 1. We can implement Algorithm 1 to solve practical problems, as illustrated in the next section.

6 Numerical experiment

In this section, we conduct numerical examples to illustrate our main results derived in Sections 3-5. First, we give an example to demonstrate that Assumptions 1-4 ensuring the existence of the value function and an optimal pair of stationary strategies of SMGs are easy to verify in practice.

Example 1

Consider a system with a model of SMG which is defined as follows:

The state space \(X:=(-\infty ,\infty )\) and the action spaces \(A:=[\underline {a},\bar {a}],B:=[\underline {b},\bar {b}]\) with admissible action sets A(x) := A,B(x) := B for each xX. The semi-Markov kernel is given by:

$$ Q(t,y|x, a, b)={\Phi}\bigg(\frac{1+t}{2+t}y\bigg)F(t),\quad\forall (t,y,x,a,b)\in [0,+\infty)\times X\times K, $$

where Φ(⋅) and F(⋅) denote the cumulative distribution functions of a normal distribution with mean μ(x,a,b) and variance σ2(x,a,b) and an exponential distribution with parameter β(x,a,b), respectively. The reward rate function is denoted by r(x,a,b) := x2 + a + b. Moreover, the discount factor is defined as α(x,a,b) := e|x|+a+b.

Now, we verify that the conditions on the existence of an optimal pair of stationary strategies described in Assumptions 1-4 are satisfied in this example. To this end, we need the following hypothesis:

Assumption 5

(a) The function β(x,a,b) is continuous on K and has both a positive lower bound \(\underline {\beta }\) and a positive upper bound \(\bar {\beta }\); (b) \(\mu ^{2}(x,a,b)+\sigma ^{2}(x,a,b)\leq \frac {1}{4}x^{2}+\frac {e^{\underline {a}+\underline {b}-1}}{8(\bar {\beta }+e^{\underline {a}+\underline {b}})},~\forall (x,a,b)\in K\).

With this hypothesis, we directly have the following result.

Proposition 2

Suppose that Assumption 5 holds, then Example 1 satisfies Assumptions 1-4, which means that the SMG has an optimal pair of stationary strategies.

The proof of Proposition 2 is provided in Appendix B.

Next, we give another example about investment problem to demonstrate the numerical computation of Algorithm 1 to solve the value function and an optimal pair of stationary strategies of the game.

Example 2

Consider an investment problem with three states {1,2,3}, which denotes the benefit, medium and loss economy environments, respectively. At each state, the investor will buy some assets while the market-maker will sell. The interest rate depends on the economy environments as well as the number of assets that investor buys and market-maker sells. In state i ∈{1,2}, the investor buys a certain amount of assets from {ai1,ai2} and the market-maker sells from {bi1,bi2}, which leads to a reward rate r(i,a,b) to the investor and − r(i,a,b) to the market-maker, where a ∈{ai1,ai2},b ∈{bi1,bi2}. Then the system moves to a new state j with probability p(j|i,a,b) after staying at state i for a random time which follows exponential-distribution with parameter β(i,a,b). In state 3, the investor buys a certain amount of assets from {a31,a32} and the market-maker sells from {b31,b32}, which leads to a reward rate r(3,a,b) to the investor and − r(3,a,b) to the market-maker, where a ∈{a31,a32},b ∈{b31,b32}. Then the system moves to a new state j with probability p(j|3,a,b) after staying at state 3 for a random time uniformly distributed in [0,β(3,a,b)] with parameter β(3,a,b) > 0. For this system, the decision makers aim to find an optimal pair of stationary strategies.

We establish an SMG model to solve this investment problem. The state space is X = {1,2,3}, action spaces are A(i) = {ai1,ai2}, B(i) = {bi1,bi2} for each iX and the semi-Markov kernel Q is given by:

$$Q(t,j|i, a, b)=\left\{ \begin{array}{ll} (1-e^{-\beta (i,a,b)t})p(j|i,a,b) & {\text{if}\quad i\in \{1,2\}},\\ {\frac{t}{\beta(i, a, b)} p(j | i, a, b)} & {\text{if}\quad i=3, \ 0 \leq t \leq \beta(i, a, b)}, \\ {p(j | i, a, b)} & {\text {otherwise}}, \end{array} \right. $$

from which we can obtain

$$ \begin{array}{@{}rcl@{}} Q(dt,j|i,a,b)=\left\{ \begin{array}{ll} p(j|i,a,b)\beta (i,a,b)e^{-\beta(i, a, b)t}dt & {\text{if}\quad i\in \{1,2\}},\\ {\frac{dt}{\beta(i, a, b)} p(j | i, a, b)} & {\text{if}\quad i=3, \ 0 \leq t \leq \beta(i, a, b)}, \\ 0 & {\text{otherwise}}, \end{array} \right. \end{array} $$

and

$$H(t|i,a,b)=\left\{ \begin{array}{ll} 1-e^{-\beta (i,a,b)t} & {\text{if}\quad i\in \{1,2\}},\\ {\frac{t}{\beta(i, a, b)} } & {\text{if}\quad i=3, \ 0 \leq t \leq \beta(i, a, b)}, \\ 1 & {\text {otherwise}}. \end{array} \right. $$

Then by (2), we have

$$ G(u,i, a, b)=\left\{ \begin{array}{ll} \frac {r(i,a,b)}{\alpha (i,a,b)+\beta (i,a,b)}+\frac {\beta (i,a,b)}{\alpha (i,a,b)+\beta (i,a,b)}\sum\limits_{j=1}^{3}p(j|i,a,b)u(j) & {\text{if}\quad i\in \{1,2\}},\\ \frac {r(3,a,b)}{(\alpha (3,a,b))^{2}\beta (3,a,b)}\Big[\alpha (3,a,b)\beta (3,a,b)-1+e^{-\alpha (3,a,b)\beta (3,a,b)}\Big]\\ +\frac {1-e^{-\alpha (3,a,b)\beta (3,a,b)}}{\alpha (3,a,b)\beta (3,a,b)}\sum\limits_{j=1}^{3}p(j|3,a,b)u(j) & {\text{if}\quad i=3}. \end{array} \right. $$

To take numerical calculation for this example, we assume that the values of model parameters are shown in Table 1.

Table 1 The values of model parameters

With this data setting, by taking α0 = 0.25,δ = 0.1,𝜃 = 0.023, Assumptions 1-4 obviously hold since both the state and action spaces are finite. Thus, the existence of the value function and an optimal pair of stationary strategies of the SMG are ensured by Theorem 1. Next, we use Algorithm 1 to find the approximate value function and an ε-optimal pair of stationary strategies of the game. The detailed steps are listed as follows.

Step 1: Initialization.

Let n = 0, and V0(⋅) = 1; set a small threshold 𝜖 := 10− 4, and we have \(\varepsilon =\frac {\epsilon }{1-\eta \gamma }=0.33\).

Step 2: Iteration.

For n ≥ 0, (a,b) ∈ A(i) × B(i), we have

$$ u_{n}(i, a, b)=\frac {r(i,a,b)}{\alpha (i,a,b)+\beta (i,a,b)}+\frac{\beta (i,a,b)}{\alpha (i,a,b)+\beta (i,a,b)}\sum\limits_{j=1}^{3}p(j|i,a,b)V_{n}(j),\quad i=1,2, $$
$$ \begin{array}{@{}rcl@{}} u_{n}(3, a, b)&=&\frac {r(3,a,b)}{(\alpha (3,a,b))^{2}\beta (3,a,b)}\Big[\alpha (3,a,b)\beta (3,a,b)-1+e^{-\alpha (3,a,b)\beta (3,a,b)}\Big]\\ &&+\frac {1-e^{-\alpha (3,a,b)\beta (3,a,b)}}{\alpha (3,a,b)\beta (3,a,b)}\sum \limits_{j=1}^{3}p(j|3,a,b)V_{n}(j). \end{array} $$

Then, for each state i ∈{1,2,3}, we solve the linear program

$$ \begin{array}{@{}rcl@{}} &&\max\limits_{f\left( i,a_{i1}\right), f\left( i,a_{i2}\right),v}\quad v \\ &&\text{s.t.}\quad \left\{ \begin{array}{lr} v \leq u_{n}\left( i, a_{i1}, b_{ij}\right) f\left( i,a_{i1}\right)+u_{n}\left( i, a_{i2}, b_{ij}\right) f\left( i,a_{i2}\right),\quad j=1,2 \\ f\left( i,a_{i1}\right)+f\left( i,a_{i2}\right)=1 \\ f\left( i,a_{i1}\right) \geq 0, f\left( i,a_{i2}\right) \geq 0 , \end{array}\right. \end{array} $$
(12)

with the solution denoted by \({\pi _{n}^{1}}(\cdot | i)\) where \({\pi _{n}^{1}}(a_{i1}| i)=f(i,a_{i1})\), \({\pi _{n}^{1}}(a_{i2}| i)=f(i,a_{i2})\).

Also we solve the dual program of (12)

$$ \begin{array}{@{}rcl@{}} &&\min\limits_{g\left( i,b_{i1}\right), g\left( i,b_{i2}\right),z}\quad z \\ &&\text{s.t.}\quad \left\{ \begin{array}{lr} z \geq u_{n}\left( i, a_{ij}, b_{i1}\right) g\left( i,b_{i1}\right)+u_{n}\left( i, a_{ij}, b_{i2}\right) g\left( i,b_{i2}\right),\quad j=1,2\\ g\left( i,b_{i1}\right)+g\left( i,b_{i2}\right)=1 \\ g\left( i,b_{i1}\right) \geq 0, g\left( i,b_{i2}\right) \geq 0 , \end{array} \right. \end{array} $$

with the solution denoted by \({\pi _{n}^{2}}(\cdot | i)\) where \({\pi _{n}^{2}}(b_{i1}| i)=g(i,b_{i1})\), \({\pi _{n}^{2}}(b_{i2}| i)=g(i,b_{i2})\). We set

$$ V_{n+1}(i)=\sum\limits_{a\in A(i),b\in B(i)}u_{n}(i, a, b){\pi_{n}^{1}}(a| i){\pi_{n}^{2}}(b| i). $$

Step 3: Termination judgement.

If \(\underset {i=1,2,3}{\max \limits }|V_{n+1}(i)-V_{n}(i)|<\epsilon \), then the iteration stops, Vn is the ε-value function and \(({\pi _{n}^{1}},{\pi _{n}^{2}})\) is an ε-optimal pair of stationary strategies of the SMG. Otherwise, set n = n + 1 and go to Step 2.

We use Matlab to implement the iteration algorithm for this example. It takes about 10 seconds to stop at the 93rd iteration. The curves of the error of two successive iterations, the value function, and the optimal pair of stationary strategies of players with respect to the iteration times are illustrated by Figs. 12 and 3.

Fig. 1
figure 1

The error of two successive iterations

Fig. 2
figure 2

The value function of the game

Fig. 3
figure 3

The optimal pair of stationary strategies \((\pi _{*}^{1},\pi _{*}^{2})\)

Based on the experimental results, we have the following observations:

  1. 1.

    When the state is benefit, the investor should take action a11 with probability 0.60217 and a12 with probability 0.39783, while the market-maker should take action b11 with probability 0.55737 and b12 with probability 0.44263;

  2. 2.

    When the state is medium, the investor should take action a21 with probability 0.87111 and a22 with probability 0.12889, while the market-maker should take action b21 with probability 0.77887 and b22 with probability 0.22113;

  3. 3.

    When the state is loss, the investor should always take action a31 while the market-maker should always take action b31;

  4. 4.

    If both investor and market-maker use the optimal strategies, the investor will obtain a profit 12.6054 at benefit state, 12.1271 at medium state and 11.1653 at loss state, while the market-maker will lose the same amount, respectively.

Remark 5

In this example, we choose a uniformly distributed sojourn time at state 3 to show that arbitrary distributions are permitted for the sojourn time of semi-Markov processes. Other distributions can also be chosen for the sojourn time according to practical situations. Moreover, if all the sojourn times are exponentially distributed, the semi-Markov games degenerate into discrete-time Markov games.

7 Conclusion

In this paper, we concentrate on the two-person zero-sum SMG with expected discounted criterion in which the discount factors are state-action-dependent. We first construct the SMG model with a fairly general definition setting. Then we impose suitable conditions on the model parameters, under which we establish the Shapley equation whose unique solution is the value function and prove the existence of an optimal pair of stationary strategies of the game. While the state and action spaces are finite, a value iteration-type algorithm for approaching to the value function and an optimal pair of stationary strategies is developed. Finally, we apply our results to an investment problem, which demonstrates that our algorithm performs well.

One of the future research topics is to deal with the nonzero-sum case of this game model. We wish to find sufficient conditions under which we use the similar arguments to establish the Shapley equation and prove the existence of Nash equilibrium for such game. In practice, many decision problems are continuous in states or actions. It is of interest to further study the discretization technology of continuous SMGs in order to apply our value iteration algorithm. Moreover, considering the limitations of computing resources, the dynamic programming algorithm is difficult to implement in reality when the scale of the game becomes huge. Another future research topic is to develop data-driven learning algorithms to approximately solve the game problems, such as the combination with multi-agent reinforcement learning approaches.