Extremal Shift Rule for Continuous-Time Zero-Sum Markov Games

Averboukh, Yurii

doi:10.1007/s13235-015-0173-z

Extremal Shift Rule for Continuous-Time Zero-Sum Markov Games

Published: 29 October 2015

Volume 7, pages 1–20, (2017)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Dynamic Games and Applications Aims and scope Submit manuscript

Extremal Shift Rule for Continuous-Time Zero-Sum Markov Games

Download PDF

Yurii Averboukh^1,2

225 Accesses
2 Citations
Explore all metrics

Abstract

In the paper we consider the controlled continuous-time Markov chain describing the interacting particles system with the finite number of types. The system is controlled by two players with the opposite purposes. This Markov game converges to a zero-sum differential game when the number of particles tends to infinity. Krasovskii–Subbotin extremal shift provides the optimal strategy in the limiting game. The main result of the paper is the near optimality of the Krasovskii–Subbotin extremal shift rule for the original Markov game.

A Differential Game with the Possibility of Early Termination

Article 01 December 2021

Continuous-Time Zero-Sum Games for Markov Decision Processes with Discounted Risk-Sensitive Cost Criterion

Article 28 May 2021

Generalized Γ Calculus and Application to Interacting Particles on a Graph

Article 06 March 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The paper is concerned with the construction of near-optimal strategies for zero-sum two-player continuous-time Markov game based on deterministic game. The term ‘Markov game’ is used for a Markov chain with the Kolmogorov matrix depending on controls of players. These games are also called continuous-time stochastic games. Continuous-time Markov games were first studied by Zachrisson [20]. The recent progress in the theory of continuous-time Markov games can be found in [15, 17] and references therein.

We consider the case when the continuous-time Markov chain describes the interacting particle system. The interacting particle system converges to the deterministic system as the number of particles tends to infinity [7, 8] (see also [3, 4]). The value function of the controlled Markov chain converges to the value function of the limiting control system [8] (see the corresponding result for the discrete-time systems in [6]). This result is extended to the case of zero-sum games as well as to the case of nonzero-sum games in [8]. If the nonanticipative strategy is optimal for the differential game, then it is near optimal for the Markov game [8]. However, the nonanticipative strategies require the knowledge of the control of the second player. Often this information is inaccessible, and the player has only the information about the current position. In this case, one can use feedback strategies or control with guide strategies.

Control with guide strategies were proposed by Krasovskii and Subbotin to construct the solution of the deterministic differential game under informational disturbances [13]. Note that feedback strategies do not provide the stable solution of the differential game. If the player uses control with guide strategy, then the control is formed stepwise, and the player has a model of the system that is used to choose an appropriate control on each step. The easiest way to construct a control with guide strategy is the extremal shift rule. The value function is achieved in the limit when the time between control corrections tends to zero. In the original work by Krasovskii and Subbotin, the motion of the model is governed by the system that is a copy of the original system and the motion of the original system is close to the motion of the model. Therefore the model can be called guide. Formally, control with guide strategy is a strategy with memory. However, it suffices to store only the finite number of vectors. Moreover, the player does not require the information on second player’s control.

The control with guide strategies realizing the extremal shift was used for the differential games without Lipschitz continuity assumption on the system dynamics in [14] and for the games governed by delay differential equations in [12, 16]. Krasovskii and Kotelnikova proposed the stochastic control with guide strategies [9–11]. In that case, the real motion of the deterministic system is close to the auxiliary stochastic process generated by optimal control for the stochastic differential game. The Nash equilibrium for two-player game in the class of control with guide strategies was constructed via extremal shift in [1].

In this paper, we let the player use the control with guide strategy realizing the extremal shift rule in the Markov game. We assume that the motion of the guide is given by the limiting deterministic differential game. We estimate the expectation of the distance between the Markov chain and the motions of the model (guide). This leads to the estimate between the outcome of the player in the Markov game and the value function of the limiting differential game.

The paper is organized as follows. In preliminary Sect. 2, we introduce the Markov game describing the interacting particle system and the limiting deterministic differential game. In Sect. 3, we give the explicit definition of control with guide strategies and formulate the main results. Section 4 is devoted to a property of transition probabilities. In Sect. 5, we estimate the expectation of distance between the Markov chain and the deterministic guide. Section 6 provides the proofs of the statements formulated in Sect. 3. Finally, in Sect. 7, we illustrate the theoretical results by a simulation of some Markov chain.

2 Preliminaries

We consider the system of the finite number of particles. Each particle can be of type i, $i\in \{1,\ldots ,d\}$. The type of each particle is a random variable governed by a Markov chain. To specify this chain, consider the Kolmogorov matrix $Q(t,x,u,v)=(Q_{ij}(t,x,u,v))_{i,j=1}^d$. That means that the elements of matrix Q(t, x, u, v) satisfy the following properties

$Q_{ij}(t,x,u,v)\ge 0$ for $i\ne j$;
$$\begin{aligned} Q_{ii}(t,x,u,v)=-\sum _{j\ne i}Q_{ij}(t,x,u,v). \end{aligned}$$
(1)

Here $t\in [0,T],\ \ x\in \varSigma _d=\{(x_1,\ldots , x_d):x_i\ge 0, x_1+\cdots +x_n=1\},$ $u\in U$, $v\in V.$ The parameter x is used below for the state of the interacting particle system. We assume that $x=(x_1,\ldots ,x_d)$ is a row vector. The variables u and v are controlled by the first and the second players, respectively.

Additionally we assume that

U and V are compact sets;
Q is a continuous function;
for any t, u and v, the function $x\mapsto Q(t,x,u,v)$ is Lipschitz continuous;
for any $t\in [0,T]$, $\xi ,x\in \mathbb {R}^n$, the following equality holds true.
$$\begin{aligned} \min _{u\in U}\max _{v\in V}\langle \xi ,xQ(t,x,u,v)\rangle =\max _{v\in V}\min _{u\in U}\langle \xi ,xQ(t,x,u,v)\rangle \end{aligned}$$
(2)

Condition (2) is an analog of well-known Isaacs condition.

For fixed parameters $x\in \varSigma _d$, $u\in {U}$, and $v\in {V}$, the type of each particle is determined by the Markov chain with the generator

$$\begin{aligned} (Q(t,x,u,v)f)_i=\sum _{j\ne i}Q_{ij}(t,x,u,v)(f_j-f_i), \ \ f=(f_1,\ldots ,f_d). \end{aligned}$$

The another way to specify the Markov chain is the Kolmogorov forward equation

$$\begin{aligned} \frac{\hbox {d}}{\hbox {d}t}P(s,t,x)=P(s,t,x)Q(t,x,u,v). \end{aligned}$$

Here $P(s,t,x)=(P_{ij}(s,t,x))_{ij=1}^d$ is the matrix of the transition probabilities.

Now we consider the controlled mean-field interacting particle system (see for details [8]). Let $n_i$ be a number of particles of the type i. The vector $N=(n_1,\ldots ,n_d)\in \mathbb {Z}_+^d$ is the state of the system consisting of $|N|=n_1+\cdots +n_d$ particles. For $i\ne j$ and a vector $N=(n_1,\ldots ,n_d)$ denote by $N^{[ij]}$ the vector obtained from N by removing one particle of type i and adding one particle of type j, i.e., we replace the ith coordinate with $n_i-1$ and the jth coordinate with $n_j+1$. The mean-field interacting particle system is a Markov chain with the generator

$$\begin{aligned} \sum _{i,j=1}^d n_iQ_{ij}(t,N/|N|,u(t),v(t))\left[ f\left( N^{[ij]}\right) -f(N)\right] . \end{aligned}$$

The purpose of the first (respectively, second) player is to minimize (respectively, maximize) the expectation of $\sigma (N/|N|)$.

Denote the inverse number of particles by $h=1/|N|$. Normalizing the states of the interacting particle system, we get the generator (see [8])

$$\begin{aligned} \begin{aligned} L_t^h[u,v]&f(N/|N|)\\&=\sum _{i,j=1}^d \frac{1}{h}\frac{n_i}{|N|}Q_{ij}(t,N/|N|,u(t),v(t))\left[ f\left( \frac{N^{[ij]}}{|N|}\right) -f\left( \frac{N}{|N|}\right) \right] . \end{aligned} \end{aligned}$$

(3)

Denote the vector N / |N| by $x=(x_1,\ldots ,x_d)$. Thus, we have that

$$\begin{aligned} L_t^h[u,v]f(x)=\sum _{i,j=1}^d \frac{1}{h}x_iQ_{ij}(t,x,u(t),v(t))[f(x-he^i+he^j)-f(x)]. \end{aligned}$$

Here $e^i$ is the ith coordinate vector. The vector x belongs to the set

$$\begin{aligned} \varSigma _d^h=\{(x_1,\ldots ,x_d):x_i\in h\mathbb {Z},\ \ x_1+\cdots +x_d=1\}\subset \varSigma _d. \end{aligned}$$

Further, let $\mathcal {U}_\mathrm{det}[s]$ (respectively, $\mathcal {V}_\mathrm{det}[s]$) denote the set of deterministic controls of the first (respectively, second) player on [s, T], i.e.,

$$\begin{aligned}&\mathcal {U}_\mathrm{det}[s]=\{u:[s,T]\rightarrow U\hbox { measurable}\},\\&\mathcal {V}_\mathrm{det}[s]=\{v:[s,T]\rightarrow V\hbox { measurable}\}. \end{aligned}$$

Let $(\varOmega ,\mathcal {F},\{\mathcal {F}_t\},P)$ be a filtered probability space. Extending the definition given in [5, p. 135] to the stochastic game case, we say that the pair of stochastic processes u and v on [s, T] is an admissible pair of controls if

1.
$u(t)\in U$, $v(t)\in V$;
2.
the processes u and v are progressive measurable;
3.
for any $y\in \varSigma ^h_d$, there exists an unique $\{\mathcal {F}_t\}_{t\in [s,T]}$-adapted càdlàg stochastic process $X^h(t,s,y,u,v)$ taking values in $\varSigma ^h_d$, starting at y at time s and satisfying the following condition for any $f\in C(\varSigma ^h_d)$
$$\begin{aligned} f\left( X^h(t,s,y,u,v)\right) -\int _s^t L_{t}^h[u(\tau ),v(\tau )]f\left( X^h(\tau ,s,y,u,v)\right) \hbox {d}\tau \end{aligned}$$
is a martingale.

In particular, the third condition means that

$$\begin{aligned} \mathbb {E}_{sy}^hf\left( X^h(t,s,y,u,v)\right) -f(y)=\int _s^t\mathbb {E}_{sy}^h L_{t}^h[u(\tau ),v(\tau )]f\left( X^h(\tau ,s,y,u,v)\right) \hbox {d}\tau . \end{aligned}$$

(4)

Here $\mathbb {E}_{sy}^h$ denotes the conditional expectation of corresponding stochastic processes.

The purposes of the players can be reformulated in the following way. The first (respectively, second) player wishes to minimize (respectively, maximize) the payoff

$$\begin{aligned} \mathbb {E}_{sy}^h\sigma \left( X^h(T,s,y,u,v)\right) . \end{aligned}$$

Let $\mathcal {U}^h[s]$ be a set of stochastic processes u taking values in U such that the pair (u, v) is admissible for any $v\in \mathcal {V}_\mathrm{det}[s]$. Analogously, let $\mathcal {V}^h[s]$ be a set of stochastic processes v taking values in V such that the pair (u, v) is admissible for any $u\in \mathcal {U}_\mathrm{det}[s]$.

Denote by $P_{sy}^h(A)$ the conditional probability of an event A under condition that the Markov chain corresponding to the parameter h starts at y at time s, i.e.,

$$\begin{aligned} P^h_{sy}(A)=\mathbb {E}_{sy}^h\mathbf {1}_{A} \end{aligned}$$

Further, let $p^h(s,y,t,z,u,v)$ denote the transition probability, i.e.,

$$\begin{aligned} p^h(s,y,t,z,u,v)=P_{sy}^h\left( X^h(t,s,y,u,v)=z\right) =\mathbb {E}_{sy}^h\mathbf {1}_{\{z\}}\left( X^h(t,s,y,u,v)\right) . \end{aligned}$$

The substituting $\mathbf {1}_{\{z\}}$ for f in (3) and (4) gives that

$$\begin{aligned}&p^h(s,y,t,z,u,v)= p^h(s,y,s,z,u,v)\nonumber \\&\quad +\,\frac{1}{h}\int _{s}^t\mathbb {E}^h_{sy}\sum _{i,j=1}^dX_{i}^h (\tau ,s,y,u,v)Q_{ij}\left( \tau ,X^h(\tau ,s,y,u,v),u(\tau ),v(\tau )\right) \nonumber \\&\quad \times \,\left[ \mathbf {1}_z\left( X^h(\tau ,s,y,u,v)-he^i+he^j\right) -\mathbf {1}_z\left( X^h(\tau ,s,y,u,v)\right) \right] \hbox {d}\tau . \end{aligned}$$

(5)

Here $X_{i}^h(\tau ,s,y,u,v)$ denotes the ith component of $X^h(\tau ,s,y,u,v)$.

Recall (see [8]) that if $h\rightarrow 0$, then the generator $L_t^h[u,v]$ converges to the generator

$$\begin{aligned} \begin{aligned} \varLambda _t[u,v]f(x)=&\sum _{i=1}^d\sum _{j\ne i}x_iQ_{ij}(t,x,u(t),v(t))\left[ \frac{\partial f}{\partial x_j}(x)-\frac{\partial f}{\partial x_i}(x)\right] \\ =&\sum _{k=1}^d\sum _{i\ne k}[x_iQ_{ik}(t,x,u(t),v(t))-x_kQ_{ki}(t,x,u(t),v(t))]\frac{\partial f}{\partial x_k}(x). \end{aligned} \end{aligned}$$

For controls $u\in \mathcal {U}_\mathrm{det}[s]$ and $v\in \mathcal {V}_\mathrm{det}[s]$, the deterministic evolution generated by the generator $\varLambda _t[u(t),v(t)]$ is described by the equation

$$\begin{aligned} \frac{\hbox {d}}{\hbox {d}t}f_t(x)= & {} \sum _{k=1}^d\sum _{i\ne k}[x_iQ_{ik}(t,x,u(t),v(t))-x_kQ_{ki}(t,x,u(t),v(t))]\frac{\partial f_t}{\partial x_k}(x),\nonumber \\&\quad f_s(x)=f(x),\ \ x=(x_1,\ldots ,x_d)\in \varSigma _d. \end{aligned}$$

(6)

Here the function $f_t(y)$ is equal to f(x(t)) when $x(s)=y$. The characteristics of (6) solve the ODEs

$$\begin{aligned} \begin{aligned} \frac{\hbox {d}}{\hbox {d}t}{x}_k(t)=&\sum _{i\ne k}[x_i(t)Q_{ik}(t,x(t),u(t),v(t))-x_k(t)Q_{ki}(t,x(t),u(t),v(t))]\\=&\sum _{i=1}^d x_i(t)Q_{ik}(t,x(t),u(t),v(t)),\ \ k=\overline{1,d}. \end{aligned} \end{aligned}$$

One can rewrite this system in the vector form

$$\begin{aligned} \frac{\hbox {d}}{\hbox {d}t}{x}(t)= & {} x(t)Q(t,x(t),u(t),v(t)),\nonumber \\&\quad t\in [0,T], \ \ x(t)\in \mathbb {R}^d,\ \ u(t)\in U,\ \ v(t)\in V. \end{aligned}$$

(7)

For given $u\in \mathcal {U}_\mathrm{det}[s]$, $v\in \mathcal {V}_\mathrm{det}[s]$ denote the solution of the initial value problem for (7) and the condition $x(s)=y$ by $x(\cdot ,s,y,u,v)$.

Consider the deterministic zero-sum game with the dynamics given by (7) and terminal payoff equal to $\sigma (x(T,s,y,u,v))$. This game has a value in the class of feedback strategies [13] that is a continuous function of the position. Denote it by $\mathrm{Val}(s,y)$. Recall (see [18]) that the function $\mathrm{Val}(s,y)$ is a minimax (viscosity) solution of the Hamilton–Jacobi PDE

$$\begin{aligned} \frac{\partial W}{\partial t}+H(t,x,\nabla W)=0,\ \ W(T,x)=\sigma (x). \end{aligned}$$

(8)

Here the Hamiltonian H is defined by the rule

$$\begin{aligned} H(t,x,\xi )=\min _{u\in U}\max _{v\in V}\langle \xi ,xQ(t,x,u,v)\rangle . \end{aligned}$$

Remark 1

The approaches based on control with guide strategies and nonanticipative strategies (see [2] for detailed presentation of this approach) are equivalent to the feedback formalization [18]. The value function in these cases is also equal to $\mathrm{Val}$.

3 Control with Guide Strategies

In this section, we introduce the control with guide strategies for the Markov game. It is assumed that the control is formed stepwise and the player has an information about the current state of the system, i.e., the vector x is known. Additionally, we assume that the player can evaluate the expected state and the player’s control depends on current state of the system and on the evaluated state. This evaluation is called guide. At each time of control correction, the player computes the value of the guide and the control that is used up to the next time of control correction.

Formally (see [19]), control with guide strategy of player 1 is a triple $\mathfrak {u}=(u(t,x,w),\psi _1(t_+,t,x,w),\chi _1(s,y))$. Here the function u(t, x, w) is equal to the control implemented after time t if at time t the state of the system is x and the state of the guide is w. The function $\psi _1(t_+,t,x,w)$ determines the state of the guide at time $t_+$ under the condition that at time t the state of the system is x and the state of the guide is w. The function $\chi _1$ initializes the guide, i.e., $\chi _1(s,y)$ is the state of the guide in the initial position (s, y).

We use the control with guide strategies for the Markov game with the generator $L_t^h$. Here we assume that $h>0$ is fixed. Let (s, y) be an initial position, $s\in [0,T]$, $y\in \varSigma _d^h$. Assume that player 1 chooses the control with guide strategy $\mathfrak {u}$ and the partition $\varDelta =\{t_k\}_{k=0}^m$ of the time interval [s, T], whereas player 2 chooses the control $v\in \mathcal {V}^h[s]$. This control can be also formed stepwise using some second player’s control with guide strategy.

We say that the stochastic process $\mathcal {X}_1^h[\cdot ,s,y,\mathfrak {u},\varDelta ,v]$ is generated by strategy $\mathfrak {u}$, partition $\varDelta $ and the second player’s control v if for $t\in [t_k,t_{k+1})$ $\mathcal {X}_1^h[t,s,y,\mathfrak {u},\varDelta ,v]=X^h(t,t_k,x_k,u_k,v), $ where

$x_0=y$, $w_0=\chi _1(t_0,x_0)$, $u_0=u(t_0,x_0,w_0)$;
for $k=\overline{1,r}$ $x_k=X^h(t_k,t_{k-1},x_{k-1},u_{k-1},v)$, $w_k=\psi _1(t_k,t_{k-1},x_{k-1},w_{k-1})$, $u_k=u(t_k,x_k,w_k)$.

Note that even though the state of the guide $w_k$ is determined by the deterministic function, it depends on the random variable $x_{k-1}$. Thus, $w_k$ is a random variable.

Below we define the first player’s control with guide strategy that realizes the extremal shift rule (see [13]). Let $\varphi $ be a supersolution of Eq. (8). That means (see [18]) that for any $(t_*,x_*)\in [0,T]\times \varSigma _{d}$, $t_+>t_*$ and $v_*\in V$, there exists a solution $\zeta _1(\cdot ,t_+,t_*,x_*,v_*)$ of the differential inclusion

$$\begin{aligned} \dot{\zeta }_1(t)\in \mathrm{co}\{\zeta _1(t)Q(t,\zeta _1(t),u,v_*):u\in U\} \end{aligned}$$

satisfying the conditions

$$\begin{aligned} \zeta _1(t_*,t_+,t_*,x_*,v_*)=x_*,\ \ \varphi (t_+,\zeta _1(t_+,t_+,t_*,x_*,v_*))\le \varphi (t_*,x_*). \end{aligned}$$

Define the control with guide strategy $\hat{\mathfrak {u}}=(\hat{u},\hat{\psi }_1,\hat{\chi }_1)$ by the following rules. If $t_*,t_+\in [0,T]$, $t_+>t_*$, $x_*,w_*\in \varSigma _d$, then choose $u_*$, $v_*$ by the rules

$$\begin{aligned}&\min _{u\in U}\max _{v\in V}\langle x_*-w_*,x_*Q(t_*,x_*,u,v)\rangle =\max _{v\in V}\langle x_*-w_*,x_*Q(t_*,x_*,u_*,v)\rangle , \end{aligned}$$

(9)

$$\begin{aligned}&\max _{v\in V}\min _{u\in U}\langle x_*-w_*,x_*Q(t_*,x_*,u,v)\rangle =\min _{u\in U}\langle x_*-w_*,x_*Q(t_*,x_*,u,v_*)\rangle . \end{aligned}$$

(10)

Put

(u1)
$\hat{u}(t_*,x_*,w_*)=u_*$,
(u2)
$\hat{\psi }_1(t_+,t_*,x_*,w_*)=\zeta _1(t_+,t_+,t_*,w_*,v_*)$,
(u3)
$\hat{\chi }_1(s,y)=y$.

Note that if the first player uses the strategy $\hat{\mathfrak {u}}$ in the differential game with the dynamics given by (7), then she guarantees the limit outcome not greater than $\varphi $ (see [13, 18]). If additionally $\varphi =\mathrm{Val}$, then the strategy $\hat{\mathfrak {u}}$ is optimal in the deterministic game.

The main result of the paper is the following.

Theorem 1

Assume that $\sigma $ is Lipschitz continuous with a constant R, and the function $\varphi $ is a supersolution of (8). If the first player uses the control with guide strategy $\hat{\mathfrak {u}}$ determined by (u1)–(u3), then

(i)
$$\begin{aligned}&\lim _{\delta \downarrow 0}\sup \left\{ \mathbb {E}_{sy}^h\left( \sigma \left( \mathcal {X}_1^h[T,s,y,\hat{\mathfrak {u}},\varDelta ,v]\right) \right) :d(\varDelta )\le \delta ,v\in \mathcal {V}^h[s]\right\} \\&\quad \le \varphi (s,y)+R\sqrt{Dh}. \end{aligned}$$
(ii)
$$\begin{aligned}&\lim _{\delta \downarrow 0}\sup \left\{ P_{sy}^h\left( \sigma \left( \mathcal {X}_1^h[T,s,y,\hat{\mathfrak {u}},\varDelta ,v]\right) \ge \varphi (s,y)+R\root 3 \of {Dh}\right) :\right. \\&\quad \left. d(\varDelta )\le \delta ,\ \ v\in \mathcal {V}^h[s]\right\} \le \root 3 \of {Dh}. \end{aligned}$$

Here D is a constant not dependent on $\varphi $ and $\sigma $.

The theorem is proved in Sect. 6.

Now let us consider the case when the second player uses control with guide strategies. The control with guide strategy of the second player is a triple $\mathfrak {v}=(v(t,x,w),\psi _2(t_+,t,x,w),\chi _2(s,y))$. Here w denotes the state of the second player’s guide. The control in this case is formed also stepwise. If (s, y) is an initial position, $\varDelta $ is a partition of time interval [s, T] and $u\in \mathcal {U}^h[s]$ is a control of player 1, then denote by $\mathcal {X}_2^h[\cdot ,s,y,\mathfrak {v},\varDelta ,u]$ the corresponding stochastic process.

Let $\phi $ be a subsolution of Eq. (8). That means (see [18]) that for any $(t_*,x_*)\in [0,T]\times \varSigma _{d}$, $t_+>t_*$ and $u^*$ there exists a trajectory $\zeta _2(\cdot ,t_+,t_*,x_*,u^*)$ of the differential inclusion

$$\begin{aligned} \dot{\zeta }_2(t)\in \mathrm{co}\{\zeta _2(t)Q(t,\zeta _2(t),u^*,v):v\in V\},\ \ \zeta _2(t_*)=x_* \end{aligned}$$

satisfying the condition $\phi (t_+,\zeta _2(t_+,t_+,t_*,x_*,u^*))\ge \phi (t_*,x_*).$

Define the strategy $\hat{\mathfrak {v}}$ by the following rule. If $(t_*,x_*)$ is a position, $t_+>t_*$ and $w_*\in \varSigma _d$ is a state of the guide, then choose $v^*$ and $u^*$ by the rules

$$\begin{aligned}&\min _{v\in V}\max _{u\in U}\langle x_*-w_*,x_*Q(t_*,x_*,u,v)\rangle =\max _{u\in U}\langle x_*-w_*,x_*Q(t_*,x_*,u,v^*)\rangle ,\\&\max _{u\in U}\min _{v\in V}\langle x_*-w_*,x_*Q(t_*,x_*,u,v)\rangle =\min _{v\in V}\langle x_*-w_*,x_*Q(t_*,x_*,u^*,v)\rangle . \end{aligned}$$

Put

(v1)
$v(t_*,x_*,w_*)=v^*$,
(v2)
$\psi _2(t_+,t_*,x_*,w_*)=\zeta _2(t_+,t_+,t_*,x_*,u^*)$
(v3)
$\chi _2(s,y)=y$.

Corollary 1

Let $\phi $ be a subsolution of (8). If the second player uses the control with guide strategy $\hat{\mathfrak {v}}$ determined by (v1)–(v3), then

(i)
$$\begin{aligned}&\lim _{\delta \downarrow 0}\inf \left\{ \mathbb {E}_{sy}^h\left( \sigma \left( \mathcal {X}_2^h[T,s,y,\hat{\mathfrak {u}},\varDelta ,v]\right) \right) :d(\varDelta )\le \delta ,u\in \mathcal {U}^h[s]\right\} \\&\quad \ge \phi (s,y)-R\sqrt{Dh}. \end{aligned}$$
(ii)
$$\begin{aligned}&\lim _{\delta \downarrow 0}\sup \left\{ P_{sy}^h\left( \sigma \left( \mathcal {X}_2^h[T,s,y,\hat{\mathfrak {v}},\varDelta ,u]\right) \le \phi (s,y)-R\root 3 \of {Dh}\right) :\right. \\&\quad \left. d(\varDelta )\le \delta ,\ \ u\in \mathcal {U}^h[s]\right\} \le \root 3 \of {Dh}. \end{aligned}$$

The corollary is also proved in Sect. 6.

4 Properties of Transition Probabilities

Now we prove the following.

Lemma 1

There exists a function $\alpha ^h(\delta )$ such that $\alpha ^h(\delta )\rightarrow 0$ as $\delta \rightarrow 0$ and for any $t_*,t_+\in [0,T]$, $\xi ,\eta \in \varSigma _d$, $\xi =(\xi _1,\ldots ,\xi _d)$, $\bar{u}\in U$, $\bar{v}\in \mathcal {V}^h[t_*]$ the following properties hold true

1.
if $\eta =\xi $, then
$$\begin{aligned} p^h(t_*,\xi ,t_+,\eta ,\bar{u},\bar{v})\le & {} 1+\frac{1}{h}\sum _{k=1}^d\int _{t_*}^{t_+}\int _{V} \xi _kQ_{kk}(t_*,\xi ,\bar{u},v)\nu _\tau (\hbox {d}v)\hbox {d}\tau \\&+\alpha ^h(t_+-t_*)\cdot (t_+-t_*); \end{aligned}$$
2.
if $\eta =\xi -h e^i+he^j$, then
$$\begin{aligned}&p^h(t_*,\xi ,t_+,\eta ,u,v)\\&\quad \le \frac{1}{h}\int _{t_*}^{t_+}\int _{V} \xi _iQ_{ij}(t_*,\xi ,\bar{u},v)\nu _\tau (\hbox {d}v) \hbox {d}\tau +\alpha ^h(t_+-t_*)\times (t_+-t_*); \end{aligned}$$
3.
if $\eta \ne \xi $ and $\eta \ne \xi -h e^i+he^j$, then
$$\begin{aligned} p^h(t_*,\xi ,t_+,\eta ,u,v)\le \alpha ^h(t_+-t_*)\times (t_+-t_*); \end{aligned}$$

Here $\nu _\tau $ is a measure on V depending on $t_*,t_+$, $\xi $, $\eta $, $\bar{u}$ and $\bar{v}$.

Proof

First denote

$$\begin{aligned} K=\sup \{|Q_{ij}(t,x,u,v)|:i,j=\overline{1,d},\ \ t\in [0,T],\ \ x\in \varSigma _d,\ \ u\in U,\ \ v\in Q\}. \end{aligned}$$

(11)

For any $x\in \varSigma _d$, $t\in [0,T]$, $u\in U$, $v\in V$, the following estimates hold true

$$\begin{aligned} \Vert x\Vert \le \sqrt{d},\ \ \left| \sum _{i=1}^n x_iQ_{ij}(t,x,u,v)\right| \le K,\ \ \Vert xQ(t,x,u,v)\Vert \le K\sqrt{d}. \end{aligned}$$

(12)

Further, let $\gamma (\delta )$ be a common modulus of continuity with respect to t of the functions $Q_{ij}$, i.e., for all i, j, $t',t''\in [0,T]$, $x\in \varSigma _d$, $u\in U$, $v\in Q$

$$\begin{aligned} |Q_{ij}(t',x,u,v)-Q_{ij}(t'',x,u,v)|\le \gamma (t''-t') \end{aligned}$$

(13)

and $\gamma (\delta )\rightarrow 0$ as $\delta \rightarrow 0$. From (5) and (12), we obtain that

$$\begin{aligned} p^h(t_*,\xi ,t,\eta ,u,v)\le p^h(t_*,\xi ,t_*,\eta ,u,v)+\frac{2Kd}{h}(t-t_*). \end{aligned}$$

(14)

Further, for a given control $\bar{v}\in \mathcal {V}^h[t_*]$, let $\mathbb {E}^h_{t_*\xi ;\tau x}$ denote the expectation under conditions $X^h(t_*,t_*,\xi ,\bar{u},\bar{v})=\xi $, and $X^h(\tau ,t_*,\xi ,\bar{u},\bar{v})=x$.

We have that

$$\begin{aligned} \mathbb {E}_{t_*\xi }^hf=\sum _{x\in \varSigma _d^h}\mathbb {E}^h_{t_*\xi ;\tau x}f\times p^h(t_*,\xi ,\tau ,x,\bar{u},\bar{v}). \end{aligned}$$

From this and (5), we get

$$\begin{aligned} \begin{aligned}&p^h(t_*,\xi ,t_+,\eta ,\bar{u},\bar{v})=p^h(t_*,\xi ,t_*,\eta ,\bar{u},\bar{v})\\&\quad +\frac{1}{h}\int _{t_*}^{t_+}\sum _{x\in \varSigma _d^h}\mathbb {E}^h_{t_*\xi ;\tau x} \sum _{i,j=1}^d x_iQ_{i,j}\left( \tau ,x,\bar{u},\bar{v}(\tau )\right) [\mathbf {1}_\eta (x-he^i+he^j)-\mathbf {1}_\eta (x)]\\&\quad \times p^h(t_*,\xi ,\tau ,x,\bar{u},\bar{v})\hbox {d}\tau \le p^h(t_*,\xi ,t_*,\eta ,\bar{u},\bar{v})\\&\quad +\frac{1}{h}\int _{t_*}^{t_+}\sum _{x\in \varSigma _d^h}\mathbb {E}^h_{t_*\xi ;\tau x} \sum _{i,j=1}^d x_iQ_{i,j}\left( \tau ,x,\bar{u},\bar{v}(\tau )\right) [\mathbf {1}_\eta (x-he^i+he^j)-\mathbf {1}_\eta (x)]\\&\quad \times p^h(t_*,\xi ,t_*,x,\bar{u},\bar{v})\hbox {d}\tau +\frac{2K^2d^2}{h}(t-t_*)^2. \end{aligned} \end{aligned}$$

Here $x_i$ denotes the ith component of the vector x.

We have that $p^h(t_*,\xi ,t_*,x,\bar{u},\bar{v})=1$ for $x=\xi $ and $p^h(t_*,\xi ,t_*,x,\bar{u},\bar{v})=0$ for $x\ne \xi $. Thus,

$$\begin{aligned} \begin{aligned}&p^h(t_*,\xi ,t_+,\eta ,\bar{u},\bar{v})\le p^h(t_*,\xi ,t_*,\eta ,\bar{u},\bar{v}) \\&\qquad +\frac{1}{h}\int _{t_*}^{t_+}\mathbb {E}^h_{t_*\xi ;\tau \xi } \sum _{i,j=1}^d \xi _iQ_{i,j}\left( \tau ,\xi ,\bar{u},\bar{v}(\tau )\right) [\mathbf {1}_\eta (\xi -he^i+he^j)-\mathbf {1}_\eta (\xi )] \\&\qquad +\frac{2K^2d^2}{h}(t-t_*)^2\\&\quad \le p^h(t_*,\xi ,t_*,\eta ,\bar{u},\bar{v}) +\frac{1}{h}\int _{t_*}^{t_+}\mathbb {E}^h_{t_*\xi ;\tau \xi }\\&\qquad \sum _{i,j=1}^d \xi _iQ_{i,j}\left( t_*,\xi ,\bar{u},\bar{v}(\tau )\right) [\mathbf {1}_\eta (\xi -he^i+he^j)-\mathbf {1}_\eta (\xi )]\hbox {d}\tau \\&\qquad +\frac{2K^2d^2}{h}(t-t_*)^2+\frac{2d}{h}\gamma (t-t_*)\times (t-t_*). \end{aligned} \end{aligned}$$

Recall that for each $\tau $ $\bar{v}(\tau )$ is a random variable with values in V. Define the measure $\nu _\tau $ on V as the image measure of $P_{t_*\xi ;\tau \xi }$ by $\bar{v}(\tau )$, where for $A\in \mathcal {F}$

$$\begin{aligned} P_{t_*\xi ;\tau \xi }(A)=\mathbb {E}_{t_*\xi ;\tau \xi }\mathbf {1}_A. \end{aligned}$$

We have that

$$\begin{aligned} \begin{aligned} \mathbb {E}_{t_*\xi ;\tau \xi }Q_{ij}(t_*,\xi ,\bar{u},\bar{v}(\tau ))&=\int _\varOmega Q_{ij}(t_*,\xi ,\bar{u},\bar{v}(\tau ,\omega ))P_{t_*\xi ;\tau \xi }(\hbox {d}\omega )\\&=\int _VQ_{ij}(t_*,\xi ,\bar{u},v)\nu _\tau (\hbox {d}v). \end{aligned} \end{aligned}$$

Consequently,

$$\begin{aligned}&p^h(t_*,\xi ,t_+,\eta ,\bar{u},\bar{v})\le p^h(t_*,\xi ,t_*,\eta ,\bar{u},\bar{v})\nonumber \\&\quad +\frac{1}{h}\int _{t_*}^{t_+}\int _{V} \sum _{i,j=1}^d \xi _iQ_{i,j}(t_*,\xi ,\bar{u},v) [\mathbf {1}_\eta (\xi -he^i+he^j)-\mathbf {1}_\eta (\xi )]\nu _\tau (\hbox {d}v)\hbox {d}\tau \nonumber \\&\quad +\alpha ^h(t-t_*)\times (t-t_*). \end{aligned}$$

(15)

Here we denote

$$\begin{aligned} \alpha ^h(\delta )=\frac{2K^2d^2}{h}\delta +\frac{2d}{h}\gamma (\delta ). \end{aligned}$$

From (15) the second and third statements of the Lemma follow. To derive the first statement, use the property of Kolmogorov matrixes (1). We have that

$$\begin{aligned}&p^h(t_*,\xi ,t_+,\xi ,\bar{u},\bar{v})\le p^h(t_*,\xi ,t_*,\eta ,\bar{u},\bar{v})\\&\qquad -\frac{1}{h}\int _{t_*}^{t_+}\int _{V} \sum _{i=1}^d\sum _{j\ne i} \xi _iQ_{i,j}(t_*,\xi ,\bar{u},v)\nu _\tau (\hbox {d}v)\hbox {d}\tau +\alpha ^h(t-t_*)\times (t-t_*)\\&\quad = 1+\frac{1}{h}\int _{t_*}^{t_+}\int _{V} \sum _{i=1}^d \xi _iQ_{i,i}(t_*,\xi ,\bar{u},v)\nu _\tau (\hbox {d}v)\hbox {d}\tau +\alpha ^h(t-t_*)\times (t-t_*). \end{aligned}$$

$\square $

5 Key Estimate

This section provides the estimate of the distance between the controlled Markov chain and the guide. This estimate is an analog of [13, Lemma 2.3.1].

Lemma 2

There exist constants $\beta ,C>0$, and a function $\varkappa ^h(\delta )$ such that $\varkappa ^h(\delta )\rightarrow 0$ as $\delta \rightarrow 0$ and the following property holds true.

If

1.
$(t,x)\in [0,T]\times \varSigma _{d}^h$, $w_*\in \varSigma _d$, $t_+>t_*$,
2.
The controls $u_*$, $v_*$ are chosen by rules (9) and (10) respectively,
3.
$w_+=\zeta _1(t_+,t_+,t_*,w_*,v_*)$,

then for any $v\in \mathcal {V}^h[t_*]$

$$\begin{aligned}&\mathbb {E}_{t_*x_*}^h\left( \Vert \mathcal {X}(t_+,t_*,x_*,u_*,v)-w_+\Vert ^2\right) \\&\quad \le (1+\beta (t_+-t_*))\Vert x_*-w_*\Vert ^2+Ch(t_+-t_*)+\varkappa ^h(t_+-t_*)\times (t-t_*). \end{aligned}$$

Proof

Denote the ith component of vector $x_*$ by $x_{*i}$.

We have that

$$\begin{aligned} \begin{aligned} \mathbb {E}_{t_*x_*}^h(\Vert \mathcal {X}(t_+,&t_*,x_*,u_*,v)-w_+\Vert ^2)\\=&\sum _{z\in \varSigma _d^h}\Vert z-w_+\Vert ^2p^h(t_*,x_*,t_+,z,u_*,v). \end{aligned} \end{aligned}$$

(16)

Further,

$$\begin{aligned} \begin{aligned} \Vert z-w_+\Vert ^2&=\Vert (z-x_*)+(x_*-w_*)+(w_*-w_+)\Vert ^2\\&=\Vert x_*-w_*\Vert ^2+2\langle x_*-w_*,z-x_*\rangle \\&\quad -2\langle x_*-w_*,w_+-w_*\rangle +\Vert z-x_*\Vert ^2+\Vert w_+-w_*\Vert ^2. \end{aligned} \end{aligned}$$

It follows from (12) that

$$\begin{aligned} \left\| \frac{\hbox {d}}{\hbox {d}t}\zeta _1(t_+,t,t_*,w_*,v_*)\right\| \le K\sqrt{d},\ \ \Vert w_+-w_*\Vert ^2\le K^2d(t_+-t_*)^2. \end{aligned}$$

(17)

From Lemma 1, it follows that

$$\begin{aligned} \begin{aligned}&\sum _{z\in \varSigma _d^h}\Vert z-x_*\Vert ^2p^h(t_*,x_*,t_+,z,u_*,v)\\&\qquad \le \sum _{i=1}^d\sum _{j\ne i}\Vert -he^i+he^j\Vert ^2\frac{1}{h}\int _{t_*}^{t_+}\int _VQ_{ij}(t_*,x_*,u_*,v)\nu _\tau (\hbox {d}v)\hbox {d}\tau \\&\qquad \quad +2d^3\alpha ^h(t_+-t_*)\times (t_+-t_*) \\&\qquad \le 2hd^2K(t_+-t_*)+2d^3\alpha ^h(t_+-t_*)\times (t_+-t_*). \end{aligned} \end{aligned}$$

(18)

For simplicity denote $\zeta _*(t)=\zeta _1(t,t_+,t_*,w_*,u,v_*)$. We have that for each t there exists a probability $\mu _t$ on U such that

$$\begin{aligned} \frac{\hbox {d}\zeta _*}{\hbox {d}t}(t)=\int _{U}\zeta _*(t)Q(t,\zeta _*(t),u,v_*)\mu _t(\hbox {d}u). \end{aligned}$$

Therefore,

$$\begin{aligned} \begin{aligned} \sum _{z\in \varSigma _d^h}\langle x_*-w_*,&w_+-w_*\rangle p^h(t_*,x_*,t_+,z,u_*,v)\\=\Bigl \langle x_*-&w_*, \int _{t_*}^{t_+}\int _{u\in U}\zeta _*(t)Q(t,\zeta _*(t),u,v_*)\mu _t(\hbox {d}u)\hbox {d}t\Bigr \rangle . \end{aligned} \end{aligned}$$

(19)

Define

$$\begin{aligned} \begin{aligned} \varrho (\delta )&\triangleq \sup \Big \{|y''Q(t'',y'',u,v)-y'Q(t',y',u,v)|:\\&\quad t',t''\in [0,T], \ \ y',y''\in \varSigma _d,\ \ u\in U,\ \ v\in V,\\&\qquad |t'-t''|\le \delta ,\ \ \Vert y'-y''\Vert \le \delta K\sqrt{d}\Big \}. \end{aligned} \end{aligned}$$

(20)

We have that $\varrho (\delta )\rightarrow 0$, as $\delta \rightarrow 0$. From (17), (19), and (20), it follows that

$$\begin{aligned} \begin{aligned}&\sum _{z\in \varSigma _d^h}\langle x_*-w_*,w_+-w_*\rangle p^h(t_*,x_*,t_+,z,u_*,v)\\&\quad \ge \Bigl \langle x_*-w_*,\int _{t_*}^{t_+}\int _{u\in U}w_*Q(t_*,w_*,u,v_*)\mu _t(\hbox {d}u)\hbox {d}t\Bigr \rangle \\&\qquad -\sqrt{2d}\varrho (t_+-t_*)\times (t_+-t_*). \end{aligned} \end{aligned}$$

(21)

Using Lemma 1 one more time we get the inequality

$$\begin{aligned} \begin{aligned}&\sum _{z\in \varSigma _d^h}\langle x_*-w_*,z-x_*\rangle p^h(t_*,x_*,t_+,z,u_*,v)\\&\qquad \le \sum _{i=1}^d\sum _{j\ne i}\langle x_*-w_*,-he^i+he^j\rangle \frac{1}{h}\int _{t_*}^{t_+}\int _V x_{*i}Q_{ij}(t_*,x_*,u_*,v)\nu _t(\hbox {d}v)\hbox {d}t\\&\qquad \quad +2d^3\alpha ^h(t_+-t_*)\times (t_+-t_*). \end{aligned} \end{aligned}$$

(22)

The first term in the right-hand side of (22) can be transformed as follows. Denote for simplicity

$$\begin{aligned} \widehat{Q}_{ij}=\int _{t_*}^{t_+}\int _V Q_{ij}(t_*,x_*,u_*,v)\nu _t(\hbox {d}v)\hbox {d}t . \end{aligned}$$

Note that $\widehat{Q}=(\widehat{Q}_{ij})_{i,j=1}^d$ is a Kolmogorov matrix. We have that

$$\begin{aligned} \begin{aligned}&\qquad \sum _{i=1}^d\sum _{j\ne i}(-he^i+he^j)\frac{1}{h}\int _{t_*}^{t_+}\int _V x_{*i}Q_{ij}(t_*,x_*,u_*,v)\nu _t(\hbox {d}v)\hbox {d}t \\&\qquad \quad =\sum _{i=1}^d\sum _{j\ne i}e^j x_{*,i}\widehat{Q}_{ij}-\sum _{i=1}^d x_{*,i}e^i\sum _{j\ne i} \widehat{Q}_{ij}\\&\qquad \quad =\sum _{i=1}^d \sum _{j=1}^d e^j x_{*,i}\widehat{Q}_{ij}=\sum _{j=1}^d\left[ \sum _{i=1}^d x_{*,i}\widehat{Q}_{ij}\right] e^j=x_*\widehat{Q}. \end{aligned} \end{aligned}$$

This and (22) yield the estimate

$$\begin{aligned} \begin{aligned}&\sum _{z\in \varSigma _d^h}\langle x_*-w_*,z-x_*\rangle p^h(t_*,x_*,t_+,z,u_*,v)\\&\qquad \le \Bigl \langle x_*-w_*,\int _{t_*}^{t_+}\int _V x_*Q(t_*,x_*,u_*,v)\nu _t(\hbox {d}v)\hbox {d}t\Bigr \rangle \\&\qquad \quad +\,2d^3\alpha ^h(t_+-t_*)\times (t_+-t_*). \end{aligned} \end{aligned}$$

(23)

Substituting (17)–(21), (23) in (16), we get the inequality

$$\begin{aligned} \begin{aligned}&\mathbb {E}_{t_*x_*}^h(\Vert \mathcal {X}(t_+,t_*,x_*,u_*,v)-w_+\Vert ^2)\le \Vert x_*-w_*\Vert ^2\\&\ \ \ \ + 2\left\langle x_*-w_*,\int _{t_*}^{t_+}\int _V x_*Q(t_*,x_*,u_*,v)\nu _t(\hbox {d}v)\hbox {d}t\right\rangle \\&\ \ \ \ -2\left\langle x_*-w_*,\int _{t_*}^{t_+}\int _{u\in U} w_*Q(t_*,w_*,u,v_*)\mu _t(\hbox {d}u)\hbox {d}t\right\rangle \\&\ \ \ \ +2Kd^2 h(t_+-t_*)+ \left( 6d^3\alpha ^h(t_+-t_*)+\sqrt{2d}\varrho (t_+-t_*)\right) \times (t_+-t_*). \end{aligned} \end{aligned}$$

(24)

Let $\varUpsilon $ be a Lipschitz constant of the function $y\mapsto yQ(t,y,u,v)$, i.e., for all $y',y''\in \varSigma _d$, $t\in [0,T]$, $u\in U$, $v\in Q$

$$\begin{aligned} \Vert y'Q(t,y',u,v)-y''Q(t,y'',u,v)\Vert \le \varUpsilon \Vert y'-y''\Vert . \end{aligned}$$

We have that

$$\begin{aligned} \begin{aligned}&2\Bigl \langle x_*-w_*,\int _{t_*}^{t_+}\int _V x_*Q(t_*,x_*,u_*,v)\nu _t(\hbox {d}v)\hbox {d}t\Bigr \rangle \\&\qquad -2\Bigl \langle x_*-w_*,\int _{t_*}^{t_+}\int _{u\in U} w_*Q(t_*,w_*,u,v_*)\mu _t(\hbox {d}u)\hbox {d}t\Bigr \rangle \\&\le 2\int _{t_*}^{t_+}\int _{u\in U}\int _{v\in V}\Bigl [\Bigl \langle x_*-w_*, x_*Q(t_*,x_*,u_*,v)\Bigr \rangle \\&\quad -\Bigl \langle x_*-w_*, x_*Q(t_*,x_*,u,v_*)\Bigr \rangle \Bigr ]\nu _t(\hbox {d}v)\mu _t(\hbox {d}u)\hbox {d}t\\&\quad +2\varUpsilon \Vert x_*-w_*\Vert ^2(t_+-t_*). \end{aligned} \end{aligned}$$

The choice of $u_*$ and $v_*$ gives that for all $u\in U$, $v\in V$

$$\begin{aligned} \langle x_*-w_*, x_*Q(t_*,x_*,u_*,v)\rangle \le \langle x_*-w_*, x_*Q(t_*,x_*,u,v_*)\rangle . \end{aligned}$$

Consequently, we get the estimate

$$\begin{aligned} \begin{aligned}&2\Bigl \langle x_*-w_*,\int _{t_*}^{t_+}\int _V x_*Q(t_*,x_*,u_*,v)\nu _t(\hbox {d}v)\hbox {d}t\Bigr \rangle \\&\qquad \quad -2\Bigl \langle x_*-w_*,\int _{t_*}^{t_+}\int _{u\in U} w_*Q(t_*,w_*,u,v_*)\mu _t(\hbox {d}u)\hbox {d}t\Bigr \rangle \\&\quad \le 2\varUpsilon \Vert x_*-w_*\Vert ^2(t_+-t_*). \end{aligned} \end{aligned}$$

From this and (24), the conclusion of the Lemma follows for $\beta =2\varUpsilon ,\ \ C=2d^2K, \ \ \varkappa ^h(\delta )=6d^3\alpha ^h(\delta )+\sqrt{2d}\varrho (\delta )$. $\square $

6 Near-Optimal Strategies

In this section, we prove Theorem 1 and Corollary 1.

Proof of Theorem 1

Let $v\in \mathcal {V}^h[s]$ be a control of the second player. Consider a partition $\varDelta =\{t_k\}_{k=1}^m$ of the time interval [s, T]. If $x_0,x_1,\ldots ,x_m$ are vectors, $x_0=y$, then denote by $\hat{p}^h_r(x_1,\ldots ,x_r,\varDelta )$ the probability of the event $\mathcal {X}_1^h[t_k,s,y,\hat{\mathfrak {u}},\varDelta ,v]=x_k$ for $k=\overline{1,r}$. Define vectors $w_0,\ldots ,w_m$ recursively in the following way. Put

$$\begin{aligned} w_0\triangleq \hat{\chi }_1(s,y)=y, \end{aligned}$$

(25)

for $k>0$ put

$$\begin{aligned} w_k\triangleq \hat{\psi }_1(t_k,t_{k-1},x_{k-1},w_{k-1}). \end{aligned}$$

(26)

If $w_0,\ldots ,w_m$ are defined by rules (25), (26) and $r\in \overline{1,m}$, we set

$$\begin{aligned} g_r(x_0,\ldots ,x_{r-1},\varDelta )\triangleq (w_0,\ldots ,w_r). \end{aligned}$$

In addition, put $g_0(\varDelta )\triangleq y$.

Below we use the transformation $G(\cdot ,\mathcal {X}^h_1[\cdot ,s,y,\hat{\mathfrak {u}},\varDelta ,v])$ of the stochastic process $\mathcal {X}^h_1[\cdot ,s,y,\hat{\mathfrak {u}},\varDelta ,v]$ defined in the following way. If $x_i$ are values of $\mathcal {X}^h_1[t_i,s,y,\hat{\mathfrak {u}},\varDelta ,v]$, $i=0,\ldots ,r$, and $(w_0,\ldots ,w_r)=g_r(x_0,\ldots ,x_{r-1},\varDelta )$, then we put

$$\begin{aligned} G(t_r,\mathcal {X}^h_1[\cdot ,s,y,\hat{\mathfrak {u}},\varDelta ,v])\triangleq w_r. \end{aligned}$$

Generally, the stochastic process $G(\cdot ,\mathcal {X}^h_1[\cdot ,s,y,\hat{\mathfrak {u}},\varDelta ,v])$ is non-Markov.

Further, if $u_i=\hat{u}(t_i,x_i,w_i)$, $i=0,\ldots ,r$, and

$$\begin{aligned} (w_0,\ldots ,w_r)=g_r(x_0,\ldots ,x_{r-1},\varDelta ), \end{aligned}$$

we write $\varsigma _r(x_0,\ldots ,x_r,\varDelta )\triangleq u_r$.

We have that for any $r\in \overline{1,m}$

$$\begin{aligned} \begin{aligned}&\mathbb {E}_{sy}^h\left( \Vert \mathcal {X}^h_1[t_r,s,y,\hat{\mathfrak {u}},\varDelta ,v]- G(t_r,\mathcal {X}^h_1[\cdot ,s,y,\hat{\mathfrak {u}},\varDelta ,v])\Vert ^2\right) \\&\quad =\sum _{x_1,\ldots ,x_r}\Vert x_r-g_r(x_0,\ldots ,x_{r-1},\varDelta )\Vert ^2\hat{p}_r(x_0,\ldots ,x_{r},\varDelta )\\&\quad =\sum _{x_1,\ldots ,x_{r-1}}\hat{p}_{r-1}(x_0,\ldots ,x_{r-1},\varDelta ) \cdot \sum _{x_r}\Vert x_r-g_r(x_0,\ldots ,x_{r-1},\varDelta )\Vert ^2 \\&\quad \quad \times P_{t_{r-1}x_{r-1}}^h(X(t_r,t_{r-1},x_{r-1},\varsigma _{r-1}(x_0,\ldots ,x_{r-1}),v)=x_r). \end{aligned} \end{aligned}$$

(27)

By Lemma 2, we have that

$$\begin{aligned} \begin{aligned}&\sum _{x_r}\Vert x_r-g_r(x_1,\ldots ,x_{r-1},\varDelta ))\Vert ^2 \\&\qquad \times P_{t_{r-1}x_{r-1}}^h(X(t_r,t_{r-1},x_{r-1},\varsigma _{r-1}(x_0,\ldots ,x_{r-1}),v)=x_r) \\&\quad \le (1+\beta (t_{r}-t_{r-1}))\Vert x_{r-1}-g_{r-1}(x_0,\ldots ,x_{r-2},\varDelta )\Vert ^2\\&\qquad +Ch\times (t_r-t_{r-1})+\varkappa ^h(t_r-t_{r-1})\times (t_r-t_{r-1}). \end{aligned} \end{aligned}$$

From this and (27), it follows that

$$\begin{aligned} \begin{aligned}&\mathbb {E}_{sy}^h\left( \Vert \mathcal {X}^h_1[t_r,s,y,\hat{\mathfrak {u}},\varDelta ,v]- G(t_r,\mathcal {X}^h_1[\cdot ,s,y,\hat{\mathfrak {u}},\varDelta ,v])\Vert ^2\right) \\&\quad \le (1+\beta (t_{r}-t_{r-1}))\mathbb {E}_{sy}^h(\Vert x_{r-1}-g_{r-1}(x_0,\ldots ,x_{r-2})\Vert ^2)\\&\qquad + Ch\times (t_r-t_{r-1})+\varkappa ^h(t_r-t_{r-1})\times (t_r-t_{r-1}). \end{aligned} \end{aligned}$$

(28)

Applying this inequality recursively, we get

$$\begin{aligned} \begin{aligned}&\mathbb {E}_{sy}^h\left( \Vert \mathcal {X}^h_1[T,s,y,\hat{\mathfrak {u}},\varDelta ,v]- G(T,\mathcal {X}^h_1[\cdot ,s,y,\hat{\mathfrak {u}},\varDelta ,v])\Vert ^2\right) \\&\quad \le \exp (\beta (T-s))\mathbb {E}_{sy}^h(\Vert x_0-g_0(\varDelta )\Vert ^2)\\&\qquad +Ch\times (T-s)+\varkappa ^h(d(\varDelta ))\times (T-s). \end{aligned} \end{aligned}$$

Taking into account the equality $x_0=y=g_0(\varDelta )$, we conclude that

$$\begin{aligned} \mathbb {E}_{sy}^h\left( \Vert \mathcal {X}^h_1[T,s,y,\hat{\mathfrak {u}},\varDelta ,v]- G(T,\mathcal {X}^h_1[\cdot ,s,y,\hat{\mathfrak {u}},\varDelta ,v])\Vert ^2\right) \le \epsilon (h,d(\varDelta )). \end{aligned}$$

(29)

Here we denote

$$\begin{aligned} \epsilon (h,\delta )\triangleq Dh+T\varkappa ^h(\delta ),\ \ D\triangleq CT. \end{aligned}$$

Note that for any h $\epsilon (h,\delta )\rightarrow Dh$, as $ \delta \rightarrow 0$. Using (29) and Jensen’s inequality, we get

$$\begin{aligned} \mathbb {E}_{sy}^h\left( \Vert \mathcal {X}^h_1[T,s,y,\hat{\mathfrak {u}},\varDelta ,v]- G(T,\mathcal {X}^h_1[\cdot ,s,y,\hat{\mathfrak {u}},\varDelta ,v])\Vert \right) \le \sqrt{\epsilon (d(\varDelta ),h)}. \end{aligned}$$

(30)

By construction of control with guide strategy $\hat{\mathfrak {u}}$, the following inequalities hold true

$$\begin{aligned} \begin{aligned} \varphi (s,y)&=\varphi (t_0,g_0(\varDelta ))\ge \varphi (t_1,g_1(x_0,\varDelta )) \ge \ldots \\&\ge \varphi (t_m,g_m(x_0,\ldots ,x_{m0-1},\varDelta ))=\sigma (g_m(x_0,\ldots ,x_{m0-1},\varDelta )). \end{aligned} \end{aligned}$$

Hence,

$$\begin{aligned} \sigma (G(T,\mathcal {X}^h_1[\cdot ,s,y,\hat{\mathfrak {u}},\varDelta ,v]))\le \varphi (s,y). \end{aligned}$$

(31)

Since $\sigma $ is Lipschitz continuous with the constant R, we have that for any partition $\varDelta $ and the second player’s control v

$$\begin{aligned} \begin{aligned} \sigma (\mathcal {X}^h_1&[T,s,y,\hat{\mathfrak {u}},\varDelta ,v])\\&\le \varphi (s,y)+R\Vert \mathcal {X}^h_1[T,s,y,\hat{\mathfrak {u}},\varDelta ,v]- G(T,\mathcal {X}^h_1[\cdot ,s,y,\hat{\mathfrak {u}},\varDelta ,v])\Vert . \end{aligned} \end{aligned}$$

This and (30) yield the inequality

$$\begin{aligned} \mathbb {E}_{sy}^h\sigma (\mathcal {X}^h_1[T,s,y,\hat{\mathfrak {u}},\varDelta ,v])\le \varphi (s,y)+R\sqrt{\epsilon (d(\varDelta ),h)}. \end{aligned}$$

Passing to the limit as $d(\varDelta )\rightarrow 0$ and taking into account the property $\epsilon (\delta ,h)\rightarrow Dh$, as $\delta \rightarrow 0$, we obtain the first statement of the theorem.

Now let us prove the second statement of the theorem. Using Markov inequality and (29), we get

Lipschitz continuity of the function $\sigma $ and (31) yield the following inclusion

$$\begin{aligned} \begin{aligned} \{\sigma (\mathcal {X}_1^h[T,s,y,\hat{\mathfrak {u}},\varDelta&,v])\ge \varphi (s,y)+R[\epsilon (h,d(\varDelta ))]^{1/3}\}\subset \\ \bigl \{\Vert \mathcal {X}^h_1[T,s,&y,\hat{\mathfrak {u}},\varDelta ,v]- G(T,\mathcal {X}^h_1[\cdot ,s,y,\hat{\mathfrak {u}},\varDelta ,v])\Vert \ge [\epsilon (h,d(\varDelta ))]^{1/3}\bigr \}. \end{aligned} \end{aligned}$$

Finally, for any partition $\varDelta $ and any second player’s control $v\in \mathcal {V}^h[s]$, we have that

$$\begin{aligned} P\{\sigma (\mathcal {X}_1^h[T,s,y,\hat{\mathfrak {u}},\varDelta ,v])\ge \varphi (s,y)+ R[\epsilon (h,d(\varDelta ))]^{1/3}\}\le [\epsilon (h,d(\varDelta ))]^{1/3}. \end{aligned}$$

From this, the second statement of the theorem follows. $\square $

To prove Corollary 1, it suffices to replace the payoff function with $-\sigma $ and interchange the players.

7 Example

In this section we illustrate the theory developed above with a simulation study. Let $d=2$,

$$\begin{aligned} Q(t,x,u,v)= \left( \begin{array}{cc} -u &{}\quad u \\ v &{}\quad -v \end{array} \right) , \end{aligned}$$

$u,v\in [0,1]$, $\sigma (x_1,x_2)=\frac{1}{2}|x_1-x_2|$. In this case the generator $L_t^h[u,v]$ (see (3)) takes the form

$$\begin{aligned} \begin{aligned} L^h_t[&u,v]f(x_1,x_2)\\&=\frac{1}{h}[u(f(x_1-h,x_2+h)-f(x_1,x_2))+v(f(x_1+h,x_2-h)-f(x_1,x_2))], \end{aligned} \end{aligned}$$

(32)

whereas the limiting dynamics (7) takes the form

$$\begin{aligned} \left\{ \begin{array}{cc} \dot{x}_1= &{} -x_1 u+x_2 v, \\ \dot{x}_2=&{} x_1u-x_2 v. \end{array}\right. \end{aligned}$$

(33)

Using equality $x_1+x_2=1$, we can reduce the system (33) to the ODE

$$\begin{aligned} \dot{x}_2=(1-x_2)u-x_2v. \end{aligned}$$

(34)

The corresponding Hamiltonian for $x_2\in [0,1]$ is equal to

$$\begin{aligned} H(t,x_2,\xi )=\left\{ \begin{array}{ll} 0, &{}\quad \xi \ge 0, \\ (1-2x_2)\xi , &{}\quad \xi \le 0. \end{array} \right. \end{aligned}$$

(35)

Further, we replace the payoff function $\sigma $ with the equivalent payoff function $\tilde{\sigma }=|1/2-x_2|$. The solution of the Hamilton–Jacobi PDE (8) with the Hamiltonian (35) and boundary condition given by the function $\tilde{\sigma }$ is

$$\begin{aligned} \varphi (t,x_2)=\left\{ \begin{array}{ll} x_2-1/2 &{}\quad x_2\ge 1/2, \\ (1/2-x_2)e^{-2(T-t)}, &{} \quad x_2\le 1/2. \end{array} \right. \end{aligned}$$

We have that the controls $u_*$ and $v_*$ satisfying condition (9) and (10) are

$$\begin{aligned} u_*=v_*=\left\{ \begin{array}{ll} 0, &{}\quad x_2\ge w, \\ 1, &{}\quad x_2<w, \end{array} \right. \end{aligned}$$

Let $t_*,t_+\in [0,T]$, $x_*\in [0,1]$, $v_*\in [0,1]$. If $x_*<1/2$, then put

$$\begin{aligned} \tilde{\zeta }_1(t,t_+,t_*,x_*,v_*)=\frac{1}{1+v_*}+\left( x_*-\frac{1}{1+v_*}\right) e^{-(1+v_*)(t-t_0)}, \end{aligned}$$

if $x_*\ge 1/2$ then put

$$\begin{aligned} \tilde{\zeta }_1(t,t_+,t_*,x_*,v_*)=1/2+(x_*-1/2)e^{-2v_*(t-t_0)}. \end{aligned}$$

Note that $\tilde{\zeta }_1(\cdot ,t_+,t_*,x_*,v_*)$ is a solution of the differential inclusion

$$\begin{aligned} \dot{y}\in \mathrm{co}\{(1-y)u-yv_*:u\in [0,1]\}. \end{aligned}$$

Moreover,

$$\begin{aligned} \tilde{\zeta }_1(t_*,t_+,t_*,x_*,v_*)=x_*, \end{aligned}$$

and $\varphi (t_*,x_*)\le \varphi (t_+,\tilde{\zeta }_1(t_+,t_+,t_*,x_*,v_*))$. The function $\tilde{\zeta }_1$ is a second coordinate of the function $\zeta $ defined in Sect. 3. Thus, the control with guide strategy determined by (u1)–(u3) takes the form

$$\begin{aligned} \hat{u}(t,x_2,w)=\left\{ \begin{array}{ll} 0, &{}\quad x_2\ge w, \\ 1, &{}\quad x_2<w, \end{array} \right. \end{aligned}$$

$\hat{\psi }_1(t_+,t,x_2,w)=\tilde{\zeta }(t_+,t_+,t,w,v_*)$ where

$$\begin{aligned} v_*=\left\{ \begin{array}{ll} 0, &{}\quad x_2\ge w, \\ 1, &{}\quad x_2<w, \end{array} \right. \end{aligned}$$

$\hat{\chi }_1(s,y_2)=y_2$.

To simulate the Markov chain with the generator (32), we consider the discrete-time Markov chain defined for $t_k=k\varDelta t$ with the transition probabilities

$$\begin{aligned} p^h_{\varDelta t}(t_k,y_1,y_2,t_{k+1},x_1,x_2,u,v)=\left\{ \begin{array}{ll} u\varDelta t, &{}\quad x_1=y_1-h, x_2=y_2+h, \\ v\varDelta t, &{}\quad x_1=y_1+h, x_2=y_2-h, \\ 1-(u+v)\varDelta t, &{}\quad x_1=y_y,x_2=y_2, \\ 0, &{}\quad \hbox {otherwise}. \end{array} \right. \end{aligned}$$

The state of the guide was computed analytically. The results of the simulation for $N=32$, $t_0=0$, $x_0=(7/8,1/8)$, $T=1$, $\varDelta t=10^{-3}$,

$$\begin{aligned} v(t,x_1,x_2)=\left\{ \begin{array}{ll} 0, &{}\quad x_2\le 1/2, \\ 1, &{}\quad x_2>1/2 \end{array} \right. \end{aligned}$$

and 100 trials are presented in Figs. 1 and 2.

Note that the value of the game is 0.0017119, whereas the average distance between the state of the Markov chain and 1/2 is 0.08375, and the average distance between state of the guide and 1/2 is 0.0295650. Moreover the standard deviation between state of the Markov chain and the state of the guide is 0.0074127.

The results of the simulation of Markov chain with $N=128$ particles are presented in Figs. 3 and 4. In this case the average distance between the state of the Markov chain and 1/2 is 0.0678906, the average distance between state of the guide and 1/2 is 0.0379969, the standard deviation between state of the Markov chain and the state of the guide is 0.0017119. Note that the ratio of the standard deviations is less than ratio of the number of particles. This result corresponds to estimate (29).

8 Conclusion

A continuous-time Markov game describing an interacting particle system is considered. A near-optimal guaranteeing strategy in the above Markov game is designed on the basis of an auxiliary deterministic game derived from the Markov one. The extreme shift rule proposed by Krasovskii and Subbotin is used for the design. The above strategy uses the model of the Markov game defined by deterministic relations called the guide. However, it receives state estimates of the Markov game to the input. Thus, the strategy used in the paper is a stochastic and memory strategy. The question whether the auxiliary game has purely feedback strategy which is near optimal in the Markov game remains open.

The study of the paper is restricted to Markov games describing interacting particle systems. The extension of the results obtained to general Markov games is the subject of future work.

References

Averboukh Y (2015) Universal Nash equilibrium strategies for differential games. J Dyn Control Syst 21:329–350
Article MathSciNet MATH Google Scholar
Bardi M, Capuzzo Dolcetta I (1997) Optimal control and viscosity solutions of Hamilton–Jacobi–Bellman equations. Birkhäuser, Basel
Benaïm M, Le Boudec J-Y (2008) A class of mean field interaction models for computer and communication systems. Perform Eval 65:823–838
Article Google Scholar
Darling RWR, Norris JR (2010) Differential equation approximations for Markov chains. Probab Surv 5:37–79
Article MathSciNet MATH Google Scholar
Fleming WH, Soner HM (2006) Controlled Markov processes and viscosity solutions. Springer, New York
MATH Google Scholar
Gast N, Gaujal B, Le Boudec J-Y (2010) Mean field for Markov decision processes: from discrete to continuous optimization. INRIA report no. 7239
Kolokoltsov VN (2010) Nonlinear Markov process and kinetic equations. Cambridge University Press, Cambridge
Book MATH Google Scholar
Kolokoltsov VN (2013) Nonlinear Markov games on a finite state space (mean-field and binary interactions). Int J Stat Probab 1:77–91
Google Scholar
Krasovskii NN, Kotelnikova AN (2009) Unification of differential games, generalized solutions of the Hamilton–Jacobi equations, and a stochastic guide. Differ Equ 45:1653–1668
Article MathSciNet MATH Google Scholar
Krasovskii NN, Kotelnikova AN (2010) An approach-evasion differential game: stochastic guide. Proc Steklov Inst Math 269:S191–S213
Article Google Scholar
Krasovskii NN, Kotelnikova AN (2010) On a differential interception game. Proc Steklov Inst Math 268:161–206
Article MathSciNet MATH Google Scholar
Krasovskii NN, Kotelnikova AN (2012) Stochastic guide for a time-delay object in a positional differential game. Proc Steklov Inst Math 277:S145–S151
Article MathSciNet MATH Google Scholar
Krasovskii NN, Subbotin AI (1988) Game-theoretical control problems. Springer, New York
Book Google Scholar
Kriazhimskii AV (1978) On stable position control in differential games. J Appl Math Mech 42(6):1055–1060
Article MathSciNet Google Scholar
Levy Y (2013) Continuous-time stochastic games of fixed duration. Dyn Games Appl 3:279–312
Article MathSciNet MATH Google Scholar
Lukoyanov NYu, Plaksin AR (2013) Finite-dimensional modeling guides in time-delay systems. Trudy Inst Mat i Mekh UrO RAN 19:182–195 (in Russian)
Google Scholar
Neyman A (2012) Continuous-time stochastic games. DP #616. Center for the Study of Rationality, Hebrew University, Jerusalem
Subbotin AI (1995) Generalized solutions of first-order PDEs. The dynamical perspective. Birkhaüser, Boston
Book MATH Google Scholar
Subbotin AI, Chentsov AG (1981) Optimization of guarantee in control problems. Nauka, Moscow (in Russian)
MATH Google Scholar
Zachrisson LE (1964) Markov games. In: Dresher M, Shapley LS, Tucker AW (eds) Advances in game theory. Princeton University Press, Princeton, pp 211–253
Google Scholar

Download references

Acknowledgments

The author would like to thank Vassili Kolokoltsov for insightful discussions and the anonymous reviewer for the valuable comments. The research was supported by Russian Foundation for Basic Research (Project N15-01-07909).

Author information

Authors and Affiliations

Krasovskii Institute of Mathematics and Mechanics UrB RAS, Yekaterinburg, Russia
Yurii Averboukh
Ural Federal University, Yekaterinburg, Russia
Yurii Averboukh

Authors

Yurii Averboukh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yurii Averboukh.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Averboukh, Y. Extremal Shift Rule for Continuous-Time Zero-Sum Markov Games. Dyn Games Appl 7, 1–20 (2017). https://doi.org/10.1007/s13235-015-0173-z

Download citation

Published: 29 October 2015
Issue Date: March 2017
DOI: https://doi.org/10.1007/s13235-015-0173-z

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Extremal Shift Rule for Continuous-Time Zero-Sum Markov Games

Abstract

Similar content being viewed by others

A Differential Game with the Possibility of Early Termination

Continuous-Time Zero-Sum Games for Markov Decision Processes with Discounted Risk-Sensitive Cost Criterion

Generalized Γ Calculus and Application to Interacting Particles on a Graph

1 Introduction

2 Preliminaries

Remark 1

3 Control with Guide Strategies

Theorem 1

Corollary 1

4 Properties of Transition Probabilities

Lemma 1

Proof

5 Key Estimate

Lemma 2

Proof

6 Near-Optimal Strategies

Proof of Theorem 1

7 Example

8 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Extremal Shift Rule for Continuous-Time Zero-Sum Markov Games

Abstract

Similar content being viewed by others

A Differential Game with the Possibility of Early Termination

Continuous-Time Zero-Sum Games for Markov Decision Processes with Discounted Risk-Sensitive Cost Criterion

Generalized Γ Calculus and Application to Interacting Particles on a Graph

1 Introduction

2 Preliminaries

Remark 1

3 Control with Guide Strategies

Theorem 1

Corollary 1

4 Properties of Transition Probabilities

Lemma 1

Proof

5 Key Estimate

Lemma 2

Proof

6 Near-Optimal Strategies

Proof of Theorem 1

7 Example

8 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation