Abstract
In the paper we consider the controlled continuous-time Markov chain describing the interacting particles system with the finite number of types. The system is controlled by two players with the opposite purposes. This Markov game converges to a zero-sum differential game when the number of particles tends to infinity. Krasovskii–Subbotin extremal shift provides the optimal strategy in the limiting game. The main result of the paper is the near optimality of the Krasovskii–Subbotin extremal shift rule for the original Markov game.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
The paper is concerned with the construction of near-optimal strategies for zero-sum two-player continuous-time Markov game based on deterministic game. The term ‘Markov game’ is used for a Markov chain with the Kolmogorov matrix depending on controls of players. These games are also called continuous-time stochastic games. Continuous-time Markov games were first studied by Zachrisson [20]. The recent progress in the theory of continuous-time Markov games can be found in [15, 17] and references therein.
We consider the case when the continuous-time Markov chain describes the interacting particle system. The interacting particle system converges to the deterministic system as the number of particles tends to infinity [7, 8] (see also [3, 4]). The value function of the controlled Markov chain converges to the value function of the limiting control system [8] (see the corresponding result for the discrete-time systems in [6]). This result is extended to the case of zero-sum games as well as to the case of nonzero-sum games in [8]. If the nonanticipative strategy is optimal for the differential game, then it is near optimal for the Markov game [8]. However, the nonanticipative strategies require the knowledge of the control of the second player. Often this information is inaccessible, and the player has only the information about the current position. In this case, one can use feedback strategies or control with guide strategies.
Control with guide strategies were proposed by Krasovskii and Subbotin to construct the solution of the deterministic differential game under informational disturbances [13]. Note that feedback strategies do not provide the stable solution of the differential game. If the player uses control with guide strategy, then the control is formed stepwise, and the player has a model of the system that is used to choose an appropriate control on each step. The easiest way to construct a control with guide strategy is the extremal shift rule. The value function is achieved in the limit when the time between control corrections tends to zero. In the original work by Krasovskii and Subbotin, the motion of the model is governed by the system that is a copy of the original system and the motion of the original system is close to the motion of the model. Therefore the model can be called guide. Formally, control with guide strategy is a strategy with memory. However, it suffices to store only the finite number of vectors. Moreover, the player does not require the information on second player’s control.
The control with guide strategies realizing the extremal shift was used for the differential games without Lipschitz continuity assumption on the system dynamics in [14] and for the games governed by delay differential equations in [12, 16]. Krasovskii and Kotelnikova proposed the stochastic control with guide strategies [9–11]. In that case, the real motion of the deterministic system is close to the auxiliary stochastic process generated by optimal control for the stochastic differential game. The Nash equilibrium for two-player game in the class of control with guide strategies was constructed via extremal shift in [1].
In this paper, we let the player use the control with guide strategy realizing the extremal shift rule in the Markov game. We assume that the motion of the guide is given by the limiting deterministic differential game. We estimate the expectation of the distance between the Markov chain and the motions of the model (guide). This leads to the estimate between the outcome of the player in the Markov game and the value function of the limiting differential game.
The paper is organized as follows. In preliminary Sect. 2, we introduce the Markov game describing the interacting particle system and the limiting deterministic differential game. In Sect. 3, we give the explicit definition of control with guide strategies and formulate the main results. Section 4 is devoted to a property of transition probabilities. In Sect. 5, we estimate the expectation of distance between the Markov chain and the deterministic guide. Section 6 provides the proofs of the statements formulated in Sect. 3. Finally, in Sect. 7, we illustrate the theoretical results by a simulation of some Markov chain.
2 Preliminaries
We consider the system of the finite number of particles. Each particle can be of type i, \(i\in \{1,\ldots ,d\}\). The type of each particle is a random variable governed by a Markov chain. To specify this chain, consider the Kolmogorov matrix \(Q(t,x,u,v)=(Q_{ij}(t,x,u,v))_{i,j=1}^d\). That means that the elements of matrix Q(t, x, u, v) satisfy the following properties
-
\(Q_{ij}(t,x,u,v)\ge 0\) for \(i\ne j\);
-
$$\begin{aligned} Q_{ii}(t,x,u,v)=-\sum _{j\ne i}Q_{ij}(t,x,u,v). \end{aligned}$$(1)
Here \(t\in [0,T],\ \ x\in \varSigma _d=\{(x_1,\ldots , x_d):x_i\ge 0, x_1+\cdots +x_n=1\},\) \(u\in U\), \(v\in V.\) The parameter x is used below for the state of the interacting particle system. We assume that \(x=(x_1,\ldots ,x_d)\) is a row vector. The variables u and v are controlled by the first and the second players, respectively.
Additionally we assume that
-
U and V are compact sets;
-
Q is a continuous function;
-
for any t, u and v, the function \(x\mapsto Q(t,x,u,v)\) is Lipschitz continuous;
-
for any \(t\in [0,T]\), \(\xi ,x\in \mathbb {R}^n\), the following equality holds true.
$$\begin{aligned} \min _{u\in U}\max _{v\in V}\langle \xi ,xQ(t,x,u,v)\rangle =\max _{v\in V}\min _{u\in U}\langle \xi ,xQ(t,x,u,v)\rangle \end{aligned}$$(2)
Condition (2) is an analog of well-known Isaacs condition.
For fixed parameters \(x\in \varSigma _d\), \(u\in {U}\), and \(v\in {V}\), the type of each particle is determined by the Markov chain with the generator
The another way to specify the Markov chain is the Kolmogorov forward equation
Here \(P(s,t,x)=(P_{ij}(s,t,x))_{ij=1}^d\) is the matrix of the transition probabilities.
Now we consider the controlled mean-field interacting particle system (see for details [8]). Let \(n_i\) be a number of particles of the type i. The vector \(N=(n_1,\ldots ,n_d)\in \mathbb {Z}_+^d\) is the state of the system consisting of \(|N|=n_1+\cdots +n_d\) particles. For \(i\ne j\) and a vector \(N=(n_1,\ldots ,n_d)\) denote by \(N^{[ij]}\) the vector obtained from N by removing one particle of type i and adding one particle of type j, i.e., we replace the ith coordinate with \(n_i-1\) and the jth coordinate with \(n_j+1\). The mean-field interacting particle system is a Markov chain with the generator
The purpose of the first (respectively, second) player is to minimize (respectively, maximize) the expectation of \(\sigma (N/|N|)\).
Denote the inverse number of particles by \(h=1/|N|\). Normalizing the states of the interacting particle system, we get the generator (see [8])
Denote the vector N / |N| by \(x=(x_1,\ldots ,x_d)\). Thus, we have that
Here \(e^i\) is the ith coordinate vector. The vector x belongs to the set
Further, let \(\mathcal {U}_\mathrm{det}[s]\) (respectively, \(\mathcal {V}_\mathrm{det}[s]\)) denote the set of deterministic controls of the first (respectively, second) player on [s, T], i.e.,
Let \((\varOmega ,\mathcal {F},\{\mathcal {F}_t\},P)\) be a filtered probability space. Extending the definition given in [5, p. 135] to the stochastic game case, we say that the pair of stochastic processes u and v on [s, T] is an admissible pair of controls if
-
1.
\(u(t)\in U\), \(v(t)\in V\);
-
2.
the processes u and v are progressive measurable;
-
3.
for any \(y\in \varSigma ^h_d\), there exists an unique \(\{\mathcal {F}_t\}_{t\in [s,T]}\)-adapted càdlàg stochastic process \(X^h(t,s,y,u,v)\) taking values in \(\varSigma ^h_d\), starting at y at time s and satisfying the following condition for any \(f\in C(\varSigma ^h_d)\)
$$\begin{aligned} f\left( X^h(t,s,y,u,v)\right) -\int _s^t L_{t}^h[u(\tau ),v(\tau )]f\left( X^h(\tau ,s,y,u,v)\right) \hbox {d}\tau \end{aligned}$$is a martingale.
In particular, the third condition means that
Here \(\mathbb {E}_{sy}^h\) denotes the conditional expectation of corresponding stochastic processes.
The purposes of the players can be reformulated in the following way. The first (respectively, second) player wishes to minimize (respectively, maximize) the payoff
Let \(\mathcal {U}^h[s]\) be a set of stochastic processes u taking values in U such that the pair (u, v) is admissible for any \(v\in \mathcal {V}_\mathrm{det}[s]\). Analogously, let \(\mathcal {V}^h[s]\) be a set of stochastic processes v taking values in V such that the pair (u, v) is admissible for any \(u\in \mathcal {U}_\mathrm{det}[s]\).
Denote by \(P_{sy}^h(A)\) the conditional probability of an event A under condition that the Markov chain corresponding to the parameter h starts at y at time s, i.e.,
Further, let \(p^h(s,y,t,z,u,v)\) denote the transition probability, i.e.,
The substituting \(\mathbf {1}_{\{z\}}\) for f in (3) and (4) gives that
Here \(X_{i}^h(\tau ,s,y,u,v)\) denotes the ith component of \(X^h(\tau ,s,y,u,v)\).
Recall (see [8]) that if \(h\rightarrow 0\), then the generator \(L_t^h[u,v]\) converges to the generator
For controls \(u\in \mathcal {U}_\mathrm{det}[s]\) and \(v\in \mathcal {V}_\mathrm{det}[s]\), the deterministic evolution generated by the generator \(\varLambda _t[u(t),v(t)]\) is described by the equation
Here the function \(f_t(y)\) is equal to f(x(t)) when \(x(s)=y\). The characteristics of (6) solve the ODEs
One can rewrite this system in the vector form
For given \(u\in \mathcal {U}_\mathrm{det}[s]\), \(v\in \mathcal {V}_\mathrm{det}[s]\) denote the solution of the initial value problem for (7) and the condition \(x(s)=y\) by \(x(\cdot ,s,y,u,v)\).
Consider the deterministic zero-sum game with the dynamics given by (7) and terminal payoff equal to \(\sigma (x(T,s,y,u,v))\). This game has a value in the class of feedback strategies [13] that is a continuous function of the position. Denote it by \(\mathrm{Val}(s,y)\). Recall (see [18]) that the function \(\mathrm{Val}(s,y)\) is a minimax (viscosity) solution of the Hamilton–Jacobi PDE
Here the Hamiltonian H is defined by the rule
Remark 1
The approaches based on control with guide strategies and nonanticipative strategies (see [2] for detailed presentation of this approach) are equivalent to the feedback formalization [18]. The value function in these cases is also equal to \(\mathrm{Val}\).
3 Control with Guide Strategies
In this section, we introduce the control with guide strategies for the Markov game. It is assumed that the control is formed stepwise and the player has an information about the current state of the system, i.e., the vector x is known. Additionally, we assume that the player can evaluate the expected state and the player’s control depends on current state of the system and on the evaluated state. This evaluation is called guide. At each time of control correction, the player computes the value of the guide and the control that is used up to the next time of control correction.
Formally (see [19]), control with guide strategy of player 1 is a triple \(\mathfrak {u}=(u(t,x,w),\psi _1(t_+,t,x,w),\chi _1(s,y))\). Here the function u(t, x, w) is equal to the control implemented after time t if at time t the state of the system is x and the state of the guide is w. The function \(\psi _1(t_+,t,x,w)\) determines the state of the guide at time \(t_+\) under the condition that at time t the state of the system is x and the state of the guide is w. The function \(\chi _1\) initializes the guide, i.e., \(\chi _1(s,y)\) is the state of the guide in the initial position (s, y).
We use the control with guide strategies for the Markov game with the generator \(L_t^h\). Here we assume that \(h>0\) is fixed. Let (s, y) be an initial position, \(s\in [0,T]\), \(y\in \varSigma _d^h\). Assume that player 1 chooses the control with guide strategy \(\mathfrak {u}\) and the partition \(\varDelta =\{t_k\}_{k=0}^m\) of the time interval [s, T], whereas player 2 chooses the control \(v\in \mathcal {V}^h[s]\). This control can be also formed stepwise using some second player’s control with guide strategy.
We say that the stochastic process \(\mathcal {X}_1^h[\cdot ,s,y,\mathfrak {u},\varDelta ,v]\) is generated by strategy \(\mathfrak {u}\), partition \(\varDelta \) and the second player’s control v if for \(t\in [t_k,t_{k+1})\) \(\mathcal {X}_1^h[t,s,y,\mathfrak {u},\varDelta ,v]=X^h(t,t_k,x_k,u_k,v), \) where
-
\(x_0=y\), \(w_0=\chi _1(t_0,x_0)\), \(u_0=u(t_0,x_0,w_0)\);
-
for \(k=\overline{1,r}\) \(x_k=X^h(t_k,t_{k-1},x_{k-1},u_{k-1},v)\), \(w_k=\psi _1(t_k,t_{k-1},x_{k-1},w_{k-1})\), \(u_k=u(t_k,x_k,w_k)\).
Note that even though the state of the guide \(w_k\) is determined by the deterministic function, it depends on the random variable \(x_{k-1}\). Thus, \(w_k\) is a random variable.
Below we define the first player’s control with guide strategy that realizes the extremal shift rule (see [13]). Let \(\varphi \) be a supersolution of Eq. (8). That means (see [18]) that for any \((t_*,x_*)\in [0,T]\times \varSigma _{d}\), \(t_+>t_*\) and \(v_*\in V\), there exists a solution \(\zeta _1(\cdot ,t_+,t_*,x_*,v_*)\) of the differential inclusion
satisfying the conditions
Define the control with guide strategy \(\hat{\mathfrak {u}}=(\hat{u},\hat{\psi }_1,\hat{\chi }_1)\) by the following rules. If \(t_*,t_+\in [0,T]\), \(t_+>t_*\), \(x_*,w_*\in \varSigma _d\), then choose \(u_*\), \(v_*\) by the rules
Put
-
(u1)
\(\hat{u}(t_*,x_*,w_*)=u_*\),
-
(u2)
\(\hat{\psi }_1(t_+,t_*,x_*,w_*)=\zeta _1(t_+,t_+,t_*,w_*,v_*)\),
-
(u3)
\(\hat{\chi }_1(s,y)=y\).
Note that if the first player uses the strategy \(\hat{\mathfrak {u}}\) in the differential game with the dynamics given by (7), then she guarantees the limit outcome not greater than \(\varphi \) (see [13, 18]). If additionally \(\varphi =\mathrm{Val}\), then the strategy \(\hat{\mathfrak {u}}\) is optimal in the deterministic game.
The main result of the paper is the following.
Theorem 1
Assume that \(\sigma \) is Lipschitz continuous with a constant R, and the function \(\varphi \) is a supersolution of (8). If the first player uses the control with guide strategy \(\hat{\mathfrak {u}}\) determined by (u1)–(u3), then
-
(i)
$$\begin{aligned}&\lim _{\delta \downarrow 0}\sup \left\{ \mathbb {E}_{sy}^h\left( \sigma \left( \mathcal {X}_1^h[T,s,y,\hat{\mathfrak {u}},\varDelta ,v]\right) \right) :d(\varDelta )\le \delta ,v\in \mathcal {V}^h[s]\right\} \\&\quad \le \varphi (s,y)+R\sqrt{Dh}. \end{aligned}$$
-
(ii)
$$\begin{aligned}&\lim _{\delta \downarrow 0}\sup \left\{ P_{sy}^h\left( \sigma \left( \mathcal {X}_1^h[T,s,y,\hat{\mathfrak {u}},\varDelta ,v]\right) \ge \varphi (s,y)+R\root 3 \of {Dh}\right) :\right. \\&\quad \left. d(\varDelta )\le \delta ,\ \ v\in \mathcal {V}^h[s]\right\} \le \root 3 \of {Dh}. \end{aligned}$$
Here D is a constant not dependent on \(\varphi \) and \(\sigma \).
The theorem is proved in Sect. 6.
Now let us consider the case when the second player uses control with guide strategies. The control with guide strategy of the second player is a triple \(\mathfrak {v}=(v(t,x,w),\psi _2(t_+,t,x,w),\chi _2(s,y))\). Here w denotes the state of the second player’s guide. The control in this case is formed also stepwise. If (s, y) is an initial position, \(\varDelta \) is a partition of time interval [s, T] and \(u\in \mathcal {U}^h[s]\) is a control of player 1, then denote by \(\mathcal {X}_2^h[\cdot ,s,y,\mathfrak {v},\varDelta ,u]\) the corresponding stochastic process.
Let \(\phi \) be a subsolution of Eq. (8). That means (see [18]) that for any \((t_*,x_*)\in [0,T]\times \varSigma _{d}\), \(t_+>t_*\) and \(u^*\) there exists a trajectory \(\zeta _2(\cdot ,t_+,t_*,x_*,u^*)\) of the differential inclusion
satisfying the condition \(\phi (t_+,\zeta _2(t_+,t_+,t_*,x_*,u^*))\ge \phi (t_*,x_*).\)
Define the strategy \(\hat{\mathfrak {v}}\) by the following rule. If \((t_*,x_*)\) is a position, \(t_+>t_*\) and \(w_*\in \varSigma _d\) is a state of the guide, then choose \(v^*\) and \(u^*\) by the rules
Put
-
(v1)
\(v(t_*,x_*,w_*)=v^*\),
-
(v2)
\(\psi _2(t_+,t_*,x_*,w_*)=\zeta _2(t_+,t_+,t_*,x_*,u^*)\)
-
(v3)
\(\chi _2(s,y)=y\).
Corollary 1
Let \(\phi \) be a subsolution of (8). If the second player uses the control with guide strategy \(\hat{\mathfrak {v}}\) determined by (v1)–(v3), then
-
(i)
$$\begin{aligned}&\lim _{\delta \downarrow 0}\inf \left\{ \mathbb {E}_{sy}^h\left( \sigma \left( \mathcal {X}_2^h[T,s,y,\hat{\mathfrak {u}},\varDelta ,v]\right) \right) :d(\varDelta )\le \delta ,u\in \mathcal {U}^h[s]\right\} \\&\quad \ge \phi (s,y)-R\sqrt{Dh}. \end{aligned}$$
-
(ii)
$$\begin{aligned}&\lim _{\delta \downarrow 0}\sup \left\{ P_{sy}^h\left( \sigma \left( \mathcal {X}_2^h[T,s,y,\hat{\mathfrak {v}},\varDelta ,u]\right) \le \phi (s,y)-R\root 3 \of {Dh}\right) :\right. \\&\quad \left. d(\varDelta )\le \delta ,\ \ u\in \mathcal {U}^h[s]\right\} \le \root 3 \of {Dh}. \end{aligned}$$
The corollary is also proved in Sect. 6.
4 Properties of Transition Probabilities
Now we prove the following.
Lemma 1
There exists a function \(\alpha ^h(\delta )\) such that \(\alpha ^h(\delta )\rightarrow 0\) as \(\delta \rightarrow 0\) and for any \(t_*,t_+\in [0,T]\), \(\xi ,\eta \in \varSigma _d\), \(\xi =(\xi _1,\ldots ,\xi _d)\), \(\bar{u}\in U\), \(\bar{v}\in \mathcal {V}^h[t_*]\) the following properties hold true
-
1.
if \(\eta =\xi \), then
$$\begin{aligned} p^h(t_*,\xi ,t_+,\eta ,\bar{u},\bar{v})\le & {} 1+\frac{1}{h}\sum _{k=1}^d\int _{t_*}^{t_+}\int _{V} \xi _kQ_{kk}(t_*,\xi ,\bar{u},v)\nu _\tau (\hbox {d}v)\hbox {d}\tau \\&+\alpha ^h(t_+-t_*)\cdot (t_+-t_*); \end{aligned}$$ -
2.
if \(\eta =\xi -h e^i+he^j\), then
$$\begin{aligned}&p^h(t_*,\xi ,t_+,\eta ,u,v)\\&\quad \le \frac{1}{h}\int _{t_*}^{t_+}\int _{V} \xi _iQ_{ij}(t_*,\xi ,\bar{u},v)\nu _\tau (\hbox {d}v) \hbox {d}\tau +\alpha ^h(t_+-t_*)\times (t_+-t_*); \end{aligned}$$ -
3.
if \(\eta \ne \xi \) and \(\eta \ne \xi -h e^i+he^j\), then
$$\begin{aligned} p^h(t_*,\xi ,t_+,\eta ,u,v)\le \alpha ^h(t_+-t_*)\times (t_+-t_*); \end{aligned}$$
Here \(\nu _\tau \) is a measure on V depending on \(t_*,t_+\), \(\xi \), \(\eta \), \(\bar{u}\) and \(\bar{v}\).
Proof
First denote
For any \(x\in \varSigma _d\), \(t\in [0,T]\), \(u\in U\), \(v\in V\), the following estimates hold true
Further, let \(\gamma (\delta )\) be a common modulus of continuity with respect to t of the functions \(Q_{ij}\), i.e., for all i, j, \(t',t''\in [0,T]\), \(x\in \varSigma _d\), \(u\in U\), \(v\in Q\)
and \(\gamma (\delta )\rightarrow 0\) as \(\delta \rightarrow 0\). From (5) and (12), we obtain that
Further, for a given control \(\bar{v}\in \mathcal {V}^h[t_*]\), let \(\mathbb {E}^h_{t_*\xi ;\tau x}\) denote the expectation under conditions \(X^h(t_*,t_*,\xi ,\bar{u},\bar{v})=\xi \), and \(X^h(\tau ,t_*,\xi ,\bar{u},\bar{v})=x\).
We have that
From this and (5), we get
Here \(x_i\) denotes the ith component of the vector x.
We have that \(p^h(t_*,\xi ,t_*,x,\bar{u},\bar{v})=1\) for \(x=\xi \) and \(p^h(t_*,\xi ,t_*,x,\bar{u},\bar{v})=0\) for \(x\ne \xi \). Thus,
Recall that for each \(\tau \) \(\bar{v}(\tau )\) is a random variable with values in V. Define the measure \(\nu _\tau \) on V as the image measure of \(P_{t_*\xi ;\tau \xi }\) by \(\bar{v}(\tau )\), where for \(A\in \mathcal {F}\)
We have that
Consequently,
Here we denote
From (15) the second and third statements of the Lemma follow. To derive the first statement, use the property of Kolmogorov matrixes (1). We have that
\(\square \)
5 Key Estimate
This section provides the estimate of the distance between the controlled Markov chain and the guide. This estimate is an analog of [13, Lemma 2.3.1].
Lemma 2
There exist constants \(\beta ,C>0\), and a function \(\varkappa ^h(\delta )\) such that \(\varkappa ^h(\delta )\rightarrow 0\) as \(\delta \rightarrow 0\) and the following property holds true.
If
-
1.
\((t,x)\in [0,T]\times \varSigma _{d}^h\), \(w_*\in \varSigma _d\), \(t_+>t_*\),
-
2.
The controls \(u_*\), \(v_*\) are chosen by rules (9) and (10) respectively,
-
3.
\(w_+=\zeta _1(t_+,t_+,t_*,w_*,v_*)\),
then for any \(v\in \mathcal {V}^h[t_*]\)
Proof
Denote the ith component of vector \(x_*\) by \(x_{*i}\).
We have that
Further,
It follows from (12) that
From Lemma 1, it follows that
For simplicity denote \(\zeta _*(t)=\zeta _1(t,t_+,t_*,w_*,u,v_*)\). We have that for each t there exists a probability \(\mu _t\) on U such that
Therefore,
Define
We have that \(\varrho (\delta )\rightarrow 0\), as \(\delta \rightarrow 0\). From (17), (19), and (20), it follows that
Using Lemma 1 one more time we get the inequality
The first term in the right-hand side of (22) can be transformed as follows. Denote for simplicity
Note that \(\widehat{Q}=(\widehat{Q}_{ij})_{i,j=1}^d\) is a Kolmogorov matrix. We have that
This and (22) yield the estimate
Substituting (17)–(21), (23) in (16), we get the inequality
Let \(\varUpsilon \) be a Lipschitz constant of the function \(y\mapsto yQ(t,y,u,v)\), i.e., for all \(y',y''\in \varSigma _d\), \(t\in [0,T]\), \(u\in U\), \(v\in Q\)
We have that
The choice of \(u_*\) and \(v_*\) gives that for all \(u\in U\), \(v\in V\)
Consequently, we get the estimate
From this and (24), the conclusion of the Lemma follows for \(\beta =2\varUpsilon ,\ \ C=2d^2K, \ \ \varkappa ^h(\delta )=6d^3\alpha ^h(\delta )+\sqrt{2d}\varrho (\delta )\). \(\square \)
6 Near-Optimal Strategies
In this section, we prove Theorem 1 and Corollary 1.
Proof of Theorem 1
Let \(v\in \mathcal {V}^h[s]\) be a control of the second player. Consider a partition \(\varDelta =\{t_k\}_{k=1}^m\) of the time interval [s, T]. If \(x_0,x_1,\ldots ,x_m\) are vectors, \(x_0=y\), then denote by \(\hat{p}^h_r(x_1,\ldots ,x_r,\varDelta )\) the probability of the event \(\mathcal {X}_1^h[t_k,s,y,\hat{\mathfrak {u}},\varDelta ,v]=x_k\) for \(k=\overline{1,r}\). Define vectors \(w_0,\ldots ,w_m\) recursively in the following way. Put
for \(k>0\) put
If \(w_0,\ldots ,w_m\) are defined by rules (25), (26) and \(r\in \overline{1,m}\), we set
In addition, put \(g_0(\varDelta )\triangleq y\).
Below we use the transformation \(G(\cdot ,\mathcal {X}^h_1[\cdot ,s,y,\hat{\mathfrak {u}},\varDelta ,v])\) of the stochastic process \(\mathcal {X}^h_1[\cdot ,s,y,\hat{\mathfrak {u}},\varDelta ,v]\) defined in the following way. If \(x_i\) are values of \(\mathcal {X}^h_1[t_i,s,y,\hat{\mathfrak {u}},\varDelta ,v]\), \(i=0,\ldots ,r\), and \((w_0,\ldots ,w_r)=g_r(x_0,\ldots ,x_{r-1},\varDelta )\), then we put
Generally, the stochastic process \(G(\cdot ,\mathcal {X}^h_1[\cdot ,s,y,\hat{\mathfrak {u}},\varDelta ,v])\) is non-Markov.
Further, if \(u_i=\hat{u}(t_i,x_i,w_i)\), \(i=0,\ldots ,r\), and
we write \(\varsigma _r(x_0,\ldots ,x_r,\varDelta )\triangleq u_r\).
We have that for any \(r\in \overline{1,m}\)
By Lemma 2, we have that
From this and (27), it follows that
Applying this inequality recursively, we get
Taking into account the equality \(x_0=y=g_0(\varDelta )\), we conclude that
Here we denote
Note that for any h \(\epsilon (h,\delta )\rightarrow Dh\), as \( \delta \rightarrow 0\). Using (29) and Jensen’s inequality, we get
By construction of control with guide strategy \(\hat{\mathfrak {u}}\), the following inequalities hold true
Hence,
Since \(\sigma \) is Lipschitz continuous with the constant R, we have that for any partition \(\varDelta \) and the second player’s control v
This and (30) yield the inequality
Passing to the limit as \(d(\varDelta )\rightarrow 0\) and taking into account the property \(\epsilon (\delta ,h)\rightarrow Dh\), as \(\delta \rightarrow 0\), we obtain the first statement of the theorem.
Now let us prove the second statement of the theorem. Using Markov inequality and (29), we get
Lipschitz continuity of the function \(\sigma \) and (31) yield the following inclusion
Finally, for any partition \(\varDelta \) and any second player’s control \(v\in \mathcal {V}^h[s]\), we have that
From this, the second statement of the theorem follows. \(\square \)
To prove Corollary 1, it suffices to replace the payoff function with \(-\sigma \) and interchange the players.
7 Example
In this section we illustrate the theory developed above with a simulation study. Let \(d=2\),
\(u,v\in [0,1]\), \(\sigma (x_1,x_2)=\frac{1}{2}|x_1-x_2|\). In this case the generator \(L_t^h[u,v]\) (see (3)) takes the form
whereas the limiting dynamics (7) takes the form
Using equality \(x_1+x_2=1\), we can reduce the system (33) to the ODE
The corresponding Hamiltonian for \(x_2\in [0,1]\) is equal to
Further, we replace the payoff function \(\sigma \) with the equivalent payoff function \(\tilde{\sigma }=|1/2-x_2|\). The solution of the Hamilton–Jacobi PDE (8) with the Hamiltonian (35) and boundary condition given by the function \(\tilde{\sigma }\) is
We have that the controls \(u_*\) and \(v_*\) satisfying condition (9) and (10) are
Let \(t_*,t_+\in [0,T]\), \(x_*\in [0,1]\), \(v_*\in [0,1]\). If \(x_*<1/2\), then put
if \(x_*\ge 1/2\) then put
Note that \(\tilde{\zeta }_1(\cdot ,t_+,t_*,x_*,v_*)\) is a solution of the differential inclusion
Moreover,
and \(\varphi (t_*,x_*)\le \varphi (t_+,\tilde{\zeta }_1(t_+,t_+,t_*,x_*,v_*))\). The function \(\tilde{\zeta }_1\) is a second coordinate of the function \(\zeta \) defined in Sect. 3. Thus, the control with guide strategy determined by (u1)–(u3) takes the form
\(\hat{\psi }_1(t_+,t,x_2,w)=\tilde{\zeta }(t_+,t_+,t,w,v_*)\) where
\(\hat{\chi }_1(s,y_2)=y_2\).
To simulate the Markov chain with the generator (32), we consider the discrete-time Markov chain defined for \(t_k=k\varDelta t\) with the transition probabilities
The state of the guide was computed analytically. The results of the simulation for \(N=32\), \(t_0=0\), \(x_0=(7/8,1/8)\), \(T=1\), \(\varDelta t=10^{-3}\),
and 100 trials are presented in Figs. 1 and 2.
Note that the value of the game is 0.0017119, whereas the average distance between the state of the Markov chain and 1/2 is 0.08375, and the average distance between state of the guide and 1/2 is 0.0295650. Moreover the standard deviation between state of the Markov chain and the state of the guide is 0.0074127.
The results of the simulation of Markov chain with \(N=128\) particles are presented in Figs. 3 and 4. In this case the average distance between the state of the Markov chain and 1/2 is 0.0678906, the average distance between state of the guide and 1/2 is 0.0379969, the standard deviation between state of the Markov chain and the state of the guide is 0.0017119. Note that the ratio of the standard deviations is less than ratio of the number of particles. This result corresponds to estimate (29).
8 Conclusion
A continuous-time Markov game describing an interacting particle system is considered. A near-optimal guaranteeing strategy in the above Markov game is designed on the basis of an auxiliary deterministic game derived from the Markov one. The extreme shift rule proposed by Krasovskii and Subbotin is used for the design. The above strategy uses the model of the Markov game defined by deterministic relations called the guide. However, it receives state estimates of the Markov game to the input. Thus, the strategy used in the paper is a stochastic and memory strategy. The question whether the auxiliary game has purely feedback strategy which is near optimal in the Markov game remains open.
The study of the paper is restricted to Markov games describing interacting particle systems. The extension of the results obtained to general Markov games is the subject of future work.
References
Averboukh Y (2015) Universal Nash equilibrium strategies for differential games. J Dyn Control Syst 21:329–350
Bardi M, Capuzzo Dolcetta I (1997) Optimal control and viscosity solutions of Hamilton–Jacobi–Bellman equations. Birkhäuser, Basel
Benaïm M, Le Boudec J-Y (2008) A class of mean field interaction models for computer and communication systems. Perform Eval 65:823–838
Darling RWR, Norris JR (2010) Differential equation approximations for Markov chains. Probab Surv 5:37–79
Fleming WH, Soner HM (2006) Controlled Markov processes and viscosity solutions. Springer, New York
Gast N, Gaujal B, Le Boudec J-Y (2010) Mean field for Markov decision processes: from discrete to continuous optimization. INRIA report no. 7239
Kolokoltsov VN (2010) Nonlinear Markov process and kinetic equations. Cambridge University Press, Cambridge
Kolokoltsov VN (2013) Nonlinear Markov games on a finite state space (mean-field and binary interactions). Int J Stat Probab 1:77–91
Krasovskii NN, Kotelnikova AN (2009) Unification of differential games, generalized solutions of the Hamilton–Jacobi equations, and a stochastic guide. Differ Equ 45:1653–1668
Krasovskii NN, Kotelnikova AN (2010) An approach-evasion differential game: stochastic guide. Proc Steklov Inst Math 269:S191–S213
Krasovskii NN, Kotelnikova AN (2010) On a differential interception game. Proc Steklov Inst Math 268:161–206
Krasovskii NN, Kotelnikova AN (2012) Stochastic guide for a time-delay object in a positional differential game. Proc Steklov Inst Math 277:S145–S151
Krasovskii NN, Subbotin AI (1988) Game-theoretical control problems. Springer, New York
Kriazhimskii AV (1978) On stable position control in differential games. J Appl Math Mech 42(6):1055–1060
Levy Y (2013) Continuous-time stochastic games of fixed duration. Dyn Games Appl 3:279–312
Lukoyanov NYu, Plaksin AR (2013) Finite-dimensional modeling guides in time-delay systems. Trudy Inst Mat i Mekh UrO RAN 19:182–195 (in Russian)
Neyman A (2012) Continuous-time stochastic games. DP #616. Center for the Study of Rationality, Hebrew University, Jerusalem
Subbotin AI (1995) Generalized solutions of first-order PDEs. The dynamical perspective. Birkhaüser, Boston
Subbotin AI, Chentsov AG (1981) Optimization of guarantee in control problems. Nauka, Moscow (in Russian)
Zachrisson LE (1964) Markov games. In: Dresher M, Shapley LS, Tucker AW (eds) Advances in game theory. Princeton University Press, Princeton, pp 211–253
Acknowledgments
The author would like to thank Vassili Kolokoltsov for insightful discussions and the anonymous reviewer for the valuable comments. The research was supported by Russian Foundation for Basic Research (Project N15-01-07909).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Averboukh, Y. Extremal Shift Rule for Continuous-Time Zero-Sum Markov Games. Dyn Games Appl 7, 1–20 (2017). https://doi.org/10.1007/s13235-015-0173-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13235-015-0173-z