Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

6.1 Introduction

This chapter studies continuous-time Markov decision processes and continuous-time zero-sum stochastic dynamic games. In the continuous-time setup, although the infinite horizon cases have been well studied, the corresponding literature on finite horizon case is few and far between. Infinite horizon continuous-time Markov decision processes have been studied by many authors (e.g. see [5] and the references therein). In the finite horizon case, Pliska [7] has used a semi-group approach to characterise the value function and the optimal control. But his approach yields only existential results. In this chapter, we show that the value function is a smooth solution of an appropriate dynamic programming equation. Our method of proof gives algorithms for computing the value function and an optimal control.

The situation is analogous for continuous-time stochastic dynamic Markov games. In this problem as well, the infinite horizon case has been studied in the literature [6]. To our knowledge, the finite horizon case has not been studied. In this chapter, we prove that the value of the game on the finite horizon exists and is the solution of an appropriate Isaacs equation. This leads to the existence of saddle point equilibrium.

The rest of our chapter is structured as follows. In Sect. 6.2 we analyse the finite horizon continuous-time MDP. Section 6.3 deals with zero-sum stochastic dynamic games. We conclude our chapter in Sect. 6.4 with a few remarks.

6.2 Finite Horizon Continuous-Time MDP

Throughout this chapter the time horizon is \(T\). The control model we consider is given by

$$\{X,U,(\lambda (t,x,u),t \in [0,T],x \in X,u \in U),Q(t,x,u,\mathrm{d}z),c(t,x,u)\}$$

where each element is described below.

The state space \(X\). The state space \(X\) is the set of states of the process under observation which is assumed to be a Polish space.

The action space \(U\). The decision-maker dynamically takes his action from the action space \(U\). We assume that \(U\) is a compact metric space.

The instantaneous transition rate \(\lambda \). \(\lambda : [0,T] \times X \times U \rightarrow [0,\infty )\) is a given function satisfying the following assumption:

  1. (A1)

    \(\lambda \) is continuous and there exists a constant \(M\) such that

    $$\sup\limits_{t,x,u}\lambda (t,x,u) \leq M.$$

    The transition probability kernel \(Q\). For a fixed \(t \in [0,T],x \in X,u \in U\), \(Q(t,x,u,.)\) is a probability measure on \(X\) with \(Q(t,x,u,\{x\}) = 0\). \(Q\) satisfies the following:

  2. (A2)

    \(Q\) is weakly continuous, i.e. if \({x}_{n} \rightarrow x\), \({t}_{n} \rightarrow t\), \({u}_{n} \rightarrow u\), then for any \(f \in {C}_{b}(X)\)

    $${\int \nolimits \nolimits }_{X}f(z)Q({t}_{n},{x}_{n},{u}_{n},\mathrm{d}z) \rightarrow {\int \nolimits \nolimits }_{X}f(z)Q(t,x,u,\mathrm{d}z).$$

    The cost rate \(c\). \(c : [0,T] \times X \times U \rightarrow [0,\infty )\) is a given function satisfying the following assumption:

  3. (A3)

    \(c\) is continuous and there exists a finite constant \(\tilde{C}\) such that

    $$\sup\limits_{t,x,u}c(t,x,u) \leq \tilde{ C}.$$

Next we give an informal description of the evolution of the controlled system. Suppose that the system is in state \(x\) at time \(t \geq 0\) and the controller or the decision-maker takes an action \(u \in U\). Then the following happens on the time interval [t, t + dt]:

  1. 1.

    The decision maker has to pay an infinitesimal cost \(c(t,x,u)\mathrm{d}t,\) and

  2. 2.

    A transition from state \(x\) to a set \(A\) (not containing \(x\)) occurs with probability

    $$\lambda (t,x,u)\mathrm{d}t{\int \nolimits \nolimits }_{A}Q(t,x,u,\mathrm{d}z) + o(\mathrm{d}t);$$

    or the system remains in state \(x\) with probability

    $$1 - \lambda (t,x,u)\mathrm{d}t + o(\mathrm{d}t).$$

    Now we describe the optimal control problem. To this end we first describe the set of admissible controls. Let

    $$\bf{u} : [0,T] \times X \rightarrow U$$

    be a measurable function. Let \(\mathcal{U}\) denote the set of all such measurable functions which is the set of admissible controls. Such controls are called Markov controls. For each \(\bf{u} \in \mathcal{U}\), it can be shown that there exists is a strong Markov process \(\{{X}_{t}\}\) (see [1, 3]) having the generator

    $$\begin{array}{rcl}{ \mathcal{A}}_{t}^{\bf{u}}f(x) = -\lambda (t,x,\bf{u}(t,x))f(x) +{ \int \nolimits \nolimits }_{X}f(z)Q(t,x,\bf{u}(t,x),\mathrm{d}z)& & \\ \end{array}$$

    where \(f\) is a bounded measurable function.

For each \(\bf{u} \in \mathcal{U}\), define

$$\begin{array}{rcl}{ V }^{\bf{u}}(t,x) = {\mathbb{E}}_{ t,x}^{\bf{u}}\left [{\int \nolimits \nolimits }_{t}^{T}c(s,{X}_{ s},\bf{u}(s,{X}_{s}))\mathrm{d}s + g({X}_{T})\right ]& &\end{array}$$
(6.1)

where \(g : X \rightarrow {\mathbb{R}}_{+}\) is the terminal cost function which is assumed to be bounded, continuous and \({\mathbb{E}}_{t,x}^{\bf{u}}\) is the expectation operator under the control \(\bf{u}\) with initial condition \({X}_{t} = x\). The aim of the controller is to minimise \({V }^{\bf{u}}\) over all \(\bf{u} \in \mathcal{U}.\) Define

$$\begin{array}{rcl} V (t,x) =\inf\limits_{\bf{u}\in \mathcal{U}}{\mathbb{E}}_{t,x}^{\bf{u}}\left [{\int \nolimits \nolimits }_{t}^{T}c(s,{X}_{ s},\bf{u}(s,{X}_{s}))\mathrm{d}s + g({X}_{T})\right ]\,.& &\end{array}$$
(6.2)

The function \(V\) is called the value function. If \({\bf{u}}^{{_\ast}}\in \mathcal{U}\) satisfies

$${V }^{{\bf{u}}^{{_\ast}} }(t,x) = V (t,x)\quad \forall (t,x),$$

then \({\bf{u}}^{{_\ast}}\) is called an optimal control.

The associated dynamic programming equation is

$$\begin{array}{rcl} \left \{\begin{array}{@{}l@{\quad }l@{}} \frac{\mathrm{d}\varphi } {\mathrm{d}t} (t,x){+\inf }_{u\in U}[c(t,x,u)-\lambda (t,x,u)\varphi (t,x)\quad \\ +\lambda (t,x,u){\int \nolimits \nolimits }_{X}\varphi (t,z)Q(t,x,u,\mathrm{d}z) = 0 \quad \\ \mbox{ on}\;X \times [0,T)\quad \mbox{ and} \quad \\ \varphi (T,x) = g(x). \quad \end{array} \right.& &\end{array}$$
(6.3)

The importance of (6.3) is illustrated by the following verification theorem.

Theorem 6.2.1

If (6.3) has a solution \(\varphi \) in \({C}_{b}^{1,0}([0,T] \times X)\) , then \(\varphi = V\) , the value function. Moreover, if \({\bf{u}}^{{_\ast}}\) is such that

$$\begin{array}{rcl} & & \left [c(t,x,{\bf{u}}^{{_\ast}}(t,x)) - \lambda (t,x,{\bf{u}}^{{_\ast}}(t,x))\varphi (t,x) + \lambda (t,x,{\bf{u}}^{{_\ast}}(t,x)){ \int \nolimits \nolimits }_{X}\varphi (t,z)Q(t,x,{\bf{u}}^{{_\ast}}(t,x),\mathrm{d}z)\right ] \\ & & \qquad =\inf\limits_{u\in U}\left [c(t,x,u) - \lambda (t,x,u)\varphi (t,x) + \lambda (t,x,u){ \int \nolimits \nolimits }_{X}\varphi (t,z)Q(t,x,u,\mathrm{d}z)\right ], \end{array}$$
(6.4)

then \({\bf{u}}^{{_\ast}}\) is an optimal control.

Proof.

Using Ito-Dynkin formula to the solution \(\varphi \) of (6.3), we obtain

$$\varphi (t,x) \leq \inf\limits_{\bf{u}\in \mathcal{U}}{\mathbb{E}}_{t,x}^{\bf{u}}\left [{\int \nolimits \nolimits }_{t}^{T}c(s,{X}_{ s},\bf{u}(s,{X}_{s}))\mathrm{d}s + g({X}_{T})\right ]\,.$$

For \(\bf{u} ={ \bf{u}}^{{_\ast}}\) as in the statement of the theorem, we get the equality

$$\varphi (t,x) = {\mathbb{E}}_{t,x}^{{\bf{u}}^{{_\ast}} }\left [{\int \nolimits \nolimits }_{t}^{T}c(s,{X}_{ s},{\bf{u}}^{{_\ast}}(s,{X}_{ s}))\mathrm{d}s + g({X}_{T})\right ]\,.$$

The existence of such a \({\bf{u}}^{{_\ast}}\) follows by a standard measurable selection theorem [2]. □ 

In view of the above theorem, it suffices to show that (6.3) has a solution in \({C}_{b}^{1,0}([0,T] \times X)\).

Theorem 6.2.2

Under \((A1)\) –(A3), the dynamic programming equation (6.3) has a unique solution in \({C}_{b}^{1,0}([0,T] \times X).\)

Proof.

Let \(\varphi (t,x) ={ \mathrm{e}}^{-\gamma t}\psi (t,x)\) for some \(\gamma < \infty \). Then from (6.3) we get,

$$\begin{array}{rcl} \left \{\begin{array}{@{}l@{\quad }l@{}} {\mathrm{e}}^{-\gamma t}\frac{\mathrm{d}\psi } {\mathrm{d}t} (t,x) - \gamma {\mathrm{e}}^{-\gamma t}\psi (t,x) +\inf\limits_{ u\in U}[c(t,x,u) - \lambda (t,x,u){\mathrm{e}}^{-\gamma t}\psi (t,x)\quad \\ +\lambda (t,x,u){\int \nolimits \nolimits }_{X}{\mathrm{e}}^{-\gamma t}\psi (t,z)Q(t,x,u,\mathrm{d}z)] = 0 \quad \\ \mbox{ on}\quad X \times [0,T)\quad \mbox{ and} \quad \\ \psi (T,x) ={ \mathrm{e}}^{\gamma T}g(x). \quad \end{array} \right.& & \\ \end{array}$$

Thus (6.3) has a solution if and only if

$$\begin{array}{rcl} \left \{\begin{array}{@{}l@{\quad }l@{}} \frac{\mathrm{d}\psi } {\mathrm{d}t} (t,x) - \gamma \psi (t,x) +\inf\limits_{u\in U}[{\mathrm{e}}^{\gamma t}c(t,x,u) - \lambda (t,x,u)\psi (t,x)\quad \\ +\lambda (t,x,u){\int \nolimits \nolimits }_{X}\psi (t,z)Q(t,x,u,\mathrm{d}z)] = 0 \quad \\ \mbox{ on}\quad X \times [0,T)\quad \mbox{ and} \quad \\ \psi (T,x) ={ \mathrm{e}}^{\gamma T}g(x) \quad \end{array} \right.& & \\ \end{array}$$

has a solution. The above differential equation is equivalent to the following integral equation:

$$\begin{array}{rcl} \psi (t,x)& ={ \mathrm{e}}^{\gamma t}g(x) +{ \mathrm{e}}^{\gamma t}{ \int \nolimits \nolimits }_{t}^{T}\mathrm{e}{}^{-\gamma s}\inf\limits_{u\in U}\left[{\mathrm{e}}^{\gamma s}c(s,x,u) - \lambda (s,x,u)\psi (s,x) \right.\\ &\left. \quad + \lambda (s,x,u){\int \nolimits \nolimits }_{X}\psi (s,z)Q(s,x,u,\mathrm{d}z) \right]\mathrm{d}s\,.\end{array}$$

Let \({C}_{b}^{\mathrm{unif}}([0,T] \times X)\) be the space of bounded continuous functions \(\varphi \) on \([0,T] \times X\) with the additional property that given \(\varepsilon > 0\) there exists \(\delta > 0\) such that

$$\sup\limits_{x}\vert \varphi (t + h,x) - \varphi (t,x)\vert < \varepsilon \quad \mbox{ whenever}\quad \vert h\vert < \delta.$$

Suppose \({\varphi }_{n} \in {C}_{b}^{\mathrm{unif}}([0,T] \times X)\) and \({\varphi }_{n} \rightarrow \varphi \) uniformly. Then

$$\begin{array}{rcl} \vert \varphi (t + h,x) - \varphi (t,x)\vert & \leq & \vert \varphi (t + h,x) - {\varphi }_{n}(t + h,x)\vert + \vert {\varphi }_{n}(t + h,x) - {\varphi }_{n}(t,x)\vert \\ & &\quad \vert {\varphi }_{n}(t,x) - \varphi (t,x)\vert \,.\end{array}$$

Given \(\epsilon > 0\), there exists \({n}_{0}\) such that \({\sup }_{t,x}\vert {\varphi }_{{n}_{0}}(t,x) - \varphi (t,x)\vert < \frac{\varepsilon } {3}\), and for this \({n}_{0}\), there exists \(\delta > 0\) such that \(\sup\limits_{x}\vert {\varphi }_{{n}_{0}}(t + h,x) - {\varphi }_{{n}_{0}}(t,x)\vert < \frac{\varepsilon } {3}\) whenever \(\vert h\vert < \delta \). Putting \(n = {n}_{0}\), we get from the above inequality

$$\sup\limits_{x}\vert \varphi (t + h,x) - \varphi (t,x)\vert < \varepsilon \quad \mbox{ whenever}\quad \vert h\vert < \delta \,.$$

Thus \({C}_{b}^{\mathrm{unif}}([0,T] \times X)\) is a closed subspace of \({C}_{b}([0,T] \times X)\), and hence it is a Banach space.

Now for \(\varphi \in {C}_{b}^{\mathrm{unif}}([0,T] \times X)\), it follows from the assumption on \(Q\) that \({\int \nolimits \nolimits }_{X}\varphi (t,z)Q(t,x,u,\mathrm{d}z)\) is continuous in \(t,x\) and \(u\). Define

$$\mathcal{T} : {C}_{b}^{\mathrm{unif}}([0,T] \times X) \rightarrow {C}_{ b}^{\mathrm{unif}}([0,T] \times X)\quad \mbox{ by}$$
$$\begin{array}{rcl} \mathcal{T} \psi (t,x)& =&{ \mathrm{e}}^{\gamma t}g(x) +{ \mathrm{e}}^{\gamma t}{ \int \nolimits \nolimits }_{t}^{T}\mathrm{e}{}^{-\gamma s}\inf\limits_{ u\in U}\left[{\mathrm{e}}^{\gamma s}c(s,x,u) - \lambda (s,x,u)\psi (s,x) \right.\\& &\left. +\lambda (s,x,u){\int \nolimits \nolimits }_{X}\psi (s,z)Q(s,x,u,\mathrm{d}z) \right]\mathrm{d}s\,.\end{array}$$

For \({\psi }_{1},{\psi }_{2} \in {C}_{b}^{\mathrm{unif}}([0,T] \times X)\), we have

$$\begin{array}{rcl} \vert \mathcal{T} {\psi }_{1}(t,x) -\mathcal{T} {\psi }_{2}(t,x)\vert & \leq &{ \mathrm{e}}^{\gamma t}{ \int \nolimits \nolimits }_{t}^{T}{\mathrm{e}}^{-\gamma s}2M\vert \vert {\psi }_{ 1} - {\psi }_{2}\vert \vert \mathrm{d}s \\ & =& \frac{2M} {\gamma }{ \mathrm{e}}^{\gamma t}[{\mathrm{e}}^{-\gamma t} -{\mathrm{e}}^{-\gamma T}]\vert \vert {\psi }_{ 1} - {\psi }_{2}\vert \vert \\ & =& \frac{2M} {\gamma } [1 -{\mathrm{e}}^{-\gamma (T-t)}]\vert \vert {\psi }_{ 1} - {\psi }_{2}\vert \vert \\ &\leq & \frac{2M} {\gamma } \vert \vert {\psi }_{1} - {\psi }_{2}\vert \vert.\end{array}$$

Thus if we choose \(\gamma = 2M + 1\), then \(\mathcal{T}\) is a contraction and hence has a fixed point. Let \(\varphi \) be the fixed point. Then \({\mathrm{e}}^{-(2M+1)t}\varphi \) is the unique solution of (6.3). \(\square \)

6.3 Zero-Sum Stochastic Game

In this section, we consider a zero-sum stochastic game. The control model we consider here is given by

$$\{X,U,V,(\lambda (t,x,u,v),t \in [0,T],x \in X,u \in U,v \in V,Q(t,x,u,v,\mathrm{d}z),r(t,x,u,v)\}\,$$

where \(X\) is the state space as before; \(U\) and \(V\) are the action spaces for player I and player II, respectively; \(\lambda \) and \(Q\) denote the rate and transition kernel, respectively, which now depend on the additional parameter \(v\); and \(r\) is the reward rate. The dynamics of the game is similar to that of MDP with appropriate modifications. Here player I receives a payoff from player II. The aim of player I is to maximise his payoff, and player II seeks to minimise the payoff to player I.

Now we describe the strategies of the players. In order to solve the problem, we will need to consider Markov relaxed strategies. We denote the space of strategies of player I by \(\mathcal{U}\) and that of player \(II\) by \(\mathcal{V}\) where

$$\mathcal{U} =\{ \bf{u}\,\vert \,\bf{u} : [0,T] \times X \rightarrow \mathcal{P}(U)\;\mbox{ measurable}\}\,,$$
$$\mathcal{V} =\{ \mathbf{v}\,\vert \,\mathbf{v} : [0,T] \times X \rightarrow \mathcal{P}(V )\;\mbox{ measurable}\}\,.$$

Now corresponding to \(\lambda \), \(Q\) and \(r\), define

$$\tilde{\lambda }(t,x,\mu,\nu ) ={ \int \nolimits \nolimits }_{V }{ \int \nolimits \nolimits }_{U}\lambda (t,x,u,v)\mu (\mathrm{d}u)\nu (\mathrm{d}v),$$
$$\tilde{Q}(t,x,\mu,\nu ) ={ \int \nolimits \nolimits }_{V }{ \int \nolimits \nolimits }_{U}Q(t,x,u,v)\mu (\mathrm{d}u)\nu (\mathrm{d}v),$$
$$\tilde{r}(t,x,\mu,\nu ) ={ \int \nolimits \nolimits }_{V }{ \int \nolimits \nolimits }_{U}r(t,x,u,v)\mu (\mathrm{d}u)\nu (\mathrm{d}v),$$

where \(\mu \in \mathcal{P}(U)\) and \(\nu \in \mathcal{P}(V )\). As in the previous section, we make the following assumptions:

  1. (A1′)

    \(\lambda \) is continuous and there exists a finite constant \(M\) such that

    $$\sup\limits_{t,x,u,v}\lambda (t,x,u,v) \leq M.$$
  2. (A2′)

    \(Q\) is weakly continuous, i.e. if \({x}_{n} \rightarrow x\), \({t}_{n} \rightarrow t\), \({u}_{n} \rightarrow u\) and \({v}_{n} \rightarrow v\), then for any \(f \in {C}_{b}(X)\)

    $${\int \nolimits \nolimits }_{X}f(z)Q({t}_{n},{x}_{n},{u}_{n},{v}_{n},\mathrm{d}z) \rightarrow {\int \nolimits \nolimits }_{X}f(z)Q(t,x,u,v,\mathrm{d}z).$$
  3. (A3′)

    \(r\) is continuous and there exists a finite constant \(\tilde{C}\) such that

    $$\sup\limits_{t,x,u,v}r(t,x,u,v) \leq \tilde{ C}.$$

    If the players use strategies \((\bf{u},\mathbf{v}) \in \mathcal{U}\times \mathcal{V}\), then the expected payoff to player I is given by

    $${\mathbb{E}}_{t,x}^{\bf{u},\mathbf{v}}\left [{\int \nolimits \nolimits }_{t}^{T}\tilde{r}(s,{X}_{ s},\bf{u}(s,{X}_{s}),\mathbf{v}(s,{X}_{s}))\mathrm{d}s + g({X}_{T})\right ]$$

    where \(g\) is the terminal reward function which is assumed to be bounded and continuous. Now we define the upper and lower values for our game. Define

    $$\begin{array}{rcl} \overline{V }(t,x) =\inf\limits_{\mathbf{v}\in \mathcal{V}}\sup\limits_{\bf{u}\in \mathcal{U}}{\mathbb{E}}_{t,x}^{\bf{u},\mathbf{v}}\left [{\int \nolimits \nolimits }_{t}^{T}\tilde{r}(s,{X}_{ s},\bf{u}(s,{X}_{s}),\mathbf{v}(s,{X}_{s}))\mathrm{d}s + g({X}_{T})\right ]\,.& & \\ \end{array}$$

    Also define

    $$\begin{array}{rcl} \underline{V}(t,x) =\sup\limits_{\bf{u}\in \mathcal{U}}\inf\limits_{\mathbf{v}\in \mathcal{V}}{\mathbb{E}}_{t,x}^{\bf{u},\mathbf{v}}\left [{\int \nolimits \nolimits }_{t}^{T}\tilde{r}(s,{X}_{ s},\bf{u}(s,{X}_{s}),\mathbf{v}(s,{X}_{s}))\mathrm{d}s + g({X}_{T})\right ].& & \\ \end{array}$$

    The function \(\overline{V }\) is called the upper value function of the game, and \(\underline{V}\) is called the lower value function of the game. In the game, player I is trying to maximise his payoff and player II is trying to minimise the payoff of player I. Thus \(\underline{V}\) is the minimum payoff that player I is guaranteed to receive and \(\overline{V }\) is the guaranteed greatest amount that player II can lose to player I. In general \(\underline{V} \leq \overline{V }\). If \(\overline{V }(t,x) = \underline{V}(t,x)\), then the game is said to have a value. A strategy \({\bf{u}}^{{_\ast}}\) is said to be an optimal strategy for player I if

    $$\begin{array}{rcl}{ \mathbb{E}}_{t,x}^{{\bf{u}}^{{_\ast}},\mathbf{v}}\left [{\int \nolimits \nolimits }_{t}^{T}\tilde{r}(s,{X}_{ s},{\bf{u}}^{{_\ast}}(s,{X}_{ s}),\mathbf{v}(s,{X}_{s}))\mathrm{d}s + g({X}_{T})\right ] \geq \overline{V }(t,x)& & \\ \end{array}$$

    for any \(t,x,\mathbf{v}\).

Similarly, \({\mathbf{v}}^{{_\ast}}\) is called an optimal policy for player II if

$$\begin{array}{rcl}{ \mathbb{E}}_{t,x}^{\bf{u},{\mathbf{v}}^{{_\ast}} }\left [{\int \nolimits \nolimits }_{t}^{T}\tilde{r}(s,{X}_{ s},\bf{u}(s,{X}_{s}),{\mathbf{v}}^{{_\ast}}(s,{X}_{ s}))\mathrm{d}s + g({X}_{T})\right ] \leq \underline{V}(t,x)& & \\ \end{array}$$

for any \(t,x,\bf{u}\). Such a pair \(({\bf{u}}^{{_\ast}},{\mathbf{v}}^{{_\ast}})\), if it exists, is called a saddle point equilibrium. Our aim is to find the value of the game and to find optimal strategies for both the players. To this end, consider the following pair of Isaacs equations:

$$\begin{array}{rcl} \left \{\begin{array}{@{}l@{\quad }l@{}} \frac{\mathrm{d}\varphi } {\mathrm{d}t} (t,x) +\inf\limits_{\nu \in \mathcal{P}(V )}\sup\limits_{\mu \in \mathcal{P}(U)}\bigl [\tilde{r}(t,x,\mu,\nu ) -\tilde{ \lambda }(t,x,\mu,\nu )\varphi (t,x)\quad \\ +\tilde{\lambda }(t,x,\mu,\nu ){\int \nolimits \nolimits }_{X}\varphi (t,z)\tilde{Q}(t,x,\mu,\nu,\mathrm{d}z)\bigr ] = 0 \quad \\ \mbox{ on}\;X \times [0,T)\quad \mbox{ and} \quad \\ \varphi (T,x) = g(x). \quad \end{array} \right.& &\end{array}$$
(6.5)
$$\begin{array}{rcl} \left \{\begin{array}{@{}l@{\quad }l@{}} \frac{\mathrm{d}\psi } {\mathrm{d}t} (t,x) +\sup\limits_{\mu \in \mathcal{P}(U)}\inf\limits_{\nu \in \mathcal{P}(V )}\bigl [\tilde{r}(t,x,\mu,\nu ) -\tilde{ \lambda }(t,x,\mu,\nu )\psi (t,x)\quad \\ +\tilde{\lambda }(t,x,\mu,\nu ){\int \nolimits \nolimits }_{X}\psi (t,z)\tilde{Q}(t,x,\mu,\nu,\mathrm{d}z)\bigr ] = 0 \quad \\ \mbox{ on}\;X \times [0,T)\quad \mbox{ and} \quad \\ \varphi (T,x) = g(x). \quad \end{array} \right.& &\end{array}$$
(6.6)

By Fan’s minimax theorem [4], we have that if \(\varphi \in {C}_{b}^{1,0}([0,T] \times X)\) is a solution of (6.5), then it is also a solution of (6.6) and vice versa. The importance of Isaacs equations is illustrated by the following theorem.

Theorem 6.3.1

Let \({\varphi }^{{_\ast}}\in {C}_{b}^{1,0}([0,T] \times X)\) be a solution of (6.5) and (6.6). Then

  1. (i)

    \({\varphi }^{{_\ast}}\) is the value of the game.

  2. (ii)

    Let \(({\bf{u}}^{{_\ast}},{\mathbf{v}}^{{_\ast}}) \in \mathcal{U}\times \mathcal{V}\) be such that

    $$ \begin{array}{rcl} & \inf\limits_{\nu \in \mathcal{P}(V )}\left[\tilde{r}(t,x,{\bf{u}}^{{_\ast}}(t,x),\nu ) -\tilde{ \lambda }(t,x,{\bf{u}}^{{_\ast}}(t,x),\nu ){\varphi }^{{_\ast}}(t,x) +\tilde{ \lambda }(t,x,{\bf{u}}^{{_\ast}}(t,x),\nu )\right. \\ &\left. \qquad \qquad {\int \nolimits \nolimits }_{X}{\varphi }^{{_\ast}}(t,z)\tilde{Q}(t,x,{\bf{u}}^{{_\ast}}(t,x),\nu,\mathrm{d}z)\right] \\ & =\sup\limits_{\mu \in \mathcal{P}(U)}\inf\limits_{\nu \in \mathcal{P}(V )}\left[\tilde{r}(t,x,\mu,\nu ) -\tilde{ \lambda }(t,x,\mu,\nu )\psi (t,x) +\tilde{ \lambda }(t,x,\mu,\nu ) \right. \\ &\left. \qquad \qquad \qquad \qquad {\int \nolimits \nolimits }_{X}\psi (t,z)\tilde{Q}(t,x,\mu,\nu,\mathrm{d}z)\right] & \end{array}$$
    (6.7)

    and

    $$ \begin{array}{rcl} & \sup\limits_{\mu \in \mathcal{P}(U)}\left[\tilde{r}(t,x,\mu,{\bf{v}}^{{_\ast}}(t,x)) -\tilde{ \lambda }(t,x,\mu,{\bf{v}}^{{_\ast}}(t,x)){\varphi }^{{_\ast}}(t,x) +\tilde{ \lambda }(t,x,\mu,{\bf{v}}^{{_\ast}}(t,x))\right. \\ & \left.\qquad \qquad {\int \nolimits \nolimits }_{X}{\varphi }^{{_\ast}}(t,z)\tilde{Q}(t,x,\mu,{\bf{v}}^{{_\ast}}(t,x),\mathrm{d}z)\right] \\ & =\inf\limits_{\nu \in \mathcal{P}(V )}\sup\limits_{\mu \in \mathcal{P}(U)}\left[\tilde{r}(t,x,\mu,\nu ) -\tilde{ \lambda }(t,x,\mu,\nu )\varphi (t,x) +\tilde{ \lambda }(t,x,\mu,\nu ) \right. \\ & \left.\qquad \qquad \qquad \qquad {\int \nolimits \nolimits }_{X}\varphi (t,z)\tilde{Q}(t,x,\mu,\nu,\mathrm{d}z)\right]. \end{array}$$
    (6.8)

    Then \({\bf{u}}^{{_\ast}}\) is an optimal policy for player I and \({\mathbf{v}}^{{_\ast}}\) is an optimal policy for player II.

Proof.

Let \({\bf{u}}^{{_\ast}}\) be as in (6.7) and \(\mathbf{v}\) be any arbitrary strategy of player II. Then by Ito-Dynkin formula applied to the solution \(\varphi \), we obtain

$$\begin{array}{rcl}{ \varphi }^{{_\ast}}(t,x)& \leq {\mathbb{E}}_{ t,x}^{{\bf{u}}^{{_\ast}},\mathbf{v}}\left [{\int \nolimits \nolimits }_{t}^{T}\tilde{r}(s,{X}_{s},{\bf{u}}^{{_\ast}}(s,{X}_{s}),\mathbf{v}(s,{X}_{s}))\mathrm{d}s + g({X}_{T})\right ] & \\ & \leq \inf\limits_{\mathbf{v}\in \mathcal{V}}{\mathbb{E}}_{t,x}^{{\bf{u}}^{{_\ast}},\mathbf{v}}\left [{\int \nolimits \nolimits }_{t}^{T}\tilde{r}(s,{X}_{s},{\bf{u}}^{{_\ast}}(s,{X}_{s}),\mathbf{v}(s,{X}_{s}))\mathrm{d}s + g({X}_{T})\right ]& \\ & \leq \underline{V}(t,x)\,. & \\ \end{array}$$

Now let \({\mathbf{v}}^{{_\ast}}\) be as in (6.8) and let \(\bf{u}\) be any arbitrary strategy of player I. Then again by Ito’s formula we obtain

$$\begin{array}{rcl}{ \varphi }^{{_\ast}}(t,x)& \geq {\mathbb{E}}_{ t,x}^{\bf{u},{\mathbf{v}}^{{_\ast}} }\left [{\int \nolimits \nolimits }_{t}^{T}\tilde{r}(s,{X}_{s},\bf{u}(s,{X}_{s}),{\mathbf{v}}^{{_\ast}}(s,{X}_{s}))\mathrm{d}s + g({X}_{T})\right ] & \\ & \geq \inf\limits_{\mathbf{v}\in \mathcal{V}}{\mathbb{E}}_{t,x}^{\bf{u},{\mathbf{v}}^{{_\ast}} }\left [{\int \nolimits \nolimits }_{t}^{T}\tilde{r}(s,{X}_{s},\bf{u}(s,{X}_{s}),{\mathbf{v}}^{{_\ast}}(s,{X}_{s}))\mathrm{d}s + g({X}_{T})\right ]& \\ & \geq \overline{V }(t,x)\,. & \\ \end{array}$$

From the above two inequalities, it follows that

$${\varphi }^{{_\ast}}(t,x) = \overline{V }(t,x) = \underline{V}(t,x)\,.$$

Hence \({\varphi }^{{_\ast}}\) is the value of the game. Moreover it follows that \(({\bf{u}}^{{_\ast}},{\mathbf{v}}^{{_\ast}})\) is a saddle point equilibrium. \(\square \)

Now our aim is to find a solution of (6.5) (and hence of (6.6)) in \({C}_{b}^{1,0}([0,T] \times X)\). Our next theorem asserts the existence of such a solution.

Theorem 6.3.2

Under \((A1^{\prime})\) –(A3′), equation (6.5) has a unique solution in \({C}_{b}^{1,0}([0,T] \times X).\)

Proof.

Let \(\varphi (t,x) ={ \mathrm{e}}^{-\gamma t}\psi (t,x)\) for some \(\gamma < \infty \). Substituting in (6.5), we get

$$\begin{array}{rcl} \left \{\begin{array}{@{}l@{\quad }l@{}} {\mathrm{e}}^{-\gamma t}\frac{\mathrm{d}\psi } {\mathrm{d}t} (t,x) - \gamma {\mathrm{e}}^{-\gamma t}\psi (t,x) +\inf\limits_{ \nu \in \mathcal{P}(V )}\sup\limits_{\mu \in \mathcal{P}(U)}\bigl [\tilde{r}(t,x,\mu,\nu ) -\tilde{ \lambda }(t,x,\mu,\nu ){\mathrm{e}}^{-\gamma t}\psi (t,x)\quad \\ +\tilde{\lambda }(t,x,\mu,\nu ){\int \nolimits \nolimits }_{X}{\mathrm{e}}^{-\gamma t}\psi (t,z)\tilde{Q}(t,x,\mu,\nu,\mathrm{d}z)\bigr ] = 0 \quad \\ \mbox{ on}\quad X \times [0,T)\quad \mbox{ and} \quad \\ \psi (T,x) ={ \mathrm{e}}^{\gamma T}g(x). \quad \end{array}\right.& & \\ \end{array}$$

Thus (6.5) has a solution if and only if

$$\begin{array}{rcl} \left \{\begin{array}{@{}l@{\quad }l@{}} \frac{\mathrm{d}\psi } {\mathrm{d}t} (t,x) - \gamma \psi (t,x) +\inf\limits_{\nu \in \mathcal{P}(V )}\sup\limits_{\mu \in \mathcal{P}(U)}\bigl [{\mathrm{e}}^{\gamma t}\tilde{r}(t,x,\mu,\nu ) -\tilde{ \lambda }(t,x,\mu,\nu )\psi (t,x)\quad \\ +\tilde{\lambda }(t,x,\mu,\nu ){\int \nolimits \nolimits }_{X}\psi (t,z)\tilde{Q}(t,x,\mu,\nu,\mathrm{d}z)\bigr ] = 0 \quad \\ \mbox{ on}\quad X \times [0,T)\quad \mbox{ and} \quad \\ \psi (T,x) ={ \mathrm{e}}^{\gamma T}g(x) \quad \end{array} \right.& & \\ \end{array}$$

has a solution. The above differential equation is equivalent to the following integral equation:

$$\begin{array}{rcl} \psi (t,x)& ={ \mathrm{e}}^{\gamma t}g(x) +{ \mathrm{e}}^{\gamma t}{ \int \nolimits \nolimits }_{t}^{T}\mathrm{e}{}^{-\gamma s}\inf\limits_{\nu \in \mathcal{P}(V )}\sup\limits_{\mu \in \mathcal{P}(U)}\left[{\mathrm{e}}^{\gamma s}\tilde{r}(s,x,\mu,\nu ) -\tilde{ \lambda }(s,x,\mu,\nu )\psi (s,x) \right.\\ & \left.\quad +\tilde{ \lambda }(s,x,\mu,\nu ){\int \nolimits \nolimits }_{X}\psi (s,z)\tilde{Q}(s,x,\mu,,\nu,\mathrm{d}z) \right]\mathrm{d}s\,. \end{array}$$

Let \({C}_{b}^{\mathrm{unif}}([0,T] \times X)\) be the same space as defined in the previous section. Define

$$\mathcal{T} : {C}_{b}^{\mathrm{unif}}([0,T] \times X) \rightarrow {C}_{ b}^{\mathrm{unif}}([0,T] \times X)\quad \mbox{ by}$$
$$\begin{array}{rcl} \mathcal{T} \psi (t,x)& ={ \mathrm{e}}^{\gamma t}g(x) +{ \mathrm{e}}^{\gamma t}{ \int \nolimits \nolimits }_{t}^{T}\mathrm{e}{}^{-\gamma s}\inf\limits_{\nu \in \mathcal{P}(V )}\sup\limits_{\mu \in \mathcal{P}(U)}\left[{\mathrm{e}}^{\gamma s}\tilde{r}(s,x,\mu,\nu ) \right.\\ & \left.\quad -\tilde{ \lambda }(s,x,\mu,\nu )\psi (s,x) +\tilde{ \lambda }(s,x,\mu,\nu ){\int \nolimits \nolimits }_{X}\psi (s,z)\tilde{Q}(s,x,\mu,\nu,\mathrm{d}z) \right]\mathrm{d}s\,.\end{array}$$

For \({\psi }_{1},{\psi }_{2} \in {C}_{b}^{\mathrm{unif}}([0,T] \times X)\), we have

$$\begin{array}{rcl} \vert \mathcal{T} {\psi }_{1}(t,x) -\mathcal{T} {\psi }_{2}(t,x)\vert & \leq &{ \mathrm{e}}^{\gamma t}{ \int \nolimits \nolimits }_{t}^{T}{\mathrm{e}}^{-\gamma s}2M\vert \vert {\psi }_{ 1} - {\psi }_{2}\vert \vert \mathrm{d}s \\ & =& \frac{2M} {\gamma }{ \mathrm{e}}^{\gamma t}[{\mathrm{e}}^{-\gamma t} -{\mathrm{e}}^{-\gamma T}]\vert \vert {\psi }_{ 1} - {\psi }_{2}\vert \vert \\ & =& \frac{2M} {\gamma } [1 -{\mathrm{e}}^{-\gamma (T-t)}]\vert \vert {\psi }_{ 1} - {\psi }_{2}\vert \vert \\ &\leq & \frac{2M} {\gamma } \vert \vert {\psi }_{1} - {\psi }_{2}\vert \vert.\end{array}$$

Thus if we choose \(\gamma = 2M + 1\), then \(\mathcal{T}\) is a contraction and hence has a fixed point. Let \(\varphi \) be the fixed point. Then \({\mathrm{e}}^{-(2M+1)t}\varphi \) is the unique solution of (6.5). \(\square \)

6.4 Conclusion

In this chapter we have established smooth solutions of dynamic programming equations for continuous-time controlled Markov chains on the finite horizon. This has led to the existence of an optimal Markov strategy for continuous-time MDP and saddle point equilibrium in Markov strategies for zero-sum games. We have used the boundedness condition on the cost function \(c\) for simplicity. For continuous-time MDP, if \(c\) is unbounded above, then we can show that \(V (t,x)\) is the minimal non-negative solution of (6.3) by approximating the cost function \(c\) by \(c \wedge n\) for a positive integer \(n\) and then letting \(n \rightarrow \infty \). If \(c\) is unbounded on both sides and it satisfies a suitable growth condition, then again we can prove the existence of unique solutions of dynamic programming equations in \({C}^{1,0}([0,T] \times X)\) with appropriate weighted norm; see [5] and [6] for analogous results.