Continuous-Time Controlled Jump Markov Processes on the Finite Horizon

Ghosh, Mrinal K.; Saha, Subhamay

doi:10.1007/978-0-8176-8337-5_6

Mrinal K. Ghosh³ &
Subhamay Saha³

Part of the book series: Systems & Control: Foundations & Applications ((SCFA))

1418 Accesses
2 Citations

Abstract

We study continuous-time controlled Markov chains on the finite horizon. For the Markov decision problem, we show that the value function is the unique solution of the corresponding dynamic programming equation. This leads to the existence of an optimal Markov control. We then consider a zero-sum game. We show that the value function exists and is the unique solution of the corresponding Isaacs equations. This yields the existence of a pair of saddle point Markov strategies.

Access provided by Autonomous University of Puebla. Download chapter PDF

Abel-type Results for Controlled Piecewise Deterministic Markov Processes

Article 09 April 2015

Continuity of the optimal average cost in Markov decision chains with small risk-sensitivity

Article 22 February 2015

Simultaneous Impulse and Continuous Control of a Markov Chain in Continuous Time

Article 27 March 2020

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

6.1 Introduction

This chapter studies continuous-time Markov decision processes and continuous-time zero-sum stochastic dynamic games. In the continuous-time setup, although the infinite horizon cases have been well studied, the corresponding literature on finite horizon case is few and far between. Infinite horizon continuous-time Markov decision processes have been studied by many authors (e.g. see [5] and the references therein). In the finite horizon case, Pliska [7] has used a semi-group approach to characterise the value function and the optimal control. But his approach yields only existential results. In this chapter, we show that the value function is a smooth solution of an appropriate dynamic programming equation. Our method of proof gives algorithms for computing the value function and an optimal control.

The situation is analogous for continuous-time stochastic dynamic Markov games. In this problem as well, the infinite horizon case has been studied in the literature [6]. To our knowledge, the finite horizon case has not been studied. In this chapter, we prove that the value of the game on the finite horizon exists and is the solution of an appropriate Isaacs equation. This leads to the existence of saddle point equilibrium.

The rest of our chapter is structured as follows. In Sect. 6.2 we analyse the finite horizon continuous-time MDP. Section 6.3 deals with zero-sum stochastic dynamic games. We conclude our chapter in Sect. 6.4 with a few remarks.

6.2 Finite Horizon Continuous-Time MDP

Throughout this chapter the time horizon is $T$. The control model we consider is given by

$$\{X,U,(\lambda (t,x,u),t \in [0,T],x \in X,u \in U),Q(t,x,u,\mathrm{d}z),c(t,x,u)\}$$

where each element is described below.

The state space $X$. The state space $X$ is the set of states of the process under observation which is assumed to be a Polish space.

The action space $U$. The decision-maker dynamically takes his action from the action space $U$. We assume that $U$ is a compact metric space.

The instantaneous transition rate $\lambda $. $\lambda : [0,T] \times X \times U \rightarrow [0,\infty )$ is a given function satisfying the following assumption:

(A1)
$\lambda $ is continuous and there exists a constant $M$ such that
$$\sup\limits_{t,x,u}\lambda (t,x,u) \leq M.$$

The transition probability kernel $Q$. For a fixed $t \in [0,T],x \in X,u \in U$, $Q(t,x,u,.)$ is a probability measure on $X$ with $Q(t,x,u,\{x\}) = 0$. $Q$ satisfies the following:
(A2)
$Q$ is weakly continuous, i.e. if ${x}_{n} \rightarrow x$, ${t}_{n} \rightarrow t$, ${u}_{n} \rightarrow u$, then for any $f \in {C}_{b}(X)$
$${\int \nolimits \nolimits }_{X}f(z)Q({t}_{n},{x}_{n},{u}_{n},\mathrm{d}z) \rightarrow {\int \nolimits \nolimits }_{X}f(z)Q(t,x,u,\mathrm{d}z).$$

The cost rate $c$. $c : [0,T] \times X \times U \rightarrow [0,\infty )$ is a given function satisfying the following assumption:
(A3)
$c$ is continuous and there exists a finite constant $\tilde{C}$ such that
$$\sup\limits_{t,x,u}c(t,x,u) \leq \tilde{ C}.$$

Next we give an informal description of the evolution of the controlled system. Suppose that the system is in state $x$ at time $t \geq 0$ and the controller or the decision-maker takes an action $u \in U$. Then the following happens on the time interval [t, t + dt]:

1.
The decision maker has to pay an infinitesimal cost $c(t,x,u)\mathrm{d}t,$ and
2.
A transition from state $x$ to a set $A$ (not containing $x$) occurs with probability
$$\lambda (t,x,u)\mathrm{d}t{\int \nolimits \nolimits }_{A}Q(t,x,u,\mathrm{d}z) + o(\mathrm{d}t);$$
or the system remains in state $x$ with probability
$$1 - \lambda (t,x,u)\mathrm{d}t + o(\mathrm{d}t).$$
Now we describe the optimal control problem. To this end we first describe the set of admissible controls. Let
$$\bf{u} : [0,T] \times X \rightarrow U$$
be a measurable function. Let $\mathcal{U}$ denote the set of all such measurable functions which is the set of admissible controls. Such controls are called Markov controls. For each $\bf{u} \in \mathcal{U}$, it can be shown that there exists is a strong Markov process $\{{X}_{t}\}$ (see [1, 3]) having the generator
$$\begin{array}{rcl}{ \mathcal{A}}_{t}^{\bf{u}}f(x) = -\lambda (t,x,\bf{u}(t,x))f(x) +{ \int \nolimits \nolimits }_{X}f(z)Q(t,x,\bf{u}(t,x),\mathrm{d}z)& & \\ \end{array}$$
where $f$ is a bounded measurable function.

For each $\bf{u} \in \mathcal{U}$, define

$$\begin{array}{rcl}{ V }^{\bf{u}}(t,x) = {\mathbb{E}}_{ t,x}^{\bf{u}}\left [{\int \nolimits \nolimits }_{t}^{T}c(s,{X}_{ s},\bf{u}(s,{X}_{s}))\mathrm{d}s + g({X}_{T})\right ]& &\end{array}$$

(6.1)

where $g : X \rightarrow {\mathbb{R}}_{+}$ is the terminal cost function which is assumed to be bounded, continuous and ${\mathbb{E}}_{t,x}^{\bf{u}}$ is the expectation operator under the control $\bf{u}$ with initial condition ${X}_{t} = x$. The aim of the controller is to minimise ${V }^{\bf{u}}$ over all $\bf{u} \in \mathcal{U}.$ Define

$$\begin{array}{rcl} V (t,x) =\inf\limits_{\bf{u}\in \mathcal{U}}{\mathbb{E}}_{t,x}^{\bf{u}}\left [{\int \nolimits \nolimits }_{t}^{T}c(s,{X}_{ s},\bf{u}(s,{X}_{s}))\mathrm{d}s + g({X}_{T})\right ]\,.& &\end{array}$$

(6.2)

The function $V$ is called the value function. If ${\bf{u}}^{{_\ast}}\in \mathcal{U}$ satisfies

$${V }^{{\bf{u}}^{{_\ast}} }(t,x) = V (t,x)\quad \forall (t,x),$$

then ${\bf{u}}^{{_\ast}}$ is called an optimal control.

The associated dynamic programming equation is

$$\begin{array}{rcl} \left \{\begin{array}{@{}l@{\quad }l@{}} \frac{\mathrm{d}\varphi } {\mathrm{d}t} (t,x){+\inf }_{u\in U}[c(t,x,u)-\lambda (t,x,u)\varphi (t,x)\quad \\ +\lambda (t,x,u){\int \nolimits \nolimits }_{X}\varphi (t,z)Q(t,x,u,\mathrm{d}z) = 0 \quad \\ \mbox{ on}\;X \times [0,T)\quad \mbox{ and} \quad \\ \varphi (T,x) = g(x). \quad \end{array} \right.& &\end{array}$$

(6.3)

The importance of (6.3) is illustrated by the following verification theorem.

Theorem 6.2.1

If (6.3) has a solution $\varphi $ in ${C}_{b}^{1,0}([0,T] \times X)$ , then $\varphi = V$ , the value function. Moreover, if ${\bf{u}}^{{_\ast}}$ is such that

$$\begin{array}{rcl} & & \left [c(t,x,{\bf{u}}^{{_\ast}}(t,x)) - \lambda (t,x,{\bf{u}}^{{_\ast}}(t,x))\varphi (t,x) + \lambda (t,x,{\bf{u}}^{{_\ast}}(t,x)){ \int \nolimits \nolimits }_{X}\varphi (t,z)Q(t,x,{\bf{u}}^{{_\ast}}(t,x),\mathrm{d}z)\right ] \\ & & \qquad =\inf\limits_{u\in U}\left [c(t,x,u) - \lambda (t,x,u)\varphi (t,x) + \lambda (t,x,u){ \int \nolimits \nolimits }_{X}\varphi (t,z)Q(t,x,u,\mathrm{d}z)\right ], \end{array}$$

(6.4)

then ${\bf{u}}^{{_\ast}}$ is an optimal control.

Proof.

Using Ito-Dynkin formula to the solution $\varphi $ of (6.3), we obtain

$$\varphi (t,x) \leq \inf\limits_{\bf{u}\in \mathcal{U}}{\mathbb{E}}_{t,x}^{\bf{u}}\left [{\int \nolimits \nolimits }_{t}^{T}c(s,{X}_{ s},\bf{u}(s,{X}_{s}))\mathrm{d}s + g({X}_{T})\right ]\,.$$

For $\bf{u} ={ \bf{u}}^{{_\ast}}$ as in the statement of the theorem, we get the equality

$$\varphi (t,x) = {\mathbb{E}}_{t,x}^{{\bf{u}}^{{_\ast}} }\left [{\int \nolimits \nolimits }_{t}^{T}c(s,{X}_{ s},{\bf{u}}^{{_\ast}}(s,{X}_{ s}))\mathrm{d}s + g({X}_{T})\right ]\,.$$

The existence of such a ${\bf{u}}^{{_\ast}}$ follows by a standard measurable selection theorem [2]. □

In view of the above theorem, it suffices to show that (6.3) has a solution in ${C}_{b}^{1,0}([0,T] \times X)$.

Theorem 6.2.2

Under $(A1)$ –(A3), the dynamic programming equation (6.3) has a unique solution in ${C}_{b}^{1,0}([0,T] \times X).$

Proof.

Let $\varphi (t,x) ={ \mathrm{e}}^{-\gamma t}\psi (t,x)$ for some $\gamma < \infty $. Then from (6.3) we get,

$$\begin{array}{rcl} \left \{\begin{array}{@{}l@{\quad }l@{}} {\mathrm{e}}^{-\gamma t}\frac{\mathrm{d}\psi } {\mathrm{d}t} (t,x) - \gamma {\mathrm{e}}^{-\gamma t}\psi (t,x) +\inf\limits_{ u\in U}[c(t,x,u) - \lambda (t,x,u){\mathrm{e}}^{-\gamma t}\psi (t,x)\quad \\ +\lambda (t,x,u){\int \nolimits \nolimits }_{X}{\mathrm{e}}^{-\gamma t}\psi (t,z)Q(t,x,u,\mathrm{d}z)] = 0 \quad \\ \mbox{ on}\quad X \times [0,T)\quad \mbox{ and} \quad \\ \psi (T,x) ={ \mathrm{e}}^{\gamma T}g(x). \quad \end{array} \right.& & \\ \end{array}$$

Thus (6.3) has a solution if and only if

$$\begin{array}{rcl} \left \{\begin{array}{@{}l@{\quad }l@{}} \frac{\mathrm{d}\psi } {\mathrm{d}t} (t,x) - \gamma \psi (t,x) +\inf\limits_{u\in U}[{\mathrm{e}}^{\gamma t}c(t,x,u) - \lambda (t,x,u)\psi (t,x)\quad \\ +\lambda (t,x,u){\int \nolimits \nolimits }_{X}\psi (t,z)Q(t,x,u,\mathrm{d}z)] = 0 \quad \\ \mbox{ on}\quad X \times [0,T)\quad \mbox{ and} \quad \\ \psi (T,x) ={ \mathrm{e}}^{\gamma T}g(x) \quad \end{array} \right.& & \\ \end{array}$$

has a solution. The above differential equation is equivalent to the following integral equation:

$$\begin{array}{rcl} \psi (t,x)& ={ \mathrm{e}}^{\gamma t}g(x) +{ \mathrm{e}}^{\gamma t}{ \int \nolimits \nolimits }_{t}^{T}\mathrm{e}{}^{-\gamma s}\inf\limits_{u\in U}\left[{\mathrm{e}}^{\gamma s}c(s,x,u) - \lambda (s,x,u)\psi (s,x) \right.\\ &\left. \quad + \lambda (s,x,u){\int \nolimits \nolimits }_{X}\psi (s,z)Q(s,x,u,\mathrm{d}z) \right]\mathrm{d}s\,.\end{array}$$

Let ${C}_{b}^{\mathrm{unif}}([0,T] \times X)$ be the space of bounded continuous functions $\varphi $ on $[0,T] \times X$ with the additional property that given $\varepsilon > 0$ there exists $\delta > 0$ such that

$$\sup\limits_{x}\vert \varphi (t + h,x) - \varphi (t,x)\vert < \varepsilon \quad \mbox{ whenever}\quad \vert h\vert < \delta.$$

Suppose ${\varphi }_{n} \in {C}_{b}^{\mathrm{unif}}([0,T] \times X)$ and ${\varphi }_{n} \rightarrow \varphi $ uniformly. Then

$$\begin{array}{rcl} \vert \varphi (t + h,x) - \varphi (t,x)\vert & \leq & \vert \varphi (t + h,x) - {\varphi }_{n}(t + h,x)\vert + \vert {\varphi }_{n}(t + h,x) - {\varphi }_{n}(t,x)\vert \\ & &\quad \vert {\varphi }_{n}(t,x) - \varphi (t,x)\vert \,.\end{array}$$

Given $\epsilon > 0$, there exists ${n}_{0}$ such that ${\sup }_{t,x}\vert {\varphi }_{{n}_{0}}(t,x) - \varphi (t,x)\vert < \frac{\varepsilon } {3}$, and for this ${n}_{0}$, there exists $\delta > 0$ such that $\sup\limits_{x}\vert {\varphi }_{{n}_{0}}(t + h,x) - {\varphi }_{{n}_{0}}(t,x)\vert < \frac{\varepsilon } {3}$ whenever $\vert h\vert < \delta $. Putting $n = {n}_{0}$, we get from the above inequality

$$\sup\limits_{x}\vert \varphi (t + h,x) - \varphi (t,x)\vert < \varepsilon \quad \mbox{ whenever}\quad \vert h\vert < \delta \,.$$

Thus ${C}_{b}^{\mathrm{unif}}([0,T] \times X)$ is a closed subspace of ${C}_{b}([0,T] \times X)$, and hence it is a Banach space.

Now for $\varphi \in {C}_{b}^{\mathrm{unif}}([0,T] \times X)$, it follows from the assumption on $Q$ that ${\int \nolimits \nolimits }_{X}\varphi (t,z)Q(t,x,u,\mathrm{d}z)$ is continuous in $t,x$ and $u$. Define

$$\mathcal{T} : {C}_{b}^{\mathrm{unif}}([0,T] \times X) \rightarrow {C}_{ b}^{\mathrm{unif}}([0,T] \times X)\quad \mbox{ by}$$

$$\begin{array}{rcl} \mathcal{T} \psi (t,x)& =&{ \mathrm{e}}^{\gamma t}g(x) +{ \mathrm{e}}^{\gamma t}{ \int \nolimits \nolimits }_{t}^{T}\mathrm{e}{}^{-\gamma s}\inf\limits_{ u\in U}\left[{\mathrm{e}}^{\gamma s}c(s,x,u) - \lambda (s,x,u)\psi (s,x) \right.\\& &\left. +\lambda (s,x,u){\int \nolimits \nolimits }_{X}\psi (s,z)Q(s,x,u,\mathrm{d}z) \right]\mathrm{d}s\,.\end{array}$$

For ${\psi }_{1},{\psi }_{2} \in {C}_{b}^{\mathrm{unif}}([0,T] \times X)$, we have

$$\begin{array}{rcl} \vert \mathcal{T} {\psi }_{1}(t,x) -\mathcal{T} {\psi }_{2}(t,x)\vert & \leq &{ \mathrm{e}}^{\gamma t}{ \int \nolimits \nolimits }_{t}^{T}{\mathrm{e}}^{-\gamma s}2M\vert \vert {\psi }_{ 1} - {\psi }_{2}\vert \vert \mathrm{d}s \\ & =& \frac{2M} {\gamma }{ \mathrm{e}}^{\gamma t}[{\mathrm{e}}^{-\gamma t} -{\mathrm{e}}^{-\gamma T}]\vert \vert {\psi }_{ 1} - {\psi }_{2}\vert \vert \\ & =& \frac{2M} {\gamma } [1 -{\mathrm{e}}^{-\gamma (T-t)}]\vert \vert {\psi }_{ 1} - {\psi }_{2}\vert \vert \\ &\leq & \frac{2M} {\gamma } \vert \vert {\psi }_{1} - {\psi }_{2}\vert \vert.\end{array}$$

Thus if we choose $\gamma = 2M + 1$, then $\mathcal{T}$ is a contraction and hence has a fixed point. Let $\varphi $ be the fixed point. Then ${\mathrm{e}}^{-(2M+1)t}\varphi $ is the unique solution of (6.3). $\square $

6.3 Zero-Sum Stochastic Game

In this section, we consider a zero-sum stochastic game. The control model we consider here is given by

$$\{X,U,V,(\lambda (t,x,u,v),t \in [0,T],x \in X,u \in U,v \in V,Q(t,x,u,v,\mathrm{d}z),r(t,x,u,v)\}\,$$

where $X$ is the state space as before; $U$ and $V$ are the action spaces for player I and player II, respectively; $\lambda $ and $Q$ denote the rate and transition kernel, respectively, which now depend on the additional parameter $v$; and $r$ is the reward rate. The dynamics of the game is similar to that of MDP with appropriate modifications. Here player I receives a payoff from player II. The aim of player I is to maximise his payoff, and player II seeks to minimise the payoff to player I.

Now we describe the strategies of the players. In order to solve the problem, we will need to consider Markov relaxed strategies. We denote the space of strategies of player I by $\mathcal{U}$ and that of player $II$ by $\mathcal{V}$ where

$$\mathcal{U} =\{ \bf{u}\,\vert \,\bf{u} : [0,T] \times X \rightarrow \mathcal{P}(U)\;\mbox{ measurable}\}\,,$$

$$\mathcal{V} =\{ \mathbf{v}\,\vert \,\mathbf{v} : [0,T] \times X \rightarrow \mathcal{P}(V )\;\mbox{ measurable}\}\,.$$

Now corresponding to $\lambda $, $Q$ and $r$, define

$$\tilde{\lambda }(t,x,\mu,\nu ) ={ \int \nolimits \nolimits }_{V }{ \int \nolimits \nolimits }_{U}\lambda (t,x,u,v)\mu (\mathrm{d}u)\nu (\mathrm{d}v),$$

$$\tilde{Q}(t,x,\mu,\nu ) ={ \int \nolimits \nolimits }_{V }{ \int \nolimits \nolimits }_{U}Q(t,x,u,v)\mu (\mathrm{d}u)\nu (\mathrm{d}v),$$

$$\tilde{r}(t,x,\mu,\nu ) ={ \int \nolimits \nolimits }_{V }{ \int \nolimits \nolimits }_{U}r(t,x,u,v)\mu (\mathrm{d}u)\nu (\mathrm{d}v),$$

where $\mu \in \mathcal{P}(U)$ and $\nu \in \mathcal{P}(V )$. As in the previous section, we make the following assumptions:

(A1′)
$\lambda $ is continuous and there exists a finite constant $M$ such that
$$\sup\limits_{t,x,u,v}\lambda (t,x,u,v) \leq M.$$
(A2′)
$Q$ is weakly continuous, i.e. if ${x}_{n} \rightarrow x$, ${t}_{n} \rightarrow t$, ${u}_{n} \rightarrow u$ and ${v}_{n} \rightarrow v$, then for any $f \in {C}_{b}(X)$
$${\int \nolimits \nolimits }_{X}f(z)Q({t}_{n},{x}_{n},{u}_{n},{v}_{n},\mathrm{d}z) \rightarrow {\int \nolimits \nolimits }_{X}f(z)Q(t,x,u,v,\mathrm{d}z).$$
(A3′)
$r$ is continuous and there exists a finite constant $\tilde{C}$ such that
$$\sup\limits_{t,x,u,v}r(t,x,u,v) \leq \tilde{ C}.$$
If the players use strategies $(\bf{u},\mathbf{v}) \in \mathcal{U}\times \mathcal{V}$, then the expected payoff to player I is given by
$${\mathbb{E}}_{t,x}^{\bf{u},\mathbf{v}}\left [{\int \nolimits \nolimits }_{t}^{T}\tilde{r}(s,{X}_{ s},\bf{u}(s,{X}_{s}),\mathbf{v}(s,{X}_{s}))\mathrm{d}s + g({X}_{T})\right ]$$
where $g$ is the terminal reward function which is assumed to be bounded and continuous. Now we define the upper and lower values for our game. Define
$$\begin{array}{rcl} \overline{V }(t,x) =\inf\limits_{\mathbf{v}\in \mathcal{V}}\sup\limits_{\bf{u}\in \mathcal{U}}{\mathbb{E}}_{t,x}^{\bf{u},\mathbf{v}}\left [{\int \nolimits \nolimits }_{t}^{T}\tilde{r}(s,{X}_{ s},\bf{u}(s,{X}_{s}),\mathbf{v}(s,{X}_{s}))\mathrm{d}s + g({X}_{T})\right ]\,.& & \\ \end{array}$$
Also define
$$\begin{array}{rcl} \underline{V}(t,x) =\sup\limits_{\bf{u}\in \mathcal{U}}\inf\limits_{\mathbf{v}\in \mathcal{V}}{\mathbb{E}}_{t,x}^{\bf{u},\mathbf{v}}\left [{\int \nolimits \nolimits }_{t}^{T}\tilde{r}(s,{X}_{ s},\bf{u}(s,{X}_{s}),\mathbf{v}(s,{X}_{s}))\mathrm{d}s + g({X}_{T})\right ].& & \\ \end{array}$$
The function $\overline{V }$ is called the upper value function of the game, and $\underline{V}$ is called the lower value function of the game. In the game, player I is trying to maximise his payoff and player II is trying to minimise the payoff of player I. Thus $\underline{V}$ is the minimum payoff that player I is guaranteed to receive and $\overline{V }$ is the guaranteed greatest amount that player II can lose to player I. In general $\underline{V} \leq \overline{V }$. If $\overline{V }(t,x) = \underline{V}(t,x)$, then the game is said to have a value. A strategy ${\bf{u}}^{{_\ast}}$ is said to be an optimal strategy for player I if
$$\begin{array}{rcl}{ \mathbb{E}}_{t,x}^{{\bf{u}}^{{_\ast}},\mathbf{v}}\left [{\int \nolimits \nolimits }_{t}^{T}\tilde{r}(s,{X}_{ s},{\bf{u}}^{{_\ast}}(s,{X}_{ s}),\mathbf{v}(s,{X}_{s}))\mathrm{d}s + g({X}_{T})\right ] \geq \overline{V }(t,x)& & \\ \end{array}$$
for any $t,x,\mathbf{v}$.

Similarly, ${\mathbf{v}}^{{_\ast}}$ is called an optimal policy for player II if

$$\begin{array}{rcl}{ \mathbb{E}}_{t,x}^{\bf{u},{\mathbf{v}}^{{_\ast}} }\left [{\int \nolimits \nolimits }_{t}^{T}\tilde{r}(s,{X}_{ s},\bf{u}(s,{X}_{s}),{\mathbf{v}}^{{_\ast}}(s,{X}_{ s}))\mathrm{d}s + g({X}_{T})\right ] \leq \underline{V}(t,x)& & \\ \end{array}$$

for any $t,x,\bf{u}$. Such a pair $({\bf{u}}^{{_\ast}},{\mathbf{v}}^{{_\ast}})$, if it exists, is called a saddle point equilibrium. Our aim is to find the value of the game and to find optimal strategies for both the players. To this end, consider the following pair of Isaacs equations:

$$\begin{array}{rcl} \left \{\begin{array}{@{}l@{\quad }l@{}} \frac{\mathrm{d}\varphi } {\mathrm{d}t} (t,x) +\inf\limits_{\nu \in \mathcal{P}(V )}\sup\limits_{\mu \in \mathcal{P}(U)}\bigl [\tilde{r}(t,x,\mu,\nu ) -\tilde{ \lambda }(t,x,\mu,\nu )\varphi (t,x)\quad \\ +\tilde{\lambda }(t,x,\mu,\nu ){\int \nolimits \nolimits }_{X}\varphi (t,z)\tilde{Q}(t,x,\mu,\nu,\mathrm{d}z)\bigr ] = 0 \quad \\ \mbox{ on}\;X \times [0,T)\quad \mbox{ and} \quad \\ \varphi (T,x) = g(x). \quad \end{array} \right.& &\end{array}$$

(6.5)

$$\begin{array}{rcl} \left \{\begin{array}{@{}l@{\quad }l@{}} \frac{\mathrm{d}\psi } {\mathrm{d}t} (t,x) +\sup\limits_{\mu \in \mathcal{P}(U)}\inf\limits_{\nu \in \mathcal{P}(V )}\bigl [\tilde{r}(t,x,\mu,\nu ) -\tilde{ \lambda }(t,x,\mu,\nu )\psi (t,x)\quad \\ +\tilde{\lambda }(t,x,\mu,\nu ){\int \nolimits \nolimits }_{X}\psi (t,z)\tilde{Q}(t,x,\mu,\nu,\mathrm{d}z)\bigr ] = 0 \quad \\ \mbox{ on}\;X \times [0,T)\quad \mbox{ and} \quad \\ \varphi (T,x) = g(x). \quad \end{array} \right.& &\end{array}$$

(6.6)

By Fan’s minimax theorem [4], we have that if $\varphi \in {C}_{b}^{1,0}([0,T] \times X)$ is a solution of (6.5), then it is also a solution of (6.6) and vice versa. The importance of Isaacs equations is illustrated by the following theorem.

Theorem 6.3.1

Let ${\varphi }^{{_\ast}}\in {C}_{b}^{1,0}([0,T] \times X)$ be a solution of (6.5) and (6.6). Then

(i)
${\varphi }^{{_\ast}}$ is the value of the game.
(ii)
Let $({\bf{u}}^{{_\ast}},{\mathbf{v}}^{{_\ast}}) \in \mathcal{U}\times \mathcal{V}$ be such that
$$ \begin{array}{rcl} & \inf\limits_{\nu \in \mathcal{P}(V )}\left[\tilde{r}(t,x,{\bf{u}}^{{_\ast}}(t,x),\nu ) -\tilde{ \lambda }(t,x,{\bf{u}}^{{_\ast}}(t,x),\nu ){\varphi }^{{_\ast}}(t,x) +\tilde{ \lambda }(t,x,{\bf{u}}^{{_\ast}}(t,x),\nu )\right. \\ &\left. \qquad \qquad {\int \nolimits \nolimits }_{X}{\varphi }^{{_\ast}}(t,z)\tilde{Q}(t,x,{\bf{u}}^{{_\ast}}(t,x),\nu,\mathrm{d}z)\right] \\ & =\sup\limits_{\mu \in \mathcal{P}(U)}\inf\limits_{\nu \in \mathcal{P}(V )}\left[\tilde{r}(t,x,\mu,\nu ) -\tilde{ \lambda }(t,x,\mu,\nu )\psi (t,x) +\tilde{ \lambda }(t,x,\mu,\nu ) \right. \\ &\left. \qquad \qquad \qquad \qquad {\int \nolimits \nolimits }_{X}\psi (t,z)\tilde{Q}(t,x,\mu,\nu,\mathrm{d}z)\right] & \end{array}$$
(6.7)

and
$$ \begin{array}{rcl} & \sup\limits_{\mu \in \mathcal{P}(U)}\left[\tilde{r}(t,x,\mu,{\bf{v}}^{{_\ast}}(t,x)) -\tilde{ \lambda }(t,x,\mu,{\bf{v}}^{{_\ast}}(t,x)){\varphi }^{{_\ast}}(t,x) +\tilde{ \lambda }(t,x,\mu,{\bf{v}}^{{_\ast}}(t,x))\right. \\ & \left.\qquad \qquad {\int \nolimits \nolimits }_{X}{\varphi }^{{_\ast}}(t,z)\tilde{Q}(t,x,\mu,{\bf{v}}^{{_\ast}}(t,x),\mathrm{d}z)\right] \\ & =\inf\limits_{\nu \in \mathcal{P}(V )}\sup\limits_{\mu \in \mathcal{P}(U)}\left[\tilde{r}(t,x,\mu,\nu ) -\tilde{ \lambda }(t,x,\mu,\nu )\varphi (t,x) +\tilde{ \lambda }(t,x,\mu,\nu ) \right. \\ & \left.\qquad \qquad \qquad \qquad {\int \nolimits \nolimits }_{X}\varphi (t,z)\tilde{Q}(t,x,\mu,\nu,\mathrm{d}z)\right]. \end{array}$$
(6.8)
Then ${\bf{u}}^{{_\ast}}$ is an optimal policy for player I and ${\mathbf{v}}^{{_\ast}}$ is an optimal policy for player II.

Proof.

Let ${\bf{u}}^{{_\ast}}$ be as in (6.7) and $\mathbf{v}$ be any arbitrary strategy of player II. Then by Ito-Dynkin formula applied to the solution $\varphi $, we obtain

$$\begin{array}{rcl}{ \varphi }^{{_\ast}}(t,x)& \leq {\mathbb{E}}_{ t,x}^{{\bf{u}}^{{_\ast}},\mathbf{v}}\left [{\int \nolimits \nolimits }_{t}^{T}\tilde{r}(s,{X}_{s},{\bf{u}}^{{_\ast}}(s,{X}_{s}),\mathbf{v}(s,{X}_{s}))\mathrm{d}s + g({X}_{T})\right ] & \\ & \leq \inf\limits_{\mathbf{v}\in \mathcal{V}}{\mathbb{E}}_{t,x}^{{\bf{u}}^{{_\ast}},\mathbf{v}}\left [{\int \nolimits \nolimits }_{t}^{T}\tilde{r}(s,{X}_{s},{\bf{u}}^{{_\ast}}(s,{X}_{s}),\mathbf{v}(s,{X}_{s}))\mathrm{d}s + g({X}_{T})\right ]& \\ & \leq \underline{V}(t,x)\,. & \\ \end{array}$$

Now let ${\mathbf{v}}^{{_\ast}}$ be as in (6.8) and let $\bf{u}$ be any arbitrary strategy of player I. Then again by Ito’s formula we obtain

$$\begin{array}{rcl}{ \varphi }^{{_\ast}}(t,x)& \geq {\mathbb{E}}_{ t,x}^{\bf{u},{\mathbf{v}}^{{_\ast}} }\left [{\int \nolimits \nolimits }_{t}^{T}\tilde{r}(s,{X}_{s},\bf{u}(s,{X}_{s}),{\mathbf{v}}^{{_\ast}}(s,{X}_{s}))\mathrm{d}s + g({X}_{T})\right ] & \\ & \geq \inf\limits_{\mathbf{v}\in \mathcal{V}}{\mathbb{E}}_{t,x}^{\bf{u},{\mathbf{v}}^{{_\ast}} }\left [{\int \nolimits \nolimits }_{t}^{T}\tilde{r}(s,{X}_{s},\bf{u}(s,{X}_{s}),{\mathbf{v}}^{{_\ast}}(s,{X}_{s}))\mathrm{d}s + g({X}_{T})\right ]& \\ & \geq \overline{V }(t,x)\,. & \\ \end{array}$$

From the above two inequalities, it follows that

$${\varphi }^{{_\ast}}(t,x) = \overline{V }(t,x) = \underline{V}(t,x)\,.$$

Hence ${\varphi }^{{_\ast}}$ is the value of the game. Moreover it follows that $({\bf{u}}^{{_\ast}},{\mathbf{v}}^{{_\ast}})$ is a saddle point equilibrium. $\square $

Now our aim is to find a solution of (6.5) (and hence of (6.6)) in ${C}_{b}^{1,0}([0,T] \times X)$. Our next theorem asserts the existence of such a solution.

Theorem 6.3.2

Under $(A1^{\prime})$ –(A3′), equation (6.5) has a unique solution in ${C}_{b}^{1,0}([0,T] \times X).$

Proof.

Let $\varphi (t,x) ={ \mathrm{e}}^{-\gamma t}\psi (t,x)$ for some $\gamma < \infty $. Substituting in (6.5), we get

$$\begin{array}{rcl} \left \{\begin{array}{@{}l@{\quad }l@{}} {\mathrm{e}}^{-\gamma t}\frac{\mathrm{d}\psi } {\mathrm{d}t} (t,x) - \gamma {\mathrm{e}}^{-\gamma t}\psi (t,x) +\inf\limits_{ \nu \in \mathcal{P}(V )}\sup\limits_{\mu \in \mathcal{P}(U)}\bigl [\tilde{r}(t,x,\mu,\nu ) -\tilde{ \lambda }(t,x,\mu,\nu ){\mathrm{e}}^{-\gamma t}\psi (t,x)\quad \\ +\tilde{\lambda }(t,x,\mu,\nu ){\int \nolimits \nolimits }_{X}{\mathrm{e}}^{-\gamma t}\psi (t,z)\tilde{Q}(t,x,\mu,\nu,\mathrm{d}z)\bigr ] = 0 \quad \\ \mbox{ on}\quad X \times [0,T)\quad \mbox{ and} \quad \\ \psi (T,x) ={ \mathrm{e}}^{\gamma T}g(x). \quad \end{array}\right.& & \\ \end{array}$$

Thus (6.5) has a solution if and only if

$$\begin{array}{rcl} \left \{\begin{array}{@{}l@{\quad }l@{}} \frac{\mathrm{d}\psi } {\mathrm{d}t} (t,x) - \gamma \psi (t,x) +\inf\limits_{\nu \in \mathcal{P}(V )}\sup\limits_{\mu \in \mathcal{P}(U)}\bigl [{\mathrm{e}}^{\gamma t}\tilde{r}(t,x,\mu,\nu ) -\tilde{ \lambda }(t,x,\mu,\nu )\psi (t,x)\quad \\ +\tilde{\lambda }(t,x,\mu,\nu ){\int \nolimits \nolimits }_{X}\psi (t,z)\tilde{Q}(t,x,\mu,\nu,\mathrm{d}z)\bigr ] = 0 \quad \\ \mbox{ on}\quad X \times [0,T)\quad \mbox{ and} \quad \\ \psi (T,x) ={ \mathrm{e}}^{\gamma T}g(x) \quad \end{array} \right.& & \\ \end{array}$$

has a solution. The above differential equation is equivalent to the following integral equation:

$$\begin{array}{rcl} \psi (t,x)& ={ \mathrm{e}}^{\gamma t}g(x) +{ \mathrm{e}}^{\gamma t}{ \int \nolimits \nolimits }_{t}^{T}\mathrm{e}{}^{-\gamma s}\inf\limits_{\nu \in \mathcal{P}(V )}\sup\limits_{\mu \in \mathcal{P}(U)}\left[{\mathrm{e}}^{\gamma s}\tilde{r}(s,x,\mu,\nu ) -\tilde{ \lambda }(s,x,\mu,\nu )\psi (s,x) \right.\\ & \left.\quad +\tilde{ \lambda }(s,x,\mu,\nu ){\int \nolimits \nolimits }_{X}\psi (s,z)\tilde{Q}(s,x,\mu,,\nu,\mathrm{d}z) \right]\mathrm{d}s\,. \end{array}$$

Let ${C}_{b}^{\mathrm{unif}}([0,T] \times X)$ be the same space as defined in the previous section. Define

$$\mathcal{T} : {C}_{b}^{\mathrm{unif}}([0,T] \times X) \rightarrow {C}_{ b}^{\mathrm{unif}}([0,T] \times X)\quad \mbox{ by}$$

$$\begin{array}{rcl} \mathcal{T} \psi (t,x)& ={ \mathrm{e}}^{\gamma t}g(x) +{ \mathrm{e}}^{\gamma t}{ \int \nolimits \nolimits }_{t}^{T}\mathrm{e}{}^{-\gamma s}\inf\limits_{\nu \in \mathcal{P}(V )}\sup\limits_{\mu \in \mathcal{P}(U)}\left[{\mathrm{e}}^{\gamma s}\tilde{r}(s,x,\mu,\nu ) \right.\\ & \left.\quad -\tilde{ \lambda }(s,x,\mu,\nu )\psi (s,x) +\tilde{ \lambda }(s,x,\mu,\nu ){\int \nolimits \nolimits }_{X}\psi (s,z)\tilde{Q}(s,x,\mu,\nu,\mathrm{d}z) \right]\mathrm{d}s\,.\end{array}$$

For ${\psi }_{1},{\psi }_{2} \in {C}_{b}^{\mathrm{unif}}([0,T] \times X)$, we have

$$\begin{array}{rcl} \vert \mathcal{T} {\psi }_{1}(t,x) -\mathcal{T} {\psi }_{2}(t,x)\vert & \leq &{ \mathrm{e}}^{\gamma t}{ \int \nolimits \nolimits }_{t}^{T}{\mathrm{e}}^{-\gamma s}2M\vert \vert {\psi }_{ 1} - {\psi }_{2}\vert \vert \mathrm{d}s \\ & =& \frac{2M} {\gamma }{ \mathrm{e}}^{\gamma t}[{\mathrm{e}}^{-\gamma t} -{\mathrm{e}}^{-\gamma T}]\vert \vert {\psi }_{ 1} - {\psi }_{2}\vert \vert \\ & =& \frac{2M} {\gamma } [1 -{\mathrm{e}}^{-\gamma (T-t)}]\vert \vert {\psi }_{ 1} - {\psi }_{2}\vert \vert \\ &\leq & \frac{2M} {\gamma } \vert \vert {\psi }_{1} - {\psi }_{2}\vert \vert.\end{array}$$

Thus if we choose $\gamma = 2M + 1$, then $\mathcal{T}$ is a contraction and hence has a fixed point. Let $\varphi $ be the fixed point. Then ${\mathrm{e}}^{-(2M+1)t}\varphi $ is the unique solution of (6.5). $\square $

6.4 Conclusion

In this chapter we have established smooth solutions of dynamic programming equations for continuous-time controlled Markov chains on the finite horizon. This has led to the existence of an optimal Markov strategy for continuous-time MDP and saddle point equilibrium in Markov strategies for zero-sum games. We have used the boundedness condition on the cost function $c$ for simplicity. For continuous-time MDP, if $c$ is unbounded above, then we can show that $V (t,x)$ is the minimal non-negative solution of (6.3) by approximating the cost function $c$ by $c \wedge n$ for a positive integer $n$ and then letting $n \rightarrow \infty $. If $c$ is unbounded on both sides and it satisfies a suitable growth condition, then again we can prove the existence of unique solutions of dynamic programming equations in ${C}^{1,0}([0,T] \times X)$ with appropriate weighted norm; see [5] and [6] for analogous results.

References

A. Arapostathis, V. S. Borkar and M. K. Ghosh, Ergodic Control of Diffusion Processes, Cambridge University Press, 2011.
Google Scholar
V. E. Benes, Existence of optimal strategies based on specified information for a class of stochastic decision problems, SIAM J. Control 8 (1970), 179–188.
Article MathSciNet MATH Google Scholar
M. H. A. Davis, Markov Models and Optimization, Chapman and Hall, 1993.
Google Scholar
K. Fan, Fixed-point and minimax theorems in locally convex topological linear spaces, Proc. of the Natl. Academy of Sciences of the United States of America 38 (1952), 121–126.
Article MATH Google Scholar
X. Guo and O. Hernández-Lerma, Continuous-Time Markov Decision Processes. Theory and Applications, Springer-Verlag, 2009.
Google Scholar
X. Guo and O. Hernández-Lerma, Zero-sum games for continuous-time jump Markov processes in Polish spaces: discounted payoffs, Adv. in Appl. Probab. 39 (2007), 645–668.
Article MathSciNet MATH Google Scholar
S. R. Pliska, Controlled jump processes, Stochastic Processes Appl. 3 (1975), 259–282.
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics, Indian Institute of Science, Bangalore, 560012, India
Mrinal K. Ghosh & Subhamay Saha

Authors

Mrinal K. Ghosh
View author publications
You can also search for this author in PubMed Google Scholar
Subhamay Saha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mrinal K. Ghosh .

Editor information

Editors and Affiliations

, Department of Probability and Statistics, Center for Research in Mathematics, Jalisco s/n, Guanajuato, 36000, Mexico
Daniel Hernández-Hernández
, Department of Mathematics, University of Sonora, Rosales s/n, Hermosillo, 83000, Sonora, Mexico
J. Adolfo Minjárez-Sosa

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ghosh, M.K., Saha, S. (2012). Continuous-Time Controlled Jump Markov Processes on the Finite Horizon. In: Hernández-Hernández, D., Minjárez-Sosa, J. (eds) Optimization, Control, and Applications of Stochastic Systems. Systems & Control: Foundations & Applications. Birkhäuser, Boston. https://doi.org/10.1007/978-0-8176-8337-5_6

Download citation

DOI: https://doi.org/10.1007/978-0-8176-8337-5_6
Published: 12 July 2012
Publisher Name: Birkhäuser, Boston
Print ISBN: 978-0-8176-8336-8
Online ISBN: 978-0-8176-8337-5
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Continuous-Time Controlled Jump Markov Processes on the Finite Horizon

Abstract

Similar content being viewed by others

Abel-type Results for Controlled Piecewise Deterministic Markov Processes

Continuity of the optimal average cost in Markov decision chains with small risk-sensitivity

Simultaneous Impulse and Continuous Control of a Markov Chain in Continuous Time

Keywords

6.1 Introduction

6.2 Finite Horizon Continuous-Time MDP

Theorem 6.2.1

Proof.

Theorem 6.2.2

Proof.

6.3 Zero-Sum Stochastic Game

Theorem 6.3.1

Proof.

Theorem 6.3.2

Proof.

6.4 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Continuous-Time Controlled Jump Markov Processes on the Finite Horizon

Abstract

Similar content being viewed by others

Abel-type Results for Controlled Piecewise Deterministic Markov Processes

Continuity of the optimal average cost in Markov decision chains with small risk-sensitivity

Simultaneous Impulse and Continuous Control of a Markov Chain in Continuous Time

Keywords

6.1 Introduction

6.2 Finite Horizon Continuous-Time MDP

Theorem 6.2.1

Proof.

Theorem 6.2.2

Proof.

6.3 Zero-Sum Stochastic Game

Theorem 6.3.1

Proof.

Theorem 6.3.2

Proof.

6.4 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation