Abstract
We study continuous-time controlled Markov chains on the finite horizon. For the Markov decision problem, we show that the value function is the unique solution of the corresponding dynamic programming equation. This leads to the existence of an optimal Markov control. We then consider a zero-sum game. We show that the value function exists and is the unique solution of the corresponding Isaacs equations. This yields the existence of a pair of saddle point Markov strategies.
Access provided by Autonomous University of Puebla. Download chapter PDF
Similar content being viewed by others
Keywords
- Finite Horizon Case
- Dynamic Programming Equation
- Continuous-time Controlled Markov Chains
- Optimal Markov Strategies
- Isaacs Equation
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
6.1 Introduction
This chapter studies continuous-time Markov decision processes and continuous-time zero-sum stochastic dynamic games. In the continuous-time setup, although the infinite horizon cases have been well studied, the corresponding literature on finite horizon case is few and far between. Infinite horizon continuous-time Markov decision processes have been studied by many authors (e.g. see [5] and the references therein). In the finite horizon case, Pliska [7] has used a semi-group approach to characterise the value function and the optimal control. But his approach yields only existential results. In this chapter, we show that the value function is a smooth solution of an appropriate dynamic programming equation. Our method of proof gives algorithms for computing the value function and an optimal control.
The situation is analogous for continuous-time stochastic dynamic Markov games. In this problem as well, the infinite horizon case has been studied in the literature [6]. To our knowledge, the finite horizon case has not been studied. In this chapter, we prove that the value of the game on the finite horizon exists and is the solution of an appropriate Isaacs equation. This leads to the existence of saddle point equilibrium.
The rest of our chapter is structured as follows. In Sect. 6.2 we analyse the finite horizon continuous-time MDP. Section 6.3 deals with zero-sum stochastic dynamic games. We conclude our chapter in Sect. 6.4 with a few remarks.
6.2 Finite Horizon Continuous-Time MDP
Throughout this chapter the time horizon is \(T\). The control model we consider is given by
where each element is described below.
The state space \(X\). The state space \(X\) is the set of states of the process under observation which is assumed to be a Polish space.
The action space \(U\). The decision-maker dynamically takes his action from the action space \(U\). We assume that \(U\) is a compact metric space.
The instantaneous transition rate \(\lambda \). \(\lambda : [0,T] \times X \times U \rightarrow [0,\infty )\) is a given function satisfying the following assumption:
-
(A1)
\(\lambda \) is continuous and there exists a constant \(M\) such that
$$\sup\limits_{t,x,u}\lambda (t,x,u) \leq M.$$The transition probability kernel \(Q\). For a fixed \(t \in [0,T],x \in X,u \in U\), \(Q(t,x,u,.)\) is a probability measure on \(X\) with \(Q(t,x,u,\{x\}) = 0\). \(Q\) satisfies the following:
-
(A2)
\(Q\) is weakly continuous, i.e. if \({x}_{n} \rightarrow x\), \({t}_{n} \rightarrow t\), \({u}_{n} \rightarrow u\), then for any \(f \in {C}_{b}(X)\)
$${\int \nolimits \nolimits }_{X}f(z)Q({t}_{n},{x}_{n},{u}_{n},\mathrm{d}z) \rightarrow {\int \nolimits \nolimits }_{X}f(z)Q(t,x,u,\mathrm{d}z).$$The cost rate \(c\). \(c : [0,T] \times X \times U \rightarrow [0,\infty )\) is a given function satisfying the following assumption:
-
(A3)
\(c\) is continuous and there exists a finite constant \(\tilde{C}\) such that
$$\sup\limits_{t,x,u}c(t,x,u) \leq \tilde{ C}.$$
Next we give an informal description of the evolution of the controlled system. Suppose that the system is in state \(x\) at time \(t \geq 0\) and the controller or the decision-maker takes an action \(u \in U\). Then the following happens on the time interval [t, t + dt]:
-
1.
The decision maker has to pay an infinitesimal cost \(c(t,x,u)\mathrm{d}t,\) and
-
2.
A transition from state \(x\) to a set \(A\) (not containing \(x\)) occurs with probability
$$\lambda (t,x,u)\mathrm{d}t{\int \nolimits \nolimits }_{A}Q(t,x,u,\mathrm{d}z) + o(\mathrm{d}t);$$or the system remains in state \(x\) with probability
$$1 - \lambda (t,x,u)\mathrm{d}t + o(\mathrm{d}t).$$Now we describe the optimal control problem. To this end we first describe the set of admissible controls. Let
$$\bf{u} : [0,T] \times X \rightarrow U$$be a measurable function. Let \(\mathcal{U}\) denote the set of all such measurable functions which is the set of admissible controls. Such controls are called Markov controls. For each \(\bf{u} \in \mathcal{U}\), it can be shown that there exists is a strong Markov process \(\{{X}_{t}\}\) (see [1, 3]) having the generator
$$\begin{array}{rcl}{ \mathcal{A}}_{t}^{\bf{u}}f(x) = -\lambda (t,x,\bf{u}(t,x))f(x) +{ \int \nolimits \nolimits }_{X}f(z)Q(t,x,\bf{u}(t,x),\mathrm{d}z)& & \\ \end{array}$$where \(f\) is a bounded measurable function.
For each \(\bf{u} \in \mathcal{U}\), define
where \(g : X \rightarrow {\mathbb{R}}_{+}\) is the terminal cost function which is assumed to be bounded, continuous and \({\mathbb{E}}_{t,x}^{\bf{u}}\) is the expectation operator under the control \(\bf{u}\) with initial condition \({X}_{t} = x\). The aim of the controller is to minimise \({V }^{\bf{u}}\) over all \(\bf{u} \in \mathcal{U}.\) Define
The function \(V\) is called the value function. If \({\bf{u}}^{{_\ast}}\in \mathcal{U}\) satisfies
then \({\bf{u}}^{{_\ast}}\) is called an optimal control.
The associated dynamic programming equation is
The importance of (6.3) is illustrated by the following verification theorem.
Theorem 6.2.1
If (6.3) has a solution \(\varphi \) in \({C}_{b}^{1,0}([0,T] \times X)\) , then \(\varphi = V\) , the value function. Moreover, if \({\bf{u}}^{{_\ast}}\) is such that
then \({\bf{u}}^{{_\ast}}\) is an optimal control.
Proof.
Using Ito-Dynkin formula to the solution \(\varphi \) of (6.3), we obtain
For \(\bf{u} ={ \bf{u}}^{{_\ast}}\) as in the statement of the theorem, we get the equality
The existence of such a \({\bf{u}}^{{_\ast}}\) follows by a standard measurable selection theorem [2]. □
In view of the above theorem, it suffices to show that (6.3) has a solution in \({C}_{b}^{1,0}([0,T] \times X)\).
Theorem 6.2.2
Under \((A1)\) –(A3), the dynamic programming equation (6.3) has a unique solution in \({C}_{b}^{1,0}([0,T] \times X).\)
Proof.
Let \(\varphi (t,x) ={ \mathrm{e}}^{-\gamma t}\psi (t,x)\) for some \(\gamma < \infty \). Then from (6.3) we get,
Thus (6.3) has a solution if and only if
has a solution. The above differential equation is equivalent to the following integral equation:
Let \({C}_{b}^{\mathrm{unif}}([0,T] \times X)\) be the space of bounded continuous functions \(\varphi \) on \([0,T] \times X\) with the additional property that given \(\varepsilon > 0\) there exists \(\delta > 0\) such that
Suppose \({\varphi }_{n} \in {C}_{b}^{\mathrm{unif}}([0,T] \times X)\) and \({\varphi }_{n} \rightarrow \varphi \) uniformly. Then
Given \(\epsilon > 0\), there exists \({n}_{0}\) such that \({\sup }_{t,x}\vert {\varphi }_{{n}_{0}}(t,x) - \varphi (t,x)\vert < \frac{\varepsilon } {3}\), and for this \({n}_{0}\), there exists \(\delta > 0\) such that \(\sup\limits_{x}\vert {\varphi }_{{n}_{0}}(t + h,x) - {\varphi }_{{n}_{0}}(t,x)\vert < \frac{\varepsilon } {3}\) whenever \(\vert h\vert < \delta \). Putting \(n = {n}_{0}\), we get from the above inequality
Thus \({C}_{b}^{\mathrm{unif}}([0,T] \times X)\) is a closed subspace of \({C}_{b}([0,T] \times X)\), and hence it is a Banach space.
Now for \(\varphi \in {C}_{b}^{\mathrm{unif}}([0,T] \times X)\), it follows from the assumption on \(Q\) that \({\int \nolimits \nolimits }_{X}\varphi (t,z)Q(t,x,u,\mathrm{d}z)\) is continuous in \(t,x\) and \(u\). Define
For \({\psi }_{1},{\psi }_{2} \in {C}_{b}^{\mathrm{unif}}([0,T] \times X)\), we have
Thus if we choose \(\gamma = 2M + 1\), then \(\mathcal{T}\) is a contraction and hence has a fixed point. Let \(\varphi \) be the fixed point. Then \({\mathrm{e}}^{-(2M+1)t}\varphi \) is the unique solution of (6.3). \(\square \)
6.3 Zero-Sum Stochastic Game
In this section, we consider a zero-sum stochastic game. The control model we consider here is given by
where \(X\) is the state space as before; \(U\) and \(V\) are the action spaces for player I and player II, respectively; \(\lambda \) and \(Q\) denote the rate and transition kernel, respectively, which now depend on the additional parameter \(v\); and \(r\) is the reward rate. The dynamics of the game is similar to that of MDP with appropriate modifications. Here player I receives a payoff from player II. The aim of player I is to maximise his payoff, and player II seeks to minimise the payoff to player I.
Now we describe the strategies of the players. In order to solve the problem, we will need to consider Markov relaxed strategies. We denote the space of strategies of player I by \(\mathcal{U}\) and that of player \(II\) by \(\mathcal{V}\) where
Now corresponding to \(\lambda \), \(Q\) and \(r\), define
where \(\mu \in \mathcal{P}(U)\) and \(\nu \in \mathcal{P}(V )\). As in the previous section, we make the following assumptions:
-
(A1′)
\(\lambda \) is continuous and there exists a finite constant \(M\) such that
$$\sup\limits_{t,x,u,v}\lambda (t,x,u,v) \leq M.$$ -
(A2′)
\(Q\) is weakly continuous, i.e. if \({x}_{n} \rightarrow x\), \({t}_{n} \rightarrow t\), \({u}_{n} \rightarrow u\) and \({v}_{n} \rightarrow v\), then for any \(f \in {C}_{b}(X)\)
$${\int \nolimits \nolimits }_{X}f(z)Q({t}_{n},{x}_{n},{u}_{n},{v}_{n},\mathrm{d}z) \rightarrow {\int \nolimits \nolimits }_{X}f(z)Q(t,x,u,v,\mathrm{d}z).$$ -
(A3′)
\(r\) is continuous and there exists a finite constant \(\tilde{C}\) such that
$$\sup\limits_{t,x,u,v}r(t,x,u,v) \leq \tilde{ C}.$$If the players use strategies \((\bf{u},\mathbf{v}) \in \mathcal{U}\times \mathcal{V}\), then the expected payoff to player I is given by
$${\mathbb{E}}_{t,x}^{\bf{u},\mathbf{v}}\left [{\int \nolimits \nolimits }_{t}^{T}\tilde{r}(s,{X}_{ s},\bf{u}(s,{X}_{s}),\mathbf{v}(s,{X}_{s}))\mathrm{d}s + g({X}_{T})\right ]$$where \(g\) is the terminal reward function which is assumed to be bounded and continuous. Now we define the upper and lower values for our game. Define
$$\begin{array}{rcl} \overline{V }(t,x) =\inf\limits_{\mathbf{v}\in \mathcal{V}}\sup\limits_{\bf{u}\in \mathcal{U}}{\mathbb{E}}_{t,x}^{\bf{u},\mathbf{v}}\left [{\int \nolimits \nolimits }_{t}^{T}\tilde{r}(s,{X}_{ s},\bf{u}(s,{X}_{s}),\mathbf{v}(s,{X}_{s}))\mathrm{d}s + g({X}_{T})\right ]\,.& & \\ \end{array}$$Also define
$$\begin{array}{rcl} \underline{V}(t,x) =\sup\limits_{\bf{u}\in \mathcal{U}}\inf\limits_{\mathbf{v}\in \mathcal{V}}{\mathbb{E}}_{t,x}^{\bf{u},\mathbf{v}}\left [{\int \nolimits \nolimits }_{t}^{T}\tilde{r}(s,{X}_{ s},\bf{u}(s,{X}_{s}),\mathbf{v}(s,{X}_{s}))\mathrm{d}s + g({X}_{T})\right ].& & \\ \end{array}$$The function \(\overline{V }\) is called the upper value function of the game, and \(\underline{V}\) is called the lower value function of the game. In the game, player I is trying to maximise his payoff and player II is trying to minimise the payoff of player I. Thus \(\underline{V}\) is the minimum payoff that player I is guaranteed to receive and \(\overline{V }\) is the guaranteed greatest amount that player II can lose to player I. In general \(\underline{V} \leq \overline{V }\). If \(\overline{V }(t,x) = \underline{V}(t,x)\), then the game is said to have a value. A strategy \({\bf{u}}^{{_\ast}}\) is said to be an optimal strategy for player I if
$$\begin{array}{rcl}{ \mathbb{E}}_{t,x}^{{\bf{u}}^{{_\ast}},\mathbf{v}}\left [{\int \nolimits \nolimits }_{t}^{T}\tilde{r}(s,{X}_{ s},{\bf{u}}^{{_\ast}}(s,{X}_{ s}),\mathbf{v}(s,{X}_{s}))\mathrm{d}s + g({X}_{T})\right ] \geq \overline{V }(t,x)& & \\ \end{array}$$for any \(t,x,\mathbf{v}\).
Similarly, \({\mathbf{v}}^{{_\ast}}\) is called an optimal policy for player II if
for any \(t,x,\bf{u}\). Such a pair \(({\bf{u}}^{{_\ast}},{\mathbf{v}}^{{_\ast}})\), if it exists, is called a saddle point equilibrium. Our aim is to find the value of the game and to find optimal strategies for both the players. To this end, consider the following pair of Isaacs equations:
By Fan’s minimax theorem [4], we have that if \(\varphi \in {C}_{b}^{1,0}([0,T] \times X)\) is a solution of (6.5), then it is also a solution of (6.6) and vice versa. The importance of Isaacs equations is illustrated by the following theorem.
Theorem 6.3.1
Let \({\varphi }^{{_\ast}}\in {C}_{b}^{1,0}([0,T] \times X)\) be a solution of (6.5) and (6.6). Then
-
(i)
\({\varphi }^{{_\ast}}\) is the value of the game.
-
(ii)
Let \(({\bf{u}}^{{_\ast}},{\mathbf{v}}^{{_\ast}}) \in \mathcal{U}\times \mathcal{V}\) be such that
$$ \begin{array}{rcl} & \inf\limits_{\nu \in \mathcal{P}(V )}\left[\tilde{r}(t,x,{\bf{u}}^{{_\ast}}(t,x),\nu ) -\tilde{ \lambda }(t,x,{\bf{u}}^{{_\ast}}(t,x),\nu ){\varphi }^{{_\ast}}(t,x) +\tilde{ \lambda }(t,x,{\bf{u}}^{{_\ast}}(t,x),\nu )\right. \\ &\left. \qquad \qquad {\int \nolimits \nolimits }_{X}{\varphi }^{{_\ast}}(t,z)\tilde{Q}(t,x,{\bf{u}}^{{_\ast}}(t,x),\nu,\mathrm{d}z)\right] \\ & =\sup\limits_{\mu \in \mathcal{P}(U)}\inf\limits_{\nu \in \mathcal{P}(V )}\left[\tilde{r}(t,x,\mu,\nu ) -\tilde{ \lambda }(t,x,\mu,\nu )\psi (t,x) +\tilde{ \lambda }(t,x,\mu,\nu ) \right. \\ &\left. \qquad \qquad \qquad \qquad {\int \nolimits \nolimits }_{X}\psi (t,z)\tilde{Q}(t,x,\mu,\nu,\mathrm{d}z)\right] & \end{array}$$(6.7)and
$$ \begin{array}{rcl} & \sup\limits_{\mu \in \mathcal{P}(U)}\left[\tilde{r}(t,x,\mu,{\bf{v}}^{{_\ast}}(t,x)) -\tilde{ \lambda }(t,x,\mu,{\bf{v}}^{{_\ast}}(t,x)){\varphi }^{{_\ast}}(t,x) +\tilde{ \lambda }(t,x,\mu,{\bf{v}}^{{_\ast}}(t,x))\right. \\ & \left.\qquad \qquad {\int \nolimits \nolimits }_{X}{\varphi }^{{_\ast}}(t,z)\tilde{Q}(t,x,\mu,{\bf{v}}^{{_\ast}}(t,x),\mathrm{d}z)\right] \\ & =\inf\limits_{\nu \in \mathcal{P}(V )}\sup\limits_{\mu \in \mathcal{P}(U)}\left[\tilde{r}(t,x,\mu,\nu ) -\tilde{ \lambda }(t,x,\mu,\nu )\varphi (t,x) +\tilde{ \lambda }(t,x,\mu,\nu ) \right. \\ & \left.\qquad \qquad \qquad \qquad {\int \nolimits \nolimits }_{X}\varphi (t,z)\tilde{Q}(t,x,\mu,\nu,\mathrm{d}z)\right]. \end{array}$$(6.8)Then \({\bf{u}}^{{_\ast}}\) is an optimal policy for player I and \({\mathbf{v}}^{{_\ast}}\) is an optimal policy for player II.
Proof.
Let \({\bf{u}}^{{_\ast}}\) be as in (6.7) and \(\mathbf{v}\) be any arbitrary strategy of player II. Then by Ito-Dynkin formula applied to the solution \(\varphi \), we obtain
Now let \({\mathbf{v}}^{{_\ast}}\) be as in (6.8) and let \(\bf{u}\) be any arbitrary strategy of player I. Then again by Ito’s formula we obtain
From the above two inequalities, it follows that
Hence \({\varphi }^{{_\ast}}\) is the value of the game. Moreover it follows that \(({\bf{u}}^{{_\ast}},{\mathbf{v}}^{{_\ast}})\) is a saddle point equilibrium. \(\square \)
Now our aim is to find a solution of (6.5) (and hence of (6.6)) in \({C}_{b}^{1,0}([0,T] \times X)\). Our next theorem asserts the existence of such a solution.
Theorem 6.3.2
Under \((A1^{\prime})\) –(A3′), equation (6.5) has a unique solution in \({C}_{b}^{1,0}([0,T] \times X).\)
Proof.
Let \(\varphi (t,x) ={ \mathrm{e}}^{-\gamma t}\psi (t,x)\) for some \(\gamma < \infty \). Substituting in (6.5), we get
Thus (6.5) has a solution if and only if
has a solution. The above differential equation is equivalent to the following integral equation:
Let \({C}_{b}^{\mathrm{unif}}([0,T] \times X)\) be the same space as defined in the previous section. Define
For \({\psi }_{1},{\psi }_{2} \in {C}_{b}^{\mathrm{unif}}([0,T] \times X)\), we have
Thus if we choose \(\gamma = 2M + 1\), then \(\mathcal{T}\) is a contraction and hence has a fixed point. Let \(\varphi \) be the fixed point. Then \({\mathrm{e}}^{-(2M+1)t}\varphi \) is the unique solution of (6.5). \(\square \)
6.4 Conclusion
In this chapter we have established smooth solutions of dynamic programming equations for continuous-time controlled Markov chains on the finite horizon. This has led to the existence of an optimal Markov strategy for continuous-time MDP and saddle point equilibrium in Markov strategies for zero-sum games. We have used the boundedness condition on the cost function \(c\) for simplicity. For continuous-time MDP, if \(c\) is unbounded above, then we can show that \(V (t,x)\) is the minimal non-negative solution of (6.3) by approximating the cost function \(c\) by \(c \wedge n\) for a positive integer \(n\) and then letting \(n \rightarrow \infty \). If \(c\) is unbounded on both sides and it satisfies a suitable growth condition, then again we can prove the existence of unique solutions of dynamic programming equations in \({C}^{1,0}([0,T] \times X)\) with appropriate weighted norm; see [5] and [6] for analogous results.
References
A. Arapostathis, V. S. Borkar and M. K. Ghosh, Ergodic Control of Diffusion Processes, Cambridge University Press, 2011.
V. E. Benes, Existence of optimal strategies based on specified information for a class of stochastic decision problems, SIAM J. Control 8 (1970), 179–188.
M. H. A. Davis, Markov Models and Optimization, Chapman and Hall, 1993.
K. Fan, Fixed-point and minimax theorems in locally convex topological linear spaces, Proc. of the Natl. Academy of Sciences of the United States of America 38 (1952), 121–126.
X. Guo and O. Hernández-Lerma, Continuous-Time Markov Decision Processes. Theory and Applications, Springer-Verlag, 2009.
X. Guo and O. Hernández-Lerma, Zero-sum games for continuous-time jump Markov processes in Polish spaces: discounted payoffs, Adv. in Appl. Probab. 39 (2007), 645–668.
S. R. Pliska, Controlled jump processes, Stochastic Processes Appl. 3 (1975), 259–282.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Ghosh, M.K., Saha, S. (2012). Continuous-Time Controlled Jump Markov Processes on the Finite Horizon. In: Hernández-Hernández, D., Minjárez-Sosa, J. (eds) Optimization, Control, and Applications of Stochastic Systems. Systems & Control: Foundations & Applications. Birkhäuser, Boston. https://doi.org/10.1007/978-0-8176-8337-5_6
Download citation
DOI: https://doi.org/10.1007/978-0-8176-8337-5_6
Published:
Publisher Name: Birkhäuser, Boston
Print ISBN: 978-0-8176-8336-8
Online ISBN: 978-0-8176-8337-5
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)