Abstract
We consider a piecewise deterministic Markov decision process, where the expected exponential utility of total (nonnegative) cost is to be minimized. The cost rate, transition rate and post-jump distributions are under control. The state space is Borel, and the transition and cost rates are locally integrable along the drift. Under natural conditions, we establish the optimality equation, justify the value iteration algorithm, and show the existence of a deterministic stationary optimal policy. Applied to special cases, the obtained results already significantly improve some existing results in the literature on finite horizon and infinite horizon discounted risk-sensitive continuous-time Markov decision processes.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Since the pioneering work [16], risk-sensitive discrete-time Markov decision processes (DTMDPs) have been studied intensively. Having restricted our attention to total undiscounted or discounted problems, let us mention e.g., [4, 6, 7, 11, 12, 17, 18], most of which deal with the exponential utility, as well as in the present paper. As an application, an open problem in insurance was recently solved in [1] in the framework of risk-sensitive DTMDP. There are notable differences between risk-sensitive and risk-neutral DTMDPs. For instance, in a finite model, i.e., when the state and action spaces are both finite, there is always a deterministic stationary optimal policy in a discounted risk-neutral DTMDP, but not always in a discounted risk-sensitive DTMDP, see [17].
One of the first works on risk-sensitive continuous-time Markov decision processes (CTMDPs) is [21], where only verification theorems were presented. Recently, there have been reviving interests in this topic; see e.g., [8, 14, 20, 24, 25, 27]. A finite horizon total undiscounted risk-sensitive CTMDP was considered in [14, 21, 24], whose arguments were summarized as follows. Firstly, the optimality equation is shown to admit a solution out of a small enough class. Secondly, by using the Feynman–Kac formula, this solution is shown to be the value function, and any Markov policy providing the minimizer in the optimality equation is optimal. The proofs of [14, 24] reveal that the main technicalities lie in the first step, for which, the state space was assumed to be denumerable. This assumption is important for the diagonalization argument used in [24], which is an extension of [14] from bounded transition rate to possibly unbounded transition rate, whose growth is bounded by a Lyapunov function. The latter requirement and the boundedness of the cost rate then validate the Feynman–Kac formula applied in the second step. Wei [24] mentioned that it was unclear how to extend his argument to an unbounded cost rate, see Sect. 7 therein. Following a similar argument as described above, a discounted risk-sensitive CTMDP was also considered in [14], although now the first step becomes, to quote the authors’ words (see p. 658 therein), “surprisingly far more involved”, for which the state space was further assumed to be finite, see Remark 3.6 therein. It is a corollary of the present paper that we significantly weaken the restrictive conditions in [14, 24], see Sect. 3 below.
The present paper is concerned with a risk-sensitive piecewise deterministic Markov decision process (PDMDP), where the expected exponential utility of the total cost is to be minimized. The state space is a general Borel space, the transition and the nonnegative cost rates only need be locally integrable along the drift. A PDMDP is an extension of a CTMDP: now between two consecutive jumps, the process evolves according to deterministic Markov process. For simplicity and to keep the conditions as weak as possible, we do not consider the control on the drift. In spite that there has been a vast literature on PDMDPs; see the well known monographs [9, 10] and the references therein, to the best of our knowledge, risk-sensitive PDMDPs have not been systematically studied before.
Our main contributions are the following. We establish the optimality equation satisfied by the value function, justify the value iteration algorithm and show the existence of a deterministic stationary optimal policy. As an application and corollary, finite horizon and infinite horizon discounted risk-sensitive CTMDPs are reformulated as total undiscounted risk-sensitive PDMDPs, and are thus treated in a unified way and under much weaker conditions than in [14, 24]. This is possible because we follow a different argument. Namely, we directly show that the value function satisfies the optimality equation, by reducing the total undiscounted risk-sensitive PDMDP to a risk-sensitive DTMDP. This method, without referring to the Feynman–Kac formula, was originally developed by Yushkevich [26] for risk-neutral CTMDPs. Later, it was employed in [2, 3, 9, 10, 13, 23] for studies of risk-neutral PDMDPs, and in [27] for risk-sensitive CTMDPs. In [8], restricted to stationary policies, the discounted risk-sensitive CTMDP with bounded transition rates was reduced to a DTMDP problem, using the uniformization technique. The induced DTMDP is less standard (with a random cost), and was not further investigated there.
The rest of the paper is organized as follows. In Sect. 2 we describe the concerned optimal control problem. In Sect. 3 we present the main results, the proofs of which are postponed to Sect. 4. We finish the paper with a conclusion in Sect. 5. Some relevant facts were collected in the Appendix for ease of reference.
2 Model Description and Problem Statement
2.1 Notations and Conventions
In what follows, \(\mathcal{{B}}(X)\) is the Borel \(\sigma \)-algebra of the topological space X, I stands for the indicator function, and \(\delta _{\{x\}}(\cdot )\) is the Dirac measure concentrated on the singleton \(\{x\},\) assumed to be measurable. A measure is \(\sigma \)-additive and \([0,\infty ]\)-valued. Below, unless stated otherwise, the term of measurability is always understood in the Borel sense. Throughout this paper, we adopt the conventions of
If a mapping f defined on X, and \(\{X_i\}\) is a partition of X, then when f is piecewise defined as \(f(x)=g_i(x)\) for all \(x\in X_i\), the notation \(f(x)=\sum _i I\{x\in X_i\}g_i(x)\) is used, even if f is not real-valued.
Let S be a nonempty Borel state space, A be a nonempty Borel action space, and q stand for a signed kernel q(dy|x, a) on \(\mathcal{{B}}(S)\) given \((x,a)\in S\times A\) such that
for all \(\Gamma _S\in \mathcal{{B}}(S).\) Throughout this article we assume that \(q(\cdot |x,a)\) is conservative and stable, i.e.,
where \(q_x(a):=-q(\{x\}|x,a).\) The signed kernel q is often called the transition rate. Between two consecutive jumps, the state of the process evolves according to a measurable mapping \(\phi \) from \(S\times [0,\infty )\) to S, see (5) below. It is assumed that for each \(x\in S\)
and \(t\rightarrow \phi (x,t)\) is continuous.
Finally let the cost rate c be a \([0,\infty )\)-valued measurable function on \(S\times A\). For simplicity, we do not consider the case of different admissible action spaces at different states.
Condition 2.1
-
(a)
For each bounded measurable function f on S and each \(x\in S\), \(\int _S f(y)\tilde{q}(dy|x,a)\) is continuous in \(a\in A.\)
-
(b)
For each \(x\in S,\) the (nonnegative) function c(x, a) is lower semicontinuous in \(a\in A.\)
-
(c)
The action space A is a compact Borel space.
Condition 2.2
For each \(x\in S\), \(\int _{0}^t\overline{q}_{\phi (x,s)}ds<\infty \), and \(\int _{0}^t \sup _{a\in A} c(\phi (x,s),a)ds<\infty \), for each \(t\in [0,\infty ).\)
The integrals in the above condition are well defined: the integrands are universally measurable in \(s\in [0,\infty )\); see Chaper 7 of [5].
Let us take the sample space \(\Omega \) by adjoining to the countable product space \(S\times ((0,\infty )\times S)^\infty \) the sequences of the form \((x_0,\theta _1,\dots ,\theta _n,x_n,\infty ,x_\infty ,\infty ,x_\infty ,\dots ),\) where \(x_0,x_1,\dots ,x_n\) belong to S, \(\theta _1,\dots ,\theta _n\) belong to \((0,\infty ),\) and \(x_{\infty }\notin S\) is the isolated point. We equip \(\Omega \) with its Borel \(\sigma \)-algebra \(\mathcal F\).
Let \(t_0(\omega ):=0=:\theta _0,\) and for each \(n\ge 0\), and each element \(\omega :=(x_0,\theta _1,x_1,\theta _2,\dots )\in \Omega \), let
and
Obviously, \((t_n(\omega ))\) are measurable mappings on \((\Omega ,\mathcal{F})\). In what follows, we often omit the argument \(\omega \in \Omega \) from the presentation for simplicity. Also, we regard \(x_n\) and \(\theta _{n+1}\) as the coordinate variables, and note that the pairs \(\{t_n,x_n\}\) form a marked point process with the internal history \(\{\mathcal{F}_t\}_{t\ge 0},\) i.e., the filtration generated by \(\{t_n,x_n\}\); see Chapter 4 of [19] for greater details. The marked point process \(\{t_n,x_n\}\) defines the stochastic process \(\{\xi _t,t\ge 0\}\) on \((\Omega ,\mathcal{F})\) of interest by
where we accept \(0\cdot x:=0\) and \(1\cdot x:=x\) for each \(x\in S_\infty ,\) and below we denote \(S_{\infty }:=S\bigcup \{x_\infty \}\).
A (history-dependent) policy \(\pi \) is given by a sequence \((\pi _n)\) such that, for each \(n=0,1,2,\dots ,\)\(\pi _n(da|x_0,\theta _1,\dots ,x_{n},s)\) is a stochastic kernel on A, and for each \(\omega =(x_0,\theta _1,x_1,\theta _2,\dots )\in \Omega \), \(t> 0,\)
where \(a_\infty \notin A\) is some isolated point. A policy \(\pi \) is called Markov if, with slight abuse of notations, \( \pi (da|\omega ,s)=\pi ^M(da|\xi _{s-},s)\) for some stochastic kernel \(\pi ^M\). A Markov policy is further called deterministic if the stochastic kernels \(\pi ^M(da|x,s)=\delta _{\{f^M(x,s)\}}(da)\) for some measurable mapping \(f^M\) from \(S\times (0,\infty )\) to A. A policy is called deterministic stationary if for each \(n=0,1,\dots ,\)\(\pi _{n}(da|x_0,\theta _1,\dots ,\theta _n,x_n, t-t_n)=\delta _{\{f(\phi (x_n,t-t_n))\}}(da)\) for some measurable mapping f from S to A. We shall identify such a deterministic stationary policy by the underlying measurable mapping f.
The class of all policies is denoted by \(\Pi .\) Under a fixed policy \(\pi =(\pi _n)\), for each initial distribution \(\gamma \) on \((S,\mathcal{B}(S)),\) by using the Ionescu-Tulcea theorem, one can build a probability measure \(P_\gamma ^\pi \) on \((\Omega ,\mathcal{F})\) such that \(P_\gamma ^\pi (x_0\in \Gamma )=\gamma (\Gamma )\) for each \(\Gamma \in \mathcal{B}(S)\), and the conditional distribution of \((\theta _{n+1},x_{n+1})\) with the condition on \(x_0,\theta _1,x_1,\dots ,\theta _{n},x_n\) is given on \(\{\omega :x_n(\omega )\in S\}\) by
and given on \(\{\omega :x_n(\omega )=x_\infty \}\) by
Below, when \(\gamma \) is a Dirac measure concentrated at \(x\in S,\) we use the denotation \({}{P}_x^\pi .\) Expectations with respect to \({}{P}_\gamma ^\pi \) and \({}{P}_x^\pi \) are denoted as \({}{E}_{\gamma }^\pi \) and \({}{E}_{x}^\pi ,\) respectively. Roughly speaking, the uncontrolled version of the process evolves as follows: given the current state, the process evolves deterministically according to the mapping \(\phi \), up to the next jump, taking place after a random time whose distribution is (nonstationary) exponential, and the dynamics continue in the similar manner. A detailed book treatment with many examples of this and more general type of processes, allowing deterministic jumps, can be found in [10].
For each \(x\in S\), and policy \(\pi =(\pi _n)\),
defines the concerned performance measure of the policy \(\pi \in \Pi \) given the initial state \(x\in S.\) Here and below, we put \(c(x_\infty ,a):=0\) for each \(a\in A,\) and \(\phi (x_\infty ,t)=x_\infty \) for each \(t\in [0,\infty ).\) We are interested in the following optimal control problem for each \(x\in S:\)
A policy \(\pi ^*\) is called optimal if \( V(x,\pi ^*)=\inf _{\pi \in \Pi }V(x,\pi )=:V^*(x)\) for each \(x\in S\).
The objective of this paper is to show, under the imposed conditions, the existence of a deterministic stationary optimal policy, and to establish the corresponding optimality equation satisfied by the value function \(V^*\), together with its value iteration. Evidently, \(V^*(x)\ge 1\) for each \(x\in S.\) Under the next condition, it will be seen that for each \(x\in S,\)\(V^*(\phi (x,s))\) is absolutely continuous in s.
Condition 2.3
For each \(x\in S,\)\(V^*(x)<\infty \).
The above condition is mainly assumed for notational convenience. In fact, the main optimality results (such as the existence of a deterministic stationary optimal policy) obtained in this paper can be established without assuming Condition 2.3, at the cost of some additional notations. In a nutshell, one has to consider the sets \( \hat{S}:=\{x\in S:~V^*(x)<\infty \} \) and \(S{\setminus }\hat{S}\) separately, and note that if \(x\in \hat{S}\), then \(\phi (x,t)\in \hat{S}\) for each \(t\in [0,\infty ).\) The reasoning presented under Condition 2.3 can be followed in an obvious manner. We formulate the corresponding optimality results in Remarks 3.1 and 3.2 below.
3 Main Statements
We first present the main optimality results concerning problem (8) for the PDMDP model. Their proofs are postponed to the next section.
Theorem 3.1
Suppose Conditions 2.1, 2.2 and 2.3 are satisfied. Then the following assertions hold.
- (a)
The value function \(V^*\) for problem (8) is the minimal \([1,\infty )\)-valued solution to the following optimality equation:
$$\begin{aligned}&-\,(V(\phi (x,t))-V(x))\\&\quad =\int _0^t \inf _{a\in A}\left\{ \int _S V(y)\tilde{q}(dy|\phi (x,\tau ),a)- (q_{\phi (x,\tau )}(a)\right. \\&\qquad \left. -\,c(\phi (x,\tau ),a) )V(\phi (x,\tau ))\right\} d\tau ,t\in [0,\infty ),x\in S. \end{aligned}$$In particular, \(V^*(\phi (x,t))\) is absolutely continuous in t for each \(x\in S.\)
- (b)
There exists a deterministic stationary optimal policy f, which can be taken as any measurable mapping from S to A such that
$$\begin{aligned}&\inf _{a\in A}\left\{ \int _S V^*(y)\tilde{q}(dy|x,a)- (q_{x}(a)-c(x,a))V^*(x))\right\} \\&\quad =\int _S V^*(y)\tilde{q}(dy|x,f(x))- (q_{x}(f(x))-c(x,f(x)))V^*(x)),~\forall ~x\in S. \end{aligned}$$
Remark 3.1
By inspecting its proof, one can see the following version of Theorem 3.1 holds without assuming Condition 2.3. Suppose Conditions 2.1 and 2.2 are satisfied. Then the following assertions hold.
- (a)
The value function \(V^*\) for problem (8) is the minimal \([1,\infty ]\)-valued solution to the following optimality equation:
$$\begin{aligned}&-\,(V(\phi (x,t))-V(x))\\&\quad =\int _0^t \inf _{a\in A}\Bigg \{ \int _S V(y)\tilde{q}(dy|\phi (x,\tau ),a)- (q_{\phi (x,\tau )}(a)\\&\qquad -\,c(\phi (x,\tau ),a) )V(\phi (x,\tau ))\Bigg \}d\tau ,\quad t\in [0,\infty ),\quad x\in \hat{S};\\&\quad \qquad V(x)<\infty ,\quad x\in \hat{S};\quad V(x)=\infty ,\quad x\in S{\setminus }\hat{S}. \end{aligned}$$In particular, \(V^*(\phi (x,t))\) is absolutely continuous in t for each \(x\in \hat{S}.\)
- (b)
There exists a deterministic stationary optimal policy f, which can be taken as any measurable mapping from S to A such that
$$\begin{aligned}&\inf _{a\in A}\left\{ \int _S V^*(y)\tilde{q}(dy|x,a)- (q_{x}(a)-c(x,a))V^*(x))\right\} \\&\quad =\int _S V^*(y)\tilde{q}(dy|x,f(x))- (q_{x}(f(x))-c(x,f(x)))V^*(x)),~\forall ~x\in \hat{S}. \end{aligned}$$
Next, we present the value iteration algorithm for the value function \(V^*\).
Theorem 3.2
Suppose Conditions 2.1, 2.2 and 2.3 are satisfied. Let \(V^{(0)}(x):=1\) for each \(x\in S\). For each \(n\ge 0,\) let \(V^{(n+1)}\) be the minimal \([1,\infty )\)-valued measurable solution to
such that \(V^{(n+1)}(\phi (x,t))\) is absolutely continuous in t for each \(x\in S.\) (For each \(n\ge 0,\) such a solution always exists.) Furthermore, \(\{V^{(n)}\}\) is a monontone nondecreasing sequence of measurable functions on S such that for each \(x\in S,\)\(V^{(n)}(x)\uparrow V^*(x)\) as \(n\uparrow \infty .\)
Remark 3.2
Similar to Remark 3.1, we have the following version of Theorem 3.2 without assuming Condition 2.3. Suppose Conditions 2.1, 2.2 are satisfied. Let \(V^{(0)}(x):=1\) for each \(x\in \hat{S}\) and \(V^{(0)}(x)=\infty \) if \(x\in S{\setminus }\hat{S}\). For each \(n\ge 0,\) let \(V^{(n+1)}\) be the minimal \([1,\infty ]\)-valued measurable solution to
Here \(V^{(n+1)}(\phi (x,t))\) is absolutely continuous in t for each \(x\in \hat{S}.\) (For each \(n\ge 0,\) such a solution always exists.) Furthermore, \(\{V^{(n)}\}\) is a monontone nondecreasing sequence of measurable functions on S such that for each \(x\in S,\)\(V^{(n)}(x)\uparrow V^*(x)\) as \(n\uparrow \infty .\)
We can apply our theorems to a special case of a CTMDP. That is, \(\phi (x,t)\equiv x\) for each \(x\in S.\) The following \(\alpha \)-discounted risk-sensitive CTMDP problem was considered in [14]:
Here \(\alpha >0\) is a fixed constant. In fact, Ghosh and Saha [14] were restricted to Markov policies, bounded transition and cost rates, i.e., \(\sup _{x\in S}\overline{q}_x<\infty \), and \(\sup _{x\in S,a\in A}c(x,a)<\infty \), and a finite state space S. These restrictions, e.g., the finiteness of S, were needed for their investigations, see e.g., [14, Remark 3.6]. Under the compactness-continuity condition (Condition 2.1), it was shown in [14] that there exists an optimal Markov policy for the discounted risk-sensitive CTMDP, and established the optimality equation. By using the theorems presented earlier in this section, we can obtain these optimality results for problem (10) in a much more general setup: the state space S is Borel, there is no boundedness requirement on the transition rate with respect to the state \(x\in S\), and the optimality is over the class of history-dependent policies. Furthermore, we let the CTMDP model be nonhomogeneous, i.e., the transition rate q(dy|t, x, a) now is a signed kernel on \(\mathcal{B}(S)\) from \((t,x,a)\in [0,\infty )\times S\times A\), satisfying the corresponding version of (3); the notations \(\tilde{q}\) is kept as before, see (2), with the extra argument t in addition to x. Similarly, the nonnegative cost rate c is allowed to be a measurable function on \([0,\infty )\times S\times A\).
Corollary 1
Consider the \(\alpha \)-discounted risk-sensitive (nonhomogeneous) CTMDP problem (10) with \(c(\xi _t,a)\) being replaced by \(c(t,\xi _t,a)\). Suppose
and the corresponding version of Condition 2.1, where x is replaced by (t, x), is satisfied by the nonhomogeneous CTMDP model. Then the following assertions hold.
- (a)
There exists some \([1,\infty )\)-valued measurable solution on \([0,\infty )\times S\) to
$$\begin{aligned}&-\,(V(t,x)-V(0,x))\\&\quad =\int _0^t \inf _{a\in A}\left\{ \int _S V(u,y)\tilde{q}(dy|u,x,a)+(e^{-\alpha u} c(u,x,a)\right. \\&\qquad \left. -\,q_{(u,x)}(a))V(u,x) \right\} du,~x\in S,~t\in [0,\infty ), \end{aligned}$$so that V(t, x) is absolutely continuous in t for each \(x\in S.\)
- (b)
Let L be the minimal \([1,\infty )\)-valued measurable solution on \([0,\infty )\times S\) to the above equation. Then the value function say \(L^*\) to the \(\alpha \)-discounted risk-sensitive CTMDP problem (10) (with \(c(\xi _t,a)\) being replaced by \(c(t,\xi _t,a)\)) is given by \(L^*(x)=L(0,x)\) for each \(x\in S.\)
- (c)
There exists an optimal deterministic Markov policy f for the \(\alpha \)-discounted risk-sensitive CTMDP problem (10) (with \(c(\xi _t,a)\) being replaced by \(c(t,\xi _t,a)\)). One can take f as any measurable mapping from \([0,\infty )\times S\) to A such that
$$\begin{aligned}&\inf _{a\in A}\left\{ \int _S L(u,y)\tilde{q}(dy|u,x,a)+(e^{-\alpha u} c(u,x,a)-q_{(u,x)}(a))L(u,x) \right\} \\&\quad =\int _S L(u,y)\tilde{q}(dy|u,x,f(u,x))+(e^{-\alpha u} c(u,x,f(u,x))\\&\qquad -\,q_{(u,x)}(f(u,x)))L(u,x) \end{aligned}$$for each \(u\in [0,\infty )\) and \(x\in S.\)
Proof
We prove this by reformulating the nonhomogeneous version of the \(\alpha \)-discounted risk-sensitive (nonhomogeneous) CTMDP problem (10) in the form of problem (8) for a PDMDP, which we introduce as follows. We use the notation “hat” to distinguish this model from the original (nonhomogeneous) CTMDP model.
The state space is \(\hat{S}=[0,\infty )\times S.\)
The action space is the same as in the CTMDP: \(\hat{A}=A.\)
the transition rate \(\hat{q}(ds\times dy|(t,x),a)\) is defined by
$$\begin{aligned} \hat{q}(ds\times dy|(t,x),a):=\tilde{\hat{q}}(ds\times dy|(t,x),a)-I\{(t,x)\in ds\times dy\}q_{(t,x)}(a), \end{aligned}$$where
$$\begin{aligned} \tilde{\hat{q}}(ds\times dy|(t,x),a):=I\{t\in ds\}\tilde{q}(dy|t,x,a), \end{aligned}$$for each \((t,x)\in \hat{S}\) and \(a\in \hat{A}.\)
The drift is given by \(\hat{\phi }((t,x),s):=(t+s,x)\) for each \(x\in S\) and \(t,s\ge 0.\) Clearly it satisfies the corresponding version of (4).
The cost rate is given by
$$\begin{aligned} \hat{c}((t,x),a):=e^{-\alpha t} c(t,x,a),\quad ~\forall ~t\in [0,\infty ),\quad ~x\in S,\quad ~a\in A. \end{aligned}$$
Now the marked point process \(\{\hat{t}_n,\hat{x}_n\}\) and controlled process \(\hat{\xi }_t\) in this PDMDP model is connected to those in the original (nonhomogeneous) CTMDP model, namely \((t_n,x_n)\) and \(\xi _t\), via \(\hat{t}_n=t_n\) and \(\hat{x}_n=(t_n,x_n),\) and \(\hat{\xi }_t=(t,\xi _t).\) For example, under a fixed strategy \(\hat{\pi }\) and initial distribution \(\hat{\gamma }\) in this PDMDP model, the version of the first equation in (7) now reads on \(\{\omega :x_n(\omega )\in S\}\)
Clearly, Conditions 2.1, 2.2 and 2.3 are satisfied by this PDMDP model. It remains to apply Theorem 3.1. \(\square \)
The condition in the previous corollary is much weaker than in [14], and can be further weakened; one only needs the reformulated PDMDP to satisfy Conditions 2.1, 2.2 and 2.3. Moreover, the boundedness of the cost rate c was assumed in the previous corollary only to ensure Condition 2.3 to be satisfied. It can be relaxed if one formulates the previous corollary using the statements in Remarks 3.1 and 3.2.
One can also consider the risk-sensitive nonhomogeneous CTMDP problem on the finite horizon [0, T] with \(T>0\) being a fixed constant:
where g is a \([0,\infty )\)-valued measurable function; g(x) represents the terminal cost incurred when \(\xi _T=x\in S\). Let us put \(g(x_\infty ):=0.\) Here \(\alpha \) is a fixed nonnegative finite constant. A simpler version of this problem was considered in [24] with \(\alpha =0\) and a bounded cost rate, where additional restrictions were put on the growth of the transition rate. We can reformulate this problem into the PDMDP problem (8) just as in the above. The only difference is that now we put \(q_{(t,x)}(a)\equiv 0\) for each \(x\in S\) and \(t\ge T,\) and introduce the following cost rate for each \(x\in S\), \(t\ge 0\) and \(a\in A:\)
4 Proof of the Main Statements
For the rest of this paper, it is convenient to introduce the following notations. Let \(\mathbb {P}(A)\) be the space of probability measures on \(\mathcal{B}(A)\), endowed with the standard weak topology. For each \(\mu \in \mathbb {P}(A)\),
Let \(\mathcal{R}\) denote the set of (Borel) measurable mappings \(\rho _t(da)\) from \(t\in (0,\infty )\rightarrow \mathbb {P}(A).\) Here, we do not distinguish two measurable mappings in \(t\in (0,\infty ),\) which coincide almost everywhere with respect to the Lebesgue measure. Let us equip \(\mathcal{R}\) with the Young topology, which is the weakest topology with respect to which the function \( \rho \in \mathcal{{R}}\rightarrow \int _0^\infty \int _A f(t,a)\rho _t(da)dt \) is continuous for each strongly integrable Carathéodory function f on \((0,\infty )\times A\) . Here a real-valued measurable function f on \((0,\infty )\times A\) is called a strongly integrable Carathéodory function if for each fixed \(t\in (0,\infty )\), f(t, a) is continuous in \(a\in A,\) and for each fixed \(a\in A,\)\(\sup _{a\in A}|f(t,a)|\) is integrable in t, i.e., \(\int _0^\infty \sup _{a\in A}|f(t,a)|dt<\infty .\) It is known that if A is a compact Borel space, then so is \(\mathcal{R}\); see Chapter 4 of [10].
Lemma 4.1
Suppose Conditions 2.1 and 2.2 are satisfied. Then the following assertions hold.
- (a)
The value function \(V^*\) is the minimal \([1,\infty ]\)-valued measurable solution to
$$\begin{aligned}&V^*(x)\\&\quad = \inf _{\rho \in \mathcal{R}}\left\{ \int _0^\infty e^{-\int _0^\tau (q_{\phi (x,s)}(\rho _s)-c(\phi (x,s),\rho _s))ds} \left( \int _S V^*(y)\tilde{q}(dy|\phi (x,\tau ),\rho _\tau )\right) d\tau \right. \\&\qquad \left. +\,e^{-\int _0^\infty q_{\phi (x,s)}(\rho _s)ds}e^{\int _0^\infty c(\phi (x,s),\rho _s)ds} \right\} ,\quad ~\forall ~x\in S. \end{aligned}$$ - (b)
The mapping
$$\begin{aligned}&\rho \in \mathcal{R}\rightarrow W(x,\rho )\\&\quad :=\int _0^\infty e^{-\int _0^\tau (q_{\phi (x,s)}(\rho _s)-c(\phi (x,s),\rho _s))ds} \left( \int _S V^*(y)\tilde{q}(dy|\phi (x,\tau ),\rho _\tau )\right) d\tau \\&\qquad +\,e^{-\int _0^\infty q_{\phi (x,s)}(\rho _s)ds}e^{\int _0^\infty c(\phi (x,s),\rho _s)ds} \end{aligned}$$is lower semicontinuous for each \(x\in S.\)
Proof
One can legitimately consider the following DTMDP (discrete-time Markov decision process): according to [9, Lemma 2.29], all the involved mappings are measurable.
The state space is \(\mathbf X :=((0,\infty )\times S)\bigcup \{(\infty ,x_\infty )\}\). Whenever the topology is concerned, \((\infty ,x_\infty )\) is regarded as an isolated point in \(\mathbf X .\)
The action space is \(\mathbf A :=\mathcal{R}\).
The transition kernel p on \(\mathcal{B}(\mathbf X )\) from \(\mathbf X \times \mathbf A \), c.f. (7), is given for each \(\rho \in \mathbf A \) by
$$\begin{aligned}&p(\Gamma _1\times \Gamma _2|(\theta ,x),\rho ):=\int _{\Gamma _2} e^{-\int _0^t q_{\phi (x,s)}(\rho _s)ds}\tilde{q}(\Gamma _1|\phi (x,t),\rho _t)dt,\\&\quad \forall ~\Gamma _1\in \mathcal{B}(S),~\Gamma _2 \in \mathcal{B}((0,\infty )),~x\in S,~\theta \in (0,\infty ),\nonumber \\&p(\{(\infty ,x_\infty )\}|(\theta ,x),\rho ):=e^{-\int _0^\infty q_{\phi (x,s)}(\rho _s)ds},\quad ~\forall ~x\in S,\quad ~\theta \in (0,\infty );\nonumber \\&p(\{(\infty ,x_\infty )\}|(\infty ,x_\infty ),\rho ):=1. \end{aligned}$$The cost function l is a \([0,\infty ]\)-valued measurable function on \(\mathbf X \times \mathbf A \times \mathbf X \) given by
$$\begin{aligned}&l((\theta ,x),\rho ,(\tau ,y))\\&\quad :=\int _0^\infty I\{s<\tau \} c(\phi (x,s),\rho _s)ds,~\forall ~((\theta ,x),\rho ,(\tau ,y))\in \mathbf X \times \mathbf A \times \mathbf X . \end{aligned}$$
The relevant facts and statements for the DTMDP are included in the Appendix.
One can show that under Conditions 2.1 and 2.2, for each \((\theta ,x)\in \mathbf X \), \(a\in \mathbf A \rightarrow \int _\mathbf{X }f(z)p(dz|(\theta ,x),a)\) is continuous for each bounded measurable function f on \(\mathbf X \); for each \((\theta ,x)\in \mathbf X \) and \((\tau ,y)\in \mathbf X \), \(a\in \mathbf A \rightarrow l((\theta ,x),\rho ,(\tau ,y))\) is lower semicontinuous, and \(\mathbf A \) is a compact Borel space. Hence, Condition A.1 for the DTMDP model \(\{\mathbf{X },\mathbf{A },p,l\}\) is satisfied.
The controlled process in the above DTMDP model \(\{\mathbf{X },\mathbf{A },p,l\}\) is denoted by \(\{Y_n,n=0,1,\dots \}\), where \(Y_n=(\Theta _n,X_n)\), and the controlling process is denoted by \(\{A_n,n=0,1,\dots \}.\) For \(n\ge 1,\)\(\Theta _n\) and \(X_n\) correspond to the nth sojourn time and the post-jump state in the PDMDP, \(\Theta _0\) is fictitious, and \(X_0\) is the initial state in the PDMDP. Let \(\Sigma \) be the class of all strategies for the DTMDP model \(\{\mathbf{X },\mathbf{A },p,l\}\), and \(\Sigma _{DM}^0\) be the class of deterministic Markov strategies in the form \(\sigma =(\varphi _n)\) where \(\varphi _0((\theta ,x))\) does not depend on \(\theta \in (0,\infty )\) for each \(x\in S.\) We preserve the term of policy for the PDMDP and the term of strategy for the DTMDP.
According to Proposition A.1, the function
is the minimal \([1,\infty ]\)-valued measurable solution to the optimality equation
for each \(x\in S\) and \(\theta \in (0,\infty );\) this is just (20). Furthermore, by Proposition A.1, there exists a deterministic stationary strategy \(\sigma ^*\) for the DTMDP such that \(\sigma ^*((\theta ,x))\) attains the above infimum for each \(x\in S\) and \(\theta \in (0,\infty ),\) and any such strategy \(\sigma ^*\) verifies
Let \(\hat{\theta }\in ~(0,\infty )\) be arbitrarily fixed. The function \(\mathbf V ^*((\theta ,x))\) being measurable in \((\theta ,x)\in \mathbf X \), it follows that \(x\in S\rightarrow \mathbf V ^*((\hat{\theta },x))\) is measurable. The strategy \(\sigma ^*\) and the constant \(\hat{\theta }\) induce a deterministic Markov strategy \(\sigma ^{**}=(\varphi _n)\in \Sigma ^0_{DM}\), where \(\varphi _0((\theta ,x))=:\sigma ^*((\hat{\theta },x))\) for each \(\theta \in (0,\infty ),~x\in S\), and \(\varphi _n((\theta ,x)):=\sigma ((\theta ,x))\) for each \(n\ge 1\), \(\theta \in (0,\infty ),~x\in S.\) (The control on the isolated point \((0,x_\infty )\) is irrelevant and we do not specify the definition of the strategy on that point.) This strategy can be identified with a policy \(\pi ^*\) in the PDMDP, c.f. (6). On the other hand, each policy \(\pi =(\pi _n)\) can be identified with a deterministic strategy in this DTMDP. Thus,
for each \(x\in S.\) Consequently, the policy \(\pi ^*\) is optimal, \(V^*(x)=\mathbf V ^*((\theta ,x))\) for each \(x\in S\) and \(\theta \in (0,\infty );\) recall that \(\hat{\theta }\) was arbitrarily fixed. The statement of this lemma now follows. \(\square \)
The policy \(\pi ^*\) in the proof of the previous lemma is actually optimal for problem (8). However, it is not necessarily a deterministic nor stationary policy. Also the reduction of the risk-sensitive PDMDP problem (8) to a risk-sensitive problem for the DTMDP model \(\{\mathbf{X },\mathbf{A },p,l\}\) as seen in the proof of the above theorem will be used without special reference in what follows.
Lemma 4.2
Suppose Conditions 2.1, 2.2 and 2.3 are satisfied. For each \(x\in S\) and \(\rho \in \mathcal{R}\),
is monotone nondecreasing in \(t\in [0,\infty )\).
Proof
Let \(0\le t_1<t_2<\infty \) be arbitrarily fixed. We need show
It is without loss of generality to assume
Then all the four terms in (11) are nonnegative and finite, and (11) is equivalent to
which is verified as follows. Let \(\delta >0\) be arbitrarily fixed. By Lemma 4.1, there exists some \(\hat{\nu }\in \mathcal{R}\) such that
(Recall \(\phi (x,t_2+t)=\phi (\phi (x,t_2),t)\) for each \(t \ge 0.\)) Consider \(\tilde{\nu }\in \mathcal{R}\) defined by
Then routine calculations lead to
Since \(\delta >0\) was arbitrarily fixed, now it follows that the term in the parenthesis in (12) is nonnegative, and thus inequality (12) is verified. \(\square \)
Lemma 4.3
Suppose Conditions 2.1, 2.2 and 2.3 are satisfied. For each \(x\in S\), there is some \(\rho ^*\in \mathcal{R}\) such that
Proof
Let \(x\in S\) be fixed, and let \(\rho ^*\in \mathcal{R}\) be such that \(V^*(x)=W(x,\rho ^*)\), see Lemma 4.1. Suppose \(t\in [0,\infty )\) is arbitrarily fixed. Consider \(\tilde{\rho }\in \mathcal{R}\) defined by \( \tilde{\rho }_s=\rho ^*_{t+s}\) for each \(s>0\). Then
recall (4). On the other hand, by Lemma 4.2,
The statement of this lemma is thus proved. \(\square \)
Lemma 4.4
Suppose Conditions 2.1, 2.2 and 2.3 are satisfied. Then for each \(x\in S,\)\(t\in [0,\infty )\rightarrow V^*(\phi (x,t))\) is absolutely continuous.
Proof
This immediately follows from Lemma 4.3. \(\square \)
Proof of Theorem 3.1
(a) Under Conditions 2.1, 2.2 and 2.3, by Lemma 4.4, for each \(x\in S,\) let \(t\in [0,\infty )\rightarrow U^*(x,t)\) be an integrable real-valued function such that \(U^*(x,t)\) coincides with the derivative of \(t\in [0,\infty )\rightarrow V(\phi (x,t))\) almost everywhere. Let \(x\in S\) and \(t\in [0,\infty )\) be fixed, and let \(\rho ^*\in \mathcal{R}\) be from Lemma 4.3.
and
are absolutely continuous in \(\tau \) and are finite for each \(\tau \in [0,\infty )\). Since \(\phi (x,0)=x\), see (4),
Now by Lemma 4.3,
where f is a measurable mapping from S to A such that
for each \(x\in S\); the existence of such a mapping is according to a well known measurable selection theorem, c.f. Proposition D.5 of [15].
Note that \(e^{-\int _0^\tau (q_{\phi (x,v)}(\rho _v)-c(\phi (x,v),\rho _v))dv}\) is bounded and separated from zero in \(\tau \in [0,t]\) for each \(\rho \in \mathcal{R};\) recall Condition 2.2. So
is finite. If
then
which is against (14). Therefore,
Then
is absolutely continuous on [0, t]. After legitimately differentiating the above expression with respect to v, and applying Lemma 4.2, we see
for almost all \(v\in [0,t].\) This and (14) imply
almost everywhere in \(\tau \in [0,t].\) Remember, \(t\in [0,\infty )\) was arbitrarily fixed. The first part of (a) is thus verified, and we postpone the justification of the second part of (a) after the proof of part (b).
(b) We use the same notation as in the above. Note that
Indeed, if either \(\int _0^\infty q_{\phi (x,s)}(f(\phi (x,s)))ds\) or \(\int _0^\infty c(\phi (x,s),f(\phi (x,s))))ds\) is finite, then in the above inequality, the equality takes place; and if both \(\int _0^\infty q_{\phi (x,s)}(f(\phi (x,s)))ds\) and \(\int _0^\infty c(\phi (x,s),f(\phi (x,s))))ds\) are infinite, then the right hand side of the inequality is zero according to (1).
In the proof of part (a), it was observed that
and
are absolutely continuous in t and are thus finite for each \(t\in [0,\infty )\). As in the proof of part (a), similar calculations to those in (14) imply that for each \(t\in [0,\infty ),\)
where the last equality is by what was established in part (a). Therefore, for each \(t\in [0,\infty ),\)
where the inequality holds because \(V^*(x)\ge 1\) for each \(x\in S.\) Taking \(\mathop {\underline{\lim }}_{t\rightarrow \infty }\) on the both sides of the previous equality yields:
with the inequality following from (15). Hence
Here it is clear that \(s\in [0,\infty )\rightarrow f(\phi (x,s))\) can be identified as an element of \(\mathcal{R}\), denoted as \(\tilde{f}^x\). In fact, \(\tilde{f}_s^x=\delta _{\{f(\phi (x,s))\}}\) for each \(s\in [0,\infty )\), whereas \(x\in S\rightarrow \tilde{f}^x\in \mathcal{R}\) is measurable. This measurable mapping \(x\in S\rightarrow \tilde{f}^x\in \mathcal{R}\) defines a deterministic stationary optimal strategy for the risk-sensitive DTMDP problem (20) by Proposition A.1. It is clear that the measurable mapping \(x\in S\rightarrow f(x)\in A\) defines an optimal deterministic stationary policy for the PDMDP problem (8).
Finally, we show the remaining part of (a). Let \(H^*\) be a measurable \([1,\infty )\)-valued function on S such that
There exists a measurable mapping h from S to A such that
c.f., Proposition D.5 of [15]. It follows that \(\int _0^s\int _S H^*(y)\tilde{q}(dy|\phi (x,\tau ),h(\phi (x,\tau )))d\tau \) is absolutely continuous in \(s\in [0,t]\) for each \(t\ge 0.\) As in the proof of part (b),
and by passing to the lower limit as \(t\rightarrow \infty \),
It remains to refer to Proposition A.1 for that \(H^*(x)\ge V^*(x)\) for each \(x\in S.\)\(\square \)
Proof of Theorem 3.2
Let \(V^*_0(x):=1\) for each \(x\in S.\) For each \(n\ge 0,\) one can legitimately define
Recall that the DTMDP model \(\{\mathbf{X },\mathbf{A },p,l\}\) satisfies Condition A.1, as noted in the proof of Lemma 4.1. Then by Proposition A.1, \(\{V_n^*\}\) is a monotone nondecreasing sequence of \([1,\infty )\)-valued measurable functions on S such that \(V^*_n(x)\uparrow V^*(x)\) as \(n\uparrow \infty ,\) for each \(x\in S.\)
Let \(n\ge 0\) be fixed. As in Lemma 4.3, for each \(x\in S\), there is some \(\rho ^*\in \mathcal{R}\) such that
Also the relevant version of Lemma 4.2 holds: for each \(x\in S\) and \(\rho \in \mathcal{R}\),
is monotone nondecreasing in \(t\in [0,\infty )\). Clearly, \(V^*_{n+1}(\phi (x,t))\) is absolutely continuous in \(t\in [0,\infty )\) for each \(x\in S\).
Corresponding to (14), we now have
where \(\tau \in [0,t]\rightarrow U^*_{n+1}(x,\tau )\) is integrable and coincides with \(\frac{\partial V^*_{n+1}(\phi (x,t))}{\partial t}\) almost everywhere, and f is some measurable mapping from S to A, whose existence is guaranteed by [15, Proposition D.5]. Continued from the above relation, the reasoning in the proof of the first assertion in part (a) of Theorem 3.1 can be followed: eventually we see
almost everywhere in \(\tau \in [0,t],\) i.e., the equation
is satisfied by \(V=V^*_{n+1}.\)
Recall that \(V^*_{0}=V^{(0)}\). Suppose the recursive definition in (9) is valid up to step n, and \(V^*_{n}(x)=V^{(n)}(x)\) for each \(x\in S.\) Consider an arbitrarily fixed \([1,\infty )\)-valued measurable solution V to (18), and let \(f^*\) be a measurable mapping from S to A such that
One can follow the reasoning in the last part of the proof of Theorem 3.1, and see, c.f. (16),
where the last equality is by (17). Thus, \(V^*_{n+1}\) is the minimal \([1,\infty )\)-valued measurable solution to (18), and coincides with \(V^{(n+1)}\). Therefore, by induction \(V^*_{n}=V^{(n)}\) for each \(n\ge 0.\) It follows now that \(V^{(n)}(x)\uparrow V^*(x)\) as \(n\uparrow \infty \) for each \(x\in S.\)\(\square \)
5 Conclusion
In this paper, we considered total undiscounted risk-sensitive PDMDP in Borel state and action spaces with a nonnegative cost rate. The transition and cost rates are assumed to be locally integrable along the drift. Under quite natural conditions, we showed that the value function is a solution to the optimality equation, justified the value iteration algorithm, and showed the existence of deterministic stationary optimal policy. As a corollary, the obtained results were applied to improving significantly known results for finite horizon undiscounted and infinite horizon discounted risk-sensitive CTMDP in the literature.
References
Bäuerle, N., Jaśkiewicz, A.: Risk-sensitive Divident problems. Eur. J. Oper. Res. 242, 161–171 (2015)
Bäuerle, N., Rieder, U.: MDP algorithms for portfolio optimization problems in pure jump markets. Financ. Stoch. 13, 591–611 (2009)
Bäuerle, N., Rieder, U.: Markov Decision Processes with Applications to Finance. Springer, Berlin (2011)
Bäuerle, N., Rieder, U.: More risk-sensitive Markov decision processes. Math. Oper. Res. 39, 105–120 (2014)
Bertsekas, D., Shreve, S.: Stochastic Optimal Control. Academic Press, New York (1978)
Cavazos-Cadena, R., Montes-de-Oca, R.: Optimal stationary policies in risk-sensitive dynamic programs with finite state space and nonnegative rewards. Appl. Math. (Warsaw) 27, 167–185 (2000)
Chung, K., Sobel, M.: Discounted MDP’s: distribution functions and exponential utility maximization. SIAM J Control Optim. 25, 49–62 (1987)
Coraluppi, S., Marcus, S.: Risk-sensitive queueing. In: Proceedings of the 35th Annual Allerton Conference on Communication Control and Computing, 943–952 (1997)
Costa, O., Dufour, F.: Continuous Average Control of Piecewise Deterministic Markov Processes. Springer, New York (2013)
Davis, M.: Markov Models and Optimization. Chapman and Hall, London (1993)
Di Masi, G., Stettner, L.: Risk-sensitive control of discrete-time Markov processes with infinite horizon. SIAM J. Control Optim. 38, 61–78 (1999)
Fainberg, E.: Controlled Markov processes with arbitrary numerical criteria. Theory Probab. Appl. 27, 486–503 (1982)
Forwick, L., Schäl, M., Schmitz, M.: Piecewise deterministic Markov control processes with feedback controls and unbounded costs. Acta Appl. Math. 82, 239–267 (2004)
Ghosh, M., Saha, S.: Risk-sensitive control of continuous time Markov chains. Stochastics 86, 655–675 (2014)
Hernández-Lerma, O., Lasserre, J.: Discrete-Time Markov Control Processes. Springer, New York (1996)
Howard, R., Matheson, J.: Risk-sensitive Markov decision proceses. Manag. Sci. 18, 356–369 (1972)
Jaquette, S.: A utility criterion for Markov decision processes. Manag. Sci. 23, 43–49 (1976)
Jaśkiewicz, A.: A note on negative dynamic programming for risk-sensitive control. Oper. Res. Lett. 36, 531–534 (2008)
Kitaev, M., Rykov, V.: Controlled Queueing Systems. CRC Press, Boca Raton (1995)
Kumar, S., Pal, C.: Risk-sensitive control of pure jump process on countable space with near monotone cost. Appl. Math. Optim. 68, 311–331 (2013)
Piunovski, A., Khametov, V.: New effective solutions of optimality equations for the controlled Markov chains with continuous parameter (the unbounded price-function). Probl. Control Inf. Theory 14, 303–318 (1985)
Piunovskiy, A.: Optimal Control of Random Sequences in Problems with Constraints. Kluwer, Dordrecht (1997)
Schäl, M.: On piecewise deterministic Markov control processes: control of jumps and of risk processes in insurance. Insur. Math. Econ. 22, 75–91 (1998)
Wei, Q.: Continuous-time Markov decision processes with risk-sensitive finite-horizon cost criterion. Math. Methods Oper. Res. 84, 461–487 (2016)
Wei, Q., Chen, X.: Continuous-time Markov decision processes under the risk-sensitive average cost criterion. Oper. Res. Lett. 44, 457–462 (2016)
Yushkevich, A.: On reducing a jump controllable Markov model to a model with discrete time. Theory Probab. Appl. 25, 58–68 (1980)
Zhang, Y.: Continuous-time Markov decision processes with exponential utility. SIAM J. Control Optim. 55, 2636–2660 (2017)
Acknowledgements
We thank the referees for their remarks, which improved the presentation of this paper. This work is partially supported by a grant from the Royal Society (IE160503).
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
For ease of reference, we present the relevant notations and facts about the risk-sensitive problem for a DTMDP. The proofs of the presented statements can be found in [18] or [27]. Standard description of a DTMDP can be found in e.g., [15, 22].
Consider a discrete-time Markov decision process with the following primitives:
\(\mathbf X \) is a nonempty Borel state space.
\(\mathbf A \) is a nonempty Borel action space.
p(dy|x, a) is a stochastic kernel on \(\mathcal{B}(\mathbf X )\) given \((x,a)\in \mathbf X \times \mathbf A \).
l a \([0,\infty ]\)-valued measurable cost function on \(\mathbf X \times \mathbf A \times \mathbf X .\)
Let \(\Sigma \) be the space of strategies, and \(\Sigma _{DM}\) be the space of all deterministic strategies for the DTMDP. Let the controlled and controlling processes be denoted by \(\{Y_n, n=0,1,\dots ,\infty \}\) and \(\{A_n,n=0,1,\dots ,\infty \}\), respectively. The strategic measure of a strategy \(\sigma \) given the initial state \(x\in \mathbf X \) is denoted by \(\mathbf P _x^\sigma \). The expectation taken with respect to \(\mathbf P _x^\sigma \) is denoted by \(\mathbf E _x^\sigma .\)
Consider the optimal control problem
It is also referred to as the risk-sensitive DTMDP problem. We denote the value function of problem (19) by \(\mathbf V ^*\). Then a strategy \(\sigma ^*\) is called optimal for problem (19) if \(\mathbf V (x,\sigma ^*)=\mathbf V ^*(x)\) for each \(x\in \mathbf X .\)
Condition A.1
-
(a)
The function l(x, a, y) is lower semicontinuous in \(a\in \mathbf A \) for each \(x,y\in \mathbf X .\)
-
(b)
For each bounded measurable function f on \(\mathbf X \) and each \(x\in \mathbf X ,\)\(\int _\mathbf{X }f(y)p(dy|x,a)\) is continuous in \(a\in \mathbf A .\)
-
(c)
The space \(\mathbf A \) is a compact Borel space.
Proposition A.1
Suppose Condition A.1 is satisfied.
- (a)
The value function \(\mathbf V ^*\) is the minimal \([1,\infty ]\)-valued measurable solution to
$$\begin{aligned} \mathbf V (x)=\inf _{a\in \mathbf A }\left\{ \int _\mathbf{X }p(dy|x,a)e^{l(x,a,y)}{} \mathbf V (y)\right\} ,\quad ~x\in \mathbf X . \end{aligned}$$(20) - (b)
Let \(\mathbf U \) be a \([1,\infty ]\)-valued lower semianalytic function on \(\mathbf X \). If
$$\begin{aligned} \mathbf U (x)\ge \inf _{a\in \mathbf A }\left\{ \int _\mathbf{X }p(dy|x,a)e^{l(x,a,y)}{} \mathbf U (y)\right\} ,\quad ~\forall ~x\in \mathbf X , \end{aligned}$$then \(\mathbf U (x)\ge \mathbf V ^*(x)\) for each \(x\in \mathbf X .\) In particular, if the function \(\mathbf U \) satisfying the above relation is \([1,\infty )\)-valued, then so is the value function \(\mathbf V ^*.\)
- (c)
Let \(\varphi \) be a deterministic stationary strategy for the DTMDP model \(\{\mathbf{X },\mathbf{A },p,l\}\). If
$$\begin{aligned} \mathbf V ^*(x)=\int _\mathbf{X }p(dy|x,\varphi (x))e^{l(x,\varphi (x),y)}{} \mathbf V ^*(y),\quad ~\forall ~x\in \mathbf X , \end{aligned}$$(21)then \(\mathbf V ^*(x)=\mathbf V (x,\varphi )\) for each \(x\in \mathbf X .\)
- (d)
Let \(\mathbf V ^{(0)}(x):=1\) for each \(x\in \mathbf X \), and for each \(n=1,2,\dots ,\)
$$\begin{aligned} \mathbf V ^{(n)}(x):=\inf _{a\in A}\left\{ \int _\mathbf{X }p(dy|x,a)e^{l(x,a,y)}{} \mathbf V ^{(n-1)}(y)\right\} ,\quad ~\forall ~x\in \mathbf X . \end{aligned}$$Then \((\mathbf V ^{(n)}(x))\) increases to \(\mathbf V ^*(x)\) for each \(x\in \mathbf X \), where \(\mathbf V ^*\) is the value function for problem (19). Furthermore, there exists a deterministic stationary strategy \(\varphi \) satisfying (21), and so in particular, there exists a deterministic stationary optimal strategy for the risk-sensitive DTMDP problem (19).
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Guo, X., Zhang, Y. On Risk-Sensitive Piecewise Deterministic Markov Decision Processes. Appl Math Optim 81, 685–710 (2020). https://doi.org/10.1007/s00245-018-9485-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00245-018-9485-x
Keywords
- Continuous-time Markov decision processes
- Piecewise deterministic Markov decision processes
- Exponential utility
- Dynamic programming