Risk-sensitive continuous-time Markov decision processes with unbounded rates and Borel spaces

Guo, Xianping; Zhang, Junyu

doi:10.1007/s10626-019-00292-y

Risk-sensitive continuous-time Markov decision processes with unbounded rates and Borel spaces

Published: 19 October 2019

Volume 29, pages 445–471, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Discrete Event Dynamic Systems Aims and scope Submit manuscript

Risk-sensitive continuous-time Markov decision processes with unbounded rates and Borel spaces

Download PDF

469 Accesses
15 Citations
Explore all metrics

Abstract

This paper considers the finite-horizon risk-sensitive optimality for continuous-time Markov decision processes, and focuses on the more general case that the transition rates are unbounded, cost/reward rates are allowed to be unbounded from below and from above, the policies can be history-dependent, and the state and action spaces are Borel ones. Under mild conditions imposed on the decision process’s primitive data, we establish the existence of a solution to the corresponding optimality equation (OE) by a so called approximation technique. Then, using the OE and the extension of Dynkin’s formula developed here, we prove the existence of an optimal Markov policy, and verify that the value function is the unique solution to the OE. Finally, we give an example to illustrate the difference between our conditions and those in the previous literature.

Finite horizon risk-sensitive continuous-time Markov decision processes with unbounded transition and cost rates

Article 10 January 2019

Risk-sensitive infinite-horizon discounted piecewise deterministic Markov decision processes

Article 15 July 2022

On Risk-Sensitive Piecewise Deterministic Markov Decision Processes

Article Open access 17 February 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Continuous-time Markov decision processes (CTMDPs) are an important class of stochastic optimality control problems and have been widely studied; see, for instance, the monographs (Guo and Hernández-Lerma 2009; Prieto-Rumeau and Hernández-Lerma 2012) and the extensive references therein. In most existing literature on CTMDPs, the infinite-horizon expected discounted criterion (Guo 2007; Guo and Hernández-Lerma 2009; Guo and Song 2011; Guo and Piunovskiy 2011; Piunovskiy and Zhang 2011; Prieto-Rumeau and Hernández-Lerma 2012), the long-run expected average criterion (Guo and Hernández-Lerma 2009; Guo et al. 2012; Prieto-Rumeau and Hernández-Lerma 2012; Wei and Chen 2017; Xia 2014), and the finite-horizon expected criterion (Guo et al. 2015b; Yushkevich 1977), are the commonly used optimality criteria. All these expected criteria are risk-neutral and cannot reflect the attitude of a decision-maker to the risk. Since many decision-makers in the real-world applications may be either risk-seeking or risk-averse, in order to take the risk-sensitivity of a decision-maker into an optimality criterion, the risk-sensitive criteria have been employed and studied in CTMDPs. In this paper, we will further study a risk-sensitive criterion in CTMDPs and focus on the finite horizon case. Thus we shall pinpoint neither the main results in the earlier literature on the infinite-horizon risk-sensitive discounted criterion and the long-run risk-sensitive average criterion in Ghosh and Saha (2014), Huo et al. (2017), Kumar and Chandan (2013), and Zhang (2017) and Ghosh and Saha (2014), Kumar and Chandan (2015), and Kumar and Chandan (2013) respectively for CTMDPs, nor those on the risk-sensitive discrete-time Markov decision processes (Anantharam and Borkar 2017; Baüerle and Rieder 2014; Cavazos-Cadena and Hernndez-Hernndez 2011; Jaskiewicz 2007; Xia 2018) as well as on zero-sum risk-sensitive stochastic games (Basu and Ghosh 2014; Baüerle and Rieder 2017). Our concern is on the finite horizon risk-sensitive criterion for CTMDPs (Ghosh and Saha 2014; Wei 2016), which leads to that the value function also depends on both time and states, while the value functions for the infinite-horizon risk-sensitive discounted and average criteria in Ghosh and Saha (2014), Kumar and Chandan (2015), Kumar and Chandan (2013), and Zhang (2017) are independent of time. Therefore, the optimality equation for the finite horizon case (Ghosh and Saha 2014; Wei 2016) is rather different from that for the infinite horizon cases (Ghosh and Saha 2014; Kumar and Chandan 2013; Zhang 2017).

To the best of our knowledge, the finite horizon risk-sensitive criterion for CTMDPs is addressed only in Ghosh and Saha (2014), Huang (2018), and Wei (2016). Precisely, Ghosh and Saha (2014) uses the exponential utility function to characterize the risk-sensitivity of a decision-maker and obtains the existence of optimal Markov policies for the case of denumerable states, uniformly bounded transition rates, bounded cost rates, and the class of Markov policies. Wei (2016) extends the main results in Ghosh and Saha (2014) to the case of unbounded transition rates, establishes the existence of a solution to the corresponding optimality equation (OE) by the approximation technique in Guo et al. (2015b), and gives the corresponding error estimations of the approximations. However, the arguments in Ghosh and Saha (2014) and Wei (2016) need the assumptions of bounded cost rates and denumerable states, and the policies in Ghosh and Saha (2014) and Wei (2016) are independent of histories.

Huang (2018) also studies the finite horizon CTMDPs with the Borel state and action spaces and unbounded reward functions and transition rates. But the optimization criteria in Huang (2018) are very different from ours. The optimization criteria in Huang (2018) are firstly mean maximization, then for fixed mean, variance minimization. This is suitable for a risk-averse decision maker. The optimization criterion in this paper is risk-sensitivity, which is maximization of the exponential utility function. From Eq. 2.7 in Section 2, we can see that the exponential utility function is related to the mean and variance. Since the coefficient of variance is positive, our optimization criterion is suitable to a risk-seeking decision maker.

As indicated above, the finite horizon risk-sensitive criterion of for CTMDPs with unbounded cost/transition rates, Borel state and action spaces as well as randomized history-dependent policies, has not been studied yet, and it will be considered in this paper. Precisely, we study the CTMDPs having the following features: (1) the transition rates can be unbounded; (2) the cost rates are allowed to be unbounded from below and from above. (As the cost rates are allowed to take positive and negative values, they can be interpreted as reward rates rather than “cost rates” only); (3) the state and action spaces are Borel ones; (4) the policies can be randomized and history-dependent; and (5) the optimality criterion is finite horizon risk-sensitive.

First, we give a suitable condition, under which we establish the finiteness of the finite-horizon risk-sensitive criterion with unbounded cost rates using the Jensen inequality. This condition is new and satisfied for the bounded costs in Ghosh and Saha (2014) and Wei (2016); see Lemma 3.1.

Second, under the conditions slightly weaker than those in Guo and Hernández-Lerma (2009), Guo et al. (2012), Guo and Piunovskiy (2011), Piunovskiy and Zhang (2011), and Prieto-Rumeau and Hernández-Lerma (2012) on infinite horizon risk-neutral CTMDPs, we derive an extension of Dynkin’s formula, which is a generalization of the analog of Ito-Dynkin’s formula recently developed in Guo et al. (2015b) for the underlying state process {x_t,t ≥ 0} induced by the transition rates and randomized history-dependent policies (see Theorem 3.1 below). This result is also a natural extension of the Feynman-Kac formula in Wei (2016) for the continuous-time jump Markov process to the case of continuous-time jump “non-Markov” processes and a more larger class of functions φ(ω,t,x) of samples ω, time t and states x. On the one hand, since the analog of Ito-Dynkin’s formula in Guo et al. (2015b) is designed for the forms of functions φ(t,x) of time t and states x, it is not suitable for the more general forms of φ(ω,t,x) with an additional sample variable ω such as the function $e^{{{\int \limits }_{0}^{t}}{\int \limits }_{A}c(x_{s}(\omega ),a)\pi (da|\omega ,s)ds}\varphi (t,x)$ of (ω,t,x), which need to be considered for our case of history-dependent policies π(da|ω,s) and the cost rates c(x,a) in the risk-sensitive CTMDPs; on the other hand, the Feynman-Kac formula in Wei (2016) is for Markov processes, and thus it is not applicable to the case that the underlying processes {x_t} here may not be Markovian. Since our optimality problem is on the risk-sensitive criterion over the class of history-dependent policies, the two facts just mentioned above motivate our study on the extension of the Dynkin’s formula.

Third, under suitable conditions as in Ghosh and Saha (2014), Guo and Hernández-Lerma (2009), Guo et al. (2012), Guo and Piunovskiy (2011), Piunovskiy and Zhang (2011), and Prieto-Rumeau and Hernández-Lerma (2012) on risk-neutral CTMDPs with infinite horizon, we prove the existence and uniqueness of a solution to the OE as well as the existence of an optimal Markov policy for the finite horizon risk-sensitive CTMDPs in three steps. The first step is to consider the simple case of bounded transition rates and bounded cost rates, and establishes the existence of a solution to the OE by the Banach’s fixed point theorem, and also proves the existence of an optimal Markov policy; see Proposition 4.2. The second step is to deal with the case of unbounded transition rates and nonnegative cost rates. By constructing a sequence of the models of CTMDPs with bounded transition rates and bounded cost rates, using the results in the first step and some new properties of the value function of the finite horizon risk-sensitive CTMDPs, we prove that the limit of the sequence of the value functions of models of CTMDPs is a solution to the OE for the case of unbounded transition rates and nonnegative cost rates; see Proposition 4.3. In the third step, by designing a technique of approximations from nonnegative but unbounded cost rates to a more general case of cost rates being unbounded from above and from below, we further show the existence of a solution to the OE for the most general case of unbounded transition and cost rates; see Theorem 4.1. It should be mentioned that our approximation technique here is rather different from the approximation in Guo et al. (2015b) and Wei (2016) for denumerable states.

Fourth, using the existence of a solution to the OE and the extension of the Dynkin’s formula developed here, we prove the existence of an optimal Markov policy, and also show that the value function is the unique solution to the optimality equation; see Theorem 4.1. All arguments here are direct and closed.

Finally, our conditions in this paper are an generalization of those in Ghosh and Saha (2014) and Wei (2016) on the finite horizon risk-sensitive CTMDPs. In order to further illustrate our main results and show the difference between the conditions in this paper and those in Ghosh and Saha (2014) and Wei (2016), we present an applied example, in which our conditions are satisfied, in which the transition and cost rates are both unbounded, in which the state and action spaces are non-denumerable, and for which some of the conditions in Ghosh and Saha (2014) and Wei (2016) fail to hold.

The rest of the paper is organized as follows. In Section 2 we introduce the optimality problem for the risk-sensitive CTMDPs. The main results are presented in Section 4 after giving technical preliminaries in Section 3, and are illustrated with an example in Section 5.

2 The optimal control problems

Notation

For any Borel space X endowed with the Borel σ-algebra ${\mathscr{B}}(X)$, we will denote by E^c := X ∖ E the complement of a subset E of X, by I_E the indicator function on any set E, by δ_z(dx) the Dirac measure at point z ∈ X (i.e., δ_z(D) = I_D(z) for all $D\in {\mathscr{B}}(X))$, by $\mathbb {B}_{1}(X)$ the set of all bounded Borel measurable functions φ on X with the norm $\|\varphi \|:=\sup _{x\in X}|\varphi (x)|$, and by $\mathcal {{U}}(X)$ the universal σ-algebra on X, that is, $\mathcal { U}(X):=\cap _{p\in P(X)}{\mathscr{B}}_{p}(X),$ where P(X) represents the set of all probability measures on ${\mathscr{B}}(X)$, and ${\mathscr{B}}_{p}(X)$ is the completion of ${\mathscr{B}}(X)$ with respect to p ∈ P(X). To discern the “measurability”, we will say “Borel measurable” or “universally measurable” in the following.

The model of CTMDPs is a five-tuple

$$ \begin{array}{@{}rcl@{}} \mathbb{M}:=\{S, A, (A(x),x\in S),c(x,a), q(dy|x,a)\}, \end{array} $$

(2.1)

consisting of the following elements:

(a)
a Borel space S, called the state space, whose elements are referred to as states of a system.
(b)
a Borel space A, called the action space, whose elements are referred to as actions (or decisions) of a decision-maker (or controller);
(c)
a family {A(x),x ∈ S} of nonempty subsets A(x) of A, where A(x) denotes the set of actions available to a controller when the system is in state x ∈ S, and it is assumed to be Borel-measurable, that is, $A(x)\in {\mathscr{B}}(A)$ for every x ∈ S;
(d)
a Borel measurable function c(x,a) on K, called the cost rates, where K := {(x,a)|x ∈ S,a ∈ A(x)} denotes the set of all feasible state-action pairs and is assumed in ${\mathscr{B}}(S\times A);$ (As c(x,a) is allowed to take positive and negative values, it can be interpreted as rewards rather than “costs” only.)
(e)
transition rates q(dy|x,a), a universally measurable signed kernel on S given K. That is, q(⋅|x,a) satisfies countable additivity; q(D|x,a) ≥ 0 for all $D\in {\mathscr{B}}(S)$ with (x,a) ∈ K and x∉D, being conservative in the sense of q(S|x,a) ≡ 0, and stable in that of
$$ \begin{array}{@{}rcl@{}} q^{*}(x):=\sup\limits_{a\in A(x)}q(x,a)<\infty \ \ \ \ \forall \ x\in S, \end{array} $$
(2.2)
where q(x,a) := −q({x}|x,a) ≥ 0 for all (x,a) ∈ K.

Next, we give an informal description of the evolution of CTMDPs with model (2.1).

Roughly speaking, CTMDPs evolve as follows: A controller observes states of a system continuously in time. If the system is at state x_t at time t, he/she chooses an action a_t ∈ A(x_t) according to a given policy, as a consequence of which, the following happen:

(i)
An immediate cost takes place at the rate c(x_t,a_t).
(ii)
After a random sojourn time (i.e., the holding time at state x_t), the system jumps to a set D (x_t∉D) of states with the transition probability $\frac {q(D|x_{t},a_{t})}{q(x_{t},a_{t})}$ determined by the transition rates q(dy|x_t,a_t). The distribution function of the sojourn time is $(1-e^{-{{\int \limits }_{t}^{t+x}}q(x_{s},a_{s})ds})$.

To formalize what is described above, below we describe the construction of CTMDPs under possibly randomized history-dependent policies.

To construct the underlying CTMDPs, we introduce some notation: Let ${\Omega }_{0} := (S\times (0,\infty ))^{\infty }$, ${\Omega }_{k} := (S\times (0,\infty ))^{k}\times S\times (\{\infty \}\times \{{\Delta }\})^{\infty }$ for k ≥ 1 and some Δ∉S, ${\Omega }:=\cup _{k=0}^{\infty }{\Omega }_{k}$. $\mathcal {F}$ is the Borel σ-algebra on the Borel space Ω. Then we obtain the measurable space $({\Omega },\mathcal {F})$. For some k ≥ 1, and sample $\omega :=(x_{0},\theta _{1},x_{1},\ldots ,\theta _{k},x_{k},\infty ,{\Delta }) \in {\Omega }$, define

$$ \begin{array}{@{}rcl@{}} T_{k}(\omega):=\theta_{1}+\theta_{2}+\ldots+\theta_{k}, T_{\infty}(\omega):=\lim_{k\rightarrow\infty}T_{k}(\omega), \ \text{and} \ X_{k}(\omega):=x_{k}.\end{array} $$

(2.3)

In what follows, the argument ω is always omitted except some special informational statements. Then, we define the state process {x_t(ω),t ≥ 0} on $({\Omega },\mathcal {F})$ by

$$ \begin{array}{@{}rcl@{}} x_{t}(\omega):=\sum\limits_{k\geq0}I_{\{T_{k}\leq t<T_{k+1}\}}X_{k}(\omega)+ I_{\{t\geq T_{\infty}\}}{\Delta} , \ \ \ \ \text{for } \ t\geq 0, \ \ (\text{with} \ T_{0}:=0). \end{array} $$

(2.4)

Obviously, x_t(ω) is right-continuous on $[0,\infty )$. We denote $x_{t-}(\omega ):=\lim _{s\to t-}x_{s}(\omega )$. Here we have used the convenience that 0z =: 0 and 0 + z =: z for all $z\in S_{\Delta }:=S\cup \{{\Delta }\}$.

For each fixed ω := (x₀,𝜃₁,x₁,…,𝜃_k,x_k,…) ∈Ω, from Eq. 2.4, we see that T_k(ω) (k ≥ 1) denotes the k-th jump moment of {x_t,t ≥ 0}, X_k− 1(ω) = x_k− 1 is the state of the process on [T_k− 1(ω),T_k(ω)), 𝜃_k = T_k(ω) − T_k− 1(ω) plays the role of sojourn time at state x_k− 1, and the sample path {x_t(ω),t ≥ 0} has at most denumerable states x_k(k = 0,1,…). We do not intend to consider the controlled process {x_t,t ≥ 0} after moment $T_{\infty }$, and thus view it to be absorbed in the cemetery state Δ. Hence, we write A_Δ := A ∪{a_Δ}, A(Δ) := {a_Δ}, q(⋅|Δ,a_Δ) :≡ 0, c(Δ,a_Δ) :≡ 0, where a_Δ is an isolated point.

To precisely define the optimality criterion, we need to introduce the concept of a policy in Guo et al. (2012), Guo and Piunovskiy (2011), Kitaev and Rykov (1995), and Piunovskiy and Zhang (2011). To do so, we recall some notation. Take the right-continuous family of σ-algebras $\{\mathcal { F}_{t}\}_{t\ge 0}$ with $\mathcal { F}_{t}:=\sigma (\{T_{k}\leq s, X_{k}\in D\}: D\in \mathcal {{B}}(S),s\leq t,k\geq 0)$. As in Guo and Song (2011), Guo and Piunovskiy (2011), Kitaev and Rykov (1995), and Piunovskiy and Zhang (2011), let $\mathcal P$ be the σ-algebra of predictable sets on ${\Omega }\times [0,\infty )$ related to $\{\mathcal { F}_{t}\}_{t\ge 0}$, that is, $\mathcal {{P}}:=\sigma (B\times [0,\infty ), C\times (s,\infty ): B\in \mathcal {{F}}_{0}, C\in \mathcal {{F}}_{s-}, s>0)$ with $\mathcal {{F}}_{s-}:=\bigvee _{t<s}\mathcal {{F}}_{t}:=\sigma (\mathcal {{F}}_{t}, t<s)$. A real-valued function on ${\Omega }\times [0,\infty )$ is called predictable if it is measurable with respect to $\mathcal P$.

Definition 2.1

A transition probability π(da|ω,t) from $({\Omega }\times [0,\infty ),\mathcal { P} )$ onto $(A_{\Delta }, \mathcal {{B}}(A_{\Delta }))$ such that π(A(x_t−(ω))|ω,t) ≡ 1 is called a (randomized history-dependent) policy. A policy π(da|ω,t) is called randomized Markovian if π(da|ω,t) ≡ π(da|x_t−(ω),t). We will denote such a Markov policy by π_t(da|⋅) for informational implication. A randomized Markov policy π_t(da|⋅) is called deterministic Markovian whenever there exists a A-valued universally measurable function f(t,x) on $[0,\infty )\times S$ such that π_t(da|x) is the Dirac measure at point f(t,x) ∈ A(x) for all t ≥ 0 and x ∈ S. Such a Markov policy will be denoted by f for simplicity.

We denote by π the set of all randomized history-dependent policies, by ${{\Pi }_{m}^{r}}$ the set of all randomized Markov policies, and by ${{\Pi }_{m}^{d}}$ the set of all (deterministic) Markov policies.

For any initial distribution γ on S and policy π ∈π, as showing in Guo et al. (2012), Guo and Song (2011), Guo and Piunovskiy (2011), Kitaev and Rykov (1995), and Piunovskiy and Zhang (2011) but using the extension of the Ionescu Tulcea theorem (e.g., Proposition 7.45 in Bertsekas and Shreve 1996); we see that there exists a unique probability measure $\mathbb {P}_{\gamma }^{\pi }$ (depending on γ and π) on $({\Omega },\mathcal {F})$. Let $\mathbb {E}_{\gamma }^{\pi }$ be the corresponding expectation operator. In particular, $\mathbb {E}_{\gamma }^{\pi }$ and $\mathbb {P}_{\gamma }^{\pi }$ will be respectively written as $\mathbb {E}_{x}^{\pi }$ and $\mathbb {P}_{x}^{\pi }$ when γ is the Dirac measure at a state x in S.

Fix any finite horizon T > 0. For each policy π ∈π and state x ∈ S, we define the T-horizon risk-sensitive criterion J(π,0,x) of CTMDPs by

$$ \begin{array}{@{}rcl@{}} J(\pi,0,x):=\mathbb{E}_{x}^{\pi}\left[e^{{\delta{\int}_{0}^{T}}{\int}_{A}c(x_{t},a)\pi(da|\omega,t)dt}\right], \end{array} $$

(2.5)

provided that the integral is well defined, where δ is a constant called the risk-sensitive parameter. In the following arguments, we assume that δ > 0. For the other case of δ < 0, the corresponding arguments are similar, and thus omitted.

Note that the process {x_t,t ≥ 0} on $({\Omega },\mathcal {F}, \mathbb {P}^{\pi }_{\gamma })$ may not be Markovian since the policy π can depend on histories (x₀,𝜃₁,x₁,…,𝜃_k,x_k). However, for each $\pi :=\pi _{t}(da|\cdot ) \in {{\Pi }^{r}_{m}}$, it is well known (e.g. Feinberg et al. 2014) that {x_t,t ≥ 0} is a Markov process on $({\Omega },\mathcal {F}, \mathbb {P}^{\pi }_{\gamma })$, and thus for each x ∈ S and t ∈ [0,T], the following expression

$$ \begin{array}{@{}rcl@{}} J(\pi,t,x):=\mathbb{E}_{\gamma}^{\pi}\left[e^{{\delta{\int}_{t}^{T}}{\int}_{A}c(x_{s},a)\pi_{s}(da|x_{s})ds}|x_{t}=x\right], \end{array} $$

is well defined (when the integral exists), and it is called the risk-sensitive value of π from the horizon t to T.

For each x ∈ S, let

$$ \begin{array}{@{}rcl@{}} J_{*}(t,x):=\inf\limits_{\pi\in {{\Pi}_{m}^{r}}} J(\pi,t,x) \ \ \ \ \text{ for \ } t\in [0,T]. \end{array} $$

(2.6)

The function J_∗(t,x) on [0,T] × S is called the value function of the CTMDPs with the T-horizon risk-sensitive criterion.

Definition 2.2

A policy π^∗∈π is said to be optimal if J(π^∗,0,x) ≤ J(π,0,x) for all π ∈π and x ∈ S.

The main goal of this paper is to give conditions for the existence of optimal Markov policies.

At this end, we give some remarks about the difference between the T-horizon risk-sensitive criterion J(π,0,x) and the risk-neutral one in Guo et al. (2015b) and Yushkevich (1977), which is defined by

$$V(\pi,0,x):=\mathbb{E}_{x}^{\pi}\left[{\int}_{0}^{T}{\int}_{A}c(x_{t},a)\pi(da|\omega,t)dt\right]=:\mathbb{E}_{x}^{\pi}Y, \ \text{with } \ Y:={{\int}_{0}^{T}}{\int}_{A}c(x_{t},a)\pi(da|\omega,t)dt,$$

where c(x,a) is assumed to be bounded positive on K. Then, by the Jensen’s inequality and the monotonicity of the $\log $ function, we have $\ln J(\pi ,0,x)\geq \delta V(\pi ,0,x)$. More, by Taylor’s expansion of e^δY at the point $z:=\mathbb {E}_{x}^{\pi }Y$, we have

$$ \begin{array}{@{}rcl@{}} e^{\delta Y}&=&e^{\delta z}+\delta e^{\delta z}(Y-z)+\frac{1}{2}\delta^{2} e^{\delta z}(Y-z)^{2}\\&&+\frac{1}{6}\delta^{3}e^{\delta \xi}(Y-z)^{3} \ (\text{ with} \ \xi \ \text{being \ between} \ Y \ \text{and} \ z), \end{array} $$

which also implies that ξ is a random variable on $({\Omega },\mathcal {F}, \mathbb {P}^{\pi }_{\gamma })$. Therefore, we have

$$ \begin{array}{@{}rcl@{}} \ln J(\pi,0,x)&=&\ln\left[e^{\delta z}+\frac{1}{2}e^{\delta z} \delta^{2} var(Y)+\frac{1}{6}\delta^{3}\mathbb{E}_{x}^{\pi}e^{\delta \xi}(Y-z)^{3}\right] \\ &=&\delta V(\pi,0,x)+\ln \left[1+\frac{1}{2}\delta^{2}var(Y)+\frac{1}{6}\delta^{3}\mathbb{E}_{x}^{\pi}e^{\delta (\xi-z)}(Y-z)^{3}\right] \\ &=&\delta V(\pi,0,x)+\ln[1+\frac{1}{2}\delta^{2}var(Y)]+o(\delta^{2}), \end{array} $$

(2.7)

where $var(Y):=\mathbb {E}_{x}^{\pi }(Y-\mathbb {E}_{x}^{\pi }Y)^{2}$ denoting the variance of Y, and the corresponding term $\ln [1+\frac {1}{2}\delta ^{2}var(Y)]$ is called a risk premium. Thus, the difference between $\ln J(\pi ,0,x)$ and δV (π,0,x) is the summation of the risk premium plus the other small order term. Therefore, the risk-sensitive criterion is more risk-sensitive than the risk-neutral one because of its inclusion of the risk premium.

From Eq. 2.7 we see that $\lim _{\delta \to 0}\frac {\ln J(\pi ,0,x)}{\delta }=V(\pi ,0,x)$. On the other hand, for any fixed δ > 0, since $\inf _{\pi \in {\Pi }}\ln J(\pi ,0,x)=\ln J(\pi ^{*},0,x)$if and only if$\inf _{\pi \in {\Pi }}J(\pi ,0,x)=J(\pi ^{*},0,x)$, we will consider J(π,0,x), instead of $\ln J(\pi ,0,x)$.

3 Preliminaries

This section provides some preliminary facts for our arguments below.

Since the rates q(dy|x,a) and costs c(x,a) are allowed to be unbounded, we next give conditions for the non-explosion of {x_t,t ≥ 0} and finiteness of J(π,t,x).

Assumption 3.1

There exist a real-valued Borel measurable function V₀ ≥ 1 on S and constants ρ₀, b₀ ≥ 0, $M_{0}^{\prime }, M_{0}\geq 1$, such that

(i):: HCode ${\int \limits }_{S}V_{0}(y)q(dy|x,a)\leq \rho _{0} V_{0}(x)+b_{0},$ for all (x,a) ∈ K;
(ii):: $q^{*}(x)\leq M_{0}^{\prime }V_{0}(x)$ for all x ∈ S, where q^∗(x) is as in Eq. 2.2;
(iii):: e^2Tδ|c(x,a)|≤ M₀V₀(x) for all (x,a) ∈ K, with T and δ as in Eq. 2.5.

Remark 3.1

(a)
Assumptions 3.1(i,ii) are used and verified with examples in Guo and Hernández-Lerma (2009), Guo and Piunovskiy (2011), Piunovskiy and Zhang (2011), and Prieto-Rumeau and Hernández-Lerma (2012) for the risk-neutral CTMDPs. When the transition rates are bounded (i.e., $\|q^{*}\|<\infty $) (Ghosh and Saha 2014; Kitaev and Rykov 1995; Kumar and Chandan2013, 2015; Yushkevich 1977), Assumptions 3.1(i-ii) are satisfied by taking V₀(x) ≡ 1.
(b)
Assumption 3.1(iii) is for the finiteness of the value function J_∗(t,x), and it is new and satisfied when c(x,a) is bounded (i.e., $\|c\|:=\sup _{(x,a)\in K}|c(x,a)|<\infty $) (Ghosh and Saha 2014; Kumar and Chandan2013, 2015; Wei 2016; Yushkevich 1977). Moreover, if $\delta |c(x,a)|\leq \sqrt {\ln V_{0}(x)}+L$ for all (x,a) ∈ K and some constant L ≥ 0, then Assumption 3.1(iii) holds: Indeed, since $\delta |c(x,a)|\leq \sqrt {\ln V_{0}(x)}+L\leq \frac {t}{2}+\frac {\ln V_{0}(x)}{2t}+L$ for all (x,a) ∈ K and t > 0, we have $e^{2T\delta |c(x,a)|} \leq e^{T^{2}+\ln V_{0}(x)+2TL}=e^{T^{2}+2TL} V_{0}(x)$, which implies Assumption 3.1(iii) with $M_{0}:=e^{T^{2}+2TL}$.
(c)
If the number ρ₀ in Assumption 3.1(i) is not positive, then Assumption 3.1(i) still holds when ρ₀ is replaced with the positive number “1 + |ρ₀|”. Thus, it is just for convenience to assume that the constant ρ₀ > 0 throughout the following. However, the corresponding number is assumed to be negative in Guo and Hernández-Lerma (2009), Guo et al. (2012), and Prieto-Rumeau and Hernández-Lerma (2012) or less than the discount factor in Guo and Hernández-Lerma (2009) and Guo and Piunovskiy (2011).

Lemma 3.1

Under Assumption 3.1, for each π ∈π, the following assertions hold.

(a):

$\mathbb {P}_{x}^{\pi }(x_{t}\in S)=1$, $\mathbb {P}_{x}^{\pi }(T_{\infty }=\infty )=1$, and $ \mathbb {P}_{x}^{\pi }(x_{0}=x)=1$ for each t ≥ 0 and x ∈ S;

(b):

$\mathbb {E}_{x}^{\pi }[V_{0}(x_{t})] \leq e^{\rho _{0}t}[V_{0}(x)+\frac {b_{0}}{\rho _{0}}]$, for each t ≥ 0, x ∈ S and π ∈π;

(c):

$\mathbb {E}_{\gamma }^{\pi }[V_{0}(x_{t})|x_{s}=x] \leq e^{\rho _{0}(t-s)}[V_{0}(x)+\frac {b_{0}}{\rho _{0}}]$, for each t ≥ s ≥ 0, x ∈ S and $\pi \in {{\Pi }_{m}^{r}}$.

(d):

If, in addition, Assumption 3.1 (iii) is satisfied, then

(d₁):: $ e^{-LV_{0}(x)}\leq J(\pi ,0,x)\leq LV_{0}(x) $ for x ∈ S and π ∈π, where $L:=M_{0}e^{\rho _{0}T}[1+\frac {b_{0}}{\rho _{0}}];$
(d₂):: $e^{-LV_{0}(x)}\leq J(\pi ,t,x)\leq LV_{0}(x)$ for (t,x) ∈ [0,T] × S and ${\pi \in {\Pi }_{m}^{r}}$.

Proof

Parts (a) and (b) follow from Guo et al. (2012) and Guo and Piunovskiy (2011) (or Piunovskiy and Zhang 2011); while part (c) is from Theorem 3.1 in Guo (2007). We next prove part (d). For almost surely (a.s.) ω := (x₀,𝜃₁,x₁,…,𝜃_k,x_k,…) ∈Ω (with respect to $\mathbb {P}_{x}^{\pi }$), let k (depending on ω) be determined by T_k(ω) ≤ T < T_k+ 1(ω). Then, it follows from (a) and Eq. 2.4 that k is finite, and {x_t(ω),t ∈ [0,T]} = {X₀(ω),…,X_k(ω)}. Thus, since $|c(x,a)|\leq \frac {1}{T\delta }\ln \sqrt {M_{0}V_{0}(x)}$ (by Assumption 3.1(iii)) and 𝜃_i+ 1 = T_i+ 1(ω) − T_i(ω)(i = 0,…,k − 1), we have

$$ \begin{array}{@{}rcl@{}} &&{{\int}_{0}^{T}} {\int}_{A} |c(x_{t},a)|\pi(da|\omega,t)dt\\ &\leq& \frac{{\sum}_{i=0}^{k-1}\theta_{i+1}\ln\sqrt{M_{0}V_{0}(X_{i}(\omega))}+(T-T_{k}(\omega))\ln\sqrt{M_{0}V_{0}(X_{k}(\omega))}}{T\delta}, \end{array} $$

which implies that ${{\int \limits }_{0}^{s}} {\int \limits }_{A} c(x_{t},a)\pi (da|\omega ,t)dt$ is real-valued Borel measurable in s ∈ [0,T] (for the given ω). Thus, by the Jensen inequality with respect to the probability measure $\frac {dt}{T}$ on $\mathcal {{B}}([0,T])$, we have

$$ \begin{array}{@{}rcl@{}} e^{{{\int}_{0}^{T}}\delta{\int}_{A} c(x_{t},a)\pi(da|\omega,t)dt}\leq \frac{1}{T} {{\int}_{0}^{T}} e^{T\delta{\int}_{A} |c(x_{t},a)|\pi(da|\omega,t)}dt, \ \ \ \ a.s.-\mathbb{P}_{x}^{\pi}. \end{array} $$

Therefore (by (b)),

$$ \begin{array}{@{}rcl@{}} \mathbb{E}_{x}^{\pi}\left[e^{{{\int}_{0}^{T}}\delta{\int}_{A} c(x_{t},a)\pi(da|\omega,t)dt}\right]&\leq& \mathbb{E}_{x}^{\pi}\left[\frac{1}{T} {{\int}_{0}^{T}} e^{T\delta{\int}_{A} |c(x_{t},a)|\pi(da|\omega,t)}dt\right] \\ &\leq& \frac{1}{T}M_{0}{{\int}_{0}^{T}}\mathbb{E}_{x}^{\pi}\left[V_{0}(x_{t})\right]dt\leq M_{0}e^{\rho_{0} T}[V_{0}(x)+\frac{b_{0}}{\rho_{0}}]. \end{array} $$

Hence, by Eq. 2.5, we have

$$ \begin{array}{@{}rcl@{}} J(\pi,0,x)\leq M_{0}e^{\rho_{0}T}\left[1+\frac{b_{0}}{\rho_{0}V_{0}(x)}\right]V_{0}(x)\leq LV_{0}(x). \end{array} $$

(3.1)

On the other hand, since Tδ|c(x,a)|≤ e^2Tδ|c(x,a)|≤ M₀V₀(x), we have

$$ \begin{array}{@{}rcl@{}} J(\pi,0,x)\!&=&\mathbb{E}_{x}^{\pi}\left[e^{{\delta{\int}_{0}^{T}}{\int}_{A}c(x_{t},a)\pi(da|\omega,t)dt}\right]\geq e^{\delta\mathbb{E}_{x}^{\pi}\left[{\int}_{0}^{T}{\int}_{A}c(x_{t},a)\pi(da|\omega,t)dt\right]}\\ &\geq& e^{-\frac{1}{T}\mathbb{E}_{x}^{\pi}\left[{\int}_{0}^{T}{\int}_{A}M_{0}V_{0}(x_{t})\pi(da|\omega,t)dt\right]}\!= e^{-\frac{1}{T}\mathbb{E}_{x}^{\pi}\left[{\int}_{0}^{T}M_{0}V_{0}(x_{t})dt\right]}\!\geq e^{-\frac{M_{0}}{T}{{\int}_{0}^{T}}\mathbb{E}_{x}^{\pi}[V_{0}(x_{t})]dt}\\ &\geq& e^{-\frac{M_{0}}{T}{{\int}_{0}^{T}} e^{\rho_{0}T}[V_{0}(x)+\frac{b_{0}}{\rho_{0}}]dt}=e^{-M_{0}e^{\rho_{0}T}[V_{0}(x)+\frac{b_{0}}{\rho_{0}}]} \geq e^{-LV_{0}(x)}, \end{array} $$

which, together with Eq. 3.1, implies (d₁). Similarly, we see that (d₂) is also true. □

Lemma 3.1 gives conditions for the finiteness of J(π,t,x) as well as the non-explosion of {x_t,t ≥ 0}. In order to deal with the optimality for history-dependent policies, we need the following facts, which extend both the analog of Ito-Dynkin’s formula in Guo et al. (2015b) and Feynman-Kac formula in Wei (2016) to a more general case of possible non-Markov processes {x_t,t ≥ 0} and functions φ(ω,t,x) with an additional element ω ∈Ω.

Denote by m_L the Lebesgue’s measure on [0,T], and by $\mathbb {B}_{\mathcal { P}}({\Omega }\times [0,T]\times S)$ the set of real-valued and $\mathcal {{P}}\times \mathcal {{B}}(S)$-measurable functions φ(ω,t,x) with the following features: Given any x ∈ S,π ∈π, and a.s. ω ∈Ω with respect to $\mathbb {P}_{x}^{\pi }$, there exists a Borel subset E_(φ,ω,x,π) (depending on the φ,ω,x,π) of [0,T] such that the partial derivative $\varphi ^{\prime }(\omega ,t,x)$ (with respect to t) exists for every t ∈ E_(φ,ω,x,π) and $m_{L}(E_{(\varphi ,\omega ,x,\pi )}^{c})=0$. Obviously, if a function φ(ω,t,x) in $\mathbb {B}_{\mathcal { P}}({\Omega }\times [0,T]\times S)$ is independent of ω (written as φ(t,x)), and so is the corresponding E_(φ,ω,x,π), which will be denoted by E_(φ,x).

For any $\varphi (\omega , t,x)\in \mathbb {B}_{\mathcal { P}}({\Omega }\times [0,T]\times S)$, when the partial derivative does not exist for some (ω,t,x) ∈Ω× [0,T]×S, for simplicity of arguments, we take $\varphi ^{\prime }(\omega ,t,x)$ to be any real number, and so $\varphi ^{\prime }$ is defined on Ω × [0,T] × S. We will see that such a modification of $\varphi ^{\prime }$ loses nothing. For example, $\mathbb {E}_{x}^{\pi } \left [{\int \limits }_{0}^{T}|\varphi ^{\prime }(\omega ,t,x_{t})|dt\right ]$ and ${{\int \limits }_{0}^{T}}|\varphi ^{\prime }(\omega ,t,x_{t})|dt$ are defined well for each (x,π) ∈ S ×π and a.s. ω ∈Ω with respect to $\mathbb {P}_{x}^{\pi }$, respectively. Indeed, for each x ∈ S,π ∈π, and s ∈ [0,T], Lemma 3.1(a) together with Eq. 2.4 gives the existence of a Boreal measurable subset $\hat {\Omega }$ of Ω such that: 1) $\mathbb {P}_{x}^{\pi }(\hat {\Omega }^{c})=0$, and 2) for each given $\omega \in \hat {\Omega }$, x_t(ω) = x_i for T_i(ω) ≤ t < T_i+ 1(ω)(0 ≤ i ≤ k) for some a finite k, where k and x_i ∈ S depend on ω, and k is determined by T_k(ω) ≤ T < T_k+ 1(ω). Let $\tilde i:={\min \limits } \{l: s<T_{l} \}$ (depending on ω and s). Then, since $m_{L}(\cup _{i=0}^{k} E^{c}_{(\varphi ,\omega ,x_{i},\pi )})=0$, for each $\omega \in \hat {\Omega }$,

$${{\int}_{s}^{T}} \!|\varphi^{\prime}(\omega,t,x_{t})|dt = {\int}_{s}^{T_{\tilde i}}\!|\varphi^{\prime}(\omega,t,x_{\tilde i-1})|dt+\sum\limits_{l=\tilde i}^{k-1} {\int}_{T_{l}}^{T_{l+1}}\!|\varphi^{\prime}(\omega,t,x_{l})|dt+{\int}_{T_{k}}^{T}\!|\varphi^{\prime}\!(\omega,t,x_{k})|dt$$

is well defined. Therefore (by $\mathbb {P}_{x}^{\pi }(\hat {\Omega }^{c})=0$), $\mathbb {E}_{x}^{\pi } \left [{\int \limits }_{0}^{T}|\varphi ^{\prime }(\omega ,t,x_{t})|dt\right ]$ is also well defined.

Lemma 3.2

Under Assumption 3.1, the following assertions hold.

(a):

For any given π ∈π,x ∈ S, if a function $\varphi \in \mathbb {B}_{\mathcal { P}}({\Omega }\times [0,T]\times S)$ satisfies that

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E}_{x}^{\pi}\left[{\int}_{0}^{T}{\int}_{A}{\int}_{S}|\varphi(\omega,t,y)| |q|(dy|x_{t},a)\pi(da|\omega,t)dt\right]\\&&+\mathbb{E}_{x}^{\pi} \left[{\int}_{0}^{T}|\varphi^{\prime}(\omega,t,x_{t})|dt\right]<\infty, \end{array} $$

(3.2)

where $|q|(dy|x,a):={\int \limits } q(dy\setminus \{x\})|x,a)-q(\{x\}|x,a)\delta _{x}(dy)$ for all (x,a) ∈ K, then

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E}_{x}^{\pi}\left[{\int}_{0}^{T} \left( \varphi^{\prime}(\omega,t,x_{t})+{\int}_{A}{\int}_{S}\varphi(\omega,t,y)q(dy|x_{t},a)\pi(da|\omega,t)\right)dt \right]\\ &=&\mathbb{E}_{x}^{\pi}\varphi(\omega,T,x_{T})-\mathbb{E}_{x}^{\pi}\varphi(\omega,0,x). \end{array} $$

(b):

For any $\pi =\pi _{t}(da|\cdot ){\in {\Pi }_{m}^{r}}$ and a Borel measurable function $\phi \in \mathbb {B}_{\mathcal { P}}([0,T]\times S)$, such that the corresponding function $\varphi (\omega ,t,x):=e^{{{\int \limits }_{s}^{t}} \delta c(x_{v}(\omega ),\pi _{v})dv}\phi (t,x)$ satisfying (3.2), then

$$ \mathbb{E}_{\gamma}^{\pi}\!\left[{\int}_{s}^{T} \!\left( \!\left( e^{{{\int}_{s}^{t}}\delta c(x_{v},\pi_{v})dv}\phi(t,x_{t})\!\right)'+{\int}_{S}e^{{{\int}_{s}^{t}}\delta c(x_{v},\pi_{v})dv}\phi(t,y)q(dy|x_{t},\pi_{t})\!\right)dt|x_{s}=x\right]$$

$${}=\mathbb{E}_{\gamma}^{\pi}\left[e^{{{\int}_{s}^{T}}{\int}_{A}\delta c(x_{v},a)\pi_{v}(da|x_{v})dv}\phi(T,x_{T})|x_{s}=x\right]-\phi(s,x),\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ $$

where $c(x,\pi _{t}):={\int \limits }_{A(x)}c(x,a)\pi _{t}(da|x)$ and $q(dy|x,\pi _{t}):={\int \limits }_{A(x)} q(dy|x,a)\pi _{t}(da|x)$.

(c):

If the functions c(x,a) and ϕ(t,x) in (b) above are both bounded and $|\phi ^{\prime }(t,x)|\leq CV_{0}(x)$ for all (t,x) ∈ [0,T] × S with some constant C > 0, then

$$ \begin{array}{@{}rcl@{}} \!\!\!\!\!\!\!\!\!\!\!\!\!\!\!&&\mathbb{E}_{\gamma}^{\pi}\left[{\int}_{s}^{T} \left( \left( e^{{{\int}_{s}^{t}}\delta c(x_{v},\pi_{v})dv}\phi(t,x_{t})\right)'+{\int}_{S}e^{{{\int}_{s}^{t}}\delta c(x_{v},\pi_{v})dv}\phi(t,y)q(dy|x_{t},\pi_{t})\right)dt|x_{s}=x\right]\\ \!\!\!\!\!\!\!\!\!\!\!\!\!\!\!&=& \left\{\begin{array}{lll} \mathbb{E}_{x}^{\pi}\left( e^{{{\int}_{s}^{T}}{\int}_{A}\delta c(x_{v},a)\pi_{v}(da|x_{v})dv}\phi(T,x_{T})\right)-\phi(0,x), \ \ \ \text{for } \ s=0, \pi\in {\Pi} \\ \mathbb{E}_{\gamma}^{\pi}\left[e^{{{\int}_{s}^{T}}{\int}_{A}\delta c(x_{v},a)\pi_{v}(da|x_{v})dv}\phi(T,x_{T})|x_{s}=x\right]-\phi(s,x), \ \text{for } \ s\in [0,T],\pi\in {{\Pi}_{m}^{r}}. \end{array}\right. \end{array} $$

Proof

(a)
Lemma 3.1(a) together with Eq. 2.4 gives the existence of a Boreal measurable subset $\hat {\Omega }$ of Ω such that $\mathbb {P}_{x}^{\pi }(\hat {\Omega }^{c})=0$. For each $\omega \in \hat {\Omega }$, for 0 ≤ t ≤ T, from definitions (2.3) and (2.4), denote $k_{t}(\omega ):=\max \limits \{k|T_{k}(\omega )\le t\}$. Since x_t(ω) is right-continuous on $[0,\infty )$, we have $x_{t}=x_{T_{k_{t}}}$ when $T_{k_{t}}\le t<T_{k_{t}+1}$. Since ${{\int \limits }_{s}^{T}} \varphi ^{\prime }(\omega ,t,x_{t})dt$ is well defined (just proved), for 0 ≤ s ≤ T and $\omega \in \hat {\Omega }$, we have

$$ \begin{array}{@{}rcl@{}} \!\!\!\!\!\!\!\!\!\!\!\varphi(\omega,T,x_{T}) = \varphi(\omega,s,x_{s}) + {{\int}_{s}^{T}} \varphi^{\prime}(\omega,t,x_{t})dt + \sum\limits_{n=k_{s}+1}^{k_{T}}{\int}_{(s,T]}{\Delta} \varphi(\omega,t,x_{t})\delta_{T_{n} }(dt) \ \ \ \mathbb{P}_{x}^{\pi} - a.s. \end{array} $$
(3.3)
Denote Δφ(ω,t,x_t) := φ(ω,t,x_t(ω)) − φ(ω,t,x_t−(ω)), and recall that $x_{t-}(\omega )=\lim _{s\to t-}x_{s}(\omega )$. Equation 3.3 is proved as follows.

$$ \begin{array}{@{}rcl@{}} &&\varphi(\omega,T,x_{T})\\ &=&\varphi(\omega,s,x_{s})+\varphi(\omega,T_{k_{s}+1},x_{T_{k_{s}+1}-})-\varphi(\omega,s,x_{s})\\ &&+\sum\limits_{n=k_{s}+1}^{k_{T}}\left[\varphi(\omega,T_{n},x_{T_{n}})-\varphi(\omega,T_{n},x_{T_{n}-})\right]+\sum\limits_{n=k_{s}+1}^{k_{T}-1}\left[ \varphi(\omega,T_{n+1},x_{T_{n+1}-})-\varphi(\omega,T_{n},x_{T_{n}})\right]\\ &&+\varphi(\omega,T,x_{T-})-\varphi(\omega,T_{k_{T}},x_{T_{k_{T}}})+\varphi(\omega,T,x_{T})-\varphi(\omega,T,x_{T-})\\ &=&\varphi(\omega,s,x_{s})+\varphi(\omega,T_{k_{s}+1},x_{s})-\varphi(\omega,s,x_{s})\\ &&+\sum\limits_{n=k_{s}+1}^{k_{T}}\left[\varphi(\omega,T_{n},x_{T_{n}})-\varphi(\omega,T_{n},x_{T_{n}-})\right]+\sum\limits_{n=k_{s}+1}^{k_{T}-1}\left[ \varphi(\omega,T_{n+1},x_{T_{n}})-\varphi(\omega,T_{n},x_{T_{n}})\right]\\ &&+\varphi(\omega,T,x_{T_{k_{T}}})-\varphi(\omega,T_{k_{T}},x_{T_{k_{T}}})+\varphi(\omega,T,x_{T})-\varphi(\omega,T,x_{T-})\\ &=&\varphi(\omega,s,x_{s})+\varphi(\omega,T_{k_{s}+1},x_{s})-\varphi(\omega,s,x_{s})+\sum\limits_{n=k_{s}+1}^{k_{T}-1}\left[ \varphi(\omega,T_{n+1},x_{T_{n}})-\varphi(\omega,T_{n},x_{T_{n}})\right]\\ &&+\varphi(\omega,T,x_{T_{k_{T}}})-\varphi(\omega,T_{k_{T}},x_{T_{k_{T}}})\\ &&+\sum\limits_{n=k_{s}+1}^{k_{T}}\left[\varphi(\omega,T_{n},x_{T_{n}})-\varphi(\omega,T_{n},x_{T_{n}-})\right]+\varphi(\omega,T,x_{T})-\varphi(\omega,T,x_{T-})\\ &=&\varphi(\omega,s,x_{s})+{\int}_{[s,T_{k_{s}+1})}\varphi^{\prime}(\omega,t,x_{s})dt+\sum\limits_{n=k_{s}+1}^{k_{T}-1}{\int}_{[T_{n},T_{n+1})} \varphi^{\prime}(\omega,t,x_{T_{n}})dt\\ &&+{\int}_{[T_{k_{T}},T)}\varphi^{\prime}(\omega,t,x_{T_{k_{T}}})dt+\sum\limits_{n=k_{s}+1}^{k_{T}}{\Delta}\varphi(\omega,T_{n},x_{T_{n}})+{\Delta}\varphi(\omega,T,x_{T})\\ &=&\varphi(\omega,s,x_{s})+{\int}_{[s,T_{k_{s}+1})}\varphi^{\prime}(\omega,t,x_{t})dt+\sum\limits_{n=k_{s}+1}^{k_{T}-1}{\int}_{[T_{n},T_{n+1})} \varphi^{\prime}(\omega,t,x_{t})dt\\ &&+{\int}_{[T_{k_{T}},T)}\varphi^{\prime}(\omega,t,x_{t})dt+\sum\limits_{n=k_{s}+1}^{k_{T}}{\int}_{(s,T]}{\Delta}\varphi(\omega,t,x_{t})\delta_{T_{n}}(dt)\\ &=&\varphi(\omega,s,x_{s})+{{\int}_{s}^{T}} \varphi^{\prime}(\omega,t,x_{t})dt+\sum\limits_{n=k_{s}+1}^{k_{T}}{\int}_{(s,T]}{\Delta} \varphi(\omega,t,x_{t})\delta_{T_{n} }(dt) \ \ \ \mathbb{P}_{x}^{\pi}-a.s. \end{array} $$

Then, Lemma 4.28 in Kitaev and Rykov (1995) shows that the random measure m^π defined by

$$ \begin{array}{@{}rcl@{}} m^{\pi}(B|\omega,t)dt:={\int}_{A}q(B|x_{t-},a)\pi(da|\omega,t)I_{\{x_{t-}\notin B\}}dt,\quad B\in \mathcal{{B}}(S) \end{array} $$

(3.4)

is the dual predictable projection of the random measure ${\sum }_{n\ge 1}\delta _{(T_{n},X_{n-1})}(dt,dx)$ on ${\mathscr{B}}((0,\infty )\times S)$ under $\mathbb {P}_{x}^{\pi }$. Thus, by the definition of a dual predictable projection in (4.5) in Kitaev and Rykov (1995), we have

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E}_{x}^{\pi}\left[ \sum\limits_{n\ge 1}{\int}_{(0,T]}{\Delta} \varphi(\omega,t,x_{t})\delta_{T_{n} }(dt)\right]\\&=&\mathbb{E}_{x}^{\pi}\left[ {\int}_{S} {\int}_{(0,T]}(\varphi(\omega,s,y)-\varphi(\omega,s,x_{s-})) m^{\pi}(dy|\omega,s) d s\right] \end{array} $$

which, together with Eq. 3.4 and the expectation of both sides of Eq. 3.3 with s = 0, gives

$$ \begin{array}{@{}rcl@{}} && \mathbb{E}_{x}^{\pi}\left[ \varphi(\omega,T,x_{T}) \right]\\ &=&\mathbb{E}_{x}^{\pi}\varphi(\omega,0,x)+\mathbb{E}_{x}^{\pi}\left[ {{\int}_{0}^{T}} \varphi^{\prime}(\omega,t,x_{t})dt \right] + \mathbb{E}_{x}^{\pi}\left[ \sum\limits_{n\ge 1}{\int}_{(0,T]}{\Delta} \varphi(\omega,t,x_{t})\delta_{T_{n} }(dt)\right] \\ &=&\mathbb{E}_{x}^{\pi}\varphi(\omega,0,x)+ \mathbb{E}_{x}^{\pi}\left[ {{\int}_{0}^{T}} \varphi^{\prime}(\omega,t,x_{t})dt\right] + \mathbb{E}_{x}^{\pi}\left[ {\int}_{(0,T]}{\int}_{S}[\varphi(\omega,s,y)-\varphi(\omega,s,x_{s-})] m^{\pi}(dy|\omega,s) d s\right] \\ &=&\mathbb{E}_{x}^{\pi}\varphi(\omega,0,x)+ \mathbb{E}_{x}^{\pi}\left[ {{\int}_{0}^{T}} \varphi^{\prime}(\omega,t,x_{t})dt\right]+\mathbb{E}_{x}^{\pi}\left[ {\int}_{(0,T]} {\int}_{S} {\int}_{A} \varphi(\omega,t,y)q(dy|x_{t-},a) \pi(da|\omega,t)dt\right]. \end{array} $$

Here, integrability results such as Eq. 3.2 validate all the involved operations. Moreover, since x_t−(ω) = x_t(ω) on (0,T] except finite time points t, part (a) follows.

(b)
In lieu of φ(ω,t,x_t) with $e^{{{\int \limits }_{s}^{t}}{\int \limits }_{A}\delta c(v,x_{v},a)\pi _{v}(da|x_{v})dv}\phi (t,x_{t})$, taking the conditional expectation $\mathbb {E}_{\gamma }^{\pi }[\cdot |x_{s}=x]$ in both sides of Eq. 3.3 and using the Markov property of {x_t,t ≥ 0}, as the proof of (a), we see that (b) is also true.
(c)
Under the condition in (c), the function φ(ω,t,x) in (b) is bounded on Ω × [0,T] × S, and thus the first part of Eq. 3.2 is finite (by Lemma 3.1(b) and Assumptions 3.1(i,ii)). Moreover, since
$$ \begin{array}{@{}rcl@{}} \left( e^{{{\int}_{s}^{t}}{\int}_{A}|\delta c(x_{v},a)|\pi_{v}(da|x_{v})dv}\phi(t,x_{t})\right)'&=&\delta |c(x_{t},\pi_{t})|e^{{{\int}_{s}^{t}}{\int}_{A}|\delta c(x_{v},a)|\pi_{v}(da|x_{v})dv}\phi(t,x_{t})\\ && \ \ +e^{{{\int}_{s}^{t}}{\int}_{A}|\delta c(x_{v},a)|\pi_{v}(da|x_{v})dv}\phi^{\prime}(t,x_{t})\\ &\leq & \delta\|c\|e^{T\|c\|}\|\phi\|+Ce^{T\delta\|c\|}V_{0}(x_{t}) \end{array} $$
which, together with Lemma 3.1(b), implies that the second part of Eq. 3.2 is finite. Thus, (c) follows.

□

Next we derive an extension of Dynkin’s formula by Lemma 3.2. To do so, we introduce the following condition and notation.

Assumption 3.2

There exist a Borel measurable function V₁ ≥ 1 on S, and positive constants ρ₁,b₁, and M₁, such that

(i):: HCode ${\int \limits }_{S}{V_{1}^{2}}(y)q(dy|x,a)\leq \rho _{1}{V_{1}^{2}}(x)+b_{1}$ for all (x,a) ∈ K;
(ii):: ${V_{0}^{2}}(x)\leq M_{1}V_{1}(x)$ for all x ∈ S, with the V₀(x) as the Assumption 3.1(i).

Assumption 3.2 is used to give a domain for the Dynkin’s formula below, and it is obviously satisfied when the transition rates are bounded (Ghosh and Saha 2014; Kumar and Chandan 2013, 2015; Yushkevich 1977).

Given the V_k(k = 0,1) as in Assumption 3.2 and any Borel set Z, a real-valued function φ on Z × S is called V_k-bounded if the V_k-weighted norm of φ, $\|\varphi \|_{V_{k}}:=\sup _{(z,x)\in Z\times S}\frac {|\varphi (z,x)|}{V_{k}(x)}$, is finite. We denote by $\mathbb {B}_{V_{k}}(Z\times S)$ the Banach space of all V_k-bounded functions on Z × S. When V_k(x) ≡ 1, $\mathbb {B}_{1}(Z\times S)$ is the space of all bounded functions. In particular, take Z = Ω × [0,T] or [0,T], we define

$$ \mathbb{B}_{V_{0},V_{1}}^{1}({\Omega}\times [0,T]\times S):=\{\varphi\in \mathbb{B}_{V_{0}}({\Omega}\times [0,T]\times S)\cap\mathbb{B}_\mathcal{{P}}({\Omega}\times [0,T]\times S)\mid \varphi^{\prime}\in \mathbb{B}_{V_{1}}({\Omega}\times [0,T]\times S)\},$$

and then

$$ \mathbb{B}_{V_{0},V_{1}}^{1}([0,T]\times S):=\{\varphi\in \mathbb{B}_{V_{0}}([0,T]\times S)\cap\mathbb{B}_\mathcal{{P}}([0,T]\times S)\mid \varphi^{\prime}\in \mathbb{B}_{V_{1}}([0,T]\times S)\}.$$

Theorem 3.1 below is an extension of Theorem 3.1 in Guo et al. (2015b) and Theorem 3.1 in Wei (2016) from the forms of functions φ(t,x) to the more general forms of φ(ω,t,x) with the additional variable ω, and the extension is needed for the following arguments over history-dependent policies. The proof of Theorem 3.1 is based on Lemma 3.2, which is different from those in Guo et al. (2015b) and Wei (2016) for Markov policies only.

Theorem 3.1

Suppose Assumptions 3.1 and 3.2 are satisfied. Then, for each (s,x) ∈ [0,T] × S, the following assertions hold.

(a):

(The extension of Dynkin’s formula): For every π ∈π and $\varphi \in \mathbb {B}_{V_{0},V_{1}}^{1}({\Omega }\times [0,T]\times S)$,

$$ \begin{array}{@{}rcl@{}} && \mathbb{E}_{x}^{\pi}\left[{\int}_{0}^{T} \left( \varphi^{\prime}(\omega,t,x_{t})+{\int}_{A}{\int}_{S}\varphi(\omega,t,y)q(dy|x_{t},a)\pi(da|\omega,t)\right)dt \right] \\&& \ \ =\mathbb{E}_{x}^{\pi}\varphi(\omega,T,x_{T})-\mathbb{E}_{x}^{\pi}\varphi(\omega,0,x), \end{array} $$

where {x_t,t ≥ 0} may be not Markovian since the policy π may depend on histories.

(b):

(The Dynkin’s formula): For each ${\pi \in {\Pi }_{m}^{r}}$, and $\varphi \in \mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)$,

$$ \qquad \mathbb{E}_{\gamma}^{\pi}\left[{\int}_{s}^{T} \left( \left( e^{{{\int}_{s}^{t}}\delta c(x_{v},\pi_{v})dv}\varphi(t,x_{t})\right)'+{\int}_{S}e^{{{\int}_{s}^{t}}\delta c(x_{v},\pi_{v})dv}\varphi(t,y)q(dy|x_{t},\pi_{t})\right)dt|x_{s}=x\right]$$

$${}=\mathbb{E}_{\gamma}^{\pi}\left[e^{{{\int}_{s}^{T}} \delta c(x_{t},\pi_{t})dt}\varphi(T,x_{T})|x_{s}=x\right]-\varphi(s,x).\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ $$

Proof

(a)
By the definition $\mathbb {B}_{V_{0},V_{1}}^{1}({\Omega }\times [0,T]\times S)$ and $\varphi \in \mathbb {B}_{V_{0},V_{1}}^{1}({\Omega }\times [0,T]\times S)$, we have

$$ \begin{array}{@{}rcl@{}} |\varphi(\omega,t,x)|\leq \|\varphi\|_{V_{0}}V_{0}(x), \ |\varphi^{\prime}(\omega,t,z)|\leq \|\varphi^{\prime}\|_{V_{1}}V_{1}(x) \ \ \ \forall \ (\omega,t,x)\in {\Omega}\times [0,T]\times S. \end{array} $$
(3.5)
Thus, by Assumptions 3.1(i,ii) and 3.2 we have, for (ω,t,z) ∈Ω× [0,T] × S

$$ \begin{array}{@{}rcl@{}} {\int}_{A}{\int}_{S}|\varphi(\omega,t,y)||q|(dy|z,a)\pi(da|\omega,t) &\leq& \|\varphi\|_{V_{0}}\left[{\int}_{A}{\int}_{S}V_{0}(y)|q|(dy|z,a)\pi(da|\omega,t)\right] \\ &\leq& \|\varphi\|_{V_{0}}[\rho_{0}V_{0}(z)+b_{0}+2M_{0}^{\prime}{V_{0}^{2}}(z)] \\ &\leq & \|\varphi\|_{V_{0}}[\rho_{0}M_{1}+b_{0}+2M_{0}^{\prime}M_{1}]V_{1}(z). \end{array} $$
(3.6)
Moreover, by Eq. 3.5 we have
$$ \begin{array}{@{}rcl@{}} {{\int}_{0}^{T}}|\varphi^{\prime}(\omega,t,x_{t})|dt\leq\|\varphi^{\prime}\|_{V_{1}}{{\int}_{0}^{T}}V_{1}(x_{t})dt\leq\|\varphi^{\prime}\|_{V_{1}}{{\int}_{0}^{T}}{V_{1}^{2}}(x_{t})dt, \end{array} $$
which, together with Lemma 3.1(b), gives
$$ \begin{array}{@{}rcl@{}} \mathbb{E}_{x}^{\pi} \left[{\int}_{0}^{T}|\varphi^{\prime}(\omega,t,x_{t})|dt\right] &\leq& \|\varphi^{\prime}\|_{V_{1}}\mathbb{E}_{x}^{\pi} \left[{\int}_{0}^{T}{V_{1}^{2}}(x_{t})dt\right]\\&\leq&\|\varphi^{\prime}\|_{V_{1}}Te^{\rho_{1}T}\left[\frac{{V_{1}^{2}}(x)}{T\rho_{1}}+\frac{b_{1}}{\rho_{1}}\right]<\infty. \end{array} $$
(3.7)
Thus, by Eqs. 3.6–3.7 and Lemma 3.1(b) we have
$$ \begin{array}{@{}rcl@{}} && \mathbb{E}_{x}^{\pi} \left[{\int}_{0}^{T}{\int}_{A}{\int}_{S}|\varphi(\omega,t,y)||q|(dy|x_{t},a)\pi(da|\omega,t)dt\right] \\ &\leq & T\|\varphi\|_{V_{0}}[\rho_{0}M_{1}+b_{0}+2M_{0}^{\prime}M_{1}]e^{\rho_{1}T}\left[\frac{{V_{1}^{2}}(x)}{T\rho_{1}}+\frac{b_{1}}{\rho_{1}}\right], \ \ \ \ \ \end{array} $$
which, together with Eq. 3.7 and Lemma 3.2(a), implies (a).
(b)
For any fixed x ∈ S, a.e. t ∈ [0,T] (depending on ω), and 0 ≤ s ≤ t, since
$$ \begin{array}{@{}rcl@{}} \left( e^{{{\int}_{s}^{t}}\delta c(x_{v},\pi_{v})dv}\varphi(t,x)\right)'=\delta c(x_{t},\pi_{t})e^{{{\int}_{s}^{t}}\delta c(x_{v},\pi_{v})dv}\varphi(t,x)+e^{{{\int}_{s}^{t}}\delta c(x_{v},\pi_{v})dv}\varphi^{\prime}(t,x), \end{array} $$
by $\varphi \in \mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)$ and Tδ|c(x,a)|≤ M₀V₀(x) (using Assumption 3.1(iii)) we have
$$ \begin{array}{@{}rcl@{}} \left|\left( e^{{{\int}_{s}^{t}}\delta c(x_{v},\pi_{v})dv}\varphi(t,x)\right)'\right|&\leq & \frac{M_{0}}{T}\|\varphi\|_{V_{0}}V_{0}(x)e^{{{\int}_{s}^{t}} |\delta c(x_{v},\pi_{v})|dv}V_{0}(x) \\ && +\|\varphi^{\prime}\|_{V_{1}}e^{{{\int}_{s}^{t}}|\delta c(x_{v},\pi_{v})|dv}V_{1}(x), \\ &\leq & \left( \frac{\|\varphi\|_{V_{0}}M_{0}M_{1}}{T}+\|\varphi^{\prime}\|_{V_{1}}\right)e^{{{\int}_{s}^{t}}|\delta c(x_{v},\pi_{v})|dv}V_{1}(x). \end{array} $$
which, together with the same arguments for Eq. 3.3, gives
$$ \begin{array}{@{}rcl@{}} &&\left|\left( e^{{{\int}_{s}^{t}}\delta c(x_{v}(\omega),\pi_{v})dv}\varphi(t,x_{t}(\omega))\right)'\right| \\&\leq& \left( \frac{\|\varphi\|_{V_{0}}M_{0}M_{1}}{T}+\|\varphi^{\prime}\|_{V_{1}}\right)e^{{{\int}_{s}^{t}}|\delta c(x_{v}(\omega),\pi_{v})|dv}V_{1}(x_{t}(\omega)) \end{array} $$
(3.8)
for a.s. ω ∈Ω with respect to $\mathbb {P}_{x}^{\pi }$, and a.e. t ∈ [0,T](depending on give ω).

On the other hand, by the H$\ddot {o}$lder inequality we have

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E}_{\gamma}^{\pi}\left[ e^{{{\int}_{s}^{t}}|\delta c(x_{v},\pi_{v})|dv}V_{1}(x_{t})|x_{s}=x\right]\\&\leq& \sqrt{\mathbb{E}_{\gamma}^{\pi}\left[e^{2{{\int}_{s}^{t}}|\delta c(x_{v},\pi_{v})|dv}|x_{s}=x\right]\mathbb{E}_{\gamma}^{\pi}\left[V_{1}^{2}(x_{t})|x_{s}=x\right]} \\ &\leq& \mathbb{E}_{\gamma}^{\pi}\left[e^{2{{\int}_{s}^{t}}|\delta c(x_{v},\pi_{v})|dv}|x_{s}=x\right]\mathbb{E}_{\gamma}^{\pi}\left[V_{1}^{2}(x_{t})|x_{s}=x\right]. \ \ \ \ \ \ \ \ \ \ \end{array} $$

(3.9)

Furthermore, by the arguments similar to the proof of Eq. 3.1, we also have

$$ \begin{array}{@{}rcl@{}} \mathbb{E}_{\gamma}^{\pi}\left[ e^{2{{\int}_{s}^{t}}|\delta c(x_{v},\pi_{v})|dv}|x_{s}=x\right] \leq LV_{0}(x), \end{array} $$

(3.10)

which, together with Lemma 3.1(b) and Eq. 3.9, implies

$$ \begin{array}{@{}rcl@{}} &&{{\int}_{s}^{T}} \mathbb{E}_{\gamma}^{\pi} \left[e^{{{\int}_{s}^{t}} |\delta c(x_{v},\pi_{v})|dv}V_{1}(x_{t})|x_{s}=x\right]dt\\ &\leq &LV_{0}(x){{\int}_{s}^{T}}\mathbb{E}_{\gamma}^{\pi}\left[V_{1}^{2}(x_{t})|x_{s}=x\right]dt \\ &\leq& TLV_{0}(x) e^{\rho_{1} T}\left[V_{1}^{2}(x)+\frac{b_{1}}{\rho_{1}}\right]<\infty. \end{array} $$

(3.11)

Also, by Assumptions 3.1 and 3.2, we have

$$ \begin{array}{@{}rcl@{}} &&{\int}_{S}e^{{{\int}_{s}^{t}}|\delta c(x_{v},\pi_{v})|dv}|\varphi(t,y)||q|(dy|x_{t},\pi_{t})\\ &\leq &\|\varphi\|_{V_{0}} [\rho_{0}V_{0}(x_{t})+b_{0}+2M_{0}^{\prime}{V_{0}^{2}}(x_{t})]e^{{{\int}_{s}^{t}}|\delta c(x_{v},\pi_{v})|dv} \\ &\leq & \|\varphi\|_{V_{0}}\left[\rho_{0}M_{1}+b_{0}+2M_{0}^{\prime}M_{1}\right]e^{{{\int}_{s}^{t}}|\delta c(x_{v},\pi_{v})|dv}V_{1}(x_{t}). \ \ \ \ \ \ \ \ \end{array} $$

(3.12)

Thus, by Eqs. 3.8–3.12 we have

$$ \begin{array}{@{}rcl@{}} && \mathbb{E}_{\gamma}^{\pi}\left[{\int}_{s}^{T} \left( \left|\left( e^{{{\int}_{s}^{t}}\delta c(x_{v},\pi_{v})dv}\varphi(t,x_{t})\right)^{\prime}\right|+{\int}_{S} e^{{{\int}_{s}^{t}}\delta c(x_{v},\pi_{v})dv}|\varphi(t,y)||q|(dy|x_{t},\pi_{t}) \right)dt|x_{s}=x\right]\\ \\&\leq& L\left[\|\varphi\|_{V_{0}}M_{0}M_{1}+T\|\varphi^{\prime}\|_{V_{1}}+T\|\varphi\|_{V_{0}}(\rho_{0}M_{1}+b_{0}+2M_{0}^{\prime}M_{1})\right]V_{0}(x) e^{\rho_{1} T}\!\left[V_{1}^{2}(x) + \frac{b_{1}}{\rho_{1}}\right]<\infty, \end{array} $$

which, together with Lemma 3.2, verifies (b). □

Theorem 3.2

Under Assumptions 3.1 and 3.2, the following assertions hold.

(a):

If there exists $\varphi \in \mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)$ such that

$$ \begin{array}{@{}rcl@{}} \left\{\begin{array}{lll} \varphi^{\prime}(t,x)+\inf_{a\in A(x)}\left[\delta c(x,a)\varphi(t,x)+{\int}_{S}\varphi(t,y)q(dy|x,a)\right]=0, \\ \varphi(T,x)\equiv 1, \end{array}\right. \end{array} $$

(3.13)

for each x ∈ S and t ∈ E_(φ,x) with $m_{L}(E_{(\varphi ,x)}^{c})=0$, then

(a₁):: J(π,0,x) ≥ φ(0,x), for all π ∈π and x ∈ S, and
(a₂):: J(π,t,x) ≥ φ(t,x), for all $\pi \in {{\Pi }_{m}^{r}}$ and (t,x) ∈ [0,T] × S.

(b):

For any randomized Markov policy $\pi \in {{\Pi }_{m}^{r}}$, J(π,t,x) is a unique solution in $\mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)$ of the following equation

$$ \begin{array}{@{}rcl@{}} \left\{\begin{array}{lll} \varphi^{\prime}(t,x)+\delta c(x,\pi_{t})\varphi(t,x)+{\int}_{S}\varphi(t,y)q(dy|x,\pi_{t})=0 \\ \varphi(T,x)=1 \end{array}\right. \end{array} $$

(3.14)

for each x ∈ S and t ∈ E_(φ,x) with $m_{L}(E_{(\varphi ,x)}^{c})=0$.

Proof

(a)
For each ω ∈Ω, under the conditions for (a) we have
$$ \begin{array}{@{}rcl@{}} \!\!\!\!\!\!\!\!\!\!\!\varphi^{\prime}\!(t,x) + {\int}_{A}\delta c(x\!,a)\pi(da|\omega,t)\varphi(t,x) + {\int}_{S}\!{\int}_{A}\varphi(t,y)q(dy|x,a)\pi(da|\omega,t)\!\geq\! 0 \end{array} $$
(3.15)
for all x ∈ S and t ∈ E_(φ,x).

On the other hand, for a.s. ω ∈Ω (with respect to $\mathbb {P}_{x}^{\pi }$), since {x_t(ω),t ∈ [0,T]} =: {x₀,…,x_k}(x_i ∈ S,0 ≤ i ≤ k) for some a finite k (depending on ω) determined by T_k(ω) ≤ T < T_k+ 1(ω) and $m_{L}(E_{(\varphi ,x_{i})}^{c})=0$, by Eq. 3.15 we have

$$ \begin{array}{@{}rcl@{}} \!\!\!\!\!\!\!\!\varphi^{\prime}(t,x_{t}) + {\int}_{A}\delta c(x_{t},a)\pi(da|\omega,t)\varphi(t,x_{t}) + {\int}_{S}\!{\int}_{A}\varphi(t,y)q(dy|x_{t},a)\pi(da|\omega,t)\!\geq\! 0 \end{array} $$

(3.16)

for a.s. ω ∈Ω and a.e. t ∈ [0,T] (because of $m_{L}(\cup _{i=0}^{k} E_{(\varphi ,x_{i})}^{c})=0$).

Let $\bar c(\omega ,t,x,\pi ):={\int \limits }_{A}\delta c(x_{t},a)\pi (da|\omega ,t)$. Then, we have, for any 0 ≤ s ≤ T,

$$ \begin{array}{@{}rcl@{}} {{\int}_{s}^{T}}\!\left[\!\left( e^{{{\int}_{s}^{t}}\bar c(\omega,v,x,\pi)dv}\varphi(t,x_{t})\right)' + {\int}_{S}{\int}_{A}q(dy|x_{t},a)\pi(da|\omega,t)e^{{{\int}_{s}^{t}}\bar c(\omega,v,x,\pi)dv}\varphi(t,y)\!\right]\!dt\!\geq \!0. \end{array} $$

The proof of (a₁): For each π ∈π and x ∈ S, since φ(T,y) = 1 for all y ∈ S, by Theorem 3.1(a) and Eq. 3.16 we have

$$ \begin{array}{@{}rcl@{}} && \mathbb{E}_{x}^{\pi} \left( e^{{{\int}_{0}^{T}}{\int}_{A}\delta c(x_{v},a)\pi(da|\omega,v)dv)}\right)-\varphi(0,x)\\ &=&\mathbb{E}_{x}^{\pi} \left( e^{{{\int}_{0}^{T}}{\int}_{A}\delta c(x_{v},a)\pi(da|\omega,v)dv}\varphi(T,x_{T})\right)-\varphi(0,x)\\ &=& \mathbb{E}_{x}^{\pi} \left( e^{{{\int}_{0}^{T}}{\int}_{A}\delta c(x_{v},a)\pi(da|\omega,v)dv}\varphi(T,x_{T})\right)-\mathbb{E}_{x}^{\pi} \left( e^{{{\int}_{0}^{0}}{\int}_{A}\delta c(x_{v},a)\pi(da|\omega,v)dv}\varphi(0,x_{0})\right)\geq 0 \end{array} $$

and so

$$ \begin{array}{@{}rcl@{}} J(\pi,0,x)=\mathbb{E}_{x}^{\pi} \left( e^{{{\int}_{0}^{T}}{\int}_{A}\delta c(x_{v},a)\pi(da|\omega,v)dv}\right)\geq \varphi(0,x), \end{array} $$

which implies (a₁).

Similarly, by Theorem 3.1(b) we see that (a₂) is also true.

(b)
If there exists a $\varphi \in \mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)$ satisfying (3.14), by (a₂) and φ(T,x) ≡ 1, we have
$$\varphi(s,x)=\mathbb{E}_{\gamma}^{\pi} \left[e^{{{\int}_{s}^{T}}{\int}_{A}\delta c(x_{v},a)\pi_{v}(da|x_{v})dv}|x_{s}=x\right]=J(\pi,s,x), \ \ s\in [0,T],$$
and so (b) follows. Thus, to complete the proof of (b), it suffices to show the existence of $\varphi \in \mathbb {B}_{V_{0},V_{1}}^{1}(I\times S)$ satisfying (3.14), while the proof of which is very long and thus will be postponed to Section 4; see Remark 4.2 below.

□

To establish the existence of $\varphi \in \mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)$ satisfying (3.13), we need Lemma 3.3 below. To state it, we recall some concepts. A subset of a Borel space X is analytic (by Proposition 7.41 in Bertsekas and Shreve 1996) if it is a projection into X of a Borel subset of X × Y for some uncountable Borel space Y. Then, a function u(⋅) on X is called upper semianalytic if {x ∈ X : u(x) > r} is an analytic set for each $r\in (-\infty ,\infty )$. It is known that each Bore-measurable function is upper semianalytic; see more details in Chapter 7 of Bertsekas and Shreve (1996). Hence, each Borel measurable function such as c(x,a) is upper semianalytic on K.

Lemma 3.3

Suppose that Assumption 3.1 holds. For any $u(t,x)\in \mathbb {B}_{V_{0}}([0,T]\times S)$, define a corresponding function $u_{*}(t,x): \ [0,T]\times S\longrightarrow (-\infty ,\infty )$ by

$$u_{*}(t,x):= \inf_{a\in A(x)}\left\{\delta c(x,a)u(t,x)+{\int}_{S}u(t,y)q(dy|x,a)\right\}.$$

Then, the following assertions hold.

(a):: The function u_∗(t,x) is upper semianalytic (and hence universally measurable).
(b):: For every ε > 0, there exists a deterministic Markov policy $f_{\varepsilon }\in {{\Pi }_{m}^{d}}$ (depending on ε) such that
$$ \begin{array}{@{}rcl@{}} &&\delta c(x,f_{\varepsilon}(t,x))u(t,x)+{\int}_{S}u(t,y)q(dy|x,f_{\varepsilon}(t,x))\\&\leq& u_{*}(t,x)+\varepsilon \ \ \forall \ (t,x)\in [0,\infty)\times S. \end{array} $$

Proof

See Lemma 3.3 in Guo et al. (2015b) or (Bertsekas and Shreve 1996, Propositions 7.47 and 7.50). □

4 The existence of optimal Markov policies

In this section, we prove the existence of an optimal Markov policy and a solution to the following optimality equation (4.1) for the finite horizon CTMDPs with the risk-sensitive criterion. The proofs are shown in three steps as follows: 1) consider the case of bounded transition and cost rates, 2) deal with the case of unbounded transition rates but nonnegative costs, and 3) study the case of unbounded transition and cost rates.

Suppose that f(x) is defined on X. Denote $\|f\|:=\sup _{x\in X}|f(x)|$. The following results are for the case of the bounded transition and bounded cost rates.

Proposition 4.1

If ∥q∥ and ∥c∥ are finite, then the following assertions hold.

(a):

There exists a unique φ in $\mathbb {B}^{1}_{1,1}([0,T]\times S)$ (that is, V₀(x) = V₁(x) ≡ 1,x ∈ S) satisfying the following o ptimality equation for the risk-sensitive criterion of CTMDPs on the finite horizon:

$$ \begin{array}{@{}rcl@{}} \left\{\begin{array}{lll} \varphi^{\prime}(t,x)+\inf_{a\in A(x)}[\delta c(x,a)\varphi(t,x)+{\int}_{S}\varphi(t,y)q(dy|x,a)]=0, & \\ \varphi(T,x)=1, \end{array}\right. \end{array} $$

(4.1)

for each x ∈ S and t ∈ E_(φ,x) with $m_{L}(E_{(\varphi ,x)}^{c})=0$.

(b):

$\varphi (t,x)=J_{*}(t,x)=\inf _{f\in {{\Pi }_{m}^{d}} }J(f,t,x)$ for all (t,x) ∈ [0,T] × S, with φ(t,x) as in (a).

(c):

For any ε > 0, there exists $f_{\varepsilon }\in {{\Pi }_{m}^{d}}$ such that J(f_ε,t,x) ≤ φ(t,x) + ε for all (t,x) ∈ [0,T] × S.

Proof

(a)
Define the following operator B on $\mathbb {B}_{1}([0,T]\times S)$ (actually the space of all bounded functions) by
$$ \begin{array}{@{}rcl@{}} B\psi(t,x):=1+{{\int}_{t}^{T}} \inf_{a\in A(x)}\left[\delta c(x,a)\psi(s,x)+{\int}_{S}\psi(s,y)q(dy|x,a)\right]ds \end{array} $$
(4.2)
for any (t,x) ∈ [0,T] × S and $\psi \in \mathbb {B}_{1}([0,T]\times S)$.

Then, for each (t,x) ∈ [0,T] × S, and any $\psi _{1},\psi _{2}\in \mathbb {B}_{1}([0,T]\times S)$, from Eq. 4.2 and q({x}|x,a) + q(S ∖{x})|x,a) ≡ 0 we obtain

$$ \begin{array}{@{}rcl@{}} |B\psi_{1}(t,x)-B\psi_{2}(t,x)|&\leq&{{\int}_{t}^{T}}\sup_{a\in A(x)}\left[\delta\|c\|\|\psi_{1}-\psi_{2}\|+{\int}_{S}|q|(dy|x,a)\|\psi_{1}-\psi_{2}\|\right]ds \\ &\leq & (\delta\|c\|+2\|q\|) {{\int}_{t}^{T}}\|\psi_{1}-\psi_{2}\|ds \\ &=&\tilde{L}(T-t)\|\psi_{1}-\psi_{2}\| \end{array} $$

where $\tilde {L}:=\delta \|c\|+2\|q\|<\infty $. Furthermore, by induction we can prove the following fact:

$$ \begin{array}{@{}rcl@{}} \!\!\!\!\!\!\!\!|B^{n}\psi_{1}(t,x)-\!B^{n}\psi_{2}(t,x)| \!\leq\! \tilde{L}^{n} \frac{(T-t)^{n}}{n!}\|\psi_{1} - \psi_{2}\| \ \ \ \forall \ (t,x)\in [0,T]\!\times\! S, n\geq 1. \end{array} $$

(4.3)

Since ${\sum }_{n=1}^{\infty } \tilde {L}^{n} \frac {T^{n}}{n!}\|\psi _{1}-\psi _{2}\|<\infty $, there exists some integer k such that the constant $\beta := \tilde {L}^{k} \frac {T^{k}}{k!}<1$. Thus, by Eq. 4.3 we have ∥B^kψ₁ − B^kψ₂∥≤ β∥ψ₁ − ψ₂∥. Therefore, B is a k-step contract operator. Thus, there exists a function $\varphi \in \mathbb {B}_{1}([0,T]\times S)$ such that Bφ = φ, that is

$$ \begin{array}{@{}rcl@{}} \varphi(t,x):=1+{{\int}_{t}^{T}} \inf_{a\in A(x)}\left[\delta c(x,a)\varphi(s,x)+{\int}_{S}\varphi(s,y)q(dy|x,a)\right]ds \ \forall \ (t,x)\in [0,T]\times S. \end{array} $$

(4.4)

Since ∥q∥ and ∥c∥ are finite, by Eq. 4.4 we see that $\varphi \in \mathbb {B}^{1}_{1,1}([0,T]\times S)$, and thus (a) follows.

(b-c)
We are going to prove (b) and (c) together. Since $\|q\|<\infty $ and $\|c\|<\infty $, Assumptions 3.1 and 3.2 are satisfied by taking V₀ = V₁ ≡ 1. Thus, it follows from Eq. 4.1 and Theorem 3.2(a) that
$$ \begin{array}{@{}rcl@{}} J(\pi,t,x)\geq \varphi(t,x) \ \ \ \text{for \ all } \ \pi\in {{\Pi}_{m}^{r}}. \end{array} $$
(4.5)
Moreover, for any ε > 0, since $\varphi \in \mathbb {B}_{1}([0,T]\times S)$, Lemma 3.3 together with Eq. 4.1 gives the existence of $f_{\varepsilon }\in {{\Pi }_{m}^{d}}$, such that, for each x ∈ S and t ∈ E_(φ,x),
$$ \begin{array}{@{}rcl@{}} \!\!\!\!\!\!\!\left\{\begin{array}{lll} \varphi^{\prime}(t,x)+\delta c(x,f_{\varepsilon}(t,x))\varphi(t,x)+{\int}_{S}\varphi(t,y)q(dy|x,f_{\varepsilon}(t,x))\leq \frac{\varepsilon}{T}e^{-T\delta\|c\|}, \\ \varphi(T,x)=1. \end{array}\right. \end{array} $$
(4.6)
Then, as the arguments for Eq. 3.16, by |c(x,a)|≤∥c∥ and Eq. 4.6 we have
$$ \begin{array}{@{}rcl@{}} \left( e^{{{\int}_{s}^{t}}\delta c(x_{v},f_{\varepsilon}(v,x_{v}))dv}\varphi(t,x_{t})\right)' + {\int}_{S}q(dy|x_{t},f_{\varepsilon}(t,x_{t}))\!\left( e^{{{\int}_{s}^{t}}\delta c(x_{v},f_{\varepsilon}(v,x_{v})))}\varphi(t,y)\right)\!\leq\! \frac{\varepsilon}{T} \end{array} $$
for every 0 ≤ s ≤ t. Thus, by Theorem 3.1(a) we have
$$ \begin{array}{@{}rcl@{}} &&\mathbb{E}_{x}^{f_{\varepsilon}} \left( e^{{{\int}_{s}^{T}}\delta c(x_{v},f_{\varepsilon}(v,x_{v}))dv)}\right)-\varphi(s,x)\\&=&\mathbb{E}_{x}^{f_{\varepsilon}} \left( e^{{{\int}_{s}^{T}}\delta c(x_{v},f_{\varepsilon}(v,x_{v}))dv}\varphi(T,x_{T})\right)-\varphi(s,x)\leq \varepsilon \end{array} $$
and so
$$ \begin{array}{@{}rcl@{}} J(f_{\varepsilon},s,x)\leq \varphi(s,x)+\varepsilon \ \ \ \ \text{for \ all} \ (s,x)\in [0,T]\times S. \end{array} $$
(4.7)
Therefore, since ε can be arbitrary, by Eqs. 4.5 and 4.7 we have
$$ \begin{array}{@{}rcl@{}} \inf_{\pi\in {{\Pi}_{m}^{r}}} J(\pi,t,x)=\varphi(t,x)=\inf_{f\in {{\Pi}_{m}^{d}}} J(f,t,x), \ \text{ and} \ \ J(f_{\varepsilon},t,x)\leq \varphi(t,x)+\varepsilon, \end{array} $$
for all (t,x) ∈ [0,T] × S, and by Eq. 2.6 so (b) and (c) follow.

□

Proposition 4.1 shows the existence of a solution to the optimality equation for the bounded transition and cost rates. To further establish the existence of an optimal Markov policy, we need some conditions below.

Assumption 4.1

(i):: For each x ∈ S, A(x) is compact;
(ii):: For each x ∈ S and $D\in \mathcal {{B}}(S)$, the function q(D|x,a) is continuous in a ∈ A(x);
(iii):: For each x ∈ S, the functions c(x,a) and ${\int \limits }_{S}V_{0}(y)q(dy|x,a)$ are continuous in a ∈ A(x), with V₀ as in Assumption 3.1.

Remark 4.1

Assumption 4.1 is used for the existence of the minimum points in the optimality equation (4.1).

By Lemma 8.3.7(a) in Hernández-Lerma and Lasserre (1999), under Assumption 4.1 we have the following lemma.

Lemma 4.1

Under Assumptions 4.1(ii,iii), the function ${\int \limits }_{S}q(dy|x,a)u(t,y)$ is continuous in a ∈ A(x), for every fixed (t,x) ∈ [0,T] × S and $u\in \mathbb {B}_{V_{0}}([0,T]\times S)$.

Proposition 4.2

Under Assumption 4.1, if $\|q\|<\infty $ and $\|c\|<\infty $, that is, transition and cost rates are bounded, then the following assertions hold.

(a):

There exists a unique φ(t,x) in $\mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)$ satisfying the o ptimality equation (4.1).

(b):

$\varphi (t,x)=J_{*}(t,x)=\inf _{f\in {{\Pi }_{m}^{d}}} J(f,t,x)$ for all (t,x) ∈ [0,T] × S, with φ(t,x) as in (a).

(c):

There exists a deterministic Markov policy $f^{*}\in {{\Pi }_{m}^{d}}$ such that

$$ \begin{array}{@{}rcl@{}} \varphi^{\prime}(t,x)+ \delta c(x,f^{*}(t,x))\varphi(t,x)+{\int}_{S}\varphi(t,y)q(dy|x,f^{*}(t,x))=0 \end{array} $$

for each x ∈ S and t ∈ E_(φ,x) with $m_{L}(E_{(\varphi ,x)}^{c})=0$.

(d):

If, in addition, c(x,a) ≥ 0 for all (x,a) ∈ K, then J_∗(x,t) (and also φ(t,x)) is decreasing in t ∈ [0,T] for each fixed x ∈ S.

Proof

We only need to prove (c) and (d) since (a) and (b) follow from Proposition 4.1. For the function φ(t,x) from (a), when $\varphi ^{\prime }(t,x)$ does not exist for some (t,x), we define

$$ \begin{array}{@{}rcl@{}} \varphi^{\prime}(t,x):=-\inf_{a\in A(x)}[\delta c(x,a)\varphi(t,x)+{\int}_{S}\varphi(t,y)q(dy|x,a)], \end{array} $$

(4.8)

which, together with Eq. 4.1, implies that Eq. 4.8 holds for each (t,x) ∈ [0,T] × S. Hence, Proposition 7.50 in Bertsekas and Shreve (1996) together with Lemma 4.1 ensures the existence of a Markov policy $f^{*}\in {{\Pi }_{m}^{d}}$ such that

$$ \begin{array}{@{}rcl@{}} \left\{\begin{array}{lll} \varphi^{\prime}(t,x) + \delta c(x,f^{*}(t,x))\varphi(t,x) + {\int}_{S}\varphi(t,y)q(dy|x,f^{*}(t,x)] = 0 \ \ \ \forall \ (t,x)\!\in\! [0,T]\times S, & \\ \varphi(T,x)=1. \end{array}\right. \end{array} $$

Then, as the proofs of (b) and (c) in Proposition 4.1, we see that (c) is true.

(d) Fix any s,t ∈ [0,T] with s < t. Then, for any Markov policy $f\in {{\Pi }_{m}^{d}}$, we define the corresponding Markov policy ${f_{s}^{t}}$ as follows: for each x ∈ S,

$$ \begin{array}{@{}rcl@{}} && {f_{s}^{t}}(v,x) = \left\{\begin{array}{lll} f(v+t-s,x) & \text{$ v\geq s, $} \\ f(v,x) & \text{otherwise.} \end{array}\right. \end{array} $$

(4.9)

Then, we have, for each (v,x) ∈ [s,s + T − t] × S,

$$q(dy|x,{f_{s}^{t}}(v,x))=q(dy|x,f(v+t-s,x)), \ c(x,{f_{s}^{t}}(v,x))=c(x,f(v+t-s,x)).$$

Let

$$ \begin{array}{@{}rcl@{}} && J(f,s\sim t,x):=\mathbb{E}_{\gamma}^{\pi}\left[e^{{{\int}_{s}^{t}}{\int}_{A}\delta c(x_{v},f(v,x_{v}))dv}|x_{s}=x\right], \\ && J_{*}(s\sim t,x):=\inf_{f\in {{\Pi}_{m}^{d}}}J(f,s\sim t,x). \end{array} $$

(4.10)

By the Markov property of {x_t,t ≥ 0} under any Markov policy f and Eqs. 4.9–4.10, we have $J(f,t\sim T,x)= J({f_{s}^{t}},s\sim T+s-t,x)$, and thus $J_{*}(t\sim T,x)\geq J_{*}(s\sim T+s-t,x)$; Similarly, we can prove that $J_{*}(s\sim T+s-t,x)\geq J_{*}(t\sim T,x)$. Thus, we have $J_{*}(t\sim T,x)=J_{*}(s\sim T+s-t,x)$. Moreover, since c(x,a) ≥ 0 on K, by Eq. 4.10 and t > s, we have $J_{*}(t\sim T,x)=J_{*}(s\sim T+s-t,x)\leq J_{*}(s\sim T,x)$, which, together with $J_{*}(t\sim T,x)=J_{*}(t,x)$, gives (d). □

Proposition 4.2 shows the existence of an optimal policy under the bounded transition and cost rates. We next extend the results in Proposition 4.2 to the case of unbounded transition rates by approximations from bounded transition and cost rates to unbounded transition rates and nonnegative costs.

Proposition 4.3

Under Assumptions 3.1, 3.2 and 4.1, if in addition c(x,a) ≥ 0 for all (x,a) ∈ K, then the following assertions hold.

(a):

There exists a unique φ(t,x) in $\mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)$ satisfying the o ptimality equation (4.1).

(b):

$\varphi (t,x)=J_{*}(t,x)=\inf _{f{\in {\Pi }_{m}^{d}}} J(f,t,x)$ for all (t,x) ∈ [0,T] × S, with φ(t,x) as in (a).

(c):

There exists a Markov policy $f^{*}\in {{\Pi }_{m}^{d}}$ such that

$$ \begin{array}{@{}rcl@{}} \varphi^{\prime}(t,x)+ \delta c(x,f^{*}(t,x))\varphi(t,x)+{\int}_{S}\varphi(t,y)q(dy|x,f^{*}(t,x))=0 \end{array} $$

for each x ∈ S and t ∈ E_(φ,x) with $m_{L}(E_{(\varphi ,x)}^{c})=0$.

Proof

We only proof (a). This is because (b) and (c) can be proved as the same arguments of (b)-(c) of Proposition 4.2.

The main thread of our proof is as follows.

First, we construct a series of models $ \mathcal {{M}}_{n}^{+}$ which the transition rates and costs are all bounded. By Proposition 4.2, there exists a decreasing (with respect to t) function $\varphi _{n}(t,x)\in \mathbb {B}_{1,1}^{1}([0,T]\times S)$ which is the value function of $ \mathcal {{M}}_{n}^{+}$.

Second, we prove that the limit of φ_n(t,x) exists, denoted by φ(t,x).

Third, we prove that φ(t,x) is the value function of the original model.

Now, we construct $ \mathcal {{M}}_{n}^{+}$ first. under Assumption 3.1 (iii), we have

$$1\leq e^{2T\delta c(x,a)}\leq M_{0}V_{0}(x), (\text{i.e.}, \ 0\leq c(x,a)\leq \frac{1}{T\delta}\ln\sqrt{M_{0}V_{0}(x)} ) \ \text{for \ all \ } (x,a)\in K. $$

For each n ≥ 1, let A_n(x) := A(x) for x ∈ S, K_n := {(x,a)|x ∈ S,a ∈ A_n(x)}, and S_n := {x ∈ S|V₀(x) ≤ n}. Moreover, for each x ∈ S, a ∈ A_n(x), let

$$ q_{n}(dy|x,a):= \left\{\begin{array}{lll} q(dy|x,a) &\text{if } x\in S_{n}, \\ 0 & \text{if } x\not\in S_{n}; \end{array}\right. $$

(4.11)

$$ c_{n}^{+}(x,a):= \left\{\begin{array}{lll} c(x,a)\wedge \min\{n, \frac{1}{T\delta}\ln\sqrt{M_{0}V_{0}(x)}\} &\text{if } x\in S_{n},\\ 0 & \text{if } x\not\in S_{n}. \end{array}\right. $$

(4.12)

Fix any n ≥ 1. By Eq. 4.11, it is obvious that the q_n(dy|t,x,a) denotes indeed transition rates on S, which are conservative and stable. By Eq. 4.12, recall T > 0,δ > 0,M₀ ≥ 1,V₀(x) ≥ 1 and c(x,a) ≥ 0, $c_{n}^{+}(x,a) \ge 0$ for n ≥ 1. Then, we obtain a sequence of models $\{\mathcal {{M}}_{n}^{+}\}$:

$$ \mathcal{{M}}_{n}^{+}:=\left\{S,A,(A_{n}(x), x\in S), c_{n}^{+}(x,a), q_{n}(\cdot|x,a) \right\}, $$

for which the transition rates q_n(dy|x,a) and costs $c_{n}^{+}(x,a)$ are all bounded (by Assumption 3.1 and Eqs. 4.11–4.12). In the following arguments, any quality with respect to $\mathcal {{M}}_{n}^{+}$ is labeled by a lower n, such as the risk-sensitive criterion J_n(f,t,x) of a Markov policy f and the value function $J_{n}(t,x):=\inf _{f\in {{\Pi }_{m}^{d}}} J_{n}(f,t,x)$.

Obviously, Assumptions 3.1, 3.2 and 4.1 still hold for each model ${\mathscr{M}}_{n}^{+}$. Thus, for each n ≥ 1, it follows from Proposition 4.2 that there exists $\varphi _{n}(t,x)\in \mathbb {B}_{1,1}^{1}([0,T]\times S)$ satisfying (4.1) for the corresponding ${\mathscr{M}}_{n}^{+}$, that is,

$$ \begin{array}{@{}rcl@{}} \left\{\begin{array}{lll} \varphi_{n}^{\prime}(t,x)+\inf_{a\in A_{n}(x)}[\delta c_{n}^{+}(x,a)\varphi_{n}(t,x)+{\int}_{S}\varphi_{n}(t,y)q_{n}(dy|x,a)]=0,& \\ \varphi_{n}(T,x)=1. \end{array}\right. \end{array} $$

(4.13)

for all x ∈ S and $t\in E_{(\varphi _{n},x)}$ with $m_{L}(E_{(\varphi _{n},x)}^{c})=0$. Furthermore, since $c_{n}^{+}(x,a) \ge 0$, by Proposition 4.2 (d), φ_n(t,x)) is decreasing in t ∈ [0,T] for each fixed x ∈ S.

Then, Proposition 7.50 in Bertsekas and Shreve (1996) together with Lemma 4.1 and Eq. 4.13 gives the existence of a Markov policy $f_{n}\in {{\Pi }_{m}^{d}}$ such that,

$$ \begin{array}{@{}rcl@{}} \left\{\begin{array}{lll} \varphi_{n}^{\prime}(t,x)\!+\delta c_{n}^{+}(x,f_{n}(t,x))\varphi_{n}(t,x) + {\int}_{S}\varphi_{n}(t,y)q_{n}(dy|x,f_{n}(t,x) = 0 \ \ \ \forall \ x\in S, & \\ \varphi_{n}(T,x)=1 \end{array}\right. \end{array} $$

(4.14)

for all x ∈ S and $t\in E_{(\varphi _{n},x)}$ with $m_{L}(E_{(\varphi _{n},x)}^{c})=0$.

Also, by Eq. 4.12 we have $e^{2T\delta c_{n}^{+}(x,a)}\leq M_{0}V_{0}(x)$ for all x ∈ S and n ≥ 1. Then, using Lemma 3.1 and Theorem 3.2(a₂) with V₀ = V₁ ≡ 1 and Lemma 3.1, from Eq. 4.14 we have

$$ \begin{array}{@{}rcl@{}} e^{-LV_{0}(x)}\leq \varphi_{n}(t,x)=J_{n}(f_{n},t,x)\leq LV_{0}(x) \ \ \ \ \ \forall \ n\geq 1. \end{array} $$

(4.15)

Moreover, since φ_n(t,x) ≥ 0 and $c_{n}^{+}(x,a)\geq c_{n-1}^{+}(x,a)$ for all (x,a) ∈ K, by Eqs. 4.11 and 4.14 as well as Proposition 4.2(d), we have, for all $x\in S, t\in E_{(\varphi _{n},x)}$ and n ≥ 2

$$ \begin{array}{@{}rcl@{}} \left\{\begin{array}{lll} \varphi_{n}^{\prime}(t,x)+ \delta c_{n-1}^{+}(x,f_{n}(t,x))\varphi_{n}(t,x)+{\int}_{S}\varphi_{n}(t,y)q_{n-1}(dy|x,f_{n}(t,x))\leq 0, \ x\in S_{n-1}, & \\ \varphi_{n}^{\prime}(t,x)+ \delta c_{n-1}^{+}(x,f_{n}(t,x))\varphi_{n}(t,x)+{\int}_{S}\varphi_{n}(t,y)q_{n-1}(dy|x,f_{n}(t,x))= \varphi_{n}^{\prime}(t,x)\leq 0, \ x\not\in S_{n-1} & \\ \varphi_{n}(T,x)=1, \end{array}\right. \end{array} $$

which, together with Theorem 3.2(a₂) with V₀ = V₁ ≡ 1, implies that J_n− 1(f_n,t,x) ≤ φ_n(t,x) for all (x,t) ∈ [0,T] × S. Therefore, we have φ_n− 1(t,x) ≤ J_n− 1(f_n,t,x) ≤ φ_n(t,x), that is, the sequence {φ_n,n ≥ 1} is nondecreasing in n ≥ 1, and thus the limit

$$ \begin{array}{@{}rcl@{}} \varphi(t,x):=\lim_{n\to\infty}\varphi_{n}(t,x) \end{array} $$

(4.16)

exists for each (t,x) ∈ [0,T] × S.

Since φ_n(t,x) is decreasing in t ∈ [0,T], φ(t,x) is decreasing in t ∈ [0,T], for each fixed x ∈ S. Therefore, φ(t,x) is differential in a.e. t ∈ [0,T] (for each fixed x ∈ S).

Let, for every n ≥ 1 and (t,x) ∈ [0,T] × S,

$$ \begin{array}{@{}rcl@{}} && H_{n}(t,x):=\inf_{a\in A(x)}\left[\delta c_{n}^{+}(x,a)\varphi_{n}(t,x)+{\int}_{S}\varphi_{n}(t,y)q_{n}(dy|x,a)\right],\\ && H(t,x):=\inf_{a\in A(x)}\left[\delta c(x,a)\varphi(t,x)+{\int}_{S}\varphi(t,y)q(dy|x,a)\right]. \end{array} $$

We next show that $\lim _{n\to \infty } H_{n}(t,x)=H(t,x)$ for each (t,x) ∈ [0,T] × S.

Indeed, for any fixed (t,x) ∈ [0,T] × S, there exists n₀ ≥ 1 such that $(t,x)\in [0,T]\times S_{n_{0}}$, and then q_n(dy|x,a) = q(dy|x,a) for all n ≥ n₀ and $\lim _{n\to \infty } c_{n}^{+}(x,a)=c(x,a)$ for all a ∈ A(x). Thus, by Lemma 8.3.7 in Hernández-Lerma and Lasserre (1999) and Eq. 4.15 we have

$$ \begin{array}{@{}rcl@{}} &&\limsup_{n\to \infty}H_{n}(t,x)\\&\leq & \limsup_{n\to \infty}\left[\delta c_{n}^{+}(x,a)\varphi_{n}(t,x)+{\int}_{S}\varphi_{n}(t,y)q(dy|x,a)\right] \ \ \ \\ &\leq& \delta c(x,a)\varphi(t,x)+{\int}_{S}\varphi(t,y)q(dy|x,a), \ \ \ \text{for \ all} \ a\in A(x). \end{array} $$

Hence,

$$ \begin{array}{@{}rcl@{}} \limsup_{n\to \infty}H_{n}(t,x)&\leq & \inf_{a\in A(x)} \left[\delta c(x,a)\varphi(t,x)+{\int}_{S}\varphi(t,y)q(dy|x,a)\right]. \end{array} $$

(4.17)

On the other hand, note that $\liminf _{n\to \infty }H_{n}(t,x)=\lim _{m\to \infty }H_{n_{m}}(t,x)$ for some subsequence {n_m,m ≥ 1} of {n,n ≥ 1}. For each m ≥ 1, under Assumption 4.1, the measurable selection theorem (e.g. Proposition 7.50 in Bertsekas and Shreve 1996) together with Lemma 4.1 ensures the existence of $f_{n_{m}}\in {{\Pi }_{m}^{d}}$ such that

$$ \begin{array}{@{}rcl@{}} H_{n_{m}}(t,x)&=&\inf_{a\in A(x)}[\delta c_{n_{m}}^{+}(x,a)\varphi_{n_{m}}(t,x)+{\int}_{S}\varphi_{n_{m}}(t,y)q(dy|x,a)] \\ &=&\delta c_{n_{m}}^{+}(x,f_{n_{m}}(t,x))\varphi_{n_{m}}(t,x)+{\int}_{S}\varphi_{n_{m}}(t,y)q(dy|x,f_{n_{m}}(t,x)). \end{array} $$

(4.18)

Since $f_{n_{m}}(t,x)\in A(x)$ for all m ≥ 1 and A(x) is compact, there exists a subsequence $\{f_{n_{m_{k}}}(t,x), k\geq 1\}$ of $\{f_{n_{m}}(t,x), m\geq 1\}$ and a(t,x) ∈ A(x) (depending on (t,x)) such that $f_{n_{m_{k}}}(t,x)\to a(t,x)$ as $k\to \infty $. Thus, since $\lim _{m\to \infty }H_{n_{m}}(t,x)=\lim _{k\to \infty }H_{n_{m_{k}}}(t,x)$, using Assumption 4.1, by Lemma 8.3.7 in Hernández-Lerma and Lasserre (1999) and Eq. 4.18 we have

$$ \begin{array}{@{}rcl@{}} \liminf_{n\to \infty}H_{n}(t,x)\!&=&\! \lim\limits_{k\to \infty}H_{n_{m_{k}}}(t,x)\\ \!&=&\! \lim\limits_{k\to \infty}\left[\!\delta c_{n_{m_{k}}}^{+}\!(x,f_{n_{m_{k}}}(t,x))\varphi_{n_{m_{k}}}\!(t,x) + {\int}_{S}\varphi_{n_{m_{k}}}(t,y)q(dy|x,f_{n_{m_{k}}}(t,x))\right] \\ \!&=&\! \delta c(x,a(t,x))\varphi(t,x)+{\int}_{S}\varphi(t,y)q(dy|t,x,a(t,x)) \\ \!&\geq&\! \inf\limits_{a\in A(x)} \left[\delta c(x,a)\varphi(t,x)+{\int}_{S}\varphi(t,y)q(dy|x,a)\right], \end{array} $$

which, together with Eq. 4.17, implies that $\lim _{n\to \infty } H_{n}(t,x)=H(t,x)$. Thus, by Eq. 4.13 we have

$$ \begin{array}{@{}rcl@{}} \varphi(t,x)=1+{{\int}_{t}^{T}}\inf_{a\in A(x)}\left[\delta c(x,a)\varphi(s,x)+{\int}_{S}\varphi(s,y)q(dy|x,a)\right]ds. \end{array} $$

(4.19)

Since φ(t,x) is the integral of a measurable function, it is a absolutely continuous function. Therefore, we prove that φ(t,x) is differential in a.e. t ∈ [0,T] (for each fixed x ∈ S) again. We can verify that φ(t,x) satisfies (4.1). To show $\varphi (t,x)\in \mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)$, since $\varphi (t,x)\in \mathbb {B}_{V_{0}}([0,T]\times S)$ (by Eqs. 4.15–4.16), the rest verifies that $\varphi ^{\prime }(t,x)$ is V₁-bounded. Indeed, since T|δc(x,a)|≤ e^Tδ|c(x,a)|≤ M₀V₀(x), from Eq. 4.19 we have

$$ \begin{array}{@{}rcl@{}} |\varphi^{\prime}(t,x)| &\leq & \frac{M_{0}}{T}\|\varphi\|_{V_{0}}V_{0}(x)V_{0}(x)+ \|\varphi\|_{V_{0}}[\rho_{0}V_{0}(x)+b_{0}+2V_{0}(x)q^{*}(x)] \\ &\leq & \|\varphi\|_{V_{0}}\left[ \frac{M_{0}}{T}{V_{0}^{2}}(x)+\rho_{0}V_{0}(x)+b_{0}+2M_{0}^{\prime}{V_{0}^{2}}(x)\right] \\ &\leq & \|\varphi\|_{V_{0}}\left[\frac{M_{1}M_{0}}{T}+\rho_{0}M_{1}+b_{0}+2M_{0}^{\prime}M_{1}\right]V_{1}(x) \end{array} $$

which implies that φ(t,x) is in $\mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)$, and thus (a) is proved. □

Next, we use Proposition 4.3 to prove our main results by approximation from nonnegative cost rates to the cost rates that may be unbounded from above and from below.

Theorem 4.1

Under Assumptions 3.1, 3.2 and 4.1, the following assertions hold.

(a):

There exists a unique φ(t,x) in $\mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)$ satisfying the o ptimality equation (4.1).

(b):

$\varphi (t,x)=\inf _{{\pi \in {\Pi }_{m}^{r}}}J(\pi ,t,x)=\inf _{f{\in {\Pi }_{m}^{d}}}J(f,t,x)$ for all (t,x) ∈ [0,T] × S, with φ(t,x) as in (a).

(c):

There exists a Markov policy $f^{*}\in {{\Pi }_{m}^{d}}$ such that

$$ \begin{array}{@{}rcl@{}} \varphi^{\prime}(t,x)+ \delta c(x,f^{*}(t,x))\varphi(t,x)+{\int}_{S}\varphi(t,y)q(dy|x,f^{*}(t,x))=0 \end{array} $$

for each x ∈ S and t ∈ E_(φ,x) with $m_{L}(E_{(\varphi ,x)}^{c})=0$, and f^∗ is optimal.

Proof

We only prove (a) and the optimality of the policy f^∗ since the others can be proved as (b) and (c) of Proposition 4.3. For each n ≥ 1, define c_n on K as follows: for each (x,a) ∈ K,

$$ c_{n}(x,a):=\max\{-n, c(x,a)\}, $$

which implies that $\lim _{n\to \infty } c_{n}(x,a)=c(x,a)$ and k_n(x,a) := c_n(x,a) + n ≥ 0 for each (x,a) ∈ K and n ≥ 1. Moreover, it follows from Assumption 3.1(iii) that

$$ -\frac{1}{T\delta}\ln\sqrt{M_{0}V_{0}(x)}\leq \max\{-n, -\frac{1}{T\delta}\ln\sqrt{M_{0}V_{0}(x)}\}\leq c_{n}(x,a)\leq \frac{1}{T\delta}\ln\sqrt{M_{0}V_{0}(x)}. $$

(4.20)

Thus, $e^{2T\delta k_{n}(x,a)}\leq e^{2Tn\delta }M_{0}V_{0}(x)$ for all (x,a) ∈ K for all n ≥ 1, and so Assumptions 3.1 (with M₀ replaced by e^2TnδM₀), 3.2 and 4.1 still hold for each model $\mathcal {{N}}_{n}$ defined by

$$ \mathcal{{N}}_{n}:=\left\{S,(A(x), x\in S), k_{n}(x,a), q(\cdot|x,a) \right\}. $$

For any real-valued Borel measurable function u on K, let

$$J_{u}(x,t):=\inf_{f{\in{\Pi}_{m}^{d}}}\mathbb{E}_{\gamma}^{\pi}\left[e^{{\delta{\int}_{t}^{T}}{\int}_{A}u(x_{t},f(t,x_{t}))dt}|x_{t}=x\right]$$

provided the integral exists. Then, for each n ≥ 1, since k_n(x,a) ≥ 0, by Proposition 4.3(b) we have $J_{k_{n}}$ is in $\mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)$ and satisfies

$$ \begin{array}{@{}rcl@{}} \left\{\begin{array}{lll} J_{k_{n}}^{\prime}(t,x)+\inf_{a\in A(x)}[\delta k_{n}(x,a)J_{k_{n}}(t,x)+{\int}_{S}J_{k_{n}}(t,y)q(dy|x,a)]=0\\ J_{k_{n}}(T,x)=1. \end{array}\right. \end{array} $$

(4.21)

for all x ∈ S and $t\in E_{(J_{k_{n}},x)}$.

Moreover, since $J_{k_{n}}(t,x)=J_{c_{n}+n}(t,x)=J_{c_{n}}(t,x)e^{\delta (T-t)n}$, by Eq. 4.21 we derive that

$$ \begin{array}{@{}rcl@{}} \left\{\begin{array}{lll} J_{c_{n}}^{\prime}(t,x)+\inf_{a\in A(x)}[\delta c_{n}(x,a)J_{c_{n}}(t,x)+{\int}_{S}J_{c_{n}}(t,y)q(dy|x,a)]=0\\ J_{c_{n}}(T,x)=1. \end{array}\right. \end{array} $$

This is

$$ \begin{array}{@{}rcl@{}} J_{c_{n}}(t,x)=1+{{\int}_{t}^{T}} \inf\limits_{a\in A(x)}[\delta c_{n}(x,a)J_{c_{n}}(s,x)+{\int}_{S}J_{c_{n}}(s,y)q(dy|x,a)]ds. \end{array} $$

(4.22)

On the other hand, for each (t,x) ∈ [0,T] × S, it follows from Eq. 4.20 and Lemma 3.1(d) that

$$ \begin{array}{@{}rcl@{}} |J_{c_{n}}(t,x)|\leq LV_{0}(x), \ \ \ \ n\geq 1. \end{array} $$

(4.23)

Since c_n(x,a) is decreasing in n ≥ 1, and so is the corresponding value functions $J_{c_{n}}(t,x)$. Therefore, the limit $\varphi (t,x):=\lim _{n\to \infty }J_{c_{n}}(t,x)$ exists for each (t,x) ∈ [0,T] × S. Then, as the arguments for Proposition 4.3 with φ_n(t,x) replaced with $J_{c_{n}}(t,x)$ here, from Eqs. 4.22 and 4.23 we can see that (a) is also true.

Moreover, by Eq. 4.1 and (c), using Theorem 3.2 we see that f^∗ is optimal. □

Remark 4.2

For the given $\pi _{t}(da|x)\in {{\Pi }_{m}^{r}}$, to show the existence of a $\varphi \in \mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)$ satisfying (3.14), we modify the operator B in Eq. 4.2 as the following B^π:

$$ \begin{array}{@{}rcl@{}} B^{\pi}\psi(t,x):=1+{{\int}_{t}^{T}}\left[\delta c(x,\pi_{s})\psi(s,x)+{\int}_{S}\psi(s,y)q(dy|x,\pi_{s})\right]ds \end{array} $$

for all (t,x) ∈ [0,T] × S. Then, a similar argument as in the proof of Proposition 4.3(a) gives the existence of a $\varphi \in \mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)$ satisfying (3.14).

5 An example

Recall that Assumptions 3.1 and 4.1 above are the generalization of the corresponding ones in Guo and Hernández-Lerma (2009), Guo et al. (2012), Guo and Piunovskiy (2011), Piunovskiy and Zhang (2011), and Prieto-Rumeau and Hernández-Lerma (2012). Hence, they are satisfied for all the examples and hypotheses in these references. To further illustrate the main results here, we next consider the risk-sensitive optimality problem of case flow in Guo et al. (2015a) for CTMDPs on the mean-variance criteria.

Example 5.1 (Risk-sensitive controlled problems of cash flow)

Consider a continuous-time controlled problem of cash flow in an economic market with an amount of the cash as a state, and thus the corresponding state space is $S:=(-\infty ,+\infty )$. When the current state of cash flow is at x ∈ S, a decision-maker withdraws money with the amount − a (if a < 0) or takes a supply of money with the amount a for a ≥ 0, where a is regarded an action. When the current state is at x ∈ S and an action a ∈ A(x) is chosen, the two things happen: 1) a cost is incurred at rate c(x,a); and 2) the amount x of cash is assumed to keep invariable for an exponential-distributed random time with parameter λ(x,a) ≥ 0, and then jump to other states with the normal distribution N(x,σ²) for some constant σ > 0. Therefore, the transition rates of cash flow is represented by

$$ \begin{array}{@{}rcl@{}} q(D|x,a):=\lambda(x,a)\left[\frac{1}{\sqrt{2\pi}\sigma}{\int}_{D} e^{-\frac{(y-x)^{2}}{2\sigma^{2}}}dy-\delta_{x}(D)\right] \ \ \ \ \ \text{for \ each } \ D\in \mathcal{B}(S). \end{array} $$

(5.1)

For this cash flow model, the decision maker wishes to minimize the risk-sensitive costs on a given T horizon over all policies.

To ensure the existence of an optimal policy for the cash flow model, we consider the following hypotheses:

(A₁) λ(x,a) ≤ M(x² + 1) and $|c(x,a)|\leq \frac {1}{2\delta T}[M+\ln (x^{2}+1)]$ for all x ∈ S,a ∈ A(x) with some positive constant M, where the constants δ and T are as before;
(A₂) A(x) is assumed to be a compact set of a Borel space A for each x ∈ S;
(A₃) λ(x,a) and c(x,a) are Borel measurable on K and continuous in a ∈ A(x) for each fixed x ∈ S.

Under the above conditions, we have the following fact.

Proposition 5.1

Under the hypotheses A₁–A₃, Example 5.1 satisfies the Assumptions 3.1, 3.2 and 4.1, and hence (by Theorem 4.1) there exists an optimal Markov policy.

To verify the conditions required in Theorem 4.1, let

$$V_{0}(x):=1+x^{2}, V_{1}(x):=1+x^{4} \ \ \ \forall \ x\in S.$$

Since $\frac {1}{\sqrt {2\pi }\sigma }{\int \limits }_{S} (y-x)^{2k+1}e^{-\frac {(y-x)^{2}}{2\sigma ^{2}}}dy=0$ and $\frac {1}{\sqrt {2\pi }\sigma }{\int \limits }_{S}(y-x)^{2k}e^{-\frac {(y-x)^{2}}{2\sigma ^{2}}}dy =1\cdot 3{\cdots } (2k-1)\sigma ^{2k}$ for all k = 0,1,…, using (5.1) and hypothesis A₁, a directive calculation gives

$$ \begin{array}{@{}rcl@{}} {\int}_{S}V_{0}(y)q(dy|x,a)&=&\lambda(x,a)\left[\frac{1}{\sqrt{2\pi}\sigma}{\int}_{-\infty}^{+\infty} (y^{2}+1)e^{-\frac{(y-x)^{2}}{2\sigma^{2}}}dy-(x^{2}+1)\right]\\ &=&\lambda(x,a) \sigma^{2}\leq M\sigma^{2}V_{0}(x); \end{array} $$

$$ \begin{array}{@{}rcl@{}} {\int}_{S}{V_{1}^{2}}(y)q(dy|x,a)\!&=&\lambda(x,a)\left[\frac{1}{\sqrt{2\pi}\sigma}{\int}_{-\infty}^{+\infty} (y^{4}+1)^{2}e^{-\frac{(y-x)^{2}}{2\sigma^{2}}}dy-(x^{4}+1)^{2}\right]\\ &=&\lambda(x,a)\!\left( \!105\sigma^{8} + 420x^{2}\sigma^{6} + 210x^{4}\sigma^{4}+6\sigma^{4}+12\sigma^{2}x^{2}+28x^{6}\sigma^{2}\right) \\ &\leq&420 \lambda(x,a)\left( \sigma^{8}+\sigma^{6}+\sigma^{4}+\sigma^{2}\right)\left( x^{6}+x^{4}+x^{2}+1\right);\\ &\leq&420 \lambda(x,a)\left( \sigma^{8}+\sigma^{6}+\sigma^{4}+\sigma^{2}\right)\left( 3x^{6}+3\right);\\ &\leq&1260 M\left( \sigma^{8}+\sigma^{6}+\sigma^{4}+\sigma^{2}\right)\left( x^{6}+1\right)(1+x^{2});\\ &\leq&3780M\left( \sigma^{8}+\sigma^{6}+\sigma^{4}+\sigma^{2}\right)\left( x^{4}+1\right)^{2};\\ &=&3780M\left( \sigma^{8}+\sigma^{6}+\sigma^{4}+\sigma^{2}\right){V_{1}^{2}}(x). \end{array} $$

Thus, the hypotheses A₁–A₃ imply the Assumptions 3.1, 3.2 and 4.1, and then Theorem 4.1 gives the existence of an optimal Markov policy.

Remark 5.1

In this example, the cost c(x,a) are allowed to be unbounded from above and below, and the transition rates q(dy|x,a) can be unbounded. Thus, some of the conditions in Ghosh and Saha (2014), Jaskiewicz (2007), Kumar and Chandan (2015), Kumar and Chandan (2013), and Wei (2016) for CTMDPs on the risk-sensitive criteria fails to hold for this example because the transition and cost rates are all bounded in Ghosh and Saha (2014), Jaskiewicz (2007), Kumar and Chandan (2013, 2015) and Wei (2016).

References

Anantharam V, Borkar VS (2017) A variational formula for risk-sensitive reward. SIAM J Control Optim 55:961–988
Article MathSciNet Google Scholar
Basu A, Ghosh MK (2014) Zero-sum risk-sensitive stochastic games on a countable state space. Stochastic Process Appl 124:961–983
Article MathSciNet Google Scholar
Baüerle N, Rieder U (2014) More risk-sensitive Markov decision processes. Math Oper Res 39:105–120
Article MathSciNet Google Scholar
Baüerle N, Rieder U (2017) Zero-sum risk-sensitive stochastic games. Stochastic Process Appl 127:622–642
Article MathSciNet Google Scholar
Bertsekas D, Shreve S (1996) Stochastic optimal control: the discrete-time case. Academic Press, Inc
Cavazos-Cadena R, Hernndez-Hernndez D (2011) Discounted approximations for risk-sensitive average criteria in Markov decision chains with finite state space. Math Oper Res 36:133–146
Article MathSciNet Google Scholar
Ghosh MK, Saha S (2014) Risk-sensitive control of continuous time Markov chains. Stochastics 86:655–675
Article MathSciNet Google Scholar
Feinberg EA, Mandava M, Shiryaev AN (2014) On solutions of Kolmogorov’s equations for nonhomogeneous jump Markov processes. J Math Anal Appl 411:261–270
Article MathSciNet Google Scholar
Guo X (2007) Continuous–time Markov decision processes with discounted rewards: the case of Polish spaces. Math Oper Res 32:73–87
Article MathSciNet Google Scholar
Guo X, Piunovskiy A (2011) Discounted continuous-time Markov decision processes with constraints: unbounded transition and loss rates. Math Oper Res 36:105–132
Article MathSciNet Google Scholar
Guo X, Song X (2011) Discounted continuous-time constrained Markov decision processes in Polish spaces. Ann Appl Probab 21:2016–2049
Article MathSciNet Google Scholar
Guo X, Hernández-Lerma O (2009) Continuous-time Markov decision processes: theory and applications. Springer, Berlin
Book Google Scholar
Guo X, Huang Y, Song X (2012) Linear programming and constrained average optimality for general continuous-time Markov decision processes in history-dependent policies. SIAM J Control Optim 50:23–47
Article MathSciNet Google Scholar
Guo X, Huang XX, Zhang Y (2015a) On the first passage g-mean-variance optimality for discounted continuous-time Markov decision processes. SIAM J Control Optim 53:1406–1424
Article MathSciNet Google Scholar
Guo X, Huang XX, Huang Y (2015b) Finite-horizon optimality for continuous-time Markov decision processes with unbounded transition rates. Adv Appl Probab 47:1064–1087
Article MathSciNet Google Scholar
Hernández-Lerma O, Lasserre JB (1999) Further topics on discrete-time Markov control processes. Springer, New York
Book Google Scholar
Huang Y (2018) Finite horizon continuous-time Markov decision processes with mean and variance criteria. Discret Event Dyn Syst 28(4):539–564
Article MathSciNet Google Scholar
Huo H, Zou X, Guo X (2017) The risk probability criterion for discounted continuous-time Markov decision processes. Discret Event Dyn Syst 27(4):675–699
Article MathSciNet Google Scholar
Jaskiewicz A (2007) Average optimality for risk-sensitive control with general state space. Ann Appl Probab 17:654–675
Article MathSciNet Google Scholar
Kitaev MY, Rykov V (1995) Controlled queueing systems. CRC Press, New York
MATH Google Scholar
Kumar KS, Chandan P (2013) Risk-sensitive control of jump process on denumerable state space with near monotone cost. Appl Math Optim 68:311–331
Article MathSciNet Google Scholar
Kumar KS, Chandan P (2015) Risk-sensitive control of continuous-time Markov processes with denumerable state space. Stoch Anal Appl 33:863–881
Article MathSciNet Google Scholar
Piunovskiy A, Zhang Y (2011) Discounted continuous-time Markov decision processes with unbounded rates: the convex analytic approach. SIAM J Control Optim 49:2032–2061
Article MathSciNet Google Scholar
Prieto-Rumeau T, Hernández-Lerma O (2012) Selected topics in continuous-time controlled Markov chains and Markov games. Imperial College Press, London
Book Google Scholar
Wei QD (2016) Continuous-time Markov decision processes with risk-sensitive finite-horizon cost criterion. Math Meth Oper Res 84:461–487
Article MathSciNet Google Scholar
Wei Q, Chen X (2017) Average cost criterion induced by the regular utility function for continuous-time Markov decision processes. Discret Event Dyn Syst 27 (3):501–524
Article MathSciNet Google Scholar
Xia L (2014) Event-based optimization of admission control in open queueing networks. Discret Event Dyn Syst 24(2):133–151
Article MathSciNet Google Scholar
Xia L (2018) Variance minimization of parameterized Markov decision processes. Discret Event Dyn Syst 28:63–81
Article MathSciNet Google Scholar
Yushkevich AA (1977) Controlled Markov models with countable state and continuous time. Theory Probab Appl 22:215–235
Article MathSciNet Google Scholar
Zhang Y (2017) Continuous-time Markov decision processes with exponential utility. SIAM J Control Optim 55:2636–2660
Article MathSciNet Google Scholar

Download references

Acknowledgments

Research supported by the National Natural Science Foundation of China (Grant No. 60171009, Grant No. 61673019, Grant No. 11931018).

Author information

Authors and Affiliations

School of Mathematics, Sun Yat-Sen University, Guangzhou, 510275, China
Xianping Guo & Junyu Zhang
Guangdong Province Key Laboratory of Computational Science, Guangzhou, China
Xianping Guo & Junyu Zhang

Authors

Xianping Guo
View author publications
You can also search for this author in PubMed Google Scholar
Junyu Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Junyu Zhang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Guo, X., Zhang, J. Risk-sensitive continuous-time Markov decision processes with unbounded rates and Borel spaces. Discrete Event Dyn Syst 29, 445–471 (2019). https://doi.org/10.1007/s10626-019-00292-y

Download citation

Received: 15 November 2018
Accepted: 05 September 2019
Published: 19 October 2019
Issue Date: December 2019
DOI: https://doi.org/10.1007/s10626-019-00292-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Risk-sensitive continuous-time Markov decision processes with unbounded rates and Borel spaces

Abstract

Similar content being viewed by others

Finite horizon risk-sensitive continuous-time Markov decision processes with unbounded transition and cost rates

Risk-sensitive infinite-horizon discounted piecewise deterministic Markov decision processes

On Risk-Sensitive Piecewise Deterministic Markov Decision Processes

1 Introduction

2 The optimal control problems

Notation

Definition 2.1

Definition 2.2

3 Preliminaries

Assumption 3.1

Remark 3.1

Lemma 3.1

Proof

Lemma 3.2

Proof

Assumption 3.2

Theorem 3.1

Proof

Theorem 3.2

Proof

Lemma 3.3

Proof

4 The existence of optimal Markov policies

Proposition 4.1

Proof

Assumption 4.1

Remark 4.1

Lemma 4.1

Proposition 4.2

Proof

Proposition 4.3

Proof

Theorem 4.1

Proof

Remark 4.2

5 An example

Example 5.1 (Risk-sensitive controlled problems of cash flow)

Proposition 5.1

Remark 5.1

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation