1 Introduction

Continuous-time Markov decision processes (CTMDPs) are an important class of stochastic optimality control problems and have been widely studied; see, for instance, the monographs (Guo and Hernández-Lerma 2009; Prieto-Rumeau and Hernández-Lerma 2012) and the extensive references therein. In most existing literature on CTMDPs, the infinite-horizon expected discounted criterion (Guo 2007; Guo and Hernández-Lerma 2009; Guo and Song 2011; Guo and Piunovskiy 2011; Piunovskiy and Zhang 2011; Prieto-Rumeau and Hernández-Lerma 2012), the long-run expected average criterion (Guo and Hernández-Lerma 2009; Guo et al. 2012; Prieto-Rumeau and Hernández-Lerma 2012; Wei and Chen 2017; Xia 2014), and the finite-horizon expected criterion (Guo et al. 2015b; Yushkevich 1977), are the commonly used optimality criteria. All these expected criteria are risk-neutral and cannot reflect the attitude of a decision-maker to the risk. Since many decision-makers in the real-world applications may be either risk-seeking or risk-averse, in order to take the risk-sensitivity of a decision-maker into an optimality criterion, the risk-sensitive criteria have been employed and studied in CTMDPs. In this paper, we will further study a risk-sensitive criterion in CTMDPs and focus on the finite horizon case. Thus we shall pinpoint neither the main results in the earlier literature on the infinite-horizon risk-sensitive discounted criterion and the long-run risk-sensitive average criterion in Ghosh and Saha (2014), Huo et al. (2017), Kumar and Chandan (2013), and Zhang (2017) and Ghosh and Saha (2014), Kumar and Chandan (2015), and Kumar and Chandan (2013) respectively for CTMDPs, nor those on the risk-sensitive discrete-time Markov decision processes (Anantharam and Borkar 2017; Baüerle and Rieder 2014; Cavazos-Cadena and Hernndez-Hernndez 2011; Jaskiewicz 2007; Xia 2018) as well as on zero-sum risk-sensitive stochastic games (Basu and Ghosh 2014; Baüerle and Rieder 2017). Our concern is on the finite horizon risk-sensitive criterion for CTMDPs (Ghosh and Saha 2014; Wei 2016), which leads to that the value function also depends on both time and states, while the value functions for the infinite-horizon risk-sensitive discounted and average criteria in Ghosh and Saha (2014), Kumar and Chandan (2015), Kumar and Chandan (2013), and Zhang (2017) are independent of time. Therefore, the optimality equation for the finite horizon case (Ghosh and Saha 2014; Wei 2016) is rather different from that for the infinite horizon cases (Ghosh and Saha 2014; Kumar and Chandan 2013; Zhang 2017).

To the best of our knowledge, the finite horizon risk-sensitive criterion for CTMDPs is addressed only in Ghosh and Saha (2014), Huang (2018), and Wei (2016). Precisely, Ghosh and Saha (2014) uses the exponential utility function to characterize the risk-sensitivity of a decision-maker and obtains the existence of optimal Markov policies for the case of denumerable states, uniformly bounded transition rates, bounded cost rates, and the class of Markov policies. Wei (2016) extends the main results in Ghosh and Saha (2014) to the case of unbounded transition rates, establishes the existence of a solution to the corresponding optimality equation (OE) by the approximation technique in Guo et al. (2015b), and gives the corresponding error estimations of the approximations. However, the arguments in Ghosh and Saha (2014) and Wei (2016) need the assumptions of bounded cost rates and denumerable states, and the policies in Ghosh and Saha (2014) and Wei (2016) are independent of histories.

Huang (2018) also studies the finite horizon CTMDPs with the Borel state and action spaces and unbounded reward functions and transition rates. But the optimization criteria in Huang (2018) are very different from ours. The optimization criteria in Huang (2018) are firstly mean maximization, then for fixed mean, variance minimization. This is suitable for a risk-averse decision maker. The optimization criterion in this paper is risk-sensitivity, which is maximization of the exponential utility function. From Eq. 2.7 in Section 2, we can see that the exponential utility function is related to the mean and variance. Since the coefficient of variance is positive, our optimization criterion is suitable to a risk-seeking decision maker.

As indicated above, the finite horizon risk-sensitive criterion of for CTMDPs with unbounded cost/transition rates, Borel state and action spaces as well as randomized history-dependent policies, has not been studied yet, and it will be considered in this paper. Precisely, we study the CTMDPs having the following features: (1) the transition rates can be unbounded; (2) the cost rates are allowed to be unbounded from below and from above. (As the cost rates are allowed to take positive and negative values, they can be interpreted as reward rates rather than “cost rates” only); (3) the state and action spaces are Borel ones; (4) the policies can be randomized and history-dependent; and (5) the optimality criterion is finite horizon risk-sensitive.

First, we give a suitable condition, under which we establish the finiteness of the finite-horizon risk-sensitive criterion with unbounded cost rates using the Jensen inequality. This condition is new and satisfied for the bounded costs in Ghosh and Saha (2014) and Wei (2016); see Lemma 3.1.

Second, under the conditions slightly weaker than those in Guo and Hernández-Lerma (2009), Guo et al. (2012), Guo and Piunovskiy (2011), Piunovskiy and Zhang (2011), and Prieto-Rumeau and Hernández-Lerma (2012) on infinite horizon risk-neutral CTMDPs, we derive an extension of Dynkin’s formula, which is a generalization of the analog of Ito-Dynkin’s formula recently developed in Guo et al. (2015b) for the underlying state process {xt,t ≥ 0} induced by the transition rates and randomized history-dependent policies (see Theorem 3.1 below). This result is also a natural extension of the Feynman-Kac formula in Wei (2016) for the continuous-time jump Markov process to the case of continuous-time jump “non-Markov” processes and a more larger class of functions φ(ω,t,x) of samples ω, time t and states x. On the one hand, since the analog of Ito-Dynkin’s formula in Guo et al. (2015b) is designed for the forms of functions φ(t,x) of time t and states x, it is not suitable for the more general forms of φ(ω,t,x) with an additional sample variable ω such as the function \(e^{{{\int \limits }_{0}^{t}}{\int \limits }_{A}c(x_{s}(\omega ),a)\pi (da|\omega ,s)ds}\varphi (t,x)\) of (ω,t,x), which need to be considered for our case of history-dependent policies π(da|ω,s) and the cost rates c(x,a) in the risk-sensitive CTMDPs; on the other hand, the Feynman-Kac formula in Wei (2016) is for Markov processes, and thus it is not applicable to the case that the underlying processes {xt} here may not be Markovian. Since our optimality problem is on the risk-sensitive criterion over the class of history-dependent policies, the two facts just mentioned above motivate our study on the extension of the Dynkin’s formula.

Third, under suitable conditions as in Ghosh and Saha (2014), Guo and Hernández-Lerma (2009), Guo et al. (2012), Guo and Piunovskiy (2011), Piunovskiy and Zhang (2011), and Prieto-Rumeau and Hernández-Lerma (2012) on risk-neutral CTMDPs with infinite horizon, we prove the existence and uniqueness of a solution to the OE as well as the existence of an optimal Markov policy for the finite horizon risk-sensitive CTMDPs in three steps. The first step is to consider the simple case of bounded transition rates and bounded cost rates, and establishes the existence of a solution to the OE by the Banach’s fixed point theorem, and also proves the existence of an optimal Markov policy; see Proposition 4.2. The second step is to deal with the case of unbounded transition rates and nonnegative cost rates. By constructing a sequence of the models of CTMDPs with bounded transition rates and bounded cost rates, using the results in the first step and some new properties of the value function of the finite horizon risk-sensitive CTMDPs, we prove that the limit of the sequence of the value functions of models of CTMDPs is a solution to the OE for the case of unbounded transition rates and nonnegative cost rates; see Proposition 4.3. In the third step, by designing a technique of approximations from nonnegative but unbounded cost rates to a more general case of cost rates being unbounded from above and from below, we further show the existence of a solution to the OE for the most general case of unbounded transition and cost rates; see Theorem 4.1. It should be mentioned that our approximation technique here is rather different from the approximation in Guo et al. (2015b) and Wei (2016) for denumerable states.

Fourth, using the existence of a solution to the OE and the extension of the Dynkin’s formula developed here, we prove the existence of an optimal Markov policy, and also show that the value function is the unique solution to the optimality equation; see Theorem 4.1. All arguments here are direct and closed.

Finally, our conditions in this paper are an generalization of those in Ghosh and Saha (2014) and Wei (2016) on the finite horizon risk-sensitive CTMDPs. In order to further illustrate our main results and show the difference between the conditions in this paper and those in Ghosh and Saha (2014) and Wei (2016), we present an applied example, in which our conditions are satisfied, in which the transition and cost rates are both unbounded, in which the state and action spaces are non-denumerable, and for which some of the conditions in Ghosh and Saha (2014) and Wei (2016) fail to hold.

The rest of the paper is organized as follows. In Section 2 we introduce the optimality problem for the risk-sensitive CTMDPs. The main results are presented in Section 4 after giving technical preliminaries in Section 3, and are illustrated with an example in Section 5.

2 The optimal control problems

Notation

For any Borel space X endowed with the Borel σ-algebra \({\mathscr{B}}(X)\), we will denote by Ec := XE the complement of a subset E of X, by IE the indicator function on any set E, by δz(dx) the Dirac measure at point zX (i.e., δz(D) = ID(z) for all \(D\in {\mathscr{B}}(X))\), by \(\mathbb {B}_{1}(X)\) the set of all bounded Borel measurable functions φ on X with the norm \(\|\varphi \|:=\sup _{x\in X}|\varphi (x)|\), and by \(\mathcal {{U}}(X)\) the universal σ-algebra on X, that is, \(\mathcal { U}(X):=\cap _{p\in P(X)}{\mathscr{B}}_{p}(X),\) where P(X) represents the set of all probability measures on \({\mathscr{B}}(X)\), and \({\mathscr{B}}_{p}(X)\) is the completion of \({\mathscr{B}}(X)\) with respect to pP(X). To discern the “measurability”, we will say “Borel measurable” or “universally measurable” in the following.

The model of CTMDPs is a five-tuple

$$ \begin{array}{@{}rcl@{}} \mathbb{M}:=\{S, A, (A(x),x\in S),c(x,a), q(dy|x,a)\}, \end{array} $$
(2.1)

consisting of the following elements:

  1. (a)

    a Borel space S, called the state space, whose elements are referred to as states of a system.

  2. (b)

    a Borel space A, called the action space, whose elements are referred to as actions (or decisions) of a decision-maker (or controller);

  3. (c)

    a family {A(x),xS} of nonempty subsets A(x) of A, where A(x) denotes the set of actions available to a controller when the system is in state xS, and it is assumed to be Borel-measurable, that is, \(A(x)\in {\mathscr{B}}(A)\) for every xS;

  4. (d)

    a Borel measurable function c(x,a) on K, called the cost rates, where K := {(x,a)|xS,aA(x)} denotes the set of all feasible state-action pairs and is assumed in \({\mathscr{B}}(S\times A);\) (As c(x,a) is allowed to take positive and negative values, it can be interpreted as rewards rather than “costs” only.)

  5. (e)

    transition rates q(dy|x,a), a universally measurable signed kernel on S given K. That is, q(⋅|x,a) satisfies countable additivity; q(D|x,a) ≥ 0 for all \(D\in {\mathscr{B}}(S)\) with (x,a) ∈ K and xD, being conservative in the sense of q(S|x,a) ≡ 0, and stable in that of

    $$ \begin{array}{@{}rcl@{}} q^{*}(x):=\sup\limits_{a\in A(x)}q(x,a)<\infty \ \ \ \ \forall \ x\in S, \end{array} $$
    (2.2)

    where q(x,a) := −q({x}|x,a) ≥ 0 for all (x,a) ∈ K.

Next, we give an informal description of the evolution of CTMDPs with model (2.1).

Roughly speaking, CTMDPs evolve as follows: A controller observes states of a system continuously in time. If the system is at state xt at time t, he/she chooses an action atA(xt) according to a given policy, as a consequence of which, the following happen:

  1. (i)

    An immediate cost takes place at the rate c(xt,at).

  2. (ii)

    After a random sojourn time (i.e., the holding time at state xt), the system jumps to a set D (xtD) of states with the transition probability \(\frac {q(D|x_{t},a_{t})}{q(x_{t},a_{t})}\) determined by the transition rates q(dy|xt,at). The distribution function of the sojourn time is \((1-e^{-{{\int \limits }_{t}^{t+x}}q(x_{s},a_{s})ds})\).

To formalize what is described above, below we describe the construction of CTMDPs under possibly randomized history-dependent policies.

To construct the underlying CTMDPs, we introduce some notation: Let \({\Omega }_{0} := (S\times (0,\infty ))^{\infty }\), \({\Omega }_{k} := (S\times (0,\infty ))^{k}\times S\times (\{\infty \}\times \{{\Delta }\})^{\infty }\) for k ≥ 1 and some Δ∉S, \({\Omega }:=\cup _{k=0}^{\infty }{\Omega }_{k}\). \(\mathcal {F}\) is the Borel σ-algebra on the Borel space Ω. Then we obtain the measurable space \(({\Omega },\mathcal {F})\). For some k ≥ 1, and sample \(\omega :=(x_{0},\theta _{1},x_{1},\ldots ,\theta _{k},x_{k},\infty ,{\Delta }) \in {\Omega }\), define

$$ \begin{array}{@{}rcl@{}} T_{k}(\omega):=\theta_{1}+\theta_{2}+\ldots+\theta_{k}, T_{\infty}(\omega):=\lim_{k\rightarrow\infty}T_{k}(\omega), \ \text{and} \ X_{k}(\omega):=x_{k}.\end{array} $$
(2.3)

In what follows, the argument ω is always omitted except some special informational statements. Then, we define the state process {xt(ω),t ≥ 0} on \(({\Omega },\mathcal {F})\) by

$$ \begin{array}{@{}rcl@{}} x_{t}(\omega):=\sum\limits_{k\geq0}I_{\{T_{k}\leq t<T_{k+1}\}}X_{k}(\omega)+ I_{\{t\geq T_{\infty}\}}{\Delta} , \ \ \ \ \text{for } \ t\geq 0, \ \ (\text{with} \ T_{0}:=0). \end{array} $$
(2.4)

Obviously, xt(ω) is right-continuous on \([0,\infty )\). We denote \(x_{t-}(\omega ):=\lim _{s\to t-}x_{s}(\omega )\). Here we have used the convenience that 0z =: 0 and 0 + z =: z for all \(z\in S_{\Delta }:=S\cup \{{\Delta }\}\).

For each fixed ω := (x0,𝜃1,x1,…,𝜃k,xk,…) ∈Ω, from Eq. 2.4, we see that Tk(ω) (k ≥ 1) denotes the k-th jump moment of {xt,t ≥ 0}, Xk− 1(ω) = xk− 1 is the state of the process on [Tk− 1(ω),Tk(ω)), 𝜃k = Tk(ω) − Tk− 1(ω) plays the role of sojourn time at state xk− 1, and the sample path {xt(ω),t ≥ 0} has at most denumerable states xk(k = 0,1,…). We do not intend to consider the controlled process {xt,t ≥ 0} after moment \(T_{\infty }\), and thus view it to be absorbed in the cemetery state Δ. Hence, we write AΔ := A ∪{aΔ}, A(Δ) := {aΔ}, q(⋅|Δ,aΔ) :≡ 0, c(Δ,aΔ) :≡ 0, where aΔ is an isolated point.

To precisely define the optimality criterion, we need to introduce the concept of a policy in Guo et al. (2012), Guo and Piunovskiy (2011), Kitaev and Rykov (1995), and Piunovskiy and Zhang (2011). To do so, we recall some notation. Take the right-continuous family of σ-algebras \(\{\mathcal { F}_{t}\}_{t\ge 0}\) with \(\mathcal { F}_{t}:=\sigma (\{T_{k}\leq s, X_{k}\in D\}: D\in \mathcal {{B}}(S),s\leq t,k\geq 0)\). As in Guo and Song (2011), Guo and Piunovskiy (2011), Kitaev and Rykov (1995), and Piunovskiy and Zhang (2011), let \(\mathcal P\) be the σ-algebra of predictable sets on \({\Omega }\times [0,\infty )\) related to \(\{\mathcal { F}_{t}\}_{t\ge 0}\), that is, \(\mathcal {{P}}:=\sigma (B\times [0,\infty ), C\times (s,\infty ): B\in \mathcal {{F}}_{0}, C\in \mathcal {{F}}_{s-}, s>0)\) with \(\mathcal {{F}}_{s-}:=\bigvee _{t<s}\mathcal {{F}}_{t}:=\sigma (\mathcal {{F}}_{t}, t<s)\). A real-valued function on \({\Omega }\times [0,\infty )\) is called predictable if it is measurable with respect to \(\mathcal P\).

Definition 2.1

A transition probability π(da|ω,t) from \(({\Omega }\times [0,\infty ),\mathcal { P} )\) onto \((A_{\Delta }, \mathcal {{B}}(A_{\Delta }))\) such that π(A(xt(ω))|ω,t) ≡ 1 is called a (randomized history-dependent) policy. A policy π(da|ω,t) is called randomized Markovian if π(da|ω,t) ≡ π(da|xt(ω),t). We will denote such a Markov policy by πt(da|⋅) for informational implication. A randomized Markov policy πt(da|⋅) is called deterministic Markovian whenever there exists a A-valued universally measurable function f(t,x) on \([0,\infty )\times S\) such that πt(da|x) is the Dirac measure at point f(t,x) ∈ A(x) for all t ≥ 0 and xS. Such a Markov policy will be denoted by f for simplicity.

We denote by π the set of all randomized history-dependent policies, by \({{\Pi }_{m}^{r}}\) the set of all randomized Markov policies, and by \({{\Pi }_{m}^{d}}\) the set of all (deterministic) Markov policies.

For any initial distribution γ on S and policy π ∈π, as showing in Guo et al. (2012), Guo and Song (2011), Guo and Piunovskiy (2011), Kitaev and Rykov (1995), and Piunovskiy and Zhang (2011) but using the extension of the Ionescu Tulcea theorem (e.g., Proposition 7.45 in Bertsekas and Shreve 1996); we see that there exists a unique probability measure \(\mathbb {P}_{\gamma }^{\pi }\) (depending on γ and π) on \(({\Omega },\mathcal {F})\). Let \(\mathbb {E}_{\gamma }^{\pi }\) be the corresponding expectation operator. In particular, \(\mathbb {E}_{\gamma }^{\pi }\) and \(\mathbb {P}_{\gamma }^{\pi }\) will be respectively written as \(\mathbb {E}_{x}^{\pi }\) and \(\mathbb {P}_{x}^{\pi }\) when γ is the Dirac measure at a state x in S.

Fix any finite horizon T > 0. For each policy π ∈π and state xS, we define the T-horizon risk-sensitive criterion J(π,0,x) of CTMDPs by

$$ \begin{array}{@{}rcl@{}} J(\pi,0,x):=\mathbb{E}_{x}^{\pi}\left[e^{{\delta{\int}_{0}^{T}}{\int}_{A}c(x_{t},a)\pi(da|\omega,t)dt}\right], \end{array} $$
(2.5)

provided that the integral is well defined, where δ is a constant called the risk-sensitive parameter. In the following arguments, we assume that δ > 0. For the other case of δ < 0, the corresponding arguments are similar, and thus omitted.

Note that the process {xt,t ≥ 0} on \(({\Omega },\mathcal {F}, \mathbb {P}^{\pi }_{\gamma })\) may not be Markovian since the policy π can depend on histories (x0,𝜃1,x1,…,𝜃k,xk). However, for each \(\pi :=\pi _{t}(da|\cdot ) \in {{\Pi }^{r}_{m}}\), it is well known (e.g. Feinberg et al. 2014) that {xt,t ≥ 0} is a Markov process on \(({\Omega },\mathcal {F}, \mathbb {P}^{\pi }_{\gamma })\), and thus for each xS and t ∈ [0,T], the following expression

$$ \begin{array}{@{}rcl@{}} J(\pi,t,x):=\mathbb{E}_{\gamma}^{\pi}\left[e^{{\delta{\int}_{t}^{T}}{\int}_{A}c(x_{s},a)\pi_{s}(da|x_{s})ds}|x_{t}=x\right], \end{array} $$

is well defined (when the integral exists), and it is called the risk-sensitive value of π from the horizon t to T.

For each xS, let

$$ \begin{array}{@{}rcl@{}} J_{*}(t,x):=\inf\limits_{\pi\in {{\Pi}_{m}^{r}}} J(\pi,t,x) \ \ \ \ \text{ for \ } t\in [0,T]. \end{array} $$
(2.6)

The function J(t,x) on [0,T] × S is called the value function of the CTMDPs with the T-horizon risk-sensitive criterion.

Definition 2.2

A policy π∈π is said to be optimal if J(π,0,x) ≤ J(π,0,x) for all π ∈π and xS.

The main goal of this paper is to give conditions for the existence of optimal Markov policies.

At this end, we give some remarks about the difference between the T-horizon risk-sensitive criterion J(π,0,x) and the risk-neutral one in Guo et al. (2015b) and Yushkevich (1977), which is defined by

$$V(\pi,0,x):=\mathbb{E}_{x}^{\pi}\left[{\int}_{0}^{T}{\int}_{A}c(x_{t},a)\pi(da|\omega,t)dt\right]=:\mathbb{E}_{x}^{\pi}Y, \ \text{with } \ Y:={{\int}_{0}^{T}}{\int}_{A}c(x_{t},a)\pi(da|\omega,t)dt,$$

where c(x,a) is assumed to be bounded positive on K. Then, by the Jensen’s inequality and the monotonicity of the \(\log \) function, we have \(\ln J(\pi ,0,x)\geq \delta V(\pi ,0,x)\). More, by Taylor’s expansion of eδY at the point \(z:=\mathbb {E}_{x}^{\pi }Y\), we have

$$ \begin{array}{@{}rcl@{}} e^{\delta Y}&=&e^{\delta z}+\delta e^{\delta z}(Y-z)+\frac{1}{2}\delta^{2} e^{\delta z}(Y-z)^{2}\\&&+\frac{1}{6}\delta^{3}e^{\delta \xi}(Y-z)^{3} \ (\text{ with} \ \xi \ \text{being \ between} \ Y \ \text{and} \ z), \end{array} $$

which also implies that ξ is a random variable on \(({\Omega },\mathcal {F}, \mathbb {P}^{\pi }_{\gamma })\). Therefore, we have

$$ \begin{array}{@{}rcl@{}} \ln J(\pi,0,x)&=&\ln\left[e^{\delta z}+\frac{1}{2}e^{\delta z} \delta^{2} var(Y)+\frac{1}{6}\delta^{3}\mathbb{E}_{x}^{\pi}e^{\delta \xi}(Y-z)^{3}\right] \\ &=&\delta V(\pi,0,x)+\ln \left[1+\frac{1}{2}\delta^{2}var(Y)+\frac{1}{6}\delta^{3}\mathbb{E}_{x}^{\pi}e^{\delta (\xi-z)}(Y-z)^{3}\right] \\ &=&\delta V(\pi,0,x)+\ln[1+\frac{1}{2}\delta^{2}var(Y)]+o(\delta^{2}), \end{array} $$
(2.7)

where \(var(Y):=\mathbb {E}_{x}^{\pi }(Y-\mathbb {E}_{x}^{\pi }Y)^{2}\) denoting the variance of Y, and the corresponding term \(\ln [1+\frac {1}{2}\delta ^{2}var(Y)]\) is called a risk premium. Thus, the difference between \(\ln J(\pi ,0,x)\) and δV (π,0,x) is the summation of the risk premium plus the other small order term. Therefore, the risk-sensitive criterion is more risk-sensitive than the risk-neutral one because of its inclusion of the risk premium.

From Eq. 2.7 we see that \(\lim _{\delta \to 0}\frac {\ln J(\pi ,0,x)}{\delta }=V(\pi ,0,x)\). On the other hand, for any fixed δ > 0, since \(\inf _{\pi \in {\Pi }}\ln J(\pi ,0,x)=\ln J(\pi ^{*},0,x)\)if and only if\(\inf _{\pi \in {\Pi }}J(\pi ,0,x)=J(\pi ^{*},0,x)\), we will consider J(π,0,x), instead of \(\ln J(\pi ,0,x)\).

3 Preliminaries

This section provides some preliminary facts for our arguments below.

Since the rates q(dy|x,a) and costs c(x,a) are allowed to be unbounded, we next give conditions for the non-explosion of {xt,t ≥ 0} and finiteness of J(π,t,x).

Assumption 3.1

There exist a real-valued Borel measurable function V0 ≥ 1 on S and constants ρ0, b0 ≥ 0, \(M_{0}^{\prime }, M_{0}\geq 1\), such that

(i):

HCode \({\int \limits }_{S}V_{0}(y)q(dy|x,a)\leq \rho _{0} V_{0}(x)+b_{0},\) for all (x,a) ∈ K;

(ii):

\(q^{*}(x)\leq M_{0}^{\prime }V_{0}(x)\) for all xS, where q(x) is as in Eq. 2.2;

(iii):

e2Tδ|c(x,a)|M0V0(x) for all (x,a) ∈ K, with T and δ as in Eq. 2.5.

Remark 3.1

  1. (a)

    Assumptions 3.1(i,ii) are used and verified with examples in Guo and Hernández-Lerma (2009), Guo and Piunovskiy (2011), Piunovskiy and Zhang (2011), and Prieto-Rumeau and Hernández-Lerma (2012) for the risk-neutral CTMDPs. When the transition rates are bounded (i.e., \(\|q^{*}\|<\infty \)) (Ghosh and Saha 2014; Kitaev and Rykov 1995; Kumar and Chandan2013, 2015; Yushkevich 1977), Assumptions 3.1(i-ii) are satisfied by taking V0(x) ≡ 1.

  2. (b)

    Assumption 3.1(iii) is for the finiteness of the value function J(t,x), and it is new and satisfied when c(x,a) is bounded (i.e., \(\|c\|:=\sup _{(x,a)\in K}|c(x,a)|<\infty \)) (Ghosh and Saha 2014; Kumar and Chandan2013, 2015; Wei 2016; Yushkevich 1977). Moreover, if \(\delta |c(x,a)|\leq \sqrt {\ln V_{0}(x)}+L\) for all (x,a) ∈ K and some constant L ≥ 0, then Assumption 3.1(iii) holds: Indeed, since \(\delta |c(x,a)|\leq \sqrt {\ln V_{0}(x)}+L\leq \frac {t}{2}+\frac {\ln V_{0}(x)}{2t}+L\) for all (x,a) ∈ K and t > 0, we have \(e^{2T\delta |c(x,a)|} \leq e^{T^{2}+\ln V_{0}(x)+2TL}=e^{T^{2}+2TL} V_{0}(x)\), which implies Assumption 3.1(iii) with \(M_{0}:=e^{T^{2}+2TL}\).

  3. (c)

    If the number ρ0 in Assumption 3.1(i) is not positive, then Assumption 3.1(i) still holds when ρ0 is replaced with the positive number 1 + |ρ0|. Thus, it is just for convenience to assume that the constant ρ0 > 0 throughout the following. However, the corresponding number is assumed to be negative in Guo and Hernández-Lerma (2009), Guo et al. (2012), and Prieto-Rumeau and Hernández-Lerma (2012) or less than the discount factor in Guo and Hernández-Lerma (2009) and Guo and Piunovskiy (2011).

Lemma 3.1

Under Assumption 3.1, for each π ∈π, the following assertions hold.

(a):

\(\mathbb {P}_{x}^{\pi }(x_{t}\in S)=1\), \(\mathbb {P}_{x}^{\pi }(T_{\infty }=\infty )=1\), and \( \mathbb {P}_{x}^{\pi }(x_{0}=x)=1\) for each t ≥ 0 and xS;

(b):

\(\mathbb {E}_{x}^{\pi }[V_{0}(x_{t})] \leq e^{\rho _{0}t}[V_{0}(x)+\frac {b_{0}}{\rho _{0}}]\), for each t ≥ 0, xS and π ∈π;

(c):

\(\mathbb {E}_{\gamma }^{\pi }[V_{0}(x_{t})|x_{s}=x] \leq e^{\rho _{0}(t-s)}[V_{0}(x)+\frac {b_{0}}{\rho _{0}}]\), for each ts ≥ 0, xS and \(\pi \in {{\Pi }_{m}^{r}}\).

(d):

If, in addition, Assumption 3.1 (iii) is satisfied, then

(d1):

\( e^{-LV_{0}(x)}\leq J(\pi ,0,x)\leq LV_{0}(x) \) for xS and π ∈π, where \(L:=M_{0}e^{\rho _{0}T}[1+\frac {b_{0}}{\rho _{0}}];\)

(d2):

\(e^{-LV_{0}(x)}\leq J(\pi ,t,x)\leq LV_{0}(x)\) for (t,x) ∈ [0,T] × S and \({\pi \in {\Pi }_{m}^{r}}\).

Proof

Parts (a) and (b) follow from Guo et al. (2012) and Guo and Piunovskiy (2011) (or Piunovskiy and Zhang 2011); while part (c) is from Theorem 3.1 in Guo (2007). We next prove part (d). For almost surely (a.s.) ω := (x0,𝜃1,x1,…,𝜃k,xk,…) ∈Ω (with respect to \(\mathbb {P}_{x}^{\pi }\)), let k (depending on ω) be determined by Tk(ω) ≤ T < Tk+ 1(ω). Then, it follows from (a) and Eq. 2.4 that k is finite, and {xt(ω),t ∈ [0,T]} = {X0(ω),…,Xk(ω)}. Thus, since \(|c(x,a)|\leq \frac {1}{T\delta }\ln \sqrt {M_{0}V_{0}(x)}\) (by Assumption 3.1(iii)) and 𝜃i+ 1 = Ti+ 1(ω) − Ti(ω)(i = 0,…,k − 1), we have

$$ \begin{array}{@{}rcl@{}} &&{{\int}_{0}^{T}} {\int}_{A} |c(x_{t},a)|\pi(da|\omega,t)dt\\ &\leq& \frac{{\sum}_{i=0}^{k-1}\theta_{i+1}\ln\sqrt{M_{0}V_{0}(X_{i}(\omega))}+(T-T_{k}(\omega))\ln\sqrt{M_{0}V_{0}(X_{k}(\omega))}}{T\delta}, \end{array} $$

which implies that \({{\int \limits }_{0}^{s}} {\int \limits }_{A} c(x_{t},a)\pi (da|\omega ,t)dt\) is real-valued Borel measurable in s ∈ [0,T] (for the given ω). Thus, by the Jensen inequality with respect to the probability measure \(\frac {dt}{T}\) on \(\mathcal {{B}}([0,T])\), we have

$$ \begin{array}{@{}rcl@{}} e^{{{\int}_{0}^{T}}\delta{\int}_{A} c(x_{t},a)\pi(da|\omega,t)dt}\leq \frac{1}{T} {{\int}_{0}^{T}} e^{T\delta{\int}_{A} |c(x_{t},a)|\pi(da|\omega,t)}dt, \ \ \ \ a.s.-\mathbb{P}_{x}^{\pi}. \end{array} $$

Therefore (by (b)),

$$ \begin{array}{@{}rcl@{}} \mathbb{E}_{x}^{\pi}\left[e^{{{\int}_{0}^{T}}\delta{\int}_{A} c(x_{t},a)\pi(da|\omega,t)dt}\right]&\leq& \mathbb{E}_{x}^{\pi}\left[\frac{1}{T} {{\int}_{0}^{T}} e^{T\delta{\int}_{A} |c(x_{t},a)|\pi(da|\omega,t)}dt\right] \\ &\leq& \frac{1}{T}M_{0}{{\int}_{0}^{T}}\mathbb{E}_{x}^{\pi}\left[V_{0}(x_{t})\right]dt\leq M_{0}e^{\rho_{0} T}[V_{0}(x)+\frac{b_{0}}{\rho_{0}}]. \end{array} $$

Hence, by Eq. 2.5, we have

$$ \begin{array}{@{}rcl@{}} J(\pi,0,x)\leq M_{0}e^{\rho_{0}T}\left[1+\frac{b_{0}}{\rho_{0}V_{0}(x)}\right]V_{0}(x)\leq LV_{0}(x). \end{array} $$
(3.1)

On the other hand, since Tδ|c(x,a)|≤ e2Tδ|c(x,a)|M0V0(x), we have

$$ \begin{array}{@{}rcl@{}} J(\pi,0,x)\!&=&\mathbb{E}_{x}^{\pi}\left[e^{{\delta{\int}_{0}^{T}}{\int}_{A}c(x_{t},a)\pi(da|\omega,t)dt}\right]\geq e^{\delta\mathbb{E}_{x}^{\pi}\left[{\int}_{0}^{T}{\int}_{A}c(x_{t},a)\pi(da|\omega,t)dt\right]}\\ &\geq& e^{-\frac{1}{T}\mathbb{E}_{x}^{\pi}\left[{\int}_{0}^{T}{\int}_{A}M_{0}V_{0}(x_{t})\pi(da|\omega,t)dt\right]}\!= e^{-\frac{1}{T}\mathbb{E}_{x}^{\pi}\left[{\int}_{0}^{T}M_{0}V_{0}(x_{t})dt\right]}\!\geq e^{-\frac{M_{0}}{T}{{\int}_{0}^{T}}\mathbb{E}_{x}^{\pi}[V_{0}(x_{t})]dt}\\ &\geq& e^{-\frac{M_{0}}{T}{{\int}_{0}^{T}} e^{\rho_{0}T}[V_{0}(x)+\frac{b_{0}}{\rho_{0}}]dt}=e^{-M_{0}e^{\rho_{0}T}[V_{0}(x)+\frac{b_{0}}{\rho_{0}}]} \geq e^{-LV_{0}(x)}, \end{array} $$

which, together with Eq. 3.1, implies (d1). Similarly, we see that (d2) is also true. □

Lemma 3.1 gives conditions for the finiteness of J(π,t,x) as well as the non-explosion of {xt,t ≥ 0}. In order to deal with the optimality for history-dependent policies, we need the following facts, which extend both the analog of Ito-Dynkin’s formula in Guo et al. (2015b) and Feynman-Kac formula in Wei (2016) to a more general case of possible non-Markov processes {xt,t ≥ 0} and functions φ(ω,t,x) with an additional element ω ∈Ω.

Denote by mL the Lebesgue’s measure on [0,T], and by \(\mathbb {B}_{\mathcal { P}}({\Omega }\times [0,T]\times S)\) the set of real-valued and \(\mathcal {{P}}\times \mathcal {{B}}(S)\)-measurable functions φ(ω,t,x) with the following features: Given any xS,π ∈π, and a.s. ω ∈Ω with respect to \(\mathbb {P}_{x}^{\pi }\), there exists a Borel subset E(φ,ω,x,π) (depending on the φ,ω,x,π) of [0,T] such that the partial derivative \(\varphi ^{\prime }(\omega ,t,x)\) (with respect to t) exists for every tE(φ,ω,x,π) and \(m_{L}(E_{(\varphi ,\omega ,x,\pi )}^{c})=0\). Obviously, if a function φ(ω,t,x) in \(\mathbb {B}_{\mathcal { P}}({\Omega }\times [0,T]\times S)\) is independent of ω (written as φ(t,x)), and so is the corresponding E(φ,ω,x,π), which will be denoted by E(φ,x).

For any \(\varphi (\omega , t,x)\in \mathbb {B}_{\mathcal { P}}({\Omega }\times [0,T]\times S)\), when the partial derivative does not exist for some (ω,t,x) ∈Ω× [0,TS, for simplicity of arguments, we take \(\varphi ^{\prime }(\omega ,t,x)\) to be any real number, and so \(\varphi ^{\prime }\) is defined on Ω × [0,T] × S. We will see that such a modification of \(\varphi ^{\prime }\) loses nothing. For example, \(\mathbb {E}_{x}^{\pi } \left [{\int \limits }_{0}^{T}|\varphi ^{\prime }(\omega ,t,x_{t})|dt\right ]\) and \({{\int \limits }_{0}^{T}}|\varphi ^{\prime }(\omega ,t,x_{t})|dt\) are defined well for each (x,π) ∈ S ×π and a.s. ω ∈Ω with respect to \(\mathbb {P}_{x}^{\pi }\), respectively. Indeed, for each xS,π ∈π, and s ∈ [0,T], Lemma 3.1(a) together with Eq. 2.4 gives the existence of a Boreal measurable subset \(\hat {\Omega }\) of Ω such that: 1) \(\mathbb {P}_{x}^{\pi }(\hat {\Omega }^{c})=0\), and 2) for each given \(\omega \in \hat {\Omega }\), xt(ω) = xi for Ti(ω) ≤ t < Ti+ 1(ω)(0 ≤ ik) for some a finite k, where k and xiS depend on ω, and k is determined by Tk(ω) ≤ T < Tk+ 1(ω). Let \(\tilde i:={\min \limits } \{l: s<T_{l} \}\) (depending on ω and s). Then, since \(m_{L}(\cup _{i=0}^{k} E^{c}_{(\varphi ,\omega ,x_{i},\pi )})=0\), for each \(\omega \in \hat {\Omega }\),

$${{\int}_{s}^{T}} \!|\varphi^{\prime}(\omega,t,x_{t})|dt = {\int}_{s}^{T_{\tilde i}}\!|\varphi^{\prime}(\omega,t,x_{\tilde i-1})|dt+\sum\limits_{l=\tilde i}^{k-1} {\int}_{T_{l}}^{T_{l+1}}\!|\varphi^{\prime}(\omega,t,x_{l})|dt+{\int}_{T_{k}}^{T}\!|\varphi^{\prime}\!(\omega,t,x_{k})|dt$$

is well defined. Therefore (by \(\mathbb {P}_{x}^{\pi }(\hat {\Omega }^{c})=0\)), \(\mathbb {E}_{x}^{\pi } \left [{\int \limits }_{0}^{T}|\varphi ^{\prime }(\omega ,t,x_{t})|dt\right ]\) is also well defined.

Lemma 3.2

Under Assumption 3.1, the following assertions hold.

(a):

For any given π ∈π,xS, if a function \(\varphi \in \mathbb {B}_{\mathcal { P}}({\Omega }\times [0,T]\times S)\) satisfies that

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E}_{x}^{\pi}\left[{\int}_{0}^{T}{\int}_{A}{\int}_{S}|\varphi(\omega,t,y)| |q|(dy|x_{t},a)\pi(da|\omega,t)dt\right]\\&&+\mathbb{E}_{x}^{\pi} \left[{\int}_{0}^{T}|\varphi^{\prime}(\omega,t,x_{t})|dt\right]<\infty, \end{array} $$
(3.2)

where \(|q|(dy|x,a):={\int \limits } q(dy\setminus \{x\})|x,a)-q(\{x\}|x,a)\delta _{x}(dy)\) for all (x,a) ∈ K, then

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E}_{x}^{\pi}\left[{\int}_{0}^{T} \left( \varphi^{\prime}(\omega,t,x_{t})+{\int}_{A}{\int}_{S}\varphi(\omega,t,y)q(dy|x_{t},a)\pi(da|\omega,t)\right)dt \right]\\ &=&\mathbb{E}_{x}^{\pi}\varphi(\omega,T,x_{T})-\mathbb{E}_{x}^{\pi}\varphi(\omega,0,x). \end{array} $$
(b):

For any \(\pi =\pi _{t}(da|\cdot ){\in {\Pi }_{m}^{r}}\) and a Borel measurable function \(\phi \in \mathbb {B}_{\mathcal { P}}([0,T]\times S)\), such that the corresponding function \(\varphi (\omega ,t,x):=e^{{{\int \limits }_{s}^{t}} \delta c(x_{v}(\omega ),\pi _{v})dv}\phi (t,x)\) satisfying (3.2), then

$$ \mathbb{E}_{\gamma}^{\pi}\!\left[{\int}_{s}^{T} \!\left( \!\left( e^{{{\int}_{s}^{t}}\delta c(x_{v},\pi_{v})dv}\phi(t,x_{t})\!\right)'+{\int}_{S}e^{{{\int}_{s}^{t}}\delta c(x_{v},\pi_{v})dv}\phi(t,y)q(dy|x_{t},\pi_{t})\!\right)dt|x_{s}=x\right]$$
$${}=\mathbb{E}_{\gamma}^{\pi}\left[e^{{{\int}_{s}^{T}}{\int}_{A}\delta c(x_{v},a)\pi_{v}(da|x_{v})dv}\phi(T,x_{T})|x_{s}=x\right]-\phi(s,x),\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ $$

where \(c(x,\pi _{t}):={\int \limits }_{A(x)}c(x,a)\pi _{t}(da|x)\) and \(q(dy|x,\pi _{t}):={\int \limits }_{A(x)} q(dy|x,a)\pi _{t}(da|x)\).

(c):

If the functions c(x,a) and ϕ(t,x) in (b) above are both bounded and \(|\phi ^{\prime }(t,x)|\leq CV_{0}(x)\) for all (t,x) ∈ [0,T] × S with some constant C > 0, then

$$ \begin{array}{@{}rcl@{}} \!\!\!\!\!\!\!\!\!\!\!\!\!\!\!&&\mathbb{E}_{\gamma}^{\pi}\left[{\int}_{s}^{T} \left( \left( e^{{{\int}_{s}^{t}}\delta c(x_{v},\pi_{v})dv}\phi(t,x_{t})\right)'+{\int}_{S}e^{{{\int}_{s}^{t}}\delta c(x_{v},\pi_{v})dv}\phi(t,y)q(dy|x_{t},\pi_{t})\right)dt|x_{s}=x\right]\\ \!\!\!\!\!\!\!\!\!\!\!\!\!\!\!&=& \left\{\begin{array}{lll} \mathbb{E}_{x}^{\pi}\left( e^{{{\int}_{s}^{T}}{\int}_{A}\delta c(x_{v},a)\pi_{v}(da|x_{v})dv}\phi(T,x_{T})\right)-\phi(0,x), \ \ \ \text{for } \ s=0, \pi\in {\Pi} \\ \mathbb{E}_{\gamma}^{\pi}\left[e^{{{\int}_{s}^{T}}{\int}_{A}\delta c(x_{v},a)\pi_{v}(da|x_{v})dv}\phi(T,x_{T})|x_{s}=x\right]-\phi(s,x), \ \text{for } \ s\in [0,T],\pi\in {{\Pi}_{m}^{r}}. \end{array}\right. \end{array} $$

Proof

  1. (a)

    Lemma 3.1(a) together with Eq. 2.4 gives the existence of a Boreal measurable subset \(\hat {\Omega }\) of Ω such that \(\mathbb {P}_{x}^{\pi }(\hat {\Omega }^{c})=0\). For each \(\omega \in \hat {\Omega }\), for 0 ≤ tT, from definitions (2.3) and (2.4), denote \(k_{t}(\omega ):=\max \limits \{k|T_{k}(\omega )\le t\}\). Since xt(ω) is right-continuous on \([0,\infty )\), we have \(x_{t}=x_{T_{k_{t}}}\) when \(T_{k_{t}}\le t<T_{k_{t}+1}\). Since \({{\int \limits }_{s}^{T}} \varphi ^{\prime }(\omega ,t,x_{t})dt\) is well defined (just proved), for 0 ≤ sT and \(\omega \in \hat {\Omega }\), we have

    $$ \begin{array}{@{}rcl@{}} \!\!\!\!\!\!\!\!\!\!\!\varphi(\omega,T,x_{T}) = \varphi(\omega,s,x_{s}) + {{\int}_{s}^{T}} \varphi^{\prime}(\omega,t,x_{t})dt + \sum\limits_{n=k_{s}+1}^{k_{T}}{\int}_{(s,T]}{\Delta} \varphi(\omega,t,x_{t})\delta_{T_{n} }(dt) \ \ \ \mathbb{P}_{x}^{\pi} - a.s. \end{array} $$
    (3.3)

    Denote Δφ(ω,t,xt) := φ(ω,t,xt(ω)) − φ(ω,t,xt(ω)), and recall that \(x_{t-}(\omega )=\lim _{s\to t-}x_{s}(\omega )\). Equation 3.3 is proved as follows.

    $$ \begin{array}{@{}rcl@{}} &&\varphi(\omega,T,x_{T})\\ &=&\varphi(\omega,s,x_{s})+\varphi(\omega,T_{k_{s}+1},x_{T_{k_{s}+1}-})-\varphi(\omega,s,x_{s})\\ &&+\sum\limits_{n=k_{s}+1}^{k_{T}}\left[\varphi(\omega,T_{n},x_{T_{n}})-\varphi(\omega,T_{n},x_{T_{n}-})\right]+\sum\limits_{n=k_{s}+1}^{k_{T}-1}\left[ \varphi(\omega,T_{n+1},x_{T_{n+1}-})-\varphi(\omega,T_{n},x_{T_{n}})\right]\\ &&+\varphi(\omega,T,x_{T-})-\varphi(\omega,T_{k_{T}},x_{T_{k_{T}}})+\varphi(\omega,T,x_{T})-\varphi(\omega,T,x_{T-})\\ &=&\varphi(\omega,s,x_{s})+\varphi(\omega,T_{k_{s}+1},x_{s})-\varphi(\omega,s,x_{s})\\ &&+\sum\limits_{n=k_{s}+1}^{k_{T}}\left[\varphi(\omega,T_{n},x_{T_{n}})-\varphi(\omega,T_{n},x_{T_{n}-})\right]+\sum\limits_{n=k_{s}+1}^{k_{T}-1}\left[ \varphi(\omega,T_{n+1},x_{T_{n}})-\varphi(\omega,T_{n},x_{T_{n}})\right]\\ &&+\varphi(\omega,T,x_{T_{k_{T}}})-\varphi(\omega,T_{k_{T}},x_{T_{k_{T}}})+\varphi(\omega,T,x_{T})-\varphi(\omega,T,x_{T-})\\ &=&\varphi(\omega,s,x_{s})+\varphi(\omega,T_{k_{s}+1},x_{s})-\varphi(\omega,s,x_{s})+\sum\limits_{n=k_{s}+1}^{k_{T}-1}\left[ \varphi(\omega,T_{n+1},x_{T_{n}})-\varphi(\omega,T_{n},x_{T_{n}})\right]\\ &&+\varphi(\omega,T,x_{T_{k_{T}}})-\varphi(\omega,T_{k_{T}},x_{T_{k_{T}}})\\ &&+\sum\limits_{n=k_{s}+1}^{k_{T}}\left[\varphi(\omega,T_{n},x_{T_{n}})-\varphi(\omega,T_{n},x_{T_{n}-})\right]+\varphi(\omega,T,x_{T})-\varphi(\omega,T,x_{T-})\\ &=&\varphi(\omega,s,x_{s})+{\int}_{[s,T_{k_{s}+1})}\varphi^{\prime}(\omega,t,x_{s})dt+\sum\limits_{n=k_{s}+1}^{k_{T}-1}{\int}_{[T_{n},T_{n+1})} \varphi^{\prime}(\omega,t,x_{T_{n}})dt\\ &&+{\int}_{[T_{k_{T}},T)}\varphi^{\prime}(\omega,t,x_{T_{k_{T}}})dt+\sum\limits_{n=k_{s}+1}^{k_{T}}{\Delta}\varphi(\omega,T_{n},x_{T_{n}})+{\Delta}\varphi(\omega,T,x_{T})\\ &=&\varphi(\omega,s,x_{s})+{\int}_{[s,T_{k_{s}+1})}\varphi^{\prime}(\omega,t,x_{t})dt+\sum\limits_{n=k_{s}+1}^{k_{T}-1}{\int}_{[T_{n},T_{n+1})} \varphi^{\prime}(\omega,t,x_{t})dt\\ &&+{\int}_{[T_{k_{T}},T)}\varphi^{\prime}(\omega,t,x_{t})dt+\sum\limits_{n=k_{s}+1}^{k_{T}}{\int}_{(s,T]}{\Delta}\varphi(\omega,t,x_{t})\delta_{T_{n}}(dt)\\ &=&\varphi(\omega,s,x_{s})+{{\int}_{s}^{T}} \varphi^{\prime}(\omega,t,x_{t})dt+\sum\limits_{n=k_{s}+1}^{k_{T}}{\int}_{(s,T]}{\Delta} \varphi(\omega,t,x_{t})\delta_{T_{n} }(dt) \ \ \ \mathbb{P}_{x}^{\pi}-a.s. \end{array} $$

Then, Lemma 4.28 in Kitaev and Rykov (1995) shows that the random measure mπ defined by

$$ \begin{array}{@{}rcl@{}} m^{\pi}(B|\omega,t)dt:={\int}_{A}q(B|x_{t-},a)\pi(da|\omega,t)I_{\{x_{t-}\notin B\}}dt,\quad B\in \mathcal{{B}}(S) \end{array} $$
(3.4)

is the dual predictable projection of the random measure \({\sum }_{n\ge 1}\delta _{(T_{n},X_{n-1})}(dt,dx)\) on \({\mathscr{B}}((0,\infty )\times S)\) under \(\mathbb {P}_{x}^{\pi }\). Thus, by the definition of a dual predictable projection in (4.5) in Kitaev and Rykov (1995), we have

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E}_{x}^{\pi}\left[ \sum\limits_{n\ge 1}{\int}_{(0,T]}{\Delta} \varphi(\omega,t,x_{t})\delta_{T_{n} }(dt)\right]\\&=&\mathbb{E}_{x}^{\pi}\left[ {\int}_{S} {\int}_{(0,T]}(\varphi(\omega,s,y)-\varphi(\omega,s,x_{s-})) m^{\pi}(dy|\omega,s) d s\right] \end{array} $$

which, together with Eq. 3.4 and the expectation of both sides of Eq. 3.3 with s = 0, gives

$$ \begin{array}{@{}rcl@{}} && \mathbb{E}_{x}^{\pi}\left[ \varphi(\omega,T,x_{T}) \right]\\ &=&\mathbb{E}_{x}^{\pi}\varphi(\omega,0,x)+\mathbb{E}_{x}^{\pi}\left[ {{\int}_{0}^{T}} \varphi^{\prime}(\omega,t,x_{t})dt \right] + \mathbb{E}_{x}^{\pi}\left[ \sum\limits_{n\ge 1}{\int}_{(0,T]}{\Delta} \varphi(\omega,t,x_{t})\delta_{T_{n} }(dt)\right] \\ &=&\mathbb{E}_{x}^{\pi}\varphi(\omega,0,x)+ \mathbb{E}_{x}^{\pi}\left[ {{\int}_{0}^{T}} \varphi^{\prime}(\omega,t,x_{t})dt\right] + \mathbb{E}_{x}^{\pi}\left[ {\int}_{(0,T]}{\int}_{S}[\varphi(\omega,s,y)-\varphi(\omega,s,x_{s-})] m^{\pi}(dy|\omega,s) d s\right] \\ &=&\mathbb{E}_{x}^{\pi}\varphi(\omega,0,x)+ \mathbb{E}_{x}^{\pi}\left[ {{\int}_{0}^{T}} \varphi^{\prime}(\omega,t,x_{t})dt\right]+\mathbb{E}_{x}^{\pi}\left[ {\int}_{(0,T]} {\int}_{S} {\int}_{A} \varphi(\omega,t,y)q(dy|x_{t-},a) \pi(da|\omega,t)dt\right]. \end{array} $$

Here, integrability results such as Eq. 3.2 validate all the involved operations. Moreover, since xt(ω) = xt(ω) on (0,T] except finite time points t, part (a) follows.

  1. (b)

    In lieu of φ(ω,t,xt) with \(e^{{{\int \limits }_{s}^{t}}{\int \limits }_{A}\delta c(v,x_{v},a)\pi _{v}(da|x_{v})dv}\phi (t,x_{t})\), taking the conditional expectation \(\mathbb {E}_{\gamma }^{\pi }[\cdot |x_{s}=x]\) in both sides of Eq. 3.3 and using the Markov property of {xt,t ≥ 0}, as the proof of (a), we see that (b) is also true.

  2. (c)

    Under the condition in (c), the function φ(ω,t,x) in (b) is bounded on Ω × [0,T] × S, and thus the first part of Eq. 3.2 is finite (by Lemma 3.1(b) and Assumptions 3.1(i,ii)). Moreover, since

    $$ \begin{array}{@{}rcl@{}} \left( e^{{{\int}_{s}^{t}}{\int}_{A}|\delta c(x_{v},a)|\pi_{v}(da|x_{v})dv}\phi(t,x_{t})\right)'&=&\delta |c(x_{t},\pi_{t})|e^{{{\int}_{s}^{t}}{\int}_{A}|\delta c(x_{v},a)|\pi_{v}(da|x_{v})dv}\phi(t,x_{t})\\ && \ \ +e^{{{\int}_{s}^{t}}{\int}_{A}|\delta c(x_{v},a)|\pi_{v}(da|x_{v})dv}\phi^{\prime}(t,x_{t})\\ &\leq & \delta\|c\|e^{T\|c\|}\|\phi\|+Ce^{T\delta\|c\|}V_{0}(x_{t}) \end{array} $$

    which, together with Lemma 3.1(b), implies that the second part of Eq. 3.2 is finite. Thus, (c) follows.

Next we derive an extension of Dynkin’s formula by Lemma 3.2. To do so, we introduce the following condition and notation.

Assumption 3.2

There exist a Borel measurable function V1 ≥ 1 on S, and positive constants ρ1,b1, and M1, such that

(i):

HCode \({\int \limits }_{S}{V_{1}^{2}}(y)q(dy|x,a)\leq \rho _{1}{V_{1}^{2}}(x)+b_{1}\) for all (x,a) ∈ K;

(ii):

\({V_{0}^{2}}(x)\leq M_{1}V_{1}(x)\) for all xS, with the V0(x) as the Assumption 3.1(i).

Assumption 3.2 is used to give a domain for the Dynkin’s formula below, and it is obviously satisfied when the transition rates are bounded (Ghosh and Saha 2014; Kumar and Chandan 2013, 2015; Yushkevich 1977).

Given the Vk(k = 0,1) as in Assumption 3.2 and any Borel set Z, a real-valued function φ on Z × S is called Vk-bounded if the Vk-weighted norm of φ, \(\|\varphi \|_{V_{k}}:=\sup _{(z,x)\in Z\times S}\frac {|\varphi (z,x)|}{V_{k}(x)}\), is finite. We denote by \(\mathbb {B}_{V_{k}}(Z\times S)\) the Banach space of all Vk-bounded functions on Z × S. When Vk(x) ≡ 1, \(\mathbb {B}_{1}(Z\times S)\) is the space of all bounded functions. In particular, take Z = Ω × [0,T] or [0,T], we define

$$ \mathbb{B}_{V_{0},V_{1}}^{1}({\Omega}\times [0,T]\times S):=\{\varphi\in \mathbb{B}_{V_{0}}({\Omega}\times [0,T]\times S)\cap\mathbb{B}_\mathcal{{P}}({\Omega}\times [0,T]\times S)\mid \varphi^{\prime}\in \mathbb{B}_{V_{1}}({\Omega}\times [0,T]\times S)\},$$

and then

$$ \mathbb{B}_{V_{0},V_{1}}^{1}([0,T]\times S):=\{\varphi\in \mathbb{B}_{V_{0}}([0,T]\times S)\cap\mathbb{B}_\mathcal{{P}}([0,T]\times S)\mid \varphi^{\prime}\in \mathbb{B}_{V_{1}}([0,T]\times S)\}.$$

Theorem 3.1 below is an extension of Theorem 3.1 in Guo et al. (2015b) and Theorem 3.1 in Wei (2016) from the forms of functions φ(t,x) to the more general forms of φ(ω,t,x) with the additional variable ω, and the extension is needed for the following arguments over history-dependent policies. The proof of Theorem 3.1 is based on Lemma 3.2, which is different from those in Guo et al. (2015b) and Wei (2016) for Markov policies only.

Theorem 3.1

Suppose Assumptions 3.1 and 3.2 are satisfied. Then, for each (s,x) ∈ [0,T] × S, the following assertions hold.

(a):

(The extension of Dynkin’s formula): For every π ∈π and \(\varphi \in \mathbb {B}_{V_{0},V_{1}}^{1}({\Omega }\times [0,T]\times S)\),

$$ \begin{array}{@{}rcl@{}} && \mathbb{E}_{x}^{\pi}\left[{\int}_{0}^{T} \left( \varphi^{\prime}(\omega,t,x_{t})+{\int}_{A}{\int}_{S}\varphi(\omega,t,y)q(dy|x_{t},a)\pi(da|\omega,t)\right)dt \right] \\&& \ \ =\mathbb{E}_{x}^{\pi}\varphi(\omega,T,x_{T})-\mathbb{E}_{x}^{\pi}\varphi(\omega,0,x), \end{array} $$

where {xt,t ≥ 0} may be not Markovian since the policy π may depend on histories.

(b):

(The Dynkin’s formula): For each \({\pi \in {\Pi }_{m}^{r}}\), and \(\varphi \in \mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)\),

$$ \qquad \mathbb{E}_{\gamma}^{\pi}\left[{\int}_{s}^{T} \left( \left( e^{{{\int}_{s}^{t}}\delta c(x_{v},\pi_{v})dv}\varphi(t,x_{t})\right)'+{\int}_{S}e^{{{\int}_{s}^{t}}\delta c(x_{v},\pi_{v})dv}\varphi(t,y)q(dy|x_{t},\pi_{t})\right)dt|x_{s}=x\right]$$
$${}=\mathbb{E}_{\gamma}^{\pi}\left[e^{{{\int}_{s}^{T}} \delta c(x_{t},\pi_{t})dt}\varphi(T,x_{T})|x_{s}=x\right]-\varphi(s,x).\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ $$

Proof

  1. (a)

    By the definition \(\mathbb {B}_{V_{0},V_{1}}^{1}({\Omega }\times [0,T]\times S)\) and \(\varphi \in \mathbb {B}_{V_{0},V_{1}}^{1}({\Omega }\times [0,T]\times S)\), we have

    $$ \begin{array}{@{}rcl@{}} |\varphi(\omega,t,x)|\leq \|\varphi\|_{V_{0}}V_{0}(x), \ |\varphi^{\prime}(\omega,t,z)|\leq \|\varphi^{\prime}\|_{V_{1}}V_{1}(x) \ \ \ \forall \ (\omega,t,x)\in {\Omega}\times [0,T]\times S. \end{array} $$
    (3.5)

    Thus, by Assumptions 3.1(i,ii) and 3.2 we have, for (ω,t,z) ∈Ω× [0,T] × S

    $$ \begin{array}{@{}rcl@{}} {\int}_{A}{\int}_{S}|\varphi(\omega,t,y)||q|(dy|z,a)\pi(da|\omega,t) &\leq& \|\varphi\|_{V_{0}}\left[{\int}_{A}{\int}_{S}V_{0}(y)|q|(dy|z,a)\pi(da|\omega,t)\right] \\ &\leq& \|\varphi\|_{V_{0}}[\rho_{0}V_{0}(z)+b_{0}+2M_{0}^{\prime}{V_{0}^{2}}(z)] \\ &\leq & \|\varphi\|_{V_{0}}[\rho_{0}M_{1}+b_{0}+2M_{0}^{\prime}M_{1}]V_{1}(z). \end{array} $$
    (3.6)

    Moreover, by Eq. 3.5 we have

    $$ \begin{array}{@{}rcl@{}} {{\int}_{0}^{T}}|\varphi^{\prime}(\omega,t,x_{t})|dt\leq\|\varphi^{\prime}\|_{V_{1}}{{\int}_{0}^{T}}V_{1}(x_{t})dt\leq\|\varphi^{\prime}\|_{V_{1}}{{\int}_{0}^{T}}{V_{1}^{2}}(x_{t})dt, \end{array} $$

    which, together with Lemma 3.1(b), gives

    $$ \begin{array}{@{}rcl@{}} \mathbb{E}_{x}^{\pi} \left[{\int}_{0}^{T}|\varphi^{\prime}(\omega,t,x_{t})|dt\right] &\leq& \|\varphi^{\prime}\|_{V_{1}}\mathbb{E}_{x}^{\pi} \left[{\int}_{0}^{T}{V_{1}^{2}}(x_{t})dt\right]\\&\leq&\|\varphi^{\prime}\|_{V_{1}}Te^{\rho_{1}T}\left[\frac{{V_{1}^{2}}(x)}{T\rho_{1}}+\frac{b_{1}}{\rho_{1}}\right]<\infty. \end{array} $$
    (3.7)

    Thus, by Eqs. 3.63.7 and Lemma 3.1(b) we have

    $$ \begin{array}{@{}rcl@{}} && \mathbb{E}_{x}^{\pi} \left[{\int}_{0}^{T}{\int}_{A}{\int}_{S}|\varphi(\omega,t,y)||q|(dy|x_{t},a)\pi(da|\omega,t)dt\right] \\ &\leq & T\|\varphi\|_{V_{0}}[\rho_{0}M_{1}+b_{0}+2M_{0}^{\prime}M_{1}]e^{\rho_{1}T}\left[\frac{{V_{1}^{2}}(x)}{T\rho_{1}}+\frac{b_{1}}{\rho_{1}}\right], \ \ \ \ \ \end{array} $$

    which, together with Eq. 3.7 and Lemma 3.2(a), implies (a).

  2. (b)

    For any fixed xS, a.e. t ∈ [0,T] (depending on ω), and 0 ≤ st, since

    $$ \begin{array}{@{}rcl@{}} \left( e^{{{\int}_{s}^{t}}\delta c(x_{v},\pi_{v})dv}\varphi(t,x)\right)'=\delta c(x_{t},\pi_{t})e^{{{\int}_{s}^{t}}\delta c(x_{v},\pi_{v})dv}\varphi(t,x)+e^{{{\int}_{s}^{t}}\delta c(x_{v},\pi_{v})dv}\varphi^{\prime}(t,x), \end{array} $$

    by \(\varphi \in \mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)\) and Tδ|c(x,a)|≤ M0V0(x) (using Assumption 3.1(iii)) we have

    $$ \begin{array}{@{}rcl@{}} \left|\left( e^{{{\int}_{s}^{t}}\delta c(x_{v},\pi_{v})dv}\varphi(t,x)\right)'\right|&\leq & \frac{M_{0}}{T}\|\varphi\|_{V_{0}}V_{0}(x)e^{{{\int}_{s}^{t}} |\delta c(x_{v},\pi_{v})|dv}V_{0}(x) \\ && +\|\varphi^{\prime}\|_{V_{1}}e^{{{\int}_{s}^{t}}|\delta c(x_{v},\pi_{v})|dv}V_{1}(x), \\ &\leq & \left( \frac{\|\varphi\|_{V_{0}}M_{0}M_{1}}{T}+\|\varphi^{\prime}\|_{V_{1}}\right)e^{{{\int}_{s}^{t}}|\delta c(x_{v},\pi_{v})|dv}V_{1}(x). \end{array} $$

    which, together with the same arguments for Eq. 3.3, gives

    $$ \begin{array}{@{}rcl@{}} &&\left|\left( e^{{{\int}_{s}^{t}}\delta c(x_{v}(\omega),\pi_{v})dv}\varphi(t,x_{t}(\omega))\right)'\right| \\&\leq& \left( \frac{\|\varphi\|_{V_{0}}M_{0}M_{1}}{T}+\|\varphi^{\prime}\|_{V_{1}}\right)e^{{{\int}_{s}^{t}}|\delta c(x_{v}(\omega),\pi_{v})|dv}V_{1}(x_{t}(\omega)) \end{array} $$
    (3.8)

    for a.s. ω ∈Ω with respect to \(\mathbb {P}_{x}^{\pi }\), and a.e. t ∈ [0,T](depending on give ω).

On the other hand, by the H\(\ddot {o}\)lder inequality we have

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E}_{\gamma}^{\pi}\left[ e^{{{\int}_{s}^{t}}|\delta c(x_{v},\pi_{v})|dv}V_{1}(x_{t})|x_{s}=x\right]\\&\leq& \sqrt{\mathbb{E}_{\gamma}^{\pi}\left[e^{2{{\int}_{s}^{t}}|\delta c(x_{v},\pi_{v})|dv}|x_{s}=x\right]\mathbb{E}_{\gamma}^{\pi}\left[V_{1}^{2}(x_{t})|x_{s}=x\right]} \\ &\leq& \mathbb{E}_{\gamma}^{\pi}\left[e^{2{{\int}_{s}^{t}}|\delta c(x_{v},\pi_{v})|dv}|x_{s}=x\right]\mathbb{E}_{\gamma}^{\pi}\left[V_{1}^{2}(x_{t})|x_{s}=x\right]. \ \ \ \ \ \ \ \ \ \ \end{array} $$
(3.9)

Furthermore, by the arguments similar to the proof of Eq. 3.1, we also have

$$ \begin{array}{@{}rcl@{}} \mathbb{E}_{\gamma}^{\pi}\left[ e^{2{{\int}_{s}^{t}}|\delta c(x_{v},\pi_{v})|dv}|x_{s}=x\right] \leq LV_{0}(x), \end{array} $$
(3.10)

which, together with Lemma 3.1(b) and Eq. 3.9, implies

$$ \begin{array}{@{}rcl@{}} &&{{\int}_{s}^{T}} \mathbb{E}_{\gamma}^{\pi} \left[e^{{{\int}_{s}^{t}} |\delta c(x_{v},\pi_{v})|dv}V_{1}(x_{t})|x_{s}=x\right]dt\\ &\leq &LV_{0}(x){{\int}_{s}^{T}}\mathbb{E}_{\gamma}^{\pi}\left[V_{1}^{2}(x_{t})|x_{s}=x\right]dt \\ &\leq& TLV_{0}(x) e^{\rho_{1} T}\left[V_{1}^{2}(x)+\frac{b_{1}}{\rho_{1}}\right]<\infty. \end{array} $$
(3.11)

Also, by Assumptions 3.1 and 3.2, we have

$$ \begin{array}{@{}rcl@{}} &&{\int}_{S}e^{{{\int}_{s}^{t}}|\delta c(x_{v},\pi_{v})|dv}|\varphi(t,y)||q|(dy|x_{t},\pi_{t})\\ &\leq &\|\varphi\|_{V_{0}} [\rho_{0}V_{0}(x_{t})+b_{0}+2M_{0}^{\prime}{V_{0}^{2}}(x_{t})]e^{{{\int}_{s}^{t}}|\delta c(x_{v},\pi_{v})|dv} \\ &\leq & \|\varphi\|_{V_{0}}\left[\rho_{0}M_{1}+b_{0}+2M_{0}^{\prime}M_{1}\right]e^{{{\int}_{s}^{t}}|\delta c(x_{v},\pi_{v})|dv}V_{1}(x_{t}). \ \ \ \ \ \ \ \ \end{array} $$
(3.12)

Thus, by Eqs. 3.83.12 we have

$$ \begin{array}{@{}rcl@{}} && \mathbb{E}_{\gamma}^{\pi}\left[{\int}_{s}^{T} \left( \left|\left( e^{{{\int}_{s}^{t}}\delta c(x_{v},\pi_{v})dv}\varphi(t,x_{t})\right)^{\prime}\right|+{\int}_{S} e^{{{\int}_{s}^{t}}\delta c(x_{v},\pi_{v})dv}|\varphi(t,y)||q|(dy|x_{t},\pi_{t}) \right)dt|x_{s}=x\right]\\ \\&\leq& L\left[\|\varphi\|_{V_{0}}M_{0}M_{1}+T\|\varphi^{\prime}\|_{V_{1}}+T\|\varphi\|_{V_{0}}(\rho_{0}M_{1}+b_{0}+2M_{0}^{\prime}M_{1})\right]V_{0}(x) e^{\rho_{1} T}\!\left[V_{1}^{2}(x) + \frac{b_{1}}{\rho_{1}}\right]<\infty, \end{array} $$

which, together with Lemma 3.2, verifies (b). □

Theorem 3.2

Under Assumptions 3.1 and 3.2, the following assertions hold.

(a):

If there exists \(\varphi \in \mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)\) such that

$$ \begin{array}{@{}rcl@{}} \left\{\begin{array}{lll} \varphi^{\prime}(t,x)+\inf_{a\in A(x)}\left[\delta c(x,a)\varphi(t,x)+{\int}_{S}\varphi(t,y)q(dy|x,a)\right]=0, \\ \varphi(T,x)\equiv 1, \end{array}\right. \end{array} $$
(3.13)

for each xS and tE(φ,x) with \(m_{L}(E_{(\varphi ,x)}^{c})=0\), then

(a1):

J(π,0,x) ≥ φ(0,x), for all π ∈π and xS, and

(a2):

J(π,t,x) ≥ φ(t,x), for all \(\pi \in {{\Pi }_{m}^{r}}\) and (t,x) ∈ [0,T] × S.

(b):

For any randomized Markov policy \(\pi \in {{\Pi }_{m}^{r}}\), J(π,t,x) is a unique solution in \(\mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)\) of the following equation

$$ \begin{array}{@{}rcl@{}} \left\{\begin{array}{lll} \varphi^{\prime}(t,x)+\delta c(x,\pi_{t})\varphi(t,x)+{\int}_{S}\varphi(t,y)q(dy|x,\pi_{t})=0 \\ \varphi(T,x)=1 \end{array}\right. \end{array} $$
(3.14)

for each xS and tE(φ,x) with \(m_{L}(E_{(\varphi ,x)}^{c})=0\).

Proof

  1. (a)

    For each ω ∈Ω, under the conditions for (a) we have

    $$ \begin{array}{@{}rcl@{}} \!\!\!\!\!\!\!\!\!\!\!\varphi^{\prime}\!(t,x) + {\int}_{A}\delta c(x\!,a)\pi(da|\omega,t)\varphi(t,x) + {\int}_{S}\!{\int}_{A}\varphi(t,y)q(dy|x,a)\pi(da|\omega,t)\!\geq\! 0 \end{array} $$
    (3.15)

    for all xS and tE(φ,x).

On the other hand, for a.s. ω ∈Ω (with respect to \(\mathbb {P}_{x}^{\pi }\)), since {xt(ω),t ∈ [0,T]} =: {x0,…,xk}(xiS,0 ≤ ik) for some a finite k (depending on ω) determined by Tk(ω) ≤ T < Tk+ 1(ω) and \(m_{L}(E_{(\varphi ,x_{i})}^{c})=0\), by Eq. 3.15 we have

$$ \begin{array}{@{}rcl@{}} \!\!\!\!\!\!\!\!\varphi^{\prime}(t,x_{t}) + {\int}_{A}\delta c(x_{t},a)\pi(da|\omega,t)\varphi(t,x_{t}) + {\int}_{S}\!{\int}_{A}\varphi(t,y)q(dy|x_{t},a)\pi(da|\omega,t)\!\geq\! 0 \end{array} $$
(3.16)

for a.s. ω ∈Ω and a.e. t ∈ [0,T] (because of \(m_{L}(\cup _{i=0}^{k} E_{(\varphi ,x_{i})}^{c})=0\)).

Let \(\bar c(\omega ,t,x,\pi ):={\int \limits }_{A}\delta c(x_{t},a)\pi (da|\omega ,t)\). Then, we have, for any 0 ≤ sT,

$$ \begin{array}{@{}rcl@{}} {{\int}_{s}^{T}}\!\left[\!\left( e^{{{\int}_{s}^{t}}\bar c(\omega,v,x,\pi)dv}\varphi(t,x_{t})\right)' + {\int}_{S}{\int}_{A}q(dy|x_{t},a)\pi(da|\omega,t)e^{{{\int}_{s}^{t}}\bar c(\omega,v,x,\pi)dv}\varphi(t,y)\!\right]\!dt\!\geq \!0. \end{array} $$

The proof of (a1): For each π ∈π and xS, since φ(T,y) = 1 for all yS, by Theorem 3.1(a) and Eq. 3.16 we have

$$ \begin{array}{@{}rcl@{}} && \mathbb{E}_{x}^{\pi} \left( e^{{{\int}_{0}^{T}}{\int}_{A}\delta c(x_{v},a)\pi(da|\omega,v)dv)}\right)-\varphi(0,x)\\ &=&\mathbb{E}_{x}^{\pi} \left( e^{{{\int}_{0}^{T}}{\int}_{A}\delta c(x_{v},a)\pi(da|\omega,v)dv}\varphi(T,x_{T})\right)-\varphi(0,x)\\ &=& \mathbb{E}_{x}^{\pi} \left( e^{{{\int}_{0}^{T}}{\int}_{A}\delta c(x_{v},a)\pi(da|\omega,v)dv}\varphi(T,x_{T})\right)-\mathbb{E}_{x}^{\pi} \left( e^{{{\int}_{0}^{0}}{\int}_{A}\delta c(x_{v},a)\pi(da|\omega,v)dv}\varphi(0,x_{0})\right)\geq 0 \end{array} $$

and so

$$ \begin{array}{@{}rcl@{}} J(\pi,0,x)=\mathbb{E}_{x}^{\pi} \left( e^{{{\int}_{0}^{T}}{\int}_{A}\delta c(x_{v},a)\pi(da|\omega,v)dv}\right)\geq \varphi(0,x), \end{array} $$

which implies (a1).

Similarly, by Theorem 3.1(b) we see that (a2) is also true.

  1. (b)

    If there exists a \(\varphi \in \mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)\) satisfying (3.14), by (a2) and φ(T,x) ≡ 1, we have

    $$\varphi(s,x)=\mathbb{E}_{\gamma}^{\pi} \left[e^{{{\int}_{s}^{T}}{\int}_{A}\delta c(x_{v},a)\pi_{v}(da|x_{v})dv}|x_{s}=x\right]=J(\pi,s,x), \ \ s\in [0,T],$$

    and so (b) follows. Thus, to complete the proof of (b), it suffices to show the existence of \(\varphi \in \mathbb {B}_{V_{0},V_{1}}^{1}(I\times S)\) satisfying (3.14), while the proof of which is very long and thus will be postponed to Section 4; see Remark 4.2 below.

To establish the existence of \(\varphi \in \mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)\) satisfying (3.13), we need Lemma 3.3 below. To state it, we recall some concepts. A subset of a Borel space X is analytic (by Proposition 7.41 in Bertsekas and Shreve 1996) if it is a projection into X of a Borel subset of X × Y for some uncountable Borel space Y. Then, a function u(⋅) on X is called upper semianalytic if {xX : u(x) > r} is an analytic set for each \(r\in (-\infty ,\infty )\). It is known that each Bore-measurable function is upper semianalytic; see more details in Chapter 7 of Bertsekas and Shreve (1996). Hence, each Borel measurable function such as c(x,a) is upper semianalytic on K.

Lemma 3.3

Suppose that Assumption 3.1 holds. For any \(u(t,x)\in \mathbb {B}_{V_{0}}([0,T]\times S)\), define a corresponding function \(u_{*}(t,x): \ [0,T]\times S\longrightarrow (-\infty ,\infty )\) by

$$u_{*}(t,x):= \inf_{a\in A(x)}\left\{\delta c(x,a)u(t,x)+{\int}_{S}u(t,y)q(dy|x,a)\right\}.$$

Then, the following assertions hold.

(a):

The function u(t,x) is upper semianalytic (and hence universally measurable).

(b):

For every ε > 0, there exists a deterministic Markov policy \(f_{\varepsilon }\in {{\Pi }_{m}^{d}}\) (depending on ε) such that

$$ \begin{array}{@{}rcl@{}} &&\delta c(x,f_{\varepsilon}(t,x))u(t,x)+{\int}_{S}u(t,y)q(dy|x,f_{\varepsilon}(t,x))\\&\leq& u_{*}(t,x)+\varepsilon \ \ \forall \ (t,x)\in [0,\infty)\times S. \end{array} $$

Proof

See Lemma 3.3 in Guo et al. (2015b) or (Bertsekas and Shreve 1996, Propositions 7.47 and 7.50). □

4 The existence of optimal Markov policies

In this section, we prove the existence of an optimal Markov policy and a solution to the following optimality equation (4.1) for the finite horizon CTMDPs with the risk-sensitive criterion. The proofs are shown in three steps as follows: 1) consider the case of bounded transition and cost rates, 2) deal with the case of unbounded transition rates but nonnegative costs, and 3) study the case of unbounded transition and cost rates.

Suppose that f(x) is defined on X. Denote \(\|f\|:=\sup _{x\in X}|f(x)|\). The following results are for the case of the bounded transition and bounded cost rates.

Proposition 4.1

If ∥q∥ and ∥c∥ are finite, then the following assertions hold.

(a):

There exists a unique φ in \(\mathbb {B}^{1}_{1,1}([0,T]\times S)\) (that is, V0(x) = V1(x) ≡ 1,xS) satisfying the following o ptimality equation for the risk-sensitive criterion of CTMDPs on the finite horizon:

$$ \begin{array}{@{}rcl@{}} \left\{\begin{array}{lll} \varphi^{\prime}(t,x)+\inf_{a\in A(x)}[\delta c(x,a)\varphi(t,x)+{\int}_{S}\varphi(t,y)q(dy|x,a)]=0, & \\ \varphi(T,x)=1, \end{array}\right. \end{array} $$
(4.1)

for each xS and tE(φ,x) with \(m_{L}(E_{(\varphi ,x)}^{c})=0\).

(b):

\(\varphi (t,x)=J_{*}(t,x)=\inf _{f\in {{\Pi }_{m}^{d}} }J(f,t,x)\) for all (t,x) ∈ [0,T] × S, with φ(t,x) as in (a).

(c):

For any ε > 0, there exists \(f_{\varepsilon }\in {{\Pi }_{m}^{d}}\) such that J(fε,t,x) ≤ φ(t,x) + ε for all (t,x) ∈ [0,T] × S.

Proof

  1. (a)

    Define the following operator B on \(\mathbb {B}_{1}([0,T]\times S)\) (actually the space of all bounded functions) by

    $$ \begin{array}{@{}rcl@{}} B\psi(t,x):=1+{{\int}_{t}^{T}} \inf_{a\in A(x)}\left[\delta c(x,a)\psi(s,x)+{\int}_{S}\psi(s,y)q(dy|x,a)\right]ds \end{array} $$
    (4.2)

    for any (t,x) ∈ [0,T] × S and \(\psi \in \mathbb {B}_{1}([0,T]\times S)\).

Then, for each (t,x) ∈ [0,T] × S, and any \(\psi _{1},\psi _{2}\in \mathbb {B}_{1}([0,T]\times S)\), from Eq. 4.2 and q({x}|x,a) + q(S ∖{x})|x,a) ≡ 0 we obtain

$$ \begin{array}{@{}rcl@{}} |B\psi_{1}(t,x)-B\psi_{2}(t,x)|&\leq&{{\int}_{t}^{T}}\sup_{a\in A(x)}\left[\delta\|c\|\|\psi_{1}-\psi_{2}\|+{\int}_{S}|q|(dy|x,a)\|\psi_{1}-\psi_{2}\|\right]ds \\ &\leq & (\delta\|c\|+2\|q\|) {{\int}_{t}^{T}}\|\psi_{1}-\psi_{2}\|ds \\ &=&\tilde{L}(T-t)\|\psi_{1}-\psi_{2}\| \end{array} $$

where \(\tilde {L}:=\delta \|c\|+2\|q\|<\infty \). Furthermore, by induction we can prove the following fact:

$$ \begin{array}{@{}rcl@{}} \!\!\!\!\!\!\!\!|B^{n}\psi_{1}(t,x)-\!B^{n}\psi_{2}(t,x)| \!\leq\! \tilde{L}^{n} \frac{(T-t)^{n}}{n!}\|\psi_{1} - \psi_{2}\| \ \ \ \forall \ (t,x)\in [0,T]\!\times\! S, n\geq 1. \end{array} $$
(4.3)

Since \({\sum }_{n=1}^{\infty } \tilde {L}^{n} \frac {T^{n}}{n!}\|\psi _{1}-\psi _{2}\|<\infty \), there exists some integer k such that the constant \(\beta := \tilde {L}^{k} \frac {T^{k}}{k!}<1\). Thus, by Eq. 4.3 we have ∥Bkψ1Bkψ2∥≤ βψ1ψ2∥. Therefore, B is a k-step contract operator. Thus, there exists a function \(\varphi \in \mathbb {B}_{1}([0,T]\times S)\) such that Bφ = φ, that is

$$ \begin{array}{@{}rcl@{}} \varphi(t,x):=1+{{\int}_{t}^{T}} \inf_{a\in A(x)}\left[\delta c(x,a)\varphi(s,x)+{\int}_{S}\varphi(s,y)q(dy|x,a)\right]ds \ \forall \ (t,x)\in [0,T]\times S. \end{array} $$
(4.4)

Since ∥q∥ and ∥c∥ are finite, by Eq. 4.4 we see that \(\varphi \in \mathbb {B}^{1}_{1,1}([0,T]\times S)\), and thus (a) follows.

  1. (b-c)

    We are going to prove (b) and (c) together. Since \(\|q\|<\infty \) and \(\|c\|<\infty \), Assumptions 3.1 and 3.2 are satisfied by taking V0 = V1 ≡ 1. Thus, it follows from Eq. 4.1 and Theorem 3.2(a) that

    $$ \begin{array}{@{}rcl@{}} J(\pi,t,x)\geq \varphi(t,x) \ \ \ \text{for \ all } \ \pi\in {{\Pi}_{m}^{r}}. \end{array} $$
    (4.5)

    Moreover, for any ε > 0, since \(\varphi \in \mathbb {B}_{1}([0,T]\times S)\), Lemma 3.3 together with Eq. 4.1 gives the existence of \(f_{\varepsilon }\in {{\Pi }_{m}^{d}}\), such that, for each xS and tE(φ,x),

    $$ \begin{array}{@{}rcl@{}} \!\!\!\!\!\!\!\left\{\begin{array}{lll} \varphi^{\prime}(t,x)+\delta c(x,f_{\varepsilon}(t,x))\varphi(t,x)+{\int}_{S}\varphi(t,y)q(dy|x,f_{\varepsilon}(t,x))\leq \frac{\varepsilon}{T}e^{-T\delta\|c\|}, \\ \varphi(T,x)=1. \end{array}\right. \end{array} $$
    (4.6)

    Then, as the arguments for Eq. 3.16, by |c(x,a)|≤∥c∥ and Eq. 4.6 we have

    $$ \begin{array}{@{}rcl@{}} \left( e^{{{\int}_{s}^{t}}\delta c(x_{v},f_{\varepsilon}(v,x_{v}))dv}\varphi(t,x_{t})\right)' + {\int}_{S}q(dy|x_{t},f_{\varepsilon}(t,x_{t}))\!\left( e^{{{\int}_{s}^{t}}\delta c(x_{v},f_{\varepsilon}(v,x_{v})))}\varphi(t,y)\right)\!\leq\! \frac{\varepsilon}{T} \end{array} $$

    for every 0 ≤ st. Thus, by Theorem 3.1(a) we have

    $$ \begin{array}{@{}rcl@{}} &&\mathbb{E}_{x}^{f_{\varepsilon}} \left( e^{{{\int}_{s}^{T}}\delta c(x_{v},f_{\varepsilon}(v,x_{v}))dv)}\right)-\varphi(s,x)\\&=&\mathbb{E}_{x}^{f_{\varepsilon}} \left( e^{{{\int}_{s}^{T}}\delta c(x_{v},f_{\varepsilon}(v,x_{v}))dv}\varphi(T,x_{T})\right)-\varphi(s,x)\leq \varepsilon \end{array} $$

    and so

    $$ \begin{array}{@{}rcl@{}} J(f_{\varepsilon},s,x)\leq \varphi(s,x)+\varepsilon \ \ \ \ \text{for \ all} \ (s,x)\in [0,T]\times S. \end{array} $$
    (4.7)

    Therefore, since ε can be arbitrary, by Eqs. 4.5 and 4.7 we have

    $$ \begin{array}{@{}rcl@{}} \inf_{\pi\in {{\Pi}_{m}^{r}}} J(\pi,t,x)=\varphi(t,x)=\inf_{f\in {{\Pi}_{m}^{d}}} J(f,t,x), \ \text{ and} \ \ J(f_{\varepsilon},t,x)\leq \varphi(t,x)+\varepsilon, \end{array} $$

    for all (t,x) ∈ [0,T] × S, and by Eq. 2.6 so (b) and (c) follow.

Proposition 4.1 shows the existence of a solution to the optimality equation for the bounded transition and cost rates. To further establish the existence of an optimal Markov policy, we need some conditions below.

Assumption 4.1

(i):

For each xS, A(x) is compact;

(ii):

For each xS and \(D\in \mathcal {{B}}(S)\), the function q(D|x,a) is continuous in aA(x);

(iii):

For each xS, the functions c(x,a) and \({\int \limits }_{S}V_{0}(y)q(dy|x,a)\) are continuous in aA(x), with V0 as in Assumption 3.1.

Remark 4.1

Assumption 4.1 is used for the existence of the minimum points in the optimality equation (4.1).

By Lemma 8.3.7(a) in Hernández-Lerma and Lasserre (1999), under Assumption 4.1 we have the following lemma.

Lemma 4.1

Under Assumptions 4.1(ii,iii), the function \({\int \limits }_{S}q(dy|x,a)u(t,y)\) is continuous in aA(x), for every fixed (t,x) ∈ [0,T] × S and \(u\in \mathbb {B}_{V_{0}}([0,T]\times S)\).

Proposition 4.2

Under Assumption 4.1, if \(\|q\|<\infty \) and \(\|c\|<\infty \), that is, transition and cost rates are bounded, then the following assertions hold.

(a):

There exists a unique φ(t,x) in \(\mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)\) satisfying the o ptimality equation (4.1).

(b):

\(\varphi (t,x)=J_{*}(t,x)=\inf _{f\in {{\Pi }_{m}^{d}}} J(f,t,x)\) for all (t,x) ∈ [0,T] × S, with φ(t,x) as in (a).

(c):

There exists a deterministic Markov policy \(f^{*}\in {{\Pi }_{m}^{d}}\) such that

$$ \begin{array}{@{}rcl@{}} \varphi^{\prime}(t,x)+ \delta c(x,f^{*}(t,x))\varphi(t,x)+{\int}_{S}\varphi(t,y)q(dy|x,f^{*}(t,x))=0 \end{array} $$

for each xS and tE(φ,x) with \(m_{L}(E_{(\varphi ,x)}^{c})=0\).

(d):

If, in addition, c(x,a) ≥ 0 for all (x,a) ∈ K, then J(x,t) (and also φ(t,x)) is decreasing in t ∈ [0,T] for each fixed xS.

Proof

We only need to prove (c) and (d) since (a) and (b) follow from Proposition 4.1. For the function φ(t,x) from (a), when \(\varphi ^{\prime }(t,x)\) does not exist for some (t,x), we define

$$ \begin{array}{@{}rcl@{}} \varphi^{\prime}(t,x):=-\inf_{a\in A(x)}[\delta c(x,a)\varphi(t,x)+{\int}_{S}\varphi(t,y)q(dy|x,a)], \end{array} $$
(4.8)

which, together with Eq. 4.1, implies that Eq. 4.8 holds for each (t,x) ∈ [0,T] × S. Hence, Proposition 7.50 in Bertsekas and Shreve (1996) together with Lemma 4.1 ensures the existence of a Markov policy \(f^{*}\in {{\Pi }_{m}^{d}}\) such that

$$ \begin{array}{@{}rcl@{}} \left\{\begin{array}{lll} \varphi^{\prime}(t,x) + \delta c(x,f^{*}(t,x))\varphi(t,x) + {\int}_{S}\varphi(t,y)q(dy|x,f^{*}(t,x)] = 0 \ \ \ \forall \ (t,x)\!\in\! [0,T]\times S, & \\ \varphi(T,x)=1. \end{array}\right. \end{array} $$

Then, as the proofs of (b) and (c) in Proposition 4.1, we see that (c) is true.

(d) Fix any s,t ∈ [0,T] with s < t. Then, for any Markov policy \(f\in {{\Pi }_{m}^{d}}\), we define the corresponding Markov policy \({f_{s}^{t}}\) as follows: for each xS,

$$ \begin{array}{@{}rcl@{}} && {f_{s}^{t}}(v,x) = \left\{\begin{array}{lll} f(v+t-s,x) & \text{$ v\geq s, $} \\ f(v,x) & \text{otherwise.} \end{array}\right. \end{array} $$
(4.9)

Then, we have, for each (v,x) ∈ [s,s + Tt] × S,

$$q(dy|x,{f_{s}^{t}}(v,x))=q(dy|x,f(v+t-s,x)), \ c(x,{f_{s}^{t}}(v,x))=c(x,f(v+t-s,x)).$$

Let

$$ \begin{array}{@{}rcl@{}} && J(f,s\sim t,x):=\mathbb{E}_{\gamma}^{\pi}\left[e^{{{\int}_{s}^{t}}{\int}_{A}\delta c(x_{v},f(v,x_{v}))dv}|x_{s}=x\right], \\ && J_{*}(s\sim t,x):=\inf_{f\in {{\Pi}_{m}^{d}}}J(f,s\sim t,x). \end{array} $$
(4.10)

By the Markov property of {xt,t ≥ 0} under any Markov policy f and Eqs. 4.94.10, we have \(J(f,t\sim T,x)= J({f_{s}^{t}},s\sim T+s-t,x)\), and thus \(J_{*}(t\sim T,x)\geq J_{*}(s\sim T+s-t,x)\); Similarly, we can prove that \(J_{*}(s\sim T+s-t,x)\geq J_{*}(t\sim T,x)\). Thus, we have \(J_{*}(t\sim T,x)=J_{*}(s\sim T+s-t,x)\). Moreover, since c(x,a) ≥ 0 on K, by Eq. 4.10 and t > s, we have \(J_{*}(t\sim T,x)=J_{*}(s\sim T+s-t,x)\leq J_{*}(s\sim T,x)\), which, together with \(J_{*}(t\sim T,x)=J_{*}(t,x)\), gives (d). □

Proposition 4.2 shows the existence of an optimal policy under the bounded transition and cost rates. We next extend the results in Proposition 4.2 to the case of unbounded transition rates by approximations from bounded transition and cost rates to unbounded transition rates and nonnegative costs.

Proposition 4.3

Under Assumptions 3.1, 3.2 and 4.1, if in addition c(x,a) ≥ 0 for all (x,a) ∈ K, then the following assertions hold.

(a):

There exists a unique φ(t,x) in \(\mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)\) satisfying the o ptimality equation (4.1).

(b):

\(\varphi (t,x)=J_{*}(t,x)=\inf _{f{\in {\Pi }_{m}^{d}}} J(f,t,x)\) for all (t,x) ∈ [0,T] × S, with φ(t,x) as in (a).

(c):

There exists a Markov policy \(f^{*}\in {{\Pi }_{m}^{d}}\) such that

$$ \begin{array}{@{}rcl@{}} \varphi^{\prime}(t,x)+ \delta c(x,f^{*}(t,x))\varphi(t,x)+{\int}_{S}\varphi(t,y)q(dy|x,f^{*}(t,x))=0 \end{array} $$

for each xS and tE(φ,x) with \(m_{L}(E_{(\varphi ,x)}^{c})=0\).

Proof

We only proof (a). This is because (b) and (c) can be proved as the same arguments of (b)-(c) of Proposition 4.2.

The main thread of our proof is as follows.

First, we construct a series of models \( \mathcal {{M}}_{n}^{+}\) which the transition rates and costs are all bounded. By Proposition 4.2, there exists a decreasing (with respect to t) function \(\varphi _{n}(t,x)\in \mathbb {B}_{1,1}^{1}([0,T]\times S)\) which is the value function of \( \mathcal {{M}}_{n}^{+}\).

Second, we prove that the limit of φn(t,x) exists, denoted by φ(t,x).

Third, we prove that φ(t,x) is the value function of the original model.

Now, we construct \( \mathcal {{M}}_{n}^{+}\) first. under Assumption 3.1 (iii), we have

$$1\leq e^{2T\delta c(x,a)}\leq M_{0}V_{0}(x), (\text{i.e.}, \ 0\leq c(x,a)\leq \frac{1}{T\delta}\ln\sqrt{M_{0}V_{0}(x)} ) \ \text{for \ all \ } (x,a)\in K. $$

For each n ≥ 1, let An(x) := A(x) for xS, Kn := {(x,a)|xS,aAn(x)}, and Sn := {xS|V0(x) ≤ n}. Moreover, for each xS, aAn(x), let

$$ q_{n}(dy|x,a):= \left\{\begin{array}{lll} q(dy|x,a) &\text{if } x\in S_{n}, \\ 0 & \text{if } x\not\in S_{n}; \end{array}\right. $$
(4.11)
$$ c_{n}^{+}(x,a):= \left\{\begin{array}{lll} c(x,a)\wedge \min\{n, \frac{1}{T\delta}\ln\sqrt{M_{0}V_{0}(x)}\} &\text{if } x\in S_{n},\\ 0 & \text{if } x\not\in S_{n}. \end{array}\right. $$
(4.12)

Fix any n ≥ 1. By Eq. 4.11, it is obvious that the qn(dy|t,x,a) denotes indeed transition rates on S, which are conservative and stable. By Eq. 4.12, recall T > 0,δ > 0,M0 ≥ 1,V0(x) ≥ 1 and c(x,a) ≥ 0, \(c_{n}^{+}(x,a) \ge 0\) for n ≥ 1. Then, we obtain a sequence of models \(\{\mathcal {{M}}_{n}^{+}\}\):

$$ \mathcal{{M}}_{n}^{+}:=\left\{S,A,(A_{n}(x), x\in S), c_{n}^{+}(x,a), q_{n}(\cdot|x,a) \right\}, $$

for which the transition rates qn(dy|x,a) and costs \(c_{n}^{+}(x,a)\) are all bounded (by Assumption 3.1 and Eqs. 4.114.12). In the following arguments, any quality with respect to \(\mathcal {{M}}_{n}^{+}\) is labeled by a lower n, such as the risk-sensitive criterion Jn(f,t,x) of a Markov policy f and the value function \(J_{n}(t,x):=\inf _{f\in {{\Pi }_{m}^{d}}} J_{n}(f,t,x)\).

Obviously, Assumptions 3.1, 3.2 and 4.1 still hold for each model \({\mathscr{M}}_{n}^{+}\). Thus, for each n ≥ 1, it follows from Proposition 4.2 that there exists \(\varphi _{n}(t,x)\in \mathbb {B}_{1,1}^{1}([0,T]\times S)\) satisfying (4.1) for the corresponding \({\mathscr{M}}_{n}^{+}\), that is,

$$ \begin{array}{@{}rcl@{}} \left\{\begin{array}{lll} \varphi_{n}^{\prime}(t,x)+\inf_{a\in A_{n}(x)}[\delta c_{n}^{+}(x,a)\varphi_{n}(t,x)+{\int}_{S}\varphi_{n}(t,y)q_{n}(dy|x,a)]=0,& \\ \varphi_{n}(T,x)=1. \end{array}\right. \end{array} $$
(4.13)

for all xS and \(t\in E_{(\varphi _{n},x)}\) with \(m_{L}(E_{(\varphi _{n},x)}^{c})=0\). Furthermore, since \(c_{n}^{+}(x,a) \ge 0\), by Proposition 4.2 (d), φn(t,x)) is decreasing in t ∈ [0,T] for each fixed xS.

Then, Proposition 7.50 in Bertsekas and Shreve (1996) together with Lemma 4.1 and Eq. 4.13 gives the existence of a Markov policy \(f_{n}\in {{\Pi }_{m}^{d}}\) such that,

$$ \begin{array}{@{}rcl@{}} \left\{\begin{array}{lll} \varphi_{n}^{\prime}(t,x)\!+\delta c_{n}^{+}(x,f_{n}(t,x))\varphi_{n}(t,x) + {\int}_{S}\varphi_{n}(t,y)q_{n}(dy|x,f_{n}(t,x) = 0 \ \ \ \forall \ x\in S, & \\ \varphi_{n}(T,x)=1 \end{array}\right. \end{array} $$
(4.14)

for all xS and \(t\in E_{(\varphi _{n},x)}\) with \(m_{L}(E_{(\varphi _{n},x)}^{c})=0\).

Also, by Eq. 4.12 we have \(e^{2T\delta c_{n}^{+}(x,a)}\leq M_{0}V_{0}(x)\) for all xS and n ≥ 1. Then, using Lemma 3.1 and Theorem 3.2(a2) with V0 = V1 ≡ 1 and Lemma 3.1, from Eq. 4.14 we have

$$ \begin{array}{@{}rcl@{}} e^{-LV_{0}(x)}\leq \varphi_{n}(t,x)=J_{n}(f_{n},t,x)\leq LV_{0}(x) \ \ \ \ \ \forall \ n\geq 1. \end{array} $$
(4.15)

Moreover, since φn(t,x) ≥ 0 and \(c_{n}^{+}(x,a)\geq c_{n-1}^{+}(x,a)\) for all (x,a) ∈ K, by Eqs. 4.11 and 4.14 as well as Proposition 4.2(d), we have, for all \(x\in S, t\in E_{(\varphi _{n},x)}\) and n ≥ 2

$$ \begin{array}{@{}rcl@{}} \left\{\begin{array}{lll} \varphi_{n}^{\prime}(t,x)+ \delta c_{n-1}^{+}(x,f_{n}(t,x))\varphi_{n}(t,x)+{\int}_{S}\varphi_{n}(t,y)q_{n-1}(dy|x,f_{n}(t,x))\leq 0, \ x\in S_{n-1}, & \\ \varphi_{n}^{\prime}(t,x)+ \delta c_{n-1}^{+}(x,f_{n}(t,x))\varphi_{n}(t,x)+{\int}_{S}\varphi_{n}(t,y)q_{n-1}(dy|x,f_{n}(t,x))= \varphi_{n}^{\prime}(t,x)\leq 0, \ x\not\in S_{n-1} & \\ \varphi_{n}(T,x)=1, \end{array}\right. \end{array} $$

which, together with Theorem 3.2(a2) with V0 = V1 ≡ 1, implies that Jn− 1(fn,t,x) ≤ φn(t,x) for all (x,t) ∈ [0,T] × S. Therefore, we have φn− 1(t,x) ≤ Jn− 1(fn,t,x) ≤ φn(t,x), that is, the sequence {φn,n ≥ 1} is nondecreasing in n ≥ 1, and thus the limit

$$ \begin{array}{@{}rcl@{}} \varphi(t,x):=\lim_{n\to\infty}\varphi_{n}(t,x) \end{array} $$
(4.16)

exists for each (t,x) ∈ [0,T] × S.

Since φn(t,x) is decreasing in t ∈ [0,T], φ(t,x) is decreasing in t ∈ [0,T], for each fixed xS. Therefore, φ(t,x) is differential in a.e. t ∈ [0,T] (for each fixed xS).

Let, for every n ≥ 1 and (t,x) ∈ [0,T] × S,

$$ \begin{array}{@{}rcl@{}} && H_{n}(t,x):=\inf_{a\in A(x)}\left[\delta c_{n}^{+}(x,a)\varphi_{n}(t,x)+{\int}_{S}\varphi_{n}(t,y)q_{n}(dy|x,a)\right],\\ && H(t,x):=\inf_{a\in A(x)}\left[\delta c(x,a)\varphi(t,x)+{\int}_{S}\varphi(t,y)q(dy|x,a)\right]. \end{array} $$

We next show that \(\lim _{n\to \infty } H_{n}(t,x)=H(t,x)\) for each (t,x) ∈ [0,T] × S.

Indeed, for any fixed (t,x) ∈ [0,T] × S, there exists n0 ≥ 1 such that \((t,x)\in [0,T]\times S_{n_{0}}\), and then qn(dy|x,a) = q(dy|x,a) for all nn0 and \(\lim _{n\to \infty } c_{n}^{+}(x,a)=c(x,a)\) for all aA(x). Thus, by Lemma 8.3.7 in Hernández-Lerma and Lasserre (1999) and Eq. 4.15 we have

$$ \begin{array}{@{}rcl@{}} &&\limsup_{n\to \infty}H_{n}(t,x)\\&\leq & \limsup_{n\to \infty}\left[\delta c_{n}^{+}(x,a)\varphi_{n}(t,x)+{\int}_{S}\varphi_{n}(t,y)q(dy|x,a)\right] \ \ \ \\ &\leq& \delta c(x,a)\varphi(t,x)+{\int}_{S}\varphi(t,y)q(dy|x,a), \ \ \ \text{for \ all} \ a\in A(x). \end{array} $$

Hence,

$$ \begin{array}{@{}rcl@{}} \limsup_{n\to \infty}H_{n}(t,x)&\leq & \inf_{a\in A(x)} \left[\delta c(x,a)\varphi(t,x)+{\int}_{S}\varphi(t,y)q(dy|x,a)\right]. \end{array} $$
(4.17)

On the other hand, note that \(\liminf _{n\to \infty }H_{n}(t,x)=\lim _{m\to \infty }H_{n_{m}}(t,x)\) for some subsequence {nm,m ≥ 1} of {n,n ≥ 1}. For each m ≥ 1, under Assumption 4.1, the measurable selection theorem (e.g. Proposition 7.50 in Bertsekas and Shreve 1996) together with Lemma 4.1 ensures the existence of \(f_{n_{m}}\in {{\Pi }_{m}^{d}}\) such that

$$ \begin{array}{@{}rcl@{}} H_{n_{m}}(t,x)&=&\inf_{a\in A(x)}[\delta c_{n_{m}}^{+}(x,a)\varphi_{n_{m}}(t,x)+{\int}_{S}\varphi_{n_{m}}(t,y)q(dy|x,a)] \\ &=&\delta c_{n_{m}}^{+}(x,f_{n_{m}}(t,x))\varphi_{n_{m}}(t,x)+{\int}_{S}\varphi_{n_{m}}(t,y)q(dy|x,f_{n_{m}}(t,x)). \end{array} $$
(4.18)

Since \(f_{n_{m}}(t,x)\in A(x)\) for all m ≥ 1 and A(x) is compact, there exists a subsequence \(\{f_{n_{m_{k}}}(t,x), k\geq 1\}\) of \(\{f_{n_{m}}(t,x), m\geq 1\}\) and a(t,x) ∈ A(x) (depending on (t,x)) such that \(f_{n_{m_{k}}}(t,x)\to a(t,x)\) as \(k\to \infty \). Thus, since \(\lim _{m\to \infty }H_{n_{m}}(t,x)=\lim _{k\to \infty }H_{n_{m_{k}}}(t,x)\), using Assumption 4.1, by Lemma 8.3.7 in Hernández-Lerma and Lasserre (1999) and Eq. 4.18 we have

$$ \begin{array}{@{}rcl@{}} \liminf_{n\to \infty}H_{n}(t,x)\!&=&\! \lim\limits_{k\to \infty}H_{n_{m_{k}}}(t,x)\\ \!&=&\! \lim\limits_{k\to \infty}\left[\!\delta c_{n_{m_{k}}}^{+}\!(x,f_{n_{m_{k}}}(t,x))\varphi_{n_{m_{k}}}\!(t,x) + {\int}_{S}\varphi_{n_{m_{k}}}(t,y)q(dy|x,f_{n_{m_{k}}}(t,x))\right] \\ \!&=&\! \delta c(x,a(t,x))\varphi(t,x)+{\int}_{S}\varphi(t,y)q(dy|t,x,a(t,x)) \\ \!&\geq&\! \inf\limits_{a\in A(x)} \left[\delta c(x,a)\varphi(t,x)+{\int}_{S}\varphi(t,y)q(dy|x,a)\right], \end{array} $$

which, together with Eq. 4.17, implies that \(\lim _{n\to \infty } H_{n}(t,x)=H(t,x)\). Thus, by Eq. 4.13 we have

$$ \begin{array}{@{}rcl@{}} \varphi(t,x)=1+{{\int}_{t}^{T}}\inf_{a\in A(x)}\left[\delta c(x,a)\varphi(s,x)+{\int}_{S}\varphi(s,y)q(dy|x,a)\right]ds. \end{array} $$
(4.19)

Since φ(t,x) is the integral of a measurable function, it is a absolutely continuous function. Therefore, we prove that φ(t,x) is differential in a.e. t ∈ [0,T] (for each fixed xS) again. We can verify that φ(t,x) satisfies (4.1). To show \(\varphi (t,x)\in \mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)\), since \(\varphi (t,x)\in \mathbb {B}_{V_{0}}([0,T]\times S)\) (by Eqs. 4.154.16), the rest verifies that \(\varphi ^{\prime }(t,x)\) is V1-bounded. Indeed, since T|δc(x,a)|≤ eTδ|c(x,a)|M0V0(x), from Eq. 4.19 we have

$$ \begin{array}{@{}rcl@{}} |\varphi^{\prime}(t,x)| &\leq & \frac{M_{0}}{T}\|\varphi\|_{V_{0}}V_{0}(x)V_{0}(x)+ \|\varphi\|_{V_{0}}[\rho_{0}V_{0}(x)+b_{0}+2V_{0}(x)q^{*}(x)] \\ &\leq & \|\varphi\|_{V_{0}}\left[ \frac{M_{0}}{T}{V_{0}^{2}}(x)+\rho_{0}V_{0}(x)+b_{0}+2M_{0}^{\prime}{V_{0}^{2}}(x)\right] \\ &\leq & \|\varphi\|_{V_{0}}\left[\frac{M_{1}M_{0}}{T}+\rho_{0}M_{1}+b_{0}+2M_{0}^{\prime}M_{1}\right]V_{1}(x) \end{array} $$

which implies that φ(t,x) is in \(\mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)\), and thus (a) is proved. □

Next, we use Proposition 4.3 to prove our main results by approximation from nonnegative cost rates to the cost rates that may be unbounded from above and from below.

Theorem 4.1

Under Assumptions 3.1, 3.2 and 4.1, the following assertions hold.

(a):

There exists a unique φ(t,x) in \(\mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)\) satisfying the o ptimality equation (4.1).

(b):

\(\varphi (t,x)=\inf _{{\pi \in {\Pi }_{m}^{r}}}J(\pi ,t,x)=\inf _{f{\in {\Pi }_{m}^{d}}}J(f,t,x)\) for all (t,x) ∈ [0,T] × S, with φ(t,x) as in (a).

(c):

There exists a Markov policy \(f^{*}\in {{\Pi }_{m}^{d}}\) such that

$$ \begin{array}{@{}rcl@{}} \varphi^{\prime}(t,x)+ \delta c(x,f^{*}(t,x))\varphi(t,x)+{\int}_{S}\varphi(t,y)q(dy|x,f^{*}(t,x))=0 \end{array} $$

for each xS and tE(φ,x) with \(m_{L}(E_{(\varphi ,x)}^{c})=0\), and f is optimal.

Proof

We only prove (a) and the optimality of the policy f since the others can be proved as (b) and (c) of Proposition 4.3. For each n ≥ 1, define cn on K as follows: for each (x,a) ∈ K,

$$ c_{n}(x,a):=\max\{-n, c(x,a)\}, $$

which implies that \(\lim _{n\to \infty } c_{n}(x,a)=c(x,a)\) and kn(x,a) := cn(x,a) + n ≥ 0 for each (x,a) ∈ K and n ≥ 1. Moreover, it follows from Assumption 3.1(iii) that

$$ -\frac{1}{T\delta}\ln\sqrt{M_{0}V_{0}(x)}\leq \max\{-n, -\frac{1}{T\delta}\ln\sqrt{M_{0}V_{0}(x)}\}\leq c_{n}(x,a)\leq \frac{1}{T\delta}\ln\sqrt{M_{0}V_{0}(x)}. $$
(4.20)

Thus, \(e^{2T\delta k_{n}(x,a)}\leq e^{2Tn\delta }M_{0}V_{0}(x)\) for all (x,a) ∈ K for all n ≥ 1, and so Assumptions 3.1 (with M0 replaced by e2TnδM0), 3.2 and 4.1 still hold for each model \(\mathcal {{N}}_{n}\) defined by

$$ \mathcal{{N}}_{n}:=\left\{S,(A(x), x\in S), k_{n}(x,a), q(\cdot|x,a) \right\}. $$

For any real-valued Borel measurable function u on K, let

$$J_{u}(x,t):=\inf_{f{\in{\Pi}_{m}^{d}}}\mathbb{E}_{\gamma}^{\pi}\left[e^{{\delta{\int}_{t}^{T}}{\int}_{A}u(x_{t},f(t,x_{t}))dt}|x_{t}=x\right]$$

provided the integral exists. Then, for each n ≥ 1, since kn(x,a) ≥ 0, by Proposition 4.3(b) we have \(J_{k_{n}}\) is in \(\mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)\) and satisfies

$$ \begin{array}{@{}rcl@{}} \left\{\begin{array}{lll} J_{k_{n}}^{\prime}(t,x)+\inf_{a\in A(x)}[\delta k_{n}(x,a)J_{k_{n}}(t,x)+{\int}_{S}J_{k_{n}}(t,y)q(dy|x,a)]=0\\ J_{k_{n}}(T,x)=1. \end{array}\right. \end{array} $$
(4.21)

for all xS and \(t\in E_{(J_{k_{n}},x)}\).

Moreover, since \(J_{k_{n}}(t,x)=J_{c_{n}+n}(t,x)=J_{c_{n}}(t,x)e^{\delta (T-t)n}\), by Eq. 4.21 we derive that

$$ \begin{array}{@{}rcl@{}} \left\{\begin{array}{lll} J_{c_{n}}^{\prime}(t,x)+\inf_{a\in A(x)}[\delta c_{n}(x,a)J_{c_{n}}(t,x)+{\int}_{S}J_{c_{n}}(t,y)q(dy|x,a)]=0\\ J_{c_{n}}(T,x)=1. \end{array}\right. \end{array} $$

This is

$$ \begin{array}{@{}rcl@{}} J_{c_{n}}(t,x)=1+{{\int}_{t}^{T}} \inf\limits_{a\in A(x)}[\delta c_{n}(x,a)J_{c_{n}}(s,x)+{\int}_{S}J_{c_{n}}(s,y)q(dy|x,a)]ds. \end{array} $$
(4.22)

On the other hand, for each (t,x) ∈ [0,T] × S, it follows from Eq. 4.20 and Lemma 3.1(d) that

$$ \begin{array}{@{}rcl@{}} |J_{c_{n}}(t,x)|\leq LV_{0}(x), \ \ \ \ n\geq 1. \end{array} $$
(4.23)

Since cn(x,a) is decreasing in n ≥ 1, and so is the corresponding value functions \(J_{c_{n}}(t,x)\). Therefore, the limit \(\varphi (t,x):=\lim _{n\to \infty }J_{c_{n}}(t,x)\) exists for each (t,x) ∈ [0,T] × S. Then, as the arguments for Proposition 4.3 with φn(t,x) replaced with \(J_{c_{n}}(t,x)\) here, from Eqs. 4.22 and 4.23 we can see that (a) is also true.

Moreover, by Eq. 4.1 and (c), using Theorem 3.2 we see that f is optimal. □

Remark 4.2

For the given \(\pi _{t}(da|x)\in {{\Pi }_{m}^{r}}\), to show the existence of a \(\varphi \in \mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)\) satisfying (3.14), we modify the operator B in Eq. 4.2 as the following Bπ:

$$ \begin{array}{@{}rcl@{}} B^{\pi}\psi(t,x):=1+{{\int}_{t}^{T}}\left[\delta c(x,\pi_{s})\psi(s,x)+{\int}_{S}\psi(s,y)q(dy|x,\pi_{s})\right]ds \end{array} $$

for all (t,x) ∈ [0,T] × S. Then, a similar argument as in the proof of Proposition 4.3(a) gives the existence of a \(\varphi \in \mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)\) satisfying (3.14).

5 An example

Recall that Assumptions 3.1 and 4.1 above are the generalization of the corresponding ones in Guo and Hernández-Lerma (2009), Guo et al. (2012), Guo and Piunovskiy (2011), Piunovskiy and Zhang (2011), and Prieto-Rumeau and Hernández-Lerma (2012). Hence, they are satisfied for all the examples and hypotheses in these references. To further illustrate the main results here, we next consider the risk-sensitive optimality problem of case flow in Guo et al. (2015a) for CTMDPs on the mean-variance criteria.

Example 5.1 (Risk-sensitive controlled problems of cash flow)

Consider a continuous-time controlled problem of cash flow in an economic market with an amount of the cash as a state, and thus the corresponding state space is \(S:=(-\infty ,+\infty )\). When the current state of cash flow is at xS, a decision-maker withdraws money with the amount − a (if a < 0) or takes a supply of money with the amount a for a ≥ 0, where a is regarded an action. When the current state is at xS and an action aA(x) is chosen, the two things happen: 1) a cost is incurred at rate c(x,a); and 2) the amount x of cash is assumed to keep invariable for an exponential-distributed random time with parameter λ(x,a) ≥ 0, and then jump to other states with the normal distribution N(x,σ2) for some constant σ > 0. Therefore, the transition rates of cash flow is represented by

$$ \begin{array}{@{}rcl@{}} q(D|x,a):=\lambda(x,a)\left[\frac{1}{\sqrt{2\pi}\sigma}{\int}_{D} e^{-\frac{(y-x)^{2}}{2\sigma^{2}}}dy-\delta_{x}(D)\right] \ \ \ \ \ \text{for \ each } \ D\in \mathcal{B}(S). \end{array} $$
(5.1)

For this cash flow model, the decision maker wishes to minimize the risk-sensitive costs on a given T horizon over all policies.

To ensure the existence of an optimal policy for the cash flow model, we consider the following hypotheses:

  • (A1) λ(x,a) ≤ M(x2 + 1) and \(|c(x,a)|\leq \frac {1}{2\delta T}[M+\ln (x^{2}+1)]\) for all xS,aA(x) with some positive constant M, where the constants δ and T are as before;

  • (A2) A(x) is assumed to be a compact set of a Borel space A for each xS;

  • (A3) λ(x,a) and c(x,a) are Borel measurable on K and continuous in aA(x) for each fixed xS.

Under the above conditions, we have the following fact.

Proposition 5.1

Under the hypotheses A1A3, Example 5.1 satisfies the Assumptions 3.1, 3.2 and 4.1, and hence (by Theorem 4.1) there exists an optimal Markov policy.

To verify the conditions required in Theorem 4.1, let

$$V_{0}(x):=1+x^{2}, V_{1}(x):=1+x^{4} \ \ \ \forall \ x\in S.$$

Since \(\frac {1}{\sqrt {2\pi }\sigma }{\int \limits }_{S} (y-x)^{2k+1}e^{-\frac {(y-x)^{2}}{2\sigma ^{2}}}dy=0\) and \(\frac {1}{\sqrt {2\pi }\sigma }{\int \limits }_{S}(y-x)^{2k}e^{-\frac {(y-x)^{2}}{2\sigma ^{2}}}dy =1\cdot 3{\cdots } (2k-1)\sigma ^{2k}\) for all k = 0,1,…, using (5.1) and hypothesis A1, a directive calculation gives

$$ \begin{array}{@{}rcl@{}} {\int}_{S}V_{0}(y)q(dy|x,a)&=&\lambda(x,a)\left[\frac{1}{\sqrt{2\pi}\sigma}{\int}_{-\infty}^{+\infty} (y^{2}+1)e^{-\frac{(y-x)^{2}}{2\sigma^{2}}}dy-(x^{2}+1)\right]\\ &=&\lambda(x,a) \sigma^{2}\leq M\sigma^{2}V_{0}(x); \end{array} $$
$$ \begin{array}{@{}rcl@{}} {\int}_{S}{V_{1}^{2}}(y)q(dy|x,a)\!&=&\lambda(x,a)\left[\frac{1}{\sqrt{2\pi}\sigma}{\int}_{-\infty}^{+\infty} (y^{4}+1)^{2}e^{-\frac{(y-x)^{2}}{2\sigma^{2}}}dy-(x^{4}+1)^{2}\right]\\ &=&\lambda(x,a)\!\left( \!105\sigma^{8} + 420x^{2}\sigma^{6} + 210x^{4}\sigma^{4}+6\sigma^{4}+12\sigma^{2}x^{2}+28x^{6}\sigma^{2}\right) \\ &\leq&420 \lambda(x,a)\left( \sigma^{8}+\sigma^{6}+\sigma^{4}+\sigma^{2}\right)\left( x^{6}+x^{4}+x^{2}+1\right);\\ &\leq&420 \lambda(x,a)\left( \sigma^{8}+\sigma^{6}+\sigma^{4}+\sigma^{2}\right)\left( 3x^{6}+3\right);\\ &\leq&1260 M\left( \sigma^{8}+\sigma^{6}+\sigma^{4}+\sigma^{2}\right)\left( x^{6}+1\right)(1+x^{2});\\ &\leq&3780M\left( \sigma^{8}+\sigma^{6}+\sigma^{4}+\sigma^{2}\right)\left( x^{4}+1\right)^{2};\\ &=&3780M\left( \sigma^{8}+\sigma^{6}+\sigma^{4}+\sigma^{2}\right){V_{1}^{2}}(x). \end{array} $$

Thus, the hypotheses A1A3 imply the Assumptions 3.1, 3.2 and 4.1, and then Theorem 4.1 gives the existence of an optimal Markov policy.

Remark 5.1

In this example, the cost c(x,a) are allowed to be unbounded from above and below, and the transition rates q(dy|x,a) can be unbounded. Thus, some of the conditions in Ghosh and Saha (2014), Jaskiewicz (2007), Kumar and Chandan (2015), Kumar and Chandan (2013), and Wei (2016) for CTMDPs on the risk-sensitive criteria fails to hold for this example because the transition and cost rates are all bounded in Ghosh and Saha (2014), Jaskiewicz (2007), Kumar and Chandan (2013, 2015) and Wei (2016).