Abstract
This paper considers the finite-horizon risk-sensitive optimality for continuous-time Markov decision processes, and focuses on the more general case that the transition rates are unbounded, cost/reward rates are allowed to be unbounded from below and from above, the policies can be history-dependent, and the state and action spaces are Borel ones. Under mild conditions imposed on the decision process’s primitive data, we establish the existence of a solution to the corresponding optimality equation (OE) by a so called approximation technique. Then, using the OE and the extension of Dynkin’s formula developed here, we prove the existence of an optimal Markov policy, and verify that the value function is the unique solution to the OE. Finally, we give an example to illustrate the difference between our conditions and those in the previous literature.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Continuous-time Markov decision processes (CTMDPs) are an important class of stochastic optimality control problems and have been widely studied; see, for instance, the monographs (Guo and Hernández-Lerma 2009; Prieto-Rumeau and Hernández-Lerma 2012) and the extensive references therein. In most existing literature on CTMDPs, the infinite-horizon expected discounted criterion (Guo 2007; Guo and Hernández-Lerma 2009; Guo and Song 2011; Guo and Piunovskiy 2011; Piunovskiy and Zhang 2011; Prieto-Rumeau and Hernández-Lerma 2012), the long-run expected average criterion (Guo and Hernández-Lerma 2009; Guo et al. 2012; Prieto-Rumeau and Hernández-Lerma 2012; Wei and Chen 2017; Xia 2014), and the finite-horizon expected criterion (Guo et al. 2015b; Yushkevich 1977), are the commonly used optimality criteria. All these expected criteria are risk-neutral and cannot reflect the attitude of a decision-maker to the risk. Since many decision-makers in the real-world applications may be either risk-seeking or risk-averse, in order to take the risk-sensitivity of a decision-maker into an optimality criterion, the risk-sensitive criteria have been employed and studied in CTMDPs. In this paper, we will further study a risk-sensitive criterion in CTMDPs and focus on the finite horizon case. Thus we shall pinpoint neither the main results in the earlier literature on the infinite-horizon risk-sensitive discounted criterion and the long-run risk-sensitive average criterion in Ghosh and Saha (2014), Huo et al. (2017), Kumar and Chandan (2013), and Zhang (2017) and Ghosh and Saha (2014), Kumar and Chandan (2015), and Kumar and Chandan (2013) respectively for CTMDPs, nor those on the risk-sensitive discrete-time Markov decision processes (Anantharam and Borkar 2017; Baüerle and Rieder 2014; Cavazos-Cadena and Hernndez-Hernndez 2011; Jaskiewicz 2007; Xia 2018) as well as on zero-sum risk-sensitive stochastic games (Basu and Ghosh 2014; Baüerle and Rieder 2017). Our concern is on the finite horizon risk-sensitive criterion for CTMDPs (Ghosh and Saha 2014; Wei 2016), which leads to that the value function also depends on both time and states, while the value functions for the infinite-horizon risk-sensitive discounted and average criteria in Ghosh and Saha (2014), Kumar and Chandan (2015), Kumar and Chandan (2013), and Zhang (2017) are independent of time. Therefore, the optimality equation for the finite horizon case (Ghosh and Saha 2014; Wei 2016) is rather different from that for the infinite horizon cases (Ghosh and Saha 2014; Kumar and Chandan 2013; Zhang 2017).
To the best of our knowledge, the finite horizon risk-sensitive criterion for CTMDPs is addressed only in Ghosh and Saha (2014), Huang (2018), and Wei (2016). Precisely, Ghosh and Saha (2014) uses the exponential utility function to characterize the risk-sensitivity of a decision-maker and obtains the existence of optimal Markov policies for the case of denumerable states, uniformly bounded transition rates, bounded cost rates, and the class of Markov policies. Wei (2016) extends the main results in Ghosh and Saha (2014) to the case of unbounded transition rates, establishes the existence of a solution to the corresponding optimality equation (OE) by the approximation technique in Guo et al. (2015b), and gives the corresponding error estimations of the approximations. However, the arguments in Ghosh and Saha (2014) and Wei (2016) need the assumptions of bounded cost rates and denumerable states, and the policies in Ghosh and Saha (2014) and Wei (2016) are independent of histories.
Huang (2018) also studies the finite horizon CTMDPs with the Borel state and action spaces and unbounded reward functions and transition rates. But the optimization criteria in Huang (2018) are very different from ours. The optimization criteria in Huang (2018) are firstly mean maximization, then for fixed mean, variance minimization. This is suitable for a risk-averse decision maker. The optimization criterion in this paper is risk-sensitivity, which is maximization of the exponential utility function. From Eq. 2.7 in Section 2, we can see that the exponential utility function is related to the mean and variance. Since the coefficient of variance is positive, our optimization criterion is suitable to a risk-seeking decision maker.
As indicated above, the finite horizon risk-sensitive criterion of for CTMDPs with unbounded cost/transition rates, Borel state and action spaces as well as randomized history-dependent policies, has not been studied yet, and it will be considered in this paper. Precisely, we study the CTMDPs having the following features: (1) the transition rates can be unbounded; (2) the cost rates are allowed to be unbounded from below and from above. (As the cost rates are allowed to take positive and negative values, they can be interpreted as reward rates rather than “cost rates” only); (3) the state and action spaces are Borel ones; (4) the policies can be randomized and history-dependent; and (5) the optimality criterion is finite horizon risk-sensitive.
First, we give a suitable condition, under which we establish the finiteness of the finite-horizon risk-sensitive criterion with unbounded cost rates using the Jensen inequality. This condition is new and satisfied for the bounded costs in Ghosh and Saha (2014) and Wei (2016); see Lemma 3.1.
Second, under the conditions slightly weaker than those in Guo and Hernández-Lerma (2009), Guo et al. (2012), Guo and Piunovskiy (2011), Piunovskiy and Zhang (2011), and Prieto-Rumeau and Hernández-Lerma (2012) on infinite horizon risk-neutral CTMDPs, we derive an extension of Dynkin’s formula, which is a generalization of the analog of Ito-Dynkin’s formula recently developed in Guo et al. (2015b) for the underlying state process {xt,t ≥ 0} induced by the transition rates and randomized history-dependent policies (see Theorem 3.1 below). This result is also a natural extension of the Feynman-Kac formula in Wei (2016) for the continuous-time jump Markov process to the case of continuous-time jump “non-Markov” processes and a more larger class of functions φ(ω,t,x) of samples ω, time t and states x. On the one hand, since the analog of Ito-Dynkin’s formula in Guo et al. (2015b) is designed for the forms of functions φ(t,x) of time t and states x, it is not suitable for the more general forms of φ(ω,t,x) with an additional sample variable ω such as the function \(e^{{{\int \limits }_{0}^{t}}{\int \limits }_{A}c(x_{s}(\omega ),a)\pi (da|\omega ,s)ds}\varphi (t,x)\) of (ω,t,x), which need to be considered for our case of history-dependent policies π(da|ω,s) and the cost rates c(x,a) in the risk-sensitive CTMDPs; on the other hand, the Feynman-Kac formula in Wei (2016) is for Markov processes, and thus it is not applicable to the case that the underlying processes {xt} here may not be Markovian. Since our optimality problem is on the risk-sensitive criterion over the class of history-dependent policies, the two facts just mentioned above motivate our study on the extension of the Dynkin’s formula.
Third, under suitable conditions as in Ghosh and Saha (2014), Guo and Hernández-Lerma (2009), Guo et al. (2012), Guo and Piunovskiy (2011), Piunovskiy and Zhang (2011), and Prieto-Rumeau and Hernández-Lerma (2012) on risk-neutral CTMDPs with infinite horizon, we prove the existence and uniqueness of a solution to the OE as well as the existence of an optimal Markov policy for the finite horizon risk-sensitive CTMDPs in three steps. The first step is to consider the simple case of bounded transition rates and bounded cost rates, and establishes the existence of a solution to the OE by the Banach’s fixed point theorem, and also proves the existence of an optimal Markov policy; see Proposition 4.2. The second step is to deal with the case of unbounded transition rates and nonnegative cost rates. By constructing a sequence of the models of CTMDPs with bounded transition rates and bounded cost rates, using the results in the first step and some new properties of the value function of the finite horizon risk-sensitive CTMDPs, we prove that the limit of the sequence of the value functions of models of CTMDPs is a solution to the OE for the case of unbounded transition rates and nonnegative cost rates; see Proposition 4.3. In the third step, by designing a technique of approximations from nonnegative but unbounded cost rates to a more general case of cost rates being unbounded from above and from below, we further show the existence of a solution to the OE for the most general case of unbounded transition and cost rates; see Theorem 4.1. It should be mentioned that our approximation technique here is rather different from the approximation in Guo et al. (2015b) and Wei (2016) for denumerable states.
Fourth, using the existence of a solution to the OE and the extension of the Dynkin’s formula developed here, we prove the existence of an optimal Markov policy, and also show that the value function is the unique solution to the optimality equation; see Theorem 4.1. All arguments here are direct and closed.
Finally, our conditions in this paper are an generalization of those in Ghosh and Saha (2014) and Wei (2016) on the finite horizon risk-sensitive CTMDPs. In order to further illustrate our main results and show the difference between the conditions in this paper and those in Ghosh and Saha (2014) and Wei (2016), we present an applied example, in which our conditions are satisfied, in which the transition and cost rates are both unbounded, in which the state and action spaces are non-denumerable, and for which some of the conditions in Ghosh and Saha (2014) and Wei (2016) fail to hold.
The rest of the paper is organized as follows. In Section 2 we introduce the optimality problem for the risk-sensitive CTMDPs. The main results are presented in Section 4 after giving technical preliminaries in Section 3, and are illustrated with an example in Section 5.
2 The optimal control problems
Notation
For any Borel space X endowed with the Borel σ-algebra \({\mathscr{B}}(X)\), we will denote by Ec := X ∖ E the complement of a subset E of X, by IE the indicator function on any set E, by δz(dx) the Dirac measure at point z ∈ X (i.e., δz(D) = ID(z) for all \(D\in {\mathscr{B}}(X))\), by \(\mathbb {B}_{1}(X)\) the set of all bounded Borel measurable functions φ on X with the norm \(\|\varphi \|:=\sup _{x\in X}|\varphi (x)|\), and by \(\mathcal {{U}}(X)\) the universal σ-algebra on X, that is, \(\mathcal { U}(X):=\cap _{p\in P(X)}{\mathscr{B}}_{p}(X),\) where P(X) represents the set of all probability measures on \({\mathscr{B}}(X)\), and \({\mathscr{B}}_{p}(X)\) is the completion of \({\mathscr{B}}(X)\) with respect to p ∈ P(X). To discern the “measurability”, we will say “Borel measurable” or “universally measurable” in the following.
The model of CTMDPs is a five-tuple
consisting of the following elements:
- (a)
a Borel space S, called the state space, whose elements are referred to as states of a system.
- (b)
a Borel space A, called the action space, whose elements are referred to as actions (or decisions) of a decision-maker (or controller);
- (c)
a family {A(x),x ∈ S} of nonempty subsets A(x) of A, where A(x) denotes the set of actions available to a controller when the system is in state x ∈ S, and it is assumed to be Borel-measurable, that is, \(A(x)\in {\mathscr{B}}(A)\) for every x ∈ S;
- (d)
a Borel measurable function c(x,a) on K, called the cost rates, where K := {(x,a)|x ∈ S,a ∈ A(x)} denotes the set of all feasible state-action pairs and is assumed in \({\mathscr{B}}(S\times A);\) (As c(x,a) is allowed to take positive and negative values, it can be interpreted as rewards rather than “costs” only.)
- (e)
transition rates q(dy|x,a), a universally measurable signed kernel on S given K. That is, q(⋅|x,a) satisfies countable additivity; q(D|x,a) ≥ 0 for all \(D\in {\mathscr{B}}(S)\) with (x,a) ∈ K and x∉D, being conservative in the sense of q(S|x,a) ≡ 0, and stable in that of
$$ \begin{array}{@{}rcl@{}} q^{*}(x):=\sup\limits_{a\in A(x)}q(x,a)<\infty \ \ \ \ \forall \ x\in S, \end{array} $$(2.2)where q(x,a) := −q({x}|x,a) ≥ 0 for all (x,a) ∈ K.
Next, we give an informal description of the evolution of CTMDPs with model (2.1).
Roughly speaking, CTMDPs evolve as follows: A controller observes states of a system continuously in time. If the system is at state xt at time t, he/she chooses an action at ∈ A(xt) according to a given policy, as a consequence of which, the following happen:
- (i)
An immediate cost takes place at the rate c(xt,at).
- (ii)
After a random sojourn time (i.e., the holding time at state xt), the system jumps to a set D (xt∉D) of states with the transition probability \(\frac {q(D|x_{t},a_{t})}{q(x_{t},a_{t})}\) determined by the transition rates q(dy|xt,at). The distribution function of the sojourn time is \((1-e^{-{{\int \limits }_{t}^{t+x}}q(x_{s},a_{s})ds})\).
To formalize what is described above, below we describe the construction of CTMDPs under possibly randomized history-dependent policies.
To construct the underlying CTMDPs, we introduce some notation: Let \({\Omega }_{0} := (S\times (0,\infty ))^{\infty }\), \({\Omega }_{k} := (S\times (0,\infty ))^{k}\times S\times (\{\infty \}\times \{{\Delta }\})^{\infty }\) for k ≥ 1 and some Δ∉S, \({\Omega }:=\cup _{k=0}^{\infty }{\Omega }_{k}\). \(\mathcal {F}\) is the Borel σ-algebra on the Borel space Ω. Then we obtain the measurable space \(({\Omega },\mathcal {F})\). For some k ≥ 1, and sample \(\omega :=(x_{0},\theta _{1},x_{1},\ldots ,\theta _{k},x_{k},\infty ,{\Delta }) \in {\Omega }\), define
In what follows, the argument ω is always omitted except some special informational statements. Then, we define the state process {xt(ω),t ≥ 0} on \(({\Omega },\mathcal {F})\) by
Obviously, xt(ω) is right-continuous on \([0,\infty )\). We denote \(x_{t-}(\omega ):=\lim _{s\to t-}x_{s}(\omega )\). Here we have used the convenience that 0z =: 0 and 0 + z =: z for all \(z\in S_{\Delta }:=S\cup \{{\Delta }\}\).
For each fixed ω := (x0,𝜃1,x1,…,𝜃k,xk,…) ∈Ω, from Eq. 2.4, we see that Tk(ω) (k ≥ 1) denotes the k-th jump moment of {xt,t ≥ 0}, Xk− 1(ω) = xk− 1 is the state of the process on [Tk− 1(ω),Tk(ω)), 𝜃k = Tk(ω) − Tk− 1(ω) plays the role of sojourn time at state xk− 1, and the sample path {xt(ω),t ≥ 0} has at most denumerable states xk(k = 0,1,…). We do not intend to consider the controlled process {xt,t ≥ 0} after moment \(T_{\infty }\), and thus view it to be absorbed in the cemetery state Δ. Hence, we write AΔ := A ∪{aΔ}, A(Δ) := {aΔ}, q(⋅|Δ,aΔ) :≡ 0, c(Δ,aΔ) :≡ 0, where aΔ is an isolated point.
To precisely define the optimality criterion, we need to introduce the concept of a policy in Guo et al. (2012), Guo and Piunovskiy (2011), Kitaev and Rykov (1995), and Piunovskiy and Zhang (2011). To do so, we recall some notation. Take the right-continuous family of σ-algebras \(\{\mathcal { F}_{t}\}_{t\ge 0}\) with \(\mathcal { F}_{t}:=\sigma (\{T_{k}\leq s, X_{k}\in D\}: D\in \mathcal {{B}}(S),s\leq t,k\geq 0)\). As in Guo and Song (2011), Guo and Piunovskiy (2011), Kitaev and Rykov (1995), and Piunovskiy and Zhang (2011), let \(\mathcal P\) be the σ-algebra of predictable sets on \({\Omega }\times [0,\infty )\) related to \(\{\mathcal { F}_{t}\}_{t\ge 0}\), that is, \(\mathcal {{P}}:=\sigma (B\times [0,\infty ), C\times (s,\infty ): B\in \mathcal {{F}}_{0}, C\in \mathcal {{F}}_{s-}, s>0)\) with \(\mathcal {{F}}_{s-}:=\bigvee _{t<s}\mathcal {{F}}_{t}:=\sigma (\mathcal {{F}}_{t}, t<s)\). A real-valued function on \({\Omega }\times [0,\infty )\) is called predictable if it is measurable with respect to \(\mathcal P\).
Definition 2.1
A transition probability π(da|ω,t) from \(({\Omega }\times [0,\infty ),\mathcal { P} )\) onto \((A_{\Delta }, \mathcal {{B}}(A_{\Delta }))\) such that π(A(xt−(ω))|ω,t) ≡ 1 is called a (randomized history-dependent) policy. A policy π(da|ω,t) is called randomized Markovian if π(da|ω,t) ≡ π(da|xt−(ω),t). We will denote such a Markov policy by πt(da|⋅) for informational implication. A randomized Markov policy πt(da|⋅) is called deterministic Markovian whenever there exists a A-valued universally measurable function f(t,x) on \([0,\infty )\times S\) such that πt(da|x) is the Dirac measure at point f(t,x) ∈ A(x) for all t ≥ 0 and x ∈ S. Such a Markov policy will be denoted by f for simplicity.
We denote by π the set of all randomized history-dependent policies, by \({{\Pi }_{m}^{r}}\) the set of all randomized Markov policies, and by \({{\Pi }_{m}^{d}}\) the set of all (deterministic) Markov policies.
For any initial distribution γ on S and policy π ∈π, as showing in Guo et al. (2012), Guo and Song (2011), Guo and Piunovskiy (2011), Kitaev and Rykov (1995), and Piunovskiy and Zhang (2011) but using the extension of the Ionescu Tulcea theorem (e.g., Proposition 7.45 in Bertsekas and Shreve 1996); we see that there exists a unique probability measure \(\mathbb {P}_{\gamma }^{\pi }\) (depending on γ and π) on \(({\Omega },\mathcal {F})\). Let \(\mathbb {E}_{\gamma }^{\pi }\) be the corresponding expectation operator. In particular, \(\mathbb {E}_{\gamma }^{\pi }\) and \(\mathbb {P}_{\gamma }^{\pi }\) will be respectively written as \(\mathbb {E}_{x}^{\pi }\) and \(\mathbb {P}_{x}^{\pi }\) when γ is the Dirac measure at a state x in S.
Fix any finite horizon T > 0. For each policy π ∈π and state x ∈ S, we define the T-horizon risk-sensitive criterion J(π,0,x) of CTMDPs by
provided that the integral is well defined, where δ is a constant called the risk-sensitive parameter. In the following arguments, we assume that δ > 0. For the other case of δ < 0, the corresponding arguments are similar, and thus omitted.
Note that the process {xt,t ≥ 0} on \(({\Omega },\mathcal {F}, \mathbb {P}^{\pi }_{\gamma })\) may not be Markovian since the policy π can depend on histories (x0,𝜃1,x1,…,𝜃k,xk). However, for each \(\pi :=\pi _{t}(da|\cdot ) \in {{\Pi }^{r}_{m}}\), it is well known (e.g. Feinberg et al. 2014) that {xt,t ≥ 0} is a Markov process on \(({\Omega },\mathcal {F}, \mathbb {P}^{\pi }_{\gamma })\), and thus for each x ∈ S and t ∈ [0,T], the following expression
is well defined (when the integral exists), and it is called the risk-sensitive value of π from the horizon t to T.
For each x ∈ S, let
The function J∗(t,x) on [0,T] × S is called the value function of the CTMDPs with the T-horizon risk-sensitive criterion.
Definition 2.2
A policy π∗∈π is said to be optimal if J(π∗,0,x) ≤ J(π,0,x) for all π ∈π and x ∈ S.
The main goal of this paper is to give conditions for the existence of optimal Markov policies.
At this end, we give some remarks about the difference between the T-horizon risk-sensitive criterion J(π,0,x) and the risk-neutral one in Guo et al. (2015b) and Yushkevich (1977), which is defined by
where c(x,a) is assumed to be bounded positive on K. Then, by the Jensen’s inequality and the monotonicity of the \(\log \) function, we have \(\ln J(\pi ,0,x)\geq \delta V(\pi ,0,x)\). More, by Taylor’s expansion of eδY at the point \(z:=\mathbb {E}_{x}^{\pi }Y\), we have
which also implies that ξ is a random variable on \(({\Omega },\mathcal {F}, \mathbb {P}^{\pi }_{\gamma })\). Therefore, we have
where \(var(Y):=\mathbb {E}_{x}^{\pi }(Y-\mathbb {E}_{x}^{\pi }Y)^{2}\) denoting the variance of Y, and the corresponding term \(\ln [1+\frac {1}{2}\delta ^{2}var(Y)]\) is called a risk premium. Thus, the difference between \(\ln J(\pi ,0,x)\) and δV (π,0,x) is the summation of the risk premium plus the other small order term. Therefore, the risk-sensitive criterion is more risk-sensitive than the risk-neutral one because of its inclusion of the risk premium.
From Eq. 2.7 we see that \(\lim _{\delta \to 0}\frac {\ln J(\pi ,0,x)}{\delta }=V(\pi ,0,x)\). On the other hand, for any fixed δ > 0, since \(\inf _{\pi \in {\Pi }}\ln J(\pi ,0,x)=\ln J(\pi ^{*},0,x)\)if and only if\(\inf _{\pi \in {\Pi }}J(\pi ,0,x)=J(\pi ^{*},0,x)\), we will consider J(π,0,x), instead of \(\ln J(\pi ,0,x)\).
3 Preliminaries
This section provides some preliminary facts for our arguments below.
Since the rates q(dy|x,a) and costs c(x,a) are allowed to be unbounded, we next give conditions for the non-explosion of {xt,t ≥ 0} and finiteness of J(π,t,x).
Assumption 3.1
There exist a real-valued Borel measurable function V0 ≥ 1 on S and constants ρ0, b0 ≥ 0, \(M_{0}^{\prime }, M_{0}\geq 1\), such that
- (i):
-
HCode \({\int \limits }_{S}V_{0}(y)q(dy|x,a)\leq \rho _{0} V_{0}(x)+b_{0},\) for all (x,a) ∈ K;
- (ii):
-
\(q^{*}(x)\leq M_{0}^{\prime }V_{0}(x)\) for all x ∈ S, where q∗(x) is as in Eq. 2.2;
- (iii):
-
e2Tδ|c(x,a)|≤ M0V0(x) for all (x,a) ∈ K, with T and δ as in Eq. 2.5.
Remark 3.1
-
(a)
Assumptions 3.1(i,ii) are used and verified with examples in Guo and Hernández-Lerma (2009), Guo and Piunovskiy (2011), Piunovskiy and Zhang (2011), and Prieto-Rumeau and Hernández-Lerma (2012) for the risk-neutral CTMDPs. When the transition rates are bounded (i.e., \(\|q^{*}\|<\infty \)) (Ghosh and Saha 2014; Kitaev and Rykov 1995; Kumar and Chandan2013, 2015; Yushkevich 1977), Assumptions 3.1(i-ii) are satisfied by taking V0(x) ≡ 1.
-
(b)
Assumption 3.1(iii) is for the finiteness of the value function J∗(t,x), and it is new and satisfied when c(x,a) is bounded (i.e., \(\|c\|:=\sup _{(x,a)\in K}|c(x,a)|<\infty \)) (Ghosh and Saha 2014; Kumar and Chandan2013, 2015; Wei 2016; Yushkevich 1977). Moreover, if \(\delta |c(x,a)|\leq \sqrt {\ln V_{0}(x)}+L\) for all (x,a) ∈ K and some constant L ≥ 0, then Assumption 3.1(iii) holds: Indeed, since \(\delta |c(x,a)|\leq \sqrt {\ln V_{0}(x)}+L\leq \frac {t}{2}+\frac {\ln V_{0}(x)}{2t}+L\) for all (x,a) ∈ K and t > 0, we have \(e^{2T\delta |c(x,a)|} \leq e^{T^{2}+\ln V_{0}(x)+2TL}=e^{T^{2}+2TL} V_{0}(x)\), which implies Assumption 3.1(iii) with \(M_{0}:=e^{T^{2}+2TL}\).
-
(c)
If the number ρ0 in Assumption 3.1(i) is not positive, then Assumption 3.1(i) still holds when ρ0 is replaced with the positive number “1 + |ρ0|”. Thus, it is just for convenience to assume that the constant ρ0 > 0 throughout the following. However, the corresponding number is assumed to be negative in Guo and Hernández-Lerma (2009), Guo et al. (2012), and Prieto-Rumeau and Hernández-Lerma (2012) or less than the discount factor in Guo and Hernández-Lerma (2009) and Guo and Piunovskiy (2011).
Lemma 3.1
Under Assumption 3.1, for each π ∈π, the following assertions hold.
- (a):
-
\(\mathbb {P}_{x}^{\pi }(x_{t}\in S)=1\), \(\mathbb {P}_{x}^{\pi }(T_{\infty }=\infty )=1\), and \( \mathbb {P}_{x}^{\pi }(x_{0}=x)=1\) for each t ≥ 0 and x ∈ S;
- (b):
-
\(\mathbb {E}_{x}^{\pi }[V_{0}(x_{t})] \leq e^{\rho _{0}t}[V_{0}(x)+\frac {b_{0}}{\rho _{0}}]\), for each t ≥ 0, x ∈ S and π ∈π;
- (c):
-
\(\mathbb {E}_{\gamma }^{\pi }[V_{0}(x_{t})|x_{s}=x] \leq e^{\rho _{0}(t-s)}[V_{0}(x)+\frac {b_{0}}{\rho _{0}}]\), for each t ≥ s ≥ 0, x ∈ S and \(\pi \in {{\Pi }_{m}^{r}}\).
- (d):
-
If, in addition, Assumption 3.1 (iii) is satisfied, then
- (d1):
-
\( e^{-LV_{0}(x)}\leq J(\pi ,0,x)\leq LV_{0}(x) \) for x ∈ S and π ∈π, where \(L:=M_{0}e^{\rho _{0}T}[1+\frac {b_{0}}{\rho _{0}}];\)
- (d2):
-
\(e^{-LV_{0}(x)}\leq J(\pi ,t,x)\leq LV_{0}(x)\) for (t,x) ∈ [0,T] × S and \({\pi \in {\Pi }_{m}^{r}}\).
Proof
Parts (a) and (b) follow from Guo et al. (2012) and Guo and Piunovskiy (2011) (or Piunovskiy and Zhang 2011); while part (c) is from Theorem 3.1 in Guo (2007). We next prove part (d). For almost surely (a.s.) ω := (x0,𝜃1,x1,…,𝜃k,xk,…) ∈Ω (with respect to \(\mathbb {P}_{x}^{\pi }\)), let k (depending on ω) be determined by Tk(ω) ≤ T < Tk+ 1(ω). Then, it follows from (a) and Eq. 2.4 that k is finite, and {xt(ω),t ∈ [0,T]} = {X0(ω),…,Xk(ω)}. Thus, since \(|c(x,a)|\leq \frac {1}{T\delta }\ln \sqrt {M_{0}V_{0}(x)}\) (by Assumption 3.1(iii)) and 𝜃i+ 1 = Ti+ 1(ω) − Ti(ω)(i = 0,…,k − 1), we have
which implies that \({{\int \limits }_{0}^{s}} {\int \limits }_{A} c(x_{t},a)\pi (da|\omega ,t)dt\) is real-valued Borel measurable in s ∈ [0,T] (for the given ω). Thus, by the Jensen inequality with respect to the probability measure \(\frac {dt}{T}\) on \(\mathcal {{B}}([0,T])\), we have
Therefore (by (b)),
Hence, by Eq. 2.5, we have
On the other hand, since Tδ|c(x,a)|≤ e2Tδ|c(x,a)|≤ M0V0(x), we have
which, together with Eq. 3.1, implies (d1). Similarly, we see that (d2) is also true. □
Lemma 3.1 gives conditions for the finiteness of J(π,t,x) as well as the non-explosion of {xt,t ≥ 0}. In order to deal with the optimality for history-dependent policies, we need the following facts, which extend both the analog of Ito-Dynkin’s formula in Guo et al. (2015b) and Feynman-Kac formula in Wei (2016) to a more general case of possible non-Markov processes {xt,t ≥ 0} and functions φ(ω,t,x) with an additional element ω ∈Ω.
Denote by mL the Lebesgue’s measure on [0,T], and by \(\mathbb {B}_{\mathcal { P}}({\Omega }\times [0,T]\times S)\) the set of real-valued and \(\mathcal {{P}}\times \mathcal {{B}}(S)\)-measurable functions φ(ω,t,x) with the following features: Given any x ∈ S,π ∈π, and a.s. ω ∈Ω with respect to \(\mathbb {P}_{x}^{\pi }\), there exists a Borel subset E(φ,ω,x,π) (depending on the φ,ω,x,π) of [0,T] such that the partial derivative \(\varphi ^{\prime }(\omega ,t,x)\) (with respect to t) exists for every t ∈ E(φ,ω,x,π) and \(m_{L}(E_{(\varphi ,\omega ,x,\pi )}^{c})=0\). Obviously, if a function φ(ω,t,x) in \(\mathbb {B}_{\mathcal { P}}({\Omega }\times [0,T]\times S)\) is independent of ω (written as φ(t,x)), and so is the corresponding E(φ,ω,x,π), which will be denoted by E(φ,x).
For any \(\varphi (\omega , t,x)\in \mathbb {B}_{\mathcal { P}}({\Omega }\times [0,T]\times S)\), when the partial derivative does not exist for some (ω,t,x) ∈Ω× [0,T]×S, for simplicity of arguments, we take \(\varphi ^{\prime }(\omega ,t,x)\) to be any real number, and so \(\varphi ^{\prime }\) is defined on Ω × [0,T] × S. We will see that such a modification of \(\varphi ^{\prime }\) loses nothing. For example, \(\mathbb {E}_{x}^{\pi } \left [{\int \limits }_{0}^{T}|\varphi ^{\prime }(\omega ,t,x_{t})|dt\right ]\) and \({{\int \limits }_{0}^{T}}|\varphi ^{\prime }(\omega ,t,x_{t})|dt\) are defined well for each (x,π) ∈ S ×π and a.s. ω ∈Ω with respect to \(\mathbb {P}_{x}^{\pi }\), respectively. Indeed, for each x ∈ S,π ∈π, and s ∈ [0,T], Lemma 3.1(a) together with Eq. 2.4 gives the existence of a Boreal measurable subset \(\hat {\Omega }\) of Ω such that: 1) \(\mathbb {P}_{x}^{\pi }(\hat {\Omega }^{c})=0\), and 2) for each given \(\omega \in \hat {\Omega }\), xt(ω) = xi for Ti(ω) ≤ t < Ti+ 1(ω)(0 ≤ i ≤ k) for some a finite k, where k and xi ∈ S depend on ω, and k is determined by Tk(ω) ≤ T < Tk+ 1(ω). Let \(\tilde i:={\min \limits } \{l: s<T_{l} \}\) (depending on ω and s). Then, since \(m_{L}(\cup _{i=0}^{k} E^{c}_{(\varphi ,\omega ,x_{i},\pi )})=0\), for each \(\omega \in \hat {\Omega }\),
is well defined. Therefore (by \(\mathbb {P}_{x}^{\pi }(\hat {\Omega }^{c})=0\)), \(\mathbb {E}_{x}^{\pi } \left [{\int \limits }_{0}^{T}|\varphi ^{\prime }(\omega ,t,x_{t})|dt\right ]\) is also well defined.
Lemma 3.2
Under Assumption 3.1, the following assertions hold.
- (a):
-
For any given π ∈π,x ∈ S, if a function \(\varphi \in \mathbb {B}_{\mathcal { P}}({\Omega }\times [0,T]\times S)\) satisfies that
$$ \begin{array}{@{}rcl@{}} &&\mathbb{E}_{x}^{\pi}\left[{\int}_{0}^{T}{\int}_{A}{\int}_{S}|\varphi(\omega,t,y)| |q|(dy|x_{t},a)\pi(da|\omega,t)dt\right]\\&&+\mathbb{E}_{x}^{\pi} \left[{\int}_{0}^{T}|\varphi^{\prime}(\omega,t,x_{t})|dt\right]<\infty, \end{array} $$(3.2)where \(|q|(dy|x,a):={\int \limits } q(dy\setminus \{x\})|x,a)-q(\{x\}|x,a)\delta _{x}(dy)\) for all (x,a) ∈ K, then
$$ \begin{array}{@{}rcl@{}} &&\mathbb{E}_{x}^{\pi}\left[{\int}_{0}^{T} \left( \varphi^{\prime}(\omega,t,x_{t})+{\int}_{A}{\int}_{S}\varphi(\omega,t,y)q(dy|x_{t},a)\pi(da|\omega,t)\right)dt \right]\\ &=&\mathbb{E}_{x}^{\pi}\varphi(\omega,T,x_{T})-\mathbb{E}_{x}^{\pi}\varphi(\omega,0,x). \end{array} $$ - (b):
-
For any \(\pi =\pi _{t}(da|\cdot ){\in {\Pi }_{m}^{r}}\) and a Borel measurable function \(\phi \in \mathbb {B}_{\mathcal { P}}([0,T]\times S)\), such that the corresponding function \(\varphi (\omega ,t,x):=e^{{{\int \limits }_{s}^{t}} \delta c(x_{v}(\omega ),\pi _{v})dv}\phi (t,x)\) satisfying (3.2), then
$$ \mathbb{E}_{\gamma}^{\pi}\!\left[{\int}_{s}^{T} \!\left( \!\left( e^{{{\int}_{s}^{t}}\delta c(x_{v},\pi_{v})dv}\phi(t,x_{t})\!\right)'+{\int}_{S}e^{{{\int}_{s}^{t}}\delta c(x_{v},\pi_{v})dv}\phi(t,y)q(dy|x_{t},\pi_{t})\!\right)dt|x_{s}=x\right]$$$${}=\mathbb{E}_{\gamma}^{\pi}\left[e^{{{\int}_{s}^{T}}{\int}_{A}\delta c(x_{v},a)\pi_{v}(da|x_{v})dv}\phi(T,x_{T})|x_{s}=x\right]-\phi(s,x),\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ $$where \(c(x,\pi _{t}):={\int \limits }_{A(x)}c(x,a)\pi _{t}(da|x)\) and \(q(dy|x,\pi _{t}):={\int \limits }_{A(x)} q(dy|x,a)\pi _{t}(da|x)\).
- (c):
-
If the functions c(x,a) and ϕ(t,x) in (b) above are both bounded and \(|\phi ^{\prime }(t,x)|\leq CV_{0}(x)\) for all (t,x) ∈ [0,T] × S with some constant C > 0, then
$$ \begin{array}{@{}rcl@{}} \!\!\!\!\!\!\!\!\!\!\!\!\!\!\!&&\mathbb{E}_{\gamma}^{\pi}\left[{\int}_{s}^{T} \left( \left( e^{{{\int}_{s}^{t}}\delta c(x_{v},\pi_{v})dv}\phi(t,x_{t})\right)'+{\int}_{S}e^{{{\int}_{s}^{t}}\delta c(x_{v},\pi_{v})dv}\phi(t,y)q(dy|x_{t},\pi_{t})\right)dt|x_{s}=x\right]\\ \!\!\!\!\!\!\!\!\!\!\!\!\!\!\!&=& \left\{\begin{array}{lll} \mathbb{E}_{x}^{\pi}\left( e^{{{\int}_{s}^{T}}{\int}_{A}\delta c(x_{v},a)\pi_{v}(da|x_{v})dv}\phi(T,x_{T})\right)-\phi(0,x), \ \ \ \text{for } \ s=0, \pi\in {\Pi} \\ \mathbb{E}_{\gamma}^{\pi}\left[e^{{{\int}_{s}^{T}}{\int}_{A}\delta c(x_{v},a)\pi_{v}(da|x_{v})dv}\phi(T,x_{T})|x_{s}=x\right]-\phi(s,x), \ \text{for } \ s\in [0,T],\pi\in {{\Pi}_{m}^{r}}. \end{array}\right. \end{array} $$
Proof
-
(a)
Lemma 3.1(a) together with Eq. 2.4 gives the existence of a Boreal measurable subset \(\hat {\Omega }\) of Ω such that \(\mathbb {P}_{x}^{\pi }(\hat {\Omega }^{c})=0\). For each \(\omega \in \hat {\Omega }\), for 0 ≤ t ≤ T, from definitions (2.3) and (2.4), denote \(k_{t}(\omega ):=\max \limits \{k|T_{k}(\omega )\le t\}\). Since xt(ω) is right-continuous on \([0,\infty )\), we have \(x_{t}=x_{T_{k_{t}}}\) when \(T_{k_{t}}\le t<T_{k_{t}+1}\). Since \({{\int \limits }_{s}^{T}} \varphi ^{\prime }(\omega ,t,x_{t})dt\) is well defined (just proved), for 0 ≤ s ≤ T and \(\omega \in \hat {\Omega }\), we have
$$ \begin{array}{@{}rcl@{}} \!\!\!\!\!\!\!\!\!\!\!\varphi(\omega,T,x_{T}) = \varphi(\omega,s,x_{s}) + {{\int}_{s}^{T}} \varphi^{\prime}(\omega,t,x_{t})dt + \sum\limits_{n=k_{s}+1}^{k_{T}}{\int}_{(s,T]}{\Delta} \varphi(\omega,t,x_{t})\delta_{T_{n} }(dt) \ \ \ \mathbb{P}_{x}^{\pi} - a.s. \end{array} $$(3.3)Denote Δφ(ω,t,xt) := φ(ω,t,xt(ω)) − φ(ω,t,xt−(ω)), and recall that \(x_{t-}(\omega )=\lim _{s\to t-}x_{s}(\omega )\). Equation 3.3 is proved as follows.
$$ \begin{array}{@{}rcl@{}} &&\varphi(\omega,T,x_{T})\\ &=&\varphi(\omega,s,x_{s})+\varphi(\omega,T_{k_{s}+1},x_{T_{k_{s}+1}-})-\varphi(\omega,s,x_{s})\\ &&+\sum\limits_{n=k_{s}+1}^{k_{T}}\left[\varphi(\omega,T_{n},x_{T_{n}})-\varphi(\omega,T_{n},x_{T_{n}-})\right]+\sum\limits_{n=k_{s}+1}^{k_{T}-1}\left[ \varphi(\omega,T_{n+1},x_{T_{n+1}-})-\varphi(\omega,T_{n},x_{T_{n}})\right]\\ &&+\varphi(\omega,T,x_{T-})-\varphi(\omega,T_{k_{T}},x_{T_{k_{T}}})+\varphi(\omega,T,x_{T})-\varphi(\omega,T,x_{T-})\\ &=&\varphi(\omega,s,x_{s})+\varphi(\omega,T_{k_{s}+1},x_{s})-\varphi(\omega,s,x_{s})\\ &&+\sum\limits_{n=k_{s}+1}^{k_{T}}\left[\varphi(\omega,T_{n},x_{T_{n}})-\varphi(\omega,T_{n},x_{T_{n}-})\right]+\sum\limits_{n=k_{s}+1}^{k_{T}-1}\left[ \varphi(\omega,T_{n+1},x_{T_{n}})-\varphi(\omega,T_{n},x_{T_{n}})\right]\\ &&+\varphi(\omega,T,x_{T_{k_{T}}})-\varphi(\omega,T_{k_{T}},x_{T_{k_{T}}})+\varphi(\omega,T,x_{T})-\varphi(\omega,T,x_{T-})\\ &=&\varphi(\omega,s,x_{s})+\varphi(\omega,T_{k_{s}+1},x_{s})-\varphi(\omega,s,x_{s})+\sum\limits_{n=k_{s}+1}^{k_{T}-1}\left[ \varphi(\omega,T_{n+1},x_{T_{n}})-\varphi(\omega,T_{n},x_{T_{n}})\right]\\ &&+\varphi(\omega,T,x_{T_{k_{T}}})-\varphi(\omega,T_{k_{T}},x_{T_{k_{T}}})\\ &&+\sum\limits_{n=k_{s}+1}^{k_{T}}\left[\varphi(\omega,T_{n},x_{T_{n}})-\varphi(\omega,T_{n},x_{T_{n}-})\right]+\varphi(\omega,T,x_{T})-\varphi(\omega,T,x_{T-})\\ &=&\varphi(\omega,s,x_{s})+{\int}_{[s,T_{k_{s}+1})}\varphi^{\prime}(\omega,t,x_{s})dt+\sum\limits_{n=k_{s}+1}^{k_{T}-1}{\int}_{[T_{n},T_{n+1})} \varphi^{\prime}(\omega,t,x_{T_{n}})dt\\ &&+{\int}_{[T_{k_{T}},T)}\varphi^{\prime}(\omega,t,x_{T_{k_{T}}})dt+\sum\limits_{n=k_{s}+1}^{k_{T}}{\Delta}\varphi(\omega,T_{n},x_{T_{n}})+{\Delta}\varphi(\omega,T,x_{T})\\ &=&\varphi(\omega,s,x_{s})+{\int}_{[s,T_{k_{s}+1})}\varphi^{\prime}(\omega,t,x_{t})dt+\sum\limits_{n=k_{s}+1}^{k_{T}-1}{\int}_{[T_{n},T_{n+1})} \varphi^{\prime}(\omega,t,x_{t})dt\\ &&+{\int}_{[T_{k_{T}},T)}\varphi^{\prime}(\omega,t,x_{t})dt+\sum\limits_{n=k_{s}+1}^{k_{T}}{\int}_{(s,T]}{\Delta}\varphi(\omega,t,x_{t})\delta_{T_{n}}(dt)\\ &=&\varphi(\omega,s,x_{s})+{{\int}_{s}^{T}} \varphi^{\prime}(\omega,t,x_{t})dt+\sum\limits_{n=k_{s}+1}^{k_{T}}{\int}_{(s,T]}{\Delta} \varphi(\omega,t,x_{t})\delta_{T_{n} }(dt) \ \ \ \mathbb{P}_{x}^{\pi}-a.s. \end{array} $$
Then, Lemma 4.28 in Kitaev and Rykov (1995) shows that the random measure mπ defined by
is the dual predictable projection of the random measure \({\sum }_{n\ge 1}\delta _{(T_{n},X_{n-1})}(dt,dx)\) on \({\mathscr{B}}((0,\infty )\times S)\) under \(\mathbb {P}_{x}^{\pi }\). Thus, by the definition of a dual predictable projection in (4.5) in Kitaev and Rykov (1995), we have
which, together with Eq. 3.4 and the expectation of both sides of Eq. 3.3 with s = 0, gives
Here, integrability results such as Eq. 3.2 validate all the involved operations. Moreover, since xt−(ω) = xt(ω) on (0,T] except finite time points t, part (a) follows.
- (b)
In lieu of φ(ω,t,xt) with \(e^{{{\int \limits }_{s}^{t}}{\int \limits }_{A}\delta c(v,x_{v},a)\pi _{v}(da|x_{v})dv}\phi (t,x_{t})\), taking the conditional expectation \(\mathbb {E}_{\gamma }^{\pi }[\cdot |x_{s}=x]\) in both sides of Eq. 3.3 and using the Markov property of {xt,t ≥ 0}, as the proof of (a), we see that (b) is also true.
- (c)
Under the condition in (c), the function φ(ω,t,x) in (b) is bounded on Ω × [0,T] × S, and thus the first part of Eq. 3.2 is finite (by Lemma 3.1(b) and Assumptions 3.1(i,ii)). Moreover, since
$$ \begin{array}{@{}rcl@{}} \left( e^{{{\int}_{s}^{t}}{\int}_{A}|\delta c(x_{v},a)|\pi_{v}(da|x_{v})dv}\phi(t,x_{t})\right)'&=&\delta |c(x_{t},\pi_{t})|e^{{{\int}_{s}^{t}}{\int}_{A}|\delta c(x_{v},a)|\pi_{v}(da|x_{v})dv}\phi(t,x_{t})\\ && \ \ +e^{{{\int}_{s}^{t}}{\int}_{A}|\delta c(x_{v},a)|\pi_{v}(da|x_{v})dv}\phi^{\prime}(t,x_{t})\\ &\leq & \delta\|c\|e^{T\|c\|}\|\phi\|+Ce^{T\delta\|c\|}V_{0}(x_{t}) \end{array} $$which, together with Lemma 3.1(b), implies that the second part of Eq. 3.2 is finite. Thus, (c) follows.
□
Next we derive an extension of Dynkin’s formula by Lemma 3.2. To do so, we introduce the following condition and notation.
Assumption 3.2
There exist a Borel measurable function V1 ≥ 1 on S, and positive constants ρ1,b1, and M1, such that
- (i):
-
HCode \({\int \limits }_{S}{V_{1}^{2}}(y)q(dy|x,a)\leq \rho _{1}{V_{1}^{2}}(x)+b_{1}\) for all (x,a) ∈ K;
- (ii):
-
\({V_{0}^{2}}(x)\leq M_{1}V_{1}(x)\) for all x ∈ S, with the V0(x) as the Assumption 3.1(i).
Assumption 3.2 is used to give a domain for the Dynkin’s formula below, and it is obviously satisfied when the transition rates are bounded (Ghosh and Saha 2014; Kumar and Chandan 2013, 2015; Yushkevich 1977).
Given the Vk(k = 0,1) as in Assumption 3.2 and any Borel set Z, a real-valued function φ on Z × S is called Vk-bounded if the Vk-weighted norm of φ, \(\|\varphi \|_{V_{k}}:=\sup _{(z,x)\in Z\times S}\frac {|\varphi (z,x)|}{V_{k}(x)}\), is finite. We denote by \(\mathbb {B}_{V_{k}}(Z\times S)\) the Banach space of all Vk-bounded functions on Z × S. When Vk(x) ≡ 1, \(\mathbb {B}_{1}(Z\times S)\) is the space of all bounded functions. In particular, take Z = Ω × [0,T] or [0,T], we define
and then
Theorem 3.1 below is an extension of Theorem 3.1 in Guo et al. (2015b) and Theorem 3.1 in Wei (2016) from the forms of functions φ(t,x) to the more general forms of φ(ω,t,x) with the additional variable ω, and the extension is needed for the following arguments over history-dependent policies. The proof of Theorem 3.1 is based on Lemma 3.2, which is different from those in Guo et al. (2015b) and Wei (2016) for Markov policies only.
Theorem 3.1
Suppose Assumptions 3.1 and 3.2 are satisfied. Then, for each (s,x) ∈ [0,T] × S, the following assertions hold.
- (a):
-
(The extension of Dynkin’s formula): For every π ∈π and \(\varphi \in \mathbb {B}_{V_{0},V_{1}}^{1}({\Omega }\times [0,T]\times S)\),
$$ \begin{array}{@{}rcl@{}} && \mathbb{E}_{x}^{\pi}\left[{\int}_{0}^{T} \left( \varphi^{\prime}(\omega,t,x_{t})+{\int}_{A}{\int}_{S}\varphi(\omega,t,y)q(dy|x_{t},a)\pi(da|\omega,t)\right)dt \right] \\&& \ \ =\mathbb{E}_{x}^{\pi}\varphi(\omega,T,x_{T})-\mathbb{E}_{x}^{\pi}\varphi(\omega,0,x), \end{array} $$where {xt,t ≥ 0} may be not Markovian since the policy π may depend on histories.
- (b):
-
(The Dynkin’s formula): For each \({\pi \in {\Pi }_{m}^{r}}\), and \(\varphi \in \mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)\),
$$ \qquad \mathbb{E}_{\gamma}^{\pi}\left[{\int}_{s}^{T} \left( \left( e^{{{\int}_{s}^{t}}\delta c(x_{v},\pi_{v})dv}\varphi(t,x_{t})\right)'+{\int}_{S}e^{{{\int}_{s}^{t}}\delta c(x_{v},\pi_{v})dv}\varphi(t,y)q(dy|x_{t},\pi_{t})\right)dt|x_{s}=x\right]$$$${}=\mathbb{E}_{\gamma}^{\pi}\left[e^{{{\int}_{s}^{T}} \delta c(x_{t},\pi_{t})dt}\varphi(T,x_{T})|x_{s}=x\right]-\varphi(s,x).\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ $$
Proof
-
(a)
By the definition \(\mathbb {B}_{V_{0},V_{1}}^{1}({\Omega }\times [0,T]\times S)\) and \(\varphi \in \mathbb {B}_{V_{0},V_{1}}^{1}({\Omega }\times [0,T]\times S)\), we have
$$ \begin{array}{@{}rcl@{}} |\varphi(\omega,t,x)|\leq \|\varphi\|_{V_{0}}V_{0}(x), \ |\varphi^{\prime}(\omega,t,z)|\leq \|\varphi^{\prime}\|_{V_{1}}V_{1}(x) \ \ \ \forall \ (\omega,t,x)\in {\Omega}\times [0,T]\times S. \end{array} $$(3.5)Thus, by Assumptions 3.1(i,ii) and 3.2 we have, for (ω,t,z) ∈Ω× [0,T] × S
$$ \begin{array}{@{}rcl@{}} {\int}_{A}{\int}_{S}|\varphi(\omega,t,y)||q|(dy|z,a)\pi(da|\omega,t) &\leq& \|\varphi\|_{V_{0}}\left[{\int}_{A}{\int}_{S}V_{0}(y)|q|(dy|z,a)\pi(da|\omega,t)\right] \\ &\leq& \|\varphi\|_{V_{0}}[\rho_{0}V_{0}(z)+b_{0}+2M_{0}^{\prime}{V_{0}^{2}}(z)] \\ &\leq & \|\varphi\|_{V_{0}}[\rho_{0}M_{1}+b_{0}+2M_{0}^{\prime}M_{1}]V_{1}(z). \end{array} $$(3.6)Moreover, by Eq. 3.5 we have
$$ \begin{array}{@{}rcl@{}} {{\int}_{0}^{T}}|\varphi^{\prime}(\omega,t,x_{t})|dt\leq\|\varphi^{\prime}\|_{V_{1}}{{\int}_{0}^{T}}V_{1}(x_{t})dt\leq\|\varphi^{\prime}\|_{V_{1}}{{\int}_{0}^{T}}{V_{1}^{2}}(x_{t})dt, \end{array} $$which, together with Lemma 3.1(b), gives
$$ \begin{array}{@{}rcl@{}} \mathbb{E}_{x}^{\pi} \left[{\int}_{0}^{T}|\varphi^{\prime}(\omega,t,x_{t})|dt\right] &\leq& \|\varphi^{\prime}\|_{V_{1}}\mathbb{E}_{x}^{\pi} \left[{\int}_{0}^{T}{V_{1}^{2}}(x_{t})dt\right]\\&\leq&\|\varphi^{\prime}\|_{V_{1}}Te^{\rho_{1}T}\left[\frac{{V_{1}^{2}}(x)}{T\rho_{1}}+\frac{b_{1}}{\rho_{1}}\right]<\infty. \end{array} $$(3.7)Thus, by Eqs. 3.6–3.7 and Lemma 3.1(b) we have
$$ \begin{array}{@{}rcl@{}} && \mathbb{E}_{x}^{\pi} \left[{\int}_{0}^{T}{\int}_{A}{\int}_{S}|\varphi(\omega,t,y)||q|(dy|x_{t},a)\pi(da|\omega,t)dt\right] \\ &\leq & T\|\varphi\|_{V_{0}}[\rho_{0}M_{1}+b_{0}+2M_{0}^{\prime}M_{1}]e^{\rho_{1}T}\left[\frac{{V_{1}^{2}}(x)}{T\rho_{1}}+\frac{b_{1}}{\rho_{1}}\right], \ \ \ \ \ \end{array} $$which, together with Eq. 3.7 and Lemma 3.2(a), implies (a).
-
(b)
For any fixed x ∈ S, a.e. t ∈ [0,T] (depending on ω), and 0 ≤ s ≤ t, since
$$ \begin{array}{@{}rcl@{}} \left( e^{{{\int}_{s}^{t}}\delta c(x_{v},\pi_{v})dv}\varphi(t,x)\right)'=\delta c(x_{t},\pi_{t})e^{{{\int}_{s}^{t}}\delta c(x_{v},\pi_{v})dv}\varphi(t,x)+e^{{{\int}_{s}^{t}}\delta c(x_{v},\pi_{v})dv}\varphi^{\prime}(t,x), \end{array} $$by \(\varphi \in \mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)\) and Tδ|c(x,a)|≤ M0V0(x) (using Assumption 3.1(iii)) we have
$$ \begin{array}{@{}rcl@{}} \left|\left( e^{{{\int}_{s}^{t}}\delta c(x_{v},\pi_{v})dv}\varphi(t,x)\right)'\right|&\leq & \frac{M_{0}}{T}\|\varphi\|_{V_{0}}V_{0}(x)e^{{{\int}_{s}^{t}} |\delta c(x_{v},\pi_{v})|dv}V_{0}(x) \\ && +\|\varphi^{\prime}\|_{V_{1}}e^{{{\int}_{s}^{t}}|\delta c(x_{v},\pi_{v})|dv}V_{1}(x), \\ &\leq & \left( \frac{\|\varphi\|_{V_{0}}M_{0}M_{1}}{T}+\|\varphi^{\prime}\|_{V_{1}}\right)e^{{{\int}_{s}^{t}}|\delta c(x_{v},\pi_{v})|dv}V_{1}(x). \end{array} $$which, together with the same arguments for Eq. 3.3, gives
$$ \begin{array}{@{}rcl@{}} &&\left|\left( e^{{{\int}_{s}^{t}}\delta c(x_{v}(\omega),\pi_{v})dv}\varphi(t,x_{t}(\omega))\right)'\right| \\&\leq& \left( \frac{\|\varphi\|_{V_{0}}M_{0}M_{1}}{T}+\|\varphi^{\prime}\|_{V_{1}}\right)e^{{{\int}_{s}^{t}}|\delta c(x_{v}(\omega),\pi_{v})|dv}V_{1}(x_{t}(\omega)) \end{array} $$(3.8)for a.s. ω ∈Ω with respect to \(\mathbb {P}_{x}^{\pi }\), and a.e. t ∈ [0,T](depending on give ω).
On the other hand, by the H\(\ddot {o}\)lder inequality we have
Furthermore, by the arguments similar to the proof of Eq. 3.1, we also have
which, together with Lemma 3.1(b) and Eq. 3.9, implies
Also, by Assumptions 3.1 and 3.2, we have
Thus, by Eqs. 3.8–3.12 we have
which, together with Lemma 3.2, verifies (b). □
Theorem 3.2
Under Assumptions 3.1 and 3.2, the following assertions hold.
- (a):
-
If there exists \(\varphi \in \mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)\) such that
$$ \begin{array}{@{}rcl@{}} \left\{\begin{array}{lll} \varphi^{\prime}(t,x)+\inf_{a\in A(x)}\left[\delta c(x,a)\varphi(t,x)+{\int}_{S}\varphi(t,y)q(dy|x,a)\right]=0, \\ \varphi(T,x)\equiv 1, \end{array}\right. \end{array} $$(3.13)for each x ∈ S and t ∈ E(φ,x) with \(m_{L}(E_{(\varphi ,x)}^{c})=0\), then
- (a1):
-
J(π,0,x) ≥ φ(0,x), for all π ∈π and x ∈ S, and
- (a2):
-
J(π,t,x) ≥ φ(t,x), for all \(\pi \in {{\Pi }_{m}^{r}}\) and (t,x) ∈ [0,T] × S.
- (b):
-
For any randomized Markov policy \(\pi \in {{\Pi }_{m}^{r}}\), J(π,t,x) is a unique solution in \(\mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)\) of the following equation
$$ \begin{array}{@{}rcl@{}} \left\{\begin{array}{lll} \varphi^{\prime}(t,x)+\delta c(x,\pi_{t})\varphi(t,x)+{\int}_{S}\varphi(t,y)q(dy|x,\pi_{t})=0 \\ \varphi(T,x)=1 \end{array}\right. \end{array} $$(3.14)for each x ∈ S and t ∈ E(φ,x) with \(m_{L}(E_{(\varphi ,x)}^{c})=0\).
Proof
-
(a)
For each ω ∈Ω, under the conditions for (a) we have
$$ \begin{array}{@{}rcl@{}} \!\!\!\!\!\!\!\!\!\!\!\varphi^{\prime}\!(t,x) + {\int}_{A}\delta c(x\!,a)\pi(da|\omega,t)\varphi(t,x) + {\int}_{S}\!{\int}_{A}\varphi(t,y)q(dy|x,a)\pi(da|\omega,t)\!\geq\! 0 \end{array} $$(3.15)for all x ∈ S and t ∈ E(φ,x).
On the other hand, for a.s. ω ∈Ω (with respect to \(\mathbb {P}_{x}^{\pi }\)), since {xt(ω),t ∈ [0,T]} =: {x0,…,xk}(xi ∈ S,0 ≤ i ≤ k) for some a finite k (depending on ω) determined by Tk(ω) ≤ T < Tk+ 1(ω) and \(m_{L}(E_{(\varphi ,x_{i})}^{c})=0\), by Eq. 3.15 we have
for a.s. ω ∈Ω and a.e. t ∈ [0,T] (because of \(m_{L}(\cup _{i=0}^{k} E_{(\varphi ,x_{i})}^{c})=0\)).
Let \(\bar c(\omega ,t,x,\pi ):={\int \limits }_{A}\delta c(x_{t},a)\pi (da|\omega ,t)\). Then, we have, for any 0 ≤ s ≤ T,
The proof of (a1): For each π ∈π and x ∈ S, since φ(T,y) = 1 for all y ∈ S, by Theorem 3.1(a) and Eq. 3.16 we have
and so
which implies (a1).
Similarly, by Theorem 3.1(b) we see that (a2) is also true.
- (b)
If there exists a \(\varphi \in \mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)\) satisfying (3.14), by (a2) and φ(T,x) ≡ 1, we have
$$\varphi(s,x)=\mathbb{E}_{\gamma}^{\pi} \left[e^{{{\int}_{s}^{T}}{\int}_{A}\delta c(x_{v},a)\pi_{v}(da|x_{v})dv}|x_{s}=x\right]=J(\pi,s,x), \ \ s\in [0,T],$$and so (b) follows. Thus, to complete the proof of (b), it suffices to show the existence of \(\varphi \in \mathbb {B}_{V_{0},V_{1}}^{1}(I\times S)\) satisfying (3.14), while the proof of which is very long and thus will be postponed to Section 4; see Remark 4.2 below.
□
To establish the existence of \(\varphi \in \mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)\) satisfying (3.13), we need Lemma 3.3 below. To state it, we recall some concepts. A subset of a Borel space X is analytic (by Proposition 7.41 in Bertsekas and Shreve 1996) if it is a projection into X of a Borel subset of X × Y for some uncountable Borel space Y. Then, a function u(⋅) on X is called upper semianalytic if {x ∈ X : u(x) > r} is an analytic set for each \(r\in (-\infty ,\infty )\). It is known that each Bore-measurable function is upper semianalytic; see more details in Chapter 7 of Bertsekas and Shreve (1996). Hence, each Borel measurable function such as c(x,a) is upper semianalytic on K.
Lemma 3.3
Suppose that Assumption 3.1 holds. For any \(u(t,x)\in \mathbb {B}_{V_{0}}([0,T]\times S)\), define a corresponding function \(u_{*}(t,x): \ [0,T]\times S\longrightarrow (-\infty ,\infty )\) by
Then, the following assertions hold.
- (a):
-
The function u∗(t,x) is upper semianalytic (and hence universally measurable).
- (b):
-
For every ε > 0, there exists a deterministic Markov policy \(f_{\varepsilon }\in {{\Pi }_{m}^{d}}\) (depending on ε) such that
$$ \begin{array}{@{}rcl@{}} &&\delta c(x,f_{\varepsilon}(t,x))u(t,x)+{\int}_{S}u(t,y)q(dy|x,f_{\varepsilon}(t,x))\\&\leq& u_{*}(t,x)+\varepsilon \ \ \forall \ (t,x)\in [0,\infty)\times S. \end{array} $$
Proof
See Lemma 3.3 in Guo et al. (2015b) or (Bertsekas and Shreve 1996, Propositions 7.47 and 7.50). □
4 The existence of optimal Markov policies
In this section, we prove the existence of an optimal Markov policy and a solution to the following optimality equation (4.1) for the finite horizon CTMDPs with the risk-sensitive criterion. The proofs are shown in three steps as follows: 1) consider the case of bounded transition and cost rates, 2) deal with the case of unbounded transition rates but nonnegative costs, and 3) study the case of unbounded transition and cost rates.
Suppose that f(x) is defined on X. Denote \(\|f\|:=\sup _{x\in X}|f(x)|\). The following results are for the case of the bounded transition and bounded cost rates.
Proposition 4.1
If ∥q∥ and ∥c∥ are finite, then the following assertions hold.
- (a):
-
There exists a unique φ in \(\mathbb {B}^{1}_{1,1}([0,T]\times S)\) (that is, V0(x) = V1(x) ≡ 1,x ∈ S) satisfying the following o ptimality equation for the risk-sensitive criterion of CTMDPs on the finite horizon:
$$ \begin{array}{@{}rcl@{}} \left\{\begin{array}{lll} \varphi^{\prime}(t,x)+\inf_{a\in A(x)}[\delta c(x,a)\varphi(t,x)+{\int}_{S}\varphi(t,y)q(dy|x,a)]=0, & \\ \varphi(T,x)=1, \end{array}\right. \end{array} $$(4.1)for each x ∈ S and t ∈ E(φ,x) with \(m_{L}(E_{(\varphi ,x)}^{c})=0\).
- (b):
-
\(\varphi (t,x)=J_{*}(t,x)=\inf _{f\in {{\Pi }_{m}^{d}} }J(f,t,x)\) for all (t,x) ∈ [0,T] × S, with φ(t,x) as in (a).
- (c):
-
For any ε > 0, there exists \(f_{\varepsilon }\in {{\Pi }_{m}^{d}}\) such that J(fε,t,x) ≤ φ(t,x) + ε for all (t,x) ∈ [0,T] × S.
Proof
-
(a)
Define the following operator B on \(\mathbb {B}_{1}([0,T]\times S)\) (actually the space of all bounded functions) by
$$ \begin{array}{@{}rcl@{}} B\psi(t,x):=1+{{\int}_{t}^{T}} \inf_{a\in A(x)}\left[\delta c(x,a)\psi(s,x)+{\int}_{S}\psi(s,y)q(dy|x,a)\right]ds \end{array} $$(4.2)for any (t,x) ∈ [0,T] × S and \(\psi \in \mathbb {B}_{1}([0,T]\times S)\).
Then, for each (t,x) ∈ [0,T] × S, and any \(\psi _{1},\psi _{2}\in \mathbb {B}_{1}([0,T]\times S)\), from Eq. 4.2 and q({x}|x,a) + q(S ∖{x})|x,a) ≡ 0 we obtain
where \(\tilde {L}:=\delta \|c\|+2\|q\|<\infty \). Furthermore, by induction we can prove the following fact:
Since \({\sum }_{n=1}^{\infty } \tilde {L}^{n} \frac {T^{n}}{n!}\|\psi _{1}-\psi _{2}\|<\infty \), there exists some integer k such that the constant \(\beta := \tilde {L}^{k} \frac {T^{k}}{k!}<1\). Thus, by Eq. 4.3 we have ∥Bkψ1 − Bkψ2∥≤ β∥ψ1 − ψ2∥. Therefore, B is a k-step contract operator. Thus, there exists a function \(\varphi \in \mathbb {B}_{1}([0,T]\times S)\) such that Bφ = φ, that is
Since ∥q∥ and ∥c∥ are finite, by Eq. 4.4 we see that \(\varphi \in \mathbb {B}^{1}_{1,1}([0,T]\times S)\), and thus (a) follows.
- (b-c)
We are going to prove (b) and (c) together. Since \(\|q\|<\infty \) and \(\|c\|<\infty \), Assumptions 3.1 and 3.2 are satisfied by taking V0 = V1 ≡ 1. Thus, it follows from Eq. 4.1 and Theorem 3.2(a) that
$$ \begin{array}{@{}rcl@{}} J(\pi,t,x)\geq \varphi(t,x) \ \ \ \text{for \ all } \ \pi\in {{\Pi}_{m}^{r}}. \end{array} $$(4.5)Moreover, for any ε > 0, since \(\varphi \in \mathbb {B}_{1}([0,T]\times S)\), Lemma 3.3 together with Eq. 4.1 gives the existence of \(f_{\varepsilon }\in {{\Pi }_{m}^{d}}\), such that, for each x ∈ S and t ∈ E(φ,x),
$$ \begin{array}{@{}rcl@{}} \!\!\!\!\!\!\!\left\{\begin{array}{lll} \varphi^{\prime}(t,x)+\delta c(x,f_{\varepsilon}(t,x))\varphi(t,x)+{\int}_{S}\varphi(t,y)q(dy|x,f_{\varepsilon}(t,x))\leq \frac{\varepsilon}{T}e^{-T\delta\|c\|}, \\ \varphi(T,x)=1. \end{array}\right. \end{array} $$(4.6)Then, as the arguments for Eq. 3.16, by |c(x,a)|≤∥c∥ and Eq. 4.6 we have
$$ \begin{array}{@{}rcl@{}} \left( e^{{{\int}_{s}^{t}}\delta c(x_{v},f_{\varepsilon}(v,x_{v}))dv}\varphi(t,x_{t})\right)' + {\int}_{S}q(dy|x_{t},f_{\varepsilon}(t,x_{t}))\!\left( e^{{{\int}_{s}^{t}}\delta c(x_{v},f_{\varepsilon}(v,x_{v})))}\varphi(t,y)\right)\!\leq\! \frac{\varepsilon}{T} \end{array} $$for every 0 ≤ s ≤ t. Thus, by Theorem 3.1(a) we have
$$ \begin{array}{@{}rcl@{}} &&\mathbb{E}_{x}^{f_{\varepsilon}} \left( e^{{{\int}_{s}^{T}}\delta c(x_{v},f_{\varepsilon}(v,x_{v}))dv)}\right)-\varphi(s,x)\\&=&\mathbb{E}_{x}^{f_{\varepsilon}} \left( e^{{{\int}_{s}^{T}}\delta c(x_{v},f_{\varepsilon}(v,x_{v}))dv}\varphi(T,x_{T})\right)-\varphi(s,x)\leq \varepsilon \end{array} $$and so
$$ \begin{array}{@{}rcl@{}} J(f_{\varepsilon},s,x)\leq \varphi(s,x)+\varepsilon \ \ \ \ \text{for \ all} \ (s,x)\in [0,T]\times S. \end{array} $$(4.7)Therefore, since ε can be arbitrary, by Eqs. 4.5 and 4.7 we have
$$ \begin{array}{@{}rcl@{}} \inf_{\pi\in {{\Pi}_{m}^{r}}} J(\pi,t,x)=\varphi(t,x)=\inf_{f\in {{\Pi}_{m}^{d}}} J(f,t,x), \ \text{ and} \ \ J(f_{\varepsilon},t,x)\leq \varphi(t,x)+\varepsilon, \end{array} $$for all (t,x) ∈ [0,T] × S, and by Eq. 2.6 so (b) and (c) follow.
□
Proposition 4.1 shows the existence of a solution to the optimality equation for the bounded transition and cost rates. To further establish the existence of an optimal Markov policy, we need some conditions below.
Assumption 4.1
- (i):
-
For each x ∈ S, A(x) is compact;
- (ii):
-
For each x ∈ S and \(D\in \mathcal {{B}}(S)\), the function q(D|x,a) is continuous in a ∈ A(x);
- (iii):
-
For each x ∈ S, the functions c(x,a) and \({\int \limits }_{S}V_{0}(y)q(dy|x,a)\) are continuous in a ∈ A(x), with V0 as in Assumption 3.1.
Remark 4.1
Assumption 4.1 is used for the existence of the minimum points in the optimality equation (4.1).
By Lemma 8.3.7(a) in Hernández-Lerma and Lasserre (1999), under Assumption 4.1 we have the following lemma.
Lemma 4.1
Under Assumptions 4.1(ii,iii), the function \({\int \limits }_{S}q(dy|x,a)u(t,y)\) is continuous in a ∈ A(x), for every fixed (t,x) ∈ [0,T] × S and \(u\in \mathbb {B}_{V_{0}}([0,T]\times S)\).
Proposition 4.2
Under Assumption 4.1, if \(\|q\|<\infty \) and \(\|c\|<\infty \), that is, transition and cost rates are bounded, then the following assertions hold.
- (a):
-
There exists a unique φ(t,x) in \(\mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)\) satisfying the o ptimality equation (4.1).
- (b):
-
\(\varphi (t,x)=J_{*}(t,x)=\inf _{f\in {{\Pi }_{m}^{d}}} J(f,t,x)\) for all (t,x) ∈ [0,T] × S, with φ(t,x) as in (a).
- (c):
-
There exists a deterministic Markov policy \(f^{*}\in {{\Pi }_{m}^{d}}\) such that
$$ \begin{array}{@{}rcl@{}} \varphi^{\prime}(t,x)+ \delta c(x,f^{*}(t,x))\varphi(t,x)+{\int}_{S}\varphi(t,y)q(dy|x,f^{*}(t,x))=0 \end{array} $$for each x ∈ S and t ∈ E(φ,x) with \(m_{L}(E_{(\varphi ,x)}^{c})=0\).
- (d):
-
If, in addition, c(x,a) ≥ 0 for all (x,a) ∈ K, then J∗(x,t) (and also φ(t,x)) is decreasing in t ∈ [0,T] for each fixed x ∈ S.
Proof
We only need to prove (c) and (d) since (a) and (b) follow from Proposition 4.1. For the function φ(t,x) from (a), when \(\varphi ^{\prime }(t,x)\) does not exist for some (t,x), we define
which, together with Eq. 4.1, implies that Eq. 4.8 holds for each (t,x) ∈ [0,T] × S. Hence, Proposition 7.50 in Bertsekas and Shreve (1996) together with Lemma 4.1 ensures the existence of a Markov policy \(f^{*}\in {{\Pi }_{m}^{d}}\) such that
Then, as the proofs of (b) and (c) in Proposition 4.1, we see that (c) is true.
(d) Fix any s,t ∈ [0,T] with s < t. Then, for any Markov policy \(f\in {{\Pi }_{m}^{d}}\), we define the corresponding Markov policy \({f_{s}^{t}}\) as follows: for each x ∈ S,
Then, we have, for each (v,x) ∈ [s,s + T − t] × S,
Let
By the Markov property of {xt,t ≥ 0} under any Markov policy f and Eqs. 4.9–4.10, we have \(J(f,t\sim T,x)= J({f_{s}^{t}},s\sim T+s-t,x)\), and thus \(J_{*}(t\sim T,x)\geq J_{*}(s\sim T+s-t,x)\); Similarly, we can prove that \(J_{*}(s\sim T+s-t,x)\geq J_{*}(t\sim T,x)\). Thus, we have \(J_{*}(t\sim T,x)=J_{*}(s\sim T+s-t,x)\). Moreover, since c(x,a) ≥ 0 on K, by Eq. 4.10 and t > s, we have \(J_{*}(t\sim T,x)=J_{*}(s\sim T+s-t,x)\leq J_{*}(s\sim T,x)\), which, together with \(J_{*}(t\sim T,x)=J_{*}(t,x)\), gives (d). □
Proposition 4.2 shows the existence of an optimal policy under the bounded transition and cost rates. We next extend the results in Proposition 4.2 to the case of unbounded transition rates by approximations from bounded transition and cost rates to unbounded transition rates and nonnegative costs.
Proposition 4.3
Under Assumptions 3.1, 3.2 and 4.1, if in addition c(x,a) ≥ 0 for all (x,a) ∈ K, then the following assertions hold.
- (a):
-
There exists a unique φ(t,x) in \(\mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)\) satisfying the o ptimality equation (4.1).
- (b):
-
\(\varphi (t,x)=J_{*}(t,x)=\inf _{f{\in {\Pi }_{m}^{d}}} J(f,t,x)\) for all (t,x) ∈ [0,T] × S, with φ(t,x) as in (a).
- (c):
-
There exists a Markov policy \(f^{*}\in {{\Pi }_{m}^{d}}\) such that
$$ \begin{array}{@{}rcl@{}} \varphi^{\prime}(t,x)+ \delta c(x,f^{*}(t,x))\varphi(t,x)+{\int}_{S}\varphi(t,y)q(dy|x,f^{*}(t,x))=0 \end{array} $$for each x ∈ S and t ∈ E(φ,x) with \(m_{L}(E_{(\varphi ,x)}^{c})=0\).
Proof
We only proof (a). This is because (b) and (c) can be proved as the same arguments of (b)-(c) of Proposition 4.2.
The main thread of our proof is as follows.
First, we construct a series of models \( \mathcal {{M}}_{n}^{+}\) which the transition rates and costs are all bounded. By Proposition 4.2, there exists a decreasing (with respect to t) function \(\varphi _{n}(t,x)\in \mathbb {B}_{1,1}^{1}([0,T]\times S)\) which is the value function of \( \mathcal {{M}}_{n}^{+}\).
Second, we prove that the limit of φn(t,x) exists, denoted by φ(t,x).
Third, we prove that φ(t,x) is the value function of the original model.
Now, we construct \( \mathcal {{M}}_{n}^{+}\) first. under Assumption 3.1 (iii), we have
For each n ≥ 1, let An(x) := A(x) for x ∈ S, Kn := {(x,a)|x ∈ S,a ∈ An(x)}, and Sn := {x ∈ S|V0(x) ≤ n}. Moreover, for each x ∈ S, a ∈ An(x), let
Fix any n ≥ 1. By Eq. 4.11, it is obvious that the qn(dy|t,x,a) denotes indeed transition rates on S, which are conservative and stable. By Eq. 4.12, recall T > 0,δ > 0,M0 ≥ 1,V0(x) ≥ 1 and c(x,a) ≥ 0, \(c_{n}^{+}(x,a) \ge 0\) for n ≥ 1. Then, we obtain a sequence of models \(\{\mathcal {{M}}_{n}^{+}\}\):
for which the transition rates qn(dy|x,a) and costs \(c_{n}^{+}(x,a)\) are all bounded (by Assumption 3.1 and Eqs. 4.11–4.12). In the following arguments, any quality with respect to \(\mathcal {{M}}_{n}^{+}\) is labeled by a lower n, such as the risk-sensitive criterion Jn(f,t,x) of a Markov policy f and the value function \(J_{n}(t,x):=\inf _{f\in {{\Pi }_{m}^{d}}} J_{n}(f,t,x)\).
Obviously, Assumptions 3.1, 3.2 and 4.1 still hold for each model \({\mathscr{M}}_{n}^{+}\). Thus, for each n ≥ 1, it follows from Proposition 4.2 that there exists \(\varphi _{n}(t,x)\in \mathbb {B}_{1,1}^{1}([0,T]\times S)\) satisfying (4.1) for the corresponding \({\mathscr{M}}_{n}^{+}\), that is,
for all x ∈ S and \(t\in E_{(\varphi _{n},x)}\) with \(m_{L}(E_{(\varphi _{n},x)}^{c})=0\). Furthermore, since \(c_{n}^{+}(x,a) \ge 0\), by Proposition 4.2 (d), φn(t,x)) is decreasing in t ∈ [0,T] for each fixed x ∈ S.
Then, Proposition 7.50 in Bertsekas and Shreve (1996) together with Lemma 4.1 and Eq. 4.13 gives the existence of a Markov policy \(f_{n}\in {{\Pi }_{m}^{d}}\) such that,
for all x ∈ S and \(t\in E_{(\varphi _{n},x)}\) with \(m_{L}(E_{(\varphi _{n},x)}^{c})=0\).
Also, by Eq. 4.12 we have \(e^{2T\delta c_{n}^{+}(x,a)}\leq M_{0}V_{0}(x)\) for all x ∈ S and n ≥ 1. Then, using Lemma 3.1 and Theorem 3.2(a2) with V0 = V1 ≡ 1 and Lemma 3.1, from Eq. 4.14 we have
Moreover, since φn(t,x) ≥ 0 and \(c_{n}^{+}(x,a)\geq c_{n-1}^{+}(x,a)\) for all (x,a) ∈ K, by Eqs. 4.11 and 4.14 as well as Proposition 4.2(d), we have, for all \(x\in S, t\in E_{(\varphi _{n},x)}\) and n ≥ 2
which, together with Theorem 3.2(a2) with V0 = V1 ≡ 1, implies that Jn− 1(fn,t,x) ≤ φn(t,x) for all (x,t) ∈ [0,T] × S. Therefore, we have φn− 1(t,x) ≤ Jn− 1(fn,t,x) ≤ φn(t,x), that is, the sequence {φn,n ≥ 1} is nondecreasing in n ≥ 1, and thus the limit
exists for each (t,x) ∈ [0,T] × S.
Since φn(t,x) is decreasing in t ∈ [0,T], φ(t,x) is decreasing in t ∈ [0,T], for each fixed x ∈ S. Therefore, φ(t,x) is differential in a.e. t ∈ [0,T] (for each fixed x ∈ S).
Let, for every n ≥ 1 and (t,x) ∈ [0,T] × S,
We next show that \(\lim _{n\to \infty } H_{n}(t,x)=H(t,x)\) for each (t,x) ∈ [0,T] × S.
Indeed, for any fixed (t,x) ∈ [0,T] × S, there exists n0 ≥ 1 such that \((t,x)\in [0,T]\times S_{n_{0}}\), and then qn(dy|x,a) = q(dy|x,a) for all n ≥ n0 and \(\lim _{n\to \infty } c_{n}^{+}(x,a)=c(x,a)\) for all a ∈ A(x). Thus, by Lemma 8.3.7 in Hernández-Lerma and Lasserre (1999) and Eq. 4.15 we have
Hence,
On the other hand, note that \(\liminf _{n\to \infty }H_{n}(t,x)=\lim _{m\to \infty }H_{n_{m}}(t,x)\) for some subsequence {nm,m ≥ 1} of {n,n ≥ 1}. For each m ≥ 1, under Assumption 4.1, the measurable selection theorem (e.g. Proposition 7.50 in Bertsekas and Shreve 1996) together with Lemma 4.1 ensures the existence of \(f_{n_{m}}\in {{\Pi }_{m}^{d}}\) such that
Since \(f_{n_{m}}(t,x)\in A(x)\) for all m ≥ 1 and A(x) is compact, there exists a subsequence \(\{f_{n_{m_{k}}}(t,x), k\geq 1\}\) of \(\{f_{n_{m}}(t,x), m\geq 1\}\) and a(t,x) ∈ A(x) (depending on (t,x)) such that \(f_{n_{m_{k}}}(t,x)\to a(t,x)\) as \(k\to \infty \). Thus, since \(\lim _{m\to \infty }H_{n_{m}}(t,x)=\lim _{k\to \infty }H_{n_{m_{k}}}(t,x)\), using Assumption 4.1, by Lemma 8.3.7 in Hernández-Lerma and Lasserre (1999) and Eq. 4.18 we have
which, together with Eq. 4.17, implies that \(\lim _{n\to \infty } H_{n}(t,x)=H(t,x)\). Thus, by Eq. 4.13 we have
Since φ(t,x) is the integral of a measurable function, it is a absolutely continuous function. Therefore, we prove that φ(t,x) is differential in a.e. t ∈ [0,T] (for each fixed x ∈ S) again. We can verify that φ(t,x) satisfies (4.1). To show \(\varphi (t,x)\in \mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)\), since \(\varphi (t,x)\in \mathbb {B}_{V_{0}}([0,T]\times S)\) (by Eqs. 4.15–4.16), the rest verifies that \(\varphi ^{\prime }(t,x)\) is V1-bounded. Indeed, since T|δc(x,a)|≤ eTδ|c(x,a)|≤ M0V0(x), from Eq. 4.19 we have
which implies that φ(t,x) is in \(\mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)\), and thus (a) is proved. □
Next, we use Proposition 4.3 to prove our main results by approximation from nonnegative cost rates to the cost rates that may be unbounded from above and from below.
Theorem 4.1
Under Assumptions 3.1, 3.2 and 4.1, the following assertions hold.
- (a):
-
There exists a unique φ(t,x) in \(\mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)\) satisfying the o ptimality equation (4.1).
- (b):
-
\(\varphi (t,x)=\inf _{{\pi \in {\Pi }_{m}^{r}}}J(\pi ,t,x)=\inf _{f{\in {\Pi }_{m}^{d}}}J(f,t,x)\) for all (t,x) ∈ [0,T] × S, with φ(t,x) as in (a).
- (c):
-
There exists a Markov policy \(f^{*}\in {{\Pi }_{m}^{d}}\) such that
$$ \begin{array}{@{}rcl@{}} \varphi^{\prime}(t,x)+ \delta c(x,f^{*}(t,x))\varphi(t,x)+{\int}_{S}\varphi(t,y)q(dy|x,f^{*}(t,x))=0 \end{array} $$for each x ∈ S and t ∈ E(φ,x) with \(m_{L}(E_{(\varphi ,x)}^{c})=0\), and f∗ is optimal.
Proof
We only prove (a) and the optimality of the policy f∗ since the others can be proved as (b) and (c) of Proposition 4.3. For each n ≥ 1, define cn on K as follows: for each (x,a) ∈ K,
which implies that \(\lim _{n\to \infty } c_{n}(x,a)=c(x,a)\) and kn(x,a) := cn(x,a) + n ≥ 0 for each (x,a) ∈ K and n ≥ 1. Moreover, it follows from Assumption 3.1(iii) that
Thus, \(e^{2T\delta k_{n}(x,a)}\leq e^{2Tn\delta }M_{0}V_{0}(x)\) for all (x,a) ∈ K for all n ≥ 1, and so Assumptions 3.1 (with M0 replaced by e2TnδM0), 3.2 and 4.1 still hold for each model \(\mathcal {{N}}_{n}\) defined by
For any real-valued Borel measurable function u on K, let
provided the integral exists. Then, for each n ≥ 1, since kn(x,a) ≥ 0, by Proposition 4.3(b) we have \(J_{k_{n}}\) is in \(\mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)\) and satisfies
for all x ∈ S and \(t\in E_{(J_{k_{n}},x)}\).
Moreover, since \(J_{k_{n}}(t,x)=J_{c_{n}+n}(t,x)=J_{c_{n}}(t,x)e^{\delta (T-t)n}\), by Eq. 4.21 we derive that
This is
On the other hand, for each (t,x) ∈ [0,T] × S, it follows from Eq. 4.20 and Lemma 3.1(d) that
Since cn(x,a) is decreasing in n ≥ 1, and so is the corresponding value functions \(J_{c_{n}}(t,x)\). Therefore, the limit \(\varphi (t,x):=\lim _{n\to \infty }J_{c_{n}}(t,x)\) exists for each (t,x) ∈ [0,T] × S. Then, as the arguments for Proposition 4.3 with φn(t,x) replaced with \(J_{c_{n}}(t,x)\) here, from Eqs. 4.22 and 4.23 we can see that (a) is also true.
Moreover, by Eq. 4.1 and (c), using Theorem 3.2 we see that f∗ is optimal. □
Remark 4.2
For the given \(\pi _{t}(da|x)\in {{\Pi }_{m}^{r}}\), to show the existence of a \(\varphi \in \mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)\) satisfying (3.14), we modify the operator B in Eq. 4.2 as the following Bπ:
for all (t,x) ∈ [0,T] × S. Then, a similar argument as in the proof of Proposition 4.3(a) gives the existence of a \(\varphi \in \mathbb {B}_{V_{0},V_{1}}^{1}([0,T]\times S)\) satisfying (3.14).
5 An example
Recall that Assumptions 3.1 and 4.1 above are the generalization of the corresponding ones in Guo and Hernández-Lerma (2009), Guo et al. (2012), Guo and Piunovskiy (2011), Piunovskiy and Zhang (2011), and Prieto-Rumeau and Hernández-Lerma (2012). Hence, they are satisfied for all the examples and hypotheses in these references. To further illustrate the main results here, we next consider the risk-sensitive optimality problem of case flow in Guo et al. (2015a) for CTMDPs on the mean-variance criteria.
Example 5.1 (Risk-sensitive controlled problems of cash flow)
Consider a continuous-time controlled problem of cash flow in an economic market with an amount of the cash as a state, and thus the corresponding state space is \(S:=(-\infty ,+\infty )\). When the current state of cash flow is at x ∈ S, a decision-maker withdraws money with the amount − a (if a < 0) or takes a supply of money with the amount a for a ≥ 0, where a is regarded an action. When the current state is at x ∈ S and an action a ∈ A(x) is chosen, the two things happen: 1) a cost is incurred at rate c(x,a); and 2) the amount x of cash is assumed to keep invariable for an exponential-distributed random time with parameter λ(x,a) ≥ 0, and then jump to other states with the normal distribution N(x,σ2) for some constant σ > 0. Therefore, the transition rates of cash flow is represented by
For this cash flow model, the decision maker wishes to minimize the risk-sensitive costs on a given T horizon over all policies.
To ensure the existence of an optimal policy for the cash flow model, we consider the following hypotheses:
(A1) λ(x,a) ≤ M(x2 + 1) and \(|c(x,a)|\leq \frac {1}{2\delta T}[M+\ln (x^{2}+1)]\) for all x ∈ S,a ∈ A(x) with some positive constant M, where the constants δ and T are as before;
(A2) A(x) is assumed to be a compact set of a Borel space A for each x ∈ S;
(A3) λ(x,a) and c(x,a) are Borel measurable on K and continuous in a ∈ A(x) for each fixed x ∈ S.
Under the above conditions, we have the following fact.
Proposition 5.1
Under the hypotheses A1–A3, Example 5.1 satisfies the Assumptions 3.1, 3.2 and 4.1, and hence (by Theorem 4.1) there exists an optimal Markov policy.
To verify the conditions required in Theorem 4.1, let
Since \(\frac {1}{\sqrt {2\pi }\sigma }{\int \limits }_{S} (y-x)^{2k+1}e^{-\frac {(y-x)^{2}}{2\sigma ^{2}}}dy=0\) and \(\frac {1}{\sqrt {2\pi }\sigma }{\int \limits }_{S}(y-x)^{2k}e^{-\frac {(y-x)^{2}}{2\sigma ^{2}}}dy =1\cdot 3{\cdots } (2k-1)\sigma ^{2k}\) for all k = 0,1,…, using (5.1) and hypothesis A1, a directive calculation gives
Thus, the hypotheses A1–A3 imply the Assumptions 3.1, 3.2 and 4.1, and then Theorem 4.1 gives the existence of an optimal Markov policy.
Remark 5.1
In this example, the cost c(x,a) are allowed to be unbounded from above and below, and the transition rates q(dy|x,a) can be unbounded. Thus, some of the conditions in Ghosh and Saha (2014), Jaskiewicz (2007), Kumar and Chandan (2015), Kumar and Chandan (2013), and Wei (2016) for CTMDPs on the risk-sensitive criteria fails to hold for this example because the transition and cost rates are all bounded in Ghosh and Saha (2014), Jaskiewicz (2007), Kumar and Chandan (2013, 2015) and Wei (2016).
References
Anantharam V, Borkar VS (2017) A variational formula for risk-sensitive reward. SIAM J Control Optim 55:961–988
Basu A, Ghosh MK (2014) Zero-sum risk-sensitive stochastic games on a countable state space. Stochastic Process Appl 124:961–983
Baüerle N, Rieder U (2014) More risk-sensitive Markov decision processes. Math Oper Res 39:105–120
Baüerle N, Rieder U (2017) Zero-sum risk-sensitive stochastic games. Stochastic Process Appl 127:622–642
Bertsekas D, Shreve S (1996) Stochastic optimal control: the discrete-time case. Academic Press, Inc
Cavazos-Cadena R, Hernndez-Hernndez D (2011) Discounted approximations for risk-sensitive average criteria in Markov decision chains with finite state space. Math Oper Res 36:133–146
Ghosh MK, Saha S (2014) Risk-sensitive control of continuous time Markov chains. Stochastics 86:655–675
Feinberg EA, Mandava M, Shiryaev AN (2014) On solutions of Kolmogorov’s equations for nonhomogeneous jump Markov processes. J Math Anal Appl 411:261–270
Guo X (2007) Continuous–time Markov decision processes with discounted rewards: the case of Polish spaces. Math Oper Res 32:73–87
Guo X, Piunovskiy A (2011) Discounted continuous-time Markov decision processes with constraints: unbounded transition and loss rates. Math Oper Res 36:105–132
Guo X, Song X (2011) Discounted continuous-time constrained Markov decision processes in Polish spaces. Ann Appl Probab 21:2016–2049
Guo X, Hernández-Lerma O (2009) Continuous-time Markov decision processes: theory and applications. Springer, Berlin
Guo X, Huang Y, Song X (2012) Linear programming and constrained average optimality for general continuous-time Markov decision processes in history-dependent policies. SIAM J Control Optim 50:23–47
Guo X, Huang XX, Zhang Y (2015a) On the first passage g-mean-variance optimality for discounted continuous-time Markov decision processes. SIAM J Control Optim 53:1406–1424
Guo X, Huang XX, Huang Y (2015b) Finite-horizon optimality for continuous-time Markov decision processes with unbounded transition rates. Adv Appl Probab 47:1064–1087
Hernández-Lerma O, Lasserre JB (1999) Further topics on discrete-time Markov control processes. Springer, New York
Huang Y (2018) Finite horizon continuous-time Markov decision processes with mean and variance criteria. Discret Event Dyn Syst 28(4):539–564
Huo H, Zou X, Guo X (2017) The risk probability criterion for discounted continuous-time Markov decision processes. Discret Event Dyn Syst 27(4):675–699
Jaskiewicz A (2007) Average optimality for risk-sensitive control with general state space. Ann Appl Probab 17:654–675
Kitaev MY, Rykov V (1995) Controlled queueing systems. CRC Press, New York
Kumar KS, Chandan P (2013) Risk-sensitive control of jump process on denumerable state space with near monotone cost. Appl Math Optim 68:311–331
Kumar KS, Chandan P (2015) Risk-sensitive control of continuous-time Markov processes with denumerable state space. Stoch Anal Appl 33:863–881
Piunovskiy A, Zhang Y (2011) Discounted continuous-time Markov decision processes with unbounded rates: the convex analytic approach. SIAM J Control Optim 49:2032–2061
Prieto-Rumeau T, Hernández-Lerma O (2012) Selected topics in continuous-time controlled Markov chains and Markov games. Imperial College Press, London
Wei QD (2016) Continuous-time Markov decision processes with risk-sensitive finite-horizon cost criterion. Math Meth Oper Res 84:461–487
Wei Q, Chen X (2017) Average cost criterion induced by the regular utility function for continuous-time Markov decision processes. Discret Event Dyn Syst 27 (3):501–524
Xia L (2014) Event-based optimization of admission control in open queueing networks. Discret Event Dyn Syst 24(2):133–151
Xia L (2018) Variance minimization of parameterized Markov decision processes. Discret Event Dyn Syst 28:63–81
Yushkevich AA (1977) Controlled Markov models with countable state and continuous time. Theory Probab Appl 22:215–235
Zhang Y (2017) Continuous-time Markov decision processes with exponential utility. SIAM J Control Optim 55:2636–2660
Acknowledgments
Research supported by the National Natural Science Foundation of China (Grant No. 60171009, Grant No. 61673019, Grant No. 11931018).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Guo, X., Zhang, J. Risk-sensitive continuous-time Markov decision processes with unbounded rates and Borel spaces. Discrete Event Dyn Syst 29, 445–471 (2019). https://doi.org/10.1007/s10626-019-00292-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10626-019-00292-y