Constrained Optimality for First Passage Criteria in Semi-Markov Decision Processes

Huang, Yonghui; Guo, Xianping

doi:10.1007/978-0-8176-8337-5_11

Yonghui Huang³ &
Xianping Guo³

Part of the book series: Systems & Control: Foundations & Applications ((SCFA))

1411 Accesses
1 Citations

Abstract

This chapter is devoted to studying constrained semi-Markov decision processes with denumerable states and unbounded reward/cost rates. The performance criterion to be optimized is the expected reward obtained during a first passage time to some target set, subject to a constraint on the associated expected cost over this first passage time. The discount rate is state-action dependent, and the undiscounted case is allowed. We employ the Lagrange multiplier technique to establish the existence of a constrained optimal policy under suitable conditions and show that the constrained optimal policy randomizes between two stationary policies differing in at most one state.

Access provided by Autonomous University of Puebla. Download chapter PDF

Mean-Variance Problems for Finite Horizon Semi-Markov Decision Processes

Article 27 November 2014

First Passage Exponential Optimality Problem for Semi-Markov Decision Processes

Optimal Stopping Time on Semi-Markov Processes with Finite Horizon

Article 21 April 2022

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

11.1 Introduction

In the field of Markov decision problems (MDPs), the control horizon is usually a fixed finite interval [0, T] or the infinite interval [0, + ∞). In many real applications, however, the control horizon may be a random duration [0, τ], where the terminal time τ is a random variable at which the state of the controlled system changes critically and the control beyond τ may no longer be meaningful or necessary. For example, in the insurance systems [27], the control after the time when the company is bankrupt becomes unnecessary. Therefore, it makes better sense to consider the problem in [0, τ], where τ represents the bankruptcy time of the company. Such situations motivate first passage problems in MDPs [13, 15, 19, 21, 22, 28], for which one generally aims at maximizing/minimizing the expected reward/cost over a first passage time to some target set.

This chapter is devoted to studying constrained optimality for first passage criteria, for which the dynamic of a system is described by semi-Markov decision processes (SMDPs). The state space is assumed to be denumerable, while the action set is general. Both reward and cost rates are possibly unbounded. A key feature of our model is that the discount rate is state-action dependent, and furthermore, the undiscounted case is allowed. This feature makes our model more general since the state-action-dependent discount rate exactly characterizes the practical cases such as the interest rate in economic and financial systems [2, 9, 17, 23, 26], which can be adjusted according to the real circumstances. We aim to maximize the expected reward obtained during a first passage time to some target set, subject to that the associated expected cost over this first passage time does not exceed a given constraint. An interesting special case is that in which the reward rates are uniformly equal to one, which corresponds to a stochastic time optimal control problem with a target set; see Remark 11.2.4(d) for details.

Previously, Beutler and Ross [3] consider constrained SMDPs with the long-run average criteria. They suppose that the state space of the SMDP is finite, and the action space compact metric. A Lagrange multiplier formulation involving a dynamic programming equation is utilized to relate the constrained optimization to an unconstrained optimization parametrized by the multiplier. This approach leads to a proof for the existence of a semi-simple optimal constrained policy. That is, there is at most one state for which the action is randomized between two possibilities; at all other states, an action is uniquely chosen for each state. Feinberg [4] further investigates constrained average reward SMDPs with finite state and action sets. They develop a technique of state-action renewal intensities and provide linear programming algorithms for the computation of optimal policies. On the other hand, Feinberg [5] deals with constrained infinite horizon discounted SMDPs. Compared with the existing works above, however, our main interest in this chapter is to analyze the constrained optimality for first passage criteria in SMDPs, which, to best of our knowledge, is an issue not yet explored.

To obtain the existence of a constrained first passage optimal policy, we first give suitable conditions and then employ the so-called Lagrange multiplier technique to analyze the constrained control problem. Based on the Lagrange multiplier technique, we transform the constrained control problem to an unconstrained one, prove that a constrained optimal policy exists, and show that the constrained optimal policy randomizes between two stationary policies differing in at most one state.

The rest of this chapter is organized as follows. In Sect. 11.2, we formulate the control model, followed by the optimality conditions and the main results on the existence of constrained optimal policies. In Sect. 11.3, some technique preliminaries are given, and the proof of the main result is presented in Sect. 11.4.

11.2 The Control Model

The model of constrained SMDPs considered in this chapter is specified by the eight objects

$$\{E,B,(A(i) \subset A,i \in E),Q(\cdot,\cdot \mid i,a),r(i,a),c(i,a),\alpha (i,a),\gamma \},$$

(11.1)

where E is the state space, a denumerable set; B ⊂ E is the given target set, such as the set of all bad states or of good states of a system; A is the action space, a Borel space endowed with the Borel σ-field $\mathcal{A}$; and $A(i) \in \mathcal{A}$ is the set of admissible actions at state i ∈ E. The transition mechanism of the SMDPs is defined by the semi-Markov kernelQ( ⋅, ⋅ | i, a) on R ₊ ×E given K, where R ₊ = [0, + ∞), and K = { (i, a)∣i ∈ E, a ∈ A(i)} is the set of feasible state-action pairs. It is assumed that (1) Q( ⋅, j | i, a) (for any fixed j ∈ E and (i, a) ∈ K) is a nondecreasing, right continuous real function on R ₊ such that Q(0, j | i, a) = 0; (2) Q(t, ⋅ | ⋅, ⋅) (for each fixed t ∈ R ₊) is a sub-stochastic kernel on E given K; and (3) P( ⋅ | ⋅, ⋅) : = Q(∞, ⋅ | ⋅, ⋅) is a stochastic kernel on E given K. If action a ∈ A(i) is selected in state i, then Q(t, j∣i, a) is the joint probability that the sojourn time in state i is not greater than t ∈ R ₊, and the next state is j. Moreover, r(i, a) and c(i, a) in (11.1) denote the reward and cost rate functions on K valued in R = ( − ∞, + ∞), respectively, which are both assumed to be measurable on A(i) for each fixed i ∈ E. In addition, α(i, a) represents the discount rate, which is a measurable function from K to R ₊. Finally, γ is a given constraint constant.

Remark 11.2.1.

Compared with the models of the standard constrained discounted and average criteria [3–5], in this model (11.1), we introduce a target set B ⊂ E of the controlled system, and furthermore, the discount rate α(i, a) here is state-action dependent and may be equal to zero (i.e., the undiscounted case is allowed).

To state the constrained SMDPs we are concerned with, we need to introduce the classes of policies. For each n ≥ 0, let H _n be the family of admissible histories up to the nth jump (decision epoch), that is, H _n = (R ₊ ×K)ⁿ ×(R ₊ ×E), for n = 0, 1, ….

Definition 11.2.1.

A randomized history-dependent policy, or simply a policy, is a sequence π = { π_n, n ≥ 0} of stochastic kernels π_n on A given H _n satisfying

$${\pi }_{n}(A({i}_{n})\mid {h}_{n}) = 1\ \ \ \forall \ {h}_{n} = ({t}_{0},{i}_{0},{a}_{0},\ldots,{t}_{n-1},{i}_{n-1},{a}_{n-1},{t}_{n},{i}_{n}) \in {H}_{n},\quad n = 0,1,\ldots.$$

The class of all policies is denoted by Π. To distinguish the subclasses of Π, we let Φ be the family of all stochastic kernels φ on A given E such that φ(A(i)∣i) = 1 for all i ∈ E, and $\mathbb{F}$ the set of all functions f : E → A such that f(i) is in A(i) for every i ∈ E. A policy π = { π_n} ∈ Π is said to be randomized Markov if there is a sequence {φ_n} of φ_n ∈ Φ such that π_n( ⋅∣h _n) = φ_n( ⋅∣i _n) for every h _n ∈ H _n and n ≥ 0. We denote such a policy by π = { φ_n}. A randomized Markov policy π = { φ_n} is said to be randomized stationary if every φ_n is independent of n. In this case, we write π = { φ, φ, …} as φ for simplicity. Further, a randomized Markov policy π = { φ_n} is said to be deterministic Markov if there is a sequence {f _n} of ${f}_{n} \in \mathbb{F}$ such that φ_n( ⋅∣i) is the Dirac measure at f _n(i) for all i ∈ E and n ≥ 0. We write such a policy as π = { f _n}. In particular, a deterministic Markov policy π = { f _n} is said to be (deterministic) stationary if f _n are all independent of n. Similarly, we write π = { f, f, …} as f for simplicity. We denote by Π _RM, Π _RS, Π _DM, and Π _DS the families of all randomized Markov, randomized stationary, deterministic Markov, and stationary policies, respectively. Obviously, Φ = Π _RS ⊂ Π _RM ⊂ Π and $\mathbb{F} = {\Pi }_{\mathrm{DS}} \subset {\Pi }_{\mathrm{DM}} \subset \Pi $.

Let ℙ(E) denote the set of all the probability measures on E. For each (s, μ) ∈ R ₊ ×ℙ(E) and π ∈ Π, by the well-known Tulcea’s theorem ([10, Proposition C.10]), there exist a unique probability space ($\Omega,\mathcal{F},$ P _(s, μ) ^π) and a stochastic process {T _n, J _n, A _n, n ≥ 0} such that, for each $i,j \in E,t \in {R}_{+},C \in \mathcal{A}$ and n ≥ 0,

$$\begin{array}{rcl} & & {P}_{(s,\mu )}^{\pi }({T}_{ 0} = s,{J}_{0} = i) = \mu (i),\end{array}$$

(11.2)

$$\begin{array}{rcl} & & {P}_{(s,\mu )}^{\pi }({A}_{ n} \in C\mid {h}_{n}) = {\pi }_{n}(C\mid {h}_{n}),\end{array}$$

(11.3)

$$\begin{array}{rcl} & & {P}_{(s,\mu )}^{\pi }({T}_{ n+1} - {T}_{n} \leq t,{J}_{n+1} = j\mid {h}_{n},{a}_{n}) = Q(t,j\mid {i}_{n},{a}_{n}),\end{array}$$

(11.4)

where T _n, J _n, and A _n denote the nth decision epoch, the state, and the action chosen at the nth decision epoch, respectively. The expectation operator with respect to P _(s, μ) ^π is denoted by E _(s, μ) ^π. In particular, if μ is the Dirac measure δ_i( ⋅) concentrated at some state i ∈ E, we write P _(s, μ) ^π and E _(s, μ) ^π as P _(s, i) ^π and E _(s, i) ^π, respectively. For simplicity, P _(0, μ) ^π and E _(0, μ) ^π is denoted by P _μ ^π and E _μ ^π, respectively. Without loss of generality, in the following, we always set the initial decision epoch T ₀ = 0 and omit it.

Remark 11.2.2.

(a)
The construction of the probability measure space ($\Omega,\mathcal{F},$ P _(s, μ) ^π) and the above properties (11.2)–(11.4) follow from those in Limnios and Oprisan [18, p.33] and Puterman [24, p.534–535].
(b)
Let X ₀ : = 0, X _n : = T _n − T _n − 1 (n ≥ 0) denote the sojourn times between decision epochs (jumps). Then, the stochastic process {T _n, J _n, A _n, n ≥ 0} may be rewritten as the one {X _n, J _n, A _n, n ≥ 0}.

To avoid the possibility of an infinite number of decision epochs within finite time, we make the following assumption that the system does not have accumulation points.

Assumption 11.2.1

For all μ ∈ ℙ(E) and π ∈ Π, P _μ ^π ({lim _n→∞ T _n = ∞}) = 1.

To verify Assumption 11.2.1, we can use a sufficient condition below.

Condition 11.2.2

There exist constants δ > 0 and ε > 0 such that

$$Q(\delta,E\mid i,a) \leq 1 - \varepsilon \ \ \forall (i,a) \in K.$$

Remark 11.2.3.

In fact, Condition 11.2.2 is the standard regular condition widely used in SMDPs [5, 16, 20, 24, 25], which exactly implies Assumption 11.2.1 above.

Under Assumption 11.2.1, we define an underlying continuous-time state-action process {Z(t), W(t), t ∈ R ₊} corresponding to the stochastic process {T _n, J _n, A _n} by

$$Z(t) = {J}_{n},\quad W(t) = {A}_{n},\ \ \mbox{ for}\ \ {T}_{n} \leq t < {T}_{n+1},\quad t \in {R}_{+}\ \mbox{ and}\ n \geq 0.$$

Definition 11.2.2.

The stochastic process {Z(t), W(t)} is called a (continuous-time) SMDP.

For the target set B ⊂ E, we consider the random variable

$$\begin{array}{rcl}{ \tau }_{B} :=\inf \{ t \geq 0\mid Z(t) \in B\}\ \ (\mbox{ with}\ \inf \varnothing := \infty ),& & \\ \end{array}$$

which is the first passage time into the set B of the process {Z(t), t ∈ R ₊}. Now, fix an initial distribution μ ∈ ℙ(E). For each π ∈ Π, the expected first passage reward and cost criteria are defined as follows:

$$\begin{array}{rcl} & & {V }_{r}(\mu,\pi ) := {E}_{\mu }^{\pi }\left[{\int \nolimits \nolimits }_{0}^{{\tau }_{B} }{\mathrm{e}}^{-{\int \nolimits \nolimits }_{0}^{t}\alpha (Z(u),W(u))\mathrm{d}u }r(Z(t),W(t))\mathrm{d}t\right],\end{array}$$

(11.5)

$$\begin{array}{rcl} & & {V }_{c}(\mu,\pi ) := {E}_{\mu }^{\pi }\left[{\int \nolimits \nolimits }_{0}^{{\tau }_{B} }{\mathrm{e}}^{-{\int \nolimits \nolimits }_{0}^{t}\alpha (Z(u),W(u))\mathrm{d}u }c(Z(t),W(t))\mathrm{d}t\right].\end{array}$$

(11.6)

To introduce the constrained problem, for the constraint constant γ in (11.1), let

$$U :=\{ \pi \in \Pi \mid {V }_{c}(\mu,\pi ) \leq \gamma \}$$

be the set of “constrained” policies. We assume that U≠∅ throughout the following. Then, the optimization problem we are interested in is to maximize the expected first passage reward V _r(μ, π) over the set U, and our goal is to find a constrained optimal policy as defined below.

Definition 11.2.3.

A policy π^∗ ∈ U is called constrained optimal if

$${V }_{r}(\mu,{\pi }^{{_\ast}}) {=\sup }_{ \pi \in U}{V }_{r}(\mu,\pi ).$$

Remark 11.2.4.

(a)
It is worthwhile to point out that the expected first passage reward criterion V _r(μ, π) defined in (11.5) is different from the usual discounted reward criteria [11, 12, 24] and the total reward criteria without discount [6, 11, 24]. In fact, the former concerns with the performance of the system during a first passage time to some target set, while the latter concern with the performance of the system over an infinite horizon. However, if the target set B = ∅ (and thus ${\tau }_{B} \equiv \infty $) and, furthermore, the discount factor α(i, a) is state-action independent (say, α(i, a) ≡ α), then the expected first passage reward criterion (11.5) above will be directly reduced to the standard infinite horizon expected discounted reward criteria or expected total reward criteria [6, 11, 12, 14, 24].
(b)
Note that the case without discount, that is, $\alpha (i,a) \equiv 0$, is allowed in the context of this chapter; see Remark 11.2.5 for further details.
(c)
When the constraint constant γ in (11.1) is sufficiently large so that U = Π, then the constrained first passage optimization problem (recall Definition 11.2.3) is reduced to the usual unconstrained first passage optimization problems [13, 15, 19, 21, 22, 28].
(d)
In real situations, the target set B usually represents the set of failure states of a system, and thus τ_B denotes the working life (functioning life) of the system. Therefore, our aim is to maximize the expected rewards V _r(μ, π) obtained before the system fails, subject to the associated costs V _c(μ, π) incurred before the failure of the system is not more than some constraint constant γ. In particular, if the reward function rate r(i, a) ≡ 1 and the discount factor α(i, a) ≡ 0, our aim is then reduced to maximizing the expected working life of the system, subject to the associated costs V _c(μ, π) incurred before the failure of the system are not more than some constraint constant γ.

To obtain the existence of a constrained optimal policy, we need several sets of conditions.

Assumption 11.2.3

There exist constants M > 0, 0 < β < 1, and a weight function w ≥ 1 on E such that for every i ∈ B ^c := E − B,

(a)
${\sup }_{a\in A(i)}\vert \widetilde{r}(i,a)\vert \leq Mw(i)$ , and ${\sup }_{a\in A(i)}\vert \widetilde{c}(i,a)\vert \leq Mw(i)$ for all a ∈ A(i), where
$$\begin{array}{rcl} \widetilde{r}(i,a) :& =& r(i,a){\int \nolimits \nolimits }_{0}^{\infty }{\mathrm{e}}^{-\alpha (i,a)t}(1 - D(t\mid i,a))\mathrm{d}t, \\ \widetilde{c}(i,a) :& =& c(i,a){\int \nolimits \nolimits }_{0}^{\infty }{\mathrm{e}}^{-\alpha (i,a)t}(1 - D(t\mid i,a))\mathrm{d}t,\ \ \mbox{ and} \\ D(t\mid i,a) :& =& Q(t,E\mid i,a).\end{array}$$
(b)
sup _a∈A(i) ∑ _j∈B ^c w(j)m(j∣i,a) ≤ βw(i), where m(j∣i,a) := ∫ ₀ ^∞ e ^−α(i,a)t Q(d t,j∣i,a).

Remark 11.2.5.

(a)
In fact, Assumption 11.2.3 is a condition that ensures the first passage criteria (11.5) and (11.6) to be finite and the dynamic programming operators to be contracting; see Lemmas 11.3.1–11.3.2 below.
(b)
Assumption 11.2.3(a) shows that the cost function is allowed to have neither upper nor lower bounds, while the ones in the existing works [3–5, 7, 8, 12] for the standard constrained expected discount criteria are assumed to be bounded or nonnegative (bounded below).
(c)
Note that the case without discount, that is, “α(i, a) ≡ 0”, is allowed in Assumption 11.2.3. In this case, Assumption 11.2.3(b) is reduced to that there exists a constant 0 < β < 1 such that
$$ \sup\limits_{a\in A(i)}\sum\limits_{j\in {B}^{c}}w(j)P(j\mid i,a) \leq \beta w(i)\ \forall \ i \in {B}^{c}\ (\mathrm{with}\ P(j\mid i,a) := Q(\infty,j\mid i,a)),$$
(11.7)
which can be still verified. This fact is due to that the restrictions in Assumption 11.2.3(b) are imposed on the data of the set B ^c rather than the entire space E. However, if the restrictions in Assumption 11.2.3(b) are imposed on the data of the entire space E, that is, there exists a constant 0 < β < 1 such that
$$ \sup\limits_{a\in A(i)} \sum\limits_{j\in E}w(j)P(j\mid i,a) \leq \beta w(i)\ \ \forall \ i \in E,$$
(11.8)
then (11.8) fails to hold itself. Indeed, by taking inf_i w(i) in the two sides of (11.8), we can conclude from (11.8) that “β ≥ 1”, which leads to a contradiction with “0 < β < 1”.

Assumption 11.2.4

(a)
For each i ∈ B ^c , A(i) is compact.
(b)
The functions $\widetilde{r}(i,a),$ $\widetilde{c}(i,a)$ , and m(j∣i,a) defined in Assumption 11.2.3 are continuous in a ∈ A(i) for each fixed i,j ∈ B ^c , respectively.
(c)
The function ${\sum \nolimits }_{j\in {B}^{c}}w(j)m(j\mid i,a)$ is continuous in a ∈ A(i), with w as in Assumption 11.2.3.

Remark 11.2.6.

Assumption 11.2.4 is the compactness-continuity conditions for the first passage criteria, which is similar to the standard compactness-continuity conditions for discount and average criteria; see, for instance, Beutler and Ross [3], Guo and Hern$\acute{\mathrm{a}}$ndez-Lerma [7, 8]. The difference between them lies in that the former only imposes restrictions on the data of the set B ^c, while the latter focus on the data of the entire space E.

Assumption 11.2.5

(a)
${\sum \nolimits }_{j\in {B}^{c}}w(j)\mu (j) < \infty.$
(b)
U ₀ :={ π ∈ Π∣V _c (μ,π) < γ}≠∅.

Remark 11.2.7.

(a)
Assumption 11.2.5(a) is a condition on the “tails” of the initial distribution μ, whereas Assumption 11.2.5(b) is a Slater-like hypothesis, typical for constrained problems; see, for instance, Beutler and Ross [3], Guo and Hern$\acute{\mathrm{a}}$ndez-Lerma [7, 8], and Zhang and Guo [29].
(b)
It should be noted that the conditions in Assumptions 11.2.3–11.2.5 are all imposed on the data of the set B ^c rather than the entire space E and thus can be fulfilled in greater generality.

Our main result is stated as following.

Theorem 11.2.1.

Suppose that Assumptions 11.2.1–11.2.5 hold. Then there exists a constrained optimal policy that may be a stationary policy or a randomized stationary policy which differ in at most one state; that is, there exist two stationary policies f ¹ ,f ² , a state i ^∗ ∈ B ^c , and a number p ∈ [0,1] such that f ¹ (i) = f ² (i) for all i≠i ^∗ and, in addition, the randomized stationary policy φ ^p (⋅∣i) is constrained optimal, where

$$\begin{array}{rcl}{ \varphi }^{p}(a\mid i) = \left \{\begin{array}{ll} p, &\mbox{ if}\ \ a = {f}^{1}({i}^{{_\ast}}), \\ 1 - p,&\mbox{ if}\ \ a = {f}^{2}({i}^{{_\ast}}), \\ 1, &\mbox{ if}\ \ a = {f}^{1}(i) = {f}^{2}(i),i\neq {i}^{{_\ast}}. \end{array} \right.& &\end{array}$$

(11.9)

Proof.

See Sect. 11.4. □

11.3 Technical Preliminaries

This section provides some technical preliminaries necessary for the proof of Theorem 11.2.1 in Sect. 11.4.

To begin with, we define the w-norm for every real-valued function u on E by

$$\begin{array}{rcl} \|{u\|}_{w} :=\sup\limits_{i\in E}\vert u(i)\vert /w(i),& & \\ \end{array}$$

where w is the so-called weight function on E as in Assumption 11.2.3. Let

$$\begin{array}{rcl}{ \mathbb{B}}_{w}(E) :=\{ u :\| {u\|}_{w} < \infty \}& & \\ \end{array}$$

be the space of w-bounded functions on E.

Lemma 11.3.1.

Suppose that Assumptions 11.2.1 and 11.2.3 hold. Then:

(a)
For each i ∈ E and π ∈ Π,
$$\begin{array}{rcl} \vert {V }_{r}(i,\pi )\vert \leq Mw(i)/(1 - \beta ),\ \ \vert {V }_{c}(i,\pi )\vert \leq Mw(i)/(1 - \beta ).& & \\ \end{array}$$
Hence, ${V }_{r}(\cdot,\pi ) \in {\mathbb{B}}_{w}(E)$ , and ${V }_{c}(\cdot,\pi ) \in {\mathbb{B}}_{w}(E).$
(b)
For all i ∈ E, π ∈ Π, and $u \in {\mathbb{B}}_{w}(E),$
$$ \begin{array}{rcl} \lim\limits_{n\rightarrow \infty }{E}_{i}^{\pi }\left[{\prod \nolimits }_{k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}u({J}_{n})\right] = 0,& & \\ \end{array}$$
where ${1}_{D}$ is the indicator function on a set D.

Proof.

(a)
By the definition of V _r(i, π), we see that V _r(i, π) can be expressed as below:
$$\begin{array}{rcl} \!\!\!{V }_{r}(i,\pi )& & \\ & =&{E}_{i}^{\pi }\left[{\int \nolimits \nolimits }_{0}^{{\tau }_{B} }{\mathrm{e}}^{-{\int \nolimits \nolimits }_{0}^{t}\alpha (Z(u),W(u))\mathrm{d}u }r(Z(t),W(t))\mathrm{d}t\right] \\ & =&{E}_{i}^{\pi }\left[{\int \nolimits \nolimits }_{0}^{\infty }{\mathrm{e}}^{-{\int \nolimits \nolimits }_{0}^{t}\alpha (Z(u),W(u))\mathrm{d}u }{1}_{\{{\tau }_{B}>t\}}r(Z(t),W(t))\mathrm{d}t\right] \\ & =& {E}_{i}^{\pi}\left[\sum\limits_{n=0}^{\infty }{\int \nolimits \nolimits }_{{T}_{n}}^{{T}_{n+1} }{\mathrm{e}}^{-{\int \nolimits \nolimits }_{0}^{t}\alpha (Z(u),W(u))\mathrm{d}u }\mathrm{d}t{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}r({J}_{n},{A}_{n})\right] \\ &=& {E}_{i}^{\pi }\left[\sum\limits_{n=0}^{\infty }{\int \nolimits \nolimits }_{0}^{{X}_{n+1} }{\mathrm{e}}^{-{\int \nolimits \nolimits }_{0}^{{T}_{n}+t}\alpha (Z(u),W(u))\mathrm{d}u }\mathrm{d}t{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}r({J}_{n},{A}_{n})\right] \\ & =& {E}_{i}^{\pi }\left[\sum\limits_{n=0}^{\infty }{\mathrm{e}}^{-{\int \nolimits \nolimits }_{0}^{{T}_{n}}\alpha (Z(u),W(u))\mathrm{d}u }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}r({J}_{n},{A}_{n})\right. \\ & &\qquad \left.{\int \nolimits \nolimits }_{0}^{{X}_{n+1} }{\mathrm{e}}^{-{\int \nolimits \nolimits }_{{T}_{n}}^{{T}_{n}+t}\alpha (Z(u),W(u))\mathrm{d}u }\mathrm{d}t\right] \\ & =& {E}_{i}^{\pi }\left[{\sum }_{n=0}^{\infty }\prod\limits_{k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}r({J}_{n},{A}_{n}){\int \nolimits \nolimits }_{0}^{{X}_{n+1} }{\mathrm{e}}^{-\alpha ({J}_{n},{A}_{n})t}\mathrm{d}t\right] \\ & =& {E}_{i}^{\pi}\left[\sum\limits_{n=0}^{\infty }{E}_{ i}^{\pi }\left[\prod\limits _{k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}r({J}_{n},{A}_{n})\right.\right. \\ & &\left.\left. \qquad \times {\int \nolimits\nolimits }_{0}^{{X}_{n+1} }{\mathrm{e}}^{-\alpha ({J}_{n},{A}_{n})t}\mathrm{d}t\vert {X}_{ 0},{J}_{0},{A}_{0},\ldots,{X}_{n},{J}_{n},{A}_{n}\right]\right] \\ &=& {E}_{i}^{\pi }\left[\sum\limits_{n=0}^{\infty }\prod\limits _{k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}r({J}_{n},{A}_{n})\right. \\ & & \qquad \left.\times {E}_{i}^{\pi}\left[{\int \nolimits \nolimits }_{0}^{{X}_{n+1} }{\mathrm{e}}^{-\alpha ({J}_{n},{A}_{n})t}\mathrm{d}t\vert {X}_{ 0},{J}_{0},{A}_{0},\ldots,{X}_{n},{J}_{n},{A}_{n}\right]\right] \\ &=& {E}_{i}^{\pi }\left[\sum\limits_{n=0}^{\infty }\prod\limits _{k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}r({J}_{n},{A}_{n}) \right.\\ & & \left.\qquad \times {\int \nolimits\nolimits }_{0}^{\infty }{\mathrm{e}}^{-\alpha ({J}_{n},{A}_{n})t}(1 - D(t\mid {J}_{ n},{A}_{n}))\mathrm{d}t\right] \\ & =& {E}_{i}^{\pi}\left[\sum\limits_{n=0}^{\infty }\prod\limits _{k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}\widetilde{r}({J}_{n},{A}_{n})\right], \end{array}$$
(11.10)
where the third equality follows from Assumption 11.2.1 and the ninth equality is due to the property (11.4).

We now show that for each n = 0, 1, …,
$$\begin{array}{rcl}{ E}_{i}^{\pi }\left[\prod\limits_{k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}w({J}_{n})\right] \leq {\beta }^{n}w(i).& & \end{array}$$
(11.11)
Indeed, (11.11) is trivial for n = 0. Now, for n ≥ 1, it follows from the property (11.4) and Assumption 11.2.3(b) that
$$\begin{array}{rcl} & &{E}_{i }^{\pi }\left[\prod\limits_{k=0}^{n-1}{\mathrm{e}}^{-\alpha({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in{B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}w({J}_{n})\right] \\& =&{E}_{i}^{\pi }\left[{E}_{ i}^{\pi }\left[\prod\limits_{k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1}}{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in{B}^{c}\}}w({J}_{n})\right.\right. \\& & \qquad \left.\mid{T}_{0},{J}_{0},{A}_{0},\ldots,{T}_{n-1},{J}_{n-1},{A}_{n-1}]\right] \\& =& {E}_{i}^{\pi }\left[\prod\limits_{k=0}^{n-2}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1}}{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n-1}\in{B}^{c}\}}{E}_{i}^{\pi }\left[{\mathrm{e}}^{-\alpha({J}_{n-1},{A}_{n-1}){X}_{n} }{1}_{\{{J}_{n}\in {B}^{c}\}}w({J}_{n})\right.\right.\\& & \left.\left.\qquad \mid{T}_{0},{J}_{0},{A}_{0},\ldots,{T}_{n-1},{J}_{n-1},{A}_{n-1}\right]\right]\\& =& {E}_{i}^{\pi }\left[\prod\limits_{k=0}^{n-2}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1}}{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n-1}\in {B}^{c}\}} \right.\\& &\left.\qquad \times {\sum \nolimits }_{j\in {B}^{c}}{ \int \nolimits\nolimits }_{0}^{\infty }{\mathrm{e}}^{-\alpha({J}_{n-1},{A}_{n-1})t}w(j)Q(\mathrm{d}t,j\mid {J}_{n-1},{A}_{n-1})\right] \\& =& {E}_{i}^{\pi }\left[\prod\limits_{k=0}^{n-2}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1}}{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n-1}\in {B}^{c}\}}{ \sum\nolimits }_{j\in {B}^{c}}w(j)m(j\mid {J}_{n-1},{A}_{n-1})\right] \\& \leq & \beta {E}_{i}^{\pi }\left[\prod\limits_{k=0}^{n-2}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1}}{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n-1}\in{B}^{c}\}}w({J}_{n-1})\right]. \end{array}$$
(11.12)
Iterating (11.12) yields (11.11).

Moreover, observe that Assumption 11.2.3(a) and (11.11) gives
$$\begin{array}{rcl} & & {E}_{i}^{\pi }\left[\prod\limits_{k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}\vert \widetilde{r}({J}_{n},{A}_{n})\vert \right] \\ & \leq & M{E}_{i}^{\pi }\left[\prod\limits_{k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}w({J}_{n})\right] \\ & \leq & M{\beta }^{n}w(i), \\ \end{array}$$
which together with (11.10) yields
$$\begin{array}{rcl} \vert {V }_{r}(i,\pi )\vert &\leq &\sum\limits_{n=0}^{\infty }{E}_{ i}^{\pi }\left[\prod\limits _{k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}\vert \widetilde{r}({J}_{n},{A}_{n})\vert \right] \\ & \leq &\sum\limits_{n=0}^{\infty }M{\beta }^{n}w(i) = Mw(i)/(1 - \beta ).\end{array}$$
Thus, we get
$$ \begin{array}{rcl} \sup\limits_{i\in E}\vert {V }_{r}(i,\pi )\vert /w(i) \leq Mw(i)/(1 - \beta ),& & \\ \end{array}$$
which shows that ${V }_{r}(\cdot,\pi ) \in {\mathbb{B}}_{w}(E)$.

Similarly, the conclusion for V _c( ⋅, π) can be obtained.
(b)
Since $\vert u(i)\vert \leq \| {u\|}_{w}w(i)$ for all i ∈ E, it follows from (11.11) above that
$$\begin{array}{rcl} & & {E}_{i}^{\pi }\left[\prod\limits_{k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}\vert u({J}_{n})\vert \right] \\ & & \quad \leq \| {u\|}_{w}{E}_{i}^{\pi }\left[\prod\limits_{k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}w({J}_{n})\right] \leq \| {u\|}_{w}{\beta }^{n}w(i), \\ \end{array}$$
and so part (b) follows. □

Remark 11.3.8.

In fact, Lemma 11.3.1 here for first passage criteria in SMDPs is similar to Lemma 3.1 in Huang and Guo [15]. The main difference between them is due to that the discount factor α(i, a) here is state-action dependent, and the reward rate here is possibly unbounded (while the ones in Huang and Guo [15] are not).

For φ ∈ Φ, we define the dynamic programming operators H ^φ and H on ${\mathbb{B}}_{w}(E)$ as follows: for $u \in {\mathbb{B}}_{w}(E)$, if i ∈ B ^c,

$$\begin{array}{rcl}{ H}^{\varphi }u(i)& :=& {\int \nolimits \nolimits }_{a\in A(i)}\left[\widetilde{r}(i,a) + \sum\limits_{j\in {B}^{c}}u(j)m(j\mid i,a)\right]\varphi (\mathrm{d}a\mid i),\end{array}$$

(11.13)

$$\begin{array}{rcl} Hu(i)& :=& \sup\limits_{a\in A(i)}\left[\widetilde{r}(i,a) + \sum\limits_{j\in {B}^{c}}u(j)m(j\mid i,a)\right],\end{array}$$

(11.14)

and if i ∈ B, H ^φ u(i) = Hu(i) : = 0.

Lemma 11.3.2.

Suppose that Assumptions 11.2.1 and 11.2.3 hold. Then:

(a)
For each φ ∈ Φ, V _r (⋅,φ) is the unique solution in ${\mathbb{B}}_{w}(E)$ to the equation
$$\begin{array}{rcl}{ V }_{r}(i,\varphi ) = {H}^{\varphi }{V }_{ r}(i,\varphi )\ \forall i \in E.& & \\ \end{array}$$
(b)
If, in addition, Assumption 11.2.4 also holds, V _r ^∗ (i) := sup _π∈Π V _r (i,π) is the unique solution in ${\mathbb{B}}_{w}(E)$ to equation
$$\begin{array}{rcl}{ V }_{r}^{{_\ast}}(i) = H{V }_{ r}^{{_\ast}}(i)\ \forall i \in E.& & \\ \end{array}$$
Moreover, there exists an ${f}^{{_\ast}}\in \mathbb{F}$ such that ${V }_{r}^{{_\ast}}(i) = {H}^{{f}^{{_\ast}} }{V }_{r}^{{_\ast}}(i)$ , and such a policy ${f}^{{_\ast}}\in \mathbb{F}$ satisfies V _r (i,f ^∗ ) = V _r ^∗ (i) for every i ∈ E.

Proof.

(a)
From Lemma 11.3.1, it is clear that ${V }_{r}(\cdot,\varphi ) \in {\mathbb{B}}_{w}(E)$. We now establish the equation V _r(i, φ) = H ^φ V _r(i, φ). It is obviously true when i ∈ B, and for i ∈ B ^c, by (11.10), we have

$$\begin{array}{rcl} & &{V }_{r } (i,\varphi ) \\& =&{E}_{i}^{\varphi }\left[\sum\limits_{ n=0}^{\infty }{\prod \nolimits}_{ k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1}}{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in{B}^{c}\}}\widetilde{r}({J}_{n},{A}_{n})\right] \\& =&{E}_{i}^{\varphi }\left[{E}_{ i}^{\varphi }\left[\sum\limits_{n=0}^{\infty }\prod\limits_{ k=0}^{n-1}{\mathrm{e}}^{-\alpha({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in{B}^{c},\ldots,{J}_{n}\in{B}^{c}\}}\widetilde{r}({J}_{n},{A}_{n})\mid{T}_{0},{J}_{0},{A}_{0},{T}_{1},{J}_{1}\right]\right] \\& =&{E}_{i}^{\varphi }\left[{1}_{\{{ J}_{0}\in{B}^{c}\}}\widetilde{r}({J}_{0},{A}_{0}) +{ \mathrm{e}}^{-\alpha({J}_{0},{A}_{0}){X}_{1} }{1}_{\{{J}_{0}\in {B}^{c},{J}_{1}\in{B}^{c}\}}{E}_{i}^{\varphi }\left[\sum\limits_{ n=1}^{\infty }{\prod\nolimits }_{ k=1}^{n-1}{\mathrm{e}}^{-\alpha({J}_{k},{A}_{k}){X}_{k+1} }\right.\right. \\& &\left.\left. \qquad {1}_{\{{J}_{2}\in{B}^{c},\ldots,{J}_{n}\in{B}^{c}\}}\widetilde{r}({J}_{n},{A}_{n})\mid{T}_{0},{J}_{0},{A}_{0},{T}_{1},{J}_{1}\right]\right] \\& =& {\int\nolimits \nolimits }_{a\in A(i)}\varphi (\mathrm{d}a\mid i)\left[\widetilde{r}(i,a)+\sum\limits_{j\in {B}^{c}}{ \int\nolimits \nolimits }_{0}^{\infty }{\mathrm{e}}^{-\alpha(i,a)t}Q(\mathrm{d}t,j\mid i,a){E}_{ i}^{\varphi }\left[\sum\limits_{n=1}^{\infty }\prod\limits_{ k=1}^{n-1}{\mathrm{e}}^{-\alpha({J}_{k},{A}_{k}){X}_{k+1} }\right.\right. \\& & \left.\left. {1}_{\{{J}_{0}\in{B}^{c},\ldots,{J}_{n}\in{B}^{c}\}}\widetilde{r}({J}_{n},{A}_{n})\mid {T}_{0} = 0,{J}_{0} =i,{A}_{0} = a,{T}_{0} = t,{J}_{1} = j\right]\right] \\& =& {\int\nolimits \nolimits }_{a\in A(i)}\varphi (\mathrm{d}a\mid i)\left[\widetilde{r}(i,a) +\sum\limits_{j\in {B}^{c}}m(j\mid i,a){E}_{j}^{\varphi }\left[\sum\limits_{ n=0}^{\infty }{\prod\nolimits }_{ k=0}^{n-1}{\mathrm{e}}^{-\alpha({J}_{k},{A}_{k}){X}_{k+1} }\qquad \qquad \qquad \qquad \qquad\qquad \quad \qquad \right.\right.\\& & \left.\qquad \qquad \quad \qquad{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in{B}^{c}\}}\widetilde{r}({J}_{n},{A}_{n})\right] \\& =& {\int\nolimits \nolimits }_{a\in A(i)}\varphi (\mathrm{d}a\mid i)\left[\widetilde{r}(i,a) +\sum\limits_{j\in {B}^{c}}m(j\mid i,a){V }_{r}(j,\varphi )\right],\end{array}$$

where the fifth equality is due to the properties (11.2)–(11.4) and that policy φ is Markov. Hence, we obtain that V _r(i, φ) = H ^φ V _r(i, φ), i ∈ E.

To complete the proof, we need only show that H ^φ is a contraction from ${\mathbb{B}}_{w}(E)$ to ${\mathbb{B}}_{w}(E)$, and thus H ^φ has a unique fixed point in ${\mathbb{B}}_{w}(E)$. Indeed, for an arbitrary function $u \in {\mathbb{B}}_{w}(E)$, by Assumption 11.2.3 it is clear that
$$\begin{array}{rcl} & & \vert {H}^{\varphi }u(i)\vert \\ & =& \left \vert {\int \nolimits \nolimits }_{a\in A(i)}\left[\widetilde{r}(i,a) + \sum\limits_{j\in {B}^{c}}u(j)m(j\mid i,a)\right]\varphi (\mathrm{d}a\mid i)\right \vert \\ &\leq &{\int \nolimits \nolimits }_{a\in A(i)}\left[\vert \widetilde{r}(i,a)\vert + \sum\limits_{j\in {B}^{c}}\vert u(j)\vert m(j\mid i,a)\right]\varphi (\mathrm{d}a\mid i) \\ & \leq &{\int \nolimits \nolimits }_{a\in A(i)}\left[Mw(i) +\| {u\|}_{w}\beta w(i)\right]\varphi (\mathrm{d}a\mid i) \\ & =& (M + \beta \|{u\|}_{w})w(i)\ \forall i \in {B}^{c}, \\ \end{array}$$
which implies that ${H}^{\varphi }u \in {\mathbb{B}}_{w}(E)$, that is, H ^φ maps ${\mathbb{B}}_{w}(E)$ to itself. Moreover, for any $u,u^{\prime} \in {\mathbb{B}}_{w}(E)$, we have
$$\begin{array}{rcl} & &\vert {H}^{\varphi } u(i) - {H}^{\varphi }u^{\prime}(i)\vert \\ & =& \left \vert {\int \nolimits \nolimits }_{a\in A(i)}\left[\sum\limits_{j\in {B}^{c}}(u(j) - u^{\prime}(j))m(j\mid i,a)\right]\varphi (\mathrm{d}a\mid i)\right \vert \\ &\leq &{\int \nolimits \nolimits }_{a\in A(i)}\left[\sum\limits_{j\in {B}^{c}}\vert u(j) - u^{\prime}(j)\vert m(j\mid i,a)\right]\varphi (\mathrm{d}a\mid i) \\ & \leq &{\int \nolimits \nolimits }_{a\in A(i)}\left[\|u - {u^{\prime}\|}_{w}\beta w(i)\right]\varphi (\mathrm{d}a\mid i) \\ & =& \beta \|u - {u^{\prime}\|}_{w}w(i)\ \forall i \in {B}^{c}.\end{array}$$
Hence, $\|{H}^{\varphi }u - {H}^{\varphi }{u^{\prime}\|}_{w} \leq \beta \|u - {u^{\prime}\|}_{w}$, and thus H ^φ is a contraction from ${\mathbb{B}}_{w}(E)$ to itself. By Banach’s Fixed Point Theorem, H ^φ has a unique fixed point in ${\mathbb{B}}_{w}(E)$, and so the proof is achieved.
(b)
Under Assumption 11.2.4, using a similar manner to the proof of part (a) yields that H is a contraction from ${\mathbb{B}}_{w}(E)$ to itself, and thus, by Banach’s Fixed Point Theorem, H has a unique fixed point u ^∗ in ${\mathbb{B}}_{w}(E)$, that is, Hu ^∗ = u ^∗. Hence, to prove part (b), we need to show that: (b ₁) ${V }_{r}^{{_\ast}}\in {\mathbb{B}}_{w}(E)$, with w-norm $\|{V }_{r}^{{_\ast}}\|\leq M/(1 - \beta )$. (b ₂) V _r ^∗ = u ^∗.

In fact, (b ₁) is an immediate result of Lemma 11.3.1(a). Thus, it remains to prove (b ₂). To this end, we show that u ^∗ ≤ V _r ^∗ and u ^∗ ≥ V _r ^∗ as below, respectively. It is clear that u ^∗(i) = V _r ^∗(i) = 0 for every i ∈ B. Hence, in the following, we restrict our argument to the case of i ∈ B ^c.

(i)
This part is to show that u ^∗ ≤ V _r ^∗. By the equality u ^∗ = Hu ^∗ and the measurable selection theorem [10, Proposition D.5, p.182], there exists an $f \in \mathbb{F}$ such that
$$\begin{array}{rcl}{ u}^{{_\ast}}(i) =\widetilde{ r}(i,f) + \sum\limits_{j\in {B}^{c}}{u}^{{_\ast}}(j)m(j\mid i,f)\ \forall i \in {B}^{c}.& & \end{array}$$
(11.15)
Iteration of (11.15) yields
$$\begin{array}{rcl}{ u}^{{_\ast}}(i)& =& {E}_{ i}^{f}\left[\sum\limits_{m=0}^{n-1} \prod\limits_{k=0}^{m-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{m}\in {B}^{c}\}}\widetilde{r}({J}_{m},f)\right] \\ & & +{E}_{i}^{f}\left[\prod\limits_{k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}{u}^{{_\ast}}({J}_{ n})\right]\ \ \forall i \in {B}^{c},\ n = 1,2\ldots,\\ \end{array}$$
and letting n → ∞ we get, by Lemma 11.3.1(b),
$$\begin{array}{rcl}{ u}^{{_\ast}}(i) = {E}_{ i}^{f}\left[\sum\limits_{m=0}^{\infty }\prod\limits_{k=0}^{m-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{m}\in {B}^{c}\}}\widetilde{r}({J}_{m},f)\right] = {V }_{r}(i,f)\ \ \forall i \in {B}^{c}.& & \\ \end{array}$$
Thus, by the definition of V _r ^∗, we see that u ^∗ ≤ V _r ^∗.
(ii)
This part is to show that u ^∗ ≥ V _r ^∗. Note that u ^∗ = Hu ^∗ implies that
$$\begin{array}{rcl}{ u}^{{_\ast}}(i) \geq \widetilde{ r}(i,a) + \sum\limits_{j\in {B}^{c}}{u}^{{_\ast}}(j)m(j\mid i,a)\ \ \forall i \in {B}^{c},\ \ a \in A(i),& & \end{array}$$
(11.16)
which gives
$$\begin{array}{rcl}{ 1}_{\{{J}_{n}\in {B}^{c}\}}{u}^{{_\ast}}({J}_{ n}) \geq {1}_{\{{J}_{n}\in {B}^{c}\}}\widetilde{r}({J}_{n},{A}_{n}) + {1}_{\{{J}_{n}\in {B}^{c}\}} \sum\limits_{j\in {B}^{c}}{u}^{{_\ast}}(j)m(j\mid {J}_{ n},{A}_{n})\ \ \forall n \geq 0.& & \end{array}$$
(11.17)
Hence, for any initial state i ∈ B ^c and policy π ∈ Π, using properties (11.2)–(11.4) yields
$$\begin{array}{rcl}{ 1}_{\{{J}_{n}\in {B}^{c}\}}{u}^{{_\ast}}({J}_{ n})& \geq & {E}_{i}^{\pi }\left[{1}_{\{{ J}_{n}\in {B}^{c}\}}\widetilde{r}({J}_{n},{A}_{n}) +{ \mathrm{e}}^{-\alpha ({J}_{n},{A}_{n}){X}_{n+1} }{1}_{\{{J}_{n}\in {B}^{c},{J}_{n+1}\in {B}^{c}\}} \right.\\ & &\left. \qquad \times {u}^{{_\ast}}({J}_{ n+1})\mid {T}_{0},{J}_{0},{A}_{0},\ldots,{T}_{n},{J}_{n},{A}_{n}\right]\forall n = 0,1\ldots,\end{array}$$
which gives
$$\begin{array}{rcl} & & \prod\limits_{k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}{u}^{{_\ast}}({J}_{ n}) \\ & & \quad \geq {E}_{i}^{\pi }\left[\prod\limits_{k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}\widetilde{r}({J}_{n},{A}_{n}) \right.\\ & & \left.\qquad + \prod\limits_{k=0}^{n}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n+1}\in {B}^{c}\}}{u}^{{_\ast}}({J}_{ n+1})\mid {T}_{0},{J}_{0},{A}_{0},\ldots,{T}_{n},{J}_{n},{A}_{n}\right] \\ & & \forall n = 0,1\ldots. \end{array}$$
Therefore, taking expectation E _i ^π and summing over m = 0, 1…, n − 1, we obtain
$$\begin{array}{rcl}{ u}^{{_\ast}}(i)& \geq & {E}_{ i}^{\pi }\left[\sum\limits_{m=0}^{n-1} \prod\limits_{k=0}^{m-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{m}\in {B}^{c}\}}\widetilde{r}({J}_{m},{A}_{m})\right] \\ & & +{E}_{i}^{\pi }\left[\prod\limits_{k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}{u}^{{_\ast}}({J}_{ n})\right],\ \ \forall n = 1,2\ldots.\end{array}$$
Finally, letting n → ∞ in the latter inequality and using Lemma 11.3.1(b), it follows that
$$\begin{array}{rcl}{ u}^{{_\ast}}(i) \geq {V }_{ r}(i,\pi )& & \\ \end{array}$$
so that, as i and π were arbitrary, we conclude that u ^∗ ≥ V _r ^∗.

Combining (i) with (ii) yields that u ^∗ = V _r ^∗, and thus we have V _r ^∗ = HV _r ^∗.

Finally, it follows from V _r ^∗ = HV _r ^∗ and the measurable selection theorem that there exists an ${f}^{{_\ast}}\in \mathbb{F}$ such that ${V }_{r}^{{_\ast}} = {H}^{{f}^{{_\ast}} }{V }_{r}^{{_\ast}}$. This fact together with part (a) implies that ${V }_{r}^{{f}^{{_\ast}} } = {V }_{r}^{{_\ast}}$. □

Remark 11.3.9.

Note that Lemma 11.3.2 also holds for the case of the expected first passage cost V _c accordingly.

Note that $\mathbb{F}$ can be written as the product space $\mathbb{F} ={ \prod \nolimits }_{i\in E}A(i)$. Hence, by Assumption 11.2.4(a) and Tychonoff’s theorem, $\mathbb{F}$ is a compact metric space.

Lemma 11.3.3.

Suppose that Assumptions 11.2.1–11.2.4 and 11.2.5 (a) hold. Then the functions V _r (μ,f) and V _c (μ,f) are continuous in $f \in \mathbb{F}.$

Proof.

We only prove the continuity of V _r(μ, f) in $f \in \mathbb{F}$ because the other case is similar. Let f _n → f as n → ∞ and fix any i ∈ E. Let v(i) : = {limsup}_{n → ∞} V _r(i, f _n). Then, by Theorem 4.4 in [1], there exists a subsequence $\{{V }_{r}(i,{f}_{{n}_{m}})\}$ (depending on i) of {V _r(i, f _n)} such that ${V }_{r}(i,{f}_{{n}_{m}}) \rightarrow v(i)$ as m → ∞. Then, by Lemma 11.3.1(a), we have ${V }_{r}(j,{f}_{{n}_{m}}) \in [-Mw(j)/(1 - \beta ),Mw(j)/(1 - \beta )]$ for all j ∈ E and m ≥ 1, and so ${V }_{r}(\cdot,{f}_{{n}_{m}})$ is in the product space ∏_j ∈ E[ − Mw(j) ∕ (1 − β), Mw(j) ∕ (1 − β)] for each m ≥ 1. Since E is denumerable, the Tychonoff theorem shows that the space ∏_j ∈ E[ − Mw(j) ∕ (1 − β), Mw(j) ∕ (1 − β)] is compact, and thus there exists a subsequence $\{{V }_{r}(\cdot,{f}_{{n}_{k}})\}$ of $\{{V }_{r}(\cdot,{f}_{{n}_{m}})\}$ converging to some point u in ∏_j ∈ E[ − Mw(j) ∕ (1 − β), Mw(j) ∕ (1 − β)], that is, ${\lim }_{k\rightarrow \infty }{V }_{r}(j,{f}_{{n}_{k}}) = u(j)$ for all j ∈ E, which, together with f _n → f and ${\lim }_{m\rightarrow \infty }{V }_{r}(i,{f}_{{n}_{m}}) = v(i)$, implies that

$$\begin{array}{rcl} \!\!\!\!\!\!\!\!\!\!\!\!\!v(i) = u(i),\ \lim\limits_{k\rightarrow \infty }{V }_{r}(j,{f}_{{n}_{k}}) = u(j), \mathrm{and}\lim\limits_{k\rightarrow \infty }{f}_{{n}_{k}}(j) = f(j),\ \ \mathrm{for\ all}\ j \in E.\ \ \ \ \ \ \ \ \ \ & &\end{array}$$

(11.18)

Moreover, by Lemma 11.3.1(a), we have

$$\begin{array}{rcl} \vert u(j)\vert \leq Mw(j)/(1 - \beta ),\ \ \ \mathrm{for\ all}\ j \in E,& &\end{array}$$

(11.19)

which implies that $u \in {\mathbb{B}}_{w}(E)$.

On the other hand, for k ≥ 1 and the given i ∈ B ^c, by Lemma 11.3.2(a), we have

$$\begin{array}{rcl}{ V }_{r}(i,{f}_{{n}_{k}}) =\widetilde{ r}(i,{f}_{{n}_{k}}) + \sum\limits_{j\in {B}^{c}}{V }_{r}(j,{f}_{{n}_{k}})m(j\mid i,{f}_{{n}_{k}}).& &\end{array}$$

(11.20)

Then, under Assumptions 11.2.3 and 11.2.4, from (11.18)–(11.20) and Lemma 8.3.7 (i.e., the Generalized Dominated Convergence Theorem) in [11], we get

$$\begin{array}{rcl} u(i) =\widetilde{ r}(i,f) + \sum\limits_{j\in {B}^{c}}u(j)m(j\mid i,f).& &\end{array}$$

(11.21)

Thus, by Lemma 11.3.2(a) and (11.18), we conclude that

$$\begin{array}{rcl} \mathop{\lim\sup}\limits_{n\rightarrow \infty }{V }_{r}(i,{f}_{n}) = v(i) = u(i) = {V }_{r}(i,f).& &\end{array}$$

(11.22)

Similarly, we can prove that

$$\begin{array}{rcl} \mathop{\lim\inf}\limits_{n\rightarrow \infty }{V }_{r}(i,{f}_{n}) = {V }_{r}(i,f),& & \\ \end{array}$$

which together with (11.22) implies that

$$\mathop{\lim\sup}\limits_{n\rightarrow \infty }{V }_{r}(i,{f}_{n}) =\mathop{\lim\inf}\limits_{n\rightarrow \infty }{V }_{r}(i,{f}_{n}) = {V }_{r}(i,f),$$

and so

$$ \begin{array}{rcl} \lim\limits_{n\rightarrow \infty }{V }_{r}(i,{f}_{n}) = {V }_{r}(i,f).& &\end{array}$$

(11.23)

Moreover, noting that V _r(i, f _n) = V _r(i, f) = 0 for every i ∈ B and n ≥ 0, it follows from Assumption 11.2.5(a) and Lemma 8.3.7 in [11] again that

$$ \begin{array}{rcl} \lim\limits_{n\rightarrow \infty }{V }_{r } (\mu,{f}_{n})& =& \lim\limits_{n\rightarrow \infty }\sum\limits_{i\in E}[{V }_{r}(i,{f}_{n})]\mu (i) =\lim\limits_{n\rightarrow \infty }\sum\limits_{i\in {B}^{c}}[{V }_{r}(i,{f}_{n})]\mu (i) \\ & =& \sum\limits_{i\in {B}^{c}}{[\lim }_{n\rightarrow \infty }{V }_{r}(i,{f}_{n})]\mu (i) = \sum\limits_{i\in {B}^{c}}{V }_{r}(i,f)\mu (i) = {V }_{r}(\mu,f),\end{array}$$

(11.24)

which gives the stated result: V _r(μ, f _n) → V _r(μ, f), as n → ∞. □

To analyze the constrained control problem (recall Definition 11.2.3), we introduce a Lagrange multiplier λ ≥ 0 as follows. For each i ∈ E and a ∈ A(i), let

$$\begin{array}{rcl}{ b}^{\lambda }(i,a) := r(i,a) - \lambda c(i,a).& &\end{array}$$

(11.25)

Furthermore, for each policy π ∈ Π and i ∈ E, let

$$\begin{array}{rcl} & & {V }_{{b}^{\lambda }}(i,\pi ) := {E}_{i}^{\pi }\left[{\int \nolimits \nolimits }_{0}^{{\tau }_{B} }{\mathrm{e}}^{-{\int \nolimits \nolimits }_{0}^{t}\alpha (Z(u),W(u))\mathrm{d}u }{b}^{\lambda }(Z(t),W(t))\mathrm{d}t\right],\end{array}$$

(11.26)

$$\begin{array}{rcl} & & {V }_{{b}^{\lambda }}(\mu,\pi ) := \sum\limits_{j\in E}{V }_{{b}^{\lambda }}(j,\pi )\mu (j),\end{array}$$

(11.27)

$$\begin{array}{rcl} & & {V }_{{b}^{\lambda }}^{{_\ast}}(i) :=\sup\limits_{ \pi \in \Pi }{V }_{{b}^{\lambda }}(i,\pi ),{V }_{{b}^{\lambda }}^{{_\ast}}(\mu ) :=\sup\limits_{ \pi \in \Pi }{V }_{{b}^{\lambda }}(\mu,\pi ).\end{array}$$

(11.28)

Remark 11.3.10.

Notice that, for each i ∈ B, ${V }_{{b}^{\lambda }}(i,\pi ) = 0$. Therefore, we have

$$\begin{array}{rcl}{ V }_{{b}^{\lambda }}(\mu,\pi ) = \sum\limits_{j\in {B}^{c}}{V }_{{b}^{\lambda }}(j,\pi )\mu (j),\qquad {V }_{{b}^{\lambda }}^{{_\ast}}(\mu ) = \sum\limits_{j\in {B}^{c}}{V }_{{b}^{\lambda }}^{{_\ast}}(j)\mu (j).& & \\ \end{array}$$

Under Assumptions 11.2.1–11.2.4, by Lemma 11.3.2(b), we have

$${ V }_{{b}^{\lambda }}^{{_\ast}}(i) = \left \{\begin{array}{lll} 0, &\mbox{ for}\ i \in B, \\ \sup\limits_{a\in A(i)}\left[\widetilde{{b}^{\lambda }}(i,a) + \sum\limits_{j\in {B}^{c}}{V }_{{b}^{\lambda }}^{{_\ast}}(j)m(j\mid i,a)\right],&\mbox{ for}\ i \in {B}^{c}, \end{array} \right.$$

(11.29)

where $\widetilde{{b}^{\lambda }}(i,a) := {b}^{\lambda }(i,a){\int \nolimits \nolimits }_{0}^{\infty }{\mathrm{e}}^{-\alpha t}(1 - D(t\mid i,a))\mathrm{d}t$. Moreover, for each i ∈ E, the maximum in (11.29) is realized by some a ∈ A(i), that is,

$${ A}_{\lambda }^{{_\ast}}(i) := \left \{\begin{array}{lll} A(i), &\mbox{ for}\ i \in B, \\ \{a \in A(i)\mid {V }_{{b}^{\lambda }}^{{_\ast}}(i) =\widetilde{ {b}^{\lambda }}(i,a) + \sum\limits_{j\in {B}^{c}}{V }_{{b}^{\lambda }}^{{_\ast}}(j)m(j\mid i,a)\},&\mbox{ for}\ i \in {B}^{c} \end{array} \right.$$

(11.30)

is nonempty. In other words, the following sets

$$\begin{array}{rcl}{ \mathbb{F}}_{\lambda }^{{_\ast}} := \{f \in \mathbb{F}\mid f(i) \in {A}_{ \lambda }^{{_\ast}}(i)\ \forall i \in E\}& &\end{array}$$

(11.31)

and

$$\begin{array}{rcl}{ \Phi }^{\lambda } := \{\varphi \in \Phi \mid \varphi ({A}_{ \lambda }^{{_\ast}}(i)\mid i) = 1\ \forall i \in E\}& &\end{array}$$

(11.32)

are nonempty.

Next lemma reveals that Φ ^λ is convex.

Lemma 11.3.4.

Under Assumptions 11.2.1–11.2.4, the set Φ ^λ is convex.

Proof.

For each φ₁, φ₂ ∈ Φ ^λ, and p ∈ [0, 1], let

$$\begin{array}{rcl}{ \varphi }^{p}(\cdot \mid i) := p{\varphi }_{ 1}(\cdot \mid i) + (1 - p){\varphi }_{2}(\cdot \mid i),\ \forall i \in E.& &\end{array}$$

(11.33)

Hence, φ^p(A _λ ^∗(i)∣i) = pφ₁(A _λ ^∗(i)∣i) + (1 − p)φ₂(A _λ ^∗(i)∣i) = 1, and so Φ ^λ is convex. □

Notation. For each λ ≥ 0, we take an arbitrary, but fixed policy ${f}^{\lambda } \in {\mathbb{F}}_{\lambda }^{{_\ast}}$, and denote V _r(μ, f ^λ), V _c(μ, f ^λ), and ${V }_{{b}^{\lambda }}(\mu,{f}^{\lambda })$ by V _r(λ), V _c(λ), and V _b(λ), respectively. By Lemma 11.3.2, we have that ${V }_{{b}^{\lambda }}(i,f) = {V }_{{b}^{\lambda }}^{{_\ast}}(i)$ for all i ∈ E and $f \in {\mathbb{F}}_{\lambda }^{{_\ast}}$. Hence, ${V }_{b}(\lambda ) = {V }_{{b}^{\lambda }}(\mu,{f}^{\lambda }) = {V }_{{b}^{\lambda }}^{{_\ast}}(\mu )$.

Lemma 11.3.5.

If Assumptions 11.2.3–11.2.4 and 11.2.5 (a) hold, then V _c (λ) is nonincreasing in λ ∈ [0,∞).

Proof.

By (11.5), (11.6), and (11.25)–(11.26) for each π ∈ Π, we obtain

$$\begin{array}{rcl}{ V }_{{b}^{\lambda }}(\mu,\pi ) = {V }_{r}(\mu,\pi ) - \lambda {V }_{c}(\mu,\pi )\ \forall \lambda \geq 0.& & \\ \end{array}$$

Moreover, noting that ${V }_{b}(\lambda ) = {V }_{{b}^{\lambda }}(\mu,{f}^{\lambda }) = {V }_{{b}^{\lambda }}^{{_\ast}}(\mu )$ for all λ ≥ 0 and ${f}^{\lambda } \in {\mathbb{F}}_{\lambda }^{{_\ast}}$, we have, for any h > 0,

$$\begin{array}{rcl} -h{V }_{c}(\lambda )& =& {V }_{{b}^{\lambda +h}}(\mu,{f}^{\lambda }) - {V }_{ b}(\lambda ) \\ & \leq & {V }_{b}(\lambda + h) - {V }_{b}(\lambda ) \\ & \leq & {V }_{b}(\lambda + h) - {V }_{{b}^{\lambda }}(\mu,{f}^{\lambda +h}) = -h{V }_{ c}(\lambda + h), \\ \end{array}$$

which gives that

$$\begin{array}{rcl} -h{V }_{c}(\lambda ) \leq -h{V }_{c}(\lambda + h).& & \\ \end{array}$$

Hence, V _c(λ) is nonincreasing in λ ∈ [0, ∞). □

Lemma 11.3.6.

Suppose that Assumptions 11.2.1–11.2.4 hold. If lim _k→∞ λ _k = λ, and ${f}^{{\lambda }_{k}} \in {\mathbb{F}}_{{ \lambda }_{k}}^{{_\ast}}$ is such that ${\lim }_{k\rightarrow \infty }{f}^{{\lambda }_{k}} = f$ , then $f \in {\mathbb{F}}_{\lambda }^{{_\ast}}.$

Proof.

Since ${f}^{{\lambda }_{k}} \in {\mathbb{F}}_{{ \lambda }_{k}}^{{_\ast}}$, for each i ∈ B ^c and π ∈ Π, we have

$$\begin{array}{rcl}{ V }_{{b}^{{\lambda }_{k}}}^{{_\ast}}(i) = {V }_{ r}(i,{f}^{{\lambda }_{k} }) - {\lambda }_{k}{V }_{c}(i,{f}^{{\lambda }_{k} }) \geq {V }_{{b}^{{\lambda }_{k}}}(i,\pi ) = {V }_{r}(i,\pi ) - {\lambda }_{k}{V }_{c}(i,\pi ).& &\end{array}$$

(11.34)

Letting k → ∞ in (11.34) and by Lemma 11.3.3, we obtain

$$\begin{array}{rcl}{ V }_{{b}^{\lambda }}(i,f) \geq {V }_{{b}^{\lambda }}(i,\pi )\ \ \forall i \in {B}^{c}\ \ \mathrm{and}\ \ \pi \in \Pi,& & \\ \end{array}$$

which together with the fact that A _λ ^∗(i) = A(i) for each i ∈ B implies that $f \in {\mathbb{F}}_{\lambda }^{{_\ast}}$. □

Under Assumptions 11.2.1–11.2.4 and 11.2.5(a), it follows from Lemma 11.3.5 that the following nonnegative constant

$$\begin{array}{rcl} \overline{\lambda } :=\inf \{ \lambda \geq 0\mid {V }_{c}(\lambda ) \leq \gamma \}& &\end{array}$$

(11.35)

is well defined.

Lemma 11.3.7.

Suppose that Assumptions 11.2.1–11.2.5 hold. Then the constant $\overline{\lambda }$ in (11.35) is finite, that is, $\overline{\lambda } \in [0,\infty ).$

Proof.

Suppose that $\overline{\lambda } = \infty $. By Assumption 11.2.5(b), there exists a policy π′ ∈ Π such that V _c(μ, π′) < γ. Let d : = γ − V _c(μ, π′) > 0. Then, for any λ > 0, we have

$$\begin{array}{rcl}{ V }_{{b}^{\lambda }}(\mu,\pi ^{\prime}) = {V }_{r}(\mu,\pi ^{\prime}) - \lambda {V }_{c}(\mu,\pi ^{\prime}) = {V }_{r}(\mu,\pi ^{\prime}) - \lambda (\gamma - d).& &\end{array}$$

(11.36)

As $\overline{\lambda } = \infty $, by (11.35) and Lemma 11.3.5, we obtain V _c(λ) > γ for all λ > 0. Therefore, V _b(λ) = V _r(λ) − λV _c(λ) < V _r(λ) − λγ. Since ${V }_{b}(\lambda ) = {V }_{{b}^{\lambda }}^{{_\ast}}(\mu ) \geq {V }_{{b}^{\lambda }}(\mu,\pi ^{\prime})$, from (11.36), we have

$$\begin{array}{rcl}{ V }_{r}(\lambda ) - \lambda \gamma > {V }_{b}(\lambda ) \geq {V }_{{b}^{\lambda }}(\mu,\pi ^{\prime}) = {V }_{r}(\mu,\pi ^{\prime}) - \lambda (\gamma - d)\ \ \forall \lambda > 0,& &\end{array}$$

(11.37)

which gives

$$\begin{array}{rcl}{ V }_{r}(\lambda ) > {V }_{r}(\mu,\pi ^{\prime}) + \lambda d\ \ \ \forall \lambda > 0.& &\end{array}$$

(11.38)

On the other hand, by Lemma 11.3.1 and Assumption 11.2.5(a), we have

$$\begin{array}{rcl} \max \{\vert {V }_{r}(\mu,\pi ^{\prime})\vert,\vert {V }_{r}(\lambda )\vert \} \leq M\left[\sum\limits_{j\in {B}^{c}}w(j)\mu (j)\right]/(1 - \beta ) :=\tilde{ M} < \infty \end{array}$$

(11.39)

for all λ > 0. The latter inequality together with (11.38) gives that

$$\begin{array}{rcl} 2\tilde{M} > \lambda d\ \ \forall \lambda > 0,& &\end{array}$$

(11.40)

which is clearly a contradiction; for instance, take $\lambda = 3\tilde{M}/d > 0$. Hence, $\overline{\lambda }$ is finite. □

11.4 Proof of Theorem 11.2.1

In this section, we prove Theorem 11.2.1 by using the Lagrange approach and the following lemma.

Lemma 11.4.8.

If there exist λ ₀ ≥ 0 and π ^∗ ∈ U such that

$$\begin{array}{rcl}{ V }_{c}(\mu,{\pi }^{{_\ast}}) = \gamma \ \ \mbox{ and}\ \ {V }_{{ b}^{{\lambda }_{0}}}(\mu,{\pi }^{{_\ast}}) = {V }_{{ b}^{{\lambda }_{0}}}^{{_\ast}}(\mu ),& & \\ \end{array}$$

then π ^∗ is constrained optimal.

Proof.

For any π ∈ U, since ${V }_{{b}^{{\lambda }_{0}}}(\mu,{\pi }^{{_\ast}}) = {V }_{{b}^{{\lambda }_{0}}}^{{_\ast}}(\mu ) \geq {V }_{{b}^{{\lambda }_{0}}}(\mu,\pi )$, we have

$$\begin{array}{rcl}{ V }_{r}(\mu,{\pi }^{{_\ast}}) - {\lambda }_{ 0}{V }_{c}(\mu,{\pi }^{{_\ast}}) \geq {V }_{ r}(\mu,\pi ) - {\lambda }_{0}{V }_{c}(\mu,\pi ).& &\end{array}$$

(11.41)

Noting that V _c(μ, π^∗) = γ and V _c(μ, π) ≤ γ (by π ∈ U), from (11.41), we get

$$\begin{array}{rcl}{ V }_{r}(\mu,{\pi }^{{_\ast}}) \geq {V }_{ r}(\mu,\pi ) + {\lambda }_{0}(\gamma - {V }_{c}(\mu,\pi )) \geq {V }_{r}(\mu,\pi )\ \forall \pi \in U,& & \\ \end{array}$$

which means that π^∗ is constrained optimal. □

Proof of Theorem 11.2.1. By Lemma 11.3.7, the constant $\overline{\lambda } \in [0,\infty )$. Thus, we shall consider the two cases: $\overline{\lambda } = 0$ and $\overline{\lambda } > 0$.

The case of $\overline{\lambda } = 0$ : By (11.35), there exists a sequence ${f}^{{\lambda }_{k}} \in {\mathbb{F}}_{{ \lambda }_{k}}^{{_\ast}}$ such that λ_k ↓ 0 as k → ∞. Because $\mathbb{F}$ is compact, we may assume that ${f}^{{\lambda }_{k}} \rightarrow \tilde{ f}$ without loss of generality. Thus, by Lemma 11.3.5, we have ${V }_{c}(\mu,{f}^{{\lambda }_{k}}) \leq \gamma $ for all k ≥ 1, and then it follows from Lemma 11.3.3 that $\tilde{f} \in U$. Moreover, for each π ∈ U, we have that ${V }_{b}({\lambda }_{k}) = {V }_{{b}^{{\lambda }_{k}}}(\mu,{f}^{{\lambda }_{k}}) \geq {V }_{{ b}^{{\lambda }_{k}}}(\mu,\pi )$. Hence, by Lemma 11.3.1(a) and (11.39),

$$\begin{array}{rcl}{ V }_{r}(\mu,{f}^{{\lambda }_{k} }) - {V }_{r}(\mu,\pi ) \geq {\lambda }_{k}[{V }_{c}(\mu,{f}^{{\lambda }_{k} }) - {V }_{c}(\mu,\pi )] \geq -2{\lambda }_{k}\tilde{M}.& &\end{array}$$

(11.42)

Letting k → ∞ in (11.42), by Lemma 11.3.3, we have

$$\begin{array}{rcl}{ V }_{r}(\mu,\tilde{f}) - {V }_{r}(\mu,\pi ) \geq 0\ \forall \pi \in U,& & \\ \end{array}$$

which means that $\tilde{f}$ is a constrained optimal stationary policy.

The case of $\overline{\lambda } > 0$ : First, if there is some λ′ ∈ (0, ∞) satisfying V _c(λ′) = γ, then there exist an associated ${f}^{\lambda ^{\prime}} \in {\mathbb{F}}_{\lambda ^{\prime}}^{{_\ast}}$ such that V _c(λ′) = V _c(μ, f ^λ′) = γ, and ${V }_{{b}^{\lambda ^{\prime}}}^{{_\ast}}(\mu ) = {V }_{{b}^{\lambda ^{\prime}}}(\mu,{f}^{\lambda ^{\prime}})$. Thus, by Lemma 11.4.8, f ^λ′ is a constrained optimal stationary policy.

Now, suppose that V _c(λ)≠γ for all λ ∈ (0, ∞). Then, as $\overline{\lambda }$ is in (0, ∞), there exist two nonnegative sequences {λ_k} and {δ_k} such that ${\lambda }_{k} \uparrow \overline{\lambda }$ and ${\delta }_{k} \downarrow \overline{\lambda }$, respectively. Since $\mathbb{F}$ is compact, we may take ${f}^{{\lambda }_{k}} \in {\mathbb{F}}_{{ \lambda }_{k}}^{{_\ast}}$ and ${f}^{{\delta }_{k}} \in {\mathbb{F}}_{{ \delta }_{k}}^{{_\ast}}$ such that ${f}^{{\lambda }_{k}} \rightarrow {f}^{1} \in \mathbb{F}$ and ${f}^{{\delta }_{k}} \rightarrow {f}^{2} \in \mathbb{F}$, without loss of generality. By Lemma 11.3.6, we have that ${f}^{1},{f}^{2} \in {\mathbb{F}}_{\overline{\lambda }}^{{_\ast}}$. By Lemmas 11.3.3 and 11.3.4, we obtain that V _c(μ, f ¹) ≥ γ and V _c(μ, f ²) ≤ γ. If V _c(μ, f ¹) = γ ( or V _c(μ, f ²) = γ), by Lemma 11.4.8, we have that f ¹ (or f ²) is a constrained optimal stationary policy. Hence, to complete the proof, we shall consider the following case:

$$\begin{array}{rcl}{ V }_{c}(\mu,{f}^{1}) > \gamma \ \ \mbox{ and}\ \ \ {V }_{ c}(\mu,{f}^{2}) < \gamma.& &\end{array}$$

(11.43)

Now using f ¹ and f ², we construct a sequence of stationary policies {f _n} as follows. For each n ≥ 1 and i ∈ E, let

$$\begin{array}{rcl}{ f}_{n}(i) := \left \{\begin{array}{ll} {f}^{1}(i),&\mbox{ if}\ \ i < n; \\ {f}^{2}(i),&\mbox{ if}\ \ i \geq n, \end{array} \right.& & \\ \end{array}$$

where, without loss of generality, the denumerable state space is assumed to be the set {1, 2, …}. Obviously, f ₁ = f ² and lim_{n → ∞} f _n = f ¹. Hence, by Lemma 11.3.3, lim_{n → ∞} V _c(μ, f _n) = V _c(μ, f ¹). Since ${f}^{1},{f}^{2} \in {\mathbb{F}}_{\overline{\lambda }}^{{_\ast}}$ (just mentioned), by (11.31), we see that ${f}_{n} \in {\mathbb{F}}_{\overline{\lambda }}^{{_\ast}}$ for all n ≥ 1. As f ₁ = f ², by (11.43), we have V _c(μ, f ₁) < γ. If there exists n ^∗ such that ${V }_{c}(\mu,{f}_{{n}^{{_\ast}}}) = \gamma $, then by Lemma 11.4.8 and ${f}_{n} \in {\mathbb{F}}_{\overline{\lambda }}^{{_\ast}}$, ${f}_{{n}^{{_\ast}}}$ a constrained optimal stationary policy. Thus, in the remainder of this section, we may assume that V _c(μ, f _n)≠γ for all n ≥ 1. If V _c(μ, f _n) < γ for all n ≥ 1, lim_{n → ∞} V _c(μ, f _n) = V _c(μ, f ¹) ≤ γ, which is a contradiction to (11.43). Thus, there exists some n ≥ 1 such that V _c(μ, f _n) > γ, which together with V _c(μ, f ₁) < γ gives the existence of some $\tilde{n}$ such that

$${V }_{c}(\mu,{f}_{\tilde{n}}) < \gamma \ \ \mbox{ and}\ \ \ {V }_{c}(\mu,{f}_{\tilde{n}+1}) > \gamma.$$

(11.44)

Obviously, the stationary policies ${f}_{\tilde{n}}$ and ${f}_{\tilde{n}+1}$ differ in at most the state $\tilde{n}$. Here, it should be pointed out that $\tilde{n}$ must be in B ^c. Indeed, if $\tilde{n} \in B$, we have ${V }_{c}(\tilde{n},{f}_{\tilde{n}}) = {V }_{c}(\tilde{n},{f}_{\tilde{n}+1}) = 0$, which implies that ${V }_{c}(\mu,{f}_{\tilde{n}}) = {V }_{c}(\mu,{f}_{\tilde{n}+1})$ and thus leads to a contradiction to (11.44).

For any p ∈ [0, 1], using the stationary policies ${f}_{\tilde{n}}$ and ${f}_{\tilde{n}+1}$, we construct a randomized stationary policy φ^p as follows. For each i ∈ E,

$$\begin{array}{rcl}{ \varphi }^{p}(a\mid i) = \left \{\begin{array}{ll} p, &\mbox{ if}\ \ a = {f}_{\tilde{n}}(\tilde{n})\ \mbox{ when}\ i =\tilde{ n}, \\ 1 - p,&\mbox{ if}\ \ a = {f}_{\tilde{n}+1}(\tilde{n})\ \mbox{ when}\ i =\tilde{ n}, \\ 1, &\mbox{ if}\ \ a = {f}_{\tilde{n}}(i)\ \mbox{ when}\ i\neq \tilde{n}. \end{array} \right.& &\end{array}$$

(11.45)

Since ${f}_{\tilde{n}},{f}_{\tilde{n}+1} \in {\mathbb{F}}_{\overline{\lambda }}^{{_\ast}}$, by Lemma 11.3.4, we have ${V }_{{ b}^{\overline{\lambda }}}(\mu,{\varphi }^{p}) = {V }_{{ b}^{\overline{\lambda }}}^{{_\ast}}(\mu )$ for all p ∈ [0, 1]. We also have that V _c(μ, φ^p) is continuous in p ∈ [0, 1]. Indeed, for any p ∈ [0, 1] and any sequence {p _m} in [0, 1] such that lim_{n → ∞} p _m = p, as in the proof of Lemma 11.3.2, we have

$$\begin{array}{rcl} {V }_{c } (i,{\varphi }^{{p}_{m} }) = \sum\limits_{a\in A(i)}{\varphi }^{{p}_{m} }(a\mid i)\left[\widetilde{c}(i,a) + \sum\limits_{j\in {B}^{c}}{V }_{c}(j,{\varphi }^{{p}_{m} })m(j\mid i,a)\right]\ \forall i \in {B}^{c}.& &\end{array}$$

(11.46)

Hence, as in the proof of Lemma 11.3.3, from (11.45) and (11.46), we obtain

$$ \begin{array}{rcl} \lim\limits_{n\rightarrow \infty }{V }_{c}(\mu,{\varphi }^{{p}_{m} }) = {V }_{c}(\mu,{\varphi }^{p}),& & \\ \end{array}$$

and so V _c(μ, φ^p) is continuous in p ∈ [0, 1].

Finally, let p ₀ = 0 and p ₁ = 1. Then, ${V }_{c}(\mu,{\varphi }^{{p}_{0}}) = {V }_{c}(\mu,{f}_{\tilde{n}+1}) > \gamma $ and ${V }_{c}(\mu,{\varphi }^{{p}_{1}}) = {V }_{c}(\mu,{f}_{\tilde{n}}) < \gamma $. Therefore, by the continuity of V _c(μ, φ^p) in p ∈ [0, 1] there exists a p ^∗ ∈ (0, 1) such that ${V }_{c}(\mu,{\varphi }^{{p}^{{_\ast}} }) = \gamma $. Since ${V }_{{ b}^{\overline{\lambda }}}(\mu,{\varphi }^{{p}^{{_\ast}} }) = {V }_{{ b}^{\overline{\lambda }}}^{{_\ast}}(\mu )$, by Lemma 11.4.8, we have that ${\varphi }^{{p}^{{_\ast}} }$ is a constrained optimal stationary policy, which randomizes between the two stationary policies ${f}_{\tilde{n}}$ and ${f}_{\tilde{n}+1}$ that differ in at most the state $\tilde{n} \in {B}^{c}$. □

References

Aliprantis, C.D., Burkinshaw O.: Principles of real analysis. Third edition. Academic Press, Inc., San Diego, CA (1998)
MATH Google Scholar
Berument, H., Kilinc, Z., Ozlale, U.: The effects of different inflation risk premiums on interest rate spreads. Phys. A 333, 317–324 (2004)
Article MathSciNet Google Scholar
Beutler, F.J., Ross, K.W.: Time-average optimal constrained semi-Markov decision processes. Adv. in Appl. Probab. 18, 341–359 (1986)
Article MathSciNet MATH Google Scholar
Feinberg, E.A.: Constrained semi-Markov decision processes with average rewards. Z. Oper. Res. 39, 257–288 (1994)
MathSciNet MATH Google Scholar
Feinberg, E.A.: Continuous time discounted jump Markov decision processes: a discrete-event approach. Math. Oper. Res. 29, 492–524 (2004)
Article MathSciNet MATH Google Scholar
Guo, X.P.: Constrained denumerable state non-stationary MDPs with expected total reward criterion. Acta Math. Appl. Sinica (English Ser.) 16, 205–212 (2000)
Google Scholar
Guo, X.P., Hern$\acute{\mathrm{a}}$ndez-Lerma, O.: Constrained continuous-time Markov control processes with discounted criteria. Stochastic Anal. Appl. 21, 379–399 (2003)
Google Scholar
Guo, X.P., Hern$\acute{\mathrm{a}}$ndez-Lerma, O.: Continuous-Time Markov Decision Processes: Theory and Applications. Springer-Verlag, Berlin Heidelberg (2009)
Google Scholar
Haberman, S., Sung, J.: Optimal pension funding dynamics over infinite control horizon when stochastic rates of return are stationary. Insur. Math. Econ. 36, 103–116 (2005)
Article MathSciNet MATH Google Scholar
Hern$\acute{\mathrm{a}}$ndez-Lerma, O., Lasserre, J.B.: Discrete-Time Markov Control Processes: Basic Optimality Criteria. Springer-Verlag, New York (1996)
Google Scholar
Hern$\acute{\mathrm{a}}$ndez-Lerma, O., Lasserre, J.B.: Further Topics on Discrete-Time Markov Control Processes. Springer-Verlag, New York (1999)
Google Scholar
Hern$\acute{\mathrm{a}}$ndez-Lerma, O., Gonz$\acute{\mathrm{a}}$lez-Hern$\acute{\mathrm{a}}$ndez, J.: Constrained Markov control processes in Borel spaces: the discounted case. Math. Methods Oper. Res. 52, 271–285 (2000)
Google Scholar
Huang, Y.H, Guo, X.P.: Optimal risk probability for first passage models in semi-Markov decision processes. J. Math. Anal. Appl. 359, 404–420 (2009)
Google Scholar
Huang, Y.H, Guo, X.P.: Discounted semi-Markov decision processes with nonnegative costs. Acta. Math. Sinica (Chinese Series) 53, 503–514 (2010)
Google Scholar
Huang, Y.H, Guo, X.P.: First passage models for denumerable semi-Markov decision processes with nonnegative discounted costs. Acta. Math. Appl. Sinica 27, 177–190 (2011)
Google Scholar
Huang, Y.H, Guo, X.P.: Finite horizon semi-Markov decision processes with application to maintenance systems. European. J. Oper. Res. 212, 131–140 (2011)
Google Scholar
Lee, P., Rosenfield, D.B.: When to refinance a mortgage: a dynamic programming approach. European. J. Oper. Res. 166, 266–277 (2005)
Article MATH Google Scholar
Limnios, N., Oprisan, J.: Semi-Markov Processes and Reliability. Birkhäuser, Boston (2001)
Google Scholar
Lin, Y.L.: Continuous time first arrival target models (1)- discounted moment optimal models. Acta. Math. Appl. Sinica-Chinese Serias 14, 115–124 (1991)
MATH Google Scholar
Lippman, S.A.: Semi-Markov decision processes with unbounded rewards. Management Science 19, 717–731 (1973)
Article MathSciNet MATH Google Scholar
Liu, J.Y., Huang S.M.: Markov decision processes with distribution function criterion of first-passage time. Appl. Math. Optim. 43, 187–201 (2001)
Article MathSciNet MATH Google Scholar
Liu, J.Y., Liu, K.: Markov decision programming—the first passage model with denumerable state space. Systems Sci. Math. Sci. 5, 340–351 (1992)
MathSciNet MATH Google Scholar
Newell R. G. and Pizer W. A. Discounting the distant future: how much do uncertain rates increase valuation. J. Environ. Econ. Manage 46, 52–71 (2003)
Article MATH Google Scholar
Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons. Inc., New York (1994)
MATH Google Scholar
Ross, S.M.: Average cost semi-Markov decision processes. J. Appl. Probab. 7, 649–656 (1970)
Article MATH Google Scholar
Sack, B., Wieland, V.: Interest-rate smoothing and optimal monetary policy: a review of recent empirical evidence. J. Econ. Bus. 52, 205–228 (2000)
Article Google Scholar
Yong, J.M., Zhou, X.Y.: Stochastic Controls—Hamiltonian Systems and HJB Equations. Springer-Verlag, New York (1999)
MATH Google Scholar
Yu, S.X., Lin, Y.L., Yan, P.F.: Optimization models for the first arrival target distribution function in discrete time. J. Math. Anal. Appl. 225, 193–223 (1998)
Article MathSciNet MATH Google Scholar
Zhang, L.L., Guo, X.P.: Constrained continuous-time Markov control processes with average criteria. Math. Meth. Oper. Res. 67, 323–340 (2008)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Sun Yat-Sen University, Guangzhou, 510275, China
Yonghui Huang & Xianping Guo

Authors

Yonghui Huang
View author publications
You can also search for this author in PubMed Google Scholar
Xianping Guo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xianping Guo .

Editor information

Editors and Affiliations

, Department of Probability and Statistics, Center for Research in Mathematics, Jalisco s/n, Guanajuato, 36000, Mexico
Daniel Hernández-Hernández
, Department of Mathematics, University of Sonora, Rosales s/n, Hermosillo, 83000, Sonora, Mexico
J. Adolfo Minjárez-Sosa

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Huang, Y., Guo, X. (2012). Constrained Optimality for First Passage Criteria in Semi-Markov Decision Processes. In: Hernández-Hernández, D., Minjárez-Sosa, J. (eds) Optimization, Control, and Applications of Stochastic Systems. Systems & Control: Foundations & Applications. Birkhäuser, Boston. https://doi.org/10.1007/978-0-8176-8337-5_11

Download citation

DOI: https://doi.org/10.1007/978-0-8176-8337-5_11
Published: 12 July 2012
Publisher Name: Birkhäuser, Boston
Print ISBN: 978-0-8176-8336-8
Online ISBN: 978-0-8176-8337-5
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Constrained Optimality for First Passage Criteria in Semi-Markov Decision Processes

Abstract

Similar content being viewed by others

Mean-Variance Problems for Finite Horizon Semi-Markov Decision Processes

First Passage Exponential Optimality Problem for Semi-Markov Decision Processes

Optimal Stopping Time on Semi-Markov Processes with Finite Horizon

Keywords

11.1 Introduction

11.2 The Control Model

Remark 11.2.1.

Definition 11.2.1.

Remark 11.2.2.

Assumption 11.2.1

Condition 11.2.2

Remark 11.2.3.

Definition 11.2.2.

Definition 11.2.3.

Remark 11.2.4.

Assumption 11.2.3

Remark 11.2.5.

Assumption 11.2.4

Remark 11.2.6.

Assumption 11.2.5

Remark 11.2.7.

Theorem 11.2.1.

Proof.

11.3 Technical Preliminaries

Lemma 11.3.1.

Proof.

Remark 11.3.8.

Lemma 11.3.2.

Proof.

Remark 11.3.9.

Lemma 11.3.3.

Proof.

Remark 11.3.10.

Lemma 11.3.4.

Proof.

Lemma 11.3.5.

Proof.

Lemma 11.3.6.

Proof.

Lemma 11.3.7.

Proof.

11.4 Proof of Theorem 11.2.1

Lemma 11.4.8.

Proof.

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation