Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

11.1 Introduction

In the field of Markov decision problems (MDPs), the control horizon is usually a fixed finite interval [0, T] or the infinite interval [0, + ). In many real applications, however, the control horizon may be a random duration [0, τ], where the terminal time τ is a random variable at which the state of the controlled system changes critically and the control beyond τ may no longer be meaningful or necessary. For example, in the insurance systems [27], the control after the time when the company is bankrupt becomes unnecessary. Therefore, it makes better sense to consider the problem in [0, τ], where τ represents the bankruptcy time of the company. Such situations motivate first passage problems in MDPs [13, 15, 19, 21, 22, 28], for which one generally aims at maximizing/minimizing the expected reward/cost over a first passage time to some target set.

This chapter is devoted to studying constrained optimality for first passage criteria, for which the dynamic of a system is described by semi-Markov decision processes (SMDPs). The state space is assumed to be denumerable, while the action set is general. Both reward and cost rates are possibly unbounded. A key feature of our model is that the discount rate is state-action dependent, and furthermore, the undiscounted case is allowed. This feature makes our model more general since the state-action-dependent discount rate exactly characterizes the practical cases such as the interest rate in economic and financial systems [2, 9, 17, 23, 26], which can be adjusted according to the real circumstances. We aim to maximize the expected reward obtained during a first passage time to some target set, subject to that the associated expected cost over this first passage time does not exceed a given constraint. An interesting special case is that in which the reward rates are uniformly equal to one, which corresponds to a stochastic time optimal control problem with a target set; see Remark 11.2.4(d) for details.

Previously, Beutler and Ross [3] consider constrained SMDPs with the long-run average criteria. They suppose that the state space of the SMDP is finite, and the action space compact metric. A Lagrange multiplier formulation involving a dynamic programming equation is utilized to relate the constrained optimization to an unconstrained optimization parametrized by the multiplier. This approach leads to a proof for the existence of a semi-simple optimal constrained policy. That is, there is at most one state for which the action is randomized between two possibilities; at all other states, an action is uniquely chosen for each state. Feinberg [4] further investigates constrained average reward SMDPs with finite state and action sets. They develop a technique of state-action renewal intensities and provide linear programming algorithms for the computation of optimal policies. On the other hand, Feinberg [5] deals with constrained infinite horizon discounted SMDPs. Compared with the existing works above, however, our main interest in this chapter is to analyze the constrained optimality for first passage criteria in SMDPs, which, to best of our knowledge, is an issue not yet explored.

To obtain the existence of a constrained first passage optimal policy, we first give suitable conditions and then employ the so-called Lagrange multiplier technique to analyze the constrained control problem. Based on the Lagrange multiplier technique, we transform the constrained control problem to an unconstrained one, prove that a constrained optimal policy exists, and show that the constrained optimal policy randomizes between two stationary policies differing in at most one state.

The rest of this chapter is organized as follows. In Sect. 11.2, we formulate the control model, followed by the optimality conditions and the main results on the existence of constrained optimal policies. In Sect. 11.3, some technique preliminaries are given, and the proof of the main result is presented in Sect. 11.4.

11.2 The Control Model

The model of constrained SMDPs considered in this chapter is specified by the eight objects

$$\{E,B,(A(i) \subset A,i \in E),Q(\cdot,\cdot \mid i,a),r(i,a),c(i,a),\alpha (i,a),\gamma \},$$
(11.1)

where E is the state space, a denumerable set; B ⊂ E is the given target set, such as the set of all bad states or of good states of a system; A is the action space, a Borel space endowed with the Borel σ-field \(\mathcal{A}\); and \(A(i) \in \mathcal{A}\) is the set of admissible actions at state i ∈ E. The transition mechanism of the SMDPs is defined by the semi-Markov kernelQ( ⋅,  ⋅ | i, a) on R  +  ×E given K, where R  +  = [0, + ), and K = { (i, a)∣i ∈ E, a ∈ A(i)} is the set of feasible state-action pairs. It is assumed that (1) Q( ⋅, j | i, a) (for any fixed j ∈ E and (i, a) ∈ K) is a nondecreasing, right continuous real function on R  +  such that Q(0, j | i, a) = 0; (2) Q(t,  ⋅ |  ⋅,  ⋅) (for each fixed t ∈ R  + ) is a sub-stochastic kernel on E given K; and (3) P( ⋅ |  ⋅,  ⋅) : = Q(,  ⋅ |  ⋅,  ⋅) is a stochastic kernel on E given K. If action a ∈ A(i) is selected in state i, then Q(t, ji, a) is the joint probability that the sojourn time in state i is not greater than t ∈ R  + , and the next state is j. Moreover, r(i, a) and c(i, a) in (11.1) denote the reward and cost rate functions on K valued in R = ( − , + ), respectively, which are both assumed to be measurable on A(i) for each fixed i ∈ E. In addition, α(i, a) represents the discount rate, which is a measurable function from K to R  + . Finally, γ is a given constraint constant.

Remark 11.2.1.

Compared with the models of the standard constrained discounted and average criteria [35], in this model (11.1), we introduce a target set B ⊂ E of the controlled system, and furthermore, the discount rate α(i, a) here is state-action dependent and may be equal to zero (i.e., the undiscounted case is allowed).

To state the constrained SMDPs we are concerned with, we need to introduce the classes of policies. For each n ≥ 0, let H n be the family of admissible histories up to the nth jump (decision epoch), that is, H n  = (R  +  ×K)n ×(R  +  ×E), for n = 0, 1, .

Definition 11.2.1.

A randomized history-dependent policy, or simply a policy, is a sequence π = { π n , n ≥ 0} of stochastic kernels π n on A given H n satisfying

$${\pi }_{n}(A({i}_{n})\mid {h}_{n}) = 1\ \ \ \forall \ {h}_{n} = ({t}_{0},{i}_{0},{a}_{0},\ldots,{t}_{n-1},{i}_{n-1},{a}_{n-1},{t}_{n},{i}_{n}) \in {H}_{n},\quad n = 0,1,\ldots.$$

The class of all policies is denoted by Π. To distinguish the subclasses of Π, we let Φ be the family of all stochastic kernels φ on A given E such that φ(A(i)∣i) = 1 for all i ∈ E, and \(\mathbb{F}\) the set of all functions f : E → A such that f(i) is in A(i) for every i ∈ E. A policy π = { π n } ∈ Π is said to be randomized Markov if there is a sequence {φ n } of φ n  ∈ Φ such that π n ( ⋅∣h n ) = φ n ( ⋅∣i n ) for every h n  ∈ H n and n ≥ 0. We denote such a policy by π = { φ n }. A randomized Markov policy π = { φ n } is said to be randomized stationary if every φ n is independent of n. In this case, we write π = { φ, φ, } as φ for simplicity. Further, a randomized Markov policy π = { φ n } is said to be deterministic Markov if there is a sequence {f n } of \({f}_{n} \in \mathbb{F}\) such that φ n ( ⋅∣i) is the Dirac measure at f n (i) for all i ∈ E and n ≥ 0. We write such a policy as π = { f n }. In particular, a deterministic Markov policy π = { f n } is said to be (deterministic) stationary if f n are all independent of n. Similarly, we write π = { f, f, } as f for simplicity. We denote by Π RM, Π RS, Π DM, and Π DS the families of all randomized Markov, randomized stationary, deterministic Markov, and stationary policies, respectively. Obviously, Φ = Π RS ⊂ Π RM ⊂ Π and \(\mathbb{F} = {\Pi }_{\mathrm{DS}} \subset {\Pi }_{\mathrm{DM}} \subset \Pi \).

Let (E) denote the set of all the probability measures on E. For each (s, μ) ∈ R  +  ×(E) and π ∈ Π, by the well-known Tulcea’s theorem ([10, Proposition C.10]), there exist a unique probability space (\(\Omega,\mathcal{F},\) P (s, μ) π) and a stochastic process {T n , J n , A n , n ≥ 0} such that, for each \(i,j \in E,t \in {R}_{+},C \in \mathcal{A}\) and n ≥ 0,

$$\begin{array}{rcl} & & {P}_{(s,\mu )}^{\pi }({T}_{ 0} = s,{J}_{0} = i) = \mu (i),\end{array}$$
(11.2)
$$\begin{array}{rcl} & & {P}_{(s,\mu )}^{\pi }({A}_{ n} \in C\mid {h}_{n}) = {\pi }_{n}(C\mid {h}_{n}),\end{array}$$
(11.3)
$$\begin{array}{rcl} & & {P}_{(s,\mu )}^{\pi }({T}_{ n+1} - {T}_{n} \leq t,{J}_{n+1} = j\mid {h}_{n},{a}_{n}) = Q(t,j\mid {i}_{n},{a}_{n}),\end{array}$$
(11.4)

where T n , J n , and A n denote the nth decision epoch, the state, and the action chosen at the nth decision epoch, respectively. The expectation operator with respect to P (s, μ) π is denoted by E (s, μ) π. In particular, if μ is the Dirac measure δ i ( ⋅) concentrated at some state i ∈ E, we write P (s, μ) π and E (s, μ) π as P (s, i) π and E (s, i) π, respectively. For simplicity, P (0, μ) π and E (0, μ) π is denoted by P μ π and E μ π, respectively. Without loss of generality, in the following, we always set the initial decision epoch T 0 = 0 and omit it.

Remark 11.2.2.

  1. (a)

    The construction of the probability measure space (\(\Omega,\mathcal{F},\) P (s, μ) π) and the above properties (11.2)–(11.4) follow from those in Limnios and Oprisan [18, p.33] and Puterman [24, p.534–535].

  2. (b)

    Let X 0 : = 0, X n : = T n  − T n − 1 (n ≥ 0) denote the sojourn times between decision epochs (jumps). Then, the stochastic process {T n , J n , A n , n ≥ 0} may be rewritten as the one {X n , J n , A n , n ≥ 0}.

To avoid the possibility of an infinite number of decision epochs within finite time, we make the following assumption that the system does not have accumulation points.

Assumption 11.2.1

For all μ ∈ ℙ(E) and π ∈ Π, P μ π ({lim n→∞ T n = ∞}) = 1.

To verify Assumption 11.2.1, we can use a sufficient condition below.

Condition 11.2.2

There exist constants δ > 0 and ε > 0 such that

$$Q(\delta,E\mid i,a) \leq 1 - \varepsilon \ \ \forall (i,a) \in K.$$

Remark 11.2.3.

In fact, Condition 11.2.2 is the standard regular condition widely used in SMDPs [5, 16, 20, 24, 25], which exactly implies Assumption 11.2.1 above.

Under Assumption 11.2.1, we define an underlying continuous-time state-action process {Z(t), W(t), t ∈ R  + } corresponding to the stochastic process {T n , J n , A n } by

$$Z(t) = {J}_{n},\quad W(t) = {A}_{n},\ \ \mbox{ for}\ \ {T}_{n} \leq t < {T}_{n+1},\quad t \in {R}_{+}\ \mbox{ and}\ n \geq 0.$$

Definition 11.2.2.

The stochastic process {Z(t), W(t)} is called a (continuous-time) SMDP.

For the target set B ⊂ E, we consider the random variable

$$\begin{array}{rcl}{ \tau }_{B} :=\inf \{ t \geq 0\mid Z(t) \in B\}\ \ (\mbox{ with}\ \inf \varnothing := \infty ),& & \\ \end{array}$$

which is the first passage time into the set B of the process {Z(t), t ∈ R  + }. Now, fix an initial distribution μ ∈ (E). For each π ∈ Π, the expected first passage reward and cost criteria are defined as follows:

$$\begin{array}{rcl} & & {V }_{r}(\mu,\pi ) := {E}_{\mu }^{\pi }\left[{\int \nolimits \nolimits }_{0}^{{\tau }_{B} }{\mathrm{e}}^{-{\int \nolimits \nolimits }_{0}^{t}\alpha (Z(u),W(u))\mathrm{d}u }r(Z(t),W(t))\mathrm{d}t\right],\end{array}$$
(11.5)
$$\begin{array}{rcl} & & {V }_{c}(\mu,\pi ) := {E}_{\mu }^{\pi }\left[{\int \nolimits \nolimits }_{0}^{{\tau }_{B} }{\mathrm{e}}^{-{\int \nolimits \nolimits }_{0}^{t}\alpha (Z(u),W(u))\mathrm{d}u }c(Z(t),W(t))\mathrm{d}t\right].\end{array}$$
(11.6)

To introduce the constrained problem, for the constraint constant γ in (11.1), let

$$U :=\{ \pi \in \Pi \mid {V }_{c}(\mu,\pi ) \leq \gamma \}$$

be the set of “constrained” policies. We assume that U throughout the following. Then, the optimization problem we are interested in is to maximize the expected first passage reward V r (μ, π) over the set U, and our goal is to find a constrained optimal policy as defined below.

Definition 11.2.3.

A policy π ∗  ∈ U is called constrained optimal if

$${V }_{r}(\mu,{\pi }^{{_\ast}}) {=\sup }_{ \pi \in U}{V }_{r}(\mu,\pi ).$$

Remark 11.2.4.

  1. (a)

    It is worthwhile to point out that the expected first passage reward criterion V r (μ, π) defined in (11.5) is different from the usual discounted reward criteria [11, 12, 24] and the total reward criteria without discount [6, 11, 24]. In fact, the former concerns with the performance of the system during a first passage time to some target set, while the latter concern with the performance of the system over an infinite horizon. However, if the target set B =  (and thus \({\tau }_{B} \equiv \infty \)) and, furthermore, the discount factor α(i, a) is state-action independent (say, α(i, a) ≡ α), then the expected first passage reward criterion (11.5) above will be directly reduced to the standard infinite horizon expected discounted reward criteria or expected total reward criteria [6, 11, 12, 14, 24].

  2. (b)

    Note that the case without discount, that is, \(\alpha (i,a) \equiv 0\), is allowed in the context of this chapter; see Remark 11.2.5 for further details.

  3. (c)

    When the constraint constant γ in (11.1) is sufficiently large so that U = Π, then the constrained first passage optimization problem (recall Definition 11.2.3) is reduced to the usual unconstrained first passage optimization problems [13, 15, 19, 21, 22, 28].

  4. (d)

    In real situations, the target set B usually represents the set of failure states of a system, and thus τ B denotes the working life (functioning life) of the system. Therefore, our aim is to maximize the expected rewards V r (μ, π) obtained before the system fails, subject to the associated costs V c (μ, π) incurred before the failure of the system is not more than some constraint constant γ. In particular, if the reward function rate r(i, a) ≡ 1 and the discount factor α(i, a) ≡ 0, our aim is then reduced to maximizing the expected working life of the system, subject to the associated costs V c (μ, π) incurred before the failure of the system are not more than some constraint constant γ.

To obtain the existence of a constrained optimal policy, we need several sets of conditions.

Assumption 11.2.3

There exist constants M > 0, 0 < β < 1, and a weight function w ≥ 1 on E such that for every i ∈ B c := E − B,

  1. (a)

    \({\sup }_{a\in A(i)}\vert \widetilde{r}(i,a)\vert \leq Mw(i)\) , and \({\sup }_{a\in A(i)}\vert \widetilde{c}(i,a)\vert \leq Mw(i)\) for all a ∈ A(i), where

    $$\begin{array}{rcl} \widetilde{r}(i,a) :& =& r(i,a){\int \nolimits \nolimits }_{0}^{\infty }{\mathrm{e}}^{-\alpha (i,a)t}(1 - D(t\mid i,a))\mathrm{d}t, \\ \widetilde{c}(i,a) :& =& c(i,a){\int \nolimits \nolimits }_{0}^{\infty }{\mathrm{e}}^{-\alpha (i,a)t}(1 - D(t\mid i,a))\mathrm{d}t,\ \ \mbox{ and} \\ D(t\mid i,a) :& =& Q(t,E\mid i,a).\end{array}$$
  2. (b)

    sup a∈A(i) j∈B c w(j)m(j∣i,a) ≤ βw(i), where m(j∣i,a) := ∫ 0 e −α(i,a)t Q(d t,j∣i,a).

Remark 11.2.5.

  1. (a)

    In fact, Assumption 11.2.3 is a condition that ensures the first passage criteria (11.5) and (11.6) to be finite and the dynamic programming operators to be contracting; see Lemmas 11.3.111.3.2 below.

  2. (b)

    Assumption 11.2.3(a) shows that the cost function is allowed to have neither upper nor lower bounds, while the ones in the existing works [35, 7, 8, 12] for the standard constrained expected discount criteria are assumed to be bounded or nonnegative (bounded below).

  3. (c)

    Note that the case without discount, that is, “α(i, a) ≡ 0”, is allowed in Assumption 11.2.3. In this case, Assumption 11.2.3(b) is reduced to that there exists a constant 0 < β < 1 such that

    $$ \sup\limits_{a\in A(i)}\sum\limits_{j\in {B}^{c}}w(j)P(j\mid i,a) \leq \beta w(i)\ \forall \ i \in {B}^{c}\ (\mathrm{with}\ P(j\mid i,a) := Q(\infty,j\mid i,a)),$$
    (11.7)

    which can be still verified. This fact is due to that the restrictions in Assumption 11.2.3(b) are imposed on the data of the set B c rather than the entire space E. However, if the restrictions in Assumption 11.2.3(b) are imposed on the data of the entire space E, that is, there exists a constant 0 < β < 1 such that

    $$ \sup\limits_{a\in A(i)} \sum\limits_{j\in E}w(j)P(j\mid i,a) \leq \beta w(i)\ \ \forall \ i \in E,$$
    (11.8)

    then (11.8) fails to hold itself. Indeed, by taking inf i w(i) in the two sides of (11.8), we can conclude from (11.8) that “β ≥ 1”, which leads to a contradiction with “0 < β < 1”.

Assumption 11.2.4

  1. (a)

    For each i ∈ B c , A(i) is compact.

  2. (b)

    The functions \(\widetilde{r}(i,a),\) \(\widetilde{c}(i,a)\) , and m(j∣i,a) defined in Assumption 11.2.3 are continuous in a ∈ A(i) for each fixed i,j ∈ B c , respectively.

  3. (c)

    The function \({\sum \nolimits }_{j\in {B}^{c}}w(j)m(j\mid i,a)\) is continuous in a ∈ A(i), with w as in Assumption 11.2.3.

Remark 11.2.6.

Assumption 11.2.4 is the compactness-continuity conditions for the first passage criteria, which is similar to the standard compactness-continuity conditions for discount and average criteria; see, for instance, Beutler and Ross [3], Guo and Hern\(\acute{\mathrm{a}}\)ndez-Lerma [7, 8]. The difference between them lies in that the former only imposes restrictions on the data of the set B c, while the latter focus on the data of the entire space E.

Assumption 11.2.5

  1. (a)

    \({\sum \nolimits }_{j\in {B}^{c}}w(j)\mu (j) < \infty.\)

  2. (b)

    U 0 :={ π ∈ Π∣V c (μ,π) < γ}≠∅.

Remark 11.2.7.

  1. (a)

    Assumption 11.2.5(a) is a condition on the “tails” of the initial distribution μ, whereas Assumption 11.2.5(b) is a Slater-like hypothesis, typical for constrained problems; see, for instance, Beutler and Ross [3], Guo and Hern\(\acute{\mathrm{a}}\)ndez-Lerma [7, 8], and Zhang and Guo [29].

  2. (b)

    It should be noted that the conditions in Assumptions 11.2.311.2.5 are all imposed on the data of the set B c rather than the entire space E and thus can be fulfilled in greater generality.

Our main result is stated as following.

Theorem 11.2.1.

Suppose that Assumptions  11.2.111.2.5 hold. Then there exists a constrained optimal policy that may be a stationary policy or a randomized stationary policy which differ in at most one state; that is, there exist two stationary policies f 1 ,f 2 , a state i ∈ B c , and a number p ∈ [0,1] such that f 1 (i) = f 2 (i) for all i≠i and, in addition, the randomized stationary policy φ p (⋅∣i) is constrained optimal, where

$$\begin{array}{rcl}{ \varphi }^{p}(a\mid i) = \left \{\begin{array}{ll} p, &\mbox{ if}\ \ a = {f}^{1}({i}^{{_\ast}}), \\ 1 - p,&\mbox{ if}\ \ a = {f}^{2}({i}^{{_\ast}}), \\ 1, &\mbox{ if}\ \ a = {f}^{1}(i) = {f}^{2}(i),i\neq {i}^{{_\ast}}. \end{array} \right.& &\end{array}$$
(11.9)

Proof.

See Sect. 11.4. □ 

11.3 Technical Preliminaries

This section provides some technical preliminaries necessary for the proof of Theorem 11.2.1 in Sect. 11.4.

To begin with, we define the w-norm for every real-valued function u on E by

$$\begin{array}{rcl} \|{u\|}_{w} :=\sup\limits_{i\in E}\vert u(i)\vert /w(i),& & \\ \end{array}$$

where w is the so-called weight function on E as in Assumption 11.2.3. Let

$$\begin{array}{rcl}{ \mathbb{B}}_{w}(E) :=\{ u :\| {u\|}_{w} < \infty \}& & \\ \end{array}$$

be the space of w-bounded functions on E.

Lemma 11.3.1.

Suppose that Assumptions 11.2.1 and 11.2.3 hold. Then:

  1. (a)

    For each i ∈ E and π ∈ Π,

    $$\begin{array}{rcl} \vert {V }_{r}(i,\pi )\vert \leq Mw(i)/(1 - \beta ),\ \ \vert {V }_{c}(i,\pi )\vert \leq Mw(i)/(1 - \beta ).& & \\ \end{array}$$

    Hence, \({V }_{r}(\cdot,\pi ) \in {\mathbb{B}}_{w}(E)\) , and \({V }_{c}(\cdot,\pi ) \in {\mathbb{B}}_{w}(E).\)

  2. (b)

    For all i ∈ E, π ∈ Π, and \(u \in {\mathbb{B}}_{w}(E),\)

    $$ \begin{array}{rcl} \lim\limits_{n\rightarrow \infty }{E}_{i}^{\pi }\left[{\prod \nolimits }_{k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}u({J}_{n})\right] = 0,& & \\ \end{array}$$

    where \({1}_{D}\) is the indicator function on a set D.

Proof.

  1. (a)

    By the definition of V r (i, π), we see that V r (i, π) can be expressed as below:

    $$\begin{array}{rcl} \!\!\!{V }_{r}(i,\pi )& & \\ & =&{E}_{i}^{\pi }\left[{\int \nolimits \nolimits }_{0}^{{\tau }_{B} }{\mathrm{e}}^{-{\int \nolimits \nolimits }_{0}^{t}\alpha (Z(u),W(u))\mathrm{d}u }r(Z(t),W(t))\mathrm{d}t\right] \\ & =&{E}_{i}^{\pi }\left[{\int \nolimits \nolimits }_{0}^{\infty }{\mathrm{e}}^{-{\int \nolimits \nolimits }_{0}^{t}\alpha (Z(u),W(u))\mathrm{d}u }{1}_{\{{\tau }_{B}>t\}}r(Z(t),W(t))\mathrm{d}t\right] \\ & =& {E}_{i}^{\pi}\left[\sum\limits_{n=0}^{\infty }{\int \nolimits \nolimits }_{{T}_{n}}^{{T}_{n+1} }{\mathrm{e}}^{-{\int \nolimits \nolimits }_{0}^{t}\alpha (Z(u),W(u))\mathrm{d}u }\mathrm{d}t{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}r({J}_{n},{A}_{n})\right] \\ &=& {E}_{i}^{\pi }\left[\sum\limits_{n=0}^{\infty }{\int \nolimits \nolimits }_{0}^{{X}_{n+1} }{\mathrm{e}}^{-{\int \nolimits \nolimits }_{0}^{{T}_{n}+t}\alpha (Z(u),W(u))\mathrm{d}u }\mathrm{d}t{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}r({J}_{n},{A}_{n})\right] \\ & =& {E}_{i}^{\pi }\left[\sum\limits_{n=0}^{\infty }{\mathrm{e}}^{-{\int \nolimits \nolimits }_{0}^{{T}_{n}}\alpha (Z(u),W(u))\mathrm{d}u }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}r({J}_{n},{A}_{n})\right. \\ & &\qquad \left.{\int \nolimits \nolimits }_{0}^{{X}_{n+1} }{\mathrm{e}}^{-{\int \nolimits \nolimits }_{{T}_{n}}^{{T}_{n}+t}\alpha (Z(u),W(u))\mathrm{d}u }\mathrm{d}t\right] \\ & =& {E}_{i}^{\pi }\left[{\sum }_{n=0}^{\infty }\prod\limits_{k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}r({J}_{n},{A}_{n}){\int \nolimits \nolimits }_{0}^{{X}_{n+1} }{\mathrm{e}}^{-\alpha ({J}_{n},{A}_{n})t}\mathrm{d}t\right] \\ & =& {E}_{i}^{\pi}\left[\sum\limits_{n=0}^{\infty }{E}_{ i}^{\pi }\left[\prod\limits _{k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}r({J}_{n},{A}_{n})\right.\right. \\ & &\left.\left. \qquad \times {\int \nolimits\nolimits }_{0}^{{X}_{n+1} }{\mathrm{e}}^{-\alpha ({J}_{n},{A}_{n})t}\mathrm{d}t\vert {X}_{ 0},{J}_{0},{A}_{0},\ldots,{X}_{n},{J}_{n},{A}_{n}\right]\right] \\ &=& {E}_{i}^{\pi }\left[\sum\limits_{n=0}^{\infty }\prod\limits _{k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}r({J}_{n},{A}_{n})\right. \\ & & \qquad \left.\times {E}_{i}^{\pi}\left[{\int \nolimits \nolimits }_{0}^{{X}_{n+1} }{\mathrm{e}}^{-\alpha ({J}_{n},{A}_{n})t}\mathrm{d}t\vert {X}_{ 0},{J}_{0},{A}_{0},\ldots,{X}_{n},{J}_{n},{A}_{n}\right]\right] \\ &=& {E}_{i}^{\pi }\left[\sum\limits_{n=0}^{\infty }\prod\limits _{k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}r({J}_{n},{A}_{n}) \right.\\ & & \left.\qquad \times {\int \nolimits\nolimits }_{0}^{\infty }{\mathrm{e}}^{-\alpha ({J}_{n},{A}_{n})t}(1 - D(t\mid {J}_{ n},{A}_{n}))\mathrm{d}t\right] \\ & =& {E}_{i}^{\pi}\left[\sum\limits_{n=0}^{\infty }\prod\limits _{k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}\widetilde{r}({J}_{n},{A}_{n})\right], \end{array}$$
    (11.10)

    where the third equality follows from Assumption 11.2.1 and the ninth equality is due to the property (11.4).

    We now show that for each n = 0, 1, ,

    $$\begin{array}{rcl}{ E}_{i}^{\pi }\left[\prod\limits_{k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}w({J}_{n})\right] \leq {\beta }^{n}w(i).& & \end{array}$$
    (11.11)

    Indeed, (11.11) is trivial for n = 0. Now, for n ≥ 1, it follows from the property (11.4) and Assumption 11.2.3(b) that

    $$\begin{array}{rcl} & &{E}_{i }^{\pi }\left[\prod\limits_{k=0}^{n-1}{\mathrm{e}}^{-\alpha({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in{B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}w({J}_{n})\right] \\& =&{E}_{i}^{\pi }\left[{E}_{ i}^{\pi }\left[\prod\limits_{k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1}}{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in{B}^{c}\}}w({J}_{n})\right.\right. \\& & \qquad \left.\mid{T}_{0},{J}_{0},{A}_{0},\ldots,{T}_{n-1},{J}_{n-1},{A}_{n-1}]\right] \\& =& {E}_{i}^{\pi }\left[\prod\limits_{k=0}^{n-2}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1}}{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n-1}\in{B}^{c}\}}{E}_{i}^{\pi }\left[{\mathrm{e}}^{-\alpha({J}_{n-1},{A}_{n-1}){X}_{n} }{1}_{\{{J}_{n}\in {B}^{c}\}}w({J}_{n})\right.\right.\\& & \left.\left.\qquad \mid{T}_{0},{J}_{0},{A}_{0},\ldots,{T}_{n-1},{J}_{n-1},{A}_{n-1}\right]\right]\\& =& {E}_{i}^{\pi }\left[\prod\limits_{k=0}^{n-2}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1}}{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n-1}\in {B}^{c}\}} \right.\\& &\left.\qquad \times {\sum \nolimits }_{j\in {B}^{c}}{ \int \nolimits\nolimits }_{0}^{\infty }{\mathrm{e}}^{-\alpha({J}_{n-1},{A}_{n-1})t}w(j)Q(\mathrm{d}t,j\mid {J}_{n-1},{A}_{n-1})\right] \\& =& {E}_{i}^{\pi }\left[\prod\limits_{k=0}^{n-2}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1}}{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n-1}\in {B}^{c}\}}{ \sum\nolimits }_{j\in {B}^{c}}w(j)m(j\mid {J}_{n-1},{A}_{n-1})\right] \\& \leq & \beta {E}_{i}^{\pi }\left[\prod\limits_{k=0}^{n-2}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1}}{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n-1}\in{B}^{c}\}}w({J}_{n-1})\right]. \end{array}$$
    (11.12)

    Iterating (11.12) yields (11.11).

    Moreover, observe that Assumption 11.2.3(a) and (11.11) gives

    $$\begin{array}{rcl} & & {E}_{i}^{\pi }\left[\prod\limits_{k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}\vert \widetilde{r}({J}_{n},{A}_{n})\vert \right] \\ & \leq & M{E}_{i}^{\pi }\left[\prod\limits_{k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}w({J}_{n})\right] \\ & \leq & M{\beta }^{n}w(i), \\ \end{array}$$

    which together with (11.10) yields

    $$\begin{array}{rcl} \vert {V }_{r}(i,\pi )\vert &\leq &\sum\limits_{n=0}^{\infty }{E}_{ i}^{\pi }\left[\prod\limits _{k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}\vert \widetilde{r}({J}_{n},{A}_{n})\vert \right] \\ & \leq &\sum\limits_{n=0}^{\infty }M{\beta }^{n}w(i) = Mw(i)/(1 - \beta ).\end{array}$$

    Thus, we get

    $$ \begin{array}{rcl} \sup\limits_{i\in E}\vert {V }_{r}(i,\pi )\vert /w(i) \leq Mw(i)/(1 - \beta ),& & \\ \end{array}$$

    which shows that \({V }_{r}(\cdot,\pi ) \in {\mathbb{B}}_{w}(E)\).

    Similarly, the conclusion for V c ( ⋅, π) can be obtained.

  2. (b)

    Since \(\vert u(i)\vert \leq \| {u\|}_{w}w(i)\) for all i ∈ E, it follows from (11.11) above that

    $$\begin{array}{rcl} & & {E}_{i}^{\pi }\left[\prod\limits_{k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}\vert u({J}_{n})\vert \right] \\ & & \quad \leq \| {u\|}_{w}{E}_{i}^{\pi }\left[\prod\limits_{k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}w({J}_{n})\right] \leq \| {u\|}_{w}{\beta }^{n}w(i), \\ \end{array}$$

    and so part (b) follows. □ 

Remark 11.3.8.

In fact, Lemma 11.3.1 here for first passage criteria in SMDPs is similar to Lemma 3.1 in Huang and Guo [15]. The main difference between them is due to that the discount factor α(i, a) here is state-action dependent, and the reward rate here is possibly unbounded (while the ones in Huang and Guo [15] are not).

For φ ∈ Φ, we define the dynamic programming operators H φ and H on \({\mathbb{B}}_{w}(E)\) as follows: for \(u \in {\mathbb{B}}_{w}(E)\), if i ∈ B c,

$$\begin{array}{rcl}{ H}^{\varphi }u(i)& :=& {\int \nolimits \nolimits }_{a\in A(i)}\left[\widetilde{r}(i,a) + \sum\limits_{j\in {B}^{c}}u(j)m(j\mid i,a)\right]\varphi (\mathrm{d}a\mid i),\end{array}$$
(11.13)
$$\begin{array}{rcl} Hu(i)& :=& \sup\limits_{a\in A(i)}\left[\widetilde{r}(i,a) + \sum\limits_{j\in {B}^{c}}u(j)m(j\mid i,a)\right],\end{array}$$
(11.14)

and if i ∈ B, H φ u(i) = Hu(i) : = 0.

Lemma 11.3.2.

Suppose that Assumptions 11.2.1 and 11.2.3 hold. Then:

  1. (a)

    For each φ ∈ Φ, V r (⋅,φ) is the unique solution in \({\mathbb{B}}_{w}(E)\) to the equation

    $$\begin{array}{rcl}{ V }_{r}(i,\varphi ) = {H}^{\varphi }{V }_{ r}(i,\varphi )\ \forall i \in E.& & \\ \end{array}$$
  2. (b)

    If, in addition, Assumption 11.2.4 also holds, V r (i) := sup π∈Π V r (i,π) is the unique solution in \({\mathbb{B}}_{w}(E)\) to equation

    $$\begin{array}{rcl}{ V }_{r}^{{_\ast}}(i) = H{V }_{ r}^{{_\ast}}(i)\ \forall i \in E.& & \\ \end{array}$$

    Moreover, there exists an \({f}^{{_\ast}}\in \mathbb{F}\) such that \({V }_{r}^{{_\ast}}(i) = {H}^{{f}^{{_\ast}} }{V }_{r}^{{_\ast}}(i)\) , and such a policy \({f}^{{_\ast}}\in \mathbb{F}\) satisfies V r (i,f ) = V r (i) for every i ∈ E.

Proof.

  1. (a)

    From Lemma 11.3.1, it is clear that \({V }_{r}(\cdot,\varphi ) \in {\mathbb{B}}_{w}(E)\). We now establish the equation V r (i, φ) = H φ V r (i, φ). It is obviously true when i ∈ B, and for i ∈ B c, by (11.10), we have

    $$\begin{array}{rcl} & &{V }_{r } (i,\varphi ) \\& =&{E}_{i}^{\varphi }\left[\sum\limits_{ n=0}^{\infty }{\prod \nolimits}_{ k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1}}{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in{B}^{c}\}}\widetilde{r}({J}_{n},{A}_{n})\right] \\& =&{E}_{i}^{\varphi }\left[{E}_{ i}^{\varphi }\left[\sum\limits_{n=0}^{\infty }\prod\limits_{ k=0}^{n-1}{\mathrm{e}}^{-\alpha({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in{B}^{c},\ldots,{J}_{n}\in{B}^{c}\}}\widetilde{r}({J}_{n},{A}_{n})\mid{T}_{0},{J}_{0},{A}_{0},{T}_{1},{J}_{1}\right]\right] \\& =&{E}_{i}^{\varphi }\left[{1}_{\{{ J}_{0}\in{B}^{c}\}}\widetilde{r}({J}_{0},{A}_{0}) +{ \mathrm{e}}^{-\alpha({J}_{0},{A}_{0}){X}_{1} }{1}_{\{{J}_{0}\in {B}^{c},{J}_{1}\in{B}^{c}\}}{E}_{i}^{\varphi }\left[\sum\limits_{ n=1}^{\infty }{\prod\nolimits }_{ k=1}^{n-1}{\mathrm{e}}^{-\alpha({J}_{k},{A}_{k}){X}_{k+1} }\right.\right. \\& &\left.\left. \qquad {1}_{\{{J}_{2}\in{B}^{c},\ldots,{J}_{n}\in{B}^{c}\}}\widetilde{r}({J}_{n},{A}_{n})\mid{T}_{0},{J}_{0},{A}_{0},{T}_{1},{J}_{1}\right]\right] \\& =& {\int\nolimits \nolimits }_{a\in A(i)}\varphi (\mathrm{d}a\mid i)\left[\widetilde{r}(i,a)+\sum\limits_{j\in {B}^{c}}{ \int\nolimits \nolimits }_{0}^{\infty }{\mathrm{e}}^{-\alpha(i,a)t}Q(\mathrm{d}t,j\mid i,a){E}_{ i}^{\varphi }\left[\sum\limits_{n=1}^{\infty }\prod\limits_{ k=1}^{n-1}{\mathrm{e}}^{-\alpha({J}_{k},{A}_{k}){X}_{k+1} }\right.\right. \\& & \left.\left. {1}_{\{{J}_{0}\in{B}^{c},\ldots,{J}_{n}\in{B}^{c}\}}\widetilde{r}({J}_{n},{A}_{n})\mid {T}_{0} = 0,{J}_{0} =i,{A}_{0} = a,{T}_{0} = t,{J}_{1} = j\right]\right] \\& =& {\int\nolimits \nolimits }_{a\in A(i)}\varphi (\mathrm{d}a\mid i)\left[\widetilde{r}(i,a) +\sum\limits_{j\in {B}^{c}}m(j\mid i,a){E}_{j}^{\varphi }\left[\sum\limits_{ n=0}^{\infty }{\prod\nolimits }_{ k=0}^{n-1}{\mathrm{e}}^{-\alpha({J}_{k},{A}_{k}){X}_{k+1} }\qquad \qquad \qquad \qquad \qquad\qquad \quad \qquad \right.\right.\\& & \left.\qquad \qquad \quad \qquad{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in{B}^{c}\}}\widetilde{r}({J}_{n},{A}_{n})\right] \\& =& {\int\nolimits \nolimits }_{a\in A(i)}\varphi (\mathrm{d}a\mid i)\left[\widetilde{r}(i,a) +\sum\limits_{j\in {B}^{c}}m(j\mid i,a){V }_{r}(j,\varphi )\right],\end{array}$$

    where the fifth equality is due to the properties (11.2)–(11.4) and that policy φ is Markov. Hence, we obtain that V r (i, φ) = H φ V r (i, φ),  i ∈ E.

    To complete the proof, we need only show that H φ is a contraction from \({\mathbb{B}}_{w}(E)\) to \({\mathbb{B}}_{w}(E)\), and thus H φ has a unique fixed point in \({\mathbb{B}}_{w}(E)\). Indeed, for an arbitrary function \(u \in {\mathbb{B}}_{w}(E)\), by Assumption 11.2.3 it is clear that

    $$\begin{array}{rcl} & & \vert {H}^{\varphi }u(i)\vert \\ & =& \left \vert {\int \nolimits \nolimits }_{a\in A(i)}\left[\widetilde{r}(i,a) + \sum\limits_{j\in {B}^{c}}u(j)m(j\mid i,a)\right]\varphi (\mathrm{d}a\mid i)\right \vert \\ &\leq &{\int \nolimits \nolimits }_{a\in A(i)}\left[\vert \widetilde{r}(i,a)\vert + \sum\limits_{j\in {B}^{c}}\vert u(j)\vert m(j\mid i,a)\right]\varphi (\mathrm{d}a\mid i) \\ & \leq &{\int \nolimits \nolimits }_{a\in A(i)}\left[Mw(i) +\| {u\|}_{w}\beta w(i)\right]\varphi (\mathrm{d}a\mid i) \\ & =& (M + \beta \|{u\|}_{w})w(i)\ \forall i \in {B}^{c}, \\ \end{array}$$

    which implies that \({H}^{\varphi }u \in {\mathbb{B}}_{w}(E)\), that is, H φ maps \({\mathbb{B}}_{w}(E)\) to itself. Moreover, for any \(u,u^{\prime} \in {\mathbb{B}}_{w}(E)\), we have

    $$\begin{array}{rcl} & &\vert {H}^{\varphi } u(i) - {H}^{\varphi }u^{\prime}(i)\vert \\ & =& \left \vert {\int \nolimits \nolimits }_{a\in A(i)}\left[\sum\limits_{j\in {B}^{c}}(u(j) - u^{\prime}(j))m(j\mid i,a)\right]\varphi (\mathrm{d}a\mid i)\right \vert \\ &\leq &{\int \nolimits \nolimits }_{a\in A(i)}\left[\sum\limits_{j\in {B}^{c}}\vert u(j) - u^{\prime}(j)\vert m(j\mid i,a)\right]\varphi (\mathrm{d}a\mid i) \\ & \leq &{\int \nolimits \nolimits }_{a\in A(i)}\left[\|u - {u^{\prime}\|}_{w}\beta w(i)\right]\varphi (\mathrm{d}a\mid i) \\ & =& \beta \|u - {u^{\prime}\|}_{w}w(i)\ \forall i \in {B}^{c}.\end{array}$$

    Hence, \(\|{H}^{\varphi }u - {H}^{\varphi }{u^{\prime}\|}_{w} \leq \beta \|u - {u^{\prime}\|}_{w}\), and thus H φ is a contraction from \({\mathbb{B}}_{w}(E)\) to itself. By Banach’s Fixed Point Theorem, H φ has a unique fixed point in \({\mathbb{B}}_{w}(E)\), and so the proof is achieved.

  2. (b)

    Under Assumption 11.2.4, using a similar manner to the proof of part (a) yields that H is a contraction from \({\mathbb{B}}_{w}(E)\) to itself, and thus, by Banach’s Fixed Point Theorem, H has a unique fixed point u  ∗  in \({\mathbb{B}}_{w}(E)\), that is, Hu  ∗  = u  ∗ . Hence, to prove part (b), we need to show that: (b 1) \({V }_{r}^{{_\ast}}\in {\mathbb{B}}_{w}(E)\), with w-norm \(\|{V }_{r}^{{_\ast}}\|\leq M/(1 - \beta )\). (b 2) V r  ∗  = u  ∗ .

In fact, (b 1) is an immediate result of Lemma 11.3.1(a). Thus, it remains to prove (b 2). To this end, we show that u  ∗  ≤ V r  ∗  and u  ∗  ≥ V r  ∗  as below, respectively. It is clear that u  ∗ (i) = V r  ∗ (i) = 0 for every i ∈ B. Hence, in the following, we restrict our argument to the case of i ∈ B c.

  1. (i)

    This part is to show that u  ∗  ≤ V r  ∗ . By the equality u  ∗  = Hu  ∗  and the measurable selection theorem [10, Proposition D.5, p.182], there exists an \(f \in \mathbb{F}\) such that

    $$\begin{array}{rcl}{ u}^{{_\ast}}(i) =\widetilde{ r}(i,f) + \sum\limits_{j\in {B}^{c}}{u}^{{_\ast}}(j)m(j\mid i,f)\ \forall i \in {B}^{c}.& & \end{array}$$
    (11.15)

    Iteration of (11.15) yields

    $$\begin{array}{rcl}{ u}^{{_\ast}}(i)& =& {E}_{ i}^{f}\left[\sum\limits_{m=0}^{n-1} \prod\limits_{k=0}^{m-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{m}\in {B}^{c}\}}\widetilde{r}({J}_{m},f)\right] \\ & & +{E}_{i}^{f}\left[\prod\limits_{k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}{u}^{{_\ast}}({J}_{ n})\right]\ \ \forall i \in {B}^{c},\ n = 1,2\ldots,\\ \end{array}$$

    and letting n →  we get, by Lemma 11.3.1(b),

    $$\begin{array}{rcl}{ u}^{{_\ast}}(i) = {E}_{ i}^{f}\left[\sum\limits_{m=0}^{\infty }\prod\limits_{k=0}^{m-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{m}\in {B}^{c}\}}\widetilde{r}({J}_{m},f)\right] = {V }_{r}(i,f)\ \ \forall i \in {B}^{c}.& & \\ \end{array}$$

    Thus, by the definition of V r  ∗ , we see that u  ∗  ≤ V r  ∗ .

  2. (ii)

    This part is to show that u  ∗  ≥ V r  ∗ . Note that u  ∗  = Hu  ∗  implies that

    $$\begin{array}{rcl}{ u}^{{_\ast}}(i) \geq \widetilde{ r}(i,a) + \sum\limits_{j\in {B}^{c}}{u}^{{_\ast}}(j)m(j\mid i,a)\ \ \forall i \in {B}^{c},\ \ a \in A(i),& & \end{array}$$
    (11.16)

    which gives

    $$\begin{array}{rcl}{ 1}_{\{{J}_{n}\in {B}^{c}\}}{u}^{{_\ast}}({J}_{ n}) \geq {1}_{\{{J}_{n}\in {B}^{c}\}}\widetilde{r}({J}_{n},{A}_{n}) + {1}_{\{{J}_{n}\in {B}^{c}\}} \sum\limits_{j\in {B}^{c}}{u}^{{_\ast}}(j)m(j\mid {J}_{ n},{A}_{n})\ \ \forall n \geq 0.& & \end{array}$$
    (11.17)

    Hence, for any initial state i ∈ B c and policy π ∈ Π, using properties (11.2)–(11.4) yields

    $$\begin{array}{rcl}{ 1}_{\{{J}_{n}\in {B}^{c}\}}{u}^{{_\ast}}({J}_{ n})& \geq & {E}_{i}^{\pi }\left[{1}_{\{{ J}_{n}\in {B}^{c}\}}\widetilde{r}({J}_{n},{A}_{n}) +{ \mathrm{e}}^{-\alpha ({J}_{n},{A}_{n}){X}_{n+1} }{1}_{\{{J}_{n}\in {B}^{c},{J}_{n+1}\in {B}^{c}\}} \right.\\ & &\left. \qquad \times {u}^{{_\ast}}({J}_{ n+1})\mid {T}_{0},{J}_{0},{A}_{0},\ldots,{T}_{n},{J}_{n},{A}_{n}\right]\forall n = 0,1\ldots,\end{array}$$

    which gives

    $$\begin{array}{rcl} & & \prod\limits_{k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}{u}^{{_\ast}}({J}_{ n}) \\ & & \quad \geq {E}_{i}^{\pi }\left[\prod\limits_{k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}\widetilde{r}({J}_{n},{A}_{n}) \right.\\ & & \left.\qquad + \prod\limits_{k=0}^{n}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n+1}\in {B}^{c}\}}{u}^{{_\ast}}({J}_{ n+1})\mid {T}_{0},{J}_{0},{A}_{0},\ldots,{T}_{n},{J}_{n},{A}_{n}\right] \\ & & \forall n = 0,1\ldots. \end{array}$$

    Therefore, taking expectation E i π and summing over m = 0, 1, n − 1, we obtain

    $$\begin{array}{rcl}{ u}^{{_\ast}}(i)& \geq & {E}_{ i}^{\pi }\left[\sum\limits_{m=0}^{n-1} \prod\limits_{k=0}^{m-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{m}\in {B}^{c}\}}\widetilde{r}({J}_{m},{A}_{m})\right] \\ & & +{E}_{i}^{\pi }\left[\prod\limits_{k=0}^{n-1}{\mathrm{e}}^{-\alpha ({J}_{k},{A}_{k}){X}_{k+1} }{1}_{\{{J}_{0}\in {B}^{c},\ldots,{J}_{n}\in {B}^{c}\}}{u}^{{_\ast}}({J}_{ n})\right],\ \ \forall n = 1,2\ldots.\end{array}$$

    Finally, letting n →  in the latter inequality and using Lemma 11.3.1(b), it follows that

    $$\begin{array}{rcl}{ u}^{{_\ast}}(i) \geq {V }_{ r}(i,\pi )& & \\ \end{array}$$

    so that, as i and π were arbitrary, we conclude that u  ∗  ≥ V r  ∗ .

Combining (i) with (ii) yields that u  ∗  = V r  ∗ , and thus we have V r  ∗  = HV r  ∗ .

Finally, it follows from V r  ∗  = HV r  ∗  and the measurable selection theorem that there exists an \({f}^{{_\ast}}\in \mathbb{F}\) such that \({V }_{r}^{{_\ast}} = {H}^{{f}^{{_\ast}} }{V }_{r}^{{_\ast}}\). This fact together with part (a) implies that \({V }_{r}^{{f}^{{_\ast}} } = {V }_{r}^{{_\ast}}\). □ 

Remark 11.3.9.

Note that Lemma 11.3.2 also holds for the case of the expected first passage cost V c accordingly.

Note that \(\mathbb{F}\) can be written as the product space \(\mathbb{F} ={ \prod \nolimits }_{i\in E}A(i)\). Hence, by Assumption 11.2.4(a) and Tychonoff’s theorem, \(\mathbb{F}\) is a compact metric space.

Lemma 11.3.3.

Suppose that Assumptions 11.2.111.2.4 and 11.2.5 (a) hold. Then the functions V r (μ,f) and V c (μ,f) are continuous in \(f \in \mathbb{F}.\)

Proof.

We only prove the continuity of V r (μ, f) in \(f \in \mathbb{F}\) because the other case is similar. Let f n  → f as n →  and fix any i ∈ E. Let v(i) : = {limsup} n →  V r (i, f n ). Then, by Theorem 4.4 in [1], there exists a subsequence \(\{{V }_{r}(i,{f}_{{n}_{m}})\}\) (depending on i) of {V r (i, f n )} such that \({V }_{r}(i,{f}_{{n}_{m}}) \rightarrow v(i)\) as m → . Then, by Lemma 11.3.1(a), we have \({V }_{r}(j,{f}_{{n}_{m}}) \in [-Mw(j)/(1 - \beta ),Mw(j)/(1 - \beta )]\) for all j ∈ E and m ≥ 1, and so \({V }_{r}(\cdot,{f}_{{n}_{m}})\) is in the product space ∏ j ∈ E [ − Mw(j) ∕ (1 − β), Mw(j) ∕ (1 − β)] for each m ≥ 1. Since E is denumerable, the Tychonoff theorem shows that the space ∏ j ∈ E [ − Mw(j) ∕ (1 − β), Mw(j) ∕ (1 − β)] is compact, and thus there exists a subsequence \(\{{V }_{r}(\cdot,{f}_{{n}_{k}})\}\) of \(\{{V }_{r}(\cdot,{f}_{{n}_{m}})\}\) converging to some point u in ∏ j ∈ E [ − Mw(j) ∕ (1 − β), Mw(j) ∕ (1 − β)], that is, \({\lim }_{k\rightarrow \infty }{V }_{r}(j,{f}_{{n}_{k}}) = u(j)\) for all j ∈ E, which, together with f n  → f and \({\lim }_{m\rightarrow \infty }{V }_{r}(i,{f}_{{n}_{m}}) = v(i)\), implies that

$$\begin{array}{rcl} \!\!\!\!\!\!\!\!\!\!\!\!\!v(i) = u(i),\ \lim\limits_{k\rightarrow \infty }{V }_{r}(j,{f}_{{n}_{k}}) = u(j), \mathrm{and}\lim\limits_{k\rightarrow \infty }{f}_{{n}_{k}}(j) = f(j),\ \ \mathrm{for\ all}\ j \in E.\ \ \ \ \ \ \ \ \ \ & &\end{array}$$
(11.18)

Moreover, by Lemma 11.3.1(a), we have

$$\begin{array}{rcl} \vert u(j)\vert \leq Mw(j)/(1 - \beta ),\ \ \ \mathrm{for\ all}\ j \in E,& &\end{array}$$
(11.19)

which implies that \(u \in {\mathbb{B}}_{w}(E)\).

On the other hand, for k ≥ 1 and the given i ∈ B c, by Lemma 11.3.2(a), we have

$$\begin{array}{rcl}{ V }_{r}(i,{f}_{{n}_{k}}) =\widetilde{ r}(i,{f}_{{n}_{k}}) + \sum\limits_{j\in {B}^{c}}{V }_{r}(j,{f}_{{n}_{k}})m(j\mid i,{f}_{{n}_{k}}).& &\end{array}$$
(11.20)

Then, under Assumptions 11.2.3 and 11.2.4, from (11.18)–(11.20) and Lemma 8.3.7 (i.e., the Generalized Dominated Convergence Theorem) in [11], we get

$$\begin{array}{rcl} u(i) =\widetilde{ r}(i,f) + \sum\limits_{j\in {B}^{c}}u(j)m(j\mid i,f).& &\end{array}$$
(11.21)

Thus, by Lemma 11.3.2(a) and (11.18), we conclude that

$$\begin{array}{rcl} \mathop{\lim\sup}\limits_{n\rightarrow \infty }{V }_{r}(i,{f}_{n}) = v(i) = u(i) = {V }_{r}(i,f).& &\end{array}$$
(11.22)

Similarly, we can prove that

$$\begin{array}{rcl} \mathop{\lim\inf}\limits_{n\rightarrow \infty }{V }_{r}(i,{f}_{n}) = {V }_{r}(i,f),& & \\ \end{array}$$

which together with (11.22) implies that

$$\mathop{\lim\sup}\limits_{n\rightarrow \infty }{V }_{r}(i,{f}_{n}) =\mathop{\lim\inf}\limits_{n\rightarrow \infty }{V }_{r}(i,{f}_{n}) = {V }_{r}(i,f),$$

and so

$$ \begin{array}{rcl} \lim\limits_{n\rightarrow \infty }{V }_{r}(i,{f}_{n}) = {V }_{r}(i,f).& &\end{array}$$
(11.23)

Moreover, noting that V r (i, f n ) = V r (i, f) = 0 for every i ∈ B and n ≥ 0, it follows from Assumption 11.2.5(a) and Lemma 8.3.7 in [11] again that

$$ \begin{array}{rcl} \lim\limits_{n\rightarrow \infty }{V }_{r } (\mu,{f}_{n})& =& \lim\limits_{n\rightarrow \infty }\sum\limits_{i\in E}[{V }_{r}(i,{f}_{n})]\mu (i) =\lim\limits_{n\rightarrow \infty }\sum\limits_{i\in {B}^{c}}[{V }_{r}(i,{f}_{n})]\mu (i) \\ & =& \sum\limits_{i\in {B}^{c}}{[\lim }_{n\rightarrow \infty }{V }_{r}(i,{f}_{n})]\mu (i) = \sum\limits_{i\in {B}^{c}}{V }_{r}(i,f)\mu (i) = {V }_{r}(\mu,f),\end{array}$$
(11.24)

which gives the stated result: V r (μ, f n ) → V r (μ, f), as n → . □ 

To analyze the constrained control problem (recall Definition 11.2.3), we introduce a Lagrange multiplier λ ≥ 0 as follows. For each i ∈ E and a ∈ A(i), let

$$\begin{array}{rcl}{ b}^{\lambda }(i,a) := r(i,a) - \lambda c(i,a).& &\end{array}$$
(11.25)

Furthermore, for each policy π ∈ Π and i ∈ E, let

$$\begin{array}{rcl} & & {V }_{{b}^{\lambda }}(i,\pi ) := {E}_{i}^{\pi }\left[{\int \nolimits \nolimits }_{0}^{{\tau }_{B} }{\mathrm{e}}^{-{\int \nolimits \nolimits }_{0}^{t}\alpha (Z(u),W(u))\mathrm{d}u }{b}^{\lambda }(Z(t),W(t))\mathrm{d}t\right],\end{array}$$
(11.26)
$$\begin{array}{rcl} & & {V }_{{b}^{\lambda }}(\mu,\pi ) := \sum\limits_{j\in E}{V }_{{b}^{\lambda }}(j,\pi )\mu (j),\end{array}$$
(11.27)
$$\begin{array}{rcl} & & {V }_{{b}^{\lambda }}^{{_\ast}}(i) :=\sup\limits_{ \pi \in \Pi }{V }_{{b}^{\lambda }}(i,\pi ),{V }_{{b}^{\lambda }}^{{_\ast}}(\mu ) :=\sup\limits_{ \pi \in \Pi }{V }_{{b}^{\lambda }}(\mu,\pi ).\end{array}$$
(11.28)

Remark 11.3.10.

Notice that, for each i ∈ B, \({V }_{{b}^{\lambda }}(i,\pi ) = 0\). Therefore, we have

$$\begin{array}{rcl}{ V }_{{b}^{\lambda }}(\mu,\pi ) = \sum\limits_{j\in {B}^{c}}{V }_{{b}^{\lambda }}(j,\pi )\mu (j),\qquad {V }_{{b}^{\lambda }}^{{_\ast}}(\mu ) = \sum\limits_{j\in {B}^{c}}{V }_{{b}^{\lambda }}^{{_\ast}}(j)\mu (j).& & \\ \end{array}$$

Under Assumptions 11.2.111.2.4, by Lemma 11.3.2(b), we have

$${ V }_{{b}^{\lambda }}^{{_\ast}}(i) = \left \{\begin{array}{lll} 0, &\mbox{ for}\ i \in B, \\ \sup\limits_{a\in A(i)}\left[\widetilde{{b}^{\lambda }}(i,a) + \sum\limits_{j\in {B}^{c}}{V }_{{b}^{\lambda }}^{{_\ast}}(j)m(j\mid i,a)\right],&\mbox{ for}\ i \in {B}^{c}, \end{array} \right.$$
(11.29)

where \(\widetilde{{b}^{\lambda }}(i,a) := {b}^{\lambda }(i,a){\int \nolimits \nolimits }_{0}^{\infty }{\mathrm{e}}^{-\alpha t}(1 - D(t\mid i,a))\mathrm{d}t\). Moreover, for each i ∈ E, the maximum in (11.29) is realized by some a ∈ A(i), that is,

$${ A}_{\lambda }^{{_\ast}}(i) := \left \{\begin{array}{lll} A(i), &\mbox{ for}\ i \in B, \\ \{a \in A(i)\mid {V }_{{b}^{\lambda }}^{{_\ast}}(i) =\widetilde{ {b}^{\lambda }}(i,a) + \sum\limits_{j\in {B}^{c}}{V }_{{b}^{\lambda }}^{{_\ast}}(j)m(j\mid i,a)\},&\mbox{ for}\ i \in {B}^{c} \end{array} \right.$$
(11.30)

is nonempty. In other words, the following sets

$$\begin{array}{rcl}{ \mathbb{F}}_{\lambda }^{{_\ast}} := \{f \in \mathbb{F}\mid f(i) \in {A}_{ \lambda }^{{_\ast}}(i)\ \forall i \in E\}& &\end{array}$$
(11.31)

and

$$\begin{array}{rcl}{ \Phi }^{\lambda } := \{\varphi \in \Phi \mid \varphi ({A}_{ \lambda }^{{_\ast}}(i)\mid i) = 1\ \forall i \in E\}& &\end{array}$$
(11.32)

are nonempty.

Next lemma reveals that Φ λ is convex.

Lemma 11.3.4.

Under Assumptions 11.2.111.2.4, the set Φ λ is convex.

Proof.

For each φ1, φ2 ∈ Φ λ, and p ∈ [0, 1], let

$$\begin{array}{rcl}{ \varphi }^{p}(\cdot \mid i) := p{\varphi }_{ 1}(\cdot \mid i) + (1 - p){\varphi }_{2}(\cdot \mid i),\ \forall i \in E.& &\end{array}$$
(11.33)

Hence, φp(A λ  ∗ (i)∣i) = pφ1(A λ  ∗ (i)∣i) + (1 − p2(A λ  ∗ (i)∣i) = 1, and so Φ λ is convex. □ 

Notation. For each λ ≥ 0, we take an arbitrary, but fixed policy \({f}^{\lambda } \in {\mathbb{F}}_{\lambda }^{{_\ast}}\), and denote V r (μ, f λ), V c (μ, f λ), and \({V }_{{b}^{\lambda }}(\mu,{f}^{\lambda })\) by V r (λ), V c (λ), and V b (λ), respectively. By Lemma 11.3.2, we have that \({V }_{{b}^{\lambda }}(i,f) = {V }_{{b}^{\lambda }}^{{_\ast}}(i)\) for all i ∈ E and \(f \in {\mathbb{F}}_{\lambda }^{{_\ast}}\). Hence, \({V }_{b}(\lambda ) = {V }_{{b}^{\lambda }}(\mu,{f}^{\lambda }) = {V }_{{b}^{\lambda }}^{{_\ast}}(\mu )\).

Lemma 11.3.5.

If Assumptions 11.2.311.2.4 and 11.2.5 (a) hold, then V c (λ) is nonincreasing in λ ∈ [0,∞).

Proof.

By (11.5), (11.6), and (11.25)–(11.26) for each π ∈ Π, we obtain

$$\begin{array}{rcl}{ V }_{{b}^{\lambda }}(\mu,\pi ) = {V }_{r}(\mu,\pi ) - \lambda {V }_{c}(\mu,\pi )\ \forall \lambda \geq 0.& & \\ \end{array}$$

Moreover, noting that \({V }_{b}(\lambda ) = {V }_{{b}^{\lambda }}(\mu,{f}^{\lambda }) = {V }_{{b}^{\lambda }}^{{_\ast}}(\mu )\) for all λ ≥ 0 and \({f}^{\lambda } \in {\mathbb{F}}_{\lambda }^{{_\ast}}\), we have, for any h > 0,

$$\begin{array}{rcl} -h{V }_{c}(\lambda )& =& {V }_{{b}^{\lambda +h}}(\mu,{f}^{\lambda }) - {V }_{ b}(\lambda ) \\ & \leq & {V }_{b}(\lambda + h) - {V }_{b}(\lambda ) \\ & \leq & {V }_{b}(\lambda + h) - {V }_{{b}^{\lambda }}(\mu,{f}^{\lambda +h}) = -h{V }_{ c}(\lambda + h), \\ \end{array}$$

which gives that

$$\begin{array}{rcl} -h{V }_{c}(\lambda ) \leq -h{V }_{c}(\lambda + h).& & \\ \end{array}$$

Hence, V c (λ) is nonincreasing in λ ∈ [0, ). □ 

Lemma 11.3.6.

Suppose that Assumptions 11.2.111.2.4 hold. If lim k→∞ λ k = λ, and \({f}^{{\lambda }_{k}} \in {\mathbb{F}}_{{ \lambda }_{k}}^{{_\ast}}\) is such that \({\lim }_{k\rightarrow \infty }{f}^{{\lambda }_{k}} = f\) , then \(f \in {\mathbb{F}}_{\lambda }^{{_\ast}}.\)

Proof.

Since \({f}^{{\lambda }_{k}} \in {\mathbb{F}}_{{ \lambda }_{k}}^{{_\ast}}\), for each i ∈ B c and π ∈ Π, we have

$$\begin{array}{rcl}{ V }_{{b}^{{\lambda }_{k}}}^{{_\ast}}(i) = {V }_{ r}(i,{f}^{{\lambda }_{k} }) - {\lambda }_{k}{V }_{c}(i,{f}^{{\lambda }_{k} }) \geq {V }_{{b}^{{\lambda }_{k}}}(i,\pi ) = {V }_{r}(i,\pi ) - {\lambda }_{k}{V }_{c}(i,\pi ).& &\end{array}$$
(11.34)

Letting k →  in (11.34) and by Lemma 11.3.3, we obtain

$$\begin{array}{rcl}{ V }_{{b}^{\lambda }}(i,f) \geq {V }_{{b}^{\lambda }}(i,\pi )\ \ \forall i \in {B}^{c}\ \ \mathrm{and}\ \ \pi \in \Pi,& & \\ \end{array}$$

which together with the fact that A λ  ∗ (i) = A(i) for each i ∈ B implies that \(f \in {\mathbb{F}}_{\lambda }^{{_\ast}}\). □ 

Under Assumptions 11.2.111.2.4 and 11.2.5(a), it follows from Lemma 11.3.5 that the following nonnegative constant

$$\begin{array}{rcl} \overline{\lambda } :=\inf \{ \lambda \geq 0\mid {V }_{c}(\lambda ) \leq \gamma \}& &\end{array}$$
(11.35)

is well defined.

Lemma 11.3.7.

Suppose that Assumptions 11.2.111.2.5 hold. Then the constant \(\overline{\lambda }\) in (11.35) is finite, that is, \(\overline{\lambda } \in [0,\infty ).\)

Proof.

Suppose that \(\overline{\lambda } = \infty \). By Assumption 11.2.5(b), there exists a policy π ∈ Π such that V c (μ, π) < γ. Let d : = γ − V c (μ, π) > 0. Then, for any λ > 0, we have

$$\begin{array}{rcl}{ V }_{{b}^{\lambda }}(\mu,\pi ^{\prime}) = {V }_{r}(\mu,\pi ^{\prime}) - \lambda {V }_{c}(\mu,\pi ^{\prime}) = {V }_{r}(\mu,\pi ^{\prime}) - \lambda (\gamma - d).& &\end{array}$$
(11.36)

As \(\overline{\lambda } = \infty \), by (11.35) and Lemma 11.3.5, we obtain V c (λ) > γ for all λ > 0. Therefore, V b (λ) = V r (λ) − λV c (λ) < V r (λ) − λγ. Since \({V }_{b}(\lambda ) = {V }_{{b}^{\lambda }}^{{_\ast}}(\mu ) \geq {V }_{{b}^{\lambda }}(\mu,\pi ^{\prime})\), from (11.36), we have

$$\begin{array}{rcl}{ V }_{r}(\lambda ) - \lambda \gamma > {V }_{b}(\lambda ) \geq {V }_{{b}^{\lambda }}(\mu,\pi ^{\prime}) = {V }_{r}(\mu,\pi ^{\prime}) - \lambda (\gamma - d)\ \ \forall \lambda > 0,& &\end{array}$$
(11.37)

which gives

$$\begin{array}{rcl}{ V }_{r}(\lambda ) > {V }_{r}(\mu,\pi ^{\prime}) + \lambda d\ \ \ \forall \lambda > 0.& &\end{array}$$
(11.38)

On the other hand, by Lemma 11.3.1 and Assumption 11.2.5(a), we have

$$\begin{array}{rcl} \max \{\vert {V }_{r}(\mu,\pi ^{\prime})\vert,\vert {V }_{r}(\lambda )\vert \} \leq M\left[\sum\limits_{j\in {B}^{c}}w(j)\mu (j)\right]/(1 - \beta ) :=\tilde{ M} < \infty \end{array}$$
(11.39)

for all λ > 0. The latter inequality together with (11.38) gives that

$$\begin{array}{rcl} 2\tilde{M} > \lambda d\ \ \forall \lambda > 0,& &\end{array}$$
(11.40)

which is clearly a contradiction; for instance, take \(\lambda = 3\tilde{M}/d > 0\). Hence, \(\overline{\lambda }\) is finite. □ 

11.4 Proof of Theorem 11.2.1

In this section, we prove Theorem 11.2.1 by using the Lagrange approach and the following lemma.

Lemma 11.4.8.

If there exist λ 0 ≥ 0 and π ∈ U such that

$$\begin{array}{rcl}{ V }_{c}(\mu,{\pi }^{{_\ast}}) = \gamma \ \ \mbox{ and}\ \ {V }_{{ b}^{{\lambda }_{0}}}(\mu,{\pi }^{{_\ast}}) = {V }_{{ b}^{{\lambda }_{0}}}^{{_\ast}}(\mu ),& & \\ \end{array}$$

then π is constrained optimal.

Proof.

For any π ∈ U, since \({V }_{{b}^{{\lambda }_{0}}}(\mu,{\pi }^{{_\ast}}) = {V }_{{b}^{{\lambda }_{0}}}^{{_\ast}}(\mu ) \geq {V }_{{b}^{{\lambda }_{0}}}(\mu,\pi )\), we have

$$\begin{array}{rcl}{ V }_{r}(\mu,{\pi }^{{_\ast}}) - {\lambda }_{ 0}{V }_{c}(\mu,{\pi }^{{_\ast}}) \geq {V }_{ r}(\mu,\pi ) - {\lambda }_{0}{V }_{c}(\mu,\pi ).& &\end{array}$$
(11.41)

Noting that V c (μ, π ∗ ) = γ and V c (μ, π) ≤ γ (by π ∈ U), from (11.41), we get

$$\begin{array}{rcl}{ V }_{r}(\mu,{\pi }^{{_\ast}}) \geq {V }_{ r}(\mu,\pi ) + {\lambda }_{0}(\gamma - {V }_{c}(\mu,\pi )) \geq {V }_{r}(\mu,\pi )\ \forall \pi \in U,& & \\ \end{array}$$

which means that π ∗  is constrained optimal. □ 

Proof of Theorem  11.2.1. By Lemma 11.3.7, the constant \(\overline{\lambda } \in [0,\infty )\). Thus, we shall consider the two cases: \(\overline{\lambda } = 0\) and \(\overline{\lambda } > 0\).

The case of \(\overline{\lambda } = 0\) : By (11.35), there exists a sequence \({f}^{{\lambda }_{k}} \in {\mathbb{F}}_{{ \lambda }_{k}}^{{_\ast}}\) such that λ k 0 as k → . Because \(\mathbb{F}\) is compact, we may assume that \({f}^{{\lambda }_{k}} \rightarrow \tilde{ f}\) without loss of generality. Thus, by Lemma 11.3.5, we have \({V }_{c}(\mu,{f}^{{\lambda }_{k}}) \leq \gamma \) for all k ≥ 1, and then it follows from Lemma 11.3.3 that \(\tilde{f} \in U\). Moreover, for each π ∈ U, we have that \({V }_{b}({\lambda }_{k}) = {V }_{{b}^{{\lambda }_{k}}}(\mu,{f}^{{\lambda }_{k}}) \geq {V }_{{ b}^{{\lambda }_{k}}}(\mu,\pi )\). Hence, by Lemma 11.3.1(a) and (11.39),

$$\begin{array}{rcl}{ V }_{r}(\mu,{f}^{{\lambda }_{k} }) - {V }_{r}(\mu,\pi ) \geq {\lambda }_{k}[{V }_{c}(\mu,{f}^{{\lambda }_{k} }) - {V }_{c}(\mu,\pi )] \geq -2{\lambda }_{k}\tilde{M}.& &\end{array}$$
(11.42)

Letting k →  in (11.42), by Lemma 11.3.3, we have

$$\begin{array}{rcl}{ V }_{r}(\mu,\tilde{f}) - {V }_{r}(\mu,\pi ) \geq 0\ \forall \pi \in U,& & \\ \end{array}$$

which means that \(\tilde{f}\) is a constrained optimal stationary policy.

The case of \(\overline{\lambda } > 0\) : First, if there is some λ ∈ (0, ) satisfying V c ) = γ, then there exist an associated \({f}^{\lambda ^{\prime}} \in {\mathbb{F}}_{\lambda ^{\prime}}^{{_\ast}}\) such that V c ) = V c (μ, f λ) = γ, and \({V }_{{b}^{\lambda ^{\prime}}}^{{_\ast}}(\mu ) = {V }_{{b}^{\lambda ^{\prime}}}(\mu,{f}^{\lambda ^{\prime}})\). Thus, by Lemma 11.4.8, f λ is a constrained optimal stationary policy.

Now, suppose that V c (λ)≠γ for all λ ∈ (0, ). Then, as \(\overline{\lambda }\) is in (0, ), there exist two nonnegative sequences {λ k } and {δ k } such that \({\lambda }_{k} \uparrow \overline{\lambda }\) and \({\delta }_{k} \downarrow \overline{\lambda }\), respectively. Since \(\mathbb{F}\) is compact, we may take \({f}^{{\lambda }_{k}} \in {\mathbb{F}}_{{ \lambda }_{k}}^{{_\ast}}\) and \({f}^{{\delta }_{k}} \in {\mathbb{F}}_{{ \delta }_{k}}^{{_\ast}}\) such that \({f}^{{\lambda }_{k}} \rightarrow {f}^{1} \in \mathbb{F}\) and \({f}^{{\delta }_{k}} \rightarrow {f}^{2} \in \mathbb{F}\), without loss of generality. By Lemma 11.3.6, we have that \({f}^{1},{f}^{2} \in {\mathbb{F}}_{\overline{\lambda }}^{{_\ast}}\). By Lemmas 11.3.3 and 11.3.4, we obtain that V c (μ, f 1) ≥ γ and V c (μ, f 2) ≤ γ. If V c (μ, f 1) = γ ( or V c (μ, f 2) = γ), by Lemma 11.4.8, we have that f 1 (or f 2) is a constrained optimal stationary policy. Hence, to complete the proof, we shall consider the following case:

$$\begin{array}{rcl}{ V }_{c}(\mu,{f}^{1}) > \gamma \ \ \mbox{ and}\ \ \ {V }_{ c}(\mu,{f}^{2}) < \gamma.& &\end{array}$$
(11.43)

Now using f 1 and f 2, we construct a sequence of stationary policies {f n } as follows. For each n ≥ 1 and i ∈ E, let

$$\begin{array}{rcl}{ f}_{n}(i) := \left \{\begin{array}{ll} {f}^{1}(i),&\mbox{ if}\ \ i < n; \\ {f}^{2}(i),&\mbox{ if}\ \ i \geq n, \end{array} \right.& & \\ \end{array}$$

where, without loss of generality, the denumerable state space is assumed to be the set {1, 2, }. Obviously, f 1 = f 2 and lim n →  f n  = f 1. Hence, by Lemma 11.3.3, lim n →  V c (μ, f n ) = V c (μ, f 1). Since \({f}^{1},{f}^{2} \in {\mathbb{F}}_{\overline{\lambda }}^{{_\ast}}\) (just mentioned), by (11.31), we see that \({f}_{n} \in {\mathbb{F}}_{\overline{\lambda }}^{{_\ast}}\) for all n ≥ 1. As f 1 = f 2, by (11.43), we have V c (μ, f 1) < γ. If there exists n  ∗  such that \({V }_{c}(\mu,{f}_{{n}^{{_\ast}}}) = \gamma \), then by Lemma 11.4.8 and \({f}_{n} \in {\mathbb{F}}_{\overline{\lambda }}^{{_\ast}}\), \({f}_{{n}^{{_\ast}}}\) a constrained optimal stationary policy. Thus, in the remainder of this section, we may assume that V c (μ, f n )≠γ for all n ≥ 1. If V c (μ, f n ) < γ for all n ≥ 1, lim n →  V c (μ, f n ) = V c (μ, f 1) ≤ γ, which is a contradiction to (11.43). Thus, there exists some n ≥ 1 such that V c (μ, f n ) > γ, which together with V c (μ, f 1) < γ gives the existence of some \(\tilde{n}\) such that

$${V }_{c}(\mu,{f}_{\tilde{n}}) < \gamma \ \ \mbox{ and}\ \ \ {V }_{c}(\mu,{f}_{\tilde{n}+1}) > \gamma.$$
(11.44)

Obviously, the stationary policies \({f}_{\tilde{n}}\) and \({f}_{\tilde{n}+1}\) differ in at most the state \(\tilde{n}\). Here, it should be pointed out that \(\tilde{n}\) must be in B c. Indeed, if \(\tilde{n} \in B\), we have \({V }_{c}(\tilde{n},{f}_{\tilde{n}}) = {V }_{c}(\tilde{n},{f}_{\tilde{n}+1}) = 0\), which implies that \({V }_{c}(\mu,{f}_{\tilde{n}}) = {V }_{c}(\mu,{f}_{\tilde{n}+1})\) and thus leads to a contradiction to (11.44).

For any p ∈ [0, 1], using the stationary policies \({f}_{\tilde{n}}\) and \({f}_{\tilde{n}+1}\), we construct a randomized stationary policy φp as follows. For each i ∈ E,

$$\begin{array}{rcl}{ \varphi }^{p}(a\mid i) = \left \{\begin{array}{ll} p, &\mbox{ if}\ \ a = {f}_{\tilde{n}}(\tilde{n})\ \mbox{ when}\ i =\tilde{ n}, \\ 1 - p,&\mbox{ if}\ \ a = {f}_{\tilde{n}+1}(\tilde{n})\ \mbox{ when}\ i =\tilde{ n}, \\ 1, &\mbox{ if}\ \ a = {f}_{\tilde{n}}(i)\ \mbox{ when}\ i\neq \tilde{n}. \end{array} \right.& &\end{array}$$
(11.45)

Since \({f}_{\tilde{n}},{f}_{\tilde{n}+1} \in {\mathbb{F}}_{\overline{\lambda }}^{{_\ast}}\), by Lemma 11.3.4, we have \({V }_{{ b}^{\overline{\lambda }}}(\mu,{\varphi }^{p}) = {V }_{{ b}^{\overline{\lambda }}}^{{_\ast}}(\mu )\) for all p ∈ [0, 1]. We also have that V c (μ, φp) is continuous in p ∈ [0, 1]. Indeed, for any p ∈ [0, 1] and any sequence {p m } in [0, 1] such that lim n →  p m  = p, as in the proof of Lemma 11.3.2, we have

$$\begin{array}{rcl} {V }_{c } (i,{\varphi }^{{p}_{m} }) = \sum\limits_{a\in A(i)}{\varphi }^{{p}_{m} }(a\mid i)\left[\widetilde{c}(i,a) + \sum\limits_{j\in {B}^{c}}{V }_{c}(j,{\varphi }^{{p}_{m} })m(j\mid i,a)\right]\ \forall i \in {B}^{c}.& &\end{array}$$
(11.46)

Hence, as in the proof of Lemma 11.3.3, from (11.45) and (11.46), we obtain

$$ \begin{array}{rcl} \lim\limits_{n\rightarrow \infty }{V }_{c}(\mu,{\varphi }^{{p}_{m} }) = {V }_{c}(\mu,{\varphi }^{p}),& & \\ \end{array}$$

and so V c (μ, φp) is continuous in p ∈ [0, 1].

Finally, let p 0 = 0 and p 1 = 1. Then, \({V }_{c}(\mu,{\varphi }^{{p}_{0}}) = {V }_{c}(\mu,{f}_{\tilde{n}+1}) > \gamma \) and \({V }_{c}(\mu,{\varphi }^{{p}_{1}}) = {V }_{c}(\mu,{f}_{\tilde{n}}) < \gamma \). Therefore, by the continuity of V c (μ, φp) in p ∈ [0, 1] there exists a p  ∗  ∈ (0, 1) such that \({V }_{c}(\mu,{\varphi }^{{p}^{{_\ast}} }) = \gamma \). Since \({V }_{{ b}^{\overline{\lambda }}}(\mu,{\varphi }^{{p}^{{_\ast}} }) = {V }_{{ b}^{\overline{\lambda }}}^{{_\ast}}(\mu )\), by Lemma 11.4.8, we have that \({\varphi }^{{p}^{{_\ast}} }\) is a constrained optimal stationary policy, which randomizes between the two stationary policies \({f}_{\tilde{n}}\) and \({f}_{\tilde{n}+1}\) that differ in at most the state \(\tilde{n} \in {B}^{c}\). □