Keywords

AMS(2020) subject classification:

1 Introduction

Semi-Markov decision processes (SMDPs), as an important class of stochastic control problems, have been widely studied [1, 10, 11, 15, 20, 28, 31]. The commonly used criteria for SMDPs are the finite horizon expected criterion [8, 14, 26, 28], the expected discounted criterion [1, 3, 10, 13, 25, 27], and the average criterion [10, 23, 31,32,33]. These criteria are linear utility functions of the total rewards (i.e. are risk-neutral), which only focus on the expected total rewards of a system during a fixed or a random horizon, and therefore cannot reflect the decision maker’s attitude toward risk.

To exhibit the attitude of a decision maker in the face of risk (i.e. risk-seeking or risk-averse), the risk sensitive criteria, which include the exponential utility criterion, have been considered for discrete-time MDPs (DTMDPs) [2, 4,5,6, 21, 22], and continuous-time MDPs (CTMDPs) [7, 9, 30, 34]. Specifically, Jaquette [21] first introduced the exponential utility to DTMDPs. For the resulting optimization problem, Chung and Sobel [6] established the corresponding optimality equation by means of the Banach fixed point theorem. Cavazos-Cadena and Montes-De-Oca [4, 5] gave conditions ensuring the existence of optimal policies for the positive dynamic programming, where the state space is considered to be finite in [4], and denumerable in [5]. Ja\(\acute{s}\)kiewicz [22] considered the Borel state and action spaces, and establish the convergence of the n-stage optimal expected total reward and the existence of an optimal stationary policy. Ba\(\ddot{\mathrm{u}}\)erle and Rieder [2] considered a more general problem than the classic risk sensitive optimization problem, which is called minimizing a certainty equivalent. They solved the optimization problem by an ordinary MDP with extended state space, and proved the existence of an optimal policy under some suitable conditions. For the case of CTMDPs, Ghosh and Saha [7] studied the risk sensitive control in discrete state space. They obtain the value function as a solution to the Hamilton Jacobi Bellman equation, and proved the existence of an optimal Markov control for finite horizon problem, and the existence of an optimal stationary control for infinite horizon problem. Wei [30] dealt with risk sensitive cost criterion for finite horizon CTMDPs with denumerable state space and Borel action space. Under suitable conditions, he proved the existence of the Feynman-Kac formula and an optimal deterministic Markov policy. For the same problem as in [30], Guo, Liu and Zhang [9] investigated the case when the transition and cost rates may be unbounded. They proved that the value function is the unique solution to the optimality equation, and showed the existence of an optimal policy via the Feynman-Kac formula. Few literature [34] applied the uniformization technique to reducing the CTMDPs problem with exponential utility to an equivalent DTMDPs. Recently, Huang, Lian and Guo [17] considered the risk sensitive unconstrained and constrained problems for SMDPs with Borel state space, unbounded cost rates and general utility functions, and proved the existence of the Bellman equation and the optimal policies under some continuity-compactness conditions by using the occupation measure approach.

All of this existing literature shows that all the aforementioned MDPs for the risk-sensitive criterion have two common features: the horizon is finite or infinite, the control model is DTMDPs or CTMDPs. However, such as those encountered in many real world situations, many models in ruin problems [20, 29], reliability [20, 24], and maintenance [20] are considered with a random horizon, and described as SMDPs. Moreover, compared to DTMDPs and CTMDPs (under stationary policies), SMDPs are more general stochastic optimal models, in which the holding time of the system state can be allowed to follow any arbitrary probability distribution. This is the main reason for considering a random horizon for SMDPs in this paper.

Compared with the existing research work for risk-sensitive SMDPs in [17], this paper has some new features as follows: First, in order to make the conclusion more closely fit the actual situation, we pay more attention to the time horizon is the random first passage time, which is more general than those in [17]. Second, since the random first passage time is considered in our control model, by Remark 4.2 in [17], we know that the occupation measure approach is not suitable for our model, because the definition of the occupation measure is based on the discount factor. Instead, we use a so-called minimum nonnegative solution approach to establish the optimality equation and prove the existence of optimal policies. Third, we are mainly concerned with the calculation and existence of the optimal policies, while the purpose of the works in [17] is to establish the existence condition of the optimal policies. Due to these, we develop a value iteration algorithm to calculate the value function and the optimal policy, which is new and the key feature in our paper.

To the best of our knowledge, the risk-sensitive optimality problem for SMDPs in first passage has not been studied yet.

Motivated by the above discussion, we investigate in this paper the first passage risk-sensitive optimality problems for SMDPs. We focus on both the existence conditions and the computational algorithms of an optimal policy, thus we limit the choice of risk-sensitive criteria to the exponential utility criterion (e.g. [2, 6, 21, 34]), which maximizes the expected exponential utility of the total rewards before the state of system enters the target set. More precisely, in order to ensure the existence of an optimal stationary policy, we impose the standard regular condition to ensure that the state process is non-explosive, which is similar to those given in [13,14,15, 18] for SMDPs (see Lemma 1). Second, compared with [13,14,15, 18], which are mainly limited to denumerable state space and finite action set, we consider more general Borel state and action spaces. Then, we need to introduce a new continuity-compactness condition (see Assumption 2). Under the regular and continuity-compactness conditions, we establish the corresponding optimality equation, and prove that the value function is a solution to this optimality equation. Moreover, we show the existence of an exponential utility optimal stationary policy by using an invariant embedding technique (see Assumption 1). Furthermore, a value iteration algorithm for computing the value function as well as the optimal policies, in a finite number of iterations, is provided. Finally, an example illustrating the computational methodology of an optimal stationary policy and the value function is given.

The rest of this paper is organized as follows. In Sect. 2, we introduce the semi-Markov decision model and state the first passage exponential utility optimality problem. The main optimality results are stated and proved in Sect. 3. In Sect. 4, an example is provided to illustrate the computational aspects of an optimal policy.

2 Model Description

Models of first passage exponential utility SMDPs are defined by

$$\begin{aligned} \{S,A,(A(x),x\in S),Q(u,y| x,a), B,r(x,a)\} \end{aligned}$$
(1)

with the following components:

  1. (a)

    S denotes a Borel state space, endowed with the Borel \(\sigma \)-algebras \(\mathcal {B}(S)\).

  2. (b)

    A denotes a Borel action space, endowed with the Borel \(\sigma \)-algebras \(\mathcal {B}(A)\).

  3. (c)

    \(A(x)\in \mathcal {B}(A)\) represents the set of allowable actions when the system is at state \(x\in S\). \(K:=\{(x,a)|x\in S,a\in A(x)\}\) represents the set of all feasible pairs of states and actions.

  4. (d)

    \(Q(\cdot ,\cdot | x,a)\) is a semi-Markov kernel on \(R^{+}\times S\) given K, where \(R^{+}:=[0,\infty )\). For any \(u\in R^{+},D\in \mathcal {B}(S)\), when the action \(a\in A(x)\) is taken in state x, Q(uD|xa) denotes the joint probability that the holding time of the system is no more than \(u\in R^{+}\) and the state x changes into the set D. The semi-Markov kernel \(Q(\cdot ,\cdot | x,a),(x,a)\in K\) has the following features:

    1. (i)

      For any \(D\in \mathcal {B}(S)\), \(Q(\cdot ,D| x,a)\) is a non-decreasing, right continuous function from \(R^{+}\) to [0, 1] with \(Q(0,D| x,a)=0\) .

    2. (ii)

      For any \(u\in R^{+}\), \(Q(u,\cdot | x,a)\) is a sub-stochastic kernel on the state space S.

  5. (e)

    B is target set, which is a measurable subset of S, and usually represents the set of failure (or ruin) states of a system.

  6. (f)

    r(xa) denotes the reward rate, which is assumed to be nonnegative measurable function on K such that \(r(x,\cdot )\equiv 0\) for all \( x\in B\).

The first passage SMDP with exponential utility evolves as follows: When the system state is \(x_{0}\in B^{c}\) at time \(t_{0}=0\), the decision maker selects an admissible action \(a_{0}\) from the action set \(A(x_{0})\), where \(B^{c}\) denotes the complement of B. Consequently, the system stays in the state \(x_{0}\) up to time \(t_{1}\). At this point the system jumps to state \(x_{1}\) with probability \(p(x_{1}|x_{0},a_{0})\), and earns a reward \(r(x_{0},a_{0})(t_{1}-t_{0})\). If the state \(x_{1}\in B\), the system will stay at the target set B forever. If the state \(x_{1}\in B^{c}\), a new decision epoch \(t_{1}\) comes along. Then, based on the present state \(x_{1}\) and the previous state \(x_{0}\), the decision maker chooses an action \(a_{1}\in A(x_{1})\) and the process is repeated. Thus, during its evolution, the system receives a series of rewards. The decision maker aims at maximizing the exponential utility of the total rewards before the state of the system first reaches the target set B.

Let

$$\begin{aligned} h_{k}:= & {} (x_{0},a_{0},t_{1},x_{1},a_{1},\ldots ,t_{k},x_{k}), \end{aligned}$$
(2)

be an admissible history up to the k-th decision epoch, where \(t_{m+1}\ge t_{m}\ge 0\), \(x_{m}\in S,a_{m}\in A(x_{m})\) for \(m=0,1,\ldots ,k-1,x_{k}\in S\). From the evolution of SMDPs, we know that \(t_{k+1}\) (\(k\ge 0\)) denotes the \((k+1)\)-th decision epoch, \(x_{k}\) denotes the state of the system on \([t_{k},t_{k+1})\), \(a_{k}\) denotes an action, which is chosen by the decision maker at time \(t_{k}\). \(\theta _{k+1}:=t_{k+1}-t_{k}\) denotes the sojourn time at state \(x_{k}\), which may follow any given probability distribution.

The set of all admissible histories \(h_{k}\) is denoted by \(H_k\), that is \(H_{0}:=S\) and \(H_{k}:=(S\times A \times (0,+\infty ])^{k}\times S \).

For the sake of the optimality problem, we shall pay close attention to some classes of policies that we introduce below.

Definition 1

A sequence \(\pi =\{\pi _{k},k\ge 0\}\) is called stochastic history-dependent policy if, for any \(k=0,1,2\ldots \), the stochastic kernel \(\pi _{k}\) on \(A(x_{k})\) given \(H_{k}\) satisfies

$$\begin{aligned} \pi _{k}(A(x_{k})|h_{k})=1\ \ \text {for any}\ \ h_{k}\in H_{k}. \end{aligned}$$

Denote by \(\varPi \) the set of all stochastic history-dependent policies, \(\phi \) the set of all stochastic kernels \(\varphi \) on A(x) given S such that \(\varphi (A(x)|x)=1\), and F the family of all Borel measurable functions f from S to A(x) for all \(x \in S\).

Definition 2

A policy \(\pi =\{\pi _{k}\}\in \varPi \) is called stochastic Markov if there exists a sequence of stochastic kernels \(\{\varphi _{k}\}\) such that \(\pi _{k}(\cdot |h_{k})=\varphi _{k}(\cdot |x_{k})\) for \(k\ge 0,h_{k}\in H_{k}\), and \(\varphi _{k}\in \phi \). For simplicity, we denote such a policy by \(\pi =\{\varphi _{k}\}\).

A stochastic Markov policy \(\pi =\{\varphi _{k}\}\) is called stochastic stationary if all the \(\varphi _{k}\) are independent of k. Such a policy is denoted by \(\varphi \), for simplicity.

A stochastic Markov policy \(\pi =\{\varphi _{k}\}\) is called deterministic Markov if each \(\varphi _{k}(\cdot |x_{k})\) is concentrated at \( f_{k}(x_k)\in A(x_{k})\) for some measurable functions \(\{f_{k}\}\) with \(k\ge 0,x_{k}\in S\), and \(f_{k}\in F\).

A deterministic Markov policy \(\pi =\{f_{k}\}\) is called deterministic stationary if all the measurable functions \(f_{k}\) are independent of k. For simplicity, such a policy is denoted by f.

The class of all stochastic Markov, stochastic stationary, deterministic Markov, and deterministic stationary policies are, respectively, denoted by \(\varPi _{RM}, \varPi _{RS}\), \(\varPi _{DM}\) and \(\varPi _{DS}\). Clearly, \(\phi =\varPi _{RS}\subset \varPi _{RM}\subset \varPi \) and \(F=\varPi _{DS}\subset \varPi _{DM}\subset \varPi \).

For the sake of mathematical rigor, we need to construct a well-suited probability space. Define a sample space \(\varOmega :=\{(x_{0},a_{0},t_{1},x_{1},a_{1}, \ldots ,t_{k},x_{k},a_{k},\ldots )|\) \(x_0\in S\), \(a_{0}\in A(x_{0}), t_{l}\in (0,\infty ],x_{l}\in S,a_{l}\in A(x_{l}) \) for each \(1\le l\le k, k\ge 1\)}. Let F be the Borel \(\sigma \)-algebra of the sample space \(\varOmega \). For any \(\omega :=(x_{0},a_{0},t_{1},x_{1},a_{1}\), \( \ldots ,t_{k},x_{k},a_{k},\ldots )\in \varOmega \), we define the random variables \(T_{k},X_k,A_{k}\) on \((\varOmega ,\mathcal {F})\) as follows:

$$\begin{aligned} T_{k}(\omega ):=t_{k}, X_k(\omega ):=x_k, A_k(\omega ):=a_k,T_{\infty }(\omega ):=\lim _{k\rightarrow \infty }T_{k}(\omega ). \end{aligned}$$
(3)

In what follows, for the purpose of simplicity, we omit the argument \(\omega \).

Moreover, we define the state process \(\{x_t,t\ge 0\}\) and the action process \(\{A_t,t\ge 0\}\) on \((\varOmega ,\mathcal {F})\) by

$$\begin{aligned} x_t:= & {} \sum _{k\ge 0}I_{\{T_{k}\le t<T_{k+1}\}}X_{k}+\varDelta I_{\{t\ge T_\infty \}},\\ A_t:= & {} \sum _{k\ge 0}I_{\{T_{k}\le t<T_{k+1}\}}A_{k}+a_{\varDelta } I_{\{t\ge T_\infty \}}, \end{aligned}$$

where \(I_{D}(\cdot )\) denotes the indicator function on the set D, \(\varDelta \not \in E\) is a cemetery state, and \(a_{\varDelta }\) is an isolated point.

For any policy \(\pi \in \varPi \) and initial state \(x\in S\), in the light of the Ionescu Tulcea theorem (e.g., Proposition C.10 in [11]), there exist a unique probability measure \(P_{x}^{\pi }\) on the measurable space \((\varOmega ,\mathcal {F})\) such that,

$$\begin{aligned}&P_{x}^{\pi }(A_{k}\in \varGamma |T_{0},X_{0},A_{0},\ldots ,T_{k},X_{k})=\pi _{k}(\varGamma |T_{0},X_{0},A_{0},\ldots ,T_{k},X_{k}),\\&P_{x}^{\pi }(T_{k+1}-T_{k}\le u, X_{k+1}\in D|T_{0},X_{0},A_{0},\ldots ,T_{k},X_{k},A_{k})=Q(u,D|X_{k},A_{k}),\nonumber \end{aligned}$$
(4)

for each \(u\in R^{+},\varGamma \in \mathcal {B}(A), D\in \mathcal {B}(S),k\ge 0\). We shall use \(\mathbb {E}_{x}^{\pi }\) to represent the expectation operator with respect to \(P_{x}^{\pi }\).

To avoid the possibility that the system generates an infinite number of jumps within a fixed finite horizon, we need to impose the following condition.

Assumption 1

For any \(\pi \in \varPi ,x\in S\), \(P^\pi _{x}(T_\infty =\infty )=1\).

To ease the verification of Assumption 1, we state the following sufficient condition for its validity.

Lemma 1

If \(Q(\delta ,S| x,a)\le 1-\varepsilon \) with some constants \(\delta \), \(\varepsilon >0\) and \((x,a)\in K\), then Assumption 1 holds.

Proof

The proof follows directly from Proposition 2.1 in [14].    \(\square \)

Remark 1

  1. (a)

    A key feature of Lemma 1 is that the condition is imposed on the semi-Markov kernel, and can be directly verified.

  2. (b)

    Lemma 1 is the standard regular condition, which is similar to the classic expected criteria for SMDPs, see, for instance [13,14,15, 18].

The random variable \(\tau _{B}\) is given by

$$\begin{aligned} \tau _{B}={\left\{ \begin{array}{ll}\inf \{t\ge 0:x_{t}\in B\},&{}if \ \ \{t\ge 0:x_{t}\in B\}\ne \emptyset ;\\ +\infty ,&{} { otherwise}. \end{array}\right. } \end{aligned}$$
(5)

represents the first passage time for which the state process \(\{x_{t}, t\ge 0\}\) first enters the target set B.

For any \(x\in S\) and \(\pi \in \varPi \), we define the first passage exponential utility criterion by

$$\begin{aligned} V^{\pi }(x):=E^{\pi }_{x}\Big (e^{-\gamma \int ^{\tau _{B}}_{0}r(x_{t},A_{t})dt}\Big ), \end{aligned}$$
(6)

where \(\gamma >0\) represents the risk aversion coefficient, which expresses the degree of risk aversion that the decision makers face to the level of the total rewards before the state of the system first enters the target set.

Definition 3

A policy \(\pi ^{*}\in \varPi \) is called an optimal policy, if

$$\begin{aligned} V^{\pi ^{*}}(x)= \sup _{\pi \in \varPi }V^{\pi }(x), x\in S. \end{aligned}$$
(7)

The corresponding value function is given by

$$\begin{aligned} V^{*}(x):=\sup _{\pi \in \varPi }V^{\pi }(x), x\in S. \end{aligned}$$
(8)

Remark 2

Note that for any \(\pi \in \varPi \) and initial state \(x\in B\), in view of (5), (6) and (8), we have \(\tau _{B}=0\) and \(V^{*}(x)=V^{\pi }(x)=1\). In order to avoid this trivial case, our arguments consider only the case \(x\in B^{c}\).

3 Main Results

In this section, we will state the main results concerning the first passage exponential utility optimality problem for SMDPs.

Notation: Let \(\mathcal {V}_{m}\) denotes the set of all Borel measurable functions from S to [0, 1]. For any \(x \in B^{c}, V\in \mathcal {V}_{m}, \varphi \in \phi ,a\in A(x)\), we define the operators \(M^{a}V, M^{\varphi }V\) and MV as follows:

$$\begin{aligned} M^{a}V(x):= & {} \int _{B}\int ^{+\infty }_{0}e^{-\gamma r(x,a)u}Q(du,dy| x,a)\\&+\int _{ B^{c} }\int ^{+\infty }_{0}e^{-\gamma r(x,a)u}V(y)Q(du,dy| x,a),\\ M^{\varphi }V(x):= & {} \int _{ A(x) }\varphi (da|x)M^{a}V(x),\\ MV(x):= & {} \sup _{a\in A(x)}M^{a}V(x). \end{aligned}$$

For any \(\varphi \in \phi \), we also define the operators \((M^{n}V,n\ge 1),((M^{\varphi })^{n}V,n\ge 1)\) as follows:

$$ M^{n+1}V=M(M^{n}V), (M^{\varphi })^{n+1}V=M^{\varphi }((M^{\varphi })^{n}V),n\ge 1.$$

Since the state and action space are Borel space, in order to ensure the existence of optimal policies, it follows from [28, 31, 32], we need establish the following continuity-compactness condition, and which is trivially satisfied for the case of denumerable state space and finite action set A(x) with \(x\in S\).

Assumption 2

  1. (a)

     For any \(x\in B^{c}\), A(x) is compact;

  2. (b)

    For each fixed \(V\in \mathcal {V}_{m}\), \(\int _{y\in S }\int ^{+\infty }_{0}e^{-\gamma r(x,a)u}V(y)Q(du,dy| x,a)\) is upper semicontinuous and inf-compact on K.

Lemma 2

Suppose that Assumptions 1 and 2 hold. Then the operators \(M^{a}\) and M have the following properties:

  1. (a)

    For any \(U,V\in \mathcal {V}_{m}\), if \(U\ge V\), then \(M^{a}U(x)\ge M^{a}V(x)\) and \(MU(x)\ge MV(x)\) for any \(x\in S\) and \( a\in A(x)\).

  2. (b)

    For any \(V\in \mathcal {V}_{m}\), there exists a policy \(f\in \varPi _{DS}\) such that \(MV(x)=M^{f}V(x)\) for any \( x\in S\).

Proof

  1. (a)

     This statement follows from the definitions of operators \(M^{a}\) and M.

  2. (b)

    Assuming the validity of Assumption 1 and 2, and invoking the measurable selection theorem (Theorem B.6 in [28]), we conclude that, for each \(x\in S\), there is a stationary policy \(f\in F\) with \(M^{f}V(x)=MV(x) =\sup _{a\in A(x)}M^{a}\) V(x).

   \(\square \)

Since state process \(\{x_t,t\ge 0\}\) is non-explosive and the reward rate is nonnegative, in view of the monotone convergence theorem, we can rewrite \(V^{\pi }(x)\) as follows:

$$\begin{aligned} V^{\pi }(x)= & {} E^{\pi }_{x}\Big (e^{-\gamma \int ^{\tau _{B}}_{0}r(x_{t},A_{t})dt}\Big )\nonumber \\= & {} E^{\pi }_{x}\Big (e^{-\gamma \sum _{m=0}^{\infty }\int ^{T_{m+1}}_{T_{m}}I_{\{\tau _{B}>t\}}r(x_{t},A_{t})dt}\Big )\nonumber \\= & {} E^{\pi }_{x}\Big (e^{-\gamma \sum _{m=0}^{\infty }\int ^{T_{m+1}}_{T_{m}}I_{\{\bigcap _{k=0}^{m}\{x_{T_{k}}\in B^{c}\}\}}r(x_{t},A_{t})dt}\Big )\\= & {} \lim _{n\rightarrow \infty }E^{\pi }_{x}\Big (e^{-\gamma \sum _{m=0}^{n}\int ^{T_{m+1}}_{T_{m}}I_{\{\bigcap _{k=0}^{m}\{x_{T_{k}}\in B^{c}\}\}}r(x_{t},A_{t})dt}\Big ).\nonumber \end{aligned}$$
(9)

We shall find it essential to define the sequence \(\{V^{\pi }_{n}(x),n=-1,0,1,\ldots \}\) by

$$\begin{aligned} V^{\pi }_{-1}(x):= & {} 1,\\ V^{\pi }_{n}(x):= & {} E^{\pi }_{x}\Big (e^{-\gamma \sum _{m=0}^{n}\int ^{T_{m+1}}_{T_{m}}I_{\{\bigcap _{k=0}^{m}\{x_{T_{k}}\in B^{c})\}\}}r(x_{t},A_{t})dt}\Big ). \end{aligned}$$

Obviously, \(V^{\pi }_{n}(x)\ge V^{\pi }_{n+1}(x)\) for any \(n\ge -1\) and \(\lim _{n\rightarrow \infty }V^{\pi }_{n}(x)=V^{\pi }(x)\) for all \(x\in B^{c}\).

Proposition 1

For each \(\pi =\{\pi _{0},\pi _{1},\ldots \}\in \varPi \) and \(x \in S\). Then, there exists a policy \(\pi ^{'}=\{\varphi _{0},\varphi _{1},\ldots \}\in \varPi _{RM}\), satisfying \(V^{\pi }(x)=V^{\pi ^{'}}(x)\).

Proof

Since \(V^{\pi }(x)=E^{\pi }_{x}\Big (e^{-\gamma \sum _{m=0}^{\infty }\int ^{T_{m+1}}_{T_{m}}I_{\{\bigcap _{k=0}^{m}\{x_{T_{k}}\in B^{c}\}\}}r(x_{t},A_{t})dt}\Big )\) in (9), to prove this proposition we need to prove that, for each \(x\in S\), there exists a randomized Markov policy \(\pi ^{'}=\{\varphi _{0},\varphi _{1},\ldots \}\in \varPi _{RM}\) such that

$$\begin{aligned}&P_{x}^{\pi ^{'}}(X_{k}\in D,T_{n+1}-T_{n}>u,A_{k}\in \varGamma )\nonumber \\= & {} P_{x}^{\pi }(X_{k}\in D,T_{n+1}-T_{n}>u,A_{k}\in \varGamma ) \end{aligned}$$

with \(k=0,1,\ldots ,u\in R^{+},D\in \mathcal {B}(S),\varGamma \in \mathcal {B}(A)\).

Thus, in view of property (4), it suffices to show that

$$\begin{aligned} P_{x}^{\pi ^{'}}(X_{k}\in D,A_{k}\in \varGamma )=P_{x}^{\pi }(X_{k}\in D,A_{k}\in \varGamma ). \end{aligned}$$
(10)

Along the same arguments as in the proof of Theorem 5.5.1 in [28], one can prove (10) by induction on the integer k.    \(\square \)

Proposition 1 states, in particular, that in seeking optimal policies for (7), it is sufficient to limit the search to the set of randomized Markov policies. Thus, from now on, we will limit our attention to \(\varPi _{RM}\).

The following lemma is required to establish the optimality equation.

Lemma 3

Under Assumption 1 and 2, for any \(x \in S\), \(n\ge -1\), and \(\pi =\{\varphi _{0},\varphi _{1},\ldots \}\in \varPi _{RM}\), the following statements hold.

  1. (a)

    \(V^{\pi }_{n}\in \mathcal {V}_{m}\) and \(V^{\pi }\in \mathcal {V}_{m}\).

  2. (b)

    \(V^{\pi }_{n+1}(x)=M^{\varphi _{0}}V_{n}^{^{1}\pi }(x)\) and \(V^{\pi }(x)=M^{\varphi _{0}}V^{^{1}\pi }(x)\), with \(^{1}\pi :=\{\varphi _{1},\varphi _{2},\ldots \}\) being the 1-shift policy of \(\pi \).

    In particular, for any \(f\in F \), \(V^{f}_{n+1}(x)=M^{f}V_{n}^{f}(x)\) and \(V^{f}(x)=M^{f}V^{f}(x)\).

Proof

(a) We shall prove the first statement of (a) by induction on the integer \(n\ge -1\). The statement is trivial for \(n =-1\) since \(V^{\pi }_{-1}(x)=1 \in \mathcal {V}_{m}\) for any \(x\in S\) and \( \pi \in \varPi _{RM}\). Assume the statement holds for any \(n<k\). Then, by (4) and the property of conditional expectation, we have

$$\begin{aligned}&V^{\pi }_{k+1}(x)\\= & {} E^{\pi }_{x}\Big (e^{-\gamma \sum _{m=0}^{k+1}\int ^{T_{m+1}}_{T_{m}}I_{\{\bigcap _{k=0}^{m}\{x_{T_{k}}\in B^{c}\}\}}r(x_{t},A_{t})dt}\Big ) \\= & {} E^{\pi }_{x}[E^{\pi }_{x}[e^{ -\gamma \sum _{m=0}^{k+1}\int ^{T_{m+1}}_{T_{m}}I_{\{\bigcap _{k=0}^{m}\{x_{T_{k}}\in B^{c}\}\}}r(x_{t},A_{t})dt} |T_{0},x_{T_{0}},A_{0},T_{1},x_{T_{1}}]]\\= & {} \int _{ A(x) }\varphi _{0}(da|x)\\&\times \int _{ S }\int ^{+\infty }_{0}E^{\pi }_{x}\Big (e^{-\gamma (\int ^{T_{1}}_{0}r(x_{t},A_{t})dt+ \sum _{m=1}^{k+1}\int ^{T_{m+1}}_{T_{m}}I_{\{\bigcap _{k=1}^{m}\{x_{T_{k}}\in B^{c}\}\}}r(x_{t},A_{t})dt) }\\&|T_{0}=0,x_{T_{0}}=x, A_{0}=a,T_{1}=u,x_{T_{1}}=y\Big )Q(du,dy| x,a)\\= & {} \int _{ A(x) }\varphi _{0}(da|x)\int _{ B }\int ^{+\infty }_{0}e^{-\gamma r(x,a)u}Q(du,dy| x,a)+\int _{ A(x) }\varphi _{0}(da|x)\\&\times \int _{ B^{c} }\int ^{+\infty }_{0}E^{\pi }_{x}\Big (e^{-\gamma (\int ^{T_{1}}_{0}r(x_{t},A_{t})dt+ \sum _{m=1}^{k+1}\int ^{T_{m+1}}_{T_{m}}I_{\{\bigcap _{k=1}^{m}\{x_{T_{k}}\in B^{c}\}\}}r(x_{t},A_{t})dt) }\\&|T_{0}=0,x_{T_{0}}=x, A_{0}=a,T_{1}=u,x_{T_{1}}=y\Big )Q(du,dy| x,a)\\= & {} \int _{ A(x) }\varphi _{0}(da|x)[\int _{ B }\int ^{+\infty }_{0}e^{-\gamma r(x,a)u}Q(du,j| x,a)\\&+\int _{ B^{c} }\int ^{+\infty }_{0}e^{-\gamma r(x,a)u}E^{^{1}\pi }_{y}\Big (e^{-\gamma \sum _{m=0}^{k}\int ^{T_{m+1}}_{T_{m}}I_{\{\bigcap _{k=0}^{m}\{x_{T_{k}}\in B^{c}\}\}}r(x_{t},A_{t})dt}\Big )\\&\times Q(du,dy| x,a)]\\= & {} \int _{ A(x) }\varphi _{0}(da|x)[\int _{ B }\int ^{+\infty }_{0}e^{-\gamma r(x,a)u}Q(du,dy| x,a)\\&+\int _{ B^{c} }\int ^{+\infty }_{0}e^{-\gamma r(x,a)u}V_{k}^{^{1}\pi }(y)Q(du,dy| x,a)]\\:= & {} M^{\varphi _{0}}V_{k}^{^{1}\pi }(x) \end{aligned}$$

which together with induction hypothesis implies that \(V^{\pi }_{k+1}(x)\) is a measurable function and \(V^{\pi }_{k+1}(x)\le 1\). Thus, \(V^{\pi }_{n}\in \mathcal {V}_{m}\) for all \(n\ge -1\). Since the limit of a convergent sequence of measurable functions is itself a measurable function, we obtain \(\lim _{n\rightarrow \infty }V^{\pi }_{n}=V^{\pi }\in \mathcal {V}_{m}\). This concludes the proof of (a).

(b) From the proof of part (a), we can deduce that, for any \(x\in B^{c}\) and \(n\ge -1\),

$$\begin{aligned} V^{\pi }_{n+1}(x)=M^{\varphi _{0}}V_{n}^{^{1}\pi }(x). \end{aligned}$$
(11)

Letting \(n\rightarrow \infty \) in (11) and using the monotone convergence theorem, we obtain

$$V^{\pi }(x)=M^{\varphi _{0}}V^{^{1}\pi }(x).$$

In particular, for \(\pi =f\in F\), we have \( V^{f}(x)=M^{f}V^{f}(x)\).    \(\square \)

Remark 3

For any \(x\in B^{c}\) and \(f\in F\), one can use Lemma 3 to develop an efficient iteration algorithm for the computation of the function \(V^{f}(x)\) based on the following: \(V^{f}(x)=\lim _{n\rightarrow \infty }V^{f}_{n}(x)\) where \(V_{-1}^{f}(x):=1\) and \(V_{n+1}^{f}(x)=M^{f}V_{n}^{f}(x)\) for \(n\ge 0\).

The following theorem states the existence of an optimality equation.

Theorem 1

Under Assumption 1 and 2, the following hold.

  1. (a)

    For each \( n\ge -1\), let \(V^{*}_{n+1}:=MV^{*}_{n}\) with \(V^{*}_{-1}:=1\). Then, \(\lim _{n\rightarrow \infty }V^{*}_{n}=V^{*}\in \mathcal {V}_{m}\).

  2. (b)

    The value function \(V^{*}\) is a solution to the optimality equation \(V^{*}=MV^{*}\).

  3. (c)

    There is a policy \(f^{*} \in F\) such that \(V^{*}(x)=M^{f}V^{*}(x),x\in B^{c}\).

Proof

(a) Using Lemma 2(a) and the definition of the operator M, we obtain \(0\le V^{*}_{n+1}(x)\le V^{*}_{n}(x)\le 1\) and \(V^{*}_{n}\in \mathcal {V}_{m}, n\ge -1\), for any \(x\in B^{c}\). Thus, \(\tilde{V}:=\lim _{n\rightarrow \infty }V^{*}_{n} \in \mathcal {V}_{m}\), since the limit of a convergent sequence of measurable function is also measurable. To complete the proof of part (a), we need to prove that \(\tilde{V}=V^{*}\).

We first show by induction on \(n\ge -1\) that for any \( x\in B^{c}\) and \(\pi =\{\varphi _{0},\varphi _{1},\ldots \}\in \varPi _{ RM}\)

$$\begin{aligned} V_{n}^{*}(x)\ge V_{n}^{\pi }(x). \end{aligned}$$
(12)

It is clear that \(V_{-1}^{*}= V_{-1}^{\pi }=1\) for any \(\pi \in \varPi _{ RM}\). Suppose that (12) holds for any \(n\le k\). By the induction hypothesis, the definition of the operator M and Lemma 3(b), we have

$$V_{k+1}^{*}(x)=MV_{k}^{*}(x)\ge MV_{k}^{^{1}{\pi }}(x)\ge M^{\varphi _{0}}V_{k}^{^{1}{\pi }}(x)=V_{k+1}^{\pi }(x).$$

Letting \(n\rightarrow \infty \) in (12), we obtain \(\tilde{V}(x)=\lim _{n\rightarrow \infty }V_{n}^{*}(x)\ge V^{\pi }(x)\) with \(\pi \in \varPi _{RM}\). Since \(\pi \) is arbitrary, we conclude that \(\tilde{V}(x)\ge V^{*}(x)\).

We need, now, to prove the reverse inequality \(\tilde{V}(x)\le V^{*}(x)\). For any \(x\in B^{c},n\ge -1\), let \(A_{n}:=\{a\in A(x)| M^{a}V_{n}^{*}(x)\ge M\tilde{V}(x)\} \) and \(A^{*}:=\{a\in A(x)| M^{a}\tilde{V}(x)= M\tilde{V}(x)\}\). By the compactness-continuity condition in Assumption 2 and the convergence \(V_{n}^{*}\downarrow \tilde{V}\), we conclude that \(A_{n}\) and \(A^{*}\) are nonempty and compact, and that \(A_{n}\downarrow A^{*}\). It follows from the measurable selection theorem (Theorem B.6 in [28]) that, for each \(n\ge 1\), there exist \(a_{n}\in A_{n}\) such that \( M^{a_{n}}V^{*}_{n}(x)= MV^{*}_{n}(x)\). Hence, using compactness and the convergence \(A_{n}\downarrow A^{*}\), we deduce that there exist an \(a^{*}\in A^{*}\) and a subsequence \(\{a_{n_{k}}\}\) of \(\{a_{n}\}\) such that \(a_{n_{k}}\rightarrow a^{*}\). Since \(V_{n}^{*}\downarrow \tilde{V}\), by Lemma 3(a), for any given \(n\ge 1\), we have

$$\begin{aligned} M^{a_{n_{k}}}V_{n_{k}}^{*}(x)\le M^{a_{n_{k}}}V_{n}^{*}(x)\ \ \forall n_{k}\ge n. \end{aligned}$$

Letting \({k}\rightarrow \infty \) and using the upper semicontinuity condition in Assumption 2 give

$$\begin{aligned} {\tilde{V}}^{*}(x)\le M^{a^{*}}V_{n}^{*}(x), \end{aligned}$$

which together with the convergence \(V_{n}^{*}\downarrow \tilde{V}\) imply

$$\begin{aligned} {\tilde{V}}^{*}(x)\le M^{a^{*}} \tilde{V}(x)\le M\tilde{V}(x), \end{aligned}$$

By Lemma 2(b), there exists a stationary policy \(f\in F\) such that

$$\tilde{V}(x)\le M\tilde{V}(x)=M^{f}\tilde{V}(x).$$

Moreover, using Lemma 2(a), Lemma 3(b) and Remark 3, we obtain

$$\tilde{V}(x)\le (M^{f})^{n}\tilde{V}(x)\le (M^{f})^{n}{V}^{f}_{-1}(x)=V^{f}_{n-1}(x). $$

Letting \(n\rightarrow \infty \), and invoking Remark 3, we obtain \(\tilde{V}(x)\le V^{f}(x)\le V^{*}(x)\), which proves the part (a) of the theorem.

(b) By virtue of Lemma 3(b), we know that for any \(x\in B^{c}\) and \(\pi \in \varPi _{RM}\), we have

$$\begin{aligned} V^{\pi }(x)=M^{\varphi _{0}}V^{^{1}\pi }(x)\le M^{\varphi _{0}}V^{*}(x)\le MV^{*}(x). \end{aligned}$$

Taking the supremum over all policies \(\pi \in \varPi _{RM}\) implies \(V^{*}(x)\le MV^{*}(x)\).

The reverse inequality is proved as follows: From the definition of \(V_{n}^{*}\), for any \(x\in B^{c}\) and \(a\in A(x)\),

$$V_{n+1}^{*}(x)=MV_{n}^{*}(x)\ge M^{a}V_{n}^{*}(x).$$

Letting \(n\rightarrow \infty \) and using the monotone convergence theorem, we obtain

$$V^{*}(x)\ge M^{a}V^{*}(x),$$

which implies that \(V^{*}(x)\ge MV^{*}(x)\) since \(a\in A(x)\) is arbitrary. This proves \(V^{*}=MV^{*}\).

(c) The statement in (c) follows from Lemma 2.    \(\square \)

To guarantee the uniqueness of solution of the optimality equation and the existence of the optimal policies, we require the following additional condition (i.e., Assumption 3).

Assumption 3

For any \(x\in B^{c}, f\in \varPi _{s}, P_{x}^{f}(\tau _{B}<+\infty )=1\).

Remark 4

  1. (a)

    Assumption 3 means that, when the initial state of such system is \(X_{0}=x \in S\), the controlled state process \(\{x_{t}, t\ge 0\}\) will eventually enter the target set B under the policy \(f\in F\).

  2. (b)

    Letting \({X}_{n}:=x_{T_{n}},n\,=\,0,1,\ldots \), \(T_{n}\) denotes the jump epoch. Then, we obtain a discrete-time embedded chain \(\{{X}_{n}, n\ge 0\}\). For every \(x \in B^{c}\), using Theorem 3.3 in [16], we know that Assumption 3 can be rewritten as follows:.

    $$P_{x}^{f}(\tau _{B}<+\infty )=P_{x}^{f}(\bigcup _{n\,=\,1}^{\infty }\{{X}_{n}\in B\})=1,$$

    which is equivalent to

    $$\begin{aligned} P^{f}_{x}(\bigcap _{n=1}^{\infty }\{{X}_{n}\in B^{c}\})=0. \end{aligned}$$
    (13)
  3. (c)

    Using Proposition 3.3 in [19], we also obtain a sufficient condition to verify Assumption 3. There exist a constant \(\alpha >0\) such that \(\int _{ B}P(dy|x,a)\ge \alpha \) for \((x,a)\in B^{c}\times A(x)\), then Assumption 3 holds.

Lemma 4

Suppose that Assumptions 1 and 3 hold.

  1. (a)

    If \(U,V\in \mathcal {V}_{m}\) are such that \(U(x)-V(x)\le M^{f}(U-V)(x)\) with \(x\in B^{c},f\in \varPi _{s}\), then \(U(x)\le V(x)\).

  2. (b)

    For any \(f\in \varPi _{s}\), \(V^{f}\in \mathcal {V}_{m}\) is the unique solution to the equation \(V=M^{f}V\).

Proof

(a) For any \(U,V\in \mathcal {V}_{m}\), \(x\in B^{c},f\in \varPi _{s}\), we will show the following conclusion by induction,

$$\begin{aligned} (M^{f})^{n}(U-V)(x)\le P^{f}_{x}(\bigcap _{k=1}^{n}\{{X}_{k}\in B^{c}\}), n\ge 1. \end{aligned}$$
(14)

For \(n=1\), it follows from \(U,V\in \mathcal {V}_{m}\) that

$$\begin{aligned} M^{f}(U-V)(x)= & {} M^{f}U(x)-M^{f}V(x)\\= & {} \int _{ B^{c}}\int ^{+\infty }_{0}e^{-\gamma r(x,f)u}(U-V)(y)Q(du,dy| x,a)\\\le & {} \int _{ B^{c}}\int ^{+\infty }_{0}Q(du,dy| x,a)\\= & {} P^{f}_{x}({X}_{1}\in B^{c}). \end{aligned}$$

Suppose that (14) holds for \(n=k\). Then, by using the induction hypothesis and the nonnegativity of the reward rate, we have

$$\begin{aligned} (M^{f})^{k+1}(U-V)(x)= & {} M^{f}(M^{f})^{k}(U-V)(x)\nonumber \\= & {} \int _{ B^{c}}\int ^{+\infty }_{0}e^{-\gamma r(x,f)u} (M^{f})^{k}(U-V)(y)\nonumber \\&\times Q(du,dy| x,a) \nonumber \\= & {} \int _{ B^{c}}\int ^{+\infty }_{0}e^{-\gamma r(x,f)u} P^{f}_{y}(\bigcap _{l=1}^{k}\{X_{l}\in B^{c}\})\nonumber \\&\times Q(du,dy| x,a)\nonumber \\\le & {} \int _{ B^{c}}\int ^{+\infty }_{0} P^{f}_{y}(\bigcap _{l=1}^{k}\{X_{l}\in B^{c}\}) Q(du,dy| x,a). \end{aligned}$$
(15)

On the other hand,

$$\begin{aligned}&P^{f}_{x}(\bigcap _{l=1}^{k+1}\{{X}_{l}\in B^{c}\})\\= & {} E^{f}_{x}[I_{\{\bigcap _{l=1}^{k+1}\{{X}_{l}\in B^{c}\}\}}]\\= & {} E^{f}_{x}[E^{f}_{x}[I_{\bigcap _{l=1}^{k+1}\{{X}_{l}\in B^{c}\}}|{X}_{0},{X}_{1}]\\= & {} \int _{ B^{c}}\int ^{+\infty }_{0}P^{f}_{x}\Big (\bigcap _{l=1}^{k+1}\{{X}_{l}\in B^{c}\}|{X}_{0}=x,{X}_{1}=y \Big )Q(du,dy| x,a) \\= & {} \int _{ B^{c}}\int ^{+\infty }_{0}P^{f}_{y}\Big (\bigcap _{l=1}^{k}\{{X}_{l}\in B^{c}\}\Big )Q(du,dy| x,a), \end{aligned}$$

from which together with (15) and the induction, we have for all \(n\ge 1\),

$$\begin{aligned} U(x)-V(x)\le (M^{f})^{n}(U(x)-V(x))\le P^{f}_{x}(\bigcap _{k=1}^{n}\{{X}_{k}\in B^{c}\}). \end{aligned}$$
(16)

Letting \(n\rightarrow \infty \), using (13), we obtain

$$\begin{aligned} U(x)-V(x)\le P^{f}_{x}(\bigcap _{k=1}^{\infty }\{{X}_{k}\in B^{c}\})=0. \end{aligned}$$

Then, \(U(x)\le V(x)\), for \(x\in S\).

(b) For any \(x \in S,f\in F\), it follows from Lemma 2(b) that \(V^{f}(x)\in \mathcal {V}_{m}\) satisfies the equation \(V(x)=M^{f}V(x)\). If U(x) is another solution to the equation \(U(x)=M^{f}U(x)\) on S, and thus \(U(x)-V^{f}(x)=M^{f}(U(x)-V^{f}(x))\), which together with the statement in part (a), we know \(U(x)=V^{f}(x)\) and the uniqueness of solution to the equation is proved.    \(\square \)

Theorem 2

Suppose that Assumption 1,2 and 3 hold. Then, the following statements hold.

  1. (a)

    The value function \(V^{*}\) is the unique solution to the optimality equation \(V^{*}=MV^{*}\).

  2. (b)

    There is a policy \(f^{*} \in F\) which satisfies \(V^{*}=M^{f^{*}}V^{*}\), \(V^{*}=V^{f^{*}}\) and such a policy \(f^{*} \in F\) is optimal.

Proof

(a) It follows from Lemma 3 (b) that \(V^{*}\) satisfies the equation \(V^{*}=MV^{*}\). Then, by Lemma 2(b), there exists a stationary policy \(f^{*}\in F\) such that \(V^{*}=M^{f^{*}}V^{*}\). Moreover, U is another solution of the equation \(U=MU\). Similarly, the existence of a policy \(f^{'}\in F\) satisfying \(U=M^{f^{'}}U\) is ensured by Lemma 2(b). Then, we have \(V^{*}-U\le M^{f^{*}}(V^{*}-U)\). Combining this inequality and Lemma 4 yields that \(V^{*}\le U\). Similarly, we obtain \(U-V^{*}\le M^{f^{'}}(U-V^{*})\) and \(U\le V^{*}\), which implies \(U=V^{*}\) and the uniqueness of \(V^{*}\) is achieved.

(b) Since \(V^{*}\in \mathcal {V}_{m}\), for any \(x \in B^{c}\), Lemma 2 guarantees the existence of a stationary policy \(f^{*}\in F\) such that

$$V^{*}(x)=M^{f^{*}}V^{*}(x),$$

which together with Lemma 3 and Remark 11 yield

$$\begin{aligned} V^{*}=\lim _{n\rightarrow \infty }(M^{f^{*}})^{n}V^{*}\le \lim _{n\rightarrow \infty }(M^{f^{*}})^{n}V_{-1}^{f^{*}}=\lim _{n\rightarrow \infty }V_{n-1}^{f^{*}}=V^{f^{*}}. \end{aligned}$$

This implies the optimality of \(f^{*}\).    \(\square \)

Theorem 1 leads to the following iterative algorithm for computing the value function and the corresponding optimal policies.

The value iteration algorithm procedure:

Step 1: For any \(x\in B^{c}\), set \(V_{-1}^{*}(x):=1\).

Step 2: According to Theorem 1, the value \(V_{n+1}^{*}(x),n \ge 1\), is iteratively computed as:

$$\begin{aligned} M^{a}V_{n}^{*}(x)= & {} \int _{ B }\int ^{+\infty }_{0}e^{-\gamma r(x,f)u}Q(du,dy| x,a)\\&+\int _{ B^{c} }\int ^{+\infty }_{0}e^{-\gamma r(x,f)u}V_{n}^{*}(y)Q(du,dy| x,a),\\ V_{n+1}^{*}(x)= & {} \sup _{a\in A(x)}\{M^{a}V_{n}^{*}(x)\}. \end{aligned}$$

Step 3: When \(|V_{n+1}^{*}-V_{n}^{*}|<10^{-12}\), the iteration stops. Since \(V_{n}^{*}\) is very close to \(V_{n+1}^{*}\), one can view \(V_{n+1}^{*}\) as a good approximation of the value function \(V^{*}\). In addition, Lemma 2 and Theorem 2 ensure the existence of a policy \(f^{*}\in F\) such that \(MV^{*}=M^{f^{*}}V^{*}\), and this policy \(f^{*}\) is optimal. Or else, go back to step 2 and replace n with \(n+1\).

4 Example

In this section, an example is given to illustrate our main results, and to demonstrate the computation of an optimal stationary policy and the corresponding value function using the above described iterative algorithm.

Example 1

Consider a company using idle funds for financial management. When the company has some idle funds (which is denoted by state 1), the decision maker gets the reward at the rate of return \(r(1,a_{11})\ge 0\) through deposit method \(a_{11}\) or the reward at the rate of return \(r(1,a_{12})\ge 0\) through another deposit method \(a_{12}\). When the company has plenty of idle funds (which is denoted by state 2), the decision maker can choose a financial management \(a_{21}\) earning in a reward rate \(r(2,a_{21})\ge 0\) or another financing way \(a_{22}\) earning in a reward rate \(r(2,a_{22})\ge 0\). When the company goes bankrupt (which is denoted by state 0), the decision-maker does not need to choose any way of financing \(a_{01}\) and cannot get any reward \(r(0,a_{01})=0\).

Suppose that the evolution mechanism of this system is described as a SMDP. When the system state is 1, the decision maker selects an admissible action \(a_{1n},n=1,2\). Then, the system stays at the state 1 with a random time satisfying the uniform distribution in the region \([0,u(1,a_{1n})],n=1,2\). After the system state lingers for a period of time, it will move to a new state \(j\in \{0,2\}\) with the probability \(p(j|1,a_{1n}),n=1,2\). When the action \(a_{2n}\) is selected \( n=1,2\), the system stays at 2 with a random time satisfying the exponential distribution with the parameter \(\lambda (2,a_{2n})\). Consequently, the system jumps to state \(j\in \{0,1\}\) with the probability \(p(j|2,a_{2n}),n=1,2\).

The corresponding parameters of this SMDPs are given as follows: The state space \(S=\{0,1,2\}\), the target set \(B=\{0\}\) and the admissible action sets \(A(0)=\{a_{01}\},A(1)=\{a_{11},a_{12}\}\), \(A(2)=\{a_{21},a_{22}\}\), the risk-sensitivity coefficient \(\gamma =1\). The transition probabilities are assumed to be given

$$\begin{aligned} p(0|0,a_{01})&=1,&p(0|1,a_{11})&=\frac{1}{2},&p(2| 1,a_{11})&=\frac{1}{2},\nonumber \\ p(0| 1,a_{12})&=\frac{2}{3},&p(2| 1,a_{12})&=\frac{1}{3},&p(0|2,a_{21})&=\frac{3}{10},\\ p(1|2,a_{21})&=\frac{7}{10},&p(0| 2,a_{22})&=\frac{2}{5},&p(1|2,a_{22})&=\frac{3}{5}.\nonumber \end{aligned}$$
(17)

In addition, the corresponding distribution parameters are given by

$$\begin{aligned} u(1,a_{11})&={30},&u(1,a_{12})&={40},\nonumber \\ \lambda (2,a_{21})&=0.11,&\lambda (2,a_{22})&=0.13. \end{aligned}$$
(18)

and the reward rates are given by

$$\begin{aligned} r(1,a_{11})&=0.0035,&r(1,a_{12})&=0.011,\\ r(2,a_{21})&=0.013,&r(2,a_{22})&=0.015. \end{aligned}$$

In this model, we mainly focus on the existence and calculation parts of an optimal policy and the value function for first passage exponential utility criterion. As can be seen from the discussion in Sect. 3 above, we first need to verify Assumption 1, 2 and 3. Indeed, by (17) and (18), we know that Assumption 1 and 3 are satisfied. Moreover, since the state space is denumerable and the action space A is finite, Assumption 2 is trivially satisfied. Thus, by Theorem 1 and 2, the value iteration technique can be used for evaluating the value function and the exponential optimal policies as follows:

Step 1: Let \(V_{-1}^{*}(x):=1,x=1,2\).

Step 2: For \(x=1,2,n\ge 1\), using Theorem 1 (a), we obtain

$$\begin{aligned}&V_{n}^{*}(1)=MV_{n-1}^{*}(1),\\= & {} \max \Big \{\frac{1}{2}\times \frac{1}{30} \times \int _{0}^{30}e^{-0.0035u}du\\&+\frac{1}{2}\times \frac{1}{30} \times \int _{0}^{30}e^{-0.0035u}du\times V_{n-1}^{*}(2),\\&\frac{2}{3}\times \frac{1}{40} \times \int _{0}^{40}e^{-0.011u}du+\frac{1}{3}\times \frac{1}{40} \times \int _{0}^{40}e^{-0.011u}du\times V_{n-1}^{*}(2)\Big \}\\&V_{n}^{*}(2)=MV_{n-1}^{*}(2),\\= & {} \max \Big \{\frac{3}{10}\times 0.11 \times \int _{0}^{+\infty }e^{-0.123u}du\\&+\frac{7}{10}\times 0.11 \times \int _{0}^{+\infty }e^{-0.123u}du\times V_{n-1}^{*}(1),\\&\frac{2}{5}\times 0.13 \times \int _{0}^{+\infty }e^{-0.145u}du+\frac{3}{5}\times 0.13 \times \int _{0}^{+\infty }e^{-0.145u}du\times V_{n-1}^{*}(1)\Big \} \end{aligned}$$

Step 3: When \(|V_{n}^{*}-V_{n-1}^{*}|<10^{-12}\), go to step 4, the value \(V_{n}^{*}\) is usually approximated as \(V^{*}\); otherwise, go to step \(n+1\) and go back to step 2.

Fig. 1.
figure 1

The function \(M^{a}V_{n}^{*}(1)\)

Step 4: Plot out the graphs of the value functions \(M^{a_{ij}}V_{n}^{*}(i)\) and \(V_{n}^{*}(i),i=1,2;j=1,2\), see Figs. 1, 2 and 3.

Fig. 2.
figure 2

The function \(M^{a}V_{n}^{*}(2)\)

Fig. 3.
figure 3

The value function \(V_{n}^{*}(i)\)

Moreover, for \(x=1\), using Theorem 1, 2, Fig. 1 and Fig. 2, we know that

$$\begin{aligned} MV^{*}(1)=V^{*}(1)=M^{a_{11}}V^{*}(1). \end{aligned}$$

For \(x=2\), we also obtain

$$\begin{aligned} MV^{*}(2)=V^{*}(2)=M^{a_{22}}V^{*}(2). \end{aligned}$$

According to the above analysis and Theorem 2, we obtain the optimal stationary policy \(f^{*}(1)=a_{12},f^{*}(2)=a_{21}\) and the value function \(V^{*}(1)=0.8660\),\(V^{*}(2)=0.8245\).