1 Introduction

In this paper, we study a risk-sensitive control problem for Markov decision processes (MDPs). The risk-sensitive control of MDPs has been widely investigated (see [10, 12, 16, 21, 22] and references cited therein). The basic target is to find the optimal solution to the following control problem

$$ \lambda^{\pi}(x,\gamma):=\frac 1\gamma\underset{N\to\infty}{\lim}\frac {1}{N}\log E_{x}^{\pi}\exp\left[\gamma\sum\limits_{n=0}^{N-1}r(X_{n},A_{n})\right], $$
(1.1)

where Xn is the state of the system at time n, x is the initial state, An is the decision made by the controller at time n, and π is the strategy for decision-making. The risk-sensitive factor γ represents the controller’s risk preference. Regarding r as a reward, we concern the following maximization with γ > 0:

$$ \lambda(x,\gamma):=\underset{\pi}{\sup}\lambda^{\pi}(x,\lambda). $$

It is well known that γ = 0 corresponds to the risk-neutral case in which the performance is evaluated according to the following typical long-run average reward:

$$ \begin{array}{@{}rcl@{}} v(x):=\sup_{\pi}\underset{N\to \infty}{\lim}\frac{1}{N} E^{\pi}\left[{x}\sum\limits_{n=0}^{N-1}r(X_{n},A_{n})\right]. \end{array} $$

Notice that for any γ > 0 if E(eγX) and E(X) both exist, then

$$ \begin{array}{@{}rcl@{}} \underset{\gamma\to 0}{\lim}\frac 1\gamma \log E(e^{\gamma X})= E(X). \end{array} $$

It is natural to consider the problem of whether the optimal risk-sensitive control converges to the optimal long-run average control as the risk-sensitive factor gets vanishing. It is the main purpose of this paper to prove that

$$ \underset{\gamma\to 0+}{\lim}\lambda(x,\gamma)=v(x), $$
(1.2)

provided that both sides are well-defined (see the next section for an explicit description of this problem). This problem has been studied for minimizing risk-sensitive costs for MDPs; see the references cited above. A similar problem for optimal risk-sensitive portfolios has also been studied. Notice that in the framework of portfolio or other asset processes, maximizing rewards is a natural problem for consideration. This maximization problem is essentially different from the minimization problem, but both are of fundamental importance for applications. It is interesting that maximizing the risk-sensitive reward is dual to maximizing the upside chance or minimizing the downside risk under some conditions (see [18, 24] and [26]). These motivate us to study the asymptotics of optimal risk-sensitive rewards for MDPs. We shall show in this paper that for MDPs with compact state spaces and action spaces, under certain assumptions, the maximal risk-sensitive reward will converge to the maximal long-run average reward as the risk-sensitive factor gets down to 0.

We note that in the approach for deriving the asymptotics of minimal risk-sensitive cost, besides a few necessary continuity assumptions, some conditions on contraction and strong ergodicity for the transition probabilities were imposed, based on which span contraction of some properly defined operator can be verified which guarantee a solution to the corresponding Bellman equation. The strong ergodicity condition also makes it possible to apply some large deviation techniques (see (2.7) and (2.8) in Section 2 for the explicit conditions with more explanations given in Remark 2.1). In this paper, we shall use a quite different approach: inspired by Anantharam and Borkar [2], we use a nonlinear extension of the Kreĭn-Rutman Theorem (see [23]) to find the eigenvalues of some properly defined operators on certain function spaces and characterize the optimal growth rate of the multiplicative reward with this eigenvalue. Using this characterization and a perturbation technique, we derive a variational formula for the optimal growth rate (Theorem 3.7) of the MDP without the ergodicity of transition probability. This variational formula is similar to the Donsker-Varadhan formula (see [14]), which is of independent significance. The vanishing risk-sensitivity limit of the maximal reward of MDPs follows as an application of this formula (Theorems 3.1 and 3.8), and its proof implies that for a risk-sensitive control problem, the optimal policy can be taken to be a stationary one, even when the MDP is not communicating.

We also apply the approach to study the same problem for partially observable Markov decision processes (POMDPs) with compact state and action spaces. For POMDPs, see [4, 6, 8, 11, 17], and [19] and the references cited therein. A widely used approach for studying a POMDP is to transfer it into a completely observable MDP. However, the structure of the transferred MDP is usually much more complicated. Among the current results, such as those in the references mentioned above, few are on the risk-sensitivity vanishing limit. In [1], the limit of the minimal risk-sensitive cost as the risk-sensitive factor tends to 0 is derived for a class of POMDPs with a particular structure. This particular structure makes it possible to apply a large deviation approach. In [11], Di Masi and Stettner established the existence of the solution to the associated Bellman equation for cost-minimizing problems. However, they remarked that the limit as the risk-sensitive factor tends to 0 for general POMDPs had not been proven. Based on our investigation for MDPs, we prove that, as long as the solution to the associated Bellman equation exists, the maximal risk-sensitive reward converges to the maximal long-run average reward as the risk-sensitive factor tends to 0.

Finally, as an application of our approach, for portfolio optimization, we establish a duality-relation between maximizing the risk-sensitive reward and maximizing the chance for outperforming certain amounts of reward, with the range of the amounts being characterized by the optimal average reward (Theorem 5.2).

The paper is organized as follows. At the end of this section, we introduce some notations that will be frequently used in this paper. In Section 2, we define the decision model and derive some properties of the operator corresponding to the Bellman equation. The risk-neutral limit for MDPs is given in Section 3, in which the variational formula mentioned above is established. Section 4 is devoted to POMDPs. The portfolio optimization problem is investigated in Section 5.

Here are some notations and preliminaries. Given a separable and complete metric space (also called a Polish space) \((\mathcal {X},\rho )\), let \({\mathscr{M}}(\mathcal {X})\) and \({\mathscr{M}}^{+}(\mathcal {X})\) denote the set of finite signed measures on \(\mathcal {X}\) and the set of finite measures on \(\mathcal {X}\), respectively. \(\mathcal {P}(\mathcal {X})\) is the space of probability measures on \(\mathcal {X}\), endowed with the weak topology. For \(p,q\in \mathcal {P}(\mathcal {X})\), we use p << q to denote that p is absolutely continuous with respect to q. As usual, δx(⋅) denotes the Dirac measure on point \(x\in \mathcal {X}\). When \(\mathcal {X}\) is compact, \(C(\mathcal {X})\), the real-valued continuous functions on \(\mathcal {X}\), equipped with the supremum norm ∥⋅∥, is a Banach space. Let \(C^{+}(\mathcal {X})\) denote the set of non-negative functions in \(C(\mathcal {X})\). \(C^{+}(\mathcal {X})\) is a cone, which means that for any \(f,g\in C^{+}(\mathcal {X})\) and any c > 0, both f + g and cf are in \(C^{+}(\mathcal {X})\). \(C^{+}(\mathcal {X})\) is convex, closed and satisfies that \(C^{+}(\mathcal {X})\cup (-C^{+}(\mathcal {X}))=\{0\}\) and that \(interior(C^{+}(\mathcal {X}))\neq \emptyset \). We write fg,f > g,f >> g if \(f-g\in C^{+}(\mathcal {X}),f-g\in C^{+}(\mathcal {X})\backslash \{0\},f-g\in interior(C^{+}(\mathcal {X}))\), respectively. These facts form the basis for applying a nonlinear extension of the Kreĭn-Rutman theorem (see Appendix) to the operator corresponding to the Bellman equation.

For two probability measures \(p,q\in \mathcal {P}(\mathcal {X})\), the relative entropy of p with respect to q is defined by

$$ D(p\Vert q):= \left\{\begin{array}{ll} {\int}_{\mathcal{X}}\log\left( \frac{dp}{dq}(x)\right)p(dx),\quad&p<<q\\ \infty,\quad&\text{otherwise} \end{array}\right., $$
(1.3)

which plays an important role in the variational formula for the optimal reward.

Let \(\text {Lip}(\mathcal {X})\) denote the space of real-valued, bounded, and Lipschitz continuous functions on \(\mathcal {X}\). Given \(f\in \text {Lip}(\mathcal {X})\), define its norm by

$$ \|f\|_{L}:= \max\left\{\underset{x\in\mathcal{X}}{\sup}|f(x)|,{\underset{x\neq y}{\underset{x,y\in\mathcal{X}}{\sup}}}\frac{|f(x)-f(y)|}{\rho(x,y)}\right\}. $$
(1.4)

Then \((\text {Lip}(\mathcal {X}),\ \|\cdot \|_{L})\) is a Banach space when \(\mathcal {X}\) is compact. Given \(\mu \in {\mathscr{M}}(\mathcal {X})\), define the following Kantorovich-Rubinstein norm:

$$ \|\mu\|_{0}:=\sup\left\{\int fd\mu, f\in\text{Lip}(\mathcal{X}), \|f\|_{L}\leq 1\right\}. $$
(1.5)

Then the weak topology on \(\mathcal {P}(\mathcal {X})\) is generated by the Kantorovich-Rubinstein metric d0(μ,ν) := ∥μν0 (Theorem 8.3.2, pp. 193–194, in [7]). \(\mathcal {P}(\mathcal {X})\) endowed with the weak topology is a Polish space since \(\mathcal {X}\) is Polish. The space of Lipschitz functions and the Kantorovich-Rubinstein metric will be used in Section 4 to discuss the POMDPs.

Finally, as usual, \(\mathbb {N}\) and \(\mathbb {R}\) denote the sets of non-negative integers and real numbers, respectively.

2 Solution to the Bellman Equation

A discrete-time MDP can be represented as a four-tuple M = 〈S,A,p(⋅|⋅,⋅),r(⋅,⋅)〉. S is the state space, A is the action space, and both are assumed to be compact metric spaces in the present paper. We assume for convenience that any action in A is admissible in any state. The transition kernel, which depends on actions, is denoted by \(p(E|x,a), E\subseteq S, x\in S, a\in A\). The last element in the tuple is the one-step reward function \(r: S\times A\to \mathbb {R}\). To define a probability space and a stochastic process with the desired mechanism, let \({\Omega }=(S\times A)^{\infty }\) and \({\mathscr{B}}({\Omega })\) be the product Borel σ-field. Given a sample path ω = (x1,a1,x2,a2,...) ∈Ω, define \(X_{t}:= x_{t},A_{t}:= a_{t},t\in \mathbb {N}\). At each time \(t\in \mathbb {N}\), the system M occupies a state Xt, based on which the controller chooses an action At, and then the system moves to the next state according to the law p(⋅|Xt,At). A Markov decision rule at time t is a stochastic kernel \(d_{t}\in \mathcal {P}(A|S)\), where dt(B|x) denotes the probability for taking action in \(B\subseteq A\) when observing the current state Xt = x. A Markov policy π is a sequence of Markov decision rules. Let DM denote the set of all the Markov decision rules. \({\Pi }_{M}=(D_{M})^{\infty }\) is the set of all the Markov policies of M. Given an initial state xS and a policy π = (d1,d2,...) ∈πM, we can define a unique probability measure \(\text {P}_{x}^{\pi }\) on \({\mathscr{B}}({\Omega })\) by the Ionescu-Tulcea theorem, such that for each t ≥ 0,

$$ \text{P}_{x}^{\pi}(dx_{1},da_{1},...,dx_{t-1},da_{t-1},dx_{t})=\delta_{x}(dx_{1})\left( \prod\limits_{i=1}^{t-1}d_{i}(da_{i}|x_{i})p(dx_{i+1}|x_{i},a_{i})\right). $$

The corresponding expectation operator is denoted by \( E_{x}^{\pi }\). Since we view r as a reward, a typical criterion for evaluating the optimal policy is to maximize the average reward, i.e., we are interested in the following function

$$ v(x)=\underset{\pi\in{\Pi}_{M}}{\sup}\underset{N\rightarrow\infty}{\liminf}\frac{1}{N} E_{x}^{\pi}\left[\sum\limits_{t=1}^{N}r(X_{t},A_{t})\right], $$
(2.1)

which is risk-neutral. Informally, we notice that by the Taylor expansion, we see that for a small factor γ

$$ \frac{1}{\gamma N}\log E_{x}^{\pi}\exp\left[\gamma\sum\limits_{t=1}^{N}r(X_{t},A_{t})\right] = \frac{1}{N} E_{x}^{\pi}\left[{\sum}_{t=1}^{N}r(X_{t},A_{t})\right]\\ +\frac{\gamma}{2N}\text{Var}_{x}^{\pi}\left[{\sum}_{t=1}^{N}r(X_{t},A_{t})\right]+\frac{1}{N}\cdot o(\gamma^{2}). $$
(2.2)

Hence, we can use γ≠ 0 to evaluate the controller’s risk preference. This leads to the following risk-sensitive criterion for maximization of reward (see [10]):

$$ \lambda(x,\gamma)=\underset{\pi\in{\Pi}_{M}}{\sup}\frac{1}{\gamma}\underset{N\rightarrow\infty}{\liminf}\frac{1}{N}\log E_{x}^{\pi}\exp\left[\gamma\sum\limits_{t=1}^{N}r(X_{t},A_{t})\right], $$
(2.3)

where γ≠ 0 is a constant evaluating the controller’s risk preference. With “\(\sup \)” replaced by “\(\inf \)” in (2.3), we have the risk-sensitive criterion for minimization of cost. An interesting problem is the asymptotics of λ(x,γ) as γ → 0. Observe that if we define for N ≥ 1

$$ \lambda_{N}^{\pi}(x,\gamma):=\frac{1}{\gamma N}\log E_{x}^{\pi}\exp\left[\gamma\sum\limits_{t=1}^{N}r(X_{t},A_{t})\right]. $$
(2.4)

Then (2.2) implies that

$$ \lambda_{N}^{\pi}(x,0):=\underset{\gamma\to 0}{\lim\limits}\lambda_{N}^{\pi}(x,\gamma)=\frac{1}{N} E_{x}^{\pi}\left[{\sum}_{t=1}^{N}r(X_{t},A_{t})\right], $$
(2.5)

which motivates us to derive

$$ \lambda(x,0):=\underset{\gamma\to 0}{\lim\limits}\lambda(x,\gamma)=v(x). $$
(2.6)

This is the main concern of this paper.

Remark 2.1

Replacing the \(\sup \) in (2.3) with an \(\inf \) to define \(\tilde \lambda (x,\gamma )\), the γ → 0 limit has already been established in [10]. They proved the existence of a solution to the Bellman equation with, in addition to some necessary continuity assumptions, the following two requirements:

$$ p(E|x,a)-p(E|x^{\prime},a^{\prime})<\delta\ \text{for some }\delta\in (0,1)\text{and for any}x,x^{\prime}\in S,\ a,a^{\prime}\in A; $$
(2.7)

and there exist an \(\eta \in \mathcal {P}(S)\) and a continuous density q(x,a,y) such that \(p(E|x,a)={\int \limits }_{E}q(x,a,y)\eta (dy)\) for \(E\in {\mathscr{B}}(S)\) and

$$ q(x,a,y)>0, \underset{x,x^{\prime}\in S}{\sup}\underset{a\in A}{\sup}\underset{y\in S}{\sup}\frac{q(x,a,y)}{q(x^{\prime},a,y)}= K<\infty. $$
(2.8)

These conditions guarantee that the operator defining the corresponding Bellman equation is span contractive, and hence a solution exists. (2.8) also implies strong ergodicity for the family of transition probabilities defining the MDP. A consequence is the applicability of large deviation techniques for ergodic Markov processes. Instead of such conditions, we will use the following assumption (B1) on communication among the family of transition probabilities to guarantee that the eigenvector of the Bellman equation is strictly positive, based on which the variational formula holds. Then we will remove (B1) by a perturbation technique, which means that the limit can hold for completely observable MDPs without any communication requirements on the transition probability.

  1. (B1)

    For any x1,xS and any open neighborhood U containing x, there exist an N > 0 and a1,..,aNA such that

    $$ \begin{array}{@{}rcl@{}} \int \textbf{1}_{U}(x_{N+1})p(dx_{N+1}|x_{N},a_{N})...p(dx_{2}|x_{1},a_{1})>0, \end{array} $$

Remark 2.2

When the model is finite (i.e., both S and A are finite), (B1) is the classical communicating condition.

Now let

$$ v:=\underset{x\in S}{\sup}v(x) $$

and

$$ \lambda(\gamma):=\underset{x\in S}{\sup}\lambda(x,\gamma) $$

for γ > 0. The main objective of this paper is to show that

$$ \underset{\gamma\to 0}{\lim}\lambda(\gamma)=v. $$
(2.9)

In order to do this, we make the following assumptions.

  1. (A1)

    r(x,a) is continuous in (x,a).

  2. (A2)

    \((x,a)\mapsto {\int \limits }_{S}p(dy|x,a)f(y)\) is continuous in (x,a) when fC(S).

  3. (A3)

    The family of functions

    $$ \begin{array}{@{}rcl@{}} \left\{x\mapsto{\int}_{S}f(y)p(dy|x,a), f\in C(S),\left\|f\right\|\leq 1,a\in A\right\} \end{array} $$

    is equicontinuous.

Moreover, if (B1) holds, we will see that, independent of the choice of the initial state xS, the value of λ(x,γ) depends only on γ and

$$ \underset{\gamma\to 0}{\lim}\lambda(x,\gamma)=\underset{\gamma\to 0}{\lim}\lambda(\gamma)=v. $$
(2.10)

Remark 2.3

A concrete case in which (A3) is satisfied is that

$$p(dy|x,a)=q(y|x,a){\Lambda}(dy)$$

with \({\Lambda }\in \mathcal {P}(S)\) and {q(y|⋅,a),yS,aA} equicontinuous. The equicontinuity assumption (A3) is only used to prove the compactness of the operator related to the Bellman equation. In particular, for every finite MDP, (A1), (A2), and (A3) automatically hold. Combined with (B1), this compactness yields the existence of a positive eigenvalue and the associated positive eigenvector. The continuity assumptions of the result regarding the Bellman equation in [10] are the same as (A1) and (A2). But the γ → 0 limit established in [10] requires that there exist an \(\eta \in \mathcal {P}(S)\) and a density q(x,a,y) > 0 such that \(p(E|x,a)={\int \limits }_{E}q(x,a,y)\eta (dy)\) for \(E\in {\mathscr{B}}(S)\) and (x,a,y) → q(x,a,y) is continuous, which is more strict than (A3) when state space S and action space A are compact.

The Bellman equation mentioned above is

$$ \rho f(x) = \underset{\nu\in\mathcal{P}(A)}{\sup}{\int}_{S\times A}e^{\gamma r(x,a)}f(y)p(dy|x,a)\nu(da), $$
(2.11)

and the corresponding operator L(γ) on C(S) is defined by

$$ L^{(\gamma)}f(x):= \sup_{\nu\in\mathcal{P}(A)}{\int}_{S\times A}e^{\gamma r(x,a)}f(y)p(dy|x,a)\nu(da). $$
(2.12)

Since

$$ \begin{array}{@{}rcl@{}} \sup_{\nu\in\mathcal{P}(A)}{\int}_{A}\left( {\int}_{S}e^{\gamma r(x,a)}f(y)p(dy|x,a)\right)\nu(da)\leq\sup_{a^{\prime}\in A}\int\limits_{S}e^{\gamma r(x,a^{\prime})}f(y)p(dy|x,a^{\prime}) \end{array} $$

and

$$ \begin{array}{@{}rcl@{}} \underset{\nu\in\mathcal{P}(A)}{\sup}{\int}_{S\times A}e^{\gamma r(x,a)}f(y)p(dy|x,a)\nu(da)\geq\sup_{a^{\prime}\in A}{\int}_{S\times A}e^{\gamma r(x,a^{\prime})}f(y)p(dy|x,a^{\prime})\delta_{a^{\prime}}(da), \end{array} $$

we see that

$$ L^{(\gamma)}f(x)= \sup_{a\in A}{\int}_{S}e^{\gamma r(x,a)}f(y)p(dy|x,a). $$
(2.13)

Assumptions (A1), (A2) and the compactness of S × A imply that when fC(S), L(γ)f also belongs to C(S). Combining these with (A3), we can prove that L(γ) is a compact operator, which is crucial to the existence of a positive eigenvalue, as we claimed before.

Proposition 2.1

Assume (A1), (A2), and (A3). Then L(γ) is a compact operator mapping C(S) into itself.

Proof

Notice that r(⋅,⋅) is bounded under assumption (A1) and the compactness of S × A. For convenience, we let rM and rm be the supremum and infimum of r, respectively. For any function f with ∥f∥≤ K, we have \(\sup _{x\in S}\left |L^{(\gamma )}f(x)\right |\leq Ke^{\gamma r_{M}}\). Thus, to apply the Arzelà-Ascoli theorem, we need to verify that the family {L(γ)f,fC(S),∥f∥≤ K} is equicontinuous.

To this end, let ρ denote the metric on S. According to (A3), for any ε > 0, there exists δ1 > 0 such that

$$ \begin{array}{@{}rcl@{}} \sup_{a\in A}\sup_{\substack{g\in C(S)\\\|g\|\leq 1}}\left|\int\limits_{S}g(y)p(dy|x_{1},a)-\int\limits_{S}g(y)p(dy|x_{2},a)\right|\leq\varepsilon \end{array} $$

for any x1,x2 with ρ(x1,x2) ≤ δ1. By the uniform continuity of eγr(⋅,⋅), there exists δ2 > 0 such that

$$ \begin{array}{@{}rcl@{}} \sup_{a\in A}\left|e^{\gamma r(x_{1},a)}-e^{\gamma r(x_{2},a)}\right|\leq\varepsilon \end{array} $$

whenever ρ(x1,x2) ≤ δ2. Consequently when ∥f∥≤ K, for x1,x2 with \(\rho (x_{1},x_{2})\leq \min \limits \{\delta _{1},\delta _{2}\}\), we have

$$ \begin{array}{@{}rcl@{}} \left|L^{(\gamma)}\!f(x_{1}) - L^{(\gamma)}\!f(x_{2})\right|&\leq&\sup_{a\in A}\left|\int\limits_{S}e^{\gamma r(x_{1},a)}\!f(y)p(dy|x_{1},a) - \int\limits_{S}e^{\gamma r(x_{2},a)}\!f(y)p(dy|x_{2},a)\right|\\ &\leq& \sup_{a\in A}\left|e^{\gamma r(x_{1},a)}-e^{\gamma r(x_{2},a)}\right|\int\limits_{S}\left|f(y)\right|p(dy|x_{1},a)\\ & & +\sup_{a\in A}e^{\gamma r(x_{2},a)}\left|\int\limits_{S}f(y)p(dy|x_{1},a)-\int\limits_{S}f(y)p(dy|x_{2},a)\right|\\ &\leq&\left( K+e^{r_{M}}\right)\varepsilon. \end{array} $$

L(γ) has the following properties, which will be used to apply the non-linear Kreĭn-Rutman Theorem to prove the existence of the solution to (2.13).

  1. (P1)

    Assume (A1). Then

    $$(L^{(\gamma)})^{N}f(x)=\sup\limits_{\pi\in{\Pi}_{M}} E_{x}^{\pi}\left[\exp\left( {\sum}_{t=1}^{N}\gamma r(X_{t},A_{t})\right)\cdot f(X_{N+1})\right],N\geq 1.$$

    This property can be proven by induction using the Markov feature of π ∈πM and the fact that \(e^{\gamma r(X_{i},A_{i})}\leq e^{\gamma r_{M}}\) (see Lemma 2.1 and its proof in [2]).

  2. (P2)

    (Positive 1-homogeneity) c(L(γ)f) = L(γ)(cf) for c ≥ 0 and fC(S).

  3. (P3)

    (Order-preserving) If fg, then L(γ)fL(γ)g.

The following theorem shows that the spectral radius of L(γ) is an eigenvalue. For an operator T : C(S) → C(S), define

$$ \begin{array}{@{}rcl@{}} \|T\|^{+}:=\sup_{\substack{g\in C^{+}(S)\\\|g\|\leq 1}}\left\{\left\|Tg\right\|\right\}. \end{array} $$

It is not hard to check that

$$\|(L^{(\gamma)})^{m+n}\|^{+}\leq\|(L^{(\gamma)})^{m}\|^{+}\|(L^{(\gamma)})^{n}\|^{+},$$

which implies that the limit

$$ \begin{array}{@{}rcl@{}} \rho(L^{(\gamma)}):=\lim_{n\to\infty}\left( \|(L^{(\gamma)})^{n}\|^{+}\right)^{\frac{1}{n}} \end{array} $$

exists.

Theorem 2.2

Assume (A1), (A2), and (A3). Then ρ(L(γ)) > 0 and there exists an fγC+(S) depending on γ with fγ≠ 0 such that

$$ \rho(L^{(\gamma)})f_{\gamma}=L^{(\gamma)}f_{\gamma}. $$
(2.14)

If in addition (B1) is satisfied, then fγ >> 0 and \(\gamma \lambda (x,\gamma )=\log \rho (L^{(\gamma )})\) is independent of xS.

Proof

From (P1), we see that \(\left \|(L^{(\gamma )})^{n}\textbf {1}\right \|\geq e^{n\gamma r_{m}}\), which implies that \(\rho (L^{(\gamma )})\geq e^{\gamma r_{m}}>0\). Since L(γ) is compact, positive 1-homogeneous and order-preserving, by Theorem A.1 in the Appendix, there exists an fγC+(S) satisfying (2.14). Moreover, from (A1) and (A2), we know that \({\int \limits }_{S}e^{\gamma r(x,a)}f(y)p(dy|x,a)\) is continuous in a, which means that the supremum in (2.13) can be achieved. Hence, there exists a Markov decision rule d such that

$$ \begin{array}{@{}rcl@{}} L^{(\gamma)}f_{\gamma}(x)= {\int}_{S}e^{\gamma r(x,d^{*}(x))}f_{\gamma}(y)p(dy|x,d^{*}(x)). \end{array} $$

Let \(\pi ^{*}=(d^{*})^{\infty }\), then we have for \(N\in \mathbb {N}\)

$$ \left[\rho(L^{(\gamma)})\right]^{N}f_{\gamma}(x)= E_{x}^{\pi^{*}}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}f_{\gamma}(X_{N+1})\right]. $$
(2.15)

Similarly, we have for any Markov policy π

$$ \left[\rho(L^{(\gamma)})\right]^{N}f_{\gamma}(x)\geq E_{x}^{\pi}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}f_{\gamma}(X_{N+1})\right]. $$
(2.16)

Now, assume (B1). Since fγC+(S) and fγ≠ 0, there exist x0S,c0 > 0 and an open neighborhood U0 containing x0 such that \(f_{\gamma } |_{U_{0}}>c_{0}>0\). It follows from (B1) that for any x1S, there exists a1,..,aM such that

$$ \begin{array}{@{}rcl@{}} f_{\gamma}(x_{1})&=&\frac{1}{\left[\rho(L^{(\gamma)})\right]^{M}}\left( L^{(\gamma)}\right)^{M}f_{\gamma}(x_{1})\\ &\geq&\frac{1}{\left[\rho(L^{(\gamma)})\right]^{M}}\int \textbf{1}_{U_{0}}(x_{M+1})p(dx_{M+1}|x_{M},a_{M})...p(dx_{2}|x_{1},a_{1})\exp\left( \gamma{\sum}_{i=1}^{M} r(x_{i},a_{i})\right)f_{\gamma}(x_{M+1})\\ &\geq&\frac{c_{0}\cdot e^{M\gamma r_{m}}}{\left[\rho(L^{(\gamma)})\right]^{M}}\cdot\int \textbf{1}_{U_{0}}(x_{M+1})p(dx_{M+1}|x_{M},a_{M})...p(dx_{2}|x_{1},a_{1})>0. \end{array} $$

Thus, fγ >> 0. Since S is compact, there are constants \(0<k_{\gamma }<K_{\gamma }<\infty \) such that kγfγKγ. From (2.15) and (2.16), we see that for any xS

$$ \begin{array}{@{}rcl@{}} \frac{k_{\gamma}}{K_{\gamma}}\left( E_{x}^{\pi^{*}}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}\right]\right)\leq \left[\rho(L^{(\gamma)})\right]^{N}\leq\frac{K_{\gamma}}{k_{\gamma}}\left( E_{x}^{\pi^{*}}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}\right]\right) \end{array} $$

and for any Markov policy π

$$ \begin{array}{@{}rcl@{}} \frac{k_{\gamma}}{K_{\gamma}}\left( E_{x}^{\pi}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}\right]\right)\leq\left[\rho(L^{(\gamma)})\right]^{N}. \end{array} $$

Taking the logarithm and letting \(N\to \infty \), we see that the limit

$$ \log\rho(L^{(\gamma)})=\lim_{N\rightarrow\infty}\frac{1}{N}\log E_{x}^{\pi^{*}}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}\right] $$
(2.17)

exists and

$$ \begin{array}{@{}rcl@{}} \log\rho(L^{(\gamma)})=\gamma\lambda(x,\gamma) \end{array} $$

for any xS. □

Remark 2.4

Assume (A1), (A2), (A3), and (B1). From the proof of Theorem 2.2 we can see that ρ(L(γ)) is the unique positive eigenvalue of L(γ) restricted to interior(C+(S)).

3 Risk-Sensitive Asymptotics of MDP

In this section, we shall apply the following variational formula for λ(γ) to prove (2.9).

$$ \lambda(\gamma)=\sup_{\beta\in\mathcal{I}}\left\{{\int}_{S\times A}\left[ r(x,a)-\frac{1}{\gamma}D(\beta_{2}(\cdot|x,a)\Vert p(\cdot|x,a))\right]\beta^{\prime}(dx,da)\right\}, $$
(3.1)

where \(\mathcal {I}\) is defined by

$$ \mathcal{I}:=\{\beta\in\mathcal{P}(S\times A\times S):\beta(S,A,dx)=\beta(dx,A,S)\}, $$
(3.2)

and for \(\beta \in \mathcal {P}(S\times A\times S),\) the notations β0,β1,β2, and \(\beta ^{\prime }\) are defined by

$$ \beta(dx,da,dy)=\beta_{0}(dx)\beta_{1}(da|x)\beta_{2}(dy|x,a)=\beta^{\prime}(dx,da)\beta_{2}(dy|x,a). $$
(3.3)

Obviously, \(\mathcal {I}\) is nonempty and closed in \(\mathcal {P}(S\times A\times S)\). Notice that \(\mathcal {P}(S\times A\times S)\) is compact since S × A × S is compact. Hence, \(\mathcal {I}\) is compact, too. For \(\beta \in \mathcal {P}(S\times A\times S)\), β0 is the first 1-dimensional marginal of β, \(\beta ^{\prime }\) is the first 2-dimensional marginal of β, and β1 and β2 are the two successive conditional distributions of β. With these notations, \(\mathcal {I}\) is seen to be the set of probability measures β on S × A × S satisfying that β0 is invariant under \({\int \limits }_{A}\beta _{1}(da|x)\beta _{2}(dy|x,a)\). The validity of (3.1) will be verified in Theorems 3.4 and 3.7. At present, we will apply (3.1) to get the limit of λ(γ) as γ → 0.

Theorem 3.1

Assume (A1) and (A2). If (3.1) holds, then

$$ \lim\limits_{\gamma\to 0}\lambda(\gamma)=v. $$
(3.4)

To prove the theorem, we need the following

Lemma 3.2

If there exists a \(\beta \in \mathcal {I}\) satisfying that β2 = p, then

$$ \begin{array}{@{}rcl@{}} {\int}_{S\times A}r(x,a)\beta^{\prime}(dx,da)\leq v, \end{array} $$

where \(\mathcal {I}\) is defined in (3.2).

Proof

Since β(S,A,dx) = β(dx,A,S), taking β0 as the initial distribution and using the policy \(\pi _{\beta }=(\beta _{1}(da|x))^{\infty }\), we see that

$$ \begin{array}{@{}rcl@{}} E_{\beta_{0}}^{\pi_{\beta}}\left[r(X_{2},A_{2})\right]&=& {\int}_{S\times A\times S\times A} r(x_{2},a_{2})\beta_{1}(da_{2}|x_{2})p(dx_{2}|x_{1},a_{1})\beta_{1}(da_{1}|x_{1})\beta_{0}(dx_{1})\\ &=&{\int}_{S\times A\times S}\left( {\int}_{A}r(x_{2},a_{2})\beta_{1}(da_{2}|x_{2})\right)\beta(dx_{1},da_{1},dx_{2})\\ &=&{\int}_{S\times A\times S}\left( {\int}_{A}r(x_{1},a_{2})\beta_{1}(da_{2}|x_{1})\right)\beta(dx_{1},da_{1},dx_{2})\\ &=&{\int}_{S\times A} r(x_{1},a_{2})\beta_{1}(da_{2}|dx_{1})\beta_{0}(dx_{1})= E_{\beta_{0}}^{\pi_{\beta}}\left[r(X_{1},A_{1})\right]. \end{array} $$

The third equality is due to the coincidence of the first and the third marginal of β ensured by (3.2). By induction, we have

$$ \begin{array}{@{}rcl@{}} E_{\beta_{0}}^{\pi_{\beta}}\left[{\sum}_{i=1}^{N}r(X_{i},A_{i})\right]&=&N\cdot E_{\beta_{0}}^{\pi_{\beta}}\left[r(X_{1},A_{1})\right]\\ &=&N{\int}_{S\times A} r(x_{1},a_{1})\beta_{1}(da_{1}|x_{1})\beta_{0}(dx_{1})\\ &=&N{\int}_{S\times A}r(x,a)\beta^{\prime}(dx,da). \end{array} $$

Thus,

$$ \begin{array}{@{}rcl@{}} v\geq\liminf_{N\rightarrow\infty} E_{\beta_{0}}^{\pi_{\beta}}\left[\frac{1}{N}{\sum}_{i=1}^{N}r(X_{i},A_{i})\right]={\int}_{S\times A}r(x,a)\beta^{\prime}(dx,da). \end{array} $$

Now we are ready to prove Theorem 3.1.

Proof Proof of Theorem 3.1

By Hölder’s inequality, for \(\gamma \geq \gamma ^{\prime }>0\), we have

$$ \begin{array}{*{20}l} \frac{1}{\gamma}\frac{1}{N}\log E_{x}^{\pi}\exp\left[{{\sum}_{t=1}^{N}\gamma r(X_{t},A_{t})}\right]\geq\frac{1}{\gamma^{\prime}}\frac{1}{N}\log E_{x}^{\pi}\exp\left[{{\sum}_{t=1}^{N}\gamma^{\prime} r(X_{t},A_{t})}\right] \end{array} $$

and

$$ \begin{array}{@{}rcl@{}} \frac{1}{\gamma}\frac{1}{N}\log E_{x}^{\pi}\exp\left[{{\sum}_{t=1}^{N}\gamma r(X_{t},A_{t})}\right]\geq\frac{1}{N} E_{x}^{\pi}\left[{\sum}_{t=1}^{N}r(X_{t},A_{t})\right]. \end{array} $$

Therefore, λ(γ) is non-decreasing in γ and \(\lim \limits _{\gamma \to 0}\lambda (\gamma )\geq v\). To prove (3.4), it suffices to verify that \(\lim \limits _{\gamma \to 0}\lambda (\gamma )\leq v\). To this end, we notice that it follows from (3.1) that for any ε > 0 and γ > 0, there exists \(\beta _{\gamma }^{\varepsilon }\in \mathcal {I}\) such that

$$ \lambda(\gamma)-\varepsilon\leq{\int}_{S\times A}\left[r(x,a)-\frac{1}{\gamma}D((\beta_{\gamma}^{\varepsilon})_{2}(\cdot|x,a)\Vert p(\cdot|x,a))\right](\beta_{\gamma}^{\varepsilon})'(dx,da). $$
(3.5)

Recall that \(\mathcal {I}\subseteq \mathcal {P}(S\times A\times S)\) is compact, we can find a sequence \(\{\gamma _{n}\}_{n\in \mathbb {N}}\) decreasing to 0 and a \(\beta ^{\varepsilon }\in \mathcal {I}\) such that

$$\lim\limits_{\gamma\to 0}\lambda(\gamma)=\lim\limits_{n\to\infty}\lambda(\gamma_{n})\quad \text{ and }\quad \lim\limits_{n\to\infty}\beta_{\gamma_{n}}^{\varepsilon}=\beta^{\varepsilon}\text{ weakly}.$$

Therefore, from (A1), we know that

$$ \lim_{n\to\infty}\lambda(\gamma_{n})-\varepsilon\leq\lim_{n\to\infty} {\int}_{S\times A}r(x,a)\left( \beta_{\gamma_{n}}^{\varepsilon}\right)'(dx,da)={\int}_{S\times A}r(x,a)\left( \beta^{\varepsilon}\right)'(dx,da), $$
(3.6)

which is finite. Now we claim that \(\left (\beta ^{\varepsilon }\right )_{2}=p\). Indeed, we have

$$ \begin{array}{@{}rcl@{}} &&{\int}_{S\times A}D\Big(\left( \beta_{\gamma_{n}}^{\varepsilon}\right)_{2}(\cdot|x,a)\Big\Vert p(\cdot|x,a)\Big)\left( \beta_{\gamma_{n}}^{\varepsilon}\right)'(dx,da)\\ &&\quad=D\left( \beta_{\gamma_{n}}^{\varepsilon}(dx,da,dy)\Big\Vert\left( \beta_{\gamma_{n}}^{\varepsilon}\right)'(dx,da)p(dy|x,a)\right). \end{array} $$

It follows from the (joint) lower semicontinuity of D(⋅∥⋅) and (A2) that

$$ \begin{array}{@{}rcl@{}} &-&\liminf_{n\rightarrow\infty}D\left( \beta_{\gamma_{n}}^{\varepsilon}(dx,da,dy)\Big\Vert\left( \beta_{\gamma_{n}}^{\varepsilon}\right)'(dx,da)p(dy|x,a)\right)\leq \\&-&D\left( \beta^{\varepsilon}(dx,da,dy)\Big\Vert\left( \beta^{\varepsilon}\right)'(dx,da)p(dy|x,a)\right). \end{array} $$

Thus if \(\left (\beta ^{\varepsilon }\right )_{2}\neq p\), then

$$D\left( \beta^{\varepsilon}(dx,da,dy)\Big\Vert\left( \beta^{\varepsilon}\right)'(dx,da)p(dy|x,a)\right)>0.$$

Combining this with (3.5) and the fact that λ(γ) ≥ rm, we would have

$$ \begin{array}{@{}rcl@{}} r_{m}-\varepsilon\leq\lim_{n\to\infty}\lambda(\gamma_{n})-\varepsilon\leq-\infty. \end{array} $$

It is impossible. Thus, \(\beta ^{\varepsilon }\in \mathcal {I}\) and \(\left (\beta ^{\varepsilon }\right )_{2}=p\). Recalling (3.6) and Lemma 3.2, we obtain that

$$ \begin{array}{@{}rcl@{}} \lim_{n\to\infty}\lambda(\gamma_{n})-\varepsilon\leq{\int}_{S\times A}r(x,a)\left( \beta^{\varepsilon}\right)'(dx,da)\leq v. \end{array} $$

(3.4) follows by letting ε → 0. □

The remainder of this section is devoted to verifying (3.1) under certain conditions. This is carried out at first under assumptions including (B1), then with (B1) removed. Our assumption (A2) is slightly weaker than those in [2]. In [2], it is required that the family of functions

$$\left\{(x,a)\mapsto{\int}_{S}f(y)p(dy|x,a), f\in C(S),\left\|f\right\|\leq 1\right\}$$

is equicontinuous, while we assume that

$$\left\{x\mapsto{\int}_{S}f(y)p(dy|x,a), f\in C(S),\left\|f\right\|\leq 1,a\in A\right\}$$

is equicontinuous (Theorems 3.4 and 3.8). Moreover, it is worth mentioning that equicontinuity only plays a role in the existence of the positive eigenvalue and the eigenvector. Once L(γ) has a positive eigenvalue and a strictly positive eigenvector, only (A1) and (A2) are needed.

Proposition 3.3

Assume (A1) and (A2). If there exist ργ > 0 and fγC(S) such that fγ >> 0 and L(γ)fγ = ργfγ, then (3.1) holds.

Proof

From the proof of Theorem 2.2, we know that \(\log \rho _{\gamma }=\gamma \lambda (x,\gamma )\) for any xS. Thus, for any \(\mu \in {\mathscr{M}}^{+}(S)\), we have

$$ \begin{array}{@{}rcl@{}} e^{\gamma\lambda(\gamma)}=\frac{\int L^{(\gamma)}f_{\gamma}d\mu}{\int f_{\gamma}d\mu}. \end{array} $$

Therefore,

$$ e^{\gamma\lambda(\gamma)}=\sup_{\mu\in\mathcal{M}^{+}(S)}\frac{\int L^{(\gamma)}f_{\gamma}d\mu}{\int f_{\gamma}d\mu}\geq\inf_{f>>0}\sup_{\mu\in\mathcal{M}^{+}(S)}\frac{\int L^{(\gamma)}fd\mu}{\int fd\mu}. $$
(3.7)

For any f >> 0, we also have

$$ \begin{array}{@{}rcl@{}} \frac{L^{\gamma}f}{f}\leq\sup_{\mu\in\mathcal{M}^{+}(S)}\frac{\int L^{(\gamma)}fd\mu}{\int fd\mu}, \end{array} $$

which means that

$$ \begin{array}{@{}rcl@{}} L^{\gamma}f\leq\left( \sup_{\mu\in\mathcal{M}^{+}(S)}\frac{\int L^{(\gamma)}fd\mu}{\int fd\mu}\right)f. \end{array} $$

Since under (A1) and (A2), properties (P2) and (P3) hold for L(γ), we can apply Theorem A.2 and A.3 in the Appendix to deduce that

$$ e^{\gamma\lambda(\gamma)}=\rho(L^{(\gamma)})\leq\left( \sup_{\mu\in\mathcal{M}^{+}(S)}\frac{\int L^{(\gamma)}fd\mu}{\int fd\mu}\right). $$
(3.8)

From (3.7) and (3.8), we have

$$ \begin{array}{@{}rcl@{}} \lambda(\gamma)=\frac{1}{\gamma}\log\inf_{f>>0}\sup_{\mu\in\mathcal{M}^{+}(S)}\frac{\int L^{(\gamma)}fd\mu}{\int fd\mu}=\frac{1}{\gamma}\log\inf_{f>>0}\sup_{\substack{\mu\in\mathcal{M}^{+}(S)\\\int fd\mu=1}}\int L^{(\gamma)}fd\mu. \end{array} $$

Thus,

$$ \begin{array}{@{}rcl@{}} \lambda(\gamma)&=&\frac{1}{\gamma}\log\inf_{f>>0}\sup_{\substack{\mu\in\mathcal{M}^{+}(S)\\\int fd\mu=1}}{\int}_{S}\mu(dx)\sup_{a\in A}{\int}_{S}e^{\gamma r(x,a)}f(y)p(dy|x,a)\\ &=&\frac{1}{\gamma}\log\inf_{f>>0}\sup_{\nu\in\mathcal{P}(S)}{\int}_{S}\nu(dx)\sup_{a\in A}{\int}_{S}e^{\gamma r(x,a)+\log f(y)-\log f(x)}p(dy|x,a)\\ &=&\frac{1}{\gamma}\inf_{g\in C(S)}\sup_{x\in S}\sup_{a\in A}\log{\int}_{S}e^{\gamma r(x,a)+g(y)-g(x)}p(dy|x,a)\\ &=&\frac{1}{\gamma}\inf_{g\in C(S)}\sup_{\eta\in\mathcal{P}(S\times A)}\log{\int}_{S\times A\times S}e^{\gamma r(x,a)+g(y)-g(x)}\eta(dx,da)p(dy|x,a). \end{array} $$

Using the Gibbs variational formula (Proposition 1.4.2(a), pp. 33–34 in [15]), we see that

$$ \begin{array}{@{}rcl@{}} \lambda(\gamma)=\frac{1}{\gamma}\inf_{g\in C(S)}\sup_{\eta\in\mathcal{P}(S\times A)}\sup_{\beta\in\mathcal{P}(S\times A\times S)}\Bigg\{{\int}_{S\times A\times S}[\gamma r(x,a)+g(y)-g(x)]\beta(dx,da,dy)\\ -D(\beta(dx,da,dy)\Vert\eta(dx,da)p(dy|x,a))\Bigg\}. \end{array} $$

Since D(μν) is jointly convex and lower semicontinuous in (μ,ν) (Lemma 1.4.3, pp. 36–38 in [15]) and \(\mathcal {P}(S\times A), \mathcal {P}(S\times A\times S)\) are both compact, the minimax theorem (Theorem 4.2 in [25]) can be applied to get

$$ \begin{array}{@{}rcl@{}} \lambda(\gamma)=\frac{1}{\gamma}\sup_{\beta\in\mathcal{P}(S\times A\times S)}\sup_{\eta\in\mathcal{P}(S\times A)}\inf_{g\in C(S)}\Bigg\{{\int}_{S\times A\times S}[\gamma r(x,a)+g(y)-g(x)]\beta(dx,da,dy)\\ -D(\beta(dx,da,dy)\Vert\eta(dx,da)p(dy|x,a))\Bigg\}. \end{array} $$

Furthermore, by the chain rule for relative entropy (Theorem D.13, pp. 357–359 in [9]), we have that

$$ \begin{array}{@{}rcl@{}} \lambda(\gamma)=\frac{1}{\gamma}\sup_{\beta\in\mathcal{P}(S\times A\times S)}\sup_{\eta\in\mathcal{P}(S\times A)}\inf_{g\in C(S)}\Bigg\{{\int}_{S\times A\times S}[\gamma r(x,a)+g(y)-g(x)]\beta(dx,da,dy)\\ -D(\beta^{\prime}(dx,da)\Vert\eta(dx,da))-{\int}_{S\times A}D(\beta_{2}(dy|x,a)\Vert p(dy|x,a))\beta^{\prime}(dx,da)\Bigg\}. \end{array} $$

Since D(μν) ≥ 0 and D(μν) = 0 iff μ = ν (Lemma 1.4.1 , pp. 33, in [15]), the supremum over \(\eta \in \mathcal {P}(S\times A)\) is attained at \(\eta =\beta ^{\prime }\). Moreover, notice that when \(\beta \in \mathcal {I}\), for any gC(S),

$${\int}_{S\times A\times S}[g(y)-g(x)]\beta(dx,da,dy)=0,$$

and for \(\beta \notin \mathcal {I}\),

$$\inf\limits_{g\in C(S)}{\int}_{S\times A\times S}[g(y)-g(x)]\beta(dx,da,dy)=-\infty,$$

we obtain that

$$ \begin{array}{@{}rcl@{}} \lambda(\gamma)&=&\sup_{\beta\in\mathcal{I}}\Bigg\{{\int}_{S\times A\times S}r(x,a)\beta(dx,da,dy)-\frac{1}{\gamma}{\int}_{S\times A}D(\beta_{2}(dy|x,a)\Vert p(dy|x,a))\beta^{\prime}(dx,da)\Bigg\}\\ &=&\sup_{\beta\in\mathcal{I}}\Bigg\{{\int}_{S\times A}\left[r(x,a)-\frac{1}{\gamma}D(\beta_{2}(dy|x,a)\Vert p(dy|x,a))\right]\beta^{\prime}(dx,da)\Bigg\}. \end{array} $$

Combining Theorem 2.2 and Proposition 3.3, we obtain the following theorem immediately.

Theorem 3.4

Assume (A1), (A2), (A3), and (B1). Then (3.1) holds.

To remove (B1), we use a perturbation argument. For each 𝜖 > 0, define a new MDP M𝜖 with the transition law and one-step reward given by

$$ \begin{gathered} p^{(\gamma)}_{\epsilon}(dy|x,a):=\frac{\epsilon{\Gamma}(dy)+e^{\gamma r(x,a)}p(dy|x,a)}{\epsilon+e^{\gamma r(x,a)}}\ \text{ and }\ r^{(\gamma)}_{\epsilon}(x,a):=\log\left( \epsilon+e^{\gamma r(x,a)}\right) \end{gathered} $$
(3.9)

respectively, where \({\Gamma }\in \mathcal {P}(S)\) with full support. It is not hard to check that M𝜖 satisfies (A1), (A2), (A3), and (B1). Using \( E_{\epsilon ,\gamma ,x}^{\pi }\) to denote the corresponding expectation operator with initial state x and policy π, we define

$$ L^{(\gamma)}_{\epsilon}f(x):=\sup_{\pi\in{\Pi}_{M}} E_{\epsilon,\gamma,x}^{\pi}\left[e^{r_{\epsilon}(X_{1},A_{1})}f(X_{2})\right]=L^{(\gamma)}f(x)+\epsilon\int{\Gamma}(dy)f(y) $$
(3.10)

and

$$ \lambda_{\epsilon}(x,\gamma):=\sup_{\pi\in{\Pi}_{M}}\frac{1}{\gamma}\liminf_{N\rightarrow\infty}\frac{1}{N}\log E_{\epsilon,\gamma,x}^{\pi}\exp\left[\sum\limits_{t=1}^{N} r^{(\gamma)}_{\epsilon}(X_{t},A_{t})\right]. $$
(3.11)

By Theorem 2.2, λ𝜖(x,γ) depends only on γ, and the limit inferior is actually a limit. Hence, we write it as λ𝜖(γ). Without (B1), we will prove the variational formula by exploring properties of λ𝜖(γ) and then letting ε → 0.

Lemma 3.5

Assume (A1) and (A2). Then λ𝜖(γ) is non-decreasing in 𝜖 and \(\lim \limits _{\epsilon \to 0}\lambda _{\epsilon }(\gamma )\geq \lambda (\gamma )\).

Proof

From property (P1), we have

$$ \lambda_{\epsilon}(\gamma)=\lim_{N\rightarrow\infty}\frac{1}{\gamma}\sup_{\pi\in{\Pi}_{M}}\frac{1}{N}\log E_{\epsilon,\gamma,x}^{\pi}\exp\left[{\sum}_{t=1}^{N}r_{\epsilon}^{(\gamma)}(X_{t},A_{t})\right]=\lim_{N\rightarrow\infty}\frac{1}{\gamma N}\log(L^{(\gamma)}_{\epsilon})^{N}\textbf{1}(x) $$
(3.12)

for any xS. Thus, for any 𝜖1 > 𝜖2 > 0, by (3.10), we obtain that

$$ \begin{array}{@{}rcl@{}} \lambda_{\epsilon_{1}}(\gamma)\geq\lambda_{\epsilon_{2}}(\gamma)&\geq&\sup_{x\in S}\liminf_{N\rightarrow\infty}\frac{1}{\gamma N}\log(L^{(\gamma)})^{N}\textbf{1}(\theta)\\ &\geq&\sup_{x\in S}\sup_{\pi\in{\Pi}_{M}}\liminf_{N\rightarrow\infty}\frac{1}{\gamma N}\log E_{\theta}^{\pi}\exp\left[\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})\right]=\lambda(\gamma). \end{array} $$

In order to write the variational formula in a form that is more convenient for using in the following arguments, we define for given 𝜖 > 0,γ > 0 and \(\beta \in \mathcal {I}\)

$$ \begin{array}{@{}rcl@{}} \phi(\beta,\gamma,\epsilon):=& {\int}_{S\times A}\left[\frac{1}{\gamma}r^{(\gamma)}_{\epsilon}(x,a)-\frac{1}{\gamma}D\left( \beta_{2}(\cdot|x,a)\Vert p_{\epsilon}^{(\gamma)}(\cdot|x,a)\right)\right]\beta^{\prime}(dx,da) \end{array} $$

and

$$ \begin{array}{@{}rcl@{}} \phi(\beta,\gamma,0):=& {\int}_{S\times A}\left[r(x,a)-\frac{1}{\gamma}D\left( \beta_{2}(\cdot|x,a)\Vert p(\cdot|x,a)\right)\right]\beta^{\prime}(dx,da). \end{array} $$

To prove \(\lambda (\gamma )\geq \lim \limits _{\epsilon \to 0}\sup \limits _{\beta \in \mathcal {I}}\phi (\beta ,\gamma ,\epsilon )\), we will show that

$$\lambda(\gamma)\geq\sup\limits_{x\in S}\lambda_{SM}(x,\gamma)\geq\sup\limits_{\beta\in\mathcal{I}}\phi(\beta,\gamma,0)\geq\lim\limits_{\epsilon\to 0}\sup\limits_{\beta\in\mathcal{I}}\phi(\beta,\gamma,\epsilon),$$

where λSM(x,γ) is defined by

$$ \begin{array}{@{}rcl@{}} \lambda_{SM}(x,\gamma):=\sup_{d\in D_{M}}\liminf_{N\rightarrow\infty}\frac{1}{\gamma N}\log E_{x}^{d^{\infty}}\exp\left[\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})\right], \end{array} $$

with \(d^{\infty }\) denoting the stationary Markov policy whose decision rules at each time are the same dDM.

Lemma 3.6

Assume (A1) and (A2). Then \(\sup \limits _{x\in S}\lambda _{SM}(x,\gamma )\geq \sup \limits _{\beta \in \mathcal {I}}\phi (\beta ,\gamma ,0)\).

Proof

We need to prove that

$$\sup\limits_{x\in S}\lambda_{SM}(x,\gamma)\geq\phi(\beta,\gamma,0)$$

for each \(\beta \in \mathcal {I}\). If \(\phi (\beta ,\gamma ,0)=-\infty \), the inequality holds trivially. Otherwise, \(\beta \in \mathcal {I}\) with \(\phi (\beta ,\gamma ,0)>-\infty \) implies that β2(⋅|x,a) << p(⋅|x,a) \(\beta ^{\prime }-\)a.s.. Choosing the stationary Markov policy \(\pi _{\beta }=(\beta _{1}(da|\theta ))^{\infty }\) and the initial distribution \(\beta _{0}\in \mathcal {P}(S)\), we see that

$$ \sup\limits_{x\in S}\lambda_{SM}(x,\gamma)\geq \underset{N\rightarrow\infty}{\liminf}\frac{1}{\gamma N}\log E_{\beta_{0}}^{\pi_{\beta}}\exp\left[{\sum}_{t=1}^{N}\gamma r(X_{t},A_{t})\right]. $$

Define \(^{\beta }{E}_{\beta _{0}}^{\pi _{\beta }}\) as the expectation operator with respect to the probability measure determined by the initial distribution β0, the transition law β2(dy|x,a) for \(\{X_{t}\}_{t\in \mathbb {N}}\), and the policy πβ. Using the change of measure technique and Jensen’s inequality, we obtain that

$$ \begin{array}{@{}rcl@{}} \log E_{\beta_{0}}^{\pi_{\beta}}\exp\left[\sum\limits_{t=1}^{N}\gamma r(X_{t},A_{t})\right]&=&\log{{~}^{\beta}}{ E}_{\beta_{0}}^{\pi_{\beta}}\exp\left\{{\sum}_{t=1}^{N}\left[\gamma r(X_{t},A_{t})-\log\left( \frac{d\beta_{2}(\cdot|X_{t},A_{t})}{dp(\cdot|X_{t},A_{t})}(X_{t+1})\right)\right]\right\}\\ &\geq&{~}^{\beta}{E}_{\beta_{0}}^{\pi_{\beta}}{\sum}_{t=1}^{N}\left[\gamma r(X_{t},A_{t})-\log\left( \frac{d\beta_{2}(\cdot|X_{t},A_{t})}{dp_{0}(\cdot|X_{t},A_{t})}(X_{t+1})\right)\right]. \end{array} $$

Since \(\beta \in \mathcal {I}\), the same argument as in proving Lemma 3.2 shows that

$$ \begin{array}{@{}rcl@{}} &&{{~}^{\beta}{E}}_{\beta_{0}}^{\pi_{\beta}}{\sum}_{t=1}^{N}\left[\gamma r(X_{t},A_{t})-\log\left( \frac{d\beta_{2}(\cdot|X_{t},A_{t})}{dp_{0}(\cdot|X_{t},A_{t})}(X_{t+1})\right)\right]\\ &&\quad=N\cdot{{~}^{\beta}}{ E}_{\beta_{0}}^{\pi_{\beta}}\left[\gamma r(X_{1},A_{1})-\log\left( \frac{d\beta_{2}(\cdot|X_{1},A_{1})}{dp(\cdot|X_{1},A_{1})}(X_{2})\right)\right] \end{array} $$

Consequently,

$$ \begin{array}{@{}rcl@{}} \sup\limits_{x\in S}\lambda_{SM}(x,\gamma)\geq\frac{1}{{~}^{\beta}}{E}_{\beta_{0}}^{\pi_{\beta}}\left[\gamma r(X_{1},A_{1})-\log\left( \frac{d\beta_{2}(\cdot|X_{1},A_{1})}{dp(\cdot|X_{1},A_{1})}(X_{2})\right)\right]=\phi(\beta,\gamma,0). \end{array} $$

Combining Theorem 3.4 and Lemmas 3.5 and 3.6, we obtain the following

Theorem 3.7

Assume (A1), (A2), and (A3). Then (3.1) holds.

Proof

From Lemmas 3.5 and 3.6, we see that

$$ \lim_{\epsilon\to 0}\lambda_{\epsilon}(\gamma)\geq\lambda(\gamma)=\sup\limits_{x\in S}\lambda(x,\gamma)\geq\sup\limits_{x\in S}\lambda_{SM}(x,\gamma)\geq\sup\limits_{\beta\in\mathcal{I}}\phi(\beta,\gamma,0). $$
(3.13)

Hence, (3.1) will follow once we prove that \(\sup _{\beta \in \mathcal {I}}\phi (\beta ,\gamma ,0)\geq \lim \limits _{\epsilon \to 0}\lambda _{\epsilon }(\gamma )\). Since M𝜖 satisfies (A1), (A2), and (B1), by Theorems 2.2 and 3.4, we have

$$ \lambda_{\epsilon}(\gamma)=\sup_{\beta\in\mathcal{I}}\phi(\beta,\gamma,\epsilon). $$
(3.14)

Therefore, given ξ > 0, for every 𝜖 > 0, there exists \(\beta _{\epsilon }^{\xi }\in \mathcal {I}\) such that

$$\lambda_{\epsilon}(\gamma)<\sup_{\beta\in\mathcal{I}}\phi(\beta,\gamma,\epsilon)+\xi.$$

Since \(\mathcal {I}\) is compact, there exists a sequence \(\{\epsilon _{n}\}_{n\in \mathbb {N}}\) decreasing to 0 such that the weak limit \(\lim \limits _{n\to \infty }\beta _{\epsilon _{n}}^{\xi }=:\beta ^{\xi }\) exists and \(\in \mathcal {I}\). By (A1) and Dini’s Theorem \(r^{(\gamma )}_{\epsilon }(\cdot ,\cdot )\) converges to γr(⋅,⋅) uniformly. Thus, we obtain that

$$ \begin{array}{@{}rcl@{}} \lim_{n\to\infty}{\int}_{S\times A}\frac{1}{\gamma}r^{(\gamma)}_{\epsilon_{n}}(x,a)\left( \beta_{\epsilon_{n}}^{\xi}\right)'(dx,da)={\int}_{S\times A}r(x,a)\left( \beta^{\xi}\right)'(dx,da). \end{array} $$

Recalling the definition of β2 for \(\beta \in \mathcal {I}\), by the lower semicontinuity of D(⋅∥⋅), we see that

$$ \begin{array}{@{}rcl@{}} \liminf_{n\rightarrow\infty}& & {\int}_{S\times A}D\left( \left( \beta_{\epsilon_{n}}^{\xi}\right)_{2}(dy|x,a)\Big\Vert p^{(\gamma)}_{\epsilon}(dy|x,a)\right)\left( \beta_{\epsilon_{n}}^{\xi}\right)'(dx,da)\\ &=&\liminf_{n\rightarrow\infty}D\left( \beta_{\epsilon_{n}}^{\xi}(dx,da,dy)\Big\Vert\left( \beta_{\epsilon_{n}}^{\xi}\right)'(dx,da)p^{(\gamma)}_{\epsilon}(dy|x,a)\right)\\ &\geq&D\left( \beta^{\xi}(dx,da,dy)\Big\Vert\left( \beta^{\xi}\right)'(dx,da)p(dy|x,a)\right)\\ &=&{\int}_{S\times A}D\left( \left( \beta^{\xi}\right)_{2}(dy|x,a)\Big\Vert p(dy|x,a)\right)\left( \beta^{\xi}\right)'(dx,da). \end{array} $$

It then follows that

$$ \begin{array}{@{}rcl@{}} \limsup\limits_{n\to\infty}\phi(\beta_{\epsilon_{n}}^{\xi},\gamma,\epsilon_{n}) &\leq&{\int}_{S\times A}\left[r(x,a)-\frac{1}{\gamma}D\left( \left( \beta^{\xi}\right)_{2}(\cdot|x,a)\Vert p(\cdot|x,a)\right)\right]\left( \beta^{\xi}\right)'(dx,da)\\ &=&\phi(\beta^{\xi},\gamma,0) \end{array} $$

Thus,

$$ \sup_{\beta\in\mathcal{I}}\phi(\beta,\gamma,0)\geq\phi(\beta^{\xi},\gamma,0)\geq\limsup\limits_{n\to\infty}\phi(\beta_{\epsilon_{n}}^{\xi},\gamma,\epsilon_{n})\geq\lim_{n\to\infty}\lambda_{\epsilon_{n}}(\gamma)+\xi. $$

Letting ξ → 0, we have \(\sup _{\beta \in \mathcal {I}}\phi (\beta ,\gamma ,0)\geq \lim _{\epsilon \to 0}\lambda _{\epsilon }(\gamma )\). Now (3.1) follows. □

Remark 3.1

The proof shows that the inequalities in (3.13) are actually equalities, which indicates that the supremum over Markov policies in risk-sensitive MDP is tantamount to the supremum over stationary Markov policies, meaning that one should search for the optimal policy within the stationary policies even without the ergodicity of transition probability.

Combining Theorems 3.1, 3.4, and 3.7, we obtain the main result immediately.

Theorem 3.8

Assume (A1), (A2), and (A3). Then

$$ \begin{array}{@{}rcl@{}} \lim\limits_{\gamma\to 0}\lambda(\gamma)=v. \end{array} $$

In addition, if (B1) holds, then

$$ \begin{array}{@{}rcl@{}} \lim\limits_{\gamma\to 0}\lambda(x,\gamma)=v. \end{array} $$

for any xS.

Remark 3.2

Recalling the proof of Theorem 3.1, we see that under (A1), (A2), and (A3), there exists \(\mu (dx,da)\in \mathcal {P}(S\times A)\) such that

$$ \begin{array}{@{}rcl@{}} {\int}_{S\times A}\mu(dx,da)p(dy|x,a)={\int}_{A}\mu(dy,da). \end{array} $$

.

A sufficient condition for the risk-neutral average optimal reward v(x) to be independent of the initial state x is the uniform ergodicity (2.7) (see Section 5.5 in [20]). We rewrite it as

  1. (B2)

    There exists δ < 1 such that

    $$ \sup_{U\in \mathcal{B}(S)}\sup_{x,x^{\prime}\in S}\sup_{a,a^{\prime}\in A}[P(U|x,a)-P(U|x^{\prime},a^{\prime})]\leq\delta. $$
    (3.15)

We provide brief proof for this.

Theorem 3.9

Assume (A1), (A2), and (B2), then v(x) is independent of the initial state x.

Proof

Define an operator T on C(S) by

$$ Tf(x):=\sup_{a\in A}\left[r(x,a)+{\int}_{S}p(dy|x,a)f(y)\right]. $$

It is not hard to check that under (A1) and (A2), T maps C(S) into itself. Let

$$\|f\|_{\text{sp}}:=\sup\limits_{x\in S}f(x)-\inf\limits_{x^{\prime}\in S}f(x^{\prime})$$

be the span norm on C(S). For f1,f2C(S),x1,x2S, and ε > 0, there exist a1,a2A such that

$$ Tf_{i}(x_{i})\leq r(x_{i},a_{i})+{\int}_{S}p(dy|x_{i},a_{i})f_{i}(y)+\varepsilon, i=1,2. $$

Therefore, we obtain that

$$ \begin{array}{@{}rcl@{}} Tf_{1}(x_{1})& -&Tf_{2}(x_{1})-\left[Tf_{1}(x_{2})-Tf_{2}(x_{2})\right]\\ &\leq & \left[r(x_{1},a_{1})+{\int}_{S}p(dy|x_{1},a_{1})f_{1}(y)+\varepsilon\right] - \left[r(x_{1},a_{1})+{\int}_{S}p(dy|x_{1},a_{1})f_{2}(y)\right]\\ &- &\left[r(x_{2},a_{2})+{\int}_{S}p(dy|x_{2},a_{2})f_{1}(y)\right] + \left[r(x_{2},a_{2})+{\int}_{S}p(dy|x_{2},a_{2})f_{2}(y)+\varepsilon\right]\\ & =&{\int}_{S}p(dy|x_{1},a_{1})\left[f_{1}(y)-f_{2}(y)\right]-{\int}_{S}p(dy|x_{2},a_{2})\left[f_{1}(y)-f_{2}(y)\right]+2\varepsilon\\ &\leq&\left[p(E|x_{1},a_{1})-p(E|x_{2},a_{2})\right]\cdot\|f_{1}-f_{2}\|_{\text{sp}}+2\epsilon\leq\delta_{p}\cdot\|f_{1}-f_{2}\|_{\text{sp}}+2\varepsilon, \end{array} $$

where E in the second to the last inequality comes from the Hahn-Jordan decomposition of p(⋅|x1,a1) − p(⋅|x2,a2). Letting ε → 0, we see that T is a contraction mapping on (C(S),∥⋅∥sp). Thus, by the Banach Fixed-Point Theorem, there exist a unique (up to an additive constant) f0C(S) such that ∥Tf0f0sp = 0, which means that Tf0(x) − f0(x) is a constant v0. It follows that for any xS,

$$ \begin{array}{@{}rcl@{}} v_{0}=\lim_{N\to\infty}\frac{1}{N}T^{N}f_{0}(x)&=&\lim_{N\to\infty}\sup_{\pi\in{\Pi}_{M}}\frac{1}{N} E_{x}^{\pi}\left[{\sum}_{t=1}^{N}r(X_{t},A_{t})\right]\\ &\geq&\sup_{\pi\in{\Pi}_{M}}\liminf_{N\to\infty}\frac{1}{N} E_{x}^{\pi}\left[{\sum}_{t=1}^{N}r(X_{t},A_{t})\right]. \end{array} $$
(3.16)

Since \(r(x,a)+{\int \limits } p(dy|x,a)f_{0}(y)\) is in C(A) due to (A2), for each xS, there exists an d0(x) ∈ A such that

$$ Tf_{0}(x)=r(x,d_{0}(x))+\int p(dy|x,d_{0}(x))f_{0}(y). $$

Letting \(\pi _{0}=(d_{0})^{\infty }\), we have

$$ \begin{array}{@{}rcl@{}} \lim_{N\to\infty}\sup_{\pi\in{\Pi}_{M}}\frac{1}{N} E_{x}^{\pi}\left[{\sum}_{t=1}^{N}r(X_{t},A_{t})\right]&=&\lim_{N\to\infty}\frac{1}{N} E_{x}^{\pi_{0}}\left[{\sum}_{t=1}^{N}r(X_{t},A_{t})\right]\\ &\leq&\sup_{\pi\in{\Pi}_{M}}\liminf_{N\to\infty}\frac{1}{N} E_{x}^{\pi}\left[{\sum}_{t=1}^{N}r(X_{t},A_{t})\right]. \end{array} $$
(3.17)

(3.16) and (3.17) imply that v0 = v(x) for any xS. □

Combining the above theorem with our main result (Theorem 3.8), we obtain the following

Corollary 3.10

Assume (A1), (A2), (A3), (B1), and (B2). Then

$$ \begin{array}{@{}rcl@{}} \lim\limits_{\gamma\to 0}\lambda(x,\gamma)=v(x), \end{array} $$

Furthermore, this limit is indeed independent of xS.

4 Risk-Sensitive Asymptotics of POMDP

This section applies the approach explored in the last section to the partially observable Markov decision process (POMDP). Francesca Albertini, Paolo Dai Pra, and Chiara Prior established such a limit in [1] for processes described by Xn+ 1 = f(Xn,An,Wn),Yn = h(Xn,Vn), where Xn, An, and Yn denote the state, control, and observation, respectively, and W,V are i.i.d random variables. As for general POMDPs, Di Masi and Stettner proved the existence of the solution to the associated Bellman equation for cost-minimizing problems and stated that the limit as γ → 0 had not been proven (see Remark 2 in [11]). However, the method in [11] can not be applied to reward-maximizing problems since it requires the operator induced from the Bellman equation to preserve the concavity. However, in this case, the operator is convexity-preserving. Nevertheless, we can prove that given the existence of a solution to the Bellman equation, (3.4) holds for the maximal reward of POMDPs. A POMDP can be represented as a six-tuple MP = 〈S,A,O,p(⋅|⋅,⋅),q(⋅|⋅),r(⋅,⋅)〉. S is the space of real but unobserved states, A is the action space, and both are assumed to be compact metric spaces. The observation space O is a Polish space. Like those in MDPs, p is the transition kernel depending on actions, and \(r: S\times A\to \mathbb {R}\) is the reward function. q(⋅|x) denotes the observation probability when the system is in state xS. As mentioned in the introduction, a widely used technique for analyzing a POMDP is to transfer it into a completely observable MDP. We will also adopt this technique which allows us to employ the analysis for MDPs in the last section to establish

$$ \lim_{\gamma\to 0}\lambda_{P}(\gamma)=v_{P}, $$

where

$$ \lambda_{P}(\gamma):=\sup_{\theta\in\mathcal{P}(S)}\sup_{\pi\in{\Pi}}\frac{1}{\gamma}\liminf_{N\rightarrow\infty}\frac{1}{N}\log E_{\theta}^{\pi}\left[e^{{\sum}_{t=1}^{N}\gamma r(X_{t},A_{t})}\right] $$

and

$$ v_{P}:=\sup_{\pi\in{\Pi}}\liminf_{N\rightarrow\infty}\frac{1}{N} E_{\theta}^{\pi}\left[{\sum}_{t=1}^{N} r(X_{t},A_{t})\right]. $$

The exact definitions of the set π of policies and the expectation operator \( E_{\theta }^{\pi }\) will be given after introducing the assumptions needed in this section.

In order to use the measure transformation technique, we first assume that

  1. (C0)

    There exists a \({\Lambda }\in \mathcal {P}(O)\) with full support such that for every xS, q(⋅|x) << Λ.

The corresponding density function is also denoted by q(⋅|⋅), i.e.,

$$q(dy|x)=q(y|x){\Lambda}(dy), x\in S.$$

To apply Theorem 3.1 and Proposition 3.3, we make the following assumptions to guarantee that the reward and the transition probability of the transferred MDP satisfy (A1) and (A2).

  1. (C1)

    r(⋅,⋅) ∈ C(S × A).

  2. (C2)

    q(y|⋅) ∈Lip(S). There exist qm > 0 and qM > 0 such that q(y|⋅) ≥ qm and \(\left \|q(y|\cdot )\right \|_{L}\leq q_{M}\) for every yO.

  3. (C3)

    There exists Kp > 0 such that

    $$\|p(\cdot|x,a)-p(\cdot|x^{\prime},a^{\prime})\|_{KR}\leq K_{p}\left[\rho_{S}(x,x^{\prime})+\rho_{A}(a,a^{\prime})\right]$$

    for any \(x,x^{\prime }\in A\) and \(a,a^{\prime }\in S\), where ∥⋅∥KR denotes the Kantorovich-Rubinstein norm defined by (1.5) on \(\mathcal {P}(S)\), ρS denotes the metric on S, and ρA denotes the metric on A.

To define a probability space and a stochastic process with the desired mechanism, let \({\Omega }_{p}=S\times (A\times S\times O)^{\infty }\) and \({\mathscr{B}}({\Omega }_{p})\) be the product Borel σ-field. Given a sample path ω = (x1,a1,x2,y2,a3,...) ∈ΩP, define Xt := xt,At := at,t ≥ 1, and Yt := yt,t ≥ 2. At each time \(t\in \mathbb {N}\), the system MP occupies a state Xt, which is unobservable. When t = 1, we know the distribution of X1 and then choose an action A1. When t ≥ 2, we can observe a signal Yt generated by Xt and then choose an action At. The optimal policy in POMDP is usually not a Markovian one due to the unavailability of real states when making decisions. Hence, we introduce the observed-history-dependent policy. Let \(\mathbb {H}_{t}\) denote the set of observed histories up to time \(t\in \mathbb {N}\). Then, \(\mathbb {H}_{1}=\mathcal {P}(S)\) (the set of all the initial state distributions) and \(\mathbb {H}_{t+1}=\mathbb {H}_{t}\times A\times O\). An observed-history-dependent decision rule at time t is a stochastic kernel \(d_{t}\in \mathcal {P}(A|\mathbb {H}_{t})\), where dt(B|ht) denote the probability for taking action in \(B\subseteq A\) when observing \(h_{t}=(\theta _{1},a_{1},y_{2},a_{2},y_{3},...,a_{t-1},y_{t})\in \mathbb {H}_{t}\). An observed-history-dependent policy π is a sequence of such decision rules at different times. Let Dt denote all the observed-history-dependent decision rules at time t, and \({\Pi }=\prod\limits_{t=1}^{\infty }D_{t}\) denote all the observable-history-dependent policies. Given an initial distribution of states \(\theta _{1}\in \mathcal {P}(S)\) and a policy π = (d1,d2,...) ∈π, a unique probability measure \(\text {P}_{\theta _{0}}^{\pi }\) and the corresponding expectation operator on \({\mathscr{B}}({\Omega }_{p})\) is defined by the Ionescu-Tulcea theorem, such that for each t ≥ 1,

$$ \begin{array}{@{}rcl@{}} &&\text{P}_{\theta_{1}}^{\pi}(dx_{1},da_{1},dx_{2},dy_{2},...da_{t-1},dx_{t},dy_{t})\\ &&\quad=\theta_{1}(dx_{1})\left( \prod\limits_{i=1}^{t-1}d_{i}(da_{i}|h_{i})p(dx_{i+1}|x_{i},a_{i})q(y_{i+1}|x_{i+1}){\Lambda}(dy_{i+1})\right). \end{array} $$

The risk-sensitive criterion introduced in Section 2 is to optimize

$$ \lambda_{P}(\theta, \gamma):=\sup_{\pi\in{\Pi}}\frac{1}{\gamma}\liminf_{N\rightarrow\infty}\frac{1}{N}\log E_{\theta}^{\pi}\left[e^{{\sum}_{t=1}^{N}\gamma r(X_{t},A_{t})}\right],\qquad\theta\in\mathcal{P}(S),\gamma>0. $$
(4.1)

while the typical optimal average reward is

$$ v_{P}(\theta):=\sup_{\pi\in{\Pi}}\liminf_{N\rightarrow\infty}\frac{1}{N} E_{\theta}^{\pi}\left[{\sum}_{t=1}^{N} r(X_{t},A_{t})\right],\qquad\theta\in\mathcal{P}(S). $$
(4.2)

Let \(\lambda _{P}(\gamma ):=\sup _{\theta \in \mathcal {P}(S)}\lambda _{P}(\theta , \gamma )\) and \(v_{P}:=\sup _{\theta \in \mathcal {P}(S)}v_{P}(\theta )\). We intend to apply Theorem 3.1 and Proposition 3.3 to prove the risk-sensitive asymptotics

$$ \lim_{\gamma\to0}\lambda_{P}(\gamma)=v_{P}. $$
(4.3)

It has already been shown that optimal control of a POMDP MP under the average reward criterion can be converted to the optimal control of a properly transferred MDP M0 (see, e.g., Section 7.2.1 in [3], and Section 5.3, pp. 157–159 in [5]), where the new states are the conditional distributions of real states given the observed history. The transition law p0 and one-step reward r0 of M0 are

$$ \begin{array}{@{}rcl@{}} p_{0}(d\theta^{\prime}|\theta,a)&:=&{\int}_{O}{\Delta}^{(0)}_{a,y,\theta}(d\theta^{\prime})\left( {\int}_{S\times S}q(y|x^{\prime})p(dx^{\prime}|x,a)\theta(dx)\right){\Lambda}(dy),\\ r_{0}(\theta,a)&:=&{\int}_{S}\theta(dx)r(x,a),\qquad\theta\in\mathcal{P}(S),a\in A, \end{array} $$
(4.4)

where \({\Delta }^{(0)}_{a,y,\theta }\) is a measure on \(\mathcal {P}(S)\) defined by

$$ \begin{array}{@{}rcl@{}} {\Delta}^{(0)}_{a,y,\theta}(U):= \begin{cases} \delta\left\{\frac{T^{*}_{a,y,0}(\theta)}{T^{*}_{a,y,0}(\theta)(S)}\right\}(U),\quad&T^{*}_{a,y,0}(\theta)(S)\neq 0\\ 0,\quad&T^{*}_{a,y,0}(\theta)(S)=0 \end{cases},\qquad U\subseteq \mathcal{P}(S), \end{array} $$
(4.5)

and \(T^{*}_{a,y,0}\) is an operator on \({\mathscr{M}}^{+}(S)\) given by

$$ T^{*}_{a,y,0}(\mu)(E):= {\int}_{S\times S}\textbf{1}_{E}(x^{\prime})q(y|x^{\prime})p(dx^{\prime}|x,a)\mu(dx),\quad \mu\in \mathcal{M}^{+}(S),E\in\mathcal{B}(S). $$
(4.6)

In the case of risk-sensitive control, the transformed MDP is slightly different from the typical form of average reward control (see [4]). We present the transformation procedure with our notations. Assuming (C0), (C1), (C2), and (C3), we derive the new state and the corresponding transition mechanism of the transferred MDP first. For t ≥ 1, define two filters \(\mathcal {F}_{t}\) and \(\mathcal {G}_{t}\) by

$$ \mathcal{F}_{t}=\sigma(A_{1},Y_{2},...,A_{t-1},Y_{t}),\qquad\mathcal{G}_{t}=\sigma(X_{1},A_{1},X_{2},Y_{2},...,A_{t-1},X_{t},Y_{t}). $$

respectively. Let Ht = (𝜃1,A1,Y2,...,At− 1,Yt) denote the observed history up to time t. Since \(\theta _{1}\in \mathcal {P}(S)\) is fixed, \(\mathcal {F}_{t}=\sigma (H_{t})\). Define another probability measure \(\widetilde {P}_{\theta _{1}}^{\pi }\) on \({\mathscr{B}}({\Omega }_{P})\) by

$$ \begin{array}{@{}rcl@{}} &&\widetilde{\text{P}}_{\theta_{1}}^{\pi}(dx_{1},da_{1},dx_{2},dy_{2},...da_{t-1},dx_{t},dy_{t})\\ &&\quad=\theta_{1}(dx_{1})\left(\prod\limits_{i=1}^{t-1}d_{i}(da_{i}|h_{i})p(dx_{i+1}|x_{i},a_{i}){\Lambda}(dy_{i+1})\right), \end{array} $$

or equivalently,

$$ \frac{d\text{P}_{\theta_{1}}^{\pi}}{d\widetilde{\text{P}}_{\theta_{1}}^{\pi}}\Bigg|_{\mathcal{G}_{t}}=\prod\limits_{i=2}^{t}q(Y_{i}|X_{i})=: R_{t}. $$

Since S,A,O are all Polish spaces, ΩP is also Polish. Thus, the following regular conditional expectations on \(({\Omega }_{P},{\mathscr{B}}({\Omega }_{P}),\widetilde {P}_{\theta _{1}}^{\pi })\) for bounded Borel functions f on S

$$ \widetilde{ E}_{\theta_{1}}^{\pi}\left[f(X_{t})e^{\gamma{\sum}_{i=1}^{t-1}r(X_{i},A_{i})}R_{t}\Big|H_{t}=h_{t}\right],t\geq 2 $$

have regular versions. Therefore, we can define an \({\mathscr{M}}^{+}(S)\)-valued process \(\{\psi ^{(\gamma )}_{t}\}\) by

$$ \psi^{(\gamma)}_{1}:=\theta_{1},\qquad\psi^{(\gamma)}_{t}(f):=\widetilde{ E}_{\theta_{1}}^{\pi}\left[f(X_{t})e^{\gamma{\sum}_{i=1}^{t-1}r(X_{i},A_{i})}R_{t}\Big|H_{t}\right],t\geq 2, $$
(4.7)

where f is bounded and measurable on S. For aA,yO, define Ta,y,γ as an operator on the space of bounded Borel functions on S by

$$ \begin{array}{@{}rcl@{}} T_{a,y,\gamma}(f)(x):={\int}_{S}e^{\gamma r(x,a)}q(y|x^{\prime})f(x^{\prime})p(dx^{\prime}|x,a),\quad x\in S \end{array} $$

Notice that under \(\widetilde {P}_{\theta _{0}}^{\pi }\), Yt+ 1 is independent of Xs+ 1,As+ 1 and Ys with st, and Xt+ 1 depends on σ(Gtσ(At,Yt+ 1)) only through Xt and At, we have

$$ \begin{array}{@{}rcl@{}} & &\widetilde{E}_{\theta_{1}}^{\pi}\left[f(X_{t+1})e^{\gamma{\sum}_{i=1}^{t}r(X_{i},A_{i})}R_{t+1}\Big|H_{t+1}=h_{t+1}\right]\\ &=&\widetilde{ E}_{\theta_{1}}^{\pi}\left[\widetilde{ E}_{\theta_{1}}^{\pi}\left[f(X_{t+1})e^{\gamma r(X_{t},A_{t})}q(Y_{t+1}|X_{t+1})\Big|\sigma(G_{t}\cup \sigma(A_{t},Y_{t+1}))\right]e^{\gamma{\sum}_{i=1}^{t-1}r(X_{i},A_{i})}R_{t}\Big|H_{t+1}=h_{t+1}\right]\\ &=&\widetilde{ E}_{\theta_{1}}^{\pi}\left[\widetilde{ E}_{\theta_{1}}^{\pi}\left[f(X_{t+1})e^{\gamma r(X_{t},A_{t})}q(Y_{t+1}|X_{t+1})\Big|\sigma(X_{t},A_{t},Y_{t+1})\right]e^{\gamma{\sum}_{i=1}^{t-1}r(X_{i},A_{i})}R_{t}\Big|H_{t} = h_{t},A_{t} = a_{t},Y_{t+1} = y_{t+1}\right]\\ &=&\widetilde{ E}_{\theta_{1}}^{\pi}\left[T_{A_{t},Y_{t+1},\gamma}(f)(X_{t})e^{\gamma{\sum}_{i=1}^{t-1}r(X_{i},A_{i})}R_{t}\Big|H_{t}=h_{t},A_{t}=a_{t},Y_{t+1}=y_{t+1}\right]\\ &=&\widetilde{ E}_{\theta_{1}}^{\pi}\left[T_{a_{t},y_{t+1},\gamma}(f)(X_{t})e^{\gamma{\sum}_{i=1}^{t-1}r(X_{i},A_{i})}R_{t}\Big|H_{t}=h_{t}\right]. \end{array} $$

Therefore, we have

$$ \begin{array}{@{}rcl@{}} \psi^{(\gamma)}_{t+1}(f)=\psi^{(\gamma)}_{t}\left( T_{A_{t},Y_{t+1},\gamma}(f)\right) =T^{*}_{A_{t},Y_{t+1},\gamma}(\psi^{(\gamma)}_{t})(f),\quad\widetilde{P}_{\theta_{1}}^{\pi} a.s., \end{array} $$

where \(T^{*}_{a,y,\gamma }\) is the adjoint operator of Ta,y,γ defined on \({\mathscr{M}}^{+}(S)\) by

$$ \begin{array}{@{}rcl@{}} T^{*}_{a,y,\gamma}(\mu)(E)&:=&{\int}_{S}T_{a,y,\gamma}(\textbf{1}_{E})d\mu\\ &=& {\int}_{S\times S}\textbf{1}_{E}(x^{\prime})e^{\gamma r(x,a)}q(y|x^{\prime})p(dx^{\prime}|x,a)\mu(dx),\quad \mu\!\in\! \mathcal{M}^{+}(S),E\!\in\!\mathcal{B}(S). \end{array} $$

From (C1), we know that \(\psi ^{(\gamma )}_{t}(S)\) is finite and strictly positive. Hence, we can define a new state process \(\{\theta ^{(\gamma )}_{t}\}\) taking values in \(\mathcal {P}(S)\) by

$$ \theta^{(\gamma)}_{t}:=\frac{\psi^{(\gamma)}_{t}}{\psi^{(\gamma)}_{t}(S)}. $$
(4.8)

We call \(\{\theta ^{(\gamma )}_{t}\}\) the information state process since it represents the cumulative-reward-weighted conditional distribution of the real state given the observed history. The information state \(\theta ^{(\gamma )}_{1}\) at time t = 1 is still 𝜃1. Notice that the operator \(T^{*}_{a,y,\gamma }\) is positively 1-homogeneous. We have

$$ \theta^{(\gamma)}_{t+1}=\frac{T^{*}_{A_{t},Y_{t+1},\gamma}(\psi^{(\gamma)}_{t})}{T^{*}_{A_{t},Y_{t+1},\gamma}(\psi^{(\gamma)}_{t})(S)}=\frac{T^{*}_{A_{t},Y_{t+1},\gamma}(\theta^{(\gamma)}_{t})}{T^{*}_{A_{t},Y_{t+1},\gamma}(\theta^{(\gamma)}_{t})(S)}, $$
(4.9)

which implies the transition mechanism of \(\theta ^{(\gamma )}_{t}\). As for the new reward function, we define

$$ G_{\gamma}(a,y,\theta):=\log\left[T^{*}_{a,y,\gamma}(\theta)(S)\right]. $$
(4.10)

Then we have

$$ \begin{array}{@{}rcl@{}} \psi^{(\gamma)}_{t}(S)&=&T^{*}_{A_{t-1},Y_{t},\gamma}(\psi^{(\gamma)}_{t-1})(S)=T^{*}_{A_{t-1},Y_{t},\gamma}(\theta^{(\gamma)}_{t-1})(S)\cdot\psi^{(\gamma)}_{t-1}(S)\\ &=&e^{G_{\gamma}(A_{t-1},Y_{t},\theta^{(\gamma)}_{t-1})}\cdot\psi^{(\gamma)}_{t-1}(S),\quad t\geq 2 \end{array} $$
(4.11)

It then follows that

$$ \begin{array}{@{}rcl@{}} E_{\theta_{1}}^{\pi}\left[e^{\gamma{\sum}_{i=1}^{t}r(X_{i},A_{i})}\right]&=&\widetilde{ E}_{\theta_{1}}^{\pi}\left[e^{\gamma{\sum}_{i=1}^{t}r(X_{i},A_{i})}R_{t}\right]=\widetilde{ E}_{\theta_{1}}^{\pi}\left[\widetilde{ E}_{\theta_{1}}^{\pi}\left[e^{\gamma{\sum}_{i=1}^{t}r(X_{i},A_{i})}R_{t}\Big|\mathcal{F}_{t}\right]\right]\\ &=&\widetilde{ E}_{\theta_{1}}^{\pi}\left[\psi^{(\gamma)}_{t+1}(S)\right]\\ &=&\widetilde{ E}_{\theta_{1}}^{\pi}\left[e^{{\sum}_{i=1}^{t}G_{\gamma}(A_{i},Y_{i+1},\theta^{(\gamma)}_{i})}\cdot\psi^{(\gamma)}_{1}(S)\right]\\ &=&\widetilde{ E}_{\theta_{1}}^{\pi}\left[e^{{\sum}_{i=1}^{t}G_{\gamma}(A_{i},Y_{i+1},\theta^{(\gamma)}_{i})}\cdot\theta^{(\gamma)}_{1}(S)\right]\\ &=&\widetilde{ E}_{\theta_{1}}^{\pi}\left[e^{{\sum}_{i=1}^{t}G_{\gamma}(A_{i},Y_{i+1},\theta^{(\gamma)}_{i})}\right]. \end{array} $$
(4.12)

Hence, we can consider Gγ as the new reward. Now, we can transfer MP to the following completely observable model \(M^{\prime }_{\gamma }\) with state space \(\mathcal {P}(S)\) and action space A:

  1. 1.

    The initial information state is 𝜃1.

  2. 2.

    At time t, given the current information state \(\theta ^{(\gamma )}_{t}\), we take action At according to a pre-specified policy. Then, the system generates Yt+ 1, which is independent of \(\theta ^{(\gamma )}_{s}, A_{s}\), and Ys with st, and distributed according to the law Λ. The next information state \(\theta ^{(\gamma )}_{t+1}\) is determined by

    $$ \theta^{(\gamma)}_{t+1}=\frac{T^{*}_{A_{t},Y_{t+1},\gamma}(\theta^{(\gamma)}_{t})}{T^{*}_{A_{t},Y_{t+1},\gamma}(\theta^{(\gamma)}_{t})(S)}. $$
  3. 3.

    Once Yt+ 1 is generated, the next state \(\theta ^{(\gamma )}_{t+1}\) is then obtained according to (4.9), and simultaneously the system generates one-step reward \(G_{\gamma }(A_{t},Y_{t+1},\theta ^{(\gamma )}_{t})\).

Remark 4.1

The one-step reward Gγ in (4.10) depends not only on the state 𝜃 and the action A but also on an independent signal Y under \(\widetilde {\text {P}}_{\theta _{1}}^{\pi }\), which is slightly different from the typical form. We make the following changes to have a reward and a transition probability in a standard form in which assumptions (A1) and (A2) can be verified.

Define the completely observable Markov decision model Mγ with transition law pγ and one-step reward rγ by

$$ \begin{array}{@{}rcl@{}} p_{\gamma}(d\theta^{\prime}|\theta,a)&:=&\frac{1}{{\int}_{O}T^{*}_{a,y,\gamma}(\theta)(S){\Lambda}(dy)}\left( {\int}_{O}{\Delta}^{(\gamma)}_{a,y,\theta}(d\theta^{\prime}) T^{*}_{a,y,\gamma}(\theta)(S){\Lambda}(dy)\right)\\ &=&\frac{1}{{\int}_{S}e^{\gamma r(x,a)}\theta(dx)}\left[{\int}_{O}{\Delta}^{(\gamma)}_{a,y,\theta}(d\theta^{\prime}){\Lambda}(dy){\int}_{S\times S}e^{\gamma r(x,a)}q(y|x^{\prime})p(dx^{\prime}|x,a)\theta(dx)\right],\\ r_{\gamma}(\theta,a)&:=&\log\left( {\int}_{O}T^{*}_{a,y,\gamma}(\theta)(S){\Lambda}(dy)\right)=\log\left( {\int}_{S}e^{\gamma r(x,a)}\theta(dx)\right). \end{array} $$
(4.13)

where \({\Delta }^{(\gamma )}_{a,y,\theta }\) is a measure on \(\mathcal {P}(S)\) defined by

$$ {\Delta}^{(\gamma)}_{a,y,\theta}(U):= \begin{cases} \delta\left\{\frac{T^{*}_{a,y,\gamma}(\theta)}{T^{*}_{a,y,\gamma}(\theta)(S)}\right\}(U),\quad&T^{*}_{a,y,\gamma}(\theta)(S)\neq 0\\ 0,\quad&T^{*}_{a,y,\gamma}(\theta)(S)=0 \end{cases},\qquad U\subseteq \mathcal{P}(S). $$
(4.14)

Use \( E_{\gamma ,\theta }^{\pi }\) to denote the expectation operator with respect to the transition probability pγ with initial state 𝜃 and policy π. Since Mγ is an MDP, we consider the Markov policies of Mγ, which consists of decision rules by choosing actions only through the current information state 𝜃. Such policies are also called separated policies of the original model MP (see, e.g., [19]). We let DS denote the set of all the Markov decision rules of Mγ and \({\Pi }_{S}=(D_{S})^{\infty }\). Then for πS ∈πS, by direct calculation, we have

$$ \widetilde{ E}_{\theta}^{\pi_{S}}\left[e^{{\sum}_{t=1}^{N}G_{\gamma}(A_{t},Y_{t+1},\theta^{(\gamma)}_{t})}f(\theta^{(\gamma)}_{t+1})\right]= E_{\gamma,\theta}^{\pi_{S}}\left[e^{{\sum}_{t=1}^{N}r_{\gamma}(\theta^{(\gamma)}_{t},A_{t})}f(\theta^{(\gamma)}_{t+1})\right] $$
(4.15)

for any bounded Borel function f on \(\mathcal {P}(S)\). πS is a subset of π since 𝜃t is \(\mathcal {F}_{t}\)-adaptive. Hence, from (4.12) and (4.15), we have for πS ∈πS,

$$ \begin{array}{@{}rcl@{}} \liminf_{N\rightarrow\infty}\frac{1}{N}\log E_{\theta}^{\pi_{S}}\left[e^{{\sum}_{t=1}^{N}\gamma r(X_{t},A_{t})}\right] &=&\liminf_{N\rightarrow\infty}\frac{1}{N}\log\widetilde{ E}_{\theta_{1}}^{\pi_{S}}\left[e^{{\sum}_{i=1}^{N}G_{\gamma}(A_{i},Y_{i+1},\theta^{(\gamma)}_{i})}\right]\\ &=&\liminf_{N\rightarrow\infty}\frac{1}{N}\log E_{\gamma,\theta}^{\pi_{S}}\left[e^{{\sum}_{t=1}^{N}r_{\gamma}(\theta^{(\gamma)}_{t},A_{t})}\right]. \end{array} $$

We define λS as the optimal value of separated policies, which is

$$ \begin{array}{@{}rcl@{}} \lambda_{S}(\theta, \gamma)&:=&\sup_{\pi_{S}\in{\Pi}_{S}}\frac{1}{\gamma}\liminf_{N\rightarrow\infty}\frac{1}{N}\log E_{\theta}^{\pi}\left[e^{{\sum}_{t=1}^{N}\gamma r(X_{t},A_{t})}\right]\\ &=&\sup_{\pi_{S}\in{\Pi}_{S}}\frac{1}{\gamma}\liminf_{N\rightarrow\infty}\frac{1}{N}\log E_{\gamma,\theta}^{\pi_{S}}\left[e^{{\sum}_{t=1}^{N}r_{\gamma}(\theta^{(\gamma)}_{t},A_{t})}\right]. \end{array} $$
(4.16)

We will show that under (C0), (C1), (C2), and (C3), λS = λP and (4.3) hold if there exists K > 0 such that for every γ ∈ (0,K), the Bellman equation

$$ \rho_{\gamma} f_{\gamma}(\theta)=\sup_{a\in A}{\int}_{\mathcal{P}(S)}f_{\gamma}(\theta^{\prime})e^{r_{\gamma}(\theta,a)}p_{\gamma}(d\theta^{\prime}|\theta,a) $$
(4.17)

has a solution ργ > 0 and \(f_{\gamma }\in C(\mathcal {P}(S))\) with fγ >> 0. We first verify that rγ satisfies (A1) and pγ satisfies (A2), which implies that the corresponding operator \(L^{(\gamma )}_{P}\) on \(C(\mathcal {P}(S))\), defined by

$$ L^{(\gamma)}_{P}f(\theta):=\sup_{a\in A}{\int}_{\mathcal{P}(S)}f(\theta^{\prime})e^{r_{\gamma}(\theta,a)}p_{\gamma}(d\theta^{\prime}|\theta,a), $$
(4.18)

maps \(C(\mathcal {P}(S))\) into itself.

Lemma 4.1

Assume (C1). Then rγ(𝜃,a) is continuous in (𝜃,a).

Proof

For 𝜃n𝜃 weakly and ana, we have

$$ \begin{array}{@{}rcl@{}} \left|e^{r_{\gamma}(\theta_{n},a_{n})}-e^{r_{\gamma}(\theta,a)}\right|&\leq&{\int}_{S}\left|e^{\gamma r(x,a_{n})}-e^{\gamma r(x,a)}\right|\theta_{n}(dx)\\ &&+\left|{\int}_{S}e^{\gamma r(x,a)}\theta_{n}(dx)-{\int}_{S}e^{\gamma r(x,a)}\theta(dx)\right| \end{array} $$

The second term tends to 0 due to the weak convergence of 𝜃n while the first term tends to 0 because of the uniform continuity of r(⋅,⋅) on the compact set S × A. Hence, \(e^{r_{\gamma }(\theta ,a)}\) is continuous in (𝜃,a). Notice that \(e^{r_{\gamma }}\geq e^{r_{m}}>0\), we see that rγ(𝜃,a) is continuous in (𝜃,a). □

Lemma 4.2

Assume (C0), (C1), (C2), and (C3). Then \((\theta ,a)\mapsto {\int \limits }_{\mathcal {P}(S)}p(d\theta ^{\prime }|\theta ,a)f(\theta )\) is continuous in (𝜃,a) for \(f\in C(\mathcal {P}(S))\).

Proof

Recall that rM and rm are the supremum and infimum of r, respectively. Fix \(f\in C(\mathcal {P}(S))\). From (4.13) and direct calculation, we see that

$$ {\int}_{\mathcal{P}(S)}p(d\theta^{\prime}|\theta,a)f(\theta^{\prime})=e^{-r_{\gamma}(\theta,a)}{\int}_{O}T^{*}_{a,y,\gamma}(\theta)(S)f\left( \frac{T^{*}_{a,y,\gamma}(\theta)}{T^{*}_{a,y,\gamma}(\theta)(S)}\right){\Lambda}(dy), $$

where \(e^{-r_{\gamma }(\theta ,a)}\) is continuous in (𝜃,a) by Lemma 4.1. It suffices to show that

$$ {\int}_{O}T^{*}_{a,y,\gamma}(\theta)(S)f\left( \frac{T^{*}_{a,y,\gamma}(\theta)}{T^{*}_{a,y,\gamma}(\theta)(S)}\right){\Lambda}(dy) $$
(4.19)

is continuous in (𝜃,a). Once \(\left \{(\theta ,a)\mapsto T^{*}_{a,y,\gamma }(\theta ),y\in O\right \}\) is equicontinuous, then from the uniform continuity of \(f\in C(\mathcal {P}(S))\) (\(\mathcal {P}(S)\) is compact since S is compact) and the fact that

$$ T^{*}_{a,y,\gamma}(\theta)(S)={\int}_{S}\theta(dx)e^{\gamma r(x,a)}{\int}_{S}q(y|x^{\prime})p(dx^{\prime}|x,a_{n})\geq q_{m}e^{\gamma r_{m}}>0 $$

uniformly, we can see that

$$ \left\{(\theta,a)\mapsto T^{*}_{a,y,\gamma}(\theta)(S)f\left( \frac{T^{*}_{a,y,\gamma}(\theta)}{T^{*}_{a,y,\gamma}(\theta)(S)}\right),y\in O\right\} $$

is equicontinuous, which proves this lemma. Now we prove that \(\left \{(\theta ,a)\mapsto T^{*}_{a,y,\gamma }(\theta ),y\in O\right \}\) is equicontinuous. Fix \(\theta \in \mathcal {P}(S)\) and aA. For 𝜃n𝜃 weakly and ana, we have

$$ \begin{array}{@{}rcl@{}} \sup_{y\in O}\left\|T^{*}_{a_{n},y,\gamma}(\theta_{n})-T^{*}_{a,y,\gamma}(\theta)\right\|_{KR}&\leq&\sup_{y\in O}\left\|T^{*}_{a_{n},y,\gamma}(\theta_{n})-T^{*}_{a,y,\gamma}(\theta_{n})\right\|_{KR}\\ &&+\sup_{y\in O}\left\|T^{*}_{a,y,\gamma}(\theta_{n})-T^{*}_{a,y,\gamma}(\theta)\right\|_{KR}. \end{array} $$

The first term \(\left \|T^{*}_{a_{n},y,\gamma }(\theta _{n})-T^{*}_{a,y,\gamma }(\theta _{n})\right \|_{KR}\) is

$$ \begin{array}{@{}rcl@{}} &&{\underset{\|g\|_{L}\leq 1}{\underset{g\in\text{Lip}(S)}{\sup}}} {\int}_{S}\theta_{n}(dx)\left( e^{\gamma r(x,a_{n})}{\int}_{S}g(x^{\prime})q(y|x^{\prime})p(dx^{\prime}|x,a_{n})-e^{\gamma r(x,a)}{\int}_{S}g(x^{\prime})q(y|x^{\prime})p(dx^{\prime}|x,a)\right)\\ &&\quad\leq {\underset{\|g\|_{L}\leq 1} {\underset{g\in\text{Lip}(S)}{\sup}}}{\int}_{S}\theta_{n}(dx)e^{\gamma r(x,a_{n})}\left( {\int}_{S}g(x^{\prime})q(y|x^{\prime})p(dx^{\prime}|x,a_{n})-{\int}_{S}g(x^{\prime})q(y|x^{\prime})p(dx^{\prime}|x,a)\right)\\ &&\qquad+{\underset{\|g\|_{L}\leq 1}{\underset{g\in\text{Lip}(S)}{\sup}}}{\int}_{S}\theta_{n}(dx)\left( e^{\gamma r(x,a_{n})}-e^{\gamma r(x,a)}\right)\left( {\int}_{S}g(x^{\prime})q(y|x^{\prime})p(dx^{\prime}|x,a)\right) \end{array} $$

Notice that ∥g(⋅)∥L ≤ 1 and ∥q(y|⋅)∥LqM imply ∥g(⋅)q(y|⋅)∥L ≤ 2qM. Thus, by (C1), (C2), and (C3), we have that

$$ \begin{array}{@{}rcl@{}} & & {\int}_{S}\theta_{n}(dx)e^{\gamma r(x,a_{n})}\left( {\int}_{S}g(x^{\prime})q(y|x^{\prime})p(dx^{\prime}|x,a_{n})-{\int}_{S}g(x^{\prime})q(y|x^{\prime})p(dx^{\prime}|x,a)\right)\\ &\leq & 2q_{M}\cdot{\int}_{S}\theta_{n}(dx)e^{\gamma r(x,a_{n})}\|p(\cdot|x,a_{n})-p(\cdot|x,a)\|_{KR}\leq 2q_{M}e^{\gamma r_{M}}K_{p}\cdot\rho_{A}(a_{n},a) \end{array} $$

and

$$ {\int}_{S}\theta_{n}(dx)\left( e^{\gamma r(x,a_{n})}-e^{\gamma r(x,a)}\right)\left( {\int}_{S}g(x^{\prime})q(y|x^{\prime})p(dx^{\prime}|x,a)\right)\leq q_{M}\cdot{\int}_{S}\left|e^{\gamma r(x,a_{n})}-e^{\gamma r(x,a)}\right|\theta_{n}(dx). $$

holds for every yO. Hence, from the uniform continuity of r(⋅,⋅), we know that

$$ \lim\limits_{n\to\infty}\sup_{y\in O}\left\|T^{*}_{a_{n},y,\gamma}(\theta_{n})-T^{*}_{a,y,\gamma}(\theta_{n})\right\|_{KR}=0. $$
(4.20)

The second term \(\left \|T^{*}_{a,y,\gamma }(\theta _{n})-T^{*}_{a,y,\gamma }(\theta )\right \|_{KR}\) is

$$ {\underset{\|g\|_{L}\leq 1}{\underset{g\in\text{Lip}(S)}{\sup}}}{\int}_{S}\theta_{n}(dx)e^{\gamma r(x,a)}{\int}_{S}g(x^{\prime})q(y|x^{\prime})p(dx^{\prime}|x,a)-{\int}_{S}\theta(dx)e^{\gamma r(x,a)}{\int}_{S}g(x^{\prime})q(y|x^{\prime})p(dx^{\prime}|x,a). $$

Define two finite measures \({\mu ^{a}_{n}}(dx):=\theta _{n}(dx)e^{\gamma r(x,a)}\) and μa(dx) := 𝜃(dx)eγr(x,a). Then from the continuity of r(⋅,a), we know that \({\mu ^{a}_{n}}\to \mu ^{a}\) weakly, which means that \(\lim \limits _{n\to \infty }\|{\mu ^{a}_{n}}-\mu ^{a}\|_{KR}=0\). Thus, we only need to verify that as a function of x, if ∥gL ≤ 1, then the Lipschitz constant \(\left \|{\int \limits }_{S}q(y|x^{\prime })p(dx^{\prime }|x, a)g(x^{\prime })\right \|_{L}\) is bounded by a constant which is independent of y. First, it is obvious that \(\left |{\int \limits }_{S}q(y|x^{\prime })p(dx^{\prime }|x,a)g(x^{\prime })\right |\leq q_{M}\). As for the Lipschitz constant, we have for x1,x2S,

$$ \begin{array}{@{}rcl@{}} & & \left|\int\limits_{S}g(x^{\prime})q(y|x^{\prime})p(dx^{\prime}|x_{1},a)-\int\limits_{S}g(x^{\prime})q(y|x^{\prime})p(dx^{\prime}|x_{2},a)\right|\\ &\leq& \|g(\cdot)q(y|\cdot)\|_{L}\cdot \|p(\cdot|x_{1},a)-p(\cdot|x_{2},a)\|_{KR}\leq 2q_{M}K_{p}\cdot\rho(x_{1},x_{2}). \end{array} $$

Consequently,

$$ \lim\limits_{n\to\infty}\sup_{y\in O}\left\|T^{*}_{a,y,\gamma}(\theta_{n})-T^{*}_{a,y,\gamma}(\theta)\right\|_{KR}\leq\lim\limits_{n\to\infty}\sup_{y\in O}\max\{q_{M},2q_{M}K_{p}\}\cdot\|{\mu^{a}_{n}}-\mu^{a}\|_{KR}=0. $$
(4.21)

(4.20) and (4.21) show that \(\left \{(\theta ,a)\mapsto T^{*}_{a,y,\gamma }(\theta ),y\in O\right \}\) is equicontinuous and the lemma is proved. □

Since rγ satisfies (A1) and pγ satisfies (A2), we can apply Proposition 3.3 to obtain the variational formula for λS.

Theorem 4.3

Assume (C0), (C1), (C2), and (C3). If there exist ργ > 0 and \(f_{\gamma }\in C(\mathcal {P}(S))\) with fγ >> 0 satisfying \(\rho _{\gamma } f_{\gamma }=L^{(\gamma )}_{P}f_{\gamma }\), then

$$ \lambda_{S}(\gamma)=\sup_{\beta\in\mathcal{I}_{P}}\left\{{\int}_{\mathcal{P}(S)\times A}\left[\frac{1}{\gamma}r_{\gamma}(\theta,a)-\frac{1}{\gamma}D(\beta_{2}(\cdot|\theta,a)\Vert p_{\gamma}(\cdot|\theta,a))\right]\beta^{\prime}(d\theta,da)\right\} $$
(4.22)

holds, where

$$ \mathcal{I}_{P}:=\{\beta\in\mathcal{P}(\mathcal{P}(S)\times A\times \mathcal{P}(S)):\beta(\mathcal{P}(S),A,d\theta)=\beta(d\theta,A,\mathcal{P}(S))\}, $$
(4.23)

and the notations β2 and \(\beta ^{\prime }\) are defined by (3.3).

Proof

Lemmas 4.1 and 4.2 imply that Mγ satisfies (A1) and (A2). Notice that \(\mathcal {P}(S)\) is compact, by Proposition 3.3, we have

$$ \gamma\lambda_{S}(\gamma)=\sup_{\beta\in\mathcal{I}_{P}}\left\{{\int}_{\mathcal{P}(S)\times A}\left[r_{\gamma}(\theta,a)-D(\beta_{2}(\cdot|\theta,a)\Vert p_{\gamma}(\cdot|\theta,a))\right]\beta^{\prime}(d\theta,da)\right\}. $$

Hence, (4.22) holds. □

By H\(\ddot {o}\)lder’s inequality, for \(\gamma \geq \gamma ^{\prime }>0\), we have \(\lambda _{P}(\gamma )\geq \lambda _{P}(\gamma ^{\prime })\geq v_{P}\). Therefore, λP(γ) is non-decreasing in γ and \(\lim \limits _{\gamma \to 0}\lambda _{P}(\gamma )\geq v_{P}\). To get the desired assertion, we need that λS = λP.

Theorem 4.4

Assume (C0), (C1), (C2), and (C3). If there exist ργ > 0 and \(f_{\gamma }\in C(\mathcal {P}(S))\) with fγ >> 0 satisfying that \(\rho _{\gamma } f_{\gamma }=L^{(\gamma )}_{P}f_{\gamma }\), then λS(γ) = λP(γ).

Proof

We first show that it is true in the finite-horizon case. Fix N > 0 and an observed-history-dependent policy π = (d1,...,dN,...). By (4.11), we have that

$$ \psi^{(\gamma)}_{t+1}(S)=e^{G_{\gamma}(A_{t},Y_{t+1},\theta^{(\gamma)}_{t})}\cdot\psi^{(\gamma)}_{t}(S),\quad t\geq 1. $$

Hence

$$ \begin{array}{@{}rcl@{}} E_{\theta}^{\pi}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}\right]&=&\tilde{ E}_{\theta}^{\pi}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}R_{N+1}\right]=\tilde{ E}_{\theta}^{\pi}\left[\psi^{(\gamma)}_{N+1}(S)\right]\\ & =&\tilde{ E}_{\theta}^{\pi}\left[\psi^{(\gamma)}_{N}(S)e^{G_{\gamma}(A_{N},Y_{N+1},\theta^{(\gamma)}_{N})}\right]. \end{array} $$

For every \(\psi \in {\mathscr{M}}^{+}(S)\) and ε > 0, there exists \(d_{N}^{*}(\psi )\in A\) such that

$$ \begin{array}{@{}rcl@{}} &&\tilde{ E}_{\theta}^{\pi}\left[\psi^{(\gamma)}_{N}(S)e^{G_{\gamma}(d_{N}^{*}(\psi),Y_{N+1},\theta^{(\gamma)}_{N})}\Big|\psi^{(\gamma)}_{N}=\psi\right]\\ &&\quad\geq\sup_{a\in A}\tilde{ E}_{\theta}^{\pi}\left[\psi^{(\gamma)}_{N}(S)e^{G_{\gamma}(a,Y_{N+1},\theta^{(\gamma)}_{N})}\Big|\psi^{(\gamma)}_{N}=\psi\right]-\frac{\varepsilon}{N}e^{(N-1)(r_{m}-r_{M})}. \end{array} $$

\(d_{N}^{*}\) actually depends only on \(\theta =\frac {\psi }{\psi (S)}\). Given \(\psi ^{\prime }\neq \psi \) with \(\frac {\psi ^{\prime }}{\psi ^{\prime }(S)}=\theta =\frac {\psi }{\psi (S)}\), we can assume that

$$e^{(N-1)\gamma r_{m}}\leq\psi^{\prime}(S)\quad \text{and}\quad \psi(S)\leq e^{(N-1)\gamma r_{M}}$$

since \(e^{(N-1)\gamma r_{m}}\leq \psi ^{(\gamma )}_{N}(S)\leq e^{(N-1)\gamma r_{M}}\). Thus, for any aA, we have

$$ \begin{array}{@{}rcl@{}} &&\tilde{ E}_{\theta}^{\pi} \left[\psi^{(\gamma)}_{N}(S)e^{G_{\gamma}(a,Y_{N+1},\theta^{(\gamma)}_{N})}\Big|\psi^{(\gamma)}_{N}=\psi^{\prime}\right] =\psi^{\prime}(S){\int}_{O}e^{G_{\gamma}(a,y,\theta)}{\Lambda}(dy)\\ &&\quad=\frac{\psi^{\prime}(S)}{\psi(S)}\!\cdot\!\psi(S){\int}_{O}e^{G_{\gamma}(a,y,\theta)}{\Lambda}(dy) = \frac{\psi^{\prime}(S)}{\psi(S)}\!\cdot\!\tilde{ E}_{\theta}^{\pi}\left[\psi^{(\gamma)}_{N}(S)e^{G_{\gamma}(a,Y_{N+1},\theta^{(\gamma)}_{N})}\Big| \psi^{(\gamma)}_{N}\! = \psi\right]. \end{array} $$

Hence,

$$ \begin{array}{@{}rcl@{}} & &\tilde{ E}_{\theta}^{\pi}\left[\psi^{(\gamma)}_{N}(S)e^{G_{\gamma}(d_{N}^{*}(\psi),Y_{N+1},\theta^{(\gamma)}_{N})}\Big|\psi^{(\gamma)}_{N}=\psi^{\prime}\right]\\ & \geq & \frac{\psi^{\prime}(S)}{\psi(S)}\cdot\sup_{a\in A}\tilde{ E}_{\theta}^{\pi}\left[\psi^{(\gamma)}_{N}(S)e^{G_{\gamma}(a,Y_{N+1},\theta^{(\gamma)}_{N})}\Big|\psi^{(\gamma)}_{N}=\psi\right]-\frac{\psi^{\prime}(S)}{\psi(S)}\cdot\frac{\varepsilon}{N}e^{(N-1)(r_{m}-r_{M})}\\ &=& \sup_{a\in A}\tilde{ E}_{\theta}^{\pi}\left[\psi^{(\gamma)}_{N}(S)e^{G_{\gamma}(a,Y_{N+1},\theta^{(\gamma)}_{N})}\Big|\psi^{(\gamma)}_{N}=\psi^{\prime}\right]-\frac{\psi^{\prime}(S)}{\psi(S)}\cdot\frac{\varepsilon}{N}e^{(N-1)(r_{m}-r_{M})}\\ &\geq& \sup_{a\in A}\tilde{ E}_{\theta}^{\pi}\left[\psi^{(\gamma)}_{N}(S)e^{G_{\gamma}(a,Y_{N+1},\theta^{(\gamma)}_{N})}\Big|\psi^{(\gamma)}_{N}=\psi^{\prime}\right]-\frac{\varepsilon}{N}. \end{array} $$

Thus, we see that for every \(\psi \in {\mathscr{M}}^{+}(S)\), there exists \(d_{N}^{*}(\theta )\in A\) depending only on \(\theta =\frac {\psi }{\psi (S)}\) such that

$$ \begin{array}{@{}rcl@{}} &&\tilde{ E}_{\theta}^{\pi}\left[\psi^{(\gamma)}_{N}(S)e^{G_{\gamma} \left( d_{N}^{*}\left( \theta^{(\gamma)}_{N}\right),Y_{N+1},\theta^{(\gamma)}_{N}\right)}\Big|\psi^{(\gamma)}_{N}=\psi\right]\\ &&\quad\geq\sup_{a\in A}\tilde{ E}_{\theta}^{\pi}\left[\psi^{(\gamma)}_{N}(S)e^{G_{\gamma}\left( a,Y_{N+1},\theta^{(\gamma)}_{N}\right)}\Big|\psi^{(\gamma)}_{N}=\psi\right]-\frac{\varepsilon}{N}. \end{array} $$

Now modify the policy π by simply replacing dN with \(d^{*}_{N}\) to get a new policy \(\pi ^{*}_{N}=(d_{1},...,d_{N-1},d^{*}_{N},...)\), we obtain that

$$ \tilde{ E}_{\theta}^{\pi^{*}_{N}}\left[\psi^{(\gamma)}_{N}(S)e^{G_{\gamma} \left( A_{N},Y_{N+1},\theta^{(\gamma)}_{N}\right)}\right]\geq\tilde{ E}_{\theta}^{\pi}\left[\psi^{(\gamma)}_{N}(S)e^{G_{\gamma}\left( A_{N},Y_{N+1},\theta^{(\gamma)}_{N}\right)}\right]-\frac{\varepsilon}{N}. $$

Continue this procedure by successively replacing dj with \(d^{*}_{j}\) for j = N − 1,⋯ ,1, with each \(d^{*}_{j}\) depending only on the information state and satisfying that

$$ \tilde{ E}_{\theta}^{\pi^{*}_{n-1}}\left[\psi^{(\gamma)}_{n-1}(S)e^{{\sum}_{j=n-1}^{N}G_{\gamma} \left( A_{j},Y_{j+1},\theta^{(\gamma)}_{j}\right)}\right]\geq\tilde{ E}_{\theta}^{\pi^{*}_{n}}\left[\psi^{(\gamma)}_{n-1}(S)e^{{\sum}_{j=n-1}^{N}G_{\gamma}\left( A_{j},Y_{j+1},\theta^{(\gamma)}_{j}\right)}\right]-\frac{\varepsilon}{N}. $$

where \(\pi ^{*}_{n}=(d_{1},...,d_{n-1},d^{*}_{n},...,d^{*}_{N},...)\). In this way, with \(\pi _{1}^{*}=(d^{*}_{1},d^{*}_{2},...d^{*}_{N},...)\), we obtain that

$$ \tilde{ E}_{\theta}^{\pi^{*}_{1}}\left[e^{{\sum}_{t=1}^{N}G_{\gamma} \left( A_{t},Y_{t+1},\theta^{(\gamma)}_{t}\right)}\right]\geq\tilde{ E}_{\theta}^{\pi}\left[e^{{\sum}_{t=1}^{N}G_{\gamma}\left( A_{t},Y_{t+1},\theta^{(\gamma)}_{t}\right)}\right]-\varepsilon. $$

Noticing that the decision rules after time N is irrelevant and recalling (4.12), we have proved that

$$ \sup_{\pi\in{\Pi}_{S}} E_{\theta}^{\pi}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}\right]\geq\sup_{\pi\in{\Pi}} E_{\theta}^{\pi}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}\right]-\varepsilon. $$

for any ε > 0. Obviously,

$$ \sup_{\pi\in{\Pi}_{S}} E_{\theta}^{\pi}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}\right]\leq\sup_{\pi\in{\Pi}} E_{\theta}^{\pi}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}\right]. $$

Consequently,

$$ \sup_{\pi\in{\Pi}_{S}} E_{\theta}^{\pi}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}\right]=\sup_{\pi\in{\Pi}} E_{\theta}^{\pi}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}\right]. $$

Letting \(N\to \infty \), we obtain that

$$ \begin{array}{@{}rcl@{}} \lambda_{S}(\gamma)\leq\lambda_{P}(\gamma)&\leq&\sup_{\theta\in\mathcal{P}(S)}\frac{1}{\gamma}\liminf_{N\rightarrow\infty}\sup_{\pi\in{\Pi}}\frac{1}{N} E_{\theta}^{\pi}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}\right] \end{array} $$
(4.24)
$$ \begin{array}{@{}rcl@{}} &=&\sup_{\theta\in\mathcal{P}(S)}\frac{1}{\gamma}\liminf_{N\rightarrow\infty}\sup_{\pi\in{\Pi}_{S}}\frac{1}{N} E_{\theta}^{\pi}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}\right]. \end{array} $$
(4.25)

Thus, it suffices to verify that

$$ \lambda_{S}(\gamma)\geq\sup_{\theta\in\mathcal{P}(S)}\frac{1}{\gamma}\liminf_{N\rightarrow\infty}\sup_{\pi\in{\Pi}_{S}}\frac{1}{N} E_{\theta}^{\pi}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}\right]. $$
(4.26)

Recall that by assumption, we have \(\rho _{\gamma } f_{\gamma }=L^{(\gamma )}_{P}f_{\gamma }\) with ργ > 0 and fγ >> 0. From Lemmas 4.1 and 4.2, we know that

$$ {\int}_{\mathcal{P}(S)}f_{\gamma}(\theta^{\prime})e^{r_{\gamma}(\theta,a)}p_{\gamma}(d\theta^{\prime}|\theta,a) $$

is continuous in a. Due to the compactness of A, for every \(\theta \in \mathcal {P}(S)\), there exists d(𝜃) ∈ A such that

$$ \rho_{\gamma} f_{\gamma}(\theta)=L^{(\gamma)}_{P}f_{\gamma}(\theta)={\int}_{\mathcal{P}(S)}f_{\gamma}(\theta^{\prime})e^{r_{\gamma}(\theta,d^{*}(\theta))}p_{\gamma}(d\theta^{\prime}|\theta,d^{*}(\theta)). $$

Let \(\pi ^{*}=(d^{*})^{\infty }\). Similarly to the argument used to derive (2.17) in the proof of Theorem 2.2, we see that

$$ \lim_{N\rightarrow\infty}\sup_{\pi\in{\Pi}_{S}}\frac{1}{N} E_{\theta}^{\pi}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}\right] = \log(\rho_{\gamma}) = \!\lim_{N\rightarrow\infty}\frac{1}{N} E_{\theta}^{\pi^{*}}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}\right]\!\!\leq\!\!\gamma\lambda_{S}(\theta,\gamma). $$

Hence, the inequalities in (4.24) are equalities, which gives that λS(γ) = λP(γ). □

With Theorems 4.3 and 4.4, using a similar argument as in the proof of Theorem 3.1, we can now extend the risk-sensitive asymptotics to POMDPs, which is the main result of this section.

Theorem 4.5

Assume (C0), (C1), (C2), and (C3). If there exists K > 0 such that for every γ ∈ (0,K), there exist ργ > 0 and \(f_{\gamma }\in C(\mathcal {P}(S))\) with fγ >> 0 satisfying \(\rho _{\gamma } f_{\gamma }=L^{(\gamma )}_{P}f_{\gamma }\), then

$$ \lim_{\gamma\to0}\lambda_{P}(\gamma)=v_{P}. $$
(4.27)

Before proving Theorem 4.5, we present a lemma to show that pγ converges to p0 weakly and uniformly, where p0 is defined in (4.4).

Lemma 4.6

Assume (C0), (C1), (C2), and (C3). Then pγ(⋅|𝜃,a) weakly converges to p0(⋅|𝜃,a), uniformly in \(\theta \in \mathcal {P}(S)\) and aA, i.e., if \(f\in C(\mathcal {P}(S)\times A\times \mathcal {P}(S))\), then

$$ \underset{\gamma\to 0}{\lim} \underset{a\in A}{\underset{\theta\in\mathcal{P}(S)}{\sup}} \left|{\int}_{\mathcal{P}(S)}f(\theta,a,\theta^{\prime})p_{\gamma}(d\theta^{\prime}|\theta,a)-{\int}_{\mathcal{P}(S)}f(\theta,a,\theta^{\prime})p_0(d\theta^{\prime}|\theta,a)\right|=0. $$
(4.28)

Proof

Fix an \(f\in C(\mathcal {P}(S)\times A\times \mathcal {P}(S))\). Then

$$ \begin{array}{@{}rcl@{}} &&\left|{\int}_{\mathcal{P}(S)}f(\theta,a,\theta^{\prime})p_{\gamma}(d\theta^{\prime}|\theta,a)-{\int}_{\mathcal{P}(S)}f(\theta,a,\theta^{\prime})p_{0}(d\theta^{\prime}|\theta,a)\right|\\ &&\quad=\left|e^{-r_{\gamma}(\theta,a)}{\int}_{O}T^{*}_{a,y,\gamma}(\theta)(S)f\left( \theta,a,\frac{T^{*}_{a,y,\gamma}(\theta)}{T^{*}_{a,y,\gamma}(\theta)(S)}\right){\Lambda}(dy)-{\int}_{O}T^{*}_{a,y,0}(\theta)(S)f\left( \theta,a,\frac{T^{*}_{a,y,0}(\theta)}{T^{*}_{a,y,0}(\theta)(S)}\right){\Lambda}(dy)\right|. \end{array} $$

Since

$$|e^{r_{\gamma}(\theta,a)}-1|\leq{\int}_{S}|e^{\gamma r(x,a)}-1|\theta(dx)\leq\max\{|e^{\gamma r_{m}}-1|,\ |e^{\gamma r_{M}}-1|\},$$

It follows that \(e^{r_{\gamma }(\theta ,a)}\) converges uniformly to 1 as γ → 1. Then similarly as in the proof of Lemma 4.2, it suffices to verify that \(T^{*}_{a,y,\gamma }(\theta )\) converges weakly to \(T^{*}_{a,y,0}(\theta )\), uniformly in aA,yO, and \(\theta \in \mathcal {P}(S)\). In fact, recalling the definition of the Kantorovich-Rubinstein norm, we see that

$$ \begin{array}{@{}rcl@{}} \left\|T^{*}_{a,y,\gamma}(\theta)-T^{*}_{a,y,0}(\theta)\right\|_{KR}&=&{\underset{\|g\|_{L}\leq 1}{\underset{g\in\text{Lip}(S)}{\sup}}}{\int}_{S}\theta(dx)\left( e^{\gamma r(x,a)}-1\right){\int}_{S}g(x^{\prime})q(y|x^{\prime})p(dx^{\prime}|x,a)\\ &\leq& {\underset{\|g\|_{L}\leq 1}{\underset{g\in\text{Lip}(S)}{\sup}}}{\int}_{S}\theta(dx)\left|e^{\gamma r(x,a)}-1\right|{\int}_{S}q_{M}p(dx^{\prime}|x,a)\\ &\leq&\max\{|e^{\gamma r_{m}}-1|,|e^{\gamma r_{M}}-1|\}\cdot q_{M}. \end{array} $$

Consequently,

$$ \begin{array}{@{}rcl@{}} \lim\limits_{\gamma\to 0}{\underset{a\in A,y\in O}{\underset{\theta\in\mathcal{P}(S)}{\sup}}}\left\|T^{*}_{a,y,\gamma}(\theta)-T^{*}_{a,y,0}(\theta)\right\|_{KR}=0 \end{array} $$

and thus (4.28) follows. □

Proof Proof of Theorem 4.5

We already knew that \(\lim \limits _{\gamma \to 0}\lambda _{P}(\gamma )\geq v_{P}\). Hence, by Theorem 4.4, it suffices to verify that

$$ \begin{array}{@{}rcl@{}} \lim\limits_{\gamma\to 0}\lambda_{S}(\gamma)\leq v_{P}. \end{array} $$

From Theorem 4.3, we know that for any ε > 0 and γ > 0, there exists \(\beta _{\gamma }^{\varepsilon }\in \mathcal {I}\) such that

$$ \begin{array}{@{}rcl@{}} \lambda_{S}(\gamma)-\varepsilon\leq{\int}_{\mathcal{P}(S)\times A}\left[\frac{1}{\gamma}r_{\gamma}(\theta,a)-\frac{1}{\gamma}D((\beta_{\gamma}^{\varepsilon})_{2}(\cdot|\theta,a)\Vert p_{\gamma}(\cdot|\theta,a))\right](\beta_{\gamma}^{\varepsilon})'(d\theta,da). \end{array} $$

Recall that \(\mathcal {I}\subseteq \mathcal {P}(\mathcal {P}(S)\times A\times \mathcal {P}(S))\) is compact. We can find a sequence \(\{\gamma _{n}\}_{n\in \mathbb {N}}\) monotonically tending to 0 and a \(\beta ^{\varepsilon }\in \mathcal {I}_{P}\) such that

$$ \lim\limits_{\gamma\to 0}\lambda_{P}(\gamma)=\lim\limits_{n\to\infty}\lambda_{P}(\gamma_{n})\ \ \text{and}\ \lim\limits_{n\to\infty}\beta_{\gamma_{n}}^{\varepsilon}=\beta^{\varepsilon} $$

weakly. Since the relative entropy is non-negative, we obtain that

$$ \begin{array}{@{}rcl@{}} \lambda_{S}(\gamma_{n})-\varepsilon\leq{\int}_{\mathcal{P}(S)\times A}\frac{1}{\gamma_{n}}r_{\gamma_{n}}(\theta,a)(\beta_{\gamma_{n}}^{\varepsilon})'(d\theta,da). \end{array} $$

Monotonicity follows from H\(\ddot {o}\)lder’s inequality. Thus, we have as γn → 0 that

$$ \begin{array}{@{}rcl@{}} \frac{1}{\gamma_{n}}r_{\gamma_{n}}(\theta,a)=\frac{1}{\gamma_{n}}\log\left( {\int}_{S}\theta(dx)e^{\gamma_{n} r(x,a)}\right)\downarrow{\int}_{S}\theta(dx)r(x,a)=r_{0}(\theta,a), \end{array} $$

where r0 is defined in (4.4). Hence, by Dini’s theorem, \(\gamma _{n}^{-1}r_{\gamma _{n}}\) converge to r0 uniformly. From the weak convergence of \(\beta _{\gamma _{n}}^{\varepsilon }\), we then obtain that

$$ \begin{array}{@{}rcl@{}} \lim_{n\rightarrow\infty}{\int}_{\mathcal{P}(S)\times A}\frac{1}{\gamma_{n}}r_{\gamma_{n}}(\theta,a)(\beta_{\gamma_{n}}^{\varepsilon})'(d\theta,da)={\int}_{\mathcal{P}(S)\times A}r_{0}(\theta,a)(\beta^{\varepsilon})'(d\theta,da). \end{array} $$

Now, we claim that \(\left (\beta ^{\varepsilon }\right )_{2}=p_{0}\), where p0 is defined in (4.4). Notice that

$$ \begin{array}{@{}rcl@{}} \lim_{n\to\infty}\lambda_{S}(\gamma_{n})-\varepsilon&\leq&{\int}_{\mathcal{P}(S)\times A}r_{0}(\theta,a)(\beta^{\varepsilon})'(d\theta,da)\\ &&-\liminf_{n\rightarrow\infty}\frac{1}{\gamma_{n}}{\int}_{\mathcal{P}(S)\times A}\left( \beta_{\gamma_{n}}^{\varepsilon}\right)'(d\theta,da)D\left( \left( \beta_{\gamma_{n}}^{\varepsilon}\right)_{2}(\cdot|\theta,a)\Vert p_{\gamma_{n}}(\cdot|\theta,a)\right)\\ &=&{\int}_{\mathcal{P}(S)\times A}r_{0}(\theta,a)(\beta^{\varepsilon})'(d\theta,da)\\ &&-\liminf_{n\rightarrow\infty}\frac{1}{\gamma_{n}}D\left( \beta_{\gamma_{n}}^{\varepsilon}(d\theta,da,d\theta^{\prime})\Big\Vert\left( \beta_{\gamma_{n}}^{\varepsilon}\right)'(d\theta,da)p_{\gamma_{n}}(d\theta^{\prime}|\theta,a)\right)\\ &\leq& r_{M} - \liminf_{n\rightarrow\infty}\frac{1}{\gamma_{n}}D\left( \beta_{\gamma_{n}}^{\varepsilon}(d\theta,da,d\theta^{\prime})\Big\Vert\left( \beta_{\gamma_{n}}^{\varepsilon}\right)'(d\theta,da)p_{\gamma_{n}}(d\theta^{\prime}|\theta,a)\right). \end{array} $$

From the lower semicontinuity of D(⋅∥⋅) and Lemma 4.6, we deduce that

$$ \begin{array}{@{}rcl@{}} &-&\liminf_{n\rightarrow\infty}D\left( \beta_{\gamma_{n}}^{\varepsilon}(d\theta,da,d\theta^{\prime})\Big\Vert\left( \beta_{\gamma_{n}}^{\varepsilon}\right)'(d\theta,da)p_{\gamma_{n}}(d\theta^{\prime}|\theta,a)\right)\leq\\ &-&D\left( \beta^{\varepsilon}(d\theta,da,d\theta^{\prime})\Big\Vert\left( \beta^{\varepsilon}\right)'(d\theta,da)p_{0}(d\theta^{\prime}|\theta,a)\right). \end{array} $$

Thus, if \(\left (\beta ^{\varepsilon }\right )_{2}\neq p_{0}\), then

$$D\left( \beta^{\varepsilon}(d\theta,da,d\theta^{\prime})\Big\Vert\left( \beta^{\varepsilon}\right)'(d\theta,da)p_{0}(d\theta^{\prime}|\theta,a)\right)>0,$$

and we would have

$$ \begin{array}{@{}rcl@{}} -r_{m}-\varepsilon\leq\lambda(0)-\varepsilon\leq\lim_{n\to\infty}\lambda(\gamma_{n})-\varepsilon=-\infty. \end{array} $$

It is impossible, so \(\beta ^{\varepsilon }\in \mathcal {I}\) and \(\left (\beta ^{\varepsilon }\right )_{2}=p_{0}\). Now we can employ the same argument as used in the proof of Lemma 3.2 to derive that

$$ \begin{array}{@{}rcl@{}} \lim_{n\to\infty}\lambda_{S}(\gamma_{n})-\varepsilon\leq{\int}_{\mathcal{P}(S)\times A}r_{0}(\theta,a)(\beta^{\varepsilon})'(d\theta,da)\leq v_{P}. \end{array} $$

Then (4.27) follows by letting ε → 0. □

Remark 4.2

From the proof of Theorem 4.5, we can see that the existence of a solution to the risk-sensitive Bellman equation guarantees the existence of the invariant probability measure for p0.

We end this section with a simple example.

Example 4.7

Consider a finite POMDP with S = {x1,x2},A = {a1,a2},O = {y1,y2}. The transition probability p, observation probability q, and reward r are described by

$$ \begin{array}{@{}rcl@{}} \begin{bmatrix} p(x_{1}|x_{1},a_{1}) & p(x_{2}|x_{1},a_{1})\\ p(x_{1}|x_{2},a_{1}) & p(x_{2}|x_{2},a_{1}) \end{bmatrix} = \begin{bmatrix} 1/2 & 1/2\\ 1/2 & 1/2 \end{bmatrix}&, \begin{bmatrix} p(x_{1}|x_{1},a_{2}) & p(x_{2}|x_{1},a_{2})\\ p(x_{1}|x_{2},a_{2}) & p(x_{2}|x_{2},a_{2}) \end{bmatrix} = \begin{bmatrix} 1 & 0\\ 0 & 1 \end{bmatrix},\\ \begin{bmatrix} q(y_{1}|x_{1}) & q(y_{2}|x_{1})\\ q(y_{1}|x_{2}) & q(y_{2}|x_{2}) \end{bmatrix} = \begin{bmatrix} 1/2 & 1/2\\ 1/2 & 1/2 \end{bmatrix}&, \begin{bmatrix} r(x_{1},a_{1}) & r(x_{1},a_{2})\\ r(x_{2},a_{1}) & r(x_{2},a_{2}) \end{bmatrix} = \begin{bmatrix} 0 & 0\\ 0 & 1 \end{bmatrix}. \end{array} $$

The state space of the transferred MDP is \(\mathcal {P}(S)\), which is isomorphic to [0,1]. We use (1 − t,t) to denote the probability distribution in \(\mathcal {P}(S)\), where 0 ≤ t ≤ 1 represents the probability assigned on x2. On the one hand, to apply Theorem 4.5, by straightforward calculations, we have for fC([0,1]),

$$ \begin{array}{@{}rcl@{}} L^{(\gamma)}_{P}f(\theta)=\max\left\{f\left( \frac{1}{2}\right), (1-t+te^{\gamma})f\left( \frac{te^{\gamma}}{1-t+te^{\gamma}}\right)\right\}. \end{array} $$

Let

$$ \begin{array}{@{}rcl@{}} \rho_{\gamma}=e^{\gamma},\quad f_{\gamma}(t)= \begin{cases} 1,\quad&t\in[0,\frac{1}{2}e^{-\gamma})\\ 2e^{\gamma} t,\quad&t\in[\frac{1}{2}e^{-\gamma},1] \end{cases}. \end{array} $$

We can verify that \(L^{(\gamma )}_{P}f_{\gamma }=\rho _{\gamma }f_{\gamma }\). Then all the assumptions in Theorem 4.5 are fulfilled. Thus,

$$ v_{P}=\lim\limits_{\gamma\to 0+}\lambda_{P}(\gamma)=\lim\limits_{\gamma\to 0+}\frac{1}{\gamma}\log\rho_{\gamma}=1. $$

On the other hand, given an initial distribution t ∈ [0,1], we can see that the optimal average reward is vP(t) = t. Hence, \(v_{P}=\sup _{t\in [0,1]}v_{P}(t)=1\), which coincides with Theorem 4.5. Furthermore, this example illustrates that there are circumstances in which the optimal risk-sensitive reward is independent of the initial distribution while the optimal average reward is not.

5 A Portfolio Optimization Example

In this section, as an example of applications of the approach developed in the previous sections, we consider a problem for portfolio optimization. Given a market with m securities and k price affecting factors. Let V (n) denote the portfolio’s value at time n. We assume that the portfolio dynamics are determined by

$$ \frac{V(n+1)}{V(n)}=e^{F(X(n),H(n),W(n))}, $$
(5.1)

where X(n) = (X1(n),...,Xk(n)) denotes the factor process, which is a Markov chain with transition kernel \(P(dx^{\prime }|x)\), H(n) = (H1(n),...,Hm(n)) represents the portfolio strategy, i.e., the proportions of capital invested in the m securities at time n, {W(n),n ≥ 1} is the i.i.d. random noise which is independent of the factor process and has a common law η. F is a Borel measurable function. X(n) and H(n) take values in some compact subsets \(S\subset \mathbb {R}^{k}\) and \(A\subset \mathbb {R}^{m}\) respectively, while the noise W(n) takes values in a Polish space Z. This model was extensively studied in [26] for the dual relationship between maximizing the probability of outperforming over a given benchmark and optimizing the long-term risk-sensitive reward. In this section, we will demonstrate that our approach guarantees the convergence of the optimal risk-sensitive reward to the optimal risk-neutral reward as the risk-sensitive factor tends to 0. As a consequence, we show that the optimal risk-neutral reward can be taken as a benchmark appearing in the duality mentioned above, complementing the studies of [26]. Given an initial state X(1) = x, we use Px to denote the corresponding probability measure on \({\mathscr{B}}((S\times Z)^{\infty })\) and Ex the expectation under Px. Let \(\mathcal {A}\) denote the set of all Markov portfolio strategies. Given a risk factor γ > 0, the risk-sensitive optimal value is

$$ \lambda(x,\gamma)=\sup_{H\in\mathcal{A}}\frac{1}{\gamma}\liminf_{N\rightarrow\infty}\frac{1}{N}\log E_{x}\exp\left[\gamma\sum\limits_{t=1}^{N}F(X(n),H(n),W(n))\right]. $$
(5.2)

In what follows, we state the two assumptions on F and P for fitting (A1), (A2), and (A3).

  1. (H1)

    For each wZ, F(⋅,⋅,w) ∈ C(S × A), and there is an η −integrable random variable g(w), such that

    $$ e^{|F(x,h,w)|}\leq g(w)\ \ \forall x\in S,\ h\in A \text{ and } w\in Z; $$
    (5.3)
  2. (H2)

    The family of functions \(\left \{x\mapsto {\int \limits } f(x^{\prime })P(dx^{\prime }|x), f\in C(S),\left \|f\right \|\leq 1\right \}\) is equicontinuous.

Remark 5.1

  1. (1)

    If (H1) holds, then |F(x,h,w)|≤ g(w) and hence \(\hat F(x,h):= {\int \limits } F(x,h,\cdot )d\eta \) is bounded continuous in (x,h). Let Fm and FM be the infimum and supremum of \(\hat F\). Then from Jensen’s inequality, it follows that for γ > 0

    $$ \log \int e^{\gamma F(x,h,\cdot)}d\eta\geq \gamma F_{m}. $$
    (5.4)
  2. (2)

    A particular case in which (H2) is true is that \(P(dx^{\prime }|x)=Q(x^{\prime }|x){\Lambda }(dx^{\prime })\) with \(\{Q(x^{\prime }|\cdot ),x^{\prime }\in S\}\) equicontinuous and \({\Lambda }\in \mathcal {P}(S)\).

The one-step reward F in (5.1) depends not only on the state x and the action h but also on W, which is slightly different from the typical form. We make the following changes to get a reward in such a standard form. Define a new Markov decision model with the transition law p(γ) and the one-step reward r(γ) defined respectively by

$$ \begin{array}{@{}rcl@{}} p^{(\gamma)}(dx^{\prime}|x,h)&:=&\frac{1}{{\int}_{Z} e^{\gamma F(x,h,w)}\eta(dw)}\left( {\int}_{Z} P(dx^{\prime}|x)e^{\gamma F(x,h,w)}\eta(dw)\right),\\ \text{and}r^{(\gamma)}(x,h)&:=&\log\left( {\int}_{Z} e^{\gamma F(x,h,w)}\eta(dw)\right). \end{array} $$
(5.5)

By a direct calculation, we see that \(p^{(\gamma )}(dx^{\prime }|x,h)\) is actually \(P(dx^{\prime }|x)\), and for any N ≥ 1

$$ E_{x}\exp\left[{\sum}_{n=1}^{N}r^{(\gamma)}(X(n),H(n))\right]= E_{x}\exp\left[\gamma{\sum}_{n=1}^{N}F(X(n),H(n),W(n))\right]. $$
(5.6)

Notice that the transition kernel of this MDP is still P, but the reward is r(γ) instead of γr. Assumption (H1) implies that r(γ)(x,h) is continuous in (x,h). Thus, with an extra discussion about the convergence of \(\frac {1}{\gamma }r^{(\gamma )}\) as γ → 0, we can obtain the limit with the same argument as the one in the proof of Theorem 3.1. In particular, it is not hard to check that (H1) and (H2) imply that r(γ) and P satisfy (A1), (A2), and (A3). Therefore, setting the risk-sensitive coefficient in Theorem 3.7 to be one and then dividing both sides by γ, we have the following variational formula for \(\lambda (\gamma )=\sup _{x\in S}\lambda (\gamma ,x)\).

$$ \lambda(\gamma)=\frac{1}{\gamma}\sup_{\beta\in\mathcal{I}}\left\{{\int}_{S\times A}\left[r^{(\gamma)}(x,h)-D(\beta_{2}(\cdot|x,h)\Vert P(\cdot|x))\right]\beta^{\prime}(dx,dh)\right\}, $$
(5.7)

where \(\mathcal {I}\) is defined in (3.2). Although Theorem 3.8 can not be directly applied due to the difference between (5.7) and (3.1), the risk-neutral limit \(\lim _{\gamma \to 0}\lambda (\gamma )\) can still be derived by an argument similar to the one used in proving Theorem 3.1. To see this, we still use v to denote the average optimal return, i.e.,

$$ v=\sup_{x\in S}v(x)=\sup_{x\in S}\sup_{H\in\mathcal{A}}\liminf_{N\rightarrow\infty}\frac{1}{N} E_{x}\left[{\sum}_{n=1}^{N}F(X(n),H(n),W(n))\right]. $$
(5.8)

Theorem 5.1

Assume (H1) and (H2). Then

$$ \lim_{\gamma\to 0}\lambda(\gamma)=v. $$
(5.9)

Proof

By Hölder’s inequality, we see that λ(γ) is nondecreasing in γ and \(\liminf \limits _{\gamma \to 0}\lambda (\gamma )\geq v\). We will apply (5.7) to prove that \(\limsup \limits _{\gamma \to 0}\lambda (\gamma )\leq v\). Similarly to the argument in the proof of Theorem 3.1, for any ε > 0, we can find a sequence \(\{\gamma _{n}\}_{n\in \mathbb {N}}\) decreasing to 0 with \(\lim \limits _{\gamma \to 0}\lambda (\gamma )=\lim \limits _{n\to \infty }\lambda (\gamma _{n})\) and \(\beta _{\gamma _{n}}^{\varepsilon },\beta ^{\varepsilon }\in \mathcal {I}\) with \(\lim \limits _{n\to \infty }\beta _{\gamma _{n}}^{\varepsilon }=\beta ^{\varepsilon }\) weakly such that

$$ \lambda(\gamma_{n})-\varepsilon\leq{\int}_{S\times A}\left[\frac{1}{\gamma_{n}}r^{(\gamma_{n})}(x,h)-\frac{1}{\gamma_{n}}D((\beta_{\gamma_{n}}^{\varepsilon})_{2}(\cdot|x,h)\Vert P(\cdot|x))\right](\beta_{\gamma_{n}}^{\varepsilon})'(dx,dh). $$
(5.10)

Therefore,

$$ \begin{array}{@{}rcl@{}} \lambda(\gamma_{n})-\varepsilon\leq{\int}_{S\times A}\frac{1}{\gamma_{n}}r^{(\gamma_{n})}(x,h)(\beta_{\gamma_{n}}^{\varepsilon})'(dx,dh). \end{array} $$

Monotonicity follows from Hölder’s inequality, and thus we have as γn → 0 that

$$ \begin{array}{@{}rcl@{}} \frac{1}{\gamma_{n}}r^{(\gamma_{n})}(x,h)=\frac{1}{\gamma_{n}}\log\left( {\int}_{Z} e^{\gamma_{n} F(x,h,w)}\eta(dw)\right)\downarrow{\int}_{Z}F(x,h,w)\eta(dw)=: r^{(0)}(x,h). \end{array} $$

Therefore, it follows from Dini’s theorem that \(\frac {1}{\gamma _{n}}r^{(\gamma _{n})}\) converge to r(0) uniformly. Combining this fact with the weak convergence of \(\beta _{\gamma _{n}}^{\varepsilon }\), we obtain that

$$ \begin{array}{@{}rcl@{}} \lim_{n\rightarrow\infty}{\int}_{S\times A}\frac{1}{\gamma_{n}}r^{(\gamma_{n})}(x,h)(\beta_{\gamma_{n}}^{\varepsilon})'(dx,dh)={\int}_{S\times A}r^{(0)}(x,h)(\beta^{\varepsilon})'(dx,dh). \end{array} $$

Now we claim that \(\left (\beta ^{\varepsilon }\right )_{2}=P\). Indeed, from the joint semicontinuity of the relative entropy, we see that

$$ \begin{array}{@{}rcl@{}} & &-\liminf_{n\rightarrow\infty}{\int}_{S\times A}D\Big(\left( \beta_{\gamma_{n}}^{\varepsilon}\right)_{2}(\cdot|x,h)\Big\Vert P(\cdot|x)\Big)\left( \beta_{\gamma_{n}}^{\varepsilon}\right)'(dx,dh)\\ &=&-\liminf_{n\rightarrow\infty}D\left( \beta_{\gamma_{n}}^{\varepsilon}(dx,dh,dx^{\prime})\Big\Vert\left( \beta_{\gamma_{n}}^{\varepsilon}\right)'(dx,dh)P(dx^{\prime}|x)\right)\\ &\leq&-D\left( \beta^{\varepsilon}(dx,dh,dx^{\prime})\Big\Vert\left( \beta^{\varepsilon}\right)'(dx,dh)P(dx^{\prime}|x)\right). \end{array} $$

Thus, if \(\left (\beta ^{\varepsilon }\right )_{2}\neq P\), then

$$D\left( \beta^{\varepsilon}(dx,dh,dx^{\prime})\Big\Vert\left( \beta^{\varepsilon}\right)'(dx,dh)P(dx^{\prime}|x)\right)>0.$$

From assumption (H1), (5.4), (5.5), and (5.6), together with (5.2) and (5.10), we would have

$$ \begin{array}{@{}rcl@{}} F_{m}-\varepsilon\leq\lim_{n\to\infty}\lambda(\gamma_{n})-\varepsilon\leq-\infty. \end{array} $$

It is impossible, implying that it must be that \(\beta ^{\varepsilon }\in \mathcal {I}\) and \(\left (\beta ^{\varepsilon }\right )_{2}=P\). Then it is routine to follow the same argument as that of Theorem 3.2 to check that \({\int \limits }_{S\times A}r^{(0)}(x,h)(\beta ^{\epsilon })'(dx,dh)\leq v\). Consequently, (5.9) follows by letting ε → 0. □

As claimed in the introduction, it was shown in [18, 24], and [26] that the risk-sensitive portfolio optimization is a dual problem to the maximization of the outperformance probability (upside chance) when assuming differentiability for the optimal value. To describe this more precisely, for \(b\in \mathbb {R},\gamma >0,x\in S\), define Ix(b) by

$$ \begin{array}{@{}rcl@{}} I_{x}(b)&:=&\sup_{H\in\mathcal{A}}\liminf_{N\rightarrow\infty}\frac{1}{N}\log\text{P}_{x}\left[\frac{1}{N}\sum\limits_{n=1}^{N}F(X(n),H(n),W(n))\geq b\right],\\ \quad \text{and}\quad I(b)&:=&\sup_{x\in S}I_{x}(b). \end{array} $$
(5.11)

Then by Chebyshev’s inequality,

$$ \begin{array}{@{}rcl@{}} \gamma\lambda(\gamma,x)-\gamma b\geq I_{x}(b),\text{ and }\ \gamma\lambda(\gamma)-\gamma b\geq I(b). \end{array} $$

Thus,

$$ -\sup_{\gamma\in[0,K)}\{\gamma b-\gamma\lambda(\gamma)\}\geq I(b) $$
(5.12)

for a pre-specified K > 0. Let Λ(γ) := γλ(γ) for convenience. It has already been established in [26] that if Λ(γ) is differentiable on [0,K) and the limit

$$ \lim_{N\rightarrow\infty}\frac{1}{N}\log E_{x}\exp\left[\gamma\sum\limits_{n=1}^{N}F(X(n),H(n),W(n))\right] $$
(5.13)

exists and does not depend on the initial state x, then the duality

$$ -\sup_{\gamma\in[0,K)}\{\gamma b-{\Lambda}(\gamma)\}= I(b) $$
(5.14)

holds whenever \(b\in \{{{\Lambda }^{\prime }}^{+}(\gamma ),\gamma \in [0,K)]\}\) or \(b\leq {{\Lambda }^{\prime }}^{+}(0)\), where \({{\Lambda }^{\prime }}^{+}(\gamma )\) denote the righthand derivative of Λ(γ) (see Theorem 2.7 in [26]). In the meantime, our result shows that under (H1) and (H2),

$$ \begin{array}{@{}rcl@{}} {{\Lambda}^{\prime}}^{+}(0)&=\lim\limits_{\gamma\to 0+}\frac{\Lambda(\gamma)-{\Lambda}(0)}{\gamma-0}=\lim\limits_{\gamma\to 0}\lambda(\gamma)=v, \end{array} $$

which reveals the connection between the outer-performance probability, the risk-neutral average return, and the risk-sensitive average growth rate. In order to guarantee the differentiability, we add the following assumptions for P and r(γ) in accordance with Theorem 3.1 in [26], which also implies that the transition law satisfies (B2).

  1. (H3)

    There exists δp < 1 such that

    $$ \sup_{U\in \mathcal{B}(S)}\sup_{x,x^{\prime}\in S}[P(U|x)-P(U|x^{\prime})]\leq \delta_{p}. $$
    (5.15)
  2. (H4)

    There exists a Kγ > 0 such that the mapping \(\gamma \to \sup \limits _{h\in A}r^{(\gamma )}(x,h)\) is differentiable on [0,Kγ) for any xS.

Remark 5.2

Let Fm and FM be defined in Remark 5.1(1). Then (H1), (H2), (H3), and the condition \(\gamma \leq -\frac {\log \delta _{p}}{F_{M}-F_{m}}\) guarantee that the limit inferior in (5.2) is actually a limit and λ(γ,x) does not depend on x (see Theorem 1 in [13]). So the constant K in (5.12) can be determined.

Theorem 5.2

Assume (H1), (H2), (H3), and (H4). Let \(K=\min \limits \{-\frac {\log \delta _{p}}{F_{M}-F_{m}}, K_{\gamma }\}\). Then v(x) = v is a constant, and the duality (5.14) holds for every bv.

Proof

Combining Theorem 3.1 in [26] and the above remark, we see that Λ(γ) is differentiable on [0,K). Thus, by Theorem 2.7 in [26],

$$ -\sup_{\gamma\in[0,K)}\{\gamma b-{\Lambda}(\gamma)\}=I(b) $$

holds for every \(b\leq {{\Lambda }^{\prime }}^{+}(0)\). Theorem 3.7 implies that

$$ {{\Lambda}^{\prime}}^{+}(0) = \lim_{\gamma\to 0}\frac{1}{\gamma}{\Lambda}(\gamma) =\lim\limits_{\gamma\to 0}\sup_{x\in S}\lambda(x,\gamma)=\sup_{x\in S}v(x), $$

where v(x) is indeed a constant due to (H3) and Lemma 3.9. These complete the proof. □

We end this section with an illustrative example.

Example 5.3

Let S = {− 1,1}, A = {(h1,h2),hi ≥ 0, h1 + h2 = 1} and {Wn,n ≥ 1} i.i.d. with the standard Normal distribution N(0,1). F is given by

$$ F(x,h,w)=\alpha\cdot x\cdot(h_{1}-h_{2})\cdot w^{2},\ x\in S,\ h\in A,\ w\in W, $$

where α ∈ (0,1/2) is a constant. Then \(e^{|F(x,h,w)|}\leq g(w)=e^{\alpha w^{2}}\) with g being η −integrable. Thus (H1), (H2), and (H4) are fulfilled, and (5.9) holds. If the transition probabilities are chosen to satisfy that

$$ \min_{x,x^{\prime}\in S}P(x^{\prime}|x)>0, $$

then (H3) is also satisfied; therefore, the assertions of Theorem 5.2 hold true.