Risk-Sensitivity Vanishing Limit for Controlled Markov Processes

Dai, Yanan; Chen, Jinwen

doi:10.1007/s10883-023-09641-5

Risk-Sensitivity Vanishing Limit for Controlled Markov Processes

Published: 16 March 2023

Volume 29, pages 1471–1508, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Dynamical and Control Systems Aims and scope Submit manuscript

Risk-Sensitivity Vanishing Limit for Controlled Markov Processes

Download PDF

180 Accesses
Explore all metrics

Abstract

In this paper, we prove that the optimal risk-sensitive reward for Markov decision processes with compact state space and action space converges to the optimal average reward as the risk-sensitive factor tends to 0. In doing so, a variational formula for the optimal risk-sensitive reward is derived. An extension of the Kreĭn-Rutman Theorem to certain nonlinear operators is involved. Based on these, partially observable Markov decision processes are also investigated. A portfolio optimization problem is presented as an example of an application of the approach, in which a duality-relation between the maximization of risk-sensitive reward and the maximization of upside chance for out-performance over the optimal average reward is established.

Risk-sensitive continuous-time Markov decision processes with unbounded rates and Borel spaces

Article 19 October 2019

Risk measurement and risk-averse control of partially observable discrete-time Markov systems

Article 23 February 2018

Finite horizon risk-sensitive continuous-time Markov decision processes with unbounded transition and cost rates

Article 10 January 2019

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In this paper, we study a risk-sensitive control problem for Markov decision processes (MDPs). The risk-sensitive control of MDPs has been widely investigated (see [10, 12, 16, 21, 22] and references cited therein). The basic target is to find the optimal solution to the following control problem

$$ \lambda^{\pi}(x,\gamma):=\frac 1\gamma\underset{N\to\infty}{\lim}\frac {1}{N}\log E_{x}^{\pi}\exp\left[\gamma\sum\limits_{n=0}^{N-1}r(X_{n},A_{n})\right], $$

(1.1)

where X_n is the state of the system at time n, x is the initial state, A_n is the decision made by the controller at time n, and π is the strategy for decision-making. The risk-sensitive factor γ represents the controller’s risk preference. Regarding r as a reward, we concern the following maximization with γ > 0:

$$ \lambda(x,\gamma):=\underset{\pi}{\sup}\lambda^{\pi}(x,\lambda). $$

It is well known that γ = 0 corresponds to the risk-neutral case in which the performance is evaluated according to the following typical long-run average reward:

$$ \begin{array}{@{}rcl@{}} v(x):=\sup_{\pi}\underset{N\to \infty}{\lim}\frac{1}{N} E^{\pi}\left[{x}\sum\limits_{n=0}^{N-1}r(X_{n},A_{n})\right]. \end{array} $$

Notice that for any γ > 0 if E(e^γX) and E(X) both exist, then

$$ \begin{array}{@{}rcl@{}} \underset{\gamma\to 0}{\lim}\frac 1\gamma \log E(e^{\gamma X})= E(X). \end{array} $$

It is natural to consider the problem of whether the optimal risk-sensitive control converges to the optimal long-run average control as the risk-sensitive factor gets vanishing. It is the main purpose of this paper to prove that

$$ \underset{\gamma\to 0+}{\lim}\lambda(x,\gamma)=v(x), $$

(1.2)

provided that both sides are well-defined (see the next section for an explicit description of this problem). This problem has been studied for minimizing risk-sensitive costs for MDPs; see the references cited above. A similar problem for optimal risk-sensitive portfolios has also been studied. Notice that in the framework of portfolio or other asset processes, maximizing rewards is a natural problem for consideration. This maximization problem is essentially different from the minimization problem, but both are of fundamental importance for applications. It is interesting that maximizing the risk-sensitive reward is dual to maximizing the upside chance or minimizing the downside risk under some conditions (see [18, 24] and [26]). These motivate us to study the asymptotics of optimal risk-sensitive rewards for MDPs. We shall show in this paper that for MDPs with compact state spaces and action spaces, under certain assumptions, the maximal risk-sensitive reward will converge to the maximal long-run average reward as the risk-sensitive factor gets down to 0.

We note that in the approach for deriving the asymptotics of minimal risk-sensitive cost, besides a few necessary continuity assumptions, some conditions on contraction and strong ergodicity for the transition probabilities were imposed, based on which span contraction of some properly defined operator can be verified which guarantee a solution to the corresponding Bellman equation. The strong ergodicity condition also makes it possible to apply some large deviation techniques (see (2.7) and (2.8) in Section 2 for the explicit conditions with more explanations given in Remark 2.1). In this paper, we shall use a quite different approach: inspired by Anantharam and Borkar [2], we use a nonlinear extension of the Kreĭn-Rutman Theorem (see [23]) to find the eigenvalues of some properly defined operators on certain function spaces and characterize the optimal growth rate of the multiplicative reward with this eigenvalue. Using this characterization and a perturbation technique, we derive a variational formula for the optimal growth rate (Theorem 3.7) of the MDP without the ergodicity of transition probability. This variational formula is similar to the Donsker-Varadhan formula (see [14]), which is of independent significance. The vanishing risk-sensitivity limit of the maximal reward of MDPs follows as an application of this formula (Theorems 3.1 and 3.8), and its proof implies that for a risk-sensitive control problem, the optimal policy can be taken to be a stationary one, even when the MDP is not communicating.

We also apply the approach to study the same problem for partially observable Markov decision processes (POMDPs) with compact state and action spaces. For POMDPs, see [4, 6, 8, 11, 17], and [19] and the references cited therein. A widely used approach for studying a POMDP is to transfer it into a completely observable MDP. However, the structure of the transferred MDP is usually much more complicated. Among the current results, such as those in the references mentioned above, few are on the risk-sensitivity vanishing limit. In [1], the limit of the minimal risk-sensitive cost as the risk-sensitive factor tends to 0 is derived for a class of POMDPs with a particular structure. This particular structure makes it possible to apply a large deviation approach. In [11], Di Masi and Stettner established the existence of the solution to the associated Bellman equation for cost-minimizing problems. However, they remarked that the limit as the risk-sensitive factor tends to 0 for general POMDPs had not been proven. Based on our investigation for MDPs, we prove that, as long as the solution to the associated Bellman equation exists, the maximal risk-sensitive reward converges to the maximal long-run average reward as the risk-sensitive factor tends to 0.

Finally, as an application of our approach, for portfolio optimization, we establish a duality-relation between maximizing the risk-sensitive reward and maximizing the chance for outperforming certain amounts of reward, with the range of the amounts being characterized by the optimal average reward (Theorem 5.2).

The paper is organized as follows. At the end of this section, we introduce some notations that will be frequently used in this paper. In Section 2, we define the decision model and derive some properties of the operator corresponding to the Bellman equation. The risk-neutral limit for MDPs is given in Section 3, in which the variational formula mentioned above is established. Section 4 is devoted to POMDPs. The portfolio optimization problem is investigated in Section 5.

Here are some notations and preliminaries. Given a separable and complete metric space (also called a Polish space) $(\mathcal {X},\rho )$, let ${\mathscr{M}}(\mathcal {X})$ and ${\mathscr{M}}^{+}(\mathcal {X})$ denote the set of finite signed measures on $\mathcal {X}$ and the set of finite measures on $\mathcal {X}$, respectively. $\mathcal {P}(\mathcal {X})$ is the space of probability measures on $\mathcal {X}$, endowed with the weak topology. For $p,q\in \mathcal {P}(\mathcal {X})$, we use p << q to denote that p is absolutely continuous with respect to q. As usual, δ_x(⋅) denotes the Dirac measure on point $x\in \mathcal {X}$. When $\mathcal {X}$ is compact, $C(\mathcal {X})$, the real-valued continuous functions on $\mathcal {X}$, equipped with the supremum norm ∥⋅∥, is a Banach space. Let $C^{+}(\mathcal {X})$ denote the set of non-negative functions in $C(\mathcal {X})$. $C^{+}(\mathcal {X})$ is a cone, which means that for any $f,g\in C^{+}(\mathcal {X})$ and any c > 0, both f + g and cf are in $C^{+}(\mathcal {X})$. $C^{+}(\mathcal {X})$ is convex, closed and satisfies that $C^{+}(\mathcal {X})\cup (-C^{+}(\mathcal {X}))=\{0\}$ and that $interior(C^{+}(\mathcal {X}))\neq \emptyset $. We write f ≥ g,f > g,f >> g if $f-g\in C^{+}(\mathcal {X}),f-g\in C^{+}(\mathcal {X})\backslash \{0\},f-g\in interior(C^{+}(\mathcal {X}))$, respectively. These facts form the basis for applying a nonlinear extension of the Kreĭn-Rutman theorem (see Appendix) to the operator corresponding to the Bellman equation.

For two probability measures $p,q\in \mathcal {P}(\mathcal {X})$, the relative entropy of p with respect to q is defined by

$$ D(p\Vert q):= \left\{\begin{array}{ll} {\int}_{\mathcal{X}}\log\left( \frac{dp}{dq}(x)\right)p(dx),\quad&p<<q\\ \infty,\quad&\text{otherwise} \end{array}\right., $$

(1.3)

which plays an important role in the variational formula for the optimal reward.

Let $\text {Lip}(\mathcal {X})$ denote the space of real-valued, bounded, and Lipschitz continuous functions on $\mathcal {X}$. Given $f\in \text {Lip}(\mathcal {X})$, define its norm by

$$ \|f\|_{L}:= \max\left\{\underset{x\in\mathcal{X}}{\sup}|f(x)|,{\underset{x\neq y}{\underset{x,y\in\mathcal{X}}{\sup}}}\frac{|f(x)-f(y)|}{\rho(x,y)}\right\}. $$

(1.4)

Then $(\text {Lip}(\mathcal {X}),\ \|\cdot \|_{L})$ is a Banach space when $\mathcal {X}$ is compact. Given $\mu \in {\mathscr{M}}(\mathcal {X})$, define the following Kantorovich-Rubinstein norm:

$$ \|\mu\|_{0}:=\sup\left\{\int fd\mu, f\in\text{Lip}(\mathcal{X}), \|f\|_{L}\leq 1\right\}. $$

(1.5)

Then the weak topology on $\mathcal {P}(\mathcal {X})$ is generated by the Kantorovich-Rubinstein metric d₀(μ,ν) := ∥μ − ν∥₀ (Theorem 8.3.2, pp. 193–194, in [7]). $\mathcal {P}(\mathcal {X})$ endowed with the weak topology is a Polish space since $\mathcal {X}$ is Polish. The space of Lipschitz functions and the Kantorovich-Rubinstein metric will be used in Section 4 to discuss the POMDPs.

Finally, as usual, $\mathbb {N}$ and $\mathbb {R}$ denote the sets of non-negative integers and real numbers, respectively.

2 Solution to the Bellman Equation

A discrete-time MDP can be represented as a four-tuple M = 〈S,A,p(⋅|⋅,⋅),r(⋅,⋅)〉. S is the state space, A is the action space, and both are assumed to be compact metric spaces in the present paper. We assume for convenience that any action in A is admissible in any state. The transition kernel, which depends on actions, is denoted by $p(E|x,a), E\subseteq S, x\in S, a\in A$. The last element in the tuple is the one-step reward function $r: S\times A\to \mathbb {R}$. To define a probability space and a stochastic process with the desired mechanism, let ${\Omega }=(S\times A)^{\infty }$ and ${\mathscr{B}}({\Omega })$ be the product Borel σ-field. Given a sample path ω = (x₁,a₁,x₂,a₂,...) ∈Ω, define $X_{t}:= x_{t},A_{t}:= a_{t},t\in \mathbb {N}$. At each time $t\in \mathbb {N}$, the system M occupies a state X_t, based on which the controller chooses an action A_t, and then the system moves to the next state according to the law p(⋅|X_t,A_t). A Markov decision rule at time t is a stochastic kernel $d_{t}\in \mathcal {P}(A|S)$, where d_t(B|x) denotes the probability for taking action in $B\subseteq A$ when observing the current state X_t = x. A Markov policy π is a sequence of Markov decision rules. Let D_M denote the set of all the Markov decision rules. ${\Pi }_{M}=(D_{M})^{\infty }$ is the set of all the Markov policies of M. Given an initial state x ∈ S and a policy π = (d₁,d₂,...) ∈π_M, we can define a unique probability measure $\text {P}_{x}^{\pi }$ on ${\mathscr{B}}({\Omega })$ by the Ionescu-Tulcea theorem, such that for each t ≥ 0,

$$ \text{P}_{x}^{\pi}(dx_{1},da_{1},...,dx_{t-1},da_{t-1},dx_{t})=\delta_{x}(dx_{1})\left( \prod\limits_{i=1}^{t-1}d_{i}(da_{i}|x_{i})p(dx_{i+1}|x_{i},a_{i})\right). $$

The corresponding expectation operator is denoted by $ E_{x}^{\pi }$. Since we view r as a reward, a typical criterion for evaluating the optimal policy is to maximize the average reward, i.e., we are interested in the following function

$$ v(x)=\underset{\pi\in{\Pi}_{M}}{\sup}\underset{N\rightarrow\infty}{\liminf}\frac{1}{N} E_{x}^{\pi}\left[\sum\limits_{t=1}^{N}r(X_{t},A_{t})\right], $$

(2.1)

which is risk-neutral. Informally, we notice that by the Taylor expansion, we see that for a small factor γ

$$ \frac{1}{\gamma N}\log E_{x}^{\pi}\exp\left[\gamma\sum\limits_{t=1}^{N}r(X_{t},A_{t})\right] = \frac{1}{N} E_{x}^{\pi}\left[{\sum}_{t=1}^{N}r(X_{t},A_{t})\right]\\ +\frac{\gamma}{2N}\text{Var}_{x}^{\pi}\left[{\sum}_{t=1}^{N}r(X_{t},A_{t})\right]+\frac{1}{N}\cdot o(\gamma^{2}). $$

(2.2)

Hence, we can use γ≠ 0 to evaluate the controller’s risk preference. This leads to the following risk-sensitive criterion for maximization of reward (see [10]):

$$ \lambda(x,\gamma)=\underset{\pi\in{\Pi}_{M}}{\sup}\frac{1}{\gamma}\underset{N\rightarrow\infty}{\liminf}\frac{1}{N}\log E_{x}^{\pi}\exp\left[\gamma\sum\limits_{t=1}^{N}r(X_{t},A_{t})\right], $$

(2.3)

where γ≠ 0 is a constant evaluating the controller’s risk preference. With “$\sup $” replaced by “$\inf $” in (2.3), we have the risk-sensitive criterion for minimization of cost. An interesting problem is the asymptotics of λ(x,γ) as γ → 0. Observe that if we define for N ≥ 1

$$ \lambda_{N}^{\pi}(x,\gamma):=\frac{1}{\gamma N}\log E_{x}^{\pi}\exp\left[\gamma\sum\limits_{t=1}^{N}r(X_{t},A_{t})\right]. $$

(2.4)

Then (2.2) implies that

$$ \lambda_{N}^{\pi}(x,0):=\underset{\gamma\to 0}{\lim\limits}\lambda_{N}^{\pi}(x,\gamma)=\frac{1}{N} E_{x}^{\pi}\left[{\sum}_{t=1}^{N}r(X_{t},A_{t})\right], $$

(2.5)

which motivates us to derive

$$ \lambda(x,0):=\underset{\gamma\to 0}{\lim\limits}\lambda(x,\gamma)=v(x). $$

(2.6)

This is the main concern of this paper.

Remark 2.1

Replacing the $\sup $ in (2.3) with an $\inf $ to define $\tilde \lambda (x,\gamma )$, the γ → 0 limit has already been established in [10]. They proved the existence of a solution to the Bellman equation with, in addition to some necessary continuity assumptions, the following two requirements:

$$ p(E|x,a)-p(E|x^{\prime},a^{\prime})<\delta\ \text{for some }\delta\in (0,1)\text{and for any}x,x^{\prime}\in S,\ a,a^{\prime}\in A; $$

(2.7)

and there exist an $\eta \in \mathcal {P}(S)$ and a continuous density q(x,a,y) such that $p(E|x,a)={\int \limits }_{E}q(x,a,y)\eta (dy)$ for $E\in {\mathscr{B}}(S)$ and

$$ q(x,a,y)>0, \underset{x,x^{\prime}\in S}{\sup}\underset{a\in A}{\sup}\underset{y\in S}{\sup}\frac{q(x,a,y)}{q(x^{\prime},a,y)}= K<\infty. $$

(2.8)

These conditions guarantee that the operator defining the corresponding Bellman equation is span contractive, and hence a solution exists. (2.8) also implies strong ergodicity for the family of transition probabilities defining the MDP. A consequence is the applicability of large deviation techniques for ergodic Markov processes. Instead of such conditions, we will use the following assumption (B1) on communication among the family of transition probabilities to guarantee that the eigenvector of the Bellman equation is strictly positive, based on which the variational formula holds. Then we will remove (B1) by a perturbation technique, which means that the limit can hold for completely observable MDPs without any communication requirements on the transition probability.

(B1)
For any x₁,x ∈ S and any open neighborhood U containing x, there exist an N > 0 and a₁,..,a_N ∈ A such that
$$ \begin{array}{@{}rcl@{}} \int \textbf{1}_{U}(x_{N+1})p(dx_{N+1}|x_{N},a_{N})...p(dx_{2}|x_{1},a_{1})>0, \end{array} $$

Remark 2.2

When the model is finite (i.e., both S and A are finite), (B1) is the classical communicating condition.

Now let

$$ v:=\underset{x\in S}{\sup}v(x) $$

and

$$ \lambda(\gamma):=\underset{x\in S}{\sup}\lambda(x,\gamma) $$

for γ > 0. The main objective of this paper is to show that

$$ \underset{\gamma\to 0}{\lim}\lambda(\gamma)=v. $$

(2.9)

In order to do this, we make the following assumptions.

(A1)
r(x,a) is continuous in (x,a).
(A2)
$(x,a)\mapsto {\int \limits }_{S}p(dy|x,a)f(y)$ is continuous in (x,a) when f ∈ C(S).
(A3)
The family of functions
$$ \begin{array}{@{}rcl@{}} \left\{x\mapsto{\int}_{S}f(y)p(dy|x,a), f\in C(S),\left\|f\right\|\leq 1,a\in A\right\} \end{array} $$

is equicontinuous.

Moreover, if (B1) holds, we will see that, independent of the choice of the initial state x ∈ S, the value of λ(x,γ) depends only on γ and

$$ \underset{\gamma\to 0}{\lim}\lambda(x,\gamma)=\underset{\gamma\to 0}{\lim}\lambda(\gamma)=v. $$

(2.10)

Remark 2.3

A concrete case in which (A3) is satisfied is that

$$p(dy|x,a)=q(y|x,a){\Lambda}(dy)$$

with ${\Lambda }\in \mathcal {P}(S)$ and {q(y|⋅,a),y ∈ S,a ∈ A} equicontinuous. The equicontinuity assumption (A3) is only used to prove the compactness of the operator related to the Bellman equation. In particular, for every finite MDP, (A1), (A2), and (A3) automatically hold. Combined with (B1), this compactness yields the existence of a positive eigenvalue and the associated positive eigenvector. The continuity assumptions of the result regarding the Bellman equation in [10] are the same as (A1) and (A2). But the γ → 0 limit established in [10] requires that there exist an $\eta \in \mathcal {P}(S)$ and a density q(x,a,y) > 0 such that $p(E|x,a)={\int \limits }_{E}q(x,a,y)\eta (dy)$ for $E\in {\mathscr{B}}(S)$ and (x,a,y) → q(x,a,y) is continuous, which is more strict than (A3) when state space S and action space A are compact.

The Bellman equation mentioned above is

$$ \rho f(x) = \underset{\nu\in\mathcal{P}(A)}{\sup}{\int}_{S\times A}e^{\gamma r(x,a)}f(y)p(dy|x,a)\nu(da), $$

(2.11)

and the corresponding operator L^(γ) on C(S) is defined by

$$ L^{(\gamma)}f(x):= \sup_{\nu\in\mathcal{P}(A)}{\int}_{S\times A}e^{\gamma r(x,a)}f(y)p(dy|x,a)\nu(da). $$

(2.12)

Since

$$ \begin{array}{@{}rcl@{}} \sup_{\nu\in\mathcal{P}(A)}{\int}_{A}\left( {\int}_{S}e^{\gamma r(x,a)}f(y)p(dy|x,a)\right)\nu(da)\leq\sup_{a^{\prime}\in A}\int\limits_{S}e^{\gamma r(x,a^{\prime})}f(y)p(dy|x,a^{\prime}) \end{array} $$

and

$$ \begin{array}{@{}rcl@{}} \underset{\nu\in\mathcal{P}(A)}{\sup}{\int}_{S\times A}e^{\gamma r(x,a)}f(y)p(dy|x,a)\nu(da)\geq\sup_{a^{\prime}\in A}{\int}_{S\times A}e^{\gamma r(x,a^{\prime})}f(y)p(dy|x,a^{\prime})\delta_{a^{\prime}}(da), \end{array} $$

we see that

$$ L^{(\gamma)}f(x)= \sup_{a\in A}{\int}_{S}e^{\gamma r(x,a)}f(y)p(dy|x,a). $$

(2.13)

Assumptions (A1), (A2) and the compactness of S × A imply that when f ∈ C(S), L^(γ)f also belongs to C(S). Combining these with (A3), we can prove that L^(γ) is a compact operator, which is crucial to the existence of a positive eigenvalue, as we claimed before.

Proposition 2.1

Assume (A1), (A2), and (A3). Then L^(γ) is a compact operator mapping C(S) into itself.

Proof

Notice that r(⋅,⋅) is bounded under assumption (A1) and the compactness of S × A. For convenience, we let r_M and r_m be the supremum and infimum of r, respectively. For any function f with ∥f∥≤ K, we have $\sup _{x\in S}\left |L^{(\gamma )}f(x)\right |\leq Ke^{\gamma r_{M}}$. Thus, to apply the Arzelà-Ascoli theorem, we need to verify that the family {L^(γ)f,f ∈ C(S),∥f∥≤ K} is equicontinuous.

To this end, let ρ denote the metric on S. According to (A3), for any ε > 0, there exists δ₁ > 0 such that

$$ \begin{array}{@{}rcl@{}} \sup_{a\in A}\sup_{\substack{g\in C(S)\\\|g\|\leq 1}}\left|\int\limits_{S}g(y)p(dy|x_{1},a)-\int\limits_{S}g(y)p(dy|x_{2},a)\right|\leq\varepsilon \end{array} $$

for any x₁,x₂ with ρ(x₁,x₂) ≤ δ₁. By the uniform continuity of e^γr(⋅,⋅), there exists δ₂ > 0 such that

$$ \begin{array}{@{}rcl@{}} \sup_{a\in A}\left|e^{\gamma r(x_{1},a)}-e^{\gamma r(x_{2},a)}\right|\leq\varepsilon \end{array} $$

whenever ρ(x₁,x₂) ≤ δ₂. Consequently when ∥f∥≤ K, for x₁,x₂ with $\rho (x_{1},x_{2})\leq \min \limits \{\delta _{1},\delta _{2}\}$, we have

$$ \begin{array}{@{}rcl@{}} \left|L^{(\gamma)}\!f(x_{1}) - L^{(\gamma)}\!f(x_{2})\right|&\leq&\sup_{a\in A}\left|\int\limits_{S}e^{\gamma r(x_{1},a)}\!f(y)p(dy|x_{1},a) - \int\limits_{S}e^{\gamma r(x_{2},a)}\!f(y)p(dy|x_{2},a)\right|\\ &\leq& \sup_{a\in A}\left|e^{\gamma r(x_{1},a)}-e^{\gamma r(x_{2},a)}\right|\int\limits_{S}\left|f(y)\right|p(dy|x_{1},a)\\ & & +\sup_{a\in A}e^{\gamma r(x_{2},a)}\left|\int\limits_{S}f(y)p(dy|x_{1},a)-\int\limits_{S}f(y)p(dy|x_{2},a)\right|\\ &\leq&\left( K+e^{r_{M}}\right)\varepsilon. \end{array} $$

□

L^(γ) has the following properties, which will be used to apply the non-linear Kreĭn-Rutman Theorem to prove the existence of the solution to (2.13).

(P1)
Assume (A1). Then
$$(L^{(\gamma)})^{N}f(x)=\sup\limits_{\pi\in{\Pi}_{M}} E_{x}^{\pi}\left[\exp\left( {\sum}_{t=1}^{N}\gamma r(X_{t},A_{t})\right)\cdot f(X_{N+1})\right],N\geq 1.$$
This property can be proven by induction using the Markov feature of π ∈π_M and the fact that $e^{\gamma r(X_{i},A_{i})}\leq e^{\gamma r_{M}}$ (see Lemma 2.1 and its proof in [2]).
(P2)
(Positive 1-homogeneity) c(L^(γ)f) = L^(γ)(cf) for c ≥ 0 and f ∈ C(S).
(P3)
(Order-preserving) If f ≥ g, then L^(γ)f ≥ L^(γ)g.

The following theorem shows that the spectral radius of L^(γ) is an eigenvalue. For an operator T : C(S) → C(S), define

$$ \begin{array}{@{}rcl@{}} \|T\|^{+}:=\sup_{\substack{g\in C^{+}(S)\\\|g\|\leq 1}}\left\{\left\|Tg\right\|\right\}. \end{array} $$

It is not hard to check that

$$\|(L^{(\gamma)})^{m+n}\|^{+}\leq\|(L^{(\gamma)})^{m}\|^{+}\|(L^{(\gamma)})^{n}\|^{+},$$

which implies that the limit

$$ \begin{array}{@{}rcl@{}} \rho(L^{(\gamma)}):=\lim_{n\to\infty}\left( \|(L^{(\gamma)})^{n}\|^{+}\right)^{\frac{1}{n}} \end{array} $$

exists.

Theorem 2.2

Assume (A1), (A2), and (A3). Then ρ(L^(γ)) > 0 and there exists an f_γ ∈ C⁺(S) depending on γ with f_γ≠ 0 such that

$$ \rho(L^{(\gamma)})f_{\gamma}=L^{(\gamma)}f_{\gamma}. $$

(2.14)

If in addition (B1) is satisfied, then f_γ >> 0 and $\gamma \lambda (x,\gamma )=\log \rho (L^{(\gamma )})$ is independent of x ∈ S.

Proof

From (P1), we see that $\left \|(L^{(\gamma )})^{n}\textbf {1}\right \|\geq e^{n\gamma r_{m}}$, which implies that $\rho (L^{(\gamma )})\geq e^{\gamma r_{m}}>0$. Since L^(γ) is compact, positive 1-homogeneous and order-preserving, by Theorem A.1 in the Appendix, there exists an f_γ ∈ C⁺(S) satisfying (2.14). Moreover, from (A1) and (A2), we know that ${\int \limits }_{S}e^{\gamma r(x,a)}f(y)p(dy|x,a)$ is continuous in a, which means that the supremum in (2.13) can be achieved. Hence, there exists a Markov decision rule d^∗ such that

$$ \begin{array}{@{}rcl@{}} L^{(\gamma)}f_{\gamma}(x)= {\int}_{S}e^{\gamma r(x,d^{*}(x))}f_{\gamma}(y)p(dy|x,d^{*}(x)). \end{array} $$

Let $\pi ^{*}=(d^{*})^{\infty }$, then we have for $N\in \mathbb {N}$

$$ \left[\rho(L^{(\gamma)})\right]^{N}f_{\gamma}(x)= E_{x}^{\pi^{*}}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}f_{\gamma}(X_{N+1})\right]. $$

(2.15)

Similarly, we have for any Markov policy π

$$ \left[\rho(L^{(\gamma)})\right]^{N}f_{\gamma}(x)\geq E_{x}^{\pi}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}f_{\gamma}(X_{N+1})\right]. $$

(2.16)

Now, assume (B1). Since f_γ ∈ C⁺(S) and f_γ≠ 0, there exist x₀ ∈ S,c₀ > 0 and an open neighborhood U₀ containing x₀ such that $f_{\gamma } |_{U_{0}}>c_{0}>0$. It follows from (B1) that for any x₁ ∈ S, there exists a₁,..,a_M such that

$$ \begin{array}{@{}rcl@{}} f_{\gamma}(x_{1})&=&\frac{1}{\left[\rho(L^{(\gamma)})\right]^{M}}\left( L^{(\gamma)}\right)^{M}f_{\gamma}(x_{1})\\ &\geq&\frac{1}{\left[\rho(L^{(\gamma)})\right]^{M}}\int \textbf{1}_{U_{0}}(x_{M+1})p(dx_{M+1}|x_{M},a_{M})...p(dx_{2}|x_{1},a_{1})\exp\left( \gamma{\sum}_{i=1}^{M} r(x_{i},a_{i})\right)f_{\gamma}(x_{M+1})\\ &\geq&\frac{c_{0}\cdot e^{M\gamma r_{m}}}{\left[\rho(L^{(\gamma)})\right]^{M}}\cdot\int \textbf{1}_{U_{0}}(x_{M+1})p(dx_{M+1}|x_{M},a_{M})...p(dx_{2}|x_{1},a_{1})>0. \end{array} $$

Thus, f_γ >> 0. Since S is compact, there are constants $0<k_{\gamma }<K_{\gamma }<\infty $ such that k_γ ≤ f_γ ≤ K_γ. From (2.15) and (2.16), we see that for any x ∈ S

$$ \begin{array}{@{}rcl@{}} \frac{k_{\gamma}}{K_{\gamma}}\left( E_{x}^{\pi^{*}}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}\right]\right)\leq \left[\rho(L^{(\gamma)})\right]^{N}\leq\frac{K_{\gamma}}{k_{\gamma}}\left( E_{x}^{\pi^{*}}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}\right]\right) \end{array} $$

and for any Markov policy π

$$ \begin{array}{@{}rcl@{}} \frac{k_{\gamma}}{K_{\gamma}}\left( E_{x}^{\pi}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}\right]\right)\leq\left[\rho(L^{(\gamma)})\right]^{N}. \end{array} $$

Taking the logarithm and letting $N\to \infty $, we see that the limit

$$ \log\rho(L^{(\gamma)})=\lim_{N\rightarrow\infty}\frac{1}{N}\log E_{x}^{\pi^{*}}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}\right] $$

(2.17)

exists and

$$ \begin{array}{@{}rcl@{}} \log\rho(L^{(\gamma)})=\gamma\lambda(x,\gamma) \end{array} $$

for any x ∈ S. □

Remark 2.4

Assume (A1), (A2), (A3), and (B1). From the proof of Theorem 2.2 we can see that ρ(L^(γ)) is the unique positive eigenvalue of L^(γ) restricted to interior(C⁺(S)).

3 Risk-Sensitive Asymptotics of MDP

In this section, we shall apply the following variational formula for λ(γ) to prove (2.9).

$$ \lambda(\gamma)=\sup_{\beta\in\mathcal{I}}\left\{{\int}_{S\times A}\left[ r(x,a)-\frac{1}{\gamma}D(\beta_{2}(\cdot|x,a)\Vert p(\cdot|x,a))\right]\beta^{\prime}(dx,da)\right\}, $$

(3.1)

where $\mathcal {I}$ is defined by

$$ \mathcal{I}:=\{\beta\in\mathcal{P}(S\times A\times S):\beta(S,A,dx)=\beta(dx,A,S)\}, $$

(3.2)

and for $\beta \in \mathcal {P}(S\times A\times S),$ the notations β₀,β₁,β₂, and $\beta ^{\prime }$ are defined by

$$ \beta(dx,da,dy)=\beta_{0}(dx)\beta_{1}(da|x)\beta_{2}(dy|x,a)=\beta^{\prime}(dx,da)\beta_{2}(dy|x,a). $$

(3.3)

Obviously, $\mathcal {I}$ is nonempty and closed in $\mathcal {P}(S\times A\times S)$. Notice that $\mathcal {P}(S\times A\times S)$ is compact since S × A × S is compact. Hence, $\mathcal {I}$ is compact, too. For $\beta \in \mathcal {P}(S\times A\times S)$, β₀ is the first 1-dimensional marginal of β, $\beta ^{\prime }$ is the first 2-dimensional marginal of β, and β₁ and β₂ are the two successive conditional distributions of β. With these notations, $\mathcal {I}$ is seen to be the set of probability measures β on S × A × S satisfying that β₀ is invariant under ${\int \limits }_{A}\beta _{1}(da|x)\beta _{2}(dy|x,a)$. The validity of (3.1) will be verified in Theorems 3.4 and 3.7. At present, we will apply (3.1) to get the limit of λ(γ) as γ → 0.

Theorem 3.1

Assume (A1) and (A2). If (3.1) holds, then

$$ \lim\limits_{\gamma\to 0}\lambda(\gamma)=v. $$

(3.4)

To prove the theorem, we need the following

Lemma 3.2

If there exists a $\beta \in \mathcal {I}$ satisfying that β₂ = p, then

$$ \begin{array}{@{}rcl@{}} {\int}_{S\times A}r(x,a)\beta^{\prime}(dx,da)\leq v, \end{array} $$

where $\mathcal {I}$ is defined in (3.2).

Proof

Since β(S,A,dx) = β(dx,A,S), taking β₀ as the initial distribution and using the policy $\pi _{\beta }=(\beta _{1}(da|x))^{\infty }$, we see that

$$ \begin{array}{@{}rcl@{}} E_{\beta_{0}}^{\pi_{\beta}}\left[r(X_{2},A_{2})\right]&=& {\int}_{S\times A\times S\times A} r(x_{2},a_{2})\beta_{1}(da_{2}|x_{2})p(dx_{2}|x_{1},a_{1})\beta_{1}(da_{1}|x_{1})\beta_{0}(dx_{1})\\ &=&{\int}_{S\times A\times S}\left( {\int}_{A}r(x_{2},a_{2})\beta_{1}(da_{2}|x_{2})\right)\beta(dx_{1},da_{1},dx_{2})\\ &=&{\int}_{S\times A\times S}\left( {\int}_{A}r(x_{1},a_{2})\beta_{1}(da_{2}|x_{1})\right)\beta(dx_{1},da_{1},dx_{2})\\ &=&{\int}_{S\times A} r(x_{1},a_{2})\beta_{1}(da_{2}|dx_{1})\beta_{0}(dx_{1})= E_{\beta_{0}}^{\pi_{\beta}}\left[r(X_{1},A_{1})\right]. \end{array} $$

The third equality is due to the coincidence of the first and the third marginal of β ensured by (3.2). By induction, we have

$$ \begin{array}{@{}rcl@{}} E_{\beta_{0}}^{\pi_{\beta}}\left[{\sum}_{i=1}^{N}r(X_{i},A_{i})\right]&=&N\cdot E_{\beta_{0}}^{\pi_{\beta}}\left[r(X_{1},A_{1})\right]\\ &=&N{\int}_{S\times A} r(x_{1},a_{1})\beta_{1}(da_{1}|x_{1})\beta_{0}(dx_{1})\\ &=&N{\int}_{S\times A}r(x,a)\beta^{\prime}(dx,da). \end{array} $$

Thus,

$$ \begin{array}{@{}rcl@{}} v\geq\liminf_{N\rightarrow\infty} E_{\beta_{0}}^{\pi_{\beta}}\left[\frac{1}{N}{\sum}_{i=1}^{N}r(X_{i},A_{i})\right]={\int}_{S\times A}r(x,a)\beta^{\prime}(dx,da). \end{array} $$

□

Now we are ready to prove Theorem 3.1.

Proof Proof of Theorem 3.1

By Hölder’s inequality, for $\gamma \geq \gamma ^{\prime }>0$, we have

$$ \begin{array}{*{20}l} \frac{1}{\gamma}\frac{1}{N}\log E_{x}^{\pi}\exp\left[{{\sum}_{t=1}^{N}\gamma r(X_{t},A_{t})}\right]\geq\frac{1}{\gamma^{\prime}}\frac{1}{N}\log E_{x}^{\pi}\exp\left[{{\sum}_{t=1}^{N}\gamma^{\prime} r(X_{t},A_{t})}\right] \end{array} $$

and

$$ \begin{array}{@{}rcl@{}} \frac{1}{\gamma}\frac{1}{N}\log E_{x}^{\pi}\exp\left[{{\sum}_{t=1}^{N}\gamma r(X_{t},A_{t})}\right]\geq\frac{1}{N} E_{x}^{\pi}\left[{\sum}_{t=1}^{N}r(X_{t},A_{t})\right]. \end{array} $$

Therefore, λ(γ) is non-decreasing in γ and $\lim \limits _{\gamma \to 0}\lambda (\gamma )\geq v$. To prove (3.4), it suffices to verify that $\lim \limits _{\gamma \to 0}\lambda (\gamma )\leq v$. To this end, we notice that it follows from (3.1) that for any ε > 0 and γ > 0, there exists $\beta _{\gamma }^{\varepsilon }\in \mathcal {I}$ such that

$$ \lambda(\gamma)-\varepsilon\leq{\int}_{S\times A}\left[r(x,a)-\frac{1}{\gamma}D((\beta_{\gamma}^{\varepsilon})_{2}(\cdot|x,a)\Vert p(\cdot|x,a))\right](\beta_{\gamma}^{\varepsilon})'(dx,da). $$

(3.5)

Recall that $\mathcal {I}\subseteq \mathcal {P}(S\times A\times S)$ is compact, we can find a sequence $\{\gamma _{n}\}_{n\in \mathbb {N}}$ decreasing to 0 and a $\beta ^{\varepsilon }\in \mathcal {I}$ such that

$$\lim\limits_{\gamma\to 0}\lambda(\gamma)=\lim\limits_{n\to\infty}\lambda(\gamma_{n})\quad \text{ and }\quad \lim\limits_{n\to\infty}\beta_{\gamma_{n}}^{\varepsilon}=\beta^{\varepsilon}\text{ weakly}.$$

Therefore, from (A1), we know that

$$ \lim_{n\to\infty}\lambda(\gamma_{n})-\varepsilon\leq\lim_{n\to\infty} {\int}_{S\times A}r(x,a)\left( \beta_{\gamma_{n}}^{\varepsilon}\right)'(dx,da)={\int}_{S\times A}r(x,a)\left( \beta^{\varepsilon}\right)'(dx,da), $$

(3.6)

which is finite. Now we claim that $\left (\beta ^{\varepsilon }\right )_{2}=p$. Indeed, we have

$$ \begin{array}{@{}rcl@{}} &&{\int}_{S\times A}D\Big(\left( \beta_{\gamma_{n}}^{\varepsilon}\right)_{2}(\cdot|x,a)\Big\Vert p(\cdot|x,a)\Big)\left( \beta_{\gamma_{n}}^{\varepsilon}\right)'(dx,da)\\ &&\quad=D\left( \beta_{\gamma_{n}}^{\varepsilon}(dx,da,dy)\Big\Vert\left( \beta_{\gamma_{n}}^{\varepsilon}\right)'(dx,da)p(dy|x,a)\right). \end{array} $$

It follows from the (joint) lower semicontinuity of D(⋅∥⋅) and (A2) that

$$ \begin{array}{@{}rcl@{}} &-&\liminf_{n\rightarrow\infty}D\left( \beta_{\gamma_{n}}^{\varepsilon}(dx,da,dy)\Big\Vert\left( \beta_{\gamma_{n}}^{\varepsilon}\right)'(dx,da)p(dy|x,a)\right)\leq \\&-&D\left( \beta^{\varepsilon}(dx,da,dy)\Big\Vert\left( \beta^{\varepsilon}\right)'(dx,da)p(dy|x,a)\right). \end{array} $$

Thus if $\left (\beta ^{\varepsilon }\right )_{2}\neq p$, then

$$D\left( \beta^{\varepsilon}(dx,da,dy)\Big\Vert\left( \beta^{\varepsilon}\right)'(dx,da)p(dy|x,a)\right)>0.$$

Combining this with (3.5) and the fact that λ(γ) ≥ r_m, we would have

$$ \begin{array}{@{}rcl@{}} r_{m}-\varepsilon\leq\lim_{n\to\infty}\lambda(\gamma_{n})-\varepsilon\leq-\infty. \end{array} $$

It is impossible. Thus, $\beta ^{\varepsilon }\in \mathcal {I}$ and $\left (\beta ^{\varepsilon }\right )_{2}=p$. Recalling (3.6) and Lemma 3.2, we obtain that

$$ \begin{array}{@{}rcl@{}} \lim_{n\to\infty}\lambda(\gamma_{n})-\varepsilon\leq{\int}_{S\times A}r(x,a)\left( \beta^{\varepsilon}\right)'(dx,da)\leq v. \end{array} $$

(3.4) follows by letting ε → 0. □

The remainder of this section is devoted to verifying (3.1) under certain conditions. This is carried out at first under assumptions including (B1), then with (B1) removed. Our assumption (A2) is slightly weaker than those in [2]. In [2], it is required that the family of functions

$$\left\{(x,a)\mapsto{\int}_{S}f(y)p(dy|x,a), f\in C(S),\left\|f\right\|\leq 1\right\}$$

is equicontinuous, while we assume that

$$\left\{x\mapsto{\int}_{S}f(y)p(dy|x,a), f\in C(S),\left\|f\right\|\leq 1,a\in A\right\}$$

is equicontinuous (Theorems 3.4 and 3.8). Moreover, it is worth mentioning that equicontinuity only plays a role in the existence of the positive eigenvalue and the eigenvector. Once L^(γ) has a positive eigenvalue and a strictly positive eigenvector, only (A1) and (A2) are needed.

Proposition 3.3

Assume (A1) and (A2). If there exist ρ_γ > 0 and f_γ ∈ C(S) such that f_γ >> 0 and L^(γ)f_γ = ρ_γf_γ, then (3.1) holds.

Proof

From the proof of Theorem 2.2, we know that $\log \rho _{\gamma }=\gamma \lambda (x,\gamma )$ for any x ∈ S. Thus, for any $\mu \in {\mathscr{M}}^{+}(S)$, we have

$$ \begin{array}{@{}rcl@{}} e^{\gamma\lambda(\gamma)}=\frac{\int L^{(\gamma)}f_{\gamma}d\mu}{\int f_{\gamma}d\mu}. \end{array} $$

Therefore,

$$ e^{\gamma\lambda(\gamma)}=\sup_{\mu\in\mathcal{M}^{+}(S)}\frac{\int L^{(\gamma)}f_{\gamma}d\mu}{\int f_{\gamma}d\mu}\geq\inf_{f>>0}\sup_{\mu\in\mathcal{M}^{+}(S)}\frac{\int L^{(\gamma)}fd\mu}{\int fd\mu}. $$

(3.7)

For any f >> 0, we also have

$$ \begin{array}{@{}rcl@{}} \frac{L^{\gamma}f}{f}\leq\sup_{\mu\in\mathcal{M}^{+}(S)}\frac{\int L^{(\gamma)}fd\mu}{\int fd\mu}, \end{array} $$

which means that

$$ \begin{array}{@{}rcl@{}} L^{\gamma}f\leq\left( \sup_{\mu\in\mathcal{M}^{+}(S)}\frac{\int L^{(\gamma)}fd\mu}{\int fd\mu}\right)f. \end{array} $$

Since under (A1) and (A2), properties (P2) and (P3) hold for L^(γ), we can apply Theorem A.2 and A.3 in the Appendix to deduce that

$$ e^{\gamma\lambda(\gamma)}=\rho(L^{(\gamma)})\leq\left( \sup_{\mu\in\mathcal{M}^{+}(S)}\frac{\int L^{(\gamma)}fd\mu}{\int fd\mu}\right). $$

(3.8)

From (3.7) and (3.8), we have

$$ \begin{array}{@{}rcl@{}} \lambda(\gamma)=\frac{1}{\gamma}\log\inf_{f>>0}\sup_{\mu\in\mathcal{M}^{+}(S)}\frac{\int L^{(\gamma)}fd\mu}{\int fd\mu}=\frac{1}{\gamma}\log\inf_{f>>0}\sup_{\substack{\mu\in\mathcal{M}^{+}(S)\\\int fd\mu=1}}\int L^{(\gamma)}fd\mu. \end{array} $$

Thus,

$$ \begin{array}{@{}rcl@{}} \lambda(\gamma)&=&\frac{1}{\gamma}\log\inf_{f>>0}\sup_{\substack{\mu\in\mathcal{M}^{+}(S)\\\int fd\mu=1}}{\int}_{S}\mu(dx)\sup_{a\in A}{\int}_{S}e^{\gamma r(x,a)}f(y)p(dy|x,a)\\ &=&\frac{1}{\gamma}\log\inf_{f>>0}\sup_{\nu\in\mathcal{P}(S)}{\int}_{S}\nu(dx)\sup_{a\in A}{\int}_{S}e^{\gamma r(x,a)+\log f(y)-\log f(x)}p(dy|x,a)\\ &=&\frac{1}{\gamma}\inf_{g\in C(S)}\sup_{x\in S}\sup_{a\in A}\log{\int}_{S}e^{\gamma r(x,a)+g(y)-g(x)}p(dy|x,a)\\ &=&\frac{1}{\gamma}\inf_{g\in C(S)}\sup_{\eta\in\mathcal{P}(S\times A)}\log{\int}_{S\times A\times S}e^{\gamma r(x,a)+g(y)-g(x)}\eta(dx,da)p(dy|x,a). \end{array} $$

Using the Gibbs variational formula (Proposition 1.4.2(a), pp. 33–34 in [15]), we see that

$$ \begin{array}{@{}rcl@{}} \lambda(\gamma)=\frac{1}{\gamma}\inf_{g\in C(S)}\sup_{\eta\in\mathcal{P}(S\times A)}\sup_{\beta\in\mathcal{P}(S\times A\times S)}\Bigg\{{\int}_{S\times A\times S}[\gamma r(x,a)+g(y)-g(x)]\beta(dx,da,dy)\\ -D(\beta(dx,da,dy)\Vert\eta(dx,da)p(dy|x,a))\Bigg\}. \end{array} $$

Since D(μ∥ν) is jointly convex and lower semicontinuous in (μ,ν) (Lemma 1.4.3, pp. 36–38 in [15]) and $\mathcal {P}(S\times A), \mathcal {P}(S\times A\times S)$ are both compact, the minimax theorem (Theorem 4.2 in [25]) can be applied to get

$$ \begin{array}{@{}rcl@{}} \lambda(\gamma)=\frac{1}{\gamma}\sup_{\beta\in\mathcal{P}(S\times A\times S)}\sup_{\eta\in\mathcal{P}(S\times A)}\inf_{g\in C(S)}\Bigg\{{\int}_{S\times A\times S}[\gamma r(x,a)+g(y)-g(x)]\beta(dx,da,dy)\\ -D(\beta(dx,da,dy)\Vert\eta(dx,da)p(dy|x,a))\Bigg\}. \end{array} $$

Furthermore, by the chain rule for relative entropy (Theorem D.13, pp. 357–359 in [9]), we have that

$$ \begin{array}{@{}rcl@{}} \lambda(\gamma)=\frac{1}{\gamma}\sup_{\beta\in\mathcal{P}(S\times A\times S)}\sup_{\eta\in\mathcal{P}(S\times A)}\inf_{g\in C(S)}\Bigg\{{\int}_{S\times A\times S}[\gamma r(x,a)+g(y)-g(x)]\beta(dx,da,dy)\\ -D(\beta^{\prime}(dx,da)\Vert\eta(dx,da))-{\int}_{S\times A}D(\beta_{2}(dy|x,a)\Vert p(dy|x,a))\beta^{\prime}(dx,da)\Bigg\}. \end{array} $$

Since D(μ∥ν) ≥ 0 and D(μ∥ν) = 0 iff μ = ν (Lemma 1.4.1 , pp. 33, in [15]), the supremum over $\eta \in \mathcal {P}(S\times A)$ is attained at $\eta =\beta ^{\prime }$. Moreover, notice that when $\beta \in \mathcal {I}$, for any g ∈ C(S),

$${\int}_{S\times A\times S}[g(y)-g(x)]\beta(dx,da,dy)=0,$$

and for $\beta \notin \mathcal {I}$,

$$\inf\limits_{g\in C(S)}{\int}_{S\times A\times S}[g(y)-g(x)]\beta(dx,da,dy)=-\infty,$$

we obtain that

$$ \begin{array}{@{}rcl@{}} \lambda(\gamma)&=&\sup_{\beta\in\mathcal{I}}\Bigg\{{\int}_{S\times A\times S}r(x,a)\beta(dx,da,dy)-\frac{1}{\gamma}{\int}_{S\times A}D(\beta_{2}(dy|x,a)\Vert p(dy|x,a))\beta^{\prime}(dx,da)\Bigg\}\\ &=&\sup_{\beta\in\mathcal{I}}\Bigg\{{\int}_{S\times A}\left[r(x,a)-\frac{1}{\gamma}D(\beta_{2}(dy|x,a)\Vert p(dy|x,a))\right]\beta^{\prime}(dx,da)\Bigg\}. \end{array} $$

□

Combining Theorem 2.2 and Proposition 3.3, we obtain the following theorem immediately.

Theorem 3.4

Assume (A1), (A2), (A3), and (B1). Then (3.1) holds.

To remove (B1), we use a perturbation argument. For each 𝜖 > 0, define a new MDP M_𝜖 with the transition law and one-step reward given by

$$ \begin{gathered} p^{(\gamma)}_{\epsilon}(dy|x,a):=\frac{\epsilon{\Gamma}(dy)+e^{\gamma r(x,a)}p(dy|x,a)}{\epsilon+e^{\gamma r(x,a)}}\ \text{ and }\ r^{(\gamma)}_{\epsilon}(x,a):=\log\left( \epsilon+e^{\gamma r(x,a)}\right) \end{gathered} $$

(3.9)

respectively, where ${\Gamma }\in \mathcal {P}(S)$ with full support. It is not hard to check that M_𝜖 satisfies (A1), (A2), (A3), and (B1). Using $ E_{\epsilon ,\gamma ,x}^{\pi }$ to denote the corresponding expectation operator with initial state x and policy π, we define

$$ L^{(\gamma)}_{\epsilon}f(x):=\sup_{\pi\in{\Pi}_{M}} E_{\epsilon,\gamma,x}^{\pi}\left[e^{r_{\epsilon}(X_{1},A_{1})}f(X_{2})\right]=L^{(\gamma)}f(x)+\epsilon\int{\Gamma}(dy)f(y) $$

(3.10)

and

$$ \lambda_{\epsilon}(x,\gamma):=\sup_{\pi\in{\Pi}_{M}}\frac{1}{\gamma}\liminf_{N\rightarrow\infty}\frac{1}{N}\log E_{\epsilon,\gamma,x}^{\pi}\exp\left[\sum\limits_{t=1}^{N} r^{(\gamma)}_{\epsilon}(X_{t},A_{t})\right]. $$

(3.11)

By Theorem 2.2, λ_𝜖(x,γ) depends only on γ, and the limit inferior is actually a limit. Hence, we write it as λ_𝜖(γ). Without (B1), we will prove the variational formula by exploring properties of λ_𝜖(γ) and then letting ε → 0.

Lemma 3.5

Assume (A1) and (A2). Then λ_𝜖(γ) is non-decreasing in 𝜖 and $\lim \limits _{\epsilon \to 0}\lambda _{\epsilon }(\gamma )\geq \lambda (\gamma )$.

Proof

From property (P1), we have

$$ \lambda_{\epsilon}(\gamma)=\lim_{N\rightarrow\infty}\frac{1}{\gamma}\sup_{\pi\in{\Pi}_{M}}\frac{1}{N}\log E_{\epsilon,\gamma,x}^{\pi}\exp\left[{\sum}_{t=1}^{N}r_{\epsilon}^{(\gamma)}(X_{t},A_{t})\right]=\lim_{N\rightarrow\infty}\frac{1}{\gamma N}\log(L^{(\gamma)}_{\epsilon})^{N}\textbf{1}(x) $$

(3.12)

for any x ∈ S. Thus, for any 𝜖₁ > 𝜖₂ > 0, by (3.10), we obtain that

$$ \begin{array}{@{}rcl@{}} \lambda_{\epsilon_{1}}(\gamma)\geq\lambda_{\epsilon_{2}}(\gamma)&\geq&\sup_{x\in S}\liminf_{N\rightarrow\infty}\frac{1}{\gamma N}\log(L^{(\gamma)})^{N}\textbf{1}(\theta)\\ &\geq&\sup_{x\in S}\sup_{\pi\in{\Pi}_{M}}\liminf_{N\rightarrow\infty}\frac{1}{\gamma N}\log E_{\theta}^{\pi}\exp\left[\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})\right]=\lambda(\gamma). \end{array} $$

□

In order to write the variational formula in a form that is more convenient for using in the following arguments, we define for given 𝜖 > 0,γ > 0 and $\beta \in \mathcal {I}$

$$ \begin{array}{@{}rcl@{}} \phi(\beta,\gamma,\epsilon):=& {\int}_{S\times A}\left[\frac{1}{\gamma}r^{(\gamma)}_{\epsilon}(x,a)-\frac{1}{\gamma}D\left( \beta_{2}(\cdot|x,a)\Vert p_{\epsilon}^{(\gamma)}(\cdot|x,a)\right)\right]\beta^{\prime}(dx,da) \end{array} $$

and

$$ \begin{array}{@{}rcl@{}} \phi(\beta,\gamma,0):=& {\int}_{S\times A}\left[r(x,a)-\frac{1}{\gamma}D\left( \beta_{2}(\cdot|x,a)\Vert p(\cdot|x,a)\right)\right]\beta^{\prime}(dx,da). \end{array} $$

To prove $\lambda (\gamma )\geq \lim \limits _{\epsilon \to 0}\sup \limits _{\beta \in \mathcal {I}}\phi (\beta ,\gamma ,\epsilon )$, we will show that

$$\lambda(\gamma)\geq\sup\limits_{x\in S}\lambda_{SM}(x,\gamma)\geq\sup\limits_{\beta\in\mathcal{I}}\phi(\beta,\gamma,0)\geq\lim\limits_{\epsilon\to 0}\sup\limits_{\beta\in\mathcal{I}}\phi(\beta,\gamma,\epsilon),$$

where λ_SM(x,γ) is defined by

$$ \begin{array}{@{}rcl@{}} \lambda_{SM}(x,\gamma):=\sup_{d\in D_{M}}\liminf_{N\rightarrow\infty}\frac{1}{\gamma N}\log E_{x}^{d^{\infty}}\exp\left[\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})\right], \end{array} $$

with $d^{\infty }$ denoting the stationary Markov policy whose decision rules at each time are the same d ∈ D_M.

Lemma 3.6

Assume (A1) and (A2). Then $\sup \limits _{x\in S}\lambda _{SM}(x,\gamma )\geq \sup \limits _{\beta \in \mathcal {I}}\phi (\beta ,\gamma ,0)$.

Proof

We need to prove that

$$\sup\limits_{x\in S}\lambda_{SM}(x,\gamma)\geq\phi(\beta,\gamma,0)$$

for each $\beta \in \mathcal {I}$. If $\phi (\beta ,\gamma ,0)=-\infty $, the inequality holds trivially. Otherwise, $\beta \in \mathcal {I}$ with $\phi (\beta ,\gamma ,0)>-\infty $ implies that β₂(⋅|x,a) << p(⋅|x,a) $\beta ^{\prime }-$a.s.. Choosing the stationary Markov policy $\pi _{\beta }=(\beta _{1}(da|\theta ))^{\infty }$ and the initial distribution $\beta _{0}\in \mathcal {P}(S)$, we see that

$$ \sup\limits_{x\in S}\lambda_{SM}(x,\gamma)\geq \underset{N\rightarrow\infty}{\liminf}\frac{1}{\gamma N}\log E_{\beta_{0}}^{\pi_{\beta}}\exp\left[{\sum}_{t=1}^{N}\gamma r(X_{t},A_{t})\right]. $$

Define $^{\beta }{E}_{\beta _{0}}^{\pi _{\beta }}$ as the expectation operator with respect to the probability measure determined by the initial distribution β₀, the transition law β₂(dy|x,a) for $\{X_{t}\}_{t\in \mathbb {N}}$, and the policy π_β. Using the change of measure technique and Jensen’s inequality, we obtain that

$$ \begin{array}{@{}rcl@{}} \log E_{\beta_{0}}^{\pi_{\beta}}\exp\left[\sum\limits_{t=1}^{N}\gamma r(X_{t},A_{t})\right]&=&\log{{~}^{\beta}}{ E}_{\beta_{0}}^{\pi_{\beta}}\exp\left\{{\sum}_{t=1}^{N}\left[\gamma r(X_{t},A_{t})-\log\left( \frac{d\beta_{2}(\cdot|X_{t},A_{t})}{dp(\cdot|X_{t},A_{t})}(X_{t+1})\right)\right]\right\}\\ &\geq&{~}^{\beta}{E}_{\beta_{0}}^{\pi_{\beta}}{\sum}_{t=1}^{N}\left[\gamma r(X_{t},A_{t})-\log\left( \frac{d\beta_{2}(\cdot|X_{t},A_{t})}{dp_{0}(\cdot|X_{t},A_{t})}(X_{t+1})\right)\right]. \end{array} $$

Since $\beta \in \mathcal {I}$, the same argument as in proving Lemma 3.2 shows that

$$ \begin{array}{@{}rcl@{}} &&{{~}^{\beta}{E}}_{\beta_{0}}^{\pi_{\beta}}{\sum}_{t=1}^{N}\left[\gamma r(X_{t},A_{t})-\log\left( \frac{d\beta_{2}(\cdot|X_{t},A_{t})}{dp_{0}(\cdot|X_{t},A_{t})}(X_{t+1})\right)\right]\\ &&\quad=N\cdot{{~}^{\beta}}{ E}_{\beta_{0}}^{\pi_{\beta}}\left[\gamma r(X_{1},A_{1})-\log\left( \frac{d\beta_{2}(\cdot|X_{1},A_{1})}{dp(\cdot|X_{1},A_{1})}(X_{2})\right)\right] \end{array} $$

Consequently,

$$ \begin{array}{@{}rcl@{}} \sup\limits_{x\in S}\lambda_{SM}(x,\gamma)\geq\frac{1}{{~}^{\beta}}{E}_{\beta_{0}}^{\pi_{\beta}}\left[\gamma r(X_{1},A_{1})-\log\left( \frac{d\beta_{2}(\cdot|X_{1},A_{1})}{dp(\cdot|X_{1},A_{1})}(X_{2})\right)\right]=\phi(\beta,\gamma,0). \end{array} $$

□

Combining Theorem 3.4 and Lemmas 3.5 and 3.6, we obtain the following

Theorem 3.7

Assume (A1), (A2), and (A3). Then (3.1) holds.

Proof

From Lemmas 3.5 and 3.6, we see that

$$ \lim_{\epsilon\to 0}\lambda_{\epsilon}(\gamma)\geq\lambda(\gamma)=\sup\limits_{x\in S}\lambda(x,\gamma)\geq\sup\limits_{x\in S}\lambda_{SM}(x,\gamma)\geq\sup\limits_{\beta\in\mathcal{I}}\phi(\beta,\gamma,0). $$

(3.13)

Hence, (3.1) will follow once we prove that $\sup _{\beta \in \mathcal {I}}\phi (\beta ,\gamma ,0)\geq \lim \limits _{\epsilon \to 0}\lambda _{\epsilon }(\gamma )$. Since M_𝜖 satisfies (A1), (A2), and (B1), by Theorems 2.2 and 3.4, we have

$$ \lambda_{\epsilon}(\gamma)=\sup_{\beta\in\mathcal{I}}\phi(\beta,\gamma,\epsilon). $$

(3.14)

Therefore, given ξ > 0, for every 𝜖 > 0, there exists $\beta _{\epsilon }^{\xi }\in \mathcal {I}$ such that

$$\lambda_{\epsilon}(\gamma)<\sup_{\beta\in\mathcal{I}}\phi(\beta,\gamma,\epsilon)+\xi.$$

Since $\mathcal {I}$ is compact, there exists a sequence $\{\epsilon _{n}\}_{n\in \mathbb {N}}$ decreasing to 0 such that the weak limit $\lim \limits _{n\to \infty }\beta _{\epsilon _{n}}^{\xi }=:\beta ^{\xi }$ exists and $\in \mathcal {I}$. By (A1) and Dini’s Theorem $r^{(\gamma )}_{\epsilon }(\cdot ,\cdot )$ converges to γr(⋅,⋅) uniformly. Thus, we obtain that

$$ \begin{array}{@{}rcl@{}} \lim_{n\to\infty}{\int}_{S\times A}\frac{1}{\gamma}r^{(\gamma)}_{\epsilon_{n}}(x,a)\left( \beta_{\epsilon_{n}}^{\xi}\right)'(dx,da)={\int}_{S\times A}r(x,a)\left( \beta^{\xi}\right)'(dx,da). \end{array} $$

Recalling the definition of β₂ for $\beta \in \mathcal {I}$, by the lower semicontinuity of D(⋅∥⋅), we see that

$$ \begin{array}{@{}rcl@{}} \liminf_{n\rightarrow\infty}& & {\int}_{S\times A}D\left( \left( \beta_{\epsilon_{n}}^{\xi}\right)_{2}(dy|x,a)\Big\Vert p^{(\gamma)}_{\epsilon}(dy|x,a)\right)\left( \beta_{\epsilon_{n}}^{\xi}\right)'(dx,da)\\ &=&\liminf_{n\rightarrow\infty}D\left( \beta_{\epsilon_{n}}^{\xi}(dx,da,dy)\Big\Vert\left( \beta_{\epsilon_{n}}^{\xi}\right)'(dx,da)p^{(\gamma)}_{\epsilon}(dy|x,a)\right)\\ &\geq&D\left( \beta^{\xi}(dx,da,dy)\Big\Vert\left( \beta^{\xi}\right)'(dx,da)p(dy|x,a)\right)\\ &=&{\int}_{S\times A}D\left( \left( \beta^{\xi}\right)_{2}(dy|x,a)\Big\Vert p(dy|x,a)\right)\left( \beta^{\xi}\right)'(dx,da). \end{array} $$

It then follows that

$$ \begin{array}{@{}rcl@{}} \limsup\limits_{n\to\infty}\phi(\beta_{\epsilon_{n}}^{\xi},\gamma,\epsilon_{n}) &\leq&{\int}_{S\times A}\left[r(x,a)-\frac{1}{\gamma}D\left( \left( \beta^{\xi}\right)_{2}(\cdot|x,a)\Vert p(\cdot|x,a)\right)\right]\left( \beta^{\xi}\right)'(dx,da)\\ &=&\phi(\beta^{\xi},\gamma,0) \end{array} $$

Thus,

$$ \sup_{\beta\in\mathcal{I}}\phi(\beta,\gamma,0)\geq\phi(\beta^{\xi},\gamma,0)\geq\limsup\limits_{n\to\infty}\phi(\beta_{\epsilon_{n}}^{\xi},\gamma,\epsilon_{n})\geq\lim_{n\to\infty}\lambda_{\epsilon_{n}}(\gamma)+\xi. $$

Letting ξ → 0, we have $\sup _{\beta \in \mathcal {I}}\phi (\beta ,\gamma ,0)\geq \lim _{\epsilon \to 0}\lambda _{\epsilon }(\gamma )$. Now (3.1) follows. □

Remark 3.1

The proof shows that the inequalities in (3.13) are actually equalities, which indicates that the supremum over Markov policies in risk-sensitive MDP is tantamount to the supremum over stationary Markov policies, meaning that one should search for the optimal policy within the stationary policies even without the ergodicity of transition probability.

Combining Theorems 3.1, 3.4, and 3.7, we obtain the main result immediately.

Theorem 3.8

Assume (A1), (A2), and (A3). Then

$$ \begin{array}{@{}rcl@{}} \lim\limits_{\gamma\to 0}\lambda(\gamma)=v. \end{array} $$

In addition, if (B1) holds, then

$$ \begin{array}{@{}rcl@{}} \lim\limits_{\gamma\to 0}\lambda(x,\gamma)=v. \end{array} $$

for any x ∈ S.

Remark 3.2

Recalling the proof of Theorem 3.1, we see that under (A1), (A2), and (A3), there exists $\mu (dx,da)\in \mathcal {P}(S\times A)$ such that

$$ \begin{array}{@{}rcl@{}} {\int}_{S\times A}\mu(dx,da)p(dy|x,a)={\int}_{A}\mu(dy,da). \end{array} $$

.

A sufficient condition for the risk-neutral average optimal reward v(x) to be independent of the initial state x is the uniform ergodicity (2.7) (see Section 5.5 in [20]). We rewrite it as

(B2)
There exists δ < 1 such that
$$ \sup_{U\in \mathcal{B}(S)}\sup_{x,x^{\prime}\in S}\sup_{a,a^{\prime}\in A}[P(U|x,a)-P(U|x^{\prime},a^{\prime})]\leq\delta. $$
(3.15)

We provide brief proof for this.

Theorem 3.9

Assume (A1), (A2), and (B2), then v(x) is independent of the initial state x.

Proof

Define an operator T on C(S) by

$$ Tf(x):=\sup_{a\in A}\left[r(x,a)+{\int}_{S}p(dy|x,a)f(y)\right]. $$

It is not hard to check that under (A1) and (A2), T maps C(S) into itself. Let

$$\|f\|_{\text{sp}}:=\sup\limits_{x\in S}f(x)-\inf\limits_{x^{\prime}\in S}f(x^{\prime})$$

be the span norm on C(S). For f₁,f₂ ∈ C(S),x₁,x₂ ∈ S, and ε > 0, there exist a₁,a₂ ∈ A such that

$$ Tf_{i}(x_{i})\leq r(x_{i},a_{i})+{\int}_{S}p(dy|x_{i},a_{i})f_{i}(y)+\varepsilon, i=1,2. $$

Therefore, we obtain that

$$ \begin{array}{@{}rcl@{}} Tf_{1}(x_{1})& -&Tf_{2}(x_{1})-\left[Tf_{1}(x_{2})-Tf_{2}(x_{2})\right]\\ &\leq & \left[r(x_{1},a_{1})+{\int}_{S}p(dy|x_{1},a_{1})f_{1}(y)+\varepsilon\right] - \left[r(x_{1},a_{1})+{\int}_{S}p(dy|x_{1},a_{1})f_{2}(y)\right]\\ &- &\left[r(x_{2},a_{2})+{\int}_{S}p(dy|x_{2},a_{2})f_{1}(y)\right] + \left[r(x_{2},a_{2})+{\int}_{S}p(dy|x_{2},a_{2})f_{2}(y)+\varepsilon\right]\\ & =&{\int}_{S}p(dy|x_{1},a_{1})\left[f_{1}(y)-f_{2}(y)\right]-{\int}_{S}p(dy|x_{2},a_{2})\left[f_{1}(y)-f_{2}(y)\right]+2\varepsilon\\ &\leq&\left[p(E|x_{1},a_{1})-p(E|x_{2},a_{2})\right]\cdot\|f_{1}-f_{2}\|_{\text{sp}}+2\epsilon\leq\delta_{p}\cdot\|f_{1}-f_{2}\|_{\text{sp}}+2\varepsilon, \end{array} $$

where E in the second to the last inequality comes from the Hahn-Jordan decomposition of p(⋅|x₁,a₁) − p(⋅|x₂,a₂). Letting ε → 0, we see that T is a contraction mapping on (C(S),∥⋅∥_sp). Thus, by the Banach Fixed-Point Theorem, there exist a unique (up to an additive constant) f₀ ∈ C(S) such that ∥Tf₀ − f₀∥_sp = 0, which means that Tf₀(x) − f₀(x) is a constant v₀. It follows that for any x ∈ S,

$$ \begin{array}{@{}rcl@{}} v_{0}=\lim_{N\to\infty}\frac{1}{N}T^{N}f_{0}(x)&=&\lim_{N\to\infty}\sup_{\pi\in{\Pi}_{M}}\frac{1}{N} E_{x}^{\pi}\left[{\sum}_{t=1}^{N}r(X_{t},A_{t})\right]\\ &\geq&\sup_{\pi\in{\Pi}_{M}}\liminf_{N\to\infty}\frac{1}{N} E_{x}^{\pi}\left[{\sum}_{t=1}^{N}r(X_{t},A_{t})\right]. \end{array} $$

(3.16)

Since $r(x,a)+{\int \limits } p(dy|x,a)f_{0}(y)$ is in C(A) due to (A2), for each x ∈ S, there exists an d₀(x) ∈ A such that

$$ Tf_{0}(x)=r(x,d_{0}(x))+\int p(dy|x,d_{0}(x))f_{0}(y). $$

Letting $\pi _{0}=(d_{0})^{\infty }$, we have

$$ \begin{array}{@{}rcl@{}} \lim_{N\to\infty}\sup_{\pi\in{\Pi}_{M}}\frac{1}{N} E_{x}^{\pi}\left[{\sum}_{t=1}^{N}r(X_{t},A_{t})\right]&=&\lim_{N\to\infty}\frac{1}{N} E_{x}^{\pi_{0}}\left[{\sum}_{t=1}^{N}r(X_{t},A_{t})\right]\\ &\leq&\sup_{\pi\in{\Pi}_{M}}\liminf_{N\to\infty}\frac{1}{N} E_{x}^{\pi}\left[{\sum}_{t=1}^{N}r(X_{t},A_{t})\right]. \end{array} $$

(3.17)

(3.16) and (3.17) imply that v₀ = v(x) for any x ∈ S. □

Combining the above theorem with our main result (Theorem 3.8), we obtain the following

Corollary 3.10

Assume (A1), (A2), (A3), (B1), and (B2). Then

$$ \begin{array}{@{}rcl@{}} \lim\limits_{\gamma\to 0}\lambda(x,\gamma)=v(x), \end{array} $$

Furthermore, this limit is indeed independent of x ∈ S.

4 Risk-Sensitive Asymptotics of POMDP

This section applies the approach explored in the last section to the partially observable Markov decision process (POMDP). Francesca Albertini, Paolo Dai Pra, and Chiara Prior established such a limit in [1] for processes described by X_n+ 1 = f(X_n,A_n,W_n),Y_n = h(X_n,V_n), where X_n, A_n, and Y_n denote the state, control, and observation, respectively, and W,V are i.i.d random variables. As for general POMDPs, Di Masi and Stettner proved the existence of the solution to the associated Bellman equation for cost-minimizing problems and stated that the limit as γ → 0 had not been proven (see Remark 2 in [11]). However, the method in [11] can not be applied to reward-maximizing problems since it requires the operator induced from the Bellman equation to preserve the concavity. However, in this case, the operator is convexity-preserving. Nevertheless, we can prove that given the existence of a solution to the Bellman equation, (3.4) holds for the maximal reward of POMDPs. A POMDP can be represented as a six-tuple M_P = 〈S,A,O,p(⋅|⋅,⋅),q(⋅|⋅),r(⋅,⋅)〉. S is the space of real but unobserved states, A is the action space, and both are assumed to be compact metric spaces. The observation space O is a Polish space. Like those in MDPs, p is the transition kernel depending on actions, and $r: S\times A\to \mathbb {R}$ is the reward function. q(⋅|x) denotes the observation probability when the system is in state x ∈ S. As mentioned in the introduction, a widely used technique for analyzing a POMDP is to transfer it into a completely observable MDP. We will also adopt this technique which allows us to employ the analysis for MDPs in the last section to establish

$$ \lim_{\gamma\to 0}\lambda_{P}(\gamma)=v_{P}, $$

where

$$ \lambda_{P}(\gamma):=\sup_{\theta\in\mathcal{P}(S)}\sup_{\pi\in{\Pi}}\frac{1}{\gamma}\liminf_{N\rightarrow\infty}\frac{1}{N}\log E_{\theta}^{\pi}\left[e^{{\sum}_{t=1}^{N}\gamma r(X_{t},A_{t})}\right] $$

and

$$ v_{P}:=\sup_{\pi\in{\Pi}}\liminf_{N\rightarrow\infty}\frac{1}{N} E_{\theta}^{\pi}\left[{\sum}_{t=1}^{N} r(X_{t},A_{t})\right]. $$

The exact definitions of the set π of policies and the expectation operator $ E_{\theta }^{\pi }$ will be given after introducing the assumptions needed in this section.

In order to use the measure transformation technique, we first assume that

(C0)
There exists a ${\Lambda }\in \mathcal {P}(O)$ with full support such that for every x ∈ S, q(⋅|x) << Λ.

The corresponding density function is also denoted by q(⋅|⋅), i.e.,

$$q(dy|x)=q(y|x){\Lambda}(dy), x\in S.$$

To apply Theorem 3.1 and Proposition 3.3, we make the following assumptions to guarantee that the reward and the transition probability of the transferred MDP satisfy (A1) and (A2).

(C1)
r(⋅,⋅) ∈ C(S × A).
(C2)
q(y|⋅) ∈Lip(S). There exist q_m > 0 and q_M > 0 such that q(y|⋅) ≥ q_m and $\left \|q(y|\cdot )\right \|_{L}\leq q_{M}$ for every y ∈ O.
(C3)
There exists K_p > 0 such that
$$\|p(\cdot|x,a)-p(\cdot|x^{\prime},a^{\prime})\|_{KR}\leq K_{p}\left[\rho_{S}(x,x^{\prime})+\rho_{A}(a,a^{\prime})\right]$$
for any $x,x^{\prime }\in A$ and $a,a^{\prime }\in S$, where ∥⋅∥_KR denotes the Kantorovich-Rubinstein norm defined by (1.5) on $\mathcal {P}(S)$, ρ_S denotes the metric on S, and ρ_A denotes the metric on A.

To define a probability space and a stochastic process with the desired mechanism, let ${\Omega }_{p}=S\times (A\times S\times O)^{\infty }$ and ${\mathscr{B}}({\Omega }_{p})$ be the product Borel σ-field. Given a sample path ω = (x₁,a₁,x₂,y₂,a₃,...) ∈Ω_P, define X_t := x_t,A_t := a_t,t ≥ 1, and Y_t := y_t,t ≥ 2. At each time $t\in \mathbb {N}$, the system M_P occupies a state X_t, which is unobservable. When t = 1, we know the distribution of X₁ and then choose an action A₁. When t ≥ 2, we can observe a signal Y_t generated by X_t and then choose an action A_t. The optimal policy in POMDP is usually not a Markovian one due to the unavailability of real states when making decisions. Hence, we introduce the observed-history-dependent policy. Let $\mathbb {H}_{t}$ denote the set of observed histories up to time $t\in \mathbb {N}$. Then, $\mathbb {H}_{1}=\mathcal {P}(S)$ (the set of all the initial state distributions) and $\mathbb {H}_{t+1}=\mathbb {H}_{t}\times A\times O$. An observed-history-dependent decision rule at time t is a stochastic kernel $d_{t}\in \mathcal {P}(A|\mathbb {H}_{t})$, where d_t(B|h_t) denote the probability for taking action in $B\subseteq A$ when observing $h_{t}=(\theta _{1},a_{1},y_{2},a_{2},y_{3},...,a_{t-1},y_{t})\in \mathbb {H}_{t}$. An observed-history-dependent policy π is a sequence of such decision rules at different times. Let D_t denote all the observed-history-dependent decision rules at time t, and ${\Pi }=\prod\limits_{t=1}^{\infty }D_{t}$ denote all the observable-history-dependent policies. Given an initial distribution of states $\theta _{1}\in \mathcal {P}(S)$ and a policy π = (d₁,d₂,...) ∈π, a unique probability measure $\text {P}_{\theta _{0}}^{\pi }$ and the corresponding expectation operator on ${\mathscr{B}}({\Omega }_{p})$ is defined by the Ionescu-Tulcea theorem, such that for each t ≥ 1,

$$ \begin{array}{@{}rcl@{}} &&\text{P}_{\theta_{1}}^{\pi}(dx_{1},da_{1},dx_{2},dy_{2},...da_{t-1},dx_{t},dy_{t})\\ &&\quad=\theta_{1}(dx_{1})\left( \prod\limits_{i=1}^{t-1}d_{i}(da_{i}|h_{i})p(dx_{i+1}|x_{i},a_{i})q(y_{i+1}|x_{i+1}){\Lambda}(dy_{i+1})\right). \end{array} $$

The risk-sensitive criterion introduced in Section 2 is to optimize

$$ \lambda_{P}(\theta, \gamma):=\sup_{\pi\in{\Pi}}\frac{1}{\gamma}\liminf_{N\rightarrow\infty}\frac{1}{N}\log E_{\theta}^{\pi}\left[e^{{\sum}_{t=1}^{N}\gamma r(X_{t},A_{t})}\right],\qquad\theta\in\mathcal{P}(S),\gamma>0. $$

(4.1)

while the typical optimal average reward is

$$ v_{P}(\theta):=\sup_{\pi\in{\Pi}}\liminf_{N\rightarrow\infty}\frac{1}{N} E_{\theta}^{\pi}\left[{\sum}_{t=1}^{N} r(X_{t},A_{t})\right],\qquad\theta\in\mathcal{P}(S). $$

(4.2)

Let $\lambda _{P}(\gamma ):=\sup _{\theta \in \mathcal {P}(S)}\lambda _{P}(\theta , \gamma )$ and $v_{P}:=\sup _{\theta \in \mathcal {P}(S)}v_{P}(\theta )$. We intend to apply Theorem 3.1 and Proposition 3.3 to prove the risk-sensitive asymptotics

$$ \lim_{\gamma\to0}\lambda_{P}(\gamma)=v_{P}. $$

(4.3)

It has already been shown that optimal control of a POMDP M_P under the average reward criterion can be converted to the optimal control of a properly transferred MDP M₀ (see, e.g., Section 7.2.1 in [3], and Section 5.3, pp. 157–159 in [5]), where the new states are the conditional distributions of real states given the observed history. The transition law p₀ and one-step reward r₀ of M₀ are

$$ \begin{array}{@{}rcl@{}} p_{0}(d\theta^{\prime}|\theta,a)&:=&{\int}_{O}{\Delta}^{(0)}_{a,y,\theta}(d\theta^{\prime})\left( {\int}_{S\times S}q(y|x^{\prime})p(dx^{\prime}|x,a)\theta(dx)\right){\Lambda}(dy),\\ r_{0}(\theta,a)&:=&{\int}_{S}\theta(dx)r(x,a),\qquad\theta\in\mathcal{P}(S),a\in A, \end{array} $$

(4.4)

where ${\Delta }^{(0)}_{a,y,\theta }$ is a measure on $\mathcal {P}(S)$ defined by

$$ \begin{array}{@{}rcl@{}} {\Delta}^{(0)}_{a,y,\theta}(U):= \begin{cases} \delta\left\{\frac{T^{*}_{a,y,0}(\theta)}{T^{*}_{a,y,0}(\theta)(S)}\right\}(U),\quad&T^{*}_{a,y,0}(\theta)(S)\neq 0\\ 0,\quad&T^{*}_{a,y,0}(\theta)(S)=0 \end{cases},\qquad U\subseteq \mathcal{P}(S), \end{array} $$

(4.5)

and $T^{*}_{a,y,0}$ is an operator on ${\mathscr{M}}^{+}(S)$ given by

$$ T^{*}_{a,y,0}(\mu)(E):= {\int}_{S\times S}\textbf{1}_{E}(x^{\prime})q(y|x^{\prime})p(dx^{\prime}|x,a)\mu(dx),\quad \mu\in \mathcal{M}^{+}(S),E\in\mathcal{B}(S). $$

(4.6)

In the case of risk-sensitive control, the transformed MDP is slightly different from the typical form of average reward control (see [4]). We present the transformation procedure with our notations. Assuming (C0), (C1), (C2), and (C3), we derive the new state and the corresponding transition mechanism of the transferred MDP first. For t ≥ 1, define two filters $\mathcal {F}_{t}$ and $\mathcal {G}_{t}$ by

$$ \mathcal{F}_{t}=\sigma(A_{1},Y_{2},...,A_{t-1},Y_{t}),\qquad\mathcal{G}_{t}=\sigma(X_{1},A_{1},X_{2},Y_{2},...,A_{t-1},X_{t},Y_{t}). $$

respectively. Let H_t = (𝜃₁,A₁,Y₂,...,A_t− 1,Y_t) denote the observed history up to time t. Since $\theta _{1}\in \mathcal {P}(S)$ is fixed, $\mathcal {F}_{t}=\sigma (H_{t})$. Define another probability measure $\widetilde {P}_{\theta _{1}}^{\pi }$ on ${\mathscr{B}}({\Omega }_{P})$ by

$$ \begin{array}{@{}rcl@{}} &&\widetilde{\text{P}}_{\theta_{1}}^{\pi}(dx_{1},da_{1},dx_{2},dy_{2},...da_{t-1},dx_{t},dy_{t})\\ &&\quad=\theta_{1}(dx_{1})\left(\prod\limits_{i=1}^{t-1}d_{i}(da_{i}|h_{i})p(dx_{i+1}|x_{i},a_{i}){\Lambda}(dy_{i+1})\right), \end{array} $$

or equivalently,

$$ \frac{d\text{P}_{\theta_{1}}^{\pi}}{d\widetilde{\text{P}}_{\theta_{1}}^{\pi}}\Bigg|_{\mathcal{G}_{t}}=\prod\limits_{i=2}^{t}q(Y_{i}|X_{i})=: R_{t}. $$

Since S,A,O are all Polish spaces, Ω_P is also Polish. Thus, the following regular conditional expectations on $({\Omega }_{P},{\mathscr{B}}({\Omega }_{P}),\widetilde {P}_{\theta _{1}}^{\pi })$ for bounded Borel functions f on S

$$ \widetilde{ E}_{\theta_{1}}^{\pi}\left[f(X_{t})e^{\gamma{\sum}_{i=1}^{t-1}r(X_{i},A_{i})}R_{t}\Big|H_{t}=h_{t}\right],t\geq 2 $$

have regular versions. Therefore, we can define an ${\mathscr{M}}^{+}(S)$-valued process $\{\psi ^{(\gamma )}_{t}\}$ by

$$ \psi^{(\gamma)}_{1}:=\theta_{1},\qquad\psi^{(\gamma)}_{t}(f):=\widetilde{ E}_{\theta_{1}}^{\pi}\left[f(X_{t})e^{\gamma{\sum}_{i=1}^{t-1}r(X_{i},A_{i})}R_{t}\Big|H_{t}\right],t\geq 2, $$

(4.7)

where f is bounded and measurable on S. For a ∈ A,y ∈ O, define T_a,y,γ as an operator on the space of bounded Borel functions on S by

$$ \begin{array}{@{}rcl@{}} T_{a,y,\gamma}(f)(x):={\int}_{S}e^{\gamma r(x,a)}q(y|x^{\prime})f(x^{\prime})p(dx^{\prime}|x,a),\quad x\in S \end{array} $$

Notice that under $\widetilde {P}_{\theta _{0}}^{\pi }$, Y_t+ 1 is independent of X_s+ 1,A_s+ 1 and Y_s with s ≤ t, and X_t+ 1 depends on σ(G_t ∪ σ(A_t,Y_t+ 1)) only through X_t and A_t, we have

$$ \begin{array}{@{}rcl@{}} & &\widetilde{E}_{\theta_{1}}^{\pi}\left[f(X_{t+1})e^{\gamma{\sum}_{i=1}^{t}r(X_{i},A_{i})}R_{t+1}\Big|H_{t+1}=h_{t+1}\right]\\ &=&\widetilde{ E}_{\theta_{1}}^{\pi}\left[\widetilde{ E}_{\theta_{1}}^{\pi}\left[f(X_{t+1})e^{\gamma r(X_{t},A_{t})}q(Y_{t+1}|X_{t+1})\Big|\sigma(G_{t}\cup \sigma(A_{t},Y_{t+1}))\right]e^{\gamma{\sum}_{i=1}^{t-1}r(X_{i},A_{i})}R_{t}\Big|H_{t+1}=h_{t+1}\right]\\ &=&\widetilde{ E}_{\theta_{1}}^{\pi}\left[\widetilde{ E}_{\theta_{1}}^{\pi}\left[f(X_{t+1})e^{\gamma r(X_{t},A_{t})}q(Y_{t+1}|X_{t+1})\Big|\sigma(X_{t},A_{t},Y_{t+1})\right]e^{\gamma{\sum}_{i=1}^{t-1}r(X_{i},A_{i})}R_{t}\Big|H_{t} = h_{t},A_{t} = a_{t},Y_{t+1} = y_{t+1}\right]\\ &=&\widetilde{ E}_{\theta_{1}}^{\pi}\left[T_{A_{t},Y_{t+1},\gamma}(f)(X_{t})e^{\gamma{\sum}_{i=1}^{t-1}r(X_{i},A_{i})}R_{t}\Big|H_{t}=h_{t},A_{t}=a_{t},Y_{t+1}=y_{t+1}\right]\\ &=&\widetilde{ E}_{\theta_{1}}^{\pi}\left[T_{a_{t},y_{t+1},\gamma}(f)(X_{t})e^{\gamma{\sum}_{i=1}^{t-1}r(X_{i},A_{i})}R_{t}\Big|H_{t}=h_{t}\right]. \end{array} $$

Therefore, we have

$$ \begin{array}{@{}rcl@{}} \psi^{(\gamma)}_{t+1}(f)=\psi^{(\gamma)}_{t}\left( T_{A_{t},Y_{t+1},\gamma}(f)\right) =T^{*}_{A_{t},Y_{t+1},\gamma}(\psi^{(\gamma)}_{t})(f),\quad\widetilde{P}_{\theta_{1}}^{\pi} a.s., \end{array} $$

where $T^{*}_{a,y,\gamma }$ is the adjoint operator of T_a,y,γ defined on ${\mathscr{M}}^{+}(S)$ by

$$ \begin{array}{@{}rcl@{}} T^{*}_{a,y,\gamma}(\mu)(E)&:=&{\int}_{S}T_{a,y,\gamma}(\textbf{1}_{E})d\mu\\ &=& {\int}_{S\times S}\textbf{1}_{E}(x^{\prime})e^{\gamma r(x,a)}q(y|x^{\prime})p(dx^{\prime}|x,a)\mu(dx),\quad \mu\!\in\! \mathcal{M}^{+}(S),E\!\in\!\mathcal{B}(S). \end{array} $$

From (C1), we know that $\psi ^{(\gamma )}_{t}(S)$ is finite and strictly positive. Hence, we can define a new state process $\{\theta ^{(\gamma )}_{t}\}$ taking values in $\mathcal {P}(S)$ by

$$ \theta^{(\gamma)}_{t}:=\frac{\psi^{(\gamma)}_{t}}{\psi^{(\gamma)}_{t}(S)}. $$

(4.8)

We call $\{\theta ^{(\gamma )}_{t}\}$ the information state process since it represents the cumulative-reward-weighted conditional distribution of the real state given the observed history. The information state $\theta ^{(\gamma )}_{1}$ at time t = 1 is still 𝜃₁. Notice that the operator $T^{*}_{a,y,\gamma }$ is positively 1-homogeneous. We have

$$ \theta^{(\gamma)}_{t+1}=\frac{T^{*}_{A_{t},Y_{t+1},\gamma}(\psi^{(\gamma)}_{t})}{T^{*}_{A_{t},Y_{t+1},\gamma}(\psi^{(\gamma)}_{t})(S)}=\frac{T^{*}_{A_{t},Y_{t+1},\gamma}(\theta^{(\gamma)}_{t})}{T^{*}_{A_{t},Y_{t+1},\gamma}(\theta^{(\gamma)}_{t})(S)}, $$

(4.9)

which implies the transition mechanism of $\theta ^{(\gamma )}_{t}$. As for the new reward function, we define

$$ G_{\gamma}(a,y,\theta):=\log\left[T^{*}_{a,y,\gamma}(\theta)(S)\right]. $$

(4.10)

Then we have

$$ \begin{array}{@{}rcl@{}} \psi^{(\gamma)}_{t}(S)&=&T^{*}_{A_{t-1},Y_{t},\gamma}(\psi^{(\gamma)}_{t-1})(S)=T^{*}_{A_{t-1},Y_{t},\gamma}(\theta^{(\gamma)}_{t-1})(S)\cdot\psi^{(\gamma)}_{t-1}(S)\\ &=&e^{G_{\gamma}(A_{t-1},Y_{t},\theta^{(\gamma)}_{t-1})}\cdot\psi^{(\gamma)}_{t-1}(S),\quad t\geq 2 \end{array} $$

(4.11)

It then follows that

$$ \begin{array}{@{}rcl@{}} E_{\theta_{1}}^{\pi}\left[e^{\gamma{\sum}_{i=1}^{t}r(X_{i},A_{i})}\right]&=&\widetilde{ E}_{\theta_{1}}^{\pi}\left[e^{\gamma{\sum}_{i=1}^{t}r(X_{i},A_{i})}R_{t}\right]=\widetilde{ E}_{\theta_{1}}^{\pi}\left[\widetilde{ E}_{\theta_{1}}^{\pi}\left[e^{\gamma{\sum}_{i=1}^{t}r(X_{i},A_{i})}R_{t}\Big|\mathcal{F}_{t}\right]\right]\\ &=&\widetilde{ E}_{\theta_{1}}^{\pi}\left[\psi^{(\gamma)}_{t+1}(S)\right]\\ &=&\widetilde{ E}_{\theta_{1}}^{\pi}\left[e^{{\sum}_{i=1}^{t}G_{\gamma}(A_{i},Y_{i+1},\theta^{(\gamma)}_{i})}\cdot\psi^{(\gamma)}_{1}(S)\right]\\ &=&\widetilde{ E}_{\theta_{1}}^{\pi}\left[e^{{\sum}_{i=1}^{t}G_{\gamma}(A_{i},Y_{i+1},\theta^{(\gamma)}_{i})}\cdot\theta^{(\gamma)}_{1}(S)\right]\\ &=&\widetilde{ E}_{\theta_{1}}^{\pi}\left[e^{{\sum}_{i=1}^{t}G_{\gamma}(A_{i},Y_{i+1},\theta^{(\gamma)}_{i})}\right]. \end{array} $$

(4.12)

Hence, we can consider G_γ as the new reward. Now, we can transfer M_P to the following completely observable model $M^{\prime }_{\gamma }$ with state space $\mathcal {P}(S)$ and action space A:

1.
The initial information state is 𝜃₁.
2.
At time t, given the current information state $\theta ^{(\gamma )}_{t}$, we take action A_t according to a pre-specified policy. Then, the system generates Y_t+ 1, which is independent of $\theta ^{(\gamma )}_{s}, A_{s}$, and Y_s with s ≤ t, and distributed according to the law Λ. The next information state $\theta ^{(\gamma )}_{t+1}$ is determined by
$$ \theta^{(\gamma)}_{t+1}=\frac{T^{*}_{A_{t},Y_{t+1},\gamma}(\theta^{(\gamma)}_{t})}{T^{*}_{A_{t},Y_{t+1},\gamma}(\theta^{(\gamma)}_{t})(S)}. $$
3.
Once Y_t+ 1 is generated, the next state $\theta ^{(\gamma )}_{t+1}$ is then obtained according to (4.9), and simultaneously the system generates one-step reward $G_{\gamma }(A_{t},Y_{t+1},\theta ^{(\gamma )}_{t})$.

Remark 4.1

The one-step reward G_γ in (4.10) depends not only on the state 𝜃 and the action A but also on an independent signal Y under $\widetilde {\text {P}}_{\theta _{1}}^{\pi }$, which is slightly different from the typical form. We make the following changes to have a reward and a transition probability in a standard form in which assumptions (A1) and (A2) can be verified.

Define the completely observable Markov decision model M_γ with transition law p_γ and one-step reward r_γ by

$$ \begin{array}{@{}rcl@{}} p_{\gamma}(d\theta^{\prime}|\theta,a)&:=&\frac{1}{{\int}_{O}T^{*}_{a,y,\gamma}(\theta)(S){\Lambda}(dy)}\left( {\int}_{O}{\Delta}^{(\gamma)}_{a,y,\theta}(d\theta^{\prime}) T^{*}_{a,y,\gamma}(\theta)(S){\Lambda}(dy)\right)\\ &=&\frac{1}{{\int}_{S}e^{\gamma r(x,a)}\theta(dx)}\left[{\int}_{O}{\Delta}^{(\gamma)}_{a,y,\theta}(d\theta^{\prime}){\Lambda}(dy){\int}_{S\times S}e^{\gamma r(x,a)}q(y|x^{\prime})p(dx^{\prime}|x,a)\theta(dx)\right],\\ r_{\gamma}(\theta,a)&:=&\log\left( {\int}_{O}T^{*}_{a,y,\gamma}(\theta)(S){\Lambda}(dy)\right)=\log\left( {\int}_{S}e^{\gamma r(x,a)}\theta(dx)\right). \end{array} $$

(4.13)

where ${\Delta }^{(\gamma )}_{a,y,\theta }$ is a measure on $\mathcal {P}(S)$ defined by

$$ {\Delta}^{(\gamma)}_{a,y,\theta}(U):= \begin{cases} \delta\left\{\frac{T^{*}_{a,y,\gamma}(\theta)}{T^{*}_{a,y,\gamma}(\theta)(S)}\right\}(U),\quad&T^{*}_{a,y,\gamma}(\theta)(S)\neq 0\\ 0,\quad&T^{*}_{a,y,\gamma}(\theta)(S)=0 \end{cases},\qquad U\subseteq \mathcal{P}(S). $$

(4.14)

Use $ E_{\gamma ,\theta }^{\pi }$ to denote the expectation operator with respect to the transition probability p_γ with initial state 𝜃 and policy π. Since M_γ is an MDP, we consider the Markov policies of M_γ, which consists of decision rules by choosing actions only through the current information state 𝜃. Such policies are also called separated policies of the original model M_P (see, e.g., [19]). We let D_S denote the set of all the Markov decision rules of M_γ and ${\Pi }_{S}=(D_{S})^{\infty }$. Then for π_S ∈π_S, by direct calculation, we have

$$ \widetilde{ E}_{\theta}^{\pi_{S}}\left[e^{{\sum}_{t=1}^{N}G_{\gamma}(A_{t},Y_{t+1},\theta^{(\gamma)}_{t})}f(\theta^{(\gamma)}_{t+1})\right]= E_{\gamma,\theta}^{\pi_{S}}\left[e^{{\sum}_{t=1}^{N}r_{\gamma}(\theta^{(\gamma)}_{t},A_{t})}f(\theta^{(\gamma)}_{t+1})\right] $$

(4.15)

for any bounded Borel function f on $\mathcal {P}(S)$. π_S is a subset of π since 𝜃_t is $\mathcal {F}_{t}$-adaptive. Hence, from (4.12) and (4.15), we have for π_S ∈π_S,

$$ \begin{array}{@{}rcl@{}} \liminf_{N\rightarrow\infty}\frac{1}{N}\log E_{\theta}^{\pi_{S}}\left[e^{{\sum}_{t=1}^{N}\gamma r(X_{t},A_{t})}\right] &=&\liminf_{N\rightarrow\infty}\frac{1}{N}\log\widetilde{ E}_{\theta_{1}}^{\pi_{S}}\left[e^{{\sum}_{i=1}^{N}G_{\gamma}(A_{i},Y_{i+1},\theta^{(\gamma)}_{i})}\right]\\ &=&\liminf_{N\rightarrow\infty}\frac{1}{N}\log E_{\gamma,\theta}^{\pi_{S}}\left[e^{{\sum}_{t=1}^{N}r_{\gamma}(\theta^{(\gamma)}_{t},A_{t})}\right]. \end{array} $$

We define λ_S as the optimal value of separated policies, which is

$$ \begin{array}{@{}rcl@{}} \lambda_{S}(\theta, \gamma)&:=&\sup_{\pi_{S}\in{\Pi}_{S}}\frac{1}{\gamma}\liminf_{N\rightarrow\infty}\frac{1}{N}\log E_{\theta}^{\pi}\left[e^{{\sum}_{t=1}^{N}\gamma r(X_{t},A_{t})}\right]\\ &=&\sup_{\pi_{S}\in{\Pi}_{S}}\frac{1}{\gamma}\liminf_{N\rightarrow\infty}\frac{1}{N}\log E_{\gamma,\theta}^{\pi_{S}}\left[e^{{\sum}_{t=1}^{N}r_{\gamma}(\theta^{(\gamma)}_{t},A_{t})}\right]. \end{array} $$

(4.16)

We will show that under (C0), (C1), (C2), and (C3), λ_S = λ_P and (4.3) hold if there exists K > 0 such that for every γ ∈ (0,K), the Bellman equation

$$ \rho_{\gamma} f_{\gamma}(\theta)=\sup_{a\in A}{\int}_{\mathcal{P}(S)}f_{\gamma}(\theta^{\prime})e^{r_{\gamma}(\theta,a)}p_{\gamma}(d\theta^{\prime}|\theta,a) $$

(4.17)

has a solution ρ_γ > 0 and $f_{\gamma }\in C(\mathcal {P}(S))$ with f_γ >> 0. We first verify that r_γ satisfies (A1) and p_γ satisfies (A2), which implies that the corresponding operator $L^{(\gamma )}_{P}$ on $C(\mathcal {P}(S))$, defined by

$$ L^{(\gamma)}_{P}f(\theta):=\sup_{a\in A}{\int}_{\mathcal{P}(S)}f(\theta^{\prime})e^{r_{\gamma}(\theta,a)}p_{\gamma}(d\theta^{\prime}|\theta,a), $$

(4.18)

maps $C(\mathcal {P}(S))$ into itself.

Lemma 4.1

Assume (C1). Then r_γ(𝜃,a) is continuous in (𝜃,a).

Proof

For 𝜃_n → 𝜃 weakly and a_n → a, we have

$$ \begin{array}{@{}rcl@{}} \left|e^{r_{\gamma}(\theta_{n},a_{n})}-e^{r_{\gamma}(\theta,a)}\right|&\leq&{\int}_{S}\left|e^{\gamma r(x,a_{n})}-e^{\gamma r(x,a)}\right|\theta_{n}(dx)\\ &&+\left|{\int}_{S}e^{\gamma r(x,a)}\theta_{n}(dx)-{\int}_{S}e^{\gamma r(x,a)}\theta(dx)\right| \end{array} $$

The second term tends to 0 due to the weak convergence of 𝜃_n while the first term tends to 0 because of the uniform continuity of r(⋅,⋅) on the compact set S × A. Hence, $e^{r_{\gamma }(\theta ,a)}$ is continuous in (𝜃,a). Notice that $e^{r_{\gamma }}\geq e^{r_{m}}>0$, we see that r_γ(𝜃,a) is continuous in (𝜃,a). □

Lemma 4.2

Assume (C0), (C1), (C2), and (C3). Then $(\theta ,a)\mapsto {\int \limits }_{\mathcal {P}(S)}p(d\theta ^{\prime }|\theta ,a)f(\theta )$ is continuous in (𝜃,a) for $f\in C(\mathcal {P}(S))$.

Proof

Recall that r_M and r_m are the supremum and infimum of r, respectively. Fix $f\in C(\mathcal {P}(S))$. From (4.13) and direct calculation, we see that

$$ {\int}_{\mathcal{P}(S)}p(d\theta^{\prime}|\theta,a)f(\theta^{\prime})=e^{-r_{\gamma}(\theta,a)}{\int}_{O}T^{*}_{a,y,\gamma}(\theta)(S)f\left( \frac{T^{*}_{a,y,\gamma}(\theta)}{T^{*}_{a,y,\gamma}(\theta)(S)}\right){\Lambda}(dy), $$

where $e^{-r_{\gamma }(\theta ,a)}$ is continuous in (𝜃,a) by Lemma 4.1. It suffices to show that

$$ {\int}_{O}T^{*}_{a,y,\gamma}(\theta)(S)f\left( \frac{T^{*}_{a,y,\gamma}(\theta)}{T^{*}_{a,y,\gamma}(\theta)(S)}\right){\Lambda}(dy) $$

(4.19)

is continuous in (𝜃,a). Once $\left \{(\theta ,a)\mapsto T^{*}_{a,y,\gamma }(\theta ),y\in O\right \}$ is equicontinuous, then from the uniform continuity of $f\in C(\mathcal {P}(S))$ ($\mathcal {P}(S)$ is compact since S is compact) and the fact that

$$ T^{*}_{a,y,\gamma}(\theta)(S)={\int}_{S}\theta(dx)e^{\gamma r(x,a)}{\int}_{S}q(y|x^{\prime})p(dx^{\prime}|x,a_{n})\geq q_{m}e^{\gamma r_{m}}>0 $$

uniformly, we can see that

$$ \left\{(\theta,a)\mapsto T^{*}_{a,y,\gamma}(\theta)(S)f\left( \frac{T^{*}_{a,y,\gamma}(\theta)}{T^{*}_{a,y,\gamma}(\theta)(S)}\right),y\in O\right\} $$

is equicontinuous, which proves this lemma. Now we prove that $\left \{(\theta ,a)\mapsto T^{*}_{a,y,\gamma }(\theta ),y\in O\right \}$ is equicontinuous. Fix $\theta \in \mathcal {P}(S)$ and a ∈ A. For 𝜃_n → 𝜃 weakly and a_n → a, we have

$$ \begin{array}{@{}rcl@{}} \sup_{y\in O}\left\|T^{*}_{a_{n},y,\gamma}(\theta_{n})-T^{*}_{a,y,\gamma}(\theta)\right\|_{KR}&\leq&\sup_{y\in O}\left\|T^{*}_{a_{n},y,\gamma}(\theta_{n})-T^{*}_{a,y,\gamma}(\theta_{n})\right\|_{KR}\\ &&+\sup_{y\in O}\left\|T^{*}_{a,y,\gamma}(\theta_{n})-T^{*}_{a,y,\gamma}(\theta)\right\|_{KR}. \end{array} $$

The first term $\left \|T^{*}_{a_{n},y,\gamma }(\theta _{n})-T^{*}_{a,y,\gamma }(\theta _{n})\right \|_{KR}$ is

$$ \begin{array}{@{}rcl@{}} &&{\underset{\|g\|_{L}\leq 1}{\underset{g\in\text{Lip}(S)}{\sup}}} {\int}_{S}\theta_{n}(dx)\left( e^{\gamma r(x,a_{n})}{\int}_{S}g(x^{\prime})q(y|x^{\prime})p(dx^{\prime}|x,a_{n})-e^{\gamma r(x,a)}{\int}_{S}g(x^{\prime})q(y|x^{\prime})p(dx^{\prime}|x,a)\right)\\ &&\quad\leq {\underset{\|g\|_{L}\leq 1} {\underset{g\in\text{Lip}(S)}{\sup}}}{\int}_{S}\theta_{n}(dx)e^{\gamma r(x,a_{n})}\left( {\int}_{S}g(x^{\prime})q(y|x^{\prime})p(dx^{\prime}|x,a_{n})-{\int}_{S}g(x^{\prime})q(y|x^{\prime})p(dx^{\prime}|x,a)\right)\\ &&\qquad+{\underset{\|g\|_{L}\leq 1}{\underset{g\in\text{Lip}(S)}{\sup}}}{\int}_{S}\theta_{n}(dx)\left( e^{\gamma r(x,a_{n})}-e^{\gamma r(x,a)}\right)\left( {\int}_{S}g(x^{\prime})q(y|x^{\prime})p(dx^{\prime}|x,a)\right) \end{array} $$

Notice that ∥g(⋅)∥_L ≤ 1 and ∥q(y|⋅)∥_L ≤ q_M imply ∥g(⋅)q(y|⋅)∥_L ≤ 2q_M. Thus, by (C1), (C2), and (C3), we have that

$$ \begin{array}{@{}rcl@{}} & & {\int}_{S}\theta_{n}(dx)e^{\gamma r(x,a_{n})}\left( {\int}_{S}g(x^{\prime})q(y|x^{\prime})p(dx^{\prime}|x,a_{n})-{\int}_{S}g(x^{\prime})q(y|x^{\prime})p(dx^{\prime}|x,a)\right)\\ &\leq & 2q_{M}\cdot{\int}_{S}\theta_{n}(dx)e^{\gamma r(x,a_{n})}\|p(\cdot|x,a_{n})-p(\cdot|x,a)\|_{KR}\leq 2q_{M}e^{\gamma r_{M}}K_{p}\cdot\rho_{A}(a_{n},a) \end{array} $$

and

$$ {\int}_{S}\theta_{n}(dx)\left( e^{\gamma r(x,a_{n})}-e^{\gamma r(x,a)}\right)\left( {\int}_{S}g(x^{\prime})q(y|x^{\prime})p(dx^{\prime}|x,a)\right)\leq q_{M}\cdot{\int}_{S}\left|e^{\gamma r(x,a_{n})}-e^{\gamma r(x,a)}\right|\theta_{n}(dx). $$

holds for every y ∈ O. Hence, from the uniform continuity of r(⋅,⋅), we know that

$$ \lim\limits_{n\to\infty}\sup_{y\in O}\left\|T^{*}_{a_{n},y,\gamma}(\theta_{n})-T^{*}_{a,y,\gamma}(\theta_{n})\right\|_{KR}=0. $$

(4.20)

The second term $\left \|T^{*}_{a,y,\gamma }(\theta _{n})-T^{*}_{a,y,\gamma }(\theta )\right \|_{KR}$ is

$$ {\underset{\|g\|_{L}\leq 1}{\underset{g\in\text{Lip}(S)}{\sup}}}{\int}_{S}\theta_{n}(dx)e^{\gamma r(x,a)}{\int}_{S}g(x^{\prime})q(y|x^{\prime})p(dx^{\prime}|x,a)-{\int}_{S}\theta(dx)e^{\gamma r(x,a)}{\int}_{S}g(x^{\prime})q(y|x^{\prime})p(dx^{\prime}|x,a). $$

Define two finite measures ${\mu ^{a}_{n}}(dx):=\theta _{n}(dx)e^{\gamma r(x,a)}$ and μ^a(dx) := 𝜃(dx)e^γr(x,a). Then from the continuity of r(⋅,a), we know that ${\mu ^{a}_{n}}\to \mu ^{a}$ weakly, which means that $\lim \limits _{n\to \infty }\|{\mu ^{a}_{n}}-\mu ^{a}\|_{KR}=0$. Thus, we only need to verify that as a function of x, if ∥g∥_L ≤ 1, then the Lipschitz constant $\left \|{\int \limits }_{S}q(y|x^{\prime })p(dx^{\prime }|x, a)g(x^{\prime })\right \|_{L}$ is bounded by a constant which is independent of y. First, it is obvious that $\left |{\int \limits }_{S}q(y|x^{\prime })p(dx^{\prime }|x,a)g(x^{\prime })\right |\leq q_{M}$. As for the Lipschitz constant, we have for x₁,x₂ ∈ S,

$$ \begin{array}{@{}rcl@{}} & & \left|\int\limits_{S}g(x^{\prime})q(y|x^{\prime})p(dx^{\prime}|x_{1},a)-\int\limits_{S}g(x^{\prime})q(y|x^{\prime})p(dx^{\prime}|x_{2},a)\right|\\ &\leq& \|g(\cdot)q(y|\cdot)\|_{L}\cdot \|p(\cdot|x_{1},a)-p(\cdot|x_{2},a)\|_{KR}\leq 2q_{M}K_{p}\cdot\rho(x_{1},x_{2}). \end{array} $$

Consequently,

$$ \lim\limits_{n\to\infty}\sup_{y\in O}\left\|T^{*}_{a,y,\gamma}(\theta_{n})-T^{*}_{a,y,\gamma}(\theta)\right\|_{KR}\leq\lim\limits_{n\to\infty}\sup_{y\in O}\max\{q_{M},2q_{M}K_{p}\}\cdot\|{\mu^{a}_{n}}-\mu^{a}\|_{KR}=0. $$

(4.21)

(4.20) and (4.21) show that $\left \{(\theta ,a)\mapsto T^{*}_{a,y,\gamma }(\theta ),y\in O\right \}$ is equicontinuous and the lemma is proved. □

Since r_γ satisfies (A1) and p_γ satisfies (A2), we can apply Proposition 3.3 to obtain the variational formula for λ_S.

Theorem 4.3

Assume (C0), (C1), (C2), and (C3). If there exist ρ_γ > 0 and $f_{\gamma }\in C(\mathcal {P}(S))$ with f_γ >> 0 satisfying $\rho _{\gamma } f_{\gamma }=L^{(\gamma )}_{P}f_{\gamma }$, then

$$ \lambda_{S}(\gamma)=\sup_{\beta\in\mathcal{I}_{P}}\left\{{\int}_{\mathcal{P}(S)\times A}\left[\frac{1}{\gamma}r_{\gamma}(\theta,a)-\frac{1}{\gamma}D(\beta_{2}(\cdot|\theta,a)\Vert p_{\gamma}(\cdot|\theta,a))\right]\beta^{\prime}(d\theta,da)\right\} $$

(4.22)

holds, where

$$ \mathcal{I}_{P}:=\{\beta\in\mathcal{P}(\mathcal{P}(S)\times A\times \mathcal{P}(S)):\beta(\mathcal{P}(S),A,d\theta)=\beta(d\theta,A,\mathcal{P}(S))\}, $$

(4.23)

and the notations β₂ and $\beta ^{\prime }$ are defined by (3.3).

Proof

Lemmas 4.1 and 4.2 imply that M_γ satisfies (A1) and (A2). Notice that $\mathcal {P}(S)$ is compact, by Proposition 3.3, we have

$$ \gamma\lambda_{S}(\gamma)=\sup_{\beta\in\mathcal{I}_{P}}\left\{{\int}_{\mathcal{P}(S)\times A}\left[r_{\gamma}(\theta,a)-D(\beta_{2}(\cdot|\theta,a)\Vert p_{\gamma}(\cdot|\theta,a))\right]\beta^{\prime}(d\theta,da)\right\}. $$

Hence, (4.22) holds. □

By H$\ddot {o}$lder’s inequality, for $\gamma \geq \gamma ^{\prime }>0$, we have $\lambda _{P}(\gamma )\geq \lambda _{P}(\gamma ^{\prime })\geq v_{P}$. Therefore, λ_P(γ) is non-decreasing in γ and $\lim \limits _{\gamma \to 0}\lambda _{P}(\gamma )\geq v_{P}$. To get the desired assertion, we need that λ_S = λ_P.

Theorem 4.4

Assume (C0), (C1), (C2), and (C3). If there exist ρ_γ > 0 and $f_{\gamma }\in C(\mathcal {P}(S))$ with f_γ >> 0 satisfying that $\rho _{\gamma } f_{\gamma }=L^{(\gamma )}_{P}f_{\gamma }$, then λ_S(γ) = λ_P(γ).

Proof

We first show that it is true in the finite-horizon case. Fix N > 0 and an observed-history-dependent policy π = (d₁,...,d_N,...). By (4.11), we have that

$$ \psi^{(\gamma)}_{t+1}(S)=e^{G_{\gamma}(A_{t},Y_{t+1},\theta^{(\gamma)}_{t})}\cdot\psi^{(\gamma)}_{t}(S),\quad t\geq 1. $$

Hence

$$ \begin{array}{@{}rcl@{}} E_{\theta}^{\pi}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}\right]&=&\tilde{ E}_{\theta}^{\pi}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}R_{N+1}\right]=\tilde{ E}_{\theta}^{\pi}\left[\psi^{(\gamma)}_{N+1}(S)\right]\\ & =&\tilde{ E}_{\theta}^{\pi}\left[\psi^{(\gamma)}_{N}(S)e^{G_{\gamma}(A_{N},Y_{N+1},\theta^{(\gamma)}_{N})}\right]. \end{array} $$

For every $\psi \in {\mathscr{M}}^{+}(S)$ and ε > 0, there exists $d_{N}^{*}(\psi )\in A$ such that

$$ \begin{array}{@{}rcl@{}} &&\tilde{ E}_{\theta}^{\pi}\left[\psi^{(\gamma)}_{N}(S)e^{G_{\gamma}(d_{N}^{*}(\psi),Y_{N+1},\theta^{(\gamma)}_{N})}\Big|\psi^{(\gamma)}_{N}=\psi\right]\\ &&\quad\geq\sup_{a\in A}\tilde{ E}_{\theta}^{\pi}\left[\psi^{(\gamma)}_{N}(S)e^{G_{\gamma}(a,Y_{N+1},\theta^{(\gamma)}_{N})}\Big|\psi^{(\gamma)}_{N}=\psi\right]-\frac{\varepsilon}{N}e^{(N-1)(r_{m}-r_{M})}. \end{array} $$

$d_{N}^{*}$ actually depends only on $\theta =\frac {\psi }{\psi (S)}$. Given $\psi ^{\prime }\neq \psi $ with $\frac {\psi ^{\prime }}{\psi ^{\prime }(S)}=\theta =\frac {\psi }{\psi (S)}$, we can assume that

$$e^{(N-1)\gamma r_{m}}\leq\psi^{\prime}(S)\quad \text{and}\quad \psi(S)\leq e^{(N-1)\gamma r_{M}}$$

since $e^{(N-1)\gamma r_{m}}\leq \psi ^{(\gamma )}_{N}(S)\leq e^{(N-1)\gamma r_{M}}$. Thus, for any a ∈ A, we have

$$ \begin{array}{@{}rcl@{}} &&\tilde{ E}_{\theta}^{\pi} \left[\psi^{(\gamma)}_{N}(S)e^{G_{\gamma}(a,Y_{N+1},\theta^{(\gamma)}_{N})}\Big|\psi^{(\gamma)}_{N}=\psi^{\prime}\right] =\psi^{\prime}(S){\int}_{O}e^{G_{\gamma}(a,y,\theta)}{\Lambda}(dy)\\ &&\quad=\frac{\psi^{\prime}(S)}{\psi(S)}\!\cdot\!\psi(S){\int}_{O}e^{G_{\gamma}(a,y,\theta)}{\Lambda}(dy) = \frac{\psi^{\prime}(S)}{\psi(S)}\!\cdot\!\tilde{ E}_{\theta}^{\pi}\left[\psi^{(\gamma)}_{N}(S)e^{G_{\gamma}(a,Y_{N+1},\theta^{(\gamma)}_{N})}\Big| \psi^{(\gamma)}_{N}\! = \psi\right]. \end{array} $$

Hence,

$$ \begin{array}{@{}rcl@{}} & &\tilde{ E}_{\theta}^{\pi}\left[\psi^{(\gamma)}_{N}(S)e^{G_{\gamma}(d_{N}^{*}(\psi),Y_{N+1},\theta^{(\gamma)}_{N})}\Big|\psi^{(\gamma)}_{N}=\psi^{\prime}\right]\\ & \geq & \frac{\psi^{\prime}(S)}{\psi(S)}\cdot\sup_{a\in A}\tilde{ E}_{\theta}^{\pi}\left[\psi^{(\gamma)}_{N}(S)e^{G_{\gamma}(a,Y_{N+1},\theta^{(\gamma)}_{N})}\Big|\psi^{(\gamma)}_{N}=\psi\right]-\frac{\psi^{\prime}(S)}{\psi(S)}\cdot\frac{\varepsilon}{N}e^{(N-1)(r_{m}-r_{M})}\\ &=& \sup_{a\in A}\tilde{ E}_{\theta}^{\pi}\left[\psi^{(\gamma)}_{N}(S)e^{G_{\gamma}(a,Y_{N+1},\theta^{(\gamma)}_{N})}\Big|\psi^{(\gamma)}_{N}=\psi^{\prime}\right]-\frac{\psi^{\prime}(S)}{\psi(S)}\cdot\frac{\varepsilon}{N}e^{(N-1)(r_{m}-r_{M})}\\ &\geq& \sup_{a\in A}\tilde{ E}_{\theta}^{\pi}\left[\psi^{(\gamma)}_{N}(S)e^{G_{\gamma}(a,Y_{N+1},\theta^{(\gamma)}_{N})}\Big|\psi^{(\gamma)}_{N}=\psi^{\prime}\right]-\frac{\varepsilon}{N}. \end{array} $$

Thus, we see that for every $\psi \in {\mathscr{M}}^{+}(S)$, there exists $d_{N}^{*}(\theta )\in A$ depending only on $\theta =\frac {\psi }{\psi (S)}$ such that

$$ \begin{array}{@{}rcl@{}} &&\tilde{ E}_{\theta}^{\pi}\left[\psi^{(\gamma)}_{N}(S)e^{G_{\gamma} \left( d_{N}^{*}\left( \theta^{(\gamma)}_{N}\right),Y_{N+1},\theta^{(\gamma)}_{N}\right)}\Big|\psi^{(\gamma)}_{N}=\psi\right]\\ &&\quad\geq\sup_{a\in A}\tilde{ E}_{\theta}^{\pi}\left[\psi^{(\gamma)}_{N}(S)e^{G_{\gamma}\left( a,Y_{N+1},\theta^{(\gamma)}_{N}\right)}\Big|\psi^{(\gamma)}_{N}=\psi\right]-\frac{\varepsilon}{N}. \end{array} $$

Now modify the policy π by simply replacing d_N with $d^{*}_{N}$ to get a new policy $\pi ^{*}_{N}=(d_{1},...,d_{N-1},d^{*}_{N},...)$, we obtain that

$$ \tilde{ E}_{\theta}^{\pi^{*}_{N}}\left[\psi^{(\gamma)}_{N}(S)e^{G_{\gamma} \left( A_{N},Y_{N+1},\theta^{(\gamma)}_{N}\right)}\right]\geq\tilde{ E}_{\theta}^{\pi}\left[\psi^{(\gamma)}_{N}(S)e^{G_{\gamma}\left( A_{N},Y_{N+1},\theta^{(\gamma)}_{N}\right)}\right]-\frac{\varepsilon}{N}. $$

Continue this procedure by successively replacing d_j with $d^{*}_{j}$ for j = N − 1,⋯ ,1, with each $d^{*}_{j}$ depending only on the information state and satisfying that

$$ \tilde{ E}_{\theta}^{\pi^{*}_{n-1}}\left[\psi^{(\gamma)}_{n-1}(S)e^{{\sum}_{j=n-1}^{N}G_{\gamma} \left( A_{j},Y_{j+1},\theta^{(\gamma)}_{j}\right)}\right]\geq\tilde{ E}_{\theta}^{\pi^{*}_{n}}\left[\psi^{(\gamma)}_{n-1}(S)e^{{\sum}_{j=n-1}^{N}G_{\gamma}\left( A_{j},Y_{j+1},\theta^{(\gamma)}_{j}\right)}\right]-\frac{\varepsilon}{N}. $$

where $\pi ^{*}_{n}=(d_{1},...,d_{n-1},d^{*}_{n},...,d^{*}_{N},...)$. In this way, with $\pi _{1}^{*}=(d^{*}_{1},d^{*}_{2},...d^{*}_{N},...)$, we obtain that

$$ \tilde{ E}_{\theta}^{\pi^{*}_{1}}\left[e^{{\sum}_{t=1}^{N}G_{\gamma} \left( A_{t},Y_{t+1},\theta^{(\gamma)}_{t}\right)}\right]\geq\tilde{ E}_{\theta}^{\pi}\left[e^{{\sum}_{t=1}^{N}G_{\gamma}\left( A_{t},Y_{t+1},\theta^{(\gamma)}_{t}\right)}\right]-\varepsilon. $$

Noticing that the decision rules after time N is irrelevant and recalling (4.12), we have proved that

$$ \sup_{\pi\in{\Pi}_{S}} E_{\theta}^{\pi}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}\right]\geq\sup_{\pi\in{\Pi}} E_{\theta}^{\pi}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}\right]-\varepsilon. $$

for any ε > 0. Obviously,

$$ \sup_{\pi\in{\Pi}_{S}} E_{\theta}^{\pi}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}\right]\leq\sup_{\pi\in{\Pi}} E_{\theta}^{\pi}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}\right]. $$

Consequently,

$$ \sup_{\pi\in{\Pi}_{S}} E_{\theta}^{\pi}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}\right]=\sup_{\pi\in{\Pi}} E_{\theta}^{\pi}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}\right]. $$

Letting $N\to \infty $, we obtain that

$$ \begin{array}{@{}rcl@{}} \lambda_{S}(\gamma)\leq\lambda_{P}(\gamma)&\leq&\sup_{\theta\in\mathcal{P}(S)}\frac{1}{\gamma}\liminf_{N\rightarrow\infty}\sup_{\pi\in{\Pi}}\frac{1}{N} E_{\theta}^{\pi}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}\right] \end{array} $$

(4.24)

$$ \begin{array}{@{}rcl@{}} &=&\sup_{\theta\in\mathcal{P}(S)}\frac{1}{\gamma}\liminf_{N\rightarrow\infty}\sup_{\pi\in{\Pi}_{S}}\frac{1}{N} E_{\theta}^{\pi}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}\right]. \end{array} $$

(4.25)

Thus, it suffices to verify that

$$ \lambda_{S}(\gamma)\geq\sup_{\theta\in\mathcal{P}(S)}\frac{1}{\gamma}\liminf_{N\rightarrow\infty}\sup_{\pi\in{\Pi}_{S}}\frac{1}{N} E_{\theta}^{\pi}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}\right]. $$

(4.26)

Recall that by assumption, we have $\rho _{\gamma } f_{\gamma }=L^{(\gamma )}_{P}f_{\gamma }$ with ρ_γ > 0 and f_γ >> 0. From Lemmas 4.1 and 4.2, we know that

$$ {\int}_{\mathcal{P}(S)}f_{\gamma}(\theta^{\prime})e^{r_{\gamma}(\theta,a)}p_{\gamma}(d\theta^{\prime}|\theta,a) $$

is continuous in a. Due to the compactness of A, for every $\theta \in \mathcal {P}(S)$, there exists d^∗(𝜃) ∈ A such that

$$ \rho_{\gamma} f_{\gamma}(\theta)=L^{(\gamma)}_{P}f_{\gamma}(\theta)={\int}_{\mathcal{P}(S)}f_{\gamma}(\theta^{\prime})e^{r_{\gamma}(\theta,d^{*}(\theta))}p_{\gamma}(d\theta^{\prime}|\theta,d^{*}(\theta)). $$

Let $\pi ^{*}=(d^{*})^{\infty }$. Similarly to the argument used to derive (2.17) in the proof of Theorem 2.2, we see that

$$ \lim_{N\rightarrow\infty}\sup_{\pi\in{\Pi}_{S}}\frac{1}{N} E_{\theta}^{\pi}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}\right] = \log(\rho_{\gamma}) = \!\lim_{N\rightarrow\infty}\frac{1}{N} E_{\theta}^{\pi^{*}}\left[e^{\gamma{\sum}_{t=1}^{N}r(X_{t},A_{t})}\right]\!\!\leq\!\!\gamma\lambda_{S}(\theta,\gamma). $$

Hence, the inequalities in (4.24) are equalities, which gives that λ_S(γ) = λ_P(γ). □

With Theorems 4.3 and 4.4, using a similar argument as in the proof of Theorem 3.1, we can now extend the risk-sensitive asymptotics to POMDPs, which is the main result of this section.

Theorem 4.5

Assume (C0), (C1), (C2), and (C3). If there exists K > 0 such that for every γ ∈ (0,K), there exist ρ_γ > 0 and $f_{\gamma }\in C(\mathcal {P}(S))$ with f_γ >> 0 satisfying $\rho _{\gamma } f_{\gamma }=L^{(\gamma )}_{P}f_{\gamma }$, then

$$ \lim_{\gamma\to0}\lambda_{P}(\gamma)=v_{P}. $$

(4.27)

Before proving Theorem 4.5, we present a lemma to show that p_γ converges to p₀ weakly and uniformly, where p₀ is defined in (4.4).

Lemma 4.6

Assume (C0), (C1), (C2), and (C3). Then p_γ(⋅|𝜃,a) weakly converges to p₀(⋅|𝜃,a), uniformly in $\theta \in \mathcal {P}(S)$ and a ∈ A, i.e., if $f\in C(\mathcal {P}(S)\times A\times \mathcal {P}(S))$, then

$$ \underset{\gamma\to 0}{\lim} \underset{a\in A}{\underset{\theta\in\mathcal{P}(S)}{\sup}} \left|{\int}_{\mathcal{P}(S)}f(\theta,a,\theta^{\prime})p_{\gamma}(d\theta^{\prime}|\theta,a)-{\int}_{\mathcal{P}(S)}f(\theta,a,\theta^{\prime})p_0(d\theta^{\prime}|\theta,a)\right|=0. $$

(4.28)

Proof

Fix an $f\in C(\mathcal {P}(S)\times A\times \mathcal {P}(S))$. Then

$$ \begin{array}{@{}rcl@{}} &&\left|{\int}_{\mathcal{P}(S)}f(\theta,a,\theta^{\prime})p_{\gamma}(d\theta^{\prime}|\theta,a)-{\int}_{\mathcal{P}(S)}f(\theta,a,\theta^{\prime})p_{0}(d\theta^{\prime}|\theta,a)\right|\\ &&\quad=\left|e^{-r_{\gamma}(\theta,a)}{\int}_{O}T^{*}_{a,y,\gamma}(\theta)(S)f\left( \theta,a,\frac{T^{*}_{a,y,\gamma}(\theta)}{T^{*}_{a,y,\gamma}(\theta)(S)}\right){\Lambda}(dy)-{\int}_{O}T^{*}_{a,y,0}(\theta)(S)f\left( \theta,a,\frac{T^{*}_{a,y,0}(\theta)}{T^{*}_{a,y,0}(\theta)(S)}\right){\Lambda}(dy)\right|. \end{array} $$

Since

$$|e^{r_{\gamma}(\theta,a)}-1|\leq{\int}_{S}|e^{\gamma r(x,a)}-1|\theta(dx)\leq\max\{|e^{\gamma r_{m}}-1|,\ |e^{\gamma r_{M}}-1|\},$$

It follows that $e^{r_{\gamma }(\theta ,a)}$ converges uniformly to 1 as γ → 1. Then similarly as in the proof of Lemma 4.2, it suffices to verify that $T^{*}_{a,y,\gamma }(\theta )$ converges weakly to $T^{*}_{a,y,0}(\theta )$, uniformly in a ∈ A,y ∈ O, and $\theta \in \mathcal {P}(S)$. In fact, recalling the definition of the Kantorovich-Rubinstein norm, we see that

$$ \begin{array}{@{}rcl@{}} \left\|T^{*}_{a,y,\gamma}(\theta)-T^{*}_{a,y,0}(\theta)\right\|_{KR}&=&{\underset{\|g\|_{L}\leq 1}{\underset{g\in\text{Lip}(S)}{\sup}}}{\int}_{S}\theta(dx)\left( e^{\gamma r(x,a)}-1\right){\int}_{S}g(x^{\prime})q(y|x^{\prime})p(dx^{\prime}|x,a)\\ &\leq& {\underset{\|g\|_{L}\leq 1}{\underset{g\in\text{Lip}(S)}{\sup}}}{\int}_{S}\theta(dx)\left|e^{\gamma r(x,a)}-1\right|{\int}_{S}q_{M}p(dx^{\prime}|x,a)\\ &\leq&\max\{|e^{\gamma r_{m}}-1|,|e^{\gamma r_{M}}-1|\}\cdot q_{M}. \end{array} $$

Consequently,

$$ \begin{array}{@{}rcl@{}} \lim\limits_{\gamma\to 0}{\underset{a\in A,y\in O}{\underset{\theta\in\mathcal{P}(S)}{\sup}}}\left\|T^{*}_{a,y,\gamma}(\theta)-T^{*}_{a,y,0}(\theta)\right\|_{KR}=0 \end{array} $$

and thus (4.28) follows. □

Proof Proof of Theorem 4.5

We already knew that $\lim \limits _{\gamma \to 0}\lambda _{P}(\gamma )\geq v_{P}$. Hence, by Theorem 4.4, it suffices to verify that

$$ \begin{array}{@{}rcl@{}} \lim\limits_{\gamma\to 0}\lambda_{S}(\gamma)\leq v_{P}. \end{array} $$

From Theorem 4.3, we know that for any ε > 0 and γ > 0, there exists $\beta _{\gamma }^{\varepsilon }\in \mathcal {I}$ such that

$$ \begin{array}{@{}rcl@{}} \lambda_{S}(\gamma)-\varepsilon\leq{\int}_{\mathcal{P}(S)\times A}\left[\frac{1}{\gamma}r_{\gamma}(\theta,a)-\frac{1}{\gamma}D((\beta_{\gamma}^{\varepsilon})_{2}(\cdot|\theta,a)\Vert p_{\gamma}(\cdot|\theta,a))\right](\beta_{\gamma}^{\varepsilon})'(d\theta,da). \end{array} $$

Recall that $\mathcal {I}\subseteq \mathcal {P}(\mathcal {P}(S)\times A\times \mathcal {P}(S))$ is compact. We can find a sequence $\{\gamma _{n}\}_{n\in \mathbb {N}}$ monotonically tending to 0 and a $\beta ^{\varepsilon }\in \mathcal {I}_{P}$ such that

$$ \lim\limits_{\gamma\to 0}\lambda_{P}(\gamma)=\lim\limits_{n\to\infty}\lambda_{P}(\gamma_{n})\ \ \text{and}\ \lim\limits_{n\to\infty}\beta_{\gamma_{n}}^{\varepsilon}=\beta^{\varepsilon} $$

weakly. Since the relative entropy is non-negative, we obtain that

$$ \begin{array}{@{}rcl@{}} \lambda_{S}(\gamma_{n})-\varepsilon\leq{\int}_{\mathcal{P}(S)\times A}\frac{1}{\gamma_{n}}r_{\gamma_{n}}(\theta,a)(\beta_{\gamma_{n}}^{\varepsilon})'(d\theta,da). \end{array} $$

Monotonicity follows from H$\ddot {o}$lder’s inequality. Thus, we have as γ_n → 0 that

$$ \begin{array}{@{}rcl@{}} \frac{1}{\gamma_{n}}r_{\gamma_{n}}(\theta,a)=\frac{1}{\gamma_{n}}\log\left( {\int}_{S}\theta(dx)e^{\gamma_{n} r(x,a)}\right)\downarrow{\int}_{S}\theta(dx)r(x,a)=r_{0}(\theta,a), \end{array} $$

where r₀ is defined in (4.4). Hence, by Dini’s theorem, $\gamma _{n}^{-1}r_{\gamma _{n}}$ converge to r₀ uniformly. From the weak convergence of $\beta _{\gamma _{n}}^{\varepsilon }$, we then obtain that

$$ \begin{array}{@{}rcl@{}} \lim_{n\rightarrow\infty}{\int}_{\mathcal{P}(S)\times A}\frac{1}{\gamma_{n}}r_{\gamma_{n}}(\theta,a)(\beta_{\gamma_{n}}^{\varepsilon})'(d\theta,da)={\int}_{\mathcal{P}(S)\times A}r_{0}(\theta,a)(\beta^{\varepsilon})'(d\theta,da). \end{array} $$

Now, we claim that $\left (\beta ^{\varepsilon }\right )_{2}=p_{0}$, where p₀ is defined in (4.4). Notice that

$$ \begin{array}{@{}rcl@{}} \lim_{n\to\infty}\lambda_{S}(\gamma_{n})-\varepsilon&\leq&{\int}_{\mathcal{P}(S)\times A}r_{0}(\theta,a)(\beta^{\varepsilon})'(d\theta,da)\\ &&-\liminf_{n\rightarrow\infty}\frac{1}{\gamma_{n}}{\int}_{\mathcal{P}(S)\times A}\left( \beta_{\gamma_{n}}^{\varepsilon}\right)'(d\theta,da)D\left( \left( \beta_{\gamma_{n}}^{\varepsilon}\right)_{2}(\cdot|\theta,a)\Vert p_{\gamma_{n}}(\cdot|\theta,a)\right)\\ &=&{\int}_{\mathcal{P}(S)\times A}r_{0}(\theta,a)(\beta^{\varepsilon})'(d\theta,da)\\ &&-\liminf_{n\rightarrow\infty}\frac{1}{\gamma_{n}}D\left( \beta_{\gamma_{n}}^{\varepsilon}(d\theta,da,d\theta^{\prime})\Big\Vert\left( \beta_{\gamma_{n}}^{\varepsilon}\right)'(d\theta,da)p_{\gamma_{n}}(d\theta^{\prime}|\theta,a)\right)\\ &\leq& r_{M} - \liminf_{n\rightarrow\infty}\frac{1}{\gamma_{n}}D\left( \beta_{\gamma_{n}}^{\varepsilon}(d\theta,da,d\theta^{\prime})\Big\Vert\left( \beta_{\gamma_{n}}^{\varepsilon}\right)'(d\theta,da)p_{\gamma_{n}}(d\theta^{\prime}|\theta,a)\right). \end{array} $$

From the lower semicontinuity of D(⋅∥⋅) and Lemma 4.6, we deduce that

$$ \begin{array}{@{}rcl@{}} &-&\liminf_{n\rightarrow\infty}D\left( \beta_{\gamma_{n}}^{\varepsilon}(d\theta,da,d\theta^{\prime})\Big\Vert\left( \beta_{\gamma_{n}}^{\varepsilon}\right)'(d\theta,da)p_{\gamma_{n}}(d\theta^{\prime}|\theta,a)\right)\leq\\ &-&D\left( \beta^{\varepsilon}(d\theta,da,d\theta^{\prime})\Big\Vert\left( \beta^{\varepsilon}\right)'(d\theta,da)p_{0}(d\theta^{\prime}|\theta,a)\right). \end{array} $$

Thus, if $\left (\beta ^{\varepsilon }\right )_{2}\neq p_{0}$, then

$$D\left( \beta^{\varepsilon}(d\theta,da,d\theta^{\prime})\Big\Vert\left( \beta^{\varepsilon}\right)'(d\theta,da)p_{0}(d\theta^{\prime}|\theta,a)\right)>0,$$

and we would have

$$ \begin{array}{@{}rcl@{}} -r_{m}-\varepsilon\leq\lambda(0)-\varepsilon\leq\lim_{n\to\infty}\lambda(\gamma_{n})-\varepsilon=-\infty. \end{array} $$

It is impossible, so $\beta ^{\varepsilon }\in \mathcal {I}$ and $\left (\beta ^{\varepsilon }\right )_{2}=p_{0}$. Now we can employ the same argument as used in the proof of Lemma 3.2 to derive that

$$ \begin{array}{@{}rcl@{}} \lim_{n\to\infty}\lambda_{S}(\gamma_{n})-\varepsilon\leq{\int}_{\mathcal{P}(S)\times A}r_{0}(\theta,a)(\beta^{\varepsilon})'(d\theta,da)\leq v_{P}. \end{array} $$

Then (4.27) follows by letting ε → 0. □

Remark 4.2

From the proof of Theorem 4.5, we can see that the existence of a solution to the risk-sensitive Bellman equation guarantees the existence of the invariant probability measure for p₀.

We end this section with a simple example.

Example 4.7

Consider a finite POMDP with S = {x₁,x₂},A = {a₁,a₂},O = {y₁,y₂}. The transition probability p, observation probability q, and reward r are described by

$$ \begin{array}{@{}rcl@{}} \begin{bmatrix} p(x_{1}|x_{1},a_{1}) & p(x_{2}|x_{1},a_{1})\\ p(x_{1}|x_{2},a_{1}) & p(x_{2}|x_{2},a_{1}) \end{bmatrix} = \begin{bmatrix} 1/2 & 1/2\\ 1/2 & 1/2 \end{bmatrix}&, \begin{bmatrix} p(x_{1}|x_{1},a_{2}) & p(x_{2}|x_{1},a_{2})\\ p(x_{1}|x_{2},a_{2}) & p(x_{2}|x_{2},a_{2}) \end{bmatrix} = \begin{bmatrix} 1 & 0\\ 0 & 1 \end{bmatrix},\\ \begin{bmatrix} q(y_{1}|x_{1}) & q(y_{2}|x_{1})\\ q(y_{1}|x_{2}) & q(y_{2}|x_{2}) \end{bmatrix} = \begin{bmatrix} 1/2 & 1/2\\ 1/2 & 1/2 \end{bmatrix}&, \begin{bmatrix} r(x_{1},a_{1}) & r(x_{1},a_{2})\\ r(x_{2},a_{1}) & r(x_{2},a_{2}) \end{bmatrix} = \begin{bmatrix} 0 & 0\\ 0 & 1 \end{bmatrix}. \end{array} $$

The state space of the transferred MDP is $\mathcal {P}(S)$, which is isomorphic to [0,1]. We use (1 − t,t) to denote the probability distribution in $\mathcal {P}(S)$, where 0 ≤ t ≤ 1 represents the probability assigned on x₂. On the one hand, to apply Theorem 4.5, by straightforward calculations, we have for f ∈ C([0,1]),

$$ \begin{array}{@{}rcl@{}} L^{(\gamma)}_{P}f(\theta)=\max\left\{f\left( \frac{1}{2}\right), (1-t+te^{\gamma})f\left( \frac{te^{\gamma}}{1-t+te^{\gamma}}\right)\right\}. \end{array} $$

Let

$$ \begin{array}{@{}rcl@{}} \rho_{\gamma}=e^{\gamma},\quad f_{\gamma}(t)= \begin{cases} 1,\quad&t\in[0,\frac{1}{2}e^{-\gamma})\\ 2e^{\gamma} t,\quad&t\in[\frac{1}{2}e^{-\gamma},1] \end{cases}. \end{array} $$

We can verify that $L^{(\gamma )}_{P}f_{\gamma }=\rho _{\gamma }f_{\gamma }$. Then all the assumptions in Theorem 4.5 are fulfilled. Thus,

$$ v_{P}=\lim\limits_{\gamma\to 0+}\lambda_{P}(\gamma)=\lim\limits_{\gamma\to 0+}\frac{1}{\gamma}\log\rho_{\gamma}=1. $$

On the other hand, given an initial distribution t ∈ [0,1], we can see that the optimal average reward is v_P(t) = t. Hence, $v_{P}=\sup _{t\in [0,1]}v_{P}(t)=1$, which coincides with Theorem 4.5. Furthermore, this example illustrates that there are circumstances in which the optimal risk-sensitive reward is independent of the initial distribution while the optimal average reward is not.

5 A Portfolio Optimization Example

In this section, as an example of applications of the approach developed in the previous sections, we consider a problem for portfolio optimization. Given a market with m securities and k price affecting factors. Let V (n) denote the portfolio’s value at time n. We assume that the portfolio dynamics are determined by

$$ \frac{V(n+1)}{V(n)}=e^{F(X(n),H(n),W(n))}, $$

(5.1)

where X(n) = (X¹(n),...,X^k(n)) denotes the factor process, which is a Markov chain with transition kernel $P(dx^{\prime }|x)$, H(n) = (H¹(n),...,H^m(n)) represents the portfolio strategy, i.e., the proportions of capital invested in the m securities at time n, {W(n),n ≥ 1} is the i.i.d. random noise which is independent of the factor process and has a common law η. F is a Borel measurable function. X(n) and H(n) take values in some compact subsets $S\subset \mathbb {R}^{k}$ and $A\subset \mathbb {R}^{m}$ respectively, while the noise W(n) takes values in a Polish space Z. This model was extensively studied in [26] for the dual relationship between maximizing the probability of outperforming over a given benchmark and optimizing the long-term risk-sensitive reward. In this section, we will demonstrate that our approach guarantees the convergence of the optimal risk-sensitive reward to the optimal risk-neutral reward as the risk-sensitive factor tends to 0. As a consequence, we show that the optimal risk-neutral reward can be taken as a benchmark appearing in the duality mentioned above, complementing the studies of [26]. Given an initial state X(1) = x, we use P_x to denote the corresponding probability measure on ${\mathscr{B}}((S\times Z)^{\infty })$ and E_x the expectation under P_x. Let $\mathcal {A}$ denote the set of all Markov portfolio strategies. Given a risk factor γ > 0, the risk-sensitive optimal value is

$$ \lambda(x,\gamma)=\sup_{H\in\mathcal{A}}\frac{1}{\gamma}\liminf_{N\rightarrow\infty}\frac{1}{N}\log E_{x}\exp\left[\gamma\sum\limits_{t=1}^{N}F(X(n),H(n),W(n))\right]. $$

(5.2)

In what follows, we state the two assumptions on F and P for fitting (A1), (A2), and (A3).

(H1)
For each w ∈ Z, F(⋅,⋅,w) ∈ C(S × A), and there is an η −integrable random variable g(w), such that
$$ e^{|F(x,h,w)|}\leq g(w)\ \ \forall x\in S,\ h\in A \text{ and } w\in Z; $$
(5.3)
(H2)
The family of functions $\left \{x\mapsto {\int \limits } f(x^{\prime })P(dx^{\prime }|x), f\in C(S),\left \|f\right \|\leq 1\right \}$ is equicontinuous.

Remark 5.1

(1)
If (H1) holds, then |F(x,h,w)|≤ g(w) and hence $\hat F(x,h):= {\int \limits } F(x,h,\cdot )d\eta $ is bounded continuous in (x,h). Let F_m and F_M be the infimum and supremum of $\hat F$. Then from Jensen’s inequality, it follows that for γ > 0
$$ \log \int e^{\gamma F(x,h,\cdot)}d\eta\geq \gamma F_{m}. $$
(5.4)
(2)
A particular case in which (H2) is true is that $P(dx^{\prime }|x)=Q(x^{\prime }|x){\Lambda }(dx^{\prime })$ with $\{Q(x^{\prime }|\cdot ),x^{\prime }\in S\}$ equicontinuous and ${\Lambda }\in \mathcal {P}(S)$.

The one-step reward F in (5.1) depends not only on the state x and the action h but also on W, which is slightly different from the typical form. We make the following changes to get a reward in such a standard form. Define a new Markov decision model with the transition law p^(γ) and the one-step reward r^(γ) defined respectively by

$$ \begin{array}{@{}rcl@{}} p^{(\gamma)}(dx^{\prime}|x,h)&:=&\frac{1}{{\int}_{Z} e^{\gamma F(x,h,w)}\eta(dw)}\left( {\int}_{Z} P(dx^{\prime}|x)e^{\gamma F(x,h,w)}\eta(dw)\right),\\ \text{and}r^{(\gamma)}(x,h)&:=&\log\left( {\int}_{Z} e^{\gamma F(x,h,w)}\eta(dw)\right). \end{array} $$

(5.5)

By a direct calculation, we see that $p^{(\gamma )}(dx^{\prime }|x,h)$ is actually $P(dx^{\prime }|x)$, and for any N ≥ 1

$$ E_{x}\exp\left[{\sum}_{n=1}^{N}r^{(\gamma)}(X(n),H(n))\right]= E_{x}\exp\left[\gamma{\sum}_{n=1}^{N}F(X(n),H(n),W(n))\right]. $$

(5.6)

Notice that the transition kernel of this MDP is still P, but the reward is r^(γ) instead of γ ⋅ r. Assumption (H1) implies that r^(γ)(x,h) is continuous in (x,h). Thus, with an extra discussion about the convergence of $\frac {1}{\gamma }r^{(\gamma )}$ as γ → 0, we can obtain the limit with the same argument as the one in the proof of Theorem 3.1. In particular, it is not hard to check that (H1) and (H2) imply that r^(γ) and P satisfy (A1), (A2), and (A3). Therefore, setting the risk-sensitive coefficient in Theorem 3.7 to be one and then dividing both sides by γ, we have the following variational formula for $\lambda (\gamma )=\sup _{x\in S}\lambda (\gamma ,x)$.

$$ \lambda(\gamma)=\frac{1}{\gamma}\sup_{\beta\in\mathcal{I}}\left\{{\int}_{S\times A}\left[r^{(\gamma)}(x,h)-D(\beta_{2}(\cdot|x,h)\Vert P(\cdot|x))\right]\beta^{\prime}(dx,dh)\right\}, $$

(5.7)

where $\mathcal {I}$ is defined in (3.2). Although Theorem 3.8 can not be directly applied due to the difference between (5.7) and (3.1), the risk-neutral limit $\lim _{\gamma \to 0}\lambda (\gamma )$ can still be derived by an argument similar to the one used in proving Theorem 3.1. To see this, we still use v to denote the average optimal return, i.e.,

$$ v=\sup_{x\in S}v(x)=\sup_{x\in S}\sup_{H\in\mathcal{A}}\liminf_{N\rightarrow\infty}\frac{1}{N} E_{x}\left[{\sum}_{n=1}^{N}F(X(n),H(n),W(n))\right]. $$

(5.8)

Theorem 5.1

Assume (H1) and (H2). Then

$$ \lim_{\gamma\to 0}\lambda(\gamma)=v. $$

(5.9)

Proof

By Hölder’s inequality, we see that λ(γ) is nondecreasing in γ and $\liminf \limits _{\gamma \to 0}\lambda (\gamma )\geq v$. We will apply (5.7) to prove that $\limsup \limits _{\gamma \to 0}\lambda (\gamma )\leq v$. Similarly to the argument in the proof of Theorem 3.1, for any ε > 0, we can find a sequence $\{\gamma _{n}\}_{n\in \mathbb {N}}$ decreasing to 0 with $\lim \limits _{\gamma \to 0}\lambda (\gamma )=\lim \limits _{n\to \infty }\lambda (\gamma _{n})$ and $\beta _{\gamma _{n}}^{\varepsilon },\beta ^{\varepsilon }\in \mathcal {I}$ with $\lim \limits _{n\to \infty }\beta _{\gamma _{n}}^{\varepsilon }=\beta ^{\varepsilon }$ weakly such that

$$ \lambda(\gamma_{n})-\varepsilon\leq{\int}_{S\times A}\left[\frac{1}{\gamma_{n}}r^{(\gamma_{n})}(x,h)-\frac{1}{\gamma_{n}}D((\beta_{\gamma_{n}}^{\varepsilon})_{2}(\cdot|x,h)\Vert P(\cdot|x))\right](\beta_{\gamma_{n}}^{\varepsilon})'(dx,dh). $$

(5.10)

Therefore,

$$ \begin{array}{@{}rcl@{}} \lambda(\gamma_{n})-\varepsilon\leq{\int}_{S\times A}\frac{1}{\gamma_{n}}r^{(\gamma_{n})}(x,h)(\beta_{\gamma_{n}}^{\varepsilon})'(dx,dh). \end{array} $$

Monotonicity follows from Hölder’s inequality, and thus we have as γ_n → 0 that

$$ \begin{array}{@{}rcl@{}} \frac{1}{\gamma_{n}}r^{(\gamma_{n})}(x,h)=\frac{1}{\gamma_{n}}\log\left( {\int}_{Z} e^{\gamma_{n} F(x,h,w)}\eta(dw)\right)\downarrow{\int}_{Z}F(x,h,w)\eta(dw)=: r^{(0)}(x,h). \end{array} $$

Therefore, it follows from Dini’s theorem that $\frac {1}{\gamma _{n}}r^{(\gamma _{n})}$ converge to r⁽⁰⁾ uniformly. Combining this fact with the weak convergence of $\beta _{\gamma _{n}}^{\varepsilon }$, we obtain that

$$ \begin{array}{@{}rcl@{}} \lim_{n\rightarrow\infty}{\int}_{S\times A}\frac{1}{\gamma_{n}}r^{(\gamma_{n})}(x,h)(\beta_{\gamma_{n}}^{\varepsilon})'(dx,dh)={\int}_{S\times A}r^{(0)}(x,h)(\beta^{\varepsilon})'(dx,dh). \end{array} $$

Now we claim that $\left (\beta ^{\varepsilon }\right )_{2}=P$. Indeed, from the joint semicontinuity of the relative entropy, we see that

$$ \begin{array}{@{}rcl@{}} & &-\liminf_{n\rightarrow\infty}{\int}_{S\times A}D\Big(\left( \beta_{\gamma_{n}}^{\varepsilon}\right)_{2}(\cdot|x,h)\Big\Vert P(\cdot|x)\Big)\left( \beta_{\gamma_{n}}^{\varepsilon}\right)'(dx,dh)\\ &=&-\liminf_{n\rightarrow\infty}D\left( \beta_{\gamma_{n}}^{\varepsilon}(dx,dh,dx^{\prime})\Big\Vert\left( \beta_{\gamma_{n}}^{\varepsilon}\right)'(dx,dh)P(dx^{\prime}|x)\right)\\ &\leq&-D\left( \beta^{\varepsilon}(dx,dh,dx^{\prime})\Big\Vert\left( \beta^{\varepsilon}\right)'(dx,dh)P(dx^{\prime}|x)\right). \end{array} $$

Thus, if $\left (\beta ^{\varepsilon }\right )_{2}\neq P$, then

$$D\left( \beta^{\varepsilon}(dx,dh,dx^{\prime})\Big\Vert\left( \beta^{\varepsilon}\right)'(dx,dh)P(dx^{\prime}|x)\right)>0.$$

From assumption (H1), (5.4), (5.5), and (5.6), together with (5.2) and (5.10), we would have

$$ \begin{array}{@{}rcl@{}} F_{m}-\varepsilon\leq\lim_{n\to\infty}\lambda(\gamma_{n})-\varepsilon\leq-\infty. \end{array} $$

It is impossible, implying that it must be that $\beta ^{\varepsilon }\in \mathcal {I}$ and $\left (\beta ^{\varepsilon }\right )_{2}=P$. Then it is routine to follow the same argument as that of Theorem 3.2 to check that ${\int \limits }_{S\times A}r^{(0)}(x,h)(\beta ^{\epsilon })'(dx,dh)\leq v$. Consequently, (5.9) follows by letting ε → 0. □

As claimed in the introduction, it was shown in [18, 24], and [26] that the risk-sensitive portfolio optimization is a dual problem to the maximization of the outperformance probability (upside chance) when assuming differentiability for the optimal value. To describe this more precisely, for $b\in \mathbb {R},\gamma >0,x\in S$, define I_x(b) by

$$ \begin{array}{@{}rcl@{}} I_{x}(b)&:=&\sup_{H\in\mathcal{A}}\liminf_{N\rightarrow\infty}\frac{1}{N}\log\text{P}_{x}\left[\frac{1}{N}\sum\limits_{n=1}^{N}F(X(n),H(n),W(n))\geq b\right],\\ \quad \text{and}\quad I(b)&:=&\sup_{x\in S}I_{x}(b). \end{array} $$

(5.11)

Then by Chebyshev’s inequality,

$$ \begin{array}{@{}rcl@{}} \gamma\lambda(\gamma,x)-\gamma b\geq I_{x}(b),\text{ and }\ \gamma\lambda(\gamma)-\gamma b\geq I(b). \end{array} $$

Thus,

$$ -\sup_{\gamma\in[0,K)}\{\gamma b-\gamma\lambda(\gamma)\}\geq I(b) $$

(5.12)

for a pre-specified K > 0. Let Λ(γ) := γλ(γ) for convenience. It has already been established in [26] that if Λ(γ) is differentiable on [0,K) and the limit

$$ \lim_{N\rightarrow\infty}\frac{1}{N}\log E_{x}\exp\left[\gamma\sum\limits_{n=1}^{N}F(X(n),H(n),W(n))\right] $$

(5.13)

exists and does not depend on the initial state x, then the duality

$$ -\sup_{\gamma\in[0,K)}\{\gamma b-{\Lambda}(\gamma)\}= I(b) $$

(5.14)

holds whenever $b\in \{{{\Lambda }^{\prime }}^{+}(\gamma ),\gamma \in [0,K)]\}$ or $b\leq {{\Lambda }^{\prime }}^{+}(0)$, where ${{\Lambda }^{\prime }}^{+}(\gamma )$ denote the righthand derivative of Λ(γ) (see Theorem 2.7 in [26]). In the meantime, our result shows that under (H1) and (H2),

$$ \begin{array}{@{}rcl@{}} {{\Lambda}^{\prime}}^{+}(0)&=\lim\limits_{\gamma\to 0+}\frac{\Lambda(\gamma)-{\Lambda}(0)}{\gamma-0}=\lim\limits_{\gamma\to 0}\lambda(\gamma)=v, \end{array} $$

which reveals the connection between the outer-performance probability, the risk-neutral average return, and the risk-sensitive average growth rate. In order to guarantee the differentiability, we add the following assumptions for P and r^(γ) in accordance with Theorem 3.1 in [26], which also implies that the transition law satisfies (B2).

(H3)
There exists δ_p < 1 such that
$$ \sup_{U\in \mathcal{B}(S)}\sup_{x,x^{\prime}\in S}[P(U|x)-P(U|x^{\prime})]\leq \delta_{p}. $$
(5.15)
(H4)
There exists a K_γ > 0 such that the mapping $\gamma \to \sup \limits _{h\in A}r^{(\gamma )}(x,h)$ is differentiable on [0,K_γ) for any x ∈ S.

Remark 5.2

Let F_m and F_M be defined in Remark 5.1(1). Then (H1), (H2), (H3), and the condition $\gamma \leq -\frac {\log \delta _{p}}{F_{M}-F_{m}}$ guarantee that the limit inferior in (5.2) is actually a limit and λ(γ,x) does not depend on x (see Theorem 1 in [13]). So the constant K in (5.12) can be determined.

Theorem 5.2

Assume (H1), (H2), (H3), and (H4). Let $K=\min \limits \{-\frac {\log \delta _{p}}{F_{M}-F_{m}}, K_{\gamma }\}$. Then v(x) = v is a constant, and the duality (5.14) holds for every b ≤ v.

Proof

Combining Theorem 3.1 in [26] and the above remark, we see that Λ(γ) is differentiable on [0,K). Thus, by Theorem 2.7 in [26],

$$ -\sup_{\gamma\in[0,K)}\{\gamma b-{\Lambda}(\gamma)\}=I(b) $$

holds for every $b\leq {{\Lambda }^{\prime }}^{+}(0)$. Theorem 3.7 implies that

$$ {{\Lambda}^{\prime}}^{+}(0) = \lim_{\gamma\to 0}\frac{1}{\gamma}{\Lambda}(\gamma) =\lim\limits_{\gamma\to 0}\sup_{x\in S}\lambda(x,\gamma)=\sup_{x\in S}v(x), $$

where v(x) is indeed a constant due to (H3) and Lemma 3.9. These complete the proof. □

We end this section with an illustrative example.

Example 5.3

Let S = {− 1,1}, A = {(h₁,h₂),h_i ≥ 0, h₁ + h₂ = 1} and {W_n,n ≥ 1} i.i.d. with the standard Normal distribution N(0,1). F is given by

$$ F(x,h,w)=\alpha\cdot x\cdot(h_{1}-h_{2})\cdot w^{2},\ x\in S,\ h\in A,\ w\in W, $$

where α ∈ (0,1/2) is a constant. Then $e^{|F(x,h,w)|}\leq g(w)=e^{\alpha w^{2}}$ with g being η −integrable. Thus (H1), (H2), and (H4) are fulfilled, and (5.9) holds. If the transition probabilities are chosen to satisfy that

$$ \min_{x,x^{\prime}\in S}P(x^{\prime}|x)>0, $$

then (H3) is also satisfied; therefore, the assertions of Theorem 5.2 hold true.

Data Availability

Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.

References

Albertini F, Dai Pra P, Prior C. Small parameter limit for ergodic, discrete-time, partially observed, risk-sensitive control problems. Math Control Signals Syst 2001;14(1):1–28.
Article MathSciNet MATH Google Scholar
Anantharam V, Borkar V S. A variational formula for risk-sensitive reward. SIAM J Control Optim 2017;55(2):961–988.
Article MathSciNet MATH Google Scholar
Arapostathis A, Borkar V S, Fernández-Gaucherand E, Ghosh M K, Marcus S I. Discrete-time controlled Markov processes with average cost criterion: a survey. SIAM J Control Optim 1993;31(2):282–344.
Article MathSciNet MATH Google Scholar
Baras J S, James M R. Robust and risk-sensitive output feedback control for finite state machines and hidden Markov models (summary). J Math Syst Estimation Control 1997;7:371–374.
MATH Google Scholar
Bäuerle N, Rieder U. Markov decision processes with applications to finance. Berlin: Springer; 2011.
Book MATH Google Scholar
Bäuerle N, Rieder U. Partially observable risk-sensitive Markov decision processes. Math Oper Res 2017;42(4):1180–1196.
Article MathSciNet MATH Google Scholar
Bogachev V I. Measure Theory, vol II. Berlin: Springer; 2007.
Book MATH Google Scholar
Cavazos-Cadena R, Hernández-Hernández D. Successive approximations in partially observable controlled Markov chains with risk-sensitive average criterion. Stochast: an Int J Probab Stochast Process 2005;77(6):537–568.
Article MathSciNet MATH Google Scholar
Dembo A, Zeitouni O. Large deviations techniques and applications. Stochastic modelling and applied probability, 38. Berlin: Springer; 2010.
Book MATH Google Scholar
Di Masi G B, Stettner L. Risk-sensitive control of discrete-time Markov processes with infinite horizon. SIAM J Control Optim 1999;38(1):61–78.
Article MathSciNet MATH Google Scholar
Di Masi G B, Stettner L. Risk sensitive control of discrete time partially observed Markov processes with infinite horizon. Stochast: an Int J Probab Stochast Process 1999;67(3-4):309–322.
MathSciNet MATH Google Scholar
Di Masi G B, Stettner L. Remarks on risk neutral and risk sensitive portfolio optimization. In From stochastic calculus to mathematical finance. pp 211-226. Berlin: Springer; 2006.
Google Scholar
Di Masi G B. Infinite horizon risk sensitive control of discrete time Markov processes with small risk. Syst Control Lett 2000;40(1):15–20.
Article MathSciNet MATH Google Scholar
Donsker M D, Varadhan S S. On a variational formula for the principal eigenvalue for operators with maximum principle. Proc Natl Acad Sci 1975; 72(3):780–783.
Article MathSciNet MATH Google Scholar
Dupuis P, Ellis R S. A weak convergence approach to the theory of large deviations. New York: Wiley; 1997.
Book MATH Google Scholar
Fleming W H, Hernández-Hernández D. Risk-sensitive control of finite state machines on an infinite horizon I. SIAM J Control Optim 1997;35(5): 1790–1810.
Article MathSciNet MATH Google Scholar
Fleming W H, Hernández-Hernández D. Risk-sensitive control of finite state machines on an infinite horizon II. SIAM J Control Optim 1999;37(4): 1048–1069.
Article MathSciNet MATH Google Scholar
Hata H, Nagai H, Sheu S J. Asymptotics of the probability minimizing a “down-side” risk. Annals Appl Probab 2010;20(1):52–89.
Article MathSciNet MATH Google Scholar
Hernández-Hernández D. Partially observed control problems with multiplicative cost. Stochastic analysis, control, optimization and applications. pp 41-55. Birkhäuser, Boston; 1999.
Hernández-Lerma O, Lasserre J B. Discrete-time Markov control processes: basic optimality criteria. Vol 30. Berlin: Springer; 2012.
MATH Google Scholar
Howard R A, Matheson J E. Risk-sensitive Markov decision processes. Manag Sci 1972;18(7):356–369.
Article MathSciNet MATH Google Scholar
Jaśkiewicz A. Average optimality for risk-sensitive control with general state space. Ann Appl Probab 2007;17(2):654–675.
Article MathSciNet MATH Google Scholar
Ogiwara T. Nonlinear Perron-Frobenius problem on an ordered Banach space. Japan J Math 1995;21(1):43–103.
Article MathSciNet MATH Google Scholar
Pham H. A risk-sensitive control dual approach to a large deviations control problem. Syst Control Lett 2003;49(4):295–309.
Article MathSciNet MATH Google Scholar
Sion M. On general minimax theorems. Pac J Math 1958;8(1):171–176.
Article MathSciNet MATH Google Scholar
Stettner L. Duality and risk sensitive portfolio optimization. Contemp Math 2004;351:333–348.
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

This paper is part of the first author’s dissertation. The authors are grateful to the referee for the careful review of the manuscript and for the helpful comments and suggestions for improvement.

Funding

This work is supported by the NSFC 11671226.

Author information

Authors and Affiliations

Department of Mathematics, Tsinghua University, Beijing, China
Yanan Dai & Jinwen Chen

Authors

Yanan Dai
View author publications
You can also search for this author in PubMed Google Scholar
Jinwen Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jinwen Chen.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: A nonlinear extension of the Kreĭn-Rutman theorem

The following theorem is an extension of the Kreĭn-Rutman theorem for nonlinear operators on Banach space (see Proposition 3.1.5, Lemma 3.1.3, and Lemma 3.1.7 in [23]). An ordered Banach space is a real Banach space (X,∥⋅∥) endowed with a ordered cone K, which means K is a closed convex cone with vertex at 0 such that K ∩ (−K) = {0}. Assume interior(K)≠∅ and $\dim X\geq 2$. For x₁,x₂ ∈ X, we write x₁ ≥ x₂ if x₁ − x₂ ∈ K, x₁ > x₂ if x₁ − x₂ ∈ K∖{0} and x₁ >> x₂ if x₁ − x₂ ∈ interior(K). For an operator L : K → K, define

$$ \begin{array}{@{}rcl@{}} \sigma^{+}(L)&:=&\{\rho>0:Lx=\rho x,\text{ for some }x>0\}.\\ \|L\|^{+}&:=&\sup_{x\in K,\|x\|\leq 1}\left\{\left\|Lx\right\|\right\}. \end{array} $$

Since ∥L^m+n∥⁺ ≤∥L^m∥⁺∥Lⁿ∥⁺,m,n ≥ 1, we can define

$$ \rho(L):=\lim_{n\to\infty}\left( \|L^{n}\|^{+}\right)^{\frac{1}{n}}. $$

The following properties for operator L on K will be required:

1.
(Compactness) L is a compact operator, meaning that L maps any bounded subset into a relatively compact one.
2.
(Positively 1-homogeneity) c(Lx) = L(cx) for any x ∈ K,c > 0.
3.
(Order-preserving) Lx₁ ≥ Lx₂ for any x₁ ≥ x₂,x₁,x₂ ∈ K.

Theorem A.1

Let L : K → K be a compact, positively 1-homogeneous, and order-preserving operator. If ρ(L) > 0, then ρ(L) ∈ σ⁺(L) and $\rho (L)=\max \limits \sigma ^{+}(L)$.

Theorem A.2

Let L : K → K be a positively 1-homogeneous and order-preserving operator. If there exits $\rho ^{\prime }>0$ and f >> 0 such that $Lf=\rho ^{\prime }f$, then $\rho ^{\prime }=\rho (L)$.

Theorem A.3

Let L : K → K be a positively 1-homogeneous and order-preserving operator. If there exits $\rho ^{\prime }\geq 0$ and f >> 0 such that $Lf\leq \rho ^{\prime }f$, then $\rho ^{\prime }\geq \rho (L)$.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Dai, Y., Chen, J. Risk-Sensitivity Vanishing Limit for Controlled Markov Processes. J Dyn Control Syst 29, 1471–1508 (2023). https://doi.org/10.1007/s10883-023-09641-5

Download citation

Received: 31 August 2022
Revised: 31 December 2022
Accepted: 10 January 2023
Published: 16 March 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s10883-023-09641-5

Keywords

Mathematics Subject Classification (2010)

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Risk-Sensitivity Vanishing Limit for Controlled Markov Processes

Abstract

Similar content being viewed by others

Risk-sensitive continuous-time Markov decision processes with unbounded rates and Borel spaces

Risk measurement and risk-averse control of partially observable discrete-time Markov systems

Finite horizon risk-sensitive continuous-time Markov decision processes with unbounded transition and cost rates

1 Introduction

2 Solution to the Bellman Equation

Remark 2.1

Remark 2.2

Remark 2.3

Proposition 2.1

Proof

Theorem 2.2

Proof

Remark 2.4

3 Risk-Sensitive Asymptotics of MDP

Theorem 3.1

Lemma 3.2

Proof

Proof Proof of Theorem 3.1

Proposition 3.3

Proof

Theorem 3.4

Lemma 3.5

Proof

Lemma 3.6

Proof

Theorem 3.7

Proof

Remark 3.1

Theorem 3.8

Remark 3.2

Theorem 3.9

Proof

Corollary 3.10

4 Risk-Sensitive Asymptotics of POMDP

Remark 4.1

Lemma 4.1

Proof

Lemma 4.2

Proof

Theorem 4.3

Proof

Theorem 4.4

Proof

Theorem 4.5

Lemma 4.6

Proof

Proof Proof of Theorem 4.5

Remark 4.2

Example 4.7

5 A Portfolio Optimization Example

Remark 5.1

Theorem 5.1

Proof

Remark 5.2

Theorem 5.2

Proof

Example 5.3

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Appendix: A nonlinear extension of the Kreĭn-Rutman theorem

Appendix: A nonlinear extension of the Kreĭn-Rutman theorem

Theorem A.1

Theorem A.2

Theorem A.3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification (2010)