Abstract
In this paper, we prove that the optimal risk-sensitive reward for Markov decision processes with compact state space and action space converges to the optimal average reward as the risk-sensitive factor tends to 0. In doing so, a variational formula for the optimal risk-sensitive reward is derived. An extension of the Kreĭn-Rutman Theorem to certain nonlinear operators is involved. Based on these, partially observable Markov decision processes are also investigated. A portfolio optimization problem is presented as an example of an application of the approach, in which a duality-relation between the maximization of risk-sensitive reward and the maximization of upside chance for out-performance over the optimal average reward is established.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
In this paper, we study a risk-sensitive control problem for Markov decision processes (MDPs). The risk-sensitive control of MDPs has been widely investigated (see [10, 12, 16, 21, 22] and references cited therein). The basic target is to find the optimal solution to the following control problem
where Xn is the state of the system at time n, x is the initial state, An is the decision made by the controller at time n, and π is the strategy for decision-making. The risk-sensitive factor γ represents the controller’s risk preference. Regarding r as a reward, we concern the following maximization with γ > 0:
It is well known that γ = 0 corresponds to the risk-neutral case in which the performance is evaluated according to the following typical long-run average reward:
Notice that for any γ > 0 if E(eγX) and E(X) both exist, then
It is natural to consider the problem of whether the optimal risk-sensitive control converges to the optimal long-run average control as the risk-sensitive factor gets vanishing. It is the main purpose of this paper to prove that
provided that both sides are well-defined (see the next section for an explicit description of this problem). This problem has been studied for minimizing risk-sensitive costs for MDPs; see the references cited above. A similar problem for optimal risk-sensitive portfolios has also been studied. Notice that in the framework of portfolio or other asset processes, maximizing rewards is a natural problem for consideration. This maximization problem is essentially different from the minimization problem, but both are of fundamental importance for applications. It is interesting that maximizing the risk-sensitive reward is dual to maximizing the upside chance or minimizing the downside risk under some conditions (see [18, 24] and [26]). These motivate us to study the asymptotics of optimal risk-sensitive rewards for MDPs. We shall show in this paper that for MDPs with compact state spaces and action spaces, under certain assumptions, the maximal risk-sensitive reward will converge to the maximal long-run average reward as the risk-sensitive factor gets down to 0.
We note that in the approach for deriving the asymptotics of minimal risk-sensitive cost, besides a few necessary continuity assumptions, some conditions on contraction and strong ergodicity for the transition probabilities were imposed, based on which span contraction of some properly defined operator can be verified which guarantee a solution to the corresponding Bellman equation. The strong ergodicity condition also makes it possible to apply some large deviation techniques (see (2.7) and (2.8) in Section 2 for the explicit conditions with more explanations given in Remark 2.1). In this paper, we shall use a quite different approach: inspired by Anantharam and Borkar [2], we use a nonlinear extension of the Kreĭn-Rutman Theorem (see [23]) to find the eigenvalues of some properly defined operators on certain function spaces and characterize the optimal growth rate of the multiplicative reward with this eigenvalue. Using this characterization and a perturbation technique, we derive a variational formula for the optimal growth rate (Theorem 3.7) of the MDP without the ergodicity of transition probability. This variational formula is similar to the Donsker-Varadhan formula (see [14]), which is of independent significance. The vanishing risk-sensitivity limit of the maximal reward of MDPs follows as an application of this formula (Theorems 3.1 and 3.8), and its proof implies that for a risk-sensitive control problem, the optimal policy can be taken to be a stationary one, even when the MDP is not communicating.
We also apply the approach to study the same problem for partially observable Markov decision processes (POMDPs) with compact state and action spaces. For POMDPs, see [4, 6, 8, 11, 17], and [19] and the references cited therein. A widely used approach for studying a POMDP is to transfer it into a completely observable MDP. However, the structure of the transferred MDP is usually much more complicated. Among the current results, such as those in the references mentioned above, few are on the risk-sensitivity vanishing limit. In [1], the limit of the minimal risk-sensitive cost as the risk-sensitive factor tends to 0 is derived for a class of POMDPs with a particular structure. This particular structure makes it possible to apply a large deviation approach. In [11], Di Masi and Stettner established the existence of the solution to the associated Bellman equation for cost-minimizing problems. However, they remarked that the limit as the risk-sensitive factor tends to 0 for general POMDPs had not been proven. Based on our investigation for MDPs, we prove that, as long as the solution to the associated Bellman equation exists, the maximal risk-sensitive reward converges to the maximal long-run average reward as the risk-sensitive factor tends to 0.
Finally, as an application of our approach, for portfolio optimization, we establish a duality-relation between maximizing the risk-sensitive reward and maximizing the chance for outperforming certain amounts of reward, with the range of the amounts being characterized by the optimal average reward (Theorem 5.2).
The paper is organized as follows. At the end of this section, we introduce some notations that will be frequently used in this paper. In Section 2, we define the decision model and derive some properties of the operator corresponding to the Bellman equation. The risk-neutral limit for MDPs is given in Section 3, in which the variational formula mentioned above is established. Section 4 is devoted to POMDPs. The portfolio optimization problem is investigated in Section 5.
Here are some notations and preliminaries. Given a separable and complete metric space (also called a Polish space) \((\mathcal {X},\rho )\), let \({\mathscr{M}}(\mathcal {X})\) and \({\mathscr{M}}^{+}(\mathcal {X})\) denote the set of finite signed measures on \(\mathcal {X}\) and the set of finite measures on \(\mathcal {X}\), respectively. \(\mathcal {P}(\mathcal {X})\) is the space of probability measures on \(\mathcal {X}\), endowed with the weak topology. For \(p,q\in \mathcal {P}(\mathcal {X})\), we use p << q to denote that p is absolutely continuous with respect to q. As usual, δx(⋅) denotes the Dirac measure on point \(x\in \mathcal {X}\). When \(\mathcal {X}\) is compact, \(C(\mathcal {X})\), the real-valued continuous functions on \(\mathcal {X}\), equipped with the supremum norm ∥⋅∥, is a Banach space. Let \(C^{+}(\mathcal {X})\) denote the set of non-negative functions in \(C(\mathcal {X})\). \(C^{+}(\mathcal {X})\) is a cone, which means that for any \(f,g\in C^{+}(\mathcal {X})\) and any c > 0, both f + g and cf are in \(C^{+}(\mathcal {X})\). \(C^{+}(\mathcal {X})\) is convex, closed and satisfies that \(C^{+}(\mathcal {X})\cup (-C^{+}(\mathcal {X}))=\{0\}\) and that \(interior(C^{+}(\mathcal {X}))\neq \emptyset \). We write f ≥ g,f > g,f >> g if \(f-g\in C^{+}(\mathcal {X}),f-g\in C^{+}(\mathcal {X})\backslash \{0\},f-g\in interior(C^{+}(\mathcal {X}))\), respectively. These facts form the basis for applying a nonlinear extension of the Kreĭn-Rutman theorem (see Appendix) to the operator corresponding to the Bellman equation.
For two probability measures \(p,q\in \mathcal {P}(\mathcal {X})\), the relative entropy of p with respect to q is defined by
which plays an important role in the variational formula for the optimal reward.
Let \(\text {Lip}(\mathcal {X})\) denote the space of real-valued, bounded, and Lipschitz continuous functions on \(\mathcal {X}\). Given \(f\in \text {Lip}(\mathcal {X})\), define its norm by
Then \((\text {Lip}(\mathcal {X}),\ \|\cdot \|_{L})\) is a Banach space when \(\mathcal {X}\) is compact. Given \(\mu \in {\mathscr{M}}(\mathcal {X})\), define the following Kantorovich-Rubinstein norm:
Then the weak topology on \(\mathcal {P}(\mathcal {X})\) is generated by the Kantorovich-Rubinstein metric d0(μ,ν) := ∥μ − ν∥0 (Theorem 8.3.2, pp. 193–194, in [7]). \(\mathcal {P}(\mathcal {X})\) endowed with the weak topology is a Polish space since \(\mathcal {X}\) is Polish. The space of Lipschitz functions and the Kantorovich-Rubinstein metric will be used in Section 4 to discuss the POMDPs.
Finally, as usual, \(\mathbb {N}\) and \(\mathbb {R}\) denote the sets of non-negative integers and real numbers, respectively.
2 Solution to the Bellman Equation
A discrete-time MDP can be represented as a four-tuple M = 〈S,A,p(⋅|⋅,⋅),r(⋅,⋅)〉. S is the state space, A is the action space, and both are assumed to be compact metric spaces in the present paper. We assume for convenience that any action in A is admissible in any state. The transition kernel, which depends on actions, is denoted by \(p(E|x,a), E\subseteq S, x\in S, a\in A\). The last element in the tuple is the one-step reward function \(r: S\times A\to \mathbb {R}\). To define a probability space and a stochastic process with the desired mechanism, let \({\Omega }=(S\times A)^{\infty }\) and \({\mathscr{B}}({\Omega })\) be the product Borel σ-field. Given a sample path ω = (x1,a1,x2,a2,...) ∈Ω, define \(X_{t}:= x_{t},A_{t}:= a_{t},t\in \mathbb {N}\). At each time \(t\in \mathbb {N}\), the system M occupies a state Xt, based on which the controller chooses an action At, and then the system moves to the next state according to the law p(⋅|Xt,At). A Markov decision rule at time t is a stochastic kernel \(d_{t}\in \mathcal {P}(A|S)\), where dt(B|x) denotes the probability for taking action in \(B\subseteq A\) when observing the current state Xt = x. A Markov policy π is a sequence of Markov decision rules. Let DM denote the set of all the Markov decision rules. \({\Pi }_{M}=(D_{M})^{\infty }\) is the set of all the Markov policies of M. Given an initial state x ∈ S and a policy π = (d1,d2,...) ∈πM, we can define a unique probability measure \(\text {P}_{x}^{\pi }\) on \({\mathscr{B}}({\Omega })\) by the Ionescu-Tulcea theorem, such that for each t ≥ 0,
The corresponding expectation operator is denoted by \( E_{x}^{\pi }\). Since we view r as a reward, a typical criterion for evaluating the optimal policy is to maximize the average reward, i.e., we are interested in the following function
which is risk-neutral. Informally, we notice that by the Taylor expansion, we see that for a small factor γ
Hence, we can use γ≠ 0 to evaluate the controller’s risk preference. This leads to the following risk-sensitive criterion for maximization of reward (see [10]):
where γ≠ 0 is a constant evaluating the controller’s risk preference. With “\(\sup \)” replaced by “\(\inf \)” in (2.3), we have the risk-sensitive criterion for minimization of cost. An interesting problem is the asymptotics of λ(x,γ) as γ → 0. Observe that if we define for N ≥ 1
Then (2.2) implies that
which motivates us to derive
This is the main concern of this paper.
Remark 2.1
Replacing the \(\sup \) in (2.3) with an \(\inf \) to define \(\tilde \lambda (x,\gamma )\), the γ → 0 limit has already been established in [10]. They proved the existence of a solution to the Bellman equation with, in addition to some necessary continuity assumptions, the following two requirements:
and there exist an \(\eta \in \mathcal {P}(S)\) and a continuous density q(x,a,y) such that \(p(E|x,a)={\int \limits }_{E}q(x,a,y)\eta (dy)\) for \(E\in {\mathscr{B}}(S)\) and
These conditions guarantee that the operator defining the corresponding Bellman equation is span contractive, and hence a solution exists. (2.8) also implies strong ergodicity for the family of transition probabilities defining the MDP. A consequence is the applicability of large deviation techniques for ergodic Markov processes. Instead of such conditions, we will use the following assumption (B1) on communication among the family of transition probabilities to guarantee that the eigenvector of the Bellman equation is strictly positive, based on which the variational formula holds. Then we will remove (B1) by a perturbation technique, which means that the limit can hold for completely observable MDPs without any communication requirements on the transition probability.
-
(B1)
For any x1,x ∈ S and any open neighborhood U containing x, there exist an N > 0 and a1,..,aN ∈ A such that
$$ \begin{array}{@{}rcl@{}} \int \textbf{1}_{U}(x_{N+1})p(dx_{N+1}|x_{N},a_{N})...p(dx_{2}|x_{1},a_{1})>0, \end{array} $$
Remark 2.2
When the model is finite (i.e., both S and A are finite), (B1) is the classical communicating condition.
Now let
and
for γ > 0. The main objective of this paper is to show that
In order to do this, we make the following assumptions.
-
(A1)
r(x,a) is continuous in (x,a).
-
(A2)
\((x,a)\mapsto {\int \limits }_{S}p(dy|x,a)f(y)\) is continuous in (x,a) when f ∈ C(S).
-
(A3)
The family of functions
$$ \begin{array}{@{}rcl@{}} \left\{x\mapsto{\int}_{S}f(y)p(dy|x,a), f\in C(S),\left\|f\right\|\leq 1,a\in A\right\} \end{array} $$is equicontinuous.
Moreover, if (B1) holds, we will see that, independent of the choice of the initial state x ∈ S, the value of λ(x,γ) depends only on γ and
Remark 2.3
A concrete case in which (A3) is satisfied is that
with \({\Lambda }\in \mathcal {P}(S)\) and {q(y|⋅,a),y ∈ S,a ∈ A} equicontinuous. The equicontinuity assumption (A3) is only used to prove the compactness of the operator related to the Bellman equation. In particular, for every finite MDP, (A1), (A2), and (A3) automatically hold. Combined with (B1), this compactness yields the existence of a positive eigenvalue and the associated positive eigenvector. The continuity assumptions of the result regarding the Bellman equation in [10] are the same as (A1) and (A2). But the γ → 0 limit established in [10] requires that there exist an \(\eta \in \mathcal {P}(S)\) and a density q(x,a,y) > 0 such that \(p(E|x,a)={\int \limits }_{E}q(x,a,y)\eta (dy)\) for \(E\in {\mathscr{B}}(S)\) and (x,a,y) → q(x,a,y) is continuous, which is more strict than (A3) when state space S and action space A are compact.
The Bellman equation mentioned above is
and the corresponding operator L(γ) on C(S) is defined by
Since
and
we see that
Assumptions (A1), (A2) and the compactness of S × A imply that when f ∈ C(S), L(γ)f also belongs to C(S). Combining these with (A3), we can prove that L(γ) is a compact operator, which is crucial to the existence of a positive eigenvalue, as we claimed before.
Proposition 2.1
Assume (A1), (A2), and (A3). Then L(γ) is a compact operator mapping C(S) into itself.
Proof
Notice that r(⋅,⋅) is bounded under assumption (A1) and the compactness of S × A. For convenience, we let rM and rm be the supremum and infimum of r, respectively. For any function f with ∥f∥≤ K, we have \(\sup _{x\in S}\left |L^{(\gamma )}f(x)\right |\leq Ke^{\gamma r_{M}}\). Thus, to apply the Arzelà-Ascoli theorem, we need to verify that the family {L(γ)f,f ∈ C(S),∥f∥≤ K} is equicontinuous.
To this end, let ρ denote the metric on S. According to (A3), for any ε > 0, there exists δ1 > 0 such that
for any x1,x2 with ρ(x1,x2) ≤ δ1. By the uniform continuity of eγr(⋅,⋅), there exists δ2 > 0 such that
whenever ρ(x1,x2) ≤ δ2. Consequently when ∥f∥≤ K, for x1,x2 with \(\rho (x_{1},x_{2})\leq \min \limits \{\delta _{1},\delta _{2}\}\), we have
□
L(γ) has the following properties, which will be used to apply the non-linear Kreĭn-Rutman Theorem to prove the existence of the solution to (2.13).
-
(P1)
Assume (A1). Then
$$(L^{(\gamma)})^{N}f(x)=\sup\limits_{\pi\in{\Pi}_{M}} E_{x}^{\pi}\left[\exp\left( {\sum}_{t=1}^{N}\gamma r(X_{t},A_{t})\right)\cdot f(X_{N+1})\right],N\geq 1.$$This property can be proven by induction using the Markov feature of π ∈πM and the fact that \(e^{\gamma r(X_{i},A_{i})}\leq e^{\gamma r_{M}}\) (see Lemma 2.1 and its proof in [2]).
-
(P2)
(Positive 1-homogeneity) c(L(γ)f) = L(γ)(cf) for c ≥ 0 and f ∈ C(S).
-
(P3)
(Order-preserving) If f ≥ g, then L(γ)f ≥ L(γ)g.
The following theorem shows that the spectral radius of L(γ) is an eigenvalue. For an operator T : C(S) → C(S), define
It is not hard to check that
which implies that the limit
exists.
Theorem 2.2
Assume (A1), (A2), and (A3). Then ρ(L(γ)) > 0 and there exists an fγ ∈ C+(S) depending on γ with fγ≠ 0 such that
If in addition (B1) is satisfied, then fγ >> 0 and \(\gamma \lambda (x,\gamma )=\log \rho (L^{(\gamma )})\) is independent of x ∈ S.
Proof
From (P1), we see that \(\left \|(L^{(\gamma )})^{n}\textbf {1}\right \|\geq e^{n\gamma r_{m}}\), which implies that \(\rho (L^{(\gamma )})\geq e^{\gamma r_{m}}>0\). Since L(γ) is compact, positive 1-homogeneous and order-preserving, by Theorem A.1 in the Appendix, there exists an fγ ∈ C+(S) satisfying (2.14). Moreover, from (A1) and (A2), we know that \({\int \limits }_{S}e^{\gamma r(x,a)}f(y)p(dy|x,a)\) is continuous in a, which means that the supremum in (2.13) can be achieved. Hence, there exists a Markov decision rule d∗ such that
Let \(\pi ^{*}=(d^{*})^{\infty }\), then we have for \(N\in \mathbb {N}\)
Similarly, we have for any Markov policy π
Now, assume (B1). Since fγ ∈ C+(S) and fγ≠ 0, there exist x0 ∈ S,c0 > 0 and an open neighborhood U0 containing x0 such that \(f_{\gamma } |_{U_{0}}>c_{0}>0\). It follows from (B1) that for any x1 ∈ S, there exists a1,..,aM such that
Thus, fγ >> 0. Since S is compact, there are constants \(0<k_{\gamma }<K_{\gamma }<\infty \) such that kγ ≤ fγ ≤ Kγ. From (2.15) and (2.16), we see that for any x ∈ S
and for any Markov policy π
Taking the logarithm and letting \(N\to \infty \), we see that the limit
exists and
for any x ∈ S. □
Remark 2.4
Assume (A1), (A2), (A3), and (B1). From the proof of Theorem 2.2 we can see that ρ(L(γ)) is the unique positive eigenvalue of L(γ) restricted to interior(C+(S)).
3 Risk-Sensitive Asymptotics of MDP
In this section, we shall apply the following variational formula for λ(γ) to prove (2.9).
where \(\mathcal {I}\) is defined by
and for \(\beta \in \mathcal {P}(S\times A\times S),\) the notations β0,β1,β2, and \(\beta ^{\prime }\) are defined by
Obviously, \(\mathcal {I}\) is nonempty and closed in \(\mathcal {P}(S\times A\times S)\). Notice that \(\mathcal {P}(S\times A\times S)\) is compact since S × A × S is compact. Hence, \(\mathcal {I}\) is compact, too. For \(\beta \in \mathcal {P}(S\times A\times S)\), β0 is the first 1-dimensional marginal of β, \(\beta ^{\prime }\) is the first 2-dimensional marginal of β, and β1 and β2 are the two successive conditional distributions of β. With these notations, \(\mathcal {I}\) is seen to be the set of probability measures β on S × A × S satisfying that β0 is invariant under \({\int \limits }_{A}\beta _{1}(da|x)\beta _{2}(dy|x,a)\). The validity of (3.1) will be verified in Theorems 3.4 and 3.7. At present, we will apply (3.1) to get the limit of λ(γ) as γ → 0.
Theorem 3.1
Assume (A1) and (A2). If (3.1) holds, then
To prove the theorem, we need the following
Lemma 3.2
If there exists a \(\beta \in \mathcal {I}\) satisfying that β2 = p, then
where \(\mathcal {I}\) is defined in (3.2).
Proof
Since β(S,A,dx) = β(dx,A,S), taking β0 as the initial distribution and using the policy \(\pi _{\beta }=(\beta _{1}(da|x))^{\infty }\), we see that
The third equality is due to the coincidence of the first and the third marginal of β ensured by (3.2). By induction, we have
Thus,
□
Now we are ready to prove Theorem 3.1.
Proof Proof of Theorem 3.1
By Hölder’s inequality, for \(\gamma \geq \gamma ^{\prime }>0\), we have
and
Therefore, λ(γ) is non-decreasing in γ and \(\lim \limits _{\gamma \to 0}\lambda (\gamma )\geq v\). To prove (3.4), it suffices to verify that \(\lim \limits _{\gamma \to 0}\lambda (\gamma )\leq v\). To this end, we notice that it follows from (3.1) that for any ε > 0 and γ > 0, there exists \(\beta _{\gamma }^{\varepsilon }\in \mathcal {I}\) such that
Recall that \(\mathcal {I}\subseteq \mathcal {P}(S\times A\times S)\) is compact, we can find a sequence \(\{\gamma _{n}\}_{n\in \mathbb {N}}\) decreasing to 0 and a \(\beta ^{\varepsilon }\in \mathcal {I}\) such that
Therefore, from (A1), we know that
which is finite. Now we claim that \(\left (\beta ^{\varepsilon }\right )_{2}=p\). Indeed, we have
It follows from the (joint) lower semicontinuity of D(⋅∥⋅) and (A2) that
Thus if \(\left (\beta ^{\varepsilon }\right )_{2}\neq p\), then
Combining this with (3.5) and the fact that λ(γ) ≥ rm, we would have
It is impossible. Thus, \(\beta ^{\varepsilon }\in \mathcal {I}\) and \(\left (\beta ^{\varepsilon }\right )_{2}=p\). Recalling (3.6) and Lemma 3.2, we obtain that
(3.4) follows by letting ε → 0. □
The remainder of this section is devoted to verifying (3.1) under certain conditions. This is carried out at first under assumptions including (B1), then with (B1) removed. Our assumption (A2) is slightly weaker than those in [2]. In [2], it is required that the family of functions
is equicontinuous, while we assume that
is equicontinuous (Theorems 3.4 and 3.8). Moreover, it is worth mentioning that equicontinuity only plays a role in the existence of the positive eigenvalue and the eigenvector. Once L(γ) has a positive eigenvalue and a strictly positive eigenvector, only (A1) and (A2) are needed.
Proposition 3.3
Assume (A1) and (A2). If there exist ργ > 0 and fγ ∈ C(S) such that fγ >> 0 and L(γ)fγ = ργfγ, then (3.1) holds.
Proof
From the proof of Theorem 2.2, we know that \(\log \rho _{\gamma }=\gamma \lambda (x,\gamma )\) for any x ∈ S. Thus, for any \(\mu \in {\mathscr{M}}^{+}(S)\), we have
Therefore,
For any f >> 0, we also have
which means that
Since under (A1) and (A2), properties (P2) and (P3) hold for L(γ), we can apply Theorem A.2 and A.3 in the Appendix to deduce that
Thus,
Using the Gibbs variational formula (Proposition 1.4.2(a), pp. 33–34 in [15]), we see that
Since D(μ∥ν) is jointly convex and lower semicontinuous in (μ,ν) (Lemma 1.4.3, pp. 36–38 in [15]) and \(\mathcal {P}(S\times A), \mathcal {P}(S\times A\times S)\) are both compact, the minimax theorem (Theorem 4.2 in [25]) can be applied to get
Furthermore, by the chain rule for relative entropy (Theorem D.13, pp. 357–359 in [9]), we have that
Since D(μ∥ν) ≥ 0 and D(μ∥ν) = 0 iff μ = ν (Lemma 1.4.1 , pp. 33, in [15]), the supremum over \(\eta \in \mathcal {P}(S\times A)\) is attained at \(\eta =\beta ^{\prime }\). Moreover, notice that when \(\beta \in \mathcal {I}\), for any g ∈ C(S),
and for \(\beta \notin \mathcal {I}\),
we obtain that
□
Combining Theorem 2.2 and Proposition 3.3, we obtain the following theorem immediately.
Theorem 3.4
Assume (A1), (A2), (A3), and (B1). Then (3.1) holds.
To remove (B1), we use a perturbation argument. For each 𝜖 > 0, define a new MDP M𝜖 with the transition law and one-step reward given by
respectively, where \({\Gamma }\in \mathcal {P}(S)\) with full support. It is not hard to check that M𝜖 satisfies (A1), (A2), (A3), and (B1). Using \( E_{\epsilon ,\gamma ,x}^{\pi }\) to denote the corresponding expectation operator with initial state x and policy π, we define
and
By Theorem 2.2, λ𝜖(x,γ) depends only on γ, and the limit inferior is actually a limit. Hence, we write it as λ𝜖(γ). Without (B1), we will prove the variational formula by exploring properties of λ𝜖(γ) and then letting ε → 0.
Lemma 3.5
Assume (A1) and (A2). Then λ𝜖(γ) is non-decreasing in 𝜖 and \(\lim \limits _{\epsilon \to 0}\lambda _{\epsilon }(\gamma )\geq \lambda (\gamma )\).
Proof
From property (P1), we have
for any x ∈ S. Thus, for any 𝜖1 > 𝜖2 > 0, by (3.10), we obtain that
□
In order to write the variational formula in a form that is more convenient for using in the following arguments, we define for given 𝜖 > 0,γ > 0 and \(\beta \in \mathcal {I}\)
and
To prove \(\lambda (\gamma )\geq \lim \limits _{\epsilon \to 0}\sup \limits _{\beta \in \mathcal {I}}\phi (\beta ,\gamma ,\epsilon )\), we will show that
where λSM(x,γ) is defined by
with \(d^{\infty }\) denoting the stationary Markov policy whose decision rules at each time are the same d ∈ DM.
Lemma 3.6
Assume (A1) and (A2). Then \(\sup \limits _{x\in S}\lambda _{SM}(x,\gamma )\geq \sup \limits _{\beta \in \mathcal {I}}\phi (\beta ,\gamma ,0)\).
Proof
We need to prove that
for each \(\beta \in \mathcal {I}\). If \(\phi (\beta ,\gamma ,0)=-\infty \), the inequality holds trivially. Otherwise, \(\beta \in \mathcal {I}\) with \(\phi (\beta ,\gamma ,0)>-\infty \) implies that β2(⋅|x,a) << p(⋅|x,a) \(\beta ^{\prime }-\)a.s.. Choosing the stationary Markov policy \(\pi _{\beta }=(\beta _{1}(da|\theta ))^{\infty }\) and the initial distribution \(\beta _{0}\in \mathcal {P}(S)\), we see that
Define \(^{\beta }{E}_{\beta _{0}}^{\pi _{\beta }}\) as the expectation operator with respect to the probability measure determined by the initial distribution β0, the transition law β2(dy|x,a) for \(\{X_{t}\}_{t\in \mathbb {N}}\), and the policy πβ. Using the change of measure technique and Jensen’s inequality, we obtain that
Since \(\beta \in \mathcal {I}\), the same argument as in proving Lemma 3.2 shows that
Consequently,
□
Combining Theorem 3.4 and Lemmas 3.5 and 3.6, we obtain the following
Theorem 3.7
Assume (A1), (A2), and (A3). Then (3.1) holds.
Proof
From Lemmas 3.5 and 3.6, we see that
Hence, (3.1) will follow once we prove that \(\sup _{\beta \in \mathcal {I}}\phi (\beta ,\gamma ,0)\geq \lim \limits _{\epsilon \to 0}\lambda _{\epsilon }(\gamma )\). Since M𝜖 satisfies (A1), (A2), and (B1), by Theorems 2.2 and 3.4, we have
Therefore, given ξ > 0, for every 𝜖 > 0, there exists \(\beta _{\epsilon }^{\xi }\in \mathcal {I}\) such that
Since \(\mathcal {I}\) is compact, there exists a sequence \(\{\epsilon _{n}\}_{n\in \mathbb {N}}\) decreasing to 0 such that the weak limit \(\lim \limits _{n\to \infty }\beta _{\epsilon _{n}}^{\xi }=:\beta ^{\xi }\) exists and \(\in \mathcal {I}\). By (A1) and Dini’s Theorem \(r^{(\gamma )}_{\epsilon }(\cdot ,\cdot )\) converges to γr(⋅,⋅) uniformly. Thus, we obtain that
Recalling the definition of β2 for \(\beta \in \mathcal {I}\), by the lower semicontinuity of D(⋅∥⋅), we see that
It then follows that
Thus,
Letting ξ → 0, we have \(\sup _{\beta \in \mathcal {I}}\phi (\beta ,\gamma ,0)\geq \lim _{\epsilon \to 0}\lambda _{\epsilon }(\gamma )\). Now (3.1) follows. □
Remark 3.1
The proof shows that the inequalities in (3.13) are actually equalities, which indicates that the supremum over Markov policies in risk-sensitive MDP is tantamount to the supremum over stationary Markov policies, meaning that one should search for the optimal policy within the stationary policies even without the ergodicity of transition probability.
Combining Theorems 3.1, 3.4, and 3.7, we obtain the main result immediately.
Theorem 3.8
Assume (A1), (A2), and (A3). Then
In addition, if (B1) holds, then
for any x ∈ S.
Remark 3.2
Recalling the proof of Theorem 3.1, we see that under (A1), (A2), and (A3), there exists \(\mu (dx,da)\in \mathcal {P}(S\times A)\) such that
.
A sufficient condition for the risk-neutral average optimal reward v(x) to be independent of the initial state x is the uniform ergodicity (2.7) (see Section 5.5 in [20]). We rewrite it as
-
(B2)
There exists δ < 1 such that
$$ \sup_{U\in \mathcal{B}(S)}\sup_{x,x^{\prime}\in S}\sup_{a,a^{\prime}\in A}[P(U|x,a)-P(U|x^{\prime},a^{\prime})]\leq\delta. $$(3.15)
We provide brief proof for this.
Theorem 3.9
Assume (A1), (A2), and (B2), then v(x) is independent of the initial state x.
Proof
Define an operator T on C(S) by
It is not hard to check that under (A1) and (A2), T maps C(S) into itself. Let
be the span norm on C(S). For f1,f2 ∈ C(S),x1,x2 ∈ S, and ε > 0, there exist a1,a2 ∈ A such that
Therefore, we obtain that
where E in the second to the last inequality comes from the Hahn-Jordan decomposition of p(⋅|x1,a1) − p(⋅|x2,a2). Letting ε → 0, we see that T is a contraction mapping on (C(S),∥⋅∥sp). Thus, by the Banach Fixed-Point Theorem, there exist a unique (up to an additive constant) f0 ∈ C(S) such that ∥Tf0 − f0∥sp = 0, which means that Tf0(x) − f0(x) is a constant v0. It follows that for any x ∈ S,
Since \(r(x,a)+{\int \limits } p(dy|x,a)f_{0}(y)\) is in C(A) due to (A2), for each x ∈ S, there exists an d0(x) ∈ A such that
Letting \(\pi _{0}=(d_{0})^{\infty }\), we have
(3.16) and (3.17) imply that v0 = v(x) for any x ∈ S. □
Combining the above theorem with our main result (Theorem 3.8), we obtain the following
Corollary 3.10
Assume (A1), (A2), (A3), (B1), and (B2). Then
Furthermore, this limit is indeed independent of x ∈ S.
4 Risk-Sensitive Asymptotics of POMDP
This section applies the approach explored in the last section to the partially observable Markov decision process (POMDP). Francesca Albertini, Paolo Dai Pra, and Chiara Prior established such a limit in [1] for processes described by Xn+ 1 = f(Xn,An,Wn),Yn = h(Xn,Vn), where Xn, An, and Yn denote the state, control, and observation, respectively, and W,V are i.i.d random variables. As for general POMDPs, Di Masi and Stettner proved the existence of the solution to the associated Bellman equation for cost-minimizing problems and stated that the limit as γ → 0 had not been proven (see Remark 2 in [11]). However, the method in [11] can not be applied to reward-maximizing problems since it requires the operator induced from the Bellman equation to preserve the concavity. However, in this case, the operator is convexity-preserving. Nevertheless, we can prove that given the existence of a solution to the Bellman equation, (3.4) holds for the maximal reward of POMDPs. A POMDP can be represented as a six-tuple MP = 〈S,A,O,p(⋅|⋅,⋅),q(⋅|⋅),r(⋅,⋅)〉. S is the space of real but unobserved states, A is the action space, and both are assumed to be compact metric spaces. The observation space O is a Polish space. Like those in MDPs, p is the transition kernel depending on actions, and \(r: S\times A\to \mathbb {R}\) is the reward function. q(⋅|x) denotes the observation probability when the system is in state x ∈ S. As mentioned in the introduction, a widely used technique for analyzing a POMDP is to transfer it into a completely observable MDP. We will also adopt this technique which allows us to employ the analysis for MDPs in the last section to establish
where
and
The exact definitions of the set π of policies and the expectation operator \( E_{\theta }^{\pi }\) will be given after introducing the assumptions needed in this section.
In order to use the measure transformation technique, we first assume that
-
(C0)
There exists a \({\Lambda }\in \mathcal {P}(O)\) with full support such that for every x ∈ S, q(⋅|x) << Λ.
The corresponding density function is also denoted by q(⋅|⋅), i.e.,
To apply Theorem 3.1 and Proposition 3.3, we make the following assumptions to guarantee that the reward and the transition probability of the transferred MDP satisfy (A1) and (A2).
-
(C1)
r(⋅,⋅) ∈ C(S × A).
-
(C2)
q(y|⋅) ∈Lip(S). There exist qm > 0 and qM > 0 such that q(y|⋅) ≥ qm and \(\left \|q(y|\cdot )\right \|_{L}\leq q_{M}\) for every y ∈ O.
-
(C3)
There exists Kp > 0 such that
$$\|p(\cdot|x,a)-p(\cdot|x^{\prime},a^{\prime})\|_{KR}\leq K_{p}\left[\rho_{S}(x,x^{\prime})+\rho_{A}(a,a^{\prime})\right]$$for any \(x,x^{\prime }\in A\) and \(a,a^{\prime }\in S\), where ∥⋅∥KR denotes the Kantorovich-Rubinstein norm defined by (1.5) on \(\mathcal {P}(S)\), ρS denotes the metric on S, and ρA denotes the metric on A.
To define a probability space and a stochastic process with the desired mechanism, let \({\Omega }_{p}=S\times (A\times S\times O)^{\infty }\) and \({\mathscr{B}}({\Omega }_{p})\) be the product Borel σ-field. Given a sample path ω = (x1,a1,x2,y2,a3,...) ∈ΩP, define Xt := xt,At := at,t ≥ 1, and Yt := yt,t ≥ 2. At each time \(t\in \mathbb {N}\), the system MP occupies a state Xt, which is unobservable. When t = 1, we know the distribution of X1 and then choose an action A1. When t ≥ 2, we can observe a signal Yt generated by Xt and then choose an action At. The optimal policy in POMDP is usually not a Markovian one due to the unavailability of real states when making decisions. Hence, we introduce the observed-history-dependent policy. Let \(\mathbb {H}_{t}\) denote the set of observed histories up to time \(t\in \mathbb {N}\). Then, \(\mathbb {H}_{1}=\mathcal {P}(S)\) (the set of all the initial state distributions) and \(\mathbb {H}_{t+1}=\mathbb {H}_{t}\times A\times O\). An observed-history-dependent decision rule at time t is a stochastic kernel \(d_{t}\in \mathcal {P}(A|\mathbb {H}_{t})\), where dt(B|ht) denote the probability for taking action in \(B\subseteq A\) when observing \(h_{t}=(\theta _{1},a_{1},y_{2},a_{2},y_{3},...,a_{t-1},y_{t})\in \mathbb {H}_{t}\). An observed-history-dependent policy π is a sequence of such decision rules at different times. Let Dt denote all the observed-history-dependent decision rules at time t, and \({\Pi }=\prod\limits_{t=1}^{\infty }D_{t}\) denote all the observable-history-dependent policies. Given an initial distribution of states \(\theta _{1}\in \mathcal {P}(S)\) and a policy π = (d1,d2,...) ∈π, a unique probability measure \(\text {P}_{\theta _{0}}^{\pi }\) and the corresponding expectation operator on \({\mathscr{B}}({\Omega }_{p})\) is defined by the Ionescu-Tulcea theorem, such that for each t ≥ 1,
The risk-sensitive criterion introduced in Section 2 is to optimize
while the typical optimal average reward is
Let \(\lambda _{P}(\gamma ):=\sup _{\theta \in \mathcal {P}(S)}\lambda _{P}(\theta , \gamma )\) and \(v_{P}:=\sup _{\theta \in \mathcal {P}(S)}v_{P}(\theta )\). We intend to apply Theorem 3.1 and Proposition 3.3 to prove the risk-sensitive asymptotics
It has already been shown that optimal control of a POMDP MP under the average reward criterion can be converted to the optimal control of a properly transferred MDP M0 (see, e.g., Section 7.2.1 in [3], and Section 5.3, pp. 157–159 in [5]), where the new states are the conditional distributions of real states given the observed history. The transition law p0 and one-step reward r0 of M0 are
where \({\Delta }^{(0)}_{a,y,\theta }\) is a measure on \(\mathcal {P}(S)\) defined by
and \(T^{*}_{a,y,0}\) is an operator on \({\mathscr{M}}^{+}(S)\) given by
In the case of risk-sensitive control, the transformed MDP is slightly different from the typical form of average reward control (see [4]). We present the transformation procedure with our notations. Assuming (C0), (C1), (C2), and (C3), we derive the new state and the corresponding transition mechanism of the transferred MDP first. For t ≥ 1, define two filters \(\mathcal {F}_{t}\) and \(\mathcal {G}_{t}\) by
respectively. Let Ht = (𝜃1,A1,Y2,...,At− 1,Yt) denote the observed history up to time t. Since \(\theta _{1}\in \mathcal {P}(S)\) is fixed, \(\mathcal {F}_{t}=\sigma (H_{t})\). Define another probability measure \(\widetilde {P}_{\theta _{1}}^{\pi }\) on \({\mathscr{B}}({\Omega }_{P})\) by
or equivalently,
Since S,A,O are all Polish spaces, ΩP is also Polish. Thus, the following regular conditional expectations on \(({\Omega }_{P},{\mathscr{B}}({\Omega }_{P}),\widetilde {P}_{\theta _{1}}^{\pi })\) for bounded Borel functions f on S
have regular versions. Therefore, we can define an \({\mathscr{M}}^{+}(S)\)-valued process \(\{\psi ^{(\gamma )}_{t}\}\) by
where f is bounded and measurable on S. For a ∈ A,y ∈ O, define Ta,y,γ as an operator on the space of bounded Borel functions on S by
Notice that under \(\widetilde {P}_{\theta _{0}}^{\pi }\), Yt+ 1 is independent of Xs+ 1,As+ 1 and Ys with s ≤ t, and Xt+ 1 depends on σ(Gt ∪ σ(At,Yt+ 1)) only through Xt and At, we have
Therefore, we have
where \(T^{*}_{a,y,\gamma }\) is the adjoint operator of Ta,y,γ defined on \({\mathscr{M}}^{+}(S)\) by
From (C1), we know that \(\psi ^{(\gamma )}_{t}(S)\) is finite and strictly positive. Hence, we can define a new state process \(\{\theta ^{(\gamma )}_{t}\}\) taking values in \(\mathcal {P}(S)\) by
We call \(\{\theta ^{(\gamma )}_{t}\}\) the information state process since it represents the cumulative-reward-weighted conditional distribution of the real state given the observed history. The information state \(\theta ^{(\gamma )}_{1}\) at time t = 1 is still 𝜃1. Notice that the operator \(T^{*}_{a,y,\gamma }\) is positively 1-homogeneous. We have
which implies the transition mechanism of \(\theta ^{(\gamma )}_{t}\). As for the new reward function, we define
Then we have
It then follows that
Hence, we can consider Gγ as the new reward. Now, we can transfer MP to the following completely observable model \(M^{\prime }_{\gamma }\) with state space \(\mathcal {P}(S)\) and action space A:
-
1.
The initial information state is 𝜃1.
-
2.
At time t, given the current information state \(\theta ^{(\gamma )}_{t}\), we take action At according to a pre-specified policy. Then, the system generates Yt+ 1, which is independent of \(\theta ^{(\gamma )}_{s}, A_{s}\), and Ys with s ≤ t, and distributed according to the law Λ. The next information state \(\theta ^{(\gamma )}_{t+1}\) is determined by
$$ \theta^{(\gamma)}_{t+1}=\frac{T^{*}_{A_{t},Y_{t+1},\gamma}(\theta^{(\gamma)}_{t})}{T^{*}_{A_{t},Y_{t+1},\gamma}(\theta^{(\gamma)}_{t})(S)}. $$ -
3.
Once Yt+ 1 is generated, the next state \(\theta ^{(\gamma )}_{t+1}\) is then obtained according to (4.9), and simultaneously the system generates one-step reward \(G_{\gamma }(A_{t},Y_{t+1},\theta ^{(\gamma )}_{t})\).
Remark 4.1
The one-step reward Gγ in (4.10) depends not only on the state 𝜃 and the action A but also on an independent signal Y under \(\widetilde {\text {P}}_{\theta _{1}}^{\pi }\), which is slightly different from the typical form. We make the following changes to have a reward and a transition probability in a standard form in which assumptions (A1) and (A2) can be verified.
Define the completely observable Markov decision model Mγ with transition law pγ and one-step reward rγ by
where \({\Delta }^{(\gamma )}_{a,y,\theta }\) is a measure on \(\mathcal {P}(S)\) defined by
Use \( E_{\gamma ,\theta }^{\pi }\) to denote the expectation operator with respect to the transition probability pγ with initial state 𝜃 and policy π. Since Mγ is an MDP, we consider the Markov policies of Mγ, which consists of decision rules by choosing actions only through the current information state 𝜃. Such policies are also called separated policies of the original model MP (see, e.g., [19]). We let DS denote the set of all the Markov decision rules of Mγ and \({\Pi }_{S}=(D_{S})^{\infty }\). Then for πS ∈πS, by direct calculation, we have
for any bounded Borel function f on \(\mathcal {P}(S)\). πS is a subset of π since 𝜃t is \(\mathcal {F}_{t}\)-adaptive. Hence, from (4.12) and (4.15), we have for πS ∈πS,
We define λS as the optimal value of separated policies, which is
We will show that under (C0), (C1), (C2), and (C3), λS = λP and (4.3) hold if there exists K > 0 such that for every γ ∈ (0,K), the Bellman equation
has a solution ργ > 0 and \(f_{\gamma }\in C(\mathcal {P}(S))\) with fγ >> 0. We first verify that rγ satisfies (A1) and pγ satisfies (A2), which implies that the corresponding operator \(L^{(\gamma )}_{P}\) on \(C(\mathcal {P}(S))\), defined by
maps \(C(\mathcal {P}(S))\) into itself.
Lemma 4.1
Assume (C1). Then rγ(𝜃,a) is continuous in (𝜃,a).
Proof
For 𝜃n → 𝜃 weakly and an → a, we have
The second term tends to 0 due to the weak convergence of 𝜃n while the first term tends to 0 because of the uniform continuity of r(⋅,⋅) on the compact set S × A. Hence, \(e^{r_{\gamma }(\theta ,a)}\) is continuous in (𝜃,a). Notice that \(e^{r_{\gamma }}\geq e^{r_{m}}>0\), we see that rγ(𝜃,a) is continuous in (𝜃,a). □
Lemma 4.2
Assume (C0), (C1), (C2), and (C3). Then \((\theta ,a)\mapsto {\int \limits }_{\mathcal {P}(S)}p(d\theta ^{\prime }|\theta ,a)f(\theta )\) is continuous in (𝜃,a) for \(f\in C(\mathcal {P}(S))\).
Proof
Recall that rM and rm are the supremum and infimum of r, respectively. Fix \(f\in C(\mathcal {P}(S))\). From (4.13) and direct calculation, we see that
where \(e^{-r_{\gamma }(\theta ,a)}\) is continuous in (𝜃,a) by Lemma 4.1. It suffices to show that
is continuous in (𝜃,a). Once \(\left \{(\theta ,a)\mapsto T^{*}_{a,y,\gamma }(\theta ),y\in O\right \}\) is equicontinuous, then from the uniform continuity of \(f\in C(\mathcal {P}(S))\) (\(\mathcal {P}(S)\) is compact since S is compact) and the fact that
uniformly, we can see that
is equicontinuous, which proves this lemma. Now we prove that \(\left \{(\theta ,a)\mapsto T^{*}_{a,y,\gamma }(\theta ),y\in O\right \}\) is equicontinuous. Fix \(\theta \in \mathcal {P}(S)\) and a ∈ A. For 𝜃n → 𝜃 weakly and an → a, we have
The first term \(\left \|T^{*}_{a_{n},y,\gamma }(\theta _{n})-T^{*}_{a,y,\gamma }(\theta _{n})\right \|_{KR}\) is
Notice that ∥g(⋅)∥L ≤ 1 and ∥q(y|⋅)∥L ≤ qM imply ∥g(⋅)q(y|⋅)∥L ≤ 2qM. Thus, by (C1), (C2), and (C3), we have that
and
holds for every y ∈ O. Hence, from the uniform continuity of r(⋅,⋅), we know that
The second term \(\left \|T^{*}_{a,y,\gamma }(\theta _{n})-T^{*}_{a,y,\gamma }(\theta )\right \|_{KR}\) is
Define two finite measures \({\mu ^{a}_{n}}(dx):=\theta _{n}(dx)e^{\gamma r(x,a)}\) and μa(dx) := 𝜃(dx)eγr(x,a). Then from the continuity of r(⋅,a), we know that \({\mu ^{a}_{n}}\to \mu ^{a}\) weakly, which means that \(\lim \limits _{n\to \infty }\|{\mu ^{a}_{n}}-\mu ^{a}\|_{KR}=0\). Thus, we only need to verify that as a function of x, if ∥g∥L ≤ 1, then the Lipschitz constant \(\left \|{\int \limits }_{S}q(y|x^{\prime })p(dx^{\prime }|x, a)g(x^{\prime })\right \|_{L}\) is bounded by a constant which is independent of y. First, it is obvious that \(\left |{\int \limits }_{S}q(y|x^{\prime })p(dx^{\prime }|x,a)g(x^{\prime })\right |\leq q_{M}\). As for the Lipschitz constant, we have for x1,x2 ∈ S,
Consequently,
(4.20) and (4.21) show that \(\left \{(\theta ,a)\mapsto T^{*}_{a,y,\gamma }(\theta ),y\in O\right \}\) is equicontinuous and the lemma is proved. □
Since rγ satisfies (A1) and pγ satisfies (A2), we can apply Proposition 3.3 to obtain the variational formula for λS.
Theorem 4.3
Assume (C0), (C1), (C2), and (C3). If there exist ργ > 0 and \(f_{\gamma }\in C(\mathcal {P}(S))\) with fγ >> 0 satisfying \(\rho _{\gamma } f_{\gamma }=L^{(\gamma )}_{P}f_{\gamma }\), then
holds, where
and the notations β2 and \(\beta ^{\prime }\) are defined by (3.3).
Proof
Lemmas 4.1 and 4.2 imply that Mγ satisfies (A1) and (A2). Notice that \(\mathcal {P}(S)\) is compact, by Proposition 3.3, we have
Hence, (4.22) holds. □
By H\(\ddot {o}\)lder’s inequality, for \(\gamma \geq \gamma ^{\prime }>0\), we have \(\lambda _{P}(\gamma )\geq \lambda _{P}(\gamma ^{\prime })\geq v_{P}\). Therefore, λP(γ) is non-decreasing in γ and \(\lim \limits _{\gamma \to 0}\lambda _{P}(\gamma )\geq v_{P}\). To get the desired assertion, we need that λS = λP.
Theorem 4.4
Assume (C0), (C1), (C2), and (C3). If there exist ργ > 0 and \(f_{\gamma }\in C(\mathcal {P}(S))\) with fγ >> 0 satisfying that \(\rho _{\gamma } f_{\gamma }=L^{(\gamma )}_{P}f_{\gamma }\), then λS(γ) = λP(γ).
Proof
We first show that it is true in the finite-horizon case. Fix N > 0 and an observed-history-dependent policy π = (d1,...,dN,...). By (4.11), we have that
Hence
For every \(\psi \in {\mathscr{M}}^{+}(S)\) and ε > 0, there exists \(d_{N}^{*}(\psi )\in A\) such that
\(d_{N}^{*}\) actually depends only on \(\theta =\frac {\psi }{\psi (S)}\). Given \(\psi ^{\prime }\neq \psi \) with \(\frac {\psi ^{\prime }}{\psi ^{\prime }(S)}=\theta =\frac {\psi }{\psi (S)}\), we can assume that
since \(e^{(N-1)\gamma r_{m}}\leq \psi ^{(\gamma )}_{N}(S)\leq e^{(N-1)\gamma r_{M}}\). Thus, for any a ∈ A, we have
Hence,
Thus, we see that for every \(\psi \in {\mathscr{M}}^{+}(S)\), there exists \(d_{N}^{*}(\theta )\in A\) depending only on \(\theta =\frac {\psi }{\psi (S)}\) such that
Now modify the policy π by simply replacing dN with \(d^{*}_{N}\) to get a new policy \(\pi ^{*}_{N}=(d_{1},...,d_{N-1},d^{*}_{N},...)\), we obtain that
Continue this procedure by successively replacing dj with \(d^{*}_{j}\) for j = N − 1,⋯ ,1, with each \(d^{*}_{j}\) depending only on the information state and satisfying that
where \(\pi ^{*}_{n}=(d_{1},...,d_{n-1},d^{*}_{n},...,d^{*}_{N},...)\). In this way, with \(\pi _{1}^{*}=(d^{*}_{1},d^{*}_{2},...d^{*}_{N},...)\), we obtain that
Noticing that the decision rules after time N is irrelevant and recalling (4.12), we have proved that
for any ε > 0. Obviously,
Consequently,
Letting \(N\to \infty \), we obtain that
Thus, it suffices to verify that
Recall that by assumption, we have \(\rho _{\gamma } f_{\gamma }=L^{(\gamma )}_{P}f_{\gamma }\) with ργ > 0 and fγ >> 0. From Lemmas 4.1 and 4.2, we know that
is continuous in a. Due to the compactness of A, for every \(\theta \in \mathcal {P}(S)\), there exists d∗(𝜃) ∈ A such that
Let \(\pi ^{*}=(d^{*})^{\infty }\). Similarly to the argument used to derive (2.17) in the proof of Theorem 2.2, we see that
Hence, the inequalities in (4.24) are equalities, which gives that λS(γ) = λP(γ). □
With Theorems 4.3 and 4.4, using a similar argument as in the proof of Theorem 3.1, we can now extend the risk-sensitive asymptotics to POMDPs, which is the main result of this section.
Theorem 4.5
Assume (C0), (C1), (C2), and (C3). If there exists K > 0 such that for every γ ∈ (0,K), there exist ργ > 0 and \(f_{\gamma }\in C(\mathcal {P}(S))\) with fγ >> 0 satisfying \(\rho _{\gamma } f_{\gamma }=L^{(\gamma )}_{P}f_{\gamma }\), then
Before proving Theorem 4.5, we present a lemma to show that pγ converges to p0 weakly and uniformly, where p0 is defined in (4.4).
Lemma 4.6
Assume (C0), (C1), (C2), and (C3). Then pγ(⋅|𝜃,a) weakly converges to p0(⋅|𝜃,a), uniformly in \(\theta \in \mathcal {P}(S)\) and a ∈ A, i.e., if \(f\in C(\mathcal {P}(S)\times A\times \mathcal {P}(S))\), then
Proof
Fix an \(f\in C(\mathcal {P}(S)\times A\times \mathcal {P}(S))\). Then
Since
It follows that \(e^{r_{\gamma }(\theta ,a)}\) converges uniformly to 1 as γ → 1. Then similarly as in the proof of Lemma 4.2, it suffices to verify that \(T^{*}_{a,y,\gamma }(\theta )\) converges weakly to \(T^{*}_{a,y,0}(\theta )\), uniformly in a ∈ A,y ∈ O, and \(\theta \in \mathcal {P}(S)\). In fact, recalling the definition of the Kantorovich-Rubinstein norm, we see that
Consequently,
and thus (4.28) follows. □
Proof Proof of Theorem 4.5
We already knew that \(\lim \limits _{\gamma \to 0}\lambda _{P}(\gamma )\geq v_{P}\). Hence, by Theorem 4.4, it suffices to verify that
From Theorem 4.3, we know that for any ε > 0 and γ > 0, there exists \(\beta _{\gamma }^{\varepsilon }\in \mathcal {I}\) such that
Recall that \(\mathcal {I}\subseteq \mathcal {P}(\mathcal {P}(S)\times A\times \mathcal {P}(S))\) is compact. We can find a sequence \(\{\gamma _{n}\}_{n\in \mathbb {N}}\) monotonically tending to 0 and a \(\beta ^{\varepsilon }\in \mathcal {I}_{P}\) such that
weakly. Since the relative entropy is non-negative, we obtain that
Monotonicity follows from H\(\ddot {o}\)lder’s inequality. Thus, we have as γn → 0 that
where r0 is defined in (4.4). Hence, by Dini’s theorem, \(\gamma _{n}^{-1}r_{\gamma _{n}}\) converge to r0 uniformly. From the weak convergence of \(\beta _{\gamma _{n}}^{\varepsilon }\), we then obtain that
Now, we claim that \(\left (\beta ^{\varepsilon }\right )_{2}=p_{0}\), where p0 is defined in (4.4). Notice that
From the lower semicontinuity of D(⋅∥⋅) and Lemma 4.6, we deduce that
Thus, if \(\left (\beta ^{\varepsilon }\right )_{2}\neq p_{0}\), then
and we would have
It is impossible, so \(\beta ^{\varepsilon }\in \mathcal {I}\) and \(\left (\beta ^{\varepsilon }\right )_{2}=p_{0}\). Now we can employ the same argument as used in the proof of Lemma 3.2 to derive that
Then (4.27) follows by letting ε → 0. □
Remark 4.2
From the proof of Theorem 4.5, we can see that the existence of a solution to the risk-sensitive Bellman equation guarantees the existence of the invariant probability measure for p0.
We end this section with a simple example.
Example 4.7
Consider a finite POMDP with S = {x1,x2},A = {a1,a2},O = {y1,y2}. The transition probability p, observation probability q, and reward r are described by
The state space of the transferred MDP is \(\mathcal {P}(S)\), which is isomorphic to [0,1]. We use (1 − t,t) to denote the probability distribution in \(\mathcal {P}(S)\), where 0 ≤ t ≤ 1 represents the probability assigned on x2. On the one hand, to apply Theorem 4.5, by straightforward calculations, we have for f ∈ C([0,1]),
Let
We can verify that \(L^{(\gamma )}_{P}f_{\gamma }=\rho _{\gamma }f_{\gamma }\). Then all the assumptions in Theorem 4.5 are fulfilled. Thus,
On the other hand, given an initial distribution t ∈ [0,1], we can see that the optimal average reward is vP(t) = t. Hence, \(v_{P}=\sup _{t\in [0,1]}v_{P}(t)=1\), which coincides with Theorem 4.5. Furthermore, this example illustrates that there are circumstances in which the optimal risk-sensitive reward is independent of the initial distribution while the optimal average reward is not.
5 A Portfolio Optimization Example
In this section, as an example of applications of the approach developed in the previous sections, we consider a problem for portfolio optimization. Given a market with m securities and k price affecting factors. Let V (n) denote the portfolio’s value at time n. We assume that the portfolio dynamics are determined by
where X(n) = (X1(n),...,Xk(n)) denotes the factor process, which is a Markov chain with transition kernel \(P(dx^{\prime }|x)\), H(n) = (H1(n),...,Hm(n)) represents the portfolio strategy, i.e., the proportions of capital invested in the m securities at time n, {W(n),n ≥ 1} is the i.i.d. random noise which is independent of the factor process and has a common law η. F is a Borel measurable function. X(n) and H(n) take values in some compact subsets \(S\subset \mathbb {R}^{k}\) and \(A\subset \mathbb {R}^{m}\) respectively, while the noise W(n) takes values in a Polish space Z. This model was extensively studied in [26] for the dual relationship between maximizing the probability of outperforming over a given benchmark and optimizing the long-term risk-sensitive reward. In this section, we will demonstrate that our approach guarantees the convergence of the optimal risk-sensitive reward to the optimal risk-neutral reward as the risk-sensitive factor tends to 0. As a consequence, we show that the optimal risk-neutral reward can be taken as a benchmark appearing in the duality mentioned above, complementing the studies of [26]. Given an initial state X(1) = x, we use Px to denote the corresponding probability measure on \({\mathscr{B}}((S\times Z)^{\infty })\) and Ex the expectation under Px. Let \(\mathcal {A}\) denote the set of all Markov portfolio strategies. Given a risk factor γ > 0, the risk-sensitive optimal value is
In what follows, we state the two assumptions on F and P for fitting (A1), (A2), and (A3).
-
(H1)
For each w ∈ Z, F(⋅,⋅,w) ∈ C(S × A), and there is an η −integrable random variable g(w), such that
$$ e^{|F(x,h,w)|}\leq g(w)\ \ \forall x\in S,\ h\in A \text{ and } w\in Z; $$(5.3) -
(H2)
The family of functions \(\left \{x\mapsto {\int \limits } f(x^{\prime })P(dx^{\prime }|x), f\in C(S),\left \|f\right \|\leq 1\right \}\) is equicontinuous.
Remark 5.1
-
(1)
If (H1) holds, then |F(x,h,w)|≤ g(w) and hence \(\hat F(x,h):= {\int \limits } F(x,h,\cdot )d\eta \) is bounded continuous in (x,h). Let Fm and FM be the infimum and supremum of \(\hat F\). Then from Jensen’s inequality, it follows that for γ > 0
$$ \log \int e^{\gamma F(x,h,\cdot)}d\eta\geq \gamma F_{m}. $$(5.4) -
(2)
A particular case in which (H2) is true is that \(P(dx^{\prime }|x)=Q(x^{\prime }|x){\Lambda }(dx^{\prime })\) with \(\{Q(x^{\prime }|\cdot ),x^{\prime }\in S\}\) equicontinuous and \({\Lambda }\in \mathcal {P}(S)\).
The one-step reward F in (5.1) depends not only on the state x and the action h but also on W, which is slightly different from the typical form. We make the following changes to get a reward in such a standard form. Define a new Markov decision model with the transition law p(γ) and the one-step reward r(γ) defined respectively by
By a direct calculation, we see that \(p^{(\gamma )}(dx^{\prime }|x,h)\) is actually \(P(dx^{\prime }|x)\), and for any N ≥ 1
Notice that the transition kernel of this MDP is still P, but the reward is r(γ) instead of γ ⋅ r. Assumption (H1) implies that r(γ)(x,h) is continuous in (x,h). Thus, with an extra discussion about the convergence of \(\frac {1}{\gamma }r^{(\gamma )}\) as γ → 0, we can obtain the limit with the same argument as the one in the proof of Theorem 3.1. In particular, it is not hard to check that (H1) and (H2) imply that r(γ) and P satisfy (A1), (A2), and (A3). Therefore, setting the risk-sensitive coefficient in Theorem 3.7 to be one and then dividing both sides by γ, we have the following variational formula for \(\lambda (\gamma )=\sup _{x\in S}\lambda (\gamma ,x)\).
where \(\mathcal {I}\) is defined in (3.2). Although Theorem 3.8 can not be directly applied due to the difference between (5.7) and (3.1), the risk-neutral limit \(\lim _{\gamma \to 0}\lambda (\gamma )\) can still be derived by an argument similar to the one used in proving Theorem 3.1. To see this, we still use v to denote the average optimal return, i.e.,
Theorem 5.1
Assume (H1) and (H2). Then
Proof
By Hölder’s inequality, we see that λ(γ) is nondecreasing in γ and \(\liminf \limits _{\gamma \to 0}\lambda (\gamma )\geq v\). We will apply (5.7) to prove that \(\limsup \limits _{\gamma \to 0}\lambda (\gamma )\leq v\). Similarly to the argument in the proof of Theorem 3.1, for any ε > 0, we can find a sequence \(\{\gamma _{n}\}_{n\in \mathbb {N}}\) decreasing to 0 with \(\lim \limits _{\gamma \to 0}\lambda (\gamma )=\lim \limits _{n\to \infty }\lambda (\gamma _{n})\) and \(\beta _{\gamma _{n}}^{\varepsilon },\beta ^{\varepsilon }\in \mathcal {I}\) with \(\lim \limits _{n\to \infty }\beta _{\gamma _{n}}^{\varepsilon }=\beta ^{\varepsilon }\) weakly such that
Therefore,
Monotonicity follows from Hölder’s inequality, and thus we have as γn → 0 that
Therefore, it follows from Dini’s theorem that \(\frac {1}{\gamma _{n}}r^{(\gamma _{n})}\) converge to r(0) uniformly. Combining this fact with the weak convergence of \(\beta _{\gamma _{n}}^{\varepsilon }\), we obtain that
Now we claim that \(\left (\beta ^{\varepsilon }\right )_{2}=P\). Indeed, from the joint semicontinuity of the relative entropy, we see that
Thus, if \(\left (\beta ^{\varepsilon }\right )_{2}\neq P\), then
From assumption (H1), (5.4), (5.5), and (5.6), together with (5.2) and (5.10), we would have
It is impossible, implying that it must be that \(\beta ^{\varepsilon }\in \mathcal {I}\) and \(\left (\beta ^{\varepsilon }\right )_{2}=P\). Then it is routine to follow the same argument as that of Theorem 3.2 to check that \({\int \limits }_{S\times A}r^{(0)}(x,h)(\beta ^{\epsilon })'(dx,dh)\leq v\). Consequently, (5.9) follows by letting ε → 0. □
As claimed in the introduction, it was shown in [18, 24], and [26] that the risk-sensitive portfolio optimization is a dual problem to the maximization of the outperformance probability (upside chance) when assuming differentiability for the optimal value. To describe this more precisely, for \(b\in \mathbb {R},\gamma >0,x\in S\), define Ix(b) by
Then by Chebyshev’s inequality,
Thus,
for a pre-specified K > 0. Let Λ(γ) := γλ(γ) for convenience. It has already been established in [26] that if Λ(γ) is differentiable on [0,K) and the limit
exists and does not depend on the initial state x, then the duality
holds whenever \(b\in \{{{\Lambda }^{\prime }}^{+}(\gamma ),\gamma \in [0,K)]\}\) or \(b\leq {{\Lambda }^{\prime }}^{+}(0)\), where \({{\Lambda }^{\prime }}^{+}(\gamma )\) denote the righthand derivative of Λ(γ) (see Theorem 2.7 in [26]). In the meantime, our result shows that under (H1) and (H2),
which reveals the connection between the outer-performance probability, the risk-neutral average return, and the risk-sensitive average growth rate. In order to guarantee the differentiability, we add the following assumptions for P and r(γ) in accordance with Theorem 3.1 in [26], which also implies that the transition law satisfies (B2).
-
(H3)
There exists δp < 1 such that
$$ \sup_{U\in \mathcal{B}(S)}\sup_{x,x^{\prime}\in S}[P(U|x)-P(U|x^{\prime})]\leq \delta_{p}. $$(5.15) -
(H4)
There exists a Kγ > 0 such that the mapping \(\gamma \to \sup \limits _{h\in A}r^{(\gamma )}(x,h)\) is differentiable on [0,Kγ) for any x ∈ S.
Remark 5.2
Let Fm and FM be defined in Remark 5.1(1). Then (H1), (H2), (H3), and the condition \(\gamma \leq -\frac {\log \delta _{p}}{F_{M}-F_{m}}\) guarantee that the limit inferior in (5.2) is actually a limit and λ(γ,x) does not depend on x (see Theorem 1 in [13]). So the constant K in (5.12) can be determined.
Theorem 5.2
Assume (H1), (H2), (H3), and (H4). Let \(K=\min \limits \{-\frac {\log \delta _{p}}{F_{M}-F_{m}}, K_{\gamma }\}\). Then v(x) = v is a constant, and the duality (5.14) holds for every b ≤ v.
Proof
Combining Theorem 3.1 in [26] and the above remark, we see that Λ(γ) is differentiable on [0,K). Thus, by Theorem 2.7 in [26],
holds for every \(b\leq {{\Lambda }^{\prime }}^{+}(0)\). Theorem 3.7 implies that
where v(x) is indeed a constant due to (H3) and Lemma 3.9. These complete the proof. □
We end this section with an illustrative example.
Example 5.3
Let S = {− 1,1}, A = {(h1,h2),hi ≥ 0, h1 + h2 = 1} and {Wn,n ≥ 1} i.i.d. with the standard Normal distribution N(0,1). F is given by
where α ∈ (0,1/2) is a constant. Then \(e^{|F(x,h,w)|}\leq g(w)=e^{\alpha w^{2}}\) with g being η −integrable. Thus (H1), (H2), and (H4) are fulfilled, and (5.9) holds. If the transition probabilities are chosen to satisfy that
then (H3) is also satisfied; therefore, the assertions of Theorem 5.2 hold true.
Data Availability
Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.
References
Albertini F, Dai Pra P, Prior C. Small parameter limit for ergodic, discrete-time, partially observed, risk-sensitive control problems. Math Control Signals Syst 2001;14(1):1–28.
Anantharam V, Borkar V S. A variational formula for risk-sensitive reward. SIAM J Control Optim 2017;55(2):961–988.
Arapostathis A, Borkar V S, Fernández-Gaucherand E, Ghosh M K, Marcus S I. Discrete-time controlled Markov processes with average cost criterion: a survey. SIAM J Control Optim 1993;31(2):282–344.
Baras J S, James M R. Robust and risk-sensitive output feedback control for finite state machines and hidden Markov models (summary). J Math Syst Estimation Control 1997;7:371–374.
Bäuerle N, Rieder U. Markov decision processes with applications to finance. Berlin: Springer; 2011.
Bäuerle N, Rieder U. Partially observable risk-sensitive Markov decision processes. Math Oper Res 2017;42(4):1180–1196.
Bogachev V I. Measure Theory, vol II. Berlin: Springer; 2007.
Cavazos-Cadena R, Hernández-Hernández D. Successive approximations in partially observable controlled Markov chains with risk-sensitive average criterion. Stochast: an Int J Probab Stochast Process 2005;77(6):537–568.
Dembo A, Zeitouni O. Large deviations techniques and applications. Stochastic modelling and applied probability, 38. Berlin: Springer; 2010.
Di Masi G B, Stettner L. Risk-sensitive control of discrete-time Markov processes with infinite horizon. SIAM J Control Optim 1999;38(1):61–78.
Di Masi G B, Stettner L. Risk sensitive control of discrete time partially observed Markov processes with infinite horizon. Stochast: an Int J Probab Stochast Process 1999;67(3-4):309–322.
Di Masi G B, Stettner L. Remarks on risk neutral and risk sensitive portfolio optimization. In From stochastic calculus to mathematical finance. pp 211-226. Berlin: Springer; 2006.
Di Masi G B. Infinite horizon risk sensitive control of discrete time Markov processes with small risk. Syst Control Lett 2000;40(1):15–20.
Donsker M D, Varadhan S S. On a variational formula for the principal eigenvalue for operators with maximum principle. Proc Natl Acad Sci 1975; 72(3):780–783.
Dupuis P, Ellis R S. A weak convergence approach to the theory of large deviations. New York: Wiley; 1997.
Fleming W H, Hernández-Hernández D. Risk-sensitive control of finite state machines on an infinite horizon I. SIAM J Control Optim 1997;35(5): 1790–1810.
Fleming W H, Hernández-Hernández D. Risk-sensitive control of finite state machines on an infinite horizon II. SIAM J Control Optim 1999;37(4): 1048–1069.
Hata H, Nagai H, Sheu S J. Asymptotics of the probability minimizing a “down-side” risk. Annals Appl Probab 2010;20(1):52–89.
Hernández-Hernández D. Partially observed control problems with multiplicative cost. Stochastic analysis, control, optimization and applications. pp 41-55. Birkhäuser, Boston; 1999.
Hernández-Lerma O, Lasserre J B. Discrete-time Markov control processes: basic optimality criteria. Vol 30. Berlin: Springer; 2012.
Howard R A, Matheson J E. Risk-sensitive Markov decision processes. Manag Sci 1972;18(7):356–369.
Jaśkiewicz A. Average optimality for risk-sensitive control with general state space. Ann Appl Probab 2007;17(2):654–675.
Ogiwara T. Nonlinear Perron-Frobenius problem on an ordered Banach space. Japan J Math 1995;21(1):43–103.
Pham H. A risk-sensitive control dual approach to a large deviations control problem. Syst Control Lett 2003;49(4):295–309.
Sion M. On general minimax theorems. Pac J Math 1958;8(1):171–176.
Stettner L. Duality and risk sensitive portfolio optimization. Contemp Math 2004;351:333–348.
Acknowledgements
This paper is part of the first author’s dissertation. The authors are grateful to the referee for the careful review of the manuscript and for the helpful comments and suggestions for improvement.
Funding
This work is supported by the NSFC 11671226.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: A nonlinear extension of the Kreĭn-Rutman theorem
Appendix: A nonlinear extension of the Kreĭn-Rutman theorem
The following theorem is an extension of the Kreĭn-Rutman theorem for nonlinear operators on Banach space (see Proposition 3.1.5, Lemma 3.1.3, and Lemma 3.1.7 in [23]). An ordered Banach space is a real Banach space (X,∥⋅∥) endowed with a ordered cone K, which means K is a closed convex cone with vertex at 0 such that K ∩ (−K) = {0}. Assume interior(K)≠∅ and \(\dim X\geq 2\). For x1,x2 ∈ X, we write x1 ≥ x2 if x1 − x2 ∈ K, x1 > x2 if x1 − x2 ∈ K∖{0} and x1 >> x2 if x1 − x2 ∈ interior(K). For an operator L : K → K, define
Since ∥Lm+n∥+ ≤∥Lm∥+∥Ln∥+,m,n ≥ 1, we can define
The following properties for operator L on K will be required:
-
1.
(Compactness) L is a compact operator, meaning that L maps any bounded subset into a relatively compact one.
-
2.
(Positively 1-homogeneity) c(Lx) = L(cx) for any x ∈ K,c > 0.
-
3.
(Order-preserving) Lx1 ≥ Lx2 for any x1 ≥ x2,x1,x2 ∈ K.
Theorem A.1
Let L : K → K be a compact, positively 1-homogeneous, and order-preserving operator. If ρ(L) > 0, then ρ(L) ∈ σ+(L) and \(\rho (L)=\max \limits \sigma ^{+}(L)\).
Theorem A.2
Let L : K → K be a positively 1-homogeneous and order-preserving operator. If there exits \(\rho ^{\prime }>0\) and f >> 0 such that \(Lf=\rho ^{\prime }f\), then \(\rho ^{\prime }=\rho (L)\).
Theorem A.3
Let L : K → K be a positively 1-homogeneous and order-preserving operator. If there exits \(\rho ^{\prime }\geq 0\) and f >> 0 such that \(Lf\leq \rho ^{\prime }f\), then \(\rho ^{\prime }\geq \rho (L)\).
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Dai, Y., Chen, J. Risk-Sensitivity Vanishing Limit for Controlled Markov Processes. J Dyn Control Syst 29, 1471–1508 (2023). https://doi.org/10.1007/s10883-023-09641-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10883-023-09641-5
Keywords
- Markov decision processes
- Partially observable Markov decision process
- Risk-sensitive reward
- Variational formula
- Kreĭn-Rutman Theorem