1 Introduction

This note concerns with Markov decision chains evolving on a denumerable state space. The one-step cost function is bounded and the performance of a control policy is measured by the average criterion associated with a risk-seeking decision maker. The structural conditions on the transition law ensure that the optimal average cost is constant, but do not guarantee that the optimality equation admits a solution. In this framework, the following problem is addressed:

  • To obtain convergent approximations to the optimal average cost, and to determine approximately optimal stationary policies using the fixed points of a family of contractive operators.

The main conclusions on this problem, which are stated in Theorem 3.1 of Sect. 3, represent an extension of the classical ‘discounted approach’ in the risk-neutral case (Hernández-Lerma 1989; Arapostathis et al. 1993), and extend to the present framework results established in Saucedo-Zul et al. (2020), where a risk-averse version of this problem was analyzed.

The study of Markov decision chains endowed with a risk-sensitive average criterion can be traced back, at least, to the seminal paper by Howard and Matheson (1972), where Markov decision chains with finite state space were analyzed, and the optimal average cost was characterized via an optimality equation. The interest on this topic has been motivated by applications, for instance, in finance (Bäuerle and Rieder 2011, 2014; Stettner 1999; Pitera and Stettner 2016), revenue management (Barz and Waldmann 2007), and the theory of large deviations (Borkar and Meyn 2002). Models with finite or denumerable state space are considered, for instance, in Sladký (2008, 2018), Cavazos-Cadena (2009, 2018) whereas Markov decision chains on a Borel states space are analyzed in Di Masi and Stettner (1999, 2000, 2007), Jaśkiewicz (2007), Jaśkiewicz and Nowak (2014) and Shen et al. (2013).

Stochastic games with risk-sensitive criteria are studied in Basu and Ghosh (2014).

The remainder of the paper is organized as follows. In Sect. 2 the decision model is formally described, the average criterion is defined, and the main structural assumptions on the model are stated. In Sect. 3 a family of contractive operators is introduced, and the main result of the paper is stated as Theorem 3.1. The technical instruments that will be used to establish that result are established in Sect. 4, and the proof of the main result is presented in Sect. 5 before the concluding remarks.

Notation

Throughout the remainder \(\mathbb {N}\) denotes the set of non-negative integers and, given a topological space S, the Banach space of all bounded functions \(H: S\rightarrow \mathbb {R}\) is denoted by \({\mathcal {B}}(S)\); the supremum norm of \(H\in {\mathcal {B}}(S) \) is denoted by \(\Vert H\Vert : = \sup _{x\in S} |H(x)|\). On the other hand, every (in)equality involving random variables holds almost surely with respect to the underlying probability measure.

2 Decision model

Let \({\mathcal {M}}:=(S, A, \{A(x)\}_{x\in S}, C, [p_{x, y}(a)])\) be a Markov decision chain, a model for a dynamical system whose components are as follows: The state space S is a denumerable set endowed with the discrete topology, the metric space A is the action set whereas, for each state \(x\in S\), \(A(x)\subset A\) is the class of admissible actions (controls) at state x. On the other hand \(C: \mathbb {K}\rightarrow \mathbb {R}\) is the cost function, where \(\mathbb {K}=\{(x,a)\,|\, x\in S, a\in A(x)\}\) is the family of admissible pairs and, finally, \([p_{x, y}(a)]_{x, y \in S, a\in A(x)}\) is the controlled transition law. The interpretation of \({\mathcal {M}}\) is is as follows: At each time \(t\in \mathbb {N}\) the decision maker observes the state of the system \(X_t=x\in S\), and then picks and applies an action \(A_t=a\in A(x)\). As a consequence of such an intervention, (i) a cost C(xa) is incurred, and (ii) the system moves to a new state \(X_{t+1}\in S\) where, regardless of the previous states and actions, the event \([X_{t+1} = y]\) is observed with probability \(p_{x, y}(a)\), where \(\sum _{y\in S} p_{x, y}(a)= 1\); this is the Markov property of the decision process.

Assumption 2.1

  1. (i)

    For every \(x\in S\), A(x) is a compact subset of A.

  2. (ii)

    For each \(x, y\in S\), the mappings \(a\mapsto p_{x, y}(a)\) and \(a\mapsto C(x,a)\) are continuous in \(a\in A(x)\).

  3. (iii)

    The cost function is bounded, i.e., \(C\in {\mathcal {B}}(\mathbb {K})\).

Policies

A control policy is a rule for choosing actions, which at each decision time \(n\in \mathbb {N}\) may depend on the current state as well as the previous states and actions. More formally, for each \(n\in \mathbb {N}\) define the space \(\mathbb {H}_n\) of possible histories up to time n by \(\mathbb {H}_0:=S\) and \(\mathbb {H}_n:=\mathbb {K}^n\times S\) for \(n=1,2,3,\ldots \); a generic elements of \(\mathbb {H}_n\) is denoted by \(h_n=(x_0,a_0,x_1, a_1,\ldots , x_{n-1}, a_{n-1}, x_n)\), where \((x_k,a_k)\in \mathbb {K}\) for \(k< n\) and \(x_n \in S\). With this notation, a control policy \(\pi = \{\pi _n\}\) is a sequence of stochastic kernels \(\pi _n\) on A given \(\mathbb {H}_n\), satisfying that \( \pi _n(A (x_n)|h_n) = 1\), for each \(h_n \in \mathbb {H}_n\) and \(n \in \mathbb {N}\). The family of all policies is denoted by \({\mathcal {P}}\). Next, set \(\mathbb {F}:=\prod _{x\in S}A(x)\), which is compact metric space, by Assumption 2.1, and consists of all functions \(f: S\rightarrow A\) satisfying \(f(x)\in A(x)\) for every \(x\in S\). A policy \(\pi \in {\mathcal {P}}\) is stationary if there exists \(f\in \mathbb {F}\) such that the equality \(\pi _n(\{f(x_n)\}|h_n)=1\) always holds: the class of stationary policies is naturally identified with \(\mathbb {F}\), a convention allowing to write \(\mathbb {F}\subset {\mathcal {P}}\). Given the initial state \(X_0= x\) and the policy \(\pi \in {\mathcal {P}}\) used to drive the system, the distribution of the state-action process \(\{(X_t, A_t)\}_{t\in \mathbb {N}}\) is uniquely determined and is denoted by \(P_x^\pi \) (Hernández-Lerma 1989; Arapostathis et al. 1993; Puterman 1994), whereas \(E_x^\pi \) stands for the corresponding expectation operator.

Throughout the sequel, the following notation will be used: For each \(n\in \mathbb {N}\) set

$$\begin{aligned} H_n: = (X_0, A_0,\ldots , X_{n-1}, A_{n-1}, X_n)\quad \hbox {and}\quad {\mathcal {F}}_n:= \sigma (H_n), \end{aligned}$$
(2.1)

whereas for each \(F\subset S\) the first return time to set F is defined by

$$\begin{aligned} T_F:= \min \{n\ge 1\,|\, X_n\in F\}; \end{aligned}$$
(2.2)

when \(F= \{x\}\) is a singleton the simpler notation

$$\begin{aligned} T_x \equiv T_{\{x\}} \end{aligned}$$
(2.3)

is used. Notice that \(T_F\) is an stopping time respect to the filtration \(\{{\mathcal {F}}_n\}\), i.e., \([T_F = n]\in {\mathcal {F}}_n\) for every \(n\in \mathbb {N}.\)

Average criterionThroughout the remainder it is supposed that the decision maker has a constant risk-sensitive coefficient \(\lambda \) which satisfies

$$\begin{aligned} \lambda <0. \end{aligned}$$

This means that the controller assesses a random cost Y via the expectation of \(U_\lambda (Y)\), where the (dis-)utility function \(U_{\lambda }: \mathbb {R}\rightarrow (-\infty ,0)\) is defined as follows

$$\begin{aligned} U_\lambda (x) = -e^{\lambda x},\quad x\in \mathbb {R}; \end{aligned}$$
(2.4)

notice that \(U_\lambda (\cdot )\) is strictly increasing and satisfies the relation

$$\begin{aligned} U_\lambda (a+ b) =e^{\lambda a} U_\lambda (b),\quad a, b \in \mathbb {R}. \end{aligned}$$
(2.5)

When the decision maker chooses between two random costs \(C_0\) and \(C_1\), the controller prefers \(C_0\) if \(E[U_{\lambda }(C_0)]<E[U_{\lambda }(C_1)]\), and is indifferent between both costs if \(E[U_{\lambda }(C_0)]=E[U_{\lambda }(C_1)]\). The certainty equivalent of a cost Y is denoted by \({\mathcal {E}}_\lambda [Y]\) and is determined by the equality \(U_{\lambda }({\mathcal {E}}_\lambda [Y])=E[U_{\lambda }(Y)]\), so that the controller is indifferent between paying the fixed amount \({\mathcal {E}}_\lambda (Y)\) or facing the random cost Y. Notice that \(U_\lambda (\cdot )\) is a concave function, so that Jensen’s inequality yields that \({\mathcal {E}}_\lambda (Y)\le E[Y]\). Now, observe that

$$\begin{aligned} {\mathcal {E}}_\lambda [Y] = U_\lambda ^{-1}(E[U_\lambda (Y)]) = {1\over \lambda }\log \left( E\left[ e^{\lambda Y}\right] \right) , \end{aligned}$$
(2.6)

an expression that immediately yields that

$$\begin{aligned} P[|Y| \le b]= 1 \implies |{\mathcal {E}}_\lambda (Y)|\le b. \end{aligned}$$
(2.7)

Next, assume that the controller chooses actions using policy \(\pi \in {\mathcal {P}}\) starting at \(x\in S\). The application of the first n actions \(A_0, A_1,\ldots A_{n-1}\) generates the cost \(\sum _{k=0}^{n-1} C(X_k, A_k)\) and, by (2.6), the associated certainty equivalent is given by

$$\begin{aligned} J_n(\pi , x):= {1\over \lambda } \log \left( E_x^\pi \left[ e^{\lambda \sum _{t=0}^{n-1} C(X_t, A_t)}\right] \right) , \quad n=1,2,3,\ldots , \end{aligned}$$
(2.8)

which represents an average of \(J_n(\pi , x)/n\) per step. The (inferior limit \(\lambda \)-sensitive) average performance index of policy \(\pi \in {\mathcal {P}}\) at state \(x\in S\) is given by

$$\begin{aligned} J(\pi ,x):= \liminf _{n\rightarrow \infty } {1\over n } J_n(\pi , x), \end{aligned}$$
(2.9)

and

$$\begin{aligned} J_*(x):= \inf _{\pi \in {\mathcal {P}}} J(\pi , x),\quad x\in S. \end{aligned}$$
(2.10)

is the corresponding optimal value function. A policy \(\pi _*\in {\mathcal {P}}\) is (\(\lambda \)-)average optimal if \(J(\pi ,x)=J(\pi _*,x)\) for every \(x\in S\).

Recurrence-communication conditions In the risk-neutral case, it is known that the simultaneous Doeblin condition, which is stated in Assumption 2.2(i) below, is sufficient to ensure that the optimal average cost is constant and is characterized via an optimality equation (Hernández-Lerma 1989; Arapostathis et al. 1993; Puterman 1994). In the present risk-sensitive context, the \(\lambda \)-sensitive average optimality equation is given by

$$\begin{aligned} U_\lambda (g +h(x)) =\inf _{a\in A(x)} \left[ \sum _{y\in S} p_{x, y}(a) U_\lambda (C(x,a)+ h(y))\right] ,\quad x\in S, \end{aligned}$$
(2.11)

where g is a real number and \(h: S\rightarrow \mathbb {R}\) is a function. When this equation admits a solution \((g, h(\cdot ))\) and \(h(\cdot )\) is a bounded mapping, it is known that the optimal \(\lambda \)-average cost function \(J_*(\cdot ) \) is constant and equal to g, and if \(f\in \mathbb {F}\) is such that for each state x action f(x) minimizes the term within brackets in (2.11), then f is \(\lambda \)-average optimal; see, for instance, Howard and Matheson (1972), Hernández-Hernández and Marcus (1996), or Cavazos-Cadena (2009). Notice that via (2.4) the above optimality equation can be equivalently written as

$$\begin{aligned} e^{\lambda g + \lambda h(x)} =\sup _{a\in A(x)} \left[ e^{\lambda C(x,a)} \sum _{y\in S} p_{x, y}(a) e^{\lambda h(y)}\right] ,\quad x\in S. \end{aligned}$$
(2.12)

In contrast with the risk-neutral context, in the present framework where the controller is risk-seeking, the simultaneous Doeblin conditions is not sufficient to ensure even that the optimal average cost function is constant (Cavazos-Cadena and Fernández-Gaucherand 1999; Cavazos-Cadena 2009). For this reason, in this work the simultaneous Doeblin condition will be complemented with a communication requirement.

Assumption 2.2

There exists \(z\in S\) such that properties (i) and (ii) below hold:

  1. (i)

    [Simultaneous Doeblin Condition.] The first return time \(T_{z}\) satisfies

    $$\begin{aligned} \sup _{x\in S, f\in \mathbb {F}} E_x^f[T_{z}] <\infty . \end{aligned}$$
    (2.13)
  2. (ii)

    [Accessibility from z.] Under the action of any stationary policy, every state \(y\in S\) is accessible from z, that is,

    $$\begin{aligned} P_{z}^f[T_y <\infty ] > 0,\quad y\in S,\quad f\in \mathbb {F}. \end{aligned}$$
    (2.14)

Remark 2.1

Assumptions 2.1 and 2.2 imply the following properties (i) and (ii) below; for a proof see Theorem 4.1 in Cavazos-Cadena (2018).

  1. (i)

    For each \( y\in S\), there exists a finite constant \(M_{ y}\) such that

    $$\begin{aligned} E_x^\pi [T_y ]\le M_{ y},\quad x\in S,\quad \pi \in {\mathcal {P}}. \end{aligned}$$
    (2.15)
  2. (ii)

    If \(x, y\in S\) with \(x\ne y\), then \(P_x^\pi [T_y < T_x ] > 0\) for every \(\pi \in {\mathcal {P}}\).

Remark 2.2

Assumption 2.2 is, admittedly, very strong. However, in the denumerable case such a condition is presently the most general one under which a characterization of the optimal risk-sensitive average cost is available. The result in this direction can be seen in Cavazos-Cadena (2018) and involves an extension of the Collatz-Wielandt relations in the theory of positive matrices.

TheProblem

Under Assumptions 2.1 and 2.2 the optimal average cost function \(J_*(\cdot )\) is constant but the optimality equation (2.11) does not necessarily admits a solution; an (uncontrolled) example illustrating this phenomenon was presented in Section 9 of Cavazos-Cadena (2018). This fact provides that motivation to analyze the following problem:

  • To obtain convergent approximations to the optimal average cost as well as ‘nearly optimal’ stationary policies via the fixed points of contractive operators.

An answer to this problem allows to determine approximations to the optimal average cost as well as a stationary policy whose average cost is ‘close’ to the optimal one by solving the single equation characterizing the fixed point of a contractive operator. The main result on the above problem is stated in the following section, and represents an extension of the classical ‘discounted approach’ in the risk neutral case (Hernández-Lerma 1989; Puterman 1994) to the present risk-seeking framework.

Throughout the remainder, even without explicit reference, Assumptions 2.1 and 2.2 are enforced.

3 Contractive approximations

In this section the main result of the paper will be stated in Theorem 3.1 below. To begin with, for each \(\alpha \in (0,1)\) define \(T_{\alpha }:{\mathcal {B}}(S) \rightarrow {\mathcal {B}}(S)\) as follows: For each \(W\in {\mathcal {B}}(S)\), \(T_\alpha [W]\) is implicitly determined by

$$\begin{aligned} U_\lambda (T_\alpha [W](x) ) =\inf _{a\in A(x)} \left[ \sum _{y\in S} p_{x, y}(a) U_\lambda (C(x,a)+ \alpha W(y))\right] ,\quad x\in S, \end{aligned}$$
(3.1)

an expression that via (2.4) leads to

$$\begin{aligned} T_\alpha [W](x):={1\over \lambda } \log \left( \sup _{a\in A(x)} \left[ e^{\lambda C(x, a)}\sum _{y\in S} p_{x, y}(a) e^{\lambda \alpha W(y)}\right] \right) , \quad x\in S. \end{aligned}$$
(3.2)

Using (2.7) it follows that \(\Vert T_\alpha [W]\Vert \le \Vert C\Vert + \alpha \Vert W\Vert \), so that \(T_\alpha \) maps \({\mathcal {B}}(S)\) into itself. Also, it is not difficult to verify that \(T_\alpha \) is a monotone and \(\alpha \)-homogeneous operator, i.e., for each \(W, V\in {\mathcal {B}}(S)\)

$$\begin{aligned} W \ge V\implies T_\alpha [W]\ge T_\alpha [V] \hbox { and } T_\alpha [V+ c ] = T_\alpha [V] + \alpha c,\quad c\in \mathbb {R}. \end{aligned}$$
(3.3)

Observing that \(V \le W + \Vert V-W\Vert \), these properties lead to \(T_\alpha [V]\le T_\alpha [W + \Vert V-W\Vert ] = T_\alpha [W] + \alpha \Vert V- W\Vert \), and interchanging the roles of V and W it follows that

$$\begin{aligned} \Vert T_\alpha [W]- T_\alpha [V]\Vert \le \alpha \Vert W- V\Vert ,\quad W, V\in {\mathcal {B}}(S), \end{aligned}$$
(3.4)

so that \(T_\alpha \) is a contractive operator on \({\mathcal {B}}(S)\). Since \({\mathcal {B}}(S)\) endowed with the supremum norm is a Banach space, there exists a unique \(V_\alpha \in {\mathcal {B}}(S)\) satisfying

$$\begin{aligned} V_\alpha = T_\alpha [V_\alpha ], \end{aligned}$$
(3.5)

an equation that, via (3.2), is equivalent to

$$\begin{aligned} e^{\lambda V_\alpha (x)} = \sup _{a\in A(x)} \left[ e^{\lambda C(x, a)}\sum _{y\in S} p_{x, y}(a) e^{\lambda \alpha V_\alpha (y)}\right] ,\quad x\in S. \end{aligned}$$
(3.6)

Additionally, from Assumption 2.1 it is not difficult to see that there exists \(f_\alpha \in \mathbb {F}\) such that, for every \(x\in S\), action \(f_\alpha (x)\) maximizes the term within brackets in the above display, so that

$$\begin{aligned} e^{\lambda V_\alpha (x)} = e^{\lambda C(x, f_\alpha (x))}\sum _{y\in S} p_{x, y}(f_\alpha (x)) e^{\lambda \alpha V_\alpha (y)},\quad x\in S. \end{aligned}$$
(3.7)

The normalized (\(\alpha \)-)cost and the (\(\alpha \)-)relative value functions are defined by

$$\begin{aligned} g_\alpha (x) := (1-\alpha ) V_\alpha (x), \quad h_\alpha (x) := \alpha [V_\alpha (x) - V_\alpha (w)],\quad x\in S, \end{aligned}$$
(3.8)

respectively, where, from this point onwards, \(w\in S\) is an arbitrary but fixed state. Direct calculations combining these definitions with the two previous displays yield hat

$$\begin{aligned} e^{\lambda g_\alpha (x) + \lambda h_\alpha (x)} =\sup _{a\in A(x)} \left[ e^{\lambda C(x,a)} \sum _{y\in S} p_{x, y}(a) e^{\lambda h_\alpha (y)}\right] ,\quad x\in S, \end{aligned}$$
(3.9)

and

$$\begin{aligned} e^{\lambda g_\alpha (x) + \lambda h_\alpha (x)} = e^{\lambda C(x, f_\alpha (x))}\sum _{y\in S} p_{x, y}(f_\alpha (x)) e^{\lambda h_\alpha (y)},\quad x\in S. \end{aligned}$$
(3.10)

Notice that \(\Vert V_\alpha - T_\alpha [0] \Vert = \Vert T_\alpha [V_\alpha ] - T_\alpha [0]\Vert \le \alpha \Vert V_\alpha - 0] = \alpha \Vert V_\alpha \Vert \), and then, observing that \(\Vert T_\alpha [0]\Vert \le \Vert C\Vert \), by (3.2), it follows that \(\Vert V_\alpha \Vert -\Vert C\Vert \le \Vert V_\alpha \Vert -\Vert T_\alpha [0]\Vert \le \Vert V_\alpha - T_\alpha [0] \Vert \le \alpha \Vert V_\alpha \Vert \), so that

$$\begin{aligned} \Vert g_\alpha \Vert = (1-\alpha )\Vert V_\alpha \Vert \le \Vert C\Vert . \end{aligned}$$
(3.11)

The next theorem is the main result of this work.

Theorem 3.1

Let \(\lambda < 0\) be arbitrary but fixed. Under Assumptions 2.1 and 2.2 the following assertions (i) and (ii) hold.

  1. (i)

    The optimal average cost is constant, say \(g^*\), and \(\lim _{\alpha \nearrow 1} g_\alpha (x) = g^* = J_*(x) \) for every \(x\in S\).

  2. (iii)

    Given \(\varepsilon > 0\), for each \(x\in S\) there exists \(\alpha _{x, \varepsilon } \in (0,1) \) such that policy \(f_\alpha \) in (3.7) is \(\varepsilon \) -optimal at x for \(\alpha \in (\alpha _{x, \varepsilon }, 1)\), that is,

    $$\begin{aligned} \alpha \in (\alpha _{x, \varepsilon }, 1)\implies g^* +\varepsilon \ge J(f_\alpha , x). \end{aligned}$$
    (3.12)

The proof of Theorem 3.1 will be presented in Sect. 5 after the preliminary results established in the following section.

4 Auxiliary tools

In this section the basic technical instruments that will be used to verify Theorem 3.1 are analyzed. Such preliminaries are established in Lemmas 4.14.3 below. The first one concerns with boundedness properties of the family of relative cost functions introduced in (3.8).

Lemma 4.1

  1. (i)

    For each \(\alpha \in (0, 1)\),

    $$\begin{aligned} h_\alpha (\cdot ) \le 2\Vert C\Vert M_w, \end{aligned}$$
    (4.1)

    where the finite constant \(M_w\) is as in (2.15).

  2. (ii)

    For each \(x\in S\), \(\liminf _{\alpha \nearrow 1} h_\alpha (x) > -\infty \).

Proof

  1. (i)

    Given \(\alpha \in (0, 1)\), define the sequence \(\{Y_n\}\) of random variables by \(Y_0 = e^{\lambda h_\alpha (X_0)}\) and \(Y_n = e^{\lambda \sum _{t=0} ^{n-1} (C(X_t, A_t) - g_\alpha (X_t)) + \lambda h_\alpha (X_n)}\) for \(n\ge 1\). Now, let \(x\in S\) be a fixed state, and observe that (3.10) implies that for every \(n \in \mathbb {N}\)

    $$\begin{aligned} e^{\lambda h_\alpha (X_n)}= & {} e^{\lambda (C(X_n, f_\alpha (X_n)) - g_\alpha (X_n)) }\sum _{y\in S} p_{X_n, y}(f_\alpha (X_n)) e^{\lambda h_\alpha (y)} \nonumber \\= & {} E_x^{f_\alpha }\left. \left[ e^{\lambda (C(X_n, A_n) - g_\alpha (X_n)) + \lambda h_\alpha (X_{n+1})}\right| {\mathcal {F}}_n\right] , \quad P_x^{f_\alpha }\hbox {-a.\,s.}, \end{aligned}$$
    (4.2)

    where, using that the relation \(P_x^{f_\alpha }[A_t = f_\alpha (X_t)] = 1\) is always valid, the second equality is due to the Markov property. Observing that \(e^{\lambda \sum _{t=0} ^{n-1} (C(X_t, A_t) - g_\alpha (X_t))}\) is \({\mathcal {F}}_n\)-measurable, by (2.1), the previous display yields

    $$\begin{aligned} Y_n&= e^{\lambda \sum _{t=0} ^{n-1} (C(X_t, A_t) - g_\alpha (X_t))+ \lambda h_\alpha (X_n)} \\&= e^{\lambda \sum _{t=0} ^{n-1} (C(X_t, A_t) - g_\alpha (X_t)) } E_x^{f_\alpha }\left. \left[ e^{\lambda (C(X_n, A_n) - g_\alpha (X_n)) + \lambda h_\alpha (X_{n+1})}\right| {\mathcal {F}}_n\right] \\&= E_x^{f_\alpha }\left. \left[ e^{\lambda \sum _{t=0} ^{n} (C(X_t, A_t) - g_\alpha (X_t)) + \lambda h_\alpha (X_{n+1})}\right| {\mathcal {F}}_n\right] = E_x^{f_\alpha }\left. \left[ Y_{n+1}\right| {\mathcal {F}}_n\right] , \end{aligned}$$

    so that \(\{(Y_n, {\mathcal {F}}_n)\}\) is a martingale with respect to \(P_x^{f_\alpha }\); since \(P_x^{f_\alpha }[X_0= x] = 1\), the optional sampling theorem yields that, for every initial state x and \(n\in \mathbb {N}\),

    $$\begin{aligned} e^{\lambda h_{\alpha }(x)}&= E_x^{f_\alpha }[Y_0] \\&=E_x^{f_\alpha }[Y_{n\wedge T_w}] = E_x^{f_\alpha }\left[ e^{\lambda \sum _{t=0} ^{n\wedge T_w - 1 } (C(X_t, A_t) - g_\alpha (X_t)) + h_\alpha (X_{n\wedge T_w})}\right] . \end{aligned}$$

    Now, using (2.2) and (2.3), observe that \(h_\alpha (X_{T_w}) = h_\alpha (w) = 0\) on the event \([T_w < \infty ]\); since \(P_x^{f_\alpha }[T_w < \infty ] = 1\), by (2.15), it follows hat

    $$\begin{aligned}&\lim _{n\rightarrow \infty } e^{\lambda \sum _{t=0} ^{n\wedge T_w-1} (C(X_t, A_t) - g_\alpha (X_t)) + h_\alpha (X_{n\wedge T_w})}\\&\qquad = e^{\lambda \sum _{t=0} ^{T_w-1} (C(X_t, A_t) - g_\alpha (X_t)) + h_\alpha (X_{T_w})}\\&\qquad = e^{\lambda \sum _{t=0} ^{T_w-1} (C(X_t, A_t) - g_\alpha (X_t)) },\quad P_x^{f_\alpha }\hbox {-a.\,s.}. \end{aligned}$$

    Via Fatou’s lemma and Jensen’s inequality, these two last displays together imply that

    $$\begin{aligned} e^{\lambda h_{\alpha }(x)}&= \liminf _{n\rightarrow \infty } E_x^{f_\alpha }\left[ e^{\lambda \sum _{t=0} ^{n\wedge T_w - 1 } (C(X_t, A_t) - g_\alpha (X_t)) + h_\alpha (X_{n\wedge T_w})}\right] \\&\ge E_x^{f_\alpha }\left[ e^{\lambda \sum _{t=0} ^{T_w-1} (C(X_t, A_t) - g_\alpha (X_t)) }\right] \ge e^{E_x^{f_\alpha }\left[ \lambda \sum _{t=0} ^{T_w-1} (C(X_t, A_t) - g_\alpha (X_t)) \right] }\\&\ge e^{E_x^{f_\alpha }\left[ -\sum _{t=0} ^{T_w-1} |\lambda (C(X_t, A_t) - g_\alpha (X_t))| \right] } \ge e^{2\lambda \Vert C\Vert E_x^{f_\alpha }\left[ T_w\right] }, \end{aligned}$$

    where (3.11) and the negativity of \(\lambda \) were used in the last step. It follows that \( \lambda h_{\alpha }(x) \ge 2\lambda \Vert C\Vert E_x^{f_\alpha }\left[ T_w\right] \), so that \(h_{\alpha }(x) \le 2 \Vert C\Vert E_x^{f_\alpha }\left[ T_w\right] \); since x was arbitrary in this argument, (4.1) follows via (2.15).

  2. (ii)

    Let \(\tilde{f}\in \mathbb {F}\) be fixed, and define the sequence \(\{S_k\}\) of subsets of the state space S by

    $$\begin{aligned} S_0&:= \{w\},\\S_k&:= \{y\in S: p_{x, y}(\tilde{f}(x)) > 0\hbox { for some } x\in S_{k-1}\},\quad k=1,2,3,\ldots \end{aligned}$$

    and notice that \(\bigcup _{k=0}^\infty S_k =S\), by Remark 2.1(ii). Thus, to establish part (ii) it is sufficient to show that, for every \(k\in \mathbb {N}\),

    $$\begin{aligned} \liminf _{\alpha \nearrow 1} h_\alpha (x) > -\infty ,\quad x\in S_k, \end{aligned}$$
    (4.3)

    a claim will be verified by induction. To begin with, let \(\tilde{f}\in \mathbb {F}\) be a fixed policy and notice that (3.9) implies that

    $$\begin{aligned} e^{\lambda h_\alpha (x)}&\ge e^{\lambda C(x,\tilde{f}(x))-g_\alpha (x)} \sum _{y\in S} p_{x, y}(\tilde{f}(x)) e^{\lambda h_\alpha (y)} \nonumber \\&\ge e^{2\lambda \Vert C\Vert } \sum _{y\in S} p_{x, y}(\tilde{f}(x)) e^{\lambda h_\alpha (y)} \end{aligned}$$
    (4.4)

    where the second inequality is due to (3.11) and the negativity of \(\lambda \). Now, using that \(S_0= \{w\}\) and \(h_\alpha (w) = 0\) for every \(\alpha \in (0,1)\), observe that assertion (4.3) clearly holds for \(k= 0\). Next, assume that (4.3) is valid for some \(k\in \mathbb {N}\) and let \(\tilde{y}\in S_{k+1} \) be arbitrary. Pick \(\tilde{x}\in S_k\) such that

    $$\begin{aligned} p_{\tilde{x}, \tilde{y}}(\tilde{f}(\tilde{x})) > 0 \end{aligned}$$

    and notice that (4.4) implies that \(e^{\lambda h_\alpha (\tilde{x})} \ge e^{2\lambda \Vert C\Vert } p_{\tilde{x}, \tilde{y}}(\tilde{f}(\tilde{x})) e^{\lambda h_\alpha (\tilde{y} )}\), so that

    $$\begin{aligned} h_\alpha (\tilde{x}) \le 2\Vert C\Vert + {1\over \lambda } \log ( p_{\tilde{x}, \tilde{y}}(\tilde{f}(\tilde{x}))) + h_\alpha (\tilde{y} ). \end{aligned}$$

    Since \(\tilde{x}\in S_k\), the induction hypothesis yields that \(\liminf _{\alpha \nearrow 1} h_\alpha (\tilde{x}) > -\infty \), and then the two last displays together imply that \(\liminf _{\alpha \nearrow 1} h_\alpha (\tilde{y}) > -\infty \). Recalling that \(\tilde{y}\in S_{k+1}\) is arbitrary, it follows that (4.3) holds with \(k+1\) instead of k, completing the induction argument. \(\square \)

In the subsequent development \(\{\alpha _n\}\subset (0, 1)\) is a fixed sequence such that

$$\begin{aligned} \alpha _n\nearrow 1 \text{ as } n\rightarrow \infty \end{aligned}$$
(4.5)

and, after taking a subsequence—if necessary—without loss of generality it is assumed that the following limits exist:

$$\begin{aligned} g(x) := \lim _{n\rightarrow \infty } g_{\alpha _n} (x),\quad h^*(x) := \lim _{n\rightarrow \infty } h_{\alpha _n} (x), \quad x\in S \end{aligned}$$
(4.6)

where, for each \(x\in S\),

$$\begin{aligned} g(x)\in [-\Vert C\Vert , \Vert C\Vert ],\quad h^*(x) \in (-\infty , 2\Vert C\Vert M_w]; \end{aligned}$$
(4.7)

see (3.11) and Lemma 4.1.

The next lemma establishes fundamental properties of the mappings \(g(\cdot )\) and \(h^*(\cdot )\).

Lemma 4.2

With the notation in (4.5)–(4.7) assertions (i)–(iv) below hold.

  1. (i)

    The mapping \(g(\cdot )\) in (4.6) is constant, say \(g(x) = g^*\in \mathbb {R}\) for each \(x\in S\).

  2. (ii)

    For each \(x\in S\), \(e^{\lambda g^* + \lambda h^* (x)} \ge \sup _{a\in A(x)} \left[ e^{\lambda C(x,a)} \sum _{y\in S} p_{x, y}(a) e^{\lambda h^* (y)}\right] \).

  3. (iii)

    For each positive integer n,

    $$\begin{aligned} n g^* + h^* (x) - 2\Vert C\Vert M_w \le J_n(\pi , x), \quad x\in S, \quad \pi \in {\mathcal {P}}. \end{aligned}$$
  4. (iv)

    \(g ^* \le J_*(\cdot ).\)

Proof

  1. (i)

    Notice that (3.8) yields that \(\displaystyle g_{\alpha _n}(x)- g_{\alpha _n} (w) = {1-\alpha _n\over \alpha _n} h_{\alpha _n}(x)\) for every \(x\in S\). Taking the limit as n goes to \(\infty \), (4.6) and (4.7) together yield that \(g(x) = g(w)\) for every \(x\in S\).

  2. (ii)

    Let \((x, a)\in \mathbb {K}\) be arbitrary and notice that (3.9) implies that, for each \(n\in \mathbb {N}\),

    $$\begin{aligned} e^{\lambda g_{\alpha _n}(x) + \lambda h_{\alpha _n}(x)} \ge e^{\lambda C(x,a)} \sum _{y\in S} p_{x, y}(a) e^{\lambda h_{\alpha _n}(y)}. \end{aligned}$$

    Taking the inferior limit as n goes to \(\infty \) in both sides of this inequality, (4.6) and part (i) together imply that

    $$\begin{aligned} e^{\lambda g^* + \lambda h^* (x)}&\ge \liminf _{n\rightarrow \infty }e^{\lambda C(x,a)} \sum _{y\in S} p_{x, y}(a) e^{\lambda h_{\alpha _n}(y)}\\&\ge e^{\lambda C(x,a)} \sum _{y\in S} p_{x, y}(a) \liminf _{n\rightarrow \infty } e^{\lambda h_{\alpha _n}(y)} \end{aligned}$$

    where Fatou’s lemma was used to set the second inequality. Thus, (4.6) and the above display lead to

    $$\begin{aligned} e^{\lambda g^* + \lambda h^* (x)} \ge e^{\lambda C(x,a)} \sum _{y\in S} p_{x, y}(a) e^{\lambda h^* (y)},\quad (x, a)\in \mathbb {K}, \end{aligned}$$
    (4.8)

    establishing part (ii).

  3. (iii)

    An induction argument starting at (4.8) and using the Markov property yields that for every \(x\in S\), \(\pi \in {\mathcal {P}}\) and \(n\in \mathbb {N}\setminus \{0\}\),

    $$\begin{aligned} e^{\lambda n g^* + \lambda h^*(x)}\ge E_x^\pi \left[ e^{\lambda \sum _{t=0}^{n-1} C(X_t, A_t) + \lambda h^* (X_{n+1})}\right] . \end{aligned}$$

    From this relation, recalling that \(\lambda < 0\) and using (4.7) it follows that

    $$\begin{aligned} e^{\lambda n g^* + \lambda h^*(x)}\ge E_x^\pi \left[ e^{\lambda \sum _{t=0}^{n-1} C(X_t, A_t) + 2 \lambda \Vert C\Vert M_w}\right] = e^{\lambda J_n(\pi , x) + 2\lambda \Vert C\Vert M_w}, \end{aligned}$$

    where (2.8) was used to set the equality. Therefore, \(\lambda n g^* + \lambda h^*(x) \ge \lambda J_n(\pi , x) + 2\lambda \Vert C\Vert M_w\), and the conclusion follows, since \(\lambda \) is negative.

  4. (iv)

    Dividing by n both sides of 4.2 and taking the inferior limit as \(n\nearrow \infty \) in the resulting inequality, (2.9) yields that \(g^* \le J(\pi , x)\) for each \(x\in S\) and \(\pi \in {\mathcal {P}}\). From this point, (2.10) leads to \(g^* \le J_*(\cdot )\). \(\square \)

The following result is the final step before proceeding to the proof of the main theorem.

Lemma 4.3

Given \(\alpha \in (0, 1) \), let the policy \(f_\alpha \in \mathbb {F}\) be such that (3.7) holds.

  1. (i)

    For each \(x\in S\),

    $$\begin{aligned} g_\alpha (x) \ge (1-\alpha )^2 \sum _{k=1}^\infty \alpha ^{k-1} J_k(f_\alpha , x). \end{aligned}$$
  2. (ii)

    Given \(\varepsilon > 0\) and \(x\in S\), there exists \(\tilde{\alpha }_{x, \varepsilon } \in (0, 1)\) such that

    $$\begin{aligned} g_{\alpha } + \varepsilon /2 \ge J(f_{\alpha }, x),\quad \alpha \in (\tilde{\alpha }_{x, \varepsilon }, 1). \end{aligned}$$
  3. (iii)

    \(g^* \ge J_*(\cdot )\).

Proof

  1. (i)

    Let \(x\in S\) be arbitrary but fixed. Following ideas in Cavazos-Cadena and Salem-Silva (2010), it will be proved by induction that for every positive integer n

    $$\begin{aligned} e^{\lambda V_\alpha (x)} \le E_x^{f_\alpha } \left[ e^{\lambda \sum _{t=0}^{n-1} C(X_t, A_t) + \lambda V_\alpha (X_n)}\right] ^{\alpha ^n }\prod _{k=1}^ n e^{\lambda (1-\alpha ) \alpha ^{k-1} J_k( f_\alpha , x)}. \end{aligned}$$
    (4.9)

    To begin with, recall that the equality \(P_x^{f_ \alpha }[ A_t = f_\alpha (X_t)] = 1\) is always valid, so that the Markov property and (3.7) together yield that, for every \(x\in S\) and \(n\in \mathbb {N}\),

    $$\begin{aligned} e^{\lambda V_\alpha (X_{n})} = E_x^{f_\alpha } \left. \left[ e^{\lambda C(X_{n}, A_{n}) + \lambda \alpha V_\alpha (X_{n+1}) } \right| {\mathcal {F}}_{n}\right] ,\quad P_x^{f_\alpha }\hbox {-a.\,s.} \end{aligned}$$

    Setting \(n= 0\) in this relation and using that \(P_x^{f_\alpha }[X_0 = x]\), it follows that

    $$\begin{aligned} e^{\lambda V_\alpha (x)}&= E_x^{f_\alpha } \left[ e^{\lambda C(X_0, A_0) + \lambda \alpha V_\alpha (X_1) }\right] \\&= E_x^{f_\alpha } \left[ \left( e^{\lambda C(X_0, A_0) + \lambda V_\alpha (X_1) )}\right) ^\alpha \left( e^{\lambda C(X_0, A_0)}\right) ^{1-\alpha }\right] \\&\le E_x^{f_\alpha } \left[ e^{\lambda C(X_0, A_0) + \lambda V_\alpha (X_1) )} \right] ^\alpha E_x^{f_\alpha } \left[ e^{\lambda C(X_0, A_0)}\right] ^{(1-\alpha )} \\&= E_x^{f_\alpha } \left[ e^{\lambda C(X_0, A_0) + \lambda V_\alpha (X_1) )} \right] ^\alpha e^{\lambda J_1(f_\alpha , x) (1-\alpha )} \end{aligned}$$

    where Hölder’s inequality was used in the third step, and the last equality is due to (2.8). This shows that (4.9) holds for \(n= 1\). Next, assume that (4.9) is valid for certain positive integer n. Observing that the equality \(A_t = f_\alpha (X_t)\) is always valid with probability one under \(f_\alpha \) and using that \(\sum _{t=0}^{n-1} C(X_t, A_t)\) is \({\mathcal {F}}_n\)-measurable, by (2.1), via the Markov property it follows that

    $$\begin{aligned}&E_x^{f_\alpha } \left. \left[ e^{\lambda \sum _{t=0}^{n-1} C(X_t, A_t) + \lambda V_\alpha (X_n)}\right| {\mathcal {F}}_n\right] \\&\qquad = e^{\lambda \sum _{t=0}^{n-1} C(X_t, A_t) } e^{\lambda V_\alpha (X_n)}\\&\qquad = e^{\lambda \sum _{t=0}^{n-1} C(X_t, A_t) } E_x^{f_\alpha } \left. \left[ e^{\lambda C(X_{n}, A_{n}) + \lambda \alpha V_\alpha (X_{n+1}) } \right| {\mathcal {F}}_{n}\right] \\&\qquad = E_x^{f_\alpha } \left. \left[ e^{\lambda \sum _{t=0}^{n} C(X_t, A_t) + \lambda \alpha V_\alpha (X_{n+1}) } \right| {\mathcal {F}}_{n}\right] . \end{aligned}$$

    Therefore, via Hölder’s inequality and (2.8) it follows that

    $$\begin{aligned}&E_x^{f_\alpha } \left[ e^{\lambda \sum _{t=0}^{n-1} C(X_t, A_t) + \lambda V_\alpha (X_n)}\right] \\&\qquad = E_x^{f_\alpha } \left[ e^{\lambda \sum _{t=0}^{n} C(X_t, A_t) + \lambda \alpha V_\alpha (X_{n+1}) } \right] \\&\qquad = E_x^{f_\alpha } \left[ \left( e^{\lambda \sum _{t=0}^{n} C(X_t, A_t) + \lambda V_\alpha (X_{n+1}) } \right) ^\alpha \left( e^{\lambda \sum _{t=0}^{n} C(X_t, A_t) } \right) ^{(1-\alpha )} \right] \\&\qquad \le E_x^{f_\alpha } \left[ e^{\lambda \sum _{t=0}^{n} C(X_t, A_t) + \lambda V_\alpha (X_{n+1}) } \right] ^\alpha E_x^{f_\alpha } \left[ e^{\lambda \sum _{t=0}^{n} C(X_t, A_t) } \right] ^{(1-\alpha )} \\&\qquad = E_x^{f_\alpha } \left[ e^{\lambda \sum _{t=0}^{n} C(X_t, A_t) + \lambda V_\alpha (X_{n+1}) } \right] ^\alpha \left( e^{\lambda J_{n+1}(f_\alpha , x) }\right) ^{(1-\alpha )} , \end{aligned}$$

    and then

    $$\begin{aligned}&E_x^{f_\alpha } \left[ e^{\lambda \sum _{t=0}^{n-1} C(X_t, A_t) + \lambda V_\alpha (X_n)}\right] ^{\alpha ^n} \\&\qquad \le E_x^{f_\alpha } \left[ e^{\lambda \sum _{t=0}^{n} C(X_t, A_t) + \lambda V_\alpha (X_{n+1}) } \right] ^{\alpha ^{n+1}} \left( e^{\lambda J_{n+1}(f_\alpha , x) }\right) ^{(1-\alpha )\alpha ^n} \\&\qquad = E_x^{f_\alpha } \left[ e^{\lambda \sum _{t=0}^{n} C(X_t, A_t) + \lambda V_\alpha (X_{n+1}) } \right] ^{\alpha ^{n+1}} e^{\lambda (1-\alpha )\alpha ^n J_{n+1}(f_\alpha , x) }. \end{aligned}$$

    Combining this relation with the induction hypothesis, it follows that (4.9) holds with \(n+1\) instead of n. Now, to establish part (i) notice that for \(n=1,2,3,\ldots \)

    $$\begin{aligned} \left| \sum _{t=0}^{n-1} C(X_t, A_t) + V_\alpha (X_n)\right| \le n\Vert C\Vert + \Vert V_\alpha (\cdot )\Vert \le \Vert C\Vert ( n + (1-\alpha )^{-1}), \end{aligned}$$

    so that \(E_x^{f_\alpha } \left[ e^{\lambda \sum _{t=0}^{n-1} C(X_t, A_t) + \lambda V_\alpha (X_n)}\right] \le e^{|\lambda | \Vert C\Vert ( n + (1-\alpha )^{-1})}\), and via (4.9) it follows that

    $$\begin{aligned} e^{\lambda V_\alpha (x)} \le e^{\alpha ^n|\lambda | \Vert C\Vert ( n + (1-\alpha )^{-1})}\prod _{k=1}^ n e^{\lambda (1-\alpha ) \alpha ^{k-1} J_k( f_\alpha , x)}, \end{aligned}$$

    an inequality that, recalling that \(\lambda < 0\), is equivalent to

    $$\begin{aligned} V_\alpha (x) \ge - \alpha ^n\Vert C\Vert ( n + (1-\alpha )^{-1}) + \sum _{k=1}^ n (1-\alpha ) \alpha ^{k-1} J_k( f_\alpha , x). \end{aligned}$$

    Multiplying by \((1-\alpha )\) both sides of this relation, using (3.8) it follows that

    $$\begin{aligned} g_\alpha (x) \ge - \alpha ^n(1-\alpha ) \Vert C\Vert ( n + (1-\alpha )^{-1}) + \sum _{k=1}^ n (1-\alpha )^2 \alpha ^{k-1} J_k( f_\alpha , x) \end{aligned}$$

    and the desired conclusion follows taking the limit a n goes to \(\infty \).

  2. (ii)

    Let \(x\in S\) and \(\varepsilon > 0\) be arbitrary and, using (2.9), pick \(N_0(x,\varepsilon ) \in \mathbb {N}\) such that

    $$\begin{aligned} {1\over k} J_k(f_\alpha , x) \ge J(f_\alpha , x)-\varepsilon /4,\quad k\ge N_0(x,\varepsilon ). \end{aligned}$$

    Thus, observing that \(|J(f_\alpha , x)|, k^{-1} |J_k(f_\alpha , x)|\le \Vert C\Vert \), via part (i) it follows that

    $$\begin{aligned} g_\alpha (x)&\ge (1-\alpha )^2 \sum _{k=1}^\infty k \alpha ^{k-1} {J_k(f_\alpha , x)\over k}\\&= J(f_\alpha , x) + (1-\alpha )^2 \sum _{k=1}^\infty k \alpha ^{k-1} \left( {1\over k}J_k(f_\alpha , x) - J(f_\alpha , x)\right) \\&\ge J(f_\alpha , x) + (1-\alpha )^2 \sum _{k=1}^{N_0(x, \varepsilon )-1} k \alpha ^{k-1} \left( {1\over k}J_k(f_\alpha , x) - J(f_\alpha , x)\right) -\varepsilon /4\\&\ge J(f_\alpha , x) -2 (1-\alpha )^2 \Vert C\Vert \sum _{k=1}^{N(x_0, \varepsilon )-1} k \alpha ^{k-1} - \varepsilon /4. \end{aligned}$$

    where the previous display was used to set the first inequality. Finally, select \(\tilde{\alpha }_{x, \varepsilon }\) such that \((1-\alpha )^2 \sum _{k=1}^{N(x_0, \varepsilon )-1} k \alpha ^{k-1} \le \varepsilon (8 \Vert C\Vert +1)^{-1}\) when \( \alpha \in (\tilde{\alpha }_{x, \varepsilon }, 1)\) to conclude that

    $$\begin{aligned} g_{\alpha }(x) \ge J(f_{\alpha }, x)-\varepsilon /2, \quad \alpha \in (\tilde{\alpha }_{x, \varepsilon }, 1), \end{aligned}$$

    completing the proof of part (ii).

  3. (iii)

    Let \(x\in S\) be arbitrary. Given \(\varepsilon > 0\), let \(\tilde{\alpha }_{x, \varepsilon } \in (0, 1)\) be as in part (ii) and observe that (4.5) yields that there exists \(\tilde{N}(x, \varepsilon )\in \mathbb {N}\) such that \(\alpha _n > \tilde{\alpha }_{x, \varepsilon }\) if \(n > \tilde{N}(x,\varepsilon )\), and in this case (4.3) implies that \(g_{\alpha _n}(x) \ge J(f_{\alpha _n}, x) -\varepsilon /2\), so that

    $$\begin{aligned} g_{\alpha _n}(x) \ge J_*(x) -\varepsilon /2,\quad n> \tilde{N}(x, \varepsilon ). \end{aligned}$$

    Taking the limit as n goes to \(\infty \), this relation leads to \(g ^* \ge J_*(x) -\varepsilon /2\), and the conclusion follows, since \(\varepsilon >0\) is arbitrary. \(\square \)

5 Proof of the main result

After the preliminaries in the previous section, the main conclusions of the paper can be established as follows.

Proof of Theorem 3.1

Let \(\{\alpha _n\}_{n\in \mathbb {N}}\) be an arbitrary sequence satisfying (4.5) and, as before, taking a subsequence, if necessary, without loss of generality assume that (4.6) holds, so that \(\lim _{k\rightarrow \infty } g_{\alpha _k}(\cdot ) = g^*\in \mathbb {R}\), by Lemma 4.2(i).

  1. (i)

    Combining Lemma 4.2(iv) and Lemma 4.3(iii) it follows that \(J_*(\cdot ) = g^* = \lim _{n\rightarrow \infty } g_{\alpha _n}(x)\) for every \(x\in S\). Thus, since the sequence \(\{\alpha _n\}\) satisfying (4.5) is arbitrary, it follows that \(\lim _{\alpha \nearrow 1} g_{\alpha } (\cdot ) = J_*(\cdot ) = g^*\).

  2. (ii)

    Let \(x\in S\) be arbitrary but fixed. Given \(\varepsilon > 0\), using part (i) select \(\hat{\alpha }_{x\, \varepsilon }\in (0, 1)\) such that

    $$\begin{aligned} g_{\alpha }(x) < g^* + \varepsilon /2,\quad \alpha \in (\hat{\alpha }_{x, \varepsilon }, 1). \end{aligned}$$

    Setting \(\alpha _{x, \varepsilon } = \max \{\hat{\alpha }_{x, \varepsilon }, \tilde{\alpha }_{x, \varepsilon }\}\), this last display and Lemma 4.3 (ii) together yield that (3.12) holds.

6 Conclusion

In this work, Markov decision chains on a denumerable state space were studied. It was assumed that the performance of a decision policy is measured by the average criterion as perceived by a risk-seeking controller with constant risk-sensitivity. Under conditions ensuring that the optimal average cost is constant, but not that the optimality equation admits a solution, the problems of approximating the optimal average cost, and determining a nearly optimal policy via the family of fixed points of contractive operators were studied. The results in this direction, which are stated in Theorem 3.1, provide an extension to the present framework of the classical discounted approach in the theory of Markov decision chains endowed with the risk-neutral average index. On the other hand, extending the conclusions in Theorem 3.1 to more general contexts, including unbounded costs or more general state space, seems to be an interesting problem.