Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Dedicated to Onésimo Hernández-Lerma on the occasion of his 65th birthday.

1.1 Introduction

The policy iteration algorithm (PIA) for controlled Markov chains has been known since the fundamental work of Howard [2]. For controlled Markov chains on Borel state spaces, most studies of the PIA rely on blanket Lyapunov conditions [9]. A study of the PIA that treats the model of near-monotone costs can be found in [11], some ideas of which we follow closely. An analysis of the PIA for piecewise deterministic Markov processes has appeared in [6].

In this chapter we study the PIA for controlled diffusion processes X = { X t ,  t ≥ 0}, taking values in the d-dimensional Euclidean space \({\mathbb{R}}^{d}\), and governed by the Itô stochastic differential equation

$$\mathrm{d}{X}_{t} = b({X}_{t},{U}_{t})\,\mathrm{d}t + { \sigma }({X}_{t})\,\mathrm{d}{W}_{t}\,.$$
(1.1)

All random processes in (1.1) live in a complete probability space \((\Omega,\mathfrak{F}, \mathbb{P})\). The process W is a d-dimensional standard Wiener process independent of the initial condition X 0. The control process U takes values in a compact, metrizable set \(\mathbb{U}\), and U t (ω) is jointly measurable in (t, ω) ∈ [0, ) ×Ω. Moreover, it is nonanticipative: for s < t, W t  − W s is independent of

$${\mathfrak{F}}_{s} := \text{ the completion of }\sigma \{{X}_{0},{U}_{r},{W}_{r},\;r \leq s\}\text{ relative to }(\mathfrak{F}, \mathbb{P})\,.$$

Such a process U is called an admissible control, and we let \(\mathfrak{U}\) denote the set of all admissible controls.

We impose the following standard assumptions on the drift b and the diffusion matrix σ to guarantee existence and uniqueness of solutions to (1.1).

  1. (A1)

    Local Lipschitz continuity: The functions

    $$b ={\bigl [ {b}^{1},\ldots,{b}^{d}\bigr ]}\mathsf{T} : {\mathbb{R}}^{d} \times \mathbb{U}\mapsto {\mathbb{R}}^{d}\quad \text{ and}\quad { \sigma } ={\bigl [{ { \sigma }}^{ij}\bigr ]} : {\mathbb{R}}^{d}\mapsto {\mathbb{R}}^{d\times d}$$

    are locally Lipschitz in x with a Lipschitz constant K R depending on R > 0. In other words, if B R denotes the open ball of radius R centered at the origin in \({\mathbb{R}}^{d}\), then for all x, y ∈ B R and \(u \in \mathbb{U}\),

    $$\vert b(x,u) - b(y,u)\vert +\Vert { \sigma }(x) -{ \sigma }(y)\Vert \leq {K}_{R}\vert x - y\vert \,,$$

    where ∥ σ ∥ 2 : = traceσσT.

  2. (A2)

    Affine growth condition:b and σ satisfy a global growth condition of the form

    $$\vert b{(x,u)\vert }^{2} +\Vert { \sigma }{(x)\Vert }^{2} \leq {K}_{ 1}{\bigl (1 +\vert {x\vert }^{2}\bigr )}\,,\quad \forall (x,u) \in {\mathbb{R}}^{d} \times \mathbb{U}\,.$$
  3. (A3)

    Local nondegeneracy: For each R > 0, there exists a positive constant κ R such that

    $${\sum \limits _{i,j=1}^{d}}{a}^{ij}(x){\xi }_{ i}{\xi }_{j} \geq {\kappa }_{R}\vert {\xi \vert }^{2}\,,\quad \forall x \in {B}_{ R}\,,$$

    for all \(\xi = ({\xi }_{1},\ldots,{\xi }_{d}) \in {\mathbb{R}}^{d}\), where \(a := \frac{1} {2}{ \sigma }\,{ \sigma }\mathsf{T}\).

We also assume that b is continuous in (x, u).

In integral form, (1.1) is written as

$${X}_{t} = {X}_{0} +{ \int \nolimits \nolimits }_{0}^{t}b({X}_{ s},{U}_{s})\,\mathrm{d}s +{ \int \nolimits \nolimits }_{0}^{t}{ \sigma }({X}_{ s})\,\mathrm{d}{W}_{s}\,.$$
(1.2)

The second term on the right-hand side of (1.2) is an Itô stochastic integral. We say that a process X = { X t (ω)} is a solution of (1.1) if it is \({\mathfrak{F}}_{t}\)-adapted, continuous in t, defined for all ω ∈ Ω and t ∈ [0, ), and satisfies (1.2) for all t ∈ [0, ) at once a.s.

With \(u \in \mathbb{U}\) treated as a parameter, we define the family of operators \({L}^{u} : {\mathcal{C}}^{2}({\mathbb{R}}^{d})\mapsto \mathcal{C}({\mathbb{R}}^{d})\) by

$${L}^{u}f(x) ={ \sum \limits _{i,j}}{a}^{ij}(x) \frac{{\partial }^{2}f} {\partial {x}_{i}\partial {x}_{j}}(x) +{ \sum \limits _{i}}{b}^{i}(x,u) \frac{\partial f} {\partial {x}_{i}}(x)\,,\quad u \in \mathbb{U}\,.$$
(1.3)

We refer to L u as the controlled extended generator of the diffusion.

Of fundamental importance in the study of functionals of X is Itô’s formula. For \(f \in {\mathcal{C}}^{2}({\mathbb{R}}^{d})\) and with L u as defined in (1.3),

$$f({X}_{t}) = f({X}_{0}) +{ \int \nolimits \nolimits }_{0}^{t}{L}^{{U}_{s} }f({X}_{s})\,\mathrm{d}s + {M}_{t}\,,\quad \mathrm{ a.s.},$$
(1.4)

where

$${M}_{t} :={ \int \nolimits \nolimits }_{0}^{t}{\bigl \langle\nabla f({X}_{ s}),{ \sigma }({X}_{s})\,\mathrm{d}{W}_{s}\bigr \rangle}$$

is a local martingale. Krylov’s extension of the Itô formula [10, p. 122] extends (1.4) to functions f in the Sobolev space \({{\mathcal{W}}_{\mathrm{loc}}}^{2,p}({\mathbb{R}}^{d})\).

Recall that a control is called stationary Markov if U t  = v(X t ) for a measurable map \(v : {\mathbb{R}}^{d}\mapsto \mathbb{U}\). Correspondingly, the equation

$${X}_{t} = {x}_{0} +{ \int \nolimits \nolimits }_{0}^{t}b{\bigl ({X}_{ s},v({X}_{s})\bigr )}\,\mathrm{d}s +{ \int \nolimits \nolimits }_{0}^{t}{ \sigma }({X}_{ s})\,\mathrm{d}{W}_{s}$$
(1.5)

is said to have a strong solution if given a Wiener process \(({W}_{t},{\mathfrak{F}}_{t})\) on a complete probability space \((\Omega,\mathfrak{F}, \mathbb{P})\), there exists a process X on \((\Omega,\mathfrak{F}, \mathbb{P})\), with \({X}_{0} = {x}_{0} \in {\mathbb{R}}^{d}\), which is continuous, \({\mathfrak{F}}_{t}\)-adapted, and satisfies (1.5) for all t at once, a.s. A strong solution is called unique, if any two such solutions X and X′ agree \(\mathbb{P}\)-a.s. when viewed as elements of \(\mathcal{C}{\bigl ([0,\infty ), {\mathbb{R}}^{d}\bigr )}\). It is well known that under Assumptions (A1)–(A3), for any stationary Markov control v, (1.5) has a unique strong solution [8].

Let \({\mathfrak{U}}_{\mathrm{SM}}\) denote the set of stationary Markov controls. Under \(v \in {\mathfrak{U}}_{\mathrm{SM}}\), the process X is strong Markov, and we denote its transition function by P v(t, x,  ⋅). It also follows from the work of [5, 12] that under \(v \in {\mathfrak{U}}_{\mathrm{SM}}\), the transition probabilities of X have densities which are locally Hölder continuous. Thus L v defined by

$${L}^{v}f(x) ={ \sum \nolimits }_{i,j}{a}^{ij}(x) \frac{{\partial }^{2}f} {\partial {x}_{i}\partial {x}_{j}}(x) +{ \sum \nolimits }_{i}{b}^{i}(x,v(x)) \frac{\partial f} {\partial {x}_{i}}(x)\,,\quad v \in {\mathfrak{U}}_{\mathrm{SM}}\,,$$

for \(f \in {\mathcal{C}}^{2}({\mathbb{R}}^{d})\) is the generator of a strongly continuous semigroup on \({\mathcal{C}}_{b}({\mathbb{R}}^{d})\), which is strong Feller. We let \({\mathbb{P}}_{x}^{v}\) denote the probability measure and \({\mathbb{E}}_{x}^{v}\) the expectation operator on the canonical space of the process under the control \(v \in {\mathfrak{U}}_{\mathrm{SM}}\), conditioned on the process X starting from \(x \in {\mathbb{R}}^{d}\) at t = 0.

In Sect. 1.2 we define our notation. Sect. 1.3 reviews the ergodic control problem for near-monotone costs and the basic properties of the PIA. Sect. 1.4 is dedicated to the convergence of the algorithm.

1.2 Notation

The standard Euclidean norm in \({\mathbb{R}}^{d}\) is denoted by |  ⋅ | , and ⟨ ⋅,  ⋅⟩ stands for the inner product. The set of nonnegative real numbers is denoted by \({\mathbb{R}}_{+}\), \(\mathbb{N}\) stands for the set of natural numbers, and \(\mathbb{I}\) denotes the indicator function. We denote by τ(A) the first exit time of the process {X t } from the set \(A \subset {\mathbb{R}}^{d}\), defined by

$$\tau (A) :=\inf \;\{ t > 0 : {X}_{t}\not\in A\}\,.$$

The open ball of radius R in \({\mathbb{R}}^{d}\), centered at the origin, is denoted by B R , and we let τ R : = τ(B R ), and \(\Breve{\tau}_{R}{:=} \tau(B^{c}_{R})\).

The term domain in \({\mathbb{R}}^{d}\) refers to a nonempty, connected open subset of the Euclidean space \({\mathbb{R}}^{d}\). We introduce the following notation for spaces of real-valued functions on a domain \(D \subset {\mathbb{R}}^{d}\). The space \({\mathcal{L}}^{p}(D)\), p ∈ [1, ), stands for the Banach space of (equivalence classes) of measurable functions f satisfying ∫ D  | f(x) | p dx < , and \({\mathcal{L}}^{\infty }(D)\) is the Banach space of functions that are essentially bounded in D. The space \({\mathcal{C}}^{k}(D)\) (\({\mathcal{C}}^{\infty }(D)\)) refers to the class of all functions whose partial derivatives up to order k (of any order) exist and are continuous, and \({\mathcal{C}}_{c}^{k}(D)\) is the space of functions in \({\mathcal{C}}^{k}(D)\) with compact support. The standard Sobolev space of functions on D whose generalized derivatives up to order k are in \({\mathcal{L}}^{p}(D)\), equipped with its natural norm, is denoted by \({\mathcal{W}}^{k,p}(D)\), k ≥ 0, p ≥ 1.

In general if \(\mathcal{X}\) is a space of real-valued functions on D, \({\mathcal{X}}_{\mathrm{loc}}\) consists of all functions f such that \(f\varphi \in \mathcal{X}\) for every \(\varphi \in {\mathcal{C}}_{c}^{\infty }(D)\). In this manner we obtain the spaces \(\mathcal{L}_{loc}^{p}(D)\) and \({{\mathcal{W}}_{\mathrm{loc}}}^{2,p}(D)\).

Let \(h \in \mathcal{C}({\mathbb{R}}^{d})\) be a positive function. We denote by \(\mathcal{O}(h)\) the set of functions \(f \in \mathcal{C}({\mathbb{R}}^{d})\) having the property

$${\limsup}_{\vert x\vert \rightarrow \infty }\;\frac{\vert f(x)\vert } {h(x)} < \infty \,,$$
(1.6)

and by \(\mathfrak{o}(h)\) the subset of \(\mathcal{O}(h)\) over which the limit in (1.6) is zero.

We adopt the notation \({\partial }_{i} := \tfrac{\partial \ } {\partial {x}_{i}}\) and \({\partial }_{ij} := \tfrac{{\partial }^{2}\ } {\partial {x}_{i}\partial {x}_{j}}\). We often use the standard summation rule that repeated subscripts and superscripts are summed from 1 through d. For example,

$${a}^{ij}{\partial }_{ ij}\varphi + {b}^{i}{\partial }_{ i}\varphi :={ \sum \limits _{i,j=1}^{d}}{a}^{ij} \frac{{\partial }^{2}\varphi } {\partial {x}_{i}\partial {x}_{j}} +{ \sum \limits _{i=1}^{d}}{b}^{i} \frac{\partial \varphi } {\partial {x}_{i}}\,.$$

1.3 Ergodic Control and the PIA

Let \(c: {\mathbb{R}}^{d} \times \mathbb{U} \rightarrow \mathbb{R}\) be a continuous function bounded from below. As well known, the ergodic control problem, in its almost sure (or pathwise) formulation, seeks to a.s. minimize over all admissible \(U \in \mathfrak{U}\)

$${\limsup \limits_{t\rightarrow \infty }}\;\frac{1} {t}{\int \nolimits \nolimits }_{0}^{t}c({X}_{ s},{U}_{s})\,\mathrm{d}s\,.$$
(1.7)

A weaker, average formulation seeks to minimize

$${\limsup \limits_{t\rightarrow \infty }}\;\frac{1} {t}{\int \nolimits \nolimits }_{0}^{t}{\mathbb{E}}^{U}{\bigl [c({X}_{ s},{U}_{s})\bigr ]}\,\mathrm{d}s\,.$$
(1.8)

We let ρ ∗  denote the infimum of (1.8) over all admissible controls. We assume that ρ ∗  < .

We assume that the cost function \(c: {\mathbb{R}}^{d} \times \mathbb{U} \rightarrow {\mathbb{R}}_{+}\) is continuous and locally Lipschitz in its first argument uniformly in \(u \in \mathbb{U}\). More specifically, for some function \({K}_{c}: {\mathbb{R}}_{+} \rightarrow {\mathbb{R}}_{+}\),

$${\bigl \lvert c(x,u) - c(y,u)\bigr \rvert} \leq {K}_{c}(R)\vert x - y\vert \qquad \forall x,y \in {B}_{R}\,,\ \forall u \in \mathbb{U}\,,$$

and all R > 0.

An important class of running cost functions arising in practice for which the ergodic control problem is well behaved is the near-monotone cost functions. Let \({M}^{{_\ast}}\in {\mathbb{R}}_{+} \cup \{\infty \}\) be defined by

$${M}^{{_\ast}} :={ \liminf \limits_{\vert x\vert \rightarrow \infty }}\;\min _{u\in \mathbb{U}}\;c(x,u)\,.$$

The running cost function c is called near-monotone if ρ ∗  < M  ∗ . Note that inf-compact functions c are always near-monotone.

We adopt the following abbreviated notation. For a function \(g : {\mathbb{R}}^{d} \times \mathbb{U} \rightarrow \mathbb{R}\) and \(v \in {\mathfrak{U}}_{\mathrm{SSM}}\) we let

$${g}_{v}(x) := g{\bigl (x,v(x)\bigr )}\,,\quad x \in {\mathbb{R}}^{d}\,.$$

The ergodic control problem for near-monotone cost functions is characterized as follows

Theorem 1.3.1.

There exists a unique function \(V \in {\mathcal{C}}^{2}({\mathbb{R}}^{d})\) which is bounded below in \({\mathbb{R}}^{d}\) and satisfies V (0) = 0 and the Hamilton–Jacobi–Bellman (HJB) equation

$${ \min _{u\in \mathbb{U}}}\;{\bigl [{L}^{u}V (x) + c(x,u)\bigr ]} = {\rho }^{{_\ast}}\,,\quad x \in {\mathbb{R}}^{d}\,.$$

The control \({v}^{{_\ast}}\in {\mathfrak{U}}_{\mathrm{SM}}\) is optimal with respect to the criteria (1.7) and (1.8) if and only if it satisfies

$${ \min _{u\in \mathbb{U}}}\;\left [{\sum \limits _{i=1}^{d}}{b}^{i}(x,u) \frac{\partial V } {\partial {x}_{i}}(x) + c(x,u)\right ] ={ \sum \limits _{i=1}^{d}}{b}_{{ v}^{{_\ast}}}^{i}(x) \frac{\partial V } {\partial {x}_{i}}(x) + {c}_{{v}^{{_\ast}}}(x)$$

a.e. in \({\mathbb{R}}^{d}\). Moreover, with \(\Breve{\tau}_{r}=\tau(B_{r}^{c})\), r > 0, we have

$$V (x) ={ \limsup \limits_{r\downarrow 0}}\;\inf _{v\in {\mathfrak{U}}_{\mathrm{SSM}}}\;{\mathbb{E}}_{x}^{v}\left [{\int \nolimits \nolimits }_{0}^{{\breve{\tau }}_{r} }{\bigl ({c}_{v}({X}_{t}) - {\rho }^{{_\ast}}\bigr )}\,\mathrm{d}t\right ]\,,\quad x \in {\mathbb{R}}^{d}\,.$$

A control \(v \in {\mathfrak{U}}_{\mathrm{SM}}\) is called stable if the associated diffusion is positive recurrent. We denote the set of such controls by \({\mathfrak{U}}_{\mathrm{SSM}}\). Also we let μ v denote the unique invariant probability measure on \({\mathbb{R}}^{d}\) for the diffusion under the control \(v \in {\mathfrak{U}}_{\mathrm{SSM}}\). Recall that \(v \in {\mathfrak{U}}_{\mathrm{SSM}}\) if and only if there exists an inf-compact function \(\mathcal{V}\in {\mathcal{C}}^{2}({\mathbb{R}}^{d})\), a bounded domain \(D \subset {\mathbb{R}}^{d}\), and a constant ε > 0 satisfying

$${L}^{v}\mathcal{V}(x) \leq -\epsilon \qquad \forall x \in {D}^{c}\,.$$
(1.9)

It follows that the optimal control v in Theorem 1.3.1 is stable. For \(v \in {\mathfrak{U}}_{\mathrm{SSM}}\) we define

$${\rho }_{v} :={ \limsup _{t\rightarrow \infty }}\;\frac{1} {t}{\int \nolimits \nolimits }_{0}^{t}{\mathbb{E}}^{v}{\bigl [{c}_{ v}({X}_{s})\bigr ]}\,\mathrm{d}s\,.$$

A difficulty in synthesizing an optimal control \(v \in {\mathfrak{U}}_{\mathrm{SM}}\) via the HJB equation lies in the fact that the optimal cost ρ ∗  is not known. The PIA provides an iterative procedure for obtaining the HJB equation at the limit. In order to describe the algorithm we, first need to review some properties of the Poisson equation

$${L}^{v}V (x) + {c}_{ v}(x) = \rho \,,\quad x \in {\mathbb{R}}^{d}\,.$$
(1.10)

We need the following definition.

Definition 1.3.1.

For \(v \in {\mathfrak{U}}_{\mathrm{SSM}}\), and provided ρ v  < , define

$${\Psi }^{v}(x) :{=\lim _{ r\downarrow 0}}\;{\mathbb{E}}_{x}^{v}\left [{\int \nolimits \nolimits }_{0}^{{\breve{\tau }}_{r} }{\bigl ({c}_{v}({X}_{t}) - {\rho }_{v}\bigr )}\,\mathrm{d}t\right ]\,,\quad x\neq 0\,.$$

For \(v \in {\mathfrak{U}}_{\mathrm{SM}}\) and α > 0, let J α v denote the α-discounted cost

$${J}_{\alpha }^{v}(x) := {\mathbb{E}}_{ x}^{v}\left [{\int \nolimits \nolimits }_{0}^{\infty }\mathrm{{e}}^{-\alpha t}{c}_{ v}({X}_{t})\,\mathrm{d}t\right ]\,,\quad x \in {\mathbb{R}}^{d}.$$

We borrow the following result from [1, Lemma 7.4]. If \(v \in {\mathfrak{U}}_{\mathrm{SSM}}\) and ρ v  < , then there exists, a function \(V \in {{\mathcal{W}}_{\mathrm{loc}}}^{2,p}({\mathbb{R}}^{d})\), for any p > 1, and a constant \(\rho \in \mathbb{R}\) which satisfies (1.10) a.e. in \({\mathbb{R}}^{d}\) and such that, as α 0, αJ α v(0) → ρ and J α v − J α v(0) → V uniformly on compact subsets of \({\mathbb{R}}^{d}\). Moreover,

$$\rho = {\rho }_{v}\quad \text{ and}\quad V (x) = {\Psi }^{v}(x).$$

We refer to the function \(V (x) = {\Psi }^{v}(x) \in {{\mathcal{W}}_{\mathrm{loc}}}^{2,p}({\mathbb{R}}^{d})\) as the canonical solution of the Poisson equation \({L}^{v}V + {c}_{v} = {\rho }_{v}\) in \({\mathbb{R}}^{d}\).

It can be shown that the canonical solution V to the Poisson equation is the unique solution in \({{\mathcal{W}}_{\mathrm{loc}}}^{2,p}({\mathbb{R}}^{d})\) which is bounded below and satisfies V (0) = 0. Note also that (1.9) implies that any control v satisfying ρ v  < M  ∗  is stable.

The PIA takes the following familiar form:

Algorithm (PIA).

  1. 1.

    Initialization. Set k = 0 and select any \({v}_{0} \in {\mathfrak{U}}_{\mathrm{SM}}\) such that \({\rho }_{{v}_{0}} < {M}^{{_\ast}}\).

  2. 2.

    Value determination. Obtain the canonical solution \({V }_{k} = {\Psi }^{{v}_{k}} \in {{\mathcal{W}}_{\mathrm{ loc}}}^{2,p}({\mathbb{R}}^{d})\) , p > 1, to the Poisson equation

    $${L}^{{v}_{k} }{V }_{k} + {c}_{{v}_{k}} = {\rho }_{{v}_{k}}$$

    in \({\mathbb{R}}^{d}\).

  3. 3.

    If \({v}_{k}(x) \in { Arg\,min}_{u\in \mathbb{U}}\left [{b}^{i}(x,u){\partial }_{i}{V }_{k}(x) + c(x,u)\right ]\) x-a.e., return v k.

  4. 4.

    Policy improvement. Select an arbitrary \({v}_{k+1} \in {\mathfrak{U}}_{\mathrm{SM}}\) which satisfies

    $${v}_{k+1}(x) \in { Arg\,min _{u\in \mathbb{U}}}\;\left [{\sum \limits _{i=1}^{d}}{b}^{i}(x,u)\frac{\partial {V }_{k}} {\partial {x}_{i}} (x) + c(x,u)\right ]\,,\qquad x \in {\mathbb{R}}^{d}\,.$$

Since \({\rho }_{{v}_{0}} < {M}^{{_\ast}}\) it follows that \({v}_{0} \in {\mathfrak{U}}_{\mathrm{SSM}}\). The algorithm is well defined, provided \({v}_{k} \in {\mathfrak{U}}_{\mathrm{SSM}}\) for all \(k \in \mathbb{N}\). This follows from the next lemma which shows that \({\rho }_{{v}_{k+1}} \leq {\rho }_{{v}_{k}}\), and in particular that \({\rho }_{{v}_{k}} < {M}^{{_\ast}}\), for all \(k \in \mathbb{N}\).

Lemma 1.3.1.

Suppose \(v \in {\mathfrak{U}}_{\mathrm{SSM}}\) satisfies ρ v < M . Let \(V \in {{\mathcal{W}}_{\mathrm{loc}}}^{2,p}({\mathbb{R}}^{d})\) , p > 1, be the canonical solution to the Poisson equation

$${L}^{v}V + {c}_{ v} = {\rho }_{v}\,,\quad \text{ in }{\mathbb{R}}^{d}\,.$$

Then any measurable selector \(\hat{v}\) from the minimizer

$${Arg\,min _{u\in \mathbb{U}}}\left [{b}^{i}(x,u){\partial }_{ i}V (x) + c(x,u)\right ]$$

satisfies \({\rho }_{\hat{v}} \leq {\rho }_{v}\) . Moreover, the inequality is strict unless v satisfies

$${L}^{v}V (x) + {c}_{ v}(x) {=\min _{u\in \mathbb{U}}}\;\left [{L}^{u}V (x) + c(x,u)\right ] = {\rho }_{ v},\quad \text{ for almost all }x.$$
(1.11)

Proof.

Let \(\mathcal{V}\) be a Lyapunov function satisfying \({L}^{v}\mathcal{V}(x) \leq {k}_{0} - g(x)\), for some inf-compact g such that \({c}_{v} \in \mathfrak{o}(g)\) (see [1, Lemma 7.1]). For \(n \in \mathbb{N}\), define

$$\hat{{v}}_{n}(x) = \left \{\begin{array}{@{}l@{\quad }l@{}} \hat{v}(x)\quad &\text{ if }x \in {B}_{n} \\ v(x)\quad &\text{ if }x \in {B}_{n}^{c}\,. \end{array} \right.$$

Clearly, \(\hat{{v}}_{n} \rightarrow \hat{ v}\) as n →  in the topology of Markov controls (see [1, Section 3.3]). It is evident that \(\mathcal{V}\) is a stochastic Lyapunov function relative to \(\hat{{v}}_{n}\), i.e., there exist constants k n such that \({L}^{\hat{{v}}_{n}}\mathcal{V}(x) \leq {k}_{n} - g(x)\), for all \(n \in \mathbb{N}\). Since \(V \in \mathfrak{o}(\mathcal{V})\), it follows that (see [1, Lemma 7.1])

$$\frac{1} {t} {\mathbb{E}}_{x}^{\hat{{v}}_{n} }[V ({X}_{t})]\mathop{\longrightarrow}\limits_{t \rightarrow \infty }^{}0\,$$
(1.12)

Let

$$h(x) := {\rho }_{v} {-\min _{u\in \mathbb{U}}}\;\left [{L}^{u}V (x) + c(x,u)\right ]\,,\quad x \in {\mathbb{R}}^{d}\,.$$

Also, by definition of \(\hat{{v}}_{n}\), for all m ≤ n, we have

$${L}^{\hat{{v}}_{n} }V (x) + {c}_{\hat{{v}}_{n}}(x) \leq {\rho }_{v} - h(x)\,{\mathbb{I}}_{{B}_{m}}(x)\,.$$
(1.13)

By Itô’s formula we obtain from (1.13) that

$$\begin{array}{rcl} & & \frac{1} {t}{\bigl ({\mathbb{E}}_{x}^{\hat{{v}}_{n} }[V ({X}_{t})] - V (x)\bigr )} + \frac{1} {t} {\mathbb{E}}_{x}^{\hat{{v}}_{n} }\left [{\int \nolimits \nolimits }_{0}^{t}{c}_{\hat{{ v}}_{n}}({X}_{s})\,\mathrm{d}s\right ] \\ & & \quad \leq {\rho }_{v} -\frac{1} {t} {\mathbb{E}}_{x}^{\hat{{v}}_{n} }\left [{\int \nolimits \nolimits }_{0}^{t}h({X}_{ s})\,{\mathbb{I}}_{{B}_{m}}({X}_{s})\,\mathrm{d}s\right ], \end{array}$$
(1.14)

for all m ≤ n. Taking limits in (1.14) as t →  and using (1.12), we obtain

$${\rho }_{\hat{{v}}_{n}} \leq {\rho }_{v} -{\int \nolimits \nolimits }_{{\mathbb{R}}^{d}}h(x)\,{\mathbb{I}}_{{B}_{m}}(x)\,{\mu }_{\hat{{v}}_{n}}(\mathrm{d}x)\,.$$
(1.15)

Note that v↦ρ v is lower semicontinuous. Therefore, taking limits in (1.15) as n → , we have

$${\rho }_{\hat{v}} \leq {\rho }_{v} -{ \limsup _{n\rightarrow \infty }}{\int \nolimits \nolimits }_{{\mathbb{R}}^{d}}h(x)\,{\mathbb{I}}_{{B}_{m}}(x)\,{\mu }_{\hat{{v}}_{n}}(\mathrm{d}x)\,.$$
(1.16)

Since c is near-monotone and \({\rho }_{\hat{{v}}_{n}} \leq {\rho }_{v} < {M}^{{_\ast}}\), there exists \(\hat{R} > 0\) and δ > 0, such that \({\mu }_{\hat{{v}}_{n}}({B}_{\hat{R}}) \geq \delta \) for all \(n \in \mathbb{N}\). Then with \({\psi }_{\hat{{v}}_{n}}\) denoting the density of \({\mu }_{\hat{{v}}_{n}}\) Harnack’s inequality [7, Theorem 8.20, p. 199] implies that there exists a constant C H  = C H (R) such that for every \(R >\hat{ R}\), with | B R  | denoting the volume of \({B}_{R} \subset {\mathbb{R}}^{d}\), it holds that

$${ \inf _{{B}_{R}}}\;{\psi }_{\hat{{v}}_{n}} \geq \frac{\delta } {{C}_{H}\vert {B}_{R}\vert }\,,\quad \forall n \in \mathbb{N}.$$

By (1.16) this implies that \({\rho }_{\hat{v}} < {\rho }_{v}\) unless h = 0 a.e.

1.4 Convergence of the PIA

We start with the following lemma.

Lemma 1.4.2.

The sequence {V k } of the PIA has the following properties:

  1. (i)

    For some constant \({C}_{0} = {C}_{0}({\rho }_{{v}_{0}})\) we have \({\inf }_{{\mathbb{R}}^{d}}\;{V }_{k} > {C}_{0}\) for all k ≥ 0.

  2. (ii)

    Each V k attains its minimum on the compact set

    $$\mathcal{K}({\rho }_{{v}_{0}}) :={\bigl \{ x \in {\mathbb{R}}^{d} {:\min _{ u\in \mathbb{U}}}\;c(x,u) \leq {\rho }_{{v}_{0}}\bigr \}}\,.$$
  3. (iii)

    For any p > 1, there exists a constant \(\tilde{{C}}_{0} =\tilde{ {C}}_{0}(R,{\rho }_{{v}_{0}},p)\) such that

    $${\bigl \lVert{V{ }_{k}\bigr \rVert}}_{{\mathcal{W}}^{2,p}({B}_{R})} \leq \tilde{ {C}}_{0}\qquad \forall R > 0\,.$$
  4. (iv)

    There exist positive numbers α k and β k , k ≥ 0, such that α k ↓ 1 and β k ↓ 0 as k →∞ and

    $${\alpha }_{k+1}{V }_{k+1}(x) + {\beta }_{k+1} \leq {\alpha }_{k}{V }_{k} + {\beta }_{k}\quad \forall k \geq 0\,.$$

Proof.

Parts (i) and (ii) follow directly from [3, Lemmas 3.6.1 and 3.6.4].

For part (iii) note first that the near-monotone assumption implies that

$${\mu }_{{v}_{k}}\left (\mathcal{K}{\Bigl (\tfrac{{M}^{{_\ast}}+{\rho }_{{ v}_{k}}} {2} \Bigr )}\right ) \geq \frac{{M}^{{_\ast}}- {\rho }_{{v}_{k}}} {{M}^{{_\ast}} + {\rho }_{{v}_{k}}}\qquad \forall k \geq 0\,.$$

Consequently,

$${\mu }_{{v}_{k}}\left (\mathcal{K}{\Bigl (\tfrac{{M}^{{_\ast}}+{\rho }_{{ v}_{0}}} {2} \Bigr )}\right ) \geq \frac{{M}^{{_\ast}}- {\rho }_{{v}_{0}}} {{M}^{{_\ast}} + {\rho }_{{v}_{0}}} \qquad \forall k \geq 0\,.$$

uniformly on compact subsets of \({\mathbb{R}}^{d}\). Hence, since \({J}_{\alpha }^{{v}_{k}} - {J}_{ \alpha }^{{v}_{k}}(0) \rightarrow {V }_{ k}\) weakly in \({\mathcal{W}}^{2,p}({B}_{R})\) for any R > 0, (iii) follows from [3, Theorem 3.7.4].

Part (iv) follows as in [11, Theorem 4.4]. Footnote 1

As the corollary below shows, the PIA always converges.

Corollary 1.4.1.

There exist a constant \(\hat{\rho }\) and a function \(\hat{V } \in {\mathcal{C}}^{2}({\mathbb{R}}^{d})\) with \(\hat{V }(0) = 0\) , such that, as k →∞, \({\rho }_{{v}_{k}} \downarrow \hat{ \rho }\) and \({V }_{k} \rightarrow \hat{ V }\) weakly in \({\mathcal{W}}^{2,p}({B}_{R})\) , p > 1, for any R > 0. Moreover, \((\hat{V },\hat{\rho })\) satisfies the HJB equation

$${ \min _{u\in \mathbb{U}}}\;{\bigl [{L}^{u}\hat{V }(x) + c(x,u)\bigr ]} =\hat{ \rho }\,,\quad x \in {\mathbb{R}}^{d}\,.$$
(1.17)

Proof.

By Lemma 1.3.1, \({\rho }_{{v}_{k}}\) is decreasing monotonically in k and hence converges to some \(\hat{\rho } \geq {\rho }^{{_\ast}}\). By Lemma 1.4.2 (iii), the sequence V k is weakly compact in \({\mathcal{W}}^{2,p}({B}_{R})\), p > 1, for any R > 0, while by Lemma 1.4.2 (iv), any weakly convergent subsequence has the same limit \(\hat{V }\). Also repeating the argument in the proof of Lemma 1.3.1, with

$${h}_{k}(x) := {\rho }_{{v}_{k-1}} {-\min _{u\in \mathbb{U}}}\;\left [{L}^{u}{V }_{ k-1}(x) + c(x,u)\right ]\,,\quad x \in {\mathbb{R}}^{d},$$

we deduce that for any R > 0 there exists some constant K(R) such that

$${\int \nolimits \nolimits }_{{B}_{R}}{h}_{k}(x)\,\mathrm{d}x \leq K(R){\bigl ({\rho }_{{v}_{k-1}} - {\rho }_{{v}_{k}}\bigr )}\qquad \forall k \in \mathbb{N}\,.$$

Therefore, h k  → 0 weakly in \({\mathcal{L}}^{1}(D)\) as k →  for any bounded domain D. Taking limits in the equation

$${ \min _{u\in \mathbb{U}}}\;\left [{L}^{u}{V }_{ k-1}(x) + c(x,u)\right ] = {\rho }_{{v}_{k-1}} - {h}_{k}(x)$$

and using [3, Lemma 3.5.4] yields (1.17).

It is evident that \(v \in {\mathfrak{U}}_{\mathrm{SM}}\) is an equilibrium of the PIA if it satisfies ρ v  < M  ∗  and

$${ \min _{u\in \mathbb{U}}}\;{\bigl [{L}^{u}{\Psi }^{v}(x) + c(x,u)\bigr ]} = {\rho }_{ v}\,,\quad x \in {\mathbb{R}}^{d}.$$
(1.18)

For one-dimensional diffusions, one can show that (1.18) has a unique solution, and hence, this is the optimal solution with ρ v  = ρ ∗ . For higher dimensions, to the best of our knowledge there is no such result. There is also the possibility that the PIA converges to \(\hat{v} \in {\mathfrak{U}}_{\mathrm{SSM}}\) which is not an equilibrium. This happens if (1.17) satisfies

$${L}^{\hat{v}}\hat{V }(x) + {c}_{\hat{ v}}(x) {=\min _{u\in \mathbb{U}}}\;{\bigl [{L}^{u}\hat{V }(x) + c(x,u)\bigr ]} =\hat{ \rho } > {\rho }_{\hat{ v}},\quad x \in {\mathbb{R}}^{d}.$$
(1.19)

This is in fact the case with the example in [4]. In this example the controlled diffusion takes the form \(\mathrm{d}{X}_{t} = {U}_{t}\,\mathrm{d}t +\mathrm{ d}{W}_{t}\), with \(\mathbb{U} = [-1,1]\) and running cost \(c(x) = 1 -\mathrm{ {e}}^{-\vert x\vert }\). If we define

$${\xi }_{\rho } :=\log \frac{3} {2} +\log (1 - \rho )\,,\quad \rho \in \left [1/3,1\right )$$

and

$${V }_{\rho }(x) := 2{\int \nolimits \nolimits }_{-\infty }^{x}\mathrm{{e}}^{2\vert y-{\xi }_{\rho }\vert }\mathrm{d}y{\int \nolimits \nolimits }_{-\infty }^{y}\mathrm{{e}}^{-2\vert z-{\xi }_{\rho }\vert }{\bigl (\rho - c(z)\bigr )}\,\mathrm{d}z,\quad x \in \mathbb{R},$$

then direct computation shows that

$$\tfrac{1} {2}{V \prime\prime}_{\rho }(x) -\vert {V \prime}_{\rho }(x)\vert + c(x) = \rho \qquad \forall \rho \in \left [1/3,1\right )\,,$$

and so the pair (V ρ, ρ) satisfies the HJB. The stationary Markov control corresponding to this solution of the HJB is \({w}_{\rho }(x) = -sign (x - {\xi }_{\rho })\). The controlled process under w ρ has invariant probability density \({\varphi }_{\rho }(x) =\mathrm{ {e}}^{-2\vert x-{\xi }_{\rho }\vert }\). A simple computation shows that

$${\int \nolimits \nolimits }_{-\infty }^{\infty }c(x){\varphi }_{ \rho }(x)\,\mathrm{d}x = \rho -\tfrac{9} {8}(1 - \rho )(3\rho - 1) < \rho \,,\quad \forall \rho \in \left (1/3,1\right )\,.$$

Thus, if ρ > 1 ∕ 3, then V ρ is not a canonical solution of the Poisson equation corresponding to the stable control w ρ. Therefore, this example satisfies (1.19) and shows that in general we cannot preclude the possibility that the limiting value of the PIA is not an equilibrium of the algorithm.

In [11, Theorem 5.2], a blanket Lyapunov condition is imposed to guarantee convergence of the PIA to an optimal control. Instead, we use Lyapunov analysis to characterize the domain of attraction of the optimal value.

We need the following definition.

Definition 1.4.2.

Let v  ∗  be an optimal control as characterized in Theorem 1.3.1. Let \(\mathfrak{V}\) denote the class of all nonnegative functions \(\mathcal{V}\in {\mathcal{C}}^{2}({\mathbb{R}}^{d})\) satisfying \({L}^{{v}^{{_\ast}} }\mathcal{V}\leq {k}_{0} - h(x)\) for some nonnegative, inf-compact \(h \in \mathcal{C}({\mathbb{R}}^{d})\) and a constant k 0. We denote by \(\mathfrak{o}(\mathfrak{V})\) the class of inf-compact functions g satisfying \(g \in \mathfrak{o}(\mathcal{V})\) for some \(\mathcal{V}\in \mathfrak{V}\).

The theorem below asserts that if the PIA is initialized at a \({v}_{0} \in {\mathfrak{U}}_{\mathrm{SSM}}\) whose associated canonical solution to the Poisson equation lies in \(\mathfrak{o}(\mathfrak{V})\) then it converges to an optimal \({v}^{{_\ast}}\in {\mathfrak{U}}_{\mathrm{SSM}}\).

Theorem 1.4.2.

If \({v}_{0} \in {\mathfrak{U}}_{\mathrm{SSM}}\) satisfies \({\Psi }^{{v}_{0}} \in \mathfrak{o}(\mathfrak{V})\) , then \({\rho }_{{v}_{k}} \rightarrow {\rho }^{{_\ast}}\) as k →∞.

Proof.

The proof is straightforward. By Lemma 1.4.2 (iv), \(\hat{V } \in \mathfrak{o}(\mathfrak{V})\). Also by (1.17), we have

$${L}^{{v}^{{_\ast}} }\hat{V }(x) + {c}_{{v}^{{_\ast}}}(x) \geq \hat{ \rho }\,,\quad x \in {\mathbb{R}}^{d}\,,$$

and applying Dynkin’s formula, we obtain

$$\frac{1} {t}{\bigl ({\mathbb{E}}_{x}^{{v}^{{_\ast}} }{\bigl [\hat{V }({X}_{t})\bigr ]} - V (x)\bigr )} + \frac{1} {t} {\mathbb{E}}_{x}^{{v}^{{_\ast}} }\left [{\int \nolimits \nolimits }_{0}^{t}{c}_{{ v}^{{_\ast}}}({X}_{s})\,\mathrm{d}s\right ] \geq \hat{ \rho }\,,$$
(1.20)

Since \(\hat{V } \in \mathfrak{o}(\mathfrak{V})\), by [1, Lemma 7.1] we have

$$\frac{1} {t} {\mathbb{E}}_{x}^{{v}^{{_\ast}} }{\bigl [\hat{V }({X}_{t})\bigr ]}\mathop{\longrightarrow}\limits_{t \rightarrow \infty }^{}0$$

and thus taking limits as t →  in (1.20), we obtain \({\rho }^{{_\ast}}\geq \hat{ \rho }\). Therefore, we must have \(\hat{\rho } = {\rho }^{{_\ast}}\).

1.5 Concluding Remarks

We have concentrated on the model of controlled diffusions with near-monotone running costs. The case of stable controls with a blanket Lyapunov condition is much simpler. If, for example, we impose the assumption that there exist a constant k 0 > 0 and a pair of nonnegative, inf-compact functions \((\mathcal{V},h) \in {\mathcal{C}}^{2}({\mathbb{R}}^{d}) \times \mathcal{C}({\mathbb{R}}^{d})\) satisfying \(1 + c \in \mathfrak{o}(h)\) and such that

$${L}^{u}\mathcal{V}(x) \leq {k}_{ 0} - h(x,u)\qquad \forall (x,u) \in {\mathbb{R}}^{d} \times \mathbb{U}\,,$$

then the PIA always converges to the optimal solution.