On the Policy Iteration Algorithm for Nondegenerate Controlled Diffusions Under the Ergodic Criterion

Arapostathis, Ari

doi:10.1007/978-0-8176-8337-5_1

Ari Arapostathis³

Part of the book series: Systems & Control: Foundations & Applications ((SCFA))

1403 Accesses
1 Citations

Abstract

The subject of this chapter is the policy iteration algorithm for nondegenerate controlled diffusions. The results parallel the ones in Meyn (IEEE Trans Automat Control 42:1663–1680, 1997) for discrete-time controlled Markov chains. The model in (Meyn, IEEE Trans Automat Control 42:1663–1680, 1997) uses norm-like running costs, while we opt for the milder assumption of near-monotone costs. Also, instead of employing a blanket Lyapunov stability hypothesis, we provide a characterization of the region of attraction of the optimal control.

Access provided by Autonomous University of Puebla. Download chapter PDF

Moment Characteristic Method in the Optimal Control Theory of Diffusion-Type Stochastic Systems

Article 01 September 2019

Zubov’s method for controlled diffusions with state constraints

Article 05 August 2015

Numerical Methods for Continuous-Time Stochastic Control Problems

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Dedicated to Onésimo Hernández-Lerma on the occasion of his 65th birthday.

1.1 Introduction

The policy iteration algorithm (PIA) for controlled Markov chains has been known since the fundamental work of Howard [2]. For controlled Markov chains on Borel state spaces, most studies of the PIA rely on blanket Lyapunov conditions [9]. A study of the PIA that treats the model of near-monotone costs can be found in [11], some ideas of which we follow closely. An analysis of the PIA for piecewise deterministic Markov processes has appeared in [6].

In this chapter we study the PIA for controlled diffusion processes X = { X _t, t ≥ 0}, taking values in the d-dimensional Euclidean space ${\mathbb{R}}^{d}$, and governed by the Itô stochastic differential equation

$$\mathrm{d}{X}_{t} = b({X}_{t},{U}_{t})\,\mathrm{d}t + { \sigma }({X}_{t})\,\mathrm{d}{W}_{t}\,.$$

(1.1)

All random processes in (1.1) live in a complete probability space $(\Omega,\mathfrak{F}, \mathbb{P})$. The process W is a d-dimensional standard Wiener process independent of the initial condition X ₀. The control process U takes values in a compact, metrizable set $\mathbb{U}$, and U _t(ω) is jointly measurable in (t, ω) ∈ [0, ∞) ×Ω. Moreover, it is nonanticipative: for s < t, W _t − W _s is independent of

$${\mathfrak{F}}_{s} := \text{ the completion of }\sigma \{{X}_{0},{U}_{r},{W}_{r},\;r \leq s\}\text{ relative to }(\mathfrak{F}, \mathbb{P})\,.$$

Such a process U is called an admissible control, and we let $\mathfrak{U}$ denote the set of all admissible controls.

We impose the following standard assumptions on the drift b and the diffusion matrix σ to guarantee existence and uniqueness of solutions to (1.1).

(A1)
Local Lipschitz continuity: The functions
$$b ={\bigl [ {b}^{1},\ldots,{b}^{d}\bigr ]}\mathsf{T} : {\mathbb{R}}^{d} \times \mathbb{U}\mapsto {\mathbb{R}}^{d}\quad \text{ and}\quad { \sigma } ={\bigl [{ { \sigma }}^{ij}\bigr ]} : {\mathbb{R}}^{d}\mapsto {\mathbb{R}}^{d\times d}$$
are locally Lipschitz in x with a Lipschitz constant K _R depending on R > 0. In other words, if B _R denotes the open ball of radius R centered at the origin in ${\mathbb{R}}^{d}$, then for all x, y ∈ B _R and $u \in \mathbb{U}$,
$$\vert b(x,u) - b(y,u)\vert +\Vert { \sigma }(x) -{ \sigma }(y)\Vert \leq {K}_{R}\vert x - y\vert \,,$$
where ∥ σ ∥ ² : = traceσσT.
(A2)
Affine growth condition:b and σ satisfy a global growth condition of the form
$$\vert b{(x,u)\vert }^{2} +\Vert { \sigma }{(x)\Vert }^{2} \leq {K}_{ 1}{\bigl (1 +\vert {x\vert }^{2}\bigr )}\,,\quad \forall (x,u) \in {\mathbb{R}}^{d} \times \mathbb{U}\,.$$
(A3)
Local nondegeneracy: For each R > 0, there exists a positive constant κ_R such that
$${\sum \limits _{i,j=1}^{d}}{a}^{ij}(x){\xi }_{ i}{\xi }_{j} \geq {\kappa }_{R}\vert {\xi \vert }^{2}\,,\quad \forall x \in {B}_{ R}\,,$$
for all $\xi = ({\xi }_{1},\ldots,{\xi }_{d}) \in {\mathbb{R}}^{d}$, where $a := \frac{1} {2}{ \sigma }\,{ \sigma }\mathsf{T}$.

We also assume that b is continuous in (x, u).

In integral form, (1.1) is written as

$${X}_{t} = {X}_{0} +{ \int \nolimits \nolimits }_{0}^{t}b({X}_{ s},{U}_{s})\,\mathrm{d}s +{ \int \nolimits \nolimits }_{0}^{t}{ \sigma }({X}_{ s})\,\mathrm{d}{W}_{s}\,.$$

(1.2)

The second term on the right-hand side of (1.2) is an Itô stochastic integral. We say that a process X = { X _t(ω)} is a solution of (1.1) if it is ${\mathfrak{F}}_{t}$-adapted, continuous in t, defined for all ω ∈ Ω and t ∈ [0, ∞), and satisfies (1.2) for all t ∈ [0, ∞) at once a.s.

With $u \in \mathbb{U}$ treated as a parameter, we define the family of operators ${L}^{u} : {\mathcal{C}}^{2}({\mathbb{R}}^{d})\mapsto \mathcal{C}({\mathbb{R}}^{d})$ by

$${L}^{u}f(x) ={ \sum \limits _{i,j}}{a}^{ij}(x) \frac{{\partial }^{2}f} {\partial {x}_{i}\partial {x}_{j}}(x) +{ \sum \limits _{i}}{b}^{i}(x,u) \frac{\partial f} {\partial {x}_{i}}(x)\,,\quad u \in \mathbb{U}\,.$$

(1.3)

We refer to L ^u as the controlled extended generator of the diffusion.

Of fundamental importance in the study of functionals of X is Itô’s formula. For $f \in {\mathcal{C}}^{2}({\mathbb{R}}^{d})$ and with L ^u as defined in (1.3),

$$f({X}_{t}) = f({X}_{0}) +{ \int \nolimits \nolimits }_{0}^{t}{L}^{{U}_{s} }f({X}_{s})\,\mathrm{d}s + {M}_{t}\,,\quad \mathrm{ a.s.},$$

(1.4)

where

$${M}_{t} :={ \int \nolimits \nolimits }_{0}^{t}{\bigl \langle\nabla f({X}_{ s}),{ \sigma }({X}_{s})\,\mathrm{d}{W}_{s}\bigr \rangle}$$

is a local martingale. Krylov’s extension of the Itô formula [10, p. 122] extends (1.4) to functions f in the Sobolev space ${{\mathcal{W}}_{\mathrm{loc}}}^{2,p}({\mathbb{R}}^{d})$.

Recall that a control is called stationary Markov if U _t = v(X _t) for a measurable map $v : {\mathbb{R}}^{d}\mapsto \mathbb{U}$. Correspondingly, the equation

$${X}_{t} = {x}_{0} +{ \int \nolimits \nolimits }_{0}^{t}b{\bigl ({X}_{ s},v({X}_{s})\bigr )}\,\mathrm{d}s +{ \int \nolimits \nolimits }_{0}^{t}{ \sigma }({X}_{ s})\,\mathrm{d}{W}_{s}$$

(1.5)

is said to have a strong solution if given a Wiener process $({W}_{t},{\mathfrak{F}}_{t})$ on a complete probability space $(\Omega,\mathfrak{F}, \mathbb{P})$, there exists a process X on $(\Omega,\mathfrak{F}, \mathbb{P})$, with ${X}_{0} = {x}_{0} \in {\mathbb{R}}^{d}$, which is continuous, ${\mathfrak{F}}_{t}$-adapted, and satisfies (1.5) for all t at once, a.s. A strong solution is called unique, if any two such solutions X and X′ agree $\mathbb{P}$-a.s. when viewed as elements of $\mathcal{C}{\bigl ([0,\infty ), {\mathbb{R}}^{d}\bigr )}$. It is well known that under Assumptions (A1)–(A3), for any stationary Markov control v, (1.5) has a unique strong solution [8].

Let ${\mathfrak{U}}_{\mathrm{SM}}$ denote the set of stationary Markov controls. Under $v \in {\mathfrak{U}}_{\mathrm{SM}}$, the process X is strong Markov, and we denote its transition function by P ^v(t, x, ⋅). It also follows from the work of [5, 12] that under $v \in {\mathfrak{U}}_{\mathrm{SM}}$, the transition probabilities of X have densities which are locally Hölder continuous. Thus L ^v defined by

$${L}^{v}f(x) ={ \sum \nolimits }_{i,j}{a}^{ij}(x) \frac{{\partial }^{2}f} {\partial {x}_{i}\partial {x}_{j}}(x) +{ \sum \nolimits }_{i}{b}^{i}(x,v(x)) \frac{\partial f} {\partial {x}_{i}}(x)\,,\quad v \in {\mathfrak{U}}_{\mathrm{SM}}\,,$$

for $f \in {\mathcal{C}}^{2}({\mathbb{R}}^{d})$ is the generator of a strongly continuous semigroup on ${\mathcal{C}}_{b}({\mathbb{R}}^{d})$, which is strong Feller. We let ${\mathbb{P}}_{x}^{v}$ denote the probability measure and ${\mathbb{E}}_{x}^{v}$ the expectation operator on the canonical space of the process under the control $v \in {\mathfrak{U}}_{\mathrm{SM}}$, conditioned on the process X starting from $x \in {\mathbb{R}}^{d}$ at t = 0.

In Sect. 1.2 we define our notation. Sect. 1.3 reviews the ergodic control problem for near-monotone costs and the basic properties of the PIA. Sect. 1.4 is dedicated to the convergence of the algorithm.

1.2 Notation

The standard Euclidean norm in ${\mathbb{R}}^{d}$ is denoted by | ⋅ | , and ⟨ ⋅, ⋅⟩ stands for the inner product. The set of nonnegative real numbers is denoted by ${\mathbb{R}}_{+}$, $\mathbb{N}$ stands for the set of natural numbers, and $\mathbb{I}$ denotes the indicator function. We denote by τ(A) the first exit time of the process {X _t} from the set $A \subset {\mathbb{R}}^{d}$, defined by

$$\tau (A) :=\inf \;\{ t > 0 : {X}_{t}\not\in A\}\,.$$

The open ball of radius R in ${\mathbb{R}}^{d}$, centered at the origin, is denoted by B _R, and we let τ_R : = τ(B _R), and $\Breve{\tau}_{R}{:=} \tau(B^{c}_{R})$.

The term domain in ${\mathbb{R}}^{d}$ refers to a nonempty, connected open subset of the Euclidean space ${\mathbb{R}}^{d}$. We introduce the following notation for spaces of real-valued functions on a domain $D \subset {\mathbb{R}}^{d}$. The space ${\mathcal{L}}^{p}(D)$, p ∈ [1, ∞), stands for the Banach space of (equivalence classes) of measurable functions f satisfying ∫_D | f(x) | ^p dx < ∞, and ${\mathcal{L}}^{\infty }(D)$ is the Banach space of functions that are essentially bounded in D. The space ${\mathcal{C}}^{k}(D)$ (${\mathcal{C}}^{\infty }(D)$) refers to the class of all functions whose partial derivatives up to order k (of any order) exist and are continuous, and ${\mathcal{C}}_{c}^{k}(D)$ is the space of functions in ${\mathcal{C}}^{k}(D)$ with compact support. The standard Sobolev space of functions on D whose generalized derivatives up to order k are in ${\mathcal{L}}^{p}(D)$, equipped with its natural norm, is denoted by ${\mathcal{W}}^{k,p}(D)$, k ≥ 0, p ≥ 1.

In general if $\mathcal{X}$ is a space of real-valued functions on D, ${\mathcal{X}}_{\mathrm{loc}}$ consists of all functions f such that $f\varphi \in \mathcal{X}$ for every $\varphi \in {\mathcal{C}}_{c}^{\infty }(D)$. In this manner we obtain the spaces $\mathcal{L}_{loc}^{p}(D)$ and ${{\mathcal{W}}_{\mathrm{loc}}}^{2,p}(D)$.

Let $h \in \mathcal{C}({\mathbb{R}}^{d})$ be a positive function. We denote by $\mathcal{O}(h)$ the set of functions $f \in \mathcal{C}({\mathbb{R}}^{d})$ having the property

$${\limsup}_{\vert x\vert \rightarrow \infty }\;\frac{\vert f(x)\vert } {h(x)} < \infty \,,$$

(1.6)

and by $\mathfrak{o}(h)$ the subset of $\mathcal{O}(h)$ over which the limit in (1.6) is zero.

We adopt the notation ${\partial }_{i} := \tfrac{\partial \ } {\partial {x}_{i}}$ and ${\partial }_{ij} := \tfrac{{\partial }^{2}\ } {\partial {x}_{i}\partial {x}_{j}}$. We often use the standard summation rule that repeated subscripts and superscripts are summed from 1 through d. For example,

$${a}^{ij}{\partial }_{ ij}\varphi + {b}^{i}{\partial }_{ i}\varphi :={ \sum \limits _{i,j=1}^{d}}{a}^{ij} \frac{{\partial }^{2}\varphi } {\partial {x}_{i}\partial {x}_{j}} +{ \sum \limits _{i=1}^{d}}{b}^{i} \frac{\partial \varphi } {\partial {x}_{i}}\,.$$

1.3 Ergodic Control and the PIA

Let $c: {\mathbb{R}}^{d} \times \mathbb{U} \rightarrow \mathbb{R}$ be a continuous function bounded from below. As well known, the ergodic control problem, in its almost sure (or pathwise) formulation, seeks to a.s. minimize over all admissible $U \in \mathfrak{U}$

$${\limsup \limits_{t\rightarrow \infty }}\;\frac{1} {t}{\int \nolimits \nolimits }_{0}^{t}c({X}_{ s},{U}_{s})\,\mathrm{d}s\,.$$

(1.7)

A weaker, average formulation seeks to minimize

$${\limsup \limits_{t\rightarrow \infty }}\;\frac{1} {t}{\int \nolimits \nolimits }_{0}^{t}{\mathbb{E}}^{U}{\bigl [c({X}_{ s},{U}_{s})\bigr ]}\,\mathrm{d}s\,.$$

(1.8)

We let ρ^∗ denote the infimum of (1.8) over all admissible controls. We assume that ρ^∗ < ∞.

We assume that the cost function $c: {\mathbb{R}}^{d} \times \mathbb{U} \rightarrow {\mathbb{R}}_{+}$ is continuous and locally Lipschitz in its first argument uniformly in $u \in \mathbb{U}$. More specifically, for some function ${K}_{c}: {\mathbb{R}}_{+} \rightarrow {\mathbb{R}}_{+}$,

$${\bigl \lvert c(x,u) - c(y,u)\bigr \rvert} \leq {K}_{c}(R)\vert x - y\vert \qquad \forall x,y \in {B}_{R}\,,\ \forall u \in \mathbb{U}\,,$$

and all R > 0.

An important class of running cost functions arising in practice for which the ergodic control problem is well behaved is the near-monotone cost functions. Let ${M}^{{_\ast}}\in {\mathbb{R}}_{+} \cup \{\infty \}$ be defined by

$${M}^{{_\ast}} :={ \liminf \limits_{\vert x\vert \rightarrow \infty }}\;\min _{u\in \mathbb{U}}\;c(x,u)\,.$$

The running cost function c is called near-monotone if ρ^∗ < M ^∗. Note that inf-compact functions c are always near-monotone.

We adopt the following abbreviated notation. For a function $g : {\mathbb{R}}^{d} \times \mathbb{U} \rightarrow \mathbb{R}$ and $v \in {\mathfrak{U}}_{\mathrm{SSM}}$ we let

$${g}_{v}(x) := g{\bigl (x,v(x)\bigr )}\,,\quad x \in {\mathbb{R}}^{d}\,.$$

The ergodic control problem for near-monotone cost functions is characterized as follows

Theorem 1.3.1.

There exists a unique function $V \in {\mathcal{C}}^{2}({\mathbb{R}}^{d})$ which is bounded below in ${\mathbb{R}}^{d}$ and satisfies V (0) = 0 and the Hamilton–Jacobi–Bellman (HJB) equation

$${ \min _{u\in \mathbb{U}}}\;{\bigl [{L}^{u}V (x) + c(x,u)\bigr ]} = {\rho }^{{_\ast}}\,,\quad x \in {\mathbb{R}}^{d}\,.$$

The control ${v}^{{_\ast}}\in {\mathfrak{U}}_{\mathrm{SM}}$ is optimal with respect to the criteria (1.7) and (1.8) if and only if it satisfies

$${ \min _{u\in \mathbb{U}}}\;\left [{\sum \limits _{i=1}^{d}}{b}^{i}(x,u) \frac{\partial V } {\partial {x}_{i}}(x) + c(x,u)\right ] ={ \sum \limits _{i=1}^{d}}{b}_{{ v}^{{_\ast}}}^{i}(x) \frac{\partial V } {\partial {x}_{i}}(x) + {c}_{{v}^{{_\ast}}}(x)$$

a.e. in ${\mathbb{R}}^{d}$. Moreover, with $\Breve{\tau}_{r}=\tau(B_{r}^{c})$, r > 0, we have

$$V (x) ={ \limsup \limits_{r\downarrow 0}}\;\inf _{v\in {\mathfrak{U}}_{\mathrm{SSM}}}\;{\mathbb{E}}_{x}^{v}\left [{\int \nolimits \nolimits }_{0}^{{\breve{\tau }}_{r} }{\bigl ({c}_{v}({X}_{t}) - {\rho }^{{_\ast}}\bigr )}\,\mathrm{d}t\right ]\,,\quad x \in {\mathbb{R}}^{d}\,.$$

A control $v \in {\mathfrak{U}}_{\mathrm{SM}}$ is called stable if the associated diffusion is positive recurrent. We denote the set of such controls by ${\mathfrak{U}}_{\mathrm{SSM}}$. Also we let μ_v denote the unique invariant probability measure on ${\mathbb{R}}^{d}$ for the diffusion under the control $v \in {\mathfrak{U}}_{\mathrm{SSM}}$. Recall that $v \in {\mathfrak{U}}_{\mathrm{SSM}}$ if and only if there exists an inf-compact function $\mathcal{V}\in {\mathcal{C}}^{2}({\mathbb{R}}^{d})$, a bounded domain $D \subset {\mathbb{R}}^{d}$, and a constant ε > 0 satisfying

$${L}^{v}\mathcal{V}(x) \leq -\epsilon \qquad \forall x \in {D}^{c}\,.$$

(1.9)

It follows that the optimal control v in Theorem 1.3.1 is stable. For $v \in {\mathfrak{U}}_{\mathrm{SSM}}$ we define

$${\rho }_{v} :={ \limsup _{t\rightarrow \infty }}\;\frac{1} {t}{\int \nolimits \nolimits }_{0}^{t}{\mathbb{E}}^{v}{\bigl [{c}_{ v}({X}_{s})\bigr ]}\,\mathrm{d}s\,.$$

A difficulty in synthesizing an optimal control $v \in {\mathfrak{U}}_{\mathrm{SM}}$ via the HJB equation lies in the fact that the optimal cost ρ^∗ is not known. The PIA provides an iterative procedure for obtaining the HJB equation at the limit. In order to describe the algorithm we, first need to review some properties of the Poisson equation

$${L}^{v}V (x) + {c}_{ v}(x) = \rho \,,\quad x \in {\mathbb{R}}^{d}\,.$$

(1.10)

We need the following definition.

Definition 1.3.1.

For $v \in {\mathfrak{U}}_{\mathrm{SSM}}$, and provided ρ_v < ∞, define

$${\Psi }^{v}(x) :{=\lim _{ r\downarrow 0}}\;{\mathbb{E}}_{x}^{v}\left [{\int \nolimits \nolimits }_{0}^{{\breve{\tau }}_{r} }{\bigl ({c}_{v}({X}_{t}) - {\rho }_{v}\bigr )}\,\mathrm{d}t\right ]\,,\quad x\neq 0\,.$$

For $v \in {\mathfrak{U}}_{\mathrm{SM}}$ and α > 0, let J _α ^v denote the α-discounted cost

$${J}_{\alpha }^{v}(x) := {\mathbb{E}}_{ x}^{v}\left [{\int \nolimits \nolimits }_{0}^{\infty }\mathrm{{e}}^{-\alpha t}{c}_{ v}({X}_{t})\,\mathrm{d}t\right ]\,,\quad x \in {\mathbb{R}}^{d}.$$

We borrow the following result from [1, Lemma 7.4]. If $v \in {\mathfrak{U}}_{\mathrm{SSM}}$ and ρ_v < ∞, then there exists, a function $V \in {{\mathcal{W}}_{\mathrm{loc}}}^{2,p}({\mathbb{R}}^{d})$, for any p > 1, and a constant $\rho \in \mathbb{R}$ which satisfies (1.10) a.e. in ${\mathbb{R}}^{d}$ and such that, as α ↓ 0, αJ _α ^v(0) → ρ and J _α ^v − J _α ^v(0) → V uniformly on compact subsets of ${\mathbb{R}}^{d}$. Moreover,

$$\rho = {\rho }_{v}\quad \text{ and}\quad V (x) = {\Psi }^{v}(x).$$

We refer to the function $V (x) = {\Psi }^{v}(x) \in {{\mathcal{W}}_{\mathrm{loc}}}^{2,p}({\mathbb{R}}^{d})$ as the canonical solution of the Poisson equation ${L}^{v}V + {c}_{v} = {\rho }_{v}$ in ${\mathbb{R}}^{d}$.

It can be shown that the canonical solution V to the Poisson equation is the unique solution in ${{\mathcal{W}}_{\mathrm{loc}}}^{2,p}({\mathbb{R}}^{d})$ which is bounded below and satisfies V (0) = 0. Note also that (1.9) implies that any control v satisfying ρ_v < M ^∗ is stable.

The PIA takes the following familiar form:

Algorithm (PIA).

1.
Initialization. Set k = 0 and select any ${v}_{0} \in {\mathfrak{U}}_{\mathrm{SM}}$ such that ${\rho }_{{v}_{0}} < {M}^{{_\ast}}$.
2.
Value determination. Obtain the canonical solution ${V }_{k} = {\Psi }^{{v}_{k}} \in {{\mathcal{W}}_{\mathrm{ loc}}}^{2,p}({\mathbb{R}}^{d})$ , p > 1, to the Poisson equation
$${L}^{{v}_{k} }{V }_{k} + {c}_{{v}_{k}} = {\rho }_{{v}_{k}}$$
in ${\mathbb{R}}^{d}$.
3.
If ${v}_{k}(x) \in { Arg\,min}_{u\in \mathbb{U}}\left [{b}^{i}(x,u){\partial }_{i}{V }_{k}(x) + c(x,u)\right ]$ x-a.e., return v _k.
4.
Policy improvement. Select an arbitrary ${v}_{k+1} \in {\mathfrak{U}}_{\mathrm{SM}}$ which satisfies
$${v}_{k+1}(x) \in { Arg\,min _{u\in \mathbb{U}}}\;\left [{\sum \limits _{i=1}^{d}}{b}^{i}(x,u)\frac{\partial {V }_{k}} {\partial {x}_{i}} (x) + c(x,u)\right ]\,,\qquad x \in {\mathbb{R}}^{d}\,.$$

Since ${\rho }_{{v}_{0}} < {M}^{{_\ast}}$ it follows that ${v}_{0} \in {\mathfrak{U}}_{\mathrm{SSM}}$. The algorithm is well defined, provided ${v}_{k} \in {\mathfrak{U}}_{\mathrm{SSM}}$ for all $k \in \mathbb{N}$. This follows from the next lemma which shows that ${\rho }_{{v}_{k+1}} \leq {\rho }_{{v}_{k}}$, and in particular that ${\rho }_{{v}_{k}} < {M}^{{_\ast}}$, for all $k \in \mathbb{N}$.

Lemma 1.3.1.

Suppose $v \in {\mathfrak{U}}_{\mathrm{SSM}}$ satisfies ρ _v < M ^∗ . Let $V \in {{\mathcal{W}}_{\mathrm{loc}}}^{2,p}({\mathbb{R}}^{d})$ , p > 1, be the canonical solution to the Poisson equation

$${L}^{v}V + {c}_{ v} = {\rho }_{v}\,,\quad \text{ in }{\mathbb{R}}^{d}\,.$$

Then any measurable selector $\hat{v}$ from the minimizer

$${Arg\,min _{u\in \mathbb{U}}}\left [{b}^{i}(x,u){\partial }_{ i}V (x) + c(x,u)\right ]$$

satisfies ${\rho }_{\hat{v}} \leq {\rho }_{v}$ . Moreover, the inequality is strict unless v satisfies

$${L}^{v}V (x) + {c}_{ v}(x) {=\min _{u\in \mathbb{U}}}\;\left [{L}^{u}V (x) + c(x,u)\right ] = {\rho }_{ v},\quad \text{ for almost all }x.$$

(1.11)

Proof.

Let $\mathcal{V}$ be a Lyapunov function satisfying ${L}^{v}\mathcal{V}(x) \leq {k}_{0} - g(x)$, for some inf-compact g such that ${c}_{v} \in \mathfrak{o}(g)$ (see [1, Lemma 7.1]). For $n \in \mathbb{N}$, define

$$\hat{{v}}_{n}(x) = \left \{\begin{array}{@{}l@{\quad }l@{}} \hat{v}(x)\quad &\text{ if }x \in {B}_{n} \\ v(x)\quad &\text{ if }x \in {B}_{n}^{c}\,. \end{array} \right.$$

Clearly, $\hat{{v}}_{n} \rightarrow \hat{ v}$ as n → ∞ in the topology of Markov controls (see [1, Section 3.3]). It is evident that $\mathcal{V}$ is a stochastic Lyapunov function relative to $\hat{{v}}_{n}$, i.e., there exist constants k _n such that ${L}^{\hat{{v}}_{n}}\mathcal{V}(x) \leq {k}_{n} - g(x)$, for all $n \in \mathbb{N}$. Since $V \in \mathfrak{o}(\mathcal{V})$, it follows that (see [1, Lemma 7.1])

$$\frac{1} {t} {\mathbb{E}}_{x}^{\hat{{v}}_{n} }[V ({X}_{t})]\mathop{\longrightarrow}\limits_{t \rightarrow \infty }^{}0\,$$

(1.12)

Let

$$h(x) := {\rho }_{v} {-\min _{u\in \mathbb{U}}}\;\left [{L}^{u}V (x) + c(x,u)\right ]\,,\quad x \in {\mathbb{R}}^{d}\,.$$

Also, by definition of $\hat{{v}}_{n}$, for all m ≤ n, we have

$${L}^{\hat{{v}}_{n} }V (x) + {c}_{\hat{{v}}_{n}}(x) \leq {\rho }_{v} - h(x)\,{\mathbb{I}}_{{B}_{m}}(x)\,.$$

(1.13)

By Itô’s formula we obtain from (1.13) that

$$\begin{array}{rcl} & & \frac{1} {t}{\bigl ({\mathbb{E}}_{x}^{\hat{{v}}_{n} }[V ({X}_{t})] - V (x)\bigr )} + \frac{1} {t} {\mathbb{E}}_{x}^{\hat{{v}}_{n} }\left [{\int \nolimits \nolimits }_{0}^{t}{c}_{\hat{{ v}}_{n}}({X}_{s})\,\mathrm{d}s\right ] \\ & & \quad \leq {\rho }_{v} -\frac{1} {t} {\mathbb{E}}_{x}^{\hat{{v}}_{n} }\left [{\int \nolimits \nolimits }_{0}^{t}h({X}_{ s})\,{\mathbb{I}}_{{B}_{m}}({X}_{s})\,\mathrm{d}s\right ], \end{array}$$

(1.14)

for all m ≤ n. Taking limits in (1.14) as t → ∞ and using (1.12), we obtain

$${\rho }_{\hat{{v}}_{n}} \leq {\rho }_{v} -{\int \nolimits \nolimits }_{{\mathbb{R}}^{d}}h(x)\,{\mathbb{I}}_{{B}_{m}}(x)\,{\mu }_{\hat{{v}}_{n}}(\mathrm{d}x)\,.$$

(1.15)

Note that v↦ρ_v is lower semicontinuous. Therefore, taking limits in (1.15) as n → ∞, we have

$${\rho }_{\hat{v}} \leq {\rho }_{v} -{ \limsup _{n\rightarrow \infty }}{\int \nolimits \nolimits }_{{\mathbb{R}}^{d}}h(x)\,{\mathbb{I}}_{{B}_{m}}(x)\,{\mu }_{\hat{{v}}_{n}}(\mathrm{d}x)\,.$$

(1.16)

Since c is near-monotone and ${\rho }_{\hat{{v}}_{n}} \leq {\rho }_{v} < {M}^{{_\ast}}$, there exists $\hat{R} > 0$ and δ > 0, such that ${\mu }_{\hat{{v}}_{n}}({B}_{\hat{R}}) \geq \delta $ for all $n \in \mathbb{N}$. Then with ${\psi }_{\hat{{v}}_{n}}$ denoting the density of ${\mu }_{\hat{{v}}_{n}}$ Harnack’s inequality [7, Theorem 8.20, p. 199] implies that there exists a constant C _H = C _H(R) such that for every $R >\hat{ R}$, with | B _R | denoting the volume of ${B}_{R} \subset {\mathbb{R}}^{d}$, it holds that

$${ \inf _{{B}_{R}}}\;{\psi }_{\hat{{v}}_{n}} \geq \frac{\delta } {{C}_{H}\vert {B}_{R}\vert }\,,\quad \forall n \in \mathbb{N}.$$

By (1.16) this implies that ${\rho }_{\hat{v}} < {\rho }_{v}$ unless h = 0 a.e.

1.4 Convergence of the PIA

We start with the following lemma.

Lemma 1.4.2.

The sequence {V _k } of the PIA has the following properties:

(i)
For some constant ${C}_{0} = {C}_{0}({\rho }_{{v}_{0}})$ we have ${\inf }_{{\mathbb{R}}^{d}}\;{V }_{k} > {C}_{0}$ for all k ≥ 0.
(ii)
Each V _k attains its minimum on the compact set
$$\mathcal{K}({\rho }_{{v}_{0}}) :={\bigl \{ x \in {\mathbb{R}}^{d} {:\min _{ u\in \mathbb{U}}}\;c(x,u) \leq {\rho }_{{v}_{0}}\bigr \}}\,.$$
(iii)
For any p > 1, there exists a constant $\tilde{{C}}_{0} =\tilde{ {C}}_{0}(R,{\rho }_{{v}_{0}},p)$ such that
$${\bigl \lVert{V{ }_{k}\bigr \rVert}}_{{\mathcal{W}}^{2,p}({B}_{R})} \leq \tilde{ {C}}_{0}\qquad \forall R > 0\,.$$
(iv)
There exist positive numbers α _k and β _k , k ≥ 0, such that α _k ↓ 1 and β _k ↓ 0 as k →∞ and
$${\alpha }_{k+1}{V }_{k+1}(x) + {\beta }_{k+1} \leq {\alpha }_{k}{V }_{k} + {\beta }_{k}\quad \forall k \geq 0\,.$$

Proof.

Parts (i) and (ii) follow directly from [3, Lemmas 3.6.1 and 3.6.4].

For part (iii) note first that the near-monotone assumption implies that

$${\mu }_{{v}_{k}}\left (\mathcal{K}{\Bigl (\tfrac{{M}^{{_\ast}}+{\rho }_{{ v}_{k}}} {2} \Bigr )}\right ) \geq \frac{{M}^{{_\ast}}- {\rho }_{{v}_{k}}} {{M}^{{_\ast}} + {\rho }_{{v}_{k}}}\qquad \forall k \geq 0\,.$$

Consequently,

$${\mu }_{{v}_{k}}\left (\mathcal{K}{\Bigl (\tfrac{{M}^{{_\ast}}+{\rho }_{{ v}_{0}}} {2} \Bigr )}\right ) \geq \frac{{M}^{{_\ast}}- {\rho }_{{v}_{0}}} {{M}^{{_\ast}} + {\rho }_{{v}_{0}}} \qquad \forall k \geq 0\,.$$

uniformly on compact subsets of ${\mathbb{R}}^{d}$. Hence, since ${J}_{\alpha }^{{v}_{k}} - {J}_{ \alpha }^{{v}_{k}}(0) \rightarrow {V }_{ k}$ weakly in ${\mathcal{W}}^{2,p}({B}_{R})$ for any R > 0, (iii) follows from [3, Theorem 3.7.4].

Part (iv) follows as in [11, Theorem 4.4]. ^{Footnote 1}

As the corollary below shows, the PIA always converges.

Corollary 1.4.1.

There exist a constant $\hat{\rho }$ and a function $\hat{V } \in {\mathcal{C}}^{2}({\mathbb{R}}^{d})$ with $\hat{V }(0) = 0$ , such that, as k →∞, ${\rho }_{{v}_{k}} \downarrow \hat{ \rho }$ and ${V }_{k} \rightarrow \hat{ V }$ weakly in ${\mathcal{W}}^{2,p}({B}_{R})$ , p > 1, for any R > 0. Moreover, $(\hat{V },\hat{\rho })$ satisfies the HJB equation

$${ \min _{u\in \mathbb{U}}}\;{\bigl [{L}^{u}\hat{V }(x) + c(x,u)\bigr ]} =\hat{ \rho }\,,\quad x \in {\mathbb{R}}^{d}\,.$$

(1.17)

Proof.

By Lemma 1.3.1, ${\rho }_{{v}_{k}}$ is decreasing monotonically in k and hence converges to some $\hat{\rho } \geq {\rho }^{{_\ast}}$. By Lemma 1.4.2 (iii), the sequence V _k is weakly compact in ${\mathcal{W}}^{2,p}({B}_{R})$, p > 1, for any R > 0, while by Lemma 1.4.2 (iv), any weakly convergent subsequence has the same limit $\hat{V }$. Also repeating the argument in the proof of Lemma 1.3.1, with

$${h}_{k}(x) := {\rho }_{{v}_{k-1}} {-\min _{u\in \mathbb{U}}}\;\left [{L}^{u}{V }_{ k-1}(x) + c(x,u)\right ]\,,\quad x \in {\mathbb{R}}^{d},$$

we deduce that for any R > 0 there exists some constant K(R) such that

$${\int \nolimits \nolimits }_{{B}_{R}}{h}_{k}(x)\,\mathrm{d}x \leq K(R){\bigl ({\rho }_{{v}_{k-1}} - {\rho }_{{v}_{k}}\bigr )}\qquad \forall k \in \mathbb{N}\,.$$

Therefore, h _k → 0 weakly in ${\mathcal{L}}^{1}(D)$ as k → ∞ for any bounded domain D. Taking limits in the equation

$${ \min _{u\in \mathbb{U}}}\;\left [{L}^{u}{V }_{ k-1}(x) + c(x,u)\right ] = {\rho }_{{v}_{k-1}} - {h}_{k}(x)$$

and using [3, Lemma 3.5.4] yields (1.17).

It is evident that $v \in {\mathfrak{U}}_{\mathrm{SM}}$ is an equilibrium of the PIA if it satisfies ρ_v < M ^∗ and

$${ \min _{u\in \mathbb{U}}}\;{\bigl [{L}^{u}{\Psi }^{v}(x) + c(x,u)\bigr ]} = {\rho }_{ v}\,,\quad x \in {\mathbb{R}}^{d}.$$

(1.18)

For one-dimensional diffusions, one can show that (1.18) has a unique solution, and hence, this is the optimal solution with ρ_v = ρ^∗. For higher dimensions, to the best of our knowledge there is no such result. There is also the possibility that the PIA converges to $\hat{v} \in {\mathfrak{U}}_{\mathrm{SSM}}$ which is not an equilibrium. This happens if (1.17) satisfies

$${L}^{\hat{v}}\hat{V }(x) + {c}_{\hat{ v}}(x) {=\min _{u\in \mathbb{U}}}\;{\bigl [{L}^{u}\hat{V }(x) + c(x,u)\bigr ]} =\hat{ \rho } > {\rho }_{\hat{ v}},\quad x \in {\mathbb{R}}^{d}.$$

(1.19)

This is in fact the case with the example in [4]. In this example the controlled diffusion takes the form $\mathrm{d}{X}_{t} = {U}_{t}\,\mathrm{d}t +\mathrm{ d}{W}_{t}$, with $\mathbb{U} = [-1,1]$ and running cost $c(x) = 1 -\mathrm{ {e}}^{-\vert x\vert }$. If we define

$${\xi }_{\rho } :=\log \frac{3} {2} +\log (1 - \rho )\,,\quad \rho \in \left [1/3,1\right )$$

and

$${V }_{\rho }(x) := 2{\int \nolimits \nolimits }_{-\infty }^{x}\mathrm{{e}}^{2\vert y-{\xi }_{\rho }\vert }\mathrm{d}y{\int \nolimits \nolimits }_{-\infty }^{y}\mathrm{{e}}^{-2\vert z-{\xi }_{\rho }\vert }{\bigl (\rho - c(z)\bigr )}\,\mathrm{d}z,\quad x \in \mathbb{R},$$

then direct computation shows that

$$\tfrac{1} {2}{V \prime\prime}_{\rho }(x) -\vert {V \prime}_{\rho }(x)\vert + c(x) = \rho \qquad \forall \rho \in \left [1/3,1\right )\,,$$

and so the pair (V _ρ, ρ) satisfies the HJB. The stationary Markov control corresponding to this solution of the HJB is ${w}_{\rho }(x) = -sign (x - {\xi }_{\rho })$. The controlled process under w _ρ has invariant probability density ${\varphi }_{\rho }(x) =\mathrm{ {e}}^{-2\vert x-{\xi }_{\rho }\vert }$. A simple computation shows that

$${\int \nolimits \nolimits }_{-\infty }^{\infty }c(x){\varphi }_{ \rho }(x)\,\mathrm{d}x = \rho -\tfrac{9} {8}(1 - \rho )(3\rho - 1) < \rho \,,\quad \forall \rho \in \left (1/3,1\right )\,.$$

Thus, if ρ > 1 ∕ 3, then V _ρ is not a canonical solution of the Poisson equation corresponding to the stable control w _ρ. Therefore, this example satisfies (1.19) and shows that in general we cannot preclude the possibility that the limiting value of the PIA is not an equilibrium of the algorithm.

In [11, Theorem 5.2], a blanket Lyapunov condition is imposed to guarantee convergence of the PIA to an optimal control. Instead, we use Lyapunov analysis to characterize the domain of attraction of the optimal value.

We need the following definition.

Definition 1.4.2.

Let v ^∗ be an optimal control as characterized in Theorem 1.3.1. Let $\mathfrak{V}$ denote the class of all nonnegative functions $\mathcal{V}\in {\mathcal{C}}^{2}({\mathbb{R}}^{d})$ satisfying ${L}^{{v}^{{_\ast}} }\mathcal{V}\leq {k}_{0} - h(x)$ for some nonnegative, inf-compact $h \in \mathcal{C}({\mathbb{R}}^{d})$ and a constant k ₀. We denote by $\mathfrak{o}(\mathfrak{V})$ the class of inf-compact functions g satisfying $g \in \mathfrak{o}(\mathcal{V})$ for some $\mathcal{V}\in \mathfrak{V}$.

The theorem below asserts that if the PIA is initialized at a ${v}_{0} \in {\mathfrak{U}}_{\mathrm{SSM}}$ whose associated canonical solution to the Poisson equation lies in $\mathfrak{o}(\mathfrak{V})$ then it converges to an optimal ${v}^{{_\ast}}\in {\mathfrak{U}}_{\mathrm{SSM}}$.

Theorem 1.4.2.

If ${v}_{0} \in {\mathfrak{U}}_{\mathrm{SSM}}$ satisfies ${\Psi }^{{v}_{0}} \in \mathfrak{o}(\mathfrak{V})$ , then ${\rho }_{{v}_{k}} \rightarrow {\rho }^{{_\ast}}$ as k →∞.

Proof.

The proof is straightforward. By Lemma 1.4.2 (iv), $\hat{V } \in \mathfrak{o}(\mathfrak{V})$. Also by (1.17), we have

$${L}^{{v}^{{_\ast}} }\hat{V }(x) + {c}_{{v}^{{_\ast}}}(x) \geq \hat{ \rho }\,,\quad x \in {\mathbb{R}}^{d}\,,$$

and applying Dynkin’s formula, we obtain

$$\frac{1} {t}{\bigl ({\mathbb{E}}_{x}^{{v}^{{_\ast}} }{\bigl [\hat{V }({X}_{t})\bigr ]} - V (x)\bigr )} + \frac{1} {t} {\mathbb{E}}_{x}^{{v}^{{_\ast}} }\left [{\int \nolimits \nolimits }_{0}^{t}{c}_{{ v}^{{_\ast}}}({X}_{s})\,\mathrm{d}s\right ] \geq \hat{ \rho }\,,$$

(1.20)

Since $\hat{V } \in \mathfrak{o}(\mathfrak{V})$, by [1, Lemma 7.1] we have

$$\frac{1} {t} {\mathbb{E}}_{x}^{{v}^{{_\ast}} }{\bigl [\hat{V }({X}_{t})\bigr ]}\mathop{\longrightarrow}\limits_{t \rightarrow \infty }^{}0$$

and thus taking limits as t → ∞ in (1.20), we obtain ${\rho }^{{_\ast}}\geq \hat{ \rho }$. Therefore, we must have $\hat{\rho } = {\rho }^{{_\ast}}$.

1.5 Concluding Remarks

We have concentrated on the model of controlled diffusions with near-monotone running costs. The case of stable controls with a blanket Lyapunov condition is much simpler. If, for example, we impose the assumption that there exist a constant k ₀ > 0 and a pair of nonnegative, inf-compact functions $(\mathcal{V},h) \in {\mathcal{C}}^{2}({\mathbb{R}}^{d}) \times \mathcal{C}({\mathbb{R}}^{d})$ satisfying $1 + c \in \mathfrak{o}(h)$ and such that

$${L}^{u}\mathcal{V}(x) \leq {k}_{ 0} - h(x,u)\qquad \forall (x,u) \in {\mathbb{R}}^{d} \times \mathbb{U}\,,$$

then the PIA always converges to the optimal solution.

Notes

1.
Theorem 4.4 in [11] applies to Markov chains on Borel state spaces. Also the model in [11] involves only inf-compact running costs. Nevertheless, the essential arguments can be followed to adapt the proof to controlled diffusions. We skip the details.

References

Arapostathis, A., Borkar, V.S.: Uniform recurrence properties of controlled diffusions and applications to optimal control. SIAM J. Control Optim. 48(7), 152–160 (2010)
Article MathSciNet Google Scholar
Arapostathis, A., Borkar, V.S., Fernández-Gaucherand, E., Ghosh, M.K., Marcus, S.I.: Discrete-time controlled Markov processes with average cost criterion: a survey. SIAM J. Control Optim. 31(2), 282–344 (1993)
Article MathSciNet MATH Google Scholar
Arapostathis, A., Borkar, V.S., Ghosh, M.K.: Ergodic control of diffusion processes, Encyclopedia of Mathematics and its Applications, vol. 143. Cambridge University Press, Cambridge (2011)
Book Google Scholar
Bensoussan, A., Borkar, V.: Ergodic control problem for one-dimensional diffusions with near-monotone cost. Systems Control Lett. 5(2), 127–133 (1984)
Article MathSciNet MATH Google Scholar
Bogachev, V.I., Krylov, N.V., Röckner, M.: On regularity of transition probabilities and invariant measures of singular diffusions under minimal conditions. Comm. Partial Differential Equations 26(11–12), 2037–2080 (2001)
Article MathSciNet MATH Google Scholar
Costa, O.L.V., Dufour, F.: The policy iteration algorithm for average continuous control of piecewise deterministic Markov processes. Appl. Math. Optim. 62(2), 185–204 (2010)
Article MathSciNet MATH Google Scholar
Gilbarg, D., Trudinger, N.S.: Elliptic partial differential equations of second order, Grundlehren der Mathematischen Wissenschaften, vol. 224, second edn. Springer-Verlag, Berlin (1983)
Google Scholar
Gyöngy, I., Krylov, N.: Existence of strong solutions for Itô’s stochastic equations via approximations. Probab. Theory Related Fields 105(2), 143–158 (1996)
Article MathSciNet MATH Google Scholar
Hernández-Lerma, O., Lasserre, J.B.: Policy iteration for average cost Markov control processes on Borel spaces. Acta Appl. Math. 47(2), 125–154 (1997)
Article MathSciNet MATH Google Scholar
Krylov, N.V.: Controlled diffusion processes, Applications of Mathematics, vol. 14. Springer-Verlag, New York (1980)
Book Google Scholar
Meyn, S.P.: The policy iteration algorithm for average reward Markov decision processes with general state space. IEEE Trans. Automat. Control 42(12), 1663–1680 (1997)
Article MathSciNet MATH Google Scholar
Stannat, W.: (Nonsymmetric) Dirichlet operators on L ¹: existence, uniqueness and associated Markov processes. Ann. Scuola Norm. Sup. Pisa Cl. Sci. (4) 28(1), 99–140 (1999)
Google Scholar

Download references

Acknowledgement

This work was supported in part by the Office of Naval Research through the Electric Ship Research and Development Consortium.

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, 78712, USA
Ari Arapostathis

Authors

Ari Arapostathis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ari Arapostathis .

Editor information

Editors and Affiliations

, Department of Probability and Statistics, Center for Research in Mathematics, Jalisco s/n, Guanajuato, 36000, Mexico
Daniel Hernández-Hernández
, Department of Mathematics, University of Sonora, Rosales s/n, Hermosillo, 83000, Sonora, Mexico
J. Adolfo Minjárez-Sosa

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Arapostathis, A. (2012). On the Policy Iteration Algorithm for Nondegenerate Controlled Diffusions Under the Ergodic Criterion. In: Hernández-Hernández, D., Minjárez-Sosa, J. (eds) Optimization, Control, and Applications of Stochastic Systems. Systems & Control: Foundations & Applications. Birkhäuser, Boston. https://doi.org/10.1007/978-0-8176-8337-5_1

Download citation

DOI: https://doi.org/10.1007/978-0-8176-8337-5_1
Published: 12 July 2012
Publisher Name: Birkhäuser, Boston
Print ISBN: 978-0-8176-8336-8
Online ISBN: 978-0-8176-8337-5
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

On the Policy Iteration Algorithm for Nondegenerate Controlled Diffusions Under the Ergodic Criterion

Abstract

Similar content being viewed by others

Moment Characteristic Method in the Optimal Control Theory of Diffusion-Type Stochastic Systems

Zubov’s method for controlled diffusions with state constraints

Numerical Methods for Continuous-Time Stochastic Control Problems

Keywords

1.1 Introduction

1.2 Notation

1.3 Ergodic Control and the PIA

Theorem 1.3.1.

Definition 1.3.1.

Algorithm (PIA).

Lemma 1.3.1.

Proof.

1.4 Convergence of the PIA

Lemma 1.4.2.

Proof.

Corollary 1.4.1.

Proof.

Definition 1.4.2.

Theorem 1.4.2.

Proof.

1.5 Concluding Remarks

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

On the Policy Iteration Algorithm for Nondegenerate Controlled Diffusions Under the Ergodic Criterion

Abstract

Similar content being viewed by others

Moment Characteristic Method in the Optimal Control Theory of Diffusion-Type Stochastic Systems

Zubov’s method for controlled diffusions with state constraints

Numerical Methods for Continuous-Time Stochastic Control Problems

Keywords

1.1 Introduction

1.2 Notation

1.3 Ergodic Control and the PIA

Theorem 1.3.1.

Definition 1.3.1.

Algorithm (PIA).

Lemma 1.3.1.

Proof.

1.4 Convergence of the PIA

Lemma 1.4.2.

Proof.

Corollary 1.4.1.

Proof.

Definition 1.4.2.

Theorem 1.4.2.

Proof.

1.5 Concluding Remarks

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation