Abstract
The subject of this chapter is the policy iteration algorithm for nondegenerate controlled diffusions. The results parallel the ones in Meyn (IEEE Trans Automat Control 42:1663–1680, 1997) for discrete-time controlled Markov chains. The model in (Meyn, IEEE Trans Automat Control 42:1663–1680, 1997) uses norm-like running costs, while we opt for the milder assumption of near-monotone costs. Also, instead of employing a blanket Lyapunov stability hypothesis, we provide a characterization of the region of attraction of the optimal control.
Access provided by Autonomous University of Puebla. Download chapter PDF
Similar content being viewed by others
Keywords
- Policy Iteration Algorithm (PIA)
- Controlled Diffusions
- Ergodicity Criterion
- Discrete-time Controlled Markov Chains
- Stationary Markov Control
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Dedicated to Onésimo Hernández-Lerma on the occasion of his 65th birthday.
1.1 Introduction
The policy iteration algorithm (PIA) for controlled Markov chains has been known since the fundamental work of Howard [2]. For controlled Markov chains on Borel state spaces, most studies of the PIA rely on blanket Lyapunov conditions [9]. A study of the PIA that treats the model of near-monotone costs can be found in [11], some ideas of which we follow closely. An analysis of the PIA for piecewise deterministic Markov processes has appeared in [6].
In this chapter we study the PIA for controlled diffusion processes X = { X t , t ≥ 0}, taking values in the d-dimensional Euclidean space \({\mathbb{R}}^{d}\), and governed by the Itô stochastic differential equation
All random processes in (1.1) live in a complete probability space \((\Omega,\mathfrak{F}, \mathbb{P})\). The process W is a d-dimensional standard Wiener process independent of the initial condition X 0. The control process U takes values in a compact, metrizable set \(\mathbb{U}\), and U t (ω) is jointly measurable in (t, ω) ∈ [0, ∞) ×Ω. Moreover, it is nonanticipative: for s < t, W t − W s is independent of
Such a process U is called an admissible control, and we let \(\mathfrak{U}\) denote the set of all admissible controls.
We impose the following standard assumptions on the drift b and the diffusion matrix σ to guarantee existence and uniqueness of solutions to (1.1).
-
(A1)
Local Lipschitz continuity: The functions
$$b ={\bigl [ {b}^{1},\ldots,{b}^{d}\bigr ]}\mathsf{T} : {\mathbb{R}}^{d} \times \mathbb{U}\mapsto {\mathbb{R}}^{d}\quad \text{ and}\quad { \sigma } ={\bigl [{ { \sigma }}^{ij}\bigr ]} : {\mathbb{R}}^{d}\mapsto {\mathbb{R}}^{d\times d}$$are locally Lipschitz in x with a Lipschitz constant K R depending on R > 0. In other words, if B R denotes the open ball of radius R centered at the origin in \({\mathbb{R}}^{d}\), then for all x, y ∈ B R and \(u \in \mathbb{U}\),
$$\vert b(x,u) - b(y,u)\vert +\Vert { \sigma }(x) -{ \sigma }(y)\Vert \leq {K}_{R}\vert x - y\vert \,,$$where ∥ σ ∥ 2 : = traceσσT.
-
(A2)
Affine growth condition:b and σ satisfy a global growth condition of the form
$$\vert b{(x,u)\vert }^{2} +\Vert { \sigma }{(x)\Vert }^{2} \leq {K}_{ 1}{\bigl (1 +\vert {x\vert }^{2}\bigr )}\,,\quad \forall (x,u) \in {\mathbb{R}}^{d} \times \mathbb{U}\,.$$ -
(A3)
Local nondegeneracy: For each R > 0, there exists a positive constant κ R such that
$${\sum \limits _{i,j=1}^{d}}{a}^{ij}(x){\xi }_{ i}{\xi }_{j} \geq {\kappa }_{R}\vert {\xi \vert }^{2}\,,\quad \forall x \in {B}_{ R}\,,$$for all \(\xi = ({\xi }_{1},\ldots,{\xi }_{d}) \in {\mathbb{R}}^{d}\), where \(a := \frac{1} {2}{ \sigma }\,{ \sigma }\mathsf{T}\).
We also assume that b is continuous in (x, u).
In integral form, (1.1) is written as
The second term on the right-hand side of (1.2) is an Itô stochastic integral. We say that a process X = { X t (ω)} is a solution of (1.1) if it is \({\mathfrak{F}}_{t}\)-adapted, continuous in t, defined for all ω ∈ Ω and t ∈ [0, ∞), and satisfies (1.2) for all t ∈ [0, ∞) at once a.s.
With \(u \in \mathbb{U}\) treated as a parameter, we define the family of operators \({L}^{u} : {\mathcal{C}}^{2}({\mathbb{R}}^{d})\mapsto \mathcal{C}({\mathbb{R}}^{d})\) by
We refer to L u as the controlled extended generator of the diffusion.
Of fundamental importance in the study of functionals of X is Itô’s formula. For \(f \in {\mathcal{C}}^{2}({\mathbb{R}}^{d})\) and with L u as defined in (1.3),
where
is a local martingale. Krylov’s extension of the Itô formula [10, p. 122] extends (1.4) to functions f in the Sobolev space \({{\mathcal{W}}_{\mathrm{loc}}}^{2,p}({\mathbb{R}}^{d})\).
Recall that a control is called stationary Markov if U t = v(X t ) for a measurable map \(v : {\mathbb{R}}^{d}\mapsto \mathbb{U}\). Correspondingly, the equation
is said to have a strong solution if given a Wiener process \(({W}_{t},{\mathfrak{F}}_{t})\) on a complete probability space \((\Omega,\mathfrak{F}, \mathbb{P})\), there exists a process X on \((\Omega,\mathfrak{F}, \mathbb{P})\), with \({X}_{0} = {x}_{0} \in {\mathbb{R}}^{d}\), which is continuous, \({\mathfrak{F}}_{t}\)-adapted, and satisfies (1.5) for all t at once, a.s. A strong solution is called unique, if any two such solutions X and X′ agree \(\mathbb{P}\)-a.s. when viewed as elements of \(\mathcal{C}{\bigl ([0,\infty ), {\mathbb{R}}^{d}\bigr )}\). It is well known that under Assumptions (A1)–(A3), for any stationary Markov control v, (1.5) has a unique strong solution [8].
Let \({\mathfrak{U}}_{\mathrm{SM}}\) denote the set of stationary Markov controls. Under \(v \in {\mathfrak{U}}_{\mathrm{SM}}\), the process X is strong Markov, and we denote its transition function by P v(t, x, ⋅). It also follows from the work of [5, 12] that under \(v \in {\mathfrak{U}}_{\mathrm{SM}}\), the transition probabilities of X have densities which are locally Hölder continuous. Thus L v defined by
for \(f \in {\mathcal{C}}^{2}({\mathbb{R}}^{d})\) is the generator of a strongly continuous semigroup on \({\mathcal{C}}_{b}({\mathbb{R}}^{d})\), which is strong Feller. We let \({\mathbb{P}}_{x}^{v}\) denote the probability measure and \({\mathbb{E}}_{x}^{v}\) the expectation operator on the canonical space of the process under the control \(v \in {\mathfrak{U}}_{\mathrm{SM}}\), conditioned on the process X starting from \(x \in {\mathbb{R}}^{d}\) at t = 0.
In Sect. 1.2 we define our notation. Sect. 1.3 reviews the ergodic control problem for near-monotone costs and the basic properties of the PIA. Sect. 1.4 is dedicated to the convergence of the algorithm.
1.2 Notation
The standard Euclidean norm in \({\mathbb{R}}^{d}\) is denoted by | ⋅ | , and ⟨ ⋅, ⋅⟩ stands for the inner product. The set of nonnegative real numbers is denoted by \({\mathbb{R}}_{+}\), \(\mathbb{N}\) stands for the set of natural numbers, and \(\mathbb{I}\) denotes the indicator function. We denote by τ(A) the first exit time of the process {X t } from the set \(A \subset {\mathbb{R}}^{d}\), defined by
The open ball of radius R in \({\mathbb{R}}^{d}\), centered at the origin, is denoted by B R , and we let τ R : = τ(B R ), and \(\Breve{\tau}_{R}{:=} \tau(B^{c}_{R})\).
The term domain in \({\mathbb{R}}^{d}\) refers to a nonempty, connected open subset of the Euclidean space \({\mathbb{R}}^{d}\). We introduce the following notation for spaces of real-valued functions on a domain \(D \subset {\mathbb{R}}^{d}\). The space \({\mathcal{L}}^{p}(D)\), p ∈ [1, ∞), stands for the Banach space of (equivalence classes) of measurable functions f satisfying ∫ D | f(x) | p dx < ∞, and \({\mathcal{L}}^{\infty }(D)\) is the Banach space of functions that are essentially bounded in D. The space \({\mathcal{C}}^{k}(D)\) (\({\mathcal{C}}^{\infty }(D)\)) refers to the class of all functions whose partial derivatives up to order k (of any order) exist and are continuous, and \({\mathcal{C}}_{c}^{k}(D)\) is the space of functions in \({\mathcal{C}}^{k}(D)\) with compact support. The standard Sobolev space of functions on D whose generalized derivatives up to order k are in \({\mathcal{L}}^{p}(D)\), equipped with its natural norm, is denoted by \({\mathcal{W}}^{k,p}(D)\), k ≥ 0, p ≥ 1.
In general if \(\mathcal{X}\) is a space of real-valued functions on D, \({\mathcal{X}}_{\mathrm{loc}}\) consists of all functions f such that \(f\varphi \in \mathcal{X}\) for every \(\varphi \in {\mathcal{C}}_{c}^{\infty }(D)\). In this manner we obtain the spaces \(\mathcal{L}_{loc}^{p}(D)\) and \({{\mathcal{W}}_{\mathrm{loc}}}^{2,p}(D)\).
Let \(h \in \mathcal{C}({\mathbb{R}}^{d})\) be a positive function. We denote by \(\mathcal{O}(h)\) the set of functions \(f \in \mathcal{C}({\mathbb{R}}^{d})\) having the property
and by \(\mathfrak{o}(h)\) the subset of \(\mathcal{O}(h)\) over which the limit in (1.6) is zero.
We adopt the notation \({\partial }_{i} := \tfrac{\partial \ } {\partial {x}_{i}}\) and \({\partial }_{ij} := \tfrac{{\partial }^{2}\ } {\partial {x}_{i}\partial {x}_{j}}\). We often use the standard summation rule that repeated subscripts and superscripts are summed from 1 through d. For example,
1.3 Ergodic Control and the PIA
Let \(c: {\mathbb{R}}^{d} \times \mathbb{U} \rightarrow \mathbb{R}\) be a continuous function bounded from below. As well known, the ergodic control problem, in its almost sure (or pathwise) formulation, seeks to a.s. minimize over all admissible \(U \in \mathfrak{U}\)
A weaker, average formulation seeks to minimize
We let ρ ∗ denote the infimum of (1.8) over all admissible controls. We assume that ρ ∗ < ∞.
We assume that the cost function \(c: {\mathbb{R}}^{d} \times \mathbb{U} \rightarrow {\mathbb{R}}_{+}\) is continuous and locally Lipschitz in its first argument uniformly in \(u \in \mathbb{U}\). More specifically, for some function \({K}_{c}: {\mathbb{R}}_{+} \rightarrow {\mathbb{R}}_{+}\),
and all R > 0.
An important class of running cost functions arising in practice for which the ergodic control problem is well behaved is the near-monotone cost functions. Let \({M}^{{_\ast}}\in {\mathbb{R}}_{+} \cup \{\infty \}\) be defined by
The running cost function c is called near-monotone if ρ ∗ < M ∗ . Note that inf-compact functions c are always near-monotone.
We adopt the following abbreviated notation. For a function \(g : {\mathbb{R}}^{d} \times \mathbb{U} \rightarrow \mathbb{R}\) and \(v \in {\mathfrak{U}}_{\mathrm{SSM}}\) we let
The ergodic control problem for near-monotone cost functions is characterized as follows
Theorem 1.3.1.
There exists a unique function \(V \in {\mathcal{C}}^{2}({\mathbb{R}}^{d})\) which is bounded below in \({\mathbb{R}}^{d}\) and satisfies V (0) = 0 and the Hamilton–Jacobi–Bellman (HJB) equation
The control \({v}^{{_\ast}}\in {\mathfrak{U}}_{\mathrm{SM}}\) is optimal with respect to the criteria (1.7) and (1.8) if and only if it satisfies
a.e. in \({\mathbb{R}}^{d}\). Moreover, with \(\Breve{\tau}_{r}=\tau(B_{r}^{c})\), r > 0, we have
A control \(v \in {\mathfrak{U}}_{\mathrm{SM}}\) is called stable if the associated diffusion is positive recurrent. We denote the set of such controls by \({\mathfrak{U}}_{\mathrm{SSM}}\). Also we let μ v denote the unique invariant probability measure on \({\mathbb{R}}^{d}\) for the diffusion under the control \(v \in {\mathfrak{U}}_{\mathrm{SSM}}\). Recall that \(v \in {\mathfrak{U}}_{\mathrm{SSM}}\) if and only if there exists an inf-compact function \(\mathcal{V}\in {\mathcal{C}}^{2}({\mathbb{R}}^{d})\), a bounded domain \(D \subset {\mathbb{R}}^{d}\), and a constant ε > 0 satisfying
It follows that the optimal control v in Theorem 1.3.1 is stable. For \(v \in {\mathfrak{U}}_{\mathrm{SSM}}\) we define
A difficulty in synthesizing an optimal control \(v \in {\mathfrak{U}}_{\mathrm{SM}}\) via the HJB equation lies in the fact that the optimal cost ρ ∗ is not known. The PIA provides an iterative procedure for obtaining the HJB equation at the limit. In order to describe the algorithm we, first need to review some properties of the Poisson equation
We need the following definition.
Definition 1.3.1.
For \(v \in {\mathfrak{U}}_{\mathrm{SSM}}\), and provided ρ v < ∞, define
For \(v \in {\mathfrak{U}}_{\mathrm{SM}}\) and α > 0, let J α v denote the α-discounted cost
We borrow the following result from [1, Lemma 7.4]. If \(v \in {\mathfrak{U}}_{\mathrm{SSM}}\) and ρ v < ∞, then there exists, a function \(V \in {{\mathcal{W}}_{\mathrm{loc}}}^{2,p}({\mathbb{R}}^{d})\), for any p > 1, and a constant \(\rho \in \mathbb{R}\) which satisfies (1.10) a.e. in \({\mathbb{R}}^{d}\) and such that, as α ↓ 0, αJ α v(0) → ρ and J α v − J α v(0) → V uniformly on compact subsets of \({\mathbb{R}}^{d}\). Moreover,
We refer to the function \(V (x) = {\Psi }^{v}(x) \in {{\mathcal{W}}_{\mathrm{loc}}}^{2,p}({\mathbb{R}}^{d})\) as the canonical solution of the Poisson equation \({L}^{v}V + {c}_{v} = {\rho }_{v}\) in \({\mathbb{R}}^{d}\).
It can be shown that the canonical solution V to the Poisson equation is the unique solution in \({{\mathcal{W}}_{\mathrm{loc}}}^{2,p}({\mathbb{R}}^{d})\) which is bounded below and satisfies V (0) = 0. Note also that (1.9) implies that any control v satisfying ρ v < M ∗ is stable.
The PIA takes the following familiar form:
Algorithm (PIA).
-
1.
Initialization. Set k = 0 and select any \({v}_{0} \in {\mathfrak{U}}_{\mathrm{SM}}\) such that \({\rho }_{{v}_{0}} < {M}^{{_\ast}}\).
-
2.
Value determination. Obtain the canonical solution \({V }_{k} = {\Psi }^{{v}_{k}} \in {{\mathcal{W}}_{\mathrm{ loc}}}^{2,p}({\mathbb{R}}^{d})\) , p > 1, to the Poisson equation
$${L}^{{v}_{k} }{V }_{k} + {c}_{{v}_{k}} = {\rho }_{{v}_{k}}$$in \({\mathbb{R}}^{d}\).
-
3.
If \({v}_{k}(x) \in { Arg\,min}_{u\in \mathbb{U}}\left [{b}^{i}(x,u){\partial }_{i}{V }_{k}(x) + c(x,u)\right ]\) x-a.e., return v k.
-
4.
Policy improvement. Select an arbitrary \({v}_{k+1} \in {\mathfrak{U}}_{\mathrm{SM}}\) which satisfies
$${v}_{k+1}(x) \in { Arg\,min _{u\in \mathbb{U}}}\;\left [{\sum \limits _{i=1}^{d}}{b}^{i}(x,u)\frac{\partial {V }_{k}} {\partial {x}_{i}} (x) + c(x,u)\right ]\,,\qquad x \in {\mathbb{R}}^{d}\,.$$
Since \({\rho }_{{v}_{0}} < {M}^{{_\ast}}\) it follows that \({v}_{0} \in {\mathfrak{U}}_{\mathrm{SSM}}\). The algorithm is well defined, provided \({v}_{k} \in {\mathfrak{U}}_{\mathrm{SSM}}\) for all \(k \in \mathbb{N}\). This follows from the next lemma which shows that \({\rho }_{{v}_{k+1}} \leq {\rho }_{{v}_{k}}\), and in particular that \({\rho }_{{v}_{k}} < {M}^{{_\ast}}\), for all \(k \in \mathbb{N}\).
Lemma 1.3.1.
Suppose \(v \in {\mathfrak{U}}_{\mathrm{SSM}}\) satisfies ρ v < M ∗ . Let \(V \in {{\mathcal{W}}_{\mathrm{loc}}}^{2,p}({\mathbb{R}}^{d})\) , p > 1, be the canonical solution to the Poisson equation
Then any measurable selector \(\hat{v}\) from the minimizer
satisfies \({\rho }_{\hat{v}} \leq {\rho }_{v}\) . Moreover, the inequality is strict unless v satisfies
Proof.
Let \(\mathcal{V}\) be a Lyapunov function satisfying \({L}^{v}\mathcal{V}(x) \leq {k}_{0} - g(x)\), for some inf-compact g such that \({c}_{v} \in \mathfrak{o}(g)\) (see [1, Lemma 7.1]). For \(n \in \mathbb{N}\), define
Clearly, \(\hat{{v}}_{n} \rightarrow \hat{ v}\) as n → ∞ in the topology of Markov controls (see [1, Section 3.3]). It is evident that \(\mathcal{V}\) is a stochastic Lyapunov function relative to \(\hat{{v}}_{n}\), i.e., there exist constants k n such that \({L}^{\hat{{v}}_{n}}\mathcal{V}(x) \leq {k}_{n} - g(x)\), for all \(n \in \mathbb{N}\). Since \(V \in \mathfrak{o}(\mathcal{V})\), it follows that (see [1, Lemma 7.1])
Let
Also, by definition of \(\hat{{v}}_{n}\), for all m ≤ n, we have
By Itô’s formula we obtain from (1.13) that
for all m ≤ n. Taking limits in (1.14) as t → ∞ and using (1.12), we obtain
Note that v↦ρ v is lower semicontinuous. Therefore, taking limits in (1.15) as n → ∞, we have
Since c is near-monotone and \({\rho }_{\hat{{v}}_{n}} \leq {\rho }_{v} < {M}^{{_\ast}}\), there exists \(\hat{R} > 0\) and δ > 0, such that \({\mu }_{\hat{{v}}_{n}}({B}_{\hat{R}}) \geq \delta \) for all \(n \in \mathbb{N}\). Then with \({\psi }_{\hat{{v}}_{n}}\) denoting the density of \({\mu }_{\hat{{v}}_{n}}\) Harnack’s inequality [7, Theorem 8.20, p. 199] implies that there exists a constant C H = C H (R) such that for every \(R >\hat{ R}\), with | B R | denoting the volume of \({B}_{R} \subset {\mathbb{R}}^{d}\), it holds that
By (1.16) this implies that \({\rho }_{\hat{v}} < {\rho }_{v}\) unless h = 0 a.e.
1.4 Convergence of the PIA
We start with the following lemma.
Lemma 1.4.2.
The sequence {V k } of the PIA has the following properties:
-
(i)
For some constant \({C}_{0} = {C}_{0}({\rho }_{{v}_{0}})\) we have \({\inf }_{{\mathbb{R}}^{d}}\;{V }_{k} > {C}_{0}\) for all k ≥ 0.
-
(ii)
Each V k attains its minimum on the compact set
$$\mathcal{K}({\rho }_{{v}_{0}}) :={\bigl \{ x \in {\mathbb{R}}^{d} {:\min _{ u\in \mathbb{U}}}\;c(x,u) \leq {\rho }_{{v}_{0}}\bigr \}}\,.$$ -
(iii)
For any p > 1, there exists a constant \(\tilde{{C}}_{0} =\tilde{ {C}}_{0}(R,{\rho }_{{v}_{0}},p)\) such that
$${\bigl \lVert{V{ }_{k}\bigr \rVert}}_{{\mathcal{W}}^{2,p}({B}_{R})} \leq \tilde{ {C}}_{0}\qquad \forall R > 0\,.$$ -
(iv)
There exist positive numbers α k and β k , k ≥ 0, such that α k ↓ 1 and β k ↓ 0 as k →∞ and
$${\alpha }_{k+1}{V }_{k+1}(x) + {\beta }_{k+1} \leq {\alpha }_{k}{V }_{k} + {\beta }_{k}\quad \forall k \geq 0\,.$$
Proof.
Parts (i) and (ii) follow directly from [3, Lemmas 3.6.1 and 3.6.4].
For part (iii) note first that the near-monotone assumption implies that
Consequently,
uniformly on compact subsets of \({\mathbb{R}}^{d}\). Hence, since \({J}_{\alpha }^{{v}_{k}} - {J}_{ \alpha }^{{v}_{k}}(0) \rightarrow {V }_{ k}\) weakly in \({\mathcal{W}}^{2,p}({B}_{R})\) for any R > 0, (iii) follows from [3, Theorem 3.7.4].
Part (iv) follows as in [11, Theorem 4.4]. Footnote 1
As the corollary below shows, the PIA always converges.
Corollary 1.4.1.
There exist a constant \(\hat{\rho }\) and a function \(\hat{V } \in {\mathcal{C}}^{2}({\mathbb{R}}^{d})\) with \(\hat{V }(0) = 0\) , such that, as k →∞, \({\rho }_{{v}_{k}} \downarrow \hat{ \rho }\) and \({V }_{k} \rightarrow \hat{ V }\) weakly in \({\mathcal{W}}^{2,p}({B}_{R})\) , p > 1, for any R > 0. Moreover, \((\hat{V },\hat{\rho })\) satisfies the HJB equation
Proof.
By Lemma 1.3.1, \({\rho }_{{v}_{k}}\) is decreasing monotonically in k and hence converges to some \(\hat{\rho } \geq {\rho }^{{_\ast}}\). By Lemma 1.4.2 (iii), the sequence V k is weakly compact in \({\mathcal{W}}^{2,p}({B}_{R})\), p > 1, for any R > 0, while by Lemma 1.4.2 (iv), any weakly convergent subsequence has the same limit \(\hat{V }\). Also repeating the argument in the proof of Lemma 1.3.1, with
we deduce that for any R > 0 there exists some constant K(R) such that
Therefore, h k → 0 weakly in \({\mathcal{L}}^{1}(D)\) as k → ∞ for any bounded domain D. Taking limits in the equation
and using [3, Lemma 3.5.4] yields (1.17).
It is evident that \(v \in {\mathfrak{U}}_{\mathrm{SM}}\) is an equilibrium of the PIA if it satisfies ρ v < M ∗ and
For one-dimensional diffusions, one can show that (1.18) has a unique solution, and hence, this is the optimal solution with ρ v = ρ ∗ . For higher dimensions, to the best of our knowledge there is no such result. There is also the possibility that the PIA converges to \(\hat{v} \in {\mathfrak{U}}_{\mathrm{SSM}}\) which is not an equilibrium. This happens if (1.17) satisfies
This is in fact the case with the example in [4]. In this example the controlled diffusion takes the form \(\mathrm{d}{X}_{t} = {U}_{t}\,\mathrm{d}t +\mathrm{ d}{W}_{t}\), with \(\mathbb{U} = [-1,1]\) and running cost \(c(x) = 1 -\mathrm{ {e}}^{-\vert x\vert }\). If we define
and
then direct computation shows that
and so the pair (V ρ, ρ) satisfies the HJB. The stationary Markov control corresponding to this solution of the HJB is \({w}_{\rho }(x) = -sign (x - {\xi }_{\rho })\). The controlled process under w ρ has invariant probability density \({\varphi }_{\rho }(x) =\mathrm{ {e}}^{-2\vert x-{\xi }_{\rho }\vert }\). A simple computation shows that
Thus, if ρ > 1 ∕ 3, then V ρ is not a canonical solution of the Poisson equation corresponding to the stable control w ρ. Therefore, this example satisfies (1.19) and shows that in general we cannot preclude the possibility that the limiting value of the PIA is not an equilibrium of the algorithm.
In [11, Theorem 5.2], a blanket Lyapunov condition is imposed to guarantee convergence of the PIA to an optimal control. Instead, we use Lyapunov analysis to characterize the domain of attraction of the optimal value.
We need the following definition.
Definition 1.4.2.
Let v ∗ be an optimal control as characterized in Theorem 1.3.1. Let \(\mathfrak{V}\) denote the class of all nonnegative functions \(\mathcal{V}\in {\mathcal{C}}^{2}({\mathbb{R}}^{d})\) satisfying \({L}^{{v}^{{_\ast}} }\mathcal{V}\leq {k}_{0} - h(x)\) for some nonnegative, inf-compact \(h \in \mathcal{C}({\mathbb{R}}^{d})\) and a constant k 0. We denote by \(\mathfrak{o}(\mathfrak{V})\) the class of inf-compact functions g satisfying \(g \in \mathfrak{o}(\mathcal{V})\) for some \(\mathcal{V}\in \mathfrak{V}\).
The theorem below asserts that if the PIA is initialized at a \({v}_{0} \in {\mathfrak{U}}_{\mathrm{SSM}}\) whose associated canonical solution to the Poisson equation lies in \(\mathfrak{o}(\mathfrak{V})\) then it converges to an optimal \({v}^{{_\ast}}\in {\mathfrak{U}}_{\mathrm{SSM}}\).
Theorem 1.4.2.
If \({v}_{0} \in {\mathfrak{U}}_{\mathrm{SSM}}\) satisfies \({\Psi }^{{v}_{0}} \in \mathfrak{o}(\mathfrak{V})\) , then \({\rho }_{{v}_{k}} \rightarrow {\rho }^{{_\ast}}\) as k →∞.
Proof.
The proof is straightforward. By Lemma 1.4.2 (iv), \(\hat{V } \in \mathfrak{o}(\mathfrak{V})\). Also by (1.17), we have
and applying Dynkin’s formula, we obtain
Since \(\hat{V } \in \mathfrak{o}(\mathfrak{V})\), by [1, Lemma 7.1] we have
and thus taking limits as t → ∞ in (1.20), we obtain \({\rho }^{{_\ast}}\geq \hat{ \rho }\). Therefore, we must have \(\hat{\rho } = {\rho }^{{_\ast}}\).
1.5 Concluding Remarks
We have concentrated on the model of controlled diffusions with near-monotone running costs. The case of stable controls with a blanket Lyapunov condition is much simpler. If, for example, we impose the assumption that there exist a constant k 0 > 0 and a pair of nonnegative, inf-compact functions \((\mathcal{V},h) \in {\mathcal{C}}^{2}({\mathbb{R}}^{d}) \times \mathcal{C}({\mathbb{R}}^{d})\) satisfying \(1 + c \in \mathfrak{o}(h)\) and such that
then the PIA always converges to the optimal solution.
References
Arapostathis, A., Borkar, V.S.: Uniform recurrence properties of controlled diffusions and applications to optimal control. SIAM J. Control Optim. 48(7), 152–160 (2010)
Arapostathis, A., Borkar, V.S., Fernández-Gaucherand, E., Ghosh, M.K., Marcus, S.I.: Discrete-time controlled Markov processes with average cost criterion: a survey. SIAM J. Control Optim. 31(2), 282–344 (1993)
Arapostathis, A., Borkar, V.S., Ghosh, M.K.: Ergodic control of diffusion processes, Encyclopedia of Mathematics and its Applications, vol. 143. Cambridge University Press, Cambridge (2011)
Bensoussan, A., Borkar, V.: Ergodic control problem for one-dimensional diffusions with near-monotone cost. Systems Control Lett. 5(2), 127–133 (1984)
Bogachev, V.I., Krylov, N.V., Röckner, M.: On regularity of transition probabilities and invariant measures of singular diffusions under minimal conditions. Comm. Partial Differential Equations 26(11–12), 2037–2080 (2001)
Costa, O.L.V., Dufour, F.: The policy iteration algorithm for average continuous control of piecewise deterministic Markov processes. Appl. Math. Optim. 62(2), 185–204 (2010)
Gilbarg, D., Trudinger, N.S.: Elliptic partial differential equations of second order, Grundlehren der Mathematischen Wissenschaften, vol. 224, second edn. Springer-Verlag, Berlin (1983)
Gyöngy, I., Krylov, N.: Existence of strong solutions for Itô’s stochastic equations via approximations. Probab. Theory Related Fields 105(2), 143–158 (1996)
Hernández-Lerma, O., Lasserre, J.B.: Policy iteration for average cost Markov control processes on Borel spaces. Acta Appl. Math. 47(2), 125–154 (1997)
Krylov, N.V.: Controlled diffusion processes, Applications of Mathematics, vol. 14. Springer-Verlag, New York (1980)
Meyn, S.P.: The policy iteration algorithm for average reward Markov decision processes with general state space. IEEE Trans. Automat. Control 42(12), 1663–1680 (1997)
Stannat, W.: (Nonsymmetric) Dirichlet operators on L 1: existence, uniqueness and associated Markov processes. Ann. Scuola Norm. Sup. Pisa Cl. Sci. (4) 28(1), 99–140 (1999)
Acknowledgement
This work was supported in part by the Office of Naval Research through the Electric Ship Research and Development Consortium.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Arapostathis, A. (2012). On the Policy Iteration Algorithm for Nondegenerate Controlled Diffusions Under the Ergodic Criterion. In: Hernández-Hernández, D., Minjárez-Sosa, J. (eds) Optimization, Control, and Applications of Stochastic Systems. Systems & Control: Foundations & Applications. Birkhäuser, Boston. https://doi.org/10.1007/978-0-8176-8337-5_1
Download citation
DOI: https://doi.org/10.1007/978-0-8176-8337-5_1
Published:
Publisher Name: Birkhäuser, Boston
Print ISBN: 978-0-8176-8336-8
Online ISBN: 978-0-8176-8337-5
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)