Keywords

Introduction

The dynamic programming principle (DPP) is a fundamental tool in optimal control theory. It was largely developed by Richard Bellman in the 1950s (Bellman 1957) and has since been applied to various problems in deterministic and stochastic optimal control. The goal of optimal control is to determine the control function and the corresponding trajectory of a dynamical system which together optimize a given criterion usually expressed in terms of an integral along the trajectory (the cost functional) (Fleming and Rishel 1975; Macki and Strauss 1982). The function which associates with the initial condition of the dynamical system the optimal value of the cost functional among all the possible trajectories is called the value function. The most interesting point is that via the dynamic programming principle, one can derive a characterization of the value function in terms of a nonlinear partial differential equation (the Hamilton–Jacobi–Bellman equation) and then use it to synthesize a feedback control law. This is the major advantage over the approach based on the Pontryagin Maximum Principle (PMP) (Boltyanskii et al. 1956; Pontryagin et al. 1962). In fact, the PMP merely gives necessary conditions for the characterization of the open-loop optimal control and of the corresponding optimal trajectory. The DPP has also been applied to construct approximation schemes for the value function although this approach suffers from the “curse of dimensionality” since one has to solve a nonlinear partial differential equation in a high dimension. Despite the elegance of the DPP approach, its practical application is limited by this bottleneck, and the solution of many optimal control problems has been accomplished instead via the two-point boundary value problem associated with the PMP.

The Infinite Horizon Problem

Let us present the main ideas for the classical infinite horizon problem. Let a controlled dynamical system be given by

$$\displaystyle{ \left \{\begin{array}{@{}l@{\quad }l@{}} \dot{y}(s) = f(y(s),\alpha (s))\quad \\ y(t_{ 0}) = x_{0}. \quad \end{array} \right. }$$
(1)

where \(x_{0},y(s) \in \mathbb{R}^{d}\), and

$$\displaystyle{\alpha : [t_{0},T] \rightarrow A \subseteq \mathbb{R}^{m},}$$

with T finite or + . Under the assumption that the control is measurable, existence and uniqueness properties for the solution of (1) are ensured by the Carathèodory theorem:

Theorem 1 (Carathèodory)

Assume that:

  1. 1.

    f(⋅,⋅) is continuous.

  2. 2.

    There exists a positive constant L f > 0 such that

    $$\displaystyle{ \vert f(x,a) - f(y,a)\vert \leq L_{f}\vert x - y\vert, }$$

    for all \(x, y \in \mathbb{R}^{d }\) , \(t \in \mathbb{R}^{+}\) and a ∈ A.

  3. 3.

    f(x,α(t)) is measurable with respect to t.

Then, there is a unique absolutely continuous function\(y : [t_{0},T] \rightarrow \mathbb{R}^{d}\)that satisfies

$$\displaystyle{ y(s) = x_{0} +\int _{ t_{0}}^{s}f(y(\tau ),\alpha (\tau ))d\tau. }$$
(2)

which is interpreted as the solution of (1) .

Note that the solution is continuous, but only a.e. differentiable, so it must be regarded as a weak solution of (1). By the theorem above, fixing a control in the set of admissible controls

$$\displaystyle{ \alpha \in \mathcal{A} :=\{\alpha : [t_{0},T] \rightarrow A,\text{measurable}\} }$$

yields a unique trajectory of (1) which is denoted by \(y_{x_{0},\ t_{0}}(s;\alpha )\). Changing the control policy generates a family of solutions of the controlled system (1) with index α. Since the dynamics (1) are “autonomous,” the initial time t0 can be shifted to 0 by a change of variable. So to simplify the notation for autonomous dynamics, we can set t0 = 0 and we denote this family by \(y_{x_{0}}(s;\alpha )\) (or even write it as y(s) if no ambiguity over the initial state or control arises). It is customary in dynamic programming, moreover, to use the notations x and t instead of x0 and t0 (since x and t appear as variables in the Hamilton–Jacobi–Bellman equation).

Optimal control problems require the introduction of a cost functional \(J : \mathcal{A}\rightarrow \mathbb{R}\) which is used to select the “optimal trajectory” for (1). In the case of the infinite horizon problem, we set t0 = 0, x0 = x, and this functional is defined as

$$\displaystyle\begin{array}{rcl} J_{x}(\alpha ) =\int _{ 0}^{\infty }g(y_{ x}(s,\alpha ),\alpha (s))e^{-\lambda s}ds& &{}\end{array}$$
(3)

for a given λ > 0. The function g represents the running cost and λ is the discount factor, which can be used to take into account the reduced value, at the initial time, of future costs. From a technical point of view, the presence of the discount factor ensures that the integral is finite whenever g is bounded. Note that one can also consider the undiscounted problem (λ = 0) provided the integral is still finite. The goal of optimal control is to find an optimal pair \((y^{{\ast}},\alpha ^{{\ast}})\) that minimizes the cost functional. If we seek optimal controls in open-loop form , i.e., as functions of t, then the Pontryagin Maximum Principle furnishes necessary conditions for a pair \((y^{{\ast}},\alpha ^{{\ast}})\) to be optimal.

A major drawback of an open-loop control is that being constructed as a function of time, it cannot take into account errors in the true state of the system, due, for example, to model errors or external disturbances, which may take the evolution far from the optimal forecasted trajectory. Another limitation of this approach is that a new computation of the control is required whenever the initial state is changed.

For these reasons, we are interested in the so-called feedback controls, that is, controls expressed as functions of the state of the system. Under feedback control, if the system trajectory is perturbed, the system reacts by changing its control strategy according to the change in the state. One of the main motivations for using the DPP is that it yields solutions to optimal control problems in the form of feedback controls.

DPP for the Infinite Horizon Problem

The starting point of dynamic programming is to introduce an auxiliary function, the value function, which for our problem is

$$\displaystyle\begin{array}{rcl} v(x) =\inf _{\alpha \in \mathcal{A}}J_{x}(\alpha ),& &{}\end{array}$$
(4)

where, as above, x is the initial position of the system. The value function has a clear meaning: it is the optimal cost associated with the initial position x. This is a reference value which can be useful to evaluate the efficiency of a control – if \(J_{x}(\bar{\alpha })\) is close to v(x), this means that \(\bar{\alpha }\) is “efficient.”

Bellman’s dynamic programming principle provides a first characterization of the value function.

Proposition 1 (DPP for the infinite horizon problem)

Under the assumptions of Theorem 1 , for all \(x \in \mathbb{R}^{d}\) and τ > 0,

$$\displaystyle\begin{array}{rcl} v(x)& =& \inf _{\alpha \in \mathcal{A}}\left \{\int _{0}^{\tau }g(y_{ x}(s;\alpha ),\alpha (s))e^{-\lambda s}ds\right. \\ & & \quad \quad \left.\phantom{\int _{0}^{\tau }} + e^{-\lambda \tau }v(y_{ x}(\tau ;\alpha ))\right \}. {}\end{array}$$
(5)

Proof

Denote by \(\bar{v}(x)\) the right-hand side of (5). First, we remark that for any \(x \in \mathbb{R}^{d}\) and \(\bar{\alpha }\in \mathcal{A}\),

$$\displaystyle\begin{array}{rcl} J_{x}(\bar{\alpha })& =& \int _{0}^{\infty }g(\bar{y}(s),\bar{\alpha }(s))e^{-\lambda s}ds {}\\ & =& \int _{0}^{\tau }g(\bar{y}(s),\bar{\alpha }(s))e^{-\lambda s}ds {}\\ & & \quad +\int _{ \tau }^{\infty }g(\bar{y}(s),\bar{\alpha }(s))e^{-\lambda s}ds {}\\ & =& \int _{0}^{\tau }g(\bar{y}(s),\bar{\alpha }(s))e^{-\lambda s}ds + e^{-\lambda \tau } {}\\ & &\quad \times \int _{0}^{\infty }g(\bar{y}(s+\tau ),\bar{\alpha }(s+\tau ))e^{-\lambda s}ds {}\\ & \geq & \int _{0}^{\tau }g(\bar{y}(s),\bar{\alpha }(s))e^{-\lambda s}ds + e^{-\lambda \tau }v(\bar{y}(\tau )) {}\\ \end{array}$$

(here, \(y_{x}(s,\bar{\alpha })\) is abbreviated as \(\bar{y}(s)\)). Taking the infimum over all trajectories, first over the right-hand side and then the left of this inequality, yields

$$\displaystyle\begin{array}{rcl} v(x) \geq \bar{ v}(x)& &{}\end{array}$$
(6)

To prove the opposite inequality, we recall that \(\bar{v}\) is defined as an infimum, and so, for any \(x \in \mathbb{R}^{d}\) and \(\varepsilon> 0\), there exists a control \(\bar{\alpha }_{\varepsilon }\) (and the corresponding evolution \(\bar{y}_{\varepsilon }\)) such that

$$\displaystyle{ \bar{v}(x)+\varepsilon \geq \int _{0}^{\tau }g(\bar{y}_{\varepsilon }(s),\bar{\alpha }_{\varepsilon }(s))e^{-\lambda s}ds + e^{-\lambda \tau }v(\bar{y}_{\varepsilon }(\tau )). }$$
(7)

On the other hand, the value function v being also defined as an infimum, for any \(x \in \mathbb{R}^{d}\) and \(\varepsilon> 0\), there exists a control \(\tilde{\alpha }_{\varepsilon }\) such that

$$\displaystyle{ v(\bar{y}_{\varepsilon }(\tau ))+\varepsilon \geq J_{\bar{y}_{\varepsilon }(\tau )}(\tilde{\alpha }_{\varepsilon }). }$$
(8)

Inserting (8) in (7), we get

$$\displaystyle\begin{array}{rcl} \bar{v}(x)& \geq & \int _{0}^{\tau }g(\bar{y}_{\varepsilon }(s),\bar{\alpha }_{\varepsilon }(s))e^{-\lambda s}ds \\ & & +e^{-\lambda \tau }J_{\bar{ y}_{\varepsilon }(\tau )}(\tilde{\alpha }_{\varepsilon }) - (1 + e^{-\lambda \tau })\varepsilon \\ & \geq & J_{x}(\hat{\alpha }) - (1 + e^{-\lambda \tau })\varepsilon \\ & \geq & v(x) - (1 + e^{-\lambda \tau })\varepsilon, {}\end{array}$$
(9)

where \(\hat{\alpha }\) is a control defined by

$$\displaystyle{ \hat{\alpha }(s) = \left \{\begin{array}{@{}l@{\quad }l@{}} \bar{\alpha }_{\varepsilon }(s) \quad &0 \leq s <\tau \\ \tilde{\alpha }_{\varepsilon }(s-\tau )\quad &s \geq \tau. \end{array} \right. }$$
(10)

(Note that \(\hat{\alpha }(\cdot )\) is still measurable). Since \(\varepsilon\) is arbitrary, (9) finally yields \(\bar{v}(x) \geq v(x)\).

We observe that this proof crucially relies on the fact that the control defined by (10) still belongs to \(\mathcal{A}\), being a measurable control. The possibility of obtaining an admissible control by joining together two different measurable controls is known as the concatenation property.

The Hamilton–Jacobi–Bellman Equation

The DPP can be used to characterize the value function in terms of a nonlinear partial differential equation. In fact, let \(\alpha ^{{\ast}} \in \mathcal{A}\) be the optimal control, and y the associated evolution (to simplify, we are assuming that the infimum is a minimum). Then,

$$\displaystyle{ v(x) =\int _{ 0}^{\tau }g(y^{{\ast}}(s),\alpha ^{{\ast}}(s))e^{-\lambda s}ds + e^{-\lambda \tau }v(y^{{\ast}}(\tau )), }$$

that is,

$$\displaystyle{ v(x) - e^{-\lambda \tau }v(y^{{\ast}}(\tau )) =\int _{ 0}^{\tau }g(y^{{\ast}}(s),\alpha ^{{\ast}}(s))e^{-\lambda s}ds }$$

so that adding and subtracting \(e^{-\lambda \tau }v(x)\) and dividing by τ, we get

$$\displaystyle\begin{array}{rcl} & & e^{-\lambda \tau }\frac{\left (v(x) - v(y^{{\ast}}(\tau ))\right )} {\tau } + \frac{v(x)(1 - e^{-\lambda \tau })} {\tau } {}\\ & & \quad = \frac{1} {\tau } \int _{0}^{\tau }g(y^{{\ast}}(s),\alpha ^{{\ast}}(s))e^{-\lambda s}ds. {}\\ \end{array}$$

Assume now that v is regular. By passing to the limit as τ → 0+, we have

$$\displaystyle\begin{array}{rcl} & & \lim _{\tau \rightarrow 0^{+}} -\frac{v\left (y^{{\ast}}(\tau )\right ) - v(x)} {\tau } {}\\ & & \quad = -Dv(x) \cdot \dot{ y}^{{\ast}}(x) = -Dv(x) \cdot f(x,\alpha ^{{\ast}}(0)) {}\\ & & \lim _{\tau \rightarrow 0^{+}}v(x)\frac{(1 - e^{-\lambda \tau })} {\tau } =\lambda v(x) {}\\ & & \lim _{\tau \rightarrow 0^{+}} \frac{1} {\tau } \int _{0}^{\tau }g(y^{{\ast}}(s),\alpha ^{{\ast}}(s))e^{-\lambda s}ds = g(x,\alpha ^{{\ast}}(0)) {}\\ \end{array}$$

where we have assumed that \(\alpha ^{{\ast}}(\cdot )\) is continuous at 0. Then, we can conclude

$$\displaystyle{ \lambda v(x) - Dv(x) \cdot f(x,a^{{\ast}}) - g(x,a^{{\ast}}) = 0 }$$
(11)

where \(a^{{\ast}} =\alpha ^{{\ast}}(0)\). Similarly, using the equivalent form

$$\displaystyle\begin{array}{rcl} & & v(x) +\sup _{\alpha \in \mathcal{A}}\left \{-\int _{0}^{\tau }g(y(s),\alpha (s))e^{-\lambda s}ds\right. {}\\ & & \quad \left.-e^{-\lambda \tau }v(y(\tau ))\right \} = 0 {}\\ \end{array}$$

of the DPP and the inequality, this implies for any (continuous at 0) control \(\alpha \in \mathcal{A}\),

$$\displaystyle\begin{array}{rcl} & & \lambda v(x) - Dv(x) \cdot f(x,a) - g(x,a) \\ & & \quad \leq 0,\quad \mbox{ for every }a \in A. {}\end{array}$$
(12)

Combining (11) and (12), we obtain the Hamilton–Jacobi–Bellman equation (or dynamic programming equation):

$$\displaystyle{ \lambda u(x) +\sup _{a\in A}\left \{-f(x,a) \cdot Du(x) - g(x,a)\right \} = 0, }$$
(13)

which characterizes the value function for the infinite horizon problem associated with minimizing (3). Note that given x, the value of a achieving the max (assuming it exists) corresponds to the control a = α(0), and this makes it natural to interpret the argmax in (13) as the optimal feedback at x (see Bardi and Capuzzo Dolcetta (1997) for more details).

In short, (13) can be written as

$$\displaystyle{ H(x,u,Du) = 0 }$$

with \(x \in \mathbb{R}^{d}\), and

$$\displaystyle{ H(x,u,p) =\lambda u(x) +\sup _{a\in A}\left \{-f(x,a) \cdot p - g(x,a)\right \}. }$$
(14)

Note that H(x, u, ⋅ ) is convex (being the sup of a family of linear functions) and that H(x, ⋅ , p) is monotone (since λ > 0). It is also easy to see that the solution u is not differentiable even when f and g are smooth functions (i.e., \(f,g,\in C^{\infty }(\mathbb{R}^{n},A)\)), so we need to deal with weak solution of the Bellman equation. This can be done in the framework of viscosity solutions, a theory initiated by Crandall and Lions in the 1980s which has been successfully applied in many areas as optimal control, fluid dynamics, and image processing (see the books Barles (1994) and Bardi and Capuzzo Dolcetta (1997) for an extended introduction and numerous applications to optimal control). Typically viscosity solutions are Lipschitz continuous solutions so they are differentiable almost everywhere.

An Extension to the Minimum Time Problem

In the minimum time problem, we want to minimize the time of arrival of the state on a given target set\(\mathcal{T}\). We will assume that \(\mathcal{T} \subset \mathbb{R}^{d}\) is a closed set. Then our cost functional will be given by

$$\displaystyle{ J(x,\alpha ) = t_{x}(\alpha ) }$$

where

$$\displaystyle{ t_{x}(\alpha ) := \left \{\begin{array}{@{}l@{\quad }l@{}} \min \{t : y_{x}(t,\alpha ) \in \mathcal{T}\}\quad &\mbox{ if $y_{x}(t,\alpha )$} \in \mathcal{T} \\ \quad &\text{for some}\ t \geq 0 \\ +\infty \quad &\text{if}y_{x}(t,\alpha )\notin \mathcal{T} \\ \quad &\text{for any}\ t \geq 0 \end{array} \right. }$$

The corresponding value function is called the minimum time function

$$\displaystyle{ T(x) :=\inf _{\alpha (\cdot )\in \mathcal{A}}t_{x}(\alpha (\cdot )). }$$
(15)

The main difference with respect to the previous problem is that now the value function T will be finite valued only on a subset \(\mathcal{R}\) which depends on the target, on the dynamics, and on the set of admissible controls.

Definition 1

The reachable set \(\mathcal{R}\) is defined by

$$\displaystyle{ \mathcal{R} := \cup _{t>0}\mathcal{R}(t) =\{ x \in \mathbb{R}^{n} : T(x) <+\infty \} }$$

where, for t > 0, \(\mathcal{R}(t) :=\{ x \in \mathbb{R}^{n} : T(x) <t\}\).

The meaning is clear: \(\mathcal{R}\) is the set of initial points which can be driven to the target in finite time. The system is said to be controllable on \(\mathcal{T}\) if for all t > 0, \(\mathcal{T} \subset \mbox{ int}(\mathcal{R}(t))\) (here, int(D) denotes the interior of the set D). Assuming controllability in a neighborhood of the target one gets the continuity of the minimum time function and under the assumptions made on f, A, and \(\mathcal{T}\), one can prove some interesting properties:

  1. (i)

    \(\mathcal{R}\) is open.

  2. (ii)

    T is continuous on \(\mathcal{R}\).

  3. (iii)

    \(\lim \limits _{x\rightarrow x_{0}}T(x) = +\infty\), for any \(x_{0} \in \partial \mathcal{R}\).

Now let us denote by \(\mathcal{X}_{D}\) the characteristic function of the set D. Using in \(\mathcal{R}\) arguments similar to the proof of DPP in the previous section one can obtain the following DPP:

Proposition 2 (DPP for the minimum time problem)

For any \(x \in \mathcal{R}\) , the value function satisfies

$$\displaystyle\begin{array}{rcl} T(x)& =& \inf _{\alpha \in \mathcal{A}}\{t \wedge t_{x}(\alpha ) + \mathcal{X}_{\{t\leq t_{x}(\alpha )\}}T(y_{x}(t,\alpha ))\} \\ & & \quad \mbox{ for any }t \geq 0 {}\end{array}$$
(16)

and

$$\displaystyle\begin{array}{rcl} T(x)& =& \inf _{\alpha \in \mathcal{A}}\{t + T(y_{x}(t,\alpha ))\} \\ & & \quad \mbox{ for any $t \in [0,T(x)]$}{}\end{array}$$
(17)

From the previous DPP, one can also obtain the following characterization of the minimum time function.

Proposition 3

Let \(\mathcal{R}\setminus \mathcal{T}\) be open and \(T \in C(\mathcal{R}\setminus \mathcal{T} )\) , then T is a viscosity solution of

$$\displaystyle{ \max _{a\in A}\{ - f(x,a) \cdot \nabla T(x)\} = 1\qquad x \in \mathcal{R}\setminus \mathcal{T} }$$
(18)

coupled with the natural boundary condition

$$\displaystyle{ \left \{\begin{array}{@{}l@{\quad }l@{}} T(x) = 0 \quad &x \in \partial \mathcal{T}\\ \lim \limits _{ x\rightarrow \partial \mathcal{R}}T(x) = +\infty \quad \end{array} \right. }$$

By the change of variable \(v(x) = 1 - e^{-T(x)}\), one can obtain a simpler problem getting rid of the boundary condition on \(\partial \mathcal{R}\) (which is unknown). The new function v will be the unique viscosity solution of an external Dirichlet problem (see Bardi and Capuzzo Dolcetta (1997) for more details), and the reachable set can be recovered a posteriori via the relation \(\mathcal{R} = \{x \in \mathbb{R}^{d} :\)v(x) < 1}.

Further Extensions and Related Topics

The DPP has been extended from deterministic control problems to many other problems. In the framework of stochastic control problems where the dynamics are given by a diffusion, the characterization of the value function obtained via the DPP leads to a second-order Hamilton–Jacobi–Bellman equation (Fleming and Soner 1993; Kushner and Dupuis 2001). Another interesting extension has been made in differential games where the DPP is based on the delicate notion of nonanticipative strategies for the players and leads to a nonconvex nonlinear partial differential equation (the Isaacs equation) (Bardi and Capuzzo Dolcetta 1997). For a short introduction to numerical methods based on DP and exploiting the so-called “value iteration,” we refer the interested reader to the Appendix A in Bardi and Capuzzo Dolcetta (1997) and to Kushner and Dupuis (2001) (see also the book Howard (1960) for the “policy iteration”).

Cross-References