1 Introduction

Deep learning [5, 29, 41] has become a primary tool in many modern machine learning tasks, such as image classification and segmentation. Consequently, there is a pressing need to provide a solid mathematical framework to analyze various aspects of deep neural networks. The recent line of work on linking dynamical systems, optimal control and deep learning has suggested such a candidate [15,16,17, 31, 43,44,45, 48, 55, 60]. In this view, ResNet [32] can be regarded as a time discretization of a continuous-time dynamical system. Learning (usually in the empirical risk minimization form) is then recast as an optimal control problem, from which novel algorithms [43, 44] and network structures [15, 16, 31, 48] can be designed. An attractive feature of this approach is that the compositional structure, which is widely considered the essence of deep neural networks, is explicitly taken into account in the time evolution of the dynamical systems.

While most prior works on the dynamical systems viewpoint of deep learning have focused on algorithms and network structures, this paper aims to study the fundamental mathematical aspects of the formulation. Indeed, we show that the most general formulation of the population risk minimization problem can be regarded as a mean-field optimal control problem, in the sense that the optimal control parameters (or equivalently, the trainable weights) depend on the population distribution of input–target pairs. Our task is then to analyze the mathematical properties of this mean-field control problem. Mirroring the development of classical optimal control, we will proceed in two parallel, but interconnected ways, namely the dynamic programming formalism and the maximum principle formalism.

The paper is organized as follows. We discuss related work in Sect. 2 and introduce the basic mean-field optimal control formulation of deep learning in Sect. 3. In Sect. 4, following the classical dynamic programming approach [4], we introduce and study the properties of a value function for the mean-field control problem whose state space is an appropriate Wasserstein space of probability measures. By defining an appropriate notion of derivative with respect to probability measures, we show that the value function is related to solutions of an infinite-dimensional Hamilton–Jacobi–Bellman (HJB) partial differential equation. With the concept of viscosity solutions [19], we show in Sect. 5 that the HJB equation admits a unique viscosity solution and completely characterize the optimal loss function and the optimal control policy of the mean-field control problem. This establishes a concrete link between the learning problem viewed as a variational problem and the Hamilton–Jacobi–Bellman equation that is associated with the variational problem. It should be noted the essential ideas in the proof of Sects. 4 and 5 are not new, but we present our simplified treatment for this particular setting.

Next, in Sect. 6, we develop the more local theory based on the Pontryagin’s maximum principle (PMP) [54]. We state and prove a mean-field version of the classical PMP that provides necessary conditions for optimal controls. Further, we study situations when the mean-field PMP admits a unique solution, which then imply that it is also sufficient for optimality, provided that an optimal solution exists. We will see in Sect. 7 that compared with the HJB approach, this further requires the fact that the time horizon of the learning problem is small enough. Finally, in Sect. 8 we study the relationship between the population risk minimization problem (cast as a mean-field control problem and characterized by a mean-field PMP) and its empirical risk minimization counterpart (cast as a classical control problem and characterized by a classical, sampled PMP). We prove that under appropriate conditions for every stable solution of the mean-field PMP, with high probability there exist close-by solutions of the sampled PMP, and the latter converge in probability to the former, with explicit error estimates on both the distance between the solutions and the distance between their loss function values. This provides a type of a priori error estimate that has implications on the generalization ability of neural networks, which is an important and active area of machine learning research.

Note that it is not the purpose of this paper to prove the sharpest estimates under the most general conditions; thus, we have taken the most convenient but reasonable assumptions and the results presented could be sharpened with more technical details. In each section from Sects. 4 to 8, we first present the mathematical results and then discuss the related implications in deep learning. Furthermore, in this work we shall focus our analysis on the continuous idealization of deep residual networks, but we believe that much of the analysis presented also carry over to the discrete domain (i.e., discrete layers).

2 Related work

The connection between back-propagation and optimal control of dynamical systems is known since the earlier works on control and deep learning [3, 10, 40]. Recently, the dynamical systems approach to deep learning was proposed in [60] and explored in the direction of training algorithms based on the PMP and the method of successive approximations [43, 44]. In another vein, there are also studies on the continuum limit of neural networks [45, 55] and on designing network architectures for deep learning [15, 16, 31, 48] based on dynamical systems and differential equations. Instead of analysis of algorithms or architectures, the present paper focuses on the mathematical aspects of the control formulation itself and develops a mean-field theory that characterizes the optimality conditions and value functions using both PDE (HJB) and ODE (PMP) approaches. The overarching goal is to develop the mathematical foundations of the optimal control formulation of deep learning.

In the control theory literature, mean-field optimal control is an active area of research. Many works on mean-field games [6, 30, 33, 38], the control of McKean–Vlasov systems [39, 51, 52] and the control of Cucker–Smale systems [8, 12, 25] focus on deriving the limiting partial differential equations that characterize the optimal control as the number of agents goes to infinity. This is akin to the theory of the propagation of chaos [58]. Meanwhile, there are also works discussing the stochastic maximum principle for stochastic differential equations of mean-field type [1, 11, 14]. The present paper differs from all previous works in two aspects. First, in the context of continuous-time deep learning, the problem differs from these previous control formulations as the source of randomness is coupled input–target pairs (the latter determines the terminal loss function, which can now be regarded as a random function). On the other hand, a simplifying feature in our case is that the dynamics, given the input–target pair, are otherwise deterministic. Second, the dynamics of each random realization are independent of the distribution law of the population and are coupled only through the shared control parameters. This is to be contrasted with optimal control of McKean–Vlasov dynamics [14, 51, 52] or mean-field games [6, 30, 33, 38], where the population law directly enters the dynamical equations (and not just through the shared control). Thus, in this sense our dynamical equations are much simpler to analyze. Consequently, although some of our results can be deduced from more general mean-field analysis in the control literature, here we will present simplified derivations tailored to our setting. Note also that there are neural network structures (e.g., batch normalization) that can be considered to have explicit mean-field dynamics, and we defer this discussion to Sect. 9.

3 From ResNets to mean-field optimal control

Let us now present the optimal control formulation of deep learning as introduced in [43, 44, 60]. In the simplest form, the feed-forward propagation in a T-layer residual network can be represented by the difference equations:

$$\begin{aligned} x_{t+1} = x_{t} + f(x_{t}, \theta _{t}), \qquad t=0,\dots ,T-1, \end{aligned}$$
(1)

where \(x_0\) is the input (image, time series, etc.), \(x_T\) is the final output, and f is the transition dynamics between different layers. For instance, in the case of simple feed-forward-type skip connection, f takes the form of \(\sigma (\theta _tx_t)\), where \(\theta _t\) is the weight matrix, \(\sigma \) is the activation function, and the bias vector is neglected. The final output is then compared with some target \(y_0\) corresponding to \(x_0\) via some loss function. The goal of learning is to tune the trainable parameters \(\theta _0,\dots ,\theta _{T-1}\) such that \(x_T\) is close to \(y_0\). The only change in the continuous-time idealization of deep residual learning, which we will subsequently focus on, is that instead of the difference equation (1), the forward dynamics are now a differential equation. In fact, there is empirical evidence showing that the contribution of the residual part is small in ResNets, which heuristically justifies our approximation in terms of the differential equations. Jastrzebski et al. [35] report that the ratio of \(l^2\) norm of output of residual block to the norm of input of residual block is small in the original ResNet with 50 layers. Veit et al. [59] show that removing paths from residual networks by deleting some layers or corrupting some paths by reordering layers only has a modest and smooth impact on performance, which also implies the contribution of some residual parts is marginal. In this sense, our continuous dynamical system becomes a reasonable idealization of deep residual learning.

Now, we introduce our formulation more precisely. Let \((\Omega ,\mathcal {F},\mathbb {P})\) be a fixed and sufficiently rich probability space so that all subsequently required random variables can be constructed. Suppose \(x_0\in \mathbb {R}^d\) and \(y_0\in \mathbb {R}^l\) are random variables jointly distributed according to \(\mu _0 :=\mathbb {P}_{(x_0,y_0)}\) (hereafter, for each random variable X we denote its distribution or law by \(\mathbb {P}_X\)). This represents the distribution of the input–target pairs, which we assume can be embedded in Euclidean spaces. Consider a set of admissible controls or training weights \(\Theta \subseteq \mathbb {R}^m\). In typical deep learning, \(\Theta \) is taken as the whole space \(\mathbb {R}^m\), but here we consider the more general case where \(\Theta \) can be constrained. Fix \(T>0\) (network “depth”) and let f (feed-forward dynamics), \(\Phi \) (terminal loss function) and L (regularizer) be functions

$$\begin{aligned} f: \mathbb {R}^d \times \Theta \rightarrow \mathbb {R}^d, \quad \Phi : \mathbb {R}^d \times \mathbb {R}^l \rightarrow \mathbb {R}, \quad L: \mathbb {R}^d \times \Theta \rightarrow \mathbb {R}. \end{aligned}$$

We define the state dynamics as the ordinary differential equation (ODE)

$$\begin{aligned} \dot{x}_t&= f(x_t, \theta _t) \end{aligned}$$
(2)

with initial condition equals to the random variable \(x_0\). Thus, this is a stochastic ODE, whose only source of randomness is on the initial condition. Consider the set of essentially bounded measurable controls \(L^{\infty }([0,T],\Theta )\). To improve clarity, we will reserve bold-faced letters for path-space quantities. For example, \({\varvec{\theta }}\equiv \{ \theta _t: 0\le t\le T \}\). In contrast, variables/functions taking values in finite-dimensional Euclidean spaces are not bold-faced.

The population risk minimization problem in deep learning can hence be posed as the following mean-field optimal control problem

$$\begin{aligned} \begin{aligned} \inf _{{\varvec{\theta }}\in L^{\infty }([0,T],\Theta )} J({\varvec{\theta }})&:=\mathbb {E}_{\mu _0} \left[ \Phi (x_T, y_0) + \int _{0}^{T} L(x_t, \theta _t) \mathrm{d}t \right] ,\\&\text {Subject to~(2)}. \end{aligned} \end{aligned}$$
(3)

The term “mean-field” highlights the fact that \({\varvec{\theta }}\) is shared by a whole population of input–target pairs, and the optimal control must depend on the law of the input–target random variables. Strictly speaking, the law of \({\varvec{x}}\) does not enter the forward equations explicitly (unlike e.g., McKean–Vlasov control [14]), and hence our forward dynamics are not explicitly in mean-field form. Nevertheless, we will use the term “mean-field” to emphasize the dependence of the control on the population distribution.

In contrast, if we were to perform empirical risk minimization, as is often the case in practice (and is the case analyzed by previous work on algorithms [43, 44]), we would first draw i.i.d. samples \(\{x_0^i,y_0^i\}_{i=1}^{N}\sim \mu _0\) and pose the sampled optimal control problem

$$\begin{aligned} \begin{aligned} \inf _{{\varvec{\theta }}\in L^{\infty }([0,T],\Theta )} J_N({\varvec{\theta }})&:=\frac{1}{N} \sum _{i=1}^N \left[ \Phi \left( x^{i}_T, y^i_0\right) + \int _{0}^{T} L\left( x^i_t, \theta _t\right) \mathrm{d}t \right] ,\\&\text {Subject to} \qquad \dot{x}^{i}_t = f(x^{i}_t, \theta _t), \qquad i=1,\dots ,N. \end{aligned} \end{aligned}$$
(4)

Thus, the solutions of sampled optimal control problems are typically random variables. We will mostly focus our analysis on the mean-field problem (3) and only later in Sect. 8 relate it to the sampled problem (4).

3.1 Additional notation

Throughout this paper, we always use w to denote the concatenated \((d+l)\)-dimensional variable (xy) where \(x\in \mathbb {R}^d\) and \(y\in \mathbb {R}^l\). Correspondingly, \(\bar{f}(w,\theta ):=(f(x,\theta ),0)\) is the extended \((d+l)\)-dimensional feed-forward function, \(\bar{L}(w,\theta ):=L(x,\theta )\) is the extended \((d+l)\)-dimensional regularization loss, and \(\bar{\Phi }(w):=\Phi (x,y)\) still denotes the terminal loss function. We denote by \(x\cdot y\) the inner product of two Euclidean vectors x and y with the same dimension. The Euclidean norm is denoted by \(\Vert \cdot \Vert \) and the absolute value is denoted by \(\vert \cdot \vert \). Gradient operators on Euclidean spaces are denoted by \(\nabla \) with subscripts indicating the variable with which the derivative is taken with. In contrast, we use D to represent the Fréchet derivative on Banach spaces. Namely, if \(x\in U\) and \(F:U\rightarrow V\) is a mapping between two Banach spaces \((U,\Vert \cdot \Vert _U)\) and \((V,\Vert \cdot \Vert _V)\), then DF(x) is defined by the linear operator \(DF(x):U \rightarrow V\) s.t.

$$\begin{aligned} r(x,y) :=\frac{\Vert F(x + y) - F(x) - DF(x)y \Vert _V}{\Vert y \Vert _U} \rightarrow 0, \quad \text {as } \Vert y \Vert _U \rightarrow 0. \end{aligned}$$
(5)

For a matrix A, we use the symbol \(A \preceq 0\) to mean that A is negative semi-definite.

Let the Banach space \(L^\infty ([0,T],E)\) be the set of essentially bounded measurable functions from [0, T] to E, where E is a subset of a Euclidean space with the usual Lebesgue measure. The norm is \(\Vert {\varvec{x}}\Vert _{L^\infty ([0,T],E)} = \mathop {{{\,\mathrm{ess\,sup}\,}}}\nolimits _{t\in [0,T]} \Vert x(t) \Vert \), and we shall write for brevity \(\Vert \cdot \Vert _{L^\infty }\) in place of \(\Vert \cdot \Vert _{L^\infty ([0,T],E)}\). In this paper, E is often either \(\Theta \) or \(\mathbb {R}^d\), and the path-space variables we consider in this paper, such as the controls \({\varvec{\theta }}\), will mostly be defined in this space.

As this paper introduces a mean-field optimal control approach, we also need some notation for the random variables and their distributions. We use the shorthand \(L^2(\Omega , \mathbb {R}^{d+l})\) for \(L^2((\Omega ,\mathcal {F},\mathbb {P}), \mathbb {R}^{d+l})\), the set of \(\mathbb {R}^{d+l}\)-valued square-integrable random variables. We equip this Hilbert space with the norm \(\Vert X\Vert _{L^2}:=(\mathbb {E}\Vert X \Vert ^2)^{1/2}\) for \(X\in L^2(\Omega ,\mathbb {R}^{d+l})\). We denote by \(\mathcal {P}_2(\mathbb {R}^{d+l})\) the set of square-integrable probability measures on the Euclidean space \(\mathbb {R}^{d+l}\). Note that \(X\in L^2(\Omega ,\mathbb {R}^{d+l})\) if and only if \(\mathbb {P}_X \in \mathcal {P}_2(\mathbb {R}^{d+l})\). The space \(\mathcal {P}_2(\mathbb {R}^{d+l})\) is regarded as a metric space equipped with the 2-Wasserstein distance

$$\begin{aligned} W_2(\mu ,\nu ):=\inf \Big \{&\Big (\int _{\mathbb {R}^{d+l}\times \mathbb {R}^{d+l}}\Vert w-z\Vert ^2\pi (\mathrm{d}w,\mathrm{d}z)\Big )^{1/2}\,\Big |\, \\&\pi \in \mathcal {P}_2(\mathbb {R}^{d+l}\times \mathbb {R}^{d+l})\text { with marginals } \mu \text { and } \nu \Big \}\\ :=\inf \Big \{&\Vert X-Y\Vert _{L^2}\,\Big |\,X,Y\in L^2(\Omega ,\mathbb {R}^{d+l}) \text { with } \mathbb {P}_X=\mu ,\,\mathbb {P}_Y=\nu \Big \}. \end{aligned}$$

For \(\mu \in \mathcal {P}_2(\mathbb {R}^{d+l})\), we also define \(\Vert \mu \Vert _{L^2}:=(\int _{\mathbb {R}^{d+l}}\Vert w\Vert ^2\mu (\mathrm{d}w))^{1/2}\).

Given a measurable function \(\psi : \mathbb {R}^{d+l}\rightarrow \mathbb {R}^q\) that is square integrable with respect to \(\mu \), we use the notation

$$\begin{aligned} \langle \psi (.),\,\mu \rangle :=\int _{\mathbb {R}^{d+l}}\psi (w)\mu (\mathrm{d}w). \end{aligned}$$

Now, we introduce some notation for the dynamical evolution of probabilities. Given \(\xi \in L^2(\Omega , \mathbb {R}^{d+l})\) and a control process \({\varvec{\theta }}\in L^\infty ([0,T],\Theta )\), we consider the following dynamical system for \(t\le s\le T\):

$$\begin{aligned} W_s^{t,\xi ,{\varvec{\theta }}}=\xi + \int _{t}^s \bar{f}\left( W_s^{t,\xi ,{\varvec{\theta }}},\theta _t\right) \,\mathrm{d}s. \end{aligned}$$

Note that \(W_s^{t,\xi ,{\varvec{\theta }}}\) is always square integrable given \(\bar{f}(w,\theta )\) is Lipschitz continuous with respect to w. Let \(\mu = \mathbb {P}_{\xi }\in \mathcal {P}_2(\mathbb {R}^{d+l})\), we denote the law of \(W_s^{t,\xi ,{\varvec{\theta }}}\) for simplicity by

$$\begin{aligned} \mathbb {P}_s^{t,\mu ,{\varvec{\theta }}}:=\mathbb {P}_{W_s^{t,\xi ,{\varvec{\theta }}}}. \end{aligned}$$

This is valid since the law of \(W_s^{t,\xi ,{\varvec{\theta }}}\) should only depend on the law of \(\xi \) and not on the random variable itself. This notation also allow us to write down the flow or semigroup property of the dynamical system as

$$\begin{aligned} \mathbb {P}_{s}^{t,\mu ,{\varvec{\theta }}} = \mathbb {P}_s^{\hat{t},\mathbb {P}_{\hat{t}}^{t,\mu ,{\varvec{\theta }}},{\varvec{\theta }}}, \end{aligned}$$
(6)

for all \(0\le t\le \hat{t} \le s \le T,\,\mu \in \mathcal {P}_2(\mathbb {R}^{d+l}),\,{\varvec{\theta }}\in L^\infty ([0,T],\Theta )\).

Finally, throughout the results and proofs, we will use K or C with subscripts as names for generic constants, whose values may change from line to line when there is no need for them to be distinct. In general, these constants may implicitly depend on T and the ambient dimensions dm, but for brevity we omit them in the rest of the paper.

4 Mean-field dynamic programming principle and HJB equation

We begin our analysis of (3) by formulating the dynamic programming principle and the Hamilton–Jacobi–Bellman equation. In this approach, the key idea is to define a value function that corresponds to the optimal loss of the control problem (3), but under a general starting time and starting state. One can then derive a partial differential equation (Hamilton–Jacobi–Bellman equation, or HJB equation) to be satisfied by such a value function, which characterizes both the optimal loss function value and the optimal control policy of the original control problem. Compared to the classical optimal control case corresponding to empirical risk minimization in learning, here the value function’s state argument is no longer a finite-dimensional vector, but an infinite-dimensional object corresponding to the joint distribution of the input–target pair. We shall interpret it as an element of a suitable Wasserstein space. The detailed mathematical definition of this value function and its basic properties are discussed in Sect. 4.1.

In the finite-dimensional case, the HJB equation is a classical partial differential equation. In contrast, since the state variables we are dealing with are probability measures rather than Euclidean vectors, we need a concept of derivative with respect to a probability measure, as introduced by Lions in his course at Collège de France [47]. We give a brief introduction of this concept in Sect. 4.2 and refer readers to the lecture notes [13] for more details. We then present the resulting infinite-dimensional HJB equation in Sect. 4.3.

Throughout this section and next section (Sect. 5), we assume

  1. (A1)

    \(f,L,\Phi \) are bounded; \(f,L,\Phi \) are Lipschitz continuous with respect to x, and the Lipschitz constants of f and L are independent of \(\theta \).

  2. (A2)

    \(\mu _0\in \mathcal {P}_2(\mathbb {R}^{d+l})\).

4.1 Value function and its properties

Adopting the viewpoint of taking probability measures \(\mu \in \mathcal {P}_2(\mathbb {R}^{d+l})\) as state variables, we can define a time-dependent objective functional

$$\begin{aligned} J(t,\mu ,{\varvec{\theta }})&:=~ \mathbb {E}_{(x_t,y_0)\sim \mu } \left[ \Phi (x_T, y_0)+ \int _{t}^{T} L(x_t, \theta _t) \mathrm{d}t \right] \text { (subject to~(2))} \nonumber \\&=~\langle \bar{\Phi }(.), \,\mathbb {P}_{T}^{t,\mu ,{\varvec{\theta }}} \rangle + \int _{t}^T \left\langle \bar{L}(.,\theta _s), \,\mathbb {P}_{s}^{t,\mu ,{\varvec{\theta }}} \right\rangle \,\mathrm{d}s. \end{aligned}$$
(7)

The second line in the above is just a rewriting of the first line based on the notation introduced earlier. Here, we abuse the notation J in (3) for the new objective functional, which now has additional arguments \(t,\mu \). Of course, \(J({\varvec{\theta }})\) in (3) corresponds to \(J(0,\mu _0,{\varvec{\theta }})\) in (7).

The value function \(v^*(t,\mu )\) is defined as a real-valued function on \([0,T]\times \mathcal {P}_2(\mathbb {R}^{d+l})\) through

$$\begin{aligned} v^*(t,\mu ) = \inf _{{\varvec{\theta }}\in L^\infty ([0,T],\Theta )} J(t,\mu ,{\varvec{\theta }}). \end{aligned}$$
(8)

If we assume \({\varvec{\theta }}^*\) attains the infimum in (3), then by definition

$$\begin{aligned} J({\varvec{\theta }}^*)=v^*(0,\mu _0). \end{aligned}$$

The following proposition shows the continuity of the value function.

Proposition 1

The function \((t,\mu )\mapsto J(t,\mu ,{\varvec{\theta }})\) is Lipschitz continuous on \([0,T]\times \mathcal {P}_2(\mathbb {R}^{d+l})\), uniformly with respect to \({\varvec{\theta }}\in L^\infty ([0,T],\Theta )\), and the value function \(v^*(t,\mu )\) is Lipschitz continuous on \([0,T]\times \mathcal {P}_2(\mathbb {R}^{d+l})\).

Proof

We first establish some elementary estimates based on the assumptions. We suppose

$$\begin{aligned} \langle \bar{L}(.,\theta ),\,\mu \rangle \le C. \end{aligned}$$
(9)

Let \(X,Y\in L_2(\Omega ,\mathbb {R}^{d+l})\) such that \(\mathbb {P}_X=\mu , \mathbb {P}_Y=\hat{\mu }\), the Lipschitz continuity of \(\bar{L}\) gives us

$$\begin{aligned} |\langle \bar{L}(.,\theta ),\,\mu \rangle - \langle \bar{L}(.,\theta ),\,\hat{\mu }\rangle | = |\mathbb {E}[\bar{L}(X,\theta )-\bar{L}(Y,\theta )]| \le K_L \Vert X-Y\Vert _{L^2}. \end{aligned}$$

Note that in the proceeding inequality the left side does not depend on the choice of XY, while the right side does. Hence, we can take the infimum over all the joint choices of XY to get

$$\begin{aligned} |\langle \bar{L}(.,\theta ),\,\mu \rangle - \langle \bar{L}(.,\theta ),\,\hat{\mu }\rangle | \,&\le K_L\times \inf \Big \{\Vert X-Y\Vert _{L^2}\,\Big |\,X,Y\in L^2(\Omega ,\mathbb {R}^{d+l}) \text { with } \mathbb {P}_X=\mu ,\,\mathbb {P}_Y=\nu \Big \} \nonumber \\ \,&\le K_L W_2(\mu ,\hat{\mu }). \end{aligned}$$
(10)

The same argument applied to \(\bar{\Phi }\) gives us

$$\begin{aligned} |\langle \bar{\Phi }(.),\,\mu \rangle - \langle \bar{\Phi }(.),\,\hat{\mu }\rangle | \le K_LW_2(\mu , \hat{\mu }). \end{aligned}$$
(11)

For the deterministic ODE

$$\begin{aligned} \frac{\mathrm{d}w_t^{{\varvec{\theta }}}}{\mathrm{d}t}=\bar{f}\left( w_t^{{\varvec{\theta }}},\theta _t\right) , \quad w_0^{{\varvec{\theta }}} = w_0, \end{aligned}$$

define the induced flow map as

$$\begin{aligned} h(t,w_0,{\varvec{\theta }}) :=w_t^{{\varvec{\theta }}}. \end{aligned}$$

Using Gronwall’s inequality with the boundedness and Lipschitz continuity of \(\bar{f}\), we know

$$\begin{aligned}&|h(t,w,{\varvec{\theta }}) - h(t,\hat{w},{\varvec{\theta }})| \le K_L\Vert w-\hat{w}\Vert , \\&|h(t,w,{\varvec{\theta }}) - h(\hat{t},w,{\varvec{\theta }})| \le K_L|t-\hat{t}|. \end{aligned}$$

Therefore, we use the definition of Wasserstein distance to obtain

$$\begin{aligned} W_2\left( \mathbb {P}_s^{t,\mu , {\varvec{\theta }}}, \mathbb {P}_s^{t,\hat{\mu }, {\varvec{\theta }}}\right)&= \inf \Big \{\Vert X-Y\Vert _{L^2}\,\Big |\,X,Y\in L^2(\Omega ,\mathbb {R}^{d+l}) \text { with } \mathbb {P}_X=\mathbb {P}_s^{t,\mu , {\varvec{\theta }}},\,\mathbb {P}_Y=\mathbb {P}_s^{t,\hat{\mu }, {\varvec{\theta }}} \Big \} \nonumber \\&= \inf \Big \{\Vert h(s-t,X,{\varvec{\theta }})-h(s-t,Y,{\varvec{\theta }})\Vert _{L^2}\,\Big |\, X,Y\in L^2(\Omega ,\mathbb {R}^{d+l}) \text { with } \mathbb {P}_X=\mu ,\,\mathbb {P}_Y=\hat{\mu } \Big \} \nonumber \\&\le \, \inf \Big \{K_L\Vert X-Y\Vert _{L^2}\,\Big |\,X,Y\in L^2(\Omega ,\mathbb {R}^{d+l}) \text { with } \mathbb {P}_X=\mu ,\,\mathbb {P}_Y=\hat{\mu } \Big \} \nonumber \\&=\, K_LW_2(\mu ,\hat{\mu }) \end{aligned}$$
(12)

and similarly

$$\begin{aligned} W_2\left( \mathbb {P}_s^{t,\mu , {\varvec{\theta }}}, \mu \right)&\le K_L |s-t|. \end{aligned}$$
(13)

The flow property (6) and estimates (12), (13) together give us

$$\begin{aligned} W_2\left( \mathbb {P}_s^{t,\mu ,{\varvec{\theta }}},\mathbb {P}_s^{\hat{t},\hat{\mu },{\varvec{\theta }}}\right)&= W_2\left( \mathbb {P}_s^{\hat{t},\mathbb {P}_{\hat{t}}^{t,\mu ,{\varvec{\theta }}},{\varvec{\theta }}},\mathbb {P}_s^{\hat{t},\hat{\mu },{\varvec{\theta }}}\right) \nonumber \\&\le K_LW_2\left( \mathbb {P}_{\hat{t}}^{t,\mu ,{\varvec{\theta }}},\hat{\mu }\right) \nonumber \\&\le K_L\left( |t-\hat{t}| + W_2(\mu ,\hat{\mu })\right) . \end{aligned}$$
(14)

Now, for all \(0\le t\le \hat{t}\le T\), \(\mu ,\hat{\mu }\in \mathcal {P}_2(\mathbb {R}^{d+l})\), \({\varvec{\theta }}\in L^\infty ([0,T],\Theta )\), we employ (9), (10), (11) and (14) to obtain

$$\begin{aligned} |J(t,\mu ,{\varvec{\theta }})-J(\hat{t},\hat{\mu },{\varvec{\theta }})|\,&\le \, \int _{t}^{\hat{t}} \left| \left\langle \bar{L}(.,\theta _s), \,\mathbb {P}_{s}^{t,\mu ,{\varvec{\theta }}} \right\rangle \right| \,\mathrm{d}s + \int _{\hat{t}}^{T} \left| \left\langle \bar{L}(.,\theta _s), \,\mathbb {P}_{s}^{t,\mu ,{\varvec{\theta }}}\right\rangle - \left\langle \bar{L}(.,\theta _s), \,\mathbb {P}_{s}^{\hat{t},\hat{\mu },{\varvec{\theta }}}\right\rangle \right| \,\mathrm{d}s \\&\quad \,+ \left| \left\langle \bar{\Phi }(.), \,\mathbb {P}_{T}^{t,\mu ,{\varvec{\theta }}} \right\rangle - \left\langle \bar{\Phi }(.), \,\mathbb {P}_{T}^{\hat{t},\hat{\mu },{\varvec{\theta }}} \right\rangle \right| \\&\le \, C|\hat{t}-t| + K_L\sup _{\hat{t}\le s\le T}W_2\left( \mathbb {P}_s^{t,\mu ,{\varvec{\theta }}},\mathbb {P}_s^{\hat{t},\hat{\mu },{\varvec{\theta }}}\right) \\&\le \, K_L(|t-\hat{t}| + W_2(\mu ,\hat{\mu })), \end{aligned}$$

which gives us the desired Lipschitz continuity property.

Finally, combining the fact that

$$\begin{aligned}&|v^*(t,\mu )-v^*(\hat{t},\hat{\mu })|\le \sup _{{\varvec{\theta }}\in L^\infty ([0,T],\Theta )}|J(t,\mu ,{\varvec{\theta }})-J(\hat{t},\hat{\mu },{\varvec{\theta }})|, \\&\forall ~t,\hat{t}\in [0,T], \,\mu ,\hat{\mu }\in \mathcal {P}_2(\mathbb {R}^{d+l}), \end{aligned}$$

and \(J(t,\mu ,{\varvec{\theta }})\) is Lipschitz continuous at \((t,\mu )\in [0,T]\times \mathcal {P}_2(\mathbb {R}^{d+l})\), uniformly with respect to \({\varvec{\theta }}\in L^\infty ([0,T],\Theta )\), we deduce that the value function \(v^*(t,\mu )\) is Lipschitz continuous on \([0,T]\times \mathcal {P}_2(\mathbb {R}^{d+l})\). \(\square \)

The important observation we now make is that the value function satisfies a recursive relation. This is known as the dynamic programming principle, which forms the basis of deriving the Hamilton–Jacobi–Bellman equation. Intuitively, the dynamic programming principle states that for any optimal trajectory, starting from any intermediate state in the trajectory, the remaining trajectory must again be optimal, starting from that time and state. We now state and prove this intuitive statement precisely.

Proposition 2

(Dynamic programming principle) For all \(0\le t \le \hat{t} \le T\), \(\mu \in \mathcal {P}_2(\mathbb {R}^{d+l})\), we have

$$\begin{aligned} v^*(t,\mu ) = \inf _{{\varvec{\theta }}\in L^\infty ([0,T],\Theta )}\left[ \int _{t}^{\hat{t}}\left\langle \bar{L}(.,\theta _s), \,\mathbb {P}_{s}^{t,\mu ,{\varvec{\theta }}} \right\rangle \,\mathrm{d}s + v^*\left( \hat{t}, \mathbb {P}_{\hat{t}}^{t,\mu ,{\varvec{\theta }}}\right) \right] . \end{aligned}$$
(15)

Proof

The proof is elementary as in the context of deterministic control problem. We provide it as follows for completeness.

(1) Given fixed \(t,\hat{t},\mu \) and any \({\varvec{\theta }}^1 \in L^\infty ([0,T],\Theta )\), we consider the probability measure \(\mathbb {P}_{\hat{t}}^{t,\mu ,{\varvec{\theta }}^1}\). Fix \(\varepsilon >0\) and by definition of value function (8) we can pick \({\varvec{\theta }}^2 \in L^\infty ([0,T],\Theta )\) satisfying

$$\begin{aligned} v^*\left( \hat{t},\mathbb {P}_{\hat{t}}^{t,\mu ,{\varvec{\theta }}^1}\right) + \varepsilon \ge \left\langle \bar{\Phi }(.), \,\mathbb {P}_{T}^{\hat{t},\mathbb {P}_{\hat{t}}^{t,\mu ,{\varvec{\theta }}^1},{\varvec{\theta }}^2} \right\rangle + \int _{\hat{t}}^T \left\langle \bar{L}(.,\theta ^2_s), \, \mathbb {P}_{s}^{\hat{t},\mathbb {P}_{\hat{t}}^{t,\mu ,{\varvec{\theta }}^1},{\varvec{\theta }}^2} \right\rangle \,\mathrm{d}s. \end{aligned}$$
(16)

Now, consider the control process \(\hat{{\varvec{\theta }}}\) defined as

$$\begin{aligned} \hat{\theta }_s = \mathbf {1}_{\{s < \hat{t}\}}\theta ^1_s + \mathbf {1}_{\{s\ge \hat{t}\}}\theta ^2_s. \end{aligned}$$

Thus, we can use (16) and flow property (6) to deduce

$$\begin{aligned} ~v^*(t,\mu )&\le \int _{t}^{T}\left\langle \bar{L}(.,\hat{\theta }_s), \,\mathbb {P}_{s}^{t,\mu ,\hat{{\varvec{\theta }}}} \right\rangle \,\mathrm{d}s + \left\langle \bar{\Phi }(.), \,\mathbb {P}_{T}^{t,\mu ,\hat{{\varvec{\theta }}}} \right\rangle \\&= \int _{t}^{\hat{t}}\left\langle \bar{L}(.,\hat{\theta }_s), \,\mathbb {P}_{s}^{t,\mu ,\hat{{\varvec{\theta }}}} \right\rangle \,\mathrm{d}s + \int _{\hat{t}}^{T}\left\langle \bar{L}(.,\hat{\theta }_s), \,\mathbb {P}_{s}^{t,\mu ,\hat{{\varvec{\theta }}}} \right\rangle \,\mathrm{d}s + \left\langle \bar{\Phi }(.), \,\mathbb {P}_{T}^{t,\mu ,\hat{{\varvec{\theta }}}} \right\rangle \\&= \int _{t}^{\hat{t}}\left\langle \bar{L}(.,\hat{\theta }_s), \,\mathbb {P}_{s}^{t,\mu ,\hat{{\varvec{\theta }}}} \right\rangle \,\mathrm{d}s + \int _{\hat{t}}^{T}\left\langle \bar{L}(.,\theta ^2_s), \, \mathbb {P}_{s}^{\hat{t},\mathbb {P}_{\hat{t}}^{t,\mu ,{\varvec{\theta }}^1},{\varvec{\theta }}^2} \right\rangle \,\mathrm{d}s + \left\langle \bar{\Phi }(.), \,\mathbb {P}_{T}^{\hat{t},\mathbb {P}_{\hat{t}}^{t,\mu ,{\varvec{\theta }}^1},{\varvec{\theta }}^2} \right\rangle \\&\le \int _{t}^{\hat{t}}\left\langle \bar{L}(.,\hat{\theta }_s), \,\mathbb {P}_{s}^{t,\mu ,\hat{{\varvec{\theta }}}} \right\rangle \,\mathrm{d}s + v^*(\hat{t},\mathbb {P}_{\hat{t}}^{t,\mu ,{\varvec{\theta }}^1}) + \varepsilon \\&= \int _{t}^{\hat{t}}\left\langle \bar{L}(.,\theta ^1_s), \,\mathbb {P}_{s}^{t,\mu ,{\varvec{\theta }}^1} \right\rangle \,\mathrm{d}s + v^*(\hat{t},\mathbb {P}_{\hat{t}}^{t,\mu ,{\varvec{\theta }}^1}) + \varepsilon . \end{aligned}$$

As \({\varvec{\theta }}^1\) and \(\varepsilon \) are both arbitrary, we have

$$\begin{aligned} v^*(t,\mu ) \le \inf _{{\varvec{\theta }}\in L^\infty ([0,T],\Theta )}\Big [ \int _{t}^{\hat{t}}\left\langle \bar{L}(.,\theta _s), \,\mathbb {P}_{s}^{t,\mu ,{\varvec{\theta }}}\right\rangle \,\mathrm{d}s + v^*\left( \hat{t}, \mathbb {P}_{\hat{t}}^{t,\mu ,{\varvec{\theta }}}\right) \Big ]. \end{aligned}$$

(2) Fix \(\varepsilon >0\) again and we choose by definition \({\varvec{\theta }}^3 \in L^\infty ([0,T],\Theta )\) such that

$$\begin{aligned} v^*(t,\mu ) + \varepsilon \ge \int _{t}^{T}\left\langle \bar{L}(.,\theta _s), \,\mathbb {P}_{s}^{t,\mu ,{\varvec{\theta }}^3} \right\rangle \,\mathrm{d}s +\left\langle \bar{\Phi }(.), \,\mathbb {P}_{T}^{t,\mu , {\varvec{\theta }}^3} \right\rangle . \end{aligned}$$

Using the flow property (6) and the definition of the value function again gives us the estimate

$$\begin{aligned} ~v^*(t,\mu ) + \varepsilon&\ge \int _{t}^{T}\left\langle \bar{L}(.,\theta ^3_s), \,\mathbb {P}_{s}^{t,\mu ,{\varvec{\theta }}^3} \right\rangle \,\mathrm{d}s + \left\langle \bar{\Phi }(.), \,\mathbb {P}_{T}^{t,\mu , {\varvec{\theta }}^3} \right\rangle \\&= \int _{t}^{\hat{t}}\left\langle \bar{L}(.,\theta ^3_s), \,\mathbb {P}_{s}^{t,\mu ,{\varvec{\theta }}^3} \right\rangle \,\mathrm{d}s + \int _{\hat{t}}^{T}\left\langle \bar{L}(.,\theta ^3_s), \,\mathbb {P}_{s}^{\hat{t},\mathbb {P}_{\hat{t}}^{t,\mu ,{\varvec{\theta }}^3},{\varvec{\theta }}^3} \right\rangle \,\mathrm{d}s +\left\langle \bar{\Phi }(.), \,\mathbb {P}_{T}^{\hat{t},\mathbb {P}_{\hat{t}}^{t,\mu ,{\varvec{\theta }}^3},{\varvec{\theta }}^3} \right\rangle \\&\ge \int _{t}^{\hat{t}}\left\langle \bar{L}(.,\theta ^3_s), \,\mathbb {P}_{s}^{t,\mu ,{\varvec{\theta }}^3} \right\rangle \,\mathrm{d}s + v^*\left( \hat{t}, \mathbb {P}_{\hat{t}}^{t,\mu ,{\varvec{\theta }}^3}\right) \\&\ge \inf _{{\varvec{\theta }}\in L^\infty ([0,T],\Theta )}\Big [ \int _{t}^{\hat{t}}\left\langle \bar{L}(.,\theta _s), \,\mathbb {P}_{s}^{t,\mu ,{\varvec{\theta }}}\right\rangle \,\mathrm{d}s + v^*\left( \hat{t}, \mathbb {P}_{\hat{t}}^{t,\mu ,{\varvec{\theta }}}\right) \Big ]. \end{aligned}$$

Hence, we deduce

$$\begin{aligned} v^*(t,\mu ) \ge \inf _{{\varvec{\theta }}\in L^\infty ([0,T],\Theta )}\Big [ \int _{t}^{\hat{t}}\left\langle \bar{L}(.,\theta _s), \,\mathbb {P}_{s}^{t,\mu ,{\varvec{\theta }}} \right\rangle \,\mathrm{d}s + v^*\left( \hat{t}, \mathbb {P}_{\hat{t}}^{t,\mu ,{\varvec{\theta }}}\right) \Big ]. \end{aligned}$$

Combining the inequalities in the two parts completes the proof. \(\square \)

4.2 Derivative and chain rule in Wasserstein space

In classical finite-dimensional optimal control, the HJB equation can be formally derived from the dynamic programming principle by a Taylor expansion of the value function with respect to the state vector. However, in the current formulation, the state is now a probability measure. To derive the corresponding HJB equation in this setting, it is essential to define a notion of derivative of the value function with respect to a probability measure. The basic idea to achieve this is to take probability measures on \(\mathbb {R}^{d+l}\) as laws of \(\mathbb {R}^{d+l}\)-valued random variables on the probability space \((\Omega ,\mathcal {F},\mathbb {P})\) and then use the corresponding Banach space of random variables to define derivatives. This approach is more extensively outlined in [13].

Concretely, let us take any function \(u: \mathcal {P}_2(\mathbb {R}^{d+l})\rightarrow \mathbb {R}\). We now lift it into its “extension” U, a function defined on \(L^2(\Omega ,\mathbb {R}^{d+l})\) by

$$\begin{aligned} U(X)=u(\mathbb {P}_X),\quad \forall X\in L^2(\Omega ,\mathbb {R}^{d+l}). \end{aligned}$$
(17)

We say u is \(C^1(\mathcal {P}_2(\mathbb {R}^{d+l}))\) if the lifted function U is Fréchet differentiable with continuous derivatives. Since we can identify \(L^2(\Omega ,\mathbb {R}^{d+l})\) with its dual space, if the Fréchet derivative DU(X) exists, by Riesz’ theorem one can view it as an element of \(L^2(\Omega ,\mathbb {R}^{d+l})\):

$$\begin{aligned} DU(X)(Y)=\mathbb {E}[DU(X)\cdot Y],\quad \forall Y\in L^2(\Omega ,\mathbb {R}^{d+l}). \end{aligned}$$

The important result one can prove is that the law of DU(X) does not depend on X but only on the law of X. Accordingly, we have the representation

$$\begin{aligned} DU(X) = \partial _\mu u(\mathbb {P}_X)(X), \end{aligned}$$

for some function \(\partial _\mu u(\mathbb {P}_X): \mathbb {R}^{d+l}\rightarrow \mathbb {R}^{d+l}\), which is called derivative of u at \(\mu = \mathbb {P}_X\). Moreover, we know \(\partial _\mu u(\mu )\) is square integrable with respect to \(\mu \).

We next need a chain rule defined on \(\mathcal {P}_2(\mathbb {R}^{d+l})\). Consider the dynamical system

$$\begin{aligned} W_t=\xi + \int _{0}^t \bar{f}(W_s)\,\mathrm{d}s,\quad \xi \in L^2(\Omega ,\mathbb {R}^{d+l}), \end{aligned}$$

and \(u\in \mathcal {C}^1(\mathcal {P}_2(\mathbb {R}^{d+l}))\). Then, for all \(t\in [0,T]\), we have

$$\begin{aligned} u(\mathbb {P}_{W_t})=u(\mathbb {P}_{W_0})+\int _0^t\langle \partial _{\mu } u(\mathbb {P}_{W_s})(.)\cdot \bar{f}(.),\,\mathbb {P}_{W_s}\rangle \,\mathrm{d}s, \end{aligned}$$
(18)

or equivalently its lifted version

$$\begin{aligned} U({W_t})=U({W_0})+\int _0^t \mathbb {E}[DU(W_s)\cdot \bar{f}(W_s)] \,\mathrm{d}s. \end{aligned}$$
(19)

4.3 HJB equation in Wasserstein space

Guided by the dynamic programming principle (15) and formula (18), we are ready to formally derive the associated HJB equation as follows. Let \(\hat{t}=t+\delta t\) with \(\delta t\) being small. By performing a formal Taylor series expansion of (15), we have

$$\begin{aligned} 0&= \inf _{{\varvec{\theta }}\in L^\infty ([0,T],\Theta )}\Big [ v^*\left( t+\delta t, \mathbb {P}_{t+\delta t}^{t,\mu ,{\varvec{\theta }}}\right) - v^*(t,\mu ) + \int _{t}^{t+\delta t}\left\langle \bar{L}(.,\theta _s), \,\mathbb {P}_{s}^{t,\mu ,{\varvec{\theta }}} \right\rangle \,\mathrm{d}s\Big ] \\&\approx \inf _{{\varvec{\theta }}\in L^\infty ([0,T],\Theta )}\Big [ \partial _t v(t,\mu )\delta t + \int _{t}^{t+\delta t}\langle \partial _\mu v(t,\mu )(.)\cdot \bar{f}(.,\theta ) + \bar{L}(.,\theta _s), \,\mu \rangle \,\mathrm{d}s\Big ] \\&\approx \delta t \inf _{{\varvec{\theta }}\in L^\infty ([0,T],\Theta )}\Big [ \partial _t v(t,\mu ) + \langle \partial _\mu v(t,\mu )(.)\cdot \bar{f}(.,\theta ) + \bar{L}(.,\theta _s), \,\mu \rangle \Big ]. \end{aligned}$$

Passing to the limit \(\delta t\rightarrow 0\), we obtain the following HJB equation

$$\begin{aligned} {\left\{ \begin{array}{ll} \displaystyle {\frac{\partial v}{\partial t} + \inf _{\theta \in \Theta }\left\langle \partial _\mu v(t,\mu )(.)\cdot \bar{f}(.,\theta )+ \bar{L}(.,\theta ),\,\mu \right\rangle = 0,}&{}\text {on~~} [0,T) \times \mathcal {P}_2(\mathbb {R}^{d+l}),\\ \displaystyle {v(T, \mu )=\langle \bar{\Phi }(.),\mu \rangle }, &{}\text {on~~} \mathcal {P}_2(\mathbb {R}^{d+l}), \end{array}\right. } \end{aligned}$$
(20)

which the value function should satisfy. Note that a similar infinite-dimensional PDE like (20), named master equation, can also be derived heuristically as the mean-field limit from the Nash equilibria of feedback games involving many players. We refer to [13, 27, 47] for more related introductions and results.

The rest of this and the next section is to establish the precise link between equation (20) and the value function (8). We now prove a verification result, which essentially says that if we have a smooth enough solution of the HJB equation (20), then this solution must be the value function. Moreover, the HJB allows us to identify the optimal control policy.

Proposition 3

Let v be a function in \(C^{1,1}([0,T]\times \mathcal {P}_2(\mathbb {R}^{d+l}))\). If v is a solution to (20) and there exists \(\theta ^{\dagger }(t,\mu )\), which is a mapping \((t,\mu ) \mapsto \Theta \) attaining the infimum in (20), then \(v(t,\mu )=v^*(t,\mu )\), and \(\theta ^{\dagger }\) is an optimal feedback control policy, i.e., \({\varvec{\theta }}={\varvec{\theta }}^*\) is a solution of (3), where \(\theta ^*_t := \theta ^{\dagger }(t,\mathbb {P}_{w^*_t})\) with \(\mathbb {P}_{w^*_0}=\mu _0\) and \(\mathrm{d}w^*_t/\mathrm{d}t=\bar{f}(w^*_t,\theta ^*_t)\).

Proof

Given any control process \({\varvec{\theta }}\), one can apply formula (18) between \(s=t\) and \(s=T\) with explicit t dependence and obtain

$$\begin{aligned} v\left( T,\mathbb {P}_{T}^{t,\mu ,{\varvec{\theta }}}\right) = v(t,\mu ) + \int _{t}^T\frac{\partial v}{\partial t}\left( s,\mathbb {P}_{s}^{t,\mu ,{\varvec{\theta }}}\right) + \left\langle \partial _{\mu }v\left( s,\mathbb {P}_{s}^{t,\mu ,{\varvec{\theta }}}\right) (.)\cdot \bar{f}(.;\theta _s),\,\mathbb {P}_{s}^{t,\mu ,{\varvec{\theta }}}\right\rangle \,\mathrm{d}s. \end{aligned}$$

Equivalently, we have

$$\begin{aligned} v(t,\mu )&= v\left( T,\mathbb {P}_{T}^{t,\mu ,{\varvec{\theta }}}\right) - \int _{t}^T\frac{\partial v}{\partial t}\left( s,\mathbb {P}_{s}^{t,\mu ,{\varvec{\theta }}}\right) +\left\langle \partial _{\mu }v\left( s,\mathbb {P}_{s}^{t,\mu ,{\varvec{\theta }}}\right) (.)\cdot \bar{f}(.;\theta _s),\,\mathbb {P}_{s}^{t,\mu ,{\varvec{\theta }}}\right\rangle \,\mathrm{d}s\\&\le v\left( T,\mathbb {P}_{T}^{t,\mu ,{\varvec{\theta }}}\right) + \int _{t}^T \left\langle \bar{L}(.,\theta _s), \,\mathbb {P}_{s}^{t,\mu ,{\varvec{\theta }}} \right\rangle \,\mathrm{d}s \\&=\left\langle \bar{\Phi }(.),\,\mathbb {P}_{t}^{T,\mu ,{\varvec{\theta }}} \right\rangle + \int _{t}^T \left\langle \bar{L}(.,\theta _s), \,\mathbb {P}_{s}^{t,\mu ,{\varvec{\theta }}} \right\rangle \,\mathrm{d}s \\&=J(t,\mu ,{\varvec{\theta }}), \end{aligned}$$

where the first inequality comes from the infimum condition in (20). Since the control process is arbitrary, we have

$$\begin{aligned} v(t,\mu )\le v^*(t,\mu ). \end{aligned}$$
(21)

Replacing the arbitrary control process with \({\varvec{\theta }}^*\) where \(\theta ^*_t = \theta ^{\dagger }(t,\mathbb {P}_{s}^{t,\mu ,{\varvec{\theta }}^*})\) is given by the optimal feedback control and repeating the above argument, noting that the inequality becomes equality since the infimum is attained, we have

$$\begin{aligned} v(t,\mu )=J(t,\mu ,{\varvec{\theta }}^*)\ge v^*(t,\mu ). \end{aligned}$$
(22)

Therefore, we obtain \(v(t,\mu )=v^*(t,\mu )\) and \(\theta ^{\dagger }\) defines an optimal feedback control policy. \(\square \)

Proposition 3 is an important statement that links smooth solutions of the HJB equation with solutions of the mean-field optimal control problem, and hence the population minimization problem in deep learning. Furthermore, by taking the infimum in (20), it allows us to identify an optimal control policy \(\theta ^{\dagger }:[0,T]\times \mathcal {P}_2(\mathbb {R}^{d+l}) \rightarrow \Theta \). This is in general a stronger characterization of the solution of the learning problem. In particular, it is of feedback, or closed-loop form. On the other hand, an open-loop solution can be obtained from the closed-loop control policy by sequentially setting \(\theta ^*_t = \theta ^{\dagger }(t, \mathbb {P}_{w^*_t})\), where \(w^*_t\) is the solution of the feed-forward ODE with \({\varvec{\theta }}={\varvec{\theta }}^*\) up to time t. Note that, however, in the common practice of deep learning, one usually finds the open-loop-type solution directly during training (as an optimization problem rather than a control problem) and uses it in inference. In other words, during inference the trained weights are fixed and are not dependent on the distribution of the inputs encountered. On the other hand, controls obtained from closed-loop control policies are actively adjusted according to the distribution encountered. In this sense, the ability to generate an optimal control policy in the form of state-based feedback is an important feature of the dynamic programming approach. However, we should note there is a price to pay for obtaining such a feedback control: The HJB equation is generally difficult to solve numerically. We shall return to this point at the end of Sect. 5.

The limitation of Proposition 3 is that it assumes the value function \(v^*(t,\mu )\) is continuously differentiable, which is often not the case. In order to formulate a complete characterization, we would also like to deduce the statement in the other direction: A solution to (3) should also solve the PDE (20) in an appropriate sense. In the next section, we achieve this by giving a more flexible characterization of the value function as the viscosity solution of the HJB equation.

5 Viscosity solution of HJB equation

5.1 The concept of viscosity solutions

In general, one cannot expect to have smooth solutions to the HJB equation (20). Therefore, we need to extend the classical concept of PDE solutions to a type of weak solutions. As in the analysis of classical Hamilton–Jacobi equations, we shall introduce a notion of viscosity solution for the HJB equation in the Wasserstein space of probability measures. The key idea is again the lifting identification between measures and random variables, working in the Hilbert space \(L^2(\Omega ,\mathbb {R}^{d+l})\), instead of the Wasserstein space \(\mathcal {P}_2(\mathbb {R}^{d+l})\). Then, we can use the tools developed for viscosity solutions in Hilbert spaces. The techniques presented below have been employed in the study of well-posedness for general Hamilton–Jacobi equations in Banach spaces, see e.g., [20,21,22].

For convenience, we define the Hamiltonian \(\mathcal {H}(\xi ,P):L^2(\Omega ,\mathbb {R}^{d+l})\times L^2(\Omega ,\mathbb {R}^{d+l})\rightarrow \mathbb {R}\) as

$$\begin{aligned} \mathcal {H}(\xi ,P):=\inf _{\theta \in \Theta }\mathbb {E}[P\cdot \bar{f}(\xi ,\theta )+\bar{L}(\xi ,\theta )]. \end{aligned}$$
(23)

Then, the “lifted” Bellman equation of (20) with \(V(t,\xi ) = v(t,\mathbb {P}_\xi )\) can be written down as follows, except that the state space is enlarged to \(L^2(\Omega ,\mathbb {R}^{d+l})\):

$$\begin{aligned} {\left\{ \begin{array}{ll} \displaystyle {\frac{\partial V}{\partial t} + \mathcal {H}(\xi ,DV(t,\xi )) = 0,}&{}\text {on~~} [0,T) \times L^2(\Omega ,\mathbb {R}^{d+l}),\\ \displaystyle {V(T, \xi )=\mathbb {E}[\bar{\Phi }(\xi )]}, &{}\text {on}~~ L^2(\Omega ,\mathbb {R}^{d+l}). \end{array}\right. } \end{aligned}$$
(24)

Definition 1

We say that a bounded, uniformly continuous function \(u:\,[0,T]\times \mathcal {P}_2(\mathbb {R}^{d+l})\rightarrow \mathbb {R}\) is a viscosity (sub, super) solution to (20) if the lifted function \(U:\,[0,T]\times L^2(\Omega ,\mathbb {R}^{d+l})\rightarrow \mathbb {R}\) defined by

$$\begin{aligned} U(t,\xi )=u(t,\mathbb {P}_\xi ) \end{aligned}$$

is a viscosity (sub, super) solution to the lifted Bellman equation (24), that is:

(i) \(U(T,\xi )\le \mathbb {E}[\bar{\Phi }(\xi )]\), and for any test function \(\psi \in C^{1,1}([0,T]\times L^2(\Omega ,\mathbb {R}^{d+l}))\) such that the map \(U-\psi \) has a local maximum at \((t_0,\xi _0)\in [0,T)\times L^2(\Omega ,\mathbb {R}^{d+l})\), one has

$$\begin{aligned} \partial _t \psi (t_0,\xi _0) + \mathcal {H}(\xi _0,D\psi (t_0,\xi _0)) \ge 0. \end{aligned}$$
(25)

(ii) \(U(T,\xi )\ge \mathbb {E}[\bar{\Phi }(\xi )]\), and for any test function \(\psi \in C^{1,1}([0,T]\times L^2(\Omega ,\mathbb {R}^{d+l}))\) such that the map \(U-\psi \) has a local minimum at \((t_0,\xi _0)\in [0,T)\times L^2(\Omega ,\mathbb {R}^{d+l})\), one has

$$\begin{aligned} \partial _t \psi (t_0,\xi _0) + \mathcal {H}(\xi _0,D\psi (t_0,\xi _0)) \le 0. \end{aligned}$$
(26)

Remark 1

Readers familiar with the concept of viscosity solution in the finite-dimensional case will readily find the above definition a natural extension in the infinite-dimensional case. Informally speaking, the HJB equation (20) defines a sort of “monotonicity” for the value function: If a generic function is less than the value function everywhere, then its image under the Bellman operator [right-hand side of (15) acting on \(v^*\)] is still less than the value function everywhere. This kind of monotonicity can help us specify the property of possible non-differentiable solutions, as indicated in Definition 1. We refer [18, 24] for more introductions of viscosity solution.

5.2 Existence and uniqueness of viscosity solution

The main goal of introducing the concept of viscosity solutions is that in the viscosity sense, the HJB equation is well posed and the value function is the unique solution of the HJB equation. We show this in Theorems 1 and 2.

Theorem 1

The value function \(v^*(t,\mu )\) defined in (8) is a viscosity solution to the HJB equation (20).

Before proving Theorem 1, we first introduce a useful lemma regarding the continuity of \(\mathcal {H}(\xi ,P)\).

Lemma 1

The Hamiltonian \(\mathcal {H}(\xi ,P)\) defined in (23) satisfies the following continuity conditions:

$$\begin{aligned}&|\mathcal {H}(\xi ,P)-\mathcal {H}(\xi ,Q)|\le K_L \Vert P-Q\Vert _{L^2}, \end{aligned}$$
(27)
$$\begin{aligned}&|\mathcal {H}(\xi ,P)-\mathcal {H}(\zeta ,P)|\le K_L (1+\Vert P\Vert _{L^2})\Vert \xi -\zeta \Vert _{L^2}. \end{aligned}$$
(28)

Proof

For simplicity, we define

$$\begin{aligned} \hat{\mathcal {H}}(\xi ,P;\theta ):=\mathbb {E}[P\cdot \bar{f}(\xi ,\theta )+\bar{L}(\xi ,\theta )]. \end{aligned}$$

The boundedness of \(\bar{f}\) and \(\bar{L}\) gives us

$$\begin{aligned}&|\hat{\mathcal {H}}(\xi ,P;\theta ) - \hat{\mathcal {H}}(\xi ,Q;\theta )| \le K_L\Vert P-Q\Vert _{L^2}, \end{aligned}$$
(29)
$$\begin{aligned}&|\hat{\mathcal {H}}(\xi ,P;\theta ) - \hat{\mathcal {H}}(\zeta ,P;\theta )| \le K_L(1+\Vert P\Vert _{L^2})\Vert \xi -\zeta \Vert _{L^2}. \end{aligned}$$
(30)

By definition, we know

$$\begin{aligned} \mathcal {H}(\xi ,P) :=\inf _{\theta \in \Theta } \hat{\mathcal {H}}(\xi ,P;\theta ). \end{aligned}$$

Let \(\theta _n\) satisfy

$$\begin{aligned} \hat{\mathcal {H}}(\xi ,P;\theta _n) - \mathcal {H}(\xi ,Q) \le 1/n. \end{aligned}$$

Then,

$$\begin{aligned} \mathcal {H}(\xi ,P)-\mathcal {H}(\xi ,Q)&=\, (\mathcal {H}(\xi ,P)- \hat{\mathcal {H}}(\xi ,P;\theta _n)) + (\hat{\mathcal {H}}(\xi ,P;\theta _n) - \hat{\mathcal {H}}(\xi ,Q;\theta _n)) + (\hat{\mathcal {H}}(\xi ,Q;\theta _n) - \mathcal {H}(\xi ,Q)) \\&\le \, |\hat{\mathcal {H}}(\xi ,P;\theta _n) - \hat{\mathcal {H}}(\xi ,Q;\theta _n)| + 1/n \\&\le \, K_L\Vert P-Q\Vert _{L^2} + 1/n. \end{aligned}$$

Taking \(n\rightarrow \infty \), we have \(\mathcal {H}(\xi ,P)-\mathcal {H}(\xi ,Q) \le K_L\Vert P-Q\Vert _{L^2}\). A similar computation shows \(\mathcal {H}(\xi ,Q)-\mathcal {H}(\xi ,P) \le K_L\Vert P-Q\Vert _{L^2}\), and we prove (27). Equation (28) can be proved in a similar way, based on the condition (30). \(\square \)

Proof of Theorem 1

We lift the value function \(v^*(t,\mu )\) to \([0,T]\times L^2(\Omega ,\mathbb {R}^{d+l})\) and denote it by \(V^*(t,\xi )\). Note that the convergence \(\xi _n\rightarrow \xi \) in \(L^2(\Omega ,\mathbb {R}^{d+l})\) implies the convergence \(\mathbb {P}_{\xi _n}\rightarrow \mathbb {P}_{\xi }\) in \(\mathcal {P}_2(\mathbb {R}^{d+l})\); thus, Proposition 1 guarantees that \(V^*(t,\xi )\) is continuous on \([0,T]\times L^2(\Omega ,\mathbb {R}^{d+l})\). By definition, we know \(V^*(t,\xi )\) is bounded and \(V^*(T,\xi )=\mathbb {E}(\bar{\Phi }(\xi ))\). It remains to show the viscosity sub- and supersolution properties of \(V^*(t,\xi )\). To proceed, we note that \(V^*(t,\xi )\) also inherits the dynamic programming principle from \(v^*(t,\mu )\) (c.f. Proposition 2), which can be represented as

$$\begin{aligned} V^*(t,\xi ) = \inf _{{\varvec{\theta }}\in L^\infty ([0,T],\Theta )}\left[ \int _{t}^{\hat{t}}\mathbb {E}\left[ \bar{L}\left( W_s^{t,\xi ,{\varvec{\theta }}},\theta _s\right) \right] \,\mathrm{d}s + V^*\left( \hat{t}, W_{\hat{t}}^{t,\xi ,{\varvec{\theta }}}\right) \right] . \end{aligned}$$
(31)

1. Subsolution property Suppose \(\psi \) is a test function in \(C^{1,1}([0,T]\times L^2(\Omega ,\mathbb {R}^{d+l}))\) and \(V^*-\psi \) has a local maximum at \((t_0,\xi _0)\in [0,T)\times L^2(\Omega ,\mathbb {R}^{d+l})\), which means

$$\begin{aligned} (V^*-\psi )(t,\xi )\le (V^*-\psi )(t_0,\xi _0) \text { for all } (t,\xi ) \text { satisfying } |t-t_0|+\Vert \xi -\xi _0\Vert _{L^2} < \delta . \end{aligned}$$

Let \(\theta _0\) be an arbitrary element in \(\Theta \) and define a control process \({\varvec{\theta }}^0 \in L^\infty ([0,T],\Theta )\) such that \(\theta ^0_s \equiv \theta _0,\,s\in [t_0,T]\). Let \(h\in (0,T-t_0)\) be small enough such that \(|s-t_0|+\Vert W_s^{t_0,\xi _0,{\varvec{\theta }}^0}-\xi _0\Vert _{L^2} < \delta \) for all \(s\in [t_0,t_0+h]\). This is possible from an argument similar in the proof of Proposition 1. From the dynamic programming principle (31), we have

$$\begin{aligned} V^*(t_0,\xi _0) \le \int _{t_0}^{t_0+h}\mathbb {E}\left[ \bar{L}\left( W_s^{t_0,\xi _0,{\varvec{\theta }}^0},\theta ^0_s\right) \right] \,\mathrm{d}s + V^*\left( t_0+h, W_{t_0+h}^{t_0,\xi _0,{\varvec{\theta }}^0}\right) . \end{aligned}$$

Using the condition of local maximality and chain rule (19), we have the inequality

$$\begin{aligned} 0&\le V^*\left( t_0+h, W_{t_0+h}^{t_0,\xi _0,{\varvec{\theta }}^0}\right) - V^*(t_0,\xi _0) + \int _{t_0}^{t_0+h}\mathbb {E}\left[ \bar{L}\left( W_s^{t_0,\xi _0,{\varvec{\theta }}^0},\theta ^0_s\right) \right] \,\mathrm{d}s \nonumber \\&\le \psi \left( t_0+h, W_{t_0+h}^{t_0,\xi _0,{\varvec{\theta }}^0}\right) - \psi (t_0,\xi _0) + \int _{t_0}^{t_0+h}\mathbb {E}\left[ \bar{L}\left( W_s^{t_0,\xi _0,{\varvec{\theta }}^0},\theta ^0_s\right) \right] \,\mathrm{d}s \nonumber \\&= \int _{t_0}^{t_0+h} \partial _t\psi \left( s,W_s^{t_0,\xi _0,{\varvec{\theta }}^0}\right) +\mathbb {E}\left[ D\psi (s,W_s^{t_0,\xi _0,{\varvec{\theta }}^0})\cdot \bar{f}\left( W_s^{t_0,\xi _0,{\varvec{\theta }}^0},\theta ^0_s\right) \right] \,\mathrm{d}s \nonumber \\&\quad + \int _{t_0}^{t_0+h}\mathbb {E}\left[ \bar{L}\left( W_s^{t_0,\xi _0,{\varvec{\theta }}^0},\theta ^0_s\right) \right] \,\mathrm{d}s. \end{aligned}$$
(32)

Since we know \(W_s^{t_0,\xi _0,{\varvec{\theta }}^0}\) is continuous in time, in the sense of \(L^2\)-metric of \(L^2(\Omega ,\mathbb {R}^{d+l})\),

$$\begin{aligned} \partial _t\psi \left( s,W_s^{t_0,\xi _0,{\varvec{\theta }}^0}\right) + \mathbb {E}\left[ D\psi \left( s,W_s^{t_0,\xi _0,{\varvec{\theta }}^0}\right) \cdot \bar{f}\left( W_s^{t_0,\xi _0,{\varvec{\theta }}^0},\theta ^0_s\right) + \bar{L}\left( W_s^{t_0,\xi _0,{\varvec{\theta }}^0},\theta ^0_s\right) \right] \end{aligned}$$

is also continuous in time. Dividing the inequality (32) by h and taking the limit \(h\rightarrow 0\), we obtain

$$\begin{aligned} 0&\le \Big [\partial _t\psi \left( s,W_s^{t_0,\xi _0,{\varvec{\theta }}^0}\right) + \mathbb {E}\left[ D\psi \left( s,W_s^{t_0,\xi _0,{\varvec{\theta }}^0}\right) \cdot \bar{f}\left( W_s^{t_0,\xi _0,{\varvec{\theta }}^0},\theta ^0_s\right) + \bar{L}\left( W_s^{t_0,\xi _0,{\varvec{\theta }}^0},\theta ^0_s\right) \right] \Big ]\Big |_{s=t_0} \\&= \partial _t\psi (t_0,\xi _0) + \mathbb {E}\left[ D\psi (t_0,\xi _0)\cdot \bar{f}(\xi _0,\theta _0) + \bar{L}(\xi _0,\theta _0)\right] . \end{aligned}$$

Since \(\theta _0\) is arbitrary in \(\Theta \), we obtain the desired subsolution property (25).

2. Supersolution property Suppose \(\psi \) is a test function in \(C^{1,1}([0,T]\times L^2(\Omega ,\mathbb {R}^{d+l}))\) and \(V^*-\psi \) has a local minimum at \((t_0,\xi _0)\in [0,T)\times L^2(\Omega ,\mathbb {R}^{d+l})\), which means

$$\begin{aligned} (V^*-\psi )(t,\xi )\ge (V^*-\psi )(t_0,\xi _0) \text { for all } (t,\xi ) \text { satisfying } |t-t_0|+\Vert \xi -\xi _0\Vert _{L^2} < \delta _1. \end{aligned}$$

Given an arbitrary \(\varepsilon >0\), since Lemma 1 tells us \(\mathcal {H}\) is continuous, there exits \(\delta _2 >0\) such that

$$\begin{aligned} |\partial _t\psi (t,\xi )+\mathcal {H}(t,\xi ) - \partial _t\psi (t_0,\xi _0) - \mathcal {H}(t_0,\xi _0)|<\varepsilon , \end{aligned}$$

for all \((t,\xi ) \text { satisfying } |t-t_0|+\Vert \xi -\xi _0\Vert _{L^2} < \delta _2\). Again as argued in the proof of Proposition 1, we can choose \(h\in (0,T-t_0)\) to be small enough such that \(|s-t_0|+\Vert W_s^{t_0,\xi _0,{\varvec{\theta }}}-\xi _0\Vert _{L^2} < \min \{\delta _1,\delta _2\}\) for all \(s\in [t_0,t_0+h]\), \({\varvec{\theta }}\in L^\infty ([0,T],\Theta )\).

From the dynamic programming principle (31), there exists \({\varvec{\theta }}^h\) such that

$$\begin{aligned} V^*(t_0,\xi _0) + \varepsilon h \ge \int _{t_0}^{t_0+h}\mathbb {E}\left[ \bar{L}\left( W_s^{t_0,\xi _0,{\varvec{\theta }}^h},\theta ^h_s\right) \right] \,\mathrm{d}s + V^*\left( t_0+h, W_{t_0+h}^{t_0,\xi _0,{\varvec{\theta }}^h}\right) . \end{aligned}$$

Again using the condition of local minimality, chain rule (19) and definition of \(\mathcal {H}\), we have the inequality

$$\begin{aligned} \varepsilon h&\ge V^*\left( t_0+h, W_{t_0+h}^{t_0,\xi _0,{\varvec{\theta }}^h}\right) - V^*(t_0,\xi _0) + \int _{t_0}^{t_0+h}\mathbb {E}\left[ \bar{L}\left( W_s^{t_0,\xi _0,{\varvec{\theta }}^h},\theta ^h_s\right) \right] \,\mathrm{d}s \nonumber \\&\ge \psi \left( t_0+h, W_{t_0+h}^{t_0,\xi _0,{\varvec{\theta }}^h}\right) - \psi (t_0,\xi _0) + \int _{t_0}^{t_0+h}\mathbb {E}\left[ \bar{L}\left( W_s^{t_0,\xi _0,{\varvec{\theta }}^h},\theta ^h_s\right) \right] \,\mathrm{d}s \nonumber \\&= \int _{t_0}^{t_0+h} \partial _t\psi \left( s,W_s^{t_0,\xi _0,{\varvec{\theta }}^h}\right) + \mathbb {E}\left[ D\psi \left( s,W_s^{t_0,\xi _0,{\varvec{\theta }}^h}\right) \cdot \bar{f}\left( W_s^{t_0,\xi _0,{\varvec{\theta }}^h},\theta ^h_s\right) \right] \,\mathrm{d}s \nonumber \\&\quad + \int _{t_0}^{t_0+h}\mathbb {E}\left[ \bar{L}\left( W_s^{t_0,\xi _0,{\varvec{\theta }}^h},\theta ^h_s\right) \right] \,\mathrm{d}s \nonumber \\&\ge \int _{t_0}^{t_0+h} \partial _t\psi \left( s,W_s^{t_0,\xi _0,{\varvec{\theta }}^h}\right) + \mathcal {H}\left( W_s^{t_0,\xi _0,{\varvec{\theta }}^h}, D\psi \left( s,W_s^{t_0,\xi _0,{\varvec{\theta }}^h}\right) \right) \,\mathrm{d}s \nonumber \\&\ge h (\partial _t\psi (t_0,\xi _0) + \mathcal {H}(t_0,\xi _0) - \varepsilon ). \end{aligned}$$
(33)

Dividing the inequality (33) by h and taking the limit \(\varepsilon \rightarrow 0\), we obtain the desired supersolution property (26). \(\square \)

Theorem 1 incidentally establishes the existence of viscosity solutions to the HJB, which we can identify as the value function of the mean-field control problem. We show below that this solution is in fact unique.

Theorem 2

Let \(u_1\) and \(u_2\) be two functions defined on \([0,T]\times \mathcal {P}_2(\mathbb {R}^{d+l})\) such that \(u_1\) and \(u_2\) are viscosity subsolution and supersolution to (20), respectively. Then, \(u_1\le u_2\). Consequently, the value function \(v^*(t,\mu )\) defined in (8) is the unique viscosity solution to the HJB equation (20).

Proof

The final assertion of the theorem follows immediately from Theorem 1. As before, we consider the lifted version \(U_1(t,\xi )=u_1(t,\mathbb {P}_{\xi })\), \(U_2(t,\xi )=u_2(t,\mathbb {P}_{\xi })\) on \([0,T]\times L^2(\Omega ,\mathbb {R}^{d+l})\). By definition, we know \(U_1\) and \(U_2\) are subsolution and supersolution to (24), respectively. By definition of viscosity solution, \(U_1\) and \(U_2\) are both bounded and uniformly continuous. We denote their moduli of continuity by \(\omega _1\) and \(\omega _2\), which satisfy

$$\begin{aligned} |U_i(t,\xi )-U_i(s,\zeta )|\le \omega _i(|t-s|+\Vert \xi -\zeta \Vert _{L^2}), \quad i=1,2 \end{aligned}$$

for all \(0\le t\le s\le T, \xi ,\zeta \in L^2(\Omega ,\mathbb {R}^{d+l})\), and \(\omega _i(r)\rightarrow 0\) as \(r\rightarrow 0^{+}\). To prove \(U_1\le U_2\), we assume

$$\begin{aligned} \delta :=\sup _{[0,T]\times L^2(\Omega ,\mathbb {R}^{d+l})} U_1(t,\xi )-U_2(t,\xi )> 0, \end{aligned}$$
(34)

and proceed in five steps below to derive a contradiction.

(1) Let \(\sigma ,\varepsilon \in (0,1)\) and construct the auxiliary function

$$\begin{aligned} G(t,s,\xi ,\zeta )=U_1(t,\xi )-U_2(s,\zeta ) + \sigma (t+s)-\varepsilon \left( \Vert \xi \Vert ^2_2+\Vert \zeta \Vert ^2_2\right) -\frac{1}{\varepsilon ^2}\left( (t-s)^2+\Vert \xi -\zeta \Vert _{L^2}^2\right) , \end{aligned}$$
(35)

for \(t,s\in [0,T], \xi ,\zeta \in L^2(\Omega ,\mathbb {R}^{d+l})\). From Stegall Theorem [56], there exist \(\eta _t,\eta _s\in \mathbb {R}\), \(\eta _{\xi },\eta _{\zeta }\in L^2(\Omega ,\mathbb {R}^{d+l})\) such that \(|\eta _t|,|\eta _s|,\Vert \eta _{\xi }\Vert _{L^2},\Vert \eta _{\zeta }\Vert _{L^2}\le \varepsilon \) and the function with linear perturbation

$$\begin{aligned} \tilde{G}(t,s,\xi ,\zeta ):=G(t,s,\xi ,\zeta )-\eta _t t -\eta _s s-\mathbb {E}[\eta _\xi \cdot \xi ] - \mathbb {E}[\eta _{\zeta }\cdot \zeta ] \end{aligned}$$
(36)

has a maximum over \([0,T]\times [0,T] \times L^2(\Omega ,\mathbb {R}^{d+l}) \times L^2(\Omega ,\mathbb {R}^{d+l})\) at \((t_0,s_0,\xi _0,\zeta _0)\).

(2) Since \(\tilde{G}(0,0,0,0) \le \tilde{G}(t_0,s_0,\xi _0,\zeta _0)\) and \(U_1,U_2\) are bounded, after an arrangement of terms, we have

$$\begin{aligned} \varepsilon \left( \Vert \xi _0\Vert _{L^2}^2 + \Vert \zeta _0\Vert _{L^2}^2\right)&\le \, C + \sigma (t_0 + s_0) -\frac{1}{\varepsilon ^2}\left( (t_0-s_0)^2+\Vert \xi _0-\zeta _0\Vert _{L^2}^2\right) -\eta _t t_0 - \eta _s s_0 \nonumber \\&\quad - \mathbb {E}[\eta _\xi \cdot \xi _0] - \mathbb {E}[\eta _{\zeta }\cdot \zeta _0] \nonumber \\&\le \, C - \mathbb {E}[\eta _\xi \cdot \xi _0] - \mathbb {E}[\eta _{\zeta }\cdot \zeta _0] \nonumber \\&\le \, C + \sqrt{2}\,\varepsilon \left( \Vert \xi _0\Vert _{L^2}^2 + \Vert \zeta _0\Vert _{L^2}^2\right) ^{1/2}. \end{aligned}$$
(37)

Here and in the following, C denotes generic positive constant, whose value may change from line to line but is always independent of \(\varepsilon \) and \(\sigma \). Solving the quadratic inequality above, we get

$$\begin{aligned} \left( \Vert \xi _0\Vert _{L^2}^2 + \Vert \zeta _0\Vert _{L^2}^2\right) ^{1/2} \le C(1+\varepsilon ^{-1/2}). \end{aligned}$$
(38)

Now, arguing in the same way as (37) and further combining (37), we have

$$\begin{aligned} \frac{1}{\varepsilon ^2}\left( (t_0-s_0)^2+\Vert \xi _0-\zeta _0\Vert _{L^2}^2\right)&\le C - \mathbb {E}[\eta _\xi \cdot \xi _0] - \mathbb {E}[\eta _{\zeta }\cdot \zeta _0] \\&\le C + \sqrt{2}\,\varepsilon \left( \Vert \xi _0\Vert _{L^2}^2 + \Vert \zeta _0\Vert _{L^2}^2\right) ^{1/2} \\&\le C, \end{aligned}$$

or equivalently

$$\begin{aligned} |t_0-s_0| + \Vert \xi _0-\zeta _0\Vert _{L^2} \le C\varepsilon . \end{aligned}$$
(39)

(3) Equation (39) allows us to further sharpen the estimate of \((t-s)^2+\Vert \xi -\zeta \Vert _{L^2}^2\). Specifically, since \(\tilde{G}(t_0,t_0,\xi _0,\xi _0) \le \tilde{G}(t_0,s_0,\xi _0,\zeta _0)\), we have

$$\begin{aligned} \mathbb {E}[\eta _{s}\cdot (s_0-t_0)] + \mathbb {E}[\eta _{\zeta }\cdot (\zeta _0-\xi _0)]&\le \, U_2(t_0,\xi _0)-U_2(s_0,\zeta _0) + \sigma (s_0-t_0) + \varepsilon \left( \Vert \xi _0\Vert _{L^2}^2 - \Vert \zeta _0\Vert _{L^2}^2\right) \\&\quad - \frac{1}{\varepsilon ^2}\left( (t_0-s_0)^2+\Vert \xi _0-\zeta _0\Vert _{L^2}^2\right) . \end{aligned}$$

Rearranging the above inequality and using estimates (38), (39), and uniform continuity of \(U_2\), we obtain

$$\begin{aligned} \frac{1}{\varepsilon ^2}\left( (t_0-s_0)^2+\Vert \xi _0-\zeta _0\Vert _{L^2}^2\right)&\le \, \omega _2(|t_0-s_0|+\Vert \xi _0-\zeta _0\Vert _{L^2}) + C(|t_0-s_0|+\Vert \xi _0-\zeta _0\Vert _{L^2}) \\&\quad + \varepsilon \Vert \xi _0+\zeta _0\Vert _{L^2} \Vert \xi _0 - \zeta _0\Vert _{L^2} \\&\le \, \omega _2(|t_0-s_0|+\Vert \xi _0-\zeta _0\Vert _{L^2}) + C(|t_0-s_0|+\Vert \xi _0-\zeta _0\Vert _{L^2}) \\&\le \, \omega _2(C\varepsilon ) + C\varepsilon . \end{aligned}$$

By the property of modulus, we conclude

$$\begin{aligned} |t_0-s_0| + \Vert \xi _0-\zeta _0\Vert _{L^2} = o(\varepsilon ). \end{aligned}$$
(40)

(4) From the definition of \(\tilde{G}\) and \(\delta \), we can choose \(\varepsilon \) so small that

$$\begin{aligned} \sup _{[0,T]\times L^2(\Omega ,\mathbb {R}^{d+l})}\tilde{G}(t,t,\xi ,\xi )\ge \frac{\delta }{2}. \end{aligned}$$

Using estimate (38), (40), we can furthermore choose \(\sigma , \varepsilon \) small enough such that

$$\begin{aligned} U_1(t_0,\xi _0)-U_2(s_0,\zeta _0)\,&\ge \, \tilde{G}(t_0,s_0,\xi _0,\zeta _0) - C\sigma - C\varepsilon \\&\ge \, \sup _{[0,T]\times L^2(\Omega ,\mathbb {R}^{d+l})}\tilde{G}(t,t,\xi ,\xi ) - \frac{\delta }{4} \\&\ge \, \frac{\delta }{4}. \end{aligned}$$

Noting the terminal condition \(U_1(T,\xi )\le U_2(T,\xi )\), we are ready to estimate \(|T-t_0|\) through

$$\begin{aligned} \frac{\delta }{4} \le \,&U_1(t_0,\xi _0)-U_2(s_0,\zeta _0) \\ \le \,&U_1(t_0,\xi _0)-U_1(T,\xi _0) + U_1(T,\xi _0)-U_2(T,\xi _0) \\&+ U_2(T,\xi _0)-U_2(t_0,\xi _0) + U_2(t_0,\xi _0)-U_2(s_0,\zeta _0) \\ \le \,&\omega _1(|T-t_0|) + \omega _2(|T-t_0|) + \omega _2(|t_0-s_0| + \Vert \xi _0-\zeta _0\Vert _{L^2}) \\ =\,&\omega _1(|T-t_0|) + \omega _2(|T-t_0|) + \omega _2(o(\varepsilon )). \end{aligned}$$

Therefore, when \(\varepsilon \) is small enough, we have

$$\begin{aligned} \omega _1(|T-t_0|) + \omega _2(|T-t_0|) \ge \frac{\delta }{8}, \end{aligned}$$

which implies

$$\begin{aligned} |T-t_0|\ge \lambda > 0, \end{aligned}$$

for some positive constant \(\lambda \), provided \(\sigma ,\varepsilon \) are small enough. The same argument as above can also give \(|T-s_0|\ge \lambda > 0\).

(5) The finite differences between \(t_0,s_0\) and T finally allow us to employ the viscosity property. Consider the map \((t,\xi )\mapsto \tilde{G}(t,s_0,\xi ,\zeta _0)\) has a maximum at \((t_0,\xi _0)\), i.e., \(U_1-\psi \) has a maximum at \((t_0,\xi _0)\) for

$$\begin{aligned} \psi (t,\xi ):=&\,U_2(s_0,\zeta _0) - \sigma (t + s_0) + \varepsilon \left( \Vert \xi \Vert _{L^2}^2+\Vert \zeta _0\Vert _{L^2}^2\right) +\frac{1}{\varepsilon ^2}\left( (t-s_0)^2+\Vert \xi -\zeta _0\Vert _{L^2}^2\right) \\&+ \eta _t t + \eta _s s_0 + \mathbb {E}[\eta _\xi \cdot \xi ] + \mathbb {E}[\eta _{\zeta }\cdot \zeta _0]. \end{aligned}$$

Since \(U_1\) is a viscosity subsolution, using the subsolution property (25), we have

$$\begin{aligned} -\sigma + \frac{2(t-s_0)}{\varepsilon ^2} + \eta _t + \mathcal {H}\left( \xi _0,2\varepsilon \xi _0 + \frac{2(\xi _0-\zeta _0)}{\varepsilon ^2} + \eta _{\xi }\right) \ge 0. \end{aligned}$$
(41)

In the same way, consider the map \((s,\zeta )\mapsto -\tilde{G}(t_0,s,\xi _0,\zeta )\) has a minimum at \((s_0,\zeta _0)\), i.e., \(U_2-\psi \) has a minimum at \((s_0,\zeta _0)\) for

$$\begin{aligned} \psi (t,\xi ):=&\,U_1(s_0,\zeta _0) + \sigma (t_0 + s) - \varepsilon (\Vert \xi _0\Vert _{L^2}^2+\Vert \zeta \Vert _{L^2}^2) - \frac{1}{\varepsilon ^2}((t_0-s)^2+\Vert \xi _0-\zeta \Vert _{L^2}^2) \\&- \eta _t t_0 - \eta _s s - \mathbb {E}[\eta _\xi \cdot \xi _0] - \mathbb {E}[\eta _{\zeta }\cdot \zeta ]. \end{aligned}$$

Since \(U_2\) is a viscosity supersolution, using the supersolution property (26), we have

$$\begin{aligned} \sigma + \frac{2(t_0-s)}{\varepsilon ^2} - \eta _s + \mathcal {H}\left( \zeta _0,-2\varepsilon \zeta _0 + \frac{2(\xi _0-\zeta _0)}{\varepsilon ^2} - \eta _{\zeta }\right) \ge 0. \end{aligned}$$
(42)

Computing the difference in the two inequalities (41), (42) gives

$$\begin{aligned} -2\sigma + \eta _t + \eta _s + \mathcal {H}\left( \xi _0,2\varepsilon \xi _0 + \frac{2(\xi _0-\zeta _0)}{\varepsilon ^2} + \eta _{\xi }\right) - \mathcal {H}\left( \zeta _0,-2\varepsilon \zeta _0 + \frac{2(\xi _0-\zeta _0)}{\varepsilon ^2} - \eta _{\zeta }\right) \ge 0. \end{aligned}$$

Using estimates (38), (40) and Lemma 1, we have

$$\begin{aligned} 2\sigma&\le \eta _t + \eta _s + \mathcal {H}\left( \zeta _0, -2\varepsilon \zeta _0 + \frac{2(\xi _0-\zeta _0)}{\varepsilon ^2} - \eta _{\zeta }\right) - \mathcal {H}\left( \xi _0, 2\varepsilon \xi _0 + \frac{2(\xi _0-\zeta _0)}{\varepsilon ^2} + \eta _{\xi }\right) \\&\le 2\varepsilon + \left| \mathcal {H}\left( \zeta _0, -2\varepsilon \zeta _0 + \frac{2(\xi _0-\zeta _0)}{\varepsilon ^2} - \eta _{\zeta }\right) - \mathcal {H}\left( \zeta _0, 2\varepsilon \xi _0 + \frac{2(\xi _0-\zeta _0)}{\varepsilon ^2} + \eta _{\xi }\right) \right| \\&\quad + \left| \mathcal {H}\left( \zeta _0, 2\varepsilon \xi _0 + \frac{2(\xi _0-\zeta _0)}{\varepsilon ^2} + \eta _{\xi }\right) - \mathcal {H}\left( \xi _0,2\varepsilon \xi _0 + \frac{2(\xi _0-\zeta _0)}{\varepsilon ^2} + \eta _{\xi }\right) \right| \\&\le 2\varepsilon + K_L \Vert 2\varepsilon \xi _0 + 2\varepsilon \zeta _0 + \eta _{\xi } + \eta _{\zeta } \Vert _{L^2} \\&\quad + K_L\left( 1 + \Vert 2\varepsilon \xi _0 + \frac{2(\xi _0-\zeta _0)}{\varepsilon ^2} + \eta _{\xi }\Vert _{L^2}\right) \Vert \xi _0 - \zeta _0 \Vert _{L^2} \\&\le o(1) \quad (\varepsilon \rightarrow 0^+). \end{aligned}$$

Therefore, taking the limit gives us a contradiction \(0 < \sigma \le 0\), which completes the proof. \(\square \)

Theorems 1 and 2 establish the well-posedness, in the viscosity sense, of the HJB equation and identifies the value function for the mean-field optimal control problem as the unique solution of the HJB equation. Moreover, it provides us (through solving the infimum in (20) after solving for the value function) an optimal control policy, from which we can synthesize an optimal control as the solution of our learning problem. In this sense, the HJB equation gives us a necessary and sufficient condition for optimality of the learning problem (3). This demonstrates an essential observation from the mean-field optimal control viewpoint of deep learning: The population risk minimization problem of deep learning can be viewed as a variational problem, whose solution can be characterized by a suitably defined Hamilton–Jacobi–Bellman equation. This very much parallels classical calculus of variations.

It is worth noting that the HJB equation is a global characterization of the value function, in the sense that it must in principle be solved over the entire space \(\mathcal {P}_2(\mathbb {R}^{d+l})\) of input–target distributions. Of course, we would not expect this to be the case in practice for any non-trivial machine learning problem. However, if we can solve it locally around some trajectories generated by the initial condition \(\mu _0 \in \mathcal {P}_2(\mathbb {R}^{d+l})\), then we would expect the obtained feedback control policy to apply to nearby input-label distributions as well. This may be able to give a principled way to perform transfer or one-shot learning [28, 42, 50].

Finally, observe that if the Hamiltonian defined in (23) is attained by a unique minimizer \(\theta ^*\in \Theta \) given any \(\xi \in L^2(\Omega ,\mathbb {R}^{d+l})\) and \(P\in L^2(\Omega ,\mathbb {R}^{d+l})\), then the uniqueness of value function immediately implies the uniqueness of the open-loop optimal control, which is sometimes a desired property of the population risk minimization problem. The following example gives such an instance.

Example 1

Consider a specific type of residual networks, where \(f(x,\theta ) = \theta \sigma (x)\) and \(L(x,\theta )\propto \Vert \theta \Vert ^2\). Here, \(\theta \in \mathbb {R}^{d\times d}\) is a matrix and \(\sigma \) is a smooth and bounded nonlinearity, e.g., tanh or sigmoid. This is similar to conventional residual neural networks except that the order of the affine transformation and the nonlinearity are swapped. In this case, the Hamiltonian defined in (23) admits a unique minimizer \(\theta ^*\) given any \(\xi \in L^2(\Omega ,\mathbb {R}^{d+l})\) and \(P\in L^2(\Omega ,\mathbb {R}^{d+l})\).

6 Mean-field Pontryagin’s maximum principle

As discussed in the earlier sections, the HJB equation provides us with a complete characterization of the optimality conditions for the population risk minimization problem (3). However, it has the disadvantage that it is global in \(\mathcal {P}(\mathbb {R}^{d+l})\) (or its lifted version, in \(L^2(\Omega , \mathbb {R}^{d+l}))\) and hence difficult to handle in practice. The natural question is whether we can have a local characterization of optimality, and by local we mean having no need for the optimality condition to depend on the whole space of input-label distributions. In this section, we provide such a characterization by proving a mean-field version of the celebrated Pontryagin’s maximum principle (PMP) [7]. Although seemingly disparate at first, we will discuss in Sect. 6.1 that the maximum principle approach is intimately connected with the dynamic programming approach introduced earlier.

In classical optimal control, such a local characterization is given in the form of the Pontryagin’s maximum principle, where forward and backward Hamiltonian dynamics are coupled through a maximization condition. In the present formulation, a common control parameter is shared by all input–target pair values \((x_0,y_0)\) that can take under the distribution \(\mu _0\). Thus, one expects that a maximum principle should exist in the average sense. Let us state and prove such a maximum principle below. We modify the assumptions (A1), (A2) to

(A1\('\)):

The function f is bounded; fL are continuous in \(\theta \); and \(f,L,\Phi \) are continuously differentiable with respect to x.

(A2\('\)):

The distribution \(\mu _0\) has bounded support in \(\mathbb {R}^d\times \mathbb {R}^l\), i.e., there exists \(M>0\) such that \(\mu (\{ (x,y) \in \mathbb {R}^d \times \mathbb {R}^l : \Vert x \Vert + \Vert y \Vert \le M \}) = 1\).

Theorem 3

(Mean-field PMP) Let (A1\('\)), (A2\('\)) be satisfied and \({\varvec{\theta }}^*\in L^\infty ([0,T],\Theta )\) be a solution of (3) in the sense that \(J({\varvec{\theta }}^*)\) attains the infimum. Then, there exists absolutely continuous stochastic processes \({\varvec{x}}^*,{\varvec{p}}^*\) such that

$$\begin{aligned}&\dot{x}^*_t = f(x^*_t, \theta ^*_t),&x^*_t = x_0, \end{aligned}$$
(43)
$$\begin{aligned}&\dot{p}^*_t = - \nabla _x H(x^*_t, p^*_t, \theta ^*_t),&p^*_T = -\nabla _x \Phi (x^*_T, y_0), \end{aligned}$$
(44)
$$\begin{aligned}&\mathbb {E}_{\mu _0} H(x^*_t, p^*_t, \theta ^*_t) \ge \mathbb {E}_{\mu _0} H(x^*_t, p^*_t, \theta ),&\forall \,\theta \in \Theta , \quad a.e.\,t\in [0,T], \end{aligned}$$
(45)

where the Hamiltonian function \(H: \mathbb {R}^d \times \mathbb {R}^d \times \Theta \rightarrow \mathbb {R}\) is given by

$$\begin{aligned} H(x,p,\theta ) = p\cdot f(x, \theta ) - L(x, \theta ). \end{aligned}$$
(46)

Proof

To simplify the proof, we first make a substitution by introducing a new coordinate \(x^0\) satisfying the dynamics \(\dot{x}^0_t = L(x_t,\theta _t)\) with \(x^0_0=0\). Then, it is clear that the PMP above can be transformed into one without running loss by redefining

$$\begin{aligned} x \rightarrow (x^0, x), \quad f \rightarrow (L, f), \quad \Phi (x_T, y_0)&\rightarrow \Phi (x_T, y_0) + x^0_T. \end{aligned}$$

Check that (A1\('\)), (A2\('\)) are preserved, but now we can consider without loss of generality the case \(L\equiv 0\).

Let some \(\tau \in (0,T]\) be a Lebesgue point of \(\hat{f}(t):=f(x^*_t,\theta ^*_t)\). By assumptions (A1\('\)) and (A2\('\)), these points are dense in [0, T]. Now, for \(\epsilon \in (0,\tau )\), define the family of perturbed controls

$$\begin{aligned} \theta ^{\tau ,\epsilon }_t = {\left\{ \begin{array}{ll} \omega &{} t\in [\tau -\epsilon , \tau ], \\ \theta ^*_t &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

where \(\omega \in \Theta \). This is a “needle” perturbation. Accordingly, define \(x^{\tau ,\epsilon }_t\) by

$$\begin{aligned} x^{\tau ,\epsilon }_t = x_0 + \int _{0}^{t} f(x^{\tau ,\epsilon }_s, \theta ^{\tau ,\epsilon }_s) \mathrm{d}s. \end{aligned}$$

i.e., solution of the forward propagation equation with the perturbed control \(\theta ^{\tau ,\epsilon }\). It is clear that \(x^*_t = x^{\tau ,\epsilon }_t\) for every \(t<\tau -\epsilon \) and every \(x_0\), since the perturbation is not present. At \(t=\tau \), we have

$$\begin{aligned} \frac{1}{\epsilon } (x^{\tau ,\epsilon }_\tau - x^*_\tau )&= \frac{1}{\epsilon } \int _{\tau -\epsilon }^{\tau } f(x^{\tau ,\epsilon }_s,\omega ) - f(x^*_s,\theta ^*_s) \mathrm{d}s. \end{aligned}$$

Since \(\tau \) is Lebesgue point of F, we have

$$\begin{aligned} v_\tau :=\lim _{\epsilon \downarrow 0} \frac{1}{\epsilon } (x^{\tau ,\epsilon }_\tau - x^*_\tau ) = f(x^{*}_\tau ,\omega ) - f(x^*_\tau ,\theta ^*_\tau ). \end{aligned}$$

Here, \(v_\tau \) represents the leading order perturbation on the state due to the “needle” perturbation introduced in the infinitesimal interval \([\tau -\epsilon ,\tau ]\). For the rest of the time interval \((\tau ,T]\), the dynamics remain the same since the controls are the same. It remains to compute how the perturbation \(v_\tau \) propagates. Define for \(t\ge \tau \), \(v_{t}^\epsilon :=\frac{1}{\epsilon } (x^{\tau ,\epsilon }_t - x^*_t)\) and \(v_t :=\lim _{\epsilon \downarrow 0} v^\epsilon _t\). By Theorem 2.3.1 of [9], we know that \(v_t\) is well defined for almost every t (all the Lebesgue points of the map \(t\mapsto x^*(t)\)) and satisfies the following linearized equation:

$$\begin{aligned} \begin{aligned} \dot{v}_t&= \nabla _x f(x^*_t, \theta ^*_t)^T v_t, \qquad t\in (\tau ,T], \\ v_\tau&= f(x^*_\tau ,\omega ) - f(x^*_\tau ,\theta ^*_\tau ). \end{aligned} \end{aligned}$$
(47)

In particular, v(T) represents the perturbation of the final state introduced by this control. By the optimality assumption of \({\varvec{\theta }}^*\), we must have

$$\begin{aligned} \mathbb {E}_{\mu _0} \Phi \left( x^{\tau ,\epsilon }_T, y_0\right) \ge \mathbb {E}_{\mu _0} \Phi \left( x^*_T, y_0\right) . \end{aligned}$$

Assumptions (A1\('\)) and (A2\('\)) imply \(\nabla _x \Phi \) is bounded, so by dominated convergence theorem,

$$\begin{aligned} \begin{aligned} 0&\le \lim _{\epsilon \downarrow 0} \frac{1}{\epsilon } \mathbb {E}_{\mu _0} \left[ \Phi \left( x^{\tau ,\epsilon }_T,y_0\right) - \Phi \left( x^*_T,y_0\right) \right] \\&= \mathbb {E}_{\mu _0} \frac{\mathrm{d}}{\mathrm{d}\epsilon } \Phi \left( x^{\epsilon ,\tau }_T, y_0\right) \Big \vert _{\epsilon =0^+} \\&= \mathbb {E}_{\mu _0} \nabla _x \Phi (x^*_T, y_0) \cdot v_T . \end{aligned} \end{aligned}$$
(48)

Now, let us define \({\varvec{p}}^*\) to be the solution of the adjoint of Eq (47),

$$\begin{aligned} \dot{p}^*_t = - \nabla _x f(x^*_s, \theta ^*_s) p^*_t, \quad p^*_T = -\nabla _x \Phi (x^*_T,y_0). \end{aligned}$$

Then, (48) implies \(\mathbb {E}_{\mu _0} p^*_T \cdot v_T \le 0\). Moreover, we have

$$\begin{aligned} \frac{\mathrm{d}}{\mathrm{d}t} (p^*_t \cdot v_t) = \dot{p}^*_t \cdot v_t + \dot{v}_t \cdot p^*_t = 0 \end{aligned}$$

for all \(t\in [\tau ,T]\). Thus, we must have \(\mathbb {E}_{\mu _0} p^*_t\cdot v_t = \mathbb {E}_{\mu _0} p^*_T\cdot v_T \le 0\) for all \(t\in [\tau ,T]\) and so for \(t=\tau \) [with initial condition in (47)],

$$\begin{aligned} \mathbb {E}_{\mu _0} p^*_\tau \cdot f(x^*_\tau ,\theta ^*_\tau ) \ge \mathbb {E}_{\mu _0} p^*_\tau \cdot f(x^*_\tau ,\omega ). \end{aligned}$$

Since \(\omega \in \Theta \) is arbitrary, this completes the proof by recalling that \(H(x,p,\theta ) = p\cdot f(x,\theta )\). \(\square \)

Remark 2

In fact, one can show, under slightly stronger conditions (bounded first partial derivatives), that \(\mathbb {E}_{\mu _0} H(x^*_t,p^*_t,\theta ^*_t)\) is constant in time, using standard techniques (see e.g.,  Sec. 4.2.9 of [46]).

Let us now discuss the mean-field PMP. First, notice that it is a necessary condition and hence is much weaker than the HJB characterization. Also, the PMP refers only to the open-loop control process \({\varvec{\theta }}\) with no explicit reference to an optimal control policy. Now, since the PMP is a necessary condition, we should discuss its relationship with classical necessary conditions in optimization. Equation (43) is simply the feed-forward ODE (2) under the optimal parameters \({\varvec{\theta }}^*\). On the other hand, Eq. (44) defines the evolution of the co-state \(p_s^*\). To draw analogy with constrained optimization, the co-state can be regarded as Lagrange multipliers which enforce the ODE constraint (2). However, as in the proof of Theorem 3, it may be more general to interpret it as the evolution of an adjoint variational condition backwards in time. The Hamiltonian maximization condition (45) is a unique feature of PMP-type statements, in that it does not characterize optimality in terms of vanishing of first-order partial derivatives, as is the case in usual first-order optimality conditions. Instead, optimal solutions must globally maximize the Hamiltonian function. This feature allows greater applicability since we can also deal with the case where the dynamics are not differentiable with respect to the controls/training weights, or when the optimal controls/training weights lie on the boundary of the set \(\Theta \). Moreover, the usual first-order optimality conditions and the celebrated back-propagation algorithm can be readily derived from the PMP, see [43]. We note that compared to classical statements of the PMP [54], the main difference in our result is the presence of the expectation over \(\mu _0\) in the Hamiltonian maximization condition (45). This is to be expected since the mean-field optimal control must depend on the distribution of input–target pairs.

We conclude the discussion by noting that the PMP above can be written more compactly as follows. For each control process \({\varvec{\theta }}\in L^\infty ([0,T],\Theta )\), denote by \({\varvec{x}}^{\varvec{\theta }}:=\{ x^{\varvec{\theta }}_t : 0\le t\le T\}\) and \({\varvec{p}}^{\varvec{\theta }}:=\{ p^{\varvec{\theta }}_t : 0\le t\le T\}\) the solution of the Hamilton’s equations (43) and (44) using this control with the random variables \((x_0,y_0)\sim \mu _0\), i.e.,

$$\begin{aligned} \begin{aligned}&\dot{x}^{\varvec{\theta }}_t = f(x^{\varvec{\theta }}_t, \theta _t), \qquad x^{\varvec{\theta }}_0 = x_0, \\&\dot{p}^{\varvec{\theta }}_t = -\nabla _x H(x^{\varvec{\theta }}_t, p^{\varvec{\theta }}_t, \theta _t), \qquad p^{\varvec{\theta }}_T = -\nabla _x \Phi (x^{\varvec{\theta }}_T, y_0). \end{aligned} \end{aligned}$$
(49)

Then, \({\varvec{\theta }}^*\) satisfies the PMP if and only if

$$\begin{aligned} \mathbb {E}_{\mu _0} H(x^{{\varvec{\theta }}^*}_t, p^{{\varvec{\theta }}^*}_t, \theta ^*_t) \ge \mathbb {E}_{\mu _0} H(x^{{\varvec{\theta }}^*}_t, p^{{\varvec{\theta }}^*}_t, \theta ), \quad \forall \,\theta \in \Theta . \end{aligned}$$
(50)

Furthermore, observe that the mean-field PMP derived above includes, as a special case, the necessary conditions for optimality for the sampled optimal control problem (4). To see this, simply define the empirical measure \(\mu ^N_0 :=\frac{1}{N} \sum _{i=1}^{N} \delta _{(x^i_0,y^i_0)}\) and apply the mean-field PMP (Theorem 3) with \(\mu ^N_0\) in place of \(\mu _0\) to give

$$\begin{aligned} \frac{1}{N} \sum _{i=1}^{N} H\left( x^{{\varvec{\theta }}^*,i}_t, p^{{\varvec{\theta }}^*,i}_t, \theta ^*_t\right) \ge \frac{1}{N}\sum _{i=1}^N H\left( x^{{\varvec{\theta }}^*,i}_t, p^{{\varvec{\theta }}^*,i}_t, \theta \right) , \quad \forall \,\theta \in \Theta , \end{aligned}$$
(51)

where each \({\varvec{x}}^{{\varvec{\theta }},i}\) and \({\varvec{p}}^{{\varvec{\theta }},i}\) are defined as in (49), but with the input–target pair \((x^i_0,y^i_0)\). Of course, since \(\mu ^N_0\) is a random measure, this is a random equation whose solution is random variables.

6.1 Connection between the HJB equation and the PMP

We now discuss some concrete connections between the HJB equation and the PMP, thus justifying our claim that the PMP can be understood as a local result compared to the global characterization of the HJB equation.

It should be noted that the Hamiltonian defined in Pontryagin’s maximum principle (46) is different from (23) in the HJB equation, due to different sign conventions in these two approaches of classical optimal control. We choose to keep this difference such that readers familiar with classical control theory can draw an analogy easily. Nevertheless, if one replaces pLf in (46) by \(-P,-\bar{L}, \bar{f}\), respectively, and takes the infimum over \(\Theta \) instead of the maximum condition in (45), one formally obtains the negative of (23).

Now, our goal is to show that the HJB and PMP are more intimately connected than it appears in the definition of Hamiltonian. The deeper connections originate from the link between Hamilton’s canonical equations (ODEs) and Hamilton–Jacobi equations (PDEs), of which we give an informal description as follows.

First, note that although the Hamiltonian dynamics (43) and (44) describe the trajectory of particular random variables (completely determined by \((x_0,y_0)\)), the optimality conditions are not dependent on the particular representation of the probability measures by these random variables. In other words, we could also formulate a maximum principle whose Hamiltonian flow is that on measures in a Wasserstein space, from which the above PMP can be seen as a “lifting.” This approach would parallel the developments in the previous sections on the HJB equations. However, here we choose to establish and analyze the PMP in the lifted space due to the simplicity of having well-defined evolution equations. The corresponding evolution of measures would require more technical analysis while not being particularly more elucidating. Instead, we shall establish the connections by also lifting the HJB equation into \(L^2(\Omega ,\mathbb {R}^{d+l})\).

Consider the lifted HJB equation (24) in \(L^2(\Omega ,\mathbb {R}^{d+l})\). The key observation is that we can apply the method of characteristics (see e.g., Ch. 3.2 of [24] or [57]) by defining \(P_t=DV(t,\xi _t)\) and write down the characteristic evolution equations:

$$\begin{aligned} {\left\{ \begin{array}{ll} \dot{\xi }_t = D_{P}\mathcal {H}(\xi _t,P_t), \\ \dot{P}_t = -D_{\xi }\mathcal {H}(\xi _t,P_t). \end{array}\right. } \end{aligned}$$
(52)

Suppose this system has a solution satisfying boundary conditions \(\mathbb {P}_{\xi _0}=\mu _0, P_T=\nabla _w\bar{\Phi }(\xi _T)\), where the second condition comes from the terminal condition of (24). To avoid technicalities, we further assume that the infimum in (23) is attained at \(\theta ^{\dagger }(\xi ,P)\), which is always an interior point of \(\Theta \). Hence, (23) can be explicitly written down as

$$\begin{aligned} \mathcal {H}= \mathbb {E}[P\cdot \bar{f}(\xi ,\theta ^{\dagger }(\xi ,P))+\bar{L}(\xi ,\theta ^{\dagger }(\xi ,P))], \end{aligned}$$

and by first-order condition we have

$$\begin{aligned} \mathbb {E}\left[ \nabla _{\theta }\bar{f}(\xi ,\theta ^{\dagger }(\xi ,P))P + \nabla _{\theta }\bar{L}(\xi ,\theta ^{\dagger }(\xi ,P))\right] =0. \end{aligned}$$

Plugging the above two equalities into (52) gives us

$$\begin{aligned} {\left\{ \begin{array}{ll} \dot{\xi }_t = \bar{f}(\xi _t,\theta ^{\dagger }(\xi _t,P_t)), \\ \dot{P}_t = -\nabla _w\bar{f}(\xi _t,\theta ^{\dagger }(\xi _t,P_t))P_t - \nabla _w\bar{L}(\xi _t,\theta ^{\dagger }(\xi _t,P_t)). \end{array}\right. } \end{aligned}$$

Let \(\theta ^*_t=\theta ^{\dagger }(\xi _t,P_t)\). Note that \(w=(x,y)\) is the concatenated variable and the last l components of \(\bar{f}\) are zero. If we only consider the first d components, then we can deduce the d-dimensional dynamical system in \(L^2(\Omega ,\mathbb {R}^d)\):

$$\begin{aligned} {\left\{ \begin{array}{ll} \dot{x}_t = f(\xi _t,\theta _t^*), \\ \dot{p}_t = -\nabla _xf(x_t,\theta ^*_t)p_t - \nabla _x L(x_t,\theta ^*_t). \end{array}\right. } \end{aligned}$$
(53)

If we make the transformation \(p\rightarrow -p\) in Theorem 3, it is straightforward to see that the deduced dynamical system by Theorem 3 satisfies (53) in \(L^2(\Omega ,\mathbb {R}^d)\) and the boundary conditions are matched.

In summary, the Hamilton’s equations (53) in the PMP can be viewed as the characteristic equations for the HJB equation (24). Consequently, the PMP pinpoints the necessary condition a characteristic of the HJB equation originating from (a random variable with law) \(\mu _0\) must satisfy. This justifies the preceding claim that the PMP constitutes a local optimality condition as compared to the HJB equation.

7 Small-time uniqueness

As discussed, the PMP constitutes necessary conditions for optimality. A natural question is when are the PMP solutions also sufficient for optimality (See [9], Ch. 8 for some discussions on sufficiency). One simple case where it is sufficient, assuming an optimal solution exists, is when the PMP equations admit a unique solution. In this section, we investigate the uniqueness properties of the PMP system.

Note that even if there exists a unique solution \(\theta ^{\dagger }(\nu )\) of the Hamiltonian maximization \(\mathop {{{\,\mathrm{arg\,max}\,}}}\nolimits _\theta \mathbb {E}_{(x,p)\sim \nu }H(x,p,\theta )\) for any \(\mathbb {P}_{(x,p)}\), Eq. (49) reduces to a highly nonlinear two-point boundary value problem for \({\varvec{x}}^*, {\varvec{p}}^*\), further coupled with their laws. Even without the coupling to laws, such two-point boundary value problems are known to not have unique solutions in general (see e.g., Ch. 7 of [37]). In the following, we shall show that if T is sufficiently small and H is strongly concave, then the PMP admits a unique solution. Hereafter, we retain assumption (A2\('\)) and replace (A1\('\)) with a stronger assumption, which greatly simplifies our arguments:

(A1\(''\)):

f is bounded; \(f,L,\Phi \) are twice continuously differentiable with respect to both \(x,\theta \), with bounded and Lipschitz partial derivatives.

With an estimate of the difference in flow maps due to two different controls, we can prove a small-time uniqueness result for the PMP.

Theorem 4

Suppose that \(H(x,p,\theta )\) is strongly concave in \(\theta \), uniformly in \(x,p\in \mathbb {R}^d\), i.e., \(\nabla ^2_{xx}H(x,p,\theta ) + \lambda _0 I \preceq 0\) for some \(\lambda _0>0\). Then, for sufficiently small T, if \({\varvec{\theta }}^1\) and \({\varvec{\theta }}^2\) are solutions of the PMP (50) then \({\varvec{\theta }}^1 = {\varvec{\theta }}^2\).

Note that since we are considering the effects of T, in the rest of the estimates in this section, the dependence of constants on T are explicitly considered. We first estimate the difference of flow maps driven by two different controls.

Lemma 2

Let \({\varvec{\theta }}^1, {\varvec{\theta }}^2\in L^\infty ([0,T],\Theta )\). Then, there exists a constant \(T_0\) such that for all \(T\in [0,T_0)\), we have

$$\begin{aligned} \Vert {\varvec{x}}^{{\varvec{\theta }}^1} - {\varvec{x}}^{{\varvec{\theta }}^2} \Vert _{L^\infty } + \Vert {\varvec{p}}^{{\varvec{\theta }}^1} - {\varvec{p}}^{{\varvec{\theta }}^2} \Vert _{L^\infty } \le C(T) \Vert {\varvec{\theta }}^1 - {\varvec{\theta }}^2 \Vert _{L^\infty }, \end{aligned}$$

where \(C(T)>0\) satisfies \(C(T)\rightarrow 0\) as \(T\rightarrow 0\).

Proof

Denote \({\varvec{\delta }}{\varvec{\theta }}:={\varvec{\theta }}^1 - {\varvec{\theta }}^2\), \({\varvec{\delta }}{\varvec{x}}:={\varvec{x}}^{{\varvec{\theta }}^1} - {\varvec{x}}^{{\varvec{\theta }}^2}\) and \({\varvec{\delta }}{\varvec{p}}:={\varvec{p}}^{{\varvec{\theta }}^1} - {\varvec{p}}^{{\varvec{\theta }}^2}\). Since \(x^{{\varvec{\theta }}^1}_0=x^{{\varvec{\theta }}^2}_0=x_0\), integrating the respective ODEs and using (A1\(''\)) we have

$$\begin{aligned} \Vert \delta x_t \Vert \le \int _{0}^{t} \left\| f(x^{{\varvec{\theta }}^1}_s, \theta ^1_s) - f(x^{{\varvec{\theta }}^1}_t, \theta ^2_s) \right\| \mathrm{d}s \le K_L \int _{0}^{T} \Vert \delta x_s \Vert \mathrm{d}s + K_L \int _{0}^{T} \Vert \delta p_s \Vert \mathrm{d}s, \\ \end{aligned}$$

and so

$$\begin{aligned} \Vert {\varvec{\delta }}{\varvec{x}}\Vert _{L^\infty } \le K_L T \Vert {\varvec{\delta }}{\varvec{x}}\Vert _\infty + K_L T \Vert {\varvec{\delta }}\theta \Vert _\infty . \end{aligned}$$

Now, if \(T<T_0:=1/K_L\), we then have

$$\begin{aligned} \Vert {\varvec{\delta }}{\varvec{x}}\Vert _{L^\infty } \le \frac{K_L T}{1 - K_L T} \Vert {\varvec{\delta }}{\varvec{\theta }}\Vert _{L^\infty }. \end{aligned}$$
(54)

Similarly,

$$\begin{aligned}&\Vert \delta p_t \Vert \le K_L \Vert \delta x_T \Vert + K_L \int _{t}^{T} \Vert \delta x_s \Vert \mathrm{d}s + K_L \int _{t}^{T} \Vert \delta p_s \Vert \mathrm{d}s, \\&\Vert {\varvec{\delta }}{\varvec{p}}\Vert _{L^\infty } \le (K_L + K_L T) \Vert {\varvec{\delta }}{\varvec{x}}\Vert _{L^\infty } + K_L T \Vert {\varvec{\delta }}{\varvec{p}}\Vert _{L^\infty }, \end{aligned}$$

and hence

$$\begin{aligned} \Vert {\varvec{\delta }}{\varvec{p}}\Vert _{L^\infty } \le \frac{K_L(1+T)}{1-K_L T} \Vert {\varvec{\delta }}{\varvec{x}}\Vert _{L^\infty }. \end{aligned}$$
(55)

Combining (54) and (55) proves the claim. \(\square \)

With the above estimate, we can now prove Theorem 4.

Proof of Theorem 4

By uniform strong concavity, the function \(\theta \mapsto \mathbb {E}_{\mu _0} H(x^{{\varvec{\theta }}^1}_t,x^{{\varvec{\theta }}^1}_t,\theta )\) is strongly concave. Thus, we have a \(\lambda _0>0\) such that

$$\begin{aligned} \frac{\lambda _0}{2} \Vert \theta ^1_t - \theta ^2_t \Vert ^2 \le {\left[ \mathbb {E}_{\mu _0} \nabla H(x^{{\varvec{\theta }}^1}_t, p^{{\varvec{\theta }}^1}_t, \theta ^2_t) - \mathbb {E}_{\mu _0} \nabla H(x^{{\varvec{\theta }}^1}_t, p^{{\varvec{\theta }}^1}_t, \theta ^1_t) \right] } \cdot (\theta ^1_t - \theta ^2_t). \end{aligned}$$

A similar expression holds for \(\theta \mapsto \mathbb {E}_{\mu _0} H(x^{{\varvec{\theta }}^2}_t,x^{{\varvec{\theta }}^2}_t,\theta )\), and so combining them and using assumptions (A1\(''\)) we have

$$\begin{aligned} \lambda _0 \Vert \theta ^1_t - \theta ^2_t \Vert ^2&\le {\left[ \mathbb {E}_{\mu _0} \nabla H(x^{{\varvec{\theta }}^1}_t, p^{{\varvec{\theta }}^1}_t, \theta ^2_t) - \mathbb {E}_{\mu _0} \nabla H(x^{{\varvec{\theta }}^1}_t, p^{{\varvec{\theta }}^1}_t, \theta ^1_t) \right] } \cdot (\theta ^1_t - \theta ^2_t) \\&\quad + {\left[ \mathbb {E}_{\mu _0} \nabla H(x^{{\varvec{\theta }}^2}_t, p^{{\varvec{\theta }}^2}_t, \theta ^1_t) - \mathbb {E}_{\mu _0} \nabla H(x^{{\varvec{\theta }}^2}_t, p^{{\varvec{\theta }}^2}_t, \theta ^2_t) \right] } \cdot (\theta ^1_t - \theta ^2_t) \\&\le \mathbb {E}_{\mu _0} \Vert \nabla H(x^{{\varvec{\theta }}^1}_t, p^{{\varvec{\theta }}^1}_t, \theta ^1_t) - \nabla H(x^{{\varvec{\theta }}^2}_t, p^{{\varvec{\theta }}^2}_t, \theta ^1_t) \Vert \Vert \theta ^1_t - \theta ^2_t \Vert \\&\quad +\mathbb {E}_{\mu _0} \Vert \nabla H(x^{{\varvec{\theta }}^1}_t, p^{{\varvec{\theta }}^1}_t, \theta ^2_t) - \nabla H(x^{{\varvec{\theta }}^2}_t, p^{{\varvec{\theta }}^2}_t, \theta ^2_t) \Vert \Vert \theta ^1_t - \theta ^2_t \Vert \\&\le K_L \Vert {\varvec{\delta }}{\varvec{\theta }}\Vert _{L^\infty } (\Vert {\varvec{\delta }}{\varvec{x}}\Vert _{L^\infty } + \Vert {\varvec{\delta }}{\varvec{p}}\Vert _{L^\infty }). \end{aligned}$$

Combining the above and Lemma 2, we have

$$\begin{aligned} \Vert {\varvec{\delta }}{\varvec{\theta }}\Vert _{L^\infty }^2 \le \frac{K_L}{\lambda _0} C(T) \Vert {\varvec{\delta }}{\varvec{\theta }}\Vert _{L^\infty }^2. \end{aligned}$$

But \(C(T)=o(1)\) and so we may take T sufficiently small so that \(K_L C(T) < \lambda _0\) to conclude that \(\Vert {\varvec{\delta }}{\varvec{\theta }}\Vert _{L^\infty } = 0\). \(\square \)

In the context of machine learning, since f is bounded, small T roughly corresponds to the regime where the reachable set of the forward dynamics is small. This can be loosely interpreted as the case where the model has low capacity or expressive power. We note that the number of parameters is still infinite, since we only require \({\varvec{\theta }}\) to be essentially bounded and measurable in time. Hence, Theorem 4 can be interpreted as the statement that when the model capacity is low, the optimal solution is unique, albeit with possibly high loss function values. Note that the strong concavity of the Hamiltonian does not imply that the loss function J is strongly convex, or even convex, which is often an unrealistic assumption in deep learning. In fact, in the case considered in Example 1, we observe that H is strongly concave, but the loss function J can be highly non-convex due to the nonlinear transformation \(\sigma \). Compared with the characterization using HJB (Sect. 5), we observe that the uniqueness of the solutions of the PMP requires the small T condition.

8 From mean-field PMP to sampled PMP

So far, we have focused our discussion on the mean-field control problem (3) and mean-field PMP (50). However, the solution of the mean-field PMP requires maximizing an expectation. Hence, in practice we must resort to solving a sampled version (51), which constitutes necessary conditions for the sampled optimal control problem (4).

The goal of this section is to draw some precise connections between the solutions of the mean-field PMP (50) and the sampled PMP (51). In particular, we show that under appropriate conditions, near any stable (to be precisely defined later) solution of the mean-field PMP (50) we can find with high probability a solution of the sampled PMP (51). This allows us to establish a concrete link, via the maximum principle, between solutions of the population risk minimization problem (3) and the empirical risk minimization problem (4). To proceed, the key observation is that the interior solutions to both the mean-field and sampled PMPs can be written as the solutions to algebraic equations on Banach spaces. Indeed, in view of the compact notation (50), let us suppose that \({\varvec{\theta }}^*\) is a solution of the PMP such that the maximization step attains a maximum in the interior of \(\Theta \) for a.e. \(t\in [0,T]\). Note that if \(\Theta \) is sufficiently large, e.g., \(\Theta =\mathbb {R}^m\), then this must be the case. We shall hereafter assume this holds. Consequently, the PMP solution satisfies (by dominated convergence theorem)

$$\begin{aligned} {{\varvec{F}}({\varvec{\theta }}^*)}_t :=\mathbb {E}_{\mu _0} \nabla _\theta H\left( x^{{\varvec{\theta }}^*}_t, p^{{\varvec{\theta }}^*}_t , \theta ^*_t\right) = 0, \end{aligned}$$
(56)

for a.e. t, where \({\varvec{F}}: L^\infty ([0,T],\Theta ) \rightarrow L^\infty ([0,T],\mathbb {R}^m)\) is a Banach space mapping. Similarly, from (51) we know that an interior solution \({\varvec{\theta }}^N\) of the finite-sample PMP is a random variable which satisfies

$$\begin{aligned} {{\varvec{F}}_N({\varvec{\theta }}^N)}_t :=\frac{1}{N} \sum _{i=1}^{N} \nabla _\theta H\left( x^{{\varvec{\theta }}^N,i}_t, p^{{\varvec{\theta }}^N,i}_t , \theta ^N_t\right) = 0, \end{aligned}$$
(57)

for a.e. t. Now, \({\varvec{F}}_N\) is a random approximation of \({\varvec{F}}\) and \(\mathbb {E}{\varvec{F}}_N({\varvec{\theta }}) = {\varvec{F}}({\varvec{\theta }})\) for all \({\varvec{\theta }}\). In fact, \({\varvec{F}}_N \rightarrow {\varvec{F}}\) almost surely by law of large numbers. Hence, the analysis of the approximation properties of the mean-field PMP by its sampled counterpart amounts to the study of the approximation of zeros of \({\varvec{F}}\) by those of \({\varvec{F}}_N\).

In view of this, we shall take a brief excursion to develop some theory on random approximations of zeros of Banach space mappings at an abstract level and then use these results to deduce properties of the PMP approximations. The techniques employed in the next section are reminiscent of classical numerical analysis results on finite difference approximation schemes [36], except that we work with random approximations.

8.1 Excursion: random approximations of zeros of Banach space mappings

Let \((U,\Vert \cdot \Vert _U),(V,\Vert \cdot \Vert _V)\) be Banach spaces and \(F: U \rightarrow V\) be a mapping. We first define a notion of stability, which shall be a primary condition that ensures existence of close-by zeros of approximations.

Definition 2

For \(\rho >0\) and \(x\in U\), define \(S_\rho (x):=\{y\in U: \Vert x-y \Vert _U \le \rho \}\). We say that the mapping F is stable on \(S_\rho (x)\) if there exists a constant \(K_\rho >0\) such that for all \(y,z \in S_\rho (x)\),

$$\begin{aligned} \Vert y - z \Vert _U \le K_\rho \Vert F(y) - F(z) \Vert _V . \end{aligned}$$

Note that if F is stable on \(S_\rho (x)\), then it is trivially true that it has at most one solution to \(F=0\) on \(S_\rho (x)\). If it does have a solution, say at \(x^*\), then it is necessarily isolated, i.e., if \(DF(x^*)\) exists, then it is non-singular. The following proposition establishes a stronger version of this: If DF(x) exists for any \(x\in S_{\rho }(x^*)\), then it is necessarily non-singular.

Proposition 4

Let F on \(S_\rho (x^*)\) be stable. Then, for any \(x\in S_\rho (x^*)\), if DF(x) exists, then it is non-singular, i.e., \(DF(x)y = 0\) implies \(y=0\).

Proof

Suppose for the sake of contradiction that \(DF(x)y = 0\) and \(\Vert y\Vert _U \ne 0\). Define \(z(\alpha ) :=x + \alpha y\) with \(\alpha \) sufficiently small so that \(z(\alpha )\in S_\rho (x^*)\). Then,

$$\begin{aligned} \alpha \Vert y \Vert _U&= \Vert x - z(\alpha ) \Vert _U \\&\le K_\rho \Vert F(x) - F(z(\alpha )) \Vert _V \\&\le K_\rho ( \alpha \Vert DF(x)y \Vert _V + \Vert F(x+\alpha y) - F(x) - DF(x) \alpha y \Vert _V ). \end{aligned}$$

But \(DF(x)y=0\), and so \(\alpha \Vert y \Vert _U \le K_\rho r(x,\alpha y) \alpha \Vert y \Vert _U\). By definition of the Fréchet derivative (5), \(r(x,\alpha y)\rightarrow 0\) as \(\alpha \rightarrow 0\). Thus, if \(\alpha \) is sufficiently small so that \(K_\rho r(x,\alpha y)<1\), then \(\Vert y \Vert _U=0\) and hence we arrive at a contradiction. \(\square \)

As the previous proposition suggests, a converse statement that establishes stability will require DF(x) to be non-singular on some neighborhood of \(x^*\). One in fact requires more, i.e., that DF needs to be Lipschitz. Note that for a linear operator \(A:U\rightarrow V\), we also use \(\Vert A \Vert _V\) to denote the usual induced norm, \(\Vert A \Vert _V = \sup _{\Vert y\Vert _U \le 1} \Vert Ay \Vert _V\).

Proposition 5

Suppose \(DF(x^*)\) is non-singular, DF(x) exists and \(\Vert DF(x) - DF(y) \Vert _V \le K_L \Vert x - y \Vert _U\) for all \(x,y\in S_{\rho }(x^*)\). Then, F is stable on \(S_{\rho _0}(x^*)\) for any \(0<\rho _0\le \min (\rho , \tfrac{1}{2}(K_L \Vert {DF(x^*)}^{-1} \Vert _U)^{-1})\) with stability constant

$$\begin{aligned} K_{\rho _0} = 2\Vert {DF(x^*)}^{-1} \Vert _U. \end{aligned}$$

Proof

Let \(\rho _0\le \rho \) and take \(x,y \in S_{\rho _0}(x^*)\). Using the mean value theorem, we can write \(F(x)-F(y) = R(x,y)(x-y)\) where

$$\begin{aligned} R(x,y) :=\int _{0}^{1} DF(s x + (1-s) y) \mathrm{d}s. \end{aligned}$$

But, using the Lipschitz condition we have

$$\begin{aligned} \Vert R(x,y) - DF(x^*) \Vert _V&\le \int _{0}^{1} \Vert DF(s x + (1-s) y) - DF(s x^* + (1-s) x^*) \Vert _V \mathrm{d}s \\&\le K_L \int _{0}^{1} \Vert s (x-x^*) + (1-s) (y-x^*) \Vert _U \mathrm{d}s \\&\le \rho _0 K_L. \end{aligned}$$

We take \(\rho _0\) sufficiently small so that \(\rho _0 K_L \le \tfrac{1}{2} \Vert {DF(x^*)}^{-1} \Vert _U^{-1}\). Then, by the Banach lemma, R(xy) is non-singular and \(\Vert {R(x,y)}^{-1} \Vert _U \le 2 \Vert {DF(x^*)}^{-1}\Vert _U\). The result follows since \((x-y) = {R(x,y)}^{-1} (F(x) - F(y))\). \(\square \)

Now, let us now introduce a family of random mappings \(F_N\) that approximate F. Let \((\Omega , \mathcal {F}, \mathbb {P})\) be a probability space and \(\{ F_N(\omega ): N \ge 1, \omega \in \Omega \}\) be a family of mappings from U to V such that \(\omega \mapsto F_N(\omega )(x)\) is \(\mathcal {F}\)-measurable for each x (we equip the Banach spaces UV with the Borel \(\sigma \)-algebra). We make the following assumptions which will allow us to relate the random solutions of \(F_N=0\) to those of \(F=0\) in Theorem 5.

  1. (B1)

    (Stability) There exists \(x^*\in U\) such that \(F(x^*)=0\) and F is stable on \(S_\rho (x^*)\) for some \(\rho >0\).

  2. (B2)

    (Uniform convergence in probability) For all \(N\ge 1\), DF(x) and \(DF_N(x)\) exists for all \(x\in S_\rho (x^*)\), \(\mathbb {P}\)-a.s. and

    $$\begin{aligned}&\mathbb {P}\left[ \Vert F(x) - F_N(x)\Vert _V \ge s \right] \le r_1(N,s), \\&\mathbb {P}\left[ \Vert DF(x) - DF_N(x)\Vert _V \ge s \right] \le r_2(N,s), \end{aligned}$$

    for some real-valued functions \(r_1,r_2\) such that \(r_1(N,s),r_2(N,s)\rightarrow 0\) as \(N\rightarrow \infty \).

  3. (B3)

    (Uniformly Lipschitz derivative) There exists \(K_L>0\) such that for all \(x,y \in S_\rho (x^*)\),

    $$\begin{aligned} \Vert DF_N(x) - DF_N(y) \Vert _V \le K_L \Vert x - y \Vert _U, \qquad \mathbb {P}\text {-a.s.} \end{aligned}$$

Theorem 5

Let (B1)–(B3) hold. Then, there exist positive constants \(s_0, \rho _1, C\) with \(\rho _1<\rho \) and U-valued random variables \(x_N\in S_{\rho _1}(x^*)\) satisfying

$$\begin{aligned}&\mathbb {P}[\Vert x_N - x^* \Vert _U \ge C s ] \le r_1(N, s) + r_2(N, s), \qquad s \in (0,s_0],\\&\mathbb {P}[F_N(x_N) \ne 0] \le r_1(N, s_0) + r_2(N, s_0). \end{aligned}$$

In particular, \(x_N \rightarrow x^*\) and \(F_N(x_N) \rightarrow 0\) in probability.

To establish Theorem 5, we first prove that for large N, with high probability \(DF_N(x^*)\) is non-singular and \(\Vert DF_N(x^*)^{-1} \Vert _U\) is uniformly bounded.

Lemma 3

Let (B1)–(B3) hold. Then, there exists a constant \(s_0>0\) such that for each \(s\in (0,s_0]\) and \(N\ge 1\), there exists a measurable \(A_N(s)\subset \Omega \) such that \(\mathbb {P}[A_N(s)] \ge 1 - r_1(N,s) - r_2(N,s)\) and for each \(\omega \in A_N(s)\),

$$\begin{aligned} \Vert F(x^*) - F_N(\omega )(x^*) \Vert _V < s. \end{aligned}$$

Moreover, \(DF_N(\omega )(x^*)\) is non-singular with

$$\begin{aligned} \Vert {DF_N(\omega )(x^*)}^{-1} \Vert _U \le 2 \Vert {DF(x^*)}^{-1}\Vert _U. \end{aligned}$$

In particular, \(DF_N(\omega )\) is stable on \(S_{\rho _0}(x^*)\) with \(\rho _0\le \min (\rho , \tfrac{1}{4}{(K_L \Vert {DF(x^*)}^{-1} \Vert _U)}^{-1})\) and stability constant \(K_{\rho _0} = 4\Vert {DF(x^*)}^{-1}\Vert _U\).

Proof

For \(s>0\), set

$$\begin{aligned} A_N(s) :=\{&\omega \in \Omega : \Vert F(x^*) - F_N(\omega )(x^*) \Vert _V)< s \\&\text {and } \Vert DF(x^*) - DF_N(\omega )(x^*) \Vert _V < s \}. \end{aligned}$$

Observe that \(A_N(s)\) is measurable as \(DF_N(\omega )(x^*)\) is measurable and assumption (B2) implies \(\mathbb {P}[A_N(s)]\ge 1 - r_1(N,s) - r_2(N,s)\). Now, take s sufficiently small so that \(s \le s_0 = \tfrac{1}{2} \Vert {DF(x^*)}^{-1}\Vert _U^{-1}\). Then, for each \(\omega \in A_N(s)\), the Banach lemma implies \(DF_N(\omega )(x^*)\) is non-singular and

$$\begin{aligned} \Vert DF_N(\omega )(x^*)^{-1} \Vert _U \le \frac{\Vert DF(x^*)^{-1} \Vert _U}{ 1 - \frac{1}{2}} = 2 \Vert DF(x^*)^{-1} \Vert _U. \end{aligned}$$

Finally, we use Proposition 5 to deduce stability of \(F_N(\omega )\). \(\square \)

Now, we are ready to prove Theorem 5 by constructing a uniform contraction mapping whose fixed point is a solution of \(F_N(x)=0\).

Proof of Theorem 5

Let \(s_0\), \(A_N(s)\) and \(\rho _0\) be those defined in Lemma 3. For each \(\omega \in A_N(s)\) with \(s\le s_0\), define the mapping

$$\begin{aligned} G_N(\omega )(x) :=x - {DF_N(\omega )(x^*)}^{-1} F_N(\omega )(x). \end{aligned}$$

We now show that this is in fact a uniform contraction on \(S_{\rho _1}(x^*)\) for sufficiently small \(\rho _1\). Let \(x,y\in S_{\rho _1}(x^*)\). By the mean value theorem, we have

$$\begin{aligned} G_N(\omega )(x) - G_N(\omega )(y)&=\, {DF_N(\omega )(x^*)}^{-1} [ DF_N(\omega )(x^*)(x-y) - (F_N(\omega )(x) - F_N(\omega )(y)) ] \\&=\, {DF_N(\omega )(x^*)}^{-1} [DF_N(\omega )(x^*) - R_N(\omega )(x,y)] (x - y), \end{aligned}$$

where \(R_N(\omega )(x,y) = \int _{0}^{1} DF_N(\omega )( s x + (1-s) y ) \mathrm{d}s\). Lipschitz condition (B3) implies

$$\begin{aligned} \Vert DF_N(\omega )(x^*) - R_N(\omega )(x,y) \Vert _V \le \rho _1 K_L \end{aligned}$$

and hence by Lemma 3,

$$\begin{aligned} \Vert G_N(\omega )(x) - G_N(\omega )(y) \Vert _U \le \alpha \Vert x - y \Vert _U, \end{aligned}$$

where \(\alpha = 2 K_L \rho _1 \Vert DF(x^*)^{-1} \Vert _U\). We now pick \(\rho _1<\rho _0\) sufficiently small so that \(\alpha < 1\). It remains to show that the mapping \(G_N(\omega )\) maps \(S_{\rho _1}(x^*)\) onto itself. Let \(x \in S_{\rho _1}(x^*)\), then by noting that \(F(x^*)=0\),

$$\begin{aligned} \Vert G_N(\omega )(x) - x^* \Vert _U&\le \Vert G_N(\omega )(x) - G_N(\omega )(x^*) \Vert _U + \Vert G_N(\omega )(x^*) - x^* \Vert _U \\&\le \alpha \rho _1 + 2\Vert {DF(x^*)}^{-1} \Vert _U \Vert F_N(\omega )(x^*) - F(x^*) \Vert _V . \end{aligned}$$

Using Lemma 3 again, we have

$$\begin{aligned} \Vert G_N(\omega )(x) - x^* \Vert _U&\le \alpha \rho _1 + 2 s \Vert {DF(x^*)}^{-1} \Vert _U. \end{aligned}$$

We now take \(s_0>s\) small enough so that \(2 s_0 \Vert {DF(x^*)}^{-1} \Vert _U < (1-\alpha )\rho _1\). Then, for all \(N\ge 1\), \(G_N(\omega )\) is a contraction, uniform in N, on \(S_{\rho _1}(x^*)\) and hence by Banach fixed point theorem, there exists a unique \(\tilde{x}_{N,s}(\omega ) \in S_{\rho _1}(x^*)\) such that \(G_N(\omega )(\tilde{x}_{N,s}(\omega ))=\tilde{x}_{N,s}(\omega )\), i.e., \(F_N(\omega )(\tilde{x}_{N,s}(\omega ))=0\) for all \(\omega \in A_N(s)\). Moreover, \(\tilde{x}_{N,s}(\omega ) = \lim _{k\rightarrow \infty } [G_N(\omega )]^{(k)}(y)\) for any \(y\in S_{\rho _0}(x^*)\). Define

$$\begin{aligned} x_{N,s}(\omega ) = \mathbf {1}_{A_N(s)}(\omega ) \tilde{x}_N(\omega ) + \mathbf {1}_{A_N(s)^c}(\omega ) x^*. \end{aligned}$$

Now, \(x_{N,s}\) is measurable since \(A_N(s)\) is measurable and \(\tilde{x}_{N,s}\) is the limit of measurable random variables, and hence measurable. Moreover, \(A_N(s)\subset \{ F_N(x_N) = 0 \}\) and so \(\mathbb {P}[F_N(x_{N,s})=0] \ge 1 - r_1(N,s) - r_2(N,s)\). Since \(x_{N,s} \in S_{\rho _1}(x^*)\) and \(\rho _1 < \rho _0\), using the stability of \(F_N(\omega )\) established in Lemma 3, and the fact that \(F_N(x_{N,s})=F(x^*)=0\), we have for any \(\omega \in A_{N}(s)\)

$$\begin{aligned} \Vert x_{N,s}(\omega ) - x^* \Vert _U&\le K_{\rho _0} \Vert F_N(\omega )(x_{N,s}) - F_N(\omega )(x^*) \Vert _V \\&\le 4 \Vert {DF(x^*)}^{-1} \Vert _U \Vert F(x^*) - F_N(\omega )(x^*) \Vert _V \\&< 4 s \Vert {DF(x^*)}^{-1} \Vert _U, \end{aligned}$$

and so \(\mathbb {P}[\Vert x_{N,s}(\omega ) - x^* \Vert _U \ge C s] \le r_1(N,s)+r_2(N,s)\) with \(C = 4 \Vert {DF(x^*)}^{-1} \Vert _U\). At this point, it appears that \(x_{N,s}\) depends on s. However, notice that for all \(s\le s_0\), \(A_N(s)\subset A_N(s_0)\). But, \(x_{N,s}(\omega )\) is the unique solution of \(F_N(\omega )(\cdot )=0\) in \(S_{\rho _1}(x^*)\) for each \(\omega \in A_N(s) \subset A_N(s_0)\). Therefore, \(x_{N,s}(\omega ) = x_{N,s_0}(\omega )\) for all \(s\le s_0\). We can thus write \(x_N :=x_{N,s_0} \equiv x_{N,s}\).

Lastly, convergence in probability follows from the decay of the functions \(r_1,r_2\) as \(N\rightarrow \infty \). \(\square \)

8.2 Error estimate for sampled PMP

Now, our goal is to apply the theory developed in Sect. 8.1 to the PMP. We shall assume that \({\varvec{\theta }}^*\), the solution of the mean-field PMP, is such that \({\varvec{F}}({\varvec{\theta }}^*)=0\) (recall that this holds for \(\Theta =\mathbb {R}^m\)). Suppose further that \({\varvec{F}}\) is stable at \({\varvec{\theta }}^*\) (see Definition 2), and later on we shall give remarks on when this assumption is reasonably satisfied. We wish to show that for sufficiently large N, with high probability \({\varvec{F}}_N\) must have a solution \({\varvec{\theta }}^N\) close to \({\varvec{\theta }}^*\).

In view of Theorem 5, we only need to check that (B2)–(B3) are satisfied. This requires a few elementary estimates and an application of the infinite-dimensional Hoeffding’s inequality [53].

Lemma 4

There exist constants \(K_B, K_L>0\) such that for all \({\varvec{\theta }},{\varvec{\phi }}\in L^{\infty }([0,T],\Theta )\)

$$\begin{aligned} \Vert {\varvec{x}}^{\varvec{\theta }}\Vert _{L^\infty } + \Vert {\varvec{p}}^{\varvec{\theta }}\Vert _{L^\infty }&\le K_B,\\ \Vert {\varvec{x}}^{\varvec{\theta }}- {\varvec{x}}^{\varvec{\phi }}\Vert _{L^\infty } + \Vert {\varvec{p}}^{\varvec{\theta }}- {\varvec{p}}^{\varvec{\phi }}\Vert _{L^\infty }&\le K_L \Vert {\varvec{\theta }}- {\varvec{\phi }}\Vert _{L^\infty }. \end{aligned}$$

Proof

We have by Gronwall’s inequality for a.e. t,

$$\begin{aligned} \Vert x^{{\varvec{\theta }}}_t - x^{{\varvec{\phi }}}_t \Vert&= \left\| \int _{0}^{t} f(x^{{\varvec{\theta }}}_s, \theta _s) - f(x^{{\varvec{\theta }}}_s, \theta _s) \mathrm{d}s \right\| \\&\le K_L \int _{0}^{t} \Vert x^{{\varvec{\theta }}}_s - x^{{\varvec{\phi }}}_s \Vert \mathrm{d}s + K_L \int _{0}^{t} \Vert \theta _s - \phi _s \Vert \mathrm{d}s \\&\le K_L T e^{K_L T} \Vert {\varvec{\theta }}- {\varvec{\phi }}\Vert _{L^\infty }. \end{aligned}$$

Similarly,

$$\begin{aligned} \Vert p^{{\varvec{\theta }}}_t - p^{{\varvec{\phi }}}_t \Vert&\le \Vert \nabla _x \Phi (x^{{\varvec{\theta }}}_T,y_0) - \nabla _x \Phi (x^{{\varvec{\phi }}}_T,y_0) \Vert \\&\quad + \left\| \int _{t}^{T} \nabla _x H(x^{{\varvec{\theta }}}_s, p^{\varvec{\theta }}_s, \theta _s) - \nabla _x H(x^{{\varvec{\phi }}}_s, p^{{\varvec{\phi }}}_s, \phi _s) \mathrm{d}s \right\| \\&\le K_L \Vert x^{{\varvec{\theta }}}_T - x^{{\varvec{\phi }}}_T\Vert + K_L \int _{t}^{T} \Vert x^{{\varvec{\theta }}}_s - x^{{\varvec{\phi }}}_s \Vert \mathrm{d}s + K_L \int _{t}^{T} \Vert p^{{\varvec{\theta }}}_s - p^{{\varvec{\phi }}}_s \Vert \mathrm{d}s \\&\le (K_L+T)K_L T e^{2 K_L T} \Vert {\varvec{\theta }}- {\varvec{\phi }}\Vert _{L^\infty }. \end{aligned}$$

\(\square \)

Notice that we can view \({\varvec{x}}^{{\varvec{\theta }}} \equiv {\varvec{x}}({\varvec{\theta }})\) as a Banach space mapping from \(L^\infty ([0,T],\Theta )\) to \(L^\infty ([0,T],\mathbb {R}^d)\), and similarly for \({\varvec{p}}^{{\varvec{\theta }}}\). Below, we establish some elementary estimates for the derivatives of these mappings with respect to \({\varvec{\theta }}\).

Lemma 5

There exist constants \(K_B, K_L>0\) such that for all \({\varvec{\theta }},{\varvec{\phi }}\in L^{\infty }([0,T],\Theta )\)

$$\begin{aligned} \Vert D {\varvec{x}}^{\varvec{\theta }}\Vert _{L^\infty } + \Vert D {\varvec{p}}^{\varvec{\theta }}\Vert _{L^\infty }&\le K_B, \\ \Vert D {\varvec{x}}^{\varvec{\theta }}- D {\varvec{x}}^{\varvec{\phi }}\Vert _{L^\infty } + \Vert D {\varvec{p}}^{\varvec{\theta }}- D {\varvec{p}}^{\varvec{\phi }}\Vert _{L^\infty }&\le K_L \Vert {\varvec{\theta }}- {\varvec{\phi }}\Vert _{L^\infty }. \end{aligned}$$

Proof

Let \({\varvec{\eta }}\in L^\infty ([0,T],\mathbb {R}^m)\) such that \(\Vert {\varvec{\eta }}\Vert _{L^\infty }\le 1\). For brevity, let us also denote \(f^{\varvec{\theta }}_t :=f(x^{\varvec{\theta }}_t, \theta _t)\) and \(H^{\varvec{\theta }}_t :=H(x^{\varvec{\theta }}_t, p^{\varvec{\theta }}_t, \theta _t)\). Then, \((D{\varvec{x}}^{\varvec{\theta }}){\varvec{\eta }}\) satisfy the linearized ODE

$$\begin{aligned} \frac{\mathrm{d}}{\mathrm{d}t} {[(D{\varvec{x}}^{\varvec{\theta }}){\varvec{\eta }}]}_t = \nabla _x f^{\varvec{\theta }}_t {[(D{\varvec{x}}^{\varvec{\theta }}){\varvec{\eta }}]}_t + \nabla _\theta f^{\varvec{\theta }}_t \eta _t, \qquad {[(D{\varvec{x}}^{\varvec{\theta }}){\varvec{\eta }}]}_0 = 0. \end{aligned}$$

Gronwall’s inequality and (A1\(''\)) immediately implies that \(\Vert {[(D{\varvec{x}}^{\varvec{\theta }}){\varvec{\eta }}]}_t \Vert \le K_L \Vert {\varvec{\eta }}\Vert _{L^\infty }\), and so \(\Vert D{\varvec{x}}^{\varvec{\theta }}\Vert _{L^\infty }\le K'\). Next,

$$\begin{aligned} \Vert {[(D{\varvec{x}}^{\varvec{\theta }}){\varvec{\eta }}]}_t - {[(D{\varvec{x}}^{\varvec{\phi }}){\varvec{\eta }}]}_t \Vert \le&\int _{0}^{t} \Vert \nabla _x f^{\varvec{\theta }}_s \Vert \Vert {[(D{\varvec{x}}^{\varvec{\theta }}){\varvec{\eta }}]}_s - {[(D{\varvec{x}}^{\varvec{\phi }}){\varvec{\eta }}]}_s \Vert \mathrm{d}s \\&+ \int _{0}^{t} \Vert \nabla _x f^{\varvec{\theta }}_s - \nabla _x f^{\varvec{\phi }}_s \Vert \Vert {[(D{\varvec{x}}^{\varvec{\phi }}){\varvec{\eta }}]}_s \Vert \mathrm{d}s \\&+ \int _{0}^{t} \Vert \nabla _\theta f^{\varvec{\theta }}_s - \nabla _\theta f^{\varvec{\phi }}_s \Vert \Vert \eta _s \Vert \mathrm{d}s. \end{aligned}$$

But, using Lemma 4, assumption (A1\(''\)), we have

$$\begin{aligned} \Vert \nabla _x f^{\varvec{\theta }}_s - \nabla _x f^{\varvec{\phi }}_s \Vert&\le K_L \Vert x^{\varvec{\theta }}_s - x^{\varvec{\phi }}_s \Vert + K_L \Vert \theta _s - \phi _s \Vert \\&\le K_L \Vert {\varvec{\theta }}- {\varvec{\phi }}\Vert _{L^\infty }. \end{aligned}$$

A similar calculation shows \(\Vert \nabla _x f^{\varvec{\theta }}_s - \nabla _x f^{\varvec{\phi }}_s \Vert \le K_L \Vert {\varvec{\theta }}- {\varvec{\phi }}\Vert _{L^\infty }\). Hence, Gronwall’s inequality gives

$$\begin{aligned} \Vert {[(D{\varvec{x}}^{\varvec{\theta }}){\varvec{\eta }}]}_t - {[(Dx^{\varvec{\phi }}){\varvec{\eta }}]}_t \Vert \le&K_L \Vert {\varvec{\eta }}\Vert _{L^\infty } \Vert {\varvec{\theta }}- {\varvec{\phi }}\Vert _{L^\infty }. \end{aligned}$$

Similarly, \((D{\varvec{p}}^{\varvec{\theta }}){\varvec{\eta }}\) satisfies the ODE

$$\begin{aligned}&\frac{\mathrm{d}}{\mathrm{d}t} {[(D{\varvec{p}}^{\varvec{\theta }}){\varvec{\eta }}]}_t = - \nabla ^2_{xx} H^{\varvec{\theta }}_t {[(D{\varvec{x}}^{\varvec{\theta }}){\varvec{\eta }}]}_t - \nabla ^2_{xp} H^{\varvec{\theta }}_t {[(Dp^{\varvec{\theta }}){\varvec{\eta }}]}_t - \nabla ^2_{x\theta } H^{\varvec{\theta }}_t \eta _t,\\&{[(D{\varvec{p}}^{\varvec{\theta }}){\varvec{\eta }}]}_T = -\nabla ^2_{xx} \Phi (x^{\varvec{\theta }}_T, y_0) {[(D{\varvec{x}}^{\varvec{\theta }}){\varvec{\eta }}]}_T. \end{aligned}$$

A analogous calculation as above with (A1\(''\)) shows that

$$\begin{aligned} \Vert {[(D{\varvec{p}}^{\varvec{\theta }}){\varvec{\eta }}]}_t - {[(D{\varvec{p}}^{\varvec{\phi }}){\varvec{\eta }}]}_t \Vert&\le K_L \Vert {\varvec{\eta }}\Vert _{L^\infty } \Vert {\varvec{\theta }}- {\varvec{\phi }}\Vert _{L^\infty }. \end{aligned}$$

\(\square \)

Lemma 6

Let \(h:\mathbb {R}^d\times \mathbb {R}^d \times \Theta \rightarrow \mathbb {R}^m\) have bounded and Lipschitz derivatives in all arguments and define the mapping \({\varvec{\theta }}\mapsto {\varvec{G}}({\varvec{\theta }})\) where \({[{\varvec{G}}({\varvec{\theta }})]}_t = h(x^{\varvec{\theta }}_t, p^{\varvec{\theta }}_t, {\varvec{\theta }}_t)\). Then, \({\varvec{G}}\) is differentiable and \(D{\varvec{G}}\) is bounded and Lipschitz \(\mu _0\)-a.s., i.e.,

$$\begin{aligned} \Vert D{\varvec{G}}({\varvec{\theta }}) \Vert _{L^\infty }&\le K_B, \\ \Vert D{\varvec{G}}({\varvec{\theta }}) - D{\varvec{G}}({\varvec{\phi }}) \Vert _{L^\infty }&\le K_L \Vert {\varvec{\theta }}- {\varvec{\phi }}\Vert _{L^\infty }. \end{aligned}$$

for some \(K_B, K_L>0\) and all \({\varvec{\theta }},{\varvec{\phi }}\in L^\infty ([0,T],\Theta )\).

Proof

Let \({\varvec{\eta }}\in L^\infty ([0,T],\mathbb {R}^m)\) such that \(\Vert {\varvec{\eta }}\Vert _{L^\infty } \le 1\). By assumptions on h and Lemmas 4 and 5, \(D{\varvec{G}}\) exists and by the chain rule,

$$\begin{aligned} {[(D{\varvec{G}}({\varvec{\theta }})){\varvec{\eta }}]}_t = \nabla _x h^{\varvec{\theta }}_t {[(D{\varvec{x}}^{\varvec{\theta }}){\varvec{\eta }}]}_t + \nabla _p h^{\varvec{\theta }}_t {[(D{\varvec{p}}^{\varvec{\theta }}){\varvec{\eta }}]}_t + \nabla _\theta h^{\varvec{\theta }}_t {\eta }_t. \end{aligned}$$

Thus, \(\Vert {[(D{\varvec{G}}({\varvec{\theta }})){\varvec{\eta }}]}_t \Vert \le K_B \Vert {\varvec{\eta }}\Vert _{L^\infty }\) and

$$\begin{aligned} \Vert {[(D{\varvec{G}}({\varvec{\theta }})){\varvec{\eta }}]}_t - {[(D{\varvec{G}}({\varvec{\phi }})){\varvec{\eta }}]}_t \Vert&\le K_B \Vert \nabla _x h^{\varvec{\theta }}_t - \nabla _x h^{\varvec{\phi }}_t \Vert \\&\quad + K_L \Vert {[(Dx^{\varvec{\theta }}){\varvec{\eta }}]}_t - {[(Dx^{\varvec{\phi }}){\varvec{\eta }}]}_t \Vert \\&\quad + \cdots \end{aligned}$$

The other terms are split similarly and we omit them for simplicity. Using Lipschitz assumption of the derivatives of h and Lemmas 4 and 5, we obtain the result. \(\square \)

Applying Lemma 6 with \(h=H\) for each sample i and summing, we see that \(D{\varvec{F}}_N\) is bounded and Lipschitz \(\mu _0\)-a.s. and so (B3) is satisfied. It remains to check (B2). Using Lemma (6) and (A1\(''\)), \(\Vert {\varvec{F}}_N \Vert _{L^\infty }\) and \(\Vert D{\varvec{F}}_N \Vert _{L^\infty }\) are almost surely bounded; hence, they satisfy standard concentration estimates. We have:

Lemma 7

There exist constants \(K_B, K_L>0\) such that for all \({\varvec{\theta }},{\varvec{\phi }}\in L^\infty ([0,T],\Theta )\)

$$\begin{aligned}&\mathbb {P}[ \Vert {\varvec{F}}({\varvec{\theta }}) - {\varvec{F}}_N({\varvec{\theta }}) \Vert _{L^\infty } \ge s] \le 2 \exp {\left( -\frac{N s^2}{K_1 + K_2 s} \right) }, \\&\mathbb {P}[ \Vert D{\varvec{F}}({\varvec{\theta }}) - D{\varvec{F}}_N({\varvec{\theta }}) \Vert _{L^\infty } \ge s] \le 2 \exp {\left( -\frac{N s^2}{K_1 + K_2 s} \right) }. \end{aligned}$$

Proof

Since \(\Vert {\varvec{F}}({\varvec{\theta }}) \Vert \) is uniformly bounded by \(K_B\), we can apply the infinite-dimensional Hoeffding’s inequality ([53], Corollary 2) to obtain

$$\begin{aligned} \mathbb {P}[ \Vert {\varvec{F}}({\varvec{\theta }}) - {\varvec{F}}_N({\varvec{\theta }}) \Vert _{L^\infty } \ge s] \le 2 \exp {\left( -\frac{N s^2}{2 K_B^2 + (2/3)K_B s} \right) }. \end{aligned}$$

and similarly for \(D{\varvec{F}}_N\). \(\square \)

Given the above results, we can deduce Theorem 6 directly.

Theorem 6

Let \({\varvec{\theta }}^*\) be a solution \({\varvec{F}}=0\) (defined in (56)), which is stable on \(S_\rho ({\varvec{\theta }}^*)\) for some \(\rho >0\). Then, there exists positive constants \(s_0,C,K_1,K_2\) and \(\rho _1<\rho \) and a random variable \({\varvec{\theta }}^N \in S_{\rho _1}({\varvec{\theta }}^*) \subset L^\infty ([0,T],\Theta )\), such that

$$\begin{aligned} \mathbb {P}[ \Vert {\varvec{\theta }}- {\varvec{\theta }}^N \Vert _{L^\infty } \ge C s]&\le 4 \exp {\left( -\frac{N s^2}{K_1 + K_2 s} \right) }, \qquad s\in (0,s_0], \\ \mathbb {P}[ {\varvec{F}}_N({\varvec{\theta }}^N) \ne 0 ]&\le 4 \exp {\left( -\frac{N s_0^2}{K_1 + K_2 s_0} \right) }. \end{aligned}$$

In particular, \({\varvec{\theta }}^N \rightarrow {\varvec{\theta }}^*\) and \({\varvec{F}}_N({\varvec{\theta }}^N)\rightarrow 0\) in probability.

Proof

Use Theorem 5 with estimates derived in Lemmas 6 and 7. \(\square \)

Theorem 6 describes the convergence of a solution of the first-order condition of the PMP solution in the sampled situation to the population solution of the PMP. Together with a condition of local strong concavity, we show further in Corollary 1 that this stationary solution is in fact a local/global maximum of the sampled PMP. The claim regarding the convergence of loss function values is provided in Corollary 2.

Corollary 1

Let \({\varvec{\theta }}^*\) be a solution of the mean-field PMP such that there exists \(\lambda _0>0\), satisfying that for a.e. \(t\in [0,T]\), \(\mathbb {E}\nabla ^2_{\theta \theta }H(x_t^{{\varvec{\theta }}^*},p_t^{{\varvec{\theta }}^*},\theta ^*_t)+\lambda _0 I \preceq 0\). Then, the random variable \({\varvec{\theta }}^N\) defined in Theorem 6 satisfies, with probability at least \(1 - 6\exp {[ -{(N \lambda _0^2)}/{(K_1 + K_2 \lambda _0)}]}\), that \(\theta ^N_t\) is a strict local maximum of sampled Hamiltonian \(\frac{1}{N}\sum _{i=1}^N H(x^{{\varvec{\theta }}^N,i}_t, p^{{\varvec{\theta }}^N,i}_t, \theta )\). In particular, if the finite-sampled Hamiltonian has a unique local maximizer, then \({\varvec{\theta }}^N\) is a solution of the sampled PMP with the same high probability.

Proof

Let

$$\begin{aligned}&[{\varvec{I}}({\varvec{\theta }})]_t :=\mathbb {E}_{\mu _0} \nabla ^2_{\theta \theta }H\left( x_t^{{\varvec{\theta }}},p_t^{{\varvec{\theta }}},\theta _t\right) ,\\&[{\varvec{I}}_N({\varvec{\theta }})]_t :=\frac{1}{N} \sum _{i=1}^N \nabla ^2_{\theta \theta }H\left( x_t^{{\varvec{\theta }},i},p_t^{{\varvec{\theta }},i},\theta _t\right) . \end{aligned}$$

Given the assumption of negative definite Hessian matrix at \(\theta ^*_t\):

$$\begin{aligned} {[}{\varvec{I}}({\varvec{\theta }}^*)]_t + \lambda _0 I \preceq 0, \end{aligned}$$

what we need to prove is

$$\begin{aligned} \mathbb {P}[\Vert {\varvec{I}}_N({\varvec{\theta }}^N) - {\varvec{I}}({\varvec{\theta }}^*) \Vert _{L^\infty } \ge 2c\lambda _0] \le o(1), \quad N\rightarrow \infty , \end{aligned}$$

for sufficient small \(c>0\). Consider the following estimate

$$\begin{aligned} \mathbb {P}[\Vert {\varvec{I}}_N({\varvec{\theta }}^N) - {\varvec{I}}({\varvec{\theta }}^*) \Vert _{L^\infty } \ge 2c\lambda _0] \,&\le \mathbb {P}[\Vert {\varvec{I}}_N({\varvec{\theta }}^N) - {\varvec{I}}_N({\varvec{\theta }}^*) \Vert _{L^\infty } \ge c\lambda _0 \text { and } \Vert {\varvec{I}}_N({\varvec{\theta }}^*) - {\varvec{I}}({\varvec{\theta }}^*) \Vert _{L^\infty } \ge c\lambda _0]\\&\le \mathbb {P}[\Vert {\varvec{I}}_N({\varvec{\theta }}^N) - {\varvec{I}}_N({\varvec{\theta }}^*) \Vert _{L^\infty } \ge c\lambda _0] + \mathbb {P}[\Vert {\varvec{I}}_N({\varvec{\theta }}^*) - {\varvec{I}}({\varvec{\theta }}^*) \Vert _{L^\infty } \ge c\lambda _0]. \end{aligned}$$

To bound the first term, we can use similar steps as in the proof of Lemma 6, which gives

$$\begin{aligned} \mathop {{{\,\mathrm{ess\,sup}\,}}}\limits _{t\in [0,T]} \left\| \nabla ^2_{\theta \theta }H\left( x^{{\varvec{\theta }}}_t,p^{{\varvec{\theta }}}_t,\theta _t\right) - \nabla ^2_{\theta \theta }H\left( x^{{\varvec{\phi }}}_t,p^{{\varvec{\phi }}}_t,\phi _t\right) \right\| \le K_L \Vert {\varvec{\theta }}-{\varvec{\phi }}\Vert _{L^\infty }. \end{aligned}$$

Hence, we have

$$\begin{aligned} \mathbb {P}[\Vert {\varvec{I}}_N({\varvec{\theta }}^N) - {\varvec{I}}_N({\varvec{\theta }}^*) \Vert _{L^\infty } \ge c\lambda _0]&\le \mathbb {P}[\Vert {\varvec{\theta }}^N - {\varvec{\theta }}^*\Vert _{L^\infty } \ge c\lambda _0/K_L] \\&\le 4 \exp {\left( -\frac{N \lambda _0^2}{K_1 + K_2 \lambda _0} \right) }. \end{aligned}$$

To bound the second term, note that \(\Vert {\varvec{I}}_N({\varvec{\theta }}) \Vert \) is uniformly bounded, we can apply the infinite-dimensional Hoeffding’s inequality ([53], Corollary 2) to obtain

$$\begin{aligned} \mathbb {P}[\Vert {\varvec{I}}_N({\varvec{\theta }}^*) - {\varvec{I}}({\varvec{\theta }}^*) \Vert _{L^\infty } \ge c\lambda _0] \le 2 \exp {\left( -\frac{N \lambda _0^2}{K'_1 + K'_2 \lambda _0} \right) }. \end{aligned}$$

Combining two estimates together, we complete the proof. \(\square \)

Corollary 2

Let \({\varvec{\theta }}^N\) be as defined in Theorem 6. Then, there exist constants \(K_1, K_2\) such that

$$\begin{aligned} \mathbb {P}[|J({\varvec{\theta }}^N) - J({\varvec{\theta }}^*)| \ge s] \le 4 \exp {\left( -\frac{N s^2}{K_1 + K_2s} \right) }, \qquad s\in (0,s_0]. \end{aligned}$$

Proof

Note that \(J({\varvec{\theta }}) = \Phi (x^{\varvec{\theta }}_T,y_0) + \int _{0}^{T} L(x^{\varvec{\theta }}_t, \theta _t) \mathrm{d}t\). Using Lemma 4, we have

$$\begin{aligned} |J({\varvec{\theta }}^N) - J({\varvec{\theta }}^*)|&\le K_L \left\| x^{{\varvec{\theta }}^*}_T - x^{{\varvec{\theta }}^N}_T \right\| + K_L \int _{0}^{T} \left\| x^{{\varvec{\theta }}^*}_t - x^{{\varvec{\theta }}^N}_t \right\| + \left\| \theta ^*_t - \theta ^N_t \right\| \mathrm{d}t \\&\le K_L' \left\| {\varvec{\theta }}^N - {\varvec{\theta }}^*\right\| _{L^\infty }. \end{aligned}$$

Thus, using Theorem 6, we have

$$\begin{aligned} \mathbb {P}[ |J({\varvec{\theta }}^N) - J({\varvec{\theta }}^*)| \ge s ]&\le \mathbb {P}\left[ \left\| {\varvec{\theta }}^N - {\varvec{\theta }}^* \right\| _{L^\infty } \ge s/K_L'\right] \\&\le 4 \exp {\left( -\frac{N s^2}{K_1 + K_2s} \right) }. \end{aligned}$$

\(\square \)

Theorem 6 and Corollary 1 establish a rigorous connection between solutions of the mean-field PMP and its sampled version: When a solution of the mean-field PMP \({\varvec{\theta }}^*\) is stable, then for large N, with high probability we can find in its neighborhood a random variable \({\varvec{\theta }}^N\) that is a stationary solution of the sampled PMP (51). If further that the maximization is non-degenerate (local concavity assumption in Theorem 1) and unique, then \(\theta ^N_t\) maximizes the sample Hamiltonian with high probability. Note that this concavity condition is local in the sense that it only has to be satisfied at the paths involving \({\varvec{\theta }}^*\), whereas the strong concavity condition required in Theorem 4 is stronger as it is global. Of course, in the case where the Hamiltonian is quadratic in \(\theta \), i.e., when \(f(x,\theta )\) is linear in \(\theta \) and the regularization \(L(x,\theta )\) is quadratic in \(\theta \) (this is still a nonlinear network, see Example 1), then all concavity assumptions in the preceding results are satisfied.

The key assumption for the results in this section is the stability condition (c.f. Definition 2). In general, this is different from the assumption that \(H(x^{{\varvec{\theta }}^*}_t,p^{{\varvec{\theta }}^*}_t,\theta ^*_t)\) is strongly concave point-wise in t. However, note that one can show using triangle inequality and estimates in Lemma 5 that if H is strongly concave with sufficiently large concavity parameter (\(\lambda _0\)), then the solution must stable. Intuitively, the stability assumption ensures that we can find a small region around \({\varvec{\theta }}^*\) such that it is isolated from other solutions, and this then allows us to find a nearby solution of the sampled problem that is close to this solution. On the other hand, if \(DF({\varvec{\theta }}^*)\) has a non-trivial kernel, then one cannot expect to construct a \({\varvec{\theta }}^N\) that is close to \({\varvec{\theta }}^*\) itself, or any specific point in the kernel. However, one may still find \({\varvec{\theta }}^N\) that is close to the whole kernel.

We also remark that in both the mean-field problem (3) and the sampled problem (4), the parameters at any time only affect the incremental change (like a residual connection) and the dimension of the system always stays the same in the continuous-time setup. Accordingly, although there is an infinite number of parameters, there is no permutational invariance among different rows of \(\theta _t\) or different components of \(x_t\). This stands in contrast to the common over-parameterized neural networks in which such permutational invariance widely exists if a residual connection is not constructed. Hence, this stability assumption can reasonably hold even if we are in an “over-parameterized” regime.

Corollary 2 is a simple consequence of the previous results and is effectively a statement about generalization error of the learning model, because it quantifies the difference between loss function values when evaluated on either the population or empirical risk minimization solution. We mention an interesting point of the optimal control framework alluded to earlier in the context of generalization. Notice that since we have only assumed that the controls or weights \({\varvec{\theta }}\) are measurable and essentially bounded (and thus can be very discontinuous) in time, we are always dealing with the case where the number of parameters is infinite. Even in this case, we can derive non-trivial generalization estimates. This is to be contrasted with classical generalization bounds based on measures of complexity [26], where the number of parameters adversely affects generalization. Note that there are many recent works which take on such issues from varying angles, e.g., [2, 23, 49].

9 Conclusion

In this paper, we introduce the mathematical formulation of the population risk minimization problem of continuous-time deep learning in the context of mean-field optimal control. In this framework, the compositional structure of deep neural networks is explicitly taken into account as the evolution of the dynamical system in time. To analyze this mean-field optimal control problem, we proceed from two parallel but interrelated perspectives, namely the dynamic programming approach and the maximum principle approach. In the former, an infinite-dimensional Hamilton–Jacobi–Bellman (HJB) equation for the optimal loss function values is derived, with state variables being the joint distribution of input–target pairs. The viscosity solution of the derived HJB equation provides us with a complete characterization of the original population risk minimization problem, giving both the optimal loss function value and an optimal feedback control policy. In the latter approach, we prove a mean-field Pontryagin’s maximum principle that constitutes necessary conditions for optimality. This can be viewed as a local characterization of optimal trajectories, and indeed we formally show that the PMP can be derived from the HJB equation using the method of characteristics. Using the PMP, we study a sufficient condition for which the solution of the PMP is unique. Lastly, we prove an existence result of sampled PMP solutions near the stable solutions of the mean-field PMP. We show how this result connects with generalization errors of deep learning and provide a new direction for obtaining generalization estimates in the case of infinite number of parameters and finite number of sample points. Overall, this work establishes a concrete mathematical framework from which novel ways to attack the pertinent problems in practical and theoretical deep learning may be further developed.

As a specific motivation for future work, notice that here, we have assumed that the state dynamics f is independent of distribution law of \(x_t\) and only depends on \(x_t\) itself and control \(\theta _t\). There are also more complex network structures used in practice which are beyond this assumption. Let us take batch normalization as an example [34]. A batch normalization step involves normalizing inputs using some distribution \(\nu \) and then rescaling (and re-centering) the output using trainable variables so that the matching space is recovered. This has been found empirically to have a good regularization effect for training, but theoretical analysis of such effects is limited. In the present setting, we can write a batch normalization operation as

$$\begin{aligned} BN_{\gamma ,\beta }(x,\nu ):=\gamma \odot \frac{x-\int z\,\mathrm{d}\nu (z)}{\sqrt{(z-\int z'\,\mathrm{d}\nu (z'))^2\mathrm{d}\nu (z) + \epsilon }} + \beta . \end{aligned}$$

Here, \(\gamma ,\beta \in \mathbb {R}^d\) are trainable parameters, \(\odot \) denotes element-wise multiplication, and \(\epsilon \) is a small constant avoiding division by zero. Suppose we insert a batch normalization operation immediately after the skip connection, the corresponding state dynamics f becomes

$$\begin{aligned} f(x,\theta ) \rightarrow f(BN_{\gamma ,\beta }(x,\nu ),\theta ). \end{aligned}$$

By incorporating \(\gamma ,\beta \) into the parameter vector \(\theta \) and taking \(\nu \) as the population distribution of the state, the equation of state dynamics has the following abstract form

$$\begin{aligned} \dot{x}_t=\tilde{f}(x_t,\theta ,\mathbb {P}_{x_t}). \end{aligned}$$
(58)

This is a more general formulation typically considered in the mean-field optimal control literature. The associated objective is very similar to (3) except the state dynamics:

$$\begin{aligned} \begin{aligned} \inf _{{\varvec{\theta }}\in L^{\infty }([0,T],\Theta )} J({\varvec{\theta }})&:=\mathbb {E}_{\mu _0} \left[ \Phi (x_T, y_0) + \int _{0}^{T} L(x_t, \theta _t) \mathrm{d}t \right] ,\\&\text {Subject to~(58)}. \end{aligned} \end{aligned}$$
(59)

The dynamic programming principle and the maximum principle are still applicable in this setting. For instance, the associated HJB equation can be derived as

$$\begin{aligned} {\left\{ \begin{array}{ll} \displaystyle {\frac{\partial v}{\partial t} + \inf _{\theta \in \Theta }\left\langle \partial _\mu v(t,\mu )(.)\cdot \bar{f}(.,\theta ,\mu )+ \bar{L}(.,\theta ),\,\mu \right\rangle = 0,}&{}\text {on~~} [0,T) \times \mathcal {P}_2(\mathbb {R}^{d+l}),\\ \displaystyle {v(T, \mu )=\langle \bar{\Phi }(.),\mu \rangle }, &{}\text {on~~} \mathcal {P}_2(\mathbb {R}^{d+l}), \end{array}\right. } \end{aligned}$$

where \(\bar{f}(w,\theta ,\mu ):=(\tilde{f}(x,\theta ,\mu _x),0)\). Similarly, we expect the following mean-field PMP (in the lifted space) to hold under suitable conditions:

$$\begin{aligned}&\dot{x}^*_t = \tilde{f}(x^*_t, \theta ^*_t, \mathbb {P}_{x^*_t}),&x^*_t = x_0, \\&\dot{p}^*_t = - \nabla _x H(x^*_t, p^*_t, \theta ^*_t, \mathbb {P}_{x^*_t}),&p^*_T = -\nabla _x \Phi (x^*_T, y_0), \\&\mathbb {E}_{\mu _0} H(x^*_t, p^*_t, \theta ^*_t, \mathbb {P}_{x^*_t}) \ge \mathbb {E}_{\mu _0} H(x^*_t, p^*_t, \theta , \mathbb {P}_{x^*_t}),&\forall \,\theta \in \Theta , \quad a.e.\,t\in [0,T], \end{aligned}$$

where the Hamiltonian function \(H: \mathbb {R}^d \times \mathbb {R}^d \times \Theta \times \mathcal {P}_2(\mathbb {R}^d) \rightarrow \mathbb {R}\) is given by

$$\begin{aligned} H(x,p,\theta ,\mu ) = p\cdot f(x, \theta , \mu ) - L(x, \theta ). \end{aligned}$$

Thus, batch normalization can be viewed as a general form of mean-field dynamics and can be treated in a principled way under the mean-field optimal control framework. We leave the study of further implications of this connection on the theoretical understanding of batch normalization to future work.