1 Introduction

In this paper we are interested in the Pontryagin Maximum Principle for a class of general stochastic control problem with McKean–Vlasov dynamics:

$$\begin{aligned} \left\{ \begin{array}{lll} dX^u_t=b(t, \cdot , X^u_t,\mathbb {P}_{X^u_t}, u_t)dt+\sigma (t, \cdot , X^u_t,\mathbb {P}_{X^u_t}, u_t)dW_t, \\ X^u_0=x_0, \end{array}\right. \quad t\in [0,T], \end{aligned}$$
(1.1)

where \(W=\{W_t\}_{t\ge 0}\) is a one dimensional Brownian motion, defined on a complete probability space \((\Omega ,\mathcal{F},\mathbb {P})\), \(\mathbb {P}_\xi :=\mathbb {P}\circ \xi ^{-1}\) denotes the law of the random variable \(\xi \), \(T>0\) is a given time horizon, and the coefficients \(b, \sigma : [0,T]\times \Omega \times \mathbb {R}^d\times {\mathscr {P}}_2(\mathbb {R}^d)\times U \mapsto \mathbb {R}\) are measurable functions with appropriate dimensions. Here \(\mathscr {P}_2(\mathbb {R}^d)\) is the space of all probability measures on \(\mathbb {R}^d\), endowed with 2-Wasserstein metric (see Sect. 2 for details).

The stochastic control problems with dynamics (1.1) have been used in the study of mean-field control/potential games, to describe the (Nash) equilibrium state of the symmetric game, or the limiting state of large number of interacting players. We refer to [16] for the main ideas of the mean-field games, and [14, 1113] and the references cited therein for the history and recent developments of stochastic control problem with controlled McKean–Vlasov dynamics and related mean-field games.

It is worth noting that the system (1.1) is very general, in that the dependence of the coefficient on the law of the solution \(\mathbb {P}_{X_t}\) could be genuinely nonlinear as an element of the space of probability measures. In fact, while all existing works can be put into a general framework of (1.1), many of them often fall into one of the following two types:

$$\begin{aligned} \mathrm{(a)}\quad \varphi (t, \omega , X_{t}, \mathbb {P}_{X_t}, u_t)&={\widetilde{\varphi }}(t,\omega , X_{t}, \mathbb {E}[X_t], u_t)\\&={\widetilde{\varphi }}\left( t, \omega ,X_{ t}, \int _\mathbb {R}\psi (y) \mathbb {P}_{X_t}(dy),u_t\right) , \end{aligned}$$

where \(\psi \) is a given function, \(\varphi =b, \sigma \), and \(\tilde{\varphi }\) denotes a different function corresponding to b or \(\sigma \) in an obvious way. Such a case resembles the mean-field interaction of scalar type in mean-field theory, and was used as the controlled dynamics in [4]. We note that in this case the role of the measure \(\mathbb {P}_{X_t}\) would be averaged out and therefore less essential in the analysis.

$$\begin{aligned} \mathrm{(b)}\quad \varphi (t, \omega , X_{ t}, \mathbb {P}_{X_t}, u_t)&=\mathbb {E}[{\widetilde{\varphi }}(t, x, X_t,z)]\big |_{y=X_t, z=u_t}\\&=\int _\mathbb {R}{\widetilde{\varphi }}(t,x, y, z)\mathbb {P}_{X_t}(dy)\big |_{x=X_t, z=u_t}, \quad \varphi =b,\sigma . \end{aligned}$$

This is the most common case in the literature, known as the mean-field interaction of order 1, which comes often as a consequences of law of large numbers, and as the limit of mean-field games (see, for example, [3, 1113], among the aforementioned references). We note that in this case, the dependence on \(\mathbb {P}_{X_t}\) is actually linear(!). Clearly, the SDE (1.1) would cover higher order interactions as well.

We should note that the optimal control problem with general McKean–Vlasov dynamics has been studied along the lines of dynamic programming (see, e.g., [1, 2, 4]), under various conditions. In particular, either the convexity of the control set or that of the Hamiltonian, or the existence of optimal control is assumed so as to facilitate the discussion of the dynamic programming principle. On the other hand, most of the works mentioned above focused on the necessary condition of the optimal, known as the Pontryagin’s Maximum principle. Again, the convexity of either the control region or the Hamiltonian (or the coefficients) played an important role in the discussion. In fact, to the best of our knowledge, the Stochastic Maximum Principle (SMP for short) for a general controlled McKean–Vlasov dynamics with non-convex control region remains an open problem to date, which is the main objective of this paper.

The main technical issue in dealing with the Stochastic Maximum Principle without the convexity assumption on either the control region or the Hamiltonian, especially in the case when the control enters the diffusion coefficient \(\sigma \) (often referred to as Peng’s SMP due to his seminal work [18]), is the need to consider a second order variational equation, or equivalently, a second order Taylor expansion (see [18]), which naturally involves the second order derivatives of all spatial variables in the coefficients. This immediately leads to, among other things, the subtle difficulty in treating the desired second order derivatives with respect to the measures in the space \(\mathscr {P}_2(\mathbb {R}^d)\), along with some appropriate estimates.

An interesting observation is that by adding the tool of the derivatives with respect to measures, we can now treat more general mean-field cases for which the method of our previous work [4] would fail. A not-so-subtle example is the case where the coefficients are of the form

$$\begin{aligned} \phi (t,X_t, \mathbb {P}_{X_t}, u_t)=\phi (t, X_t, \mathbb {E}[\psi (X_t)], u_t), \qquad \phi =b, \sigma , f, \end{aligned}$$

where \(\psi \) is some general nonlinear function (see, e.g., [8]). The method of [4] would only work for the case \(\psi (y)=y\), but now it is clearly one of the simplest forms under our setting.

The main result of this paper is to prove the following Stochastic Maximum Principle (SMP), along the by now well-known scheme of [18]. More precisely, consider the following Hamiltonian:

$$\begin{aligned} H(t, x,\mu , u,p,q)\mathop {\buildrel \Delta \over =}b(t,x,\mu ,u)\cdot p+\sigma (t,x,\mu ,u)\cdot q-f(x,\mu ,u), \end{aligned}$$
(1.2)

where \((t,x,\mu ,u,p,q)\in [0,T]\times \mathbb {R}^d\times \mathscr {P}_2(\mathbb {R}^d)\times U\times \mathbb {R}^d\times \mathbb {R}\). We are to show that, if \((u^*, X^*)\) is an optimal solution of the stochastic control problem, then there are two pairs of adapted processes (pq) and (PQ) known as the first and second order adjoint processes, respectively, such that for all \(u\in U\) and a.e. \(t\in [0,T]\), it holds \(\mathbb {P}\)-almost surely that

$$\begin{aligned}&H(t, X^*_{ t},\mathbb {P}_{X^*_t},u; p_t,q_t)-H(t,X^*_{t},\mathbb {P}_{X^*_t}, u^*_t; p_t,q_t)\\&\quad + \frac{1}{2}[\Delta \sigma ^{*, u}(t, \cdot )]^TP_t[\Delta \sigma ^{*, u}(t, \cdot )] \le 0, \end{aligned}$$

where \(\Delta \sigma ^{*, u}(t, \cdot ):= \sigma (t, X^*_{t},\mathbb {P}_{X^*_t},u)-\sigma (t, X^*_{t},\mathbb {P}_{X^*_t},u^*_t)\). Furthermore, the processes (pq) and (PQ) are the solutions of the first and second order adjoint equations that are in the form of mean-field backward SDEs similar to those in [3], but the parts involving second order derivatives require special attention. The key point, however, would be to prove the following estimate:

$$\begin{aligned} \mathbb {E}\left[ \sup _{t\in [0,T]}|X^{\varepsilon }(t)-\left( X^*(t)+Y^{\varepsilon }(t)+Z^{\varepsilon }(t) \right) |^{2k}\right] \le \varepsilon ^{2k}\rho _k(\varepsilon ), \end{aligned}$$
(1.3)

\(X^{\varepsilon }\mathop {\buildrel \Delta \over =}X^{u^{\varepsilon }}\) denotes the state process corresponding to the spike variation of \(u^*\): \(u^{\varepsilon }_t :=u_t \mathbf{1}_{E_\varepsilon }(t)+ u^*_t\mathbf{1}_{E_{\varepsilon }^c}\), where \(E_{\varepsilon }\subset [0,T]\) is a Borel set with \(|E_{\varepsilon }|=\varepsilon \); \(Y^\varepsilon \), \(Z^\varepsilon \) are the solutions to the first and second order variational equations, respectively; and \(\rho _k>0\) is some positive function such that \(\rho _k(\varepsilon )\downarrow 0\) as \(\varepsilon \downarrow 0\).

It is worth noting that, while our analysis follows idea of [17], there are some technical difficulties due to the presence of the second derivatives with respect to measures in the Taylor expansion. In particular, as it is noted in [7], the second order Fréchet derivative with respect to \(L^2\)-random variables may fail to exist. It turns out, however, that such difficulty can be overcome by some careful analysis and estimates on the first and second order variational processes, so as to show that the term that potentially involves the second order derivative with respect to measure is a higher order term, and consequently only the mixed second order derivatives are relevant, as we observed in [7]. To the best of our knowledge, these estimates, especially those in Propositions 4.3 and 5.3 of this paper, have never been established before. Consequently, albeit technical, we show the scheme of [18] can still be validated, and that \(1\le k\le \frac{3}{2}\) will be the desired order in (1.3).

The rest of the paper is organized as follows. In Sect. 2 we give all the necessary preparations on the technical tools, including the precise definition of the second order Fréchet derivative with respect to the measures \(\mathbb {P}_{X_t}\) in the coefficients. In Sect. 3 we formulate the problem and state the main theorem (SMP), and in Sect. 4 we study the (first and second order) variational equations. Finally in Sects. 5 and 6 we establish the main estimates and prove the main theorem.

2 Preliminaries

Throughout this paper we consider a complete, filtered probability space \((\Omega ,\mathcal{F},\mathbb {P};\mathbb {F})\), on which is defined a 1-dimensional \(\mathbb {F}\)-Brownian motion \(W=\{W_t\}_{t\in [0,T]}\), where \(T>0\) denotes an arbitrarily fixed time horizon, and \(\mathbb {F}=\{\mathcal{F}_t\}_{t\ge 0}\). For a generic Euclidean space \(\mathbb {X}\), we denote its inner product by \((\cdot , \cdot )\) (or simply “\(~\cdot ~\)”), its norm by \(|\cdot |\), and its Borel \(\sigma \)-field by \(\mathscr {B}(\mathbb {X})\). Also, for any sub-\(\sigma \)-field \(\mathcal{G}\subseteq \mathcal{F}\) and \(1\le p < \infty \), we denote

  • \(L^p({\mathcal{G}};\mathbb {X})\) to be all \(\mathbb {X}\)-valued, \(\mathcal{G}\)-measurable random variables \(\xi \) with \(\Vert \xi \Vert _p\mathop {\buildrel \Delta \over =} \mathbb {E}[|\xi |^p]^{1/p}<\infty \). In particular, \(L^2(\mathcal{G};{\mathbb R}^d)\) is the Hilbert space with inner product \((\xi ,\eta )_{2}\mathop {\buildrel \Delta \over =}\mathbb {E}[\xi \cdot \eta ]\), \( \xi \), \(\eta \in L^2(\mathcal{F}; {\mathbb R}^d)\), and the norm \(\Vert \xi \Vert _{2}=\sqrt{(\xi ,\xi )_{2}}\).

  • \(L^2_\mathbb {F}([0,T];\mathbb {X})\) to be all \(\mathbb {X}\)-valued, \(\mathbb {F}\)-adapted process \(\eta \) on [0, T], such that

    $$\begin{aligned} \Vert \eta \Vert _{p,T}\mathop {\buildrel \Delta \over =}\mathbb {E}\left[ \int _0^T|\eta _t|^pdt\right] ^{1/p}<\infty ; \end{aligned}$$
  • \({\mathscr {P}}_2(\mathbb {X})\) to be the space of all probability measures \(\mu \) on \((\mathbb {X},{\mathscr {B}}(\mathbb {X}))\) with finite second moment (i.e., \(\int _{\mathbb {X}}|x|^2\mu (dx)<\infty \)). In particular, we endow the space \({\mathscr {P}}_2({\mathbb R}^{d})\) with the following 2-Wasserstein metric: for \(\mu ,\ \nu \in {\mathscr {P}}_2({\mathbb R}^d)\),

    $$\begin{aligned} W_2(\mu ,\nu ) \mathop {\buildrel \Delta \over =}\text {inf}&\Bigg \{\Bigg [\int _{{\mathbb {R}}^{2d}}|x-y|^2\rho (dx,dy)\Bigg ]^{\frac{1}{2}}: \rho \in {\mathscr {P}}_2({\mathbb R}^{2d}),\nonumber \\&\quad \rho (\cdot , {\mathbb R}^d)=\mu ,~\rho ({\mathbb R}^d, \cdot ) =\nu \Bigg \}. \end{aligned}$$
    (2.1)

    Furthermore, for an \(\mathbb {X}\)-valued random variable \(\xi \) defined on \((\Omega , \mathcal{F}, \mathbb {P})\), we denote \(\mathbb {P}_\xi \mathop {\buildrel \Delta \over =}\mathbb {P}\circ \xi ^{-1}\), the law introduced by \(\xi \) on \((\mathbb {X},{\mathscr {B}}(\mathbb {X}))\).

We now recall briefly an important notion in mean-field theory: the differentiability with respect to probability measures. We shall follow the approach introduced in [15] and later detailed in [9] (see also [7] and [13] for more discussions). The main idea is to identify a distribution \(\mu \in \mathscr {P}_2(\mathbb {R}^d)\) with a random variables \(\vartheta \in L^2(\mathcal{F};{\mathbb R}^d)\) so that \(\mu =\mathbb {P}_\vartheta \). To be more precise, let us assume that the probability space \((\Omega , \mathcal{F}, \mathbb {P})\) is rich enough in the sense that for every \(\mu \in {\mathscr {P}}_2({\mathbb R}^d)\), there is a random variable \(\vartheta \in L^2({\mathcal{F}};{\mathbb R}^d)\) such that \(\mathbb {P}_\vartheta =\mu .\) It is well-known that the probability space \(([0,1],{\mathscr {B}}([0,1]),dx)\), where “dx” is the Borel measure, has this property.

Next, for any function \(f:{\mathscr {P}}_2({\mathbb R}^d)\rightarrow {\mathbb R}\), we induce a function \(f^\sharp : L^2(\mathcal{F};{\mathbb R}^d)\rightarrow {\mathbb R}\), such that \(f^\sharp (\vartheta ) := f(\mathbb {P}_\vartheta ),\ \vartheta \in L^2(\mathcal{F};{\mathbb R}^d)\). Clearly, the function \(f^\sharp \), often called the “lift” of f in the literature, depends only on the law of \(\vartheta \in L^2(\mathcal{F};\mathbb {R}^d)\), and is independent of the choice of the representative \(\vartheta \). Recall now from [9] (see also [10]), a function \(f:{\mathscr {P}}_2({\mathbb {R}}^d)\mapsto {\mathbb {R}}\) is said to be differentiable at \(\mu _0\in {\mathscr {P}}_2({\mathbb {R}}^d)\) if there exists \(\vartheta _0\in L^2(\mathcal{F}, \mathbb {R}^d)\) with \(\mathbb {P}_{\vartheta _0}=\mu _0\), such that its lift \(f^\sharp \) is Fréchet differentiable at \(\vartheta _0\). In other words, there exists a continuous linear functional \(Df^\sharp (\vartheta _0): L^2(\mathcal{F};{\mathbb R}^d)\rightarrow {\mathbb R}\) such that

$$\begin{aligned} f^\sharp (\vartheta _0+\eta )-f^\sharp (\vartheta _0)=\mathop {\langle }Df^\sharp (\vartheta _0) ,\eta \mathop {\rangle }+ o(\Vert \eta \Vert _{2})\mathop {\buildrel \Delta \over =}D_\eta f(\mu _0)+o(\Vert \eta \Vert _2), \end{aligned}$$
(2.2)

where \(\mathop {\langle }\cdot , \cdot \mathop {\rangle }\) is the “dual product” on \(L^2(\mathcal{F};{\mathbb R}^d)\), and we will refer to \({D_{\eta }} f({\mu _{0}})\) as the Fréchet derivative of f at \(\mu _0\), in the direction \(\eta \). Clearly, in this case we have

$$\begin{aligned} D_{\eta } f({\mu }_{0})={\mathop {\langle }} {Df}^{\sharp }({\vartheta _{0}}), {\eta }\mathop {\rangle }{\mathop {\buildrel \Delta \over =}} \frac{d }{dt}{f}^{\sharp }({\vartheta _{0}}+t\eta )|_{t=0}, \qquad {\mu _{0}}=\mathbb {P}_{\vartheta _0}. \end{aligned}$$
(2.3)

The second order derivative, however, is a much more subtle issue. For example, it is tempting to define \(D^2_{\eta \eta }f(\vartheta _0):= \frac{d^2}{dt^2}f^\sharp (\vartheta _0+t\eta )\Big |_{t=0}\). But it turns out that with such a definition even a function that is infinitely differentiable in usual sense becomes nowhere twice differentiable(!). (See [7] for a counterexample). The more reasonable definition of the second order derivative have been given recently in [7] and [13], which we now briefly describe.

First note that by Riesz’ Representation Theorem, there is a (unique) random variable \(\Theta _0\in L^2(\mathcal{F};{\mathbb R}^d)\) such that \(\mathop {\langle }Df^\sharp (\vartheta _0),\eta \mathop {\rangle }=(\Theta _0, \eta )_{2}=\mathbb {E}[(\Theta _0,\eta )]\), \( \eta \in L^2(\mathcal{F};{\mathbb R}^d)\). It was shown (see [15] or [9]) that there exists a Borel function \(h[{\mu _0}]:{\mathbb R}^d\rightarrow {\mathbb R}^d\), depending only on the law \(\mu _0=\mathbb {P}_{\vartheta _0}\) but not on the particular choice of the representative \(\vartheta _0\), such that \(\Theta _0=h[\mu _0](\vartheta _0)\). Thus we can write (2.2) as

$$\begin{aligned} f(\mathbb {P}_\vartheta )-f(\mathbb {P}_{\vartheta _0})=(h[\mu _0](\vartheta _0), \vartheta -\vartheta _0)_2+o(\Vert \vartheta -\vartheta _0\Vert _{2}), \qquad \forall \vartheta \in L^2(\mathcal{F};{\mathbb R}^d).\nonumber \\ \end{aligned}$$
(2.4)

We shall denote \(\partial _\mu f(\mathbb {P}_{\vartheta _0},y)\mathop {\buildrel \Delta \over =}h[\mu _0](y),\ y\in {\mathbb R}^d\). Namely, we have the following identities:

$$\begin{aligned} Df^\sharp (\vartheta _0)=\Theta _0=h[\mathbb {P}_{\vartheta _0}](\vartheta _0)=\partial _\mu f(\mathbb {P}_{\vartheta _0}, \vartheta _0), \end{aligned}$$
(2.5)

and \(D_\eta f(\mathbb {P}_{\vartheta _0})=\mathop {\langle }\partial _\mu f(\mathbb {P}_{\vartheta _0}, \vartheta _0), \eta \mathop {\rangle }\), where \(\eta =\vartheta -\vartheta _0\). We note that for each \(\mu \in \mathscr {P}_2(\mathbb {R}^d)\), \(\partial _\mu f(\mathbb {P}_{\vartheta },\cdot )=h[ \mathbb {P}_\vartheta ](\cdot )\) is only defined in a \(\mathbb {P}_\vartheta (dy)\)-a.e. sense, where \(\mu =\mathbb {P}_\vartheta \).

Let us now assume that the function \(f:\mathscr {P}_2(\mathbb {P}^d)\mapsto \mathbb {R}\) is differentiable on the whole space \(\mathscr {P}_2(\mathbb {R}^d)\), and consider the derivative \(\partial _\mu f(\mathbb {P}_\vartheta , y)\). It can be shown (see, e.g. [13, Lemma 3.2]) that if the mapping \(\vartheta \mapsto Df^\sharp (\vartheta )=h[\mathbb {P}_\vartheta ](\vartheta )\) is Lipschitz continuous with a Lipschitz constant \(K>0\), then for all \(\vartheta \in L^2(\mathcal{F};{\mathbb R}^d)\) there is a \(P_\vartheta \)-version of \(\partial _\mu f(\mathbb {P}_\vartheta , \cdot )\) such that the mapping \(y\mapsto \partial _\mu f(\mathbb {P}_\vartheta , y)=h[\mathbb {P}_\vartheta ](y)\) is Lipschitz continuous, with the same uniform Lipschitz constant. In what follows we shall always refer to such a version, without further specification. It is worth noting that while the law of \(Df^\sharp (\vartheta )=\partial _\mu f(\mathbb {P}_\vartheta , \vartheta )\) does not depend on the choice of \(\vartheta \) (cf. [9, Theorem 6.2]), but as an \(L^2(\mathcal{F}, \mathbb {R}^d)\) random vector, \(Df^\sharp (\vartheta )\) does(!).

Definition 2.1

We say that \(f\in \mathbb {C}^{1,1}_b({\mathscr {P}}_2({\mathbb R}^d))\) if for all \(\vartheta \in L^2(\mathcal{F};\mathbb {R}^d)\), there exists a \(\mathbb {P}_\vartheta \)-modification of \(\partial _\mu f(\mathbb {P}_\vartheta , \cdot )\), denoted by itself, such that \(\partial _\mu f: \mathscr {P}_2(\mathbb {R}^d)\times \mathbb {R}^d\mapsto \mathbb {R}^d\) is bounded and Lipschitz continuous. That is, for some \(C>0\), it holds that

  1. (i)

    \(|\partial _\mu f(\mu , x)|\le C\), \(\forall \mu \in \mathscr {P}_2(\mathbb {R}^d)\), \(x\in \mathbb {R}^d\);

  2. (ii)

    \(|\partial _\mu f(\mu ,y)-\partial _\mu f(\mu ',y')|\le K(W_2(\mu ,\mu ')+|y-y'|)\), \( \mu , \mu '\in {\mathscr {P}}_2({\mathbb R}^d)\), \( y, y'\in {\mathbb R}^d\).

Remark 2.2

We would like to point out that, if \(f\in \mathbb {C}_b^{1,1}({\mathscr {P}}_2 ({\mathbb {R}}^d))\), the version of \(\partial _\mu f(\mathbb {P}_{\vartheta },.)\), \(\vartheta \in L^2(\mathcal{F};{\mathbb {R}}^d)\), indicated in Definition 2.1 is unique. Indeed, given \(\vartheta \in L^2(\mathcal{F};{\mathbb R}^d)\), let \(\eta \) be a d-dimensional vector of independent standard normal random variables, independent of \(\vartheta \). Then, since \(\partial _\mu f(\mathbb {P}_{\vartheta +\varepsilon \eta }, \vartheta +\varepsilon \eta )\) is \(\mathbb {P}\)-a.s. defined, \(\partial _\mu f(\mathbb {P}_{\vartheta +\varepsilon \eta }, y)\) is defined dy-a.e. From the Lipschitz continuity (ii) of \(\partial _\mu f\) in Definition 2.1 it then follows that \(\partial _\mu f(\mathbb {P}_{\vartheta +\varepsilon \eta }, y)\) is well-defined for all \(y\in {\mathbb R}^d\). Taking the limit \(0<\varepsilon \downarrow 0\) then yields that \(\partial _\mu f(P_{\vartheta }, y)\) is uniquely defined for all \(y\in {\mathbb R}^d\).

Now let \(f\in \mathbb {C}^{1,1}_b(\mathscr {P}(\mathbb {R}^d)\), and consider the mapping \(\partial _\mu f=([\partial _\mu f]_1, \cdots [\partial _\mu f]_d)^T:\mathscr {P}_2(\mathbb {R}^d)\times \mathbb {R}^d\mapsto \mathbb {R}^d\). We are now ready to define the second order derivatives of f.

Definition 2.3

We say that \(f\in \mathbb {C}^{2,1}_b(\mathscr {P}_2({\mathbb R}^d))\) if \(f\in \mathbb {C}^{1,1}_b({\mathscr {P}}_2({\mathbb R}^d))\), and such that

  1. (i)

    \([\partial _\mu f]_i(\cdot ,y)\in \mathbb {C}^{1,1}_b({\mathscr {P}}_2({\mathbb R}^d)),\) for all \(y\in {\mathbb R}^d,\, 1\le i\le d\);

  2. (ii)

    \(\partial _\mu f(\mu , \cdot ): {\mathbb R}^d\rightarrow {\mathbb R}^d\) is differentiable, for every \(\mu \in {\mathscr {P}}_2({\mathbb R}^d)\);

  3. (iii)

    \(\partial _{y}\partial _\mu f:{\mathscr {P}}_2({\mathbb R}^d)\times {\mathbb R}^d\!\rightarrow \! {\mathbb R}^d\otimes {\mathbb R}^d\) and \(\partial ^2_\mu f(\mathbb {P}_{\vartheta _0}, y, z)\!:=\partial _\mu [\partial _\mu f(\cdot ,y)](\mathbb {P}_{\vartheta _0},z)\!:\)     \({\mathscr {P}}_2({\mathbb R}^d)\times {\mathbb R}^d\times {\mathbb R}^d\rightarrow {\mathbb R}^d\otimes {\mathbb R}^d\) are bounded and Lipschitz-continuous.

To end this section we present a second order Taylor expansion that plays an essential role in our discussion. Let \(f\in \mathbb {C}^{2,1}_b({\mathscr {P}}_2({\mathbb R}^d))\), then for \(1\le i\le d\) we have

$$\begin{aligned} Df^\sharp _i(\vartheta _0+\eta )-Df^\sharp _i(\vartheta _0)&=[\partial _\mu f]_i (\mathbb {P}_{\vartheta _0+\eta }, \vartheta _0+\eta )-[\partial _\mu f]_i(\mathbb {P}_{\vartheta _0}, \vartheta _0)\nonumber \\&=\left[ [\partial _\mu f]_i (\mathbb {P}_{\vartheta _0+\eta }, y)-[\partial _\mu f]_i (\mathbb {P}_{\vartheta _0},y)\right] \big |_{y=\vartheta _0+\eta }\nonumber \\&\quad + [[\partial _\mu f]_i (\mathbb {P}_{\vartheta _0}, y)\big |_{y=\vartheta _0+\eta }-[\partial _\mu f]_i (\mathbb {P}_{\vartheta _0},y)]\big |_{y=\vartheta _0}\nonumber \\&=\int _0^1\mathop {\langle }D[\partial _\mu f]^\sharp _i({\vartheta _0+\lambda \eta }, y), \eta \mathop {\rangle }d\lambda \big |_{y=\vartheta _0}\nonumber \\&\quad + ( \partial _y[\partial _\mu f]_i (\mathbb {P}_{\vartheta _0}, \vartheta _0), \eta ) +o(\Vert \eta \Vert _2). \end{aligned}$$
(2.6)

Here, again \(\mathop {\langle }\cdot ,\cdot \mathop {\rangle }\) is the dual product in \(L^2(\mathcal{F}; \mathbb {R}^d)\), \((\cdot ,\cdot )\) is the inner product of \(\mathbb {R}^d\), and as due to (2.5),

$$\begin{aligned} D[\partial _\mu f]^\sharp _i({\vartheta _0}, y)=\partial _\mu [[\partial _\mu f]_i(\cdot , y)](\mathbb {P}_{\vartheta _0},\vartheta _0) =: [\partial ^2_{\mu }f]_i (\mathbb {P}_{\vartheta _0}, y, z)\big |_{z= \vartheta _0}\in \mathbb {R}^d,\nonumber \\ \end{aligned}$$
(2.7)

the latter line of (2.6) follows from the Lipschitz continuity of \([\partial _\mu ^2 f]_i\) and \(\partial _y[\partial _\mu f]_i\) (see Definition 2.3).

We should note that, similar to our previous work [7], in this paper we actually do not need \(\partial _\mu ^2 f\) for the formulation of our main result, Theorem 3.5, but rather the derivative \(\partial ^2_{\mu y}f\) defined in Definition 2.3. But the following discussion is necessary for the second order Taylor expansion, and therefore interesting in its own right. Let \((\widetilde{\Omega },\widetilde{\mathcal{F}},\widetilde{\mathbb {P}})\) be a copy of the probability space \((\Omega , \mathcal{F}, \mathbb {P})\). For any pair of random variables \((\vartheta , \eta )\in [L^{2}(\mathcal{F}, {\mathbb {R}^{d}})]^{2}\), we let \(({\widetilde{\vartheta }}_{0}, {\widetilde{\eta }})\) be an independent copy of \((\vartheta _0, \eta )\) defined on \((\widetilde{\Omega },\widetilde{\mathcal{F}},\widetilde{\mathbb {P}})\). Now, we consider the product space \((\Omega \times \widetilde{\Omega }, \mathcal{F}\otimes \widetilde{\mathcal{F}},\mathbb {P}\otimes \widetilde{\mathbb {P}})\) and setting \((\widetilde{\vartheta _0}, {\widetilde{\eta }})(\widetilde{\omega },{\omega })\mathop {\buildrel \Delta \over =}(\vartheta ({\widetilde{\omega }}),\eta ({\widetilde{\omega }}))\), \(\forall (\widetilde{\omega },{\omega })\in \widetilde{\Omega }\times {\Omega }\).

For any \(\mu _0\in \mathscr {P}_2(\mathbb {R}^d)\) with \(\mu _0=\mathbb {P}_{\vartheta _0}\), and \(\eta \!\in \! L^2(\mathcal{F};\mathbb {R}^d)\), we can define the “second order derivative of f at \(\mu _0\in \mathscr {P}(\mathbb {R}^d)\), in the direction \(\eta \)” via (2.6):

$$\begin{aligned} D^2_{\eta }f(\mu _0):= & {} \mathop {\langle }\mathop {\langle }D[\partial _\mu f]^\sharp (\cdot , y)(\mathbb {P}_{\vartheta _0}, z)\big |_{z={\widetilde{\vartheta }}_0}, {\widetilde{\eta }}\mathop {\rangle }\big |_{y=\vartheta _0}, \eta \mathop {\rangle }+\mathop {\langle }\partial _y\partial _\mu f(\mathbb {P}_{\vartheta _0}, \vartheta _0)\eta , \eta \mathop {\rangle }\nonumber \\= & {} \mathbb {E}\{\widetilde{\mathbb {E}}\{\text {tr}[\partial _\mu ^2 f(\mathbb {P}_{\vartheta _0}, \vartheta _0, {\widetilde{\vartheta }}_0){\widetilde{\eta }}\otimes \eta ]\}\} +\mathbb {E}\{\text {tr}[\partial _y\partial _\mu f(\mathbb {P}_{\vartheta _0}, \vartheta _0)\eta \otimes \eta ]\}.\nonumber \\ \end{aligned}$$
(2.8)

where for \(x\in \mathbb {R}^d\), \(x\otimes x \mathop {\buildrel \Delta \over =}xx^T\in \mathbb {R}^d\otimes \mathbb {R}^d\); and the expectation \(\widetilde{\mathbb {E}}[\cdot ]\) acts only on random variables marked with a “ \(\widetilde{}\) ”. It is now easy to check, with the notation \(D_\eta f\) and \(D^2_\eta f\) defined by (2.3) and (2.8 ), we have the following simple form of the Taylor expansion (cf. [7, Lemma 2.1])

$$\begin{aligned} f(\mathbb {P}_{\vartheta _0+\eta })-f(\mathbb {P}_{\vartheta _0})=D_\eta f(\mathbb {P}_{\vartheta _0})+\frac{1}{2} D^2_\eta f(\mathbb {P}_{\vartheta _0})+R(\eta ), \end{aligned}$$
(2.9)

where \(|R(\eta )|\le C\mathbb {E}(|\eta |^3\wedge |\eta |^2)=O(\Vert \eta \Vert ^2_2)\).

We should remark that the fact that the remainder \(R(\eta )\) is only of order \(O(\Vert \eta \Vert ^2_2)\), rather than \(o((\Vert \eta \Vert ^2_2)\) as one would hope, is the main reason that the \(f(\mathbb {P}_\vartheta )\) may not be twice differentiable even though its lift \(f^\sharp \) is(!), which is supported by the example in [7].

3 Problem Formulation

Let us consider a complete probability space \((\Omega ,\mathcal{F},\mathbb {P})\) on which is defined a m-dimensional Brownian motion \(W=\{W_t\}_{t\ge 0}\), and let \(T>0\) be a given time horizon. We shall assume that there exists a sub-\(\sigma \)-field \(\mathcal{F}_0\subset \mathcal{F}\) that is independent of \(\mathbb {F}^W\), the filtration generated by W, and is “rich enough” in the sense described in the previous section. To wit, \({\mathscr {P}}_2({\mathbb {R}}^\ell )=\{\mathbb {P}_\vartheta ,\, \vartheta \in L^2(\mathcal{F}_0;{\mathbb {R}}^\ell )\}\), \(\ell \ge 1\). We denote \(\mathbb {F}=\{\mathcal{F}^W_t\vee \mathcal{F}_0\}_{t\in [0,T]}\) in the sequel, with the standard augmentation.

We are interested in the following general controlled mean-field stochastic system:

$$\begin{aligned} \left\{ \begin{array}{lll} dX^u_t=b(t, X^u_t,\mathbb {P}_{X^u_t}, u_t)dt+\sigma (t,X^u_t,\mathbb {P}_{X^u_t}, u_t)dW_t, \\ X^u_0=x_0, \end{array}\right. \quad t\in [0,T], \end{aligned}$$
(3.1)

where the coefficients \((b, \sigma )\): \([0,T]\times \mathbb {R}^d\times {\mathscr {P}}_2(\mathbb {R})\times \mathbb {R}^k \mapsto \mathbb {R}\times \mathbb {R}^{d\times m}\) are deterministic functions with appropriate dimensions, and \(u\in L^2_\mathbb {F}([0,T];\mathbb {R}^k)\) is the given “control” process.

Remark 3.1

In order not to over complicate the already notational heavy presentation of this paper, in what follows we shall assume all processes are 1-dimensional (i.e., \(d=k=m=1\)). We should note that the higher dimensional cases can be argued along the same lines without substantial difficulties, except for even heavier notations. We leave it for interested reader.

Let \(U\subseteq \mathbb {R}\) be a non-empty subset. We say that a process \(u\in L^2_\mathbb {F}([0,T];\mathbb {R})\) is an admissible control if \(u_t\in U\) for all \(t\in [0, T]\), \(\mathbb {P}\)-a.s. We denote the set of all admissible controls by \(\mathscr {U}_{ad}\), and the goal of the optimal control problem is to minimize the following cost functional over \(\mathscr {U}_{ad}\):

$$\begin{aligned} J(u)=\mathbb {E}\left[ \int _0^Tf(t, X^u_t,\mathbb {P}_{X^u_t}, u_t)\,dt+ h(X^u_T,\mathbb {P}_{X^u_T})\right] , \end{aligned}$$
(3.2)

where \(f: [0,T]\times \mathbb {R}\times {\mathscr {P}}_2(\mathbb {R})\times U\rightarrow \mathbb {R}\), \(h: \mathbb {R}\times {\mathscr {P}}_2(\mathbb {R})\mapsto \mathbb {R}\) are deterministic functions.

A control \(u^*\in \mathscr {U}_{ad}\) satisfying

$$\begin{aligned} J(u^*)=\text {inf}_{u\in \mathscr {U}_{ad}}J(u) \end{aligned}$$
(3.3)

is called an optimal control. We denote \(X^*:=X^{u^*}\) to be the corresponding (optimal) state process, namely, the solution of (3.1) with \(u=u^*\). The main objective of this paper is to prove the necessary conditions, also known as Pontryagin’s Maximum Principle, of the optimal control without the convexity assumption on the control set U.

We note that the temporal variable t in the coefficients b, \(\sigma \), f can be easily absorbed into the state process X by expanding its dimension, so in what follows we consider only the time-homogeneous coefficients for notational simplicity. We shall make use of the following Standing Assumptions.

Assumption 3.2

The coefficients \(b, \sigma , f, h\) are measurable in all variables. Moreover, for all \(u\in U\), \(b(\cdot , \cdot , u), \sigma (\cdot , \cdot , u), f(\cdot , \cdot , u)\in \mathbb {C}^{1,1}_b({\mathbb {R}}\times {\mathscr {P}}_2({\mathbb {R}}^d); \mathbb {R})\), \(h(\cdot , \cdot )\in \mathbb {C}^{1,1}_b(\mathbb {R}\times {\mathscr {P}}_2(\mathbb {R});\mathbb {R})\). More precisely, for each \(u\in U\), denoting \(\phi (x, \mu )=b (x, \mu , u)\), \(\sigma (x, \mu , u)\), \(f(x, \mu , u)\), \(h(x, \mu )\), the function \(\phi (\cdot ,\cdot )\) enjoys the following properties:

  1. (i)

    for fixed \(x\in \mathbb {R}\), \(\phi (x, \cdot )\in \mathbb {C}^{1,1}_b({\mathscr {P}}_2({\mathbb R}^d))\);

  2. (ii)

    for fixed \(\mu \in {\mathscr {P}}_2({\mathbb {R}} )\), \(\phi (\cdot ,\mu )\in C^1_b(\mathbb {R})\);

  3. (iii)

    all the derivatives \(\partial _x \phi \) and \(\partial _\mu \phi \), \(\phi =b \), \(\sigma \), f, h, are bounded and Lipschitz continuous, with Lipschitz constants independent of \(u\in U\).

Assumption 3.3

The coefficients \(b, \sigma , f, h\) satisfy Assumption 3.2. Furthermore, for all \(u\in U\), \(b(\cdot , \cdot , u), \sigma (\cdot , \cdot , u), f(\cdot , \cdot , u)\in \mathbb {C}^{2,1}_b(\mathbb {R}\times {\mathscr {P}}_2(\mathbb {R}); \mathbb {R})\), \(h(\cdot , \cdot )\in \mathbb {C}^{2,1}_b(\mathbb {R}\times {\mathscr {P}}_2(\mathbb {R});\mathbb {R})\). More precisely, for each \(u\in U\), the derivatives of b, \(\sigma \), f, h, denoted by a generic function \(\phi (x,\mu )\), enjoy the following properties:

  1. (i)

    \(\partial _{x}\phi (\cdot , \cdot )\in \mathbb {C}^{1,1}_b(\mathbb {R}\times {\mathscr {P}}_2(\mathbb {R}))\) ;

  2. (ii)

    \(\partial _\mu \phi (\cdot , \cdot )\in \mathbb {C}^{1,1}_b(\mathbb {R}\times {\mathscr {P}}_2(\mathbb {R})\times \mathbb {R})\);

  3. (iii)

    all the second order derivatives of \(b , \sigma ,f, h\), are bounded and Lipschitz continuous, with Lipschitz constants independent of \(u\in U\).

Remark 3.4

We should emphasize that, as one of the main features of Peng’s SMP, we do not require any differentiability of the coefficient on the control variable u.

Clearly, under Assumption 3.2, for each \(u\in \mathscr {U}_{ad}\), SDE (3.1) admits a unique strong solution \(X^u\). Now let \(u^*\in \mathscr {U}_{ad}\) is an optimal control, we denote the optimal state by \(X^*=X^{u^*}\). To facilitate our presentation, we shall introduce some notations for the coefficients and their derivatives. These notations are slightly unusual especially when they involve the derivatives defined in the previous section, we shall describe them more carefully.

To begin with, we again let \(\phi (x, \mu , v)\), \((x,\mu ,v)\in \mathbb {R}\times \mathscr {P}_2(\mathbb {R})\times U\), be an generic function representing \(b, \sigma , f, h\), respectively. For any \(u\in \mathscr {U}_{ad}\), we denote \( \phi ^u(t)=\phi (X^u_t, \mathbb {P}_{X^u_t}, u_t)\), \(t\in [0,T]\); and define \(\phi (t):=\phi (X^*_t,\mathbb {P}_{X^*_t},u^*_t)\), where \(u^*\) is an optimal control. We denote

$$\begin{aligned} \left\{ \begin{array}{lll} \delta \phi (t):=\phi (X^*_t,\mathbb {P}_{X^*_t},u_t)-\phi (X^*_t,\mathbb {P}_{X^*_t}, u^*_t); \\ \displaystyle (\phi _x(t), \phi _{xx}(t)):=\Big (\frac{\partial \phi }{\partial x}, \frac{\partial ^ 2\phi }{\partial x^2}\Big )(X^*_t,\mathbb {P}_{X^*_t},u^*_t); \\ \displaystyle \phi _\mu (t, y)=\partial _\mu \phi (X^*_t, \mathbb {P}_{X^*_t}, u^*_t; y). \end{array}\right. \end{aligned}$$
(3.4)

Clearly, \(\phi _x(\cdot ), \phi _{xx}(\cdot )\) and \(\phi _\mu (\cdot , y)\), \(y\in \mathbb {R}\), are progressively measurable processes defined on \((\Omega , \mathcal{F}, \mathbb {P})\). Now we recall the product probability space \((\Omega \times {\widetilde{\Omega }}, \mathcal{F}\otimes {\widetilde{\mathcal{F}}}, \mathbb {P}\otimes {\widetilde{\mathbb {P}}})\) and denote all the processes defined on the space \(({\widetilde{\Omega }}, {\widetilde{\mathcal{F}}}, {\widetilde{\mathbb {P}}})\) with “\(~\widetilde{}~\)”. Let \((\widetilde{u}^*, {\widetilde{X}}^*)\) be an independent copy of \((u^*, X^*)\), so that \(\mathbb {P}_{X^*_t}={\widetilde{\mathbb {P}}}_{{\widetilde{X}}^*_t}\), \(t\in [0,T]\). We denote

$$\begin{aligned} \widetilde{\phi }_\mu (t):= \partial _\mu {\phi }(X^*_t,\mathbb {P}_{X^*_t}, u^*_t; {\widetilde{X}}^*_t);\quad {\widetilde{\phi }}^*_\mu (t):=\partial _\mu {\phi }({\widetilde{X}}^*_t,\mathbb {P}_{X^*_t}, {\widetilde{u}}^*_t; {X^*_t}), \quad t\in [0,T].\nonumber \\ \end{aligned}$$
(3.5)

Similarly, we can define the second derivative processes:

$$\begin{aligned} \left\{ \begin{array}{llll} \widetilde{\phi }_{\mu \mu }(t):={\partial ^2_{\mu }\phi }(X^*_t,\mathbb {P}_{X^*_t}, u^*_t;X^*_t, {\widetilde{X}}^*_t);\\ \phi _{x\mu }(t):={\partial _x}{\partial _\mu }\phi (X^*_t,\mathbb {P}_{X^*_t},u^*_t;X^*_t);\\ {\widetilde{\phi }}^*_{y\mu }(t):={\partial _y}{\partial _\mu }\phi (\widetilde{X}^*_t,\mathbb {P}_{X^*_t}, \widetilde{u}^*_t;X^*_t), \end{array}\right. \quad t\in [0,T], \end{aligned}$$
(3.6)

where \(\partial ^2_\mu \phi (X^*_t, \mathbb {P}_{X^*_t}, u^*_t; \cdot ,\cdot ):\mathbb {R}\times \mathbb {R}\rightarrow \mathbb {R}\) is the second order derivative, given \((X^*_t, \mathbb {P}_{X^*_t}, u^*_t)\). We note that all the derivative processes \(\widetilde{\phi }_\mu \), \({\widetilde{\phi }}^*_\mu \), \({\widetilde{\phi }}^*_{y\mu }\), and \({\widetilde{\phi }}_{\mu {\mu }}\) should all be understood as progressively measurable processes defined on the product space \(\Omega \times {\widetilde{\Omega }}\). Finally, we define

$$\begin{aligned} \left\{ \begin{array}{lll} \displaystyle \mathscr {L}_{xx}(t, \phi ,y):=\frac{1}{2} \partial _{x x}\phi (X^*_t,\mathbb {P}_{X^*_t},u^*_t)y^2, \\ \displaystyle \mathscr {L}_{y\mu }(t, {\widetilde{\phi }},y):=\frac{1}{2} \partial _y\partial _{\mu }\phi (X^*_t,\mathbb {P}_{X^*_t},u^*_t; \widetilde{X}^*_t) y^2, \end{array}\right. \quad t\in [0,T]. \end{aligned}$$
(3.7)

The Hamiltonian We recall the Hamiltonian defined in (1.2) which, under our time-

homogeneous assumption, now takes the following form: for any \((x, \mu , u, p, q)\in \mathbb {R}\times \mathscr {P}_2(\mathbb {R})\times \mathbb {R}\times \mathbb {R}\times \mathbb {R}\),

$$\begin{aligned} H(x,\mu ,u,p,q):=b(x,\mu ,u) p+\sigma (x, \mu ,u) q-f(x,\mu ,u). \end{aligned}$$
(3.8)

Now let \((u^*, X^*)\) be the optimal control-state pair, and let (pq) be a pair of adapted processes taking values in \(\mathbb {R}\times \mathbb {R}^{d\times d}\), respectively. We denote

$$\begin{aligned} H^{p,q}(t):=\mathscr {H}(X^*_t, u^*_t, p_t, q_t):=H(X^*_t, \mathbb {P}_{X^*_t}, u^*_t, p_t, q_t). \end{aligned}$$
(3.9)

In particular, if (pq) is the solution to the so-called “adjoint equation” (to be defined by (3.11) below), we shall simply denote \(H(t):=H^{p,q}(t)\).

Now using the notations of (3.4) for coefficients \(b,\sigma , f\), we define

$$\begin{aligned} \left\{ \begin{array}{lll} \delta H(t):= \delta b(t) p_t+\delta \sigma (t)\cdot q_t -\delta f(t);\\ H_x (t):= b_x (t)p_t+\sigma _x(t)q_t-\partial _x f(t);\\ H_{xx}(t):= b_{xx}(t) p_t+ \sigma _{xx}(t) \otimes q_t-f_{xx}(t). \end{array}\right. \end{aligned}$$
(3.10)

First order adjoint equation We are now ready to introduce two adjoint equations that will be the building blocks of the stochastic maximum principle. We first consider the first order adjoint equation, which is the following mean-field-type linear backward SDE:

$$\begin{aligned} \left\{ \begin{array}{ll} dp_t=-\Big \{b_x(t) p_t+\widetilde{\mathbb {E}}\big [\widetilde{b}^*_\mu (t)\cdot \widetilde{p}_t\big ]+\sigma _x(t) q_t+\widetilde{\mathbb {E}}\big [\widetilde{\sigma }^*_\mu (t)\widetilde{q}_t\big ]-f_x(t) \\ \qquad \quad \,\,-{\widetilde{\mathbb {E}}}\big [\widetilde{f}^*_\mu (t)\big ]\Big \}dt+q_t dW_t, \\ p_T=-h_x(T)-{\widetilde{\mathbb {E}}}\big [\widetilde{h}^*_\mu (T)\big ]. \end{array}\right. \end{aligned}$$
(3.11)

Here, we recall from (3.5) that \(\widetilde{\mathbb {E}}[\partial _\mu \phi ^*(t)]:={\widetilde{\mathbb {E}}}[\partial _\mu \phi (\widetilde{X}^*_t, \mathbb {P}_{X^*_t}, \widetilde{u}^*_t:y)]\big | _{y=X^*_t}\), for \(\phi =b,\sigma , f, h\).

It is readily seen that the mean-field nature of BSDE (3.11) comes from the terms involving Fréchet derivatives \(\partial _\mu b\), \(\partial _\mu {\sigma }\), and \(\partial _\mu f\), which will reduce to a standard BSDE if the coefficients do not explicitly depend on law of the solution. The well-posedness of BSDE (3.11) follows from [6, Theorem 3.1]. To be more precise, under the Assumption 3.2, the BSDE (3.11) admits a unique \(\mathbb {F}\)-adapted solution (pq) such that

$$\begin{aligned} \mathbb {E}\left[ \sup _{t\in [0,T]}|p_t|^2+\int _0^T |q_t|^2\, dt\right] < +\infty . \end{aligned}$$
(3.12)

Second order adjoint equation We note that one of the main features of this paper, which distinguishes it from the previous one [4], is that we neither assume that the control set U is convex, nor that the coefficients have any differentiability on the control variable u. An important device in such a situation is to introduce a second order adjoint equation, initiated by Peng [18], which we now describe.

Consider the following \(\mathbb {R}\otimes \mathbb {R}\)-valued linear backward SDE:

$$\begin{aligned} \left\{ \begin{array}{lll} dP_t=-\big \{2\big (b_x(t)+\widetilde{\mathbb {E}}[\widetilde{b}^*_\mu (t)]\big )P_t+\big (\sigma _x(t) +\widetilde{\mathbb {E}}[\widetilde{\sigma }^*_\mu (t)]\big )^2P_t \\ \quad \qquad \quad \;+2\;\big (\sigma _x(t)+\widetilde{\mathbb {E}}[{\widetilde{\sigma }}^*_\mu (t)]\big )Q_t+\big (H_{xx}(t)+ \widetilde{\mathbb {E}}[\widetilde{H}^*_{\mu y}(t)]\big )\big \}dt+Q_t\,dW_t, \\ P_T=-\big (h_{xx}(T)+\widetilde{\mathbb {E}}[\widetilde{h}^*_{\mu y}(T)]\big ). \end{array}\right. \nonumber \\ \end{aligned}$$
(3.13)

We note that (3.13) is a standard linear BSDE, and it is well-known that it has a unique \(\mathbb {F}\)-adapted solution (PQ) which satisfies the following estimate:

$$\begin{aligned} \mathbb {E}\left[ \sup _{t\in [0,T]}|P_t|^2+\int _0^T|Q_t|^2 dt\right] <+\infty . \end{aligned}$$
(3.14)

We are now ready to state the main theorem of the paper.

Theorem 3.5

(Stochastic Maximum Principle) Suppose that the Assumptions 3.2 and 3.3 are in force. Let \((u^*, X^*)\) be an optimal solution of the control problem (3.1)–(3.3), then there are two pairs of \(\mathbb {F}\)-adapted processes \(\left( p,q\right) \) and \(\left( P,Q\right) \) that satisfy the first and second order adjoint equations (3.11) and (3.13), respectively, such that (3.12) and (3.14), respectively, and that for all \(u\in U\) and a.e. \(t\in [0,T]\), it holds \(\mathbb {P}\)-almost surely that

$$\begin{aligned}&\mathscr {H}(X^*_t,u,p_t,q_t)-\mathscr {H}(X^*_t,u^*_t,p_t,q_t)\nonumber \\&\quad +\frac{1}{2}P_t \big [\sigma (X^*_t,\mathbb {P}_{X^*_t},u)-\sigma (X^*_t,\mathbb {P}_{X^*_t},u^*_t) \big ]^2 \le 0. \end{aligned}$$
(3.15)

Remark 3.6

(i) Note that, unlike the usual maximum principle, there is an extra term on the left-hand side of (3.15). This means that the inequality would generally be strict.

(ii) Theorem 3.5 can be extended to the higher dimensional cases without substantial difficulties. More precisely, if we consider the original system (3.1) with \(X_t\in \mathbb {R}^d\), \(W_t\in \mathbb {R}^m\), and \(u_t\in \mathbb {R}^k\), then in Theorem 3.5 the first order adjoint process \((p_t, q_t)\in (\mathbb {R}^d, \mathbb {R}^{d\times m})\), the second order adjoint process \((P_t, Q_t)\in (\mathbb {R}^{d\times d}, \mathbb {R}^{d\times d}\otimes \mathbb {R}^m)\), and the variational inequality (3.15) will read:

$$\begin{aligned} \mathscr {H}(X^*_t,u,p_t,q_t)-\mathscr {H}(X^*_t,u^*_t,p_t,q_t) +\frac{1}{2} [\Delta \sigma ^{*, u}(t, \cdot )]^TP_t[\Delta \sigma ^{*, u}(t, \cdot )] \le 0,\nonumber \\ \end{aligned}$$
(3.16)

where \(\Delta \sigma ^{*, u}(t, \cdot ):= \sigma (t, X^*_{t},\mathbb {P}_{X^*_t},u)-\sigma (t, X^*_{t},\mathbb {P}_{X^*_t},u^*_t)\).

4 Variational Equations

In this section we study an important ingredient of the stochastic maximum principle, that is, the “differentiation” of the state process by a perturbation of the optimal control. Since the control set U is not necessarily convex, we shall use the so-called spike variation, which we now describe.

Still denote \(u^*\in \mathscr {U}_{ad}\) to be an optimal control. For any \(\varepsilon >0\), we choose a Borel subset \(E_{\varepsilon }\subset [0,T]\) such that \(|E_{\varepsilon }|=\varepsilon \), where |A| denotes the Lebesgue measure of set \(A\subseteq [0,T]\). Now for any \(u\in \mathscr {U}_{ad}\) we consider the following “spike variation” of \(u^*\): for \(t\in [0, T]\),

$$\begin{aligned} u^{\varepsilon }_t \mathop {\buildrel \Delta \over =}\left\{ \begin{array}{ll} u_t,\,\,\, t\in E_{\varepsilon },\\ u^*_t\,\,\, t\in E_{\varepsilon }^c, \end{array}\right. \end{aligned}$$
(4.1)

We denote by \(X^{\varepsilon }\mathop {\buildrel \Delta \over =}X^{u^{\varepsilon }}\) the corresponding state process which satisfies (3.1) corresponding to the control \(u^{\varepsilon }\), and consider the following two processes:

$$\begin{aligned} \Delta X^\varepsilon _t\mathop {\buildrel \Delta \over =}X^\varepsilon _t-X^*_t; \qquad \delta X^\varepsilon _t\mathop {\buildrel \Delta \over =}\frac{1}{\varepsilon }\Delta X^\varepsilon _t=\frac{1}{\varepsilon }[X^\varepsilon _t-X^*_t], \qquad t\in [0, T]. \end{aligned}$$
(4.2)

We will investigate the behavior of \(\Delta X^\varepsilon \) and \(\delta X^\varepsilon \) as \(\varepsilon \rightarrow 0\). Obviously, we expect that, as \(\varepsilon \rightarrow 0\), \(\Delta X^\varepsilon \rightarrow 0\); and hope that \(\delta X^\varepsilon \rightarrow Y\) for some continuous process Y, which satisfies the so-called first order variational equation. However, in the case when the diffusion term contains the control, and the control set is not convex, the aforementioned convergence and its speed are by no means obvious. In fact, the main idea of Peng [18] is to argue that, for each \(\varepsilon >0\), there exists a process \(Y^\varepsilon \), such that \(\delta X^\varepsilon -\frac{1}{\varepsilon }Y^\varepsilon = O(1)\), as \(\varepsilon \rightarrow 0\), i.e., \(\Delta X^\varepsilon -Y^\varepsilon =O(\varepsilon )\), and the process \(Y^\varepsilon \) satisfies, for each \(\varepsilon >0\), the following SDE:

$$\begin{aligned} \left\{ \begin{array}{lll} dY^{\varepsilon }_t=\big \{b_x(t)Y^{\varepsilon }_t+\widetilde{\mathbb {E}}\big [ \widetilde{b}_\mu (t)\widetilde{Y}^{\varepsilon }_t\big ]+\delta b(t)\mathbf{1}_{E_{\varepsilon }}(t)\big \}dt\\ \qquad \quad +\big \{\sigma _x(t)Y^{\varepsilon }_t+\widetilde{\mathbb {E}}\big [\widetilde{\sigma }_\mu (t) \widetilde{Y}^{\varepsilon }_t\big ]+\delta \sigma (t)\mathbf{1}_{E_{\varepsilon }}(t)\big \}dW_t, \\ Y^{\varepsilon }_0=0. \end{array}\right. \end{aligned}$$
(4.3)

In what follows we shall refer to equation (4.3) as the “first order variational equation”, and the process \(Y^\varepsilon \) is called the first order variational process.

A very important step in [18] is, in light of the Taylor expansion, to find a process \(Z^\varepsilon \) so that \(\Delta X^\varepsilon -Y^\varepsilon -Z^\varepsilon =o(\varepsilon )\), as \(\varepsilon \rightarrow 0\), and that the convergence is of an appropriate order. The process \(Z^\varepsilon \) is called the second order variational process, and we shall argue that in our case it satisfies the following SDE:

$$\begin{aligned} \left\{ \begin{array}{lll} dZ^{\varepsilon }_t=\big \{b_x(t)Z^{\varepsilon }_t+\widetilde{\mathbb {E}}\big [{\widetilde{b}_\mu }(t)\widetilde{Z}^{\varepsilon }_t\big ]+\mathscr {L}_{xx}(t,b,Y^{\varepsilon })+\mathscr {L}_{\mu y}(t,\widetilde{b},\widetilde{Y}^{\varepsilon })\big \} dt \\ \qquad \quad ~~+\,\big \{\sigma _x(t)Z^{\varepsilon }_t+\widetilde{\mathbb {E}}\big [{\widetilde{\sigma }_\mu }(t)\widetilde{Z}^{\varepsilon }_t\big ] +\mathscr {L}_{xx}(t,\sigma ,Y^{\varepsilon })+\mathscr {L}_{\mu y}(t,{\widetilde{\sigma }},\widetilde{Y}^{\varepsilon })\big \}dW_t \\ \qquad \quad ~~+\,\big \{\delta {b}_x(t)Y^{\varepsilon }_t+\widetilde{\mathbb {E}}\big [\delta {\widetilde{b}_\mu } (t)\widetilde{Y}^{\varepsilon }_t\big ]\big \}{} \mathbf{1}_{E_{\varepsilon }}(t)dt+\big \{\delta \sigma _x(t) Y^{\varepsilon }(t)\\ \qquad \quad ~~+\,\widetilde{\mathbb {E}}\big [\delta {\widetilde{\sigma }_\mu }(t)\widetilde{Y}^{\varepsilon }_t\big ]\big \} \mathbf{1}_{E_{\varepsilon }}(t)dW_t, \\ Z^{\varepsilon }_0=0. \end{array}\right. \end{aligned}$$
(4.4)

The equation (4.4) will be referred to as the second order variational equation. As expected, the adjoint processes (pq) and the variational processes \((Y^{\varepsilon },Z^{\varepsilon })\) are related by the following “duality relationship”, which is essential for the proof of the SMP.

Lemma 4.1

Let (pq) be the solution to the adjoint equation (3.11) satisfying (3.12), and \(Y^\varepsilon \) and \(Z^\varepsilon \) are the solutions to the first and second order variational equations (4.3) and (4.4), respectively. Then the following duality relations hold:

$$\begin{aligned} \mathbb {E}[p_TY^\varepsilon _T]= & {} \mathbb {E}\Big [\int _0^T Y^{\varepsilon }_t\big (f_x(t) + \widetilde{\mathbb {E}}[{\widetilde{f}^*_\mu }(t)]\big )dt\Big ]\nonumber \\&+\, \mathbb {E}\Big [\int _0^T (\delta b(t) \cdot p_t+\delta \sigma (t)\cdot q_t)\mathbf{1}_{E_{\varepsilon }}(t)dt\Big ], \end{aligned}$$
(4.5)
$$\begin{aligned} \mathbb {E}[p_TZ^{\varepsilon }_T]= & {} \mathbb {E}\Big [\int _0^T Z^{\varepsilon }_t(f_x(t)+ \widetilde{\mathbb {E}}[{\widetilde{f}^*_\mu }(t)])dt\Big ]\nonumber \\&+\,\mathbb {E}\Big [\int _0^T(p(t)\delta b_x(t) + q_t\delta \sigma _x(t))Y^{\varepsilon }_t \mathbf{1}_{E_{\varepsilon }}(t)dt\Big ] \nonumber \\&+ \,\mathbb {E}\Big [\int _0^Tp_t\Big ({\mathscr {L}}_{xx}(t,b,Y^{\varepsilon })+\widetilde{\mathbb {E}}[ {\mathscr {L}}_{\mu y}(t,\widetilde{b},\widetilde{Y}^{\varepsilon })]\Big )dt\Big ]\nonumber \\&+\, \mathbb {E}\Big [\int _0^Tq_t\Big ({\mathscr {L}}_{xx}(t,\sigma ,Y^{\varepsilon })+\widetilde{\mathbb {E}}[ {\mathscr {L}}_{\mu y}(t,{\widetilde{\sigma }},\widetilde{Y}^{\varepsilon })]\Big )dt\Big ]\nonumber \\&+\,\mathbb {E}\Big [\int _0^T\Big (p_t\widetilde{\mathbb {E}}\big [\delta {\widetilde{b}_\mu }(t) \widetilde{Y}^{\varepsilon }_t\big ]+q_t\widetilde{\mathbb {E}}\big [\delta \widetilde{\sigma }_\mu (t)\widetilde{Y}^{\varepsilon }_t \big ]\Big )\mathbf{1}_{E_{\varepsilon }}(t)dt\Big ]. \end{aligned}$$
(4.6)

Proof

The proof of this lemma follows directly from a simple application of Itô’s formula and some direct computation using Fubini’s theorem (see, e.g., [5]), we leave it to the interested reader.\(\square \)

Our main task of this section is to substantiate the variational SDEs (4.3) and (4.4), and prove the desired convergence.

To this end, we first note that the processes \(X^{\varepsilon }:= X^{u^{\varepsilon }}\) and \(X^*:= X^{u^*}\) satisfy the SDEs:

$$\begin{aligned} \left\{ \begin{array}{lll} dX^{\varepsilon }_t=b( X^{\varepsilon }_t,\mathbb {P}_{X^{\varepsilon }_t},u_t^{\varepsilon })dt+\sigma (X^{\varepsilon }_t, \mathbb {P}_{X^{\varepsilon }_t}, u_t^{\varepsilon })dW_t, \quad X^{\varepsilon }_0=x_0;\\ dX^*_t=b(X^*_t,\mathbb {P}_{X^*_t}, u^*_t)dt+\sigma (X^*_t,\mathbb {P}_{X^*_t}, u^*_t)dW_t, \quad X^*_0=x_0, \end{array}\right. \end{aligned}$$
(4.7)

respectively. We shall establish some fundamental estimates that will play the crucial roles in our discussion. We note that unless specified, for each \(p\in \mathbb {R}_+\) we will denote by \(C_p>0\) a generic positive constant depending only on p and the constants appearing in Assumptions 3.2 and 3.3, which may vary from line to line.

Proposition 4.2

Assume that Assumption 3.2 is in force. Then, for any \(k\ge 1\), and \(\varepsilon >0\), the following estimates hold:

$$\begin{aligned} \mathbb {E}\left[ \sup _{t\in [0,T]}|X^{\varepsilon }_t-X^*_t|^{2k}\right] \le C_k \varepsilon ^k, \end{aligned}$$
(4.8)
$$\begin{aligned} \mathbb {E}\left[ \sup _{t\in [0,T]}|Y^{\varepsilon }_t|^{2k}\right] \le C_k \varepsilon ^k, \end{aligned}$$
(4.9)
$$\begin{aligned} \mathbb {E}\left[ \sup _{t\in [0,T]}|Z^{\varepsilon }_t|^{2k}\right] \le C_k \varepsilon ^{2k}, \end{aligned}$$
(4.10)
$$\begin{aligned} \mathbb {E}\left[ \sup _{t\in [0,T]}|X^\varepsilon _t- (X^*_t+Y^{\varepsilon }_t )|^{2k}\right] \le C_k \varepsilon ^{2k}. \end{aligned}$$
(4.11)

Proof

The estimates (4.8)–(4.10) are obvious, in particular we remark that, thanks to Assumption 3.2, \(\displaystyle E[\sup _{t\in [0, T]}|X_t^*|^{2k}]\le C_k\) and \(\displaystyle E[\sup _{t\in [0, T]}|\int _0^t \delta \sigma (s)I_{E_{\varepsilon }}(s)dW_s|^{2k}]\le C_k \varepsilon ^k\). A boundedness assumption on \(\delta \sigma (s)\) and \(\delta b(s)\) is not needed here. We shall only check (4.11). Without loss of generality we will assume \(b=0\). Define \(K_t^{\varepsilon }:= X^{\varepsilon }_t-X^*_t-Y^{\varepsilon }_t\), then from equations (4.7) and (4.3) we get

$$\begin{aligned} dK_t^{\varepsilon }&=\big \{\sigma (X^{\varepsilon }_t,\mathbb {P}_{X^{\varepsilon }_t},u_t^{\varepsilon })-\sigma (X^*_t, \mathbb {P}_{X^*_t},u^*_t)\nonumber \\&\quad -(\sigma _x(t)Y^{\varepsilon }_t+\widetilde{\mathbb {E}}\big [{\widetilde{\sigma }}_\mu (t)\widetilde{Y}^{\varepsilon }_t\big ]+\delta \sigma (t)\mathbf{1}_{\mathbb {E}_{\varepsilon }}(t))\big \}dW_t \nonumber \\&=\big \{\beta _t^{\varepsilon }+\sigma _x(t)K_t^{\varepsilon }+\widetilde{\mathbb {E}}\big [{\widetilde{\sigma }}_\mu (t)\widetilde{K}^{\varepsilon }_t\big ]\big \}dW_t, \end{aligned}$$
(4.12)

where, recalling the notation \(\Delta X^\varepsilon _t:=X^{\varepsilon }_t-X^*_t\) introduced in (4.2),

$$\begin{aligned} \beta _t^{\varepsilon }&:=\sigma (X^{\varepsilon }_t,\mathbb {P}_{X^{\varepsilon }_t}, u_t^{\varepsilon })-\sigma (X^*_t,\mathbb {P}_{X^*_t}, u^*_t)\\&\quad -\left( \sigma _x(t)\Delta X^{\varepsilon }_t+ \widetilde{\mathbb {E}}[{\widetilde{\sigma }}_\mu (t)\Delta \widetilde{X}^{\varepsilon }_t]+\delta \sigma (t)\mathbf{1}_{E_{\varepsilon }}(t)\right) . \end{aligned}$$

Now recall (3.4) and definition of \(u^\varepsilon \). We see that for the given \(u\in \mathscr {U}_{ad}\) and \(\varepsilon >0\), \(\delta \sigma (t)\mathbf{1}_{E_\varepsilon }(t) \equiv \sigma (X^*_t,\mathbb {P}_{X^*_t}, u^\varepsilon _t)-\sigma (X^*_t,\mathbb {P}_{X^*_t}, u^*_t)\), \(t\in [0,T]\). Thus we can write

$$\begin{aligned} \beta _t^{\varepsilon }= & {} \sigma (X^{\varepsilon }_t,\mathbb {P}_{X^{\varepsilon }_t}, u_t^{\varepsilon })-\sigma (X^*_t,\mathbb {P}_{X^*_t}, u_t^{\varepsilon })-\left( \sigma _x(t)\Delta X^{\varepsilon }_t+\widetilde{\mathbb {E}}[{\widetilde{\sigma }}_\mu (t)\Delta \widetilde{X}^{\varepsilon }_t]\right) \\= & {} \displaystyle \int _0^1\{\sigma _x(X^*_t+\theta \Delta X^{\varepsilon }_t, P_{X^*_t+\theta \Delta X^{\varepsilon }_t}, u_t^{\varepsilon })-\sigma _x(t)\}d\theta \cdot \Delta X^{\varepsilon }_t\\&+\displaystyle \int _0^1\widetilde{\mathbb {E}}[({\sigma }_\mu (X^*_t+\theta \Delta X^{\varepsilon }_t, P_{X^*_t+\theta \Delta X^{\varepsilon }_t},\widetilde{X}^*_t+\theta \Delta \widetilde{X}^{\varepsilon }_t, u_t^{\varepsilon })-{\widetilde{\sigma }}_\mu (t))\Delta \widetilde{X}^{\varepsilon }_t]d\theta . \end{aligned}$$

We recall that by Assumption 3.2 \(\sigma _x\) and \(\sigma _\mu \) are bounded. Hence, we obtain that

$$\begin{aligned} |\beta _t^{\varepsilon }|\le C\big \{|\Delta X^{\varepsilon }_t|^2+\mathbb {E}[|\Delta X^{\varepsilon }_t|^2]+ \mathbf{1}_{E_{\varepsilon }}(t)\big (|\Delta X^{\varepsilon }_t|+(\mathbb {E}[|\Delta X^{\varepsilon }_t|^2])^{ \frac{1}{2}}\big )\big \}. \end{aligned}$$
(4.13)

Estimate (4.11) now follows easily from the Gronwall inequality and the estimates (4.8), (4.12), and (4.13). \(\square \)

To end this section, we give an important estimate involving the solution \(Y^\varepsilon \), the first order variational equation, and the first order derivatives of the coefficients. This estimate reflects some of the main technicalities when the derivatives of the measures are present. We recall from (3.4) that for a function \(\phi (x, \mu )\), \(\phi _\mu (x,\mu , y)=\partial _\mu \phi (x,\mu ;y)\), and \({\widetilde{\phi }}_\mu (t)=\partial _\mu \phi (X^*_t, \mathbb {P}_{X^*_t};\widetilde{X}^*_t)\). Thus, \(\widetilde{\mathbb {E}}[{\widetilde{\phi }}_\mu (t)]=\widetilde{\mathbb {E}}[\phi (x,\mu ;\widetilde{X}^*_t)]\big |_{x=X^*_t, \mu =\mathbb {P}_{X^*_t}}\) is a random variable on \((\Omega , \mathcal{F}, \mathbb {P})\).

Proposition 4.3

Assume that Assumption 3.2 is in force. Let \(Y^\varepsilon \), \(\varepsilon >0\), be the solution to (4.3), and \(\partial _\mu \widetilde{b}\), \(\partial _\mu {\widetilde{\sigma }}\), and \(\partial _\mu \widetilde{h}\) be defined by (3.4). Then, for any \(\varepsilon >0\), the following estimates hold:

$$\begin{aligned}&\int _0^T \mathbb {E}\Big [\big |\widetilde{\mathbb {E}}[\partial _\mu {\widetilde{b}}(t)\widetilde{Y}^{\varepsilon }_t]\big |^4+\big |\widetilde{\mathbb {E}}[\partial _\mu {\widetilde{\sigma }}(t)\widetilde{Y}^{\varepsilon }_t]\big |^4\Big ]dt \le C\varepsilon ^2\rho (\varepsilon ),\end{aligned}$$
(4.14)
$$\begin{aligned}&\mathbb {E}\Big [\big |{\widetilde{\mathbb {E}}}[\partial _\mu \widetilde{h}(T)\widetilde{Y}^{\varepsilon }_T]\big |^4\Big ] \le C\varepsilon ^2\rho (\varepsilon ), \end{aligned}$$
(4.15)

where \(\rho (\cdot )\) is a positive function defined on \((0,\infty )\), such that \(\rho (\varepsilon )\rightarrow 0\) as \(\varepsilon \downarrow 0\).

Proof

We first prove (4.14). Since the functions b and \(\sigma \) have the same properties, we shall prove only the estimate for the term involving \(\sigma _\mu \). The term involving \(b_\mu \) can be argued similarly.

To begin with, we recall the dynamics of \(Y^{\varepsilon }(\cdot )\) (4.3). Since the SDE (4.3) is almost linear, we consider the stochastic exponential

$$\begin{aligned} \eta _t:=\exp {\left\{ -\int _0^t\sigma _x(s)dW_s-\int _0^t\left( b_x(s)-\frac{1}{2}| \sigma _x(s)|^2\right) ds\right\} },\qquad t\in [0,T], \end{aligned}$$

and its inverse

$$\begin{aligned} \rho (t):=\eta _t^{-1}:=\exp {\left\{ \int _0^t\sigma _x(s)dW_s+\int _0^t\left( b_x(s) -\frac{1}{2}|\sigma _x(s)|^2\right) ds \right\} },\qquad t\in [0,T]. \end{aligned}$$

Since \(b_x, \sigma _x\) are bounded, for all \(p\ge 1\) there exists a positive constant \(C_p\), such that

$$\begin{aligned} \mathbb {E}\left[ \sup _{t\in [0,T]}(|\eta _t|^p+|\rho _t|^p)\right] \le C_p. \end{aligned}$$
(4.16)

Furthermore, by applying Itô’s formula to \(\eta _tY^{\varepsilon }(t)\), we can express \(Y^\varepsilon \) explicitly as

$$\begin{aligned} Y^{\varepsilon }(t)= & {} \rho _t\int _0^t\eta _s\left\{ \widetilde{\mathbb {E}}[ \partial _\mu {\widetilde{\sigma }}(s)\widetilde{Y}^{\varepsilon }_s]+\delta \sigma (s)\mathbf{1}_{E_{\varepsilon }}(s)\right\} d W_s\nonumber \\&+\rho _t\int _0^t\eta _s \left\{ \widetilde{\mathbb {E}}[\partial _\mu {\widetilde{b}}(s)\widetilde{Y}^{\varepsilon }_s] +\delta b(s)\mathbf{1}_{E_{\varepsilon }}(s)\right\} ds\\&-\rho _t\displaystyle \int _0^t\eta _s\left\{ \sigma _x(s)\widetilde{\mathbb {E}} [\partial _\mu {\widetilde{\sigma }}(s)\widetilde{Y}^{\varepsilon }_s]+\sigma _x(s)\delta \sigma (s)\mathbf{1}_{E_{\varepsilon }}(s)\right\} d W_s. \nonumber \end{aligned}$$
(4.17)

To facilitate our discussion let us now introduce an intermediate space \(({\widehat{\Omega }}, {\widehat{\mathcal{F}}}, {\widehat{\mathbb {P}}})\) on which we can carry out some generic analysis without specifying the space \(\Omega \) or \({\widetilde{\Omega }}\). All process \(\eta \) defined on \(\Omega \) will have a copy \(\widehat{\eta }\) on space \({\widehat{\Omega }}\). Using (4.17) we now write

$$\begin{aligned} \widehat{\mathbb {E}}[\partial _\mu \widehat{{\sigma }}(t)\widehat{Y}^{\varepsilon }_t]:=\widehat{\mathbb {E}}[\partial _\mu \widehat{{\sigma }}(t)({\widehat{\rho }}_t\widehat{J}_1^{\varepsilon }(t)+\widehat{J}_2^{\varepsilon }(t))] :=I^{\sigma ,\varepsilon }_1(t)+I^{\sigma ,\varepsilon }_2(t),\qquad t\in [0, T],\nonumber \\ \end{aligned}$$
(4.18)

where

$$\begin{aligned} J_1^\varepsilon (t):= & {} \int _0^t\big (\eta _s\widetilde{\mathbb {E}}[\partial _\mu {\widetilde{\sigma }}(s) \widetilde{Y}_s^\varepsilon ]+\eta _s\delta \sigma (s)\mathbf{1}_{E_\varepsilon }(s)\big )dW_s, \nonumber \\ J_2^\varepsilon (t):= & {} \rho _t\int _0^t\big (\eta _s\widetilde{\mathbb {E}}[\partial _\mu \widetilde{b}(s) \widetilde{Y}_s^\varepsilon ]+\eta _s\delta b(s)I_{E_\varepsilon }(s)\big )ds;\nonumber \\&-\rho _t\int _0^t\big (\eta _s\sigma _x(s)\widetilde{\mathbb {E}}[\partial _\mu \widetilde{\sigma }(s)\widetilde{Y}_s^\varepsilon ]+\eta _s\sigma _x(s)\delta \sigma (s)I_{E_\varepsilon } (s)\big )ds; \nonumber \\ I_1^{\sigma ,\varepsilon }(t):= & {} \widehat{\mathbb {E}}[\partial _\mu \widehat{\sigma }(t)\widehat{\rho }_t \widehat{J}_1^\varepsilon (t)], \quad I_2^{\sigma ,\varepsilon }(t):=\widehat{\mathbb {E}}[\partial _\mu \widehat{\sigma }(t)\widehat{J}_2^\varepsilon (t)]. \end{aligned}$$
(4.19)

We shall estimate \(I^{\sigma ,\varepsilon }_1\) and \(I^{\sigma , \varepsilon }_2\) separately. First, for any \((\bar{x}, u)\in \mathbb {R}\times U\), we consider the process \(\rho _t\sigma _\mu (t):=\rho _t \sigma _\mu (\bar{x}, \mathbb {P}_{X^*_t}, u; X^*_t)\). Since \(\mathbb {F}=\mathcal{F}_0\vee \mathbb {F}^W\), applying Itô’s (Martingale) Representation Theorem, for each \(t\in [0,T]\), there exists a unique \(\widehat{\gamma }_{\cdot ,t}\in L^2_{\mathbb {F}}[0,t]\) such that

$$\begin{aligned} \widehat{\rho }_t\partial _\mu \widehat{{\sigma }}(t)=\widehat{\rho }_t\partial _\mu \sigma (X^*_t,P_{X^*_t}, u^*_t; \widehat{X}^*_t)=\widehat{\mathbb {E}}[{\widehat{\rho }}_t \partial _\mu \widehat{{\sigma }}(t)]+\int _0^t\widehat{\gamma }_{s,t}d\widehat{W}_s, \,\,\quad \mathbb {P}\text{- }\hbox \mathrm{a.s.{ }}\nonumber \\ \end{aligned}$$
(4.20)

We note that, for \(p>1\), by virtue of (4.16), it follows from the Burkholder-Davis-Gundy and Hölder’s inequalities that

$$\begin{aligned} \widehat{\mathbb {E}}\left[ \left( \int _0^t|\widehat{\gamma }_{s,t}|^2 \,ds\right) ^{p/2}\right]\le & {} C_p \widehat{\mathbb {E}}\left[ \sup _{s\in [0,t]}\Big |\int _0^s\widehat{\gamma }_{r,t} \,d\widehat{W}_r\Big |^{p}\right] \nonumber \\\le & {} C_p\left( \frac{p}{p-1}\right) ^p \widehat{\mathbb {E}}\Big [\Big |\int _0^t\widehat{\gamma }_{s,t}\, d\widehat{W}_s\Big |^{p} \Big ] \nonumber \\\le & {} C_p\widehat{\mathbb {E}}\Big [|\widehat{\rho }_t\partial _\mu \widehat{{\sigma }}(t)-\widehat{\mathbb {E}}[{ \widehat{\rho }}_t\partial _\mu \widehat{{\sigma }}(t)]|^p\Big ]\nonumber \\\le & {} C_p\widehat{\mathbb {E}}\Big [|\widehat{\rho }_t\partial _\mu \widehat{{\sigma }}(t)|^p\Big ] \le C_p \widehat{\mathbb {E}}\left[ \sup _{t\in [0,T]}|\widehat{\rho }_t\partial _\mu \widehat{\sigma }(t)|^p\right] \le C_p. \nonumber \\ \end{aligned}$$
(4.21)

It is clear that (4.21) also holds for \(p=1\), by simply applying Hölder’s inequality. We note that the constant \(C_p >0\) is independent of t.

Now, by combining (4.19) and (4.20) we have

$$\begin{aligned} I_1^{\sigma ,\varepsilon }(t)=\widehat{\mathbb {E}}\left[ \int _0^t\widehat{\gamma }(t,s)\big (\widehat{\eta }_s \widetilde{\mathbb {E}}[\partial _\mu \sigma (\widehat{X}_s^*,\mathbb {P}_{X_s^*},\widehat{u}_s^*; \widetilde{X}_s^*) \widetilde{Y}_s^\varepsilon ] +\widehat{\eta }_s{\delta \widehat{\sigma }(s)}\mathbf{1}_{E_\varepsilon }(s)\big )ds\right] .\nonumber \\ \end{aligned}$$
(4.22)

Hence, for \(t \in [0, T]\) we have, for some generic constant \(C > 0\), which may vary from line to line,

$$\begin{aligned} |I_1^{\sigma ,{\varepsilon }}(t)|^2&\le C\widehat{\mathbb {E}}\Bigg [\int _0^t|\widehat{\gamma }(t,s)|^2ds\cdot \sup _{s\in [0,T]}| \widehat{\eta }_s|^2\\&\qquad \times \int _0^t \Big (\widetilde{\mathbb {E}}[\partial _\mu {\sigma }(\widehat{X}_s^*,\mathbb {P}_{X_s^*},\widehat{u}_s^*,;\widetilde{X}_s^*)\widetilde{Y}_s^\varepsilon ]\Big )^2ds\Bigg ]\\&\quad +C\widehat{\mathbb {E}}\left[ \int _{E_\varepsilon }|\widehat{\gamma }(t,s)|^2ds\cdot \sup _{s\in [0,T]}| \widehat{\eta _s}|^2\cdot \int _0^t\mathbf{1}_{E_\varepsilon }^2(s)ds \right] \\&\le C\left( {\widehat{\mathbb {E}}\left[ \left( \int _0^t|\widehat{\gamma }(t,s)|^2ds\right) ^3\right] } \right) ^{\frac{1}{3}}\\&\qquad \times \left( \widehat{\mathbb {E}}\left[ \int _0^t \left( \widetilde{\mathbb {E}}\left[ \partial _\mu {\sigma }(\widehat{X}_s^*,\mathbb {P}_{X_s^*},\widehat{u}_s^*;\widetilde{X}_s^*)\widetilde{Y}_s^\varepsilon \right] \right) ^4ds\right] \right) ^{\frac{1}{2}}\\&\quad +C\varepsilon \left( \widehat{\mathbb {E}}\left[ \left( \int _{E_\varepsilon }|\widehat{\gamma }(t,s)|^2ds\right) ^2\right] \right) ^{ \frac{1}{2}}. \end{aligned}$$

This, together with (4.21), yields that, for \( t\in [0, T]\),

$$\begin{aligned} |I_1^{\sigma ,\varepsilon }(t)|^4&\le C\int _0^t\widehat{\mathbb {E}} \left[ \left( \widetilde{ \mathbb {E}}\left[ \partial _\mu \sigma (\widehat{X}_s^*,\mathbb {P}_{X_s^*},\widehat{u}_s^*,\widetilde{X}_s^*) \widetilde{Y}_s^\varepsilon \right] \right) ^4\right] ds\\&\quad + C\varepsilon ^2\widehat{\mathbb {E}}\left[ \left( \int _{E_\varepsilon }|\widehat{\gamma } (t,s)|^2ds\right) ^2\right] . \end{aligned}$$

Therefore, for any \(r\in [0, T]\), we have

$$\begin{aligned} \int _0^r\mathbb {E}\left[ |I_1^{\sigma ,\varepsilon }(t)|^4\right] dt\le & {} C\int _0^r\int _0^t\widehat{\mathbb {E}}\Big [\Big (\widetilde{\mathbb {E}}[\partial _\mu \sigma ( \widehat{X}_s^*,\mathbb {P}_{X_s^*},\widehat{u}_s^*,\widetilde{X}_s^*)\widetilde{Y}_s^\varepsilon ]\Big )^4 \Big ]dsdt\nonumber \\&+\,C\varepsilon ^2\mathbb {E}\widehat{\mathbb {E}}\left[ \int _0^T\Big (\int _{E_\varepsilon }|\widehat{\gamma }(t,s) |^2ds\Big )^2dt\right] , \\\le & {} C\int _0^r\int _0^t\widehat{\mathbb {E}}\Big [\Big (\widetilde{\mathbb {E}}[\partial _\mu \sigma (\widehat{X}_s^*,\mathbb {P}_{X_s^*},\widehat{u}_s^*,\widetilde{X}_s^*)\widetilde{Y}_s^\varepsilon ] \Big )^4\Big ]dsdt\nonumber \\&+\,\varepsilon ^2\rho _1(\varepsilon ),\nonumber \end{aligned}$$
(4.23)

where \(\rho _1(\varepsilon ):=\mathbb {E}\big [\widehat{\mathbb {E}}\big [\int _0^T\big ( \int _{E_\varepsilon }|\widehat{\gamma }(t,s)|^2ds\big )^2dt\big ]\big ]\). Since \(\mathbb {E}\big [\widehat{\mathbb {E}}\big [\int _0^T\big (\int _0^t|\widehat{\gamma }(t,s)|^2ds \big )^2dt\big ]\big ]<\infty \), it follows from the Dominated Convergence Theorem that \(\rho _1(\varepsilon )\rightarrow 0\), as \(\varepsilon \rightarrow 0\).

We now estimate \(I^{\sigma ,\varepsilon }_2\). First we notice that

$$\begin{aligned} \mathbb {E}[| J_2^\varepsilon (t)|]&\le C\varepsilon +C\left( \int _0^t\mathbb {E}\big [|\widetilde{\mathbb {E}}[\partial _\mu \widetilde{b}(s)\widetilde{Y}_s^\varepsilon ]\big |^2\big ]ds\right) ^{\frac{1}{2}}\\&\quad +\, C\left( \int _0^t\mathbb {E}\Big [\Big |\widetilde{ \mathbb {E}}[\partial _\mu \widetilde{\sigma }(s)\widetilde{Y}_s^\varepsilon \Big ]\Big |^2\Big ]ds \right) ^{\frac{1}{2}}. \end{aligned}$$

Thus,

$$\begin{aligned} |\widehat{\mathbb {E}}[(\partial _\mu \widehat{\sigma })(t)\widehat{J}_2^\varepsilon (t)]|^4\le & {} \! C\left( \widehat{\mathbb {E}}[|\widehat{J}_2^\varepsilon (t)|]\right) ^4\\\le & {} \!C\varepsilon ^4\!+\!C\int _0^t\left( \!\mathbb {E}[|\widetilde{\mathbb {E}}[(\partial _\mu \widetilde{\sigma })(s)\widetilde{Y}_s^\varepsilon ]|^4]\! +\!\mathbb {E}[|\widetilde{\mathbb {E}}[(\partial _\mu \widetilde{b})(s)\widetilde{Y}_s^\varepsilon ]|^4]\right) \!ds. \end{aligned}$$

This, combined with (4.18) and (4.23), yields that for any \(t\in [0,T]\),

$$\begin{aligned}&\int _0^t\mathbb {E}\left[ |\widetilde{\mathbb {E}}[\partial _\mu \widetilde{\sigma }(r)\widetilde{Y}_r^\varepsilon ]|^4\right] dr\nonumber \\&\quad \le C\varepsilon ^2\rho _1(\varepsilon )^2 + C\int _0^t\left( \int _0^r\Big (\mathbb {E}[|\widetilde{\mathbb {E}}[ \partial _\mu \widetilde{\sigma }(s)\widetilde{Y}_s^\varepsilon ]|^4]+\mathbb {E}[|\widetilde{\mathbb {E}}[ \partial _\mu \widetilde{b}(s)\widetilde{Y}_s^\varepsilon ]|^4]\Big )ds\right) dr. \nonumber \\ \end{aligned}$$
(4.24)

Analogously, we can have a similar estimate for \(\partial _\mu \widetilde{b}\).

Estimate (4.14) then follows from an application of Gronwall’s inequality.

We now prove (4.15). We shall argue that , for any \(\xi \in L^{\infty -}(\Omega \times {\widetilde{\Omega }},\mathcal{F}_T\otimes \mathbb {F}^{\widetilde{W}}):=\bigcap \nolimits _{p>1}L^{p}(\Omega \times {\widetilde{\Omega }}, \mathcal{F}_T\otimes \mathbb {F}^{\widetilde{W}})\), it holds that

$$\begin{aligned} \mathbb {E}[|\widetilde{\mathbb {E}}[\xi \widetilde{Y}_T^\varepsilon ]|^4]\le C\varepsilon ^2\rho (\varepsilon ). \end{aligned}$$
(4.25)

We again use (4.18) to write

$$\begin{aligned} \widetilde{\mathbb {E}}[\xi \widetilde{Y}_T^\varepsilon ]=\widetilde{\mathbb {E}}[\xi \widetilde{\rho }_T\widetilde{J}_1^\varepsilon (T)] +\widetilde{\mathbb {E}}[\xi \widetilde{J}_2^\varepsilon (T)]. \end{aligned}$$
(4.26)

Following the previous argument, we first apply the Martingale Representation Theorem to get, for every \(t\in [0,T]\), a unique \(\widetilde{\gamma }\in L^2_{\mathcal{F}_T\otimes \mathbb {F}^{\widetilde{W}}}[0,T]\) such that

$$\begin{aligned} \xi \widetilde{\rho }_T=\widetilde{\mathbb {E}}[\xi \widetilde{\rho }_T]+\int _0^T\widetilde{\gamma }_sd\widetilde{W}_s, \end{aligned}$$
(4.27)

with \(\widetilde{\mathbb {E}}\big [\big (\int _0^T\Vert \widetilde{\gamma }_s\Vert ^2ds\big )^p\big ]\le C_p\), whenever \(\widetilde{\mathbb {E}}[|\xi |^{2p}]\le C_p\), \(\mathbb {P}\)-a.s., for all p.

Then, by definition (4.19) we have

$$\begin{aligned} |\widetilde{\mathbb {E}}[\xi \widetilde{\rho }_T\widetilde{J}_1^\varepsilon (T)]|^2= & {} \Bigg |\widehat{\mathbb {E}} \Bigg [\int _0^T\widehat{\gamma }_s\Big (\widehat{\eta }_s{\widetilde{\mathbb {E}}\Big [\partial _\mu {\sigma }(\widehat{X}_s^*,P_{X_s^*},\widehat{u}_s^*;\widetilde{X}_s^*)\widetilde{Y}_s^\varepsilon \Big ]}\nonumber \\&+ \widehat{\eta }_s\widehat{\delta \sigma }(s)\mathbf{1}_{E_\varepsilon }(s)\Big )ds\Bigg ]\Bigg |^2 \nonumber \\\le & {} \widehat{\mathbb {E}}\Bigg [\Vert \widehat{\gamma }\Vert _{L^2([0,T])}^2 \Vert \widehat{\eta }\Vert _{C[0,T]}^2 \int _0^T\Big |\widetilde{\mathbb {E}}\Big [\partial _\mu {\sigma }(\widehat{X}_s^*,\mathbb {P}_{X_s^*}, \widehat{u}_s^*;\widetilde{X}_s^*)\widetilde{Y}_s^\varepsilon \Big ]\Big |^2ds\Bigg ]\nonumber \\&+C\varepsilon \widehat{\mathbb {E}}\left[ \left( \int _{E_\varepsilon }|\widehat{\gamma }_s|^2ds\right) \Vert \widehat{\eta }\Vert _{C[0,T]}^2\right] \\\le & {} C\varepsilon \left( \widehat{\mathbb {E}}\left[ \left( \int _{E_\varepsilon }|\widehat{\gamma }_s|^2ds\right) ^2\right] \right) ^{\frac{1}{2}}\nonumber \\&+C\widehat{\mathbb {E}}\left( \int _0^T\Big |\widetilde{\mathbb {E}}\left[ \partial _\mu { \sigma }(\widehat{X}_s^*,\mathbb {P}_{X_s^*},\widehat{u}_s^*;\widetilde{X}_s^*)\widetilde{Y}_s^\varepsilon \right] \Big |^4ds \right) ^{\frac{1}{2}}. \nonumber \end{aligned}$$
(4.28)

It then follow from (4.14) that

$$\begin{aligned} \mathbb {E}\Big [\big |\widetilde{\mathbb {E}}[\xi \widetilde{\rho }_T\widetilde{J}_1^\varepsilon (T)]\big |^4\Big ]\le & {} C\varepsilon ^2\mathbb {E}\left[ \widehat{\mathbb {E}}\left[ \left( \int _{E_\varepsilon }|\widehat{\gamma }_s|^2ds\right) ^2\right] \right] \\&+C\int _0^T\mathbb {E}\left[ \Big |\widetilde{\mathbb {E}}\Big [\partial _\mu {\widetilde{\sigma }}(s)\widetilde{Y}_s^\varepsilon \Big ] \Big |^4\right] ds\\\le & {} C\varepsilon ^2\rho _2(\varepsilon ). \end{aligned}$$

Similarly, one can argue that \(\mathbb {E}[|\widetilde{\mathbb {E}}[\xi \widetilde{J}_2^\varepsilon (T)]|^4]\le C\varepsilon ^2\rho _2(\varepsilon )\) as well, proving (4.15).\(\square \)

5 Main Estimates

In this section we establish the main estimate of this paper: a second order “Taylor expansion” in the following sense:

$$\begin{aligned} X^\varepsilon _t=X^*_t +Y^\varepsilon _t + Z^\varepsilon _t+o(\varepsilon ), \qquad t\in [0,T], \end{aligned}$$
(5.1)

where the convergence is in \(L^p(\mathcal{F}, C[0,T])\) sense for \(p\in [2, p_0)\), \(p_0>2\). In other words, the first-second order variational processes \((Y^\varepsilon , Z^\varepsilon )\) can be considered as the first-second order approximations of \(X^\varepsilon -X^*\) corresponding to the perturbed control \(u^\varepsilon \).

Our main result of this section is the following proposition.

Proposition 5.1

Assume that Assumptions 3.2 and 3.3 are in force. Then, for any \(1\le k\le \frac{3}{2}\),

$$\begin{aligned} \mathbb {E}\left[ \sup _{t\in [0,T]}|X^{\varepsilon }(t)-(X^*(t)+Y^{\varepsilon }(t)+Z^{\varepsilon }(t))|^{2k} \right] \le \varepsilon ^{2k}\rho _k(\varepsilon ), \end{aligned}$$
(5.2)

where, \(\rho _k: (0,\infty )\rightarrow (0,\infty )\) is such that \(\rho _k(\varepsilon )\downarrow 0\) as \(\varepsilon \downarrow 0\).

Before we prove the proposition, let us make some simple observations. For notational convenience let us denote

$$\begin{aligned} \eta _t^{\varepsilon }:=X^{\varepsilon }_t-(X^*_t+Y^{\varepsilon }_t+Z^{\varepsilon }_t), \qquad t\in [0,T]. \end{aligned}$$
(5.3)

Then, by the estimates (4.10) and (4.11), it is readily seen that the following estimate holds:

$$\begin{aligned} \mathbb {E}\left[ \sup _{t\in [0,T]}|\eta ^{\varepsilon }_t|^{2k}\right] \le C_k\varepsilon ^{2k},\qquad k\ge 1. \end{aligned}$$
(5.4)

Comparing this to our desired estimate (5.1), we see that our main task is to sharpen the estimate by replacing the constant \(C_k\) by a function \(\rho _k(\varepsilon )\) that satisfies \(\lim _{\varepsilon \mathop {\downarrow }0}\rho _k(\varepsilon )=0\). We shall argue that this can be done for \(1\le k\le \frac{3}{2}\).

Our next observation is that the process \(\eta ^\varepsilon \) has the following dynamics:

$$\begin{aligned} d\eta _t^{\varepsilon }=\alpha _t^{\varepsilon }(b)dt+\alpha _t^{\varepsilon }(\sigma )dW_t, \end{aligned}$$
(5.5)

where, for \(\phi =b, \sigma , f\), respectively,

$$\begin{aligned} \alpha _t^{\varepsilon }(\phi ):= & {} \phi (X^{\varepsilon }_t, \mathbb {P}_{X^{\varepsilon }_t}, u_t^{\varepsilon })-\{\phi (X^*_t,\mathbb {P}_{X^*_t},u^*_t)+\phi _x(t)(Y^{\varepsilon }_t +Z^{\varepsilon }_t)\} \nonumber \\&-\big \{\widetilde{\mathbb {E}}[\widetilde{\phi }_\mu (t)(\widetilde{Y}^{\varepsilon }_t+\widetilde{Z}^{\varepsilon }_t)] +{\mathscr {L}}_{xx}(t, \phi ,Y^{\varepsilon }_t)+{\widetilde{\mathbb {E}}}[\mathscr {L}_{\mu y}(t, {\widetilde{\phi }}, \widetilde{Y}^\varepsilon _t)]\big \}\\&-\big \{\big ({\widetilde{\mathbb {E}}}[\delta {\widetilde{\phi }}_\mu (t)\widetilde{Y}^\varepsilon _t]+\delta \phi (t)+\delta \phi _x(t)Y^{\varepsilon }_t\big )\mathbf{1}_{E_{\varepsilon }}(t)\big \}. \nonumber \end{aligned}$$
(5.6)

A key element in the proof of Proposition 5.1 is the following estimate of \(\alpha _t^{\varepsilon }(\phi )\).

Lemma 5.2

Assume that the Assumptions 3.2 and 3.3 are in force. Then there exists a constant \(C>0\), such that for any \(\varepsilon >0\) and any \(t\in [0,T]\), the following estimate holds: for :

$$\begin{aligned} \mathbb {E}\Big [\int _0^t|\alpha _s^{\varepsilon }(\phi )|^3ds\Big ]\le C\varepsilon ^{3}\rho (\varepsilon )+C\int _0^t \mathbb {E}[|\eta _s^{\varepsilon }|^{3}]ds,\qquad t\in [0, T], \end{aligned}$$
(5.7)

where \(\phi =b, \sigma , f\), respectively, and \(\rho (\cdot )\) is a positive function satisfying \(\lim _{\varepsilon \mathop {\downarrow }0}\rho (\varepsilon )=0\).

Proof

As before, let us denote \(\Delta X^\varepsilon =X^\varepsilon -X^*\). Then, for each \(t\in [0,T]\), we can write, for \(\phi =b, \sigma , f\),

$$\begin{aligned} \phi (X^{\varepsilon }_t, P_{X^{\varepsilon }_t}, u_t^{\varepsilon })-\phi (X^*_t, \mathbb {P}_{X^*_t},u_t^{\varepsilon })=\int _0^1 \{\phi _x^{\theta }(t)\Delta X^{\varepsilon }_t +\widetilde{\mathbb {E}}[\widetilde{\phi }_\mu ^\theta (t)\Delta \widetilde{X}^{\varepsilon }_t]\}d\theta , \end{aligned}$$
(5.8)

where, for \(\lambda \in [0, 1]\),

$$\begin{aligned} \left\{ \begin{array}{lll} \phi _x^{\lambda }(t):=\partial _x\phi (X^*_t+\lambda \Delta X^{\varepsilon }_t,\mathbb {P}_{X^*_t+\lambda \Delta X^{\varepsilon }_t},u_t^{\varepsilon });\\ \widetilde{\phi }_\mu ^\lambda (t):=\partial _\mu \phi (X^*_t+\lambda \Delta X^{\varepsilon }_t,\mathbb {P}_{X^*_t+\lambda \Delta X^{\varepsilon }_t}, u_t^{\varepsilon }; \widetilde{X}^*_t+\lambda \Delta \widetilde{X}^{\varepsilon }_t). \end{array}\right. \end{aligned}$$
(5.9)

Recall from (3.4) that \(\phi _x(t)=\phi _x(X^*_t,\mathbb {P}_{X^*_t}, u_t^{*})\) and \(\widetilde{\phi }_\mu (t)=\partial _\mu \phi (X^*_t, \mathbb {P}_{X^*_t}, u_t^{*}; \widetilde{X}^*_t)\), we derive from (5.8) and (5.9) that, for all \(\theta \in [0, 1]\),

$$\begin{aligned} \begin{array}{lll} &{}&{}\displaystyle \phi (X^{\varepsilon }_t,\mathbb {P}_{X^{\varepsilon }_t},u_t^{\varepsilon })-\{\phi (X^*_t, \mathbb {P}_{X^*_t},u_t^{\varepsilon })+\phi _x(t)(Y^{\varepsilon }_t+Z^{\varepsilon }_t)+\widetilde{\mathbb {E}}[\widetilde{\phi }_\mu (t)(\widetilde{Y}^{\varepsilon }_t+\widetilde{Z}^{\varepsilon }_t)]\}\\ &{}&{}\quad =\displaystyle \int _0^1 \{\phi _x^{\theta }(t)\eta _t^{\varepsilon }+\widetilde{\mathbb {E}} [\widetilde{\phi }_\mu ^\theta (t)\widetilde{\eta }^{\varepsilon }_t]+\Delta \phi _x^{\theta }(t) (Y^{\varepsilon }_t+Z^{\varepsilon }_t)+\widetilde{\mathbb {E}}[\Delta {\widetilde{\phi }}_\mu ^{\theta }(t) (\widetilde{Y}^{\varepsilon }_t+\widetilde{Z}^{\varepsilon }_t)]\}d\theta , \end{array} \end{aligned}$$
(5.10)

where \(\Delta \phi _x^{\theta }(t):=\phi _x^{\theta }(t)-\phi _x(t)\),\(\Delta \tilde{\phi }_\mu ^ {\theta }(t):=\tilde{\phi }_\mu ^{\theta }(t)-\tilde{\phi }_\mu (t)\). Similarly, we can further write

$$\begin{aligned} \Delta \phi _x^{\theta }(t)= & {} \phi _x^{\theta }(t)-\phi _x(t)=\theta \int _0^1 \{ \phi _{xx}^{\theta \gamma }(t)\Delta X^{\varepsilon }_t+\widetilde{\mathbb {E}}[\widetilde{\phi }_{x\mu }^{ \theta \gamma }(t)\Delta \widetilde{X}^{\varepsilon }_t]\}d\gamma +\delta \phi _x(t)\mathbf{1}_{E_{\varepsilon }}(t)\nonumber \\= & {} \theta \int _0^1\{\phi _{xx}^{\theta \gamma }(t)\eta _t^{\varepsilon }+\widetilde{\mathbb {E}} [\widetilde{\phi }_{x\mu }^{\theta \gamma }(t)\widetilde{\eta }_t^{\varepsilon }]\}d\gamma +\theta \int _0^1\phi _{xx}^{\theta \gamma }(t)(Y^{\varepsilon }_t+Z^{\varepsilon }_t)d\gamma \\&+\theta \int _0^1\widetilde{\mathbb {E}}[\widetilde{\phi }_{x\mu }^{\theta \gamma }(t)(\widetilde{Y}^{\varepsilon }_t +\widetilde{Z}^{\varepsilon }_t)]d\gamma +\delta \phi _x(t)\mathbf{1}_{E_{\varepsilon }}(t), \nonumber \end{aligned}$$
(5.11)

where \(\phi ^{\theta \gamma }_{xx}\) and \({\widetilde{\phi }}^{\theta \gamma }_{\mu x}\) are the second order derivative processes defined by (3.6). The expression of \(\Delta {\widetilde{\phi }}_x^{\theta }(t)\), however, needs a little more attention, as it involves the second derivative \(\phi _{\mu \mu }\) which requires a third probability space, denote it again by \(\widehat{\Omega }\). Next, we define

$$\begin{aligned} \widehat{\widetilde{\phi }_{\mu \mu }^{\theta \gamma }}(t):= & {} \partial _\mu ^2\phi (\widehat{X}^*_t+ \theta \gamma \Delta \widehat{X}^{\varepsilon }_t,\mathbb {P}_{X^*_t+\theta \gamma \Delta {X}^{\varepsilon }_t}, \widehat{u}_t^{\varepsilon }; \widetilde{X}^*_t+\theta \gamma \Delta \widetilde{X}^{\varepsilon }_t);\\ \delta \widetilde{\phi }_\mu (t):= & {} \partial _\mu \phi (X^*_t,\mathbb {P}_{X^*_t},u_t^{\varepsilon }; \widetilde{X}^*_t)-\partial _\mu \phi (X^*_t,\mathbb {P}_{X^*_t}, u_t^{*}, \widetilde{X}^*_t). \nonumber \end{aligned}$$
(5.12)

Then, recalling the definition of \(\eta ^\varepsilon \) we have

$$\begin{aligned} \Delta {\widetilde{\phi }}_\mu ^{\theta }(t)&= \widetilde{\phi }_\mu ^{\theta }(t)-\widetilde{\phi }_\mu (t)\nonumber \\&=\theta \int _0^1 \Bigg \{ \widetilde{\phi }_{\mu {x}}^{\theta \gamma }(t) \Delta {X}^{\varepsilon }_t+\widehat{\mathbb {E}}\left[ \widehat{\widetilde{\phi }_{\mu \mu }^{\theta \gamma }}(t)\Delta \widehat{X}^{\varepsilon }_t\right] +\widetilde{\phi }_{\mu y}^{\theta \gamma }(t)\Delta \widetilde{X}^{\varepsilon }_t\Bigg \}d\gamma +\delta \widetilde{\phi }_\mu (t)\mathbf{1}_{E_{\varepsilon }}(t)\nonumber \\&=\theta \int _0^1\Bigg \{\widetilde{\phi }_{\mu {x}}^{\theta \gamma }(t)(Y^{\varepsilon }_t +Z^{\varepsilon }_t)+\widehat{\mathbb {E}}\left[ \widehat{\widetilde{\phi }_{\mu \mu }^{\theta \gamma }}(t)(\widehat{Y}^{\varepsilon }_t+ \widehat{Z}^{\varepsilon }_t)\right] \Bigg \}d\gamma \nonumber \\&\quad +\theta \int _0^1\left\{ \widetilde{\phi }_{\mu {x}}^{\theta \gamma }(t)\eta _t^{\varepsilon }+\widehat{E}\left[ \widehat{\widetilde{\phi }_{\mu \mu }^{\theta \gamma }}(t)\widehat{\eta }_t^{\varepsilon }\right] \right\} d\gamma \nonumber \\&\quad +\theta \int _0^1\Big \{\widetilde{\phi }_{\mu {y}}^{\theta \gamma } (t)(\widetilde{Y}^{\varepsilon }_t+\widetilde{Z}^{\varepsilon }_t)+ \widetilde{\phi }_{\mu {y}}^{\theta \gamma }(t) \widetilde{\eta }_t^{\varepsilon }\Big \}d\gamma +\delta \widetilde{\phi }_\mu (t)\mathbf{1}_{E_{\varepsilon }}(t). \end{aligned}$$
(5.13)

Now by definition (5.6), along with the expansions (5.8), (5.10), and (5.13), we have the following decomposition of \(\alpha ^\varepsilon (\phi )\):

$$\begin{aligned} \alpha _t^{\varepsilon }(\phi )=int_0^1\Big \{\phi ^\theta _x(t)\eta _t^{\varepsilon }+\widetilde{\mathbb {E}}[ \widetilde{\phi }^\theta _\mu (t)\widetilde{\eta }_t^{\varepsilon }]\Big \}d\theta +x_t^{1,\varepsilon }(\phi )+ x_t^{2,\varepsilon }(\phi )+x_t^{3,\varepsilon }(\phi )+x_t^{4,\varepsilon }(\phi ),\nonumber \\ \end{aligned}$$
(5.14)

where

$$\begin{aligned} x_t^{1,\varepsilon }(\phi )&=\int _0^1\int _0^1\theta \Bigg \{\phi _{xx}^{\theta \gamma } (t)\eta _t^{\varepsilon }(Y^{\varepsilon }_t+Z^{\varepsilon }_t)+\widetilde{\mathbb {E}}\left[ \widetilde{\phi }_{x\mu }^{\theta \gamma } (t)\widetilde{\eta }_t^{\varepsilon }\right] (Y^{\varepsilon }_t+Z^{\varepsilon }_t)\\&\quad +\widetilde{\mathbb {E}}\left[ \widetilde{\phi }_{\mu x} ^{\theta \gamma }(t)(\widetilde{Y}^{\varepsilon }_t+\widetilde{Z}^{\varepsilon }_t)\right] \eta _t^{\varepsilon } +\widehat{\mathbb {E}}\left[ \widetilde{\mathbb {E}}\left[ \widehat{\widetilde{\phi }_{\mu \mu }^{\theta \gamma }}(t)\widetilde{\eta }_t^{\varepsilon }\ (\widehat{Y}^{\varepsilon }_t+\widehat{Z}^{\varepsilon }_t)\right] \right] \\&\quad +\widetilde{\mathbb {E}}\left[ \widetilde{\phi }_{\mu {y}}^{\theta \gamma } (t)(\widetilde{Y}^{\varepsilon }_t+\widetilde{Z}^{\varepsilon }_t)\widetilde{\eta }_t^{\varepsilon }\right] \Bigg \}d\gamma d\theta ;\\ \end{aligned}$$
$$\begin{aligned} x_t^{2,\varepsilon }(\phi )&= \int _0^1\int _0^1\Big \{\phi _{xx}^{\theta \gamma }(t)\left( (Y^{\varepsilon }_t+ Z^{\varepsilon }_t)^2-(Y^{\varepsilon }_t)^2\right) \\&\quad +2\widetilde{\mathbb {E}}\left[ \widetilde{\phi }_{x\mu }^{\theta \gamma }(t) (\widetilde{Y}^{\varepsilon }_t +\widetilde{Z}^{\varepsilon }_t)\right] (Y^{\varepsilon }_t+Z^{\varepsilon }_t) \\&\quad +\widehat{\mathbb {E}}\left[ \widetilde{\mathbb {E}}\left[ \widetilde{\phi }_{\mu }{\mu }^{\theta \gamma }(t)(\widetilde{Y}^{\varepsilon }_t+ \widetilde{Z}^{\varepsilon }_t)(\widehat{Y}^{\varepsilon }_t+\widehat{Z}^{\varepsilon }_t)\right] \right] +\widetilde{\mathbb {E}}\left[ \widetilde{\phi }_{\mu y}^{\theta \gamma }(t)(\widetilde{Y}^{\varepsilon }_t+\widetilde{Z}^{\varepsilon }_t)^2\right] \Big \}d\gamma d\theta ; \\ x_t^{3,\varepsilon }(\phi )&=\int _0^1\int _0^1\theta \left( \phi _{xx}^{\theta \gamma }(t) -\phi _{xx}(t)\right) (Y^{\varepsilon }_t)^2d\gamma \,d\theta ;\\ x_t^{4,\varepsilon }(\phi )&=\left\{ \delta \phi _{x}(t)Z^{\varepsilon }_t+\widetilde{\mathbb {E}}[\delta \widetilde{\phi }_\mu (t) \widetilde{Z}^{\varepsilon }_t]\right\} \mathbf{1}_{E_{\varepsilon }}(t). \end{aligned}$$

We now estimate \(X^{i,\varepsilon }\), \(i=1,\cdots ,4\), one by one. First, using the fact (5.4), as well as the estimates (4.9) and (4.10), one can easily derive

$$\begin{aligned} \mathbb {E}\left[ \sup _{t\in [0,T]}|x_t^{1,\varepsilon }(\phi )|^{2k}\right] \le C_k\varepsilon ^{3k},\qquad k\ge 1. \end{aligned}$$
(5.15)

Next, using (4.9) and (4.10) again, we also have

$$\begin{aligned} \mathbb {E}\left[ \sup _{t\in [0,T]}|(Y^{\varepsilon }_t+Z^{\varepsilon }_t)^2-(Y^{\varepsilon }_t)^2|^{2k}\right] \le C_k\varepsilon ^{3k},\quad k\ge 1. \end{aligned}$$
(5.16)

Note that by the definitions of \(\widetilde{\phi }_{x \mu }^{\theta \gamma }(t)\) and \(\widetilde{\phi }_{x \mu }(t)\) we have (recall \(\Delta X^\varepsilon =X_t^{\varepsilon }-X_t^{*}\))

$$\begin{aligned} |\widetilde{\phi }_{x \mu }^{\theta \gamma }(t)-\widetilde{\phi }_{x \mu }(t)|\le & {} C\Big [|\Delta X^\varepsilon |+|\Delta \widetilde{X}^\varepsilon |+\big (\mathbb {E}[|\Delta X^\varepsilon |^2]\big )^{\frac{1}{2}}+ \mathbf{1}_{E_{\varepsilon }}(t)\Big ]\nonumber \\\le & {} C[|\Delta X^\varepsilon |+|\Delta \widetilde{X}_t^{\varepsilon }|+\varepsilon ^{\frac{1}{2}}+\mathbf{1}_{E_{\varepsilon }}(t)]. \end{aligned}$$
(5.17)

It then follows that

$$\begin{aligned} \mathbb {E}\Big [|\widetilde{\mathbb {E}}[\widetilde{\phi }_{x\mu }^{\theta \gamma }(t)\widetilde{Y}^{\varepsilon }_t]-\widetilde{ \mathbb {E}}[\widetilde{\phi }_{x \mu }(t)\widetilde{Y}^{\varepsilon }_t]|^4\Big ]&\le \mathbb {E}\Big [\widetilde{\mathbb {E}}\big [| \widetilde{\phi }_{x \mu }^{\theta {\gamma }}(t)-\widetilde{\phi }_{x \mu }(t)|^2\big ]^2 \widetilde{\mathbb {E}}\big [|\widetilde{Y}^{\varepsilon }_t|^2\big ]^2\Big ]\nonumber \\&\le C{\varepsilon }^2\mathbb {E}\big [\widetilde{\mathbb {E}}[|\widetilde{\phi }_{x\mu }^{\theta {\gamma }}(t)-\widetilde{\phi }_{x \mu }(t)|^4]\big ]\\&\le C{\varepsilon }^2 \mathbb {E}\Big [\widetilde{\mathbb {E}}\Big (|\Delta X^\varepsilon |+|\Delta \widetilde{X}_t^{\varepsilon }|+ \varepsilon ^{\frac{1}{2}}+\mathbf{1}_{E_{\varepsilon }}(t)\Big )^4\Big ]\nonumber \\&\le C{\varepsilon }^2 \big (\varepsilon ^{2}+\mathbf{1}_{E_{\varepsilon }}(t)\big ).\nonumber \end{aligned}$$
(5.18)

Consequently, following Proposition 4.3, especially estimate (4.25) for any \(\xi \in L^{\infty }(\mathcal{F}_T^{W, \widetilde{W}}; \mathbb {R})\), we have

$$\begin{aligned}&\mathbb {E}\left[ \int _0^T\left( \widetilde{\mathbb {E}}[\widetilde{\phi }_{x\mu }^{\theta \gamma }(t) \widetilde{Y}^{\varepsilon }_t]|{Y}^{\varepsilon }_t|\right) ^3dt\right] \nonumber \\&\quad \le C\left( \int _0^T\mathbb {E}\left[ |\widetilde{\mathbb {E}}[\widetilde{\phi }_{x\mu }(t)\widetilde{Y}^{\varepsilon }_t]|^4\right] dt\right) ^{\frac{3}{4}}\left( E\left[ \sup _{t\in [0,T]}|{Y}^{\varepsilon }_t|^{12} \right] \right) ^{\frac{1}{4}}\\&\qquad +C\left( \int _0^T\mathbb {E}\big [|\widetilde{\mathbb {E}}[\widetilde{\phi }_{x\mu }^{\theta \gamma }(t) \widetilde{Y}^{\varepsilon }_t]-\widetilde{\mathbb {E}}[\widetilde{\phi }_{x\mu }(t)\widetilde{Y}^{\varepsilon }_t]|^4\big ]dt \right) ^{\frac{3}{4}}\left( \mathbb {E}\left[ \sup _{t\in [0,T]}|{Y}^{\varepsilon }_t|^{12}\right] \right) ^{\frac{1}{4}}\nonumber \\&\quad \le C {\varepsilon }^3 \rho (\varepsilon )+ C {\varepsilon }^3\left( \int _0^T\mathbf{1}_{E_{\varepsilon }}(t)dt \right) ^{\frac{3}{4}}\le C\varepsilon ^3\rho (\varepsilon ),\nonumber \end{aligned}$$
(5.19)

where \(\rho (\varepsilon )\rightarrow 0\) as \(\varepsilon \downarrow 0\). The same argument allows to show that

$$\begin{aligned} \mathbb {E}\left[ \int _0^T\big (\big |\widetilde{\mathbb {E}}[\widetilde{\phi }_{x\mu }^{\theta {\gamma }}(t) (\widetilde{Y}^{\varepsilon }_t+\widetilde{Z}^{\varepsilon }_t)]\big ||\widetilde{Y}^{\varepsilon }_t+\widetilde{Z}^{\varepsilon }_t|\big )^3dt \right] \le C\varepsilon ^3\rho (\varepsilon ). \end{aligned}$$
(5.20)

Similar to (5.19) we can show that

$$\begin{aligned} \left\{ \begin{array}{lll} \displaystyle \mathbb {E}\Big [\int _0^T\big |\widehat{\mathbb {E}}\big [\widetilde{\mathbb {E}}[\widehat{\widetilde{\phi }_{\mu {\mu }}^{\theta {\gamma }}} (t)(\widetilde{Y}^{\varepsilon }_t+\widetilde{Z}^{\varepsilon }_t)](\widehat{Y}^{\varepsilon }_t+\widehat{Z}^{\varepsilon }_t)\big ]\big |^3dt\Big ]\le C\varepsilon ^3\rho (\varepsilon );\\ \displaystyle \mathbb {E}\Big [\int _0^T\Big |{\widetilde{\mathbb {E}}}[{\widetilde{\phi }}^{\theta ,\gamma }_{\mu y}(t)(\widetilde{Y}^\varepsilon _t+\widetilde{Z}^\varepsilon _t)^2]\Big |dt\Big ]\le C\varepsilon ^3\rho (\varepsilon ). \end{array}\right. \end{aligned}$$
(5.21)

Combining (5.16), (5.19), (5.20), and (5.21) we get

$$\begin{aligned} \mathbb {E}\Big [\int _0^T|x_t^{2,\varepsilon }(\phi )|^{3}dt\Big ]\le \varepsilon ^{3}\rho (\varepsilon ). \end{aligned}$$
(5.22)

Furthermore, applying Hölder’s inequality, we get

$$\begin{aligned} \mathbb {E}\left[ \int _0^T|x_t^{3,\varepsilon }(\phi )|^3\,dt\right]\le & {} C\mathbb {E}\left[ \sup _{t\in [0,T]}|Y^{\varepsilon }_t|^{6} \int _0^T \int _0^1 \int _0^1|\phi _{xx}^{\theta \gamma }(t)-\phi _{xx}(t)|^{3} \,d\gamma \,d\theta \,dt\right] \nonumber \\\le & {} C\left( \mathbb {E}\left[ \sup _{t\in [0,T]}|Y^{\varepsilon }_t|^{12}\right] \right) ^{\frac{1}{2}} \left( \mathbb {E}\left[ \sup _{t\in [0,T]}|X^{\varepsilon }_t-X^{*}_t|^{6}\right] ^{\frac{1}{2}}+|E_\varepsilon |\right) \nonumber \\\le & {} C\varepsilon ^3\left( \varepsilon ^{\frac{3}{2}}+\varepsilon \right) \le C \varepsilon ^3 \rho (\varepsilon ). \end{aligned}$$
(5.23)

Similarly, one shows that

$$\begin{aligned} \mathbb {E}\left[ \int _0^T|x_t^{4,\varepsilon }(\phi )|^3\,dt\right] \le C \mathbb {E}\left[ \sup _{t\in [0,T]} |Z^{\varepsilon }_t|^{3}\right] |E_\varepsilon |\le C \varepsilon ^4. \end{aligned}$$
(5.24)

Finally, in light of (5.14), we see that (5.15), (5.22), (5.23) and (5.24) imply (5.7), proving the lemma. \(\square \)

Proof of Proposition 5.1

Recalling the dynamics (5.5) of \(\eta ^{\varepsilon }\) and using (5.7), for \(\phi =b, \sigma , f\), respectively, one can follow a standard argument via Gronwall’s inequality to obtain the following estimate:

$$\begin{aligned} \mathbb {E}\left[ \sup _{t\in [0,T]}|\eta _t^{\varepsilon }|^{3}\right] \le \varepsilon ^{3}\rho (\varepsilon ), \end{aligned}$$
(5.25)

which is (5.2). \(\square \)

A direct consequence of Lemma 5.2 and Proposition 5.1 is the following corollary.

Corollary 5.3

Assume the same assumptions of Lemma 5.2. Then the following estimates hold.

$$\begin{aligned} \mathbb {E}\big [|\alpha _T^{\varepsilon }(h)|^3\big ]\le \varepsilon ^3\rho (\varepsilon ), \quad \varepsilon >0, \end{aligned}$$
(5.26)

where \(\rho (\varepsilon )\rightarrow 0\) as \(\varepsilon \downarrow 0\). Recall that \(\alpha _T^{\varepsilon }(h)\) is defined by (5.14) for \(\phi =h\).

6 Proof of Theorem 3.5

We are now ready to prove Theorem 3.5. It is worth emphasizing that, while our analysis more or less follow the well-understood scheme, initiated in [18], there are some hidden “road-blocks” in the argument in the general mean-field case, due to the presence of the second derivatives with respect to measures, especially the fact that the second order “Fréchet” derivative with respect to \(L^2\)-random variables may fail to exist. It turns out, however, that such difficulty can be naturally resolved by the special structure of the first and second order variational processes \(Y^\varepsilon \) and \(Z^\varepsilon \), as well as the estimates we established in Propositions 4.3 and 5.3, we can argue that the term that potentially involves the second order derivative with respect to measure [\(\partial ^2_{\mu \mu }\), see (2.8)] is actually of higher order \(o(\varepsilon )\). As a consequence only the mixed second order derivatives \(\partial _{\mu y}\) will be effectively in use. Such a phenomenon has already displayed in an earlier work [7], regarding the relationship between the mean-field SDE (of type a)) and the PDE, and it again turns out to be essential in our analysis.

Let \((u^*, X^*)\) be a pair of optimal control and state. Then, for any \(\varepsilon >0\), we consider the spike variation, \(u^\varepsilon \), of \(u^*\), defined by (4.1). Then by combining the usual Taylor expansion with the one with respect to measures (2.9)

$$\begin{aligned} 0 \le J(u^{\varepsilon })-J(u^*)= & {} \mathbb {E}\Big [ \int _0^T\big (f(t,X^{\varepsilon }_t,\mathbb {P}_{X^{\varepsilon }_t},u_t^{\varepsilon })-f(t,X^*_t, \mathbb {P}_{X^*_t},u^*_t)\big )dt\Big ]\nonumber \\&+\,\mathbb {E}\left[ h(X^{\varepsilon }_T,\mathbb {P}_{X^{\varepsilon }_T})-h(X^*_t, \mathbb {P}_{X^*_t})\right] \nonumber \\= & {} \mathbb {E}\left[ \int _0^T \big (f_x(t)(Y^{\varepsilon }_t+Z^{\varepsilon }_t)+\widetilde{\mathbb {E}}[\widetilde{f}_\mu (t) (\widetilde{Y}^{\varepsilon }_t+\widetilde{Z}^{\varepsilon }_t)]\big )dt\right] \nonumber \\&+\,\mathbb {E}\left[ h_x(T)(Y^{\varepsilon }_T+Z^{\varepsilon }_T)\right] +\mathbb {E}\left[ \widetilde{\mathbb {E}}\left[ \widetilde{h}_\mu (T)(\widetilde{Y}^{\varepsilon }_T +\widetilde{Z}^{\varepsilon }_T)\right] \right] \nonumber \\&+\,\mathbb {E}\Bigg [\int _0^T\big (\delta f(t)\mathbf{1}_{E_{\varepsilon }}(t)+{\mathscr {L}}_{xx}(t,f,Y^{\varepsilon }_t)\nonumber \\&+ \widetilde{\mathbb {E}}[{\mathscr {L}}_{\mu y}(t,\widetilde{f},\widetilde{Y}^{\varepsilon }_t)]\big )\,dt\Bigg ] +\mathbb {E}[{\mathscr {L}}_{xx}(T,h,Y^{\varepsilon }_T)]\\&+\,\mathbb {E}\big [\widetilde{\mathbb {E}}[{\mathscr {L}}_{\mu y}(T,\widetilde{h},\widetilde{Y}^{\varepsilon }_T)]\big ]+o(\varepsilon ). \nonumber \end{aligned}$$
(6.1)

We should remark that in (6.1) the terms involving the second order derivative \(\partial ^2_{\mu }f\) or \(\partial ^2_{\mu }h\) do not appear, which we briefly argue as follows. Note that in Taylor expansion (2.9), or more precisely (2.8), the term involve \(\partial ^2_{\mu }f\) reads

$$\begin{aligned} \Theta ^\varepsilon :=\int _0^T\mathbb {E}\left[ \widetilde{\mathbb {E}}\left[ \partial ^2_{\mu }f(\cdots )(Y^\varepsilon _t+Z^\varepsilon _t)(\widetilde{Y}^\varepsilon _t+\widetilde{Z}^\varepsilon _t)\right] \right] dt. \end{aligned}$$

But by estimate (4.10) we see that

$$\begin{aligned} \Theta ^\varepsilon =\int _0^T\mathbb {E}\left[ \widetilde{\mathbb {E}}\left[ \partial ^2_{\mu }f(\cdots )Y^\varepsilon _t\widetilde{Y}^\varepsilon _t\right] \right] dt+o(\varepsilon ), \end{aligned}$$

and as from (4.25) (with \(T=t\)) combined with (4.9) we get

$$\begin{aligned}&\displaystyle \int _0^T\Big |\mathbb {E}[\widetilde{\mathbb {E}}[\partial ^2_{\mu }f(\cdots )Y^\varepsilon _t\widetilde{Y}^\varepsilon _t]]\Big |dt\\&\displaystyle \quad \le \int _0^T \left( \mathbb {E}[|Y^\varepsilon _t|^2]\right) ^{\frac{1}{2}}\left( \mathbb {E}\Big [|\widetilde{\mathbb {E}}[ \partial ^2_{\mu }f(\cdots )\widetilde{Y}^\varepsilon _t]|^4\Big ]\right) ^{\frac{1}{4}}dt\\&\quad \le C\varepsilon \rho (\varepsilon ),\ \text{ for } \text{ some } \text{ positive } \text{ function }\ \rho :\mathbb {R}_+\rightarrow \mathbb {R}_+\ \text{ with }\ \rho (\varepsilon )\rightarrow 0\ \text{ as }\ \varepsilon \downarrow 0, \end{aligned}$$

we obtain that \(\Theta ^\varepsilon =o(\varepsilon )\). For this we observe that the function \(\rho \) constructed for (4.25) does not depend on T.

Now using the expression (4.17), and applying (4.9), Propositions 4.3 and 5.3 we conclude that \(\Theta ^\varepsilon =o(\varepsilon )\), proving our claim. Now, reorganizing (6.1) to get

$$\begin{aligned}&0 \le J(u^{\varepsilon })-J(u^*) \\&\quad = \mathbb {E}\Big [ \int _0^T\big (\delta f(t)\mathbf{1}_{E_{\varepsilon }}(t)+{\mathscr {L}}_{xx}(t,f,Y^{ \varepsilon }_t)+\widetilde{\mathbb {E}}[{\mathscr {L}}_{\mu y}(t,\widetilde{f},\widetilde{Y}^{\varepsilon }_t)]\big )dt\Big ]\\&\qquad +\mathbb {E}[{ \mathscr {L}}_{xx}(T,h,Y^{\varepsilon }_T)] +\mathbb {E}\big [\widetilde{\mathbb {E}}[{\mathscr {L}}_{\mu y}(T,\widetilde{h},\widetilde{Y}^{\varepsilon }_T)]\big ] \\&\qquad +\mathbb {E}\Big [\int _0^T(Y^{\varepsilon }_t+Z^{\varepsilon }_t)(f_x(t)+\widetilde{\mathbb {E}}[{\widetilde{f}^*}_\mu (t)])\, dt\Big ] \nonumber \\&\qquad -\mathbb {E}\big [(-h_x(T)-\widetilde{\mathbb {E}}[{\widetilde{h}^*}_\mu (T)])(Y^{\varepsilon }_T+Z^{ \varepsilon }_T)\big ]+o(\varepsilon ). \end{aligned}$$

Note that, by using Proposition 4.2 and the duality relations (4.5)–(4.6), we have

$$\begin{aligned} \mathbb {E}[p_T(Y^{\varepsilon }_T+Z^{\varepsilon }_T)]&=\mathbb {E}\Big [\int _0^T(Y^{\varepsilon }_t+Z^{\varepsilon }_t) (f_x(t)+\widetilde{\mathbb {E}} [{\widetilde{f}^*}_\mu (t)])dt\Big ]\nonumber \\&\quad +\mathbb {E}\Big [\int _0^T\big (p_t({\mathscr {L}}_{xx}(t,b,Y^{\varepsilon }_t)+\widetilde{\mathbb {E}}[ {\mathscr {L}}_{\mu y}(t,{\widetilde{b}},\widetilde{Y}^{\varepsilon }_t)]\big )\\&\qquad +q_t\big ({\mathscr {L}}_{xx}(t, \sigma ,Y^{\varepsilon }_t)+\widetilde{\mathbb {E}}[{\mathscr {L}}_{\mu y}(t,{\widetilde{\sigma }},\widetilde{Y}^{\varepsilon }_t)]) \big )dt\Big ]\nonumber \\&\quad +\mathbb {E}\Big [\int _0^T(p_t\delta b(t)+q_t\delta \sigma (t))\mathbf{1}_{E_{\varepsilon }}(t)\, dt\Big ]+R_{\varepsilon },\nonumber \end{aligned}$$
(6.2)

where

$$\begin{aligned} R_{\varepsilon }:= & {} \mathbb {E}\Big [\int _0^T\big (p_t\delta b_x(t)+q_t\delta \sigma _x(t)\big ) Y_t^{\varepsilon }{} \mathbf{1}_{E_{\varepsilon }}(t)dt\Big ]\\&+\mathbb {E}\Big [ \int _0^T\big (p_t\widetilde{E}[\delta \widetilde{b}_\mu (t)\widetilde{Y}^{\varepsilon }_t] +q_t\widetilde{E}[\delta \widetilde{\sigma }_\mu (t)\widetilde{Y}^{\varepsilon }_t]\big )\mathbf{1}_{E_{\varepsilon }}(t) dt\Big ]. \end{aligned}$$

We claim that \(|R_\varepsilon |\le C\varepsilon \rho (\varepsilon )\), and \(\rho (\varepsilon )\rightarrow 0\) as \(\varepsilon \downarrow 0\). Indeed, notice in particular,

$$\begin{aligned} \Big |\mathbb {E}\Big [\int _0^T q_t\delta \sigma _x(t)Y_t^{\varepsilon }{} \mathbf{1}_{E_{\varepsilon }}(t)dt\Big ] \Big |\le & {} C \mathbb {E}\Big [|E_{\varepsilon }|^{\frac{1}{2}}\big (\int _{{E_{\varepsilon }}} |q_t|^2dt\big )^{\frac{1}{2}}\sup _{0\le t\le T}|{Y}^{\varepsilon }_t|\Big ]\\\le & {} C{\varepsilon }^{\frac{1}{2}}\left( \mathbb {E}\Big [\int _{{E_{\varepsilon }}}|q_t|^2dt\Big ] \right) ^{\frac{1}{2}}\left( \mathbb {E}\Big [\sup _{0\le t\le T}\Vert {Y}^{\varepsilon }_t\Vert ^2 \Big ]\right) ^{\frac{1}{2}} \\\le & {} C {\varepsilon }\left( \mathbb {E}\Big [\int _{{E_{\varepsilon }}}|q_t|^2dt\Big ]\right) ^{ \frac{1}{2}}=C {\varepsilon }\rho (\varepsilon ). \nonumber \end{aligned}$$

The other terms can be estimated similarly. Now applying (6.2) we can write (6.1) as

$$\begin{aligned} 0\le & {} J(u^{\varepsilon })-J(u^*)\nonumber \\= & {} \mathbb {E}\Big [ \int _0^T\big (\delta f(t)\mathbf{1}_{E_{\varepsilon }}(t)+{\mathscr {L}}_{xx}(t,f, Y^{\varepsilon }_t)+\widetilde{\mathbb {E}}[{\mathscr {L}}_{\mu y}(t,{\widetilde{f}},\widetilde{Y}^{\varepsilon }_t)] \big )dt\Big ]\nonumber \\&+\mathbb {E}[{\mathscr {L}}_{xx}(T,h,Y^{\varepsilon }_T)]+\mathbb {E}\big [\widetilde{\mathbb {E}}[{\mathscr {L}}_{\mu y}(T,\widetilde{h},\widetilde{Y}^{\varepsilon }_T)]\big ]\nonumber \\&-\mathbb {E}\big [\int _0^T\Big (p_t\big ({\mathscr {L}}_{xx}(t,b,Y^{\varepsilon }_t)+\widetilde{\mathbb {E}}\big [ {\mathscr {L}}_{\mu y}(t,{\widetilde{b}},\widetilde{Y}^{\varepsilon }_t)\big ]\big )\\&\quad +q_t\big ({\mathscr {L}}_{xx}(t,\sigma ,Y^{\varepsilon }_t)+\widetilde{\mathbb {E}} \big [{\mathscr {L}}_{\mu y}(t,{\widetilde{\sigma }},\widetilde{Y}^{\varepsilon }_t)\big ]\big )\Big ) dt\Big ]\nonumber \\&-\mathbb {E}\Big [\int _0^T\big (p_t\delta b(t)+q_t\delta \sigma (t)\big )\mathbf{1}_{E_{\varepsilon }}(t)dt\Big ]+o(\varepsilon ).\nonumber \end{aligned}$$
(6.3)

Now, in view of (3.10), we have

$$\begin{aligned} 0\le & {} J(u^{\varepsilon })-J(u^*)\nonumber \\= & {} -\mathbb {E}\Big [\int _0^T\delta H(t)\mathbf{1}_{E_{\varepsilon }}(t)\,dt\Big ]+\frac{1}{2} \mathbb {E}\Big [\big (h_{xx}(T)+\widetilde{\mathbb {E}}\big [{\widetilde{h}^*}_{\mu y}(T)\big ] \big ) (Y^{\varepsilon }_T)^2 \nonumber \\&-\frac{1}{2}\int _0^T\big (H_{xx}(t)+ \widetilde{\mathbb {E}}[{\widetilde{H}^*}_{\mu y}(t)] \big )(Y^{\varepsilon }_t)^2 dt\Big ]+o(\varepsilon ). \end{aligned}$$
(6.4)

Applying Itô’s formula to \(P_t(Y^{\varepsilon }_t)^2\) and then taking expectation, we get from (4.9) and Proposition 4.3 that

$$\begin{aligned} \mathbb {E}[P_T(Y^{\varepsilon }_T)^2]&=-\mathbb {E}\left[ \int _0^T\big (H_{xx}(t)+ \widetilde{\mathbb {E}}[{\widetilde{H}^*}_{\mu y}(t)] \big )(Y^{\varepsilon }_t)^2 dt\right] \\&\quad +\mathbb {E}\left[ \int _0^TP_t(\delta \sigma (t))^2\mathbf{1}_{E_{\varepsilon }}(t)dt\right] +o(\varepsilon ). \end{aligned}$$

Notice now the terminal condition of the adjoint process: \(P_T=-(h_{xx}(T)+\widetilde{\mathbb {E}}[{\widetilde{h}^*}_{\mu y}(T)])\) (see, (3.13)), we obtain from (6.2) that

$$\begin{aligned} 0\le J(u^{\varepsilon })-J(u^*)=-\mathbb {E}\left[ \int _0^T\big (\delta H(t)+ \frac{1}{2}P_t(\delta \sigma (t))^2\big )\mathbf{1}_{E_{\varepsilon }} (t)dt\right] +o(\varepsilon ).\qquad \end{aligned}$$
(6.5)

We can now apply the Lebesgue differentiation theorem to deduce from (6.5) that, for all \(u\in U\), a.e \(t\in [0,T]\), it holds \(\mathbb {P}\)-almost surely,

$$\begin{aligned}&\mathscr {H}(X^*_t,u,p_t,q_t)-\mathscr {H}(X^*_t,u^*_t,p_t,q_t)\nonumber \\&\qquad +\frac{1}{2}P_t\left( \sigma (X^*_t,\mathbb {P}_{X^*_t},u)-\sigma (X^*_t,\mathbb {P}_{X^*_t},u^*_t) \right) ^2 \le 0, \end{aligned}$$
(6.6)

proving the theorem.