1 Introduction

Adjoint methods have been widely-used for the optimization of partial differential equations (PDEs), and especially for optimizing PDEs modeling engineering systems. Examples include [2, 4,5,6, 9,10,11, 16,17,18, 20, 24, 26,27,28], and [15].

Traditional adjoint algorithms consist of an iteration where at each iteration a new adjoint PDE must be solved to calculate the gradient descent step. During the course of optimization, many adjoint PDEs must be solved, which in certain cases can be computationally costly. As an alternative, time-stepping and pseudo-time-stepping methods (often in combination with one-shot methods) have been proposed, where one views time-independent PDEs as stationary states of appropriate dynamical systems and studies the behavior of the latter in the long-time regime, i.e., after their transient phase. We refer the interested reader to [3, 7, 12,13,14, 19, 31, 32] and the references therein for certainly a non-exhaustive list of representative references.

In this paper, we couple a time-relaxed adjoint PDE with a continuous-time update equation for the variables that are being optimized. The time-relaxed adjoint PDE yields an estimate of the direction of steepest descent, and updates this estimate continuously in time. The optimization variables are also updated continuously in time using this online estimate of the direction of steepest descent. The focus of our paper is the mathematical analysis of this “online adjoint algorithm”. As \(t \rightarrow \infty \), the solution of the time-relaxed adjoint PDE asymptotically matches the exact direction of steepest descent. A crucial step in the convergence proof is a multi-scale analysis of the coupled system for the forward PDE, adjoint PDE, and the gradient descent ODE for the design variables.

We prove convergence and convergence rates for the online adjoint algorithm for a certain class of PDEs. Specifically, in our theoretical analysis, we consider the optimization problem where we seek to minimize the objective function:

$$\begin{aligned} J(\theta ) = \frac{1}{2} \int _{U} \bigg ( u^{*}(x) - h(x) \bigg )^2 dx + \frac{\gamma }{2} \left\Vert {\theta }\right\Vert ^2_2, \end{aligned}$$
(1.1)

where h is a target profile. \( \gamma \left\Vert {\theta }\right\Vert ^2_2\) is a regularization term where \(\gamma > 0\) and \(\left\Vert { \cdot }\right\Vert _2\) is the \(\ell _2\) norm. \(u^{*}\) satisfies the elliptic PDE

$$\begin{aligned} A u^{*}(x)&= f( x, \theta ), \quad x\in U\nonumber \\ u^{*}(x)&= 0,\quad x\in \partial U, \end{aligned}$$
(1.2)

where A is a standard second-order elliptic operator. Thus, we wish to select a parameter \(\theta \) such that the solution \(u^{*}\) of the PDE (1.2) is as close as possible to the target profile h.

If \(A^{\dagger }\) denotes the formal adjoint operator to A, then the adjoint PDE is

$$\begin{aligned} A^{\dagger } \hat{u}^{*}(x)&= u^{*}(x) - h(x), \quad x\in U\nonumber \\ \hat{u}^{*}(x)&= 0,\quad x\in \partial U \end{aligned}$$
(1.3)

The gradient of the objective function (1.1) can be evaluated using the solution \(\hat{u}^{*}\) to the adjoint PDE (1.3). By Lemma 2.4 we have that

$$\begin{aligned} \nabla _{\theta } J(\theta ) = \int _{U} \hat{u}^{*}(x) \nabla _{\theta } f(x, \theta ) dx + \gamma \theta . \end{aligned}$$
(1.4)

Thus, the adjoint PDE (1.3) can be used to evaluate the gradient of the objective function, which in turn can be used to optimize over the PDE (1.2). A key advantage of adjoint methods is that, no matter how large the dimension of \(\theta \) is, the adjoint PDE (1.3) is the same dimension as the original PDE (1.2).

1.1 The Online Adjoint Algorithm

The online adjoint algorithm optimizes the objective function \(J(\theta )\) via a continuous-time equation for the update of the parameter \(\theta (t)\); see also [14, 32] for related formulations. The direction of steepest descent is estimated using a time-relaxation of the adjoint PDE. The estimate and the optimization variables are both simultaneously updated continuously in time. An appropriately chosen learning rate parameter is introduced, which allows to guarantee both well posedness of the algorithm for all times (Theorem 2.8) and convergence as \(t\rightarrow \infty \) (Theorems 3.1 and 4.2).

The online adjoint algorithm satisfies the equations:

$$\begin{aligned} \frac{\partial u}{\partial t}(t,x)&= -A u(t,x) + f(x,\theta (t)), \quad x\in U, t>0 \nonumber \\ \frac{\partial \hat{u}}{dt}(t,x)&= -A^{\dagger } \hat{u}(t,x) +(u(t,x) - h(x)), \quad x\in U, t>0 \nonumber \\ \frac{d \theta }{dt}(t)&= - \alpha (t) \bigg ( \int _{U} \hat{u}(t,x) \nabla _{\theta } f (x, \theta (t) ) dx +\gamma \theta (t) \bigg )\nonumber \\ u(t,x)&=\hat{u}(t,x)=0, \quad x\in \partial U, t>0\nonumber \\ u(0,x)&=u_{0}(x), \hat{u}(0,x)=\hat{u}_{0}(x), \end{aligned}$$
(1.5)

where \(\alpha (t)\) is an appropriately chosen learning rate. The PDEs for u and \(\hat{u}\) can be viewed as time relaxations of the PDE (1.2) and its adjoint PDE (1.3). It is easy to see that \(\int _{U} \hat{u}(t,x) \nabla _{\theta } f (x, \theta (t) ) dx\) is an estimate for the direction of steepest descent \(\int _{U} \hat{u}^{*}(x) \nabla _{\theta } f (x, \theta (t) ) dx\).

Apart from the generic formulation of the online adjoint algorithm as presented in (1.5), our main contribution is two-fold. First, we prove that as \(t\rightarrow \infty \), \(\left\Vert {\nabla J(\theta (t))}\right\Vert \rightarrow 0\). Namely, we prove that \(\theta (t)\) converges to a stationary point of \(J(\theta )\). We emphasize here that in order to do so, no assumptions on convexity of J are needed. Secondly, if we further assume that \(J(\theta )\) is strongly convex, then we also prove a convergence rate of \(\theta (t)\) to the global minimum of \(J(\theta )\).

In practice, the online adjoint algorithm (1.5) is implemented by simultaneously solving the coupled ODE-PDE system using numerical methods such as finite-difference methods. Either explicit or implicit finite difference methods can be used. For example, an explicit finite difference method for implementing (1.5) would be:

  • Update u and \(\hat{u}\):

    $$\begin{aligned} u(t + \Delta , x)= & {} u(t,x) + \bigg ( - A u(t,x) + f(x,\theta (t)) \bigg ) \Delta , \nonumber \\ \hat{u}(t + \Delta ,x)= & {} \hat{u}(t,x) + \bigg ( -A^{\dagger } \hat{u}(t,x) + ( u(t,x) - h(x)) \bigg ) \Delta , \end{aligned}$$
    (1.6)

    where \(\Delta \) is the time-step size.

  • Then, update the parameter \(\theta \):

    $$\begin{aligned} \theta (t+\Delta ) = \theta (t) - \alpha (t) \bigg ( \int _{U} \hat{u}(t,x) \nabla _{\theta } f (x, \theta (t) ) dx + \gamma \theta (t) \bigg ) \Delta . \end{aligned}$$
    (1.7)

    The spatial domain x is discretized and a finite-difference method is used to approximate the operator A. The integrals are discretized as appropriate sums.

The focus of our paper is to rigorously prove the convergence of the online adjoint algorithm for linear elliptic PDEs. In practice, real-world applications will typically require optimizing over nonlinear PDEs. The online adjoint algorithm can also be used to optimize over nonlinear PDEs. References [30] and [23] optimize over the Navier–Stokes equation using our online adjoint algorithm. Numerical optimization with pseudo-time-stepping adjoint methods has also been studied in [3, 7, 12,13,14, 19, 31, 32].

We demonstrate the online adjoint method below for a simple example of a nonlinear PDE. Consider the equation

$$\begin{aligned} 0 = - \theta _1 u \frac{\partial u}{\partial x} -\theta _2 u \frac{\partial u}{\partial y} +\frac{\partial ^2 u}{\partial x^2} +\frac{\partial ^2 u}{\partial y^2}, \end{aligned}$$
(1.8)

where \((x,y) \in [0,1] \times [0,1]\) and with boundary conditions \(u(0, y) = 1\), \(u(1,y) = -1\), \(u(x, 0) = 1\), and \(u(x,1) = -1\). The parameters to be optimized over are \(\theta = (\theta _1, \theta _2)\) and the objective function is (1.1) with \(\gamma = 0\). The target function h is the solution to (1.8) with \(\theta = (10, 10)\). That is, our goal is to solve the inverse problem of recovering the parameters in the PDE (1.8) given an observed solution.

The adjoint PDE for (1.8) is

$$\begin{aligned} 0= \theta _1 u \frac{\partial \hat{u} }{\partial x} +\theta _2 u \frac{\partial \hat{u} }{\partial y} +\frac{\partial ^2 \hat{u}}{\partial x^2} +\frac{\partial ^2 \hat{u}}{\partial y^2} + (u - h), \end{aligned}$$
(1.9)

where \((x,y) \in [0,1] \times [0,1]\) and with boundary conditions\(\hat{u}(0, y) = \hat{u}(1,y) = \hat{u}(x, 0) = \hat{u}(x,1) = 0\). The gradient of the objective function is given by the formula

$$\begin{aligned} \frac{\partial J(\theta ) }{\partial \theta _1}= & {} - \int _{0}^1 \int _0^1 \hat{u} u \frac{\partial u}{\partial x} (x,y) dx dy,\\ \frac{\partial J(\theta ) }{\partial \theta _2}= & {} - \int _{0}^1 \int _0^1 \hat{u} u \frac{\partial u}{\partial y} (x,y) dx dy. \end{aligned}$$

The online adjoint algorithm can be used to minimize the objective function \(J(\theta )\). In our numerical experiment, we use an explicit finite difference method for the numerical solution of the time-relaxed PDE and the time-relaxed adjoint PDE. The PDE variables are updated using an Euler scheme. (Although not implemented here, it is worthwhile noting that higher-order accuracy in time could be achieved with a Runge–Kutta scheme.) Uniform mesh sizes for both time and space are chosen. The PDE operator is approximated using a second-order accurate finite-difference method. The parameter ODEs are also solved using an explicit Euler scheme on the same uniform time grid and the spatial integrals are also discretized as sums using the uniform spatial grid. Figure 1 demonstrates that the online adjoint algorithm converges to the correct value for the parameters \(\theta \) as \(t \rightarrow \infty \). The right display in Fig. 1 presents the numerical convergence rate, which satisfies the theoretical convergence rate of \(t^{-\frac{1}{2}}\) which we prove for strongly convex objective functions for linear elliptic PDEs in this paper. (In fact, for this specific example, the numerical convergence rate turns out to be faster than \(t^{- \frac{1}{2}}\).)

Fig. 1
figure 1

Left: solution for \(\theta \) using the online adjoint algorithm versus computational time. Right: numerical convergence rate for \(\left\Vert { \theta - \theta ^{*}}\right\Vert _2\) where \(\theta ^{*} = (10, 10)\)

1.2 Organization of the Proof

The rest of the paper is organized as follows. In Sect. 2 we state our assumptions, present in more details the online adjoint algorithm and prove its well posedness in Theorem 2.8. Convergence of \(\theta (t)\) to a stationary point of \(J(\theta )\) is proven in Sect. 3, Theorem 3.1. We emphasize that no convexity requirements on \(J(\theta )\) are needed in order to prove convergence. If in addition, we assume that \(J(\theta )\) is strongly convex with a single stationary point, then one can prove a convergence rate, Theorem 4.2. The latter is the content of Sect. 4.

2 Assumptions, Notation and Well Posedness of the Online Adjoint Algorithm

Let U be an open, bounded subset of \({\mathbb {R}}^{n}\). We will denote by \((\cdot ,\cdot )\) the usual inner product in \(H=L^{2}(U)\). We shall assume that the operator A is uniformly elliptic, diagonalizable and dissipative, per Assumption 2.1.

Assumption 2.1

The operator A is uniformly elliptic. We also assume that A is diagonalizable and dissipative. Namely there exists a countable complete orthonormal basis \(\{e_{n}\}_{n\in {\mathbb {N}}}\subset H\) that consists of eigenvectors of A corresponding to a non-negative sequence \(\{\lambda _{n}\}_{n\in {\mathbb {N}}}\) of eigenvalues such that

$$\begin{aligned} (-A)e_{n}=-\lambda _{n}e_{n}, \quad n\in {\mathbb {N}} \end{aligned}$$

such that the dissipativity condition \(\lambda = \displaystyle \inf _{n\in {\mathbb {N}}} \lambda _{n}>0\) holds.

An example of A is the second order elliptic operator, for definiteness taken to be in divergence form,

$$\begin{aligned} Au(x)=-\sum _{i,j=1}^{n}\left( a^{i,j}(x)u_{x_{i}}(x)\right) _{x_{j}} +\sum _{i=1}^{n}b^{i}(x)u_{x_{i}}(x)+c(x)u(x). \end{aligned}$$

The diagonalizable condition of Assumption 2.1 is automatically satisfied for example by self-adjoint operators, see [8, Theorem 8.8.37]

Before proceeding with the well-posedness of the online adjoint algorithm, let us recall a few basic results that will be useful for the analysis that follows.

Taking the domain of A to be \(D(A)=H^{1}_{0}(U)\cap H^{2}(U)\), we have that it is dense in \(H=L^{2}(U)\). Then, due to Assumption 2.1, elliptic regularity theory gives that the operator A is closed and thus by Hille-Yoshida theorem \((-A)\) is the generator of an analytic strongly contraction semigroup \(\{S(t)\}_{t\ge 0}\) on H. The spectral assumption made in Assumption 2.1 guarantees that

$$\begin{aligned} \left\Vert {S(t)u}\right\Vert _{H}\le e^{-\lambda t}\left\Vert {u}\right\Vert _{H}. \end{aligned}$$
(2.1)

The latter also means that A is a coercive operator. In particular, we will heavily use the fact that for \(u\in D(A)\), we have

$$\begin{aligned} (A u, u)\ge \lambda \left\Vert {u}\right\Vert ^{2}_{H}. \end{aligned}$$
(2.2)

Notice now that because we are dealing with a real Hilbert space and because \(H^{1}_{0}(U)\cap H^{2}(U)\) is dense in \(L^{2}(U)\), we obtain that \(A^{\dagger }\) is also a coercive operator. Indeed, by definition of the adjoint operator \(A^{\dagger }\) we shall have that for \(u\in H^{1}_{0}(U)\cap H^{2}(U)\)

$$\begin{aligned} (A^{\dagger } u, u ) = (u, A u ) \ge \lambda ( u, u ). \end{aligned}$$
(2.3)

Notice now that under our assumptions the adjoint operator \((-A^{\dagger })\) will also generated an analytic strongly continuous semigroup \(\{S^{\dagger }(t)\}_{t\ge 0}\) on \(L^{2}(U)\). In particular (2.3) implies that the adjoint semigroup \(S^{\dagger }(t)\) will also be exponential stable. Indeed, by definition we have for \(u\in H^{1}_{0}(U)\cap H^{2}(U)\)

$$\begin{aligned} \frac{d}{dt}\left\Vert {S^{\dagger }(t)u}\right\Vert ^{2}_{H} = -2( A^{\dagger } S^{\dagger }(t)u,S^{\dagger }(t)u)\le -2\lambda \left\Vert {S^{\dagger }(t)u}\right\Vert ^{2}_{H}, \end{aligned}$$
(2.4)

which then due to Gronwall lemma gives

$$\begin{aligned} \left\Vert {S^{\dagger }(t)u}\right\Vert _{H}\le e^{-\lambda t}\left\Vert {u}\right\Vert _{H}, \end{aligned}$$
(2.5)

proving the exponential stability of \(S^{\dagger }(t)\).

Then, if we assume that \(f\in L^{2}(U)\), classical Lax–Milgram theorem (see for example Chap. 5.8 of [8]) says that the elliptic boundary-value problem

$$\begin{aligned} A u^{*}(x)&= f( x ), \quad x\in U\nonumber \\ u^{*}(x)&= 0,\quad x\in \partial U \end{aligned}$$
(2.6)

has a unique weak solution \(u^{*}\in H^{1}_{0}(U)\). The same conclusion will also be true for the adjoint problem governed by the adjoint operator \(A^{\dagger }\).

Remark 2.2

We also recall here that by classical elliptic regularity results if for given \(m\in {\mathbb {N}}\), \(a^{i,j},b^{i},c\in {\mathcal {C}}^{m+1}(\bar{U})\) with \(i,j=1,\ldots ,n\) and \(\partial U\in {\mathcal {C}}^{m+2}\), then the unique solution \(u^{*}\) to (2.6) is such that \(u^{*}\in H^{m+2}(U)\). Clearly if \(a^{i,j},b^{i},c\in {\mathcal {C}}^{\infty }(\bar{U})\) with \(i,j=1,\ldots ,n\) and \(\partial U\in {\mathcal {C}}^{\infty }\), then \(u^{*}\in {\mathcal {C}}^{\infty }(\bar{U})\). We refer the interested reader to classical manuscripts, e.g., [8, Chap. 8], for more details.

Remark 2.3

For notational convenience and without loss of generality, we have assumed zero boundary conditions for the PDE (2.6). We can consider the PDE (2.6) with non-zero boundary data, say \(u^{*}(x)= g(x), x\in \partial U\), under the assumption \(g\in H^{1}(U)\) for unique solvability of the corresponding PDE (2.6). We refer the interested reader to classical manuscripts, e.g., [8, Chap. 8], for more details.

As briefly presented in the introduction now, let \(f(\cdot ,\cdot ):{\mathbb {R}}^{n}\times {\mathbb {R}}^{d}\mapsto {\mathbb {R}}\) be such that for every \(\theta \in {\mathbb {R}}^{d}\) \(f(\cdot ,\theta )\in L^{2}(U)\). As with (2.6) the linear PDE

$$\begin{aligned} A u^{*}(x)&= f( x,\theta ), \quad x\in U\nonumber \\ u^{*}(x)&= 0,\quad x\in \partial U \end{aligned}$$
(2.7)

will have, for each given \(\theta \in {\mathbb {R}}^{d}\), a unique weak solution \(u^{*}\in H^{1}_{0}(U)\). We shall write \(u^{*}(x;\theta )\) when we want to emphasize the dependence on \(\theta \).

For a given target profile \(h\in L^{2}(U)\), the goal is to select \(\theta \) to minimize the objective function

$$\begin{aligned} J(\theta ) = \frac{1}{2} ( u^{*} - h, u^{*} - h ) +\frac{\gamma }{2} \left\Vert {\theta }\right\Vert ^2_2, \end{aligned}$$
(2.8)

where \(\gamma > 0\). \( \gamma \left\Vert {\theta }\right\Vert ^2_2\) is a regularization term and \(\left\Vert { \cdot }\right\Vert _2\) is the \(\ell _2\) norm.

The adjoint PDE satisfies

$$\begin{aligned} A^{\dagger } \hat{u}^{*}(x)&= u^{*}(x) - h(x), \quad x\in U\nonumber \\ \hat{u}^{*}(x)&= 0,\quad x\in \partial U \end{aligned}$$
(2.9)

and as with (2.6), Assumption 2.1 and the fact that \(u^{*} - h\in L^{2}(U)\) guarantee that (2.9) has a unique weak solution \(\hat{u}^{*}\in H^{1}_{0}(U)\). Notice that since \(u^{*}\) depends on \(\theta \), the same will also be true for \(\hat{u}^{*}\) and we shall write \(\hat{u}^{*}(x;\theta )\) when we want to emphasize that. The following lemma provides a useful representation for \(\nabla _{\theta } J(\theta )\), which will also motivate the form of the online adjoint algorithm.

Lemma 2.4

Let Assumption 2.1 and assume that \(h\in L^{2}(U)\). Then, we can write

$$\begin{aligned} \nabla _{\theta } J(\theta ) = ( \hat{u}^{*} , \nabla _{\theta } f(\theta )) + \gamma \theta . \end{aligned}$$
(2.10)

The proof of Lemma 2.4 is presented at the end of this section. In terms of the learning rate \(\alpha (t)\) we make the following assumption.

Assumption 2.5

We assume that the learning rate \(\alpha (t)\) is such that \(\displaystyle \lim _{t\rightarrow \infty }\alpha (t)=0\) and

  • \(\int _{0}^{\infty }\alpha (s)ds=\infty \) and \(\int _{0}^{\infty }\alpha ^{2}(s)ds<\infty \).

  • \(\displaystyle \sup _{t\ge 0} \int _{0}^{t}\alpha (s) e^{-\gamma \int _{s}^{t}\alpha (r)dr}ds<\infty \) and \(\displaystyle \lim _{t\rightarrow \infty }\frac{\alpha '(t)}{\alpha (t)}=0\).

The first part of Assumption 2.5 on the learning rate is classical and is also the same used in discrete time algorithms, see for example classical references such as [1, 21]. The second part of Assumption 2.5 comes up while proving that \(\left\Vert {\theta (t)}\right\Vert \) stays bounded for all times and later on in the convergence proof of \(\theta (t)\) to a stationary point of J. An example of a learning rate that satisfies both parts of Assumption 2.5 is \(\alpha (t)=\frac{1}{1+t}\).

In terms of the parametric model \(f(\cdot ,\cdot )\) we make the following assumption

Assumption 2.6

We assume the following conditions:

  • For each fixed \(\theta \in {\mathbb {R}}^{d}\), \(f(\cdot , \theta )\), \(\nabla _{\theta } f(\cdot , \theta )\) and \(\nabla ^{2}_{\theta } f(\cdot , \theta )\) are in \(L^{2}(U)\). For each fixed \(x\in {\mathbb {R}}^{n}\), \(f(x,\cdot )\), \(\nabla _{\theta } f(x, \cdot )\) and \(\nabla ^{2}_{\theta } f(x, \cdot )\) are bounded. In other words we assume that there exists \(C<\infty \) such that

    $$\begin{aligned} \sup _{\theta \in {\mathbb {R}}^{d}}\left( \left\Vert {f(\theta )}\right\Vert _{L^{2}(U)} +\left\Vert {\nabla _{\theta }f(\theta ) }\right\Vert _{L^{2}(U)}+\left\Vert {\nabla ^{2}_{\theta }f (\theta ) }\right\Vert _{L^{2}(U)}\right) \le C. \end{aligned}$$
  • For each \(x\in U\), \(f(x,\cdot )\) is globally Lipschitz \(L^{2}\) Lipschitz constant in x.

In terms of the associated cost function \(J(\cdot )\) we make the following Assumption 2.7.

Assumption 2.7

We assume that \(J(\cdot ) \in C^2\) and globally Lipschitz.

We emphasize that Assumption 2.7 does not impose any convexity type of assumptions on \(J(\theta )\). The online adjoint algorithm satisfies the time-dependent PDEs

$$\begin{aligned} \frac{\partial u}{\partial t}(t,x)&= -A u(t,x) + f(x,\theta (t)), \quad x\in U, t>0 \nonumber \\ \frac{\partial \hat{u}}{dt}(t,x)&= -A^{\dagger } \hat{u}(t,x) +(u(t,x) - h(x)), \quad x\in U, t>0 \nonumber \\ \frac{d \theta }{dt}(t)&= - \alpha (t) \bigg ( ( \nabla _{\theta } f (\theta (t) ), \hat{u}(t) ) + \gamma \theta (t) \bigg )\nonumber \\ u(t,x)&=\hat{u}(t,x)=0, \quad x\in \partial U, t>0\nonumber \\ u(0,x)&=u_{0}(x), \hat{u}(0,x)=\hat{u}_{0}(x) \end{aligned}$$
(2.11)

Notice that (2.11) is a non-local coupled system of PDEs. Theorem 2.8 is about the well-posedness of system (2.11). Also, with slight abuse of notation, with \(\theta =\theta (t)\) from (2.11) we shall denote the solutions to (2.7) and (2.9), by \(u^{*}(t,x)\) and \(\hat{u}^{*}(t,x)\) respectively.

Theorem 2.8

Assume Assumptions 2.1, 2.5 and 2.6 and that \(u_{0}, \hat{u}_{0}, h\in L^{2}(U)\). There exists a unique mild solution \(u,\hat{u}\in {\mathcal {C}}((0,\infty ); W^{2,2}_{0}(U))\cap {\mathcal {C}}^{1}((0,\infty );L^{2}(U))\) and \(\theta \in {\mathcal {C}}^{1}((0,\infty ))\) to equation (2.11). If in addition \(h\in L^{\infty }\), then \(u,\hat{u}\in {\mathcal {C}}((0,\infty ); W^{2,p}_{0}(U))\cap {\mathcal {C}}^{1}((0,\infty );L^{p}(U))\) for any \(p\ge 2\) and if further we assume that \(u_{0},\hat{u}_{0}\in W^{2,p}(U)\), then \(u,\hat{u}\in {\mathcal {C}}([0,\infty ); W^{2,p}_{0}(U))\cap {\mathcal {C}}^{1}([0,\infty );L^{p}(U))\) for any \(p\ge 2\). In addition, we have that there exists some constant \(K<\infty \) such that

$$\begin{aligned} \sup _{t \ge 0} \left[ \left\Vert {u(t)}\right\Vert _{L^{2}(U)} +\left\Vert {\hat{u}(t)}\right\Vert _{L^{2}(U)}+\left\Vert {\theta (t)}\right\Vert _2 \right] < K. \end{aligned}$$
(2.12)

Remark 2.9

At this point we mention that even though in Assumption 2.6 we have assumed that \(\left\Vert {f(\theta )}\right\Vert _{L^{2}(U)}\) is uniformly bounded, an investigation of the proof of Theorem 2.8 shows that this assumption can be relaxed. In particular, at the expense of slightly more elaborate estimates, the results of this paper (which heavily rely on (2.12) being true) hold if we assume instead that \(\left\Vert {f(\theta )}\right\Vert _{L^{2}(U)}\) grows linearly in \(\left\Vert {\theta }\right\Vert _{\ell _{2}}\) with bounded derivatives, i.e., \(\left\Vert {f(\theta )}\right\Vert _{L^{2}}\le C(1+\left\Vert {\theta }\right\Vert _{\ell _{2}})\), still with \(\left( \left\Vert {\nabla _{\theta }f(\theta )}\right\Vert _{L^{2}} +\left\Vert {\nabla ^{2}_{\theta }f(\theta ) }\right\Vert _{L^{2}(U)}\right) <C\) and additionally that the regularization coefficient \(\gamma >0\) is large enough depending on the \(L^{2}\) norms of \(u_{0}(x)\) and \(\hat{u}_{0}(x)\). We have chosen to present the results for uniformly bounded \(\left\Vert {f(\theta )}\right\Vert _{L^{2}(U)}\) for presentation purposes and because in this case we do not need any additional restriction on the magnitude of \(\gamma \) other than being strictly positive.

Let us conclude this section with the proofs of Lemma 2.4 and Theorem 2.8.

Proof of Lemma 2.4

(2.10) can be derived using the definition of the adjoint PDE (2.9). Define \(\tilde{u} = \nabla _{\theta } u^{*}\). Differentiating (2.7) yields

$$\begin{aligned} A \tilde{u}(x)&= \nabla _{\theta } f( x,\theta ), \quad x\in U\\ \tilde{u}(x)&= 0,\quad x\in \partial U \end{aligned}$$

Integration by parts yields

$$\begin{aligned} (\hat{u}^{*}, A \tilde{u} ) = (A^{\dagger } \hat{u}^{*}, \tilde{u}). \end{aligned}$$

Due to (2.7), this yields the equation

$$\begin{aligned} ( A^{\dagger } \hat{u}^{*}, \tilde{u} ) = (\hat{u}^{*}, \nabla _{\theta } f(\theta ) ). \end{aligned}$$

Using the definition of the adjoint PDE (2.9),

$$\begin{aligned} (u^{*} - h, \tilde{u}) = (\hat{u}^{*}, \nabla _{\theta } f(\theta )). \end{aligned}$$

Recalling the objective function (2.8), we then write

$$\begin{aligned} \nabla _{\theta } J(\theta )= & {} ( u^{*} - h, \tilde{u} ) +\gamma \theta \\= & {} ( \hat{u}^{*}, \nabla _{\theta } f(\theta ) ) + \gamma \theta , \end{aligned}$$

which yields (2.10). \(\square \)

Proof of Theorem 2.8

Let us define the index set \({\mathcal {G}}=\{1,2,3\}\) and the space \(\Theta =U\times {\mathcal {G}}\). Define the variable \(y=(x,\zeta )\in \Theta \) and the measure \(dn=dx\otimes d\iota \) on \(\Theta \) where \(d\iota \) denotes the counting measure on \({\mathcal {G}}\). Define now the Banach space \(X^{2}=L^{2}(\Theta ,dn)\). Similarly we denote by \(H^{2}(\Theta )=W^{2,2}(\Theta )\) the Banach space of functions f on \(\Theta \) such that for each \(\zeta \in {\mathcal {G}}\) we have \(f(\cdot ,\zeta )\in H^{2}(U)\) with norm \(\left\Vert {f}\right\Vert _{H^{2}(\Theta )} =\sum _{\zeta =1}^{3}\left\Vert {f(\cdot ,\zeta )}\right\Vert _{H^{2}(U)}\).

Setting \(v(t,x)=(u(t,x),\hat{u}(t,x),\theta (t))\) and \(\rho (t,y)=v_{\zeta }(t,x)\) (the \(\zeta '\)th component of the vector v) for \(y=(x,\zeta )\in U\times {\mathcal {G}}\) we consider the evolution equation on \(\Theta \) given by

$$\begin{aligned} \partial _{t}\rho (t,y) ={\mathcal {L}}[\rho ](t,y)+{\mathcal {R}}[\rho ] (t,y), y\in \Theta \end{aligned}$$
(2.13)

where

$$\begin{aligned} {\mathcal {L}}[\rho ](t,x,1)=-Au(t,x), \quad {\mathcal {L}}[\rho ] (t,x,2)=-A^{\dagger }\hat{u}(t,x),\quad {\mathcal {L}}[\rho ](t,x,3)=0 \end{aligned}$$

and

$$\begin{aligned}&{\mathcal {R}}[\rho ](t,x,1)=f(x,\theta (t)), \quad {\mathcal {R}}[\rho ](t,x,2)=u(t,x)-h(x),\\&{\mathcal {R}}[\rho ](t,x,3)=- \alpha (t) \bigg ( ( \nabla _{\theta } f ( \theta (t) ), \hat{u}(t) ) + \gamma \theta (t) \bigg ). \end{aligned}$$

We note that we have slightly abused notation here because \({\mathcal {L}}[\rho ](t,x,3)=0\). However, this notation is convenient because it allows us to describe the PDE in question in the form (2.13) as a single vector valued evolution equation.

Let us define the norm \(\left\Vert {w}\right\Vert _{2,T}=\sup _{t\in [0,T]} \left\Vert {w(t)}\right\Vert _{L^{2}(U)}\) if \(w=w(t,x)\) and \(\left\Vert {w}\right\Vert _{2,T}=\sup _{t\in [0,T]} \left\Vert {w(t)}\right\Vert _{\ell _{2}}\) if \(w=w(t)\). Here \(\ell _2\) denotes the standard Euclidean norm. In particular, for \(v(t,x)=(u(t,x),\hat{u}(t,x),\theta (t))\) we shall have

$$\begin{aligned} \left\Vert {v}\right\Vert _{2,T}=\left\Vert {u}\right\Vert _{2,T}+\left\Vert {\hat{u}}\right\Vert _{2,T}+\left\Vert {\theta }\right\Vert _{2,T}. \end{aligned}$$

where, the first two components depend on (tx) and the last component depends only on t.

We will be working with mild solutions. Due to Assumption 2.1, the operators A and \(A^{\dagger }\) are generators of analytic contraction semigroups \(\{S(t)\}_{t\ge 0}\) and \(\{S^{\dagger }(t)\}_{t\ge 0}\) respectively on \(L^{2}(U)\).

Then, we can write for the mild solution of (2.13) that

$$\begin{aligned} \rho (t,y) = H[\rho ](t,y), \end{aligned}$$

where

$$\begin{aligned} H[\rho ](t,y)&={\left\{ \begin{array}{ll} H[\rho ](t,x,1) \\ H[\rho ](t,x,2) \\ H[\rho ](t,x,3) \end{array}\right. } ={\left\{ \begin{array}{ll} S(t)u_{0}(x)+\int _{0}^{t}S(t-s)f(x,\theta (s))ds\\ S^{\dagger }(t)\hat{u}_{0}(x)+\int _{0}^{t}S^{\dagger }(t-s)(u(s,x)-h(x))ds\\ - \alpha (t) \bigg ( ( \nabla _{\theta } f ( \theta (t)), \hat{u}(t) ) + \gamma \theta (t) \bigg ). \end{array}\right. } \end{aligned}$$
(2.14)

Now the properties of the analytic contraction semigroups \(\{S(t)\}_{t\ge 0}\) and \(\{S^{\dagger }(t)\}_{t\ge 0}\) guarantee that there exist an increasing continuous function D(t) with \(\lim _{t\rightarrow 0}D(t)=0\) (possible different from line to line below) such that

$$\begin{aligned} \left\Vert {H[\rho ](\cdot ,1)}\right\Vert _{2,T}&\le \left\Vert {u_{0}}\right\Vert _{L^{2}}+D(T)\nonumber \\ \left\Vert {H[\rho ](\cdot ,2)}\right\Vert _{2,T}&\le \left\Vert {\hat{u}_{0}}\right\Vert _{L^{2}}+D(T) (1+\left\Vert {u}\right\Vert _{2,T})\nonumber \\ \left\Vert {H[\rho ](\cdot ,3)}\right\Vert _{2,T}&\le \left\Vert {\theta _{0}}\right\Vert _{\ell _{2}}+D(T) (\left\Vert {\hat{u}}\right\Vert _{2,T}+\left\Vert {\theta }\right\Vert _{2,T}) \end{aligned}$$
(2.15)

and for \(\rho =(u_{1},\hat{u}_{1},\theta _{1})\) and \(q=(u_{2},\hat{u}_{2},\theta _{2})\)

$$\begin{aligned} \left\Vert {H[\rho ](\cdot ,1)-H[q](\cdot ,1)}\right\Vert _{2,T}&\le D(T) \left\Vert {\theta _{1}-\theta _{2}}\right\Vert _{2,T}\nonumber \\ \left\Vert {H[\rho ](\cdot ,2)-H[q](\cdot ,2)}\right\Vert _{2,T}&\le D(T) \left\Vert {u_{1}-u_{2}}\right\Vert _{2,T}\nonumber \\ \left\Vert {H[\rho ](\cdot ,3)-H[q](\cdot ,3)}\right\Vert _{2,T}&\le D(T) \left( \left\Vert {\hat{u}_{1}-\hat{u}_{2}}\right\Vert _{2,T} +\left\Vert {\theta _{1}-\theta _{2}}\right\Vert _{2,T} (1+\left\Vert {\hat{u}_{1}}\right\Vert _{2,T})\right) \end{aligned}$$
(2.16)

The linear growth bounds of the operator H as given by (2.15) together with the local Lipschitz continuity property demonstrated in (2.16) allow us to conclude via the classical Picard–Lindelöf theorem for Banach valued ODE’s that for every \(\rho _{0}\in X^{2}\) there exists a unique local mild solution \(\rho \in {\mathcal {C}}([0, T_0], X^{2})\) of (2.13) for some sufficiently small \(0<T_0<\infty \).

Next we want to show that this solution can be extended globally. To do so it is enough to establish a global bound for the \(\left\Vert {\cdot }\right\Vert _{X^{2}}\) of solutions. Indeed, the analytic contraction semigroups \(\{S(t)\}_{t\ge 0}\) and \(\{S^{\dagger }(t)\}_{t\ge 0}\) guarantee that there is a constant \(K<\infty \) (independent of t) such that

$$\begin{aligned} \left\Vert {u(t)}\right\Vert _{L^{2}}\le & {} \left\Vert {S(t) u(0)}\right\Vert _{L^{2}} +\int _0^t \left\Vert {S(t-s) f(\theta (s)) }\right\Vert _{L^{2}} ds \nonumber \\\le & {} \left\Vert {u(0)}\right\Vert _{L^{2}} + C \int _0^t e^{- \lambda (t -s )} ds \nonumber \\\le & {} K. \end{aligned}$$
(2.17)

Analogously, for a potentially different constant \(K<\infty \) and using estimate (2.17)

$$\begin{aligned} \left\Vert {\hat{u}(t)}\right\Vert _{L^{2}}\le & {} \left\Vert {^{\dagger }S(t) \hat{u}(0)}\right\Vert _{L^{2}} + \int _0^t \left\Vert {S^{\dagger }(t-s) (u(s)-h) }\right\Vert _{L^{2}} ds \nonumber \\\le & {} \left\Vert {\hat{u}(0)}\right\Vert _{L^{2}} + \int _0^t e^{- \lambda (t -s )}(\left\Vert {u(s)}\right\Vert _{L^{2}} +\left\Vert {h}\right\Vert _{L^{2}}) ds \nonumber \\\le & {} K. \end{aligned}$$
(2.18)

Let us next show that \(\theta (t)\) is uniformly bounded in time. Define the quantity \(Q(t) = \left\Vert {(\nabla _{\theta } f (\theta (t)), \hat{u}(t) ) }\right\Vert ^{2}_2\), and notice that due to estimate (2.18) and the bound on \(\nabla _{\theta } f(\theta )\) by Assumption 2.6, we obtain that \(\sup _{t \ge 0} Q(t)<C\) for some appropriate constant \(C<\infty \).

By direct calculation, keeping in mind the equation that \(\theta (t)\) satisfies and Hölder inequality, we obtain

$$\begin{aligned} \frac{d}{dt} \left\Vert {\theta (t)}\right\Vert ^{2}_2&=-2\gamma \alpha (t) \left\Vert {\theta (t)}\right\Vert ^{2}_2 -2\alpha (t) \theta _{t} (\nabla _{\theta } f ( \theta (t) ), \hat{u}(t))\\&\le -\gamma \alpha (t)\left\Vert {\theta (t)}\right\Vert ^{2}_2 +\frac{4}{\gamma }\alpha (t) Q(t). \end{aligned}$$

By comparison principle we then have that

$$\begin{aligned} \left\Vert {\theta (t)}\right\Vert ^{2}_2&\le e^{-\gamma \int _{0}^{t}\alpha (s)ds} \left\Vert {\theta (0)}\right\Vert ^{2}_2 +\frac{4}{\gamma } \int _{0}^{t}\alpha (s) e^{-\gamma \int _{s}^{t}\alpha (r)dr}Q(s)ds\nonumber \\&\le e^{-\gamma \int _{0}^{t}\alpha (s)ds}\left\Vert {\theta (0)}\right\Vert ^{2}_2 +C \int _{0}^{t}\alpha (s)e^{-\gamma \int _{s}^{t}\alpha (r)dr}ds \end{aligned}$$
(2.19)

and the result follows by requiring that for some \(C<\infty \) (see Assumption 2.5)

$$\begin{aligned} \sup _{t\ge 0} e^{-\gamma \int _{0}^{t}\alpha (s)ds} +\sup _{t\ge 0} \int _{0}^{t}\alpha (s) e^{-\gamma \int _{s}^{t}\alpha (r)dr}ds\le C \end{aligned}$$

All in all we obtain the a-priori global estimate

$$\begin{aligned} \left\Vert {\rho }\right\Vert _{2,\infty } \le K \end{aligned}$$

for some finite constant \(K<\infty \). With this a-priori bound the solution can be extended indefinitely in time. Thus a unique global solution exists. This means that there exists a unique global mild solution \(\rho \in {\mathcal {C}}([0, \infty ), X^{2})\).

Essentially the same argument as above shows that if the initial data are in \(L^{q}\) and \(h\in L^{q}\), for \(q>2\), then we will have that there is a unique global mild solution \(\rho \in {\mathcal {C}}([0, \infty ), L^{q}(\Theta ,dn))\).

Let us now discuss regularity. We will prove that for initial data and h in \(L^{q}\), we actually have that \(u,\hat{u}\in {\mathcal {C}}([0, \infty ), L^{p})\) for any \(p\in [q,\infty )\). We will use a bootstrap argument. Due to Sobolev embedding theorem and Riesz-Thorin theorem the following \(L^{q}\rightarrow L^{p}\) estimate for the semigroup S(t) holds

$$\begin{aligned} \left\Vert {S(t)g}\right\Vert _{L^{p}(U)} \le C (t\wedge 1)^{-\frac{n}{2} \left( \frac{1}{q}-\frac{1}{p}\right) }\left\Vert {g}\right\Vert _{L^{q}(U)} \end{aligned}$$
(2.20)

where n is the spatial dimension, \(p\ge q\) and g a test function. Then, let us consider \(p>q\) such that \(\frac{1}{n}=\frac{1}{q}-\frac{1}{p}\) and assume that we know \(\left\Vert {u}\right\Vert _{L^{q}}<\infty \). Consider an initial time \(t=\epsilon \) for some fixed \(\epsilon >0\) and using (2.20) we have for u (we use the mild formulation of the solution )

$$\begin{aligned} \left\Vert {u(t+\epsilon )}\right\Vert _{L^{p}}&\le C\left[ t^{-\frac{1}{2}} \left\Vert {u(\epsilon )}\right\Vert _{L^{q}}+\int _{0}^{t}(t-s)^{-\frac{1}{2}} \left\Vert {u(s+\epsilon )}\right\Vert _{L^{q}}\right] ds\\&\le C\left[ t^{-\frac{1}{2}}\left\Vert {u(\epsilon )}\right\Vert _{L^{q}} +t^{\frac{1}{2}}\sup _{s\in [\epsilon ,\epsilon +t]} \left\Vert {u(s)}\right\Vert _{L^{q}}\right] ds. \end{aligned}$$

Next consider the solution u starting at time \(t=2\epsilon \) with initial data \(u(2\epsilon )\in L^{p}\). Then, we will have that \(u\in {\mathcal {C}}([2\epsilon ,\infty ), L^{p})\). Notice now that \(\epsilon >0\) is arbitrary. Thus we obtain that \(u\in {\mathcal {C}}([0,\infty ),L^{p})\) for \(p\ge q\) such that \(\frac{1}{n}=\frac{1}{q}-\frac{1}{p}\). Using this argument inductively, first with \(q=2\) and \(p>2\) such that \(\frac{1}{n}=\frac{1}{2}-\frac{1}{p}\) we then get that \(u\in {\mathcal {C}}([0,\infty ), L^{p})\) for any \(p\in [2,\infty )\).

Following exactly the same process and using that \(u\in {\mathcal {C}}([0,\infty ), L^{p})\) for any \(p\in [2,\infty )\) we then obtain that if \(h\in L^{p}\) for all \(p\ge 2\), then \(\hat{u}\in {\mathcal {C}}([0,\infty ), L^{p})\) for any \(p\in [2,\infty )\) as well.

Next, we notice that the forcing term \({\mathcal {R}}\) in (2.13) is in \(L^{p}(U)\), so by the parabolic estimates in Sect. IV.3 of [22], we get that \(u,\hat{u}\in {\mathcal {C}}((0,\infty ); W^{2,p}_{0}(U))\cap {\mathcal {C}}^{1}((0,\infty );L^{p}(U))\) and if the initial data \(u_{0},\hat{u}_{0}\in W^{2,p}(U)\), then \(u,\hat{u}\in {\mathcal {C}}([0,\infty ); W^{2,p}_{0}(U))\cap {\mathcal {C}}^{1}([0,\infty );L^{p}(U))\) for any \(p\ge 2\). This concludes the proof of the theorem. \(\square \)

3 Convergence to a Stationary Point

The main result of this section is the convergence result for \(\theta (t)\). It says that as \(t\rightarrow \infty \), \(\theta (t)\) converges to a stationary point of the cost function \(J(\theta )\).

Theorem 3.1

Assume Assumptions 2.1, 2.5, 2.6 and 2.7. Then, we have that

$$\begin{aligned} \lim _{t \rightarrow \infty } \left\Vert {\nabla J( \theta (t))}\right\Vert _2 = 0. \end{aligned}$$

The proof of Theorem 3.1 is be a consequence of series of lemmas. In Sect. 3.1 we establish necessary decay rates for the solution to (1.5). These results are then used in Sect. 3.2 to characterize the behavior of \(\theta (t)\) for large times and eventually prove Theorem 3.1.

3.1 Decay Rates for the Online Adjoint Algorithm (1.5)

In this subsection, we establish some necessary decay rates for the online adjoint algorithm (1.5).

First, Lemma 3.2 has a critical bound on \(\left\Vert {\frac{\partial u^{*}}{\partial t } }\right\Vert _{H}\) and on \(\left\Vert {\frac{\partial \hat{u}^{*}}{\partial t } }\right\Vert _{H}\).

Lemma 3.2

Under Assumptions 2.1 and 2.6, there exists a constant \(C<\infty \) such that

$$\begin{aligned} \left\Vert {\frac{\partial u^{*}}{\partial t } }\right\Vert _{H} +\left\Vert {\frac{\partial \hat{u}^{*}}{\partial t}}\right\Vert _{H}< C \alpha (t) \end{aligned}$$

and consequently \(\displaystyle \lim _{t \rightarrow \infty } \left\Vert { \frac{\partial u^{*}}{\partial t } }\right\Vert _{H}= \displaystyle \lim _{t \rightarrow \infty } \left\Vert { \frac{\partial \hat{u}^{*}}{\partial t }}\right\Vert _{H}=0\).

Proof

First, we study \(\left\Vert { \frac{\partial u^{*}}{\partial t } }\right\Vert _{H}\). We see that \(\frac{\partial u^{*}}{\partial t}(t,x)\) satisfies the PDE

$$\begin{aligned}&A\frac{\partial u^{*}}{\partial t }(t,x) = \nabla _{\theta } f(x, \theta (t) )^{\top } \frac{d \theta }{dt}\nonumber \\&\qquad \qquad \qquad = -\alpha (t) \nabla _{\theta } f(x, \theta (t) )^{\top } \bigg (\left( \nabla _{\theta } f (\theta (t) ), \hat{u}(t) \right) +\gamma \theta (t) \bigg ),\quad x\in U, t>0 \nonumber \\&\qquad \qquad \frac{\partial u^{*}}{\partial t }=0, \quad x\in \partial U, t>0 \end{aligned}$$
(3.1)

By the coercivity assumption on A by Assumption 2.1 we subsequently obtain

$$\begin{aligned} \left\Vert {\frac{\partial u^{*}}{\partial t }}\right\Vert ^{2}_{H}&\le \frac{1}{\lambda } \left( A \frac{\partial u^{*}}{\partial t}, \frac{\partial u^{*}}{\partial t } \right) \\&\le \frac{1}{\lambda } \alpha (t) \left| \left( \nabla _{\theta } f( \theta (t))^{\top } \bigg ( \left( \nabla _{\theta } f (\theta (t) ), \hat{u}(t) \right) + \gamma \theta (t) \bigg ), \frac{\partial u^{*}}{\partial t } \right) \right| \\&\le \frac{1}{\lambda } \alpha (t) \left\Vert {\nabla _{\theta } f( \theta (t) )^{\top } \bigg ( \left( \nabla _{\theta } f (\theta (t) ), \hat{u}(t) \right) + \gamma \theta (t) \bigg )}\right\Vert _{H} \left\Vert { \frac{\partial u^{*}}{\partial t } }\right\Vert _{H} \end{aligned}$$

and the result follows directly by Assumption 2.6 on \(\nabla _{\theta }f\) and estimate (2.12).

Let us now turn our attention to \(\left\Vert { \frac{\partial \hat{u}^{*}}{\partial t } }\right\Vert _{H}\). By differentiation, we obtain that \(\frac{\partial u^{*}}{\partial t }(t,x) \) satisfies the PDE

$$\begin{aligned} A^{\dagger }\frac{\partial \hat{u}^{*}}{\partial t }(t,x)&= \frac{\partial u^{*}}{\partial t }(t,x),\quad x\in U, t>0 \nonumber \\ \frac{\partial \hat{u}^{*}}{\partial t }&=0, \quad x\in \partial U, t>0 \end{aligned}$$
(3.2)

The result then follows by the coercivity condition on \(A^{\dagger }\) by Assumption 2.1 and due to the fact that \(\left\Vert {\frac{\partial u^{*}}{\partial t } }\right\Vert _{H}\le C \alpha (t)\). This completes the proof of the lemma. \(\square \)

Let us consider the difference

$$\begin{aligned} \phi (t,x) = u(t,x) - u^{*}(t,x). \end{aligned}$$
(3.3)

\(\phi (t)\) satisfies the PDE

$$\begin{aligned} \frac{\partial \phi }{\partial t}(t,x)&= - A \phi (t,x) -\frac{\partial u^{*}}{\partial t}(t,x),\quad x\in U, t>0\nonumber \\ \phi (t,x)&=0, \quad x\in \partial U, t>0\nonumber \\ \phi (0,x)&= u_{0}(x)-u^{*}(0,x), \quad x\in U \end{aligned}$$
(3.4)

Lemma 3.3

Under Assumptions 2.1, 2.5 and 2.6 we have that

$$\begin{aligned} \lim _{t \rightarrow \infty } \left\Vert { \phi (t) }\right\Vert _{H}&= 0, \end{aligned}$$
(3.5)

and there is some finite \(T^{*}<\infty \) such that for all \(t\ge T^{*}\)

$$\begin{aligned} \left\Vert { \phi (t) }\right\Vert _{H}\le C\left( e^{-\lambda t}+ \alpha (t)\right) , \end{aligned}$$
(3.6)

where \(C<\infty \) is an unimportant constant.

Proof

We begin by proving that \(\frac{\partial u^{*}}{\partial t}(t)\) is globally Lipschitz in time. Differentiating the elliptic PDE that \(u^{*}\) satisfies twice with respect to t yields for \(x\in U\) and \(t\ge 0\)

$$\begin{aligned} A\frac{\partial ^2 u^{*}}{\partial t^2 }= & {} -\alpha '(t) \nabla _{\theta } f( \theta (t) )^{\top } \bigg (\left( \nabla _{\theta } f ( \theta (t) ), \hat{u}(t) \right) + \gamma \theta (t) \bigg ) \\&- \alpha (t) \frac{\partial }{\partial t} \bigg [ \nabla _{\theta } f( \theta (t) )^{\top } \bigg ( \left( \nabla _{\theta } f ( \theta (t) ), \hat{u}(t) \right) + \gamma \theta (t) \bigg ) \bigg ]. \end{aligned}$$

Therefore, using the bounds on \(f(\theta ), \nabla f(\theta ), \theta (t), u(t),\) and \(\hat{u}(t)\), we can show, using the coercivity Assumption 2.1 and the Cauchy-Schwarz inequality as in Lemma 3.2, that \(\displaystyle \sup _{t<\infty } \left\Vert {\frac{\partial ^2 u^{*}}{\partial t^2 }}\right\Vert _{H}\le C\). We provide the necessary calculations below for completeness.

$$\begin{aligned} \left\Vert { \frac{\partial ^2 u^{*}}{\partial t^2 } }\right\Vert ^2_H\le & {} \frac{1}{\lambda } ( A\frac{\partial ^2 u^{*}}{\partial t^2 } , \frac{\partial ^2 u^{*}}{\partial t^2 } ) \le \frac{1}{\lambda } \left\Vert { A\frac{\partial ^2 u^{*}}{\partial t^2 } }\right\Vert _H \left\Vert { \frac{\partial ^2 u^{*}}{\partial t^2 } }\right\Vert _H \\\le & {} \frac{K}{\lambda } \left\Vert { \frac{\partial ^2 u^{*}}{\partial t^2 } }\right\Vert _H, \end{aligned}$$

where K is a constant. Re-arranging yields, for \(t \ge 0\),

$$\begin{aligned} \left\Vert {\frac{\partial ^2 u^{*}}{\partial t^2 } }\right\Vert _H \le C, \end{aligned}$$

where C is a constant. This, then gives

$$\begin{aligned} \left\Vert {\frac{\partial u^{*}}{\partial t }(t) -\frac{\partial u^{*}}{\partial t }(s) }\right\Vert _{H} = \left\Vert { \int _s^t \frac{\partial ^2 u^{*}(\rho )}{\partial \rho ^2 } d\rho }\right\Vert _{H} \le \int _s^t \left\Vert { \frac{\partial ^2 u^{*} (\rho )}{\partial \rho ^2 } }\right\Vert _H d\rho \le C |t - s |. \end{aligned}$$
(3.7)

Therefore, we can write

$$\begin{aligned} \phi (t) = S(t) \phi (0) - \int _0^t S(t-\tau ) \frac{\partial u^{*}}{\partial \tau } d \tau , \end{aligned}$$

where \(\left\Vert {S(t)}\right\Vert _{H} \le e^{- \lambda t }\) with \(\lambda > 0\) by Assumptions 2.1.

Due to Lemma 3.2, for any \(\epsilon > 0\), there exists a s such that \(\left\Vert { \frac{\partial u^{*}}{\partial \tau } }\right\Vert _H < \epsilon \) for \(t > s\). By the triangle inequality,

$$\begin{aligned} \left\Vert {\phi (t)}\right\Vert _H\le & {} \left\Vert {S(t)\phi (0)}\right\Vert _H +\int _0^t \left\Vert { S(t-\tau ) \frac{\partial u^{*}}{\partial \tau }}\right\Vert _H d \tau \\\le & {} e^{- \lambda t } \left\Vert { \phi (0)}\right\Vert _H +C_2 \int _0^t e^{- \lambda (t - \tau )} \alpha (\tau ) d \tau . \end{aligned}$$

Let us now define \(I(t)=\int _0^t e^{- \lambda (t - \tau ) } \alpha (\tau ) d \tau \). Next, it is easy to show that \(\lim _{t\rightarrow \infty }I(t)=0\) and in particular that the integral term goes to zero at the rate of \(\alpha (t)\) in the sense that \(\lim _{t\rightarrow \infty }\frac{I(t)}{\alpha (t)} =\frac{1}{\lambda }\). These observations imply that there is a finite \(T^{*}<\infty \) such that for all \(t\ge T^{*}\), we have that \(I(t)\le C \alpha (t)\) for some constant \(C<\infty \).

Hence we indeed get that both (3.5) and (3.6) hold, concluding the proof of the lemma. \(\square \)

Define \(\Psi (t,x) = \hat{u} (t,x) - \hat{u}^{*}(t,x)\). A similar lemma can also be proven for the limit of \(\Psi (t)\).

Lemma 3.4

Under Assumptions 2.1, 2.5 and 2.6 we have that

$$\begin{aligned} \lim _{t \rightarrow \infty } \left\Vert { \Psi (t) }\right\Vert _{H} = 0, \end{aligned}$$
(3.8)

and there is some finite \(T^{*}<\infty \) such that for all \(t\ge T^{*}\)

$$\begin{aligned} \left\Vert { \Psi (t) }\right\Vert _{H}&\le C\left( e^{-\lambda t}t+ \alpha (t)\right) , \end{aligned}$$
(3.9)

where \(C<\infty \) is an unimportant constant.

Proof

\(\Psi (t)\) satisfies the PDE

$$\begin{aligned} \frac{\partial \Psi }{\partial t}(t,x) = - A^{\dagger } \Psi (t,x) +\phi (t,x) - \frac{\partial \hat{u}^{*}}{\partial t}(t,x). \end{aligned}$$
(3.10)

Exactly, as it was done in Lemma 3.3 we can show that \(t\mapsto \frac{\partial \hat{u}^{*}}{\partial t}(t)\) is globally Lipschitz. Together with Assumption 2.1 we write

$$\begin{aligned} \Psi (t) = S^{\dagger }(t) \Psi (0) + \int _0^t S^{\dagger }(t-\tau ) \left( \phi (\tau )-\frac{\partial \hat{u}^{*}}{\partial t}(\tau ) \right) d \tau , \end{aligned}$$

where \(S^{\dagger }(t)\) is the analytic contraction semigroup generated by \(A^{\dagger }\) satisfying \(\left\Vert {S^{\dagger }(t)}\right\Vert _{H} \le e^{- \lambda t }\) with \(\lambda > 0\). Using the same reasoning as in Lemma 3.3, and the fact that \(\lim _{t\rightarrow \infty }\left( \left\Vert {\phi (t)}\right\Vert _{H}+\left\Vert {\frac{\partial \hat{u}^{*}}{\partial t}(t)}\right\Vert _{H}\right) =0\), we can prove (3.8). The decay rates for \(\left\Vert {\phi (t)}\right\Vert _{H}\) and \(\left\Vert {\frac{\partial \hat{u}^{*}}{\partial t}(t)}\right\Vert _{H}\) from Lemma 3.3 and 3.2 respectively prove (3.9), the same way (3.6) was proven. This concludes the proof of the lemma. \(\square \)

3.2 Proof of Theorem 3.1

Let us now return to the equation for \(\theta (t)\).

$$\begin{aligned} \frac{d \theta }{dt}= & {} - \alpha (t) \bigg ( \left( \nabla _{\theta } f ( \theta (t) ), \hat{u}(t) \right) + \gamma \theta (t) \bigg ) \nonumber \\= & {} - \alpha (t) \nabla _{\theta } J(\theta (t) ) - \alpha (t) \bigg (\left( \nabla _{\theta } f ( \theta (t) ), \hat{u}^{*}(t) \right) -\left( \nabla _{\theta } f ( \theta (t) ), \hat{u}(t) \right) \bigg ) \nonumber \\= & {} - \alpha (t) \nabla _{\theta } J(\theta (t) ) - \alpha (t) \left( \nabla _{\theta } f ( \theta (t) ), \Psi (t) \right) . \end{aligned}$$
(3.11)

The second term on the RHS will converge to zero as \(t \rightarrow \infty \) due to (3.8). This means that asymptotically \(\theta \) will be updated in the direction of steepest descent. Theorem 3.1 rigorously proves this along with proving convergence of \(\theta (t)\) to a critical point of \(J(\theta )\).

The structure of the proof proceeds in a spirit similar to [29] with certain differences that will be highlighted below as needed. For completeness, we present the whole argument with the proper adjustments.

Let \(\epsilon >0\) be given and let \(\mu =\mu (\epsilon )>0\) to be chosen later on. We define the following cycle of times

$$\begin{aligned} 0=\sigma _0 \le \tau _1 \le \sigma _1 \le \tau _2 \le \sigma _2 \le \cdots \end{aligned}$$

where for \(k=1,2,\ldots \)

$$\begin{aligned} \tau _{k}&=\inf \left\{ t>\sigma _{k-1}: \Vert \nabla J(\theta (t))\Vert \ge \epsilon \right\} ,\nonumber \\ \sigma _{k}&=\sup \left\{ t>\tau _{k}: \frac{\Vert \nabla J(\theta (\tau _{k}))\Vert }{2}\le \Vert \nabla J(\theta (s))\Vert \le 2\Vert \nabla J(\theta (\tau _{k}))\Vert \text { for all }s\in [\tau _{k},t] \right. \nonumber \\&\qquad \qquad \left. \text { and } \int _{\tau _{k}}^{t}\alpha (s)ds\le \mu \right\} . \end{aligned}$$

Essentially, the sequence of times \(\{\sigma _k\}_{k\in {\mathbb {N}}}\) and \(\{\tau _k\}_{k\in {\mathbb {N}}}\) keep track of the times in which \(\left\Vert {\nabla J(\theta (t))}\right\Vert \) is within a ball of radius \(\epsilon \) and away from it.

Next, let us define the corresponding intervals of time \(J_{k}=[\sigma _{k-1},\tau _{k})\) and \(I_{k}=[\tau _{k},\sigma _{k})\). Clearly, when \(t\in J_{k}\), then we have that \( \Vert \nabla J(\theta (t))\Vert <\epsilon \).

Let us now go back to (3.11) and define the integral term

$$\begin{aligned} \Delta _{s,t}=\int _{s}^{t}\alpha (\rho ) \left( \nabla _{\theta } f (\theta (\rho ) ), \Psi (\rho ) \right) d\rho . \end{aligned}$$
(3.12)

We first show that \(\Delta _{\tau _{k},\sigma _{k}}\) decays to zero. In particular, we have the following lemma.

Lemma 3.5

Assume that Assumptions 2.1, 2.5 and 2.6 hold. Let us fix some \(\eta >0\). Then, we have that \(\lim _{k\rightarrow \infty } \left\Vert {\Delta _{\tau _k,\sigma _{k}+\eta }}\right\Vert _{2}=0\).

Proof of Lemma 3.5

We notice that for some constant \(C<\infty \) that may change from line to line

$$\begin{aligned} \sup _{t>0}\left\Vert {\Delta _{0,t}}\right\Vert _{2}&\le \int _{0}^{\infty } \alpha (\rho ) \left\Vert { \left( \nabla _{\theta } f ( \theta (\rho )), \Psi (\rho ) \right) }\right\Vert _{2}d\rho \\&\le \int _{0}^{\infty }\alpha (\rho ) \left\Vert { \nabla _{\theta } f (\theta (\rho ) )}\right\Vert _{H} \left\Vert {\Psi (\rho ) }\right\Vert _{H}d\rho \\&\le C \int _{0}^{\infty }\alpha (\rho ) \left\Vert {\Psi (\rho )}\right\Vert _{H}d\rho \\&\le C \int _{0}^{\infty }\left[ \alpha ^{2}(\rho ) +\alpha (\rho )\rho e^{-\lambda \rho }\right] d\rho \\&\le C<\infty , \end{aligned}$$

where the boundedness of \(\left\Vert { \nabla _{\theta } f ( \theta )}\right\Vert _{H}\) together with the decay rates from Lemma 3.4 were used. This immediately proves that \(\lim _{k\rightarrow \infty } \left\Vert {\Delta _{\tau _k,\sigma _{k}+\eta }}\right\Vert _{2}=0\) concluding the proof of the lemma. \(\square \)

Lemma 3.6

Assume that Assumptions 2.1, 2.5 and 2.6 hold. Denote by \(L_{\nabla J}\) to be the Lipschitz constant of \(\nabla J\). For given \(\epsilon >0\), let \(\mu \) be such that \(3\mu +\frac{\mu }{8\epsilon }=\frac{1}{2 L_{\nabla J}}\). Then, for k large enough and for \(\eta >0\) small enough (potentially depending on k), one has \(\int _{\tau _{k}}^{\sigma _{k}+\eta }\alpha (s)ds>\mu \). In addition, we also have \(\frac{\mu }{2}\le \int _{\tau _{k}}^{\sigma _{k}} \alpha (s)ds\le \mu \).

Proof of Lemma 3.6

The proof proceeds by contradiction. Let us assume that \(\int _{\tau _{k}}^{\sigma _{k}+\eta }\alpha (s)ds\le \mu \) and let \(\delta >0\) be such that \(\delta <\mu /8\). In addition, without loss of generality, we can assume that for the given k, \(\eta \) is so small such that for any \(s\in [\tau _{k},\sigma _{k}+\eta ]\) one has \(\left\Vert {\nabla J(\theta (s))}\right\Vert _{2}\le 3\left\Vert {\nabla J(\theta (\tau _{k}))}\right\Vert _{2}\).

Then, invoking (3.11) we have

$$\begin{aligned} \left\Vert {\theta (\sigma _{k}+\eta )-\theta (\tau _{k})}\right\Vert _{2}&\le \int _{\tau _{k}}^{\sigma _{k}+\eta } \alpha (t) \left\Vert {\nabla _{\theta } J(\theta (t) )}\right\Vert _{2}dt +\left\Vert {\Delta _{\tau _k,\sigma _{k}+\eta }}\right\Vert _{2}\\&\le 3\left\Vert {\nabla J(\theta (\tau _{k}))}\right\Vert _{2} \mu +\left\Vert {\Delta _{\tau _k,\sigma _{k}+\eta }}\right\Vert _{2}. \end{aligned}$$

By Lemma 3.5 we have that for k large enough, \(\left\Vert {\Delta _{\tau _k,\sigma _{k}+\eta }}\right\Vert _{2}\le \delta <\mu /8\). In addition, we also have by definition that \(\frac{\epsilon }{\left\Vert {\nabla J(\theta (\tau _{k}))}\right\Vert _{2}}\le 1\). The combination of these two results gives

$$\begin{aligned} \left\Vert {\theta (\sigma _{k}+\eta )-\theta (\tau _{k})}\right\Vert \le \left\Vert {\nabla J(\theta (\tau _{k}))}\right\Vert _{2} \left( 3 \mu +\frac{\mu }{8\epsilon }\right) \le \left\Vert {\nabla J(\theta (\tau _{k}))}\right\Vert _{2} \frac{1}{2 L_{\nabla J}}. \end{aligned}$$

This means that

$$\begin{aligned} \left\Vert {\nabla J(\theta (\sigma _{k}+\eta ))-\nabla J( \theta (\tau _{k}))}\right\Vert _{2} \le L_{\nabla J} \left\Vert {\theta (\sigma _{k}+\eta )-\theta (\tau _{k})}\right\Vert _{2} \le \frac{1}{2} \left\Vert {\nabla J(\theta (\tau _{k}))}\right\Vert _{2}. \end{aligned}$$

Then, this would yield

$$\begin{aligned} \frac{1}{2} \left\Vert {\nabla J(\theta (\tau _{k}))}\right\Vert _{2} \le \left\Vert {\nabla J(\theta (\sigma _{k}+\eta ))}\right\Vert _{2} \le 2 \left\Vert {\nabla J(\theta (\tau _{k}))}\right\Vert _{2}. \end{aligned}$$

However, this is a contradiction, because that would mean that \(\int _{\tau _{k}}^{\sigma _{k}+\eta }\alpha (s)ds>\mu \), since otherwise \(\sigma _{k}+\eta \in [\tau _{k},\sigma _{k}]\) which cannot happen because \(\eta >0\). This concludes the proof of the first part of the lemma. The proof of the second part of the lemma goes as follows. By its own definition, we have that \(\int _{\tau _{k}}^{\sigma _{k}} \alpha (s)ds\le \mu \). Next, we show that \(\int _{\tau _{k}}^{\sigma _{k}} \alpha (s)ds\ge \mu /2\). We have shown that \(\int _{\tau _{k}}^{\sigma _{k}+\eta }\alpha (s)ds> \mu \). For k large enough and \(\eta \) small enough we can choose that \(\int _{\sigma _{k}}^{\sigma _{k}+\eta }\alpha (s)ds\le \mu /2\). The conclusion then follows. This concludes the proof of the lemma. \(\square \)

Lemma 3.7

Assume that Assumptions 2.1, 2.5 and 2.6 hold. Assume that there exists an infinite number of intervals \(I_{k}=[\tau _{k},\sigma _{k})\). Then, there is a fixed \(\zeta _{1}>0\) that depends on \(\epsilon \) such that for k large enough

$$\begin{aligned} J(\theta _{\sigma _{k}})-J(\theta _{\tau _{k}})\le -\zeta _{1}. \end{aligned}$$

Proof of Lemma 3.7

By chain rule we have that

$$\begin{aligned} J(\theta (\sigma _{k})) - J (\theta (\tau _{k}))&= -\int _{\tau _{k}}^{\sigma _{k}} \alpha (\rho ) \left\Vert { \nabla J (\theta (\rho )) }\right\Vert _2 d\rho \nonumber \\&\quad +\int _{\tau _{k}}^{\sigma _{k}} \alpha (\rho ) \nabla J (\theta (\rho ))\cdot \left( \nabla _{\theta } f (\theta (\rho ) ), \Psi (\rho ) \right) d\rho \nonumber \\&=M_{1,k}+M_{2,k}. \end{aligned}$$
(3.13)

Let us first consider \(M_{1,k}=- \int _{\tau _{k}}^{\sigma _{k}} \alpha (\rho ) \left\Vert { \nabla J (\theta (\rho )) }\right\Vert _2 d\rho \). For \(\rho \in [\tau _k,\sigma _k]\) we have that \( \frac{\left\Vert {\nabla J(\theta (\tau _{k}))}\right\Vert _{2}}{2}\le \left\Vert {\nabla J(\theta (\rho ))}\right\Vert _{2}\le 2\left\Vert {\nabla J(\theta (\tau _{k}))}\right\Vert _{2} \). Thus, for sufficiently large k, we have by Lemma 3.6

$$\begin{aligned} M_{1,k} \le - \frac{ \left\Vert {\nabla J (\theta (\tau _k))}\right\Vert _{2}^{2}}{4} \int _{\tau _{k}}^{\sigma _{k}} \alpha (\rho ) d\rho \le -\frac{ \left\Vert {\nabla J (\theta (\tau _k))}\right\Vert _{2}^2 }{8} \mu . \end{aligned}$$

Next, we address \(M_{2,k}=\int _{\tau _{k}}^{\sigma _{k}} \alpha (\rho ) \nabla J (\theta (\rho ))\cdot \left( \nabla _{\theta } f (\theta (\rho )), \Psi (\rho ) \right) d\rho \). Let us define

$$\begin{aligned} \hat{M}_{s,t}=\int _{s}^{t} \alpha (\rho ) \nabla J (\theta (\rho )) \cdot \left( \nabla _{\theta } f ( \theta (\rho )), \Psi (\rho )\right) d\rho . \end{aligned}$$

Clearly, we have that \(M_{2,k}=\hat{M}_{\tau _{k},\sigma _{k}}\).

We claim that \(\sup _{\rho \ge 0} \left\Vert {\nabla J (\theta (\rho ))}\right\Vert _{2}<\infty \). For this purpose, we shall use the representation of \(\nabla J(\theta )\) by (2.10) together with the a-priori H norm bounds for \(\hat{u}^{*}\), \(\nabla _{\theta } f(x,\theta )\) as well as the uniform bound on \(\sup _{\rho \ge 0}\left\Vert {\theta (\rho )}\right\Vert \) by (2.12). These imply that indeed \(\sup _{\rho \ge 0} \left\Vert {\nabla J (\theta (\rho ))}\right\Vert _{2}<\infty \).

Then, we have for some constant \(C<\infty \) that may change from line to line

$$\begin{aligned} \sup _{t\ge 0} \hat{M}_{0,t}&\le C\int _{0}^{\infty } \alpha (\rho ) \left\Vert {\nabla J (\theta (\rho ))}\right\Vert _{2}\left\Vert {\nabla _{\theta } f(\theta (\rho ))}\right\Vert _{H}\left\Vert {\Psi (\rho )}\right\Vert _{H} d\rho \\&\le C\int _{0}^{\infty } \alpha (\rho ) \left\Vert {\Psi (\rho )}\right\Vert _{H} d\rho \\&\le C\int _{0}^{\infty } \left( \alpha ^{2}(\rho ) +\alpha (\rho )e^{-\lambda \rho }\rho \right) d\rho \\&\le C<\infty . \end{aligned}$$

where we used the decay rate bound from Lemma 3.8. The latter, then means that \(M_{2,k}\rightarrow 0\) as \(k\rightarrow \infty \).

Putting the above together, we get for k large enough such that \(|M_{2,k}|\le \delta <\frac{\mu }{16}\epsilon ^{2}\)

$$\begin{aligned} J(\theta (\sigma _{k})) - J (\theta (\tau _{k}))&\le -\frac{\left\Vert {\nabla J (\theta (\tau _k))}\right\Vert _{2}^2}{8}\mu +\delta \\&\le -\frac{\mu }{8}\epsilon ^{2}+ \frac{\mu }{16}\epsilon ^{2} =-\frac{\mu }{8}\epsilon ^{2}. \end{aligned}$$

Setting \(\zeta _1=\frac{\mu }{8}\epsilon ^{2}\) we conclude the proof of the lemma. \(\square \)

Lemma 3.8

Assume that Assumptions 2.1, 2.5 and 2.6 hold. Assume that there exists an infinite number of intervals \(I_{k}=[\tau _{k},\sigma _{k})\). Then, there is a fixed \(0<\zeta _{2}<\zeta _{1}\) such that for k large enough

$$\begin{aligned} J(\theta _{\tau _{k}})-J(\theta _{\sigma _{k-1}})\le \zeta _{2}. \end{aligned}$$

Proof of Lemma 3.8

Recall that \(\left\Vert {\nabla J(\theta (t))}\right\Vert _{2}\le \epsilon \) for \(t\in J_{k}=[\sigma _{k-1},\tau _{k}]\). By chain rule we have

$$\begin{aligned} J(\theta (\tau _{k})) - J (\theta (\sigma _{k-1}))&= -\int _{\sigma _{k-1}}^{\tau _{k}} \alpha (\rho ) \left\Vert { \nabla J (\theta (\rho )) }\right\Vert _2 d\rho \nonumber \\&\quad +\int _{\sigma _{k-1}}^{\tau _{k}} \alpha (\rho ) \nabla J (\theta (\rho ))\cdot \left( \nabla _{\theta } f ( \theta (\rho ) ), \Psi (\rho ) \right) d\rho \\&\le \int _{\sigma _{k-1}}^{\tau _{k}} \alpha (\rho ) \nabla J (\theta (\rho ))\cdot \left( \nabla _{\theta } f (\theta (\rho ) ), \Psi (\rho ) \right) d\rho . \end{aligned}$$

As in the proof of Lemma 3.7 we get that for k large enough, the right hand side of the last display can be arbitrarily small. This concludes the proof of the lemma. \(\square \)

Now we can conclude the proof of Theorem 3.1.

Proof of Theorem 3.1

Fix an \(\epsilon >0\). If there are finitely number of \(\tau _{k}\), then there is a finite \(T^{*}\) such that \(\left\Vert {\nabla J(\theta (t))}\right\Vert _{2}<\epsilon \) for \(t\ge T^{*}\), which proves the theorem. So, we basically need to prove that there can only be finitely many \(\tau _{k}\). So, let us assume that there are infinitely many instances of \(\tau _{k}\). By Lemmas 3.7 and 3.8 we have for sufficiently large k that

$$\begin{aligned} J(\theta _{\sigma _{k}})-J(\theta _{\tau _{k}})\le -\zeta _{1}\\ J(\theta _{\tau _{k}})-J(\theta _{\sigma _{k-1}})\le \zeta _{2} \end{aligned}$$

with \(0<\zeta _2<\zeta _1\). Let N large enough so that the above relations hold simultaneously. Then we have

$$\begin{aligned} J(\theta (\tau _{n+1})) - J (\theta (\tau _{N}))= & {} \sum _{k = N}^n \bigg [ J(\theta (\sigma _{k})) - J (\theta (\tau _{k})) +J(\theta (\tau _{k+1})) - J (\theta (\sigma _{k}))\bigg ] \\\le & {} \sum _{k = N}^n ( - \zeta _1 + \zeta _2 )<0 . \end{aligned}$$

Letting \(n \rightarrow \infty \), we get that \(J(\theta (\tau _{n+1})) \rightarrow - \infty \), which is a contradiction, since by definition \(J(\theta ) \ge 0\). Thus, there must be at most finitely many \(\tau _k\). This concludes the proof of the theorem. \(\square \)

4 Convergence Rate in the Strongly Convex Case

We will now prove a convergence rate for \(\theta (t)\). First, we need to strengthen the assumptions on the cost function \(J(\cdot )\). Namely, we now assume that \(J(\theta )\) is strongly convex.

Assumption 4.1

We assume the following conditions:

  • \(J(\cdot ) \in C^2\).

  • There exists a unique global minimum \(\theta ^{*}\) where \(\nabla _{\theta } J(\theta ^{*}) = 0\).

  • \(H(\theta ) =\nabla _{\theta \theta } J(\theta )\) is globally Lipschitz and \(H(\theta ^{*})\) is positive definite. That is, there exists a constant \(q > 0\) such that

    $$\begin{aligned} \xi ^{\top } H (\theta ^{*})\xi \ge q \left\Vert {\xi }\right\Vert _{2}^2. \end{aligned}$$
    (4.1)
  • The learning rate satisfies \(\alpha (t) =C_{\alpha } a(t)\) where the learning rate magnitude \(C_{\alpha }\) is selected such that \(C_{\alpha } q > 1\). The learning rate function a(t) satisfies \(\displaystyle \lim _{t\rightarrow \infty } a(t)=0\) and

    $$\begin{aligned} \int _{0}^{\infty } a(s)ds&=\infty \text { and } \int _{0}^{\infty } a^{2}(s)ds<\infty .\nonumber \\ \sup _{t\ge 0} \int _{0}^{t} a(s)e^{-\gamma \int _{s}^{t}a(r)dr}ds&<\infty \text { and } \lim _{t\rightarrow \infty }\frac{a'(t)}{a(t)}=0. \end{aligned}$$

    An example is \(a(t) = \frac{1}{1 + t}\).

In this section we will assume without loss of generality that the learning rate is \(\alpha (t) = \frac{1}{1+t}\). Even though this specific choice of the learning rate is not necessary for the results to hold, it will simplify the derivation of the convergence rate. The main result of this section is Theorem 4.2.

Theorem 4.2

Let us assume that \(\alpha (t)=1/(1+t)\). In addition, assume that Assumptions 2.1, 2.6 and 4.1 hold. Then we have that there exists a time \(0<t_{0}<\infty \), such that for all \(t\ge t_{0}\)

$$\begin{aligned} \left\Vert { \theta (t) - \theta ^{*} }\right\Vert _{2} \le C t^{-1/2}. \end{aligned}$$
(4.2)

The proof of this theorem will be a consequence of a series of lemmas. Let us recall the function \(\phi (t,x) = u(t,x) -u^{*}(t,x)\) satisfying (3.4)

$$\begin{aligned} \frac{\partial \phi }{\partial t}(t,x) = - A \phi (t,x) -\frac{\partial u^{*}}{\partial t}(t,x) \end{aligned}$$

such that by Lemma 3.2, \(t\mapsto \frac{\partial u^{*}}{\partial t}\) is globally Lipschitz and \(\left\Vert {\frac{\partial u^{*}}{\partial t}}\right\Vert _{H}\le C\frac{1}{1+t}<C\frac{1}{t}\).

Then, we have the following lemma.

Lemma 4.3

Consider the setting of Theorem 4.2. We have that there is a finite constant \(C<\infty \) such that

$$\begin{aligned} \limsup _{t \rightarrow \infty } t^2 \left( \phi (t), \phi (t) \right) \le C. \end{aligned}$$
(4.3)

In addition, there exists a \(t_0 >0\) such that for all \(t \ge t_0\) and any \(0< p < 1\),

$$\begin{aligned} \left( \phi (t), \phi (t) \right) \le K t^{-2p}. \end{aligned}$$

Proof of Lemma 4.3

For notational convenience we set below \(Y(t) = \left( \phi (t), \phi (t) \right) \). First we calculate

$$\begin{aligned} \frac{d Y}{d t} (t)= - 2 \left( \phi (t), A \phi (t) \right) -2 \left( \phi (t), \frac{\partial u^{*}}{\partial t}(t)\right) . \end{aligned}$$

For \(\epsilon >0\) to be chosen later on and using the coercivity Assumption 2.1, we then have the inequality (omitting the argument t for notational convenience)

$$\begin{aligned} \frac{d Y}{d t}\le & {} - 2 \left( \phi , A \phi \right) + 2 \left| \left( \phi \epsilon , \frac{1}{\epsilon } \frac{\partial u^{*}}{\partial t}(t) \right) \right| \\\le & {} - 2 \lambda ( \phi , \phi ) + \frac{1}{2} \epsilon ^2 ( \phi , \phi ) + \frac{1}{2\epsilon ^{2}} \left( \frac{\partial u^{*}}{\partial t}, \frac{\partial u^{*}}{\partial t} \right) \\\le & {} -2 (\lambda - \frac{\epsilon ^2}{2} )Y +\frac{C}{2\epsilon ^{2}} t^{-2} \\= & {} -2 b_{\epsilon } Y + C_{\epsilon } t^{-2}, \end{aligned}$$

where we have used Young’s inequality. The constant \(b_{\epsilon } =\lambda - \frac{\epsilon ^2}{2} \) and \(C_{\epsilon } =\frac{C}{2\epsilon ^{2}}\). We can select \(\epsilon \) such that \(b_{\epsilon } > 0\).

Denoting now for notational convenience \(b=b^{\epsilon }\) and with some abuse of notation setting \(C=C_{\epsilon }\), let’s construct the ODE

$$\begin{aligned} \frac{d v }{d t}= & {} - 2 b v + C t^{-2}, \\ v(1)= & {} Y(1). \end{aligned}$$

Define \(\xi = Y - v\). Then, we have that \(\xi (1) = 0\) and for \(t \ge 1\),

$$\begin{aligned} \frac{d \xi }{dt}= & {} \frac{d Y}{dt} - \frac{dv }{dt} \\\le & {} -2 b Y + C t^{-2} - \bigg ( - 2 b v + C t^{-2} \bigg ) \\= & {} - 2 b ( Y - v ) \\= & {} - 2b \xi . \end{aligned}$$

By Gronwall’s inequality \(\xi \le 0\) and therefore \(Y \le v\). If we can establish a convergence rate for v, we then have a convergence rate for Y.

The solution v is

$$\begin{aligned} v(t) = e^{-2 b t} \int _1^t e^{2 b s} s^{-2} C ds, \end{aligned}$$

We know that

$$\begin{aligned} \lim _{t \rightarrow \infty } t^2 v(t)= \frac{1}{2 b}. \end{aligned}$$

Therefore, for a finite constant \(C<\infty \) we have that

$$\begin{aligned} \limsup _{t \rightarrow \infty } t^2 Y \le \limsup _{t \rightarrow \infty } t^2 v \le C. \end{aligned}$$

We also consequently know that there exists a \(t_0 >0\) such that for all \(t \ge t_0\) and any \(0< p < 1\),

$$\begin{aligned} | v(t) | \le K t^{-2p}. \end{aligned}$$

Therefore, for all \(t \ge t_0\),

$$\begin{aligned} Y(t) \le v(t) \le K t^{-2p}. \end{aligned}$$

concluding the proof of the lemma. \(\square \)

Let us now recall that \(\Psi (t,x) = \hat{u} (t,x) - \hat{u}^{*}(t,x)\). Next, let’s prove a convergence rate for \(\Psi \).

Lemma 4.4

Consider the setting of Theorem 4.2. Then, we have that

$$\begin{aligned} \limsup _{t \rightarrow \infty } t^2 \left( \Psi (t), \Psi (t) \right) \le C, \end{aligned}$$
(4.4)

for some finite constant \(C<\infty \).

Proof of Lemma 4.4

We first calculate that

$$\begin{aligned} \frac{\partial \Psi (t) }{\partial t }(t,x) = - A^{\dagger } \Psi (t,x) +\phi (t,x) - \frac{\partial \hat{u}^{*}}{\partial t}(t,x). \end{aligned}$$

Define \(W(t) = t^2 \left( \Psi (t), \Psi (t) \right) \). Then, omitting for notational convenience the time argument, we have

$$\begin{aligned} \frac{d W}{dt}= & {} 2t \left( \Psi , \Psi \right) + 2 t^2 \left( \Psi , \frac{\partial \Psi }{\partial t} \right) \\= & {} \frac{2}{t} W - 2 t^2 \left( \Psi , A^{\dagger } \Psi \right) + 2 t^2 \left( \Psi , \phi \right) - 2 t^2 \left( \Psi , \frac{\partial \hat{u}^{*}}{\partial t} \right) \\\le & {} \frac{2}{t} W - 2 \lambda W + 2 t^2 \left( \Psi , \phi \right) - 2 t^2 \left( \Psi , \frac{\partial \hat{u}^{*}}{\partial t}\right) , \end{aligned}$$

where we used the assumed coercivity of \(A^{\dagger }\) (consequence of Assumption 2.1). Let’s select a \(t_0 > 1\) such that \(t_0^{-1} < \delta \ll c\). Then, for \(t \ge t_0\) and for a constant \(C<\infty \) that may change from line to line,

$$\begin{aligned} \frac{d W}{dt}\le & {} - 2 (\lambda - \delta ) W + 2 t^2 \left( \Psi , \phi \right) + 2 t^2 \left| \left( \Psi , \frac{\partial \hat{u}^{*}}{\partial t} \right) \right| \\\le & {} - 2 (\lambda - \delta ) W + C \epsilon ^2 t^2 \left( \Psi , \Psi \right) + C \epsilon ^{-2} t^2 \left( \phi , \phi \right) + C \epsilon ^{-2} t^2 \left( \frac{\partial \hat{u}^{*}}{\partial t}, \frac{\partial \hat{u}^{*}}{\partial t} \right) \\= & {} - b V + C t^2 \left( \phi , \phi \right) + C t^2 \left( \frac{\partial \hat{u}^{*}}{\partial t}, \frac{\partial \hat{u}^{*}}{\partial t} \right) . \end{aligned}$$

where we have chosen an \(\epsilon > 0\) such that \(b = \lambda -\delta - C \epsilon ^2 > 0\).

Let’s construct the ODE

$$\begin{aligned} \frac{d \hat{q}}{dt}= & {} - b \hat{q} + C t^2 \left\Vert {\phi }\right\Vert ^2_{H} +C t^2 \left\Vert {\frac{\partial \hat{u}^{*}}{\partial t}}\right\Vert ^2_{H}, \quad t \ge t_0, \\ \hat{q}(t_0)= & {} V(t_0). \end{aligned}$$

which then, by (4.3) and Lemma 3.2, satisfies

$$\begin{aligned} \hat{q}(t)= & {} e^{-b t} \hat{q}(t_0) + C e^{-bt} \int _{t_0}^t e^{b s} s^2 \left( \left\Vert {\phi (s)}\right\Vert ^2_{H} + \left\Vert {\frac{\partial \hat{u}^{*}}{\partial t}(s)}\right\Vert ^2_{H}\right) ds \\\le & {} e^{-b t} \hat{q}(t_0) + C e^{-bt} \int _{t_0}^t e^{b s} ds \\\le & {} C. \end{aligned}$$

Therefore, using the same ODE comparison principle as before, we have the bound

$$\begin{aligned} \limsup _{t \rightarrow \infty } t^2 \left( \Psi (t), \Psi (t) \right) \le C, \end{aligned}$$

which concludes the Proof of the Lemma. \(\square \)

We now present the Proof of Theorem 4.2 on the convergence rate for \(\theta \).

Proof of Theorem 4.2

Recall that \(H(\theta ) = \nabla _{\theta \theta } J(\theta )\) is the Hessian matrix. At the stationary point \(\theta ^{*}\), \(H(\theta ^{*})\) is positive definite, i.e. there exists some constant \(q > 0\) such that \(\xi ^{\top } H (\theta ^{*})\xi \ge q \left\Vert {\xi }\right\Vert _{2}^2\).

Since, according to Theorem 3.1, we have already proven convergence, we know that for \(\theta ^{*}\) such that \(\nabla J(\theta ^{*})=0\), we have that \(\displaystyle \lim _{t \rightarrow \infty } \theta (t) = \theta ^{*}\). The parameter updates satisfy

$$\begin{aligned} \frac{d \theta }{dt}&= - \alpha (t) \nabla _{\theta } J(\theta (t)) -\alpha (t) \left( \nabla _{\theta } f ( \theta (t) ), \Psi (t) \right) \\&= - \alpha (t) H(\theta ^{*}) ( \theta (t) - \theta ^{*})\\&\quad -\alpha (t) \nabla _{\theta } H(\bar{\theta }(t))_{j,k} (\theta (t) - \theta ^{*} )_k ( \theta (t) -\theta ^{*})_j - \alpha (t) \left( \nabla _{\theta } f ( \theta (t) ), \Psi (t) \right) , \end{aligned}$$

where \(\bar{\theta }(t) \in [ \theta (t), \theta ^{*} ]\), \(\nabla _{\theta } H(\bar{\theta }(t))_{j,k} \in {\mathbb {R}}^d\), and \(\nabla _{\theta } H(\bar{\theta }(t))_{j,k} ( \theta (t) -\theta ^{*} )_k ( \theta (t) - \theta ^{*} )_j = \displaystyle \sum _{k,j = 1}^d \nabla _{\theta } H(\bar{\theta }(t))_{j,k} (\theta (t) - \theta ^{*} )_k ( \theta (t) - \theta ^{*} )_j\). \(H(\theta )\) is the Hessian matrix and \(H(\theta )_{j,k}\) is the (jk)-th element of the matrix.

Define

$$\begin{aligned} V(t) = \left\Vert {\theta (t) - \theta ^{*} }\right\Vert _2^2. \end{aligned}$$

V satisfies the ODE

$$\begin{aligned} \frac{d V}{dt}&= 2 (\theta (t) - \theta ^{*})^{\top } \frac{d \theta }{dt } \\&= 2 (\theta (t) - \theta ^{*})^{\top } \\&\qquad \bigg [ -\alpha (t) H(\theta ^{*}) ( \theta (t) - \theta ^{*} ) - \alpha (t) \nabla _{\theta } H(\bar{\theta }(t))_{j,k} ( \theta (t) -\theta ^{*})_k ( \theta (t) - \theta ^{*} )_j\\&\qquad \quad - \alpha (t) \left( \nabla _{\theta } f ( \theta (t) ), \Psi (t) \right) \bigg ] \\&\le - 2 \alpha (t) q V(t) - 2 \alpha (t) C \left\Vert { \theta (t) - \theta ^{*} }\right\Vert V(t) - 2 \alpha (t) (\theta (t) - \theta ^{*})^{\top } \left( \nabla _{\theta } f ( \theta (t) ), \Psi (t) \right) . \end{aligned}$$

Since \(\displaystyle \lim _{t \rightarrow \infty } \theta (t) =\theta ^{*}\), there exists a \(t_0\) such that for all \(t \ge t_0\)

$$\begin{aligned} q - C \left\Vert { \theta (t) - \theta ^{*} }\right\Vert> b_0 > 0. \end{aligned}$$

Therefore, for \(t\ge t_{0}\) large enough, we have

$$\begin{aligned} \frac{ d V}{dt}&\le - \alpha (t) b_0 V + \alpha (t) \left| (\theta (t) - \theta ^{*})^{\top } \left( \nabla _{\theta } f ( \theta (t) ), \Psi (t) \right) \right| \\&= - \alpha (t) b_0 V + \epsilon \alpha (t) \left| (\theta (t) -\theta ^{*})^{\top } \left( \nabla _{\theta } f ( \theta (t) ), \frac{\Psi (t)}{\epsilon } \right) \right| \\&\le - \alpha (t) b_0 V + \epsilon ^2 \alpha (t)^2 V + C\epsilon ^{-2} \left( \Psi , \Psi \right) , \end{aligned}$$

where we have used Young’s inequality and the Cauchy–Schwartz inequality. Here we have set \(C=\sup _{\theta \in {\mathbb {R}}^{d}}\left\Vert {\nabla _{\theta } f ( \theta ) }\right\Vert ^{2}_{L^{2}(U)}<\infty \) by Assumption 2.6.

Let’s select an \(\epsilon \) small enough such that \(b = b_0 -\epsilon ^2 > 0\). Then,

$$\begin{aligned} \frac{ d V}{dt} \le - \alpha (t) b V + C \left( \Psi , \Psi \right) . \end{aligned}$$

Let \(\hat{q}\) satisfy the ODE

$$\begin{aligned} \frac{ d \hat{q}}{dt}= & {} - \alpha (t) b \hat{q} + C \left( \Psi , \Psi \right) , \quad t \ge t_0, \\ \hat{q}(t_0)= & {} V(t_0). \end{aligned}$$

The following comparison principle holds

$$\begin{aligned} V(t) \le \hat{q}(t). \end{aligned}$$

Using an integrating factor and \(\displaystyle \limsup _{t \rightarrow \infty } t^2 \left( \Psi , \Psi \right) \le C\) by Lemma 4.4, we have that

$$\begin{aligned} \hat{q}(t)= & {} C_1 t^{-b} + C_2 t^{-b} \int _{t_0}^t s^{b} \left( \Psi (s), \Psi (s) \right) ds \\\le & {} C_1 t^{-b} + C_2 t^{-b} \int _{t_0}^t s^{b-2} s^2 \left( \Psi (s), \Psi (s) \right) ds \\\le & {} C_1 t^{-b} + C_2 t^{-b} \int _{t_0}^t s^{b-2} ds \\= & {} C_1 t^{-b} + C_2 t^{-b} \bigg ( t^{b-1} - t_0^{b-1} \bigg )\\\le & {} C t^{-1}, \end{aligned}$$

where the constant \(C_2\) may change from line to line and we have used the assumption \(C_{\alpha } b > 1\). This concludes the convergence rate proof of \(\theta (t)\). \(\square \)