Abstract
We present and mathematically analyze an online adjoint algorithm for the optimization of partial differential equations (PDEs). Traditional adjoint algorithms would typically solve a new adjoint PDE at each optimization iteration, which can be computationally costly. In contrast, an online adjoint algorithm updates the design variables in continuous-time and thus constantly makes progress towards minimizing the objective function. The online adjoint algorithm we consider is similar in spirit to the the pseudo-time-stepping, one-shot method which has been previously proposed. Motivated by the application of such methods to engineering problems, we mathematically study the convergence of the online adjoint algorithm. The online adjoint algorithm relies upon a time-relaxed adjoint PDE which provides an estimate of the direction of steepest descent. The algorithm updates this estimate continuously in time, and it asymptotically converges to the exact direction of steepest descent as \(t \rightarrow \infty \). We rigorously prove that the online adjoint algorithm converges to a critical point of the objective function for optimizing the PDE. Under appropriate technical conditions, we also prove a convergence rate for the algorithm. A crucial step in the convergence proof is a multi-scale analysis of the coupled system for the forward PDE, adjoint PDE, and the gradient descent ODE for the design variables.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Adjoint methods have been widely-used for the optimization of partial differential equations (PDEs), and especially for optimizing PDEs modeling engineering systems. Examples include [2, 4,5,6, 9,10,11, 16,17,18, 20, 24, 26,27,28], and [15].
Traditional adjoint algorithms consist of an iteration where at each iteration a new adjoint PDE must be solved to calculate the gradient descent step. During the course of optimization, many adjoint PDEs must be solved, which in certain cases can be computationally costly. As an alternative, time-stepping and pseudo-time-stepping methods (often in combination with one-shot methods) have been proposed, where one views time-independent PDEs as stationary states of appropriate dynamical systems and studies the behavior of the latter in the long-time regime, i.e., after their transient phase. We refer the interested reader to [3, 7, 12,13,14, 19, 31, 32] and the references therein for certainly a non-exhaustive list of representative references.
In this paper, we couple a time-relaxed adjoint PDE with a continuous-time update equation for the variables that are being optimized. The time-relaxed adjoint PDE yields an estimate of the direction of steepest descent, and updates this estimate continuously in time. The optimization variables are also updated continuously in time using this online estimate of the direction of steepest descent. The focus of our paper is the mathematical analysis of this “online adjoint algorithm”. As \(t \rightarrow \infty \), the solution of the time-relaxed adjoint PDE asymptotically matches the exact direction of steepest descent. A crucial step in the convergence proof is a multi-scale analysis of the coupled system for the forward PDE, adjoint PDE, and the gradient descent ODE for the design variables.
We prove convergence and convergence rates for the online adjoint algorithm for a certain class of PDEs. Specifically, in our theoretical analysis, we consider the optimization problem where we seek to minimize the objective function:
where h is a target profile. \( \gamma \left\Vert {\theta }\right\Vert ^2_2\) is a regularization term where \(\gamma > 0\) and \(\left\Vert { \cdot }\right\Vert _2\) is the \(\ell _2\) norm. \(u^{*}\) satisfies the elliptic PDE
where A is a standard second-order elliptic operator. Thus, we wish to select a parameter \(\theta \) such that the solution \(u^{*}\) of the PDE (1.2) is as close as possible to the target profile h.
If \(A^{\dagger }\) denotes the formal adjoint operator to A, then the adjoint PDE is
The gradient of the objective function (1.1) can be evaluated using the solution \(\hat{u}^{*}\) to the adjoint PDE (1.3). By Lemma 2.4 we have that
Thus, the adjoint PDE (1.3) can be used to evaluate the gradient of the objective function, which in turn can be used to optimize over the PDE (1.2). A key advantage of adjoint methods is that, no matter how large the dimension of \(\theta \) is, the adjoint PDE (1.3) is the same dimension as the original PDE (1.2).
1.1 The Online Adjoint Algorithm
The online adjoint algorithm optimizes the objective function \(J(\theta )\) via a continuous-time equation for the update of the parameter \(\theta (t)\); see also [14, 32] for related formulations. The direction of steepest descent is estimated using a time-relaxation of the adjoint PDE. The estimate and the optimization variables are both simultaneously updated continuously in time. An appropriately chosen learning rate parameter is introduced, which allows to guarantee both well posedness of the algorithm for all times (Theorem 2.8) and convergence as \(t\rightarrow \infty \) (Theorems 3.1 and 4.2).
The online adjoint algorithm satisfies the equations:
where \(\alpha (t)\) is an appropriately chosen learning rate. The PDEs for u and \(\hat{u}\) can be viewed as time relaxations of the PDE (1.2) and its adjoint PDE (1.3). It is easy to see that \(\int _{U} \hat{u}(t,x) \nabla _{\theta } f (x, \theta (t) ) dx\) is an estimate for the direction of steepest descent \(\int _{U} \hat{u}^{*}(x) \nabla _{\theta } f (x, \theta (t) ) dx\).
Apart from the generic formulation of the online adjoint algorithm as presented in (1.5), our main contribution is two-fold. First, we prove that as \(t\rightarrow \infty \), \(\left\Vert {\nabla J(\theta (t))}\right\Vert \rightarrow 0\). Namely, we prove that \(\theta (t)\) converges to a stationary point of \(J(\theta )\). We emphasize here that in order to do so, no assumptions on convexity of J are needed. Secondly, if we further assume that \(J(\theta )\) is strongly convex, then we also prove a convergence rate of \(\theta (t)\) to the global minimum of \(J(\theta )\).
In practice, the online adjoint algorithm (1.5) is implemented by simultaneously solving the coupled ODE-PDE system using numerical methods such as finite-difference methods. Either explicit or implicit finite difference methods can be used. For example, an explicit finite difference method for implementing (1.5) would be:
-
Update u and \(\hat{u}\):
$$\begin{aligned} u(t + \Delta , x)= & {} u(t,x) + \bigg ( - A u(t,x) + f(x,\theta (t)) \bigg ) \Delta , \nonumber \\ \hat{u}(t + \Delta ,x)= & {} \hat{u}(t,x) + \bigg ( -A^{\dagger } \hat{u}(t,x) + ( u(t,x) - h(x)) \bigg ) \Delta , \end{aligned}$$(1.6)where \(\Delta \) is the time-step size.
-
Then, update the parameter \(\theta \):
$$\begin{aligned} \theta (t+\Delta ) = \theta (t) - \alpha (t) \bigg ( \int _{U} \hat{u}(t,x) \nabla _{\theta } f (x, \theta (t) ) dx + \gamma \theta (t) \bigg ) \Delta . \end{aligned}$$(1.7)The spatial domain x is discretized and a finite-difference method is used to approximate the operator A. The integrals are discretized as appropriate sums.
The focus of our paper is to rigorously prove the convergence of the online adjoint algorithm for linear elliptic PDEs. In practice, real-world applications will typically require optimizing over nonlinear PDEs. The online adjoint algorithm can also be used to optimize over nonlinear PDEs. References [30] and [23] optimize over the Navier–Stokes equation using our online adjoint algorithm. Numerical optimization with pseudo-time-stepping adjoint methods has also been studied in [3, 7, 12,13,14, 19, 31, 32].
We demonstrate the online adjoint method below for a simple example of a nonlinear PDE. Consider the equation
where \((x,y) \in [0,1] \times [0,1]\) and with boundary conditions \(u(0, y) = 1\), \(u(1,y) = -1\), \(u(x, 0) = 1\), and \(u(x,1) = -1\). The parameters to be optimized over are \(\theta = (\theta _1, \theta _2)\) and the objective function is (1.1) with \(\gamma = 0\). The target function h is the solution to (1.8) with \(\theta = (10, 10)\). That is, our goal is to solve the inverse problem of recovering the parameters in the PDE (1.8) given an observed solution.
The adjoint PDE for (1.8) is
where \((x,y) \in [0,1] \times [0,1]\) and with boundary conditions\(\hat{u}(0, y) = \hat{u}(1,y) = \hat{u}(x, 0) = \hat{u}(x,1) = 0\). The gradient of the objective function is given by the formula
The online adjoint algorithm can be used to minimize the objective function \(J(\theta )\). In our numerical experiment, we use an explicit finite difference method for the numerical solution of the time-relaxed PDE and the time-relaxed adjoint PDE. The PDE variables are updated using an Euler scheme. (Although not implemented here, it is worthwhile noting that higher-order accuracy in time could be achieved with a Runge–Kutta scheme.) Uniform mesh sizes for both time and space are chosen. The PDE operator is approximated using a second-order accurate finite-difference method. The parameter ODEs are also solved using an explicit Euler scheme on the same uniform time grid and the spatial integrals are also discretized as sums using the uniform spatial grid. Figure 1 demonstrates that the online adjoint algorithm converges to the correct value for the parameters \(\theta \) as \(t \rightarrow \infty \). The right display in Fig. 1 presents the numerical convergence rate, which satisfies the theoretical convergence rate of \(t^{-\frac{1}{2}}\) which we prove for strongly convex objective functions for linear elliptic PDEs in this paper. (In fact, for this specific example, the numerical convergence rate turns out to be faster than \(t^{- \frac{1}{2}}\).)
1.2 Organization of the Proof
The rest of the paper is organized as follows. In Sect. 2 we state our assumptions, present in more details the online adjoint algorithm and prove its well posedness in Theorem 2.8. Convergence of \(\theta (t)\) to a stationary point of \(J(\theta )\) is proven in Sect. 3, Theorem 3.1. We emphasize that no convexity requirements on \(J(\theta )\) are needed in order to prove convergence. If in addition, we assume that \(J(\theta )\) is strongly convex with a single stationary point, then one can prove a convergence rate, Theorem 4.2. The latter is the content of Sect. 4.
2 Assumptions, Notation and Well Posedness of the Online Adjoint Algorithm
Let U be an open, bounded subset of \({\mathbb {R}}^{n}\). We will denote by \((\cdot ,\cdot )\) the usual inner product in \(H=L^{2}(U)\). We shall assume that the operator A is uniformly elliptic, diagonalizable and dissipative, per Assumption 2.1.
Assumption 2.1
The operator A is uniformly elliptic. We also assume that A is diagonalizable and dissipative. Namely there exists a countable complete orthonormal basis \(\{e_{n}\}_{n\in {\mathbb {N}}}\subset H\) that consists of eigenvectors of A corresponding to a non-negative sequence \(\{\lambda _{n}\}_{n\in {\mathbb {N}}}\) of eigenvalues such that
such that the dissipativity condition \(\lambda = \displaystyle \inf _{n\in {\mathbb {N}}} \lambda _{n}>0\) holds.
An example of A is the second order elliptic operator, for definiteness taken to be in divergence form,
The diagonalizable condition of Assumption 2.1 is automatically satisfied for example by self-adjoint operators, see [8, Theorem 8.8.37]
Before proceeding with the well-posedness of the online adjoint algorithm, let us recall a few basic results that will be useful for the analysis that follows.
Taking the domain of A to be \(D(A)=H^{1}_{0}(U)\cap H^{2}(U)\), we have that it is dense in \(H=L^{2}(U)\). Then, due to Assumption 2.1, elliptic regularity theory gives that the operator A is closed and thus by Hille-Yoshida theorem \((-A)\) is the generator of an analytic strongly contraction semigroup \(\{S(t)\}_{t\ge 0}\) on H. The spectral assumption made in Assumption 2.1 guarantees that
The latter also means that A is a coercive operator. In particular, we will heavily use the fact that for \(u\in D(A)\), we have
Notice now that because we are dealing with a real Hilbert space and because \(H^{1}_{0}(U)\cap H^{2}(U)\) is dense in \(L^{2}(U)\), we obtain that \(A^{\dagger }\) is also a coercive operator. Indeed, by definition of the adjoint operator \(A^{\dagger }\) we shall have that for \(u\in H^{1}_{0}(U)\cap H^{2}(U)\)
Notice now that under our assumptions the adjoint operator \((-A^{\dagger })\) will also generated an analytic strongly continuous semigroup \(\{S^{\dagger }(t)\}_{t\ge 0}\) on \(L^{2}(U)\). In particular (2.3) implies that the adjoint semigroup \(S^{\dagger }(t)\) will also be exponential stable. Indeed, by definition we have for \(u\in H^{1}_{0}(U)\cap H^{2}(U)\)
which then due to Gronwall lemma gives
proving the exponential stability of \(S^{\dagger }(t)\).
Then, if we assume that \(f\in L^{2}(U)\), classical Lax–Milgram theorem (see for example Chap. 5.8 of [8]) says that the elliptic boundary-value problem
has a unique weak solution \(u^{*}\in H^{1}_{0}(U)\). The same conclusion will also be true for the adjoint problem governed by the adjoint operator \(A^{\dagger }\).
Remark 2.2
We also recall here that by classical elliptic regularity results if for given \(m\in {\mathbb {N}}\), \(a^{i,j},b^{i},c\in {\mathcal {C}}^{m+1}(\bar{U})\) with \(i,j=1,\ldots ,n\) and \(\partial U\in {\mathcal {C}}^{m+2}\), then the unique solution \(u^{*}\) to (2.6) is such that \(u^{*}\in H^{m+2}(U)\). Clearly if \(a^{i,j},b^{i},c\in {\mathcal {C}}^{\infty }(\bar{U})\) with \(i,j=1,\ldots ,n\) and \(\partial U\in {\mathcal {C}}^{\infty }\), then \(u^{*}\in {\mathcal {C}}^{\infty }(\bar{U})\). We refer the interested reader to classical manuscripts, e.g., [8, Chap. 8], for more details.
Remark 2.3
For notational convenience and without loss of generality, we have assumed zero boundary conditions for the PDE (2.6). We can consider the PDE (2.6) with non-zero boundary data, say \(u^{*}(x)= g(x), x\in \partial U\), under the assumption \(g\in H^{1}(U)\) for unique solvability of the corresponding PDE (2.6). We refer the interested reader to classical manuscripts, e.g., [8, Chap. 8], for more details.
As briefly presented in the introduction now, let \(f(\cdot ,\cdot ):{\mathbb {R}}^{n}\times {\mathbb {R}}^{d}\mapsto {\mathbb {R}}\) be such that for every \(\theta \in {\mathbb {R}}^{d}\) \(f(\cdot ,\theta )\in L^{2}(U)\). As with (2.6) the linear PDE
will have, for each given \(\theta \in {\mathbb {R}}^{d}\), a unique weak solution \(u^{*}\in H^{1}_{0}(U)\). We shall write \(u^{*}(x;\theta )\) when we want to emphasize the dependence on \(\theta \).
For a given target profile \(h\in L^{2}(U)\), the goal is to select \(\theta \) to minimize the objective function
where \(\gamma > 0\). \( \gamma \left\Vert {\theta }\right\Vert ^2_2\) is a regularization term and \(\left\Vert { \cdot }\right\Vert _2\) is the \(\ell _2\) norm.
The adjoint PDE satisfies
and as with (2.6), Assumption 2.1 and the fact that \(u^{*} - h\in L^{2}(U)\) guarantee that (2.9) has a unique weak solution \(\hat{u}^{*}\in H^{1}_{0}(U)\). Notice that since \(u^{*}\) depends on \(\theta \), the same will also be true for \(\hat{u}^{*}\) and we shall write \(\hat{u}^{*}(x;\theta )\) when we want to emphasize that. The following lemma provides a useful representation for \(\nabla _{\theta } J(\theta )\), which will also motivate the form of the online adjoint algorithm.
Lemma 2.4
Let Assumption 2.1 and assume that \(h\in L^{2}(U)\). Then, we can write
The proof of Lemma 2.4 is presented at the end of this section. In terms of the learning rate \(\alpha (t)\) we make the following assumption.
Assumption 2.5
We assume that the learning rate \(\alpha (t)\) is such that \(\displaystyle \lim _{t\rightarrow \infty }\alpha (t)=0\) and
-
\(\int _{0}^{\infty }\alpha (s)ds=\infty \) and \(\int _{0}^{\infty }\alpha ^{2}(s)ds<\infty \).
-
\(\displaystyle \sup _{t\ge 0} \int _{0}^{t}\alpha (s) e^{-\gamma \int _{s}^{t}\alpha (r)dr}ds<\infty \) and \(\displaystyle \lim _{t\rightarrow \infty }\frac{\alpha '(t)}{\alpha (t)}=0\).
The first part of Assumption 2.5 on the learning rate is classical and is also the same used in discrete time algorithms, see for example classical references such as [1, 21]. The second part of Assumption 2.5 comes up while proving that \(\left\Vert {\theta (t)}\right\Vert \) stays bounded for all times and later on in the convergence proof of \(\theta (t)\) to a stationary point of J. An example of a learning rate that satisfies both parts of Assumption 2.5 is \(\alpha (t)=\frac{1}{1+t}\).
In terms of the parametric model \(f(\cdot ,\cdot )\) we make the following assumption
Assumption 2.6
We assume the following conditions:
-
For each fixed \(\theta \in {\mathbb {R}}^{d}\), \(f(\cdot , \theta )\), \(\nabla _{\theta } f(\cdot , \theta )\) and \(\nabla ^{2}_{\theta } f(\cdot , \theta )\) are in \(L^{2}(U)\). For each fixed \(x\in {\mathbb {R}}^{n}\), \(f(x,\cdot )\), \(\nabla _{\theta } f(x, \cdot )\) and \(\nabla ^{2}_{\theta } f(x, \cdot )\) are bounded. In other words we assume that there exists \(C<\infty \) such that
$$\begin{aligned} \sup _{\theta \in {\mathbb {R}}^{d}}\left( \left\Vert {f(\theta )}\right\Vert _{L^{2}(U)} +\left\Vert {\nabla _{\theta }f(\theta ) }\right\Vert _{L^{2}(U)}+\left\Vert {\nabla ^{2}_{\theta }f (\theta ) }\right\Vert _{L^{2}(U)}\right) \le C. \end{aligned}$$ -
For each \(x\in U\), \(f(x,\cdot )\) is globally Lipschitz \(L^{2}\) Lipschitz constant in x.
In terms of the associated cost function \(J(\cdot )\) we make the following Assumption 2.7.
Assumption 2.7
We assume that \(J(\cdot ) \in C^2\) and globally Lipschitz.
We emphasize that Assumption 2.7 does not impose any convexity type of assumptions on \(J(\theta )\). The online adjoint algorithm satisfies the time-dependent PDEs
Notice that (2.11) is a non-local coupled system of PDEs. Theorem 2.8 is about the well-posedness of system (2.11). Also, with slight abuse of notation, with \(\theta =\theta (t)\) from (2.11) we shall denote the solutions to (2.7) and (2.9), by \(u^{*}(t,x)\) and \(\hat{u}^{*}(t,x)\) respectively.
Theorem 2.8
Assume Assumptions 2.1, 2.5 and 2.6 and that \(u_{0}, \hat{u}_{0}, h\in L^{2}(U)\). There exists a unique mild solution \(u,\hat{u}\in {\mathcal {C}}((0,\infty ); W^{2,2}_{0}(U))\cap {\mathcal {C}}^{1}((0,\infty );L^{2}(U))\) and \(\theta \in {\mathcal {C}}^{1}((0,\infty ))\) to equation (2.11). If in addition \(h\in L^{\infty }\), then \(u,\hat{u}\in {\mathcal {C}}((0,\infty ); W^{2,p}_{0}(U))\cap {\mathcal {C}}^{1}((0,\infty );L^{p}(U))\) for any \(p\ge 2\) and if further we assume that \(u_{0},\hat{u}_{0}\in W^{2,p}(U)\), then \(u,\hat{u}\in {\mathcal {C}}([0,\infty ); W^{2,p}_{0}(U))\cap {\mathcal {C}}^{1}([0,\infty );L^{p}(U))\) for any \(p\ge 2\). In addition, we have that there exists some constant \(K<\infty \) such that
Remark 2.9
At this point we mention that even though in Assumption 2.6 we have assumed that \(\left\Vert {f(\theta )}\right\Vert _{L^{2}(U)}\) is uniformly bounded, an investigation of the proof of Theorem 2.8 shows that this assumption can be relaxed. In particular, at the expense of slightly more elaborate estimates, the results of this paper (which heavily rely on (2.12) being true) hold if we assume instead that \(\left\Vert {f(\theta )}\right\Vert _{L^{2}(U)}\) grows linearly in \(\left\Vert {\theta }\right\Vert _{\ell _{2}}\) with bounded derivatives, i.e., \(\left\Vert {f(\theta )}\right\Vert _{L^{2}}\le C(1+\left\Vert {\theta }\right\Vert _{\ell _{2}})\), still with \(\left( \left\Vert {\nabla _{\theta }f(\theta )}\right\Vert _{L^{2}} +\left\Vert {\nabla ^{2}_{\theta }f(\theta ) }\right\Vert _{L^{2}(U)}\right) <C\) and additionally that the regularization coefficient \(\gamma >0\) is large enough depending on the \(L^{2}\) norms of \(u_{0}(x)\) and \(\hat{u}_{0}(x)\). We have chosen to present the results for uniformly bounded \(\left\Vert {f(\theta )}\right\Vert _{L^{2}(U)}\) for presentation purposes and because in this case we do not need any additional restriction on the magnitude of \(\gamma \) other than being strictly positive.
Let us conclude this section with the proofs of Lemma 2.4 and Theorem 2.8.
Proof of Lemma 2.4
(2.10) can be derived using the definition of the adjoint PDE (2.9). Define \(\tilde{u} = \nabla _{\theta } u^{*}\). Differentiating (2.7) yields
Integration by parts yields
Due to (2.7), this yields the equation
Using the definition of the adjoint PDE (2.9),
Recalling the objective function (2.8), we then write
which yields (2.10). \(\square \)
Proof of Theorem 2.8
Let us define the index set \({\mathcal {G}}=\{1,2,3\}\) and the space \(\Theta =U\times {\mathcal {G}}\). Define the variable \(y=(x,\zeta )\in \Theta \) and the measure \(dn=dx\otimes d\iota \) on \(\Theta \) where \(d\iota \) denotes the counting measure on \({\mathcal {G}}\). Define now the Banach space \(X^{2}=L^{2}(\Theta ,dn)\). Similarly we denote by \(H^{2}(\Theta )=W^{2,2}(\Theta )\) the Banach space of functions f on \(\Theta \) such that for each \(\zeta \in {\mathcal {G}}\) we have \(f(\cdot ,\zeta )\in H^{2}(U)\) with norm \(\left\Vert {f}\right\Vert _{H^{2}(\Theta )} =\sum _{\zeta =1}^{3}\left\Vert {f(\cdot ,\zeta )}\right\Vert _{H^{2}(U)}\).
Setting \(v(t,x)=(u(t,x),\hat{u}(t,x),\theta (t))\) and \(\rho (t,y)=v_{\zeta }(t,x)\) (the \(\zeta '\)th component of the vector v) for \(y=(x,\zeta )\in U\times {\mathcal {G}}\) we consider the evolution equation on \(\Theta \) given by
where
and
We note that we have slightly abused notation here because \({\mathcal {L}}[\rho ](t,x,3)=0\). However, this notation is convenient because it allows us to describe the PDE in question in the form (2.13) as a single vector valued evolution equation.
Let us define the norm \(\left\Vert {w}\right\Vert _{2,T}=\sup _{t\in [0,T]} \left\Vert {w(t)}\right\Vert _{L^{2}(U)}\) if \(w=w(t,x)\) and \(\left\Vert {w}\right\Vert _{2,T}=\sup _{t\in [0,T]} \left\Vert {w(t)}\right\Vert _{\ell _{2}}\) if \(w=w(t)\). Here \(\ell _2\) denotes the standard Euclidean norm. In particular, for \(v(t,x)=(u(t,x),\hat{u}(t,x),\theta (t))\) we shall have
where, the first two components depend on (t, x) and the last component depends only on t.
We will be working with mild solutions. Due to Assumption 2.1, the operators A and \(A^{\dagger }\) are generators of analytic contraction semigroups \(\{S(t)\}_{t\ge 0}\) and \(\{S^{\dagger }(t)\}_{t\ge 0}\) respectively on \(L^{2}(U)\).
Then, we can write for the mild solution of (2.13) that
where
Now the properties of the analytic contraction semigroups \(\{S(t)\}_{t\ge 0}\) and \(\{S^{\dagger }(t)\}_{t\ge 0}\) guarantee that there exist an increasing continuous function D(t) with \(\lim _{t\rightarrow 0}D(t)=0\) (possible different from line to line below) such that
and for \(\rho =(u_{1},\hat{u}_{1},\theta _{1})\) and \(q=(u_{2},\hat{u}_{2},\theta _{2})\)
The linear growth bounds of the operator H as given by (2.15) together with the local Lipschitz continuity property demonstrated in (2.16) allow us to conclude via the classical Picard–Lindelöf theorem for Banach valued ODE’s that for every \(\rho _{0}\in X^{2}\) there exists a unique local mild solution \(\rho \in {\mathcal {C}}([0, T_0], X^{2})\) of (2.13) for some sufficiently small \(0<T_0<\infty \).
Next we want to show that this solution can be extended globally. To do so it is enough to establish a global bound for the \(\left\Vert {\cdot }\right\Vert _{X^{2}}\) of solutions. Indeed, the analytic contraction semigroups \(\{S(t)\}_{t\ge 0}\) and \(\{S^{\dagger }(t)\}_{t\ge 0}\) guarantee that there is a constant \(K<\infty \) (independent of t) such that
Analogously, for a potentially different constant \(K<\infty \) and using estimate (2.17)
Let us next show that \(\theta (t)\) is uniformly bounded in time. Define the quantity \(Q(t) = \left\Vert {(\nabla _{\theta } f (\theta (t)), \hat{u}(t) ) }\right\Vert ^{2}_2\), and notice that due to estimate (2.18) and the bound on \(\nabla _{\theta } f(\theta )\) by Assumption 2.6, we obtain that \(\sup _{t \ge 0} Q(t)<C\) for some appropriate constant \(C<\infty \).
By direct calculation, keeping in mind the equation that \(\theta (t)\) satisfies and Hölder inequality, we obtain
By comparison principle we then have that
and the result follows by requiring that for some \(C<\infty \) (see Assumption 2.5)
All in all we obtain the a-priori global estimate
for some finite constant \(K<\infty \). With this a-priori bound the solution can be extended indefinitely in time. Thus a unique global solution exists. This means that there exists a unique global mild solution \(\rho \in {\mathcal {C}}([0, \infty ), X^{2})\).
Essentially the same argument as above shows that if the initial data are in \(L^{q}\) and \(h\in L^{q}\), for \(q>2\), then we will have that there is a unique global mild solution \(\rho \in {\mathcal {C}}([0, \infty ), L^{q}(\Theta ,dn))\).
Let us now discuss regularity. We will prove that for initial data and h in \(L^{q}\), we actually have that \(u,\hat{u}\in {\mathcal {C}}([0, \infty ), L^{p})\) for any \(p\in [q,\infty )\). We will use a bootstrap argument. Due to Sobolev embedding theorem and Riesz-Thorin theorem the following \(L^{q}\rightarrow L^{p}\) estimate for the semigroup S(t) holds
where n is the spatial dimension, \(p\ge q\) and g a test function. Then, let us consider \(p>q\) such that \(\frac{1}{n}=\frac{1}{q}-\frac{1}{p}\) and assume that we know \(\left\Vert {u}\right\Vert _{L^{q}}<\infty \). Consider an initial time \(t=\epsilon \) for some fixed \(\epsilon >0\) and using (2.20) we have for u (we use the mild formulation of the solution )
Next consider the solution u starting at time \(t=2\epsilon \) with initial data \(u(2\epsilon )\in L^{p}\). Then, we will have that \(u\in {\mathcal {C}}([2\epsilon ,\infty ), L^{p})\). Notice now that \(\epsilon >0\) is arbitrary. Thus we obtain that \(u\in {\mathcal {C}}([0,\infty ),L^{p})\) for \(p\ge q\) such that \(\frac{1}{n}=\frac{1}{q}-\frac{1}{p}\). Using this argument inductively, first with \(q=2\) and \(p>2\) such that \(\frac{1}{n}=\frac{1}{2}-\frac{1}{p}\) we then get that \(u\in {\mathcal {C}}([0,\infty ), L^{p})\) for any \(p\in [2,\infty )\).
Following exactly the same process and using that \(u\in {\mathcal {C}}([0,\infty ), L^{p})\) for any \(p\in [2,\infty )\) we then obtain that if \(h\in L^{p}\) for all \(p\ge 2\), then \(\hat{u}\in {\mathcal {C}}([0,\infty ), L^{p})\) for any \(p\in [2,\infty )\) as well.
Next, we notice that the forcing term \({\mathcal {R}}\) in (2.13) is in \(L^{p}(U)\), so by the parabolic estimates in Sect. IV.3 of [22], we get that \(u,\hat{u}\in {\mathcal {C}}((0,\infty ); W^{2,p}_{0}(U))\cap {\mathcal {C}}^{1}((0,\infty );L^{p}(U))\) and if the initial data \(u_{0},\hat{u}_{0}\in W^{2,p}(U)\), then \(u,\hat{u}\in {\mathcal {C}}([0,\infty ); W^{2,p}_{0}(U))\cap {\mathcal {C}}^{1}([0,\infty );L^{p}(U))\) for any \(p\ge 2\). This concludes the proof of the theorem. \(\square \)
3 Convergence to a Stationary Point
The main result of this section is the convergence result for \(\theta (t)\). It says that as \(t\rightarrow \infty \), \(\theta (t)\) converges to a stationary point of the cost function \(J(\theta )\).
Theorem 3.1
Assume Assumptions 2.1, 2.5, 2.6 and 2.7. Then, we have that
The proof of Theorem 3.1 is be a consequence of series of lemmas. In Sect. 3.1 we establish necessary decay rates for the solution to (1.5). These results are then used in Sect. 3.2 to characterize the behavior of \(\theta (t)\) for large times and eventually prove Theorem 3.1.
3.1 Decay Rates for the Online Adjoint Algorithm (1.5)
In this subsection, we establish some necessary decay rates for the online adjoint algorithm (1.5).
First, Lemma 3.2 has a critical bound on \(\left\Vert {\frac{\partial u^{*}}{\partial t } }\right\Vert _{H}\) and on \(\left\Vert {\frac{\partial \hat{u}^{*}}{\partial t } }\right\Vert _{H}\).
Lemma 3.2
Under Assumptions 2.1 and 2.6, there exists a constant \(C<\infty \) such that
and consequently \(\displaystyle \lim _{t \rightarrow \infty } \left\Vert { \frac{\partial u^{*}}{\partial t } }\right\Vert _{H}= \displaystyle \lim _{t \rightarrow \infty } \left\Vert { \frac{\partial \hat{u}^{*}}{\partial t }}\right\Vert _{H}=0\).
Proof
First, we study \(\left\Vert { \frac{\partial u^{*}}{\partial t } }\right\Vert _{H}\). We see that \(\frac{\partial u^{*}}{\partial t}(t,x)\) satisfies the PDE
By the coercivity assumption on A by Assumption 2.1 we subsequently obtain
and the result follows directly by Assumption 2.6 on \(\nabla _{\theta }f\) and estimate (2.12).
Let us now turn our attention to \(\left\Vert { \frac{\partial \hat{u}^{*}}{\partial t } }\right\Vert _{H}\). By differentiation, we obtain that \(\frac{\partial u^{*}}{\partial t }(t,x) \) satisfies the PDE
The result then follows by the coercivity condition on \(A^{\dagger }\) by Assumption 2.1 and due to the fact that \(\left\Vert {\frac{\partial u^{*}}{\partial t } }\right\Vert _{H}\le C \alpha (t)\). This completes the proof of the lemma. \(\square \)
Let us consider the difference
\(\phi (t)\) satisfies the PDE
Lemma 3.3
Under Assumptions 2.1, 2.5 and 2.6 we have that
and there is some finite \(T^{*}<\infty \) such that for all \(t\ge T^{*}\)
where \(C<\infty \) is an unimportant constant.
Proof
We begin by proving that \(\frac{\partial u^{*}}{\partial t}(t)\) is globally Lipschitz in time. Differentiating the elliptic PDE that \(u^{*}\) satisfies twice with respect to t yields for \(x\in U\) and \(t\ge 0\)
Therefore, using the bounds on \(f(\theta ), \nabla f(\theta ), \theta (t), u(t),\) and \(\hat{u}(t)\), we can show, using the coercivity Assumption 2.1 and the Cauchy-Schwarz inequality as in Lemma 3.2, that \(\displaystyle \sup _{t<\infty } \left\Vert {\frac{\partial ^2 u^{*}}{\partial t^2 }}\right\Vert _{H}\le C\). We provide the necessary calculations below for completeness.
where K is a constant. Re-arranging yields, for \(t \ge 0\),
where C is a constant. This, then gives
Therefore, we can write
where \(\left\Vert {S(t)}\right\Vert _{H} \le e^{- \lambda t }\) with \(\lambda > 0\) by Assumptions 2.1.
Due to Lemma 3.2, for any \(\epsilon > 0\), there exists a s such that \(\left\Vert { \frac{\partial u^{*}}{\partial \tau } }\right\Vert _H < \epsilon \) for \(t > s\). By the triangle inequality,
Let us now define \(I(t)=\int _0^t e^{- \lambda (t - \tau ) } \alpha (\tau ) d \tau \). Next, it is easy to show that \(\lim _{t\rightarrow \infty }I(t)=0\) and in particular that the integral term goes to zero at the rate of \(\alpha (t)\) in the sense that \(\lim _{t\rightarrow \infty }\frac{I(t)}{\alpha (t)} =\frac{1}{\lambda }\). These observations imply that there is a finite \(T^{*}<\infty \) such that for all \(t\ge T^{*}\), we have that \(I(t)\le C \alpha (t)\) for some constant \(C<\infty \).
Hence we indeed get that both (3.5) and (3.6) hold, concluding the proof of the lemma. \(\square \)
Define \(\Psi (t,x) = \hat{u} (t,x) - \hat{u}^{*}(t,x)\). A similar lemma can also be proven for the limit of \(\Psi (t)\).
Lemma 3.4
Under Assumptions 2.1, 2.5 and 2.6 we have that
and there is some finite \(T^{*}<\infty \) such that for all \(t\ge T^{*}\)
where \(C<\infty \) is an unimportant constant.
Proof
\(\Psi (t)\) satisfies the PDE
Exactly, as it was done in Lemma 3.3 we can show that \(t\mapsto \frac{\partial \hat{u}^{*}}{\partial t}(t)\) is globally Lipschitz. Together with Assumption 2.1 we write
where \(S^{\dagger }(t)\) is the analytic contraction semigroup generated by \(A^{\dagger }\) satisfying \(\left\Vert {S^{\dagger }(t)}\right\Vert _{H} \le e^{- \lambda t }\) with \(\lambda > 0\). Using the same reasoning as in Lemma 3.3, and the fact that \(\lim _{t\rightarrow \infty }\left( \left\Vert {\phi (t)}\right\Vert _{H}+\left\Vert {\frac{\partial \hat{u}^{*}}{\partial t}(t)}\right\Vert _{H}\right) =0\), we can prove (3.8). The decay rates for \(\left\Vert {\phi (t)}\right\Vert _{H}\) and \(\left\Vert {\frac{\partial \hat{u}^{*}}{\partial t}(t)}\right\Vert _{H}\) from Lemma 3.3 and 3.2 respectively prove (3.9), the same way (3.6) was proven. This concludes the proof of the lemma. \(\square \)
3.2 Proof of Theorem 3.1
Let us now return to the equation for \(\theta (t)\).
The second term on the RHS will converge to zero as \(t \rightarrow \infty \) due to (3.8). This means that asymptotically \(\theta \) will be updated in the direction of steepest descent. Theorem 3.1 rigorously proves this along with proving convergence of \(\theta (t)\) to a critical point of \(J(\theta )\).
The structure of the proof proceeds in a spirit similar to [29] with certain differences that will be highlighted below as needed. For completeness, we present the whole argument with the proper adjustments.
Let \(\epsilon >0\) be given and let \(\mu =\mu (\epsilon )>0\) to be chosen later on. We define the following cycle of times
where for \(k=1,2,\ldots \)
Essentially, the sequence of times \(\{\sigma _k\}_{k\in {\mathbb {N}}}\) and \(\{\tau _k\}_{k\in {\mathbb {N}}}\) keep track of the times in which \(\left\Vert {\nabla J(\theta (t))}\right\Vert \) is within a ball of radius \(\epsilon \) and away from it.
Next, let us define the corresponding intervals of time \(J_{k}=[\sigma _{k-1},\tau _{k})\) and \(I_{k}=[\tau _{k},\sigma _{k})\). Clearly, when \(t\in J_{k}\), then we have that \( \Vert \nabla J(\theta (t))\Vert <\epsilon \).
Let us now go back to (3.11) and define the integral term
We first show that \(\Delta _{\tau _{k},\sigma _{k}}\) decays to zero. In particular, we have the following lemma.
Lemma 3.5
Assume that Assumptions 2.1, 2.5 and 2.6 hold. Let us fix some \(\eta >0\). Then, we have that \(\lim _{k\rightarrow \infty } \left\Vert {\Delta _{\tau _k,\sigma _{k}+\eta }}\right\Vert _{2}=0\).
Proof of Lemma 3.5
We notice that for some constant \(C<\infty \) that may change from line to line
where the boundedness of \(\left\Vert { \nabla _{\theta } f ( \theta )}\right\Vert _{H}\) together with the decay rates from Lemma 3.4 were used. This immediately proves that \(\lim _{k\rightarrow \infty } \left\Vert {\Delta _{\tau _k,\sigma _{k}+\eta }}\right\Vert _{2}=0\) concluding the proof of the lemma. \(\square \)
Lemma 3.6
Assume that Assumptions 2.1, 2.5 and 2.6 hold. Denote by \(L_{\nabla J}\) to be the Lipschitz constant of \(\nabla J\). For given \(\epsilon >0\), let \(\mu \) be such that \(3\mu +\frac{\mu }{8\epsilon }=\frac{1}{2 L_{\nabla J}}\). Then, for k large enough and for \(\eta >0\) small enough (potentially depending on k), one has \(\int _{\tau _{k}}^{\sigma _{k}+\eta }\alpha (s)ds>\mu \). In addition, we also have \(\frac{\mu }{2}\le \int _{\tau _{k}}^{\sigma _{k}} \alpha (s)ds\le \mu \).
Proof of Lemma 3.6
The proof proceeds by contradiction. Let us assume that \(\int _{\tau _{k}}^{\sigma _{k}+\eta }\alpha (s)ds\le \mu \) and let \(\delta >0\) be such that \(\delta <\mu /8\). In addition, without loss of generality, we can assume that for the given k, \(\eta \) is so small such that for any \(s\in [\tau _{k},\sigma _{k}+\eta ]\) one has \(\left\Vert {\nabla J(\theta (s))}\right\Vert _{2}\le 3\left\Vert {\nabla J(\theta (\tau _{k}))}\right\Vert _{2}\).
Then, invoking (3.11) we have
By Lemma 3.5 we have that for k large enough, \(\left\Vert {\Delta _{\tau _k,\sigma _{k}+\eta }}\right\Vert _{2}\le \delta <\mu /8\). In addition, we also have by definition that \(\frac{\epsilon }{\left\Vert {\nabla J(\theta (\tau _{k}))}\right\Vert _{2}}\le 1\). The combination of these two results gives
This means that
Then, this would yield
However, this is a contradiction, because that would mean that \(\int _{\tau _{k}}^{\sigma _{k}+\eta }\alpha (s)ds>\mu \), since otherwise \(\sigma _{k}+\eta \in [\tau _{k},\sigma _{k}]\) which cannot happen because \(\eta >0\). This concludes the proof of the first part of the lemma. The proof of the second part of the lemma goes as follows. By its own definition, we have that \(\int _{\tau _{k}}^{\sigma _{k}} \alpha (s)ds\le \mu \). Next, we show that \(\int _{\tau _{k}}^{\sigma _{k}} \alpha (s)ds\ge \mu /2\). We have shown that \(\int _{\tau _{k}}^{\sigma _{k}+\eta }\alpha (s)ds> \mu \). For k large enough and \(\eta \) small enough we can choose that \(\int _{\sigma _{k}}^{\sigma _{k}+\eta }\alpha (s)ds\le \mu /2\). The conclusion then follows. This concludes the proof of the lemma. \(\square \)
Lemma 3.7
Assume that Assumptions 2.1, 2.5 and 2.6 hold. Assume that there exists an infinite number of intervals \(I_{k}=[\tau _{k},\sigma _{k})\). Then, there is a fixed \(\zeta _{1}>0\) that depends on \(\epsilon \) such that for k large enough
Proof of Lemma 3.7
By chain rule we have that
Let us first consider \(M_{1,k}=- \int _{\tau _{k}}^{\sigma _{k}} \alpha (\rho ) \left\Vert { \nabla J (\theta (\rho )) }\right\Vert _2 d\rho \). For \(\rho \in [\tau _k,\sigma _k]\) we have that \( \frac{\left\Vert {\nabla J(\theta (\tau _{k}))}\right\Vert _{2}}{2}\le \left\Vert {\nabla J(\theta (\rho ))}\right\Vert _{2}\le 2\left\Vert {\nabla J(\theta (\tau _{k}))}\right\Vert _{2} \). Thus, for sufficiently large k, we have by Lemma 3.6
Next, we address \(M_{2,k}=\int _{\tau _{k}}^{\sigma _{k}} \alpha (\rho ) \nabla J (\theta (\rho ))\cdot \left( \nabla _{\theta } f (\theta (\rho )), \Psi (\rho ) \right) d\rho \). Let us define
Clearly, we have that \(M_{2,k}=\hat{M}_{\tau _{k},\sigma _{k}}\).
We claim that \(\sup _{\rho \ge 0} \left\Vert {\nabla J (\theta (\rho ))}\right\Vert _{2}<\infty \). For this purpose, we shall use the representation of \(\nabla J(\theta )\) by (2.10) together with the a-priori H norm bounds for \(\hat{u}^{*}\), \(\nabla _{\theta } f(x,\theta )\) as well as the uniform bound on \(\sup _{\rho \ge 0}\left\Vert {\theta (\rho )}\right\Vert \) by (2.12). These imply that indeed \(\sup _{\rho \ge 0} \left\Vert {\nabla J (\theta (\rho ))}\right\Vert _{2}<\infty \).
Then, we have for some constant \(C<\infty \) that may change from line to line
where we used the decay rate bound from Lemma 3.8. The latter, then means that \(M_{2,k}\rightarrow 0\) as \(k\rightarrow \infty \).
Putting the above together, we get for k large enough such that \(|M_{2,k}|\le \delta <\frac{\mu }{16}\epsilon ^{2}\)
Setting \(\zeta _1=\frac{\mu }{8}\epsilon ^{2}\) we conclude the proof of the lemma. \(\square \)
Lemma 3.8
Assume that Assumptions 2.1, 2.5 and 2.6 hold. Assume that there exists an infinite number of intervals \(I_{k}=[\tau _{k},\sigma _{k})\). Then, there is a fixed \(0<\zeta _{2}<\zeta _{1}\) such that for k large enough
Proof of Lemma 3.8
Recall that \(\left\Vert {\nabla J(\theta (t))}\right\Vert _{2}\le \epsilon \) for \(t\in J_{k}=[\sigma _{k-1},\tau _{k}]\). By chain rule we have
As in the proof of Lemma 3.7 we get that for k large enough, the right hand side of the last display can be arbitrarily small. This concludes the proof of the lemma. \(\square \)
Now we can conclude the proof of Theorem 3.1.
Proof of Theorem 3.1
Fix an \(\epsilon >0\). If there are finitely number of \(\tau _{k}\), then there is a finite \(T^{*}\) such that \(\left\Vert {\nabla J(\theta (t))}\right\Vert _{2}<\epsilon \) for \(t\ge T^{*}\), which proves the theorem. So, we basically need to prove that there can only be finitely many \(\tau _{k}\). So, let us assume that there are infinitely many instances of \(\tau _{k}\). By Lemmas 3.7 and 3.8 we have for sufficiently large k that
with \(0<\zeta _2<\zeta _1\). Let N large enough so that the above relations hold simultaneously. Then we have
Letting \(n \rightarrow \infty \), we get that \(J(\theta (\tau _{n+1})) \rightarrow - \infty \), which is a contradiction, since by definition \(J(\theta ) \ge 0\). Thus, there must be at most finitely many \(\tau _k\). This concludes the proof of the theorem. \(\square \)
4 Convergence Rate in the Strongly Convex Case
We will now prove a convergence rate for \(\theta (t)\). First, we need to strengthen the assumptions on the cost function \(J(\cdot )\). Namely, we now assume that \(J(\theta )\) is strongly convex.
Assumption 4.1
We assume the following conditions:
-
\(J(\cdot ) \in C^2\).
-
There exists a unique global minimum \(\theta ^{*}\) where \(\nabla _{\theta } J(\theta ^{*}) = 0\).
-
\(H(\theta ) =\nabla _{\theta \theta } J(\theta )\) is globally Lipschitz and \(H(\theta ^{*})\) is positive definite. That is, there exists a constant \(q > 0\) such that
$$\begin{aligned} \xi ^{\top } H (\theta ^{*})\xi \ge q \left\Vert {\xi }\right\Vert _{2}^2. \end{aligned}$$(4.1) -
The learning rate satisfies \(\alpha (t) =C_{\alpha } a(t)\) where the learning rate magnitude \(C_{\alpha }\) is selected such that \(C_{\alpha } q > 1\). The learning rate function a(t) satisfies \(\displaystyle \lim _{t\rightarrow \infty } a(t)=0\) and
$$\begin{aligned} \int _{0}^{\infty } a(s)ds&=\infty \text { and } \int _{0}^{\infty } a^{2}(s)ds<\infty .\nonumber \\ \sup _{t\ge 0} \int _{0}^{t} a(s)e^{-\gamma \int _{s}^{t}a(r)dr}ds&<\infty \text { and } \lim _{t\rightarrow \infty }\frac{a'(t)}{a(t)}=0. \end{aligned}$$An example is \(a(t) = \frac{1}{1 + t}\).
In this section we will assume without loss of generality that the learning rate is \(\alpha (t) = \frac{1}{1+t}\). Even though this specific choice of the learning rate is not necessary for the results to hold, it will simplify the derivation of the convergence rate. The main result of this section is Theorem 4.2.
Theorem 4.2
Let us assume that \(\alpha (t)=1/(1+t)\). In addition, assume that Assumptions 2.1, 2.6 and 4.1 hold. Then we have that there exists a time \(0<t_{0}<\infty \), such that for all \(t\ge t_{0}\)
The proof of this theorem will be a consequence of a series of lemmas. Let us recall the function \(\phi (t,x) = u(t,x) -u^{*}(t,x)\) satisfying (3.4)
such that by Lemma 3.2, \(t\mapsto \frac{\partial u^{*}}{\partial t}\) is globally Lipschitz and \(\left\Vert {\frac{\partial u^{*}}{\partial t}}\right\Vert _{H}\le C\frac{1}{1+t}<C\frac{1}{t}\).
Then, we have the following lemma.
Lemma 4.3
Consider the setting of Theorem 4.2. We have that there is a finite constant \(C<\infty \) such that
In addition, there exists a \(t_0 >0\) such that for all \(t \ge t_0\) and any \(0< p < 1\),
Proof of Lemma 4.3
For notational convenience we set below \(Y(t) = \left( \phi (t), \phi (t) \right) \). First we calculate
For \(\epsilon >0\) to be chosen later on and using the coercivity Assumption 2.1, we then have the inequality (omitting the argument t for notational convenience)
where we have used Young’s inequality. The constant \(b_{\epsilon } =\lambda - \frac{\epsilon ^2}{2} \) and \(C_{\epsilon } =\frac{C}{2\epsilon ^{2}}\). We can select \(\epsilon \) such that \(b_{\epsilon } > 0\).
Denoting now for notational convenience \(b=b^{\epsilon }\) and with some abuse of notation setting \(C=C_{\epsilon }\), let’s construct the ODE
Define \(\xi = Y - v\). Then, we have that \(\xi (1) = 0\) and for \(t \ge 1\),
By Gronwall’s inequality \(\xi \le 0\) and therefore \(Y \le v\). If we can establish a convergence rate for v, we then have a convergence rate for Y.
The solution v is
We know that
Therefore, for a finite constant \(C<\infty \) we have that
We also consequently know that there exists a \(t_0 >0\) such that for all \(t \ge t_0\) and any \(0< p < 1\),
Therefore, for all \(t \ge t_0\),
concluding the proof of the lemma. \(\square \)
Let us now recall that \(\Psi (t,x) = \hat{u} (t,x) - \hat{u}^{*}(t,x)\). Next, let’s prove a convergence rate for \(\Psi \).
Lemma 4.4
Consider the setting of Theorem 4.2. Then, we have that
for some finite constant \(C<\infty \).
Proof of Lemma 4.4
We first calculate that
Define \(W(t) = t^2 \left( \Psi (t), \Psi (t) \right) \). Then, omitting for notational convenience the time argument, we have
where we used the assumed coercivity of \(A^{\dagger }\) (consequence of Assumption 2.1). Let’s select a \(t_0 > 1\) such that \(t_0^{-1} < \delta \ll c\). Then, for \(t \ge t_0\) and for a constant \(C<\infty \) that may change from line to line,
where we have chosen an \(\epsilon > 0\) such that \(b = \lambda -\delta - C \epsilon ^2 > 0\).
Let’s construct the ODE
which then, by (4.3) and Lemma 3.2, satisfies
Therefore, using the same ODE comparison principle as before, we have the bound
which concludes the Proof of the Lemma. \(\square \)
We now present the Proof of Theorem 4.2 on the convergence rate for \(\theta \).
Proof of Theorem 4.2
Recall that \(H(\theta ) = \nabla _{\theta \theta } J(\theta )\) is the Hessian matrix. At the stationary point \(\theta ^{*}\), \(H(\theta ^{*})\) is positive definite, i.e. there exists some constant \(q > 0\) such that \(\xi ^{\top } H (\theta ^{*})\xi \ge q \left\Vert {\xi }\right\Vert _{2}^2\).
Since, according to Theorem 3.1, we have already proven convergence, we know that for \(\theta ^{*}\) such that \(\nabla J(\theta ^{*})=0\), we have that \(\displaystyle \lim _{t \rightarrow \infty } \theta (t) = \theta ^{*}\). The parameter updates satisfy
where \(\bar{\theta }(t) \in [ \theta (t), \theta ^{*} ]\), \(\nabla _{\theta } H(\bar{\theta }(t))_{j,k} \in {\mathbb {R}}^d\), and \(\nabla _{\theta } H(\bar{\theta }(t))_{j,k} ( \theta (t) -\theta ^{*} )_k ( \theta (t) - \theta ^{*} )_j = \displaystyle \sum _{k,j = 1}^d \nabla _{\theta } H(\bar{\theta }(t))_{j,k} (\theta (t) - \theta ^{*} )_k ( \theta (t) - \theta ^{*} )_j\). \(H(\theta )\) is the Hessian matrix and \(H(\theta )_{j,k}\) is the (j, k)-th element of the matrix.
Define
V satisfies the ODE
Since \(\displaystyle \lim _{t \rightarrow \infty } \theta (t) =\theta ^{*}\), there exists a \(t_0\) such that for all \(t \ge t_0\)
Therefore, for \(t\ge t_{0}\) large enough, we have
where we have used Young’s inequality and the Cauchy–Schwartz inequality. Here we have set \(C=\sup _{\theta \in {\mathbb {R}}^{d}}\left\Vert {\nabla _{\theta } f ( \theta ) }\right\Vert ^{2}_{L^{2}(U)}<\infty \) by Assumption 2.6.
Let’s select an \(\epsilon \) small enough such that \(b = b_0 -\epsilon ^2 > 0\). Then,
Let \(\hat{q}\) satisfy the ODE
The following comparison principle holds
Using an integrating factor and \(\displaystyle \limsup _{t \rightarrow \infty } t^2 \left( \Psi , \Psi \right) \le C\) by Lemma 4.4, we have that
where the constant \(C_2\) may change from line to line and we have used the assumption \(C_{\alpha } b > 1\). This concludes the convergence rate proof of \(\theta (t)\). \(\square \)
References
Benveniste, A., Metivier, M., Priouret, P.: Adaptive Algorithms and Stochastic Approximations. Springer, New York (2012)
Brandenburg, C., Lindemann, F., Ulbrich, M., Ulbrich, S.: A continuous adjoint approach to shape optimization for Navier Stokes flow. In: Optimal Control of Coupled Systems of Partial Differential Equations. In: Internat. Ser. Numer. Math., vol. 158. Birkhäuser Verlag, Basel, pp. 35–56 (2009)
Bosse, T., Gauger, N.R., Griewank, A., Günther, S., Schulz, V.: One-shot approaches to design optimization. Int. Ser. Numer. Math. 165, 43–66 (2014)
Bueno-Orovio, A., Castro, C.C., Palacios, F., Zuazua, E.: Continuous adjoint approach for the Spalart-Allmaras model in aerodynamic optimization. AIAA J. 50(3), 631–646 (2012)
Cagnetti, F., Gomes, D., Tran, H.: Adjoint methods for obstacle problems and weakly coupled systems of PDE. ESAIM 19(3), 754–779 (2013)
Duta, M., Giles, M., Campobasso, M.: The harmonic adjoint approach to unsteady turbomachinery design. Int. J. Numer. Methods Fluids 40(3–4), 323–332 (2002)
Gauger, N., Griewank, A., Hamdi, A., Kratzenstein, C., Özkaya, E., Slawig, T.: Automated extension of fixed point PDE solvers for optimal design with bounded retardation. In: Leugering, G., Engell, S., Griewank, A., Hinze, M., Rannacher, R., Schulz, V., Ulbrich, M., Ulbrich, S. (eds.) Constrained Optimization and Optimal Control for Partial Differential Equations, pp. 99–122. Springer, Basel (2012)
Gilbard, D., Trudinger, N.S.: Elliptic Partial Differential Equations of Second Order, 2nd edn. Springer, Berlin (2001)
Giles, M., Pierce, N.: An introduction to the adjoint approach to design. Flow Turbul. Combust. 65, 393–415 (2000)
Giles, M., Ulbrich, S.: Convergence of linearized and adjoint approximations for discontinuous solutions of conservation laws. Part 1: Linearized approximations and linearized output functionals. SIAM J. Numer. Anal. 48(3), 882–904 (2010)
Giles, M., Ulbrich, S.: Convergence of linearized and adjoint approximations for discontinuous solutions of conservation laws. Part 2: Adjoint approximations and extensions. SIAM J. Numer. Anal. 48(3), 905–921 (2010)
Günther, S., Gauger, N.R., Wang, Q.: Simultaneous single-step one-shot optimization with unsteady PDEs. J. Comput. Appl. Math. 294, 12–22 (2016)
Hazra, S.B.: Direct treatment of state constraints in aerodynamic shape optimization using simultaneous pseudo-time-stepping. AIAA J. 45(8), 1988–1997 (2007)
Hazra, S.B., Schulz, V.: Simultaneous pseudo-timestepping for PDE-model based optimization problems. Bit Numer. Math. 44(3), 457–472 (2004)
Hinze, M., Pinnau, R., Ulbrich, M., Ulbrich, S.: Optimization with PDE constraints. In: Mathematical Modelling: Theory and Applications, vol. 23. Springer, New York (2009)
Jameson, A.: Aerodynamic Shape Optimization Using the Adjoint Method. Lectures at the Von Karman Institute, Brussels (2003)
Jameson, A., Martinelli, L., Pierce, N.: Optimum aerodynamic design using the Navier-Stokes equations. Theoret. Comput. Fluid Dyn. 10(1), 213–237 (1998)
Jameson, A., Kim, S.: Reduction of the adjoint gradient formula in the continuous limit. In: 41st Aerospace Sciences Meeting and Exhibit, p. 40 (2003)
Kaland, L., De Los Reyes, J.C., Gauger, N.R.: One-shot methods in function space for PDE-constrained optimal control problems. Optim. Methods Softw. 29(2), 376–405 (2014)
Knopoff, D.A., Fernández, D.R., Torres, G.A., Turner, C.V.: Adjoint method for a tumor growth PDE-contrained optimization problem. Comput. Math. Appl. 66, 1104–1119 (2013)
Kushner, H., Yin, G.: Stochastic Approximation and Recursive Algorithms and Applications, 2nd edn. Springer, New York (2003)
Ladyženskaja, O.A., Solonnikov, V.A., Ural’ceva, N.N.: Linear and quasilinear equations of parabolic type. American Mathematical Society, Providence (1968). (Translations of Mathematical Monographs, vol. 23, by S. Smith)
MacArt, J.F., Sirignano, J., Panesi, M.: Deep learning closure of the Navier-Stokes equations for transitional flows. In: AIAA SciTech (2021)
Nadarajah, S., Jameson, A.: A comparison of the continuous and discrete adjoint approach to automatic aerodynamic optimization. In: 38th Aerospace Sciences Meeting and Exhibit, p. 667 (2000)
Pazy, A.: Semigroups of Linear Operators and Applications to Partial Differential Equations. Springer, New York (1983)
Pierce, N., Giles, M.: Adjoint recovery of superconvergent functionals from PDE approximations. SIAM Rev. 42(2), 247–264 (2000)
Protas, B.: Adjoint-based optimization of PDE systems with alternative gradients. J. Comput. Phys. 227(13), 6490–6510 (2008)
Reuther, J., Jameson, A., Farmer, J., Martinelli, L., Saunders, D.: Aerodynamic shape optimization of complex aircraft configurations via an adjoint formulation. In: 34th Aerospace Sciences Meeting and Exhibit, p. 94 (1996)
Sirignano, J., Spiliopoulos, K.: Stochastic gradient descent in continuous time. SIAM J. Financ. Math. 8(1), 933–961 (2017)
Sirignano, J., MacArt, J.F., Spiliopoulos, K.: PDE-constrained models with neural network terms: optimization and global convergence, arXiv:2105.08633 (2021)
Ta’asan, S.: One shot methods for optimal control of distributed parameter systems 1: finite dimensional control. Technical report, DTIC Document (1991)
Ta’asan, S.: Pseudo-time methods for constrained optimization problems governed by PDE, ICASE Report No. 95-32 (1995)
Funding
The authors have not disclosed any funding.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have not disclosed any competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
K. Spiliopoulos was partially supported by the National Science Foundation (DMS 1550918, DMS 2107856) and Simons Foundation Award 672441.
Rights and permissions
About this article
Cite this article
Sirignano, J., Spiliopoulos, K. Online Adjoint Methods for Optimization of PDEs. Appl Math Optim 85, 18 (2022). https://doi.org/10.1007/s00245-022-09852-5
Accepted:
Published:
DOI: https://doi.org/10.1007/s00245-022-09852-5