Keywords

Introduction

In this section, we study two methods for obtaining second derivatives of a functional subject to a state equation. We introduce analytic formulas which can then be used in combination with a non-global application of an Automatic Differentiation (AD) tool like TAPENADE [7]. This second step is not addressed here but will be presented in details in chapter “Use of Automatic Differentiation Tools at the Example of TAPENADE”.

More precisely, we recall that we are interested by obtaining the first and second derivatives of a functional j depending of \(\gamma \in \mathbb {R}^n\) and expressed in terms of a state \(W \in \mathbb {R}^N\) as follows:

$$\begin{aligned} \left\{ \begin{aligned} \psi (\gamma )&= \varPsi (\gamma ,W(\gamma )) = 0 \\ j(\gamma )&= J(\gamma ,W(\gamma )). \end{aligned} \right. \end{aligned}$$
(1)

We observe that \(\gamma \rightarrow W(\gamma )\) is a function implicitly defined through the state equation \(\varPsi (\gamma ,W(\gamma ))=0\) and the functional \(j(\gamma ) = J(\gamma ,W(\gamma ))\) is evaluated from the solution of the state equation corresponding to \(\gamma \). Then, we distinguish two different points of view:

  • Implicit differentiation: it consists to differentiate directly the implicit function j as a function of the control variable \(\gamma \). This means that the entire program involving the solution algorithm for state equation is differentiated in one way.

  • Differentiation of explicit parts: the second point of view is to apply differentiation only to routines which compute explicit functions, that is functions \(\varPsi \) and J and then use one of the available composite derivative formula in order to get \(d j / d \gamma \).

Also, two different formulas of composite derivatives can be applied:

  • The Tangent mode applied to a program computing \(\varPhi \) produces another routine computing, from u and from an arbitrary direction \(\dot{u}\) (of same dimension as u), the derivative in direction \(\dot{u}\):

    $$\begin{aligned} u , \dot{u} \mapsto \frac{\partial {\varPhi }}{\partial {u}}(u) \dot{u} =\phi '_p(w_{p-1}) \phi '_{p-1}(w_{p-2}) \cdots \phi '_1(w_0) \dot{u}. \end{aligned}$$
    (2)
  • The Reverse mode when applied to the previous program computing \(\varPhi \) produces a routine which computes, from u and from an arbitrary direction \(\bar{v}\) (of same dimension as v), the following product of same dimension as u:

    $$\begin{aligned} u , \bar{v} \mapsto \biggl ( \frac{\partial {\varPhi }}{\partial {u}}(u) \biggr )^{*} \bar{v} =\phi '^{*}_1(w_0) \phi '^{*}_{p-1}(w_{p-2}) \cdots \phi '^{*}_p(w_{p-1}) \bar{v}. \end{aligned}$$
    (3)

If we choose an implicit differentiation, differentiating the entire routine j can be performed with either Tangent or Reverse mode of the AD tool; see chapter “Surrogate Model-Based Approaches to UQ and Their Range of Applicability”. It permits to obtain directly a differentiated program, in a black box manner, with the risk that this program has a rather good reliability but not sufficiently good performances.

To analyze this issue, we observe that since the routine computing the functional j contains the iterative solver method for the state equation, the differentiated routines will contain this solver in differentiated form. We assume that we need \(n_\text {iter}\) iterations in order to obtain the nonlinear solution, and that for each iteration, we have an unitary cost.

Tangent mode produces a program that we need to apply n time for computing the entire gradient. The cost is \(n (n_\text {iter} \alpha _T )\) where \(\alpha _T\) is the overhead associated with the differentiated code with respect to the original one. One has usually \(1< \alpha _T < 4\); see for example [5]. Further, the memory requirements will be about the same as for the undifferentiated code.

With Reverse mode, we are able to obtain the entire gradient with a single evaluation of the differentiated routine. But the usual Reverse mode produces a code which involves two successive parts:

  • a forward sweep close to original code, and

  • a (backward sweep) performed in the opposite order of the original code.

The problem is that the backward sweep needs data computed in the forward sweep, but in the opposite order, that is from last data computed in the forward sweep to first ones. In one way to solve this, the Store-All (SA) strategy, these data are stored during the forward sweep.

The total cost (in terms of CPU time and memory) strongly depends on the strategy applied by the AD tool to solve the problem of inverse order differentiation for the original routine. For the case of the SA strategy, the CPU cost will be \((n_\text {iter} \alpha _R )\) with \(1< \alpha _R < 5\), i.e., \(\alpha _R\) times the undifferentiated code, but the required memory will be n times greater. For a Recompute-All (RA) strategy, the CPU cost will be \((n_\text {iter}^2 \alpha _R )\), i.e., \((n_\text {iter} \alpha _R )\) the nonlinear solution, but the memory will be the same of the undifferentiated routine. For real large programs, neither SA nor RA strategy can work, we need a special storage/recomputation trade-off in order to be efficient using checkpoints (see [7]). Obviously, with checkpointing, the CPU cost will be greater than the cost of SA strategy and it can be shown [6] that the cost for the differentiated code can be of the order of \(\root s \of {n_\text {iter}}\) (where s is the number of snapshots available).

In many cases, these strategies (SA, RA, or SA/RA) are compulsory. This is the case for unsteady nonlinear systems, where main iteration is time advancing, and for which we do not know a strategy working without the intermediate-time state variable values. Checkpointing can be applied quasi-automatically by the Automatic Differentiator or applied by hand-coding.

When the iterative algorithm is a fixed point one, with enough convergence to the fixed point, only the final state variable is necessary for backward sweep. Some intervention of the programmer is useful for avoiding the storage of unnecessary intermediate values of state variable.

Lastly, the behavior (efficiency, stability) of differentiated solution algorithms is questionable, while a natural and conservative tendancy is to re-use the algorithm used for state—preconditioner, iterator—for solving adjoint systems.

From the previous arguments, in case of simple explicit solvers, the implicit differentiation is a good option, but for more sophisticated solvers (GMRES, ...) we recommand the differentiation of explicit parts/routines, followed by an assembly of these by the programmer.

We present now in more details this latter strategy.

Computation of First Derivative

Using the chain rule, the gradient of the functional \(j(\gamma ) = J(\gamma ,W(\gamma ))\) is given by

$$\begin{aligned} \frac{d {j}}{d {\gamma }} = \frac{\partial {J}}{\partial {\gamma }} + \frac{\partial {J}}{\partial {W}} \frac{d {W}}{d {\gamma }} \end{aligned}$$

where the derivatives of the state variables \(W(\gamma )\) are obtained solving the linear system

$$\begin{aligned} \frac{d {\psi }}{d {\gamma }} = \frac{\partial {\varPsi }}{\partial {\gamma }} + \frac{\partial {\varPsi }}{\partial {W}} \frac{d {W}}{d {\gamma }} = 0. \end{aligned}$$

Two strategies can be applied.

Direct Differentiation (Tangent Mode)

It consists in computing the Gâteaux or directional derivatives with respect to each component direction (\(e_i=(0,\dots 0,1,0,\dots ,0)^T\), where 1 is at the i-th component):

$$\begin{aligned} \frac{d {j}}{d {\gamma _i}} = \frac{d {j}}{d {\gamma }} e_i = \frac{\partial {J}}{\partial {\gamma _i}} + \frac{\partial {J}}{\partial {W}} \frac{d {W}}{d {\gamma _i}} \end{aligned}$$
(4)

with:

$$\begin{aligned} \frac{\partial {\varPsi }}{\partial {W}} \frac{d {W}}{d {\gamma _i}} = - \frac{\partial {\varPsi }}{\partial {\gamma }} e_i \end{aligned}$$
(5)

This has to be applied to each component of \(\gamma \), i.e., n times and the cost is n linearised N-dimensional systems to solve. If we choose to solve the single system (5) with an iterative matrix-free method, and the solution is obtained after \(n_{\text {iter},T}\) step, the total cost will be of the order of \(\alpha _T n_{\text {iter},T}\), i.e., \(n_{\text {iter},T}\) evaluation of the matrix-by-vector operation \(\bigl ( \frac{\partial \varPsi }{\partial W} \bigr ) x\), where each evaluation costs \(\alpha _T\) times the evaluation of the state residual \(\varPsi (\gamma ,W)\) (and the cost of the state residual is taken as reference equal to 1). Therefore, the cost of the full gradient will be \(n \alpha _T n_{\text {iter},T}\).

Inverse Differentiation (Reverse Mode)

The complete gradient is given by the equation

$$\begin{aligned} \biggl ( \frac{d {j}}{d {\gamma }} \biggr )^{*} = \biggl ( \frac{\partial {J}}{\partial {\gamma }} \biggr )^{*} - \biggl ( \frac{\partial {\varPsi }}{\partial {\gamma }} \biggr )^{*} \Pi _0 \end{aligned}$$
(6)

where \(\Pi _0\) is the solution of the linear system

$$\begin{aligned} \biggl ( \frac{\partial \varPsi }{\partial W} \biggr )^{*} \Pi = \biggl (\frac{\partial J}{\partial W}\biggr )^{*}. \end{aligned}$$
(7)

This computation needs only one extra linearised N-dimensional system, the adjoint system (some methods for calculation of the adjoint solutions are described in [4]). If we choose to solve the adjoint system (7) with an iterative matrix-free method, we can apply the same estimate done as in the case of the Tangent mode differentiation, but this time the overhead associated with the evaluation of the matrix-by-vector operation \(\bigl ( \frac{\partial \varPsi }{\partial W} \bigr )^{*} x\) with respect to the state residual evaluation will be \(\alpha _R\) (and usually \(\alpha _R > \alpha _T\)) and the number of iteration \(n_{\text {iter},R}\) for the convergence of the solution could be different from \(n_{\text {iter},T}\) of the previous case (but the asymptotical rate of convergence will be the same of the original linear system \(\bigl ( \frac{\partial \varPsi }{\partial W} \bigr ) x = b\), see [4]). Therefore, the cost for the gradient will be \(\alpha _R n_{\text {iter},R}\), and the Reverse mode differentiation for the gradient computation is cheaper than the Tangent mode if \(n \gg 1\).

For second derivatives, we have different possibilities.

Second Derivative by Tangent on Tangent

These methods were initially investigated by [12] along with various other algorithms, but the publication does not go into the implementation details for a generic fluid dynamic code. Here, we present the mathematical background behind the idea and the efficient AD implementation of Ghate and Giles [3] but with a different analysis of the computational cost.

Main Property

Starting from the derivative (4), we perform another differentiation with respect to the variable \(\gamma _k\) obtaining

$$\begin{aligned} \frac{d^2 j}{d \gamma _i d \gamma _k} = D^2_{i,k} J + \frac{\partial {J}}{\partial {W}} \frac{d^2 W}{d \gamma _i d \gamma _k} \end{aligned}$$
(8)

where

$$\begin{aligned} D^2_{i,k} J = \frac{\partial {}}{\partial {\gamma }} \biggl ( \frac{\partial {J}}{\partial {\gamma }} e_i \biggr ) e_k + \frac{\partial {}}{\partial {W}} \biggl ( \frac{\partial {J}}{\partial {\gamma }} e_i \biggr ) \frac{d {W}}{d {\gamma _k}} + \frac{\partial {}}{\partial {W}} \biggl ( \frac{\partial {J}}{\partial {\gamma }} e_k \biggr ) \frac{d {W}}{d {\gamma _i}} + \frac{\partial {}}{\partial {W}} \biggl ( \frac{\partial {J}}{\partial {W}} \frac{d {W}}{d {\gamma _i}} \biggr ) \frac{d {W}}{d {\gamma _k}}. \end{aligned}$$

Differentiating Eq. (5), we get

$$\begin{aligned} D^2_{i,k} \varPsi + \frac{\partial {\varPsi }}{\partial {W}} \frac{d^2 W}{d \gamma _i d \gamma _k} = 0 \end{aligned}$$
(9)

where

$$\begin{aligned} D^2_{i,k} \varPsi = \frac{\partial {}}{\partial {\gamma }} \biggl ( \frac{\partial {\varPsi }}{\partial {\gamma }} e_i \biggr ) e_k + \frac{\partial {}}{\partial {W}} \biggl ( \frac{\partial {\varPsi }}{\partial {\gamma }} e_i \biggr ) \frac{d {W}}{d {\gamma _k}} + \frac{\partial {}}{\partial {W}} \biggl ( \frac{\partial {\varPsi }}{\partial {\gamma }} e_k \biggr ) \frac{d {W}}{d {\gamma _i}} + \frac{\partial {}}{\partial {W}} \biggl ( \frac{\partial {\varPsi }}{\partial {W}} \frac{d {W}}{d {\gamma _i}} \biggr ) \frac{d {W}}{d {\gamma _k}}. \end{aligned}$$

Substituting the second derivatives of the state with respect to the control variables \(\frac{d^2 W}{d \gamma _i d \gamma _k}\) in Eq. (8) from Eq. (9), we get

$$\begin{aligned} \begin{aligned} \frac{d^2 j}{d \gamma _i d \gamma _k}&= D^2_{i,k} J - \frac{\partial {J}}{\partial {W}} \biggl ( \frac{\partial {\varPsi }}{\partial {W}} \biggr )^{-1} D^2_{i,k} \varPsi \\&= D^2_{i,k} J -\Pi ^{*}_0 D^2_{i,k} \varPsi \end{aligned} \end{aligned}$$
(10)

where \(\Pi _0\) is the solution of the adjoint system (7) evaluated at the point \((\gamma ,W(\gamma ))\) solution of the state equation \(\varPsi (\gamma ,W)=0\).

The n derivatives \(\displaystyle \frac{d {W}}{d {\gamma _i}}\) should be computed (and stored) using tangent mode differentiation of the nonlinear solver algorithm, and each derivative costs \(n_{\text {iter},T} \alpha _T\). If we need the full Hessian matrix, we have to evaluate the quantity (10) \(n(n+1)/2\) times; i.e., we have to evaluate the terms \(D^2_{i,k} \varPsi \) and \(D^2_{i,k} J\) for \(i=1,\dots ,n\) and \(j=i,\dots ,n\) due to the symmetry of the Hessian, and each evaluation of \(D^2_{i,k} \varPsi \) costs \(\alpha _T^2\) (the evaluation of \(D^2_{i,k} J\) is negligible with respect to \(D^2_{i,k} \varPsi \)). Therefore, the full Hessian costs \(n \alpha _T [ n_{\text {iter},T} + (n+1) \alpha _T /2]\). With similar arguments, if we want only the diagonal part of the Hessian, the cost is \(n \alpha _T [ n_{\text {iter},T} + \alpha _T]\).

Global ToT Algorithm

Solve

$$\begin{aligned} \begin{array}{l|c} \varPsi (\gamma ,W) = 0 &{} n_{\text {iter}} \varPsi \\ \displaystyle \biggl ( \frac{\partial {\varPsi }}{\partial {W}} \biggr )^{*} \Pi _0 = \biggl (\frac{\partial {J}}{\partial {W}} \biggr )^{*} &{} n_{\text {iter}} \alpha _R \varPsi + \alpha _R J \end{array} \end{aligned}$$

For \(i=1,\dots ,n\) solve

$$\begin{aligned} \begin{array}{l|c} \displaystyle \frac{\partial {\varPsi }}{\partial {W}} \frac{d {W}}{d {\gamma _i}} = - \frac{\partial {\varPsi }}{\partial {\gamma }} e_i&(n_{\text {iter}}+1) \alpha _T \varPsi \end{array} \end{aligned}$$

For \(i=1,\dots ,n\) and \(j=1,\dots ,i\)

$$\begin{aligned} \begin{array}{l|c} \displaystyle \frac{\partial {}}{\partial {\gamma }} \biggl ( \frac{\partial {J}}{\partial {\gamma }} e_i \biggr ) e_j &{} \alpha _T^2 J \\ \displaystyle \frac{\partial {}}{\partial {\gamma }} \biggl ( \frac{\partial {J}}{\partial {W}} \frac{d {W}}{d {\gamma _j}} \biggr ) e_i &{} \alpha _T^2 J \\ \displaystyle \frac{\partial {}}{\partial {W}} \biggl ( \frac{\partial {J}}{\partial {W}} \frac{d {W}}{d {\gamma _j}} \biggr ) \frac{d {W}}{d {\gamma _i}} &{} \alpha _T^2 J \\ \\ \hline \\ \displaystyle \frac{\partial {}}{\partial {\gamma }} \biggl ( \frac{\partial {\varPsi }}{\partial {\gamma }} e_i \biggr ) e_j &{} \alpha _T^2 \varPsi \\ \displaystyle \frac{\partial {}}{\partial {\gamma }} \biggl ( \frac{\partial {\varPsi }}{\partial {W}} \frac{d {W}}{d {\gamma _j}} \biggr ) e_i &{} \alpha _T^2 \varPsi \\ \displaystyle \frac{\partial {}}{\partial {W}} \biggl ( \frac{\partial {\varPsi }}{\partial {W}} \frac{d {W}}{d {\gamma _j}} \biggr ) \frac{d {W}}{d {\gamma _i}} &{} \alpha _T^2 \varPsi \\ \end{array} \end{aligned}$$

Second Derivative by Tangent on Reverse

This consists in the direct derivation in any direction \(e_i, i=1,n\) of the (non-scalar) function:

$$\begin{aligned} \biggl ( \frac{\partial {j}}{\partial {\gamma }} \biggr )^{*}(\gamma ,W(\gamma )) = \biggl ( \frac{\partial {J}}{\partial {\gamma }} \biggr )^{*}(\gamma ,W(\gamma )) - \biggl ( \frac{\partial {\varPsi }}{\partial {\gamma }} \biggr )^{*} \Pi (\gamma ,W(\gamma )) \end{aligned}$$

where \(W(\gamma )\) and \(\Pi (\gamma ,W(\gamma ))\) are solutions of the above two state systems.

Main Property

The mains computations to apply are summed up in the following lemma:

Lemma

the second derivatives can be computed as:

$$\begin{aligned} \boxed { \begin{aligned} \frac{\partial {}}{\partial {\gamma _i}} \biggl ( \frac{\partial {j}}{\partial {\gamma }} \biggr )^*&= \biggl ( \frac{\partial ^2 {j}}{\partial {\gamma }^2} \biggr ) e_i = \frac{\partial { }}{\partial {\gamma }}\biggl ( \frac{\partial {J}}{\partial {\gamma }} \biggl )^{*} e_i + \frac{\partial { }}{\partial { W}} \biggl ( \frac{\partial {J}}{\partial {\gamma }} \biggl )^{*} \frac{d {W}}{d {\gamma _i}} + \\&{=}-\frac{\partial {}}{\partial {\gamma }} \biggl [ \biggl ( \frac{\partial { \varPsi }}{\partial {\gamma }} \biggl )^{*} \Pi _0 \biggr ] e_i - \frac{\partial {}}{\partial { W}} \biggl [ \biggl ( \frac{\partial { \varPsi }}{\partial {\gamma }} \biggl )^{*} \Pi _0 \biggr ] \frac{d {W}}{d {\gamma _i}} - \biggl ( \frac{\partial { \varPsi }}{\partial {\gamma }} \biggl )^{*} \lambda _i \end{aligned} } \end{aligned}$$
(11)

For each \(i=1,\dots ,n\), Eq. (11) needs \(\Pi _0\), the solution of the adjoint system

$$\begin{aligned} \biggl ( \frac{\partial {\varPsi }}{\partial {W}} \biggr )^{*} \Pi = \biggl ( \frac{\partial {J}}{\partial {W}} \biggr )^{*} \end{aligned}$$
(12)

and two perturbed N-dimensional linear systems:

$$\begin{aligned} \left\{ \begin{aligned} \frac{\partial {\varPsi }}{\partial {W}} \frac{d {W}}{d {\gamma _i}}&= - \frac{\partial {\varPsi }}{\partial {\gamma }} e_i \\ \biggl ( \frac{\partial {\varPsi }}{\partial {W}} \biggl )^{*} \lambda _i&= \frac{\partial { }}{\partial {\gamma }} \biggl (\frac{\partial {J}}{\partial {W}} \biggl )^{*} e_i + \frac{\partial { }}{\partial {W}} \biggl ( \frac{\partial {J}}{\partial {W}} \biggl )^{*} \frac{d {W}}{d {\gamma _i}} + \\&{=}- \frac{\partial {}}{\partial {\gamma }} \biggl [ \biggl ( \frac{\partial {\varPsi }}{\partial {W}} \biggl )^{*} \Pi _0 \biggr ] e_i - \frac{\partial {}}{\partial {W}} \biggl [ \biggl ( \frac{\partial {\varPsi }}{\partial {W}} \biggl )^{*} \Pi _0 \biggr ] \frac{d {W}}{d {\gamma _i}} \end{aligned} \right. \end{aligned}$$
(13)

where all the functions in Eqs. (11)–(13) are evalued at the final state (in order to verify \(\varPsi (\gamma ,W(\gamma )) = 0\)). In addition, the second linear system in (13) is of the same type of the adjoint system (12) but with a different right-hand side, so we can use the same algorithm (but with different initial data) for both equations.

It is useful to note that Eq. (11) gives us an entire column (or row, by symmetry) of the Hessian matrix, where the Tangent-on-Tangent approach (10) gives us a single element. Another interesting point is that the Tangent-on-Reverse method does not requires to know in advance all the derivatives of the state variables respect to the control; i.e., we can compute the derivative \(\frac{d {W}}{d {\gamma _i}}\) for a given index i, then use it in Eqs. (11) and (13) to obtain the i-th column (or row) of the Hessian and then we can avoid to store it for the next index \(i+1\).

The cost associated to the full Hessian evaluated with the Tangent-on-Reverse approach is

  • \(n(\alpha _T n_{\text {iter},T})\) for computing the derivatives of the state variables respect to the control \(\frac{d {W}}{d {\gamma _i}}\) (\(i=1,\dots ,n\))

  • \(n \alpha _R \alpha _T\) for evaluating the quantities on the right-hand side of Eqs. (11) and (13)

  • \(n(\alpha _R n_{\text {iter},R})\) for solving the n adjoint systems in the (13)

Therefore, the full Hessian costs \(n [\alpha _T n_{\text {iter},T} + \alpha _R \alpha _T + \alpha _R n_{\text {iter},R}]\).

Details of Calculations

First of all, we note that the tangent derivative along the direction \(\delta \) of a (n-dimensional) function \(f(\gamma )=F(\gamma ,W)\) is

$$\begin{aligned} \begin{aligned} \frac{d {f}}{d {\gamma }} \delta&= \frac{\partial { F}}{\partial { \gamma }} \delta + \frac{\partial { F}}{\partial { W}} \frac{d {W}}{d {\gamma }} \delta \\&= \frac{\partial { F}}{\partial { \gamma }} \delta - \frac{\partial { F}}{\partial { W}} \biggl ( \frac{\partial { \varPsi }}{\partial { W}} \biggr )^{-1} \frac{\partial { \varPsi }}{\partial { \gamma }} \delta \\&= \frac{\partial { F}}{\partial { \gamma }} \delta + \frac{\partial { F}}{\partial { W}} \theta \end{aligned} \end{aligned}$$

where \(\theta \) is the solution of the linear system

$$\begin{aligned} \frac{\partial { \varPsi }}{\partial { W}} \theta = - \frac{\partial { \varPsi }}{\partial { \gamma }} \delta . \end{aligned}$$

Now, we can perform the tangent derivative (along the direction \(\delta \)) of \(\bigl ( \frac{d {j}}{d {\gamma }} \bigl )^{*}\)

$$\begin{aligned} \begin{aligned} \biggl ( \frac{d^2 {j}}{d {\gamma }^2} \biggr ) \delta&= \frac{d {}}{d {\gamma }} \biggl ( \frac{d {j}}{d {\gamma }} \biggl )^{*} \delta = \frac{d {}}{d {\gamma }} \biggl ( \frac{\partial { J}}{\partial { \gamma }} \biggl )^{*} \delta - \frac{d {}}{d {\gamma }} \biggl [ \biggl ( \frac{\partial { \varPsi }}{\partial { \gamma }} \biggl )^{*} \Pi \biggr ] \delta . \end{aligned} \end{aligned}$$

The first term is

$$\begin{aligned} \frac{d {}}{d {\gamma }} \biggl ( \frac{\partial { J}}{\partial { \gamma }} \biggl )^{*} \delta = \frac{\partial { }}{\partial { \gamma }}\biggl ( \frac{\partial { J}}{\partial { \gamma }} \biggl )^{*} \delta + \frac{\partial { }}{\partial { W}} \biggl ( \frac{\partial { J}}{\partial { \gamma }} \biggl )^{*} \theta . \end{aligned}$$

The second one is a little more complex

$$\begin{aligned} \begin{aligned} \frac{d {}}{d {\gamma }} \biggl [ \biggl ( \frac{\partial { \varPsi }}{\partial { \gamma }} \biggl )^{*} \Pi \biggr ] \delta&= \biggl [ \frac{d {}}{d {\gamma }} \biggl ( \frac{\partial { \varPsi }}{\partial { \gamma }} \biggl )^{*} \Pi \biggr ] \delta + \biggl ( \frac{\partial { \varPsi }}{\partial { \gamma }} \biggl )^{*} \frac{d {\Pi }}{d {\gamma }} \delta \\&= \frac{d {}}{d {\gamma }} \biggl [ \biggl ( \frac{\partial { \varPsi }}{\partial { \gamma }} \biggl )^{*} \Pi _0 \biggr ] \delta + \biggl ( \frac{\partial { \varPsi }}{\partial { \gamma }} \biggl )^{*} \lambda \\&= \frac{\partial {}}{\partial { \gamma }} \biggl [ \biggl ( \frac{\partial { \varPsi }}{\partial { \gamma }} \biggl )^{*} \Pi _0 \biggr ] \delta + \frac{\partial {}}{\partial { W}} \biggl [ \biggl ( \frac{\partial { \varPsi }}{\partial { \gamma }} \biggl )^{*} \Pi _0 \biggr ] \theta + \biggl ( \frac{\partial { \varPsi }}{\partial { \gamma }} \biggl )^{*} \lambda \end{aligned} \end{aligned}$$

where \(\Pi _0\) is the solution of the linear system (solved at the final state)

$$\begin{aligned} \biggl ( \frac{\partial { \varPsi }}{\partial { W}} \biggr )^{*} \Pi = \biggl ( \frac{\partial { J}}{\partial { W}} \biggr )^{*} \end{aligned}$$
(14)

and \(\displaystyle \lambda = \frac{d {\Pi }}{d {\gamma }} \delta \). Now, we perform a tangent derivative of the adjoint Eq. (14)

$$\begin{aligned} \begin{aligned} \frac{d {}}{d {\gamma }} \biggl [ \biggl ( \frac{\partial {\varPsi }}{\partial { W}} \biggr )^{*} \Pi \biggr ] \delta&= \frac{d {}}{d {\gamma }} \biggl [ \biggl ( \frac{\partial {\varPsi }}{\partial {W}} \biggr )^{*} \Pi _0 \biggr ] \delta + \biggl ( \frac{\partial {\varPsi }}{\partial {W}} \biggr )^{*} \frac{d {\Pi }}{d {\gamma }} \delta \\&=\frac{\partial {}}{\partial {\gamma }} \biggl [ \biggl ( \frac{\partial {\varPsi }}{\partial {W}} \biggl )^{*} \Pi _0 \biggr ] \delta + \frac{\partial {}}{\partial {W}} \biggl [ \biggl ( \frac{\partial {\varPsi }}{\partial {W}} \biggl )^{*} \Pi _0 \biggr ] \theta + \biggl ( \frac{\partial {\varPsi }}{\partial { W}} \biggl )^{*} \lambda \\&=\frac{\partial {}}{\partial {\gamma }}\biggl ( \frac{\partial {J}}{\partial {W}} \biggl )^{*} \delta + \frac{\partial {}}{\partial {W}} \biggl ( \frac{\partial {J}}{\partial {W}} \biggl )^{*} \theta . \end{aligned} \end{aligned}$$

So \(\lambda \) is the solution of the linear system

$$\begin{aligned} \begin{aligned} \biggl ( \frac{\partial { \varPsi }}{\partial { W}} \biggl )^{*} \lambda&= \frac{\partial { }}{\partial { \gamma }}\biggl ( \frac{\partial { J}}{\partial { W}} \biggl )^{*} \delta + \frac{\partial { }}{\partial { W}} \biggl ( \frac{\partial { J}}{\partial { W}} \biggl )^{*} \theta + \\&{=}- \frac{\partial {}}{\partial { \gamma }} \biggl [ \biggl ( \frac{\partial { \varPsi }}{\partial { W}} \biggl )^{*} \Pi _0 \biggr ] \delta - \frac{\partial {}}{\partial { W}} \biggl [ \biggl ( \frac{\partial { \varPsi }}{\partial { W}} \biggl )^{*} \Pi _0 \biggr ] \theta . \end{aligned} \end{aligned}$$

Reassembling all the terms, we have that the projection of the Hessian in a direction \(\delta \) is given by

$$\begin{aligned} \boxed { \begin{aligned} \biggl ( \frac{d^2 {j}}{d {\gamma }^2} \biggr ) \delta&= \frac{\partial {}}{\partial {\gamma }}\biggl ( \frac{\partial {J}}{\partial {\gamma }} \biggl )^{*} \delta + \frac{\partial {}}{\partial {W}} \biggl ( \frac{\partial {J}}{\partial {\gamma }} \biggl )^{*} \theta + \\&{=}-\frac{\partial {}}{\partial {\gamma }} \biggl [ \biggl ( \frac{\partial {\varPsi }}{\partial {\gamma }} \biggl )^{*} \Pi _0 \biggr ] \delta - \frac{\partial {}}{\partial {W}} \biggl [ \biggl ( \frac{\partial { \varPsi }}{\partial {\gamma }} \biggl )^{*} \Pi _0 \biggr ] \theta - \biggl ( \frac{\partial {\varPsi }}{\partial { \gamma }} \biggl )^{*} \lambda \end{aligned} } \end{aligned}$$

where

$$\begin{aligned} \left\{ \begin{aligned} \biggl ( \frac{\partial { \varPsi }}{\partial { W}} \biggr )^{*} \Pi _0&= \biggl ( \frac{\partial { J}}{\partial { W}} \biggr )^{*} \\ \frac{\partial { \varPsi }}{\partial { W}} \theta&= - \frac{\partial { \varPsi }}{\partial { \gamma }} \delta \\ \biggl ( \frac{\partial { \varPsi }}{\partial { W}} \biggl )^{*} \lambda&= \frac{\partial { }}{\partial { \gamma }}\biggl ( \frac{\partial { J}}{\partial { W}} \biggl )^{*} \delta + \frac{\partial { }}{\partial { W}} \biggl ( \frac{\partial { J}}{\partial { W}} \biggl )^{*} \theta + \\&{=}- \frac{\partial {}}{\partial { \gamma }} \biggl [ \biggl ( \frac{\partial { \varPsi }}{\partial { W}} \biggl )^{*} \Pi _0 \biggr ] \delta - \frac{\partial {}}{\partial { W}} \biggl [ \biggl ( \frac{\partial { \varPsi }}{\partial { W}} \biggl )^{*} \Pi _0 \biggr ] \theta ~. \end{aligned} \right. \end{aligned}$$

Global ToR Algorithm

Solve

$$\begin{aligned} \begin{array}{l|c} \varPsi (\gamma ,W) = 0 &{} n_{\text {iter}} \varPsi \\ \displaystyle \biggl ( \frac{\partial {\varPsi }}{\partial {W}} \biggr )^{*} \Pi _0 = \biggl (\frac{\partial {J}}{\partial {W}} \biggr )^{*} &{} n_{\text {iter}} \alpha _R \varPsi + \alpha _R J \end{array} \end{aligned}$$

For \(i=1,\dots ,n\)

$$\begin{aligned} \begin{array}{l|c} \text {solve } \quad \displaystyle \frac{\partial {\varPsi }}{\partial {W}} \frac{d {W}}{d {\gamma _i}} = - \frac{\partial {\varPsi }}{\partial {\gamma }} e_i&(n_{\text {iter}}+1) \alpha _T \varPsi \end{array} \end{aligned}$$
$$\begin{aligned} \begin{array}{l|c} \dot{\overline{J}}_W &{} \alpha _T \alpha _R J \\ \dot{\overline{J}}_{\gamma } &{} \alpha _T \alpha _R J \\ \dot{\overline{\varPsi }}_W &{} \alpha _T \alpha _R \varPsi \\ \dot{\overline{\varPsi }}_{\gamma } &{} \alpha _T \alpha _R \varPsi \\ \end{array} \end{aligned}$$
$$\begin{aligned} \begin{array}{l|c} \text {solve } \quad \displaystyle \biggl ( \frac{\partial {\varPsi }}{\partial {W}} \biggr )^{*} \lambda _i = \dot{\overline{J}}_W - \dot{\overline{\varPsi }}_W&n_{\text {iter}} \alpha _R \varPsi \end{array} \end{aligned}$$
$$\begin{aligned} \begin{array}{l|c} \displaystyle \dot{\overline{J}}_{\gamma } - \dot{\overline{\varPsi }}_{\gamma } - \biggl ( \frac{\partial {\varPsi }}{\partial {\gamma }} \biggr )^{*} \lambda _i&\alpha _R \varPsi \end{array} \end{aligned}$$

Comparison Between ToT and ToR

At this point, the natural question arising from the previous analysis is about the choice of the method that is less expensive for a given problem. We have seen that the algorithms for ToT and ToR approaches have some common parts (the solution of the adjoint system and the evaluation of the state derivatives), so the comparison of the two approach should be done in cost of the part that characterize the two algorithms:

  • \(\displaystyle \frac{n (n+1)}{2} \alpha _T^2\) for ToT

  • \(n \alpha _R [\alpha _T + n_{\text {iter},R}]\) for ToR

For a very large number of variables, hundreds in practical cases, the ToR can be more efficient since its complexity is not quadratic. In practice, the ToT is more performant.

Remark

If we are interested in a (scalar!) functional depending on the gradient, then it can be interesting to apply a second inverse differentiation. We do not consider here this direction. \(\square \)

Concluding Remarks

This section has presented the main lines of the second differentiation of a functional subject to a state equation.

The next step in the obtention of a second-order Taylor surogate software model is the effective implementation of these second derivatives. That step is presented in details in chapter “Use of Automatic Differentiation Tools at the Example of TAPENADE”.

Many other informations concerning the method and concerning application to robuste optimization can be found in [1, 2, 8,9,10,11].