1 Introduction

Derivatives permeate mathematics and engineering right from the first steps of calculus, which together with the Taylor series expansion is a central tool in designing models and methods of modern mathematics. Despite this, successful methods for automatically calculating derivatives of \(n\)-dimensional functions is a relatively recent development. Perhaps most notably amongst recent methods is the advent of automatic differentiation (AD), which has the remarkable achievement of the “cheap gradient principle”, wherein the cost of evaluating the gradient is proportional to that of the underlying function [1]. This AD success is not only limited to the gradient, there also exists a number of efficient AD algorithms for calculating Jacobian [2, 3] and Hessian matrices [4, 5], that can accommodate for large dimensional sparse instances. The same success has not been observed in calculating higher order derivatives.

The assumed cost in calculating high-order derivatives drives the design of methods, typically favouring the use of lower-order methods. In the optimization community it is generally assumed that calculating any third-order information is too costly, so the design of methods revolves around using first and second order information. We will show that third-order information can be used at a cost proportional to the cost of calculating the Hessian. This has an immediate application in third-order nonlinear optimization methods such as the Chebyshev–Halley Family [6] that require calculating the directional derivative of the Hessian matrix \(D^3 f(x) \cdot d\), for a given \(x,d \in \mathbb {R}^n\) and \(f \in C^3(\mathbb {R}^n, \mathbb {R}).\) Though much theory has been examined on the semi-local convergence of members of Halley–Chebyshev family [710], it is still unclear how it’s domain of convergence compares to that of Newton’s method. On one dimensional real and complex equations, where high-order derivatives become trivial, tests have pointed to a larger basin of convergence to roots when applying Halley–Chebyshev methods as compared to Newton’s method [11, 12]. While finding the solution to nonlinear systems in \(\mathbb {R}^n\), there have been a number of successful, albeit limited, tests of the Halley–Chebyshev family, see [1315] for tests of a modern implementation. These tests indicate that there exist unconstrained problems for which these third-order methods converge with a lower runtime, despite the fact that the entire third-order derivative tensor \(D^3 f(x)\) is calculated at each iteration. We present a method that calculates the directional derivative \(D^3 f(x) \cdot d\) using only matrix arithmetic, and without the need to form or store the entire third-order tensor. For problems of large dimension, this is a fundamental improvement over calculating the entire tensor. Furthermore, there lacks reports of a significant battery of tests to affirm any practical gain in using this third-order family over Newton’s method. Efficient automatic third-order differentiation tools would greatly facilitate such tests.

Still within optimization, third-order derivatives are being used in robust aerodynamic design with promising results [16, 17]. With the development of efficient automatic tools for calculating high-order derivatives, this success could be carried over to optimal design of other mechanical structures [18]. Third-order derivatives of financial options are also being considered to design optimal hedging portfolios [19].

The need for higher order differentiation also finds applications in calculating quadratures [20, 21], bifurcations and periodic orbits [22]. In the fields of numerical integration and solution of PDE’s, a lot of attention has been given to refining and adapting meshes to then use first and second-order approximations over these meshes. An alternative paradigm would be to use fixed coarse meshes and higher approximations. With the capacity to efficiently calculate high-order derivatives, this approach could become competitive and lift the fundamental deterrent in higher-order methods.

Current methods for calculating derivatives of order three or higher in the AD community typically propagate univariate Taylor series [23] or repeatedly apply the tangent and adjoint operations [24]. In these methods, each element of the desired derivative is calculated separately. If AD has taught us anything it is that we should not treat elements of derivatives separately, for their computation can be highly interlaced. The cheap gradient principle illustrates this well, for calculating the elements of the gradient separately yields a time complexity of \(n\) times that of simultaneously calculating all entries. This same principle should be carried over to higher order methods, that is, be wary of overlapping calculations in individual elements. Another alternative for calculating high order derivatives is the use of forward differentiation [25]. The drawback of forward propagation is that it calculates the derivatives of all intermediate functions, in relation to the independent variables, even when these do not contribute to the desired end result. For these reasons, we look at calculating high-order derivatives as a whole and focus on reverse AD methods.

An efficient alternative to AD is that the end users hand code their derivatives. Though with the advent of evermore complicated models, this task is becoming increasingly error prone, difficult to write efficient code, and boring. This approach also rules out methods that use high order derivatives, for no one can expect the end user to code the total and directional derivatives of high order tensors.

The article flows as follows, first we develop algorithms that calculate derivatives in a more general setting, wherein our function is described as a sequence of compositions of maps, Sect. 2. We then use Griewank and Walther’s [1] state-transformations in Sect. 3, to translate a composition of maps into an AD setting and an efficient implementation. Numerical tests are presented in Sect. 4, followed by our conclusions in Sect. 5.

2 Derivatives of sequences of maps

In preparation for the AD setting, we first develop algorithms for calculating derivatives of functions that can be broken into a composition of operators

$$\begin{aligned} {\mathrm {F}}(x) = \Psi ^{\ell } \circ \Psi ^{\ell -1} \circ \cdots \circ \Psi ^{1}(x). \end{aligned}$$
(1)

for \(\Psi ^i\)’s of varying dimension: \(\Psi ^1(x) \in C^2(\mathbb {R}^{n},\mathbb {R}^{m_1})\) and \(\Psi ^i(x)\in C^2(\mathbb {R}^{m_{i-1}},\mathbb {R}^{m_i})\), each \(m_i \in \mathbb {N}\) and for \(i=2,\ldots , \ell \), so that \({\mathrm {F}}: \mathbb {R}^{n} \rightarrow \mathbb {R}^{m_{\ell }}\). From this we define a functional \(f(x) = y^T {\mathrm {F}}(x),\) where \(y\in \mathbb {R}^{m_{\ell }}\), and develop methods for calculating the gradient \(\nabla f(x) = y^T D{\mathrm {F}}(x)\), the Hessian \(D^2 f(x) = y^T D^2{\mathrm {F}}(x)\) and the tensor \(D^3f(x) = y^T D^3{\mathrm {F}}(x)\).

For a given \(d \in \mathbb {R}^n\), we also develop methods for the directional derivative \(D {\mathrm {F}}(x)\cdot d\), \(D^2{\mathrm {F}}(x)\cdot d\), the Hessian-vector product \(D^2f(x)\cdot d = y^T D^2 {\mathrm {F}}(x)\cdot d\) and the tensor-vector product \(D^3 f(x) \cdot d = y^T D^3 {\mathrm {F}}(x)\cdot d\). Notation will be gradually introduced and clarified as is required, including the definition of the preceding directional derivatives.

2.1 First-order derivatives

Taking the derivative of \({\mathrm {F}}\), Eq. (1), and recursively applying the chain rule, we get

$$\begin{aligned} y^T D {\mathrm {F}} = y^T D\Psi ^{\ell }D\Psi ^{\ell -1} \cdots D\Psi ^{1}. \end{aligned}$$
(2)

Note that \(y^T D {\mathrm {F}}\) is the transpose of the gradient \(\nabla (y^T {\mathrm {F}} ) \). For simplicity’s sake, the argument of each function is omitted in (2), but it should be noted that \(D\Psi ^i\) is evaluated at the argument \((\Psi ^{i-1}\circ \cdots \circ \Psi ^{1})(x)\), for each \(i\) from \(1\) to \(\ell \). In Algorithm 1, we record each of these arguments and then calculate the gradient of \(y^T{\mathrm {F}}(x)\) with what’s called a reverse sweep. Reverse, for it transverses the maps from the last \(\Psi ^{\ell }\) to the first \(\Psi ^{1}\), the opposite direction in which (1) is evaluated. The intermediate stages of the gradient calculation are accumulated in the vector \(\overline{\mathrm {v}}\), its dimension changing from one iteration to the next. This will be a recurring fact in the matrices and vectors used to store the intermediate phases of the archetype algorithms presented in this article.

figure a

For convenience, we define the operator \(D_i\) as the partial derivative in \(i\)-th argument. This way, for any function \(g(z)\), the directional derivative of \(g(z)\) in the direction \(d\), becomes

$$\begin{aligned} Dg(z)\cdot d = D_i g(z) d_i , \end{aligned}$$
(3)

where we have omitted the summation symbol for \(i\), and instead, use Einstein notation where a repeated index implies summation over that index.

We use this notation throughout the article unless otherwise stated. Again using the chain rule and (1), we find

$$\begin{aligned} D{\mathrm {F}} \cdot d = D\Psi ^{\ell } D\Psi ^{\ell -1} \cdots D\Psi ^{1}\cdot d, \end{aligned}$$

where we have omitted the arguments. This can be efficiently calculated using a forward sweep of the computational graph shown in Algorithm 2. This algorithm is a direct application of the chain rule.

figure b

2.2 Second-order derivatives

Here we develop a reverse algorithm for calculating the Hessian \(D^2 (y^T {\mathrm {F}}(x))\). First we determine the Hessian for \({\mathrm {F}}\) as a composition of two maps, then we use induction to design a method for when \({\mathrm {F}}\) is a composition of \(\ell \) maps.

For \({\mathrm {F}}(x) = \Psi ^2 \circ \Psi ^1(x) \) and \(\ell =2\), we find the Hessian by differentiating in the \(j\)-th and \(k\)-th coordinate,

$$\begin{aligned} D_{jk}( y_i F_i ) = (y_i D_{rs}\Psi _i^2 ) D_j\Psi _r^1 D_k\Psi _s^1 + (y_i D_r\Psi ^2_i )D_{jk}\Psi ^1_r, \end{aligned}$$
(4)

where the arguments have been omitted. So the \((j,k)\) component of the Hessian \([ D^2(y^T {\mathrm {F}}) ]_{jk}= D_{jk}(y^T {\mathrm {F}})\). The higher the order of the derivative, the more messy component notation becomes. A way around this issue is to use a tensor notation, where for every function \(g(z)\) we define

$$\begin{aligned} D^2 { g}(z)\cdot ( v, w) := D_{jk} g(z) v_j w_k, \nonumber \end{aligned}$$

and

$$\begin{aligned} (D^2 g(z)\cdot w) \cdot v := D^2 g(z)\cdot ( v, w) , \end{aligned}$$
(5)

for any function \(g(z)\) and compatible vectors \(v\) and \(w\). In general,

$$\begin{aligned}{}[ D^2 g(z) \cdot ( \triangle , \square ) ]_{t_2 \cdots t_q s_2 \cdots s_p} := D_{t_1 s_1} g(z) \triangle _{t_1 t_2 \cdots t_q} \square _{s_1 s_2 \cdots s_p}, \end{aligned}$$
(6)

and

$$\begin{aligned} ( D^2 g(z)\cdot \square ) \cdot \triangle := D^2 g(z)\cdot ( \triangle , \square ) \end{aligned}$$
(7)

for any compatible \(\triangle \) and \(\square \) . Using a matrix notation for a composition of maps can be aesthetically unpleasant. Using this tensor notation the Hessian of \(y^T {\mathrm {F}}\), see Eq. (4), becomes

(8)

We recursively use the identity (8) to design Algorithm 3 that calculates the Hessian of a function \(y^T{\mathrm {F}}(x)\) composed of \(\ell \) maps, as defined in Eq. (1). Algorithm 3 corresponds to a very particular way of nesting the derivatives of the \(\Psi _i\)’s maps, as detailed in [4]. Though the proof that Algorithm 3 produces the desired Hessian matrix is rather convoluted, thus following Algorithm 3 we present a far simpler proof.

figure c

Proof of Algorithm 3

We will use induction on the number of compositions \(\ell \). For \(\ell =1\) the output is \({\mathrm {W}} = y^T D^2\Psi ^{1}(x)\). Now we assume that algorithm 3 is correct for functions composed of \(\ell -1\) maps, and use this assumption to show that for \(\ell \) maps the output is \({\mathrm {W}} = y^T D^2 {\mathrm {F}}(x)\). Let

$$\begin{aligned} y^T X = y^T \Psi ^{\ell } \circ \cdots \circ \Psi ^2, \end{aligned}$$

so that \(y^T {\mathrm {F}} = y^T X \circ \Psi ^1.\) Then at the end of the iteration \(i=2\), by the chain rule, \(\overline{\mathrm {v}}^{T} = y^T D X(\mathrm {v}^1)\) and, by induction, \({\mathrm W} =y^T D^2 X(\mathrm {v}^1)\). This way, at termination, or after the iteration \(i=1\), we get

$$\begin{aligned} {\mathrm W}&= y^T D^2 X( \mathrm {v}^1 ) \cdot (D \Psi ^1(x) ,D \Psi ^1(x)) + y^T D X( \mathrm {v}^1 ) \cdot D^2 \Psi ^1 (x)\\&= y^T D^2(X \circ \Psi ^{1}(x)) \quad \quad \left[ \hbox {Eq. (8)}\right] \\&= y^T D^2 {\mathrm {F}}(x). \end{aligned}$$

\(\square \)

2.3 Third-order methods

We define the directional derivative as

$$\begin{aligned} \lim _{t \rightarrow 0}&\frac{d}{dt} D^2 g(z+td) = D_{jkm}g(z)d_m =: D^3 g(z)\cdot d, \end{aligned}$$
(9)

for any function \(g\) and compatible vector \(d\). This tensor notation facilitates working with third-order derivatives as using matrix notation would lead to confusing equations and possibly hinder intuition. The notation conventions from before carries over naturally to third-order derivatives, with

$$\begin{aligned} ( D^3 g(z) \cdot ( \triangle ,\square ,\lozenge ) )_{t_2 \ldots t_q s_2 \ldots s_p l_2 \ldots l_r} := D_{t_1 s_1 l_1} g(z) \triangle _{t_1 \ldots t_q} \square _{s_1\ldots s_p} \lozenge _{l_1\ldots l_r}, \end{aligned}$$
(10)

and

$$\begin{aligned} D^3 g(z) \cdot ( \triangle ,\square ,\lozenge ) = (D^3 g(z) \cdot \lozenge ) \cdot ( \triangle ,\square ) = (( D^3 g(z) \cdot \lozenge ) \cdot \square ) \cdot \triangle , \end{aligned}$$
(11)

for any compatible \(\triangle , \square \) and \(\lozenge \). The Hessian of a composition of two maps \({\mathrm {F}} = \Psi ^2 \circ \Psi ^1\) is given by Eq. (8), we can use the above to calculate the directional derivative of this Hessian,

$$\begin{aligned} y^T D^3 {\mathrm {F}} \cdot d&= D \left( y^T D^2 \Psi ^2 \cdot (D\Psi ^1,D\Psi ^1) \right) \cdot d + D \left( ( y^T D\Psi ^2 ) \cdot D^2\Psi ^1\right) \cdot d\\&= (y^T D^3\Psi ^2 \cdot D\Psi ^1 \cdot d)\cdot (D\Psi ^1,D\Psi ^1) \!+\! ( y^T D^2\Psi ^2)\cdot (D\Psi ^1,D^2\Psi ^1 \cdot d)\\&\quad +( y^T D^2\Psi ^2)\cdot (D^2\Psi ^1 \cdot d,D\Psi ^1) + (y^T D\Psi ^2)\cdot D^3\Psi ^1 \cdot d\\&\quad + (y^T D^2\Psi ^2 \cdot D\Psi ^1 \cdot d )\cdot D^2\Psi ^1, \end{aligned}$$

which after some rearrangement gives

$$\begin{aligned} y^T D^3 {\mathrm {F}} \cdot d&= y^T D^3\Psi ^2 \cdot (D\Psi ^1,D\Psi ^1,D\Psi ^1 \cdot d) + y^T D\Psi ^2\cdot D^3\Psi ^1 \cdot d\nonumber \\&\quad + y^T D^2\Psi ^2 \cdot \left( (D\Psi ^1,D^2\Psi ^1\cdot d) + (D^2\Psi ^1\cdot d,D\Psi ^1) \right. \nonumber \\&\left. \quad + (D^2\Psi ^1,D\Psi ^1 \cdot d) \right) . \end{aligned}$$
(12)

As usual, we have omitted all arguments to the maps. The above applied recursively gives us the Reverse Hessian Directional Derivative Algorithm 4, or RevHedir for short. To prove that RevHedir produces the desired output, we use induction with \(\mathrm {X}^m = y^T \Psi ^{\ell } \circ \cdots \circ \Psi ^m\) and work backwards from \(m=\ell \) towards \(m=1\) to calculate \(y^T D^3 {\mathrm {F}}(x)\cdot d\).

figure d

Proof of Algorithm 4

Our induction hypothesis is that at the end of the \(i = m\) iteration \({\mathrm {Td}} = y^T D^3 \mathrm X^{m}(\mathrm {v}^{m-1}) \cdot \dot{\mathrm {v}}^{m-1}\). After the first iteration \(i=\ell \) of the second loop, paying attention to the initialization of the variables, we have that

$$\begin{aligned} {\mathrm {Td}} = \overline{\mathrm {v}}^T D^3\Psi ^{\ell }(\mathrm {v}^{\ell -1}) \cdot \dot{\mathrm {v}}^{\ell -1} = y^T D^3 \mathrm X^{\ell }(\mathrm {v}^{\ell -1}) \cdot \dot{\mathrm {v}}^{\ell -1}. \end{aligned}$$

Now suppose the hypothesis is true for iterations up to \(m+1\), so that at the beginning of the \(i =m\) iteration \({\mathrm {Td}} =y^T D^3 {\mathrm {X}}^{m+1}(\mathrm {v}^{m}) \cdot \dot{\mathrm {v}}^{m}.\) To prove the hypothesis we need the following results: at the end of the \(i=m\) iteration

$$\begin{aligned} \overline{\mathrm {v}}^T&= y^T D {\mathrm {X}}^{m}(\mathrm {v}^{m-1}) \quad \mathrm and \quad {\mathrm W} = y^T D^2 {\mathrm {X}}^{m}(\mathrm {v}^{m-1}), \end{aligned}$$
(13)

both are demonstrated in the proof of Algorithm 3. Now we are equipt to examine \({\mathrm {Td}}\) at the end of the \(i=m\) iteration,

$$\begin{aligned} {\mathrm {Td}} \leftarrow&{\mathrm {Td}} \cdot ( D\Psi ^m (\mathrm {v}^{m-1}) ,D\Psi ^m (\mathrm {v}^{m-1})) + {\mathrm W} \cdot (D\Psi ^m (\mathrm {v}^{m-1}),D^2\Psi ^m (\mathrm {v}^{m-1}) \cdot \dot{\mathrm {v}}^{m-1}) \\&+ {\mathrm W} \cdot (D^2\Psi ^m (\mathrm {v}^{m-1}) \cdot \dot{\mathrm {v}}^{m-1},D\Psi ^m (\mathrm {v}^{m-1}) ) \\&+ {\mathrm W} \cdot (D^2\Psi ^m (\mathrm {v}^{m-1}),D\Psi ^m (\mathrm {v}^{m-1})\cdot \dot{\mathrm {v}}^{m-1}) + \overline{\mathrm {v}}^T D^3\Psi ^m (\mathrm {v}^{m-1})\cdot \dot{\mathrm {v}}^{m-1}. \end{aligned}$$

Now we use the induction hypothesis followed by property (11) to get

$$\begin{aligned}&{\mathrm {Td}} \cdot ( D\Psi ^m (\mathrm {v}^{m-1}) ,D\Psi ^m (\mathrm {v}^{m-1}) )\nonumber \\&\quad = y^T D^3 {\mathrm {X}}^{m+1}(\mathrm {v}^{m}) \cdot \dot{\mathrm {v}}^{m} \cdot ( D\Psi ^m (\mathrm {v}^{m-1}) ,D\Psi ^m (\mathrm {v}^{m-1}) )\\&\quad = y^T D^3 {\mathrm {X}}^{m+1}(\mathrm {v}^{m}) \cdot \left( D\Psi ^m (\mathrm {v}^{m-1}) , D\Psi ^m (\mathrm {v}^{m-1}) ,\dot{\mathrm {v}}^{m}\right) , \end{aligned}$$

and \(\dot{\mathrm {v}}^{m}= D\Psi ^m (\mathrm {v}^{m-1}) \cdot \dot{\mathrm {v}}^{m-1} \). Then using Eq. (13) to substitute \({\mathrm W}\) and \(\overline{\mathrm {v}}^T\) we arrive at

$$\begin{aligned} \mathrm {Td}&= y^T D^3 {\mathrm {X}}^{m+1}(\mathrm {v}^m ) \cdot \left( D\Psi ^m (\mathrm {v}^{m-1}) , D\Psi ^m (\mathrm {v}^{m-1}) ,D\Psi ^m (\mathrm {v}^{m-1}) \cdot \dot{\mathrm {v}}^{m-1} \right) \\&\quad \,+ y^T D^2 {\mathrm {X}}^m (\mathrm {v}^{m-1}) \cdot (D\Psi ^m (\mathrm {v}^{m-1}) ,D^2\Psi ^m (\mathrm {v}^{m-1}) \cdot \dot{\mathrm {v}}^{m-1}) \\&\quad \,+ y^T D^2 {\mathrm {X}}^m (\mathrm {v}^{m-1}) \cdot (D^2\Psi ^m (\mathrm {v}^{m-1}) \cdot \dot{\mathrm {v}}^{m-1},D\Psi ^m (\mathrm {v}^{m-1}) )\\&\quad \,+ y^T D^2 {\mathrm {X}}^m (\mathrm {v}^{m-1}) \cdot (D^2\Psi ^m (\mathrm {v}^{m-1}),D\Psi ^m (\mathrm {v}^{m-1}) \cdot \dot{\mathrm {v}}^{m-1})\\&\quad \,+ \left( y^T D {\mathrm {X}}^m (\mathrm {v}^{m-1}) \right) D^3\Psi ^m (\mathrm {v}^{m-1}) \cdot \dot{\mathrm {v}}^{m-1}\\&= y^T D^3 {\mathrm {X}}^m (\mathrm {v}^{m-1}) \cdot \dot{\mathrm {v}}^{m-1} \quad [\hbox {Using Eq. (12)}]. \end{aligned}$$

Finally, after iteration \(i=1\), we have

$$\begin{aligned} {\mathrm {Td}} = y^T D^3 {\mathrm {X}}^1(x) \cdot \dot{\mathrm {v}}^0 = y^T D^3 {\mathrm {F}}(x)\cdot d. \end{aligned}$$

\(\square \)

As is to be expected, in the computation of the tensor-vector product, only 2-dimensional tensor arithmetic, or matrix arithmetic, is used, and it is not necessary to form a \(3\)-dimensional tensor. This is different from current hand-coded implementations of this tensor-vector product used in high-order methods [1315]. In these articles, the entire third-order tensor \(y^T D^3 {\mathrm {F}}(x)\) is formed then contracted with a vector \(d\).

If the entire \(y^T D^3 {\mathrm {F}}(x)\) tensor is required, then \(3\)-dimensional arithmetic is unavoidable. A reverse method for calculating the entire third-order tensor \(y^T D^3 {\mathrm {F}}(x)\) is given in the final archetype algorithm. For this, we want an expression for the third-order derivative such that

$$\begin{aligned} \lim _{t \rightarrow 0} \frac{d}{dt} y^T D^2 {\mathrm {F}}(x+td) = y^T D^3 {\mathrm {F}}(x)\cdot d, \end{aligned}$$
(14)

for any vector \(d\). From Eq. (12) we see that \(d\) is contracted with the last coordinate in every term except one. To account for this term, we need a switching tensor \(S\) such that

$$\begin{aligned} y^T D^2\Psi ^2 \cdot (D^2\Psi ^1 \cdot d,D\Psi ^1) = y^T D^2\Psi ^2 \cdot (D^2\Psi ^1 ,D\Psi ^1) \cdot S \cdot d, \end{aligned}$$

in other words we define \(S\) as

$$\begin{aligned} S\cdot (v,w,z) = (v,z,w) \quad \text {or} \quad S_{abcijk}v_i w_j z_k = v_a z_b w_c, \end{aligned}$$
(15)

for any vectors \(v,\, w\) and \(z\). This implies that \(S\)’s components are \(S_{abcijk} = \delta _{ai}\delta _{cj}\delta _{bk}\), where \(\delta _{nm} = 1\) if \(n=m\) and \(0\) otherwise. Then for \({\mathrm {F}} = \Psi ^2 \circ \Psi ^1\) we use Eq. (12) to reach

$$\begin{aligned} y^T&D^3 {\mathrm {F}} \cdot d= \left( y^T D^3\Psi ^2\cdot (D\Psi ^1,D\Psi ^1, D\Psi ^1) + y^T D\Psi ^2 \cdot D^3\Psi ^1 \nonumber \right. \\&\quad \quad +\left. y^T D^2\Psi ^2 \cdot \left( (D\Psi ^1,D^2\Psi ^1 ) + (D^2\Psi ^1 ,D\Psi ^1)\cdot S +(D^2\Psi ^1, D\Psi ^1) \right) \right) \cdot d, \end{aligned}$$
(16)

as this is true for every \(d\), we can remove \(d\) from both sides to arrive at \(y^T D^3 {\mathrm {F}}\). With this notation we have, as expected, \((y^T D^3 {\mathrm {F}})_{ijk} = y^T D_{ijk}{\mathrm {F}} \). We can now use this result to build a recurrence for \(D^3 \mathrm X^m\), with \(\mathrm X^m = y^T \Psi ^{\ell } \circ \cdots \circ \Psi ^m\), working from \(m=\ell \) backwards towards \(m=1\) to calculate \(y^T D^3 {\mathrm {F}}(x)\), as is done in algorithm 5.

figure e

Proof of Algorithm 5

the demonstration can be carried out in an analogous fashion to the proof of Algorithm 4.

The method presented for calculating second and third order derivatives can be extended to design algorithms of arbitrarily higher orders. To do so would require a closed expression for any order derivatives of a composition of two maps (\(\Psi ^2 \circ \Psi ^1 (x)\)), found in [26], which is too long to reproduce here. Though we can see from this closed expression in [26] that the number of terms needed to be calculated grows combinatorially in the order of the derivative, thus posing a lasting computational challenge. \(\square \)

3 Implementing through state transformations

When coding a function, the user would not write a composition of maps, such as shown in previous sections, see Eq. (1). Instead users implement functions in a number of different ways. AD packages standardize these hand written functions, through compiler tools and operator overloading, into an evaluation that fits the format of Algorithm 6. As an example, consider the function \(f(x_1,x_2,x_3) = x_1x_2\sin (x_3),\) and its evaluation for a given \((x_1,x_2,x_3)\) through the following list of commands

$$\begin{aligned} v_{-2}&= x_1\\ v_{-1}&= x_2\\ v_{0}&= x_3\\ v_{1}&= v_{-2}v_{-1}\\ v_{2}&= \sin (v_{0}) \\ v_{3}&= v_2 v_1. \end{aligned}$$

By naming the functions \(\phi _1(v_{-2},v_{-1}) := v_{-2}v_{-1}\), \(\phi _2(v_{0}) := \sin (v_0) \) and \(\phi _3(v_{2},v_{1}) :=v_2 v_1\), this evaluation is the same as done by Algorithm 6.

In general, each \(\phi _i\) is an elemental function such as addition, multiplication, \(\sin (\cdot )\), \(\exp (\cdot )\), etc, which together with their derivatives are already coded in the AD package. In order, the algorithm first copies the independent variables \(x_i\) into internal intermediate variables \(v_{i-n},\) for \(i=1, \ldots , n\). Following convention, we use negative indexes for elements that relate to independent variables. For consistency, we will shift all indexes of vectors and matrices by \(-n\) from here on, e.g., the components of \(x \in \mathbb {R}^n\) are \(x_{i-n}\) for \(i =1 \ldots n.\)

The next step in Algorithm 6 calculates the value \(v_1\) that only depends on the intermediate variables \(v_{i-n},\) for \(i =1, \ldots ,n\). In turn, the value \(v_2\) may now depend on \(v_{i-n},\) for \(i=1, \ldots ,n+1\), then \(v_3\) may depend on \(v_{i-n},\) for \(i=1, \ldots ,n+2\) and so on for all \(\ell \) intermediate variables. Each \(v_i\) is calculated using only one elemental function \(\phi _i\).

This procedure at each step establishes a dependency among the intermediate variables \(v_{i}\) for \(i=1, \ldots , \ell \). We say that \(j\) is a predecessor of \(i\) if \(v_j\) is needed to calculate \(v_i\), that is, if \(v_j\) is an argument of \(\phi _i\). This way we let \(P(i)\) be the set of predecessors of \(i\), and \(v_{P(i)}\) a vector of the predecessors, so that \(\phi _i(v_{P(i)}) = v_i\). Note that \(j \in P(i)\) implies that \(j <i\). Analogously, we can define \(S(i)\) as the set of successors of \(i\).

figure f

We can bridge this algorithmic description of a function with that of compositions of maps (1) using Griewank and Walther’s [1] state-transformations

$$\begin{aligned} \Phi ^i&:\mathbb {R}^{n+\ell }\rightarrow \mathbb {R}^{n+\ell },\nonumber \\&\mathrm {v} \mapsto (v_{1-n},\ldots ,v_{i-1},\phi _i(v_{P(i)}),v_{i+1},\ldots , v_\ell )^T, \end{aligned}$$
(17)

for \(i =1, \ldots \ell \), where \(\mathrm v\) is a vector with components \(v_i\). In components,

$$\begin{aligned} \Phi ^i_r ( \mathrm v) = v_r(1 - \delta _{ri}) + \delta _{ri} \phi _i(v_{P(i)}), \end{aligned}$$
(18)

where here, and in the remainder of this article, we abandon Einstein’s notation of repeated indexes, because having the limits of summation is important when implementing. With this, the value \(f(x)\) given by Algorithm 6 can be written as

$$\begin{aligned} f(x) = e_{\ell +n}^T \Phi ^{\ell } \circ \Phi ^{\ell -1} \circ \cdots \circ \Phi ^{1} \circ ( P^{T} x), \end{aligned}$$
(19)

where \(e_{\ell +n}\) is the \((\ell + n)\)th canonical vector and \(P\) is the immersion matrix \(\left[ I \, \, 0\right] \) with \(I \in \mathbb {R}^{n \times n}\) and \(0 \in \mathbb {R}^{n\times \ell }.\) The Jacobian of the \(i\)th state transformation \(\Phi ^i\), in coordinates, is simply

$$\begin{aligned} D_j \Phi ^i_r(\mathrm {v})= \delta _{rj}(1 - \delta _{ri}) + \delta _{ri}\frac{\partial \phi _i}{\partial v_j} (v_{P(i)}). \end{aligned}$$
(20)

With the state-transforms and the structure of their derivatives, we look again at a few of the archetype algorithms in Sect. 2 and build a corresponding implementable version. Our final goal is to implement the RevHedir algorithm 4, for which we need the implementation of the reverse gradient and Hessian algorithms.

3.1 First-order derivatives

To design an algorithm to calculate the gradient of \(f(x)\), given in Eq. (19), we turn to the Archetype Reverse Gradient Algorithm 1 and identifyFootnote 1 the \(\Phi ^i\)’s in place of the \(\Psi ^i\)’s. Using (20) we find that \(\overline{\mathrm {v}}^T \leftarrow \overline{\mathrm {v}}^T D{\Phi ^i}\) becomes

$$\begin{aligned} \bar{v}_j \leftarrow \bar{v}_j (1 - \delta _{ij}) + \bar{v}_i \frac{\partial \phi _i}{\partial v_j}(v_{P(i)}) \quad \forall j \in \{1-n, \ldots , \ell \} \end{aligned}$$
(21)

where \(\bar{v}_i\) is the \(i\)-th component of \(\overline{\mathrm {v}}\), also known as the \(i\)-th adjoint in the AD literature. Note that if \(j\ne i\) in the above, then the above step will only alter \(\bar{v}_j\) if \(j \in P(i).\) Otherwise if \(j=i\), then this update is equivalent to setting \(\bar{v}_i =0.\) We can disregard this update, as \(\bar{v}_i\) will not be used in subsequent iterations. This is because \(i \not \in P(m)\), for \(m \le i\). With these considerations, we arrive at the algorithm 7, the component-wise version of algorithm 1. Note how we have used the abbreviated operation \(a+=b\) to mean \(a\leftarrow a+b.\) Furthermore, the last step \(\nabla f \leftarrow \overline{\mathrm {v}}^T P^T\) selects the adjoints corresponding to independent variables.

An abuse of notation that we will employ throughout, is that whenever we refer to \(\bar{v}_i\) in the body of the text, we are referring to the value of \(\bar{v}_i\) after iteration \(i\) of the Reverse Gradient algorithm has finished.

figure g

Similarly, by using (20) again, each iteration \(i\) of the Archetype 1st Order Directional Derivative Algorithm 2, can be reduced to a coordinate form

$$\begin{aligned} \dot{v}_r \leftarrow (1 - \delta _{ri}) \dot{v}_r + \delta _{ri} \sum _{j \in P(i)}\dot{v}_j \frac{\partial \phi _i}{\partial v_j}(v_{P(i)}), \end{aligned}$$

where \(\dot{v}_j\) is the \(j\)-th component of \(\dot{\mathrm {v}}\). If \(r \ne i\) in the above, then \(\dot{v}_r\) remains unchanged, while if \(r=i\) then we have

$$\begin{aligned} \dot{v}_i \leftarrow \sum _{j \in P(i)}\dot{v}_j \frac{\partial \phi _i}{\partial v_j}(v_{P(i)}). \end{aligned}$$
(22)

We implement this update by sweeping through the successors of each intermediate variable and incrementing a single term to the sum on the right-hand side of (22), see Algorithm 8. It is crucial to observe that the \(i\)-th component of \(\dot{\mathrm {v}}\) will remain unaltered after the \(i\)-th iteration.

Again, when we refer to \(\dot{v}_i\) in the body of the text from this point on, we are referring to the value of \(\dot{v}_i\) after iteration \(i\) has finished in Algorithm 8.

Though we have included the explicit argument \(v_{P(i)}\) of each \(\phi _i\) function and each derivative of \(\phi _i\) in this section, we now omit this argument from now on to avoid a cluttered notation.

figure h

3.2 Second-order derivatives

Just by substituting \(\Psi ^i\)s for \(\Phi ^i\)s in the Archetype Reverse Hessian, Algorithm 3, we can quickly reach a very efficient component-wise algorithm for calculating the Hessian of \(f(x)\), given in Eq. (19). This component-wise algorithm is also known as edge_pushing, and has already been detailed in Gower and Mello [4]. Here we use a different notation which leads to a more concise presentation. Furthermore, the results below form part of the calculations needed for third order methods.

There are two steps of Algorithm 3 we must investigate, for we already know how to update \(\overline{\mathrm {v}}\) from the above section. For these two steps, we need to substitute

$$\begin{aligned} D_{jk}\Phi ^i_r(\mathrm {v}) = \frac{\partial ^2 \Phi ^i_r}{\partial v_j \partial v_k }(\mathrm {v}) = \delta _{ri}\frac{\partial ^2 \phi _i}{\partial v_j \partial v_k}(v_{P(i)}), \end{aligned}$$
(23)

and \(D\Phi ^i\), Eq. (20), in \({\mathrm W} \leftarrow {\mathrm W} \cdot (D\Phi ^{i},D\Phi ^{i})+ \overline{\mathrm {v}}^T D^2\Phi ^{i}\), resulting in

$$\begin{aligned} {W}_{jk} \leftarrow&\sum _{s,t=1-n}^{\ell }\frac{\partial \Phi ^{i}_s}{\partial v_j} W_{st} \frac{\partial \Phi ^{i}_t}{\partial v_k} + \sum _{s=1-n}^{\ell } \bar{v}_{s} \frac{\partial ^2 \Phi ^{i}_s}{\partial v_j \partial v_k} \nonumber \\ =&(1 - \delta _{ji}) W_{jk} (1 - \delta _{ki})+ \frac{\partial \phi _{{i}}}{\partial v_j} W_{ii} \frac{\partial \phi _{i}}{\partial v_k} \nonumber \\&+ \frac{\partial \phi _{i}}{\partial v_j} W_{ik} (1 - \delta _{ki}) + (1 - \delta _{ji}) W_{ji} \frac{\partial \phi _{i}}{\partial v_k} \end{aligned}$$
(24)
$$\begin{aligned}&+ \bar{v}_{i} \frac{\partial ^2 \phi _{i}}{\partial v_j \partial v_k}, \end{aligned}$$
(25)

where \(W_{jk}\) is the \(jk\) component of \(\mathrm W\). Before translating these updates into an algorithm, we need a crucial result: at the beginning of iteration \(i-1\), the element \(W_{jk}\) is zero if \(j \ge i\) for all \(k\). We show this by using induction on the iterations of Algorithm 3. Note that \({\mathrm W}\) is initially set to zero, so for the first iteration \(i = \ell \) both (24) and (25) reduce to

$$\begin{aligned} W_{jk} \leftarrow \bar{v}_{\ell } \frac{\partial ^2 \phi _{\ell }}{\partial v_j \partial v_k}, \end{aligned}$$

which is zero for \(j = \ell \) because \(\ell \notin P(\ell )\). Now we assume the induction hypothesis holds at the beginning of the iteration \(i\), so that \(W_{jk} = 0\) for \(j \ge i+1\). So letting \(j \ge i+1\) and executing the iteration \(i\) we get from the updates (24) and (25)

$$\begin{aligned} W_{jk} \leftarrow W_{jk}+ W_{ji} \frac{\partial \phi _{i}}{\partial v_k}, \end{aligned}$$

because \(j \notin P(i)\) so \(\partial \phi _i /\partial v_j =0\). Together with our hypothesis \(W_{jk} = 0\) and \(W_{ji}=0\), we see that \(W_{jk}\) remains zero. While if \(j=i\), then (24) and (25) sets \(W_{jk} \leftarrow 0\) because \(i \not \in P(i)\). Hence at the beginning of iteration \(i-1\) we have that \(W_{jk}=0\) for \(j\ge i\) and this completes the induction.

Furthermore, \({\mathrm W}\) is symmetric at the beginning of iteration \(i\) because it is initialized to \({\mathrm W} =\mathrm 0\) and each iteration preserves symmetry. Consequentially \(W_{jk}\) is only updated when both \(j,k \le i.\) We make use of this symmetry to avoid unnecessary calculations on symmetric counterparts.

Let \(W_{\{kj\}}= W_{\{jk\}}\) denote both \(W_{jk}\) and \(W_{kj}\). We rewrite update (24) and (25) considering only \(j,k <i\), for \(W_{jk}\) only gets updated if \(j,k \le i\) and when \(j=i\) or \(k=i\) we know that \(W_{jk} =0\) at the end of iteration \(i\). The first two terms of update (24) become,

$$\begin{aligned} W_{\{jk\}}+= \frac{\partial \phi _{i}}{\partial v_j} W_{\{ii\}} \frac{\partial \phi _{i}}{\partial v_k}. \end{aligned}$$

The next two terms can be written as

$$\begin{aligned} W_{\{jk\}} += \frac{\partial \phi _{i}}{\partial v_j} W_{\{ik\}} \; \text { and } \; W_{\{kj\}} += \frac{\partial \phi _{i}}{\partial v_k} W_{\{ij\}}. \end{aligned}$$
(26)

Note that these two are the same with \(j\) changed for \(k\). If \(j \not = k\) we can replace both these operations with just

$$\begin{aligned} W_{\{jk\}} += \frac{\partial \phi _{i}}{\partial v_j} W_{\{ik\}}, \end{aligned}$$
(27)

if we apply this update for every \(j,k<i\) and consider that \(W_{\{jk\}}\) and \(W_{\{kj\}}\) represent the same number.

If \(j = k\) these two operations become

$$\begin{aligned} W_{\{jj\}} += 2 \frac{\partial \phi _{i}}{\partial v_j} W_{\{ik\}}. \end{aligned}$$

The updates (24) and (25) have been implemented with these above considerations in the Pushing step in Algorithm 9. The names of the steps Creating and Pushing are elusive to a graph interpretation [4].

3.3 Third-order derivatives

The final algorithm that we translate to implementation is the Hessian directional derivative, the RevHedir Algorithm 4. This implementation has an immediate application in the Halley–Chebyshev class of third-order optimization methods, for at each step of these algorithms, such a directional derivative is required.

figure i

Identifying each \(\Psi ^i\) with \(\Phi ^i\), we address each of the five operations on the matrix \(\mathrm {Td}\) in Algorithm 4 separately, pointing out how each one preserves the symmetry of \(\mathrm {Td}\) and how to perform the component-wise calculations.

First, given that \(\mathrm {Td}\) is symmetric, the 2D pushing update

$$\begin{aligned} {\mathrm {Td}}\leftarrow {\mathrm {Td}} \cdot \left( D\Phi ^i,D\Phi ^i\right) , \end{aligned}$$
(28)

is exactly as detailed in (24) and the surrounding comments. While the update 3D creating

$$\begin{aligned} {\mathrm {Td}}\leftarrow {\mathrm {Td}} + \overline{\mathrm {v}}^T D^3\Phi ^i \cdot \dot{\mathrm {v}}^{i-1}, \end{aligned}$$

can be written in coordinate form as

$$\begin{aligned} Td_{jk}&\leftarrow Td_{jk}+ \sum _{r,p=1-n}^{\ell } \overline{v}_r D_{jkp}\Phi ^i_r \dot{{v}}_p^{i-1} \nonumber \\&= Td_{jk}+ \sum _{p \in P(i)}\overline{v}_i \frac{\partial ^3 \phi _i}{\partial v_j \partial v_k \partial v_p} \dot{v}_p, \end{aligned}$$
(29)

where \( \dot{{v}}_p^{i-1}\) is the \(p-\)th component of \(\dot{\mathrm {v}}^{i-1}\), \(\dot{v}_p\) is the output of Algorithm 8 and \(Td_{jk}\) is the \(jk\) component of \(\mathrm {Td}\). Note that \(\dot{{v}}_p^{i-1} = \dot{v}_p\) for \(p \in P(i)\), because \(p \le i-1\), so on the iteration \(i-1\) of Algorithm 8 the calculation of \(\dot{v}_p\) will already have been finalized. Another trick we employ is that, since the above calculation is performed on iteration \(i\), we know that \(\bar{v}_i\) has already been calculated. These substitutions involving \(\bar{v}_i\)s and \(\dot{v}_i\)s will be carried out in the rest of the text with little or no comment. The update (29) also preserves the symmetry of \({\mathrm {Td}}\).

To examine the update,

$$\begin{aligned} {\mathrm {Td}}\leftarrow {\mathrm {Td}}+ {\mathrm W} \cdot \left( D\Phi ^i, D^2\Phi ^i\cdot \dot{\mathrm {v}}^{i-1} \right) , \end{aligned}$$
(30)

we use (20) and (23) to obtain the coordinate form

$$\begin{aligned} Td_{jk}&\leftarrow Td_{jk} + \sum _{r,s=1-n}^{\ell }W_{rs}\left( \delta _{rj}(1-\delta _{ri})+ \delta _{ri}\frac{\partial \phi _i}{\partial v_j} \right) \delta _{si}\frac{\partial ^2 \phi _i}{\partial v_k \partial v_p} \dot{v}_p \nonumber \\&= Td_{jk}+ W_{ji}(1-\delta _{ji})\frac{\partial ^2 \phi _i}{\partial v_k \partial v_p} \dot{v}_p+ W_{ii}\frac{\partial \phi _i}{\partial v_j} \frac{\partial ^2 \phi _i}{\partial v_k \partial v_p} \dot{v}_p. \end{aligned}$$
(31)

Upon inspection, the update

$$\begin{aligned} {\mathrm {Td}}\leftarrow {\mathrm {Td}}+ {\mathrm W} \cdot \left( D^2\Phi ^i\cdot \dot{\mathrm {v}}^{i-1}, D\Phi ^i \right) \end{aligned}$$

is the transpose of (31) due to the symmetry of \({\mathrm W}\). So it can be written in coordinate form as

$$\begin{aligned} Td_{jk}&\leftarrow Td_{jk}+ W_{ik}(1-\delta _{ki})\frac{\partial ^2 \phi _i}{\partial v_j \partial v_p} \dot{v}_p+ W_{ii} \frac{\partial \phi _i}{\partial v_k} \frac{\partial ^2 \phi _i}{\partial v_j \partial v_p} \dot{v}_p. \end{aligned}$$
(32)

Thus update (32) together with (31) gives a symmetry contribution to \({\mathrm {Td}}\).

Last we translate

$$\begin{aligned} {\mathrm {Td}} \leftarrow {\mathrm {Td}}+ {\mathrm W} \cdot \left( D^2\Phi ^i, D\Phi ^i \cdot \dot{\mathrm {v}}^{i-1} \right) , \end{aligned}$$
(33)

to its coordinate form

$$\begin{aligned} Td_{jk}&\leftarrow Td_{jk}+\sum _{r,s=1-n}^{\ell } W_{rs}\delta _{ri} D_{jk}\Phi ^i_r \left( \delta _{sp}(1-\delta _{si})+ \delta _{si}\frac{\partial \phi _i}{ \partial v_p}\right) \dot{v}_p \nonumber \\&= Td_{jk}+ W_{ip}\frac{\partial ^2 \phi _i}{\partial v_j \partial v_k} (1-\delta _{pi})\dot{v}_p + W_{ii}\frac{\partial ^2 \phi _i}{\partial v_j \partial v_k} D_{p} \phi _i \dot{v}_p. \end{aligned}$$
(34)

No change is affected by interchanging the indices \(j\) and \(k\) on the right-hand side of (34), so once again \({\mathrm {Td}}\) remains symmetric. For convenience of computing, we group updates (31), (32) and (34) into a set of updates called 2D Connecting. The name indicating that these updates “connect” objects that contain second order derivative information.

More than just symmetric, by closely inspecting these operations we see that the sparsity structure of \({\mathrm {Td}}\) is contained in the sparisty structure of \({\mathrm W}\). This remains true even after execution, at which point \({\mathrm {Td}}= D^3f(x)\cdot d\) and \({\mathrm W} = D^2f(x)\) where, for each \(j,k,p \in \{1-n, \ldots , 0\}\), we have

$$\begin{aligned} D_{jk}f(x) = 0 \implies D_{jkp}f(x)d_p = 0. \end{aligned}$$

This fact should be explored when implementing the method, in that, the data structure of \({\mathrm {Td}}\) should imitate that of \({\mathrm W}\).

3.3.1 Implementing third-order directional derivative

The matrices \({\mathrm {Td}}\) and \({\mathrm W}\) are symmetric, and based on the assumption that they will be sparse, we will represent them using a symmetric sparse data structure. Thus we now identify each pair \((W_{jk},W_{kj})\) and \((Td_{jk},Td_{kj})\) with the element \(W_{\{jk\}}\) and \(Td_{\{jk\}}\), respectively. Much like in edge_pushing, Algorithm 9, we want an efficient implementation of the updates to \(Td_{\{jk\}}\) that only takes the contributions from nonzero elements of \(Td_{\{ik\}}\) and \(W_{\{ik\}}\), and does not repeat unnecessary calculations.

We must take care when updating our symmetric representation of \({\mathrm {Td}}\), both for the 2D pushing update (28) and for the redundant symmetric counterparts (31) and (32) which “double-up” on the diagonal, much like in the Pushing operations of Algorithm 9. Each operation  (31), (32) and (34) depends on a diagonal element \(W_{\{ii\}}\) and an off-diagonal element \(W_{\{ik\}}\) of \({\mathrm W}\), for \(k \ne i\). Grouping together all terms that involve \(W_{\{ii\}}\) we get the resulting update

$$\begin{aligned} Td_{\{jk\}}&+= W_{\{ii\}} \sum _{p \in P(i)}\dot{v}_p\left( \frac{\partial \phi _i}{\partial v_j} \frac{\partial ^2 \phi _i}{\partial v_k \partial v_p} + \frac{\partial \phi _i}{\partial v_k } \frac{\partial ^2 \phi _i}{\partial v_j \partial v_p} + \frac{\partial \phi _i}{ \partial v_p} \frac{\partial ^2 \phi _i}{\partial v_j \partial v_k}\right) . \end{aligned}$$
(35)

Similar to how the update (26) was split into two updates (27), here by appropriately renaming the indices in (31), (32) and (34), each nonzero off diagonal elements \(W_{\{ik\}}\) results in the updates (36), (37) and (38), respectively.

$$\begin{aligned} Td_{jk}&+= \sum _{p \in P(i)} \dot{v}_p \frac{\partial ^2 \phi _i}{\partial v_j \partial v_p} W_{ik}, \quad \forall j \in P(i) \end{aligned}$$
(36)
$$\begin{aligned} Td_{kj}&+= \sum _{p \in P(i)} \dot{v}_p \frac{\partial ^2 \phi _i}{\partial v_j \partial v_p} W_{ki}, \quad \forall j \in P(i) \end{aligned}$$
(37)
$$\begin{aligned} Td_{jp}&+= \sum _{p \in P(i)} \dot{v}_k \frac{\partial ^2 \phi _i}{\partial v_j \partial v_p} W_{ik}, \quad \forall j \in P(i) \end{aligned}$$
(38)

Note that (36) and (37) are symmetric updates, and when \(j=k\) these two operations “double-up” resulting in the update

$$\begin{aligned} Td_{jj} += 2 \sum _{p \in P(i)} \dot{v}_p \frac{\partial ^2 \phi _i}{\partial v_j \partial v_p} W_{ij}. \end{aligned}$$

Passing to our symmetric notation, both (37) and (36) can be accounted for by using (36) to update \(Td_{\{jk\}}\) over every \(j\) and \(k\), where \(Td_{\{jk\}} = Td_{\{kj\}}\), and with an exception for this doubling effect in Algorithm 10. Finally we can eliminate redundant symmetric calculations performed in (38) by only performing this operation for each pair \(\{j,p\}\). All these considerations relating to 2D connecting have been factored into our implementation of the RevHedir Algorithm 10.

Performing 3D Creating (29) using this symmetric representation is simply a matter of not repeating the obvious symmetric counterpart, but instead, performing these operations on \(Td_{\{jk\}}\) once for each appropriate pair \(\{j,k\},\) see 3D Creating in to Algorithm 10.

figure j

4 Numerical experiment

We have implemented the RevHedir Algorithm 10 as an additional driver of ADOL-C, a well established AD library coded in C and C++ [27]. We used version ADOL-C-2.4.0, the most recent available.Footnote 2 The tests were carried out on a personal laptop with 1.70 GHz dual core processors Intel Core i5-3317U, 4GB of RAM, with the Ubuntu 13.0 operating system.

For those interested in replicating our implementation, we used a sparse undirected weighted graph data structure to represent the matrices \({\mathrm W}\) and \({\mathrm {Td}}\). The data structure is an array of weighted neighbourhood sets, one for each node, where each neighbourhood set is a dynamic array that resizes when needed. Each neighbourhood set is maintained in order and the method used to insert or increment the weight of an edge is built around a binary search.

We have hand-picked fourteen problems from the CUTE collection [28], augmlagn from [29], toiqmerg (Toint Quadratic Merging problem) and chainros_trigexp (Chained Rosenbrook function with Trigonometric and exponential constraints) from [30] for the experiments. We have also created a function

$$\begin{aligned} \hbox {heavy}\_\hbox {band}(x,band) = \sum _{i=1}^{n-band} \sin \left( \sum _{j=1}^{band} x_{i+j} \right) . \end{aligned}$$

For our experiments, we tested heavy_band(\(x,20\)). The problems were selected based on the sparsity pattern of \(D^3f(x)\cdot d\), dimension scalability and sparsity. Our goal was to cover a variety of patterns, to easily change the dimension of the function and work with sparse matrices.

In Table 1, the “Pattern” column indicates the type of sparsity pattern: bandwidthFootnote 3 of value \(x\) (B \(x\)), arrow, frame, number of diagonals (D \(x\)), or irregular pattern. The “nnz/n” column gives the number of nonzeros in \(D^3f(x)\cdot d\) over the dimension \(n\), which serves as a measure of density. For each problem, we applied RevHedir and edge_pushing Algorithm 10 and 9 to the objective function \(f: \mathbb {R}^n \rightarrow \mathbb {R}\), with \(x_i = i\) and \(d_i=1\), for \(i =1,\ldots , n\), and give the runtime of each method for dimension \(n=10^6\) in Table 1. Note that all of these matrices are very sparse, partly due to the “thinning out” caused by the high order differentiation. This probably contributed to the relatively low runtime, for in these tests, the run-times have a \(0.75\) correlation with the density measure “nnz/n”. This leads us to believe that the actual pattern is not a decisive factor in runtime.

Table 1 Description of problem set together with the execution time in seconds of edge_push and RevHedir for \(n=10^6\)

We did not benchmark our results against an alternative algorithm for we could not find a known AD package that is capable of efficiently calculating such directional derivatives for such high dimensions. For small dimensions, we used the tensor_eval of ADOL-C to calculate the entire tensor using univariate forward Taylor series propagation [23]. Then we contract the resulting tensor with the vector \(d\). This was useful to check that our implementation was correct,Footnote 4 though it would struggle with dimensions over \(n=100\), thus not an appropriate comparison.

Remarkably the time spent by RevHedir to calculate \(D^3f(x)\cdot d\) was on average 108% that of calculating \(D^2f(x)\) in the above tests. This means the user could gain third order information for approximately the same cost of calculating the Hessian. The code for these tests can be downloaded as part of a package called HighOrderReverse from the Edinburgh Research Group in Optimization website: http://www.maths.ed.ac.uk/ERGO/.

5 Conclusion

Our contribution boils down to a framework for designing high order reverse methods, and an efficient implementation of the directional derivative of the Hessian called RevHedir. The framework paves the way to obtaining a reverse method for all orders once and for all. Such an achievement could cause a paradigm shift in numerical method design, wherein, instead of increasing the number of steps or the mesh size, increasing the order of local approximations becomes conceivable. We have also shed light on existing AD methods, providing a concise proof of the edge_pushing [4] and the reverse gradient directional derivative [31] algorithms.

The novel algorithms 4 and 5 for calculating the third-order derivative and its contraction with a vector, respectively, fulfils what we set out to achieve: they accumulate the desired derivative “as a whole”, thus taking advantage of overlapping calculations amongst individual components. This is in contrast with what is currently being used, e.g., univariate Taylor expansions [23] and repeated tangent/adjoint operations [24]. These algorithms can also make use of the symmetry, as illustrated in our implementation of RevHedir Algorithm 10, wherein all operations are only carried out on a lower triangular matrix.

We implemented and tested the RevHedir with two noteworthy results. The first is its capacity to calculate sparse derivatives of functions with up to a million variables. The second is how the time spent by RevHedir to calculate the directional derivative \(D^3f(x)\cdot d\) was very similar to that spent by edge_pushing to calculate the Hessian. We believe this is true in general and plan on confirming this in future work through complexity analysis. Should this be confirmed, it would have an immediate consequence in the context of nonlinear optimization, in that the third-order Halley–Chebyshev methods could be used to solve large dimensional problems with an iteration cost proportional to that of Newton step. In more detail, at each step the Halley–Chebyshev methods require the Hessian matrix and its directional derivative. The descent direction is then calculated by solving the Newton system, and an additional system with the same sparsity pattern as the Newton system. If it is confirmed that solving these systems costs the same, in terms of complexity, then the cost of a Halley–Chebyshev iteration will be proportional to that of a Newton step. Though this comparison only holds if one uses these AD procedures to calculate the derivatives in both methods.

The CUTE functions used to test both edge_pushing and RevHedir are rather limited, and further tests on real-world problems should be carried out. Also, complexity bounds need to be developed for both algorithms.

A current limitation of reverse AD procedures, such as the ones we have presented, is their issue with memory usage. All floating point values of the intermediate variables must be recorded on a forward sweep and kept for use in the reverse sweep. This can be a very substantial amount of memory, and can be prohibitive for large-scale functions [32]. As an example, when we used dimensions of \(n=10^7\), most of our above test cases exhausted the available memory on the personal laptop used. A possible solution is to allow a trade off between run-time and memory usage by reversing only parts of the procedure at a time. This method is called checkpointing [32, 33].