Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The solution of moderate- to high-dimensional PDEs (larger than four dimensions) comes with a high demand for computational power. This is due to the curse of dimensionality, which manifests itself by the fact that very large computational grids are required even for moderate accuracy. In fact, the grid sizes are an exponential function of the dimension of the problem. Regular grids are thus not feasible even when future exascale systems are to be utilized. Fortunately, hierarchical discretization schemes come to the rescue. So-called sparse grids [53] mitigate the curse of dimensionality to a large extent.

Nonetheless, the need for HPC resources remains. The aim of two recent projects, one (EXAHD) within the German priority program “Software for exascale computing” and one supported through an Australian Linkage grant and Fujitsu Laboratories of Europe, has been to study the sparse grid combination technique for the solution of moderate-dimensional PDEs which arise in plasma physics for the simulation of hot fusion plasmas. The combination technique is well-suited for such large-scale simulations on future exascale systems, as it adds a second level of parallelism which admits scalability. Furthermore, its hierarchical principle can be used to support algorithm-based fault tolerance [38, 46]. In this work, we focus on recent developments with respect to the theory and application of the underlying methodology, the sparse grid combination technique.

The sparse grid combination technique utilizes numerical solutions \(u(\gamma )\) of partial differential equations computed for selected values of the parameter vector \(\gamma\) which controls the underlying grids. As the name suggests, the method then proceeds by computing a linear combination of the component solutions \(u(\gamma )\):

$$\displaystyle{ u_{I} =\sum _{\gamma \in I}c_{\gamma }\,u(\gamma )\;. }$$
(1)

Computationally, the combination technique thus consists of a reduction operation which evaluates the linear combination of the computationally independent components \(u(\gamma )\). A similar structure is commonly found in data analytic problems and is exploited by the Map Reduce method. Since the inception of the combination technique, parallel algorithms were studied which made use of the computational structure [15, 18, 19]. The current work is based on the same principles as these earlier works, see [2, 24, 25, 28, 31, 32, 3840, 48].

The combination technique computes a sparse grid approximation without having to implement complex sparse grid data structures. The result is a proper sparse grid function. In the case of the interpolation problem one typically obtains the exact sparse grid interpolant but for other problems (like finite element solutions) one obtains an approximating sparse grid function. Mathematically, the combination technique is an extrapolation method, and the accuracy is established using error expansions, see [5, 44, 45]. However, specific error expansions are only known for simple cases. Some recent work on errors of the sparse grid combination technique can be found in [16, 20, 21, 47]. The scarcity of theoretical results, however, did not stop its popularity in applications. Examples include partial differential equations in fluid dynamics, the advection and advection-diffusion equation, the Schrödinger equation, financial mathematics, and machine learning, see, e.g., [813, 17, 41, 51]. However, as the combination technique is an extrapolation method, it is inherently unstable and large errors may occur if the error expansions do not hold. This is further discussed in [30] where also a stabilized approach, the so-called Opticom method, is analyzed. Several new applications based on this stabilized approach are discussed in [1, 7, 23, 26, 35, 51, 52]. Other non-standard combination approximations are considered in [4, 35, 37, 43].

The main application considered in the following deals with the solution of the gyrokinetic equations by the software code GENE [14]. These equations are an approximation for the case of a small Larmor-radius of the Vlasov equations for densities \(f_{s}\) of plasmas,

$$\displaystyle{ \frac{\partial f_{s}} {\partial t} +\mathbf{ v} \cdot \frac{\partial f_{s}} {\partial \mathbf{x}} + \frac{q_{s}} {m_{s}}(\mathbf{E} +\mathbf{ v} \times \mathbf{ B}) \cdot \frac{\partial f_{s}} {\partial \mathbf{v}} = 0\;. }$$
(2)

The densities are distribution functions over the state space and \(E\) and \(B\) are the electrostatic and electromagnetic fields (both external and induced by the plasma), \(\mathbf{v}\) is the velocity and \(\mathbf{x}\) the location. The fields \(E\) and \(B\) are then the solution of the Maxwell equations for the charge and current densities defined by

$$\displaystyle{ \rho (\mathbf{x},t) =\sum _{s}q_{s}\int f_{s}(\mathbf{x},\mathbf{v},t)\,dv,\quad \text{and}\quad \mathbf{j}(\mathbf{x},t) =\sum _{s}q_{s}\int f_{s}(\mathbf{x},\mathbf{v},t)\mathbf{v}dv\;. }$$
(3)

While the state space has 6 dimensions (3 space and 3 velocity), the gyrokinetic equations reduce this to 5 dimensions. The index \(s\) numbers the different species (ions and electrons). The numerical scheme uses both finite differences and spectral approximations. As complex Fourier transforms are used, the densities \(f_{s}\) are complex.

In Sect. 2 a general combination technique suitable for our application is discussed. In this section the set \(I\) occurring in the combination formula (1) uniquely determines the combination coefficients \(c_{\gamma }\) in that formula. Some parallel algorithms and data structures supporting the sparse grid combination technique are presented in Sect. 3. In order to stabilize the combination technique, the combination coefficients need to be modified and even chosen dependent on the solution. This is covered in Sect. 4. An important application area relates to eigenvalue problems in Sect. 5, where we cover challenges and algorithms for this problem.

2 A Class of Combination Techniques

Here we call a combination technique a method which is obtained by substituting some of the hierarchical surpluses by zero. This includes the traditional sparse grid combination technique [19], the truncated combination technique [4], dimension adaptive variants [10, 29] and even some of the fault tolerant methods [24]. The motivation for this larger class is that often the basic error splitting assumption—which can be viewed as an assumption about the surplus—does not hold in these cases. We will now formally define this combination technique.

We assume that we have at our disposition a computer code which is able to produce approximations of some real or complex number, some vector or some function. We denote the quantity of interest by \(u\) and assume that the space of all possible \(u\) is a Euclidean vector space (including the numbers) or a Hilbert space of functions. The computer codes are assumed to compute a very special class of approximations \(u(\gamma )\) which in some way are associated with regular \(d\)-dimensional grids with step size \(h_{i} = 2^{-\gamma _{i}}\) in the \(i\)-th coordinate. For simplicity we will assume that in principle our code can compute \(u(\gamma )\) for any \(\gamma \in \mathbb{N}_{0}^{d}\). Furthermore, \(u(\gamma ) \in V (\gamma )\) where the spaces \(V (\gamma ) \subset V\) are hierarchical, such that \(V (\alpha ) \subset V (\beta )\) when \(\alpha \leq \beta\) (i.e. where \(\alpha _{i} \leq \beta _{i}\) for all \(i = 1,\ldots d\)). For example, if \(V = \mathbb{R}\) then so are all \(V (\gamma ) = \mathbb{R}\). Another example is the space of functions with bounded (in \(L_{2}\)) mixed derivatives \(V = H_{\mathop{\mathrm{mix}}\nolimits }^{1}\left ([0,1]^{d}\right )\). In this case one may choose \(V (\gamma )\) to be appropriate spaces of multilinear functions.

The quantities of interest include solutions of partial differential equations, minima of convex functionals and eigenvalues and eigenfunctions of differential operators. They may also be functions or functionals of solutions of partial differential equations. They may be moments of some particle densities which themselves are solutions to some Kolmogorov, Vlasov, or Boltzmann equations. The computer codes may be based on finite difference and finite element solvers, least squares and Ritz solvers but could also just be interpolants or projections. In all these cases, the combination technique is a method which combines multiple approximations \(u(\gamma )\) to get more accurate approximations. Of course the way how the underlying \(u(\gamma )\) are computed will have some impact on the final combination approximation.

The combination technique is fundamentally tied to the concept of the hierarchical surplus [53] which was used to introduce the sparse grids. However, there is a subtle difference between the surplus used to define the sparse grids and the one at the foundation of the combination technique. The surplus used for sparse grids is based on the representation of functions as a series of multiples of hierarchical basis functions. In contrast, the combination technique is based on a more general decomposition. It is obtained from the following result which follows from two lemmas in chapter 4 of [22].

Proposition 1 (Hierarchical surplus)

Let V (γ) be linear spaces with \(\gamma \in \mathbb{N}_{0}^{d}\) such that V (α) ⊂ V (β) if α ≤β and let u(γ) ∈ V (γ). Then there exist w(α) ∈ V (α) such that

$$\displaystyle{ \sum _{\alpha \leq \gamma }w(\alpha ) = u(\gamma )\;. }$$
(4)

Moreover, the w(γ) are uniquely determined and one has

$$\displaystyle{ w(\alpha ) =\sum _{\gamma \in B(\alpha )}(-1)^{\vert \alpha -\gamma \vert }\,u(\gamma ) }$$
(5)

where B(α) ={ γ ≥ 0∣α − 1 ≤γ ≤α} and \(1 = (1,\ldots,1) \in \mathbb{N}^{d}\).

The set of \(\gamma\) is countable and the proposition is proved by induction over this set. Note that the equations are cumulative sums and the solution is given in the form of a finite difference. For the case of \(d = 2\) and \(\gamma \leq (2,2)\) one gets the following system of equations:

$$\displaystyle{ \left [\begin{array}{*{10}c} u(2,2) \\ u(1,2) \\ u(2,1) \\ u(0,2) \\ u(1,1) \\ u(2,0) \\ u(0,1) \\ u(1,0) \\ u(0,0) \end{array} \right ] = \left [\begin{array}{*{10}c} 1&1&1&1&1&1&1&1&1\\ &1 & &1 &1 & &1 &1 &1 \\ & &1& &1&1&1&1&1\\ & & &1 & & &1 & &1 \\ & & & &1& &1&1&1\\ & & & & &1 & &1 &1 \\ & & & & & &1& &1\\ & & & & & & &1 &1 \\ & & & & & & & &1 \end{array} \right ]\left [\begin{array}{*{10}c} w(2,2) \\ w(1,2) \\ w(2,1) \\ w(0,2) \\ w(1,1) \\ w(2,0) \\ w(0,1) \\ w(1,0) \\ w(0,0) \end{array} \right ]\;. }$$
(6)

Note that all the components of the right hand side and the solution are elements of linear spaces. The vector of \(w(\alpha )\) is for the example:

$$\displaystyle{ \left [\begin{array}{*{10}c} w(2,2) \\ w(1,2) \\ w(2,1) \\ w(0,2) \\ w(1,1) \\ w(2,0) \\ w(0,1) \\ w(1,0) \\ w(0,0) \end{array} \right ] = \left [\begin{array}{*{10}c} +1&-1&-1& &+1& & & &\\ &+1 & &-1 &-1 & &+1 && \\ & &+1& &-1&-1& &+1&\\ & & &+1 & & &-1 && \\ & & & &+1& &-1&-1&+1\\ & & & & &+1 & &-1 & \\ & & & & & &+1& &-1\\ & & & & & & &+1 &-1 \\ & & & & & & & &+1 \end{array} \right ]\left [\begin{array}{*{10}c} u(2,2) \\ u(1,2) \\ u(2,1) \\ u(0,2) \\ u(1,1) \\ u(2,0) \\ u(0,1) \\ u(1,0) \\ u(0,0) \end{array} \right ]\;. }$$
(7)

For any set of indices \(I \subset \mathbb{N}_{0}^{d}\) we now define the combination technique as any method delivering the approximation

$$\displaystyle{ u_{I} =\sum _{\alpha \in I}w(\alpha )\;. }$$
(8)

In practice, the approximation \(u_{I}\) is computed directly from the \(u(\gamma )\). The combination formula is directly obtained from Proposition 1 and one has

Proposition 2

Let u I = ∑ α∈I w(α) where w(α) is the hierarchical surplus for the approximations u(γ). Then there exists a subset I of the smallest downset which contains the set I and some coefficients \(c_{\gamma } \in \mathbb{Z}\) for γ ∈ I such that

$$\displaystyle{ u_{I} =\sum _{\gamma \in I^{{\prime}}}c_{\gamma }\,u(\gamma )\;. }$$
(9)

Furthermore, one has

$$\displaystyle{ c_{\gamma } =\sum _{\alpha \in C(\gamma )}(-1)^{\vert \gamma -\alpha \vert }\chi _{ I}(\alpha ) }$$
(10)

where C(γ) ={ α∣γ ≤α ≤γ + 1} and where χ I (α) is the characteristic function of I.

The proof of this result is a direct application of Proposition 1, see also [22]. For the example \(d = 2\) and \(n = 2\) one gets

$$\displaystyle{ u_{n}^{C} = u(0,2) + u(1,1) + u(2,0) - u(0,1) - u(1,0)\;. }$$
(11)

Note the coefficients \(c_{\gamma } = 1\) for the finest grids, \(c_{\gamma } = -1\) for some grids which are slightly coarser and \(c_{\gamma } = 0\) for all the other grids. There are both positive and negative coefficients. Indeed, the results above can also be shown to be a consequence of the inclusion-exclusion principle. One can show that if \(0 \in I\) then \(\sum _{\gamma \in I}c_{\gamma } = 1\).

An implementation of the combination technique will thus compute a linear combination of a potentially large number of component solutions \(u(\gamma )\). One thus requires two steps, first the independent computation of the components \(u(\gamma )\) and then the reduction to the combination \(u_{I}\). Thus the computations require a collection of computational clusters which are loosely connected. This is a great advantage on HPC systems as the need for global communication is significantly reduced to a loose coupling.

Many variants of the combination technique are obtained using the technique introduced above. They differ by their choice of the summation set \(I\). The classical combination technique utilizes

$$\displaystyle{ I =\{\alpha \mid \vert \alpha \vert \leq n + d - 1\}\;. }$$
(12)

Many variants are subsets of this set. This includes the truncated sparse grids [3, 4] defined by

$$\displaystyle{ I =\downarrow \{\alpha \mid \vert \alpha \vert \leq n + d - 1,\;\alpha \geq \beta \} }$$
(13)

where \(\downarrow \) is the operator producing the smallest downset containing the operand. Basically the same class is considered in [49] (there called partial sparse grids):

$$\displaystyle{ I =\downarrow \{\alpha \mid \vert \alpha \vert \leq n + \vert \beta \vert - 1,\;\alpha \geq \beta \} }$$
(14)

for some \(\beta \geq 1\). Sparse grids with faults [24] include sets of the form

$$\displaystyle{ I =\{\alpha \mid \vert \alpha \vert \leq n + d - 1,\alpha \neq \beta \} }$$
(15)

for some \(\beta\) with \(\vert \beta \vert = n\). Finally, one may consider the two-scale combination with

$$\displaystyle{ I =\bigcup _{ k=1}^{d}\{\alpha \mid \alpha \leq n_{ 0}1 + n_{k}e_{k}\} }$$
(16)

where \(e_{k}\) is the standard \(k\)-th basis vector in \(\mathbb{R}^{d}\). This has been considered in [3] for the case of \(n_{0} = n_{k} = n\). Another popular choice is

$$\displaystyle{ I =\{\alpha \mid \vert \mathop{\mathrm{supp}}\nolimits \alpha \vert \leq k\}\;. }$$
(17)

This corresponds to a truncated ANOVA-type decomposition. An alternative ANOVA decomposition is obtained by choosing \(\beta ^{(k)}\) with \(\vert \mathop{\mathrm{supp}}\nolimits \beta ^{(k)}\vert = k\) and setting

$$\displaystyle{ I =\bigcup _{ k=1}^{d}\{\alpha \mid \alpha \leq \beta ^{(k)}\}\;. }$$
(18)

The sets \(I\) are usually downsets, i.e., such that \(\beta \in I\) if there exists an \(\alpha \in I\) such that \(\beta \leq \alpha\). Note that any downset \(I\) especially contains the zero vector. The corresponding vector space \(V (0)\) typically contains the set of constant functions.

We will now consider errors. First we reconsider the error of the \(u(\gamma )\). In terms of the surpluses, one has from the surplus decomposition of \(u(\gamma )\) that

$$\displaystyle{ e(\gamma ) = u - u(\gamma ) =\sum _{\alpha \not\leq \gamma }w(\alpha )\;. }$$
(19)

Let \(I_{s}(\gamma ) =\{\alpha \mid \alpha _{s} >\gamma _{s}\}.\) Then one has

$$\displaystyle{ \{\alpha \not\leq \gamma \} =\bigcup _{ s=1}^{d}I_{ s}(\gamma ) }$$
(20)

as any \(\alpha\) which is not less or equal to \(\gamma\) contains at least one element \(\alpha _{s} >\gamma _{s}\). We now define

$$\displaystyle{ I(\gamma;\sigma ) =\bigcap _{s\in \sigma }I_{s}(\gamma ) }$$
(21)

for any non-empty subset \(\sigma \subseteq \{ 1,\ldots,d\}\). A direct application of the inclusion-exclusion principle then leads to the error splitting

$$\displaystyle{ e(\gamma ) =\sum _{\emptyset \neq \sigma \subseteq \{1,\ldots,d\}}(-1)^{\vert \sigma \vert -1}z(\gamma,\sigma ) }$$
(22)

where

$$\displaystyle{ z(\gamma,\sigma ) =\sum _{\alpha \in I(\gamma;\sigma )}w(\alpha )\;. }$$
(23)

This is an ANOVA decomposition of the approximation error of \(u(\gamma )\). From this one gets the result

Proposition 3

Let \(u_{I} =\sum _{\gamma \in I^{{\prime}}}c_{\gamma }\,u(\gamma )\) and the combination coefficients c γ be such that ∑ γ∈I c γ = 1. Then

$$\displaystyle{ u - u_{I} =\sum _{\emptyset \neq \sigma \subseteq \{1,\ldots,d\}}(-1)^{\vert \sigma \vert -1}\sum _{ \gamma \in I^{{\prime}}}c_{\gamma }\,z(\gamma,\sigma )\;. }$$
(24)

Proof

This follows from the discussion above and because 0 ∈ I one has

$$\displaystyle{ u - u_{I} =\sum _{\gamma \in I^{{\prime}}}c_{\gamma }\,e(\gamma )\;. }$$
(25)

 □ 

An important point to note here is that this error formula does hold for any coefficients \(c_{\gamma }\), not just the ones defined by the general combination technique. This thus leads to a different way to choose the combination coefficients which results in a small error. We will further discuss such choices in the next section. Note that for the general combination technique the coefficients are uniquely determined by the set \(I\). In this case one has a complete description of the error using the hierarchical surplus

$$\displaystyle{ e_{I} =\sum _{\alpha \not\in I}w(\alpha )\;. }$$
(26)

In summary, we have now two strategies to design a combination approximation: one may choose either

  • the set \(I\) which contains all the \(w(\alpha )\) which are larger than some threshold

  • or the combination coefficients such that the sums \(\sum _{\alpha \in I(\gamma,\sigma )}c_{\gamma }\,z(\gamma,\sigma )\) are small.

One approach is to select the \(w(\alpha )\) adaptively, based on their size so that

$$\displaystyle{ I =\{\alpha \mid \Vert w(\alpha )\Vert \geq \epsilon \}\;. }$$
(27)

Such an approach is sometimes called dimension adaptive to distinguish it from the spatially adaptive approach where grids are refined locally. One may be interested in finding an approximation for some \(u(\gamma )\), for example, for \(\gamma = (n,\ldots,n)\). In this case one considers

$$\displaystyle{ I =\{\alpha \leq \gamma \mid \Vert w(\alpha )\Vert \geq \epsilon \} }$$
(28)

and one has the following error bound:

Proposition 4

Let I ={ α ≤γ∣∥w(α)∥≥ε} and u(γ) − u I be the error of the combination approximation based on the set I relative to u(γ). Then one has the bound

$$\displaystyle{ \Vert u(\gamma ) - u_{I}\Vert \leq \prod _{i=1}^{d}(\gamma _{ i} + 1)\,\epsilon \;. }$$
(29)

The result is a simple application of the triangle inequality and the fact that

$$\displaystyle{ \vert I\vert =\prod _{ i=1}^{d}(\gamma _{ i} + 1)\;. }$$
(30)

In particular, one has if all \(\gamma _{i} = n\):

$$\displaystyle{ \Vert u(\gamma ) - u_{I}\Vert \leq (n + 1)^{d}\epsilon \;. }$$
(31)

While this bound is very simple, it is asymptotically (in \(n\) and \(d\)) tight due to the concentration of measure. Note also, that a similar bound for the spatially adaptive method is not available. An important point to note is that this error bound holds always, independently of how good the surplus is at approximating the exact result. For \(\gamma = (n,\ldots,n)\) one can combine the estimate of Proposition 4 with a bound on \(u - u(\gamma )\) to obtain

$$\displaystyle{ \Vert u - u_{I}\Vert \leq \Vert u - u(\gamma )\Vert +\Vert u(\gamma ) - u_{I}\Vert \leq K\,4^{-n} + (n + 1)^{d}\epsilon \;. }$$
(32)

One can then choose \(n\) which minimizes this for a given \(\epsilon\) by balancing the two terms. Conversely, for a given \(n\) the corresponding \(\epsilon\) is given by \(\epsilon _{n} = (n + 1)^{-d}K4^{-n}\). In Fig. 1 we plot \(\epsilon _{n}/K\) against \(\Vert u - u_{I}\Vert /K\) for several different \(d\) to demonstrate how the error changes with the threshold.

Fig. 1
figure 1

Scaled error against threshold

While the combination approximation is the sum of the surpluses \(w(\alpha )\) over all \(\alpha \in I\), the result only depends on a small number of \(u(\gamma )\) close to the maximal elements of \(I\). In particular, any errors of the values \(u(\alpha )\) for small \(\alpha\) have no effect for approximations based on larger \(\alpha\). Thus when doing an adaptive approximation, the earlier errors are forgotten.

Finally, if one has a model for the hierarchical surplus, for example, if it is of the form

$$\displaystyle{ \Vert w(\alpha )\Vert \leq 4^{-\vert \alpha \vert }y(\alpha ) }$$
(33)

for some bounded \(y(\alpha )\) then one can get specific error bounds for the combination technique, in particular the well-known bounds for the classical sparse grid technique. In this case one gets \(\Vert w(\alpha )\Vert \leq K4^{-\vert \alpha \vert }\) if one chooses \(\vert \alpha \vert \geq n\) as for the classical combination technique. One can show that the terms in the error formula for the components \(u(\gamma )\) satisfy

$$\displaystyle{ \Vert z(\gamma,\sigma )\Vert \leq \left (\frac{4} {3}\right )^{d}4^{-\sum _{s=1}^{\vert \sigma \vert }(\gamma _{\sigma _{ s}}+1)}K\;. }$$
(34)

3 Algorithms and Data Structures

In this section we consider the parallel implementation of the combination technique for partial differential equation solvers. For large-scale simulations, for example as being the final target for the EXAHD project in the second phase, even a single component grid (together with the data structures to solve the underlying PDE on it) will not fit into the memory of a single node any more. Furthermore, the storage of a full grid representation of a sparse grid will exceed the predicted RAM of a whole exascale machine. Furthermore, the communication overhead across a whole HPC systems’ network cannot be neglected. In this section we will assume that the component grids \(u(\gamma )\) are implemented as distributed regular grids. In a first stage we consider the case where the combined solution \(u_{I}\) is also a distributed regular grid. Later we will then discuss distributed sparse grid data structures.

The combination technique is a reduction operation combining the components according to Eq. (1). This reduction is based on the sum \(u^{{\prime}}\leftarrow u^{{\prime}} + cu\) of a component \(u\) (we omit the parameters \(\gamma\) for simplicity) to the resulting combination \(u^{{\prime}}\) (or \(u_{I}\)). Assume that the \(u\) and \(u^{{\prime}}\) are distributed over \(P\) and \(P^{{\prime}}\) processors, respectively.

The direct SGCT algorithm involves for each of the component processes sending all its points of \(u\) to the respective combination process. This is denoted as the gather stage. In a second stage, the combination processes then first interpolates the gathered points to the combination grid \(u\) before adding them. In a third stage, the scatter stage, the data on each combination process is sampled and the samples sent to the corresponding component processes, see Fig. 2.

Fig. 2
figure 2

Gather and scatter steps

In the direct SGCT algorithm, the components and combination are represented by the function values on the grid points or coefficients of the nodal basis. We have also considered a hierarchical SGCT algorithm which is based on the coefficients of the hierarchical basis which leads to a hierarchical surplus representation. When the direct SGCT algorithm is applied to these hierarchical surpluses there is no need for interpolation, and the sizes of the corresponding surplus vectors are exactly the same for both the components and the combination. However, for performance, it is necessary to coalesce the combination of surpluses as described in [49]. As the largest surpluses only occur for one component they do not need to be communicated. Despite the savings in the hierarchical algorithm, we found that the direct algorithm is always faster than the hierarchical, and it scales better with both \(n\), \(d\) and the number of processes (cores). This does however require that the representation of the combined grid \(u'\) is sparse, as is described below. We also found that the formation of the hierarchical surpluses (and its inverse) took a relatively small amount of time, and concluded that, even when the data is originally stored in hierarchical form, it is faster to dehierarchize it, apply the direct algorithm and hierarchize it again [49].

New adapted algorithms and implementations have been developed with optimal communication overhead, see Fig. 3 (left) and the corresponding paper in this proceedings [27]. The gather–scatter steps described above have to be invoked multiple times for the solution of time-dependent PDEs. (We found that for eigenvalue problems it is often sufficient to call the gather–scatter only once, see the Sect. 5.) In any case, the gather–scatter step is the only remaining global communication of the combination technique and thus has to be examined well. In previous work [31] we have thus analyzed communication schemes required for the combination step in the framework of BSP-models and developed new algorithmic variants with communication that is optimal up to constant factors. This way, the overall makespan volume, the maximal communicated volume, can be drastically reduced with a slightly increased number of messages that have to be sent.

Fig. 3
figure 3

Distributed hierarchical combination with optimal communication overhead (left) and run-time results on Hazel Hen (right) for different sizes of process groups (local parallelism with nprocs processors). The results measure only the communication (local + global) and distributed hierarchization, not the computation. The saturation for large numbers of component grids is due to the logarithmic scaling of the global reduce step for large numbers of process groups and up to 180,224 processors in total. In comparison, the time for a single time step with GENE for a process group size of 4096 is shown, see [27] in this proceedings for further details

A distributed sparse grid data structure is described in [49]. The index set \(I\) for this case is a variant of a truncated sparse grid set, see Eq. (14). Recall that the sparse grid points are obtained by taking the union of all the component grid points. As the number of sparse grid points is much less than the number of full grid points it makes sense to compute only the combinations for the sparse grid points. A sparse grid data structure has been developed which is similar to the CSR data structure used for sparse matrices. In this case one stores both information about the value \(u\) at the grid point and the location of the grid point. Due to the regularity of the sparse grid this can be done efficiently.

With optimal communication, distributed data structures and corresponding algorithms, excellent scaling can be obtained for large numbers of process groups as shown in Fig. 3 (right) on Hazel Hen, which includes local algorithmic work to hierarchize, local communication and global communication. See the corresponding paper in this proceedings [27].

4 Modified Combination Coefficients

Here we consider approximations which are based on a vector \((u(\gamma ))_{\gamma \in I}\) of numerical results. It has been seen, however, that the standard way to choose the combination coefficients is not optimal and may lead to large errors. In fact one may interpret the truncated combination technique as a variant where some of the coefficients have been chosen to be zero and the rest adapted. In the following we provide a more radical approach to choosing the coefficients \(c_{\gamma }\). An advantage of this approach is that it does not depend so much on properties of the index set \(I\), in fact, this set does not even need to be a downset.

A first method was considered in [30, 52] for convex optimization problems. Here, let the component approximations be

$$\displaystyle{ u(\gamma ) =\mathop{ \mathrm{argmin}}\nolimits \{J(v)\mid v \in V (\gamma )\}\;. }$$
(35)

Then the Opticom method, a Ritz approximation over the span of given \(u(\gamma )\) computes

$$\displaystyle{ u^{O} =\mathop{ \mathrm{argmin}}\nolimits \left \{J(v)\mid v =\sum _{\gamma \in I}c_{\gamma }\,u(\gamma )\right \}\;. }$$
(36)

Computationally the Opticom method consists of the determination of minimization of a convex function \(P(c)\) of \(\vert I\vert \) variables of the form

$$\displaystyle{ \varPhi (c) = J\left (\sum _{\gamma \in I}c_{\gamma }\,u(\gamma )\right ) }$$
(37)

to get the combination coefficients. Once they have been determined, the approximation \(u^{O}\) is then computed as in the Sects. 2 and 3. By design, one has \(J(u^{O}) \leq J(u(\gamma ))\) for all \(\gamma \in I\). If \(I\) gives rise to a combination approximation \(u^{C}\) then one also has \(J(u^{O}) \leq J(u^{C})\). A whole family of other convex functions \(\varPhi (c)\) for the combination coefficients were considered in [30]. Using properties of the Bregman divergence, one can derive error bounds and quasi-optimality criteria for the Opticom method, see [52].

A similar approach was suggested for the determination of combination coefficients for faulty sets \(I\). Let \(I\) be any set and \(I^{{\prime}}\) be the smallest downset which contains \(I\). Then let the \(w(\alpha )\) be the surpluses computed from the set of all \(u(\gamma )\) for \(\gamma \in I\) and \(\alpha \in I^{{\prime}}\). Finally, let the regular combination technique be defined as

$$\displaystyle{ u^{R} =\sum _{\alpha \in I^{{\prime}}}w(\alpha ) }$$
(38)

and let for any \(c_{\gamma }\) a combination technique be

$$\displaystyle{ u^{C} =\sum _{\gamma \in I}c_{\gamma }u(\gamma )\;. }$$
(39)

Then the difference between the new combination technique and the regular combination technique is

$$\displaystyle{ \begin{array}{ll} u^{C} - u^{R}& =\sum _{\gamma \in I}c_{\gamma }\,u(\gamma ) -\sum _{\alpha \in I^{{\prime}}}w(\alpha ) \\ & =\sum _{\gamma \in I}c_{\gamma }\sum _{\alpha \leq \gamma }w(\alpha ) -\sum _{\alpha \in I^{{\prime}}}w(\alpha ) \\ & =\sum _{\alpha \in I^{{\prime}}}w(\alpha )\left (\sum _{\gamma \in I(\alpha )}c_{\gamma } - 1\right )\\ \end{array} }$$
(40)

where \(I(\alpha ) =\{\gamma \in I\mid \gamma \geq \alpha \}\). Using the triangle inequality one obtains

$$\displaystyle{ \Vert u^{C} - u^{R}\Vert \leq \varPhi (c) }$$
(41)

with

$$\displaystyle{ \varPhi (c) =\sum _{\alpha \in I^{{\prime}}}\theta (\alpha )\left \vert \sum _{\gamma \in I(\alpha )}c_{\gamma } - 1\right \vert \;, }$$
(42)

where \(\theta\) is such that \(\Vert w(\alpha )\Vert \leq \theta (\alpha )\). Minimizing the \(\varPhi (c)\) thus seems to lead to a good choice of combination coefficients, and this is confirmed by experiments as well [22]. The resulting combination technique forms the basis for a new fault-tolerant approach which has been discussed in [24].

5 Computing Eigenvalues and Eigenvectors

Here we consider the eigenvalue problem in \(V\) where one would like to compute complex eigenvalues \(\lambda\) and the corresponding eigenvectors \(u\) such that

$$\displaystyle{ L\,u =\lambda u\;, }$$
(43)

where \(L\) is a given linear operator defined on \(V\). We assume we have a code which computes approximations \(\lambda (\gamma ) \in \mathbb{C}\) and \(u_{\lambda }(\gamma ) \in V (\gamma )\) of the eigenvalues \(\lambda\) and the corresponding eigenvectors \(u\). We have chosen to discuss the eigenvalue problem separately as it does exhibit particular challenges which do not appear for initial and boundary value problems.

Consider now the determination of the eigenvalues \(\lambda\). Note that one typically has a large number of eigenvalues for any given operator \(L\). First one needs to decide which eigenvalue to compute. For example, if one is interested in stability of a system, one would like to determine the eigenvalue with the largest real part. It is possible to use the general combination technique, however, one needs to make sure that the (non-zero) combination coefficients \(c_{\gamma }\) used are such that the eigenvectors of \(L(\gamma )\) contain approximations of the eigenvector \(u\) which is of interest. However, computing the surplus \(\nu (\alpha )\) for the eigenvalues \(\lambda (\gamma )\) and including all the ones which satisfy \(\vert \nu (\alpha )\vert \geq \epsilon\) for some \(\epsilon\) would be a good way to make sure that we get a good result. Furthermore, the error bound given in Sect. 2 does hold here. As any surplus \(\nu (\alpha )\) does only depend on the values \(\lambda (\gamma )\) for \(\gamma\) close to \(\alpha\) any earlier \(\lambda (\gamma )\) with a large error will not influence the final result. Practical computations confirmed the effectiveness of this approach, see [34]. If one knows which spaces \(V (\gamma )\) produce reasonable approximations for the eigenvector corresponding to some eigenvalue \(\lambda\) then one can define a set \(I(\lambda )\) containing only those \(\gamma\). Combinations over \(I(\gamma )\) will then provide good approximations of \(\lambda\). (However, as stated above, the combination technique is asymptotically stable against wrong or non-existing eigenvectors.)

Computing the eigenvectors faces the same problem one has for computing the eigenvalues. In addition, however, one has an extra challenge as the eigenvectors are only determined up to some complex factor. In particular, if one uses the eigenvectors \(u(\gamma )\) to compute the surplus functions \(w(\alpha )\) one may get very wrong results. One way to deal with this is to first normalize the eigenvectors. For this one needs a functional \(s \in V ^{{\ast}}\). One then replaces the \(u(\gamma )\) by \(u(\gamma )/\langle s,u(\gamma )\rangle\) when computing the surplus, i.e., one solves the surplus equations

$$\displaystyle{ \sum _{\alpha \leq \gamma }w(\alpha ) = \frac{u(\gamma )} {\langle s,u(\gamma )\rangle } }$$
(44)

and computes the combination approximation as

$$\displaystyle{ u_{I} =\sum _{\gamma \in I}c_{\gamma }\, \frac{u(\gamma )} {\langle s,u(\gamma )\rangle }\;. }$$
(45)

In practice, this did give good results and it appears reasonable that bounds on the so computed surplus provide a foundation for the error analysis. In any case the error bound of the adaptive method holds. Actually, this bound even holds when the eigenvectors are not normalized. The advantage of the normalization is really that the number of surpluses to include are much smaller—i.e. a computational advantage. Practical experiments also confirmed this. It remains to be shown that error splitting assumptions are typically invariant under the scaling done above.

5.1 An Opticom Approach for Solving the Eigenvalue Problem

An approach to solving the eigenvalue problem which does not require scaling has been proposed and investigated by Kowitz and collaborators [34, 36]. The approach is based on a minimization problem which determines combination coefficients in a similar manner as the opticom method in Sect. 3. It is assumed that \(I\) is given and the \(u(\gamma )\) for \(\gamma \in I\) have been computed and solve \(L(\gamma )u(\gamma ) =\lambda (\gamma )u(\gamma )\). Let the matrix \(G = \left [u(\gamma )\right ]_{\gamma \in I}\) and the vector \(c = [c_{\gamma }]_{\gamma \in I}^{T}\), then the combination approximation for the eigenvector can be written as the matrix-vector product

$$\displaystyle{ Gc =\sum _{\gamma \in I}c_{\gamma }\,u(\gamma )\;. }$$
(46)

This eigenvalue problem can be solved by computing

$$\displaystyle{ (c,\lambda ) =\mathop{ \mathrm{argmin}}\nolimits _{c,\lambda }\Vert LGc -\lambda Gc\Vert }$$
(47)

with the normal equations

$$\displaystyle{ (LG -\lambda G)^{{\ast}}(LG -\lambda G)c = 0 }$$
(48)

for the solution of \(c\). Osborne et al. [33, 42] solved this by considering the problem

$$\displaystyle{ \left (\begin{array}{*{10}c} K(\lambda )& t\\ s^{{\ast} } &0 \end{array} \right )\left (\begin{array}{*{10}c} c\\ \beta \end{array} \right ) = \left (\begin{array}{*{10}c} 0\\ 1 \end{array} \right ) }$$
(49)

with \(K(\lambda ) = (LG -\lambda G)^{{\ast}}(LG -\lambda G)\). Here \(\lambda\) is a parameter. One obtains the solution

$$\displaystyle{ \beta (\lambda ) = -\langle s^{{\ast}},\,K(\lambda )^{-1}t\rangle ^{-1} }$$
(50)

for which one then uses Newton’s method to solve \(\beta (\lambda ) = 0\) with respect to \(\lambda\). With \(\beta (\lambda ) = 0\) it follows that \(K(\lambda )c = 0\) and \(\langle s^{{\ast}},c\rangle = 1\). Thus one obtains a normalized solution of the nonlinear eigenvalue problem (i.e., where \(\lambda\) occurs in a nonlinear way in \(K(\lambda )\)).

Another approach for obtaining the least squares solution is its interpretation as an overdetermined eigenvalue problem. Das et al. [6] developed an algorithm based on the QZ decomposition which allows the computation of the eigenvalue and the eigenvector in \(\mathcal{O}(mn)\) complexity, where \(n = \vert I\vert \) and \(m = \vert V \vert \).

The approaches have both been investigated for a simple test problem (see left of Fig. 4) and for large eigenvalue computations with GENE (see right of Fig. 4). The combination approximations (though computed serially here) can be usually obtained faster than the full grid approximations. Note that the run-times here have been obtained in a prototypical implementation before the development of the scalable algorithms described in Sect. 3. For large problems, the combination approximation can be expected to be even significantly faster as the combination technique exhibits a better parallel scalability than the full grid solution. For further details, see [34, 36].

Fig. 4
figure 4

The convergence of the Newton iteration towards the root of β(λ) for the simple test problem (left) and the time for obtaining the combination approximation t (c) compared to the time to compute an eigenpair on a full grid of similar accuracy t (ref) for linear GENE computations (right)

5.2 Iterative Refinement and Iterative Methods

Besides the adaptation of the combination coefficients, the combination technique for eigenvalue problems can also be improved by refining the \(u(\gamma )\) iteratively. Based on the iterative refinement procedure introduced by Wilkinson [50], the approximation of the eigenvalue \(\lambda _{I}\) and the corresponding eigenvector \(u_{I}\) can be improved towards \(\lambda\) and \(u\) with corrections \(\varDelta \lambda\) and \(\varDelta u\) by

$$\displaystyle{ u = u_{I} +\varDelta u\qquad \lambda =\lambda _{I} +\varDelta \lambda \;. }$$
(51)

Putting this into \(0 = Lu -\lambda u\), the corrections can be obtained by solving

$$\displaystyle{ 0 = Lu_{I} -\lambda _{I}u_{I} -\varDelta \lambda u_{I} + L\varDelta u -\lambda _{I}\varDelta u, }$$
(52)

where the quadratic term \(\varDelta \lambda \varDelta u\) is neglected. This system is underdetermined. An additional scaling condition \(\langle s^{{\ast}},\varDelta u\rangle = 0\) with \(s \in V\) ensures that the correction \(\varDelta u\) does not change the magnitude of \(u_{I}\). Solving the linear system

$$\displaystyle{ \left (\begin{array}{*{10}c} L -\lambda _{I}I &u_{I}\\ s^{{\ast} } & 0 \end{array} \right )\left (\begin{array}{*{10}c} \varDelta u\\ \varDelta \lambda \end{array} \right ) = \left (\begin{array}{*{10}c} \lambda _{I}u_{I} - Lu_{I}\\ 0 \end{array} \right )\;, }$$
(53)

we obtain the corrections \(\varDelta \lambda\) and \(\varDelta u\). The linear operator \(L\) has a large rank and its inversion is generally infeasible for high-dimensional settings. Nevertheless computing a single matrix vector product \(Lu_{I}\) is feasible, so that the right-hand side is easily computed. In the framework of the combination technique the corrections \(\varDelta u\) and \(\varDelta \lambda\) are computed on each subspace \(V (\gamma )\). Therefore, the residual \(r = Lu -\lambda u\) and the initial combination approximation \(u_{I}\) are projected on \(V (\gamma )\) using suitable prolongation operators [18]. The corrections \(\varDelta u(\gamma )\) and \(\varDelta \lambda (\gamma )\) are computed on each subspace \(V (\gamma )\) by solving

$$\displaystyle{ \left (\begin{array}{*{10}c} L(\gamma ) -\lambda _{I}I &u_{I}(\gamma ) \\ s^{{\ast}}(\gamma ) & 0 \end{array} \right )\left (\begin{array}{*{10}c} \varDelta u(\gamma ) \\ \varDelta \lambda (\gamma ) \end{array} \right ) = \left (\begin{array}{*{10}c} -r(\gamma ) \\ 0 \end{array} \right )\;. }$$
(54)

Here, the significantly smaller rank of \(L(\gamma )\) allows the solution of the linear system with feasible effort. The corrections from each subspace \(V (\gamma )\) are then combined using the standard combination coefficients \(c_{\gamma }\) by

$$\displaystyle{ \varDelta u_{I} =\sum _{\gamma \in I}c_{\gamma }\varDelta u_{I}(\gamma )\qquad \varDelta \lambda _{I} =\sum _{\gamma \in I}c_{\gamma }\varDelta \lambda _{I}(\gamma )\;. }$$
(55)

After adding the correction to \(u_{I}\) and \(\lambda _{I}\), the process can be repeated up to marginal \(\varDelta \lambda _{I}\) and \(\varDelta u_{I}\).

Instead of using the standard combination coefficients \(c_{\gamma }\), we can also adapt the combination coefficients in order to minimize the residual \(r\). The minimizer

$$\displaystyle{ (\varDelta u,\varDelta \lambda ) =\mathop{ \mathrm{argmin}}\nolimits _{c}\left \|r -\lambda _{I}u_{I} + L\varDelta u_{I} -\lambda _{I}\varDelta u_{I} -\varDelta \lambda _{I}u_{I}\right \| }$$
(56)

is then the best combination of the corrections. Both approaches have been tested for the Poisson problem as well as GENE simulations. For details see [34].

6 Conclusions

Early work on the combination technique revealed that it leads to a suitable method for the solution of simple boundary value problems on computing clusters. The work presented here demonstrated, that if combined with strongly scalable solvers for the components, one can develop an approach which is suitable for exascale architectures. This was investigated for the plasma physics code GENE which was used to solve initial and eigenvalue problems and stationary solutions. In addition to the 2 levels of parallelism exhibited by the combination technique, the flexibility of the choice of the combination coefficients led to a totally new approach to algorithm-based fault tolerance which further enhanced the scalability of the approach.