Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The solution of higher-dimensional problems, especially higher-dimensional partial differential equations (PDEs) that require the joint discretization of more than the usual three spatial dimensions plus time, is one of the grand challenges in current and future high-performance computing (HPC). Resolving the simulation domain as fine as required by the physical problem is in many cases not feasible due to the exponential growth of the number of unknowns—the so-called curse of dimensionality. A resolution of, for example, 1000 grid points in each dimension would result in 1015 grid points in five dimensions. Note that this is not an exaggerated example: simulations of physical phenomena as e.g. a turbulent flow require the resolution of length scales that can span many orders of magnitude, from meters to less than millimeters.

This can be observed in the simulation of plasma turbulence in a fusion reactor with the code GENE [9], which solves the five-dimensional gyrokinetic equations. With classical discretization techniques the resolutions necessary for physically accurate turbulence simulations of a large fusion device, such as the world’s flagship fusion experiment ITER, quickly hit the computational borders even on today’s largest HPC systems.

Sparse grids are a hierarchical approach to mitigate the curse of dimensionality to a large extent by drastically reducing the number of unknowns, while preserving a similar accuracy as classical discretization techniques that work on regular grids [4]. However, due to their recursive and hierarchical structure and the resulting global coupling of basis functions, the direct sparse grid approach is not feasible for large-scale distributed-memory parallelization.

A scalable approach to solve higher-dimensional problems is the sparse grid combination technique [11]. It is based on an extrapolation scheme and decomposes a single large problem (i.e. discretized with a fine resolution) into multiple moderately-sized problems that have coarse and anisotropic resolutions. This introduces a second level of parallelism, enabling one to compute the partial problems in parallel, independently and asynchronously of each other. This breaks the demand for full global communication and synchronization, which is expected to be one of the limiting factors with classical discretization techniques to achieve scalability on future exascale systems. Furthermore, by mitigating the curse of dimensionality, it offers the means to tackle problem sizes that would be out of scope for the classical discretization approaches. This allows us to significantly push the computational limits of plasma turbulence simulations and other higher-dimensional problems and is the driving motivation for our project EXAHD [21]. Additionally, we investigate novel approaches to enable fault-tolerance on the algorithmic level based on the combination technique [19, 20], which is a key issue in ongoing exascale research [5].

We are currently preparing our combination technique software framework to handle large-scale global computations with GENE. The key part that had been missing until now were efficient and scalable algorithms for the recombination of distributed component grids, which we present in this work. We present experiments that demonstrate the scalability of our algorithms on up to 180,225 cores on Germany’s Tier-0/1 supercomputer Hazel Hen. In order to get meaningful results, we used the problem sizes for our scaling experiments that we aim to use for our future large-scale production runs. The experiments not only demonstrate the scalability of our recombination algorithm, but also that it would even be fast enough to be applied after each GENE time step, if this was necessary.

1.1 Sparse Grid Combination Technique

The sparse grid combination technique [4, 11] computes the sparse grid approximation of a function f by a linear combination of component solutions \(f_{\mathbf{l}}\). Each \(f_{\mathbf{l}}\) is an approximation of f that has been computed on a coarse and anisotropic Cartesian component grid \(\varOmega _{\mathbf{l}}\). In our case f is the solution of the gyrokinetic equations, a higher-dimensional PDE, and the corresponding approximation \(f_{\mathbf{l}}\) is the result of a simulation with the application code GENE (see Sect. 1.2) computed on the grid \(\varOmega _{\mathbf{l}}\). In general this can be any kind of function which fulfills certain smoothness conditions.

The discretization of each d-dimensional component grid \(\varOmega _{\mathbf{l}}\) is defined by the level vector \(\mathbf{l} = (l_{1},\cdots \,,l_{d})^{T}\), which determines the uniform mesh width \(2^{-l_{i}}\) in dimension i. The number of grid points of a component grid is \(\vert \varOmega _{\mathbf{l}}\vert =\prod _{ i=1}^{d}(2^{\mathbf{l}_{i}} \pm 1)\) ( + 1 if the grid has boundary points in dimension i and − 1 if not).

In order to retrieve a sparse grid approximation \(f_{\mathbf{n}}^{(c)} \approx f\) one can combine the partial solutions \(f_{\mathbf{l}}(\mathbf{x})\) as

$$\displaystyle{ f_{\mathbf{n}}^{(c)}(\mathbf{x}) =\sum _{\mathbf{ l}\in \mathcal{I}}c_{\mathbf{l}}f_{\mathbf{l}}(\mathbf{x})\,, }$$
(1)

where \(c_{\mathbf{l}}\) are the combination coefficients and \(\mathcal{I}\) is the set of level vectors used for the combination. \(\mathbf{n}\) denotes the maximum discretization level in each dimension. It also defines the discretization of the corresponding full grid solution \(f_{\mathbf{n}}\) on \(\varOmega _{\mathbf{n}}\). Figure 1 shows a two-dimensional example.

Fig. 1
figure 1

The classical combination technique with \(\mathbf{n} = (4,4)\) and \(\mathbf{l}_{\text{min}} = (1,1)\). Seven component grids are combined to obtain a sparse grid approximation (on the grid Ω (4, 4) (c)) to the full grid solution on the grid Ω (4, 4)

There exist different approaches to determine the combination coefficients \(\mathbf{c}_{\mathbf{l}}\) and the index set \(\mathcal{I}\) [14, 15]. Usually,

$$\displaystyle{ f_{\mathbf{n}}^{(c)}(\mathbf{x}) =\sum _{ q=0}^{d-1}(-1)^{q}\left (\begin{array}{*{10}c} d - 1\\ q \end{array} \right )\sum _{\mathbf{l}\in \mathcal{I}_{\mathbf{n},q}}f_{\mathbf{l}}(\mathbf{x}) }$$
(2)

is referred to as the classical combination technique with the index set [11]

$$\displaystyle{ \mathcal{I}_{\mathbf{n},q} =\{\mathbf{ l} \in \mathbb{N}^{d}: \vert \mathbf{l}\vert _{ 1} = \vert \mathbf{l}_{\text{min}}\vert _{1} + c - q:\mathbf{ n} \geq \mathbf{ l} \geq \mathbf{ l}_{\text{min}}\}\,, }$$
(3)

where \(\mathbf{l}_{\text{min}} =\mathbf{ n} - c \cdot \mathbf{ e}\), \(c \in \mathbb{N}_{0}\) s.th. \(\mathbf{l}_{\text{min}} \geq \mathbf{ e}\) specifies a minimal resolution level in each direction, \(\mathbf{e} = (1,\ldots,1)^{T}\) and \(\mathbf{l} \geq \mathbf{ j}\) if \(\mathbf{l}_{i} \geq \mathbf{ j}_{i}\ \forall i\). The computational effort (with respect to the number of unknowns) decreases from \(\mathcal{O}(2^{nd})\) for the full grid solution \(f_{\mathbf{n}}\) on \(\varOmega _{\mathbf{n}}\) to \(\mathcal{O}(dn - 1)\) partial solutions of size \(\mathcal{O}(2^{n})\). If f fulfills certain smoothness conditions, the approximation quality is only deteriorated from \(\mathcal{O}(2^{-2n})\) for \(f_{\mathbf{n}}\) to \(\mathcal{O}(2^{-2n}n^{d-1})\) for \(f_{\mathbf{n}}^{(c)}\). The minimum level l min has been introduced in order to exclude component grids from the combination. In some cases, if the resolution of a component grid is too coarse this could lead to numerically unstable or even physically meaningless results.

In case of time-dependent initial value computations, as we have them in GENE, advancing the combined solution \(f_{\mathbf{n}}^{(c)}(t)\) in time requires to recombine the component solutions every few time steps. This is necessary to guarantee convergence and stability of the combined solution. Recombination means to combine the component solutions \(f_{\mathbf{l}}(t_{i})\) according to Eq. (1) and to set the corresponding sparse grid solution \(f_{\mathbf{n}}^{(c)}(t_{i})\) as the new initial value for each component grid. After that, the independent computations are continued until the next recombination point t i+1, where t i  = t 0 + iΔt as illustrated in Fig. 2 (t 0 is the time of a given initial value f(t 0) and the corresponding approximation \(f_{\mathbf{l}}(t_{0})\)).

Fig. 2
figure 2

Concept of recombination. On each component grid, the PDE is solved for a certain time interval Δ t. Then the component grids are combined in the sparse grid space, and the combined solution is set as the new initial value for each component grid for the computation of the next time interval

This does not necessarily mean that all component grids must be computed with the same time step size Δ t. If it is desirable to use individual time step sizes for the component grids, or even adaptive time stepping, Δ t can be understood as the time interval in-between two recombination steps. If the component grids are not recombined, or if the recombination interval is too long, the component solutions could diverge. This would then destroy the combined solution. Even in cases where this is not an issue, it might be desirable to compute the combined solution every few time steps in order to trace certain physical properties of the solution field over time. Being the only remaining step that involves global communication, an efficient implementation is crucial for the overall performance of computations with the combination technique.

The function space \(V _{\mathbf{l}}\) of a component grid is spanned by classical nodal basis functions (e.g. piece-wise linear). \(V _{\mathbf{l}}\) can be uniquely decomposed into hierarchical increment spaces \(W_{{\boldsymbol l'}}\) (also called hierarchical subspaces in the following) where \(V _{\mathbf{l}} = \oplus _{{\boldsymbol l'}\leq \mathbf{l}}W_{{\boldsymbol l'}}\). We refer to the operation of decomposing a component grid into its hierarchical subspaces as hierarchization and the inverse operation as dehierarchization. The sparse grid which corresponds to a particular combination is a union of the hierarchical subspaces of the component grids that contribute to the combination. Figure 3 shows the component grids of the combination technique with \(\mathbf{n} = (4,4)\) and the corresponding sparse grid. The component grids can be combined according to Eq. (1) in the function space of the sparse grid by adding up the hierarchical subspaces of the component grids. This is explained in detail in [17].

Fig. 3
figure 3

Component grids of the combination technique (left) with \(\mathbf{l}_{\text{min}} = (1,1)\) and \(\mathbf{n} = (4,4)\), the corresponding sparse grid (middle) and its hierarchical increment spaces (right). The hierarchical increment spaces, as well as the corresponding grid points in the sparse grid, of the component grid with \(\mathbf{l} = (2,3)\) are marked in green

The naive way to perform the recombination would be to interpolate each partial solution onto the full grid \(\varOmega _{\mathbf{n}}\) and to obtain the combined solution by adding up grid point by grid point weighted with the corresponding combination coefficient. However, for the large-scale computations with GENE that we aim for, this is not feasible. The size of the full grid would be so large that even storing it would be out of scope. The only efficient (or even feasible) way is to recombine the component grids in the corresponding sparse grid space. For large-scale setups, where the component grids are distributed onto several thousand processes, an efficient and scalable implementation of the recombination is crucial for the overall performance, because this is the only remaining step that requires global communication. In [16, 17] we already have presented and analyzed different algorithms for the recombination when a component grid is stored on a single node. In this work we present new, scalable algorithms for the recombination step with component grids that are distributed over a large number of nodes on an HPC system.

1.2 Large Scale Plasma Turbulence Simulations with GENE

A limiting factor for the generation of clean sustainable energy from plasma fusion reactors are microinstabilities that arise during the fusion process [8]. Simulation codes like GENE play an important role in understanding the mechanisms of the resulting anomalous transport phenomena. The combination technique has already been successfully applied to eigenvalue computations in GENE [18]. It has also been used to study fault tolerance on the algorithmic level [2, 19].

The 5-dimensional gyrokinetic equation describes the dynamics of hot plasmas, which is given by

$$\displaystyle{ \frac{\partial f_{s}} {\partial t} +\left (v_{\|}\mathbf{b}_{0} + \frac{B_{0}} {B_{0\|}^{{\ast}}}(\mathbf{v}_{E_{\chi }} +\mathbf{ v}_{\nabla B_{0}} +\mathbf{ v}_{c})\right )\cdot \left (\nabla f_{s} + \frac{1} {m_{s}v_{\|}}\left (q\bar{\mathbf{E}}_{1} -\mu \nabla \left (B_{0} +\bar{ B}_{1\|}\right )\right )\frac{\partial f_{s}} {\partial v_{\|}} \right ) = 0\;, }$$
(4)

where \(f_{s} \equiv f_{s}(\mathbf{x},v_{\|},\mu;t)\) is (5+1)-dimensional due to the gyrokinetic approximation (see [3, 6] for a thorough description of the model and an explanation of all identifiers) and s denotes the species (electron, ion, etc.). If we consider g s (the perturbation of f s with respect to the Maxwellian background distribution) instead of f s , Eq. (4) can be written as a sum of a (nonsymmetric) linear and a nonlinear operator, namely

$$\displaystyle{ \frac{\partial g} {\partial t} = \mathcal{L}[g] + \mathcal{N}[g]\;, }$$
(5)

where g is a vector including all species in g s .

There are different simulation modes in GENE to solve (4). Local (or flux-tube) simulations treat the x and y coordinates in a pseudo-spectral way, and background quantities like density or temperature (and their gradients) are kept constant in the simulation domain. The simulation domain is essentially only parallelized in four dimensions, because no domain decomposition is used in the radial direction. In global runs, only the y direction is treated in a spectral way, and all 5 dimensions are parallelized, with background quantities varying radially according to given profiles.

Additionally, GENE handles three main physical scenarios: multiscale problems (local mode, typical grids of size 1024 × 512 × 24 × 48 × 16 (x, y, z, v  | | , μ) with two species), stellarator problems (local mode, typical grids of size 128 × 64 × 512 × 64 × 16 with two species), and global simulations (expected grid size 8192 × 64 × 32 × 128 × 64 or higher, with up to 4 species). Especially for the global simulations, there exist certain scenarios that require such high resolutions that they are too expensive to compute on current machines.

GENE makes use of highly sophisticated domain decomposition schemes that allow it to scale very well on massively parallel systems. To name an extreme example: the largest run performed so far was carried out on the JUGENE supercomputer in Jülich using up to 262k cores [7]. Thus, GENE is able to efficiently exploit the first fine level of parallelism. By adding a second level of parallelization with the combination technique we can ensure the scalability of GENE on future machines and make it possible to compute problem sizes that would otherwise be out of scope on current machines.

2 Software Framework for Large-Scale Computations with the Combination Technique

We are developing a general software framework for large-scale computations with the combination technique as a part of our sparse grid library SG + + [1]. In order to distribute the component grids over the available compute resources, the framework implements the manager–worker pattern. A similar concept has already been successfully used for the combination technique in [10]. The available compute resources, in form of MPI processes, are arranged in process groups, except one process, which is the dedicated manager process. In the framework the computation of and access to the component grids is abstracted by so-called compute tasks. The manager distributes these tasks to the process groups. This is illustrated in Fig. 4. The actual application code is then executed by all processes of a process group in order to compute the task. Note that this concept implements the two-level parallelism of the combination technique. Apart from the recombination step, there is no communication between the process groups and the process groups can compute the tasks independently and asynchronously of each other. The communication effort between the manager and the dedicated master process of each process group to orchestrate the compute tasks can be neglected. It only consists of very few small messages that are necessary to coordinate the process groups.

Fig. 4
figure 4

The manager–worker pattern used by the software framework. The manager distributes compute tasks over the process groups. In each process group a dedicated master process is responsible for the communication with the manager and the coordination of the group. In the example, each process group consists of three nodes with four processes each.

Our framework uses a Task class (see Fig. 5) as an abstract and flexible interface that can be conveniently adapted to the application code. This class hides the application’s implementation details from the framework. A task is specified by the level vector \(\mathbf{l}\) and the combination coefficient \(c_{\mathbf{l}}\). The Task base class already contains all the necessary functionalities which are independent of the actual application code. Only a few absolutely necessary functions have to be provided by the user in a derived class.

Fig. 5
figure 5

The essential functions of the task interface (the actual file contains more functions). Only run and getDistributedFullGrid have to be implemented in the derived class by the user

The run method takes care of everything that is necessary to start the application in the process group with a discretization compatible with the combination technique and specified by \(\mathbf{l}\). Furthermore, the specialization of the Task class has to provide access to the application data, i.e., the actual component grid corresponding to the task by the method getDistributedFullGrid. Many parallel applications in science and engineering, like GENE, typically use MPI parallelization and computational grids that are distributed onto a large numbers of processes of an HPC system. Such a distributed component grid is represented by the class DistributedFullGrid. This is a parallel version of the combination-technique component grids as introduced in Sect. 1.1. Since our component grids have a regular Cartesian structure, they are parallelized by arranging the processes on a Cartesian grid as well. It is defined by the d-dimensional level-vector \(\mathbf{l}\) and a d-dimensional parallelization vector \(\mathbf{p}\) which specifies the number of processes in each dimension. Furthermore, information about the actual domain decomposition must be provided.

The actual way in which the underlying component grid of the application is accessed depends on the application. In the best case, the application code can directly work with our data structure. This is the most efficient way, because no additional data transformations are necessary for the recombination step. However, this is not possible for most application codes, because they use their own data structures. In this case, the interface has to make sure that the application data is converted into the format of DistributedFullGrid. If access to the application’s data at runtime is not possible (which might be the case for some proprietary codes) it would even be possible to read in the data from a file. Although this would be the most inefficient way and we doubt it would be feasible for large-scale problems, we mention this option to emphasize that the users of our framework have maximal flexibility on how to provide access to their application.

Furthermore, a custom load model tailored to the application can be specified by the user, which can be used to improve the assignment of tasks to the process groups with respect to load balancing. In [13] we have presented a load model for GENE. If such a model is not available, the standard model would estimate the cost of a task based on the number of grid points of the corresponding component grid.

3 Scalable Algorithms for the Combination Step with Distributed Component Grids

The recombination step is the only step of the combination technique which requires global communication between the process groups. Especially for time-dependent simulations, such as the initial value problems in GENE, which require a high number of recombination steps (in the worst case after each time step), an efficient and scalable implementation of this step is crucial for the overall performance and the feasibility of the combination technique. Efficient and scalable in this context means that the time for the recombination must not consume more than a reasonable fraction of the overall run time and that this fraction must not scale worse than the run time of the application. In our context we refer to global communication as the communication between processes of different process groups, whereas local communication happens between processes which belong to the same process group. As indicated in Sect. 2, we neglect communication between the process groups and the manager process, because no data is transferred apart from very few small messages that are necessary for the coordination of the process groups.

In [16, 17] we presented a detailed analysis of different communication strategies for the recombination step with component grids that exist on a single node. Note, however, that any approach which is based on gathering a distributed component grid on a single node cannot scale for large problem sizes. On the one hand, such approaches limit the problem size to the main memory of a single node. On the other hand, the network bandwidth of a node is limited and the time to gather a distributed grid on a single node does not decrease with increasing number of processes.

Figure 6 shows the substeps of the distributed recombination. The starting point are multiple process groups which hold one or more component grids each. Each component grid is distributed over all processes in a group. First, each process group transfers all its component grids from the nodal basis representation into the hierarchical basis of the sparse grid by distributed hierarchization. Then, the hierarchical coefficients of each component grid are multiplied by the combination coefficient and added to a temporary distributed sparse grid data structure. Note that each component grid only contributes to a subset of the hierarchical subspaces in the sparse grid. The component grid visualized in Fig. 6 has discretization level \(\mathbf{l} = (3,2)\). Hence, it only contributes to the hierarchical subspaces \(W_{{\boldsymbol l'}}\) with \({\boldsymbol l'} \leq (3,2)\). In the following we will refer to the operation of adding up (and multiplying with the combination coefficient) all the component grids in the sparse grid data structure as reduction. The reduce operation is done in a local and a global substep. First, each process group has to locally reduce its component grids. Then, the individual distributed sparse grids of the process groups are globally reduced to the combined solution, which now exists on each process group in the distributed sparse grid. As a next step, for each component grid the relevant hierarchical coefficients are extracted from the combined solution. This operation happens inside each process group, and it is the inverse of the local reduction. We refer to it as scatter operation. Afterwards, the combined solution is available on each component grid in the hierarchical basis and is transferred back into the nodal basis by dehierarchization. In the following sections we will discuss the substeps of the distributed recombination step in detail.

Fig. 6
figure 6

The recombination step. Each component grid is hierarchized and then added to a temporary distributed sparse grid. After the distributed sparse grids in each process group have been globally reduced to the combined solution, the hierarchical coefficients of each component grid are extracted and the component grid is brought back into the nodal basis representation by dehierarchization

3.1 Distributed Hierarchization/Dehierarchization

In [12], we presented a new, scalable distributed hierarchization algorithm to efficiently hierarchize large, distributed component grids. In the following, we summarize this work. Figure 7 shows a component grid distributed onto eight different MPI processes. The corresponding dependency graph for the hierarchization in x 2-direction illustrates that exchange of data between the processes is necessary. Our algorithm follows the uni-directional principle which means that we hierarchize the d dimensions of the component grid one after the other. The hierarchization in each dimension consists of two steps:

  1. 1.

    Each process determines from the dependency graph the values it has to exchange with other processes. The dependency graph can be deduced from the discretization and the domain partitioning and is known to each process without additional communication.

  2. 2.

    The data is exchanged using non-blocking MPI calls. Each process stores the received remote data in the form of (d − 1)-dimensional grids.

Fig. 7
figure 7

Left: Component grid distributed over eight processes and the corresponding dependency graph for the hierarchization of the x 2 direction. Right: Distributed hierarchization with remote data from the local view of one process

After the communication step, each process locally performs the hierarchization. The local subgrid is traversed with one-dimensional poles (see Fig. 7). If during the update of each point we would need to determine whether its dependencies are contained in the local or in the remote data, we would obtain bad performance due to branching and address calculations. In order to avoid this, we copy the local and the remote data to a temporary 1-d array which has the full global size of the current dimension. We then perform the hierarchization of the current pole in the temporary array using an efficient 1-d hierarchization kernel. Afterwards, the updated values are copied back to the subgrid. For reasonable problem sizes the maximum global size per dimension is not much more than a few thousand grid points. So, the temporary array will easily fit into the cache, even for very large grids. The poles traverse the local subgrid and the remote data in a cache-optimal order. Thus, no additional main memory transfers are required. With this approach we can achieve high single core performance due to cache optimal access patterns. Furthermore, only the absolutely necessary data is exchanged in the communication step.

Figure 8 shows strong and weak scaling results for 5-dimensional component grids of different sizes. The experiments were performed from 32 to 32,768 cores on the supercomputer Hornet. The grid sizes used for the strong scaling experiments would correspond to the sizes of component grids of very large GENE simulations. The largest grid used for the weak scaling experiments had a total size of 36 TByte distributed over 32,768 cores. The experiments show that our distributed hierarchization algorithm scales very well. More detailed information on the distributed hierarchization experiments can be found in [12]. There, also further issues such as different dimensionalities and anisotropic discretizations or anisotropic domain decompositions of the component grids are investigated.

Fig. 8
figure 8

Strong (left) and weak (right) scaling results for the distributed hierarchization of 5-dimensional grids with different grid sizes. For strong scaling the numbers denote the total grid size. For weak scaling the numbers denote the size of the (roughly) constant local portion of the grid for each process

3.2 Local Reduction/Scatter of Component Grids Inside the Process Group

For the local reduction step inside the process group we use two variants of a distributed sparse grid. In the first variant, the hierarchical subspaces are distributed over the processes of the group, so that each subspace is assigned to exactly one process. In the second variant, each hierarchical subspace is geometrically decomposed in the same way as the distributed component grids. Each process stores its part of the domain of each hierarchical subspace. Unlike the first variant, no communication between the processes is required for the local reduction step. However, this variant can only be applied if all component grids on the different process groups have exactly the same geometrical domain partitioning.

In the following we will present the two variants of the distributed sparse grid and discuss their advantages and disadvantages for the local reduction step. Note that the scatter step, where the hierarchical coefficients of the component grids are extracted from the combined solution, is just the inverse operation of the reduction step. Thus, we will only discuss the reduction here.

3.2.1 Variant 1: General Reduction of Distributed Component Grids

In the first variant of the distributed sparse grid, each hierarchical subspace is stored on exactly one process. A sensible rule to assign the hierarchical subspaces to the processes would be to distribute them such that a balanced number of grid points is achieved. This is important for the global reduction step, which has minimal cost when the number of grid points on each process in a process group is equal. We use a straightforward way to assign the subspaces to the processes: first we sort the subspaces by size in descending order. Then we assign the subspaces to the computed nodes in a round-robin fashion. As a last step, we distribute the subspaces to the processes of each node in the same way. A balanced distribution over the compute nodes is more important than over the processes, because the processes on each node share main memory and network bandwidth. In this way, if the number of hierarchical subspaces is large enough (several thousands or ten thousands in higher dimensions), a balanced distribution can easily be achieved.

The advantage of this sparse grid variant is that it is general. It can be conveniently used for the local reduction step to add up distributed component grids that have different parallelization. Futhermore, with this variant the global reduction step can easily be extended to process groups of different sizes. Adding a distributed component grid to a distributed sparse grid means adding up the coefficients of each hierarchical subspace common to both the component grid and the sparse grid. The grid points of each hierarchical subspace in the distributed full grid are geometrically distributed over the processes of the process group. Hence, to add them to the corresponding subspace in the distributed sparse grid, they have to be gathered on the process to which the subspace is assigned. This is illustrated in Fig. 9 (left). The communication overhead that comes with these additional gather operations is the major disadvantage of this sparse grid variant.

Fig. 9
figure 9

Each hierarchical subspace of the distributed sparse grid is stored on exactly one process. Gathering the hierarchical subspaces yields the shown communication pattern (right). Each of the m processes contains Nm grid points of the distributed full grid (of total size N) and of the distributed sparse grid. Each process sends a = Nm 2 grid points to m − 1 other processes

In the following, we will analyze the communication costs that incur when a distributed component grid is added to a distributed sparse grid. These costs resemble the costs to (almost) completely redistribute the grid points of the distributed full grid: we start with a distributed full grid which has Nm grid points per process (N grid points distributed over m processes) and we end up with a distributed sparse grid which has Nm grid points per process. In a simple communication model, which only considers the number of grid points to be sent, this means each process sends Nm 2 grid points to (m − 1) other processes (see Fig. 9). Thus, in total each process has to send (and receive)

$$\displaystyle{ \frac{N} {m^{2}}(m - 1) = \frac{N} {m} - \frac{N} {m^{2}} \approx \frac{N} {m} }$$
(6)

grid points, which is approximately Nm for large m. Hence, the amount of data to be transferred by each process (or node) scales with the number of processes. However, on an actual system the time to send a message to another process does not only depend on the size, but is bounded from below by a constant latency. A common model for the message time is t = t lat + KB, with latency t lat, message size K and bandwidth B. The message latency impedes the scalability of the redistribution step: with increasing m the size of the messages becomes so small that the time to send the message is dominated by the latency. But the number of messages increases with m.

We have implemented this redistribution operation: first, each process locally collects all the data that it has to send to the other processes in a send buffer for each process. Likewise, a receive buffer is created for each of the processes. Then, non-blocking MPI_Isend and MPI_Irecv calls are used to transfer the data. In theory, this allows the system to optimize the order of the messages, so that an optimal bandwidth usage can be achieved. Afterwards, the received values have to be copied (or added) to the right place in the underlying data structures. Figure 10 shows the communication time (neglecting any local operations) for a large component grid. We can observe a small decrease in run time from 512 to 1024 processes. But after this point, the run time increases almost linearly. We tried different measures to reduce the number of messages at the price of higher communicated data volume by using collective operations like MPI_Reduce, but could not observe a significant reduction in communication time.

Fig. 10
figure 10

Communication time being necessary to add a component with \(\mathbf{l} = (10,4,6,6,6)\) to a distributed sparse grid of variant 1 for different numbers of processes

3.2.2 Variant 2: Communication-Free Local Reduction of Uniformly Parallelized Component Grids

For the special case where all the component grids share the same domain decomposition, this variant of the distributed sparse grid can be used to add up the distributed full grids in the local reduction operation without any communication. Here, each hierarchical subspace of the distributed sparse grid is geometrically decomposed and distributed over the processes in the same way as the component grids (see Fig. 11 (left)). The assignment of a grid point to a process depends on its coordinate in the simulation domain. This assignment is equal for the distributed sparse grid and for the distributed full grid. This is visualized for a one-dimensional example in Fig. 11 (right): although the component grids have different numbers of grid points, grid points with the same coordinates are always assigned to the same process. This is due to the nested structure of the component grids.

Fig. 11
figure 11

Left: Each process stores for each hierarchical subspace only its local part of the domain. Right: One-dimensional component grids of different levels distributed over four processes. Grid points with the same coordinates are always on the same process due to the geometrical domain decomposition (and due to the nested structure of the component grids)

Compared to variant 1 this has the advantage that no redistribution of the hierarchical coefficients is necessary and therefore no data has to be exchanged between processes. When a distributed full grid is added to the distributed sparse grid, each process adds the hierarchical coefficients of its local part of the distributed full grid to its local part of the distributed sparse grid. With a balanced number of grid points per process Nm in the distributed full grid, the time for the local reduction scales with m. The major drawback of this approach is that it works only for the special case of uniformly parallelized component grids. However, uniformly parallelized component grids might not be possible for all applications. In this case only variant 1 remains. Also, the chosen parallelization might not be the optimal one for all component grids. For applications where both variants are possible, we face a trade-off between the computation time for the component grids, and the time for the local reduction step. This will heavily depend on the frequency of the recombination step. If the recombination has to be done after each time step, variant 2 probably is the better choice. If the combination has to be done only once in the end or at least rather rarely, variant 1 might be more desirable.

3.3 Global Reduction of the Combination Solution

In this work we will discuss the global reduction step for process groups with equal numbers of processes. For a well-scaling application like GENE this is a sensible assumption, because it reduces the complexity of the global reduction. However, for applications with worse scaling behavior, in terms of overall efficiency it might be advantageous to use process groups of different sizes, especially when the recombination step is performed only rarely. However, a detailed discussion of the distributed recombination step for process groups with different sizes is out of scope for this work.

The global reduction step is basically identical for both variants of the distributed sparse grid: An MPI_Allreduce operation is performed by all the processes which have the same position (the same local rank) in the process group (see Fig. 12). These are the processes which store the same coefficients of the distributed sparse grid in each process group. The algorithm to do this is rather simple: first, an MPI communicator is created, which contains all processes which have the same local rank in all process groups. This communicator only has to be created once and can then be reused for subsequent recombination steps. Second, each process copies its local partition of the distributed sparse grid into a buffer. The order of coefficients in the buffer is equal on all processes. It can happen that a particular hierarchical subspace does not exist in a process group. In this case corresponding entries in the buffer are filled up with zeros. This procedure is identical for both variants of the distributed sparse grid. Then, each process copies its local partition of the distributed sparse grid into a buffer. The order of coefficients in the buffer is equal on all processes. Next, an MPI_Allreduce operation is executed on the buffer. In a last step, each process extracts the coefficients from the buffer and copies them back to its local part of the distributed sparse grid. Now, the combined solution is available in the distributed sparse grid of each process group.

Fig. 12
figure 12

Communication graph of the global reduction step. The example shows six process groups. Each process group consists of three nodes with 4 processes each

In the following we will analyze the cost of the global reduction step. For m processes per process group, the global reduction step is m-way parallel. It consists of m MPI_Allreduce operations that can be performed in parallel and individually of each other. We can estimate the time for each MPI_Allreduce by

$$\displaystyle{ t_{ar} = 2t_{l}\log (ngroups) + 2\frac{N_{SG}/m} {B/M} \log (ngroups)\;. }$$
(7)

Here, t l is the message latency, ngroups is the number of process groups and N SG m the size of the buffer (the size of the local part of the distributed sparse grid per process). Since M processes per node share the node’s network bandwidth B, the effective bandwidth per process is BM. Thus, the global reduction scales with the process group size, but increases logarithmically with the number of process groups. However, for our large scale experiments with GENE, there will be several thousand processes per process group, but one or two orders of magnitude less process groups. We have already used the same model in [17] to analyze the costs of the recombination step for non-distributed component grids. It is based on the assumption that MPI_Allreduce is performed in a reduce and a broadcast step using binomial trees. Although this does not necessarily hold for actual implementations [22], it would only change the communication model by a constant factor.

4 Results

We investigated the scalability of our distributed combination algorithm on up to 180, 225 (out of 185,088) cores of Germany’s Tier-0/1 supercomputer Hazel Hen (rank 8 on the Top500 list of November 2015). It consists of 7712 dual-socket nodes, which have 24 cores of Intel Xeon E5-2680 v3 (Haswell) and 128 GByte of main memory.

We chose the problem size in the range that we aim to use for future large-scale experiments with GENE, \(\mathbf{n} = (14,6,6,8,7)\) and \(\mathbf{l}_{\text{min}} = (9,4,4,6,4)\). This results in 182 component grids. The 76 largest component grids have between 74 and 83 GByte (complex-valued numbers). Our experiments correspond to the recombination of one species. The time for the recombination, as well as the computation time of GENE, multiplies with the number of species. We used a distributed sparse grid of variant 2 and equal parallelization for each component grid. Directly computing a global GENE simulation with the chosen \(\mathbf{n}\) with GENE would result in a resolution of 16384 × 64 × 64 × 256 × 128. This problem would be 32 times larger than what GENE users already consider to be infeasible (compare the numbers in Sect. 1.2). We used process groups of size nprocs = 1024, 2048, 4096 and 8192. The number of process groups ngroups ranges between 1 and 128 (only powers of two were used). Furthermore, 22, 44, 88, and 176 process groups were used for the largest experiments with 180, 225 processes (one process for the manager). We were not able to perform the experiments with less than 4096 cores, because the necessary memory to store the component grids exceeded the nodes’ main memory.

Figure 13 shows the individual times for the hierarchization, the local reduction and the global reduction. Furthermore, it shows the total time of hierarchization, local reduction and global reduction. We do not present results for the scatter step and dehierarchization here, because they behave so similar to their inverse counterparts, reduction and hierarchization, that this would not bring any further insight. The run times presented are the average times per process group. For hierachization and local reduction they are the accumulated times of all component grids assigned to the process group. The abscissa of the plots presents the total number of processes used: ngroups × nprocs (the manager process is neglected).

Fig. 13
figure 13

Timings for the individual steps of the distributed recombination step for different sizes of the process groups. The last plot (bottom) shows the total time of hierarchization, local reduction and global reduction. We also included a rough estimate of the computation time for one time step of GENE with process groups of size 4096

The hierarchization scales well with the total number of processes. However, we can observe a small increase in run time when larger process groups are used. The distributed hierarchization consists of two steps, a communication step and a computation step. The time for the latter one does only depend on the number of grid points per process. Thus it scales perfectly with the total number of processes. The time for the commmunication step scales as well (see Fig. 7), but not perfectly. This explains the small increase in run time when larger process groups are used.

The local reduction perfectly scales with the total number of processes. Since it does not require any communication between the processes, the run time only depends on the number of grid points per process. The number of grid points per process halves when the size of the process group doubles. Likewise, when twice as much process groups are used, the number of component grids per group will halve, and thus the number of grid points per process.

The time for the global reduction increases with the number of process groups. Even though we cannot see a strict logarithmic increase of the run time with the number of process groups here, as predicted by Eq. (7), this formula is still sensible as a rough guide. As already discussed in [17], actual implementations of MPI_Allreduce might use other algorithms than binomial trees. Also, the actual network bandwidth between two distinct compute nodes depends on many factors, like their position in the network topology and the current overall load on the network. Furthermore, we did not observe a worse than logarithmic increase in run time. For example, with nprocs 1024 the run time increased from 4 process groups (4096 processes) to 176 process groups (180,224 processes) by a factor of 2. 2, whereas log(176)∕log(4) ≈ 3. 7. The run time decreases for larger process groups, because with less grid points per process the message size for the MPI_Allreduce decreases. The strong increase from one to two process groups for nprocs 4096 and nprocs 8192 comes from the fact that if only one process group is used, MPI_Allreduce essentially is a no-op and so no actual communication happens.

The total time for hierarchization, local reduction and global reduction scales until it is dominated by the time for the global reduction. For small total process counts, the time for the hierarchization dominates. The time for the local reduction is so small in comparison to the time for the hierarchization that it can be neglected. Strong scalability by adding more and more process groups can only be achieved until the point when the global reduction time dominates. This point, and thus the strong scalability limit of the recombination step, can be shifted to higher process counts by using larger process groups. However, enlarging the process groups is only sensible until the application’s scalability limit is reached.

In Fig. 13 we included a rough estimate of the average time per process group to advance the GENE simulation on each component grid by one time step. It corresponds to the computation time per species on process groups of size 4096. For the largest component grids it was not possible to use less than 4096 processes per species, because of main memory limitations. Note that GENE computations use a lot more main memory than what is necessary to store the computational grid. Therefore, for our future experiments we will rather use 8192 or 16384 processes per species for such large component grids. However, a meaningful investigation of the scalability of the global reduction step requires a large enough number of process groups. Thus, we also chose smaller process groups in our experiments. We did not measure all the individual times for all 182 component grids, but instead we measured one component grid of each of the five different grid sizes that we used for the combination as specified by \(\vert \mathbf{l}\vert _{1}\), cf. Eq. (1). We estimated the total time under the assumption that all component grids with the same level-sum have the same run time. In general this is not true, because the run time does not only depend on the size of the component grids, but it is also significantly influenced by the actual discretization and parallelization [13]. However, for the purpose of presenting a rough guide how the run time of the actual GENE computations relates to the time of the recombination step, these numbers are accurate enough.

So far it is not yet clear whether it is necessary to recombine the component grids after each time step, or if it would be sufficient to recombine only every 10 or even every 100 time steps. This depends on the application and problem at hand, and investigating it for the different GENE operation modes will be part of our future research. However, the results show that the run time for the new and optimized recombination algorithms is now below the computation time for one time step. So even recombining after each time step would be feasible. Furthermore, this is just a first implementation of the recombination step and there is still optimization potential for all substeps. For the distributed hierarchization it is possible to further reduce the communication effort and to speed up the local computations (see [12]). For the global communication an improved algorithm based on the communication patterns presented in [17] can be used.

5 Conclusion and Future Work

In this work we presented new, scalable algorithms to recombine the distributed component grids. We provide a detailed analysis of the individual steps of the recombination, hierarchization, local reduction and global reduction. Additionally, we present experimental results that show the scalability of a first implementation of the distributed recombination on up to 180,225 cores on the supercomputer Hazel Hen. The experiments demonstrate that our recombination algorithms lead to highly parallel and scalable implementations of the remaining global communication for the solution of higher-dimensional PDEs with the sparse grid combination technique. It would even be possible to recombine the partial solutions very frequently if required.

As a next step we plan to perform actual GENE simulations in order to investigate the effect of the recombination (length of the recombination interval) on the error of the combined solution (compared to a the full grid solution or to experimental results, if available). At the time of writing this was not possible, because a parallel algorithm for the adaption of GENE’s boundary conditions (which has to be done after each recombination) was not yet realized in our software framework.

We did not perform the large scaling experiments with distributed sparse grids of variant 1, because a scalable implementation for the local reduction/scatter of component grids with non-uniform domain decomposition does not exist yet. The reasons are discussed in Sect. 3.2. Although for many applications this is not a severe restriction, in order to improve the generality of our method, finding a better algorithm for the reduction of component grids with non-uniform domain decomposition will be a topic of future work.