Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Classical Algebraic Multigrid (AMG) methods were originally designed for scalar partial differential equations (PDEs) and usually assume that the nullspace of the operator is one-dimensional and constant. This assumption does not hold for many systems of PDEs. For elasticity problems in particular, the nullspace consists of three (in 2D) or six (in 3D) rigid body modes (RBMs), which comprise translations and rotations. Classical AMG methods, including standard approaches modified to handle systems of PDEs, e.g., the unknown approach [23], interpolate translations but not rotations. This limitation will typically result in a loss of optimality and scalability for these approaches when applied to systems problems.

Different approaches to handle linear elasticity problems with AMG methods have been suggested in the last decades, e.g., smoothed aggregation [7, 27], unsmoothed aggregation [3, 4, 8, 1921], AMGe [6], element-free AMGe [15], local optimization problems to incorporate the RBMs in the interpolation [13], or the global matrix (GM) and local neighborhood (LN) approaches [2].

In this paper, we provide a brief overview of AMG methods and AMG for systems in Sects. 2 and 3. In Sects. 4 and 5, we describe the GM and LN approaches, which were first introduced in [2]. These two approaches explicitly incorporate given smooth error vectors into the AMG interpolation in order to handle the correction of these error components in the coarse grid correction. We note that the descriptions of the AMG methods and interpolations in this paper are based on both [2] (which only considered sequential AMG) and on Chap. 4 of the dissertation [18]. In Sect. 6, we compare the performance of AMG approaches for systems of PDEs and show that the GM and LN approaches can improve convergence and scalability for elasticity problems. The parallel numerical results on up to half a million MPI processes presented in Sect. 6 are new and have not been published elsewhere (as [2] contained only serial results for small problems).

2 Algebraic Multigrid

We first give a brief overview of AMG. Consider the linear system Au = f, which is often generated from the discretization of a scalar PDE. We denote with u i the i-th entry of u. As in geometric multigrid, one needs to define a hierarchy of coarser grids or levels, adequate smoothers or relaxation schemes for each level and restriction, and interpolation operators to move between levels. However, unlike geometric multigrid, algebraic multigrid methods are applied to the linear system without any geometrical or mesh-related information.

Because grid information is not given, one needs to use the linear system to define a “grid”. The variables u i are now the grid points and the non-zero entries a ij of matrix A define the connections between the grid points. Because not all variables are equally important, one defines the concept of strong dependence. For a threshold 0 < θ ≤ 1, a variable u i strongly depends on the variable u j if

$$\displaystyle{ -a_{ij} \geq \theta \max _{k\neq i}(-a_{ik})\;. }$$
(1)

To determine the coarse-grid variables, which are a subset of the variables u i , one first eliminates all connections that do not fulfill (1). Then one applies a coarsening algorithm to the remaining “grid”. For brevity, we do not describe any coarsening algorithms here, but note that descriptions of several common coarsening strategies and an investigation of their parallel performance can be found in [28] (e.g., Ruge-Stüben [23, 25], HMIS [11] and Falgout [16]).

In AMG, errors are reduced by two separate operations: the smoothing or relaxation steps and the coarse grid correction. For an optimal AMG method, the coarse grid correction and the relaxation strategy must be chosen carefully to complement each other. While simple pointwise relaxation methods such as Jacobi or Gauß-Seidel rapidly reduce errors in the directions of eigenvectors associated with large eigenvalues, the reduction in directions of eigenvectors associated with small eigenvalues is less optimal; see [6] for details. Errors that are poorly reduced by the smoothing steps are referred to as smooth errors. More precisely, algebraic smooth errors can be characterized by Ae ≈ 0, where e is an eigenvector associated with a small eigenvalue. For an effective AMG method, the smooth error must be reduced by the coarse grid correction. Therefore, an interpolation operator P needs to be defined in such a way that the smooth errors are approximately in the range of P [6]. For additional details on interpolation operators, we refer the reader to various publications, e.g. [12, 23, 25, 26, 29]. The restriction operator R is often defined to be the transposed operator P T, so that in the case of a symmetric positive definite matrix A, the coarse grid operator RAP is also symmetric positive definite. After interpolation, restriction, and coarse grid operators have been defined and a relaxation strategy has been determined, the solve phase can be performed.

For simplicity, consider the two-level case with one fine grid and one coarse grid. For an approximate solution u and the exact solution u of the system Au  = f on the fine grid, we have the relationship Ae = r, where e: = u u is the error vector and r: = fAu is the residual. One AMG cycle to correct (or update) u is as follows:

$$\displaystyle{\begin{array}{l} \textbf{(1) Smooth }\nu _{1}\textbf{ times on: }Au = f \\ \textbf{(2) Compute the residual: }r = f - Au \\ \textbf{(3) Solve on the coarse grid: }RAPe_{c} = Rr \\ \textbf{(4) Correct u: } u = u + Pe_{c} \\ \textbf{(5) Smooth }\nu _{2}\textbf{ times on: }Au = f. \end{array} }$$

T obtain a full multi-level AMG V-cycle, one needs to apply this algorithm recursively, as depicted in Fig. 1. For more details on classical AMG methods, see, e.g., [23, 25].

Fig. 1
figure 1

One AMG V-cycle. Smoothing on the fine grid → Restricting to the coarsest grid → Solving on coarsest grid → Interpolating to the finest grid (Figure from [18])

3 Algebraic Multigrid for Systems of PDEs

We now consider a linear system of equations Au = f derived from the discretization of a system of PDEs with p scalar functions or unknowns. Now, each variable or degree of freedom (dof) of the linear system describes one physical quantity in a grid point or node. For example, in linear or nonlinear elasticity, we have one dof describing one spatial direction in each node. For simplicity, we restrict our presentation here to the two-dimensional case and consider an elasticity problem with two unknowns, x and y, representing the two spatial directions. A detailed three-dimensional description can be found in [2].

For algebraic multigrid methods, the two common approaches to treating systems of PDEs such as Au = f are the unknown approach (U-AMG) and the nodal approach, e.g., [1, 10, 14, 22, 23, 25]. While U-AMG completely separates the different physical quantities, the nodal approach considers all unknowns belonging to the same node at once and thus acts on a nodal basis.

We first take a brief look at the U-AMG. Here, we assume an unknown-related ordering of the system matrix (i.e., first all dofs related to the unknown x followed by those associated with y):

$$\displaystyle{ A = \left [\begin{array}{cc} A_{xx}&A_{xy} \\ A_{yx}&A_{yy} \end{array} \right ]\;. }$$
(2)

One now applies classical AMG coarsening and interpolation strategies to the different variables separately, i.e., only to the diagonal blocks A xx and A yy . Note that this strategy ignores couplings between unknowns x and y, which are contained in A xy and A yx , and leads to an AMG interpolation matrix P that has the diagonal block structure

$$\displaystyle{ P = \left [\begin{array}{cc} P_{x}& 0 \\ 0 &P_{y} \end{array} \right ]\;. }$$
(3)

In general, U-AMG is often used to handle systems of PDEs and is quite effective for problems with weak coupling between the different unknowns. Of course, performance also strongly depends on the general quality of the chosen coarsening, interpolation, and smoothing techniques for the diagonal blocks A xx and A yy .

We now describe the nodal approach, which is often a more effective approach for problems with a stronger coupling between the different physical quantities. If we block all unknowns that share the same node and consider a node-related ordering, then the system matrix A can be written as

$$\displaystyle{ A = \left [\begin{array}{cccc} A_{11} & A_{12} & \cdots & A_{1N} \\ A_{21} & A_{22} & \cdots & A_{2N}\\ \vdots & \vdots & \ddots & \vdots \\ A_{N1} & A_{N2} & \ldots & A_{NN} \end{array} \right ]\;, }$$
(4)

where the 2 × 2 blocks A ij connect nodes i and j. Note that if we define N as the number of nodes or grid points, then A is a N × N block matrix. With the nodal approach, we consider strong dependence between two nodes i and j, instead of between two variables. Therefore, we now have to compare block entries, such as A ji or A jj . This comparison typically involves an appropriate norm such as the Frobenius norm | | ⋅ | |  F or the row-sum norm | | ⋅ | |  . Applying the norm to the blocks of the system matrix A results in a condensed N × N matrix with scalar entries

$$\displaystyle{ C = \left [\begin{array}{cccc} c_{11} & c_{12} & \cdots & c_{1N} \\ c_{21} & c_{22} & \cdots & c_{2N}\\ \vdots & \vdots & \ddots & \vdots \\ c_{N1} & c_{N2} & \ldots & c_{NN} \end{array} \right ]:= \left [\begin{array}{cccc} \vert \vert A_{11}\vert \vert &\vert \vert A_{12}\vert \vert &\cdots & \vert \vert A_{1N}\vert \vert \\ \vert \vert A_{21}\vert \vert &\vert \vert A_{22}\vert \vert &\cdots & \vert \vert A_{2N}\vert \vert \\ \vdots & \vdots & \ddots & \vdots \\ \vert \vert A_{N1}\vert \vert &\vert \vert A_{N2}\vert \vert & \ldots &\vert \vert A_{NN}\vert \vert \end{array} \right ]\;. }$$
(5)

The definition of strong dependence in Eq. (1) is based on A or C being an M-matrix, i.e., a matrix whose off-diagonal elements have the opposite sign of the diagonal elements. Therefore, we change the diagonal elements c ii of C to c ii  = − | | A ii  | | or

$$\displaystyle{ c_{ii} = -\sum _{j=1,j\neq i}^{N}\vert \vert A_{ ij}\vert \vert. }$$
(6)

This approach as well as additional options for defining C are further discussed in [10]. In our experiments, we found the latter approach (i.e., the row-sum norm) to give better convergence, and we are using Eq. (6) in the numerical results presented in Sect. 6. The AMG coarse grids are now obtained by applying classical AMG coarsening techniques to the condensed matrix C. In the nodal coarsening approach, all unknowns on one grid point share the same set of coarse grids. Note the contrast with the unknown approach, which can result in completely different coarse meshes for each unknown. The interpolation matrix in the nodal approach can be obtained by applying scalar AMG interpolation techniques to the blocks (e.g., [14]). Another option, used in our experiments in Sect. 6, is to combine nodal coarsening with unknown-based interpolation. We call this approach the hybrid approach (H-AMG).

4 The Global Matrix Approach

As mentioned in Sect. 2, smooth error vectors should be in the range of the interpolation operator. In the case of linear elasticity, the nullspace of the matrix A consists of the rigid body modes (RBMs), i.e., all rotations and translations of the domain. Since classical AMG interpolation P already interpolates constant vectors exactly, we only have to take care of rotations (i.e., in two dimensions, the single rotation s(x, y): = [ y, −x]). A possible approach to incorporate an exact interpolation of smooth error vectors in the AMG interpolation is, as already mentioned, the global matrix (GM) approach, introduced in [2]. The following description is restricted to two levels. A generalization to the multilevel-case can be found in [2].

The GM approach is based on the idea of augmenting a given global AMG interpolation matrix P with several matrices Q j. Each matrix Q j is chosen to exactly interpolate a specified smooth error vector s j . We designate the rotation s: = [ y, −x] in two dimensions as the smooth error vector. We define s C as the restriction of s onto the coarse grid and define a new interpolation matrix \(\tilde{P }\) by augmenting P:

$$\displaystyle{ \widetilde{P}:= [P\quad Q],\;\text{such that }s \in \mathrm{range}(\widetilde{P})\;. }$$
(7)

There are several possibilities to define a matrix Q fulfilling Eq. (7) and also retain the sparsity of P. We will consider both variants suggested in [2]. For Variant 1 or GM1 we define \(\widetilde{P}\) such that

$$\displaystyle{ \widetilde{P}\left [\begin{array}{c} 0\\ 1 \end{array} \right ] = s\;, }$$
(8)

whereas for Variant 2 or GM2, \(\widetilde{P}\) is defined such that

$$\displaystyle{ \widetilde{P}\left [\begin{array}{c} s_{C} \\ 1 \end{array} \right ] = s\;. }$$
(9)

For GM1, the coefficients Q ij of Q, where i is the index of a fine grid point and j the index of a coarse grid point, are then defined as

$$\displaystyle{ Q_{ij}:= P_{ij}\Big( \frac{s_{i}} {\sum \limits _{k\in C_{i}}P_{ik}}\Big)\;, }$$
(10)

where C i is the set of coarse points in the direct neighborhood of i, i.e., the indices of the columns with non-zero entries in row i of the interpolation P. For GM2, the entries Q ij , are given by

$$\displaystyle{ Q_{ij}:= P_{ij}\Big( \frac{s_{i}} {(\sum \limits _{k\in C_{i}}P_{ik})} - (s_{C})_{j}\Big)\;. }$$
(11)

The unknown-based GM interpolation in two dimensions can then be written as

$$\displaystyle{\widetilde{P} = \left [\begin{array}{ccc} P_{x}& 0 &Q_{x} \\ 0 &P_{y}&Q_{y} \end{array} \right ]\;,}$$

where Q x and Q y can be computed independently and have the same sparsity as P x and P y . Note that this leads to a coarse grid space with a larger number of degrees of freedom than the coarse grid space generated by the unknown-based or the hybrid approach. The increase in degrees of freedom is even further exacerbated in three dimensions, where one needs to add three rigid body modes. So, while we expect improved convergence, the new method is potentially significantly more expensive, and the increased complexities could prevent better performance. Therefore, to mitigate the increase in complexities, we also truncate the Q matrices (see also our numerical results in Sect. 6). Truncation of Q needs to be done independently from truncation of P, because P-truncation is normalized to interpolate constants whereas the truncated Q matrices need to interpolate the rotations. When truncating Q to \(\tilde{Q}\), we adjust the weights of \(\tilde{Q}\) so that the row sums of \(\tilde{Q}\) equal those of Q.

Interestingly enough, the application of both variants beyond the first level leads to very different algorithms. GM1 needs to only interpolate constants after the first level, whereas GM2 needs to continue to interpolate coarser versions of the rigid body modes, thus requiring the storage of coarse grid versions of the rigid body modes as well as additional computations. More details are available in [2]. However, note that GM2 leads to coefficients of similar size, which is not the case for GM1. It is therefore much more difficult to effectively truncate the Q matrices generated in GM1. This difficulty will become evident in Sect. 6.

5 The Local Neighborhood Approach

We now consider an approach where the rigid body modes are incorporated locally. Because exact local interpolation leads to exact global interpolation, this approach should work at least as well as the global matrix approach. This approach requires looking at interpolation from a different angle. Assume that the error at the fine points, e F , is interpolated by the error at the coarse points, e C , such that

$$\displaystyle{ e_{F} = W_{FC}e_{C}\;. }$$
(12)

Let \(\tilde{C}\) be the set of new coarse points that have been introduced by adding new degrees of freedom to the coarse nodes. Further, s is a rigid body mode, s C is s at the original coarse grid points, and s F is s at the fine grid points. The idea for the local neighborhood approach is then to exactly interpolate the rigid body mode using an extension operator

$$\displaystyle{ e_{F} = W_{FC}e_{C} + W_{F \tilde{C} } e_{\tilde{C} } \ \ \ s.t.\ \ \ s_{F} = W_{FC}s_{C} + W_{F \tilde{C} } s_{\tilde{C} } \;, }$$
(13)

where \(s_{\tilde{C} } = 1\) at the new degrees of freedom in \(\tilde{C}\). The LN interpolation matrix needs to be defined by harmonic extension based on the local extension \(\tilde{W } _{FC} = [W_{FC},W_{F \tilde{C} } ]\). Let D s be the matrix with diagonal s. Because W FC interpolates constants, the following definition, which is similar to GM2, satisfies Eq. (13):

$$\displaystyle{ W_{F \tilde{C} } = [D_{s}^{F}W_{ FC} - W_{FC}D_{s}^{C}]\;. }$$
(14)

To allow for an arbitrary interpolation matrix P, the implementation of this approach performs a preprocessing step (cf. “iterative weight refinement” [9]) that results in \(\bar{P}\) where

$$\displaystyle{ \bar{P}_{ij} = - \frac{1} {a_{ii}}\Big(a_{ij} +\sum _{k\in F_{i}}a_{ik}w_{kj}\Big)\;, }$$
(15)

where F i is the fine neighborhood of point i and

$$\displaystyle{ w_{kj} = \frac{P_{kj}} {\sum _{n\in C_{i}}P_{kn}}\;. }$$
(16)

Now that \(\bar{P}\) is based on harmonic extension, Q can be determined using the following formula

$$\displaystyle{ Q_{ij} = - \frac{1} {a_{ii}}\sum _{k\in F_{i}}a_{ik}w_{kj}(s_{k} - s_{j})\;. }$$
(17)

For k rigid body modes s 1, , s k , the new LN interpolation operator is given by

$$\displaystyle{ \tilde{P }= [\ \ \bar{P}\ \ Q^{1}\ \ \ldots \ \ Q^{\,k}\ \ ]\;. }$$
(18)

Note that this approach assumes that As = 0. However, the unknown-based interpolation is not generated from A, but from the block diagonal matrix A D with block diagonals A xx and A yy in 2D (as well as A zz in 3D). In this situation it is important to modify Eq. (17) by incorporating the non-zero residual. We refer to [2] for further details. In addition, like GM2, the LN approach requires the generation of Q on all coarse levels.

6 Numerical Results

In this section, we present numerical results that compare the performance of the previously described AMG approaches. AMG is here used as a preconditioner to either GMRES or CG. The parallel experiments were conducted on the Vulcan supercomputer (LLNL), except for those presented in Table 6, which were computed on the JUQUEEN supercomputer (JSC) [24]. JUQUEEN and Vulcan were ranked 11th and 12th respectively on the TOP500 list of the world’s fastest supercomputers of November 2015. JUQUEEN is a 28,672 node 6 Petaflop Blue Gene/Q system at Jülich Supercomputing Center (JSC, Germany), with a total number of 458, 752 processor cores. Vulcan is a 24, 576 node 5 Petaflop Blue Gene/Q production system at Lawrence Livermore National Laboratory (USA) with a total number of 393, 216 processor cores. Both Blue Gene/Q systems use a Power BQC 16C 1.6 GHz processor with 16 cores and 16 GB memory per node.

Table 1 Weak scalability of the 2D beam problem with E = 210 and ν = 0. 3; iterative solver: preconditioned GMRES; stopping tolerance for the relative residual: 1e-8; quadratic triangular finite elements; Preconditioner denotes the AMG approach (one V-cycle); Pmax/Q-th denotes the truncation of the interpolation operators for P (max non-zeros per row) and Q (absolute threshold); It. denotes the number of GMRES iterations and (Cop) the operator complexity; Time GMRES denotes the runtime of the AMG-GMRES solve phase; Time BoomerSetup denotes the time spent in the BoomerAMG setup; Setup + Solve denotes the total solution time spent in the AMG setup (BoomerSetup) and the AMG-GMRES (Time GMRES) solve. The fastest variant is marked in bold face

We use BoomerAMG [16], the unstructured algebraic multigrid solver in hypre version 2.10.0b [17], which now provides an efficient parallel implementation of the GM and the LN approaches. In hypre version 2.10.0b, the user now simply has to provide the smooth error vectors on the fine grid in addition to the linear system. In our case, we provide the rotations s j , one in 2D, three in 3D. In order to make efficient use of the hardware threads for the 3D results in Tables 3 and 6, we use oversubscription with 2 MPI ranks for each core of the Power BQC processor. Note that no parallel results are given in [2], as a parallel implementation was not available at that time. To ensure a fair comparison of the different methods, we chose an AMG setup such that all components have shown the potential to scale up to large scales. In particular, for all methods, we use HMIS coarsening, introduced in [11], the extended+i interpolation method described in [12, 29] and symmetric SOR/Jacobi smoothing in a V(1,1)-cycle.

Table 2 Same problem setup and notation as in Table 1, but larger problem sizes
Table 3 Weak scalability of the 3D beam problem with E = 210 and ν = 0. 3; iterative solver: preconditioned CG; stopping tolerance for the relative residual: 1e-6; linear tetrahedral finite elements; 2 MPI ranks per Blue Gene/Q core are used; Preconditioner denotes the AMG approach (one V-cycle); Pmax/Q-th denotes the truncation of the interpolation operators for P (max non-zeros per row) and Q (absolute threshold); It. denotes the number of CG iterations and (Cop) the operator complexity; Time CG denotes the runtime of the AMG-CG solve phase; Time BoomerSetup denotes the time spent in the BoomerAMG setup; Setup + Solve denotes the total solution time spent in the AMG setup and the AMG-CG solve

We consider the compressible linear elasticity problem

$$\displaystyle{-2\mu \;\mathrm{div}(\epsilon (u)) -\lambda \mathrm{grad}(\mathrm{div}(u)) = f\;,}$$

where u is the unknown displacement and ε(u) is the strain. The parameters are \(\lambda = \frac{\nu E} {(1+\nu )(1-2\nu )},\mu = \frac{E} {2(1+\nu )}\) (cf. [5]), where the Young modulus is E = 210, and we vary the Poisson ratio ν between 0. 3 and 0. 49.

More detailed descriptions of the various model problems in two and three dimensions are given in subsequent subsections. The finite element assembly is performed in PETSc, and we also use PETSc’s GMRES/CG implementation. In all tables we use the abbreviations U-AMG for the unknown approach, H-AMG for the hybrid approach with the nodal coarsening strategy in Eq. (6) and the row-sum norm, and H-AMG-GM1/GM2/LN for the interpolation approaches GM1, GM2, and LN, respectively. Cop denotes the operator complexity, which is defined as the sum of the non-zeros of all matrices A i on all levels divided by the number of non-zeros of the original matrix A. Operator complexity is an indication of memory usage and the number of flops per iteration and also affects setup times. In order to reduce Cop, we truncate P to at most Pmax non-zero elements per row and use a truncation factor of Q-th (absolute threshold) to truncate Q. In the tables, we mark the fastest time (for the sum of setup and solve) as well as the lowest number of iterations in bold face. As a baseline for our weak scalability tests, in order to avoid cache effects, we use the smallest problem which still makes use of at least a single full node.

6.1 Results in Two Dimensions

If a Dirichlet boundary condition is applied to a large portion of the boundary, standard nodal or unknown approaches are known to perform well, and we do not expect any additional benefit from the GM or LN approach. Therefore, we consider an elasticity problem on a rectangular domain [0, 8] × [0, 1] in 2D, fixed on one of the short sides. A volume force orthogonal to the longer sides is applied. We refer to this problem as 2D beam, and a solution for a linear elastic material is presented in Fig. 2. We use piecewise quadratic finite elements on triangles in all experiments in two dimensions, and, by reordering the unknowns, we ensure that each MPI rank holds a portion of the beam of favorable shape, i.e., close to a square. We present weak scalability results for the 2D beam in Tables 1 and 2, comparing the unknown approach U-AMG, the hybrid approach H-AMG, and, representing the interpolation approaches, the GM2 approach. The GM1 and LN approaches performed similarly to or worse than GM2 here, but are included in a more detailed discussion on the results in three dimensions, where differences between the approaches are more interesting.

Fig. 2
figure 2

Solution of the 2D beam considering linear elasticity with E = 210 and ν = 0. 3. The color represents the norm of the displacement

In the weak scaling results in Table 1, the number of GMRES iterations for the unknown approach increases from 23 to 59 iterations, resulting in a noticeable increase in the iteration time as well. However, both the hybrid and GM2 approaches achieve good weak scalability. Comparing the hybrid and the GM2 approaches, the AMG setup times are slightly higher with the GM2 approach. This increased computational cost is expected due to the exact interpolation of the rotation. Since iteration counts and thus the iteration times are lower, the GM2 approach is always the fastest approach in this comparison.

Table 2 also contains weak scaling results for the 2D beam, but the problem sizes are approximately 2.6 times larger per core. Results are similar to the results in Table 1, but here, for the largest problem with 6. 7 billion degrees of freedom, the hybrid approach needs 52 compared to only 21 GMRES iterations for the GM2 approach. This improvement leads to a much faster convergence for GM2; see also Fig. 3 for a visualization.

Fig. 3
figure 3

Weak scalability of total solution time for the two-dimensional beam with ν = 0. 3 and E = 210; cf. Table 2

We can conclude that with our settings, all three approaches work well for smaller problems. For larger problems (and larger numbers of cores), the GM2 approach remains scalable whereas U-AMG and H-AMG experience an increase in the number of iterations. The setup cost for the GM2 approach is slightly higher, compared to the other two approaches, but the setup time is scalable and amortized in the iteration phase; see also Fig. 3.

6.2 Results in Three Dimensions

Now we present results for several three-dimensional domains. In particular, we first investigate weak scalability for a 3D beam problem. We also investigate the effect of a higher Poisson ratio ν on the 3D beam, showing scalability results and presenting a small study that increases ν. Second, we examine doubling the beam length. And for a third model problem, we consider a heterogeneous material with different boundary condition, called the 3D cuboid.

6.2.1 3D Beam Problem

Similar to the 2D beam, the 3D beam problem is defined on the domain [0, 8] × [0, 1] × [0, 1] for ν = 0. 3, ν = 0. 45 and ν = 0. 49. First, we present weak scalability results in Table 3 for the 3D beam with ν = 0. 3 for all approaches. For the 262K MPI ranks case, we also include a larger problem to show the effect of increasing problem size on performance at large scale.

From the results in Table 3 (see also Figs. 4 and 5), we conclude that for smaller problems, a set of parameters can be found for all approaches such that the results are satisfactory with respect to the numbers of iterations and the solution times. However, for the larger problems (e.g., 262K MPI ranks), the AMG approaches adapted specifically for elasticity, i.e., GM1, GM2, and LN, result in smaller numbers of CG iterations. Note that in the case of the GM1 approach, the low numbers of iterations come at the expense of high complexities because GM1 suffers from the lack of a suitable truncation strategy. As a result, the H-AMG approach is actually faster than GM1. The GM2 and LN approaches achieve the fastest overall total times (with a slight advantage for the LN approach) due to their low iteration counts and acceptable complexities. These considerations also hold when viewing the results for 262K MPI ranks and the increased problem size of 6.3 billion unknowns in Table 3.

Fig. 4
figure 4

Weak scalability of the BoomerAMG Setup (left) and the time spent in the AMG-CG solve phase (right) for the three-dimensional beam with ν = 0. 3 and E = 210; cf. Table 3

Fig. 5
figure 5

Weak scalability of total solution time for the three-dimensional beam with ν = 0. 3 and E = 210; cf. Table 3

Now we increase the Poisson ratio to ν = 0. 45 for the 3D beam. The results in Table 4 (see also Figs. 6 and 7) show that all approaches suffer from a higher number of iterations compared to the case of ν = 0. 3. The GM2 and LN approaches remain superior as a result of combining low numbers of iterations with acceptable complexities. For U-AMG and H-AMG, depending on the choice of parameters, either the numbers of iterations are high or the complexities increase substantially. The times are visualized in Figs. 6 and 7. Since GM1 with Pmax=3 requires too much memory, we use it here with Pmax=2. Note that GM1 fails for the largest problem considered.

Fig. 6
figure 6

Weak scalability of the BoomerAMG Setup (left) and the time spent in the AMG-CG solve phase (right) for the three-dimensional beam with ν = 0. 45 and E = 210; cf. Table 4

Fig. 7
figure 7

Weak scalability of total solution time for the three-dimensional beam with ν = 0. 45 and E = 210; cf. Table 4

Table 4 Same problem setup and notation as in Table 3, but larger problem sizes, ν = 0. 45. On 32 K MPI ranks H-AMG-GM1 hits the maximal iteration number of 1000 (marked with max It.)

Next, in Table 5, the effect of the Poisson ratio on the different AMG approaches is studied. We see that H-AMG does not converge within the limit of 1000 iterations for ν = 0. 49. For the other approaches, the convergence rate suffers from an increasing value of ν towards almost incompressibility. This deterioration is also the case for the AMG approaches which are especially adapted for (compressible) elasticity problems, i.e., GM1, GM2, and LN, but which are based on H-AMG. For ν = 0. 49, U-AMG, while exhibiting the highest Cop, is the fastest variant in terms of total time.

Table 5 Same problem setup and notation as in Table 3. Investigation of the effect of an increasing ν; Setup + Solve denotes the total solution time spent in the AMG setup and the AMG-CG solve; H-AMG hits the maximal iteration number of 1000 (marked with max It.)

6.2.2 3D Beam Problem with Double Length

For ν = 0. 3, we examine the effect of doubling the length of the 3D beam such that the domain is [0, 16] × [0, 1] × [0, 1]. Table 6 lists the results obtained for the 3D beam with double the length, using up to 16 of the total 28 racks of the JUQUEEN supercomputer. Again, these experiments show the clear advantage of the GM2 and LN approaches for this problem over the standard methods. The largest three dimensional problem with approximately 13 billion unknowns is solved in less than 81 s using the LN approach. Here, the solve phase time of LN is twice as fast as that of the fastest standard approach H-AMG.

Table 6 Weak scalability of the larger [0, 16] × [0, 1] × [0, 1] 3D beam problem with E = 210 and ν = 0. 3. Same notation as in Table 3. Computations carried out on JUQUEEN BlueGene/Q at Jülich Supercomputing Centre (JSC)

6.2.3 3D Cuboid Problem

Finally, we consider a 3D cuboid problem. The cuboid has the same form and size as the original 3D beam, but is fixed on the two opposite sides with x = 0 and x = 8. We then compress the cuboid to 95 % of its length. Note that for the 3D cuboid, we have a core material with E = 210 and ν = 0. 45 in the part of the cuboid where 0. 25 < y < 0. 75 and 0. 25 < z < 0. 75. Here (x, y, z) denotes the coordinates in the undeformed reference configuration of the cuboid. In the remaining hull, we have E = 210 and ν = 0. 3.

The results for the 3D cuboid problem in Table 7 show that the AMG approaches benefit from the larger Dirichlet boundary as compared to the 3D beam. However, the GM2 and LN approaches show the best numerical scalability, i.e., the numbers of iterations only increase from 29 to 44 for GM2 and from 24 to 39 for LN when scaling weakly from 64 to 262 K MPI ranks. For this problem, the H-AMG approach remains competitive as well for the largest number of ranks with regard to total times as a result of its low setup time.

Table 7 Weak scalability results for the 3D cuboid problem; notation as in Table 3

6.3 Parallel Problem Assembly and Reordering Process

Although the focus of this paper is on the parallel performance of AMG for linear elasticity problems, we also comment on the parallel problem assembly and setup, presenting timing results in Table 8. In order to assemble the global elasticity problems in two and three dimensions, we first decompose the domain into nonoverlapping parts of equal size, one for each MPI rank. We then assemble local stiffness matrices corresponding to these local parts. These computations are completely local to the ranks and thus perfectly scalable. The local assembly process is denoted as Local Asm. in Table 8. To assemble the local stiffness matrices to one global and parallel stiffness matrix, some global communication is necessary. This global assembly process is denoted as Global Asm. in Table 8. This process scales fine up to 32 K ranks. Scaling further, the amount of communication and synchronization slows the global assembly down. A classical lexicographical ordering of the global indices is often not optimal for the convergence, especially using hybrid approaches, and we therefore reorder the indices. After the reordering process, each rank holds a portion of the global stiffness matrix which has a shape close to a square in two dimensions and a cube in three dimensions. The implementation of the index reordering step is very fast (see Table 8) but makes use of the same communication patterns as the global assembly process leading to the same deterioration on more than 32 K cores.

Table 8 Presentation of problem assembly and setup timings, which are independent of the chosen AMG preconditioner. Values are averages over the measured values in all runs presented in Table 3. The total runtime of the complete 3D beam application can be obtained by adding these three times to the time Setup + Solve from Table 3

7 Conclusions

We investigated the performance of hypre’s AMG variants for elasticity for several 2D and 3D linear elasticity problems with varying Poisson ratios ν. We compared the unknown and hybrid approaches, which use prolongation operators that only interpolate the translations, with three approaches, GM1, GM2 and LN, that are based on the hybrid approach and also incorporate the rotations. In all cases, GM1, GM2 and LN showed improved convergence over the hybrid approach when using the same truncation for P. For ν = 0. 3, all hybrid approaches scaled better than the unknown approach, and the GM2 and LN approaches were overall faster for very large problems. For the largest problem in three dimensions with 14 billion unknowns and using the largest number of processes considered, i.e., 524, 288 processes, the LN approach was 40 % faster than the standard approaches. For ν = 0. 45, GM2 and LN clearly scale better than the other approaches and are more than twice as fast on 32, 768 processes with better complexities and five times as fast as the hybrid approach with the same operator complexity.

We also found that the unknown approach was more robust with regard to an increase in ν than the other approaches, solving the problem with ν = 0. 49 faster than any of the other approaches, but generally needed larger complexities. While the hybrid approach did not converge within 1000 iterations for ν = 0. 49, GM1, GM2 and LN were able to solve the problem in less than 200 iterations.

Overall, our study shows that the inclusion of the rigid body modes into AMG interpolation operators is generally beneficial, especially at large scale. We conclude that, for elasticity problems, using enhancements of the interpolation, parallel AMG methods are able to scale to the largest supercomputers currently available.