Keywords

1 Introduction

There are few modern methods for solving large linear systems that can be considered mainstream. These include multigrid methods and domain decomposition methods. Among the former, Algebraic Multigrid (AMG) methods are one of the most popular choices. The methods (generally) require no additional information, however, the matrix itself is relatively memory thrifty and, if correctly constructed, can be as efficient as geometric multigrid (GMG) methods. All this makes AMG a very attractive “black box” solver for particular classes of problems (positive definite, semi-positive definite and M-matrices). Any multigrid solver has two stages: setup and solve. The setup stage performs the necessary computations and prepares operators for the solve phase. The solve phase actually solves a linear system. In AMG methods, the solve phase is a substantial and important part in both the convergence and wall time. We refer the reader to [6, 8, 11, 16] for more detailed information on AMG methods. In this work, we mainly focus on the setup phase.

To construct matrices, smoothing, prolongation and restriction multigrid operators on all levels, one uses entries in the main system matrix, as well as, possibly, some external information. The resulting set of operators is called the AMG hierarchy. The classical AMG approach [11] constructs hierarchies by dividing matrix entries into coarse and fine ones so that the smoothed error slowly varies in the direction of large matrix coefficients. The coarse nodes are used to construct the lower level, and the prolongation operation is defined by interpolation. The restriction operation is usually defined by either the one-to-one restriction operation or the transpose of the prolongation operator. Smoothing operators are built based on the nodes on each levels. This approach often results in relatively good hierarchies that guarantee grid independent convergence. However, to achieve this quality, classical aggregation often leads to large coarse level matrices and requires a substantial amount of memory, see [10] for more information. In addition, the original classical aggregation algorithm is essentially serial. Other variants of the algorithm, such as PMIS, HMIS, CLJS, etc. (see [15]), do not produce hierarchies of such quality.

Another variant considered here is aggregation AMG methods [14]. These methods rely on grouping fine level nodes to form a coarse matrix on the lower level. Smoothing operations are also constructed on top of the formed matrices on each level. The restriction operation usually acts as an averaging of the variables inside each group, and the prolongation operator is a transposed restriction operator. Such an approach is much more economical in terms of memory consumption on the coarse levels, however, this approach is usually not applied since it cannot generally provide grid independent convergence rates [13]. To override this problem, smoothed aggregation was proposed [13, 14]. However, smoothed aggregation AMG results in greater memory requirements, and grid independent convergence is still not guarantied for some classes of problems. For a comparison of different approaches see [12, 15].

Having described one problem of selecting an appropriate setup procedure, one faces another problem, namely, how to speed up the setup phase. To speed up an aggregation-based algorithm, it is necessary to apply a parallel aggregation algorithm with all variants, including smoothed aggregation (since none of them can be universal). Modern high-performance computing architectures de facto must have Graphics Processing Units (GPUs). Such an architecture is efficient, if properly programmed, and more environmentally friendly. In addition, modern desktops can fit up to 4 (or 8) powerful GPUs capable of solving relatively large-scale problems. It would be unwise to deprive the users of these desktops from solving middle-sized problems of academic or engineering orientation. Hence, the implementation of the fully GPU-accelerated AMG solver is an important task.

We tried to utilize the AMGX solver for our problems on GPUs, but failed, see [5]. To verify our implementations of GPU-accelerated setup procedures, we used the AMGCL library [3] by D. Demidov. It is a C++ header-only library that heavily relies on template metaprogramming and has GPU support via CUDA and OpenCL. It is efficient and has been tested in many applications. However, the library is explicitly designed in such a way that the setup process is performed on CPUs only, accelerated by either OpenMP or MPI. GPU support is localized only in the solve stage. Besides, if the system matrix is formed on GPUs, then the library performs CPU\(\leftrightarrow \)GPU memcpy. The author’s idea is that the setup is executed only a limited number of times, and the matrix of the linear system can be reused (by calling the rebuild process also performed on the CPU), if applied many times, say in Newton’s method, see [4] for details. However, if the problem being solved is complicated, and the stationary point is not easily found (see, for example, [2]), then this strategy may lead to unsatisfactory results, e.g. substantially decrease the CFL number in implicit methods. In this paper, we would like to overcome this flaw. We apply the Parallel Maximal Independent Set K (MIS(K)) on the GPU, as described in [1, 7], with modifications that form aggregates closer to the serial version. The method is implemented in CUDA C++ using templates.

The paper is laid out as follows. First, the aggregation method, the modified MIS(K) method and its application in AMG during the setup are described. Next, the modifications introduced in the AMGCL library to implement this method in the GPU-only approach are outlined. Numerical experiments on several available and generated sparse matrices are also presented, and the performance and convergence of the modified and original AMGCL library are measured. The paper is finalized by a conclusion.

2 Aggregation AMG on the GPU

The initial approach adopted for the AMG hierarchy build process in AMGCL is based on constructing aggregates as noted in the introduction. Aggregates are unions of nodes (variables) on the fine level. After aggregation, each aggregate corresponds to one and only one node on the coarse level. Let n be the number of nodes on the current (fine) level. We can mathematically describe the aggregate structure by the array of numbers \(a_i\), where \(i\in \{0,...,n-1\}\) and \(a_i\in \{0,...,n_c-1\}\), and \(n_c\) is, in turn, the number of nodes on the next (coarse) level.

Transfer operators (prolongation and restriction) are fully determined by the aggregate structure. Regular (non-smoothed) aggregation builds the restriction operator (matrix) R in the following way:

$$ {R}_{j,i} = {\left\{ \begin{array}{ll} \frac{1}{|\{k \mid a_k=j\}|}, \text { if } a_i = j\\ 0, \text { otherwise } \end{array}\right. } $$

The prolongation matrix is defined as a transposition of the restriction matrix: \(P = R^T\). The coarse operator matrix is defined according to the Galerkin projection: \(A_c=RAP\), where A is the fine level matrix. For smoothed aggregation, the restriction matrix is defined as a product of the restriction matrix defined above and the smoothing matrix \(I + \omega A^F\). \(\omega \) is the relaxation parameter, and \(A^F\) is a specially filtered version of the matrix A. Further levels are constructed in a recursive way.

One can see that in this formalism the overall aggregation algorithm is fully determined by the method of constructing aggregates. Regardless of the choice of a particular algorithm, a strong connections graph first needs to be constructed. There are several variations of strong connections criteria. The one used in AMGCL is described, for example, in [14]. We denote the strong connections graph incidence matrix by C, \(C_{i,j} = 1\) means that the node with the number i is strongly connected to the node with the number j, while \(C_{i,j} = 0\) means the absence of connection. Note that the matrix C is supposed to be symmetric in the algorithm mentioned below.

The initial algorithm in AMGCL (called plain aggregation) uses a substantially serial approach that exploits a given order of nodes for their grouping. On the other hand, our task was to implement the fully GPU workflow for the setup phase, thus the parallelizable algorithm had to be utilized. A common choice for constructing parallel aggregates is the Maximal Independent Set algorithm, see, for example, [1]. A parallel version of this algorithm uses random seeds to construct MIS(K). MIS(K) is the subset of fine level nodes, and the shortest path length between any two MIS(K) nodes in the graph C is larger than K. “Maximal” means that adding any other node to MIS(K) breaks this property. Usually \(K = 2\) is used in the context of AMG.

Our version of the MIS(K) algorithm for constructing aggregates is presented here as Algorithm 1. Note that there are two parts that differ from the original version of MIS(K), highlighted in colour in the Algorithm. The first one is in the node weights (the second element of the tuples \(T_i\)). While originally only random numbers \(v_i\) were used for the weights, we added an extra term \(n_i W_{nb}\). \(W_{nb}\) is the global algorithm parameter. \(W_{nb}=0\) falls back to the initial version, while \(W_{nb}=1\) or \(W_{nb}=-1\) can be used to adjust the behavior of constructing aggregates. We noted that \(W_{nb}=0\) usually resulted in the lower aggregates number compared to the original AMGCL plain aggregation. This leads to a lower convergence rate, thus usually slowing down the solve phase. The \(W_{nb}=-1\) choice enlarges the aggregates number, thereby partially fixing the convergence problem. However, for some matrices (not considered in the current paper), \(W_{nb}=1\) may be the best option, since the reduced number of variables on the coarse levels speeds up the computational wall time.

figure a

The second difference is in the post-processing part. One can consider it as a reconnection procedure. While the original plain aggregates implementation manually connects all closest strong neighbors to newly created aggregate centers, the MIS(K) algorithm can produce highly skewed aggregates. To overcome this problem, additional regrouping was added after the main iterations cycle. The point is to reconnect the nodes initially connected to the “far" aggregates center to the close ones, if any. Together, these two improvements reduce the convergence rate drop to an acceptable level of approximately 10% in the worst case among the systems under consideration. Moreover, the initial plain aggregates algorithm can produce different results depending on matrix ordering, while the randomized algorithm demonstrates robustness to this factor.

All other parts of the AMG hierarchy construction (transfer operators, Galerkin projections) are naturally parallelizable for both CPUs and GPUs. The most time-consuming part is the sparse matrix-matrix product, and it will be addressed further in more detail.

3 Implementation

Algorithm 1 can be readily implemented on GPUs. Loops with the comment “for each node in parallel” turn into CUDA kernel calls, since there are no data dependencies inside them. There are some technical issues. First, it was not initially clear which data layout was the best for the tuples T. Experiments showed that the “structure of arrays” was still preferable, despite the fact that the size of one tuple is 16 bytes in our implementation. Second, tuples initialization and maximum tuple iterations loops were merged together using the CUB reduce_by_key algorithm, which resulted in better performance.

From the performance point of view, sparse matrix-matrix multiplication was found to be the bottleneck. We first used the legacy cusparseXcsrgemm2 operation from the cuSparse library of the CUDA toolkit, however, that already deprecated version performed badly. The new version introduced in the latest CUDA toolkit exposed a huge speedup of this operation, but failed in terms of extra memory consumption. Finally, we tried the SpECK library [9], which showed the best results in both memory consumption and performance. The only disadvantage of this library is the absence of support for Compute Capabilities lower than 6.1. The Legacy cuSparse implementation was left for the case of older hardware.

Another important aspect of the fully GPU AMG stack is a smoothers (“relaxation” in terms of AMGCL) implementation without any CPU invokes. We ported setups for three of them: spai0 (sparse approximate inversion), ilu0 (incomplete LU factorization) and damped Jacobi. Although in an algorithmic sense there are no problems with their implementation on the GPU, some problems arise with the AMGCL architecture. Initially it was not designed for such GPU-only usage, therefore, we needed to introduce a new setup constructor conveyor from the top make_solver class down to coarsening and smoothers initialization methods.

4 Numerical Experiments

All experiments are conducted on symmetric matrices. The following hardware is used in all experiments: CPU – 2\(\times \)Intel Xeon Gold 6248R, totally having 48 cores (96 threads) with 512 GB of ECC host memory, GPU – Nvidia Tesla V100 with 32GB of ECC device memory. Double precision is used in all calculations. The AMGCL setup for all experiments is the same: the conjugate gradient method is used as the main solver, preconditioned by a single AMG V-cycle. The ilu0 smoother is used on each level of the multigrid, the exact solver is used in the lowest level. All problems are convergent, the target relative residual is set to \(1.0 \cdot 10^{-14}\). All results are presented by three figures – wall time for the setup, solve phases and speedup. In addition, for each matrix, a table that contains the minimum wall time for all runs and for all implementations is presented. It should be noted once again that the original AMGCL implementation uses a multi-threaded CPU setup phase in both CPU and GPU implementations. Thus, the obtained speedup for the setup phase depends on the number of OpenMP threads for both implementations.

The first experiment is conducted with the matrix available in the AMGCL examples folder, i.e. a small matrix from the discretization of a Poisson equation called poisson3Db. Its size is 8.56E4, and the number of nonzero elements is 2.37E6. The results are presented in Fig. 1 and Table 1.

Fig. 1.
figure 1

Results for the poisson3Db matrix depending on the number of OpenMP threads: setup phase - left, solve phase - center, speedup - right.

Table 1. Minimum wall times, mean iterations and attained residuals for the poisson3Db matrix.

It is observed that for such matrix sizes, the speedup is negligible or even reversed. The setup of the AMG hierarchy using the MIS(K) algorithm on the CPU is slower than the original algorithm. The GPU setup phase algorithm is faster than the single-threaded execution of the original algorithm. However, it is slower for 48, 64 and 80 threads. It is not recommended to use GPUs for small matrices.

The next experiment is taken from the sparse matrix market, the matrix is called parabolic_fem. Its size is 5.26E5, and it has 3.67E6 nonzero elements. The results are presented in Fig. 2 and Table 2.

Fig. 2.
figure 2

Results for the parabolic_fem matrix: setup phase - left, solve phase - center, speedup - right.

Table 2. Minimum wall times, mean iterations and attained residuals for the parabolic_fem matrix.

The results again indicate that the host variant of MIK(K) is slower than the original variant in both the setup and solve phases. The GPU MIS(K) setup phase is 2.4 times faster than the single-threaded original AMGCL implementation and about as fast as the 48-threaded version. The solve phase is slightly slower for the GPU implementation (0.96 times in average). The convergence is slower on 1–2 iterations. The results indicate that the CPU implementation can be used instead of the GPU implementation with a slight penalty on the wall time. These two matrices are small to be efficiently used on GPUs.

Fig. 3.
figure 3

Results for the thermal2 matrix: setup phase - left, solve phase - center, speedup - right.

The next sparse market matrix is called thermal2, its size is 1.23E6 with 8.58E6 nonzero elements. The obtained speedup, presented in Fig. 3 and Table 3, shows that the CPU MIS(K) version is slightly faster in the solve phase for 48 threads or more. However, it is almost twice slower for the setup phase.

Table 3. Minimum wall times, mean iterations and attained residuals for the thermal2 matrix.

The GPU MIS(K) variant is 6 times faster than the single-threaded setup phase of the original implementation. The minimum speedup of the GPU MIS(K) implementation compared to the best multi-threaded original GPU implementation is 1.85 times for the setup phase. The solve phase for the GPU MIS(K) implementation is about 1.5 times faster due to the difference in the obtained AMG hierarchy.

Fig. 4.
figure 4

Results for the generated finite difference Laplace operator: setup phase - left, solve phase - center, speedup - right, upper row - one thread, lower row - best times of all OpenMP threads.

A set of parameterized matrices was generated to perform analysis in terms of matrix sizes. A finite difference 7-point 3D Laplace operator was generated and used for the Poisson equation with Neumann and a single Dirichlet boundary condition. The cubic domain was discretized with 50, 100, 150, 200 and 250 grid points in each direction, respectively. The results of the solution of this problem are presented in Fig. 4 and Table 4.

First, we analyze the behavior for the case when a single thread is used in the original AMGCL implementation. The CPU variant of our implementation is clearly inferior compared to the original AMGCL implementation. The GPU variant, on the other hand, is efficient. For this case, we obtain a substantial speedup starting from a linear matrix size of 150. The maximum speedup in the setup phase of about 7 times is achieved for the largest matrix.

Next, we analyze the behavior for the best variant of the AMGCL multi-threaded GPU implementation. In this case, a speedup of about 3.51 times is achieved for the largest matrix in the setup phase. The solve phase is approximately the same for both implementations. The solve phase fluctuates around one, see Table 4.

Table 4. Minimum wall times for all generated Poisson problem matrices and all considered OpenMP threads.

5 Conclusion

In this research, we presented the implementation of the AMG framework that targets GPUs. The AMGCL header-only library, designed and implemented by D. Demidov using C++, was used as a base framework, which was subject to deep modifications. The whole process (both setup and solve phases) was implemented and tested on multiple symmetric matrices, generated and real-world-alike. It is concluded that for small matrices, e.g. poisson3Db, the usage of our implementation is not recommended. The GPU load is insufficient to deliver any speedup against modern CPUs. We also do not recommend using the CPU variant of MIS(K) aggregates since it is clearly inferior in all numerical experiments. On the other hand, we obtained a speedup of the setup phase for intermediate and large matrices (matrix size starting from \(\sim 1E6\)) by about 7 times against the AMGCL single-threaded GPU implementation and by about 3.5 times for the best multi-threaded variant. The speedup of the solve phase depends on the problem (since aggregates are generated differently for AMGCL and our implementation using MIS(K)) and fluctuates between 0.95 and 1.15. The suggested GPU-only implementation can be recommended if matrices are generated on GPUs for stationary problems with a time-consuming setup phase, as well as for hard transient problems, when a matrix rebuild is required on each time step.

Variations of classical AMG aggregation algorithms, as well as support for multiple GPUs, are to be implemented for GPUs.