1 Introduction

In the fields of science, engineering and economy etc., many problems can be modeled as the following linear system:

$$\begin{aligned} Ax=b, \ \ x, b \in R^{n}, A \in R^{n \times n}. \end{aligned}$$
(1)

Here A is a large, sparse and nonsingular matrix, and x and b are unknown and known vectors, respectively. For solving the linear system in Eq. (1), the iterative methods such as the generalized minimal residual method (GMRES)(Saad and Schultz 1986) and the biconjugate gradient stabilized method (BICGSTAB)(van der Vorst 1992) have been widely applied. Furthermore, with a left or right preconditioner M, the original problem in Eq. (1) can be transformed into a more tractable form as:

$$\begin{aligned} MAx = Mb\ \ \text {or} \ \ AMy = b, x = My. \end{aligned}$$
(2)

A good preconditioner M should be easy to be constructed, and effective in reducing the iteration count of iterative methods. Popular preconditioners include the incomplete factorization preconditioners(Saad 2003; Gao et al. 2014; Anzt et al. 2017), the sparse approximate inverse (SPAI) preconditioners based on Frobenius norm minimization(Cosgrove et al. 1992; Grote and Huckle 1997; Chow 2000; Jia and Zhu 2009), the factorized sparse approximate inverse (FSAI) preconditioners(Kolotilina and Yeremin 1993; Benzi et al. 1996, 2000; Ferronato et al. 2014; Bernaschi et al. 2016) and the preconditioners that consist of an incomplete factorization followed by an approximate inversion of the incomplete factors(van der Vorst 1982; Duin 1999.)

The SPAI preconditioner based on Frobenius norm minimization uses approximate \(A^{-1}\) as the preconditioner M. As shown in Chow (2000), M is constructed to minimize \(\Vert AM-E\Vert\) in the Frobenius norm:

$$\begin{aligned} \min \Vert AM-E \Vert _{F}^{2}. \end{aligned}$$
(3)

Here E is an \(n\times n\) unit matrix. The columns in M are independent of each other, and thus equation (3) can be transferred into the following n independent least squares problems:

$$\begin{aligned} \min \Vert Am_{k}-e_{k} \Vert _{2}^{2}, \ \ k=1, 2, \ldots , n, \end{aligned}$$
(4)

where \(m_{k}\) and \(e_{k}\) represent the kth column of M and the unit matrix, respectively. Obviously, the construction of SPAI preconditioner is easily parallelized. Compared with the incomplete factorization preconditioners, SPAI preconditioners require only a few sparse matrix-vector multiplication operations instead of triangular solves. Compared with FSAI preconditioners, SPAI preconditioners are suitable for various matrices such as A, not just symmetric positive definite matrices. Therefore, SPAI preconditioners have attracted considerable attention.

Considering the construction method of M, SPAI can be categorized into static and dynamic types. If the sparsity of M is prescribed a priori, we will have a static SPAI preconditioning procedure. In contrast, the SPAI preconditioning algorithm is a fully dynamic one. Because the SPAI preconditioner construction is generally time-consuming for large matrices, in recent ten years, with the advent of graphics processing units (GPUs), many researchers have attempted to accelerate them on the GPU architecture. There exists some work on accelerating the construction of static SPAI preconditioners with GPU(Dehnavi et al. 2013; Gao et al. 2017; Rupp et al. 2016; He et al. 2020; Gao et al. 2021), However, in many applications, the difficulty lies in determining a good sparsity structure of the approximate inverse in advance when the static SPAI preconditioner is applied. Thus, the dynamic sparse approximate inverse algorithm that aims at searching the sparsity pattern of M automatically is proposed. As compared to the static SPAI preconditioning algorithm, the advantage of the dynamic SPAI preconditioning algorithm is that the sparsity pattern of M is automatically exploited; however, its disadvantage is that a great deal number of iterations is required to explore the nonzero entries of M. Moreover, the research on accelerating the construction of the dynamic SPAI preconditioners with GPU is scarce. Rupp et al. Rupp et al. (2016) present a parallel dynamic SPAI implementation procedure on GPU in the ViennaCL library but it works only for smaller matrices.

Based on the above motivation, in this study, we present a heuristic SPAI preconditioning algorithm on GPU, called HeuriSPAI. In our proposed HeuriSPAI, for each loop, we first present a heuristic method, which gives potential candidate indices of nonzero-entries of M in advance to guide the selection of new indices, and thus improves the quality of obtained M. Second, the loop-stopping condition adds \(l<l_{\max }\) (\(l_{\max }\) is a small integer) and \(J_k \leqslant \alpha * n2_k\) (\(\alpha\) is a small real number and \(n2_k\) is the nonzero number of the kth column of A) besides \(\Vert r_k\Vert \leqslant \varepsilon\). This guarantees the sparsity of the preconditioner. Third, a parallel framework of constructing the heuristic SPAI preconditioner is presented. Finally, each component of the preconditioner, such as sparse matrix-matrix multiplication, finding \(\tilde{J}\), reducing \(\tilde{J}\), determining \(\tilde{I}\), QR decomposition, and computing \(m_k\) and \(r_k\), is computed in parallel inside a group of threads. HeuriSPAI fuses the advantages of static and dynamic SPAI preconditioning algorithms, and alleviates the drawback of the existing dynamic SPAI preconditioning algorithms on GPU that are not suitable for large matrices. Experimental results show that HeuriSPAI is effective, and outperforms several popular preconditioned algorithms on GPU: CSRILU0 in the CUSPARSE library(NVIDIA 2021), the incomplete SPAI preconditioning algorithm in the MAGMA library(Anzt et al. 2018), a parallel static SPAI preconditioning algorithm and a parallel dynamic SPAI preconditioning algorithm in the ViennaCL library(Rupp et al. 2016), and a recent parallel static SPAI preconditioning algorithm(He et al. 2020).

The rest of this paper is organized as follows. In the second section, a new heuristic SPAI preconditioning algorithm is proposed. In the third section, a heuristic SPAI preconditioning algorithm on GPU is presented. Experimental evaluation and analysis are presented in the fourth section. The fifth section contains our conclusions and points out our future research directions.

2 A Heuristic SPAI algorithm

Assuming that each value of A is greater than or equal to 0, based on the characteristic polynomial for A, we have

$$\begin{aligned} A^{-1} = \alpha _0E+\alpha _1A + \cdots + \alpha _{n-1}A^{n-1}. \end{aligned}$$
(5)

Therefore, the pattern of \(A^{-1}\) (denoted by \(S(A^{-1})\)) is contained in the pattern \(\cup _{j=0}^{j=n-1}S(A^j)\). Thus we can obtain

$$\begin{aligned} S(A^{-1}) \subseteq S((E+A)^{n-1}). \end{aligned}$$
(6)

Similarly, the Neumann representation

$$\begin{aligned} A^{-1}=\beta \sum _{j=0}^{\infty }(E-\beta A)^j \end{aligned}$$
(7)

for small \(\alpha\) shows that numerically \(S((E + A)^j)\) is nearly contained in \(S(A^{-1})\) for all j. Considering Eqs. (6) and (7), we can obtain

$$\begin{aligned} S(A^{-1}) \simeq S((E+A)^{n-1}). \end{aligned}$$
(8)

Let us assume that we have already computed an optimal solution \(m_k\), \(k=1,2,\dots , n\), with the residual \(r_k\) of the least squares problem relative to an initial index set \(J_k^0=\{k\}\). Next, for each l, \(l=1,2,\cdots\), utilizing the idea of Eq. (8), we use

$$\begin{aligned} C_k^l = (E + |A |) C_k^{l-1} \end{aligned}$$
(9)

to generate the candidate indices that might be added to \(J^{l}_k\), where \(C_k^0=J_k^0\), and \(|A|\) means to take the absolute value for each value of A. Let \(\tilde{J}^l_k\) equal to the set of indices that appear in \(C_k^l\) but not in \({J}^{l-1}_k\). For each \(j\in \tilde{J}^l_k\), we consider the following one-dimensional minimization problem(Grote and Huckle 1997):

$$\begin{aligned} \underset{\mu _j\in R}{\min }\Vert r_k + \mu _j Ae_j\Vert = :\rho _j. \end{aligned}$$
(10)

For every j, \(\mu _j=-{r_k^TAe_j}/{\Vert Ae_j\Vert _2^2}\) and thus \(\rho _j^2=\Vert r_k\Vert _2^2-(r_k^TAe_j)^2/\Vert Ae_j\Vert _2^2\).

Obviously, indices with \((r_k^TAe_j)^2=0\) lead to no improvement in the one-dimensional minimization. We reduce \(\tilde{J}^l_k\) to the set of the most profitable indices j with smallest \(\rho _j\) and add it to \({J}^l_k\). Using the augmented set of indices \({J}^l_k\), we solve the least squares problems in Eq. (4) again. We denote by \(\tilde{I}^l_k\) the set of new indices, which correspond to the nonzero rows of \(A(.,{J}^{l-1}_k\cup \tilde{J}^{l}_k)\) not contained in \({I}^{l-1}_k\), and by \(\tilde{n}_1\) and \(\tilde{n}_2\) the number of indices in \(\tilde{I}^{l}_k\) and \(\tilde{J}^{l}_k\), and then have

$$\begin{aligned} \begin{array}{ll} A({I}^{l-1}_k\cup \tilde{I}^{l}_k,{J}^{l-1}_k\cup \tilde{J}^{l}_k) = \begin{pmatrix} \widehat{A}&{}A({I}^{l-1}_k,\tilde{J}^{l}_k)\\ 0&{}A(\tilde{I}^{l}_k,\tilde{J}^{l}_k)\\ \end{pmatrix}\\ = \begin{pmatrix} Q&{}\\ &{}E_{\tilde{n}_1}\\ \end{pmatrix} \begin{pmatrix} R&{}Q_1^TA({I}^{l-1}_k,\tilde{J}^{l}_k)\\ 0&{}Q_2^TA({I}^{l-1}_k,\tilde{J}^{l}_k)\\ 0&{}A(\tilde{I}^{l}_k,\tilde{J}^{l}_k)\\ \end{pmatrix}.\\ \end{array} \end{aligned}$$
(11)

Here \(\widehat{A} \in R^{n_1 \times n_2}\) is the submatrix of eliminating all zero rows in \(A(.,{J}^{l-1}_k)\), and Q and R are matrices obtained by the QR decomposition of \(\widehat{A}\), and \(Q_1\) and \(Q_2\) are the first \(n_2\) columns and the last \((n_1-n_2)\) columns of Q, respectively. Note that the modified Gram-Schmidt method(Brandes et al. 2012) is utilized to execute the QR decomposition in this study. We require only the computation of the QR decomposition of \(B=\begin{pmatrix} Q_2^TA({I}^{l-1}_k,\tilde{J}^{l}_k)\\ A(\tilde{I}^{l}_k,\tilde{J}^{l}_k)\\ \end{pmatrix}\). Utilizing the QR decomposition, we can obtain the solution of the least squares problems in Eq. (4). If \(r_k\) satisfies the loop-stopping condition, the algorithm stops; otherwise, we set \({I}^{l}_k={I}^{l-1}_k\cup \tilde{I}^{l}_k\) and \(C^{l}=J_k^l\) and \(l=l+1\), and continue to execute the loop.

In order to decrease the computational complexity, the loop-stopping condition is set to \(\Vert r_k\Vert \leqslant \varepsilon\) or \(l<l_{\max }\) (\(l_{\max }\) is a small integer) or \(J_k \leqslant \alpha * n2_k\) (\(\alpha\) is a small real number and \(n2_k\) is the nonzero number of the kth column of A). We summarize the sequential version of our proposed heuristic SPAI algorithm in the following Algorithm 1.

Algorithm 1: Heuristic SPAI algorithm

Input:

: A, a tolerance \(\varepsilon\), the maximum number of the heuristic computation \(l_{\max }\), and \(\alpha\)

Output:

: M For every column \(m_k\) of M:

  1. 1)

    Set \(l=1\) and \(C_k^0 =\{k\}\), choose an initial sparsity \(J_k^0=\{k\}\).

  2. 2)

    Solve Eq. (4) to obtain \(m_k\), and compute \(r_k = e_k - Am_k\). while \(\Vert r_k\Vert _2>\varepsilon\) and \(l<l_{\max }\) and \(J_k\leqslant \alpha \cdot n2_k\):

  3. 3)

    \(C_k^{l} = (E + |A |) C_k^{l-1}\).

  4. 4)

    Let \(\tilde{J}^l_k\) equal to the set of indices that appear in \(C_k^l\) but not in \({J}^{l-1}_k\).

  5. 5)

    For every \(j\in \tilde{J}^l_k\), compute \(\rho _j^2=\Vert r_k\Vert _2^2-(r_k^TAe_j)^2/\Vert Ae_j\Vert _2^2\), and delete from \(\tilde{J}^l_k\) all but the most profitable indices.

  6. 6)

    Determine the new indices \(\tilde{I}^l_k\), and execute the QR decomposition of B.

  7. 7)

    Solve the new least squares problem in Eq. (4) to obtain \(m_k\), and compute the new residual \(r_k = e_k - Am_k\).

  8. 8)

    Set \({I}^{l}_k={I}^{l-1}_k\cup \tilde{I}^{l}_k\), \({J}^{l}_k={J}^{l-1}_k\cup \tilde{J}^{l}_k\), \(C^{l}=J_k^l\), and \(l=l+1\).

It is observed that as compared to the popular dynamic SPAI preconditioning algorithm in Grote and Huckle (1997), our proposed heuristic SPAI algorithm has the following two main difference: (1) a heuristic method is proposed to give potential candidate indices; (2) the loop-stopping condition adds \(l<l_{\max }\) and \(J_k \leqslant \alpha * n2_k\) besides \(\Vert r_k\Vert \leqslant \varepsilon\), which can better maintain the sparsity level of the preconditioner. For the proposed heuristic SPAI algorithm, its computational complexity is roughly \(O(maxI\times maxJ \times n)\), and the two operations such as the sparse matrix-matrix multiplication and QR decomposition for each iteration are the most time-consuming ones.

Fig. 1
figure 1

Parallel framework of HeuriSPAI

In this section, we present a parallel heuristic sparse approximate inverse preconditioning algorithm on GPU, called HeuriSPAI. Table 1 shows the main arrays used in HeuriSPAI. The parallel framework of HeuriSPAI is shown in Fig. 1, which includes three stages: Init-HeuriSPAI stage, Compute-HeuriSPAI stage, and Post-HeuriSPAI stage.

Table 1 Arrays used in HeuriSPAI

2.1 Init-HeuriSPAI stage

In the Init-HeuriSPAI stage, the global memory of GPU to A is first allocated. A is stored in memory using the CSC (Compressed Sparse Column) storage format, and M is also stored in columns. Second, when computing \(m_k\) (one column of M), \(k=1,2,\cdots ,n\), the dimensions of local submatrices \(\widehat{A}_k\) (\(n1_k\), \(n2_k\)) are usually distinct for different k. To simplify the accesses of data in memory and enhance the coalescence, the dimensions of all local submatrices are uniformly defined as (maxI, maxJ), where \(maxJ = \max \limits _k \{\lceil \alpha \cdot n2_k\rceil \}\) and \(maxI = \nu \cdot maxJ\), where \(\nu\) is an integer. Utilizing maxI and maxJ, the main arrays that are used in HeuriSPAI (see Table 1) are defined, and the global memory of GPU to the main arrays is allocated. Third, the initial values of \(m_{k}\) and \(r_{k}\) are obtained by setting \(J^{k}_{0}=\{k\}\) on GPU.

2.2 Compute-HeuriSPAI stage

In the Init-HeuriSPAI stage, the initial values of \(m_{k}\) and \(r_{k}\) are obtained. The aim of the Compute-HeuriSPAI stage is to achieve better values of \(m_{k}, k=1,2,\cdots ,n\) by the iteration. One \(m_k\) vector is computed via one warp (32 threads in a block), many \(m_k\) vectors are computed simultaneously via warps executing in parallel. The parallelism is also exploited in a warp by computing one \(m_k\) vector in parallel using 32 threads inside a warp.

Fig. 2
figure 2

Main procedure of sparse matrix-matrix multiplication

Sparse matrix-matrix multiplication. This step is to compute \(C^{l}_k=(E+|A |)C^{l-1}_k\), \(k=1,2,\cdots , n\). In fact, each warp finishes a sparse matrix-sparse vector multiplication. Here we present a novel sparse matrix-matrix multiplication on GPU, and its main procedure is shown in Fig. 2. In a warp, the sparse matrix-sparse vector multiplication, e.g., \((E+|A |)C^{l-1}_k\), is computed as follows. First, the row indices of the first column referenced in \(CIndex_k\) are loaded into \(I_k\), and the row index vector of successive columns referenced by \(CIndex_k\) are then compared in parallel with values in \(I_k\) and new indices are appended to \(I_k\) using the atomic operations. Second, each thread computes one row whose indices is in \(I_k\) and the values are saved to \(\hat{A}_k\). Finally, threads in a warp read \(CData_k\) into shared memory sCData in parallel, and then each thread computes one row of \(\hat{A}_k\cdot sCData\), and save values to \(CData_k\) and the corresponding indices to \(CIndex_k\).

Finding \(\hat{J}\): Each warp finds a subset of \(\tilde{J}\) in this step. In a warp, a subset of \(\tilde{J}\), e.g., \(\tilde{J}_k\), is computed by the following procedure: the indices in \(CIndex_k\) are compared in parallel with values in \(J_k\) and the different indices are written into \(\hat{J}_k\).

Reducing \(\tilde{J}\): In this step, from each subset in \(\tilde{J}\) (e.g., \(\tilde{J}_k\)) all but the most profitable indices are deleted. Each subset of \(\tilde{J}\), e.g., \(\tilde{J}_k\), is reduced via one warp, which includes the following three stages. In the first stage, the threads in a warp compute \(\rho _j\), \(j\in \tilde{J}_k\), in parallel, and save them to shared memory. In the second stage, the values in shared memory are sorted in ascending order. The threads in a warp read \(\rho _j\) that is smaller than \(\eta\) from shared memory in parallel and rewrite their corresponding indices to \(\tilde{J}_k\) in the third stage.

Fig. 3
figure 3

Main procedure of decomposing \(\widehat{A}\) into QR

Determine \(\tilde{I}\) and QR decomposition: This step is used to determine \(\tilde{I}\) and decomposes the local submatrix into QR using Gram-Schmidt method. Each warp determines one set of \(\tilde{I}\), e.g., \(\tilde{I}_k\). For each \(j\in \tilde{J}_k\), all threads inside a warp search the row indices in the jth column of A in parallel to find indices that are not included in \({I}_k\), and then write them to \(\tilde{I}_k\) using the atomic operation. In the following, \(\tilde{I}_k\) are sorted in parallel in an ascending order. In addition, each warp is also responsible for one QR decomposition in this step, and its main procedure is exhibited in Fig. 3. In a warp, the QR decomposition of the local submatrix, e.g., \(\widehat{A}_k\), is composed of three steps at each iteration i. First, all threads compute the ith row of the upper triangle matrix \(R_k\) in parallel and put them into shared memory sR. Second, the threads in the warp concurrently normalize column i of \(Q_k\), and compute the projection factors \(R_k\) and sR. The values of all columns of \(Q_k\) are updated by using shared memory sR and column i of \(Q_k\) in parallel in the third step.

Computing \(m_k\) and \(r_k\): This step is used to compute \(m_k\) and \(r_k\). As we know, \(m_k\) is obtained by scattering \(\widehat{m}_k\), and \(r_k=e_k-Am_k\). Therefore, the key of this step is to compute \(\widehat{m}_k\) by solving \(R_k\widehat{m}_k=Q_k^T\widehat{e}_k\). Each warp is responsible for computing one \(\widehat{m}_k\). In a warp, computing values of \(\widehat{m}_k\) includes two steps. In the first step, all threads inside a thread group compute \(Q_k^T\widehat{e}_k\) in parallel and save values to shared memory xE. In the second step, the values of \(\widehat{m}_k\) are obtained by solving the upper triangular linear system, \(R_k\widehat{m}_k=xE\), in parallel using shared memory.

2.3 Post-HeuriSPAI stage

Fig. 4
figure 4

Assemble M

The Post-HeuriSPAI stage is to assemble M in the CSC storage format, and store it to the MPtr, MIndex, and MData arrays. The Post-HeuriSPAI stage includes the following steps:

  1. 1)

    On the GPU, we assemble MPtr using JPTR, as shown in Fig. 4.

  2. 2)

    Utilizing \(\widehat{m}_k\) and J to assemble MData and MIndex. Each warp is responsible for assembling one \(\widehat{m}_k\) to MData and one \(J_k\) to MIndex in parallel.

Obviously, MPtr, MIndex, and MData arrays are generated on the GPU memory and do not need to be transferred to the CPU.

3 Evaluation and analysis

Table 2 Overview of GPUs
Table 3 Descriptions of test matrices

We evaluate the performance of HeuriSPAI in this section. Table 2 shows the overview of NVIDIA GPUs that are used in the performance evaluation. The test matrices are selected from the SuiteSparse Matrix Collection(Davis and Hu 2011). Table 3 summarizes the information of the sparse matrices, including the name, kind, number of rows, and total number of non-zeros. The test matrices are chosen due to the fact that they have been widely used in some previous work(Grote and Huckle 1997; Dehnavi et al. 2013; He et al. 2020; Gao et al. 2021).The source codes are compiled and executed using the CUDA toolkit 11.1(NVIDIA 2021). Note that in the following experiments, all algorithms use the double-precision floating point numbers in all computations.

3.1 Effectiveness analysis

First, we test the effectiveness of the approximate inverse matrices that are obtained by HeuriSPAI. For each matrix, both GPUBICGSTAB and GPUPBICGSTAB are called to solve \(Ax=b\), where all elemental values of b are 1, and the produced M is used as the preconditioner, and the initial \(x_0=b\). GPUBICGSTAB and GPUPBICGSTAB are the parallel implementations of BICGSTAB and preconditioned BICGSTAB on GPU using the CUBLAS 11.1(NVIDIA 2021) and CUSPARSE 11.1(NVIDIA 2021) libraries, respectively, and stop when the residual error, which is defined as \(\frac{\Vert b-Ax\Vert _2}{\Vert b-Ax_0\Vert _2}\), is less than \(1e^{-7}\) or the number of iterations exceeds 10,000. Tables 4 and 5 show the number of iterations and execution time after the convergence of GPUBICGSTAB and GPUPBICGSTAB on GTX1070 and RTX3090, respectively. The time unit is second (s). Note that the execution time of GPUPBICGSTAB in Tables 4 and 5 includes the execution time of HeuriSPAI that is used to obtain the preconditioner; and "/" means that the execution time of the algorithm is not counted because its iterations exceed 10,000.

Table 4 Iterations and execution time of two algorithms on GTX1070
Table 5 Iterations and execution time of two algorithms on RTX3090

From Tables 4 and 5, we can observe that on two GPUs, without the preconditioner, for af25600, Zhao2, imagesensor, venkat01, nv2, G3_circuit, ss, and stokes, GPUBICGSTAB cannot converge to the \(10^{-7}\) residual error in 10,000 iterations while GPUPBICGSTAB with HeuriSPAI can. For apache2, t2em, thermal2, and atmosmodd, GPUBICGSTAB can converge under 10,000 iterations, but the number of iterations decreases dramatically using the preconditioner. GPUPBICGSTAB has smaller execution time than GPUBICGSTAB for these matrices except for atmosmodd. These observations validate the effectiveness of the approximate inverse matrices that are obtained by HeuriSPAI.

Second, we test the effectiveness of HeuriSPAI by comparing it with a recent static SPAI algorithm suggested in He et al. (2020)(denoted by SSPAI) and a popular dynamic SPAI algorithm by Grote and Huckle(Grote and Huckle 1997) (denoted DSPAI) from the viewpoint of accelerating the convergence. The first eleven small matrices in Table 3 are used in this test. The small matrices are chosen for the the following two reasons: (1) DSPAI is not suitable for large matrices; (2) they are the same as those in Grote and Huckle (1997). Similar to Grote and Huckle (1997), the preconditioned BICGSTAB is called to solve \(Ax=b\), and stops when the residual error is less than \(1e^{-8}\) or the number of iterations exceeds 10,000. Table 6 shows the convergence results of four algorithms. The second, third, fourth, and fifth columns are the convergence results without the preconditioner, and with the preconditioner that is obtained by SSPAI, and with the preconditioner that is obtained by DSPAI, and with the preconditioner that is obtained by HeuriSPAI, respectively.

Table 6 Convergence results of all algorithms

As compared to SSPAI, for all test cases, the preconditioned BICGSTAB with the preconditioner that is obtained by HeuriSPAI has smaller number of iterations than that with preconditioner that is obtained by SSPAI. Especially, for sherman2 and pores_2, the preconditioned BICGSTAB with SSPAI cannot converge to the \(10^{-8}\) residual error in 10,000 iterations while the preconditioned BICGSTAB with HeuriSPAI can. This verifies that HeuriSPAI is better than SSPAI. As compared to DSPAI, the preconditioned BICGSTAB with the preconditioner that is obtained by HeuriSPAI has smaller number of iterations than that with preconditioner that is obtained by DSPAI for all test matrices except for sherman3 and pores_2. Especially, for sherman2, the preconditioned BICGSTAB with DSPAI cannot converge to the \(10^{-8}\) residual error in 10,000 iterations while the preconditioned BICGSTAB with HeuriSPAI can. This means that HeuriSPAI is effective.

Fig. 5
figure 5

The fraction of total time spent in the Init-HeuriSPAI, Compute-HeuriSPAI, Post-HeuriSPAI stages

Fig. 6
figure 6

Ratio of the execution time on CPU to the excution time on GPU

3.2 Performance analysis

In this section, we first take GTX1070 to investigate the fraction of the total time spent in the Init-HeuriSPAI, Compute-HeuriSPAI, Post-HeuriSPAI stages in Fig. 5. We can observe that for all the matrices, the fractions of the Init-HeuriSPAI and Post-HeuriSPAI stages are at most \(\frac{1}{10}\) and \(\frac{1}{20}\), respectively. This further verifies that the time of HeuriSPAI is mainly attributed to the cost of the Compute-HeuriSPAI stage. Second, we take the Compute-HeuriSPAI stage to explore the ratio of its execution time on the CPU to its execution time on the GPU, as shown in Fig. 6. It can be seen that the ratios of the execution time on the CPU to the execution time on the GTX1070 range roughly from 41.38 to 63.33 for the 12 test matrices, and the average ratio is 51.05; the ratios of the execution time on the CPU to the execution time on the RTX3090 range roughly from 53.43 to 79.44 for the 12 test matrices, and the average ratio is 63.89. These results show that computing the preconditioner for our proposed HeuriSPAI has higher parallelism.

3.3 Performance comparison

Table 7 Execution time of all preconditioning algorithms and GPUPBICGSTAB on GTX1070
Table 8 Execution time of all preconditioning algorithms and GPUPBICGSTAB on RTX3090

We evaluate the performance of HeuriSPAI by comparing it with several popular preconditioning algorithms, i.e., CSRILU0 in the CUSPARSE 11.1 library (denoted by CSRILU)(NVIDIA 2021), the incomplete SPAI preconditioning algorithm in the MAGMA 2.6.2 library (denoted by ISAI)(Anzt et al. 2018), a static SPAI preconditioning algorithm (denoted by S-VCL) and a dynamic SPAI preconditioning algorithm (denoted by D-VCL) in the ViennaCL 1.7.1 library(Rupp et al. 2016), and a recent sparse approximate inverse preconditioning algorithm (denoted by SSPAI) He et al. (2020). We choose CSRILU because CUSPARSE is an open popular library for NVIDIA GPUs and ILU0 is a classic incomplete factorization method and has been widely applied as the preconditioner. ISAI is chosen because MAGMA is an open popular library for GPU and mutlicore architectures and ISAI preconditioner is a new one. S-VCL, D-VCL and SSPAI are chosen because D-VCL is the only existing dynamic sparse approximate inverse preconditioning algorithm on GPU and S-VCL and SSPAI both are one of the latest static sparse approximate inverse preconditioning algorithms on GPU. GPUPBICGSTAB with CSRILU, GPUPBICGSTAB with S-VCL/D-VCL and GPUPBICGSTAB with ISAI are implemented using the functions in CUBLAS and CUSPARSE, ViennaCL, MAGMA, respectively. GPUPBICGSTAB with SSPAI is implemented based on CUBLAS and CUSPARSE. The last 12 large matrices in Table 3 are used for this test. Tables 7 and 8 show the comparison results of all algorithms on GTX1070 and RTX3090, respectively. In each table, for each matrix and the preconditioner, the first row is the execution time of the preconditioning algorithms, the second and third rows are the execution time of GPUPBICGSTAB and the number of iterations when GPUPBICGSTAB converges to the \(1e^{-7}\) residual error in 10,000 iterations, and the fourth row is the total of the execution time of the preconditioning algorithm and GPUPBICGSTAB. If the number of iterations for GPUPBICGSTAB exceeds 10,000 for a matrix, the corresponding rows will be denoted by "/" except that the third row is denoted by ">10000". If GPUPBICGSTAB encounters the error that the size of system is too large for ISAI L or the floating point exception or the out-of-memory error, the four rows will be denoted by "N/A". The time unit is s.

From Tables 7 and 8, we observe that on the two GPUs, for the chosen 12 large matrices except for venkat01, ss, and stokes, HeuriSPAI has smaller execution time than CSRILU. Furthermore, the total execution time of HeuriSPAI and GPUPBICGSTAB is less than that of CSRILU and GPUPBICGSTAB for the chosen 12 large matrices except for ss. This verifies that HeuriSPAI is better than CSRILU in general for the test cases. Compared to ISAI, the total time of HeuriSPAI and GPUPBICGSTAB is less than that of ISAI and GPUPBICGSTAB, and GPUPBICGSTAB with HeuriSPAI has smaller number of iterations than GPUPBICGSTAB with ISAI for the 12 large matrices except for af23560. Especially, GPUPBICGSTAB with ISAI encounters the error that the size of system is too large for ISAI L for venkat01, nv2, ss, and storkes, and cannot convergence in 10,000 iterations for Zhao2 while GPUPBICGSTAB with HeuriSPAI can converge to the \(1e^{-7}\) residual error in 10,000 iterations. This shows that HeuriSPAI usually has better behavior than ISAI for the test cases. GPUPBICGSTAB with D-VCL is not applicable for the 12 large matrices because of the out-of-memory error while GPUPBICGSTAB with HeuriSPAI can converge in 10,000 iterations. This further validates the fact that HeuriSPAI can alleviate the drawback of D-VCL. Because D-VCL always encounters the out-of-memory error for the 12 large matrices, its results are not shown in Tables 7 and 8. As compared to S-VCL, whether the number of iterations or the total time of the preconditioner and GPUPBICGSTAB, HeuriSPAI outperforms S-VCL. As compared to SSPAI, GPUPBICGSTAB with HeuriSPAI can converge to the \(1e^{-7}\) residual error in 10,000 iterations for all test cases. However, for the five matrices such as af23560, Zhao2, nv2, ss, and stokes, GPUPBICGSTAB with SSPAI cannot converge to the \(1e^{-7}\) residual error in 10,000 iterations. For venkat01, imagesensor, apache2, t2em, thermal2, atmosmodd, and G3_circuit, GPUPBICGSTAB with HeuriSPAI has much smaller number of iterations than GPUPBICGSTAB with SSPAI, and although the total time of HeuriSPAI and GPUPBICGSTAB is more than that of SSPAI and GPUPBICGSTAB, their difference are slight. Therefore, we can conclude that as compared to SSPAI, HeuriSPAI can in general decrease the iteration count of iterative solvers significantly, and can alleviate the drawback that SSPAI cannot converge for some matrices.

4 Conclusion

In this paper, we present a parallel heuristic dynamic sparse approximate inverse (SPAI) preconditioning algorithm on GPU, called HeuriSPAI. HeuriSPAI fuses the advantages of static and dynamic SPAI preconditioning algorithms, and alleviates the drawbacks of the existing dynamic SPAI preconditioning algorithms on GPU that can encounter the out-of-memory error for large matrices. Experimental results validate the effectiveness and high parallelism of the proposed HeuriSPAI.

Next, we will further do research in this field, and apply the proposed HeuriSPAI to more practical problems to improve it.