1 Introduction

Let us consider a sparse linear system

$$\begin{aligned} Ax=b \end{aligned}$$
(1)

where A is the coefficient matrix, b is the right-hand side vector and x is the solution vector of the linear system. Preconditioned Krylov subspace iterative methods are amongst the most widely used iterative methods. The effectiveness of these iterative methods relies on the use of effective preconditioning schemes that reduce the number of required iterations and in many cases ensure converge, cf. [3, 15, 16, 18, 2224]. Approximate inverses have been extensively used as preconditioners with iterative methods, cf. [3, 7, 8, 10, 13, 15, 16, 18]. The approximate inverses possess inherent parallelism, cf. [10, 13, 15, 16], and thus can be effectively used on parallel systems. Recently, a generic class of approximate inverses has been proposed that can handle any sparsity pattern of the coefficient matrix, cf. [14]. By redesigning the Generic Approximate Banded Inverse algorithm, cf. [14], and utilizing approximate inverse sparsity patterns], derived from patterns of sparsified matrices (PSMs), cf. [7, 8], the generic approximate sparse inverse (GenAspI) algorithm as well as the generic factored approximate sparse inverse (GenFAspI) algorithm were proposed, cf. [10].

Multilevel techniques have been proposed in the recent years by many researchers, cf. [4, 5, 23, 24]. These methods utilize reordering schemes and techniques from domain decomposition methods, cf. [4, 5, 19, 21, 23, 24, 27]. Nested Grids ILU decomposition (NGILU), cf. [4], multilevel recursive incomplete LU factorization (MRILU), cf. [5], algebraic recursive multilevel solver (ARMS), cf. [23], and multilevel algebraic recursive generic approximate inverse solver (MARGAIS), cf. [10] are multilevel techniques that have been proposed in recent years.

Herewith, parallel schemes are proposed for shared memory parallel systems, using the OpenMP environment, cf. [6]. The parallel modified generic factored approximate sparse inverse (PMGenFAspI) method is proposed, which is a parallel version of the MGenFAspI method, cf. [10], using vector units, cf. [17]. The proposed parallel multilevel solver, namely parallel multilevel algebraic recursive generic approximate inverse solver (PMARGAIS), utilizes a modified reordering scheme that is based on block breadth first search (BBFS), cf. [23]. The coefficient matrix of the system is reordered such that the upper left block is a block diagonal matrix. The inversion of the block diagonal matrix is performed in parallel, utilizing the SVD method, cf. [11, 12], for each block. The modified reordering scheme (MBBFS) ensures that the dimensions of each block remains small, resulting to less memory requirements and balanced computational work during the inversion of the block diagonal matrix. The Schur complement is formed explicitly to be used as the coefficient matrix on the next level. This process is repeated until the linear system of the last level is small enough to be solved efficiently. The parallel explicit preconditioned bi-conjugate gradient stabilized (PEPBiCGSTAB) method, cf. [28], is parallelized for shared memory parallel systems in conjunction with AVX vector units, cf. [17], and used to solve the reduced order linear system of the last level. The explicit formation of the Schur complement and the exact inversion of the block diagonal inverse, leads to a hybrid direct-iterative method.

The AVX units are vector units used for carrying out concurrent computations to multiple data following the SIMD model, cf. [17]. Efforts have been concentrated by other researchers to facilitate efficient processing of problems that involve matrix and vector computations at the hardware level, cf. [1, 25, 26]. These efforts involve the design of specialized units, based on reversible logic synthesis, to carry out efficiently such types of concurrent computations, cf. [1, 25, 26].

In Sect. 2, the PMGenFAspI method is presented along with implementation details. In Sect. 3, the PMARGAIS method based on the modified reordering scheme (MBBFS) is presented, along with discussions on the performance improvements. Furthermore, implementation details for the parallel inversion of the block diagonal matrix and the computation of the Schur complement are given. Further, the parallel EPBiCGSTAB method based on AVX units is given. In Sect. 4, numerical results presenting the performance and applicability of the proposed schemes are given.

2 Parallel modified generic factored approximate sparse inverse (PMGenFAspI)

The MGenFAspI matrix, cf. [10], can be computed using the following decomposition, cf. [2, 12, 20, 22]:

$$\begin{aligned} A=LU\Leftrightarrow A^{-1}=U^{-1}L^{-1}\Leftrightarrow M=GH \end{aligned}$$
(2)

where \(M=A^{-1}\), \(G=U^{-1}\) and \(H=L^{-1}\). The factors L and U are obtained from Incomplete LU decomposition, cf. [2, 20, 22]. The sparsity patterns of the factors G and H are computed by sparsifying the triangular factors L and U using a predetermined drop tolerance (drptol). The resulting sparsified matrix is then raised to a predefined power or level of fill (lfill). The sparsity pattern is based on powers of sparsified matrices (PSM’s), cf. [7, 8]. Hence, to compute the elements of the G and H factors we have to solve the following systems, cf. [10]:

$$\begin{aligned} \left\{ {{\begin{array}{l} {UG_{\mathrm{drptol}}^{\mathrm{lfill}} =I} \\ {LH_{\mathrm{drptol}}^{\mathrm{lfill}} =I} \\ \end{array} }} \right\} \Leftrightarrow \left\{ {{\begin{array}{l} {Ug_{:,j} =e_{:,j} } \\ {Lh{ }_{:,j}=e_{:,j} } \\ \end{array} }} \right\} ,\quad 1\le {j}\le {n} \end{aligned}$$
(3)

where lfill is the level of fill used to compute the sparsity pattern of the approximate inverse and drptol is the threshold for retaining elements in the initial sparsity pattern of the approximate inverse, cf. [7, 8, 10], while \(g_{:,j}\) and \(h_{:,j}\) are the elements of the jth column of the triangular factors of the approximate inverse and \(e_{:,j}\) are the elements of the jth column of the identity matrix. During the computation, the elements \(g_{:,j}\) and \(h_{:,j}\) are stored in a dense vector iw to prevent column search for elements in the sparse format matrices G and H. The elements of the jth column of the identity matrix \(e_{:,j}\) are stored in a dense vector e. Each nonzero element of the H and G factors of the MGenFAspI method can be computed by the following equations:

$$\begin{aligned} H(k,j)= & {} \frac{I(k,j)-L(k,1:k-1)*H(1:k-1,j)^{T}}{L(k,k)},\nonumber \\&{k}=\hbox {j},\ldots ,{n} , (k,j)\in {H}_{\mathrm{drptol}}^{\mathrm{lfill}} \end{aligned}$$
(4)
$$\begin{aligned} G(k,j)= & {} \frac{I(k,j)-U(k,k+1:n)^{T}*G(k+1:n,j)}{U(k,k)},\nonumber \\&{k}={j,\ldots ,1 , (k,j)}\in {G}_{\mathrm{drptol}}^{\mathrm{lfill}} . \end{aligned}$$
(5)

where I is the identity matrix.

The parallel computation of the MGenFAspI process can be performed efficiently due to the fact that the computation of each column of the factors G and H is not related to the computation of other columns. Each processor is responsible for computing a group of columns of the approximate inverse without any communications. The PMGenFAspI method has been further modified to utilize AVX units accelerating the computation of the involved inner products. Initially the values of a register are set to zero. Then, the values of the involved vectors residing in the memory are transferred in groups of four values to two registers. The computation of the products is performed concurrently and the respective results are accumulated to the register retaining the partial sums. The procedure is repeated until all the elements have been accumulated, resulting in four partial sums. The inner product is computed by adding the four partial sums. In case the number of nonzero elements is not a multiple of four, the remaining elements are accumulated independently. The parallel modified generic factored approximate sparse inverse scheme in conjunction with AVX units is described by the following algorithmic scheme:

figure a

where fmadd(xr3, xr1, xr2) is the fused multiply add operation \({xr3}={xr3}+{xr1}*{xr2}\), where xr1, xr2 and xr3 are vectors consisting of four double-precision floating point numbers. The iw vector is the work vector used to store temporarily the elements of the ith column of each of the factors of the approximate inverse, while the vector e is the ith column of the identity matrix.

It should be noted that the length of the vectors iw and e is multiplied with the number of the processors that are being used, so that each processor uses a different part of the vectors. The elements of the vectors are determined as follows:

$$\begin{aligned}&\mathbf{iw}\left( {\mathbf{cur}\_\mathbf{tid}*\mathbf{n}+\mathbf{k}} \right) {:}\,\left( {{iw}\left( {k} \right) \hbox { of cur}\_\hbox {tid processor}} \right) \end{aligned}$$
(6)
$$\begin{aligned}&\mathbf{e}\left( {\mathbf{cur}\_\mathbf{tid}*\mathbf{n}+\mathbf{k}} \right) {:}\,\left( {\hbox {e}\left( \hbox {k} \right) \hbox { of cur}\_\hbox {tid processor}} \right) . \end{aligned}$$
(7)

where n is the number of the rows of the factors G and H, cur\(\_\)tid=0,\({\ldots }\), nprocs-1 is the id number of each processor and nprocs is the total number of processors.

3 Parallel multilevel algebraic recursive generic approximate inverse solver (PMARGAIS)

3.1 Modified block breadth first search (MBBFS)

Independent Sets are a crucial component of multilevel methods. An Independent Set is composed of unknowns that are decoupled between them and can be handed independently or simultaneously without affecting other unknowns of the linear system, cf. [23, 24]. Recently, a reordering scheme was introduced for the algebraic multilevel recursive solver (ARMS), cf. [23], based on Breadth First Search algorithm, utilizing a threshold parameter which restricts unknowns that have lesser relative diagonal dominance in their respective row, to join the independent set, cf. [23]. The block independent sets is a generalization of the Independent sets, where groups of unknowns are decoupled. The unknowns belonging to the Block Independent sets are numbered first and the interface unknowns are numbered last. Hence, the upper left block of the permuted coefficient matrix has a block diagonal structure.

The modified block breadth first search (MBBFS) algorithm reduces the memory requirements of the block diagonal inverse by retaining the dimension of the (dense) blocks to a predefined small number. For the BBFS algorithm, grouping of nodes into independent sets stops after a whole level of neighbors has been added and the block size has exceeded the predetermined block size. This technique results in blocks of much bigger size than the predetermined one, especially in the case of matrices with large number of nonzero elements per row. In contrast the MBBFS scheme stops the insertion of unvisited nodes to a block independent set when the predetermined block size is reached. This technique results in blocks of size equal to the predetermined one or smaller when there are not enough neighboring nodes, thus giving an upper limit to the memory requirements for computing the inverse of the block diagonal matrix. The MBBFS scheme improves the balance of computational work between the CPUs, since the dimensions of each block are almost the same.

It should be stated that the algorithm also returns a vector s storing the starting and ending point of each block, to handle the blocks independently:

$$\begin{aligned} s(\mathrm{block})=\sum _{j=1}^{\mathrm{block}} {\mathrm{block}\_\mathrm{size}(j),\hbox { block}=1,\ldots ,\hbox {{number}}\_\hbox {{of}}\_\hbox {{blocks}}} \end{aligned}$$
(8)

where \(\mathrm{block}\_\mathrm{size}(j)\) is the size of each block and \(s(0)=0\). Utilizing the vector s, the following information can be obtained:

  • s(i-1): the row and the column of the block diagonal matrix where the first row and column of the ith block are located.

  • s(i)-1: the row and the column of the block diagonal matrix where the last row and column of the ith block are located.

  • s(i)-s(i-1): the order of the ith block.

The modified block breadth first search algorithm is described by the following algorithmic scheme:

figure b

Append the excluded nodes in the end of R

It should be mentioned that in case the number of vertices grouped in a block is smaller than the prescribed value (bsize) and no more neighboring vertices exist, then the block is retained as is.

3.2 Parallel multilevel approximate inverse solver

Let us consider the reordering matrix P computed with the MBBFS algorithm. The reordered block matrix K is as follows, cf. [10]:

$$\begin{aligned} K=PAP^{T}=\left[ {{\begin{array}{ll} B &{} C \\ D &{} E \\ \end{array} }} \right] =\left[ {{\begin{array}{ll} I &{} 0 \\ {DB^{-1}} &{} I \\ \end{array} }} \right] \left[ {{\begin{array}{ll} B &{} 0 \\ 0 &{} S \\ \end{array} }} \right] \left[ {{\begin{array}{ll} I &{} {B^{-1}C} \\ 0 &{} I \\ \end{array} }} \right] \end{aligned}$$
(9)

where \(S=E-DB^{-1}C\) is the Schur complement of the block matrix A and B is a block diagonal matrix. The inverse of K can be computed as follows:

$$\begin{aligned} K^{-1}=\left[ {{\begin{array}{ll} B &{} C \\ D &{} E \\ \end{array} }} \right] ^{-1}=\left[ {{\begin{array}{ll} I &{} {-B^{-1}C} \\ 0 &{} I \\ \end{array} }} \right] \left[ {{\begin{array}{ll} {B^{-1}} &{} 0 \\ 0 &{} {S^{-1}} \\ \end{array} }} \right] \left[ {{\begin{array}{ll} I &{} 0 \\ {-DB^{-1}} &{} I \\ \end{array} }} \right] . \end{aligned}$$
(10)

The inverse of the block matrix can be computed by inverting the matrices B and S. Matrix B is a block diagonal matrix, thus the inverse \(B^{-1}\) is a block diagonal matrix. The order of the blocks is small and the exact inverse matrices can be computed using the SVD method separately on each block. The inverse matrix \(B^{-1}\) that is computed with this technique is the exact inverse of matrix B. The inverse of each block can be computed independently, thus each processor is responsible for inverting a different group of blocks. The number of nonzero elements is small compared to the order of matrix B, thus the inversion of the blocks does not require substantial computational work or high memory requirements.

Computing the inverse of the Schur complement is less expensive due to its reduced order compared to the matrix K, and can be computed approximately with the PMGenFAspI method. This process leads to the computation of a two-level approximate inverse. The two-level process could be recursively applied to invert the resulting Schur complement, leading to a multilevel scheme. The scheme is applied until the Schur complement is sufficiently small and can be inverted efficiently, or no more independent sets exist. The last Schur complement can be approximately inverted using the PMGenFAspI method.

In practice, it is inefficient to compute an approximate inverse explicitly using this multilevel technique, because the involved operations tend to have increasing number of nonzero elements, cf. [10]. Instead the linear system (1) can be solved in block form.

Fig. 1
figure 1

Multilevel solution of a linear system through recursive solution of continuously smaller Schur complement linear systems along with the linear systems corresponding to the block independent sets

The equivalent expression for the reordered system is of the following form:

$$\begin{aligned} \left( {{PAP}^{\mathrm{T}}} \right) {Px}={Pb}\Leftrightarrow {Kx}{\prime }={b}{^\prime }\Leftrightarrow {x}{\prime }={K}^{-1}{b}{^\prime } \end{aligned}$$
(11)

where \(K=PAP^{\mathrm{T}}\), \(x{^\prime }=Px\) and \(b{^\prime }=Pb\). Using Eq. (10) we derive the following:

$$\begin{aligned} \left[ {{\begin{array}{l} {x_i } \\ {x_r } \\ \end{array} }} \right] =K^{-1}\left[ {{\begin{array}{l} {b_i } \\ {b_r } \\ \end{array} }} \right] =\left[ {{\begin{array}{ll} I &{} {-B^{-1}C} \\ 0 &{} I \\ \end{array} }} \right] \left[ {{\begin{array}{ll} {B^{-1}} &{} 0 \\ 0 &{} {S^{-1}} \\ \end{array} }} \right] \left[ {{\begin{array}{ll} I &{} 0 \\ {-DB^{-1}} &{} I \\ \end{array} }} \right] \left[ {{\begin{array}{l} {b_i } \\ {b_r } \\ \end{array} }} \right] \end{aligned}$$
(12)

where the subscript i denotes the solution and the right-hand side corresponding to the nodes associated with the independent sets and the subscript r denotes the solution and the right-hand side corresponding to the rest of the nodes. The equivalent expression of (12) is as follows:

$$\begin{aligned} \left[ {{\begin{array}{l} {x_i } \\ {x_r } \\ \end{array} }} \right] =\left[ {{\begin{array}{l} {B^{-1}b_i -B^{-1}CS^{-1}(-DB^{-1}b_i +b_r )} \\ {S^{-1}(-DB^{-1}b_i +b_r )} \\ \end{array} }} \right] \end{aligned}$$
(13)

or equivalently

$$\begin{aligned} x_i= & {} B^{-1}b_i -B^{-1}Cx_r . \end{aligned}$$
(14)
$$\begin{aligned} x_r= & {} S^{-1}(-DB^{-1}b_i +b_r ) \end{aligned}$$
(15)

The solution vector \(x_{i}\) can be computed directly since \(B^{-1}\) is known explicitly and \(x_{r}\) is computed by (15). Hence, \(x_{i}\) is computed with three matrix vector multiplications and with one vector subtraction. These computations are executed in parallel using AVX units. Nevertheless, the solution vector \(x_{r}\) is computed by solving the linear system:

$$\begin{aligned} Sx_r =(-DB^{-1}b_i +b_r ). \end{aligned}$$
(16)

This linear system can be solved with the PEPBiCGSTAB in conjunction with the PMGenFAspI method. This process leads to a two-level hybrid direct-iterative scheme for solving linear systems. The two-level process could be recursively applied to the linear system (16), leading to a multilevel scheme for the computation of the solution vector x. The multilevel scheme is depicted in Fig. 1. The solution vector \(x_{i}\) of the last level is computed using the vector \(x_{r}\). The vector \(x=P^{T}[x_{i}^{T}\) \(x_{r}^{T}]^{T}\) is returned as the solution vector \(x_{r}\) of the previous level.

In the case where the singular values of B are close to machine precision, the method might converge in more than a single iteration, due to the rounding errors. Moreover, singular values close to the machine precision are set explicitly to zero. This is also true in the case where the prescribed tolerance is close to machine precision. In such cases, the multilevel process is used as a preconditioner to the Richardson iterative method,

$$\begin{aligned} x_{k+1} =x_k +P\hbox {MARGAIS(A,}r_k ) \end{aligned}$$
(17)

where \(r_{k}=b-Ax_{k}\), \(k=0,1,2{\ldots }\) is the residual vector. The scheme is repeated until the solution of the linear system is acquired to the prescribed tolerance. The multilevel scheme is a hybrid direct-iterative method.

The parallel multilevel algebraic recursive generic approximate inverse solver (PMARGAIS) can then be described by the following algorithmic scheme:

figure c

The PEPBiCGSTAB method, cf. [13, 28], using AVX units, is presented in the “Appendix”.

It should be noted that the parallel computations of the method use the cache blocking technique. Thus, each computation is tiled to segments such that eight double-precision floating point numbers are transferred to the cache memory, since the L1 cache line of the systems used is 512 bits (64 bytes).

3.3 Parallel inversion of block diagonal matrix

For the exact inversion of the block diagonal matrix B, each block of the matrix is inverted using the SVD method. The dimensions of each block are kept small, thus the computational work as well as the memory requirements required for inversion are kept reduced. Using the SVD method, the block that is inverted takes the following form, cf. [11, 12]:

$$\begin{aligned} {B}_{{\mathrm{sub}}} ={U}\Sigma {V}^{{T}} \end{aligned}$$
(18)

where U and V are orthogonal matrices, thus their inverses are equal to their transposes, and \(\Sigma \) is a diagonal matrix, cf. [11, 12]. Hence, the inverse of the block \(B_{\mathrm{sub}}\) can be computed as follows:

$$\begin{aligned} \left( {{B}_{\mathrm{sub}} } \right) ^{-1}={V}\Sigma ^{-1}{U}^{{T}}. \end{aligned}$$
(19)

It should be stated that in case the dimension of the block is equal to one, then the inversion is trivial and does not require the SVD method. After the computation of the inverse of each blocks, they are stored in \(B^{-1}\) assembling the complete inverse matrix. The mapping from local to global positions in the inverse is realized through the vector s, which is obtained by the MBBFS method. Then, the exact inverse of B is computed by the following algorithmic compact scheme:

figure d

The blocks of matrix B have arbitrary form, thus their inverses are generally dense. To form the matrix \(B^{-1}\) the memory requirements have to be a priori computed. The vector ss is retaining the memory requirements for each block and its elements are computed as follows:

$$\begin{aligned} \mathrm{ss(block)}=\sum _{j=1}^{\mathrm{block}} {[s(j)-s\left( {j-1} \right) ]^{2},\quad \hbox {block}=1,\ldots ,\hbox {number}\_\hbox {of}\_\hbox {blocks}} \end{aligned}$$
(20)

where ss(0) \(=\) 0. The values of the vector ss denote the number of the nonzero elements that are stored in the inverse matrix \(B^{-1}\). The block matrix \(B^{-1}\) is stored in compressed sparse row (three vector variant) storage format using the values of the vector ss, the values of the inverses of the blocks, as well as the column indexes computed from the s vector, as depicted in Fig. 2.

The inverse of each block of the matrix B can be computed with the multiplication \((v)(q^{-1})(u^{T})\) where u, q and v are computed by the SVD decomposition. The triple matrix multiplication involves dense computations that can be parallelized with AVX units. Parallelization with AVX units is fine grained, thus the inner summation loop is parallelized. The dense matrix multiplication with AVX units can be described by the following algorithmic procedure:

Fig. 2
figure 2

Compressed sparse row storage format (three vector variant) of \(B^{-1}\)

figure e

where fmadd(xr3, xr1, xr2) is the fused multiply add operation \({xr3}={xr3}+{xr1}*{xr2}\), where xr1, xr2 and xr3 are vectors consisting of four doubles.

The Schur complement is computed by the following equation:

$$\begin{aligned} S=E-DB^{-1}C. \end{aligned}$$
(21)

The computation of the Schur complement requires sparse matrix operations, which are computationally intensive. The required sparse matrix multiplications can be parallelized efficiently since each processor computes the assigned number of rows independently. Sparse matrix multiplications are two-step operations: initially the possible number of nonzero elements should be evaluated and then the values of the resulting nonzero elements can be computed. It should be mentioned that the computation of each row is performed using a dense work vector (w) and a list, which stores the nonzero values as well as their positions. In case of multiple processors, the dimension of this vector is computed by multiplying the number of columns of the first sparse matrix with the number of processors. Hence, each processor uses a different part of the vector.

4 Numerical results

In this section, the performance of parallel multilevel algebraic recursive generic approximate inverse solver (PMARGAIS) is examined for solving various problems. The execution time is given in ss.hh (seconds.hundreds) and the overall gain corresponds to the time of the serial execution divided with the time of the serial or parallel execution that includes the use of AVX units. The problems ATMOSMODD, tmt_unsym and cage13 were obtained from the University of Florida Sparse Matrix Collection, cf. [9]. The Poisson problem in two and three dimensions can be described by the following PDE:

$$\begin{aligned} -\Delta u=f, (x_{1},x_{2},{\ldots },x_{d})\in \Omega =[0,1]^{d} \end{aligned}$$
(22)
$$\begin{aligned} u=0, (x_{1},x_{2},\ldots ,x_{d})\in \partial \Omega \end{aligned}$$
(22.a)

where \(\partial \Omega \) denotes the boundary of \(\Omega \) and d denotes the number of the space variables. The above PDE is discretized with the five point stencil finite differences method in two space variables and with the seven point stencil finite differences method in three space variables. The right-hand side of the linear system derived from PDE (22)–(22.a) was computed as the product of the coefficient matrix A by the solution vector, with its components equal to unity.

4.1 Parallel modified generic factored approximate sparse inverse (PMGenFAspI) using AVX units

The performance, the speedup and the overall gain of the PMGenFAspI method parallelized for shared memory parallel systems with AVX units is examined for the 3D Poisson problem. The experimental results were obtained using an Intel Core-i7 4700MQ 2.4 GHz with 8 GB of RAM memory, running Windows 10 Pro.

It should be noted that the software was developed in C++ without the use of scientific libraries. The software was compiled using the Visual Studio 2010 with OpenMP v2.0 and the maximum speed optimizations flag (/O2). The use of AVX units was realized through software libraries offered by the corresponding development environment.

Table 1 Performance and speedups of PMGenFAspI algorithm, using AVX units, for the 3D Poisson problem for various number of threads and values of the lfill parameter
Table 2 Overall gain of PMGenFAspI algorithm, using AVX units, for the 3D Poisson problem for various number of threads and values of the lfill parameter

In Table 1, the performance and speedup of the PMGenFAspI algorithm, using AVX units, for the 3D Poisson problem of order \(n=1{,}000{,}000\) for various number of threads and values of the lfill parameter are presented. In Table 2, the overall gain of the PMGenFAspI algorithm, using AVX units, for the 3D Poisson problem of order \(n=1{,}000{,}000\) for various number of threads and values of the lfill parameter is given.

It should be noted that the speedup of the PMGenFAspI method tends to the theoretical maximum as the number of threads is increased. The performance is increased using AVX units.

4.2 Parallel multilevel algebraic recursive generic approximate inverse solver

The performance of PMARGAIS is examined for solving various problems. Experimental results obtained from various systems and for various values of the parameters of the method are presented, to assess the behavior of the scheme. It should be stated that the following parameters were used for all the executions: dtol \(=\) 0.0, lfill \(=\) 2, droptol \(=\) 0.1, ILUfill \(=\) 10 and ILUtol \(=\) 0.001, cf. [10]. The stopping criterion for the PMARGAIS method was \(\left\| {r_k } \right\| <10^{-10}\left\| {r_0 } \right\| \), where \(r_{i}\) is the residual vector. The stopping criterion for the PEPBiCGSTAB method used in the last level was \(\left\| {r_k } \right\| <10^{-12}\left\| {r_0 } \right\| \).

Table 3 Performance and speedups of PMARGAIS method for the 2D Poisson problem for various values of the order (n), number of threads with Block size \(=\) 50, levels \(=\) 2 and the BBFS reordering scheme
Fig. 3
figure 3

Escalation of PMARGAIS method for the 2D Poisson problem for various values of the order (n) and number of threads with Block size \(=\) 50, levels \(=\) 2 and the BBFS reordering scheme

Table 4 Performance and speedups of PMARGAIS method for the 3D Poisson problem for various values of the order (n) and number of threads with Bsize \(=\) 1, levels \(=\) 2 and the MBBFS reordering scheme
Fig. 4
figure 4

Escalation of PMARGAIS method for the 3D Poisson problem for various values of the order (n) and number of threads with Block size \(=\) 50, levels \(=\) 2 and the BBFS reordering scheme

Table 5 Performance and speedups of PMARGAIS method for various problems, values of the Block size, number of threads and reordering schemes
Table 6 Performance and speedups of PMARGAIS method for various problems, values of the Block size, number of levels, number of threads and reordering schemes

System 1 Numerical results were obtained using an AMD Phenom(tm) II X4 955 Processor 3.20 GHZ with 4 GB of RAM memory, running Ubuntu 12.04 LTS. It should be noted that the software was developed in C++ without the use of scientific libraries. The software was compiled with g++ 4.6.3 with OpenMP v3.0 and the maximum optimizations flag (\(-\)O3).

In Table 3, the performance and speedup of PMARGAIS method for the 2D Poisson problem for various values of the order (n) and number of threads with Block size \(=\) 50, levels \(=\) 2 and the BBFS reordering scheme are given. In Fig. 3, the escalation of PMARGAIS method for the 2D Poisson problem for various values of the order (n) and the number of threads with Block size \(=\) 50, levels \(=\) 2 and the BBFS reordering scheme is depicted. In Table 4, the performance and speedups of PMARGAIS method for the 3D Poisson problem for various values of the order (n) and number of threads with Bsize \(=\) 1, levels \(=\) 2 and the MBBFS reordering scheme are presented. In Fig. 4, the escalation of PMARGAIS method for the 3D Poisson problem for various values of the order (n) and the number of threads with Block size \(=\) 50, levels \(=\) 2 and the BBFS reordering scheme is depicted. In Table 5, the performance and speedups of PMARGAIS method for various problems, values of the Block size, number of threads and reordering schemes are presented.

It can be easily seen that the speedups, presented in Tables 3 and 4, do not increase uniformly as the order (n) increases, which is due to the fact that the number of block independent sets is not a multiple of the number of processors. Thus, the computational work is not balanced among processors and execution time of the parallel scheme is governed by the processor which is assigned the largest number of independent sets. The imbalance affects parallel performance significantly especially in the case where the block size has a large value and the number of independent sets is not a multiple of the number of processors.

System 2 Numerical results were obtained using an Intel Core-i7 4700MQ 2.4 GHz with 8 GB of RAM memory, running Windows 10 Pro.

Table 7 Performance and speedup of PMARGAIS method, using AVX units, for various problems, values of the Block size and number of threads with levels \(=\) 2 and MBBFS reordering scheme
Table 8 Overall gain of PMARGAIS method, using AVX units, for various problems, values of the Block size and number of threads, with levels \(=\) 2 and the MBBFS reordering scheme

It should be noted that the software was developed in C++ without the use of scientific libraries. The software was compiled using the Visual Studio 2010 with OpenMP v2.0 and the maximum speed optimizations flag (/O2). The use of AVX units was realized through software libraries offered by the corresponding development environment.

In Table 6, the performance and speedups of PMARGAIS method for various problems, values of the Block size, numbers of levels, number of threads and reordering schemes is given. In Table 7, the performance and speedup of PMARGAIS method, using AVX units, for various problems, values of the Block size and number of threads with levels \(=\) 2 and MBBFS reordering scheme is presented. In Table 8, the overall gain of PMARGAIS method, using AVX units, for various problems, values of the Block size and number of threads with levels \(=\) 2 and MBBFS reordering scheme is given.

It should be mentioned that the use of the modified reordering scheme (MBBFS) instead of the BBFS scheme ensures load balancing and better control over the size of each block, especially for problems where the size of the blocks increases rapidly, such as matrices whose corresponding graph contains multiple vertices of large degree. Hence, the MBBFS method reduces the memory requirements for storing the block diagonal inverse, giving an upper limit of the memory requirements of the method.

It should be also noted that the results can be further improved by modifying the MBBFS method to produce blocks that their number is equal to a multiple of the processor units.

5 Conclusion

The proposed parallel multilevel algebraic recursive generic approximate inverse solver (PMARGAIS) is efficient for solving a class of problems resulting in large sparse linear systems. The hybrid direct-iterative PMARGAIS method involves dense computations that can be parallelized efficiently, using AVX units. Dense computations are more efficiently parallelized than sparse computations leading to increased performance and better scalability. Moreover, the proposed scheme is based on the modified reordering scheme (MBBFS) retaining balanced computational work, thus enhancing performance and corresponding speedups are close to theoretical maximum. The PMARGAIS method is based on the PBiCGSTAB method in conjunction with PMGenFAspI matrix, using AVX units. The use of AVX units in conjunction with multicore systems for computing the PMGenFAspI matrix results in increased speedups that surpass the number of available processors. Moreover, the PBiCGSTAB has been parallelized for multicore systems with AVX units resulting in improved performance. Future work will be focused on the implementation of the method on hybrid distributed memory systems.