Parallel multilevel recursive approximate inverse techniques for solving general sparse linear systems

Makaratzis, Antonios T.; Filelis-Papadopoulos, Christos K.; Gravvanis, George A.

doi:10.1007/s11227-016-1728-5

Parallel multilevel recursive approximate inverse techniques for solving general sparse linear systems

Published: 05 May 2016

Volume 72, pages 2259–2282, (2016)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

The Journal of Supercomputing Aims and scope Submit manuscript

Parallel multilevel recursive approximate inverse techniques for solving general sparse linear systems

Download PDF

Antonios T. Makaratzis¹,
Christos K. Filelis-Papadopoulos¹ &
George A. Gravvanis¹

242 Accesses
1 Citation
Explore all metrics

Abstract

In this article, a new parallel multilevel algebraic recursive generic approximate inverse solver (PMARGAIS) is proposed. PMARGAIS utilizes the parallel modified generic factored approximate sparse inverse (PMGenFAspI) matrix technique designed for shared memory parallel systems. PMARGAIS requires a block independent set reordering scheme, to create a hierarchy of levels. A modified block breadth first search (MBBFS) is proposed for reducing memory requirements and retaining load balancing. The SVD method is used to compute the inverse of the independent blocks that are formed from the reordering scheme, and computes accurately the Schur complement that is used as a coefficient matrix on the next level, resulting in a hybrid direct-iterative method for large linear systems. The solution of the linear system at the last level is performed with the parallel explicit preconditioned BiCGSTAB method in conjunction with the PMGenFAspI matrix. The parallelization of the proposed methods uses the vector units of modern CPUs. Implementation details are provided and numerical results are given demonstrating the applicability and effectiveness of the proposed schemes.

Parallel Performance of an Iterative Solver Based on the Golub-Kahan Bidiagonalization

Development of Krylov and AMG Linear Solvers for Large-Scale Sparse Matrices on GPUs

Iterative Sparse Triangular Solves for Preconditioning

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Let us consider a sparse linear system

$$\begin{aligned} Ax=b \end{aligned}$$

(1)

where A is the coefficient matrix, b is the right-hand side vector and x is the solution vector of the linear system. Preconditioned Krylov subspace iterative methods are amongst the most widely used iterative methods. The effectiveness of these iterative methods relies on the use of effective preconditioning schemes that reduce the number of required iterations and in many cases ensure converge, cf. [3, 15, 16, 18, 22–24]. Approximate inverses have been extensively used as preconditioners with iterative methods, cf. [3, 7, 8, 10, 13, 15, 16, 18]. The approximate inverses possess inherent parallelism, cf. [10, 13, 15, 16], and thus can be effectively used on parallel systems. Recently, a generic class of approximate inverses has been proposed that can handle any sparsity pattern of the coefficient matrix, cf. [14]. By redesigning the Generic Approximate Banded Inverse algorithm, cf. [14], and utilizing approximate inverse sparsity patterns], derived from patterns of sparsified matrices (PSMs), cf. [7, 8], the generic approximate sparse inverse (GenAspI) algorithm as well as the generic factored approximate sparse inverse (GenFAspI) algorithm were proposed, cf. [10].

Multilevel techniques have been proposed in the recent years by many researchers, cf. [4, 5, 23, 24]. These methods utilize reordering schemes and techniques from domain decomposition methods, cf. [4, 5, 19, 21, 23, 24, 27]. Nested Grids ILU decomposition (NGILU), cf. [4], multilevel recursive incomplete LU factorization (MRILU), cf. [5], algebraic recursive multilevel solver (ARMS), cf. [23], and multilevel algebraic recursive generic approximate inverse solver (MARGAIS), cf. [10] are multilevel techniques that have been proposed in recent years.

Herewith, parallel schemes are proposed for shared memory parallel systems, using the OpenMP environment, cf. [6]. The parallel modified generic factored approximate sparse inverse (PMGenFAspI) method is proposed, which is a parallel version of the MGenFAspI method, cf. [10], using vector units, cf. [17]. The proposed parallel multilevel solver, namely parallel multilevel algebraic recursive generic approximate inverse solver (PMARGAIS), utilizes a modified reordering scheme that is based on block breadth first search (BBFS), cf. [23]. The coefficient matrix of the system is reordered such that the upper left block is a block diagonal matrix. The inversion of the block diagonal matrix is performed in parallel, utilizing the SVD method, cf. [11, 12], for each block. The modified reordering scheme (MBBFS) ensures that the dimensions of each block remains small, resulting to less memory requirements and balanced computational work during the inversion of the block diagonal matrix. The Schur complement is formed explicitly to be used as the coefficient matrix on the next level. This process is repeated until the linear system of the last level is small enough to be solved efficiently. The parallel explicit preconditioned bi-conjugate gradient stabilized (PEPBiCGSTAB) method, cf. [28], is parallelized for shared memory parallel systems in conjunction with AVX vector units, cf. [17], and used to solve the reduced order linear system of the last level. The explicit formation of the Schur complement and the exact inversion of the block diagonal inverse, leads to a hybrid direct-iterative method.

The AVX units are vector units used for carrying out concurrent computations to multiple data following the SIMD model, cf. [17]. Efforts have been concentrated by other researchers to facilitate efficient processing of problems that involve matrix and vector computations at the hardware level, cf. [1, 25, 26]. These efforts involve the design of specialized units, based on reversible logic synthesis, to carry out efficiently such types of concurrent computations, cf. [1, 25, 26].

In Sect. 2, the PMGenFAspI method is presented along with implementation details. In Sect. 3, the PMARGAIS method based on the modified reordering scheme (MBBFS) is presented, along with discussions on the performance improvements. Furthermore, implementation details for the parallel inversion of the block diagonal matrix and the computation of the Schur complement are given. Further, the parallel EPBiCGSTAB method based on AVX units is given. In Sect. 4, numerical results presenting the performance and applicability of the proposed schemes are given.

2 Parallel modified generic factored approximate sparse inverse (PMGenFAspI)

The MGenFAspI matrix, cf. [10], can be computed using the following decomposition, cf. [2, 12, 20, 22]:

$$\begin{aligned} A=LU\Leftrightarrow A^{-1}=U^{-1}L^{-1}\Leftrightarrow M=GH \end{aligned}$$

(2)

where $M=A^{-1}$, $G=U^{-1}$ and $H=L^{-1}$. The factors L and U are obtained from Incomplete LU decomposition, cf. [2, 20, 22]. The sparsity patterns of the factors G and H are computed by sparsifying the triangular factors L and U using a predetermined drop tolerance (drptol). The resulting sparsified matrix is then raised to a predefined power or level of fill (lfill). The sparsity pattern is based on powers of sparsified matrices (PSM’s), cf. [7, 8]. Hence, to compute the elements of the G and H factors we have to solve the following systems, cf. [10]:

$$\begin{aligned} \left\{ {{\begin{array}{l} {UG_{\mathrm{drptol}}^{\mathrm{lfill}} =I} \\ {LH_{\mathrm{drptol}}^{\mathrm{lfill}} =I} \\ \end{array} }} \right\} \Leftrightarrow \left\{ {{\begin{array}{l} {Ug_{:,j} =e_{:,j} } \\ {Lh{ }_{:,j}=e_{:,j} } \\ \end{array} }} \right\} ,\quad 1\le {j}\le {n} \end{aligned}$$

(3)

where lfill is the level of fill used to compute the sparsity pattern of the approximate inverse and drptol is the threshold for retaining elements in the initial sparsity pattern of the approximate inverse, cf. [7, 8, 10], while $g_{:,j}$ and $h_{:,j}$ are the elements of the jth column of the triangular factors of the approximate inverse and $e_{:,j}$ are the elements of the jth column of the identity matrix. During the computation, the elements $g_{:,j}$ and $h_{:,j}$ are stored in a dense vector iw to prevent column search for elements in the sparse format matrices G and H. The elements of the jth column of the identity matrix $e_{:,j}$ are stored in a dense vector e. Each nonzero element of the H and G factors of the MGenFAspI method can be computed by the following equations:

$$\begin{aligned} H(k,j)= & {} \frac{I(k,j)-L(k,1:k-1)*H(1:k-1,j)^{T}}{L(k,k)},\nonumber \\&{k}=\hbox {j},\ldots ,{n} , (k,j)\in {H}_{\mathrm{drptol}}^{\mathrm{lfill}} \end{aligned}$$

(4)

$$\begin{aligned} G(k,j)= & {} \frac{I(k,j)-U(k,k+1:n)^{T}*G(k+1:n,j)}{U(k,k)},\nonumber \\&{k}={j,\ldots ,1 , (k,j)}\in {G}_{\mathrm{drptol}}^{\mathrm{lfill}} . \end{aligned}$$

(5)

where I is the identity matrix.

The parallel computation of the MGenFAspI process can be performed efficiently due to the fact that the computation of each column of the factors G and H is not related to the computation of other columns. Each processor is responsible for computing a group of columns of the approximate inverse without any communications. The PMGenFAspI method has been further modified to utilize AVX units accelerating the computation of the involved inner products. Initially the values of a register are set to zero. Then, the values of the involved vectors residing in the memory are transferred in groups of four values to two registers. The computation of the products is performed concurrently and the respective results are accumulated to the register retaining the partial sums. The procedure is repeated until all the elements have been accumulated, resulting in four partial sums. The inner product is computed by adding the four partial sums. In case the number of nonzero elements is not a multiple of four, the remaining elements are accumulated independently. The parallel modified generic factored approximate sparse inverse scheme in conjunction with AVX units is described by the following algorithmic scheme:

where fmadd(xr3, xr1, xr2) is the fused multiply add operation ${xr3}={xr3}+{xr1}*{xr2}$, where xr1, xr2 and xr3 are vectors consisting of four double-precision floating point numbers. The iw vector is the work vector used to store temporarily the elements of the ith column of each of the factors of the approximate inverse, while the vector e is the ith column of the identity matrix.

It should be noted that the length of the vectors iw and e is multiplied with the number of the processors that are being used, so that each processor uses a different part of the vectors. The elements of the vectors are determined as follows:

$$\begin{aligned}&\mathbf{iw}\left( {\mathbf{cur}\_\mathbf{tid}*\mathbf{n}+\mathbf{k}} \right) {:}\,\left( {{iw}\left( {k} \right) \hbox { of cur}\_\hbox {tid processor}} \right) \end{aligned}$$

(6)

$$\begin{aligned}&\mathbf{e}\left( {\mathbf{cur}\_\mathbf{tid}*\mathbf{n}+\mathbf{k}} \right) {:}\,\left( {\hbox {e}\left( \hbox {k} \right) \hbox { of cur}\_\hbox {tid processor}} \right) . \end{aligned}$$

(7)

where n is the number of the rows of the factors G and H, cur$\_$tid=0,${\ldots }$, nprocs-1 is the id number of each processor and nprocs is the total number of processors.

3 Parallel multilevel algebraic recursive generic approximate inverse solver (PMARGAIS)

3.1 Modified block breadth first search (MBBFS)

Independent Sets are a crucial component of multilevel methods. An Independent Set is composed of unknowns that are decoupled between them and can be handed independently or simultaneously without affecting other unknowns of the linear system, cf. [23, 24]. Recently, a reordering scheme was introduced for the algebraic multilevel recursive solver (ARMS), cf. [23], based on Breadth First Search algorithm, utilizing a threshold parameter which restricts unknowns that have lesser relative diagonal dominance in their respective row, to join the independent set, cf. [23]. The block independent sets is a generalization of the Independent sets, where groups of unknowns are decoupled. The unknowns belonging to the Block Independent sets are numbered first and the interface unknowns are numbered last. Hence, the upper left block of the permuted coefficient matrix has a block diagonal structure.

The modified block breadth first search (MBBFS) algorithm reduces the memory requirements of the block diagonal inverse by retaining the dimension of the (dense) blocks to a predefined small number. For the BBFS algorithm, grouping of nodes into independent sets stops after a whole level of neighbors has been added and the block size has exceeded the predetermined block size. This technique results in blocks of much bigger size than the predetermined one, especially in the case of matrices with large number of nonzero elements per row. In contrast the MBBFS scheme stops the insertion of unvisited nodes to a block independent set when the predetermined block size is reached. This technique results in blocks of size equal to the predetermined one or smaller when there are not enough neighboring nodes, thus giving an upper limit to the memory requirements for computing the inverse of the block diagonal matrix. The MBBFS scheme improves the balance of computational work between the CPUs, since the dimensions of each block are almost the same.

It should be stated that the algorithm also returns a vector s storing the starting and ending point of each block, to handle the blocks independently:

$$\begin{aligned} s(\mathrm{block})=\sum _{j=1}^{\mathrm{block}} {\mathrm{block}\_\mathrm{size}(j),\hbox { block}=1,\ldots ,\hbox {{number}}\_\hbox {{of}}\_\hbox {{blocks}}} \end{aligned}$$

(8)

where $\mathrm{block}\_\mathrm{size}(j)$ is the size of each block and $s(0)=0$. Utilizing the vector s, the following information can be obtained:

s(i-1): the row and the column of the block diagonal matrix where the first row and column of the ith block are located.
s(i)-1: the row and the column of the block diagonal matrix where the last row and column of the ith block are located.
s(i)-s(i-1): the order of the ith block.

The modified block breadth first search algorithm is described by the following algorithmic scheme:

Append the excluded nodes in the end of R

It should be mentioned that in case the number of vertices grouped in a block is smaller than the prescribed value (bsize) and no more neighboring vertices exist, then the block is retained as is.

3.2 Parallel multilevel approximate inverse solver

Let us consider the reordering matrix P computed with the MBBFS algorithm. The reordered block matrix K is as follows, cf. [10]:

$$\begin{aligned} K=PAP^{T}=\left[ {{\begin{array}{ll} B &{} C \\ D &{} E \\ \end{array} }} \right] =\left[ {{\begin{array}{ll} I &{} 0 \\ {DB^{-1}} &{} I \\ \end{array} }} \right] \left[ {{\begin{array}{ll} B &{} 0 \\ 0 &{} S \\ \end{array} }} \right] \left[ {{\begin{array}{ll} I &{} {B^{-1}C} \\ 0 &{} I \\ \end{array} }} \right] \end{aligned}$$

(9)

where $S=E-DB^{-1}C$ is the Schur complement of the block matrix A and B is a block diagonal matrix. The inverse of K can be computed as follows:

$$\begin{aligned} K^{-1}=\left[ {{\begin{array}{ll} B &{} C \\ D &{} E \\ \end{array} }} \right] ^{-1}=\left[ {{\begin{array}{ll} I &{} {-B^{-1}C} \\ 0 &{} I \\ \end{array} }} \right] \left[ {{\begin{array}{ll} {B^{-1}} &{} 0 \\ 0 &{} {S^{-1}} \\ \end{array} }} \right] \left[ {{\begin{array}{ll} I &{} 0 \\ {-DB^{-1}} &{} I \\ \end{array} }} \right] . \end{aligned}$$

(10)

The inverse of the block matrix can be computed by inverting the matrices B and S. Matrix B is a block diagonal matrix, thus the inverse $B^{-1}$ is a block diagonal matrix. The order of the blocks is small and the exact inverse matrices can be computed using the SVD method separately on each block. The inverse matrix $B^{-1}$ that is computed with this technique is the exact inverse of matrix B. The inverse of each block can be computed independently, thus each processor is responsible for inverting a different group of blocks. The number of nonzero elements is small compared to the order of matrix B, thus the inversion of the blocks does not require substantial computational work or high memory requirements.

Computing the inverse of the Schur complement is less expensive due to its reduced order compared to the matrix K, and can be computed approximately with the PMGenFAspI method. This process leads to the computation of a two-level approximate inverse. The two-level process could be recursively applied to invert the resulting Schur complement, leading to a multilevel scheme. The scheme is applied until the Schur complement is sufficiently small and can be inverted efficiently, or no more independent sets exist. The last Schur complement can be approximately inverted using the PMGenFAspI method.

In practice, it is inefficient to compute an approximate inverse explicitly using this multilevel technique, because the involved operations tend to have increasing number of nonzero elements, cf. [10]. Instead the linear system (1) can be solved in block form.

The equivalent expression for the reordered system is of the following form:

$$\begin{aligned} \left( {{PAP}^{\mathrm{T}}} \right) {Px}={Pb}\Leftrightarrow {Kx}{\prime }={b}{^\prime }\Leftrightarrow {x}{\prime }={K}^{-1}{b}{^\prime } \end{aligned}$$

(11)

where $K=PAP^{\mathrm{T}}$, $x{^\prime }=Px$ and $b{^\prime }=Pb$. Using Eq. (10) we derive the following:

$$\begin{aligned} \left[ {{\begin{array}{l} {x_i } \\ {x_r } \\ \end{array} }} \right] =K^{-1}\left[ {{\begin{array}{l} {b_i } \\ {b_r } \\ \end{array} }} \right] =\left[ {{\begin{array}{ll} I &{} {-B^{-1}C} \\ 0 &{} I \\ \end{array} }} \right] \left[ {{\begin{array}{ll} {B^{-1}} &{} 0 \\ 0 &{} {S^{-1}} \\ \end{array} }} \right] \left[ {{\begin{array}{ll} I &{} 0 \\ {-DB^{-1}} &{} I \\ \end{array} }} \right] \left[ {{\begin{array}{l} {b_i } \\ {b_r } \\ \end{array} }} \right] \end{aligned}$$

(12)

where the subscript i denotes the solution and the right-hand side corresponding to the nodes associated with the independent sets and the subscript r denotes the solution and the right-hand side corresponding to the rest of the nodes. The equivalent expression of (12) is as follows:

$$\begin{aligned} \left[ {{\begin{array}{l} {x_i } \\ {x_r } \\ \end{array} }} \right] =\left[ {{\begin{array}{l} {B^{-1}b_i -B^{-1}CS^{-1}(-DB^{-1}b_i +b_r )} \\ {S^{-1}(-DB^{-1}b_i +b_r )} \\ \end{array} }} \right] \end{aligned}$$

(13)

or equivalently

$$\begin{aligned} x_i= & {} B^{-1}b_i -B^{-1}Cx_r . \end{aligned}$$

(14)

$$\begin{aligned} x_r= & {} S^{-1}(-DB^{-1}b_i +b_r ) \end{aligned}$$

(15)

The solution vector $x_{i}$ can be computed directly since $B^{-1}$ is known explicitly and $x_{r}$ is computed by (15). Hence, $x_{i}$ is computed with three matrix vector multiplications and with one vector subtraction. These computations are executed in parallel using AVX units. Nevertheless, the solution vector $x_{r}$ is computed by solving the linear system:

$$\begin{aligned} Sx_r =(-DB^{-1}b_i +b_r ). \end{aligned}$$

(16)

This linear system can be solved with the PEPBiCGSTAB in conjunction with the PMGenFAspI method. This process leads to a two-level hybrid direct-iterative scheme for solving linear systems. The two-level process could be recursively applied to the linear system (16), leading to a multilevel scheme for the computation of the solution vector x. The multilevel scheme is depicted in Fig. 1. The solution vector $x_{i}$ of the last level is computed using the vector $x_{r}$. The vector $x=P^{T}[x_{i}^{T}$ $x_{r}^{T}]^{T}$ is returned as the solution vector $x_{r}$ of the previous level.

In the case where the singular values of B are close to machine precision, the method might converge in more than a single iteration, due to the rounding errors. Moreover, singular values close to the machine precision are set explicitly to zero. This is also true in the case where the prescribed tolerance is close to machine precision. In such cases, the multilevel process is used as a preconditioner to the Richardson iterative method,

$$\begin{aligned} x_{k+1} =x_k +P\hbox {MARGAIS(A,}r_k ) \end{aligned}$$

(17)

where $r_{k}=b-Ax_{k}$, $k=0,1,2{\ldots }$ is the residual vector. The scheme is repeated until the solution of the linear system is acquired to the prescribed tolerance. The multilevel scheme is a hybrid direct-iterative method.

The parallel multilevel algebraic recursive generic approximate inverse solver (PMARGAIS) can then be described by the following algorithmic scheme:

The PEPBiCGSTAB method, cf. [13, 28], using AVX units, is presented in the “Appendix”.

It should be noted that the parallel computations of the method use the cache blocking technique. Thus, each computation is tiled to segments such that eight double-precision floating point numbers are transferred to the cache memory, since the L1 cache line of the systems used is 512 bits (64 bytes).

3.3 Parallel inversion of block diagonal matrix

For the exact inversion of the block diagonal matrix B, each block of the matrix is inverted using the SVD method. The dimensions of each block are kept small, thus the computational work as well as the memory requirements required for inversion are kept reduced. Using the SVD method, the block that is inverted takes the following form, cf. [11, 12]:

$$\begin{aligned} {B}_{{\mathrm{sub}}} ={U}\Sigma {V}^{{T}} \end{aligned}$$

(18)

where U and V are orthogonal matrices, thus their inverses are equal to their transposes, and $\Sigma $ is a diagonal matrix, cf. [11, 12]. Hence, the inverse of the block $B_{\mathrm{sub}}$ can be computed as follows:

$$\begin{aligned} \left( {{B}_{\mathrm{sub}} } \right) ^{-1}={V}\Sigma ^{-1}{U}^{{T}}. \end{aligned}$$

(19)

It should be stated that in case the dimension of the block is equal to one, then the inversion is trivial and does not require the SVD method. After the computation of the inverse of each blocks, they are stored in $B^{-1}$ assembling the complete inverse matrix. The mapping from local to global positions in the inverse is realized through the vector s, which is obtained by the MBBFS method. Then, the exact inverse of B is computed by the following algorithmic compact scheme:

The blocks of matrix B have arbitrary form, thus their inverses are generally dense. To form the matrix $B^{-1}$ the memory requirements have to be a priori computed. The vector ss is retaining the memory requirements for each block and its elements are computed as follows:

$$\begin{aligned} \mathrm{ss(block)}=\sum _{j=1}^{\mathrm{block}} {[s(j)-s\left( {j-1} \right) ]^{2},\quad \hbox {block}=1,\ldots ,\hbox {number}\_\hbox {of}\_\hbox {blocks}} \end{aligned}$$

(20)

where ss(0) $=$ 0. The values of the vector ss denote the number of the nonzero elements that are stored in the inverse matrix $B^{-1}$. The block matrix $B^{-1}$ is stored in compressed sparse row (three vector variant) storage format using the values of the vector ss, the values of the inverses of the blocks, as well as the column indexes computed from the s vector, as depicted in Fig. 2.

The inverse of each block of the matrix B can be computed with the multiplication $(v)(q^{-1})(u^{T})$ where u, q and v are computed by the SVD decomposition. The triple matrix multiplication involves dense computations that can be parallelized with AVX units. Parallelization with AVX units is fine grained, thus the inner summation loop is parallelized. The dense matrix multiplication with AVX units can be described by the following algorithmic procedure:

where fmadd(xr3, xr1, xr2) is the fused multiply add operation ${xr3}={xr3}+{xr1}*{xr2}$, where xr1, xr2 and xr3 are vectors consisting of four doubles.

The Schur complement is computed by the following equation:

$$\begin{aligned} S=E-DB^{-1}C. \end{aligned}$$

(21)

The computation of the Schur complement requires sparse matrix operations, which are computationally intensive. The required sparse matrix multiplications can be parallelized efficiently since each processor computes the assigned number of rows independently. Sparse matrix multiplications are two-step operations: initially the possible number of nonzero elements should be evaluated and then the values of the resulting nonzero elements can be computed. It should be mentioned that the computation of each row is performed using a dense work vector (w) and a list, which stores the nonzero values as well as their positions. In case of multiple processors, the dimension of this vector is computed by multiplying the number of columns of the first sparse matrix with the number of processors. Hence, each processor uses a different part of the vector.

4 Numerical results

In this section, the performance of parallel multilevel algebraic recursive generic approximate inverse solver (PMARGAIS) is examined for solving various problems. The execution time is given in ss.hh (seconds.hundreds) and the overall gain corresponds to the time of the serial execution divided with the time of the serial or parallel execution that includes the use of AVX units. The problems ATMOSMODD, tmt_unsym and cage13 were obtained from the University of Florida Sparse Matrix Collection, cf. [9]. The Poisson problem in two and three dimensions can be described by the following PDE:

$$\begin{aligned} -\Delta u=f, (x_{1},x_{2},{\ldots },x_{d})\in \Omega =[0,1]^{d} \end{aligned}$$

(22)

$$\begin{aligned} u=0, (x_{1},x_{2},\ldots ,x_{d})\in \partial \Omega \end{aligned}$$

(22.a)

where $\partial \Omega $ denotes the boundary of $\Omega $ and d denotes the number of the space variables. The above PDE is discretized with the five point stencil finite differences method in two space variables and with the seven point stencil finite differences method in three space variables. The right-hand side of the linear system derived from PDE (22)–(22.a) was computed as the product of the coefficient matrix A by the solution vector, with its components equal to unity.

4.1 Parallel modified generic factored approximate sparse inverse (PMGenFAspI) using AVX units

The performance, the speedup and the overall gain of the PMGenFAspI method parallelized for shared memory parallel systems with AVX units is examined for the 3D Poisson problem. The experimental results were obtained using an Intel Core-i7 4700MQ 2.4 GHz with 8 GB of RAM memory, running Windows 10 Pro.

It should be noted that the software was developed in C++ without the use of scientific libraries. The software was compiled using the Visual Studio 2010 with OpenMP v2.0 and the maximum speed optimizations flag (/O2). The use of AVX units was realized through software libraries offered by the corresponding development environment.

Table 1 Performance and speedups of PMGenFAspI algorithm, using AVX units, for the 3D Poisson problem for various number of threads and values of the lfill parameter

Full size table

Table 2 Overall gain of PMGenFAspI algorithm, using AVX units, for the 3D Poisson problem for various number of threads and values of the lfill parameter

Full size table

In Table 1, the performance and speedup of the PMGenFAspI algorithm, using AVX units, for the 3D Poisson problem of order $n=1{,}000{,}000$ for various number of threads and values of the lfill parameter are presented. In Table 2, the overall gain of the PMGenFAspI algorithm, using AVX units, for the 3D Poisson problem of order $n=1{,}000{,}000$ for various number of threads and values of the lfill parameter is given.

It should be noted that the speedup of the PMGenFAspI method tends to the theoretical maximum as the number of threads is increased. The performance is increased using AVX units.

4.2 Parallel multilevel algebraic recursive generic approximate inverse solver

The performance of PMARGAIS is examined for solving various problems. Experimental results obtained from various systems and for various values of the parameters of the method are presented, to assess the behavior of the scheme. It should be stated that the following parameters were used for all the executions: dtol $=$ 0.0, lfill $=$ 2, droptol $=$ 0.1, ILUfill $=$ 10 and ILUtol $=$ 0.001, cf. [10]. The stopping criterion for the PMARGAIS method was $\left\| {r_k } \right\| <10^{-10}\left\| {r_0 } \right\| $, where $r_{i}$ is the residual vector. The stopping criterion for the PEPBiCGSTAB method used in the last level was $\left\| {r_k } \right\| <10^{-12}\left\| {r_0 } \right\| $.

Table 3 Performance and speedups of PMARGAIS method for the 2D Poisson problem for various values of the order (n), number of threads with Block size $=$ 50, levels $=$ 2 and the BBFS reordering scheme

Full size table

Table 4 Performance and speedups of PMARGAIS method for the 3D Poisson problem for various values of the order (n) and number of threads with Bsize $=$ 1, levels $=$ 2 and the MBBFS reordering scheme

Full size table

Table 5 Performance and speedups of PMARGAIS method for various problems, values of the Block size, number of threads and reordering schemes

Full size table

Table 6 Performance and speedups of PMARGAIS method for various problems, values of the Block size, number of levels, number of threads and reordering schemes

Full size table

System 1 Numerical results were obtained using an AMD Phenom(tm) II X4 955 Processor 3.20 GHZ with 4 GB of RAM memory, running Ubuntu 12.04 LTS. It should be noted that the software was developed in C++ without the use of scientific libraries. The software was compiled with g++ 4.6.3 with OpenMP v3.0 and the maximum optimizations flag ($-$O3).

In Table 3, the performance and speedup of PMARGAIS method for the 2D Poisson problem for various values of the order (n) and number of threads with Block size $=$ 50, levels $=$ 2 and the BBFS reordering scheme are given. In Fig. 3, the escalation of PMARGAIS method for the 2D Poisson problem for various values of the order (n) and the number of threads with Block size $=$ 50, levels $=$ 2 and the BBFS reordering scheme is depicted. In Table 4, the performance and speedups of PMARGAIS method for the 3D Poisson problem for various values of the order (n) and number of threads with Bsize $=$ 1, levels $=$ 2 and the MBBFS reordering scheme are presented. In Fig. 4, the escalation of PMARGAIS method for the 3D Poisson problem for various values of the order (n) and the number of threads with Block size $=$ 50, levels $=$ 2 and the BBFS reordering scheme is depicted. In Table 5, the performance and speedups of PMARGAIS method for various problems, values of the Block size, number of threads and reordering schemes are presented.

It can be easily seen that the speedups, presented in Tables 3 and 4, do not increase uniformly as the order (n) increases, which is due to the fact that the number of block independent sets is not a multiple of the number of processors. Thus, the computational work is not balanced among processors and execution time of the parallel scheme is governed by the processor which is assigned the largest number of independent sets. The imbalance affects parallel performance significantly especially in the case where the block size has a large value and the number of independent sets is not a multiple of the number of processors.

System 2 Numerical results were obtained using an Intel Core-i7 4700MQ 2.4 GHz with 8 GB of RAM memory, running Windows 10 Pro.

Table 7 Performance and speedup of PMARGAIS method, using AVX units, for various problems, values of the Block size and number of threads with levels $=$ 2 and MBBFS reordering scheme

Full size table

Table 8 Overall gain of PMARGAIS method, using AVX units, for various problems, values of the Block size and number of threads, with levels $=$ 2 and the MBBFS reordering scheme

Full size table

It should be noted that the software was developed in C++ without the use of scientific libraries. The software was compiled using the Visual Studio 2010 with OpenMP v2.0 and the maximum speed optimizations flag (/O2). The use of AVX units was realized through software libraries offered by the corresponding development environment.

In Table 6, the performance and speedups of PMARGAIS method for various problems, values of the Block size, numbers of levels, number of threads and reordering schemes is given. In Table 7, the performance and speedup of PMARGAIS method, using AVX units, for various problems, values of the Block size and number of threads with levels $=$ 2 and MBBFS reordering scheme is presented. In Table 8, the overall gain of PMARGAIS method, using AVX units, for various problems, values of the Block size and number of threads with levels $=$ 2 and MBBFS reordering scheme is given.

It should be mentioned that the use of the modified reordering scheme (MBBFS) instead of the BBFS scheme ensures load balancing and better control over the size of each block, especially for problems where the size of the blocks increases rapidly, such as matrices whose corresponding graph contains multiple vertices of large degree. Hence, the MBBFS method reduces the memory requirements for storing the block diagonal inverse, giving an upper limit of the memory requirements of the method.

It should be also noted that the results can be further improved by modifying the MBBFS method to produce blocks that their number is equal to a multiple of the processor units.

5 Conclusion

The proposed parallel multilevel algebraic recursive generic approximate inverse solver (PMARGAIS) is efficient for solving a class of problems resulting in large sparse linear systems. The hybrid direct-iterative PMARGAIS method involves dense computations that can be parallelized efficiently, using AVX units. Dense computations are more efficiently parallelized than sparse computations leading to increased performance and better scalability. Moreover, the proposed scheme is based on the modified reordering scheme (MBBFS) retaining balanced computational work, thus enhancing performance and corresponding speedups are close to theoretical maximum. The PMARGAIS method is based on the PBiCGSTAB method in conjunction with PMGenFAspI matrix, using AVX units. The use of AVX units in conjunction with multicore systems for computing the PMGenFAspI matrix results in increased speedups that surpass the number of available processors. Moreover, the PBiCGSTAB has been parallelized for multicore systems with AVX units resulting in improved performance. Future work will be focused on the implementation of the method on hybrid distributed memory systems.

References

Arabnia HR, Thapliyal H, Vinod AP (2006) Combined integer and floating point multiplication architecture (CIFM) for FPGAs and its reversible logic implementation. In: 49th IEEE International Midwest Symposium on Circuits and Systems (MWSCAS’06), San Juan, Puerto Rico, August 6–9, pp 148–154
Axelsson O (1996) Iterative solution methods. Cambridge University Press, Cambridge
MATH Google Scholar
Benzi M, Meyer CD, Tuma M (1996) A sparse approximate inverse preconditioner for the conjugate gradient method. SIAM J Sci Comput 17(5):1135–1149
Article MathSciNet MATH Google Scholar
Botta EFF, van der Ploeg A, Wubs FW (1996) Nested grids ILU- decomposition (NGILU). J Comp Appl Math 66:515–526
Article MathSciNet MATH Google Scholar
Botta EFF, Wubs W (1997) MRILU: it’s the preconditioning that counts. Technical Report W-9703, Department of Mathematics, University of Groningen, The Netherlands
Chapman B, Jost G, Van Der Pas R (2008) Using OpenMP: portable shared memory parallel programming. The MIT Press, Cambridge
Google Scholar
Chow E (2001) Parallel implementation and practical use of sparse approximate inverses with a priori sparsity patterns. Int J High Perf Comput Appl 15:56–74
Article Google Scholar
Chow E (2000) A priori sparsity patterns for parallel sparse approximate inverse preconditioners. SIAM J Sci Comput 21:1804–1822
Article MathSciNet MATH Google Scholar
Davis TA, Hu Y (2011) The University of Florida sparse matrix collection. ACM Trans Math Softw (TOMS) 38(1):1–25
MathSciNet Google Scholar
Filelis-Papadopoulos CK, Gravvanis GA (2016) A class of generic factored and multilevel recursive approximate inverse techniques for solving general sparse systems. Eng Comp 33(1):74–99
Article Google Scholar
Golub GH, Reinsch C (1970) Singular value decomposition and least squares solutions. In: Wilkinson JH, Reinsch C (eds) Handbook for automatic computation, vol. 2 (Linear Algebra). Springer-Verlag, New York, pp 134–151
Golub GH, Van Loan CF (1996) Matrix computations, 3rd edn. Johns Hopkins University Press, Baltimore
MATH Google Scholar
Gravvanis GA (2009) High performance inverse preconditioning. Arch Comput Meth Engin 16(1):77–108
Article MathSciNet MATH Google Scholar
Gravvanis GA, Filelis-Papadopoulos CK, Matskanidis PI (2014) Algebraic multigrid methods based on generic approximate banded inverse matrix techniques. Comput Model Eng Sci (CMES) 100(4):323–345
MathSciNet Google Scholar
Grote MJ, Huckle T (1997) Parallel preconditioning with sparse approximate inverses. SIAM J Sci Comput 18(3):838–853
Article MathSciNet MATH Google Scholar
Grote MJ, Huckle T (1995) Effective parallel preconditioning with sparse approximate inverses. In: Proceedings of SIAM Conference on Parallel Processing for Scientific Computing, SIAM, pp 466–471
Intel Volume 1. Basic Architecture: http://www.c-jump.com/CIS77/reference/Intel/CIS77_24319002/index.html
Kolotolina YuL, Yeremin YuA (1993) Factorized sparse approximate inverse preconditionings. I. Theory. SIAM J Matrix Anal Appl 14:45–58
Article MathSciNet Google Scholar
Manguoglu M (2011) A domain-decomposing parallel sparse linear system solver. J Comput Appl Math 236(3):319–325
Article MathSciNet MATH Google Scholar
Meijerink JA, Van der Vorst HA (1977) An iterative method for linear systems of which the coefficient is a symmetric M-matrix. Math Comput 31:148–162
MathSciNet MATH Google Scholar
Ruge A, Stuben K (1987) Algebraic multigrid. In: McCormick (ed) Multigrid methods. Front Appl Math 3(4) SIAM
Saad Y (1994) ILUT: a dual threshold incomplete LU factorization. Num Linear Algebra Appl 1(4):387–402
Article MathSciNet MATH Google Scholar
Saad Y, Suchomel B (2002) ARMS: an algebraic recursive multilevel solver for general sparse linear systems. Num Linear Algebra Appl 9(5):359–378
Article MathSciNet MATH Google Scholar
Saad Y, Zhang J (1999) BILUTM: a domain-based multilevel block ILUT preconditioner for general sparse matrices. SIAM J Matrix Anal Appl 21:279–299
Article MathSciNet MATH Google Scholar
Thapliyal H, Arabnia HR (2006) Reversible programmable logic array (RPLA) using Fredkin and Feynman gates for industrial electronics and applications. In: Proceedings of the International Conference on Computer Design and Conference on Computing in Nanotechnology (CDES’06), Las Vegas, USA, June 26–29, ISBN #: 1-60132-009-4. http://arxiv.org/abs/cs/0609029, pp 70–74
Thapliyal H, Srinivas MB, Arabnia HR (2005) Reversible logic synthesis of half, full and parallel subtractors. In: Proceedings of the International Conference on Embedded Systems and Applications, ESA’05, June, Las Vegas, pp 165–181
Trottenberg U, Osterlee CW, Schuller A (2000) Multigrid. Academic Press, Cambridge
Google Scholar
Van der Vorst HA (1992) Bi-CGSTAB: a fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric linear systems. SIAM J Sci Stat Comput 13(2):631–644
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, School of Engineering, Democritus University of Thrace, University Campus, Kimmeria, 67100, Xanthi, Greece
Antonios T. Makaratzis, Christos K. Filelis-Papadopoulos & George A. Gravvanis

Authors

Antonios T. Makaratzis
View author publications
You can also search for this author in PubMed Google Scholar
Christos K. Filelis-Papadopoulos
View author publications
You can also search for this author in PubMed Google Scholar
George A. Gravvanis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to George A. Gravvanis.

Appendix

The PEPBiCGSTAB method, using AVX units, is described by the following algorithmic scheme:

where fmadd(xr3, xr1, xr2) is the fused multiply add operation ${xr3}={xr3}+{xr1}*{xr2}$, where xr1, xr2 and xr3 are vectors consisting of four double-precision floating point numbers.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Makaratzis, A.T., Filelis-Papadopoulos, C.K. & Gravvanis, G.A. Parallel multilevel recursive approximate inverse techniques for solving general sparse linear systems. J Supercomput 72, 2259–2282 (2016). https://doi.org/10.1007/s11227-016-1728-5

Download citation

Published: 05 May 2016
Issue Date: June 2016
DOI: https://doi.org/10.1007/s11227-016-1728-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Parallel multilevel recursive approximate inverse techniques for solving general sparse linear systems

Abstract

Similar content being viewed by others

Parallel Performance of an Iterative Solver Based on the Golub-Kahan Bidiagonalization

Development of Krylov and AMG Linear Solvers for Large-Scale Sparse Matrices on GPUs

Iterative Sparse Triangular Solves for Preconditioning

1 Introduction

2 Parallel modified generic factored approximate sparse inverse (PMGenFAspI)

3 Parallel multilevel algebraic recursive generic approximate inverse solver (PMARGAIS)

3.1 Modified block breadth first search (MBBFS)

3.2 Parallel multilevel approximate inverse solver

3.3 Parallel inversion of block diagonal matrix

4 Numerical results

4.1 Parallel modified generic factored approximate sparse inverse (PMGenFAspI) using AVX units

4.2 Parallel multilevel algebraic recursive generic approximate inverse solver

5 Conclusion

References