Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The chip producers can no longer efficiently increase the clock frequency of a processor. The Moore’s law which originally stated that the number of transistors doubled every 2 years [18], is, however, still valid. The Moore’s law could be translated as the number of cores double every 2 years today. As a result of this paradigm shift, researchers have been working on parallelizing the existing sequential algorithms in order to effectively use all the cores of the processors. A more innovative approach is to design a completely parallel algorithm with parallelism in mind.

Solution of sparse linear systems is required by many applications in science and engineering. Often, the linear solution step is the main performance bottleneck. There are two main classes of linear system solvers, namely direct and iterative methods. While iterative solvers are not as robust as direct solvers [19], iterative solvers are considered to scale better on parallel computers. There are many parallel sparse direct solver implementations some of the most well known of these are SuperLU [5, 15], MUMPS [1], PARDISO [2326], and WSMP [7, 8]. All of these direct solvers, however, are based on some form of the classical LU factorization, performed in parallel. After factorization, the system is solved using the computed factors via forward and backward sweeps. For most sparse systems there are some dependencies between unknowns, which limit the parallelism in factorization phase and, due to fill in, this is even more pronounced during the triangular solution phase.

To alleviate these drawbacks of the existing direct solvers we have developed a new general parallel sparse direct solver [2] based on the sparse DS factorization [3]. The idea of the DS factorization is first introduced in [2022] for banded linear systems which is called the SPIKE algorithm due the structure of the S matrix. A recursive banded DS factorization is introduced in [21] which applies recursion on the S matrix. Our approach, on the other hand, is to apply the recursion on the smaller reduced system. A generalization of the banded DS factorization to sparse linear systems and its hybrid (direct/iterative) implementation, in which the reduced system is solved iteratively, is given in [16, 17].

Given a banded or sparse linear system of equations and number of blocks,

$$\displaystyle{ Ax = f }$$
(1)

The DS decomposition factors A into two matrices D and S such that,

$$\displaystyle{ A = DS }$$
(2)

where D is just the block diagonal of A. Hence the splitting A = D + R where R is the remaining block off-diagonal entries of A. There is no computation to obtain D. The S matrix, on the other hand, is obtained by S = D −1 A or taking advantage of the fact that the block diagonals of S is identity,

$$\displaystyle{ S = D^{-1}(D + R). }$$
(3)

We obtain S = I + G which involves solving independent systems in parallel to obtain G = D −1 R. After obtaining the DS factorization, if we multiply both sides of the Equation (1) with D −1 from left,

$$\displaystyle{ D^{-1}Ax = D^{-1}f, }$$
(4)

and obtain a new system

$$\displaystyle{ Sx = g }$$
(5)

where g = D −1 f. The new system in Equation (5) contains a smaller subsystem which did not exist in the original system of equations. The reduced system is obtained by identifying the indices, c, of the columns which contain nonzeros in R matrix. Then, the reduced system is formed simply by selecting the rows and columns, c, from the S matrix (i.e., S (c, c)).

For a small example the sparsity structure of A, D, R, S, and S (c, c) is given in Figures 1(a), 1(b), 1(c), 1(d), 2 respectively.

Fig. 1
figure 1

The coefficient matrix and the corresponding D, R, and S matrices using 4 partitions.

Fig. 2
figure 2

Extracted reduced system S (c, c).

The reduced system could be formed

$$\displaystyle{ S_{(c,c)}x_{(c)} = g_{(c)}. }$$
(6)

The smaller reduced subsystem can be solved independently and the complete solution can be retrieved also in parallel using one of the following two methods:

$$\displaystyle{ x = g - G_{(:,c)}x_{(c)} }$$
(7)

or equivalently

$$\displaystyle{ x = g - D^{-1}(R_{ (:,c)}x_{(c)}). }$$
(8)

The difference between these approaches is that the first one requires G matrix to be formed completely, while in the second approach we only need to compute a partial G just enough to form S (c, c). However, it involves additional triangular solves which are independent and completely parallel.

Note that the size of the reduced system is highly dependent on the initial structure of the matrix. In fact sparse matrix partitioning tools that are designed to minimize the communication volume in parallel sparse matrix vector multiplication and load balance can be used in sparse DS factorization. The objective of the partitioning in DS factorization is to decrease the size of the reduced system and, hence, to improve the parallel scalability. Furthermore, for the factorization to exist, D must be nonsingular. To achieve this, one can apply a nonsymmetric permutation to strengthen the diagonal entries.

The rest of the chapter is organized as follows. In Section 2, we introduce the new recursive sparse solver and its variations. Programming and computing environments are described and numerical results on sparse linear systems obtained from the University of Florida Sparse Matrix Collection are given in Section 3. Finally, we conclude and summarize the results in Section 4.

2 The Recursive Sparse DS Factorization

Before we apply the recursive algorithm on the linear system, we apply symmetric and nonsymmetric permutations as mentioned before. In the following pseudocode the linear systems Ax = f are assumed to be the permuted system.

The pseudocode of the recursive DS factorization is given as follows:

Algorithm 1 The Algorithm

Two options are indicated with (a) and (b). They are mathematically equivalent but computationally not. If we choose option (a), the G matrix needs to be formed explicitly which is expensive since linear systems with multiple right-hand sides need to be solved in parallel where D is the coefficient matrix (Line 4). Obtaining the final solution in option (a) is just a matrix vector multiplication (Line 12). Option b, on the other hand, requires a matrix vector multiplication followed by a parallel triangular solve with a single right-hand side (Line 13) and G is no longer need to be computed (Line 5). In our implementation, in all variations, we are (sequentially) using Pardiso direct solver for each of the diagonal blocks in D. Even if we choose option (b), we still need to solve the reduced system. Note that the reduced system is formed using only certain entries from G. The system we form to solve for G has R matrix as the right-hand side. This allows us to use the feature that is provided in Pardiso to allow one to compute only certain entries of the solution vector if the right-hand side is sparse. Therefore, in order to keep the computational costs lower, if we choose option (b) we use the sparse right-hand side feature of Pardiso and compute just some entries of G that is required to form the reduced system. Before factorization, we reorder and partition the initial matrix once, and since the smaller reduced system maintains a similar sparsity structure as the reordered original system we do not need to repartition at every recursive call.

3 Numerical Results

We implement our algorithms using the Intel Cilk Plus which is an extension of C and C++ languages [11]. In our work it is used to ensure efficient multithreading with recursion. Intel MKL is a library of mathematical subroutines such as BLAS, LAPACK, ScaLAPACK, and sparse linear system solvers [12]. In our implementations, all BLAS and Sparse BLAS operations are called from the available routines in Intel MKL version 10.3 [10]. As the direct solver we use Pardiso version 4.1.2.

For symmetric and nonsymmetric permutations we use METIS and HSL MC64, respectively. METIS is a collection of algorithms for graph partitioning, finite element mesh partitioning, and fill-reducing ordering for sparse matrices [13, 14]. It is used in our work to gather the nonzero values in the diagonal blocks as much as possible by setting it to minimize the communication volume. METIS finds a symmetric permutation of matrices and works on undirected graphs. In order to obtain a partitioning for nonsymmetric matrices, we use a symmetric matrix that matches to the structure of our nonsymmetric matrix (i.e., using ( | A | T + | A | )). Version 5.1.0 of METIS is used in our work. Some of the matrices that are used in our experiments have zero values on their main diagonals. Since having even one zero value in the main diagonal means that our matrix is indefinite and the diagonal blocks could be singular, we apply a nonsymmetric permutation. HSL MC64 is a collection of Fortran codes to find a column permutation vector to ensure that the matrix will have only nonzero entries on its main diagonal [9]. The permutation vector created by HSL MC64 is used if the matrix is indefinite.

In addition to two recursive variations of the DS factorization based sparse solver, we have also implemented two nonrecursive variations where the reduced system is directly solved via Pardiso. For comparison there are many available parallel sparse direct solvers, in this chapter we compared our results with multithreaded Pardiso direct solver. For many problem types, Pardiso has been shown to be one of most efficient direct solvers available today [6]. Furthermore, Pardiso is provided as a part of Intel MKL.

In summary, we implemented 4 variations of the DS factorization based sparse solver:

  • Nonrecursive DS factorization using the sparse right-hand side feature of PARDISO in its computations (DS-NR-SP)

  • Nonrecursive DS factorization without using the sparse right-hand side feature of PARDISO (DS-NR-NS)

  • Recursive DS factorization using the sparse right-hand side feature of PARDISO (DS-RE-SP)

  • Recursive DS factorization without using the sparse right-hand side feature of PARDISO (DS-RE-NS)

In our naming convention RE, NR, SP, and NS stand for recursive algorithm, nonrecursive algorithm, using the sparse right-hand side feature (i.e., not computing the G matrix explicitly) and not using the sparse right-hand side feature (i.e., computing the G matrix), respectively.

For all runs using the sparse DS factorization based solver, we set the number of partitions to be equal to the number of threads. The matrices used for testing are retrieved from University of Florida Sparse Matrix collection [4]. The properties of matrices are given in Table 1. We use a right-hand side vector that consists of all ones.

Table 1 Properties of test matrices, n, nnz, and dd stand for the number of rows and columns and the number of nonzeros, respectively.

For all numerical experiments, we use a single node of the Nar cluster. Nar is the High Performance Computing Facility of Middle East Technical University Department of Computer Engineering. A single node of Nar contains 2 x Intel Xeon E5430 Quad-Core CPU (2.66 GHz, 12 MP L2 Cache, 1333 MHz FSB) and 16 GB Memory. Nar uses an open source Linux distribution developed by Fermi National Accelerator Laboratory (Fermilab) and European Organization for Nuclear Research (CERN), Scientific Linux v5.2 64bit, as its operating system. Since each node has 8 cores, we run the algorithms using up to 8 threads.

In Table 2, the speed improvement of the recursive DS factorization algorithm (RDS) and multithreaded Pardiso is compared to single threaded Pardiso runs. Timings include the total time to reorder, factorize, and solve the given linear system. We ran all the variations of the RDS algorithm. In the table, due to limited space, all RDS runs presented are using DS-RE-SP variation of the algorithm except for three cases. For matrix #6 we are using DS-RE-NS, for matrices #10 and #15 we are using DS-NR-SP variations of the algorithm since DS-RE-SP does not give a comparable relative residual to multithreaded Pardiso. Also note that the recursive versions of the solver is defined when the number of partitions is equal to or greater than 4. Hence, we are using the nonrecursive DS-NR-SP for all cases if the number of threads is 2. In Table 3, 2-norm of the final relative residuals are given. Based on the results of the runs, the proposed algorithm is faster than Pardiso using 8 threads for 10 cases out of 16 obtaining comparable relative residuals. We note that for the cases where RDS performs worse than multithreaded Pardiso, sequential solution time is very short and hence we could not expect much improvement to begin with.

Table 2 Speedup of multithreaded Pardiso and RDS algorithms compared to the Sequential Pardiso time for all test matrices using 2,4, and 8 threads.
Table 3 Relative residual norms for RDS and Pardiso using 2, 4, and 8 threads.

In Figure 3(a), the average speed improvement of RDS is compared against the average speed improvement for Pardiso for all problems. The improvement of RDS more pronounced as the number of threads (i.e., cores) is increased from 4 to 8.

Fig. 3
figure 3

The average and the best speed improvements obtained using RDS and Pardiso compared to the sequential time with Pardiso.

In Figure 3(b), we plot the best speed improvement for both RDS and Pardiso. Again, the improvement is more pronounced as the number of threads increase.

4 Conclusions

We present a recursive sparse DS factorization based direct solver. The results show that on a multicore environment, the scalability of the proposed algorithm is better than the classical LU factorization based solvers in most examples. The improvement is more pronounced if the number of cores is large.