1 Introduction

Solving large sparse linear systems is at the heart of many application problems arising from computational science and engineering applications. Advances in combinatorial methods in combination with modern computer architectures have massively influenced the design of the state-of-the-art direct solvers that are feasible for solving larger systems efficiently in a computational environment with rapidly increasing memory resources and cores. Among these advances are novel combinatorial algorithms for improving diagonal dominance which pave the way to a static pivoting approach, thus improving the efficiency of the factorization phase dramatically. Besides, partitioning and reordering the system such that a high level of concurrency is achieved, the objective is to simultaneously achieve the reduction of fill-in and the parallel concurrency. While these achievements already significantly improve the factorization phase, modern computer architectures require one to compute as many operations as possible in the cache of the CPU. This in turn can be achieved when dense subblocks that show up during the factorization can be grouped together into dense submatrices which are handled by multithreaded and cache-optimized dense matrix kernels using level-3 BLAS and LAPACK [3].

This chapter will review some of the basic technologies together with the latest developments for sparse direct solution methods that have led to the state-of-the-art LU decomposition methods. The paper is organized as follows. In Sect. 2 we will start with maximum weighted matchings which is one of the key tools in combinatorial optimization to dramatically improve the diagonal dominance of the underlying system. Next, Sect. 3 will review multilevel nested dissection as a combinatorial method to reorder a system symmetrically such that fill-in and parallelization can be improved simultaneously, once pivoting can be more or less ignored. After that, we will review established graph-theoretical approaches in Sect. 4, in particular the elimination tree, from which most of the properties of the LU factorization can be concluded. Among these properties is the prediction of dense submatrices in the factorization. In this way several subsequent columns of the factors L and U T are collected in a single dense block. This is the basis for the use of dense matrix kernels using optimized level-3 BLAS as well to exploit fast computation using the cache hierarchy which is discussed in Sect. 5. Finally, we show in Sect. 6 how the ongoing developments in parallel sparse direct solution methods have advanced integrated circuit simulations. We assume that the reader is familiar with some elementary knowledge from graph theory, see, e.g., [15, 21] and some simple computational algorithms based on graphs [1].

2 Maximum Weight Matching

In modern sparse elimination methods the key to success is ability to work with efficient data structures and their underlying numerical templates. If we can increase the size of the diagonal entries as much as possible in advance, pivoting during Gaussian elimination can often be bypassed and we may work with static data structures and the numerical method will be significantly accelerated. A popular method to achieve this goal is the maximum weight matching method [16, 37] which permutes, e.g., the rows of a given nonsingular matrix \(A\in \mathbb {R}^{n,n}\) by a permutation matrix \(\Pi \in \mathbb {R}^{n,n}\) such that ΠTA has a nonzero diagonal. Moreover, it maximizes the product of the absolute diagonal values and yields diagonal scaling matrices \(D_r, D_c\in \mathbb {R}^{n,n}\) such that \(\tilde A=\Pi ^TD_rAD_c\) satisfies \(|\tilde a_{ij}|\leqslant 1\) and \(|\tilde a_{ii}|=1\) for all i, j = 1, …, n. The original idea on which these nonsymmetric permutations and scalings are based is to find a maximum weighted matching of a bipartite graphs. Finding a maximum weighted matching is a well known assignment problem in operation research and combinatorial analysis.

Definition 1

A graph G = (V, E) with vertices V  and edges E ⊂ V 2 is called bipartite if V  can be partitioned into two sets V r and V c, such that no edge e = (v 1, v 2) ∈ E has both ends v 1, v 2 in V r or both ends v 1, v 2 in V c. In this case we denote G by G b = (V r, V c, E).

Definition 2

Given a matrix A, then we can associate with it a canonical bipartite graph G b(A) = (V r, V c, E) by assigning the labels of V r = {r 1, …, r n} with the row indices of A and V c = {c 1, …, c n} being labeled by the column indices. In this case E is defined via E = {(r i, c j)| a ij ≠ 0}.

For the bipartite graph G b(A) we see immediately that if a ij ≠ 0, then we have that r i ∈ V r from the row set is connected by an edge (r i, c j) ∈ E to the column c j ∈ V c, but neither rows are connected with each other nor do the columns have interconnections.

Definition 3

A matching \(\mathcal {M}\) of a given graph G = (V, E) is a subset of edges e ∈ E such that no two of which share the same vertex.

If \(\mathcal {M}\) is a matching of a bipartite graph G b(A), then each edge \(e=(r_i,c_j) \in \mathcal {M}\) corresponds to a row i and a column j and there exists no other edge \(\hat e=(r_k,c_l) \in \mathcal {M}\) that has the same vertices, neither r k = r i nor c l = c j.

Definition 4

A matching \(\mathcal {M}\) of G = (V, E) is called maximal, if no other edge from E can be added to \(\mathcal {M}\).

If for an n × n matrix A a matching \(\mathcal {M}\) of G b(A) with maximum cardinality n is found, then by definition the edges must be (i 1, 1), …, (i n, n) with i 1, …, i n being the numbers 1, …, n in a suitable order and therefore we obtain \(a_{i_1,1}\neq 0\), …\(a_{i_n,n}\neq 0\). In this case we have established that the matrix A is at least structurally nonsingular and we can use a row permutation matrix ΠT associated with row ordering i 1, …, i n to place a nonzero entry on each diagonal location of ΠTA.

Definition 5

A perfect matching is a maximal matching with cardinality n.

It can be shown that for a structurally nonsingular matrix A there always exists a perfect matching \(\mathcal {M}\).

Perfect Matching In Fig. 1, the set of edges \(\mathcal {M}= \{(1,2), (2,4), (3,5), (4,1), (5,3), (6,6) \}\) represents a perfect maximum matching of the bipartite graph G b(A).

Fig. 1
figure 1

Perfect matching. Left side: original matrix A. Middle: bipartite representation G b(A) = (V r, V c, E) of the matrix A and perfect matching \(\mathcal {M}\). Right side: permuted matrix ΠTA

The most efficient combinatorial methods for finding maximum matchings in bipartite graphs make use of an augmenting path. We will introduce some graph terminology for the construction of perfect matchings.

Definition 6

If an edge e = (u, v) in a graph G = (V, E) joins a vertices u, v ∈ V , then we denote it as uv. A path then consists of edges u 1u 2, u 2u 3, u 3u 4…, u k−1u k, where each (u i, u i+1) ∈ E, i = 1, …, k − 1.

If G b = (V r, V c, E) is a bipartite graph, then by definition of a path, any path is alternating between the vertices of V r and V c, e.g., paths in G b could be such as r 1c 2, c 2r 3, r 3c 4, ….

Definition 7

Given a graph G = (V, E), a vertex is called free if it is not incident to any other edge in a matching \(\mathcal {M}\) of G. An alternating path relative to a matching \(\mathcal {M}\) is a path P = u 1u 2, u 2u 3, …, u s−1u s where its edges are alternating between \(E \setminus \mathcal {M}\) and \(\mathcal {M}\). An augmenting path relative to a matching \(\mathcal {M}\) is an alternating path of odd length and both of it vertex endpoints are free.

Augmenting Path Consider Fig. 1. To better distinguish between row and column vertices we use for the rows and for the columns. A non-perfect but maximal matching is given by . We can easily see that an augmenting path alternating between rows and columns is given by Both endpoints and of this augmenting path are free.

In a bipartite graph G b = (V r, V c, E) one vertex endpoint of any augmenting path must be in V r, whereas the other one must be in V c. The symmetric difference, A ⊕ B of two edge sets A, B, is defined to be (A ∖ B) ∪ (B ∖ A).

Using these definitions and notations, the following theorem [5] gives a constructive algorithm for finding perfect matchings in bipartite graphs.

Theorem 1

If \(\mathcal {M}\) is non-maximum matching of a bipartite graph G b = (V r, V c, E), then there exists an augmenting path P relative to \(\mathcal {M}\) such that \(P=\tilde {\mathcal {M}} \oplus \mathcal {M}\) and \(\tilde {\mathcal {M}}\) is a matching with cardinality \(|\mathcal {M}|+1\).

According to this theorem, a combinatorial method of finding perfect matching in a bipartite graph is to seek augmenting paths.

The perfect matching as discussed so far only takes the nonzero structure of the matrix into account. For their use as static pivoting methods prior to the LU decomposition one requires in addition to maximize the absolute value of the product of the diagonal entries. This is referred to as maximum weighted matching. In this case a permutation π has to be found, which maximizes

$$\displaystyle \begin{aligned} \prod_{i=1}^n |a_{\pi(i)i}|. {} \end{aligned} $$
(1)

The maximization of this product is transferred into a minimization of a sum as follows. We define a matrix C = (c ij) via

$$\displaystyle \begin{aligned} c_{ij} = \begin{cases} \log a_i - \log |a_{ij}| & a_{ij} \neq 0 \\ \infty & \mbox{otherwise}, \end{cases} \end{aligned}$$

where a i =maxj|a ij| is the maximum element in row i of matrix A. A permutation π which minimizes the sum

$$\displaystyle \begin{aligned} \sum_{i=1}^n c_{\pi(i)i} \end{aligned}$$

also maximizes the product (1). The minimization problem is known as linear-sum assignment problem or bipartite weighted matching problem in combinatorial optimization. The problem is solved by a sparse variant of the Hungarian method. The complexity is \(\mathcal {O}(n \tau \log n )\) for sparse matrices with τ entries. For matrices, whose associated graph fulfill special requirements, this bound can be reduced further to \(\mathcal {O}(n^\alpha (\tau + n \log n))\) with α < 1. All graphs arising from finite-difference or finite element discretizations meet the conditions [24]. As before, we finally get a perfect matching which in turn defines a nonsymmetric permutation.

When solving the assignment problem, two dual vectors u = (u i) and v = (v i) are computed which satisfy

$$\displaystyle \begin{aligned} u_i + v_j & = c_{ij} \qquad (i,j) \in \mathcal{M}, {} \end{aligned} $$
(2)
$$\displaystyle \begin{aligned} u_i + v_j & \leq c_{ij} \qquad \text{otherwise}. {} \end{aligned} $$
(3)

Using the exponential function these vectors can be used to scale the initial matrix. To do so define two diagonal matrices D r and D c through

$$\displaystyle \begin{aligned} D_r & = \text{diag}(d_1^r,d_2^r,\dots,d_n^r), \qquad d_i^r = \exp(u_i), \end{aligned} $$
(4)
$$\displaystyle \begin{aligned} D_c & = \text{diag}(d_1^c,d_2^c,\dots,d_n^c), \qquad d_j^c = \exp(v_j)/a_j. \end{aligned} $$
(5)

Using Eqs. (2) and (3) and the definition of C, it immediately follows that \(\tilde A = \Pi ^T D_r A D_c\) satisfies

$$\displaystyle \begin{aligned} |\tilde a_{ii}| & = 1, {} \end{aligned} $$
(6)
$$\displaystyle \begin{aligned} |\tilde a_{ij}| & \le 1. {} \end{aligned} $$
(7)

The permuted and scaled system \(\tilde A\) has been observed to have significantly better numerical properties when being used for direct methods or for preconditioned iterative methods, cf., e.g., [4, 16]. Olschowka and Neumaier [37] introduced these scalings and permutation for reducing pivoting in Gaussian elimination of full matrices. The first implementation for sparse matrix problems was introduced by Duff and Koster [16]. For symmetric matrices |A|, these nonsymmetric matchings can be converted to a symmetric permutation P and a symmetric scaling D s = (D rD c)1∕2 such that P TD sAD sP consists mostly of diagonal blocks of size 1 × 1 and 2 × 2 satisfying a similar condition as (6) and (7), where in practice it rarely happens that 1 × 1 blocks are identical to 0 [17]. Recently, successful parallel approaches to compute maximum weighted matchings have been proposed [28, 29].

Example 1: Maximum Weight Matching To conclude this section we demonstrate the effectiveness of maximum weight matchings using a simple sample matrix “west0479” from the SuiteSparse Matrix Collection. The matrix can also directly be loaded in MATLAB using load west0479. In Fig. 2 we display the matrix before and after applying maximum weighted matchings. To illustrate the improved diagonal dominance we further compute \(r_i=|a_{ii}|/\sum _{j=1}^n|a_{ij}|\) for each row of A and \(\tilde A=\Pi ^TD_rAD_s\), i = 1, …, n. r i can be read as relative diagonal dominance of row i and yields a number between 0 and 1. Moreover, whenever \(r_i>\frac 12\), the row is strictly diagonal dominant, i.e., |a ii| >∑j:ji|a ij|. In Fig. 3 we display for both matrices r i by sorting its values in increasing order and taking \(\frac 12\) as reference line. We can see the dramatic impact of maximum weighted matchings in improving the diagonal dominance of the given matrix and thus paving the way to a static pivoting approach in incomplete or complete LU decomposition methods.

Fig. 2
figure 2

Maximum weight matching. Left side: original matrix A. Right side: permuted and rescaled matrix \(\tilde A=\Pi ^TD_rAD_c\)

Fig. 3
figure 3

Diagonal dominance. Left side: r i for A. Right side: r i \(\tilde A=\Pi ^TD_rAD_c\)

3 Symbolic Symmetric Reordering Techniques

When dealing with large sparse matrices a crucial factor that determines the computation time is the amount of fill that is produced during the factorization of the underlying matrix. To reduce the complexity there exist many mainly symmetric reordering techniques that attempt to reduce the fill-in heuristically. Here we will demonstrate only one of these methods, the so-called nested dissection method. The main reason for selecting this method is that it can be easily used for parallel computations.

3.1 Multilevel Nested Dissection

Recursive multilevel nested dissection methods for direct decomposition methods were first introduced in the context of multiprocessing. If parallel direct methods are used to solve a sparse system of equations, then a graph partitioning algorithm can be used to compute a fill-reducing ordering that leads to a high degree of concurrency in the factorization phase.

Definition 8

For a matrix \(A\in \mathbb {R}^{n,n}\) we define the associated (directed) graph G d(A) = (V, E), where V = {1, …, n} and the set of edges \(E=\left \{(i,j)|\, a_{ij}\neq 0\right \}\). The (undirected) graph is given by G d(|A| + |A|T) and is denoted simply by G(A).

In graph terminology for a sparse matrix A we simply have a directed edge (i, j) for any nonzero entry a ij in G d(A), whereas the orientation of the edge is ignored in G(A).

The research on graph partitioning methods in the mid-nineties has resulted in high-quality software packages, e.g., METIS [25]. These methods often compute orderings that on the one hand lead to small fill-in for (incomplete) factorization methods, while on the other hand they provide a high level of concurrency. We will briefly review the main idea of multilevel nested dissection in terms of graph partitioning.

Definition 9

Let \(A\in \mathbb {R}^{n,n}\) and consider its graph G(A) = (V, E). A k-way graph partitioning consists of partitioning V  into k disjoint subsets V 1, V 2, …, V k such that V i ∩ V j = ∅ for ijiV i = V . The subset E s = E ∩⋃ij(V i × V j) is called edge separator.

Typically we want a k-way partitioning to be balanced, i.e., each V i should satisfy |V i|≈ nk. The edge separator E s refers to the edges that have to be taken away from the graph in order to have k separate subgraphs associated with V 1, …, V k and the number of elements of E s is usually referred to as edge-cut.

Definition 10

Given \(A\in \mathbb {R}^{n,n}\), a vertex separator V s of G(A) = (V, E) is a set of vertices such that there exists a k-way partitioning V 1, V 2, …, V k of V ∖ V s having no edge e ∈ V i × V j for ij.

A useful vertex separator V s should not only separate G(A) into k independent subgraphs associated with V 1, …, V k, it is intended that the number of edges \(\cup _{i=1}^{k} |\{ e_{is} \in V_i, s \in V_s\}| \) is also small.

Nested dissection recursively splits a graph G(A) = (V, E) into almost equal parts by constructing a vertex separator V s until the desired number k of partitionings are obtained. If k is a power of 2, then a natural way of obtaining a vertex separator is to first obtain a 2-way partitioning of the graph, a so-called graph bisection with its associated edge separator E s. After that a vertex separator V s is computed from E s, which gives a 2-way partitioning V 1, V 2 of V ∖ V s. This process is then repeated separately for the subgraphs associated with V 1, V 2 until eventually a k = 2l-way partitioning is obtained. For the reordering of the underlying matrix A, the vertices associated with V 1 are taken first followed by V 2 and V s. This reordering is repeated similarly during repeated bisection of each V i. In general, vertex separators of small size result in low fill-in.

Example 2: Vertex Separators To illustrate vertex separators, we consider the reordered matrix ΠTA from Fig. 1 after a matching is applied. In Fig. 4 we display its graph G( ΠTA) ignoring the orientation of the edges. A 2-way partitioning is obtained with V 1 = {3, 5}, V 2 = {2, 6}, and a vertex separator V s = {1, 4}. The associated reordering refers to taking the rows and the columns of ΠTA in the order 3, 5, 2, 6, 1, 4.

Fig. 4
figure 4

A 2-way partition with vertex separator V s = {1, 4} and the associated reordered matrix placing the two rows and columns associated with V s to the end

Since a naive approach to compute a recursive graph bisection is typically computationally expensive, combinatorial multilevel graph bisection has been used to accelerate the process. The basic structure is simple. The multilevel approach consists of three phases: at first there is a coarsening phase which compresses the given graph successively level by level by about half of its size. When the coarsest graph with about a few hundred vertices is reached, the second phase, namely the so-called bisection, is applied. This is a high-quality partitioning algorithm. After that, during the uncoarsening phase, the given bisection is successively refined as it is prolongated towards the original graph.

3.1.1 Coarsening Phase

The initial graph G 0 = (V 0, E 0) = G(A) of \(A\in \mathbb {R}^{n,n}\) is transformed during the coarsening phase into a sequence of graphs G 1, G 2, …, G m of decreasing size such that |V 0|≫|V 1|≫|V 2|≫⋯ ≫|V m|. Given the graph G i = (V i, E i), the next coarser graph G i+1 is obtained from G i by collapsing adjacent vertices. This can be done, e.g., by using a maximal matching \(\mathcal {M}_i\) of G i (cf. Definitions 3 and 4). Using \(\mathcal {M}_i\), the next coarser graph G i+1 is constructed from G i collapsing the vertices being matched into multinodes, i.e., the elements of \(\mathcal {M}_i\) together with the unmatched vertices of G i become the new vertices V i+1 of G i+1. The new edges E i+1 are the remaining edges from E i connected with the collapsed vertices. There are various differences in the construction of maximal matchings [9, 25]. One of the most popular and efficient methods is heavy edge matching [25].

3.1.2 Partitioning Phase

At the coarsest level m, a 2-way partitioning \(V_{m,1}\dot {\cup }V_{m,2}=V_m\) of G m = (V m, E m) is computed, each of them containing about half of the vertices of G m. This specific partitioning of G m can be obtained by using various algorithms such as spectral bisection [19] or combinatorial methods based on Kernighan–Lin variants [18, 27]. It is demonstrated in [25] that for the coarsest graph, combinatorial methods typically compute smaller edge-cut separators compared with spectral bisection methods. However, since the size of the coarsest graph G m is small (typically |V m| < 100), this step is negligible with respect to the total amount of computation time.

3.1.3 Uncoarsening Phase

Suppose that at the coarsest level m, an edge separator E m,s of G m associated with the 2-way partitioning has been computed that has led to a sufficient edge-cut of G m with V m,1, V m,2 of almost equal size. Then E m,s is prolongated to G m−1 by reversing the process of collapsing matched vertices. This leads to an initial edge separator E m−1,s for G m−1. But since G m−1 is finer, E m−1,s is sub-optimal and one usually decreases the edge-cut of the partitioning by local refinement heuristics such as the Kernighan–Lin partitioning algorithm [27] or the Fiduccia–Mattheyses method [18]. Repeating this refinement procedure level-by-level we obtain a sequence of edge separators E m,s, E m−1,s, …, E 0,s and eventually and edge separator E s = E 0,s of the initial graph G(A) is obtained. If one is seeking for a vertex separator V s of G(A), then one usually computes V s from E s at the end.

There have been a number of methods that are used for graph partitioning, e.g., METIS [25], a parallel MPI version PARMETIS [26], or a recent multithreaded approach MT-METIS [30]. Another example for a parallel partitioning algorithm is SCOTCH [9].

Multilevel Nested Dissection We will continue Example 1 using the matrix \(\tilde A=\Pi ^TD_rAD_s\) that has been rescaled and permuted using maximum weight matching. We illustrate in Fig. 5 how multilevel nested dissection changes the pattern \(\hat A=P^T \tilde A P\), where P refers to the permutation matrix associated with the partitioning of \(G(\tilde A)\).

Fig. 5
figure 5

Application of multilevel nested dissection after the matrix is already rescaled and permuted using maximum weight matching

3.2 Other Reordering Methods

One of the first methods to reorder the system was the reverse Cuthill–McKee (RCM) methods [10, 34] which attempts to reduce the bandwidth of a given matrix. Though this algorithm is still attractive for sequential methods and incomplete factorization methods, its use for direct solvers is considered as obsolete. An attractive alternative to nested dissection as reordering method for direct factorization methods is the minimum degree algorithm (MMD) [20, 40] and its recent variants, in particular the approximate minimum degree algorithm (AMD) [2, 12] with or without constraints. The main objective of the minimum degree algorithm is to simulate the Gaussian elimination process symbolically by investigating the update process \(a_{ij}\to a_{ij}-a_{ik}a_{kk}^{-1}a_{kj}\) by means of graph theory, at least in the case of the undirected graph. The name-giving degree refers to the number of edges connected to a vertex and how the graph and therefore the degrees of its vertices change during the factorization process. Over the years this has led to an evolution of the underlying minimum degree algorithm using the so-called external degree for selecting vertices as pivots and further techniques like incomplete degree update, element absorption, and multiple elimination as well as data structures based on cliques. For an overview see [20]. One of the most costly parts in the minimum degree algorithm is to update of the degrees. Instead of computing the exact external degree, in the approximate minimum degree algorithm [2], an approximate external degree is computed that significantly saves time while producing comparable fill-in for the LU decomposition.

We like to conclude this section by mentioning that if nested dissection is computed to produce a vertex separator V s and a related k-way partitioning V 1, …, V − k for the remaining vertices of V ∖ V s of G(A) = (V, E) which allow for parallel computations, then the entries of each V i, i, …, k could be taken in any order. Certainly, inside V i one could use nested dissection as well, which is the default choice in multilevel nested dissection methods. However, as soon as the coarsest graph G m is small enough (typically about 100 vertices), not only the separator is computed, but in addition the remaining entries of G m are reordered to lead to a fill-reducing ordering. In both cases, for G m as well as V 1, …, V k one could alternatively use different reordering methods such as variants of the minimum degree algorithm. Indeed, for G m this is what the METIS software is doing. Furthermore, a reordering method such as the constrained approximate minimum degree algorithm is also suitable as local reordering for V 1, …, V k as alternative to nested dissection, taking into account the edges connected with V s (also referred to as HALO structure), see, e.g., [38].

4 Sparse LU Decomposition

In this section we will assume that the given matrix \(A\in \mathbb {R}^{n,n}\) is nonsingular and that it can be factorized as A = LU, where L is a lower triangular matrix with unit diagonal and U is an upper triangular matrix. It is well-known [21], if A = LU, where L and U are lower triangular matrices, then in the generic case we will have G d(L + U) ⊃ G d(A), i.e., we will only get additional edges unless some entries cancel by “accident” during the elimination. In the sequel we will ignore cancellations. Throughout this section we will always assume that the diagonal entries of A are nonzero as well. We also assume that G d(A) is connected.

In the preceding sections we have argued that maximum weight matching often leads to a rescaled and reordered matrix such that static pivoting is likely to be enough, i.e., pivoting is restricted to some dense blocks inside the LU factorization. Furthermore, reordering strategies such as multilevel nested dissection have further symmetrically permuted the system such that the fill-in that occurs during Gaussian elimination is acceptable and even parallel approaches could be drawn from this reordering. Thus assuming that A does not need further reordering and a factorization A = LU exists is a realistic scenario in what follows.

4.1 The Elimination Tree

The basis of determining the fill-in in the triangular factors L and U as by-product of the Gaussian elimination can be characterized as follows (see [23] and the references therein).

Theorem 2

Given A = LU with the aforementioned assumptions, there exists an edge (i, j) in G d(L + U) if and only if there exists a path

$$\displaystyle \begin{aligned} ix_1, x_2x_3, \dots, x_kj \end{aligned}$$

in G d(A) such that \(x_1,\dots ,x_k<\min (i,j)\).

In other words, during Gaussian elimination we obtain a fill edge (i, j) for every path from i to j through vertices less than \(\min (i,j)\).

Fill-in We will use the matrix ΠTA from Example 2 and sketch the fill-in obtained during Gaussian elimination in Fig. 6.

Fig. 6
figure 6

Fill-in with respect to L + U is denoted by ×

The fastest known method for predicting the filled graph G d(L + U) is Gaussian elimination. The situation is simplified if the graph is undirected. In the sequel we ignore the orientation of the edges and simply consider the undirected graph G(A) and G(L + U), respectively.

Definition 11

The undirected graph G(L + U) that is derived from the undirected graph G(A) by applying Theorem 2 is called the filled graph and it will be denoted by G f(A).

Fill-in with Respect to the Undirected Graph When we consider the undirected graph G(A) in Example 4.1, the pattern of | ΠTA| + | ΠTA|T and its filled graph G f(A) now equals G(A) up to positions (5, 4) and (4, 5) (cf. Fig. 7).

Fig. 7
figure 7

Entries of G(A) are denoted by filled circle, fill-in is denoted by times

The key tool to predict the fill-in easily for the undirected graph is the elimination tree [33].

Recall that an undirected and connected graph is called a tree, if it does not contain any cycle. Furthermore, one vertex is identified as root. As usual we call a vertex j parent of i, if there exists an edge (i, j) in the tree such that j is closer to the root. In this case i is called child of j. The subtree rooted at vertex j is denoted by T(j) and the vertices of this subtree are called descendants of j, whereas j is called their ancestor. Initially we will define the elimination tree algorithmically using the depth-first-search algorithm [1]. Later we will state a much simplified algorithm.

Definition 12

Given the filled graph G f(A) the elimination tree T(A) is defined by the following algorithm.

Perform a depth-first-search in G f(A) starting from vertex n.

When vertex m is visited, choose from its unvisited neighbors i 1, …, i k the index j with the largest number \(j=\max \{i_1,\dots ,i_k\}\) and continue the search with j.

A leaf of the tree is reached, when all neighbors have already been visited.

We like to point out that the application of the depth-first-search to G f(A) starting at vertex n behaves significantly different from other graphs. By Theorem 2 it follows that as soon as we visit a vertex m, all its neighbors j > m must have been visited prior to vertex m. Thus the labels of the vertices are strictly decreasing until we reach a leaf node.

Depth-First-Search We illustrate the depth-first-search using the (filled) graph in Fig. 8 and the pattern from Example 4.1. The extra fill edge is marked by a bold line.

Fig. 8
figure 8

Filled graph (left) and elimination tree (right)

The ongoing depth-first-search visits the vertices in the order 6 → 5 → 4 → 3. Since at vertex 3, all neighbors of 3 are visited (and indeed have a larger number), the algorithm backtracks to 4 and to 5 and continues the search in the order 5 → 2. Again all neighbors of vertex 2 are visited (and have larger number), thus the algorithm backtracks to 5 and to 6 and continues by 6 → 1. Then the algorithm terminates.

Remark 1

It follows immediately from the construction of T(A) and Theorem 2 that additional edges of G f(A) which are not covered by the elimination tree can only show up between a vertex and some of its ancestors (referred to as “back-edges”). In contrast to that, “cross-edges” between unrelated vertices do not exist.

Remark 2

One immediate consequence of Remark 1 is that triangular factors can be computed independently starting from the leaves until the vertices meet a common parent, i.e., column j of L and U T only depend on those columns s of L and U T such that s is a descendant of j in the elimination tree T(A).

Elimination Tree We use the matrix “west0479” from Example 3.1.3, after maximum weight matching and multilevel nested dissection have been applied. We use MATLAB’s etreeplot to display its elimination tree (see Fig. 9). The elimination tree displays the high level of concurrency that is induced by nested dissection, since by Remark 2 the computations can be executed independently at each leaf node towards the root until a common parent vertex is reached.

Fig. 9
figure 9

Elimination tree of “west0479” after maximum weight matching and nested dissection are applied

Further conclusions can be easily derived from the elimination tree, in particular Remark 2 in conjunction with Theorem 2.

Remark 3

Consider some k ∈{1, …, n}. Then there exists a (fill) edge (j, k) with j < k if and only if there exists a common descendant i of k, j in T(A) such that a ik ≠ 0. This follows from the fact that once a ik ≠ 0, by Theorem 2 this induces (fill) edges (j, k) in the filled graph G f(A) for all nodes j between i and k in the elimination tree T(A), i.e., for all ancestors of i that are also descendants of k. This way, i propagates fill-edges along the branch from i to k in T(A) and the information a ik ≠ 0 can be used as path compression to advance from i towards k along the elimination tree.

Path Compression Consider the graph and the elimination tree from Fig. 8. Since there exists the edge (3, 5) in G(A), therefore another (fill) edge (4, 5) must exist. Similarly, the same conclusion can be drawn from the existence of the edge (4, 6) (here not a fill edge, but a regular edge).

The elimination tree itself can be easily described by a vector p of length n such that for any i < n, p i denotes the parent node, while p n = 0 corresponds to the root. Consider some step k with a ik ≠ 0, for some i < k. By Remark 3, i must be a descendant of k and there could be further ancestors j of i which are also descendants of k. Possibly not all ancestors of i have been assigned a parent node so far. Thus we can replace i by j = p i until we end up with p j = 0 or \(p_j\geqslant k\). This way we traverse T(A) from i towards to k until we have found the child node j of k. If the parent of j has not been assigned to j yet, then p j = 0 and k must be the parent of j. If some l < k were the parent of j, then we would have assigned l as parent of j in an earlier step l < k. In this case we set p j ← k. Otherwise, if \(p_j\geqslant k\), then we have already assigned j’s parent in an earlier step l < k.

Computation of Parent Nodes Consider the elimination tree T(A) from Fig. 8. Unless k = 4, no parents have been assigned, i.e., p i = 0 for all i.

Now for k = 4 we have a 34 ≠ 0 and using the fact that p 3 = 0 implies that we have to set a 3 = p 3 ← 4.

For k = 5, a 25 ≠ 0 and again p 2 = 0 requires to set a 2 = p 2 ← 5. Next, a 35 ≠ 0, path compression enables a 3 ← 5 and after another loop we obtain a 4 = p 4 ← 5.

Finally, if k = 6, we have a 16 ≠ 0 and immediately obtain a 1 = p 1 ← 6. Since a 46 ≠ 0, a path compression is applied which yields a 4 ← 6 and in the next step we set a 5 = p 5 ← 6. At last a 56 ≠ 0 does not cause further changes.

In total we have p = [6, 5, 4, 5, 6, 0] which perfectly reveals the parent properties of the elimination trees in Fig. 8.

By Remark 3 (cf. [12, 43]), we can also make use of path compression. Since our goal is to traverse the branch of the elimination tree from i to k as fast as possible, any ancestor j = a i of i would be sufficient. With the same argument as before, an ancestor a j = 0 would refer to a vertex that does not have a parent yet. In this case we can again set p j ← k. Moreover, k is always an ancestor of a i.

The algorithm including path compression can be summarized as follows (see also [12, 33]).

Computation of the Elimination Tree

4.2 The Supernodal Approach

We have already seen that the elimination tree reveals information about concurrency. It is further useful to determine the fill-in L and U T. This information can be computed from the elimination tree T(A) together with G(A). The basis for determining the fill-in in each column is again Remark 3. Suppose we are interested in the nonzero entries of column j of L and U T. Then for all descendants of j, i.e., the nodes of the subtree T(j) rooted at vertex j, a nonzero entry a ik ≠ 0 also implies l kj ≠ 0. Thus, starting at any leaf i, we obtain its fill by all a ik ≠ 0 such that k > i and when we move forward from i to its parent j, vertex j will inherit the fill from node i for all k > j plus the nonzero entries given by a jk ≠ 0 such that k > j. When we reach a common parent node k with multiple children, the same argument applies using the union of fill-in greater than k from its children together with the nonzero entries a kl ≠ 0 such that l > k. We summarize this result in a very simple algorithm

Computation of Fill-in

Algorithm 4.2 only deals with the fill pattern. One additional aspect that allows to raise efficiency and to speed up the numerical factorization significantly is to detect dense submatrices in the factorization. Block structures allow to collect parts of the matrix in dense blocks and to treat them commonly using dense matrix kernels such as level-3 BLAS and LAPACK [13, 14].

Dense blocks can be read off from the elimination tree employing Algorithm 4.2.

Definition 13

Denote by \(\mathcal {P}_j\) the nonzero indices of column j of P as computed by Algorithm 4.2. A sequence k, k + 1, …, k + s − 1 is called supernode of size s if the columns of \(\mathcal {P}_{j}=\mathcal {P}_{j+1}\cup \{j+1\}\) for all j = k, …, k + s − 2.

In simple words, Definition 13 states that for a supernode s subsequent columns can be grouped together in one dense block with a triangular diagonal block and a dense subdiagonal block since they perfectly match the associated trapezoidal shape. We can thus easily supplement Algorithm 4.2 with a supernode detection.

Computation of Fill-in and Supernodes

Supernode Computation To illustrate the use of supernodes, we consider the matrix pattern from Fig. 7 and illustrate the underlying dense block structure in Fig. 10. Supernodes are the columns 1, 2, 3 as scalar columns as well as columns 4–6 as one single supernode.

Fig. 10
figure 10

Supernodes in the triangular factor

Supernodes form the basis of several improvements, e.g., a supernode can be stored as one or two dense matrices. Beside the storage scheme as dense matrices, the nonzero row indices for these blocks need only be stored once. Next the use of dense submatrices allows the usage of dense matrix kernels using level-3 BLAS [13, 14].

Supernodes We use the matrix “west0479” from Example 3.1.3, after maximum weight matching and multilevel nested dissection have been applied. We use its undirected graph to compute the supernodal structure. Certainly, since the matrix is nonsymmetric, the block structure is only sub-optimal. We display the supernodal structure for the associated Cholesky factor, i.e., for the Cholesky factor of a symmetric positive definite matrix with same undirected graph as our matrix (see left part of Fig. 11). Furthermore, we display the supernodal structure for the factors L and U computed from the nonsymmetric matrix without pivoting (see right part of Fig. 11).

Fig. 11
figure 11

Supernodal structure. Left: vertical lines display the blocking of the supernodes with respect to the associated Cholesky factor. Right: vertical and horizontal lines display the blocking of the supernodes applied to L and U

While the construction of supernodes is fairly easy in the symmetric case, its generalization for the nonsymmetric case is significantly harder, since one has to deal with pivoting in each step of Gaussian elimination. In this case one uses the column elimination tree [22].

5 Sparse Direct Solvers—Supernodal Data Structures

High-performance sparse solver libraries have been a very important part of scientific and engineering computing for years, and their importance continues to grow as microprocessor architectures become more complex and software libraries become better designed to integrate easily within applications. Despite the fact that there are various science and engineering applications, the underlying algorithms typically have remarkable similarities, especially those algorithms that are most challenging to implement well in parallel. It is not too strong a statement to say that these software libraries are essential to the broad success of scalable high-performance computing in computational sciences. In this section we demonstrate the benefit of supernodal data structures within the sparse solver package PARDISO [42]. We illustrate it by using the triangular solution process. The forward and backward substitution is performed column wise with respect to the columns of L, starting with the first column, as depicted in Fig. 12. The data dependencies here allow to store vectors y, z, b, and x in only one vector r. When column j is reached, r j contains the solution for y j. All other elements of L in this column, i. e. L ij with i = j + 1, …, N, are used to update the remaining entries in r by

$$\displaystyle \begin{aligned} r_i = r_i - r_j L_{ij}. {} \end{aligned} $$
(8)

The backward substitution with L T will take place row wise, since we use L and perform the substitution column wise with respect to L, as shown in the lower part of Fig. 12. In contrast to the forward substitution the iteration over columns starts at the last column N and proceeds to the first one. If column j is reached, then r j, which contains the j-component of the solution vector x j, is computed by subtracting the dot-product of the remaining elements in the column L ij and the corresponding elements of r i with i = j + 1, …, N from it:

$$\displaystyle \begin{aligned} r_j = r_j - r_i L_{ij} . {} \end{aligned} $$
(9)

After all columns have been processed r contains the required solution x. It is important to note that line 5 represents in both substitutions an indexed DAXPY and indexed DDOT kernel operations that has to be computed during the streaming operations of the vector r and the column j of the numerical factor L. As we are dealing with sparse matrices it makes no sense to store the lower triangular matrix L as a dense matrix. Hence, PARDISO uses its own data structure to store L, as shown in Fig. 13.

Fig. 12
figure 12

Sparse triangular substitution in CSC format based on indexed DAXPY/DDOT kernel operations

Fig. 13
figure 13

Sparse matrix data structures in PARDISO. Adjacent columns of L exhibiting the same structure form panels also known as supernodes. Groups of panels which touch independent elements of the right-hand side r are parts. The last part in the lower triangular matrix L is called separator

Algorithm 1 Forward substitution in PARDISO. Note that in case of serial execution separated updates to temporary arrays in Lines 10–13 are not necessary and can be handled via the loop in Lines 6–9

Adjacent columns exhibiting the same row sparsity structure form a panel, also known as supernode. A panel’s column count is called the panel size n p. The columns of a panel are stored consecutively in memory excluding the zero entries. Note that columns of panels are padded in the front with zeros so they get the same length as the first column inside their panel. The padding is of utmost performance for the PARDISO solver to use Level-3 BLAS and LAPACK functionalities [41]. Furthermore panels are stored consecutively in the l array. Row and column information is now stored in accompanying arrays. The xsuper array stores for each panel the index of its first column. Also note that here column indices are the running count of nonzero columns. Column indices are used as indices into xl array to lookup the start of the column in the l array which contains the numerical values of the factor L. To determine the row index of a column’s element an additional array id is used, which holds for each panel the row indices. The start of a panel inside id is found via xid array. The first row index of panel p is id[xid[p]]. For serial execution this information is enough. However, during parallel forward/backward substitution concurrent updates to the same entry of r must be avoided. The parts structure contains the start (and end) indices of the panels which can be updated independently as they do not touch the same entries of r. Two parts, colored blue and orange, are shown in Fig. 13. The last part in the bottom right corner of L is special and is called the separator and is colored green. Parts which would touch entries of r in the range of the separator perform their updates into separate temporary arrays t. Before the separator is then serially updated, the results of the temporary arrays are gathered back into r. The backward substitution works the same, just reversed and only updates to different temporary arrays are not required. The complete forward substitution and backward substitution is listed in Algorithms 1 and 2.

Algorithm 2 Backward substitution in PARDISO. Separator (sep.), parts, and panels are iterated over in reversed (rev.) order

6 Application—Circuit Simulation

In this section we demonstrate how these developments in sparse direct linear solvers have advanced integrated circuit simulations. Integrated circuits are composed of interconnected transistors. The interconnects are modeled primarily with resistors, capacitors, and inductors. The interconnects route signals through the circuit, and also deliver power. Circuit equations arise out of Kirchhoff’s current law, applied at each node, and are generally nonlinear differential-algebraic equations. In transient simulation of the circuit, the differential portion is handled by discretizing the time derivative of the node charge by an implicit integration formula. The associated set of nonlinear equations is handled through use of quasi-Newton methods or continuation methods, which change the nonlinear problem into a series of linear algebraic solutions. Each component in the circuit contributes only to a few equations. Hence, the resulting systems of linear algebraic equations are extremely sparse, and most reliably solved by using direct sparse matrix techniques. Circuit simulation matrices are peculiar in the universe of matrices, having the following characteristics [11]:

  • they are nonsymmetric, although often nearly structurally symmetric;

  • they have a few dense rows and columns (e.g., power and ground connections);

  • they are very sparse and the straightforward usage of BLAS routines (as in SuperLU[32]) may be ineffective;

  • their LU factors remain sparse if well-ordered;

  • they can have high fill-in if ordered with typical strategies;

  • and being unstructured, the highly irregular memory access causes factorization to proceed only at a few percent of the peak flop-rate.

Circuit simulation matrices also vary from being positive definite to being extremely ill-conditioned, making pivoting for stability important also. As circuit size increases, and depending on how much of the interconnect is modeled, sparse matrix factorization is the dominant cost in the transient analysis.

To overcome the complexity of matrix factorization a new class of simulators arose in the 1990s, called fast-SPICE [39]. These simulators partition the circuit into subcircuits and use a variety of techniques, including model order reduction and multirate integration, to overcome the matrix bottleneck. However, the resulting simulation methods generally incur unacceptable errors for analog and tightly coupled circuits. As accuracy demands increase, these techniques become much slower than traditional SPICE methods. Even so, since much of the research effort was directed at fast-SPICE simulators, it brought some relief from impossibly slow simulations when some accuracy trade-off was acceptable. Because these simulators partitioned the circuit, and did not require the simultaneous solution of the entire system of linear equations at any given time, they did not push the state of the art in sparse matrix solvers.

Starting in the mid-2000s, increasing demands on accuracy, due to advancing semiconductor technology, brought attention back to traditional SPICE techniques. This was aided by the proliferation of multicore CPUs. Parallel circuit simulation, an area of much research focus in the 1980s and 1990s, but not particularly in practice, received renewed interest as a way to speed up simulation without sacrificing accuracy. Along with improved implementations to avoid cache misses, rearchitecture of code for parallel computing, and better techniques for exploitation of circuit latency, improved sparse matrix solvers, most notably the release of KLU [11], played a crucial role in expanding the utility of SPICE.

Along with the ability to simulate ever larger circuits with full SPICE accuracy came the opportunity to further improve sparse matrix techniques. A sparse matrix package for transient simulation needs to have the following features:

  • must be parallel;

  • fast matrix reordering;

  • incremental update of the L and U factors when only a few nonzeros change;

  • fast computation of the diagonal entries of the inverse matrix;

  • fast computation of Schur-complements for a submatrix;

  • allow for multiple LU factors of the same structure to be stored;

  • use the best-in-class method across the spectrum of sparsity;

  • use iterative solvers with fast construction of sparse preconditioners;

  • run on various hardware platforms (e.g., GPU acceleration).

Some of these features must be available in a single package. Others, such as iterative solvers and construction of preconditioners, can be implemented with a combination of different packages. The PARDISO solverFootnote 1 stands out as a package that does most of these very well. Here we touch on a few of these features.

When applied in the simulation of very large circuits, the difference between a “good” and a “bad” matrix ordering can be the difference between seconds and days. PARDISO offers AMD and nested dissection methods for matrix ordering, as well as permitting user-defined ordering. Because the matrix reordering method which has been used most often in circuit simulation is due to Markowitz [35], and because modern sparse matrix packages do not include this ordering method, we briefly describe it here. The Markowitz method is quite well-adapted for circuit simulation. Some desirable aspects of the typical implementation of the Markowitz method, as opposed to the MD variants, are that it works for nonsymmetric matrices and combines pivot choice with numerical decomposition, such that a pivot choice is a numerically “good” pivot which generates in a local sense the least fill-in at that step of the decomposition. Choosing pivots based on the Markowitz score often produces very good results: near-minimal fill-in, unfortunately at the cost of an O(n 3) algorithm (for dense blocks). Even though the Markowitz algorithm has some good properties when applied to circuit matrices, the complexity of the algorithm has become quite burdensome. When SPICE [36] was originally conceived, a hundred-node circuit was huge and the Markowitz algorithm was not a problem. Now we routinely see netlists with hundreds of thousands of nodes and post-layout netlists with millions of elements. As matrix order and element counts increase, Markowitz reordering time can become an obstruction. Even as improved implementations of the Markowitz method have extended its reach, AMD and nested dissection have become the mainstay of simulation of large denser-than-usual matrices.

Next we turn our attention to parallel performance. While KLU remains a benchmark for serial solvers, for parallel solvers, MKL-PARDISO is often cited as the benchmark [6, 8]. To give the reader a sense of the progress in parallel sparse matrix methods, in Fig. 14 we compare KLU, PARDISO (Version 6.2) to MKL-PARDISO on up to 16 cores on an Intel Xeon E7-4880 architecture with 2.5 GHz processors.

Fig. 14
figure 14

Performance improvements of PARDISO 6.2 against Intel MKL PARDISO (one thread) for various circuit simulation matrices

Some of the matrices here can be obtained from the SuiteSparse Matrix Collection, and arise in transistor level full-chip and memory array simulations. It is clear that implementation of sparse matrix solvers has improved significantly over the years.

Exploiting latency in all parts of the SPICE algorithm is very important in enabling accurate circuit simulation, especially as the circuit size increases. By latency, we mean that only a few entries in the matrix change from one Newton iteration to the next, and from one timepoint to the next. As the matrix depends on the time-step, some simulators hold the time-steps constant as much as feasible to allow increased reuse of matrix factorizations. The nonzero entries of a matrix change only when the transistors and other nonlinear devices change their operation point. In most circuits, very few devices change state from one iteration to the next and from one time-step to the next. Nonzeros contributed by entirely linear components do not change value during the simulation. This makes incremental LU factorization a very useful feature of any matrix solver used in circuit simulation. As of April 2019 the version PARDISO 6.2 has a very efficient exploitation of incremental LU factorization, both serial and parallel. In Fig. 15 we show that PARDISO scales linearly with number of updated columns, and also scales well with number of cores. Here, the series of matrices were obtained from a full simulation of a post-layout circuit that includes all interconnects, power and ground networks. The factorization time is plotted against the number of columns that changed compared to the previous factorization. The scatter plot shows the number of rank-k update and the corresponding factorization time in milliseconds. The regression analysis clearly demonstrates a linear trend both for the single and the multiple core versions. The dashed line shows the time for the full factorization.

Fig. 15
figure 15

Regression analysis on the rank-k update LU factorization in PARDISO

Another recent useful feature in PARDISO is parallel selective inverse matrix computation as demonstrated in Table 1. In circuit simulation, the diagonal of the inverse matrix is the driving point impedance. It is often required to flag nodes in the circuit with very high driving point impedance. Such nodes would indicate failed interfaces between different subcircuits, leading to undefined state and high current leakage and power dissipation. A naive approach to this is to solve for the driving point impedance, the diagonal of the inverse matrix, by N triangular solves. This is sometimes unacceptably expensive even with exploiting the sparsity of the right-hand side, and minimizing the number of entries needed in the diagonal of the inverse. To bypass this complexity, heuristics to compute the impedance of connected components are used. But this is error prone with many false positives and also false negatives. In the circuit Freescale, PARDISO, e.g., finished the required impedance calculations in 11.9 s compared to the traditional computation that consumed 162.9 h.

Table 1 Details of the benchmark matrices

The productivity gap in simulation continues to grow, and challenges remain. Signoff simulations demand 10× speedup in sparse matrix factorization. Simply using more cores does not help unless the matrices are very large and complex. For a majority of simulations, scaling beyond eight cores is difficult. As a result, some of these simulations can take a few months to complete, making them essentially impossible. Some of the problems in parallelizing sparse matrix operations for circuit simulation are fundamental. Others may be related to implementation. Research on sparse matrix factorization for circuit simulation continues to draw attention, especially in the area of acceleration with Intel’s many integrated core (MIC) architecture [6] and GPUs [7, 31]. Other techniques for acceleration include improved preconditioners for iterative solvers [44]. We are presently addressing the need for runtime selection of optimal strategies for factorization, and also GPU acceleration. Given that circuits present a wide spectrum of matrices, no matter how we categorize them, it is possible to obtain a solver that is 2–10× better on a given problem. Improvements in parallel sparse matrix factorization targeted at circuit simulation is more necessary today than ever and will continue to drive applicability of traditional SPICE simulation methods. Availability of sparse matrix packages such as PARDISO that completely satisfy the needs of various circuit simulation methods is necessary for continued performance gains.