Keywords

1 Introduction

An essential task in several scientific and engineering applications is the solution of systems of linear equations in the form \(Ax= b\), where A is an \(n\times n\) large-scale sparse matrix, x is the unknown n-vector solution which is sough, and b is a known n-vector. It is generally a part of the simulation that demands a high processing time [1].

Modern hierarchical memory architecture and paging policies favor programs that take locality of reference into consideration [1, 2]. Thus, spatial locality (a sequence of recent memory references is clustered locally rather than randomly in the memory address space) is relevant when designing an algorithm in this context.

For the low-cost solution of large and sparse linear systems, an appropriate vertex labeling in a graph is desirable to ensure that the corresponding coefficient matrix A will have narrow bandwidth and small profile. Specifically, heuristics for bandwidth or profile reductions return a sequence of graph vertices with spatial locality. Thus, these heuristics are used to reach low computational costs when solving large sparse systems of linear equations [1, 2].

The bandwidth reduction problem consists of labeling the vertices of a graph with positive integer labels aiming at minimizing the maximum absolute difference between labels of adjacent vertices. It is isomorphic to the problem of reordering the rows and columns of a symmetric matrix so that the objective is to locate non-null coefficients as close as possible along the main diagonal [3]. Let \(A = [a_{ij}]\) be an \(n \times n\) symmetric matrix associated with a connected undirected graph \(G=(V,E)\) composed of a set of vertices V and a set of edges E. The bandwidth of row i is \(\beta _i(A) = i - \min \limits _{1 \le j \le i}[j : a_{ij} \ne 0]\), for \(a_{ii} \ne 0\). Bandwidth \(\beta (A)\) is the largest distance between the non-null coefficient of the lower triangular matrix and the main diagonal considering all rows of the matrix, that is, \(\beta (A) = \max \limits _{1 \le i \le n}[\beta _i(A)]\). Equivalently, the bandwidth of G for a vertex labeling \(S=\{s(v_1), s(v_2), \cdots , s(v_{|V|})\}\) (i.e., a bijective mapping \(s:V\rightarrow \{1,2,\cdots ,|V|\}\)) is \(\beta (G)=\max \limits _{v\in V}[(\{v,u\}\in E)\ |s(v)-s(u)|]\), where s(v) and s(u) are labels of vertices v and u, respectively. The profile of A can be defined as \(profile(A)=\sum _{i=1}^n \beta _i(A)\), or equivalently, \(profile(G)=\sum \limits _{v\in V}\max \limits _{\{v,u\}\in E}[|s(v)-s(u)|]\). The bandwidth and profile minimization problems are NP-hard [4, 5] and, since the mid-1960s, several heuristics have been proposed to solve the bandwidth and profile reduction problems [2, 6,7,8] due to the existence of extensive connections between these problems and a large number of other scientific and engineering problems [1].

A prominent method for solving large sparse systems of linear equations is the preconditioned conjugate gradient method (CGM) when the matrix contained in them are symmetric and positive definite. The number of iterations of the CG method depends on the structure of the matrix A, its condition number, the accuracy to be achieved, and the preconditioner used [1]. Thus, the most important result of a reordering algorithm in conjunction with the preconditioner is to return a matrix A with a reduced condition number (or equivalently better cluster of eigenvalues). The reordering algorithm should provide spatial locality. Moreover, the application should use a data structure that depends neither on the bandwidth nor the profile of the matrix A, as is the case in the experiments presented in this paper. Consequently, the resulting linear system is much easier to solve than the original linear system, even when the linear system is composed of multiple right-hand side vectors [1].

A specific preconditioner is applied to speed up the conjugate gradient method depending on the problem in context. Among various preconditioners proposed (e.g., see [9]) for the conjugate gradient method, the incomplete Cholesky (IC) factorization is especially useful [1]. The IC(l) preconditioner with \(l>0\) will obtain a better approximation to A than IC(0) at the cost of increased memory requirements and processing times. Thus, we used the IC(0) preconditioner. The zero-fill incomplete Cholesky-preconditioned conjugate gradient method (ICCGM) has been employed promisingly in the solution of various problems (see [1] and references therein).

One can reduce computational costs using the conjugate gradient method by applying an adequate ordering of the vertices of the corresponding graph of A to improve cache hit rates. Such ordering can be reached using a heuristic for bandwidth reduction (see [1] and references therein).

Systematic reviews [2, 6,7,8] identified almost 140 heuristics for bandwidth and profile reductions. This number is increasing each year [1, 10, 11]. Therefore, an important decision in this context is to choose the best reordering algorithm among several alternatives for the problem at hand. Furthermore, in previous publications [10,11,12,13], various heuristics for bandwidth and profile reductions were evaluated with the intention of reducing the execution cost of the Jacobi-preconditioned conjugate gradient method. Additionally, in a recent publication [1], several heuristics for bandwidth and profile reductions were evaluated. This publication [1] also identified the most promising heuristics for several application areas when reducing the computational cost of the ICCG method. Among 55 instances (arising from several application areas) used in this publication [1], it evaluated the heuristics for bandwidth and profile reductions when applied to only three linear systems originating from computational fluid dynamic (CFD) problems. Thus, this present work aims to evaluate the same 14 heuristics for bandwidth and profile reductions to speed up the ICCG method when applied to several instances that arise from this application area. To provide more specific details, systematic reviews [1, 2, 6,7,8] reported 14 promising low-cost heuristics for bandwidth (Burgess-Lai [14], FNCHC [15], GGPS [16], VNS-band [17], KP-band [3], CSS-band [18], and RBFS-GL [1]) and profile (Snay [19], Sloan [20], MPG [21], NSloan [22], and Sloan-MGPS [23]) reductions. Additionally, this paper [1] evaluated the reverse Cuthill-McKee method with pseudo-peripheral vertex given by the George-Liu algorithm (RCM-GL) [24] and GPS algorithm [25] in both contexts of bandwidth and profile reductions. Therefore, 14 heuristics were evaluated in this work. Thus, our work provides an extensive computational experiment employing heuristics for bandwidth and profile reductions aiming at reducing the computational time of the ICCG method applied to 23 instances originating from CFD problems. To be more precise, the main contribution of our work is the comparison of results obtained when using 14 heuristics for bandwidth and profile reductions to symmetric matrices contained in linear systems (with sizes nearly 300,000 unknowns) arising from CFD problems solved by the ICCG method.

The remainder of this manuscript is organized as follows. Section 2 describes how this paper conducted the simulations in this computational experiment. Section 3 presents and analyzes the results. Finally, Sect. 4 provides the conclusions.

2 Description of the Tests

This computational experiment uses a 64-bit executable program of the VNS-band heuristic. One of the heuristics’ authors kindly provided this program. Additionally, a heuristics’ author kindly provided the C programming language source code of the FNCHC heuristic. Then, we converted this source code to the C++ programming language. The 12 other heuristics were also implemented in the C++ programming language to compare the computational costs of the heuristics. Furthermore, we used a data structure based on the Compress Row Storage, Compress Column Storage, and Skyline Storage Scheme data structures to implement the ICCG method. Previous publications described details of our implementations, testing, and calibration of the heuristics [1, 2].

The three datasets (composed of real symmetric instances) that are part of the simulations carried out on each machine are (Intel\(^{\circledR }\) CoreTM): (i) systems of linear equations (ranging from 7,332 to 277,118 unknowns) arising from finite-volume discretizations of the Laplace equation [26], originally using a random order, with executions performed on a workstation that contains an i3-2120 CPU 3.30 GHz, 3 MB of cache, 8 GB DDR3 1.333 GHz of main memory, Linux kernel 3.13.0-39-generic; (ii) linear systems (ranging from 4,846 to 232,052 unknowns) originating from finite-volume discretizations of the heat conduction equation with meshes generated using Voronoi diagrams (and Delaunay triangulations) [27], with executions performed on a workstation that contains an i3-550 CPU 3.20 GHz, 4 MB cache, 16 GB DDR3 1.333 GHz of main memory, Linux kernel 3.13.0-39-generic; (iii) 11 instances (ranging from 1,733 to 81,920 unknowns) contained in the SuiteSparse matrix (SSM) collection [28], with executions performed on a workstation that contains an i7-4790K CPU 4.00 GHz, 8 MB cache, 12 GB DDR3 1.6 GHz of main memory, Linux kernel 3.19.0-31-generic (Intel; Santa Clara, CA, United States). The Ubuntu 14.04 LTS 64-bit operating system was used in the simulations. A dataset used in each machine was chosen arbitrarily.

The convergence criterion used in the ICCG method was a reduction of the computed residual \(|Ax=b|\) (i.e., a final backward error) to less than \(10^{-16}\). Thus, the final attainable accuracy in our numerical experiments is related to this precision. Three sequential runs were performed for each instance with both a reordering algorithm and with the ICCG method. This present work employs the GNU Multiple Precision Floating-point Computations with Correct-Rounding library to make it possible to obtain high precision in the computations.

3 Results and Analysis

This section presents and analyzes the results obtained in simulations using the ICCG method computed after executing heuristics for bandwidth and profile reductions. Sections 3.1 and 3.2 show the results obtained from the solutions of linear systems arising from finite-volume discretizations of the heat conduction (with instances ranging from 4,846 to 232,052 unknowns) and Laplace equations (with instances ranging from 7,322 to 277,188 unknowns), respectively. Section 3.3 shows the results obtained from the solutions of 11 systems of linear equations (ranging from 1,733 to 81,920 unknowns) contained in the SuiteSparse matrix collection [28].

Tables contained in this section show the number of unknowns (n), the name of the reordering algorithm applied, the bandwidth and profile results, the results of the heuristics in relation to the computational times, in seconds (s), and the memory requirements, in mebibytes (MiB). These tables provide the computational times of the IC(0) preconditioner and the conjugate gradient method separately. It was decided to distinguish these costs because the renumbering produced using the reordering algorithms also affects the computational cost of the IC(0) preconditioner. These tables also present the number of iterations and the total computational times, in seconds, of the ICCG method. Moreover, in spite of the small number of executions for each heuristic in each instance, these tables provide the standard deviation \(\sigma \) and coefficient of variation \(C_v=\sigma /\mu \) (where \(\mu \) is the mean of the results analyzed), referring to the total execution cost of the ICCG method. Additionally, the first line of each instance presented in these tables shows results for systems of linear equations solved using the ICCG method without applying a reordering algorithm. These lines are indicated as “—” in these tables. With this result, one can check the speed-up (or speed-down) of the ICCG method provided by employing a reordering algorithm (i.e., the time of the ICCG method without applying a reordering algorithm divided by the time of the ICCG method executed in conjunction with a reordering algorithm), which is exhibited in the last columns of the tables below. Numbers in boldface are the best results. In addition, several figures in this section are presented as line charts for clarity.

3.1 Instances Arising from Finite-Volume Discretizations of the Heat Conduction Equation

Tables 1, 2 and Fig. 1, developed from an ample collection of heuristics for bandwidth and profile reductions that compose this work, show the average results obtained from the use of the ICCG method applied to instances derived from finite-volume discretizations of the heat conduction equation. These instances arise from the use of meshes generated by employing Voronoi diagrams (and Delaunay triangulations) [27].

Table 1. Results from the solution of systems of linear equations (up to 23,367 unknowns and derived from finite-volume discretizations of the heat conduction equation [27]) using the ICCG method and vertices labeled by heuristics for bandwidth and profile reductions (continued on Table 2).
Table 2. Results from the solution of systems of linear equations (ranging from 50,592 to 232,052 unknowns and derived from finite-volume discretizations of the heat conduction equation) using the ICCG method and vertices labeled by heuristics for bandwidth and profile reductions (continued from Table 1).
Fig. 1.
figure 1

Speed-ups/downs of the ICCG method obtained when using 14 heuristics for bandwidth and profile reductions applied to matrices contained in systems of linear equations derived from finite-volume discretizations of the heat conduction equation (see Tables 1 and 2).

Fig. 2.
figure 2

Memory requirements of 10 heuristics for bandwidth and profile reductions applied to matrices contained in systems of linear equations derived from finite-volume discretizations of the heat conduction equation (see Tables 1 and 2).

Increasing the runtime of the VNS-band heuristic does not reduce the total cost of the whole simulation [10, 11]. Table 2 and Fig. 1b show that several heuristics increased the processing time of the ICCG method or the speed-up is marginal when applied to various instances contained in this dataset (e.g., see the results related to the linear system composed of 50,592 unknowns in Table 2). Moreover, although Sloan’s heuristic [20] reached the best results when applied to the smallest instances contained in this dataset (see Table 1), the RCM-GL method [24] yielded similar results to Sloan’s [20] heuristic in these instances, and the RCM-GL method [24] obtained, in general, the best results in the largest linear system (see Table 2). Therefore, this method achieved on average the best results in reducing the computational time of the ICCG method when applied to systems of linear equations contained in this dataset.

Figure 2 shows memory requirements of 10 heuristics for bandwidth and profile reductions. The RCM-GL [24], Burgess-Lai [14], KP-band [3], and RBFS-GL [1] heuristics are in-place algorithms (if one implements the label as an attribute of a vertex).

Fig. 3.
figure 3

Speed-ups/downs of the ICCG method obtained when applying 12 heuristics for bandwidth and profile reductions (see Tables 3 and 4) to matrices contained in linear systems with random order originating from finite-volume discretizations of the Laplace equation.

Table 3. Results from the solution of systems of linear equations (up to 34,238 unknowns, derived from finite-volume discretizations of the Laplace equation and composed of matrices with random order) using the ICCG method and vertices labeled by heuristics for bandwidth and profile reductions (continued on Table 4).
Table 4. Results from the solution of systems of linear equations (ranging from 75,542 to 277,118 unknowns, derived from finite-volume discretizations of the Laplace equation and composed of matrices with random order) using the ICCG method and vertices labeled by heuristics for bandwidth and profile reductions (continued from Table 3).

3.2 Instances Originating from Finite-Volume Discretizations of the Laplace Equation

This section shows simulations using seven linear systems ranging from 7,322 to 277,118 unknowns. In particular, these instances (arising from finite-volume discretizations of the Laplace equation) were ordered randomly because irregular triangulations were employed to generate them [26].

Tables 3, 4 and Fig. 3 show that the RCM-GL method [24] dominated the other heuristics when considering the speedup of the ICCG method applied to these instances. In particular, Fig. 3 does not show the results obtained from the use of the CSS-band and Burgess-Lai heuristics because these two heuristics performed less favorably than the other heuristics applied to the instances contained in this dataset (see Tables 3 and 4). Snay’s heuristic [19] was dominated by other heuristics so that we did not apply this algorithm to instances larger than 34,238 unknowns contained in this dataset.

Fig. 4.
figure 4

Memory requirements of eight reordering algorithms applied to matrices contained in systems of linear equations with a random order (see Tables 3 and 4) originating from finite-volume discretizations of the Laplace equation.

Figure 4 shows memory requirements of eight heuristics for bandwidth and profile reductions applied to matrices contained in linear systems arising from finite-volume discretizations of the Laplace equation. This figure does not show the results from the use of the VNS-band and CSS-band heuristics because these two heuristics showed larger storage costs than the other heuristics when applied to instances contained in this dataset (see Tables 3 and 4).

3.3 Experiments Using Instances Contained in the SSM Collection

Tables 1, 2, 3, 4 and Figs. 1, 2, 3, 4 show that several heuristics dominated Snay’s [19], NSloan [22], and CSS-band [18] heuristics. Moreover, these tables and figures show that the GPS [25], Burgess-Lai [14], FNCHC [15], GGPS [16], and VNS-band [17] algorithms were dominated by the RCM-GL [24], KP-band [3], and RBFS-GL [1] orderings. Hence, we applied six heuristics for bandwidth and profile reductions to a dataset composed of 11 systems of linear equations contained in the SSM collection: the RCM-GL [24], Sloan’s [20], MPG [21], Sloan-MGPS [23], KP-band [3], and RBFS-GL [1] heuristics.

Fig. 5.
figure 5

Speedups of the ICCG method and memory usage obtained when using heuristics for bandwidth and profile reductions (see Tables 5 and 6) applied to 11 systems of linear equations arising from computational fluid dynamics problems contained in the SSM collection [28].

Tables 5 and 6 show the instance’s name and density (d.). Moreover, these tables and Fig. 5 show the average results obtained by the ICCG method applied to 11 systems of linear equations contained in the SSM collection [28]. Tables 5 and 6 show the 2-norm (a 1-norm) condition number (estimation) \(\kappa (A)\) of the instances. Specifically, Tables 5 and 6 show that the ICCG method converges in n iterations, where n is the size of the linear system when applied to ill-conditioned instances (i.e., with large condition numbers). Among the information presented in Table 5, in particular, this table shows a small number of iterations for the ICCG method when applied to the Pres_Poisson instance, whose condition number is smaller than the other seven linear systems contained in this table. Furthermore, Table 6 shows that the execution times of the ICCG method to solve the shallow_water1 and shallow_water2 instances (composed of 81,920 unknowns and with small condition numbers) are approximately four times lower than the execution times to compute the cfd1 instance (composed of 70,656 unknowns and with a large condition number).

Table 5. Results from the solution of eight systems of linear equations (arising from CFD problems) contained in the SuiteSparse matrix collection [28] employing the zero-fill incomplete Cholesky-preconditioned conjugate gradient method and vertices labeled using reordering algorithms (continued on Table 6).
Table 6. Results from the solution of three systems of linear equations (arising from CFD problems) contained in the SSM collection [28] employing the ICCG method and vertices labeled using reordering algorithms (continued from Table 5).

The MPG [21] (KP-band [3]) heuristic obtained the best results when applied to the ex33, ex10, ex10hs, and ex15 (ex9 and Pres_Poisson) instances. The KP-band [3] (Sloan [20]) heuristic obtained the best results when applied to the ex13 and shallow_water1 (cfd1 and shallow_water2) instances, but these gains are marginal.

When setting the same precision in exploratory investigations using standard double-precision floating-point format, the ICCG method converged in a similar number of iterations to using high-precision floating-point arithmetic. Thus, in these experiments, we observed that high-precision floating-point arithmetic does not minimize delay in the convergence of the ICCG method. Figure 5 also shows the memory usage obtained when using three heuristics for bandwidth and profile reductions applied to 11 instances arising from computational fluid dynamics problems contained in the SSM collection [28].

4 Conclusions

Systematic reviews [1, 2, 6,7,8] reported the most promising low-cost heuristics for bandwidth and profile reductions. Thus, our computational experiment compared the results of the implementations of 14 heuristics for bandwidth and profile reductions when applied to 23 instances arising from CFD problems.

Table 5 shows six promising heuristics for bandwidth and profile reductions to reduce computational times of the ICCG method. In particular, in experiments using instances (with sizes to almost 300,000 unknowns) from three datasets, three out of 14 heuristics for bandwidth and profile reductions evaluated in this computational experiment showed the best overall results in reducing the processing cost of the ICCG method. Specifically, the RCM-GL [24], MPG [21], and KP-band [3] obtained the most promising results when used to reduce the processing cost of the ICCG method when applied to instances arising from computational fluid dynamics problems. Future studies will reveal whether this may be extended to any preconditioned conjugate gradient method or even to other preconditioned iterative solvers. In particular, the RCM-GL method [24] achieved the best results when applied to instances arising from finite-volume discretizations of the heat conduction and Laplace equations [26, 27] (see Tables 2, 3, 4 and Figs. 1a and 3). Thus, the in-place low-cost RCM-GL method [24] remains in the state of the practice in bandwidth reduction when applied to instances originating from CFD problems. On the other hand, the MPG [21] (in four linear systems) and KP-band [3] (in two linear systems) heuristics reached the largest number of best results in simulations with 11 instances contained in the SuiteSparse matrix collection [28].

As a continuation of this work, we intend to implement and evaluate other preconditioners in conjunction with the conjugate gradient method. We also plan to implement parallel approaches of the above algorithms in future investigations.