Keywords

1 Introduction

In several scientific and engineering fields, such as finite element analysis, computational fluid mechanics, and structural engineering, a fundamental task is the resolution of large sparse linear systems with the form \(Ax= b\), where A is an \(n\times n\) sparse, symmetric, and positive-definite matrix, b is a vector of length n, and x is an unknown vector (which is sought) of length n. Generally, the highest computational cost of the simulation is required in the resolution of these linear systems. A substantial amount of memory and a high processing cost are necessary to store and to solve these large-scale linear systems. For the low-cost solution of large and sparse linear systems, a heuristic for bandwidth or profile reduction is often used so that the corresponding coefficient matrix A will have narrow bandwidth and small profile. Thus, heuristics for bandwidth and profile reductions are used to achieve low processing and storage costs for solving large sparse linear systems [14, 17]. In particular, profile reduction is employed to reduce storage costs of applications that employ the skyline data structure [9] to represent large-scale matrices.

Let \(A = [a_{ij}]\) be a symmetric sparse \(n \times n\) matrix. The bandwidth of line i is \(\beta _i(A) = i - min(j: (1 \le j < i)\ a_{ij} \ne 0)\). Bandwidth of A is defined as \(\beta (A) = \max [(1 \le i \le n)\ \beta _i(A)] = \max [(1 \le i \le n)\ (i - \min [j : (1 \le j < i)]\ |\ a_{ij} \ne 0)]\). The profile of A is defined as \(profile(A) = \sum _{i=1}^n \beta _i(A)\). The bandwidth and profile minimization problems are NP-hard [28, 31]. Since these problems have associations with an extensive variety of other problems in scientific and engineering disciplines, several heuristics for bandwidth and profile reductions have been proposed for reordering the rows and columns of sparse matrices to solve the bandwidth and profile reduction problems.

A prominent algorithm for solving large-scale sparse linear systems is the Conjugate Gradient Method (CGM) [21, 26]. Duff and Meurant [8] showed that a local ordering of the vertices of the corresponding graph of A can improve cache hit rates so that a computational cost reduction of the CGM is reached. Moreover, Burgess and Giles [3] and Das et al. [6] showed that such local ordering can be attained by using a heuristic for bandwidth reduction. Moreover, one should employ an ordering which does not lead to an increase of the number of iterations of the CGM when a preconditioner is applied [15].

In this work, we analyze cases where selected heuristics for bandwidth or profile reduction may not reduce the computational times of the Jacobi-preconditioned CGM (JPCGM). In previous publications [14, 16], we showed preceding results and based on this experience [2, 5, 15, 17], 13 heuristics were selected as the most promising methods in this field. Thus, the main objective of this work is to analyze the results of 13 potential state-of-the-art low-cost heuristics for bandwidth and profile reductions (that were selected from systematic reviews [2, 5, 15, 17]) when executed to reduce the computational cost of the JPCGM using high-precision floating-point arithmetic.

Section 2 describes the systematic reviews accomplished to identify the potential best low-cost heuristics for bandwidth and profile reductions. Section 3 describes how the numerical experiments were conducted in this study. Section 4 shows the results. Finally, Sect. 5 addresses the conclusions.

2 Systematic Reviews

As described, since the bandwidth and profile reduction problems have connections with a wide range of other problems in scientific and engineering disciplines, a large number of heuristics for bandwidth and profile reductions has been proposed. In systematic reviews, 133 heuristics for bandwidth and/or profile reductions were identified [2, 5, 15, 17], published between the 1960s and the present day, including a recent proposed heuristic for bandwidth and profile reductions [13]. From the analysis performed, respectively, seven and six heuristics for bandwidth and profile reductions were selected to be evaluated in this computational experiment as potentially being the best low-cost heuristics for bandwidth (Burgess-Lai [4], FNCHC [27], GGPS [38], VNS-band [30], hGPHH [24], CSS-band [23]) or profile (Snay [35], Sloan [34], Medeiros-Pimenta-Goldenberg (MPG) [29], NSloan [25], Sloan-MGPS [32]) reduction. The Reverse Cuthill-McKee method with starting pseudo-peripheral vertex given by the George-Liu algorithm (RCM-GL) [10] was selected in both systematic reviews of heuristics for bandwidth and profile reductions. In particular, the RCM-GL method [10] is contained in the Matlab software [36]. Therefore, from the 133 identified heuristics for bandwidth and profile reduction, 12 were selected to be evaluated in this computational experiment because no other simulation or comparison showed that these 12 heuristics could be superseded by any other heuristics in the analyzed papers, concerning bandwidth or profile reduction, when the computation costs of the given heuristic were also considered. Thus, these 12 heuristics could be deemed as the most promising low-cost heuristics to solve the problems.

The GPS heuristic [12] was not selected in these systematic reviews. In spite of this, it was also implemented and its results were compared with these 12 heuristics in this computational experiment because it is one of the most classic low-cost heuristics evaluated in the field for both bandwidth and profile reductions. Thus, 13 heuristics were implemented and/or evaluated in this work.

3 Description of the Tests, Implementation of the Heuristics, Testing, and Calibration

A 64-bit executable program of the VNS-band heuristic (which was kindly provided by one of the heuristic’s authors) was used. This executable only runs with instances up to 500,000 vertices.

The FNCHC-heuristic source code was also kindly provided by one of the heuristic’s authors. With this, the source code was converted and implemented in this present work using the C++ programming language.

The 11 other heuristics’ authors were requested for the sources and/or executables of their algorithms. Some authors informed that they no longer had the source code or executable, some authors did not answer, and other authors explained that the programs could not be provided. Then, the 11 other heuristics were also implemented using the C++ programming language so that the computational costs of the heuristics could be properly compared [15]. Specifically, the g++ version 4.8.2 compiler was used.

The IEEE 754 double-precision binary floating-point arithmetic is composed of 11 bits of exponent (ranging between \(10^{-307}\) and \(10^{307}\)) and a matissa comprised of 53 bits, which describes approximately 16 decimal digits. Nowadays, this double-precision floating-point arithmetic is adequately accurate for most scientific computing applications. Nonetheless, for some scientific applications, the 64-bit IEEE arithmetic is no longer suitable for today’s large-scale numerical simulations. Thus, some relevant scientific applications require high-precision floating-point computations. High-precision floating-point arithmetic is used in applications where the execution time of arithmetic is not a limiting factor, or where accurate results with many digits in the mantissa are needed. Some of these applications demand a significand of 64 bits or more to reach numerically useful results. These applications derive from numerous scientific applications, such as climate modeling, computational fluid dynamics (CFD) problems (e.g. vortex roll-up simulations), computational geometry, mesh generation, computational number theory, Coulomb N-body atomic system simulations, experimental mathematics, large-scale physical simulations performed on highly parallel supercomputers (e.g. studies of the fine structure constant of physics), and quantum theory [1]. Particularly, mesh generation, contour mapping, and other computational geometry applications substantially trust on highly precise arithmetic, mostly when the domain is the unit cube. The reason is that small numerical errors can induce geometrically questionable results. Such troubles are latent in the mathematics of the formulas commonly used in such computations and cannot be repaired without a considerable effort [1]. Specifically, in the applications mentioned, portions of the code normally contain numerically sensitive computations. When using double-precision floating-point arithmetic, these applications may return results with questionable precision, depending on the stopping criteria used. These imprecise results may in turn cause larger errors. On the other hand, it is normally cheaper and more reliable to use high-precision floating-point arithmetic to overcome these troubles [1]. Specifically, in this computational experiment, we used instances derived from meshes generated in discretizations of partial differential equations (that govern CFD problems) by finite volumes [19, 20]. Hence, our numerical experiments will focus on high-precision floating-point arithmetic. We used the GNU Multiple Precision Floating-point Computations with Correct-Rounding (MFR) library with 256-bit (when using instances originating from discretizations of the Laplace equation) and 512-bit (when using instances contained in the University of Florida sparse matrix collection) precisions.

Many heuristics evaluated here are highly dependent on the starting vertex. Since Koohestani and Poli [24] did not explain which pseudo-peripheral vertex finder was used, the George-Liu algorithm [11] for computing a pseudo-peripheral vertex was used in this computational experiment. Hence, we will refer this heuristic as hGPHH-GL.

It was not our objective that the results of the C++ programming language versions of the heuristics supersede all the results of the original implementations. Our objective was to code reasonably efficient implementations of the heuristics evaluated to make it possible an adequate comparison of the results of the 13 heuristics. However, we tested and calibrated the C++ programming language versions of the heuristics implemented to compare our implementations with the codes used by the original proposers of the heuristics to ensure the codes we implemented were comparable to the algorithms that were originally proposed. We compared the results of the C++ programming language versions of the heuristics with the results presented in the original publications. In particular, a previous publication [15] shows how the heuristics were implemented, tested, and calibrated. The C++ programming language implementations of the heuristics obtained similar results in bandwidth or profile reductions to the results presented in the original publications (see [15]).

Table 1 shows the characteristics of the five workstations used to perform the simulations. Particularly, the Ubuntu 14.04 LTS 64-bit operating system was used.

Table 1. Characteristics of the machines used to perform the simulations.

Three sequential runs, with both a reordering algorithm and with the JPCGM, were carried out with each instance. In addition, for this experimental analysis of 13 low-cost heuristics for bandwidth and profile reductions, we followed the suggestions given by Johnson [22], aiming at reducing the computational cost of the JPCGM.

4 Numerical Experiments and Analysis

This section shows the results obtained in simulations using the JPCGM, executed after applying heuristics for bandwidth and profile reductions. Section 4.1 shows the results of the resolutions of linear systems arising from the discretization of the Laplace equation by finite volumes [19]. Section 4.2 shows the results of the resolutions of linear systems contained in the University of Florida sparse matrix collection [7].

Tables in this section show the dimension n of the respective coefficients matrix of the linear system (or the number of vertices of the graph associated with the coefficient matrix on it or the name of the instance contained in the University of Florida sparse matrix collection), the name of the reordering algorithms applied, the results with respect to profile and bandwidth reductions, the average results of the heuristics in relation to the computational cost, in seconds (s), and the memory requirements, in mebibytes (MiB). In addition, these tables show the number of iterations and the total computational cost, in seconds, of the JPCGM. Furthermore, in spite of the small number of executions for each heuristic in each instance, these tables show the standard deviation (\(\sigma \)) and coefficient of variation (\(C_v\)), referring to the total computational cost of the JPCGM. Additionally, these tables show “–” in the first row of a set of simulations performed with each instance. This means that no reordering algorithm was used. With this result, one can verify the speed-down of the JPCGM attained when using a heuristic for bandwidth or profile reduction, shown in the last columns of these tables. In the tables below, numbers in bold face are the best results (up to two occurrences) in the \(\beta \), profile, t(s), and m.(MiB) columns. Figures in this section are presented as line charts for clarity.

4.1 Instances Originating from the Discretization of the Laplace Equation by Finite Volumes

This section shows the results of the resolutions of linear systems arising from the discretization of the Laplace equation by finite volumes [19]. These linear systems are divided into two datasets: seven and eight linear systems ranging from 7,322 to 277,118 and from 16,922 to 1,115,004 unknowns comprised of matrices with random order [see Fig. 1 and Tables 2 and 3 (with executions performed on the M1 machine)] and originally ordered using a sequence given by the Sierpiński-like curve [18, 37] [see Fig. 2 and Tables 4 and 5 (with executions performed on the M2 machine)], respectively.

Fig. 1.
figure 1

Speed-downs of the JPCGM obtained using several heuristics for bandwidth and profile reductions applied to linear systems originating from the discretization of the Laplace equation by finite volumes and composed of matrices with random order (see Tables 2 and 3).

Fig. 2.
figure 2

Speed-downs of the JPCGM obtained using several heuristics for bandwidth and profile reductions applied to linear systems originating from the discretization of the Laplace equation by finite volumes and composed of matrices with a sequence given by the Sierpiński-like order (see Tables 4 and 5).

Table 2. Resolution of three linear systems (derived from the discretization of the Laplace equation by finite volumes and composed of matrices with random order) using the JPCGM and vertices labeled by heuristics for bandwidth and profile reductions.
Table 3. Resolution of four linear systems (derived from the discretization of the Laplace equation by finite volumes and composed of matrices with random order) using the JPCGM and vertices labeled by heuristics for bandwidth and profile reductions.
Table 4. Resolution of linear systems (ranging from 16,922 to 105,764 unknowns, derived from the discretization of the Laplace equation by finite volumes and composed of matrices ordered using a sequence determined by the Sierpiński-like curve) using the JPCGM and vertices labeled by heuristics for bandwidth and profile reductions.
Table 5. Resolution of four linear systems (derived from the discretization of the Laplace equation by finite volumes and composed of matrices originally ordered using a sequence determined by the Sierpiński-like curve) using the JPCGM and vertices labeled by heuristics for bandwidth and profile reductions.

Tables 234 and 5 show that Sloan’s heuristic almost always obtained the best profile results in these datasets. In addition, these tables show that the FNCHC heuristic achieved in general the best bandwidth results, but closely followed by the RCM-GL and hGPHH-GL heuristics, which presented much lower computational costs. Nevertheless, no gain was attained regarding the speed-up of the JPCGM when using these heuristics. In particular, the FNCHC heuristic presented a much higher computational cost than the RCM-GL, Sloan’s, MPG, NSloan, Sloan-MGPS, and hGPHH-GL heuristics.

A slight speed-up of the JPCGM applied to the linear system composed of 16,922 when using the hGPHH-GL heuristic (see Table 4) was reached, but this gain is marginal. Moreover, a speed-down of the JPCGM was obtained when using the other heuristics for bandwidth and profile reductions applied to the other linear systems.

Tables 4 and 5 do not show results of Snay’s heuristic [35] applied to linear systems larger than 68,414 unknowns. Snay’s heuristic obtained better results (related to reduce the JPCGM computational cost) than the results of the CSS-band [23] and NSloan [25] heuristics when applied to the linear systems composed of 39,716 and 68,414 unknowns. However, Snay’s heuristic performed less favorably than the other heuristics when applied to the linear system comprised of 16,922 unknowns (see Table 4).

The GPS [12], Burgess-Lai [4], GGPS [38], and CSS-band [23] presented higher computational costs than the other heuristics [see t(s)(Heuristic) column in Tables 3 and 5]. Consequently, Table 5 does not show the results of these four heuristics applied to the linear systems composed of 750,446 and 1,015,004 unknowns, keeping in mind that the VNS-band execution program runs with instances up to 500,000 unknowns. Furthermore, Table 5 does not show the results of the NSloan [25] and Sloan-MGPS [32] heuristics applied to the linear system composed of 1,015,004 unknowns because these two heuristics performed less favorably than the five other heuristics when applied to linear systems contained in this dataset.

Table 6. Eleven linear systems (composed of symmetric and positive-definite matrices) contained in the University of Florida sparse matrix collection.

4.2 Instances Contained in the University of Florida Sparse Matrix Collection

Table 6 provides the characteristics of 11 linear systems (composed of symmetric and positive-definite matrices) contained in the University of Florida sparse matrix collection [7]. Tables 234 and 5 show that the RCM-GL [10], Sloan’s [34], MPG [29], NSloan [25], Sloan-MGPS [32], and hGPHH-GL [24] heuristics presented much lower computational costs than the other heuristics evaluated in this computational experiment. Then, these six low-cost heuristics for bandwidth or profile reduction evaluated in this study were applied to the dataset presented in Table 6.

Tables 7 and 8 and Fig. 3 show the results of the resolutions of 11 linear systems contained in the University of Florida sparse matrix collection using the JPCGM and vertices labeled using heuristics for bandwidth and profile reductions. The hGPHH-GL heuristic obtained the best speed-up of the JPCGM when applied to the Pres_Poisson instance (see Table 7). On the other hand, speed-downs of the JPCGM were obtained when using these six heuristics for bandwidth and profile reductions when applied to the 10 other linear systems contained in the University of Florida sparse matrix collection that were used here.

Among the heuristics evaluated, the Sloan-MGPS and Sloan’s (RCM-GL) heuristics obtained (almost always) the best profile (bandwidth) results when applied to the instances composed in this dataset. Nevertheless, speed-downs of the JPCGM were obtained when using these heuristics (except the simulation using the Pres_Poisson instance).

Table 7. Resolution of seven linear systems contained in the University of Florida sparse matrix collection using the JPCGM and vertices labeled using heuristics for bandwidth and profile reductions
Table 8. Resolution of four linear systems contained in the University of Florida sparse matrix collection using the JPCGM and vertices labeled using several heuristics for bandwidth and profile reductions.

5 Conclusions

The results of 13 heuristics for bandwidth and profile reductions applied to reduce the computational cost of solving three datasets of linear systems using the Jacobi-preconditioned Conjugate Gradient Method in high-precision floating-point arithmetic are described in this paper. These heuristics were selected from systematic reviews [2, 5, 15, 17].

In experiments using three datasets composed of large-scale linear systems, the hGPHH-GL heuristic performed best when applied to one linear system aiming at reducing the computational cost of the JPCGM (see Table 7). On the other hand, speed-downs of the JPCGM were obtained when applying these 13 heuristics for bandwidth and profile reductions to the other linear systems that were used in this computational experiment. Thus, the attained results show that in certain cases no heuristic for bandwidth or profile reduction can reduce the computational cost of the Jacobi-preconditioned Conjugate Gradient Method when using high-precision numerical computations.

Concerning the set of linear systems arising from the discretization of the Laplace equation by finite volumes comprised of matrices with random order, each vertex has exactly three adjacencies [19]. Probably because of this, relabeling the vertices did not improve cache hit rates.

Regarding the set of linear systems originating from the discretization of the Laplace equation by finite volumes and comprised of matrices originally ordered using a sequence given by the Sierpiński-like curve [19], a large number of cache misses may be occurred after applying heuristics for bandwidth and profile reductions. Probably, the reason is that a space-filling curve already provides an adequate memory-data locality so that a reordering algorithm is not useful in such cases. We applied these 13 heuristics in large-scale linear systems and cache memory is a relevant factor in the execution times of these simulations. Evidence from the experiments described in this paper does allow the assertion that a linear system should be studied carefully before using a heuristic for bandwidth or profile reduction aiming at reducing the computational cost of the JPCGM (and probably when using a preconditioned CGM or other iterative linear system solver).

Fig. 3.
figure 3

Speed-downs of the JPCGM obtained using six heuristics for bandwidth and profile reductions applied to 11 linear systems contained in the University of Florida sparse matrix collection (see Tables 7 and 8).

As a continuation of this work, we intend to implement and evaluate the following preconditioners: Algebraic Multigrid, incomplete Cholesky factorization, threshold-based incomplete LU (ILUT), Successive Over-Relaxation (SOR), Symmetric SOR, and Gauss-Seidel. To provide more specific detail, we intend to study the effectiveness of the strategies when using incomplete or approximate factorization based preconditioners as well approximate inverse preconditioners. These techniques shall be used as preconditioners of the Conjugate Gradient Method and the Generalized Minimal Residual (GMRES) method [33] to evaluate their computational performance in conjunction with heuristics for bandwidth and profile reductions. Parallel strategies of the above algorithms will also be studied.

Extended (256-bit and 512-bit) precision was employed in this work. This reduces rounding errors. However, it increases the execution times by a large factor and it may not be performed when solving certain real problems. We intend to examine what occurs in double-precision arithmetic in future studies.