1 Introduction

The resolution of large sparse linear systems \(Ax= b\), in which A is a sparse matrix, is fundamental in several science and engineering applications and is generally the part of the simulation that requires the highest computational cost. The principal origin of the problems with large-scale matrices arises from the discretization of elliptic or parabolic partial differential equations (PDEs) (Benzi 2002). The methods of finite elements, finite differences, and finite volumes are some of the most common numerical problem-solving methods related to physical phenomena that are modeled by PDEs. Large sparse linear systems are generated when these methods are applied. In addition, large sparse linear systems are also originated from problems that are not modeled by PDEs, such as chemical engineering processes, design and analysis of integrated circuits, and power system networks (Benzi 2002). A considerable amount of memory and a high processing cost are necessary to store and to solve these large-scale linear systems.

Modern hierarchical memory architecture and paging policies favor programs that consider locality of reference into account. Thus, cache coherence (that is, a sequence of recent memory references is clustered locally rather than randomly in the memory address space) should be considered important when designing a new algorithm. For the low-cost solution of large and sparse linear systems, and to reduce the memory space required, an adequate nodal renumbering is desirable to ensure that the corresponding coefficient matrix A will have narrow bandwidth and small profile. Thus, a way of designing an algorithm to return a sequence of graph vertices with cache coherence is through the use of heuristics for bandwidth reduction. Therefore, heuristics for bandwidth and profile reduction are used to achieve low computational and storage costs for solving large sparse linear systems (Gonzaga de Oliveira and Chagas 2015). In particular, the profile reduction is also important for reducing the storage cost of applications that use the skyline data structure (Felippa 1975) to represent large-scale matrices.

Let \(A = [a_{ij}]\) be an \(n \times n\) symmetric matrix (corresponding to an undirected graph \(G=(V,E)\), composed of a set of vertices V and a set of edges E). The bandwidth of line i is \(\beta _i(A) = i - \mathrm{min}(j : (1 \le j < i)\ a_{ij} \ne 0)\). Bandwidth \(\beta (A)\) is the largest distance between the non-null coefficient of the lower triangular matrix and the main diagonal considering all lines of the matrix, that is, \(\beta (A) = \mathrm{max}((1 \le i \le n)\ \beta _i(A)) = \mathrm{max}((1 \le i \le n)\ (1 \le j< i)\ (i - \mathrm{min}(j : (1 \le j < i))\ |\ a_{ij} \ne 0))\) (or the bandwidth of G for a labeling \(S=\{s(v_1), s(v_2),\ldots ,s(v_{|V|})\}\) (i.e., a bijective mapping from V to the set \(\{1,2,\ldots ,|V|\}\)) is \(\beta (G)={\mathrm{max}}(|s(v_i)-s(v_j)|:(v_i,v_j)\in E)\)). The profile of A can be defined as \(\mathrm{profile}(A) = \sum _{i=1}^n \beta _i(A)\).

On the other hand, let \(A_u = [a_{ij}]\) be an \(n \times n\) asymmetric matrix. The bandwidth of line i is \(\beta _i(A) = {\mathrm{max}}({\beta _l}_i(A_u),{\beta _u}_i(A_u))\), where \({\beta _l}_i(A_u) = i - {\mathrm{min}}(j : (1 \le j < i)\ a_{ij} \ne 0)\) and \({\beta _u}_i(A_u) = {\mathrm{max}}(j: (i < j \le n)\ a_{ij} \ne 0) - i\). Bandwidth \(\beta (A_u)\) is the largest distance between the non-null coefficient of the lower or the upper triangular matrix and the main diagonal, considering all lines of the matrix, that is, \(\beta (A_u) = \mathrm{max} ((1 \le i \le n)\ {\beta _l}_i(A_u), {\beta _u}_i(A_u))\). The profile of \(A_u\) can be defined as profile\((A_u) = \sum _{i=1}^n({\beta _l}_i(A_u) + {\beta _u}_i(A_u))\).

The problems of bandwidth and profile minimizations are hard (Papadimitriou 1976; Lin and Yuan 1994). Thus, several heuristics have been proposed to solve the bandwidth and/or profile reduction problems since the mid-1960s. The large number of methods available means that the user has an arduous task to determine which method to employ. Various comparisons among methods are available in the literature, but only for a few of them. Additionally, there have been few reviews published on this field. Cuthill (1972) performed a comparative study among results of the heuristics known up to 1971. Gibbs et al. (1976) made comparisons between the results of six heuristics. Benzi (2002) reviewed preconditioning techniques for iterative solution of large sparse linear systems. This review focused on techniques to improve performance and reliability of Krylov subspace methods. However, this review was not strictly of heuristics for bandwidth or profile reduction (Gonzaga de Oliveira and Chagas 2015). Thus, a comparison of a large variety of heuristics proposed for bandwidth and profile reductions is needed in the literature.

Disregarding computational costs, the Variable Neighborhood Search for bandwidth reduction (VNS-band) heuristic (Mladenovic et al. 2010) may be the method that represents the state of the art with respect to the problem of bandwidth reduction. Mladenovic et al. (2010) established the VNS-band timeout at 500 s to solve the problem in 113 instances of the Harwell-Boeing sparse-matrix collection (http://math.nist.gov/MatrixMarket/data/Harwell-Boeing) (Duff et al. 1989a). Even with the short amount of time it takes to find the solution, 500 s as the VNS-band timeout to search for better solutions can be considered high, since this is the time that the user waits for the results. On the other hand, the heuristic based on a dual-representation simulated annealing (DRSA-band) of Torres-Jimenez et al. (2015) obtained, in tests conducted by these authors, results slightly better than results of the VNS-band heuristic (Mladenovic et al. 2010) in relation to bandwidth reduction. In general, the DRSA-band is slower than the VNS-band in the results presented by Torres-Jimenez et al. (2015). Thus, the DRSA-band heuristic was not considered as potentially the best low-cost heuristic with significant bandwidth reduction because its computational cost is higher than the computational cost of the VNS-band heuristic; apart from significantly reducing the bandwidth, a heuristic must also present low computational cost, i.e., it cannot be slow compared to other heuristics. Many papers in this field, where these are only two examples, evaluate their heuristics on 113 instances of Harwell-Boeing dataset, where the number of rows/columns of the matrices included in the dataset varies from 30 to 1104. Although the Harwell-Boeing sparse-matrix collection was widely used for testing heuristics for bandwidth reduction in the literature, the matrices in this dataset are too small by today’s standards. The results of such papers, with the largest case only of 1104 dimensions, offer little insight as to how the tested heuristics compare on large examples that are of more practical interest. Hence, additional datasets including much larger matrices are covered in our evaluation.

Thus, one of the main objectives of this study is to verify whether, with a low timeout, the VNS-band heuristic still achieves better results than the possible best low-cost heuristics for bandwidth or profile reduction identified in systematic reviews. Therefore, the main contribution of this work is the comparison of results obtained using 14 heuristics for bandwidth and profile reductions in symmetric and asymmetric matrices (up to 100,196 vertices). These 14 heuristics were selected in systematic reviews as the possible best low-cost methods to solve the problems (Chagas and Gonzaga de Oliveira 2015; Bernardes et al. 2015; Gonzaga de Oliveira and Chagas 2015).

Only potential low-cost heuristics for bandwidth and profile reductions were selected in the systematic reviews. The reason for this decision was that the local reordering of vertices of the graph associated with the matrix of the linear system may contribute to reducing the computational cost of an iterative solver, such as the Conjugate Gradient Method (CGM) (Duff and Meurant 1989b). It should be noted that it is important to have an ordering which does not lead to an increase in the number of iterations when a preconditioner is applied. Additionally, such reduction in the execution cost is also achieved by improving the number of cache hits (Das et al. 1992; Burgess and Giles 1997). On the other hand, the bandwidth and profile reductions are not directly proportional to the computational cost reduction obtained when linear systems are solved using an iterative method. Clearly, what is to be minimized is the total computing time including the reordering time (at least when only a single linear system is to be solved). Thus, a reordering of vertices must be performed at low cost. To provide more specific detail, although linear systems of reduced order can be solved efficiently with a multifrontal direct method, a prominent method for solving large-scale sparse linear systems is the conjugate gradient method (Hestenes and Stiefel 1952; Lanczos 1952). One can reduce computational costs using this method by applying a local ordering of the vertices (Duff and Meurant 1989b) of the corresponding graph of A to improve cache hit rates. This local ordering can be reached by applying a heuristic for bandwidth reduction (Das et al. 1992; Burgess and Giles 1997). On the other hand, an important issue is to have an ordering which leads to a small number of iterations when a preconditioner is applied (and this depends on the structure of the instance). Additionally, Benzi et al. (1999) showed that heuristics for bandwidth reduction can have a positive effect on the computational cost of the generalized minimal residual (GMRES) method (Saad and Schultz 1986).

The remainder of this paper is organized as follows. Section 2 explains the systematic reviews performed to identify the potential best low-cost heuristics for bandwidth and profile reductions. Section 3 presents how the simulations were conducted in this study. Section 4 shows the results. Finally, Sect. 5 addresses the conclusions.

2 Systematic reviews

Systematic reviews (Chagas and Gonzaga de Oliveira 2015; Bernardes et al. 2015; Gonzaga de Oliveira and Chagas 2015) report 73 and 74 heuristics for bandwidth and profile reductions, respectively, that had been published in the period of time spanning the 1960s to the present. Most of the heuristics were considered surpassed by other heuristics in these systematic reviews. Consequently, eight heuristics in each case were selected as potentially being the best low-cost heuristics for bandwidth [RCM–GL (George and Liu 1981), Burgess–Lai (BL) (Burgess and Lai 1986), WBRA (Esposito et al. 1998), FNCHC (Lim et al. 2003, 2004, 2007), GGPS (Wang et al. 2009), VNS-band (Mladenovic et al. 2010), hGPHH (Koohestani and Poli 2011), CSS-band (Kaveh and Sharafi 2012)] or profile [Snay (1976), RCM–GL (George and Liu 1981), RCM–GL–FL (Fenves and Law 1983), Sloan (1989), MPG (Medeiros et al. 1993), NSloan (Kumfert and Pothen 1997), Sloan–MGPS (Reid and Scott 1999), Hu and Scott (2001)] reduction.

From the heuristics identified in the systematic reviews, 17 heuristics were applied to both bandwidth and profile reductions. In addition, the Reverse Cuthill–McKee method with pseudo-peripheral vertex given by the George–Liu algorithm (RCM–GL) (George and Liu 1981) was selected in both systematic reviews of heuristics for bandwidth and profile reductions. Thus, 130 heuristics for bandwidth and profile reductions were identified in both systematic reviews, and 15 heuristics were selected because no other simulation or comparison showed that these 15 heuristics could be superseded by any other heuristics in the articles analyzed, in terms of bandwidth or profile reduction when the computation costs of the given heuristic are also considered.

The Gibbs–Poole–Stockmeyer (GPS) algorithm (Gibbs et al. 1976), outperformed by metaheuristic-based heuristics, is one of the most classic low-cost heuristics tested in the field for both bandwidth and profile reductions. Hence, the GPS algorithm was implemented and its results were compared with the other heuristics implemented in this computational experiment.

On the other hand, Fenves and Law’s RCM–GL (RCM–GL–FL) method (Fenves and Law 1983), despite being selected in the systematic review of heuristics for profile reduction, was not implemented in this work because it is a specific application of the RCM–GL method for finite element discretizations. Additionally, although no computational costs were presented by its authors, the Hu and Scott’s heuristic (Hu and Scott 2001) was selected for its results in profile reductions when compared with the results shown by the heuristics that were tested by Hu and Scott (2001). However, the Hu-Scott heuristic was not implemented here because it was considered a high-cost heuristic, especially when performing matrix multiplications.

Another version of the VNS-band was proposed by Wang et al. (2014). This heuristic is outperformed by the fast node centroid hill climbing (FNCHC) heuristic (Lim et al. 2007). This is verified both in approximating the solution as well as comparing the computational cost to present an approximate solution. One can realize this by examining the results of the original VNS-band (Mladenovic et al. 2010) heuristic and the VNS-band proposed by Wang et al. (2014) and comparing them with the results presented below. It should be noted that the dataset used in the tests shown by Wang et al. (2014) is a subset of a dataset presented below.

Thus, Table 1 shows 14 heuristics that can be considered as the most promising low-cost heuristics to solve the problems. These 14 heuristics were implemented and tested in this work.

Table 1 Low-cost heuristics for bandwidth and profile reductions that were selected from systematic reviews

3 Description of the tests

Appendix A (Appendices AD are available at http://www.dcc.ufla.br/~sanderson/app_coam16) shows the testing and calibration performed to compare our implementations with the codes used by the original proposers of the 14 heuristics to ensure the codes we implemented were comparable to the algorithms that were originally proposed. To evaluate the bandwidth and profile reductions provided by the 14 selected heuristics, two datasets commonly used in this field were employed. Specifically, 113 (50 symmetric and 63 asymmetric) instances of the Harwell-Boeing sparse-matrix collection (http://math.nist.gov/MatrixMarket/data/Harwell-Boeing) (Duff et al. 1989a) and 22 (17 symmetric and 5 asymmetric) instances of the University of Florida sparse-matrix collection (http://www.cise.ufl.edu/research/sparse/matrices/index.html) (Davis and Hu 2011) were used in this work. These instances represent a wide spectrum of scientific and engineering applications matrices (i.e., standard-test matrices arising from problems in linear systems, least squares, and eigenvalue calculations) and have been employed here for comparisons and evaluations since many other researchers have employed these instances. Specifically, the 113 instances of the Harwell-Boeing sparse-matrix collection were divided into two subsets: (i) 33 instances, ranging from 30 to 237 vertices, accordingly as stated in the collection, with each of these 33 instances (used as counter-examples to hypotheses in sparse-matrix research) having less than 200 vertices when vertices without adjacencies were disregarded; and (ii) 80 instances, ranging from 207 to 1104 vertices. This subdivision of the 113 instances of the Harwell-Boeing sparse-matrix collection into two sets of instances is common in the field, for example, see Martí et al. (2001), Lim et al. (2004, 2007), Piñana et al. (2004), Rodriguez-Tello et al. (2008), Mladenovic et al. (2010), Torres-Jimenez et al. (2015). Additionally, each set of instances was divided into symmetric and asymmetric matrices, as shown in the tables below.

To apply the heuristics implemented in this work in asymmetric matrices, we used the strategy proposed by Reid and Scott (2006) to apply the heuristics in asymmetric instances. In this strategy, the asymmetric matrix \(A_u\) is added to its transpose matrix \(A_u^\text {T}\), i.e., \(A_u + A_u^\text {T}\), thus resulting in a symmetric matrix. Subsequently, the heuristic is applied to the graph associated with the matrix obtained and the reordering attained using the heuristic is used to reorder the rows of the original asymmetric matrix. This strategy, according to the results presented by Reid and Scott (2006), had the poorest results in the bandwidth reduction of asymmetric matrices using the RCM heuristic (George 1971) when compared with the two other strategies proposed by Reid and Scott (2006). Nevertheless, the strategy to compute \(A_u + A_u^T\) was applied in our approach because it is included along with the RCM–GL method (George and Liu 1981) in the MATLAB software (MATLAB 2016). In addition, it is the simplest strategy and the one that presents the lowest storage and computational costs among those proposed by Reid and Scott (2006).

The workstations used in the execution of the simulations with the instances of the Harwell-Boeing and University of Florida sparse-matrix collections contained an Intel® CoreTM i3-2120 (3 MB Cache, CPU 3.30 GHz\(\times \)4, 8 GB of main memory DDR3 1333 MHz) and Intel® XeonTM E5620 (12 MB Cache, CPU 2.40 GHz\(\times \)8, 24 GB of main memory DDR3 1333 MHz) (Intel; Santa Clara, CA, United States), respectively. The Ubuntu 14.04 LTS 64-bit operating system with Linux kernel-version 3.13.0-39-generic was used. It should be noted that we followed the recommendations given by Johnson (2002) for this experimental analysis of the 14 heuristics for bandwidth and profile reductions.

A metric used by some authors, where Kumfert and Pothen (1997), Mladenovic et al. (2010), Koohestani and Poli (2011) are examples, is to add, for each heuristic, the bandwidths or profiles obtained. A similar metric used by other authors, where Martí et al. (2001), Piñana et al. (2004), Lim et al. (2007), Rodriguez-Tello et al. (2008) are examples, is to calculate the average bandwidth over the instances in each set. However, these metrics are not reasonable when considering a set of instances with very different sizes, as is the case in the tests presented in this paper. This is because, for example, a heuristic with very bad results in a small instance would not be penalized appropriately.

In many works, including Burgess and Lai (1986), Medeiros et al. (1993), Esposito et al. (1998), Mladenovic et al. (2010), Koohestani and Poli (2011), the authors compared the results of the heuristics by counting the number of times that the heuristic obtained the lower bandwidth or profile on the instances of the Harwell-Boeing sparse-matrix collection. This kind of metric is also shown in the following tables. In addition, the heuristic may be appropriately considered the best heuristic in a set when it has the best bandwidth or profile reduction among all the heuristics in the set. One possible scenario, then, is that a heuristic could be considered the best of the entire set if it attains the second best status in many instances, while different heuristics are the best heuristic for different individual instances. Thus, to evaluate the quality of the results obtained using the heuristics tested, for each heuristic h in each set of instances, \(\rho _p=\frac{\mathrm{profile}_h-\mathrm{profile}_{\mathrm{min}}}{\mathrm{profile}_{\mathrm{min}}}\) was calculated for each instance, where \(\mathrm{profile}_h\) is the profile obtained using a heuristic h and \(\mathrm{profile}_{\mathrm{min}}\) is the lowest profile obtained in an instance by a heuristic tested. The same is carried out in relation to both bandwidth reduction (\(\rho _{\beta }\)) and execution times (\(\rho _t\)). To the best of our knowledge, this paper is the first (published) instance of this approach being used in the field.

4 Results and analysis

This section presents the results obtained by the heuristics in relation to bandwidth and profile reductions in 113 (Sect. 4.1) and 22 (Sect. 4.2) instances of the Harwell-Boeing and University of Florida sparse-matrix collections, respectively. In the tables shown in this section, it can be seen that the instance name and size (n), the value of the initial bandwidth (\(\beta _0\)) or profile (\(\mathrm{profile}_0\)) of the instance, and the average values of bandwidth, profile and runtime obtained using each heuristic in 10 executions carried out in each of the instances. Section 4.3 analyzes the results obtained.

4.1 Results of 14 heuristics applied to instances of the Harwell-Boeing sparse-matrix collection

This section presents the bandwidth and profile results obtained using 14 heuristics applied to 113 instances of the Harwell-Boeing sparse-matrix collection. Sections 4.1.14.1.4 present the results of the 14 heuristics for bandwidth and profile reductions applied to the sets composed of 15 and 35 symmetric matrices and 18 and 45 asymmetric matrices of the Harwell-Boeing sparse-matrix collection, respectively.

4.1.1 Results of 14 heuristics applied to the set composed of 15 symmetric instances of the Harwell-Boeing sparse-matrix collection

In this set composed of 15 symmetric instances, the FNCHC (Lim et al. 2003, 2004, 2007) and NSloan (Kumfert and Pothen 1997) heuristics were the best ones for reducing bandwidth and profile when considering the \(\rho _{\beta }\) and \(\rho _p\) metrics, respectively [see Fig. 1a, b, and Table 9 (Tables 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21 are available in Appendix B)]. These two heuristics also obtained the largest number of best results of bandwidth (in 12 instances) and profile (in 8 instances) in this dataset (see Fig. 1c, d).

Fig. 1
figure 1

Results (a \(\sum \rho _b\), b \(\sum \rho _p\)) of 14 heuristics applied to reduce a bandwidth and b profile, and number of best results of several heuristics applied to reduce c bandwidth and d profile of 15 symmetric instances of the Harwell-Boeing sparse-matrix collection

The VNS-band heuristic was tested with a timeout of 500 s because this was the time set by its authors (Mladenovic et al. 2010). It should be noted that 500 s on the machine used in this study achieves more computation than the machine on which the tests were conducted by Mladenovic et al. (2010). In spite of this, the VNS-band heuristic was dominated by the FNCHC and NSloan heuristics in relation to bandwidth and profile reductions, respectively, for this set of instances.

The hGPHH-GL and RCM–GL methods were the fastest ones among the heuristics tested according to the \(\rho _t\) metric, and considering also that the RCM–GL method obtained the lowest computational costs for a larger number of instances (six instances) (see Fig. 2; Table 10). In particular, the WBRA (Esposito et al. 1998), FNCHC (Lim et al. 2003, 2004, 2007), VNS-band (Mladenovic et al. 2010), and CSS-band (Kaveh and Sharafi 2012) heuristics showed higher execution times than the 10 other heuristics tested. Moreover, the hGPHH-GL method is as fast as the RCM–GL method because their difference is how to order the adjacent vertices (to be numbered) to the current vertex.

Fig. 2
figure 2

Results (\(\sum \rho _t\)) of 10 heuristics applied to reduce bandwidth of 15 symmetric instances of the Harwell-Boeing sparse-matrix collection

4.1.2 Results of 14 heuristics applied to the set composed of 35 symmetric instances of the Harwell-Boeing sparse-matrix collection

In this set composed of 35 symmetric instances, the VNS-band (8 s) (Mladenovic et al. 2010) and MPG (Medeiros et al. 1993) heuristics were the best ones for reducing bandwidth and profile when considering the \(\rho _b\) and \(\rho _p\) metrics, respectively (see Fig. 3a, b; Tables 11 and 12). These two heuristics also obtained the largest number of best results of bandwidth (in 24 instances) and profile (in 11 instances) in this set of instances, respectively (see Fig. 3c, d). In addition, Sloan’s (1989) algorithm was the fastest method among the heuristics tested (see Fig. 4; Table 13).

Fig. 3
figure 3

Results (a \(\sum \rho _b\), b \(\sum \rho _p\)) of 14 heuristics applied to reduce a bandwidth and b profile, and number of best results of several heuristics applied to reduce c bandwidth and d profile of 35 symmetric instances of the Harwell-Boeing sparse-matrix collection

Fig. 4
figure 4

Results (\(\sum \rho _t\)) of 10 heuristics applied to reduce bandwidth of 35 symmetric instances of the Harwell-Boeing sparse-matrix collection

4.1.3 Results of 14 heuristics applied to the set composed of 18 asymmetric instances of the Harwell-Boeing sparse-matrix collection

In this set composed of 18 asymmetric instances, the FNCHC and MPG heuristics were the best ones for reducing bandwidth and profile when considering the \(\rho _b\) and \(\rho _p\) metrics, respectively (see Fig. 5a, b; Table 14). The RCM–GL method was the fastest method among the heuristics tested (see Fig. 6; Table 15).

Fig. 5
figure 5

Results (a \(\sum \rho _b\), b \(\sum \rho _p\)) of 14 heuristics applied to reduce a bandwidth and b profile of 18 asymmetric instances of the Harwell-Boeing sparse-matrix collection

Fig. 6
figure 6

Results (\(\sum \rho _t\)) of 10 heuristics applied to reduce bandwidth of 18 asymmetric instances of the Harwell-Boeing sparse-matrix collection

4.1.4 Results of 14 heuristics applied to the set composed of 45 asymmetric instances of the Harwell-Boeing sparse-matrix collection

In this set composed of 45 asymmetric instances, the VNS-band (12 s) (Mladenovic et al. 2010) and MPG (Medeiros et al. 1993) heuristics were the best ones for reducing bandwidth and profile when considering the \(\rho _b\) and \(\rho _p\) metrics, respectively (see Fig. 7a, b; Tables 16 and 17). These two heuristics also obtained the largest number of best results of bandwidth (in 34 instances) and profile (in 17 instances) in this set of instances, respectively (see Fig. 7c, d). The hGPHH-GL heuristic was the fastest method among the heuristics tested (see Fig. 8; Table 18).

Fig. 7
figure 7

Results (a \(\sum \rho _b\), b \(\sum \rho _p\)) of 14 heuristics applied to reduce a bandwidth and b profile, and number of best results of several heuristics applied to reduce c bandwidth and d profile of 45 asymmetric instances of the Harwell-Boeing sparse-matrix collection

Fig. 8
figure 8

Results (\(\sum \rho _t\)) of 10 heuristics applied to reduce bandwidth of 45 asymmetric instances of the Harwell-Boeing sparse-matrix collection

4.2 Results of 10 heuristics applied to instances of the University of Florida sparse-matrix collection

Results of Snay’s (1976), Burgess and Lai (1986), WBRA (Esposito et al. 1998), and CSS-band (Kaveh and Sharafi 2012) heuristics were dominated by the 10 other heuristics when applied to the 113 instances of the Harwell-Boeing sparse-matrix collection (see Sect. 4.1) so that these four heuristics were not applied to the 22 instances of the University of Florida sparse-matrix collection. Hence, Sects. 4.2.1 and 4.2.2 present the results of 10 heuristics for bandwidth and profile reductions applied to the sets composed of 17 symmetric and 5 asymmetric matrices of the University of Florida sparse-matrix collection, respectively.

4.2.1 Results of 10 heuristics applied to the set composed of 17 symmetric instances of the University of Florida sparse-matrix collection

In this set composed of 17 symmetric instances, the FNCHC and NSloan heuristics were the best ones for reducing bandwidth and profile when considering the \(\rho _b\) and \(\rho _p\) metrics, respectively (see Fig. 9a, b; Table 19). These two heuristics also obtained the largest number of best results of bandwidth (in eight instances) and profile (in ten instances) in this set of instances, respectively (see Fig. 9c, d). The RCM–GL method was the fastest method among the heuristics tested (see Fig. 10; Table 20).

Fig. 9
figure 9

Results (a \(\sum \rho _b\), b \(\sum \rho _p\)) of 10 heuristics applied to reduce a bandwidth and b profile, and number of best results of several heuristics applied to reduce c bandwidth and d profile of 17 symmetric instances of the University of Florida sparse-matrix collection

Fig. 10
figure 10

Results (\(\sum \rho _t\)) of 10 heuristics applied to reduce bandwidth of 17 symmetric instances of the University of Florida sparse-matrix collection

We established the VNS-band timeout at 500 s to solve the problem in these 17 symmetric instances of the University of Florida sparse-matrix collection. Probably, the VNS-band heuristic would achieve better results with a higher timeout, but our intention is to investigate low timeouts for the heuristics. In the tests shown in Table 20, one can note that the VNS-band heuristic achieved its results with a higher computing cost of 1 and 5 magnitudes in relation to the FNCHC and RCM–GL heuristics, respectively.

4.2.2 Results of 10 heuristics applied to the set composed of five asymmetric instances of the University of Florida sparse-matrix collection

In this set composed of five asymmetric instances, one can verify that the GPS and MPG heuristics were the best ones for reducing bandwidth and profile when considering the \(\rho _b\) and \(\rho _p\) metrics, respectively (see Fig. 11a, b; Table 21). These two heuristics also obtained the largest number of best results of bandwidth (in two instances) and profile (in three instances) in this set of instances, respectively. The RCM–GL method was the fastest method among the heuristics tested (see Fig. 12; Table 21).

Fig. 11
figure 11

Results (a \(\sum \rho _b\), b \(\sum \rho _p\)) of 10 heuristics applied to reduce a bandwidth and b profile of five asymmetric instances of the University of Florida sparse-matrix collection

Fig. 12
figure 12

Results (\(\sum \rho _t\)) of 6 heuristics applied to reduce bandwidth of five asymmetric instances of the University of Florida sparse-matrix collection

Similar to the executions performed with the 17 symmetric instances of the University of Florida sparse-matrix collection, we established the VNS-band timeout at 500 s to solve the problem in these five asymmetric instances of this dataset. Likewise, it is probable that the VNS-band heuristic would achieve better results with a higher timeout, but our intention is to investigate low timeouts for the heuristics. In the tests performed in these five asymmetric matrices, the VNS-band heuristic achieved its results with a higher computing cost of 1 and 5 magnitudes in relation to the GPS and RCM–GL algorithms, respectively (see Table 21).

Table 2 The most promising heuristics (for ten application areas) for reducing bandwidth and profile of symmetric matrices
Table 3 The most promising heuristics (for 14 application areas) for reducing bandwidth and profile of asymmetric matrices
Table 4 Heuristics that showed the best overall performance on the six sets of test matrices

4.3 Best heuristics applied to four sets of instances of the Harwell-Boeing and two sets of instances of the University of Florida sparse-matrix collections

The sets composed of 15 symmetric and 18 asymmetric instances contain much smaller matrices than the sets composed of 35 symmetric and 45 asymmetric instances, respectively. Moreover, the instances of the University of Florida sparse-matrix collection are larger than the instances of the Harwell-Boeing sparse-matrix collection. It may not be appropriate to consider a small instance with the same score of a much larger instance. Thus, the sets of instances of the University of Florida sparse-matrix collection were mainly considered to identify the best low-cost heuristics for bandwidth and profile reductions. On the other hand, the other sets composed of instances of the Harwell-Boeing sparse-matrix collection are also considered, as described below.

We divided the sets of test problems by application area and then the most promising heuristics for each specific application were identified. Tables 2 (symmetric matrices) and 3 (asymmetric matrices) show the most promising heuristics for application areas applied to the four sets of instances of the Harwell-Boeing (http://math.nist.gov/MatrixMarket/data/Harwell-Boeing (Duff et al. 1989a)) and the two sets of instances of the University of Florida sparse-matrix (http://www.cise.ufl.edu/research/sparse/matrices/index.html (Davis and Hu 2011)) collections in relation to bandwidth and profile reductions (see Tables 22 and 23 in Appendix C). Appendix D analyzes the heuristics that showed the best bandwidth and profile reductions of symmetric and asymmetric matrices. Table 4 shows the heuristics that showed the best overall performance on these six sets of test matrices.

Figure 13 [WBRA (Esposito et al. 1998), FNCHC (Lim et al. 2003, 2004, 2007), VNS-band (Mladenovic et al. 2010), and CSS-band (Kaveh and Sharafi 2012) heuristics], Figure 14 [GPS (Gibbs et al. 1976), Snay’s (1976), Burgess and Lai (1986), and GGPS (Wang et al. 2009) heuristics], and Figure 15 [GPS (Gibbs et al. 1976), FNCHC (Lim et al. 2003, 2004, 2007), GGPS (Wang et al. 2009), and VNS-band (Mladenovic et al. 2010)], built from a wide variety of references that were part of this work, show graphs of execution times, of the highest time-consuming heuristics tested, applied to 113 and 22 instances of the Harwell-Boeing and University of Florida sparse-matrix collections, respectively.

As expected, the FNCHC (Lim et al. 2003, 2004, 2007), VNS-band (Mladenovic et al. 2010), and CSS-band (Kaveh and Sharafi 2012) heuristics showed higher computational costs than most of the methods tested here because these are metaheuristic-based heuristics.

Fig. 13
figure 13

Average execution times of the WBRA (Esposito et al. 1998), FNCHC (Lim et al. 2003, 2004, 2007), VNS-band (8 s) (Mladenovic et al. 2010), and CSS-band (Kaveh and Sharafi 2012) heuristics applied to 103 (ranging from 100 to 1104 vertices) instances of the Harwell-Boeing sparse-matrix collection

Fig. 14
figure 14

Average execution times of the GPS (Gibbs et al. 1976), Snay’s (1976), Burgess and Lai (1986), and GGPS (Wang et al. 2009) heuristics applied to 113 instances of the Harwell-Boeing sparse-matrix collection

Fig. 15
figure 15

Average execution times of the GPS (Gibbs et al. 1976), FNCHC (Lim et al. 2003, 2004, 2007), GGPS (Wang et al. 2009), and VNS-band (Mladenovic et al. 2010) heuristics applied to 17 symmetric and 5 asymmetric instances of the University of Florida sparse-matrix collection

The WBRA heuristic (Esposito et al. 1998) shows high computational cost because it builds a rooted level structure (RLS) for each vertex of the graph and these structures are renumbered based on a bottleneck linear assignment. This heuristic for bandwidth reduction of symmetric matrices did not show competitive results when compared to other heuristics, such as the VNS-band, FNCHC, and GPS heuristics. The WBRA heuristic employs a function to move vertices from the level with maximum width to the subsequent level (nearer the level 0) in the level structure (that may not be rooted at a vertex). This may result in a worse distribution of the vertices in relation to the RLS created by the GPS algorithm, for example.

The results of the CSS-band heuristic may be related to the random initial solution suggested by its authors (we strictly followed the algorithms proposed when implementing these heuristics). On the other hand, the VNS-band and FNCHC heuristics employ a variation of the breadth-first search procedure to provide an initial solution. Moreover, a slow convergence was observed within the CSS-band heuristic.

Although based on RCM and GPS (Gibbs et al. 1976) algorithms, Burgess and Lai’s heuristic (Burgess and Lai 1986) contains a stage where three RLSs are modified and its results showed higher execution times than the GPS algorithm. To be more precise, Burgess and Lai’s heuristic (Burgess and Lai 1986) attempts to reduce the width of each level of three RLSs built in the first step. This heuristic moves vertices from one level above or below until a minimum width is reached. Thus, this algorithm may compute a level for a long time until reaching this minimum width.

Fig. 16
figure 16

Average execution times of the RCM–GL (George and Liu 1981), Sloan’s (1989), MPG (Medeiros et al. 1993), NSloan (Kumfert and Pothen 1997), Sloan–MGPS (Reid and Scott 1999), and hGPHH-GL (Koohestani and Poli 2011) heuristics applied to 113 asymmetric instances of the Harwell-Boeing sparse-matrix collection

Snay’s heuristic (Snay 1976) is similar to the RCM method, considering that both are based on breadth-first search procedure. The RCM method labels adjacent vertices to the current vertex (sorting them in ascending degree order) and Snay’s heuristic (Snay 1976) takes into account vertices also in the second level (sorting them in ascending degree order, considering only adjacencies to vertices that are neither labeled nor are candidate vertices) of the vertices already labeled. Snay’s heuristic (Snay 1976) performs this task for 10 pseudo-peripheral vertices, which explains its high computational cost. Even designed for profile reduction (by allowing vertices with small degree numbers being labeled before other vertices), the results of Snay’s heuristic were not competitive with the results of the other heuristics for profile reduction, such as Sloan’s algorithm (and its variations).

In addition to the GPS (Gibbs et al. 1976), MPG (Medeiros et al. 1993), NSloan (Kumfert and Pothen 1997), and FNCHC (Lim et al. 2003, 2004, 2007) heuristics for bandwidth and/or profile reductions, it should also be noted that the RCM–GL (George and Liu 1981) and hGPHH-GL (Koohestani and Poli 2011) methods demonstrated very low computational costs in the six datasets tested. Figure 16 shows the execution times of the RCM–GL (George and Liu 1981), Sloan’s (1989), MPG (Medeiros et al. 1993), NSloan (Kumfert and Pothen 1997), Sloan–MGPS (Reid and Scott 1999), and hGPHH-GL (Koohestani and Poli 2011) heuristics, which are the lowest-cost heuristics tested, applied to 113 asymmetric instances of the Harwell-Boeing sparse-matrix collection. In particular, the 14 heuristics showed very high computational costs in the MBEACXC, MBEAFLW, and MBEAUSE instances (from economic modeling). These instances have many connected components and isolated vertices; the heuristics are applied to each connected component of a disconnected graph.

Figure 17 shows the average execution times of the RCM–GL (George and Liu 1981), Sloan’s (1989), MPG (Medeiros et al. 1993), NSloan (Kumfert and Pothen 1997), Sloan–MGPS (Reid and Scott 1999), and hGPHH-GL (Koohestani and Poli 2011) heuristics applied to 22 instances of the University of Florida sparse-matrix collection. In general, the MPG heuristic (Medeiros et al. 1993) showed lower computational cost than Sloan’s (1989), NSloan (Kumfert and Pothen 1997), and Sloan–MGPS (Reid and Scott 1999). In particular, the MPG (Medeiros et al. 1993) limits the number of candidate vertices to be examined on each iteration.

Fig. 17
figure 17

Average execution times of the RCM–GL (George and Liu 1981), Sloan’s (1989), MPG (Medeiros et al. 1993), NSloan (Kumfert and Pothen 1997), Sloan–MGPS (Reid and Scott 1999), and hGPHH-GL (Koohestani and Poli 2011) heuristics applied to 22 instances of the University of Florida sparse-matrix collection

5 Conclusions

Among 132 heuristics identified in the literature that were applied to bandwidth and/or profile reductions, 14 heuristics were selected as the most promising low-cost heuristics to solve these problems. In experiments with 113 and 22 instances (up to 100,196 vertices) of the Harwell-Boeing and University of Florida sparse-matrix collections, respectively, the FNCHC (Lim et al. 2003, 2004, 2007) and GPS (Gibbs et al. 1976) heuristics showed the best results with respect to bandwidth reduction of symmetric and asymmetric matrices, respectively. Additionally, the NSloan (Kumfert and Pothen 1997) and MPG (Medeiros et al. 1993) heuristics may be seen as the most promising low-cost algorithms for profile reduction of symmetric and asymmetric matrices, respectively.

Heuristics for reordering of vertices contribute to provide adequate memory location and, hence, improving cache hit rates (Das et al. 1992; Burgess and Giles 1997). Moreover, as exhibited by the computational experiment described in this paper, conclusively, the choice of a heuristic is highly dependent on the use of a specific instance. Thus, in addition to the GPS (Gibbs et al. 1976), MPG (Medeiros et al. 1993), NSloan (Kumfert and Pothen 1997), and FNCHC (Lim et al. 2003, 2004, 2007) heuristics, the RCM–GL (George and Liu 1981) and hGPHH-GL (Koohestani and Poli 2011) algorithms showed very low computational costs for bandwidth and/or profile reductions in the six sets of instances tested. These six heuristics were identified as the most promising low-cost heuristics for bandwidth and profile reductions so that the computational cost of solving large-scale linear systems by iterative methods can be reduced (depending on the preconditioner used to reduce the number of iterations of the solver applied). In addition, Sloan’s, GGPS, and Sloan–MGPS heuristics can be applied to instances originating from specific application areas (see Tables 2, 3).

Additional datasets including much larger matrices shall be included in a future evaluation so that an insight is intended to be offered as to how the tested heuristics compare on very large examples that are of more practical interest. Thus, we intend to apply the 14 heuristics tested in this work and, in particular, those six heuristics selected, to perform reordering of vertices and reduce the computational cost of iterative methods for solving large-scale linear systems to verify the best method(s) in this context.