Keywords

1 Introduction

Although supercomputers have been considered as the de facto computer architectures for the execution of high-performance computing (HPC) applications, their operation and maintenance costs have increased at higher rates [9] when compared to other forms of cluster architectures. These increasing rates have led the scientific community into the development of emerging and low-cost non-HPC clusters such as cluster of workstations (COW), cluster of desktops (COD), and cluster of virtual machines (COV), which are commonly found in research laboratories and institutions.

In non-HPC clusters, researchers have sought to give answers to the following questions. First, what kind of cluster better fits the needs of a given scientific application (HPC and non-HPC). Second, which is a common and recurring question made among scientists when considering the execution of scientific applications on these clusters, what configuration (nodes, and processors) of these clusters render the best performance for a given scientific application. In order to give answers to the above-mentioned research questions, we propose the cluster performance profile, this profile summarizes the performance obtained for a variety of scientific applications in strong scaling benchmarking experiments. From these performances, we derive further cluster specific metrics named cluster overhead and cluster coupling, based on a previously proposed methodology [14]. The cluster performance profile that includes the applications performance metrics, the cluster overhead, and cluster coupling enables the characterization of not only non-HPC, but also HPC clusters by quantifying their strengths and weaknesses when executing well-known scientific computing kernels such as the seven dwarfs of scientific computing. The seven dwarfs of scientific computing exhibit a variety of communication and computation patterns common to many scientific applications, they enable further characterization of the capacity of clusters under different kinds of workloads.

Research questions are addressed as follows in this research paper. Section 2 presents related work on characterizing scientific computing clusters and evaluating their capacity for the execution of scientific applications. Section 3 presents a background on cluster overhead, cluster coupling and the proposed cluster performance profile. Section 4 presents the performance evaluation of four clusters and the calculation of their cluster performance profiles. Section 5 discusses our findings on these cluster profiles and provides answers to the proposed research questions. Conclusions on the use of the cluster performance profile and future directions are drawn in Sect. 6.

2 Related Work

The quantification of the capacity of scientific computing clusters for the execution of high-performance computing applications is a common and ongoing research problem, named performance evaluation and benchmarking. These evaluations provide to some extent an estimate of the performance that computers can deliver for specific applications. On non-HPC clusters, related works have been conducted in virtual clusters built on top of containers or virtual machines over cloud [1, 6, 8, 10, 12], IoT [3, 11], workstations [2], and desktops [4] infrastructures. Bare-metal cluster deployments on top of workstations [7] and desktops [13] have also been considered. Furthermore, those related works usually estimate the capacity in terms of the metrics delivered by the traditional compute-intensive benchmarks such as high-performance Linpack (HPL), NAS Parallel Benchmarks (NPB), or HPC Challenge Benchmark (HCB).

Most related works that estimate the capacity of virtual and containerized clusters concentrate their efforts in the determination of the computation and communication overhead these technologies pose on the performance of HPC applications. These works usually compare the capacity (given in FLOPS, latency, bandwidth, and related metrics) of the bare-metal host system with the same system when hosting containers or virtual machines. On the other hand, related works that estimate the capacity of bare-metal deployments, such as clusters of workstations and desktops use the capacity estimated in supercomputers as base line to determine if the system under study has a satisfactory capacity.

In this work we extend the state of the art in performance evaluation and benchmarking of scientific clusters in the following directions: (i) we extend the performance analysis commonly conducted in strong scaling analyses by including the cluster performance profile that comprises the traditional performance metric, and our previously proposed metrics (cluster overhead and cluster coupling [14]), finally (ii) we provide a quantitative characterization of four small size scientific computing clusters and demonstrate the validity of this characterization, the characterization being summarized in a cluster performance profile per cluster.

3 Background

Here we define the concepts that are relevant for the quantitative characterization of scientific computing clusters; namely, cluster overhead, cluster coupling, and the cluster performance profile.

3.1 Cluster Overhead and Coupling

Cluster distributed systems have been categorized as loosely- or tightly-coupled according to the storage, interconnection, processing technology and the components packing strategy employed in their development. In addition, the speed and reliability of the interconnection channel have been considered as the criteria for this classification [5]. Nevertheless, the loosely- and tightly-coupled classification does not provide quantitative information about how coupled these different computing systems are. Accordingly, in [14] we proposed a methodology to quantitatively estimate the coupling of clusters using a metric we called cluster overhead.

The cluster overhead is estimated by determining how similar a given cluster is to its tightly-coupled counterpart, assumed to be a single node. Figure 1 depicts the performance for a single node of a cluster (\(P_{h}(1)\)) and the performance for the same cluster with n computing nodes (\(P_{h}(n)\)), for the high-performance computing application building block h. This figure also depicts \(P'_{h}(1)\) and \(P'_{h}(n)\) serving as linear approximations of \(P_{h}(1)\) and \(P_{h}(n)\) respectively. Then, the cluster overhead with respect to h is given by the following formula.

$$\begin{aligned} \alpha = \measuredangle \overline{P'_{h}(1)}\, \overline{P'_{h}(n)} \end{aligned}$$
(1)

Although computers’ performance does not have linear behavior, linear approximations led us to derive properties of lines that allow further understanding of cluster’s performance. For instance, in the formula, the angle (\(\alpha \)) between the segments \(\overline{P'_{h}(1)}\) and \(\overline{P'_{h}(n)}\) stand as the performance loss also known as the cluster overhead, this measured in degrees. Note that there is an inverse relationship between the cluster overhead and coupling, as shown in Fig. 2. In Fig. 2, large values for \(\alpha \) stand for large overhead, resulting in a poor similarity between a node and the whole cluster, this being rendered in a loose cluster coupleness. On the other hand, small values for \(\alpha \) stand for small overhead, resulting in a higher similarity between a node and the whole cluster, this being rendered in a tightly cluster coupleness. Then, the cluster coupling is defined by the following formula, where \(\alpha \) stands for the cluster overhead.

$$\begin{aligned} c = \frac{1}{\alpha } \end{aligned}$$
(2)
Fig. 1.
figure 1

Cluster overhead

Fig. 2.
figure 2

Cluster coupleness

Note that the performance measure used for the calculation of coupling and overhead metrics variates according to the specific application being used for the measurement; for instance, NAS Parallel benchmarks use the MOP/s (millions of operations per second) as the standard to deliver benchmark performance, the type of operation OP variates according to the specific benchmark. Benchmarks such as FT, MG and CG use as an operation unit the Float Point whilst EP use Random Numbers Generated.

3.2 Cluster Performance Profile

Table 1 presents the cluster performance profile for an hypothetical cluster. This profile summarizes the properties of segments such as \(\overline{P'_{h}(1)}\) and \(\overline{P'_{h}(n)}\) that are of interest for clusters characterization. These properties are: the slope (m), the angle between a segment and the x-axis (\(\theta \)), the cluster overhead (\(\alpha \)) that is calculated with respect to another segment, and the coupling (c) which is derived from the cluster overhead. In the cluster performance profile, m and \(\theta \) bring information about how fast the performance increases when considering a segment established between points \((p,P'_{h}(p))\) and \((p+1,P'_{h}(p+1))\), where p is the number of processors considered and \(P'_{h}(p)\) is the performance of h when considering p processors of the cluster. If we think in terms of a line segment, m is the segment’s slope and \(\theta \) the angle formed between the x-axis and the segment, both m and \(\theta \) are directly proportional. In a similar way, \(\alpha \) describes how close the performance of a cluster being represented by a segment is to a tightly-coupled instance being represented by another segment.

The baseline for the cluster performance profile is a single node. Since this is supposed to be a tightly-coupled instance, when compared to itself, the cluster overhead is defined to equal zero, \(\alpha = 0\). As the number of computational resources (processors, nodes) used in computing h increases, performance will tend to drop due to the parallel overhead. This also may affect the performance growth rate being estimated by m and \(\theta \), increases the cluster overhead \(\alpha \), and decreases the system coupling. However, although the above is the expected behavior, different computation and communication patterns might differ from this behavior for different settings of nodes and processors. These differences are intended to be caught in the cluster performance profile.

Table 1. Cluster performance profile

4 Performance Evaluation

Five scientific computing dwarves (spectral methods, sparse linear algebra, unstructured meshes, structured meshes and monte carlo) represented in four NAS parallel benchmarks (FT, CG, MG and EP) are used to evaluate the performance of three non-HPC clusters \(C_{w_1}\), \(C_{w_2}\), \(C_{cov}\) and one HPC cluster \(C_{hpc}\) which is used for validation purpose. The resulting performance delivered by the applications in a strong scaling evaluation is then used to the elaboration of a cluster performance profile per cluster, according to the methodology proposed in Sect. 3.2. The findings on these profiles are further discussed in Sect. 5.

4.1 Experimental Setup

The experiment comprises a strong scaling performance evaluation on dedicated clusters \(C_{w_1}\), \(C_{w_2}\), \(C_{cov}\) and \(C_{hpc}\) whose technical details are shown in Table 2. \(C_{hpc}\) is the university high-performance computing system; \(C_{cov}\) is a Microsoft Hyper-V-based cluster of virtual machines whose cluster computing nodes were deployed on different servers at the university datacenter, although virtual machines are dedicated, the datacenter is not; \(C_{w_1}\) and \(C_{w_2}\) are clusters of workstations differing in their network bandwidth and latency capabilities. The FT, CG, MG, and EP benchmarks are executed over the only-MPI execution scheme. Here, we increased the number of MPI processes and the number of nodes. The Class C of NAS parallel benchmark problem sizes is considered and remains fixed through experiments. The MPI processes mappings considered for the experiment are described in Table 3. In Table 3, ppn stands for MPI processes per node and tp for the total number of processes in the cluster.

Table 2. Clusters specifications
Table 3. Processes mappings

4.2 Threats to Validity

The performance exhibited by a cluster is susceptible to a countless number of software parameters. To name a few: the problem size, the parallelization scheme, the supporting numerical libraries and the algorithms. Given the actual difficulty in providing an accurate measure of overall computers performance, our estimation considers a simplified version of the problem. Here, we consider small-scale clusters, and well-known scientific computing kernels with fixed problem sizes and execution schemes; these in order to validate our methodology. Regarding the execution scheme, in the HPC cluster we do not use all the physical cores in order to be able to use only powers of 2 number of processors, then this must be considered when performing comparisons between clusters. Also although experiments were executed three times, we observe no significant variations that lead to errors in our estimations of performance. In this regard, results are well supported by theory.

Finally, note the performance is a compound metric that measures the application and computing system performance, not the computing system in isolation, the former is what we mean when referring to the performance of the cluster.

4.3 Results

Figures 3, 4, 5 and 6 describe the performance achieved for clusters \(C_{hpc}\), \(C_{cov}\), \(C_{w_1}\) and \(C_{w_2}\) in the FT, CG, MG and EP benchmarks, respectively. These figures compare the performance achieved in the clusters for the different settings of nodes and processors considered for the computations.

Fig. 3.
figure 3

FT - fast Fourier Transform - spectral methods

In FT, Fig. 3, \(C_{hpc}\) achieves strong scalability when compared to the other clusters, as expected. However, this cluster exhibits a particular performance behavior which is the divergence in the performance seen on different settings of nodes from 16 processors. This divergence is explained by the NUMA memory architecture commonly present in high-performance computing cluster nodes. Here the performance attained for one, two, and four nodes is similar until the number of processors used per node cross the processors’ capacity of a single socket or NUMA domain. This suggests intra-node communication issues being rendered in the node performance degradation. Note that these issues are solved when we distributed 16 processes in two or four nodes, that is, by considering \(2n*8p\) and \(4n*4p\) settings, respectively. Regarding the non-HPC computing systems, although \(C_{cov}\), \(C_{w_1}\) and \(C_{w_2}\) achieve strong single node scaling, \(C_{cov}\) outperform the multi-node performance of \(C_{w_1}\) and \(C_{w_2}\). Finally, even though \(C_{cov}\) exhibits better inter-node communication capabilities, better overall performance is achieved for a single node in either \(C_{w_1}\) or \(C_{w_2}\) clusters.

Fig. 4.
figure 4

CG - Conjugate Gradient - sparse linear algebra and unstructured meshes

In CG, Fig. 4, \(C_{hpc}\) also demonstrates an upward trend in performance when increasing the number of nodes and processors; however, the effect of the NUMA memory architecture substantially hurts the performance of this computing kernel. For example, for 32 processors, the \(4n*8p\) nodes-processors setting outperforms the \(2n*16\) setting, since in the former setting, the eight processors per node do not cross the boundaries of a single node socket. The above-mentioned NUMA effect is also seen for 64 processors, but here the increasing performance tendency dramatically drops. Concerning the non-HPC computing systems, all sustain scalable performance for a single node; however, \(C_{cov}\) exceeds the multi-node performance of its counterparts. In addition, \(C_{cov}\) achieves the best overall performance in four nodes. If we compare the maximum performance achieved in the whole \(C_{cov}\) cluster and the maximum performance achieved in a single \(C_{hpc}\) node, \(C_{cov}\) just reaches \(54.29\%\) of the \(C_{hpc}\) single node capacity. Moreover, if we compare the maximum performance achieved in \(C_{w_2}\) against the one achieved in \(C_{cov}\), a single \(C_{w_2}\) node reaches \(74.79\%\) of the whole \(C_{cov}\) cluster capacity.

Fig. 5.
figure 5

MG - Multi Grid - structured meshes

In MG, Fig. 5, \(C_{hpc}\) demonstrates the same upward scaling pattern seen in the FT computing kernel. This similarity suggests the same communication and computation pattern. Both kernels solve the Poisson equation and employ short- and long-distance communication operations, but MG, unlike FT, is a memory-intensive kernel [14]. In the non-HPC computing systems, this computing kernel reports poor scalability in \(C_{cov}\), and no scalability is seen in the \(C_{w_1}\) and \({C_{w_2}}\) clusters. In particular, the lack of scalability in \(C_{w_1}\) and \({C_{w_2}}\) might be attributed to the memory-intensive nature of MG, suggesting a memory bandwidth issue. Conclusively, for the MG computing kernel, \(C_{cov}\) outperform clusters \(C_{w_1}\) and \(C_{w_2}\) in all node settings, namely one, two and four nodes. Finally, when compared to \(C_{hpc}\), the whole \(C_{cov}\) cluster reaches only \(41.75\%\) of a single \(C_{hpc}\) node capacity.

Fig. 6.
figure 6

EP - Embarrassingly Parallel - Monte Carlo

In EP, Fig. 6, clusters \(C_{hpc}\), \(C_{cov}\), and \(C_{w_1}\) demonstrate strong scalability as expected. Note that \(C_{w_2}\) was not considered since it will achieve roughly the same performance of \(C_{w_1}\) as computing nodes are the same, but variate in the inter-nodes interconnection. When considering the non-HPC computing systems, the overall performance achieved in cluster \(C_{w_1}\) outperforms the performance achieved in \(C_{cov}\). Here, \(C_{cov}\) just reaches \(69.40\%\) of the overall performance of \(C_{w1}\).

4.4 Clusters Performance Profiles

Tables 4, 5, 6 and 7 provide four performance-derived metrics for the clusters under study; namely slope, \(\theta \), \(\alpha \), and coupling. These metrics characterize the clusters under study with respect to the four fundamental computing kernels: FT, MG, CG, and EP. Note that unlike the previous section where application performance was analyzed per node and per processing unit, here the performance is described for the clusters as whole entities.

Table 4 depicts the characterization computed for the \(C_{hpc}\) cluster. The slope, as mentioned in Sect. 3, describes the rate of growth in the performance delivered by an application for a given processors and computing nodes setting. A downward trend is expected in this rate as the number of computing nodes and processors climb due to the parallel overhead. This pattern is being exhibited by FT, MG, and EP. CG, on the contrary, describes a particular behavior. Here the rate attained when considering a two-nodes \(C_{hpc}\) cluster is higher than the one achieved for one and four nodes \(C_{hpc}\) clusters. This is explained by the increasing performance tendency seen in two nodes, Fig. 4a, that is contrary to that observed for one and four nodes. Note that the slope also tells us the performance cost of increasing the number of computational resources (processors and nodes). In the case of \(C_{hpc}\), the increase in cost tends to be linear in all applications except for CG.

\(C_{hpc}\) proof consistency in the cluster overhead and coupling when considering the expectation that the overhead might tend to increase as more computing resources are added for the computation; as a consequence, the resulting coupling might decrease at the same extent. Note that this expectation is consistent in all applications except for CG which proof a negative \(-0.0501\) cluster overhead. This means that instead of having overhead, the application is exhibiting a performance rate (slope) that surpasses the single node rate. Since the trend seen in two nodes \(C_{hpc}\) cluster surpass the one exhibited by a single node that was considered as the tightly coupled instance of the cluster for the computation of coupling, \(-19.9600\) can’t represent the coupling of the two nodes \(C_{hpc}\) cluster; because a single node is not any more a tightly coupled instance of the cluster. Another particular behavior seen in \(C_{hpc}\) is the one exhibited for FT. Here the performance rate in FT decreases significantly from two to four nodes; this suggests scalability issues of FT for many computing nodes. Lastly, in general, \(C_{hpc}\) exhibits a cluster overhead close to zero and higher degree of coupling, being this a characteristic of high-performance computing systems. However, cluster overhead and coupling will really make sense when we are able to compare these values with the ones obtained from other clusters.

Table 4. Cluster performance profile \(C_{hpc}\)

Table 5 depicts the characterization computed for the \(C_{cov}\) cluster. Here, the slope decreases significantly describing a polynomial behavior, except for CG where there is a linear decrease in slope. Regarding cluster overhead, this cluster keeps consistency in the expectation that the overhead might tend to increase as more computing resources are added for the computation; as a consequence, there is also a consistency in the resulting coupling. Note that the values for cluster overhead and coupling on this cluster can be compared to the ones achieved in \(C_{hpc}\); however, when doing this comparison, we need to consider that \(C_{hpc}\) was evaluated up to 16 processors while \(C_{cov}\) only considered four. If we consider \(C_{hpc}\) as a base line, we can conclude that \(C_{cov}\) exhibits the characteristics of a loosely coupled computing system; these are higher cluster overhead and lower coupleness.

Table 5. Cluster performance profile \(C_{cov}\)

Table 6 show the performance derived metrics computed for the \(C_{w_1}\) cluster. Although we expected a downward trend in slope as we increased the number of computing nodes and processors, \(C_{w_1}\) depicts an unexpected trend. We observe negative slopes for the communication intensive computing kernels, this being the rates of performance degradation. We also observe that for the FT and MG computing kernels the performance degradation rate is greater in the two nodes setting than in the four nodes setting, whereas in GC the major degradation rate takes place, as expected, in the four nodes setting. We attributed the higher degradation in the two nodes setting to the poor network capacity of this cluster, this degradation seems to be compensated by the amount of computing processors used in the four nodes cluster setting. The equivalent behavior seen in FT and MG obeys the similarity these computing kernels have in terms of computation and communication. Finally, the negative slope renders higher cluster overheads, thus poor cluster coupling.

Table 6. Cluster performance profile \(C_{w_1}\)
Table 7. Cluster performance profile \(C_{w_2}\)

Table 7 contains the performance derived metrics computed for the \(C_{w_2}\) cluster. This cluster demonstrates the same performance degradation rate pattern seen in \(C_{w_1}\), since clusters computing nodes are the same except for the nodes interconnection. Although the negative slopes, cluster overhead and cluster coupling metrics are similar on \(C_{w_1}\) and \(C_{w_2}\) for FT and MG, there are subtle improvements in these metrics due to the improvement in the interconnection network in the \(C_{w_2}\) cluster. In addition, the network enhancement improved the coupling of the cluster in terms of the CG computing kernel.

5 Discussion

A scalability analysis and the one focused on cluster overhead and coupling metrics, were conducted on four scientific computing clusters in order to quantitatively characterize their capacity to execute scientific applications. The scalability analysis suggests memory bandwidth issues in \(C_{w_1}\) and \(C_{w_2}\) that might prevent the scalability of memory intensive computing kernels such as MG. In addition, the analysis demonstrates the negative effect of the NUMA memory architecture, established in \(C_{hpc}\) cluster nodes, that can slightly affect the scalability of scientific applications exhibiting computing patterns similar to FT and MG; being these spectral methods and structured meshes. But the NUMA architecture substantially hurts the scalability of sparse linear algebra and unstructured meshes computations such as the ones exhibited in CG for the \(C_{hpc}\) cluster. Finally, when considering the non-HPC clusters, \(C_{w_1}\) and \(C_{w_2}\) demonstrate best overall performance for FT in one node and EP in four nodes whilst \(C_{cov}\) demonstrate best overall performance in CG and MG in the four nodes setting.

The cluster performance profile reveals numerous cluster specific behaviors that might be considered, first, for the selection of clusters for specific scientific applications, and, second, to guide architectural design decisions on the development of these clusters. In the first matter, when considering the non-HPC clusters, under the experimental conditions, the workstations based clusters \(C_{w_1}\) and \(C_{w_2}\) best fits the needs of FT and EP workloads, for the given problem size and the only-MPI parallelization scheme. On the other hand, the virtual machines based cluster \(C_{cov}\) best fits the need of CG and MG workloads.

In the second matter, for instance, the performance rate of most computing kernels executed in \(C_{hpc}\) decreases linearly when increasing the number of computing resources. In contrast, \(C_{cov}\) exhibits polynomial decrease in this rate and \(C_{w_1}\) and \(C_{w_2}\) exhibit negative performance rates. Note that these performance rates can be improved by cluster designers by considering enhancements, for example, in the nodes interconnection as demonstrated when improving the interconnection of \(C_{w_1}\) in \(C_{w_2}\).

6 Conclusion

In this work we proposed the cluster performance profile, this profile comprises performance related metrics for specific computing kernels and cluster specific metrics derived from the performance exhibited on these computing kernels. This profile was introduced to support both researchers running scientific applications on HPC and non-HPC clusters and cluster designers mainly developing low cost scientific computing clusters.

In this regard, the profile delivers two main benefits for researchers; first, serve as a guide for non HPC experts to determine what kind of scientific applications would strong scale, in a given cluster, when increasing the number of computing resources and which might not; second, as this profile is based on well-known building blocks seen in many scientific applications, it can be considered as a first glance when determining the appropriate cluster for the execution of applications developed by the combination of these building blocks. In addition, the cluster performance profile also serves as a guideline for cluster designers that will be able to perform improvements led by metrics such as the cluster overhead and coupling on scientific computing clusters.

Future work may involve the use of the cluster performance profile in large scale scientific computing clusters to fulfill two main objectives; first, determine the validity of the proposed quantitative characterization in this type of clusters; then, determine how this profile constructed on well-known building blocks executed on large scale clusters can anticipate the performance that could be achieved from applications made by the combination of well-known scientific computing building blocks.