Keywords

1 Introduction

Desmos is a supercomputer targeted to molecular dynamics (MD) calculations that was installed in the JIHT RAS in December 2016. Desmos is the first application of the Angara interconnect for a GPU-based MPP system [1, 2].

Modern MPP systems can combine up to \(10^5\) nodes for solving one computational problem. For this purpose, MPI is the most widely used programming model. The architecture of the individual nodes can differ significantly and is usually selected (co-designed) for the main type of MPP system deployment. The most important component of MPP systems is the interconnect. The interconnect properties have a major influence on the scalability of any MPI-based parallel algorithm. In this work, we describe the Desmos supercomputer, which is based on cheap 1CPU+1GPU nodes connected by the original Angara interconnect.

The Angara interconnect is a Russian-designed communication network with a torus topology. The interconnect ASIC was developed by JSC NICEVT and manufactured by TSMC using the 65 nm process. The Angara architecture uses some principles of both the IBM Blue Gene L/P and the Cray Seastar2/Seastar2+ torus interconnects. The torus interconnect developed by EXTOLL is a similar project [3]. The Angara chip supports deadlock-free adaptive routing based on bubble flow control [4], direction ordered routing [5, 6] and initial and final hops for fault tolerance [5].

The results of the benchmarks confirmed the high efficiency of commodity GPU hardware for MD simulations [2]. The scaling tests for electronic structure calculations also showed the high efficiency of MPI-exchanges over the Angara network.

In this paper, we combine the results of the Desmos supercomputer performance analysis. These results pave the way to optimizations of the supercomputer efficiency and could be relevant for other HPC systems.

2 Related Work

Job scheduling determines the efficiency of a supercomputer practical deployment and is a very important topic in parallel systems (see, e.g., [7]). The everyday work of supercomputer centers shows a need for separation of cloud-like jobs (which do not require a high-bandwidth low-latency interconnect between nodes) from regular parallel jobs. Such a separation is a way for increasing efficiency of supercomputer deployment [8]. There have been some attempts of statistical analysis of supercomputers operation in Russian HPC centers (see, e.g., [9]).

The increase of power consumption and heat generation of computing platforms is a significant problem. Measurement and presentation of the results of performance tests of parallel computer systems become more and more often evidence-based [10], including the measurement of energy consumption, which is crucial for the development of exascale supercomputers [11].

Nowadays, partial use of single precision in MD calculations with consumer-grade GPUs cannot be regarded as a novelty. The results of such projects as Folding@Home confirmed the broad applicability of this approach. Recent developments in optimized MD algorithms include the validation of the single-precision solver (see, e.g., [12]). The authors of [13] give very instructive guidelines for achieving the best performance at minimal cost in 2015.

The success of the TeraChem package [14] illustrates the amazing perspectives of GPU usage for quantum chemistry.

The ongoing increase of data generated by HPC calculations leads to the requirement of a parallel file system for rapid I/O operations. However, benchmarking parallel file system is a complicated (and usually expensive!) task, which is why accurate results of particular case studies are quite rare (see, e.g. [15]).

3 Statistical Data of Desmos Deployment

The batch system for user jobs scheduling of Desmos is based on Slurm, which is an open-source workload manager designed for Linux clusters of any size [16]. It is used in many HPC centers worldwide (the paper [16] has been cited more than 500 times). Slurm has the following main features:

  • allocates exclusive and/or non-exclusive access to resources (Compute Nodes) to users for some time so they can perform a work;

  • provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes;

  • arbitrates conflicting requests for resources by managing a queue of pending work.

Fig. 1.
figure 1

Job running time vs. job size. Each point corresponds to one job (Color figure online)

Fig. 2.
figure 2

Job waiting time vs. job running time. Each point corresponds to one job. The percent values shown in the red boxes correspond to the share of each region in the supercomputer total workload (Color figure online)

The SlurmDB daemon stores data into a MySQL database. The SlurmDB daemon runs on the management node. In September 2017, the SlurmDB database was activated on Desmos, giving us the possibility of detailed analysis of supercomputer load statistics. The default Slurm tool sreport has quite limited functionality. That is why we use SQL-queries for accessing the SlurmDB for statistical analysis. For example, the following command retrieves and calculates the duration of allocated jobs:

figure a

Figure 1 shows the distribution of jobs over the number of cores used and over running time \(t_{R}\). GPU floating point performance is not taken into account when drawing the iso-levels of \(R_{\text {peak}}*t_{R}\) constant value. This quantity corresponds to the total number of floating-point operations that CPUs deployed for the particular job are able execute theoretically during time \(t_{R}\).

Parallel algorithms can be executed either slowly on a modest number of cores (nodes) or quickly if their parallel scalability justifies using a large number of processing elements efficiently. Two iso-levels separate three regions of total number floating-point operations corresponding to individual jobs: less than 10 PFlos, between 10 and 100 PFlops, and above 100 PFlops. The percent values shown in the blue boxes correspond to the share of each region in the Desmos total workload since the beginning of SlurmDB logging. We see that the major part of all the jobs executed on Desmos have been essentially supercomputer-type jobs.

At the same time, we see that there are jobs that were executed on six cores or less, i.e. on a single node. This type of jobs can be easily moved away from the supercomputer either to the cloud or to a personal workstation.

The efficiency of the supercomputer job scheduling policy can be evaluated by such type of graphs. The more points we see on the right side of the graph, the more efficient is the end-user collective deployment of the supercomputer. Users should be motivated to use scalable codes and to choose larger number of nodes for speeding calculations up. The following Slurm batch system partitions have been created on the Desmos supercomputer:

  • test: max time = 15 min, any number of nodes;

  • max1n: max time = 1440 min, min/max number of nodes = 1;

  • max8n: max time = 1440 min, min/max number of nodes = 4/8;

  • max16n: max time = 720 min, min/max number of nodes = 4/16;

  • max32n: max time = 360 min, min/max number of nodes = 4/32.

This policy motivates users to deploy higher numbers of nodes. Also, it prevents overloading the supercomputer with small one-node or two-node jobs.

Another aspect of job scheduling is the waiting time for the job to start calculation after being submitted to the job queue. Figure 2 shows the correlation of the job running time \(t_R\) to the job waiting time \(t_W\) (i.e. the time between the moment of job submission into the batch queue and the moment of job execution). Three levels of ratios \(t_W/t_R\) are shown in Fig. 2. Fortunately, the majority of jobs (66%) fall into the category with \(t_W < t_R\). Obviously, jobs with \(t_W > t_R\) should be regarded as inefficient. Diminishing the number of such jobs is another criterion of supercomputer efficient usage.

4 Energy Consumption Optimization: The VASP Case Study

Our recent studies of energy consumption at frequency variation [17,18,19] show that a variation of CPU frequency can have a positive effect on reducing energy consumption for memory-bound algorithms. Here we extend this type of analysis from the level of a single CPU to the level of a whole supercomputer.

Fig. 3.
figure 3

Power variation and consumed energy variation on the same VASP test benchmark at different CPU frequencies

VASP 5.4.1 was compiled for Desmos using gfortran 4.8, Intel MPI and linked with Intel MKL for BLAS, LAPACK and FFTW calls. The model represents a GaAs crystal consisting of 80 atoms in the supercell. All 32 nodes are used for the benchmark runs. Each run corresponds to one complete iteration for electron density optimisation that consists of 35 steps. We use digital the logging capabilities of the UPS for digital sampling of the power consumed during the benchmark runs.

The results of the CPU frequency variation from the 3.5 GHz to 1.2 GHz are presented in Fig. 3. We see how the level of consumed power decreases when the CPU frequency decreases. At the same time, the time-to-solution increases.

The lower graph shows the variation of the total energy consumed in two cases:

  • The real benchmark of Desmos shows that no energy saving regime can stem from CPU frequency variation.

  • The hypothetical case with all fans in the chassis switched off shows a shallow minimum of total energy consumed. The total power consumption of the chassis fans working at full speed is about 4 kW. If we subtract the fan-determined power draw from the total power level (e.g., it could be the case if liquid immersion cooling would be used). This minimum corresponds to saving about 3.4% of energy at the cost of about 3.8% longer calculations.

5 Case Studies of GPU Efficiency

NVIDIA CUDA technology was released in 2007. The past decade became a time of gradual adoption of this programming paradigm. Nowadays, the CUDA-enabled software ecosystem is quite mature. The GPU usage in HPC is motivated not only by energy efficiency but by cost efficiency as well. Consumer cards with teraflops performance in single precision represent an attractive option for cheap computational acceleration. The deployment of such commodity GPU accelerators in the Desmos supercomputer was a carefully planned decision [2]. However, the absence of double-precision capabilities narrows the spectrum of potential problems that can be solved using this hardware.

Fig. 4.
figure 4

Comparison of the supercomputers Desmos and IRUS17, on two biomolecular benchmarks (RIB: 2 million atoms, MEM: 82 thousand atoms; see [13])

In this context, we present benchmarks showing the efficiency of the Desmos supercomputer for certain workloads.

5.1 Classical Molecular Dynamics with Gromacs

Classical molecular dynamics is an important modern scientific tool (see, e.g., [20,21,22,23,24]). In Fig. 4, we can see the results of the Intel Xeon-based supercomputer IRUS17 and the Desmos supercomputer on two detailed biomolecular benchmarks [13] from the Gromacs package.

The cost of each node in Fig. 4 consists of the price of the computational resources with the corresponding infrastructure excluding the costs of interconnect. The prices of single nodes are estimated according to the price lists from the ThinkMate.com website at the end of November 2017.

A Desmos node without GPU costs about $ 2600, while a Desmos node with one GTX 1070 costs $ 3100. An IRUS17 node with two Intel Xeon E5-2698 v4 costs $ 11 000 (IRUS17 consists of dual-node blades in an enclosure), and an IRUS17 node with two Intel Xeon E5-2699 v4 costs $ 13 000. The labels in Fig. 4 show the amount of nodes. The cost is multiplied by the corresponding number of nodes.

We see that Desmos is ahead of IRUS17 for these benchmarks, in terms of both maximum attainable speed of calculation (ns/day) and cost-efficiency.

5.2 Quantum Molecular Dynamics with TeraChem and GAMESS-US

Quantum chemistry and electronic structure calculations are among the major consumers of HPC resources worldwide (see, e.g. [25,26,27,28,29,30]). The TeraChem package is a rare example of CUDA-based software that deploys very efficiently single-precision floating point operations of NVIDIA GPU accelerators. In this work, we compare the performance of TeraChem with the well-know quantum chemistry package GAMESS-US.

The test model is the ab initio DFT molecular dynamics of the molecule of malondealdehyde CH\(_2\)(CHO)\(_2\). The 6–31g basis is used together with the B3LYP exchange-correlation functional.

TeraChem is not MPI-parallelized and runs on a single node of Desmos (on a single core with a GTX 1070 accelerator). This hardware gives 0.5 s per one MD step in this test benchmark for TeraChem. The same level of performance we see in the CPU-only MPI-parallelized GAMESS-US calculation on 12 Desmos nodes (0.5 s per one MD step).

It is instructive to compare the peak performance of the hardware under consideration in these two tests. Twelve Desmos nodes have 4 TFlops of double-precision peak performance and 540 GB/s DRAM total memory bandwidth. One GTX 1070 accelerator has 6 TFlops of single-precision peak performance and 256 GB/s DRAM memory bandwidth. These numbers allow us to conclude that, with respect to GAMESS-US, the Desmos supercomputer is equivalent to a 128-TFlop supercomputer (\(= 32~\text {nodes} \times 4~\text {TFlops}\)) based on Intel Xeon Broadwell CPUs.

6 Parallel File System Benchmarks

Many scientific HPC codes generate huge amounts of data. For example, in classical MD, the limits of the system size are trillions of atoms [31]. Desmos allows for GPU-accelerated modeling of MD systems with up to 100 million atoms. On-the-fly methods of data processing help considerably but cannot substitute post-processing completely. Another unavoidable requirement is the saving of control (or restart) points during or at the end of the calculation.

Fig. 5.
figure 5

The schemes of the supercomputers Desmos and Angara-K1

Fig. 6.
figure 6

Parallel output benchmarks based on the LAMMPS test model for Angara-K1 and Desmos supercomputers

All 32 nodes of Desmos have been equipped with fast SSD drives, and the BeeGFS parallel file system has been installed in order to use all these disks as one distributed storage.

For comparison, we consider the Angara-K1 supercomputer, located at JSC NICEVT. This cluster is based on the Angara interconnect as well. Angara-K1 has a dedicated storage server (hardware RAID-adapter Adaptec 5405z, RAID level: 6 Reed-Solomon, HDD: \(8\times 2\) TB SATA2, FS: Lustre 2.10.1, FS type: ext3).

The schemes of the Desmos and Angara-K1 supercomputers and relevant parameters are given in Fig. 5.

The standard Lennard-Jones benchmark was run with the LAMMPS molecular dynamics package (the benchmark is based on the model “melt” from the LAMMPS distribution package, the model corresponds to a f.c.c. crystal of Lennard-Jones particles and has been replicated to 16 million particles).

LAMMPS has two variants for output of large amounts of data. It is possible to use either standard output methods or MPI-IO capability.

Figure 6 depicts the results of the benchmarks for different sizes of the MD model. We see that the absolute values of the calculation time are higher for Angara-K1 than for Desmos. However, the performance degradation due to storing large files is more pronounced for Desmos. The MPI-IO output gets the evident benefits of the distributed storage of Desmos.

7 Conclusions

The paper presents the results of efficiency and performance analyses of the Desmos supercomputer.

  • The job accounting statistics of the Desmos supercomputer were reviewed. Two methods of quantitative efficiency monitoring were proposed.

  • A variation of CPU frequency was attempted for energy optimization. The effect of lower energy consumption does indeed show up but the figures promise no practical benefits.

  • GPU-accelerated classical MD with Gromacs runs faster and is more cost effective on supercomputers similar to Desmos than on wide-spread supercomputers based on expensive Intel Xeon multi-core CPUs.

  • GPU-accelerated quantum MD can be effectively computed on Desmos nodes using single precision. A comparison with the GAMESS-US package shows that TeraChem is able to efficiently substitute double-precision CPU performance with single-precision GPU performance for solving ab initio problems.

  • It is shown that BeeGFS effectively combines the distributed storage units located on the Desmos nodes into a single drive. MPI-IO shows a very good speed in storing data from the LAMMPS MD calculation on the Desmos parallel file system. However, LAMMPS MPI-IO shows no benefits in the case of a conventional storage benchmarked on the Angara-K1 supercomputer.