Keywords

1 Introduction

Rapid development of parallel computational methods and supercomputer hardware provide great benefits for atomistic simulation methods. At the moment, these mathematical models and computational codes are not only the tools of fundamental research but the more and more intensively used instruments for diverse applied problems [1]. For classical molecular dynamics (MD) the limits of the system size and the simulated time are trillions of atoms [2] and milliseconds [3] (i.e. \(10^9\) steps with a typical MD step of 1 fs).

There are two mainstream ways of MD acceleration. The first one is the use of distributed memory massively-parallel programming (MPP) systems. For MD calculations, domain decomposition is a natural technique to distribute both the computational load and the data across nodes of MPP systems (e.g. [4]).

The second possibility consists in the increase of the computing capabilities of individual nodes of MPP systems. Multi-CPU and multi-core shared-memory node architectures provide essential acceleration. However, the scalability of shared memory systems is limited by their cost and speed limitations of DRAM access for multi-socket and/or multi-core nodes. It is the development of GPGPU that boosts the performance of shared-memory systems.

This year is the 10th anniversary of Nvidia CUDA technology that was introduced in 2007 and provided a convenient technique for GPU programming. Many algorithms have been rewritten and thoroughly optimized to use the GPU capabilities. However, the majority of them deploy only a fraction of the GPU theoretical performance even after careful tuning, e.g. see [5,6,7]. The sustained performance is usually limited by the memory-bound nature of the algorithms.

Among GPU-aware MD software one can point out GROMACS [8] as, perhaps, the most computationally efficient MD tool and LAMMPS [9] as one of the most versatile and flexible for MD models creation. Different GPU off-loading schemes were implemented in LAMMPS [10,11,12,13]. GROMACS provides a highly optimized GPU-scheme as well [14].

There are other ways to increase performance of individual nodes: using GPU accelerators with OpenCL, using Intel Xeon Phi accelerators or even using custom built chips like MDGRAPE [15] or ANTON [3]. Currently, general purpose Nvidia GPUs provide the most cost-effective way for MD calculations [16].

Modern MPP systems can unite up to \(10^5\) nodes for solving one computational problem. For this purpose, MPI is the most widely used programming model. The architecture of the individual nodes can differ significantly and is usually selected (co-designed) for the main type of MPP system deployment. The most important component of MPP systems is the interconnect that properties stand behind the scalability of any MPI-based parallel algorithm.

In this work, we describe the Desmos computing cluster that is based on cheap 1CPU + 1GPU nodes connected by an original Angara interconnect with torus topology. We describe this interconnect (for the first time in English) and the resulting performance of the cluster for MD models in GROMACS and LAMMPS and for the DFT calculations in VASP.

2 Related Work

Torus topologies of the interconnect has several attractive aspects in comparison with fat-tree topologies. In 1990s, the development of MPP systems has its peak during the remarkable success of Cray T3E systems based on the 3D torus interconnect topology [17] that was the first supercomputer that provided 1 TFlops of sustained performance. In June 1998 Cray T3E occupied 4 of top-5 records of the Top500 list. In 2004, after several years of the dominance of Beowulf clusters, a custom-built torus interconnect appeared in the IBM BlueGene/L supercomputer [18]. Subsequent supercomputers of Cray and IBM had torus interconnects as well (with the exception of the latest Cray XC series). Fujitsu designed K Computer based on the Tofu torus interconnect [19]. The AURORA Booster and GREEN ICE Booster supercomputers are based on the EXTOLL torus interconnect [20].

Among the references to the recent developments of original types of supercomputer interconnects in Russia, we can mention the MVS-Express interconnect based on the PCI-Express bus [21], the FPGA prototypes of the SKIF-Aurora [22] and Pautina [23] torus interconnects. Up to this moment the Angara interconnect has been evolving all the way from the FPGA prototype [24, 25] to the ASIC-based card [26].

Fig. 1.
figure 1

The Desmos cluster.

Torus topology is believed to be beneficial for strong scaling of many parallel algorithms. However, the accurate data that verify this assumption are quite rare. Probably, the most extensive work was done by Fabiano Corsetti who compared torus and fat tree topologies using the SIESTA electronic structure code and six large-scale supercomputers belonging to the PRACE Tier-0 network [27]. The author concluded that machines implementing torus topologies demonstrated a better scalability to large system sizes than those implementing fat tree topologies. The comparison of the benchmark data for CP2k showed a similar trend [6].

3 The Desmos Cluster

The hardware was selected in order to maximize the number of nodes and the efficiency of a one node for MD workloads. Each node consists of Supermicro SuperServer 1018GR-T, Intel Xeon E5-1650v3 (6 cores, 3.5 GHz) and Nvidia GeForce 1070 (8 GB GDDR5) and has DDR4-2133 (16 GB).

The Nvidia GeForce 1070 cards have no error-correcting code (ECC) memory in contrast to professional accelerators. For this reason, it was necessary to make sure that there is no hardware memory errors in each GPU. Testing of each GPU was performed using MemtestG80 [28] during more than 4 h for each card. No errors were detected for 32 cards considered.

The cooling of the GPU cards was a special question. We used ASUS GeForce GTX 1070 8 GB Turbo Edition. Each card was partially disassembled prior to installation into 1U chassis. The plastic cover and the dual-ball bearing fan were removed that made the card suitable for horizontal air flow cooling inside chassis.

The nodes are connected by Gigabit Ethernet and Angara interconnect (in the 4D-torus \(4\times 2\times 2\times 2\), copper Samtec cables). Due to budget limitations we did not use all possible ports for the full 4D-torus topology. The currently implemented topology is 4D-torus \(4\times 2\times 2\times 2\) \((X \times Y \times Z \times K)\) but each node along YZK dimensions is connected to another node by one link only.

There is a front-end node with the same configuration as all 32 computing nodes of the cluster (the front-end node is connected to GigE only). The cluster is running under SLES 11 SP4 with Angara MPI (based on MPICH 3.0.4).

The cluster energy consumption is 6.5 kW in the idle state and 14.4 kW under full load.

Table 1. Comparison of the Desmos cluster with the Polytekhnik cluster used as a reference for MPI benchmarks.

4 The Angara Interconnect

The Angara interconnect is a Russian-designed communication network with torus topology. The interconnect ASIC was developed by JSC NICEVT and manufactured by TSMC with the 65 nm process.

The Angara architecture uses some principles of IBM Blue Gene L/P and Cray Seastar2/Seastar2+ torus interconnects. The torus interconnect developed by EXTOLL is a similar project [20]. The Angara chip supports deadlock-free adaptive routing based on bubble flow control [29], direction ordered routing [17, 18] and initial and final hops for fault tolerance [17].

Each node has a dedicated memory region available for remote access from other nodes (read, write, atomic operations) to support OpenSHMEM and PGAS. Multiple programming models are supported, MPI and OpenMP including.

The network adapter is a PCI Express extension card that is connected to the adjacent nodes by up to 6 cables (or up to 8 with an extension card). The following topologies are supported: a ring, 2D, 3D and 4D tori.

To provide more insights into Angara communication behavior we present a performance evaluation comparison of the Desmos cluster and the Polytekhnik cluster with the Mellanox Infiniband 4x FDR 2:1 blocking interconnect. Table 1 compares two systems. Both systems are equipped with Haswell CPUs but the processors characteristics differ very much.

Figure 2a shows the latency results obtained by the OSU Micro-Benchmarks test. Angara has extremely low latency 0.85 \(\upmu \)s for the 16 bytes message size and exceeds 4x FDR interconnect for all small message sizes.

We use Intel MPI Benchmarks to evaluate MPI_Barrier and MPI_Alltoall operation times for different number of nodes of the Desmos and Polytekhnik clusters. The Angara superiority for small messages explains better results for MPI_Barrier (Fig. 2b) and MPI_Alltoall with 16 byte messages (Fig. 2c). For large messages (256 Kbytes) the Desmos results are worse than that of the Polytechnik cluster (Fig. 2d). It can be explained by loose connectivity of the Desmos torus topology (2.5 links per node) and the performance weakness of the current variant of the Angara MPI implementation.

We have considered a heuristic algorithm to optimize topology-aware mapping of MPI-processes on the physical topology of the Desmos cluster. The algorithm distributes processes on CPU cores optimally to minimize exchange times [30]. Preliminary experimental results for NAS Parallel Benchmarks show about 50% performance improvements.

Fig. 2.
figure 2

The OSU latency between two adjacent nodes (a) Times for MPI_Barrier with 4 processes per node (b) Times for MPI_Alltoall with 4 processes per node and message sizes 16 b (c) and 256 Kb (d).

Fig. 3.
figure 3

Left: time per atom per timestep for LJ liquid and C\(_{30}\)H\(_{62}\) oil LAMMPS MD models of different size and different hardware combinations (the numbers near symbols show the number of MPI threads used). Right: the hardware cost vs achieved performance for MEM and RIB benchmarks. The dashed lines show ideal scaling. The Desmos results are compared with the published data [16]. (Color figure online)

5 Classical MD Benchmarks

Nowadays, there is no novelty in the partial use of single precision in MD calculations with consumer-grade GPUs. The results of such projects as Folding@Home confirmed the broad applicability of this approach. Recent developments of optimized MD algorithms include the validation of the single precision solver (e.g. [31]). In this study we do not consider the questions of accuracy and limit ourselves to the benchmarks of computational efficiency.

The first set of benchmark data is shown on Fig. 3. It illustrates the efficiency of LAMMPS, LAMMPS with the GPU package (with mixed precision) or LAMMPS with USER-INTEL package running on one computational node for different numbers of atoms in the MD model. The USER-INTEL package provides SIMD-optimized versions of certain interatomic potentials in LAMMPS. As expected, the CPU + GPU pair shows maximum performance for sufficiently large system sizes. The results for the Desmos node are compared with the results for a two-socket node with 14-core Haswell CPUs (the MVS1P5 cluster).

We see that for the simple Lennard-Jones (LJ) liquid benchmark the GPU-version of LAMMPS on one Desmos node provides two times higher times-to-solution than the SIMD-optimized LAMMPS on 2 x 14-core Haswell CPUs. However for the liquid C\(_{30}\)H\(_{62}\) oil benchmark the GPU-version of LAMMPS is faster starting from 100 thousand atoms per node. For comparison, we present the results for the same benchmark on the professional grade Nvidia Tesla K80.

The work [16] gives very instructive guidelines for achieving the best performance for the minimal price in 2015. Authors compared different configurations of clusters using two biological benchmarks: the membrane channel protein embedded in a lipid bilayer surrounded by water (MEM, \(\sim \)100 k atoms) and the bacterial ribosome in water with ions (RIB, \(\sim \)2 M atoms). The GROMACS package was used for all tests.

We compare the results obtained on Desmos cluster with the best choice of [16]: the nodes that consist of 2 socket Xeon E5-2670 v2 with 2 780Ti and connected via IB QDR. The costs of hardware in Euros are displayed on Y axis (excluding the cost of the interconnect as in  [16]) and the performance in ns/day is shown on X axis in Fig. 3. The numbers show the number of nodes used.

Desmos (red color) shows better strong scaling for the MEM benchmark than the best system configuration provided by Kutzner et al. (blue color) [16]. The major reason is, of course, the new Pascal GPU architecture. The saturation is achieved after 20 nodes which corresponds to the small amount of atoms per node, below the GPU efficiency threshold. In the case of the RIB benchmark, Desmos demonstrates ideal scaling after 16 nodes which shows that the productivity can be increased with larger number of nodes.

We should mention that the scaling could be further improved after implementation of the topology-aware cartesian MPI-communicators in Angara MPI.

Fig. 4.
figure 4

The scaling of the GaAs crystal model in VASP.

6 Benchmarks with the Electronic Structure Code VASP

The main computing power of Desmos cluster consists in GPU and is aimed at classical MD calculations. Each node has only one 6 core processor. However, it is interesting methodically to produce scaling tests for DFT calculations of electronic structure. DFT calculations are highly dependent on the speed of collective all-to-all exchanges in contrast to classical MD models. We use one of the most used packages VASP. According to the current estimates [32, 33], the calculations carried out in VASP package consume up to 20% of the whole computing time in the world.

The GaAs crystal model consisted of 80 atoms is used for the test in VASP [34]. The results of tests are shown on Fig. 4. The time for 1 iteration of self-consistent electronic density calculation is shown depending on the peak performance. The data for different clusters are presented: MVS10P and MVS1P5 supercomputers of Joint Supercomputer Centre of Russian Academy of Sciences and Boreasz IBM 775 of Warsaw University. The numbers near the symbols are the numbers of nodes for each test. The lower of two points showing the computing time on 32 Desmos nodes corresponds to additional parallelization over k-points.

The obtained results show that the Angara network very effectively unites the cluster nodes together. The supercomputers that are used for comparison have two-socket nodes (IBM 775 has even four-socket nodes). Nevertheless, MPI-exchanges over the Angara network in terms of resulting performance give the same result as MPI-exchanges in shared memory.

7 Conclusions

In the paper we described the cost-effective Desmos cluster targeted to MD calculations. The results of this work confirmed the high efficiency of commodity GPU hardware for MD simulations. The scaling tests for the electronic structure calculations also show the high efficiency of the MPI-exchanges over the Angara network. The Desmos cluster is the first application of the Angara interconnect for a GPU-based MPP system. The features of the Angara interconnect provided the high level of efficiency for the MPP system considered. The MPI benchmarks presented supported the competitive level of this network for HPC applications.

The results of the work were obtained using computational resources of Peter the Great Saint-Petersburg Polytechnic University Supercomputing Center. The authors acknowledge Joint Supercomputer Centre of Russian Academy of Sciences for the access to the MVS-10P supercomputer.