Keywords

1 Introduction

In 2012, Intel Many Integrated Cores (MIC) architecture [1] has been introduced as an answer to mounting challenges in building scalable and efficient high-performance computing systems. To achieve energy efficiency of computation, Intel MIC has more than 60 computational cores, each capable to execute AVX instructions. This new hardware requires new level of parallelization and vectorization from the application software for efficient performance.

Quantum chemistry algorithms were being adapted for parallel hardware for many decades. However, most popular codes “as is” don’t demonstrate good performance efficiency on the Intel MIC hardware platform. In most cases, code is not vectorized, while required thread parallelism level is not achieved. For example, GAMESS(US) [2, 3] package has been parallelized for decades by now, but its code lacks vectorization and enough thread-level parallelism of important pieces of algorithm even for widely used Hartree-Fock and Density Functional Theory calculations. Intel Xeon Phi 5120D requires as many as 240 threads to be run to achieve best performance in many algorithms [4]. Attempts to run few hundred processes of GAMESS application instead of more lightweight threads overwhelm memory subsystem with dramatic performance decrease.

The Intel MIC set new performance per watt level for x86-compatible systems. While MIC is available as Intel Xeon Phi PCI express co-processor cards, it supports “Native” mode of application execution, where each Xeon Phi is visible to application as an independent manycore machine. The next generation of Intel MIC technology – Intel Knights Landing [5] – will be self-sufficient manycore bootable systems. Already existing RSC PetaStream architecture [6] leverages efficient co-processor-to-co-processor communication, providing realistic model of future Intel KNL supercomputers, where “native” mode of parallelization is the most natural and effective. Each node runs its own Linux-based OS image of operating system, and Linux OS is run on host. Majority of PetaStream computation power comes from Xeon Phi chips, therefore it make sense to run application on Intel Xeon Phi cards, and use the host’s CPU for support and service functions; application is run on uniform field of Xeon Phi nodes – at least one MPI rank per node – compatible with “native” mode for Xeon Phi. In case of offload-like work sharing is efficient for an application and it is possible to harness both CPU and MIC nodes. RSC PetaStream system uses Intel Node Manager Technology to control and monitor node power consumption of every node, that mechanism can be used to implement flexible power energy and optimization strategies to help HPC sites save power and reduce operational costs. An example of the supercomputing system where both types of nodes co-exist in the same fabric is St. Petersburg Polytechnic University supercomputing center, with over 800 nodes on Intel Xeon E5v3 -2697 (Haswell) share Infiniband FDR fabric with 256 nodes on Intel Xeon Phi 5120D.

The most common approaches in quantum chemistry are Hartree-Fock method (HF) and density functional theory (DFT). The major steps of these methods are construction and diagonalization of the Fock (Kohn-Sham) matrix  [7]. For practically interesting systems computational power required is usually of supercomputer scale. The computational effort for the first step is dominated by the calculation of two-electron integrals corresponding to the Coulomb repulsion of electron pairs (and therefore frequently called electron repulsion integrals, ERI) and, in case of DFT, also by calculation of numerical quadrature of the exchange-correlation contribution to energy. The two-electron integral calculation has theoretical O(N4) computational complexity, where N is a number of basis functions used to characterize the system. However, many of these integrals are small enough and may be neglected. It is possible to reduce number of operations down to O(N2÷3) using cutoffs and also some approximations, especially for very large systems. In that case a speed of Fock (or Kohn-Sham) matrix diagonalization (O(N3)) significantly affects the performance of HF (DFT) method. However, for the majority of practically important molecular systems a construction of Fock (Kohn-Sham) matrix dominates overall computational cost. Also, matrix diagonalization is a pure linear algebra calculation with a great scalability, so the efficient two-electron integral code is crucial to achieve the performance in HF and DFT methods. We therefore targeted the Fock matrix two-electron contribution code to demonstrate the applicability of the Intel MIC platform to classical quantum chemistry problems.

The goal of the presented work is to enable migration of GAMESS(US) quantum chemistry code [2, 3] to novel Intel MIC hardware technology. GAMESS is widely used by the scientific community, with thousands of references in the papers each year. We intend to minimize code modification and optimize for future-proof “native” mode of Intel Xeon Phi.

2 Basics of Hartree-Fock Method

2.1 Electron Repulsion Integrals (ERIs)

ERIs are the integrals of type:

$$ I_{ijkl} = \left( {{\text{i}},{\text{j|k}},{\text{l}}} \right) = \mathop {\iint }\nolimits \frac{{\chi_{\text{i}} \left( {{\text{r}}_{1} } \right)\chi_{\text{j}} \left( {{\text{r}}_{1} } \right)\chi_{\text{k}} \left( {{\text{r}}_{2} } \right)\chi_{\text{l}} \left( {{\text{r}}_{2} } \right)}}{{{\text{r}}_{2} - {\text{r}}_{1} }}{\text{dr}}_{1} {\text{dr}}_{2} $$
(1)

where χ denotes basis functions, i, j, k, l – their indices, r 1, r 2 – coordinates of first and second electrons. An important property of ERIs is their eightfold permutation symmetry with respect to i, j, k, l indices. Commonly Cartesian Gaussians are used as basis functions:

$$ \chi \left( {\text{r}} \right) = \left( {{\text{x}} - {\text{A}}_{\text{x}} } \right)^{{{\text{a}}_{\text{x}} }} \left( {{\text{y}} - {\text{A}}_{\text{y}} } \right)^{{{\text{a}}_{\text{y}} }} \left( {{\text{z}} - {\text{A}}_{\text{z}} } \right)^{{{\text{a}}_{\text{z}} }} {\text{e}}^{{ - \alpha \left( {{\text{r}} - {\text{A}}} \right)^{2} }} $$
(2)

where A and α are center and exponent of basis function respectively, a(a x , a y , a z ) – its angular momentum. They have practically important property that a product of two Gaussians is another Gaussian (see [8] for eq.). The Gaussian in form (2) is also called “primitive”. Typically, linear combinations of Gaussian primitives which share the same center and angular momentum (“contracted” functions) are actually used as a basis functions. Contracted ERI are sum of integrals over their primitives:

$$ \left( {{\text{i}},{\text{j|k}},{\text{l}}} \right) = \sum\nolimits_{\text{a}}^{\text{M}} {\sum\nolimits_{\text{b}}^{\text{N}} {\sum\nolimits_{\text{c}}^{\text{O}} {\sum\nolimits_{\text{d}}^{\text{P}} {{\text{C}}_{\text{ai}} {\text{C}}_{\text{bj}} {\text{C}}_{\text{ck}} {\text{C}}_{\text{dl}} ({\text{ab}}|{\text{cd}})} } } } $$
(3)

where C is a matrix of contraction coefficients, M, N, O, P – degree of contraction. A set of (possibly contracted) basis functions that share the same center and same set of exponents is termed “shell”. Grouping basis functions into shells reduces to some extent the number of expensive floating point operations and improves efficiency of integral screening. Primitive integrals are calculated numerically. Among the most popular approaches are McMurchie-Davidson [9], Obara-Saika [10] and Dupuis-Rys-King (DRK) [11] schemes. The effectiveness of the different schemes varies greatly for the different integral types. Quantum chemical codes often have several algorithms implemented and switch them wisely to improve performance. In this study we used only DRK integral scheme for testing purposes due to its numerical stability, relative simplicity, and uniformness for different kinds of integrals.

2.2 The Hartree-Fock Algorithm

The Hartree-Fock method is a method of finding an approximate wavefunction and energy of the model system. It is based on eigenvalue problem:

$$ {\mathbf{FC}} = \boldsymbol{\in}{\bf SC}, $$
(4)

where F – Fock matrix, S – overlap matrix, C – matrix of molecular orbital coefficients, ϵ - diagonal matrix of orbital energies. Since F depends on C, the Eq. 4 has to be solved self-consistently. Matrix F incorporates contribution from electron-electron (Vee) and electron-nuclei electrostatic interaction (Ven) as well as kinetic energy of electrons (Te). It is usually represented as a sum of one-electron Hamiltonian (h), Coulomb (J) and exchange (K) matrices:

$$ {\mathbf{F}} = {\mathbf{h}} + {\mathbf{J}} - \frac{1}{2}{\mathbf{K}} $$
(5)
$$ {\text{h}} = {\text{V}}_{\text{en}} + {\text{T}}_{\text{e}} ; {\text{J}}_{\text{ij}} = \sum\nolimits_{\text{kl}} {{\text{D}}_{\text{kl}} \cdot {\text{I}}_{\text{ijkl}} } ; {\text{K}}_{\text{ij}} = \sum\nolimits_{\text{kl}} {{\text{D}}_{\text{kl}} \cdot {\text{I}}_{\text{ikjl}} } $$
(6)

where \( {\mathbf{D}} \) – density matrix which is calculated from molecular orbital coefficients. Matrix h depends on the one-electron integrals and its computation scales quadratically depending of the system size. The Fock matrix construction requires calculation of all symmetry unique ERIs and has theoretical O(N4) complexity. It is worth noting that numerous ERIs are very small and their contribution to the Fock matrix is negligible. They could be avoided by applying screening techniques. It vastly reduces the number of ERIs required for calculation down to O(N2÷3) depending on the geometrical size of molecular system and the nature of atomic basis set used.

Different schemes have been proposed to calculate Fock matrix. Conventional algorithm requires all ERIs to be calculated once and stored on a disk. However, it is not very efficient for the large systems due to high requirements on the amount of available disk space for the integral storage and relatively slow disk operation speed. The advantage of this method is that each ERI is calculated only once. In the alternative approach (“direct” HF) integrals are recalculated every time as needed.

3 Implementation of the Hartree-Fock Method in GAMESS

The algorithm of direct HF method implemented in GAMESS is presented on Fig. 1. The implementation of main loop over shell coefficients corresponds to the so-called “triple-sort” order [12] when up to three symmetrically unique integrals are calculated at each cycle step. The alternative is a canonical way with slightly different index order, when only one integral is calculated at each cycle step. The disadvantage of triple-sort order is decreased granularity, which may be important on highly parallel systems.

Fig. 1.
figure 1

Simplified algorithm of Hatree Fock implementation in GAMESS. NSH – number of shells. NSH ≤ 1000 for typical workloads.

GAMESS uses MPI parallelization to split workload during ERI calculation. It is done on the ish and jsh loops implementing static and dynamic load balance. The main drawback of this implementation is a huge memory footprint on multicore architectures, because each MPI rank has its own copy of density matrix and a partial contribution to Fock matrix that scales quadratically with job size. Straightforward OpenMP implementation also inherits this drawback; however the density matrix is read-only during ERI computation cycle and could be shared between threads. The Fock matrix is constantly updated in this cycle and in simplest case it is replicated. It is not a big problem when a large amount of memory is available. Replicated-memory MPI/OpenMP version of GAMESS was previously reported to work on Cray XT5 and further on K-computer [13]. In this algorithm each thread has its own copy of Fock matrix. Even in this case the amount of required memory reduces up to two times in comparison to original GAMESS implementation. Co-processors like MICs have large number of cores and a limited amount of on-chip memory. In this case a maximum job size is limited by the amount of available memory. A possible solution to this problem is to use distributed memory libraries like Global Arrays [14] or DDI [15]. This approach makes calculation possible even for extremely large jobs when none of these matrices could fit in a single-node memory in expense for some internode communication overhead. The distributed memory algorithms are based on the fact, that at every moment only a small amount of data from density and Fock matrix is required for the computation. Actually, only three rows of Fock matrix are updated in the innermost loop of the ERI calculation cycle. The drawback of this implementation is that interprocess communication grows, which may be quite expensive in runtime. In this study we focus on the straightforward variant of the memory problem solution.

First we tried both triple-sort and canonical way of integral ordering. They show nearly identical performance, however canonical order is slightly faster on medium-size problems due to smaller granularity. Further we always used canonical order of shells in the two-electron integral computational loop. It also has an advantage of the rectangular structure of second and third loop in nest, that could be used to improve load balance between threads.

The straightforward OpenMP version of GAMESS Fock matrix two-electron contribution shows quite a good performance on Xeon Phi (see Tables 1, 2, and 3). This implementation still has considerable memory footprint (Fock matrix is local to each thread) but it is two times lower than for pure MPI implementation because density matrix is now shared. We observe nearly perfect parallelization when up to 60 threads per MIC are used. Further increase of number of threads per MIC improves performance only slightly. The same effect is observed on Xeon E5 CPU when more than 8 cores per socket are used.

Table 1. Performance of the OpenMP parallelized Fock matrix two-electron contribution code for the C60 (6-31G) benchmark (KMP_AFFINITY = balanced).
Table 2. Thread affinity impact on the performance of the OpenMP parallelized Fock matrix two-electron contribution code for the C60 (6-31G) benchmark.
Table 3. Performance of hybrid MPI/OpenMP parallelized Fock matrix two-electron contribution code on multiple Xeon Phi modules

One of the reasons of this effect is poor cache utilization when multiple threads are tied to one physical core. Indeed, the implementation of DRK algorithm of ERI calculation in GAMESS operating with large arrays of data (about L2 cache size) with nontrivial access pattern. The sizes of these arrays are set up at the compile time and depend on the maximum possible angular momentum for the basis functions. The scalability of code notably improves if we manually decrease the maximum angular momentum that code could manage from L = 7 (default in GAMESS) to L = 4. At the same time, the performance per core changes only slightly. Another reason for the scalability degradation is a poor vectorization of the ERI code in GAMESS.

It is worth noting that the scalability of code is unaltered if we consider benchmarks with similar thread/core affinity (Table 2). Therefore further improvement of single-core performance would increase overall performance as well.

The code on Fig. 2 could be also straightforwardly parallelized over the top loop in nest across MPI processes. The performance of the hybrid MPI/OpenMP version is presented in Table 3. The heaviest MPI communication task is a Fock matrix reduction that is performed only one time per HF iteration. We observe quite small (~ 1 % of execution time) synchronization and communication overhead in the case of multi-MIC run.

Fig. 2.
figure 2

Algorithm of OpenMP parallelization of the calculation of two-electron contribution to the Fock matrix in GAMESS.

3.1 Details of Benchmarks

As a benchmark systems we used fullerene molecule with two basis sets (6-31G and 6-31G). The sizes of basis for these systems are 540 and 900 functions respectively. Xeon Phi benchmarks were conducted on RSC PetaStream platform. MIC results were compared to those of the RSC Tornado platform based on dual-socket Xeon E5-2690 server. Configurations of the test systems are summarized in Table 4.

Table 4. Configurations of the test systems

4 Related Work

GAMESS [2, 3] is one of the most widely used software packages for quantum chemistry calculations. Existing parallelization in GAMESS is sophisticated [16], it has dynamic load balancing and distributed shared memory features.

GPU technology advances [1719] created opportunity to take advantage of this new technology. NWChem code has been re-written initially for GPU [17] with CUDA technology, at its implementation on Xeon Phi [20] uses offload mode for harnessing Xeon Phi computational power. In this paper, implementation uses Xeon Phi native mode for better fitness to next generation architectures, with performance demonstration of multi-Phi. Existing GAMESS adaptation to GPU doesn’t affect most widely used algorithms by computational chemists, and limited to some PCM model implementation. In more general contexts, only profiling work is reported [20]. In this respect, this paper constitutes an important contribution to the development of important software tools used by practicing researchers.

5 Conclusions

In this paper we present the design of parallelization scheme of GAMESS(US) code for quantum chemistry calculations, namely, Hartree-Fock and Density Function Theory (DFT) algorithms. Current work demonstrates the applicability of Xeon Phi coprocessors for the quantum chemistry problems. In this paper, we demonstrate scalability of the current implementation on Xeon Phi cores, as well as with multiple Xeon Phi chips running in native mode (OpenMP+MPI parallelization). Future work include more thorough performance characterization and additional vectorization of ERI calculation.