Structure-Aware Calculation of Many-Electron Wave Function Overlaps on Multicore Processors

Davidović, Davor; Quintana-Ortí, Enrique S.

doi:10.1007/978-3-030-43229-4_2

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12043))

Included in the following conference series:

International Conference on Parallel Processing and Applied Mathematics

930 Accesses

Abstract

We introduce a new algorithm that exploits the relationship between the determinants of a sequence of matrices that appear in the calculation of many-electron wave function overlaps, yielding a considerable reduction of the theoretical cost. The resulting enhanced algorithm is embarrassingly parallel and our comparison against the (embarrassingly parallel version of) original algorithm, on a computer node with 40 physical cores, shows acceleration factors which are close to 7 for the largest problems, consistent with the theoretical difference.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Hybrid Parallelization and Performance Optimization of the FLEUR Code: New Possibilities for All-Electron Density Functional Theory

Advances in the Parallelisation of Software for Quantum Chemistry Applications

High-Performance Computing in Solving the Electron Correlation Problem

Keywords

1 Introduction

Many-electron wave function (MEWF) overlaps are extensively used in the nonadiabatic dynamics and have significant importance in photochemical studies. Concretely, the overlap functions provide a straightforward mechanism to record the electronic states along different nuclear geometries. Therefore, they can be leveraged for constructing multi-state, multi-dimensional potential energy surfaces for quantum dynamics [6]. In the context of nonadiabatic dynamics simulations, for example, MEWF overlaps yield an approximation of time-derivative couplings (TDCs) in fewest-switch surface hopping (FSSH) calculations [5, 11]. The main drawback of the approaches based on MEWF overlaps lies in their high computational complexity and poor scaling with the system size. Thus, accelerating the calculation of the overlap functions can significantly augment the dimension of the systems to which the FSSH method can be applied.

In this work, we improve the algorithm presented in [10] for computing the overlaps between excited states using CIS-type wave functions. In particular, we optimize the algorithm denoted there as OL2M –an approach based on the level-2 minors obtained from Laplace’s recursive formula, or in other words, the minors obtained by removing two rows and two columns from the input referent matrix. In rough detail, our optimization targets the part of the OL2M algorithm in which the determinants of all the level-2 minors are computed and stored, introducing a structure-aware variant of the method that reduces the theoretical cost of that part of the algorithm by an order of magnitude via an update of the LU factorization; see, e.g., [4, 8]. This enhancement results in significantly shorter execution times for large problems solved in parallel on a multicore processor.

The rest of the paper is structured as follows. In Sect. 2 we offer a brief introduction to the computations of MEWF overlaps. The columnwise structure-aware algorithm and its parallelization are described in Sect. 3; and the parallel experimental results are presented in Sect. 4. Finally, a few concluding remarks and future research directions close the paper in Sect. 5.

2 Problem Definition

Given two many-electron wave-functions, denoted by $\varPsi _{I}$ and $\varPsi _{J}$, the overlap between these functions is expressed as:

$$\begin{aligned} S_{IJ} = \left\langle \varPsi _{I} | \varPsi _{J} \right\rangle , \end{aligned}$$

(1)

using bra-ket $\left\langle * | * \right\rangle $ notation [2], a common notation used in quantum mechanics to describe quantum spaces. The $S_{IJ}$ is the (I, J) element of the overlap matrix S, $N_{A}$ is the number of states, and the indices $I,J \in \{1,2,\ldots , N_{A}\}$. The $N_{A}$ states are described by CIS-type wave functions and, therefore, they can be expanded using Slater determinants. In the Slater determinants expansion, the electrons are divided into $n_\sigma $ and $m_\sigma $ respectively, representing the number of occupied and virtual orbitals for each spin $\sigma $.

The mathematical-physics problem can be reformulated into a matrix representation as that in Fig. 1. The main computational problem consists in obtaining the determinants of all the matrices constructed such that a row i and column j from the referent matrix $A_{ref}$ are replaced with the contents of row $i_\beta $ and column $j_\alpha $, respectively, from matrices $WF_\beta $ and $WF_\alpha $, for all possible combinations of $i, j, i_{\beta }, j_{\alpha }$.

The total number of possible matrices is $n_\sigma ^2 m_\sigma ^2$ for each spin $\sigma \in \left\{ \alpha , \beta \right\} $. Furthermore, if an LU factorization [4] is applied to compute these determinants, which requires $2/3 n_\sigma ^3$ flops (floating-point operations), this operation becomes the main bottleneck of the entire MEWF, yielding a total cost of $O(n_\sigma ^5 m_\sigma ^2)$ flops.

It is worth to note that, for a fixed row and column (i, j) of $A_{ref}$, numerous rows and columns from $WF_\beta $ and $WF_\alpha $ are to be tested, while all other rows and columns in the referent matrix remain unchanged. Thus, the matrices, for which we have to compute the determinants differ only in one row and one column. In order to decrease the computational cost, the challenge is to reuse the unchanged parts of referent matrix for computing other determinants.

The first approach to exploit this property was presented in [7]. There, the authors reported a significant speedup by exploiting similarities between the consecutive matrices differ in only one row/column. Concretely, the referent matrix $A_{ref}$ was expanded by columns using the Laplace transformation. The determinants of the minors were then stored and reused to compute determinants of matrices which differ only in one column from the baseline determinant. With this strategy, replacing any column for $WF_\alpha $ is simply a linear combination of factors from $j_\alpha $ with pre-computed determinants of the minors.

The work presented in [10] improved the idea in [7] by computing the determinants of the minors obtained from the second-level recursive Laplace expansion. There, the minors were constructed by expanding the referent matrix along both rows and columns; then the LU factorization was repeatedly applied to obtain the determinants; and finally these results were reused to compute the overlap of the corresponding Slater determinants. Hereafter, we will refer to this algorithmic approach as DL2M. Although this strategy yields a considerable reduction in the computational complexity compared with the method in [7], obtaining the determinants of the minors remains the major computational bottleneck.

A related topic is that of computing the LU factorization of a large collection of small matrices. This problem has been tackled, on graphics processors, as part of an effort to develop a batched version of the Basic Linear Algebra Subprograms as well as in the framework of computing a block-Jacobi preconditioner for the iterative solution of sparse linear systems [1, 3]. However, in those applications the matrices are independent and no structure-aware exploitation of the problem is possible.

3 Algorithms for DL2M Calculations

3.1 Original

Consider the matrix $A \in \mathbb {R}^{n \times n}$ representing the referent matrix $A_{ref}$ and assume, for simplicity, that $n=n_{\alpha }=n_{\beta }=m_{\alpha }=m_{\beta }$. The calculation of the overlap between any two MEWFs, described in Sect. 2, requires the computation of the determinants for all possible submatrices of A where any two rows/columns have been eliminated from the matrix. Let us denote by $A_{{r_1,r_2\vert |c_1,c_2}}$ the submatrix that results from eliminating rows $r_1,r_2$ and columns $c_1,c_2$ from A. The straightforward solution to obtain the determinants is to explicitly construct all possible submatrices with $m=n-2$ rows/columns of A, and then compute the LU factorization (with partial pivoting) of each submatrix, as shown in the naive algorithm (NA) in Listing 1.1. For brevity, hereafter we employ pseudo-Matlab notation in the presentation of the algorithms, and we do not consider the effect of the row permutations obtained from the LU factorization on the determinant sign. All our actual realizations of the algorithms include partial pivoting to ensure numerical stability in practice.

The computational cost of the NA realization is, approximately,

$$ \sum _{r_1=1}^n \sum _{r_2=1}^{r_1-1} \sum _{c_1=1}^n \sum _{c_2=1}^{c_1-1} 2n^3/3 \approx n^7/6~\text {flops}. $$

3.2 Columwise Re-utilization

The naive algorithm in Listing 1.1 exposes that, between any two iterations of the two inner loops (that is, those indexed by c1, c2), the matrix that needs to be factorized only differs in two columns. A natural question is thus how to exploit the fact that all other matrix columns remain the same between these two iterations. To illustrate the response, let us define $A_r = A_{{r_1,r_2\vert |-}} \in \mathbb {R}^{m \times n}$ as the submatrix with $m=n-2$ rows and n columns that results from eliminating only rows $r_1,r_2$ from A. Next, consider the LU factorization of this submatrix:

$$\begin{aligned} L^{-1} P A_r ~=~ L^{-1} P \left[ a_1,a_2,\ldots , a_n \right] ~=~ \left[ u_1,u_2,\ldots , u_n \right] ~=~ U, \end{aligned}$$

(2)

where $L \in \mathbb {R}^{m \times m}$ is a unit lower triangular comprising the Gauss transforms that are applied to annihilate the subdiagonal entries of the matrix, P is the $m \times m$ permutation matrix due to the application of partial pivoting during the factorization, and $U \in \mathbb {R}^{m \times n}$ is the resulting upper triangular factor [4]. The answer we are searching for should state the relationship between the factorization of $A_r$ in (2) and that of the submatrix

$$\begin{aligned} A_{{r_1,r_2\vert |c_1,c_2}} = \left[ a_1,a_2,\ldots , a_{c_2-1},a_{c_2+1},\ldots , a_{c_1-1},a_{c_1+1},\ldots , a_n \right] , \end{aligned}$$

(3)

for any two column values $c_1,c_2$. Since a Gauss transform, applied to a matrix from the left, simply performs an independent linear combination of each matrix column, the application of the factors L and P from the LU factorization in (2) to $A_{{r_1,r_2\vert |c_1,c_2}}$ results in:

$$\begin{aligned} L^{-1}P A_{{r_1,r_2\vert |c_1,c_2}} = \left[ u_1,u_2,\ldots , u_{c_2-1},u_{c_2+1},\ldots , u_{c_1-1},u_{c_1+1},\ldots , u_n \right] ; \end{aligned}$$

(4)

which corresponds to the columns of U in (2), except for $u_{c_1}$ and $u_{c_2}$, which have disappeared. The result is thus already upper triangular in the leftmost $c_2-1$ columns, but it contains zeros only below the first and second subdiagonals in columns $\left[ u_{c_2+1},u_{c_2+2},\ldots ,u_{c_1-1}\right] $ and $\left[ u_{c_1+1},\ldots ,u_{n}\right] $, respectively; see the example in Fig. 2.

In consequence, in order to obtain the desired upper triangular matrix, which yields the determinant for $A_{{r_1,r_2\vert |c_1,c_2}}$ we need to apply Gauss transforms (or, alternatively, any type of orthogonal transform [4]) to eliminate the nonzero entries below the main diagonal of this matrix. Let us consider the partitioning of the quasi-upper triangular factor in (4) as follows:

(5)

where $U_{11}$ is upper triangular; and $U_{22}$/$U_{33}$ contain zeros below the first/second subdiagonal. Furthermore, consider $ U_{23} = \left[ \begin{array}{c} U_{T} \\ \hline u_B \end{array} \right] , $ where $u_B$ corresponds to the bottom row of $U_{23}$. The algorithm that exploits the relationship between the matrices factorized in the inner two loops is shown in Listing 1.2. Naturally, the LU factorizations involving the blocks $\left[ U_{22},U_{23}\right] $ and $U_{33}$ leverage the special quasi-upper triangular structure of these submatrices to reduce the cost of this alternative method.

The cost of the columnwise “structure-aware” realization is approximately given by

$$ \sum _{r_1=1}^n \sum _{r_2=1}^{r_1-1} \left( \underbrace{2m^3/3}_{\text {LU of } A_{r}} + \sum _{c_1=1}^n \sum _{c_2=1}^{c_1-1} \left( \underbrace{\sum _{j_1=c_2+1}^{c_1-1} 6(n-j_1)}_{\text {LU of } \left[ U_{22},U_{23}\right] } + \underbrace{\sum _{j_2=c_1+1}^{n-1} 12(n-j_2) }_{\text {LU of } \left[ u_B; U_{33}\right] } \right) \right) \approx n^6 / 2 ~\text {flops}, $$

where the sum for $j_1$ corresponds to the cost of the LU factorization for $\left[ U_{22},~U_{23}\right] $ and the sum for $j_2$ to that for $U_{33}$. Taking into account that $m\approx n$, and neglecting the lower order terms, the cost for this approach is $n^6/2$ flops. Compared with NA, this represents a reduction in the cost of one order of magnitude.

3.3 Parallel Implementation

Both the original NA and the columnwise structure-aware algorithm (CSA), described in the previous subsections, are embarrassingly parallel. An analysis of dependencies and concurrency is, therefore, trivial. From Listing 1.1, we observe that the LU factorizations of the minors (and the accumulation of the diagonal elements of the resulting triangular factors), in the inner-most loop, are completely independent. In other words, each minor can be constructed independently of other minors, and the corresponding calculations can proceed in parallel. Although this approach exhibits a much larger memory footprint, because the minors are explicitly constructed, it accommodates a straight-forward parallelization. In contrast, in real-world use cases, the dimension of the initial matrix is up to $n=200$, and exploiting only the parallelism intrinsic to the operations involved in a single LU factorization will surely exhibit very low performance. Therefore, our approach computes single-threaded (i.e. sequential) LU factorizations, but combines this with a parallelization of the outer loops around these calculations to compute several decompositions concurrently.

For the parallelization on multicore processors, in this work we leverage OpenMP. Concretely, in the DL2M NA case, in Listing 1.1, an OpenMP parallel for pragma is applied to the outer-most loop of the algorithm – that is, the loop over index $r_1$ – which corresponds to the first row to be removed from the initial matrix. The work corresponding to the three inner loops (over $r_2$, $c_1$, and $c_2$) is then executed in parallel for each iteration of the row index $r_1$.

For the CSA variant, in Listing 1.2, the inner-most LU factorizations and the products of diagonal elements of the factors $U_{\{11,2,3\}}$ are independent for each combination of the columns $c_1$ and $c_2$. They require only the U factor obtained from the two outer-most loops (a unique combination of $r_1$ and $r_2$), as in Line 3 of Listing 1.2. In consequence, the parallelization of DL2M with columnwise reuse is done by distributing the iteration space across loop $c_1$; that is over the first column to be removed from the minor (without rows $r_1$ and $r_2$). For this variant, it is also possible to parallelize across the outer-most loop $r_1$ like it was done for L2M NA. That approach can be followed, for example, to parallelize the outer-most loop across multiple nodes, while the parallelization over $c_1$ can be applied at the node level. This multi-level parallel alternative for clusters is part of ongoing work.

4 Experimental Results

In this section we illustrate the gains in performance attained by the DL2M structure-aware algorithm, with columnwise re-utilization, for computing the determinants of the level-2 minors of the referent matrix. For this purpose, the new algorithm, CSA, is compared against the original NA realization, described in Subsect. 3.1. In addition, in the final part of this section, we compare the overall performance of the MEWF algorithms for computing the overlap of the wave functions, using our DL2M algorithm with columnwise re-utilization, with the algorithm OL2M, described in [10].

Set Up. The experiments in this section were obtained on the Juwels cluster from Juelich Supercomputing Center. Each node consists of two Intel Xeon Platinum 8168 processors, running at 2.7 GHz, with 24 cores each and 96 GB of RAM. The code was written in Fortran and C programming languages, compiled with GCC 8.3.0 and linked against the Intel MKL 2019.3.199 (sequential) library. All experiments are run on a single node and employ double precision arithmetic.

Algorithmic Improvements. The first experiment assesses the speed up gains achieved only by introducing algorithmic changes, without any parallelisation strategies. The columnwise reutilization strategy yields a significant performance gain up to $2.5\times $ and $5\times $ compared with the OL2M and NA variants, respectively, even when executed sequentially, as illustrated in Fig. 3. Note that the difference between CSA and other variants increases with the problem size. That is because of much lower computational cost of CSA compared to NA and OL2M variants. As an example, for $n=100$ the flops count for NA is $10^14/6$ while CSA exhibits $10^12/2$ flops, that is approximately $33\times $ less flops for CSA variant. However, in NA and OL2M variants more flops are “fast” flops based on LU and BLAS-3 operations, while in CSA, a part of flops, due to columnwise reutilization strategy, are based on BLAS-1 operations (the update step).

NA vs CSA. This experiment compares the DL2M CSA variant (Listing 1.2) and the DL2M NA implementation (Listing 1.1). The input matrix for this test was generated with random entries and a number of columns/rows ranging from $n=10$ to 180. The lower theoretical cost of the new CSA approach (of an order of magnitude) becomes evident in Fig. 4(left), which shows a reduction of the execution time by approximately one order of magnitude for sufficiently large matrices. When increasing the matrix size though, the speedup compared to NA significantly increases, because of the much lower computational cost of the CSA variant. Figure 4(right) shows that the speed up of CSA (with respect to NA) is consistent, independently of the number of cores in the test, and roughly grows by a factor of up to 7 for the largest test matrices.

We can observe that the CSA algorithm is superior in performance to the naive version for all configurations and matrix sizes, as presented in Fig. 4(right). It can been seen that the speedup is similar, no matter how many cores we used in test configuration, and is increasing up to more than $7 \times $ for larger test matrix sizes.

OL2M vs CSA. In Fig. 5(left) we compare the new DL2M CSA with the OL2M algorithm [10] (considering only the part that computes the determinants of the minors). In OL2M, the complete LU is computed inside the $c_1$ loop, and the accumulated U factors are reused only inside the $c_2$ loop. By computing the LU before the start of the $c_1$ loop, the CSA variant offers a speed up of up to $3.3 \times $ for the largest problems, Fig. 5(right).

Note that the speedup of CSA w.r.t. OL2M is also independent on the number of cores (as in the case of NA) and that is almost the same for different test configurations and larger test matrix sizes.

Scalability. Although CSA re-uses the columns from the U factor for computing determinants of the subsequent minors, the algorithm achieves a good scalability with increasing number of cores, as presented in Fig. 6.

Table 1. Execution times (in seconds) of DL2M CSA and OL2M for Alanine-100 and Alanine-195 on varying #cores.

Full size table

Real Use-Case. The final experiment compares DL2M CSA and the OL2M algorithm on real test cases corresponding to the excited state wave functions of poly-Alanine systems (with 100 and 195 occupied orbitals, i.e. the sizes of the referent matrices) obtained using different basis sets. The results of this test in Table 1 show that DL2M CSA offers higher scalability and achieves a speed up factor above $5 \times $ over DL2M, on 48 cores, which is aligned with the results reported in Fig. 5(right). For the Alanine-100 use-case, by increasing the number of cores over 40 a drop in performance is occured. That is expected since the problem becomes to small for the given number of cores in which the communication overhead, due the increased parallelism, becomes more significant in the total execution time.

5 Conclusion and Future Work

In this work, we leverage the connection between the level-2 minors of the referent matrix appearing in MEWFs to save a considerable part of the computations required to obtain the corresponding determinants. For that purpose, we use an updating technique for the LU factorization, in order to reduce the cost by an order of magnitude. Our tests show that the new approach considerably accelerates the computation of the MEWF overlaps, by a factor of up to 7, in principle allowing the solution of larger problems.

As part of future work, we plan to explore the parallelization of this algorithm using different approaches and/or tools (e.g., to exploit task-level parallelism, or to combine a cluster-level parallelization with a finer grain concurrent execution). In addition, we will investigate how to exploit the connection between the level-2 minors across other dimensions to further reduce the theoretical cost of this type of computations.

The source code is available at [9] and is part of the cto-nto library for computing natural transition orbitals for CIS type wave functions.

References

Anzt, H., Dongarra, J., Flegar, G., Quintana-Ortí, E.S.: Variable-size batched LU for small matrices and its integration into block-Jacobi preconditioning. In: 2017 46th International Conference on Parallel Processing (ICPP), pp. 91–100 (2017). https://doi.org/10.1109/ICPP.2017.18
Dirac, P.A.M.: A new notation for quantum mechanics. Math. Proc. Cambridge Philos. Soc. 35(3), 416–418 (1939). https://doi.org/10.1017/S0305004100021162
Article MathSciNet MATH Google Scholar
Dong, T., Haidar, A., Luszczek, P., Harris, J.A., Tomov, S., Dongarra, J.: Lu factorization of small matrices: accelerating batched DGETRF on the GPU. In: 2014 IEEE International Conference on High Performance Computing and Communications, 2014 IEEE 6th Internatinal Symposium on Cyberspace Safety and Security, 2014 IEEE 11th International Conference on Embedded Software and Systems (HPCC, CSS, ICESS), pp. 157–160, August 2014. https://doi.org/10.1109/HPCC.2014.30
Golub, G.H., Loan, C.F.V.: Matrix Computations, 3rd edn. The Johns Hopkins University Press, Baltimore (1996)
MATH Google Scholar
Hammes-Schiffer, S., Tully, J.C.: Proton transfer in solution: molecular dynamics with quantum transitions. J. Chem. Phys. 101(6), 4657–4667 (1994). https://doi.org/10.1063/1.467455
Article Google Scholar
Li, S.L., Truhlar, D.G., Schmidt, M.W., Gordon, M.S.: Model space diabatization for quantum photochemistry. J. Chem. Phys. 142(6), 064106 (2015)
Article Google Scholar
Plasser, F., Ruckenbauer, M., Mai, S., Oppel, M., Marquetand, P., González, L.: Efficient and flexible computation of many-electron wave function overlaps. J. Chem. Theory Comput. 12(3), 1207–1219 (2016)
Article Google Scholar
Quintana-Ortí, E.S., Van De Geijn, R.A.: Updating an LU factorization with pivoting. ACM Trans. Math. Softw. 35(2), 11:1–11:16 (2008)
Article MathSciNet Google Scholar
Sapunar, M.: Natural transition orbitals for CIS type wave functions. https://github.com/marin-sapunar/cis_nto. Accessed 24 Oct 2019
Sapunar, M., Piteša, T., Davidović, D., Došlić, N.: Highly efficient algorithms for CIS type excited state wave function overlaps. J. Chem. Theory Comput. 15, 3461–3469 (2019)
Article Google Scholar
Tully, J.C.: Molecular dynamics with electronic transitions. J. Chem. Phys. 93(2), 1061–1071 (1990). https://doi.org/10.1063/1.459170
Article Google Scholar

Download references

Acknowledgement

This research was performed under project HPC-EUROPA3 (INFRAIA-2016-1-730897) and supported by Croatian Science Fundation under grant HRZZ IP-2016-06-1142, the Foundation of the Croatian Academy of Science and Arts, and the European Regional Development Fund under grant KK.01.1.1.01.0009 (DATACROSS). Enrique S. Quintana-Ortí was supported by project TIN2017-82972-R of the MINECO and FEDER. The authors gratefully acknowledge the computer resources provided by the Juelich Supercomputing center, and to BSC where the initial testings and the code development were performed.

Author information

Authors and Affiliations

Centre for Informatics and Computing, Ruđer Bošković Institute, Bijenička cesta 54, 10000, Zagreb, Croatia
Davor Davidović
Depto. de Informática de Sistemas y Computadores, Universitat Politècnica de València, 46022, València, Spain
Enrique S. Quintana-Ortí

Authors

Davor Davidović
View author publications
You can also search for this author in PubMed Google Scholar
Enrique S. Quintana-Ortí
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Davor Davidović .

Editor information

Editors and Affiliations

Czestochowa University of Technology, Czestochowa, Poland
Roman Wyrzykowski
University of Southern California, Marina del Rey, CA, USA
Ewa Deelman
University of Tennessee, Knoxville, TN, USA
Jack Dongarra
Czestochowa University of Technology, Czestochowa, Poland
Konrad Karczewski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Davidović, D., Quintana-Ortí, E.S. (2020). Structure-Aware Calculation of Many-Electron Wave Function Overlaps on Multicore Processors. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2019. Lecture Notes in Computer Science(), vol 12043. Springer, Cham. https://doi.org/10.1007/978-3-030-43229-4_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-43229-4_2
Published: 19 March 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-43228-7
Online ISBN: 978-3-030-43229-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics