Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Most computational work in Jacobi-Davidson [7], an iterative method for large scale eigenvalue problems, is due to a so-called correction equation. For this, to reduce wall clock time and local memory requirements, [3, 5] proposed a domain decomposition strategy that was further improved in [4] (Sects. 2 and 3).

Here we investigate practical aspects for parallel performance of the strategy by scaling experiments on supercomputers (Sect. 4). This is of interest for large scale eigenvalue problems that need a massively parallel treatment.

2 Domain Decomposition

In [3, 5] a domain decomposition preconditioning technique for the (approximate) solution of the correction equation was proposed. This technique is based on a nonoverlapping additive Schwarz method with locally optimized coupling parameters by Tan and Borsboom [8, 9] (belonging to the class of optimized Schwarz methods [2]).

Fig. 1
figure 1

Decomposition in one (left picture) and two dimensions (right picture)

For some partial differential equation (PDE) defined on a domain Ω with appropriate boundary conditions, Ω is covered by a grid \(\hat{\varOmega }\) and the PDE is discretized accordingly, with unknowns defined on the grid points, yielding the linear system

$$\displaystyle{ \mathbf{B}\,\mathbf{y} = \mathbf{d}. }$$
(1)

Now, the domain decomposition technique

  1. 1.

    Enhances the linear system (1) into \(\mathbf{B}_{C}\,\mathbf{y}_{\approx } = \mathbf{d}_{0}\) with the following structure

    (2)

    in case of a two subdomain decomposition (generalization is straightforward). Here Ω is decomposed in two nonoverlapping subdomains Ω 1 and Ω 2 with interface (or internal boundary) Γ (see Fig. 1). The subdomains are covered by subgrids \(\hat{\varOmega }_{1}\) and \(\hat{\varOmega }_{2}\) with additional grid points located just outside the subdomain near the interface Γ (the open bullets “∘” in Fig. 1) such that no splitting of the original discretized operator (or stencil) has to be made. For B, the labels 1, 2, , and r, respectively, refer to operations on data from/to subdomain Ω 1, Ω 2, and left, right from the interface Γ, respectively. For y and d, the labels 1, 2, , and r, respectively, refer to data in subdomain Ω 1, Ω 2, and left, right from the interface Γ, respectively. Here, subvector y (y r respectively) contains those unknowns on the left (right) from Γ that are coupled by the stencil both with unknowns in Ω 1 (Ω 2) and unknowns on the right (left) from Γ. Subvector \(\tilde{y}_{r}\) (\(\tilde{y}_{\ell}\) respectively) contains the unknowns at the additional grid points of the subgrid for Ω 1 (Ω 2) on the right (left) of Γ. For the unknowns on the additional grid points additional equations are provided with the requirement that the submatrix (the interface coupling matrix)

    $$\displaystyle{ C \equiv \left [\!\!\begin{array}{cc} C_{\ell\ell} & C_{\ell r} \\ C_{r\ell}&C_{rr}\end{array} \!\!\right ] }$$
    (3)

    is nonsingular as for nonsingular C the solution y  ≈  of (2) is unique, \(\tilde{y}_{\ell} = y_{\ell}\) and \(\tilde{y}_{r} = y_{r}\), and the restriction of y  ≈  to y is the unique solution of the original linear system (1) ([9, Theorem 1], [8, Theorem 1.2.1]).

  2. 2.

    Splits the matrix \(\mathbf{B}_{C} = \mathbf{M}_{C} -\mathbf{N}_{C}\) in a part M C , the boxed parts in (2) that do not map elements from one subgrid to the other subgrid and a remaining part N C that couples the subgrids via the discretized interface with a relatively small number of nonzero elements. (Therefore matrix vector multiplication with B C can be implemented efficiently on distributed memory computers.)

  3. 3.

    Tunes the interface coupling matrix C defined in (3) such that error components due to domain decomposition are damped in the Richardson iteration

    $$\displaystyle{ \mathbf{y}_{\approx }^{\:(i+1)} = \mathbf{y}_{ \approx }^{\:(i)} + \mathbf{M}_{ C}^{\,-1}\,(\mathbf{d}_{ 0} -\mathbf{B}_{C}\,\mathbf{y}_{\approx }^{\:(i)}). }$$
    (4)

    Note \(\mathbf{M}_{C}^{\!\!\!\!-1}\!\mathbf{B}_{C} = \mathbf{I} -\mathbf{M}_{C}^{\!\!\!\!-1}\!\mathbf{N}_{C}\), therefore error components are propagated by \(\mathbf{M}_{C}^{\!\!\!\!-1}\!\mathbf{N}_{C}\).

  4. 4.

    Computes a solution of the enhanced linear system from (4) or with a more general Krylov method like GMRES [6] with \(\mathcal{K}_{m}(\mathbf{M}_{C}^{\!\!\!\!-1}\,\mathbf{B}_{C},\mathbf{M}_{C}^{\!\!\!\!-1}\,\mathbf{d}_{0}) \equiv\) \(\mbox{ span}(\mathbf{M}_{C}^{\!\!\!\!-1}\,\mathbf{d}_{0},\mathbf{M}_{C}^{\!\!\!\!-1}\,\mathbf{B}_{C}\,\mathbf{M}_{C}^{\!\!\!\!-1}\,\mathbf{d}_{0},\ldots,(\mathbf{M}_{C}^{\!\!\!\!-1}\,\mathbf{B}_{C})^{m-1}\,\mathbf{M}_{C}^{\!\!\!\!-1}\,\mathbf{d}_{0})\).

The key idea is to use the degrees of freedom, that we have created by the introduction of additional unknowns near the interface, for damping the error components. For this purpose, the spectral properties of M C −1N C for the specific underlying PDE are analyzed. With results of this analysis, optimal coupling parameters can be estimated, i.e. the interface coupling matrix C defined in (3) can be tuned. In this way error components due to the splitting are damped “as much as possible”, optimal choices result in a coupling that annihilates the outflow from one domain to another: absorbing boundary conditions. This leads effectively to almost uncoupled subproblems at subdomains. As a consequence, the number of iterations required for convergence is minimal with minimal communication overhead (due to the explicit step with N C ) between subdomains: an ideal situation for implementation on parallel computers and/or distributed memory.

3 Jacobi-Davidson

For a standard eigenvalue problem Ax = λx each iteration Jacobi-Davidson [7]

  1. 1.

    Extracts an approximate eigenpair (θ, u) ≈ (λ, x) from a search subspace V: construct H ≡ V AV, solve Hs = θs, compute u = Vs.

  2. 2.

    Corrects the approximate eigenvector u with a correction t ⊥ u that is computed from the correction equation:

    $$\displaystyle{ \mathbf{P}\,\mathbf{B}\,\mathbf{P}\,\mathbf{t} = \mathbf{r}\:\:\:\:\mbox{ where}\:\:\:\:\mathbf{P} \equiv \mathbf{I} -\frac{\mathbf{u}\,\mathbf{u}^{{\ast}}} {\mathbf{u}^{{\ast}}\mathbf{u}},\mathbf{B} \equiv \mathbf{A} -\theta \,\mathbf{I},\:\:\:\:\mbox{ and}\:\:\:\:\mathbf{r} \equiv -\mathbf{B}\,\mathbf{u}. }$$
    (5)
  3. 3.

    Expands the search subspace with the correction t: \(\mathbf{V}_{\mathit{new}}\,=\,\left [\mathbf{V}\left.\right \vert \mathbf{t}^{\perp }\right ]\) where t  ⊥  ⊥ V.

The linear system described by the correction equation (5) may be highly indefinite and is given in an unusual manner so that the application of the domain decomposition technique needed further development and special attention.

Similar to the enhancements (1) of the linear system (2) in Sect. 2, the following components of the correction equation are enhanced: the matrix B ≡ AθI to B C , the correction vector t to t  ≈  and the vectors u and r to u 0 and r 0. With these enhancements, a correction t  ≈  ⊥ u 0 is computed from the following enhanced correction equation [3, Sect. 3.3.2]:

$$\displaystyle{ \mathbf{P}\,\mathbf{B}_{C}\,\mathbf{P}\,\mathbf{t}_{\approx } = \mathbf{r}_{0}\quad \mbox{ with}\quad \mathbf{P} \equiv \mathbf{I} -\frac{\mathbf{u}_{0}\mathbf{u}_{0}^{{\ast}}} {\mathbf{u}_{0}^{{\ast}}\mathbf{u}_{0}}. }$$
(6)

The preconditioner M C for B C is constructed in the same way as the ordinary linear system case shown by the boxed parts in (2). However, because of the indefiniteness, for the correction equation the matrices B C and M C are accompanied by projections. Both for left and right preconditioning the projection is as follows:

$$\displaystyle{ \mathbf{P}' \equiv \mathbf{I} -\frac{\mathbf{M}_{C}^{-1}\,\mathbf{u}_{0}\,\mathbf{u}_{0}^{{\ast}}} {\mathbf{u}_{0}^{{\ast}}\,\mathbf{M}_{C}^{-1}\,\mathbf{u}_{0}}. }$$
(7)

In case of left preconditioning (for right preconditioning see [3, Sect. 3.3.3]) we compute approximate solutions to the correction equation from

$$\displaystyle{ \mathbf{P}'\,\mathbf{M}_{C}^{-1}\,\mathbf{B}_{ C}\,\mathbf{P}'\,\mathbf{t}_{\approx } = \mathbf{P}'\,\mathbf{M}_{C}^{-1}\,\mathbf{r}_{ 0}. }$$
(8)

However, there is more to gain. For approximate solutions of the correction equation with a preconditioned Krylov method, the Jacobi-Davidson method is an accelerated inexact Newton method that consists of two nested iterative solvers. In the innerloop of Jacobi-Davidson a search subspace for the (approximate) solution of the correction equation is built up by powers of \(\mathbf{M}_{C}^{-1}\,\left (\,\mathbf{A} -\theta \,\mathbf{I}\,\right )\) for fixed θ. In the outerloop a search subspace for the (approximate) solution of the eigenvalue problem is built up by powers of \(\mathbf{M}_{C}^{-1}\,\left (\,\mathbf{A} -\theta \,\mathbf{I}\,\right )\) for variable θ. As θ varies slightly in succeeding outer iterations, one may take advantage of the nesting by applying the domain decomposition technique to the outer loop as was the subject of [4]. This effectively leaded to two different processes:

  • Jacobi-Davidson with enhanced inner loop, enhancement at intermediate level with enhanced correction equation (6) and

  • Jacobi-Davidson with enhanced outer loop, enhancement at highest level with a slightly different correction equation

$$\displaystyle{ \mathbf{P}\,\mathbf{B}_{C}\,\mathbf{P}\,\mathbf{t}_{\approx } = \mathbf{r}_{\approx }\quad \mbox{ with}\quad \mathbf{P} \equiv \mathbf{I} -\frac{\mathbf{u}_{0}\mathbf{u}_{0}^{{\ast}}} {\mathbf{u}_{0}^{{\ast}}\mathbf{u}_{0}}. }$$
(9)

The amount of work for both processes per outer iteration is almost the same. However, Jacobi-Davidson with enhanced outer loop turned out to be faster as it damps remaining error components from the previous outer iteration in the next one.

4 Scaling Experiments

For the two processes, in [4, Sect. 5.1] different eigenvalue problems have been considered including variable coefficients and large jumps. Here, to investigate practical aspects for parallel performance, we consider the eigenvalue problem for the Laplace operator as results for different numbers of subdomains show more regular behavior (see for instance Fig. 3 in [4]). Except for the first experiment about different decompositions, in all experiments we take for the domain Ω the unit square, decompose Ω in p square subdomains, and cover each subdomain by a 256×256 subgrid. Jacobi-Davidson is started with a parabolic shaped vector \(x\,(\,1 - x\,)\,y\,(\,1 - y\,)\) for 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1 (see also [3, Sect. 3.5.1]) to compute the most global eigenvector (for which the corresponding eigenvalue is the closest one to zero) of the two-dimensional Laplace operator on Ω until the residual norm of the approximate eigenpair is less than 10−9. We apply right preconditioning in the enhanced correction equation for exact solves with the preconditioner (i.e. exact subdomain solves) to enable a Schur complement approach. The preconditioner M C is constructed only once, at the first Jacobi-Davidson outer iteration. The remaining linear system is solved with GMRES [6].

Implementation is in Fortran77 with calls to BLAS, LAPACK, and MPI. Note, however, that Fortran compiler, BLAS, LAPACK, and MPI versions differ on the specific hardware which is of influence on the (parallel) performance. Scaling experiments are performed on the following hardware:

  • Curie linux-cluster (2 Intel eight 2.7 GHz core E5-2680 node, InfiniBand QDR, Intel Fortran 12, BLAS/LAPACK from MKL, Bull X MPI),

  • H4+ linux-cluster (1 Intel quad 3.4 GHz core i7-2600 node, 1 GB/s Gigabit Ethernet, Intel Fortran 11, MPICH2),

  • IBM POWER5+ system Huygens (16 IBM single 1.9 GHz core Power5+ node, 1.2 GB/s InfiniBand, XL Fortran 10, BLAS from ESSL, MPI from IBM PE),

  • IBM POWER6 system Huygens (16 IBM dual 4.7 GHz core Power6 node, 160 GB/s InfiniBand, XL Fortran 12, BLAS from ESSL, MPI from IBM PE),

  • Lisa 2008 linux-cluster (1 Intel Xeon 3.4E GHz core node, 800 MB/s InfiniBand, GFortran, MPICH2),

  • Lisa 2012 linux-cluster (2 Intel eight 1.8 GHz core Xeon E5-2650L node, Intel Fortran 12, BLAS/LAPACK from MKL, OpenMPI),

On the H4+ and Lisa 2008 linux-clusters one subdomain is assigned to one node. On the other hardware one subdomain is assigned to one core. Results presented here are averages of three measured wall-clock times.

Fig. 2
figure 2

Residual norm of the approximate eigenpair as a function of the Jacobi-Davidson outer iteration for the different decompositions with GMRES(8) (top) and GMRES(4) (bottom)

Fig. 3
figure 3

Residual norm of the approximate eigenpair as a function of the wall clock time for the different decompositions. Shown are both enhanced innerloop and enhanced outerloop for the Lisa 2012 and H4+ linux-cluster and a fixed number of 8 and 4 inner iterations with GMRES

First we study different decompositions for a fixed number of subdomains for the same (discretized) eigenvalue problem. We keep the overall grid fixed to a size of 1024×1024 gridpoints and consider configurations with a 1×16, 2×8, 4× 4, 8×2, and 16×1 decomposition, respectively (resulting in subgrids of size 1024×64, 512×128, 256×256, 128×512, and 64×1024, respectively). So the number of subdomains is 16 with 65,536 unknowns per subdomain in all configurations, but the subdomains differ in shape. Figure 2 shows the residual norm of the approximate eigenpair as a function of the Jacobi-Davidson outer iteration for the different decompositions. Shown are both enhanced innerloop and enhanced outerloop for a fixed number of 8 (top) and 4 (bottom) inner iterations with GMRES. As expected, the convergence histories for configurations which are mirrored (for instance 2×8 and 8×2) coincide. Decomposition in only one direction needs the least number of outer iterations for convergence. For the tuning of the coupling between the subdomains we only took into consideration the one dimensional character of the error modes. For decompositions in two directions error modes will have a two dimensional character and are therefore harder to damp. Figure 3 shows the residual norm of the approximate eigenpair as a function of wall clock time for the different decompositions. Shown are both enhanced innerloop and enhanced outerloop for the Lisa 2012 and H4+ linux-cluster and a fixed number of 8 and 4 inner iterations with GMRES. By comparing the mirrored configurations it can be observed that the grid ordening may significantly lower the performance. This is mainly in the construction of the preconditioner with LAPACK (initial horizontal lines in the figure). Although processors of the H4+ linux-cluster are faster, use of the MKL implementation of LAPACK resulted in a faster construction of the preconditioner at the Lisa 2012 linux-cluster. After the construction of the preconditioner, the process at the H4+ linux-cluster goes faster than the Lisa 2012 linux-cluster. At the H4+ linux-cluster communication is between 16 nodes over a relatively slow network, at the Lisa 2012 linux-cluster communication is fast inside a 16 core node with shared memory. So, we may conclude that the process is dominated by computational work. This confirms the remarks at the end of Sect. 2 about the minimal communication overhead.

Fig. 4
figure 4

Massively parallel behavior on different hardware

For the massively parallel behavior, we first extend Fig. 6 from [4] with results from (weak) scaling experiments on more recent hardware (IBM POWER6 system Huygens, Curie, and H4+). In Fig. 4 it can be observed that the trend holds, but now for lower wall clock times as processor speed has increased further for the more recent hardware.

To further investigate the weak scaling we start with a decomposition in 16 subdomains (on 1 node with 16 cores) on the Curie linux-cluster and increase everytime the number of subdomains in both directions with a factor 2. From 16, 64, 256, 1,024, 4,096 to 16,384 subdomains (cores), resulting in up to more than 109 unknowns. For an efficient overall method, we will now use (see [1, Sect. 4])

$$\displaystyle{ \|\mathbf{r}^{(i)}\|_{ 2} <2^{-j}\,\|\mathbf{r}^{(0)}\|_{ 2} }$$
(10)

as a stopping criterion for the inner iterations (GMRES) at the jth Jacobi-Davidson outer iteration. Here r (0) is the residual at the start of the inner iterations and r (i) the residual at the ith inner iteration. Figure 5 shows the results for Jacobi-Davidson with enhanced outerloop. Note that in this figure we choose the scaling of the x-axis to be quadratic to have a better impression. The figure indicates that for a large number of subdomains the wall clock doubles when the number of subdomains increases in both directions with a factor 2. This can be explained from the local behavior of the error modes due to domain decomposition: mainly one dimensional near the interface. The additional work to damp these error modes effectively depends on this local behavior.

Fig. 5
figure 5

Massively parallel behavior on the Curie linux-cluster (quadratic scaling of the x-axis)