Practical Aspects of Domain Decomposition in Jacobi-Davidson for Parallel Performance

Genseberger, Menno

doi:10.1007/978-3-319-05789-7_81

Menno Genseberger¹³

Part of the book series: Lecture Notes in Computational Science and Engineering ((LNCSE,volume 98))

1291 Accesses

Abstract

Most computational work in Jacobi-Davidson, an iterative method for large scale eigenvalue problems, is due to a so-called correction equation. For this, to reduce wall clock time and local memory requirements, in earlier work a domain decomposition strategy was proposed and further improved. Here we investigate practical aspects for parallel performance of the strategy by scaling experiments on supercomputers. This is of interest for large scale eigenvalue problems that need a massively parallel treatment.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Parallel Approaches and Technologies of Domain Decomposition Methods

Article 19 May 2015

On Parallel Computational Technologies of Augmented Domain Decomposition Methods

Parallelisation of MACOPA, A Multi-physics Asynchronous Solver

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Most computational work in Jacobi-Davidson [7], an iterative method for large scale eigenvalue problems, is due to a so-called correction equation. For this, to reduce wall clock time and local memory requirements, [3, 5] proposed a domain decomposition strategy that was further improved in [4] (Sects. 2 and 3).

Here we investigate practical aspects for parallel performance of the strategy by scaling experiments on supercomputers (Sect. 4). This is of interest for large scale eigenvalue problems that need a massively parallel treatment.

2 Domain Decomposition

In [3, 5] a domain decomposition preconditioning technique for the (approximate) solution of the correction equation was proposed. This technique is based on a nonoverlapping additive Schwarz method with locally optimized coupling parameters by Tan and Borsboom [8, 9] (belonging to the class of optimized Schwarz methods [2]).

For some partial differential equation (PDE) defined on a domain Ω with appropriate boundary conditions, Ω is covered by a grid $\hat{\varOmega }$ and the PDE is discretized accordingly, with unknowns defined on the grid points, yielding the linear system

$$\displaystyle{ \mathbf{B}\,\mathbf{y} = \mathbf{d}. }$$

(1)

Now, the domain decomposition technique

1.
Enhances the linear system (1) into $\mathbf{B}_{C}\,\mathbf{y}_{\approx } = \mathbf{d}_{0}$ with the following structure
(2)
in case of a two subdomain decomposition (generalization is straightforward). Here Ω is decomposed in two nonoverlapping subdomains Ω ₁ and Ω ₂ with interface (or internal boundary) Γ (see Fig. 1). The subdomains are covered by subgrids $\hat{\varOmega }_{1}$ and $\hat{\varOmega }_{2}$ with additional grid points located just outside the subdomain near the interface Γ (the open bullets “∘” in Fig. 1) such that no splitting of the original discretized operator (or stencil) has to be made. For B, the labels 1, 2, ℓ, and r, respectively, refer to operations on data from/to subdomain Ω ₁, Ω ₂, and left, right from the interface Γ, respectively. For y and d, the labels 1, 2, ℓ, and r, respectively, refer to data in subdomain Ω ₁, Ω ₂, and left, right from the interface Γ, respectively. Here, subvector y _ℓ (y _r respectively) contains those unknowns on the left (right) from Γ that are coupled by the stencil both with unknowns in Ω ₁ (Ω ₂) and unknowns on the right (left) from Γ. Subvector $\tilde{y}_{r}$ ($\tilde{y}_{\ell}$ respectively) contains the unknowns at the additional grid points of the subgrid for Ω ₁ (Ω ₂) on the right (left) of Γ. For the unknowns on the additional grid points additional equations are provided with the requirement that the submatrix (the interface coupling matrix)
$$\displaystyle{ C \equiv \left [\!\!\begin{array}{cc} C_{\ell\ell} & C_{\ell r} \\ C_{r\ell}&C_{rr}\end{array} \!\!\right ] }$$
(3)
is nonsingular as for nonsingular C the solution y _≈ of (2) is unique, $\tilde{y}_{\ell} = y_{\ell}$ and $\tilde{y}_{r} = y_{r}$, and the restriction of y _≈ to y is the unique solution of the original linear system (1) ([9, Theorem 1], [8, Theorem 1.2.1]).
2.
Splits the matrix $\mathbf{B}_{C} = \mathbf{M}_{C} -\mathbf{N}_{C}$ in a part M _C, the boxed parts in (2) that do not map elements from one subgrid to the other subgrid and a remaining part N _C that couples the subgrids via the discretized interface with a relatively small number of nonzero elements. (Therefore matrix vector multiplication with B _C can be implemented efficiently on distributed memory computers.)
3.
Tunes the interface coupling matrix C defined in (3) such that error components due to domain decomposition are damped in the Richardson iteration
$$\displaystyle{ \mathbf{y}_{\approx }^{\:(i+1)} = \mathbf{y}_{ \approx }^{\:(i)} + \mathbf{M}_{ C}^{\,-1}\,(\mathbf{d}_{ 0} -\mathbf{B}_{C}\,\mathbf{y}_{\approx }^{\:(i)}). }$$
(4)
Note $\mathbf{M}_{C}^{\!\!\!\!-1}\!\mathbf{B}_{C} = \mathbf{I} -\mathbf{M}_{C}^{\!\!\!\!-1}\!\mathbf{N}_{C}$, therefore error components are propagated by $\mathbf{M}_{C}^{\!\!\!\!-1}\!\mathbf{N}_{C}$.
4.
Computes a solution of the enhanced linear system from (4) or with a more general Krylov method like GMRES [6] with $\mathcal{K}_{m}(\mathbf{M}_{C}^{\!\!\!\!-1}\,\mathbf{B}_{C},\mathbf{M}_{C}^{\!\!\!\!-1}\,\mathbf{d}_{0}) \equiv$ $\mbox{ span}(\mathbf{M}_{C}^{\!\!\!\!-1}\,\mathbf{d}_{0},\mathbf{M}_{C}^{\!\!\!\!-1}\,\mathbf{B}_{C}\,\mathbf{M}_{C}^{\!\!\!\!-1}\,\mathbf{d}_{0},\ldots,(\mathbf{M}_{C}^{\!\!\!\!-1}\,\mathbf{B}_{C})^{m-1}\,\mathbf{M}_{C}^{\!\!\!\!-1}\,\mathbf{d}_{0})$.

The key idea is to use the degrees of freedom, that we have created by the introduction of additional unknowns near the interface, for damping the error components. For this purpose, the spectral properties of M _C ⁻¹ N _C for the specific underlying PDE are analyzed. With results of this analysis, optimal coupling parameters can be estimated, i.e. the interface coupling matrix C defined in (3) can be tuned. In this way error components due to the splitting are damped “as much as possible”, optimal choices result in a coupling that annihilates the outflow from one domain to another: absorbing boundary conditions. This leads effectively to almost uncoupled subproblems at subdomains. As a consequence, the number of iterations required for convergence is minimal with minimal communication overhead (due to the explicit step with N _C) between subdomains: an ideal situation for implementation on parallel computers and/or distributed memory.

3 Jacobi-Davidson

For a standard eigenvalue problem A x = λ x each iteration Jacobi-Davidson [7]

1.
Extracts an approximate eigenpair (θ, u) ≈ (λ, x) from a search subspace V: construct H ≡ V ^∗ A V, solve H s = θ s, compute u = V s.
2.
Corrects the approximate eigenvector u with a correction t ⊥ u that is computed from the correction equation:
$$\displaystyle{ \mathbf{P}\,\mathbf{B}\,\mathbf{P}\,\mathbf{t} = \mathbf{r}\:\:\:\:\mbox{ where}\:\:\:\:\mathbf{P} \equiv \mathbf{I} -\frac{\mathbf{u}\,\mathbf{u}^{{\ast}}} {\mathbf{u}^{{\ast}}\mathbf{u}},\mathbf{B} \equiv \mathbf{A} -\theta \,\mathbf{I},\:\:\:\:\mbox{ and}\:\:\:\:\mathbf{r} \equiv -\mathbf{B}\,\mathbf{u}. }$$
(5)
3.
Expands the search subspace with the correction t: $\mathbf{V}_{\mathit{new}}\,=\,\left [\mathbf{V}\left.\right \vert \mathbf{t}^{\perp }\right ]$ where t ^⊥ ⊥ V.

The linear system described by the correction equation (5) may be highly indefinite and is given in an unusual manner so that the application of the domain decomposition technique needed further development and special attention.

Similar to the enhancements (1) of the linear system (2) in Sect. 2, the following components of the correction equation are enhanced: the matrix B ≡ A −θ I to B _C, the correction vector t to t _≈ and the vectors u and r to u ₀ and r ₀. With these enhancements, a correction t _≈ ⊥ u ₀ is computed from the following enhanced correction equation [3, Sect. 3.3.2]:

$$\displaystyle{ \mathbf{P}\,\mathbf{B}_{C}\,\mathbf{P}\,\mathbf{t}_{\approx } = \mathbf{r}_{0}\quad \mbox{ with}\quad \mathbf{P} \equiv \mathbf{I} -\frac{\mathbf{u}_{0}\mathbf{u}_{0}^{{\ast}}} {\mathbf{u}_{0}^{{\ast}}\mathbf{u}_{0}}. }$$

(6)

The preconditioner M _C for B _C is constructed in the same way as the ordinary linear system case shown by the boxed parts in (2). However, because of the indefiniteness, for the correction equation the matrices B _C and M _C are accompanied by projections. Both for left and right preconditioning the projection is as follows:

$$\displaystyle{ \mathbf{P}' \equiv \mathbf{I} -\frac{\mathbf{M}_{C}^{-1}\,\mathbf{u}_{0}\,\mathbf{u}_{0}^{{\ast}}} {\mathbf{u}_{0}^{{\ast}}\,\mathbf{M}_{C}^{-1}\,\mathbf{u}_{0}}. }$$

(7)

In case of left preconditioning (for right preconditioning see [3, Sect. 3.3.3]) we compute approximate solutions to the correction equation from

$$\displaystyle{ \mathbf{P}'\,\mathbf{M}_{C}^{-1}\,\mathbf{B}_{ C}\,\mathbf{P}'\,\mathbf{t}_{\approx } = \mathbf{P}'\,\mathbf{M}_{C}^{-1}\,\mathbf{r}_{ 0}. }$$

(8)

However, there is more to gain. For approximate solutions of the correction equation with a preconditioned Krylov method, the Jacobi-Davidson method is an accelerated inexact Newton method that consists of two nested iterative solvers. In the innerloop of Jacobi-Davidson a search subspace for the (approximate) solution of the correction equation is built up by powers of $\mathbf{M}_{C}^{-1}\,\left (\,\mathbf{A} -\theta \,\mathbf{I}\,\right )$ for fixed θ. In the outerloop a search subspace for the (approximate) solution of the eigenvalue problem is built up by powers of $\mathbf{M}_{C}^{-1}\,\left (\,\mathbf{A} -\theta \,\mathbf{I}\,\right )$ for variable θ. As θ varies slightly in succeeding outer iterations, one may take advantage of the nesting by applying the domain decomposition technique to the outer loop as was the subject of [4]. This effectively leaded to two different processes:

Jacobi-Davidson with enhanced inner loop, enhancement at intermediate level with enhanced correction equation (6) and
Jacobi-Davidson with enhanced outer loop, enhancement at highest level with a slightly different correction equation

$$\displaystyle{ \mathbf{P}\,\mathbf{B}_{C}\,\mathbf{P}\,\mathbf{t}_{\approx } = \mathbf{r}_{\approx }\quad \mbox{ with}\quad \mathbf{P} \equiv \mathbf{I} -\frac{\mathbf{u}_{0}\mathbf{u}_{0}^{{\ast}}} {\mathbf{u}_{0}^{{\ast}}\mathbf{u}_{0}}. }$$

(9)

The amount of work for both processes per outer iteration is almost the same. However, Jacobi-Davidson with enhanced outer loop turned out to be faster as it damps remaining error components from the previous outer iteration in the next one.

4 Scaling Experiments

For the two processes, in [4, Sect. 5.1] different eigenvalue problems have been considered including variable coefficients and large jumps. Here, to investigate practical aspects for parallel performance, we consider the eigenvalue problem for the Laplace operator as results for different numbers of subdomains show more regular behavior (see for instance Fig. 3 in [4]). Except for the first experiment about different decompositions, in all experiments we take for the domain Ω the unit square, decompose Ω in p square subdomains, and cover each subdomain by a 256×256 subgrid. Jacobi-Davidson is started with a parabolic shaped vector $x\,(\,1 - x\,)\,y\,(\,1 - y\,)$ for 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1 (see also [3, Sect. 3.5.1]) to compute the most global eigenvector (for which the corresponding eigenvalue is the closest one to zero) of the two-dimensional Laplace operator on Ω until the residual norm of the approximate eigenpair is less than 10⁻⁹. We apply right preconditioning in the enhanced correction equation for exact solves with the preconditioner (i.e. exact subdomain solves) to enable a Schur complement approach. The preconditioner M _C is constructed only once, at the first Jacobi-Davidson outer iteration. The remaining linear system is solved with GMRES [6].

Implementation is in Fortran77 with calls to BLAS, LAPACK, and MPI. Note, however, that Fortran compiler, BLAS, LAPACK, and MPI versions differ on the specific hardware which is of influence on the (parallel) performance. Scaling experiments are performed on the following hardware:

Curie linux-cluster (2 Intel eight 2.7 GHz core E5-2680 node, InfiniBand QDR, Intel Fortran 12, BLAS/LAPACK from MKL, Bull X MPI),
H4+ linux-cluster (1 Intel quad 3.4 GHz core i7-2600 node, 1 GB/s Gigabit Ethernet, Intel Fortran 11, MPICH2),
IBM POWER5+ system Huygens (16 IBM single 1.9 GHz core Power5+ node, 1.2 GB/s InfiniBand, XL Fortran 10, BLAS from ESSL, MPI from IBM PE),
IBM POWER6 system Huygens (16 IBM dual 4.7 GHz core Power6 node, 160 GB/s InfiniBand, XL Fortran 12, BLAS from ESSL, MPI from IBM PE),
Lisa 2008 linux-cluster (1 Intel Xeon 3.4E GHz core node, 800 MB/s InfiniBand, GFortran, MPICH2),
Lisa 2012 linux-cluster (2 Intel eight 1.8 GHz core Xeon E5-2650L node, Intel Fortran 12, BLAS/LAPACK from MKL, OpenMPI),

On the H4+ and Lisa 2008 linux-clusters one subdomain is assigned to one node. On the other hardware one subdomain is assigned to one core. Results presented here are averages of three measured wall-clock times.

First we study different decompositions for a fixed number of subdomains for the same (discretized) eigenvalue problem. We keep the overall grid fixed to a size of 1024×1024 gridpoints and consider configurations with a 1×16, 2×8, 4× 4, 8×2, and 16×1 decomposition, respectively (resulting in subgrids of size 1024×64, 512×128, 256×256, 128×512, and 64×1024, respectively). So the number of subdomains is 16 with 65,536 unknowns per subdomain in all configurations, but the subdomains differ in shape. Figure 2 shows the residual norm of the approximate eigenpair as a function of the Jacobi-Davidson outer iteration for the different decompositions. Shown are both enhanced innerloop and enhanced outerloop for a fixed number of 8 (top) and 4 (bottom) inner iterations with GMRES. As expected, the convergence histories for configurations which are mirrored (for instance 2×8 and 8×2) coincide. Decomposition in only one direction needs the least number of outer iterations for convergence. For the tuning of the coupling between the subdomains we only took into consideration the one dimensional character of the error modes. For decompositions in two directions error modes will have a two dimensional character and are therefore harder to damp. Figure 3 shows the residual norm of the approximate eigenpair as a function of wall clock time for the different decompositions. Shown are both enhanced innerloop and enhanced outerloop for the Lisa 2012 and H4+ linux-cluster and a fixed number of 8 and 4 inner iterations with GMRES. By comparing the mirrored configurations it can be observed that the grid ordening may significantly lower the performance. This is mainly in the construction of the preconditioner with LAPACK (initial horizontal lines in the figure). Although processors of the H4+ linux-cluster are faster, use of the MKL implementation of LAPACK resulted in a faster construction of the preconditioner at the Lisa 2012 linux-cluster. After the construction of the preconditioner, the process at the H4+ linux-cluster goes faster than the Lisa 2012 linux-cluster. At the H4+ linux-cluster communication is between 16 nodes over a relatively slow network, at the Lisa 2012 linux-cluster communication is fast inside a 16 core node with shared memory. So, we may conclude that the process is dominated by computational work. This confirms the remarks at the end of Sect. 2 about the minimal communication overhead.

For the massively parallel behavior, we first extend Fig. 6 from [4] with results from (weak) scaling experiments on more recent hardware (IBM POWER6 system Huygens, Curie, and H4+). In Fig. 4 it can be observed that the trend holds, but now for lower wall clock times as processor speed has increased further for the more recent hardware.

To further investigate the weak scaling we start with a decomposition in 16 subdomains (on 1 node with 16 cores) on the Curie linux-cluster and increase everytime the number of subdomains in both directions with a factor 2. From 16, 64, 256, 1,024, 4,096 to 16,384 subdomains (cores), resulting in up to more than 10⁹ unknowns. For an efficient overall method, we will now use (see [1, Sect. 4])

$$\displaystyle{ \|\mathbf{r}^{(i)}\|_{ 2} <2^{-j}\,\|\mathbf{r}^{(0)}\|_{ 2} }$$

(10)

as a stopping criterion for the inner iterations (GMRES) at the jth Jacobi-Davidson outer iteration. Here r ⁽⁰⁾ is the residual at the start of the inner iterations and r ⁽ⁱ⁾ the residual at the ith inner iteration. Figure 5 shows the results for Jacobi-Davidson with enhanced outerloop. Note that in this figure we choose the scaling of the x-axis to be quadratic to have a better impression. The figure indicates that for a large number of subdomains the wall clock doubles when the number of subdomains increases in both directions with a factor 2. This can be explained from the local behavior of the error modes due to domain decomposition: mainly one dimensional near the interface. The additional work to damp these error modes effectively depends on this local behavior.

References

Fokkema, D.R., Sleijpen, G.L.G., van der Vorst, H.A.: Jacobi-Davidson style QR and QZ algorithms for the reduction of matrix pencils. SIAM J. Sci. Comput. 20, 94–125 (1998)
Article MathSciNet Google Scholar
Gander, M.J.: Optimized Schwarz methods. SIAM J. Numer. Anal. 44, 699–731 (2006)
Article MATH MathSciNet Google Scholar
Genseberger, M.: Domain decomposition in the Jacobi-Davidson method for eigenproblems. Ph.D. thesis, Utrecht University (2001)
Google Scholar
Genseberger, M.: Improving the parallel performance of a domain decomposition preconditioning technique in the Jacobi-Davidson method for large scale eigenvalue problems. Appl. Numer. Math. 60, 1083–1099 (2010)
Article MATH MathSciNet Google Scholar
Genseberger, M., Sleijpen, G.L.G., van der Vorst, H.A.: An optimized Schwarz method in the Jacobi-Davidson method for eigenvalue problems. In: Domain Decomposition Methods in Science and Engineering, pp. 289–296. UNAM, Mexico (2003)
Google Scholar
Saad, Y., Schultz, M.H.: GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comput. 7, 856–869 (1986)
Article MATH MathSciNet Google Scholar
Sleijpen, G.L.G., van der Vorst, H.A.: A Jacobi-Davidson iteration method for linear eigenvalue problems. SIAM J. Matrix Anal. Appl. 17, 401–425 (1996)
Article MATH MathSciNet Google Scholar
Tan, K.H.: Local coupling in domain decomposition. Ph.D. thesis, Utrecht University (1995)
Google Scholar
Tan, K.H., Borsboom, M.J.A.: On generalized Schwarz coupling applied to advection- dominated problems. In: Domain Decomposition Methods in Scientific and Engineering Computing, pp. 125–130. American Mathematical Society, Providence (1994)
Google Scholar

Download references

Acknowledgements

We thank SURFsara Computing and Networking Services (www.surfsara.nl) for their support in using the Power5+/6 system Huygens and Lisa linux-cluster. We acknowledge that the results in this paper have been achieved using the PRACE Research Infrastructure resource Curie based in France at CEA.

Author information

Authors and Affiliations

Deltares, 177, 2600 MH, Delft, The Netherlands
Menno Genseberger

Authors

Menno Genseberger
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Menno Genseberger .

Editor information

Editors and Affiliations

INRIA, Rennes, France
Jocelyne Erhel
Section de Mathématiques, Université de Genève, Genève, Switzerland
Martin J. Gander
Laboratoire Analyse, Géometrie & Applications, Université Paris XIII, Villetaneuse, France
Laurence Halpern
INRIA, Rennes, France
Géraldine Pichot
Laboratoire LMNO, Université de Caen, Caen, France
Taoufik Sassi
New York University Courant Institute, New York, New York, USA
Olof Widlund

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Genseberger, M. (2014). Practical Aspects of Domain Decomposition in Jacobi-Davidson for Parallel Performance. In: Erhel, J., Gander, M., Halpern, L., Pichot, G., Sassi, T., Widlund, O. (eds) Domain Decomposition Methods in Science and Engineering XXI. Lecture Notes in Computational Science and Engineering, vol 98. Springer, Cham. https://doi.org/10.1007/978-3-319-05789-7_81

Download citation

DOI: https://doi.org/10.1007/978-3-319-05789-7_81
Published: 21 April 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-05788-0
Online ISBN: 978-3-319-05789-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Practical Aspects of Domain Decomposition in Jacobi-Davidson for Parallel Performance

Abstract

Similar content being viewed by others

Parallel Approaches and Technologies of Domain Decomposition Methods

On Parallel Computational Technologies of Augmented Domain Decomposition Methods

Parallelisation of MACOPA, A Multi-physics Asynchronous Solver

Keywords

1 Introduction

2 Domain Decomposition

3 Jacobi-Davidson

4 Scaling Experiments

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Practical Aspects of Domain Decomposition in Jacobi-Davidson for Parallel Performance

Abstract

Similar content being viewed by others

Parallel Approaches and Technologies of Domain Decomposition Methods

On Parallel Computational Technologies of Augmented Domain Decomposition Methods

Parallelisation of MACOPA, A Multi-physics Asynchronous Solver

Keywords

1 Introduction

2 Domain Decomposition

3 Jacobi-Davidson

4 Scaling Experiments

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation