1 Introduction

The Helmholtz equation describes the phenomena of time-harmonic wave scattering in the frequency domain. It is widely studied in computational electromagnetics, with applications in seismic exploration, sonar, antennas, and medical imaging. To solve the Helmholtz equation numerically, we discretize it and obtain a linear system \(Ax=b\). The linear system matrix is sparse, symmetric, non-Hermitian, and indefinite [1]. Iterative methods and parallel computing are commonly considered for a large-scale linear system resulting from a practical problem. However, the indefiniteness of the linear system brings a great challenge to the numerical solution method, especially for large wavenumbers. The convergence rate of many iterative solvers is affected significantly for increasing wavenumber. Therefore, the research problem of how to solve the systems efficiently and economically, while at the same time maintaining a high accuracy by minimizing pollution error arises in this field.

Many efforts have been made to solve the problem accurately with high performance. Originally derived from [2], the industry standard, also known as the Complex Shifted Laplace Preconditioner (CSLP) [3, 4] does show good properties for medium wavenumbers. Nevertheless, the eigenvalues shift to the origin as the wavenumber increases. These near-zero eigenvalues have an unfavorable effect on the convergence speed of Krylov-based iterative solvers. Recently, a higher-order approximation scheme to construct the deflation vectors was proposed to reach close to wavenumber-independent convergence [5].

The development of scalable parallel Helmholtz solvers is also ongoing. One approach is to parallelize existing advanced algorithms. Kononov and Riyanti [6, 7] first developed a parallel version of Bi-CGSTAB preconditioned by multigrid-based CSLP. Gordon and Gordon [8] parallelized their so-called CARP-CG algorithm (Conjugate Gradient acceleration of CARP) blockwise. The block-parallel CARP-CG algorithm shows improved scalability as the wavenumber increases. Calandra et al. [9] proposed a geometric two-grid preconditioner for 3D Helmholtz problems, which shows strong scaling properties in a massively parallel setup. Another approach is the Domain Decomposition Method (DDM), which originates from the early Schwarz Methods. DDM, as a preconditioner mostly, has been widely used to develop parallel solution methods for the Helmholtz problems. For comprehensive surveys, we refer the reader to [10,11,12,13,14] and references therein.

This work describes parallel versions of Krylov subspace methods, such as the Generalized minimal residual method (GMRES), Bi-CGSTAB, and IDR(s), preconditioned by the multigrid-based CSLP for the Helmholtz equation. We consider the CSLP preconditioner because it is the first and most popular method where the number of iterations scales linearly within medium wavenumbers. Based on a block-wise domain decomposition and a matrix-free implementation, our parallel framework contributes to robust parallel CSLP-preconditioned Krylov solvers for Helmholtz problems. It is the basis for scalable parallel computing. Numerical experiments show that, compared to [5, 6] that assemble matrices, the matrix-free framework allows us to solve the Helmholtz problem with a larger grid size to reduce pollution errors related to grid resolution. The parallel efficiency is up to \(70\%\). Its weak scaling performance means that a larger problem can be solved in about the same amount of time as a smaller problem as long as the number of tasks increases proportionally.

The rest of this paper is organized as follows. Section 2 describes the mathematical model that we will discuss. All numerical methods we use are given in Sect. 3. The numerical performance is explored in Sect. 4. Finally, Sect. 5 contains our conclusions.

2 Mathematical Model

We will consider the following 2D Helmholtz equation on a rectangular domain \(\varOmega \) with boundary \(\varGamma = \partial \varOmega \). The Helmholtz equation reads

$$\begin{aligned} -\varDelta u(x,y) - k^2 u(x,y) = b(x,y),\ \text {on} \ \varOmega \end{aligned}$$
(1)

supplied with Dirichlet boundary conditions \(u(x,y)=g(x,y)\) or first-order Sommerfeld boundary conditions \(\frac{\partial u(x,y)}{\partial \vec{n}}-\text {i} k u(x,y) = 0,\ \text {on} \ \partial \varOmega \). \(\text {i}\) is the imaginary unit. \(\vec{n}\) and g(xy) represent the outward normal and the given data of the boundary respectively. b(xy) is the source function. k is the wavenumber. The frequency is f, the speed of propagation is c, which are related by \(k = \frac{2 \pi f}{c}\).

3 Numerical Methods

3.1 Discretization

Structural vertex-centered grids are used to discretize the computational domain. Suppose the mesh width in x and y direction are both h. A second-order finite difference scheme is used. The discrete Helmholtz operator \(A_h\) can be obtained by adding the diagonal matrix \(-k^2 I_h\) to the Laplacian operator \(-\varDelta _h\), i.e. \(A_h =-\varDelta _h-k^2 I_h\). Therefore, the stencil of the discrete Helmholtz operator is

$$\begin{aligned} \left[ {{A_h}} \right] = \frac{1}{{{h^2}}}\left[ {\begin{array}{*{20}{c}} 0&{}{ - 1}&{}0\\ { - 1}&{}{4 - {k^2}{h^2}}&{}{ - 1}\\ 0&{}{ - 1}&{}0 \end{array}} \right] \end{aligned}$$
(2)

In the case of the Sommerfeld radiation condition, ghost points located outside the boundary points can be introduced for the boundary points. For instance, suppose \(u_{0,j}\) is a ghost point on the left of \(u_{1,j}\), the normal derivative can be approximated by \(\frac{\partial u}{\partial \vec{n}} -\text {i} k u =\frac{u_{0,j}-u_{2,j}}{2h}-\text {i} k u_{1,j}= 0 \). We can rewrite it as \(u_{0,j} = u_{2,j} + 2h\text {i} k u_{1,j}\), which can be used in the above computational stencil. The discretization of first-order Sommerfeld boundary conditions will result in a complex-valued linear system.

3.2 Preconditioned Krylov Subspace Methods

Among the representative Krylov subspace methods, GMRES and Bi-CGSTAB are suitable choices for the Helmholtz equation, as they are designed for non-singular problems. Also, the IDR(s) developed by Sonneveld and van Gijzen [17] is an efficient alternative to Bi-CGSTAB for Helmholtz problems. Compared with full GMRES, Bi-CGSTAB and IDR(s) have short recurrences and are easily parallelizable.

As for the preconditioner, we will focus on the CSLP due to its satisfactory performance and easy setup. The CSLP is defined by

$$\begin{aligned} M_h =-\varDelta _h-\left( \beta _1+\beta _2 \text {i} \right) k^2 I_h \end{aligned}$$
(3)

We need to compute the inverse of preconditioner \(M_h\) in the preconditioned Krylov-based algorithms. It is usually too costly to invert a preconditioner like CSLP directly. One idea is to approximately solve the preconditioner by using the multigrid method [4]. It is necessary to choose a proper complex shift [18], since a small complex shift may affect the convergence of the multigrid method. In the numerical experiments of this paper, \(\beta _1=1, \beta _2=0.5\) will be used.

A multigrid method involves several components that need a careful design to achieve excellent convergence. In this paper, damped Jacobi smoother with relaxation \(\omega = 0.8\) is used. The so-called full weighting restriction operator and the bilinear interpolation operator are employed for the inter-grid transfer operations. The coarse grid operator \(M_{2h}\) is constructed by re-discretizing on the coarse mesh in the same way that the operator \(M_h\) is obtained on the fine mesh. This is known as the discretization coarse grid operator (DCG). The classical multigrid V-cycle is performed. Instead of solving the coarse-mesh problem directly, we will solve it by full GMRES.

Suppose a problem with N unknowns is solved by the CSLP-preconditioned Krylov subspace method by assembling the matrices. According to the complexity analysis in [5, 19], except the variables vectors, we need extra memory to store the sparse matrix \(A_h\) with 5N non-zero elements, \(M_h\) with 5N non-zero elements, \(M_{2h}\) with \(\frac{9N}{4}\), inter-grid transfer operator Z with \(\frac{9N}{4}\), etc. To minimize memory limitations and solve real-world large-scale problems, we implement the preconditioned Krylov subspace methods in a matrix-free way instead of constructing the coefficient matrices explicitly.

3.3 Matrix-Free Parallel Implementation

For the matrix-vector multiplication, like \(A_h \textbf{u}_h\) in outer iterations, \(M_h \textbf{u}_h\) in the smoother and \(M_{2h} \textbf{u}_{2h}\) in the preconditioner solver, can be replaced by stencil computations. For example, results of \(A_h \textbf{u}_h\) can be obtained by Algorithm 1. The inter-grid transfer operations can also be performed in a matrix-free way according to the linear interpolation/restriction polynomials.

Algorithm 1
figure a

Matrix-free \(\mathbf {v_h}=A_h\mathbf {u_h}\).

To implement parallel computing, the standard MPI library is employed for data communications among the processors. Based on the MPI Cartesian topology, we can partition the computational domain blockwise. The partition is carried out between two grid points. One layer of overlapping grid points is introduced outward at each interface boundary to represent the adjacent grid points. In our method, the grid unknowns are stored as an array based on the grid ordering (ij) instead of a column vector based on x-line lexicographic ordering.

We implement the parallel multigrid iteration based on the original global grid. According to the relationship between the fine grid and the coarse grid, the parameters of the coarse grid are determined by the grid parameters of the fine one. For example, point \((i_c,j_c)\) in the coarse grid corresponds to point \((2i_c-1, 2j_c-1)\) in the fine grid. For a V-cycle, after reaching a manually predefined coarsest grid size, the coarsening operation will stop and solve the coarsest problem by GMRES in parallel, which may incur some efficiency loss. In this paper, the predefined coarsest global grid size is \( nc_x \times nc_y = 9 \times 9\) as the maximum number of processors we use is \(4 \times 4\).

4 Numerical Experiments

The solver is developed in Fortran and compiled by GNU Fortran and runs on a Linux compute node with Intel(R) Xeon(R) Gold 6152 (2.10GHz) CPUs. For outer iterations, the \(L_2\)-norm of preconditioned relative residuals are reduced to \(10^{-6}\). According to pre-experiments, the stopping criterion for the coarse grid preconditioner solver should be 2–3 orders of magnitude smaller than the stopping criterion for the outer iteration. We use \(10^{-8}\) as the stopping criterion for the coarse grid preconditioner solver. The WallClockTime for the preconditioned Krylov-based solver to reach the stopping criterion is denoted by \(t_w\). The speedup \(S_p\) is defined by \(S_p=\frac{t_1}{t_p}\), where \(t_1\) and \(t_p\) are the WallClockTime for sequential and parallel computation, respectively. The parallel efficiency \(E_p\) is given by \(E_P=\frac{S_p}{np} \times 100\%\), where np is the number of processors.

First, we consider a model problem (MP-1) with a point source described by the Dirac delta function \(\delta \left( {x,y} \right) \), imposed at the center \((x_0,y_0)=(0.5,0.5)\). The wave propagates outward from the center of the domain. Dirichlet boundary conditions are imposed. The analytical solution for this problem is given in [15].

Compared to the analytical solutions given by [15], our parallel preconditioned GMRES gives a fair approximation of the exact solution with relative errors (RErr.) less than \(5 \times 10^{-6}\). Parallel partitioning also has no effect on the results. The main differences in the amplitude of the waves are caused by the finite-difference approximation of the Dirac function. As shown in Table 1, if we simultaneously and proportionally scale the problem size and the number of processors (in bold), the WallClockTime almost stays constant. It means our parallel framework has weak scalability. It indicates that a parallel efficiency of up to 75% is satisfactory.

Table 1. Parallel performance of CSLP preconditioned GMRES for MP1 with wavenumber \(k=100\).

Most physical problems of geophysical seismic imaging describe a heterogeneous medium. The so-called Wedge problem (MP-2) is a typical problem with simple heterogeneity. It mimics three layers with different velocities hence, different wavenumbers. The rectangular domain \(\varOmega =\left[ 0,600 \right] \times \left[ -1000,0 \right] \) is split into three layers, where the wave velocity c is constant within each layer but different from each other. A point source is located at \(\left( x_0, y_0 \right) = \left( 300, 0\right) \). The wavenumber is \(k(x,y)=\frac{2\pi f}{c(x,y)}\), where f is the frequency. The distribution of wave velocity c(xy) refers to [6]. First-order Sommerfeld boundary conditions are imposed on all boundaries.

The 2D wedge problem is used to evaluate the performance of our parallel solution method for a simple heterogeneous medium. Besides, the matrix-free parallel framework is not limited to the GMRES algorithm. All the ingredients can be directly generalized to other Krylov methods like Bi-CGSTAB and IDR(s). In Table 2 we give the WallClockTime and the required number of matrix-vector multiplications of different CSLP-preconditioned Krylov methods for the wedge problem. It illustrates that our matrix-free parallel CSLP-preconditioned method is still suitable for heterogeneous Helmholtz problems. It still leads to satisfactory scalability if we increase the number of processors correspondingly while refining the grid.

Table 2. MP-2: CPU time consumed by different parallel CSLP-preconditioned Krylov methods while refining the grid, \(f={40\,\mathrm{\text {Hz}}}\). The number of matrix-vector multiplications is in parentheses.
Table 3. MP-3: parallel performance of different parallel CSLP-preconditioned Krylov methods, \(f={40\,\mathrm{\text {Hz}}}\), grid size \(2945 \times 961\).

The so-called Marmousi problem [16] is a well-known benchmark problem (MP-3). It contains 158 horizontal layers in the depth direction, making it highly heterogeneous, see [6] for an illustration. In our numerical experiments, first-order Sommerfeld boundary conditions are imposed on all boundaries. The source frequency \(f={40\,\mathrm{\text {Hz}}}\), grid size \(2945 \times 961\), which indicates \(kh \le 0.54\) and guarantees more than 10 grids per wavelength. In [6], the Marmousi problem with grid size \(2501\times 751\) has to be solved on at least two cores due to memory limitations. The matrix-free framework allows us to solve larger problems within even a single core.

Table 3 presents the required number of matrix-vector multiplications (denoted by #Matvec), CPU-time, and relative speedup of different CSLP-preconditioned Krylov methods for the Marmousi problem. One can find that a huge number of iterations are required. GMRES has the least number of matrix-vector multiplications but requires the most CPU time reducing the parallel efficiency to \(50\%\). This is due to the Arnoldi process in GMRES requiring a lot of dot product operations, which need global communication in parallel computing. IDR(4) and Bi-CGSTAB exhibit higher parallel efficiency than GMRES. The results illustrate that the matrix-free parallel CSLP-preconditioned method also works for the highly heterogeneous Helmholtz problems.

5 Conclusions

In this paper, we studied a matrix-free parallel solution method using preconditioned Krylov methods for Helmholtz problems. The Complex Shifted Laplace Preconditioner is used, which is approximately inverted by multigrid iterations. The matrix-free parallel framework is suitable for different Krylov methods. Numerical experiments of model problems demonstrate the robustness, satisfactory parallel performance, and weak scalability of our matrix-free parallel solution method. It allows us to solve larger problems in parallel to obtain more accurate numerical solutions.