1 Introduction

The Helmholtz equation plays an important role in the mathematical modeling of wave phenomena like the propagation of sound and light. Because of its importance, the Helmholtz equation is of large interest for both, analytical and numerical research. In this work, we deal with the efficient solving of linear systems of equations resulting from standard conforming finite element discretizations of the Helmholtz equation.

According to [11, 20], there are several reasons why the numerical solution of these equation systems is a challenging task. One reason is that the solutions of the Helmholtz equation are oscillating on a scale of 1/k, where k denotes the wavenumber. The wavenumber k is proportional to the frequency of the simulated waves. As a consequence, a large number of mesh nodes is required to resolve high frequency waves. On closer examination, it turns out that the required number of mesh nodes is proportional to \(k^2\). Moreover, low-order methods suffer from pollution effects, which implies that \(\mathcal {O}(k^2)\) mesh nodes are not sufficient to bound the discretization error as the wavenumber k increases. This implies that very large systems of equations have to be solved if large wavenumbers are considered. A further difficulty is that for large wavenumbers, these linear systems of equations can be distinctly indefinite such that classical iterative solvers perform poorly [12]. For instance, standard multigrid methods yield unsatisfactory results, since it can be shown that the smoothers as well as the coarse grid corrections cause growing error components [8, 14]. These observations motivate the design of more effective iterative solvers. As a consequence, great effort has been made to achieve this goal. An overview of different solution methods that have been tested in this context can be found in [17, 21]. Due to the fact that classical iterative solvers like the Jacobi method and multigrid methods are not appropriate for a direct application to the Helmholtz problem [10, 38], Krylov subspace methods like GMRES [35] or BiCGSTab [16] have attracted more attention. This is motivated by the fact that these Krylov subspace methods converge even in the case of indefinite matrices. However, their convergence can be very slow without a sophisticated preconditioner [29, 31, 37, 40].

It turns out that a simple modification of the original Helmholtz problem forms the basis to derive an efficient preconditioner for a Krylov subspace method. This is achieved, e.g., by adding a complex shift to the square of the wave number resulting in a new partial differential equation (PDE). This PDE is referred to as the shifted Laplacian problem or the shifted Laplacian preconditioner. In the remainder of this work, we will use the term shifted Laplacian problem [2, 6, 10, 20, 28, 39].

A crucial issue in this context is to determine the optimal shift denoted by \(\varepsilon \) [10, 20]. It can be shown that for \(\varepsilon \in \mathcal {O}(k)\), the shifted Laplacian problem is a preconditioner for the GMRES method with wavenumber-independent convergence. On the other hand using a shift \(\varepsilon \in \mathcal {O}(k^2)\) [10, 15], the standard multigrid method shows optimal convergence for the discrete counterpart of the shifted Laplacian problem. This means that there is a gap between the choice \(\varepsilon \in \mathcal {O}(k)\) and \(\varepsilon \in \mathcal {O}(k^2)\). From this conclude that a shift of \(\varepsilon \in \mathcal {O}({k})^{\sigma }), \, \sigma \in [1,2]\) should yield a solver for the unshifted Helmholtz problem, which can be regarded as a compromise between a fast multigrid convergence and a good preconditioner which reduces the total number of required iterations. One cannot expect, nor is it the goal, that this approach yields wavenumber independent iteration numbers. Obviously, it is of great interest to determine not only qualitatively but also quantitatively \(\varepsilon \) depending on the wave number k such that the convergence rate is minimal. However, there is no analytical formula how to choose the optimal order exponent \(\sigma \), and according to our knowledge, no article has been published yet which provides a-priori optimal shifts for problems considering general heterogeneous wavenumbers.

The goal of this paper is to design a Krylov subspace method using a near optimal shift. As the outer iterative solver for the standard Helmholtz equation, we choose the FGMRES method [36]. As the preconditioner for the outer solver, we use the standard twogrid solver [41] applied to the shifted problem. The twogrid solver uses the damped Jacobi method as a smoother. This allows for a semi-matrix free implementation of the iterative solver, meaning that most of the required matrix vector products can be realized without accessing a stored global sparse matrix. The Helmholtz equation as well as the shifted Laplacian problem are discretized using standard \(Q_p,\;p \in \left\{ 1,2,3\right\} \), finite elements. In order to circumvent the lack of knowledge on how to choose an optimal shift, we use a data driven approach such as in [23, 24, 32]. Thereby, for a given constant wavenumber k and a mesh size h, the near optimal complex shift is determined by an optimization method. For this purpose, we use the golden-section method [25]. This procedure is repeated for a large number of samples with respect to k and h. Having for each sample the optimal exponent \(\sigma _{opt}\) at hand, a map is constructed, such that a near optimal shift can be obtained by just evaluating the map for a given wavenumber and mesh size. By means of such a map, an optimal FGMRES solver can be constructed in an efficient way, particularly, if a highly heterogeneous distribution of the wavenumber has to be considered. In order to support our numerical findings, we perform a Local Fourier Analysis (LFA) for the \(Q_1\)-discretization in two space dimensions. Similar to [12, 15], we determine for a certain wavenumber and meshsize exponents shifts such that the twogrid solver is convergent. LFA is a standard tool for analyzing the convergence behavior of a multigrid solver. Thus it has a long tradition in the multigrid context. Details on LFA can be found e.g. in the following references: [7, 27, 41, 43].

The rest of this paper is organized as follows: Sect. 2 contains the problem setting as well as the variational formulations of the Helmholtz equation and the shifted Laplacian equation. Furthermore, the standard finite element discretizations and theoretical results from literature are recalled. Section 3 focuses on the twogrid preconditioner. Using a LFA, we study the convergence behavior of the twogrid solver. Then in Sect. 4, the FGMRES solver combined with the twogrid solver is introduced. In Sect. 5, the generation of training data that are used to estimate the optimal shift is described. The numerical results are presented and discussed in Sect. 6. In case of the two dimensional settings, the optimal shifts are related to the results of the LFA in Sect. 3. Finally in Sect. 7, we summarize our main findings and give an outlook.

2 Problem Setting and Variational Formulations

The basic equation for modeling wave phenomena like the propagation of sound and light, is the linear wave equation

$$\begin{aligned} w_{tt} - c^2 \varDelta w = \tilde{f}, \end{aligned}$$
(1)

where the solution variable w represents, e.g., the intensity of sound or light. The term c denotes the propagation speed in a specific medium, and \(\tilde{f}\) incorporates external source terms. Considering only time-harmonic solutions

$$\begin{aligned} w(\textbf{x},t) = u(\textbf{x}) e^{-\textrm{i}\omega t}, \end{aligned}$$

with an angular frequency \(\omega \in \mathbb {R}\), (1) can be transformed into a stationary equation which is known as the Helmholtz equation [18]:

(2)

The parameter k is given by \(k=\omega /c\) and is referred to as the wavenumber. In the remainder of this work, we assume that it is given by a spatially varying function \(k :\varOmega \rightarrow \mathbb {R}\), where \(\varOmega \subset \mathbb {R}^d,\;d \in \left\{ 2,3\right\} \), represents a tensorial bounded domain. The source term f results from the transformation of \(\tilde{f}\). The Helmholtz equation is equipped with an impedance boundary condition, i.e., a first-order absorbing boundary condition with \(g :\partial \varOmega \rightarrow \mathbb {C}\). Using this notation, the shifted Laplacian problem can be defined as follows [10, 20]:

(3)

Thereby the parameter \(\varepsilon \in \mathbb {R}\) represents the imaginary shift with respect to the Helmholtz equation (2). For \(\varepsilon \ge 0\), \(f\in L^2( \varOmega )\), \(g \in L^2( \varGamma )\), and \(k \in L^\infty (\varOmega )\), the standard variational formulation of (3) reads as follows: Find \(u \in H^1(\varOmega ,\mathbb {C})\) such that

$$\begin{aligned} a_\varepsilon (u,v) = F(v),\; \text { for all } v \in H^1(\varOmega ,\mathbb {C}). \end{aligned}$$
(4)

The space \(H^1(\varOmega ,\mathbb {C})\) consists of complex valued functions, whose real and imaginary parts are in the real valued space \(H^1(\varOmega )\). The sesquilinear form and the linear form of the variational formulation are given by:

$$\begin{aligned} a_\varepsilon (u,v)&= \int _{\varOmega }^{} \nabla u \cdot \nabla \overline{v} \,\textrm{d}\varvec{x} -\int _{\varOmega }^{} (k^2 + \textrm{i}\varepsilon ) u \overline{v} \,\textrm{d}\varvec{x} -\textrm{i}\int _{\partial \varOmega }^{} k\, u \overline{v} \,\textrm{d}\varvec{x}\\&= a(u,v) - m(u,v; k, \varepsilon ) -\textrm{i}b(u,v;k) \end{aligned}$$

and

$$\begin{aligned} F(v)&= \int _{\varOmega }^{} f \overline{v} \,\textrm{d}\varvec{x} + \int _{\partial \varOmega }^{} g \overline{v} \,\textrm{d}\varvec{x}. \end{aligned}$$

Note that the sesquilinear form for the original Helmholtz problem is given by \(a_0(\cdot ,\cdot )\). According to [33, Prop. 8.1.3] the variational formulation (4) is well-posed.

In order to solve this equation numerically, standard \(Q_p\)-elements, \(p \in \left\{ 1,2,3\right\} \), are used. Our assumptions on \(\varOmega \) guarantee that we can easily construct suitable meshes yielding an exact presentation of the boundary \(\partial \varOmega \). The mesh size of the used tensorial grid is denoted by h. Further, let \(V_h\) be the finite dimensional subspace of \(H^1(\varOmega ,\mathbb {C})\) defined by conforming \(Q_p\)-elements. Using this notation, the discrete version of (4) has the following form: Find \(u_h \in V_h\) such that:

$$\begin{aligned} a_\varepsilon (u_h,v_h) = a(u_h,v_h) - m(u_h,v_h;k,\varepsilon ) -\textrm{i}b(u_h,v_h;k) = F(v_h) \text { for all } v_h \in V_h. \end{aligned}$$
(5)

As it has been shown in [19, Prop. 2.1], the discrete variational Helmholtz formulation has a unique solution. Taking \(u_h = \sum _{i=1}^N \textsf{u}_i \phi _i\), where \(\{\phi _i\}_{i=1}^N\) is a basis for \(V_h\), problem (5) induces the following matrix equation for the coefficient vector \(\textsf{u}= [\textsf{u}_1,\textsf{u}_2,\ldots ,\textsf{u}_N]^\textsf{T}\):

$$\begin{aligned} \textsf{K}\textsf{u}- \textsf{M}(k,\varepsilon ) \textsf{u}- \textrm{i}\textsf{B}(k) \textsf{u}= {\textsf{F}}, \end{aligned}$$
(6)

where \(\textsf{K}_{ij} = a(\phi _j,\phi _i)\), \(\textsf{M}_{ij}(k,\varepsilon ) = m(\phi _j,\phi _i;k,\varepsilon )\), \(\textsf{B}_{ij}(k) = b(\phi _j,\phi _i;k)\), and \(\textsf{F}_i = \int _{\varOmega }^{} f \overline{\phi _i} \,\textrm{d}\varvec{x} + \int _{\partial \varOmega }^{} g \overline{\phi _i} \,\textrm{d}\varvec{x}\). We define the system matrix as

$$\begin{aligned} \textsf{A}(k,\varepsilon ) = \textsf{K}- \textsf{M}(k,\varepsilon ) - \textrm{i}\textsf{B}(k). \end{aligned}$$
(7)

3 Local Fourier Analysis for the Twogrid Solver in Two Space Dimensions

In this section, we study the convergence behavior of the twogrid solver applied to the discrete equation (5). Thereby only the two dimensional problem and the \(Q_1\)-discretization is considered. With respect to the reference element \(\left( 0,1\right) ^2\), we use the following basis functions:

$$\begin{aligned} \varphi _1 \left( x,y \right)&= x \cdot y,\; \varphi _2 \left( x,y \right) = \left( 1-x\right) \cdot y, \\ \varphi _3 \left( x,y \right)&= \left( 1-y\right) \cdot x \text { and } \varphi _4 \left( x,y \right) = \left( 1-x\right) \cdot \left( 1-y\right) . \end{aligned}$$

Furthermore, we investigate only the local convergence of the twogrid solver i.e. we neglect the boundary conditions and consider only the following matrix:

$$\begin{aligned} \textsf{A}(k,\varepsilon ) = \textsf{K}- (k^2 + \textrm{i}\varepsilon ) \textsf{M}, \end{aligned}$$
(8)

where \(\textsf{M}= m(\phi _j,\phi _i;1,0)\). Of particular interest is the region of convergence of the twogrid solver. To analyze the convergence behavior of a twogrid solver, we use a standard technique which is referred to as Local Fourier Analysis (LFA) [7, 12, 27, 41, 43]. In this work it is used to determine for which shifts \(\varepsilon \) the twogrid solver is converging. Thereby, we restrict ourselves to shifts having the form \(\varepsilon = k^\sigma \), where k is the wavenumber and \(\sigma \in \left[ 1,2\right] \) denotes an exponent. Therefore, we consider in the remainder of this work the system matrices \(\textsf{A}(k,k^\sigma ),\;\sigma \in \left[ 1,2\right] \). The matrices \(\textsf{A}(k,k^\sigma )\) in (8) can be represented in a simplified way by means of the stencil notation:

$$\begin{aligned} L_{h}( \sigma ) = \frac{1}{3} \left[ \begin{matrix} -1 &{} -1 &{} -1 \\ -1 &{} 8 &{} -1 \\ -1 &{} -1 &{} -1 \end{matrix} \right] -( k^2 + \textrm{i}k^\sigma ) \frac{h^2}{36} \left[ \begin{matrix} 1 &{} 4 &{} 1 \\ 4 &{} 16 &{} 4 \\ 1 &{} 4 &{} 1 \end{matrix} \right] . \end{aligned}$$

The first part of the stencil \(L_{h}( \sigma )\) corresponds to the standard stiffness matrix, while the second part represents the standard mass matrix. Let \(\textbf{G}_h\) be the grid given by

$$\begin{aligned} \textbf{G}_h = \left\{ \textbf{x} = ( x_1,x_2 ) = ( l_1h,l_2h ),\; \textbf{l} = ( l_1,l_2 ) \in \mathbb {Z}^2 \right\} , \end{aligned}$$

and V the index set of compact stencils

$$\begin{aligned} V = \left\{ \left. \kappa = (l_1,l_2) \right| l_1,\;l_2 \in \left\{ -1,0,1\right\} \right\} \subset \mathbb {Z}^2. \end{aligned}$$

For simplicity, we have assumed that the \(Q_1\)-discretization is based on a mesh consisting of squares with an edge length of h. A stencil S applied to a grid function \(w_h\) works as follows [41, Chapter 4]:

$$\begin{aligned} S w_h( \textbf{x} ) = \sum _{\kappa \in V} s_{\kappa } w_h( \textbf{x} +h\kappa ),\; \textbf{x} \in \textbf{G}_h. \end{aligned}$$

The values \(s_\kappa \in \mathbb {C}\) are the coefficients of a stencil S. In a next step, we introduce the notation for the operator of the twogrid solver \(T_h^{2h}\) [9]. Applying \(T_h^{2h}\) to the error function \(e_h^l\) of the l-th iteration yields:

$$\begin{aligned} e_h^{l+1} = T_h^{2h} e_h^l,\; T_h^{2h} = S_h^{\nu _2} K_h^{2h} S_h^{\nu _1},\; K_h^{2h} = I_h - I_{2h}^h L_{2h}^{-1} I_h^{2h} L_{h}. \end{aligned}$$

Thereby, \(S_h\) denotes the smoothing operator. In our work, we use the damped \(\omega \)-Jacobi smoother, where \(\omega \) is the damping factor and \(\nu _j,\;j\in \left\{ 1,2 \right\} \), are the number of pre- and postsmoothing steps. It is a well known fact that the operator for the damped Jacobi is given by

$$\begin{aligned} S_h( \sigma ) = I_h - \omega D_h^{-1}( \sigma ) L_h( \sigma ), \end{aligned}$$

where \(I_h\) is the identity operator, and \(D_h( \sigma )\) corresponds to the diagonal of the matrix in (8). A straightforward computation yields the following stencil for \(S_h( \sigma )\):

$$\begin{aligned} S_h( \sigma ) = \frac{\omega }{8 - \lambda h^2 \frac{4}{3}}\left[ \begin{matrix} 1 + \lambda h^2 \frac{1}{12} &{} &{} 1 + \lambda h^2 \frac{1}{3} &{} &{} 1 + \lambda h^2 \frac{1}{12} \\ &{} &{} &{} &{} \\ 1 + \lambda h^2 \frac{1}{3} &{} &{} \left( \frac{1}{\omega }-1 \right) \left( 8 - \lambda h^2 \frac{4}{3} \right) &{} &{}1 + \lambda h^2 \frac{1}{3} \\ &{} &{} &{}&{}\\ 1 + \lambda h^2 \frac{1}{12} &{} &{} 1 + \lambda h^2 \frac{1}{3} &{} &{} 1 + \lambda h^2 \frac{1}{12} \end{matrix} \right] , \end{aligned}$$

where we have used the abbreviation \(\lambda = k^2 + \textrm{i}k^\sigma \). Combining the smoothing operator with the correction operator \(K_h^{2h}\), results in the twogrid operator \(T_h^{2h}\). \(K_h^{2h}\) itself is given by a combination of \(L_{2\,h}\), \(I_h^{2\,h}\), and \(I_{2h}^h\). The operators

$$\begin{aligned} I_h^{2h}: \textbf{G}_h \rightarrow \textbf{G}_{2h} \text { and } I_{2h}^h: \textbf{G}_{2h} \rightarrow \textbf{G}_{h} \end{aligned}$$

stand for the restriction and prolongation where \(\textbf{G}_{2h}\) denotes the coarse grid.

According to [41, Chapters 2 and 4], the stencils for the prolongation, restriction and identity operator are given by:

$$\begin{aligned} I_{2h}^h = \frac{1}{4} \left] \begin{matrix} 1 &{} 2 &{} 1 \\ 2 &{} 4 &{} 2 \\ 1 &{} 2 &{} 1 \end{matrix} \right[,\; I_h^{2h} = \frac{1}{4} \left[ \begin{matrix} 1 &{} 2 &{} 1 \\ 2 &{} 4 &{} 2 \\ 1 &{} 2 &{} 1 \end{matrix} \right] \text {, and } I_h = \left[ \begin{matrix} 0 &{} 0 &{} 0 \\ 0 &{} 1 &{} 0 \\ 0 &{} 0 &{} 0 \end{matrix} \right] . \end{aligned}$$

\(L_{2h}\) in stencil notation has a similar shape as \(L_{h}\), the only difference is that h has to be replaced by 2h, and \(\textbf{x}\) is an element of the coarse grid \(\textbf{G}_{2h}\). The inverted brackets for the stencil of the prolongation operator \(I_{2h}^h\) indicate that this stencil has to be applied in a different way as the remaining stencils, see [41, Chapter 2] for more details on the notation. Applying \(I_{2h}^h\) to a coarse grid function \(w_{2h}\) yields a fine grid function \(w_{h}\) using the following rule [41, Chapter 2]:

$$\begin{aligned} w_{h}( \textbf{x} +h\kappa ) = (I_{2h}^h )_{\kappa } w_{2h}( \textbf{x} ),\;\textbf{x} \in \textbf{G}_{2h}, \end{aligned}$$

where \((I_{2h}^h )_{\kappa }\) is the entry of the stencil \(I_{2h}^h\) belonging to the index \(\kappa \in V\). By means of a LFA we determine for the twogrid operator \(T_h^{2h}\) a local convergence factor \(\rho _{loc}\left( \sigma \right) \) depending on the exponent \(\sigma \). Thereby, we follow the steps as described in [12] and [41, Chapter 4]. For more detailed information the interested reader is referred to “Appendix A”.

Once calculated, the local convergence factor \(\rho _{loc}\left( \sigma \right) \) helps us to determine the minimal exponent \(\sigma _c\) for the shift \(\varepsilon \) separating the interval \(I_{\sigma } = \left[ 1,2\right] \) into two subsets \(I_{conv}\) and \(I_{div} = I_{\sigma } {\setminus } I_{conv}\), where \(I_{conv} = \left[ 2,\sigma _c \right] \text { and } \sigma _c = \text {argmin}_{\sigma \in I_{\sigma }} \left\{ \rho _{loc}( \sigma ) < 1 \right\} .\) In other words: \(I_{conv} \subset I_{\sigma }\) contains the exponents \(\sigma \) for which the twogrid method converges. Varying the smoothing steps \(\nu _1\) and \(\nu _2\) as well as the damping factor \(\omega \) for \(kh<0.75\) and a fixed mesh size h shows only little impact on the graph of \(\sigma _c\). On the other hand, varying the mesh size h and keeping the remaining parameters fixed, shows a significant impact on \(\sigma _c\) (see Fig. 1). On closer examination, it can be observed that for a fixed wavenumber k a small mesh size h enlarges the interval \(I_{conv}\). Reduced intervals \(I_{conv}\) arise for a fixed h and a sufficiently large k.

Fig. 1
figure 1

Minimal exponent \(\sigma _c\) with respect to kh for different mesh sizes \(h \in \left\{ 2^{-4},\ldots ,2^{-10}\right\} \)

4 Semi Matrix-Free FGMRES with Twogrid Shifted Laplacian Preconditioner

In this section, we draw our attention to the twogrid method from the previous section being used as a preconditioner within a Krylov subspace method. Therefore, we briefly describe the building blocks of the iterative solver used for the numerical experiments in the subsequent sections. For the outer iterations of our solver, we use the preconditioned FGMRES Krylov subspace method [36] to solve the system

$$\begin{aligned} \textsf{A}_0 \textsf{P}_\varepsilon ^{-1} (\textsf{P}_\varepsilon \textsf{u}) = \textsf{F}\end{aligned}$$

for a matrix \(\textsf{A}_0\), a preconditioner \(\textsf{P}_\varepsilon \), and a right-hand side \(\textsf{F}\). In our applications, the system matrix \(\textsf{A}_0\) is given by (7) without a complex shift, i.e., \(\textsf{A}_0 = \textsf{A}(k, 0)\) and the corresponding right-hand side \(\textsf{F}\) is as in (6). The preconditioner \(\textsf{P}_\varepsilon ^{-1}\) is a single twogrid \((\nu ,\nu )\)-cycle with \(\nu \in \mathbb {N}\) pre- and postsmoothing steps applied to the preconditioner system matrix of the shifted problem \(\textsf{A}_\varepsilon = \textsf{A}(k,\varepsilon )\).

Let \(V_1 \subset V_{2} = V_h\) be a given hierarchy of two discrete subspaces of \(V_h\). To each of these subspaces, we relate a preconditioner system matrix \(\textsf{A}_\varepsilon ^{(\ell )}\) stemming from the discretization of (5) with \(V_\ell \) for \(\ell = 1,2\). By \(\textsf{I}\), we denote the canonical interpolation or prolongation matrix, mapping a discrete function from \(V_1\) to \(V_2\). Moreover, we define a damped Jacobi smoother \(\textsf{S}_\varepsilon \) with damping factor \(\omega = \frac{2}{3}\). The Jacobi smoother may be divergent [38], but we choose it because of its straightforward matrix-free implementation in which only the inverse diagonal of the system matrix needs to be stored in an additional vector. Additionally, the multiplication by the inverse diagonal may be easily parallelizable. Convergent smoothers like the damped Kacmarz-like smoother [11] require the conjugate transpose of the system matrix which is not available in a matrix-free method.

The application of \(\textsf{P}_\varepsilon ^{-1}\) is implemented by performing a twogrid \((\nu ,\nu )\)-cycle to approximately solve the system \(\textsf{P}_\varepsilon \textsf{u}= \textsf{f}\) for \(\textsf{u}\), given some right-hand side vector \(\textsf{f}\). This is achieved by calling the function cycle\((\textsf{u}, \textsf{f})\) given in algorithm 1.

figure a

An important aspect of using this iterative method is that on the fine level \(\ell = 2\), the matrices \(\textsf{A}_\varepsilon ^{(2)}\), \(\textsf{S}_\varepsilon ^{(2)}\), and \(\textsf{I}\) never need to be formed explicitly. Only the result of their action to a vector is required. More precisely, the action of \(\textsf{A}\) is implemented in a matrix-free fashion by summing the results of the underlying matrices \(\textsf{K}\) and \(\textsf{M}(k,\varepsilon )\). Due to software limitations, the boundary matrix corresponding to \(\textsf{B}(k)\) was not implemented in a matrix-free way, but is assembled instead. However, since the integration in the sesquilinear form of \(b(\cdot ,\cdot ;k)\) is performed on the boundary, the number of non-zeros in B(k) is of lower order \(\mathcal {O}(h^{-d+1})\) compared to the other terms with \(\mathcal {O}(h^{-d})\) non-zeros. Furthermore, the system matrix \(\textsf{A}_\varepsilon ^{(1)}\) on the coarse level \(\ell = 1\) is assembled and its LU factorization is stored in memory. Because of this mix of matrix-free operations and stored matrices, we denote this method as semi matrix-free.

Using such a semi matrix-free approach effectively saves the memory required for storing the global sparse matrices on finer grids which in turn allows solving problems for which the matrices and its factorizations do not fit in memory. The direct solve in line 4 of alg:line:coarsesolve 1 is only performed on the coarse grid and thus requires fewer floating point operations and less memory storage. Note that the smaller memory consumption is not the only advantage of a matrix-free approach, since it also significantly reduces the memory traffic. It can be shown that using matrix-free methods instead of matrix-based approaches is beneficial for the performance and may even outperform matrix–vector multiplications with stored matrices [13, 26].

In the remainder of this article, \(V_2\) is formed by uniformly subdividing each element in the mesh corresponding to \(V_1\). The mesh size corresponding to \(V_1\) is therefore 2h. Additionally, we do not consider any restarts in the FGMRES method.

5 Data Generation and Processing

In this section, we present our approach to obtain the near optimal complex shift exponents \(\hat{\sigma }\) depending on the wavenumber k and discretization parameters \(\ell \in \mathbb {N}\) with \(h = 2^{-\ell }\) and \(p \in \{1,2,3\}\). The analysis in Sect. 3 justifies the existence of a near optimal complex shift \(\sigma \) for which the twogrid method converges, but this does not necessarily need to be optimal shift for which the FGMRES converges faster. The ultimate goal is to construct a map which maps the input parameters to the near optimal shift exponent \(\hat{\sigma }\), i.e., \((k,\ell ,p) \mapsto \hat{\sigma }\). In order to compute this map, we consider a data based approach in which we generate a set of samples \((k,\ell ,p,\hat{\sigma })\) that are used in a subsequent nonlinear regression step. The number of samples that have been generated are contained in Table 1. We first generated samples for mesh sizes down to \(h = 2^{-7}\). After this initial sample generation, more samples were obtained for levels \(h < 2^{-7}\). Since the sample generation on finer meshes takes more time than on coarser meshes, only fewer samples were collected for \(h < 2^{-7}\), which explains the smaller number of samples for these levels. However, this disparity in the number of samples per level is relativized by our special choice of objective function. The following steps describe the process of generating such a single data point:

  1. 1.

    Choose an order p from the set \(\{1,2,3\}\).

  2. 2.

    Choose a mesh size \(h = 2^{-\ell }\) by selecting an \(\ell \) from the set \(\{4,\ldots ,10\}\).

  3. 3.

    Choose a wavenumber from the interval \([\frac{3p}{16h}, \frac{3p}{4h}]\).

  4. 4.

    Generate a right-hand side vector \(\textsf{f}\) with components laying uniformly in the interval \([-1,1]\).

  5. 5.

    Find the optimal exponent \(\hat{\sigma } \in [1,2]\) of the complex shift \(k^{\hat{\sigma }}\) using the gradient free golden-section search [25] with a tolerance of \(10^{-2}\). This involves repeatedly solving (2) on \(\varOmega = (0,s)^2\) with \(s = 1\) and \(g = 0\) using the parameters \((k,\ell ,p)\) and \(\textsf{f}\). The term optimal indicated that we are looking for exponents that are minimizing the number of outer iterations for the FGMRES.

  6. 6.

    Store the tuple \((k,\ell ,p,\hat{\sigma })\) and go to Step 1.

Let \(r_0\) be the residual norm at beginning of a solve in Step 5, i.e., \(r_0 = \Vert \textsf{A}_0\textsf{u}^{(0)} - \textsf{f}\Vert _2\). In this process, we solve until either a relative residual reduction of \(10^{-8}\) is obtained or a maximum of 50 iterations is reached. Let \(r_{\textrm{end}} = \Vert \textsf{A}_0\textsf{u}^{(\textrm{end})} - \textsf{f}\Vert _2\) be the final residual at termination after \(\textrm{end}\) iterations. In the case without convergence after 50 iterations, the obtained sample is still included in the data, but the corresponding \(\rho \) may be close to one. The objective function for the minimization in Step 5 is then given by \(\rho = (\frac{r_{\textrm{end}}}{r_0})^{\frac{1}{\textrm{end}}}\). Note that the actual residual norm and not the residual norm of the preconditioned system is used since it reflects the error of the underlying physical problem more closely. Repeating the above process N times will generate a set of training data \(\mathcal {D}= \{(k_i,\ell _i,p_i,\hat{\sigma }_i)\}_{i=1}^{N}\). Plots of the sample points for this particular case are presented in Fig. 2 for different values of p. The corresponding average convergence rate \(\rho \) for the near optimal complex shifts are illustrated in Fig. 3. The idea of this approach is to approximate this map by a smooth function which can be evaluated for any reasonable input \((k,\ell ,p)\). Even if this data is obtained only for a constant wavenumber over the whole domain, we will use the approximated map to obtain optimal complex shifts for inhomogeneous wavenumbers in the numerical experiments in Sect. 6.

Table 1 Number of samples that have been generated for the different polynomial degrees p and meshsizes h
Fig. 2
figure 2

Optimized shift exponents \(\hat{\sigma }\) for decreasing h and \(p \in \{1,2,3\}\) from left to right

Fig. 3
figure 3

Average convergence rate \(\rho \) using the optimized exponential shift for decreasing h and \(p \in \{1,2,3\}\) from left to right

In this and the following section, we fixed the number of smoothing steps \(\nu \) to 3, but ultimately the parameters \(\nu \) and \(\omega \) can also be included in the data generation in order to further augment the data set. However, the values of \(\nu \) should be restricted to small values since the Jacobi smoother may be divergent. Let \(N_{\ell ,p} = |\{(k_i,\ell _i,p_i,\hat{\sigma }_i) \in \mathcal {D}:p_i = p\}|\) and define the weighting coefficient as \(w_{\ell ,p} = \frac{N_{\ell ,p}}{|\mathcal {D}|}\). For a vector of coefficients \((\hat{k}_{c,0}, \hat{k}_{c,1}, \hat{\alpha }_0, \hat{\alpha }_1)^\textsf{T}\in \mathbb {R}^4\), let the approximated map \(\sigma _p\) be defined as

$$\begin{aligned} k_c(\ell )&= \hat{k}_{c,1} \cdot \exp (\hat{k}_{c,0} \cdot \ell ),\nonumber \\ \alpha (\ell )&= \hat{\alpha }_1 \cdot \exp (\hat{\alpha }_0 \cdot \ell ),\nonumber \\ \beta (k,\ell )&= 2 - \exp (-\alpha (\ell ) \cdot (k - k_c(\ell ))),\nonumber \\ \sigma _p(k,\ell )&= \min (\max (\beta (k,\ell ), 1), 2). \end{aligned}$$
(9)

The objective function of our regression for a fixed p is defined as

$$\begin{aligned} \textrm{loss}_p(\hat{k}_{c,0}, \hat{k}_{c,1}, \hat{\alpha }_0, \hat{\alpha }_1) = \sum _{(k_i,\ell _i,p_i,\hat{\sigma }_i) \in \mathcal {D}, p_i = p} w_{\ell _i,p}^{-2} \cdot (\hat{\sigma }_{i} - \sigma _p(k_i, \ell _i))^2. \end{aligned}$$

Each summand is multiplied by the inverse weighting coefficient squared in order to relativize the disparity in the number of samples per level. For each \(p \in \{1,2,3\}\), we train a map \(\sigma _p\) with the data obtained in the previous step. This is achieved by optimizing for the vector of coefficients \((\hat{k}_{c,0}, \hat{k}_{c,1}, \hat{\alpha }_0, \hat{\alpha }_1)^\textsf{T}\in \mathbb {R}^4\). For this purpose, we use PyTorch [34] together with the ADAM optimizer and a learning rate of \(10^{-3}\). The weights are initialized with \((0.1, 1.0, -0.5, 1.0)^\textsf{T}\) and the final weights after \(50\,000\) epochs are depicted in Table 2 for each p. The approximated optimal shift maps together with the sampling points are illustrated in Fig. 4. These optimized values are used in the MFEM [4] solver code described in the next section.

Table 2 Optimal weights in function \(\sigma _p\) for \(p \in \{1,2,3\}\)
Fig. 4
figure 4

Sampling points in orange and approximated optimal shift map \(\sigma _p\) in blue for \(p \in \{1,2,3\}\) from left to right

This process of training data generation and learning of the parameters in (9) may be considered as offline computations which only need to be done once. Even if our goal is that users use our pretrained parameters from Table 2 and plug them directly into (9) to obtain the optimal shift exponent, we briefly assess these offline costs. Obtaining each training data sample requires solving the Helmholtz problem several times in order to find the optimal shift. We assume that generating a single sample for \(\ell = 10\) and \(p = 1\) requires solving the Helmholtz problem for 5 times on average. On the machine used in Sect. 6, this requires in the worst case \(5 \cdot {3.5}\text { min} = {17.5}\text { min}\). Generating the 62 samples on the finest mesh for \(p=1\) therefore took about 18 h in total. However, since each sample can be generated independently, this process can be trivially parallelized if enough resources are available. The final learning process requires much less resources because of the relatively small number of training data and parameters. On a state-of-the-art laptop, the parameter identification required only about one minute using just the CPU.

Next, we compare the minimal exponents \(\sigma _c\) from Sect. 3 to the optimal shifts for the FGMRES solver. For this purpose, we gradually enlarge the computational domain in order to exclude the influence of the boundary conditions. By means of this comparison, we want to theoretically support the choice of the shift \(k^\sigma \) obtained by the optimization procedure described in the first part of this section.

From the LFA, we obtain the shift exponent for which it is guaranteed that the twogrid method applied to the shifted Laplacian problem converges, provided that the domain is very large and boundary effects do not play a role. This shift exponent, however, does not need to be the optimal one when used in conjunction with the outer FGMRES solver. Since our data generation provides us with near optimal shift exponents of the whole solver, we can compare them to the values from the LFA. We expect that, asymptotically, the near optimal shift exponents are larger than the LFA exponents, since these guarantee convergence.

The LFA yields results on an infinitely large domain, therefore, we need to make the near optimal shifts from the numerical experiments comparable. For this purpose, we perform two different systematic comparisons. In the first setup, we fix the parameters \(p=1\) and set \(h \in \{2^{-5}, 2^{-6}, 2^{-7}\}\). In the second setup, we consider growing square domains \(\varOmega _s = (0, s)^2\) for \(s \in \{2^i: i \in \{0,1,\ldots ,5\}\}\) and keep the mesh size fixed. For both setups, the number of degrees of freedom increases. More precisely in Fig. 5, we have an increase in the number of degrees of freedom within each picture from the blue to the brown color as well as from the left to the right picture within each color group.

We illustrate the results in Fig. 5 by plotting the required shift obtained from LFA alongside the near optimal shifts obtained by the data driven approach on different domains \(\varOmega _s\) and different fixed shifts h. We may observe that the near optimal shift exponents are in fact larger than the required shift exponent, if the domain is enlarged.

Fig. 5
figure 5

Required shift exponent \(\sigma _c\) obtained from the LFA and numerically sampled near optimal shifts on different domains \(\varOmega _s\) for \(p=1\) and a fixed mesh size \(h \in \{2^{-5}, 2^{-6}, 2^{-7}\}\)

6 Numerical Results

In this section, we demonstrate the effectiveness of using the approximated near optimal shift exponents when applied to solving (2). For this purpose, we consider a set of synthetic and actual scenarios with heterogeneous wavenumbers. In particular, we solve (2) on \(\varOmega = (0,1)^d\), \(d \in \{2,3\}\), with a source term

$$\begin{aligned} f(\varvec{x}) = 2 \cdot \exp (-1000 \cdot \Vert \varvec{x}- \varvec{s}\Vert ^2_2), \end{aligned}$$

where the source location \(\varvec{s}\in \mathbb {R}^d\) is specified in each of the scenarios. The additional boundary term g is set to zero, i.e., \(g = 0\). In each of the following scenarios, we normalize the given velocity profile such that its values lie in the interval [0, 1] and denote this scaled profile by \(\mu :\varOmega \rightarrow [0,1]\). In the following, \(\mu \) is called velocity profile. Moreover, we choose a \(k_{\textrm{max}} \in \mathbb {R}\) and set the final heterogeneous wavenumber profile as \(k(\varvec{x}) = k_{\textrm{max}} \cdot \mu (\varvec{x})\).

The solver with the two-grid preconditioner described in Sect. 4 is implemented using the MFEM modular finite element library [4] with its support for multigrid operators. The operators on the fine grid are implemented in a matrix-free fashion by using the partial assembly approach in which only the values at quadrature points are stored. The operator applications reuse these precomputed data to compute the action of the operators on-the-fly. The coarse grid problem is solved by computing the LU decomposition with MUMPS [3] through the PETSc [5] interface within the first FGMRES iteration. Subsequent iterations reuse the factorization for an efficient inversion of the coarse grid matrix. The linear systems in each of the scenarios are solved without restarts and until a relative residual reduction \(10^{-8}\) is achieved or the maximum number of 500 iterations is reached. The data and source code for the nonlinear regression in Sect. 5 as well as the MFEM driver source code are available in the Zenodo archive.Footnote 1

All run-time measurements for the 2D examples are obtained on a machine equipped with two Intel Xeon Gold 6136 processors with a nominal base frequency of 3.0 GHz. Each processor has 12 physical cores which results in a total of 24 physical cores. The total available memory of 251 GB is split into two NUMA domains, one for each socket. All available 24 physical cores are used for each computation.

The measurements for the 3D examples are conducted on the SuperMUC-NG system equipped with Skylake nodes. The following values were taken from [30]. Each node has two Intel Xeon Platinum 8174 processors with a nominal clock rate of 3.1 GHz. Each processor has 24 physical cores which results in 48 cores per node. Each core has a dedicated L1 (data) cache of size 32 kB and a dedicated L2 cache of size 1024 kB. Each of the two processors has a L3 cache of size 33 MB shared across all its cores. The total main memory of 94 GB is split into equal parts across two NUMA domains with one processor each. We use the native Intel 19.0 compiler together with the Intel 2019 MPI library.

In the following subsections, we consider scenarios with velocity profiles stemming from different sources: an artificial wedge example and three profiles from synthetic and non-synthetic geological cross-sections. Lastly, we consider a 3D scenario in which one of the 2D cross-sections are extruded to 3D. For each possible scenario, we compare the outer FGMRES iteration numbers for different values of the complex shift \(\varepsilon \in \{0, k, k^{\frac{3}{2}}, k^2, k^{\sigma _p}\}\). By \(k^\sigma \), we denote the case in which the optimal shift exponent map \(\sigma _p\) from (9) is used with the respective coefficients from Table 2. Note that for heterogeneous k, the shift exponent \(\sigma _p = \sigma _p(\varvec{x})\) depends on the spatial location \(\varvec{x}\). The speed-ups are calculated with respect to the case without a shift. In particular, let \(t_{ref}\) be the time-to-solution for \(\varepsilon = 0\) and \(t_{new}\) the time-to-solution for one of the methods with \(\varepsilon \ne 0\). The respective speed-up is then calculated as \(100\cdot (\frac{t_{ref}}{t_{new}} - 1)\%\).

6.1 Wedge Example

In the first scenario, we consider an artificial velocity profile \(\mu \) with three distinct values as illustrated in the left of Fig. 6. The maximum value of the source term is located at \(\varvec{s}= (0.5, 0.55)^\textsf{T}\). We collect the required number of FGMRES iterations and the respective compute times for \(p \in \{1,2,3\}\) and the maximum wavenumbers \(k_{\textrm{max}} \in \{450, 1100, 1800\}\) in Table 3. Additionally, in the center and right of Fig. 6, we present the real and imaginary parts of the solution in the case of \(p=2\). We observe that using the near optimal shift exponent results in the smallest number of iterations throughout all the considered cases. However, due to a faster LU factorization, using a shift \(k^{\frac{3}{2}}\) still results in a shorter compute time for \(p=2\) even if five more iterations are performed.

Fig. 6
figure 6

Wedge example velocity profile with source location (left) and real and imaginary part of the solution in the case \(p=2\) (center and right)

Table 3 Parameter values, iteration numbers, and compute times for the 2D wedge example

6.2 Marmousi Model

In a second scenario, we consider the velocity profile stemming from the synthetic Marmousi model devised by the Institut Français du Petrole [42]. The corresponding scaled velocity profile is illustrated in the left of Fig. 7. Here, the maximum value of the source term is located at \(\varvec{s}= (0.5421, 0.8946)^\textsf{T}\). We collect the required number of FGMRES iterations and the respective compute times for \(p \in \{1,2,3\}\) and the maximum wavenumbers \(k_{\textrm{max}} \in \{600, 800, 1250, 1900\}\) in Table 4. Additionally, in the center and right of Fig. 7, we present the real and imaginary parts of the solution in the case of \(p=2\). We observe that using the near optimal shift exponent results in the smallest number of iterations for large wavenumbers. For a smaller wavenumber \(k_{\textrm{max}} = 800\) and \(p=2\), the trained shift still yields the minimal number of required iterations when compared to the other shifts. This means that using the trained shift does not worsen the performance if the wavenumber is not in the critical regime. Using a shift \(k^{\frac{3}{2}}\) results in a slightly shorter compute time for \(p=1\) even if eight more iterations are performed, since the LU factorization required more time. This cannot observed for the higher order cases with \(p > 1\).

Fig. 7
figure 7

Marmousi example velocity profile with source location (left) and real and imaginary part of the solution in the case \(p=2\) (center and right)

Table 4 Parameter values, iteration numbers, and compute times for the 2D Marmousi example

6.3 Migration from Topography

In a third scenario, we consider the velocity profile stemming from the migration from topography model representing a cross section through the foothills of the Canadian rockies; courtesy of Amoco and BP [22]. The corresponding scaled velocity profile is illustrated in the left of Fig. 8. Here, the maximum value of the source term is located at \(\varvec{s}= (0.2159, 0.6054)^\textsf{T}\). We collect the required number of FGMRES iterations and the respective compute times for \(p \in \{1,2,3\}\) and the maximum wavenumbers \(k_{\textrm{max}} \in \{500, 1100, 1900\}\) in Table 5. Additionally, in the center and right of Fig. 8, we present the real and imaginary parts of the solution in the case of \(p=2\). We observe that using the optimal shift exponent results in the smallest number of iterations and the shortest compute time throughout all the considered examples. The choice \(k^{\frac{3}{2}}\) yields similar small iterations when compared to using no shift or a shift of k, but using the trained shift yields the fastest solution without requiring a manual choice of the shift.

Fig. 8
figure 8

Migration from topography velocity profile with source location (left) and real and imaginary part of the solution in the case \(p=2\) (center and right)

Table 5 Parameter values, iteration numbers, and compute times for the 2D migration from topography example

6.4 BP Statics Benchmark Model

In a fourth scenario, we consider the velocity profile stemming from the synthetic BP statics benchmark model created by Mike O’Brien and Carl Regone, provided by courtesy of Amoco and BP [1]. The corresponding scaled velocity profile is illustrated in the left of Fig. 9. Here, the maximum value of the source term is located at \(\varvec{s}= (0.4368, 0.6852)^\textsf{T}\). We collect the required number of FGMRES iterations and the respective compute times for \(p \in \{1,2,3\}\) and the maximum wavenumbers \(k_{\textrm{max}} \in \{450, 1100, 1900\}\) in Table 6. Additionally, in the center and right of Fig. 9, we present the real and imaginary parts of the solution in the case of \(p=2\). We observe that using the optimal shift exponent results in the smallest number of iterations and shortest compute time throughout all the considered examples. Considerably, for \(p=2\) and \(p=3\) the choice of \(k^{\frac{3}{2}}\) results in a larger number of required iterations and a longer compute time than for the case without a shift.

Fig. 9
figure 9

BP statics benchmark model velocity profile with source location (left) and real and imaginary part of the solution in the case \(p=2\) (center and right)

Table 6 Parameter values, iteration numbers, and compute times for the 2D BP statics benchmark model example

6.5 Marmousi 3D Model

In the last scenario, we consider again the velocity profile stemming from the synthetic Marmousi model devised by the Institut Français du Petrole [42] extruded along a third dimension. The corresponding scaled velocity profile is illustrated in the left of Fig. 10. Here, the maximum value of the source term is located at \(\varvec{s}= (0.5421, 0.8946, 0.5)^\textsf{T}\). We collect the required number of FGMRES iterations and the respective compute times for \(p \in \{1,2\}\) and the maximum wavenumbers \(k_{\textrm{max}} = 150\) in Table 7. Additionally, in the center and right of Fig. 10, we present the real and imaginary parts of the solution for \(p=1\). The timings were obtained on the SuperMUC-NG cluster described above by using 384 cores across 16 compute nodes,i.e., 24 cores per compute node. We observe that for \(p=1\), the near optimal shift yields the smallest number of outer FGMRES iterations, but like in the 2D Marmousi case, using a shift of \(k^{\frac{3}{2}}\) is faster even if six more iterations are performed. For \(p=2\), the number of iterations using the near optimal shift was larger by one compared to the smallest number of iterations obtained by using no shift or a shift by k, but the compute time was still smaller. These discrepancies in the run-time and number of iterations may be explained by the time deviations of the parallel LU decomposition performed in the first application of the preconditioner.

Fig. 10
figure 10

Marmousi 3D model velocity profile (left) and isosurfaces at value 0 of the real and imaginary part of the solution in the case \(p=1\) (center and right)

Table 7 Parameter values, iteration numbers, and compute times for the 3D Marmousi model example with \(p=1\)

7 Conclusion

In this work, we have presented a preconditioner for the Helmholtz equation obtained from a data driven approach. The preconditioner uses near optimal complex shifts in the shifted Laplacian problem which is used as a preconditioner of the Helmholtz equation by applying a twogrid V-cycle to the discrete problem. The near optimal shifts were obtained by generating training data for different mesh sizes h, wavenumbers k, and discretization orders p, and subsequently performing a nonlinear regression to construct a near optimal shift map. Using such an approximated optimal shift map allows users to obtain near optimal shifts automatically without having to tune the required complex shifts manually. In order to solidify this approach, we have performed theoretical considerations based on a local Fourier analysis which justify this data driven approach and we have related the theoretical results to experimental data. Additionally, the twogrid method has been implemented in a semi matrix-free fashion which saves on memory storage and traffic which usually required for matrices corresponding to the finer grids. Furthermore, we have used these near optimal shifts on a set of numerical benchmarks with heterogeneous wavenumbers in 2D and 3D. It could be observed that using these near optimal shifts yielded the smallest FGMRES iteration numbers throughout almost all the examples with speed ups up to 582%.

In the data generation and the numerical experiments, we had restricted ourselves to a single V(3, 3) twogrid cycle and a damped Jacobi smoother with damping factor \(\omega = \frac{2}{3}\). Moreover, the complex shift had been always in the form \(\textrm{i}k^\sigma \). Possible further work could take into account more levels of subspaces resulting in a multigrid solver for which, e.g., the number of smoothing steps per level maybe optimized for. Likewise, wavenumber coefficients of the form \(\beta _1 k^{\sigma _1} + \textrm{i}\beta _2 k^{\sigma _2}\) with \(\beta _1, \beta _2, \sigma _1, \sigma _2 \in \mathbb {R}\) are of interest as well.