Keywords

1 Introduction

Full-waveform inversion (FWI) is a recent powerful method based on based on nonlinear optimization technique in the area of seismic imaging. FWI was proposed by [1,2,3] back in the early of 1980s for reconstructing high-resolution images of the subsurface structure from local measurements of the seismic wavefield by minimizing the distance between the predicted and the recorded data [4,5,6]. Since then there are many numerical studies and new implementation of algorithms have been done [7, 8].

In this study, we investigate two algorithms Gauss-Newton and L-BFGS for solving frequency domain FWI as proposed in [7]. We compare these algorithms in terms of its robustness and speed of convergence via realistic synthetic model with marine exploration seismic setting. Also, we implement the Tikhonov regularization for assisting convergence.

2 Problem Formulation

We will formulate the FWI problem in the frequency domain as proposed by PRatt. Consider the slowness-squared as model parameters \(\mathbf{m} \in \mathbb {R}^{n_{grid}}\) and the measurement vector \(\mathbf{d} \in \mathbb {C}^{n_{data}}\) are related through a known but nonlinear relationship denoted as

$$\begin{aligned} \mathbf{d} = F(\mathbf{m} ) + \epsilon , \end{aligned}$$
(1)

where \(\epsilon \sim \mathcal {N}(0,\mathbf{C} _{D})\) is additive, normally distributed noise with zero mean and covariance \(\mathbf{C} _{D} \in \mathbb {C}^{n_{data} \times n_{data}}\).

The nonlinear forward modeling map \(F(\mathbf{m} )\) can be desribed as

$$\begin{aligned} F(\mathbf{m} ) = \mathbf{P} {} \mathbf{A} (\mathbf{m} )^{-1}{} \mathbf{q} , \end{aligned}$$
(2)

where \(\mathbf{q} \in \mathbb {C}^{n_{grid}}\) is the discretized source term which considered known. The operator \(\mathbf{A} (\mathbf{m} ) \in \mathbb {C}^{n_{grid} \times n_{grid}}\) represents the discretized Helmholtz operator (\(\nabla ^{2} + \omega ^{2}{} \mathbf{m} \)) where \(\omega = 2\pi f\) is the angular frequency. The operator \(\mathbf{P} \in \mathbb {R}^{n_{data} \times n_{grid}}\) denotes the sampling operator which samples the data \(\mathbf{d} \) from the field vector variables \(\mathbf{u} \), which is the solution of the Helmholtz equation \(\mathbf{u} = \mathbf{A} (\mathbf{m} )^{-1}{} \mathbf{q} \).

By choosing the matrix that \(\mathbf {L}\) as the first order finite difference operator which commonly referred to as roughening matrix, we can define the least-square misfit function with Tikhonov regularization as

$$\begin{aligned} V(\mathbf {m}) = \frac{1}{2}\Big |\Big | F(\mathbf {m}) - \mathbf {d}\Big |\Big |^{2}_{2} + \frac{\alpha }{2}\Big |\Big | \mathbf {L}\mathbf {m}\Big |\Big |^{2}_{2}, \end{aligned}$$
(3)

where \(\alpha \) is the regularization coefficient. The optimal model \(\mathbf{m} \) can be sought by minimizing the misfit function \(V(\mathbf {m})\) in 3. The resulting optimization problem is typically solved using a gradient-based method which generates iterates of the form

$$\begin{aligned} \mathbf {m}_{k+1} = \mathbf {m}_k - B_k \nabla V(\mathbf {m}_k), \end{aligned}$$
(4)

where \(B_k\) includes appropriate scaling/smoothing of the gradient. In this study the matrix \(B_k\) could be represented either as the inverse of the Gauss-Newton approximation or the L-BFGS approximation of Hessian which will be explained in details in the following sections. For the gradient of the misfit function, it can be computed through adjoint-state method [9] and the explicit formula can be described as

$$\begin{aligned} \nabla V(\mathbf {m}) = \mathbf {J}^{T}(F(\mathbf {m}) - \mathbf {d}) + \alpha (\mathbf {L}^{T}\mathbf {L})\mathbf {m}, \end{aligned}$$
(5)

with \(\mathbf {J}\) the Jacobian of \(\mathbf {F(\mathbf {m})}\).

3 Gauss-Newton Method

Gauss-Newton method is a method derived from Newton method for solving the nonlinear optimization problem. The issue with Newton method in solving the nonlinear optimization problem especially FWI is the computation of full Hessian. In Eq. 4, the matrix \(B_k\) has two terms based on Newton method which can be presented as

$$\begin{aligned} \mathbf {H} = \mathbf {J}^{T}\mathbf {J} + \frac{\partial \mathbf {J}}{\partial \mathbf {m}}(F(\mathbf {m}) - \mathbf {d}). \end{aligned}$$
(6)

Commonly, the computation of the second term is avoided due to its tedious calculation and which in any case should be small by assuming the problem is approximately linear, which, in practice, implies that the starting model is sufficiently close to the true model. This is where the Gauss-Newton method is being derived from. The difference between Newton and Gauss-Newton method is the negligence of the second term in the Hessian computation. Based on [7, 10], we can safely dropped off the second term in the Eq. 6 because of its value is too small and it is only important if changes in the parameters cause a change in the partial derivative of the Helmholtz equation’s solution.

The Gauss-Newton method and its approximation of Hessian can be presented as

$$\begin{aligned} \mathbf {m}_{k+1} = \mathbf {m}_k - \mathbf {H}_{GN}\nabla V(\mathbf {m}_k), \end{aligned}$$
(7)
$$\begin{aligned} \mathbf {H}_{GN} = \mathbf {J}^{T}\mathbf {J}, \end{aligned}$$
(8)

where the matrix \(\mathbf {H}_{GN}\) is assumed to have full column rank, and is thus invertible. See [11] for more details regarding to this algorithm.

4 L-BFGS Method

The limited- memory BFGS method (L-BFGS) is a quite successful modification of the quasi-Newton methods [11, 12]. In this method, no Hessian approximation is ever actually formed, but rather a collection of the last several \((s_{k},y_{k})\) pairs is stored and used to compute the step. Let m, the memory size, be the number of (sy) pairs stored. Then, given an initial matrix \(H_{0}\), the matrix \(H_{k}\) can be defined as follows:

figure a

The notation is simplified by eliminating the iteration counter k and choosing to store the most recent value of s, that is, \(s_{k} - 1\), in \(s_{m} - 1\) and the oldest value, \(s_{k} - m\), in \(s_{0}\). The vectors \(y_{i}\), \(i = 0,\ldots ,m - 1\), are stored similarly. With these values, it can be shown that the search direction in Eq. 4 can be represented as

$$\begin{aligned} B_k \nabla V(\mathbf {m}_k) = H_k \nabla V(\mathbf {m}_k), \end{aligned}$$
(9)

where the matrix \(H_k\) is the L-BFGS approximation to the inverse Hessian and can be computed through the algorithm presented above.

5 Numerical Examples

In these numerical examples, we illustrate the performance of Gauss-Newton and L-BFGS algorithms through solving the frequency domain FWI problem. We solve two FWI problems with two different velocity models with an objective to compare these two algorithms in reconstructing the velocity models from the recorded data.

For first numerical example, we use a homogeneous velocity model with an inclusion in the centre which acts as an reflector, depicted in the Fig. 1a. A standard finite-difference method is used to solve the Helmholtz equation. The grid size is \(100 \times 100\), and grid spacing is \(10 \times 10\) m. In this numerical example we consider collocated sources-receivers setting with sources-receivers are located at every 20m. We use frequency content 5 to 25 Hz with frequency sampling of 3.33 Hz.

Fig. 1
figure 1

Numerical example 1: reconstruction of homogeneous velocity model with an inclusion in the centre

Fig. 2
figure 2

Numerical example 1: misfit and norm of gradient values at each iteration

In the second numerical example, we use the Marmousi model as depicted in the Fig. 3a to perform the numerical studies. A standard finite-difference method is used to solve the Helmholtz equation. The grid size is \(61 \times 220\), and grid spacing is \(50 \times 50\) m. 50 shots at every 100 and 100 receivers at every 50m are used in this numerical example. This sources-receivers setting is resembling the marine exploration seismic setting. We use frequency content from 0.5 Hz to 3.95 Hz with frequency sampling of 0.5 Hz.

Fig. 3
figure 3

Numerical example 2: reconstruction of marmousi model

Fig. 4
figure 4

Numerical example 2: misfit and norm of gradient values at each iteration

For both numerical examples, we performed 100 Gauss-Newton and L-BFGS iterations each, starting from the initial model depicted in the Figs. 1b and 3b respectively to obtain the optimal model \(\mathbf {m}\) as shown in the bottom row of Figs. 1 and 3. As regularization, we use the Tikhonov regularization method with regularization operator \(\mathbf {L}\) as first order derivative operator and regularization parameter \(\alpha \) equals to 0.01.

In practice, the Hessian is not store explicitly in memory and only its matrix-vectors product are being computed. Thus, for the Gauss-Newton iterations, we are solving a system of linear equations at each iteration using the preconditioned conjugate gradient (PCG) to estimate the descent direction.

6 Discussions

Based on two numerical results, both algorithms are performing well and both showing a good convergence of misfit values and the values of \(l_{2}\)-norm of misfit function gradient as illustrated in Figs. 2 and 4, respectively. As we can observe, the misfit values of L-BFGS is better than Gauss-Newton algorithm, yet the values of \(l_{2}\)-norm of misfit function gradient for Gauss-Newton algorithm is lower compared to L-BFGS algorithm. In practice, we should consider the values of \(l_{2}\)-norm of misfit function gradient as it represents the optimal distance of the solution to the truth. This is because the true solution could be obtained when the misfit function gradient is equal to zero or in the vicinity of \(l_{2}\)-norm of misfit function gradient closes to zero. Thus, based on this practice, Gauss-Newton algorithm is perform better compared to L-BFGS because of its lower value in \(l_{2}\)-norm of misfit function gradient.

Here we also should discuss the feasibility of each algorithm. Gauss-Newton algorithm needs the matrix-vector product between the inverse of its approximated Hessian and the gradient at each iteration in order to obtain the descent direction. This computation is computationally intensive thus it takes longer time per iteration to solve the optimization problem. Meanwhile, in L-BFGS algorithm no Hessian approximation is ever actually formed, but rather a collection of the last several \((s_{k},y_{k})\) pairs is stored and used to compute the step. This makes L-BFGS algorithm is computationally efficient compared to the Gauss-Newton algorithm.

7 Conclusion

In conclusion, both algorithms, L-BFGS and Gauss-Newton are comparable to each other in terms of performance. Gauss-Newton algorithm gives a better result in the convergence of \(l_{2}\)-norm of misfit function gradient sense, yet it is computationally intensive. Meanwhile, L-BFGS performance is comparable to the Gauss-Newton and in terms of computationally efficiency and feasibility, L-BFGS is outperformed the Gauss-Newton for the large scale optimization problems especially in FWI.