1 Introduction

Vibrations generated by earthquakes, explosions or similar phenomena and propagated within the Earth or along its surface can yield information about the Earth and its subsurface structure. Such a knowledge, called Earth imaging, is of major interest for economy, environment, and science. Geologists have developed several methods for Earth imaging using seismic wave information. Acoustic full-waveform inversion (FWI) is one of such procedures and it attempts to derive high-resolution quantitative models of the subsurface using the full information of acoustic waves (Virieux and Operto 2009). Following (Tarantola 2005), a description of the problem can be given as follows. During the propagation, waves interfere with the environment and the total wavefield is recorded through a certain number of receivers (called hydrophones or geophones). Since the waves are affected by the physical properties of the subsurface, they are carrying information about the environment that can be retrieved by an inversion process. The propagation waves are generated by sources situated in the domain of consideration (see Fig. 1 for a simple illustration).

Fig. 1
figure 1

Acoustic waves propagated by a source are reflected by a reflective layer (in white) and are detected by the geophones. The reflective layer represents a salt dome in this example

For many years acoustic full-waveform inversion has been almost exclusively employed by academic researchers. Only recently it has been adopted by practitioners in industry. Nevertheless, its computational cost is still large compared to other existing methods in seismic exploration. The attractiveness of the approach is the promise of deriving high-fidelity Earth models for seismic imaging. Since our ability to both understand and manage complex nonlinear inversions has improved and since the available computing power has grown at the same time, full-waveform inversion has become more and more practical.

It is known that the acoustic full-waveform inversion, formulated as a nonlinear optimization least-squares minimization problem, can be efficiently solved if the starting propagation velocity model is accurate enough [in the sense of explaining the data at a low frequency and still being a smooth version of the true velocity model, see Brossier (2009), Virieux and Operto (2009)]. Otherwise the inversion procedure suffers from stalled convergence to spurious local minima due to the oscillatory nature of the data (Mulder and Plessix 2008). Thus, a crucial step related to full-waveform inversion in seismic imaging consists of finding a good starting model (or point, in an optimization context) without the need of sophisticated a priori knowledge on the velocity model. In industry, first- arrival travel-time tomography (Nolet 1987) is the most popular method to generate an accurate initial propagation velocity model. More recent methods such as stereo tomography (Lambaré 2008) and inversion in the Laplace domain (Shin and Cha 2008) are being investigated in academia. It must be also mentioned here that the use of multiscale strategies can mitigate the nonlinearity and reduce the dependence on the starting velocity model for FWI (Pratt and Worthington 1990; Sirgue and Pratt 2004).

In this paper we propose a novel approach to find an initial smooth velocity model for the full-waveform inversion problem without any a priori physical knowledge. We are motivated by the recent availability of massively parallel computing platforms [see Rauber and Rünger (2013)]. First we introduce a new parametrization of the problem to reduce the number of parameters needed to describe a velocity model and therefore the objective function of our optimization problem. Then, we show how to adapt an evolution strategy (ES) to take advantage of such a model or space reduction. ES’s are a class of evolutionary algorithms designed for searching the global minimum of a function in a continuous space. We are motivated by the modifications in ES’s recently proposed in (Diouane et al. 2015) to ensure some form of rigorous convergence and a better computational performance under moderate budgets of function evaluations. Thirdly, based on one of the modified ES’s given in (Diouane et al. 2015), we propose a highly parallel ES adapted to the full-waveform inversion setting. By combining model reduction and ES’s in a parallel environment, we aim at solving realistic instances of the problem. In fact the numerical results obtained along this direction will show the appropriateness and promise of our approach.

We note that global optimization heuristics have already been already employed to solve related inverse problems. A first attempt to invert the ocean bottom properties (Collins and Kuperman 1992) has been made through simulated annealing. Later Gerstoft (1994) has applied genetic algorithms to invert seismoacoustic data. Training a neural network to compute a reliable estimate of a one-dimensional velocity model has been also proposed in (Röth 1994). In all these applications, the heuristics were either applied to problems where the objective function was cheap to evaluate or where a very simple parametrization of the velocity model was used.

The paper is organized as follows. We start by describing in Sect. 2 the large-scale Earth imaging optimization problem of interest to us. In Sect. 3, we detail our proposed methodology to reduce the number of unknowns, while representing the full search space as faithfully as possible. In Sect. 4 we describe a parallel ES adapted to the specificity of the given application. Numerical results for a realistic large-scale public domain inversion problem are then presented and discussed in Sect. 5. Finally, we draw some conclusions and describe future lines of research in Sect. 6.

2 Full-waveform inversion

Estimating the subsurface velocity from seismic recordings is the main goal of the full-waveform inversion procedure. One uses the recorded wavefields to guess the physical properties of the medium through which the wavefield has propagated. Two formulations (either time-domain or frequency-domain based) are traditionally used for finding the solution of this inverse problem. Relevant details on both approaches can be found in, e.g., (Brossier 2009; Pratt and Worthington 1990; Sirgue and Pratt 2004; Virieux and Operto 2009). Since the frequency-domain approach is regarded as more advantageous when solving the full-waveform inversion in the multiple frequency case (Virieux and Operto 2009), we will exclusively consider this approach in our paper. Below we briefly detail the full seismic wavefield problem (forward problem) and the associated problem used to recover the velocity model (inverse problem). At the end of the section, we will also introduce the three-dimensional public domain velocity model used in all our numerical illustrations and experiments.

2.1 The forward problem

Given the medium properties (e.g., the subsurface velocity), the forward problem consists of modeling the full seismic wavefield in a three-dimensional parallelipedic domain \(\Omega \subset {\mathbb {R}}^3\) at a given frequency. The wave propagation is usually controlled by a partial differential equation, whose formulation depends on the characteristics of the propagation model (Cohen 2002; Pinel 2010). In the frequency domain, the acoustic propagation of a pressure field u(xyz) at the position \((x,y,z) \in \Omega \) in a heterogeneous medium is governed by the Helmholtz equation defined as:

$$ -\Delta u(x,y,z) - k^2(x,y,z) u(x,y,z) \; =\; s(x,y,z), $$
(1)

where \(k(x,y,z)= 2\pi f/ m(x,y,z)\) is the wavenumber, \(f \in {\mathbb {R}^{+}}\) is the frequency in Hz, and m(xyz) is the variable acoustic-wave velocity model in m/s. In Eq. (1), \(\Delta \) represents the standard Laplace operator and s(xyz) a Dirac source term. In a more realistic scenario, the source excitation is estimated by solving an inverse problem [see Sjogreen and Petersson (2014)]. The wavelength is given by m(xyz) / f. We consider absorbing boundary conditions: First, a popular approach, called perfectly matched layer formulation (PML) (Berenger 1994, 1996), is used to obtain a satisfactory near boundary solution, without many artificial reflections; Second, this artificial boundary layer is used to absorb outgoing waves at any incidence angle as shown in (Berenger 1994). The acoustic full-waveform inversion requires the solution of three-dimensional Helmholtz problems at various locations of the Dirac sources and thus leads to multiple right-hand side problems (Sourbier et al. 2009a, b).

In this paper, we consider a standard second-order accurate seven point finite-difference discretization of the Helmholtz equation (1) on an uniform equidistant Cartesian grid of size \(N_x \times N_y \times N_z\). For later use, we define \(N = N_x \times N_y \times N_z\), h the corresponding mesh grid size, and \(\Omega _h\) the discrete computational domain. After discretization, the acoustic full-wave inversion leads to the following linear system with p multiple right-hand sides:

$$ A \, U \; = \; S $$
(2)

where \(S \in {\mathbb {C}^{N \times p}}\) and \(A \in {\mathbb {C}^{N \times N}}\) is a sparse complex matrix (nonhermitian and nonsymmetric due to the PML formulation). We note that the matrix A embeds the properties of the subsurface and depends on the propagation velocity model m that we want to quantify. Since a stability condition has to be satisfied to correctly represent the wave propagation phenomena (Cohen 2002), we consider numerical discretization schemes with 10 points per wavelength. Consequently, at a given frequency f in Hz, we deduce the mesh grid size h in m as

$$ h \; = \; \frac{\displaystyle \min _{(x,y,z) \in \Omega _h}m(x,y,z)}{10 \, f}. $$
(3)

In practice this last relation imposes the solution of very large systems of equations (see Sect. 5, where \(N \approx 10^6\)) at the frequencies of interest for the geophysicists. Such a task may be too computationally and memory expensive when solving linear systems by sparse direct methods. Due to their indefiniteness, these systems are known to be very challenging for iterative methods (Ernst and Gander 2012). Based on previous studies (Calandra et al. 2013, 2012; Lago 2013), we consider a recently proposed block flexible Krylov subspace method [BFGMRES-S (Calandra et al. 2013, Algorithm3)] for the solution of the linear system with multiple right-hand sides (2). In (Calandra et al. 2013) the authors have shown the relevance of the BFGMRES-S algorithm combined with a variable two-level preconditioner to address the solution of such large-scale acoustic forward problems in a distributed memory parallel environment. We refer the reader to (Calandra et al. 2012, Algorithm 5) for a complete description of the geometric two-grid preconditioner and to (Pinel 2010) for additional theoretical properties in relation with Krylov subspace methods. Applications to large-scale forward acoustic problems have been considered in detail in (Lago 2013, Chap. 4).

Figure 2 depicts a graphical representation of the SEGFootnote 1/EAGEFootnote 2 salt dome velocity model (described later in Sect. 2.3) and the real part of the wavefield at frequency \(12\,{{\mathrm{Hz}}}\) obtained after solving the Helmholtz equation in the case of a single source. The numerical solution has been obtained using BFGMRES-S. We note that the wave propagation is affected by the properties of the velocity model. The interference of the waves with the reflected layer generates reflection waves. The latter are recorded at different time steps using geophones to generate the so-called seismograms, i.e., the observed data for the associated inverse problem detailed next.

Fig. 2
figure 2

3D SEG/EAGE salt dome velocity model: problem geometry with velocity distribution (left) and real part of numerical solution at \(12\,{{\mathrm{Hz}}}\) (right). Figures from (Lago 2013, Chap. 4)

2.2 Full-waveform inversion as a least-squares global optimization problem

The standard formulation of acoustic full-waveform inversion (at a given frequency f) aims at minimizing the following least-squares misfit function (Tarantola 2005):

$$ {\mathcal {J}}(m) \; = \; \frac{1}{2} \sum ^{p}_{i=1} (d^i(m) - d^i_{{{\mathrm{obs}}}})^\dagger W^i (d^i(m) - d^i_{{{\mathrm{obs}}}}), $$
(4)

where \(\dagger \) denotes the adjoint operator (transpose conjugate). The weighting matrices \(W^i\) are in general used to include a priori data information. The misfit vector \(d^i(m) - d^i_{{{\mathrm{obs}}}} \in {\mathbb {R}}^n\), related to the i-th source, is computed as the difference at the receiver positions between the recorded seismic data \(d^i_{{{\mathrm{obs}}}}\) (i.e. seismograms) and the modeled seismic one \(d^i(m)\). The latter one corresponds to the modeled seismic wavefield \(u^i\) [computed as the i-th column of the U solution of (2)], projected using the operator \({\mathcal {P}}_{{{\mathrm{data}}}}\), which extracts the values of the wavefield at the receiver positions for each source, i.e., \(d^i(m) = {\mathcal {P}}_{{{\mathrm{data}}}}(u^i)\). The use of this projection operator makes the full-waveform inversion an ill-posed problem, meaning that an infinite number of velocity models matches the data, leading to the same objective function value. Therefore, an additional regularization term is classically added to the inversion problem to make it well posed (Tarantola 2005). In addition to the velocity model, the source excitation is generally unknown and must be included as an unknown of the problem (Pratt 1999). Provided that a good starting velocity model \(m_s\) is available (good in the sense that smoothly represents the structure of the true velocity model), the minimization of the objective function (4) is in practice solved using a Newton type method [see Virieux and Operto (2009) and references therein].

2.3 The SEG/EAGE salt dome velocity model

We have conducted our numerical studies on the acoustic inverse problem using a three-dimensional public domain velocity model known in the geophysics community as the SEG/EAGE salt dome velocity model.

Fig. 3
figure 3

Visualization of the 3D SEG/EAGE salt dome velocity model using ParaView (Henderson 2007). The geophysical domain is of size \(20 \times 20 \times 5 \,{{\mathrm{km}}}^3\). The seismic waves propagate in water and in the salt dome at the minimal and maximal velocities of 1500 and 4418 m/s, respectively. The occurrence of a salt dome in the subsurface of Earth abruptly increases the velocity of propagation of the waveslabelfig

This velocity model (depicted in Fig. 3) is based on a typical US Golf coast salt structure, and special care was taken to ensure that it is geologically feasible. Hence it is widely accepted as an adequate benchmark model for seismic imaging in the geophysics community. Next, we will introduce a parametrization procedure to compute an appropriate basis for the velocity models, which will then allows to compute an accurate and smooth representation of the SEG/EAGE salt dome velocity model using a reduced number of parameters.

3 Search space reduction

Evolution Strategies are heuristic methods designed for the solution of global optimization problems (with continuous variables) that have performed well in terms of the quality of the final point computed [see Auger et al. (2013, 2009), Hansen et al. (2010), Rios and Sahinidis (2013)]. However, like any other method for global optimization, ES’s suffer from the curse of dimensionality, meaning that their performance is satisfactory on low dimensional problems, but deteriorates as the dimensionality of the search space increases (Nabi and Xiaodong 2010). For realistic simulations of full-waveform inversion (Lago 2013; Pinel 2010), the typical size N of the velocity models exceeds in general \(10^6\), and thus trying to solve directly the problem using an ES is ruled out.

However, our purpose is not to solve the full-waveform inversion problem but rather to find a good starting velocity model \(m_s\) which can be later improved using a local gradient based method. The initial velocity model \(m_s\) is only required to represent the general structure of the true model, and such a representation is often smooth and can be expressed using only a few parameters (Virieux and Operto 2009). Given an appropriate and efficient procedure to represent a velocity model using a reduced number of parameters, the ES method will be applied to try to find the values of the parameters that lead to a smooth representation of the unknown velocity model to be inverted.

The problem of search space reduction has been investigated over the past years using subspace approaches (Kennett et al. 1988; Oldenburg et al. 1993; Skilling and Bryan 1984). In our context, a velocity model perturbation \(\tilde{m} \in {\mathbb {R}}^N\) can be restricted to lie in an n-dimensional subspace of \({\mathbb {R}}^N\), spanned by the vectors \(\{ v_i\}_{i=1,\ldots ,n}\), with \(n \le N\) (as with N, n will also later have a 3-D interpretation). The model perturbation can be then written as follows:

$$ \tilde{m} \; = \; \sum ^{n}_{i=1} g_i v_i \; = \; V g, $$

where \(g \in {\mathbb {R}}^n\) are the new parameters to invert, and \(V=[v_1,\ldots ,v_n] \in {\mathbb {R}}^{N \times n}\) is the so-called reduction basis. Subspace approaches lead to an important simplification of the problem (Kennett et al. 1988; Skilling and Bryan 1984), but are unfortunately very sensitive to the choice of the reduction basis. In fact, by restricting the search space to directions in a subspace, the neglected ones could be the vectors which are important in finding a local or global minimum of the objective function \({\mathcal {J}}\) given in (4). Often researchers use a sinusoidal basis as a reduction basis and try to find a vector g of parameters which produces an acceptable agreement to the observation (Oldenburg et al. 1993). The existing subspace methods, previously cited, use gradient information of the objective function, thus the reduction process is problem dependent. In our case, no information about the objective function will be used, thus our subspace technique can be applied regardless of the problem.

Inspired by the methodology used in image compressing, we propose in this work a new procedure to construct this basis using a combination of sinusoidal and rectangular basis functions, more specifically a discrete cosine transform (DCT) (Britanak et al. 2006) and a step function transform. In fact, the step function used to magnify the vector parameter \(g \in {\mathbb {R}}^n\), so that it fits the original space \({\mathbb {R}}^N\), usually leads to a pixelization effect. Thus a DCT is then applied to produce a smooth velocity model, reducing such pixelization. To simplify the exposition, we will first explain our approach in the one-dimensional case, and then give a generalization to cover the realistic 3D geometry of interest. We refer the reader to (Diouane 2014, Chap. 7) for a complete description of these procedures.

3.1 One-dimensional space reduction

There are three main procedures in our space reduction scheme: a reduction, a duplication, and a magnification.

In the reduction procedure, given a vector \(m \in {\mathbb {R}}^N\) discretizing a velocity vector, n subdivisions are first created. Then, it is taken the mean value the parameters included in each subdivision. In our context, the reduction operation will be only used to estimate how efficient is our magnification procedure.

The duplication procedure consists of building a vector \(m \in {\mathbb {R}}^N\) using a small-size vector \(g \in {\mathbb {R}}^n\) with \(n < N\). We first construct an empty vector m of size N, considering n subdivisions of indices \([x_i, x_{i+1}]\). Each subdivision contains around \(\delta =[N/n]\) parameters and thus \(\delta = x_{i+1} - x_i\) and \(x_i = (i-1)\delta +1\). The n parameters of the velocity vector g are distributed over the n subdivisions. The value associated to each subdivision is then duplicated over the \(\delta \) parameters of m assigned for that subdivision. The duplication procedure, as presented here, introduces a pixelization effect in the constructed vector m. A DCT can then be applied to improve the quality of the duplication and to remove the subdivision discontinuities.

A magnification procedure aims in general at removing noise or producing a less pixelated image. The most used smoothing algorithms are based on Gaussian smoothing (Aditya et al. 2012), bilateral filters (Tomasi and Manduchi 1998), and sinusoidal based approaches (Britanak et al. 2006). As a smoothing procedure, we choose to work with a sinusoidal basis since it is one of the most popular subspace approaches for FWI to generate a smooth approximation vector using few coefficients (Oldenburg et al. 1993). We will smooth the pixelization effect in the magnified vector using a DCT (Britanak et al. 2006).

As we have seen before, when duplicating a velocity vector from a vector of smaller size, the value in each subdivision is constant and thus it can be seen as a mean of all the subdivision values. Such a property will be imposed as well on the magnified velocity vector m in the sense that it corresponds to a model \(m(\cdot )\) such that

$$ \frac{1}{x_i - x_{i+1}} \int ^{x_i}_{x_{i+1}} m(x) dx \; = \; g(i), \quad i = 1,\ldots ,n. $$
(5)

In turn, this velocity model \(m(\cdot )\) is expressed using a discrete cosine basis in the following way:

$$ m(x) \; = \; \displaystyle \sum _{j=1}^{n} a_j \cos \bigg (\frac{(j-1)\pi }{N}(x-1)\bigg ), $$
(6)

where \(a=(a_j)_{1 \le j \le n} \in {\mathbb {R}}^n \). By incorporating the expression (6) into the equations (5), we can obtain the vector a by solving a linear system of the form

$$ C a \; = \; g, $$
(7)

where \(C \in {\mathbb {R}}^{n \times n}\) is a matrix with the following coefficients:

$$\begin{aligned} C_{ij} \; = \; {\left\{ \begin{array}{ll} 1 &{} {\text{ if }} \; j=1,\\ \frac{2N}{(j-1)\pi \delta } \cos \bigg (\frac{\pi }{N}(j-1) (i-\frac{1}{2})\delta \bigg ) \sin \bigg (\frac{\delta \pi }{2N}(j-1)\bigg ) &{} {\text{ otherwise. }} \end{array}\right. } \end{aligned}$$

The matrix C is nonsingular (see the proof in the appendix). The one-dimensional smoothed vector m is then built by evaluating (8) for all \(i \in \{ 1,\ldots , N \}\), which amounts to

$$ m_i \; = \; \displaystyle \sum _{j=1}^{n} a_j \cos \bigg (\frac{(j-1)(i-1)\pi }{N}\bigg ), $$

or, equivalently, to

$$ m \; = \; Mg, $$
(8)

where \(M=KC^{-1} \in {\mathbb {R}}^{N \times n}\) and \(K \in {\mathbb {R}}^{N \times n}\) is the matrix defined by \(K_{ij}=\cos (\frac{(j-1)(i-1)\pi }{N})\). Equation (8) shows that the magnification procedure corresponds to the application of a linear operator. The magnification cost is negligible compared to an objective function evaluation. In fact, the magnification is accomplished by first computing \(C^{-1}g \in {\mathbb {R}}^n\) and then multiplying it by \(K \in {\mathbb {R}}^{N\times n}\), resulting in \(K (C^{-1}g) \in {\mathbb {R}}^{N}\), while an evaluation of the objective function requires the solution of a larger linear system of size \(N\times N\) with p multiple right-hand sides [see (2)].

3.2 Three-dimensional space reduction

A multi-dimensional transform (known as fast direct multi-dimensional DCT) can be carried out by using a composition of the one-dimensional magnification procedure along each dimension (Britanak et al. 2006). Equation (8) can then be immediately extended to 2D or 3D velocity models. A detailed description of the extension of Eq. (8) to higher dimensions is given in (Britanak et al. 2006). In the case of three-dimensional data, given a 3D velocity model G of \(n=n_x \times n_y \times n_z\) parameters, we ought to build a magnified 3D velocity model m of size \(N=N_x \times N_y \times N_z \gg n\) parameters. The magnification procedure is obtained by applying (8) first to the x axis, then to y, and finally to z as follows (using Matlab notation):

$$\begin{aligned} V(:,:,k)\,=\, & {} M_x G(:,:,k), \quad k = 1,\ldots ,n_z, \\ T(:,:,k)\,=\, & {} V(:,:,k) M_y^\top , \quad k = 1,\ldots ,n_z, \\ m(i,:,:)\,=\, & {} T(i,:,:) M_z^\top , \quad i = 1,\ldots ,N_x, \end{aligned}$$

where \(M_x \in {\mathbb {R}}^{N_x \times n_x}\), \(M_y \in {\mathbb {R}}^{N_y \times n_y}\), and \(M_z \in {\mathbb {R}}^{N_z \times n_z}\) are the one-dimensional smoothing matrices defined in (8) along the axes x, y, and z, respectively.

3.3 Application to the SEG/EAGE salt dome velocity model

To illustrate numerically the performance of the three-dimensional approximation procedure, we have used the SEG/EAGE salt dome velocity model introduced in Sect. 2.3.

Fig. 4
figure 4

Illustration using 1D SEG/EAGE salt dome velocity profiles

Figure 4 outlines three one-dimensional SEG/EAGE salt dome velocity profiles (with respect to the x, y, and z axes, respectively). Reduction procedures were then applied to these velocity profiles to create vectors for duplication and magnification. Following the x axis (resp. y axis), the velocity profile has been selected at the position \(y=8.9\, {{\mathrm{km}}}\) and \(z=2.5\,{{\mathrm{km}}}\) (resp. \(x=8.9\, {{\mathrm{km}}}\) and \(z=2.5\,{{\mathrm{km}}}\)). In both cases the reduced velocity vector is built using \(n = 8\) parameters only, while \(N = 225\) parameters are required for the true velocity vector. Following the z axis, the velocity profile (selected at the position \(x=8.9\, {{\mathrm{km}}}\) and \(y=8.9\,{{\mathrm{km}}}\)) is reduced using \(n=5\) parameters compared to the true \(N=70\) ones. From the results obtained, we observe that the smoothing effect of the DCT transform improves the quality of the duplicated velocity profiles as a representation of the original ones, and that the space reduction approach can work relatively well with a reduced number of parameters.

Fig. 5
figure 5

3D duplicated and magnified models of the SEG/EAGE salt dome velocity model. The velocity models are built using \(n=8 \times 8 \times 5=320\) parameters, while \(N= 225 \times 225 \times 70=3543750\) parameters are required for the true velocity model

For the three dimensional case, we have found that the true velocity model can be relatively well approximated using \(n=8 \times 8 \times 5=320\) parameters instead of the original \(N= 225 \times 225 \times 70=3543750\) ones, in the sense of still representing the main structure (i.e., the salt dome) of the true velocity model. Figure 5 outlines an illustration of the obtained results. As expected, the magnification procedure using DCT (see Fig. 5g–i) gives better results compared to the duplication procedure which is based on the step function transform (see Fig. 5d–f). Although we use only a few parameters to represent the velocity model, our smooth magnification procedure preserves the main specificity of the true model, in particular the salt dome.

4 A parallel ES for acoustic full-waveform inversion

In this section we start by briefly reviewing the existing methods to compute a satisfactory initial velocity model for seismic inversion. Then we explain how to apply ES’s for this purpose when using the space reduction introduced before. A parallel implementation of the resulting ES’s is also proposed.

4.1 Existing methods

The acoustic full-waveform inversion problem introduced in Sect. 2.2 is nonconvex, and thus its solution by optimization algorithms crucially depends on the starting velocity model \(m_s\). In fact, it is known that the inversion procedure converges to satisfactory results only if the starting velocity model is situated not far from a global minimizer (Virieux and Operto 2009). Hence, before applying the full-waveform inversion, a starting model is generally built. To do this, the most common techniques are first-arrival travel-time tomography (FATT) (Nolet 1987), stereotomography (Lambaré 2008) or, more recently, inversion in the Laplace domain (Shin and Cha 2008). For many years, FATT has proven to be stable in generating smooth velocity models of the subsurface. Some examples of application of FWI to real data using a starting model built by FATT are described in (Operto et al. 2006; Ravaut et al. 2004). Similarly, the stereotomography is regarded as one of the most promising methods for building a smooth velocity model. It exploits the arrival time of locally coherent events within an automatic procedure to select a seismogram collection (Lambaré 2008). Some applications to synthetic and real data sets are shown in (Billette et al. 2003; Billette and Lambaré 1998). Finally, the inversion in the Laplace domain can be viewed as a frequency domain inversion using a pure imaginary complex valued frequency that controls the time damping of the seismic wavefield. Applications of Laplace domain FWI to synthetic and real data are proposed in (Shin and Cha 2008; Shin and Ha 2008, 2012).

4.2 Evolution strategies and CMA–ES

Evolutions strategies are a class of evolutionary algorithms designed for the optimization of a possibly nonconvex function in a continuous domain without using derivatives. It has been originally developed in (Rechenberg 1973) for the unconstrained optimization of a function, \(\min _{v \in {\mathbb {R}}^n} f(v)\), and has been extensively investigated and tested [see, e.g., Beyer and Schwefel (2002), Hansen et al. (to appear)] and the references therein]. We are interested in a large class of ES’s denoted by \((\mu ,\lambda )\)–ES, with \(\mu \) and \(\lambda \) integers such that \(1 < \mu \le \lambda \), where a certain number \(\lambda \) of points are randomly generated in each iteration, among which \(\mu \) of them are selected as the best in terms of the objective function f.

CMA–ES (Hansen et al. 1995) (where CMA stands for covariance matrix adaptation) is regarded as one of the best in the class \((\mu ,\lambda )\)–ES in terms of numerical performance (Auger et al. 2013, 2009; Hansen et al. 2010; Rios and Sahinidis 2013). More precisely, CMA–ES belongs to the ES family denoted by \((\mu /\mu _W,\lambda )\)–ES, where the subscript ‘W’ indicates the use of ‘recombination’ via weights. Broadly speaking, at iteration k, a candidate minimizer \(\bar{x}_k\) is used to produce a generation of \(\lambda \) ‘offspring’, each consisting of adding to \(\bar{x}_k\) a random direction multiplied by a parameter controlling the length or size of the steps; the best \(\mu \) of these are retained as ‘parents’ in a ‘selection’, and \(\bar{x}_{k+1}\) is taken as a weighted combination (‘recombination’) of these parents. The CMA component considers a Gaussian distribution of mean zero for the random generation of the directions and provides a scheme for updating the corresponding covariance matrix as well as the step length.

4.3 A modified CMA–ES

CMA–ES has exhibited robust performance for difficult ill-conditioned, non-separable, and highly multi-modal problems (Auger et al. 2013, 2009; Hansen et al. 2010; Rios and Sahinidis 2013). Its main drawback is, however, that a large budget is required to provide outstanding results. Recently, the authors in (Diouane et al. 2015) have proposed modifications to the class of algorithms in \((\mu /\mu _W,\lambda )\)–ES to make them enjoying a favorable convergence property and performing better for smaller budgets. The modifications have been essentially the imposition of a sufficient decrease on the objective function values to accept new iterates and the reduction of the step size when such a condition is not satisfied. Under such modifications, these ES’s can converge globally (meaning independently of the starting point) for a first-order stationary point. Algorithm 1 shows an adaptation of the globally convergent ES’s proposed in (Diouane et al. 2015) to the context of full-waveform inversion.

The authors in (Diouane et al. 2015) have proposed three different globally convergent ES versions named mean/mean, max/max, and max/mean. On a large data set of problems, the mean/mean version has performed numerically the best. However, the incorporation of the mean/mean sufficient decrease condition requires an extra objective function evaluation \({\mathcal {J}}(m_{k+1}^{trial})\) at each iteration, where \(m_{k+1}^{trial}\) is the trial mean parent computed as the mean of the best \(\mu \) generated velocity models. The mean/mean version would therefore corrupt the parallel nature of ES’s. In fact, if one supposes that the evaluation of \({\mathcal {J}}\) at the offspring is performed at the same time using synchronized parallel clusters, the mean parent evaluation \({\mathcal {J}}(m_{k+1}^{trial})\) will force all these clusters to wait until the end of such an evaluation to be able to restart a new offspring generation. Alternatively, the max/max version has shown good performance (not as good as the mean/mean version) without the need of any extra objective function evaluation to impose the corresponding sufficient decrease condition. Consequently, we have adopted the max/max version in Algorithm 1.

We also note that the update of the weights to enforce the sufficient decrease condition, originally proposed in (Diouane et al. 2015), has not been activated in our setting since we aim at the least amount of changes in the original ES’s and since such an update did not seem to have a real impact on the results for the max/max version [see Diouane et al. (2015)].

Moreover, in the evaluation procedure of the objective function, one needs to satisfy the relation (3). Hence we have imposed a lower bound on the velocity equal to a known minimum value \(m_{min}=1500\,{\rm{m/s}}\) (the velocity model value of the water). A maximum value on the velocity model of \(m_{max}=4500\,{\rm{m/s}}\) has been also imposed to avoid propagation by meaningless velocity models. Both requirements were guaranteed by projecting the offspring models onto the feasible domain defined by the bounds, an approach that has been shown to be globally convergent in (Diouane et al. 2015). Since there are no other constraints rather than simple bounds on the variables one can use the simple orthogonal \(\ell _2\)-projection.

4.4 A parallel implementation

The proposed ES implementation consists of a synchronized parallel optimizer composed of \(\lambda \) clusters (typically, the population size). Each cluster is composed of a group of processors, which is assigned to evaluate the objective function (4). At a given iteration k, the clusters are synchronized and not activated until the new mean parent \(m_{k+1}\) is defined, which in turn depends on the iteration state (successful or not). Figure 6 sketches in detail a given iteration of our proposed parallel implementation of Algorithm 1.

Fig. 6
figure 6

A graphical illustration for a given iteration of the parallel evolution strategy

The proposed parallel implementation can be described as follows. In addition, at each iteration, \(\lambda \) clusters, represented in Fig. 6 by the components [Generate \(m^i\)], are launched in a synchronous manner. Each of these clusters generates a reduced velocity model based on the ES parameters and strategies. Once the velocity model is generated, the related cluster evokes the component [Propagate \(m^i\)].

The wave propagation simulation related to each velocity model considers the p source terms (i.e., p right-hand sides) at once. The [Propagate \(m^i\)] component is based on MPI (Message Passing Interface) and is responsible for discretizing and generating the linear system related to the forward problem (see Sect. 2.1) and to provide the information needed to evaluate the objective function \({\mathcal {J}}\) given in (4). The last component will just return the value of the objective function to the unique master processor [Master]. Once the master processor has received results from the \(\lambda \) clusters, it will choose the best, decide whether the iteration is successful or not, and update the mean parent accordingly. The [Master] component updates also the ES parameters (e.g., the distribution and the step length) and repeats the loop until a convergence criterion is achieved.

The propagation in itself behaves as a black box process, hiding from the ES the complexity of the discretization and of the forward solver. Changing the discretization and/or forward solver parameters will not incur in any rewriting of the ES implementation. MPI-2 has been used with the MPI_COMM_SPAWN primitive which allows an MPI process to spawn a number of clusters. Each newly spawned cluster has a specific MPI_COMM_WORLD intracommunicator that allows us to launch easily the propagation simulations over a number of CPU cores. We note that the proposed ES implementation is portable and that the propagator itself can be a standalone client. When the available cluster number is less than \(\lambda \), one can launch many propagation simulations on the same cluster until we obtain the needed function evaluations.

5 Numerical experiments

In this section we first describe the validation scenario and detail the parameters of the global optimization algorithm. Then we analyze the numerical results obtained for the inversion procedure at two frequencies. The numerical simulations have been performed on CURIE Footnote 3, a parallel cluster located at TGCC, France (two eight-cores Intel Sandy Bridge EP (E5-2680) at 2.7 GHz and 64 GB RAM per computing node with InfiniBand QDR Full Fat Tree interconnect) using a Fortran 2003 implementation with MPI in single precision arithmetic. The code has been compiled by the Intel ifort compiler suite with standard compiling options and linked with the MKL library. The numerical experiments have been performed on a fixed number of cores (2048, i.e., 128 computing nodes) with a maximal allowed elapsed computing time of 24 hours.

5.1 Problem and algorithm specifications

We have used a simple scenario where the p source terms are supposed to be Dirac functions and where the observed data \(d^i_{{{\mathrm{obs}}}}\) [i.e. seismograms (Tarantola 2005)] are generated from the propagating velocity model that we are trying to invert (see Sect. 2.3). The p sources (with \(p=16\)) are uniformly distributed in a survey plan located at 500 meters of depth (\(10\%\) of the exploration depth).

In this scenario we consider the acoustic full waveform inversion procedure at two different frequencies (\(1\,{{\mathrm{Hz}}}\) and \(2\,{{\mathrm{Hz}}}\) respectively) with only 320 parameters for the initial velocity model (see Sect. 3). Table 1 reports the dimensions \(N_x \times N_y \times N_z\) of the forward problem in agreement with the stability condition (3), the number of clusters, as well as the population size \(\lambda \) versus the frequency. As the frequency increases, the number of cores dedicated to the objective function evaluation becomes larger. In fact, the forward problem gets more complicated to solve as the frequency f increases. The initial iterate \(\bar{x}_0\), for the parallel ES algorithm in the case of 1 Hz, is built using the magnification procedure based on two given velocity values (3000 m/s and 1500 m/s, distributed as follows \(2= 1 \times 1 \times 2\)). Both the nonlinearity and the ill-posedness of the FWI problem are in practice tackled in the frequency domain using a multi-scale approach where one starts the inversion procedure in a low frequency range to mitigate the nonlinearity of the inversion, and then incorporate progressively higher frequencies (Sirgue 2006; Virieux and Operto 2009). Hence the velocity model solution obtained in the case of \(1\,{{\mathrm{Hz}}}\) will serve as a warm-starting point for the \(2\,{{\mathrm{Hz}}}\) case.

In the context of CMA–ES, the choice \(\lambda = {\text{ floor }}(4+3\log (n))\) (where \({\text{ floor }}(\cdot )\) rounds to the nearest integer no larger than the number given) has been shown to provide a good compromise between quality of the solution found and effort in determining it [see Hansen (2011)]. However, given the strong need for global exploration at \(1\,{{\mathrm{Hz}}}\), we have chosen an even larger value, equal to \(\lambda =512\), to better take advantage of the number of cores and cores per cluster, as we explain next. Since doubling the frequency doubles the number of discretization points in each direction, the number of cores per cluster is multiplied by at least a factor of eight when considering the case of \(2\,{{\mathrm{Hz}}}\). A factor of 16 has been retained in these numerical experiments. Given that the total number of cores is fixed to 2048 in both experiments, we note that the number of clusters at \(2\,{{\mathrm{Hz}}}\) is reduced from \(2048/8=256\) to \(2048/128=16\). Since the population size \(\lambda \) is chosen as twice the number of available clusters, we had to run two evaluations on the same cluster to get the entire offspring population (for both frequencies).

Table 1 The distribution of the clusters and the population size versus frequency

The parameters of the optimization algorithm are chosen similarly to those of CMA–ES for unconstrained optimization [see Hansen (2011)]: \(\mu = {\text{ floor }}(\lambda /2)\) and \(\omega _0^i = a_i/(a_1+\cdots + a_\mu )\) with \(a_i = \log (\lambda /2+1/2)-\log (i)\), \(i=1,\ldots ,\mu \). The choices of the distribution \({\mathcal {C}}_k\) and of the update of \(\sigma _k^{{{\mathrm{ES}}}}\) also followed CMA–ES for unconstrained optimization [see Hansen (2011)]. The selected forcing function was \(\rho (\sigma )=10^{-4} \sigma ^2\). To reduce the step length in unsuccessful iterations we have used \(\sigma _{k+1} = 0.5 \sigma _k\) which corresponds to set \(\beta _1 = \beta _2 = 0.5\). Finally, the initial step size \(\sigma _0\) is set to half of the difference between the velocity value on the bottom and the one on the top, given (in SEG/EAGE salt dome velocity model) by 3000 and 1500 m/s, respectively.

5.2 Numerical results for full-waveform inversion

Figure 7 reports a graphical representation of the inverted velocity model at \(1\,{{\mathrm{Hz}}}\). These numerical results are found to be satisfactory since a smooth version of the true velocity model is obtained. We can indeed consider that we are able to invert the general structure of the regarded velocity model, since in particular the salt dome structure is recovered. We also remark that such a model can be considered as a good starting point when using gradient based methods (Virieux and Operto 2009).

Figure 9a depicts the values of the objective function at the best population point at each iteration for the case \(1\,{{\mathrm{Hz}}}\). The variation of the objective function is more significant only at the early stages of the inversion process. Such a behavior is due to the sufficient decrease condition which monitors the quality of the sampling procedure. We note that the objective function value decreases from 2042.113 to 575.8082, and that after 278 iterations the inversion procedure is stopped due to the maximal elapsed computational time.

Fig. 7
figure 7

Inversion results for the salt dome velocity model using \(n=320\) parameters. The \(1\,{{\mathrm{Hz}}}\) case

Figure 8 shows the results obtained for \(f = 2\,{{\mathrm{Hz}}}\). Even though the optimization procedure is started from the inverted velocity model obtained at \(1\,{{\mathrm{Hz}}}\), we note that the inversion result is getting less accurate and rather far from being a good approximation of the targeted velocity model.

Fig. 8
figure 8

Inversion results for the salt dome velocity model using \(n=320\) parameters. The \(2\,{{\mathrm{Hz}}}\) case

Figure 9b shows that the optimization procedure is converging to a fixed velocity model with an objective function value of 123.2841 after 1220 iterations. Unlike the \(1\,{{\mathrm{Hz}}}\) case, this plot at \(2\,{{\mathrm{Hz}}}\) indicates that the inversion process is getting stacked at a local minimum. The explanation of such results is that the objective function becomes more and more noisy and multi-modal as far as the frequency increases (Sirgue 2006). A possible way to overcome such a difficulty is to increase the population size of the ES in order to encourage the global exploration.

Fig. 9
figure 9

Objective function values at the best population point during the application of Algorithm 1 for FWI

6 Conclusion

In this paper we have presented a numerical method based on evolution strategies for the solution of the acoustic full-waveform inversion problem arising in Earth imaging in geophysics. Considering that each parameter in the velocity model to be inverted is an additional variable to optimize, we have proposed a new parametrization of the problem, reducing the number of parameters needed to faithfully representing the velocity models. We have illustrated the efficiency of our parametrization on the public domain SEG/EAGE salt dome velocity model, where we were able to reconstruct a rather satisfactory approximation using very few parameters.

We have then adapted the chosen ES’s to a reduction of the original space, in particular to the full-waveform inversion setting when using reduced parameterized velocity models. The main purpose of the incorporation of ES’s in the inversion procedure is to find a good starting velocity model without the need of sophisticated a priori knowledge on the background velocity model.

Given the high cost of the problem function evaluations, a highly parallel scheme for such ES’s has been derived and validated on parallel distributed memory platforms. The parallel implementation has been tested in combination with the new model parametrization. The obtained numerical results have shown that a great improvement can be obtained in the automation of the inversion procedure.

The testing scenario used for the acoustic full-waveform inversion problem is a simple one where the source excitations are known and the observed data (i.e. seismograms) is generated using the propagating velocity model which we are trying to invert. We note that such a scenario is not truly realistic for the following two reasons: (a) the source excitation is generally unknown and must be rather included as an unknown of the problem, and (b) the observed data is generally given by geophones located at the surface of the exploration domain. Thus, a more realistic test case has to be investigated in the future to fully understand the potential of the proposed approach.

Finally we remark that our model reduction procedure can certainly be generalized to cover other fields of applications. Similarly, the developed parallel approach can also be applied to other optimization problems in geosciences (e.g. well placement in reservoir modeling or analysis of permeability of rocks in petrophysics).