Keywords

In Chap. 5, we considered the methods of solving the linear discrete inverse problems using the deterministic approach based on Tikhonov regularization. However, there exists an alternative approach based on the ideas of the probability theory. Therefore, in this chapter presents several methods for inverse problem solutions using the probabilistic approach following Zhdanov (1993, 2002, 2015).

1 Maximum Likelihood Method

As discussed in Chap. 2, the probability distribution can be described by a very complicated function in general cases. However, according to the central limit theorem, a large sample of a random variable tends to a very simple distribution, the so-called Gaussian (or normal) distribution, as the size of the random sample increases.

The joint distribution of two independent Gaussian variables is just the product of two univariate distributions. When the data forming a vector \( \textbf{d}\) are correlated (with mean \(\langle \textbf{d}\rangle \) and covariance \({\boldsymbol{\sigma }} =[\sigma _{ij}]\)), the appropriate distribution turns out to be as follows (Menke 2018):

$$\begin{aligned} P(\textbf{d})={{\frac{\mid {\boldsymbol{\sigma } }\mid ^{-{\frac{1}{2}}}}{{(2\pi )^{\frac{N}{2}}}}}\exp [-{\frac{1}{2}}(\textbf{d}-\langle \textbf{d}\rangle )^{T}}{\boldsymbol{ \sigma } }^{-1}(\textbf{d}-\langle \textbf{d}\rangle )].\ \end{aligned}$$
(6.1)

The idea that the model and data are related by an explicit relationship,

$$\begin{aligned} \textbf{Am}=\textbf{d,} \end{aligned}$$
(6.2)

can now be reinterpreted in the sense that this relationship holds only for the mean data:

$$\begin{aligned} \textbf{Am}=\langle \textbf{d}\rangle . \end{aligned}$$
(6.3)

Substituting (6.3) into (6.1), we can rewrite the distribution of the data as follows:

$$ P(\textbf{d})={{\frac{\mid {\boldsymbol{\sigma }}\mid ^{-{\frac{1}{2}}}}{{(2\pi )^{\frac{N}{2}}}}}\exp [-{\frac{1}{2}}(\textbf{d}-\textbf{Am})^{T}}{ \boldsymbol{\sigma } }^{-1}(\textbf{d}-\textbf{Am})] $$
$$\begin{aligned} ={{\frac{\mid {\boldsymbol{\sigma } }\mid ^{-{\frac{1}{2}}}}{{(2\pi )^{\frac{N}{2}}}}}\exp [-{\frac{1}{2}}}f_{\sigma }(\textbf{m})], \end{aligned}$$
(6.4)

where

$$ f_{\sigma }(\textbf{m})={(\textbf{d}-\textbf{Am})}^{T}{\boldsymbol{\sigma }}^{-1}(\textbf{d}-\textbf{Am}). $$

Under this assumption, we can say that the optimum values for the model parameters are those that maximize the probability that the observed data are, in fact, observed. Thus, the method of maximum likelihood is based on maximization of the probability function (6.4)

$$\begin{aligned} P(\textbf{d})=\max .\ \end{aligned}$$
(6.5)

Clearly, the maximum of \(P(\textbf{d})\) occurs when the argument of the exponential function has maximum or when \(f_{\sigma }(\textbf{m})\) has minimum:

$$\begin{aligned} f_{\sigma }(\textbf{m})={(\textbf{d}-\textbf{Am})}^{T}{\boldsymbol{\sigma }}^{-1}(\textbf{d}-\textbf{Am})=\min .\ \end{aligned}$$
(6.6)

Let us calculate the first variation of functional \(f_{\sigma }\):

$$ \delta f_{\sigma }(\textbf{m})=-(\delta {\textbf{Am}})^{T}{\boldsymbol{\sigma }}^{-1}(\textbf{d}-\textbf{Am})-{(\textbf{d}-\textbf{Am})}^{T}{\boldsymbol{\sigma }}^{-1}(\delta {\textbf{Am}}).\ $$

It can be shown that for symmetrical matrix \({\boldsymbol{\sigma }}^{-1},\) the following equality holds:

$$ \textbf{a}^{T}{\boldsymbol{\sigma }}^{-1}\textbf{b}=\textbf{b}^{T}{\boldsymbol{\sigma }}^{-1}\textbf{a}, $$

where \(\textbf{a}\) and \(\textbf{b}\) are two arbitrary column vectors. Therefore, we can write the necessary condition for the functional \( f_{\sigma }\) to have a minimum as follows:

$$\begin{aligned} \delta f_{\sigma }(\textbf{m})=-2(\textbf{A}\delta \textbf{m})^{T}{\boldsymbol{\sigma }}^{-1}({\textbf{d}-\textbf{Am}})=-2(\delta \textbf{m})^{T}\textbf{A}^{T}{\boldsymbol{\sigma }}^{-1}({\textbf{d}-\textbf{Am}})=0.\ \end{aligned}$$
(6.7)

From Eq. (6.7), we obtain at once the following equation:

$$ \textbf{A}^{T}{\boldsymbol{\sigma }}^{-1}({\textbf{d}-\textbf{Am}})=0.\ $$

The last formula provides the following normal system of equations for the “pseudo-solution” of the minimization problem (6.6):

$$\begin{aligned} \textbf{A}^{T}{\boldsymbol{\sigma }}^{-1}{\textbf{Am}}=\textbf{A}^{T}{\boldsymbol{\sigma }}^{-1}{\textbf{d}}.\ \end{aligned}$$
(6.8)

If the matrix \(\left( \textbf{A}^{T}{\boldsymbol{\sigma }}^{-1}{\textbf{A}}\right) \) is non-singular, then we can multiply both sides of normal system (6.8) by inverse matrix, \((\textbf{A}^{T}{\boldsymbol{\sigma }}^{-1}{\textbf{A}})^{-1},\) and write the pseudo-solution of minimization problem (6.6) in the explicit form as follows:

$$\begin{aligned} {\textbf{m}}_{0}=(\textbf{A}^{T}{\boldsymbol{\sigma }}^{-1}{\textbf{A}})^{-1}\textbf{A}^{T}{\boldsymbol{\sigma }}^{-1}{\textbf{d}}.\ \end{aligned}$$
(6.9)

Comparing the last formula with the corresponding equation for the weighted least-squares method (5.31), we see that we have obtained exactly the same result if we substitute matrix \(\textbf{W}^{2}\) for \({\boldsymbol{\sigma }}^{-1}\):

$$\begin{aligned} \textbf{W}^{2}={\boldsymbol{\sigma }}^{-1}.\ \end{aligned}$$
(6.10)

Note that if data happen to be uncorrelated, then the covariance matrix becomes diagonal:

$$\begin{aligned} {\boldsymbol{\sigma }}=[\textbf{diag}(\sigma _{i}^{2})], \end{aligned}$$
(6.11)

and the elements of the main diagonal are the variances of the data. In this case, the weights are given by the following formula:

$$\begin{aligned} w_{i}^{2}={\frac{1}{{\sigma _{i}^{2}}}}.\ \end{aligned}$$
(6.12)

The functional

$$\begin{aligned} f_{w}(\textbf{m})=\chi ^{2}(\textbf{m})=\sum _{i=1}^{N}\left( \frac{r_{i}}{{\sigma _{i}}}\right) ^{2}=\sum _{i=1}^{N}\left( \frac{d{_{i}-}d{_{i}^{p}}}{{\sigma _{i}}}\right) ^{2} \end{aligned}$$
(6.13)

is called a “chi-square.”

In the cases where the measurement errors are normally distributed, the quantity \(\chi ^{2}\) is a sum of N squares of normally distributed variables, each normalized to its variance. Thus, by applying the weighted least-squares method, we can select the smaller weights for data with bigger standard deviations (less accurate data) and the bigger weights for data with smaller standard deviations (more certain data). If the data have equal variances, \(\sigma _{0}^{2},\) then the weighting matrix becomes scalar:

$$ \textbf{W}^{2}={\boldsymbol{\sigma }}^{-1}={\frac{1}{{\sigma _{0}^{2}}}}\textbf{I}, $$

and the chi-square functional becomes equal to the conventional misfit functional.

2 The Maximum a Posteriori Estimation Method (The Bayes Estimation)

Let us consider the regularization technique from the point of view of probability theory (Tarantola 1987). First of all, we introduce the following (normally distributed) densities of probability:

(1) \(P(\textbf{d}/\textbf{m})\) is a conditional density of probability of the data \(\textbf{d}\), given the model \(\textbf{m}\). It means that it is the probability density of theoretical data \(\textbf{d}\) to be expected from a given model \(\textbf{m}\).

(2) \(P(\textbf{m}/\textbf{d})\) is a conditional density of probability of a model \(\textbf{m}\), given the data \(\textbf{d}\).

According to the Bayes theorem, the following equation holds:

$$\begin{aligned} P(\textbf{m}/\textbf{d})={\frac{{P(\textbf{d}/\textbf{m})P(\textbf{m})}}{{P(\textbf{d})}}}, \end{aligned}$$
(6.14)

where \(P(\textbf{d})\) and \(P(\textbf{m})\) are unconditional probability densities for data and model parameters, respectively. It is assumed that

$$ \langle \textbf{m}\rangle =\textbf{m}_{apr},\ $$

where \(\textbf{m}_{apr}\) is an a priori constrained expectation of the model, and

$$ [\text {cov}(m_{i},m_{j})]={\boldsymbol{\sigma }}_{m}.\ $$

Thus, considering normally distributed parameters, we have the following probability distribution of the model, \(\textbf{m}\):

$$\begin{aligned} P(\textbf{m})={{\frac{\mid {\boldsymbol{\sigma }}_{m}\mid ^{-{\frac{1}{2}}}}{{(2\pi )^{\frac{L}{2}}}}}\exp [-{\frac{1}{2}}(\textbf{m}-\textbf{m}_{apr})^{T}}{\boldsymbol{\sigma }}_{m}^{-1}(\textbf{m}-\textbf{m}_{apr})].\ \end{aligned}$$
(6.15)

Analogously, it is assumed that,

$$ [\text {cov}(d_{i},d_{j})]={\boldsymbol{\sigma }}_{d} $$

and we can write for the conditional density of probability of the data \( \textbf{d}\)

$$\begin{aligned} P(\textbf{d}/\textbf{m})={{\frac{\mid {\boldsymbol{\sigma }}_{d}\mid ^{-{\frac{1}{2}}}}{{(2\pi )^{\frac{N}{2}}}}}\exp [-{\frac{1}{2}}(\textbf{d}-\textbf{Am})^{T}}{\boldsymbol{\sigma }}_{d}^{-1}({\textbf{d}-\textbf{Am}})]. \end{aligned}$$
(6.16)

The maximum likelihood method can now be used to find the model \(\textbf{m} _{0}\) which maximizes the conditional probability of a model, \(P(\textbf{m}/ \textbf{d})\):

$$ P(\textbf{m}/\textbf{d})={{\frac{\mid {\boldsymbol{\sigma }}_{d}\mid ^{-{\frac{1}{2}}}}{{(2\pi )^{\frac{N}{2}}}}}\exp [-{\frac{1}{2}}(\textbf{d}-\textbf{Am})^{T}}{\boldsymbol{\sigma }}_{d}^{-1}({\textbf{d}-\textbf{Am}})]\times $$
$$\begin{aligned} \times {{\frac{\mid {\boldsymbol{\sigma }}_{m}\mid ^{-{\frac{1}{2}}}}{{(2\pi )^{\frac{L}{2}}}}}\exp [-{\frac{1}{2}}(\textbf{m}-\textbf{m}_{apr})^{T}}{\boldsymbol{\sigma }}_{m}^{-1}(\textbf{m}-\textbf{m}_{apr})]{P}^{-1}{(\textbf{d})}.\ \end{aligned}$$
(6.17)

It is evident that, to maximize \(P(\textbf{m}/\textbf{d})\), we have to minimize the sum of the expressions in the exponential factors in Eq. (6.17):

$$\begin{aligned} f_{Bayes}={(\textbf{d}-\textbf{Am})^{T}}{\boldsymbol{\sigma }}_{d}^{-1}({\textbf{d}-\textbf{Am}})+{(\textbf{m}-\textbf{m}_{apr})^{T}}{\boldsymbol{\sigma }}_{m}^{-1}(\textbf{m}-\textbf{m}_{apr}).\ \end{aligned}$$
(6.18)

Note that the minimization of the first term in the above equation gives the classical maximum likelihood or weighted least-squares method.

Let us calculate the first variation of \(f_{Bayes}\):

$$ \delta f_{Bayes}=-2({\textbf{A}}\delta {\textbf{m}})^{T}{\boldsymbol{\sigma }}_{d}^{-1}({\textbf{d}-\textbf{Am}})+2(\delta {\textbf{m}})^{T}{\boldsymbol{\sigma }}_{m}^{-1}(\textbf{m}-\textbf{m}_{apr})=0.\ $$

From the last equation, we have

$$ (\delta {\textbf{m}})^{T}[{\textbf{A}}^{T}{\boldsymbol{\sigma }}_{d}^{-1}({\textbf{d}-\textbf{Am}})-{\boldsymbol{\sigma }}_{m}^{-1}(\textbf{m}-\textbf{m}_{apr})]=0.\ $$

Thus, the normal system of equations for minimization of \(f_{Bayes}\) can be written as follows:

$$ {\textbf{A}}^{T}{\boldsymbol{\sigma }}_{d}^{-1}({\textbf{d}-\textbf{Am}})-{\boldsymbol{\sigma }}_{m}^{-1}(\textbf{m}-\textbf{m}_{apr})=0, $$

From the last formula, we have at once the following equation:

$$\begin{aligned} ({\textbf{A}}^{T}{\boldsymbol{\sigma }}_{d}^{-1}{\textbf{A}}+{\boldsymbol{\sigma }}_{m}^{-1}){\textbf{m}}={\textbf{A}}^{T}{\boldsymbol{\sigma }}_{d}^{-1}{\textbf{d}}+{\boldsymbol{\sigma }}_{m}^{-1}\textbf{m}_{apr}. \end{aligned}$$
(6.19)

We can write the solution of Eq. (6.19) in the closed form as follows:

$$\begin{aligned} {\textbf{m}}_{0}=({\textbf{A}}^{T}{\boldsymbol{\sigma }}_{d}^{-1}{\textbf{A}}+{\boldsymbol{\sigma }}_{m}^{-1})^{-1}({\textbf{A}}^{T}{\boldsymbol{\sigma }}_{d}^{-1}{\textbf{d}}+_{m}^{-1}\textbf{m}_{apr}).\ \end{aligned}$$
(6.20)

By comparing Eqs. (6.20) and (5.36), we see that

$$\begin{aligned} {\boldsymbol{\sigma }}_{m}^{-1}=\alpha \textbf{W}_{m}^{2},\ \end{aligned}$$
(6.21)

so \({\boldsymbol{\sigma }}_{m}^{-1}\) plays the role of the regularization parameter and the model parameter weights simultaneously.

Let us assume now that we have uncorrelated data with equal variances,

$$ {\boldsymbol{\sigma }}_{d}=\sigma _{d}^{2}\textbf{I},\ $$

and similarly for the a priori covariance of the model,

$$ {\boldsymbol{\sigma }}_{m}=\sigma _{m}^{2}\textbf{I}. $$

Then Eq. (6.20) takes the following form:

$$\begin{aligned} {\textbf{m}}_{0}=({\textbf{A}}^{T}{\textbf{A}}+k\textbf{I})^{-1}({\textbf{A}}^{T}{\textbf{d}}+k\textbf{m}_{apr}),\ \end{aligned}$$
(6.22)

where

$$\begin{aligned} k={\frac{{\sigma _{d}^{2}}}{{\sigma _{m}^{2}}}}=\alpha \ \end{aligned}$$
(6.23)

plays the role of the regularization parameter.

We can see from formula (6.23) that large values of the variance \({ \sigma _{m}^{2}}\) of the model parameters correspond to a small regularization parameter \(\alpha ,\) and vice versa, large values of \(\alpha \) correspond to a small variance \({\sigma _{m}^{2}.}\) This means that, without regularization (\(\alpha \) close to zero), the uncertainty in determining the inverse model is great, while with regularization, it becomes smaller. The last formula illustrates once again the close connection between the probabilistic (Tarantola 1987) and deterministic (Tikhonov and Arsenin 1977) approaches to regularization.

3 Stochastic Methods of Inversion

We have already discussed in this and previous chapters that there are two different major points of view in addressing the inverse problem:

(a) the algebraic (deterministic) point of view, dating back to the works of Lanczos (1961), Backus and Gilbert (1967), Backus (1970a, b, c), Marquardt (1963, 1970), Tikhonov and Arsenin (1977), etc.;

(b) the probabilistic (stochastic) point of view, formulated in the pioneering papers of Foster (1961), Franklin (1970), Jackson (1972), Tarantola and Valette (1982), Tarantola (1987, 2005), etc.

The stochastic point of view is widely used in literature because it is closely associated with the statistical nature of the noise in the data. At the same time, it has been demonstrated in many publications (e.g., the classical work by Sabatier (1977)) that in many cases, both points of view result in similar computational algorithms (see Sects. 6.1 and 6.2).

The Monte Carlo inversion methods represent a general approach based on the stochastic point of view (Metropolis and Ulam 1949; Metropolis et al. 1953). They are named after the famous Casino in Monaco. There are two major types of Monte Carlo methods. The first one is based on an extensive random search in the space M of the model parameters for a solution, which generates the predicted data from the data space, D, close to the observed data, realizing the global minimum of the corresponding misfit functional \( f\left( \textbf{m}\right) \) (e.g., Cary and Chapman 1988; Khan et al. 2000; Khan and Mosegaard 2001). This method is suitable for problems with misfit functionals having multiple local minimums, where conventional gradient-type minimization methods may have difficulties getting out from a “deep” local minimum (see Chap. 7). The second type of Monte Carlo method uses an optimization algorithm in order to minimize the number of steps required by the random search methods. The most effective global optimization algorithms have been developed based on known physical or biological rules to evolve to the best solution. For example, the simulated annealing (SA) algorithm (Kirkpatrick et al. 1983; Corana et al. 1987) comes from annealing in metallurgy, a technique involving heating and controlled cooling of a material. It is known from physics that, in order to minimize the final lattice energy, one should apply a very slow cooling process. The SA method uses an analogy between the minimization of lattice energy in the framework of the physical process of annealing and numerical problem of determining the global minimum of a misfit functional, \( f\left( \textbf{m}\right) \).

The genetic algorithm (GA) (Holland 1975; Goldberg 1989; Michalewicz and Schoenauer 1996; Whitley 1994; Mosegaard and Sambridge 2002) is a heuristic search method that mimics the process of natural evolution. In a pure genetic algorithm, a population of candidate solutions (individuals) for an optimization problem is evolved toward better solutions. Traditionally, the solutions are coded in binary form as strings of 0s and 1s to be mutated and altered. The evolution starts from a population of randomly generated solutions from the search space and proceeds as an iterative process. The population in each iteration is called a generation. In each generation, the fitness of every individual is evaluated by an objective functional (e.g., a misfit functional \(f\left( \textbf{m}\right) \)). The individuals who have low misfits are stochastically selected from the current population, and then they are chosen to form a new generation by applying genetic operations (mutation and crossover). The above steps run iteratively until the inversion process meets the termination conditions.

A detailed overview of the simulated annealing and genetic algorithms can be found, for example, in Zhdanov (2015).

The Monte Carlo methods are considered to be an effective optimization technique for many inverse problems where some general gradient-type methods fail. They can be applied for solving optimization problems with continuous or discrete parameters and with small sample intervals; there is no need to calculate the derivatives; the global minimization problem can be solved for the misfit functional with multiple local minima.

The Monte Carlo methods were first applied to the solutions of earth science problems by Keilis-Borok and Yanovskaya (1967) and Press (1968, 1970a, b). The paper by Sambridge and Mosegaard (2002) provides an excellent review of applications of the Monte Carlo methods to solving geophysical inverse problems.

References and Recommended Reading to This Chapter

Backus GE (1970a) Inference from inadequate and inaccurate data. I Proceedings of the National Academy of Sciences, vol 65, pp 1–7

Backus GE (1970b) Inference from inadequate and inaccurate data, II. Proceedings of the National Academy of Sciences, vol 65, pp 281–287

Backus, GE (1970c) Inference from inadequate and inaccurate data, III: Proceedings of the National Academy of Sciences, vol 67, pp 282–289

Backus GE, Gilbert TI (1967) Numerical applications of a formalism for geophysical inverse problems. Geophys J R Astr Soc 13:247–276

Cary PW, Chapman CH (1988) Automatic 1D waveform inversion of marine seismic refraction data. Geophys J 93:527–46

Corana A, Marchesi M, Martini C, Ridella S (1987) Minimising multimodal functions of continuous variables with the “Simulated Annealing” Algorithm. ACM Trans Math Soft 13:262–280

Foster M (1961) An application of the Wiener-Kolmogorov smoothing theory to matrix inversion. J Soc Ind Appl Math 9:387–392

Franklin JN (1970) Well-posed stochastic extensions of ill-posed linear problems. J Math Anal Appl 31:682–716

Goldberg DE (1989) Genetic Algorithms in search, optimization, and machine learning. Addison-Wesley

Holland JH (1975) Adaptation in natural and artificial systems. Ann Arbor: University of Michigan Press

Jackson DD (1972) Interpretation of inaccurate, insufficient and inconsistent data: Geophys J R Astronom Soc 28:97–110

Keilis-Borok VI, TB Yanovskaya (1967) Inverse problems of seismology. Geophys J 13:223–234

Khan A, Mosegaard K, Rasmussen KL (2000) A new seismic velocity model for the Moon from a Monte Carlo inversion of the Apollo Lunar seismic data. Geophys Res Lett 27:1591–1594

Khan A, Mosegaard K (2001) New information on the deep lunar interior from an inversion of lunar free oscillation periods. Geophys Res Lett 28:1791

Kirkpatrick, S. C., D. Gelatt and M. P. Vecchi, 1983, Optimization by simulated annealing: Science, 220, 671–680 Lanczos C (1961) Linear differential operators. D van Nostrand Co

Marquardt DW (1963) An algorithm for least-squares estimation of nonlinear parameters. J Soc Ind Appl Math 11:431–441

Marquardt DW (1970) Generalized inverses, ridge regression, biased linear estimation, and nonlinear estimation. Technometrics 12:591–612

Menke W (2018) Geophysical data analysis: discrete inverse theory, 4th ed. Academic Press, 330 pp

Metropolis N, Ulam SM (1949) The Monte Carlo method. J Am Stat Assoc 44:335–341

Metropolis N, Rosenbluth MN, Rosenbluth AW, Teller, Teller E (1953) Equation of state calculations by fast computing machines. J Chem Phys 21:1087–1092

Michalewicz Z, Schoenauer M (1996) Evolutionary algorithms for constrained parameter optimization problems. Evol Comput 4(1):1–32

Mosegaard K, Sambridge M (2002) Monte Carlo analysis of inverse problems. Inverse Probl 18:R29–R54

Press F 1968 Earth models obtained by Monte Carlo inversion. J Geophys Res 73:5223–34

Press F (1970a) Earth models consistent with geophysical data. Phys Earth Planet Inter 3:3–22

Press F (1970b) Regionalized Earth models. J Geophys Res 75:6575–81.

Sabatier PC (1977) On geophysical inverse problems and constraints. J Geophys Res 43:115–137

Sambridge, M, Mosegaard K (2002) Monte Carlo methods in geophysical inverse problems. Rev Geophys 40, 3:1–29

Tarantola A (1987) Inverse problem theory. Elsevier, Amsterdam, Oxford, New York, Tokyo, 613 pp

Tarantola A (2005) Inverse problem theory and methods for model parameter estimation. SIAM, 344 pp

Tarantola A, Valette B (1982) Generalized nonlinear inverse problem solved using the least squares criterion. Rev Geophys Space Phys 20:219–232

Tikhonov AN, Arsenin VY (1977) Solution of ill-posed problems. W H Winston and Sons

Whitley DL (1994) A genetic algorithm tutorial. Stat Comput 4:65–85

Zhdanov MS (1993) Tutorial: regularization in inversion theory. CWP-136, Colorado School of Mines, 47 pp

Zhdanov MS (2002) Geophysical inverse theory and regularization problems. Elsevier, 609 pp

Zhdanov MS (2015) Inverse theory and applications in geophysics. Elsevier, 704 pp