1 Introduction

The concept of regularization (also termed damping) is central to solving many classes of inverse problems, and especially those involving generalizations of the least-squares principle (Levenberg 1944). Instabilities caused by incomplete data coverage, which would otherwise arise during the inversion process, are damped through the addition of prior information that quantifies expectations about the behavior of the solution. Given properly chosen prior information, a unique and well-behaved solution can be determined even with noisy and incomplete data.

Prior information can be implemented in two interrelated, but conceptually distinct, ways:

The first approach is as an equation that looks just like a data equation, except that it is not based on any actual observations. This type of prior information is often referred to as a constraint; For instance, the prior information that two model parameters differ by an amount h 1 is expressed by the constraint equation Δm ≡ m 2 − m 1 = h 1. Constraint equations can contradict the data and for that reason are understood to be only approximate. The strength of the constraint, relative to the data, is expressed by a parameter ɛ.

The second approach treats the model parameters as random variables described by a probability density function p(m 1m 2). The prior information is expressed as the requirement that this probability density function have certain features. Returning to the example above, we infer that the constraint equation Δm ≈ h 1 is only probable when m 1 and m 2 are strongly and positively correlated, with probability concentrated near the line m 2 = m 1 + h 1. Thus, a constraint implies that the probability density function has a particular covariance (and vice versa). Furthermore, if we view the constraint equation as holding up to some variance σ 2 h [that is, \(\Delta m = h_{1} \pm 2\sigma_{h} (95\;\% )\)], then we expect this variance to scale inversely with the strength of the constraint (that is, σ h  ∝ ɛ −1). These considerations strongly suggest that the two approaches are interrelated.

In fact, these interrelationships are well known in least-squares theory. Suppose that the prior information equation is linear and of the form Hm = h, where m is the vector of unknown model parameters and H and h are known. Alternatively, suppose that the model parameters are normally distributed random variables with mean \(\langle {\mathbf{m}}\rangle\) and covariance matrix C h. As we will review below, a detailed analysis of the least-squares principle reveals that H = C −1/2h and \({\mathbf{h}} = {\mathbf{C}}_{\text{h}}^{ - 1/ 2} \langle {\mathbf{m}}\rangle\) (Tarantola and Valette 1982a, b). Thus, one can translate between the two viewpoints by “simple” matrix operations.

Regularization can be applied to the general linear inverse problem Gm = d (where d is data and G is the data kernel, which encodes the theory) to implement the qualitative notion of smoothness. This type of prior information is extremely important when the inverse problem is underdetermined, meaning that some aspects of the solution are not determined by the data. The prior information acts to fill in the data gaps and produce a final product that is “complete” and “useful.” However, the result also is at least somewhat dependent upon the way in which smoothness is quantified. A very simple form of smoothness occurs when spatially adjacent model parameters have similar values, which implies the same constraint equations as discussed previously (with h 1 = 0): m 2 − m 1 ≈ 0, m 3 − m 2 ≈ 0, m 4 − m 3 ≈ 0, etc.. These equations are equivalent to the condition that the first spatial derivative is small; that is, dm/dx ≈ 0. This smoothness condition, often termed gradient or first-derivative regularization, is widely used in global seismic imaging (e.g., Ekstrom et al. 1997; Boschi and Dziewonski 1999; Nettles and Dziewonski 2008). Another popular form of smoothing is Laplacian or second-derivative regularization (e.g., Trampert and Woodhouse 1995; Laske and Masters 1996; Zha et al. 2014), where the constraint equations are \(m_{3} - 2m_{2} + m_{1} \approx 0\), \(m_{4} - 2m_{3} + m_{2} \approx 0\), etc., being equivalent to the condition that the second spatial derivative is small; that is, d2 m/dx 2 ≈ 0.

That these two regularization schemes produce somewhat different results has been long recognized (Boschi and Dziewonski 1999). Numerical tests indicate that second-derivative regularization leads to greater suppression of short-wavelength features in the solution. However, while this issue can be approached empirically, we show here that a more theoretical approach has value, too, because it allows us to discern what regularization does to the structure of inverse problems in general. Such a treatment can provide insight into how the results (and side-effects) of a regularization scheme change as the underlying inverse problem is modified, for example, when in tomographic imaging a simple ray-based data kernel (Aki et al. 1976; Humphreys et al. 1984; see also Menke 2005) is replaced by a more complicated one that includes diffraction effects (e.g., a banana-doughnut kernel calculated using adjoint methods) (Tromp et al. 2005).

An important question is whether regularization works by smoothing the observations (making the data smoother) or by smoothing the data kernel (making the theory smoother). Our analysis, presented later in this paper, shows that it does both. Two important practical issues are how to choose a C h or an H to embody an intuitive form of smoothness, and how to assess the consequences of one choice over another. We show that the simple data smoothing problem is key to understanding these issues.

By data smoothing, we mean finding a set of model parameters that are a smoothed version of the data. This approach reduces the data kernel to a minimum (G = I) and highlights the role of prior information in determining the solution. Even with this simplification, the relationships between C h and H, and their effect on the solution, are still very obtuse. Surprisingly, an analysis of the continuum limit, where the number of model parameters becomes infinite and vectors become functions, provides considerable clarity. We are able to derive simple analytic formulae that relate C h and H, as well as the smoothing kernels that relate the unsmoothed and smoothed data. The latter is of particular importance, because it allows assessment of whether or not the mathematical measure of smoothing corresponds to the intuitive one.

Finally, we show that the effect of regularization on the general inverse problem can be understood by decomposing it into the part equivalent to a simple data smoothing problem and the deviatoric part controlled by the nontrivial part of the data kernel. This decomposition allows us to investigate the respective effects of the smoothing constraints and the data constraints (via some theory, represented by the data kernel) on the solution. The former blurs (in the literal sense of the word) the data, but we show also that the data kernel is also blurred in exactly the same way. Regularization partly works by smoothing the theory.

2 Background and Definitions

Generalized least squares (Levenberg 1944; Lawson and Hansen 1974; Tarantola and Valette 1982a, b; see also Menke 1984, 2012; Menke and Menke 2011) is built around a data equation, Gm = d obs, which describes the relationship between unknown model parameters, m, and observed data, d obs, and a prior information equation, Hm = h pri, which quantifies prior expectations (or “constraints”) about the behavior of the model parameters. The errors in the data equation and the prior information equation are assumed to be normally distributed with zero mean and covariance of C d and C h, respectively.

The generalized error Φ is a measure of how well a given solution m satisfies the data and prior information:

$$\varPhi \left( {\mathbf{m}} \right) = \left[ {{\mathbf{d}}^{\text{obs}} - {\mathbf{Gm}}} \right]^{\text{T}} {\mathbf{C}}_{\text{d}}^{ - 1} \left[ {{\mathbf{d}}^{\text{obs}} - {\mathbf{Gm}}} \right] + \left[ {{\mathbf{h}}^{\text{pri}} - {\mathbf{Hm}}} \right]^{\text{T}} {\mathbf{C}}_{\text{h}}^{ - 1} \left[ {{\mathbf{h}}^{\text{pri}} - {\mathbf{Hm}}} \right].$$
(1)

Here d obs are the observed data and h pri is the specified prior information. The first term on the right-hand side represents the sum of squared errors in the observations, weighted by their certainty (that is, the reciprocal of their variance), and the second represents the sum of squared errors in the prior information, weighted by their certainty. The generalized least-squares principle asserts that the best estimate of the solution is that one that minimizes this combination of errors.

Suppose now that C −1d  = Q Td Q d and C −1h  = Q Th Q h, for some matrices Q d and Q h. We can rearrange Eq. (1) into the form Φ = [f − Fm]T C −1f [f − Fm] by defining

$${\mathbf{F}} = \left[ {\begin{array}{*{20}l} {{\mathbf{Q}}_{\text{d}}\quad{\mathbf{G}}} \hfill \\ {{\mathbf{Q}}_{\text{h}}\quad{\mathbf{H}}} \hfill \\ \end{array} } \right]\quad {\text{and}}\quad {\mathbf{f}} = \left[ {\begin{array}{*{20}l} {{\mathbf{Q}}_{\text{d}}\quad{\mathbf{d}}^{\text{obs}} } \hfill \\ {{\mathbf{Q}}_{\text{h}}\quad{\mathbf{h}}^{\text{pri}} } \hfill \\ \end{array} } \right]\quad {\text{and}}\quad {\mathbf{C}}_{\text{f}} = {\mathbf{I}}.$$
(2)

This is the form of a simple least-squares minimization of the error associated with the combined equation Fm = f. The matrices Q d and Q h have the interpretation of weighting matrices, with the top rows of Fm = f being weighted by Q d and the bottom rows by Q h. The least-squares equation and its solution are

$$\left[ {{\mathbf{F}}^{\text{T}}\quad{\mathbf{F}}} \right]{\mathbf{m}}^{\text{est}} = {\mathbf{F}}^{\text{T}} {\mathbf{f}}\quad {\text{and}}\quad {\mathbf{m}}^{\text{est}} = {\mathbf{F}}^{ - g} {\mathbf{f}}\quad {\text{with}}\quad {\mathbf{F}}^{ - g} \equiv \left[ {{\mathbf{F}}^{\text{T}}\quad{\mathbf{F}}} \right]^{ - 1} {\mathbf{F}}^{\text{T}}.$$
(3a,b)

Here m est is the best estimate of the solution and the symbol F g is used to denote the generalized inverse of the matrix F, that is, the matrix that “inverts” the relationship Fm = f.

An obvious choice of weighting matrices is \({\mathbf{Q}}_{\text{d}} = {\mathbf{C}}_{\text{d}}^{ - 1/2}\) and \({\mathbf{Q}}_{\text{h}} = {\mathbf{C}}_{\text{h}}^{ - 1/2}\), where C −1/2d and C −1/2h are symmetric square roots. However, any matrices that satisfy Q Td Q d = C −1d and Q Th Q h = C −1h are acceptable, even nonsymmetric ones. In fact, if T d and T h are arbitrary unitary matrices satisfying T Td T d = I and T Th T h = I, then \({\mathbf{Q}}_{\text{d}} = {\mathbf{T}}_{\text{d}} {\mathbf{C}}_{\text{d}}^{ - 1/2}\) and \({\mathbf{Q}}_{\text{h}} = {\mathbf{T}}_{\text{h}} {\mathbf{C}}_{\text{h}}^{ - 1/2}\) are acceptable choices, too, since the unitary matrices cancel from the product Q Th Q h. A nonsymmetric matrix Q h, with singular value decomposition UΛV T, can be transformed into a symmetric matrix Q h′ =  C −1/2h by the transformation T h = VU T, since T h Q h = VU T UΛV T = VΛV T is symmetric and since VU T, as the product of two unitary matrices, is itself unitary. For reasons that will become apparent later in the paper, we give Q −1h its own name, P h, so that C h = P Th P h.

Two other important quantities in inverse theory are the covariance C m and resolution R of the estimated model parameters m est. The covariance expresses how errors in the data and prior information propagate into errors in the estimated model parameters. The resolution expresses the degree to which a given model parameter can be uniquely determined (Backus and Gilbert 1968, 1970; Wiggins 1972). These quantities are given by

$${\mathbf{C}}_{\text{m}} = {\mathbf{F}}^{ - g} {\mathbf{C}}_{\text{f}}\quad{\mathbf{F}}^{ - g\text{T}} = \left[ {{\mathbf{F}}^{\text{T}}\quad{\mathbf{F}}} \right]^{ - 1} {\mathbf{F}}^{\text{T}}\quad{\mathbf{IF}}\left[ {{\mathbf{F}}^{\text{T}}\quad{\mathbf{F}}} \right]^{ - 1} = \left[ {{\mathbf{F}}^{\text{T}}\quad{\mathbf{F}}} \right]^{ - 1},$$
(4)
$${\mathbf{R}} = {\mathbf{G}}^{ - g}\quad{\mathbf{G}}\quad {\text{with}}\quad {\mathbf{G}}^{ - g} \equiv \left[ {{\mathbf{F}}^{\text{T}}\quad{\mathbf{F}}} \right]^{ - 1} {\mathbf{G}}^{{\text{T}}} {\mathbf{C}}_{\text{d}}^{ - 1}.$$
(5)

Here the symbol G g is used to denote the generalized inverse of the data kernel G, that is, the matrix that inverts the relationship Gm = d.

The foregoing will have been familiar to those who have taken a linear algebraic approach to inverse theory. We will take the continuum limit, replacing d obs and m est with the functions d(x) and m(x), where x is an independent variable (e.g., position). The matrix G becomes the linear operator \({\mathcal{G}}\), its transpose G T becomes the adjoint \({\mathcal{G}}^{\dag }\) of the operator \({\mathcal{G}}\), and its inverse G −1 becomes the inverse \({\mathcal{G}}^{ - 1}\) of the operator \({\mathcal{G}}\). Depending upon context, we will interpret the identity matrix either as multiplication by 1 or convolution by the Dirac delta function, δ(x).

2.1 Formulation of the Simplified Data Smoothing Problem

In order to understand the role of prior information in determining the solution, we consider a simplified problem with G = I, C d = σ 2d I, Q d = σ −1d I, and h pri = 0. These choices define a data smoothing problem, when m is viewed as a discretized version of a continuous function m(x). The model parameters m est represent a smoothed version of the data d obs. We multiply Eq. (2) by \(\sigma_{\text{d}}\) so that the data equation is \({\mathbf{Gm}} = {\mathbf{d}}\) and the prior information equation, which quantifies just in what sense the data are smooth, is σ d Q h Hm = 0. The matrices Q h and H appear only as a product in Eq. (2), so we define L = σ d Q h H. This behavior implies that we can understand the prior information equation Lm = 0 either as an equation of the form Hm = 0 with nontrivial H ∝ L but trivial weighting Q h = I or as the equation \({\mathbf{Q}}_{\text{h}} {\mathbf{m}} = 0\) with the trivial H = I but with nontrivial weighting Q h ∝ L. The effect is the same but, as was highlighted in the “Introduction” section, the interpretation is different. Subsequently, when we refer to Q h (or C h or P h) it will be with the presumption that we are adopting the H = I viewpoint. The combined equation is then

$$\sigma_{\text{d}} {\mathbf{Fm}} = \sigma_{\text{d}} {\mathbf{f}} \equiv \varvec{ }\left[ {\varvec{ }\begin{array}{c} {\mathbf{I}} \\ {\mathbf{L}} \\ \end{array} \varvec{ }} \right]{\mathbf{m}} = \left[ {\begin{array}{c} {{\mathbf{d}}^{\text{obs}} } \\ 0 \\ \end{array} } \right]$$
(6)

with solution m est obeying

$$\left( {{\mathbf{L}}^{\text{T}} {\mathbf{L}} + {\mathbf{I}}} \right) {\mathbf{m}}^{\text{est}} = {\mathbf{A}} {\mathbf{m}}^{\text{est}} = {\mathbf{d}}^{\text{obs}}.$$
(7)

Here A is an abbreviation for \(\left( {{\mathbf{L}}^{\text{T}} {\mathbf{L}} + {\mathbf{I}}} \right)\). In the continuum limit, this equation becomes

$$\left( {{\mathcal{L}}^{\dag } {\mathcal{L}} + 1} \right) m\left( x \right) = {\mathcal{A}}\left( x \right)m\left( x \right) = d\left( x \right).$$
(8)

Here \({\mathcal{A}}\left( x \right)\) is an abbreviation for \(\left( {{\mathcal{L}}^{\dag } {\mathcal{L}} + 1} \right)\). Finally, we mention that, when two prior information equations are available, say L A m = 0 and L B m = 0, (7) becomes

$$\left[ {\begin{array}{c} {\mathbf{I}} \\ {{\mathbf{L}}_{\text{A}} } \\ {{\mathbf{L}}_{\text{B}} } \\ \end{array} } \right] {\mathbf{m}} = \left[ {\begin{array}{c} {{\mathbf{d}}^{\text{obs}} } \\ 0 \\ 0 \\ \end{array} } \right]$$
(9)

and the discrete and continuum solutions satisfy the equations

$$\begin{aligned} \left( {{\mathbf{L}}_{\text{A}}^{\text{T}} {\mathbf{L}}_{\text{A}}^{} + {\mathbf{L}}_{\text{B}}^{\text{T}} {\mathbf{L}}_{\text{B}}^{} + {\mathbf{I}}} \right) {\mathbf{m}}^{\text{est}} = {\mathbf{d}}^{\text{obs}} \hfill \\ {\text{and}}\,\,\,\left( {{\mathcal{L}}_{A}^{\dag } {\mathcal{L}}_{A}^{ } + {\mathcal{L}}_{B}^{\dag } {\mathcal{L}}_{B}^{ } + 1} \right) m(x) = d(x). \hfill \\ \end{aligned}$$
(10a,b)

2.2 Data Smoothing in the Continuum Limit

Equation (8) has the form of a linear differential equation with inhomogeneous source term d(x), and can therefore be solved using the method of Green functions. The Green function a(xx′) satisfies the equation with an impulsive source,

$${\mathcal{A}} \,a(x,{x^{\prime}}) = \delta (x - {x^{\prime}}).$$
(11)

Here, δ(x − x′) is the Dirac delta function, that is, a single, spiky datum located at position x′. The Green function a(xx′) represents the response of the smoothing process to this datum—the smoothing kernel. Once Eq. (11) has been solved for a particular choice of the operator \({\mathcal{A}}\), the solution for arbitrary data d(x) is given by the Green function integral:

$$m\left( x \right) = {\mathcal{A}}^{ - 1} d\left( x \right) \equiv \mathop \smallint \limits a\left( {x,x^{\prime} } \right) d\left( {x^{\prime}} \right) {\text{d}}x^{\prime} \equiv \left\{ {a,d} \right\}.$$
(12)

Here we have introduced the inner product symbol {., .} for notational simplicity; it is just shorthand for the integral. The quantity \({\mathcal{A}}^{ - 1} (x)\) has the interpretation of a smoothing operator with kernel a(xx′). In problems with translational invariance, Eq. (12) is equivalent to convolution by the function a(x); that is, \({\mathcal{A}}^{ - 1} d\left( x \right) = a\left( x \right)*d(x),\) where * denotes convolution.

Smoothing kernels are localized functions that typically have a maximum at the central point x′ and decline in both directions away from it. One example, which we will discuss in more detail later in this paper, is the two-sided declining exponential function \(a\left( x \right) = 1/2\;\varepsilon^{ - 1} \exp \left( { - \varepsilon^{ - 1} \left| {x - x'} \right|} \right)\), which smooths the data over a scale length ɛ.

Equations for the covariance of the estimated solution and the resolution can be constructed by taking the continuum limit of Eqs. (4) and (5), after making the simplification G = I:

$${\mathcal{A}} C_{m} \left( {x,x^{\prime}} \right) = \sigma_{d}^{2} \delta \left( {x - x^{\prime}} \right)\quad {\text{so}}\quad C_{m} \left( {x,x^{\prime}} \right) = \sigma_{d}^{2} a(x,x^{\prime}),$$
(13)
$${\mathcal{A}} R\left( {x,x^{\prime}} \right) = \delta \left( {x - x^{\prime}} \right)\quad {\text{so}}\quad R\left( {x,x^{\prime}} \right) = a(x,x^{\prime}).$$
(14)

Similarly, the relationship between the functions C h (xx′) and P h (xx′), which are the analogues of C h and P h, can be constructed by taking the continuum limit of the equation C h = P Th P h:

$$C_{h} = \left\{ {P_{h}^{\dag } ,P_{h} } \right\}.$$
(14)

These two functions satisfy the equations

$$\begin{aligned} \sigma_{d}^{ - 2} \left( {{\mathcal{L}}^{\dag } {\mathcal{L}}} \right) C_{h} \left( {x,x'} \right) = \delta \left( {x - x'} \right) \hfill \\ {\text{and}}\quad \sigma_{d}^{ - 1} {\mathcal{L}}^{\dag } P_{h} (x,x') = \delta (x - x'). \hfill \\ \end{aligned}$$
(15a,b)

Equation (15a) is derived by first taking the continuum limit of the equation C h = σ 2d [L T L]−1, which implies that {C h m} is the inverse operator of \(\sigma_{d}^{ - 2} \left( {{\mathcal{L}}^{\dag } {\mathcal{L}}} \right)m\). Then \(\left\{ {C_{h} , \sigma_{d}^{ - 2} \left( {{\mathcal{L}}^{\dag } {\mathcal{L}}} \right)m} \right\} = m = \left\{ {\sigma_{d}^{ - 2} \left( {{\mathcal{L}}^{\dag } {\mathcal{L}}} \right)^{\dag } C_{h} ,m} \right\} =\) \(\left\{ {\sigma_{d}^{ - 2} \left( {{\mathcal{L}}^{\dag } {\mathcal{L}}} \right)C_{h} ,m} \right\} = \left\{ {\delta ,m} \right\}\), so \(\sigma_{d}^{ - 2} \left( {{\mathcal{L}}^{\dag } {\mathcal{L}}} \right) C_{h} = \delta\). Equation (15b) is derived by first taking the continuum limit of P h = σ d L −1, which implies that \(\left\{ {P_{h} ,m} \right\}\) is the inverse of \(\sigma_{d}^{ - 1} {\mathcal{L}}m\). Then \(\left\{ {P_{h} ,\sigma_{d}^{ - 1} {\mathcal{L}}m} \right\} = m = \left\{ {\sigma_{d}^{ - 1} {\mathcal{L}}^{\dag } P_{h} ,m} \right\} = \left\{ {\delta ,m} \right\}\), so \(\sigma_{d}^{ - 1} {\mathcal{L}}^{\dag } P_{h} = \delta .\)

We will derive smoothing kernels for particular choices of prior information, \({\mathcal{L}}\), later in this paper. However, we first apply these ideas to the general inverse problem.

2.3 Smoothing within the General Problem

We examine the effect of regularization on an inverse problem with an arbitrary data kernel G ≠ I. With the simplifications that the data are uncorrelated and of uniform variance (C d = σ 2d I) and that the prior model is zero (h pri = 0), Eq. (3a) becomes

$$\left( {{\mathbf{G}}^{\text{T}} {\mathbf{G}} + {\mathbf{L}}^{\text{T}} {\mathbf{L}}} \right) {\mathbf{m}} = {\mathbf{G}}^{\text{T}} {\mathbf{d}} \equiv {\tilde{\mathbf{m}}}\quad {\text{with}} \quad {\mathbf{L}} \equiv \sigma_{\text{d}} {\mathbf{C}}_{\text{h}}^{ - 1/2} {\mathbf{H}}.$$
(16)

We have introduced the abbreviation \({\tilde{\mathbf{m}}} \equiv {\mathbf{G}}^{\text{T}} {\mathbf{d}}\) to emphasize that the model m does not depend directly upon the data d, but rather on their back-projection G T d. In the continuum limit, this equation becomes

$$\left( {{\mathcal{G}}^{\dag } {\mathcal{G}} + { \mathcal{L}}^{\dag } {\mathcal{L}}} \right) m = {\mathcal{G}}^{\dag } d = \tilde{m}$$
(17)

with \({\mathcal{G}}\) the linear operator corresponding to the data kernel G. As before, \(\tilde{m} = {\mathcal{G}}^{\dag } d\) is the back-projected data. Now consider the special case where \({\mathcal{G}}^{\dag } {\mathcal{G}}\) is close to the identity operator 1, so that we can write

$$\left( {{\mathcal{G}}^{\dag } {\mathcal{G}} + { \mathcal{L}}^{\dag } {\mathcal{L}}} \right) m = \left[ {\left( {{\mathcal{L}}^{\dag } {\mathcal{L}} + 1} \right) + \left( {{\mathcal{G}}^{\dag } {\mathcal{G}} - 1} \right)} \right] m = \left( {{\mathcal{A}} + \omega {\mathcal{B}}} \right) m = \tilde{m},$$
(18)

where \({\mathcal{A}} \equiv ( {{\mathcal{L}}^{\dag } {\mathcal{L}} + 1})\), \(\omega {\mathcal{B}} \equiv ({\mathcal{G}}^{\dag } {\mathcal{G}} - 1)\), and where, by hypothesis, ω is a small parameter. We call \(\omega {\mathcal{B}}\) the deviatoric theory. It represents the “interesting” or “nontrivial” part of the inverse problem. The parameter ω is small either when \({\mathcal{G}}\) is close to the identity operator, or when it is close to being unitary. These restrictions can be understood by considering the special case where \({\mathcal{G}}\) corresponds to convolution with a function g(x). The first restriction implies g(x) ≈ δ(x); that is, g(x) is spiky. The second restriction implies that \(g(x){ \star }g(x) \approx \delta (x)\); that is, g(x) is sufficiently broadband that its autocorrelation is spiky. The latter condition is less restrictive than the former.

We now assume that the smoothing operator \({\mathcal{A}}^{ - 1}\) is known (e.g., by solving Eq. 8) and construct the inverse of \({\mathcal{A}} + \omega {\mathcal{B}}\) using perturbation theory (see Menke and Abbott 1989, their Problem 2.1). We first propose that the solution can be written as a power series in ω:

$$m = m_{0} + \omega m_{1} + \omega^{2} m_{2} + \cdots$$

(where m i are yet to be determined). Inserting this form of m into the inverse problem yields

$$\left( {{\mathcal{A}} + \omega {\mathcal{B}}} \right)\left( {m_{0} + \omega m_{1} + \omega^{2} m_{2} + \cdots } \right) = \tilde{m}.$$
(19)

By equating terms of equal powers in ω, we find that \(m_{0} = {\mathcal{A}}^{ - 1} \tilde{m}\), \(m_{1} = -{\mathcal{A}}^{ - 1} {\mathcal{B}\mathcal{A}}^{ - 1} \tilde{m}\), etc. The solution is then

$$m = \left( {1 + \mathop \sum \limits_{n = 1}^{\infty } \left( { - {\mathcal{A}}^{ - 1} \omega {\mathcal{B}}} \right)^{n} } \right) {\mathcal{A}}^{ - 1} \tilde{m},$$
(20)

and it follows from Eq. (18) that

$$m = \left( {{\mathcal{A}} + \omega {\mathcal{B}}} \right)^{ - 1} \tilde{m},$$
$${\text{so }} \left( {{\mathcal{A}} + \omega {\mathcal{B}}} \right)^{ - 1} = {\mathcal{A}}^{ - 1} - \left( {{\mathcal{A}}^{ - 1} \omega {\mathcal{B}}} \right){\mathcal{A}}^{ - 1} + \left( {{\mathcal{A}}^{ - 1} \omega {\mathcal{B}}} \right)\left( {{\mathcal{A}}^{ - 1} \omega {\mathcal{B}}} \right){\mathcal{A}}^{ - 1} - \cdots$$
(21)

Since \({\mathcal{A}}^{ - 1}\) represents a smoothing operator, that is, convolution by a smoothing kernel, say a(x), the solution can be rewritten

$$m = {\mathcal{G}}^{ - g} \left( {a * \tilde{m}} \right)$$
$${\text{with}}\quad {\mathcal{G}}^{ - g} = \left( {1 - \left( {a * \omega {\mathcal{B}}} \right) + \left( {a*\omega {\mathcal{B}}} \right) \left( {a*\omega {\mathcal{B}}} \right) - \cdots } \right).$$
(22)

Here we have introduced the abbreviation \({\mathcal{G}}^{ - g}\) to emphasize that the solution contains a quantity that can be considered a generalized inverse. The quantity \(\left( {a*\tilde{m}} \right)\) represents the smoothing of the back-projected data \(\tilde{m}\) by the smoothing kernel a, with the result that these data become smoother. The repeated occurrence of the quantity \(\left( {a*\omega {\mathcal{B}}} \right)\) in the expression for \({\mathcal{G}}^{ - g}\) represents the smoothing of the deviatoric theory \(\omega {\mathcal{B}}\) by the smoothing kernel a, with the result that the theory becomes smoother. The effect of smoothing on the theory is entirely contained in the interaction \(a*\omega {\mathcal{B}}\), so examining it is crucial for developing a sense of how a particular smoothing kernel affects the theory. Higher-order terms in the series for \({\mathcal{G}}^{ - g}\) have many applications of the smoothing operator (a \(*\)), implying that they are preferentially smoothed. The number of terms in the expansion that are required to approximate the true \({\mathcal{G}}^{ - g}\) is clearly a function of the size of \((a*\omega {\mathcal{B}})\), such that, if \(\parallel a*\omega {\mathcal{B}}\parallel_{2} / \parallel\omega {\mathcal{B}}\parallel_{2}\) is small, the higher terms in the approximation rapidly become insignificant.

We apply Eq. (22) to two exemplary inverse problems, chosen to demonstrate the range of behaviors that result from different types of theories. In the first, the deviatoric theory is especially rich in short-wavelength features, so smoothing has a large effect. In the second, the deviatoric theory is already very smooth, so the additional smoothing associated with the regularization has little effect.

Our first example is drawn from communication theory, and consists of the problem of “undoing” convolution by a code signal. Here the operator \({\mathcal{G}}\) corresponds to convolution by the code signal g(x), chosen to be a function that is piecewise constant in small intervals of length Δx, with a randomly assigned (but known) value in each interval. This function is very complicated and unlocalized (in contrast to spiky); it is a case where \({\mathcal{G}}\) is far from 1. However, because it is very broadband, its cross-correlation \(g\left( x \right){ \star }g\left( x \right)\) is spiky; it is a case where \({\mathcal{G}}^{\dag } {\mathcal{G}} \approx 1\). The deviatoric theory, which consists of the cross-correlation minus its central peak, \(\omega b\left( x \right) = g\left( x \right){ \star }g\left( x \right) - \delta (x)\), consists of short-wavelength oscillations around zero, so we expect that the smoothing \(a*\omega b\) will have a large effect on it. A numerical test with 100 intervals of Δx = 1 indicates that the decrease is about a factor of two: \(\parallel a*\omega b \parallel_{2} / \parallel \omega b \parallel_{2} \approx \raise.5ex\hbox{$\scriptstyle 1$}\kern-.1em/ \kern-.10em\lower.25ex\hbox{$\scriptstyle 2$}\); that is, the ratio is significantly less than unity. This is a case where smoothing has a large effect on the theory. The test also shows that the exact and approximate solutions match closely, even when only the first two terms of the series are included in the approximation (Fig. 1). This latter result demonstrates the practical usefulness of Eq. (22) in simplifying an inverse problem.

Fig. 1
figure 1

Telegraph signal inverse problem. a The true model, m true(x) is a spike. b The observed data d obs(x) are the true data g(x) * m true(x) plus random noise. c An undamped inversion yields an estimated model m est(x). d A damped inversion with \({\mathcal{L}} = \varepsilon \,{\text{d}}/{\text{d}}x\) and ɛ = 0.1 yields a smoother estimated model. e The first two terms of the series approximation for the generalized inverse yield a solution substantially similar to the one in d

Our second example is drawn from potential field theory and consists of the problem of determining the density m(x) of a linear arrangement of masses (e.g., seamount chain) from the vertical component f v(x) of the gravitational field measured a distance h above them. Because gravitational attraction is a localized and smooth interaction, this is an example of the \({\mathcal{G}} \approx 1\) case. According to Newton’s law, the field due to a unit point mass at the origin is

$$f_{\text{v}} \left( x \right) = \gamma h(x^{2} + h^{2} )^{ - 3/2},$$
(23)

where γ is the gravitational constant. The scaled data \(d\left( x \right) = \raise.5ex\hbox{$\scriptstyle 1$}\kern-.1em/ \kern-.10em\lower.25ex\hbox{$\scriptstyle 2$} \gamma^{ - 1} hf_{\text{v}} \left( x \right)\) then satisfy the equation

$${\mathcal{G} }m = g*m = d$$
$${\text{with}}\quad g\left( x \right) = \raise.5ex\hbox{$\scriptstyle 1$}\kern-.1em/ \kern-.15em\lower.25ex\hbox{$\scriptstyle 2$} h^{2} (x^{2} + h^{2} )^{ - 3/2}.$$
(24)

Here, the scaling is chosen so that the gravitational response function g(x) has unit area, thus satisfying g(x) ≈ δ(x) for small h. The function g(x) is everywhere positive and decreases only slowly as |x| → ∞, so \(g\left( x \right){ \star }g\left( x \right)\) is everywhere positive and slowly decreasing, as well. Consequently, the regularization does not significantly smooth the deviatoric theory. A numerical test, with h = 2, indicates that \( ||a*\omega b||_{2} / ||\omega b||_{2} \approx 0.98 \); that is, it is not significantly less than unity. A relatively large number of terms (about 20) of the series are needed to achieve an acceptable match between the approximate and exact solutions (Fig. 2). In this case, Eq. (22) correctly describes the inverse problem, but cannot be used to simplify it.

Fig. 2
figure 2

Gravity inverse problem. a The true model, m true(x) represents density. b The observed data d obs(x) are the true data predicted by Newton’s law, plus random noise. c An undamped inversion yields an estimated model m est(x) that is very noisy. d A damped inversion with \({\mathcal{L}} = \varepsilon \,{\text{d}}/{\text{d}}x\) and ɛ = 0.1 suppresses the noise, yielding an improved estimated model. e The first 20 terms of the series approximation for the generalized inverse yield a solution substantially similar to the one in d

These lessons, when applied to the issue of seismic imaging, suggest that regularization has a weaker smoothing effect on a banana-doughnut kernel than on a ray-based data kernel, because the former is already very smooth (which is generally good news). However, a stronger effect will occur in cases when the scale length of the ripples in the banana-doughnut kernel is similar to that of the side-lobes of the smoothing kernel. This problem can be avoided by using a smoothing kernel without side-lobes (which we will describe below).

Irrespective of the form of \({\mathcal{G}}\), regularization has the effect of smoothing the back-projected data \(\tilde{m}\), which leads to a smoother solution m. Further smoothing occurs for some data kernels (those with an oscillatory deviatoric theory), since the regularization also leads to a smoother generalized inverse. Smoothing of \(\tilde{m}\), which can be viewed as an approximate form of the solution, is arguably the intent of regularization. Smoothing of the deviatoric theory is arguably an undesirable side-effect. This second kind of smoothing is of particular concern when the smoothing kernel a(x) has side-lobes, since spurious structure can be introduced into the theory, or when \(a(x)\) has less than unit area, since structure can be suppressed. In the case studies below, we derive analytic formulae for a(x) for four common choices of prior information and analyze their properties to address these concerns. As we will put forward in more detail in the “Discussion and Conclusions” section, our overall opinion is that prior information that leads to a smoothing kernel with unit area and without side-lobes is the preferred choice, unless some compelling reason, specific to the particular inverse problem under consideration, indicates otherwise.

2.4 Four Case Studies

We discuss four possible ways of quantifying the intuitive notion of a function being smooth. In all cases, we assume that the smoothing is uniform over x, which corresponds to the case where \({\mathcal{L}}\) has translational invariance, so smoothing is by convolution with a kernel a(x). In Case 1, a smooth function is taken to be one with a small first derivative, a choice motivated by the notion that a function that changes only slowly with position is likely to be smooth. In Case 2, a smooth function is taken as one with large positive correlations that decay with distance for points separated by less than some specified scale length. This choice is motivated by the notion that the function must be approximately constant, which is to say smooth, over that scale length. In Case 3, a smooth function is taken to be one with small second derivative, a choice motivated by the notion that this derivative is large at peaks and troughs, so that a function with small second derivative is likely to be smooth. Finally, in Case 4, a smooth function is taken to be one that is similar to its localized average. This choice is motivated by the notion that averaging smooths a function, so that any function that is approximately equal to its own localized average is likely to be smooth. All four of these cases are plausible ways of quantifying smoothness. As we will show below, they all do lead to smooth solutions, but solutions that are significantly different from one another. Furthermore, several of these cases have unanticipated side-effects. We summarize the smoothing kernels for each of these choices in Table 1.

Table 1 Comparison of smoothing kernels for the different choices of smoothing scheme for the four cases considered

Case 1 We take flatness (small first derivative) as a measure of smoothness. The prior information equation is \(\varepsilon\, {\text{d}}m/{\text{d}}x = 0\), where ɛ = σ d /σ h , so that \({\mathcal{L}} = \varepsilon \,{\text{d}}/{\text{d}}x\). The parameter ɛ quantifies the strength by which the flatness constraint is imposed. The smoothing kernel for this operator is (see "Appendix")

$$a\left( x \right) = \frac{{\varepsilon^{ - 1} }}{2}\exp \left( { - \varepsilon^{ - 1} \left| x \right|} \right).$$
(25)

The solution (Fig. 3) is well behaved, in the sense that the data are smoothed over a scale length ɛ without any change in their mean value [since a(x) has unit area]. Furthermore, the smoothing kernel monotonically decreases towards zero, without any side-lobes, so that the smoothing creates no extraneous features. The covariance and resolution of the estimated solution are

$$C_{m} (x) = \sigma_{d}^{2} a\left( x \right)\,\quad {\text{and}}\quad R(x) = a\left( x \right).$$
(26)
Fig. 3
figure 3

The data smoothing problem implemented using each of the four cases. ɛ = 3 and α = 0.4. a The true model \(m^{\text{true}} (x) = \sin \left( {A\pi x^{2} } \right)\) (black line) has noise added with standard deviation 0.2 to produce the hypothetical data d obs(x) (black circles), to which the different smoothing solutions are applied to produce estimated models (colored lines). For Cases 1–4, the smoothed solutions have posterior r.m.s. errors of 0.10, 0.44, 0.07, and 0.19, respectively. be Numerical (grey) and analytic (colored) versions of the smoothing kernels, a(x), for each of the four smoothing schemes considered. The two versions agree closely

Note that the variance and resolution trade off, in the sense that the size of the variance is proportional to ɛ −1, whereas the width of the resolution is proportional to ɛ; as the strength of the flatness constraint is increased, the size of the variance decreases and the width of the resolution increases.

The autocorrelation of the data, \(X_{d} \left( x \right) = d\left( x \right){ \star }d\left( x \right)\), where \({ \star }\) signifies cross-correlation, quantifies the scale lengths present in the observations. In general, the autocorrelation of the model parameters, \(X_{m} \left( x \right) = m\left( x \right){ \star }m\left( x \right)\), will be different, because of the smoothing. The two are related by convolution with the autocorrelation of the smoothing kernel:

$$X_{m} \left( x \right) = \left[ {a\left( x \right)*d\left( x \right)} \right]{ \star }\left[ {a\left( x \right)*d\left( x \right)} \right] = X_{a} \left( x \right)*X_{d} \left( x \right),$$
(27)

where \(X_{a} \left( x \right) = a\left( x \right){ \star }a\left( x \right)\) (see Menke and Menke 2011, their Equation 9.24). The reader may easily verify (by direct integration) that the autocorrelation of Eq. (25) is

$$X_{a} \left( x \right) = \frac{{\varepsilon^{ - 2} }}{4} \left( {\left| x \right| + \varepsilon } \right)\exp \left( { - \varepsilon^{ - 1} \left| x \right|} \right).$$
(28)

This is a monotonically declining function of |x| with a maximum (without a cusp) at the origin. The smoothing broadens the autocorrelation (or auto-covariance) of the data in a well-behaved way.

The covariance function C h associated with this choice of smoothing is (see Eq. 47)

$$C_{h} \left( x \right) = \sigma_{d}^{2} \frac{{\varepsilon^{ - 2} }}{2}\left( {C_{0} - \left| x \right|} \right)\quad {\text{with}}\, C_{0} {\text{ arbitrary}}.$$
(29)

Note that the product \(\sigma_{d}^{2} \varepsilon^{ - 2}\) equals the prior variance σ 2 h .

Case 2 In Case 1, we worked out the consequences of imposing a specific prior information equation \({\mathcal{L}}m = 0 ,\) among which was the equivalent covariance C h . Now we take the opposite approach, imposing C h and solving for, among other quantities, the equivalent prior information equation \({\mathcal{L}}m = 0.\) We use a two-sided declining exponential function:

$$C_{h} \left( {x - x'} \right) = \sigma_{d}^{2 } \varepsilon^{ - 2} \exp \left( { - \eta \left| {x - x'} \right|} \right) = \sigma_{d}^{2 } \frac{{2\varepsilon^{ - 2} }}{\eta } \frac{\eta }{2}\exp \left( { - \eta \left| {x - x'} \right|} \right).$$
(30)

This form of prior covariance was introduced by Abers (1994). Here η −1 is a scale factor that controls decreases of covariance with separation distance (x − x′). The smoothing kernel is given by

$$a\left( x \right) = \gamma^{ - 2} \frac{\beta \gamma }{2}\exp \left( { - \beta \gamma \left| x \right|} \right),$$
(31)

where γ and β are functions of the smoothing weight ɛ and scale length η −1 (see Eq. 52) and \(\gamma^{ - 2} = \mathop \smallint \nolimits_{ - \infty }^{\infty } a\left( x \right) {\text{d}}x\). This smoothing kernel (Fig. 3) has the form of a two-sided, decaying exponential and so is identical in form to the one encountered in Case 1. As the variance of prior information is made very large, ɛ −2 → ∞ and γ −2 → 1, implying that the area under the smoothing kernel approaches unity—a desirable behavior for a smoothing function. However, as variance is decreased, ɛ −2 → 0 and γ −2 → 0, implying that the smoothing kernel is tending toward zero area—an undesirable behavior, because it reduces the amplitude of the smoothed function, as shown in Fig. 3.

The behavior of the smoothing kernel at small variance can be understood by viewing the prior information as consisting of two equations: a flatness constraint of the form \({\mathcal{L}}_{A} m = \beta^{ - 1} {\text{d}}m/{\text{d}} x= 0\) (the same condition as in Case 1) and an additional smallness constraint of the form \({\mathcal{L}}_{B} m = \mu m = 0\), with μ 2 = γ 2 − 1 by construction. When combined via Eq. (10b), the two equations lead to the same differential operator as in Case 1 (see Eq. 51):

$$\left( {{\mathcal{L}}_{\text{A}}^{\dag } {\mathcal{L}}_{\text{A}} + {\mathcal{L}}_{\text{B}}^{\dag } {\mathcal{L}}_{\text{B}} + 1} \right)a(x) = \gamma^{2} \left( { - \beta^{ - 2} \gamma^{ - 2} \frac{{{\text{d}}^{2} }}{{{\text{d}}x^{2} }} + 1} \right)a(x) = \delta \left( x \right).$$
(32)

Note that the strength of the smallness constraint is proportional to \(\mu = \varepsilon \left( {\frac{\eta }{2}} \right)^{\frac{1}{2}}\), which depends on both η and ɛ. The smallness constraint leads to a smoothing kernel with less than unit area, since it causes the solution m(x) to approach zero as ɛ → ∞ and μ → ∞. No combination of ɛ and η can eliminate the smallness constraint while still preserving the two-sided declining exponential form of the smoothing kernel.

Case 3 We quantify the smoothness of m(x) by the smallness of its second derivative. The prior information equation is \(\varepsilon \,{\text{d}}^{2} m/{\text{d}}x^{2} = 0\), implying \({\mathcal{L}} = \varepsilon\,{\text{d}}^{2} /{\text{d}}x^{2}\). Since the second derivative is self-adjoint, we have

$${\mathcal{L}}^{\dag } {\mathcal{L}} = \varepsilon^{2} \frac{{{\text{d}}^{4} }}{{{\text{d}}x^{4} }}.$$
(33)

This differential equation yields the smoothing kernel

$$a(x) = V { \exp }\left( { - \left| x \right|/\lambda } \right) \left\{ {\cos \left( {\left| x \right|/\lambda } \right) + \sin \left( {\left| x \right|/\lambda } \right)} \right\}.$$
(34)

See Eq. 56 for the definition of the constants V and λ. The covariance function C h is given by (see Eq. 58)

$$C_{h} \left( x \right) = - \sigma_{d}^{ - 2 } \frac{{\varepsilon^{ - 2} }}{12}\left( {C_{0} - \left| {x^{3} } \right|} \right)\quad {\text{with}} \;C_{0} {\text{ arbitrary}}.$$
(35)

This smoothing kernel arises in civil engineering, where it represents the deflection a(x) of an elastic beam with flexural rigidity ɛ 2 floating on a fluid foundation, due to a point load at the origin (Hetenyi 1979). In our example, the model m(x) is analogous to the deflection of the beam and the data to the load; that is, the model is a smoothed version of the data just as a beam’s deflection is a smoothed version of its applied load. Furthermore, variance is analogous to the reciprocal of flexural rigidity. The beam will take on a shape that exactly mimics the load only in the case when it has no rigidity, that is, infinite variance. For any finite rigidity, the beam will take on a shape that is a smoothed version of the load, where the amount of smoothing increases with ɛ 2.

The area under this smoothing kernel can be determined by computing its Fourier transform, since area is equal to the zero-wavenumber value. Transforming position x to wavenumber k in (32) gives (ɛ 2 k 4 + 1)a(k) = 1, which implies a(k = 0) = 1; that is, the smoothing kernel has unit area. This is a desirable property. However, the smoothing kernel (Fig. 3) also has small undesirable side-lobes.

Case 4 The prior information equation is that m(x) is close to its localized average s(x) * m(x), where s(x) is a localized smoothing kernel. We use the same two-sided declining exponential as in Case 1 (Eq. 25) to perform the averaging

$$s\left( x \right) = \frac{\eta }{2}\exp \left\{ { - \eta \left| x \right|} \right\}.$$
(36)

The prior information equation is then

$${\mathcal{L}}m = \varepsilon \left[ {\delta \left( x \right) - s(x)} \right]*m = 0.$$
(37)

Both \(s\left( x \right)\) and the Dirac delta function are symmetric, so the operator \({\mathcal{L}}\) is self-adjoint. The smoothing kernel for this case is

$$a\left( x \right) = \left( {1 - AD} \right) \delta \left( x \right) - A \left\{ {S\;\sin \left( {\eta q\left| x \right|/r} \right) - C {\text{cos}}\left( {\eta q\left| x \right|/r} \right)} \right\} {\text{exp}}\left( { - \eta p\left| x \right|/r} \right).$$
(38)

See Eq. 63 for the definition of the constants A, D, S, C, q, and r. The smoothing kernel a(x) (Fig. 3) consists of the sum of a Dirac delta function and a spatially distributed function reminiscent of the elastic beam solution in Case 3. Thus, the function m(x) is a weighted sum of the data d(x) and a smoothed version of that same data. Whether this solution represents a useful type of smoothing is debatable; it serves to illustrate that peculiar behaviors can arise out of seemingly innocuous forms of prior information. The area under this smoothing kernel (see Eq. 64) is unity, a desirable property. However, like Case 3, the solution also has small undesirable side-lobes.

3 Discussion and Conclusions

The main result of this paper is to show that the consequences of particular choices of regularization in inverse problems can be understood in considerable detail by analyzing the data smoothing problem in its continuum limit. This limit converts the usual matrix equations of generalized least squares into differential equations. Even though matrix equations are easy to solve using a computer, they usually defy simple analysis. Differential equations, on the other hand, often can be solved exactly, allowing the behavior of their solutions to be probed analytically.

A key result is that the solution to the general inverse problem depends on a smoothed version of the back-projected data \(\tilde{m}\) and a smoothed version of the theory, as quantified by the deviatoric theory \(\omega {\mathcal{B}}\) (Eq. 22). The leading-order term reproduces the behavior of the simple \({\mathcal{G}} = 1\) data smoothing problem (considered in the case studies); that is, m 0 is just a smoothed version of the back-projected data \(\tilde{m}\). However, in the general \({\mathcal{G}} \ne 1\) case, regularization (damping) also adds smoothing inside the generalized inverse \({\mathcal{G}}^{ - g}\), making it in some sense “simpler.” Furthermore, the higher-order terms, which are important when \({\mathcal{G}}^{\dag } {\mathcal{G}}\) is dissimilar from 1, are preferentially smoothed. In all cases, the smoothing is through convolution with the smoothing kernel a(x), the solution to the simple \(( {1 + {\mathcal{L}}^{\dag } {\mathcal{L}}}) a = \delta\) problem. Thus, the solution to the simple problem controls the way smoothing occurs in the more general one.

We have also developed the link between prior information expressed as a constraint equation of the form Hm = h and of that same prior information expressed as a covariance matrix C h. Starting with a particular H or C h, we have worked out the corresponding C h or H, as well as the smoothing kernel. This smoothing kernel is precisely equivalent to the Green function, or generalized inverse familiar from the classic, linear algebraic approach.

An interesting result is that prior information implemented as a prior covariance with the form of a two-sided declining exponential function is exactly equivalent to a pair of constraint equations, one of which suppresses the first derivative of the model parameters and another that suppresses their size. In this case, the smoothing kernel is a two-sided declining exponential with an area less than or equal to unity; that is, it both smooths and reduces the amplitude of the observations.

Our results allow us to address the question of which form of regularization best implements an intuitive notion of smoothing. There is, of course, no authoritative answer to this question. Any of the four cases we have considered, and many others besides, implements reasonable forms of smoothing; any one of them might arguably be best for a specific problem. Yet simpler is often better. We put forward first-derivative regularization as an extremely simple and effective choice, with few drawbacks. The corresponding smoothing kernel has the key attributes of unit area and no side-lobes. The scale length of the smoothing depends on a single parameter, ɛ. Its only drawback is that it possesses a cusp at the origin, which implies that it suppresses higher wavenumbers relatively slowly, as k −2. Its autocorrelation, on the other hand, has a simple maximum (without a cusp) at the origin, indicating that it widens the auto-covariance of the observations in a well-behaved fashion.

Furthermore, first-derivative regularization has a straightforward generalization to higher dimensions. One merely writes a separate first-derivative equation for each independent variable (say, \(x, y, z\)):

$${\mathcal{L}}_{\text{A}} m = \varepsilon \frac{\partial }{\partial x}m = 0\quad {\text{and}}\quad {\mathcal{L}}_{\text{B}} m = \varepsilon \frac{\partial }{\partial y}m = 0\quad {\text{and}}\quad {\mathcal{L}}_{\text{C}} m = \varepsilon \frac{\partial }{\partial z}m = 0.$$
(39)

The least-squares minimization will suppress the sum of squared errors of these equations, which is to say, the Euclidean length of the gradient vector ∇m. According to Eq. (12), the smoothing kernel satisfies the screened Poisson equation,

$$\left( {\nabla^{2} - \varepsilon^{ - 2} } \right) a\left( {\mathbf{x}} \right) = - \varepsilon^{ - 2} \delta ({\mathbf{x}}),$$
(40)

which has two- and three-dimensional solutions (Wikipedia 2014),

$$a_{{2{\text{D}}}} \left( {\mathbf{x}} \right) = \frac{{\varepsilon^{ - 2} }}{2\pi }{{K}}_{0} \left( {\varepsilon^{ - 1} r} \right)\quad {\text{and}}\quad a_{{3{\text{D}}}} \left( {\mathbf{x}} \right) = \frac{{\varepsilon^{ - 2} }}{4\pi r}{ \exp }\left( { - \varepsilon^{ - 1} r} \right)\quad {\text{with}}\quad r = \left| {\mathbf{x}} \right|.$$
(41)

Here, K 0 is the modified Bessel function. Both of these multidimensional smoothing kernels, like the 1D version examined in Case 1, have unit area and no side-lobes, indicating that first-derivative regularization will be effective when applied to these higher-dimensional problems.