1 Introduction

This paper is a contribution to the approximation of noisy scattered data on the d-dimensional unit sphere \(\mathbb {S}^d\). The most important case in practice is the case \(d=2\), for which there are numerous geophysical problems in which the need is to fit noisy data available only at scattered points. Some examples are gravitational potential, magnetic field intensity, topography, and ground or sea level temperature or pressure (see for example [8]). Often a series of independent local approximations on patches will suffice, but here we seek a single global approximation.

Our ultimate approximation consists of radial basis functions centered at the given data points, thus allowing a finer approximation wherever the data points are concentrated, perhaps because of greater variability of the physical quantity in that region. The radial basis functions are generated by a single strictly positive definite kernel.

The approximation is defined (see Sect. 3) to be the minimizer in the native space for that kernel of a certain quadratic functional. This functional is the sum of the squared \(\ell _2\) norm of the discrepancy between the values of the approximation and the noisy data values at the data points, and a smoothing term (or regularization term) consisting of a smoothing parameter (or regularization parameter) \(\lambda \) multiplied by the squared native space norm of the approximation.

The main results (see Theorem 4.1) concern the choice of that smoothing parameter. The choice should depend on the nature and magnitude of the noise. The statistical and simulation literatures contain a large number of strategies for choosing the smoothing parameter for ill-posed problems and variable selection, see for example [4, 32]. In this paper the data errors are considered to be deterministic rather than random. By this we mean that we assume that the data errors at the N data sites \(\mathbf {x}_1,\mathbf {x}_2,\ldots ,\mathbf {x}_N\in \mathbb {S}^d\) are given by an error vector (or noise vector) \({\varvec{\varepsilon }}=(\varepsilon _1,\varepsilon _2,\ldots ,\varepsilon _N)^T\), with the assumption that only \(\Vert {\varvec{\varepsilon }}\Vert _2\) is known. (Here for \(k=1,2,\ldots ,N\), \(\varepsilon _k\) is the data error at \(\mathbf {x}_k\), and \(\Vert {\varvec{\varepsilon }}\Vert _2\) is the Euclidean norm of the noise vector). We do not assume that this noise has a known or unknown stochastic distribution, and our error estimates are deterministic and not in terms of expectation values.

For a given level of data error, we consider four strategies for choosing the smoothing parameter \(\lambda \), namely Morozov’s discrepancy principle (which is an a posteriori strategy) and three a priori strategies. In all four cases we obtain, under the requirement that the native space is equivalent to a Sobolev space \(H^s\) with \(s>d/2, L_2\) bounds on the approximation error, see Theorem 4.1, that are a sum of two terms. One term is of the order \(h_X^s\) and the other one is of the order \(h_X^{d/2}\Vert {\varvec{\varepsilon }}\Vert _2\), where \(h_X\) is the mesh norm of the set \(X=\{\mathbf {x}_1,\mathbf {x}_2,\ldots ,\mathbf {x}_N\}\) of the data sites.

The theoretical results, which make essential use of an \(L_2\) sampling inequality for the sphere, slightly favor the discrepancy principle.

The theory developed in this paper is especially relevant to the compactly supported radial basis functions of Wendland (see [28,29,30]), for which the native space is in every case equivalent to a Sobolev space. The compactness of the support allows the use of sparse matrix technology in solving the resulting linear systems, greatly increasing the efficiency of the approximation scheme.

The structure of the paper is as follows: In Sect. 2 we introduce the required notation and background material on the sphere, function spaces on the sphere, and radial basis functions. In Sect. 3 the approximation is introduced, and discussed in the context of penalized least squares approximation (see, for example, [1, 25] for a general discussion of penalized least squares approximation, [2] for a particular polynomial variant, and [6, 24] for this topic in the context of learning theory). In Sect. 4 we present the main result, Theorem 4.1, on \(L_2\) error estimates for the smoothing approximation from noisy scattered data on the sphere using radial basis functions. In Sect. 5 we state the \(L_2\) sampling inequality for the sphere and sketch its proof. Using the method from [15] we can lift the \(L_2\) sampling inequality from [3, Theorem 4.1] by the use of charts to the sphere, and so obtain an analogous \(L_2\) sampling inequality for the sphere. This is an essential part of the proof of our main result. In Sect. 6 we finally give the proof of the main result, Theorem 4.1. In Sect. 7 we give a numerical illustration of the different strategies and discuss their practical implementation.

2 Notation and background

In this section, the general notation and background material are introduced.

Throughout the paper generic positive constants are denoted by \(c,\tilde{c},\ldots \), and may have different values at different places. They may depend on the sphere dimension d, the Sobolev space index s, and the choice of the radial basis function in our approximation scheme.

For two sequences \(\{a_n\}\) and \(\{b_n\}\), the notation \(a_n\asymp b_n\) means that there exist positive constants c and \(\tilde{c}\) such that \(c a_n \le b_n \le \tilde{c} a_n\) for all n.

2.1 The sphere, spherical geometry, and the mesh norm

For \(\mathbf {x},\mathbf {y}\in \mathbb {R}^{d+1}\), let \(\mathbf {x}\cdot \mathbf {y}\) denote the Euclidean inner product and let \(\Vert \mathbf {x}\Vert _2=\sqrt{\mathbf {x}\cdot \mathbf {x}}\) denote the Euclidean norm. The unit sphere \(\mathbb {S}^d\) in \(\mathbb {R}^{d+1}\) is given by

$$\begin{aligned} \mathbb {S}^d := \left\{ \mathbf {x}\in \mathbb {R}^{d+1} \,:\, \Vert \mathbf {x}\Vert _2 = 1 \right\} . \end{aligned}$$

The unit sphere \(\mathbb {S}^d\) has the (Lebesgue) surface area

$$\begin{aligned} |\mathbb {S}^d| = \frac{2 \pi ^{(d+1)/2}}{\varGamma \big ((d+1)/2\big )}. \end{aligned}$$

The separation between any two points \(\mathbf {x}\) and \(\mathbf {y}\) on \(\mathbb {S}^d\) is measured by the geodesic distance \(\mathrm {dist}(\mathbf {x},\mathbf {y})\in [0,\pi ]\), defined by

$$\begin{aligned} \mathrm {dist}(\mathbf {x},\mathbf {y}) := \arccos (\mathbf {x}\cdot \mathbf {y}). \end{aligned}$$

Let \(X=\{\mathbf {x}_1,\mathbf {x}_2,\ldots ,\mathbf {x}_N\}\subset \mathbb {S}^d\) denote a set of N distinct points on \(\mathbb {S}^d\). The mesh norm of X is defined by

$$\begin{aligned} h_X := \sup _{\mathbf {y}\in \mathbb {S}^d} \min _{\mathbf {x}_j\in X} \mathrm {dist}(\mathbf {y},\mathbf {x}_j). \end{aligned}$$

2.2 Function spaces on the sphere

The space \(L_2 = L_2(\mathbb {S}^d)\) is the usual Hilbert space of square-integrable functions on \(\mathbb {S}^d\) endowed with the inner product

$$\begin{aligned} (f,g)_{L_2} := \int _{\mathbb {S}^d} f(\mathbf {x}) g(\mathbf {x}) \,\mathrm {d}\omega _d(\mathbf {x}), \end{aligned}$$

and the induced norm is \(\Vert f\Vert _{L_2}:=\sqrt{(f,f)_{L_2}}\). Here \(\,\mathrm {d}\omega _d\) denotes the (Lebesgue) surface measure on \(\mathbb {S}^d\). The space of continuous functions on \(\mathbb {S}^d\) is denoted by \( C = C(\mathbb {S}^d)\) and is endowed with the supremum norm

$$\begin{aligned} \Vert f\Vert _{C} := \sup _{\mathbf {x}\in \mathbb {S}^d} |f(\mathbf {x})|. \end{aligned}$$

The space \(\mathbb {P}_L = \mathbb {P}_L(\mathbb {S}^d)\) of all spherical polynomials on \(\mathbb {S}^d\) of degree \(\le L\) consists of the restrictions to \(\mathbb {S}^d\) of all polynomials on \(\mathbb {R}^{d+1}\) of degree \(\le L\). The dimension of \(\mathbb {P}_L\) is given by

$$\begin{aligned} d_L = \dim \big (\mathbb {P}_L\big ) = \frac{(2L+d) \varGamma (L+d)}{\varGamma (d+1) \varGamma (L+1)} \asymp (L+1)^d . \end{aligned}$$

A spherical harmonic of degree \(\ell \in \mathbb {N}_0\) is the restriction to \(\mathbb {S}^d\) of a harmonic homogeneous polynomial of exact degree \(\ell \) on \(\mathbb {R}^{d+1}\) (see [7, Section 11.2]). The space \(\mathbb {H}_\ell = \mathbb {H}_\ell (\mathbb {S}^d)\) of spherical harmonics of degree \(\ell \in \mathbb {N}_0\) (together with the zero function) has the dimension \(Z(d,\ell ) = \dim (\mathbb {H}_\ell )\), given by

$$\begin{aligned} Z(d,0) = 1; \quad Z(d,\ell ) = \frac{(2\ell +d-1) \varGamma (\ell +d-1)}{\varGamma (d) \varGamma (\ell +1)}, \quad \ell \in \mathbb {N}. \end{aligned}$$

Note that \(Z(d,\ell )\asymp (\ell +1)^{d-1}\). In this paper, for any \(\ell \in \mathbb {N}_0\), the set

$$\begin{aligned} \left\{ Y_{\ell ,k} \,:\, k=1,2,\ldots ,Z(d,\ell ) \right\} \end{aligned}$$
(2.1)

denotes a real \(L_2\)-orthonormal basis of \(\mathbb {H}_\ell \). Furthermore, \(\mathbb {P}_L=\bigoplus _{\ell =0}^L \mathbb {H}_\ell \), and, in particular, the union over \(\ell =0,1,\ldots ,L\) of the sets (2.1) forms an \(L_2 \)-orthonormal basis of \(\mathbb {P}_L\).

For fixed \(\alpha ,\beta >-1, P_\ell ^{(\alpha ,\beta )}\) denotes the Jacobi polynomial of degree \(\ell \) with indices \(\alpha \) and \(\beta \) (see [23, Chapter IV]). The Jacobi polynomials \(\{P_\ell ^{(\alpha ,\beta )}\}_{\ell \in \mathbb {N}_0}\) are orthogonal on the interval \([-1,1]\) with respect to the weighted inner product

$$\begin{aligned} (f,g)_{L_2^{(\alpha ,\beta )}([-1,1])} := \int _{-1}^1 f(t) g(t) (1-t)^\alpha (1+t)^\beta \,\mathrm {d}t, \end{aligned}$$

and the induced norm is \(\Vert f\Vert _{L_2^{(\alpha ,\beta )}([-1,1])} = \sqrt{(f,f)_{L_2^{(\alpha ,\beta )}([-1,1])}}\), where the normalization is such that (see [23, (4.1.1)])

$$\begin{aligned} P_\ell ^{(\alpha ,\beta )}(1) = \frac{\varGamma (\ell +\alpha +1)}{\varGamma (\alpha +1) \varGamma (\ell +1)} \end{aligned}$$
(2.2)

and (see [23, (4.3.3)])

$$\begin{aligned} \big \Vert P_\ell ^{(\alpha ,\beta )}\big \Vert _{L_2^{(\alpha ,\beta )}([-1,1])}^2 = \frac{2^{\alpha +\beta +1}}{(2\ell +\alpha +\beta +1)} \frac{\varGamma (\ell +\alpha +1) \varGamma (\ell +\beta +1)}{ \varGamma (\ell +1) \varGamma (\ell +\alpha +\beta +1)}. \end{aligned}$$
(2.3)

We denote the space of all measurable functions f on \([-1,1]\) for which \(\Vert f\Vert _{L_2^{(\alpha ,\beta )}([-1,1])}<\infty \) by \(L_2^{(\alpha ,\beta )}([-1,1])\). The Jacobi polynomials \(\{P_{\ell }^{(\alpha ,\beta )}\}_{\ell \in \mathbb {N}_0}\) form a complete orthogonal system in \(L_2^{(\alpha ,\beta )}([-1,1])\).

The spherical harmonics on \(\mathbb {S}^d\) of degree \(\ell \) satisfy the addition theorem (see [7, Section 11.4] and [20, Section 4.1, Lemma 4.5 and Theorem 4.7]): for any \(L_2\)-orthonormal basis (2.1) of \(\mathbb {H}_\ell \) we have

$$\begin{aligned} \sum _{k=1}^{Z(d,\ell )} Y_{\ell ,k}(\mathbf {x}) Y_{\ell ,k}(\mathbf {y}) = \frac{Z(d,\ell )}{|\mathbb {S}^d|} \frac{P_\ell ^{((d-2)/2,(d-2)/2)}(\mathbf {x}\cdot \mathbf {y})}{P_\ell ^{((d-2)/2,(d-2)/2)}(1)}. \end{aligned}$$
(2.4)

The union over all \(\ell \in \mathbb {N}_0\) of the \(L_2\)-orthonormal bases (2.1) of \(\mathbb {H}_\ell \) forms a complete orthonormal system for \(L_2\). Thus any function \(f\in L_2\) can be expanded into a Fourier series (or Laplace series) with respect to this orthonormal system: in the \(L_2\) sense

$$\begin{aligned} f = \sum _{\ell =0}^\infty \sum _{k=1}^{Z(d,\ell )} \widehat{f}_{\ell ,k} Y_{\ell ,k}, \end{aligned}$$

with the Fourier coefficients given by

$$\begin{aligned} \widehat{f}_{\ell ,k} = \big (f,Y_{\ell ,k}\big )_{L_2} = \int _{\mathbb {S}^d} f(\mathbf {x}) Y_{\ell ,k}(\mathbf {x}) \,\mathrm {d}\omega _d(\mathbf {x}). \end{aligned}$$

The Sobolev space \( H^s = H^s(\mathbb {S}^d)\), where \(s\ge 0\), is defined as the space of those functions in \(L_2\) for which the norm

$$\begin{aligned} \Vert f\Vert _{H^s} := \left( \sum _{\ell =0}^\infty (\ell +1)^{2s} \sum _{k=1}^{Z(d,\ell )} |\widehat{f}_{\ell ,k}|^{2} \right) ^{1/2} \end{aligned}$$
(2.5)

is finite (see [9, Sections 5.1 and 5.2] for the case of \(\mathbb {S}^2\) and [20, Section 6.1, Definition 6.2 and Theorem 6.3]). The space \(H^s\) is a Hilbert space with the inner product

$$\begin{aligned} (f,g)_{H^s} = \sum _{\ell =0}^\infty (\ell +1)^{2s} \sum _{k=1}^{Z(d,\ell )} \widehat{f}_{\ell ,k} \widehat{g}_{\ell ,k}, \end{aligned}$$

which induces the norm (2.5).

For \(s>d/2\), the space \(H^s\) is embedded into C, that is, \(H^s\subset C\) and there exists a positive constant c such that \(\Vert f\Vert _{C} \le c \Vert f\Vert _{H^s}\) for all \(f\in H^s\). For \(s>d/2\), the space \(H^s\) is a reproducing kernel Hilbert space, that is, there exists a kernel \(K_s:\mathbb {S}^d\times \mathbb {S}^d\rightarrow \mathbb {R}\) such that (i) \(K_s(\mathbf {x},\mathbf {y}) = K_s(\mathbf {y},\mathbf {x})\) for all \(\mathbf {x},\mathbf {y}\in \mathbb {S}^d\), (ii) \(K_s(\cdot ,\mathbf {x})\in H^s\) for any fixed \(\mathbf {x}\in \mathbb {S}^d\), and (iii) the reproducing property holds

$$\begin{aligned} \big (f,K_s(\cdot ,\mathbf {x})\big )_{H^s} = f(\mathbf {x}) \quad \text{ for } \text{ all }\ \mathbf {x}\in \mathbb {S}^d\ \text{ and } \text{ all } \ f\in H^s . \end{aligned}$$

It is easily verified that the reproducing kernel of \(H^s\) is

$$\begin{aligned} K_s(\mathbf {x},\mathbf {y})=\sum _{\ell =0}^\infty \sum _{k=1}^{Z(d,\ell )} \frac{Y_{\ell ,k}(\mathbf {x}) Y_{\ell ,k}(\mathbf {y})}{(\ell +1)^{2s}}, \quad \mathbf {x},\mathbf {y}\,\in \mathbb {S}^d . \end{aligned}$$

2.3 Strictly positive definite functions and the native space

A function \(\varPhi \in L_2^{((d-2)/2,(d-2)/2)}([-1,1])\) has the Jacobi series expansion

$$\begin{aligned} \varPhi = \sum _{\ell =0}^\infty a_\ell \frac{Z(d,\ell )}{|\mathbb {S}^d|} \frac{P_\ell ^{((d-2)/2,(d-2)/2)}}{P_\ell ^{((d-2)/2,(d-2)/2)}(1)}, \end{aligned}$$

with the coefficients (see (2.2) and (2.3) for the normalization)

$$\begin{aligned} a_\ell := 2 \pi ^{d/2} \frac{\varGamma (\ell +1)}{\varGamma (\ell +d/2)} \int _{-1}^1 \varPhi (t) P_\ell ^{((d-2)/2,(d-2)/2)}(t) (1-t^2)^{(d-2)/2} \,\mathrm {d}t. \end{aligned}$$

Then the kernel \(\phi :\mathbb {S}^d\times \mathbb {S}^d\rightarrow \mathbb {R}\), defined by

$$\begin{aligned} \phi (\mathbf {x},\mathbf {y}) := \varPhi (\mathbf {x}\cdot \mathbf {y})= & {} \sum _{\ell =0}^\infty a_\ell \frac{Z(d,\ell )}{|\mathbb {S}^d|} \frac{P_\ell ^{((d-2)/2,(d-2)/2)}(\mathbf {x}\cdot \mathbf {y})}{P_\ell ^{((d-2)/2,(d-2)/2)}(1)} \nonumber \\= & {} \sum _{\ell =0}^\infty a_\ell \sum _{k=1}^{Z(d,\ell )} Y_{\ell ,k}(\mathbf {x}) Y_{\ell ,k}(\mathbf {y}), \end{aligned}$$
(2.6)

is a zonal function, that is, its values depend only on the inner product of \(\mathbf {x}\) and \(\mathbf {y}\) (or equivalently, on the geodesic distance \(\mathrm {dist}(\mathbf {x},\mathbf {y})\)). The last representation in (2.6) follows from the addition theorem (2.4).

If \(a_\ell \ge 0\) for all \(\ell \in \mathbb {N}_0\), and if \(a_\ell >0\) for infinitely many even indices \(\ell \) and for infinitely many odd indices \(\ell \), then (see [5, 21, 31]) the kernel \(\phi \), defined by (2.6), is strictly positive definite, that is, for any \(N\in \mathbb {N}\), for any set \(X=\{\mathbf {x}_1,\mathbf {x}_2,\ldots ,\mathbf {x}_N\}\subset \mathbb {S}^d\) of N distinct points, and for any \({\varvec{\alpha }}=(\alpha _1,\alpha _2,\ldots ,\alpha _N)^T\in \mathbb {R}^N\),

$$\begin{aligned} \sum _{j=1}^N \sum _{k=1}^N \alpha _j \alpha _k \phi (\mathbf {x}_j,\mathbf {x}_k) \ge 0, \end{aligned}$$

and equality occurs only if \({\varvec{\alpha }}=\mathbf {0}=(0,0,\ldots ,0)^T\). In other words, the kernel \(\phi \) is strictly positive definite if for any \(N\in \mathbb {N}\) and for any set \(X=\{\mathbf {x}_1,\mathbf {x}_2,\ldots ,\mathbf {x}_N\}\subset \mathbb {S}^d\) of N distinct points the matrix \(\big [\phi (\mathbf {x}_j,\mathbf {x}_k)\big ]_{j,k=1,2,\ldots ,N}\) is positive definite.

A strictly positive definite zonal function \(\phi \), given by (2.6), is also called a radial basis function (RBF) (for the sphere) or a spherical (radial) basis function, since the value \(\phi (\mathbf {x},\mathbf {y})\) depends only on the geodesic distance between \(\mathbf {x}\) and \(\mathbf {y}\) on \(\mathbb {S}^d\).

For a continuous function \(f:\mathbb {S}^d\rightarrow \mathbb {R}\), and a given set \(X=\{\mathbf {x}_1,\mathbf {x}_2,\ldots ,\mathbf {x}_N\}\) of N distinct points on \(\mathbb {S}^d\), an RBF approximant of f is a function of the form

$$\begin{aligned} \sum _{j=1}^N \alpha _j \phi (\cdot ,\mathbf {x}_j), \end{aligned}$$
(2.7)

where the coefficients \(\alpha _1,\alpha _2,\ldots ,\alpha _N\in \mathbb {R}\) are suitably chosen. In this paper, we approximate a function f given in the form of noisy scattered data on a finite point set \(X=\{\mathbf {x}_1,\mathbf {x}_2,\ldots ,\mathbf {x}_N\}\) by an RBF approximant (2.7), that is, the centers \(\mathbf {x}_j\) in (2.7) are at the points where the data of f is given.

Let \(\phi \) be a strictly positive definite zonal kernel, given by (2.6), and assume that the coefficients \(a_\ell \) satisfy \(a_\ell >0\) for all \(\ell \in \mathbb {N}_0\) and that \(\sum _{\ell =0}^\infty a_\ell Z(d, \ell ) < \infty \). Define the linear space

$$\begin{aligned} \mathscr {F}_\phi := \left\{ \sum _{j=1}^N \alpha _j \phi (\cdot ,\mathbf {x}_j) \,:\, \alpha _j\in \mathbb {R},\ \mathbf {x}_j\in \mathbb {S}^d,\ j=1,2,\ldots ,N, \ \text{ and }\ N\in \mathbb {N}\right\} . \end{aligned}$$

The native space \(\mathscr {N}_\phi \) is defined to be the closure of \(\mathscr {F}_\phi \) with respect to the norm \(\Vert f\Vert _{\phi }:=\sqrt{(f,f)_\phi }, f\in \mathscr {F}_\phi \), induced by the inner product

$$\begin{aligned} \left( \sum _{j=1}^N \alpha _j \phi (\cdot ,\mathbf {x}_j) , \sum _{k=1}^M \beta _k \phi (\cdot ,\mathbf {y}_k) \right) _\phi := \sum _{j=1}^N \sum _{k=1}^M \alpha _j \beta _k \phi (\mathbf {x}_j,\mathbf {y}_k). \end{aligned}$$

(We note that \((\cdot ,\cdot )_\phi \) is indeed an inner product for \(\mathscr {F}_\phi \) because of the strict positive definiteness of \(\phi \).) The space \(\mathscr {N}_\phi \) is a Hilbert space with the inner product

$$\begin{aligned} (f,g)_\phi = \sum _{\ell =0}^\infty \frac{1}{a_\ell } \sum _{k=1}^{Z(d,\ell )} \widehat{f}_{\ell ,k} \widehat{g}_{\ell ,k} \end{aligned}$$

and the induced norm

$$\begin{aligned} \Vert f\Vert _\phi = \sqrt{(f,f)_\phi } = \left( \sum _{\ell =0}^\infty \frac{1}{a_\ell } \sum _{k=1}^{Z(d,\ell )} |\widehat{f}_{\ell ,k}|^2 \right) ^{1/2}, \end{aligned}$$
(2.8)

where \(a_\ell , \ell \in \mathbb {N}_0\), are the coefficients of the kernel \(\phi \), see (2.6).

The native space \(\mathscr {N}_\phi \) is a reproducing kernel Hilbert space, with the reproducing kernel \(\phi \), since \(\phi \) is symmetric and \(\phi (\cdot ,\mathbf {x})\in \mathscr {N}_\phi \) for every fixed \(\mathbf {x}\in \mathbb {S}^d\), and

$$\begin{aligned} \big (f,\phi (\cdot ,\mathbf {x})\big )_\phi = f(\mathbf {x}) \quad \text{ for } \text{ all }\ \mathbf {x}\in \mathbb {S}^d \ \text{ and } \text{ all }\ f\in \mathscr {N}_\phi . \end{aligned}$$

For \( s > d/2\) we see from (2.8) and (2.5) that \(\mathscr {N}_\phi \) can be identified with \(H^s\) if \(a_\ell \asymp (\ell +1)^{-2s}\), and that the norms \(\Vert \cdot \Vert _\phi \) and \(\Vert \cdot \Vert _{H^s}\) are then equivalent.

3 Smoothing approximation

In this section the smoothing approximation is introduced. We assume that (inexact) real number data values \(F_1,F_2,\ldots ,F_N\) are given at points \(\mathbf {x}_1,\mathbf {x}_2,\ldots ,\mathbf {x}_N \in \mathbb {S}^d\). Also given is a radial basis function \(\phi \). The smoothing approximation is the uniquely determined minimizer in the native space \(\mathscr {N}_\phi \) of a certain quadratic functional. This functional contains two terms: the first one (the discrepancy) measures how well the data values are fitted, and the second one is a smoothing term, given by the square of the native space norm multiplied by a smoothing parameter \(\lambda \ge 0\). Strategies for choosing \(\lambda \) are considered in the next section; for the present \(\lambda \) is assumed given. The next theorem shows that the function in the native space \(\mathscr {N}_\phi \) that minimizes the quadratic functional is a radial basis function approximant, with the RBFs centered at the data points. The characterization of a smoothing approximation as the minimizer of a quadratic functional is, of course, well known, see [26]. See also [1, 25] for a discussion of penalized least squares approximation, which is another name for the smoothing approximation approach.

Theorem 3.1

Let \(\phi \) be a strictly positive definite zonal kernel, given by (2.6) with \(a_\ell >0\) for all \(\ell \in \mathbb {N}_0\) and such that \(\sum _{\ell =0}^\infty a_\ell Z(d,\ell ) < \infty \), and let \(\mathscr {N}_\phi \) denote the native space of \(\phi \) with norm \(\Vert \cdot \Vert _\phi \). Let \(\lambda \in \mathbb {R}_0^+\), let \(N\in \mathbb {N}\), and let \(X=\{\mathbf {x}_1,\mathbf {x}_2,\ldots ,\mathbf {x}_N\}\subset \mathbb {S}^d\) be a set of N distinct points. Assume that approximate data values at X, namely \(\mathbf {F}=(F_1,F_2,\ldots ,F_N)^T\in \mathbb {R}^N\), are given. Define the quadratic functional

$$\begin{aligned} \mu _\lambda (g) := \sum _{i=1}^N \big [ g(\mathbf {x}_i) - F_i \big ]^2 + \lambda \Vert g\Vert _\phi ^2, \quad g\in \mathscr {N}_\phi . \end{aligned}$$
(3.1)

If \(\lambda >0\), then \(\mu _\lambda \) has a unique minimizer \(f_\lambda \) in \(\mathscr {N}_\phi \), given by

$$\begin{aligned} f_\lambda = \sum _{j=1}^N \alpha _j \phi (\cdot ,\mathbf {x}_j) , \end{aligned}$$
(3.2)

with coefficients uniquely determined by the conditions \(f_\lambda (\mathbf {x}_i) + \lambda \alpha _i = F_i, i=1,\ldots ,N\), that is,

$$\begin{aligned} \sum _{j=1}^N\phi (\mathbf {x}_i,\mathbf {x}_j)\alpha _j + \lambda \alpha _i = F_i, \quad i=1,\ldots ,N. \end{aligned}$$
(3.3)

If \(\lambda =0\), then there exists a unique minimizer \(f_0\) in \(\mathscr {N}_\phi \) of \(\mu _0\) that has minimal norm \(\Vert \cdot \Vert _\phi \). This minimizer is of the form (3.2) and is uniquely determined by the conditions \(f_0(\mathbf {x}_i)=F_i\) for \(i=1,2,\ldots ,N\) (that is, \(f_0(\mathbf {x}_i) = \sum _{j=1}^N\phi (\mathbf {x}_i,\mathbf {x}_j)\alpha _j= F_i, i=1,\ldots ,N\)).

The nature of the functional \(\mu _\lambda \) is that if \(\lambda \) is close to zero then much weight is given to fitting the data, that is, we expect \(f_\lambda (\mathbf {x}_i)\approx F_i, i=1,2,\ldots ,N\), and little importance is given to keeping the norm \(\Vert f_\lambda \Vert _\phi \) small. We are then close to the interpolation scenario. As \(\lambda \) increases, data fitting becomes less and less important, and more importance is given to keeping the norm \(\Vert f_\lambda \Vert _\phi \) small.

From (3.3) the coefficient vector \({\varvec{\alpha }}=(\alpha _1,\alpha _2,\ldots ,\alpha _N)^T\) satisfies the linear system

$$\begin{aligned} \big (\mathbf {A}+\lambda \mathbf {I}\big ) {\varvec{\alpha }}= \mathbf {F}, \end{aligned}$$
(3.4)

where \(\mathbf {F}=(F_1,F_2,\ldots ,F_N)^T, \mathbf {I}\) is the \(N\times N\) identity matrix, and

$$\begin{aligned} \mathbf {A}:= \big [\phi (\mathbf {x}_i,\mathbf {x}_j)\big ]_{i,j=1,2,\ldots ,N}\,. \end{aligned}$$
(3.5)

The linear system (3.4) has a unique solution for all right hand sides \(\mathbf {F}\), since \(\mathbf {A}+\lambda \mathbf {I}\) is positive definite for all \(\lambda \ge 0\).

In the next section we consider the case \(F_i=f(\mathbf {x}_i)+\varepsilon _i, i=1,2,\ldots ,N\), for some (unknown) function \(f\in \mathscr {N}_\phi \) and errors (noise) \(\varepsilon _i\) in the data, and explore the question of choosing an appropriate value of \(\lambda \) for a given level of noise.

Given Theorem 3.1, it is clearly sufficient to restrict our attention to functions g of the form \(g=\sum _{j=1}^N \alpha _j \phi (\cdot ,\mathbf {x}_j)\). In this case \(g(\mathbf {x}_i) = (\mathbf {A}{\varvec{\alpha }})_i\) and \(\Vert g\Vert _{\phi }^2 = {\varvec{\alpha }}^T \mathbf {A}{\varvec{\alpha }}\), and for such g the functional (3.1) can be written as

$$\begin{aligned} \tilde{\mu }_\lambda ({\varvec{\alpha }}) := \mu _{\lambda }(g) = \Vert \mathbf {A}{\varvec{\alpha }}- \mathbf {F}\Vert _2^2 + \lambda \Vert {\varvec{\alpha }}\Vert _{\mathbf {A}}^2, \end{aligned}$$
(3.6)

where the \(\mathbf {A}\) norm is defined by \(\Vert {\varvec{\alpha }}\Vert _{\mathbf {A}} := \sqrt{{\varvec{\alpha }}^T \mathbf {A}{\varvec{\alpha }}}\), and the corresponding inner product is defined by \(({\varvec{\alpha }},{\varvec{\beta }})_\mathbf {A}:={\varvec{\alpha }}^T\mathbf {A}{\varvec{\beta }}\), for \({\varvec{\alpha }},{\varvec{\beta }}\in \mathbb {R}^N\). From (3.6) it is clear that the functional \(\mu _\lambda \) restricted to the finite dimensional approximation space \(V_X = \mathrm {span}\{\phi (\cdot ,\mathbf {x}_j) \,:\, j=1,2,\ldots ,N \}\) can be interpreted as a Tikhonov functional (see [13, Section 2.2]), and hence strategies for the choice of the regularization parameter in Tikhonov regularization can be applied to determine \(\lambda \). Indeed, the four choices for \(\lambda \) in Sect. 4 are all motivated by parameter choice strategies for Tikhonov regularization.

The next lemma shows that the first term of \(\mu _\lambda (f_\lambda )\) (the discrepancy) is a strictly monotonic increasing function of \(\lambda \). This is needed in the next section. The lemma is inspired by [13, Sections 2.2 and 2.5].

Lemma 3.1

Let the assumptions and notation be the same as in Theorem 3.1, and assume in addition that \(\mathbf {F}\ne \mathbf {0}\). Then the discrepancy function \(J: \mathbb {R}_0^+ \rightarrow \mathbb {R}_0^+\) defined by

$$\begin{aligned} J(\lambda ) := \sum _{i=1}^N \big [ f_\lambda (\mathbf {x}_i) - F_i \big ]^2, \quad \lambda \in \mathbb {R}_0^+ , \end{aligned}$$

is continuous and strictly monotonic increasing, with the range of J given by \(\big [0, \Vert \mathbf {F}\Vert _2^2 \big )\).

Proof

We write temporarily \({\varvec{\alpha }}^\lambda \in \mathbb {R}^N\) for the coefficient vector of \(f_\lambda \), satisfying (3.4), so that \({\varvec{\alpha }}^\lambda \) satisfies

$$\begin{aligned} \big (\mathbf {A}+\lambda \mathbf {I}\big ) {\varvec{\alpha }}^{\lambda } = \mathbf {F}. \end{aligned}$$
(3.7)

We note that \({\varvec{\alpha }}^\lambda \ne \mathbf {0}\), since \({\varvec{\alpha }}^\lambda =\mathbf {0}\) would contradict \(\mathbf {F}\ne \mathbf {0}\).

Since the matrix \(\mathbf {A}+ \lambda \mathbf {I}\) is invertible and continuous in \(\lambda \) for all non-negative \(\lambda \), the vector \({\varvec{\alpha }}^\lambda \) depends continuously on \(\lambda \). It follows that so too does

$$\begin{aligned} J(\lambda ) = \sum _{i=1}^N \big [ f_\lambda (\mathbf {x}_i) - F_i \big ]^2 = \big \Vert \mathbf {A}{\varvec{\alpha }}^\lambda - \mathbf {F}\big \Vert _2^2. \end{aligned}$$
(3.8)

To study the monotonicity of \(J(\lambda )\) it is convenient to write, from (3.1) and (3.6),

$$\begin{aligned} \mu _\lambda (f_\lambda )= J(\lambda ) + \lambda K (\lambda ) , \end{aligned}$$
(3.9)

where

$$\begin{aligned} K(\lambda ) := \big \Vert f_\lambda \big \Vert _\phi ^2 = ({\varvec{\alpha }}^\lambda )^T \mathbf {A}{\varvec{\alpha }}^\lambda =\Vert {\varvec{\alpha }}^\lambda \Vert _\mathbf {A}^2. \end{aligned}$$
(3.10)

Now let \( 0 \le \lambda _1 <\lambda _2\). We first show that \({\varvec{\alpha }}^{\lambda _1}\ne {\varvec{\alpha }}^{\lambda _2}\). For this purpose we write (3.7) first with \(\lambda =\lambda _1\) and then with \(\lambda =\lambda _2\), that is

$$\begin{aligned} (\mathbf {A}+ \lambda _1 \mathbf {I}) {\varvec{\alpha }}^{\lambda _1} = \mathbf {F}\quad \text{ and }\quad (\mathbf {A}+ \lambda _2 \mathbf {I}) {\varvec{\alpha }}^{\lambda _2} = \mathbf {F}, \end{aligned}$$

and subtract the two equations to obtain

$$\begin{aligned} \mathbf {A}({\varvec{\alpha }}^{\lambda _1} - {\varvec{\alpha }}^{\lambda _2}) + \lambda _1 {\varvec{\alpha }}^{\lambda _1} - \lambda _2 {\varvec{\alpha }}^{\lambda _2} = \mathbf {0}, \end{aligned}$$

or equivalently

$$\begin{aligned} \mathbf {A}({\varvec{\alpha }}^{\lambda _1}-{\varvec{\alpha }}^{\lambda _2}) + \lambda _1( {\varvec{\alpha }}^{\lambda _1} - {\varvec{\alpha }}^{\lambda _2}) = (\lambda _2 - \lambda _1){\varvec{\alpha }}^{\lambda _2} . \end{aligned}$$
(3.11)

If \({\varvec{\alpha }}^{\lambda _1}={\varvec{\alpha }}^{\lambda _2}\) then this reduces to \((\lambda _1-\lambda _2){\varvec{\alpha }}^{\lambda _2}=\mathbf {0}\), giving \( \lambda _1=\lambda _2\), a contradiction.

We next show that \(K(\lambda )\) is strictly monotonic decreasing. Multiplying (3.11) from the left by \([\mathbf {A}({\varvec{\alpha }}^{\lambda _1} - {\varvec{\alpha }}^{\lambda _2})]^T\), we obtain

$$\begin{aligned} \big \Vert \mathbf {A}({\varvec{\alpha }}^{\lambda _1}-{\varvec{\alpha }}^{\lambda _2}) \big \Vert _2^2 + \lambda _1 \big \Vert {\varvec{\alpha }}^{\lambda _1}-{\varvec{\alpha }}^{\lambda _2} \big \Vert _\mathbf {A}^2 = (\lambda _2-\lambda _1) \big ( {\varvec{\alpha }}^{\lambda _1} - {\varvec{\alpha }}^{\lambda _2}, {\varvec{\alpha }}^{\lambda _2} \big )_\mathbf {A}. \end{aligned}$$
(3.12)

Since the left-hand side of (3.12) is positive (because \(\mathbf {A}({\varvec{\alpha }}^{\lambda _1}-{\varvec{\alpha }}^{\lambda _2}) \ne \mathbf {0}\) due to \({\varvec{\alpha }}^{\lambda _1} \ne {\varvec{\alpha }}^{\lambda _2}\) and the positive definiteness of \(\mathbf {A}\)) and \(\lambda _2 > \lambda _1\), it follows that

$$\begin{aligned} ({\varvec{\alpha }}^{\lambda _1} - {\varvec{\alpha }}^{\lambda _2}, {\varvec{\alpha }}^{\lambda _2})_\mathbf {A}> 0 , \end{aligned}$$

and hence

$$\begin{aligned} \Vert {\varvec{\alpha }}^{\lambda _2}\Vert _\mathbf {A}^2 < ({\varvec{\alpha }}^{\lambda _1}, {\varvec{\alpha }}^{\lambda _2})_\mathbf {A}\le \Vert {\varvec{\alpha }}^{\lambda _1}\Vert _\mathbf {A}\Vert {\varvec{\alpha }}^{\lambda _2} \Vert _\mathbf {A}, \end{aligned}$$

and on canceling \(\Vert {\varvec{\alpha }}^{\lambda _2}\Vert _\mathbf {A}\),

$$\begin{aligned} \Vert {\varvec{\alpha }}^{\lambda _2}\Vert _\mathbf {A}< \Vert {\varvec{\alpha }}^{\lambda _1} \Vert _\mathbf {A}\Leftrightarrow K(\lambda _2) < K(\lambda _1) . \end{aligned}$$

Thus \(K(\lambda )\) is strictly monotonic decreasing.

Next, we deduce that \(J(\lambda )\) is strictly monotonic increasing. For suppose, to the contrary, that for some \(0< \lambda _1 < \lambda _2\) we have \(J(\lambda _2) \le J(\lambda _1)\). Then we have, from (3.1) and the fact that \(K(\lambda )\) is strictly monotonic decreasing,

$$\begin{aligned} \mu _{\lambda _1} (f_{\lambda _2}) = J(\lambda _2) + \lambda _1 K (\lambda _2) < J(\lambda _1) + \lambda _1 K(\lambda _1) = \mu _{\lambda _1} (f_{\lambda _1}). \end{aligned}$$

Since \(f_{\lambda _1}\) is the unique minimizer of \(\mu _{\lambda _1}, \mu _{\lambda _1} (f_{\lambda _2}) < \mu _{\lambda _1} (f_{\lambda _1})\) is impossible. Thus \(J(\lambda )\) is strictly monotonic increasing for all \(\lambda >0\). The monotonicity then extends to all \(\lambda \ge 0\) by continuity.

It only remains to establish the range of J. For \(\lambda =0\) we see from (3.3) that

$$\begin{aligned} J(0) = \sum _{i=1}^N [f_0 (\mathbf {x}_i) - F_i]^2 =0 . \end{aligned}$$

Since J is continuous and strictly monotonic increasing, we need only establish the limit of \(J(\lambda )\) as \(\lambda \rightarrow \infty \). As \(\lambda \) increases to \(\infty \) we expect the minimizer \(f_\lambda \) of (3.1) to become increasingly close to zero. We now show that

$$\begin{aligned} {\varvec{\alpha }}^\lambda \rightarrow \mathbf {0}\quad \text{ as } \quad \lambda \rightarrow \infty , \end{aligned}$$

from which it will follow using (3.2) that \(f_\lambda \rightarrow 0\) as \(\lambda \rightarrow \infty \) and thus \(J(\lambda )\rightarrow \Vert \mathbf {F}\Vert _2^2\) as \(\lambda \rightarrow \infty \). Using (3.8), (3.9) and (3.10) we can write \(\mu _\lambda (f_\lambda )\) as

$$\begin{aligned} \mu _\lambda (f_\lambda ) = J(\lambda ) + \lambda K(\lambda ) = \big \Vert \mathbf {A}{\varvec{\alpha }}^\lambda - \mathbf {F}\big \Vert _2^2 + \lambda \Vert {\varvec{\alpha }}^\lambda \Vert _\mathbf {A}^2 \le \big \Vert \mathbf {F}\big \Vert _2^2, \end{aligned}$$

where the upper bound in the last step follows from the fact that \(f_\lambda \) is the (unique) minimizer of \(\mu _\lambda \), which justifies the replacement of \({\varvec{\alpha }}^\lambda \) by \(\mathbf {0}\) to get an upper bound. It follows that \(\lambda \Vert {\varvec{\alpha }}^\lambda \Vert _\mathbf {A}^2 \le \Vert \mathbf {F}\Vert _2^2\), and in turn that \({\varvec{\alpha }}^\lambda \rightarrow \mathbf {0}\) as \(\lambda \rightarrow \infty \). This completes the proof. \(\square \)

4 Error of the approximation for four strategies for choosing \(\lambda \)

Now we can formulate the main result of this paper, Theorem 4.1. The theorem gives an \(L_2\) error estimate of the smoothing approximation for four different strategies for choosing the parameter \(\lambda \), given values of a function \(f\in \mathscr {N}_\phi \) corrupted by (deterministic) noise. It was inspired by a similar result in [27] for the case of thin plate splines on a bounded domain in \(\mathbb {R}^2\) and by [14] for the third parameter choice strategy.

Theorem 4.1

Let \(d\ge 2\) and \(s > d/2\), let \(\phi \) be a strictly positive definite zonal kernel given by (2.6) with \(a_\ell \asymp (\ell +1)^{-2s}\) for all \(\ell \in \mathbb {N}_0\), and let \(\mathscr {N}_\phi \) denote the native space of \(\phi \) with norm \(\Vert \cdot \Vert _\phi \). For a point set \(X=\{\mathbf {x}_1,\mathbf {x}_2,\ldots ,\mathbf {x}_N\}\subset \mathbb {S}^d\) of N distinct points and a function \(f\in \mathscr {N}_\phi \) with values \(\mathbf {f}= (f(\mathbf {x}_1),f(\mathbf {x}_2),\ldots ,f(\mathbf {x}_N))^T\) on X, and given a vector of deterministic noise \({\varvec{\varepsilon }}=(\varepsilon _1,\varepsilon _2,\ldots ,\varepsilon _N)^T\) and given \(\lambda \ge 0\), let \(f_\lambda ^{\varvec{\varepsilon }}\) denote for \(\lambda >0\) the uniquely determined minimizer in \(\mathscr {N}_\phi \) of the quadratic functional

$$\begin{aligned} \mu _\lambda (g) := \sum _{i=1}^N \big [ g(\mathbf {x}_i) - \big (f(\mathbf {x}_i)+\varepsilon _i\big ) \big ]^2 + \lambda \big \Vert g\big \Vert _\phi ^2, \quad g\in \mathscr {N}_\phi , \end{aligned}$$
(4.1)

and for \(\lambda =0\) the uniquely determined minimizer in \(\mathscr {N}_\phi \) of (4.1) with minimal norm. Then there exist positive constants \(\hat{c}_s\) and \(c_s'\) (dependent on d and s) such that, if the point set X satisfies \(h_X \le \hat{c}_s\), then the following results hold true for all \(f\in \mathscr {N}_\phi \) and all \({\varvec{\varepsilon }}\in \mathbb {R}^N\):

  1. (i)

    A priori parameter choice: Let \(\lambda =\Vert {\varvec{\varepsilon }}\Vert _2^2\). Then

    $$\begin{aligned} \big \Vert f_\lambda ^{\varvec{\varepsilon }}- f\big \Vert _{L_2} \le 2 c_s' \left[ h_X^{s} \left( \frac{1}{2} + \big \Vert f\big \Vert _\phi \right) + h_X^{d/2} \Vert {\varvec{\varepsilon }}\Vert _2 \left( 1 + \frac{1}{\sqrt{2}} \big \Vert f\big \Vert _\phi \right) \right] . \end{aligned}$$
  2. (ii)

    A priori parameter choice: Let \(\lambda = \dfrac{\Vert {\varvec{\varepsilon }}\Vert _2^2}{\Vert f\Vert _\phi ^2}\). Then

    $$\begin{aligned} \big \Vert f_\lambda ^{\varvec{\varepsilon }}- f\big \Vert _{L_2} \le c_s' \left[ (1+\sqrt{2})\; h_X^{s} \big \Vert f \big \Vert _\phi + \sqrt{6}\; h_X^{d/2} \Vert {\varvec{\varepsilon }}\Vert _2 \right] . \end{aligned}$$
  3. (iii)

    A priori parameter choice: Let \(\lambda = \frac{1}{\sqrt{2}}\; h_X^{s-d/2}\; \dfrac{\Vert {\varvec{\varepsilon }}\Vert _2}{\Vert f\Vert _\phi }\). Then

    $$\begin{aligned} \big \Vert f_\lambda ^{\varvec{\varepsilon }}- f\big \Vert _{L_2}\le & {} 2c_s' \left( h_X^s \Vert f\Vert _\phi + 2^{1/4} h_X^{s/2+d/4} \Vert f\Vert _\phi ^{1/2} \Vert {\varvec{\varepsilon }}\Vert _2^{1/2} + h_X^{d/2} \Vert {\varvec{\varepsilon }}\Vert _2 \right) \nonumber \\\le & {} (1+2^{-3/4})2c_s' \left( h_X^{s} \Vert f\Vert _\phi + h_X^{d/2} \Vert {\varvec{\varepsilon }}\Vert _2 \right) . \end{aligned}$$
    (4.2)
  4. (iv)

    Morozov’s discrepancy principle: Assume that

    $$\begin{aligned} 0< \Vert {\varvec{\varepsilon }}\Vert _2 < \big \Vert \mathbf {f}+{\varvec{\varepsilon }}\big \Vert _2 . \end{aligned}$$
    (4.3)

    Then there exists a unique \(\lambda >0\) such that

    $$\begin{aligned} \sum _{i=1}^N \big [f_\lambda ^{\varvec{\varepsilon }}(\mathbf {x}_i) - \big (f(\mathbf {x}_i)+\varepsilon _i\big )\big ]^2 = \Vert {\varvec{\varepsilon }}\Vert _2^2. \end{aligned}$$
    (4.4)

    For this choice of \(\lambda \), the following \(L_2\) error estimate holds:

    $$\begin{aligned} \big \Vert f_\lambda ^{\varvec{\varepsilon }}- f\big \Vert _{L_2} \le 2 c_s' \left[ h_X^{s} \big \Vert f \big \Vert _\phi + h_X^{d/2} \Vert {\varvec{\varepsilon }}\Vert _2 \right] . \end{aligned}$$

Since by assumption in Theorem 4.1 we have \(a_\ell \asymp (\ell +1)^{-2s}\), the native space \(\mathscr {N}_\phi \) can be identified with the Sobolev space \(H^s \). Thus the results hold also for all \(f\in H^s\), and the norm \(\Vert f\Vert _\phi \) in the upper bounds can be replaced by \(c\Vert f\Vert _{H^s}\).

The theorem is proved in Sect. 6.

The four choices of \(\lambda \) considered in the theorem give similar bounds, with Morozov’s discrepancy principle giving marginally the best bound. Section 7 gives a numerical experiment illustrating these parameter choices and discusses their practical calculation.

Remark 4.1

The given \(L_2\) error estimates for the parameter choices (ii) to (iv) are order-optimal in the following sense:

  • The first term in the \(L_2\) error estimates \(h_X^s \Vert f\Vert _{H^s}\) has the correct order of the mesh norm \(h_X\) for an estimate of the \(L_2\) norm against the \(H^s\) norm.

  • In the second term, we have

    $$\begin{aligned} h_X^{d/2} \Vert {\varvec{\varepsilon }}\Vert _2 \le h_X^{d/2} \sqrt{N} \max _{j=1,2,\ldots ,N} |\varepsilon _j|. \end{aligned}$$

    Now we consider a ‘well distributed’ point set \(X = \{\mathbf {x}_1,\mathbf {x}_2,\ldots ,\mathbf {x}_N\}\), such as for example the subsets of given point sets constructed in [18, Proposition 3.2]. For such a point set we have (see [18, Proposition 3.2]) \(N \asymp h_X^{-d}\), and hence for such a ‘well distributed’ point set

    $$\begin{aligned} h_X^{d/2} \Vert {\varvec{\varepsilon }}\Vert _2 \le h_X^{d/2} \sqrt{N} \max _{j=1,2,\ldots ,N} |\varepsilon _j| \asymp \max _{j=1,2,\ldots ,N} |\varepsilon _j|. \end{aligned}$$

    As the error of the approximation can never be smaller than the error in the data, this estimate shows that the second term in the \(L_2\) error estimates is also order-optimal.

5 The \(L_2\) sampling inequality for the sphere

The proof of Theorem 4.1 uses the following \(L_2\) sampling inequality for the sphere.

Theorem 5.1

Let \(d\ge 2\) and let \(s>d/2\). There exist positive constants \(\hat{c}_s\) and \(\tilde{c}_s\) (dependent on d and s) such that for any finite point set \(X=\{\mathbf {x}_1,\mathbf {x}_2,\ldots ,\mathbf {x}_N\}\) of N distinct points on \(\mathbb {S}^d\) with \(h_X \le \hat{c}_s\) and for any function \(g\in H^s\)

$$\begin{aligned} \Vert g\Vert _{L_2} \le \tilde{c}_s \left[ h_{X}^s \,\Vert g\Vert _{H^s} + h_{X}^{d/2} \left( \sum _{j=1}^N \big [g(\mathbf {x}_j)\big ]^2 \right) ^{1/2} \right] . \end{aligned}$$
(5.1)

Remark 5.1

The result in [15, Theorem 3.3] is at first sight similar to Theorem 5.1 above, except that in the second term on the right-hand side of (5.1) the \(\ell _2\) norm of g is replaced by the \(\ell _\infty \) norm of g. However, there is a critical difference, in that the factor \(h_X^{d/2}\) in the second term is missing in [15, Theorem 3.3]. For certain classes of kernels discussed in [11], the paper [10, Proposition 3.6] provides error estimates that at first glance are similar to those in Theorem 5.1 above. However, these estimates do not hold (as Theorem 5.1 does) for all functions in the native space, but instead hold for functions from a finite dimensional space (from a sequence of such approximation spaces). More importantly, the estimates in [10, Proposition 3.6] lack the factor \(h_{X}^{d/2}\). For these two reasons [10, Proposition 3.6] cannot be used instead of Theorem 5.1 in the proof of Theorem 4.1.

Theorem 5.1 follows from [3, Theorem 4.1] which contains the analogous result for bounded connected open domains in \(\mathbb {R}^d\) with a Lipschitz-continuous boundary as a special case. Using the ideas from [15] we can lift this result to the sphere and obtain the theorem above. However, as pointed out in Remark 5.1 above the sampling inequality in [15] is not the one that we need and will not be adequate for the proof of Theorem 4.1.

Since we did not find a direct reference for Theorem 5.1 we sketch a proof for the reader’s convenience.

First we state without proof a special case of the rather general sampling inequality in [3, Theorem 4.1]. The bounded open domain \(\varOmega \subset \mathbb {R}^d\) with Lipschitz continuous boundary in Theorem 5.2 below automatically satisfies the required cone property for some radius \(\rho \) and angle \(\theta \) (see [3, page 185] for details). We denote the mesh norm of a finite point set \(V=\{\mathbf {v}_1,\mathbf {v}_2,\ldots ,\mathbf {v}_N\}\) in \(\overline{\varOmega }\) by

$$\begin{aligned} h_{V,\overline{\varOmega }} = \sup _{\mathbf {u}\in \varOmega } \min _{\mathbf {v}_j\in V} \Vert \mathbf {u}-\mathbf {v}_j\Vert _2. \end{aligned}$$

Theorem 5.2

(Arcangéli et al. [3]) Let \(d\ge 2\), let \(\varOmega \subset \mathbb {R}^d\) be a bounded connected open domain with Lipschitz-continuous boundary, and let \(s\in \mathbb {R}\) satisfy \(s>d/2\). Then there exist positive constants \(\hat{\hat{c}}_s\) (dependent on \(\theta , \rho , d\) and s) and \(\tilde{\tilde{c}}_s\) (dependent on on \(\varOmega , d\) and s) such that for any finite point set \(V=\{\mathbf {v}_1,\mathbf {v}_2,\ldots ,\mathbf {v}_N\}\) in \(\overline{\varOmega }\) with \(h_{V,\overline{\varOmega }} \le \hat{\hat{c}}_s\) and any function \(g\in W^s_2(\varOmega )\)

$$\begin{aligned} \Vert g\Vert _{L_2(\varOmega )} \le \tilde{\tilde{c}}_s \left[ h_{V,\overline{\varOmega }}^s \Vert g\Vert _{W_2^s(\varOmega )} + h_{V,\overline{\varOmega }}^{d/2} \left( \sum _{j=1}^N \big [g(\mathbf {v}_j)\big ]^2 \right) ^{1/2} \right] . \end{aligned}$$
(5.2)

In (5.2) \(\Vert \cdot \Vert _{L_2(\varOmega )}\) is the usual \(L_2\)-norm on \(\varOmega \) and, for integer order \(s, W_2^s(\varOmega )\) is the Sobolev space of those functions in \(L_2(\varOmega )\) whose distributional derivatives up to (and including) order s are in \(L_2(\varOmega )\). For the precise definition of the norm and for the definition of \(W_2^s(\varOmega )\) with fractional order s we refer to [3, Section 2.2]. Note that in [3, Theorem 4.1] the first term on the right-hand side contains a semi-norm for \(W_2^s(\varOmega )\) which we have estimated above by the corresponding norm. These Sobolev spaces are the same as the ones considered in [15] and in and [17]. It is important that for \(g\in W_2^s(\mathbb {R}^d)\) we have \(g|_\varOmega \in W_2^s(\varOmega )\), and the norm \(\Vert g\Vert _{W_2^s(\varOmega )}\) satisfies

$$\begin{aligned} \Vert g\Vert _{W_2^s(\varOmega )} \le c_s \Vert g\Vert _{W_2^s(\mathbb {R}^d)}, \end{aligned}$$
(5.3)

for some \(c_s>0\) (independent of g).

Proof of Theorem 5.1

In order to use Theorem 5.2 we make use of the fact that the Sobolev spaces \(H^s=H^s(\mathbb {S}^d)\) can also be defined with the help of charts, giving the same space with an equivalent norm (see [16, Chapter 7.3]). For a better distinction we will write here \(W_2^s(\mathbb {S}^d)\) (instead of \(H^s(\mathbb {S}^d)\)), a space equipped with the equivalent norm \(\Vert \cdot \Vert _{W_2^s(\mathbb {S}^d)}\). We will only sketch the proof of Theorem 5.1; for any missing details we refer to [15, Section 3].

The sphere \(\mathbb {S}^d\) is a compact d-dimensional differentiable manifold without boundary equipped with the atlas \(\big \{C\big (\mathbf {p}_i;\frac{3\pi }{5}\big ),\psi _i\}_{i=1,2}\), where the chart \(\psi _i: C\big (\mathbf {p}_i;\frac{3\pi }{5}\big )\rightarrow B(\mathbf {0};1)\) is defined as in [15, p.129] with the stereographic projection which maps the open spherical cap

$$\begin{aligned} C\big (\mathbf {p}_i;\tfrac{3\pi }{5}\big ) := \big \{ \mathbf {x}\in \mathbb {S}^d \,\big |\, \mathbf {x}\cdot \mathbf {p}_i > \cos \big (\tfrac{3\pi }{5}\big ) \big \} = \big \{ \mathbf {x}\in \mathbb {S}^d \,\big |\, \mathrm {dist}(\mathbf {x},\mathbf {p}_i) < \tfrac{3\pi }{5} \big \} \end{aligned}$$

onto the open ball \(B(\mathbf {0};1)\subset \mathbb {R}^d\) with radius 1 and center in the origin \(\mathbf {0}\). The centers \(\mathbf {p}_1\) and \(\mathbf {p}_2\) are the north pole \(\mathbf {p}_1 = (0,\ldots ,0,1)^\mathrm{{T}}\) and the south pole \(\mathbf {p}_2 = (0,\ldots ,0,-1)^\mathrm{{T}}\), respectively. We define for functions \(g:\mathbb {S}^d \rightarrow \mathbb {R}\) the projections \(\pi _i(g):\mathbb {R}^d \rightarrow \mathbb {R}\) by

$$\begin{aligned} \pi _i(g)(\mathbf {x}) := \left\{ \begin{array}{l@{\quad }l} (g \circ \psi _i^{-1})(\mathbf {x}) &{} \text{ for } \mathbf {x}\in B(\mathbf {0};1), \\ 0 &{} \text{ otherwise }. \end{array}\right. \end{aligned}$$

Using a partition of unity \(\{\chi _i:\mathbb {S}^d\rightarrow \mathbb {R}\}_{i=1,2}\) subordinate to the atlas \(\big \{C\big (\mathbf {p}_i;\frac{3\pi }{5}\big ),\psi _i\}_{i=1,2}\), we define the Sobolev spaces \(W_2^s(\mathbb {S}^d)\) for \(s\ge 0\) by

$$\begin{aligned} W_2^s(\mathbb {S}^d) = \left\{ \left. g \in L_2(\mathbb {S}^d) \,\right| \, \pi _i(\chi _i g)\in W_2^s(\mathbb {R}^d) \text{ for } i=1,2 \right\} , \end{aligned}$$

equipped with the norm

$$\begin{aligned} \Vert g\Vert _{W_2^s(\mathbb {S}^d)} = \left( \Vert \pi _1(\chi _1 g)\Vert _{W_2^s(\mathbb {R}^d)}^2 + \Vert \pi _2(\chi _2 g)\Vert _{W_2^s(\mathbb {R}^d)}^2 \right) ^{1/2}. \end{aligned}$$

As mentioned before \(W_2^s(\mathbb {S}^d)= H^s(\mathbb {S}^d)= H^s\), and the norms \(\Vert \cdot \Vert _{H^s}\) and \(\Vert \cdot \Vert _{W_2^s(\mathbb {S}^d)}\) are equivalent. For \(s=2\) we obtain \(W_2^0(\mathbb {S}^d) = L_2(\mathbb {S}^d)\), equipped with the norm

$$\begin{aligned} \Vert g\Vert _{W_2^0(\mathbb {S}^d)} = \left( \Vert \pi _1(\chi _1 g)\Vert _{L_2(\mathbb {R}^d)}^2 + \Vert \pi _2(\chi _2 g)\Vert _{L_2(\mathbb {R}^d)}^2 \right) ^{1/2}. \end{aligned}$$
(5.4)

From [15, Lemma 3.1] we have for \(i=1,2\)

$$\begin{aligned} \Vert \psi _i(\mathbf {x}) - \psi _i(\mathbf {y})\Vert _2 \le c \;\mathrm {dist}(\mathbf {x},\mathbf {y}) \quad \text{ for } \text{ all } \mathbf {x},\mathbf {y}\in C\big (\mathbf {p}_i;\tfrac{3\pi }{5}\big ), \end{aligned}$$

which implies for a finite point set \(X= \{\mathbf {x}_1,\mathbf {x}_2,\ldots ,\mathbf {x}_N\}\) on \(\mathbb {S}^d\) that (see [15, Proposition 3.2])

$$\begin{aligned} h_{\psi _i(X\cap C(\mathbf {p}_i;\frac{3\pi }{5})),\overline{B(\mathbf {0};1)}} \le c h_{X}, \quad i=1,2. \end{aligned}$$
(5.5)

From (5.4) it follows that

$$\begin{aligned} \Vert g\Vert _{W_2^0(\mathbb {S}^d)}\le & {} \Vert \pi _1(\chi _1 g)\Vert _{L_2(\mathbb {R}^d)} + \Vert \pi _2(\chi _2 g)\Vert _{L_2(\mathbb {R}^d)} \nonumber \\= & {} \Vert \pi _1(\chi _1 g)\Vert _{L_2(B(\mathbf {0};1))} + \Vert \pi _2(\chi _2 g)\Vert _{L_2(B(\mathbf {0};1))}, \end{aligned}$$
(5.6)

as the functions \(\pi _1(\chi _1 g)\) and \(\pi _2(\chi _2 g)\) have compact support in \(B(\mathbf {0};1)\).

Now let \(g\in W_2^s(\mathbb {S}^d)\) with \(s>d/2\). Then \(\pi _i(\chi _i g) \in W_2^s(\mathbb {R}^n)\) and hence \(\pi _i(\chi _i g)|_{B(\mathbf {0};1)} \in W_2^s(B(\mathbf {0};1)), i=1,2\). We now consider a term on the right-hand side of (5.6) and apply Theorem 5.2 with \(\varOmega =B(\mathbf {0};1) = \psi _i\big (C(\mathbf {p}_i;\frac{3\pi }{5})\big )\) and \(V= V_i :=\psi _i(X\cap C(\mathbf {p}_i;\frac{3\pi }{5}))\):

$$\begin{aligned}&\Vert \pi _i(\chi _i g)\Vert _{L_2(B(\mathbf {0};1))} \\&\quad \le \tilde{\tilde{c}}_s \left[ h_{V_i,\overline{B(\mathbf {0};1)}}^s \Vert \pi _i(\chi _i g)\Vert _{W_2^s(B(\mathbf {0};1))} + h_{V_i,\overline{B(\mathbf {0};1)}}^{d/2} \left( \sum _{j=1}^{N_i} \big [\big (\pi _i(\chi _i g)\big )(\mathbf {v}_{i,j})\big ]^2 \right) ^{1/2} \right] , \end{aligned}$$

where \(V_i = \big \{\mathbf {v}_{i,1},\mathbf {v}_{i,2},\ldots ,\mathbf {v}_{i,N_i}\big \}\) and where the point set \(X \subset \mathbb {S}^d\) is such that \(h_{V_i,\overline{B(\mathbf {0};1)}} \le \hat{\hat{c}}_s\).

Using (5.5) and the relationship between the \(V_i\) and \(X\cap C\big (\mathbf {p}_i;\frac{3\pi }{5}\big )\subset X\) yields

$$\begin{aligned} \Vert \pi _i(\chi _i g)\Vert _{L_2(B(\mathbf {0};1))} \le c' \tilde{\tilde{c}}_s \left[ h_{X}^s \Vert \pi _i(\chi _i g)\Vert _{W_2^s(B(\mathbf {0};1))} + h_{X}^{d/2} \left( \sum _{j=1}^N \big [(\chi _i g)(\mathbf {x}_j)\big ]^2 \right) ^{1/2} \right] \nonumber \\ \end{aligned}$$
(5.7)

with \(c':=\max \{c^s,c^{d/2}\}\). Applying (5.7) in (5.6) yields for \(g\in W_2^s(\mathbb {S}^d)\)

$$\begin{aligned} \Vert g\Vert _{W_2^0(\mathbb {S}^d)}\le & {} c' \tilde{\tilde{c}}_s \left( h_{X}^s \big [ \Vert \pi _1(\chi _1 g)\Vert _{W_2^s(B(\mathbf {0};1))} + \Vert \pi _2(\chi _2 g)\Vert _{W_2^s(B(\mathbf {0};1))} \big ] \right. \\&\left. +\, h_{X}^{d/2} \left[ \left( \sum _{j=1}^N \big [(\chi _1 g)(\mathbf {x}_j)\big ]^2 \right) ^{1/2} + \left( \sum _{j=1}^N \big [(\chi _2 g)(\mathbf {x}_j)\big ]^2 \right) ^{1/2} \right] \right) . \end{aligned}$$

Now we make use of \(|(\chi _i g)(\mathbf {x})| \le |g(\mathbf {x})|\) for all \(\mathbf {x}\in \mathbb {S}^d\) for \(i=1,2\), and \((a+b)^2 \le 2\,(a^2+b^2)\) for all \(a,b\ge 0\). Using (5.3), we find

$$\begin{aligned}&\Vert g\Vert _{W_2^0(\mathbb {S}^d)} \\&\quad \le c' \tilde{\tilde{c}}_s \left[ \sqrt{2} c_s h_{X}^s \left( \Vert \pi _1(\chi _1 g)\Vert _{W_2^s(\mathbb {R}^d)}^2 + \Vert \pi _2(\chi _2 g)\Vert _{W_2^s(\mathbb {R}^d)}^2 \right) ^{1/2} + 2 h_{X}^{d/2} \left( \sum _{j=1}^N \big (g(\mathbf {x}_j)\big )^2 \right) ^{1/2} \right] \\&\quad \le 2 c' \tilde{\tilde{c}}_s \max \{c_s,1\} \left[ h_{X}^s \Vert g\Vert _{W_2^s(\mathbb {S}^2)} + h_{X}^{d/2} \left( \sum _{j=1}^N \big (g(\mathbf {x}_j)\big )^2 \right) ^{1/2} \right] . \end{aligned}$$

Since \(W_2^0(\mathbb {S}^d)\) and \(L_2=L_2(\mathbb {S}^d)\), and \(W_2^s(\mathbb {S}^d)\) and \(H^s=H^s(\mathbb {S}^d)\), respectively, are the same space, equipped with equivalent norms, we obtain the estimate in Theorem 5.1. The conditions \(h_{V_i,\overline{B(\mathbf {0};1)}} \le \hat{\hat{c}}_s\) for \(i=1,2\) translates due to (5.5) to \(h_{X} \le \hat{c}_s\) for some (usually small) constant \(\hat{c}_s >0\). \(\square \)

6 Proof of Theorem 4.1

Proof of Theorem 4.1

As an initial step we establish the first result in part (iv) of the theorem. From Lemma 3.1, the inequality (4.3) guarantees that \(\Vert {\varvec{\varepsilon }}\Vert _2^2\) is in the range of J, where

$$\begin{aligned} J(\lambda ) := \sum _{i=1}^N \big [f_\lambda ^{\varvec{\varepsilon }}(\mathbf {x}_i) - \big (f(\mathbf {x}_i)+\varepsilon _i\big )\big ]^2. \end{aligned}$$

Since, by Lemma 3.1, J is continuous and strictly monotonically increasing, there exists exactly one \(\lambda >0\) with \(J(\lambda )=\Vert {\varvec{\varepsilon }}\Vert _2^2\), that is, there is exactly one \(\lambda \) satisfying (4.4).

Returning now to general \(\lambda \), we assume that the point set X satisfies \(h_X\le \hat{c}_s\), where \(\hat{c}_s\) is as in Theorem 5.1. From the assumptions on \(\phi \), the native space \(\mathscr {N}_\phi \) can be identified with \(H^s\), and the norms \(\Vert \cdot \Vert _\phi \) and \(\Vert \cdot \Vert _{H^s}\) are equivalent. Thus from applying Theorem 5.1 with the function \(g = f_\lambda ^{\varvec{\varepsilon }}- f\) in \(\mathscr {N}_\phi \) and the point set X and using the equivalence of \(\Vert \cdot \Vert _\phi \) and \(\Vert \cdot \Vert _{H^s}\), we obtain

$$\begin{aligned} \big \Vert f_\lambda ^{\varvec{\varepsilon }}- f\big \Vert _{L_2} \le c_s' \left[ h_X^s \Vert f_\lambda ^{\varvec{\varepsilon }}- f\Vert _\phi + h_X^{d/2} \left( \sum _{i=1}^N \big [f_\lambda ^{{\varvec{\varepsilon }}}(\mathbf {x}_i) - f(\mathbf {x}_i)\big ]^2 \right) ^{1/2} \right] . \end{aligned}$$
(6.1)

We estimate the two terms in (6.1) separately. In the second term we add and subtract \(\varepsilon _i\) to obtain

$$\begin{aligned} \sum _{i=1}^N \big [f_\lambda ^{{\varvec{\varepsilon }}}(\mathbf {x}_i) - f(\mathbf {x}_i)\big ]^2= & {} \sum _{i=1}^N \big [f_\lambda ^{{\varvec{\varepsilon }}}(\mathbf {x}_i) - \big (f(\mathbf {x}_i)+\varepsilon _i\big ) + \varepsilon _i \big ]^2 \nonumber \\\le & {} 2 \left( \sum _{i=1}^N \big [f_\lambda ^{{\varvec{\varepsilon }}}(\mathbf {x}_i) - \big (f(\mathbf {x}_i)+\varepsilon _i\big ) \big ]^2 + \sum _{i=1}^N |\varepsilon _i|^2 \right) .\nonumber \\ \end{aligned}$$
(6.2)

For estimating the right-hand side of (6.2) we proceed differently for the four parameter choices, beginning with the three a priori choices. In these three cases, the right-hand side in (6.2) is estimated by making use of the optimal nature of \(f_\lambda ^{\varvec{\varepsilon }}\): from (4.1)

$$\begin{aligned}&\sum _{i=1}^N \big [f_\lambda ^{{\varvec{\varepsilon }}}(\mathbf {x}_i) - \big (f(\mathbf {x}_i) +\varepsilon _i\big ) \big ]^2 + \sum _{i=1}^N |\varepsilon _i|^2 \le \mu _\lambda (f_\lambda ^{\varvec{\varepsilon }}) + \sum _{i=1}^N |\varepsilon _i|^2 \nonumber \\&\quad \le \mu _\lambda (f) + \sum _{i=1}^N |\varepsilon _i|^2 = \sum _{i=1}^N |\varepsilon _i|^2 + \lambda \big \Vert f\big \Vert _\phi ^2 + \sum _{i=1}^N |\varepsilon _i|^2 = 2 \Vert {\varvec{\varepsilon }}\Vert _2^2 + \lambda \big \Vert f\big \Vert _\phi ^2. \nonumber \\ \end{aligned}$$
(6.3)

For \(\lambda = \Vert {\varvec{\varepsilon }}\Vert _2^2\) we obtain from (6.3)

$$\begin{aligned} \sum _{i=1}^N \big [f_\lambda ^{{\varvec{\varepsilon }}}(\mathbf {x}_i) - \big (f(\mathbf {x}_i) +\varepsilon _i\big ) \big ]^2 + \sum _{i=1}^N |\varepsilon _i|^2 \le \Vert {\varvec{\varepsilon }}\Vert _2^2 \Big (2 + \big \Vert f\big \Vert _\phi ^2 \Big ). \end{aligned}$$
(6.4)

Hence, from (6.2) and (6.4), for \(\lambda = \Vert {\varvec{\varepsilon }}\Vert _2^2\),

$$\begin{aligned} \sum _{i=1}^N \big [f_\lambda ^{{\varvec{\varepsilon }}}(\mathbf {x}_i) - f(\mathbf {x}_i)\big ]^2 \le 2 \Vert {\varvec{\varepsilon }}\Vert _2^2 \Big ( 2 + \big \Vert f\big \Vert _\phi ^2 \Big ) \le 4 \Vert {\varvec{\varepsilon }}\Vert _2^2 \Big ( 1 + \frac{1}{\sqrt{2}}\big \Vert f\big \Vert _\phi \Big )^2. \end{aligned}$$
(6.5)

For the second a priori choice \(\lambda = \frac{\Vert {\varvec{\varepsilon }}\Vert _2^2}{\Vert f \Vert _\phi ^2}\), (6.3) gives

$$\begin{aligned} \sum _{i=1}^N \big [f_\lambda ^{{\varvec{\varepsilon }}}(\mathbf {x}_i) - \big (f(\mathbf {x}_i) +\varepsilon _i\big ) \big ]^2 + \sum _{i=1}^N |\varepsilon _i|^2 \le 3 \Vert {\varvec{\varepsilon }}\Vert _2^2 . \end{aligned}$$
(6.6)

Hence, from (6.2) and (6.6), for \(\lambda = \frac{\Vert {\varvec{\varepsilon }}\Vert _2^2}{\Vert f \Vert _\phi ^2}\),

$$\begin{aligned} \sum _{i=1}^N \big [f_\lambda ^{{\varvec{\varepsilon }}}(\mathbf {x}_i) - f(\mathbf {x}_i)\big ]^2 \le 6 \Vert {\varvec{\varepsilon }}\Vert _2^2 . \end{aligned}$$
(6.7)

For the third a priori choice we do not yet insert the value for \(\lambda \) but obtain from (6.2) and (6.3) that

$$\begin{aligned} \sum _{i=1}^N \big [f_\lambda ^{{\varvec{\varepsilon }}}(\mathbf {x}_i) - f(\mathbf {x}_i)\big ]^2 \le 4 \left( \Vert {\varvec{\varepsilon }}\Vert _2^2 + \frac{\lambda }{2} \big \Vert f\big \Vert _\phi ^2 \right) \le 4 \left( \Vert {\varvec{\varepsilon }}\Vert _2 + \sqrt{\frac{\lambda }{2}} \big \Vert f\big \Vert _\phi \right) ^2. \end{aligned}$$
(6.8)

If instead we choose \(\lambda \) by Morozov’s discrepancy principle, then

$$\begin{aligned} \sum _{i=1}^N \big [f_\lambda ^{{\varvec{\varepsilon }}}(\mathbf {x}_i) - \big (f(\mathbf {x}_i)+\varepsilon _i\big ) \big ]^2 + \sum _{i=1}^N |\varepsilon _i|^2 = \Vert {\varvec{\varepsilon }}\Vert _2^2 + \sum _{i=1}^N |\varepsilon _i|^2 = 2\Vert {\varvec{\varepsilon }}\Vert _2^2 . \end{aligned}$$
(6.9)

Then using (6.9) to estimate the right-hand side in (6.2) we obtain

$$\begin{aligned} \sum _{i=1}^N \big [f_\lambda ^{{\varvec{\varepsilon }}}(\mathbf {x}_i) - f(\mathbf {x}_i)\big ]^2 \le 4 \Vert {\varvec{\varepsilon }}\Vert _2^2. \end{aligned}$$
(6.10)

For all four parameter choice strategies we now estimate the first term in (6.1). From the triangle inequality, we have,

$$\begin{aligned} \big \Vert f_\lambda ^{\varvec{\varepsilon }}- f\big \Vert _\phi \le \big \Vert f_\lambda ^{\varvec{\varepsilon }}\big \Vert _\phi + \big \Vert f\big \Vert _\phi , \end{aligned}$$
(6.11)

in which the first term can be estimated using the optimality of \(f^{{\varvec{\varepsilon }}}_\lambda \) as follows:

$$\begin{aligned} \big \Vert f_\lambda ^{\varvec{\varepsilon }}\big \Vert _\phi ^2= & {} \frac{1}{\lambda } \left( \lambda \big \Vert f_\lambda ^{\varvec{\varepsilon }}\big \Vert _\phi ^2 \right) = \frac{1}{\lambda } \left( \mu _\lambda (f_\lambda ^{\varvec{\varepsilon }}) - \sum _{i=1}^N \big [f_\lambda ^{\varvec{\varepsilon }}(\mathbf {x}_i) - \big (f(\mathbf {x}_i)+\varepsilon _i\big )\big ]^2 \right) \nonumber \\\le & {} \frac{1}{\lambda } \left( \mu _\lambda (f) - \sum _{i=1}^N \big [f_\lambda ^{\varvec{\varepsilon }}(\mathbf {x}_i) - \big (f(\mathbf {x}_i)+\varepsilon _i\big )\big ]^2 \right) \nonumber \\= & {} \frac{1}{\lambda } \left( \Vert {\varvec{\varepsilon }}\Vert _2^2 - \sum _{i=1}^N \big [f_\lambda ^{\varvec{\varepsilon }}(\mathbf {x}_i) - \big (f(\mathbf {x}_i)+\varepsilon _i\big )\big ]^2 \right) + \big \Vert f\big \Vert _\phi ^2. \end{aligned}$$
(6.12)

For the three a priori parameter choices the negative term in the last inequality in (6.12) can be omitted. For the a priori parameter choice \(\lambda =\Vert {\varvec{\varepsilon }}\Vert _2^2\) this gives

$$\begin{aligned} \big \Vert f_\lambda ^{\varvec{\varepsilon }}\big \Vert _\phi \le \left( 1 + \big \Vert f\big \Vert _\phi ^2\right) ^{1/2}\le 1+\big \Vert f\big \Vert _\phi . \end{aligned}$$
(6.13)

The result for the first a priori choice then follows from (6.1), (6.5), (6.11), and (6.13).

For the a priori parameter choice \(\lambda =\frac{\Vert {\varvec{\varepsilon }}\Vert _2^2}{\Vert f \Vert _\phi ^2}\), omitting the negative term in (6.12) gives

$$\begin{aligned} \big \Vert f_\lambda ^{\varvec{\varepsilon }}\big \Vert _\phi \le \sqrt{2} \Vert f \Vert _\phi . \end{aligned}$$
(6.14)

The result for the second a priori choice then follows from (6.1), (6.7), (6.11), and (6.14).

For the third a priori choice, as we do not yet insert the value for \(\lambda \), the inequality (6.12) in this case yields

$$\begin{aligned} \big \Vert f_\lambda ^{\varvec{\varepsilon }}\big \Vert _\phi \le \left( \frac{1}{\lambda } \Vert {\varvec{\varepsilon }}\Vert _2^2 + \Vert f\Vert _\phi ^2\right) ^{1/2} \le \frac{1}{\sqrt{\lambda }} \Vert {\varvec{\varepsilon }}\Vert _2 + \Vert f\Vert _\phi . \end{aligned}$$
(6.15)

We then find from (6.1), (6.8), (6.11), and (6.15)

$$\begin{aligned} \Vert f_\lambda ^{{\varvec{\varepsilon }}} - f\Vert _{L_2} \le 2 c_s' \left[ h_X^s \left( \frac{1}{2\sqrt{\lambda }} \Vert {\varvec{\varepsilon }}\Vert _2 + \Vert f\Vert _\phi \right) + h_x^{d/2} \left( \Vert {\varvec{\varepsilon }}\Vert _2 + \sqrt{\frac{\lambda }{2}} \big \Vert f\big \Vert _\phi \right) \right] . \end{aligned}$$
(6.16)

It is easy to see that the parameter choice in part (iii) of the theorem minimizes the right-hand side of (6.16), and that the resulting minimum value is

$$\begin{aligned} \Vert f_{\lambda }^{{\varvec{\varepsilon }}} - f\Vert _{L_2}\le & {} 2 c_s' \left( h_X^s \Vert f\Vert _\phi + 2^{1/4} h_X^{s/2+d/4} \Vert f\Vert _\phi ^{1/2} \Vert {\varvec{\varepsilon }}\Vert _2^{1/2} + h_X^{d/2} \Vert {\varvec{\varepsilon }}\Vert _2 \right) \\\le & {} (1+2^{-3/4})2c_s' \left( h_X^{s} \Vert f\Vert _\phi + h_X^{d/2} \Vert {\varvec{\varepsilon }}\Vert _2 \right) , \end{aligned}$$

where the last inequality follows from the basic inequality \(2ab\le a^2+b^2\).

For Morozov’s discrepancy principle, the negative term in (6.12) cancels \(\Vert {\varvec{\varepsilon }}\Vert _2^2\), giving

$$\begin{aligned} \big \Vert f_\lambda ^{\varvec{\varepsilon }}\big \Vert _\phi \le \big \Vert f\big \Vert _\phi . \end{aligned}$$
(6.17)

The result for Morozov’s discrepancy principle follows now from (6.1), (6.10), (6.11), and (6.17). \(\square \)

7 Numerical results

In this section we present a numerical test for \(\mathbb {S}^2\) that illustrates the performance of the method and discuss the practical evaluation and performance of the four parameter choices in Theorem 4.1 as well as the L-curve heuristic [12].

The function to be approximated is in Cartesian coordinates given by

$$\begin{aligned} f(x,y,z) = e^{x+y+z} + 50 \big (y - \cos (\pi /3)\big )_+^3, \quad (x,y,z)^T\in \mathbb {R}^3, \end{aligned}$$
(7.1)

where \(r_+\) is defined to be r if \(r\ge 0\) and to be zero otherwise. It can be shown that this function is in \(H^s\) for any \(s<3.5\). The function is plotted on the polar coordinate grid in Fig. 1.

Fig. 1
figure 1

The approximated function (7.1)

The point set \(X=\{\mathbf {x}_1,\mathbf {x}_2,\ldots ,\mathbf {x}_N\}\) was a set of \(N = 6084\) points chosen to have a polynomial basis matrix with large determinant (see [22]). This provides a uniformly distributed point set on \(\mathbb {S}^2\) with mesh norm \(h_X = 0.0372\) and separation distance \(\min _{i\ne j} \; \mathrm {dist}(\mathbf {x}_i, \mathbf {x}_j) = 0.0413\). Noise was added to the data by generating independent samples from the standard normal distribution, and multiplying these numbers by the factor 0.1 before they were added to the (exact) function data on X. For our particular samples we find \(\Vert {\varvec{\varepsilon }}\Vert _2^2 = 61.8275\).

Table 1 RBF smoothing approximation for a range of smoothing parameters \(\lambda \)

In the implementation, we used the compactly supported Wendland radial basis function (see [28, 30]) \(\phi (\mathbf {x},\mathbf {y}) =\psi (\Vert \mathbf {x}-\mathbf {y}\Vert _2)\), where

$$\begin{aligned} \psi (r) = (4r+1) (1-r)_+^4, \quad r\ge 0. \end{aligned}$$

The asymptotic behavior of the Fourier coefficients of this RBF is \(a_\ell \asymp (\ell +1)^{-5}\) (see [19]), and hence the native space can be identified with \(H^{2.5}\). Since f belongs to a still smoother space, our theory applies with \(s=2.5\). Note that on the unit sphere the support of each unscaled radial basis function covers one quarter of the area of the sphere, affecting the sparsity of \(\mathbf {A}\). For the chosen set of \(N = 6084\) well distributed points and this Wendland function the sparsity of \(\mathbf {A}\) is 24.99% while the 2-norm condition number is \(\kappa _2(\mathbf {A}) = 5.9 \times 10^5\).

The RBF smoothing approximant was computed for a sequence of increasing smoothing parameters, given by \(\lambda = 0\) and \(\lambda = 2^{\zeta }\), for \(\zeta =-15,-14.9,\ldots ,6.9,7\). Table 1 lists a subset corresponding to \(\lambda = 0, \lambda = 2^j\) for \(j = -15,-14,\ldots ,6,7\), while Fig. 2 plots the full results. For each value of the smoothing parameter \(\lambda \), we calculate the discrepancy

$$\begin{aligned} J(\lambda ) = \sum _{j=1}^N \big [ f_\lambda ^{\varvec{\varepsilon }}(\mathbf {x}_j) - \big (f(\mathbf {x}_j) + \varepsilon _j\big )\big ]^2, \end{aligned}$$

\(\Vert f_\lambda ^{\varvec{\varepsilon }}\Vert _\phi ^2 = {\varvec{\alpha }}^T \mathbf {A}{\varvec{\alpha }}\) where \((\mathbf {A}+ \lambda \mathbf {I}) {\varvec{\alpha }}= \mathbf {f}+ {\varvec{\varepsilon }}\), and estimates of the \(L_2\) and the \(L_\infty \) norms of the error \(f_\lambda ^{\varvec{\varepsilon }}- f\). The \(L_2\) error was estimated using a 16474 point spherical 181-design (equal weight quadrature rule exact for all spherical polynomials of degree at most 181), while the \(L_\infty \) norm was estimated by using a local maximization routine starting from the circumcenter of each of the triangles in a Delaunay triangulation of the RBF centers, and checked against the pointwise maximum of the error at the points in the spherical design.

Fig. 2
figure 2

Variation of \(J(\lambda ), \Vert f_\lambda ^{\varvec{\varepsilon }}- f\Vert _{L_2}\) and \(\Vert f_\lambda ^{\varvec{\varepsilon }}- f\Vert _{C}\)

The smallest \(L_2\) error of \(6.9841 \times 10^{-2}\) is obtained for \(\lambda = 0.574\), while the smallest \(L_\infty \) error of \(6.8190 \times 10^{-2}\) is obtained for \(\lambda = 0.758\), as illustrated in Fig. 2.

The a priori parameter choice \(\lambda =\Vert {\varvec{\varepsilon }}\Vert _2^2 = 61.8275\) gives an over-smoothed approximation with \(L_2\) error and \(L_\infty \) error over 2. A disadvantage of this choice for \(\lambda \) is that it depends on the scaling (units) used for the data.

Fig. 3
figure 3

L-curve method: log-log plot of \(\Vert f_\lambda ^{\varvec{\varepsilon }}\Vert _\phi ^2\) against \(J(\lambda )\)

The second and third a priori choices \(\lambda = \frac{\Vert {\varvec{\varepsilon }}\Vert _2^2}{\Vert f\Vert _\phi ^2}\) and \(\lambda = \frac{1}{\sqrt{2}} h_X^{s-d/2} \frac{\Vert {\varvec{\varepsilon }}\Vert _2}{\Vert f\Vert _\phi }\) in Theorem 4.1 have a major disadvantage in practice, in that they require a knowledge of the true norm \(\Vert f \Vert _\phi \), which of course is generally not available. In this test example, where in fact we know f, we can estimate the value of the \(\phi \) norm with reasonable accuracy by the \(\phi \) norm of the minimal-norm interpolant of the exact function values at the data points. (In the language of Section 3 we set \(\lambda =0\) and \({\varvec{\varepsilon }}=\mathbf {0}\), and use \(\Vert f_0 \Vert _\phi \) to approximate \(\Vert f \Vert _\phi \).) Using the value \(\Vert f_0 \Vert _\phi = 15.79\), the second a priori choice gives \(\lambda \approx \frac{\Vert {\varvec{\varepsilon }}\Vert _2^2}{\Vert f_0\Vert _\phi ^2} = 0.248\), with \(L_2\) error of \(7.58 \times 10^{-2}\), just less than the minimal \(L_2\) error (see the second plot in Fig. 2).

The third a priori choice gives \(\lambda = 2.5 \times 10^{-3}\) and an \(L_2\) error of \(1.81\times 10^{-1}\). This value for \(\lambda \) is lower than those giving the minimal \(L_2\) norm of the errors, resulting in an undersmoothed approximation and again a relatively large error. Moreover, as \(\Vert f_0 \Vert _\phi \le \Vert f \Vert _\phi \), using the true norm \(\Vert f \Vert _\phi \) will give a still smaller value of \(\lambda \), making the undersmoothing worse (see Fig. 2). And so far the calculation of the second and third a priori choices is not at all practical, since it assumes a knowledge of exact function values at the data points. In practice perhaps the best a user could do is to approximate \(\Vert f \Vert _\phi \) by the \(\phi \) norm of a smoothed approximant \(f^{\varvec{\varepsilon }}_\lambda \) for some reasonable choice of \(\lambda \).

Fig. 4
figure 4

Plots of test function, noisy data, RBF smoothing approximation and error for \(\lambda = 1.516\)

Morozov’s discrepancy principle requires in practice that we compute \(f_\lambda ^{{\varvec{\varepsilon }}}\) for a selection of parameters \(\lambda \) until we have found a value of \(\lambda \) for which (4.4) is satisfied accurately enough. More specifically the discrepancy principle requires \(J(\lambda ) = \Vert {\varvec{\varepsilon }}\Vert _2^2\), which for this example is satisfied when \(J(\lambda ) = \Vert {\varvec{\varepsilon }}\Vert _2^2 = 61.8275\), at the value \(\lambda = 1.516\) as illustrated in the first plot of Fig. 2. This value of \(\lambda \) gives an \(L_2\) error of \(9.32\times 10^{-2}\), around \(35\%\) larger than the minimal \(L_2\) error of \(6.98\times 10^{-2}\). Thus the \(\lambda \) predicted by the discrepancy principle is a good choice, even though it is slightly too large and slightly over-smooths.

A heuristic strategy to determine the smoothing parameter is the L-curve method (see [12]): for a range of values of \(\lambda \), the two parts \(\Vert f^{{\varvec{\varepsilon }}}_{\lambda }\Vert _\phi ^2\) and \(J(\lambda )\) in the minimized functional (4.1) are plotted against each other in a log-log plot (see Fig. 3). The resulting curve \(x(\lambda ) = \log \; J(\lambda ), y(\lambda ) = \log \; \Vert f^{{\varvec{\varepsilon }}}_{\lambda }\Vert _\phi ^2\) should then be L-shaped, which is clearly the case in Fig. 3. The smoothing parameter is chosen as the value of \(\lambda \) corresponding to the corner of the L, defined as the point with maximal curvature \(| x' y'' - y' x''| / (x'^2 + y'^2)^\frac{3}{2}\). We used the full range of values of \(\lambda =2^\zeta , \zeta =-15,-14.9,\ldots ,6.9,7\), for the L-curve and then finite differences to estimate the derivatives \(x', x'', y', y''\) and hence the curvature, giving the parameter \(\lambda = 0.287\) and an \(L_2\) error of \(7.41 \times 10^{-2}\). From Table 1 it is clear that this is a fairly good prediction, but one which, unlike the methods considered in this paper, lacks theoretical support.

Finally Fig. 4 gives plots of the test function (7.1), the noisy data, the RBF smoothing approximation and the error for \(\lambda = 1.516\), the value predicted by the discrepancy principle.

To summarise, for approximation from noisy scattered data on the sphere, regularized by the square of a native space norm, we have found \(L_2\) error bounds for four different choices of the smoothing parameter. Of the four choices the best result (by a small margin theoretically), and the most practical, is the parameter choice from the discrepancy principle. The actual numerical cost of the discrepancy principle depends on the number of values for the parameter \(\lambda \) for which the approximation has to be computed in order to get a sufficiently good estimate for the smoothing parameter \(\lambda \) of the discrepancy principle.