1 Introduction

Sparse regularization, and in particular convex approaches based on \(\ell ^1\) minimization, is one of the workhorses to ill-posed linear inverse models. It finds numerous applications ranging from signal processing [19] to machine learning [56]. It is thus also the method of choice to solve the compressed sensing (CS) problem [17, 29], which is an inverse problem where the linear operator is random. Randomness of the linear operator makes the recovery possible as soon as the number of observations is of the order (up to log-factor) of the number of nonzero elements in the vector to recover. This theory and the associated numerical solvers are fundamentally discrete, which does not complain with most imaging scenarios where CS needs to be adapted to deal with physical constraints. Purely discrete random operators are idealizations, and studying random operators obtained by random sampling of continuous operators (e.g., Fourier measurements) requires the study of the so-called Rayleigh limit. This is the minimum separation distance between the individual elements forming the object of interest (in the following, Dirac masses) required to ensure that a stable recovery is possible. Furthermore, extending CS to continuous domains and sub-sampled continuous operators is of both of practical and theoretical interests. It avoids gridding the parameter space, thus enabling more efficient solvers and a sharper theoretical analysis.

The natural continuous extension of the \(\ell ^1\) approach encodes the positions and amplitudes of the sought after solution into a Radon measure. The \(\ell ^1\) norm is then replaced by the total variation (total mass) of the measure, and a measure is naturally said to be “sparse” when it is a sum of Diracs at the desired positions and amplitudes. The corresponding infinite dimensional optimization problem is called BLASSO in [25] following theoretical works on spectral extrapolation [7]. This setting of optimization on measures has also been considered in the inverse problems community [9]. Successful examples of applications of such “off-the-grid methods” include single-molecule fluorescent imaging [8], spikes sorting in neurosciences [33], mixture model estimation [37] and training shallow neural networks [5]. Existing previous theoretical works on “off-the-grid” CS are, however, focused on domains which are either the whole space (\({\mathbb {R}}^d\) or the periodic torus \({\mathbb {T}}^d\)) and consider translation-invariant measurements (such as random Fourier measurements or sub-sampled convolutions). In this article, we provide a sharp analysis of a general class of operators over arbitrary domains.

1.1 Sparse Spikes Recovery Using the BLASSO

1.1.1 Observation Model

We consider the general problem of estimating a complex-valued unknown Radon measure \(\mu _0 \in {\mathcal {M}}({\mathcal {X}})\) defined over some metric space \({\mathcal {X}}\) from a small number m of randomized linear observations \(y \in {\mathbb {C}}^m\). In this paper, \({\mathcal {X}}\) will either be a connected bounded open subset of \({\mathbb {R}}^d\) or the d-dimensional torus \({\mathbb {T}}^d\), even though some of our results extend beyond this case. We define the inner product between a complex-valued continuous function \(f \in {\mathscr {C}}^{}({\mathcal {X}})\) and complex-valued measure \(\mu \in {\mathcal {M}}({\mathcal {X}})\) as \(\langle f,\,\mu \rangle _{\mathcal {M}} {\mathop {=}\limits ^{\text {def.}}}\int _{\mathcal {X}}\overline{f(x)} \mathrm {d} \mu (x)\). The (forward) measurement operator \( \varPhi : {\mathcal {M}}({\mathcal {X}}) \mapsto {\mathbb {C}}^m\) that we consider in this paper is of the form

$$\begin{aligned} \varPhi \mu {\mathop {=}\limits ^{\text {def.}}}\tfrac{1}{\sqrt{m}} \left( \langle \varphi _{\omega _k},\,\mu \rangle _{\mathcal {M}} \right) _{k=1}^m \end{aligned}$$
(1)

where \((\omega _1,\ldots ,\omega _m)\) are parameters identically and independently distributed according to a probability distribution \(\varLambda (\omega )\) over some space \(\varOmega \), and \(\varphi _\omega : {\mathcal {X}}\rightarrow {\mathbb {C}}\) are smooth functions parameterized by \(\omega \). We further assume that \(\varphi _\omega \) is normalized, that is \( {\mathbb {E}}_{\omega \sim \varLambda }[\left|\varphi _\omega (x)\right|^2] = 1\) for all \(x\in {\mathcal {X}}\). Our observations are of the form

$$\begin{aligned} y = \varPhi (\mu _0 + {\tilde{\mu }}_0) + w\, , \end{aligned}$$
(2)

where \(\mu _0=\sum _{i=1}^s a_i \delta _{x_i}\) with \( (x_i,a_i) \in {\mathcal {X}}\times {\mathbb {C}}\) is the s-sparse measure we are interested in, \({\tilde{\mu }}_0\in {\mathcal {M}}({\mathcal {X}})\) accounts for modeling error, and \(w\in {\mathbb {C}}^m\) is measurement noise. In the rest of the paper, we naturally assume that the support of \({\tilde{\mu }}_0\) does not include any of the \(x_i\)’s.

1.1.2 BLASSO

An increasingly popular method to estimate such a sparse measure corresponds to solving an infinite-dimensional analogue of the Lasso regression problem with regularization parameter \(\lambda >0\),

figure a

Following [25], we call this method the BLASSO (for Beurling-LASSO). Here, \(|\mu |({\mathcal {X}})\) is the so-called total variation (or total mass) of the measure \(\mu \) and is defined as

$$\begin{aligned} |\mu |({\mathcal {X}}) {\mathop {=}\limits ^{\text {def.}}}\sup \left\{ \mathrm {Re}\left( \langle f,\,\mu \rangle _{\mathcal {M}}\right) \;;\; f \in {\mathscr {C}}^{}({\mathcal {X}}), \left\| f \right\| _{\infty } \leqslant 1 \right\} . \end{aligned}$$

Note that on unbounded \({\mathcal {X}}\), one needs to impose that f vanishes at infinity. If \({\mathcal {X}}=\{x_i\}_i\) is a finite space, then this would correspond to the classical finite-dimensional LASSO problem [56], because \(|\mu |({\mathcal {X}})=\left\| a \right\| _1 {\mathop {=}\limits ^{\text {def.}}}\sum _i |a_i|\) where \(a_i=\mu (\{x_i\})\). Similarly, when \({\mathcal {X}}\) is possibly infinite but \(\mu =\sum _i a_i \delta _{x_i}\), one also has that \(|\mu |({\mathcal {X}})=\left\| a \right\| _1\). The noiseless problem of (\({\mathcal {P}}_\lambda (y)\)) when \(\lambda \rightarrow 0\) is

figure b

1.2 Previous Works

The initial development of CS [17, 29] considered only discrete problems, which corresponds to imposing that \({\mathcal {X}}\) is a fixed discrete space. The use of “structured” measurements, typically random Fourier frequencies, requires this grid to be uniform [17]. This forbids the recovered Dirac’s to be closer than the grid resolution, thus implicitly imposing a Rayleigh limit. These initial works have been extended to “continuous” domains, typically making use of continuous Fourier measurements up to frequency \(f_c\). Without random sub-sampling, the main result of Candès and Fernandez-Granda [15] shows that in this noiseless setting with \(y = \varPhi \mu _0\), (\({\mathcal {P}}_0(y)\)) exactly recovers \(\mu _0\) under a so-called Rayleigh criterion, that the minimum distance between two spikes \(\min _{i\ne j}\left\| x_i - x_j \right\| \) is at least \({\mathcal {O}}(1/f_c)\). Note that this limit is consistent with the initial discrete analysis, since in this case \(1/f_c\) is equal to the grid spacing. This result has been extended to provide robustness to noise [4, 14, 30, 35] and to cope with more general measurement operators [6]. The CS setup is then obtained by randomly sub-sampling the Fourier frequencies. The first work in this compressed sensing direction is by Tang et al [55] where they showed that the recovery guarantees of [15] remain valid with high probability when only a small number of (Fourier) measurements are randomly selected, of the order (up to log factors) of the sparsity of the underlying measure. All these previous theoretical works, however, strongly rely on the translation invariance of the linear operator (Fourier measurements or convolutions) and the underlying domain (either Euclidean space or the periodic torus). Applying directly these results to spatially varying operators (such as, for instance, when imaging with non-stationary point spread functions) generally leads to overly pessimistic minimum separation conditions. The goal of this paper is thus to study the CS problem on arbitrary domains and with arbitrary operators, which necessitates to replace the Euclidean distance by an intrinsic metric induced by the operator \(\varPhi \): the Fisher geodesic distance.

Note that this “Rayleigh criterion” is critical to the performance of (\({\mathcal {P}}_\lambda (y)\)): it is shown in [30] that two spikes of opposite signs cannot be recovered if their separation is smaller than \(1/f_c\). Although it is not the topic of this paper, let us note that lifting the minimum separation condition requires to impose positivity of the spikes [25, 51] and the price to pay is an explosion of the instabilities as spikes cluster together [27]. Another point to note is that existing theoretical results on continuous CS are only valid under a random signs assumption on the amplitudes of the sought-after Dirac masses. This forbids in particular imposing positive signs on the solution. This random sign hypothesis is a well-known assumption in classical discrete compressed sensing [17, 29] but appears somewhat unrealistic. Our analysis does not impose such a random sign constraint, which requires to use different proof technics, in particular extending the so-called golfing scheme method to this continuous setting.

1.2.1 Numerical Solvers and Alternative Approaches

The focus of this paper is on the theoretical analysis of the performance BLASSO method, not on the development and analysis of efficient numerical solvers. Although the BLASSO problem is infinite dimensional, there are efficient numerical solvers that use the fact that the sought-after sparse solution is parameterized by a small number of parameters (positions and amplitudes of the spikes). This open the door to algorithms which do not scale with some grid size and hence can scale beyond 1-D and 2-D problems. Let us mention in particular: (i) discretization on a grid [31, 54], (ii) semi-definite programming (SDP) relaxation using Lasserre hierarchy [15, 26], (iii) Frank–Wolfe and its variants [8, 9, 37], (iv) non-convex particle flows [20].

1.2.2 Other Approaches

The BLASSO is by no means the only method for estimating sparse measures in an off-the-grid setup. One of the first known methods for recovering a sum of Diracs (from Fourier measurements) is Prony’s method [46], which aims to recover the Dirac positions by finding the zeros of some polynomial, whose coefficients are derived from the measurements y. This approach is non-variational and non-convex. Several extensions (with improved robustness to noise) have also been proposed, such as MUSIC and ESPRIT [48, 52]. We refer to [41] for theoretical analysis of these methods. In practice, when the noise is small and the spikes tends to cluster so that the minimum separation distance condition does not hold, these methods often surpasses BLASSO in terms of performance. However, these methods are relevant only for spectral (Fourier) type measurements and the extension to the multivariate setting is non-trivial, see, for instance, [40, 50] for extensions. A rule of thumb is that \(\ell ^1\)-regularization is, however, a good baseline, which benefits from both efficient and stable numerical solvers and an in-depth theoretical analysis which leverages the convexity of the problem.

1.3 Contributions

Our main result is informally stated in Theorem 1 and is stated in full details in Theorem 3. It ensures sharp compressed recovery guarantees for a large class of measurement operators over general domains. A salient feature of this statement is that, contrary to previous works such as [55], it does not require randomness of the signs of the sought after measure. This is achieved by extending the so-called golfing scheme [16, 39] to the infinite-dimensional setting. At the heart of this result is the definition of an intrinsic distance over the parameter domain, the so-called Fisher geodesic distance. It is defined by the metric tensor associated with the covariance kernel of the measurement operator. This definition is crucial both to define the Rayleigh limit of the problem and to quantify the recovery error using the optimal transport distance induced over the space of measures by the Fisher distance.

We now give a more precise exposition of these results. We define the limit covariance kernel as

$$\begin{aligned} K(x,x') {\mathop {=}\limits ^{\text {def.}}}{\mathbb {E}}_\omega \overline{\varphi _\omega (x)}\varphi _\omega (x'), \end{aligned}$$

which measures how much two Diracs at x and \(x'\) interact with each other in the large samples limit as \(m\rightarrow \infty \), and assume that K is real-valued (primary examples include the Gaussian kernel, or the so-called Jackson kernel used in [15]). Define the metric tensor \({{\mathfrak {g}}}_x {\mathop {=}\limits ^{\text {def.}}}\nabla _1\nabla _2 K(x,x) \in {\mathbb {R}}^{d\times d}\), where \(\nabla _i\) indicates the gradient with respect to the ith variable, and assume that for all \(x\in {\mathcal {X}}\), it is a positive definite matrix. Finally, define the associated geodesic distance \({\mathfrak {d}}_{{\mathfrak {g}}}(x,x') = \inf _\gamma \int _0^1 \sqrt{\gamma '(t)^\top {{\mathfrak {g}}}_{\gamma (t)}\gamma '(t)}\mathrm {d}t\), where the infimum is taken over all continuous path \(\gamma :[0,1] \rightarrow {\mathcal {X}}\) such that \(\gamma (0)=x\) and \(\gamma (1)=x'\). (More details about this geodesic distance are given in Sect. 3.1.) Denote by \({\mathcal {B}}_{{\mathfrak {d}}_{{\mathfrak {g}}}}(x;r)\) the ball of radius r centered on x, for the metric \({\mathfrak {d}}_{{\mathfrak {g}}}\). The main result of the paper, here stated in an informal way, is the following.

Theorem 1

(Main result, informal) Let \(R_{\mathcal {X}}{\mathop {=}\limits ^{\text {def.}}}\sup _{x,x'\in {\mathcal {X}}} {\mathfrak {d}}_{{\mathfrak {g}}}(x,x')\). Under some assumptions on the kernel \(K\) (see Assumption 1 in Sect. 4) and features \(\varphi _\omega \) (see Assumption 2 in Sect. 5), there are constants \(r, \varDelta >0\) that only depend on \(K\), and \(C_1, C_2>0\) which depend on K and the regularity of \(\varphi _{\omega _k}\) (up to 2nd order), such that the following holds. Suppose that y is of the form (2) with \(\min _{i\ne j}{\mathfrak {d}}_{{\mathfrak {g}}}(x_i,x_j) \geqslant \varDelta \) and

$$\begin{aligned} m\geqslant C_1 \cdot s \cdot \log (s) \log ((C_2 R_{\mathcal {X}})^d/\rho ). \end{aligned}$$
(3)

Then, with probability \(1-\rho \), when \(\left\| w \right\| \leqslant \delta \) and \(\lambda \sim \frac{\delta }{\sqrt{s}}\), any solution \({\hat{\mu }}\) to (\({\mathcal {P}}_\lambda (y)\)) satisfies

$$\begin{aligned}&{\mathcal {T}}_{{\mathfrak {d}}_{{\mathfrak {g}}}}^2\left( \sum _{j=1}^s {\hat{A}}_j \delta _{x_j}, \; \left|{\hat{\mu }}\right| \right) \nonumber \\&\quad \lesssim \sqrt{s}\delta + \left|{\tilde{\mu }}_0\right|({\mathcal {X}}) \quad \text {and} \quad \max _{j=1}^s\left|{\hat{a}}_j - a_j\right|\lesssim \sqrt{s}\delta + \left|{\tilde{\mu }}_0\right|({\mathcal {X}}), \end{aligned}$$
(4)

where \({\hat{A}}_j {\mathop {=}\limits ^{\text {def.}}}\left|{\hat{\mu }}\right|({\mathcal {B}}_{{\mathfrak {d}}_{{\mathfrak {g}}}}(x_j;r))\), \({\hat{a}}_j {\mathop {=}\limits ^{\text {def.}}}{\hat{\mu }}({\mathcal {B}}_{{\mathfrak {d}}_{{\mathfrak {g}}}}(x_j;r))\), and \({\mathcal {T}}_{{\mathfrak {d}}_{{\mathfrak {g}}}}\) is the partial optimal transport distance associated with \({\mathfrak {d}}_{{\mathfrak {g}}}\) (see Definition ’1).

Let us comment on this result. From an inverse problem perspective (i.e., when no sub-sampling is used, or equivalently when letting \(m \rightarrow +\infty \)), Theorem 1 is already informative, since it defines and proves a Rayleigh limit in term of a new intrinsic distance (the Fisher geodesic distance), which extends results only presented before in the translation invariant case. From a compressed sensing perspective , the most salient feature of Theorem 1 is that, up to log factors, the bound (3) is linear in the sparsity of the underlying measure. This improves over the best known result of Tang et al [55], since we do not require the random signs assumption.

The assumptions on the kernel \(K(x,x')\) mainly state that it must decrease sufficiently when x and \(x'\) are far apart, or, in other words, that the coherence between \(\varPhi \delta _x\) and \(\varPhi \delta _{x'}\) must be low. The main novelty of our approach is that we measure this separation in term of the geodesic metric \({\mathfrak {d}}_{{\mathfrak {g}}}\), which allows to account for non-translation-invariant kernels in an intrinsic and natural manner. The relationship between the decay of the kernel and separation is made explicit in our kernel width definition in Definition 3. As mentioned previously, separation is crucial for stability, and our definition of kernel width can be seen as an extension of the Babel function in compressed sensing [57] to the continuous setting, which links sparsity and separation to well conditioning of the corresponding covariance matrix. We refer to the discussion following Theorem 2 for further details. The assumptions on the features \(\varphi _\omega \), which are more technical in nature, relate to their regularity and the boundedness of their various derivatives.

Concerning the recovery bound (4), the first part states that the measure \({\hat{\mu }}\) concentrates around the true positions of the Diracs, while the second part guarantees that the complex amplitudes of \({\hat{\mu }}\) around the Diracs are close to their true values. The discrepancy in the first part is measured in terms of a partial optimal transport distance associated with \({\mathfrak {d}}_{{\mathfrak {g}}}\) (Def. 1 in Sec. 3). Although our error bound is linear with respect to the noise level \(\delta \), we do not expect the \(\sqrt{s}\) factor be sharp and is rather an artifact of proof techniques. For instance, in [14], where sub-sampling is not considered, one could also obtain bounds \(\sum _{j=1}^s \left|{\hat{a}}_j - a_j\right| \lesssim \delta \). We refer to the discussion after Proposition 1 for further remarks and links to previous works, but simply mention here that the existing proof techniques which lead to sharper bounds cannot be readily extended to the case of randomized measurements.

Finally, the constants \(C_1,C_2\) that appear in (3) can depend (generally polynomially) on the dimension d but not on the sparsity s. As we will see in Sect. 5 and the detailed version of Theorem 1 (Theorem 3), the bound (3) is actually valid when we suppose the features \(\varphi _\omega \) and their derivatives to be uniformly bounded for all x and \(\omega \). When this is not the case, we will be able to relax this assumption, similar to the notion of stochastic incoherence [16] in compressed sensing. As a result, m can actually appear in \(C_1,C_2\), generally in a logarithmic form (see examples in Sect. 2), which only adds logarithmic terms in s and d in the final number of measurements.

1.3.1 Outline of the Paper

The paper is organized as follows. In Sect. 2, we give example applications of Theorem 1, including non-translation-invariant frameworks such as Laplace measurements used in microscopy [28]. In Sect. 3, we introduce our Riemannian geometry framework and prove intermediate recovery results based on the existence of a so-called non-degenerate dual certificate, which is known in the literature to be the key object in the analysis of the BLASSO model. In Sect. 4, we study in more detail the relationship between the minimal separation condition and the covariance kernel. We prove that, under some conditions on \(K\), in the limit \(m\rightarrow \infty \), one can indeed prove the existence of a non-degenerate dual certificate when minimal separation is imposed with respect to \({\mathfrak {d}}_{{\mathfrak {g}}}\). Finally, in Sect. 5, we state our main result with finite number of measurements m (Theorem 3, which is a detailed version of Theorem 1). Section 6 is dedicated to its proof using an infinite-dimensional extension of the celebrated golfing scheme [16], with technical computations in the appendix.

1.3.2 Relationship to Our Previous Work

[44] This article is a substantially extended version of the conference publication [44]. The results of Sect. 4 are in most part already published (under slightly more restrictive assumptions) in this conference paper. The remainder of the paper is, however, entirely novel. We remove the random signs assumption of [44] thanks to a new proof technique with the golfing scheme. Furthermore, the results in [44] are restricted to the small noise setting and focus on exact support stability, while we study here arbitrary noise levels and establish more general stability bounds in terms of optimal transport distances.

1.3.3 Notations

Given \(n\in {\mathbb {N}}\), we denote by \([n]{\mathop {=}\limits ^{\text {def.}}}\{1,2,\ldots , n\}\) the first n integers. We write \(1_n\) to denote the vector of length n whose entries are all 1’s, and \(0_n\) to denote the vector of length n whose entries are all 0’s. Given two matrices A and B, we write \(A\prec B\) to mean that \(B-A\) is positive definite and \(A\preceq B\) to mean that \(B-A\) is positive semi-definite. Given two positive numbers ab, we write \(a\lesssim b\) to mean that there exists some universal constant \(C>0\) so that \(a \leqslant C b\). Given \(({\mathcal {X}}, {\mathfrak {d}})\) a metric space, \(x\in {\mathcal {X}}\) and \(r>0\), we define \({\mathcal {B}}_{\mathfrak {d}}(x;r){\mathop {=}\limits ^{\text {def.}}} \left\{ z\in {\mathcal {X}} \;;\; {\mathfrak {d}}(x,z)<r \right\} \) the ball centered on x of radius r, or just \({\mathcal {B}}_{\left\| \cdot \right\| }(r) = \left\{ z\in {\mathcal {X}} \;;\; \left\| z \right\| <r \right\} \) the ball centered on 0 for a norm \(\left\| \cdot \right\| \).

We write \(\left\| \cdot \right\| _p\) to denote the \(\ell _p\) norm, and \(\left\| \cdot \right\| \) without any subscript denotes the spectral norm for matrices or \(\ell _2\) norm for vectors. For any norm \(\left\| \cdot \right\| _X\) on vectors, the corresponding matrix norm is \(\left\| A \right\| _{X \rightarrow Y} = \sup _{\left\| x \right\| _X = 1} \left\| Ax \right\| _Y\) and \(\left\| A \right\| _X = \left\| A \right\| _{X \rightarrow X}\) for short.

Given a vector \(x \in {\mathbb {C}}^{sd}\) decomposed in blocks \(x = [x_1^\top ,\ldots ,x_s^\top ]^\top \) with \(x_i \in {\mathbb {C}}^d\), where s and d will always be defined without ambiguity, we define the block norm \(\left\| x \right\| _{\mathrm {block}}{\mathop {=}\limits ^{\text {def.}}}\max _{1\leqslant i\leqslant s} \left\| x_i \right\| \). Given a vector \(x\in {\mathbb {C}}^{s(d+1)}\) decomposed as \(x = [x_0^\top , X_1^\top , \ldots , X_s^\top ]^\top \) where \(x_0 \in {\mathbb {C}}^s\) and \(X_j\in {\mathbb {C}}^d\), we define \( \left\| x \right\| _{\mathrm {Block}} {\mathop {=}\limits ^{\text {def.}}}\max \left( \left\| x_0 \right\| _\infty , \max _{j=1}^s\left\| X_j \right\| _2 \right) \).

For a complex number a, its sign is denoted by \({{\,\mathrm{sign}\,}}(a) = \frac{a}{\left|a\right|}\). Given a complex-valued measure \(\mu \in {\mathcal {M}}({\mathcal {X}})\) and complex-valued continuous function \(f \in {\mathscr {C}}^{}({\mathcal {X}})\), we recall that \(\langle f,\,\mu \rangle _{\mathcal {M}} {\mathop {=}\limits ^{\text {def.}}}\int _{\mathcal {X}}\overline{f(x)} \mathrm {d} \mu (x)\). For two complex vectors v and w, \(\langle v,\,w\rangle _2 {\mathop {=}\limits ^{\text {def.}}}v^* w\), where \(v^* = {\overline{v}}^\top \) denotes conjugate transpose.

2 Examples

In this section, we illustrate Theorem 1 for some special cases of practical interest in imaging and machine learning. The following statements are obtained by bounding the constants in Theorem 3 in Sect. 5 (the detailed version of Theorem 1). These computations, which can be somewhat verbose, are delayed to Appendices CD and E.

2.1 Off-the-Grid Compressed Sensing

Off-the-grid Compressed sensing, initially introduced in the special case of 1-D Fourier measurements on \({\mathcal {X}}={\mathbb {T}}={\mathbb {R}}/{\mathbb {Z}}\) by [55], corresponds to Fourier measurements of the form (1). This is a “continuous” analogous of the celebrated compressed sensing line of works [17, 29]. We give a multi-dimensional version below.

Let \(f_c \in {\mathbb {N}}\) with \(f_c\geqslant 128\) (for simplicity) and \({\mathcal {X}}= {\mathbb {T}}^d\) the d-dimensional torus. Let \(\varphi _\omega (x) {\mathop {=}\limits ^{\text {def.}}}e^{\mathrm {i} 2\pi \omega ^\top x}\), \(\varOmega {\mathop {=}\limits ^{\text {def.}}} \left\{ \omega \in {\mathbb {Z}}^d \;;\; \left\| \omega \right\| _\infty \leqslant f_c \right\} \), and \(\varLambda (\omega ) = \prod _{j=1}^d g(\omega _j)\) where

$$\begin{aligned} g(j) = \frac{1}{f_c } \sum _{k=\max (j-f_c ,-f_c )}^{\min (j+f_c ,f_c )}(1-\left|k/f_c \right|)(1-\left|(j-k)/f_c \right|). \end{aligned}$$

The distribution \(\varLambda \) concentrates at lower frequencies, and the corresponding kernel is the Jackson kernel (the Dirichlet kernel raised to the power of 4): \( K(x,x') = \prod _{i=1}^d \kappa (x_i-x_i'), \) where

$$\begin{aligned} \kappa (x){\mathop {=}\limits ^{\text {def.}}}\left( \frac{\sin \left( \left( \tfrac{f_c }{2}+1 \right) \pi x \right) }{\left( \tfrac{f_c }{2}+1 \right) \sin (\pi x)} \right) ^4. \end{aligned}$$

Note that if \(\varLambda \) is chosen to be the uniform distribution, then the corresponding kernel is the Dirichlet kernel. The choice of the \(\varLambda \) here is purely technical: the Jackson kernel leads to easier analysis due to faster decay as \(\left\| x-x' \right\| \) increases and has been considered in many previous works such as [14, 55]. In this case, the Fisher metric is, up to a constant C, the Euclidean metric \({\mathfrak {d}}_{{\mathfrak {g}}}(x,x') = C f_c \left\| x-x' \right\| \). Provided that \(\min _{i\ne j}\left\| x_i-x_j \right\| \gtrsim \frac{d^{\frac{1}{2}} s^{\frac{1}{4}}}{f_c}\), stable recovery is guaranteed with

$$\begin{aligned} m\gtrsim d^2 s \left( \log (s) \log \left( \frac{s}{\rho } \right) + \log \left( \frac{(s f_c d)^d}{\rho } \right) \right) . \end{aligned}$$

The bound on m directly extends the (univariate) main result of [55] to the multivariate setting, whilst removing the unrealistic assumption that the signs of the underlying amplitudes are i.i.d. in the uniform distribution. Note that, compared to the unidimensional case in [55], the minimal separation \(\varDelta \) depends on s in general. However, as we explain in the appendix, when the dimension is such that \(d<4\), this bound can effectively be replaced by one that is exponential in d but does not depend on the sparsity s, and hence yields an extension of the result from [55]. Indeed, during the proof, one must bound a quantity of the form \(\sum _{i=2}^s \left\| x_1-x_i \right\| ^{-4}\), for \(\varDelta \)-separated Diracs. Since in one dimension only 2 Diracs can be situated at distance \(k\varDelta \) from \(x_1\) for each integer \(k>0\), this can be easily bounded by a global bound \(\varDelta ^{-4}\sum _{k=1}^\infty k^{-4}\) that does not depend on s. In the multidimensional case, however, \({\mathcal {O}}(j^{d})\) number of Diracs spaced \(\delta \) apart can be packed into the ball of radius \(j\delta \) around \(x_1\), and this can be handled by the polynomial decay of the kernel \(K(x,x')\) (which decays as \(\left\| x-x' \right\| ^{-4}\) when \(f_c\) is sufficiently large) only when \(d<4\).

2.2 Continuous Sampling Fourier Transform

For most imaging problems, imposing periodic boundary conditions on a square domain is not satisfying. Considering instead Fourier frequencies over the whole space \({\mathbb {R}}^d\) is more natural and can for instance cope with rotation-invariant sampling strategies, such as, for instance, using a Gaussian distribution. Let \({\mathcal {X}}\subset {\mathbb {R}}^d\) be a bounded open subset of \({\mathbb {R}}^d\). The space of frequencies is \(\varOmega = {\mathbb {R}}^d\), \(\varphi _\omega (x) {\mathop {=}\limits ^{\text {def.}}}e^{\mathrm {i} \omega ^\top x}\), and \(\varLambda (\omega ) = {\mathcal {N}}(0,\varSigma ^{-1})\) for some known symmetric positive definite matrix \(\varSigma \). Note that, for simplicity, the frequencies are drawn according to a Gaussian with precision matrix \(\varSigma \) (the inverse of the covariance matrix), such that the kernel \(K\) is the classical Gaussian kernel \(K(x,x') = e^{-\frac{1}{2} \left\| \varSigma ^{-\frac{1}{2}}(x-x') \right\| ^2}\). The Fisher metric is \({\mathfrak {d}}_{{\mathfrak {g}}}(x,x') = \left\| \varSigma ^{-\frac{1}{2}}(x-x') \right\| \). In this case, provided that \(\min _{i\ne j}{\mathfrak {d}}_{{\mathfrak {g}}}(x_i,x_j) \gtrsim \sqrt{\log (s)}\), stable recovery is guaranteed with

$$\begin{aligned} m\gtrsim s \left( L \log (s) \log \left( \frac{s}{\rho } \right) + L^2 \log \left( \frac{(s L {\mathcal {R}}_{\mathcal {X}})^d}{\rho } \right) \right) . \end{aligned}$$

where \(L = d + \log ^2\left( \frac{dm}{\rho } \right) \). Note that \(\log (m)\) appears in L in the r.h.s. of the expression above, which only incurs additional logarithmic terms in the bound on m, as mentioned in the introduction.

2.3 Learning of Gaussian Mixtures with Fixed Covariances

An original framework for continuous sparsity is sketched learning of mixture models [37], and in particular Gaussian mixture models (GMM), for which we can exploit the computations of the previous case of Fourier measurements sampled in accordance to a Gaussian distribution. Assume that we have data samples \(z_1,\ldots ,z_n \in {\mathbb {R}}^d\) drawn i.i.d. from a mixture of Gaussians \(\xi {\mathop {=}\limits ^{\text {def.}}}\sum _{i=1}^s a_i {\mathcal {N}}(x_i, \varSigma )\) with known covariance \(\varSigma \). The means \(x_1,\ldots ,x_s \in {\mathcal {X}}\subset {\mathbb {R}}^d\) and weights \(a_1,\ldots ,a_s>0\) are the objects which we want to estimate. We then sample frequencies \(\omega _1,\ldots ,\omega _m \in {\mathbb {R}}^d\) i.i.d. from a Gaussian \(\varLambda = {\mathcal {N}}(0, \varSigma ^{-1}/d)\) and construct the following linear sketch [22] of the data:

$$\begin{aligned} y = \frac{C}{n} \sum _{i=1}^n (e^{-\mathrm {i} \omega _k^\top z_i})_{k=1}^m \end{aligned}$$
(5)

where the constant \(C = (1+\frac{2}{d})^{\frac{d}{4}} \leqslant e^\frac{1}{2}\) is here for normalization purpose. Linear sketches are mainly used for computational gain: they are easy to compute in a streaming of distributed context and are much smaller to store in memory than the whole database [22, 37]. It is easy to see that the sketch can be reformulated as (1), by writing

$$\begin{aligned} y \approx {\mathbb {E}}_z (Ce^{-\mathrm {i} \omega _k^\top z})_{k=1}^m= \varPhi \mu _0 \end{aligned}$$
(6)

where \(\mu _0 = \sum _i a_i \delta _{x_i}\), and \(\varPhi \) is defined using the feature functions

$$\begin{aligned} \varphi _\omega (x) = {\mathbb {E}}_{z\sim {\mathcal {N}}(x,\varSigma )} C e^{\mathrm {i} \omega ^\top z} = Ce^{\mathrm {i} \omega ^\top x}e^{-\frac{1}{2} \omega ^\top \varSigma \omega }. \end{aligned}$$

The “noise” \(w{\mathop {=}\limits ^{\text {def.}}}y - {\mathbb {E}}_z (Ce^{-\mathrm {i} \omega _k^\top z})_{k=1}^m\) is simply the difference between empirical and true expectations, and using simple concentration inequalities that we skip here for simplicity, it is possible to show that with high probability, \(\left\| w \right\| \leqslant {\mathcal {O}}\left( n^{-\frac{1}{2}} \right) \). Applying the previous computations we obtain the following result: provided that \(\min _{i\ne j} \left\| \varSigma ^{-\frac{1}{2}}(x_i-x_j) \right\| _2 \gtrsim \sqrt{d \log (s)}\), stable recovery of \(\mu _0\) is guaranteed when

$$\begin{aligned} m\gtrsim s \left( d \log (s) \log \left( \frac{s}{\rho } \right) + d^2 \log \left( \frac{(s d {\mathcal {R}}_{\mathcal {X}})^d}{\rho } \right) \right) \end{aligned}$$

and the concentration in the recovery bound (4) is given by \(\delta =\left\| w \right\| ={\mathcal {O}}\left( n^{-\frac{1}{2}} \right) \).

2.4 Gaussian Mixtures with Varying Covariances

The case of simultaneously recovering both the means and covariance matrices is an interesting venue for future research. We simply describe here the associated metric and distance in the univariate case. The geodesic distance between univariate Gaussian distributions is well known [23]: Given \(x=(m,u)\) and \( x'=(n,v)\) with \(m,n\in {\mathbb {R}}\) and \(u,v\in {\mathbb {R}}_+\), let \(\varphi (x) {\mathop {=}\limits ^{\text {def.}}}\frac{1}{\root 4 \of {\pi } \sqrt{u}} e^{-(m-\cdot )/(2u^2)}\), then the covariance kernel is

$$\begin{aligned} K_0(x,x') = \langle \varphi (x),\,\varphi (x')\rangle _{L^2} = \frac{\sqrt{2uv}}{\sqrt{u^2+v^2}} e^{-\frac{(m-n)^2}{2(u^2+v^2)}}. \end{aligned}$$

The associated metric at \(x=(m,u)\) is \({{\mathfrak {g}}}_x=\frac{1}{2u^2} \mathrm {Id}_2\), and the Fisher–Rao distance is the Poincaré half-plane distance

$$\begin{aligned} {\mathfrak {d}}_0(x,x') = 2\mathrm {arsinh}\left( \frac{\left\| x-x' \right\| }{2\sqrt{uv}} \right) , \quad \text {where} \quad \mathrm {arsinh}(x) {\mathop {=}\limits ^{\text {def.}}}\ln \left( x+\sqrt{x^2-1} \right) . \end{aligned}$$
(7)

Consider now the case of Gaussian mixture \(\xi {\mathop {=}\limits ^{\text {def.}}}\sum _{i=1}^s a_i {\mathcal {N}}(x_i, v_i^2)\), where the unknowns are \(a_i>0\), \(x_i\in {\mathbb {R}}\) and \(v_i>0\), and we are given data \( \left\{ z_i \right\} _{i=1}^n\) drawn i.i.d. from \(\xi \) and we construct the linear sketch (5) as before, where \(\omega _k \in {\mathbb {R}}\) are i.i.d. from \({\mathcal {N}}(0,\sigma ^2)\). This corresponds to the normalized random features

$$\begin{aligned} \varphi _\omega (m,u) = \left( 2 u^2 \sigma ^2 +1 \right) ^{\frac{1}{4}} e^{-\mathrm {i}m \omega } e^{-\frac{1}{2} u^2 \omega ^2}, \end{aligned}$$

and

$$\begin{aligned} K((m,u), (n,v)) = \frac{\sqrt{2u_\sigma v_\sigma }}{\sqrt{u_\sigma ^2 + {v_\sigma }^2}} e^{-\frac{(m-n)^2}{2(u_\sigma ^2 + {v_\sigma }^2)}} \end{aligned}$$
(8)

where \(u_\sigma ^2 = \frac{1}{2\sigma ^2}+u^2\) and \(v_\sigma ^2 = \frac{1}{2\sigma ^2}+v^2\). The metric at \(x=(m,u)\) is \({{\mathfrak {g}}}_x=\frac{1}{2u_\sigma ^2} \mathrm {Id}_2\). Note that since (8) also corresponds to the kernel between Gaussian distributions with mean and standard deviation as \(x_\sigma {\mathop {=}\limits ^{\text {def.}}}(m,u_\sigma )\) and \(x_\sigma '{\mathop {=}\limits ^{\text {def.}}}(n,v_\sigma )\), the associated geodesic distance is therefore \({\mathfrak {d}}_0(x_\sigma ,x_\sigma ')\) where \({\mathfrak {d}}_0\) is the Poincaré half-plane distance described in (7). (As mentioned in (16), geodesic distances on random features and parameter space are equivalent.)

2.5 Sampling the Laplace Transform

In some fluorescence microscopy applications (see [28] and the references therein), depth measurements are obtained from the Laplace transform of the signal. Contrary to Fourier measurements, this gives rise to a non-translation-invariant kernel \(K\) and was therefore not covered by existing theory. Using the proposed Riemannian geometry framework, we can cover this setting.

Let \({\mathcal {X}}= (0,1)^d \subset {\mathbb {R}}^d_+\). Let \(\varOmega = {\mathbb {R}}_+^d\). Define for \(x\in {\mathcal {X}}\) and \(\omega \in \varOmega \),

$$\begin{aligned} \varphi _\omega (x) {\mathop {=}\limits ^{\text {def.}}}\exp \left( -x^\top \omega \right) \prod _{i=1}^d \sqrt{\frac{x_i+\alpha _i}{\alpha _i}} \quad \text {and} \quad \varLambda (\omega ) = \exp (-2\alpha ^\top \omega ) \prod _{i=1}^d(2\alpha _i). \end{aligned}$$

where \(\alpha _i \sim d\) are positive and distinct real numbers. The sampling of \(\omega \) here typically corresponds to observations at random discrete time-points.

The Fisher metric is

$$\begin{aligned} {\mathfrak {d}}_{{\mathfrak {g}}}(x,x') = \sqrt{\sum _{i=1}^d \left|\log \left( \frac{x_i+\alpha _i}{x_i'+\alpha _i} \right) \right|^2}, \end{aligned}$$

and provided that \(\min _{i\ne j}{\mathfrak {d}}_{{\mathfrak {g}}}(x_i,x_j) \gtrsim d + \log (d^{3/2}s)\), stable recovery is guaranteed with

$$\begin{aligned} m \gtrsim s \left( C \log (s) \log \left( \frac{s}{\rho } \right) +C^2 \log \left( \frac{C^d}{\rho } \right) \right) \end{aligned}$$

where \(C {\mathop {=}\limits ^{\text {def.}}}d^2 \left( d + \log ^2(m) + \log ^2\left( \frac{d}{\rho } \right) \right) \). Similar to the Gaussian example, \(\log (m)\) appears in C.

3 Stability and the Fisher Information Metric

In this section, we introduce the proposed Riemannian geometry framework and give intermediate recovery guarantees which constitute the first building block of our main result. Namely, we introduce so-called dual certificates, which are known to be key objects in the study of the BLASSO, and show how they lead to sparse recovery guarantees in our Riemannian framework.

3.1 Fisher and Optimal Transport Distances

Let us first introduce the proposed Riemannian geometry framework and define objects related to it.

3.1.1 The Covariance Kernel and the Fubini–Study Metric

A natural property to analyze in our problem is the way two Diracs interact with each other, which is linked to the well-known notion of coherence (or, rather, incoherence) between measurements in compressive sensing [36]. This is done through what we refer to as the covariance kernel \({\hat{K}}:{\mathcal {X}}\times {\mathcal {X}}\rightarrow {\mathbb {C}}\), defined as

$$\begin{aligned} {\hat{K}}(x,x') {\mathop {=}\limits ^{\text {def.}}}\langle \varPhi \delta _x,\,\varPhi \delta _{x'}\rangle _2 = \frac{1}{m}\sum _{j=1}^m \overline{\varphi _{\omega _k}(x)} \varphi _{\omega _k}(x'), \qquad \forall x,x'\in {\mathcal {X}}\, . \end{aligned}$$
(9)

In the limit case \(m \rightarrow \infty \), the law of large number states that \({\hat{K}}\) converges almost surely to the limit covariance kernel:

$$\begin{aligned} K(x,x') {\mathop {=}\limits ^{\text {def.}}}{\mathbb {E}}_\omega \overline{\varphi _{\omega }(x)} \varphi _{\omega }(x') \end{aligned}$$
(10)

where we recall that \(\omega \sim \varLambda \). This object naturally governs the geometry of the space, and we use it to define our Riemannian metric, which as we will see is linked to a notion of Fisher information metric. In the rest of the paper, we assume throughout that \(K\) is real-valued, even though \({\hat{K}}\) may be complex-valued.

Given the normalization \({\mathbb {E}}_\omega \left|\varphi _\omega (x)\right|^2 = 1\) for all \(x\in {\mathcal {X}}\), \(\varphi _\omega (x)\) can be interpreted as a complex-valued probability amplitude with respect to \(\omega \) (parameterized by x), a classical notion in quantum mechanics (see [38]). When x varies, a natural metric between probability amplitudes is the so-called Fubini–Study metric, which is the complex equivalent of the well-known Fisher information metric. Writing \(\varphi _\omega (x) = \sqrt{p(\omega ,x)} e^{\mathrm {i}\alpha (\omega ,x)}\) where \(p(\omega ,x) {\mathop {=}\limits ^{\text {def.}}}\left|\varphi _\omega (x)\right|^2 \) and \(\alpha (\omega ,x) {\mathop {=}\limits ^{\text {def.}}}\mathrm {arg}(\varphi _\omega (x))\), the Fubini–Study metric is defined by the following metric tensor in \({\mathbb {C}}^{d\times d}\) [34]:

$$\begin{aligned} \begin{aligned} {{\mathfrak {g}}}_x&{\mathop {=}\limits ^{\text {def.}}}\frac{1}{4} {\mathbb {E}}_p[\nabla _x \log (p) \nabla _x \log (p)^\top ]+ {\mathbb {E}}_p[\nabla _x\alpha \nabla _x\alpha ^\top ] - {\mathbb {E}}_p[\nabla _x \alpha ] {\mathbb {E}}_p[\nabla _x \alpha ]^\top \\&\quad - \frac{\mathrm {i}}{2} {\mathbb {E}}_p[\nabla _x \log (p) \nabla _x \alpha - \nabla _x \alpha \nabla _x \log (p)^\top ]. \end{aligned} \end{aligned}$$
(11)

where we use the notation \({\mathbb {E}}_p[f] = \int f(\omega ) p(\omega , x) \mathrm {d}\varLambda (\omega )\). If \(\varphi _\omega \) is real-valued, then \(\alpha = 0\) and this is indeed the Fisher metric up to a factor of \(\frac{1}{4}\). The following simple lemma shows the link between this metric and the derivatives of the covariance kernel \(K\).

Lemma 1

For any kernel \(K(x,x') {\mathop {=}\limits ^{\text {def.}}}{\mathbb {E}}_\omega \overline{\varphi _\omega (x)}\varphi _\omega (x')\), the Fubini–Study metric defined in (11) satisfies

$$\begin{aligned} {{\mathfrak {g}}}_x = \nabla _1 \nabla _2 K(x,x) - {\mathbb {E}}_p[\nabla _x \alpha ] {\mathbb {E}}_p[\nabla _x \alpha ]^\top \end{aligned}$$
(12)

If furthermore \(K(x,x')\) is assumed real-valued, then \({\mathbb {E}}_p[\nabla _x \alpha ] = 0\), and \({{\mathfrak {g}}}_x = \nabla _1 \nabla _2 K(x,x)\).

Proof

Using \(p = \left|\varphi _\omega \right|^2\) and \(\nabla \varphi _\omega = \left( \frac{\nabla p}{2p} + i \nabla \alpha \right) \varphi _\omega \), a direct computation shows that

$$\begin{aligned} \nabla _x \log (p) = \frac{2}{p}\mathrm {Re}\left( \overline{\varphi _\omega }\nabla \varphi _\omega \right) \quad \text {and} \quad \nabla _x \alpha = \frac{1}{p} \mathrm {Im}\left( \overline{\varphi _\omega }\nabla \varphi _\omega \right) \end{aligned}$$
(13)

Therefore,

$$\begin{aligned}&\frac{1}{4}{\mathbb {E}}_p[\nabla _x \log (p) \nabla _x \log (p)^\top ]+ {\mathbb {E}}_p[\nabla _x\alpha \nabla _x\alpha ^\top ] \\&\quad = \int \frac{1}{p^2} \left( \mathrm {Re}\left( \overline{\varphi _\omega }\nabla \varphi _\omega \right) \mathrm {Re}\left( \overline{\varphi _\omega }\nabla \varphi _\omega \right) ^\top \right. \\&\left. \qquad + \mathrm {Im}\left( \overline{\varphi _\omega }\nabla \varphi _\omega \right) \mathrm {Im}\left( \overline{\varphi _\omega }\nabla \varphi _\omega \right) ^\top \right) p \mathrm {d}\varLambda \\&\quad =\int \frac{1}{p} \mathrm {Re}\left( \left|\varphi _\omega \right|^2 \overline{\nabla \varphi _\omega } \nabla \varphi _\omega ^\top \right) \mathrm {d}\varLambda \\&\quad =\int \mathrm {Re}\left( \overline{\nabla \varphi _\omega } \nabla \varphi _\omega ^\top \right) \mathrm {d}\varLambda = \mathrm {Re}\left( \nabla _1\nabla _2K(x,x)\right) \end{aligned}$$

Similarly,

$$\begin{aligned}&- \frac{\mathrm {i}}{2} {\mathbb {E}}_p[\nabla _x \log (p) \nabla _x \alpha - \nabla _x \alpha \nabla _x \log (p)^\top ] \\&\quad = -\mathrm {i}\int \frac{1}{p^2} \left( \mathrm {Re}\left( \overline{\varphi _\omega }\nabla \varphi _\omega \right) \mathrm {Im}\left( \overline{\varphi _\omega }\nabla \varphi _\omega \right) ^\top \right. \\&\left. \qquad + \mathrm {Im}\left( \overline{\varphi _\omega }\nabla \varphi _\omega \right) \mathrm {Re}\left( \overline{\varphi _\omega }\nabla \varphi _\omega \right) ^\top \right) p \mathrm {d}\varLambda \\&\quad =-\mathrm {i}\int \frac{1}{p} \mathrm {Im}\left( \left|\varphi _\omega \right|^2 \overline{\nabla \varphi _\omega } \nabla \varphi _\omega ^\top \right) \mathrm {d}\varLambda \\&\quad =\mathrm {i}\int \mathrm {Im}\left( \overline{\nabla \varphi _\omega } \nabla \varphi _\omega ^\top \right) \mathrm {d}\varLambda = \mathrm {i}\cdot \mathrm {Im}\left( \nabla _1\nabla _2K(x,x)\right) \end{aligned}$$

which proves the first claim. The second claim is immediate by noticing from (13) that \(\nabla _p \alpha = {{\,\mathrm{Im}\,}}\left( \nabla _2 K(x,x) \right) \), which cancels when \(K(x,x')\) is real (in particular in a neighborhood around \(x=x'\)). \(\square \)

Since in this paper the limit covariance kernel (10) is assumed real-valued, the previous lemma justifies the definition \({{\mathfrak {g}}}_x = \nabla _1\nabla _2 K(x,x)\) that we adopt in the rest of the paper. For two vectors \(u,v \in {\mathbb {C}}^d\), we define the corresponding inner product

$$\begin{aligned}&\langle u,\,v\rangle _x {\mathop {=}\limits ^{\text {def.}}}{u}^* {{\mathfrak {g}}}_x v \quad \text {and} \quad \left\| u \right\| _x {\mathop {=}\limits ^{\text {def.}}}\sqrt{\langle u,\,u\rangle _x} \end{aligned}$$
(14)

As described in the introduction, this induces a geodesic distance on \({\mathcal {X}}\):

$$\begin{aligned}&{\mathfrak {d}}_{{\mathfrak {g}}}(x,x') \nonumber \\&\quad {\mathop {=}\limits ^{\text {def.}}}\inf \left\{ \int _0^1 \left\| \gamma '(t) \right\| _{\gamma (t)} \mathrm {d} t \;;\; \gamma : [0,1] \rightarrow {\mathcal {X}}\text { smooth},~\gamma (0)=x,~\gamma (1)=x' \right\} \nonumber \\ \end{aligned}$$
(15)

and in the case where \(\varphi _\omega (x)\) is real-valued, this coincides with the “Fisher-Rao” geodesic distance [47] which is used extensively in information geometry for estimation and learning problems on parametric families of distributions [3].

Remark 1

(As a distance on the feature space) The geodesic distance induced by \({{\mathfrak {g}}}\) is the natural distance between the random features \(\varphi _{\cdot }(x)\). Indeed, as discussed in [11], the manifold \(({\mathcal {X}}, {{\mathfrak {g}}})\) as an embedded sub-manifold of the sphere in Hilbert space \(L_2(\mathrm {d}\varLambda )\) with embedding \(x\mapsto \varphi _{\cdot }(x)\), and given any \(x,x'\in {\mathcal {X}}\), we have

$$\begin{aligned} \inf _{\gamma \in \varGamma _{x,x'}} \int _0^1 \left\| \gamma '(t) \right\| _{L_2(\mathrm {d}\varLambda )} \mathrm {d}t = {\mathfrak {d}}_{{\mathfrak {g}}}(x,x'), \end{aligned}$$
(16)

where \(\varGamma _{x,x'}\) consists of all piecewise smooth paths \(\gamma : [0,1]\rightarrow \left\{ \varphi (x) \;;\; x\in {\mathcal {X}} \right\} \) with \(\gamma (0) = \varphi (x)\) and \(\gamma (1) = \varphi (x')\).

Remark 2

(Fisher metric and invariances) The Fisher–Rao metric \({\mathfrak {d}}_{{\mathfrak {g}}}\) is “canonical” in the sense that it is the only (up to scalar multiples) geodesic distance which satisfies the natural invariances of the BLASSO problem. Indeed, the solutions to (\({\mathcal {P}}_\lambda (y)\)), in the large sample limit \(m \rightarrow +\infty \), are (i) invariant by the multiplication of \(\varphi (x) {\mathop {=}\limits ^{\text {def.}}}(\varphi _\omega (x))_{\omega \in \varOmega }\) by an arbitrary orthogonal transform U (orthogonality on \(L_2(\mathrm {d}\varLambda )\)), i.e., invariance to \(\varphi (x) \mapsto U \varphi (x)\), (ii) covariance under any change of variable \(\varphi \mapsto \varphi \circ h\) where h is a diffeomorphism between two d-dimensional parameter spaces. The covariance (ii) means that if \(\mu = \sum _i a_i \delta _{x_i}\) is a solution associated with \(\varphi \), then the push-forward measure \((h^{-1})_\sharp \mu {\mathop {=}\limits ^{\text {def.}}}\sum _i a_i \delta _{h^{-1}(x_i)}\) is a solution associated with \(\varphi \circ h\). Note that the invariance (i) is different from the usual invariance under “Markov morphisms” considered in information theory [13, 18]. When considering \({\mathfrak {d}}_{{{\mathfrak {g}}}} = {\mathfrak {d}}_{{{\mathfrak {g}}}_\varphi }\) as a Riemannian distance depending solely on \(\varphi \), the invariance under any diffeomorphism h reads

$$\begin{aligned} {\mathfrak {d}}_{{{\mathfrak {g}}}_\varphi }(x,x') = {\mathfrak {d}}_{{{\mathfrak {g}}}_{\varphi \circ h}}(h^{-1}(x),h^{-1}(x')). \end{aligned}$$
(17)

Assuming for simplicity that \(\varphi \) is injective, this invariance (17) is equivalent to the fact that the formula

$$\begin{aligned} \forall \,(q,q') \in {\mathcal {M}}^2, \quad d_{\mathcal {M}}(q,q') {\mathop {=}\limits ^{\text {def.}}}{\mathfrak {d}}_{{{\mathfrak {g}}}_\varphi }(\varphi ^{-1}(q),\varphi ^{-1}(q')) \end{aligned}$$

defines a proper (i.e., parameterization-independent) Riemannian distance \(d_{\mathcal {M}}\) on the embedded manifold \({\mathcal {M}}{\mathop {=}\limits ^{\text {def.}}}( \varphi (x) )_x \subset L_2(\mathrm {d}\varLambda )\). Among all possible such Riemannian metrics on \({\mathcal {M}}\), the only ones being invariant by orthogonal transforms \(\varphi \mapsto U \varphi \) are scalar multiples of the Hermitian positive tensor \(\partial \varphi (x)^* \partial \varphi (x) \in {\mathbb {C}}^{d \times d}\), which is equal to \({{\mathfrak {g}}}_\varphi \) (here \(\partial \varphi (x)^*\) refers to the adjoint in \(L_2(\mathrm {d}\varLambda )\) for the inner product defined by the measure \(\varLambda (\omega )\)).

Remark 3

(Tangent spaces) Formally, in Riemannian geometry, one would use the notion of tangent space \({\mathcal {T}}_x\), and, for instance, the inner product \(\langle \cdot ,\,\cdot \rangle _x\) would only be defined between vectors belonging to \({\mathcal {T}}_x\). However, in our case, since the considered ambient “manifold” is just \({\mathbb {R}}^d\), in the sense that \({\mathcal {X}}\) is not a low-dimensional sub-manifold of \({\mathbb {R}}^d\) but an open set of \({\mathbb {R}}^d\), each tangent space can be identified with \({\mathbb {R}}^d\), and we extend the definitions to complex vectors for our needs.

3.1.2 Optimal Transport Metric

In order to state quantitative performance bounds, one needs to consider a geometric distance between measures. The canonical way to “lift” a ground distance \({\mathfrak {d}}_{{\mathfrak {g}}}(x,x')\) between parameter to a distance between measure is to use optimal transport distances [49].

Definition 1

(Wasserstein distance) Given \(\mu ,\nu \in {\mathcal {M}}_+({\mathcal {X}})\) with \(\left|\mu \right|({\mathcal {X}}) = \left|\nu \right|({\mathcal {X}})\), the Wasserstein distance between \(\mu \) and \(\nu \) relative to the metric \({\mathfrak {d}}\) on \({\mathcal {X}}\) is defined by

$$\begin{aligned} W_{\mathfrak {d}}^2(\mu ,\nu ) {\mathop {=}\limits ^{\text {def.}}}\inf _{\gamma \in \varPi (\mu ,\nu )} \int _{{\mathcal {X}}^2} {\mathfrak {d}}(x,x')\mathrm {d}\gamma (x,x'), \end{aligned}$$

where \(\varGamma (\mu ,\nu ) \subset {\mathcal {M}}_+({\mathcal {X}}^2)\) is the set of all transport plans with marginals \(\mu \) and \(\nu \). Given \(\mu ,\nu \in {\mathcal {M}}_+({\mathcal {X}})\) (not necessarily of equal total mass), the optimal partial distance between \(\mu \) and \(\nu \) is defined as

$$\begin{aligned} {\mathcal {T}}_{\mathfrak {d}}^2(\mu ,\nu ) {\mathop {=}\limits ^{\text {def.}}}\inf _{{\tilde{\mu }},{\tilde{\nu }}} \left\{ W_{\mathfrak {d}}^2({\tilde{\mu }},{\tilde{\nu }}) + \left|\mu - {\tilde{\mu }}\right|({\mathcal {X}}) + \left|{\tilde{\nu }} -\nu \right|({\mathcal {X}}) \right\} . \end{aligned}$$

Note that the distance \(W_{\mathfrak {d}}(\mu ,\nu )\) is actually an hybridation (an inf-convolution) between the classical Wasserstein distance between probability distributions and the total variation norm. It is often called “partial optimal transport” in the literature (see, for instance, [12]) and belongs to the larger class of unbalanced optimal transport distances [21, 42].

3.2 Non-degenerate Certificates, Uniqueness and Stability for Sparse Measures

We now introduce the notion of a dual certificate and prove recovery guarantees under certain non-degeneracy conditions, which is the first step toward our main result.

3.2.1 Dual Certificates

The minimization problem (\({\mathcal {P}}_\lambda (y)\)) is a convex optimisation problem and a natural way of studying their solutions are via their corresponding Fenchel-dual problems. It is well known that, in the limit as \(\lambda \rightarrow 0\), its solutions cluster in a weak-* sense around minimizers of

figure c

and that properties of the dual solutions to (\({\mathcal {P}}_0(y)\)) with \(y = \varPhi \mu _0\) can be used to derive stability estimates for (\({\mathcal {P}}_\lambda (y)\)) under noisy measurements. In this section, we recall some of these results (see [9, 30] for further details). The (pre)dual of (\({\mathcal {P}}_\lambda (y)\)) is

figure d

where we remark that the adjoint operator \(\varPhi ^*:{\mathbb {C}}^m \rightarrow {\mathscr {C}}^{}({\mathcal {X}})\) is defined by \((\varPhi ^* p)(x) = \frac{1}{\sqrt{m}}\sum _{i=1}^m p_i \varphi _{\omega _i}(x)\). Note that for \(\lambda >0\), this is the projection of \(y/\lambda \) onto the closed convex set \( \left\{ p \;;\; \left\| \varPhi ^* p \right\| _\infty \leqslant 1 \right\} \) and the solution \(p_\lambda \) is hence unique. The dual solution \(p_\lambda \) is related to any primal solution \(\mu _\lambda \) of (\({\mathcal {P}}_\lambda (y)\)) by the condition

$$\begin{aligned} \varPhi ^* p_\lambda \in \partial \left|\mu _\lambda \right|({\mathcal {X}}) \quad \text {and} \quad p_\lambda =\frac{1}{\lambda }\left( y - \varPhi \mu _\lambda \right) . \end{aligned}$$
(18)

Conversely, any pair \(p_\lambda \) and \(\mu _\lambda \) which satisfy this equation (18) are necessarily dual and primal solutions of (\({\mathcal {D}}_\lambda (y)\)) and (\({\mathcal {P}}_\lambda (y)\)), respectively. In the case where \(\lambda = 0\), a dual solution need not be unique, although existence is guaranteed (since in our setting, the dual variable belongs to a finite dimensional space). In this case, \(p_0\) and \(\mu _0\) solve (\({\mathcal {D}}_\lambda (y)\)) with \(\lambda =0\) and (\({\mathcal {P}}_0(y)\)), respectively, if and only if

$$\begin{aligned} \varPhi \mu _0 = y \quad \text {and} \quad \varPhi ^* p_0 \in \partial \left|\mu _0\right|({\mathcal {X}}). \end{aligned}$$
(19)

Following the literature, we call any element \(\eta \in {{\,\mathrm{Im}\,}}(\varPhi ^*) \cap \partial \left|\mu _0\right|({\mathcal {X}})\) a dual certificate for \(\mu _0\). For \(\mu _0 = \sum _{j=1}^s a_j \delta _{x_j}\), the condition \(\eta \in \partial \left|\mu _0\right|({\mathcal {X}})\) imposes that \(\eta (x_j) = {{\,\mathrm{sign}\,}}(a_j)\) and \(\left\| \eta \right\| _\infty \leqslant 1\). Furthermore, it is known that in the noiseless case, \(\mu _0\) is the unique solution to (\({\mathcal {P}}_0(y)\)) if: the operator \(\varPhi _x: {\mathbb {C}}^s \rightarrow {\mathbb {C}}^m\) defined by \(\varPhi _x b = \sum _{j=1}^s b_j \varPhi \delta _{x_j}\) is injective, and there exists \(\eta \in {{\,\mathrm{Im}\,}}(\varPhi ^*) \cap \partial \left|\mu _0\right|({\mathcal {X}})\) such that \(\left|\eta (x)\right|<1\) for all \(x\not \in \{x_j\}\). In order to quantify the latter constraint and provide quantitative stability bounds, we impose even stronger conditions on \(\eta \) and make the following definition.

Definition 2

(Non-degenerate dual certificate) Given \((a_i,x_i)_{i=1}^s\), we say that \(\eta \in {{\,\mathrm{Im}\,}}(\varPhi ^*)\) is an \((\varepsilon _0,\varepsilon _2,r)\)-non-degenerate dual certificate if:

  1. (i)

    \(\eta (x_i) = {{\,\mathrm{sign}\,}}(a_i)\) for all \(i=1,\ldots ,s\),

  2. (ii)

    \(\left|\eta (x)\right| \leqslant 1- \varepsilon _0\) for all \(x\in {\mathcal {X}}^{\mathrm {far}}\),

  3. (iii)

    \(\left|\eta (x)\right| \leqslant 1- \varepsilon _2 {\mathfrak {d}}_{{\mathfrak {g}}}(x,x_i)^2\) for all \(x\in {\mathcal {X}}^{\mathrm {near}}_i\),

where \({\mathcal {X}}^{\mathrm {near}}_i {\mathop {=}\limits ^{\text {def.}}}{\mathcal {B}}_{{\mathfrak {d}}_{{\mathfrak {g}}}}(x_i;r)\) and \({\mathcal {X}}^{\mathrm {far}}{\mathop {=}\limits ^{\text {def.}}}{\mathcal {X}}\setminus \bigcup _i {\mathcal {X}}^{\mathrm {near}}_i\).

In other words, there are neighborhoods of the \(x_j\) such that, outside of these neighborhoods, \(\eta \) is strictly bounded away from 1, and inside, \(\left|\eta \right|\) has quadratic decay. In the next section, we prove stable recovery results from the existence of non-degenerate dual certificates.

3.2.2 Stable Recovery Bounds

The following two propositions describe stability guarantees under the non-degeneracy condition. Proposition 1 quantifies how the recovered measure is approximated by a sparse measure supported on \(\{x_j\}_j\), and Proposition 2 describes the error in measure around small neighborhoods of the points \(\{x_j\}_j\).

Proposition 1

(Stability away from the sparse support) Suppose that there exists \(\varepsilon _0,\varepsilon _2>0\), \(\eta {\mathop {=}\limits ^{\text {def.}}}\varPhi ^* p\) for some \(p\in {\mathbb {C}}^m\) such that \(\eta \) is \((\varepsilon _0,\varepsilon _2,r)\)-non-degenerate.

Assuming the measurement model (1), any minimizer \({\hat{\mu }}\) of (\({\mathcal {P}}_\lambda (y)\)), with \(\left\| w \right\| \leqslant \delta \) and \(\lambda \sim \delta /\left\| p \right\| \) is approximately sparse: by defining \({\hat{A}}_j = \left|{\hat{\mu }}\right|({\mathcal {X}}^{\mathrm {near}}_j)\), we have

$$\begin{aligned} {\mathcal {T}}_{{\mathfrak {d}}_{{\mathfrak {g}}}}^2\left( \left|{\hat{\mu }}\right|, \sum _{j=1}^s {\hat{A}}_j \delta _{x_j} \right) \lesssim \frac{1}{\min \left( \varepsilon _0,\varepsilon _2 \right) } \left( \left|{\tilde{\mu }}_0\right|({\mathcal {X}})+ \delta \left\| p \right\| \right) . \end{aligned}$$
(20)

Proof

To prove this proposition, we first establish the following bound

$$\begin{aligned} \varepsilon _0 \left|{\hat{\mu }}\right|({\mathcal {X}}^{\mathrm {far}}) + \varepsilon _2 \sum _{i=1}^s \int _{{\mathcal {X}}^{\mathrm {near}}_i} {{\mathfrak {d}}_{{\mathfrak {g}}}}(x,x_i)^2 \mathrm {d}\left|{\hat{\mu }}\right|(x) \lesssim \delta \left\| p \right\| + \left|{\tilde{\mu }}_0\right|({\mathcal {X}}). \end{aligned}$$
(21)

As we will see, the optimal partial transport bound above is then a consequence of this bound.

For \(i=1,\ldots ,s\), let \({\mathcal {X}}^{\mathrm {near}}_i\subset {\mathcal {X}}\) and \({\mathcal {X}}^{\mathrm {far}}= {\mathcal {X}}\setminus \bigcup _{j=1}^s {\mathcal {X}}^{\mathrm {near}}_j\) be as in Definition 2. Recall the measurement model \(y = \varPhi (\mu _0 + {\tilde{\mu }}_0) +w\), and define \({\bar{\mu }}_0 = \mu _0 + {\tilde{\mu }}_0\) for simplicity. We first adapt the proof of [10, Thm. 2] to derive an upper bound for \(\left|{\hat{\mu }}\right| - \left|{\bar{\mu }}_0\right| - \mathrm {Re}\left( \langle \eta ,\,{\hat{\mu }}-{\bar{\mu }}_0\rangle _{\mathcal {M}}\right) \). By minimality of \({\hat{\mu }}\) and since \(\left\| w \right\| \leqslant \delta \),

$$\begin{aligned} \lambda \left|{\hat{\mu }}\right|({\mathcal {X}}) + \frac{1}{2}\left\| \varPhi {\hat{\mu }} - y \right\| ^2 \leqslant \lambda \left|{\bar{\mu }}_0\right|({\mathcal {X}}) + \frac{1}{2}\left\| \varPhi {\bar{\mu }}_0 - y \right\| ^2 \leqslant \lambda \left|{\bar{\mu }}_0\right|({\mathcal {X}}) + \frac{\delta ^2}{2} \end{aligned}$$

Using \(\eta = \varPhi ^* p\), and by adding and subtracting \(\mathrm {Re}\left( \langle \eta ,\,{\hat{\mu }} - {\bar{\mu }}_0\rangle _{\mathcal {M}}\right) = \mathrm {Re}\left( \langle p,\,\varPhi {\hat{\mu }} - y\rangle _2\right) + \mathrm {Re}\left( \langle p,\,w\rangle _2\right) \), we obtain

$$\begin{aligned} \begin{aligned}&\lambda \left( \left|{\hat{\mu }}\right|({\mathcal {X}}) - \left|{\bar{\mu }}_0\right|({\mathcal {X}}) - \mathrm {Re}\left( \langle \eta ,\,{\hat{\mu }} - {\bar{\mu }}_0\rangle _{\mathcal {M}}\right) \right) + \mathrm {Re}\left( \langle \lambda p,\,\varPhi ({\hat{\mu }} - {\bar{\mu }}_0)\rangle _2\right) \\&\qquad + \frac{1}{2}\left\| \varPhi {\hat{\mu }} - y \right\| ^2 \leqslant \frac{\delta ^2}{2} \\&\qquad \implies \lambda \left( \left|{\hat{\mu }}\right|({\mathcal {X}}) - \left|{\bar{\mu }}_0\right|({\mathcal {X}}) - \mathrm {Re}\left( \langle \eta ,\,{\hat{\mu }} - {\bar{\mu }}_0\rangle _{\mathcal {M}}\right) \right) + \frac{1}{2}\left\| \varPhi {\hat{\mu }} - y + \lambda p \right\| ^2 \\&\quad \leqslant \frac{\delta ^2}{2}+ \frac{\lambda ^2 \left\| p \right\| ^2}{2} - \mathrm {Re}\left( \langle \lambda p,\,w\rangle _2\right) \\&\qquad \implies \left|{\hat{\mu }}\right|({\mathcal {X}}) - \left| {\bar{\mu }}_0\right|({\mathcal {X}}) - \mathrm {Re}\left( \langle \eta ,\,{\hat{\mu }} - {\bar{\mu }}_0\rangle _{\mathcal {M}}\right) \\&\quad \leqslant \frac{1}{2\lambda }\left( \delta + \lambda \left\| p \right\| \right) ^2 \lesssim \delta \left\| p \right\| \end{aligned} \end{aligned}$$
(22)

using \(\lambda \sim \delta / \left\| p \right\| \). We now derive a lower bound for \(\left|{\hat{\mu }}\right| - \left|\mu _0\right| - \mathrm {Re}\left( \langle \eta ,\,{\hat{\mu }}-{\bar{\mu }}_0\rangle _{\mathcal {M}}\right) \). Since \(\eta \) is a dual certificate, we have \(\langle \eta ,\,{\bar{\mu }}_0\rangle _{\mathcal {M}} = \left|\mu _0\right|({\mathcal {X}})\) and \(\left|\langle \eta ,\,\mu \rangle _{\mathcal {M}}\right| \leqslant \left|\mu \right|({\mathcal {X}})\). By further exploiting the non-degeneracy assumptions (ii) and (iii) on \(\eta \), we have

$$\begin{aligned}&\left|{\hat{\mu }}\right|({\mathcal {X}}) - \left|{\bar{\mu }}_0\right|({\mathcal {X}}) - \mathrm {Re}\left( \langle \eta ,\,{\hat{\mu }}-{\bar{\mu }}_0\rangle \right) \geqslant \left|{\hat{\mu }}\right|({\mathcal {X}}) - \mathrm {Re}\left( \langle \eta ,\,{\hat{\mu }}\rangle \right) - 2\left|{\tilde{\mu }}_0\right|({\mathcal {X}})\\&\quad \geqslant \left|{\hat{\mu }}\right|({\mathcal {X}}) - \sum _{i} \int _{{\mathcal {X}}^{\mathrm {near}}_i} \left|\eta \right|\mathrm {d}\left|{\hat{\mu }}\right| - \int _{{\mathcal {X}}^{\mathrm {far}}} \left|\eta \right|\mathrm {d}\left|{\hat{\mu }}\right| - 2\left|{\tilde{\mu }}_0\right|({\mathcal {X}})\\&\quad \geqslant \left|{\hat{\mu }}\right|({\mathcal {X}}) - \sum _i \int _{{\mathcal {X}}^{\mathrm {near}}_i} \left( 1-\varepsilon _2 {{\mathfrak {d}}_{{\mathfrak {g}}}}(x,x_i)^2 \right) \mathrm {d}\left|{\hat{\mu }}\right|(x) - (1-\varepsilon _0) \left|{\hat{\mu }}\right|\left( {\mathcal {X}}^{\mathrm {far}} \right) \\&- 2\left|{\tilde{\mu }}_0\right|({\mathcal {X}})\\&\quad = \varepsilon _0 \left|{\hat{\mu }}\right|\left( {\mathcal {X}}^{\mathrm {far}} \right) + \varepsilon _2 \sum _i \int _{{\mathcal {X}}^{\mathrm {near}}_i} {{\mathfrak {d}}_{{\mathfrak {g}}}}(x,x_i)^2\mathrm {d}\left|{\hat{\mu }}\right|(x) - 2\left|{\tilde{\mu }}_0\right|({\mathcal {X}}) \end{aligned}$$

which proves (21). Note also that by combining this with (22), we obtain the following bound that we will use later:

$$\begin{aligned} \left\| \varPhi {\hat{\mu }} - y + \lambda p \right\| ^2&\leqslant (\delta + \lambda \left\| p \right\| )^2 + 4 \lambda \left|{\tilde{\mu }}_0\right|({\mathcal {X}}) \implies \left\| \varPhi {\hat{\mu }} - y \right\| \nonumber \\&\leqslant \delta + 2\lambda \left\| p \right\| + 2\sqrt{\lambda \left|{\tilde{\mu }}_0\right|({\mathcal {X}})} \end{aligned}$$
(23)

It remains to show that the bound (21) yields an upper bound on the partial optimal transport distance between the recovered measure \(\left|{\hat{\mu }}\right|\) and \(\rho {\mathop {=}\limits ^{\text {def.}}}\sum _i \left|{\hat{\mu }}\right|({\mathcal {X}}^{\mathrm {near}}_i) \delta _{x_i}\), its “projection" onto the positions \(\{x_j\}_j\). To see this, first note that the Kantorovich dual formulation [49] of the Wasserstein distance in Def. 1 is

$$\begin{aligned} \sup \left\{ \int _{\mathcal {X}}\varphi \mathrm {d}\mu + \int _{\mathcal {X}}\psi \mathrm {d}\nu \;;\; \varphi , \psi \in C_b({\mathcal {X}}), \;\forall x,y\in {\mathcal {X}}, \; \varphi (x)+\psi (y) \leqslant {{\mathfrak {d}}_{{\mathfrak {g}}}}(x,y)^2 \right\} \end{aligned}$$

Given any \(\varphi , \psi \in C_b({\mathcal {X}})\) satisfying \(\varphi (x)+\psi (y) \leqslant {{\mathfrak {d}}_{{\mathfrak {g}}}}(x,y)^2\) for all \(x,y\in {\mathcal {X}}\), we have

$$\begin{aligned}&W^2_{{\mathfrak {g}}}(\rho , \left|{\hat{\mu }}\right|_{{\mathcal {X}}^{\mathrm {near}}}) \leqslant \int \varphi \mathrm {d}\left|{\hat{\mu }}\right|_{{\mathcal {X}}^{\mathrm {near}}} + \int \psi \mathrm {d}\rho \\&\quad = \sum _j \left( \int _{{\mathcal {X}}^{\mathrm {near}}_j} ( \varphi (x) + \psi (x_j)) \mathrm {d}\left|{\hat{\mu }}\right|(x) - \psi (x_j) \right. \\&\quad \left. \quad \int _{{\mathcal {X}}^{\mathrm {near}}_j} \mathrm {d}\left|{\hat{\mu }}\right|(x) + \psi (x_j) \left|{\hat{\mu }}\right|({\mathcal {X}}^{\mathrm {near}}_j) \right) \\&\quad = \sum _j \int _{{\mathcal {X}}^{\mathrm {near}}_j} ( \varphi (x) + \psi (x_j)) \mathrm {d}\left|{\hat{\mu }}\right|(x) \leqslant \sum _j \int _{{\mathcal {X}}^{\mathrm {near}}_j} {{\mathfrak {d}}_{{\mathfrak {g}}}}(x,x_j)^2 \mathrm {d}\left|{\hat{\mu }}\right|(x) \end{aligned}$$

So,

$$\begin{aligned} \varepsilon _2W^2_{{\mathfrak {g}}}(\rho , \left|{\hat{\mu }}\right|_{{\mathcal {X}}^{\mathrm {near}}}) \lesssim \left|{\tilde{\mu }}_0\right|({\mathcal {X}})+ \delta \left\| p \right\| \end{aligned}$$

So, since \(\varepsilon _0\left|{\hat{\mu }}\right|_{{\mathcal {X}}^{\mathrm {far}}}({\mathcal {X}}) \lesssim \left|{\tilde{\mu }}_0\right|({\mathcal {X}})+ \delta \left\| p \right\| \), we have

$$\begin{aligned} {\mathcal {T}}^2_{{{\mathfrak {g}}}}(\left|{\hat{\mu }}\right|, \rho ) \lesssim \frac{1}{\min \left( \varepsilon _0,\varepsilon _2 \right) } \left( \left|{\tilde{\mu }}_0\right|({\mathcal {X}})+ \delta \left\| p \right\| \right) . \end{aligned}$$

\(\square \)

We now give stability bounds around the sparse support, under some additional assumptions.

Proposition 2

(Stability around the sparse support) Under the assumptions of Proposition 1, let \({\hat{\mu }}\) be a solution of (\({\mathcal {P}}_\lambda (y)\)), and let \({\hat{a}} = ({\hat{\mu }}({\mathcal {X}}^{\mathrm {near}}_j))_{j=1}^s\). Suppose in addition that for \(j=1,\ldots , s\), there exists \(\eta _j = \varPhi ^* p_j\) which satisfies

  1. (i)

    \(\eta _j(x_j) = 1\) and \(\eta _j(x_\ell ) = 0\) for all \(\ell \ne j\)

  2. (ii)

    \(\left|1-\eta _j(x)\right| \leqslant \varepsilon _2 {{\mathfrak {d}}_{{\mathfrak {g}}}}(x,x_j)^2\) for all \(x\in {\mathcal {X}}^{\mathrm {near}}_j\),

  3. (iii)

    \(\left|\eta _j(x)\right| \leqslant \varepsilon _2 {{\mathfrak {d}}_{{\mathfrak {g}}}}(x,x_\ell )^2\) for all \(x\in {\mathcal {X}}^{\mathrm {near}}_\ell \) and \(\ell \ne j\),

  4. (iv)

    \(\left|\eta _j(x)\right| \leqslant 1-\varepsilon _0\) for all \(x\in {\mathcal {X}}^{\mathrm {far}}\).

Then,

$$\begin{aligned} \forall j=1,\ldots ,s,\quad \left|{\hat{a}}_j - a_j\right| \lesssim \left\| p_j \right\| \left( \delta + \lambda \left\| p_j \right\| \right) + \varepsilon _0^{-1}\left( \delta \left\| p \right\| + \left|{\tilde{\mu }}_0\right|({\mathcal {X}}) \right) \nonumber \\ \end{aligned}$$
(24)

where p is as in Proposition 1.

Proof

First observe that writing \(\nu = {\hat{\mu }}-\mu _0\), we have

Using (21), we have \( \left|\nu \right|({\mathcal {X}}^{\mathrm {far}}) = \left|{\hat{\mu }}\right|({\mathcal {X}}^{\mathrm {far}}) \lesssim \varepsilon _0^{-1}\left( \delta \left\| p \right\| + \left|{\tilde{\mu }}_0\right|({\mathcal {X}}) \right) \) and

$$\begin{aligned} \varepsilon _2\sum _{j=1}^s \left| \int _{{\mathcal {X}}^{\mathrm {near}}_j} {{\mathfrak {d}}_{{\mathfrak {g}}}}(x,x_j)^2 \mathrm {d}\nu (x)\right|&{=} \varepsilon _2\sum _{j=1}^s \left| \int _{{\mathcal {X}}^{\mathrm {near}}_j} {{\mathfrak {d}}_{{\mathfrak {g}}}}(x,x_j)^2 \mathrm {d}{\hat{\mu }}(x)\right| \leqslant \delta \left\| p \right\| {+} \left|{\tilde{\mu }}_0\right|({\mathcal {X}}) \end{aligned}$$

Finally, by (23),

$$\begin{aligned} \left| \int _{{\mathcal {X}}}\eta _j(x)\mathrm {d}\nu (x)\right|&\leqslant \left| \langle \eta _j,\,{\hat{\mu }} - {\bar{\mu }}_0\rangle _{\mathcal {M}}\right| + \left|{\tilde{\mu }}_0\right|({\mathcal {X}}) \\&\leqslant \left\| p_j \right\| \left\| \varPhi ({\hat{\mu }} - {\bar{\mu }}_0) \right\| + \left|{\tilde{\mu }}_0\right|({\mathcal {X}})\\&\leqslant \left\| p_j \right\| (\delta + \left\| \varPhi {\hat{\mu }} - y \right\| )+ \left|{\tilde{\mu }}_0\right|({\mathcal {X}}) \\&\leqslant \left\| p_j \right\| \left( 2\delta + 2\lambda \left\| p \right\| + 2\sqrt{ \lambda \left|{\tilde{\mu }}_0\right|({\mathcal {X}})} \right) \\&\quad + \left|{\tilde{\mu }}_0\right|({\mathcal {X}})\\&\leqslant {2\delta \left\| p_j \right\| + 2\lambda \left\| p \right\| \left\| p_j \right\| + \lambda \left\| p_j \right\| ^2 + 2\left|{\tilde{\mu }}_0\right|({\mathcal {X}})} \end{aligned}$$

using \(\sqrt{ab}\leqslant (a+b)/2\). Therefore, we obtain

$$\begin{aligned} \left|{\hat{a}}_j - a_j\right| \lesssim \left\| p_j \right\| \left( \delta + \lambda \left\| p_j \right\| \right) + \varepsilon _0^{-1}\left( \delta \left\| p \right\| + \left|{\tilde{\mu }}_0\right|({\mathcal {X}}) \right) \end{aligned}$$

\(\square \)

Additional Certificates Proposition 2 assumes the construction of additional functions \(\eta _j \in {{\,\mathrm{Im}\,}}(\varPhi ^*)\), which are essentially similar to non-degenerate certificates but with all “signs” to interpolate put to 0 except for one. As we will see, they are even simpler to construct than \(\eta \): indeed, the reason one has to resort to the random signs assumption (as in [55]) or to the golfing scheme (as in this paper) is that the Euclidean norm of the vector of signs \(({{\,\mathrm{sign}\,}}(a_i))_{i=1}^s\) appears in the proof, which results in a spurious term \(\sqrt{s}\). When constructing the \(\eta _j\), this problem does not occur, since only one sign is nonzero.

Relation to Previous Works Note that (21) and (24), without the inexact sparsity term \( \left|{\tilde{\mu }}_0\right|({\mathcal {X}})\), were previously presented in [35] in the context of sampling Fourier coefficients and in a more general setting in [4]. However, the statement in [4] is given in terms of orthonormal systems, and the so-called Bernstein Isolation Property which imposes that \(\left|P'(x)\right| \leqslant C m^2 \left\| P \right\| _\infty \) for all \(P\in {{\,\mathrm{Im}\,}}(\varPhi ^*)\). These conditions can be difficult to check in our setting of random sampling and were imposed only to ensure the existence of non-degenerate dual certificates, and to have explicit control on the constant C. For completeness, we still present the proof of (21) under non-degeneracy assumptions, and we later establish that these non-degeneracy assumptions hold, under appropriate separation conditions imposed via \({\mathfrak {d}}_{{\mathfrak {g}}}\).

In [14], one could also obtain bounds \(\sum _{j=1}^s \left|{\hat{a}}_j - a_j\right| \lesssim \delta \) in the case of Fourier sampling; however, to prove such a statement, one is required to construct a trigonometric function (a dual certificate) which interpolates arbitrary sign patterns. In the case of sub-sampling, such an approach cannot lead to sharp dependency on s, since in the real setting, one is then required to show the existence of \(2^s\) random polynomials corresponding to all possible sign patterns. We therefore settle for the bound (24) in this paper. We remark that being able to construct dual functions which interpolate arbitrary signs patterns leads to Wasserstein-1 error bounds, as opposed to Wasserstein-2 error bounds presented here.

Finally, we mention the more recent work of [32] which presents stability bounds for the sparse spikes problem where one restricts to positive measures and where the sampling functions form a T-systems. Under a positivity constraint (rather than total variation penalization), they derive stability bounds in terms of optimal partial transport distances. We stress that since we consider more general measurement operators than T-systems in this work, we consider transport distances under the Fisher metric as opposed to the Euclidean metric. Moreover, another difference is that our error bounds use the Wasserstein-2 distance, whereas they use the Wasserstein-1 distance—the reason is that since they do not consider random sub-sampling, their proofs in fact follow the work of [14] to construct dual certificates which interpolate arbitrary sign patterns.

4 Non-degenerate Limit Certificates

In this section, we provide the second building block of our main theorem: a generic way to ensure the existence and construct non-degenerates dual certificates, when \(m\rightarrow \infty \) and the sought-after Diracs satisfy a minimal separation condition with respect to the metric \({\mathfrak {d}}_{{\mathfrak {g}}}\).

4.1 Notions of Differential Geometry

We start with additional definitions in differential Riemannian geometry. All these notions can be found in the textbook [1], to which we refer the reader for further details. In many instances, we extend classical definitions to the complex case in a natural way.

4.1.1 Riemannian Gradient and Hessian

Let \(f: {\mathbb {R}}^d \rightarrow {\mathbb {C}}\) be a smooth function. The Riemannian gradient \(grad f(x) \in {\mathbb {C}}^d\) and Riemannian Hessian \(Hess f(x) : {\mathbb {C}}^d \rightarrow {\mathbb {C}}^d\), which is a linear mapping, can be defined as:

$$\begin{aligned} grad f(x)&= {{\mathfrak {g}}}_x^{-1} \nabla f(x) \\ \langle Hess f(x)[e_i],\,e_j\rangle _x&= \partial _i \partial _j f(x) - \varGamma _{ij}(x)^\top \nabla f(x) \end{aligned}$$

where \(\nabla , \partial _i\) are the classical Euclidean gradient and partial derivatives, and the \(\{e_i\}\) are the canonical basis of \({\mathbb {R}}^d\). The \(\varGamma _{ij}(x) = [\varGamma _{ij}^k(x)]_k \in {\mathbb {R}}^d\) are the Christoffel symbols, here equal to:

$$\begin{aligned} \varGamma _{ij}^k(x) = \frac{1}{2} \sum _\ell g^{k\ell }(x)\left( \partial _i g_{\ell j}(x) + \partial _j g_{\ell i}(x) - \partial _\ell g_{i j}(x) \right) \, , \end{aligned}$$

where \(g_{ij}(x) = [{{\mathfrak {g}}}_x]_{ij}\) and \(g^{ij}(x) = [{{\mathfrak {g}}}_x^{-1}]_{ij}\). Finally, we denote by \(H f(x)\in {\mathbb {C}}^{d\times d}\) the matrix that contains these terms: \(H f(x) {\mathop {=}\limits ^{\text {def.}}}\Big (\langle Hess f(x)[e_i],\,e_j\rangle _x\Big )_{ij}\).

For \(r=0,1,2\), the “covariant derivative" \(\mathrm {D}_{r}\left[ f\right] (x): ({\mathbb {C}}^d)^r \rightarrow {\mathbb {C}}\) are mappings (or scalar in the case \(r=0\)) defined as:

$$\begin{aligned} \mathrm {D}_{0}\left[ f\right] (x)&{\mathop {=}\limits ^{\text {def.}}}f(x) \\ \mathrm {D}_{1}\left[ f\right] (x)[v]&{\mathop {=}\limits ^{\text {def.}}}\langle v,\,grad f(x)\rangle _x = v^* \nabla f(x) \\ \mathrm {D}_{2}\left[ f\right] (x)[v,v']&{\mathop {=}\limits ^{\text {def.}}}\langle Hess f(x)[v],\,v'\rangle _x = v^* H f(x) v' \end{aligned}$$

We define associated operator norms

$$\begin{aligned} \left\| \mathrm {D}_{1}\left[ f\right] (x) \right\| _x&{\mathop {=}\limits ^{\text {def.}}}\sup _{\left\| v \right\| _x = 1} \mathrm {D}_{1}\left[ f\right] (x)[v] = \left\| {{\mathfrak {g}}}_x^{-\frac{1}{2}} \nabla f(x) \right\| _2 \\ \left\| \mathrm {D}_{2}\left[ f\right] (x) \right\| _x&{\mathop {=}\limits ^{\text {def.}}}\sup _{\left\| v \right\| _x = 1, \left\| v' \right\| _x = 1} \mathrm {D}_{2}\left[ f\right] (x)[v,v'] = \left\| {{\mathfrak {g}}}_x^{-\frac{1}{2}} H f(x) {{\mathfrak {g}}}_x^{-\frac{1}{2}} \right\| _{2} \end{aligned}$$

where we recall that \(\left\| \cdot \right\| _x\) is defined by (14).

4.1.2 Covariant Derivatives of the Kernel

Recall the definition of the limit covariance kernel (10). Given \(0\leqslant i,j \leqslant 2\), let \(K^{(ij)}(x,x')\) be a “bi”-multilinear map, defined for \(Q\in ({\mathbb {C}}^d)^i\) and \(V\in ({\mathbb {C}}^d)^j\) as

$$\begin{aligned}{}[Q]K^{(ij)}(x,x')[V] {\mathop {=}\limits ^{\text {def.}}}{\mathbb {E}}[ \overline{\mathrm {D}_{i}\left[ \varphi _\omega \right] (x)[Q]} {\mathrm {D}_{j}\left[ \varphi _\omega \right] (x')[V]}]. \end{aligned}$$
(25)

In the case \(i,j \leqslant 1\), note that these admits simplified expressions: \(K^{(00)}(x,x') = K(x,x')\), \([v]K^{(10)}(x,x') = v^\top \nabla _1 K(x,x')\) and \([v]K^{(11)}(x,x')[v'] = v^\top \nabla _1 \nabla _2 K(x,x') \overline{v'}\). Define the operator norm of \(K^{(ij)}(x,x')\) as

$$\begin{aligned} \left\| K^{(ij)}(x,x') \right\| _{x,x'} {\mathop {=}\limits ^{\text {def.}}}\sup _{Q,V}\left|[Q]{K^{(ij)}(x,x')}[V]\right| \end{aligned}$$
(26)

where the supremum is over all \(V = [v_1,\ldots , v_i]\) with \(\left\| v_\ell \right\| _{x}\leqslant 1\) for all \(\ell \in [i]\), and all \(Q = [q_1,\ldots , q_j]\) with \(\left\| q_\ell \right\| _{x'}\leqslant 1\) for all \(\ell \in [j]\). We will sometimes overload the notations and write \(\left\| \cdot \right\| _{x}\) when the dependence is only on x, i.e., for \(K^{(ij)}\) where \(j = 0\). Note that, in particular,

$$\begin{aligned} \begin{aligned}&\left\| K^{(10)}(x,x') \right\| _x = \left\| {{\mathfrak {g}}}_x^{-\frac{1}{2}}\nabla _1 K(x,x') \right\| _2, \\&\left\| K^{(11)}(x,x') \right\| _{x,x'} = \left\| {{\mathfrak {g}}}_x^{-\frac{1}{2}}\nabla _1 \nabla _2 K(x,x'){{\mathfrak {g}}}_{x'}^{-\frac{1}{2}} \right\| _2\\&and \left\| K^{(20)}(x,x') \right\| _{x} = \left\| {{\mathfrak {g}}}_x^{-\frac{1}{2}}H [K(\cdot ,x')](x){{\mathfrak {g}}}_{x}^{-\frac{1}{2}} \right\| _2 \end{aligned} \end{aligned}$$
(27)

All these definitions are naturally extended to the covariance kernel \({\hat{K}}\) by replacing the expectation \({\mathbb {E}}\) in (25) by an empirical expectation over \(\omega _1,\ldots , \omega _m\).

4.2 Non-Degenerate Dual Certificate with \(m \rightarrow \infty \)

Recall the definition of the covariance kernel (9). Following [15], a natural approach toward constructing a dual certificate is by interpolating the sign vector \({{\,\mathrm{sign}\,}}(a_j)\) using the functions \({\hat{K}}(x_j,\cdot )\) and \( {\hat{K}}^{(10)}(x_j,\cdot )\), since we have

$$\begin{aligned} \left\{ \eta {\mathop {=}\limits ^{\text {def.}}}\sum _{j=1}^s \alpha _{1,j} {\hat{K}}(x_j,\cdot ) + \sum _{j=1}^s [\alpha _{2,j}] {\hat{K}}^{(10)} (x_j,\cdot ) \;;\; \alpha _{1,j} \in {\mathbb {C}},~\alpha _{2,j} \in {\mathbb {C}}^d \right\} \subset {{\,\mathrm{Im}\,}}(\varPhi ^*)\, . \end{aligned}$$

Using the gradients of the kernel allows to additionally impose that \(\nabla \eta (x_i) = 0\), which is a necessary (but not sufficient) condition for the dual certificate to reach its maximum amplitude in \(x_i\). Usual proofs then show that, under minimal separation, applying this strategy indeed yields a non-degenerate dual certificate.

We first consider the case where one has access to arbitrarily many measurements (\(m \rightarrow \infty \)), and to this end, we consider the limit covariance kernel \(K\) defined in (10). Let us introduce some handy notations that will be particularly useful in later proofs (Sect. 6). Our aim is to find coefficients \((\alpha _{1,j})_{j=1}^s \in {\mathbb {C}}^s\) and \((\alpha _{2,j})_{j=1}^s \in ({\mathbb {C}}^{d})^s\) such that

$$\begin{aligned} \eta {\mathop {=}\limits ^{\text {def.}}}\sum _{j=1}^s \alpha _{1,j} K(x_j,\cdot ) + \sum _{j=1}^s [\alpha _{2,j}] K^{(10)} (x_j,\cdot ) \end{aligned}$$
(28)

satisfies \( \eta (x_j) = {{\,\mathrm{sign}\,}}(a_j)\) and \( \nabla \eta (x_j) = 0\) for all \(j=1,\ldots ,s\). Note that these \(s(d+1)\) constraints can be written as the linear system

$$\begin{aligned} \varUpsilon \left( {\begin{array}{c}\alpha _1\\ \alpha _2\end{array}}\right) = \left( {\begin{array}{c}({{\,\mathrm{sign}\,}}(a_i))_{i=1}^s\\ 0_{sd}\end{array}}\right) {\mathop {=}\limits ^{\text {def.}}}{\mathbf {u}}_s\, , \end{aligned}$$
(29)

where \(\varUpsilon \in {\mathbb {R}}^{s(d+1) \times s(d+1)}\) is a real symmetric matrix defined as

$$\begin{aligned} \varUpsilon {\mathop {=}\limits ^{\text {def.}}}{\mathbb {E}}_\omega [{\gamma (\omega )\gamma (\omega )^*}] \in {\mathbb {C}}^{s(d+1) \times s(d+1)}, \end{aligned}$$
(30)

with the vector \(\gamma (\omega )\in {\mathbb {C}}^{s(d+1)}\) defined as

$$\begin{aligned} \gamma (\omega ) {\mathop {=}\limits ^{\text {def.}}}\left( \left( \varphi _\omega (x_i) \right) _{i=1}^s,\left( \nabla \varphi _\omega (x_i)^\top \right) _{i=1}^s \right) ^\top . \end{aligned}$$
(31)

Assuming that \(\varUpsilon \) is invertible, we can therefore rewrite (28) as \(\eta (x) = (\varUpsilon ^{-1} {\mathbf {u}}_s)^\top {\mathbf {f}}(x)\), where

$$\begin{aligned} {\mathbf {f}}(x) {\mathop {=}\limits ^{\text {def.}}}{\mathbb {E}}_\omega [\overline{\gamma (\omega )} \varphi _\omega (x)] = \left( \left( K(x_i,x) \right) _{i=1}^s,\left( \nabla _1 K(x_i,x)^\top \right) _{i=1}^s \right) ^\top \in {\mathbb {R}}^{s(d+1)}\, .\nonumber \\ \end{aligned}$$
(32)

We also define the block diagonal normalization matrix \(D_{{\mathfrak {g}}}\in {\mathbb {R}}^{s(d+1)\times s(d+1)}\) as

$$\begin{aligned} D_{{\mathfrak {g}}}{\mathop {=}\limits ^{\text {def.}}}\begin{pmatrix} \mathrm {Id}_{s} \\ &{} {{\mathfrak {g}}}_{x_1}^{-\frac{1}{2}} \\ &{}&{}\ddots \\ &{}&{}&{} {{\mathfrak {g}}}_{x_s}^{-\frac{1}{2}} \end{pmatrix} \end{aligned}$$
(33)

so that \({\tilde{\varUpsilon }} = D_{{\mathfrak {g}}}\varUpsilon D_{{\mathfrak {g}}}\) has constant value 1 along its diagonal.

We will prove in Theorem 2 that \(\eta \) of the form (28) is indeed non-degenerate, provided that there is sufficient curvature on \(K(x,\cdot )\) in a small neighborhood around x and \(\min _{k\ne j}{\mathfrak {d}}_{{\mathfrak {g}}}(x_j,x_k)\geqslant \varDelta \) where \(\varDelta \) is the distance at which the kernel and its partial derivatives are sufficiently small (to allow for interpolation with \(K(\cdot ,x_j)\) with minimal inference between the point sources). To do so, we need the following definition.

Definition 3

Given \(r>0\), the local curvature constants \({\bar{\varepsilon }}_0(r)\) and \({\bar{\varepsilon }}_2(r)\) of \(K\) are defined as

$$\begin{aligned} {\bar{\varepsilon }}_0(r)&{\mathop {=}\limits ^{\text {def.}}}\sup \left\{ \varepsilon \;;\; K(x,x')\leqslant 1-\varepsilon , \; \forall x,x'\in {\mathcal {X}}\text { s.t. } {\mathfrak {d}}_{{\mathfrak {g}}}(x,x')\geqslant r \right\} \\ {\bar{\varepsilon }}_2(r)&{\mathop {=}\limits ^{\text {def.}}}\sup \left\{ \varepsilon \;;\; -K^{(02)}(x',x)[v,v] \geqslant \varepsilon \left\| v \right\| _{x}^2, \;\forall x,x'\in {\mathcal {X}}\text { s.t. }{\mathfrak {d}}_{{\mathfrak {g}}}(x,x') <r, \forall v \in {\mathbb {R}}^d \right\} \end{aligned}$$

Given \(h>0\) and \(s\in {\mathbb {N}}\), the kernel width of \(K\) is defined as

$$\begin{aligned}&\varDelta (h,s) {\mathop {=}\limits ^{\text {def.}}}\inf \\&\quad \left\{ \varDelta \;;\; \sum _{k=2}^{s} \left\| K^{(ij)}(x_1,x_k) \right\| _{x_1,x_k} \leqslant h, \; (i,j) \in \left\{ 0,1 \right\} \times \left\{ 0,2 \right\} ,\; \{x_k\}_{k=1}^s \in {\mathcal {S}}_\varDelta \right\} \end{aligned}$$

where \({\mathcal {S}}_\varDelta {\mathop {=}\limits ^{\text {def.}}} \left\{ (x_k)_{k=1}^s\in {\mathcal {X}}^s \;;\; {\mathfrak {d}}(x_k,x_\ell )\geqslant \varDelta ,\; \forall k\ne \ell \right\} \) is the set of k-tuples of \(\varDelta \)-separated points. We define \(\inf \emptyset {\mathop {=}\limits ^{\text {def.}}}+\infty \).

Intuitively, these notions are similar to those appearing in the definition of non-degenerate dual certificates (and will ultimately serve in the proof of existence of such certificates): r is a neighborhood size, \({\bar{\varepsilon }}_0\) represents the distance to 1 of the kernel away from \(x=x'\), and \({\bar{\varepsilon }}_2\) is the “curvature” of the kernel when \(x \approx x'\). Finally, \(\varDelta \) is the “minimal separation” under which s Diracs have minimal interference between them, or, in other words, the covariance kernel and its derivatives have low value. We formalize this in the following assumption.

Assumption 1

(Assumptions on the kernel.) Suppose that K is a real-valued kernel. For \(i,j\leqslant 2\) and \(i+j \leqslant 3\), assume that \(B_{ij} {\mathop {=}\limits ^{\text {def.}}}\sup _{x,x'\in {\mathcal {X}}}\left\| K^{(ij)}(x,x') \right\| _{x,x'}< \infty \) and denote \(B_i{\mathop {=}\limits ^{\text {def.}}}B_{0i} + B_{1i}+1\). Assume that K has positive curvature constants \({\bar{\varepsilon }}_0\) and \({\bar{\varepsilon }}_2\) at radius \(0< {r_{\mathrm {near}}}< B_{02}^{-\frac{1}{2}}\). Let \(s \in {\mathbb {N}}\) be such that \(\varDelta {\mathop {=}\limits ^{\text {def.}}}\varDelta (h,s)<\infty \) with \(h \leqslant \frac{1}{64}\min \left( \frac{{\bar{\varepsilon }}_0}{B_0},\frac{{\bar{\varepsilon }}_2}{B_2} \right) \).

Under this assumption, the following theorem, which is the main result of this section, proves that a limit non-degenerate dual certificate can be constructed under minimal separation.

Theorem 2

Under Assumption 1, for all \(\{x_k\}_{k=1}^s\) with \(\min _{k\ne \ell }{\mathfrak {d}}_{{\mathfrak {g}}}(x_k,x_\ell ) \geqslant \varDelta \), there exists a unique function \(\eta \) of the form (28) which is \((\frac{{\bar{\varepsilon }}_0}{2},\frac{{\bar{\varepsilon }}_2}{4}, {r_{\mathrm {near}}})\)-non-degenerate. Moreover,

$$\begin{aligned} \left\| \overline{{{\,\mathrm{sign}\,}}(a_j)}\mathrm {D}_{2}\left[ \eta \right] (x) - K^{(02)}(x_j,x) \right\| _x \leqslant \frac{{\bar{\varepsilon }}_2}{16} \qquad \forall x\in {\mathcal {B}}_{{\mathfrak {d}}_{{\mathfrak {g}}}}(x_j;{r_{\mathrm {near}}}). \end{aligned}$$

We delay the (slightly lengthy) proof of this result to the next subsection. Before that, we make a few comments.

4.2.1 Dependency on

s As we have seen in the examples of Sect. 2, for a constant h we generally let the minimal separation \(\varDelta = W(h,s)\) depend on s. Indeed, in dimension d, it is well known one can pack \(C^d\) \(\varDelta \)-separated points in a ball of radius \(2\varDelta \) for some constant C. (This is known as the kissing number.) Hence, there exist s \(\varDelta \)-separated points such that

$$\begin{aligned} \sum _{k=2}^{s} \left\| K^{(ij)}(x_1,x_k) \right\| _{x_1,x_k} \geqslant \min \left( C^d, s \right) \sup _{{\mathfrak {d}}(x,x')\geqslant \varDelta }\left\| K^{(ij)}(x,x') \right\| _{x,x'}\, . \end{aligned}$$

Therefore, while the kernel width can be independent of s in low dimensions (and the trick is then to upper bound this by a constant bound \(s\rightarrow \infty \), assuming the sum on the l.h.s. converges), as d increases, the dependence on s will become inevitable; otherwise, \(\varDelta \) generally depends exponentially on d.

4.2.2 Babel Function

The attentative reader might recognize the similarity of definition of kernel width W(hs) with the Babel function from compressed sensing [57], if we restrict the definition to \((i,j) = (0,0)\) and recall that \(K(x,x') = {\mathbb {E}}_\omega [\overline{\varphi _\omega (x)}\varphi _\omega (x')]\). The Babel function of a \(m\times N\) matrix A with columns \({\mathbf {a}}_j\) is defined as

$$\begin{aligned} \mu (s) = \max _{i\in [N]} \max \left\{ \sum _{j\in S}\left|\langle {\mathbf {a}}_i,\,{\mathbf {a}}_j\rangle \right| \;;\; S\subset [N], \left|S\right| = s, i\ne S \right\} , \end{aligned}$$

and small value of \(\mu (s)\) ensures that the sub-matrix \(A_S^* A_S\), where \(A_S\) is the matrix A restricted to index set S with \(\left|S\right|\leqslant s\), is well conditioned and invertible. Furthermore, recovery guarantees for Basis Pursuit and Orthogonal Matching Pursuit can be stated in terms of \(\mu (s)\). In Theorem 2, sufficient kernel width also ensures that \(\varPhi _x^* \varPhi _x\) is well conditioned and thereby provide performance guarantees for the BLASSO.

4.3 Proof of Theorem 2

Before proving Theorem 2, we illustrate the link between curvature of the kernel as represented by \({\bar{\varepsilon }}_2\) in Def. 3 and the quadratic decay condition \(\left|\eta \right| \leqslant 1-\varepsilon {\mathfrak {d}}_{{\mathfrak {g}}}(x_i,\cdot )^2\) that we used in the definition of non-degenerate certificates (Def. 2). The resulting condition (35) is the one that we are actually going to prove in practice. The following lemma is based on a generalized second-order Taylor expansion.

Lemma 2

Let \(x_0 \in {\mathcal {X}}\) and \(a \in {\mathbb {C}}\) with \(\left|a\right| = 1\). Suppose that for some \(\varepsilon >0\), \(B>0\) and \(0<r\leqslant B^{-\frac{1}{2}}\) we have: for all \(x \in {\mathcal {B}}_{{\mathfrak {d}}_{{\mathfrak {g}}}}(x_0;r)\) and \(v \in {\mathbb {C}}^d\), it holds that \(-K^{(02)}(x_0,x)[v,v] \geqslant \varepsilon \left\| v \right\| _{x}^2\) and \(\left\| K^{(02)}(x_0,x) \right\| _x \leqslant B\). Let \(\eta :{\mathcal {X}}\rightarrow {\mathbb {C}}\) be a smooth function.

  1. (i)

    If \(\eta (x_0) = 0, \nabla \eta (x_0) = 0\) and

    $$\begin{aligned} \left\| \mathrm {D}_{2}\left[ \eta \right] (x) \right\| _x \leqslant \delta \qquad \forall x\in {\mathcal {B}}_{{\mathfrak {d}}_{{\mathfrak {g}}}}(x_0;r) \end{aligned}$$
    (34)

    then \(\left|\eta (x)\right| \leqslant \delta {\mathfrak {d}}_{{\mathfrak {g}}}(x_0,x)^2\) for all \(x\in {\mathcal {B}}_{{\mathfrak {d}}_{{\mathfrak {g}}}}(x_0;r)\).

  2. (ii)

    If \(\eta (x_0) = a\), \(\nabla \eta (x_0) = 0\) and

    $$\begin{aligned} \left\| {\overline{a}}\mathrm {D}_{2}\left[ \eta \right] (x) - K^{(02)}(x_0,x) \right\| _x \leqslant \delta \qquad \forall x\in {\mathcal {B}}_{{\mathfrak {d}}_{{\mathfrak {g}}}}(x_0;r) \end{aligned}$$
    (35)

    for some \(\delta < \frac{\varepsilon }{2}\), then for all \(x\in {\mathcal {B}}_{{\mathfrak {d}}_{{\mathfrak {g}}}}(x_0;r)\) we have \(\left|\eta (x)\right| \leqslant 1- \varepsilon ' {\mathfrak {d}}_{{\mathfrak {g}}}(x_0,x)^2\) with \(\varepsilon ' = \frac{\varepsilon - 2\delta }{2}\).

Proof

We prove (ii), the proof for (i) is similar and simpler. Using (35) and the assumption on \(K^{(02)}\), we can deduce that for all \(v\in {\mathbb {R}}^d\) we have

$$\begin{aligned} \mathrm {Re}\left( {\overline{a}} \mathrm {D}_{2}\left[ \eta \right] (x)[v,v]\right) \leqslant -(\varepsilon -\delta )\left\| v \right\| _x^2 \quad \text {and} \quad \left|\mathrm {Im}\left( {\overline{a}} \mathrm {D}_{2}\left[ \eta \right] (x)[v,v]\right) \right| \leqslant \delta \left\| v \right\| _x^2 \end{aligned}$$

Given a geodesic \(\gamma :[0,1] \rightarrow {\mathbb {R}}^d\), it is a well-known property that \( \ddot{\gamma } + \sum _{i,j} \varGamma _{ij}(\gamma ) \dot{\gamma _i} \dot{\gamma _j} = 0 \) where we recall that \(\varGamma _{ij} \in {\mathbb {R}}^d\) are the Christoffel symbols. Therefore, we have

$$\begin{aligned} \frac{\mathrm {d}^2}{\mathrm {d}t^2}\eta (\gamma (t))&= {\dot{\gamma }}(t)^\top \nabla ^2 \eta (\gamma (t)) {\dot{\gamma }}(t) + \nabla \eta (t)^\top \ddot{\gamma }(t)\\&= {\dot{\gamma }}(t)^\top \nabla ^2 \eta (\gamma (t)) {\dot{\gamma }}(t) - \nabla \eta (t)^\top \left( \sum _{ij} \varGamma _{ij}(\gamma (t)) \dot{\gamma _j}(t) \dot{\gamma _k}(t) \right) \\&= {\dot{\gamma }}(t)^\top H \eta (\gamma (t)) {\dot{\gamma }}(t) = \mathrm {D}_{2}\left[ \eta \right] (\gamma (t)) [{\dot{\gamma }}(t), {\dot{\gamma }}(t)] \end{aligned}$$

So, given any geodesic path with \(\gamma (0) = x_0\) and \(\gamma (1) = x\), since of course we have \({\mathfrak {d}}_{{\mathfrak {g}}}(x_0,\gamma (t))\leqslant {\mathfrak {d}}_{{\mathfrak {g}}}(x_0,x)\leqslant r\), by applying the inequalities above:

$$\begin{aligned} \begin{aligned} \mathrm {Re}\left( {\overline{a}} \eta (x)\right)&= \mathrm {Re}\left( {\overline{a}} \left( \eta (x_0) + \nabla \eta (x_0)^\top {\dot{\gamma }}(0) + \frac{1}{2} \int _0^1 (1-t) \frac{\mathrm {d}^2}{\mathrm {d}t^2}\eta (\gamma (t)) d t \right) \right) \\&= 1 + \frac{1}{2} \int _0^1 (1-t) \mathrm {Re}\left( {\overline{a}} \mathrm {D}_{2}\left[ \eta \right] (\gamma (t)) [{\dot{\gamma }}(t), {\dot{\gamma }}(t)]\right) \mathrm {d} t\\&\leqslant 1- \left( \varepsilon -\delta \right) \int _0^1 (1-t) \left\| {\dot{\gamma }}(t) \right\| _{\gamma (t)}^2 \mathrm {d}t = 1- \frac{\left( \varepsilon -\delta \right) }{2}{\mathfrak {d}}_{{\mathfrak {g}}}(x_0,x)^2. \end{aligned} \end{aligned}$$
(36)

where the last line follows because \(\left\| {\dot{\gamma }}(t) \right\| _{\gamma (t)}\) is constant for all \(t\in [0,1]\). Similarly, we can show that \(\mathrm {Re}\left( {\overline{a}} \eta (x)\right) \geqslant 1-\frac{B+\delta }{2} {\mathfrak {d}}_{{\mathfrak {g}}}(x_0,x)^2 \geqslant 0\) since \(r\leqslant B^{-\frac{1}{2}}\), and \( \left|\mathrm {Im}\left( {\overline{a}} \eta (x)\right) \right| \leqslant \frac{\delta }{2}{\mathfrak {d}}_{{\mathfrak {g}}}(x_0,x)^2 \), from which we got \(\left|\eta (x)\right| \leqslant \mathrm {Re}\left( {\overline{a}} \eta (x)\right) + \left|\mathrm {Im}\left( {\overline{a}} \eta (x)\right) \right| \leqslant 1- \frac{\varepsilon - 2\delta }{2} {\mathfrak {d}}_{{\mathfrak {g}}}(x_0,x)^2\). \(\square \)

We can now proceed with the proof of Theorem 2.

Proof of Theorem 2

Recall the block diagonal matric \(D_{{\mathfrak {g}}}\) from (33). The system (29) is equivalent to

$$\begin{aligned} {\tilde{\varUpsilon }} \left( {\begin{array}{c}{\tilde{\alpha }}_1\\ {\tilde{\alpha }}_2\end{array}}\right) = {\mathbf {u}}_s. \end{aligned}$$
(37)

where \({\tilde{\varUpsilon }} = D_{{\mathfrak {g}}}\varUpsilon D_{{\mathfrak {g}}}\) and \({\tilde{\alpha }} = D_{{\mathfrak {g}}}^{-1} \alpha \). So, if \({\tilde{\varUpsilon }}\) is invertible, then we can write \(\eta = \left( {\tilde{\varUpsilon }}^{-1} {\mathbf {u}}_s \right) ^\top D_{{\mathfrak {g}}}{\mathbf {f}}= \left( \varUpsilon ^{-1} {\mathbf {u}}_s \right) ^\top {\mathbf {f}}\). Therefore, we will proceed as follows: First, prove that \({\tilde{\varUpsilon }}\) is invertible. Second, bound the coefficients \(\alpha _1\) and \(\alpha _2\). Third, prove that \(\eta \) is non-degenerate.

We first prove that the matrix \({\tilde{\varUpsilon }}\) is invertible. To this end, we decompose it into blocks

$$\begin{aligned} {\tilde{\varUpsilon }} = \left( \begin{matrix} \varUpsilon _0 &{} \varUpsilon _1^\top \\ \varUpsilon _1 &{} \varUpsilon _2 \end{matrix} \right) \end{aligned}$$
(38)

where \(\varUpsilon _0\in {\mathbb {C}}^{s\times s}\), \(\varUpsilon _1 \in {\mathbb {C}}^{sd\times s}\) and \(\varUpsilon _2 \in {\mathbb {C}}^{sd \times sd}\) are defined as

$$\begin{aligned}&\varUpsilon _0 {\mathop {=}\limits ^{\text {def.}}}(K(x_i,x_j))_{i,j=1}^s, \qquad \varUpsilon _1 {\mathop {=}\limits ^{\text {def.}}}({{\mathfrak {g}}}_{x_i}^{-\frac{1}{2}} \nabla _1K(x_i,x_j))_{i,j=1}^s, \quad \varUpsilon _2 \\&\quad {\mathop {=}\limits ^{\text {def.}}}({{\mathfrak {g}}}_{x_i}^{-\frac{1}{2}} \nabla _1\nabla _2K(x_i,x_j) {{\mathfrak {g}}}_{x_j}^{-\frac{1}{2}})_{i,j=1}^s. \end{aligned}$$

To prove the invertibility of \({\tilde{\varUpsilon }}\), it suffices to prove that \(\varUpsilon _2\) and its Schur complement \(\varUpsilon _S {\mathop {=}\limits ^{\text {def.}}}\varUpsilon _0 - \varUpsilon _1 \varUpsilon _2^{-1} \varUpsilon _1^\top \) are both invertible. To show that \(\varUpsilon _2\) is invertible, we define \(A_{ij} = {{\mathfrak {g}}}_{x_i}^{-\frac{1}{2}}\nabla _1\nabla _2K(x_i,x_j) {{\mathfrak {g}}}_{x_j}^{-\frac{1}{2}}\), such that \(\varUpsilon _2\) has the form:

and by Lemma 5 in Appendix A.1, Assumption 1 and (27), we have

$$\begin{aligned} \left\| \mathrm {Id}- \varUpsilon _2 \right\| _{\mathrm {block}} \leqslant&~ \max _{i} \sum _{j\ne i} \left\| A_{ij} \right\| _2 =\max _{i} \sum _{j\ne i} \left\| K^{(11)}(x_i,x_j) \right\| _{x_i,x_j} \leqslant h \leqslant 1/32. \end{aligned}$$

Since \(\left\| \mathrm {Id}- \varUpsilon _2 \right\| _{\mathrm {block}} <1\), \(\varUpsilon _2\) is invertible, and we have \(\left\| \varUpsilon _2^{-1} \right\| _{\mathrm {block}} \leqslant \frac{1}{1 - \left\| \mathrm {Id}- \varUpsilon _2 \right\| _{\mathrm {block}}} \leqslant \frac{4}{3}\). Next, again with Lemma 5, we can bound

$$\begin{aligned} \left\| \mathrm {Id}- \varUpsilon _0 \right\| _\infty&=~ \max _{i} \sum _{j\ne i}\left|K(x_i,x_j)\right| \leqslant h\\ \left\| \varUpsilon _1 \right\| _{\infty \rightarrow block }&\leqslant ~\max _{i}\sum _j \left\| {{\mathfrak {g}}}_{x_i}^{-\frac{1}{2}}\nabla _1K(x_i,x_j) \right\| _2= \max _{i}\sum _j \left\| K^{(10)}(x_i,x_j) \right\| _{x_i} \leqslant h \end{aligned}$$

since \(K^{(10)}(x,x)=0\). Hence, we have

$$\begin{aligned}&\left\| \mathrm {Id}- \varUpsilon _S \right\| _\infty \leqslant \left\| \mathrm {Id}-\varUpsilon _0 \right\| _\infty + \left\| \varUpsilon _1^\top \right\| _{block \rightarrow \infty }\nonumber \\&\quad \left\| \varUpsilon _2^{-1} \right\| _{\mathrm {block}} \left\| \varUpsilon _1 \right\| _{\infty \rightarrow block } \leqslant h + \frac{4}{3}h^2 \leqslant 2h{\mathop {=}\limits ^{\text {def.}}}h' <1. \end{aligned}$$
(39)

Therefore, the Schur complement of \({\tilde{\varUpsilon }}\) is invertible and so is \({\tilde{\varUpsilon }}\). Moreover, \(\left\| \varUpsilon _S^{-1} \right\| _\infty \leqslant \frac{1}{1-h'}\).

We can now define:

$$\begin{aligned} {\tilde{\alpha }} = {\tilde{\varUpsilon }}^{-1} {\mathbf {u}}_s = \left( \begin{matrix}{\tilde{\alpha }}_1 \\ {\tilde{\alpha }}_2 \end{matrix} \right) \end{aligned}$$

and, as described above, \(\alpha = D_{{\mathfrak {g}}}^{-1} {\tilde{\alpha }}\). The Schur’s complement of \({\tilde{\varUpsilon }}\) allows us to express \(\alpha _1\) and \(\alpha _2\) as

$$\begin{aligned} \left( {\begin{array}{c}{\tilde{\alpha }}_1\\ {\tilde{\alpha }}_2\end{array}}\right) =\left( {\begin{array}{c} \varUpsilon _S^{-1} {{\,\mathrm{sign}\,}}(a)\\ -\varUpsilon _2^{-1} \varUpsilon _1 \varUpsilon _S^{-1} {{\,\mathrm{sign}\,}}(a)\end{array}}\right) \end{aligned}$$
(40)

, and therefore, we can bound

$$\begin{aligned} \left\| \alpha _1 \right\| _\infty&\leqslant \left\| \varUpsilon _S^{-1} \right\| _\infty \leqslant \frac{1}{1-h'} \\ \max _i \left\| \alpha _{2,i} \right\| _{x_i}&= \left\| {\tilde{\alpha }}_2 \right\| _{\mathrm {block}} \leqslant \left\| \varUpsilon _2^{-1} \right\| _{\mathrm {block}} \left\| \varUpsilon _1 \right\| _{\infty \rightarrow block } \left\| \varUpsilon _S^{-1} \right\| _\infty \leqslant 4h \end{aligned}$$

Moreover, we have

$$\begin{aligned} \left\| \alpha _1 - {{\,\mathrm{sign}\,}}(a) \right\| _\infty \leqslant \left\| \mathrm {Id}- \varUpsilon _S^{-1} \right\| _\infty \leqslant \left\| \varUpsilon _S^{-1} \right\| _\infty \left\| \mathrm {Id}-\varUpsilon _S \right\| _\infty \leqslant \frac{h'}{1-h'} \end{aligned}$$
(41)

We can now prove that \(\eta \) is non-degenerate. For any x such that \({\mathfrak {d}}_{{\mathfrak {g}}}(x, x_i)\geqslant {r_{\mathrm {near}}}\) for all \(x_i\)’s, there exists at most one index i such that \({\mathfrak {d}}_{{\mathfrak {g}}}(x,x_i) < \varDelta /2\) and so, for all \(j \ne i\), we have \({\mathfrak {d}}_{{\mathfrak {g}}}(x,x_j)\geqslant \varDelta /2\). Therefore,

$$\begin{aligned} \left|\eta (x)\right|&= \Bigg |\alpha _{1,i} K(x_i,x) + \sum _{j \ne i} \alpha _{1,j} K(x_j,x) \Bigg . \\&\Bigg . + [\alpha _{2,i}]K^{(10)}(x_i,x) + \sum _{j \ne i}[\alpha _{2,j}] K^{(10)}(x_j,x) \Bigg | \\&\leqslant \left\| \alpha _{1} \right\| _\infty \left( \left|K(x_i,x)\right| + \sum _{j \ne i} \left|K(x_j,x)\right| \right) + \max _i \left\| \alpha _{2,i} \right\| _{x_i} \left( \left\| K^{(10)}(x_i,x) \right\| _{x_i}\right. \\&\left. \quad + \sum _{j \ne i}\left\| K^{(10)}(x_j,x) \right\| _{x_j} \right) \\&\leqslant \frac{1}{1-h'}\left( 1-{\bar{\varepsilon }}_0 + h \right) + 4h\left( B_{10} + h \right) \leqslant 1- \frac{{\bar{\varepsilon }}_0}{2}. \end{aligned}$$

Now, let x be such that \({\mathfrak {d}}_{{\mathfrak {g}}}(x_i,x)\leqslant {r_{\mathrm {near}}}\). Similarly, for all \(j \ne i\) we have \({\mathfrak {d}}_{{\mathfrak {g}}}(x,x_j)\geqslant \varDelta /2\). Observe that

$$\begin{aligned} \overline{{{\,\mathrm{sign}\,}}(a_i)} \mathrm {D}_{2}\left[ \eta \right] (x)&= K^{(02)}(x_i,x) + \left( \overline{{{\,\mathrm{sign}\,}}(a_i)} \alpha _{1,i}-1 \right) K^{(02)}(x_i,x)\\&\quad + \overline{{{\,\mathrm{sign}\,}}(a_i)} \Bigg [ \sum _{j \ne i} \alpha _{1,j} K^{(02)}(x_j,x) + [ \alpha _{2,i}]K^{(12)}(x_i,x) \\&\quad + \sum _{j \ne i} [ \alpha _{2,j}]K^{(12)}(x_j,x)\Bigg ] \end{aligned}$$

So,

$$\begin{aligned}&\left\| \overline{{{\,\mathrm{sign}\,}}(a_i)} \mathrm {D}_{2}\left[ \eta \right] (x) - K^{(02)}(x_i,x) \right\| _x\\&\quad \leqslant \left\| \left( \overline{{{\,\mathrm{sign}\,}}(a_i)} \alpha _{1,i}-1 \right) K^{(02)}(x_i,x) \right. \\&\quad + \overline{{{\,\mathrm{sign}\,}}(a_i)} \Bigg [ \sum _{j \ne i} \alpha _{1,j} K^{(02)}(x_j,x) + [ \alpha _{2,i}] K^{(12)}(x_i,x) \\&\left. \quad + \sum _{j \ne i} [ \alpha _{2,j}]K^{(12)}(x_j,x)\Bigg ] \right\| _x\\&\quad \leqslant \frac{h'}{1-h'} B_{02} + h \left\| \alpha _1 \right\| _\infty + \max _i \left\| \alpha _{2,i} \right\| _{{x_i}} \left( B_{12} + h \right) \\&\quad \leqslant \frac{h'}{1-h'} B_{02} + \frac{h}{1-h'} + 4h B_{12} + 4h^2 \leqslant \frac{{\bar{\varepsilon }}_2}{16} \end{aligned}$$

We conclude using Lemma 2 and \(\frac{{\bar{\varepsilon }}_2 - 2 {\bar{\varepsilon }}_2/16}{2} \geqslant {\bar{\varepsilon }}_2/4\). \(\square \)

5 Sparse Recovery

In this section, we formulate our main contribution, Theorem 3, which is a detailed version of Theorem 1. In previous sections, we have shown that the existence of a non-degenerate dual certificates implies sparse recovery guarantees, and that in the limit case \(m\rightarrow \infty \), a minimal separation assumption implies the existence of a dual certificate. Our main theorem is obtained by bounding the deviations from the limit case when m is finite. We do so by extending the celebrated golfing scheme [39] to the infinite-dimensional case. We first begin by our assumptions on the feature functions \(\varphi _\omega \).

5.1 Almost Bounded Random Features

In order to bound the variation between \(K\) and \({\hat{K}}\), we would ideally like the features \(\varphi _\omega \) and their derivatives to be uniformly bounded for all \(\omega \). However, this may not be the case: think of \(e^{i\omega ^\top x}\), which does not have a uniformly bounded gradient when the support of the distribution \(\varLambda \) is not bounded. On the other hand, if \(\varLambda (\omega )\) has sufficient decay as \(\left\| \omega \right\| \) increases, one could argue that the selected random features and their derivatives are uniformly bounded with high probability. For \(r\in \{0,1,2\}\), we define the random variables

$$\begin{aligned} L_r(\omega ) {\mathop {=}\limits ^{\text {def.}}}\sup _{x\in {\mathcal {X}}} \left\| \mathrm {D}_{r}\left[ \varphi _\omega \right] (x) \right\| _{x}. \end{aligned}$$
(42)

Note that \(L_r(\omega )<\infty \) for each \(\omega \) since \({\mathcal {X}}\) is a bounded domain and \(\varphi _\omega \) is smooth.

Since \(\left|\varphi _\omega (x) - \varphi _\omega (x')\right| = \left|\int _{0}^1 \frac{d}{dt}\varphi _\omega (\gamma (t))\mathrm {d}t\right| = \left|\int _{0}^1 \mathrm {D}_{1}\left[ \varphi _\omega \right] (\gamma (t))[{\dot{\gamma }}(t)]\mathrm {d}t\right|\) for a smooth path from x to \(x'\), it is easy to see that

$$\begin{aligned} \left|\varphi _\omega (x) - \varphi _\omega (x')\right| \leqslant L_1(\omega ) {\mathfrak {d}}_{{\mathfrak {g}}}(x,x') \end{aligned}$$
(43)

We will also require \(\mathrm {D}_{2}\left[ \varphi _\omega \right] (x)\) to be Lipschitz; to this end, we assume that for all \(x,x'\in {\mathcal {X}}\), there exists \(\tau _{x\rightarrow x'}:{\mathbb {C}}^d \rightarrow {\mathbb {C}}^d\) an isometric isomorphism with respect to \({{\mathfrak {g}}}_x\), that is, such that \(\langle u,\,v\rangle _{x} = \langle \tau _{x\rightarrow x'}u,\,\tau _{x\rightarrow x'}v\rangle _{x'}\), such that for all \(\omega \):

$$\begin{aligned}&L_3(\omega ) {\mathop {=}\limits ^{\text {def.}}}\inf \\&\left\{ L>0 \;;\; \sup _{{\mathfrak {d}}_{{\mathfrak {g}}}(x,x')\leqslant {r_{\mathrm {near}}}} \frac{\left\| \mathrm {D}_{2}\left[ \varphi _\omega \right] (x) - \mathrm {D}_{2}\left[ \varphi _\omega \right] (x')[\tau _{x\rightarrow x'} \cdot , \tau _{x\rightarrow x'} \cdot ] \right\| _x}{{\mathfrak {d}}_{{\mathfrak {g}}}(x,x')} \leqslant L \right\} {<} \infty . \end{aligned}$$

where naturally

$$\begin{aligned}&\left\| \mathrm {D}_{2}\left[ \varphi _\omega \right] (x) - \mathrm {D}_{2}\left[ \varphi _\omega \right] (x')[\tau _{x\rightarrow x'} \cdot , \tau _{x\rightarrow x'} \cdot ] \right\| _x\\&\quad =\sup _{\left\| u \right\| _x \leqslant 1, \left\| v \right\| _x \leqslant 1} \mathrm {D}_{2}\left[ \varphi _\omega \right] (x)[u,v] - \mathrm {D}_{2}\left[ \varphi _\omega \right] (x')[\tau _{x\rightarrow x'} u, \tau _{x\rightarrow x'} v] \end{aligned}$$

and \({r_{\mathrm {near}}}\) comes from Assumption 1. One possible choice of \(\tau _{x\rightarrow x'}\) is to choose the parallel transport along the unique geodesic connecting x and \(x'\). Another possible choice is to simply choose \(\tau _{x\rightarrow x'}:v \mapsto {{\mathfrak {g}}}_{x'}^{-\frac{1}{2}} {{\mathfrak {g}}}_x^{\frac{1}{2}}v\). The latter choice implies

$$\begin{aligned}&\left\| \mathrm {D}_{2}\left[ \varphi _\omega \right] (x) - \mathrm {D}_{2}\left[ \varphi _\omega \right] (x')[\tau _{x\rightarrow x'} \cdot , \tau _{x\rightarrow x'} \cdot ] \right\| _x \nonumber \\&= \left\| {{\mathfrak {g}}}_{x'}^{-\frac{1}{2}} H \varphi _\omega (x'){{\mathfrak {g}}}_{x'}^{-\frac{1}{2}} - {{\mathfrak {g}}}_{x}^{-\frac{1}{2}} H \varphi _\omega (x){{\mathfrak {g}}}_{x}^{-\frac{1}{2}} \right\| . \end{aligned}$$
(44)

which is a more convenient expression that we will use in the examples.

Finally, we let \(F_r: [0,\infty ) \rightarrow [0,1]\) be decaying tail functions such that

$$\begin{aligned} {\mathbb {P}}_\omega \left( L_r(\omega ) > t \right) \leqslant F_r(t). \end{aligned}$$
(45)

Our sampling complexity will depend on the decay of these tail distributions so that the derivatives of the selected random features are bounded with high probability. A similar idea of stochastic incoherence was exploited in [16] for deriving compressed sensing bounds.

5.2 Main Result

Our main result is valid under the following assumption, which links the tail probabilities of the bounds on the feature functions and the final number of measurements m.

Assumption 2

(Assumption on the features and the sample complexity) For \(\rho >0\), suppose that \(m\in {\mathbb {N}}\) and some constant \(\{{{\bar{L}}}_i\}_{i=0}^3 \in {\mathbb {R}}_+^4\) are chosen such that

$$\begin{aligned} \begin{aligned} \sum _{j=0}^3 F_j({{\bar{L}}}_j)&\leqslant \frac{\min ({\bar{\varepsilon }}_0,{\bar{\varepsilon }}_2,\rho )}{m} \quad \text {and} \quad \max _{j=0}^3 \left( {{\bar{L}}}_j^2 \sum _{i=0}^3 F_i({{{\bar{L}}}_i}) + 6\int _{{{\bar{L}}}_j}^\infty t F_j({t}) \mathrm {d}t \right) \\&\leqslant \frac{\min \left( {\bar{\varepsilon }}_0,{\bar{\varepsilon }}_2 \right) }{m} \end{aligned} \end{aligned}$$
(46)

and

$$\begin{aligned} m \gtrsim s \left( C_1 \log (s) \log \left( \frac{s}{\rho } \right) +C_2 \log \left( \frac{(sN)^d}{\rho } \right) \right) \end{aligned}$$
(47)

where \(N{\mathop {=}\limits ^{\text {def.}}}\frac{{\mathcal {R}}_{\mathcal {X}}{{\bar{L}}}_1}{{\bar{\varepsilon }}_0} +\frac{{r_{\mathrm {near}}}{{\bar{L}}}_{3}{{\bar{L}}}_0 + {{\bar{L}}}_2}{{\bar{\varepsilon }}_2}\), \(C_1{\mathop {=}\limits ^{\text {def.}}}({{\bar{L}}}_{0}^2+{{\bar{L}}}_1^2) \sum _{r=0,2}\frac{B_r^2}{{\bar{\varepsilon }}_r^2}\), and \(C_2{\mathop {=}\limits ^{\text {def.}}}\frac{B_{22}{{\bar{L}}}_{01}^2}{B_2^2} + \sum _{r=0,2}\left( \frac{{{\bar{L}}}_r^2}{{\bar{\varepsilon }}_r^2} + \frac{{{\bar{L}}}_{01}{{\bar{L}}}_r}{{\bar{\varepsilon }}_r} \right) \) with \({{\bar{L}}}_{ij} = \sqrt{{{\bar{L}}}_i^2+{{\bar{L}}}_j^2}\).

The constants \({{\bar{L}}}_r\) play the role of “stochastic” Lipschitz constant: for \(r=0,1,2\), with high probability on \(\omega _j\), \(\mathrm {D}_{r}\left[ \varphi _\omega \right] (x)\) will be \({{\bar{L}}}_r\)-bounded and \({{\bar{L}}}_{r+1}\)-Lipschitz. The condition (46) ensures that this is true with probability \(1-\rho \), that is, with the same desired probability of failure. Then, the entire proof is done conditionally on these bounds to hold.

Note also that, generally, \(\{{{\bar{L}}}_r\}\) depend on m, through (46). However, all our examples fall under two categories (see Sec. 2):

  1. (i)

    either \(\left\| \mathrm {D}_{r}\left[ \varphi _\omega \right] (x) \right\| _x\) is already uniformly bounded, in which case \({{\bar{L}}}_r\) can be chosen independently of \(\rho \) and m, this is, for instance, the case of discrete Fourier sampling;

  2. (ii)

    or the \(F_r(t)\) are exponentially decaying, in which case we can show that \({{\bar{L}}}_r = {\mathcal {O}}\left( \log \left( \frac{m}{\rho } \right) ^p \right) \) for some \(p>0\), which only incurs additional logarithmic terms in the bound (47). This occurs in the case of sampling the Laplace transform or sampling the Fourier transform with respect to a Gaussian distribution.

We are now ready to state the detailed version of Theorem 1, which is the main result of this paper.

Theorem 3

Suppose that Assumptions 1 and 2 hold. Let y be as in (2) with \(\min _{i\ne j}{\mathfrak {d}}_{{\mathfrak {g}}}(x_i,x_j)\geqslant \varDelta \) and \(\left\| w \right\| \leqslant \delta \). Then, with probability at least \(1-\rho \), any solution \({\hat{\mu }}\) of (\({\mathcal {P}}_\lambda (y)\)) with \(\lambda \sim \frac{\delta }{\sqrt{s}}\) satisfies

$$\begin{aligned} {\mathcal {T}}_{{\mathfrak {d}}_{{\mathfrak {g}}}}^2 \left( \left|{\hat{\mu }}\right|, \sum _{j=1}^s {\hat{A}}_i \delta _{x_j} \right) \leqslant e \quad \text {and} \quad \max _{i=1}^s \left|{\hat{a}}_i - a_i\right| \leqslant e \end{aligned}$$

where \({\hat{A}}_i = \left|{\hat{\mu }}\right|\left( {\mathcal {B}}_{{\mathfrak {d}}_{{\mathfrak {g}}}}(x_i;{r_{\mathrm {near}}}) \right) \), \({\hat{a}}_i = {\hat{\mu }}\left( {\mathcal {B}}_{{\mathfrak {d}}_{{\mathfrak {g}}}}(x_i;{r_{\mathrm {near}}}) \right) \) and \(e\lesssim \frac{1}{\min ({\bar{\varepsilon }}_0,{\bar{\varepsilon }}_2)}\left( \left|{\tilde{\mu }}_0\right|({\mathcal {X}}) \right) \) \({+ \delta \cdot \sqrt{s} }\).

The next section is dedicated to the proof of Theorem 3 using an infinite-dimensional golfing scheme. Appendix A is dedicated to the proof of some technical Lemmas. Appendix B gathers all the concentration inequalities that we use in the golfing scheme, which are essentially many variants of Bernstein’s inequality. Finally, Appendices CD and E are dedicated to the computation of all the constants in Assumptions 1 and 2 for the examples described in Sect. 2, which can be quite verbose.

6 Proof of Theorem 3

The main step toward proving Theorem 3 is to prove the existence of a dual certificate satisfying the properties described in Proposition 1. More precisely, we are going to prove the following theorem.

Theorem 4

Suppose that Assumptions 1 and 2 hold. Let \( \left\{ x_j \right\} _{j=1}^s\) be such that \(\min _{i\ne j}{\mathfrak {d}}_{{\mathfrak {g}}}(x_i,x_j)\geqslant \varDelta \). Then, with probability at least \(1-\rho \), there exists \(p\in {\mathbb {C}}^m\) with \(\left\| p \right\| \lesssim \sqrt{s}\) such that \({\hat{\eta }}= \varPhi ^* p\) is \((\frac{{\bar{\varepsilon }}_0}{8}, \frac{3{\bar{\varepsilon }}_2}{8}, {r_{\mathrm {near}}})\)-non-degenerate.

6.1 Outline of the Proof

The construction of the non-degenerate certificate includes several intermediate steps. As usual in this type of proof, we will first prove these properties on a finite \(\varepsilon \)-net that covers \({\mathcal {X}}\), then extend them to the whole space by regularity. Here we work with several nets \({\mathcal {G}}^{\mathrm {near}}_j\subset {\mathcal {X}}^{\mathrm {near}}_j\) and \({\mathcal {G}}^{\mathrm {far}}\subset {\mathcal {X}}^{\mathrm {far}}\) whose precision will be adjusted later. The principle of the golfing scheme is to work with an “approximate” dual certificate \(\eta ^{\mathrm {app}}\) (which is actually not a dual certificate at all); then, “correct” it to obtain the desired true certificate. In details, we will go through the following steps:

  1. 1.

    First, show that with probability at least \(1-\rho \), there is an approximate certificate \(\eta ^{\mathrm {app}}\in {{\,\mathrm{Im}\,}}(\varPhi ^*)\) such that for some constant \(c_0\) that will be adjusted later,

    $$\begin{aligned} {\left\{ \begin{array}{ll} \sum _{j=1}^s \left|\eta ^{\mathrm {app}}(x_j) - {{\,\mathrm{sign}\,}}(a_j)\right|^2 + \left\| \mathrm {D}_{1}\left[ \eta ^{\mathrm {app}}\right] (x_j) \right\| _{{x_j}}^2\leqslant c_0^2 &{}\text {for all } j=1,\ldots , s \\ \left|\eta ^{\mathrm {app}}(x)\right| \leqslant 1-\frac{{\bar{\varepsilon }}_0}{4} &{} \text {for all }x\in {\mathcal {G}}^{\mathrm {far}}\\ \left\| {\overline{{{\,\mathrm{sign}\,}}(a_j)}\mathrm {D}_{2}\left[ \eta ^{\mathrm {app}}\right] (x) - K^{(02)}(x_j,x)} \right\| _x \leqslant \frac{7{\bar{\varepsilon }}_2}{64} &{}\text {for all }j=1,\ldots , s, x\in {\mathcal {G}}^{\mathrm {near}}_j \end{array}\right. } \end{aligned}$$
    (48)

    In other words, we relax the condition \(\eta (x_j) = {{\,\mathrm{sign}\,}}(a_j)\), \(\nabla \eta (x_j) = 0\), and replace it with the first equation above.

  2. 2.

    Second, correct the approximate certificate to obtain a functionFootnote 1\({\hat{\eta }}\in {{\,\mathrm{Im}\,}}(\varPhi ^*)\) such that:

    $$\begin{aligned} {\left\{ \begin{array}{ll} {\hat{\eta }}(x_j) = {{\,\mathrm{sign}\,}}(a_j) \quad \text {and} \quad \nabla {\hat{\eta }}(x_j) = 0 &{}\text {for all }j=1,\ldots , s \\ \left|{\hat{\eta }}(x)\right| \leqslant 1-\frac{3{\bar{\varepsilon }}_0}{16} &{} \text {for all }x\in {\mathcal {G}}^{\mathrm {far}}\\ \left\| {\overline{{{\,\mathrm{sign}\,}}(a_j)}\mathrm {D}_{2}\left[ {\hat{\eta }}\right] (x) - K^{(02)}(x_j,x)} \right\| _x \leqslant \frac{15{\bar{\varepsilon }}_2}{128} &{}\text {for all }j=1,\ldots , s, x\in {\mathcal {G}}^{\mathrm {near}}_j \end{array}\right. } \end{aligned}$$
    (49)

    That is, \({\hat{\eta }}\) satisfy all the properties we want, but on the finite nets \({\mathcal {G}}^{\mathrm {far}}, {\mathcal {G}}^{\mathrm {near}}_j\).

  3. 3.

    Third, bound the norm of the \(p\in {\mathbb {C}}^m\) corresponding to \({\hat{\eta }}= \varPhi ^* p\).

  4. 4.

    Then, use Assumption 2 on the feature functions and the bound on \(\left\| p \right\| \) to show that actually, the \({\hat{\eta }}\) constructed above satisfy:

    $$\begin{aligned} {\left\{ \begin{array}{ll} {\hat{\eta }}(x_j) = {{\,\mathrm{sign}\,}}(a_j) \quad \text {and} \quad \nabla {\hat{\eta }}(x_j) = 0 &{}\text {for all }j=1,\ldots , s \\ \left|{\hat{\eta }}(x)\right| \leqslant 1-\frac{{\bar{\varepsilon }}_0}{8} &{} \text {for all }x\in {\mathcal {X}}^{\mathrm {far}}\\ \left\| {\overline{{{\,\mathrm{sign}\,}}(a_j)}\mathrm {D}_{2}\left[ {\hat{\eta }}\right] (x) - K^{(02)}(x_j,x)} \right\| _x \leqslant \frac{{\bar{\varepsilon }}_2}{8} &{}\text {for all }j=1,\ldots , s, x\in {\mathcal {X}}^{\mathrm {near}}_j \end{array}\right. } \end{aligned}$$
    (50)

    which, by Lemma 2, will imply that \({\hat{\eta }}\) is non-degenerate with the desired constants and conclude the proof of Theorem 4.

  5. 5.

    In a fifth and final step, prove the existence of s additional certificates \({\hat{\eta }}_j\) as appear in Proposition 2. Combined with the existence of \({\hat{\eta }}\) and Propositions 1 and 2, it concludes the proof of Theorem 3.

We dedicate a subsection to each step of the proof. Before that, we start in the next subsection with some technical preliminaries and notations.

6.2 Preliminaries

Let us introduce some notations and show some technical bounds that will be handy. Recall the definitions of the sign vector \({\mathbf {u}}_s\) from (29), \(\gamma \), \(\varUpsilon \) and \({\mathbf {f}}\) from (31), (30) and (32), and \(D_{{\mathfrak {g}}}\) from (33). We have the following additional bounds, whose proof, in Appendix A.2, follows similar arguments to that of Theorem 2.

Lemma 3

Under Assumption 1, \(\varUpsilon \) and \({\mathbf {f}}\) defined as in (30) and (32) satisfy the following.

  1. (i)

    \(\varUpsilon \) is invertible and satisfies

    $$\begin{aligned} \left\| \mathrm {Id}- D_{{\mathfrak {g}}}\varUpsilon D_{{\mathfrak {g}}} \right\| _2\leqslant \frac{1}{2} \quad \text {and} \quad \left\| \mathrm {Id}-D_{{\mathfrak {g}}}\varUpsilon D_{{\mathfrak {g}}} \right\| _{\mathrm {Block}}\leqslant \frac{1}{2}. \end{aligned}$$
    (51)
  2. (ii)

    For any vector \(q \in {\mathbb {C}}^{s(d+1)}\) and any \(x\in {\mathcal {X}}^far \), we have

    $$\begin{aligned} \left\| D_{{\mathfrak {g}}}{\mathbf {f}}(x) \right\| _2 \leqslant B_0 \quad \text {and} \quad \left|q^\top {\mathbf {f}}(x)\right| \leqslant B_0 \left\| D_{{\mathfrak {g}}}^{-1} q \right\| _{\mathrm {Block}} \end{aligned}$$
    (52)
  3. (iii)

    For any vector \(q \in {\mathbb {C}}^{s(d+1)}\) and any \(x \in {\mathcal {X}}^near \) we have the bound:

    $$\begin{aligned}&\left\| \mathrm {D}_{2}\left[ q^\top {\mathbf {f}}(.)\right] (x) \right\| _x \leqslant \left\| D_{{\mathfrak {g}}}^{-1} q \right\| B_2 \nonumber \\&\quad \text {and} \quad \left\| \mathrm {D}_{2}\left[ q^\top {\mathbf {f}}(.)\right] (x) \right\| _x \leqslant \left\| D_{{\mathfrak {g}}}^{-1} q \right\| _{\mathrm {Block}}B_2 \qquad \end{aligned}$$
    (53)

Now, for \(\omega _1,\ldots ,\omega _m\), denote the empirical versions of \(\varUpsilon \) and \({\mathbf {f}}\) by:

$$\begin{aligned} {\hat{\varUpsilon }}{\mathop {=}\limits ^{\text {def.}}}\frac{1}{m} \sum _{k=1}^m {\gamma (\omega _k)\gamma (\omega _k)^*} \quad \text {and} \quad \hat{{\mathbf {f}}}(x) {\mathop {=}\limits ^{\text {def.}}}\frac{1}{m} \sum _{k=1}^m \overline{\gamma (\omega _k)} \varphi _{\omega _k}(x). \end{aligned}$$
(54)

Recall the definition of \(L_j(\omega )\) and \({{\bar{L}}}_j\) in Assumption 2. Let the event \({{{\bar{E}}}}\) be defined by

$$\begin{aligned} {{{\bar{E}}}}{\mathop {=}\limits ^{\text {def.}}}\bigcap _{k=1}^m E_{\omega _k} \quad \text {where} \quad E_\omega {\mathop {=}\limits ^{\text {def.}}} \left\{ L_j(\omega )\leqslant {{\bar{L}}}_j \;;\; j=0,1,2,3 \right\} . \end{aligned}$$
(55)

Since by Assumption 2, Eq. (46), we have \({\mathbb {P}}({{{\bar{E}}}}^c) \leqslant \rho \), a non-degenerate dual certificate can be constructed with probability at least \((1-\rho )^2 \geqslant 1-2\rho \) provided that, conditional on event \({{{\bar{E}}}}\), a non-degenerate dual certificate can be constructed with probability at least \(1-\rho \).

We therefore assume for the rest of this proof that event \({{{\bar{E}}}}\) holds and establish the probability conditional on \({{{\bar{E}}}}\) that a non-degenerate dual certificate exists. To control this probability, we will need to control the deviation of \(\hat{{\mathbf {f}}}\) and \({\hat{\varUpsilon }}\) from their conditional expectations \({\mathbf {f}}_{{{\bar{E}}}}= {\mathbb {E}}_{{{\bar{E}}}}[\hat{{\mathbf {f}}}]\) and \(\varUpsilon _{{{\bar{E}}}}{\mathop {=}\limits ^{\text {def.}}}{\mathbb {E}}_{{{\bar{E}}}}[{\hat{\varUpsilon }}]\), where we denote \({\mathbb {E}}_{{{\bar{E}}}}[\cdot ] {\mathop {=}\limits ^{\text {def.}}}{\mathbb {E}}[\cdot | {{{\bar{E}}}}]\). The following lemma, proved in Appendix A.3, bounds the deviations between these.

Lemma 4

Under Assumptions 1 and 2, we have:

  1. (i)

    \(\left\| D_{{\mathfrak {g}}}(\varUpsilon - \varUpsilon _{{{\bar{E}}}})D_{{\mathfrak {g}}} \right\| _2\leqslant 4\frac{(s+1)\min ({\bar{\varepsilon }}_0, {\bar{\varepsilon }}_2)}{m}\) and \(\left\| D_{{\mathfrak {g}}}(\varUpsilon - \varUpsilon _{{{\bar{E}}}})D_{{\mathfrak {g}}} \right\| _{\mathrm {Block}}\)    \(\leqslant 8\frac{(s+1)\min ({\bar{\varepsilon }}_0, {\bar{\varepsilon }}_2)}{m}\)

  2. (ii)

    for all \(x \in {\mathcal {X}}^{\mathrm {far}}\), \(\left\| D_{{\mathfrak {g}}}({\mathbf {f}}(x) - {\mathbf {f}}_{{{\bar{E}}}}(x)) \right\| _2 \leqslant \frac{(B_0 + 2 \sqrt{s}) \min ({\bar{\varepsilon }}_0, {\bar{\varepsilon }}_2)}{m}\)

  3. (iii)

    for all \(x \in {\mathcal {X}}^{\mathrm {near}}\), \(\sup _{\left\| q \right\| _2\leqslant 1}\left\| \mathrm {D}_{2}\left[ ({\mathbf {f}}- {\mathbf {f}}_{{{\bar{E}}}})^\top D_{{\mathfrak {g}}}q\right] (x) \right\| _x \leqslant \frac{(B_2 + 2 \sqrt{s}) \min ({\bar{\varepsilon }}_0, {\bar{\varepsilon }}_2)}{m}\)

6.3 Step 1: Construction of an Approximate Certificate with the Golfing Scheme

The first step is to construct an approximate certificate \(\eta ^{\mathrm {app}}\) using the so-called golfing scheme. The golfing scheme was introduced in [39] and successfully used in compressed sensing, for instance, in [16]. It can be intuitively explained as follows. Recall that the certificate constructed in Theorem 2 in the case \(m\rightarrow \infty \) is of the form \(\eta = (\varUpsilon ^{-1} {\mathbf {u}})^{\top }{\mathbf {f}}\). It is therefore natural to try to show directly that \({\hat{\eta }}{\mathop {=}\limits ^{\text {def.}}}({\hat{\varUpsilon }}^{-1} {\mathbf {u}})^{\top }\hat{{\mathbf {f}}}\) is also non-degenerate by bounding the variation between \(\eta \) and \({\hat{\eta }}\). This is the strategy adopted by Tang et al [55] and in our previous work [44]. However, as mentioned before, this proof technique requires the random signs assumption; otherwise, a sub-optimal bound on m is obtained. To solve this, the golfing scheme starts by writing the following Neumann expansion: assuming that \({\hat{\varUpsilon }}\) is invertible, we have

$$\begin{aligned} \begin{aligned} {\hat{\eta }}&= ({\hat{\varUpsilon }}^{-1} {\mathbf {u}})^{\top }\hat{{\mathbf {f}}}= ( \varUpsilon ^{-1} ({\hat{\varUpsilon }}\varUpsilon ^{-1})^{-1} {\mathbf {u}})^{\top }\hat{{\mathbf {f}}}\\&= \sum _{\ell =1}^\infty \left( \varUpsilon ^{-1} \left( \mathrm {Id}- {\hat{\varUpsilon }}\varUpsilon ^{-1} \right) ^{\ell -1} {\mathbf {u}} \right) ^\top \hat{{\mathbf {f}}}= \sum _{\ell =1}^\infty (\varUpsilon ^{-1} q_{\ell -1})^\top \hat{{\mathbf {f}}}\end{aligned} \end{aligned}$$
(56)

where \(q_{\ell } {\mathop {=}\limits ^{\text {def.}}}\left( \mathrm {Id}- {\hat{\varUpsilon }}\varUpsilon ^{-1} \right) q_{\ell -1} \), \(q_0 {\mathop {=}\limits ^{\text {def.}}}{\mathbf {u}}\). By cutting the sum above to a finite number of terms, one effectively obtains an approximate certificate that must be later corrected. However, there is an additional difficulty in analyzing the sum, which comes from the fact that for each summand, \(\hat{{\mathbf {f}}}\) and \(\varUpsilon ^{-1} q_{\ell -1}\) are random variables which are not mutually independent. The idea of [16, 39] is to decouple the random variables by partitioning the indices \(\{1,\ldots , m\}\) into \(J\) disjoint blocks \({\mathcal {B}}_\ell \) of size \(m_\ell \) with \(\sum _{\ell =1}^J m_\ell = m\), for some \(J\) and \(m_\ell \) that are adjusted below. Denote by \({\hat{\varUpsilon }}_\ell \) and \(\hat{{\mathbf {f}}}_\ell \) the empirical versions of \(\varUpsilon \) and \({\mathbf {f}}\) over the \(m_\ell \) random variables included in \({\mathcal {B}}_\ell \), that is:

$$\begin{aligned} {\hat{\varUpsilon }}_\ell {\mathop {=}\limits ^{\text {def.}}}\frac{1}{m_\ell } \sum _{k\in {\mathcal {B}}_\ell } {\gamma (\omega _k)\gamma (\omega _k)^*} \quad \text {and} \quad \hat{{\mathbf {f}}}_\ell (x) {\mathop {=}\limits ^{\text {def.}}}\frac{1}{m_\ell } \sum _{k\in {\mathcal {B}}_\ell } \overline{\gamma (\omega _k)} \varphi _{\omega _k}(x). \end{aligned}$$

Then, instead of (56), we consider

$$\begin{aligned} \eta ^{\mathrm {app}}= \sum _{\ell =1}^{J} (\varUpsilon ^{-1} q_{\ell -1})^{\top }\hat{{\mathbf {f}}}_\ell \end{aligned}$$

where \(q_{\ell } {\mathop {=}\limits ^{\text {def.}}}\left( \mathrm {Id}- {\hat{\varUpsilon }}_\ell \varUpsilon ^{-1} \right) q_{\ell -1}\), \(q_0 {\mathop {=}\limits ^{\text {def.}}}{\mathbf {u}}\). Note that this can be rewritten as:

$$\begin{aligned} q_\ell = {\mathbf {u}}_s - \sum _{p=1}^\ell {\hat{\varUpsilon }}_p \varUpsilon ^{-1} q_{p-1} \end{aligned}$$
(57)

Now, the idea is that one can control each term \(q_{\ell -1}^{\top }\hat{{\mathbf {f}}}_\ell \) conditional on \(q_{\ell -1}\) and for appropriate choices of the blocksizes \(m_\ell \), \(\eta ^{\mathrm {app}}\) can be shown to be approximately non-degenerate with high probability. Each additional term in the sum brings the certificate “closer” to its desired properties, hence the term “golfing” scheme.

6.3.1 Parameters and Intermediate Assumptions

We set the error \(c_0\) that appears in (48) as

$$\begin{aligned} c_0 = C_0\min \left( \frac{{\bar{\varepsilon }}_0}{B_0},\frac{{\bar{\varepsilon }}_2}{B_2}, 1 \right) \end{aligned}$$

for some universal constant \(C_0\). We define the parameters of our golfing scheme as follows:

We now formulate an intermediate set of assumptions, and proceed to show that: first, they imply the desired properties on \(\eta ^{\mathrm {app}}\), and second, they are valid with high probability. For \(1\leqslant \ell \leqslant J\), we define:

(\(\hbox {I}_\ell \)):

\(\left\| D_{{\mathfrak {g}}}q_\ell \right\| _{\mathrm {Block}} \leqslant c_\ell \left\| D_{{\mathfrak {g}}}q_{\ell -1} \right\| _{\mathrm {Block}}\),

(\(\hbox {II}_\ell \)):

For all \(x\in {\mathcal {G}}^{\mathrm {far}}\), \( \left|(\varUpsilon ^{-1} q_{\ell -1})^\top \hat{{\mathbf {f}}}_\ell (x)\right| \leqslant t_\ell \left\| D_{{\mathfrak {g}}}q_{\ell -1} \right\| _{\mathrm {Block}}\),

(\(\hbox {III}_\ell \)):

If \(\ell =1\): for all \(j=1,\ldots ,s\), \(x\in {\mathcal {G}}^{\mathrm {near}}_j\), \(\left\| \overline{ {{\,\mathrm{sign}\,}}(a_j)} \mathrm {D}_{2}\left[ (\varUpsilon ^{-1} {\mathbf {u}}_s)^\top \hat{{\mathbf {f}}}_1\right] (x) - K^{(02)}(x_j, x) \right\| _x \leqslant b_1\); and if \(\ell \geqslant 2\): for all \(x\in {\mathcal {G}}^{\mathrm {near}}\), \(\left\| { \mathrm {D}_{2}\left[ (\varUpsilon ^{-1} q_{\ell -1})^\top \hat{{\mathbf {f}}}_\ell \right] (x)} \right\| _x \leqslant b_\ell \left\| D_{{\mathfrak {g}}}q_{\ell -1} \right\| _{\mathrm {Block}}\).

Let us now assume that (\(\hbox {I}_\ell \)), (\(\hbox {II}_\ell \)) and (\(\hbox {III}_\ell \)) are true for all \(\ell \) and show that \(\eta ^{\mathrm {app}}\) satisfy the desired properties. We define \(\varPsi : {\mathscr {C}}^{}({\mathcal {X}}) \rightarrow {\mathbb {C}}^{s(d+1)}\) by

$$\begin{aligned} \varPsi f {\mathop {=}\limits ^{\text {def.}}}\left[ f(x_1), \ldots , f(x_s), \nabla {f}(x_1)^\top , \ldots , \nabla {f}(x_s)^\top \right] ^\top . \end{aligned}$$
(58)

In words, \(\varPsi \) evaluates a function and its first derivative at the points \(\{x_j\}_{j=1}^s\). Note that for any vector \(v \in {\mathbb {C}}^{s(d+1)}\), by definition we have \(\varPsi (v^\top \hat{{\mathbf {f}}}_\ell ) = {\hat{\varUpsilon }}_\ell v\). Using this, we have

$$\begin{aligned}&\sqrt{\sum _{j=1}^s \left|\eta ^{\mathrm {app}}(x_j) - {{\,\mathrm{sign}\,}}(a_j)\right|^2 + \left\| \mathrm {D}_{1}\left[ \eta ^{\mathrm {app}}\right] (x_j) \right\| _{{x_j}}^2} \\&\quad = \left\| {\mathbf {u}}_s - D_{{\mathfrak {g}}}\varPsi \eta ^{\mathrm {app}} \right\| \leqslant \sqrt{2s} \left\| D_{{\mathfrak {g}}}\left( {\mathbf {u}}_s - \varPsi \eta ^{\mathrm {app}} \right) \right\| _{\mathrm {Block}} \\&\quad =\sqrt{2s} \left\| D_{{\mathfrak {g}}}\left( {\mathbf {u}}_s - \varPsi \left( \sum _{\ell =1}^{J} (\varUpsilon ^{-1} q_{\ell -1})^{\top }\hat{{\mathbf {f}}}_\ell \right) \right) \right\| _{\mathrm {Block}} \\&\quad =\sqrt{2s} \left\| D_{{\mathfrak {g}}}\left( {\mathbf {u}}_s - \sum _{\ell =1}^{J} {\hat{\varUpsilon }}_\ell \varUpsilon ^{-1} q_{\ell -1} \right) \right\| _{\mathrm {Block}}\\&\quad {\mathop {=}\limits ^{(57)}} \sqrt{2s}\left\| D_{{\mathfrak {g}}}q_J \right\| _{\mathrm {Block}} \leqslant \sqrt{s} \prod _{\ell =1}^Jc_\ell {\mathop {\leqslant }\limits ^{\text {(I)}}} \frac{\sqrt{2s}c_0^{J}}{16 \log (s)} \leqslant c_0\, , \end{aligned}$$

since by adjusting \(C_0\) we can have \(c_0 \leqslant \left( \frac{1}{\sqrt{6}} \right) ^{\frac{1}{\log (3)-1}} \leqslant \left( \frac{1}{\sqrt{2s}} \right) ^{\frac{1}{\log (s)-1}}\) where the last inequality is valid for all s and results from a simple function study. It proves the first part of (48). Next, for all \(x\in {\mathcal {G}}^{\mathrm {far}}\),

$$\begin{aligned} \left|\eta ^{\mathrm {app}}(x)\right|&\leqslant \sum _{\ell =1}^J\left|(\varUpsilon ^{-1} q_{\ell -1})^\top \hat{{\mathbf {f}}}_\ell (x)\right| {\mathop {\leqslant }\limits ^{\text {(II)}}} \sum _{\ell =1}^Jt_\ell \left\| D_{{\mathfrak {g}}}q_{\ell -1} \right\| _{\mathrm {Block}} {\mathop {\leqslant }\limits ^{\text {(I)}}} \sum _{\ell =1}^Jt_\ell \prod _{p=1}^{\ell -1} c_p\\&\leqslant 1-\frac{{\bar{\varepsilon }}_0}{2} + \frac{{\bar{\varepsilon }}_0}{8} + B_0 c_0 + \frac{B_0}{4} \sum _{\ell =2}^{J-1} c_0^\ell \leqslant 1- \frac{{\bar{\varepsilon }}_0}{2}\\&\quad + \frac{{\bar{\varepsilon }}_0}{8} + B_0 c_0 + \frac{B_0c_0^2}{4(1-c_0)} \leqslant 1-\frac{{\bar{\varepsilon }}_0}{4}. \end{aligned}$$

since by our choice of \(c_0\) and adjusting \(C_0\), \( B_0 c_0 + \frac{B_0c_0^2}{4(1-c_0)}\leqslant \frac{{\bar{\varepsilon }}_0}{8}\). Similarly, for all \(x\in {\mathcal {G}}^{\mathrm {near}}_j\),

$$\begin{aligned}&\left\| {\overline{{{\,\mathrm{sign}\,}}(a_j)} \mathrm {D}_{2}\left[ \eta ^{\mathrm {app}}\right] (x) - K^{(02)}(x_j,x)} \right\| _x\\&\quad \leqslant \left\| {\overline{{{\,\mathrm{sign}\,}}(a_j)} \mathrm {D}_{2}\left[ (\varUpsilon ^{-1} {\mathbf {u}}_s)^\top \hat{{\mathbf {f}}}_1\right] (x)- K^{(02)}(x_j,x) } \right\| _x \\&\quad + \sum _{\ell =1}^J\left\| {\mathrm {D}_{2}\left[ (\varUpsilon ^{-1} q_{\ell -1})^\top \hat{{\mathbf {f}}}_\ell \right] (x)} \right\| _x\\&\quad \leqslant \frac{3{\bar{\varepsilon }}_2}{32} + \sum _{\ell =2}^Jb_\ell \prod _{p=1}^{\ell -1} c_p = \frac{3{\bar{\varepsilon }}_2}{32} + B_2 c_0 +\frac{ B_2}{4} \sum _{\ell =2}^{J-1}c_0^\ell \\&\quad \leqslant \frac{3{\bar{\varepsilon }}_2}{32} + B_2 c_0 + \frac{B_2 c_0^2 }{4(1-c_0)} \leqslant \frac{7{\bar{\varepsilon }}_2}{64} \end{aligned}$$

since similarly, \( B_2 c_0 + \frac{B_2 c_0^2 }{4(1-c_0)} \leqslant \frac{{\bar{\varepsilon }}_2}{64}\). Hence, (\(\hbox {I}_\ell \)), (\(\hbox {II}_\ell \)), (\(\hbox {III}_\ell \)) indeed implies (48). Next, we derive a condition on m under which they are true with probability \(1-\rho \) (conditional on event \({{{\bar{E}}}}\)).

6.3.2 Probability of Successful Construction

Let us now prove that (\(\hbox {I}_\ell \)), (\(\hbox {II}_\ell \)) and (\(\hbox {III}_\ell \)) are indeed valid with the desired probability. Let \(p_1(\ell )\), \(p_2(\ell )\) and \(p_3(\ell )\) be the probabilities conditional on event \({{{\bar{E}}}}\) that (\(\hbox {I}_\ell \)), (\(\hbox {II}_\ell \)) and (\(\hbox {III}_\ell \)) fail, respectively. By a union bound, our goal is to derive a bound on m such that \(\sum _{k=1}^3 \sum _{\ell =1}^Jp_k(\ell ) \leqslant \rho \). We do so by applying variants of Bernstein’s concentration inequality that are all detailed in Appendix B. As we mentioned before, a crucial construction of the golfing scheme is that, at each step, \(q_{\ell -1}\) and \(\hat{{\mathbf {f}}}_\ell \) are mutually independent, such that we can reason conditionally on \(q_{\ell -1}\) and treat it as a fixed vector when bounding the probabilities w.r.t. \(\hat{{\mathbf {f}}}_\ell \) and \({\hat{\varUpsilon }}_\ell \).

We define \({\bar{q}}_\ell {\mathop {=}\limits ^{\text {def.}}}D_{{\mathfrak {g}}}^{-1} \varUpsilon ^{-1} q_\ell \) for short. To bound \(p_1(\ell )\), we first observe the recurrence relation \(D_{{\mathfrak {g}}}q_\ell = D_{{\mathfrak {g}}}(\mathrm {Id}- {\hat{\varUpsilon }}_\ell \varUpsilon ^{-1}) q_{\ell -1} = D_{{\mathfrak {g}}}(\varUpsilon - {\hat{\varUpsilon }}_\ell ) D_{{\mathfrak {g}}}{\bar{q}}_{\ell -1}\). Moreover, by Lemma 3 we have \(\left\| D_{{\mathfrak {g}}}^{-1}\varUpsilon ^{-1}D_{{\mathfrak {g}}}^{-1} \right\| _{\mathrm {Block}} \leqslant \frac{1}{1- \left\| D_{{\mathfrak {g}}}\varUpsilon D_{{\mathfrak {g}}} \right\| _{\mathrm {Block}}} \leqslant 2\), and therefore \( \left\| D_{{\mathfrak {g}}}q_{\ell -1} \right\| _{\mathrm {Block}} \geqslant \frac{1}{\left\| D_{{\mathfrak {g}}}^{-1}\varUpsilon ^{-1}D_{{\mathfrak {g}}}^{-1} \right\| _{\mathrm {Block}}} \left\| {\bar{q}}_{\ell -1} \right\| _{\mathrm {Block}} \geqslant \frac{1}{2} \left\| {\bar{q}}_{\ell -1} \right\| _{\mathrm {Block}} \). Finally, by Lemma 4 and our assumptions we have in particular that \(\left\| D_{{\mathfrak {g}}}(\varUpsilon _{{{\bar{E}}}}- \varUpsilon )D_{{\mathfrak {g}}} \right\| _{\mathrm {Block}} \leqslant \min _\ell c_\ell /4\). Therefore,

$$\begin{aligned} p_1(\ell )&= {\mathbb {P}}_{{{\bar{E}}}}\left( \left\| D_{{\mathfrak {g}}}q_\ell \right\| _{\mathrm {Block}} \geqslant c_\ell \left\| D_{{\mathfrak {g}}}q_{\ell -1} \right\| _{\mathrm {Block}} \right) \leqslant {\mathbb {P}}_{{{\bar{E}}}}\left( \left\| D_{{\mathfrak {g}}}(\varUpsilon - {\hat{\varUpsilon }}_\ell ) D_{{\mathfrak {g}}}{\bar{q}}_{\ell -1} \right\| _{\mathrm {Block}} \right. \\&\left. \geqslant \frac{c_\ell }{2} \left\| {\bar{q}}_{\ell -1} \right\| _{\mathrm {Block}} \right) \\&\leqslant {\mathbb {P}}_{{{\bar{E}}}}\left( \left\| D_{{\mathfrak {g}}}(\varUpsilon _{{{\bar{E}}}}- {\hat{\varUpsilon }}_\ell ) D_{{\mathfrak {g}}}{\bar{q}}_{\ell -1} \right\| _{\mathrm {Block}} \geqslant \frac{c_\ell }{4} \left\| {\bar{q}}_{\ell -1} \right\| _{\mathrm {Block}} \right) \end{aligned}$$

Finally, applying Lemma 14, for some \(\rho _\ell \) that we adjust later we obtain that

$$\begin{aligned} {\mathbb {P}}_{{{\bar{E}}}}\left( \left\| D_{{\mathfrak {g}}}(\varUpsilon _{{{\bar{E}}}}- {\hat{\varUpsilon }}_\ell ) D_{{\mathfrak {g}}}{\bar{q}}_{\ell -1} \right\| _{\mathrm {Block}} \geqslant \frac{c_\ell }{4} \left\| {\bar{q}}_{\ell -1} \right\| _{\mathrm {Block}} \right) \leqslant \rho _\ell \end{aligned}$$

if \(m_\ell \gtrsim \frac{s {{\bar{L}}}_{01}^2}{c_\ell ^2} \log \left( \frac{s}{\rho _\ell } \right) \).

For \(p_2(\ell )\), we have

$$\begin{aligned} \left|(\varUpsilon ^{-1} q_{\ell -1})^\top \hat{{\mathbf {f}}}_\ell (x)\right| = \left|({\bar{q}}_{\ell -1})^\top D_{{\mathfrak {g}}}\hat{{\mathbf {f}}}_\ell (x)\right|&\leqslant \left|({\bar{q}}_{\ell -1})^\top D_{{\mathfrak {g}}}(\hat{{\mathbf {f}}}_\ell (x) - {\mathbf {f}}(x))\right| \\&\quad + \left|({\bar{q}}_{\ell -1})^\top D_{{\mathfrak {g}}}{\mathbf {f}}(x)\right|\\&\leqslant \left|({\bar{q}}_{\ell -1})^\top D_{{\mathfrak {g}}}(\hat{{\mathbf {f}}}_\ell (x) - {\mathbf {f}}(x))\right| \\&\quad + {\left\{ \begin{array}{ll} B_0 \left\| {\bar{q}}_{\ell -1} \right\| _{\mathrm {Block}} &{}\ell \geqslant 2\\ 1-\frac{{\bar{\varepsilon }}_0}{2} &{}\ell =1 \end{array}\right. } \end{aligned}$$

by Lemma 3 for the case \(\ell \geqslant 2\) and Theorem 2 for the case \(\ell =1\). Hence,

$$\begin{aligned} p_2(\ell )&= {\mathbb {P}}_{{{\bar{E}}}}\left( \exists x\in {\mathcal {G}}^{\mathrm {far}}, \; \left|(\varUpsilon ^{-1} q_{\ell -1})^\top \hat{{\mathbf {f}}}_\ell (x)\right|> t_\ell \left\| D_{{\mathfrak {g}}}q_{\ell -1} \right\| _{\mathrm {Block}} \right) \\&\leqslant {\mathbb {P}}_{{{\bar{E}}}}\left( \exists x\in {\mathcal {G}}^{\mathrm {far}}, \; \left|(\varUpsilon ^{-1} q_{\ell -1})^\top \hat{{\mathbf {f}}}_\ell (x)\right|> \frac{t_\ell }{2} \left\| {\bar{q}}_{\ell -1} \right\| _{\mathrm {Block}} \right) \\&\leqslant {\mathbb {P}}_{{{\bar{E}}}}\left( \exists x\in {\mathcal {G}}^{\mathrm {far}}, \; \left|({\bar{q}}_{\ell -1})^\top D_{{\mathfrak {g}}}(\hat{{\mathbf {f}}}_\ell (x) - {\mathbf {f}}(x))\right| > {\tilde{t}}_\ell \left\| {\bar{q}}_{\ell -1} \right\| _{\mathrm {Block}} \right) \\&\quad \quad \text {where} \quad {\tilde{t}}_\ell {\mathop {=}\limits ^{\text {def.}}}{\left\{ \begin{array}{ll} \left( \frac{t_\ell }{2} - B_0 \right) &{}\ell \geqslant 2\\ \frac{{\bar{\varepsilon }}_0}{16} &{} \ell =1 \end{array}\right. }. \end{aligned}$$

Since by Lemma 4 we have in particular

$$\begin{aligned}&\left|({\bar{q}}_{\ell -1})^\top D_{{\mathfrak {g}}}({\mathbf {f}}_{{{\bar{E}}}}(x) - {\mathbf {f}}(x))\right|\\&\quad \leqslant \sqrt{2s}\left\| {\bar{q}}_{\ell -1} \right\| _{\mathrm {Block}} \left\| D_{{\mathfrak {g}}}({\mathbf {f}}_{{{\bar{E}}}}(x) - {\mathbf {f}}(x)) \right\| \leqslant \frac{{\tilde{t}}_\ell }{2}\left\| {\bar{q}}_{\ell -1} \right\| _{\mathrm {Block}}\, , \end{aligned}$$

by Lemma 8 and a union bound we have

$$\begin{aligned} p_2(\ell )\leqslant {\mathbb {P}}_{{{\bar{E}}}}\left( \exists x\in {\mathcal {G}}^{\mathrm {far}}, \; \left|({\bar{q}}_{\ell -1})^\top D_{{\mathfrak {g}}}(\hat{{\mathbf {f}}}_\ell (x) - {\mathbf {f}}_{{{\bar{E}}}}(x))\right| > \frac{{\tilde{t}}_\ell }{2} \left\| {\bar{q}}_{\ell -1} \right\| _{\mathrm {Block}} \right) \leqslant \rho _\ell \end{aligned}$$

provided that \(m_\ell \gtrsim s \left( \frac{{{\bar{L}}}_0^2}{{\tilde{t}}_\ell ^{2}} + \frac{{{\bar{L}}}_{01} {{\bar{L}}}_0}{{\tilde{t}}_\ell } \right) \log \left( \frac{\left|{\mathcal {G}}^{\mathrm {far}}\right|}{\rho _\ell } \right) \).

For \(p_3(\ell )\), fix j, for any \(x \in {\mathcal {G}}^{\mathrm {near}}_j\): in the case \(\ell \geqslant 2\), by Lemma 3,

$$\begin{aligned} \left\| \mathrm {D}_{2}\left[ (D_{{\mathfrak {g}}}{\bar{q}}_{\ell -1})^\top \hat{{\mathbf {f}}}_\ell \right] (x) \right\| _x&\leqslant \left\| \left( \mathrm {D}_{2}\left[ (D_{{\mathfrak {g}}}{\bar{q}}_{\ell -1})^\top (\hat{{\mathbf {f}}}_\ell - {\mathbf {f}})\right] (x) \right) \right\| _x + \left\| \mathrm {D}_{2}\left[ (D_{{\mathfrak {g}}}{\bar{q}}_{\ell -1})^\top {\mathbf {f}}\right] (x) \right\| _x\\&\leqslant \left\| \mathrm {D}_{2}\left[ (D_{{\mathfrak {g}}}{\bar{q}}_{\ell -1})^\top (\hat{{\mathbf {f}}}_\ell - {\mathbf {f}})\right] (x) \right\| _x + B_2 \left\| {\bar{q}}_{\ell -1} \right\| _{\mathrm {Block}} \end{aligned}$$

and for \(\ell =1\), by Theorem 2,

$$\begin{aligned}&\left\| {\overline{{{\,\mathrm{sign}\,}}(a_j)} \mathrm {D}_{2}\left[ (D_{{\mathfrak {g}}}{\bar{q}}_{0})^\top \hat{{\mathbf {f}}}_1\right] (x) - K^{(02)}(x_j,x)} \right\| _x\\&\quad \leqslant \left\| {\overline{{{\,\mathrm{sign}\,}}(a_j)} \mathrm {D}_{2}\left[ (D_{{\mathfrak {g}}}{\bar{q}}_{0})^\top {\mathbf {f}}\right] (x) - K^{(02)}(x_j,x)} \right\| _x + \left\| \mathrm {D}_{2}\left[ (D_{{\mathfrak {g}}}{\bar{q}}_{0})^\top (\hat{{\mathbf {f}}}_1 - {\mathbf {f}})\right] (x) \right\| _x \\&\quad \leqslant \frac{{\bar{\varepsilon }}_2}{16} + \left\| \mathrm {D}_{2}\left[ (D_{{\mathfrak {g}}}{\bar{q}}_{0})^\top (\hat{{\mathbf {f}}}_1 - {\mathbf {f}})\right] (x) \right\| _x. \end{aligned}$$

Therefore, by the same computation as before,

$$\begin{aligned}&p_3(\ell ) \leqslant {\mathbb {P}}_{{{\bar{E}}}}\left( \exists x\in {\mathcal {G}}^{\mathrm {near}}, \; \left\| \mathrm {D}_{2}\left[ (D_{{\mathfrak {g}}}{\bar{q}}_{\ell -1})^\top (\hat{{\mathbf {f}}}_\ell - {\mathbf {f}})\right] (x) \right\| _x > {\tilde{b}}_\ell \left\| {\bar{q}}_{\ell -1} \right\| _{\mathrm {Block}} \right) ,\\&\quad \text { where } {\tilde{b}}_\ell {\mathop {=}\limits ^{\text {def.}}}{\left\{ \begin{array}{ll} \left( \frac{b_\ell }{2} - B_2 \right) &{}\ell \geqslant 2 \\ \frac{{\bar{\varepsilon }}_2}{64} &{}\ell =1. \end{array}\right. } \end{aligned}$$

Again using Lemma 4, we bound \(\left\| \mathrm {D}_{2}\left[ (D_{{\mathfrak {g}}}{\bar{q}}_{\ell -1})^\top ({\mathbf {f}}_{{{\bar{E}}}}- {\mathbf {f}})\right] (x) \right\| _x \leqslant \frac{{\tilde{b}}_\ell }{2} \left\| {\bar{q}}_{\ell -1} \right\| _{\mathrm {Block}}\) and

$$\begin{aligned} p_3(\ell ) \leqslant {\mathbb {P}}_{{{\bar{E}}}}\left( \exists x\in {\mathcal {G}}^{\mathrm {near}}, \; \left\| \mathrm {D}_{2}\left[ (D_{{\mathfrak {g}}}{\bar{q}}_{\ell -1})^\top (\hat{{\mathbf {f}}}_\ell - {\mathbf {f}}_{{{\bar{E}}}})\right] (x) \right\| _x > \frac{{\tilde{b}}_\ell }{2} \left\| {\bar{q}}_{\ell -1} \right\| _{\mathrm {Block}} \right) \leqslant \rho _\ell \end{aligned}$$

by Lemma 10 and a union bound, provided that \(m_\ell \gtrsim s\left( \frac{{{\bar{L}}}_2^2}{{\tilde{b}}_\ell ^{2}} + \frac{{{\bar{L}}}_2{{\bar{L}}}_{01}}{{\tilde{b}}_\ell } \right) \log \left( \frac{\left|{\mathcal {G}}^{\mathrm {near}}\right|}{\rho _\ell } \right) \).

Choosing \(\rho _1 = \rho _2 = \rho /9\) and \(\rho _\ell = \rho /(9J)\) for \(\ell \geqslant 3\), recalling that obviously \({\bar{\varepsilon }}_r \leqslant B_r\) for \(r=1,2\) and denoting \(N_0 = \left|{\mathcal {G}}^{\mathrm {far}}\right|\) and \(N_2 = \left|{\mathcal {G}}^{\mathrm {near}}\right|\) for short, we have \( \sum _{k=1}^3 \sum _{\ell =1}^Jp_k(\ell ) \leqslant \rho \) provided that

$$\begin{aligned} m_1 = m_2&\gtrsim s \sum _{r=0,2} \left( {{\bar{L}}}_{01}^2 \frac{B_r^2}{{\bar{\varepsilon }}_r^2} \log (s) \log \left( \frac{s}{\rho } \right) + \left( \frac{{{\bar{L}}}_r^2}{{\bar{\varepsilon }}_r^2} + \frac{{{\bar{L}}}_{01}{{\bar{L}}}_r}{{\bar{\varepsilon }}_r} \right) \log \left( \frac{N_r}{\rho } \right) \right) \end{aligned}$$

and for \(\ell \geqslant 3\),

$$\begin{aligned} m_\ell \gtrsim s \sum _{r=0,2} \left( {{\bar{L}}}_{01}^2 \frac{B_r^2}{{\bar{\varepsilon }}_r^2} \log \left( \frac{s\log (s)}{\rho } \right) + \left( \frac{{{\bar{L}}}_r^2}{B_r^2 \log ^2(s)} + \frac{{{\bar{L}}}_{01}{{\bar{L}}}_r}{B_r \log (s)} \right) \log \left( \frac{N_r\log (s)}{\rho } \right) \right) \end{aligned}$$

Therefore, conditionally on \({{{\bar{E}}}}\), \(\eta ^{\mathrm {app}}\) can be constructed with probability at least \(1-\rho \) if \(m \gtrsim m_1+m_2+ Jm_3\), for which it is sufficient that

$$\begin{aligned} m&\gtrsim s \sum _{r=0,2} \left( {{\bar{L}}}_{01}^2 \frac{B_r^2}{{\bar{\varepsilon }}_r^2} \log (s) \log \left( \frac{s}{\rho } \right) + \left( \frac{{{\bar{L}}}_r^2}{{\bar{\varepsilon }}_r^2} + \frac{{{\bar{L}}}_{01}{{\bar{L}}}_r}{{\bar{\varepsilon }}_r} \right) \log \left( \frac{N_r \log (s)}{\rho } \right) \right) \end{aligned}$$
(59)

6.4 Step 2: Correcting the Approximate Certificate

The second step of our proof is to “correct” the previously constructed approximate certificate \(\eta ^{\mathrm {app}}\) to obtain a certificate \(\eta \in {{\,\mathrm{Im}\,}}(\varPhi ^*)\) satisfying (49). Recalling the definition (58) of \(\varPsi \), let \(e{\mathop {=}\limits ^{\text {def.}}}\varPsi \eta ^{\mathrm {app}}- {\mathbf {u}}_s\) be the error made by \(\eta ^{\mathrm {app}}\) and define

$$\begin{aligned} {\hat{\eta }}{\mathop {=}\limits ^{\text {def.}}}\eta ^{\mathrm {app}}- \eta ^{\mathrm {e}}, \quad \text {where} \quad \eta ^{\mathrm {e}}{\mathop {=}\limits ^{\text {def.}}}({\hat{\varUpsilon }}^{-1} e)^\top \hat{{\mathbf {f}}}. \end{aligned}$$

Then,

$$\begin{aligned} \varPsi {\hat{\eta }}= \varPsi \eta ^{\mathrm {app}}- e = {\mathbf {u}}_s\, , \end{aligned}$$

and we have indeed that \({\hat{\eta }}(x_i) = {{\,\mathrm{sign}\,}}(a_i)\) and \(\nabla {\hat{\eta }}(x_i) = 0\). We will now bound the deviations of \({\hat{\eta }}\) on the grids \({\mathcal {G}}^{\mathrm {far}}\) and \({\mathcal {G}}^{\mathrm {near}}\), using the fact that e has a small norm. Note that there is a subtlety here: e itself is random, and not independent of \(\hat{{\mathbf {f}}}\) or \({\hat{\varUpsilon }}\). So we must use “uniform” concentration bounds.

Using Lemma 3 in combination with Lemma 4 and Lemma 12, we have that with probability at least \(1-\rho \):

$$\begin{aligned}&\left\| \mathrm {Id}- D_{{\mathfrak {g}}}{\hat{\varUpsilon }}D_{{\mathfrak {g}}} \right\| \leqslant \left\| \mathrm {Id}- D_{{\mathfrak {g}}}\varUpsilon D_{{\mathfrak {g}}} \right\| + \left\| D_{{\mathfrak {g}}}(\varUpsilon - \varUpsilon _{{{\bar{E}}}}) D_{{\mathfrak {g}}} \right\| \nonumber \\&\quad + \left\| D_{{\mathfrak {g}}}(\varUpsilon _{{{\bar{E}}}}- {\hat{\varUpsilon }}) D_{{\mathfrak {g}}} \right\| \leqslant \frac{1}{2} + \frac{1}{8} + \frac{1}{8} = \frac{3}{4} \end{aligned}$$
(60)

and therefore

$$\begin{aligned} \left\| D_{{\mathfrak {g}}}^{-1} {\hat{\varUpsilon }}^{-1} D_{{\mathfrak {g}}}^{-1} \right\| \leqslant 4\, . \end{aligned}$$
(61)

By Lemma 349 and a union bound to, respectively, bound each term in the following triangular inequality, with probability \(1-\rho \) we have

$$\begin{aligned} \forall x \in {\mathcal {G}}^{\mathrm {far}},\quad \left\| D_{{\mathfrak {g}}}\hat{{\mathbf {f}}}(x) \right\| \leqslant \left\| D_{{\mathfrak {g}}}{\mathbf {f}}(x) \right\| + \left\| D_{{\mathfrak {g}}}({\mathbf {f}}_{{{\bar{E}}}}(x)-{\mathbf {f}}(x)) \right\| + \left\| D_{{\mathfrak {g}}}(\hat{{\mathbf {f}}}(x)-{\mathbf {f}}_{{{\bar{E}}}}(x)) \right\| \leqslant 2 B_0 \end{aligned}$$

if \(m \gtrsim B_0^{-2} \log \left( \frac{\left|{\mathcal {G}}^{\mathrm {far}}\right|}{\rho } \right) (s {{\bar{L}}}_{01}^2 + \sqrt{s} {{\bar{L}}}_{01} {{\bar{L}}}_0)\). Then, for all \(x\in {\mathcal {G}}^{\mathrm {far}}\), since by adjusting \(C_0\) we can have in particular \(\left\| D_{{\mathfrak {g}}}e \right\| \leqslant \frac{c_0}{\sqrt{s}} \leqslant c_0 \leqslant \frac{1}{128} \min \left( \frac{{\bar{\varepsilon }}_2}{B_2} , \frac{{\bar{\varepsilon }}_0}{ B_0} \right) \), we have

$$\begin{aligned} \left|{\hat{\eta }}(x)\right| \leqslant \left|\eta ^{\mathrm {app}}(x)\right| + \left\| D_{{\mathfrak {g}}}\hat{{\mathbf {f}}}(x) \right\| \left\| D_{{\mathfrak {g}}}^{-1} {\hat{\varUpsilon }}^{-1} D_{{\mathfrak {g}}}^{-1} \right\| \left\| D_{{\mathfrak {g}}}e \right\| \leqslant 1-\frac{3{\bar{\varepsilon }}_0}{16}, \end{aligned}$$

Similarly, by Lemma 34, with probability \(1-\rho \) we have for all \(x \in {\mathcal {G}}^{\mathrm {near}}\) and \(q \in {\mathbb {C}}^{s(d+1)}\),

$$\begin{aligned} \left\| \mathrm {D}_{2}\left[ \hat{{\mathbf {f}}}^\top D_{{\mathfrak {g}}}q\right] (x) \right\| _x&\leqslant \left\| \mathrm {D}_{2}\left[ {\mathbf {f}}^\top D_{{\mathfrak {g}}}q\right] (x) \right\| _x + \left\| \mathrm {D}_{2}\left[ ({\mathbf {f}}_{{{\bar{E}}}}- {\mathbf {f}})^\top D_{{\mathfrak {g}}}q\right] (x) \right\| _x \\&\quad + \left\| \mathrm {D}_{2}\left[ ({\mathbf {f}}_{{{\bar{E}}}}- \hat{{\mathbf {f}}})^\top D_{{\mathfrak {g}}}q\right] (x) \right\| _x \\&\leqslant (B_2 + B_2/2) \left\| q \right\| + \left\| q \right\| \sup _{\left\| v \right\| _x \leqslant 1} \left\| \frac{1}{m} \sum _{k=1}^m D_{{\mathfrak {g}}}\overline{\gamma (\omega _k)} g_{\omega _k}(v)\right. \\&\left. \quad - {\mathbb {E}}_{{{\bar{E}}}}D_{{\mathfrak {g}}}\overline{\gamma (\omega )} g_{\omega }(v) \right\| \end{aligned}$$

where \(g_\omega (v) {\mathop {=}\limits ^{\text {def.}}}\mathrm {D}_{2}\left[ \varphi _\omega \right] (x)[v,v]\). By Lemma 11 and a union bound, for all \(x\in {\mathcal {G}}^{\mathrm {near}}\),

$$\begin{aligned} \sup _{\left\| v \right\| _x \leqslant 1} \left\| \frac{1}{m} \sum _{k=1}^m D_{{\mathfrak {g}}}\overline{\gamma (\omega _k)} g_{\omega _k}(v) - {\mathbb {E}}_{{{\bar{E}}}}D_{{\mathfrak {g}}}\overline{\gamma (\omega )} g_{\omega }(v) \right\| \leqslant B_2 \end{aligned}$$

if \( m \gtrsim \frac{s B_{22} {{\bar{L}}}_{01}^2 + \sqrt{s} {{\bar{L}}}_{01} {{\bar{L}}}_2 B_2}{B_2^2} \left( \log \left( \frac{\left|{\mathcal {G}}^{\mathrm {near}}\right|}{\rho } \right) + d \log \left( \frac{s {{\bar{L}}}_{01} {{\bar{L}}}_{2}}{B_2} \right) \right) \). Using this property with \(q {\mathop {=}\limits ^{\text {def.}}}D_{{\mathfrak {g}}}^{-1} {\hat{\varUpsilon }}^{-1} e\) such that \(\left\| q \right\| \leqslant 4 c_0\), and by adjusting \(C_0\), we obtain: for all \(x\in {\mathcal {G}}^{\mathrm {near}}_j\),

$$\begin{aligned} \left\| \overline{{{\,\mathrm{sign}\,}}(a_j)} \mathrm {D}_{2}\left[ {\hat{\eta }}\right] (x) - K^{(02)}(x_j,x) \right\| _x&\leqslant \left\| \overline{{{\,\mathrm{sign}\,}}(a_j)} \mathrm {D}_{2}\left[ \eta ^{\mathrm {app}}\right] (x) - K^{(02)}(x_j,x) \right\| _x \\&\quad + \left\| \mathrm {D}_{2}\left[ \hat{{\mathbf {f}}}^\top D_{{\mathfrak {g}}}q\right] (x) \right\| _x \\&\leqslant \frac{7{\bar{\varepsilon }}_2}{64} + \frac{{\bar{\varepsilon }}_2}{128} = \frac{15{\bar{\varepsilon }}_2}{128} \end{aligned}$$

which concludes the second step of our proof. By combining the bounds on m that we obtained with (59), after simplification we still obtain

$$\begin{aligned} m \gtrsim s \sum _{r=0,2} \left( {{\bar{L}}}_{01}^2 \frac{B_r^2}{{\bar{\varepsilon }}_r^2} \log (s) \log \left( \frac{s}{\rho } \right) + \left( \frac{{{\bar{L}}}_r^2}{{\bar{\varepsilon }}_r^2} + \frac{{{\bar{L}}}_{01}{{\bar{L}}}_r}{{\bar{\varepsilon }}_r} + \frac{B_{22}}{B_2^2}{{\bar{L}}}_{01}^2 \right) \log \left( \frac{N'_r \log (s)}{\rho } \right) \right) \nonumber \\ \end{aligned}$$
(62)

with \(N'_0 = N_0 = \left|{\mathcal {G}}^{\mathrm {far}}\right|\) but \(N'_2 = \left|{\mathcal {G}}^{\mathrm {near}}\right| + (s{{\bar{L}}}_{01} {{\bar{L}}}_2/B_2)^d\).

6.5 Step 3: Bounding the Norm \(\left\| p \right\| \)

In this section, we upper bound \(\left\| p \right\| \) where \(\varPhi ^* p = {\hat{\eta }}\), for the \({\hat{\eta }}\) that we have constructed in the previous section. We recall that \(\varPhi ^* p = \frac{1}{\sqrt{m}} \sum _{k=1}^m p_k \varphi _{\omega _k}(\cdot )\), and

$$\begin{aligned} \eta ^{\mathrm {app}}= \sum _{\ell =1}^J(\varUpsilon ^{-1}q_{\ell -1})^\top \hat{{\mathbf {f}}}_\ell = \frac{1}{\sqrt{m}}\sum _\ell \frac{\sqrt{m}}{m_\ell } \sum _{k\in {\mathcal {B}}_\ell } (\varUpsilon ^{-1}q_{\ell -1})^\top \overline{\gamma (\omega _k)} \varphi _{\omega _k} = \varPhi ^* p^{\mathrm {app}}, \end{aligned}$$

where \(p^{\mathrm {app}}{\mathop {=}\limits ^{\text {def.}}}(p_\ell )_{\ell =1}^J\in {\mathbb {C}}^m\) and \(p_\ell {\mathop {=}\limits ^{\text {def.}}}\frac{\sqrt{m}}{m_\ell } \left( \gamma (\omega _k)^* \varUpsilon ^{-1} q_{\ell -1} \right) _{k\in {\mathcal {B}}_j} \in {\mathbb {C}}^{m_\ell }\). So, \(\left\| p^{\mathrm {app}} \right\| ^2 = \sum _{\ell =1}^J\left\| p_\ell \right\| ^2_2\). To upper bound this, for each \(\ell =1,\ldots , J\),

$$\begin{aligned}&\frac{m_\ell }{m}\left\| p_\ell \right\| ^2_2 = \frac{1}{m_\ell } \sum _{k\in {\mathcal {B}}_\ell } q_{\ell -1}^* \varUpsilon ^{-1} \gamma (\omega _k) \gamma (\omega _k)^* \varUpsilon ^{-1} q_{\ell -1} = q_{\ell -1}^* \varUpsilon ^{-1} {\hat{\varUpsilon }}_\ell \varUpsilon ^{-1} q_{\ell -1} \\&= q_{\ell -1}^* \varUpsilon ^{-1} ({\hat{\varUpsilon }}_\ell \varUpsilon ^{-1} -\mathrm {Id}) q_{\ell -1} + q_{\ell -1}^* \varUpsilon ^{-1} q_{\ell -1} = q_{\ell -1}^* \varUpsilon ^{-1} q_{\ell } + q_{\ell -1}^* \varUpsilon ^{-1} q_{\ell -1} \\&\leqslant \left\| D_{{\mathfrak {g}}}^{-1} \varUpsilon ^{-1} D_{{\mathfrak {g}}}^{-1} \right\| \left\| D_{{\mathfrak {g}}}q_{\ell -1} \right\| \left( \left\| D_{{\mathfrak {g}}}q_{\ell -1} \right\| + \left\| D_{{\mathfrak {g}}}q_{\ell } \right\| \right) \\&\leqslant 4s \left\| D_{{\mathfrak {g}}}q_{\ell -1} \right\| _{\mathrm {Block}} \left( \left\| D_{{\mathfrak {g}}}q_\ell \right\| _{\mathrm {Block}} + \left\| D_{{\mathfrak {g}}}q_{\ell -1} \right\| _{\mathrm {Block}} \right) \leqslant 4s \left( c_\ell + 1 \right) \prod _{i=1}^{\ell -1} c_i^2. \end{aligned}$$

where we have used \(\left\| D_{{\mathfrak {g}}}^{-1} \varUpsilon ^{-1} D_{{\mathfrak {g}}}^{-1} \right\| \leqslant 2\) by Lemma 3, \(\left\| \cdot \right\| \leqslant \sqrt{2s}\left\| \cdot \right\| _{\mathrm {Block}}\), and the computation that precedes for \(\left\| D_{{\mathfrak {g}}}q_\ell \right\| _{\mathrm {Block}}\). For \(\ell =1,2\) \( \frac{m}{m_\ell } = {\mathcal {O}}(1) \) and \(\frac{m}{m_3} = {\mathcal {O}}( \log (s))\). Also, for \(\ell \geqslant 3\),

$$\begin{aligned} \left( c_\ell + 1 \right) \prod _{i=1}^{\ell -1} c_i^2 = \left( 1+c_0 \right) \frac{c_0^{\ell -1}}{16\log (s)} \end{aligned}$$

Therefore,

$$\begin{aligned} \left\| p^{\mathrm {app}} \right\| ^2 \lesssim 4s \left( 1 + \frac{c_0}{4\sqrt{\log (s)}} + \frac{c_0^2}{16 \log (s)} + (1+c_0) \frac{c_0^2}{16(1-c_0)} \right) \lesssim s. \end{aligned}$$

On the other hand, \(\eta ^{\mathrm {e}}= \varPhi ^* p^\mathrm {e}\) where \(p^\mathrm {e}=\left( \gamma (\omega _k)^* \varUpsilon ^{-1} e \right) _{k=1}^m\). So,

$$\begin{aligned} \left\| p^\mathrm {e} \right\| ^2 = e^* \varUpsilon ^{-1} {\hat{\varUpsilon }}\varUpsilon ^{-1} e \leqslant 8\left\| D_{{\mathfrak {g}}}e \right\| ^2 \lesssim 1. \end{aligned}$$

Therefore, \({\hat{\eta }}= \varPhi ^* p \) with \(\left\| p \right\| ^2 \lesssim s\).

6.6 Step 4: Non-degeneracy on the Entire Domain

We conclude by showing that the \({\hat{\eta }}\) constructed in the previous sections is indeed non-degenerate on the entire domain. For this, we simply need to control the Lipschitz constants of \({\hat{\eta }}\) and its Hessian, which are in fact directly related to \(\left\| p \right\| \). Let any \(x \in {\mathcal {X}}^{\mathrm {far}}\) and \(x'\in {\mathcal {G}}^{\mathrm {far}}\) be the point in the grid closest to it. Under \({{{\bar{E}}}}\), we have

$$\begin{aligned} \left|{\hat{\eta }}(x)\right|&\leqslant 1-\frac{3{\bar{\varepsilon }}_0}{16} + \left|{\hat{\eta }}(x) -{\hat{\eta }}(x')\right| = 1-\frac{3{\bar{\varepsilon }}_0}{16} + \left|(\varPhi ^* p)(x) -(\varPhi ^*p)(x')\right| \\&\leqslant 1-\frac{3{\bar{\varepsilon }}_0}{16} + \left\| p \right\| \sqrt{\frac{1}{m} \sum _{k=1}^m \left|\varphi _{\omega _k}(x) - \varphi _{\omega _k}(x')\right|^2} \leqslant 1-\frac{3{\bar{\varepsilon }}_0}{16} + {{\bar{L}}}_{1} \left\| p \right\| {\mathfrak {d}}_{{\mathfrak {g}}}(x,x') \end{aligned}$$

Hence, we prove the first part of (50) by choosing \({\mathcal {G}}^{\mathrm {far}}\) such that \({\mathfrak {d}}_{{\mathfrak {g}}}(x,x') \leqslant \frac{{\bar{\varepsilon }}_0}{16{{\bar{L}}}_{1} \left\| p \right\| }\), which results in

$$\begin{aligned} \left|{\mathcal {G}}^{\mathrm {far}}\right| = \left( \frac{C{\mathcal {R}}_{\mathcal {X}}{{\bar{L}}}_1 \left\| p \right\| }{{\bar{\varepsilon }}_0} \right) ^d \end{aligned}$$

for an appropriate constant C.

Now, for any \(x \in {\mathcal {X}}^{\mathrm {near}}_j\), and \(x'\in {\mathcal {G}}^{\mathrm {near}}_j\) closest to it, we write

$$\begin{aligned} \begin{aligned}&\left\| \overline{{{\,\mathrm{sign}\,}}(a_j)} \mathrm {D}_{2}\left[ {\hat{\eta }}\right] (x) - K^{(02)}(x_j,x) \right\| _x \\&\quad \leqslant \left\| \mathrm {D}_{2}\left[ {\hat{\eta }}\right] (x) - \mathrm {D}_{2}\left[ {\hat{\eta }}\right] (x')[\tau _{x\rightarrow x'} \cdot ,\tau _{x\rightarrow x'} \cdot ] \right\| _x \\&\qquad + \left\| \overline{{{\,\mathrm{sign}\,}}(a_j)}\mathrm {D}_{2}\left[ {\hat{\eta }}\right] (x')[\tau _{x\rightarrow x'} \cdot ,\tau _{x\rightarrow x'} \cdot ] -K^{(02)}(x_j,x')[\tau _{x\rightarrow x'} \cdot ,\tau _{x\rightarrow x'} \cdot ] \right\| _x \\&\qquad + \left\| K^{(02)}(x_j,x')[\tau _{x\rightarrow x'} \cdot ,\tau _{x\rightarrow x'} \cdot ] - K^{(02)}(x_j,x) \right\| _x \end{aligned} \end{aligned}$$
(63)

We bound each of these terms. For the first, under \({{{\bar{E}}}}\) we have

$$\begin{aligned}&\left\| \mathrm {D}_{2}\left[ {\hat{\eta }}\right] (x) - \mathrm {D}_{2}\left[ {\hat{\eta }}\right] (x')[\tau _{x\rightarrow x'} \cdot ,\tau _{x\rightarrow x'} \cdot ] \right\| _x \leqslant \left\| p \right\| \\&\quad \sqrt{\frac{1}{m} \sum _{k=1}^m \left\| \mathrm {D}_{2}\left[ \varphi _{\omega _k}\right] (x) - \mathrm {D}_{2}\left[ \varphi _{\omega _k}\right] (x')[\tau _{x\rightarrow x'} \cdot ,\tau _{x\rightarrow x'} \cdot ] \right\| _x^2} \leqslant {{\bar{L}}}_3 \left\| p \right\| {\mathfrak {d}}_{{\mathfrak {g}}}(x,x') \end{aligned}$$

For the second term in (63), we have

$$\begin{aligned}&\left\| \overline{{{\,\mathrm{sign}\,}}(a_j)}\mathrm {D}_{2}\left[ {\hat{\eta }}\right] (x')[\tau _{x\rightarrow x'} \cdot ,\tau _{x\rightarrow x'} \cdot ] -K^{(02)}(x_j,x')[\tau _{x\rightarrow x'} \cdot ,\tau _{x\rightarrow x'} \cdot ] \right\| _x \\&\quad = \left\| \overline{{{\,\mathrm{sign}\,}}(a_j)}\mathrm {D}_{2}\left[ {\hat{\eta }}\right] (x') -K^{(02)}(x_j,x') \right\| _{x'} \leqslant \frac{15{\bar{\varepsilon }}_2}{128} \end{aligned}$$

from what we have proved in the previous section.

Finally, for the third term in (63) we naturally introduce \(K_{{{\bar{E}}}}^{(ij)}\) defined as \(K^{(ij)}\) in (25), but by replacing \({\mathbb {E}}\) with the conditional \({\mathbb {E}}_{{{\bar{E}}}}\). From Lemma 4, the deviation between \(K^{(02)}\) and \(K_{{{\bar{E}}}}^{(02)}\) can be bounded by

$$\begin{aligned} \forall x\in {\mathcal {X}}^{\mathrm {near}},~\left\| K_{{{\bar{E}}}}^{(02)}(x_j,x) - K^{(02)}(x_j,x) \right\| _x = \left\| \mathrm {D}_{2}\left[ ({\mathbf {f}}_{{{\bar{E}}}}- {\mathbf {f}})^\top D_{{\mathfrak {g}}}u_j\right] (x) \right\| _x \leqslant \frac{{\bar{\varepsilon }}_2}{512} \end{aligned}$$

where \(u_j\) is the jth canonical vector of \({\mathbb {C}}^{s(d+1)}\). Moreover, by Assumption 2 it is easy to see that

$$\begin{aligned} \left\| K_{{{\bar{E}}}}^{(02)}(x_j,x')[\tau _{x\rightarrow x'} \cdot ,\tau _{x\rightarrow x'} \cdot ] - K_{{{\bar{E}}}}^{(02)}(x_j,x) \right\| _x \leqslant {{\bar{L}}}_0 {{\bar{L}}}_3 {\mathfrak {d}}_{{\mathfrak {g}}}(x,x') \end{aligned}$$

Hence, by a triangular inequality we have

$$\begin{aligned}&\left\| K^{(02)}(x_j,x')[\tau _{x\rightarrow x'} \cdot ,\tau _{x\rightarrow x'} \cdot ] - K^{(02)}(x_j,x) \right\| _x\\&\quad \leqslant \left\| K^{(02)}(x_j,x')[\tau _{x\rightarrow x'} \cdot ,\tau _{x\rightarrow x'} \cdot ] - K_{{{\bar{E}}}}^{(02)}(x_j,x')[\tau _{x\rightarrow x'} \cdot ,\tau _{x\rightarrow x'} \cdot ] \right\| _x \\&\qquad +\left\| K_{{{\bar{E}}}}^{(02)}(x_j,x')[\tau _{x\rightarrow x'} \cdot ,\tau _{x\rightarrow x'} \cdot ] - K_{{{\bar{E}}}}^{(02)}(x_j,x) \right\| _x\\&\qquad +\left\| K_{{{\bar{E}}}}^{(02)}(x_j,x) - K^{(02)}(x_j,x) \right\| _x \leqslant \frac{{\bar{\varepsilon }}_2}{256} + {{\bar{L}}}_0 {{\bar{L}}}_3 {\mathfrak {d}}_{{\mathfrak {g}}}(x,x') \end{aligned}$$

Therefore, (63) becomes

$$\begin{aligned} \left\| \overline{{{\,\mathrm{sign}\,}}(a_j)} \mathrm {D}_{2}\left[ {\hat{\eta }}\right] (x) - K^{(02)}(x_j,x) \right\| _x \leqslant {{\bar{L}}}_3({{\bar{L}}}_0 + \left\| p \right\| ){\mathfrak {d}}_{{\mathfrak {g}}}(x,x') + \frac{15{\bar{\varepsilon }}_2}{128} + \frac{{\bar{\varepsilon }}_2}{256}\nonumber \\ \end{aligned}$$
(64)

We prove the desired property on \(\mathrm {D}_{2}\left[ {\hat{\eta }}\right] \) by choosing \({\mathfrak {d}}_{{\mathfrak {g}}}(x,x') \leqslant \frac{{\bar{\varepsilon }}_2}{256{{\bar{L}}}_3({{\bar{L}}}_0 + \left\| p \right\| )}\), which yields

$$\begin{aligned} \left|{\mathcal {G}}^{\mathrm {near}}\right| = s \left|{\mathcal {G}}^{\mathrm {near}}_j\right| = s\left( \frac{C {r_{\mathrm {near}}}{{\bar{L}}}_0{{\bar{L}}}_3 \left\| p \right\| }{{\bar{\varepsilon }}_2} \right) ^d \end{aligned}$$

for an appropriate constant C. Gathering everything with (62), we finally obtain

$$\begin{aligned} m \gtrsim s \sum _{r=0,2} \left( {{\bar{L}}}_{01}^2 \frac{B_r^2}{{\bar{\varepsilon }}_r^2} \log (s) \log \left( \frac{s}{\rho } \right) + \left( \frac{{{\bar{L}}}_r^2}{{\bar{\varepsilon }}_r^2} + \frac{{{\bar{L}}}_{01}{{\bar{L}}}_r}{{\bar{\varepsilon }}_r} \right) \log \left( \frac{{\bar{N}}_r^d}{\rho } \right) \right) \end{aligned}$$
(65)

with \({\bar{N}}_0 = \frac{s {\mathcal {R}}_{\mathcal {X}}{{\bar{L}}}_1}{{\bar{\varepsilon }}_0}\), \({\bar{N}}_2 = \frac{s ({r_{\mathrm {near}}}{{\bar{L}}}_0{{\bar{L}}}_3 + {{\bar{L}}}_2)}{{\bar{\varepsilon }}_2}\).

6.7 Step 5: Additional Certificates

Non-degeneracy of \({\hat{\eta }}\) directly allows us to apply Proposition 1 to deduce stability away from the sparse support in the reconstructed measure. In order to apply Proposition 2, we need to construct an additional s certificates \(\eta _j\), which are, however, “simpler” to construct since they need to interpolate a “sign vector” that has only one nonzero coordinate and do not require the golfing scheme to do so.

For each \(j=1,\ldots ,s\), let \(u_j\) be the vector of length \(s(d+1)\) whose \(j^{th}\) entry is one and all other entries are zero. Define the functions

$$\begin{aligned} \eta _j^+ {\mathop {=}\limits ^{\text {def.}}}\left( \varUpsilon ^{-1} \left( {\begin{array}{c}1_s\\ 0_{sd}\end{array}}\right) \right) ^\top {\mathbf {f}}(x) \quad \text {and} \quad \eta _j^- {\mathop {=}\limits ^{\text {def.}}}\left( \varUpsilon ^{-1} \left( 2 u_j- \left( {\begin{array}{c}1_s\\ 0_{sd}\end{array}}\right) \right) \right) ^\top {\mathbf {f}}(x), \end{aligned}$$

and

$$\begin{aligned} \eta _j{\mathop {=}\limits ^{\text {def.}}}\frac{1}{2}(\eta _j^+ + \eta _j^-) = \left( \varUpsilon ^{-1} u_j \right) ^\top {\mathbf {f}}(x). \end{aligned}$$

By Theorem 2, \(\eta _j^+\) and \(\eta _j^-\) are non-degenerate (limit) dual certificates with respect to signs \(1_s\) and \(-1_s + 2 u_j\), respectively, and \(\eta _j\) satisfies, for all \(\ell \ne j\):

$$\begin{aligned} \begin{aligned}&\eta _j(x_j) = 1,\quad \nabla \eta _j(x_j) = 0 \quad \text {and} \quad \eta _j(x_\ell ) = 0,\quad \nabla \eta _j(x_\ell ) = 0 \\&\left|\eta _j(x)\right| \leqslant \frac{1}{2}\left( \left|\eta _j^+(x)\right| + \left|\eta _j^-(x)\right| \right) \leqslant 1-\frac{{\bar{\varepsilon }}_0}{4},\quad \forall x \in {\mathcal {X}}^{\mathrm {far}}\\&\left\| \mathrm {D}_{2}\left[ \eta _j\right] (x) - K^{(02)}(x_j,x) \right\| _x \leqslant \frac{{\bar{\varepsilon }}_2}{16},\quad \forall x \in {\mathcal {X}}^{\mathrm {near}}_j \\&\left\| \mathrm {D}_{2}\left[ \eta _j\right] (x) \right\| _x \\&\qquad \leqslant \frac{1}{2}\left( \left\| \mathrm {D}_{2}\left[ \eta _j^+\right] (x) - K^{(02)}(x_\ell ,x) \right\| _x + \left\| -\mathrm {D}_{2}\left[ \eta _j^-\right] (x) - K^{(02)}(x_\ell ,x) \right\| _x \right) \\&\quad \leqslant \frac{{\bar{\varepsilon }}_2}{16},\quad \forall x \in {\mathcal {X}}^{\mathrm {near}}_\ell \end{aligned} \end{aligned}$$
(66)

Thus, using Lemma 2 to translate the last two conditions into quadratic decay, we conclude that \(\eta _j\) satisfies the conditions of Proposition 2.

To conclude, we will show that

$$\begin{aligned} {\hat{\eta }}_j{\mathop {=}\limits ^{\text {def.}}}\left( {\hat{\varUpsilon }}^{-1} u_j \right) ^\top \hat{{\mathbf {f}}}\in {{\,\mathrm{Im}\,}}{\varPhi ^*} \end{aligned}$$

does not deviate too much from \(\eta _j\) and satisfies the conditions of Proposition 2. Note that by construction, \({\hat{\eta }}_j(x_j) = 1\), \({\hat{\eta }}_j(x_\ell ) = 0\) for all \(\ell \ne j\), and \(\nabla {\hat{\eta }}_j(x_\ell ) = 0\) for all \(\ell \). It therefore remains to control the deviation of \({\hat{\eta }}_j\) from \(\eta _j\) on \({\mathcal {X}}^{\mathrm {far}}\) and \(\mathrm {D}_{2}\left[ {\hat{\eta }}_j\right] \) from \(\mathrm {D}_{2}\left[ \eta _j\right] \) on \({\mathcal {X}}^{\mathrm {near}}\).

Proposition 3

Under Assumptions 1 and 2, suppose that \(\min _{i\ne j}{\mathfrak {d}}_{{\mathfrak {g}}}(x_i,x_j)\geqslant \varDelta \). Then, with probability at least \(1-\rho \), for all \(j=1,\ldots , s\), there exists \({\hat{\eta }}_j = \varPhi ^* p_j\) where \(\left\| p_j \right\| \leqslant 4\) which satisfies, for all \(\ell \ne j\):

$$\begin{aligned} \begin{aligned}&{\hat{\eta }}_j(x_j) = 1,\quad \nabla {\hat{\eta }}_j(x_j) = 0 \quad \text {and} \quad {\hat{\eta }}_j(x_\ell ) = 0,\quad \nabla {\hat{\eta }}_j(x_\ell ) = 0 \\&\left|{\hat{\eta }}_j(x)\right| \leqslant 1-\frac{{\bar{\varepsilon }}_0}{8},\quad \forall x \in {\mathcal {X}}^{\mathrm {far}}\\&\left\| \mathrm {D}_{2}\left[ \eta _j\right] (x) - K^{(02)}(x_j,x) \right\| _x \leqslant \frac{{\bar{\varepsilon }}_2}{8},~ \forall x \in {\mathcal {X}}^{\mathrm {near}}_j,\quad \left\| \mathrm {D}_{2}\left[ \eta _j\right] (x) \right\| _x \leqslant \frac{{\bar{\varepsilon }}_2}{8},~ \forall x \in {\mathcal {X}}^{\mathrm {near}}_\ell \end{aligned} \end{aligned}$$
(67)

The proof controls the deviation between \({\hat{\eta }}_j\) and \(\eta _j\) on a fine grid using Bernstein’s concentration inequalities and extends the bound to the entire domain using Lipschitz properties of \({\hat{\eta }}_j\). As we mentioned above, the proof of this result is conceptually simpler than the deviation bounds on \(\eta ^{\mathrm {app}}\) since \(\left\| u_j \right\| = 1\). We therefore defer its proof to Appendix B.5. Using Lemma 2, we have therefore constructed the additional certificates to apply Proposition 2 and conclude the proof of Theorem 3.

7 Conclusion and Outlooks

In this paper, we have presented a unifying geometric view on the problem of sparse measures recovery from random measurements. This theoretical analysis highlights the key role played by the invariant Fisher metric to define a precise notion of Rayleigh limit in the case of possibly non-translation-invariant measurement kernels. We analyzed several examples including Laplace measurements in imaging and left partially open some other important examples such as one-hidden-layer neural networks. Analyzing the super-resolution regime (going below the Rayleigh limit) requires stringent assumptions, such as positivity of the measures. Beyond the 1-D case, this is still mostly an open question, and we refer to [45] for some partial results.