1 Introduction

This paper is concerned with the following fundamental reconstruction task:

Problem 1

Consider a set of signals \(\mathcal {X}\subset \mathbb {R}^p\) and let \(\varvec{a}_1, \dots , \varvec{a}_m \in \mathbb {R}^p\) be a collection of measurement vectors. Moreover, let \(F:\mathbb {R}^p \times \mathcal {X}\rightarrow \mathbb {R}\) be a scalar output function. Under what conditions can the following recovery problem be solved uniformly for all \(\varvec{x}\in \mathcal {X}\): Assume that \(\varvec{x}\) is observed in the form

$$\begin{aligned} y_i {:=}F(\varvec{a}_i, \varvec{x}) + \nu _i, \quad i = 1, \dots , m, \end{aligned}$$
(1.1)

where \(\nu _1,\dots , \nu _m \in \mathbb {R}\) is scalar noise. Given \(\{(\varvec{a}_i, y_i)\}_{i = 1}^m\), is it possible to recover the underlying signal \(\varvec{x}\) efficiently?

The prototypical instance of Problem 1 is the approach of compressed sensing. Dating back to the seminal works of Candès, Romberg, Tao, and Donoho [10, 11, 19], the traditional setup of compressed sensing focuses on the case of noisy linear observations, i.e., we have \(y_i = \langle \varvec{a}_i, \varvec{x}\rangle + \nu _i\) for \(i = 1, \dots , m\). In this regime, Problem 1 is nowadays fairly well understood, underpinned by various real-world applications, efficient algorithmic methods, and a rich theoretical foundation; see [21] for a comprehensive overview. In a nutshell, compressed sensing has proven that signal recovery is still feasible when \(m \ll p\), supposed that the signal set \(\mathcal {X}\) carries some low-dimensional structure, e.g., a variant of sparsity, while the measurement ensemble \(\{\varvec{a}_i\}_{i = 1}^m\) follows an appropriate random design. The uniformity over \(\mathcal {X}\) plays an important role in this context, since the measurement device—determined by \(\{\varvec{a}_i\}_{i = 1}^m\) in our case—is typically fixed in applications and should allow for the reconstruction of all (or most) signals in \(\mathcal {X}\). But also apart from this practical relevance, the above quest for uniform recovery is an interesting mathematical problem in its own right. It is significantly more involved than its non-uniform counterpart, where the \(\{\varvec{a}_i\}_{i = 1}^m\) may vary for each \(\varvec{x}\in \mathcal {X}\).

The present work is devoted to Problem 1 in the much less explored situation of nonlinear output functions. In fact, many conceptions from compressed sensing theory, such as the restricted isometry or nullspace property, are tailored to linear models and do not carry over directly to the nonlinear case. As we will see in the next subsection, the presence of nonlinear measuring components is not merely an academic concern but affects many problems in signal processing and high-dimensional statistics.

1.1 Prior Art

There exist two branches of research that are particularly relevant to this work. The first one is based in the field of (memoryless) quantized compressed sensing, which deals with the fact that analog (linear) measurements often need to be quantized before further processing in practice. A common scenario in this respect is 1-bit compressed sensing where only a single bit of information is retained, e.g., if (1.1) renders observations of the form \(y_i = {{\,\mathrm{sign}\,}}(\langle \varvec{a}_i, \varvec{x}\rangle )\). Due to this considerable loss of information, it may come as a surprise that tractable recovery methods are still available and Problem 1 is relatively well understood in this situation. A solid theoretical basis as well as efficient algorithms have been developed over the last few years, including significant progress on lower bounds [18, 34], robustness [17, 49], advanced quantization schemes [3, 36, 37, 44, 65], and non-Gaussian measurements [1, 16,17,18]. While this list of references is certainly incomplete, the surveys of Dirksen [14] and Boufounos et al. [6] provide nice overviews of the field of quantized compressed sensing.

The above achievements have also given rise to several important mathematical tools. One of the most notable breakthroughs are quantized embedding results for signal recovery—a highly geometric argument based on uniform, random hyperplane tessellations, e.g., see [17, 34, 45, 49, 50]. Closer to the original ideas of compressed sensing are reconstruction guarantees relying on variants of the restricted isometry property, e.g., see [16, 20, 33, 65]. However, these techniques are strongly tailored to quantized measurements and it remains unclear how they could be extended to other instances of Problem 1.

The second branch of related literature is much less restrictive with respect to the underlying observation model. It allows (1.1) to take the form \(y_i = f(\langle \varvec{a}_i, \varvec{x}\rangle ) + \nu _i\), where \(f:\mathbb {R}\rightarrow \mathbb {R}\) can be nonlinear and random. A pioneering work on these so-called single-index models is the one of Plan and Vershynin [51] (inspired by ideas of Brillinger [7]), who study the generalized Lasso as reconstruction method:

figure a

Here, \(K\subset \mathbb {R}^p\) is a convex constraint set, serving as an appropriate relaxation of the actual signal set \(\mathcal {X}\) and making (\({\mathsf {P}}_{K,\varvec{y}}\)) tractable in many situations of interest. Although this “linearization” strategy might appear very coarse, it was shown to produce satisfactory outcomes for Gaussian measurements, even when \(f\) is highly nonlinear. A key benefit of (\({\mathsf {P}}_{K,\varvec{y}}\)) is that it does not require any (explicit) knowledge of the observation model, which enables various applications to signal processing and statistical estimation, e.g., see [23, Chap. 3 & 4]. The article of Plan and Vershynin [51] is just an example of a whole line of research with many related and follow-up works, e.g., see [22, 28, 29, 47, 52, 56, 60, 62]; remarkably, this approach also extends to phase-retrieval-like problems where \(f\) is an even function [25, 61, 66]. For a more detailed discussion of the literature, we refer to [52, Sec. 6] and [23, Sec. 4.2].

At first sight, the aforementioned works seem to provide a general solution to Problem 1, but they lack a crucial feature, namely uniformity. Indeed, most of these results are based on concentration inequalities over high-dimensional signal sets, exploiting their “local geometry” around a fixed \(\varvec{x}\in \mathcal {X}\). This strategy naturally leads to non-uniform recovery guarantees, and we are not aware of a uniform extension in that regard. Finally, we point out that the two research areas discussed above are not independent but exhibit certain overlaps in terms of their main achievements and proof techniques, for instance, see [28, 62].

1.2 Contributions and Overview

The primary objective of this work is to fill the striking gap between the lines of research discussed in the previous subsection. On the one hand, we show that uniform recovery is not only limited to linear and quantized measurement schemes but applies to a much larger family of observation models. On the other hand, it turns out that under mild assumptions, known non-uniform guarantees for single-index and related models naturally extend to the uniform case. Apart from a unification, our theoretical approach will contribute to each of these research directions through new and improved results.

On the conceptual side, we stick to the methodology suggested by the second branch of the literature from Sect. 1.1, analyzing the performance of the generalized Lasso (\({\mathsf {P}}_{K,\varvec{y}}\)). Our main result, Theorem 2 in Sect. 2.3, establishes a uniform error bound for (\({\mathsf {P}}_{K,\varvec{y}}\)) under very general conditions, including all observation models mentioned above. This finding implies that there exists a “universal” recovery strategy, which often yields satisfactory outcomes. In this context, it is worth pointing out that we have focused on (\({\mathsf {P}}_{K,\varvec{y}}\)) mainly because of its simplicity and popularity, but similar results could be shown for other reconstruction methods as well (cf. Remark 3(3)). Therefore, the present paper is a contribution to Problem 1 in the first place, whereas the estimator of choice plays a subordinate role.Footnote 1

The recovery guarantee of Theorem 2 is formulated in a quite abstract setting, which requires some technical preparation. In order to familiarize the reader with the results of this paper, we now state a special case for single-index models with a Lipschitz continuous output function and Gaussian measurements (see Sect. 8.4 for a proof). Note that parts of the notation from Sect. 1.4 is already used here; the complexity parameters \(w_{t}(\cdot )\) and \(w_{}(\cdot )\) correspond to the common notion of (local) mean width, which is formally introduced in Definition 2.

Theorem 1

There exist universal constants \(c, C > 0\) for which the following holds.

Let \(\varvec{a}_1, \dots , \varvec{a}_m \in \mathbb {R}^p\) be independent copies of a standard Gaussian random vector \(\varvec{a}\sim \mathsf {N}(\varvec{0}, \varvec{I}_{p})\). Let \(f:\mathbb {R}\rightarrow \mathbb {R}\) be \(\gamma \)-Lipschitz, and for \(g\sim \mathsf {N}(0, 1)\), we set

$$\begin{aligned} \mu {:=}\mathbb {E}[f(g)g] \quad \text {and} \quad r{:=}\Vert f(g) - \mu g\Vert _{\psi _2}. \end{aligned}$$

Moreover, let \(\mathcal {X}\subset \mathbb {S}^{p-1}\) and let \(K\subset \mathbb {R}^p\) be a convex set such that \(\mu \mathcal {X}\subset K\). For \(u\ge 1\) and a desired reconstruction accuracy \(t \ge 0\), we assume that

$$\begin{aligned} m \ge C \cdot \Big ( (1 + t^{-2} r^2 ) \cdot \big (w_{t}^2(K- \mu \mathcal {X})+ u^2 \big ) + t^{-2} (\mu + \gamma )^2 \cdot w_{}^2(\mathcal {X}) \Big ), \end{aligned}$$
(1.2)

where \(K- \mu \mathcal {X}{:=}\{ \varvec{z}- \mu \varvec{x}\ :\varvec{z}\in K, \varvec{x}\in \mathcal {X}\}\). Then with probability at least \(1 - \exp (-c u^2)\) on the random draw of \(\{\varvec{a}_i\}_{i = 1}^m\), the following holds uniformly for all \(\varvec{x}\in \mathcal {X}\): Let \(\varvec{y}= (y_1, \dots , y_m) \in \mathbb {R}^m\) be given by

$$\begin{aligned} y_i = f(\langle \varvec{a}_i, \varvec{x}\rangle ) + \nu _i, \quad i = 1, \dots , m, \end{aligned}$$

such that \(\big (\tfrac{1}{m}\sum _{i = 1}^m \nu _i^2 \big )^{1/2} \le \tfrac{t}{20}\).Footnote 2 Then, every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\Vert \hat{\varvec{z}}- \mu \varvec{x}\Vert _{2} \le t\).

Theorem 1 can be seen as a uniform version of Plan’s and Vershynin’s main result on single-index models [51, Thm. 1.9]. Most notably, the sampling-rate condition (1.2) implies that the dependence on the desired accuracy t is the same as in the non-uniform case. For more details on the interplay between m, t, and the complexity parameters, we refer to the discussion of Theorem 2 in Sect. 2.3. To the best of our knowledge, Theorem 1 is a new result in its own right, and in particular, it is not a consequence of the approaches outlined in the previous subsection. A novel argument is required to prove such a type of recovery guarantee; see Sect. 1.3 for an overview of the major technical challenge of our work.

The main result of Theorem 2 extends Theorem 1 by several important aspects, being in line with our key achievements:

  1. 1.

    Discontinuous output functions. The Lipschitz assumption in Theorem 1 prohibits popular models like quantized observations. It turns out that an extension in that regard is subtle. To this end, we introduce an approximative increment condition on the observation variable, which is satisfied for many discontinuous nonlinearities; see Sect. 2.2 for more details and Sects. 4.14.4 for applications to nonlinear compressed sensing.

  2. 2.

    Beyond Gaussian measurements. Compared to the previous point, it is relatively straightforward to allow for i.i.d. sub-Gaussian measurement vectors in Theorem 1. However, this relaxation leads to an additional constraint term, called the target mismatch, whose size depends on the distribution of the measurements and the output function; see Definition 1 in Sect. 2.1.

  3. 3.

    Beyond single-index models. In Sect. 2, we study the more general situation of Problem 1, where \(y_i\) is not necessarily representable in terms of the dot product \(\langle \varvec{a}_i, \varvec{x}\rangle \). While this requires an additional step of abstraction, it enables for more complicated nonlinear models as well as refined recovery tasks, e.g., support estimation (see Sect. 4.5). For this purpose, we introduce so-called target functions in Sect. 2.1, transforming a signal \(\varvec{x}\in \mathcal {X}\) in such a way that it becomes compatible with the solution vector of (\({\mathsf {P}}_{K,\varvec{y}}\)) under observations of the form (1.1). Note that in the setup of Theorem 1, this transformation simply corresponds to rescaling \(\varvec{x}\) by the scalar factor \(\mu \).

  4. 4.

    Outlier robustness. The \(\ell ^{2}\)-noise constraint \(\big (\tfrac{1}{m}\sum _{i = 1}^m \nu _i^2 \big )^{1/2} \le \tfrac{t}{20}\) in Theorem 1 is standard for linear models and also remains useful in the situation of Lipschitz nonlinearities. However, such a condition is overly restrictive for quantized measurement schemes, where the noise is due to (few) wrong bits instead of (small) real-valued perturbations. Theorem 2 therefore involves an important relaxation in this respect, permitting a portion of “gross” outliers in the observation vector.

Based on our findings in Theorem 2, we will also address two points of independent interest:

  1. 5.

    Uniform vs. non-uniform recovery. Section 3 isolates characteristics that are specifically attributable to uniformity. While the presence of nonlinear output functions may only affect the oversampling rate (see also Sect. 4), our focus here is on the quite implicit complexity parameter \(w_{t}(K- \mathcal {X})\), cf. (1.2). We show that it can be always bounded by two much more informative expressions, namely a “fully localized” mean width and its global counterpart (see Proposition 1). This important simplification enables us to reuse known (non-uniform) results from the literature. We illustrate for two examples (\(\ell ^{1}\)-constraints and total variation) that there is no considerable gap between the uniform and non-uniform regime.

  2. 6.

    Plug-and-play via embeddings. Section 5 demonstrates how to incorporate known random embedding results into our theoretical framework. In this way, for instance, it is possible to obtain near-optimal error decay rates for (\({\mathsf {P}}_{K,\varvec{y}}\)) with quantized measurements. This achievement is due to a more general guarantee stated in Theorem 3.

The general purpose of our methodology is to enable a more systematic study of Problem 1 than before, especially for nonlinear observation models. To a certain degree, this paper promotes an alternative path to compressed sensing theory that can do without common tools like the restricted isometry property or quantized embedding results. Despite clear overlaps, each of these approaches comes along with its own strengths and (in-)accessible regimes; see Sect. 6 for a more detailed discussion.

1.3 Technical Challenges

This subsection highlights the main technical achievements of this work. Our principal procedure is in fact relatively standard: We decompose the excess loss associated with the estimator (\({\mathsf {P}}_{K,\varvec{y}}\)) into separate empirical processes and aim at suitable lower/upper bounds for each (cf. [17, 23, 25, 41, 51]); see Sect. 2.2 for an overview of this argument.

However, the quest for uniform recovery as prescribed by Problem 1 makes this approach a challenging endeavor. In a nutshell, the key difficulty is that the common multiplier term turns into an empirical product process of the following type:

$$\begin{aligned} \tfrac{1}{m} \sum _{i=1}^m \xi _{i}(\varvec{x}) \cdot \langle \varvec{a}_i, \varvec{v}\rangle , \qquad {\varvec{x}\in \mathcal {X},\quad \varvec{v}\in (K- T\varvec{x}) \cap t\mathbb {S}^{p-1},} \end{aligned}$$
(1.3)

where the first factor takes the form \(\xi _{i}(\varvec{x}) = \langle \varvec{a}_i, T\varvec{x}\rangle - F(\varvec{a}_i, \varvec{x})\) and \(T:\mathcal {X}\rightarrow K\) is a fixed function (think of a scalar multiplication for now). Since we require a uniform upper bound for (1.3) over both \(\varvec{x}\) and \(\varvec{v}\), the regularity of the stochastic process \(\{\xi _{i}(\varvec{x})\}_{\varvec{x}\in \mathcal {X}}\) plays an important role, in the sense that it should fulfill a sub-Gaussian increment condition (see (1.4)).Footnote 3 Such a regularity assumption unfortunately does not even hold for standard models like 1-bit quantization \(F(\varvec{a}_i, \varvec{x}) = {{\,\mathrm{sign}\,}}(\langle \varvec{a}_i, \varvec{x}\rangle )\). Therefore, existing chaining-based results for product processes are not directly applicable to our situation. This may also explain why there is no unified theory for nonlinear observations available so far and known uniform recovery guarantees are restricted to special cases (see Sect. 1.1).

The present article provides a general remedy for this foundational problem. Our starting point is that a given output function \(F\) can be often approximated by a more regular one, say \(F_t\), where \(t > 0\) controls the approximation accuracy. As such, this strategy can yield a more benign multiplier \(\xi _{t,i}(\varvec{x}) = \langle \varvec{a}_i, T\varvec{x}\rangle -F_t(\varvec{a}_i, \varvec{x})\), but it simply shifts the original problem to the resulting approximation error \(F(\varvec{a}_i, \varvec{x}) - F_t(\varvec{a}_i, \varvec{x})\). Hence, a fundamental insight in our proof (see Step 3 in Sect. 7) is that it suffices to consider the absolute error \(\varepsilon _{t,i} = |F(\varvec{a}_i, \varvec{x}) - F_t(\varvec{a}_i, \varvec{x})|\) instead. Indeed, the process \(\{\varepsilon _{t,i}(\varvec{x})\}_{\varvec{x}\in \mathcal {X}}\) fulfills an increment condition in all relevant model situations that we are aware of; see Fig. 2 in Sect. 4 for an illustration of the above argument based on the sign function. With this at hand, and apart from other technicalities, we can take advantage of a recent concentration inequality due to Mendelson [42, Thm. 1.13].

The intuitive idea described in the previous paragraph is placed into a rigorous framework by means of our central Assumption 2. In this context, it is worth noting that the actual main result, Theorem 2, couples the approximation accuracy t with the desired reconstruction error. This might appear somewhat peculiar at first sight because the processes \(\{\xi _{t,i}(\varvec{x})\}_{\varvec{x}\in \mathcal {X}}\) and \(\{\varepsilon _{t,i}(\varvec{x})\}_{\varvec{x}\in \mathcal {X}}\) usually become less regular with t decreasing. However, we demonstrate that this trade-off can be balanced out well and only affects the achievable oversampling rate.

Finally, our general technique to control empirical product processes of the form (1.3) may find application beyond signal recovery problems. For example, it could be also used to derive a new generation of nonlinear embedding results. In fact, one of the earliest works on binary random embeddings [50] has pointed out the usefulness of approximating nonlinearities, though strongly tailored to 1-bit quantization. With our systematic approach, their result could not only be reproduced but further improved (cf. [58, Thm. 3.7]).

1.4 Overview and Notation

The rest of this article is organized as follows. Section 2 presents our findings in full generality, where the proof of Theorem 2 is postponed to Sect. 7. In Sect. 3, we elaborate aspects that are specifically due to uniform recovery and highlight potential gaps to the non-uniform regime. Sections 4 and 5 are then devoted to applications and concrete examples, including a series of new guarantees for nonlinear compressed sensing and high-dimensional estimation; the corresponding proofs are provided in Sects. 8 and 9, respectively. Our concluding discussion can be found in Sect. 6.

Before proceeding, let us fix some standard notations and conventions that are commonly used in this work. The letters c and C are reserved for (positive) constants, whose values could change from time to time. We speak of a universal constant if its value does not depend on any other involved parameter. If an inequality holds up to a universal constant C, we usually write \(A \lesssim B\) instead of \(A \le C \cdot B\). The notation \(A \asymp B\) is a shortcut for \(A \lesssim B \lesssim A\).

For \(p \in \mathbb {N}\), we set \([p] {:=}\{1, \dots , p\}\). The cardinality of an index set \(\mathcal {S}\subset [p]\) is denoted by \(|\mathcal {S}|\) and its set complement in [p] is \(\mathcal {S}^c {:=}[p] \setminus \mathcal {S}\). Vectors and matrices are denoted by lower- and uppercase boldface letters, respectively. The j-th entry of a vector \(\varvec{v}\in \mathbb {R}^p\) is denoted by \((\varvec{v})_j\), or simply by \(v_j\) if there is no danger of confusion. The support of \(\varvec{v}\in \mathbb {R}^p\) is defined by \({{\,\mathrm{supp}\,}}(\varvec{v}) {:=}\{ j \in [p] \ :v_j \ne 0 \}\) and \(\Vert \varvec{v}\Vert _{0} {:=}|{{\,\mathrm{supp}\,}}(\varvec{v})|\) denotes its sparsity. We write \(\varvec{I}_{p} \in \mathbb {R}^{p \times p}\) and \(\varvec{0}\in \mathbb {R}^p\) for the identity matrix and the zero vector in \(\mathbb {R}^p\), respectively. For \(1 \le q \le \infty \), we denote the \(\ell ^{q}\)-norm on \(\mathbb {R}^p\) by \(\Vert \cdot \Vert _{q}\) and the associated unit ball by \(\mathbb {B}_q^p\). The Euclidean unit sphere is given by \(\mathbb {S}^{p-1} {:=}\{ \varvec{v}\in \mathbb {R}^p \ :\Vert \varvec{v}\Vert _{2} = 1 \}\). The spectral matrix norm is denoted by \(\Vert \cdot \Vert _{2\rightarrow 2}\).

Let \(H, H' \subset \mathbb {R}^p\) and \(\varvec{v}\in \mathbb {R}^p\). We write \({{\,\mathrm{{\text {span}}}\,}}(H)\) for the linear hull of \(H\) and the linear cone generated by \(H\) (not necessarily convex) is denoted by \({\text {cone}}(H) {:=}\{s {\tilde{\varvec{v}}} \ :{\tilde{\varvec{v}}} \in H, s \ge 0 \}\). By \(P_{H} :\mathbb {R}^p \rightarrow \mathbb {R}^p\), we denote the Euclidean projection onto the closure of \(H\) if it is well defined, and we use the shortcut \(P_{\varvec{v}} {:=}P_{{{\,\mathrm{{\text {span}}}\,}}(\{\varvec{v}\})}\). We write \({H}^\perp \) for the orthogonal complement of \(H\). The Minkowski difference between \(H\) and \(H'\) is defined by \(H- H' {:=}\{ \varvec{v}_1 -\varvec{v}_2 \ :\varvec{v}_1 \in H, \varvec{v}_2 \in H' \}\), and we use the shortcut \(H- \varvec{v}{:=}H- \{\varvec{v}\}\). For \(\varepsilon > 0\), we denote the covering number of \(H\) at scale \(\varepsilon \) with respect to the Euclidean norm by \(\mathcal {N}(H, \varepsilon )\).

The \(L^q\)-norm of a real-valued random variable a is given by \(\Vert a\Vert _{L^q} {:=}(\mathbb {E}[|a|^q])^{1/q}\). We call a sub-Gaussian if

$$\begin{aligned} \Vert a\Vert _{\psi _2} {:=}\inf \big \{v > 0 \ :\mathbb {E}[\exp (|a|^2 / v^2)] \le 2 \big \} < \infty , \end{aligned}$$

and \(\Vert \cdot \Vert _{\psi _2}\) is called the sub-Gaussian norm. A random vector \(\varvec{a}\) in \(\mathbb {R}^p\) is called sub-Gaussian if \(\Vert \varvec{a}\Vert _{\psi _2} {:=}{\mathrm{sup}}_{\varvec{v}\in \mathbb {S}^{p-1}} \Vert \langle \varvec{a}, \varvec{v}\rangle \Vert _{\psi _2} < \infty \). We say that \(\varvec{a}\) is centered if \(\mathbb {E}[\varvec{a}] = \varvec{0}\), and it is isotropic if \(\mathbb {E}[\varvec{a}\varvec{a}^\mathsf {T}] = \varvec{I}_{p}\). We write \(\varvec{a}\sim \mathsf {N}(\varvec{0}, \varvec{I}_{p})\) if \(\varvec{a}\) is a standard Gaussian random vector in \(\mathbb {R}^p\). For a more detailed introduction to sub-Gaussian random variables and their properties, see [64, Chap. 2 & 3]. Now, let \((\mathcal {H},d)\) be a pseudo-metric space and consider a real-valued stochastic process \(\{a_h\}_{h \in \mathcal {H}}\) on \(\mathcal {H}\). Then, \(\{a_h\}_{h \in \mathcal {H}}\) has sub-Gaussian increments with respect to d if

$$\begin{aligned} \Vert a_h - a_{h'}\Vert _{\psi _2} \le d(h, h') \qquad \text {for all } h, h' \in \mathcal {H}. \end{aligned}$$
(1.4)

We say that a function \(f:\mathbb {R}^p \rightarrow \mathbb {R}^{p'}\) is \(\gamma \)-Lipschitz if it is Lipschitz continuous with respect to the Euclidean metric and a Lipschitz constant \(\gamma \ge 0\). The sign of \(v \in \mathbb {R}\) is denoted by \({{\,\mathrm{sign}\,}}(v)\), with the convention that \({{\,\mathrm{sign}\,}}(0) {:=}1\). If \({{\,\mathrm{sign}\,}}(\cdot )\) is applied to a vector, the operation is understood entrywise. The ceiling and floor function of \(v \in \mathbb {R}\) is denoted by \(\lceil v\rceil \) and \(\lfloor v\rfloor \), respectively.

2 Main Result

As preliminary steps, we introduce the formal model setup in Sect. 2.1, followed by our increment conditions on the observation variable in Sect. 2.2. The main result of Theorem 2 is then stated and discussed in Sect. 2.3, while its proof can be found in Sect. 7.

2.1 Model Setup

The following model assumption fixes the notation for the remainder of this section and specifies the instance of Problem 1 that we will study from now on:

Assumption 1

  1. (a)

    Components of the observation model:

    • Measurement vector: a centered, isotropic, sub-Gaussian random vector \(\varvec{a}\in \mathbb {R}^p\) such that \(\Vert \varvec{a}\Vert _{\psi _2} \le L\) for some \(L> 0\).Footnote 4

    • Signal set: a set \(\mathcal {X}\) (not necessarily a subset of \(\mathbb {R}^p\)).

    • Output function: a scalar function \(F:\mathbb {R}^p \times \mathcal {X}\rightarrow \mathbb {R}\) that may be random (not necessarily independent of \(\varvec{a}\)).

    • Observation variable: \(\tilde{y}(\varvec{x}) {:=}F(\varvec{a}, \varvec{x})\) for \(\varvec{x}\in \mathcal {X}\).

  2. (b)

    Components of the measurement process:

    • Measurement ensemble: \(\{(\varvec{a}_i, F_i)\}_{i = 1}^m\) are independent copies of the measurement model \((\varvec{a}, F)\).

    • Observations: \(\tilde{y}_i(\varvec{x}) {:=}F_i(\varvec{a}_i, \varvec{x})\) for \(i = 1, \dots , m\) and \(\varvec{x}\in \mathcal {X}\). The observation vector of \(\varvec{x}\) is denoted by \(\tilde{\varvec{y}}(\varvec{x}) {:=}(\tilde{y}_1(\varvec{x}), \dots , \tilde{y}_m(\varvec{x})) \in \mathbb {R}^m\).

  3. (c)

    Components of the recovery procedure:

    • Constraint set: a convex subset \(K\subset \mathbb {R}^p\) (not necessarily bounded).

    • Target function: a map \(T:\mathcal {X}\rightarrow K\), such that \(T\mathcal {X}\subset K\) is bounded.

Part (a) of Assumption 1 establishes the statistical “template” of our measurement model, from which i.i.d. observations are drawn according to part (b). Importantly, this defines an entire class of observation variables \(\{\tilde{y}(\varvec{x})\}_{\varvec{x}\in \mathcal {X}}\), which corresponds to a real-valued stochastic process. Assumption 1(c) is arguably the most abstract part, but forms a crucial ingredient of our main result: Theorem 2 states an error bound for \(T\varvec{x}\in K\) instead of the actual signal \(\varvec{x}\in \mathcal {X}\) (which might not even belong to \(\mathbb {R}^p\)). The basic linearization strategy behind (\({\mathsf {P}}_{K,\varvec{y}}\)) typically calls for an appropriate transformation of the signal domain. A good example is the special case of Theorem 1, where \(T\) amounts to a rescaling of \(\varvec{x}\) by the model-dependent factor \(\mu \). Such a simple correction is sufficient for most applications considered in Sect. 4; but there exist more complicated situations, such as the variable selection problem in Sect. 4.5, where \(\mathcal {X}\) represents support sets on [p]. At the present level of abstraction, it is useful to think of \(T\varvec{x}\) as a parameterized version of \(\varvec{x}\in \mathcal {X}\) that is compatible with the estimation procedure of (\({\mathsf {P}}_{K,\varvec{y}}\)) and its constraint set \(K\subset \mathbb {R}^p\).Footnote 5

In principle, \(T\) can be arbitrary in Assumption 1, but there often exists a canonical choice that is driven by the size of the following model parameter:

Definition 1

Let Assumption 1 be satisfied. Then, we define the target mismatch of \(\varvec{x}\in \mathcal {X}\) by

$$\begin{aligned} \rho _{}(\varvec{x}) {:=}\big \Vert \mathbb {E}[\tilde{y}(\varvec{x})\varvec{a}] - T\varvec{x}\big \Vert _{2}. \end{aligned}$$

The meaning of this definition becomes clearer when taking the perspective of statistical learning for a moment. Assuming that \(y_i = \tilde{y}_i(\varvec{x})\), the Lasso estimator (\({\mathsf {P}}_{K,\varvec{y}}\)) can be viewed as an empirical loss minimization problem with “data” \(\{(\varvec{a}_i, y_i)\}_{i = 1}^m\). A central question is then under what conditions its solution can approximate (in a proper sense) the associated expected loss minimization problem (obtained in the infinite sample limit \(m \rightarrow \infty \)):

$$\begin{aligned} \min _{\varvec{z}\in K} \ \mathbb {E}[(\tilde{y}(\varvec{x}) - \langle \varvec{a}, \varvec{z}\rangle )^2]. \end{aligned}$$
(2.1)

It is not hard to see that the only critical point of the objective function in (2.1) is the vector \(\varvec{x}^*{:=}\mathbb {E}[\tilde{y}(\varvec{x})\varvec{a}]\), and if \(\varvec{x}^*\in K\), this is the global optimum. Therefore, and as its name suggests, \(\rho _{}(\varvec{x})\) measures the mismatch between the target vector \(T\varvec{x}\) and the (global) expected loss minimizer \(\varvec{x}^*\).

In the context of Theorem 2, the target mismatch can be seen as an upper bound for the asymptotic error of estimating \(T\varvec{x}\) via (\({\mathsf {P}}_{K,\varvec{y}}\)). Consequently, a general rule of thumb is to select \(T\varvec{x}\) such that \(\rho _{}(\varvec{x})\) vanishes or becomes sufficiently small. However, simply setting \(T\varvec{x}{:=}\mathbb {E}[\tilde{y}(\varvec{x})\varvec{a}]\) (if contained in \(K\)) is not necessarily consistent with our wish for signal recovery, e.g., in the application in Sect. 4.2. Therefore, we have left \(T\) unspecified in Assumption 1, even though a proper choice will be obvious for all considered examples.

We close this subsection with a short remark about the noise model considered in this work:

Remark 1

(Noise model) We will adopt the adversarial (or worst-case) noise model, which is the common setting in the field of compressed sensing, cf. [21]. More specifically, when interested in recovery of \(\varvec{x}\in \mathcal {X}\), the actual input \(\varvec{y}= (y_1, \dots , y_m)\) for (\({\mathsf {P}}_{K,\varvec{y}}\)) need not exactly correspond to the observation vector \(\tilde{\varvec{y}}(\varvec{x}) = (\tilde{y}_1(\varvec{x}), \dots , \tilde{y}_m(\varvec{x}))\) introduced in Assumption 1(b).Footnote 6 Instead, arbitrary perturbations are allowed as long as \(d(\varvec{y}, \tilde{\varvec{y}}(\varvec{x})) \lesssim t\) for an appropriate noise metric \(d(\cdot ,\cdot )\) and a desired accuracy t. In Theorem 1, for instance, we have \(\tilde{y}_i(\varvec{x}) = f(\langle \varvec{a}_i, \varvec{x}\rangle )\) and \(y_i = \tilde{y}_i(\varvec{x}) + \nu _i\), while \(d(\cdot ,\cdot )\) corresponds to the normalized \(\ell ^{2}\)-norm. Thus, our overall goal is to show that, given \(\{(\varvec{a}_i, F_i)\}_{i = 1}^m\), recovery is possible for any \(\varvec{x}\in \mathcal {X}\) and any moderate perturbation of its observation \(\tilde{\varvec{y}}(\varvec{x})\).

In this context, it is worth noting that Assumption 1(b) does not require the variables \(\tilde{y}_i(\varvec{x})\) to be deterministic when conditioned on \(\varvec{a}_i\), since \(F_i\) itself may be random. For example, one could model additive random noise in that way, as common in statistical estimation theory. In principle, such a type of noise is also admissible in our main result, Theorem 2, but it is not entirely compatible with the aforementioned goal of uniform recovery. Indeed, the ensemble \(\{(\varvec{a}_i, F_i)\}_{i = 1}^m\) is only drawn once, so that a single noise realization has to be considered for all \(\varvec{x}\in \mathcal {X}\). We therefore mostly stick to the adversarial perspective; see Remark 4(2) in Sect. 3 for a more detailed comparison of both noise models in the situation of linear measurements. Finally, we point out that randomized output functions are nevertheless very useful in the uniform regime, such as in the design of dithering variables (see Sect. 4.2).

2.2 Analysis Strategy and Increment Conditions

From now on, let Assumption 1 be satisfied. In this subsection, we introduce the key condition for our uniform recovery guarantee, namely that the class of observation variables \(\{\tilde{y}(\varvec{x})\}_{\varvec{x}\in \mathcal {X}}\), or at least an approximation thereof, has sub-Gaussian increments. To better understand the relevance of this condition, it is insightful to first take a closer look at our basic analysis strategy for the generalized Lasso:

figure b

Note that we hide the dependency on the measurement vectors \(\{\varvec{a}_i\}_{i = 1}^m\) when referring to (\({\mathsf {P}}_{K,\varvec{y}}\)), as this will be always clear from the context. At the present stage, the vector \(\varvec{y}= (y_1, \dots , y_m) \in \mathbb {R}^m\) in (\({\mathsf {P}}_{K,\varvec{y}}\)) is unspecified, but taking the viewpoint of our main result, Theorem 2, is already useful: If \(\varvec{x}\in \mathcal {X}\) is to be recovered, then \(\varvec{y}\) is a noisy version of the observation vector \(\tilde{\varvec{y}}(\varvec{x}) = (\tilde{y}_1(\varvec{x}), \dots , \tilde{y}_m(\varvec{x}))\) as defined in Assumption 1(b); see also Remark 1.

We now derive a simple, yet important criterion for an error bound for a (fixed) signal \(\varvec{x}\in \mathcal {X}\). To this end, let \(\bar{\mathcal {L}}_{\varvec{x}}(\varvec{v}) {:=}\tfrac{1}{m} \sum _{i = 1}^m (y_i - \langle \varvec{a}_i, \varvec{v}+ T\varvec{x}\rangle )^2\) denote the empirical loss of \(\varvec{v}\in \mathbb {R}^p\) over \(\varvec{x}\). For the sake of notational convenience, we have fixed the anchor point \(T\varvec{x}\) here, so that (\({\mathsf {P}}_{K,\varvec{y}}\)) is equivalent to minimizing \(\bar{\mathcal {L}}_{\varvec{x}}(\cdot )\) over \(K- T\varvec{x}\). The excess loss of \(\varvec{v}\in \mathbb {R}^p\) over \(\varvec{x}\) is then defined by

$$\begin{aligned} \mathcal {E}_{\varvec{x}}(\varvec{v}) {:=}\bar{\mathcal {L}}_{\varvec{x}}(\varvec{v}) - \bar{\mathcal {L}}_{\varvec{x}}(\varvec{0}). \end{aligned}$$

It measures how much the empirical loss is changing when traveling from \(T\varvec{x}\) in the direction of \(\varvec{v}\). The following fact is an immediate consequence of the convexity of \(K\) and \(\bar{\mathcal {L}}_{\varvec{x}}\); see Fig. 1 for an illustration.

Fig. 1
figure 1

Illustration of the localization argument behind Fact 1. We assume that \(\mathcal {E}_{\varvec{x}}(\varvec{v}) > 0\) for all \(\varvec{v}\in K_{\varvec{x},t}\), or equivalently, \(\mathcal {E}_{\varvec{x}}(\varvec{z}- T\varvec{x}) > 0\) for all \(\varvec{z}\in K_{\varvec{x},t} + T\varvec{x}\) (see red arc). Since \(\mathcal {E}_{\varvec{x}}(\varvec{0}) = 0\), the convexity of \(K\) and \(\mathcal {E}_{\varvec{x}}(\cdot )\) implies that a minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) must not lie outside the dark gray intersection. In other words, we have that \(\Vert \hat{\varvec{z}}- T\varvec{x}\Vert _{2} \le t\)

Fact 1

For \(\varvec{x}\in \mathcal {X}\) and a desired reconstruction accuracy \(t > 0\), we set

$$\begin{aligned} K_{\varvec{x},t} {:=}\{\varvec{v}\in K- T\varvec{x}\ :\Vert \varvec{v}\Vert _{2} = t \} = (K- T\varvec{x}) \cap t\mathbb {S}^{p-1}. \end{aligned}$$

If \(\mathcal {E}_{\varvec{x}}(\varvec{v}) > 0\) for all \(\varvec{v}\in K_{\varvec{x},t}\), then every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\Vert \hat{\varvec{z}}- T\varvec{x}\Vert _{2} \le t\).

For a fixed accuracy \(t > 0\), Fact 1 implies uniform recovery for all those \(\varvec{x}\in \mathcal {X}\) satisfying \(\inf _{\varvec{v}\in K_{\varvec{x},t}} \mathcal {E}_{\varvec{x}}(\varvec{v}) > 0\). This motivates us to establish a uniform lower bound for the excess loss for all \(\varvec{x}\in \mathcal {X}\) and \(\varvec{v}\in K_{\varvec{x},t}\). In particular, such a task is considerably more difficult than showing a bound on \(K_{\varvec{x},t}\) for a fixed \(\varvec{x}\in \mathcal {X}\), which would be in accordance with the non-uniform results discussed in the second part of Sect. 1.1. To get a better sense of this challenge, let us consider the following basic decomposition of the excess loss:

$$\begin{aligned} \mathcal {E}_{\varvec{x}}(\varvec{v})= & {} \underbrace{\tfrac{1}{m} \sum _{i=1}^m\langle \varvec{a}_i, \varvec{v}\rangle ^2}_{{=:}\mathcal {Q}_{}(\varvec{v})} {} - {} \underbrace{\tfrac{2}{m} \sum _{i=1}^m (y_i - \tilde{y}_i(\varvec{x})) \langle \varvec{a}_i, \varvec{v}\rangle }_{{=:}\mathcal {N}_{\varvec{x}}(\varvec{v})} {} \nonumber \\&+ {} \underbrace{\tfrac{2}{m} \sum _{i=1}^m(\langle \varvec{a}_i, T\varvec{x}\rangle - \tilde{y}_i(\varvec{x})) \langle \varvec{a}_i, \varvec{v}\rangle }_{{=:}\mathcal {M}_{\varvec{x}}(\varvec{v})}. \end{aligned}$$
(2.2)

The quadratic term \(\mathcal {Q}_{}(\varvec{v})\) and noise term \(\mathcal {N}_{\varvec{x}}(\varvec{v})\) are rather unproblematic and can be controlled by a recent matrix deviation inequality (see Step 1 and Step 2 in Sect. 7). Much more intricate is the multiplier term \(\mathcal {M}_{\varvec{x}}(\varvec{v})\), since the underlying multiplier variable \(\xi (\varvec{x}) {:=}\langle \varvec{a}, T\varvec{x}\rangle - \tilde{y}(\varvec{x})\) depends on \(\varvec{x}\), so that we actually have to deal with an empirical product process. A major difficulty is that known concentration results for such product processes do not apply directly, since the class \(\{\xi (\varvec{x})\}_{\varvec{x}\in \mathcal {X}}\) need not have sub-Gaussian increments with respect to an appropriate (pseudo-)metric. For example, this would happen if we would drop the Lipschitz assumption on \(f\) in Theorem 1. The key idea of our approach is to approximate \(\tilde{y}(\varvec{x})\) by a more “regular” observation variable \(\tilde{y}_t(\varvec{x})\) in such a way that the resulting multiplier variable and the approximation error both have sub-Gaussian increments. This is made precise by the following assumption:

Assumption 2

We define a pseudo-metric on \(\mathcal {X}\) by \(d_T(\varvec{x},\varvec{x}') {:=}\Vert T\varvec{x}- T\varvec{x}'\Vert _{2}\) for \(\varvec{x},\varvec{x}' \in \mathcal {X}\). For \(t \ge 0\), assume that there exists a class of observation variables \(\{\tilde{y}_t(\varvec{x})\}_{\varvec{x}\in \mathcal {X}}\) such that the following properties hold:

  1. (a)

    Approximation error: Setting \(\varepsilon _t(\varvec{x}) {:=}|\tilde{y}(\varvec{x}) - \tilde{y}_t(\varvec{x})|\), we assume that \(\mathbb {E}[\varepsilon _t(\varvec{x}) \cdot |\langle \varvec{a}, \varvec{z}\rangle |] \le \tfrac{t}{64}\) for all \(\varvec{x}\in \mathcal {X}\) and \(\varvec{z}\in \mathbb {S}^{p-1}\).

  2. (b)

    Multiplier increments: Set \(\xi _t(\varvec{x}) {:=}\langle \varvec{a}, T\varvec{x}\rangle - \tilde{y}_t(\varvec{x})\) for \(\varvec{x}\in \mathcal {X}\). We assume that there exist \(r\ge 0\) and \(L_{t}\ge 0\) such that

    $$\begin{aligned} \Vert \xi _t(\varvec{x}) - \xi _t(\varvec{x}')\Vert _{\psi _2} \le L_{t}\cdot d_T(\varvec{x},\varvec{x}') \quad {\text {and}}\quad \Vert \xi _t(\varvec{x})\Vert _{\psi _2} \le r\qquad \text {for all } \varvec{x}, \varvec{x}' \in \mathcal {X}. \end{aligned}$$
  3. (c)

    Error increments: We assume that there exist \(\hat{r}\ge 0\) and \(\hat{L}_{t}\ge 0\) such that

    $$\begin{aligned} \Vert \varepsilon _t(\varvec{x}) - \varepsilon _t(\varvec{x}')\Vert _{\psi _2} \le \hat{L}_{t}\cdot d_T(\varvec{x},\varvec{x}') \quad {\text {and}}\quad \Vert \varepsilon _t(\varvec{x})\Vert _{\psi _2} \le \hat{r}\qquad \text {for all } \varvec{x}, \varvec{x}' \in \mathcal {X}. \end{aligned}$$

Assumption 2(b) and (c) imply that \(\{\xi _t(\varvec{x})\}_{\varvec{x}\in \mathcal {X}}\) and \(\{\varepsilon _t(\varvec{x})\}_{\varvec{x}\in \mathcal {X}}\) have sub-Gaussian increments with respect to \(L_{t}\cdot d_T\) and \(\hat{L}_{t}\cdot d_T\), respectively. Similarly, \(r\) and \(\hat{r}\) bound the (sub-Gaussian) diameter of the respective classes. A convenient interpretation of Assumption 2 is as follows: Choose an approximation \(\tilde{y}_t(\varvec{x})\) for every \(\varvec{x}\in \mathcal {X}\) such that the error does not become too large in the sense of part (a); at the same time, ensure that the increment conditions of part (b) and (c) are satisfied such that the parameters \(L_{t}\) and \(\hat{L}_{t}\) do not grow too fast as t becomes smaller. A remarkable conclusion from our applications to quantized compressed sensing in Sects. 4.14.3 is that this strategy can even succeed for observation variables with discontinuous output functions.

Finally, we emphasize that the approximation error \(\varepsilon _t(\varvec{x})\) includes taking the absolute value, which is crucial to our proof (see Step 3 in Sect. 7). In particular, Assumption 2 does not necessarily imply that the original class \(\{\xi (\varvec{x})\}_{\varvec{x}\in \mathcal {X}}\) has sub-Gaussian increments with respect to \({\tilde{L}}\cdot d_T\) for some \({\tilde{L}} \ge 0\). On the other hand, it is certainly possible that \(\{\xi (\varvec{x})\}_{\varvec{x}\in \mathcal {X}}\) already has sub-Gaussian increments, such as in Theorem 1. In this case, we can simply choose \(\tilde{y}_t(\varvec{x}) {:=}\tilde{y}(\varvec{x})\), so that \(\varepsilon _t(\varvec{x}) = 0\) and Assumption 2(a) and (c) are trivially fulfilled.

2.3 Uniform Recovery Guarantee

In order to formulate the main result of this work, we require the notion of Gaussian mean width. This geometric parameter has proven to be a useful complexity measure in high-dimensional signal recovery, e.g., see [2, 12, 43, 54, 57] for pioneering works in that direction. Our approach is no exception and makes use of the following localized version.

Definition 2

Let \(H\subset \mathbb {R}^p\) and \(\varvec{g}\sim \mathsf {N}(\varvec{0}, \varvec{I}_{p})\). The (Gaussian) mean width of \(H\) is given by

$$\begin{aligned} w_{}(H) {:=}\mathbb {E}\Big [\mathop {{\mathrm{sup}}}\limits _{\varvec{v}\in H} \langle \varvec{g}, \varvec{v}\rangle \Big ]. \end{aligned}$$

For \(t \ge 0\), we define the local mean width of \(H\) (at scale t) by

$$\begin{aligned} w_{t}(H) {:=}{\left\{ \begin{array}{ll} w_{}(\tfrac{1}{t}H\cap \mathbb {S}^{p-1}), &{} t > 0, \\ w_{}({\text {cone}}(H) \cap \mathbb {S}^{p-1}), &{} t = 0. \end{array}\right. } \end{aligned}$$

As final preparatory step, we introduce two expressions that capture the size of measurement noise: For \(m_0\in \{0,1,\dots ,m\}\) and \(\varvec{w}= (w_1, \dots , w_m) \in \mathbb {R}^m\), let

$$\begin{aligned} \Vert \varvec{w}\Vert _{[m_0]} {:=}\Big ( \sum _{i = 1}^{m_0} |w_i^*|^2 \Big )^{1/2} \quad \text { and } \quad \sigma _{m_0}(\varvec{w})_{2} {:=}\Big ( \sum _{i = m_0+1}^{m} |w_i^*|^2 \Big )^{1/2}, \end{aligned}$$

where \((w_1^*,\dots ,w_m^*)\) is the non-increasing rearrangement of \((|w_1|, \dots , |w_m|)\). Obviously, we have that \(\Vert \varvec{w}\Vert _{[m_0]}^2 + \sigma _{m_0}(\varvec{w})_{2}^2 = \Vert \varvec{w}\Vert _{2}^2\), and in particular, \(\Vert \varvec{w}\Vert _{[0]} = 0\) and \(\sigma _{0}(\varvec{w})_{2} = \Vert \varvec{w}\Vert _{2}\). Moreover, note that \(\Vert \cdot \Vert _{[m_0]}\) simply corresponds to the \(\ell ^{2}\)-norm of the \(m_0\)-largest entries, while \(\sigma _{m_0}(\cdot )_{2}\) is commonly known as the \(\ell ^{2}\)-error of the best \(m_0\)-term approximation.

We are now ready to state our main recovery guarantee, which forms the basis of all applications presented in Sect. 4; see Sect. 7 for a complete proof.

Theorem 2

There exist universal constants \(c, C > 0\) for which the following holds.

Let Assumptions 1 and 2 be satisfied for a fixed accuracy \(t \ge 0\), and let \(m_0\in \{0, 1, \dots , m\}\). For \(\varDelta >0\), \(u\ge 1\), and \(u_0 \ge \sqrt{m_0\log (em/m_0)}\), we assume thatFootnote 7Footnote 8

$$\begin{aligned} m&\ge C \cdot L^2 \cdot \max \Big \{ \log L\cdot \big (w_{t}^2(K- T\mathcal {X})+ u^2 \big ), \nonumber \\&\quad L^2\varDelta ^2 \cdot \big (w_{t}^2(K- T\mathcal {X})+ u_0^2 \big ),\nonumber \\&\quad t^{-2} (r^2+\hat{r}^2) \cdot \big (w_{t}^2(K- T\mathcal {X})+ u^2 \big ) + t^{-2} (L_{t}^2+ \hat{L}_{t}^2) \cdot w_{}^2(T\mathcal {X}) \Big \}. \end{aligned}$$
(2.3)

Then with probability at least \(1 - \exp (-c u^2) - \exp (-c u_0^2)\) on the random draw of \(\{(\varvec{a}_i, F_i)\}_{i = 1}^m\), the following holds uniformly for every \(\varvec{x}\in \mathcal {X}\) with \(\rho _{}(\varvec{x}) \le \tfrac{t}{32}\): Let \(\varvec{y}\in \mathbb {R}^m\) be any input vector such that

$$\begin{aligned} \tfrac{1}{\sqrt{m}}\Vert \varvec{y}- \tilde{\varvec{y}}(\varvec{x})\Vert _{[m_0]}\le \varDelta t \quad \text { and } \quad \tfrac{1}{\sqrt{m}}\sigma _{m_0}(\varvec{y}- \tilde{\varvec{y}}(\varvec{x}))_{2}\le \tfrac{t}{20}. \end{aligned}$$
(2.4)

Then, every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\Vert \hat{\varvec{z}}- T\varvec{x}\Vert _{2} \le t\).

The constraints of (2.4) imply that the estimator (\({\mathsf {P}}_{K,\varvec{y}}\)) is robust against (adversarial) perturbations and outliers in the observation vector \(\tilde{\varvec{y}}(\varvec{x})\). Here, a central role is played by the fine-tuning parameter \(m_0\), controlling which coefficients of the noise vector \(\varvec{y}- \tilde{\varvec{y}}(\varvec{x})\) are captured by the first and second condition in (2.4), respectively. Enlarging \(m_0\) helps to suppress “gross” outliers (that may appear in the \(m_0\) largest coefficients of \(\varvec{y}- \tilde{\varvec{y}}(\varvec{x})\) in magnitude). Indeed, by adjusting the free parameter \(\varDelta \), the first condition of (2.4) becomes less restrictive than the second one (which captures the remaining coefficients of \(\varvec{y}- \tilde{\varvec{y}}(\varvec{x})\)). On the other hand, choosing \(m_0\) and/or \(\varDelta \) too large may have a negative effect on the sample complexity in the second branch of (2.3). The benefit of permitting outliers will become clearer in the context of quantization where the noise is due to wrong bits in \(\varvec{y}\) instead of real-valued perturbations (see Sects. 4.14.3). Note that the most common noise model is obtained for \(m_0= 0\): (2.4) then simply corresponds to the baseline condition \(\tfrac{1}{\sqrt{m}}\Vert \varvec{y}- \tilde{\varvec{y}}(\varvec{x})\Vert _{2} \le \tfrac{t}{20}\), which is consistent with standard noise bounds from the literature. We refer to Remark 4(2) in the next section for a more detailed comparison of adversarial and statistical noise, as well as aspects of optimality.

The second important constraint of Theorem 2 is that it does not apply to those \(\varvec{x}\in \mathcal {X}\) with \(\rho _{}(\varvec{x}) > \tfrac{t}{32}\). Hence, the target mismatch \(\rho _{}(\varvec{x})\) can be seen as an upper bound for the asymptotic error when estimating \(T\varvec{x}\) via (\({\mathsf {P}}_{K,\varvec{y}}\)). In particular, the above error bound does not allow for arbitrarily high precision unless \(\rho _{}(\varvec{x}) = 0\).

Let us now turn to the key assumption (2.3), which relates the number of measurements m to the desired accuracy t. The above formulation views m as a function of t, specifying how fast m grows if t decreases. The inverse relationship is also of interest, since it describes the reconstruction error as a function of m. However, this can only be made explicit in specific situations where the parameters at the right-hand side of (2.3) are known or can be well estimated. Of special importance in that respect are the increment parameters \(L_{t}\) and \(\hat{L}_{t}\) from Assumption 2, whose size has a significant impact on the required (over-)sampling rate. For 1-bit observations as studied in Sect. 4.1, for example, we achieve \(L_{t}, \hat{L}_{t}\lesssim t^{-1}\), so that (2.3) would yield an error decay rate of \(O(m^{-1/4})\)—or conversely, an oversampling rate of \(O(t^{-4})\). Remarkably, we will derive a corollary of Theorem 2 in Sect. 5 that can bypass the restrictions of Assumption 2 at the price of a stronger (local) stability condition on the observation model. This allows us to combine our approach with known embedding results from the literature and thereby to obtain near-optimal decay rates in special cases.

The dependence of (2.3) on the complexity parameters \(w_{t}(K- T\mathcal {X})\) and \(w_{}(T\mathcal {X})\) is quite natural. Similar expressions have already appeared in several of the articles discussed in Sect. 1.1, e.g., see [17, 49, 52]. A detailed discussion, including concrete examples and possible simplifications, can be found in Sect. 3. A distinctive feature of Theorem 2 is that the considered version of local mean width \(w_{t}(K- T\mathcal {X})\) “compares” the parameter vectors in \(K\) only to those in the transformed signal set \(T\mathcal {X}\subset K\). This refinement of the common quantity \(w_{t}(K- K)\) can lead to improved guarantees in certain scenarios, namely when \(T\mathcal {X}\) is much smaller than \(K\) (see Example 1 for an illustration of this aspect).

Remark 2

  1. (1)

    Inversion of \(T\). In principle, the target function \(T\) neither has to be injective nor does it have to be explicitly known to solve (\({\mathsf {P}}_{K,\varvec{y}}\)). However, in order to obtain a practicable statement from Theorem 2, the actual signal \(\varvec{x}\in \mathcal {X}\) should be extractable from an approximation \(\hat{\varvec{z}}\) of \(T\varvec{x}\). An appropriate implementation of \(T\) can differ considerably from situation to situation. In the case of single-index models (see Theorem 1), for example, this would involve a rescaling of \(\hat{\varvec{z}}\), which requires a (rough) knowledge of the nonlinearity \(f\). In contrast, for variable selection (see Sect. 4.5), it can be sufficient to perform a simple hard-thresholding step on \(\hat{\varvec{z}}\) to obtain a good estimate of the underlying support.

  2. (2)

    Optimality. The best possible error decay rate that can result from Theorem 2 is \(O(m^{-1/2})\), supposed that we have \(L_{t}, \hat{L}_{t}\lesssim 1\).Footnote 9 This rate cannot be improved in general, or in other words, the exponent \(-\tfrac{1}{2}\) is not an artifact of our proof but corresponds to a fundamental statistical barrier (see [52, Sec. 4]). However, it is possible to break through this barrier in specific model setups, e.g., noiseless 1-bit observations [34, 36]. Such superior rates are usually not achieved by (\({\mathsf {P}}_{K,\varvec{y}}\)), but may require a more sophisticated estimator instead.

Remark 3

(Possible extensions) For the sake of clarity, we did not present the most general version of Theorem 2 that we could have derived. There are several variants and generalizations that we expect to be practicable:

  1. (1)

    Anisotropic measurements. Based on an observation from [51, Rmk. 1.7], [22, Sec. II.D], the isotropy of the measurement vectors in Assumption 1(a) can be relaxed to any positive definite covariance matrix \(\varSigma \in \mathbb {R}^{p \times p}\). To see this, let us assume that \(\varvec{a}\) is centered and sub-Gaussian with \(\mathbb {E}[\varvec{a}\varvec{a}^\mathsf {T}] = \varSigma \). Then, we may write \(\varvec{a}= \sqrt{\varSigma } {\bar{\varvec{a}}}\) where \({\bar{\varvec{a}}}\) is isotropic, centered, and sub-Gaussian. If \({\bar{\varvec{a}}}_1, \dots , {\bar{\varvec{a}}}_m\) are i.i.d. copies of \({\bar{\varvec{a}}}\), a simple reformulation of (\({\mathsf {P}}_{K,\varvec{y}}\)) yields

    $$\begin{aligned} \mathop {{{\,\mathrm{argmin}\,}}}\limits _{\varvec{z}\in K} \ \tfrac{1}{m} \sum _{i = 1}^m (y_i - \langle \varvec{a}_i, \varvec{z}\rangle )^2 = \sqrt{\varSigma }^{-1} \cdot \mathop {{{\,\mathrm{argmin}\,}}}\limits _{{\bar{\varvec{z}}} \in \sqrt{\varSigma }K} \ \tfrac{1}{m} \sum _{i = 1}^m (y_i - \langle {\bar{\varvec{a}}}_i, {\bar{\varvec{z}}}\rangle )^2. \end{aligned}$$

    In other words, (\({\mathsf {P}}_{K,\varvec{y}}\)) is equivalent to a modified estimator that operates with an isotropic measurement ensemble. Therefore, we may apply Theorem 2 directly to the latter one, when replacing \(K\) and \(T(\cdot )\) by \(\sqrt{\varSigma }K\) and \(\sqrt{\varSigma } T(\cdot )\) in Assumption 1, respectively. Note that these adaptions are of purely theoretical nature and in practice, no explicit knowledge of \(\varSigma \) is required. In particular, this procedure still implies an error bound for minimizers \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)):

    $$\begin{aligned} \Vert \hat{\varvec{z}}- T\varvec{x}\Vert _{2} \le \Vert \sqrt{\varSigma }^{-1}\Vert _{2\rightarrow 2} \cdot \Vert \sqrt{\varSigma }\hat{\varvec{z}}- \sqrt{\varSigma }T\varvec{x}\Vert _{2} \le \Vert \sqrt{\varSigma }^{-1}\Vert _{2\rightarrow 2} \cdot t. \end{aligned}$$

    Similarly, one can conveniently control the adapted complexity parameters (see [51, Rmk. 1.7], [22, Sec. II.D]):

    $$\begin{aligned} w_{t}(\sqrt{\varSigma }K- \sqrt{\varSigma }T\mathcal {X})&\le \max \big \{1, \Vert \sqrt{\varSigma }^{-1}\Vert _{2\rightarrow 2}\big \} \cdot \Vert \sqrt{\varSigma }\Vert _{2\rightarrow 2} \cdot w_{}(\tfrac{1}{t}(K- T\mathcal {X}) \cap \mathbb {B}_2^p), \\ w_{}(\sqrt{\varSigma }T\mathcal {X})&\lesssim \Vert \sqrt{\varSigma }\Vert _{2\rightarrow 2} \cdot w_{}(T\mathcal {X}). \end{aligned}$$

    Overall, we can conclude that accurate (uniform) recovery is also possible with anisotropic measurements, as long as the (unknown) covariance matrix is well conditioned.

  2. (2)

    Even output functions. There exist certain types of nonlinear models that are not directly compatible with the estimator (\({\mathsf {P}}_{K,\varvec{y}}\)). For example, if \(f:\mathbb {R}\rightarrow \mathbb {R}\) is an even function in Theorem 1, it turns out that \(\mu = \mathbb {E}_{g\sim \mathsf {N}(0, 1)}[f(g)g] = 0\). In other words, (\({\mathsf {P}}_{K,\varvec{y}}\)) would simply approximate the zero vector, which is clearly not what one is aiming at. Notably, this scenario includes the important phase-retrieval problem \(f(\cdot ) = |\cdot |\). Fortunately, the linearization idea behind (\({\mathsf {P}}_{K,\varvec{y}}\)) is still applicable in these situations, when combined with the phase-lifting trick [9]:

    figure c

    where \(\langle \cdot , \cdot \rangle _F\) denotes the Hilbert–Schmidt inner product and \(K\supset \{\varvec{x}\varvec{x}^\mathsf {T}\ :\varvec{x}\in \mathcal {X}\}\) is a convex subset of the positive semi-definite cone in \(\mathbb {R}^{p \times p}\). The recovery performance of (\({\mathsf {P}}_{K,\varvec{y}}^{\text {lifted}}\)) for even output functions has been recently analyzed by the first author in [25, Sec. 3.5], inspired by earlier findings of Thrampoulidis and Rawat [61]. A key challenge in this regard is that the lifted measurement vectors/matrices \(\varvec{a}_i\varvec{a}_i^\mathsf {T}- \mathbb {E}[\varvec{a}_i\varvec{a}_i^\mathsf {T}]\) are heavier tailed than \(\varvec{a}_i\), which in turn has required deeper insights from generic chaining theory. We expect that our results are extendable into this direction, but with a technical effort that would go beyond the scope of this work.

  3. (3)

    Different metrics. An extension that is specific to uniform recovery concerns the choice of pseudo-metric in Assumption 2. In principle, \(d_T\) could be replaced by an arbitrary pseudo-metric d on the signal set \(\mathcal {X}\). The claim and proof of Theorem 2 would still hold true if the (global) complexity of \(\mathcal {X}\) is captured by \(\gamma _2(\mathcal {X}, d)\) instead of \(w_{}(T\mathcal {X})\); see Step 3 in Sect. 7. While this would make our approach slightly more flexible, we note that Talagrand’s \(\gamma _2\)-functional is difficult to control in general. Apart from that, the \(\ell ^{2}\)-norm as error metric in Theorem 2 could be replaced by an arbitrary semi-norm \(\Vert \cdot \Vert \). Specifically, this step would lead to an error bound in terms of \(\Vert \cdot \Vert \) in Fact 1, but it also entails an adaption of the spherical intersection in the local mean width (see Definition 2). Finally, different convex loss functions are feasible for (\({\mathsf {P}}_{K,\varvec{y}}\)) as well [22], although such an extension would require several additional technical assumptions.

3 (Non-)Uniformity and Signal Complexity

In this section, we highlight important aspects that are specifically attributable to uniform recovery, compared to existing non-uniform results. This includes a more detailed discussion of the local mean width \(w_{t}(K- T\mathcal {X})\), which appears as the central complexity parameter in our approach.

Except for the inclusion \(T\mathcal {X}\subset K\), Theorem 2 does not impose any restrictions on the signal set \(\mathcal {X}\). If \(T\) is the identity function, the choice of \(\mathcal {X}\) may range from a singleton \(\mathcal {X}= \{\varvec{x}\}\) to the entire constraint set \(\mathcal {X}= K\). The latter scenario corresponds to uniform recovery of all signals in \(K\), while the former means that one is interested in a specific \(\varvec{x}\in K\)—which is nothing else than non-uniform recovery. In particular, due to \(w_{}(\{\varvec{x}\}) = 0\), Theorem 2 is consistent with previously known non-uniform guarantees, e.g., see [23, Thm. 3.6] and [51, Thm 1.9]. Our main result therefore also indicates what additional expenses may come in the case of uniform recovery.

Let us first focus on the role of nonlinear observations, whose impact is reflected by the third branch of (2.3) in Theorem 2. The key quantities in this respect are the increment parameters \(L_{t}\) and \(\hat{L}_{t}\). Their size strongly depends on the type of output function \(F\) considered in Assumption 1. For instance, a jump discontinuity in \(F\) can cause a weaker oversampling rate, in the sense that \(L_{t}, \hat{L}_{t}\lesssim t^{-1}\); we refer to Sects. 4.14.4 for concrete examples and Sect. 5 for a possible improvement. In the non-uniform case (\(\mathcal {X}= \{\varvec{x}\}\)), on the other hand, this issue becomes irrelevant because one can simply set \(L_{t}= \hat{L}_{t}= 0\) (since Assumption 2(b) and (c) need only be satisfied for a single point). Thus, we can draw the following informal conclusion:

The transition to uniform recovery with nonlinear output functions may result in a worse oversampling rate (with respect to t).

The second foundational aspect of uniformity concerns the geometric complexity of the constraint set \(K\) and its interplay with the actual signal set \(\mathcal {X}\). To keep our exposition as clear as possible, we restrict ourselves to the situation of linear observations (though most arguments naturally carry over to nonlinear models). Applying Theorem 2 to this special case leads to the following recovery guarantee (see Sect. 10 for a proof). Note that we intentionally allow for two different types of noise here, based on the adversarial (\(\sim \nu _i\)) and statistical model (\(\sim \tau _i\)); see Remark 4(2) for further discussion.

Corollary 1

There exist universal constants \(c, C > 0\) for which the following holds.

Let \(\varvec{a}_1, \dots , \varvec{a}_m \in \mathbb {R}^p\) be independent copies of a centered, isotropic, sub-Gaussian random vector \(\varvec{a}\in \mathbb {R}^p\) with \(\Vert \varvec{a}\Vert _{\psi _2} \le L\). Let \(\tau _1, \dots , \tau _m\) be independent copies of \(\tau \sim \mathsf {N}(0, \sigma ^2)\) for some \(\sigma \ge 0\) (also independent of \(\{\varvec{a}_i\}_{i=1}^m\)). Moreover, let \(\mathcal {X}\subset \mathbb {R}^p\) be a bounded subset and let \(K\subset \mathbb {R}^p\) be a convex set such that \(\mathcal {X}\subset K\). For \(u\ge 1\) and \(t \ge 0\), we assume that

$$\begin{aligned} m \ge C \cdot L^2 \cdot (\log L+ t^{-2} \sigma ^2 ) \cdot \big (w_{t}^2(K- \mathcal {X}) + u^2 \big ). \end{aligned}$$
(3.1)

Then with probability at least \(1 - \exp (-c u^2)\) on the random draw of \(\{(\varvec{a}_i, \tau _i)\}_{i = 1}^m\), the following holds uniformly for all \(\varvec{x}\in \mathcal {X}\): Let \(\varvec{y}= (y_1, \dots , y_m) \in \mathbb {R}^m\) be given by

$$\begin{aligned} y_i = \langle \varvec{a}_i, \varvec{x}\rangle + \tau _i + \nu _i, \quad i = 1, \dots , m, \end{aligned}$$

such that \(\big (\tfrac{1}{m}\sum _{i = 1}^m \nu _i^2 \big )^{1/2} \le \tfrac{t}{20}\). Then, every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\Vert \hat{\varvec{z}}- \varvec{x}\Vert _{2} \le t\).

Except for the \(\sigma \)-depending summand in (3.1), the above statement does not involve any oversampling factors and it turns into an exact recovery guarantee for \(t = 0\) and \(\sigma = 0\). Although Corollary 1 bears resemblance with standard results from compressed sensing, it comes with a distinctive feature: The needed number of measurements in (3.1) is determined by the local mean width \(w_{t}(K- \mathcal {X})\). As such, this parameter is quite implicit, so that an informative (upper) bound is required for any specific choice of \(K\) and \(\mathcal {X}\). The remainder of this section is devoted to this quest.

Arguably the most popular low-complexity model is sparsity in conjunction with an \(\ell ^{1}\)-relaxation. In our context, this corresponds to the following scenario:

$$\begin{aligned} K= R\mathbb {B}_1^p \quad \text {and} \quad \mathcal {X}= \{\varvec{x}\in \mathbb {R}^p \ :\Vert \varvec{x}\Vert _{0} \le s, \Vert \varvec{x}\Vert _{1} = R \}, \end{aligned}$$
(3.2)

where \(s \in [p]\) and \(R > 0\); note that the constraint \(\Vert \varvec{x}\Vert _{1} = R\) is a specific tuning condition for the estimator (\({\mathsf {P}}_{K,\varvec{y}}\)) that could be further relaxed (see Remark 44). In this special case, one can derive a convenient bound for the local mean width:

$$\begin{aligned} w_{t}(K- \mathcal {X}) \le w_{0}(K- \mathcal {X}) \lesssim \sqrt{s \cdot \log (2p/s)}. \end{aligned}$$
(3.3)

Remarkably, this observation dates back to one of the earliest applications of Gordon’s escape through the mesh theorem [30] to signal recovery [54]; see Sect. 10 for a proof of (3.3). A combination of (3.3) and Corollary 1 ensures that uniform (exact) reconstruction is feasible with \(O(s \cdot \log (2p/s))\) Gaussian measurements. Therefore, our approach is indeed well in line with standard sparse recovery guarantees in compressed sensing, which are typically based on restricted isometry (cf. [21]).

Beyond this classical example, the situation becomes considerably more challenging and forms an important research subject in its own right. A particular obstacle due to uniformity is that \(w_{t}(K- \mathcal {X})\) heavily depends on the size and shape of the signal set \(\mathcal {X}\). On a technical level, we have to deal with a (possibly uncountable) union of spherical intersections:

$$\begin{aligned} w_{t}(K- \mathcal {X}) = w_{}(\textstyle \bigcup _{\varvec{x}\in \mathcal {X}}[\tfrac{1}{t}(K- \varvec{x}) \cap \mathbb {S}^{p-1}]). \end{aligned}$$

Fortunately, this cumbersome expression can be controlled through a much more convenient upper bound, as shown in the following proposition. To the best of our knowledge, this is a new result, which could be of independent interest. Its proof can be found in Sect. 10 and is based on a covering argument.

Proposition 1

Let \(K, \mathcal {X}\subset \mathbb {R}^p\) and \(t > 0\). Then,

$$\begin{aligned} w_{t}(K- \mathcal {X}) \lesssim \mathop {{\mathrm{sup}}}\limits _{\varvec{x}\in \mathcal {X}} \ \underbrace{w_{}(\tfrac{1}{2t}(K- \varvec{x}) \cap \mathbb {B}_2^p)}_{{=:}\tilde{w}_{t}(K- \varvec{x})} {} + {} t^{-1} \cdot w_{}(\mathcal {X}). \end{aligned}$$
(3.4)

Moreover, for \(\varvec{x}\in \mathcal {X}\) and \(K\) convex, we have that \(\tilde{w}_{t}(K- \varvec{x}) \lesssim w_{0}(K- {\tilde{\varvec{x}}}) + 1\) for every \({\tilde{\varvec{x}}} \in K\) with \(\Vert \varvec{x}- {\tilde{\varvec{x}}}\Vert _{2} \le t\).

The upper bound in (3.4) implies a significant simplification: The first term is fully localized, measuring the complexity of \(K\) with respect to each individual point \(\varvec{x}\in \mathcal {X}\), whereas the second term accounts for the global size of \(\mathcal {X}\) (independently of the constraint set \(K\)). In other words, the effect of \(K\) and \(\mathcal {X}\) is now decoupled. A second notable fact is that the local mean width \(\tilde{w}_{t}(K- \varvec{x})\) and its conic counterpart \(w_{0}(K- \varvec{x})\) are well-known complexity parameters from non-uniform recovery results, e.g., see [2, 12, 23, 51, 63]. Hence, Proposition 1 allows us to transfer any corresponding bound from the literature to the uniform regime. The presence of the global mean width \(w_{}(\mathcal {X})\) as an additional expense for uniformity appears natural to us. However, this expression also entails an oversampling factor of \(t^{-1}\), which prevents perfect reconstruction (when applied to Corollary 1). Given the \(\ell ^{1}\)-special case in (3.3), we suspect that this is an artifact of our proof, and it remains an open question whether such a factor could be removed in general. Our overall conclusion is as follows:

Regarding signal complexity, the transition to uniform recovery requires control over the total size of the set \(\mathcal {X}\), measured by \(w_{}(\mathcal {X})\). The constraint set \(K\) only appears in terms of local complexity, precisely as in the non-uniform case.

As the above argumentation is fairly abstract, it is insightful to illustrate our approach by a concrete example. In this context, we will also highlight the importance of carefully “designed” signal sets, in the sense that \(\mathcal {X}\) (and even its convex hull) is much smaller than \(K\).

Example 1

(Total variation in 1D) We consider the situation of s-gradient-sparse signals in one spatial dimension, i.e., for \(s \in [p-1]\) fixed, it is assumed that \(\Vert \nabla \varvec{x}\Vert _{0} \le s\), where

$$\begin{aligned} \nabla {:=}\begin{bmatrix} -1 &{} 1 &{} 0 &{} \dots &{} 0 \\ 0 &{} -1 &{} 1 &{} &{} 0\\ \vdots &{} &{} \ddots &{}\ddots &{} \vdots \\ 0 &{} \dots &{} 0 &{} -1 &{} 1 \end{bmatrix} \in \mathbb {R}^{(p-1) \times p} \end{aligned}$$

is a discrete gradient operator. Geometrically, this condition simply means that the vector \(\varvec{x}\) is piecewise constant with at most s jump discontinuities. An \(\ell ^{1}\)-relaxation of gradient sparsity then leads to the common total variation (TV) model [55], which in our case corresponds to the constraint set \(K= \{ \varvec{x}\in \mathbb {R}^p \ :\Vert \nabla \varvec{x}\Vert _{1} \le R \}\) for some \(R > 0\).

A recent work of the first author [27] has demonstrated that the reconstruction capacity of the TV model does not only depend on the number of jump discontinuities but also strongly on their position. More specifically, we say that an s-gradient-sparse signal \(\varvec{x}\) is \(\varDelta \)-separated for some \(\varDelta \in (0, 1]\) if

$$\begin{aligned} \min _{j \in [s+1]} \frac{|\nu _j - \nu _{j-1}|}{p} \ge \frac{\varDelta }{s+1}, \end{aligned}$$

where \({{\,\mathrm{supp}\,}}(\nabla \varvec{x}) = \{\nu _1,\dots ,\nu _{s}\}\) with \(0 {=:}\nu _0< \nu _1< \dots< \nu _{s} < \nu _{s+1} {:=}p\). Intuitively, the constant \(\varDelta \) measures how much the jump positions in \(\varvec{x}\) deviate from an equidistant pattern (where one would have \(\varDelta = 1\)). Assuming that \(\varDelta \in (0, 1]\) is fixed (independently of n and s), we say that \(\varvec{x}\) is well separated; see [27, Sec. 2.1] for more details. This motivates us to consider the following signal set:

$$\begin{aligned} \mathcal {X}= \big \{\varvec{x}\in \mathbb {R}^p \ :\Vert \nabla \varvec{x}\Vert _{0} \le s, \ \varvec{x}\text { is } \varDelta \text {-separated}, \Vert \nabla \varvec{x}\Vert _{1} = R \big \} \cap \mathbb {B}_2^p. \end{aligned}$$
(3.5)

Similarly to (3.2), the constraint \(\Vert \nabla \varvec{x}\Vert _{1} = R\) results from a tuning condition for (\({\mathsf {P}}_{K,\varvec{y}}\)) (see Remark 44).

A deep result from [27, Thm. 2.10] now yields the following estimate for the conic mean width:

$$\begin{aligned} w_{0}(K- \varvec{x}) \lesssim \sqrt{\varDelta ^{-1} \cdot s \cdot \log ^2(p)} \qquad \text {for all } \varvec{x}\in \mathcal {X}. \end{aligned}$$
(3.6)

For the global mean width, we can make use of a bound derived by Krahmer et al. [38, Sec. 3.2]:

$$\begin{aligned} w_{}(\mathcal {X}) \lesssim \sqrt{s \cdot \log (e p/ s)}. \end{aligned}$$
(3.7)

Note that (3.7) would also hold without the \(\varDelta \)-separation condition, but it exploits that \(\mathcal {X}\subset \mathbb {B}_2^p\). Combining (3.6) and (3.7) with Proposition 1, we finally obtain

$$\begin{aligned} w_{t}(K- \mathcal {X}) \lesssim (1 + t^{-1}) \sqrt{s \cdot \log ^2(p)}. \end{aligned}$$

Therefore, according to Corollary 1, uniform recovery of well-separated s-gradient-sparse signals becomes feasible with \(O(s \cdot \log ^2(p))\) Gaussian measurements. In particular, we have shown that there is no qualitative gap to the non-uniform guarantee derived in [27].

Let us emphasize that, although the previous bounds look quite simple, their proofs build on fundamental insights into the TV model. For example, relying only on the basic estimate \(w_{t}(K- \mathcal {X}) \lesssim t^{-1} \cdot w_{}(K)\) would not help much, since it is unclear how \(w_{}(K)\) scales compared to \(w_{}(\mathcal {X})\), cf. [38, Sec. 3.3]. Even more notable is the fact that the above argument would break down if the \(\varDelta \)-separation condition is omitted. Indeed, Cai and Xu [8] have shown that uniform recovery of all s-gradient-sparse signals (including pathological examples) via TV minimization is impossible with less than \(\varOmega (\sqrt{s \cdot p})\) Gaussian measurements. Thus, a striking gap emerges between the uniform and non-uniform regime in this scenario. Remarkably, the technical approach of [8] is based on an adapted nullspace property, for which it is not obvious how to incorporate additional signal structure such as \(\varDelta \)-separation. The presented example on the TV model therefore also underscores the merits of a geometric complexity analysis as proposed in our work. \(\square \)

While gradient sparsity seems to be only slightly more involved than standard sparsity (with respect to an orthonormal basis), the above consideration has indicated many subtleties. To a certain extent, Example 1 is just the “tip of the iceberg” and similar phenomena apply in the more general context of the analysis- and synthesis-\(\ell ^{1}\)-model. An in-depth discussion would go beyond the scope of the present paper, and we refer the interested reader to the related works [26, 40]. The actual novelty of Proposition 1 is that these previous findings may also extend to uniform recovery—a step that was not addressed so far.

We close this section with a remark on stable and robust recovery:

Remark 4

  1. (1)

    Tuning and stability. The equality constraint for \(\mathcal {X}\) in (3.2) and (3.5) ensures that every \(\varvec{x}\in \mathcal {X}\) lies on the boundary of \(K\). Such a technical step is important for bounds that are based on the conic mean width, such as (3.3) and (3.6); if \(\varvec{x}\) would lie in the interior of \(K\), then \({\text {cone}}(K- \varvec{x}) = \mathbb {R}^p\) and therefore \(w_{0}(K- \varvec{x}) \asymp \sqrt{p}\). From an algorithmic perspective, this can be seen as a tuning condition for the Lasso estimator (\({\mathsf {P}}_{K,\varvec{y}}\)). The “moreover”-part of Proposition 1 presents a convenient relaxation in this respect: When interested in the local complexity of some point \(\varvec{x}\in \mathcal {X}\subset K\), one may instead consider a nearby point \({\tilde{\varvec{x}}}\), whose conic mean width \(w_{0}(K- {\tilde{\varvec{x}}})\) is smaller (non-trivial). The price for this simplification is a recovery error in the order of \(\Vert \varvec{x}- {\tilde{\varvec{x}}}\Vert _{2}\). This trade-off is closely related to the idea of stability (or compressibility) in compressed sensing theory [21].

  2. (2)

    Robustness and optimality. The statement of Corollary 1 allows for a comparison of the statistical and adversarial noise model (cf. Remark 1). In the absence of worst-case perturbations (i.e., \(\nu _i = 0\)), the condition (3.1) can be translated into an error decay rate of \(O(\sigma \cdot m^{-1/2})\). Thus, (\({\mathsf {P}}_{K,\varvec{y}}\)) becomes a consistent estimator of \(\varvec{x}\) in the ordinary sense of statistics, i.e., \(\Vert \hat{\varvec{z}}- \varvec{x}\Vert _{2} \rightarrow 0\) in probability as \(m \rightarrow \infty \). Remarkably, the asymptotic error scaling with respect to the noise parameter \(\sigma \) and sample size m cannot be further improved in general—it is minimax optimal for linear observations, e.g., see [52, Sec. 4]. Compared to random noise, the adversarial model is more general, since the \(\nu _i\) may encode any type of perturbation, even deterministic ones. Given the \(\ell ^{2}\)-constraint in Corollary 1,

    $$\begin{aligned} \big (\textstyle \tfrac{1}{m}\sum _{i = 1}^m \nu _i^2 \big )^{1/2} \le \tfrac{t}{20}, \end{aligned}$$
    (3.8)

    it turns out that a recovery error in the order of \(O(t)\) is essentially optimal. Indeed, there exist instances of Corollary 1 where the \(\nu _i\) can be selected such that (3.8) holds with high probability (over \(\{\varvec{a}_i\}_{i = 1}^m\)) and one has that \(\Vert \hat{\varvec{z}}- \varvec{x}\Vert _{2} \asymp t\).Footnote 10 The difference with the statistical regime becomes clearer when considering random noise from an adversarial perspective, i.e., \(\nu _i = \tau _i \sim \mathsf {N}(0, \sigma ^2)\). Then, it is not hard to see that with high probability, the constraint (3.8) is only attainable with \(t \gtrsim \sigma \), which prevents arbitrarily high precision in terms of t. In particular, Corollary 1 would not certify the consistency of (\({\mathsf {P}}_{K,\varvec{y}}\)) anymore, contrary to the argument in the previous paragraph. Let us emphasize that uniformity does not play a special role in these matters, although the statistical noise model appears somewhat unnatural in this regime (see Remark 1). In principle, our conclusions do also carry over to the abstract setting of Theorem 2, but the interpretation of the term “noise” becomes more subtle there. The reason is that the estimator (\({\mathsf {P}}_{K,\varvec{y}}\)) basically treats nonlinear distortions as if they would be uncorrelated statistical noise (cf. [51]), even if the output function is completely deterministic. Seen from this angle, better decay rates than \(O(\sigma \cdot m^{-1/2})\) are possible, which however strongly depends on the specific model and estimator (see Remark 2(2)).

4 Applications and Examples

This section demonstrates how to derive “out-of-the-box” guarantees from Theorem 2 for specific observation models. Sects. 4.14.3 are devoted to applications to quantized compressed sensing. Sect. 4.4 then revisits the case of single-index models, including new applications to modulo measurements and coordinate-wise distortions. Finally, Sect. 4.5 is concerned with a conceptually different example on the problem of variable selection. Note that all proofs are deferred to Sect. 8.

4.1 1-Bit Observations

As already indicated in Sect. 1.1, the most basic version of 1-bit compressed sensing asks for the reconstruction of signals \(\varvec{x}\in \mathcal {X}\subset \mathbb {R}^p\) from binary observations \(\varvec{y}\in \{-1,1\}^m\) of the form

$$\begin{aligned} y_i = {{\,\mathrm{sign}\,}}(\langle \varvec{a}_i, \varvec{x}\rangle ) + \nu _i, \quad i = 1, \dots , m. \end{aligned}$$
(4.1)

Here, the noise variables \(\varvec{\nu }{:=}(\nu _1, \dots , \nu _m)\) may take values in \(\{-2, 0, 2\}\), modeling possible distortions of the linear measurement process before quantization and/or bit flips during quantization. Importantly, the magnitude of \(\varvec{x}\) gets lost in (4.1) due to the scaling invariance of the \({{\,\mathrm{sign}\,}}\)-function. Thus, the best one can hope for is to reconstruct the direction of \(\varvec{x}\).

The model of (4.1) is particularly compatible with Gaussian measurement vectors—a related result on sub-Gaussian measurements can be found in the next subsection. Indeed, choosing the target function \(T\varvec{x}\) proportionally to \(\varvec{x}/ \Vert \varvec{x}\Vert _{2}\), one can show that the target mismatch \(\rho _{}(\varvec{x})\) vanishes for every \(\varvec{x}\in \mathcal {X}\). Hence, uniform recovery is possible up to arbitrarily high precision. The following guarantee makes this claim precise and is an application of Theorem 2 to Gaussian 1-bit observations. Its proof in Sect. 8.1 will demonstrate the usefulness of the approximation condition from Assumption 2, see also Fig. 2 for an illustration of the underlying argument.

Corollary 2

There exist universal constants \(c, c_0, C' > 0\) for which the following holds.

Let \(\varvec{a}_1, \dots , \varvec{a}_m \in \mathbb {R}^p\) be independent copies of a standard Gaussian random vector \(\varvec{a}\sim \mathsf {N}(\varvec{0}, \varvec{I}_{p})\). Let \(\mathcal {X}\subset \mathbb {R}^p\) and define \(T\varvec{x}{:=}\sqrt{\tfrac{2}{\pi }}\tfrac{\varvec{x}}{\Vert \varvec{x}\Vert _{2}}\) for \(\varvec{x}\in \mathcal {X}\). Moreover, let \(K\subset \mathbb {R}^p\) be a convex set such that \(T\mathcal {X}\subset K\). For \(u\ge 1\) and \(t\in (0,1]\), we assume that

$$\begin{aligned} m \ge C' \cdot t^{-2}\cdot \Big (w_{t}^2(K- T\mathcal {X}) + t^{-2} \cdot w_{}^2(T\mathcal {X}) + u^2 \Big ). \end{aligned}$$
(4.2)

Finally, let \(\beta \in [0,1]\) be such that \(\beta \sqrt{\log (e/\beta )}\le c_0 t\). Then with probability at least \(1 - \exp (-c u^2)\) on the random draw of \(\{\varvec{a}_i\}_{i = 1}^m\), the following holds uniformly for all \(\varvec{x}\in \mathcal {X}\): Let \(\varvec{y}\in \{-1,1\}^m\) be given by (4.1) such that \(\tfrac{1}{2m}\Vert \varvec{\nu }\Vert _{1} \le \beta \). Then, every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\big \Vert \hat{\varvec{z}}- \sqrt{\tfrac{2}{\pi }}\tfrac{\varvec{x}}{\Vert \varvec{x}\Vert _{2}}\big \Vert _{2} \le t\).

Fig. 2
figure 2

Illustration of the approximation strategy used in the proof of Corollary 2. The basic idea is to approximate the jump discontinuity of \({{\,\mathrm{sign}\,}}(s)\) (plotted in black) by a linear segment whose slope is inverse proportional to the accuracy t. The resulting function \(\psi _t(s)\) (plotted in blue) and the absolute value of the error \(\phi _t(s) = |\psi _t(s) - {{\,\mathrm{sign}\,}}(s)|\) (plotted in dashed red) are then both \(\tfrac{128}{t}\)-Lipschitz. This is already enough to fulfill Assumption 2 with \(L_{t}\lesssim 1 + t^{-1}\) and \(\hat{L}_{t}\lesssim t^{-1}\); note that the factor 128 is just an appropriate constant for the proof (Color figure online)

Corollary 2 is in line with some of the early achievements in 1-bit compressed sensing [48, 49]. Remarkably, the condition (4.2) translates into an error decay rate of \(O(m^{-1/4})\) in the uniform case, which improves the original rate of \(O(m^{-1/12})\) established by Plan and Vershynin in [49, Thm. 1.3]. In fact, we are not aware of any result in the literature that implies the statement of Corollary 2. But we stress that the above oversampling rate is still not optimal (see [34]) and can be further improved with a more specialized argument relying on random hyperplane tessellations (see Corollary 8 in Sect. 5). The noise constraint of Corollary 2 simply means that the fraction of wrong input bits \(\tfrac{1}{2m}\Vert \varvec{\nu }\Vert _{1}\) must not exceed \(\beta \), while the latter may be in the order of t (up to a log-factor). This condition again significantly improves [49, Thm. 1.3] and is a particular consequence of the outlier robustness established in Theorem 2; see also [17] for a similar achievement in the situation of dithered observations.

4.2 1-Bit Observations with Dithering

According to Ai et al. [1], the conclusion of Corollary 2 cannot be extended to sub-Gaussian measurements in general, regardless of the considered reconstruction method. A practicable remedy is the technique of dithering, which in its most basic form, corresponds to a random shift of the quantization threshold. Originating from quantized signal processing, e.g., see [13, 31, 32], its benefits recently also emerged in compressed sensing theory [3, 17, 18, 37, 62, 65]. For more background information, we refer to [14] and the references therein.

Extending the original 1-bit model (4.1) by an additional dithering step leads to observations \(\varvec{y}\in \{-1,1\}^m\) of the following form:

$$\begin{aligned} y_i = {{\,\mathrm{sign}\,}}(\langle \varvec{a}_i, \varvec{x}\rangle + \tau _i) + \nu _i, \quad i = 1, \dots , m, \end{aligned}$$
(4.3)

where \(\varvec{\nu }{:=}(\nu _1, \dots , \nu _m)\in \{-2,0,2\}^m\) again models noise. The dithering variables \(\tau _i\) are independent copies of a random variable \(\tau \) that is uniformly distributed on an interval \([-\lambda , \lambda ]\). Note that a major difference between dithering and additive noise is that the parameter \(\lambda > 0\) is known and adjustable in practice (while the \(\tau _i\) could be in principle unknown). The following corollary of Theorem 2 is based on the fact that a careful choice of \(\lambda \) allows us to control the size of the (non-vanishing) target mismatch \(\rho _{}(\varvec{x})\), where \(T\varvec{x}{:=}\lambda ^{-1} \varvec{x}\).

Corollary 3

There exist universal constants \(c, c_0, {\tilde{C}}, C' > 0\) for which the following holds.

Let \(\varvec{a}_1, \dots , \varvec{a}_m \in \mathbb {R}^p\) be independent copies of a centered, isotropic, sub-Gaussian random vector \(\varvec{a}\in \mathbb {R}^p\) with \(\Vert \varvec{a}\Vert _{\psi _2} \le L\). Let \(\tau _1, \dots , \tau _m\) be independent copies of a random variable \(\tau \) that is uniformly distributed on \([-\lambda , \lambda ]\) for a parameter \(\lambda >0\). In addition, suppose that \(\{\varvec{a}_i\}_{i=1}^m\) and \(\{\tau _i\}_{i=1}^m\) are independent. Let \(\mathcal {X}\subset R \mathbb {B}_2^p\) for some \(R > 0\), and let \(K\subset \mathbb {R}^p\) be a convex set such that \(\lambda ^{-1}\mathcal {X}\subset K\). For \(u\ge 1\) and \(t\in (0, 1]\), we assume that

$$\begin{aligned} \lambda&\ge {\tilde{C}} \cdot R \cdot L\cdot \sqrt{\log (e / t)}, \nonumber \\ m&\ge C' \cdot L^2 \cdot \Big ( (\log L+ t^{-2} )\cdot \big (w_{t}^2(K- \lambda ^{-1}\mathcal {X})+ u^2 \big ) + L^2 t^{-4} \lambda ^{-2} \cdot w_{}^2(\mathcal {X}) \Big ). \end{aligned}$$
(4.4)

Finally, let \(\beta \in [0,1]\) be such that \(\beta \sqrt{\log (e/\beta )}\le c_0 L^{-2} t\). Then with probability at least \(1 - \exp (-c u^2)\) on the random draw of \(\{(\varvec{a}_i, \tau _i)\}_{i = 1}^m\), the following holds uniformly for all \(\varvec{x}\in \mathcal {X}\): Let \(\varvec{y}\in \{-1,1\}^m\) be given by (4.3) such that \(\tfrac{1}{2m}\Vert \varvec{\nu }\Vert _{1} \le \beta \). Then, every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\Vert \hat{\varvec{z}}- \lambda ^{-1}\varvec{x}\Vert _{2} \le t\).

Corollary 3 exhibits many common features of known recovery guarantees based on dithering. In particular, it can be seen as a uniform version of a recent result on the generalized Lasso by Thrampoulidis and Rawat [62, Thm. IV.1]. The most notable improvement over Corollary 2 is that it is now possible to recover the actual signal \(\varvec{x}\in \mathcal {X}\) up to arbitrarily high precision, and not only its direction vector \(\varvec{x}/ \Vert \varvec{x}\Vert _{2}\).

Similarly to Corollary 2, we are not aware of any result in the literature that implies the statement of Corollary 3. Nevertheless, the oversampling rate of \(O(m^{-1/4})\) promoted by (4.4) can be further improved with a proof strategy that is specifically tailored to 1-bit observations with dithering (see Corollary 9 in Sect. 5).

4.3 Multi-Bit Observations

While 1-bit measurements are an important extreme case of quantized compressed sensing, a considerable part of the literature deals with multi-bit observation models; once again, see [6, 14] for a good introduction to this subject. In this subsection, we illustrate our approach in the prototypical situation of uniform quantization: For a fixed \(\delta > 0\), the goal is to reconstruct a signal \(\varvec{x}\in \mathcal {X}\subset \mathbb {R}^p\) from observations \(\varvec{y}\in \delta \mathbb {Z}^m\) of the form

$$\begin{aligned} y_i = q_\delta (\langle \varvec{a}_i, \varvec{x}\rangle + \tau _i) + \nu _i, \quad i = 1, \dots , m, \end{aligned}$$
(4.5)

where \(q_\delta (v) {:=}(2\lceil \tfrac{v}{2\delta }\rceil -1)\delta \) is a uniform quantizer on the grid \(\delta (2\mathbb {Z}-1)^m\) with resolution \(\delta >0\). The dithering variables \(\tau _i\) are independent copies of a random variable \(\tau \) that is uniformly distributed on \([-\delta , \delta ]\). As before, \(\varvec{\nu }{:=}(\nu _1, \dots , \nu _m)\) can describe any type of adversarial noise, but now takes values in \(\delta \mathbb {Z}^m\). Theorem 2 yields the following uniform recovery guarantee for multi-bit observations with sub-Gaussian measurements and dithering.

Corollary 4

There exist universal constants \(c, c_0, C' > 0\) for which the following holds.

Let \(\varvec{a}_1, \dots , \varvec{a}_m \in \mathbb {R}^p\) be independent copies of a centered, isotropic, sub-Gaussian random vector \(\varvec{a}\in \mathbb {R}^p\) with \(\Vert \varvec{a}\Vert _{\psi _2} \le L\). Let \(\tau _1, \dots , \tau _m\) be independent copies of a random variable \(\tau \) that is uniformly distributed on \([-\delta , \delta ]\) for a fixed parameter \(\delta >0\). In addition, suppose that \(\{\varvec{a}_i\}_{i=1}^m\) and \(\{\tau _i\}_{i=1}^m\) are independent. Let \(\mathcal {X}\subset \mathbb {R}^p\) be a bounded subset, and let \(K\subset \mathbb {R}^p\) be a convex set such that \(\mathcal {X}\subset K\). For \(u\ge 1\) and \(t > 0\), we assume that

$$\begin{aligned} m \ge C' \cdot L^2 \cdot \Big ( (\log L+ t^{-2} \delta ^2) \cdot \big (w_{t}^2(K- \mathcal {X})+ u^2 \big ) + L^2 t^{-4} \delta ^2\cdot w_{}^2(\mathcal {X}) \Big ). \end{aligned}$$
(4.6)

Then with probability at least \(1 - \exp (-c u^2)\) on the random draw of \(\{(\varvec{a}_i, \tau _i)\}_{i = 1}^m\), the following holds uniformly for all \(\varvec{x}\in \mathcal {X}\): Let \(\varvec{y}\in \delta \mathbb {Z}^m\) be given by (4.5) such that at least one of the following conditions is fulfilled:

$$\begin{aligned}&\text {(a) } \tfrac{1}{\sqrt{m}}\Vert \varvec{\nu }\Vert _{2}\le \tfrac{t}{40} \quad \text { or} \nonumber \\&\text {(b) } \tfrac{1}{\sqrt{m}}\Vert \varvec{\nu }\Vert _{2}\le \tfrac{c_0\sqrt{\delta t}}{L^2\sqrt{\max \{1, \log ( \delta e/t)\}}} \text { and } \tfrac{1}{m} \Vert \varvec{\nu }\Vert _{0} \le \tfrac{t}{\delta }. \end{aligned}$$
(4.7)

Then, every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\Vert \hat{\varvec{z}}- \varvec{x}\Vert _{2} \le 2t\).

Corollary 4 can be considered as a uniform version of a recent result by Thrampoulidis and Rawat [62, Thm. III.1]. For a fixed quantizer resolution \(\delta >0\), the condition (4.6) translates into an error decay rate of \(O(\sqrt{\delta }\cdot m^{-1/4})\). On the other hand, if \(\delta \ll t\), then (4.6) is satisfied as soon as \(m \ge C \cdot L^2 \log L\cdot (w_{t}^2(K- \mathcal {X})+ u^2 )\). In other words, if the quantizer resolution is much higher than the desired reconstruction accuracy, the performance of (\({\mathsf {P}}_{K,\varvec{y}}\)) is the same as if the input vector would consist of linear measurements (cf. Corollary 1). This behavior is perfectly consistent with the fact that the uniformly quantized observations (4.5) become linear as \(\delta \rightarrow 0\). It is also worth pointing out that the constraint (b) in (4.7) implies a certain outlier robustness: If \(t < \delta \) and only a fraction of \(t / \delta \) bits are corrupted, then the normalized \(\ell ^{2}\)-noise error may scale in the order of \(\sqrt{\delta t}\) instead of t.

To the best of our knowledge, Corollary 4 is a new result. But we stress that there exist uniform recovery guarantees for other programs than (\({\mathsf {P}}_{K,\varvec{y}}\)) in the literature, which imply better oversampling rates, e.g., see [36, Thm. 3] or [65] in the case of structured signal sets. Nevertheless, we expect that Corollary 4 could be easily improved in that regard, using the strategy of Sect. 5 in conjunction with known uniform embedding results.

4.4 Single-Index Models and Beyond

The situation of single-index models was already considered in Theorem 1 in Sect. 1.2, as an appetizer for our more general approach. As pointed out there, this guarantee can be seen as an upgrade of an earlier non-uniform result by Plan and Vershynin [51, Thm. 1.9], thereby demonstrating that uniform recovery is possible beyond quantized measurement schemes. In view of Assumption 2, the Lipschitz continuity of \(f\) in Theorem 1 is certainly not necessary. However, it verifies that a more “regular” observation variable can lead to a near-optimal error decay rate (see Remark 2(2)). Apart from that, we emphasize that (Lipschitz) continuous nonlinearities are not only of academic interest but also appear in practical applications, for instance, as power amplifiers in sensor networks [24].

The remainder of this subsection is devoted to two new applications of our general framework, which extend Theorem 1 into different directions.

Modulo measurements. The first scenario is inspired by a recent work of Bhandari et al. [4] on “unlimited sampling.” The practical motivation there is to prevent a saturation of analog-to-digital converters by applying a modulo operator in the measurement process. Tailored to the setup of the present article, we consider modulo measurements of the form

$$\begin{aligned} y_i = {\mathfrak {m}}_\lambda (\langle \varvec{a}_i, \varvec{x}\rangle ) + \nu _i, \quad i = 1, \dots , m, \end{aligned}$$
(4.8)

where the modulo function is given by \({\mathfrak {m}}_\lambda (v){:=}v - \big \lfloor \tfrac{v+\lambda }{2\lambda }\big \rfloor \cdot 2\lambda \) for a fixed parameter \(\lambda >0\), and \(\varvec{\nu }{:=}(\nu _1, \dots , \nu _m) \in \mathbb {R}^m\) models noise. Clearly, the nonlinearity \({\mathfrak {m}}_\lambda \) neither corresponds to a quantization (cf. Sects. 4.14.3) nor is it Lipschitz continuous (cf. Theorem 1). The following corollary of Theorem 2 provides a uniform recovery guarantee for the observation model (4.8):

Corollary 5

There exist universal constants \(c, C, C' > 0\) for which the following holds.

Let \(\varvec{a}_1, \dots , \varvec{a}_m \in \mathbb {R}^p\) be independent copies of a standard Gaussian random vector \(\varvec{a}\sim \mathsf {N}(\varvec{0}, \varvec{I}_{p})\). For \(\lambda >0\), we set \(\mu _\lambda {:=}\mathbb {E}[{\mathfrak {m}}_\lambda (g)g]\) with \(g\sim \mathsf {N}(0, 1)\). Let \(\mathcal {X}\subset \mathbb {S}^{p-1}\), and let \(K\subset \mathbb {R}^p\) be a convex set such that \(\mu _\lambda \mathcal {X}\subset K\). For \(u\ge 1\) and \(t\in (0,\lambda ]\), we assume that \(\lambda \ge C\) and

$$\begin{aligned} m \ge C'\cdot \Big ((1+t^{-2})\cdot \big (w_{t}^2(K- \mu _\lambda \mathcal {X})+u^2\big )+\lambda ^2t^{-4}\cdot w_{}^2(\mathcal {X})\Big ). \end{aligned}$$

Then, \(\mu _\lambda \in [\tfrac{1}{2},1]\) and with probability at least \(1 - \exp (-c u^2)\) on the random draw of \(\{\varvec{a}_i\}_{i = 1}^m\), the following holds uniformly for all \(\varvec{x}\in \mathcal {X}\): Let \(\varvec{y}\in \mathbb {R}^m\) be given by (4.8) such that \(\big (\tfrac{1}{m}\sum _{i = 1}^m \nu _i^2 \big )^{1/2} \le \tfrac{t}{20}\). Then, every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\Vert \hat{\varvec{z}}- \mu _\lambda \varvec{x}\Vert _{2} \le t\).

Similar to our above applications to quantized compressed sensing, the (infinitely many) discontinuities of the modulo function \({\mathfrak {m}}_\lambda \) imply an error decay rate of \(O(m^{-1/4})\). To the best of our knowledge, Corollary 5 is a new result in its own right. In particular, we are not aware of alternative proof techniques that would allow us to derive a comparable statement. Finally, it is worth noting that the condition \(\lambda \ge C\) in Corollary 5 could be further relaxed to \(\lambda > 0\). This would come at the cost of a lower bound for \(\mu _\lambda \), since we have that \(\mu _\lambda \rightarrow 0\) as \(\lambda \rightarrow 0\) (the nonlinearity \({\mathfrak {m}}_\lambda \) uniformly converges to 0 as \(\lambda \rightarrow 0\)).

Coordinate-wise distortions. The common single-index model (see Theorem 1) assumes that a nonlinear output function perturbs the linear observations \(\langle \varvec{a}_i, \varvec{x}\rangle \). Instead, one can also imagine distortions that affect the computation of the dot product \(\langle \varvec{a}_i, \varvec{x}\rangle \) directly. More specifically, we are interested in observations of the form

$$\begin{aligned} y_i = f(\varvec{a}_i \circ \varvec{x}) + \nu _i, \quad i = 1, \dots , m, \end{aligned}$$
(4.9)

where \(\varvec{a}\circ \varvec{b}\in \mathbb {R}^p\) denotes the coordinate-wise product of vectors \(\varvec{a}, \varvec{b}\in \mathbb {R}^p\) and \(f:\mathbb {R}^p\rightarrow \mathbb {R}\) can be represented as

$$\begin{aligned} f(\varvec{z}){:=}\sum _{j=1}^p f_j(z_j), \quad \varvec{z}\in \mathbb {R}^p. \end{aligned}$$

Here, we assume that the functions \(f_j:\mathbb {R}\rightarrow \mathbb {R}\) are odd and \(\gamma \)-Lipschitz, while satisfying the growth conditions

$$\begin{aligned} \alpha (v-v')\le f_j(v)-f_j(v'), \quad \text {for all } v, v'\in \mathbb {R}\text { with } v'\le v, \end{aligned}$$
(4.10)

and

$$\begin{aligned} \beta _1 v\le f_j(v)\le \beta _2 v, \quad \text {for all } v\ge 0, \end{aligned}$$
(4.11)

with parameters \(\alpha > 0\) and \(\beta _2 \ge \beta _1 > 0\). One can think of (4.9) as computing a “nonlinear dot product” of \(\varvec{a}_i\) and \(\varvec{x}\). While models of this type are not covered by the original approach of [51], an application of Theorem 2 allows us to deal with them under appropriate assumptions:

Corollary 6

There exist universal constants \(c, C > 0\) for which the following holds.

Let \(\varvec{a}_1, \dots , \varvec{a}_m \in \mathbb {R}^p\) be independent copies of a centered, isotropic random vector \(\varvec{a}= (a_1, \dots , a_p)\) which has independent, symmetric, and sub-Gaussian coordinates such that \(\max _{j \in [p]}\Vert a_j\Vert _{\psi _2} \le L\) for some \(L> 0\). Let \(\mathcal {X}\subset R\mathbb {B}_2^p\) and define \(T:\mathcal {X}\rightarrow \mathbb {R}^p\) by \(T\varvec{x}{:=}\mathbb {E}[f(\varvec{a}\circ \varvec{x})\varvec{a}]\). Let \(K\subset \mathbb {R}^p\) be a convex set such that \(T\mathcal {X}\subset K\). For \(u\ge 1\) and \(t\ge 0\), we assume that

$$\begin{aligned} m \ge C \cdot L^4\cdot \Big (\big (1+t^{-2}R^2(\beta _2^2+\gamma ^2)\big )\cdot \big (w_{t}^2(K- T\mathcal {X})+ u^2\big )+ t^{-2}(\tfrac{\gamma }{\alpha })^2\cdot w_{}^2(T\mathcal {X})\Big ). \end{aligned}$$

Then with probability at least \(1 - \exp (-c u^2)\) on the random draw of \(\{\varvec{a}_i\}_{i = 1}^m\), the following holds uniformly for all \(\varvec{x}\in \mathcal {X}\): Let \(\varvec{y}\in \mathbb {R}^m\) be given by (4.9) such that \(\big (\tfrac{1}{m}\sum _{i = 1}^m \nu _i^2 \big )^{1/2} \le \tfrac{t}{20}\). Then, every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\Vert \hat{\varvec{z}}- T\varvec{x}\Vert _{2} \le t\). Furthermore, for every \(j\in [p]\), we have that

$$\begin{aligned} \beta _1x_j&\le (T\varvec{x})_j\le \beta _2x_j, \qquad \text {for } x_j\ge 0,\nonumber \\ \beta _2x_j&\le (T\varvec{x})_j\le \beta _1x_j, \qquad \text {for } x_j<0. \end{aligned}$$
(4.12)

The above result shows that estimation via (\({\mathsf {P}}_{K,\varvec{y}}\)) is even possible in situations where instead of linear measurements \(\langle \varvec{a}_i, \varvec{x}\rangle =\sum _{j=1}^p(\varvec{a}_i)_jx_j\) one observes distorted dot products \(f(\varvec{a}_i\circ \varvec{x})=\sum _{j=1}^pf_j((\varvec{a}_i)_jx_j)\). Note that the generalized Lasso may not precisely recover a scalar multiple of \(\varvec{x}\) here, but rather a vector \(T\varvec{x}\) that lies in a hyperrectangle defined by \(\varvec{x}\) and the parameters \(\beta _1\) and \(\beta _2\). The closer the functions \(f_j\) are to the identity, the closer \(\beta _1\) and \(\beta _2\) are to 1, which implies that \(T\varvec{x}\approx \varvec{x}\) according to (4.12).

4.5 Variable Selection

This subsection is devoted to an instance of Assumption 1 that is substantially different from the previous ones: For a fixed integer \(s \le p\) and an output function \(f:\mathbb {R}^p \rightarrow \mathbb {R}\), we ask for estimation of an index set \(\mathcal {S}\subset [p]\) with \(|\mathcal {S}| \le s\) from observations \(\varvec{y}\in \mathbb {R}^m\) of the form

$$\begin{aligned} y_i = f(\varvec{a}_{i,\mathcal {S}}) + \nu _i, \quad i = 1, \dots , m. \end{aligned}$$
(4.13)

Here, \(\varvec{a}_{i,\mathcal {S}} \in \mathbb {R}^p\) is the coordinate projection of \(\varvec{a}_i\) onto \(\mathcal {S}\) and \(\nu _i \in \mathbb {R}\) models additive noise. Most notably, the signals of interest are not parameters vectors in \(\mathbb {R}^p\) anymore but correspond to (small) index sets, specifying those coefficients of a measurement vector that contribute to the observation. From a statistical perspective, this can be seen as a variable selection model, where \(\mathcal {S}\subset [p]\) determines the set of active variables among all feature variables in \(\varvec{a}_i\). In the context of uniform recovery, this leads to the following problem: Given a collection of sample data \(\{\varvec{a}_i\}_{i = 1}^m\), can we retrieve any possible index set \(\mathcal {S}\subset [p]\) with \(|\mathcal {S}| \le s\) from (nonlinear) observations of the form (4.13)? The following corollary of Theorem 2 provides a result in that direction under natural conditions on the output function \(f\).

Corollary 7

There exist universal constants \(c, C > 0\) for which the following holds.

Let \(\varvec{a}_1, \dots , \varvec{a}_m \in \mathbb {R}^p\) be independent copies of a centered, isotropic random vector \(\varvec{a}= (a_1, \dots , a_p)\) which has independent sub-Gaussian coordinates such that \(\max _{j \in [p]}\Vert a_j\Vert _{\psi _2} \le L\) for some \(L> 0\). For \(s \le p\) and an output function \(f:\mathbb {R}^p \rightarrow \mathbb {R}\), let \(\mathcal {X}\subset \{\mathcal {S}\subset [p] \ :|\mathcal {S}|\le s\}\) and define \(T:\mathcal {X}\rightarrow \mathbb {R}^p\) by \(T\mathcal {S}{:=}\mathbb {E}[f(\varvec{a}_{\mathcal {S}})\varvec{a}]\), where \(\varvec{a}_{\mathcal {S}} \in \mathbb {R}^p\) denotes the coordinate projection of \(\varvec{a}\) onto \(\mathcal {S}\). Moreover, we assume that there exist parameters \(\alpha , \beta , \gamma , \kappa > 0\) such that

$$\begin{aligned} \frac{\alpha }{\sqrt{s}}\le |(T\mathcal {S})_j| \le \frac{\beta }{\sqrt{s}}\qquad \text {for all } \mathcal {S}\in \mathcal {X}\text { and } j\in \mathcal {S}\end{aligned}$$
(4.14)

and

$$\begin{aligned} \Vert f(\varvec{a}_{\mathcal {S}}) - f(\varvec{a}_{\mathcal {S}'})\Vert _{\psi _2} \le \gamma \sqrt{\frac{|\mathcal {S}\bigtriangleup \mathcal {S}'|}{s}} \quad {\text {and}}\quad \Vert f(\varvec{a}_{\mathcal {S}})\Vert _{\psi _2} \le \kappa \qquad \text {for all } \mathcal {S}, \mathcal {S}'\in \mathcal {X},\nonumber \\ \end{aligned}$$
(4.15)

where \(\mathcal {S}\bigtriangleup \mathcal {S}' \subset [p]\) denotes the symmetric difference between \(\mathcal {S}\) and \(\mathcal {S}'\). Let \(K\subset \mathbb {R}^p\) be a convex set such that \(T\mathcal {X}\subset K\). For \(u\ge 1\) and \(t\ge 0\), we assume that

$$\begin{aligned} m\ge & {} C \cdot L^2 \cdot \Big ( \big (\log L+ t^{-2} (L^2 \beta ^2 + \kappa ^2)\big )\cdot \big (w_{t}^2(K- T\mathcal {X})+ u^2 \big ) \nonumber \\&+ t^{-2} (L^2 + \gamma ^2\alpha ^{-2})\cdot w_{}^2(T\mathcal {X}) \Big ). \end{aligned}$$
(4.16)

Then with probability at least \(1 - \exp (-c u^2)\) on the random draw of \(\{\varvec{a}_i\}_{i = 1}^m\), the following holds uniformly for all \(\mathcal {S}\in \mathcal {X}\): Let \(\varvec{y}\in \mathbb {R}^m\) be given by (4.13) such that \(\big (\tfrac{1}{m}\sum _{i = 1}^m \nu _i^2 \big )^{1/2} \le \tfrac{t}{20}\). Then, \({{\,\mathrm{supp}\,}}(T\mathcal {S}) = \mathcal {S}\) and every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\Vert \hat{\varvec{z}}- T\mathcal {S}\Vert _{2} \le t\).

The assumptions (4.14) and (4.15) can be seen as natural balancing properties of the underlying observation model: (4.14) requires that the coefficients of each target vector \(T\mathcal {S}\in K\) are uniformly bounded below and above on \(\mathcal {S}\) (and in particular \(\alpha \le \Vert T\mathcal {S}\Vert _{2} \le \beta \)). The increment condition (4.15) ensures that the distance between two observation variables can be controlled in terms of the symmetric difference of their associated index sets in \(\mathcal {X}\).

With this in mind, Corollary 7 suggests the following simple procedure for variable selection: First perform a hard-thresholding step on \(\hat{\varvec{z}}\) to extract its largest entries in magnitude; then use the corresponding indices to estimate the set of active variables \(\mathcal {S}= {{\,\mathrm{supp}\,}}(T\mathcal {S})\). Note that this does not require explicit knowledge of the output function \(f\). However, the (guaranteed) success of such a strategy strongly depends on the size of \(\alpha \) and the accuracy t. In the worst case, t would have to be in the order of \(\alpha / \sqrt{s}\) for perfect recovery of \(\mathcal {S}\), which would lead to an undesirable factor of s in (4.16). It is certainly possible to show refined versions of Corollary 7. We suspect that sharper error bounds could be obtained by considering a different error measure than the \(\ell ^{2}\)-norm (see Remark 3(3)). Nevertheless, a detailed elaboration would go beyond the scope of this article, and we confine ourselves with the above proof of concept.

5 Uniform Recovery Without Increment Conditions

We have seen in Sects. 4.14.3 that the increment conditions of Assumption 2 can lead to inferior error decay rates for quantizing output functions, due to their points of discontinuity. In this section, we present a workaround that can do without any (sub-Gaussian) increment conditions and thereby enables significantly better rates. The basic idea is to cover the signal set \(\mathcal {X}\) by an \(\varepsilon \)-net \(\mathcal {X}_\varepsilon \) for some \(\varepsilon > 0\) and to apply the non-uniform version of Theorem 2 to each \(\varvec{x}\in \mathcal {X}_\varepsilon \) separately. Taking the union bound then yields uniform recovery on \(\mathcal {X}_\varepsilon \). The final, and most crucial, ingredient that allows us to pass over to the entire signal set \(\mathcal {X}\) is the following local stability condition on the observation model. It is based on the (outlier) noise bounds in (2.4) and essentially requires that close points in \(T\mathcal {X}\) imply close observation vectors.

Assumption 3

Let Assumption 1 be satisfied, and let \(t > 0\) and \(\varepsilon > 0\). For \(m_0\in \{0, 1, \dots , \lfloor \tfrac{m}{2}\rfloor \}\), \(\varDelta >0\), and \(\eta \in [0,1]\), we assume that the following holds with probability at least \(1-\eta \):

$$\begin{aligned}&{\mathop {{\mathrm{sup}}}\limits _{\begin{array}{c} \varvec{x},\, \varvec{x}'\in \mathcal {X}\\ \Vert T\varvec{x}-T\varvec{x}'\Vert _{2}\le \varepsilon \end{array}}}\tfrac{1}{\sqrt{m}}\Vert \tilde{\varvec{y}}(\varvec{x}) - \tilde{\varvec{y}}(\varvec{x}')\Vert _{[2m_0]}\le \tfrac{1}{2}\varDelta t \quad \text { and } \nonumber \\&\mathop {{\mathrm{sup}}}\limits _{\begin{array}{c} \varvec{x},\, \varvec{x}'\in \mathcal {X}\\ \Vert T\varvec{x}-T\varvec{x}'\Vert _{2}\le \varepsilon \end{array}}\tfrac{1}{\sqrt{m}}\sigma _{m_0}(\tilde{\varvec{y}}(\varvec{x}) - \tilde{\varvec{y}}(\varvec{x}'))_{2}\le \tfrac{t}{40}. \end{aligned}$$
(5.1)

The above strategy leads to the following general uniform recovery guarantee; see Sect. 9 for a detailed proof.

Theorem 3

There exist universal constants \(c, C, C_0 > 0\) for which the following holds.

Let Assumptions 1 and 3 be satisfied, let \(r{:=}{{\mathrm{sup}}}_{\varvec{x}\in \mathcal {X}}\Vert \langle \varvec{a}, T\varvec{x}\rangle -\tilde{y}(\varvec{x})\Vert _{\psi _2}\), and assume \(\rho _{}(\varvec{x}) \le \tfrac{t}{32}\) for every \(\varvec{x}\in \mathcal {X}\). For \(u\ge 1\) and \(u_0 \ge \sqrt{2m_0\log (em/2m_0)}\), we assume that

$$\begin{aligned} m\ge & {} C \cdot L^2 \cdot \Big ((\log L+t^{-2} r^2) \cdot \big (\mathop {{\mathrm{sup}}}\limits _{\varvec{x}\in \mathcal {X}}w_{t}^2(K- T\varvec{x}) + u^2 \big ) \nonumber \\&+ L^2\varDelta ^2 \cdot \big (\mathop {{\mathrm{sup}}}\limits _{\varvec{x}\in \mathcal {X}}w_{t}^2(K- T\varvec{x})+ u_0^2 \big )\Big ) \end{aligned}$$
(5.2)

and

$$\begin{aligned} \min \{u^2,u_0^2\} \ge C_0\cdot \log \mathcal {N}(T\mathcal {X}, \varepsilon ). \end{aligned}$$
(5.3)

Then with probability at least \(1 - \exp (-c u^2) - \exp (-c u_0^2)-\eta \) on the random draw of \(\{(\varvec{a}_i, F_i)\}_{i = 1}^m\), the following holds uniformly for every \(\varvec{x}\in \mathcal {X}\): Let \(\varvec{y}\in \mathbb {R}^m\) be any input vector such that

$$\begin{aligned} \tfrac{1}{\sqrt{m}}\Vert \varvec{y}- \tilde{\varvec{y}}(\varvec{x})\Vert _{[2m_0]}\le \tfrac{1}{2}\varDelta t \quad \text { and } \quad \tfrac{1}{\sqrt{m}}\sigma _{m_0}(\varvec{y}- \tilde{\varvec{y}}(\varvec{x}))_{2}\le \tfrac{t}{40}. \end{aligned}$$

Then, every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\Vert \hat{\varvec{z}}- T\varvec{x}\Vert _{2} \le t + \varepsilon \).

The statement of Theorem 3 strongly resembles the one of Theorem 2 with the important difference that the former does not rely on Assumption 2. In particular, the condition (5.2) does not depend on the increment parameters \(L_{t}\) and \(\hat{L}_{t}\) anymore, making it less restrictive than (2.3). It is worth pointing out that the global complexity of \(\mathcal {X}\) is now measured in terms of the covering number \(\mathcal {N}(T\mathcal {X}, \varepsilon )\) in (5.3) instead of the mean width \(w_{}(T\mathcal {X})\). Furthermore, (5.2) establishes a refined local complexity measure, due to \({\mathrm{sup}}_{\varvec{x}\in \mathcal {X}} w_{t}^2(K- T\varvec{x}) \le w_{t}^2(K- T\mathcal {X})\); see also Proposition 1. Nevertheless, the gain of Theorem 3 is obviously linked to the verification of Assumption 3, which is usually a highly non-trivial task.

We now present two applications of Theorem 3 in the prototypical case of 1-bit observations. In this specific situation, it turns out that Assumption 3 is compatible with uniform bounds for binary embeddings, which allows us to make use of related results from the literature. Our first application is based on the following embedding guarantee by Oymak and Recht [45] for noiseless, Gaussian 1-bit measurements (cf. Sect. 4.1). Hereafter, we agree on the shortcut notation \([H]_\varepsilon {:=}(\tfrac{1}{\varepsilon }H- \tfrac{1}{\varepsilon }H) \cap \mathbb {B}_2^p\) for any subset \(H\subset \mathbb {R}^p\) and \(\varepsilon > 0\); also note that the parameter \(w_{}([H]_\varepsilon )\) is virtually the same as the local mean width \(w_{\varepsilon }(H- H)\) except that the former intersects with \(\mathbb {B}_2^p\) instead of \(\mathbb {S}^{p-1}\).

Theorem 4

([45, Thm. 3.2]) There exist universal constants \(c, {\bar{c}}, C > 0\) for which the following holds.

Let \(H\subset \mathbb {S}^{p-1}\), and let \(\varvec{A}\in \mathbb {R}^{m\times p}\) be a random matrix with independent standard Gaussian row vectors \(\varvec{a}_1, \dots , \varvec{a}_m\in \mathbb {R}^p\). For \(\beta \in (0,1)\) and \(\varepsilon \le {\bar{c}}\beta / \sqrt{\log (e/\beta )}\), we assume that

$$\begin{aligned} m\ge C\cdot \Big (\varepsilon ^2 \beta ^{-3} \cdot w_{}^2([H]_\varepsilon ) + \beta ^{-1} \cdot \log \mathcal {N}(H, \varepsilon )\Big ). \end{aligned}$$

Then, the following holds with probability at least \(1-\exp (-cm\beta )\):

$$\begin{aligned} \mathop {{\mathrm{sup}}}\limits _{\begin{array}{c} \varvec{x},\, \varvec{x}'\in H\\ \Vert \varvec{x}- \varvec{x}'\Vert _{2}\le \varepsilon \end{array}}\tfrac{1}{2m}\Vert {{\,\mathrm{sign}\,}}(\varvec{A}\varvec{x})-{{\,\mathrm{sign}\,}}(\varvec{A}\varvec{x}')\Vert _{1}\le \beta . \end{aligned}$$

Combining Theorem 4 with Theorem 3 yields an improved version of Corollary 2 (see Sect. 9 for a proof):

Corollary 8

There exist universal constants \(c, c', c_0, C', C_0 > 0\) for which the following holds.

Let \(\varvec{a}_1, \dots , \varvec{a}_m \in \mathbb {R}^p\) be independent copies of a standard Gaussian random vector \(\varvec{a}\sim \mathsf {N}(\varvec{0}, \varvec{I}_{p})\). Let \(\mathcal {X}\subset \mathbb {R}^p\) and define \(T\varvec{x}{:=}\sqrt{\tfrac{2}{\pi }}\tfrac{\varvec{x}}{\Vert \varvec{x}\Vert _{2}}\) for \(\varvec{x}\in \mathcal {X}\). Moreover, let \(K\subset \mathbb {R}^p\) be a convex set such that \(T\mathcal {X}\subset K\). Fix \(t\in (0,1)\) and \(\varepsilon \le c't / \log (e/t)\). For \(u^2\ge \max \{1, C_0 \cdot \log \mathcal {N}(T\mathcal {X}, \varepsilon )\}\), we assume that

$$\begin{aligned} m&\ge C' \cdot \Big ( t^{-2} \cdot \big (\mathop {{\mathrm{sup}}}\limits _{\varvec{x}\in \mathcal {X}}w_{t}^2(K- T\varvec{x}) + u^2 \big )+ \varepsilon ^2 t^{-3}\log ^{3/2}(e/t)\cdot w_{}^2([T\mathcal {X}]_\varepsilon )\Big ). \end{aligned}$$
(5.4)

Finally, let \(\beta \in [0,1]\) be such that \(\beta \sqrt{\log (e/\beta )}\le c_0 t\). Then with probability at least \(1 - \exp (-c u^2)\) on the random draw of \(\{\varvec{a}_i\}_{i = 1}^m\), the following holds uniformly for all \(\varvec{x}\in \mathcal {X}\): Let \(\varvec{y}\in \{-1,1\}^m\) be given by (4.1) such that \(\tfrac{1}{2m}\Vert \varvec{\nu }\Vert _{1}\le \tfrac{\beta }{2}\). Then, every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\big \Vert \hat{\varvec{z}}- \sqrt{\tfrac{2}{\pi }}\tfrac{\varvec{x}}{\Vert \varvec{x}\Vert _{2}}\big \Vert _{2} \le 2t\).

To better understand the oversampling behavior of this result, let us consider the situation where \(\varepsilon \asymp t / \log (e/t)\). Due to Sudakov minoration, we have that \(\log \mathcal {N}(T\mathcal {X}, \varepsilon ) \lesssim \varepsilon ^{-2} \cdot w_{}^2(T\mathcal {X})\) (e.g., see [64, Thm. 7.4.1]). Hence, in the worst case, Corollary 8 would imply an error decay rate of \(O(m^{-1/4})\) up to log factors. However, for many low-dimensional signal sets, such as subspaces or sparse vectors, it is possible to establish a much stronger bound of the form \(\log \mathcal {N}(T\mathcal {X}, \varepsilon ) \lesssim \log (\varepsilon ^{-1}) \cdot w_{}^2(T\mathcal {X})\), e.g., see [45, Sec. 2].Footnote 11 In this case, Corollary 8 would yield an error decay rate of \(O(m^{-1/2})\) up to log factors. This is superior to the achievement of Corollary 2 and matches the best possible rate that can be expected for (\({\mathsf {P}}_{K,\varvec{y}}\)) in general (see Remark 2(2)).

Our second application of Theorem 3 concerns the setup of sub-Gaussian 1-bit observations with dithering, as considered in Sect. 4.2. In this case, the stability condition of Assumption 3 can be related to a recent embedding result of Dirksen and Mendelson [17], which is based on hyperplane tessellations for sub-Gaussian vectors.

Theorem 5

([17, Thm. 2.9]) For every \(L> 0\), there exist constants \(c, {\bar{c}}, C, {\tilde{C}}>0\) only depending on \(L\) for which the following holds.

Let \(\mathcal {X}\subset R\mathbb {B}_2^p\) for some \(R > 0\), and let \(\varvec{A}\in \mathbb {R}^{m\times p}\) be a random matrix with independent, isotropic, sub-Gaussian row vectors \(\varvec{a}_1, \dots , \varvec{a}_m\in \mathbb {R}^p\). Moreover, assume that \(\max _{i\in [m]}\Vert \varvec{a}_i\Vert _{\psi _2}\le L\). Let \(\varvec{\tau }{:=}(\tau _1, \dots , \tau _m) \in \mathbb {R}^m\) be a random vector with independent entries that are uniformly distributed on \([-\lambda , \lambda ]\) for a parameter \(\lambda \ge {\tilde{C}}\cdot R\). In addition, suppose that \(\varvec{A}\) and \(\varvec{\tau }\) are independent. For \(\beta \in (0,1]\) and \(\varepsilon \le {\bar{c}}\beta / \sqrt{\log (e/\beta )}\), we assume that

$$\begin{aligned} m \ge C \cdot \Big (\varepsilon ^2 \beta ^{-3} \cdot w_{}^2([(2\lambda )^{-1}\mathcal {X}]_\varepsilon ) + \beta ^{-1} \cdot \log \mathcal {N}(\lambda ^{-1}\mathcal {X}, \varepsilon )\Big ). \end{aligned}$$

Then, the following holds with probability at least \(1-\exp (-cm\beta )\):

$$\begin{aligned} \mathop {{\mathrm{sup}}}\limits _{\begin{array}{c} \varvec{x},\, \varvec{x}'\in \mathcal {X}\\ \lambda ^{-1}\Vert \varvec{x}- \varvec{x}'\Vert _{2}\le \varepsilon \end{array}}\tfrac{1}{2m}\Vert {{\,\mathrm{sign}\,}}(\varvec{A}\varvec{x}+\varvec{\tau })-{{\,\mathrm{sign}\,}}(\varvec{A}\varvec{x}'+\varvec{\tau })\Vert _{1}\le \beta . \end{aligned}$$

Combining Theorem 5 with Theorem 3 now yields an improved version of Corollary 3 (see Sect. 9 for a proof):

Corollary 9

For every \(L> 0\), there exist universal constants \(c_0, C_0>0\) and constants \(c, c', {\tilde{C}}, C'>0\) only depending on \(L\) for which the following holds.

Let \(\varvec{a}_1, \dots , \varvec{a}_m \in \mathbb {R}^p\) be independent copies of a centered, isotropic, sub-Gaussian random vector \(\varvec{a}\in \mathbb {R}^p\) with \(\Vert \varvec{a}\Vert _{\psi _2} \le L\). Let \(\tau _1, \dots , \tau _m\) be independent copies of a random variable \(\tau \) that is uniformly distributed on \([-\lambda , \lambda ]\) for a parameter \(\lambda >0\). In addition, suppose that \(\{\varvec{a}_i\}_{i=1}^m\) and \(\{\tau _i\}_{i=1}^m\) are independent. Let \(\mathcal {X}\subset R \mathbb {B}_2^p\) for some \(R > 0\), and let \(K\subset \mathbb {R}^p\) be a convex set such that \(\lambda ^{-1}\mathcal {X}\subset K\). Fix \(t\in (0,1]\) and \(\varepsilon \le c't / \log (e/t)\). For \(u^2\ge \max \{1, C_0\cdot \log \mathcal {N}(\lambda ^{-1}\mathcal {X}, \varepsilon )\}\), we assume that

$$\begin{aligned} \lambda&\ge {\tilde{C}} \cdot R \cdot \sqrt{\log (e/t)}, \ \nonumber \\ m&\ge C' \cdot \Big ( t^{-2} \cdot \big (\mathop {{\mathrm{sup}}}\limits _{\varvec{x}\in \mathcal {X}}w_{t}^2(K- \lambda ^{-1}\varvec{x})+ u^2 \big ) + \varepsilon ^2 t^{-3} \log ^{3/2}(e/t)\cdot w_{}^2([(2\lambda )^{-1}\mathcal {X}]_\varepsilon ) \Big ). \end{aligned}$$
(5.5)

Finally, let \(\beta \in [0,1]\) be such that \(\beta \sqrt{\log (e/\beta )}\le c_0 L^{-2} t\). Then with probability at least \(1 - \exp (-c u^2)\) on the random draw of \(\{(\varvec{a}_i, \tau _i)\}_{i = 1}^m\), the following holds uniformly for all \(\varvec{x}\in \mathcal {X}\): Let \(\varvec{y}\in \{-1,1\}^m\) be given by (4.3) such that \(\tfrac{1}{2m}\Vert \varvec{\nu }\Vert _{1}\le \tfrac{\beta }{2}\). Then, every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\Vert \hat{\varvec{z}}- \lambda ^{-1}\varvec{x}\Vert _{2} \le 2t\).

Analogously to noiseless 1-bit observations, Corollary 9 implies a decay rate of \(O(m^{-1/4})\) up to log factors in the worst case, which may be improved to \(O(m^{-1/2})\) for structured signal sets. This is in line with a recent finding of Dirksen and Mendelson [17, Thm. 1.7], who have analyzed a different estimator.

We close this section by pointing out that the error bounds achieved by Corollary 8 and 9 do still not match the information-theoretic optimal rate of \(O(m^{-1})\), e.g., see [18, 34]. We suspect that this gap is not an artifact of our proof, but rather due to a fundamental performance limit of the generalized Lasso (\({\mathsf {P}}_{K,\varvec{y}}\)). A potential remedy is to consider a more specialized reconstruction method, e.g., see [36].

6 Conclusion

This section highlights several key aspects of this work and discusses them in the light of our initial objectives as well as remaining challenges.

  1. (1)

    Observation models. Probably the greatest benefit of our approach to Problem 1 is its flexibility. The setup of Assumption 1 allows us to implement and analyze virtually every nonlinear observation model that is conceivable for the program (\({\mathsf {P}}_{K,\varvec{y}}\)). The examples that we have seen in Sect. 4 are only a selection of possible applications. Importantly, each of these results is accompanied by a conceptually simple proof that does not require any deeper insight into the underlying model mechanisms; see also the proof template at the beginning of Sect. 8. Therefore, our methodology may form a general path toward competitive benchmark guarantees. However, the resulting oversampling rates do not always match the best possible rates from the literature, due to the increment condition of Assumption 2. In Sect. 5, we have presented a workaround for this issue, based on an additional local stability assumption. This strategy can indeed lead to near-optimal error bounds, but it relies on the availability of a strong embedding result in each considered case.

  2. (2)

    Non-Gaussian measurements. Our main result reveals the impact of non-Gaussian measurements, in particular, when consistent recovery via (\({\mathsf {P}}_{K,\varvec{y}}\)) can be expected and when not. While relaxing the conditions on isotropy and sub-Gaussian tails in Assumption 1 is certainly practicable (see Remark 3(1)), our analysis is inherently limited to independent measurement vectors. This excludes more structured measurement ensembles, such as partial random circulant matrices or lossless expanders [21, Chap. 12& 13]. On the other hand, such schemes typically satisfy a variant of the restricted isometry property, which is the basis for some recent advances in quantized compressed sensing, e.g., see [16, 20, 33, 65]. These findings provide evidence that at least parts of our results remain valid for a much larger family of measurement vectors. Thus, any guarantee with structured measurements that meets the generality of Theorem 2 would be significant.

  3. (3)

    Complexity parameters. The research on high-dimensional signal estimation has shown that the (local) Gaussian mean width is a natural, yet accurate complexity measure for convex programs such as (\({\mathsf {P}}_{K,\varvec{y}}\)). A distinctive feature of our approach is the role played by the (not necessarily convex) signal set \(\mathcal {X}\). As indicated in Sect. 3, \(\mathcal {X}\) can be viewed as a characteristic of the observation model that allows us to study any situation between non-uniform and (fully) uniform recovery. Note that this refinement is also a crucial ingredient of Proposition 1 and Theorem 3. Nevertheless, it is important to bear in mind that the Gaussian mean width is not trivial to estimate in specific situations, e.g., see (3.3) and Example 1.

  4. (4)

    (Non-)convexity and data-driven priors. Remarkably, Fact 1 is the only argument in our proof that relies on the convexity of the constraint set \(K\). In principle, it is possible to drop this assumption on \(K\) by investigating the projected gradient descent method as an algorithmic implementation of (\({\mathsf {P}}_{K,\varvec{y}}\)), e.g., see [46, 47, 56]. However, the feasibility of this approach depends on the existence of an efficient projection onto \(K\).Footnote 12 A modern line of research on signal processing with non-convex optimization advocates the use of data-driven priors—a natural consequence of many recent advances on generative models in machine learning research, e.g., see [5, 39] and the references therein. Although the algorithmic strategy behind these methods bears resemblance to (\({\mathsf {P}}_{K,\varvec{y}}\)), we believe that the complexity of learned signal priors leads to a more difficult mathematical problem, whose understanding is still in its infancy. This is particularly underpinned by the fact that even the situation of convex priors is not sufficiently well understood yet (see [26, 27, 40]).

7 Proof of the Main Result (Theorem 2)

Recall that according to Fact 1, it suffices to show that for all \(\varvec{x}\in \mathcal {X}\) and \(\varvec{y}\in \mathbb {R}^m\) satisfying \(\rho _{}(\varvec{x}) \le \tfrac{t}{32}\) and (2.4), we have that

$$\begin{aligned} \inf _{\varvec{v}\in K_{\varvec{x},t}}\mathcal {E}_{\varvec{x}}(\varvec{v}) > 0. \end{aligned}$$

To this end, we make use of the decomposition in (2.2), i.e.,

$$\begin{aligned} \mathcal {E}_{\varvec{x}}(\varvec{v}) = \mathcal {Q}_{}(\varvec{v}) + \mathcal {N}_{\varvec{x}}(\varvec{v}) + \mathcal {M}_{\varvec{x}}(\varvec{v}), \end{aligned}$$

and continue by showing separate bounds for all three terms, where each bound holds uniformly for all \(\varvec{x}\in \mathcal {X}\) and \(\varvec{v}\in K_{\varvec{x},t}\). Unless stated otherwise, we assume that the hypotheses of Theorem 2 are fulfilled throughout this section and we have that \(t > 0\); the case of \(t = 0\) is in fact much simpler, see Step 4b. Moreover, we will use the notation \(K_{\mathcal {X},t} {:=}\cup _{\varvec{x}\in \mathcal {X}}K_{\varvec{x},t}=(K- T\mathcal {X}) \cap t\mathbb {S}^{p-1}\).

Step 1: Bounding the quadratic term. Let \(\varvec{A}\in \mathbb {R}^{m\times p}\) denote the matrix with row vectors \(\varvec{a}_1, \dots , \varvec{a}_m\). In order to control the random variable

$$\begin{aligned} \mathcal {Q}_{}(\varvec{v})=\tfrac{1}{m} \sum _{i=1}^m\langle \varvec{a}_i, \varvec{v}\rangle ^2=\big \Vert \tfrac{1}{\sqrt{m}}\varvec{A}\varvec{v}\big \Vert _{2}^2 \end{aligned}$$

uniformly for all \(\varvec{v}\in K_{\mathcal {X},t}\), we make use of the following recent matrix deviation inequality for sub-Gaussian matrices.

Theorem 6

([35, Cor. 1.2]) There exists a universal constant \(C_{\mathcal {Q}_{}} > 0\) for which the following holds.

Let \(H\subset \mathbb {R}^p\), and let \(\varvec{A}\in \mathbb {R}^{m\times p}\) be a random matrix with independent, isotropic, sub-Gaussian row vectors \(\varvec{a}_1, \dots , \varvec{a}_m\in \mathbb {R}^p\). Moreover, assume that \(\max _{i\in [m]}\Vert \varvec{a}_i\Vert _{\psi _2}\le L\) for some \(L> 0\). Then for every \(u>0\), we have the following with probability at least \(1-3\exp (-u^2)\):

$$\begin{aligned} \mathop {{\mathrm{sup}}}\limits _{\varvec{v}\in H} \Big | \big \Vert \tfrac{1}{\sqrt{m}}\varvec{A}\varvec{v}\big \Vert _{2} - \Vert \varvec{v}\Vert _{2}\Big |\le C_{\mathcal {Q}_{}} \cdot L\sqrt{\log L} \cdot \frac{w_{}(H) +u\cdot {{\mathrm{sup}}}_{\varvec{v}\in H}\Vert \varvec{v}\Vert _{2}}{\sqrt{m}}. \end{aligned}$$

We now apply this result to \(H{:=}K_{\mathcal {X},t} \subset t \mathbb {S}^{p-1}\) within the setting of Theorem 2: According to the first branch of the assumption (2.3), we have that \(m \ge C \cdot L^2 \log L\cdot \big ( w_{t}^2(K- T\mathcal {X}) + u^2 \big )\). Hence, by adjusting the universal constant C (only depending on \(C_{\mathcal {Q}_{}}\)), Theorem 6 implies that the following holds with probability at least \(1-3\exp (-u^2)\):

$$\begin{aligned} \tfrac{3t^2}{2}\ge \mathcal {Q}_{}(\varvec{v})\ge \tfrac{t^2}{2} \qquad \text {for all } \varvec{v}\in K_{\mathcal {X},t}. \end{aligned}$$
(7.1)

Step 2: Bounding the noise term. In order to control the noise term \(\mathcal {N}_{\varvec{x}}(\varvec{v})\), we require the following uniform upper bound for subsums of the quadratic term.

Theorem 7

([17, Thm. 2.10]) There exist universal constants \(c, C_{\mathcal {N}_{}}> 0\) for which the following holds.

Let \(H\subset \mathbb {R}^p\), and let \(\varvec{a}_1, \dots , \varvec{a}_m\in \mathbb {R}^p\) be independent, isotropic, sub-Gaussian random vectors such that \(\max _{i\in [m]}\Vert \varvec{a}_i\Vert _{\psi _2}\le L\) for some \(L> 0\). Then for every \(m_0\in \{0,1, \dots , m\}\) and \(u_0 \ge \sqrt{m_0\log (em/m_0)}\), we have the following with probability at least \(1-2\exp (-cu_0^2)\):

$$\begin{aligned} \mathop {{\mathrm{sup}}}\limits _{\varvec{v}\in H} \ \max _{\begin{array}{c} \mathcal {I}\subset [m]\\ |\mathcal {I}|\le m_0 \end{array}} \Big (\tfrac{1}{m}\sum _{i\in \mathcal {I}}|\langle \varvec{a}_i, \varvec{v}\rangle |^2\Big )^{1/2} \le C_{\mathcal {N}_{}} \cdot L^2\cdot \frac{w_{}(H) + u_0 \cdot {{\mathrm{sup}}}_{\varvec{v}\in H}\Vert \varvec{v}\Vert _{2}}{\sqrt{m}}. \end{aligned}$$

For \(\varvec{w}\in \mathbb {R}^m\), we denote by \(\mathcal {I}_{m_0}(\varvec{w})\subset [m]\) any (possibly non-unique) index set that corresponds to the \(m_0\) largest entries of \(\varvec{w}\) in magnitude, i.e., for all \(i\in \mathcal {I}_{m_0}(\varvec{w})\) and \(i' \in \mathcal {I}_{m_0}(\varvec{w})^c\), we have that \(|w_i|\ge |w_{i'}|\); note that for \(m_0=0\), we simply have \(\mathcal {I}_{m_0}(\varvec{w})=\emptyset \). With this notation at hand, observe that

$$\begin{aligned} \Vert \varvec{w}\Vert _{[m_0]} = \Big (\sum _{i \in \mathcal {I}_{m_0}(\varvec{w})} |w_i|^2 \Big )^{1/2} \quad \text {and} \quad \sigma _{m_0}(\varvec{w})_2 = \Big (\sum _{i \in \mathcal {I}_{m_0}(\varvec{w})^c} |w_i|^2 \Big )^{1/2}. \end{aligned}$$

Now, consider the specific choice \(\varvec{w}{:=}\varvec{y}-\tilde{\varvec{y}}(\varvec{x})\). The Cauchy–Schwarz inequality then implies

$$\begin{aligned} \mathop {{\mathrm{sup}}}\limits _{\varvec{v}\in K_{\varvec{x},t}}|\mathcal {N}_{\varvec{x}}(\varvec{v})|&\le \tfrac{2}{m} \mathop {{\mathrm{sup}}}\limits _{\varvec{v}\in K_{\varvec{x},t}} \ \sum _{i\in \mathcal {I}_{m_0}(\varvec{w})} |w_i| \cdot |\langle \varvec{a}_i, \varvec{v}\rangle |\\&\quad + \tfrac{2}{m} \mathop {{\mathrm{sup}}}\limits _{\varvec{v}\in K_{\varvec{x},t}} \ \sum _{i\in \mathcal {I}_{m_0}(\varvec{w})^c} |w_i| \cdot |\langle \varvec{a}_i, \varvec{v}\rangle |\\&\le \tfrac{2}{\sqrt{m}} \Vert \varvec{y}- \tilde{\varvec{y}}(\varvec{x})\Vert _{[m_0]} \cdot \mathop {{\mathrm{sup}}}\limits _{\varvec{v}\in K_{\mathcal {X},t}} \ \max _{\begin{array}{c} \mathcal {I}\subset [m]\\ |\mathcal {I}|\le m_0 \end{array}}\Big (\tfrac{1}{m}\sum _{i\in \mathcal {I}}|\langle \varvec{a}_i, \varvec{v}\rangle |^2\Big )^{1/2}\\&\quad + \tfrac{2}{\sqrt{m}} \sigma _{m_0}(\varvec{y}- \tilde{\varvec{y}}(\varvec{x}))_2 \cdot \mathop {{\mathrm{sup}}}\limits _{\varvec{v}\in K_{\mathcal {X},t}} \sqrt{\mathcal {Q}_{}(\varvec{v})}. \end{aligned}$$

Let us now estimate the first summand of this bound. According to Theorem 7, there exist universal constants \(c, C_{\mathcal {N}_{}} > 0\) such that for every \(u_0 \ge \sqrt{m_0\log (em/m_0)}\), we have the following with probability at least \(1-2\exp (-c u_0^2)\):

$$\begin{aligned} \mathop {{\mathrm{sup}}}\limits _{\varvec{v}\in K_{\mathcal {X},t}} \ \max _{\begin{array}{c} \mathcal {I}\subset [m]\\ |\mathcal {I}|\le m_0 \end{array}}\Big (\tfrac{1}{m}\sum _{i\in \mathcal {I}}|\langle \varvec{a}_i, \varvec{v}\rangle |^2\Big )^{1/2}\le & {} C_{\mathcal {N}_{}} \cdot L^2\cdot \frac{w_{}(K_{\mathcal {X},t}) + u_0 \cdot t}{\sqrt{m}} \\\le & {} \tfrac{1}{16} \varDelta ^{-1} t , \end{aligned}$$

where the last inequality follows from the second branch of (2.3). Hence, taking a union bound with the event of (7.1), it follows that with probability at least \(1-3\exp (-u^2) - 2\exp (-cu_0^2)\), the following holds uniformly for every \(\varvec{x}\in \mathcal {X}\) and every \(\varvec{y}\in \mathbb {R}^m\) satisfying the condition (2.4):

$$\begin{aligned} \mathop {{\mathrm{sup}}}\limits _{\varvec{v}\in K_{\varvec{x},t}}|\mathcal {N}_{\varvec{x}}(\varvec{v})|\le 2 \cdot \tfrac{t}{16} \cdot t + 2 \cdot \sqrt{\tfrac{3}{2}} \cdot t \cdot \tfrac{t}{20}\le \tfrac{t^2}{4}. \end{aligned}$$
(7.2)

Step 3: Bounding the multiplier term. Our goal is to bound the random variable

$$\begin{aligned} \tfrac{1}{2}\mathcal {M}_{\varvec{x}}(\varvec{v}) = \tfrac{1}{m} \sum _{i=1}^m(\langle \varvec{a}_i, T\varvec{x}\rangle - \tilde{y}_i(\varvec{x})) \langle \varvec{a}_i, \varvec{v}\rangle \end{aligned}$$

uniformly from below for all \(\varvec{v}\in K_{\varvec{x},t}\) and \(\varvec{x}\in \mathcal {X}\) that satisfy \(\rho _{}(\varvec{x}) \le \tfrac{t}{32}\). Let us begin by adding and subtracting the expected value:

$$\begin{aligned} \tfrac{1}{2}\mathcal {M}_{\varvec{x}}(\varvec{v})\ge \Big (\inf _{\varvec{v}\in K_{\varvec{x},t}}\tfrac{1}{2}\mathbb {E}[\mathcal {M}_{\varvec{x}}(\varvec{v})]\Big ) - \tfrac{1}{2}\big |\mathcal {M}_{\varvec{x}}(\varvec{v})- \mathbb {E}[\mathcal {M}_{\varvec{x}}(\varvec{v})]\big |. \end{aligned}$$
(7.3)

Since \(\varvec{a}\) is isotropic, we observe that \(\tfrac{1}{2}\mathbb {E}[\mathcal {M}_{\varvec{x}}(\varvec{v})]=\langle T\varvec{x}- \mathbb {E}[\tilde{y}(\varvec{x}) \varvec{a}], \varvec{v}\rangle \). Combining this with Definition 1, we obtain a lower bound in terms of the target mismatch:

$$\begin{aligned} \inf _{\varvec{v}\in K_{\varvec{x},t}}\tfrac{1}{2}\mathbb {E}[\mathcal {M}_{\varvec{x}}(\varvec{v})]&\ge t \cdot \inf _{\varvec{v}\in \mathbb {S}^{p-1}}\langle T\varvec{x}- \mathbb {E}[\tilde{y}(\varvec{x}) \varvec{a}], \varvec{v}\rangle \nonumber \\&= - t \cdot \mathop {{\mathrm{sup}}}\limits _{\varvec{v}\in \mathbb {S}^{p-1}}\langle \mathbb {E}[\tilde{y}(\varvec{x}) \varvec{a}] - T\varvec{x}, \varvec{v}\rangle \nonumber \\&= - t \cdot \big \Vert \mathbb {E}[\tilde{y}(\varvec{x}) \varvec{a}] - T\varvec{x}\big \Vert _{2} = - t \cdot \rho _{}(\varvec{x}). \end{aligned}$$
(7.4)

We now turn to the centered multiplier term in (7.3). To this end, recall Assumption 2 and let \(\{\tilde{y}_{t,i}(\varvec{x})\}_{\varvec{x}\in \mathcal {X}}\) be independent copies of the class \(\{\tilde{y}_{t}(\varvec{x})\}_{\varvec{x}\in \mathcal {X}}\) for \(i = 1, \dots , m\).Footnote 13 Inserting these observation variables and applying the triangle inequality then yield

$$\begin{aligned}&\tfrac{1}{2}|\mathcal {M}_{\varvec{x}}(\varvec{v})- \mathbb {E}[\mathcal {M}_{\varvec{x}}(\varvec{v})]| \nonumber \\&\quad \le \tfrac{1}{m} \Big |\sum _{i=1}^m\underbrace{(\langle \varvec{a}_i, T\varvec{x}\rangle - \tilde{y}_{t,i}(\varvec{x}))}_{{=:}\xi _{t,i}(\varvec{x})} \langle \varvec{a}_i, \varvec{v}\rangle - \mathbb {E}[(\langle \varvec{a}_i, T\varvec{x}\rangle - \tilde{y}_{t,i}(\varvec{x})) \langle \varvec{a}_i, \varvec{v}\rangle ]\Big |\nonumber \\&\qquad + \mathbb {E}{}[\underbrace{|\tilde{y}(\varvec{x}) - \tilde{y}_{t}(\varvec{x})|}_{= \varepsilon _t(\varvec{x})} {}\cdot {} |\langle \varvec{a}, \varvec{v}\rangle |] + \tfrac{1}{m} \sum _{i=1}^m \underbrace{|\tilde{y}_{t,i}(\varvec{x}) - \tilde{y}_i(\varvec{x})|}_{{=:}\varepsilon _{t,i}(\varvec{x})} {}\cdot {} |\langle \varvec{a}_i, \varvec{v}\rangle |. \end{aligned}$$
(7.5)

Note that the second empirical product process in (7.5) is not centered yet due to pulling in the absolute values—a crucial step to ensure that the resulting factors have sub-Gaussian increments. Adding and subtracting the expected value of the last summand in (7.5) and using Assumption 2(a), we obtain the following upper bound:

$$\begin{aligned}&\tfrac{1}{2}|\mathcal {M}_{\varvec{x}}(\varvec{v})- \mathbb {E}[\mathcal {M}_{\varvec{x}}(\varvec{v})]| \le \tfrac{1}{m} \Big |\sum _{i=1}^m\xi _{t,i}(\varvec{x}) \langle \varvec{a}_i, \varvec{v}\rangle - \mathbb {E}[\xi _{t,i}(\varvec{x}) \langle \varvec{a}_i, \varvec{v}\rangle ]\Big |\\&\quad + \tfrac{1}{m} \Big |\sum _{i=1}^m \varepsilon _{t,i}(\varvec{x}) \cdot |\langle \varvec{a}_i, \varvec{v}\rangle | - \mathbb {E}[\varepsilon _{t,i}(\varvec{x}) \cdot |\langle \varvec{a}_i, \varvec{v}\rangle |]\Big | + \underbrace{2 \mathbb {E}[\varepsilon _t(\varvec{x}) \cdot |\langle \varvec{a}, \varvec{v}\rangle |]}_{\le \tfrac{t^2}{32}}. \end{aligned}$$

This estimate in conjunction with (7.4) implies that the following holds uniformly for all \(\varvec{x}\in \mathcal {X}\) with \(\rho _{}(\varvec{x}) \le \tfrac{t}{32}\):

$$\begin{aligned} \inf _{\varvec{v}\in K_{\varvec{x},t}}\mathcal {M}_{\varvec{x}}(\varvec{v})\ge & {} - \frac{t^2}{8} - 2\mathop {{\mathrm{sup}}}\limits _{\varvec{v}\in K_{\varvec{x},t}} \tfrac{1}{m} \Big |\sum _{i=1}^m\xi _{t,i}(\varvec{x}) \langle \varvec{a}_i, \varvec{v}\rangle - \mathbb {E}[\xi _{t,i}(\varvec{x}) \langle \varvec{a}_i, \varvec{v}\rangle ]\Big |\nonumber \\&- 2\mathop {{\mathrm{sup}}}\limits _{\varvec{v}\in K_{\varvec{x},t}} \tfrac{1}{m} \Big |\sum _{i=1}^m \varepsilon _{t,i}(\varvec{x}) \cdot |\langle \varvec{a}_i, \varvec{v}\rangle | - \mathbb {E}[\varepsilon _{t,i}(\varvec{x}) \cdot |\langle \varvec{a}_i, \varvec{v}\rangle |]\Big |\nonumber \\\ge & {} - \frac{t^2}{8} - 2{\mathop {{\mathrm{sup}}}\limits _{\begin{array}{c} \varvec{x}\in \mathcal {X}\\ \varvec{v}\in K_{\mathcal {X},t} \end{array}}} \tfrac{1}{m} \Big |\sum _{i=1}^m\xi _{t,i}(\varvec{x}) \langle \varvec{a}_i, \varvec{v}\rangle - \mathbb {E}[\xi _{t,i}(\varvec{x}) \langle \varvec{a}_i, \varvec{v}\rangle ]\Big |\nonumber \\&- 2{\mathop {{\mathrm{sup}}}\limits _{\begin{array}{c} \varvec{x}\in \mathcal {X}\\ \varvec{v}\in K_{\mathcal {X},t} \end{array}}} \tfrac{1}{m} \Big |\sum _{i=1}^m \varepsilon _{t,i}(\varvec{x}) \cdot |\langle \varvec{a}_i, \varvec{v}\rangle | - \mathbb {E}[\varepsilon _{t,i}(\varvec{x}) \cdot |\langle \varvec{a}_i, \varvec{v}\rangle |]\Big |. \qquad \end{aligned}$$
(7.6)

By Assumption 2(b) and (c), all factors of the product processes

$$\begin{aligned} \bigg \{\tfrac{1}{m}\sum _{i=1}^m\xi _{t,i}(\varvec{x}) \langle \varvec{a}_i, \varvec{v}\rangle \bigg \}_{\varvec{x}\in \mathcal {X},\ \varvec{v}\in K_{\mathcal {X},t}}, \quad \bigg \{\tfrac{1}{m}\sum _{i=1}^m \varepsilon _{t,i}(\varvec{x}) \cdot |\langle \varvec{a}_i, \varvec{v}\rangle |\bigg \}_{\varvec{x}\in \mathcal {X},\ \varvec{v}\in K_{\mathcal {X},t}} \end{aligned}$$
(7.7)

have sub-Gaussian increments. A key ingredient for controlling these empirical processes is a powerful concentration inequality due to Mendelson [42]. The following result is adapted from [42, Thm. 1.13].

Theorem 8

There exist universal constants \(c, C_{\mathcal {M}_{}} > 0\) for which the following holds.

Let \(\{g_{a}\}_{a\in \mathcal {A}}\) and \(\{h_{b}\}_{b\in \mathcal {B}}\) be stochastic processes indexed by two sets \(\mathcal {A}\) and \(\mathcal {B}\), both defined on a common probability space \((\varOmega , \mathsf {A}, \mathbb {P})\). We assume that there exist \(r_{\mathcal {A}}, r_{\mathcal {B}} \ge 0\) and pseudo-metrics \(d_{\mathcal {A}}\) on \(\mathcal {A}\) and \(d_{\mathcal {B}}\) on \(\mathcal {B}\) such that

$$\begin{aligned}&\Vert g_{a} - g_{a'}\Vert _{\psi _2}\le d_{\mathcal {A}}(a, a') \quad {\text {and}} \quad \Vert g_{a}\Vert _{\psi _2}\le r_{\mathcal {A}} \qquad \text {for all } a, a' \in \mathcal {A},\\&\Vert h_{b} - h_{b'}\Vert _{\psi _2}\le d_{\mathcal {B}}(b, b') \quad {\text {and}} \quad \Vert h_{b}\Vert _{\psi _2}\le r_{\mathcal {B}} \qquad \text {for all } b, b' \in \mathcal {B}. \end{aligned}$$

Finally, let \(X_1, \dots , X_m\) be independent copies of a random variable \(X\sim \mathbb {P}\). Then for every \(u\ge 1\), we have the following with probability at least \(1- 2 \exp (-cu^2)\):

$$\begin{aligned}&\mathop {{\mathrm{sup}}}\limits _{\begin{array}{c} a\in \mathcal {A}\\ b\in \mathcal {B} \end{array}} \ \tfrac{1}{m}\Big |\sum _{i=1}^m g_{a}(X_i) h_{b}(X_i) - \mathbb {E}[g_{a}(X_i) h_{b}(X_i)]\Big |\\&\quad \le C_{\mathcal {M}_{}} \cdot \bigg (\frac{\big (\gamma _2(\mathcal {A}, d_{\mathcal {A}}) + u\cdot r_{\mathcal {A}}\big )\cdot \big (\gamma _2(\mathcal {B}, d_{\mathcal {B}}) + u\cdot r_{\mathcal {B}}\big )}{m} \\&\qquad + \frac{r_{\mathcal {A}} \cdot \gamma _2(\mathcal {B}, d_{\mathcal {B}}) + r_{\mathcal {B}} \cdot \gamma _2(\mathcal {A}, d_{\mathcal {A}}) + u\cdot r_{\mathcal {A}} r_{\mathcal {B}}}{\sqrt{m}} \bigg ), \end{aligned}$$

where \(\gamma _2(\cdot ,\cdot )\) denotes Talagrand’s \(\gamma _2\)-functional (e.g., see [42, Def. 1.2]).

Let us now apply Theorem 8 to the first product process in (7.7). To this end, we observe that the class \(\{\langle \varvec{a}, \varvec{v}\rangle \}_{\varvec{v}\in K_{\mathcal {X},t}}\) has sub-Gaussian increments with respect to the pseudo-metric induced by \(L\Vert \cdot \Vert _{2}\), while Assumption 2(b) implies that \(\{\xi _t(\varvec{x})\}_{\varvec{x}\in \mathcal {X}}\) has sub-Gaussian increments with respect to \(L_{t}\cdot d_T\). Hence, Theorem 8 implies that the following holds with probability at least \(1- 2 \exp (-cu^2)\):Footnote 14

$$\begin{aligned}&\mathop {{\mathrm{sup}}}\limits _{\begin{array}{c} \varvec{x}\in \mathcal {X}\\ \varvec{v}\in K_{\mathcal {X},t} \end{array}} \tfrac{1}{m} \Big |\sum _{i=1}^m\xi _{t,i}(\varvec{x}) \langle \varvec{a}_i, \varvec{v}\rangle - \mathbb {E}[\xi _{t,i}(\varvec{x}) \langle \varvec{a}_i, \varvec{v}\rangle ]\Big |\\&\quad \le C_{\mathcal {M}_{}} \cdot \bigg (\frac{\big (L_{t}\gamma _2(\mathcal {X}, d_T) + u\cdot r\big )\cdot \big (L\gamma _2(K_{\mathcal {X},t}, \Vert \cdot \Vert _{2}) + u\cdot Lt\big )}{m} \\&\qquad + \frac{r\cdot L\gamma _2(K_{\mathcal {X},t}, \Vert \cdot \Vert _{2}) + Lt \cdot L_{t}\gamma _2(\mathcal {X}, d_T) + u\cdot rLt}{\sqrt{m}} \bigg ). \end{aligned}$$

According to Talagrand’s majorizing measure theorem [59, Thm. 2.4.1], we have that

$$\begin{aligned} w_{}(H) \asymp \gamma _2(H, \Vert \cdot \Vert _{2}) \qquad \text {for every } H\subset \mathbb {R}^p. \end{aligned}$$

Moreover, it is not hard to see that \(\gamma _2(\mathcal {X}, d_T) = \gamma _2(T\mathcal {X}, \Vert \cdot \Vert _{2})\). Consequently, the following bound holds with probability at least \(1- 2 \exp (-cu^2)\):

$$\begin{aligned}&\mathop {{\mathrm{sup}}}\limits _{\begin{array}{c} \varvec{x}\in \mathcal {X}\\ \varvec{v}\in K_{\mathcal {X},t} \end{array}} \tfrac{1}{m} \Big |\sum _{i=1}^m\xi _{t,i}(\varvec{x}) \langle \varvec{a}_i, \varvec{v}\rangle - \mathbb {E}[\xi _{t,i}(\varvec{x}) \langle \varvec{a}_i, \varvec{v}\rangle ]\Big |\nonumber \\&\quad \le C_{\mathcal {M}_{}}' \cdot t \cdot L\cdot \bigg (\frac{\big (L_{t}w_{}(T\mathcal {X}) + u\cdot r\big ) \cdot \big ( w_{t}(K- T\mathcal {X}) + u\big )}{m} \nonumber \\&\qquad + \frac{r\cdot w_{t}(K- T\mathcal {X}) + L_{t}w_{}(T\mathcal {X}) + u\cdot r}{\sqrt{m}} \bigg ). \end{aligned}$$
(7.8)

The second product process in (7.7) can be treated similarly: The class \(\{|\langle \varvec{a}, \varvec{v}\rangle |\}_{\varvec{v}\in K_{\mathcal {X},t}}\) has sub-Gaussian increments with respect to \(L\Vert \cdot \Vert _{2}\), while Assumption 2(c) implies that \(\{\varepsilon _t(\varvec{x})\}_{\varvec{x}\in \mathcal {X}}\) has sub-Gaussian increments with respect to \(\hat{L}_{t}\cdot d_T\). An analogous application of Theorem 8 shows that with probability at least \(1- 2 \exp (-cu^2)\), we have that

$$\begin{aligned}&\mathop {{\mathrm{sup}}}\limits _{\begin{array}{c} \varvec{x}\in \mathcal {X}\\ \varvec{v}\in K_{\mathcal {X},t} \end{array}} \tfrac{1}{m} \Big |\sum _{i=1}^m \varepsilon _{t,i}(\varvec{x}) \cdot |\langle \varvec{a}_i, \varvec{v}\rangle | - \mathbb {E}[\varepsilon _{t,i}(\varvec{x}) \cdot |\langle \varvec{a}_i, \varvec{v}\rangle |]\Big |\nonumber \\&\quad \le C_{\mathcal {M}_{}}' \cdot t \cdot L\cdot \bigg (\frac{\big (\hat{L}_{t}w_{}(T\mathcal {X}) + u\cdot \hat{r}\big ) \cdot \big ( w_{t}(K- T\mathcal {X}) + u\big )}{m} \nonumber \\&\qquad + \frac{\hat{r}\cdot w_{t}(K- T\mathcal {X}) + \hat{L}_{t}w_{}(T\mathcal {X}) + u\cdot \hat{r}}{\sqrt{m}} \bigg ). \end{aligned}$$
(7.9)

We are ready to prove our main result.

Step 4a: Proof of Theorem 2 (\(t > 0\)). We now assume that the events corresponding to (7.1), (7.2), (7.8), and (7.9) have occurred jointly with probability at least \(1- 7\exp (-cu^2) - 2\exp (-cu_0^2)\) for an appropriate constant \(c > 0\); note that the factors 7 and 2 can be removed by slightly adjusting c. Combining these bounds with (7.6), the following holds uniformly for every \(\varvec{x}\in \mathcal {X}\) with \(\rho _{}(\varvec{x}) \le \tfrac{t}{32}\) and every \(\varvec{y}\in \mathbb {R}^m\) satisfying the condition (2.4):

$$\begin{aligned} \inf _{\varvec{v}\in K_{\varvec{x},t}}\mathcal {E}_{\varvec{x}}(\varvec{v})&\ge \inf _{\varvec{v}\in K_{\mathcal {X},t}}\mathcal {Q}_{}(\varvec{v}) - \mathop {{\mathrm{sup}}}\limits _{\varvec{v}\in K_{\varvec{x},t}}|\mathcal {N}_{\varvec{x}}(\varvec{v})| + \inf _{\varvec{v}\in K_{\varvec{x},t}}\mathcal {M}_{\varvec{x}}(\varvec{v})\\&\ge \tfrac{t^2}{2} - \tfrac{t^2}{4} - \tfrac{t^2}{8} \\&\quad - C_{\mathcal {M}_{}}' \cdot t \cdot L\cdot \bigg (\frac{( r+\hat{r}) \cdot w_{t}(K- T\mathcal {X}) + (L_{t}+ \hat{L}_{t}) \cdot w_{}(T\mathcal {X}) + u\cdot (r+\hat{r}) }{\sqrt{m}}\bigg ) \\&\quad - C_{\mathcal {M}_{}}' \cdot t \cdot L\cdot \bigg (\frac{\big ((L_{t}+ \hat{L}_{t}) \cdot w_{}(T\mathcal {X}) + u\cdot ( r+ \hat{r})\big ) \cdot \big ( w_{t}(K- T\mathcal {X}) + u\big )}{m}\bigg ). \end{aligned}$$

If we could show that this lower bound is strictly positive, the claim of Theorem 2 would follow directly from Fact 1. To conclude this argument, it is enough to have that

$$\begin{aligned} m&\ge C' \cdot L^2 t^{-2} \cdot \Big ( (r+\hat{r}) \cdot w_{t}(K- T\mathcal {X}) + (L_{t}+ \hat{L}_{t}) \cdot w_{}(T\mathcal {X}) + u\cdot (r+\hat{r})\Big )^2, \end{aligned}$$
(7.10)
$$\begin{aligned} m&\ge C' \cdot Lt^{-1} \cdot \big ((L_{t}+ \hat{L}_{t}) \cdot w_{}(T\mathcal {X}) + u\cdot ( r+\hat{r})\big ) \cdot \big (w_{t}(K- T\mathcal {X}) + u\big ), \end{aligned}$$
(7.11)

where \(C' > 0\) is an appropriate universal constant. Indeed, both (7.10) and (7.11) are consequences of (2.3): The bound of (7.10) is equivalent to the third branch of (2.3), while (7.11) follows from the multiplication of the first and third branch of (2.3) and then taking the square root. Note that the previous argument also makes use of the basic fact that \((v + w)^2 \ge v^2 + w^2 \ge \tfrac{1}{2}(v + w)^2\) for all \(v, w \ge 0\), and that \(L^2 \gtrsim L\) due to the isotropy of \(\varvec{a}\).

Step 4b: Proof of Theorem 2 (\(t = 0\)). We may assume that \(r= \hat{r}= L_{t}= \hat{L}_{t}= 0\), \(m_0= 0\), \(\varDelta = L^{-1}\sqrt{\log L}\), and \(u= u_0\), while only considering those \(\varvec{x}\in \mathcal {X}\) and \(\varvec{y}\in \mathbb {R}^m\) with \(\varvec{y}= \tilde{\varvec{y}}(\varvec{x})\) and \(\rho _{}(\varvec{x}) = 0\). Consequently, we have a noiseless linear model, i.e., \(\varvec{y}= \tilde{\varvec{y}}(\varvec{x}) = \langle \varvec{a}, \varvec{x}\rangle \). It follows that \(\mathcal {N}_{\varvec{x}}(\cdot ) = \mathcal {M}_{\varvec{x}}(\cdot ) = 0\). Analogously to (7.1), we can conclude that with probability at least \(1- 3\exp (-u^2)\), the quadratic term satisfies \(\mathcal {Q}_{}(\varvec{v}) \ge \tfrac{1}{2}\) for all \(\varvec{v}\in K_{\mathcal {X}, 0} {:=}{\text {cone}}(K- T\mathcal {X}) \cap \mathbb {S}^{p-1}\). On this event, let \(\hat{\varvec{z}}\in K\) be any minimizer of (\({\mathsf {P}}_{K,\varvec{y}}\)) with input \(\varvec{y}= \tilde{\varvec{y}}(\varvec{x})\) for some \(\varvec{x}\in \mathcal {X}\). If we would not have exact recovery, i.e., \(\hat{\varvec{v}} {:=}\hat{\varvec{z}}- T\varvec{x}\ne \varvec{0}\), then

(7.12)

which contradicts the fact that \(\hat{\varvec{z}}\) is a solution to (\({\mathsf {P}}_{K,\varvec{y}}\)). \(\square \)

8 Proofs for Sect. 4

Each result in Sect. 4 is an application of Theorem 2 and follows the same proof template:

  • Step 1. How the prerequisites of Theorem 2 are met:

    • Step 1a. Implementation of Assumption 1.

    • Step 1b. Controlling the target mismatch \(\rho _{}(\varvec{x})\) for every \(\varvec{x}\in \mathcal {X}\).

    • Step 1c. Controlling the increment parameters of Assumption 2.

  • Step 2. Proof of the actual statement via Theorem 2.

  • Step 3. Verification of Step 1b.

  • Step 4. Verification of Step 1c.

8.1 Proofs for Sect. 4.1

Proof of Corollary 2

We follow the proof template from the beginning of Sect. 8:

  • Step 1a. The model setup of Corollary 2 fits into Assumption 1 as follows:

    • We have that \(\varvec{a}\sim \mathsf {N}(\varvec{0}, \varvec{I}_{p})\) and therefore \(\Vert \varvec{a}\Vert _{\psi _2} \lesssim 1\). The signal set \(\mathcal {X}\) is an arbitrary subset of \(\mathbb {R}^p\). The output function \(F:\mathbb {R}^p \times \mathcal {X}\rightarrow \mathbb {R}\) takes the form \(F(\varvec{a}, \varvec{x}) {:=}{{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle )\).

    • The target function \(T:\mathcal {X}\rightarrow K\) corresponds to the (scaled) normalization \(T\varvec{x}{:=}\sqrt{\tfrac{2}{\pi }}\tfrac{\varvec{x}}{\Vert \varvec{x}\Vert _{2}}\). In particular, we have that \(d_T(\varvec{x},\varvec{x}')=\sqrt{\tfrac{2}{\pi }}\big \Vert \tfrac{\varvec{x}}{\Vert \varvec{x}\Vert _{2}} - \tfrac{\varvec{x}'}{\Vert \varvec{x}'\Vert _{2}}\big \Vert _{2}\).

  • Step 1b. The target mismatch \(\rho _{}(\varvec{x})\) vanishes for every \(\varvec{x}\in \mathcal {X}\).

  • Step 1c. There exists an approximation \(\tilde{y}_t(\varvec{x})\) of the observation variable \(\tilde{y}(\varvec{x})\) such that the conditions of Assumption 2 are fulfilled with

    $$\begin{aligned} L_{t}\lesssim 1 + t^{-1},\quad \hat{L}_{t}\lesssim t^{-1},\quad r\lesssim 1,\quad \hat{r}\lesssim 1. \end{aligned}$$
    (8.1)
  • Step 2. The first and third branch of the condition (2.3) is implied by (4.2) for a sufficiently large universal constant \(C'>0\). Since \(\phi (\beta ) {:=}\beta \sqrt{\log (e/\beta )}\) defines a continuous and non-decreasing function on [0, 1] with \(\phi (1)=1\) and \(\phi (0){:=}0\) (by continuous extension), we may assume without loss of generality that \(\beta \sqrt{\log (e/\beta )} = c_0 t \in (0,1]\). Now, we set

    $$\begin{aligned} \varDelta ^2 {:=}\frac{1}{t\sqrt{\log (e/\beta )}}, \quad m_0{:=}\lfloor \beta m\rfloor , \quad u_0 {:=}\sqrt{2 m \beta \log (e/\beta )}, \end{aligned}$$

    implying that \(u_0\ge \sqrt{m_0\log (em/m_0)}\) and \(u_0^2\ge 2 c_0 t m\). Combining the latter inequality with (4.2) for \(C'\) sufficiently large particularly implies that \(u_0^2 \ge u^2\). Furthermore, observe that \(\varDelta ^2 u_0^2=2c_0 m\) and \(\varDelta ^2 \cdot w_{t}^2(K- T\mathcal {X})\le t^{-1} \cdot w_{t}^2(K- T\mathcal {X})\). Hence, the second branch of the condition (2.3) is satisfied if \(c_0\) is chosen small enough and \(C'\) in (4.2) large enough. Next, we show that every \(\varvec{y}\in \{-1,1\}^m\) given by (4.1) with \(\tfrac{1}{2m}\Vert \varvec{\nu }\Vert _{1} \le \beta \) also satisfies (2.4). Since \(\varvec{\nu }= \varvec{y}- \tilde{\varvec{y}}(\varvec{x})\in \{-2,0,2\}^m\) and \(\tfrac{1}{2}\Vert \varvec{y}- \tilde{\varvec{y}}(\varvec{x})\Vert _{1} \le \beta m\), it follows that \(|\{i\in [m] \ :y_i \ne \tilde{y}_i(\varvec{x})\}|\le \lfloor \beta m\rfloor \). Consequently, \(\sigma _{m_0}(\varvec{y}- \tilde{\varvec{y}}(\varvec{x}))_{2}=0\) and

    $$\begin{aligned} \tfrac{1}{\sqrt{m}}\Vert \varvec{y}- \tilde{\varvec{y}}(\varvec{x})\Vert _{[m_0]}= \tfrac{1}{\sqrt{m}}\Vert \varvec{y}- \tilde{\varvec{y}}(\varvec{x})\Vert _{2}\le 2\sqrt{\beta }= 2 \Big (\tfrac{c_0 t}{\sqrt{\log (e/\beta )}}\Big )^{1/2}\le \varDelta t, \end{aligned}$$

    where the last inequality holds for \(c_0\le \frac{1}{4}\). Theorem 2 now implies the claim of Corollary 2.

  • Step 3. Let \(\varvec{x}\in \mathcal {X}\) and consider the orthogonal decomposition of the standard Gaussian random vector \(\varvec{a}\) along \(\varvec{x}\):

    $$\begin{aligned} \varvec{a}= \langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle {{\bar{\varvec{x}}}} + P_{{\varvec{x}}^\perp }(\varvec{a}), \end{aligned}$$

    where \({{\bar{\varvec{x}}}} {:=}\tfrac{\varvec{x}}{\Vert \varvec{x}\Vert _{2}}\) and \(P_{{\varvec{x}}^\perp } {:=}P_{{\{\varvec{x}\}}^\perp }\). Since \({{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle )\) is centered and \(\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle \sim \mathsf {N}(0, 1)\) is independent of \(P_{{\varvec{x}}^\perp }(\varvec{a})\), we have that

    $$\begin{aligned} \mathbb {E}[\tilde{y}(\varvec{x})\varvec{a}]&=\mathbb {E}[{{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle ) (\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle {{\bar{\varvec{x}}}} + P_{{\varvec{x}}^\perp }(\varvec{a}))]\\&=\mathbb {E}[|\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle |]\cdot {{\bar{\varvec{x}}}} + \mathbb {E}[{{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle )] \cdot \mathbb {E}[P_{{\varvec{x}}^\perp }(\varvec{a})] = \sqrt{\tfrac{2}{\pi }} \cdot {{\bar{\varvec{x}}}} = T\varvec{x}, \end{aligned}$$

    which implies that \(\rho _{}(\varvec{x})=0\).

  • Step 4. Using the shortcut \({{\bar{\varvec{x}}}} {:=}\tfrac{\varvec{x}}{\Vert \varvec{x}\Vert _{2}}\) again, we approximate the observation variable \(\tilde{y}(\varvec{x}) = {{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle )\) by \(\tilde{y}_t(\varvec{x}) {:=}\psi _t(\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle )\) for \(t\in (0, 1]\) and \(\varvec{x}\in \mathcal {X}\) where

    $$\begin{aligned} \psi _t(s) {:=}{\left\{ \begin{array}{ll} \tfrac{128}{t}\cdot s, &{} |s|\le \tfrac{t}{128},\\ {{\,\mathrm{sign}\,}}(s), &{}\text {otherwise,} \end{array}\right. } \quad s \in \mathbb {R}. \end{aligned}$$

    See also Fig. 2 for an illustration. Since \(\tilde{y}(\varvec{x}) = {{\,\mathrm{sign}\,}}(\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle )\), the absolute value of the approximation error then takes the form

    $$\begin{aligned} \varepsilon _t(\varvec{x}) = |{{\,\mathrm{sign}\,}}(\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle ) - \psi _t(\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle )|\cdot \chi _{[0, t^\circ ]}(|\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle |), \end{aligned}$$

    where we have used \(t^\circ {:=}\frac{t}{128}\) for the sake of notational convenience. We now show that for this choice of approximation, the conditions of Assumption 2 are indeed fulfilled with (8.1):

    On Assumption 2(a). Let \(\varvec{x}\in \mathcal {X}\), \(\varvec{z}\in \mathbb {S}^{p-1}\), and consider the orthogonal decomposition

    $$\begin{aligned} \varvec{z}= \langle \varvec{z}, {{\bar{\varvec{x}}}}\rangle {{\bar{\varvec{x}}}} + P_{{\varvec{x}}^\perp }(\varvec{z}). \end{aligned}$$

    This implies

    $$\begin{aligned} \mathbb {E}[\varepsilon _t(\varvec{x}) \cdot |\langle \varvec{a}, \varvec{z}\rangle |]&\le \mathbb {E}[\chi _{[0, t^\circ ]}(|\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle |) \cdot |\langle \varvec{a}, \varvec{z}\rangle |]\\&\le |\langle \varvec{z}, {{\bar{\varvec{x}}}}\rangle |\cdot \mathbb {E}[\chi _{[0, t^\circ ]}(|\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle |) \cdot |\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle |] \\&\quad + \mathbb {E}[\chi _{[0, t^\circ ]}(|\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle |) \cdot |\langle \varvec{a}, P_{{\varvec{x}}^\perp }(\varvec{z})\rangle |]. \end{aligned}$$

    Clearly, \(|\langle \varvec{z}, {{\bar{\varvec{x}}}}\rangle |\le 1\) and \(\mathbb {E}[\chi _{[0, t^\circ ]}(|\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle |) \cdot |\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle |]\le t^\circ \). Since the random variables \(\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle \) and \(\langle \varvec{a}, P_{{\varvec{x}}^\perp }(\varvec{z})\rangle \) are independent, and \(\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle \) is standard Gaussian, we obtain

    $$\begin{aligned} \mathbb {E}[\chi _{[0, t^\circ ]}(|\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle |) \cdot |\langle \varvec{a}, P_{{\varvec{x}}^\perp }(\varvec{z})\rangle |]= & {} \mathbb {P}(|\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle |\le t^\circ ) \cdot \mathbb {E}[|\langle \varvec{a}, P_{{\varvec{x}}^\perp }(\varvec{z})\rangle |] \\\le & {} t^\circ \cdot \mathbb {E}[|\langle \varvec{a}, P_{{\varvec{x}}^\perp }(\varvec{z})\rangle |]. \end{aligned}$$

    Moreover, by Jensen’s inequality and the isotropy of \(\varvec{a}\), it follows that

    $$\begin{aligned} \mathbb {E}[|\langle \varvec{a}, P_{{\varvec{x}}^\perp }(\varvec{z})\rangle |]\le (\mathbb {E}[|\langle \varvec{a}, P_{{\varvec{x}}^\perp }(\varvec{z})\rangle |^2])^{1/2} = \Vert P_{{\varvec{x}}^\perp }(\varvec{z})\Vert _{2}\le 1. \end{aligned}$$

    Putting everything together, this shows that Assumption 2(a) is satisfied. On Assumption 2(b). Since \(\psi _t\) is \(\tfrac{128}{t}\)-Lipschitz, the following holds for every \(\varvec{x}, \varvec{x}'\in \mathcal {X}\) (with \({{\bar{\varvec{x}}}}' {:=}\tfrac{\varvec{x}'}{\Vert \varvec{x}'\Vert _{2}}\)):

    $$\begin{aligned} \Vert \xi _t(\varvec{x}) - \xi _t(\varvec{x}')\Vert _{\psi _2}&\le \sqrt{\tfrac{2}{\pi }}\Vert \langle \varvec{a}, {{\bar{\varvec{x}}}}-{{\bar{\varvec{x}}}}'\rangle \Vert _{\psi _2} + \Vert \psi _t(\langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle ) - \psi _t(\langle \varvec{a}, {{\bar{\varvec{x}}}}'\rangle )\Vert _{\psi _2}\\&\le \big (\sqrt{\tfrac{2}{\pi }} + \tfrac{128}{t}\big ) \cdot \Vert \langle \varvec{a}, {{\bar{\varvec{x}}}}-{{\bar{\varvec{x}}}}'\rangle \Vert _{\psi _2} \lesssim (1 + t^{-1}) \cdot d_T(\varvec{x},\varvec{x}'). \end{aligned}$$

    This implies \(L_{t}\lesssim 1 + t^{-1}\). Furthermore, observe that \(|\tilde{y}_t(\varvec{x})| \le 1\), so that

    $$\begin{aligned} \Vert \xi _t(\varvec{x})\Vert _{\psi _2}\le \sqrt{\tfrac{2}{\pi }}\Vert \langle \varvec{a}, {{\bar{\varvec{x}}}}\rangle \Vert _{\psi _2} + \Vert \tilde{y}_t(\varvec{x})\Vert _{\psi _2}\lesssim 1 \end{aligned}$$

    for every \(\varvec{x}\in \mathcal {X}\). This shows \(r\lesssim 1\). On Assumption 2(c). We observe that the function \(s\mapsto \phi _t(s) {:=}|{{\,\mathrm{sign}\,}}(s)-\psi _t(s)|\) is \(\tfrac{128}{t}\)-Lipschitz. Therefore, for every \(\varvec{x}, \varvec{x}'\in \mathcal {X}\), we obtain

    $$\begin{aligned} \Vert \varepsilon _t(\varvec{x}) - \varepsilon _t(\varvec{x}')\Vert _{\psi _2}\le \tfrac{128}{t} \cdot \Vert \langle \varvec{a}, {{\bar{\varvec{x}}}}-{{\bar{\varvec{x}}}}'\rangle \Vert _{\psi _2}\lesssim t^{-1} \cdot d_T(\varvec{x},\varvec{x}'), \end{aligned}$$

    which implies \(\hat{L}_{t}\lesssim t^{-1}\). Finally, since \(|\varepsilon _t(\varvec{x})|\le 1\) for every \(\varvec{x}\in \mathcal {X}\), it follows that \(\hat{r}\lesssim 1\).

\(\square \)

8.2 Proofs for Sect. 4.2

Proof of Corollary 3

We follow the proof template from the beginning of Sect. 8:

  • Step 1a. The model setup of Corollary 3 fits into Assumption 1 as follows:

    • The measurement vector \(\varvec{a}\in \mathbb {R}^p\) is centered, isotropic, and sub-Gaussian with \(\Vert \varvec{a}\Vert _{\psi _2} \le L\). The signal set \(\mathcal {X}\) satisfies \(\mathcal {X}\subset R \mathbb {B}_2^p\). The output function \(F:\mathbb {R}^p \times \mathcal {X}\rightarrow \mathbb {R}\) takes the form \(F(\varvec{a}, \varvec{x}) {:=}{{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle +\tau )\), where \(\tau \) is uniformly distributed on \([-\lambda , \lambda ]\) and independent of \(\varvec{a}\). In particular, \(F\) is a random function.

    • The target function \(T:\mathcal {X}\rightarrow K\) corresponds to rescaling by a factor of \(\lambda ^{-1}\), i.e., \(T\varvec{x}{:=}\lambda ^{-1}\varvec{x}\). In particular, we have that \(d_T(\varvec{x},\varvec{x}')= \lambda ^{-1}\Vert \varvec{x}- \varvec{x}'\Vert _{2}\).

  • Step 1b. There exists an absolute constant \({\tilde{C}} \ge e\) such that if

    $$\begin{aligned} \lambda \ge {\tilde{C}}\cdot R \cdot L\cdot \sqrt{\log (e / t)}, \end{aligned}$$
    (8.2)

    the target mismatch satisfies \(\rho _{}(\varvec{x})\le \tfrac{t}{32}\) for every \(\varvec{x}\in \mathcal {X}\).

  • Step 1c. There exists an approximation \(\tilde{y}_t(\varvec{x})\) of the observation variable \(\tilde{y}(\varvec{x})\) such that the conditions of Assumption 2 are fulfilled with

    $$\begin{aligned} L_{t}\le L\cdot (1 + 64t^{-1}),\quad \hat{L}_{t}\le 64 Lt^{-1},\quad r\lesssim R L\lambda ^{-1} + 1, \quad \hat{r}\lesssim 1. \end{aligned}$$
    (8.3)
  • Step 2. The first and third branch of the condition (2.3) is implied by (4.4) for a sufficiently large universal constant \(C'>0\). Since \(\phi (\beta ) {:=}\beta \sqrt{\log (e/\beta )}\) defines a continuous and non-decreasing function on [0, 1] with \(\phi (1)=1\) and \(\phi (0){:=}0\) (by continuous extension), we may assume without loss of generality that \(\beta \sqrt{\log (e/\beta )}= c_0 L^{-2} t \in (0,1]\). Now, we set

    $$\begin{aligned} \varDelta ^2 {:=}\frac{1}{tL^2\sqrt{\log (e/\beta )}}, \quad m_0{:=}\lfloor \beta m\rfloor , \quad u_0 {:=}\sqrt{2 m \beta \log (e/\beta )}, \end{aligned}$$

    implying that \(u_0\ge \sqrt{m_0\log (em/m_0)}\) and \(u_0^2\ge 2 c_0 m L^{-2} t\). Combining the latter inequality with (4.4) for \(C'\) sufficiently large particularly implies that \(u_0^2 \ge u^2\). Furthermore, we may assume that \(c_0\le \frac{1}{4}C^{-1}\) and that (4.4) holds with \(C'\ge 2C\), where \(C>0\) denotes the universal constant from (2.3). Then, the second branch of the condition (2.3) is satisfied, since

    $$\begin{aligned} CL^4\varDelta ^2u_0^2= \frac{CL^2}{t\sqrt{\log (e/\beta )}} \cdot 2 \beta m \log (e/\beta )= 2c_0C\cdot m\le \tfrac{m}{2} \end{aligned}$$

    and

    $$\begin{aligned} CL^4\varDelta ^2 \cdot w_{t}^2(K- \lambda ^{-1}\mathcal {X})= & {} \frac{CL^2}{t\sqrt{\log (e/\beta )}} \cdot w_{t}^2(K- \lambda ^{-1}\mathcal {X})\\\le & {} \tfrac{1}{2} C' \cdot L^2 t^{-1} \cdot w_{t}^2(K- \lambda ^{-1}\mathcal {X})\le \tfrac{m}{2}. \end{aligned}$$

    Next, we show that every \(\varvec{y}\in \{-1,1\}^m\) given by (4.3) with \(\tfrac{1}{2m}\Vert \varvec{\nu }\Vert _{1} \le \beta \) also satisfies (2.4). Since \(\varvec{\nu }=\varvec{y}- \tilde{\varvec{y}}(\varvec{x})\in \{-2,0,2\}^m\) and \(\tfrac{1}{2}\Vert \varvec{y}- \tilde{\varvec{y}}(\varvec{x})\Vert _{1}\le \beta m\), it follows that \(|\{i\in [m] \ :y_i \ne \tilde{y}_i(\varvec{x})\}|\le \lfloor \beta m\rfloor \). Consequently, \(\sigma _{m_0}(\varvec{y}- \tilde{\varvec{y}}(\varvec{x}))_{2}=0\) and

    $$\begin{aligned} \tfrac{1}{\sqrt{m}}\Vert \varvec{y}- \tilde{\varvec{y}}(\varvec{x})\Vert _{[m_0]}= \tfrac{1}{\sqrt{m}}\Vert \varvec{y}- \tilde{\varvec{y}}(\varvec{x})\Vert _{2}\le 2\sqrt{\beta }= 2\Big (\tfrac{c_0 t}{L^2 \sqrt{\log (e/\beta )}} \Big )^{1/2} \le \varDelta t, \end{aligned}$$

    where the last inequality holds for \(c_0\le \frac{1}{4}\). Theorem 2 now implies the claim of Corollary 3.

  • Step 3. Since \(\varvec{a}\) is isotropic, we have that

    $$\begin{aligned} \rho _{}(\varvec{x}) = \big \Vert \mathbb {E}[\tilde{y}(\varvec{x})\varvec{a}] - \lambda ^{-1} \varvec{x}\big \Vert _{2} \le \big \Vert \mathbb {E}\big [\big ({{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle + \tau )-\langle \varvec{a}, \lambda ^{-1}\varvec{x}\rangle \big )\varvec{a}\big ]\big \Vert _{2}. \end{aligned}$$

    Therefore, it suffices to show that the following holds for all \(\varvec{x}\in \mathcal {X}\) and \(\varvec{z}\in \mathbb {S}^{p-1}\):

    $$\begin{aligned} \mathbb {E}\big [\big ({{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle + \tau )-\langle \varvec{a}, \lambda ^{-1}\varvec{x}\rangle \big )\langle \varvec{a}, \varvec{z}\rangle \big ]\le \tfrac{t}{32}. \end{aligned}$$

    The following identity explains why adding a uniformly distributed dither \(\tau \in [-\lambda , \lambda ]\) before quantization is useful:

    $$\begin{aligned} \mathbb {E}_{\tau }[{{\,\mathrm{sign}\,}}(s + \tau )] = \lambda ^{-1}s \cdot \chi _{[-\lambda , \lambda ]}(s) + {{\,\mathrm{sign}\,}}(s) \cdot \chi _{\mathbb {R}\setminus [-\lambda , \lambda ]}(s), \quad s \in \mathbb {R}. \end{aligned}$$

    In other words, for s small enough, integrating over the dithering variable \(\tau \) allows us to “smooth out” the discontinuity of the sign function. As \(\tau \) and \(\varvec{a}\) are independent, we can apply the above identity as follows:

    $$\begin{aligned}&\mathbb {E}\big [\big ({{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle + \tau )-\langle \varvec{a}, \lambda ^{-1}\varvec{x}\rangle \big )\langle \varvec{a}, \varvec{z}\rangle \big ]\\&\quad = \mathbb {E}_{\varvec{a}}\mathbb {E}_{\tau }\big [\big ({{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle + \tau )-\langle \varvec{a}, \lambda ^{-1}\varvec{x}\rangle \big )\langle \varvec{a}, \varvec{z}\rangle \big ]\\&\quad = \mathbb {E}\big [\big (\lambda ^{-1}\langle \varvec{a}, \varvec{x}\rangle \cdot \chi _{[-\lambda , \lambda ]}(\langle \varvec{a}, \varvec{x}\rangle ) \\&\qquad + {{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle ) \cdot \chi _{\mathbb {R}\setminus [-\lambda , \lambda ]}(\langle \varvec{a}, \varvec{x}\rangle )-\lambda ^{-1}\langle \varvec{a}, \varvec{x}\rangle \big )\langle \varvec{a}, \varvec{z}\rangle \big ]\\&\quad = \mathbb {E}\big [\big ({{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle ) - \lambda ^{-1}\langle \varvec{a}, \varvec{x}\rangle \big ) \cdot \chi _{\mathbb {R}\setminus [-\lambda , \lambda ]}(\langle \varvec{a}, \varvec{x}\rangle )\cdot \langle \varvec{a}, \varvec{z}\rangle \big ]. \end{aligned}$$

    Using the Cauchy–Schwarz inequality twice, the isotropy of \(\varvec{a}\), and the triangle inequality, we now obtain

    $$\begin{aligned}&\mathbb {E}\big [\big ({{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle ) - \lambda ^{-1}\langle \varvec{a}, \varvec{x}\rangle \big ) \cdot \chi _{\mathbb {R}\setminus [-\lambda , \lambda ]}(\langle \varvec{a}, \varvec{x}\rangle )\cdot \langle \varvec{a}, \varvec{z}\rangle \big ]\\&\quad \le \Big (\mathbb {E}\big [\big ({{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle ) - \lambda ^{-1}\langle \varvec{a}, \varvec{x}\rangle \big )^2 \cdot \chi _{\mathbb {R}\setminus [-\lambda , \lambda ]}(\langle \varvec{a}, \varvec{x}\rangle )\big ] \Big )^{1/2} \cdot \big (\mathbb {E}[\langle \varvec{a}, \varvec{z}\rangle ^2]\big )^{1/2}\\&\quad = \Big (\mathbb {E}\big [\big ({{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle ) - \lambda ^{-1}\langle \varvec{a}, \varvec{x}\rangle \big )^2 \cdot \chi _{\mathbb {R}\setminus [-\lambda , \lambda ]}(\langle \varvec{a}, \varvec{x}\rangle )\big ] \Big )^{1/2}\\&\quad \le \Vert {{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle ) - \lambda ^{-1}\langle \varvec{a}, \varvec{x}\rangle \Vert _{L^4}\cdot \big (\mathbb {P}(|\langle \varvec{a}, \varvec{x}\rangle |>\lambda )\big )^{1/4}\\&\quad \le \big (1+ \lambda ^{-1}\Vert \langle \varvec{a}, \varvec{x}\rangle \Vert _{L^4}\big ) \cdot \big (\mathbb {P}(|\langle \varvec{a}, \varvec{x}\rangle |>\lambda )\big )^{1/4}. \end{aligned}$$

    Since \(\varvec{a}\) is sub-Gaussian with \(\Vert \varvec{a}\Vert _{\psi _2}\le L\), there exist absolute constants \(C''\ge 1\), \(c''>0\) such that \(\Vert \langle \varvec{a}, \varvec{x}\rangle \Vert _{L^4}\le C''\cdot \Vert \langle \varvec{a}, \varvec{x}\rangle \Vert _{\psi _2}\le C''\cdot L\cdot \Vert \varvec{x}\Vert _{2}\) and \(\mathbb {P}(|\langle \varvec{a}, \varvec{x}\rangle |>\lambda )\le 2\exp \big (-c''\lambda ^2 (L\Vert \varvec{x}\Vert _{2})^{-2}\big )\). Finally, using that every \(\varvec{x}\in \mathcal {X}\) satisfies \(\Vert \varvec{x}\Vert _{2}\le R\), we can conclude that

    $$\begin{aligned} \mathbb {E}\big [\big ({{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle + \tau )-\langle \varvec{a}, \lambda ^{-1}\varvec{x}\rangle \big )\langle \varvec{a}, \varvec{z}\rangle \big ]\le & {} C'' \cdot (1+ R L\lambda ^{-1}) \cdot \exp \big (-c'' \lambda ^2(RL)^{-2}\big ) \\\le & {} \tfrac{t}{32}, \end{aligned}$$

    where the last estimate is due to (8.2) for \({\tilde{C}}\ge e\) large enough.

  • Step 4. We approximate the observation variable \(\tilde{y}(\varvec{x}) = {{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle + \tau )\) by \(\tilde{y}_t(\varvec{x}) {:=}\psi _t(\langle \varvec{a}, \varvec{x}\rangle +\tau )\) for \(t\in (0, 1]\) and \(\varvec{x}\in \mathcal {X}\) where

    $$\begin{aligned} \psi _t(s) = {\left\{ \begin{array}{ll} \tfrac{64}{t}\cdot \lambda ^{-1}s, &{}|s|\le \tfrac{t}{64} \cdot \lambda ,\\ {{\,\mathrm{sign}\,}}(s), &{}\text {otherwise,} \end{array}\right. } \quad s \in \mathbb {R}. \end{aligned}$$

    The absolute value of the approximation error then takes the form

    $$\begin{aligned} \varepsilon _t(\varvec{x}) = |{{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle + \tau ) - \psi _t(\langle \varvec{a}, \varvec{x}\rangle + \tau )|\cdot \chi _{[0, t^\circ \lambda ]}(|\langle \varvec{a}, \varvec{x}\rangle + \tau |), \end{aligned}$$

    where we have used \(t^\circ {:=}\frac{t}{64}\) for the sake of notational convenience. We now show that for this choice of approximation, the conditions of Assumption 2 are indeed fulfilled with (8.3): On Assumption 2(a). For \(\varvec{x}\in \mathcal {X}\) and \(\varvec{z}\in \mathbb {S}^{p-1}\), we have that

    $$\begin{aligned} \mathbb {E}[\varepsilon _t(\varvec{x}) \cdot |\langle \varvec{a}, \varvec{z}\rangle |]&\le \mathbb {E}[\chi _{[0, t^\circ \lambda ]}(|\langle \varvec{a}, \varvec{x}\rangle + \tau |) \cdot |\langle \varvec{a}, \varvec{z}\rangle |] \\&\le \mathbb {E}_{\varvec{a}}\big [|\langle \varvec{a}, \varvec{z}\rangle | \cdot \mathbb {E}_{\tau } [\chi _{[0, t^\circ \lambda ]}(|\langle \varvec{a}, \varvec{x}\rangle + \tau |)]\big ]. \end{aligned}$$

    Since \(\mathbb {E}_{\tau }[\chi _{[0, t^\circ \lambda ]}(|\langle \varvec{a}, \varvec{x}\rangle + \tau |)]\le t^\circ \), it follows that

    $$\begin{aligned} \mathbb {E}[\varepsilon _t(\varvec{x}) \cdot |\langle \varvec{a}, \varvec{z}\rangle |]\le t^\circ \cdot \mathbb {E}[|\langle \varvec{a}, \varvec{z}\rangle |]\le t^\circ \cdot (\mathbb {E}[|\langle \varvec{a}, \varvec{z}\rangle |^2])^{1/2}=t^\circ , \end{aligned}$$

    implying the condition of Assumption 2(a). On Assumption 2(b). Since \(\psi _t\) is \(\tfrac{64}{t \lambda }\)-Lipschitz, the following holds for every \(\varvec{x}, \varvec{x}'\in \mathcal {X}\):

    $$\begin{aligned} \Vert \xi _t(\varvec{x}) - \xi _t(\varvec{x}')\Vert _{\psi _2}&\le \lambda ^{-1}\Vert \langle \varvec{a}, \varvec{x}-\varvec{x}'\rangle \Vert _{\psi _2} + \Vert \psi _t(\langle \varvec{a}, \varvec{x}\rangle + \tau ) - \psi _t(\langle \varvec{a}, \varvec{x}'\rangle + \tau )\Vert _{\psi _2}\\&\le \big (\lambda ^{-1} + \tfrac{64}{t \lambda }\big ) \cdot \Vert \langle \varvec{a}, \varvec{x}-\varvec{x}'\rangle \Vert _{\psi _2} \le \lambda ^{-1} \cdot (1 + \tfrac{64}{t}) \cdot L\cdot \Vert \varvec{x}- \varvec{x}'\Vert _{2} \\&= L\cdot (1 + 64 t^{-1}) \cdot d_T(\varvec{x},\varvec{x}'). \end{aligned}$$

    This implies \(L_{t}\le L\cdot (1 + 64t^{-1})\). Furthermore, observe that \(|\tilde{y}_t(\varvec{x})| \le 1\), so that

    $$\begin{aligned} \Vert \xi _t(\varvec{x})\Vert _{\psi _2}\le \lambda ^{-1}\Vert \langle \varvec{a}, \varvec{x}\rangle \Vert _{\psi _2} + \Vert \tilde{y}_t(\varvec{x})\Vert _{\psi _2}\lesssim R L\lambda ^{-1} + 1 \end{aligned}$$

    for every \(\varvec{x}\in \mathcal {X}\). This shows \(r\lesssim R L\lambda ^{-1} + 1\). On Assumption 2(c). We observe that the function \(s\mapsto \phi _t(s) {:=}|{{\,\mathrm{sign}\,}}(s)-\psi _t(s)|\) is \(\tfrac{64}{t\lambda }\)-Lipschitz. Therefore, for every \(\varvec{x}, \varvec{x}'\in \mathcal {X}\), we obtain

    $$\begin{aligned} \Vert \varepsilon _t(\varvec{x}) - \varepsilon _t(\varvec{x}')\Vert _{\psi _2}\le \tfrac{64}{t\lambda } \cdot \Vert \langle \varvec{a}, \varvec{x}- \varvec{x}'\rangle \Vert _{\psi _2}\le 64 Lt^{-1} \cdot d_T(\varvec{x},\varvec{x}'). \end{aligned}$$

    Hence, \(\hat{L}_{t}\le 64 Lt^{-1}\). Finally, since \(|\varepsilon _t(\varvec{x})|\le 1\) for every \(\varvec{x}\in \mathcal {X}\), it follows that \(\hat{r}\lesssim 1\).

\(\square \)

8.3 Proofs for Sect. 4.3

Proof of Corollary 4

We follow the proof template from the beginning of Sect. 8:

  • Step 1a. The model setup of Corollary 4 fits into Assumption 1 as follows:

    • The measurement vector \(\varvec{a}\in \mathbb {R}^p\) is centered, isotropic, and sub-Gaussian with \(\Vert \varvec{a}\Vert _{\psi _2} \le L\). The signal set \(\mathcal {X}\) is a bounded subset of \(\mathbb {R}^p\). The output function \(F:\mathbb {R}^p \times \mathcal {X}\rightarrow \mathbb {R}\) takes the form \(F(\varvec{a}, \varvec{x}) {:=}q_\delta (\langle \varvec{a}, \varvec{x}\rangle + \tau )\), where \(\tau \) is uniformly distributed on \([-\delta , \delta ]\) and independent of \(\varvec{a}\). In particular, \(F\) is a random function. Moreover, the observation vector of \(\varvec{x}\) is given by \(\tilde{\varvec{y}}(\varvec{x}) = q_\delta (\varvec{A}\varvec{x}+\varvec{\tau })\), where \(\varvec{A}\in \mathbb {R}^{m\times p}\) denotes the sub-Gaussian random matrix with row vectors \(\varvec{a}_1, \dots , \varvec{a}_m\in \mathbb {R}^p\) and \(\varvec{\tau }{:=}(\tau _1, \dots , \tau _m)\).

    • The target function \(T:\mathcal {X}\rightarrow K\) is the canonical embedding into \(K\), i.e., \(T\varvec{x}{:=}\varvec{x}\). In particular, we have that \(d_T(\varvec{x},\varvec{x}')= \Vert \varvec{x}- \varvec{x}'\Vert _{2}\).

  • Step 1b. The target mismatch \(\rho _{}(\varvec{x})\) vanishes for every \(\varvec{x}\in \mathcal {X}\).

  • Step 1c. If \(t\le 128\delta \), there exists an approximation \(\tilde{y}_t(\varvec{x})\) of the observation variable \(\tilde{y}(\varvec{x})\) such that the conditions of Assumption 2 are fulfilled with

    $$\begin{aligned} L_{t}\lesssim L\delta t^{-1},\quad \hat{L}_{t}\lesssim L\delta t^{-1},\quad r\lesssim \delta ,\quad \hat{r}\lesssim \delta . \end{aligned}$$
    (8.4)

    If \(t> 128\delta \), we choose any maximal \(\tfrac{t}{256}\)-packing \(\mathcal {X}_t\) for \(\mathcal {X}\) (i.e., a maximal subset of \(\mathcal {X}\) such that \(\Vert \varvec{x}-\varvec{x}'\Vert _{2}> \tfrac{t}{256}\) for all \(\varvec{x}, \varvec{x}'\in \mathcal {X}_t\) with \(\varvec{x}\ne \varvec{x}'\)) and show that for the trivial choice \(\tilde{y}_t(\varvec{x}){:=}\tilde{y}(\varvec{x})\), the conditions of Assumption 2 are fulfilled on \(\mathcal {X}_t\) with

    $$\begin{aligned} L_{t}\lesssim \delta t^{-1},\quad \hat{L}_{t}=0,\quad r\lesssim \delta ,\quad \hat{r}= 0. \end{aligned}$$
    (8.5)
  • Step 2. We first note that for \(t\ge \delta \), the inequality \(c_0 L^{-2} \sqrt{\delta t} / \sqrt{\max \{1, \log ( \delta e/t)\}}\le \tfrac{t}{40}\) holds if \(c_0\le \tfrac{1}{40}\). Furthermore, the condition \(\tfrac{1}{m}\Vert \varvec{\nu }\Vert _{0} \le t \delta ^{-1}\) is trivially fulfilled. Consequently, it suffices to prove the statement of Corollary 4 for all those input vectors \(\varvec{y}\in \delta \mathbb {Z}^m\) that satisfy the condition (a) in (4.7), and all those \(\varvec{y}\in \delta \mathbb {Z}^m\) that satisfy condition (b) in (4.7) in the case \(t\le \delta \). We now distinguish between the two cases \(t\le 128\delta \) and \(t>128\delta \): The case \(t\le 128\delta \). From (8.4), it follows that the first and third branch of the condition (2.3) is implied by (4.6) for a sufficiently large universal constant \(C'>0\). For the choice \(\varDelta {:=}L^{-1} \sqrt{\log L}\), \(m_0{:=}0\), and \(u{:=}u_0\), we observe that the second branch of (2.3) follows from the first one. Therefore, the claim of Corollary 4 follows from Theorem 2 for all those input vectors \(\varvec{y}\in \delta \mathbb {Z}^m\) that satisfy the condition (a) in (4.7). Next, let us assume that \(t\le \delta \) and consider the choice

    $$\begin{aligned} \varDelta ^2 {:=}\frac{\delta }{4C tL^4 \log (\delta e/ t)}, \quad m_0{:=}\lfloor m t \delta ^{-1}\rfloor , \quad u_0 {:=}\sqrt{2m t \delta ^{-1}\log (\delta e / t)}, \end{aligned}$$

    where \(C>0\) denotes the universal constant from (2.3). This implies \(u_0\ge \sqrt{m_0\log (em/m_0)}\) as well as \(u_0^2\ge u^2\) if the universal constant \(C'\) in (4.6) is large enough. Furthermore, the second branch of (2.3) is satisfied: We have that \(CL^4\varDelta ^2u_0^2= \tfrac{m}{2}\), and assuming that (4.6) holds with \(C'\ge \frac{1}{2}\), it follows that

    $$\begin{aligned} CL^4\varDelta ^2 \cdot w_{t}^2(K- T\mathcal {X})\le \tfrac{1}{4}\delta t^{-1} \cdot w_{t}^2(K- \mathcal {X})\le \tfrac{1}{4} L^2 \delta ^2 t^{-2} \cdot w_{t}^2(K- \mathcal {X})\le \tfrac{m}{2}. \end{aligned}$$

    It remains to show that every \(\varvec{y}\in \delta \mathbb {Z}^m\) given by (4.5) such that

    $$\begin{aligned} \tfrac{1}{\sqrt{m}}\Vert \varvec{\nu }\Vert _{2}\le \tfrac{c_0\sqrt{\delta t}}{L^2\sqrt{\log (\delta e/t)}}\quad \text {and} \quad \tfrac{1}{m}\Vert \varvec{\nu }\Vert _{0}\le \tfrac{t}{\delta } \end{aligned}$$

    also satisfies the condition (2.4). Indeed, since \(\varvec{\nu }=\varvec{y}- \tilde{\varvec{y}}(\varvec{x})\in \delta \mathbb {Z}^m\) and \(\Vert \varvec{\nu }\Vert _{0}\le m t \delta ^{-1}\), it follows that \(|\{i\in [m] \ :y_i \ne \tilde{y}_i(\varvec{x})\}|\le \lfloor m t \delta ^{-1}\rfloor \). Consequently, \(\sigma _{m_0}(\varvec{y}- \tilde{\varvec{y}}(\varvec{x}))_{2}=0\) and

    $$\begin{aligned} \tfrac{1}{\sqrt{m}}\Vert \varvec{y}- \tilde{\varvec{y}}(\varvec{x})\Vert _{[m_0]}= \tfrac{1}{\sqrt{m}}\Vert \varvec{y}- \tilde{\varvec{y}}(\varvec{x})\Vert _{2}\le \tfrac{c_0\sqrt{\delta t}}{L^2\sqrt{\log (\delta e/t)}}\le \varDelta t, \end{aligned}$$

    where the last inequality holds for \(c_0>0\) sufficiently small. Theorem 2 now implies the claim of Corollary 4 for all those input vectors \(\varvec{y}\in \delta \mathbb {Z}^m\) that satisfy the condition (b) in (4.7). The case \(t> 128\delta \). Setting \(m_0{:=}0\), \(\varDelta {:=}L^{-1} \sqrt{\log L}\), and \(u{:=}u_0\), the second branch of (2.3) is implied by the first one. Due to (8.5), Theorem 2 applied to the signal set \(\mathcal {X}_t\) now implies that there exist universal constants \(c_1,C_1>0\) such that if

    $$\begin{aligned} m \ge C_1 \cdot L^2 \cdot \Big ( (\log L+ t^{-2} \delta ^2) \cdot \big (w_{t}^2(K- \mathcal {X}_t)+ u^2 \big ) + t^{-4} \delta ^2 \cdot w_{}^2(\mathcal {X}_t) \Big ),\nonumber \\ \end{aligned}$$
    (8.6)

    the following event \({\mathcal {A}}_1\) occurs with probability at least \(1 - \exp (-c_1 u^2)\): For every \(\varvec{x}' \in \mathcal {X}_t\) and \(\varvec{y}\in \delta \mathbb {Z}^m\) such that \(\tfrac{1}{\sqrt{m}}\Vert \varvec{y}-q_{\delta }(\varvec{A}\varvec{x}'+\varvec{\tau })\Vert _{2} \le \tfrac{t}{20}\), every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\Vert \hat{\varvec{z}}- \varvec{x}'\Vert _{2} \le t\). Moreover, according to Theorem 6, there exists a universal constant \(C_2>0\) such that if

    $$\begin{aligned} m\ge C_2\cdot L^2 \log L\cdot \big (w_{t}^2(\mathcal {X}-\mathcal {X}) + u^2\big ), \end{aligned}$$
    (8.7)

    the following event \({\mathcal {A}}_2\) occurs with probability at least \(1-3\exp (-u^2)\):

    $$\begin{aligned} {\mathop {{\mathrm{sup}}}\limits _{\begin{array}{c} \varvec{x}\in \mathcal {X},\, \varvec{x}'\in \mathcal {X}_t\\ \Vert \varvec{x}-\varvec{x}'\Vert _{2}\le \tfrac{t}{256} \end{array}}} \big \Vert \tfrac{1}{\sqrt{m}}\varvec{A}(\varvec{x}-\varvec{x}')\big \Vert _{2} \le \tfrac{t}{128}. \end{aligned}$$

    We claim that Corollary 4 holds on the intersection of the events \({\mathcal {A}}_1\) and \({\mathcal {A}}_2\). To this end, let \(\varvec{x}\in \mathcal {X}\) be arbitrary and assume that \(\varvec{y}\in \delta \mathbb {Z}^m\) satisfies \(\tfrac{1}{\sqrt{m}}\Vert \varvec{y}-q_{\delta }(\varvec{A}\varvec{x}+\varvec{\tau })\Vert _{2} \le \tfrac{t}{40}\). Since \(\mathcal {X}_t\) is a maximal \(\tfrac{t}{256}\)-packing for \(\mathcal {X}\), there exists \(\varvec{x}'\in \mathcal {X}_t\) with \(\Vert \varvec{x}-\varvec{x}'\Vert _{2}\le \tfrac{t}{256}\) (otherwise \(\mathcal {X}_t\) would not be maximal). On the event \({\mathcal {A}}_2\) and by the triangle inequality, we obtain

    $$\begin{aligned} \tfrac{1}{\sqrt{m}}\big \Vert \varvec{y}-q_{\delta }(\varvec{A}\varvec{x}'+\varvec{\tau })\big \Vert _{2}&\le \tfrac{1}{\sqrt{m}}\big \Vert \varvec{y}-q_{\delta }(\varvec{A}\varvec{x}+\varvec{\tau })\big \Vert _{2} \\&\quad + \tfrac{1}{\sqrt{m}}\big \Vert q_{\delta }(\varvec{A}\varvec{x}+\varvec{\tau })-q_{\delta }(\varvec{A}\varvec{x}'+\varvec{\tau })\big \Vert _{2} \\&\le \tfrac{t}{40} + \tfrac{1}{\sqrt{m}}\big \Vert (\varvec{A}\varvec{x}+ \varvec{\tau }) -(\varvec{A}\varvec{x}' + \varvec{\tau }) \big \Vert _{2} \\&\quad + \tfrac{1}{\sqrt{m}}\big \Vert q_\delta (\varvec{A}\varvec{x}+ \varvec{\tau }) - (\varvec{A}\varvec{x}+ \varvec{\tau })\big \Vert _{2} \\&\quad + \tfrac{1}{\sqrt{m}}\big \Vert q_\delta (\varvec{A}\varvec{x}' + \varvec{\tau }) - (\varvec{A}\varvec{x}' + \varvec{\tau })\big \Vert _{2} \\&\le \tfrac{t}{40} + \tfrac{t}{128}+ 2\delta \le \tfrac{t}{20}. \end{aligned}$$

    On the event \({\mathcal {A}}_1\), every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\Vert \hat{\varvec{z}}- \varvec{x}'\Vert _{2} \le t\), and therefore

    $$\begin{aligned} \Vert \hat{\varvec{z}}- \varvec{x}\Vert _{2} \le \Vert \hat{\varvec{z}}- \varvec{x}'\Vert _{2} + \Vert \varvec{x}' - \varvec{x}\Vert _{2} \le t + \tfrac{t}{256}\le 2t. \end{aligned}$$

    Finally, for \(C'>0\) sufficiently large, we conclude that the condition (4.6) implies both (8.6) and (8.7). Hence, the events \({\mathcal {A}}_1\) and \({\mathcal {A}}_2\) occur jointly with probability at least \(1 - \exp (-c u^2)\), provided that \(c>0\) is chosen small enough.

  • Step 3. Let \(\varvec{x}\in \mathcal {X}\). Since \(\varvec{a}\) is isotropic, we have that

    $$\begin{aligned} \rho _{}(\varvec{x}) = \big \Vert \mathbb {E}[\tilde{y}(\varvec{x})\varvec{a}] - \varvec{x}\big \Vert _{2} \le \big \Vert \mathbb {E}\big [\big (q_\delta (\langle \varvec{a}, \varvec{x}\rangle + \tau )-\langle \varvec{a}, \varvec{x}\rangle \big )\varvec{a}\big ]\big \Vert _{2}, \end{aligned}$$

    so that it suffices to show that

    $$\begin{aligned} \mathbb {E}\big [\big (q_\delta (\langle \varvec{a}, \varvec{x}\rangle + \tau )-\langle \varvec{a}, \varvec{x}\rangle \big )\langle \varvec{a}, \varvec{z}\rangle \big ]=0 \qquad \text {for all } \varvec{z}\in \mathbb {S}^{p-1}. \end{aligned}$$

    It is straightforward to check that

    $$\begin{aligned} \mathbb {E}[q_\delta (s + \tau )] = s \qquad \text {for all } s\in \mathbb {R}. \end{aligned}$$
    (8.8)

    In other words, integrating over the dithering variable \(\tau \) allows us to “eliminate” the quantizer. For the sake of completeness, we show the identity (8.8) for \(s = k \cdot (2\delta ) + s'\) where \(k\in \mathbb {Z}\) and \(s'\in [0,\delta ]\); the case \(s = k \cdot (2\delta ) + s'\) where \(k\in \mathbb {Z}\) and \(s'\in (\delta , 2\delta )\) can be treated analogously. First, observe that

    $$\begin{aligned} q_\delta (s + \tau )=(2\lceil \tfrac{s + \tau }{2\delta }\rceil -1)\delta = (2\lceil k + \tfrac{s' + \tau }{2\delta }\rceil -1)\delta . \end{aligned}$$

    Since \(\tfrac{s'+\tau }{2\delta } \in [-\tfrac{1}{2}, 1]\), it follows that \(q_\delta (s + \tau )\in \{(2k-1)\delta , (2k+1)\delta \}\) and more precisely,

    $$\begin{aligned}&\tau \in [-\delta -s',-s']&\quad \Rightarrow \quad q_\delta (s + \tau ) =(2k-1)\delta ,\\&\tau \in (-s', 2\delta -s']&\quad \Rightarrow \quad q_\delta (s + \tau ) =(2k+1)\delta . \end{aligned}$$

    Therefore, we obtain

    $$\begin{aligned} \mathbb {E}[q_\delta (s + \tau )]&= \mathbb {P}(\tau \in [-\delta -s',-s']) \cdot (2k-1)\delta \\&\qquad + \mathbb {P}(\tau \in (-s', 2\delta -s']) \cdot (2k+1)\delta \\&= \mathbb {P}(\tau \in [-\delta ,-s']) \cdot (2k-1)\delta + \mathbb {P}(\tau \in (-s', \delta ]) \cdot (2k+1)\delta \\&= \big (\tfrac{-s'+\delta }{2\delta }\big ) \cdot (2k-1)\delta + \big (\tfrac{\delta + s'}{2\delta }\big ) \cdot (2k+1)\delta \\&=\tfrac{s'}{2} + \tfrac{1}{2}(2k-1)\delta + \tfrac{s'}{2} + \tfrac{1}{2}(2k+1)\delta = 2k\delta + s' = s. \end{aligned}$$

    Since \(\tau \) and \(\varvec{a}\) are independent, we can apply (8.8) as follows:

    $$\begin{aligned} \mathbb {E}\big [\big (q_\delta (\langle \varvec{a}, \varvec{x}\rangle + \tau )-\langle \varvec{a}, \varvec{x}\rangle \big )\langle \varvec{a}, \varvec{z}\rangle \big ]&=\mathbb {E}_{\varvec{a}}\mathbb {E}_\tau \big [\big (q_\delta (\langle \varvec{a}, \varvec{x}\rangle + \tau )-\langle \varvec{a}, \varvec{x}\rangle \big )\langle \varvec{a}, \varvec{z}\rangle \big ]\\&=\mathbb {E}\big [\big (\langle \varvec{a}, \varvec{x}\rangle -\langle \varvec{a}, \varvec{x}\rangle \big )\langle \varvec{a}, \varvec{z}\rangle \big ] = 0. \end{aligned}$$
  • Step 4. Before distinguishing between the cases \(t\le 128\delta \) and \(t>128\delta \) according to Step 1b, let us analyze the action of the quantizer \(q_\delta \) in more detail. To this end, we partition the real axis \(\mathbb {R}\) into half-open intervals \({\mathcal {I}}_k\) of length \(2\delta \) given by \({\mathcal {I}}_k {:=}(e_{k-1}, e_k]\) for \(k\in \mathbb {Z}\) with \(e_k {:=}k\cdot 2\delta \). Then, \(q_\delta \) maps every point in the interval \({\mathcal {I}}_k\) to its center point:

    $$\begin{aligned} q_\delta (s)= \frac{e_{k-1}+e_k}{2}, \quad s\in {\mathcal {I}}_k. \end{aligned}$$

    In particular, \(q_\delta (s)\) is discontinuous exactly at the points \(\{ e_k \ :k\in \mathbb {Z}\}\). The basic idea is now to approximate the quantizer \(q_\delta \) by a Lipschitz continuous function by “cutting out” intervals of a certain radius \(t^\circ > 0\) around each point \(e_k\) and inserting a straight line that connects both quantization values. Assuming that \(t^\circ \le \delta \), the resulting function takes the form

    $$\begin{aligned} \psi _{t^{\circ }}(s) {:=}{\left\{ \begin{array}{ll} \tfrac{\delta }{t^\circ }\cdot s + e_k(1-\tfrac{\delta }{t^\circ }) , &{}\text {if } \exists k\in \mathbb {Z}: s\in {\mathcal {E}}_{k,t^\circ },\\ q_\delta (s), &{}\text {otherwise,} \end{array}\right. } \quad s \in \mathbb {R}, \end{aligned}$$

    where \({\mathcal {E}}_{k,t^\circ } {:=}[e_k-t^\circ , e_k+t^\circ ]\); note that if \(t^\circ =\delta \), then \(\psi _{t^{\circ }}\) is just the identity function. We also define

    $$\begin{aligned} \phi _{t^{\circ }}(s){:=}|q_\delta (s)-\psi _{t^{\circ }}(s)|,\quad s\in \mathbb {R}. \end{aligned}$$

    Let us now show that \(\psi _{t^{\circ }}\) and \(\phi _{t^{\circ }}\) are both \(\tfrac{\delta }{t^\circ }\)-Lipschitz. Since \(q_\delta \) is locally constant on \(\mathbb {R}\setminus \{e_k \ :k\in \mathbb {Z}\}\) and \(s\mapsto \tfrac{\delta }{t^\circ }\cdot s + e_k(1-\tfrac{\delta }{t^\circ })\) is clearly \(\tfrac{\delta }{t^\circ }\)-Lipschitz, it is sufficient to show that \(\psi _{t^{\circ }}\) is continuous, i.e., \(\tfrac{\delta }{t^\circ }\cdot s + e_k(1-\tfrac{\delta }{t^\circ })=q_\delta (s)\) for \(s \in \mathbb {R}\) with \(|s-e_k|=t^\circ \) for some \(k \in \mathbb {Z}\). If \(s=e_k-t^\circ \), then \(\tfrac{\delta }{t^\circ }\cdot s + e_k(1-\tfrac{\delta }{t^\circ })=e_k-\delta \), and we also have that

    $$\begin{aligned} q_\delta (s)=\big (2\big \lceil \tfrac{k\cdot (2\delta ) - t^\circ }{2\delta }\big \rceil -1\big )\delta =(2\lceil k - \tfrac{t^\circ }{2\delta }\rceil -1)\delta . \end{aligned}$$

    Since \(t^\circ \in (0,\delta ]\), it follows that \(\lceil k - \tfrac{t^\circ }{2\delta }\rceil =k\), and therefore, \(q_\delta (s)=(2k-1)\delta = e_k-\delta \). The case \(s=e_k+t^\circ \) works analogously. Thus, \(\psi _{t^{\circ }}\) is indeed \(\tfrac{\delta }{t^\circ }\)-Lipschitz. For \(\phi _{t^{\circ }}\), it is clearly sufficient to show \(\tfrac{\delta }{t^\circ }\)-Lipschitz continuity on \({\mathcal {E}}_{k,t^\circ }\) for every \(k\in \mathbb {Z}\). In the case \(s\in [e_k-t^\circ , e_k]\), we observe that \(q_\delta (s)=e_k-\delta \) and \(\tfrac{\delta }{t^\circ }\cdot s + e_k(1-\tfrac{\delta }{t^\circ })\ge e_k-\delta \), implying that

    $$\begin{aligned} \phi _{t^{\circ }}(s) = \tfrac{\delta }{t^\circ }\cdot s + e_k(1-\tfrac{\delta }{t^\circ }) - (e_k-\delta )= \delta - \tfrac{\delta }{t^\circ }(e_k-s). \end{aligned}$$

    Thus, \(\phi _{t^{\circ }}\) is \(\tfrac{\delta }{t^\circ }\)-Lipschitz continuous on \([e_k-t^\circ , e_k]\). On the other hand, if \(s\in (e_k, e_k+t^\circ ]\), we have that \(q_\delta (s)=e_k+\delta \) and \(e_k+\delta \ge \tfrac{\delta }{t^\circ }\cdot s + e_k(1-\tfrac{\delta }{t^\circ })\), implying that

    $$\begin{aligned} \phi _{t^{\circ }}(s) = e_k+\delta - \big (\tfrac{\delta }{t^\circ }\cdot s + e_k(1-\tfrac{\delta }{t^\circ })\big ) = \delta - \tfrac{\delta }{t^\circ }(s-e_k). \end{aligned}$$

    Thus, \(\phi _{t^{\circ }}\) is \(\tfrac{\delta }{t^\circ }\)-Lipschitz continuous on \((e_k, e_k+t^\circ ]\). Moreover, \(\lim _{s\rightarrow e_k}\phi _{t^{\circ }}(s)=\delta =\phi _{t^{\circ }}(e_k)\), which shows that \(\phi _{t^{\circ }}\) is indeed \(\tfrac{\delta }{t^\circ }\)-Lipschitz continuous on \({\mathcal {E}}_{k,t^\circ }\). Note that the above calculations also allow us to write

    $$\begin{aligned} \phi _{t^{\circ }}(s) = {\left\{ \begin{array}{ll} \delta - \tfrac{\delta }{t^\circ }|s-e_k|, &{}\text {if } \exists k\in \mathbb {Z}: s\in [e_k-t^\circ , e_k+t^\circ ],\\ 0, &{}\text {otherwise,} \end{array}\right. } \quad s \in \mathbb {R}. \end{aligned}$$

    In particular, we have that \(0\le \phi _{t^{\circ }}(s)\le \delta \cdot \chi _{{\mathcal {E}}_{t^{\circ }}}(s)\), where \({\mathcal {E}}_{t^{\circ }}{:=}\cup _{k\in \mathbb {Z}}{\mathcal {E}}_{k,t^\circ }\). We now distinguish between the two cases \(t\le 128\delta \) and \(t > 128\delta \): The case \(t\le 128\delta \). Using the above notation, we choose \(t^\circ {:=}\frac{t}{128}\) and approximate \(\tilde{y}(\varvec{x}) = q_\delta (\langle \varvec{a}, \varvec{x}\rangle + \tau )\) by \(\tilde{y}_t(\varvec{x}) {:=}\psi _{t^\circ }(\langle \varvec{a}, \varvec{x}\rangle + \tau )\). The absolute value of the approximation error is then given by \(\varepsilon _t(\varvec{x}) = \phi _{t^\circ }(\langle \varvec{a}, \varvec{x}\rangle +\tau )\). We now show that for this choice of approximation, the conditions of Assumption 2 are indeed fulfilled with (8.4): On Assumption 2(a). We first observe that the indicator function \(\chi _{{\mathcal {E}}_{t^\circ }}\) is \(2\delta \)-periodic on \(\mathbb {R}\). Since the random variable \(s + \tau \) is uniformly distributed on \([s-\delta , s+\delta ]\) for every \(s \in \mathbb {R}\), this implies

    $$\begin{aligned} \mathbb {E}_{\tau }[\chi _{{\mathcal {E}}_{t^\circ }}(s + \tau )] = \mathbb {E}_{\tau }[\chi _{{\mathcal {E}}_{t^\circ }}(\tau )] = \mathbb {E}_{\tau }[\chi _{{\mathcal {E}}_{0, t^\circ }}(\tau )]=\mathbb {E}_{\tau }[\chi _{[-t^\circ ,t^\circ ]}(\tau )] = t^\circ \delta ^{-1}. \end{aligned}$$
    (8.9)

    Now, let \(\varvec{x}\in \mathcal {X}\) and \(\varvec{z}\in \mathbb {S}^{p-1}\). Using the independence of \(\varvec{a}\) and \(\tau \) in conjunction with the inequality (8.9), we obtain

    $$\begin{aligned} \mathbb {E}[\varepsilon _t(\varvec{x}) \cdot |\langle \varvec{a}, \varvec{z}\rangle |]&=\mathbb {E}[\phi _{t^\circ }(\langle \varvec{a}, \varvec{x}\rangle +\tau ) \cdot |\langle \varvec{a}, \varvec{z}\rangle |] \\&\le \delta \cdot \mathbb {E}[\chi _{{\mathcal {E}}_{t^\circ }}(\langle \varvec{a}, \varvec{x}\rangle + \tau ) \cdot |\langle \varvec{a}, \varvec{z}\rangle |]\\&\le \delta \cdot \mathbb {E}_{\varvec{a}} \big [|\langle \varvec{a}, \varvec{z}\rangle | \cdot \mathbb {E}_{\tau } [\chi _{{\mathcal {E}}_{t^\circ }}(\langle \varvec{a}, \varvec{x}\rangle + \tau )]\big ] \\&\le t^\circ \cdot \mathbb {E}[|\langle \varvec{a}, \varvec{z}\rangle |] \le \tfrac{t}{64} \cdot \mathbb {E}[|\langle \varvec{a}, \varvec{z}\rangle |]. \end{aligned}$$

    From Jensen’s inequality and the isotropy of \(\varvec{a}\), it follows that

    $$\begin{aligned} \mathbb {E}[|\langle \varvec{a}, \varvec{z}\rangle |]\le (\mathbb {E}[|\langle \varvec{a}, \varvec{z}\rangle |^2])^{1/2}=\Vert \varvec{z}\Vert _{2}=1, \end{aligned}$$

    which shows that Assumption 2(a) is satisfied. On Assumption 2(b). Since \(\psi _{t^\circ }\) is \(\tfrac{\delta }{t^\circ }\)-Lipschitz, the following holds for every \(\varvec{x}, \varvec{x}'\in \mathcal {X}\):

    $$\begin{aligned} \Vert \xi _t(\varvec{x}) - \xi _t(\varvec{x}')\Vert _{\psi _2}&\le \Vert \langle \varvec{a}, \varvec{x}-\varvec{x}'\rangle \Vert _{\psi _2} + \Vert \psi _{t^\circ }(\langle \varvec{a}, \varvec{x}\rangle + \tau ) - \psi _{t^\circ }(\langle \varvec{a}, \varvec{x}'\rangle + \tau )\Vert _{\psi _2}\\&\le (1 + \tfrac{\delta }{t^\circ }) \cdot \Vert \langle \varvec{a}, \varvec{x}-\varvec{x}'\rangle \Vert _{\psi _2} \le (1 + \tfrac{\delta }{t^\circ }) \cdot L\cdot \Vert \varvec{x}- \varvec{x}'\Vert _{2}\\&= L\cdot (1 + \tfrac{128 \delta }{t}) \cdot d_T(\varvec{x},\varvec{x}'). \end{aligned}$$

    This implies \(L_{t}\lesssim L\delta t^{-1}\). Furthermore, we clearly have that \({{\mathrm{sup}}}_{s\in \mathbb {R}}|s-\psi _{t^\circ }(s + \tau )|\lesssim \delta \), and therefore

    $$\begin{aligned} \Vert \xi _t(\varvec{x})\Vert _{\psi _2}=\Vert \langle \varvec{a}, \varvec{x}\rangle -\psi _{t^\circ }(\langle \varvec{a}, \varvec{x}\rangle + \tau )\Vert _{\psi _2}\lesssim \delta \end{aligned}$$

    for every \(\varvec{x}\in \mathcal {X}\). This shows \(r\lesssim \delta \). On Assumption 2(c). Since \(\phi _{t^\circ }\) is \(\tfrac{\delta }{t^\circ }\)-Lipschitz, the following holds for every \(\varvec{x}, \varvec{x}'\in \mathcal {X}\):

    $$\begin{aligned} \Vert \varepsilon _t(\varvec{x}) - \varepsilon _t(\varvec{x}')\Vert _{\psi _2}&=\Vert \phi _{t^\circ }(\langle \varvec{a}, \varvec{x}\rangle +\tau ) - \phi _{t^\circ }(\langle \varvec{a}, \varvec{x}'\rangle +\tau ) \Vert _{\psi _2}\\&\le \tfrac{\delta }{t^\circ } \Vert \langle \varvec{a}, \varvec{x}- \varvec{x}'\rangle \Vert _{\psi _2} \le L\cdot \tfrac{128\delta }{t} \cdot d_T(\varvec{x},\varvec{x}'), \end{aligned}$$

    and therefore, \(\hat{L}_{t}\lesssim L\delta t^{-1}\). Finally, since \(0\le \phi _{t^\circ }(s)\le \delta \) for every \(s\in \mathbb {R}\), it follows that \(\hat{r}\lesssim \delta \). The case \(t> 128\delta \). Let \(\mathcal {X}_t\subset \mathcal {X}\) be any maximal \(\tfrac{t}{256}\)-packing for \(\mathcal {X}\). We show that for the choice \(\tilde{y}_t(\varvec{x}) {:=}\tilde{y}(\varvec{x})\), the conditions of Assumption 2 are fulfilled on \(\mathcal {X}_t\) with (8.5). Since \(\varepsilon _t(\varvec{x}) = 0\) for all \(\varvec{x}\in \mathcal {X}_t\), Assumption 2(a) and (c) are trivially fulfilled with \(\hat{L}_{t}=\hat{r}=0\). Furthermore, observing that \(\xi _t(\varvec{x})=\langle \varvec{a}, \varvec{x}\rangle -q_\delta (\langle \varvec{a}, \varvec{x}\rangle + \tau )\) and \({{\mathrm{sup}}}_{s\in \mathbb {R}} |s - q_\delta (s + \tau )|\le 2\delta \), we conclude that \({{\mathrm{sup}}}_{\varvec{x}\in \mathcal {X}_t}\Vert \xi _t(\varvec{x})\Vert _{\psi _2}\lesssim \delta \). Finally, using that \(\Vert \varvec{x}-\varvec{x}'\Vert _{2}> \tfrac{t}{256}\) for all \(\varvec{x}, \varvec{x}' \in \mathcal {X}_t\) with \(\varvec{x}\ne \varvec{x}'\), we can bound the sub-Gaussian norm of the multiplier increments as follows:

    $$\begin{aligned} \Vert \xi _t(\varvec{x})-\xi _t(\varvec{x}')\Vert _{\psi _2}\le & {} \Vert \xi _t(\varvec{x})\Vert _{\psi _2} + \Vert \xi _t(\varvec{x}')\Vert _{\psi _2} \lesssim \delta < \tfrac{256\delta }{t}\Vert \varvec{x}-\varvec{x}'\Vert _{2}\\= & {} \tfrac{256\delta }{t}\cdot d_T(\varvec{x},\varvec{x}'). \end{aligned}$$

    This shows that Assumption 2(b) is satisfied on \(\mathcal {X}_t\) with \(L_{t}\lesssim \delta t^{-1}\) and \(r\lesssim \delta \).

\(\square \)

8.4 Proofs for Sect. 4.4

Proof of Theorem 1

We follow the proof template from the beginning of Sect. 8:

  • Step 1a. The model setup of Theorem 1 fits into Assumption 1 as follows:

    • We have that \(\varvec{a}\sim \mathsf {N}(\varvec{0}, \varvec{I}_{p})\) and therefore \(\Vert \varvec{a}\Vert _{\psi _2} \lesssim 1\). The signal set \(\mathcal {X}\) is an arbitrary subset of \(\mathbb {S}^{p-1}\). The output function \(F:\mathbb {R}^p \times \mathcal {X}\rightarrow \mathbb {R}\) takes the form \(F(\varvec{a}, \varvec{x}) {:=}f(\langle \varvec{a}, \varvec{x}\rangle )\).

    • The target function \(T:\mathcal {X}\rightarrow K\) corresponds to rescaling by a factor of \(\mu = \mathbb {E}[f(g)g]\) with \(g\sim \mathsf {N}(0, 1)\), i.e., \(T\varvec{x}{:=}\mu \varvec{x}\). In particular, we have that \(d_T(\varvec{x},\varvec{x}')= \mu \Vert \varvec{x}- \varvec{x}'\Vert _{2}\).

  • Step 1b. The target mismatch \(\rho _{}(\varvec{x})\) vanishes for every \(\varvec{x}\in \mathcal {X}\).

  • Step 1c. Let \(g\sim \mathsf {N}(0, 1)\). For the trivial choice \(\tilde{y}_t(\varvec{x}){:=}\tilde{y}(\varvec{x})\), the conditions of Assumption 2 are fulfilled with

    $$\begin{aligned} L_{t}\lesssim 1 + \gamma \mu ^{-1},\quad \hat{L}_{t}= 0,\quad r= \Vert f(g) - \mu g\Vert _{\psi _2},\quad \hat{r}= 0. \end{aligned}$$
  • Step 2. Setting \(m_0{:=}0\), \(\varDelta {:=}1\), and \(u_0 {:=}u\), the claim of Theorem 1 follows directly from Theorem 2.

  • Step 3. Let \(\varvec{x}\in \mathcal {X}\subset \mathbb {S}^{p-1}\) and consider the orthogonal decomposition of the standard Gaussian random vector \(\varvec{a}\) along \(\varvec{x}\):

    $$\begin{aligned} \varvec{a}= \langle \varvec{a}, \varvec{x}\rangle \varvec{x}+ P_{{\varvec{x}}^\perp }(\varvec{a}), \end{aligned}$$

    where \(P_{{\varvec{x}}^\perp } {:=}P_{{\{\varvec{x}\}}^\perp }\). Since \(P_{{\varvec{x}}^\perp }(\varvec{a})\) is centered and \(\langle \varvec{a}, \varvec{x}\rangle \sim \mathsf {N}(0, 1)\) is independent of \(P_{{\varvec{x}}^\perp }(\varvec{a})\), we have that

    $$\begin{aligned} \mathbb {E}[\tilde{y}(\varvec{x})\varvec{a}]&=\mathbb {E}[f(\langle \varvec{a}, \varvec{x}\rangle ) (\langle \varvec{a}, \varvec{x}\rangle \varvec{x}+ P_{{\varvec{x}}^\perp }(\varvec{a}))]\\&=\mu \varvec{x}+ \mathbb {E}[f(\langle \varvec{a}, \varvec{x}\rangle )] \cdot \mathbb {E}[P_{{\varvec{x}}^\perp }(\varvec{a})] =\mu \varvec{x}= T\varvec{x}, \end{aligned}$$

    which implies that \(\rho _{}(\varvec{x})=0\).

  • Step 4. We simply set \(\tilde{y}_t(\varvec{x}){:=}\tilde{y}(\varvec{x})\). Then, \(\varepsilon _t(\varvec{x}) = 0\), implying that Assumption 2(a) and (c) are trivially fulfilled with \(\hat{L}_{t}=\hat{r}=0\). Furthermore, the following holds for every \(\varvec{x}, \varvec{x}'\in \mathcal {X}\):

    $$\begin{aligned} \Vert \xi _t(\varvec{x}) - \xi _t(\varvec{x}')\Vert _{\psi _2}&\le \mu \Vert \langle \varvec{a}, \varvec{x}-\varvec{x}'\rangle \Vert _{\psi _2} + \Vert f(\langle \varvec{a}, \varvec{x}\rangle ) - f(\langle \varvec{a}, \varvec{x}'\rangle )\Vert _{\psi _2}\\&\lesssim \mu \Vert \varvec{x}- \varvec{x}'\Vert _{2} + \gamma \Vert \langle \varvec{a}, \varvec{x}-\varvec{x}'\rangle \Vert _{\psi _2} \\&\lesssim (\mu + \gamma ) \cdot \Vert \varvec{x}- \varvec{x}'\Vert _{2} = (1 + \gamma \mu ^{-1}) \cdot d_T(\varvec{x},\varvec{x}'). \end{aligned}$$

    This implies \(L_{t}\lesssim 1 + \gamma \mu ^{-1}\). Since \(\langle \varvec{a}, \varvec{x}\rangle \sim \mathsf {N}(0, 1)\) for every \(\varvec{x}\in \mathcal {X}\subset \mathbb {S}^{p-1}\), we can also conclude that

    $$\begin{aligned} \Vert \xi _t(\varvec{x})\Vert _{\psi _2}=\Vert \mu \langle \varvec{a}, \varvec{x}\rangle - f(\langle \varvec{a}, \varvec{x}\rangle )\Vert _{\psi _2}=\Vert f(g) - \mu g\Vert _{\psi _2}, \quad g\sim \mathsf {N}(0, 1), \end{aligned}$$

    which shows that \(r{:=}\Vert f(g) - \mu g\Vert _{\psi _2}\) is a valid choice. Hence, Assumption 2(b) is satisfied as well.

\(\square \)

Proof of Corollary 5

We follow the proof template from the beginning of Sect. 8:

  • Step 1a. The model setup of Corollary 5 fits into Assumption 1 as follows:

    • We have that \(\varvec{a}\sim \mathsf {N}(\varvec{0}, \varvec{I}_{p})\) and therefore \(\Vert \varvec{a}\Vert _{\psi _2} \lesssim 1\). The signal set \(\mathcal {X}\) is an arbitrary subset of \(\mathbb {S}^{p-1}\). The output function \(F:\mathbb {R}^p \times \mathcal {X}\rightarrow \mathbb {R}\) takes the form \(F(\varvec{a}, \varvec{x}) {:=}{\mathfrak {m}}_\lambda (\langle \varvec{a}, \varvec{x}\rangle )\).

    • The target function \(T:\mathcal {X}\rightarrow K\) corresponds to rescaling by a factor of \(\mu _\lambda = \mathbb {E}[{\mathfrak {m}}_\lambda (g)g]\) with \(g\sim \mathsf {N}(0, 1)\), i.e., \(T\varvec{x}{:=}\mu _\lambda \varvec{x}\). In particular, we have that \(d_T(\varvec{x},\varvec{x}')= \mu _\lambda \Vert \varvec{x}- \varvec{x}'\Vert _{2}\).

  • Step 1b. The target mismatch \(\rho _{}(\varvec{x})\) vanishes for every \(\varvec{x}\in \mathcal {X}\).

  • Step 1c. There exists an approximation \(\tilde{y}_t(\varvec{x})\) of the observation variable \(\tilde{y}(\varvec{x})\) such that the conditions of Assumption 2 are fulfilled with

    $$\begin{aligned} L_{t}\lesssim \tfrac{\lambda }{t},\quad \hat{L}_{t}\lesssim \tfrac{\lambda }{t},\quad r\lesssim 1,\quad \hat{r}\lesssim 1. \end{aligned}$$
    (8.10)
  • Step 2. We begin by showing that if \(\lambda \ge C\) for a sufficiently large absolute constant \(C>0\), then \(\mu _\lambda \ge \tfrac{1}{2}\). Indeed, for \(g\sim \mathsf {N}(0, 1)\), we have that

    $$\begin{aligned} \mu _\lambda = \mathbb {E}[{\mathfrak {m}}_\lambda (g)g]&= \mathbb {E}[\chi _{[-\lambda , \lambda )}(g)g^2] + \mathbb {E}[\chi _{\mathbb {R}\backslash [-\lambda , \lambda )}(g){\mathfrak {m}}_\lambda (g)g]\\&\ge \mathbb {E}[\chi _{[-\lambda , \lambda )}(g)g^2] - \mathbb {E}[\chi _{\mathbb {R}\backslash [-\lambda , \lambda )}(g)g^2]\\&=\mathbb {E}[g^2] - 2\mathbb {E}[\chi _{\mathbb {R}\backslash [-\lambda , \lambda )}(g)g^2]\\&=1-2\mathbb {E}[\chi _{\mathbb {R}\backslash [-\lambda , \lambda )}(g)g^2]. \end{aligned}$$

    The claim now follows by observing that \(\mathbb {E}[g^2]<\infty \), which implies \(\lim _{\lambda \rightarrow \infty }\mathbb {E}[\chi _{\mathbb {R}\backslash [-\lambda , \lambda )}(g)g^2]=0\). Moreover,

    $$\begin{aligned} \mu _\lambda&= \mathbb {E}[\chi _{[-\lambda , \lambda )}(g)g^2] + \mathbb {E}[\chi _{\mathbb {R}\backslash [-\lambda , \lambda )}(g){\mathfrak {m}}_\lambda (g)g]\\&\le \mathbb {E}[\chi _{[-\lambda , \lambda )}(g)g^2] + \mathbb {E}[\chi _{\mathbb {R}\backslash [-\lambda , \lambda )}(g)\lambda |g|] \le \mathbb {E}[g^2] =1. \end{aligned}$$

    Setting \(m_0{:=}0\), \(\varDelta {:=}1\), and \(u_0 {:=}u\), the claim of Corollary 5 follows directly from Theorem 2.

  • Step 3. Let \(\varvec{x}\in \mathcal {X}\subset \mathbb {S}^{p-1}\) and consider the orthogonal decomposition of the standard Gaussian random vector \(\varvec{a}\) along \(\varvec{x}\):

    $$\begin{aligned} \varvec{a}= \langle \varvec{a}, \varvec{x}\rangle \varvec{x}+ P_{{\varvec{x}}^\perp }(\varvec{a}), \end{aligned}$$

    where \(P_{{\varvec{x}}^\perp } {:=}P_{{\{\varvec{x}\}}^\perp }\). Since \(P_{{\varvec{x}}^\perp }(\varvec{a})\) is centered and \(\langle \varvec{a}, \varvec{x}\rangle \sim \mathsf {N}(0, 1)\) is independent of \(P_{{\varvec{x}}^\perp }(\varvec{a})\), we have that

    $$\begin{aligned} \mathbb {E}[\tilde{y}(\varvec{x})\varvec{a}]&=\mathbb {E}[{\mathfrak {m}}_\lambda (\langle \varvec{a}, \varvec{x}\rangle ) (\langle \varvec{a}, \varvec{x}\rangle \varvec{x}+ P_{{\varvec{x}}^\perp }(\varvec{a}))]\\&=\mu _\lambda \varvec{x}+ \mathbb {E}[{\mathfrak {m}}_\lambda (\langle \varvec{a}, \varvec{x}\rangle )] \cdot \mathbb {E}[P_{{\varvec{x}}^\perp }(\varvec{a})] =\mu _\lambda \varvec{x}= T\varvec{x}, \end{aligned}$$

    which implies that \(\rho _{}(\varvec{x})=0\).

  • Step 4. Let \(t\in (0, \lambda )\). We define the half-open intervals \({\mathcal {I}}_{k,t}{:=}[k\lambda -(\lambda -t), k\lambda +(\lambda -t))\) for \(k \in \mathbb {Z}\) even and \({\mathcal {I}}_{k,t}{:=}[k\lambda -t, k\lambda +t)\) for \(k \in \mathbb {Z}\) odd. Moreover, we set

    $$\begin{aligned} {\mathcal {I}}_{\text {even},t}{:=}\bigcup _{\begin{array}{c} k\in \mathbb {Z}\\ k \text { even} \end{array}}{\mathcal {I}}_{k,t}, \quad {\mathcal {I}}_{\text {odd},t}{:=}\bigcup _{\begin{array}{c} k\in \mathbb {Z}\\ k \text { odd} \end{array}}{\mathcal {I}}_{k,t} \end{aligned}$$

    and observe that the intervals \({\mathcal {I}}_{k,t}\), \(k\in \mathbb {Z}\), form a partition of \(\mathbb {R}\). Finally, we introduce the functions

    $$\begin{aligned} g_t(s){:=}-\big (\tfrac{\lambda -t}{t}\big )\cdot s, \qquad s\in \mathbb {R}, \end{aligned}$$

    and

    $$\begin{aligned} \psi _t(s) {:=}{\left\{ \begin{array}{ll} {\mathfrak {m}}_\lambda (s), &{}s\in {\mathcal {I}}_{k,t},\; k \text { even},\\ g_t(s-k\lambda ), \quad &{}s\in {\mathcal {I}}_{k,t},\; k \text { odd}, \end{array}\right. } \end{aligned}$$

    and

    $$\begin{aligned} \phi _t(s){:=}|{\mathfrak {m}}_\lambda (s)-\psi _t(s)|, \qquad s\in \mathbb {R}. \end{aligned}$$

    Let us show that both \(\psi _t\) and \(\phi _t\) are \(\tfrac{\lambda }{t}\)-Lipschitz continuous. We begin by verifying that \(\psi _t\) is continuous. Clearly, \(\psi _t\) is continuous on \({\mathcal {I}}_{k,t}\) for every odd integer k. Let k be an even integer. Then, \(s\in {\mathcal {I}}_{k,t}=[k\lambda -(\lambda -t), k\lambda +(\lambda -t))\), which implies \(\big \lfloor \tfrac{s+\lambda }{2\lambda }\big \rfloor =\tfrac{k}{2}\). Therefore, \({\mathfrak {m}}_\lambda (s)=s-k\lambda \) if \(s\in {\mathcal {I}}_{k,t}\) for k an even integer. This shows that \(\psi _t\) is continuous on every interval \({\mathcal {I}}_{k,t}\). Let us now write \({\mathcal {I}}_{k,t}=[u_k, v_k)\) for \(k\in \mathbb {Z}\). To show that \(\psi _t\) is continuous, it remains to check that \(\lim _{s\uparrow v_k}\psi _t(s)=\psi _t(u_{k+1})\) for every \(k\in \mathbb {Z}\). For k even, we have that

    $$\begin{aligned} \lim _{s\uparrow v_k}\psi _t(s) = \lim _{s\uparrow v_k}{\mathfrak {m}}_\lambda (s)=\lim _{s\uparrow v_k} s-k\lambda = v_k-k\lambda = \lambda - t. \end{aligned}$$

    Furthermore, it holds that \(u_{k+1}=(k+1)\lambda -t\in {\mathcal {I}}_{k+1,t}\), which implies

    $$\begin{aligned} \psi _t(u_{k+1})=g_t(u_{k+1}-(k+1)\lambda )=g_t(-t)=\lambda -t. \end{aligned}$$

    On the other hand, for k odd, we have that

    $$\begin{aligned} \lim _{s\uparrow v_k}\psi _t(s)=\lim _{s\uparrow v_k}g_t(s-k\lambda )=\lim _{s\uparrow v_k} -\big (\tfrac{\lambda -t}{t}\big )(s-k\lambda )=-\big (\tfrac{\lambda -t}{t}\big )(v_k-k\lambda )=-(\lambda -t). \end{aligned}$$

    Furthermore, it holds that \(u_{k+1}=k\lambda +t\in {\mathcal {I}}_{k+1,t}\), which implies

    $$\begin{aligned} \psi _t(u_{k+1})={\mathfrak {m}}_\lambda (u_{k+1})=u_{k+1}-(k+1)\lambda =-(\lambda -t). \end{aligned}$$

    This shows that \(\psi _t\) is continuous. Since \(\psi _t\) is continuous and piecewise linear, it is \(\max \{1, \tfrac{\lambda -t}{t}\}\)-Lipschitz. Therefore, \(\psi _t\) is also \(\tfrac{\lambda }{t}\)-Lipschitz. Next, we show that \(\phi _t\) is \(\tfrac{\lambda }{t}\)-Lipschitz as well. First note that

    $$\begin{aligned} \phi _t(s) = |{\mathfrak {m}}_\lambda (s)-\psi _t(s)|\cdot \chi _{{\mathcal {I}}_{\text {odd},t}}(s),\qquad s\in \mathbb {R}. \end{aligned}$$

    Let \(s\in {\mathcal {I}}_{k,t}=[k\lambda -t, k\lambda + t)\) for k an odd integer. Let us show that

    $$\begin{aligned} |{\mathfrak {m}}_\lambda (s)-\psi _t(s)|=\lambda - \tfrac{\lambda }{t}|k\lambda -s|. \end{aligned}$$

    To this end, we distinguish the cases \(s\in [k\lambda -t, k\lambda )\) and \(s\in [k\lambda , k\lambda +t)\). First, let \(s\in [k\lambda -t, k\lambda )\). It follows \(\big \lfloor \tfrac{s+\lambda }{2\lambda }\big \rfloor =\tfrac{k-1}{2}\), which implies \({\mathfrak {m}}_\lambda (s)=s-k\lambda +\lambda \). Furthermore, \(g_t(s-k\lambda )=s-k\lambda +\tfrac{\lambda }{t}(k\lambda -s)\). It follows that \({\mathfrak {m}}_\lambda (s)\ge g_t(s-k\lambda )\) and

    $$\begin{aligned} |{\mathfrak {m}}_\lambda (s)-\psi _t(s)| = {\mathfrak {m}}_\lambda (s) - g_t(s-k\lambda ) = \lambda - \tfrac{\lambda }{t}|k\lambda -s|. \end{aligned}$$

    The case \(s\in [k\lambda , k\lambda +t)\) can be treated analogously. All together, this shows that for any \(s\in \mathbb {R}\), we have that

    $$\begin{aligned} \phi _t(s) = \sum _{\begin{array}{c} k\in \mathbb {Z}\\ k\text { odd} \end{array}}\chi _{{\mathcal {I}}_{k,t}}(s)\cdot (\lambda - \tfrac{\lambda }{t}|k\lambda -s|). \end{aligned}$$

    In particular, \(\phi _t\) is \(\tfrac{\lambda }{t}\)-Lipschitz. We approximate the observation variable \(\tilde{y}(\varvec{x})={\mathfrak {m}}_\lambda (\langle \varvec{a}, \varvec{x}\rangle )\) by \(\tilde{y}_t(\varvec{x})=\psi _{t^\circ }(\langle \varvec{a}, \varvec{x}\rangle )\) for \(t^{\circ }{:=}\tfrac{t}{56\cdot 64}\). Observe that \(t^\circ < \tfrac{\lambda }{2}\). Let us show that the conditions of Assumption 2 are indeed fulfilled with (8.10): On Assumption 2(a). Let \(\varvec{x}\in \mathcal {X}\) and \(\varvec{z}\in \mathbb {S}^{p-1}\), and consider the orthogonal decomposition

    $$\begin{aligned} \varvec{z}= \langle \varvec{z}, \varvec{x}\rangle \varvec{x}+ P_{{\varvec{x}}^\perp }(\varvec{z}). \end{aligned}$$

    This implies

    $$\begin{aligned} \mathbb {E}[\varepsilon _t(\varvec{x}) \cdot |\langle \varvec{a}, \varvec{z}\rangle |]\le |\langle \varvec{z}, \varvec{x}\rangle |\cdot \mathbb {E}[\varepsilon _t(\varvec{x}) \cdot |\langle \varvec{a}, \varvec{x}\rangle |] + \mathbb {E}[\varepsilon _t(\varvec{x}) \cdot |\langle \varvec{a}, P_{{\varvec{x}}^\perp }(\varvec{z})\rangle |]. \end{aligned}$$

    Clearly, \(|\langle \varvec{z}, \varvec{x}\rangle |\le 1\) and \(\varepsilon _t(\varvec{x})=\phi _{t^\circ }(\langle \varvec{a}, \varvec{x}\rangle )\le \lambda \cdot \chi _{{\mathcal {I}}_{\text {odd}, t^{\circ }}}(\langle \varvec{a}, \varvec{x}\rangle )\), which implies

    $$\begin{aligned} \mathbb {E}[\varepsilon _t(\varvec{x}) \cdot |\langle \varvec{a}, \varvec{z}\rangle |]\le & {} \underbrace{\lambda \cdot \mathbb {E}[\chi _{{\mathcal {I}}_{\text {odd}, t^{\circ }}}(\langle \varvec{a}, \varvec{x}\rangle )\cdot |\langle \varvec{a}, \varvec{x}\rangle |]}_{=:S_1} {} \\&+ {} \underbrace{\lambda \cdot \mathbb {E}[\chi _{{\mathcal {I}}_{\text {odd}, t^{\circ }}}(\langle \varvec{a}, \varvec{x}\rangle )\cdot |\langle \varvec{a}, P_{{\varvec{x}}^\perp }(\varvec{z})\rangle |]}_{=:S_2}. \end{aligned}$$

    Let us estimate \(S_1\) and \(S_2\) separately, starting with \(S_2\). Since \(\langle \varvec{a}, \varvec{x}\rangle \) and \(\langle \varvec{a}, P_{{\varvec{x}}^\perp }(\varvec{z})\rangle \) are independent, we observe that

    $$\begin{aligned} \mathbb {E}[\chi _{{\mathcal {I}}_{\text {odd}, t^{\circ }}}(\langle \varvec{a}, \varvec{x}\rangle )\cdot |\langle \varvec{a}, P_{{\varvec{x}}^\perp }(\varvec{z})\rangle |] = \mathbb {E}[\chi _{{\mathcal {I}}_{\text {odd}, t^{\circ }}}(\langle \varvec{a}, \varvec{x}\rangle )]\cdot \mathbb {E}[|\langle \varvec{a}, P_{{\varvec{x}}^\perp }(\varvec{z})\rangle |]. \end{aligned}$$

    Using Jensen’s inequality, we obtain

    $$\begin{aligned} \mathbb {E}[|\langle \varvec{a}, P_{{\varvec{x}}^\perp }(\varvec{z})\rangle |]\le \Big (\mathbb {E}[|\langle \varvec{a}, P_{{\varvec{x}}^\perp }(\varvec{z})\rangle |^2]\Big )^{1/2}=\Vert P_{{\varvec{x}}^\perp }(\varvec{z})\Vert _{2}\le \Vert \varvec{z}\Vert _{2}=1. \end{aligned}$$

    Therefore, it follows that \(S_2 \le \lambda \cdot \mathbb {P}(\langle \varvec{a}, \varvec{x}\rangle \in {\mathcal {I}}_{\text {odd},t^\circ })\). Since \(\langle \varvec{a}, \varvec{x}\rangle \sim \mathsf {N}(0, 1)\), it holds that

    $$\begin{aligned} \tfrac{1}{2}\mathbb {P}(\langle \varvec{a}, \varvec{x}\rangle \in {\mathcal {I}}_{\text {odd},t^\circ }) = \tfrac{1}{\sqrt{2\pi }} \sum _{\begin{array}{c} k\in \mathbb {N}\\ k\text { odd} \end{array}} \int _{k\lambda -t^\circ }^{k\lambda +t^\circ }e^{-x^2/2}\, dx \le \tfrac{2t^\circ }{\sqrt{2\pi }} \sum _{\begin{array}{c} k\in \mathbb {N}\\ k\text { odd} \end{array}} e^{-(k\lambda -t^\circ )^2/2}. \end{aligned}$$

    For every \(k\in \mathbb {N}\),

    $$\begin{aligned} 2\lambda e^{-((k+2)\lambda -t^\circ )^2/2}\le \int _{k\lambda -t^\circ }^{(k+2)\lambda -t^\circ }e^{-x^2/2}\, dx. \end{aligned}$$

    Therefore,

    $$\begin{aligned} \sum _{\begin{array}{c} k\in \mathbb {N}\\ k\text { odd} \end{array}} e^{-(k\lambda -t^\circ )^2/2}&= e^{-(\lambda -t^\circ )^2/2} + \sum _{\begin{array}{c} k\in \mathbb {N}\\ k\text { odd} \end{array}} e^{-((k+2)\lambda -t^\circ )^2/2}\\&\le \tfrac{1}{\lambda -t^\circ }\int _{0}^{\lambda -t^\circ }e^{-x^2/2}\, dx + \tfrac{1}{2\lambda } \sum _{\begin{array}{c} k\in \mathbb {N}\\ k\text { odd} \end{array}}\int _{k\lambda -t^\circ }^{(k+2)\lambda -t^\circ }e^{-x^2/2}\, dx\\&\le \tfrac{1}{\lambda -t^\circ }\int _{0}^{\infty }e^{-x^2/2}\,dx. \end{aligned}$$

    Using the identity \(\tfrac{1}{\sqrt{2\pi }}\int _{0}^{\infty }e^{-x^2/2}\,dx=\tfrac{1}{2}\), we obtain

    $$\begin{aligned} S_2 \le \tfrac{2 t^{\circ }\lambda }{\lambda -t^{\circ }}\le 4t^{\circ }, \end{aligned}$$

    where the last inequality follows from \(t^\circ <\tfrac{\lambda }{2}\). Let us now estimate \(S_1\). Since \(\langle \varvec{a}, \varvec{x}\rangle \sim \mathsf {N}(0, 1)\), we have that

    $$\begin{aligned} \mathbb {E}[\chi _{{\mathcal {I}}_{\text {odd}, t^{\circ }}}(\langle \varvec{a}, \varvec{x}\rangle )\cdot |\langle \varvec{a}, \varvec{x}\rangle |] = \tfrac{1}{\sqrt{2\pi }} \sum _{\begin{array}{c} k\in \mathbb {N}\\ k\text { odd} \end{array}} \int _{k\lambda -t^\circ }^{k\lambda +t^\circ }xe^{-x^2/2}\, dx. \end{aligned}$$

    It holds that

    $$\begin{aligned}&\sum _{\begin{array}{c} k\in \mathbb {N}\\ k\text { odd} \end{array}} \int _{k\lambda -t^\circ }^{k\lambda +t^\circ }xe^{-x^2/2}\, dx\\&\quad \le 2t^\circ \sum _{\begin{array}{c} k\in \mathbb {N}\\ k\text { odd} \end{array}} (k\lambda +t^\circ )e^{-(k\lambda -t^\circ )^2/2}\\&\quad = 2t^\circ \Big ((\lambda +t^\circ )\cdot e^{-(\lambda -t^\circ )^2/2} + \sum _{\begin{array}{c} k\in \mathbb {N}\\ k\text { odd} \end{array}}((k+2)\lambda +t^\circ )e^{-((k+2)\lambda -t^\circ )^2/2}\, dx\Big )\\&\quad \le 2t^\circ \Big ( \tfrac{1}{\lambda -t^\circ }\int _0^{\lambda -t^\circ }(x+\lambda +t^\circ )\cdot e^{-x^2/2}\, dx \\&\qquad + \tfrac{1}{2\lambda }\sum _{\begin{array}{c} k\in \mathbb {N}\\ k\text { odd} \end{array}}\int _{k\lambda -t^\circ }^{(k+2)\lambda -t^\circ }(x+2\lambda +2t^\circ )\cdot e^{-x^2/2}\, dx\Big ) \\&\quad \le \tfrac{4t^\circ }{\lambda }\int _0^\infty (x+3\lambda )\cdot e^{-x^2/2}\, dx\\&\quad =\tfrac{4t^\circ }{\lambda }\cdot (1+3\lambda \sqrt{\tfrac{\pi }{2}}). \end{aligned}$$

    Therefore, if \(\lambda \le 2\), then

    $$\begin{aligned} \sum _{\begin{array}{c} k\in \mathbb {N}\\ k\text { odd} \end{array}} \int _{k\lambda -t^\circ }^{k\lambda +t^\circ }xe^{-x^2/2}\, dx\le \tfrac{4t^\circ }{\lambda }\cdot (1+6\sqrt{\tfrac{\pi }{2}})\le \tfrac{52t^\circ }{\lambda }. \end{aligned}$$

    If \(\lambda >2\), then \(\lambda -t^\circ >1\). Since the function \(x\mapsto x\exp (-x^2/2)\) is monotonically decreasing for \(x\in [1,\infty )\), we obtain

    $$\begin{aligned}&\sum _{\begin{array}{c} k\in \mathbb {N}\\ k\text { odd} \end{array}} \int _{k\lambda -t^\circ }^{k\lambda +t^\circ }xe^{-x^2/2}\, dx\\&\quad \le 2t^\circ \cdot \sum _{\begin{array}{c} k\in \mathbb {N}\\ k\text { odd} \end{array}} (k\lambda -t^\circ )\cdot e^{-(k\lambda -t^\circ )^2/2}\\&\quad =2t^\circ \cdot \Big ( (\lambda -t^\circ )\cdot \exp (-(\lambda -t^\circ )^2/2)+\sum _{\begin{array}{c} k\in \mathbb {N}\\ k\text { odd} \end{array}} ((k+2)\lambda -t^\circ )\cdot e^{-((k+2)\lambda -t^\circ )^2/2}\Big ). \end{aligned}$$

    Since \(\lambda -t^\circ >1\), it follows \((\lambda -t^\circ )\cdot \exp (-(\lambda -t^\circ )^2/2)\le \tfrac{1}{\lambda -t^\circ }\). Furthermore,

    $$\begin{aligned} \sum _{\begin{array}{c} k\in \mathbb {N}\\ k\text { odd} \end{array}} ((k+2)\lambda -t^\circ )\cdot e^{-((k+2)\lambda -t^\circ )^2/2} \le \tfrac{1}{2\lambda }\sum _{\begin{array}{c} k\in \mathbb {N}\\ k\text { odd} \end{array}} \int _{k\lambda - t^\circ }^{(k+2)\lambda -t^\circ } xe^{-x^2/2}\,dx \le \tfrac{1}{2\lambda }. \end{aligned}$$

    Therefore, if \(\lambda >2\), then

    $$\begin{aligned} \sum _{\begin{array}{c} k\in \mathbb {N}\\ k\text { odd} \end{array}} \int _{k\lambda -t^\circ }^{k\lambda +t^\circ }xe^{-x^2/2}\, dx\le \tfrac{8t^\circ }{\lambda }. \end{aligned}$$

    This shows \(S_1\le 52 t^\circ \). Putting everything together, we obtain

    $$\begin{aligned} \mathbb {E}[\varepsilon _t(\varvec{x}) \cdot |\langle \varvec{a}, \varvec{z}\rangle |] \le S_1 + S_2\le 56t^\circ =\tfrac{t}{64}, \end{aligned}$$

    which implies that Assumption 2(a) is satisfied. On Assumption 2(b). For \(\varvec{x}, \varvec{x}'\in \mathcal {X}\), we have that

    $$\begin{aligned} \Vert \xi _t(\varvec{x}) - \xi _t(\varvec{x}')\Vert _{\psi _2}&\le \mu _\lambda \cdot \Vert \langle \varvec{a}, \varvec{x}-\varvec{x}'\rangle \Vert _{\psi _2}+\Vert \psi _{t^\circ }(\langle \varvec{a}, \varvec{x}\rangle )-\psi _{t^\circ }(\langle \varvec{a}, \varvec{x}'\rangle )\Vert _{\psi _2}\\&\lesssim \mu _\lambda \cdot \Vert \varvec{x}-\varvec{x}'\Vert _{2} + \tfrac{\lambda }{t^\circ } \cdot \Vert \varvec{x}-\varvec{x}'\Vert _{2}, \end{aligned}$$

    where we have used that \(\psi _{t^\circ }\) is \(\tfrac{\lambda }{t^\circ }\)-Lipschitz. Since \(\mu _\lambda \ge \tfrac{1}{2}\), \(t\le \lambda \), and \(t^\circ =\tfrac{t}{56\cdot 64}\), it follows that

    $$\begin{aligned} \Vert \xi _t(\varvec{x}) - \xi _t(\varvec{x}')\Vert _{\psi _2}\lesssim \tfrac{\lambda }{t}\cdot d_T(\varvec{x},\varvec{x}'). \end{aligned}$$

    Let \(\varvec{x}\in \mathcal {X}\subset \mathbb {S}^{p-1}\). Since \(\mu _\lambda \le 1\) and \(|\psi _{t^\circ }(s)|\le |s|\) for all \(s\in \mathbb {R}\), we observe that

    $$\begin{aligned} \Vert \xi _t(\varvec{x})\Vert _{\psi _2}\le \mu _\lambda \cdot \Vert \langle \varvec{a}, \varvec{x}\rangle \Vert _{\psi _2} + \Vert \psi _{t^\circ }(\langle \varvec{a}, \varvec{x}\rangle )\Vert _{\psi _2}\lesssim 1. \end{aligned}$$

    Therefore, Assumption 2(b) is satisfied with \(L_{t}\lesssim \tfrac{\lambda }{t}\) and \(r\lesssim 1\). On Assumption 2(c). For \(\varvec{x}, \varvec{x}'\in \mathcal {X}\),

    $$\begin{aligned} \Vert \varepsilon _t(\varvec{x})-\varepsilon _t(\varvec{x}')\Vert _{\psi _2}=\Vert \phi _{t^\circ }(\langle \varvec{a}, \varvec{x}\rangle ) - \phi _{t^\circ }(\langle \varvec{a}, \varvec{x}'\rangle )\Vert _{\psi _2}\lesssim \tfrac{\lambda }{t^\circ }\cdot \Vert \varvec{x}-\varvec{x}'\Vert _{2}, \end{aligned}$$

    where we have used that \(\phi _{t^\circ }\) is \(\tfrac{\lambda }{t^\circ }\)-Lipschitz. Since \(\mu _\lambda \ge \tfrac{1}{2}\) and \(t^\circ =\tfrac{t}{56\cdot 64}\), it follows that

    $$\begin{aligned} \Vert \varepsilon _t(\varvec{x})-\varepsilon _t(\varvec{x}')\Vert _{\psi _2}\lesssim \tfrac{\lambda }{t}\cdot d_T(\varvec{x},\varvec{x}'). \end{aligned}$$

    Let \(\varvec{x}\in \mathcal {X}\subset \mathbb {S}^{p-1}\). Since \(|\phi _{t^\circ }(s)|\le |s|\) for all \(s\in \mathbb {R}\), we observe that

    $$\begin{aligned} \Vert \varepsilon _t(\varvec{x})\Vert _{\psi _2}=\Vert \phi _{t^\circ }(\langle \varvec{a}, \varvec{x}\rangle )\Vert _{\psi _2}\lesssim 1. \end{aligned}$$

    Therefore, Assumption 2(c) is satisfied with \(\hat{L}_{t}\lesssim \tfrac{\lambda }{t}\) and \(\hat{r}\lesssim 1\).

\(\square \)

Proof of Corollary 6

We follow the proof template from the beginning of Sect. 8:

  • Step 1a. The model setup of Corollary 6 fits into Assumption 1 as follows:

    • The measurement vector \(\varvec{a}\in \mathbb {R}^p\) is centered, isotropic, and it has independent, symmetric, and sub-Gaussian coordinates with \(\max _{j \in [p]}\Vert a_j\Vert _{\psi _2} \le L\); in particular, we have that \(\Vert \varvec{a}\Vert _{\psi _2}\lesssim L\) (e.g., see [64, Lem. 3.4.2]). The signal set \(\mathcal {X}\) is a subset of \(R\mathbb {B}_2^p\). The output function \(F:\mathbb {R}^p \times \mathcal {X}\rightarrow \mathbb {R}\) takes the form \(F(\varvec{a}, \varvec{x}) {:=}f(\varvec{a}\circ \varvec{x})\), where \(f:\mathbb {R}^p\rightarrow \mathbb {R}\) is given by \(f(\varvec{z}){:=}\sum _{j=1}^p f_j(z_j)\), with \(f_j:\mathbb {R}\rightarrow \mathbb {R}\) odd, \(\gamma \)-Lipschitz, and satisfying the conditions (4.10) and (4.11).

    • The target function \(T:\mathcal {X}\rightarrow \mathbb {R}^p\) is defined by \(T\varvec{x}{:=}\mathbb {E}[f(\varvec{a}\circ \varvec{x})\varvec{a}]\).

  • Step 1b. The target mismatch \(\rho _{}(\varvec{x})\) vanishes for every \(\varvec{x}\in \mathcal {X}\).

  • Step 1c. For the trivial choice \(\tilde{y}_t(\varvec{x}){:=}\tilde{y}(\varvec{x})\), the conditions of Assumption 2 are fulfilled with

    $$\begin{aligned} L_{t}\lesssim L\cdot \tfrac{\gamma }{\alpha },\quad \hat{L}_{t}= 0,\quad r\lesssim L(\beta _2+\gamma )R,\quad \hat{r}= 0. \end{aligned}$$
  • Step 2. We begin by showing that

    $$\begin{aligned} (T\varvec{x})_j = \mathbb {E}[f_j(a_j x_j)a_j], \qquad \text {for all } j\in [p], \end{aligned}$$

    where \(a_j\) is the j-th coordinate of the measurement vector \(\varvec{a}\in \mathbb {R}^p\). Indeed, we have that

    $$\begin{aligned} (T\varvec{x})_j&= (\mathbb {E}[f(\varvec{a}\circ \varvec{x})\varvec{a}])_j = \mathbb {E}[f(\varvec{a}\circ \varvec{x})a_j] \\&= \sum _{j'=1}^p \mathbb {E}[f_{j'}(a_{j'}x_{j'})a_j] = \mathbb {E}[f_j(a_jx_j)a_j], \end{aligned}$$

    where for the last equality we have used the independence of the random variables \(a_1, \ldots , a_p\) and \(\mathbb {E}[a_j]=0\). Next, let us show that \((T\varvec{x})_j\in [\beta _1x_j, \beta _2x_j]\) for \(x_j\ge 0\) and \((T\varvec{x})_j\in [\beta _2x_j, \beta _1x_j]\) for \(x_j<0\). Let us first assume that \(x_j\ge 0\). Then, clearly \(\mathbb {E}[f_j(a_jx_j)a_j\chi _{[0,\infty )}(a_j)]\le \beta _2 x_j \mathbb {E}[a_j^2\chi _{[0,\infty )}(a_j)]\). Furthermore,

    $$\begin{aligned} \mathbb {E}[f_j(a_jx_j)a_j\chi _{(-\infty ,0)}(a_j)]&= \mathbb {E}[f_j(-a_jx_j)(-a_j)\chi _{(-\infty ,0)}(a_j)] \\&\le \beta _2x_j \mathbb {E}[a_j^2\chi _{(-\infty ,0)}(a_j)], \end{aligned}$$

    where we have used that \(f_j\) is an odd function for the first equality. Therefore,

    $$\begin{aligned} \mathbb {E}[f_j(a_jx_j)a_j]&= \mathbb {E}[f_j(a_jx_j)a_j\chi _{[0,\infty )}(a_j)] + \mathbb {E}[f_j(a_jx_j)a_j\chi _{(-\infty ,0)}(a_j)] \\&\le \beta _2x_j \mathbb {E}[a_j^2] = \beta _2x_j, \end{aligned}$$

    where \(\mathbb {E}[a_j^2]=1\) follows from the assumption that \(\varvec{a}\) is isotropic. Similarly, we observe that

    $$\begin{aligned} \mathbb {E}[f_j(a_jx_j)a_j\chi _{[0,\infty )}(a_j)] \ge \beta _1x_j \mathbb {E}[a_j^2\chi _{[0,\infty )}(a_j)] \end{aligned}$$

    and

    $$\begin{aligned} \mathbb {E}[f_j(a_jx_j)a_j\chi _{(-\infty ,0)}(a_j)]&= \mathbb {E}[f_j(-a_jx_j)(-a_j)\chi _{(-\infty ,0)}(a_j)] \\&\ge \beta _1x_j\mathbb {E}[a_j^2\chi _{(-\infty ,0)}(a_j)], \end{aligned}$$

    which implies

    $$\begin{aligned} \mathbb {E}[f_j(a_jx_j)a_j]\ge \beta _1x_j\mathbb {E}[a_j^2]=\beta _1x_j. \end{aligned}$$

    This shows that \(\beta _1x_j\le (T\varvec{x})_j\le \beta _2x_j\) for \(x_j\ge 0\). Analogously, one can show that \(\beta _2x_j\le (T\varvec{x})_j\le \beta _1x_j\) for \(x_j<0\). Therefore, \(|(T\varvec{x})_j|\le \beta _2|x_j|\), which implies \(\Vert T\varvec{x}\Vert _{2}\le \beta _2\Vert \varvec{x}\Vert _{2}\) for every \(\varvec{x}\in \mathcal {X}\). As a consequence, it holds that \(T\mathcal {X}\subset \beta _2R\mathbb {B}_2^p\). Setting \(m_0{:=}0\), \(\varDelta {:=}1\), and \(u_0 {:=}u\), the claim of Corollary 6 follows directly from Theorem 2.

  • Step 3. By definition of the target function it follows that \(\rho _{}(\varvec{x})=0\) for every \(\varvec{x}\in \mathcal {X}\).

  • Step 4. We simply set \(\tilde{y}_t(\varvec{x}){:=}\tilde{y}(\varvec{x})\). Then, \(\varepsilon _t(\varvec{x}) = 0\), implying that Assumption 2(a) and (c) are trivially fulfilled with \(\hat{L}_{t}=\hat{r}=0\). Since \(\xi _t(\varvec{x})=\langle \varvec{a}, T\varvec{x}\rangle -f(\varvec{a}\circ \varvec{x})\), the following holds for every \(\varvec{x}, \varvec{x}'\in \mathcal {X}\):

    $$\begin{aligned} \Vert \xi _t(\varvec{x}) - \xi _t(\varvec{x}')\Vert _{\psi _2}&\le \Vert \langle \varvec{a}, T\varvec{x}-T\varvec{x}'\rangle \Vert _{\psi _2} + \Vert f(\varvec{a}\circ \varvec{x})-f(\varvec{a}\circ \varvec{x}')\Vert _{\psi _2}. \end{aligned}$$

    Clearly, we have that \(\Vert \langle \varvec{a}, T\varvec{x}-T\varvec{x}'\rangle \Vert _{\psi _2}\lesssim L\cdot d_T(\varvec{x},\varvec{x}')\) and

    $$\begin{aligned} \Vert f(\varvec{a}\circ \varvec{x})-f(\varvec{a}\circ \varvec{x}')\Vert _{\psi _2} = \Vert \sum _{j=1}^pf_j(a_jx_j)-f_j(a_jx_j')\Vert _{\psi _2}. \end{aligned}$$

    Let \(j\in [p]\). Using that \(f_j\) is \(\gamma \)-Lipschitz, we obtain

    $$\begin{aligned} \Vert f_j(a_jx_j)-f_j(a_jx_j')\Vert _{\psi _2} \lesssim \gamma \cdot \Vert a_jx_j-a_jx_j'\Vert _{\psi _2} \le L\cdot \gamma \cdot |x_j-x_j'|. \end{aligned}$$

    Furthermore, since \(a_j\) is a symmetric random variable and \(f_j\) is odd, it holds \(\mathbb {E}[f_j(a_jx)]=0\) for every \(x\in \mathbb {R}\). In particular, the random variable \(f_j(a_jx_j)-f_j(a_jx_j')\) is centered. Since the random variables \(a_1,\ldots , a_p\) are independent, Hoeffding’s inequality yields

    $$\begin{aligned} \Vert \sum _{j=1}^p f_j(a_jx_j)-f_j(a_jx_j') \Vert _{\psi _2}^2\lesssim & {} \sum _{j=1}^p \Vert f_j(a_jx_j)-f_j(a_jx_j') \Vert _{\psi _2}^2 \\ {}\lesssim & {} L^2 \cdot \gamma ^2 \cdot \big \Vert \varvec{x}-\varvec{x}'\big \Vert _{2}^2. \end{aligned}$$

    Therefore,

    $$\begin{aligned} \Vert \xi _t(\varvec{x}) - \xi _t(\varvec{x}')\Vert _{\psi _2}&\lesssim L\cdot d_T(\varvec{x},\varvec{x}') + L\cdot \gamma \cdot \Vert \varvec{x}-\varvec{x}'\Vert _{2}. \end{aligned}$$

    We now show that \(d_T(\varvec{x},\varvec{x}')\ge \alpha \Vert \varvec{x}-\varvec{x}'\Vert _{2}\). For \(j\in [p]\), we have that

    $$\begin{aligned} |(T\varvec{x})_j - (T\varvec{x}')_j|&= |\mathbb {E}[(f_j(a_jx_j)-f_j(a_jx_j'))a_j]|\\&= |\mathbb {E}[(f_j(a_jx_j)-f_j(a_jx_j'))a_j\chi _{[0,\infty )}(a_j)] \\&\quad + \mathbb {E}[(f_j(a_jx_j)-f_j(a_jx_j'))a_j\chi _{(-\infty ,0)}(a_j)]|. \end{aligned}$$

    We may assume that \(x_j\ge x_j'\). If \(a_j\ge 0\), then \(a_jx_j\ge a_jx_j'\), which implies \(f_j(a_jx_j)-f_j(a_jx_j')\ge \alpha (a_jx_j-a_jx_j')\). Therefore, \((f_j(a_jx_j)-f_j(a_jx_j'))a_j\ge \alpha (a_jx_j-a_jx_j')a_j\), which implies

    $$\begin{aligned} \mathbb {E}[(f_j(a_jx_j)-f_j(a_jx_j'))a_j\chi _{[0,\infty )}(a_j)]&\ge \mathbb {E}[\alpha (a_jx_j-a_jx_j')a_j\chi _{[0,\infty )}(a_j)]\\&=\alpha (x_j-x_j')\mathbb {E}[a_j^2\chi _{[0,\infty )}(a_j)] \ge 0. \end{aligned}$$

    If \(a_j<0\), then \(a_jx_j\le a_jx_j'\), which implies \(f_j(a_jx_j')-f_j(a_jx_j)\ge \alpha (a_jx_j'-a_jx_j)\). Therefore, \(f_j(a_jx_j)-f_j(a_jx_j')\le \alpha (a_jx_j-a_jx_j')\), which implies \((f_j(a_jx_j)-f_j(a_jx_j'))a_j\ge \alpha (a_jx_j-a_jx_j')a_j\). This shows

    $$\begin{aligned} \mathbb {E}[(f_j(a_jx_j)-f_j(a_jx_j'))a_j\chi _{(-\infty ,0)}(a_j)]&\ge \mathbb {E}[\alpha (a_jx_j-a_jx_j')a_j\chi _{(-\infty ,0)}(a_j)]\\&=\alpha (x_j-x_j')\mathbb {E}[a_j^2\chi _{(-\infty ,0)}(a_j)] \ge 0. \end{aligned}$$

    It follows that \(|(T\varvec{x})_j - (T\varvec{x}')_j|\ge \alpha (x_j-x_j')=\alpha |x_j-x_j'|\), and therefore \(d_T(\varvec{x},\varvec{x}')\ge \alpha \Vert \varvec{x}-\varvec{x}'\Vert _{2}\). Due to this inequality, we obtain

    $$\begin{aligned} \Vert \xi _t(\varvec{x}) - \xi _t(\varvec{x}')\Vert _{\psi _2}&\lesssim L\cdot d_T(\varvec{x},\varvec{x}') + L\cdot \gamma \cdot \Vert \varvec{x}-\varvec{x}'\Vert _{2}\\&\le L\cdot d_T(\varvec{x},\varvec{x}') + L\cdot \tfrac{\gamma }{\alpha }\cdot d_T(\varvec{x},\varvec{x}')\\&\le 2\cdot L\cdot \tfrac{\gamma }{\alpha }\cdot d_T(\varvec{x},\varvec{x}'), \end{aligned}$$

    where the last inequality follows from \(\alpha \le \gamma \). Now, let \(\varvec{x}\in \mathcal {X}\). It holds that

    $$\begin{aligned} \Vert \xi _t(\varvec{x})\Vert _{\psi _2}&=\Vert \langle \varvec{a}, T\varvec{x}\rangle -f(\varvec{a}\circ \varvec{x})\Vert _{\psi _2} \le \Vert \langle \varvec{a}, T\varvec{x}\rangle \Vert _{\psi _2} + \Vert f(\varvec{a}\circ \varvec{x})\Vert _{\psi _2}\\*&\lesssim L\cdot \Vert T\varvec{x}\Vert _{2} + \Vert f(\varvec{a}\circ \varvec{x})\Vert _{\psi _2} \le L\cdot \beta _2\cdot R+ \Vert f(\varvec{a}\circ \varvec{x})\Vert _{\psi _2}. \end{aligned}$$

    Since \(f_j\) is \(\gamma \)-Lipschitz and odd for every \(j\in [p]\), we obtain

    $$\begin{aligned} \Vert f_j(a_jx_j)\Vert _{\psi _2}=\Vert f_j(a_jx_j)-f_j(0)\Vert _{\psi _2}\lesssim \gamma \cdot \Vert a_jx_j\Vert _{\psi _2}\le L\cdot \gamma \cdot |x_j|. \end{aligned}$$

    Using Hoeffding’s inequality, we conclude that

    $$\begin{aligned} \Vert f(\varvec{a}\circ \varvec{x})\Vert _{\psi _2}=\Vert \sum _{j=1}^p f_j(a_jx_j)\Vert _{\psi _2} \lesssim \Big ( \sum _{j=1}^p\Vert f_j(a_jx_j)\Vert _{\psi _2}^2\Big )^{1/2} \le L\cdot \gamma \cdot \Vert \varvec{x}\Vert _{2}. \end{aligned}$$

    Hence, Assumption 2(b) is satisfied with \(L_{t}\lesssim L\cdot \tfrac{\gamma }{\alpha }\) and \(r\lesssim L(\beta _2+\gamma )R\).

\(\square \)

8.5 Proofs for Sect. 4.5

Proof of Corollary 7

We follow the proof template from the beginning of Sect. 8:

  • Step 1a. The model setup of Corollary 7 fits into Assumption 1 as follows:

    • The measurement vector \(\varvec{a}\in \mathbb {R}^p\) is centered, isotropic, and has independent sub-Gaussian coordinates with \(\max _{j \in [p]}\Vert a_j\Vert _{\psi _2} \le L\); in particular, we have that \(\Vert \varvec{a}\Vert _{\psi _2}\lesssim L\) (e.g., see [64, Lem. 3.4.2]). The signal set \(\mathcal {X}\) is an arbitrary subset of \(\{\mathcal {S}\subset [p] \ :|\mathcal {S}|\le s\}\). The output function \(F:\mathbb {R}^p \times \mathcal {X}\rightarrow \mathbb {R}\) takes the form \(F(\varvec{a}, \mathcal {S}) {:=}f(\varvec{a}_{\mathcal {S}})\), where \((\varvec{a}_{\mathcal {S}})_j = a_j\) for \(j\in \mathcal {S}\) and \((\varvec{a}_{\mathcal {S}})_j =0\) for \(j\in \mathcal {S}^c\).

    • The target function \(T:\mathcal {X}\rightarrow K\) is defined by \(T\mathcal {S}{:=}\mathbb {E}[f(\varvec{a}_{\mathcal {S}})\varvec{a}]\).

  • Step 1b. The target mismatch \(\rho _{}(\mathcal {S})\) vanishes for every \(\mathcal {S}\in \mathcal {X}\).

  • Step 1c. For the trivial choice \(\tilde{y}_t(\mathcal {S}){:=}\tilde{y}(\mathcal {S})\), the conditions of Assumption 2 are fulfilled with

    $$\begin{aligned} L_{t}\le L+ \gamma \alpha ^{-1},\quad \hat{L}_{t}=0,\quad r\le L\beta + \kappa ,\quad \hat{r}=0. \end{aligned}$$
  • Step 2. Setting \(m_0{:=}0\), \(\varDelta {:=}L^{-1} \sqrt{\log L}\), and \(u{:=}u_0\), the claim of Corollary 7 follows directly from Theorem 2.

  • Step 3. This is clear by the definition of the target function.

  • Step 4. We simply set \(\tilde{y}_t(\mathcal {S}){:=}\tilde{y}(\mathcal {S})\). Then, \(\varepsilon _t(\mathcal {S}) = 0\), implying that Assumption 2(a) and (c) are trivially fulfilled with \(\hat{L}_{t}=\hat{r}=0\). To see that Assumption 2(b) holds as well, we first recall that the coordinates of \(\varvec{a}\) are centered and independent, so that

    $$\begin{aligned} (T\mathcal {S})_j=\mathbb {E}[f(\varvec{a}_{\mathcal {S}})a_j]= \mathbb {E}[f(\varvec{a}_{\mathcal {S}})]\cdot \mathbb {E}[a_j]=0 \qquad \text {for all } \mathcal {S}\in \mathcal {X}\text { and } j\in \mathcal {S}^c. \end{aligned}$$

    Together with the lower bound in (4.14), it follows that \({{\,\mathrm{supp}\,}}(T\mathcal {S}) = \mathcal {S}\) for all \(\mathcal {S}\in \mathcal {X}\). Using the lower bound in (4.14) again, we obtain the following estimate for all \(\mathcal {S}, \mathcal {S}'\in \mathcal {X}\):

    $$\begin{aligned} d_T(\mathcal {S},\mathcal {S}')= \Vert T\mathcal {S}-T\mathcal {S}'\Vert _{2}\ge \Big (\sum _{j\in \mathcal {S}\setminus \mathcal {S}'} (T\mathcal {S})_j^2 + \sum _{j\in \mathcal {S}'\setminus \mathcal {S}} (T\mathcal {S}')_j^2 \Big )^{1/2}\ge \tfrac{\alpha }{\sqrt{s}}\sqrt{|\mathcal {S}\bigtriangleup \mathcal {S}'|}. \end{aligned}$$

    Combining this with the assumption (4.15), it follows that

    $$\begin{aligned} \Vert \xi _t(\mathcal {S}) - \xi _t(\mathcal {S}')\Vert _{\psi _2}&\le \Vert \langle \varvec{a}, T\mathcal {S}- T\mathcal {S}'\rangle \Vert _{\psi _2} + \Vert f(\varvec{a}_{\mathcal {S}}) - f(\varvec{a}_{\mathcal {S}'})\Vert _{\psi _2}\\&\le L\cdot d_T(\mathcal {S},\mathcal {S}') + \gamma \alpha ^{-1} \cdot d_T(\mathcal {S},\mathcal {S}') = (L+ \gamma \alpha ^{-1}) \cdot d_T(\mathcal {S},\mathcal {S}'). \end{aligned}$$

    This implies \(L_{t}\le L+ \gamma \alpha ^{-1}\). Since \({{\,\mathrm{supp}\,}}(T\mathcal {S}) = \mathcal {S}\) and \(|\mathcal {S}|\le s\) for \(\mathcal {S}\in \mathcal {X}\), we also have that \(|{{\,\mathrm{supp}\,}}(T\mathcal {S})|\le s\). The upper bound in (4.14) therefore yields \(\Vert T\mathcal {S}\Vert _{2}\le \beta \) for every \(\mathcal {S}\in \mathcal {X}\). Combining this estimate with (4.15), we obtain the following upper bound for the sub-Gaussian norm of \(\xi _t(\mathcal {S})\):

    $$\begin{aligned} \Vert \xi _t(\mathcal {S})\Vert _{\psi _2}=\Vert \langle \varvec{a}, T\mathcal {S}\rangle - f(\varvec{a}_{\mathcal {S}}) \Vert _{\psi _2} \le L\Vert T\mathcal {S}\Vert _{2} + \Vert f(\varvec{a}_{\mathcal {S}})\Vert _{\psi _2} \le L\beta + \kappa . \end{aligned}$$

    Hence, \(r{:=}L\beta + \kappa \) is a valid choice for Assumption 2(b).

\(\square \)

9 Proofs for Sect. 5

Proof of Theorem 3

Let \(\mathcal {X}_\varepsilon \) be a minimal subset of \(\mathcal {X}\) such that \(T\mathcal {X}_\varepsilon \) is a minimal \(\varepsilon \)-net for \(T\mathcal {X}\); in particular, we have that \(|\mathcal {X}_\varepsilon | = |T\mathcal {X}_\varepsilon |=\mathcal {N}(T\mathcal {X}, \varepsilon )\). The assumptions of (5.2) and (5.3) allow us to first apply Theorem 2 to every \(\varvec{x}' \in \mathcal {X}_\varepsilon \) as a singleton signal set (Assumption 2 is trivially fulfilled here), and then to take a union bound over \(\mathcal {X}_\varepsilon \). Consequently, there exist universal constants \(c, C > 0\) such that the following event \({\mathcal {A}}_1\) holds with probability at least \(1 - \exp (-c u^2) - \exp (-c u_0^2)\): For every \(\varvec{x}' \in \mathcal {X}_\varepsilon \) with \(\rho _{}(\varvec{x}') \le \tfrac{t}{32}\) and every \(\varvec{y}\in \mathbb {R}^m\) such that

$$\begin{aligned} \tfrac{1}{\sqrt{m}}\Vert \varvec{y}- \tilde{\varvec{y}}(\varvec{x}')\Vert _{[2m_0]}\le \varDelta t \quad \text { and } \quad \tfrac{1}{\sqrt{m}}\sigma _{2m_0}(\varvec{y}- \tilde{\varvec{y}}(\varvec{x}'))_{2}\le \tfrac{t}{20}, \end{aligned}$$

any minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\Vert \hat{\varvec{z}}- T\varvec{x}'\Vert _{2} \le t\). Note that we have applied the robustness conditions of (2.4) for \(2m_0\) instead of \(m_0\) here, which is possible due to \(2m_0\le m\).

Let \({\mathcal {A}}_2\) denote the event of (5.1) in Assumption 3. We now show that on the event \({\mathcal {A}}_1\cap {\mathcal {A}}_2\), which occurs with probability at least \(1 - \exp (-c u^2) - \exp (-c u_0^2)-\eta \), the conclusion of Theorem 3 holds: Let \(\varvec{x}\in \mathcal {X}\) be fixed and consider any input vector \(\varvec{y}\in \mathbb {R}^m\) satisfying

$$\begin{aligned} \tfrac{1}{\sqrt{m}}\Vert \varvec{y}- \tilde{\varvec{y}}(\varvec{x})\Vert _{[2m_0]}\le \tfrac{1}{2}\varDelta t \quad \text { and } \quad \tfrac{1}{\sqrt{m}}\sigma _{m_0}(\varvec{y}- \tilde{\varvec{y}}(\varvec{x}))_{2}\le \tfrac{t}{40}. \end{aligned}$$

By the definition of \(\mathcal {X}_\varepsilon \), there exists \(\varvec{x}'\in \mathcal {X}_\varepsilon \) such that \(\Vert T\varvec{x}-T\varvec{x}'\Vert _{2}\le \varepsilon \). According to the event \({\mathcal {A}}_2\), we conclude that

$$\begin{aligned} \tfrac{1}{\sqrt{m}}\Vert \varvec{y}- \tilde{\varvec{y}}(\varvec{x}')\Vert _{[2m_0]}\le \tfrac{1}{\sqrt{m}}\Vert \varvec{y}- \tilde{\varvec{y}}(\varvec{x})\Vert _{[2m_0]} + \tfrac{1}{\sqrt{m}}\Vert \tilde{\varvec{y}}(\varvec{x}) - \tilde{\varvec{y}}(\varvec{x}')\Vert _{[2m_0]} \le \varDelta t \end{aligned}$$

and

$$\begin{aligned} \tfrac{1}{\sqrt{m}}\sigma _{2m_0}(\varvec{y}- \tilde{\varvec{y}}(\varvec{x}'))_{2}\le \tfrac{1}{\sqrt{m}}\sigma _{m_0}(\varvec{y}- \tilde{\varvec{y}}(\varvec{x}))_{2} + \tfrac{1}{\sqrt{m}}\sigma _{m_0}(\tilde{\varvec{y}}(\varvec{x}) - \tilde{\varvec{y}}(\varvec{x}'))_{2}\le \tfrac{t}{20}. \end{aligned}$$

Finally, according to the event \({\mathcal {A}}_1\), every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\Vert \hat{\varvec{z}}- T\varvec{x}'\Vert _{2} \le t\) and therefore \(\Vert \hat{\varvec{z}}- T\varvec{x}\Vert _{2}\le \Vert \hat{\varvec{z}}- T\varvec{x}'\Vert _{2} + \Vert T\varvec{x}' - T\varvec{x}\Vert _{2}\le t + \varepsilon \). \(\square \)

Proof of Corollary 8

Analogously to Corollary 2, the model setup of Corollary 8 fits into Assumption 1 as follows:

  • We have that \(\varvec{a}\sim \mathsf {N}(\varvec{0}, \varvec{I}_{p})\) and therefore \(\Vert \varvec{a}\Vert _{\psi _2} \lesssim 1\). The signal set \(\mathcal {X}\) is an arbitrary subset of \(\mathbb {R}^p\). The output function \(F:\mathbb {R}^p \times \mathcal {X}\rightarrow \mathbb {R}\) takes the form \(F(\varvec{a}, \varvec{x}) {:=}{{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle )\). In particular, the resulting observation vector of \(\varvec{x}\) is given by \(\tilde{\varvec{y}}(\varvec{x}) = {{\,\mathrm{sign}\,}}(\varvec{A}\varvec{x})\) where \(\varvec{A}\in \mathbb {R}^{m\times p}\) is a standard Gaussian random matrix with row vectors \(\varvec{a}_1, \dots , \varvec{a}_m\in \mathbb {R}^p\).

  • The target function \(T:\mathcal {X}\rightarrow K\) corresponds to the (scaled) normalization \(T\varvec{x}{:=}\sqrt{\tfrac{2}{\pi }}\tfrac{\varvec{x}}{\Vert \varvec{x}\Vert _{2}}\).

As already shown in Sect. 8.1 (Step 3), the target mismatch \(\rho _{}(\varvec{x})\) vanishes for every \(\varvec{x}\in \mathcal {X}\). Moreover, we clearly have that \(r={\mathrm{sup}}_{\varvec{x}\in \mathcal {X}}\Vert \langle \varvec{a}, T\varvec{x}\rangle - \tilde{y}(\varvec{x})\Vert _{\psi _2}\lesssim 1\). Since \(\phi (\beta ) {:=}\beta \sqrt{\log (e/\beta )}\) defines a continuous and non-decreasing function on [0, 1] with \(\phi (1)=1\) and \(\phi (0){:=}0\) (by continuous extension), we may assume without loss of generality that

$$\begin{aligned} \beta \sqrt{\log (e/\beta )} = c_0 t \in (0,1]. \end{aligned}$$
(9.1)

Now, we set

$$\begin{aligned} \varDelta ^2 {:=}\frac{1}{t\sqrt{\log (e/\beta )}}, \quad m_0{:=}\big \lfloor \tfrac{\beta }{2} m\big \rfloor , \quad u_0 {:=}\sqrt{2 m \beta \log (e/\beta )}, \end{aligned}$$

implying that \(u_0 \ge \sqrt{2m_0\log (em/2m_0)}\), \(m_0\in \{0, 1, \dots , \lfloor \tfrac{m}{2}\rfloor \}\), and \(u_0^2\ge 2c_0t m\). The latter inequality again implies that \(u_0^2 \ge u^2\), and since \(u^2\ge C_0 \cdot \log \mathcal {N}(T\mathcal {X}, \varepsilon )\), the condition of (5.3) is fulfilled. Similarly, it is not hard to see that the condition of (5.2) follows from (5.4) for \(C'\) sufficiently large (cf. Sect. 8.1 (Step 2)).

According to Theorem 4 with \(H{:=}\sqrt{\tfrac{\pi }{2}}T\mathcal {X}\subset \mathbb {S}^{p-1}\), there exist universal constants \(c, {\bar{c}}, C > 0\) (possibly slightly different from those in Theorem 4) such that if

$$\begin{aligned} m\ge C\cdot \Big (\varepsilon ^2 \beta ^{-3} \cdot w_{}^2([T\mathcal {X}]_\varepsilon ) + \beta ^{-1} \cdot \log \mathcal {N}(T\mathcal {X}, \varepsilon )\Big ) \end{aligned}$$
(9.2)

for \(\varepsilon \le {\bar{c}}\beta / \sqrt{\log (e/\beta )}\), the following holds with probability at least \(1-\exp (-c m \beta )\):

$$\begin{aligned} \mathop {{\mathrm{sup}}}\limits _{\begin{array}{c} \varvec{x},\, \varvec{x}'\in \mathcal {X}\\ \Vert T\varvec{x}-T\varvec{x}'\Vert _{2}\le \varepsilon \end{array}}\tfrac{1}{2m}\Vert {{\,\mathrm{sign}\,}}(\varvec{A}\varvec{x})-{{\,\mathrm{sign}\,}}(\varvec{A}\varvec{x}')\Vert _{1} \le \tfrac{\beta }{2}. \end{aligned}$$

On this event and by the above choice of \(m_0\), we have that

$$\begin{aligned} \mathop {{\mathrm{sup}}}\limits _{\begin{array}{c} \varvec{x},\, \varvec{x}'\in \mathcal {X}\\ \Vert T\varvec{x}-T\varvec{x}'\Vert _{2}\le \varepsilon \end{array}}\tfrac{1}{\sqrt{m}}\sigma _{m_0}(\tilde{\varvec{y}}(\varvec{x}) - \tilde{\varvec{y}}(\varvec{x}'))_{2}=0 \end{aligned}$$

and

$$\begin{aligned} \mathop {{\mathrm{sup}}}\limits _{\begin{array}{c} \varvec{x},\, \varvec{x}'\in \mathcal {X}\\ \Vert T\varvec{x}-T\varvec{x}'\Vert _{2}\le \varepsilon \end{array}}\tfrac{1}{\sqrt{m}}\Vert \tilde{\varvec{y}}(\varvec{x}) - \tilde{\varvec{y}}(\varvec{x}')\Vert _{[2m_0]}\le \sqrt{2\beta }= \Big (\tfrac{2c_0t}{\sqrt{\log (e/\beta )}}\Big )^{1/2}\le \tfrac{1}{2}\varDelta t, \end{aligned}$$

where the last inequality holds for \(c_0>0\) small enough. Consequently, Assumption 3 would hold for \(\eta {:=}\exp (-c m \beta )\) if we can show that (9.2) is satisfied under the hypotheses of Corollary 8. To this end, we first note that the relationship (9.1) implies that there exists a universal constant \(c_0'>0\) such that \(\beta \ge c_0't/\sqrt{\log (e/t)}\). This particularly leads to the following estimates:

$$\begin{aligned} \beta \ge c_0' t^2 \qquad \text {and} \qquad \frac{\beta }{\sqrt{\log (e/\beta )}}\ge \frac{(c_0')^2t}{c_0\log (e/t)}. \end{aligned}$$

Combining the first one with (5.4), it follows that \(m \beta \ge c_0' m t^2 \gtrsim u^2\), while the second one yields

$$\begin{aligned} \varepsilon \le \frac{c't}{\log (e/t)} \le \frac{{\bar{c}}\beta }{\sqrt{\log (e/\beta )}} \end{aligned}$$

if \(c'\) is chosen sufficiently small. Hence, (9.2) is a consequence of (5.4) and the assumption \(u^2\ge C_0\cdot \log \mathcal {N}(T\mathcal {X}, \varepsilon )\).

Since all assumptions of Theorem 3 are satisfied, the following holds with probability at least \(1 - \exp (-c u^2)\) uniformly for every \(\varvec{x}\in \mathcal {X}\): For any input vector \(\varvec{y}\in \mathbb {R}^m\) such that

$$\begin{aligned} \tfrac{1}{\sqrt{m}}\Vert \varvec{y}- \tilde{\varvec{y}}(\varvec{x})\Vert _{[2m_0]}\le \tfrac{1}{2}\varDelta t \quad \text { and } \quad \tfrac{1}{\sqrt{m}}\sigma _{m_0}(\varvec{y}- \tilde{\varvec{y}}(\varvec{x}))_{2}\le \tfrac{t}{40}, \end{aligned}$$
(9.3)

every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\Vert \hat{\varvec{z}}- T\varvec{x}\Vert _{2} \le t + \varepsilon \le 2t\). The claim of Corollary 8 now follows from the fact that any input vector \(\varvec{y}\in \{-1,1\}^m\) given by (4.1) with \(\tfrac{1}{2m}\Vert \varvec{\nu }\Vert _{1} \le \tfrac{\beta }{2}\) satisfies (9.3). \(\square \)

Proof of Corollary 9

Analogously to Corollary 3, the model setup of Corollary 9 fits into Assumption 1 as follows:

  • The measurement vector \(\varvec{a}\in \mathbb {R}^p\) is centered, isotropic, and sub-Gaussian with \(\Vert \varvec{a}\Vert _{\psi _2} \le L\). The signal set \(\mathcal {X}\) satisfies \(\mathcal {X}\subset R \mathbb {B}_2^p\). The output function \(F:\mathbb {R}^p \times \mathcal {X}\rightarrow \mathbb {R}\) takes the form \(F(\varvec{a}, \varvec{x})={{\,\mathrm{sign}\,}}(\langle \varvec{a}, \varvec{x}\rangle +\tau )\), where \(\tau \) is uniformly distributed on \([-\lambda , \lambda ]\) and independent of \(\varvec{a}\). In particular, \(F\) is a random function. Moreover, the observation vector of \(\varvec{x}\) is given by \(\tilde{\varvec{y}}(\varvec{x}) = {{\,\mathrm{sign}\,}}(\varvec{A}\varvec{x}+\varvec{\tau })\), where \(\varvec{A}\in \mathbb {R}^{m\times p}\) denotes the sub-Gaussian random matrix with row vectors \(\varvec{a}_1, \dots , \varvec{a}_m\in \mathbb {R}^p\) and \(\varvec{\tau }{:=}(\tau _1, \dots , \tau _m)\).

  • The target function \(T:\mathcal {X}\rightarrow K\) corresponds to rescaling by a factor of \(\lambda ^{-1}\), i.e., \(T\varvec{x}{:=}\lambda ^{-1}\varvec{x}\).

As already shown in Sect. 8.2 (Step 3), there exists a universal constant \({\tilde{C}}'\ge e\) such that if

$$\begin{aligned} \lambda \ge {\tilde{C}}' \cdot R\cdot L\cdot \sqrt{\log (e/t)}, \end{aligned}$$
(9.4)

the target mismatch satisfies \(\rho _{}(\varvec{x})\le \tfrac{t}{32}\) for every \(\varvec{x}\in \mathcal {X}\). Moreover, we clearly have that

$$\begin{aligned} \Vert \langle \varvec{a}, \lambda ^{-1}\varvec{x}\rangle - \tilde{y}(\varvec{x})\Vert _{\psi _2}\lesssim \lambda ^{-1} \Vert \langle \varvec{a}, \varvec{x}\rangle \Vert _{\psi _2} + 1 \lesssim RL\lambda ^{-1} +1 \end{aligned}$$

for every \(\varvec{x}\in \mathcal {X}\), and together with (9.4), it follows that \(r={\mathrm{sup}}_{\varvec{x}\in \mathcal {X}}\Vert \langle \varvec{a}, \lambda ^{-1}\varvec{x}\rangle - \tilde{y}(\varvec{x})\Vert _{\psi _2}\lesssim 1\). Since \(\phi (\beta ) {:=}\beta \sqrt{\log (e/\beta )}\) defines a continuous and non-decreasing function on [0, 1] with \(\phi (1)=1\) and \(\phi (0){:=}0\) (by continuous extension), we may assume without loss of generality that

$$\begin{aligned} \beta \sqrt{\log (e/\beta )}=c_0 L^{-2} t \in (0,1]. \end{aligned}$$
(9.5)

Now, we set

$$\begin{aligned} \varDelta ^2 {:=}\frac{1}{tL^2\sqrt{\log (e/\beta )}}, \quad m_0{:=}\big \lfloor \tfrac{\beta }{2} m\big \rfloor , \quad u_0 {:=}\sqrt{2 m \beta \log (e/\beta )}, \end{aligned}$$

implying that \(u_0 \ge \sqrt{2m_0\log (em/2m_0)}\), \(m_0\in \{0, 1, \dots , \lfloor \tfrac{m}{2}\rfloor \}\), and \(u_0^2\ge 2c_0mL^{-2} t\). Combining the latter inequality with (5.5) for \(C' \gtrsim L^2\) sufficiently large implies that \(u_0^2 \ge u^2\), and since \(u^2\ge C_0 \cdot \log \mathcal {N}(\lambda ^{-1}\mathcal {X}, \varepsilon )\), the condition of (5.3) is fulfilled. Similarly, it is not hard to see that the condition of (5.2) follows from (5.5) for \(C' \gtrsim L^2 \log L\) sufficiently large (cf. Sect. 8.2 (Step 2)).

According to Theorem 5, there exist constants \(c, {\bar{c}}, C, {\tilde{C}} > 0\) only depending on \(L\) (possibly slightly different from those in Theorem 5) such that if \(\lambda \ge {\tilde{C}}\cdot R\) and

$$\begin{aligned} m \ge C \cdot \Big (\varepsilon ^2\beta ^{-3}\cdot w_{}^2([(2\lambda )^{-1}\mathcal {X}]_\varepsilon ) + \beta ^{-1} \cdot \log \mathcal {N}(\lambda ^{-1}\mathcal {X}, \varepsilon )\Big ) \end{aligned}$$
(9.6)

for \(\varepsilon \le {\bar{c}}\beta / \sqrt{\log (e/\beta )}\), the following holds with probability at least \(1-\exp (-c m\beta )\):

$$\begin{aligned} \mathop {{\mathrm{sup}}}\limits _{\begin{array}{c} \varvec{x},\, \varvec{x}'\in \mathcal {X}\\ \lambda ^{-1} \Vert \varvec{x}-\varvec{x}'\Vert _{2}\le \varepsilon \end{array}}\tfrac{1}{2m}\Vert {{\,\mathrm{sign}\,}}(\varvec{A}\varvec{x}+\varvec{\tau })-{{\,\mathrm{sign}\,}}(\varvec{A}\varvec{x}'+\varvec{\tau })\Vert _{1}\le \tfrac{\beta }{2}. \end{aligned}$$

On this event and by the above choice of \(m_0\), we have that

$$\begin{aligned} \mathop {{\mathrm{sup}}}\limits _{\begin{array}{c} \varvec{x},\, \varvec{x}'\in \mathcal {X}\\ \lambda ^{-1}\Vert \varvec{x}-\varvec{x}'\Vert _{2}\le \varepsilon \end{array}}\tfrac{1}{\sqrt{m}}\sigma _{m_0}(\tilde{\varvec{y}}(\varvec{x}) - \tilde{\varvec{y}}(\varvec{x}'))_{2}=0 \end{aligned}$$

and

$$\begin{aligned} \mathop {{\mathrm{sup}}}\limits _{\begin{array}{c} \varvec{x},\, \varvec{x}'\in \mathcal {X}\\ \lambda ^{-1}\Vert \varvec{x}-\varvec{x}'\Vert _{2}\le \varepsilon \end{array}} \tfrac{1}{\sqrt{m}}\Vert \tilde{\varvec{y}}(\varvec{x}) - \tilde{\varvec{y}}(\varvec{x}')\Vert _{[2m_0]} \le \sqrt{2\beta }= \Big (\tfrac{2c_0t}{L^2\sqrt{\log (e/\beta )}}\Big )^{1/2}\le \tfrac{1}{2}\varDelta t, \end{aligned}$$

where the last inequality holds for \(c_0>0\) small enough. Consequently, Assumption 3 would hold for \(\eta {:=}\exp (-c m \beta )\) if we can show that (9.6) is satisfied under the hypotheses of Corollary 9. To this end, we first note that the relationship (9.5) implies that there exists a universal constant \(c_0'>0\) such that \(\beta \ge c_0'tL^{-2}/\sqrt{\log (eL^2/t)}\). This particularly leads to the following estimates:

$$\begin{aligned} \beta \ge c_0' t^2 L^{-2} \qquad \text {and} \qquad \frac{\beta }{\sqrt{\log (e/\beta )}}\ge \frac{(c_0')^2t}{c_0L^2\log (eL^2/t)}. \end{aligned}$$

Combining the first one with (5.5), it follows that \(m \beta \ge c_0' m t^2 L^{-2} \gtrsim u^2\) for \(C' \gtrsim L^2\) sufficiently large, while the second one yields

$$\begin{aligned} \varepsilon \le \frac{c't}{\log (e/t)} \le \frac{{\bar{c}}\beta }{\sqrt{\log (e/\beta )}} \end{aligned}$$

if \(c'\) is chosen sufficiently small (depending on \(L\)). Hence, (9.6) is a consequence of (5.5) and the assumption \(u^2\ge C_0\cdot \log \mathcal {N}(\lambda ^{-1}\mathcal {X}, \varepsilon )\).

Since all assumptions of Theorem 3 are satisfied, the following holds with probability at least \(1 - \exp (-c u^2)\) uniformly for every \(\varvec{x}\in \mathcal {X}\): For any input vector \(\varvec{y}\in \mathbb {R}^m\) such that

$$\begin{aligned} \tfrac{1}{\sqrt{m}}\Vert \varvec{y}- \tilde{\varvec{y}}(\varvec{x})\Vert _{[2m_0]}\le \tfrac{1}{2}\varDelta t \quad \text { and } \quad \tfrac{1}{\sqrt{m}}\sigma _{m_0}(\varvec{y}- \tilde{\varvec{y}}(\varvec{x}))_{2}\le \tfrac{t}{40}, \end{aligned}$$
(9.7)

every minimizer \(\hat{\varvec{z}}\) of (\({\mathsf {P}}_{K,\varvec{y}}\)) satisfies \(\Vert \hat{\varvec{z}}- \lambda ^{-1}\varvec{x}\Vert _{2} \le t + \varepsilon \le 2t\). The claim of Corollary 9 now follows from the fact that any input vector \(\varvec{y}\in \{-1,1\}^m\) given by (4.3) with \(\tfrac{1}{2m}\Vert \varvec{\nu }\Vert _{1} \le \tfrac{\beta }{2}\) satisfies (9.7). \(\square \)

10 Proofs for Sect. 3

Proof of Corollary 1

We follow the proof template from the beginning of Sect. 8:

  • Step 1a. The model setup of Corollary 1 fits into Assumption 1 as follows:

    • The measurement vector \(\varvec{a}\in \mathbb {R}^p\) is centered, isotropic, and sub-Gaussian with \(\Vert \varvec{a}\Vert _{\psi _2} \le L\). The signal set \(\mathcal {X}\) is a bounded subset of \(\mathbb {R}^p\). The output function \(F:\mathbb {R}^p \times \mathcal {X}\rightarrow \mathbb {R}\) takes the form \(F(\varvec{a}, \varvec{x}) {:=}\langle \varvec{a}, \varvec{x}\rangle + \tau \), where \(\tau \sim \mathsf {N}(0, \sigma ^2)\).

    • The target function \(T:\mathcal {X}\rightarrow K\) is the canonical embedding into \(K\), i.e., \(T\varvec{x}{:=}\varvec{x}\). In particular, we have that \(d_T(\varvec{x},\varvec{x}')= \Vert \varvec{x}- \varvec{x}'\Vert _{2}\).

  • Step 1b. The target mismatch \(\rho _{}(\varvec{x})\) vanishes for every \(\varvec{x}\in \mathcal {X}\).

  • Step 1c. For the trivial choice \(\tilde{y}_t(\varvec{x}){:=}\tilde{y}(\varvec{x})\), the conditions of Assumption 2 are fulfilled with

    $$\begin{aligned} L_{t}= 0,\quad \hat{L}_{t}= 0,\quad r\lesssim \sigma , \quad \hat{r}= 0. \end{aligned}$$
  • Step 2. Setting \(m_0{:=}0\), \(\varDelta {:=}L^{-1} \sqrt{\log L}\), and \(u{:=}u_0\), the claim of Corollary 1 follows directly from Theorem 2.

  • Step 3. Let \(\varvec{x}\in \mathcal {X}\). By the isotropy of \(\varvec{a}\), we have that \(\mathbb {E}[\tilde{y}(\varvec{x})\varvec{a}] = \mathbb {E}[\langle \varvec{a}, \varvec{x}\rangle \varvec{a}]=\varvec{x}= T\varvec{x}\), and therefore \(\rho _{}(\varvec{x})=0\).

  • Step 4. We simply set \(\tilde{y}_t(\varvec{x}){:=}\tilde{y}(\varvec{x})\). Then, \(\varepsilon _t(\varvec{x}) = 0\), implying that Assumption 2(a) and (c) are trivially fulfilled with \(\hat{L}_{t}=\hat{r}=0\). Furthermore, since \(\xi _t(\varvec{x})= \langle \varvec{a}, \varvec{x}\rangle -\tilde{y}_t(\varvec{x})= -\tau \), Assumption 2(b) holds with \(L_{t}=0\) and \(r{:=}\Vert \tau \Vert _{\psi _2} \lesssim \sigma \), since \(\tau \sim \mathsf {N}(0, \sigma ^2)\).

\(\square \)

Proof of Inequality (3.3)

Since \(\Vert \varvec{x}\Vert _{1} = R\) for every \(\varvec{x}\in \mathcal {X}\), we may use [54, Lem. 4.5] to observe that

$$\begin{aligned} {\text {cone}}(R\mathbb {B}_1^p - \varvec{x}) \cap \mathbb {S}^{p-1} \subset H{:=}2 {\text {conv}}(\varvec{v}\in \mathbb {S}^{p-1} \ :\Vert \varvec{v}\Vert _{0} \le s ), \end{aligned}$$

where \({\text {conv}}(\cdot )\) denotes the convex hull. Moreover, [49, Lem. 2.3] yields \(w_{}(H) \lesssim \sqrt{s \cdot \log (2p/s)}\). Therefore, we obtain

$$\begin{aligned} w_{t}(K- \mathcal {X})&\le w_{0}(K- \mathcal {X}) \\&= w_{}({\text {cone}}(K- \mathcal {X}) \cap \mathbb {S}^{p-1}) \\&= w_{}(\textstyle \bigcup _{\varvec{x}\in \mathcal {X}}{\text {cone}}(R\mathbb {B}_1^p - \varvec{x}) \cap \mathbb {S}^{p-1}) \\&\le w_{}(H) \lesssim \sqrt{s \cdot \log (2p/s)}, \end{aligned}$$

as claimed in (3.3). \(\square \)

Proof of Proposition 1

We begin with writing out the definition of the local mean width as follows:

$$\begin{aligned} w_{t}(K- \mathcal {X}) \le \tfrac{1}{t} \cdot w_{}((K- \mathcal {X}) \cap t \mathbb {B}_2^p) = \tfrac{1}{t} \cdot \mathbb {E}\Big [\mathop {{\mathrm{sup}}}\limits _{\begin{array}{c} \varvec{z}\in K, \varvec{x}\in \mathcal {X}\\ \Vert \varvec{z}- \varvec{x}\Vert _{2} \le t \end{array}} \langle \varvec{g}, \varvec{z}- \varvec{x}\rangle \Big ], \end{aligned}$$

where \(\varvec{g}\sim \mathsf {N}(\varvec{0}, \varvec{I}_{p})\). Now, let \(\mathcal {X}_t\) be a minimal t-net for \(\mathcal {X}\); in particular, we have that \(|\mathcal {X}_t| = \mathcal {N}(\mathcal {X}, t)\). Hence, for every \(\varvec{z}\in K\) and \(\varvec{x}\in \mathcal {X}\) with \(\Vert \varvec{z}- \varvec{x}\Vert _{2} \le t\), there exists \(\varvec{v}_{\varvec{x}} \in \mathcal {X}_t\) such that \(\Vert \varvec{v}_{\varvec{x}} - \varvec{x}\Vert _{2} \le t\) and \(\Vert \varvec{v}_{\varvec{x}} - \varvec{z}\Vert _{2} \le 2t\). Therefore, we obtain

$$\begin{aligned} \mathbb {E}\Big [\mathop {{\mathrm{sup}}}\limits _{\begin{array}{c} \varvec{z}\in K, \varvec{x}\in \mathcal {X}\\ \Vert \varvec{z}- \varvec{x}\Vert _{2} \le t \end{array}} \langle \varvec{g}, \varvec{z}- \varvec{x}\rangle \Big ]&= \mathbb {E}\Big [\mathop {{\mathrm{sup}}}\limits _{\begin{array}{c} \varvec{z}\in K, \varvec{x}\in \mathcal {X}\\ \Vert \varvec{z}- \varvec{x}\Vert _{2} \le t \end{array}} \langle \varvec{g}, \varvec{z}- \varvec{v}_{\varvec{x}}\rangle + \langle \varvec{g}, \varvec{v}_{\varvec{x}} - \varvec{x}\rangle \Big ] \\&\le \mathbb {E}\Big [\mathop {{\mathrm{sup}}}\limits _{\begin{array}{c} \varvec{z}\in K, \varvec{v}\in \mathcal {X}_t \\ \Vert \varvec{z}- \varvec{v}\Vert _{2} \le 2t \end{array}} \langle \varvec{g}, \varvec{z}- \varvec{v}\rangle + \mathop {{\mathrm{sup}}}\limits _{\begin{array}{c} \varvec{x}\in \mathcal {X},\varvec{v}\in \mathcal {X}_t \\ \Vert \varvec{x}- \varvec{v}\Vert _{2} \le t \end{array}} \langle \varvec{g}, \varvec{v}- \varvec{x}\rangle \Big ] \\&= w_{}((K- \mathcal {X}_t) \cap 2t \mathbb {B}_2^p) + w_{}((\mathcal {X}_t - \mathcal {X}) \cap t \mathbb {B}_2^p) \\&= w_{}({\textstyle \bigcup _{\varvec{v}\in \mathcal {X}_t} (K- \varvec{v}) \cap 2t \mathbb {B}_2^p}) + \underbrace{w_{}((\mathcal {X}_t - \mathcal {X}) \cap t \mathbb {B}_2^p)}_{\le w_{}(\mathcal {X}_t - \mathcal {X}) \le 2 \cdot w_{}(\mathcal {X})}. \end{aligned}$$

Next, we make use of the fact that \(\mathcal {X}_t\) is a finite set. Indeed, applying [15, Lem. 6.1], this allows us to “pull out” the union over \(\mathcal {X}_t\) in the following way:

$$\begin{aligned} w_{}({\textstyle \bigcup _{\varvec{v}\in \mathcal {X}_t} (K- \varvec{v}) \cap 2t \mathbb {B}_2^p}) \lesssim \mathop {{\mathrm{sup}}}\limits _{\varvec{v}\in \mathcal {X}_t} \ w_{}((K- \varvec{v}) \cap 2t \mathbb {B}_2^p) + t \cdot \sqrt{\log \mathcal {N}(\mathcal {X}, t)}. \end{aligned}$$

By Sudakov minoration (e.g., see [64, Thm. 7.4.1]), we have that \(t \cdot \sqrt{\log \mathcal {N}(\mathcal {X}, t)} \lesssim w_{}(\mathcal {X})\). Hence, overall, we have that

$$\begin{aligned} w_{t}(K- \mathcal {X})\lesssim & {} \mathop {{\mathrm{sup}}}\limits _{\varvec{v}\in \mathcal {X}_t} \ \underbrace{w_{}(\tfrac{1}{2t}(K- \varvec{v}) \cap \mathbb {B}_2^p)}_{= \tilde{w}_{t}(K- \varvec{v})} {}+{} t^{-1} \cdot w_{}(\mathcal {X}) \\\lesssim & {} \mathop {{\mathrm{sup}}}\limits _{\varvec{x}\in \mathcal {X}} \ \tilde{w}_{t}(K- \varvec{x}) + t^{-1} \cdot w_{}(\mathcal {X}). \end{aligned}$$

The “moreover”-part of Proposition 1 follows directly from a stability bound derived in [23, Prop. 2.6]. \(\square \)