Keywords

1 Introduction

1.1 Motivation

When dealing with high-dimensional data, a general principle is that the curse of dimensionality can be efficiently fought if one assumes the data points to lie on a structure of smaller intrinsic dimensionality, typically a manifold. Some well-known methods to discover such a lower dimensional structure include Isomap (Tenenbaum et al. 2000), LLE (Roweis and Saul 2000), and Laplacian Eigenmaps (Belkin and Niyogi 2003).

In this work, our main interest is not in visualizing or representing by an explicit mapping the underlying structure of the observed data points; rather, we want to represent or estimate efficiently a real-valued function on these points. More specifically, we focus on the following denoising problem: assuming we observe a noisy version of the function f, y i = f(x i) + ε i at points (x 1, …, x n), we would like to recover the values of f at these points. An important step for solving this problem is to find a dictionary of functions to represent the signal f, which is adapted to the structure of the data. Ideally, we would like this dictionary to exhibit the features of a wavelet basis. In traditional signal processing on a flat space, with data points on a regular grid, orthogonal wavelet bases offer a very powerful tool to sparsely represent signals with inhomogeneous regularity (such as a signal that is very smooth everywhere except at a few singular points where it is discontinuous). Such bases are in particular well suited to the denoising task. Can this be generalized to irregularly scattered data on a manifold?

We present such a method to construct a so-called Parseval frame of functions exhibiting wavelet-like properties while adapting to the intrinsic geometry of the data. Furthermore, we use this dictionary for the denoising task using a simple coefficient thresholding method.

This work is organized as follows. In the coming section, we discuss the relationship to previous work on which the present chapter is built, as well as pointing out our new contributions. In Sect. 20.2, we recall important notions of frame theory as well as of neighborhood graphs needed for our construction. The construction of the frame and its properties is presented in Sect. 20.3. In Sect. 20.4, we develop a coefficient thresholding strategy for the denoising problem. In Sect. 20.5, we present numerical results and method comparison on testbed data.

1.2 Relation to Previous Work

Regression methods that adapt to an underlying lower dimension of the data have been considered by Bickel and Li (2007), Kpotufe and Dasgupta (2012) and Kpotufe (2011) using local polynomial estimates, random projection trees, and nearest-neighbors, respectively. However, these methods are not constructed to adapt to an inhomogeneous regularity of the target function: in these three cases, the smoothing scale (determined by the smoothing kernel bandwidth, the tree partition’s average data diameter, or the number of neighbors, respectively) is fixed globally. In the experimental Sect. 20.5, for data lying on a smooth manifold but a target function exhibiting a sharp discontinuity, we demonstrate the advantage of our method over kernel smoothing.

Based on the motivations similar to ours, a method for constructing a wavelet-like basis on scattered data was proposed by Gavish et al. (2010). It is based on a hierarchical tree partition of the data, on which a Haar-like basis of 0–1 functions is constructed. However, the performance of that method is then adapted to the geometry of the tree, in the sense that the distance of two points is measured through tree path distance. This can strongly distort the original distance: two close points in original distance can find themselves in very separated subtrees.

The construction proposed here, based on a transform of the spectral decomposition of the graph Laplacian, follows closely the ideas of Hammond et al. (2011). Two important contributions brought forth in the present work are that we construct a Parseval (or tight) frame, rather than a general frame; and we consider an explicit thresholding method for the denoising problem. The former point is crucial to obtain sharp bounds for the thresholding method, and also eliminates the computational problem of signal reconstruction from the frame coefficients, since Parseval frames enjoy a reconstruction formula similar to that of an orthonormal basis. The choice of multiscale bandpass filter functions leading to the tight frame is inspired by the recent work of Coulhon et al. (2012), where the spectral decomposition principle is also studied, albeit in the setting of a quite general metric space.

2 Notation and Basics

2.1 Setting

We consider a sample of n points \(x_i\in \mathbb {R}^d\). These points are assumed to belong to an unknown low-dimensional submanifold \(\mathscr {M}\subset \mathbb {R}^d\). We denote the design by \(\mathfrak {D}=\{{x_{1},\ldots ,x_{n}}\}\subset \mathscr {M}\). Furthermore, we observe on these points the (noisy) value of a function \(f: \mathfrak {D} \rightarrow \mathbb {R}\). Since \(\mathfrak {D}\) is finite, we can represent the function f as vector \(f=(f(x_1),\ldots ,f(x_n))^t\in \mathbb {R}^n\). The space of all (square-integrable) functions f defined on \(\mathfrak {D}\) is denoted \(L^2(\mathfrak {D})\) and endowed with the usual Euclidean inner product.

We denote by y i = f(x i) + 𝜖 i the noisy observation of f at x i, where 𝜖 i are independent identically distributed centered random variables. The problem we consider in this work is that of denoising, that is, try to recover the underlying value of the function f at the points x i.

While the existence of a low-dimensional supporting manifold \(\mathscr {M}\) for the design points motivates the construction of the proposed method, we underline (again) that \(\mathscr {M}\) is not known to the user and the method only uses the knowledge of the design points. In such a setting, a key idea to recover implicitly some information on the geometry of \(\mathscr {M}\) is to construct a neighborhood graph based on the design points (see Sect. 20.2.3 for details).

2.2 Frames

For the construction in Sect. 20.3, we rely on the notion of a vector frame, for which we recall here some important properties (see, e.g., Casazza et al. 2013; Han 2007; Christensen 2008). A frame is an overcomplete dictionary with particular properties allowing it to act almost as basis.

Definition 1

Let \(\mathscr {H}\) be a Hilbert space. Then a countable set \(\{z_i\}_{i\in I}\subset \mathscr {H}\) is a frame with frame bounds A and B for \(\mathscr {H}\) if there exists constants 0 < A ≤ B <  such that

$$\displaystyle \begin{aligned} \forall z\in\mathscr{H}:~\; A \left\lVert z \right\rVert^2 \leq \sum_{i\in I}\left\lvert \left\langle z,z_i \right\rangle \right\rvert^2\leq B \left\lVert z \right\rVert^2. \end{aligned} $$
(20.1)

A frame is called tight if A = B, in particular the frame is called Parseval if A = B = 1.

In the remainder of this work we consider the case of a Euclidean space \(\mathscr {H}=\mathbb {R}^n\), and assume that {z i}iI is a frame with a finite number of elements. Two important operators associated to the frame are the analysis operator

$$\displaystyle \begin{aligned}T:\mathbb{R}^n \rightarrow \mathbb{R}^I,~ Tz:=(\left\langle z,z_i \right\rangle)_{i\in I}\end{aligned} $$
(20.2)

(sequence of frame coefficients), and its adjoint the synthesis operator:

$$\displaystyle \begin{aligned}T^*: \mathbb{R}^I\rightarrow \mathbb{R}^n,~ T^*a=T^* (a_i)_{i\in I}^t=\sum_{i\in I} a_i z_i.\end{aligned} $$
(20.3)

Further, the frame operator is defined as S = T T:

$$\displaystyle \begin{aligned}S:\mathbb{R}^n \rightarrow \mathbb{R}^n, ~Sz=T^*Tz=\sum_{i\in I}\left\langle z,z_i \right\rangle z_i,\end{aligned} $$
(20.4)

and finally the Gramian operator as U = TT ,

$$\displaystyle \begin{aligned}U: \mathbb{R}^I\rightarrow \mathbb{R}^I,~ Ua=TT^*a= \left\{\left\langle \sum_{i\in I} a_i z_i,z_k \right\rangle\right\}_{k\in I}.\end{aligned} $$
(20.5)

In matrix form, the columns of T are the vectors z i, i ∈ I, T is its transpose and \(U_{ij} = \left \langle z_i,z_j \right \rangle \).

The definition of a frame implies that S is invertible, and it is possible to reconstruct any z from its frame coefficients by \(z=\sum _{i\in I}\left \langle z,z_i \right \rangle z^*_i=\sum _{i\in I}\left \langle z,z^*_i \right \rangle z_i\), where \(z^*_i:=S^{-1}z_i, i\in I\) is called the canonical dual frame of (z i)iI.

We recall some properties of finite Parseval frames over Euclidean spaces (see, e.g., Han 2007, chapter 3).

Theorem 1 (Properties of Parseval frames)

Let \(\mathscr {H}\) be a Hilbert space with \(\dim \mathscr {H}=n< \infty \) . The following statements are equivalent:

  1. 1.

    \(\{z_i\}_{1\leq i \leq k}\subset \mathscr {H}\) is a Parseval frame.

  2. 2.

    \(\forall y\in \mathscr {H}: ~~y=\sum _{i=1}^{k}\left \langle \,y,z_i \right \rangle z_i\).

  3. 3.

    The frame operator S is the identity on \(\mathbb {R}^n\).

  4. 4.

    The Gramian operator U is an orthogonal projector of rank n in \(\mathbb {R}^k\).

Furthermore if \(\{z_i\}_{1..k}\subset \mathscr {H}\) is a Parseval frame, then

  • \(\left \lVert z_i \right \rVert \leq 1\) for i ∈{1, …, k};

  • \(\dim \mathscr {H}=n=\sum _{i=1}^{k}\left \lVert z_i \right \rVert ^2;\)

  • the canonical dual frame is the frame itself.

For the present work, the two most important points of this theory are the following: first, the reconstruction formula (point two above), where we see that a Parseval frame acts similarly to an orthonormal basis; secondly, if we construct a vector v = T a =∑i a i z i from an arbitrary vector of coefficients (a i), then

$$\displaystyle \begin{aligned} \left\lVert \sum_i a_i z_i \right\rVert^2 = \left\langle T^*a,T^*a\right\rangle = \left\langle a,Ua\right\rangle = \left\lVert Ua \right\rVert^2 \leq \left\lVert a \right\rVert^2, \end{aligned} $$
(20.6)

which follows from property 4 above.

2.3 Neighborhood Graphs

In order to exploit the structure and geometry of the unknown submanifold \(\mathscr {M}\) on which the sample \(\mathfrak {D}\) is supposed to lie, a powerful idea is to use a graph-based representation of the data \(\mathfrak {D}\) through a neighborhood graph. The points in \(\mathfrak {D}\) correspond to the vertices of the graph, and two vertices of the graph are joined by an edge when the two corresponding points are neighbors (in some appropriate sense) in \(\mathbb {R}^n\). The underlying idea is that the local geometry of \(\mathbb {R}^n\) is reflected in the local connectivity of the graph, while the long-range geometry of the graph reflects the geometrical properties of the manifold \(\mathscr {M}\), rather than those of \(\mathbb {R}^n\).

Formally, a finite graph G = (V, E) is given by a finite set of vertices V  and a set of edges E ⊂ V × V . The |V |×|V | adjacency matrix A of the graph is defined by A i,j = 1 if (v i, v j) ∈ E and A i,j = 0 otherwise. An undirected graph is such that its adjacency matrix is symmetric.

The graph is called weighted if every edge e ∈ E has a positive weight \(w(e)\in \mathbb {R}_{+}\). In this case the notion of adjacency matrix is extended to A i,j = w((v i, v j)) if (v i, v j) ∈ E and A i,j = 0 otherwise. The degree of a vertex v i in a (possibly weighted) graph is defined as \(d_i=d(i)=\sum _{j=1}^{|V|}A_{i,j}\).

As announced, we focus on geometric graphs, which (can) approximate the structure of the unknown \(\mathscr {M}\). Each point x i is represented by a vertex, say v i. An edge between two vertices represents a small distance, or a high similarity, of the two associated points. The weight of an edge can quantify the similarity more finely.

We use the Euclidean distance \(d(x_i,x_j)=\left \lVert x_i-x_j \right \rVert .\) We recall three usual ways to construct the edges of a neighborhood graph:

  • (undirected) k-nearest-neighbor graph: an undirected edge connects the two vertices v i and v j iff x i belongs to the k nearest neighbors of x j, or x j belongs to the k nearest neighbors of x i (“the k-NN-graph”).

  • 𝜖-graph: an undirected edge connects two vertices v i and v j iff d(x i, x j) ≤ 𝜖.

  • complete weighted neighborhood graph: for each pair of vertices there exists an undirected edge with a weight depending on the distance/similarity of the two vertices.

A k-NN graph or an 𝜖-graph can be made weighted by additionally assigning weights to the edges depending on d(x i, x j), for instance by choosing Gaussian weights \(w(\left \{i,j\right \})= \exp ( - d^2(x_i,x_j)/2\lambda ^2)\).

2.4 Spectral Graph Theory

If one considers real-valued functions \(f:\mathscr {M}\rightarrow \mathbb {R}\) defined on a submanifold \(\mathscr {M}\subset \mathbb {R}^d\), it is known that under some regularity assumptions on the submanifold \(\mathscr {M}\), the eigenfunctions of the Laplace-Beltrami-operator give a basis of the space of squared-integrable functions on \(\mathscr {M}\). Since \(\mathscr {M}\) is unknown in our setting, the principle of the Laplacian Eigenmaps method (Belkin and Niyogi 2003) is to use a discrete analogon, namely the graph Laplace operator L on a neighborhood graph.

Given a finite weighted undirected graph with adjacency matrix A (n × n) and vertex degrees (d i)i, as introduced in the previous section, we will either use the unnormalized graph Laplace operator L u or the normalized (symmetric) graph Laplace operator L norm defined by

$$\displaystyle \begin{aligned} \begin{array}{rcl}L^{u}&\displaystyle =&\displaystyle D-A \\ L^{\mathrm{norm}}&\displaystyle =&\displaystyle {\mathbf{I}}_{n}-D^{-1/2}AD^{-1/2}, \end{array} \end{aligned} $$
(20.7)

where \(D={\mathrm {diag}\left (d_1,\ldots , d_n\right )}\) is a diagonal matrix with entries d i on the diagonal. By construction L u and L norm are symmetric matrices. The positive semidefiniteness follows from

$$\displaystyle \begin{aligned}f^t L^{u}f=0.5 \sum_{(i,j)}A_{i,j} (f_i-f_j)^2 \mbox{ and } f^t L^{\mathrm{norm}}f=0.5 \sum_{(i,j)}A_{i,j} \big(\frac{f_i}{\sqrt{d_i}}-\frac{f_j}{\sqrt{d_j}}\big)^2,\end{aligned} $$

respectively. The spectral theorem for matrices indicates that the normalized eigenvectors Φ i of the graph Laplace operator L (L u resp. L norm) form an orthonormal basis of \(\mathbb {R}^n\) and all eigenvalues are nonnegative. Furthermore the number of components of the graph is given by the number of eigenvalues equal to 0.

3 Construction and Properties

3.1 Construction of a Tight Graph Frame

As discussed earlier, the principle of Laplacian Eigenmaps is to use the basis (Φ i)1≤in to represent and process the data. An important advantage of this basis as compared with the natural basis of \(\mathbb {R}^d\) is that it will be adapted to the geometry of the underlying submanifold \(\mathscr {M}\) supporting the data distribution. For instance, in the denoising problem, a reasonable estimator of f could be a truncated expansion of the noisy vector of observations Y  in the basis (Φ i)1≤in.

On the other hand, a disadvantage of this basis is that it is not spatially localized. To get an intuitive view, consider the simple case of the interval [0, 1] with uniformly distributed data. In the population view, the eigenbasis of the Laplacian is the Fourier basis. While a truncated expansion in this basis is well-adapted to represent functions that are uniformly regular, it is not well-suited for functions exhibiting locally varying regularity (as an extreme example, a signal that is very smooth everywhere except at a few singular points where it is discontinuous). By contrast, wavelet bases, because they are localized both in space and frequency, allow for an efficient (i.e., sparse) representation of signals with locally varying regularity.

If we now think of data supported on a one-dimensional submanifold (curve) of \(\mathbb {R}^d\), we can expect that the Laplacian eigenmaps method will discover a warped Fourier basis following the curve; and, for a more general submanifold \(\mathscr {M}\), “harmonics” on \(\mathscr {M}\).

In order to go from this basis to a spatially localized dictionary, following ideas of Coulhon et al. (2012) and Hammond et al. (2011), we use the principle of the Littlewood-Paley decomposition.

Let G be an undirected geometric neighborhood graph with adjacency matrix A constructed from \(\mathfrak {D}\), and L be an associated symmetric graph Laplace operator with increasing eigenvalues 0 = λ 1 ≤ λ 2 ≤⋯ ≤ λ n and normalized eigenvectors \(\varPhi _i\in \mathbb {R}^n,i=1\ldots n\).

We first define a set of vectors using a decomposition of unity and a splitting operation and we will show that this vector set is a Parseval frame.

Definition 2

Let \(\{\zeta _k\}_{k\in \mathbb {N}}\) be a sequence of functions \(\zeta _k: \mathbb {R}_{+}\rightarrow [0,1]\) satisfying

(DoU):

j≥0 ζ j(x) = 1 for all x ≥ 0;

(FD):

#{ζ k : ζ k(λ i) ≠ 0} <  for i = 1, …, n.

Then we define the set of column vectors \(\{\varPsi _{kl}\in \mathbb {R}^n , 0\leq k \leq Q, 1 \leq j \leq n\}\) by

$$\displaystyle \begin{aligned} \varPsi_{kl}=\sum_{i=1}^{n}\sqrt{\zeta_k(\lambda_i)}\varPhi_i(x_l) \varPhi_i. \end{aligned} $$
(20.8)

with \(Q:=\max \{k: \exists i \in \left \{1,\ldots ,n\right \} \text{with}\, \zeta _k(\lambda _i)>0\}\).

Theorem 2

{Ψ kl}k,l is a Parseval frame for \(\mathscr {H}=\mathbb {R}^n\) , that is for all \(x\in \mathbb {R}^n\) :

$$\displaystyle \begin{aligned} \sum_{k,l}\left\lvert \left\langle x,\varPsi_{kl} \right\rangle \right\rvert^2 = \left\lVert x \right\rVert^2\,. {} \end{aligned} $$
(20.9)

Proof

If we can show that \(\sum _{(k,l)}\varPsi _{kl}^{} \varPsi _{kl}^t={\mathbf {I}}_{n}\), we get immediately

$$\displaystyle \begin{aligned} y={\mathbf{I}}_{n} y=\left(\sum_{(k,l)}\varPsi_{kl}^{} \varPsi_{kl}^t\right) y \stackrel{}{=} \sum_{(k,l)} \left\langle\,y,\varPsi_{kl} \right\rangle\varPsi_{kl}, \end{aligned} $$
(20.10)

for \(y\in \mathbb {R}^n\). According to Theorem 1 this equation is equivalent to the condition (20.1) with A = B = 1. So we are done. It remains to show \(\sum _{k,l}\varPsi _{kl}^{} \varPsi _{kl}^t={\mathbf {I}}_{n}\). We have (since we sum over a finite number of elements)

$$\displaystyle \begin{aligned} \begin{array}{rcl} \sum_{(k,l)} \varPsi_{kl}^{} \varPsi_{kl}^t &\displaystyle = &\displaystyle \sum_{k,l,i,j} \sqrt{\zeta_k(\lambda_i)} \sqrt{\zeta_k(\lambda_{j})} \varPhi_i(x_l) \varPhi_{j}(x_l) \varPhi_i \varPhi_{j}^t \\ &\displaystyle =&\displaystyle \sum_{i=1}^{n} \sum_{k=0}^{Q} \zeta_k(\lambda_i) \, \varPhi_i \varPhi_i^t \\ &\displaystyle = &\displaystyle \sum_{i=1}^{n} \varPhi_i \varPhi_i^t={\mathbf{I}}_{n}. \end{array} \end{aligned} $$
(20.11)

For the second equality, we have used that \(\sum _l \varPhi _i(x_l) \varPhi _{j}(x_l) = \left \langle \varPhi _i,\varPhi _j\right \rangle = {\mathbf {1}\{i=j\}}\), since {Φ i}i is an orthonormal basis (onb). For the third equality, we used (DoU), and for the last again the onb property. □

We now choose a special sequence of functions satisfying the decomposition of unity (DoU) condition while also ensuring (a) a spectral localization property for the frame elements and (b) a multiscale decomposition interpretation of the resulting decomposition. This construction follows Coulhon et al. (2012), and is known in the context of functional analysis as a smooth Littlewood-Paley decomposition.

Definition 3 (Multiscale Bandpass Filter)

Let \(g\in C^{\infty }(\mathbb {R}_{+})\), suppg ⊂ [0, 1], 0  ≤ g ≤ 1, g(u) = 1 for u ∈ [0, 1∕b] (for some constant b > 1). For \(k\in \mathbb {N}=\{0,1,\ldots \}\) the functions \(\zeta _k: \mathbb {R}_{+}\rightarrow [0,1]\) are defined by

$$\displaystyle \begin{aligned} \zeta_k(x):= \begin{cases}g(x) & \text{if}\, k=0\\ g(b^{-k}x)-g(b^{-k+1}x) & \text{if}\, k>0\\ \end{cases} \end{aligned} $$
(20.12)

The sequence {ζ k}k≥0 is called multiscale bandpass filter.

This definition leads to the following properties: ζ k(x) = ζ 1(b k x) for k ≥ 1 (multiscale decomposition), \(\zeta _k\in C^{\infty }(\mathbb {R}_{+})\), 0 ≤ ζ k ≤ 1, suppζ 0 ⊂ [0, 1], suppζ k ⊂ [b k−2, b k] for k ≥ 1 (spectral localization property). Moreover, one can check readily

$$\displaystyle \begin{aligned}\sum_{j\geq 0}\zeta_j(x)=1,\end{aligned} $$
(20.13)

i.e., the (DoU) condition holds. In practice, we use a dyadic bandpass filter, that is, b = 2. The functions ζ 0, …, ζ 5 with b = 2 are displayed in Fig. 20.1b. By construction, the parameter k in Ψ kl is naturally a spectral scale parameter, while l is a spatial localization parameter: the frame element Ψ kl is localized around the point x l, as we discuss next.

Fig. 20.1
figure 1

Littlewood-Paley on L 2(0, 1): (a) eigenfunctions; (b) multiscale bandpass filter; (c) frame elements

3.2 Spatial Localization

By construction, the elements of the frame are band-limited, i.e. localized in the spectral scale, in the sense that for a fixed k, the frame elements Ψ kl (l = 1, …, n) are linear combinations of the eigenvectors of the graph Laplacian (“graph harmonics”) corresponding to eigenvalues in the range [b k−2, b k] only.

From our initial motivations, it is desirable that in contrast with the eigenfunctions of the Laplace operator, the frame elements Ψ kl are spatially localized functions. In the classical Littlewood-Paley construction for the usual Laplacian on the interval [0, 1], this is a well-known fact: the use of linear combination of trigonometric functions \(\varPsi _{k l}(y) := \sin {}(kl)\sin {}(ky)\) via smooth multiscale bandpass filters weights as described in Definition 3 gives rise to strongly localized functions (as illustrated in Fig. 20.1).

Regarding the corresponding discrete construction based on the graph Laplacian, this localization property is certainly observed in practice (as illustrated in Figs. 20.2 and 20.3, see Sect. 20.5 for the setup of the numerical experiments).

Fig. 20.2
figure 2

Swiss roll data: top: eigenvectors Φ j for j = 10, 30, 50, 100; bottom: frame elements Ψ kl for l fixed and k = 0, 2, 5, 7 (construction from actual swiss roll data, then “unrolled” for clearer graphical representation)

Fig. 20.3
figure 3

Sphere data: frame elements Ψ kl for l fixed and k = 1, 5, 7 (color encodes the value of the function)

Concerning the theoretical perspective, we first review briefly the existing results of Hammond et al. (2011), denote d the shortest path distance in the graph. Theorem 5.5 of Hammond et al. (2011) gives the following localization result for graph frames:

$$\displaystyle \begin{aligned} \frac{\varPsi_{kl}(x)}{\left\lVert \varPsi_{kl} \right\rVert_2} \leq C b^{-k}\,, \end{aligned} $$
(20.14)

for all x with d(x, x l) ≥ K, under the assumption that the scaling function ζ 1 is K-times differentiable with vanishing first (K − 1) derivatives in 0, non-vanishing K-th derivative, and the scale parameter k is big enough. This says that Ψ kl is “localized” around the point x l. Unfortunately, this result is not informative in our framework for two reasons: first, we chose a function ζ 1 (see (20.12)) vanishing in a neighborhood of zero, so that all derivatives vanish in the origin, contradicting one of the above assumptions. Secondly, and independently of this first issue, the condition “k is big enough,” and the factor C depend on the size n of the graph and of the largest eigenvalue of the Laplacian. As a consequence it is unclear if this bound covers any interesting part of the spectrum (for k too large, the spectral support [b k−2, b k] does not contain any eigenvalues, so that Ψ kl is trivial). Finally, for fixed k the bound also does not give information on the behavior of Ψ kl(x) when the path distance of x to x l becomes very large.

On the other hand, the form of the scaling function ζ 1 used in the present work is based on Coulhon et al. (2012) where a theory of multiscale frame analysis is developed on very general metric spaces under certain geometrical assumptions. In a nutshell, it is proved there that using this construction, the obtained frame functions Ψ kl(x) are upper bounded by O((d(x, x l)∕b k)ν) for ν arbitrary large. We observe that this type of localization estimate is sharper than (20.14) for fixed x and growing k, as well as for fixed scale k and varying x. We conjecture that these theoretical results apply meaningfully in the discrete setting considered here, under the assumption that x 1, …, x n are iid from a sufficiently regular distribution P 0 on a regular manifold \(\mathscr {M}\), but it is out of the intended scope of the present chapter to establish this formally. In particular “meaningfully” means that the constants involved in the bounds should be independent of the graph size (otherwise the bounds could potentially be devoid of interest for any particular graph, as pointed out above), a question that we are currently investigating.

4 Denoising

We consider the regression model for fixed design points \(\mathfrak {D}=\left \{x_i, i=1\ldots n\right \}\) and observations y i = f(x i) + 𝜖 i (𝜖 i are independent and identically distributed random variables with \({\mathbf {E}\left (\epsilon _i\right )}=0\) and \({\mathbf {Var}\left (\varepsilon \right )}_i=\sigma ^2\)). The aim of denoising is to recover the function \(f:\mathfrak {D}\rightarrow \mathbb {R}\) at the design points themselves. We will use the proposed Parseval frame in order to define an estimate \({\widehat {f}}\) of the function f. In what follows, since the \(\mathfrak {D}\) is fixed, we identify f with the vector (f(x 1), …, f(x n)) and denote y = (y 1, …, y n).

Given the frame \(\mathscr {F}\) associated to the data points \(\mathfrak {D}\) with a multiscale bandpass filter as from Definitions 2 and 3, we denote the frame coefficients \(a_{kl}=\left \langle \varPsi _{kl},f \right \rangle \) for f and \(b_{kl}=\left \langle \varPsi _{kl},y \right \rangle \) for y. Due to the linearity of the inner product we get \(a_{kl}=b_{kl}-\left \langle \varPsi _{kl},\epsilon \right \rangle .\) We estimate the unknown coefficients a kl by adjusting the known coefficients b kl by soft-thresholding:

$$\displaystyle \begin{aligned} S_s\left(z, c\right)={\mathrm{sgn}}(z)\, (\left\lvert z \right\rvert-c)_+. \end{aligned} $$
(20.15)

In order to take into account that the frame elements Ψ kl are not normalized, and generally have different norms, we use element-adapted thresholds of the form \(c_{kl}=\sigma \left \lVert \varPsi _{kl} \right \rVert t\) which depend on the variance of \(\left \langle \epsilon ,\varPsi _{kl} \right \rangle \) and some global parameter t. Equivalently, this corresponds to first normalizing the observed coefficients b kl by dividing by their variance, then applying a global threshold to the normalized coefficients, and finally inverting the normalization.

The estimator of f is then the plug-in estimator

$$\displaystyle \begin{aligned} {\widehat{f}}_{S_{s}} = \sum_{k,l} {S_{s}}\left(b_{kl}, c_{kl}\right) \varPsi_{kl} {} = T^* {S_{s}}(b,c), \end{aligned} $$
(20.16)

where S s(b, c) denotes the vector of thresholded coefficients, and T is the synthesis operator of the frame as introduced in Sect. 20.2.2.

To measure the performance of this estimator, we use the risk measure

$$\displaystyle \begin{aligned}Risk({\widehat{f}},f)={{\mathbf{E}}_{\epsilon}\left({\left\lVert {\widehat{f}}-f \right\rVert^2}\right)},\end{aligned} $$
(20.17)

that is, the expected quadratic norm at the sampled points (where \(\left \lVert f \right \rVert ^2=\sum _{i=1}^n f(x_i)^2\) is the Euclidean vector norm of f on the observation points), for the performance analysis of an estimator \({\widehat {f}}\in \mathbb {R}^n\).

For bounding the risk of the thresholding estimator \({\widehat {f}}_{S_{s}}\), rather than assuming some specific regularity properties on the function f, it is useful to compare the performance of \({\widehat {f}}_{S_{s}}\) to that of a group of reference estimators. This is called the oracle approach (Candès 2006; Donoho and Johnstone 1994): can the proposed estimator have a performance (almost) as good as the best estimator (for this specific f) in a reference family (that is to say, as good as if an oracle would have given us advance knowledge of which reference estimator is the best for this function f). We review here briefly some important results.

A suitable class of simple reference estimators consists of “keep or kill” (or diagonal projection) estimators, that keep without changes the observed coefficients b k,l for (k, l) in some subset I, and put to zero the coefficients for indices outside of I:

$$\displaystyle \begin{aligned} {\widehat{f}}_I := \sum_{(k,l) \in I} b_{kl} \varPsi_{kl} = T^* {\widehat{a}}^I_{kl}, \end{aligned} $$
(20.18)

where \({\widehat {a}}^I_{kl}=b_{kl} {\mathbf {1}\{(k,l)\in I\}}\). Now using the frame reconstruction formula and (20.6), we obtain

$$\displaystyle \begin{aligned} \begin{array}{rcl} {{\mathbf{E}}_{\epsilon}\left({\left\lVert {\widehat{f}}_I-f \right\rVert^2}\right)} &\displaystyle = &\displaystyle {{\mathbf{E}}_{\epsilon}\left(\left\lVert T^*(a - {\widehat{a}}^I) \right\rVert^2\right)}\\ &\displaystyle \leq &\displaystyle {{\mathbf{E}}_{\epsilon}\left({\left\lVert a-{\widehat{a}}^I \right\rVert^2}\right)}\\ &\displaystyle = &\displaystyle \sum_{(k,l)} \big( a_{kl}^2 {\mathbf{1}\{(k,l)\not\in I\}}\\ &\displaystyle &\displaystyle + \sigma^2 \left\lVert \psi_{kl} \right\rVert^2 {\mathbf{1}\{(k,l) \in I\}}\big)\,. \end{array} \end{aligned} $$
(20.19)

Therefore, the optimal (oracle) choice of the index set I obtained by minimizing the above upper bound is given by

$$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle (k,l)\in I^* ~~&\displaystyle \Leftrightarrow ~~\left\langle\,f,\varPsi_{kl} \right\rangle^2 \geq \sigma^2 \left\lVert \varPsi_{kl} \right\rVert^2 ~\mbox{ ~(keep) ~}\\ &\displaystyle (k,l)\notin I^*~~&\displaystyle \Leftrightarrow ~~\left\langle\,f,\varPsi_{kl} \right\rangle^2 \leq \sigma^2 \left\lVert \varPsi_{kl} \right\rVert^2 ~\mbox{ ~(kill) ~}.\vspace{-2.5pt} \end{array} \end{aligned} $$
(20.20)

One deduces from this that

$$\displaystyle \begin{aligned} \inf_I{{\mathbf{E}}_{\epsilon}\left({\left\lVert {\widehat{f}}_I-f \right\rVert^2}\right)} \leq \sum_{(k,l)\in N} \min\left(\left\langle\,f,\varPsi_{kl} \right\rangle^2, \sigma^2\left\lVert \varPsi_{kl} \right\rVert^2\right) =: OB(f)\,.\end{aligned} $$
(20.21)

The relation of soft thresholding estimators to the collection of keep-or-kill estimators on a Parseval frame is captured by the following oracle-type inequality (see Candès 2006, Section 9)Footnote 1:

Theorem 3

Let {Ψ kl}k,l be a Parseval frame and consider the denoising observation model with Gaussian noise. Let \({\widehat {f}}_{S_s}=\sum _{k,l} S_s\left (\left \langle \,y,\varPsi _{kl} \right \rangle ,t_{kl}\right ) \varPsi _{kl}\) be the soft-threshold frame estimator from (20.16). Then with \(t_{kl}=\sigma \left \lVert \varPsi _{kl} \right \rVert \sqrt {2 \log (n)}\) the following inequality holds:

$$\displaystyle \begin{aligned} {{\mathbf{E}}_{\epsilon}\left({\left\lVert {\widehat{f}}_{S_s}-f \right\rVert^2}\right)} \leq \left(2\log(n)+1\right) \left( \sigma^2 +OB(f)\right). \end{aligned} $$
(20.22)

To interpret this result, observe that if we renormalize the squared norm by \(\frac {1}{n}\), so that it represents averaged squared error per point, we expect (depending on the regularity of f) the order of magnitude of n −1 OB(f) to be typically a polynomial rate O(n ν) for some ν < 1. Then the term σ 2n is negligible in comparison, and the oracle inequality states that the performance of \({\widehat {f}}_{S_s}\) is only worse by a logarithmic factor than the performance obtained with the optimal, f-dependent choice of I in a keep-or-kill estimator.

For this tight oracle inequality to hold, it is particularly important that a Parseval frame is used. While thresholding strategies can also be applied to the coefficients of a frame that is not Parseval, the reconstruction step is less straightforward (the canonical dual frame must be computed for reconstruction from the thresholded coefficients, see Sect. 20.2.2); furthermore, an additional factor BA comes into the bound (A ≤ 1 ≤ B being the frame bounds from definition (20.1)) (see, for instance, Haltmeier and Munk 2014, Prop. 3.10). Therefore, the performance of simple thresholding estimates deteriorates when used with a non-Parseval frame.

5 Numerical Experiments

We investigate the performance of the proposed method for denoising on two testbed datasets where the ground truth is known and the design points are drawn randomly iid from a distribution on a manifold. More precisely, we will consider one example where the design points \(\mathfrak {D}\) are drawn uniformly (n = 500) on the unit square, which is then rolled up into a “swiss roll” shape in 3D. We consider a very simple target function represented (on the original unit square) as a piecewise constant function (with values 5 and − 3) on two triangles, displaying a sharp discontinuity along one diagonal of the square and very smooth regularity elsewhere. This function is observed with an additional Gaussian noise of variance σ 2 = 1. In the second example the design points \(\mathfrak {D}\) are drawn uniformly (n = 500) on the unit sphere in \(\mathbb {R}^3\). The target function remains a piecewise constant function, defined on the two parts of the sphere when intersecting it with a chosen plane. Again, this function is observed with an additional Gaussian noise of variance σ 2 = 1. For the swiss roll example as well as for the sphere example, one sample consisting of design points and noisy function values is displayed in Fig. 20.4.

Fig. 20.4
figure 4

Left: noisy function on swiss roll data (top) and sphere data (bottom), graph representation. Right: MSE for two representative settings (weighted ε-Graph and k-NN-Graph) as a function of threshold level. Red is thresholding in the original Laplacian Eigenmaps ONB, blue is thresholding of frame coefficients

In each example, we consider the different types of neighborhood graphs described in Sect. 20.2.3. Following usual heuristics, for the construction of the k-NN graph we take \(k=7 \approx \log n\); for the ε-graph, we take for ε the average distance to the k = 7th nearest neighbor, and for weighted graphs we take Gaussian weights, where the bandwidth λ is calibrated so that points at the distance ε defined above are given weight 0.5.

After constructing the (weighted or unweighted) graph Laplacian, we compute explicitly its eigendecomposition. For the construction of the frame via the multiscale bandpass filter, we use a \({\mathscr {C}}^3\) piecewise polynomial plateau function g satisfying the support constraints of Definition 3 for b = 2 (i.e., constant equal to 1 for x ≤ 0.5, and zero for x ≥ 1). While this function is not \({\mathscr {C}}^\infty \), it has the advantage of fast computation.

We compare the denoising performance of the following competitors: Parseval frame with soft thresholding, soft thresholding applied to the Laplacian Eigenmaps orthonormal basis, and truncated expansion in the Laplacian Eigenmaps basis (only the k coefficients corresponding to the first eigenvalues are kept, without thresholding). The latter method is in the spirit of Belkin and Niyogi (2002). It is well-known (from the regular grid case) that the “universal” theoretical threshold \(\sigma \sqrt {\log n}\) is often too conservative in practice. For a fair comparison, we therefore compute the mean squared error (MSE) of both thresholding methods for varying threshold t (still modulated by \(\left \lVert \varPsi _{kl} \right \rVert \) for the Parseval frame). Comparison of the MSE for one sample across the t-range for two particular settings is plotted in Fig. 20.4. For all studied settings (different graph and graph Laplacian types), for the same threshold level t we observed that the frame-based method systematically shows a noticeable improvement.

In Table 20.1 we report the minimum MSEs and their standard error (averaged over m = 50 samples of design points and independent noise) for different methods over the possible range of the parameter (threshold level t, resp. number of coefficients for truncated expansion), both for the swissroll and for the sphere example. We observe an improvement of 20–25% across the different settings (the best overall results being obtained with weighted graphs and the unnormalized Laplacian). We also compared to the more traditional methods of kernel smoothing (Nadaraya-Watson estimator) and kernel ridge regression, using a Gaussian kernel (also with optimal choices of bandwidth and regularization parameter), and observed a comparable performance improvement. While it is not realistic to assume that the optimal parameter choice is known in practice, it is fair to compare all methods under their respective optimal parameter settings, as parameter selection methods will induce a comparable performance hit with respect to the best setting.

Table 20.1 MSE performance under optimal parameter choice

6 Outlook

Following the recently introduced idea of generalizing the Littlewood-Paley spectral decomposition, we constructed explicitly a Parseval frame of functions on a neighborhood graph formed on the data points. We established that a thresholding strategy on the frame coefficients has superior performance for the denoising problem as compared to usual, spectral or non-spectral, approaches. Future developments include extension of this methodology to the semisupervised learning setting, and a stronger theoretical basis for spatial localization.