Keywords

1 Introduction

We are interested in developing tools for harmonic analysis and processing of large data set that arise in wide variety of applications, such as sounds, images (RGB or hyperspectral, [16]), gene arrays, EEG signals [9], and manifold-valued data [44], to name a few. These data sets are often modeled as samples from a probability distribution in D, but it is sometimes assumed that the support of such probability distribution is in fact a set of low intrinsic dimension, perhaps with some nice geometric properties, for example, those of a smooth manifold.

Approximating and learning functions in high-dimensional spaces is hard because of the curse of high dimensionality, it is natural to try to exploit the intrinsic low dimensionality of the data: this idea has attracted wide interest across different scientific disciplines and various applications. One example of exploitation of low intrinsic dimension is to map the data to low-dimensional space, while preserving salient properties of data [3, 19, 21, 27, 28, 30, 31, 46, 52, 54]. Another example is the construction of dictionaries of functions supported on the data [7, 17, 18, 3840, 49, 50]. Yet another possibility is modeling the data as a union of low-dimensional subspaces, which is related to the ideas of sparse representations and dictionary learning ([1, 2, 10, 11, 51] and references therein).

When performing dimensionality reduction/manifold learning, the objective is mapping data to a low-dimensional space. The maps used are often nonlinear, and in at least two problems arise: that of extending the map from a training data set to new data points and that of inverting such a map, i.e., going from a low-dimensional representation of a data point back to its higher-dimensional original representation. Both problems seem to be rather hard (depending of course of the map used) and to require some form of high-dimensional interpolation/extrapolation.

We will work directly in the high-dimensional space, but by taking advantage of the assumed low intrinsic dimensionality of the data and its geometry. One advantage of this approach is that while our representations will be low-dimensional, we will not have to produce inverse maps from low dimensions to high dimensions. We construct geometric multi-resolution analysis (GMRA) for analyzing intrinsically low-dimensional data in high-dimensional spaces, modeled as samples from a d-dimensional set (in particular, a manifold) embedded in D, in the regime d ≪ D. Data may be sampled from a class of signals of interest; in harmonic analysis, a linear infinite-dimensional function space often models the class of signals of interest, and linear representations in the form f =  ∑ i α i ϕ i , for f ∈  in terms of a dictionary of atoms Φ : = { ϕ i } ⊆  are studied. Such dictionaries may be bases or frames and are constructed so that the sequence of coefficients {α i } i has desirable properties, such as some form of sparsity, or a distribution highly concentrated at zero. Several such dictionaries have been constructed for function classes modeling one- and two-dimensional signals of interest  [8, 12, 14, 20, 22, 47] and are proven to provide optimal representations (in a suitably defined sense) for certain function spaces and/or for operators on such spaces. A more recent trend  [1, 12, 4143, 51, 55], motivated by the desire to model classes of signals that are not well modeled by the linear structure of function spaces, has been that of constructing data-adapted dictionaries: an algorithm is allowed to see samples from a class of signals (not necessarily a linear function space) and constructs a dictionary Φ : = { ϕ i } i that optimizes some functional, such as the sparsity of the coefficients for signals in .

There are several parameters in this problem: given training data from , one seeks Φ with I elements, such that every element in the training set may be represented, up to a certain precision ε, by at most m elements of the dictionary. The smaller I and m are, for a given ε, the better the dictionary.

Several current approaches may be summarized as follows [42]: consider a finite training set of signals X n  = { x i } i = 1 n ⊂  D, which we may represent by a D ×n matrix, and optimize the cost function

$${f}_{n}(\Phi ) = \frac{1} {n}{\sum \nolimits }_{i=1}^{n}\mathcal{l}({x}_{ i},\Phi ),$$

where Φ ∈  D ×I is the dictionary, and a loss function, for example,

$$\mathcal{l}(x,\Phi ) :{=\min }_{\alpha \in {\mathbb{R}}^{I}}\frac{1} {2}\vert \vert x - \Phi \alpha \vert {\vert }_{{\mathbb{R}}^{D}}^{2} + \lambda \vert \vert \alpha \vert {\vert }_{ 1},$$

where λ is a regularization parameter. This is basis pursuit [12] or lasso [53]. One typically adds constraints on the size of the columns of Φ, for example, | | ϕ i  | |  D ≤ 1 for all i, which we can write as \(\Phi \in \mathcal{C}\) for some convex set \(\mathcal{C}\). The overall problem may then be written as a matrix factorization problem with a sparsity penalty:

$${\min }_{\Phi \in \mathcal{C},\alpha \in {\mathbb{R}}^{I\times n}}\frac{1} {2}\vert \vert {X}_{n} - \Phi \alpha \vert {\vert }_{F}^{2} + \lambda \vert \vert \alpha \vert {\vert }_{ 1,1},$$

where | | α | | 1, 1 : =  ∑ i 1, i 2 | α i 1, i 2 | . We refer the reader to [42] and references therein for techniques for attacking this optimization problem.

In this chapter we make additional assumptions on the data, specifically that it is well approximated by a smooth low-dimensional manifold, and we exploit this geometric assumption to construct data-dependent dictionaries. We use a multiscale approach that will lead to a GMRA of the data: this is inspired not only by quantitative geometric analysis techniques in geometric measure theory (see, e.g., [25, 32]) but also from multiscale approximation of functions in high dimensions [5, 6]. These dictionaries are structured in a multiscale fashion and, under suitable assumptions on the data, are computed efficiently; the expansion of a data point on the dictionary elements is guaranteed to have a certain degree of sparsity, m, and may also be computed by fast algorithms; the growth of the number of dictionary elements I as a function of ε is controlled depending on geometric properties of the data. This may be thought of as a wavelet analysis for data sets rather than for functions, where the geometry of a set of points is approximated, rather than a single function.

2 Geometric Multi-resolution Analysis

Let μ be a probability measure in D and its support. In this chapter we will consider the case in which is endowed with the structure of a Riemannian manifold, but the examples will show that the construction is robust enough to extend and be useful when this assumption is severely violated. In this setting we have a Riemannian metric g and a volume measure dvol. The geodesic distance on associated with g will be denoted by ρ. We shall assume that dμ is absolutely continuous with respect to dvol, with dμ ∕ dvol bounded above and below. We are interested in the case when the “dimension” d of is much smaller than the dimension of the ambient space D. While d is typically unknown in practice, efficient (multiscale, geometric) algorithms for its estimation are available (see [37], which also contains many references to previous work on this problem), under additional assumptions on the geometry of .

2.1 Dyadic Cubes

We start by constructing dyadic cubes on . This may be thought of as an analogue of dyadic cubes in Euclidean space. It is a collection of (measurable) subsets \(\{{Q{}_{j,k}\}}_{k\in {\mathcal{K}}_{j},j\geq {J}_{0}}\) of with the following properties [13, 23, 24]:

  • For every j ∈ , \(\mu (\mathcal{M}\setminus {\cup }_{k\in {\mathcal{K}}_{j}}{Q}_{j,k}) = 0\).

  • For j′ ≥ j and \(k{^\prime} \in {\mathcal{K}}_{j{^\prime}}\), either Q j′, k′  ⊆ Q j, k or μ(Q j′, k′  ∩ Q j, k ) = 0.

  • For j < j′ and \(k{^\prime} \in {\mathcal{K}}_{j{^\prime}}\), there exists a unique \(k \in {\mathcal{K}}_{j}\) such that Q j′, k′  ⊆ Q j, k .

  • Each Q j, k contains a point c j, k such that \({B}_{{c}_{1}\cdot {2}^{-j}}^{\mathcal{M}}({c}_{j,k}) \subseteq {Q}_{j,k} \subseteq {B}_{{2}^{-j}}^{\mathcal{M}}({c}_{j,k}),\) for a constant c 1 depending on intrinsic geometric properties of . Here \({B}_{r}^{\mathcal{M}}(x)\) is the ρ-ball inside of radius r > 0 centered at x ∈ . In particular, we have μ(Q j, k ) ∼ 2 − dj.

Let \(\mathcal{T}\) be the tree structure associated to the decomposition above: for any j ∈  and \(k \in {\mathcal{K}}_{j}\), we let \(\mathrm{ch}(j,k) = \left \{k{^\prime} \in {\mathcal{K}}_{j+1} : {Q}_{j+1,k{^\prime}} \subseteq {Q}_{j,k}\right \}\). We use the notation (j, x) to represent the unique \((j,k(x)),k(x) \in {\mathcal{K}}_{j}\) such that x ∈ Q j, k(x).

2.2 Multiscale SVD and Intrinsic Dimension Estimation

An introduction to the use of the ideas we present for the estimation of intrinsic dimension of point clouds is in [37] and references therein (see [35, 36] for previous short accounts). These types of constructions are motivated by ideas in both multiscale geometric measure theory [24, 26, 33] and adaptive approximation of functions in high dimensions[5, 6].

In each dyadic cell Q j, k we consider the mean

$${m}_{j,k} := {\mathbb{E}}_{\mu }[x\vert x \in {Q}_{j,x}] = \frac{1} {\mu ({Q}_{j,k})}{\int \nolimits \nolimits }_{{Q}_{j,k}}x\,\mathrm{d}\mu (x)\, \in {\mathbb{R}}^{D}$$
(13.1)

and the local covariance

$$\begin{array}{rlrlrl}\mathrm{{cov}}_{j,k}& = {\mathbb{E}}_{\mu }[(x - {m}_{j,k}){(x - {m}_{j,k})}^{{_\ast}}\vert x \in {Q}_{ j,k}] \in {\mathbb{R}}^{D\times D},& \cr \end{array}$$
(13.2)

where vectors in D are considered d-dimensional column vectors. Let the rank-d singular value decomposition (SVD) [29] of cov j, k be

$$\mathrm{{cov}}_{j,k} \approx {\Phi }_{j,k}{\Sigma }_{j,k}{\Phi }_{j,k}^{{_\ast}},$$
(13.3)

where Φ j, k is an orthonormal D ×d matrix and Σ is a diagonal d ×d matrix. Let

$$\begin{array}{rlrlrl}{\mathbb{V}}_{j,k} := {V }_{j,k} + {m}_{j,k},\quad & {V }_{j,k} =\langle {\Phi }_{j,k}\rangle,& \cr \end{array}$$
(13.4)

where ⟨A⟩ denotes the span of the columns of A, so that \({\mathbb{V}}_{j,k}\) is the affine subspace of dimension d parallel to V j, k and passing through m j, k . It is an approximate tangent space to at location m j, k and scale 2 − j; in fact by the properties of the SVD it provides the best d j, k -dimensional planar approximation to in the least squares sense:

$${\mathbb{V}}_{j,k} =\mathop{ argmin}\limits_{\Pi }{\int \nolimits \nolimits }_{{Q}_{j,k}}\vert \vert x - {\mathbb{P}}_{\Pi }(x)\vert {\vert }^{2}\,\mathrm{d}\mu (x),$$
(13.5)

where Π is taken on the set of all affine d j, k -planes and Π is the orthogonal projection onto the affine plane Π. Let j, k be the associated affine projection

$${\mathbb{P}}_{j,k}(x) := {\mathbb{P}}_{{\mathbb{V}}_{j,k}}(x) = {\Phi }_{j,k}{\Phi }_{j,k}^{{_\ast}}(x - {m}_{ j,k}) + {m}_{j,k},\quad x \in {Q}_{j,k}.$$
(13.6)

The behavior of the singular values in the matrix Σ j, x in Eq. (13.3) as a function of the scale j, for x fixed, contains a lot of useful information about the geometry of the data around x. In particular they may be used to detect the intrinsic dimension of the data in a neighborhood of x. We need to introduce several definition before stating some results. Because of space constraints, we will consider here the case when is a manifold of co-dimension one, leaving the discussion of the general case ( with arbitrary co-dimension and not a manifold) to [2, 37]. Let

$$\begin{array}{rcl} \lambda = \frac{d} {d + 2}&,& \kappa = \frac{d} {{(d + 2)}^{2}(d + 4)}\bigg{[}\frac{d + 1} {2} {\sum \nolimits }_{i=1}^{d}{\kappa }_{ i}^{2} -{\sum \nolimits }_{i<j}{\kappa }_{i}{\kappa }_{j}\bigg{]}, \\ \end{array}$$

where κ i ’s are the sectional curvatures of the manifold. We refer the reader to [37] for an extended discussion of these quantities, which arise naturally in the study of multiscale SVD of manifolds. When has co-dimension larger than 1 more complicate functions of the curvatures arise [similar to those in Eq. (13.18)]; in the non-manifold case a notion of L 2 that generalizes the above may be used [37]. Suppose we sample n points \({x}_{1},\ldots,{x}_{n}\) i.i.d. from the volume measure on the manifold, and each is perturbed by i.i.d. realizations of white Gaussian noise in D with variance σ2 I D . We denote by \(\tilde{{X}}_{n,\tilde{z},r}\) the set of noisy samples x i  + η i that are in \({B}_{z+{\eta }_{z}}(r)\), where η z is the noise corresponding to the data point z: this is the data being observed, which is sampled and noisy, at disposal of an algorithm. We denote by X z, r a random variable distributed in B z (r) ∩  according to volume measure: this is the ideal data, uncorrupted by noise and sampling. Finally, we let r  =  2 : = r 2 − 2σ2 D.

Let r = 2 − j and X z, r  =  ∩ B z (r). The behavior of the ideal covariance of X z, r (which is comparable to cov j, k ) as a function of r reveals interesting properties of the data, for example, it may be used to measure intrinsic dimension and L 2-curvature of around a point z, since the d largest singular values will grow quadratically in r, and the remaining ones will measure L 2-curvatures. In particular for r small the largest gap between these singular values will be the dth gap, leading to an estimator of intrinsic dimension. However, since we do not have access to X z, r , we are interested in the behavior of the empirical covariance matrix of the noisy samples \(\tilde{{X}}_{n,\tilde{z},r}\) as a function of r. In particular, we ask how close it is to cov(X z, r ) and when is the dth gap of \(\mathrm{cov}(\tilde{{X}}_{n,\tilde{z},r})\) the largest, so that we may use it to estimate the intrinsic dimension of ? Observe that while we would like to choose r small, since then the difference in the behavior of the top d singular values and the remaining ones is largest, we are not allowed to do that anymore: having only n samples forces a lower bound on r, since in small balls we will have too small a number of samples to estimate the covariances. Moreover, the presence of noise also puts a lower bound on the interesting range of r: since the expected length of a noise vector is \(\sigma \sqrt{D}\), and the covariance of the noise has norm σ, we expect that r should be larger than a function of these quantities in order for \(\mathrm{cov}(\tilde{{X}}_{n,\tilde{z},r})\) to provide meaningful information about the geometric of .

Here and in what follows C, C 1, and C 2 will denote numerical constants whose value may change with each occurrence.

Theorem 1 (n → ). 

Fix z ∈ℳ; assume D ≥ C, \(\sigma \sqrt{D} \leq \frac{\sqrt{d}} {2\sqrt{2}\kappa }\),

$$r \in \left ({R}_{\min } + 4\sigma \sqrt{D} + \frac{1} {6\kappa },{R}_{\max } - \sigma \sqrt{D} - \frac{1} {6\kappa }\right ) \cap \left (3\sigma \left (\sqrt{D} \vee d\right ), \frac{\sqrt{d}} {\kappa } \right ).$$
(13.7)

Then for n large enough, with probability at least \(1 - C{e}^{-C\sqrt{D}}\) , we have

$$\begin{array}{rlrlrl} \vert \vert \mathrm{cov}({\tilde{X}}_{n,\tilde{z},r}) -\mathrm{ cov}({{X}_{z,r}}_{=})\vert \vert \leq C\left (\frac{{\kappa }^{2}{r}_{=}^{4}} {d} + {\sigma }^{2} + \frac{\lambda \kappa {r}_{=}^{3}} {d} \left ( \frac{\lambda \kappa {r}_{=}} {{\lambda }^{2} - C{\kappa }^{2}{r}_{=}^{2}} \wedge 1\right )\right ). & & \end{array}$$

Moreover, in the range of scales

$${C}_{1} \frac{\sigma \sqrt{d}} {\sqrt{{\lambda }^{2 } - {\delta }^{2}}} \leq {r}_{=} \leq {C}_{2}\frac{{\lambda }^{2} - {\delta }^{2}} {\lambda \kappa },$$
(13.9)

\({\Delta }_{k}({\tilde{X}}_{n,\tilde{z},r})\) is the largest gap, with the probability as above.

Theorem 1 essentially says that if we have O(dlogd) points in μ(B z (r)), and the noise variance σ is not too large compared to curvature, then the largest gap in the empirical covariance matrix of the data in B z (r) is the dth gap, with high probability, for r in the range:

$${C}_{1}{\sigma }^{2} \leq \frac{{r}^{2}} {d} \leq {C}_{2}\ \frac{{\lambda }^{2}} {{\kappa }^{2}d}.$$

The upper bound \(\frac{\lambda } {\kappa }\) is dictated by curvature, while the lower bound \(\sigma \sqrt{D}\) is forced by the noise level: the lower bound is comparable to the size of the covariance of the noise, the upper bound is comparable to the size of the covariance of the data computed at the largest radius λ ∕ κ where the curvature is not too large, and the term in the middle is comparable to the size of the data along the local approximating plane.

Fig. 13.1
figure 1

We consider 4, 000 points uniformly sampled on a 16-dimensional unit sphere, embedded in 100, with \(\eta \sim 0.1\mathcal{N}(0,{I}_{100})\) Gaussian noise added to each point. We plot empirical mean (over data points \(\tilde{z}\)) of the squared singular values of the empirical covariance matrix \(\mathrm{cov}({\tilde{X}}_{n,\tilde{z},r})\), as a function of r: in the “reasonable” range of scales, above the size of the noise, we see 16 singular values corresponding to the approximate tangent planes, at 17th squared singular value corresponding to curvature, and all the other squared singular values of size comparable to the energy of the noise 10 − 2. The algorithm detects a range of scales, above the scale of the noise, where the 16th gap between the squared singular values is largest, i.e., noise is small compared to curvature, which is in turn small compared to elongation along the tangent plane. It is remarkable, albeit predicted by our results, that only 4, 000 points (typically considered a small number if 16 (even more in 100) dimensions), perturbed by large noise (note that e[ | | η | | ] ∼ 1), are enough to obtain accurate geometric information

Our second theorem explores the regime where the ambient dimension D goes to infinity, but the number of samples n is fixed, dependent on the intrinsic dimension. While of course n samples certainly lie in an n-dimensional affine subspace, because of the ambient noise such subspace is unreliable at small scales, and this regime captures the range of scales where we have independence from the ambient dimension and the essentially linear dependence on d for the minimal needed number of points in B z (r  = ).

Theorem 2 (D → ,\(\sigma \sqrt{D} = O(1)\)). 

Fix z ∈ℳ. Let the assumptions of Theorem  1 and the restriction (13.7) hold. Fix t ∈ (C,Cd) and assume \(\epsilon := {\epsilon }_{{r}_{=},n,t} \leq \frac{1} {2}\) . Then for D ≥ C and m ≤ D, and σ 0 constant, for r in the range of scales (13.7) intersected with

$$\begin{array}{rcl} r \in \left ( \frac{4{\sigma }_{0}\left (1 \vee \frac{d} {\sqrt{D}} \vee {\lambda }_{\max }\epsilon \right )} {{\lambda }_{\min }^{2} - {\delta }^{2}{\lambda }_{\max }\epsilon - \frac{{\epsilon }^{2}} {{\lambda }_{\min }^{2}} \left (\frac{C{\sigma }_{0}d} {r} \vee \frac{1} {m}\right ) -\frac{{\sigma }_{0}\kappa } {t} }, \frac{\frac{{\lambda }_{\max }} {4} \wedge \sqrt{d}} {\kappa } \right ),& & \\ \end{array}$$

the following hold, with probability at least \(1 - C{e}^{-C{t}^{2} }\):

  1. (i)

    \({\Delta }_{k}(\mathrm{cov}({\tilde{X}}_{n,\tilde{z},r}))\) is the largest gap of \(\mathrm{cov}({\tilde{X}}_{n,\tilde{z},r})\).

  2. (ii)

    \(\vert \vert \mathrm{cov}({\tilde{X}}_{n,\tilde{z},r})-\mathrm{cov}({{X}_{z,r}}_{=})-{\sigma }^{2}{I}_{D}\vert \vert \leq \Big{(}{\sigma }_{0}^{2}\epsilon +{\lambda }_{\max }{\sigma }_{0}r+\left ({\lambda }_{\max } + 2{\sigma }_{0}\kappa + \frac{\epsilon } {m}\right ){r}^{2}+O\left (\frac{{r}^{3}} {\epsilon } \right )\Big{)}\frac{\epsilon } {d}\).

These bounds, and in fact the finer bounds of [37], may of course be trivially used to obtain perturbation bounds for the empirical noisy tangent spaces estimated by looking at the top d singular vectors of the empirical covariance matrix of the data in \({B}_{\tilde{z}}(r)\) (a slightly better approach, taken in [37], uses Wieland’s lemma before applying the usual sine theorems). It turns out that, since the noise essentially “to first order” adds only a multiple to the identity matrix, the approximate tangent space computed in this fashion is very stable, even in the regime of Theorem 2 [37].

This is the subject of [37], where it is shown that under rather general conditions on the geometry of the data (much more general than the manifold case) and under sampling and ambient noise, one may use these multiscale singular values to estimate the intrinsic dimension of the data. Moreover, under suitable assumptions, the number of samples in a ball around x required in order to do so is linear in the intrinsic dimension and independent of the ambient dimension. We refer the reader to [10, 3537]. We now proceed by using not only the information in the singular values but also in the singular vectors in the SVD decomposition in Eq. (13.3).

2.3 Geometric Scaling Functions

Then j, k (Q j, k ) is the projection of Q j, k onto the local linear approximation given by the affine subspace in Eq. (13.4). The fact that this linear subspaces are affine will have various implications in our construction, creating mild nonlinearities and forcing us to construct a different transform and data representation which is not simply in the form of linear combination of certain atoms. On the other hand it seems an extremely natural construction, and not only the nonlinearities involved will not cause conceptual or computational overheads, but in fact we shall obtain algorithms which are faster than those needed to compute sparse linear representations in the standard dictionary learning setting. \(\{{\Phi {}_{j,k}\}}_{k\in {\mathcal{K}}_{j}}\) are the geometric analogue of a family of scaling functions at scale j, and therefore we call them geometric scaling functions. They “span” an approximate piecewise linear manifold at scale j

$${\mathcal{M}}_{j} :=\{ {\mathbb{P}}_{j,k}{({Q}_{j,k})\}}_{k\in {\mathcal{K}}_{j}}$$
(13.10)

Under general conditions, j  →  in the Hausdorff distance, as j →  + . It is natural to define the nonlinear projection of onto j by

$${x}_{j} \equiv {P}_{{\mathcal{M}}_{j}}(x) := {\mathbb{P}}_{j,k}(x),\qquad x \in {Q}_{j,k}.$$
(13.11)

Note that in general j is not contained in j + 1, due to the nonlinearity of the underlying manifold . This is important as we move into the next section when we will encode “the difference” between j and j + 1.

Fig. 13.2
figure 2

An illustration of the geometric wavelet decomposition. The centers m j, x ’s are represented as lying on while in fact they are only close to , and the corresponding planes \({\mathbb{V}}_{j,x}\) are represented as tangent planes, albeit they are only an approximation to them. Art courtesy of E. Monson

2.4 Geometric Wavelets

In wavelet analysis, wavelets span the difference between scaling function spaces and are contained in the finer scale scaling function space. In our setting that would correspond to encoding the difference needed to go from j to j + 1: for a fixed x ∈ , x j + 1 − x j  ∈  D, but in general not contained in j + 1, due to the nonlinearity of j and j + 1. The main observation is that nevertheless the collection of vectors x j + 1 − x j for x varying in Q j + 1, x is in fact contained in a low-dimensional subspace and may be therefore encoded efficiently in terms of a basis of that subspace. We proceed as follows: for j ≤ J − 1 we let

$$\begin{array}{rlrlrl}{Q}_{{\mathcal{M}}_{j+1}}(x) :& = {x}_{j+1} - {x}_{j} = {x}_{j+1} - {\mathbb{P}}_{j,x}({x}_{j+1}) + {\mathbb{P}}_{j,x}({x}_{j+1}) - {\mathbb{P}}_{j,x}(x)& & \cr & = (I - {P}_{j,x})({x}_{j+1} - {c}_{j,x}) + {P}_{j,x}({x}_{j+1} - x)& \cr & = (I - {P}_{j,x})({{x}_{j+1} - {c}_{j+1,x} }_{\in {V }_{j+1,x}} + {c}_{j+1,x} - {c}_{j,x}) - {P}_{j,x}(x - {x}_{j+1}).& \cr \end{array}$$
(13.12)

Let W j + 1, x : = (I − P j, x ) V j + 1, x , Q j + 1, x be the orthogonal projection onto W j + 1, x , and let Ψ j + 1, x be an orthonormal basis for W j + 1, x , which we will call a geometric wavelet basis. Observe dimW j + 1, x  ≤ dimV j + 1, x  = d j + 1, x . We define several quantities below:

$$\begin{array}{rcl}{ t}_{j+1,x} :& =& {c}_{j+1,x} - {c}_{j,x},{w}_{j+1,x} := (I - {P}_{j,x})\,{t}_{j+1,x}; \\ & & {\mathbb{Q}}_{j+1,x}(x) := {Q}_{j+1,x}(x - {c}_{j+1,x}) + {w}_{j+1,x}\end{array}$$

Then we may rewrite Eq. (13.12) as

$$\begin{array}{rlrlrl} {Q}_{{\mathcal{M}}_{j+1}}(x)& ={ {Q}_{j+1,x}({x}_{j+1} - {c}_{j+1,x})}_{\in {W}_{j+1,x}} + {w}_{j+1,x} - {P}_{j,x}\left (x - {x}_{J} +{ \sum }_{l=j+1}^{J-1}({x}_{ l+1} - {x}_{l})\right )&& \\ & = {\mathbb{Q}}_{j+1,x}({x}_{j+1}) - {P}_{j,x}{ \sum }_{l=j+1}^{J-1}({x}_{ l+1} - {x}_{l}) - {P}_{j,x}(x - {x}_{J}) && \\ & = {\mathbb{Q}}_{j+1,x}({x}_{j+1}) - {P}_{j,x}{ \sum }_{l=j+1}^{J-1}{Q}_{{ \mathcal{M}}_{l+1}}(x) - {P}_{j,x}(x - {x}_{J}), &\end{array}$$
(13.13)

where J ≥ j + 1 is the index of the finest scale (and the last term vanishes as J →  + , under general conditions). Note that this multiscale expansion contains terms that involve not only the current scale j + 1 and the previous scale j but terms from finer scales as well, all the way to the finest scale J. This is once again due to the nonlinearity of and of the whole construction: knowing P j + 1(x) is not enough to construct P j (x), since the whole local nonlinear structure of determines the locally optimal projection P j (x). In [2] we describe a variation of the transform where optimality is relaxed and a “two-scale equation” is obtained.

In terms of the geometric scaling functions and wavelets, the above may be written as

$$\begin{array}{rlrlrl} {x}_{j+1} - {x}_{j}& = {\Psi }_{j+1,x}{\Psi }_{j+1,x}^{{_\ast}}\left ({x}_{ j+1} - {m}_{j+1,x}\right ) + {w}_{j+1,x} - {\Phi }_{j,x}{\Phi }_{j,x}^{{_\ast}}{\sum }_{l=j+1}^{J-1}{Q}_{{ \mathcal{M}}_{l+1}}(x)&& \\ &\quad - {\Phi }_{j,x}{\Phi }_{j,x}^{{_\ast}}\left (x - {x}_{ J}\right ). &\end{array}$$
(13.14)

This shows that the difference x j + 1 − x j can be expressed as the sum of a component in W j + 1, x , a second component that only depends on the cell (j + 1, x) (but not on the point x itself) which accounts for the translation of centers and lying in V j, x  ⊥  (but not necessarily in W j + 1, x ), and a sum of projections on V j, x of differences x l + 1 − x l at finer scales. By construction we have the two-scale equation

$${P}_{{\mathcal{M}}_{j+1}}(x) = {P}_{{\mathcal{M}}_{j}}(x) + {Q}_{{\mathcal{M}}_{j+1}}(x),\quad x \in \mathcal{M}$$
(13.15)

which can be iterated across scales, leading to a multiscale decomposition along low-dimensional subspaces, with efficient encoding and algorithms. We think of P j, k as being attached to the node (j, k) of \(\mathcal{T}\) and the Q j + 1, k′ as being attached to the edge connecting the node (j + 1, k′) to its parent.

Fig. 13.3
figure 3

We represent in this table the triangular array summarizing the geometric wavelet expansion of a term in the first column in terms of geometric wavelets, according to the multiscale relations (13.15) and the equalities in Eq. (13.13)

We say that the set of multiscale piecewise affine operators {P j } and {Q j + 1} form a geometric multi-resolution analysis or GMRA for short.

2.5 Approximation for Manifolds

We analyze the error of approximation to a d-dimensional manifold in D by using geometric wavelets representation. Our analysis gives a full explanation of the examples in Sect. 13.4.1. We have the following theorem from [2]:

Theorem 3.

Let (ℳ,ρ,μ) be a compact \({\mathcal{C}}^{1+\alpha }\) Riemannian manifold of dimension d isometrically embedded in ℝ D , with α ∈ (0,1], and μ absolutely continuous with respect to the volume measure on ℳ. Let {P j ,Q j+1 } be a GMRA for (ℳ,ρ,μ). For any x ∈ℳ, there exists a scale j 0 = j 0 (x) such that for any j ≥ j 0 and any p > 0, if we let dμ j,x := μ(Q j,x)−1 ,

$$\begin{array}{rlrlrl} {\left \|{\left \|z - {P}_{{\mathcal{M}}_{j}}(z)\right \|}_{{\mathbb{R}}^{D}}\right \|}_{{L}^{p}({Q}_{j,x},d{\mu }_{j,x}(z))}& ={ \left \|{\left \|z - {P}_{{\mathcal{M}}_{{j}_{ 0}}}(z)-{\sum }_{l={j}_{0}}^{j-1}{Q}_{{ \mathcal{M}}_{l+1}}(z)\right \|}_{{\mathbb{R}}^{D}}\right \|}_{{L}^{p}({Q}_{j,x},d{\mu }_{j,x}(z))}&& \\ & \leq \vert \vert \kappa \vert {\vert }_{{L}^{\infty }({Q}_{j,x})}\,{2}^{-(1+\alpha )j} + o({2}^{-(1+\alpha )j}). & \end{array}$$
(13.16)

If α < 1, κ(x) depends on the \({\mathcal{C}}^{1+\alpha }\) norm of a coordinate chart from T x (ℳ) to Q j,x ⊆ℳ and on \({\left \| \frac{d\mu } {d\mathrm{vol}}\right \|}_{{L}^{\infty }({Q}_{j,x})}\).

If α = 1,

$$\kappa (x) ={ \left \| \frac{d\mu } {d\mathrm{vol}}\right \|}_{{L}^{\infty }({Q}_{j,x})}\min ({\kappa }_{1}(x),{\kappa }_{2}(x)),$$
(13.17)

with

$$\begin{array}{rlrlrl}{\kappa }_{1}(x)& :={ \frac{1} {2}\max }_{i\in \{1,\ldots,D-d\}}\vert \vert {H}_{i}(x)\vert \vert ;& & \cr {\kappa }_{2}^{2}(x)& :{=\max }_{ w\in {\mathbb{S}}^{D-d}} \frac{d(d+1)} {4(d + 2)(d + 4)}\left [{\left \|{\sum \nolimits }_{l=1}^{D-d}{w}_{ l}{H}_{l}(x)\right \|}_{F}^{2}- \frac{1} {d + 2}{\left ({\sum \nolimits }_{l=1}^{D-d}{w}_{ l}\mathrm{Tr}({H}_{l}(x))\right )}^{2}\right ],& \cr \end{array}$$
(13.18)

and the D − d matrices H l (x) are the d-dimensional Hessians of ℳ at x.

Observe that κ2 can be smaller than κ1 (by a constant factor) or larger (by factors depending on d 2), depending on the spectral properties and commutativity relations between the Hessians H l . κ2 2 may be unexpectedly small, in the sense that it may scale as d  − 2 r 4 as a function of d and r, as observed in [37]. For the proof we refer the reader to [2].

3 Algorithms

We present in this section algorithms implementing the construction of the GMRA and the corresponding geometric wavelet transform (GWT).

3.1 Construction of Geometric Multi-resolution Analysis

The first step in the construction of the geometric wavelets is to perform a geometric nested partition of the data set, forming a tree structure. For this end, one may consider various methods listed below:

  • Use of METIS [34]: a multiscale variation of iterative spectral partitioning. We construct a weighted graph as done for the construction of diffusion maps [15, 21]: we add an edge between each data point and its k nearest neighbors and assign to any such edge between x i and x j the weight e  − | | x i  − x j  | | 2 ∕ σ. Here k and σ are parameters whose selection we do not discuss here (but see [45] for a discussion in the context of molecular dynamics data). In practice, we choose k between 10 and 50 and choose σ adaptively at each point x i as the distance between x i and its ⌊k ∕ 2⌋ nearest neighbor.

  • Use of cover trees [4].

  • Use of iterated PCA: at scale 1, compute the top d principal components of data and partition the data based on the sign of the (d + 1)-st singular vector. Repeat on each of the two partitions.

  • Iterated k-means: at scale 1 partition the data based on k-means clustering, then iterate on each of the elements of the partition.

Each construction has pros and cons, in terms of performance and guarantees. For (I) we refer the reader to [34], for (II) to [4] (which also discussed several other constructions), and for (III) and (IV) to [48]. Only (II) provides the needed properties for the cells Q j, k . However constructed, we denote by {Q j, k } the family of resulting dyadic cells and let \(\mathcal{T}\) be the associated tree structure, as in Section 2.1.

In Fig. 13.4 we display pseudo-code for the GMRA of a data set X n given a precision ε > 0 and a method τ0 for choosing local dimensions (e.g., using thresholds or a fixed dimension). The code first constructs a family of multiscale dyadic cells (with local centers c j, k and bases Φ j, k ) and then computes the geometric wavelets Ψ j, k and translations w j, k at all scales. In practice, we use METIS [34] to construct a dyadic (not 2d-adic) tree \(\mathcal{T}\) and the associated cells Q j, k .

Fig. 13.4
figure 4

Pseudo-code for the construction of geometric wavelets

3.2 The Fast Geometric Wavelet Transform and Its Inverse

For simplicity of presentation, we shall assume x = x J ; otherwise, we may first project x onto the local linear approximation of the cell Q J, x and use x J instead of x from now on. That is, we will define x j; J  = P j (x J ), for all j < J, and encode the differences x j + 1; J  − x j; J using the geometric wavelets. Note also that \(\|{x}_{j;J} - {x}_{j}\| \leq \| x - {x}_{J}\|\) at all scales.

The geometric scaling and wavelet coefficients {p j, x }, {q j + 1, x }, for j ≥ 0, of a point x ∈  are chosen to satisfy the equations

$$\begin{array}{rlrlrl} {P}_{{\mathcal{M}}_{j}}(x) & = {\Phi }_{j,x}{p}_{j,x} + {m}_{j,x}; &\end{array}$$
(13.19)
$$\begin{array}{rlrlrl} {Q}_{{\mathcal{M}}_{j+1}}(x) & = {\Psi }_{j+1,x}{q}_{j+1,x} + {w}_{j+1,x} - {P}_{j,x}{ \sum }_{l=j+1}^{J-1}{Q}_{{ \mathcal{M}}_{l+1}}(x). &\end{array}$$
(13.20)

The computation of the coefficients, from fine to coarse, is simple and fast: since we assume x = x J , we have

$$\begin{array}{rlrlrl}{p}_{j,x}& = {\Phi }_{j,x}^{{_\ast}}({x}_{ J} - {c}_{j,x}) = {\Phi }_{j,x}^{{_\ast}}({\Phi }_{ J,x}{p}_{J,x} + {c}_{J,x} - {c}_{j,x})& & \cr & = \left ({\Phi }_{j,x}^{{_\ast}}{\Phi }_{ J,x}\right ){p}_{J,x} + {\Phi }_{j,x}^{{_\ast}}({c}_{ J,x} - {c}_{j,x}).& \cr \end{array}$$
(13.21)

Moreover the wavelet coefficients q j + 1, x [defined in Eq. (13.20)] are obtained from Eq. (13.14):

$$\begin{array}{rlrlrl}{q}_{j+1,x} = {\Psi }_{j+1,x}^{{_\ast}}({x}_{ j+1} - {c}_{j+1,x}) = \left ({\Psi }_{j+1,x}^{{_\ast}}{\Phi }_{ j+1,x}\right ){p}_{j+1,x}.& \cr \end{array}$$
(13.22)

Note that Φ j, x  ∗  Φ J, x and Ψ j + 1, x  ∗  Φ j + 1, x are both small matrices (at most d j, x ×d j, x ) and are the only matrices we need to compute and store (once for all, and only up to a specified precision) in order to compute all the wavelet coefficients q j + 1, x and the scaling coefficients p j, x , given p J, x at the finest scale.

In Figs. 13.5 and 13.6 we display pseudo-codes for the computation of the forward and inverse geometric wavelet transforms (F/IGWT). The input to FGWT is a GMRA object, as returned by GeometricMultiResolutionAnalysis, and a point x ∈ . Its output is the wavelet coefficients of the point x at all scales, which are then used by IGWT for reconstruction of the point at all scales.

Fig. 13.5
figure 5

Pseudo-code for the forward geometric wavelet transform

Fig. 13.6
figure 6

Pseudo-code for the inverse geometric wavelet transform

For any x ∈  J , the set of coefficients

$${q}_{x} = \left ({q}_{J,x};{q}_{J-1,x};\ldots ;{q}_{1,x};{p}_{0,x}\right )$$
(13.23)

is called the discrete GWT of x. Letting d j, x w = rank(Ψ j + 1, x ), the length of the transform is d +  ∑ j > 0 d j, x w, which is bounded by (J + 1)d in the case of samples from a d-dimensional manifold (due to d j, x w ≤ d).

4 Examples

We conduct numerical experiments in this section to demonstrate the performance of the algorithm (i.e., Figs. 13.413.6).

4.1 Low-Dimensional Smooth Manifolds

To illustrate the construction presented so far, we consider simple synthetic data sets: a SwissRoll, an S-Manifold, and an Oscillating2DWave, all two-dimensional manifolds but embedded in 50 (see Fig. 13.7). We apply the algorithm to construct the GMRA and obtain the FGWT of the sampled data (10,000 points, without noise) in Fig. 13.8. We use the manifold dimension d j, k  = d = 2 at each node of the tree when constructing scaling functions and choose the smallest finest scale for achieving an absolute precision. 001 in each case. We compute the average magnitude of the wavelet coefficients at each scale and plot it as a function of scale in Fig. 13.8. The reconstructed manifolds obtained by the inverse geometric wavelets transform (at selected scales) are shown in Fig. 13.9, together with a plot of relative approximation errors,

$${\mathcal{E}}_{j,2}^{\mathrm{rel}} = \frac{1} {\sqrt{\mathrm{Var }({X}_{n } )}}\sqrt{ \frac{1} {n}{\sum \nolimits }_{x\in {X}_{n}}{\left (\frac{\vert \vert x - {P}_{j,x}(x)\vert \vert } {\vert \vert x\vert \vert } \right )}^{2}},$$
(13.24)

where X n is the training data of n samples. Both the approximation error and the magnitude of the wavelet coefficients decrease quadratically with respect to scale as expected.

Fig. 13.7
figure 7

Toy data sets for the following examples of GMRA

Fig. 13.8
figure 8

Top row: wavelet coefficients obtained by the algorithm for the three data sets in Fig. 13.7. The horizontal axis indexes the points (arranged according to the tree), and the vertical axis multi-indexes the wavelet coefficients, from coarse (top) to fine (bottom) scales: the block of entries (x, j), x ∈ Q j, k displays log10 | q j, x  | , where q j, x is the vector of geometric wavelet coefficients of x at scale j (see Sect. 13.3). In particular, each row indexes multiple wavelet elements, one for each \(k \in {\mathcal{K}}_{j}\). Bottom row: magnitude of wavelet coefficients decreasing quadratically as a function of scale

Fig. 13.9
figure 9

Top and middle: reconstructions by the algorithm of the three toy data sets in Fig. 13.7 at two selected scales. Bottom: reconstruction errors as a function of scale

We threshold the wavelet coefficients to study the compressibility of the wavelet coefficients and the rate of change of the approximation errors (using compressed wavelet coefficients). For this end, we use a smaller precision 10 − 5 so that the algorithm can examine a larger interval of thresholds. We threshold the wavelet coefficients of the Oscillating2DWave data at the level. 01 and plot in Fig. 13.10 the reduced matrix of wavelet coefficients and the corresponding best reconstruction of the manifold (i.e., at the finest scale).

Fig. 13.10
figure 10

The wavelet coefficients of the Oscillating2DWave data may be thresholded leading to adaptive approximation. Left: after sorting the points so that the x-axis orders them as going from left to right on the manifold, we see that when the manifold oscillates more, larger wavelet coefficients arise at fine scales. By threshold at the level of. 01 and prune the dyadic tree accordingly, we reconstruct the manifold at the corresponding precision (right)

4.2 Data Sets

4.2.1 MNIST Handwritten Digits

We first consider the MNIST data set of images of handwritten digits,Footnote 1 each of size 28 ×28. We use the digits 0 and 1 and randomly sample for each digit 3,000 images from the database. We apply the algorithm to construct the geometric wavelets and show the wavelet coefficients and the reconstruction errors at all scales in Fig. 13.11. We select local dimensions for scaling functions by keeping 50 % and 95 % of the variance, respectively, at the nonleaf and leaf nodes. We observe that the magnitudes of the coefficients stop decaying after a certain scale. This indicates that the data is not on a smooth manifold. We expect optimization of the tree and of the wavelet dimensions in future work to lead to a more efficient representation in this case.

Fig. 13.11
figure 11

From left to right: geometric wavelet representation of the MNIST digits data set for 1 and 0. As usual, the vertical axis multi-indexes the wavelet coefficients, from coarse (top) to fine (bottom) scales: the block of entries at (x, j), x ∈ Q j, k is log10 | q j, x  | , where q j, x is the vector of geometric wavelet coefficients of x at scale j (see Sect. 13.3). In particular, each row indexes multiple wavelet elements, one for each \(k \in {\mathcal{K}}_{j}\). Top right: dimensions of the wavelet subspaces (with the same convention as in the previous plot): even if the data lies in 784 dimensions, the approximating planes used have mostly dimension 1–6, except for some planes at the leaf nodes. Rightmost inset: reconstruction error as functions of scale. The decay is nonlinear and not what we would expect from a manifold structure

We then fix a data point (or equivalently an image), for each digit, and show in Fig. 13.12 its reconstructed coordinates at all scales and the corresponding dictionary elements (all of which are also images). We see that at every scale we have a handwritten digit, which is an approximation to the fixed image, and those digits are refined successively to approximate the original data point. The elements of the dictionary quickly fix the orientation and the thickness, and then they add other distinguishing features of the image being approximated.

Fig. 13.12
figure 12

Left: in each figure we plot coarse-to-fine geometric wavelet approximations of the original data point (represented in the last image). Right: elements of the wavelet dictionary (ordered from coarsest to finest scales) used in the expansion of the data point on the left

4.3 Example: A Connection to Fourier Analysis and FFT

Consider band-limited functions of band B:

$$B{F}_{B} = \{f :\mathrm{ supp.}\,\hat{f} \subseteq [-B\pi,B\pi ]\}.$$

Classes of smooth functions (e.g., W k, 2) are essentially characterized by their L 2-energy in dyadic spectral bands of the form [ − 2j + 1π, − 2jπ] ∪[2jπ, 2j + 1π], i.e., by the L 2-size of their projection onto \(B{F}_{{2}^{j+1}} \ominus B{F}_{{2}^{j}}\) (some care is of course needed in that smooth frequency cutoff, but this issue is not relevant for our purposes here). We generate random smooth (band-limited!) functions as follows:

$$f(x) ={ \sum \nolimits }_{j=0}^{J}{a}_{ j}(\omega )\cos (jx)$$

with a j random Gaussian (or bounded) with mean \({2}^{-\lfloor \frac{j} {J}\rfloor \alpha }\) and standard deviation \({2}^{-\lfloor \frac{j} {J}\rfloor \alpha } \cdot \frac{1} {5}\). The GMRA associated with a random sample of this family of functions takes advantage of the multiscale nature of the data and organizes this family of functions in a Littlewood–Paley type of decomposition: the scaling function subspace at scale j roughly corresponds to \(B{F}_{{2}^{j+1}} \ominus B{F}_{{2}^{j}}\), and the GMRA of a point is essentially a block Fourier transform, where coefficients in the same dyadic band are grouped together. Observe that the cost of the GMRA of a point f is comparable to the cost of the fast Fourier transform.

5 Data Representation, Compression, and Computational Considerations

A set of n points in D can trivially be stored in space Dn; if it lies, up to a least squares error ε in a linear subspace of dimension d ε ≪ D, we could encode n points in space d ε(D + n) (cost of encoding a basis for the linear subspace, plus encoding of the coefficients of the points on that basis). This is much less than the trivial encoding for d ε ≪ D. It can be shown [2] that the cost of encoding with a GMRA a \({\mathcal{C}}^{2}\) manifold of dimension d sampled at n points, for a fixed precision ε > 0 and n large, is \(O({\epsilon }^{-\frac{d} {2} }dD + n{d\log }_{2}{\epsilon }^{-\frac{1} {2} })\).

Fig. 13.13
figure 13

We construct an orthogonal geometric multi-resolution analysis (see [2]) on a random sample of 10, 000 band-limited functions. Left: dimension of the GMRA wavelet subspaces. Center: approximation error as a function of scale. Right: dominant frequency in each GMRA subspace, showing that frequencies are sorted from low (top, coarse GMRA scales) to high (bottom, fine GMRA scales). This implies that the geometric scaling function subspaces roughly correspond to a Littlewood–Paley decomposition, and the GWT of a function f corresponds to a rough standard wavelet transform

Also, the cost of the algorithm is

$$O(nD(\log (n) + {d}^{2})) + {O}_{ d,D}(n\log n),$$

while the cost of performing the FGWT of a point is

$$O({2}^{d}D\log n + dD + {d}^{2}\log {\epsilon }^{-\frac{1} {2} }).$$

The cost of the IGWT is similar but without the first term.

6 Multiscale Models of Densities

We present a simple example of how our techniques may be used to model measures supported on low-dimensional sets which are well approximated by the multiscale planes we constructed; results from more extensive investigations will be reported in an upcoming publication.

We sample n training points from a point cloud and, for a fixed scale j, we consider the coarse approximation j [defined in Eq. (13.10)], and on each local linear approximating plane V j, k we use the training set to construct a multifactor Gaussian model on Q j, k : let π j, k be the estimated distribution. We also estimate from the training data the probability π j (k) that a given point in belongs to Q j, k (recall that j is fixed, so this is a probability distribution over the \(\vert {\mathcal{K}}_{j}\vert \) labels of the planes at scale j). We may then generate new data points by drawing a \(k \in {\mathcal{K}}_{j}\) according to π j and then drawing a point in V j, k from the distribution π j, k : this defines a probability distribution supported on j that we denote by p j .

In this way we may generate new data points which are consistent with both the geometry of the approximating planes V j, k and with the distribution of the data on each such plane. In Fig. 13.14 we display the result of such modeling on a simple manifold. In Fig. 13.15 we construct p j by training on 2, 000 handwritten 2s and 3s from the MNIST database, and on the same training set we train two other algorithms: the first one is based on projecting the data on the first a j principal components, where a j is chosen so that the cost of encoding the projection and the projected data is the same as the cost of encoding the GMRA up to scale j and the GMRA of the data and then running the same multifactor Gaussian model used above for generating π j, k . This leads to a probability distribution we denote by \({p}_{SV {D}_{j}}\). In order to test the quality of these models, we consider the following two measures. The first measure is simply the Hausdorff distance between 2, 000 randomly chosen samples according to each model and the training set: this is measuring how close the generated samples are to the training set. The second measure quantifies if the model captures the variability of the true data and is computed by generating multiple point clouds of 2, 000 points for a fixed model and looking at the pairwise Hausdorff distances between such point clouds, called the within-model Hausdorff distance variability.

Fig. 13.14
figure 13b

Approximations of the probability distribution concentrated on a S-shaped 2-dimensional manifold within the GMRA framework. From left to right, top to bottom: 4, 000 samples drawn from our approximate distribution, constructed at scale 4, 6, and 8, respectively, from 2, 000 training samples. Bottom right: as a function of scale, the Hausdorff distance between points generated by the SVD model and GWT models and the training data, as well as the Hausdorff distance variability of the generated data and true data. We see that p j has small distance to the training set and decreasingly so for models constructed at finer scales, while \({p}_{SV {D}_{j}}\), being a model in the ambient space, generates points farther from the distribution. Looking at the plots of the in-model Hausdorff distance variability, we see that such measure increases for p j as a function of j (reflecting the increasing expression power of the model). Samples from the SVD model look like a Gaussian point cloud, as the kernel density estimator did not have enough training samples to adapt to the low-dimensional manifold structure

Fig. 13.15
figure 15

Left and right columns: a training set of 2, 000 digits 2 (respectively, 3) from the MNIST data set are used to train probability models with GMRA (p j , one for each scale j in the GMRA of the training set) and SVD (\({p}_{SV {D}_{j}}\), one for each GMRA scale, see text). Left: 32 digits drawn from p 5, \({p}_{SV {D}_{5}}\): the quality of p 5 is qualitatively better than that of \({p}_{SV {D}_{5}}\). Center: plots of the Hausdorff distance to training set and in-model Hausdorff distance variability. Right: a similar experiment with a training set of 2, 000 points from a SwissRoll-shaped manifold with no noise: the finest scale GMRA-based models perform best (in terms of both approximation and variability, the SVD-based models are once again unable to take advantage of the low intrinsic dimension)

The bias–variance trade-off in the models p j is the following: as j increases the planes better model the geometry of the data (under our usual assumptions), so that the bias of the model (and the approximation error) decreases as j increases; on the other hand the sampling requirements for correctly estimating the density of Q j, k projected on V j, k increases with j as less and less training points fall in Q j, k . A pruning greedy algorithm that selects, in each region of the data, the correct scale for obtaining the correct bias–variance trade-off, depending on the samples and the geometry of the data, similar in spirit to the what has been studied in the case of multiscale approximation of functions, will be presented in a forthcoming publication. It should be remarked that such a model would be very complicated in the wavelet domain, as one would need to model very complex dependencies among wavelet coefficients, in both space and scale.