Abstract
Large data sets arise in a wide variety of applications and are often modeled as samples from a probability distribution in high-dimensional space. It is sometimes assumed that the support of such probability distribution is well approximated by a set of low intrinsic dimension, perhaps even a low-dimensional smooth manifold. Samples are often corrupted by high-dimensional noise. We are interested in developing tools for studying the geometry of such high-dimensional data sets. In particular, we present here a multiscale transform that maps high-dimensional data as above to a set of multiscale coefficients that are compressible/sparse under suitable assumptions on the data. We think of this as a geometric counterpart to multi-resolution analysis in wavelet theory: whereas wavelets map a signal (typically low dimensional, such as a one-dimensional time series or a two-dimensional image) to a set of multiscale coefficients, the geometric wavelets discussed here map points in a high-dimensional point cloud to a multiscale set of coefficients. The geometric multi-resolution analysis (GMRA) we construct depends on the support of the probability distribution, and in this sense it fits with the paradigm of dictionary learning or data-adaptive representations, albeit the type of representation we construct is in fact mildly nonlinear, as opposed to standard linear representations. Finally, we apply the transform to a set of synthetic and real-world data sets.
Access provided by Autonomous University of Puebla. Download chapter PDF
Similar content being viewed by others
Keywords
1 Introduction
We are interested in developing tools for harmonic analysis and processing of large data set that arise in wide variety of applications, such as sounds, images (RGB or hyperspectral, [16]), gene arrays, EEG signals [9], and manifold-valued data [44], to name a few. These data sets are often modeled as samples from a probability distribution in ℝ D, but it is sometimes assumed that the support of such probability distribution is in fact a set of low intrinsic dimension, perhaps with some nice geometric properties, for example, those of a smooth manifold.
Approximating and learning functions in high-dimensional spaces is hard because of the curse of high dimensionality, it is natural to try to exploit the intrinsic low dimensionality of the data: this idea has attracted wide interest across different scientific disciplines and various applications. One example of exploitation of low intrinsic dimension is to map the data to low-dimensional space, while preserving salient properties of data [3, 19, 21, 27, 28, 30, 31, 46, 52, 54]. Another example is the construction of dictionaries of functions supported on the data [7, 17, 18, 38–40, 49, 50]. Yet another possibility is modeling the data as a union of low-dimensional subspaces, which is related to the ideas of sparse representations and dictionary learning ([1, 2, 10, 11, 51] and references therein).
When performing dimensionality reduction/manifold learning, the objective is mapping data to a low-dimensional space. The maps used are often nonlinear, and in at least two problems arise: that of extending the map from a training data set to new data points and that of inverting such a map, i.e., going from a low-dimensional representation of a data point back to its higher-dimensional original representation. Both problems seem to be rather hard (depending of course of the map used) and to require some form of high-dimensional interpolation/extrapolation.
We will work directly in the high-dimensional space, but by taking advantage of the assumed low intrinsic dimensionality of the data and its geometry. One advantage of this approach is that while our representations will be low-dimensional, we will not have to produce inverse maps from low dimensions to high dimensions. We construct geometric multi-resolution analysis (GMRA) for analyzing intrinsically low-dimensional data in high-dimensional spaces, modeled as samples from a d-dimensional set ℳ (in particular, a manifold) embedded in ℝ D, in the regime d ≪ D. Data may be sampled from a class of signals of interest; in harmonic analysis, a linear infinite-dimensional function space ℱ often models the class of signals of interest, and linear representations in the form f = ∑ i α i ϕ i , for f ∈ ℱ in terms of a dictionary of atoms Φ : = { ϕ i } ⊆ ℱ are studied. Such dictionaries may be bases or frames and are constructed so that the sequence of coefficients {α i } i has desirable properties, such as some form of sparsity, or a distribution highly concentrated at zero. Several such dictionaries have been constructed for function classes modeling one- and two-dimensional signals of interest [8, 12, 14, 20, 22, 47] and are proven to provide optimal representations (in a suitably defined sense) for certain function spaces and/or for operators on such spaces. A more recent trend [1, 12, 41–43, 51, 55], motivated by the desire to model classes of signals that are not well modeled by the linear structure of function spaces, has been that of constructing data-adapted dictionaries: an algorithm is allowed to see samples from a class of signals ℱ (not necessarily a linear function space) and constructs a dictionary Φ : = { ϕ i } i that optimizes some functional, such as the sparsity of the coefficients for signals in ℱ.
There are several parameters in this problem: given training data from ℱ, one seeks Φ with I elements, such that every element in the training set may be represented, up to a certain precision ε, by at most m elements of the dictionary. The smaller I and m are, for a given ε, the better the dictionary.
Several current approaches may be summarized as follows [42]: consider a finite training set of signals X n = { x i } i = 1 n ⊂ ℝ D, which we may represent by a ℝ D ×n matrix, and optimize the cost function
where Φ ∈ ℝ D ×I is the dictionary, and ℓ a loss function, for example,
where λ is a regularization parameter. This is basis pursuit [12] or lasso [53]. One typically adds constraints on the size of the columns of Φ, for example, | | ϕ i | | ℝ D ≤ 1 for all i, which we can write as \(\Phi \in \mathcal{C}\) for some convex set \(\mathcal{C}\). The overall problem may then be written as a matrix factorization problem with a sparsity penalty:
where | | α | | 1, 1 : = ∑ i 1, i 2 | α i 1, i 2 | . We refer the reader to [42] and references therein for techniques for attacking this optimization problem.
In this chapter we make additional assumptions on the data, specifically that it is well approximated by a smooth low-dimensional manifold, and we exploit this geometric assumption to construct data-dependent dictionaries. We use a multiscale approach that will lead to a GMRA of the data: this is inspired not only by quantitative geometric analysis techniques in geometric measure theory (see, e.g., [25, 32]) but also from multiscale approximation of functions in high dimensions [5, 6]. These dictionaries are structured in a multiscale fashion and, under suitable assumptions on the data, are computed efficiently; the expansion of a data point on the dictionary elements is guaranteed to have a certain degree of sparsity, m, and may also be computed by fast algorithms; the growth of the number of dictionary elements I as a function of ε is controlled depending on geometric properties of the data. This may be thought of as a wavelet analysis for data sets rather than for functions, where the geometry of a set of points is approximated, rather than a single function.
2 Geometric Multi-resolution Analysis
Let μ be a probability measure in ℝ D and ℳ its support. In this chapter we will consider the case in which ℳ is endowed with the structure of a Riemannian manifold, but the examples will show that the construction is robust enough to extend and be useful when this assumption is severely violated. In this setting we have a Riemannian metric g and a volume measure dvol. The geodesic distance on ℳ associated with g will be denoted by ρ. We shall assume that dμ is absolutely continuous with respect to dvol, with dμ ∕ dvol bounded above and below. We are interested in the case when the “dimension” d of ℳ is much smaller than the dimension of the ambient space ℝ D. While d is typically unknown in practice, efficient (multiscale, geometric) algorithms for its estimation are available (see [37], which also contains many references to previous work on this problem), under additional assumptions on the geometry of ℳ.
2.1 Dyadic Cubes
We start by constructing dyadic cubes on ℳ. This may be thought of as an analogue of dyadic cubes in Euclidean space. It is a collection of (measurable) subsets \(\{{Q{}_{j,k}\}}_{k\in {\mathcal{K}}_{j},j\geq {J}_{0}}\) of ℳ with the following properties [13, 23, 24]:
-
For every j ∈ ℤ, \(\mu (\mathcal{M}\setminus {\cup }_{k\in {\mathcal{K}}_{j}}{Q}_{j,k}) = 0\).
-
For j′ ≥ j and \(k{^\prime} \in {\mathcal{K}}_{j{^\prime}}\), either Q j′, k′ ⊆ Q j, k or μ(Q j′, k′ ∩ Q j, k ) = 0.
-
For j < j′ and \(k{^\prime} \in {\mathcal{K}}_{j{^\prime}}\), there exists a unique \(k \in {\mathcal{K}}_{j}\) such that Q j′, k′ ⊆ Q j, k .
-
Each Q j, k contains a point c j, k such that \({B}_{{c}_{1}\cdot {2}^{-j}}^{\mathcal{M}}({c}_{j,k}) \subseteq {Q}_{j,k} \subseteq {B}_{{2}^{-j}}^{\mathcal{M}}({c}_{j,k}),\) for a constant c 1 depending on intrinsic geometric properties of ℳ. Here \({B}_{r}^{\mathcal{M}}(x)\) is the ρ-ball inside ℳ of radius r > 0 centered at x ∈ ℳ. In particular, we have μ(Q j, k ) ∼ 2 − dj.
Let \(\mathcal{T}\) be the tree structure associated to the decomposition above: for any j ∈ ℤ and \(k \in {\mathcal{K}}_{j}\), we let \(\mathrm{ch}(j,k) = \left \{k{^\prime} \in {\mathcal{K}}_{j+1} : {Q}_{j+1,k{^\prime}} \subseteq {Q}_{j,k}\right \}\). We use the notation (j, x) to represent the unique \((j,k(x)),k(x) \in {\mathcal{K}}_{j}\) such that x ∈ Q j, k(x).
2.2 Multiscale SVD and Intrinsic Dimension Estimation
An introduction to the use of the ideas we present for the estimation of intrinsic dimension of point clouds is in [37] and references therein (see [35, 36] for previous short accounts). These types of constructions are motivated by ideas in both multiscale geometric measure theory [24, 26, 33] and adaptive approximation of functions in high dimensions[5, 6].
In each dyadic cell Q j, k we consider the mean
and the local covariance
where vectors in ℝ D are considered d-dimensional column vectors. Let the rank-d singular value decomposition (SVD) [29] of cov j, k be
where Φ j, k is an orthonormal D ×d matrix and Σ is a diagonal d ×d matrix. Let
where ⟨A⟩ denotes the span of the columns of A, so that \({\mathbb{V}}_{j,k}\) is the affine subspace of dimension d parallel to V j, k and passing through m j, k . It is an approximate tangent space to ℳ at location m j, k and scale 2 − j; in fact by the properties of the SVD it provides the best d j, k -dimensional planar approximation to ℳ in the least squares sense:
where Π is taken on the set of all affine d j, k -planes and ℙ Π is the orthogonal projection onto the affine plane Π. Let ℙ j, k be the associated affine projection
The behavior of the singular values in the matrix Σ j, x in Eq. (13.3) as a function of the scale j, for x fixed, contains a lot of useful information about the geometry of the data around x. In particular they may be used to detect the intrinsic dimension of the data in a neighborhood of x. We need to introduce several definition before stating some results. Because of space constraints, we will consider here the case when ℳ is a manifold of co-dimension one, leaving the discussion of the general case (ℳ with arbitrary co-dimension and ℳ not a manifold) to [2, 37]. Let
where κ i ’s are the sectional curvatures of the manifold. We refer the reader to [37] for an extended discussion of these quantities, which arise naturally in the study of multiscale SVD of manifolds. When ℳ has co-dimension larger than 1 more complicate functions of the curvatures arise [similar to those in Eq. (13.18)]; in the non-manifold case a notion of L 2 that generalizes the above may be used [37]. Suppose we sample n points \({x}_{1},\ldots,{x}_{n}\) i.i.d. from the volume measure on the manifold, and each is perturbed by i.i.d. realizations of white Gaussian noise in ℝ D with variance σ2 I D . We denote by \(\tilde{{X}}_{n,\tilde{z},r}\) the set of noisy samples x i + η i that are in \({B}_{z+{\eta }_{z}}(r)\), where η z is the noise corresponding to the data point z: this is the data being observed, which is sampled and noisy, at disposal of an algorithm. We denote by X z, r a random variable distributed in B z (r) ∩ ℳ according to volume measure: this is the ideal data, uncorrupted by noise and sampling. Finally, we let r = 2 : = r 2 − 2σ2 D.
Let r = 2 − j and X z, r = ℳ ∩ B z (r). The behavior of the ideal covariance of X z, r (which is comparable to cov j, k ) as a function of r reveals interesting properties of the data, for example, it may be used to measure intrinsic dimension and L 2-curvature of ℳ around a point z, since the d largest singular values will grow quadratically in r, and the remaining ones will measure L 2-curvatures. In particular for r small the largest gap between these singular values will be the dth gap, leading to an estimator of intrinsic dimension. However, since we do not have access to X z, r , we are interested in the behavior of the empirical covariance matrix of the noisy samples \(\tilde{{X}}_{n,\tilde{z},r}\) as a function of r. In particular, we ask how close it is to cov(X z, r ) and when is the dth gap of \(\mathrm{cov}(\tilde{{X}}_{n,\tilde{z},r})\) the largest, so that we may use it to estimate the intrinsic dimension of ℳ? Observe that while we would like to choose r small, since then the difference in the behavior of the top d singular values and the remaining ones is largest, we are not allowed to do that anymore: having only n samples forces a lower bound on r, since in small balls we will have too small a number of samples to estimate the covariances. Moreover, the presence of noise also puts a lower bound on the interesting range of r: since the expected length of a noise vector is \(\sigma \sqrt{D}\), and the covariance of the noise has norm σ, we expect that r should be larger than a function of these quantities in order for \(\mathrm{cov}(\tilde{{X}}_{n,\tilde{z},r})\) to provide meaningful information about the geometric of ℳ.
Here and in what follows C, C 1, and C 2 will denote numerical constants whose value may change with each occurrence.
Theorem 1 (n → ∞).
Fix z ∈ℳ; assume D ≥ C, \(\sigma \sqrt{D} \leq \frac{\sqrt{d}} {2\sqrt{2}\kappa }\),
Then for n large enough, with probability at least \(1 - C{e}^{-C\sqrt{D}}\) , we have
Moreover, in the range of scales
\({\Delta }_{k}({\tilde{X}}_{n,\tilde{z},r})\) is the largest gap, with the probability as above.
Theorem 1 essentially says that if we have O(dlogd) points in μ(B z (r)), and the noise variance σ is not too large compared to curvature, then the largest gap in the empirical covariance matrix of the data in B z (r) is the dth gap, with high probability, for r in the range:
The upper bound \(\frac{\lambda } {\kappa }\) is dictated by curvature, while the lower bound \(\sigma \sqrt{D}\) is forced by the noise level: the lower bound is comparable to the size of the covariance of the noise, the upper bound is comparable to the size of the covariance of the data computed at the largest radius λ ∕ κ where the curvature is not too large, and the term in the middle is comparable to the size of the data along the local approximating plane.
Our second theorem explores the regime where the ambient dimension D goes to infinity, but the number of samples n is fixed, dependent on the intrinsic dimension. While of course n samples certainly lie in an n-dimensional affine subspace, because of the ambient noise such subspace is unreliable at small scales, and this regime captures the range of scales where we have independence from the ambient dimension and the essentially linear dependence on d for the minimal needed number of points in B z (r = ).
Theorem 2 (D → ∞,\(\sigma \sqrt{D} = O(1)\)).
Fix z ∈ℳ. Let the assumptions of Theorem 1 and the restriction (13.7) hold. Fix t ∈ (C,Cd) and assume \(\epsilon := {\epsilon }_{{r}_{=},n,t} \leq \frac{1} {2}\) . Then for D ≥ C and m ≤ D, and σ 0 constant, for r in the range of scales (13.7) intersected with
the following hold, with probability at least \(1 - C{e}^{-C{t}^{2} }\):
-
(i)
\({\Delta }_{k}(\mathrm{cov}({\tilde{X}}_{n,\tilde{z},r}))\) is the largest gap of \(\mathrm{cov}({\tilde{X}}_{n,\tilde{z},r})\).
-
(ii)
\(\vert \vert \mathrm{cov}({\tilde{X}}_{n,\tilde{z},r})-\mathrm{cov}({{X}_{z,r}}_{=})-{\sigma }^{2}{I}_{D}\vert \vert \leq \Big{(}{\sigma }_{0}^{2}\epsilon +{\lambda }_{\max }{\sigma }_{0}r+\left ({\lambda }_{\max } + 2{\sigma }_{0}\kappa + \frac{\epsilon } {m}\right ){r}^{2}+O\left (\frac{{r}^{3}} {\epsilon } \right )\Big{)}\frac{\epsilon } {d}\).
These bounds, and in fact the finer bounds of [37], may of course be trivially used to obtain perturbation bounds for the empirical noisy tangent spaces estimated by looking at the top d singular vectors of the empirical covariance matrix of the data in \({B}_{\tilde{z}}(r)\) (a slightly better approach, taken in [37], uses Wieland’s lemma before applying the usual sine theorems). It turns out that, since the noise essentially “to first order” adds only a multiple to the identity matrix, the approximate tangent space computed in this fashion is very stable, even in the regime of Theorem 2 [37].
This is the subject of [37], where it is shown that under rather general conditions on the geometry of the data (much more general than the manifold case) and under sampling and ambient noise, one may use these multiscale singular values to estimate the intrinsic dimension of the data. Moreover, under suitable assumptions, the number of samples in a ball around x required in order to do so is linear in the intrinsic dimension and independent of the ambient dimension. We refer the reader to [10, 35–37]. We now proceed by using not only the information in the singular values but also in the singular vectors in the SVD decomposition in Eq. (13.3).
2.3 Geometric Scaling Functions
Then ℙ j, k (Q j, k ) is the projection of Q j, k onto the local linear approximation given by the affine subspace in Eq. (13.4). The fact that this linear subspaces are affine will have various implications in our construction, creating mild nonlinearities and forcing us to construct a different transform and data representation which is not simply in the form of linear combination of certain atoms. On the other hand it seems an extremely natural construction, and not only the nonlinearities involved will not cause conceptual or computational overheads, but in fact we shall obtain algorithms which are faster than those needed to compute sparse linear representations in the standard dictionary learning setting. \(\{{\Phi {}_{j,k}\}}_{k\in {\mathcal{K}}_{j}}\) are the geometric analogue of a family of scaling functions at scale j, and therefore we call them geometric scaling functions. They “span” an approximate piecewise linear manifold at scale j
Under general conditions, ℳ j → ℳ in the Hausdorff distance, as j → + ∞. It is natural to define the nonlinear projection of ℳ onto ℳ j by
Note that in general ℳ j is not contained in ℳ j + 1, due to the nonlinearity of the underlying manifold ℳ. This is important as we move into the next section when we will encode “the difference” between ℳ j and ℳ j + 1.
2.4 Geometric Wavelets
In wavelet analysis, wavelets span the difference between scaling function spaces and are contained in the finer scale scaling function space. In our setting that would correspond to encoding the difference needed to go from ℳ j to ℳ j + 1: for a fixed x ∈ ℳ, x j + 1 − x j ∈ ℝ D, but in general not contained in ℳ j + 1, due to the nonlinearity of ℳ j and ℳ j + 1. The main observation is that nevertheless the collection of vectors x j + 1 − x j for x varying in Q j + 1, x is in fact contained in a low-dimensional subspace and may be therefore encoded efficiently in terms of a basis of that subspace. We proceed as follows: for j ≤ J − 1 we let
Let W j + 1, x : = (I − P j, x ) V j + 1, x , Q j + 1, x be the orthogonal projection onto W j + 1, x , and let Ψ j + 1, x be an orthonormal basis for W j + 1, x , which we will call a geometric wavelet basis. Observe dimW j + 1, x ≤ dimV j + 1, x = d j + 1, x . We define several quantities below:
Then we may rewrite Eq. (13.12) as
where J ≥ j + 1 is the index of the finest scale (and the last term vanishes as J → + ∞, under general conditions). Note that this multiscale expansion contains terms that involve not only the current scale j + 1 and the previous scale j but terms from finer scales as well, all the way to the finest scale J. This is once again due to the nonlinearity of ℳ and of the whole construction: knowing P ℳ j + 1(x) is not enough to construct P ℳ j (x), since the whole local nonlinear structure of ℳ determines the locally optimal projection P ℳ j (x). In [2] we describe a variation of the transform where optimality is relaxed and a “two-scale equation” is obtained.
In terms of the geometric scaling functions and wavelets, the above may be written as
This shows that the difference x j + 1 − x j can be expressed as the sum of a component in W j + 1, x , a second component that only depends on the cell (j + 1, x) (but not on the point x itself) which accounts for the translation of centers and lying in V j, x ⊥ (but not necessarily in W j + 1, x ), and a sum of projections on V j, x of differences x l + 1 − x l at finer scales. By construction we have the two-scale equation
which can be iterated across scales, leading to a multiscale decomposition along low-dimensional subspaces, with efficient encoding and algorithms. We think of P j, k as being attached to the node (j, k) of \(\mathcal{T}\) and the Q j + 1, k′ as being attached to the edge connecting the node (j + 1, k′) to its parent.
We say that the set of multiscale piecewise affine operators {P ℳ j } and {Q ℳ j + 1} form a geometric multi-resolution analysis or GMRA for short.
2.5 Approximation for Manifolds
We analyze the error of approximation to a d-dimensional manifold in ℝ D by using geometric wavelets representation. Our analysis gives a full explanation of the examples in Sect. 13.4.1. We have the following theorem from [2]:
Theorem 3.
Let (ℳ,ρ,μ) be a compact \({\mathcal{C}}^{1+\alpha }\) Riemannian manifold of dimension d isometrically embedded in ℝ D , with α ∈ (0,1], and μ absolutely continuous with respect to the volume measure on ℳ. Let {P ℳ j ,Q ℳ j+1 } be a GMRA for (ℳ,ρ,μ). For any x ∈ℳ, there exists a scale j 0 = j 0 (x) such that for any j ≥ j 0 and any p > 0, if we let dμ j,x := μ(Q j,x)−1 dμ,
If α < 1, κ(x) depends on the \({\mathcal{C}}^{1+\alpha }\) norm of a coordinate chart from T x (ℳ) to Q j,x ⊆ℳ and on \({\left \| \frac{d\mu } {d\mathrm{vol}}\right \|}_{{L}^{\infty }({Q}_{j,x})}\).
If α = 1,
with
and the D − d matrices H l (x) are the d-dimensional Hessians of ℳ at x.
Observe that κ2 can be smaller than κ1 (by a constant factor) or larger (by factors depending on d 2), depending on the spectral properties and commutativity relations between the Hessians H l . κ2 2 may be unexpectedly small, in the sense that it may scale as d − 2 r 4 as a function of d and r, as observed in [37]. For the proof we refer the reader to [2].
3 Algorithms
We present in this section algorithms implementing the construction of the GMRA and the corresponding geometric wavelet transform (GWT).
3.1 Construction of Geometric Multi-resolution Analysis
The first step in the construction of the geometric wavelets is to perform a geometric nested partition of the data set, forming a tree structure. For this end, one may consider various methods listed below:
-
Use of METIS [34]: a multiscale variation of iterative spectral partitioning. We construct a weighted graph as done for the construction of diffusion maps [15, 21]: we add an edge between each data point and its k nearest neighbors and assign to any such edge between x i and x j the weight e − | | x i − x j | | 2 ∕ σ. Here k and σ are parameters whose selection we do not discuss here (but see [45] for a discussion in the context of molecular dynamics data). In practice, we choose k between 10 and 50 and choose σ adaptively at each point x i as the distance between x i and its ⌊k ∕ 2⌋ nearest neighbor.
-
Use of cover trees [4].
-
Use of iterated PCA: at scale 1, compute the top d principal components of data and partition the data based on the sign of the (d + 1)-st singular vector. Repeat on each of the two partitions.
-
Iterated k-means: at scale 1 partition the data based on k-means clustering, then iterate on each of the elements of the partition.
Each construction has pros and cons, in terms of performance and guarantees. For (I) we refer the reader to [34], for (II) to [4] (which also discussed several other constructions), and for (III) and (IV) to [48]. Only (II) provides the needed properties for the cells Q j, k . However constructed, we denote by {Q j, k } the family of resulting dyadic cells and let \(\mathcal{T}\) be the associated tree structure, as in Section 2.1.
In Fig. 13.4 we display pseudo-code for the GMRA of a data set X n given a precision ε > 0 and a method τ0 for choosing local dimensions (e.g., using thresholds or a fixed dimension). The code first constructs a family of multiscale dyadic cells (with local centers c j, k and bases Φ j, k ) and then computes the geometric wavelets Ψ j, k and translations w j, k at all scales. In practice, we use METIS [34] to construct a dyadic (not 2d-adic) tree \(\mathcal{T}\) and the associated cells Q j, k .
3.2 The Fast Geometric Wavelet Transform and Its Inverse
For simplicity of presentation, we shall assume x = x J ; otherwise, we may first project x onto the local linear approximation of the cell Q J, x and use x J instead of x from now on. That is, we will define x j; J = P ℳ j (x J ), for all j < J, and encode the differences x j + 1; J − x j; J using the geometric wavelets. Note also that \(\|{x}_{j;J} - {x}_{j}\| \leq \| x - {x}_{J}\|\) at all scales.
The geometric scaling and wavelet coefficients {p j, x }, {q j + 1, x }, for j ≥ 0, of a point x ∈ ℳ are chosen to satisfy the equations
The computation of the coefficients, from fine to coarse, is simple and fast: since we assume x = x J , we have
Moreover the wavelet coefficients q j + 1, x [defined in Eq. (13.20)] are obtained from Eq. (13.14):
Note that Φ j, x ∗ Φ J, x and Ψ j + 1, x ∗ Φ j + 1, x are both small matrices (at most d j, x ×d j, x ) and are the only matrices we need to compute and store (once for all, and only up to a specified precision) in order to compute all the wavelet coefficients q j + 1, x and the scaling coefficients p j, x , given p J, x at the finest scale.
In Figs. 13.5 and 13.6 we display pseudo-codes for the computation of the forward and inverse geometric wavelet transforms (F/IGWT). The input to FGWT is a GMRA object, as returned by GeometricMultiResolutionAnalysis, and a point x ∈ ℳ. Its output is the wavelet coefficients of the point x at all scales, which are then used by IGWT for reconstruction of the point at all scales.
For any x ∈ ℳ J , the set of coefficients
is called the discrete GWT of x. Letting d j, x w = rank(Ψ j + 1, x ), the length of the transform is d + ∑ j > 0 d j, x w, which is bounded by (J + 1)d in the case of samples from a d-dimensional manifold (due to d j, x w ≤ d).
4 Examples
We conduct numerical experiments in this section to demonstrate the performance of the algorithm (i.e., Figs. 13.4–13.6).
4.1 Low-Dimensional Smooth Manifolds
To illustrate the construction presented so far, we consider simple synthetic data sets: a SwissRoll, an S-Manifold, and an Oscillating2DWave, all two-dimensional manifolds but embedded in ℝ 50 (see Fig. 13.7). We apply the algorithm to construct the GMRA and obtain the FGWT of the sampled data (10,000 points, without noise) in Fig. 13.8. We use the manifold dimension d j, k = d = 2 at each node of the tree when constructing scaling functions and choose the smallest finest scale for achieving an absolute precision. 001 in each case. We compute the average magnitude of the wavelet coefficients at each scale and plot it as a function of scale in Fig. 13.8. The reconstructed manifolds obtained by the inverse geometric wavelets transform (at selected scales) are shown in Fig. 13.9, together with a plot of relative approximation errors,
where X n is the training data of n samples. Both the approximation error and the magnitude of the wavelet coefficients decrease quadratically with respect to scale as expected.
We threshold the wavelet coefficients to study the compressibility of the wavelet coefficients and the rate of change of the approximation errors (using compressed wavelet coefficients). For this end, we use a smaller precision 10 − 5 so that the algorithm can examine a larger interval of thresholds. We threshold the wavelet coefficients of the Oscillating2DWave data at the level. 01 and plot in Fig. 13.10 the reduced matrix of wavelet coefficients and the corresponding best reconstruction of the manifold (i.e., at the finest scale).
4.2 Data Sets
4.2.1 MNIST Handwritten Digits
We first consider the MNIST data set of images of handwritten digits,Footnote 1 each of size 28 ×28. We use the digits 0 and 1 and randomly sample for each digit 3,000 images from the database. We apply the algorithm to construct the geometric wavelets and show the wavelet coefficients and the reconstruction errors at all scales in Fig. 13.11. We select local dimensions for scaling functions by keeping 50 % and 95 % of the variance, respectively, at the nonleaf and leaf nodes. We observe that the magnitudes of the coefficients stop decaying after a certain scale. This indicates that the data is not on a smooth manifold. We expect optimization of the tree and of the wavelet dimensions in future work to lead to a more efficient representation in this case.
We then fix a data point (or equivalently an image), for each digit, and show in Fig. 13.12 its reconstructed coordinates at all scales and the corresponding dictionary elements (all of which are also images). We see that at every scale we have a handwritten digit, which is an approximation to the fixed image, and those digits are refined successively to approximate the original data point. The elements of the dictionary quickly fix the orientation and the thickness, and then they add other distinguishing features of the image being approximated.
4.3 Example: A Connection to Fourier Analysis and FFT
Consider band-limited functions of band B:
Classes of smooth functions (e.g., W k, 2) are essentially characterized by their L 2-energy in dyadic spectral bands of the form [ − 2j + 1π, − 2jπ] ∪[2jπ, 2j + 1π], i.e., by the L 2-size of their projection onto \(B{F}_{{2}^{j+1}} \ominus B{F}_{{2}^{j}}\) (some care is of course needed in that smooth frequency cutoff, but this issue is not relevant for our purposes here). We generate random smooth (band-limited!) functions as follows:
with a j random Gaussian (or bounded) with mean \({2}^{-\lfloor \frac{j} {J}\rfloor \alpha }\) and standard deviation \({2}^{-\lfloor \frac{j} {J}\rfloor \alpha } \cdot \frac{1} {5}\). The GMRA associated with a random sample of this family of functions takes advantage of the multiscale nature of the data and organizes this family of functions in a Littlewood–Paley type of decomposition: the scaling function subspace at scale j roughly corresponds to \(B{F}_{{2}^{j+1}} \ominus B{F}_{{2}^{j}}\), and the GMRA of a point is essentially a block Fourier transform, where coefficients in the same dyadic band are grouped together. Observe that the cost of the GMRA of a point f is comparable to the cost of the fast Fourier transform.
5 Data Representation, Compression, and Computational Considerations
A set of n points in ℝ D can trivially be stored in space Dn; if it lies, up to a least squares error ε in a linear subspace of dimension d ε ≪ D, we could encode n points in space d ε(D + n) (cost of encoding a basis for the linear subspace, plus encoding of the coefficients of the points on that basis). This is much less than the trivial encoding for d ε ≪ D. It can be shown [2] that the cost of encoding with a GMRA a \({\mathcal{C}}^{2}\) manifold ℳ of dimension d sampled at n points, for a fixed precision ε > 0 and n large, is \(O({\epsilon }^{-\frac{d} {2} }dD + n{d\log }_{2}{\epsilon }^{-\frac{1} {2} })\).
Also, the cost of the algorithm is
while the cost of performing the FGWT of a point is
The cost of the IGWT is similar but without the first term.
6 Multiscale Models of Densities
We present a simple example of how our techniques may be used to model measures supported on low-dimensional sets which are well approximated by the multiscale planes we constructed; results from more extensive investigations will be reported in an upcoming publication.
We sample n training points from a point cloud ℳ and, for a fixed scale j, we consider the coarse approximation ℳ j [defined in Eq. (13.10)], and on each local linear approximating plane V j, k we use the training set to construct a multifactor Gaussian model on Q j, k : let π j, k be the estimated distribution. We also estimate from the training data the probability π j (k) that a given point in ℳ belongs to Q j, k (recall that j is fixed, so this is a probability distribution over the \(\vert {\mathcal{K}}_{j}\vert \) labels of the planes at scale j). We may then generate new data points by drawing a \(k \in {\mathcal{K}}_{j}\) according to π j and then drawing a point in V j, k from the distribution π j, k : this defines a probability distribution supported on ℳ j that we denote by p ℳ j .
In this way we may generate new data points which are consistent with both the geometry of the approximating planes V j, k and with the distribution of the data on each such plane. In Fig. 13.14 we display the result of such modeling on a simple manifold. In Fig. 13.15 we construct p ℳ j by training on 2, 000 handwritten 2s and 3s from the MNIST database, and on the same training set we train two other algorithms: the first one is based on projecting the data on the first a j principal components, where a j is chosen so that the cost of encoding the projection and the projected data is the same as the cost of encoding the GMRA up to scale j and the GMRA of the data and then running the same multifactor Gaussian model used above for generating π j, k . This leads to a probability distribution we denote by \({p}_{SV {D}_{j}}\). In order to test the quality of these models, we consider the following two measures. The first measure is simply the Hausdorff distance between 2, 000 randomly chosen samples according to each model and the training set: this is measuring how close the generated samples are to the training set. The second measure quantifies if the model captures the variability of the true data and is computed by generating multiple point clouds of 2, 000 points for a fixed model and looking at the pairwise Hausdorff distances between such point clouds, called the within-model Hausdorff distance variability.
The bias–variance trade-off in the models p ℳ j is the following: as j increases the planes better model the geometry of the data (under our usual assumptions), so that the bias of the model (and the approximation error) decreases as j increases; on the other hand the sampling requirements for correctly estimating the density of Q j, k projected on V j, k increases with j as less and less training points fall in Q j, k . A pruning greedy algorithm that selects, in each region of the data, the correct scale for obtaining the correct bias–variance trade-off, depending on the samples and the geometry of the data, similar in spirit to the what has been studied in the case of multiscale approximation of functions, will be presented in a forthcoming publication. It should be remarked that such a model would be very complicated in the wavelet domain, as one would need to model very complex dependencies among wavelet coefficients, in both space and scale.
Notes
- 1.
Available at http://yann.lecun.com/exdb/mnist/
References
Aharon, M., Elad, M., Bruckstein, A.: K-SVD: Design of dictionaries for sparse representation. In: Proceedings of SPARS 05’, pp. 9–12 (2005)
Allard, W.K., Chen, G., Maggioni, M.: Multi-scale geometric methods for data sets II: Geometric multi-resolution analysis. Appl. Computat. Harmonic Analysis 32, 435–462 (2012)
Belkin, M., Niyogi, P.: Using manifold structure for partially labelled classification. Advances in NIPS, vol. 15. MIT Press, Cambridge (2003)
Beygelzimer, A., Kakade, S., Langford, J.: Cover trees for nearest neighbor. In: ICML, pp. 97–104 (2006)
Binev, P., Cohen, A., Dahmen, W., Devore, R., Temlyakov, V.: Universal algorithms for learning theory part i: Piecewise constant functions. J. Mach. Learn. 6, 1297–1321 (2005)
Binev, P., Devore, R.: Fast computation in adaptive tree approximation. Numer. Math. 97, 193–217 (2004)
Bremer, J., Coifman, R., Maggioni, M., Szlam, A.: Diffusion wavelet packets. Appl. Comp. Harm. Anal. 21, 95–112 (2006) (Tech. Rep. YALE/DCS/TR-1304, 2004)
Candès, E., Donoho, D.L.: Curvelets: A surprisingly effective nonadaptive representation of objects with edges. In: Schumaker, L.L., et al. (eds.) Curves and Surfaces. Vanderbilt University Press, Nashville (1999)
Causevic, E., Coifman, R., Isenhart, R., Jacquin, A., John, E., Maggioni, M., Prichep, L., Warner, F.: QEEG-based classification with wavelet packets and microstate features for triage applications in the ER, vol. 3. ICASSP Proc., May 2006 10.1109/ICASSP.2006.1660859
Chen, G., Little, A., Maggioni, M., Rosasco, L.: Wavelets and Multiscale Analysis: Theory and Applications. Springer (2011) submitted March 12th, 2010
Chen, G., Maggioni, M.: Multiscale geometric wavelets for the analysis of point clouds. Information Sciences and Systems (CISS), 2010 44th Annual Conference on. IEEE, 2010.
Chen, S.S., Donoho, D.L., Saunders, M.A.: Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20, 33–61 (1998)
Christ, M.: A T(b) theorem with remarks on analytic capacity and the Cauchy integral. Colloq. Math. 60–61, 601–628 (1990)
Christensen, O.: An introduction to frames and Riesz bases. Applied and Numerical Harmonic Analysis. Birkhäuser, Boston (2003)
Coifman, R., Lafon, S.: Diffusion maps. Appl. Comp. Harm. Anal. 21, 5–30 (2006)
Coifman, R., Lafon, S., Maggioni, M., Keller, Y., Szlam, A., Warner, F., Zucker, S.: Geometries of sensor outputs, inference, and information processing. In: Athale, R.A. (ed.) Proc. SPIE, J. C. Z. E. Intelligent Integrated Microsystems, vol. 6232, p. 623209, May 2006
Coifman, R., Maggioni, M.: Diffusion wavelets. Appl. Comp. Harm. Anal. 21, 53–94 (2006) (Tech. Rep. YALE/DCS/TR-1303, Yale Univ., Sep. 2004).
Coifman, R., Maggioni, M.: Multiscale data analysis with diffusion wavelets. In: Proc. SIAM Bioinf. Workshop, Minneapolis (2007)
Coifman, R., Maggioni, M.: Geometry analysis and signal processing on digital data, emergent structures, and knowledge building. SIAM News, November 2008
Coifman, R., Meyer, Y., Quake, S., Wickerhauser, M.V.: Signal processing and compression with wavelet packets. In: Progress in Wavelet Analysis and Applications (Toulouse, 1992), pp. 77–93. Frontières, Gif (1993)
Coifman, R.R., Lafon, S., Lee, A.B., Maggioni, M., Nadler, B., Warner, F., Zucker, S.W.: Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. PNAS 102, 7426–7431 (2005)
Daubechies, I.: Ten lectures on wavelets. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA (1992) ISBN: 0-89871-274-2.
David, G.: Wavelets and singular integrals on curves and surfaces. In: Lecture Notes in Mathematics, vol. 1465. Springer, Berlin (1991)
David, G.: Wavelets and Singular Integrals on Curves and Surfaces. Springer, Berlin (1991)
David, G., Semmes, S.: Analysis of and on uniformly rectifiable sets. Mathematical Surveys and Monographs, vol. 38. American Mathematical Society, Providence (1993)
David, G., Semmes, S.: Uniform Rectifiability and Quasiminimizing Sets of Arbitrary Codimension. American Mathematical Society, Providence (2000)
Donoho, D.L., Grimes, C.: When does isomap recover natural parameterization of families of articulated images? Tech. Rep. 2002–2027, Department of Statistics, Stanford University, August 2002
Donoho, D.L., Grimes, C.: Hessian eigenmaps: new locally linear embedding techniques for high-dimensional data. Proc. Nat. Acad. Sciences 100, 5591–5596 (2003)
Golub, G., Loan, C.V.: Matrix Computations. Johns Hopkins University Press, Baltimore (1989)
Jones, P., Maggioni, M., Schul, R.: Manifold parametrizations by eigenfunctions of the Laplacian and heat kernels. Proc. Nat. Acad. Sci. 105, 1803–1808 (2008)
Jones, P., Maggioni, M., Schul, R.: Universal local manifold parametrizations via heat kernels and eigenfunctions of the Laplacian. Ann. Acad. Scient. Fen. 35, 1–44 (2010) http://arxiv.org/abs/0709.1975
Jones, P.W.: Rectifiable sets and the traveling salesman problem. Invent. Math. 102, 1–15 (1990)
Jones, P.W.: The traveling salesman problem and harmonic analysis. Publ. Mat. 35, 259–267 (1991) Conference on Mathematical Analysis (El Escorial, 1989)
Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20, 359–392 (1999)
Little, A., Jung, Y.-M., Maggioni, M.: Multiscale estimation of intrinsic dimensionality of data sets. In: Proc. A.A.A.I. (2009)
Little, A., Lee, J., Jung, Y.-M., Maggioni, M.: Estimation of intrinsic dimensionality of samples from noisy low-dimensional manifolds in high dimensions with multiscale SVD. In: Proc. S.S.P. (2009)
Little, A., Maggioni, M., Rosasco, L.: Multiscale geometric methods for data sets I: Estimation of intrinsic dimension, submitted (2010)
Maggioni, M., Bremer, J. Jr., Coifman, R., Szlam, A.: Biorthogonal diffusion wavelets for multiscale representations on manifolds and graphs. SPIE, vol. 5914, p. 59141M (2005)
Maggioni, M., Mahadevan, S.: Fast direct policy evaluation using multiscale analysis of markov diffusion processes. In: ICML 2006, pp. 601–608 (2006)
Mahadevan, S., Maggioni, M.: Proto-value functions: A spectral framework for solving markov decision processes. JMLR 8, 2169–2231 (2007)
Mairal, J., Bach, F., Ponce, J., Sapiro, G.: Online dictionary learning for sparse coding. In: ICML, p. 87 (2009)
Mairal, J., Bach, F., Ponce, J., Sapiro, G.: Online learning for matrix factorization and sparse coding. J. Mach. Learn. Res. 11, 19–60 (2010)
Olshausen, B.A., Field, D.J.: Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Res. 37, 3311–3325 (1997)
Rahman, I.U., Drori, I., Stodden, V.C., Donoho, D.L.: Multiscale representations for manifold-valued data. SIAM J. Multiscale Model. Simul. 4, 1201–1232 (2005).
Rohrdanz, M.A., Zheng, W., Maggioni, M., Clementi, C.: Determination of reaction coordinates via locally scaled diffusion map. J. Chem. Phys. 134, 124116 (2011)
Roweis, S., Saul, L.: Nonlinear dimensionality reduction by locally linear embedding. Science 290, 2323–2326 (2000)
Starck, J.L., Elad, M., Donoho, D.: Image decomposition via the combination of sparse representations and a variational approach. IEEE T. Image Process. 14, 1570–1582 (2004)
Szlam, A.: Asymptotic regularity of subdivisions of euclidean domains by iterated PCA and iterated 2-means. Appl. Comp. Harm. Anal. 27, 342–350 (2009)
Szlam, A., Maggioni, M., Coifman, R., Bremer, J. Jr.: Diffusion-driven multiscale analysis on manifolds and graphs: top-down and bottom-up constructions. SPIE, vol. 5914(1), p. 59141D (2005)
Szlam, A., Maggioni, M., Coifman, R.: Regularization on graphs with function-adapted diffusion processes. J. Mach. Learn. Res. 9, 1711–1739 (2008) (YALE/DCS/TR1365, Yale Univ, July 2006)
Szlam, A., Sapiro, G.: Discriminative k-metrics. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1009–1016 (2009)
Tenenbaum, J.B., Silva, V.D., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323 (2000)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc. B 58, 267–288 (1996)
Zhang, Z., Zha, H.: Principal manifolds and nonlinear dimension reduction via local tangent space alignment. SIAM J. Sci. Comput. 26, 313–338 (2002)
Zhou, M., Chen, H., Paisley, J., Ren, L., Sapiro, G., Carin, L.: Non-parametric Bayesian dictionary learning for sparse image representations. In: Neural and Information Processing Systems (NIPS) (2009)
Acknowledgements
The authors thank E. Monson for useful discussions. AVL was partially supported by NSF and ONR. GC was partially supported by DARPA, ONR, NSF CCF, and NSF/DHS FODAVA program. MM is grateful for partial support from DARPA, NSF, ONR, and the Sloan Foundation.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Birkhäuser Boston
About this chapter
Cite this chapter
Chen, G., Little, A.V., Maggioni, M. (2013). Multi-Resolution Geometric Analysis for Data in High Dimensions. In: Andrews, T., Balan, R., Benedetto, J., Czaja, W., Okoudjou, K. (eds) Excursions in Harmonic Analysis, Volume 1. Applied and Numerical Harmonic Analysis. Birkhäuser, Boston. https://doi.org/10.1007/978-0-8176-8376-4_13
Download citation
DOI: https://doi.org/10.1007/978-0-8176-8376-4_13
Published:
Publisher Name: Birkhäuser, Boston
Print ISBN: 978-0-8176-8375-7
Online ISBN: 978-0-8176-8376-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)