Abstract
We survey our latest results on the development and analysis of adaptive approximation algorithms for sparse data representation, where special emphasis is placed on the Easy Path Wavelet Transform (EPWT), nonlinear dimensionality reduction (NDR) methods, and their application to signal separation and detection.
Access provided by Autonomous University of Puebla. Download chapter PDF
Similar content being viewed by others
Keywords
- Independent Component Analysis
- Filter Bank
- Independent Component Analysis
- Dimensionality Reduction Method
- Short Time Fourier Transform
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
14.1 Introduction
During the last few years there has been an increasing interest in efficient (i.e., sparse) representation and denoising of high-dimensional signals. We have focussed our research on the development and analysis of adaptive approximation algorithms for high-dimensional signals, especially (a) scattered data denoising by wavelet transforms; (b) nonlinear dimensionality reduction relying on geometrical and topological concepts. This contribution reviews our recent research results on (a) and (b).
For (a), we present a general framework for the Easy Path Wavelet Transform (EPWT) for sparse representation and denoising of scattered data taken from high-dimensional signals (in Sect. 14.2). As regards (b), we continue our research on nonlinear dimensionality reduction (NDR) methods (cf. Sect. 14.3), where we combine recent NDR methods with non-negative matrix factorization (NNMF), for the purpose of separating sources from a mixture of signals without a prior knowledge about the mixing process. More details on dimensionality reduction and NNMF, along with our recent results on signal separation, are discussed in Sect. 14.4.
The presented results are based on our papers [7–9, 11, 13, 17–21, 25] and have been achieved in the project “Adaptive approximation algorithms for sparse data representation” of the German Research Foundation’s priority program DFG-SPP 1324.
14.2 The Easy Path Wavelet Transform
Let Ω be a connected domain in \(\mathbb{R}^{d}\) and let Γ be a large finite set of points in Ω. We let \(h_{\varGamma }:=\max _{y\in \varOmega }\min _{x\in \varGamma }\|y - x\|_{2}\) be the fill distance of Γ in Ω and its grid distance is \(g_{\varGamma }:=\min _{x,x^{{\prime}}\in \varGamma,x\neq x^{{\prime}}}\|x - x^{{\prime}}\|_{2}.\) We say that the set Γ is quasi-uniform, if h Γ < 2g Γ . Further, let \(f:\varOmega \rightarrow \mathbb{R}\) be a piecewise smooth function that is sampled at Γ, i.e., the values f(x), x ∈ Γ, are given. We are now interested in an efficient approximation of f using a meshless multiscale approach called Easy Path Wavelet Transform (EPWT). For applications, we usually assume that Γ approximates a smooth manifold in \(\mathbb{R}^{d}\). For example, our approach covers the efficient approximation of digital images, see [16, 20], where Γ is chosen to be a set of regular grid points in a rectangle Ω, and the approximation of piecewise smooth functions on the sphere, see [18], where \(\varOmega = \mathbb{S}^{2}\) and Γ is a suitably chosen quasi-uniform point set on the sphere \(\mathbb{S}^{2}\).
Similar approaches have also been proposed for generalizing the wavelet transform to data defined on weighted graphs, see [22]. In this section, we extend the EPWT proposed in [16, 18, 24] to the case of high-dimensional data approximation.
14.2.1 The General EPWT Algorithm for Sparse Approximation
Let us shortly recall the notions of a biorthogonal wavelet filter bank of perfect reconstruction. To this end, let \(\varphi\) be a sufficiently smooth, compactly supported, one-dimensional scaling function, \(\tilde{\varphi }\) the corresponding biorthogonal compactly supported scaling function, and \(\psi,\,\tilde{\psi }\) the corresponding pair of biorthogonal compactly supported wavelets, see, e.g., [2, 15]. These functions provide us with a filter bank of perfect reconstruction with sequences \((h_{n})_{n\in \mathbb{Z}}\), \((\tilde{h}_{n})_{n\in \mathbb{Z}}\) of low-pass filter coefficients and \((g_{n})_{n\in \mathbb{Z}}\), \((\tilde{g}_{n})_{n\in \mathbb{Z}}\) of high-pass filter coefficients.
Assume that the number N = | Γ | of given points \(x \in \varGamma \subset \mathbb{R}^{d}\) is a power of 2, N = 2J, where J ≫ 1. We denote Γ J: = Γ and its elements by x k J = x k , k = 1, …, N, i.e., we fix some ordering of the points in Γ J. Now the EPWT works as follows. In a first step, we seek a suitable permutation p J of the indices of the points in Γ J by determining a path of length N through all points x k J such that consecutive data points \((x_{p^{J}(k)}^{J},f(x_{p^{J}(k)}^{J}))\) and \((x_{p^{J}(k+1)}^{J},f(x_{p^{J}(k+1)}^{J}))\) in the path strongly “correlate”. In the second step, we apply the one-dimensional wavelet filter bank to the sequence of functions values \((f(x_{p^{J}(k)}^{J}))_{k=1}^{N}\), and simultaneously a low-pass filter to the points \((x_{p^{J}(k)}^{J})_{k=1}^{N}\), where we consider each of the d components separately. The significant high-pass coefficients corresponding to the function values will be stored. The N∕2 low-pass data will be processed further at the next level of the EPWT. Particularly, we denote the set of the N∕2 points obtained by low-pass filtering and downsampling of \((x_{p^{J}(k)}^{J})_{k=1}^{N}\) by Γ J−1, and relate the low-pass function coefficients to these points. Again, we start with seeking a permutation p J−1 of the indices of the points in Γ J−1 to obtain an appropriate ordering of the data and apply the one-dimensional wavelet filter bank to the ordered low-pass function data. We iterate this procedure and obtain a sparse representation of the original data by applying a hard thresholding procedure to the high-pass coefficients of the function value components. The complete procedure is summarized by Algorithm 9.
Algorithm 9 Decomposition
By construction many high pass values \(d_{k}^{\ell}\) will vanish. An optimal storage of the path vectors p ℓ depends on the original distribution of the points x k J and on the applied filter \(\tilde{h}_{p}\). Employing a “lazy” filter, we have \(x_{k}^{\ell}:= x_{p^{\ell+1}(2k)}^{\ell+1}\), such that at each level the new point set is just a subset of that of the preceding level of half cardinality.
Algorithm 10 Reconstruction
14.2.2 Construction of Path Vectors
The main challenge for the application of the EPWT to sparse data representation is to construct path vectors through the point sets Γ ℓ, ℓ = J, …, J − L + 1. This step is crucial for the performance of the data compression. The path construction is based on determining a suitable correlation measure that takes the local distance of the scattered points \(x_{k}^{\ell}\) into account, on the one hand, and the difference of the corresponding low-pass values \(f_{k}^{\ell}\), on the other hand. In the following, we present some strategies for path construction and comment on their advantages and drawbacks.
14.2.2.1 Path Construction with Fixed Local Distances
One suitable strategy for path construction [16, 24] is based on a priori fixed local \(\varepsilon\)-neighborhoods of the points \(x_{k}^{\ell}\). In \(\mathbb{R}^{d}\), we consider a neighborhood of the form
where \(\varepsilon > 2^{J/d}\,g_{\varGamma }\) depends on the distribution of the original point set Γ = Γ J. For example, starting with a regular rectangular grid in \(\mathbb{R}^{2}\) with mesh size g Γ = 2−J∕2 (with J even) in both directions, one may think about a constant \(\varepsilon\) with \(\sqrt{ 2} \leq \varepsilon < 2\), such that each inner grid point has eight neighbors.
For path construction at level ℓ of the EPWT, we choose a first point x ℓ ∈ Γ ℓ randomly, and put \(x_{p^{\ell}(1)}^{\ell}:= x^{\ell}\). Let now \(P_{j}^{\ell}:=\{ x_{p^{\ell}(1)}^{\ell},\ldots,x_{p^{\ell}(j)}^{\ell}\}\) be the set of points that have already been taken in the path. Now, we determine the (j + 1)-th point by
i.e., we choose the point x in the neighborhood of the point \(x_{p^{\ell}(j)}^{\ell}\) with minimal absolute difference of the corresponding function values. This measure has been applied in the rigorous EPWT of [16, 18]. The advantage of fixing the local neighborhood in spatial domain lies in the reduced storage costs for the path vector that needs to be kept to ensure a reconstruction. The drawback of this measure is that the set of “admissible points” \(N_{\varepsilon }(x_{p^{\ell}(j)}^{\ell})\setminus P_{j}^{\ell}\) may be empty. In this case a different rule for finding the next path entry has to be applied.
A special measure occurs if one tries to mimic the one-dimensional wavelet transform. In order to exploit the piecewise smoothness of the function f to be approximated, one should prefer to construct path vectors, where locally three consecutive points \(x_{p^{\ell}(j-1)}^{\ell}\), \(x_{p^{\ell}(j)}^{\ell}\), \(x_{p^{\ell}(j+1)}^{\ell}\) lie (almost) on a straight line. This consideration leads to the following measure: We fix a threshold μ for the function values. For finding the next point in the path, we compute
and then let
where \(\langle \cdot,\cdot \rangle\) denotes the usual scalar product in \(\mathbb{R}^{d}\). Note that in (14.3) the cosine of the angle between the vectors \(x_{p^{\ell}(j-1)}^{\ell} - x_{p^{\ell}(j)}^{\ell}\) and \(x_{p^{\ell}(j)}^{\ell} - x\) is minimized if \(x_{p^{\ell}(j-1)}^{\ell},\,x_{p^{\ell}(j)}^{\ell}\) and x are co-linear. This approach is taken in [16, 24] for images (called relaxed EPWT), and in [10] for scattered data denoising.
Remark 14.1.
The idea to prefer path vectors, where the angles between three consecutive points in the path is as large as possible, can be theoretically validated in different ways. Assume that the given wavelet decomposition filter \(\tilde{g} = (\tilde{g}_{k})_{k\in \mathbb{Z}}\) in the filter bank satisfies the moment conditions \(\sum _{k\in \mathbb{Z}}\tilde{g}_{k} = 0\) and \(\sum _{k\in \mathbb{Z}}k\tilde{g}_{k} = 0\). Then we simply observe that for a constant function f(x) = c for x ∈ Γ and \(c \in \mathbb{R}\) by
all wavelet coefficients vanish, while for a linear function of the form f(x) = a T x + b with \(a \in \mathbb{R}^{d}\) and \(b \in \mathbb{R}\) we have
Consequently, these coefficients only vanish, if the points in the sequence \((x_{p^{J}(k)}^{J})_{k\in \mathbb{Z}}\) are co-linear and equidistant, see [10]. A second validation for choosing the path vector using the criterion (14.3) is given by the so-called path smoothness condition in [17], see also Subsection 14.2.4, Remark 14.4.
Remark 14.2.
Our numerical results in Sect. 14.2.5 show that the relaxed path construction proposed in (14.2)–(14.3) is far superior to the rigorous path construction (14.1), since it produces fewer “interruptions”, i.e., cases where \(N_{\varepsilon }(x_{p^{\ell}(j)})\setminus P_{j}^{\ell} = \varnothing \), and a new path entry needs to be taken that is no longer locally correlated to the preceding point, which is usually leading to large wavelet coefficients and a higher effort in path coding (see [16, 24]).
14.2.2.2 Path Construction with Global Distances
We want to present a second path construction using a global weight function. Considering the vectors \(y_{k}^{\ell} = y(x_{k}^{\ell}):= ((x_{k}^{\ell})^{T},f_{k}^{\ell})^{T} \in \mathbb{R}^{d+1}\) at each level, we define a symmetric weight matrix \(W^{\ell} = (w(y_{k}^{\ell},y_{k^{{\prime}}}^{\ell}))_{ k,k^{{\prime}}=1}^{2^{\ell}}\), where the weight is written as
Now the weights for the scattered points \(x_{k}^{\ell}\) can be chosen differently from the weights for the (low-pass) function values \(f_{k}^{\ell}\). A possible weight function used already in the context of bilateral filtering [25] is
where \(\eta _{1}\) and \(\eta _{2}\) need to be chosen appropriately. The normalization constant 22(J−ℓ)∕d in the weight w 1 is due to the reduction of the points x ∈ Γ ℓ by factor 2, at each level, so that the distances between the points grow. The normalization constant 2J−ℓ in the weight w 2 arises from the usual amplification of the low-pass coefficients in the wavelet transform with filters \(\tilde{h}\) satisfying \(\sum _{k\in \mathbb{Z}}\tilde{h}_{k} = \sqrt{2}\).
Having computed the weight matrix \(W^{\ell} = (w(y_{k}^{\ell},y_{k^{{\prime}}}^{\ell}))_{ k,k^{{\prime}}=1}^{2^{\ell}}\), we simply compute the path vector as follows. We choose the first component \(x_{p^{\ell}(1)}^{\ell}\) randomly from Γ ℓ. Using again the notation \(P_{j}^{\ell}:=\{ x_{p^{\ell}(1)}^{\ell},\ldots,x_{p^{\ell}(j)}^{\ell}\}\) for the set of points in Γ ℓ that are already contained in the path vector, we now determine the next point as
where uniqueness can be achieved by fixing a rule if the maximum is attained at more than one point. The advantage of this path construction is that no “interruptions” occur. The essential drawback consists in higher storage costs for path vectors, where we can no longer rely on direct local neighborhood properties of consecutive points in the path vector. Further, computing the full weight matrix W ℓ is very expensive. The costs can be reduced by cutting the spatial weight at a suitable distance defining
with D chosen appropriately to ensure a sufficiently large spatial neighborhood.
Remark 14.3.
This approach has been used in [10] for random path construction, where the compactly supported weight function \(w_{1}(x_{k}^{\ell},x_{k^{{\prime}}}^{\ell})\) above is employed. Taking the weight function
and \(w_{2}(f_{k}^{\ell},f_{k^{{\prime}}}^{\ell}) =\exp \left (\frac{-\vert f_{k}^{\ell}-f_{ k^{{\prime}}}^{\ell}\vert ^{2}} {2^{J-\ell}\eta _{2}} \right )\) we obtain a distance measure that is equivalent to (14.1).
14.2.3 EPWT for Scattered Data Denoising
The EPWT can also be used for denoising of scattered data. Let us again assume Γ = { x 1, …, x N } are scattered points in \(\mathbb{R}^{d}\) and let \(f: \mathbb{R}^{d} \rightarrow \mathbb{R}\) be a smooth function sampled on \(\varGamma \subset \varOmega\). For the measured data \(\tilde{f}(x_{j})\), we suppose that
where z j denotes additive Gaussian noise with zero mean and an unknown variance σ 2. For the distribution of the points in Ω we assume quasi-uniformity as before.
We now apply the EPWT, Algorithms 9 and 10 in Sect. 14.2.1, for data denoising. Note that in case of noisy function values, the construction of path vectors (being based on the correlation of function values at points with small spatial distance) is now influenced by the noise. To improve the denoising performance, we have to resemble the “cycle spinning” method (see [3]) that works as follows. We apply the (tensor product) wavelet shrinkage not only to the image itself, but also to the images that are obtained by up to seven cyclic shifts in x- and y-direction. After un-shifting, one takes the average of the 64 reconstructed images, thereby greatly improving the denoising result.
Employing the EPWT algorithm, we use Algorithms 9 and 10, applying them 64 times using different starting values \(x_{p^{J}(1)}\) as a first path component each time. For the path construction, we utilize one of the two methods described in Sect. 14.2.2. After reconstruction of the 64 data sets, we take the average in order to obtain the denoising result. Similarly as for wavelet denoising, the threshold parameter θ in Algorithm 9 needs to be selected carefully depending on the noise level.
In [10] we have employed two different path constructions for image denoising. The first one is very similar to the path construction in Sect. 14.2.2.1. The second one is based on a weight matrix resembling that in Sect. 14.2.2.2. Here, the next component in the path vector is chosen randomly according to a probability distribution based on the weight matrix.
For images, the proposed denoising procedure strongly outperforms the usual tensor-product wavelet shrinkage with cycle spinning, see [10]. Moreover, the procedure is not restricted to rectangular grids, but can be used in a much more general context for denoising of functions on manifolds. Numerical examples of the EPWT-based denoising scheme are given in Sect. 14.2.5.
14.2.4 Optimal Image Representation by the EPWT
In this subsection we restrict ourselves to the EPWT on digital images on a domain Ω = [0, 1)2. For cartoon models, where the image is piecewise Hölder continuous or even Hölder smooth, we can prove that the EPWT leads to optimally sparse image representations, see [17, 19]. To explain this, let F ∈ L 2(Ω) be a piecewise Hölder continuous image. More precisely, let {Ω i }1 ≤ i ≤ K be a finite set of regions forming a disjoint partition of Ω whose boundaries are continuous and of finite length. In each region Ω i , F is assumed to be Hölder continuous of order α ∈ (0, 1],
where C > 0 does not depend on i. For given samples \(\{(F(2^{-J/2}n))\}_{n\in I_{J}}\), the function F can be approximated by the piecewise constant function
where the index set \(I_{J}:=\{ n = (n_{1},n_{2}) \in \mathbb{N}^{2}:\, 0 \leq n_{1} \leq 2^{J/2} - 1,0 \leq n_{2} \leq 2^{J/2}\ - 1\}\) is of cardinality 2J. In this special case α ∈ (0, 1] we can rely on the orthogonal Haar wavelet filter bank in Algorithms 9 and 10. An optimal image representation is strongly based on an appropriate path construction. As shown in [19], we need to satisfy the following two conditions.
Region condition. At each level ℓ of the EPWT, we need to choose the path vector, such that it contains at most R 1 K discontinuities which are incurred by crossing over from one region Ω i to another region, or by jumping within one region Ω i . Here R 1 does not depend on J or ℓ, and K is the number of regions.
Diameter condition. At each level ℓ of the EPWT, we require
for almost all points \(x_{p^{\ell}(k)}^{\ell}\), k = 1, …, 2ℓ − 1, where D 1 does not depend on J or ℓ. The number of path components which do not satisfy the diameter condition is bounded by a constant being independent of ℓ and J.
The region condition suggests that for path construction, we should first collect all points that belong to one region Ω i before transferring to the next region. The diameter condition ensures that the remaining points in Γ ℓ are quasi-uniformly distributed at each level ℓ of the EPWT. Satisfying these two conditions for the path vectors, we have shown in [19], Corollary 3.1 that the M-term approximation F M reconstructed from the M most significant EPWT wavelet coefficients, satisfies the asymptotically optimal error estimate
with a constant \(\tilde{C}\) and the Hölder exponent α ∈ (0, 1] in (14.5).
Remark 14.4.
Observe that at each level of the EPWT the path vector \((p^{\ell}(j))_{j=1}^{2^{j} }\) determines a planar curve that interpolates \(f_{p^{\ell}(j)}^{\ell}\) at the points \(x_{p^{\ell}(j)}^{\ell}\), j = 1, …, 2ℓ. By definition, this curve is only piecewise linear. A generalization of the optimal M-term approximation result (14.6) for piecewise Hölder smooth images with Hölder exponent α > 1 has been developed in [17]. In this case, one needs to generalize the idea of a piecewise linear path vector curve to a smooth path function that satisfies, besides the region condition and the diameter condition, a third condition called path smoothness condition, see [17]. More precisely, let us consider a domain \(\varOmega \subset [0,1]^{2}\) with a sufficiently smooth boundary and a disjoint partition Ω i of Ω with smooth boundaries of finite length. Further, instead of (14.5), we assume that F ∈ L 2(Ω) is a piecewise smooth bivariate function being Hölder smooth of order α > 1 in each region Ω i , i = 1, …, K. In order to show the optimal error estimate (14.6) also for α > 1, we need to employ a path function that approximates the values \(f_{p^{\ell}(j)}^{\ell}\) at the points \(x_{p^{\ell}(j)}^{\ell}\) being a planar curve that is not only piecewise smooth but smooth of order α inside a region Ω i with suitably bounded derivatives, see [17], Section 3.2. Particularly, this condition suggests that one should avoid “small angles” in the path curve.
14.2.5 Numerical Results
We shortly illustrate the performance of the proposed EPWT algorithm for sparse date representation and data denoising. In Fig. 14.1, we illustrate the application of the EPWT for sparse image representation, see also [16, 24]. The three considered images are of size 256 × 256. In Algorithm 9, we have used the 7–9 biorthogonal filter bank for the function values, and the lazy filter bank for the grid points, i.e., at each level of the EPWT, we have kept only every other grid point. The path construction from Sect. 14.2.2.1 is taken, where in (14.2) the parameters \(\varepsilon = \sqrt{2}\) and μ = 5 are employed. The threshold parameter θ in Algorithm 9 is chosen, such that 1,000 most significant EPWT wavelet coefficients are kept for the clock image, 700 coefficients are kept for the Lenna image and 200 coefficients are kept for the sail image. Figure 14.1 shows the reconstructed images, where we compare the results of a tensor-product wavelet compression with the 7–9 biorthogonal filter bank with the results of the EPWT reconstruction, using the same number of wavelet coefficients for the reconstruction in both cases.
In a second example we study the denoising behavior of the EPWT approach as described in Sect. 14.2.3. In Fig. 14.2, we present the noisy pepper image with a PSNR of 19. 97 and compare the denoising results of different methods. In particular, we have used the four-pixel denoising scheme based on anisotropic diffusion by Welk et al. [28] with 76 iterations and step size 0.001 providing a PSNR of 28.26. Further, we apply the 7–9 wavelet shrinkage with a PSNR of 24.91 and the 7–9 wavelet shrinkage with cycle spinning using 64 shifts of the image and yielding the PSNR 28.11. Our EPWT denoising approach employing a relaxed path construction as described in Sect. 14.2.2.1 achieves a PSNR of 29.01 while a random path construction based on the ideas in Sect. 14.2.2.2 yields the PSNR 27.96. Note that the repeated application of the EPWT shrinkage method can be done in a parallel process. While our proposed EPWT denoising is (due to the path constructions) more expensive than the tensor-product wavelet shrinkage its application is not restricted to rectangular regular grids.
The third example shows the EPWT denoising to a triangular domain taking the approach in Sect. 14.2.3, see Fig. 14.3. We use the 7–9 biorthogonal filter bank for the function values, the lazy filter bank for the grid points, and the path construction from Sect. 14.2.2.1 with \(\varepsilon = 1.3\), μ = 89 and threshold θ = 89.
14.3 Dimensionality Reduction on High-Dimensional Signal Data
To explain basic concepts on dimensionality reduction, we regard point cloud data as a finite family of vectors
contained in an n-dimensional Euclidean space. The fundamental assumption is that X lies in \(\mathcal{M}\), a low dimensional (topological) space embedded in \(\mathbb{R}^{n}\). Therefore, \(X \subset \mathcal{M}\subset \mathbb{R}^{n}\) with \(p:= \mbox{ dim}(\mathcal{M}) \ll n\). Another ingredient is a parameter domain Ω for \(\mathcal{M}\), where Ω is assumed to be embedded in a low dimensional space \(\mathbb{R}^{d}\) with p ≤ d < n. Moreover, we assume the existence of a homeomorphism (diffeomorphism)
so that Ω is a homeomorphic (diffeomorphic) copy of \(\mathcal{M}\). This concept can then be used for signal analysis in a low dimensional environment. In practice, we can only approximate Ω by a projection
where Ω ′ is a homeomorphic copy of Ω. The low dimensional structure representing X is the reduced data \(Y =\{ y_{i}\}_{i=1}^{m} \subset \varOmega ^{{\prime}}\subset \mathbb{R}^{d}\), according to the following diagram.
Principal component analysis (PCA) is a classical linear projection method. Dimensionality reduction by PCA can be described as an eigenvalue problem, so that PCA can be applied by using the singular value decomposition (SVD). More precisely, in PCA we consider centered data X (i.e., X has zero mean) in matrix form \(X \in \mathbb{R}^{n\times m}\). Now the concept of PCA is to construct a linear projection \(P: \mathbb{R}^{n} \rightarrow \mathbb{R}^{n}\), for rank(P) = p < n, with minimal error \(\mbox{ err}(P,X) =\sum _{ k=1}^{m}\|x_{k} - P(x_{k})\|\), or, equivalently, with maximal variance \(\mbox{ var}(P,X) =\sum _{ k=1}^{m}\|P(x_{k})\|^{2}\). These conditions can in turn be reformulated as an eigenvalue problem, where the p largest eigenvalues of the covariance matrix \(\mathit{XX}^{T} \in \mathbb{R}^{n\times n}\) are sought, cf. [14].
Another classical linear dimensionally reduction method is multidimensional scaling (MDS), which is also relying on an eigendecomposition of data \(X \in \mathbb{R}^{n\times m}\). In contrast to PCA, the MDS method constructs a low dimensional configuration of X without using an explicit projection map. More precisely, on input matrix \(X \in \mathbb{R}^{n\times m}\), MDS works with the distance matrix D = (d ij ) i, j = 1, ⋯ , m , of the points in X to compute an optimal configuration of points \(Y = (y_{1},\cdots \,,y_{m}) \in \mathbb{R}^{p\times m}\), with p ≤ n, minimizing the error \(\mbox{ err}(Y,D) =\sum _{ i,j=1}^{m}(d_{\mathit{ij}} -\| y_{i} - y_{j}\|)^{2}\). In other words, the low dimensional configuration of points Y preserves the distances of the higher dimensional dataset X approximately.
In the construction of nonlinear dimensionality reduction (NDR) methods, we are especially interested in their interaction with signal processing tools, e.g., convolution transforms. When applying signal transforms to the dataset X, one important task is the analysis of the incurred geometrical deformation. To this end, we propose the concept of modulation maps and modulation manifolds for the construction of particular datasets which are relevant in signal processing and NDR, especially since we are interested in numerical methods for analyzing geometrical properties of the modulation manifolds, with a particular focus on their scalar and mean curvature.
We define a modulation manifold by employing a homeomorphism (or diffeomorphism) \(\mathcal{A}:\varOmega \rightarrow \mathcal{M}\), for a specific manifold Ω, as used in signal processing. The basic objective is to understand how the geometry of Ω is distorted when we transform Ω using a modulation map \(\mathcal{A}\). More explicitly, let \(\{\phi _{k}\}_{k=1}^{d} \subset \mathcal{H}\) be a set of vectors in an Euclidean space \(\mathcal{H}\), and \(\{s_{k}:\varOmega \rightarrow \mathcal{C}_{\mathcal{H}}(\mathcal{H})\}_{k=1}^{d}\) a family of smooth maps from a manifold Ω to \(\mathcal{C}_{\mathcal{H}}(\mathcal{H})\) (the continuous functions from \(\mathcal{H}\) into \(\mathcal{H}\)). We say that a manifold \(\mathcal{M}\subset \mathcal{H}\) is a \(\{\phi _{k}\}_{k=1}^{d}\)-modulated manifold if
In this case, the map \(\mathcal{A}:\varOmega \rightarrow \mathcal{M}\), \(\alpha \mapsto \sum _{k=1}^{d}s_{k}(\alpha )\phi _{k}\), is called modulation map.
To make one prototypical example (cf. [7]), we regard a map of the form
for a set of band-limited functions \(\{\phi _{k}\}_{k=1}^{d}\) in combination with a finite set of uniform samples \(\{t_{i}\}_{i=1}^{n} \subset [0,1]\).
Now we use the same notation for the band-limited functions ϕ k and the above mentioned vector of sampling values \(\{\phi _{k}(t_{i})\}_{i=1}^{n}\), as this is justified by the Whittaker-Shannon interpolation formula as follows.
As the support of the band-limited functions ϕ k is located in [0, 1], the Whittaker-Shannon interpolation formula allows us to reconstruct each ϕ k exactly from the finite samples \((\phi _{k}(t_{i}))_{i=1}^{n} \in \mathbb{R}^{n}\). This in turn gives a one-to-one relation between the band-limited functions \(\phi _{k}: [0,1] \rightarrow \mathbb{R}\) and the vectors \((\phi _{k}(t_{i}))_{i=1}^{n} \in \mathbb{R}^{n}\). Note that the maps s k (α) are in our example given by \(s_{k}(\alpha )\phi _{k}(t_{i}) =\phi _{k}(\alpha _{k}t_{i})\). In other words, we use the (continuous) map s k (α), \(f(t)\mapsto f(\alpha _{k}t)\), as the scaling by factor α k , being the k-th coordinate of vector \(\alpha \in \varOmega \subset \mathbb{R}^{d}\).
To explain our analysis of the geometric distortions incurred by \(\mathcal{A}\), we restrict ourselves to the case d = 3 and \(\varOmega \subset \mathbb{R}^{3}\) with dim(Ω) = 2. We compute the scalar curvature of \(\mathcal{M}\) from the parametrization of Ω and the modulation map \(\mathcal{A}\) by the following algorithm [7].
Algorithm 11
For further details concerning the construction of Algorithm 11, we refer to [7].
14.4 Audio Signal Separation and Signal Detection
In many relevant applications of signal processing there is an increasing demand for effective methods to estimate the components from a mixture of acoustic signals. In recent years, different decomposition techniques were developed to do so, including independent subspace analysis (ISA), based on independent component analysis (ICA), see [1, 5, 26], and non-negative matrix factorization (NNMF), see [6, 23, 27]. The computational complexity of these methods, however, may be very large, in particular for real-time computations on audio signals. In signal separation, dimensionality reduction methods are used to first reduce the dimension of the data obtained from a time-frequency transform, e.g., short time Fourier transform (STFT), before the reduced data is decomposed into different components, each assigned to one of the source signals. For the application of dimensionality reduction in combination with NNMF, however, non-negative dimensionality reduction methods are essentially required to guarantee non-negative output data from non-negative input data (e.g., a non-negative spectrogram from the STFT). For the special case of PCA, a suitable rotation map is constructed in [12] for the purpose of back-projecting the reduced data to the positive orthant of the Cartesian coordinate system, where the sought rotation is given by the solution of a constraint optimization problem in a linear subspace of orthogonal matrices.
In this section, we evaluate different decomposition methods for signal separation in combination with the non-negative PCA projection from [12]. The basic steps of our method are illustrated in Fig. 14.5.
To explain how we use PCA, let \(U \in \mathbb{R}^{D\times d}\) be an orthogonal projection, satisfying Y = U T X, being obtained by the solution of the minimization problem
The solution of (14.7) is given by the maximizer of the variance var(Y ) of Y, as given by the trace of YY T. This observation allows us to reformulate the minimization problem in (14.7) as an equivalent maximization problem,
where the maximizer U of var(Y ) is given by a matrix U whose d columns contain the eigenvectors of the d largest eigenvalues of the covariance matrix XX T.
For further processing the data in a subsequent decomposition by NNMF, the data matrix Y is essentially required to be non-negative. Note, however, that even if the data matrix X (obtained e.g., by STFT) may be non-negative, this is not necessarily the case for the components of the reduced data matrix Y. Therefore, we reformulate the maximization problem in (14.8) by adding a non-negativity constraint:
Note that this additional restriction transforms the simple PCA problem (14.8) into a much more difficult non-convex optimization problem (14.9) with many local solutions, for which (in general) none of the solutions is known analytically.
We tackle this fundamental problem as follows. We make use of the fact that the input data set X is non-negative, before it is projected onto a linear subspace, with the perception that there exists a rotation of the low-dimensional data set Y into the non-negative orthant. Indeed, as proven in [12], such a rotation map exists, which motivates us to split the non-negative PCA (NNPCA) problem (14.9) into a PCA part and a rotation part. This, in turn, gives rise to seek for a general construction of a rotation matrix W satisfying WU T X ≥ 0.
To further explain our splitting approach, recall that we already know the solution U of the PCA part. Since the rotation matrix W is orthogonal, it does not affect the value of the NNPCA cost functional. Now, in order to determine the rotation matrix W, we consider solving an auxiliary optimization problem on the set of orthogonal matrices SO(d), i.e., we minimize the cost functional
as this was proposed in [21] in the context of ICA. However, we cannot solve this optimization problem directly by an additive update algorithm, since the set of rotation matrices SO(d) is not invariant under additions. But an elegant way to minimize the cost functional J in (14.10) uses the Lie-group structure of SO(d) to transfer the problem into an optimization problem on the Lie-algebra of skew-symmetric matrices \(\mathfrak{s}\mathfrak{o}(d)\). Due to the vector space property of \(\mathfrak{s}\mathfrak{o}(d)\), standard methods can be applied to find the minimum (see [9, 11, 21] for details).
14.4.1 Decomposition Techniques
There are different methods for the decomposition of the (reduced) spectrogram Y. Among them, independent component analysis (ICA) and non-negative matrix factorization (NNMF) are commonly used. In either case, for the application of ICA or NNMF, we assume the input data Y to be a linear mixture of source terms s i , i.e.,
where \(A \in \mathbb{R}^{d\times r}\) and \(S \in \mathbb{R}^{r\times n}\) are unknown. For the estimation of A and S we need specific additional assumptions to balance the disproportion of equations and unknowns in the factorization problem (14.11).
14.4.1.1 Independent Component Analysis (ICA)
The basic assumption of ICA is that the source signals are statistically independent. Furthermore, the data matrix Y is assumed to result from n realizations of a d-dimensional random vector. In order to estimate S, a random variable \(\mathcal{S}\) is constructed, whose n realizations yield the columns of the source matrix S. The components of \(\mathcal{S}\) are chosen to be as stochastically independent as possible, where the stochastical independence can be measured by the Kullback-Leibler distance [4].
In practice, the number of sources is usually unknown. Therefore, we may detect more independent components than the true number of sources. In this case, two or more of the separated components belong to the same source. Thus, the sources are combinations of the independent components. In a subsequent step, the sources are grouped into independent subspaces, each corresponding to one source. Finally, the sources are reconstructed from these multi-component subspaces [1]. This procedure is called independent subspace analysis (ISA). The main difficulty of ISA is to identify components belonging to the same multi-component subspace.
14.4.1.2 Non-negative Matrix Factorization (NNMF)
The factorization of the given data Y into a mixing matrix A and the source signals (source components) S, i.e., Y = AS, could be done by matrix factorization. The data we use for signal separation are obtained by taking the modulus of the signal’s STFT, and so the input data is non-negative. Since the source components are assumed to be spectrograms, too, we assume them to be non-negative as well. Therefore, non-negative matrix factorizations (NNMF) are suitable tools for decomposition.
There are different NNMF algorithms available, all of which are relying on the non-negativity Y, A, S ≥ 0, where different measures d(Y, AS) for the reconstruction error were proposed [6, 23, 27]. We consider using the generalized Kullback-Leibler distance (proposed in [13] and used for decomposing signal data in [27]):
14.4.2 Numerical Results
We present one numerical example comparing the decomposition strategies ICA and NNMF. We consider a mixture f = f 1 + f 2 of acoustic transient signals, where f 1 is a sequence of castanets and f 2 a cymbal signal, shown in Fig. 14.6, where also the combination f = f 1 + f 2 of the two signals is displayed. The spectrograms in these figures are generated with an STFT using a Hamm-window. Since f 2 is a high-energy signal, f has a complex frequency characteristic. Therefore, the extraction of the castanets signal f 1, being active only at a few time steps, is a challenging task.
The obtained separations, resulting from the two different decomposition methods using NNPCA and PCA, respectively, are displayed in Fig. 14.7. Note that both methods, NNMF and ICA, achieve to reproduce the characteristic peaks of the castanets quite well. However, in the case of NNMF strong artifacts of the castanets are visible in the cymbal signal, whereas the separation by ICA is almost perfect.
Likewise, for the reconstruction of the reduced signal, the combination of PCA and ICA provides an almost complete reproduction of the original signal f (see Fig. 14.8). Merely at time steps where a high amplitude of the cymbal exactly matches the peaks of the castanets, a correct separation is not quite achieved. As for the NNMF, the spectrogram in Fig. 14.8 shows that information is being lost.
We finally remark that for signal separation without dimensionality reduction, NNMF is competitive to ICA (see e.g. [27]). This indicates that our use of NNPCA in combination with NNMF could be improved. Further improvements could be achieved by the use of more sophisticated (nonlinear) dimensionality reduction methods. On the other hand, this would lead to a much more complicated construction of the inverse transform, as required for the back-projection of the data. We defer these points to future research. Nevertheless, although PCA is only a linear projection method, our numerical results of this section, especially those obtained by the combination of PCA and ICA, are already quite promising.
References
Casey, M., Westner, A.: Separation of mixed audio sources by independent subspace analysis. In: Proceedings of the International Computer Music Conference, San Francisco, pp. 154–161 (2000)
Cohen, A., Daubechies, I., Feauveau, J.C.: Biorthogonal bases of compactly supported wavelets. Commun. Pure Appl. Math. 45(5), 485–560 (1992)
Coifman, R., Donoho, D.: Translation-invariant de-noising. In: Wavelets and Statistics, pp. 125–150. Springer, New York (1995)
Comon, P.: Independent component analysis, a new concept? Signal Process. 36(3), 287–314 (1994)
FitzGerald, D., Coyle, E., Lawlor, B.: Sub-band independent subspace analysis for drum transcription. In: Proceedings of the 5th International Conference on Digital Audio Effects (DAFX’02), Hamburg, pp. 65–69 (2002)
FitzGerald, D., Cranitch, M., Coyle, E.: Non-negative tensor factorisation for sound source separation. In: Proceedings of Irish Signals and Systems Conference, Dublin, pp. 8–12 (2005)
Guillemard, M., Iske, A.: Curvature analysis of frequency modulated manifolds in dimensionality reduction. Calcolo 48(1), 107–125 (2011)
Guillemard, M.: Some Geometrical and Topological Aspects of Dimensionality Reduction in Signal Analysis. PhD thesis, University of Hamburg, 2011, ftp://ftp.math.tu-berlin.de/pub/numerik/guillem/prj2/mgyDiss.pdf.
Hall, B.C.: Lie Groups, Lie Algebras, and Representations. An Elementary Introduction. Springer, New York (2003)
Heinen, D., Plonka, G.: Wavelet shrinkage on paths for denoising of scattered data. Result. Math. 62(3–4), 337–354 (2012)
Iserles, A., Munthe-Kaas, H., Nørsett, S., Zanna, A.: Lie-group methods. Acta Numer. 9, 215–365 (2000)
Krause-Solberg, S., Iske, A.: Non-negative dimensionality reduction in signal separation. University of Hamburg (2014, preprint)
Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Advances in Neural Information Processing Systems, vol. 13, pp. 556–562. MIT, Cambridge, Massachusetts (U.S.A.) (2000)
Lee, J.A., Verleysen, M.: Nonlinear Dimensionality Reduction. Springer, New York (2007)
Mallat, S.: A Wavelet Tour of Signal Processing. Academic, San Diego (1999)
Plonka, G.: The easy path wavelet transform: a new adaptive wavelet transform for sparse representation of two-dimensional data. Multiscale Model. Simul. 7(3), 1474–1496 (2009)
Plonka, G., Iske, A., Tenorth, S.: Optimal representation of piecewise Hölder smooth bivariate functions by the easy path wavelet transform. J. Approx. Theory 176, 42–67 (2013)
Plonka, G., Roşca, D.: Easy path wavelet transform on triangulations of the sphere. Math. Geosci. 42(7), 839–855 (2010)
Plonka, G., Tenorth, S., Iske, A.: Optimally sparse image representation by the easy path wavelet transform. Int. J. Wavelets Multiresolut. Inf. Process. 10(1), 1250,007, 20 (2012)
Plonka, G., Tenorth, S., Roşca, D.: A hybrid method for image approximation using the easy path wavelet transform. IEEE Trans. Image Process. 20(2), 372–381 (2011)
Plumbley, M.D.: Geometrical methods for non-negative ICA: manifolds, Lie groups and toral subalgebras. Neurocomputing 67, 161–197 (2005)
Ram, I., Elad, M., Cohen, I.: Generalized tree-based wavelet transform. IEEE Trans. Signal Process. 59(9), 4199–4209 (2011)
Smaragdis, P., Brown, J.: Non-negative matrix factorization for polyphonic music transcription. In: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, pp. 177–180 (2003)
Tenorth, S.: Adaptive waveletmethoden zur approximation von bildern. Phd thesis, University of Göttingen (2011). http://hdl.handle.net/11858/00-1735-0000-0006-B3E7-A
Tomasi, C., Manduchi, R.: Bilateral filtering for gray and color images. In: Proceedings of 6th International Conference on Computer Vision (ICCV ’98), Bombay, pp. 839–846. IEEE Computer Society (1998)
Uhle, C., Dittmar, C., Sporer, T.: Extraction of drum tracks from polyphonic music using independent subspace analysis. In: Proceedings of the 4th International Symposium on Independent Component Analysis and Blind Signal Separation (ICA2003), Nara, pp. 843–848 (2003)
Virtanen, T.: Monaural sound source separation by non-negative matrix factorization with temporal continuity and sparseness criteria. Trans. Audio Speech Lang. Proc. 15(3), 1066–1074 (2007)
Welk, M., Weickert, J., Steidl, G.: A four-pixel scheme for singular differential equations. In: Scale-Space and PDE Methods in Computer Vision, pp. 610–621. Springer, Berlin (2005)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Guillemard, M., Heinen, D., Iske, A., Krause-Solberg, S., Plonka, G. (2014). Adaptive Approximation Algorithms for Sparse Data Representation. In: Dahlke, S., et al. Extraction of Quantifiable Information from Complex Systems. Lecture Notes in Computational Science and Engineering, vol 102. Springer, Cham. https://doi.org/10.1007/978-3-319-08159-5_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-08159-5_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-08158-8
Online ISBN: 978-3-319-08159-5
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)