Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

14.1 Introduction

During the last few years there has been an increasing interest in efficient (i.e., sparse) representation and denoising of high-dimensional signals. We have focussed our research on the development and analysis of adaptive approximation algorithms for high-dimensional signals, especially (a) scattered data denoising by wavelet transforms; (b) nonlinear dimensionality reduction relying on geometrical and topological concepts. This contribution reviews our recent research results on (a) and (b).

For (a), we present a general framework for the Easy Path Wavelet Transform (EPWT) for sparse representation and denoising of scattered data taken from high-dimensional signals (in Sect. 14.2). As regards (b), we continue our research on nonlinear dimensionality reduction (NDR) methods (cf. Sect. 14.3), where we combine recent NDR methods with non-negative matrix factorization (NNMF), for the purpose of separating sources from a mixture of signals without a prior knowledge about the mixing process. More details on dimensionality reduction and NNMF, along with our recent results on signal separation, are discussed in Sect. 14.4.

The presented results are based on our papers [79, 11, 13, 1721, 25] and have been achieved in the project “Adaptive approximation algorithms for sparse data representation” of the German Research Foundation’s priority program DFG-SPP 1324.

14.2 The Easy Path Wavelet Transform

Let Ω be a connected domain in \(\mathbb{R}^{d}\) and let Γ be a large finite set of points in Ω. We let \(h_{\varGamma }:=\max _{y\in \varOmega }\min _{x\in \varGamma }\|y - x\|_{2}\) be the fill distance of Γ in Ω and its grid distance is \(g_{\varGamma }:=\min _{x,x^{{\prime}}\in \varGamma,x\neq x^{{\prime}}}\|x - x^{{\prime}}\|_{2}.\) We say that the set Γ is quasi-uniform, if h Γ  < 2g Γ . Further, let \(f:\varOmega \rightarrow \mathbb{R}\) be a piecewise smooth function that is sampled at Γ, i.e., the values f(x), x ∈ Γ, are given. We are now interested in an efficient approximation of f using a meshless multiscale approach called Easy Path Wavelet Transform (EPWT). For applications, we usually assume that Γ approximates a smooth manifold in \(\mathbb{R}^{d}\). For example, our approach covers the efficient approximation of digital images, see [16, 20], where Γ is chosen to be a set of regular grid points in a rectangle Ω, and the approximation of piecewise smooth functions on the sphere, see [18], where \(\varOmega = \mathbb{S}^{2}\) and Γ is a suitably chosen quasi-uniform point set on the sphere \(\mathbb{S}^{2}\).

Similar approaches have also been proposed for generalizing the wavelet transform to data defined on weighted graphs, see [22]. In this section, we extend the EPWT proposed in [16, 18, 24] to the case of high-dimensional data approximation.

14.2.1 The General EPWT Algorithm for Sparse Approximation

Let us shortly recall the notions of a biorthogonal wavelet filter bank of perfect reconstruction. To this end, let \(\varphi\) be a sufficiently smooth, compactly supported, one-dimensional scaling function, \(\tilde{\varphi }\) the corresponding biorthogonal compactly supported scaling function, and \(\psi,\,\tilde{\psi }\) the corresponding pair of biorthogonal compactly supported wavelets, see, e.g., [2, 15]. These functions provide us with a filter bank of perfect reconstruction with sequences \((h_{n})_{n\in \mathbb{Z}}\), \((\tilde{h}_{n})_{n\in \mathbb{Z}}\) of low-pass filter coefficients and \((g_{n})_{n\in \mathbb{Z}}\), \((\tilde{g}_{n})_{n\in \mathbb{Z}}\) of high-pass filter coefficients.

Assume that the number N =  | Γ | of given points \(x \in \varGamma \subset \mathbb{R}^{d}\) is a power of 2, N = 2J, where J ≫ 1. We denote Γ J: = Γ and its elements by x k J = x k , k = 1, , N, i.e., we fix some ordering of the points in Γ J. Now the EPWT works as follows. In a first step, we seek a suitable permutation p J of the indices of the points in Γ J by determining a path of length N through all points x k J such that consecutive data points \((x_{p^{J}(k)}^{J},f(x_{p^{J}(k)}^{J}))\) and \((x_{p^{J}(k+1)}^{J},f(x_{p^{J}(k+1)}^{J}))\) in the path strongly “correlate”. In the second step, we apply the one-dimensional wavelet filter bank to the sequence of functions values \((f(x_{p^{J}(k)}^{J}))_{k=1}^{N}\), and simultaneously a low-pass filter to the points \((x_{p^{J}(k)}^{J})_{k=1}^{N}\), where we consider each of the d components separately. The significant high-pass coefficients corresponding to the function values will be stored. The N∕2 low-pass data will be processed further at the next level of the EPWT. Particularly, we denote the set of the N∕2 points obtained by low-pass filtering and downsampling of \((x_{p^{J}(k)}^{J})_{k=1}^{N}\) by Γ J−1, and relate the low-pass function coefficients to these points. Again, we start with seeking a permutation p J−1 of the indices of the points in Γ J−1 to obtain an appropriate ordering of the data and apply the one-dimensional wavelet filter bank to the ordered low-pass function data. We iterate this procedure and obtain a sparse representation of the original data by applying a hard thresholding procedure to the high-pass coefficients of the function value components. The complete procedure is summarized by Algorithm 9.

Algorithm 9 Decomposition

By construction many high pass values \(d_{k}^{\ell}\) will vanish. An optimal storage of the path vectors p depends on the original distribution of the points x k J and on the applied filter \(\tilde{h}_{p}\). Employing a “lazy” filter, we have \(x_{k}^{\ell}:= x_{p^{\ell+1}(2k)}^{\ell+1}\), such that at each level the new point set is just a subset of that of the preceding level of half cardinality.

Algorithm 10 Reconstruction

14.2.2 Construction of Path Vectors

The main challenge for the application of the EPWT to sparse data representation is to construct path vectors through the point sets Γ ,  = J, , JL + 1. This step is crucial for the performance of the data compression. The path construction is based on determining a suitable correlation measure that takes the local distance of the scattered points \(x_{k}^{\ell}\) into account, on the one hand, and the difference of the corresponding low-pass values \(f_{k}^{\ell}\), on the other hand. In the following, we present some strategies for path construction and comment on their advantages and drawbacks.

14.2.2.1 Path Construction with Fixed Local Distances

One suitable strategy for path construction [16, 24] is based on a priori fixed local \(\varepsilon\)-neighborhoods of the points \(x_{k}^{\ell}\). In \(\mathbb{R}^{d}\), we consider a neighborhood of the form

$$\displaystyle{N_{\varepsilon }(x_{k}^{\ell}) =\{ x \in \varGamma ^{\ell}\setminus \{x_{ k}^{\ell}\}:\,\| x_{ k}^{\ell} - x\|_{ 2} \leq \,\mathrm{mMn}\,2^{(J-\ell)/d}\varepsilon \},}$$

where \(\varepsilon > 2^{J/d}\,g_{\varGamma }\) depends on the distribution of the original point set Γ = Γ J. For example, starting with a regular rectangular grid in \(\mathbb{R}^{2}\) with mesh size g Γ  = 2J∕2 (with J even) in both directions, one may think about a constant \(\varepsilon\) with \(\sqrt{ 2} \leq \varepsilon < 2\), such that each inner grid point has eight neighbors.

For path construction at level of the EPWT, we choose a first point x  ∈ Γ randomly, and put \(x_{p^{\ell}(1)}^{\ell}:= x^{\ell}\). Let now \(P_{j}^{\ell}:=\{ x_{p^{\ell}(1)}^{\ell},\ldots,x_{p^{\ell}(j)}^{\ell}\}\) be the set of points that have already been taken in the path. Now, we determine the (j + 1)-th point by

$$\displaystyle{ x_{p^{\ell}(j+1)}^{\ell}:=\mathop{ \mathrm{argmin}}_{ x\in N_{\varepsilon }(x_{p^{\ell}(j)}^{\ell})\setminus P_{j}^{\ell}}\vert f(x) - f(x_{p^{\ell}(j)}^{\ell})\vert, }$$
(14.1)

i.e., we choose the point x in the neighborhood of the point \(x_{p^{\ell}(j)}^{\ell}\) with minimal absolute difference of the corresponding function values. This measure has been applied in the rigorous EPWT of [16, 18]. The advantage of fixing the local neighborhood in spatial domain lies in the reduced storage costs for the path vector that needs to be kept to ensure a reconstruction. The drawback of this measure is that the set of “admissible points” \(N_{\varepsilon }(x_{p^{\ell}(j)}^{\ell})\setminus P_{j}^{\ell}\) may be empty. In this case a different rule for finding the next path entry has to be applied.

A special measure occurs if one tries to mimic the one-dimensional wavelet transform. In order to exploit the piecewise smoothness of the function f to be approximated, one should prefer to construct path vectors, where locally three consecutive points \(x_{p^{\ell}(j-1)}^{\ell}\), \(x_{p^{\ell}(j)}^{\ell}\), \(x_{p^{\ell}(j+1)}^{\ell}\) lie (almost) on a straight line. This consideration leads to the following measure: We fix a threshold μ for the function values. For finding the next point in the path, we compute

$$\displaystyle{ N_{\varepsilon,\mu }(x_{p^{\ell}(j)}^{\ell}):=\{ x \in N_{\varepsilon }(x_{ p^{\ell}(j)}^{\ell})\setminus P_{ j}^{\ell}:\, \vert f(x) - f(x_{ p^{\ell}(j)}^{\ell})\vert \leq \mu \}, }$$
(14.2)

and then let

$$\displaystyle{ x_{p^{\ell}(j+1)}^{\ell}:=\mathop{ \mathrm{argmin}}\limits _{ x\in N_{\varepsilon,\mu }(x_{p^{\ell}(j)}^{\ell})} \frac{\langle x_{p^{\ell}(j-1)}^{\ell} - x_{p^{\ell}(j)}^{\ell},x_{p^{\ell}(j)}^{\ell} - x\rangle } {\|x_{p^{\ell}(j-1)}^{\ell} - x_{p^{\ell}(j)}^{\ell}\|_{2}\,\|x_{p^{\ell}(j)}^{\ell} - x\|_{2}}, }$$
(14.3)

where \(\langle \cdot,\cdot \rangle\) denotes the usual scalar product in \(\mathbb{R}^{d}\). Note that in (14.3) the cosine of the angle between the vectors \(x_{p^{\ell}(j-1)}^{\ell} - x_{p^{\ell}(j)}^{\ell}\) and \(x_{p^{\ell}(j)}^{\ell} - x\) is minimized if \(x_{p^{\ell}(j-1)}^{\ell},\,x_{p^{\ell}(j)}^{\ell}\) and x are co-linear. This approach is taken in [16, 24] for images (called relaxed EPWT), and in [10] for scattered data denoising.

Remark 14.1.

The idea to prefer path vectors, where the angles between three consecutive points in the path is as large as possible, can be theoretically validated in different ways. Assume that the given wavelet decomposition filter \(\tilde{g} = (\tilde{g}_{k})_{k\in \mathbb{Z}}\) in the filter bank satisfies the moment conditions \(\sum _{k\in \mathbb{Z}}\tilde{g}_{k} = 0\) and \(\sum _{k\in \mathbb{Z}}k\tilde{g}_{k} = 0\). Then we simply observe that for a constant function f(x) = c for x ∈ Γ and \(c \in \mathbb{R}\) by

$$\displaystyle{d_{n}^{J} =\sum _{ k\in \mathbb{Z}}\tilde{g}_{k-2n+1}f(x_{p^{J}(k)}^{J}) = c\sum _{ k\in \mathbb{Z}}\tilde{g}_{k-2n+1} = 0}$$

all wavelet coefficients vanish, while for a linear function of the form f(x) = a T xb with \(a \in \mathbb{R}^{d}\) and \(b \in \mathbb{R}\) we have

$$\displaystyle{d_{n}^{J} =\sum _{ k\in \mathbb{Z}}\tilde{g}_{k-2n+1}f(x_{p^{J}(k)}^{J}) = a^{T}\sum _{ k\in \mathbb{Z}}\tilde{g}_{k-2n+1}x_{p^{J}(k)}^{J} + b\sum _{ k\in \mathbb{Z}}\tilde{g}_{k-2n+1}.}$$

Consequently, these coefficients only vanish, if the points in the sequence \((x_{p^{J}(k)}^{J})_{k\in \mathbb{Z}}\) are co-linear and equidistant, see [10]. A second validation for choosing the path vector using the criterion (14.3) is given by the so-called path smoothness condition in [17], see also Subsection 14.2.4, Remark 14.4.

Remark 14.2.

Our numerical results in Sect. 14.2.5 show that the relaxed path construction proposed in (14.2)–(14.3) is far superior to the rigorous path construction (14.1), since it produces fewer “interruptions”, i.e., cases where \(N_{\varepsilon }(x_{p^{\ell}(j)})\setminus P_{j}^{\ell} = \varnothing \), and a new path entry needs to be taken that is no longer locally correlated to the preceding point, which is usually leading to large wavelet coefficients and a higher effort in path coding (see [16, 24]).

14.2.2.2 Path Construction with Global Distances

We want to present a second path construction using a global weight function. Considering the vectors \(y_{k}^{\ell} = y(x_{k}^{\ell}):= ((x_{k}^{\ell})^{T},f_{k}^{\ell})^{T} \in \mathbb{R}^{d+1}\) at each level, we define a symmetric weight matrix \(W^{\ell} = (w(y_{k}^{\ell},y_{k^{{\prime}}}^{\ell}))_{ k,k^{{\prime}}=1}^{2^{\ell}}\), where the weight is written as

$$\displaystyle{w(y_{k}^{\ell},y_{ k^{{\prime}}}^{\ell }) = w_{1}(x_{k}^{\ell},x_{ k^{{\prime}}}^{\ell }) \cdot w_{2}(f_{k}^{\ell},f_{ k^{{\prime}}}^{\ell }).}$$

Now the weights for the scattered points \(x_{k}^{\ell}\) can be chosen differently from the weights for the (low-pass) function values \(f_{k}^{\ell}\). A possible weight function used already in the context of bilateral filtering [25] is

$$\displaystyle{w(y_{k}^{\ell},y_{ k^{{\prime}}}^{\ell }) =\exp \left (\frac{-\Vert x_{k}^{\ell} - x_{k^{{\prime}}}^{\ell}\Vert _{ 2}^{2}} {2^{2(J-\ell)/d}\eta _{1}} \right ) \cdot \exp \left (\frac{-\vert f_{k}^{\ell} - f_{k^{{\prime}}}^{\ell}\vert ^{2}} {2^{J-\ell}\eta _{2}} \right ),}$$

where \(\eta _{1}\) and \(\eta _{2}\) need to be chosen appropriately. The normalization constant 22(J)∕d in the weight w 1 is due to the reduction of the points x ∈ Γ by factor 2, at each level, so that the distances between the points grow. The normalization constant 2J in the weight w 2 arises from the usual amplification of the low-pass coefficients in the wavelet transform with filters \(\tilde{h}\) satisfying \(\sum _{k\in \mathbb{Z}}\tilde{h}_{k} = \sqrt{2}\).

Having computed the weight matrix \(W^{\ell} = (w(y_{k}^{\ell},y_{k^{{\prime}}}^{\ell}))_{ k,k^{{\prime}}=1}^{2^{\ell}}\), we simply compute the path vector as follows. We choose the first component \(x_{p^{\ell}(1)}^{\ell}\) randomly from Γ . Using again the notation \(P_{j}^{\ell}:=\{ x_{p^{\ell}(1)}^{\ell},\ldots,x_{p^{\ell}(j)}^{\ell}\}\) for the set of points in Γ that are already contained in the path vector, we now determine the next point as

$$\displaystyle{x_{p^{\ell}(j+1)}^{\ell}:=\mathop{ \mathrm{argmax}}_{ x\in \varGamma ^{\ell}\setminus P_{j}^{\ell}}w(y(x),y(x_{p^{\ell}(j)}^{\ell})),}$$

where uniqueness can be achieved by fixing a rule if the maximum is attained at more than one point. The advantage of this path construction is that no “interruptions” occur. The essential drawback consists in higher storage costs for path vectors, where we can no longer rely on direct local neighborhood properties of consecutive points in the path vector. Further, computing the full weight matrix W is very expensive. The costs can be reduced by cutting the spatial weight at a suitable distance defining

$$\displaystyle{ w_{1}(x_{k}^{\ell},x_{ k^{{\prime}}}^{\ell }) = \left \{\begin{array}{ll} \exp (-\|x_{k}^{\ell} - x_{k^{{\prime}}}^{\ell}\|_{ 2}^{2}/2^{2(J-\ell)/d}\eta _{ 1})&\quad \mbox{ for }\|x_{k}^{\ell} - x_{ k^{{\prime}}}^{\ell}\|_{ 2} \leq 2^{-\ell/d}D, \\ 0 &\quad \mbox{ for }\|x_{k}^{\ell} - x_{k^{{\prime}}}^{\ell}\|_{ 2} > 2^{-\ell/d}D, \end{array} \right. }$$
(14.4)

with D chosen appropriately to ensure a sufficiently large spatial neighborhood.

Remark 14.3.

This approach has been used in [10] for random path construction, where the compactly supported weight function \(w_{1}(x_{k}^{\ell},x_{k^{{\prime}}}^{\ell})\) above is employed. Taking the weight function

$$\displaystyle{w_{1}(x_{k}^{\ell},x_{ k^{{\prime}}}^{\ell }) = \left \{\begin{array}{ll} 1&\quad \mbox{ for }\|x_{k}^{\ell} - x_{k^{{\prime}}}^{\ell}\|_{ 2} \leq 2^{-\ell/d}D, \\ 0&\quad \mbox{ for }\|x_{k}^{\ell} - x_{k^{{\prime}}}^{\ell}\|_{ 2} > 2^{-\ell/d}D, \end{array} \right.}$$

and \(w_{2}(f_{k}^{\ell},f_{k^{{\prime}}}^{\ell}) =\exp \left (\frac{-\vert f_{k}^{\ell}-f_{ k^{{\prime}}}^{\ell}\vert ^{2}} {2^{J-\ell}\eta _{2}} \right )\) we obtain a distance measure that is equivalent to (14.1).

14.2.3 EPWT for Scattered Data Denoising

The EPWT can also be used for denoising of scattered data. Let us again assume Γ = { x 1, , x N } are scattered points in \(\mathbb{R}^{d}\) and let \(f: \mathbb{R}^{d} \rightarrow \mathbb{R}\) be a smooth function sampled on \(\varGamma \subset \varOmega\). For the measured data \(\tilde{f}(x_{j})\), we suppose that

$$\displaystyle{\tilde{f}(x_{j}) = f(x_{j}) + z_{j},}$$

where z j denotes additive Gaussian noise with zero mean and an unknown variance σ 2. For the distribution of the points in Ω we assume quasi-uniformity as before.

We now apply the EPWT, Algorithms 9 and 10 in Sect. 14.2.1, for data denoising. Note that in case of noisy function values, the construction of path vectors (being based on the correlation of function values at points with small spatial distance) is now influenced by the noise. To improve the denoising performance, we have to resemble the “cycle spinning” method (see [3]) that works as follows. We apply the (tensor product) wavelet shrinkage not only to the image itself, but also to the images that are obtained by up to seven cyclic shifts in x- and y-direction. After un-shifting, one takes the average of the 64 reconstructed images, thereby greatly improving the denoising result.

Employing the EPWT algorithm, we use Algorithms 9 and 10, applying them 64 times using different starting values \(x_{p^{J}(1)}\) as a first path component each time. For the path construction, we utilize one of the two methods described in Sect. 14.2.2. After reconstruction of the 64 data sets, we take the average in order to obtain the denoising result. Similarly as for wavelet denoising, the threshold parameter θ in Algorithm 9 needs to be selected carefully depending on the noise level.

In [10] we have employed two different path constructions for image denoising. The first one is very similar to the path construction in Sect. 14.2.2.1. The second one is based on a weight matrix resembling that in Sect. 14.2.2.2. Here, the next component in the path vector is chosen randomly according to a probability distribution based on the weight matrix.

For images, the proposed denoising procedure strongly outperforms the usual tensor-product wavelet shrinkage with cycle spinning, see [10]. Moreover, the procedure is not restricted to rectangular grids, but can be used in a much more general context for denoising of functions on manifolds. Numerical examples of the EPWT-based denoising scheme are given in Sect. 14.2.5.

14.2.4 Optimal Image Representation by the EPWT

In this subsection we restrict ourselves to the EPWT on digital images on a domain Ω = [0, 1)2. For cartoon models, where the image is piecewise Hölder continuous or even Hölder smooth, we can prove that the EPWT leads to optimally sparse image representations, see [17, 19]. To explain this, let F ∈ L 2(Ω) be a piecewise Hölder continuous image. More precisely, let {Ω i }1 ≤ i ≤ K be a finite set of regions forming a disjoint partition of Ω whose boundaries are continuous and of finite length. In each region Ω i , F is assumed to be Hölder continuous of order α ∈ (0, 1],

$$\displaystyle{ \vert F(x) - F(x + h)\vert \leq C\|h\|_{2}^{\alpha },\qquad x,\,x + h \in \varOmega _{ i}, }$$
(14.5)

where C > 0 does not depend on i. For given samples \(\{(F(2^{-J/2}n))\}_{n\in I_{J}}\), the function F can be approximated by the piecewise constant function

$$\displaystyle{F^{J}(x) =\sum _{ n\in I_{J}}F(2^{-J/2}n)\chi _{ [0,1)^{2}}(2^{J/2}x - n),\qquad x \in [0,1)^{2},}$$

where the index set \(I_{J}:=\{ n = (n_{1},n_{2}) \in \mathbb{N}^{2}:\, 0 \leq n_{1} \leq 2^{J/2} - 1,0 \leq n_{2} \leq 2^{J/2}\ - 1\}\) is of cardinality 2J. In this special case α ∈ (0, 1] we can rely on the orthogonal Haar wavelet filter bank in Algorithms 9 and 10. An optimal image representation is strongly based on an appropriate path construction. As shown in [19], we need to satisfy the following two conditions.

Region condition. At each level of the EPWT, we need to choose the path vector, such that it contains at most R 1 K discontinuities which are incurred by crossing over from one region Ω i to another region, or by jumping within one region Ω i . Here R 1 does not depend on J or , and K is the number of regions.

Diameter condition. At each level of the EPWT, we require

$$\displaystyle{\|x_{p^{\ell}(k)} - x_{p^{\ell}(k+1)}\|_{2} \leq D_{1}2^{-\ell/2},}$$

for almost all points \(x_{p^{\ell}(k)}^{\ell}\), k = 1, , 2 − 1, where D 1 does not depend on J or . The number of path components which do not satisfy the diameter condition is bounded by a constant being independent of and J.

The region condition suggests that for path construction, we should first collect all points that belong to one region Ω i before transferring to the next region. The diameter condition ensures that the remaining points in Γ are quasi-uniformly distributed at each level of the EPWT. Satisfying these two conditions for the path vectors, we have shown in [19], Corollary 3.1 that the M-term approximation F M reconstructed from the M most significant EPWT wavelet coefficients, satisfies the asymptotically optimal error estimate

$$\displaystyle{ \|F - F_{M}\|_{2}^{2} \leq \tilde{ C}\,M^{-\alpha } }$$
(14.6)

with a constant \(\tilde{C}\) and the Hölder exponent α ∈ (0, 1] in (14.5).

Remark 14.4.

Observe that at each level of the EPWT the path vector \((p^{\ell}(j))_{j=1}^{2^{j} }\) determines a planar curve that interpolates \(f_{p^{\ell}(j)}^{\ell}\) at the points \(x_{p^{\ell}(j)}^{\ell}\), j = 1, , 2. By definition, this curve is only piecewise linear. A generalization of the optimal M-term approximation result (14.6) for piecewise Hölder smooth images with Hölder exponent α > 1 has been developed in [17]. In this case, one needs to generalize the idea of a piecewise linear path vector curve to a smooth path function that satisfies, besides the region condition and the diameter condition, a third condition called path smoothness condition, see [17]. More precisely, let us consider a domain \(\varOmega \subset [0,1]^{2}\) with a sufficiently smooth boundary and a disjoint partition Ω i of Ω with smooth boundaries of finite length. Further, instead of (14.5), we assume that F ∈ L 2(Ω) is a piecewise smooth bivariate function being Hölder smooth of order α > 1 in each region Ω i , i = 1, , K. In order to show the optimal error estimate (14.6) also for α > 1, we need to employ a path function that approximates the values \(f_{p^{\ell}(j)}^{\ell}\) at the points \(x_{p^{\ell}(j)}^{\ell}\) being a planar curve that is not only piecewise smooth but smooth of order α inside a region Ω i with suitably bounded derivatives, see [17], Section 3.2. Particularly, this condition suggests that one should avoid “small angles” in the path curve.

14.2.5 Numerical Results

We shortly illustrate the performance of the proposed EPWT algorithm for sparse date representation and data denoising. In Fig. 14.1, we illustrate the application of the EPWT for sparse image representation, see also [16, 24]. The three considered images are of size 256 × 256. In Algorithm 9, we have used the 7–9 biorthogonal filter bank for the function values, and the lazy filter bank for the grid points, i.e., at each level of the EPWT, we have kept only every other grid point. The path construction from Sect. 14.2.2.1 is taken, where in (14.2) the parameters \(\varepsilon = \sqrt{2}\) and μ = 5 are employed. The threshold parameter θ in Algorithm 9 is chosen, such that 1,000 most significant EPWT wavelet coefficients are kept for the clock image, 700 coefficients are kept for the Lenna image and 200 coefficients are kept for the sail image. Figure 14.1 shows the reconstructed images, where we compare the results of a tensor-product wavelet compression with the 7–9 biorthogonal filter bank with the results of the EPWT reconstruction, using the same number of wavelet coefficients for the reconstruction in both cases.

Fig. 14.1
figure 1

Top row: Reconstruction by tensor-product wavelet compression using the 7–9 biorthogonal filter bank with 1,000 wavelet coefficients for test image clock (PSNR 29.93), 700 coeffs for Lenna (PSNR 24.28), and 200 coeffs for sail (PSNR 19.58). Bottom row: Reconstruction by EPWT wavelet transform using the 7–9 biorthogonal filter bank with 1,000 wavelet coefficients for clock (PSNR 33.55), 700 coeffs for Lenna (PSNR 30.46), 200 coeffs for sail (PSNR 27.19)

In a second example we study the denoising behavior of the EPWT approach as described in Sect. 14.2.3. In Fig. 14.2, we present the noisy pepper image with a PSNR of 19. 97 and compare the denoising results of different methods. In particular, we have used the four-pixel denoising scheme based on anisotropic diffusion by Welk et al. [28] with 76 iterations and step size 0.001 providing a PSNR of 28.26. Further, we apply the 7–9 wavelet shrinkage with a PSNR of 24.91 and the 7–9 wavelet shrinkage with cycle spinning using 64 shifts of the image and yielding the PSNR 28.11. Our EPWT denoising approach employing a relaxed path construction as described in Sect. 14.2.2.1 achieves a PSNR of 29.01 while a random path construction based on the ideas in Sect. 14.2.2.2 yields the PSNR 27.96. Note that the repeated application of the EPWT shrinkage method can be done in a parallel process. While our proposed EPWT denoising is (due to the path constructions) more expensive than the tensor-product wavelet shrinkage its application is not restricted to rectangular regular grids.

Fig. 14.2
figure 2

Top row: Peppers with additive white Gaussian noise with σ = 25 (PSNR 19.97) and reconstruction by the Four-Pixel Scheme [28] (PSNR 28.26), Mid row: Reconstruction by 2d tensor product wavelet transform using the 7–9 biorthogonal filter bank without (PSNR 24.91) and with cycle spinning (PSNR 28.11) Bottom row: Reconstruction by our approach described in Sect. 14.2.3 using a relaxed path construction with fixed local distances in (14.2), (PSNR 29.01) and a random path construction based on (14.4) (PSNR 27.96)

The third example shows the EPWT denoising to a triangular domain taking the approach in Sect. 14.2.3, see Fig. 14.3. We use the 7–9 biorthogonal filter bank for the function values, the lazy filter bank for the grid points, and the path construction from Sect. 14.2.2.1 with \(\varepsilon = 1.3\), μ = 89 and threshold θ = 89.

Fig. 14.3
figure 3

Cameraman. Data with additive white Gaussian noise with σ = 25 (PSNR 19.98), and EPWT reconstruction using the approach in Sect. 14.2.3 (PSNR 26.31)

14.3 Dimensionality Reduction on High-Dimensional Signal Data

To explain basic concepts on dimensionality reduction, we regard point cloud data as a finite family of vectors

$$\displaystyle{X =\{ x_{i}\}_{i=1}^{m} \subset \mathbb{R}^{n}}$$

contained in an n-dimensional Euclidean space. The fundamental assumption is that X lies in \(\mathcal{M}\), a low dimensional (topological) space embedded in \(\mathbb{R}^{n}\). Therefore, \(X \subset \mathcal{M}\subset \mathbb{R}^{n}\) with \(p:= \mbox{ dim}(\mathcal{M}) \ll n\). Another ingredient is a parameter domain Ω for \(\mathcal{M}\), where Ω is assumed to be embedded in a low dimensional space \(\mathbb{R}^{d}\) with p ≤ d < n. Moreover, we assume the existence of a homeomorphism (diffeomorphism)

$$\displaystyle{\mathcal{A}:\varOmega \rightarrow \mathcal{M},}$$

so that Ω is a homeomorphic (diffeomorphic) copy of \(\mathcal{M}\). This concept can then be used for signal analysis in a low dimensional environment. In practice, we can only approximate Ω by a projection

$$\displaystyle{P: \mathcal{M}\rightarrow \varOmega ^{{\prime}},}$$

where Ω is a homeomorphic copy of Ω. The low dimensional structure representing X is the reduced data \(Y =\{ y_{i}\}_{i=1}^{m} \subset \varOmega ^{{\prime}}\subset \mathbb{R}^{d}\), according to the following diagram.

Principal component analysis (PCA) is a classical linear projection method. Dimensionality reduction by PCA can be described as an eigenvalue problem, so that PCA can be applied by using the singular value decomposition (SVD). More precisely, in PCA we consider centered data X (i.e., X has zero mean) in matrix form \(X \in \mathbb{R}^{n\times m}\). Now the concept of PCA is to construct a linear projection \(P: \mathbb{R}^{n} \rightarrow \mathbb{R}^{n}\), for rank(P) = p < n, with minimal error \(\mbox{ err}(P,X) =\sum _{ k=1}^{m}\|x_{k} - P(x_{k})\|\), or, equivalently, with maximal variance \(\mbox{ var}(P,X) =\sum _{ k=1}^{m}\|P(x_{k})\|^{2}\). These conditions can in turn be reformulated as an eigenvalue problem, where the p largest eigenvalues of the covariance matrix \(\mathit{XX}^{T} \in \mathbb{R}^{n\times n}\) are sought, cf. [14].

Another classical linear dimensionally reduction method is multidimensional scaling (MDS), which is also relying on an eigendecomposition of data \(X \in \mathbb{R}^{n\times m}\). In contrast to PCA, the MDS method constructs a low dimensional configuration of X without using an explicit projection map. More precisely, on input matrix \(X \in \mathbb{R}^{n\times m}\), MDS works with the distance matrix D = (d ij ) i, j = 1, ⋯ , m , of the points in X to compute an optimal configuration of points \(Y = (y_{1},\cdots \,,y_{m}) \in \mathbb{R}^{p\times m}\), with p ≤ n, minimizing the error \(\mbox{ err}(Y,D) =\sum _{ i,j=1}^{m}(d_{\mathit{ij}} -\| y_{i} - y_{j}\|)^{2}\). In other words, the low dimensional configuration of points Y preserves the distances of the higher dimensional dataset X approximately.

In the construction of nonlinear dimensionality reduction (NDR) methods, we are especially interested in their interaction with signal processing tools, e.g., convolution transforms. When applying signal transforms to the dataset X, one important task is the analysis of the incurred geometrical deformation. To this end, we propose the concept of modulation maps and modulation manifolds for the construction of particular datasets which are relevant in signal processing and NDR, especially since we are interested in numerical methods for analyzing geometrical properties of the modulation manifolds, with a particular focus on their scalar and mean curvature.

We define a modulation manifold by employing a homeomorphism (or diffeomorphism) \(\mathcal{A}:\varOmega \rightarrow \mathcal{M}\), for a specific manifold Ω, as used in signal processing. The basic objective is to understand how the geometry of Ω is distorted when we transform Ω using a modulation map \(\mathcal{A}\). More explicitly, let \(\{\phi _{k}\}_{k=1}^{d} \subset \mathcal{H}\) be a set of vectors in an Euclidean space \(\mathcal{H}\), and \(\{s_{k}:\varOmega \rightarrow \mathcal{C}_{\mathcal{H}}(\mathcal{H})\}_{k=1}^{d}\) a family of smooth maps from a manifold Ω to \(\mathcal{C}_{\mathcal{H}}(\mathcal{H})\) (the continuous functions from \(\mathcal{H}\) into \(\mathcal{H}\)). We say that a manifold \(\mathcal{M}\subset \mathcal{H}\) is a \(\{\phi _{k}\}_{k=1}^{d}\)-modulated manifold if

$$\displaystyle{\mathcal{M} = \left \{\sum _{k=1}^{d}s_{ k}(\alpha )\phi _{k},\;\alpha \in \varOmega \right \}.}$$

In this case, the map \(\mathcal{A}:\varOmega \rightarrow \mathcal{M}\), \(\alpha \mapsto \sum _{k=1}^{d}s_{k}(\alpha )\phi _{k}\), is called modulation map.

To make one prototypical example (cf. [7]), we regard a map of the form

$$\displaystyle{\mathcal{A}(\alpha )(t_{i}) =\sum _{ k=1}^{d}\phi _{ k}(\alpha _{k}t_{i}),\quad \alpha = (\alpha _{1},\ldots,\alpha _{d}) \in \varOmega,\quad \{t_{i}\}_{i=1}^{n} \subset [0,1],}$$

for a set of band-limited functions \(\{\phi _{k}\}_{k=1}^{d}\) in combination with a finite set of uniform samples \(\{t_{i}\}_{i=1}^{n} \subset [0,1]\).

Now we use the same notation for the band-limited functions ϕ k and the above mentioned vector of sampling values \(\{\phi _{k}(t_{i})\}_{i=1}^{n}\), as this is justified by the Whittaker-Shannon interpolation formula as follows.

As the support of the band-limited functions ϕ k is located in [0, 1], the Whittaker-Shannon interpolation formula allows us to reconstruct each ϕ k exactly from the finite samples \((\phi _{k}(t_{i}))_{i=1}^{n} \in \mathbb{R}^{n}\). This in turn gives a one-to-one relation between the band-limited functions \(\phi _{k}: [0,1] \rightarrow \mathbb{R}\) and the vectors \((\phi _{k}(t_{i}))_{i=1}^{n} \in \mathbb{R}^{n}\). Note that the maps s k (α) are in our example given by \(s_{k}(\alpha )\phi _{k}(t_{i}) =\phi _{k}(\alpha _{k}t_{i})\). In other words, we use the (continuous) map s k (α), \(f(t)\mapsto f(\alpha _{k}t)\), as the scaling by factor α k , being the k-th coordinate of vector \(\alpha \in \varOmega \subset \mathbb{R}^{d}\).

To explain our analysis of the geometric distortions incurred by \(\mathcal{A}\), we restrict ourselves to the case d = 3 and \(\varOmega \subset \mathbb{R}^{3}\) with dim(Ω) = 2. We compute the scalar curvature of \(\mathcal{M}\) from the parametrization of Ω and the modulation map \(\mathcal{A}\) by the following algorithm [7].

Algorithm 11

For further details concerning the construction of Algorithm 11, we refer to [7].

Fig. 14.4
figure 4

(a) A sphere Ω whose colors represent the scalar curvature of \(\mathcal{M} = \mathcal{A}(\varOmega )\), (b) PCA projection of \(\mathcal{M} = \mathcal{A}(\varOmega )\) with Gaussian curvature represented by colors

14.4 Audio Signal Separation and Signal Detection

In many relevant applications of signal processing there is an increasing demand for effective methods to estimate the components from a mixture of acoustic signals. In recent years, different decomposition techniques were developed to do so, including independent subspace analysis (ISA), based on independent component analysis (ICA), see [1, 5, 26], and non-negative matrix factorization (NNMF), see [6, 23, 27]. The computational complexity of these methods, however, may be very large, in particular for real-time computations on audio signals. In signal separation, dimensionality reduction methods are used to first reduce the dimension of the data obtained from a time-frequency transform, e.g., short time Fourier transform (STFT), before the reduced data is decomposed into different components, each assigned to one of the source signals. For the application of dimensionality reduction in combination with NNMF, however, non-negative dimensionality reduction methods are essentially required to guarantee non-negative output data from non-negative input data (e.g., a non-negative spectrogram from the STFT). For the special case of PCA, a suitable rotation map is constructed in [12] for the purpose of back-projecting the reduced data to the positive orthant of the Cartesian coordinate system, where the sought rotation is given by the solution of a constraint optimization problem in a linear subspace of orthogonal matrices.

In this section, we evaluate different decomposition methods for signal separation in combination with the non-negative PCA projection from [12]. The basic steps of our method are illustrated in Fig. 14.5.

Fig. 14.5
figure 5

Signal separation with dimensionality reduction

To explain how we use PCA, let \(U \in \mathbb{R}^{D\times d}\) be an orthogonal projection, satisfying Y = U T X, being obtained by the solution of the minimization problem

$$\displaystyle{ \min \limits _{\tilde{U}^{T}\tilde{U}=I}\sum \limits _{k=1}^{n}\left \|x_{ k} -\tilde{ U}\tilde{U}^{T}x_{ k}\right \|_{2}. }$$
(14.7)

The solution of (14.7) is given by the maximizer of the variance var(Y ) of Y, as given by the trace of YY T. This observation allows us to reformulate the minimization problem in (14.7) as an equivalent maximization problem,

$$\displaystyle{ \max \limits _{\tilde{U}^{T}\tilde{U}=I}\mathrm{tr}(\tilde{U}^{T}\mathit{XX}^{T}\tilde{U}), }$$
(14.8)

where the maximizer U of var(Y ) is given by a matrix U whose d columns contain the eigenvectors of the d largest eigenvalues of the covariance matrix XX T.

For further processing the data in a subsequent decomposition by NNMF, the data matrix Y is essentially required to be non-negative. Note, however, that even if the data matrix X (obtained e.g., by STFT) may be non-negative, this is not necessarily the case for the components of the reduced data matrix Y. Therefore, we reformulate the maximization problem in (14.8) by adding a non-negativity constraint:

$$\displaystyle{ \max \limits _{{ \tilde{U}^{T}\tilde{U}=I \atop \tilde{U}^{T}X\geq 0} }\mathrm{tr}(\tilde{U}^{T}\mathit{XX}^{T}\tilde{U}). }$$
(14.9)

Note that this additional restriction transforms the simple PCA problem (14.8) into a much more difficult non-convex optimization problem (14.9) with many local solutions, for which (in general) none of the solutions is known analytically.

We tackle this fundamental problem as follows. We make use of the fact that the input data set X is non-negative, before it is projected onto a linear subspace, with the perception that there exists a rotation of the low-dimensional data set Y into the non-negative orthant. Indeed, as proven in [12], such a rotation map exists, which motivates us to split the non-negative PCA (NNPCA) problem (14.9) into a PCA part and a rotation part. This, in turn, gives rise to seek for a general construction of a rotation matrix W satisfying WU T X ≥ 0.

To further explain our splitting approach, recall that we already know the solution U of the PCA part. Since the rotation matrix W is orthogonal, it does not affect the value of the NNPCA cost functional. Now, in order to determine the rotation matrix W, we consider solving an auxiliary optimization problem on the set of orthogonal matrices SO(d), i.e., we minimize the cost functional

$$\displaystyle{ J(\tilde{W}) = \frac{1} {2}\sum \limits _{i,j}\left [\left (\tilde{W}U^{T}X\right )_{ -}\right ]_{\mathit{ij}}^{2}\qquad \mbox{ where }\left [Z_{ -}\right ]_{\mathit{ij}} = \left \{\begin{array}{ll} z_{\mathit{ij}} & \mbox{ if }z_{\mathit{ij}} < 0, \\ 0 &\mbox{ otherwise, } \end{array} \right. }$$
(14.10)

as this was proposed in [21] in the context of ICA. However, we cannot solve this optimization problem directly by an additive update algorithm, since the set of rotation matrices SO(d) is not invariant under additions. But an elegant way to minimize the cost functional J in (14.10) uses the Lie-group structure of SO(d) to transfer the problem into an optimization problem on the Lie-algebra of skew-symmetric matrices \(\mathfrak{s}\mathfrak{o}(d)\). Due to the vector space property of \(\mathfrak{s}\mathfrak{o}(d)\), standard methods can be applied to find the minimum (see [9, 11, 21] for details).

14.4.1 Decomposition Techniques

There are different methods for the decomposition of the (reduced) spectrogram Y. Among them, independent component analysis (ICA) and non-negative matrix factorization (NNMF) are commonly used. In either case, for the application of ICA or NNMF, we assume the input data Y to be a linear mixture of source terms s i , i.e.,

$$\displaystyle{ Y = \mathit{AS}, }$$
(14.11)

where \(A \in \mathbb{R}^{d\times r}\) and \(S \in \mathbb{R}^{r\times n}\) are unknown. For the estimation of A and S we need specific additional assumptions to balance the disproportion of equations and unknowns in the factorization problem (14.11).

14.4.1.1 Independent Component Analysis (ICA)

The basic assumption of ICA is that the source signals are statistically independent. Furthermore, the data matrix Y is assumed to result from n realizations of a d-dimensional random vector. In order to estimate S, a random variable \(\mathcal{S}\) is constructed, whose n realizations yield the columns of the source matrix S. The components of \(\mathcal{S}\) are chosen to be as stochastically independent as possible, where the stochastical independence can be measured by the Kullback-Leibler distance [4].

In practice, the number of sources is usually unknown. Therefore, we may detect more independent components than the true number of sources. In this case, two or more of the separated components belong to the same source. Thus, the sources are combinations of the independent components. In a subsequent step, the sources are grouped into independent subspaces, each corresponding to one source. Finally, the sources are reconstructed from these multi-component subspaces [1]. This procedure is called independent subspace analysis (ISA). The main difficulty of ISA is to identify components belonging to the same multi-component subspace.

14.4.1.2 Non-negative Matrix Factorization (NNMF)

The factorization of the given data Y into a mixing matrix A and the source signals (source components) S, i.e., Y = AS, could be done by matrix factorization. The data we use for signal separation are obtained by taking the modulus of the signal’s STFT, and so the input data is non-negative. Since the source components are assumed to be spectrograms, too, we assume them to be non-negative as well. Therefore, non-negative matrix factorizations (NNMF) are suitable tools for decomposition.

There are different NNMF algorithms available, all of which are relying on the non-negativity Y, A, S ≥ 0, where different measures d(Y, AS) for the reconstruction error were proposed [6, 23, 27]. We consider using the generalized Kullback-Leibler distance (proposed in [13] and used for decomposing signal data in [27]):

$$\displaystyle{d(Y,\mathit{AS}) =\sum _{i,j}Y _{\mathit{ij}}\log \frac{Y _{\mathit{ij}}} {(\mathit{AS})_{\mathit{ij}}} - Y _{\mathit{ij}} + (\mathit{AS})_{\mathit{ij}}.}$$

14.4.2 Numerical Results

We present one numerical example comparing the decomposition strategies ICA and NNMF. We consider a mixture f = f 1 + f 2 of acoustic transient signals, where f 1 is a sequence of castanets and f 2 a cymbal signal, shown in Fig. 14.6, where also the combination f = f 1 + f 2 of the two signals is displayed. The spectrograms in these figures are generated with an STFT using a Hamm-window. Since f 2 is a high-energy signal, f has a complex frequency characteristic. Therefore, the extraction of the castanets signal f 1, being active only at a few time steps, is a challenging task.

Fig. 14.6
figure 6

Two acoustic signals: castanets f 1 (top left), cymbal f 2 (top right), and corresponding spectrograms (second row). Signal f = f 1 + f 2 and spectrogram (third row)

The obtained separations, resulting from the two different decomposition methods using NNPCA and PCA, respectively, are displayed in Fig. 14.7. Note that both methods, NNMF and ICA, achieve to reproduce the characteristic peaks of the castanets quite well. However, in the case of NNMF strong artifacts of the castanets are visible in the cymbal signal, whereas the separation by ICA is almost perfect.

Fig. 14.7
figure 7

Signal separation by NNPCA & NNMF (left column); PCA & ICA (right column)

Likewise, for the reconstruction of the reduced signal, the combination of PCA and ICA provides an almost complete reproduction of the original signal f (see Fig. 14.8). Merely at time steps where a high amplitude of the cymbal exactly matches the peaks of the castanets, a correct separation is not quite achieved. As for the NNMF, the spectrogram in Fig. 14.8 shows that information is being lost.

Fig. 14.8
figure 8

Reconstruction of f as sum of the decomposed f i by using NNPCA & NNMF (left column) and by using PCA & ICA (right column)

We finally remark that for signal separation without dimensionality reduction, NNMF is competitive to ICA (see e.g. [27]). This indicates that our use of NNPCA in combination with NNMF could be improved. Further improvements could be achieved by the use of more sophisticated (nonlinear) dimensionality reduction methods. On the other hand, this would lead to a much more complicated construction of the inverse transform, as required for the back-projection of the data. We defer these points to future research. Nevertheless, although PCA is only a linear projection method, our numerical results of this section, especially those obtained by the combination of PCA and ICA, are already quite promising.