Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Blind source separation (BSS) has been investigated during the last two decades; many algorithms have been developed and applied in a wide range of applications including biomedical engineering, medical imaging, speech processing, astronomical imaging, and communication systems. Typically, a linear mixture model is assumed where the mixtures \(\varvec{Z}\in \mathbb {R}^{r\times N}\) are described as \(\varvec{Z}=\varvec{A}\varvec{S}+\varvec{V}\). Each row of \(\varvec{S}\in \mathbb {R}^{s\times N}\) is a source and \(\varvec{A}\in \mathbb {R}^{r\times s}\) models the linear combinations of the sources. The matrix \(\varvec{V}\in \mathbb {R}^{r\times N}\) represents additive noise or interference introduced during mixture acquisition and transmission.

Usually in the BSS problem, the only known information is the mixtures \(\varvec{Z}\) and the number of sources. One needs to determine both the mixing matrix \(\varvec{A}\) and the sources \(\varvec{S}\), i.e., mathematically, one needs to solve

$$ \underset{\varvec{A},\varvec{S}}{\mathrm{min}}\left\| \varvec{Z}-\varvec{A}\varvec{S}\right\| _{F}^{2}. $$

It is clear that such a problem has an infinite number of solutions, i.e., the problem is ill-posed. In order to find the true sources and the mixing matrix (subject to scale and permutation ambiguities), it is often required to add extra constraints to the problem formulation. For example, a well-known method called independent component analysis (ICA) [1] assumes that the original sources are statistically independent. This has led to some widely used approaches such as Infomax [2], maximum likelihood estimation [3], the maximum a posterior (MAP) [4], and FastICA [1].

Sparsity prior is another property that can be used for BSS. Most natural signals are sparse under some dictionaries. The mixtures, viewed as a superposition of sources, are in general less sparse compared to the original sources. Based on this fact, the sparse prior has been used in solving the BSS problem from various perspectives since 2001, e.g., sparse ICA (SPICA) [5] and sparse component analysis (SCA) [6]. In this approach, there is typically no requirement that the original sources have to be independent. As a result, these algorithms are capable of dealing with highly correlated sources, for example, in separating two superposed identical speeches, with one being a few samples delayed version of the other. Jourjine et al. proposed an SCA-based algorithm in [7] aiming at solving the anechoic problem. SCA algorithms look for a sparse representation under predefined bases such as discrete cosine transform (DCT), wavelet, curvelet, etc. Morphological component analysis (MCA) [8] and its extended algorithms for multichannel cases, Multichannel MCA (MMCA) [9], and Generalized MCA (GMCA) [10], are also based on the assumption that the original sources are sparse in different bases instead of explicitly constructed dictionaries. However, these algorithms do not exhibit an outstanding performance since in most cases the predefined dictionaries are too general to offer sufficient details of sources when used in sparse representation.

A method to address this problem is to learn data-specific dictionaries. In [11], the authors advised to train a dictionary from the mixtures/corrupted-images and then decompose it into a few dictionaries according to the prior knowledge of the main components in different sources. This algorithm is used for separating images with different main frequency components (e.g., Cartoon and Texture images) and obtained satisfactory results in image denoising. Starck et al. proposed in [12] to learn dictionary from a set of exemplar images for each source. Xu et al. [13] proposed an algorithm, which allows the dictionaries to be learned from the sources or the mixtures. In most BSS problems, however, dictionaries learned from the mixtures or from similar exemplar images rarely well represent the original sources.

To get more accurate separation results, the dictionaries should be adapted to the unknown sources. The motivation is clear from the assumption that the sources are sparsely represented by some dictionaries. The initial idea of learning dictionaries while separating the sources was suggested by Abolghasemi et al. [14]. They proposed a two-stage iterative process. In this process each source is equipped with a dictionary, which is learned in each iteration, right after the previous mixture learning stage. Considering the size of dictionaries being much larger than the mixing matrix, the main computational cost is on the dictionary learning stage. This two-stage procedure was further developed in Zhao et al. [15]. The method was termed as SparseBSS, which employs a joint optimization framework based on the idea of SimCO dictionary update algorithm [16]. By studying the optimization problem encountered in dictionary learning, the phenomenon of singularity in dictionary update was for the first time discovered. Furthermore, from the viewpoint of the dictionary redundancy, SparseBSS uses only one dictionary to represent all the sources, and is therefore computationally much more efficient than using multiple dictionaries as in [14]. This joint dictionary learning and source separation framework is the focus of this chapter. This framework can be extended potentially to a convolutive or underdetermined model, e.g., apply clustering method to solve the the ill-posed inverse problem in underdetermined model [13]; however, discussion on such an extension is beyond the scope of this chapter. In this chapter, we focus on overdetermined/even determined model.

The remainder of this chapter is organized as follows. Section 2.2 describes the framework of the BSS problem based on dictionary learning. The recently proposed algorithm SparseBSS is introduced and compared in detail with the related benchmark algorithm BMMCA. In Sect. 2.3, we briefly introduce the background of dictionary learning algorithms and then discuss the important observation of the singularity issue, which is a major reason for the failure of dictionary learning algorithms and hence dictionary learning-based BSS algorithms. Later, two available approaches are presented to address this problem. In Sect. 2.5, we conclude our work and discuss some possible extensions.

2 Framework of Dictionary Learning-Based BSS Problem

We consider the following linear and instantaneous mixing model. Suppose there are \(s\) source signals of the same length, denoted by \(\varvec{s}_{1},\;\varvec{s}_{2},\ldots ,\varvec{s}_{s}\), respectively, where \(\varvec{s}_{i}\in \mathbb {R}^{1\times N}\) is a row vector to denote the \(i\)th source. Assume that these sources are linearly mixed into \(r\) observation signals denoted by \(\varvec{z}_{1},\;\varvec{z}_{2},\ldots ,\varvec{z}_{r}\) respectively, where \(\varvec{z}_{j}\in \mathbb {R}^{1\times N}\). In the matrix format, denote \(\varvec{S}=\left[ \varvec{s}_{1}^{T},\varvec{s}_{2}^{T},\ldots ,\varvec{s}_{s}^{T}\right] ^{T}\in \mathbb {R}^{s\times N}\) and \(\varvec{Z}=\left[ \varvec{z}_{1}^{T},\varvec{z}_{2}^{T},\ldots ,\varvec{z}_{r}^{T}\right] ^{T}\in \mathbb {R}^{r\times N}\). Then the mixing model is given by

$$\begin{aligned} \varvec{Z}=\varvec{A}\varvec{S}+\varvec{V}, \end{aligned}$$
(2.1)

where \(\varvec{A}\in \mathbb {R}^{r\times s}\) is the mixing matrix and \(\varvec{V}\in \mathbb {R}^{r\times N}\) is denoted as zero mean additive Gaussian noise. We also assume that \(r\ge s\), i.e., the underdetermined case will not be discussed here.

2.1 Separation with Dictionaries Known in Advance

For some BSS algorithms, such as MMCA [9], orthogonal dictionaries \(\varvec{D}_{i}\)’s are required to be known a priori. Each source \(\varvec{s}_{i}\) is assumed to be sparsely represented by a different \(\varvec{D}_{i}\). Hence, we have \(\varvec{s}_{i}=\varvec{D}_{i}\varvec{x}_{i}\) with \(\varvec{x}_{i}\)’s being sparse. Given the observation \(\varvec{Z}\) and the dictionaries \(\varvec{D}_{i}\)’s, MMCA [9] aims to estimate the mixing matrix and sources, based on the following form:

$$\begin{aligned} \underset{\varvec{A},\varvec{S}}{\mathrm{min}}\,\left\| \varvec{Z}-\varvec{A}\varvec{S}\right\| _{F}^{2}+\sum _{i=1}^{n}\lambda _{i}\left\| \varvec{s}_{i}\varvec{D}_{i}^{\dagger }\right\| _{1}. \end{aligned}$$
(2.2)

Here \(\lambda _{i}>0\) is the weighting parameter determined by the noise deviation \(\sigma \), \(\left\| \cdot \right\| _{F}\) represents the Frobenius norm, \(\left\| \cdot \right\| _{1}\) is the \(\ell _{1}\) norm and \(\varvec{D}_{i}^{\dagger }\) denotes the pseudo-inverse of \(\varvec{D}_{i}\). Predefined dictionaries generated from typical mathematical transforms, e.g., DCT, wavelets and curvelets, do not target particular sources, and thus do not always provide sufficiently accurate reconstruction and separation results. Elad et al. [11] designed a method to first train a redundant dictionary by K-SVD algorithm in advance, and then decompose it into a few dictionaries, one for each source. This method works well when the original sources have components that are largely different from each other under some unknown mathematical transformations (e.g. Cartoon and Texture images under the DCT transformation). Otherwise, the dictionaries found may not be appropriate in the sense that they may fit better the mixtures rather than the sources.

2.2 Separation with Unknown Dictionaries

2.2.1 SparseBSS Algorithm Framework

According to the authors’ knowledge, BMMCA and SparseBSS are the two most recent BSS algorithms, which implement the idea of performing source separation and dictionary learning simultaneously. Due to space constraints, we focus on Sparse BSS in this chapter. In SparseBSS, one assumes that all the sources can be sparsely represented under the same dictionary. In order to obtain enough training samples for dictionary learning, multiple overlapped segments (patches) of the sources are taken. To extract small overlapped patches from the source image \(\varvec{s}_{i}\), a binary matrix \(\varvec{P}_{k}\in \mathbb {R}^{n\times N}\) is defined as a patching operatorFootnote 1 [15]. The product \(\varvec{P}_{k}\cdot \varvec{s}_{i}^{T}\in \mathbb {R}^{n\times 1}\) is needed to obtain and vectorize the \(k\)th patch of size \(\sqrt{n}\times \sqrt{n}\) taken from image \(\mathcal{S}_{i}\). Denote \(\varvec{P}=[\varvec{P}_{1},\ldots ,\varvec{P}_{K}]\in \mathbb {R}^{n\times KN}\), where \(K\) is the number of patches taken from each image. Then the extraction of multiple sources \(\varvec{S}\) is defined as \(\mathcal{P}\varvec{S}=([\varvec{P}_{1},\ldots ,\varvec{P}_{K}])\cdot (\left[ \varvec{s}_{1}^{T},\varvec{s}_{2}^{T},\ldots ,\varvec{s}_{s}^{T}\right] \otimes \varvec{I}_{K})=\varvec{P}\cdot (\varvec{S}^{T}\otimes \varvec{I}_{K})\in \mathbb {R}^{n\times Ks}\), where symbol \(\otimes \) denotes the Kronecker product and \(\varvec{I}_{K}\) indicates the identity matrix. The computational cost associated with converting from images to patches is low. Each column of \(\mathcal{P}\varvec{S}\) represents one vectorized patch. We sparsely represent \(\mathcal{P}\varvec{S}\) by using only one dictionary \(\varvec{D}\in \mathbb {R}^{n\times d}\) and a sparse coefficient matrix \(\varvec{X}\in \mathbb {R}^{d\times Ks}\), which suggests \(\mathcal{P}\varvec{S}\approx \varvec{D}\varvec{X}\). This is different from BMMCA, where multiple dictionaries are used for multiple sources.

With these notations, the BSS problem is formulated as the following joint optimization problem:

$$\begin{aligned} \underset{\varvec{A},\varvec{S},\varvec{D},\varvec{X}}{\mathrm{min}}\,\lambda \left\| \varvec{Z}-\varvec{A}\varvec{S}\right\| _{F}^{2}+\left\| \mathcal{P}^{\dagger }\left( \varvec{D}\varvec{X}\right) -\varvec{S}\right\| _{F}^{2}. \end{aligned}$$
(2.3)

The parameter \(\lambda \) is introduced to balance the measurement error and the sparse approximation error, and \(\varvec{X}\) is assumed to be sparse.

To find the solution of the above problem, we propose a joint optimization algorithm to iteratively update the following two pairs of variables \(\{\varvec{D},\varvec{X}\}\) and \(\{\varvec{A},\varvec{S}\}\) over two stages until a (local) minimizer is found. Note that in each stage there is only one pair of variables to be updated simultaneously by keeping the other pair fixed.

  • Dictionary learning stage

    $$\begin{aligned} \underset{\varvec{D},\varvec{X}}{\mathrm{min}}\,\left\| \varvec{D}\varvec{X}-\mathcal{P}\varvec{S}\right\| _{F}^{2}, \end{aligned}$$
    (2.4)
  • Mixture learning stage

    $$\begin{aligned} \underset{\varvec{A},\varvec{S}}{\mathrm{min}}\,\lambda \left\| \varvec{Z}-\varvec{A}\varvec{S}\right\| _{F}^{2}+\left\| D\varvec{X}-\mathcal{P}\varvec{S}\right\| _{F}^{2}. \end{aligned}$$
    (2.5)

Without being explicit in (2.3), a sparse coding process is involved where greedy algorithms, such as orthogonal matching pursuit (OMP) [17] and subspace pursuit (SP), [18] are used to solve

$$ \underset{\varvec{X}}{\min }\;\left\| \varvec{X}\right\| _{0},\;\mathrm{s.t.}\;\left\| \varvec{D}\varvec{X}-\mathcal {P}\left( \varvec{S}\right) \right\| _{F}^{2}\le \epsilon , $$

where \(\left\| \varvec{X}\right\| _{0}\) counts the number of nonzero elements in \(\varvec{X}\), the dictionary \(\varvec{D}\) is assumed fixed, and \(\epsilon >0\) is an upper bound on the sparse approximation error.

During the optimization, further constraints are made on the matrices \(\varvec{A}\) and \(\varvec{D}\). Consider the dictionary learning stage. Since the performance is invariant to scaling and permutations of the dictionary codewords (columns of \(\varvec{D}\)), we follow the convention in the literature, e.g., [16], and enforce the dictionary to be updated on the set

$$\begin{aligned} \mathcal {D}=\left\{ \varvec{D}\in \mathbb {R}^{n\times d}:\;\left\| \varvec{D}_{:,i}\right\| _{2}=1,\;1\le i\le d\right\} , \end{aligned}$$
(2.6)

where \(\varvec{D}_{:,i}\) stands for the \(i\)th column of \(\varvec{D}\). A detailed description of the advantage by adding this constraint can be found in [16]. Sparse coding, once performed, provides information about which elements of \(\varvec{X}\) are zeros and which are nonzeros. Define the sparsity pattern by \(\varOmega =\left\{ \left( i,j\right) :\;\varvec{X}_{i,j}\ne 0\right\} \), which is the index set of the nonzero elements of \(\varvec{X}\). Define \(\mathcal {X}_{\varOmega }\) as the set of all matrices conforming to the sparsity pattern \(\varOmega \). This is the feasible set of the matrix \(\varvec{X}\). The optimization problem for the dictionary learning stage can be written as

$$\begin{aligned} \underset{\varvec{D}\in \mathcal {D}}{\mathrm{min}}f_{\mu }\left( \varvec{D}\right) =&\underset{\varvec{D}\in \mathcal {D}}{\mathrm{min}}\; \quad \underset{\varvec{X}\in \mathcal {X}_{\varOmega }}{\mathrm{min}}\;\left\| \varvec{D}\varvec{X}-\mathcal{P}\left( \varvec{S}\right) \right\| _{F}^{2}+\mu \left\| \varvec{X}\right\| _{F}^{2},\nonumber \\ =&\underset{\varvec{D}\in \mathcal {D}}{\mathrm{min}}\; \quad \underset{\varvec{X}\in \mathcal {X}_{\varOmega }}{\mathrm{min}}\;\left\| \left[ \begin{array}{c} \mathcal{P}\left( \varvec{S}\right) \\ \varvec{0} \end{array}\right] -\left[ \begin{array}{c} \varvec{D}\\ \sqrt{\mu }\varvec{I} \end{array}\right] \varvec{X}\right\| _{F}^{2}. \end{aligned}$$
(2.7)

The term \(\mu \left\| \varvec{X}\right\| _{F}^{2}\) introduces a penalty to alleviate the singularity issue. See more details in Sect. 2.3.3.

In the mixture learning stage, similar to the dictionary learning stage, we constrain the mixing matrix \(\varvec{A}\) in the set

$$\begin{aligned} \mathcal {A}=\left\{ \varvec{A}\in \mathbb {R}^{r\times s}:\;\left\| \varvec{A}_{:,i}\right\| _{2}=1,\;1\le i\le s\right\} . \end{aligned}$$
(2.8)

This constraint is necessary. Otherwise, if the mixing matrix \(\varvec{A}\) is scaled by a constant \(c\) and the source \(\varvec{S}\) is inversely scaled by \(c^{-1}\), then for any \(\{\varvec{A},\varvec{S}\}\) we can always find a solution \(\{c\varvec{A},c^{-1}\varvec{S}|c>1\}\), which further decreases the objective function (2.3) from \(\lambda \left\| \varvec{Z}-\varvec{A}\varvec{S}\right\| _{F}^{2}+\left\| D\varvec{X}-\mathcal{P}\varvec{S}\right\| _{F}^{2}\) to \(\lambda \left\| \varvec{Z}-\varvec{A}\varvec{S}\right\| _{F}^{2}+c^{-2}\left\| D\varvec{X}-\mathcal{P}\varvec{S}\right\| _{F}^{2}\). Now if we view the sources \(\varvec{S}\in \mathbb {R}^{s\times n}\) as a “sparse” matrix with the sparsity pattern \(\varOmega ^{\prime }=\left\{ \left( i,j\right) :\;1\le i\le s,\;1\le j\le N\right\} \). Then, the optimization problem for the mixture learning stage is exactly the same as that for the dictionary learning stage:

$$\begin{aligned} \underset{\varvec{A}\in \mathcal {A}}{\mathrm{min}}f_{\lambda }\left( \varvec{A}\right) =\,&\underset{\varvec{A}\in \mathcal {A}}{\min }\;\underset{\varvec{S}\in \mathbb {R}^{s\times n}}{\min }\;\lambda \left\| \varvec{Z}-\varvec{A}\varvec{S}\right\| _{F}^{2}+\left\| \mathcal{P}^{\dagger }\left( \varvec{D}\varvec{X}\right) -\varvec{S}\right\| _{F}^{2}\nonumber \\ =\,&\underset{\varvec{A}\in \mathcal {A}}{\min }\;\underset{\varvec{S}\in \mathcal {X}_{\varOmega ^{\prime }}}{\min }\;\left\| \left[ \begin{array}{c} \sqrt{\lambda }\varvec{Z}\\ \mathcal {P}^{\dagger }\left( \varvec{D}\varvec{X}\right) \end{array}\right] -\left[ \begin{array}{c} \sqrt{\lambda }\varvec{A}\\ \varvec{I} \end{array}\right] \varvec{S}\right\| _{F}^{2}, \end{aligned}$$
(2.9)

where the fact that \(\mathbb {R}^{s\times n}=\mathcal {X}_{\varOmega ^{\prime }}\) has been used. As a result, the SimCO mechanism can be directly applied. Here, we do not require the prior knowledge of the scaling matrix in front of the true mixing matrix [10], as otherwise required in MMCA and GMCA algorithms.

To conclude this section, we emphasize the following treatment of the optimization problems (2.7) and (2.9). Both involve a joint optimization over two variables, i.e., \(\varvec{D}\) and \(\varvec{X}\) for (2.7) and \(\varvec{A}\) and \(\varvec{S}\) for (2.9). Note that if \(\varvec{D}\) and \(\varvec{A}\) are fixed, then the optimal \(\varvec{X}\) and \(\varvec{S}\) can be easily computed by solving the corresponding least squares problems. Motivated by this fact, we write (2.7) and (2.9) as \(\underset{\varvec{D}\in \mathcal {D}}{\mathrm{min}}f_{\mu }\left( \varvec{D}\right) \) and \(\underset{\varvec{A}\in \mathcal {A}}{\mathrm{min}}f_{\lambda }\left( \varvec{A}\right) \), respectively, when \(f_{\mu }\left( \varvec{D}\right) \) and \(f_{\lambda }\left( \varvec{A}\right) \) are properly defined in (2.7) and (2.9). In this way, the optimization problems, at least from the surface, only involve one variable. This helps the discovery of the singularity issue and the developments of handling singularity. See Sect. 2.3 for details.

2.2.2 Implementation Details in SparseBSS

Most optimization methods are based on line search strategies. The dictionaries at the beginning and the end of the \(k\)th iteration, denoted by \(\varvec{D}^{\left( k\right) }\) and \(\varvec{D}^{\left( k+1\right) }\), respectively, can be related by \(\varvec{D}^{\left( k+1\right) }=\varvec{D}^{\left( k\right) }+\alpha ^{\left( k\right) }\varvec{\eta }^{\left( k\right) }\) where \(\alpha ^{\left( k\right) }\) is an appropriately chosen step size and \(\varvec{\eta }^{\left( k\right) }\) is the search direction. The step size \(\alpha ^{\left( k\right) }\) can be determined by Armijo condition or Golden selection presented in [19]. The search direction \(\varvec{\eta }^{\left( k\right) }\) can be determined by a variety of gradient methods [19, 20]. The decision of \(\varvec{\eta }^{\left( k\right) }\) plays the key role, which directly affects the convergence rate of the whole algorithm. Generally speaking, a Newton direction is a preferred choice (compared with the gradient descent direction) [19]. In many cases, direct computation of the Newton direction is computationally prohibitive. Iterative methods can be used to search the Newton direction. Take the Newton Conjugate Gradient (Newton CG) method as an example. It starts with the gradient descent direction \(\varvec{\eta }_{0}\) and iteratively refines it toward the Newton direction. Denote the gradient of \(f_{\mu }\left( \varvec{D}\right) \) as \(\nabla f_{\mu }\left( \varvec{D}\right) \). Denote \(\nabla _{\varvec{\eta }}\left( \nabla f_{\mu }\left( \varvec{D}\right) \right) \) as the directional derivative of \(\nabla f_{\mu }\left( \varvec{D}\right) \) along \(\varvec{\eta }\) [21]. In each line search step of the Newton CG method, instead of computing the Hessian \(\nabla ^{2}f_{\mu }\left( \varvec{D}\right) \in \mathbb {R}^{md\times md}\) explicitly, one only needs to compute \(\nabla _{\varvec{\eta }}\left( \nabla f_{\mu }\left( \varvec{D}\right) \right) \in \mathbb {R}^{m\times d}\). The required computational and storage resources are therefore much reduced.

When applying the Newton CG to minimize \(f_{\mu }\left( \varvec{D}\right) \) in (2.7), the key computations are summarized below. Denote \(\tilde{\varvec{D}}=\left[ \begin{array}{cc} \varvec{D}^{T}&\mu \varvec{I}\end{array}\right] ^{T}\) and let \(\varOmega \left( :,j\right) \) be the index set of nonzero elements in \(\varvec{X}_{:,j}\) . We consider \(\tilde{\varvec{D}}_{i}=\tilde{\varvec{D}}_{:,\varOmega \left( :,i\right) }\in \mathbb {R}^{\left( m+r\right) \times r}\) with \(m>r\). Matrix \(\tilde{\varvec{D}}_{i}\) is a full column rank tall matrix. We denote

$$ f_{i}(\tilde{\varvec{D}}_{i})=\underset{\varvec{x}_{i}}{\mathrm{min}}\Vert \varvec{y}_{i}-\tilde{\varvec{D}}\varvec{x}_{i}\Vert _{2}^{2} $$

and the optimal

$$ \varvec{x}_{i}^{*}=\mathrm{arg}\,\underset{\varvec{x}_{i}}{\mathrm{min}}\Vert \varvec{y}_{i}-\tilde{\varvec{D}}\varvec{x}_{i}\Vert _{2}^{2}. $$

Denote \(\tilde{\varvec{D}}_{i}^{\dagger }\) as the pseudo-inverse of \(\tilde{\varvec{D}}_{i}\). Then we have \(\frac{\partial f}{\partial \varvec{x}_{i}}\mid _{\varvec{x}_{i}^{*}}=\mathbf {0}\), where \(\varvec{x}_{i}^{*}=\tilde{\varvec{D}}_{i}^{\dagger }\varvec{y}_{i}\), and \(\nabla f_{i}(\tilde{\varvec{D}}_{i})\) can be written as

$$\begin{aligned} \nabla f_{i}(\tilde{\varvec{D}}_{i})=\frac{\partial f}{\partial \tilde{\varvec{D}}_{i}}+\frac{\partial f}{\partial \varvec{x}_{i}}\frac{\partial \varvec{x}_{i}}{\partial \tilde{\varvec{D}}_{i}}=-2(\varvec{y}_{i}-\tilde{\varvec{D}}_{i}\varvec{x}_{i}^{*})\varvec{x}_{i}^{*T}+\mathbf {0} \end{aligned}$$
(2.10)

To compute \(\nabla _{\varvec{\eta }}\left( \nabla f_{i}(\tilde{\varvec{D}}_{i})\right) \), we have

$$\begin{aligned} \nabla _{\varvec{\eta }}\left( \nabla f_{i}(\tilde{\varvec{D}}_{i})\right)&=2\nabla _{\varvec{\eta }}\left( \tilde{\varvec{D}}_{i}\varvec{x}_{i}^{*}-\varvec{y}_{i}\right) \varvec{x}_{i}^{*T}+2\left( \tilde{\varvec{D}}_{i}\varvec{x}_{i}^{*}-\varvec{y}_{i}\right) \nabla _{\varvec{\eta }}\varvec{x}_{i}^{*T}\nonumber \\&=2\nabla _{\varvec{\eta }}\tilde{\varvec{D}}_{i}\varvec{x}_{i}^{*}\varvec{x}_{i}^{*T}+2\tilde{\varvec{D}}_{i}\nabla _{\varvec{\eta }}\varvec{x}_{i}^{*}\varvec{x}_{i}^{*T}+2\left( \tilde{\varvec{D}}_{i}\varvec{x}_{i}^{*}-\varvec{y}_{i}\right) \nabla _{\varvec{\eta }}\varvec{x}_{i}^{*T}\nonumber \\&=2\varvec{\eta }\varvec{x}_{i}^{*}\varvec{x}_{i}^{*T}+2\tilde{\varvec{D}}_{i}\nabla _{\varvec{\eta }}\varvec{x}_{i}^{*}\varvec{x}_{i}^{*T}+2\left( \tilde{\varvec{D}}_{i}\varvec{x}_{i}^{*}-\varvec{y}_{i}\right) \nabla _{\varvec{\eta }}\varvec{x}_{i}^{*T}, \end{aligned}$$
(2.11)

where \(\nabla _{\varvec{\eta }}\varvec{x}^{*}\) is relatively easy to obtain,

$$\begin{aligned} \nabla _{\varvec{\eta }}x^{*}=-\Big (\tilde{\varvec{D}}^{T}\tilde{\varvec{D}}\Big )^{-1}\left( \left( \tilde{\varvec{D}}^{T}\varvec{\eta }+\varvec{\eta }^{T}\tilde{\varvec{D}}\right) \tilde{\varvec{D}}^{\dagger }-\varvec{\eta }^{T}\right) \varvec{y}. \end{aligned}$$
(2.12)

From the definition of \(\tilde{\varvec{D}}_{i}\), \(\varvec{D}_{i}\) is a submatrix of \(\tilde{\varvec{D}}_{i}\), therefore \(\nabla f_{i}(\varvec{D}_{i})\) and \(\nabla _{\varvec{\eta }}\left( \nabla f_{i}(\varvec{D}_{i})\right) \) are also, respectively, submatrices of \(\nabla f_{i}(\tilde{\varvec{D}}_{i})\) and \(\nabla _{\varvec{\eta }}\left( \nabla f_{i}(\tilde{\varvec{D}}_{i})\right) \), i.e., \(\nabla f_{i}(\varvec{D}_{i})=\left( \nabla f_{i}\Big (\tilde{\varvec{D}}_{i}\Big )\right) _{1:m,:}\) and \(\nabla _{\varvec{\eta }}\left( \nabla f_{i}(\varvec{D}_{i})\right) =\left( \nabla _{\varvec{\eta }}\left( \nabla f_{i}(\tilde{\varvec{D}}_{i})\right) \right) _{1:m,:}\).

In addition, it is also worth noting that the SparseBSS model, using one dictionary to sparsely represent all the sources will get almost the same performance as using multiple but same-sized dictionaries when the dictionary redundancy \(d\)/\(n\) is large enough. As a result, it is reasonable to train only one dictionary for all the sources. An obvious advantage of using one dictionary is that the computational cost does not increase when the number of sources increases.

2.3 Blind MMCA and Its Comparison to SparseBSS

BMMCA [14] is another recently proposed BSS algorithm based on adaptive dictionary learning. Without knowing dictionaries in advance, the BMMCA algorithm also trains dictionaries from the observed mixture \(\varvec{Z}\). Inspired by the hierarchical scheme used in MMCA and the update method in K-SVD, the separation model in BMMCA is made up of a few rank-1 approximation problems, where each problem targets on the estimation of one particular source

$$\begin{aligned} \underset{\varvec{A}_{:,i},\varvec{s}_{i},\varvec{D}_{i},\varvec{X}_{i}}{\mathrm{min}}\,\lambda \left\| \varvec{E}_{i}-\varvec{A}_{:,i}\varvec{s}_{i}\right\| _{F}^{2}+\left\| \varvec{D}_{i}\varvec{X}_{i}-\mathcal {R}\varvec{s}_{i}\right\| _{2}^{2}+\mu \left\| \varvec{X}_{i}\right\| _{0}. \end{aligned}$$
(2.13)

Different from the operator \(\mathcal{P}\) defined earlier in SparseBSS algorithm, the operator \(\mathcal{R}\) in BMMCA is used to take patches from only one estimated image \(\varvec{s}_{i}\). \(\varvec{D}_{i}\) is the trained dictionaries for representing source \(\varvec{s}_{i}\). \(\varvec{E}_{i}\) is the residual which can be written as

$$\begin{aligned} \varvec{E}_{i}=\varvec{Z}-\sum _{j\ne i}\varvec{A}_{:,j}\varvec{s}_{j}. \end{aligned}$$
(2.14)

Despite being similar in problem formulation, BMMCA and SparseBSS differ in terms of whether the sources share a single dictionary in dictionary learning. In the SparseBSS algorithm, only one dictionary is used to provide sparse representations for all sources. BMMCA requires multiple dictionaries, one for each source. In the mixing matrix update, BMMCA imitates the K-SVD algorithm by splitting the steps of update and normalization. Such two-step based approach does not bring the expected optimality of \(\varvec{A}\in \mathcal{A}\), thereby giving inaccurate estimation, while SparseBSS keeps \(\varvec{A}\in \mathcal{A}\) during the optimization process. In BMMCA, the authors claim that the ratio between the parameter \(\lambda \) and the noise standard deviation \(\sigma \) is fixed to \(30\), which will not guarantee good estimation results at various noise levels.

3 Dictionary Learning and the Singularity Issue

As is clear from previous discussions, dictionary learning plays an essential role in solving the BSS problem when the sparse prior is used, and hence is the focus of this section. We first briefly introduce the relevant background, then discuss an interesting phenomenon, the singularity issue in the dictionary update stage, and finally present two approaches to handle the singularity issue. For readers who are more interested in the SparseBSS algorithm themselves may consider this section as optional and skip to Sect. 2.4.

3.1 Brief Introduction to Dictionary Learning Algorithms

One of the earliest dictionary learning algorithms is the method of optimal directions (MOD) [22] proposed by Engan et al. The main idea is as follows: in each iteration, one first fixes the dictionary and uses OMP [17] or FOCUSS [23] to update the sparse coefficients, then fixes the obtained sparse coefficients and updates the dictionary in the next stage. MOD was later modified to iterative least squares algorithm (ILS-DLA) [24] and recursive least squares algorithm (RLS-DLA) [25]. Aharon et al. developed the K-SVD algorithm [26], which can be viewed as a generalization of the K-means algorithm. In each iteration, the first step is to update the sparse coefficients in the same way as in MOD. Then in the second step, one fixes the sparse pattern, and updates the dictionary and the nonzero coefficients simultaneously. In particular, the codewords in the dictionary are sequentially selected: the selected codeword and the corresponding row of the sparse coefficients are updated simultaneously by using singular value decomposition (SVD). More recently, Dai et al. [16] considered the dictionary learning problem from a new perspective. They formulated dictionary learning as an optimization problem on manifolds and developed simultaneous codeword optimization (SimCO) algorithm. In each iteration SimCO allows multiple codewords of the dictionary to be updated with corresponding rows of the sparse coefficients jointly. This new algorithm can be viewed as a generalization of both MOD and K-SVD. Some other dictionary learning algorithms are also developed in the past decade targeting on various circumstances. For example, based on stochastic approximations, Mairal et al. [27] proposed an online algorithm to address the problem with large data sets.

Theoretical or in-depth analysis about the dictionary learning problem was mean time in progress as well. Gribonval et al. [28], Geng et al. [29], and Jenatton et al. [30] studied the stability and robustness of the objective function under different probabilistic modeling assumptions, respectively. In addition, Dai et al. observed in [16] that the dictionary update procedure may fail to converge to a minimizer. This is a common phenomenon happening in MOD, K-SVD, and SimCO. Dai et al. further observed that ill-conditioned dictionaries, rather than stationary dictionaries, are the major reason that has led to the failure of the convergence. To alleviate this problem, Regularized SimCO was proposed in [16]. Empirical performance improvement was observed. The same approach was also considered in [31], however, without detailed discussion on the singularity issue. More recently, the fundamental drawback of regularized SimCO was demonstrated using an artificial example [32]. To further handle the singularity issue, a Smoothed SimCO [33] was proposed by adding multiplicative terms rather than additive regularization terms to the objective function.

3.2 Singularity Issue and Its Impacts

In dictionary update stage of existing mainstream algorithms, singularity is observed as the major reason leading to failures [16, 33]. Simulations in [16] suggests that the mainstream algorithms fail mainly because of singular points in the objective function rather than non-optimal stationary points. As dictionary learning is an essential part of the aforementioned SparseBSS, the singularity issue also has negative impact on the overall performance of BSS. To explain the singularity issue in dictionary update, we first formally define the singular dictionaries.

Definition 1

A dictionary \(\varvec{D}\in \mathbb {R}^{m\times d}\) is singular under a given sparsity pattern \(\varOmega \) if there exists an \(i\in \left[ n\right] \) such that the corresponding sub-dictionary \(\varvec{D}_{i}\triangleq \varvec{D}_{:,\varOmega (:,i)}\) is column rank deficient. Or equivalently, the minimum singular value of \(\varvec{D}_{i}\), denoted as \(\lambda _{\mathrm{min}}\left( \varvec{D}_{i}\right) \), is zero.

A dictionary \(\varvec{D}\in \mathbb {R}^{m\times d}\) is said to be ill-conditioned under a given sparsity pattern \(\varOmega \) if there exists an \(i\in \left[ n\right] \) such that the condition number of the sub-dictionary \(\varvec{D}_{i}\) is large, or equivalently \(\lambda _{\mathrm{min}}\left( \varvec{D}_{i}\right) \) is close to zero.

Definition 2

[16] Define the condition number of a dictionary \(\varvec{D}\) as:

$$ \kappa \left( \varvec{D}\right) =\underset{i\in [n]}{\mathrm{max}}\frac{\lambda _{\mathrm{max}}\left( \varvec{D}_{i}\right) }{\lambda _{\mathrm{min}}\left( \varvec{D}_{i}\right) }, $$

where \(\lambda _{\mathrm{max}}\left( \varvec{D}_{i}\right) \) and \(\lambda _{\mathrm{min}}\left( \varvec{D}_{i}\right) \) represent the maximum and the minimum singular value of the sub-dictionary \(\varvec{D}_{i}\) respectively.

The word “singular” comes from the fact that \(f\left( \varvec{D}\right) =\underset{\varvec{X}\in \mathcal {X}_{\varOmega }}{\mathrm{min}}\Vert \varvec{Y}-\varvec{DX}\Vert _{F}^{2}\) is not continuous at a singular dictionaryFootnote 2 and the corresponding

$$ \varvec{X}\left( \varvec{D}\right) \triangleq \mathrm{arg}\,\underset{\varvec{X}\in \mathcal {X}_{\varOmega }}{\mathrm{min}}\Vert \varvec{Y}-\varvec{DX}\Vert _{F}^{2} $$

is not unique. The singularity of \(f\left( \varvec{D}\right) \) leads to convergence problems. Benchmark dictionary update procedures may fail to find a globally optimal solution. Instead they converge to a singular point of \(f\left( \varvec{D}\right) \), i.e., a singular dictionary.

Ill-conditioned dictionaries are in the neighborhood of singular ones. Algorithmically when one of the \(\lambda _{\mathrm{min}}\left( \varvec{D}_{i}\right) \)s is ill-conditioned, the curvature of \(f\left( \varvec{D}\right) \) is quite large and the value of the gradient fluctuates dramatically. This seriously affects the convergence rate of the dictionary update process.

Furthermore, ill-conditioned dictionaries also bring negative effect on the sparse coding stage. Denote \(\varvec{y}_{i}\) and \(\varvec{x}_{i}\) as the \(i\)th column of \(\varvec{Y}\) and \(\varvec{X}\) respectively. Consider a summand of the formulation in sparse coding stage [16, 26], i.e.,

$$ \underset{\varvec{x}_{i}}{\mathrm{min}}\,\Vert \varvec{y}_{i}-\varvec{Dx}_{i}\Vert _{F}^{2}+\left\| \varvec{x}_{i}\right\| _{0}. $$

An ill-conditioned \(\varvec{D}\) corresponds to a very large condition number, which breaks the restricted isometry condition (RIP) [34], and results in the unstable solutions: with small perturbations added on the training sample \(\varvec{Y}\), the solutions of \(\varvec{X}\) deviate significantly.

3.3 Regularized SimCO

The main idea of Regularized SimCO lies in the use of an additive penalty term to avoid singularity. Consider the objective function \(f_{\mu }\left( \tilde{\varvec{D}}\right) \) in (2.7),

$$\begin{aligned} f_{\mu }\left( \tilde{\varvec{D}}\right) =&\underset{\varvec{X}\in \mathcal {X}_{\varOmega }}{\mathrm{min}}\;\left\| \varvec{D}\varvec{X}-\mathcal{P}\left( \varvec{S}\right) \right\| _{F}^{2}+\mu \left\| \varvec{X}\right\| _{F}^{2},\nonumber \\ =&\underset{\varvec{X}\in \mathcal {X}_{\varOmega }}{\mathrm{min}}\;\left\| \left[ \begin{array}{c} \mathcal{P}\left( \varvec{S}\right) \\ \varvec{0} \end{array}\right] -\left[ \begin{array}{c} \varvec{D}\\ \sqrt{\mu }\varvec{I} \end{array}\right] \varvec{X}\right\| _{F}^{2}. \end{aligned}$$
(2.15)

As long as \(\mu \ne 0\) (\(\mu >0\) in our case), the block \(\mu \varvec{I}\) guarantees the full column rank of \(\tilde{\varvec{D}}=\left[ \begin{array}{cc} \varvec{D}^{T}&\mu \varvec{I}\end{array}\right] ^{T}\). Therefore, with the modified objective function \(f_{\mu }\left( \tilde{\varvec{D}}\right) \), there is no singular point so that gradient descent methods will only converge to stationary points.

This regularization technique is also applicable to MOD [16]. It is verified that this technique effectively mitigates the occurrence of ill-conditioned dictionary although at the same time some stationary points might be generated. To alleviate this problem, one can decrease gradually the regularization parameter \(\mu \) during the optimization process [16]. In the end \(\mu \) will decrease to zero. Nevertheless, it is still not guaranteed to converge to a global minimum. The explicit example constructed in [32] shows a failure of the Regularized SimCO. As a result, another method to address the singularity issue is introduced below.

3.4 Smoothed SimCO

Also aiming at handling the singularity issue, Smoothed SimCO [33] is to remove the singularity effect by adding multiplicative functions. The intuition is explained as follows. Write \(f\left( \varvec{D}\right) \) into a summation of atomic functions

$$\begin{aligned} f(\varvec{D})&=\Vert \varvec{Y}-\varvec{DX}\Vert _{F}^{2}\nonumber \\&=\underset{i}{\sum }\Vert \varvec{Y}_{:,i}-\varvec{D}_{i}\varvec{X}_{\varvec{\varOmega }(:,i),i}\Vert _{2}^{2}\\&=\underset{i}{\sum }\, f_{i}(\varvec{D}_{i}),\nonumber \end{aligned}$$
(2.16)

where each \(f_{i}(\varvec{D}_{i})\) is termed as an atomic function and \(\varvec{D}_{i}\) is defined in Definition 1. Let \(\mathcal{I}\) be the index set corresponding to the \(\varvec{D}_{i}\)’s of full column rank. Define an indicator function \(\mathcal{X}_{\mathcal{I}}\) s.t. \(\mathcal{X}_{\mathcal{I}}\left( i\right) =1\) if \(i\in \mathcal{I}\) and \(\mathcal{X}_{\mathcal{I}}\left( i\right) =0\) if \(i\in \mathcal{I}^{c}\). Use \(\mathcal{X}_{\mathcal{I}}\left( i\right) \) as a multiplicative modulation function and apply it to each \(f_{i}\left( \varvec{D}_{i}\right) \). Then one obtains

$$\begin{aligned} \bar{f}(\varvec{D})=\underset{i}{\sum }f_{i}(\varvec{D}_{i})\mathcal{X}_{\mathcal{I}}\left( i\right) =\underset{i\in \mathcal{I}}{\sum }f_{i}(\varvec{D}_{i}). \end{aligned}$$
(2.17)

This new function \(\bar{f}\) is actually the best possible lower semi-continuous approximation of \(f\) and there is no new stationary point created.

Motivated from the above, we define

$$\begin{aligned} \tilde{f}(\varvec{D})=\underset{i}{\sum }f_{i}(\varvec{D}_{i})g\left( \lambda _{\mathrm{min}}\left( \varvec{D}_{i}\right) \right) , \end{aligned}$$
(2.18)

where the shape of \(g\) is given in Fig. 2.1. The function \(g\) has the following properties: (1) \(g\left( \lambda _{\mathrm{min}}\right) =0\) for all \(\lambda _{\mathrm{min}}\le 0\); (2) \(g\left( \lambda _{\mathrm{min}}\right) =1\) for all \(\lambda _{\mathrm{min}}\left( \varvec{D}_{i}\right) >\delta >0\), where \(\delta \) is a threshold; (3) \(g\) is monotonically increasing; (4) \(g\) is second order differentiable. When using \(\lambda _{\mathrm{min}}\left( \varvec{D}_{i}\right) \) as the input variable for \(g\) and the positive threshold \(\delta \rightarrow 0\), \(\lambda _{\mathrm{min}}\left( \varvec{D}_{i}\right) \) becomes an indicator function indicating whether \(\varvec{D}_{i}\) has a full column rank, i.e.,

$$ {\left\{ \begin{array}{ll} g\left( \lambda _{\mathrm{min}}\left( \varvec{D}_{i}\right) \right) =1 &{}\quad \mathrm{if}\,\varvec{D}_{i}\,\mathrm{has\, full\, column\, rank;}\\ g\left( \lambda _{\mathrm{min}}\left( \varvec{D}_{i}\right) \right) =0 &{}\quad \mathrm{otherwise}. \end{array}\right. } $$

The modulated objective function \(\tilde{f}\) has several good properties, which do not exhibit in the regularized objective function (2.15). In particular, we have the following theorems.

Theorem 1

Consider the smoothed objective function \(\tilde{f}\) and the original objective function \(f\) defined in (2.18) and (2.16) ,respectively.

  1. 1.

    When \(\delta >0\), \(\forall i\), \(\tilde{f}(\varvec{D})\) is continuous.

  2. 2.

    Consider the limit case where \(\delta \rightarrow 0\) with \(\delta >0\), \(\forall i\). The following statements hold:

    1. a.

      \(\tilde{f}(\varvec{D})\) and \(f(\varvec{D})\) differ only at the singular points.

    2. b.

      \(\tilde{f}(\varvec{D})\) is the best possible lower semi-continuous approximation of \(f(\varvec{D})\).

Theorem 2

Consider the smoothed objective function \(\tilde{f}\) and the original objective function \(f\) defined in (2.18) and (2.16) ,respectively. For any \(a\in \mathbb {R}\), define the lower level set \(\mathcal{D}_{f}\left( a\right) =\left\{ \varvec{D}:\, f(\varvec{D})\le a\right\} \). It is provable that when \(\delta \rightarrow 0\), \(\mathcal{D}_{\tilde{f}}\left( a\right) \) is the closure of \(\mathcal{D}_{f}\left( a\right) \).

Fig. 2.1
figure 1

A shape of function \(g\left( \cdot \right) \)

In practice, we always choose a \(\delta >0\). The effect of a positive \(\delta \), roughly speaking, is to remove the barriers created by singular points, and replace them with “tunnels”, whose widths are controlled by \(\delta \), to allow the optimization process to pass through. The smaller the \(\delta \) is, the better \(\tilde{f}\) approximates \(f\), but the narrower the tunnels are, and the slower the convergence rate will be. As a result, the threshold \(\delta \) should be properly chosen. A detailed discussion of choosing \(\delta \) is presented in [32]. Compared with the choice of the parameter \(\left( \mu \right) \) in the Regularized SimCO [16], the choice of the smoothing threshold \(\delta \) is easier: one can simply choose a small \(\delta >0\) without decreasing it during the process.

As final remarks, Smoothed SimCO has several theoretical advantages over Regularized SimCO. However, the computations of \(\left( \lambda _{\mathrm{min}}\left( \varvec{D}_{i}\right) \right) \)’s introduce extra cost. The choice between these two methods will depend on the size of the problem under consideration.

4 Algorithm Testing on Practical Applications

In this section, we present numerical results of the SparseBSS method compared with some other mainstream algorithms. We first focus on speech separation where an equal determined case will be considered. Then, we show an example for blind image separation, where we will consider an overdetermined case.

In the speech separation case two mixtures are used, which are the mixtures of two audio sources. Two male utterances in different languages are selected as the sources. The sources are mixed by a \(2\times 2\) random matrix \(\varvec{A}\) (with normalized columns). For the noisy case, a 20 dB Gaussian noise was added to the mixtures. See Fig. 2.2 for the sources and mixtures.

Fig. 2.2
figure 2

Two speech sources and the corresponding noisy mixtures (20 dB Gaussian noise)

We compare SparseBSS with two benchmark algorithms including FastICA and QJADE [35]. The BSSEVAL toolbox [36] is used for the performance measurement. In particular, an estimated source \(\hat{\varvec{s}}\) is decomposed as \(\hat{\varvec{s}}=\varvec{s}_{\mathrm{target}}+\varvec{e}_{\mathrm{interf}}+\varvec{e}_{\mathrm{noise}}+\varvec{e}_{\mathrm{artif}}\), where \(\varvec{s}_{\mathrm{target}}\) is the true source signal, \(\varvec{e}_{\mathrm{interf}}\) denotes the interferences from other sources, \(\varvec{e}_{\mathrm{noise}}\) represents the deformation caused by the noise, and \(\varvec{e}_{\mathrm{artif}}\) includes all other artifacts introduced by the separation algorithm. Based on the decomposition, three performance criteria can be defined: the source-to-distortion ratio \(\mathrm{SDR}=10\log _{10}\frac{\left\| \varvec{s}_{\mathrm{target}}\right\| ^{2}}{\left\| \varvec{e}_{\mathrm{interf}}+\varvec{e}_{\mathrm{noise}}+\varvec{e}_{\mathrm{artif}}\right\| ^{2}}\), the source-to-artifact ratio \(\mathrm{SAR}=10\log _{10}\frac{\left\| \varvec{s}_{\mathrm{target}}+\varvec{e}_{\mathrm{interf}}+\varvec{e}_{\mathrm{noise}}\right\| ^{2}}{\left\| \varvec{e}_{\mathrm{artif}}\right\| ^{2}}\), and the source-to-interference ratio \(\mathrm{SIR}=10\log _{10}\frac{\left\| \varvec{s}_{\mathrm{target}}\right\| ^{2}}{\left\| \varvec{e}_{\mathrm{interf}}\right\| ^{2}}\). Among them, the SDR measures the overall performance (quality) of the algorithm, and the SIR focuses on the interference rejection. We investigate the gains of SDRs, SARs, and SIRs from the mixtures to the estimated sources. For example, \(\triangle SDR=SDR_{out}-SDR_{in}\), where \(SDR_{out}\) is calculated from its definition and \(SDR_{in}\) is obtained by letting \(\hat{\varvec{s}}=\varvec{Z}\) with the same equation. The results (in dB) are summarized in Table 2.1.

Table 2.1 Separation performance of the SparseBSS algorithm as compared to FastICA and QJADE

The selection of \(\lambda \) is an important practical issue since it is related to the noise level and largely affects the algorithm performance. From the optimization formulation (2.3), it is clear that with a fixed SNR, different choices of \(\lambda \) may give different separation performance. To show this, we use the estimation error \(\left\| \varvec{A}_{\mathrm{true}}-\hat{\varvec{A}}\right\| _{F}^{2}\) of the mixing matrix to measure the separation performance, where \(\varvec{A}_{\mathrm{true}}\) and \(\hat{\varvec{A}}\) are the true and estimated mixing matrices, respectively. The simulation results are presented in Fig. 2.3. Consistent with the intuition, simulations suggest that the smaller the noise level, the larger the optimal value of \(\lambda \). The results in Fig. 2.3 help in setting \(\lambda \) when the noise level is known a priori.

Fig. 2.3
figure 3

Relation of the parameter \(\lambda \) to the estimation error of the mixing matrix under different noise levels. The signal-to-noise ratio (SNR) is defined as \(\rho =10\log _{10}\left\| \varvec{A}\varvec{S}\right\| _{F}^{2}/\left\| \varvec{V}\right\| _{F}^{2}\) dB

Next, we show an example for blind image separation, where we consider an overdetermined case. The mixed images are generated from two source images using a \(4 \times 2\) full rank column normalized mixing matrix \(\varvec{A}\) with its elements generated randomly according to a Gaussian process. The mean squared errors (MSEs) are used to compare the reconstruction performance of the candidate algorithms when no noise is added. MSE is defined as \(MSE=(1/N)\left\| \chi -\tilde{\chi }\right\| _{F}^{2}\), where \(\chi \) is the source image and \(\tilde{\chi }\) is the reconstructed image. The lower the MSE, the better the reconstruction performance. Table 2.2 illustrates the results of four tested algorithms. For the noisy case, a Gaussian white noise is added to the four mixtures with \(\sigma =10\). We use the Peak Signal-to-Noise Ratio (PSNR) to measure the reconstruction quality, which is defined as, \(PSNR=20\mathrm{log}_{10}(MAX/\sqrt{MSE})\), where MAX indicates the maximum possible pixel value of the image, (e.g., \(MAX=255\) for a uint-8 image). Higher PSNR indicates better quality. The noisy observations are illustrated in Fig. 2.4b.Footnote 3

Table 2.2 Achieved MSEs of the algorithms in a noiseless case
Fig. 2.4
figure 4

Two classic images, Lena and Boat were selected as the source images, which are shown in (a). The mixtures are shown in (b). The separation results are shown in (cf). We compared SparseBSS with other benchmark algorithms: FastICA [37], GMCA [10], and BMMCA [14]. We set the overlap percentage equal to 50 % for both BMMCA and SparseBSS. The recovered source images by the SparseBSS tend to be less blurred compared to the other three algorithms

Finally, we show another example of blind image separation to demonstrate the importance of the singularity-aware process. In this example, we use two classic images Lena and Texture as the source images (Fig. 2.6a). Four noiseless mixtures were generated from the sources. The separation results are shown in Fig. 2.6b and c. Note that images like Texture contain a lot of frequency components corresponding to a particular frequency. Hence, an initial dictionary with more codewords corresponding to the particular frequency may perform better for the estimation of these images. Motivated by this, in Fig. 2.6b the initial dictionary is generated from an over-complete DCT dictionary, but contains more high frequency codewords. Such choice can lead to better separation results. At the same time, the very similar dictionary codewords may introduce the risk of singularity issue (Fig. 2.5).

Fig. 2.5
figure 5

Compare the performance of estimating the mixing matrix for all the methods in different noise standard deviation \(\sigma \)s. In this experiment, \(\sigma \) varies from 2 to 20. The performance of GMCA is better than that of FastICA. The curve for BMMCA is not available as the setting for the parameters is too sophisticated and inconsistent for different \(\sigma \) to obtain a good result. SparseBSS outperforms the compared algorithms

Fig. 2.6
figure 6

The two source images Lena and Texture are shown in (a). The separation results are shown in (b) and (c). The comparison results demonstrate the importance of the singularity -aware process

The major difference between Fig. 2.6b and c is that: in Fig. 2.6b the Regularized SimCO process (\(\mu =0.05\)) is introduced, while in Fig. 2.6c there is no regularization term in the dictionary learning stage. As one can see from the numerical results, Fig. 2.6b performs much better than Fig. 2.6c. By checking the condition number when the regularized term is not introduced (\(\mu =0\)), the value stays at a high level as expected (larger than 40 in this example). This confirms the necessity of considering the singularity issue in BSS and the effectiveness of the proposed singularity-aware approach.

5 Conclusions and Prospective Extensions

In conclusion, we briefly introduced a development of the blind source separation algorithms based on dictionary learning. In particular, we focus on the SparseBSS algorithm and the optimization procedures. The singularity issue might lead to the failure of these algorithms. At the same time there are still some open questions to be addressed.

In dictionary learning, it remains open how to find an optimum choice of the redundancy factor \(\tau =d/n\) of the over-complete dictionary. A higher redundancy factor leads to either more sparse representation or more precise reconstruction. Moreover, one has to consider the computational capabilities when implementing the algorithms. From this point of view, it is better to keep the redundancy factor low. In the simulation, we have used a 64 by 256 dictionary, which gives the redundancy factor \(\tau =256/64=4\). This choice is empirical: the sparse representation results are good and the computational cost is limited. A rigorous analysis on the selection of \(\tau \) is still missing.

The relation between the parameters \(\lambda \), \(\epsilon \), and noise standard deviation \(\sigma \) is also worth investigating. As presented in the first experiment on blind audio separation, the relation between \(\lambda \) and \(\sigma \) is discussed when the error bound \(\epsilon \) is fixed in the sparse coding stage. One can roughly estimate the value of the parameter \(\lambda \) assuming the noise level is known a priori. Similar investigation is undertaken in [14], where the authors claim that when \(\lambda \thickapprox \sigma /30\), the algorithm achieved similar reconstruction performance under various \(\sigma \)’s. From another perspective, the error bound \(\epsilon \) is proportional to the noise standard deviation. It turns out that once a well-approximated relation between \(\epsilon \) and \(\sigma \) is obtained, one may get more precise estimation of parameter \(\lambda \), rather than keeping \(\epsilon \) fixed. This analysis, therefore, is counted as another open question.