1 Introduction

In signal and image processing, we frequently need to address the problem of recovering unknown sources from their mixtures, that is, the blind source separation (BSS) problem, which can be described as the following linear mixing model

$$\begin{aligned} \varvec{Y}=\varvec{\mathrm{AX}}+\varvec{V} \end{aligned}$$
(1)

where \(\varvec{Y}=[\varvec{y}_{1},\varvec{y}_{2},\ldots ,\varvec{y}_{m}]^{T}\in \mathbb {R}^{m\times N}\) and \(\varvec{X}=[\varvec{x}_{1},\varvec{x}_{2},\ldots ,\varvec{x}_{n}]^{T} \in \mathbb {R}^{n\times N}\) denote the observed matrix and the source matrix, respectively, \(\varvec{A}\in {\mathbb {R}^{m\times n}}\) is the mixing matrix, and \(\varvec{V}\in {\mathbb {R}^{m\times N}}\) is referred to the additive Gaussian noise. Notice that \(\varvec{X}\) and \(\varvec{Y}\) contain n sources and m mixtures, respectively, and here N is the data size. BSS’s task is to estimate the sources \(\varvec{X}\) (and the mixing matrix \(\varvec{A}\)) without any prior knowledge except the observation \(\varvec{Y}\). As we all know that this problem does not have a unique solution in general. Therefore, it is necessary to impose some restrictions into the separation process and make the sources distinguishable.

Independent component analysis (ICA) [1], as one of classical approaches to solve the over-determined BSS problem (\(m\ge n\)), assumes that the sources are non-Gaussian and as statistically independent as possible. It has led to many well-known approaches, such as Infomax [2], maximum likelihood estimation [3], the maximum a posterior (MAP) [4] and FastICA [1]. Where Infomax using a stochastic gradient algorithm to maximize the likelihood may require some hand-tuning and often fails to converge [5], or only converges slowly. Therefore, various methods have been proposed to maximize Infomax likelihood much faster by using curvature information, especially the complete set of second-order derivatives (Hessian). However, as a large object, Hessian will cause high computational cost for large data sets. To address the cost issue, [6] adopted a truncated Newton algorithm, and another approach is to use approximations of the Hessian, such as a simple Quasi-Newton method in [7] and [8], and a trust-region method in [9]. Nevertheless, the work in [10] compared these methods [7,8,9] and found that the Hessian approximations do yield a low cost per iteration, but that they are not accurate enough on real data. Therefore, literature [10] proposed a Preconditioned ICA for Real Data (Picard) algorithm, which is a relative L-BFGS algorithm preconditioned with sparse Hessian approximations, to overcome this problem. However, the classical ICA algorithm is weak in processing noisy sources separation.

Meanwhile, Lee and Seung proposed another effective method, nonnegative matrix factorization (NMF) by taking advantages of the nonnegativity constraints. These constraints lead to aparts-based representation because they allow only additive, not subtractive, combinations [11]. Afterward, Boutsidis and Gallopoulos described nonnegative double singular value decomposition (NNDSVD) [12], a new method designed to enhance the initialization stage of NMF, which is well suited to initialize NMF algorithms with sparse factors and many numerical examples showed that NNDSVD made the approximation error of many NMF algorithms decrease rapidly.

During the past decade, sparsity has been widely used in many aspects [13,14,15,16,17,18]. Vast works have been in an attempt to use sparsity for BSS problem, and many sparsity-based approaches have proposed [19,20,21,22,23,24]. They are collectively coined sparse component analysis (SCA), which means that the sources can be sparsely represented using a known common basis or dictionary. For instance, morphological component analysis (MCA) [19] is a novel decomposition method based on sparse representation of signals with a signal dictionary, which showed that different dictionaries produce entirely different representations and the sparser representation, the better separation. Therefore, Jérôme Bobin described an extension of the MCA-multichannel morphological component analysis (MMCA) [20] adapted to the multichannel data. MMCA took advantage of a large over-complete dictionary to obtain different sparse representations for n sources based on their morphological diversity. And in literature [21], generalized morphological component analysis (GMCA) was proposed to the much more general case and considered that each source is a linear combination of several components and each component is sparse in a given dictionary. Moreover, Vahid Abolghasemi proposed BMMCA [22], which can adaptively learn the dictionaries from the mixed images via K-SVD, adapting MMCA to those cases that the known dictionaries are not available. Further, the work in [24] extended the BMMCA method that uses the only one dictionary to sparsely represent different sources. And it inventively incorporated SimCO optimization framework [25] into two-stage dictionary learning and mixture learning, thus significantly reducing the implementation effort.

It is noting that the above sparsity-related methods without exception are all relied on synthesis model. In this model, a signal \(\varvec{w}\in {\mathbb {R}^{d}}\) can be composed as \(\varvec{w}=\varvec{D\alpha }\), where \(\varvec{D}\in {\mathbb {R}^{d\times k}}\) \((d\ge k)\) is a dictionary and the representation coefficient \(\varvec{\alpha }\in {\mathbb {R}^{k}}\) is expected to be sparse, i.e., \(\Vert \varvec{\alpha }\Vert _{0}=c\ll k\). Just like above mentioned, the synthesis model has been extensively applied to BSS problem. As for its dual analysis angle—analysis sparse model—in sparse representation has been barely touched [26, 27]. The analysis model is described as \(\varvec{z}=\varvec{\varOmega w}\), where the \(\varvec{\varOmega }\in {\mathbb {R}^{p\times d}}\) \((p\ge d)\) refers to a analysis dictionary. The implications of this model are the signal \(\varvec{w}\in {\mathbb {R}^{d}}\) can be transformed into a low-dimensional subspace via a linear operation \(\varvec{\varOmega }\). Therefore, the representation coefficient \(\varvec{z}\) is assumed as sparse, i.e., \(\Vert \varvec{z}\Vert _{0}=c\ll p\).

Although the two models have very similar structure, the theoretical as well as simulation results provided in [28] have indicated that there exists a respectable gap between them, especially in the over-complete case. Moreover, in some respects, analysis model has advantages over synthesis model. For example, analysis model contains a larger number of low-dimensional subspaces in the case of the same dimension \((k=d-l)\), which enrich in terms of its descriptive power [26]. In addition, in the framework of synthesis model where only a small number of atoms are used to represent each signal, the significance of every atom grows enormously; any wrong choice could potentially lead to a ’domino effect.’ But in the analysis formulation, all atoms take an equal part in describing the signal, thus minimizing the dependence on each individual one, and stabilizing the recovery process [28]. More generally, analysis-based methods are a very common structure in image processing and computer vision applications. Due to the smaller dimension of the unknown, compared with the similar-sized synthesis-based method, the analysis-based method leads to a more simple optimization problem in the over-complete case \((p>d)\) , which is considerably easier to be solved [28]. Therefore, fusing analysis sparse constraint to the processing of BSS problem becomes necessary.

Recently, one of the effective related attempts was the work of literature [29], in which experimental results have demonstrated that the approach based on analysis sparse model has better separation performance. Nevertheless, it should be noticed that this work replaced the \(l_{0}\)-norm with \(l_{1}\)-norm as the sparsity measure to obtain a approximate solution, which will certainly reduce sparsity. \(l_{0}\)-norm is the most intuitive measure of sparsity. Only when certain conditions, such as exact recovery condition (ERC) and restricted isometry property (RIP) [26] are met, can \(l_{1}\)-norm be considered equivalently to \(l_{0}\)-norm.

Hence, in this paper, we propose a new method, termed analysis BSS (ABSS), to make better use of the sparsity constrain. Different from the most BSS algorithms based on synthesis model, this method bases on analysis model. And it directly uses \(l_{0}\)-norm as the measure of sparsity to obtain a more exact solution. In turn we come up with a novel objective function for the BSS problem based on analysis sparse model. In order to simplify this problem, the solution process is divided into three stages: dictionary learning stage, sources estimating stage and mixing matrix estimating stage. In the first stage, we adopt Analysis KSVD [30] to adaptively estimate the analysis dictionary from mixed sources. Next in the second and third stage, the sources and mixing matrix can be uniquely obtained separately by a simple least square linear regression. Finally, the results indicate that using \(l_{0}\)-norm enhances the separability of the sources.

The organization of this manuscript is as follows. In sect. 2, we are devoted to introducing the proposed framework. Then the details of our algorithm are presented in Sect. 3. Finally, the simulations compared with some typical methods to verify the competitive separation performances of the proposed algorithm are provided in Sect. 4.

2 Problem formulation

Firstly, BSS problem (1) can be written as the following optimization problem

$$\begin{aligned} \min _{\varvec{A},\varvec{X}}\Vert \varvec{Y}-\varvec{\mathrm{AX}}\Vert _{\mathrm{F}}^2 \end{aligned}$$
(2)

where \(\Vert \cdot \Vert _{F}\) denotes the Frobenius norm. Notice that in this problem, the mixing matrix \(\varvec{A}\in {\mathbb {R}^{m\times n}}\) and the source matrix \(\varvec{X}:\{\varvec{x}_{i}\}_{i=1}^{n}\in {\mathbb {R}^{n\times N}}\) are completely unknown. We can only use the known observed matrix \(\varvec{Y}:\{\varvec{y}_{i}\}_{i=1}^{m}\in {\mathbb {R}^{m\times N}}\) which is the input of our algorithm to estimate \(\varvec{X}\) and \(\varvec{A}\). Therefore, this problem is ill-posed and here we will impose the analysis sparse constraint on it, that is, the forward projection of the source \(\varvec{x}_{i}\) is sparse enough on the basis elements. The sparse representation of \(\varvec{x}_{i}\) based on analysis model can be described as

$$\begin{aligned} \min _{\varvec{\varOmega }_{i},\varvec{s}_{i}}\Vert \varvec{s}_{i}\Vert _{0} \quad s.t. \ \varvec{s}_{i}=\varvec{\varOmega }_{i} \varvec{x}_{i} \end{aligned}$$
(3)

where the \(\varvec{s}_{i}\) is referred to the representation coefficient of \(\varvec{x}_{i}\), which most entries are zero, and the \(\Vert \cdot \Vert _{0}\) is the \(\ell _{0}\)-norm which counts the number of nonzeros. In particular, \(\varvec{\varOmega }_{i}\in \mathbb {R}^{p\times d}\) is an analysis dictionary corresponding to the ith source. Optimization problem (3) can be written in its Lagrangian form

$$\begin{aligned} \min _{\varvec{\varOmega }_{i},\varvec{s}_{i}}\Vert \varvec{s}_{i}-\varvec{\varOmega }_{i} \varvec{x}_{i}\Vert _{\mathrm{F}}^2+\mu \Vert \varvec{s}_{i}\Vert _{0} \end{aligned}$$
(4)

Note that formula (4) describes the analysis model, which is different from the expression of the synthesis model in BMMCA [22]. In addition, \(l_{0}\)-norm is directly used as the measure of sparsity, which is different from the method proposed by Fang [29]. Overall, the BSS problem based on the analysis sparse model can be described as a joint optimization problem

$$\begin{aligned} \min _{\varvec{A},\varvec{\varOmega }_{i},\varvec{x}_{i},\varvec{s}_{i}}\lambda \Vert \varvec{Y}-\varvec{\mathrm{AX}}\Vert _{\mathrm{F}}^2 +\sum _{i}\mu \Vert \varvec{s}_{i}\Vert _{0}+\sum _{i}\Vert \varvec{s}_{i}-\varvec{\varOmega }_{i}\varvec{x}_{i}\Vert _{\mathrm{F}}^2 \end{aligned}$$
(5)

where scalars \(\lambda\) and \(\mu\) control the noise power and sparsity degree, respectively. To simplify the problem, the dictionary, the sources and the mixing matrix are estimated alternatively. So overall optimization problem (5) is divided into the following stage

  • Dictionary learning stage (with \(\varvec{X}\) and \(\varvec{A}\) fixed)

    $$\begin{aligned} \min _{\varvec{\varOmega }_{i},\varvec{s}_{i}} \Vert \varvec{s}_{i}\Vert _{0} \quad s.t.\ \Vert \varvec{s}_{i}-\varvec{\varOmega }_{i} \varvec{x}_{i}\Vert _{\mathrm{F}}^2\le \delta \qquad i=1,\ldots ,n \end{aligned}$$
    (6)
  • Sources estimating stage (with \(\varvec{\varOmega }_{i}\), \(\varvec{s}_{i}\), and \(\varvec{A}\) fixed)

    $$\begin{aligned} \min _{\varvec{X}} \lambda \Vert \varvec{Y}-\varvec{\mathrm{AX}}\Vert _{\mathrm{F}}^2+\sum _{i}\Vert \varvec{s}_{i}-\varvec{\varOmega }_{i} \varvec{x}_{i}\Vert _{\mathrm{F}}^2 \end{aligned}$$
    (7)
  • Mixing matrix estimating stage (with \(\varvec{X}\), \(\varvec{\varOmega }_{i}\), and \(\varvec{s}_{i}\) fixed)

    $$\begin{aligned} \min _{\varvec{A}}\Vert \varvec{Y}-\varvec{\mathrm{AX}}\Vert _{\mathrm{F}}^2 \end{aligned}$$
    (8)

In the overall separation process, the above three optimization subproblems should be addressed iteratively until convergence. The specific details of the proposed algorithm, coined Analysis BSS (ABSS), are described in Sect. 3.

3 ABSS algorithm

Before giving the ABSS algorithm, we make some conventions on the variables in the model in advance, which will be more conducive to the proposed method’s application, such as solving large images separation problems. Here, the j th \(\sqrt{N}\times \sqrt{N}\) observation image \(\mathcal {Y}_{j}\) is presented as the vector \(\varvec{y}_{j}\in \mathbb {R}^{N\times 1}\) and the ith \(\sqrt{N}\times \sqrt{N}\) source image \(\mathcal {X}_{i}\) is also vectorized by \(\varvec{x}_{i}\in \mathbb {R}^{N\times 1}\). Then, we divide each \(\varvec{x}_{i}\) and each \(\varvec{y}_{j}\) into M patches of size \(d\times 1\) which may be overlapped, where \(\varvec{x}_{i,f}\) is the fth patch of \(\varvec{x}_{i}\) and \(\varvec{y}_{j,f}\) is the fth patch of \(\varvec{y}_{j}\). The observation matrix \(\varvec{y}_{:,f}=[\varvec{y}_{1,f},\varvec{y}_{2,f},\ldots ,\varvec{y}_{m,f}]^{T}\in \mathbb {R}^{m\times d}\), \(f=1,\ldots ,M\) can be modeled as

$$\begin{aligned} \varvec{y}_{:,f}=\varvec{\mathrm{Ax}}_{:,f}+\varvec{v} \end{aligned}$$
(9)

where \(\varvec{x}_{:,f}=[\varvec{x}_{1,f},\varvec{x}_{2,f},\ldots ,\varvec{x}_{n,f}]^{T}\in \mathbb {R}^{n\times d}\) refers to the source matrix and \(\varvec{v}\) is the additive noise. Thus, large-scale images separation problem (1) can be transformed into M small-scale images separation problems (9). Solving overall BSS problem (5) amounts to solving the following problem one patch by one patch

$$\begin{aligned} \min _{\varvec{A},\varvec{\varOmega }_{i},\varvec{x}_{i,f},\varvec{s}_{i,f}}\lambda \Vert \varvec{y}_{:,f} -\varvec{\mathrm{Ax}}_{:,f}\Vert _{\mathrm{F}}^2&+\sum _{i}\mu \Vert \varvec{s}_{i,f}\Vert _{0}\nonumber \\&+\sum _{i}\Vert \varvec{s}_{i,f}-\varvec{\varOmega }_{i} \varvec{x}_{i,f}\Vert _{\mathrm{F}}^2 \end{aligned}$$
(10)

or equivalently,

$$\begin{aligned} \min _{\varvec{A},\varvec{\varOmega }_{i},\varvec{x}_{i,f},\varvec{s}_{i,f}}\lambda \Vert \varvec{E}_{i,f} -\varvec{a}_{:,i} \, \varvec{x}_{i,f}^{T}\Vert _{\mathrm{F}}^2&+\sum _{i}\mu \Vert \varvec{s}_{i,f}\Vert _{0}\nonumber \\&+\sum _{i}\Vert \varvec{s}_{i,f}-\varvec{\varOmega }_{i} \varvec{x}_{i,f}\Vert _{\mathrm{F}}^2 \end{aligned}$$
(11)

where \(\varvec{E}_{i,f}\) is defined as the multichannel residual of \(\varvec{x}_{i,f}\) and its expression is \(\varvec{E}_{i,f}=\varvec{y}_{:,f}-\sum _{r=1,r\ne i}^{n}\varvec{a}_{:,r}\,\varvec{x}_{r,f}^{T}\). Here \(\varvec{a}_{:,i}\) corresponds to the ith column of \(\varvec{A}\). Because the above optimization subproblems are invariant to scaling [24], we require \(\varvec{A}\) is column-normalized matrix and \(\varvec{\varOmega }_{i}\) is row-normalized matrix, that is

$$\begin{aligned} \Vert \varvec{a}_{:,i}\Vert =1, \qquad i=1,\ldots ,n \end{aligned}$$
(12)

and

$$\begin{aligned} \Vert \varvec{w}_{j,i}\Vert _{2}=1, \qquad j=1,\ldots ,p \end{aligned}$$
(13)

where \(\varvec{w}_{j,i}\) is the jth row of \(\varvec{\varOmega }_{i}\). To minimize problem (11), we propose an alternating algorithm by keeping all but one unknown fixed at a time. The estimation process of sources, mixing matrix and analysis dictionary are discussed in the following three subsections, respectively. And the procedure of the proposed algorithm is given at the end of this section.

3.1 Dictionary learning stage

During the dictionary learning stage, we fix \(\{\varvec{x}_{i,f}\}_{f=1}^{M}\) and \(\varvec{A}\) to estimate the dictionary \(\varOmega _{i}\), so problem (6) can be transformed as follows

$$\begin{aligned} \{\hat{\varvec{\varOmega }}_{i},\hat{\varvec{s}}_{i,f}\}=\min _{\varvec{\varOmega }_{i},\varvec{s}_{i,f}} \sum _{f}\Vert \varvec{s}_{i,f}\Vert _{0} \quad s.t.\Vert \varvec{s}_{i,f}-\varvec{\varOmega }_{i} \varvec{x}_{i,f}\Vert _{\mathrm{F}}^2\le \delta \end{aligned}$$
(14)

Note that, for each picture \(\varvec{x}_{i}\) we need learn a corresponding dictionary \(\hat{\varvec{\varOmega }}_{i}\). Thus, we consider adopt Analysis KSVD to the above problem, in which the learning of \(\ell _{0}\) analysis dictionary \(\varvec{\varOmega }_{i}\) and the recovery of the correct noiseless signal \(\{{\tilde{\varvec{x}}}_{i,f}\}_{f=1}^{M}\) from its noisy version \(\{\varvec{x}_{i,f}\}_{f=1}^{M}\) can be achieved. Note that \(\varvec{x}_{i,f}={\tilde{\varvec{x}}}_{i,f}+\varvec{v}_{i,f}\) where \(\varvec{v}_{i,f}\) is a zero-mean white-Gaussian additive noise vector, and the analysis coefficients can be described as \(\varvec{s}_{i,f}=\varOmega _{i}{\tilde{\varvec{x}}}_{i,f}\). Then formula (14) can be transformed into the objective function of Analysis KSVD

$$\begin{aligned} \{\hat{\varvec{\varOmega }}_{i},\hat{\tilde{\varvec{x}}}_{i,f}\}=\min _{\varvec{\varOmega }_{i}, {\tilde{\varvec{x}}}_{i,f}} \sum _{f}\Vert \varvec{\varOmega }_{i}{\tilde{\varvec{x}}}_{i,f}\Vert _{0} \quad s.t.\ \Vert {\tilde{\varvec{x}}}_{i,f}-\varvec{x}_{i,f}\Vert _{\mathrm{F}}^2\le \delta \ \end{aligned}$$
(15)

or

$$\begin{aligned}&\{\hat{\varvec{\varOmega }}_{i},\hat{\tilde{\varvec{x}}}_{i,f},\hat{\Lambda }_{i,f}\} =\min _{\varvec{\varOmega }_{i},{\tilde{\varvec{x}}}_{i,f},\Lambda _{i,f}} \sum _{f} \Vert {\tilde{\varvec{x}}}_{i,f}-\varvec{x}_{i,f}\Vert _{\mathrm{F}}^2 \nonumber \\&\qquad s.t.\ {\varvec{\varOmega }_{i}}_{\Lambda _{i,f}}{\tilde{\varvec{x}}}_{i,f}=0, \ \mathrm{Rank}({\varvec{\varOmega }_{i}}_{\Lambda _{i,f}})=d-r \ \end{aligned}$$
(16)

since here we assume the signal \({\tilde{\varvec{x}}}_{i,f}\) resides in r-dimensional subspace, which implies that \(d-r\) rows in \(\varvec{\varOmega }_{i}\) are orthogonal to it. These rows define the co-support \(\Lambda _{i,f}\) of signal \({\tilde{\varvec{x}}}_{i,f}\) and \({\varvec{\varOmega }_{i}}_{\Lambda _{i,f}}\) is a submatrix of \(\varvec{\varOmega }_{i}\) that contains only the rows indexed in \(\Lambda _{i,f}\). In problem (15), we constrain the target sparsity of \(\varvec{\varOmega }_{i}{\tilde{\varvec{x}}}_{i,f}\) to \(d-r\), whereas in problem (16), we require the solution to be \(\delta\)-close to the given noisy signal, where this error tolerance is derived from the noise power. Once given the correct correspondence between r and \(\delta\), problem (15) and problem (16) can be considered equivalent. Note that in the proposed method, the Analysis KSVD is based on a two-phase block-coordinate-relaxation approach to solve problem (16). The first phase is called the sparse-coding stage, using a pursuit method, such as the backward greedy (BG) algorithm and the optimized backward greedy (OBG) algorithm, to estimate \(\{ {\tilde{\varvec{x}}}_{i,f}\}_{1}^{M}\) with \(\varvec{\varOmega }_{i}\) fixed. It is worth noting that this method is a greedy algorithm which selecting rows from \(\varvec{\varOmega }\) one-by-one in a greedy fashion to solve the \(l_{0}\)-norm problem. The problem can be described as follows

$$\begin{aligned}&\{\hat{\tilde{\varvec{x}}}_{i,f},\hat{\Lambda }_{i,f}\}=\min _{{\tilde{\varvec{x}}}_{i,f}, \Lambda _{i,f}} \Vert {\tilde{\varvec{x}}}_{i,f}-\varvec{x}_{i,f}\Vert _{F}^2 \nonumber \\&\qquad s.t.\ {\varvec{\varOmega }_{i}}_{\Lambda _{i,f}}{\tilde{\varvec{x}}}_{i,f}=0, \ \mathrm{Rank}({\varvec{\varOmega }_{i}}_{\Lambda _{i,f}})=d-r \ \end{aligned}$$
(17)

Once \(\{\hat{\tilde{\varvec{x}}}_{i,f}\}_{1}^{M}\) is computed, we turn to update \(\varvec{\varOmega }_{i}\) in the second phase, which is called the atom update step. The optimization is carried out sequentially for each of the rows \(\varvec{w}_{j,i}\) in \(\varvec{\varOmega }_{i}\). And the update of \(\varvec{w}_{j,i}\) should only depend on those signals of \(\{ {\tilde{\varvec{x}}}_{i,f}\}_{1}^{M}\) that are orthogonal to it, while not on the remaining signals. Therefore, denoting \({\tilde{\varvec{X}}}_{J,i}\) whose columns are derived from the signal in \(\{ {\tilde{\varvec{x}}}_{i,f}\}_{1}^{M}\) and are orthogonal to \(\varvec{w}_{j,i}\). In addition, denoting by \(\varvec{X}_{J,i}\) the corresponding columns of \(\{\varvec{x}_{i,f}\}_{1}^{M}\). The update step for \(\varvec{w}_{j,i}\) can be written as

$$\begin{aligned} \{\hat{\varvec{w}}_{j,i},\hat{\tilde{\varvec{X}}}_{J,i}\}=\min _{\varvec{w}_{j,i}, {\tilde{\varvec{X}}}_{J,i}}&\Vert {\tilde{\varvec{X}}}_{J,i}-\varvec{X}_{J,i}\Vert _{F}^{2}\nonumber \\ s.t.&\ {\varvec{\varOmega }_{i}}_{\Lambda _{i,f}}{\tilde{\varvec{x}}}_{i,f}=0, \ \forall f\in J\nonumber \\&\mathrm{Rank}({\varvec{\varOmega }_{i}}_{\Lambda _{i,f}})=d-r, \ \forall f\in J\nonumber \\&\Vert \varvec{w}_{j,i}\Vert _{2}=1 \end{aligned}$$
(18)

Unfortunately, in general, solving problem (18) is a difficult task. [30] proposed the following approximation to the above optimization goal

$$\begin{aligned} \hat{\varvec{w}}_{j,i}=\min _{\varvec{w}_{j,i}} \Vert \varvec{w}_{j,i}\varvec{X}_{J,i}\Vert _{F}^{2} \quad s.t.\ \Vert \varvec{w}_{j,i}\Vert _{2}=1 \end{aligned}$$
(19)

For the optimization problem, the \(\varvec{w}_{j,i}\) can be updated using the eigenvector associated with the smallest eigenvalue of \(\varvec{X}_{J,i}\varvec{X}_{J,i}^{T}\). Finally, the process is repeated until a fixed number of iterations are reached. And the analysis coefficients of \({\tilde{\varvec{x}}}_{i,f}\) can be obtained as \(\hat{\varvec{s}}_{i,f}=\hat{\varvec{\varOmega }}_{i}\hat{\tilde{\varvec{x}}}_{i,f}\).

3.2 Sources estimating stage

In this stage, we estimate source \(\varvec{x}_{i,f}\) by solving the following convex minimization problem

$$\begin{aligned} \{\hat{\varvec{x}}_{i,f}\}=\min _{\varvec{x}_{i,f}}\lambda \Vert \varvec{E}_{i,f}-\varvec{a}_{:,i}\, x_{i,f}^{T}\Vert _{\mathrm{F}}^2+\Vert \varvec{s}_{i,f}-\varvec{\varOmega }_{i}\varvec{x}_{i,f}\Vert _{\mathrm{F}}^2 \end{aligned}$$
(20)

where \(\varvec{\varOmega }_{i}\) and \(\varvec{s}_{i,f}\) were already obtained in the previous step, and the other sources in \(\varvec{x}_{:,f}\) are assumed to be given (\(\varvec{E}_{i,f}\) fixed). Then the solution can be obtained by setting the gradient of subproblem (20) to zero

$$\begin{aligned} 0&=\lambda (\varvec{x}_{i,f}\,\varvec{a}_{:,i}^{T}-\varvec{E}_{i,f}^{T})\varvec{a}_{:,i} +\varvec{\varOmega }_{i}^{T}(\varvec{\varOmega }_{i}\varvec{x}_{i,f}-\varvec{s}_{i,f})\nonumber \\&=\lambda \varvec{x}_{i,f}+\varvec{\varOmega }_{i}^{T}\varvec{\varOmega }_{i}\varvec{x}_{i,f}-\lambda \varvec{E}_{i,f}^{T}\,\varvec{a}_{:,i}-\varvec{\varOmega }_{i}^{T}\varvec{s}_{i,f} \end{aligned}$$
(21)

leading to

$$\begin{aligned} \hat{\varvec{x}}_{i,f}=(\lambda \varvec{I}+\varvec{\varOmega }_{i}^{T}\varvec{\varOmega }_{i})^{-1}(\lambda \varvec{E}_{i,f}^{T}\,\varvec{a}_{:,i}+\varvec{\varOmega }_{i}^{T}\varvec{s}_{i,f}) \end{aligned}$$
(22)

where \(\varvec{I}\) is the identity matrix. With the help of these corresponding patches \(\{\varvec{x}_{i,f}\}_{1}^{M}\), the source \(\varvec{x}_{i}\) can be obtained by averaging over the overlapping regions and then the source matrix \(\varvec{X}\) can be recovered.

3.3 Mixing matrix estimating stage

At this part, all the unknowns, except \(\varvec{A}\), are fixed and problem (5) is simplified as

$$\begin{aligned} \hat{\varvec{a}}_{:,i}=\min _{\varvec{a}_{:,i}}\Vert \varvec{E}_{i}-\varvec{a}_{:,i}\,\varvec{x}_{i}^{T}\Vert _{\mathrm{F}}^{2} \end{aligned}$$
(23)

where the multichannel residual of \(\varvec{x}_{i}\) is defined as \(\varvec{E}_{i}=\varvec{Y}-\sum _{r=1,r\ne i}^{n}\varvec{a}_{:,r}\,\varvec{x}_{r}^{T}\). The above minimization subproblem is easily solved by setting its gradient to zero, and then, the estimation of \(\varvec{a}_{:,i}\) is achieved

$$\begin{aligned} \varvec{a}_{:,i}=(\varvec{E}_{i}\varvec{x}_{i})(\varvec{x}_{i}^{T}\varvec{x}_{i})^{-1} \end{aligned}$$
(24)

In order to maintain the column norm, the mixing matrix \(\varvec{A}\) needs to be normalized after each update of \(\varvec{a}_{:,i}\).

3.4 Complexity analysis

Since the proposed algorithm directly uses \(l_{0}\)-norm as the measure of sparsity, the computational complexity of the proposed algorithm must be higher than Fang’s method. The detailed analysis of the complexity of the proposed method per iteration is as follows. The complexity analysis of the dictionary learning stage is divided into two parts. In Analysis KSVD, the backward greedy (BG) pursuit algorithms used in sparse coding have complexity \(\mathcal {O}(\mathrm{Md}^{2}p)\), and atoms update cost \(\mathcal {O}(\mathrm{Mdp})\). Therefore, line 4 costs \(\mathcal {O}(\mathrm{Md}^{2}p+\mathrm{Mdp})\). The complexity of updating the representation coefficient (line 6) and computing the residual (line 8) is \(\mathcal {O}(\mathrm{Mdp})\) and \(\mathcal {O}((n-1)\mathrm{mN})\), respectively. Updating sources in line 9 costs \(\mathcal {O}(d^{2}p+d^{3})\). Finally, the mixing matrix estimating stage including computing residual (line 11), updating mixing matrix (line 12) and normalization (line 13) total cost \(\mathcal {O}((n-1)\mathrm{mN}+\mathrm{mN}+2m)\). Since all these calculations are executed for each source (i.e., \(i=1, \ldots , n\)), then the overall computation cost per each iteration of the algorithm would be \(\mathcal {O}(\mathrm{nNd}^{2}p+n^{2}\mathrm{mN}+\mathrm{nd}^{3}+2\mathrm{mn})\), where \(N=M\times d\).

figure a

4 Simulations

In the first experiment, we chose two very similar source imagesFootnote 1 (shown in Fig.2a) mixed together using a \(4\times 2\) full rank random column normalized mixing matrix \(\varvec{A}\) to examine the performance of ABSS algorithm. Each \(128\times 128\) (=N) mixture and initialized source were divided into patches of size \(8\times 8\), which can avoid leading to large dictionaries and provide sufficient training signals for the dictionary learning stage. The overlap percentage of the patches was fixed to 50%. Of course, the higher the overlap percentage lead to a better the recovery result. In addition, three hundred iterations were selected as the stopping criterion. As to the selection of parameter \(\lambda\), we investigated the effects of choosing a different \(\lambda\) on recovery quality. For this experiment, we calculated the normalized mean square error (NMSE) while varying \(\lambda\). The NMSE in dB can measure the distance between the mixing matrix \(\varvec{A}\) and the estimated matrix \(\hat{\varvec{A}}\) and is defined as

$$\begin{aligned} \mathrm{NMSE}=10\log _{10}(\frac{\sum _{p,q}(\hat{a}_{p,q}-a_{p,q})^2}{\sum _{p,q}(a_{p,q})^2} \end{aligned}$$
(25)

where \(\hat{a}_{p,q}\) is the (pq) element of the estimated matrix \(\hat{\varvec{A}}\). And the smaller the value of NMSE, the better the mixing matrix estimation. Figure 1 represents the achieved results for this experiment. It can be found that \(\lambda =3\) is a optimal choice.

Fig. 1
figure 1

Normalized mean square error (NMSE) as a function of \(\lambda\)

Then, we applied the following methods for comparison purposes: FastICA [1], Picard [10], BMMCA [22] and the algorithm proposed by Fang [29] (Computer OS: Windows 8.1, CPU: Intel (R) Core (TM) i5-5200U @ 2.20GHz RAM: 4G). And using the peak signal-to-noise ratio (PSNR) describes the reconstruction performance of the candidate algorithm. PSNR is calculated as

$$\begin{aligned} \mathrm{PSNR}=20\log _{10}(\mathrm{MAX}/\sqrt{\mathrm{MSE}}) \end{aligned}$$
(26)

where the MAX indicates the maximum possible pixel value of the image, such as for a uint-8 image, the MAX equals to 255. And the MSE refers to mean square errors given by \(\mathrm{MSE}=(1/N)\Vert \mathcal {X}-\hat{\mathcal {X}}\Vert _{\mathrm{F}}^{2}\), where \(\mathcal {X}\) and \(\hat{\mathcal {X}}\) are the source image and recovered image. The better the quality of the image, the higher value of PSNR. Note that in the ABSS algorithm, the number of atoms of dictionary is set to 80. Although there is not much can be theoretically said about choosing the optimum dictionary size (or redundancy factor d/p), we try to give an explanation by experiment. Table 1 shows the PSNR values for different numbers of atoms.

Table 1 PSNRs (dB) and computation times (in seconds) of reconstructed source images

As shown in Table 1, the separation effect is better when the number of atoms is 80 or 90, and the computation time of 90 atoms is greater than that of 80 atoms. For comprehensive consideration, we set the number of atoms to 80. Then, Fig. 2 shows the reconstructed source images of the five test algorithms. And Table 2 presents the average PSNR results.

Fig. 2
figure 2

a Mixtures (the first row). b Source images (the first column). cg are separation results by ABSS, BMMCA, Fang, FastICA and Picard (the second–sixth column), respectively

Table 2 PSNRs (dB) of the five test algorithms

We observed that the proposed algorithm achieves the best performance. BMMCA and Fang have almost similar results, and the results of FastICA and Picard are not as good as the other three algorithms in processing very similar source images. As for another measure on the performance of the proposed method, the reconstruction error as a function of the number of iterations is shown in Fig. 3, which shows that a monotonic decrease in the value of the separation error is achieved. In addition, the asymptotic error decreases relatively rapidly with a very small value around the 20th iteration and is almost zero around the 300th iteration. Therefore, the proposed algorithm is convergent well.

Fig. 3
figure 3

MSE varies with the number of iterations, note that the elements of image sources have amplitude in the range [0, 1]

Next, in order to make the experimental results more convincing, we did another experiment to separate four sources from six noiseless mixtures. The four selected source imagesFootnote 2 have very different morphologies, including noiselike texture and bricklike texture. As found from Fig. 4, the parameter \(\lambda\) should set to 3. Moreover, the rest of parameters were the same as those in the first experiment. Figure 5 and Table 3 illustrate the separation results of applying all these methods mentioned above (BMMCA, Fang, FastICA and Picard) and the corresponding MSEs. As shown in Fig. 5c the proposed method could successfully recover the image sources. And the achieved MSEs, given in Table 3, are lowest for the proposed method for all image sources except for the cartoon boy. Other methods do not perform as well as the proposed method. For example, as shown in Fig. 5d, although BMMCA has perfectly recovered the cartoon boy with the lowest MSE among all other methods, it has a problem in recovering bricklike texture. In addition, Fang, FastICA and Picard all have some difficulties in separation of some images, separating all the image sources but adding some interference from other sources.

Fig. 4
figure 4

Normalized mean square error (NMSE) as a function of \(\lambda\)

Fig. 5
figure 5

a Mixtures (the first row). b Source images (the first column). cg are separation results by ABSS, BMMCA, Fang, FastICA and Picard (the second–sixth column), respectively

Table 3 MSEs of the five test algorithms

In the last simulation, we investigated the performance of all the methods in different noise levels. Gaussian noises which standard deviations varied from 5 to 15 (\(\sigma =15\)) are added to the mixtures which are mixed from another two image sources (Lena and Boat)Footnote 3. The noisy mixtures are shown in Fig. 5a. As shown in Fig. 6, selection of \(\lambda\) is dependent on the noise standard deviation. Therefore, at noise levels ranging from 5 to 15, the optimal choices of \(\lambda\) were 1, 5 and 10, respectively. The other experimental settings and procedure were the same as the previous experiment. Table 4 presents the average PSNRS results in different noise levels. And the separation results in the noise level 10 are shown in Fig. 7c–g. We observed that all algorithms successfully separated the noise mixtures, but the proposed algorithm achieves a better separation performance than the other algorithms. When the noise level is 5, the separation results of all methods are similar except Picard, but as the noise level increases, FastICA performs obviously worse and Picard has superior performance than FastICA. In addition, ABSS presents successful performance improvement in denoising, for example, Lena’s facial details are the most legible among the five methods. Overall, the above experimental results have indicated that ABSS has a significant performance improvement in both separation and denoising.

Fig. 6
figure 6

Normalized mean square error (NMSE) as a function of \(\lambda\)

Fig. 7
figure 7

a Noisy mixtures when \(\sigma =10\) (the first row). b Source images (the first column). cg are separation results by ABSS, BMMCA, Fang, FastICA and Picard (the second–sixth column), respectively

Table 4 PSNRs (dB) of the five test algorithms

5 Conclusions and future work

In this paper, we proposed a novel algorithm based on the analysis sparse constraint of the source over an adaptive analysis dictionary to address BSS problem, where the number of mixtures is not less than the number of sources. Moreover, \(l_{0}\)-norm was chosen as the measure of sparsity to make better use of the sparsity constrain. The simulation results on both noisy and noiseless mixtures have been confirmed that the proposed method improves the separation performance significantly.

In fact, the way of updating dictionaries and sources is completely different from the literature [29]. In the analysis dictionary learning stage, we adopt Analysis KSVD [30] to adaptively estimate the analysis dictionary from mixed sources, whereas Fang uses SP-ADL directly proposed by the literature [31], which is an efficient algorithm for obtaining the analysis dictionary from a given data set in a KSVD-like manner. And in the sources estimating stage, since the subproblem we need to solve is a convex function, the sources can be uniquely obtained by a simple least square linear regression instead of the split Bergman iteration (SBI) algorithm introduced by [32], which use directly in the literature [29] and need five iterations to converge to an approximate solution. Thus, the convergence rate of the proposed algorithm is faster than the Fang’s method. Finally, the simulation results on both noisy and noiseless mixtures have been confirmed that using \(l_{0}\)-norm enhances the separability of the sources and the proposed method improves the separation performance significantly.

However, due to the greedy algorithm—backward greedy algorithm is used in the dictionary learning stage, it inevitably slows down the proposed algorithm speed, which is also the reason why the computation time in Table 1 is so high. Further work is required to speed up the algorithm and make it suitable for large-scale problems, for example, using some novel and more efficient dictionary learning algorithms [33, 34] for BSS problem. In addition, this work only focuses on the general overdetermined BSS problem and tries to find some new angle to deal with it perfectly. It would be desired to extend this related method to resolve the underdetermined BSS problem in the future, in which the number of mixtures is less than that of the sources, and this kind of problem is harder to deal with than the problem mentioned in this work.