Keywords

1 Introduction

Blind Source Separation (BSS) aims to find a set of N unknown signals, called sources and denoted by \(s_j(n)\), knowing only a set of M mixtures of these sources, called observations and denoted by \(x_i(n)\). This discipline is receiving increasing attention thanks to the diversity of its fields of application. Among these fields, we can cite those of audio, biomedical, seismic and telecommunications. In this paper, we are interested in so-called linear convolutive (LC) mixtures for which each mixture \(x_i(n)\) is expressed in terms of the sources \(s_j(n)\) and their delayed versions as follows:

$$\begin{aligned} x_i(n) = \sum _{j=1}^{N}\sum _{q=0}^{Q}{ h_{ij}(q)\cdot s_j (n-q)}=\sum _{j=1}^{N}{ h_{ij}(n)* s_j(n)}, \qquad i\in [1,M], \end{aligned}$$
(1)

where:

  • \( h_ {ij}(q) \) represents the impulse response coefficients of the mixing filter linking the source of index j to the sensor of index i,

  • Q is the order of the longest filter,

  • the symbol “\(*\)” denotes the linear convolution operator.

Indeed, in the field of BSS, the case of LC mixtures is still of interest since the performance of existing methods is still modest compared to the particular case of linear instantaneous mixtures for which \(Q=0\). BSS methods for LC mixtures can be classified into two main families. The so-called temporal methods that deal with mixtures in the time domain and the so-called frequency methods that deal with mixtures in the time-frequency (TF) domain. The performance of the former is generally very modest and remains very restrictive in terms of assumptions compared to the latter. Indeed, based mostly on the independence of source signals, most efficient methods are compared to frequency ones only for very short filters (i.e. Q low), and generally require over-determined mixtures (i.e. for \(M>N\)) [12, 16]. Based mostly on the sparsity of source signals in the TF domain, the frequency methods have shown good performance in the determined case (i.e. for \(M=N\)) or even under-determined case (i.e. for \(M < N\)), and this despite increasing the filters length [4, 8, 9, 13,14,15]. These frequency methods start by transposing the Eq. (1) into the TF domain using the short time Fourier transform (STFT) as follows:

$$\begin{aligned} X_i(m,k)=\sum _{j=1}^{N}{ H_{ij}(k)\cdot S_j(m,k)}, \quad m\in [0,T-1], \quad k\in [0,K-1], \end{aligned}$$
(2)

where:

  • \(X_i(m,k)\) and \(S_j(m,k)\) are the STFT representations of \(x_i(n)\) and \(s_j(n)\) respectively,

  • K and T are the length of the analysis windowFootnote 1 and the number of time windows used by the STFT respectivelyFootnote 2,

  • \(H_{ij}(k)\) is the Discrete Fourier Transform of \(h_{ij}(n)\) calculated on K points.

Among most efficient and relatively more recent frequency methods, we can mention those based on TF masking [2, 4, 6,7,8,9, 13,14,15]. The sparsity is often exploited by these methods by assuming that the source signals are W-disjoint orthogonal, i.e. not overlappingFootnote 3 in the TF domain. The principle of these methods is to estimate a separation mask, denoted by \(M_j (m, k)\) and specific to each source \(S_j (m, k)\), which groups the TF points where only this source is present. The application of the estimated mask \( M_j (m,k) \) to one of the frequency observations \( X_i(m, k) \) allows us to keep from the latter only the TF points belonging to the source \( S_j (m, k) \), and then separate it from the rest of the mixture. Depending on the procedure used to estimate the masks, we distinguish between two types of BSS methods based on TF masking. The so-called full-band methods [2, 4, 6, 9] for which the masks are estimated integrally using a clustering algorithm that processes all frequency bins simultaneously, and the so-called bin-wise methods [7, 8, 13,14,15] for which the masks are estimated using a clustering algorithm that processes only one frequency bin at a time.

Among the most popular full-band methods we can cite those proposed in [4, 9] which are based on the clustering of the level ratios and phase differences between the frequency observations \( X_i (m, k) \) to estimate the separation masks. However, this clustering is not always reliable, especially when the order Q of the mixing filters increases [4]. Moreover, when the maximum distance between the sensors is greater than half the wavelength of the maximum frequency of source signals involved, a problem called spatial aliasing is inevitable [4]. The bin-wise methods [7, 8, 13,14,15] are robust to these two problems. However, these methods require the introduction of an additional step to solve a permutation problem in the estimated masks, when we pass from one frequency bin to another, which is a classical problem that is common to all bin-wise BSS methods.

However, all of these BSS methods based on TF masking (full-band and bin-wise) suffers from artifacts problem which affect the quality of the separated signals and due to the fact that the W-disjoint orthogonality assumption is not perfectly verified in practice. Indeed, being introduced by the TF masking operation, these artifacts are more and more troublesome when the spectral overlap of source signals in the TF domain becomes important. In [11] the authors proposed a first solution to this problem which consists of a cepstral smoothing of spectral masks before applying them to the frequency observations \(X_i(m,k)\). An interesting extension of this technique, which was proposed in [3], consists in applying cepstral smoothing not to spectral masks but rather to the separated signals, i.e. after applying the separation masks. Knowing that these two techniques [3, 11] were have only been validated on a few full-band methods, in [5] we have recently proposed to evaluate their effectiveness using a few popular bin-wise methods. However, these two solutions could only improve one particular type of artifact called musical noise [3, 5, 11]. In the same sense, in order to avoid the artifacts caused by the TF masking operation, we propose in this paper a new BSS method which also exploits the sparsity of source signals in the TF domain for determined LC mixtures. Indeed, by focusing on the case of determined mixtures, we show that we can avoid TF masking and also relax the W-disjoint orthogonality assumption. Note that the case of determined mixtures was also addressed in [1], but with an assumption which is again very restrictive and which consists in having at least a whole time frame of silenceFootnote 4 for each of the source signals. Thus, our new method makes it possible to carry out the separation while avoiding the artifacts introduced by the operation of TF masking, with sparsity assumptions much less restrictive than those of existing methods.

We begin in Sect. 2 by describing our method. Then we present in Sect. 3 various experimental results that measure the performance of our method compared to existing methods, then we conclude with a conclusion and perspectives of our work in Sect. 4.

2 Proposed Method

The sole sparsity assumption of our method is the following.

Assumption: For each source \( s_j (n) \) and for each frequency bin k, there is at least one TF point (mk) where it is present alone, i.e:

$$ \begin{aligned} \forall \,j, \ \forall \, k, \ \exists \ m \ /\ S_j(m,k)\ne 0 \ \& \ S_i(m,k)=0, \ \forall \,i\ne j \end{aligned}$$
(3)

Thus, if we denote by \(E_j\) the set of TF points (mk) that verify the assumption (3), called single-source points, then the relation (2) gives us:

$$\begin{aligned} X_i(m,k) =H_{ij}(k) \cdot S_j(m,k), \quad \, \forall (m,k) \in E_j. \end{aligned}$$
(4)

Our method proceeds in two steps. The first step, which exploits the probabilistic masks used by Sawada et al. in [14, 15], consists in identifying for each source of index “j” and each frequency bin “k” the index “\(m_{jk}\)” such that the TF point \((m_{jk},k)\) best verifies the Eq. (4), then in estimating the separating filters, denoted \( F_ {ij}(k) \) and defined by:

$$\begin{aligned} F_{ij}(k)=\frac{X_i(m_{jk},k)}{X_1(m_{jk},k)} =\frac{H_{ij}(k)\cdot S_j(m_{jk},k)}{H_{1j}(k)\cdot S_j(m_{jk},k)} =\frac{H_{ij}(k)}{H_{1j}(k)}, \quad i \in [2,M] \end{aligned}$$
(5)

The second step consists in recombining the mixtures \( X_i (m,k) \) using the separating filters \( F_ {ij}(k) \) in order to finally obtain an estimate of the separated sources. The two steps of our method are the subject of Sects. 2.1 and 2.2 respectively.

2.1 Estimation of the Separating Filters

Since the proposed treatment in this first step of our method is performed independently of the frequency, we propose in this section to simplify the notations by omitting the frequency bin index “k”. So using a matrix formulation, the Eq. (2) gives us:

$$\begin{aligned} \mathbf{X}(m)=\sum _{j=1}^{N}{} \mathbf{H}_{j}\cdot S_{j}(m), \end{aligned}$$
(6)

where \(\mathbf{X}(m)=[X_1(m,k),...,X_{M}(m,k)]^T\), \(\mathbf{H}_{j}=[H_{1j}(k),...,H_{{M}j}(k)]^T\) and

\(S_{j}(m)=S_{j}(m,k)\). During this first step, we proceed as follows:

  1. 1.

    Each vector \(\mathbf{X}(m)\) is normalized and then whitened as follows:

    $$\begin{aligned} \widetilde{\mathbf{X}}(m) =\frac{\mathbf{X}(m)}{||\mathbf{X}(m)||} \quad \text {and} \quad \mathbf{Z}(m)=\frac{\mathbf{W}\widetilde{\mathbf{X}}(m)}{||\mathbf{W}\widetilde{\mathbf{X}}(m)||}, \end{aligned}$$
    (7)

    where \(\mathbf{W}\) is given by \(\mathbf{W} = \mathbf{D}^{-\frac{1}{2}}{} \mathbf{E}^H\), with \(\mathbb {E}\{\widetilde{\mathbf{X}}(m)\widetilde{\mathbf{X}}^H(m)\}=\mathbf{E}{} \mathbf{D}{} \mathbf{E}^H\).

  2. 2.

    Each vector \(\mathbf{Z}(m)\) is modeled by a complex Gaussian density function of the form [14]:

    $$\begin{aligned} p(\mathbf{Z}(m)|\mathbf{a}_j,\sigma _j)=\frac{1}{(\pi \sigma _j^2)^{M}}\cdot \exp \left( -\frac{||\mathbf{Z}(m)-(\mathbf{a}_j^H\mathbf{Z}(m)).\mathbf{a}_j||^2}{\sigma _j^2}\right) \end{aligned}$$
    (8)

    where \(\mathbf{a}_j\) and \(\sigma _j^2\) are respectively the centroid (with unit norm \(||\mathbf{a}_j||=1\)) and the variance of each cluster \(C_j\). This density function \(p(\mathbf{Z})\) can be described by the following mixing model:

    $$\begin{aligned} p(\mathbf{Z}(m)|\theta )=\sum _{j=1}^{N}\alpha _j\cdot p(\mathbf{Z}(m)|\mathbf{a}_j,\sigma _j), \end{aligned}$$
    (9)

    where \(\alpha _j\) are the mixture ratios and \(\theta =\{\mathbf{a}_1,\sigma _1,\alpha _1,...,\mathbf{a}_N,\sigma _N,\alpha _N\}\) is the parameter set of the mixing model. Then, an iterative algorithm of the type expectation-maximization (EM) is used to estimate the parameter set \( \theta \), as well as the posterior probabilities \( P (C_j | \mathbf{Z} (m),\theta ) \) at each TF point, which are none other than the probabilistic masks used in [14]. In the expectation step, these posterior probabilities are given by:

    $$\begin{aligned} P(C_j|\mathbf{Z}(m),\theta )=\frac{\alpha _jp(\mathbf{Z}(m)|\mathbf{a}_j,\sigma _j)}{ p(\mathbf{Z}(m)|\theta )} \cdot \end{aligned}$$
    (10)

    In the maximization step, the update of centroid \(\mathbf{a}_j\) is given by the eigenvector associated with the largest eigenvalue of the matrix \(\mathbf{R}_j\) defined by:

    $$\begin{aligned} \mathbf{R}_j=\sum _{m=0}^{T-1}P(C_j|\mathbf{Z}(m),\theta )\cdot \left\{ \mathbf{Z}(m)\mathbf{Z}^H(m)\right\} . \end{aligned}$$
    (11)

    The parameters \(\sigma _j^2\) and \(\alpha _j\) are updated respectively via the following relations:

    $$\begin{aligned} \sigma _j^2=\frac{\sum _{m=0}^{T-1}P(C_j|\mathbf{Z}(m),\theta )\cdot ||\mathbf{Z}(m)-(\mathbf{a}_j^H\mathbf{Z}(m)).\mathbf{a}_j||^2}{M\cdot \sum _{m=0}^{T-1}P(C_j|\mathbf{Z}(m),\theta )} \end{aligned}$$
    (12)
    $$\begin{aligned} \alpha _j=\dfrac{1}{T}\sum _{m=0}^{T-1}P(C_j|\mathbf{Z}(m),\theta ). \end{aligned}$$
    (13)

    However, since the EM algorithm used in [14, 15] is sensitive to the initializationFootnote 5, we propose in our method to initialize the masks with those obtained by a modified version of the MENUET method [4]. Indeed, we replaced, in the clustering step for the estimation of the masks, the k-means algorithm used in [4] by the fuzzy c-means (FCM) algorithm used in [13], in order to have probabilistic masks.

  3. 3.

    After the convergence of the EM algorithm, the classical permutation problem between the different frequency bins is solved by the algorithm proposed in [15], which is based on the inter-frequency correlation between the time sequences of posterior probabilities \(P(C_j|\mathbf{Z}(m),\theta )\) in each frequency bin k. In the following we denote these posterior probabilities by \(P(C_j|\mathbf{Z}(m,k))\).

  4. 4.

    Unlike the approach adopted in [14, 15] which consists in using all the TF points of the estimated probabilistic masks \(P(C_j|\mathbf{Z}(m,k))\), we are interested in this step only in identifying one single-source TF point for each source of index “j” and for each frequency bin “k”, therefore a single time frame index that we denote by “\(m_{jk}\)”, which best verifies our working assumption (4). We then define this index \(m_{jk}\) as being the index “m” for which the presence probability of the corresponding source is maximumFootnote 6:

    $$\begin{aligned} m_{jk}=\mathop {\text {argmax}}\limits _{m} P(C_j|\mathbf{Z}(m,k )),\quad \, m\in \left[ 0,T-1\right] \end{aligned}$$
    (14)
  5. 5.

    After having identified these “best” single-source TF points \((m_{jk},k)\), we finish this first step of our method by estimating the separating filters \(F_{ij}(k)\) defined in (5) by:

    $$\begin{aligned} F_{ij}(k)=\frac{X_i(m_{jk},k)}{X_1(m_{jk},k)} =\frac{H_{ij}(k)}{H_{1j}(k)}, \,\, i \in [2,M] \end{aligned}$$
    (15)

2.2 Estimation of the Separated Sources

In this section, for more clarity, we provide the mathematical bases for the second step of our method for two LC mixtures of two sources, i.e. for \(M=N=2\). The generalization to the case \(M = N > 2\) can be derived directly from this in an obvious way. In this case, the mixing Eq. (1) gives us:

$$\begin{aligned} {\left\{ \begin{array}{ll} x_1(n)= h_{11}(n) * s_1(n)+h_{12}(n) * s_2(n)\\ x_2(n)= h_{21}(n) * s_1(n)+h_{22}(n) * s_2(n) \end{array}\right. } \end{aligned}$$
(16)

As we pass to the TF domain, we get:

$$\begin{aligned} {\left\{ \begin{array}{ll} X_1(m,k)= H_{11}(k) \cdot S_1(m,k)+H_{12}(k) \cdot S_2(m,k)\\ X_2(m,k)= H_{21}(k) \cdot S_1(m,k)+H_{22}(k) \cdot S_2(m,k) \end{array}\right. } \end{aligned}$$
(17)

We use the separating filters \(F_{ij}(k)\), with \(i=2\) and \(j=1, 2\), estimated in the first step to recombine these two mixtures as follows:

$$\begin{aligned} {\left\{ \begin{array}{ll} X_2(m,k)-F_{22}(k) \cdot X_1(m,k) =\widetilde{S}_{1}(m,k)\\ X_2(m,k)-F_{21}(k)\cdot X_1(m,k) =\widetilde{S}_{2}(m,k) \end{array}\right. } \end{aligned}$$
(18)

Since we have \(F_{21}(k)=\frac{H_{21}(k)}{H_{11}(k)}\) and \(F_{22}(k)=\frac{H_{22}(k)}{H_{12}(k)}\), based on the Eq. (15), we get after all simplifications have been made:

$$\begin{aligned} {\left\{ \begin{array}{ll} \widetilde{S}_1(m,k) = \frac{H_{21}(k).H_{12}(k)-H_{22}(k).H_{11}(k)}{H_{12}(k)} \cdot S_1(m,k) \\ \widetilde{S}_2(m,k)= \frac{H_{22}(k).H_{11}(k)-H_{21}(k).H_{12}(k)}{H_{11}(k)}\cdot S_2(m,k) \end{array}\right. } \end{aligned}$$
(19)

In order to ultimately obtain the contributions of sources in one of the sensors, we propose to add a post-processing step (as in [1]) which consists in multiplying the signals \(\widetilde{S}_j(m,k)\) by filters, denoted by \(G_j(k)\), as follows:

$$\begin{aligned} G_j(k) \cdot \widetilde{S}_j(m,k) =Y_j(m,k),\qquad j\in \{1,2\}, \end{aligned}$$
(20)
$$\begin{aligned} {}\text {where} \qquad G_1(k)=\frac{1}{F_{21}(k)-F_{22}(k)}\qquad \text {and} \qquad G_2(k)=\frac{1}{F_{22}(k)-F_{21}(k)}. \end{aligned}$$
(21)

After all the simplifications are done, we get:

$$\begin{aligned} {\left\{ \begin{array}{ll} Y_1(m,k) = H_{11}(k)\cdot S_1(m,k)\\ Y_2(m,k) = H_{12}(k)\cdot S_2(m,k) \end{array}\right. } \end{aligned}$$
(22)

By denoting \(y_j(n)\) the inverse STFT of \(Y_j(m,k)\) we get:

$$\begin{aligned} {\left\{ \begin{array}{ll} y_1(n)=h_{11}(n)*s_1(n) \\ y_2(n)= h_{12}(n)*s_2(n) \end{array}\right. } \end{aligned}$$
(23)

These signals are none other than the contributions of source signals \(s_1(n)\) and \(s_2(n)\) on the first sensor (see the expression of the mixture \(x_1(n)\) in (16)).

3 Results

In order to evaluate the performance of our method and compare it to the most popular bin-wise methods known for their good performance, that is the method proposed by Sawada et al. [15] and the UCBSS method [13], we performed several tests on different sets of mixtures. Each set consists of two mixtures of two real audio sources, which are sampled at 16 KHz and with a duration of 10 s each, using different filter sets. Generated by the toolbox [10], which simulates a real acoustic room characterized by a reverberation time denoted by \({RT}_{60}\)Footnote 7, the coefficients \(h_{ij}(n)\) of these mixing filters depend on the distance between the two sensors (microphones), denoted as D and on the absolute value of the difference between directions of arrival of the two source signals, denoted as \(\delta \varphi \). For the calculation of the STFT, we used a 2048 sample Hanning window (as analysis window) with a 75% overlap. To measure the performance we used two of the most commonly used criteria by the BSS community, called Signal to Distortion Ratio (SDR) and Signal to Artifacts Ratio (SAR) provided by the BSSeval toolbox [17] and both expressed in decibels (dB). The SDR measures the global performance of any BSS method, while the SAR provides us with a specific information on its performance in terms of artifacts presented in the separated signals.

For each test we evaluated the performance of the three methods, in terms of SDR and SAR, over 4 different realizations of the mixtures related to the use of different sets of source signals cited above. Thus, the values provided below for SDR and SAR represent the average obtained over these 4 realizationsFootnote 8.

In the first experiment, we evaluated the performance as a function of the parameters D and \(\delta \varphi \) for an acoustic room characterized by \({RT}_{60}\) = 50 ms. Table 1 groups the performance for \(D\in \left\{ 0.3\,\mathrm{m}, 1\,\mathrm{m}\right\} \) and \(\delta \varphi \in \left\{ 85^{\circ }, 55^{\circ }, 30 ^{\circ }\right\} \), where the last column for each value of the parameter D represents the average value of SDR and SAR over the three values \(\delta \varphi _i\) of \(\delta \varphi \).

Table 1. SDR (dB) and SAR (dB) as a function of D and \(\delta \varphi \) for \({RT}_{60}\) = 50 ms.

According to Table 1, we can see that our method is performing better than the other two methods, and this over the 4 realizations of mixtures tested. Indeed, the proposed method shows superior performance over these two methods by about 5 dB for D = 0.3 m and 3.5 dB for \(D=\)1 m in terms of SDR. This performance difference is even more visible in terms of SAR, which confirms that the artifacts introduced by our method are less significant than those introduced by the other two methods.

In our second experiment we were interested in the behavior of our method with regard to the increase of the reverberation time while fixing the parameters D and \(\delta \varphi \) respectively to D = 0.3 m and \(\delta \varphi = 55^\circ \). Table 2 groups the performance of the three methods in terms of SDR, for \({RT}_{60}\) belonging to the interval \(\left\{ 50\,\mathrm{ms}, 100\,\mathrm{ms}, 150\,\mathrm{ms}, 200\,\mathrm{ms} \right\} \)Footnote 9.

Table 2. SDR (dB) as a function of \({RT}_{60}\) for D = 0.3 m and \(\delta \varphi = 55^\circ \).

According to Table 2, we can see again that the best performance is obtained by using our method whichever the reverberation time. However, we note that this performance is degraded when \({RT}_{60}\) increases. This result, which is common to all BSS methods, is expected and is mainly explained by the fact that the higher the reverberation time, the less the assumption (here of sparseness in the TF domain) assumed by these methods on source signals is verified.

4 Conclusion and Perspectives

In this paper, we have proposed a new Blind Source Separation method for linear convolutive mixtures with a sparsity assumption in the time-frequency domain that is much less restrictive compared to the existing methods [1, 2, 4, 6,7,8,9, 13,14,15]. Indeed, by focusing on the case of determined mixtures, we have shown that our method avoids the problem of artifacts at the separated signals from which suffers most of these methods [2, 4, 6,7,8,9, 13,14,15]. According to the results of the several tests performed, the performance of our new method, in terms of SDR and SAR, is better than that obtained by using the method proposed by Sawada et al. [15] and the UCBSS method [13], which are known for their good performance within existing methods. Nevertheless, considering that these results were obtained over 4 different realizations of the mixtures and only for some values of the parameters involved, a larger statistical performance study including all these parameters is desirable to confirm this results. Furthermore, it would be interesting to propose a solution to this problem of artifacts also in the case of under-determined linear convolutive mixtures.