1 Introduction

Speech enhancement and acoustic noise reductions applications have been active research fields in the last four decades. The existing speech enhancement techniques aim to improve speech quality by using various algorithms to provide a good convergence speed performance and fast tracking capabilities, since the acoustic environments imply very long and time-variant echo path. A plethora of techniques and algorithms using speech and noise characteristics can be found in the literature (Djendi et al. 2013; Loizou and Kim 2011; Loizou 2013).

Generally the speech enhancement techniques or algorithms can be categorized as single channel, dual channel or multichannel enhancement techniques (Djendi et al. 2009; Ghosh and Tsiartas 2011; Sandoval-Ibarra et al. 2016). Single channel enhancement techniques are used in the situations where only one recorder microphone is available. The single channel speech enhancement techniques still an important field of research because of their simple realization and effectiveness. The single channel is particularly valuable in mobile communication request, where only a single microphone is used due to cost and size constraints (Sandoval-Ibarra et al. 2016). In recent times, several single channel algorithms have been proposed in literature.

Recently in (Upadhyay 2016; Upadhyay and Karmakar 2015), the problem of single channel speech enhancement in stationary environments is discussed and it is proposed the Wiener filtering combined with recursive noise estimation algorithms to enhance speech signals. In Roy et al. (2016), the authors proposed a single channel speech enhancement algorithm using a subband iterative Kalman filter. A wavelet filter bank is first used to decompose the noise corrupted speech into a number of subbands then it is processed by an efficient Kalman filter. In Lee et al. (2017), Cho et al. (2016), the authors proposed new single-channel speech enhancement methods using reconstructive using nonnegative matrix factorization (NMF) with spectro-temporal speech presence probabilities, and outlier detection are also proposed. In order to improve the single channel solution to the problem of speech enhancement, several dual channel and multichannel enhancement techniques have been proposed in literature. For example, several papers have been proposed for dual channel speech enhancement techniques based on the combination between the blind source separation and adaptive filters (Djendi 2010; Ikeda and Sugiyama 1999; Al-Kindi and Dunlop 1989; Gerven and Compernolle 1995). The same dual microphones techniques were used to propose several two-channel or dual adaptive filter that work only on blind noisy speech signals (Sato et al. 2005; Ghribi et al. 2016). We can also cite the machine learning and the active learning techniques and their use in the domain of noisy signal classification and enhancement as given in (Vajda and Santosh 2017; Bouguelia et al. 2018; Zhang et al. 2015). Another direction of research that allows enhancing the speech signal from noisy observations is direction of arrival estimation and localization when multi-speech sources are available (Dey and Ashour 2018a, b, c).

In the approach where multi-channel technique is used for speech enhancement techniques, we can find several technique that are adaptive and not adaptive and all of them aim to improve the single and dual microphones techniques for the same application, i.e. speech enhancement and acoustic noise reduction application. In Marro et al. (1998), the authors concluded that in teleconferencing systems, the use of hands-free sound pick-up reduces speech quality. This is due to ambient noise, acoustic echo, and the reverberation produced by the acoustical environment. The authors of this paper presented a theoretical analysis of noise reduction and dereverberation algorithms based on a microphone array combined with a Wiener post-filter. It is shown that the transfer function of the post-filter depends on the input signal-to-noise ratio (SNR) and on the noise reduction yielded by the array. The use of a directivity-controlled array instead of a conventional beam-former was proposed to improve the performance of the whole system. Several papers based on the multichannel approach were proposed accordingly. Therefore, as multichannel enhancement techniques employ microphone arrays and take advantage of availability of multiple signal inputs to our system, to make possible the use of phase alignment to reject the undesired noise components (Meyer 1997; Lotter et al. 2003; Wang et al. 2016; Mildner and Goetze 2006; Senthamizh Selvi et al. 2017; Qingning and Waleed 2006).

In this paper, we focus our interest on the dual channel approach and we propose a new efficient crosstalk backward blind source separation (BSS) resistant algorithm for automatic blind speech enhancement application. The proposed algorithm is a self-controlled system for automatic speech enhancement application and doesn’t need of any voice activity detector to separate speech from very noisy observations.

This paper is organized as follows: after the introduction which is presented in Sect. 1, we present in Sect. 2, the noisy observation model that we adopt in our work. In Sect. 3, we give the principle of backward blind source separation (BSS) structure and two known backward algorithms that are combined with this structure. In Sect. 4, we give the mathematical formulation of the proposed crosstalk backward blind source separation (BSS) resistant algorithm for automatic blind speech enhancement application and its theoretical analysis. In Sect. 5, we show the simulation results of the proposed algorithm in terms of several objective criteria, and finally, in Sect. 6, we conclude our work.

2 Noisy observations model

In this work, we consider two-microphone configurations to make available two noisy observations. The two noisy observations are composed by one speech source signal and one punctual noise. We assume that the speech source signal us placed close to the first microphone, however, the second source of noise is located close to the second microphone (see Fig. 1). The noisy observations of this model are given by the following relations (Ghosh and Tsiartas 2011; Djendi 2010; Gerven and Compernolle 1995):

Fig. 1
figure 1

The simplified mixture model, \(s\left( n \right)\) and \(b\left( n \right)\) are the speech signal and the noise respectively. \({h_{12}}\left( n \right)\) and \({h_{21}}\left( n \right)\) represent the impulse responses between the channels

$${{\text{m}}_1}\left( {\text{n}} \right)={\text{s}}\left( {\text{n}} \right)+{{\text{h}}_{21}}\left( {\text{n}} \right)\;*{\text{b}}\left( {\text{n}} \right)$$
(1)
$${{\text{m}}_2}\left( {\text{n}} \right)={\text{b}}\left( {\text{n}} \right)+{{\text{h}}_{12}}\left( {\text{n}} \right)\;*{\text{s}}\left( {\text{n}} \right)$$
(2)

The symbol “*” stands for the linear convolution operation. The parameters \({{\text{h}}_{{\text{12}}}}\left( {\text{n}} \right)\) and \({{\text{h}}_{{\text{21}}}}\left( {\text{n}} \right)\) are the cross-coupling effects between the two-channel; \({\text{s}}\left( {\text{n}} \right)\) and \({\text{b}}\left( {\text{n}} \right)\) are two sources of speech and noise respectively. Note that the sources signals (\({\text{s}}\left( {\text{n}} \right)\), \({\text{b}}\left( {\text{n}} \right)\)), and the real filters (\({{\text{h}}_{{\text{12}}}}\left( {\text{n}} \right)\), \({{\text{h}}_{{\text{21}}}}\left( {\text{n}} \right)\)) are unknown parameters, and only observed signals \({{\text{m}}_{\text{1}}}\left( {\text{n}} \right)\) and \({{\text{m}}_{\text{2}}}\left( {\text{n}} \right)\) are available. In a BSS algorithm, no a priori information are available in the separation process. In practice, we often use the backward BSS (BBSS) structure to retrieve the speech signal from only noisy observations. This BBSS structure is well described in next section.

3 Backward BSS (BBSS) structure

The backward blind source separation (BBSS) structure that we consider in this paper is shown by Fig. 2. The noisy input signals of this structure are \({\text{m}_1}\left( {\text{n}} \right)\) and \({\text{m}_2}\left( {\text{n}} \right)\). The output \({s_1}\left( n \right)\) and \({{\text{s}}_{\text{2}}}\left( {\text{n}} \right)\) of this BSS structure are given by the following equations (Djendi et al. 2013; Djendi 2010; Gerven and Compernolle 1995):

Fig. 2
figure 2

Backward blind source separation BBSS structure [Left: simplified mixing model], [Right: backward blind source separation (BSS) structure]

$${{\text{s}}_1}\left( {\text{n}} \right)={{\text{m}}_{\text{1}}}\left( {\text{n}} \right) - {{\text{w}}_{21}}\left( {\text{n}} \right)*\;{{\text{s}}_2}\left( {\text{n}} \right)$$
(3)
$${{\text{s}}_2}\left( {\text{n}} \right)={{\text{m}}_{\text{2}}}\left( {\text{n}} \right) - {{\text{w}}_{12}}\left( {\text{n}} \right)*{{\text{s}}_1}\left( {\text{n}} \right)$$
(4)

Inserting (1) and (2) in (3) and (4) respectively, we get the following outputs signals:

$${{\text{s}}_{\text{1}}}\left( {\text{n}} \right)=\frac{1}{{\delta \left( {\text{n}} \right)-{\text{w}}{}_{{{\text{12}}}}\left( {\text{n}} \right) * {\text{w}}{}_{{{\text{12}}}}\left( {\text{n}} \right)}}*\left( {{\text{s}}\left( {\text{n}} \right) * \left( {\delta \left( {\text{n}} \right)-{\text{h}}{}_{{{\text{12}}}}\left( {\text{n}} \right) * {\text{w}}{}_{{{\text{21}}}}\left( {\text{n}} \right)} \right)\,+\,{\text{b}}\left( {\text{n}} \right) * \left( {{\text{ h}}{}_{{{\text{21}}}}\left( {\text{n}} \right)-{\text{w}}{}_{{{\text{21}}}}\left( {\text{n}} \right)} \right)} \right)$$
(5)
$${{\text{s}}_{\text{2}}}\left( {\text{n}} \right)=\frac{1}{{\delta \left( {\text{n}} \right)-{\text{w}}{}_{{{\text{12}}}}\left( {\text{n}} \right) * {\text{w}}{}_{{{\text{12}}}}\left( {\text{n}} \right)}}*\left( {{\text{b}}\left( {\text{n}} \right) * \left( {\delta \left( {\text{n}} \right)-{\text{h}}{}_{{{\text{21}}}}\left( {\text{n}} \right) * {\text{w}}{}_{{{\text{12}}}}\left( {\text{n}} \right)} \right)\,+\, {\text{s}}\left( {\text{n}} \right) * \left( {{\text{ h}}{}_{{{\text{12}}}}\left( {\text{n}} \right)-{\text{w}}{}_{{{\text{12}}}}\left( {\text{n}} \right)} \right)} \right)$$
(6)

To get noise signal at the output \({{\text{s}}_{\text{2}}}\left( {\text{n}} \right)\), and the speech signal at the output \({{\text{s}}_{\text{1}}}\left( {\text{n}} \right)\), we have to satisfied \({\text{w}}_{{{\text{21}}}}^{{{\text{opt}}}}={{\text{h}}_{{\text{21}}}}\) and \({\text{w}}_{{{\text{12}}}}^{{{\text{opt}}}}={{\text{h}}_{{\text{12}}}}\)). In this case, the outputs of the BBSS structure become as follows \({{\text{s}}_{\text{1}}}\left( {\text{n}} \right)={\text{s}}\left( {\text{n}} \right)\) and \({{\text{s}}_{\text{2}}}\left( {\text{n}} \right)={\text{b}}\left( {\text{n}} \right)\)(Djendi et al. 2013).

3.1 Classical backward BSS (CBBSS) two-channel algorithm

In (Gerven and Van Compernolle 1995), the classical backward BSS (CBBSS) two-channel algorithm is used to adjust the coefficients of the two separation filters \({{\text{w}}_{{\text{12}}}}\left( {\text{n}} \right)\) and \({{\text{w}}_{{\text{21}}}}\left( {\text{n}} \right)\). The update relations, in the minimum mean squared error (MMSE) sense, of both adaptive filters \({{\text{w}}_{{\text{12}}}}\left( {\text{n}} \right)\) and \({{\text{w}}_{{\text{21}}}}\left( {\text{n}} \right)\) are given in a vector form as follows:

$${{\mathbf{w}}_{{\text{12}}}}\left( {\text{n}} \right)={{\mathbf{w}}_{{\text{12}}}}\left( {{\text{n}}-{\text{1}}} \right)+{\mu _{{\text{12}}}}{{\text{s}}_{\text{2}}}\left( {\text{n}} \right)\;{{\mathbf{k}}_{\text{1}}}\left( {\text{n}} \right)$$
(7)
$${{\mathbf{w}}_{{\text{21}}}}\left( {\text{n}} \right)={{\mathbf{w}}_{{\text{21}}}}\left( {{\text{n}}-{\text{1}}} \right)+{\mu _{{\text{21}}}}{{\text{s}}_{\text{1}}}\left( {\text{n}} \right)\;{{\mathbf{k}}_{\text{2}}}\left( {\text{n}} \right)$$
(8)

where

$${s_1}\left( n \right)={{\text{m}}_{\text{1}}}\left( {\text{n}} \right) - {\mathbf{w}}_{{21}}^{T}\left( n \right)\;{{\mathbf{k}}_2}\left( {n - 1} \right)$$
(9)
$${s_2}\left( n \right)={{\text{m}}_{\text{2}}}\left( {\text{n}} \right) - {\mathbf{w}}_{{12}}^{T}\left( n \right)\;{{\mathbf{k}}_1}\left( n \right)$$
(10)

and \({{\mathbf{k}}_{\text{1}}}\left( {\text{n}} \right)={\left[ {{{\text{s}}_{\text{1}}}\left( {\text{n}} \right){\text{, }}{{\text{s}}_{\text{1}}}\left( {{\text{n-1}}} \right){\text{, }}...{\text{, }}{{\text{s}}_{\text{1}}}\left( {{\text{n-L}}+{\text{1}}} \right)} \right]^T}\), \({{\mathbf{k}}_{\text{2}}}{\text{(n)}}={\left[ {{{\text{s}}_{\text{2}}}\left( {\text{n}} \right){\text{, }}{{\text{s}}_{\text{2}}}\left( {{\text{n-1}}} \right){\text{, }}...{\text{, }}{{\text{s}}_{\text{2}}}\left( {{\text{n-L}}+{\text{1}}} \right)} \right]^T}\) are vectors that contain the last L sample of the output \({{\text{s}}_{\text{1}}}\left( {\text{n}} \right)\) and \({{\text{s}}_{\text{2}}}\left( {\text{n}} \right)\) respectively. \({\mu _{12}}\) and \({\mu _{21}}\) are respectively the step sizes of the two adaptive filters \({{\text{w}}_{{\text{12}}}}\left( {\text{n}} \right)\) and \({{\text{w}}_{{\text{21}}}}\left( {\text{n}} \right)\), respectively. To ensure stability and convergence of the two-channel CBBSS algorithm toward optimal solutions, the two step-sizes must be selected between 0 and 2 (Djendi 2010; Gerven and Van Compernolle 1995).

A normalized version of this algorithm is obtained by normalizing the step sizes of each adaptive algorithms by \({\mathbf{k}}_{{\text{1}}}^{{\text{T}}}\left( {\text{n}} \right)\;{{\mathbf{k}}_{\text{1}}}\left( {\text{n}} \right)\) and \({\mathbf{k}}_{{\text{2}}}^{{\text{T}}}\left( {\text{n}} \right)\;{{\mathbf{k}}_{\text{2}}}\left( {\text{n}} \right)\) of the two adaptive filters \({{\text{w}}_{{\text{12}}}}\left( {\text{n}} \right)\) and \({{\text{w}}_{{\text{21}}}}\left( {\text{n}} \right)\), respectively. This algorithm allows to take more simple relation for the step-sizes 0 < \({\mu _{12}}\) < 2 and 0 < \({\mu _{21}}\) < 2.

$${{\mathbf{w}}_{{\text{21}}}}\left( {\text{n}} \right)={{\mathbf{w}}_{{\text{21}}}}\left( {{\text{n}}-{\text{1}}} \right)+\frac{{{\mu _{{\text{21}}}}}}{{{\mathbf{k}}_{{\text{2}}}^{{\text{T}}}\left( {\text{n}} \right)\;\,{{\mathbf{k}}_{\text{2}}}\left( {\text{n}} \right)+{\xi _{\text{1}}}}}{{\text{s}}_{\text{1}}}\left( {\text{n}} \right)\;{{\mathbf{k}}_{\text{2}}}\left( {\text{n}} \right)$$
(11)
$${{\mathbf{w}}_{{\text{12}}}}\left( {\text{n}} \right)={{\mathbf{w}}_{{\text{12}}}}\left( {{\text{n}}-{\text{1}}} \right)+\frac{{{\mu _{{\text{12}}}}}}{{{\mathbf{k}}_{{\text{1}}}^{{\text{T}}}\left( {\text{n}} \right)\;\,{{\mathbf{k}}_{\text{1}}}\left( {\text{n}} \right)+{\xi _{\text{2}}}}}{{\text{s}}_{\text{2}}}\left( {\text{n}} \right)\;{{\mathbf{k}}_{\text{1}}}\left( {\text{n}} \right)$$
(12)

where \({\xi _{\text{1}}}\) and \({\xi _{\text{2}}}\) are two small constants introduced to avoid division by zero. The principle of the CBBSS algorithm is similar to the normalized least mean square (NLMS) algorithm in the dual case, this equivalence has been well shown and proven in (Gerven and Van Compernolle 1995). In Table 1, the CBBSS algorithm is summarized.

Table 1 Summary of the CBBSS algorithm (16)

4 Proposed robust backward BSS crosstalk-resistant algorithm

4.1 Motivation

In the classical use of the BBSS algorithm, the separating adaptive filters \({{\text{w}}_{{\text{12}}}}{\text{(n)}}\) and \({{\text{w}}_{{\text{21}}}}{\text{(n)}}\) have to converge towards the optimal solutions \({{\text{h}}_{{\text{12}}}}{\text{(n)}}\) and \({{\text{h}}_{{\text{21}}}}{\text{(n)}}\), respectively, to separate the speech signal and the noise components from the noisy observation m1(n) and m2(n) (Djendi et al. 2013; Ghosh and Tsiartas 2011; Djendi 2010). This principle is possible thanks to the use of a voice activity detector (VAD) system. The VAD system allows extracting the source signals from the noisy observation with less distortion (Górriz et al. 2010; Mak 2014; Mukherjee et al. 2018a, b). Usually, the adaptive filters \({{\text{w}}_{{\text{12}}}}{\text{(n)}}\) and \({{\text{w}}_{{\text{21}}}}{\text{(n)}}\) are updated alternatively, i.e. if we want to get the speech signal at the output s1(n), we have to update the adaptive filter \({{\text{w}}_{{\text{21}}}}{\text{(n)}}\) at only noise presence periods, however the opposite configuration must be adopted for the second adaptive filter \({{\text{w}}_{{\text{12}}}}{\text{(n)}}\). In this paper, we propose a new automatic BBSS algorithm that update the cross-filters \({{\text{w}}_{{\text{12}}}}{\text{(n)}}\) and \({{\text{w}}_{{\text{21}}}}{\text{(n)}}\) automatically and alternatively without need of any VAD system, and is robust for crosstalk presence components.

4.2 Derivation of the proposed algorithm

The mathematic derivation of the proposed algorithm is presented along this section. We recall that the suggested technique principle is based on the use of the intermittent property of the speech signal to adjust the adaptive filter coefficients given by relations (11) and (12) (Djendi et al. 2013). For this reason, we can start from the Newton recurrence (Sayed 2003; Zoulikha and Djendi 2016; Djendi and Zoulikha 2014) applied to the backward blind source separation structure that is given as follows (see Fig. 3):

Fig. 3
figure 3

Proposed algorithm. The new parameters rs1m2(n) and rs2m1(n) are the cross-correlations between the outputs s1(n) and s2(n) and the mixing signals m1(n) and m2(n) respectively

$${\mathbf{w}_{{\text{21}}}}\left( {{\text{n+1}}} \right){\text{ = }}{\mathbf{w}_{{\text{21}}}}\left( {\text{n}} \right){\text{+}}{\mu _{{\text{21}}}}\left( {\text{n}} \right)\frac{{ {\mathbf{p}_{{{\text{s}}_{\text{1}}}{\text{ }}{{\mathbf{k}}_{\text{2}}}}}-{\mathbf{R}_{{{\mathbf{k}}_{\text{2}}}}}{\mathbf{w}_{{\text{21}}}}\left( {\text{n}} \right){\text{ }}}}{{{\zeta _1}\left( {\text{n}} \right){\text{ }}\mathbf{I}+{\mathbf{R}_{{{\mathbf{k}}_2}}}\left( {\text{n}} \right)}}$$
(13)

where \({\mathbf{R}_{{{\mathbf{k}}_{\text{2}}}}}(\text{n})\) represents the autocorrelation matrix of the output vector k2(n). It is given by:

$${\mathbf{R}_{{{\mathbf{k}}_{\text{2}}}}}(\text{n})=E{\left[ {{{\mathbf{k}}_{\text{2}}}\left( n \right)\;{\mathbf{k}}_{{\text{2}}}^{{\text{T}}}\left( n \right)} \right]^{}}$$
(14)

and \({\mathbf{P}_{{{\text{s}}_{\text{1}}}{{\mathbf{k}}_{\text{2}}}}}{\text{(n)}}\) the cross-correlation vector between the output s1(n) and the output vector k2(n). It is given by:

$${{\mathbf{P}}_{\text{s}1{\mathbf{k}}2}}\left( n \right)=E\left[ {{s_1}\left( n \right)\;{{\mathbf{k}}_2}\left( n \right)} \right]$$
(15)

and \(\mathbf{I}\) is the N × N identity matrix ; and \({\zeta _1}\left( {\text{n}} \right)\) is a small regularization scalar. The step size µ21 is a control parameter of relation (13) to ensure stability and convergence. The same thing can be done to relation (12) as follows:

$${\mathbf{w}_{{\text{12}}}}\left( {{\text{n+1}}} \right){\text{\,=\,}}{\mathbf{w}_{{\text{12}}}}\left( {\text{n}} \right){\text{+}}{\mu _{{\text{12}}}}\left( {\text{n}} \right)\frac{{ {\mathbf{p}_{{{\text{s}}_{\text{2}}}{\text{ }}{{\mathbf{k}}_{\text{1}}}}}-{\mathbf{R}_{{{\mathbf{k}}_{\text{1}}}}}{\mathbf{w}_{{\text{12}}}}\left( {\text{n}} \right){\text{ }}}}{{{\zeta _2}\left( {\text{n}} \right){\text{ }}\mathbf{I}+{\mathbf{R}_{{{\mathbf{k}}_1}}}\left( {\text{n}} \right)}}$$
(16)

where \({\mathbf{R}_{{\mathbf{k}}1}}(\text{n})\) is the autocorrelation matrix of the output vector k1(n), it is given by \({\mathbf{R}_{{{\mathbf{k}}_{\text{1}}}}}(\text{n})=E{\left[ {{{\mathbf{k}}_{\text{1}}}\left( n \right)\;{\mathbf{k}}_{{\text{1}}}^{{\text{T}}}\left( n \right)} \right]^{}}\). the vector \({\mathbf{P}_{{{\text{s}}_{\text{2}}}{{\mathbf{k}}_{\text{1}}}}}\) is the cross-correlation vector between the output s2(n) and the output vector k1(n), it is given by \({{\mathbf{P}}_{\text{s}2{\mathbf{k}}1}}\left( n \right)=E\left[ {{s_2}\left( n \right)\;{{\mathbf{k}}_1}\left( n \right)} \right]\). \({\zeta _2}\left( {\text{n}} \right)\) is a small regularization scalar. The step size µ12 is a control parameter of relation (16) to ensure stability and convergence.

In general case, the parameters \({\zeta _1}\left( {\text{n}} \right){\text{ }}\mathbf{I}\) and \({\zeta _2}\left( {\text{n}} \right){\text{ }}\mathbf{I}\) are introduced in the Newton recursion of (13) and (16) to allow regularization of the two-channel algorithm. However, as these two regularization parameter are constant, the behavior of the Newton algorithm applied to (13) and (16) is similar in the transient and permanent regime. The idea is how to change these parameters to get enhancement in either transient or permanent regime. Enhancement of the Newton algorithm in the transient regime is got by improving the convergence speed of the algorithm, hence enhancing the permanent regime is to make the final mean square error (MSE) small, i.e. we want to get a blind two-channel algorithm that has a faster convergence speed and small final MSE.

In this paper we propose to use the cross-correlation vector of the filtering error s1(n) and the noisy observation p2(n) instead of \({\zeta _1}\left( {\text{n}} \right){\text{ }}\mathbf{I}\) in (13), and the cross-correlation vector of the filtering error s2(n) and the noisy observation p1(n) instead of \({\zeta _2}\left( {\text{n}} \right){\text{ }}\mathbf{I}\) in (16). These two modifications allow to the Newton algorithm of relation (13) and (16) to be enhanced in the transient and permanent regimes. The new proposed solution of the automatic speech enhancement by the BBSS algorithm is given by the following relation:

$${\mathbf{w}_{{\text{21}}}}\left( {{\text{n+1}}} \right){\text{ = }}{\mathbf{w}_{{\text{21}}}}\left( {\text{n}} \right){\text{ + }}{\mu _{{\text{21}}}}\left( {\text{n}} \right)\frac{{ {\mathbf{p}_{{{\text{s}}_{\text{1}}}{\text{ }}{{\mathbf{k}}_{\text{2}}}}}-{\mathbf{R}_{{{\mathbf{k}}_{\text{2}}}}}\left( {\text{n}} \right){\text{ }}{\mathbf{w}_{{\text{21}}}}\left( {\text{n}} \right){\text{ }}}}{{\left( {{\zeta _1}\left( {\text{n}} \right)+{{\left\| {{\text{r}_{\text{s}1{\mathbf{m}}2}}\left( n \right)} \right\|}^2}} \right){\text{ }}\mathbf{I}+{\mathbf{R}_{{{\mathbf{k}}_2}}}\left( {\text{n}} \right)}}$$
(17)
$${\mathbf{w}_{{\text{12}}}}\left( {{\text{n+1}}} \right){\text{ = }}{\mathbf{w}_{{\text{12}}}}\left( {\text{n}} \right){\text{ + }}{\mu _{{\text{12}}}}\left( {\text{n}} \right)\frac{{ {\mathbf{p}_{{{\text{s}}_{\text{2}}}{\text{ }}{{\mathbf{k}}_{\text{1}}}}}-{\mathbf{R}_{{{\mathbf{k}}_{\text{1}}}}}\left( {\text{n}} \right){\text{ }}{\mathbf{w}_{{\text{12}}}}\left( {\text{n}} \right){\text{ }}}}{{\left( {{\zeta _2}\left( {\text{n}} \right)+{{\left\| {{\text{r}_{\text{s}2{\mathbf{m}}1}}\left( n \right)} \right\|}^2}} \right){\text{ }}\mathbf{I}+{\mathbf{R}_{{{\mathbf{k}}_1}}}\left( {\text{n}} \right)}}$$
(18)

where \({{\mathbf{r}}_{\text{s}1{\mathbf{m}}2}}\left( n \right)\) is the cross-correlation vector of the output signal s1(n) and the noisy observation vector m2(n), and \({{\mathbf{r}}_{\text{s}2{\mathbf{m}}1}}\left( n \right)\) is the cross-correlation vector computed between the output signal s2(n) and the noisy observation vector m1(n). They are given as follows:

$${{\mathbf{r}}_{\text{s}1\text{m}2}}\left( n \right)=E\left[ {{{\text{s}}_1}\left( k \right){\text{ }}{{\mathbf{m}}_2}\left( {k - n} \right)} \right]$$
(19)
$${{\mathbf{r}}_{\text{s}2\text{m}1}}\left( n \right)=E\left[ {{{\text{s}}_2}\left( k \right){\text{ }}{{\mathbf{m}}_1}\left( {k - n} \right)} \right]$$
(20)

and

$${\left\| {{{\mathbf{r}}_{\text{s}1{\mathbf{m}}2}}\left( n \right)} \right\|^2}{\text{= }}\sum\limits_{{{\text{k}}={\text{0}}}}^{{\text{L} - 1}} {{{\left| {{{\text{r}}_{\text{s}1{\mathbf{m}}2}}\left( {n - k} \right)} \right|}^{\text{2}}} } {\text{ }}$$
(21)
$${\left\| {{{\mathbf{r}}_{\text{s}2{\mathbf{m}}1}}\left( n \right)} \right\|^2}{\text{= }}\sum\limits_{{{\text{k}}={\text{0}}}}^{{\text{L} - 1}} {{{\left| {{{\text{r}}_{\text{s}2{\mathbf{m}}1}}\left( {n - k} \right)} \right|}^{\text{2}}} } {\text{ }}$$
(22)

where L is a sample number of the cross-correlation vector norm. In the following, we will drive an automatic and less complex algorithm. We will start by relation (17) then we make an extrapolation for relation (18).

  1. Step 1:

    In first, we introduce a parameter \({\beta _1}\) that allows controlling the contribution of \({\left\| {{\text{r}_{\text{s}1{\mathbf{m}}2}}\left( n \right)} \right\|^2}\) in the regularization of (17). Also, we suppose ergodic and stochastic condition that allows to replace \({\mathbf{P}_{{{\text{s}}_{\text{1}}}{{\mathbf{k}}_{\text{2}}}}}\left( n \right)\) and \({{\mathbf{P}}_{\text{s}2{\mathbf{k}}1}}\left( n \right)\) by their instantaneous values, i.e. \({{\mathbf{P}}_{\text{s}1{\mathbf{k}}2}}\left( n \right)=\left[ {{s_1}\left( n \right)\;{{\mathbf{k}}_2}\left( n \right)} \right]\) and \({{\mathbf{P}}_{\text{s}2{\mathbf{k}}1}}\left( n \right)=\left[ {{s_2}\left( n \right)\;{{\mathbf{k}}_1}\left( n \right)} \right]\). The new relation of w21(n) is given as follows:

    $${\mathbf{w}_{{\text{21}}}}\left( {{\text{n+1}}} \right){\text{ = }}{\mathbf{w}_{{\text{21}}}}\left( {\text{n}} \right){\text{ + }}\frac{{ {\text{ }}{\mu _{{\text{21}}}}\left( {\text{n}} \right){\text{ }}}}{{{\beta _1}\left( {{\zeta _1}\left( {\text{n}} \right)+{{\left\| {{{\mathbf{r}}_{\text{s}1{\mathbf{m}}2}}\left( n \right)} \right\|}^2}} \right){\text{ }}\mathbf{I}+\left( {1 - {\beta _1}} \right){\text{ }}{{\mathbf{k}}_{\text{2}}}\left( {\text{n}} \right){\text{ }}{\mathbf{k}}_{{\text{2}}}^{{\text{T}}}\left( {\text{n}} \right)}}{{\mathbf{k}}_{\text{2}}}\left( {\text{n}} \right){\text{ }}{s_1}\left( {\text{n}} \right)$$
    (23)
  2. Step 2:

    In the second step, in order to reduce the complexity of the algorithm (23), we aim to reduce the complexity of (23) by using the following matrix inverse lemma:

    $${\left[ {{\mathbf{A}}+{\mathbf{BC}}{{\mathbf{D}}^{}}} \right]^{-1}}={{\mathbf{A}}^{-1}}{\text{ }}-{\text{ }}{{\mathbf{A}}^{-1}}{\mathbf{B}} {\left[ {{{\mathbf{C}}^{-1}}+{\mathbf{D}}{{\mathbf{A}}^{-1}}{\mathbf{B}}} \right]^{-1}}{\mathbf{D}}{{\mathbf{A}}^{-1}}$$
    (24)

    we make the following equality beteween the denomiator of (23) and (24), we get:

    $${\left[ {\mathbf{A}+\mathbf{B}\mathbf{C}\mathbf{D}} \right]^{{\text{-1}}}}{\text{=}}{\left[ {{\beta _1}\left( {{\zeta _1}\left( {\text{n}} \right)+{{\left\| {{{\mathbf{r}}_{\text{s}1{\mathbf{m}}2}}\left( n \right)} \right\|}^2}} \right){\text{ }}\mathbf{I}+\left( {1 - {\beta _1}} \right){\text{ }}{{\mathbf{k}}_{\text{2}}}\left( {\text{n}} \right){\text{ }}{\mathbf{k}}_{{\text{2}}}^{{\text{T}}}\left( {\text{n}} \right)} \right]^{ - 1}}$$
    (25)

    If we put \(\mathbf{A}{\text{ = }}{\beta _1}\left( {{\zeta _1}\left( {\text{n}} \right)+{{\left\| {{{\mathbf{r}}_{\text{s}1{\mathbf{m}}2}}\left( n \right)} \right\|}^2}} \right){\text{ }}\mathbf{I}\), \(\mathbf{B}{\text{ = }}{{\mathbf{k}}_{\text{2}}}\left( n \right)\), \(\mathbf{C}{\text{ = }}\left( {1 - {\beta _1}} \right)\), and \(\mathbf{D}{\text{ = }}{\mathbf{k}}_{{\text{2}}}^{{\text{T}}}\left( {\text{n}} \right)\), and after applying (24), we get the following relation:

    $${\left[ {{\beta _1}\left( {{\zeta _1}\left( {\text{n}} \right)+{{\left\| {{{\mathbf{r}}_{\text{s}1{\mathbf{m}}2}}\left( n \right)} \right\|}^2}} \right){\text{ }}\mathbf{I}+\left( {1 - {\beta _1}} \right){\text{ }}{{\mathbf{k}}_{\text{2}}}\left( {\text{n}} \right) {\mathbf{k}}_{{\text{2}}}^{{\text{T}}}\left( {\text{n}} \right)} \right]^{ - 1}}={\left[ {{\beta _1}\left( {{\zeta _1}\left( {\text{n}} \right)+{{\left\| {{{\mathbf{r}}_{\text{s}1{\mathbf{m}}2}}\left( n \right)} \right\|}^2}} \right){\text{ }}\mathbf{I}} \right]^{ - 1}} - {\left[ {{\beta _1}\left( {{\zeta _1}\left( {\text{n}} \right)+{{\left\| {{{\mathbf{r}}_{\text{s}1{\mathbf{m}}2}}\left( n \right)} \right\|}^2}} \right){\text{ }}\mathbf{I}} \right]^{ - 1}}{{\mathbf{k}}_{\text{2}}}\left( n \right)\;{\left[ {{{\left( {1 - {\beta _1}} \right)}^{ - 1}}+{\mathbf{k}}_{2}^{T}\left( n \right){{\left[ {{\beta _1}\left( {{\zeta _1}\left( {\text{n}} \right)+{{\left\| {{{\mathbf{r}}_{\text{s}1{\mathbf{m}}2}}\left( n \right)} \right\|}^2}} \right){\text{ }}\mathbf{I}{\text{ }}{{\mathbf{k}}_{\text{2}}}\left( n \right)} \right]}^{ - 1}}} \right]^{ - 1}} \times \mathbf{s}_{2}^{T}\left( n \right){\left[ {{\beta _1}\left( {{\zeta _1}\left( {\text{n}} \right)+{{\left\| {{{\mathbf{r}}_{\text{s}1{\mathbf{m}}2}}\left( n \right)} \right\|}^2}} \right){\text{ }}\mathbf{I}} \right]^{ - 1}}$$
    (26)
  3. Step 3:

    More simplification of (26) can be obtained. We multiply both sides of (26) by \({\mathbf{k}_{\text{2}}}\left( n \right)\) and after some modification and rearrangements we get the following simple relation:

    $${\left[ {{\beta _1}\left( {{\zeta _1}\left( {\text{n}} \right)+{{\left\| {{{\mathbf{r}}_{\text{s}1{\mathbf{m}}2}}\left( n \right)} \right\|}^2}} \right){\text{ }}\mathbf{I}+\left( {1 - {\beta _1}} \right){\text{ }}{{\mathbf{k}}_{\text{2}}}\left( {\text{n}} \right){\text{ }}{\mathbf{k}}_{{\text{2}}}^{{\text{T}}}\left( {\text{n}} \right)} \right]^{ - 1}}{\mathbf{k}}_{{\text{2}}}^{{}}\left( {\text{n}} \right)=\frac{{{\mathbf{k}}_{{\text{2}}}^{{}}\left( {\text{n}} \right)}}{{{\beta _1}\left( {{\zeta _1}\left( {\text{n}} \right)+{{\left\| {{{\mathbf{r}}_{\text{s}1{\mathbf{m}}2}}\left( n \right)} \right\|}^2}} \right){\text{ }}+\left( {1 - {\beta _1}} \right){\text{ }}{{\left\| {{{\mathbf{k}}_{\text{2}}}\left( {\text{n}} \right)} \right\|}^2}}}$$
    (27)
  4. Step 4:

    If we replace relation (27) in (23) we get the final update relation of the filter w21(n):

    $${\mathbf{w}_{{\text{21}}}}\left( {{\text{n+1}}} \right){\text{ = }}{\mathbf{w}_{{\text{21}}}}\left( {\text{n}} \right){\text{ + }}\frac{{ {\text{ }}{\mu _{{\text{21}}}}\left( {\text{n}} \right){\text{ }}}}{{{\beta _1}\left( {{\zeta _1}\left( {\text{n}} \right)+{{\left\| {{{\mathbf{r}}_{\text{s}1{\mathbf{m}}2}}\left( n \right)} \right\|}^2}} \right){\text{ }}+\left( {1 - {\beta _1}} \right){\text{ }}{{\left\| {{{\mathbf{k}}_{\text{2}}}\left( {\text{n}} \right)} \right\|}^2}}}{{\mathbf{k}}_{\text{2}}}\left( {\text{n}} \right)\;{s_1}\left( {\text{n}} \right)$$
    (28)

    In our proposed algorithm, we expoit the symetric property of the backward blind source separation structure to conclude the derivation of the update relation of the adaptive filter w12(n) and we get:

    $${\mathbf{w}_{{\text{12}}}}\left( {{\text{n+1}}} \right){\text{ = }}{\mathbf{w}_{{\text{12}}}}\left( {\text{n}} \right){\text{+}}\frac{{ {\text{ }}{\mu _{{\text{12}}}}\left( {\text{n}} \right){\text{ }}}}{{{\beta _2}\left( {{\zeta _2}\left( {\text{n}} \right)+{{\left\| {{{\mathbf{r}}_{\text{s}2{\mathbf{m}}1}}\left( n \right)} \right\|}^2}} \right){\text{ }}+\left( {1 - {\beta _2}} \right){\text{ }}{{\left\| {{{\mathbf{k}}_{\text{1}}}\left( {\text{n}} \right)} \right\|}^2}}}{{\mathbf{k}}_{\text{1}}}\left( {\text{n}} \right)\;{s_2}\left( {\text{n}} \right)$$
    (29)

    where \({\zeta _1}\left( {\text{n}} \right)\) and \({\zeta _2}\left( {\text{n}} \right)\) are small positive constants \({\beta _1}\), \({\beta _2}\), \({\mu _{{\text{21}}}}\left( {\text{n}} \right)\), and \({\mu _{{\text{12}}}}\left( {\text{n}} \right)\) are control parameters of the proposed algorithm. These last parameters have to be finely selected to accomplish the best tradeoff between faster convergence speed and low final MSE. The proposed algorithm is summarized in Table 2.

Table 2 The proposed algorithm [In this paper]

4.3 Theoretical analysis of the proposed algorithm

In this analysis, we adopt a new notation of the proposed algorithm of relations (28) and (29). Hence, the new two-channel update of the cross-adaptive filters \({\mathbf{w}_{{\text{12}}}}\left( {\text{n}} \right)\) and \({\mathbf{w}_{{\text{21}}}}\left( {\text{n}} \right)\) of the proposed algorithm can be rewritten as follows:

$${\mathbf{w}_{{\text{21}}}}\left( {{\text{n+1}}} \right){\text{ = }}{\mathbf{w}_{{\text{21}}}}\left( {\text{n}} \right){\text{ + }}{\nabla _{{\text{1 }}}}\left( {\text{n}} \right)\;{\mathbf{k}_{\text{2}}}\left( {\text{n}} \right)\;{s_1}\left( {\text{n}} \right)$$
(30)
$${\mathbf{w}_{{\text{12}}}}\left( {{\text{n+1}}} \right){\text{ = }}{\mathbf{w}_{{\text{12}}}}\left( {\text{n}} \right){\text{ + }}{\nabla _{{\text{2 }}}}\left( {\text{n}} \right)\;{\mathbf{k}_{\text{1}}}\left( {\text{n}} \right)\;{s_2}\left( {\text{n}} \right)$$
(31)

where the two new step-sizes \({\nabla _{{\text{1 }}}}\left( {\text{n}} \right)\) and \({\nabla _{{\text{2 }}}}\left( {\text{n}} \right)\) are given by the following relations:

$${\nabla _{{\text{1 }}}}\left( {\text{n}} \right)=\frac{{ {\text{ }}{\mu _{{\text{21}}}}\left( {\text{n}} \right){\text{ }}}}{{{\beta _1}\left( {{\zeta _1}\left( {\text{n}} \right)+{{\left\| {{{\mathbf{r}}_{\text{s}1{\mathbf{m}}2}}\left( n \right)} \right\|}^2}} \right){\text{ }}+\left( {1 - {\beta _1}} \right){\text{ }}{{\left\| {{\mathbf{k}_{\text{2}}}\left( {\text{n}} \right)} \right\|}^2}}}$$
(32)
$${\nabla _{{\text{2 }}}}\left( {\text{n}} \right)=\frac{{ {\text{ }}{\mu _{{\text{12}}}}\left( {\text{n}} \right){\text{ }}}}{{{\beta _2}\left( {{\zeta _2}\left( {\text{n}} \right)+{{\left\| {{{\mathbf{r}}_{\text{s}2{\mathbf{m}}1}}\left( n \right)} \right\|}^2}} \right){\text{ }}+\left( {1 - {\beta _2}} \right){\text{ }}{{\left\| {{\mathbf{k}_{\text{1}}}\left( {\text{n}} \right)} \right\|}^2}}}$$
(33)

In order to analysis the behavior of the proposed algorithm, a particular attention is made to the step-sizes of relation (32) and (33). From relation (32), we can note that the step size \({\nabla _{{\text{1 }}}}\left( {\text{n}} \right)\) of the adaptive filter \({{\text{w}}_{{\text{21}}}}\left( {\text{n}} \right)\) is large when the cross-correlation factor \({{\mathbf{r}}_{\text{s}1{\mathbf{m}}2}}\left( n \right)\) is small, i.e. the step size \({\nabla _{{\text{1 }}}}\left( {\text{n}} \right)\) takes large values when the speech signal is absent, and gets small values in the opposite case. This configuration allows to the adaptive filter \({{\text{w}}_{{\text{21}}}}\left( {\text{n}} \right)\) to be adjusted in the speech absence periods and be frozen in the opposite situation. Furthermore, this automatic mechanism of adjusting the adaptive filter \({{\text{w}}_{{\text{21}}}}\left( {\text{n}} \right)\) allows to formulate an adaptive noise cancellation (ANC) system with noise-only reference, and make possible to cancel the noise components at the output \({s_{{\text{1 }}}}\left( {\text{n}} \right)\).

In the other hand, an invert relation between the variation of the step-size \({\nabla _{{\text{2 }}}}\left( {\text{n}} \right)\) and the cross-correlation factor \({{\mathbf{r}}_{\text{s}2{\mathbf{m}}1}}\left( n \right)\) is observed. i.e. the step size \({\nabla _{{\text{2 }}}}\left( {\text{n}} \right)\) is large when \({{\mathbf{r}}_{\text{s}2{\mathbf{m}}1}}\left( n \right)\) takes small values in speech presence periods. This automatic mechanism allows to the adaptive filter \({{\text{w}}_{{\text{12}}}}\left( {\text{n}} \right)\) to be adjusted to suppress the speech signal at the output \({s_{{\text{2 }}}}\left( {\text{n}} \right)\) and to get the noise source components in the same output, i.e. \({s_{{\text{2 }}}}\left( {\text{n}} \right)\).

This automatic mechanism that makes an alternative update of the adaptive filters \({{\text{w}}_{{\text{21}}}}\left( {\text{n}} \right)\) and \({{\text{w}}_{{\text{12}}}}\left( {\text{n}} \right)\), leads to a blind system separation of the speech and the noise components at the outputs \({s_{{\text{1 }}}}\left( {\text{n}} \right)\) and \({s_{{\text{2 }}}}\left( {\text{n}} \right)\) without any priori information about them, i.e. only the mixing signals are available at the input of the algorithm. A demonstration of these conclusions and theoretical analysis will be given in the simulations part of (See Subsection 5.5).

5 Simulation results

In this section, we analyze the behavior of the proposed algorithm in comparison with two two-channel adaptive BSS-based algorithm, which are the classical BSS (CBBSS) algorithm (16), and the variable step-size backward source separation (VSS-BBSS) algorithm (Djendi and Zoulikha 2014).

5.1 Description of the experimental model and the used signals

We have generated the simulated impulse responses by the model proposed in (Djendi 2010; Ikeda and Sugiyama 1999; Al-Kindi and Dunlop 1989; Gerven and Van Compernolle 1995; Sato et al. 2005; Ghribi et al. 2016; Vajda and Santosh 2017; Bouguelia et al. 2018; Zhang et al. 2015; Dey and Ashour 2018a, b, c; Marro et al. 1998; Meyer 1997; Lotter et al. 2003; Wang et al. 2016; Mildner and Goetze 2006; Senthamizh Selvi et al. 2017; Qingning and Waleed 2006; Vlaj and Kačič 2012; Djendi et al. 2006), i.e. \({{\text{h}}_{{\text{12}}}}\left( {\text{n}} \right)=\delta \left( {\text{n}} \right)+{\psi _1}\left( {\text{n}} \right)\) and \({{\text{h}}_{{\text{21}}}}\left( {\text{n}} \right)=\delta \left( {\text{n}} \right)+{\psi _2}\), where \(\delta \left( {\text{n}} \right)\) is the first sample of the impulse response that represents the direct acoustic path from each source to the cross-coupled microphone. \({\psi _1}\) and \({\psi _2}\) are exponentially weighted tail that model the room effect (Djendi et al. 2006). Figure 4 shows an example of each impulse responses \({{\text{h}}_{{\text{12}}}}\left( {\text{n}} \right)\) (left of Fig. 4) and \({{\text{h}}_{{\text{21}}}}\left( {\text{n}} \right)\) (right of Fig. 4) that corresponds to spaced microphones; with a sampling period Ts = 125 µs, the corresponding reverberation time is 30.8 ms, and the size of the impulse responses is \(L=128\) (Djendi et al. 2006).

Fig. 4
figure 4

Simulated impulse responses in the spaced microphones case; [Left]: \({{\text{h}}_{{\text{12}}}}\left( {\text{n}} \right)\), [Right]: \({{\text{h}}_{{\text{21}}}}\left( {\text{n}} \right)\). The real filters length is L = 128. \(f{\text{s}}={\text{8 kHz}}\)

The speech and the noises signals are real, sampled at \(f{\text{s}}={\text{8 kHz}}\), and obtained from AURORA database (Zue et al. 1990; Varga and Steeneken 1993; ITU-T 2003). The noises that we use are White noise, USASI (United State of America Standard Institute now ANSI), street, car and babble. The mixing signals \({m_1}\left( n \right)\) and \({m_2}\left( n \right)\) are generated for different input SNRs, i.e. − 6, 0, and 6 dB. We give an example of a speech signal, a noise and mixing ones \({m_1}\left( n \right)\) in Fig. 5. The input SNR is selected to be 0 dB at the two microphones, respectively.

Fig. 5
figure 5

Source, noise and mixing signal samples. [Top]: the speech signal and its spectrogram. [Middle]: the noise (white) and its spectrogram. [Bottom]: the mixing signal \({m_1}\left( n \right)\) and its spectrogram. The input SNR is selected to be 0 dB at the two microphones, and the real filters length is L = 128

5.2 Simulation parameters of the algorithms

In order to objectively compare our proposed algorithm against the performances of two other competitive ones, i.e. the conventional blind source separation (CBBSS) (Gerven and Van Compernolle 1995), and the variable step-size blind source separation (VSS-BBSS) algorithms (Djendi and Zoulikha 2014), we have selected the best parameters of each algorithm to achieve the best behavior with speech signals. The parameters of each algorithm are summarized in Table 3. We recall here that the CBBSS algorithm (Gerven and Van Compernolle 1995) uses a manual voice activity detector (MVAD) mechanism to control the adaptation of both adaptive estimated filters \({w_{12}}\left( n \right)\) and \({w_{21}}\left( n \right)\), however the VSS-BBSS (Djendi and Zoulikha 2014), which is an improved version of CBBSS, uses a variable step-sizes technique that performs as an automatic voice activity detector (AVAD) mechanism. Recall that the adaptation process of the estimated filters \({w_{12}}\left( n \right)\) and \({w_{21}}\left( n \right)\) by the proposed algorithm is done automatically thanks to the variable step-sizes that are given by relations (32) and (33), respectively. This modification allows to our algorithm to be adapted automatically without need to any VAD system. We note that these parameters are used in all the simulations that we have done and are presented along this paper.

Table 3 Control parameters of the conventional BBSS (CBBSS), the variable ste-size BBSS (VSS-BBSS), and the proposed algorithms

From Table 3, we can see that the proposed and simulated algorithms share some parameters. The shared parameters of these algorithms are the adaptive filters length of \({w_{12}}\left( n \right)\) and \({w_{21}}\left( n \right)\) which is selected to be equal to \(L=128\), or 256 (for more details about these parameters, see Table 3). The considered situation of the simulation is exact modelization of the adaptive filter, i.e. the adaptive filters length is equal to the real ones. The other parameters are specific for each algorithm. Moreover, the control parameters of our algorithm are the optimal ones, and several simulations are carried out to get these optimal values. Finally, we note that control parameters of Table 3 are used along all the carried out simulations and experiments. All the presented simulations are carried out with speech signal and noise components sampled at 8 kHz and coded on 16 bits.

5.3 Time-domain outputs of the proposed algorithm

Simulated and proposed algorithms aim to extract speech at the first output \({{\text{s}}_{\text{1}}}\left( {\text{n}} \right)\) and the noise components at the second output \({{\text{s}}_{\text{2}}}\left( {\text{n}} \right)\). As we are interested on speech enhancement, we only focus on the output \({{\text{s}}_{\text{1}}}\left( {\text{n}} \right)\) and the behavior of the adaptive cross-filter \({w_{21}}\left( n \right)\). In Fig. 6, we illustrate the output \({{\text{s}}_{\text{1}}}\left( {\text{n}} \right)\) of the proposed algorithm, CBBSS and VSS-BBSS algorithms with the parameters of Table 3. This figure shows the good performance of each algorithm in reducing the acoustic noise components at the output \({{\text{s}}_{\text{1}}}\left( {\text{n}} \right)\). No further performance comparisons between the algorithms can be done according to this figure.

Fig. 6
figure 6

The output speech signals of, [Top]: the CBBSS, [middle]: the VSS-BBSS and [Bottom]: the proposed algorithm. Each output has its spectrogram in the right. L = 256

5.4 Evaluation of the system mismatch (SM) criterion

The system mismatch (SM) criterion is often used to evaluate the convergence speed performance behavior of any algorithm. The SM criterion evaluates the distance between the estimated adaptive filtering coefficients and the real ones. As we are interested only on the output \(~{s_1}(n)\), we focus on the adaptive filter \({{\varvec{w}}_{21}}(n)\) and we compute the SM by the following relation (Hu and Loizou 2008):

$$S{M_{{\text{dB}}}}=10~lo{g_{10~}}\left( {\frac{{{{\left\| {{{\varvec{h}}_{21}} - {{\varvec{w}}_{21}}(n)} \right\|}^2}}}{{{{\left\| {{{\varvec{h}}_{21}}} \right\|}^2}}}} \right)$$
(34)

where \({h_{21}}\) is the real impulse response, and the symbol \(\left\| \cdot \right\|\) is the mathematical Euclidean norm. We have done much experiments to evaluate the SM criterion of the three algorithms, i.e. CBBSS, VSS-BBSS, and the proposed RBBSS. The real and adaptive filters length is the same equal to \(L=128\), and 256. Four noise types from AURORA database (Zue et al. 1990) are used, i.e. white, USASI, babble, and street. The obtained results by CBBSS, VSS-BBSS and our proposed algorithms are represented on Fig. 7 for inputs SNR equal to \(~ - 3\;{\text{dB}}\) at the two microphones, respectively. We can easily see, from this figure, the superiority of our proposed algorithm in terms of convergence speed performance in comparison with the other ones. We have used the same control parameters of each algorithm as given in Table 3, and the same input signals as explained in Sect. 5.1.

Fig. 7
figure 7

The system mismatch (SM) comparison between the CBBSS, VSS-BBSS and the proposed algorithms for L = 128 [In left], and L = 256 [in right]. The parameters of each algorithm are given in Table 3

5.5 Step-sizes analysis of the proposed algorithm

In order to analysis the behavior of the proposed algorithm, and as we are interested on speech enhancement problem at the output s1(n), we will focus our interest on relation (32) and its evolution in time domain. Under the same simulation conditions of Sects. (5.2) and (5.3), we have shown the evolution of the step-size \({\nabla _{{\text{1 }}}}\left( {\text{n}} \right)\). In Figs. 8 and 9, we give the time evolutions of the step-size of relation (32) in the cases of two values of the adaptive filters L, i.e. L = 128, and 256. On the same figures, we show the input speech signal.

Fig. 8
figure 8

Original speech signal (in black), Manual VAD (in green), and the automatic VAD obtained by relation (32) [in red]. The control parameters are the same as given in Table 3 for the proposed algorithm. The adaptive and real filter length is L = 128. (Color figure online)

Fig. 9
figure 9

Original speech signal (in black), Manual VAD (in green), and the automatic VAD obtained by relation (32) [in red]. The control parameters are the same as given in Table 3 for the proposed algorithm. The adaptive and real filter length is L = 256. (Color figure online)

From Fig. 8 (for L = 128), and 9 (for L = 256), we can observe that the step size \({\nabla _{{\text{1 }}}}\left( {\text{n}} \right)\) of the filter \({{\text{w}}_{{\text{21}}}}\left( {\text{n}} \right)\) is large when the cross-correlation factor \({{\mathbf{r}}_{\text{s}1\text{m}2}}\left( n \right)\) is small, i.e. the step size \({\nabla _{{\text{1 }}}}\left( {\text{n}} \right)\) takes large values when the speech signal is absent, and gets small values in the opposite case. This configuration allows to the filter \({{\text{w}}_{{\text{21}}}}\left( {\text{n}} \right)\) to be adjusted in the speech absence periods and be frozen in the opposite situation. This automatic mechanism of adjusting the adaptive filter \({{\text{w}}_{{\text{21}}}}\left( {\text{n}} \right)\) allows to formulate an adaptive noise cancellation (ANC) system with noise-only reference, and make possible to cancel the noise components at the output \({s_{{\text{1 }}}}\left( {\text{n}} \right)\). In the other hand, an invert relation between the variation of the step-size \({\nabla _{{\text{2 }}}}\left( {\text{n}} \right)\) and the cross-correlation factor \({{\mathbf{r}}_{\text{s}2\text{m}1}}\left( n \right)\) is concluded, i.e. \({\nabla _{{\text{2 }}}}\left( {\text{n}} \right)\) is large when \({{\mathbf{r}}_{\text{s}2\text{m}1}}\left( n \right)\) takes small values in speech presence periods. This automatic mechanism allows to the adaptive filter \({{\text{w}}_{{\text{12}}}}\left( {\text{n}} \right)\) to be adjusted to suppress the speech signal at the output \({s_{{\text{2 }}}}\left( {\text{n}} \right)\) and to get the noise source components in the same output, i.e. \({s_{{\text{2 }}}}\left( {\text{n}} \right)\). This automatic mechanism that makes an alternative update of the adaptive filters \({{\text{w}}_{{\text{21}}}}\left( {\text{n}} \right)\) and \({{\text{w}}_{{\text{12}}}}\left( {\text{n}} \right)\), leads to a blind system separation of the speech and the noise components at the outputs \({s_{{\text{1 }}}}\left( {\text{n}} \right)\) and \({s_{{\text{2 }}}}\left( {\text{n}} \right)\) respectively, without any a priori information about them, i.e. only the mixing signals are available at the proposed algorithm inputs.

5.6 Evaluation of the cepstral distance (CD) criterion

The cepstral distance (CD) criterion is used in this Section to quantify the output speech signal processing distortion of each algorithm, i.e. CBBSS, VSS-BBSS and the proposed algorithm. The CD criterion is evaluated by the log-spectrum distance between the original speech signal \(s\left( n \right)~\) and the output speech signal \({s_1}\left( n \right)~\) of each algorithm (Hu and Loizou 2008). The CD is computed only in speech presence periods and is given by the following relation:

$$C{D_{dB}}=\frac{{10}}{M}\mathop \sum \limits_{{m=0}}^{{M - 1}} log10\mathop \sum \limits_{{n=Tm}}^{{Tm+T - 1}} \left( {c{p_s}(n}) - c{p_{{e_1}}}(n)\right)^2$$
(35)

where \(c{p_s}(n)=\frac{1}{{2\pi }}\mathop \int_{{ - \pi }}^{\pi } \log \left| {S\left( \omega \right)} \right|{e^{j\omega n}}d\omega\) and \(c{p_{s1}}(n)=\frac{1}{{2\pi }}\mathop \int_{{ - \pi }}^{\pi } \log \left| {{S_1}\left( \omega \right)} \right|{e^{j\omega n}}d\omega\) are the \({\text{nth}}\) real cepstral coefficients of the signals \(s(n)\) and\(~{s_1}(n)\), respectively. We recall here that \(S(\omega )\) and \({S_1}(\omega )\) are the short Fourier transform (SFTF) of the original speech signal \(s(n)\) and the enhanced one \({s_1}(n)\), respectively. \('~T'~\) is the mean averaging value of the CD criterion and \('M'\) represents the number of segment where only speech is present. We have estimated the CD criterion for three inputs SNRs at the two microphones are − \(6{\text{~dB}},{\text{~}}0{\text{~dB}}\) and \({\text{~}}6{\text{~dB}}\). In addition, we have used four types of noise components from AURORA database (Zue et al. 1990; Varga and Steeneken 1993; ITU-T 2003) to generate the noisy observations, which are white, USASI, babble, and street noises. The simulation parameters of ach algorithm are similar to those of the previous experiments and are also summarized in Table 3. The obtained results of the CD criterion are reported on Fig. 10.

Fig. 10
figure 10

The cepstral distance (CD) evaluation by: (1) BBSS algorithm, (2) the VSS-BSS algorithm, and (3) the proposed algorithm. The simulation parameters of each algorithm are the same as reported on Table 3 except the length of the adaptive filters L = 128. The input SNRs are − 6 dB, 0 dB and 6 dB

The obtained results of Fig. 10 show clearly the efficiency of the proposed RBBSS algorithm in providing an output speech signal that is very close to the original one and with minimal spectral distortions. Also, we have noted that the proposed algorithm is the one that alters less the speech signal in comparison with the other ones.

5.7 Evaluation of the segmental SNR (SegSNR) criterion

In this section, we analyze the noise reduction performance of the proposed algorithm in terms of segmental signal to noise (SegSNR) criterion. The SegSNR criterion is computed on frames of \('N~'\) samples between the original speech signal \(s(n)\) and its enhanced version for each algorithm \(~~{s_1}(n)\). This SegSNR criterion is estimated as follows (Sayed 2003; Zoulikha and Djendi 2016):

$$SegSN{R_{dB}}=\frac{{10}}{M}\mathop \sum \limits_{{m=0}}^{{M - 1}} lo{g_{10}}\left( {\frac{{\mathop \sum \nolimits_{{n=N{\text{m}}}}^{{N{\text{m}}+N - 1}} {{\left| {{\text{s}}\left( n \right)} \right|}^2}}}{{\mathop \sum \nolimits_{{n=N{\text{m}}}}^{{N{\text{m}}+N - 1}} {{\left| {s(n) - {s_1}\left( n \right)} \right|}^2}}}} \right)$$
(36)

where the parameters \('M'\) and \('N'~\) are the number of frame and the frame length, respectively. We note that at the output, we get \('M'\) values of the SegSNR criterion, each one is mean averaged on \(~'N'~\) samples. The symbol \(\left| {~.} \right|\) stands for the absolute operator. We recall here that all the \('M'\) frames correspond to only speech signal presence periods. The \(lo{g_{10}}\) symbol is the base 10 logarithm. The simulation parameters are the same as given in Table 3. We have evaluated the SegSNR criterion for three inputs SNRs, i.e. − \(6{\text{~dB}},{\text{~~}}0{\text{~dB}}\) and\(~6~{\text{dB}}\). Moreover, four types of noise are used to generate the noisy observations. These noise components which are white, USASI, babble, and street noises are taken from AURORA database (Zue et al. 1990; Varga and Steeneken 1993; ITU-T 2003). The obtained results are reported on Fig. 11.

Fig. 11
figure 11

The Segmental SNR (SegSNR) criterion evaluation by: (1) CBBSS algorithm, (2) the VSS-BSS algorithm, and (3) the proposed algorithm. The parameters of the simulation are the same as reported on Table 3 except the length of the adaptive filter L = 128. The input SNRs are − 6 dB, 0 dB and 6 dB

According to the obtained results, we can easily see that the proposed RBBSS algorithm behaves more efficiency than the other algorithms, and leads to an important SNR at the output. This means that the proposed algorithm suppresses more noise at the output in comparison with the state-of-the-art algorithms, i.e. CBBSS, and VSS-BBSS algorithms. We also conclude that the proposed algorithm has a good performance in different situations when correlated and uncorrelated noises are present. At the end, we can claim that the obtained SegSNR results are another proof performances superiority of the proposed algorithm when combined with BSS structure to restore speech source signal in blind situation when no a priori informations are available about the target.

6 Conclusion

In this paper, we have proposed a new approach for speech enhancement application. The proposed approach is adaptive and based on the combination between a new automatic adaptive algorithm with the backward blind source separation structure, and allows to automatically adjust the coefficients of the cross-filters.

Intensive experiments were conducted to validate the performance of the proposed algorithm in comparison with two state-of-the-art algorithms, i.e. the classical BBSS and its variable step-size version (VSS-BBSS). The obtained results, expressed in terms of system mismatch, have shown that the proposed algorithm converges quickly to the optimal solutions and this behavior is obtained thanks to the normalization by the norm of the output filtering errors. The obtained CD values have confirmed that the proposed algorithm does not distort the output speech signal especially in the case of loosely spaced microphones (about − 14 dB of minimum CD values). The SegSNR results have also shown that the proposed algorithm reduces the acoustic noise components by about 50 dB at the output in several input SNR conditions. The residual noise amount is very small in the case of our proposed algorithm and it do not affect the speech intelligibility at the output.

Finally, we conclude that all the obtained results in terms of CD and SegSNR criteria have shown the superiority of the proposed algorithm in comparison with the other ones. The obtained results have proven the efficiency of the proposed algorithm and show that it can be a good candidate and alternative for speech enhancement and acoustic noise reduction applications. As a future work, the proposed algorithm can be combined with active learning techniques to be used for live stream audio (speech) analysis and can be the one of the contemporary issues in the domain.