1 Introduction

Enhancing speech signal corrupted by uncorrelated additive noise is still remaining as a challenging task for researchers due to shortcoming of existing speech enhancement techniques in real world noise conditions. The noise presence affects the performance of speech processing systems. These systems include voice coders, speech recognition, hearing aids and mobile phones. The speech enhancement objective is to improve the intelligibility and perceptual quality of speech by minimizing the effect of noise. Existing techniques for this task include Wiener filtering (Deller et al. 2000; Haykin 1996), spectral subtraction (Deller et al. 2000; Boll 1979), wavelet transform (WT) (Seok and Bae 1997; Bahoura and Rouat 2001, 2006; Cohen 2001; Lu and Wang 2003; Chen and Wang 2004; Hu and Loizou 2004), etc.

An emerging tendency in the speech enhancement domain consists of using a filter bank based on a specific psychoacoustic model of human auditory system (critical bands). The principle behind this is based on the fact that embedding the model of psychoacoustic of human auditory system in filter bank can improve the perceptual quality and the intelligibility of speech. Moreover, it is well known that the human auditory system can roughly be described as a non-uniform band-pass filter bank and humans are capable to detect the original speech signal in noisy environments without noise prior knowledge (Taşmaz and Erçelebi 2008). Different frequency transformations (scales) are proposed for considering the hearing perceptive aspect (Mel, Bark, ERB, and so on). It is worth mentioning that the majority of the perceptual speech enhancement approaches are based on the wavelet packet transform (Johnson et al. 2007). Moreover, the wavelet packet transforms were efficiently combined with others denoising methods for improving the performance of speech enhancement techniques based on wavelets. Therefore, many hybrid speech enhancement systems used both WT and others tools such as Wiener filtering (Mahmoudi 1997), spectral subtraction (Shao and Chang 2007) and Ephraim and Malah approach (Taşmaz and Erçelebi 2008). Daqrouq et al. (2010) have investigated the utilization of wavelet filters via multistage convolution by reverse biorthogonal wavelets in high and low pass band frequency parts of speech signal. Speech signal is decomposed into two pass bands of frequency; high and low, and then the noise is removed in each band individually in different stages via wavelet filters. This approach provides better outcomes because it does not cut the speech information, which occurs when utilizing conventional thresholding (Daqrouq et al. 2010). In Vaz et al. (2013) was proposed a method for speech enhancement of data collected in extremely noisy environments, such as those found during magnetic resonance imaging (MRI) scans. Vaz et al. (2013) have proposed a two-step algorithm to perform this noise suppression. First, they used probabilistic latent component analysis in order to learn dictionaries of the noise and (speech + noise) portions of the data and used these to factor the noisy spectrum into estimated speech and noise components. Second, they applied a wavelet packet analysis in conjunction with a wavelet threshold that minimizes the KL divergence between the estimated speech and noise to achieve further noise suppression (Vaz et al. 2013).

In this paper, we propose a new technique of noise reduction and speech enhancement. This technique integrates a new proposed WT which we call stationary bionic wavelet transform (SBWT) and the maximum a posterior estimator of magnitude-squared spectrum (MSS-MAP) (Yang and Loizou 2011). According to (Yang and Loizou 2011), statistical estimators of the magnitude-squared spectrum (MSS) are derived based on the assumption that the MSS of the noisy speech signal can be computed as the sum of the (clean) signal and noise magnitude-squared spectra. maximum a posterior (MAP) and minimum mean square error (MMSE) estimators are derived based on a Gaussian statistical model. The gain function of the MAP estimator was found to be the same as the gain function used in the ideal binary mask that is extensively used in computational auditory scene analysis. As such, it was binary and assumed the value of 1 if the local signal-to-noise ratio (SNR) exceeded 0 dB, and assumed the value of 0 otherwise. By modeling the local instantaneous SNR as an F-distributed random variable, soft masking techniques were derived integrating SNR uncertainty. The soft masking technique, in particular, which weighted the noisy magnitude-squared spectrum by the a priori probability that the local SNR exceeds 0 dB. The obtained results in (Yang and Loizou 2011) was shown to be identical to the Wiener gain function. The obtained results in (Yang and Loizou 2011) indicated that the estimators proposed in (Yang and Loizou 2011) yielded significantly better speech quality than the conventional minimum mean square error spectral power estimators, in terms of yielding lower residual noise and lower speech degradation. Concerning the SBWT (Talbi and Aicha 2014), it is introduced in order to solve the problem of the perfect reconstruction associated with the bionic wavelet transform (BWT). The MSS-MAP estimation (Yang and Loizou 2011) was used for estimation of speech in the SBWT domain.

The rest of this paper is organized as follows: Sect. 2 describes the proposed speech enhancement technique by giving a detailed overview of the SBWT and the different steps followed in this technique. In Sect. 3, we will deal with MSS-MAP in SBWT domain. Section 4 is devoted to the evaluation metrics. In Sect. 5 are presented results and discussions. Finally, the conclusion is given in Sect. 6.

2 The proposed technique

In this work, we propose a new speech enhancement technique, which integrates a new proposed wavelet transform which we call SBWT and the MSS-MAP. The SBWT is introduced in order to solve the problem of the perfect reconstruction associated with the BWT. The MSS-MAP estimation was used for speech estimation in the SBWT domain. The block diagram of the proposed technique is presented in Fig. 1.

Fig. 1
figure 1

The block diagram of the proposed technique

As shown in Fig. 1, this proposed technique consists at first step in applying the SBWT to the noisy speech signal. Then each of the obtained noisy stationary bionic wavelet coefficients, \({\text{w}}_{\text{i}} , 1 \le {\text{i}} \le 8\), is denoised separately in order to obtain eight denoised stationary bionic wavelet coefficients, \({\hat{\text{w}}}_{\text{i}} , 1 \le {\text{i}} \le 8\). The denoising of each coefficient \({\text{w}}_{\text{i}} , 1 \le {\text{i}} \le 8\) is perfomed by using the technique based on the MSS-MAP estimation (Yang and Loizou 2011). Finally, the denoised speech signal is obtained from the application of the inverse of SBWT, SBWT−1, to the denoised coefficients, \({\widehat{\text{w}}}_{\text{i}} , 1 \le {\text{i}} \le 8\).

2.1 The bionic wavelet transform

Yao and Zhang (2001) have proposed the BWT as an adaptive wavelet transform designed specifically to model the human auditory system. The term ‘bionic’ means that the BWT is rooted in an active biological mechanism (Johnson et al. 2007). In addition, the BWT decomposition is both perceptually scaled and adaptive (Johnson et al. 2007). The initial perceptual aspect of this transform comes from the logarithmic spacing of the baseline scale variables, which are designed to match basilar membrane spacing (Johnson et al. 2007). Then, two adaptation factors control the time-support used at each scale, based on a non-linear perceptual model of the auditory system (Johnson et al. 2007). The basis of this transform is the Giguerre–Woodland non-linear transmission line model of the auditory system (Giguere 1993; Giguere and Woodland 1994), an active-feedback electro-acoustic model incorporating the auditory canal, middle ear and cochlea (Johnson et al. 2007). The model yields estimates of resistance and the time-varying acoustic compliance along the displaced basilar membrane, as a physiological acoustic mass function, cochlear frequency-position mapping, and feedback factors representing the active mechanisms of outer hair cells. The net result can be seen as a technique for estimating the time-varying quality factor Q eq of the cochlear filter banks as a function of the input sound waveform. Giguere and Woodland (1994), Zheng et al. (1999), and Yao and Zhang (2002) have given all details on the elements of this model. The adaptive nature of the Bionic Wavelet Transform is insured by a time-varying linear factor T(a, τ) which represents the scaling of the cochlear filter bank quality factor Q eq at each scale over time. Incorporating this directly into the scale factor of a Morlet mother wavelet, the following formula is obtained:

$$X_{BWT} \left( {a,\tau } \right) = \frac{1}{{T\left( {a,\tau } \right)\sqrt a }}\mathop \int x\left( t \right)\tilde{\varphi }^{\ast} \left( {\frac{t - \tau }{{a \cdot T\left( {a,\tau } \right)}}} \right)e^{{ - jw_{0} \left( {\frac{t - \tau }{a}} \right)}} dt$$
(1)

Where a and τ represent respectively scale and time shift variables and \(\tilde{\varphi }\) is expressed as follow:

$$\tilde{\varphi }\left( t \right) = e^{{ - \left( {\frac{t}{{T_{0} }}} \right)^{2} }}$$
(2)

The function \(\tilde{\varphi }\left( t \right)\) is the amplitude envelope of the Morlet mother wavelet and the factor w 0 represents the base fundamental frequency of the unscaled mother wavelet. Here this parameter is taken as w 0 = 15, 165.4 Hz for the human auditory system, per the original work of Yao and Zhang (2002). The factor T 0 represents the initial time-support. The discretization of the scale variable \(a\) is performed using pre-determined logarithmic spacing across the desired frequency range, in order that the center frequency at each scale is expressed as follow (Johnson et al. 2007):

$$w_{m} = w_{0} /\left( {1.1623} \right)^{m} , m = 0, 1, 2, \ldots$$
(3)

For this implementation, based on original work of Yao and Zhang for cochlear implant coding (Yao and Zhang 2002), coefficients at 22 scales, m = 7, …, 28, are computed using numerical integration of the continuous wavelet transform. These 22 scales correspond to center frequencies logarithmically spaced from \(225\) to \(5300\) Hz. The adaptation factor T(a, τ) for each time and scale is calculated using the following formula (Johnson et al. 2007):

$$T\left( {a,\tau + \Delta \tau } \right) = \frac{1}{{\left( {1 - G_{1} \frac{{C_{s} }}{{C_{s} + \left| {X_{BWT} \left( {a,\tau } \right)} \right|}}} \right) \cdot \left( {1 + G_{2} \left| {\frac{\partial }{\partial \tau }X_{BWT} \left( {a,\tau } \right)} \right|} \right)}}$$
(4)

where G 1 is the active gain factor representing the outer hair cell active resistance function, G 2 is the active gain factor representing the time-varying compliance of the Basilar membrane, and C s  = 0.8 is a constant that represents non linear saturation effects in the cochlear model (Johnson et al. 2007). Practically speaking, the partial derivative of Eq. (4) is approximated using the first difference of the previous points of the BWT at that scale (Johnson et al. 2007). From the Eq. (1), we can see that the duration of the amplitude envelope of the wavelet is affected by the factor T(a, τ) which does not affect the frequency of the associated complex exponential. Therefore, one useful manner for thinking of the BWT is as a mechanism for adapting the time support of the underlying wavelet according to the quality factor Q eq of the corresponding cochlear filter model at each scale. Yao and Zhang (2002) have proved that the bionic coefficients, X BWT (a, τ) can be computed as a product of the original WT coefficients X WT (a, τ) and a constant K(a, τ) which is a function of the adaptation factor T(a, τ). For the Morlet mother wavelet, this adaptive multiplying factor can be formulated as follow:

$$X_{BWT} \left( {a,\tau } \right) = K\left( {a,\tau } \right)X_{WT} \left( {a,\tau } \right)$$
(5)

with

$$K\left( {a,\tau } \right) = \frac{\sqrt \pi }{C}\frac{{T_{0} }}{{\sqrt {1 + T^{2} \left( {a,\tau } \right)} }}$$
(6)

where C is a normalizing constant calculated from the integral of the squared mother wavelet. This representation yields an efficient computational technique for calculating BWT coefficients directly from the original WT coefficients without requiring at each scale and time, to compute numerical integration of Eq. (1) (Johnson et al. 2007). There are diverse key differences between a filterbank based wavelet packet transform (WPT) using an orthonormal wavelet such as the Daubechies family, as used for the comparative baseline technique and the discretized continuous wavelet transform (CWT) using the Morlet mother wavelet, used for the BWT. One is that the WPT is perfectly reconstructable, while the discretized CWT is an approximation whose exactness depends on the placement and number of frequency bands selected. Another difference is that the frequency support of the orthonormal wavelet families used for WPTs and DWTs covers a broader bandwidth while the Morlet wavelet consists of a single frequency with an exponentially decaying time support. The Morlet mother wavelet is thus more “frequency focused” along each scale, which is what allows the direct adaptation of the time support, the central mechanism of the adaptation of the BWT.

2.2 Stationary bionic wavelet transform (SBWT)

As previously mentioned, in this work, we have used in our speech enhancement system, a new wavelet transform which we call SBWT. This new transform is obtained by replacing the discretized CWT used in the BWT computation, by the stationary wavelet transform (SWT). In Fig. 2, are given the different steps of the SBWT computation and also the steps of the computation of its inverse, SBWT−1. According to this figure, the stationary bionic wavelet coefficients are obtained by multiplying the stationary wavelet coefficients by the K factor (Eq. (6)). These stationary wavelet coefficients are obtained from the application of the SWT to the input signal. The steps of the SBWT computation are the same steps followed in the BWT computation but the unique difference consists in replacing the discretized CWT by SWT. The reconstructed signal is obtained by multiplying at first step, the stationary bionic wavelet coefficients by 1/K and then applying the SWT−1 to the resulting coefficients.

Fig. 2
figure 2

The stationary bionic wavelet transform (SBWT) and its inverse, (SBWT −1)

In the implementation of SWT and SBWT, we have used the Daubechies mother wavelet with ten vanishing moments (https://www.nag.co.uk/numeric/MB/manual_22_1/pdf/C09/c09aa.pdf).

In Tables 1 and 2, are listed the values of max(|x − y|) between the original speech signal, x and the reconstructed speech signal, y obtained after application of the BWT or the SBWT and its inverse. The original signal x is obtained by applying the MSS-MAP (Yang and Loizou 2011) to the noisy speech signal (Fig. 4). The Fig. 4 shows the different steps of the procedure followed in this paper to verify the perfect reconstruction of the transform, BWT or SBWT.

Table 1 Case of female voice
Table 2 Case of male voice
Fig. 3
figure 3

Filter bank implementation of SWT

2.2.1 Stationary wavelet transform (SWT)

In both discrete wavelet transform (DWT) and WPT, after filtration the coefficients will down sampled, that prevents redundancy and allow using the same pair of filter in different levels. And so, these transforms will suffer from the lack of shift invariance, which means that small shifts in the input signal can cause major variations in the distribution of energy between coefficients at deferent levels and may causes some error in reconstruction (Mortazavi and Shahrtash 2008). This problem is carried out by eliminating the down sampling steps after filtration at each level in SWT. By eliminating down sampling, the number of coefficients at each level is as long as original signal. Figure 3 shows decomposition of a signal by SWT up two levels. In decomposition of a signal through a filter bank, if down sampling operators were eliminated, for the next level of decomposition the high and low pass filters must be modified. For this, the low pass and high pass filters at each level will be up sampled by putting zero between each filter’s coefficients of previous level that called a trous algorithm (Mortazavi and Shahrtash 2008; Shensa 1992). Denoising a signal by SWT has the same three steps as DWT (Mortazavi and Shahrtash 2008).

It is worth mentioning that for computing the Error, max(|x − y|) and verifying the perfect reconstruction of the two transforms (BWT and SBWT), we first have enhanced the speech signal by MSS-MAP based technique (Yang and Loizou 2011). The application of this technique is performed because the clean speech signal is generally not available but we know only the noisy speech signal. So to compute the error between the original signal and the reconstructed signal, we first have to suppress the noise corrupting this original signal and we have chosen MSS-MAP (Yang and Loizou 2011) for this aim.

In Fig. 4, the noisy speech signal is obtained by corrupting the clean speech signal by the noise. which is selected to be the car noise with SNR = 10 dB. Hence the values listed in Tables 1 and 2, are obtained in that case. These values show clearly that the use of SBWT permits to have a lower Error between the original signal x and the reconstructed signal y, than that obtained in case of using BWT. The latter introduces some distortions on the reconstructed speech signals compared to the original speech signals and this especially when the number of scales is N = 22. For the BWT, the error between the original signal and the reconstructed signal (Table 1), is reduced when using N = 30 instead of N = 22.

Fig. 4
figure 4

The procedure of verifying the perfect reconstruction of the wavelet transform (BWT or SBWT)

3 Maximum a posterior estimator of magnitude-squared spectrum in SBWT domain

Generally, conventional speech enhancement techniques based on thresholding in wavelet domain may introduce some degradation on the original speech signal. This especially occurs for unvoiced sounds. Therefore many speech enhancement systems based on wavelets use others tools such as Wiener filtering, spectral subtraction and MMSE-STSA estimation (Taşmaz and Erçelebi 2008; Ephraim and Malah 1984). The latter is used with the undecimated wavelet packet-perceptual filterbanks in the speech enhancement system proposed by Taşmaz and Erçelebi (2008). In that system (Taşmaz and Erçelebi 2008), is first performed the perceptual filterbank (CB-UWP) (critical bands–undecimated wavelet package) decomposition of the degraded speech signal by applying the undecimated wavelet packet perceptual transform to this signal. Seventeen critical sub-bands are obtained from this decomposition and this is done by referring to psychoacoustic model (Taşmaz and Erçelebi 2008). Each of these critical sub-bands is denoised by using the speech enhancement technique proposed by Ephraim and Malah (1984). The estimation of the clean speech signal is finally obtained by the CB-UWP reconstruction from the denoised subband signals. This speech enhancement principle proposed in (Taşmaz and Erçelebi 2008), is used in this work (Fig. 1) and the CB-UWP decomposition is replaced by the SBWT decomposition and the MMSE-STSA estimation is replaced by MSS-MAP estimation. Such as in the speech enhancement system proposed in (Taşmaz and Erçelebi 2008), each of stationary bionic wavelet coefficient, \({\text{w}}_{\text{i}} , 1 \le {\text{i}} \le 8\) (Fig. 1) obtained from the application of SBWT to the noisy speech signal, is processed as a noisy speech signal and is denoised using MSS-MAP introduced by Yang and Loizou (2011).

As previously mentioned the SBWT is introduced to solve the problem of the perfect reconstruction associated with BWT. Moreover, the SBWT among all wavelet transforms (Biswas et al. 2014; Singh and Mutawa 2016), tends to uncorrelated data (Bahoura and Rouat 2006) and simplifies noise cancellation. Moreover, the application of MSS-MAP in SBWT domain (Fig. 1) for denoising the noisy sub-bands, \({\text{w}}_{\text{i}} , 1 \le {\text{i}} \le 8\), introduces better adaptation for noise and speech estimations compared to the application of the MSS-MAP to the entire noisy speech signal. All these facts motivate us to propose this new speech enhancement technique (SBWT/MSS-MAP).

4 The evaluation metrics

To test the performance of the proposed speech enhancement technique, the objective quality measurement tests, SNR, segmental signal-to-noise ratio (SSNR), Itakura–Saito distance and perceptual evaluation of speech quality (PESQ), were used.

4.1 Signal-to-noise ratio

The following formula was used to calculate the SNR of enhanced speech signals:

$$SNR\left( {dB} \right) = 10 \cdot log_{10} \left( {\frac{{\mathop \sum \nolimits_{n = 0}^{N - 1} x^{2} \left( n \right)}}{{\mathop \sum \nolimits_{n = 0}^{N - 1} \left( {\hat{x}\left( n \right) - x\left( n \right)} \right)^{2} }}} \right)$$
(7)

where x(n) and \({\hat{\text{x}}}\left( {\text{n}} \right)\) are respectively, the original and the enhanced signals and N is the number of samples in the original signal.

4.2 Segmental signal to noise ratio

The frame based segmental SNR is an objective measure of speech quality. It is computed by averaging frame level estimates as follows:

$$SSNR\left( {dB} \right) = \frac{1}{M}\mathop \sum \limits_{m = 0}^{M - 1} 10 \cdot log_{10} \left( {\frac{{\mathop \sum \nolimits_{{n = N_{m} }}^{{N_{m} + N - 1}} x^{2} \left( n \right)}}{{\mathop \sum \nolimits_{{n = N_{m} }}^{{N_{m} + N - 1}} \left( {\hat{x}\left( n \right) - x\left( n \right)} \right)^{2} }}} \right)$$
(8)

where x(n) and \(\hat{x}\left( n \right)\) represent respectively the original and the enhanced signals, M is the number of frames, N is the number of samples in each short time frame and Nm is the beginning of the m-th frame. Since the SNR can become very small and negative during silence periods, the SSNR values are limited to the range of [−10, 35 dB].

4.3 Itakura–Saito distance

The Itakura–Saito distance measure, based on the dissimilarity between the clean and the enhanced speech, is calculated between sets of linear prediction coefficients (LPC) estimated over synchronous frames. This measure is greatly affected by spectral dissimilarity due to mismatch in formant locations, with little contribution from errors in matching spectral valleys. Such behavior is desirable since the auditory system is more sensitive to errors in formant position and bandwidth than to spectral valleys between peaks. In this work, the average Itakura–Saito measure (as defined by Eq. (9)) across all speech frames of a given sentence, was calculated for evaluating the speech enhancement technique.

$${\text{ISd}}\left( {{\text{a}},{\text{b}}} \right) = \left( {\left( {{\text{a}} - {\text{b}}} \right)^{\text{T}} {\text{R}}\left( {{\text{a}} - {\text{b}}} \right)} \right)/\left( {{\text{a}}^{\text{T}} {\text{Ra}}} \right)$$
(9)

where a and b represent respectively the LPC of the clean speech signal and the LPC of the enhanced speech signal \(\hat{x}\left( {\text{n}} \right)\) and R represents the matrix of autocorrelation. The symbol T represents the transpose symbol.

4.4 Perceptual evaluation of speech quality

The perceptual evaluation of speech quality (PESQ) algorithm is an objective quality measure that is approved as the ITU-T recommendation P.862 (Rix et al. 2001). It is a tool of objective measurement introduced to predict the results of a subjective mean opinion score (MOS) test. It was proved (Hu and Loizou 2008; Zavarehei et al. 2006) that the PESQ correlated better with MOS than the traditional objective speech measures.

5 Results and discussions

In this section, ten Arabic speech sentences produced by a female speaker and ten others are produced by a male speaker. These sentences are artificially corrupted in additive manner with different noise types (white, F16 cockpit, Tank, Pink and Car noises) at different values of SNR. These noises were taken from the AURORA database (Hirsch and Pearce 2000). The used Arabic sentences (Table 3) are material phonetically balanced and they are sampled at 16 kHz.

Table 3 The list of the used Arabic speech sentences

The noisy speech signals were enhanced by using the proposed technique (SBWT/MSS-MAP), the technique based on MSS-MAP estimation (Yang and Loizou 2011), the Wiener Filtering (Loizou 2007) and the speech enhancement technique based on discrete fourier transform (DFT), proposed in (Hendriks et al. 2013).

Figures 5, 6, 7 and 8 show the curves obtained from the SNR, the SSNR, the Itakura–Saito distance (ISd) and PESQ computations for the different techniques: the technique based on MSS-MAP estimation (Yang and Loizou 2011), the proposed technique (SBWT/MSS-MAP), Wiener Filtering (Deller et al. 2000; Haykin 1996) and DFT-domain based single microphone noise reduction (Hendriks et al. 2013).

Fig. 5
figure 5

Speech signal corrupted by volvo noise

Fig. 6
figure 6

Speech signal corrupted by volvo noise

Fig. 7
figure 7

Speech signal corrupted by volvo noise

Fig. 8
figure 8

Speech signal corrupted by volvo noise

The results obtained from SNR computation and in case of Volvo noise corrupting the speech signal, show that all speech enhancement techniques improve the SNR (SNRf > SNRi). Moreover, the proposed technique outperforms all the techniques used in our evaluation precisely the technique based on MSS-MAP (Yang and Loizou 2011) and the technique DFT domain based single-microphone noise reduction (Hendriks et al. 2013) which in turn outperforms the two others techniques: MSS-MAP and Wiener filtering.

The results obtained from SSNR computation and in case of volvo noise corrupting the speech signal, show that all speech enhancement techniques improve the SSNR (SSNRf > SSNRi). Moreover, the proposed technique outperforms all the techniques used in our evaluation precisely the technique based on MSS-MAP (Yang and Loizou 2011) and technique DFT domain based single-microphone noise reduction (Hendriks et al. 2013) which in turn outperforms the two others techniques: MSS-MAP and Wiener.

According to the results obtained from ISd computation and in case of Volvo noise corrupting the speech signal, the proposed speech enhancement technique (SBWT/MSS-MAP) gives the lowest values of ISd compared to others techniques. Therefore in term of ISd, the proposed technique (SBWT/MSS-MAP) outperforms the three others techniques: MSS-MAP, Wiener and DFT-domain based single noise reduction (Hendriks et al. 2013).

According to the results obtained from PESQ computation and in case of Volvo noise corrupting the speech signal, the proposed technique (SBWT/MSS-MAP) and the technique DFT-domain based single noise reduction (Hendriks et al. 2013), outperform the two others technique: Wiener and MSS-MAP. For the higher values of SNRi, the values of the PESQ after enhancement (PESQf), obtained from the application of the proposed technique (SBWT/MSS-MAP), are almost the same the values obtained from the application of the technique DFT-domain based single noise reduction (Hendriks et al. 2013). Whereas For the lower values of SNRi, the technique DFT-domain based single noise reduction (Hendriks et al. 2013) outperforms the proposed technique (SBWT/MSS-MAP).

The Fig. 9 illustrates an example of speech enhancement using the proposed technique.

Fig. 9
figure 9

An example of denoising speech signal corrupted by car noise: a clean speech, b noisy speech (SNR = 10 dB), c denoised speech signal using the proposed technique (SBWT/MSS-MAP)

This figure shows clearly that the proposed technique efficiently reduces the noise while preserving the quality of the original speech signal.

The evaluation of the different techniques [SBWT/MSS-MAP, MSS-MAP (Yang and Loizou 2011) and DFT-domain based single-microphone noise reduction (Hendriks et al. 2013)], is also performed on a speech sentence taken from TIMIT Database and corrupted by the noise. This speech sentence is the English sentence.

“She had your dark suit in greasy wash water all year” and is pronounced by a female voice. This sentence is corrupted by car noise with different values of SNR.

In Tables 4, 5, 6 and 7, are listed the results obtained from the computation of the SNR, the SSNR, the ISd and the PESQ and this for the case of volvo noise.

Table 4 SNR computation (case of volvo noise)
Table 5 SSNR computation (case of volvo noise)
Table 6 ISd computation(case of volvo noise)
Table 7 PESQ computation (case of Volvo noise)

The results obtained from SNR, SSNR and ISd computation (Tables 1, 2 and 3) show that the proposed technique (SBWT/MSS-MAP) outperforms the two techniques: MSS-MAP (Yang and Loizou 2011) and DFT-domain based single-microphone noise reduction (Hendriks et al. 2013).

The results obtained from PESQ computation (Table 4) show that the DFT-domain based single-microphone noise reduction (Hendriks et al. 2013) outperforms the two techniques: the proposed technique (SBWT/MSS-MAP) and the MSS-MAP technique (Yang and Loizou 2011).

We have also used others speech signals and an other denoising technique in our evaluation. This technique is supervised and online nonnegative matrix factorization (NMF) based noise reduction and was proposed in (Mohammadiha et al. 2013). Figures 11, 12, 13 and  14 show the different curves obtained from the SNR, the SSNR, the ISd and PESQ computations for the different values of SNR before speech enhancement. These results are obtained from the application of the proposed technique (SBWT/MSS-MAP) and the others three techniques [the DFT domain based single-microphone noise reduction technique (Hendriks et al. 2013), the technique MSS-MAP (Yang and Loizou 2011) and supervised and online NMF based noise reduction technique (Mohammadiha et al. 2013; Girish et al. 2015)] to a speech signal (Fig. 10) corrupted by different types of noise. This speech signal is sampled at 16,000 Hz and pronounced in English language by a male voice.

Fig. 10
figure 10

An example of speech signal corrupted by volvo noise and used for evaluating the four techniques including the proposed one (SBWT/MSS-MAP)

Fig. 11
figure 11

Signal to noise ratio after denoising (SNRf) versus signal to noise ratio before denoising (SNRi): case of a speech signal (Fig. 10) corrupted by volvo noise

Fig. 12
figure 12

Segmental signal to noise ratio after denoising (SSNRf) versus segmental signal to noise ratio before denoising (SSNRi): case of a speech signal (Fig. 9) corrupted by volvo noise

Fig. 13
figure 13

Itakura–Saito distance (ISdf) versus Itakura–Saito distance (ISdi): case of a speech signal (Fig. 9) corrupted by volvo noise

Fig. 14
figure 14

Perceptual evaluation of speech quality after denoising (PESQf) versus perceptual evaluation of speech quality before denoising (PESQi): case of a speech signal (Fig. 9) corrupted by volvo noise

According to the curves in Fig. 11 and in term of SNR computing, when the SNR before denoising, SNRi is higher, the proposed technique outperforms the others denoising techniques. Although, when the SNRi is lower, the best technique is supervised and online NMF based noise reduction technique (Mohammadiha et al. 2013).

According to the curves in Fig. 12 and in term of segmental SNR computing, the proposed technique outperforms the others denoising techniques.

According to the curves in Fig. 13 and in term of ISd computing, the proposed technique and MSS-MAP based one (Yang and Loizou 2011) outperform the others denoising techniques.

According to the curves in Fig. 14 and in term of PESQ computing, when the perceptual evaluation of speech quality before denoising (PESQi) is higher, the DFT Domain based single-microphone noise reduction technique (Hendriks et al. 2013) outperforms the others denoising techniques. Although, when the PESQi is lower, the supervised and online NMF based noise reduction technique (Mohammadiha et al. 2013) outperforms the others techniques. In higher values of PESQi, the proposed technique is better than the two techniques MSS-MAP (Yang and Loizou 2011) and supervised and online NMF based noise reduction (Mohammadiha et al. 2013).

Figures 15, 16, 17 and 18 show others examples of speech enhancement using the proposed technique.

Fig. 15
figure 15

A speech signal taken from Timit Database and corrupted by tank noise, enhanced by the proposed technique (SNRi = 10 dB, SNRf = 16.7383 dB, SSNRi = 1.7965 dB, SSNRf = 7.6179 dB, ISdi = 0.0182, ISdf = 3.7397e-04, PESQi = 2.6675, PESQf = 3.1143)

Fig. 16
figure 16

A speech signal taken from Timit Database and corrupted by pink noise, enhanced by the proposed technique (SNRi = 10 dB, SNRf = 15.0956 dB, SSNRi = 1.5896 dB, SSNRf = 6.2249 dB, ISdi = 0.0768, ISdf = 0.0495, PESQi = 2.2660, PESQf = 2.7800)

Fig. 17
figure 17

A speech signal taken from Timit Database and corrupted by white noise and enhanced by the proposed technique (SNRi = 10 dB, SNRf = 14.5035 dB, SSNRi = 1.4850 dB, SSNRf = 6.0776 dB, ISdi = 0.5621, ISdf = 0.0495, PESQi = 2.0519, PESQf = 2.7304)

Fig. 18
figure 18

A speech signal taken from Timit Database and corrupted by F16 noise, enhanced by the proposed technique (SNRi = 5 dB, SNRf = 11.4539 dB, SSNRi = 1.7233 dB, SSNRf = 3.2526 dB, ISdi = 0.4625, ISdf = 0.4826, PESQi = 1.8480, PESQf = 2.4521)

Where SNRi and SNRf are respectively signal to noise ratios before and after enhancement. SSNRi and SSNRf are respectively segmental signal to noise ratios before and after enhancement. ISdi and ISdf are respectively Itakura–Saito distances before and after enhancement. PESQi and PESQf are respectively PESQ before and after enhancement.

Figures 19 and 20 illustrate another example of speech denoising using the proposed technique (SBWT/MSS-MAP). In Fig. 20 are illustrated the spectrograms of the clean speech signal, the noisy speech signal and the enhanced speech signal.

Fig. 19
figure 19

An example of speech enhancement using the proposed technique (SBWT/MSS-MAP): Denoising of speech signal (taken from Timit Database) corrupted by volvo noise with SNR = 10 dB

Fig. 20
figure 20

a The spectrogram of the clean speech signal. b The spectrogram of the noisy speech signal (speech signal corrupted by car noise with SNR = 10 dB). c The spectrogram of the enhanced speech signal

The spectrogram (b) shows that the type of noise corrupting the speech signal is low pass because it is localized in low frequencies regions. The spectrogram (c) shows that the noise which is a car noise, is suppressed efficiently by using the proposed technique (SBWT/MSS-MAP).

6 Conclusion

In this paper, we proposed a new speech enhancement technique, which integrates a new proposed wavelet transform (which we call SBWT) and the MSS-MAP. The SBWT is introduced in order to solve the problem of the perfect reconstruction associated with the BWT. The MSS-MAP estimation was used for estimation of speech in the SBWT domain. The performance of the proposed technique (SBWT/MSS-MAP) was compared to that of the techniques based on MSS-MAP estimation, the Wiener Filtering, the speech enhancement technique based on DFT and the supervised and online NMF based noise reduction technique. The evaluation was based on four objective metrics: SNR, SSNR, ISd and PESQ. We have also used in our evaluation a number of speech signals (ten sentences pronounced in Arabic language by a male voice and ten others pronounced by a female voice) and others speech sentences taken from TIMIT Database. We have also used different types of noises which are Car, White, F16, Tank and pink noises. The results obtained from the SNR, SSNR, ISd and PESQ computations, show that the proposed technique (SBWT/MSS-MAP) outperforms the technique based on MSS-MAP estimation and the Wiener Filtering. When compared with the technique supervised and online NMF based noise reduction, the proposed technique is better when the SNR is higher and we have the opposite when the SNR is lower.