1 Introduction

Noise reduction systems are extensively used telecommunication systems to enhance the quality of the speech communication in noisy environments. Although, an improved noise reduction can be realized by using microphone array system, but for economic reasons, most of these systems are based on single microphone. In principle, a single microphone noise reduction system uses adaptive filtering operations to attenuate time–frequency (T–F) units of the noisy speech that have low SNR and retain the T–F units with high SNR. By doing so, the essential regions of speech are preserved whereas the noise level is greatly reduced, leading to an enhanced speech with reduced noise level. Countless noise reduction systems are available in literature along this line (Boll 1979; Lim and Oppenheim 1978; Scalart and Filho 1996; Ephraim and Malah 1984, 1985). Wiener filter (Scalart and Filho 1996; Abd El-Fattah et al. 2014) is a linear filter employed to recover original speech signal from the noisy signal by minimizing the mean square error (MSE) between estimated/enhanced signal and the original one. In Wiener filtering, some attenuation rules are used to decide which T–F unit of noisy speech need to be attenuated and how much. Usually, these attenuation rules are optimized in such a way that the enhanced speech is as close as possible to the clean speech. Clearly, the quality of single microphone noise reduction systems is determined by the suppression rule. In general, a suppression rule with strong attenuation will lead to a less noisy speech, however, strong attenuation results in more distortion. On other hand, a moderate attenuation introduces less distortion but achieved limited amount of noise reduction. For this reason, a balance trade-off has to be made to achieve a speech signal with low distortion and high quality. To end this, ideal binary masking (IdBM) which is successfully applied in noise reduction systems. These masks are constructed to retain time–frequency (T–F) units when estimated speech is stronger than intrusive noise (SNR > 0 dB) and removes T–F units when intrusive noise is dominant (SNR ≤ 0 dB). The estimate of these masks can be achieved either using the single-microphone or the multi-microphone systems. A widespread literature review on time–frequency masking can be found in the (Wang 2008). Methodologies employing binary masks have revealed generous quality improvements even at extremely low SNRs with less distortion. These optimistic results have reinvigorated the researchers to develop/estimate binary masks and suggested it as the goal of computational auditory scene analysis (CASA) (Wang 2005). With these evidences of quality/intelligibility improvement, research is done in the recent past in trying to estimate these masks (Boldt et al. 2008; Saleem et al. 2015a, 2015; Loizou 2009).

In this study a two-stage noise reduction system for the noise reduction is proposed which is based on Wiener filtering, employing an improved a priori SNR [to reduce one-frame delay offered by the decision-direct approach (Ephraim and Malah 1984)] and the ideal binary mask (Wang 2005). The ideal binary mask can be defined by relating a priori SNR estimate against the threshold (usually 0 dB). However, instead of a priori SNR, ideal binary mask estimation needs access to local instantaneous SNR which is defined as ratio of power spectrum of speech to the power spectrum of noise at every T–F unit. The performance of the proposed systems is evaluated with two different intruder’s noise (babble, white noise) in terms of the speech distortion and residual noise. The rest of the paper is arranged as; in Sect. 2, a review of the proposed noise reduction system is presented, the Sect. 3 presents experimental setup, the Sect. 4 shows the results and analysis. Finally, the concluding remarks are given in the Sect. 5.

2 The overview of the proposed noise reduction system

This section provides as overview of the proposed noise reduction system. In classical noise reduction model, the noisy speech is given by equation;

$${\text{y}}(t) = {\text{s}}(t) + {\text{e}}(t)$$
(1)

where s(t) and e(t) specify clean speech and the noise respectively. Let Y(m,ω m ), S(m,ω m ) and E(m,ω m ) categorized ω m spectral component of short-time frame m of noisy speech y(t), clean speech s(t) and noise e(t) respectively. Both speech and noise are non-stationary in nature, however, in short-intervals (10–30 ms), both are supposed to be stationary, hence, the quasi-stationary nature is supposed in frame analysis. To reduce noise level, a spectral gain G(m,ω m ) is multiplied to every short-time spectrum of the Y(m,ω m ). Figure 1 demonstrates the block diagram of the proposed system. Practically, the spectral gain is involved in calculation of two prime SNR estimations, a posteriori and a priori SNR and is given as:

Fig. 1
figure 1

Block diagram of proposed system

$$\upgamma (m,\omega_{m} ) = \frac{{\left| {{\text{Y}}(m,\omega_{m} )} \right|^{2} }}{{{\text{E}}\{ \left| {{\text{E}}(m,\omega_{m} )} \right|^{2} \} }} = \frac{{\left| {{\text{Y}}(m,\omega_{m} )} \right|^{2} }}{{\upsigma_{e}^{2} (m,\omega_{m} )}}$$
(2)
$$\upxi (m,\omega_{m} ) = \frac{{\left| {{\text{S}}(m,\omega_{m} )} \right|^{2} }}{{{\text{E}}\{ \left| {{\text{E}}(m,\omega_{m} )} \right|^{2} \} }} = \frac{{\upsigma_{\text{S}}^{2} (m,\omega_{m} )}}{{\upsigma_{\text{e}}^{2} (m,\omega_{m} )}}$$
(3)

where E{.} is expectation operator, γ(m,ω m ) and ξ(m,ω m ) is a posteriori and a priori SNR respectively. In real-world applications of a noise reduction systems, the power spectrum density of the clean speech |S(m,ω m )|2 and the noise |E(m,ω m )|2 are unidentified as merely the noisy speech is reachable. Therefore; both the instantaneous and a priori SNR are needed to be estimated. The power spectral density of noise can be estimated through speech gaps exploiting the standard recursive relation, given as:

$$\hat{\upsigma }_{\text{e}}^{ 2} (m,\omega_{m} ) = \upzeta \hat{\upsigma }_{\text{e}}^{ 2} (m - 1,\omega_{m} ) + (1 - \upzeta )\tilde{\upsigma }_{\text{Y}}^{ 2} (m - 1,\omega_{m} )$$
(4)

where, ζ is the smoothing factor and \(\tilde{\upsigma }_{\text{Y}}^{2} (m - 1,\omega_{m} )\) is the estimate from existing frame. The two signal-to-noise ratios can be computed as:

$${\text{SNR}}_{\text{INSTANT}} (m,\omega_{m} ) = \frac{{\left| {{\text{Y}}(m,\omega_{m} )} \right|^{2} }}{{\sigma_{\text{e}}^{ 2} (m,\omega_{m} )}} - 1$$
(5)
$$\upxi_{\text{PRIO}}^{\text{DD}} (m,\omega_{m} )= \upbeta \frac{{\left| {{\text{G(}}m - 1,\omega_{m} ) * {\text{Y(}}m,\omega_{m} )} \right|^{ 2} }}{{\hat{\upsigma }_{\text{e}}^{ 2} (m,\omega_{m} - 1 )}} + (1 - \upbeta ) {\text{F\{ SNR}}_{\text{INSTANT}} (m,\omega_{m} ) {\text{\} }}$$
(6)

where \(\upxi_{\text{PRIO}}^{\text{DD}} (m,\omega_{m} )\) represents the a priori SNR calculation using decision-direct (DD) approach and F{.} shows the full-wave rectification. The decision-direct is computationally effective technique and performs remarkable in noise reduction applications, however, in this technique, the a priori SNR tails the shape of instantaneous SNR which leads to one-frame delay. In order to reduce single-frame delay, the improved version of the a priori SNR is used by introducing momentum terms to improve the tracking speech of proposed system. The improved version of a priori SNR can be written as:

$$\begin{aligned} \upxi_{\text{PRIO}}^{\text{DD - MT}} (m,\omega_{m} )= \upbeta \frac{{\left| {{\text{G(}}m - 1,\omega_{m} ) * {\text{Y(}}m,\omega_{m} )} \right|^{ 2} }}{{\hat{\upsigma }_{\text{e}}^{ 2} (m,\omega_{m} - 1 )}} + \uplambda (m,\omega_{m} )+ (1 - \upbeta ) {\text{F\{ SNR}}_{\text{INSTANT}} (m,\omega_{m} ) {\text{\} }} \hfill \\ \uplambda (m,\omega_{m} ) { = }\uppsi ( (\upxi_{\text{PRIO}}^{\text{DD}} (m,\omega_{m} - 1) - \upxi_{\text{PRIO}}^{\text{DD}} (m,\omega_{m} - 2 ) )\hfill \\ \end{aligned}$$
(7)

\(\upxi_{\text{PRIO}}^{\text{DD - MT}} (m,\omega_{m} )\) shows a priori SNR calculation using modified decision-direct method by inserting momentum terms, the λ(m,ω m ) is the momentum terms, ψ(m,ω m ) is called the momentum parameter (ψ = 0.998) and β(m,ω m ) is the smoothing parameter (usually β = 0.98) The estimated power spectrum of the clean speech SEST(m,ω m ) is computed from the noisy speech Y(m,ω m ) by multiplying with Wiener filter gain function:

$$\left| {{\text{S}}_{\text{EST}} (m,\omega_{m} )} \right| = \left| {{\text{Y}}(m,\omega_{m} )} \right|*G_{\text{SQWF}}^{\text{DD - MT}} (m,\omega_{m} )$$
(8)

The square root Wiener gain function \({\text{G}}_{\text{SQWF}}^{\text{DD}} (m,\omega_{m} )\) is given by equation:

$${\text{G}}_{\text{SQWF}}^{\text{DD}} (m,\omega_{m} ) = \sqrt {\frac{{\upxi_{{_{\text{PRIO}} }}^{\text{DD}} (m,\omega_{m} )}}{{\upxi_{\text{PRIO}}^{\text{DD}} (m,\omega_{m} ) + 1}}}$$
(9)

With improved a priori SNR, the gain function \({\text{G}}_{\text{SQWF}}^{\text{DD - MT}} (m,\omega_{m} )\) in Eq. (8) becomes:

$${\text{G}}_{\text{SQWF}}^{\text{DD - MT}} (m,\omega_{m} ) = \sqrt {\frac{{\upxi_{\text{PRIP}}^{\text{DD - MT}} (m,\omega_{m} )}}{{\upxi_{\text{PRIP}}^{\text{DD - MT}} (m,\omega_{m} ) + 1}}}$$
(10)

To remove/reduce residual noise, the pre-processed signals are inserted to the second stage. Although, the pre-processed speech offers reasonable speech quality, however, residual noise remains deceptive and annoying under substantial noisy situations. The estimate of clean speech is used for computing instantaneous SNR in second stage. The SNRINSTANT is computed as:

$${\text{SNR}}_{\text{INSTANT}} ({\rm m},\upomega_{m} ) \, = \, 10\log_{10} \left( {\frac{{\left| {{\text{S}}({\rm m},\upomega_{m} )} \right|}}{{\left| {{\text{S}}_{\text{EST}} ({\rm m},\upomega_{m} )} \right|}}} \right) \,$$
(11)

The short-term energies of filtered waveforms are calculated followed by the comparison stage. To reduce residual noise, ratio of estimated magnitude spectrum to clean speech (|S(m,ω m )|/|SEST (m,ω m )|) is compared against a predefined threshold T. The T–F units satisfying the constraint i.e. (|S(m,ω m )|/|SEST (m,ω m )|) > T are preserved whereas T–F units violating the constraints i.e. (|S(m,ω m )|/|SEST (m,ω m )|) < T are attenuated. The modified magnitude spectrum SM(m,ω m ) is calculated as:

$$\left| {{\text{S}}_{{\text{M}}} ({\text{m,}}\upomega _{m} )} \right| = \left\{ {\begin{array}{ll} {\left| {\overline{{\text{S}}} _{{{\text{EST}}}} ({\text{m,}}\upomega _{m} )} \right|} & {\left| {{\text{S}}_{{{\text{EST}}}} ({\text{m,}}\upomega _{m} )} \right|/\left| {{\text{S(m,}}\upomega _{m} )} \right| \ge {\text{ T}}} \\ 0 & {\left| {{\text{S}}_{{{\text{EST}}}} ({\text{m,}}\upomega _{m} )} \right|/\left| {{\text{S(m,}}\upomega _{m} )} \right|{\text{ < T}}} \\ \end{array} } \right.$$
(12)

Following the selection of the T–F units, an inverse STFT is applied to modified speech using the phase of the noisy speech spectrum followed by the overlap-and-add method to synthesize noise-suppressed/reduced speech.

3 Experiments: methodology and setup

This section offers experimental setup and methodology to assess the performance and suitability of the proposed noise reduction system. In experiments, the Noizeus (Hu and Loizou 2007) corpus was engaged which was composed of 30-phonetically balanced sentences belonging to three male and three female speakers. The sentences were sampled at 8 kHz frequency and filtered to simulate the frequency characteristics of telephone handsets. The corpus was originated with non-stationary noises at various SNRs. However; our experiments kept only clean sentences. The noisy stimuli were generated by adding clean sentences with babble and white noise using the ITU-T Recommendation P.56 (ITU-T P.56 1993). Three signal-to-noise ratio levels, including -5, 0, and 5 dB were used to assess the performance. The noise sources were taken from AURORA (Hirsch and Pearce 2000) database. The ITU-T Recommendation P.862 (PESQ) (Rix et al. 2001) was used to predict the mean opinion scores (MOS) and ITU-T Recommendation P.835 (ITU-T P.835 2003) was used to predict the amount of residual noise (BAK) and speech distortion (SIG). The spectrogram analysis was also performed to assess the proposed system. To measure the speech intelligibility, the normalized subband envelop correlation (NSEC) (Boldt and Ellis 2009) measure is used which is a good alternate to the speech intelligibility index (SII) and speech transmission index (STI).

4 Objective measures

A number of objective measures are derived in the literature to evaluate the performance of noise reduction systems (Rix et al. 2001; Hansen and Pellom 1998; Klatt 1982; Quackenbush et al. 1988; Kitawaki et al. 1988). The most extensively used objective measure includes PESQ–MOS and segmental SNR (SNRSEG) (Hansen and Pellom 1998). The PESQ–MOS measure which was not originally designed to assess the performance of noise reduction systems, however, it has been found to have good correlation with mean opinion score (MOS). It predicts the MOS scores which yields results from 1 to 5, where high score indicates better speech quality. Similarly, SNRSEG is another widely used objective measure and it has the best correlation with background noise reduction. The SNRSEG is defined as:

$${\text{SNR}}_{\text{SEG}} ( {\text{m,}}\upomega_{m} ) { = }\frac{10}{\text{M}}\sum\nolimits_{\text{m = 0}}^{\text{M - 1}} {\log }_{10} \left( {\frac{{\left| {{\text{S(m,}}\upomega_{m} )} \right|^{2} }}{{\left| {\left| {{\text{S(m,}}\upomega_{m} )} \right| - \left| {{\text{S}}_{\text{EST}} ( {\text{m,}}\upomega_{m} )} \right|} \right|^{2} }}} \right) \,$$
(13)

where S(m,ωm) and \({\hat{\text{S}}}\)(m,ωm) shows the frames of clean and estimated speech respectively. To discard non-speech frames, every frame was threshold by a 0 dB lower bound and −35 dB upper bound. The performance of a noise reduction system has a trade-off among musical noise, speech distortion and noise reduction. Both PESQ–MOS and SNRSEG cannot portray the whole picture of these trade-offs. Therefore, ITU-T Recommendation P.835 (composite measure) is used to measure the speech distortion and residual noise. The P.835 measure is formulated by relating the basic objective measures to establish composite measure (Loizou 2007), given as:

$$\begin{aligned}& {\text{Csig = 3}} . 0 9 3- 1. 0 2 9 {\text{S}}_{\text{LLR}} {\, +\, 0} . 6 0 3 {\text{S}}_{\text{PESQ}} - 0. 0 0 9 {\text{S}}_{\text{WSS}} \hfill \\& {\text{Cbak = 1}} . 6 3 4 \,{ +\, 0} . 4 7 8 {\text{S}}_{\text{PESQ}} - 0. 0 0 7 {\text{S}}_{\text{WSS}} {\, +\, 0} . 0 6 3 {\text{S}}_{{{\text{SNR}}_{\text{SEG}} }} \hfill \\ \end{aligned}$$
(14)

where SPESQ, SLLR, SWSS and SSNRSEG represents perceptual evaluation of speech quality (PESQ), log-likelihood ratio (LLR) and weighted-slope spectral (WSS) distance respectively.

4.1 Objective performance evaluation

The objective evaluation was performed for noisy (unprocessed) speech, Weiner filtering, spectral subtraction, ideal ratio making (IdRM) and the proposed system respectively. The measurements employed were PESQ–MOS, SNRSEG, and composite measure (speech distortion, SIG and residual noise, BAK). For all measuring parameters, the high scores indicate better speech quality.

4.2 PESQ evaluation

The Table 1 shows the performance comparison in terms of the perceptual evaluation of speech quality (PESQ–MOS) among noisy speech, first stage and second stage respectively. A remarkable improvement in PESQ–MOS was observed with proposed systems. The highest improvement in PESQ–MOS was observed with 0 dB white noise (∆ = 1.07) while the lowest improvement was observed with 5 dB babble noise. Significant improvements in PESQ–MOS were observed with proposed systems when compared to the noisy speech and the highest improvement is reported with 0 dB babble noise (∆ = 1.27) while the lowest improvement is obtained with −5 dB white noise (∆ = 0.95). The Table 2 shows observations in terms of speech quality (PESQ–MOS) of proposed system against noisy speech, speech processed by the spectral subtraction, Weiner Filtering, and Ideal ratio mask respectively. The highest PESQ–MOS scores are obtained with the ideal ratio mask (IRM) which is understandable. The boldface shows the best performance in reference to noisy speech, Spectral Subtraction and Weiner Filtering.

Table 1 Performance comparison in terms of PESQ–MOS scores, and improvement (∆PESQ) between first and second stage
Table 2 Performance comparison in terms of PESQ–MOS scores between different noise reduction algorithms

4.3 Segmental SNR evaluation

Table 3 shows the performance comparison in terms of the segmental SNR (SNRSEG) between the noisy speech, the first stage and the second stage respectively. An improvement in SNRSEG was observed with proposed systems in all noise conditions. The highest and lowest improvements in SNRSEG were noted.

Table 3 Performance comparison in terms of SNRSEG scores, and improvement (∆SNRSEG) between first and second stage

With the 0 dB white noise (∆ = 2.58) and −5 dB babble noise (∆ = 1.46) respectively. The improvement in SNRSEG clearly shows that significant noise reduction was achieved with proposed systems (by applying the second stage). By observing the results in Table 3, the improvements in SNRSEG after first stage were negligible when compared to noisy speech, however, considerable improvements in SNRSEG were observed with proposed systems when compared to the noisy speech.

4.4 Composite measure evaluation

Both PESQ–MOS and SNRSEG cannot portray the whole picture of the trade-off between residual noise and speech distortion. To measure the speech distortion introduced by the noise reduction system and the amount of residual noise, composite measure was used (discussed in Sect. 4). Table 4 shows the speech distortion introduced by first stage, second stage and the improvement in speech distortion (∆SIG) respectively. A high amount of speech distortion was introduced by first stage of proposed system which is less evident in the second stage (high scores of SIG). The highest and lowest gains in SIG scores were observed at 5 dB white noise (∆ = 0.98) and −5 dB babble noise (0.25) respectively. Table 5 shows the amount of residual noise (BAK) in enhanced speech after processed by first and second stage respectively. A considerable amount of background noise was reduced (i.e., less residual noise) by the first stage of proposed system (high BAK values) which was further reduced by the second stage (high BAK values for second stage). The highest and lowest gains in BAK scores were observed at 5 dB white noise (∆ = 0.74) and 5 dB babble noise (∆ = 0.53) respectively. The composite measure indicates that low speech distortion, less residual noise and high quality speech was obtained with proposed system in all noise conditions.

Table 4 Performance comparison in terms of speech distortion (SIG), and improvement (∆SIG) in distortion between first and second stage
Table 5 Performance comparison in terms of residual noise (BAK), and improvement (∆BAK) in distortion between first and second stage

5 Speech intelligibility measure

To measure the speech intelligibility, the normalized subband envelop correlation (NSEC) measure is used which is a good alternate to speech intelligibility index (SII) and speech transmission index (STI). Figure 2 shows the percentage intelligibility scores across all the SNR levels and noisy conditions. A significant improvement was reported with second stage in reference to the noisy and speech processed by first stage. Less improvement of first stage in low SNR was reported and the speech intelligibility was remained close to noisy speech. However, at higher SNR levels (0 and 5 dB), the improvements in intelligibility were significant. The percentage improvements in intelligibility with second stage in reference to noisy speech were remarkable in both babble and white noise, (i.e., 21.34% at −5 dB, 19.71% at 0 dB, and 14.79% at 5 dB) and (20.99% at −5 dB, 22.44% at 0 dB and 17.08% at 5 dB). The results in Fig. 2 show that the post-processing stage has remarkably improved speech intelligibility in the low SNR background conditions. The Fig. 3 shows the performance comparison in terms of the speech intelligibility (NSEC) among noisy speech, Spectral Subtraction, Weiner Filtering, first stage and second stage respectively. A remarkable improvement in the NSEC scores was observed with proposed system.

Fig. 2
figure 2

NSEC based speech intelligibility prediction in various noisy backgrounds

Fig. 3
figure 3

NSEC based speech Intelligibility prediction for different start-of-the-art noise reduction systems

6 Selection of threshold

For the best performance of the proposed system, the appropriate selection of threshold T value was mandatory. In a set of experiments, the influence of threshold was examined. The threshold was varied from −10 to 0 dB and the performance is measured in terms of PESQ, SNRSEG, SIG, BAK and NSEC (intelligibility) respectively. Figure 4, 5 and 6 sows the impact of threshold on speech quality (PESQ − SNRSEG), the residual noise (BAK), speech distortion (SIG) and NSEC scores. In terms of the PESQ, a better performance was obtained when T = −10 dB while in terms of SNRSEG, the performance was significant at T = −5 dB. Similarly, in terms of SIG and BAK, the appropriate value of T was found to be T = −10 dB. A trade-off can be made for the selection of threshold T value for the proposed system. For speech quality, T = −10 dB is consistent while in terms of SNRSEG, T = −5 dB was a better choice. For that reason, the optimized value of T was varied according to measuring parameters. However, by observing the results, T value must be in between −10 to −5 dB.

Fig. 4
figure 4

Impact of threshold on PESQ–MOS score and SNRSEG scores

Fig. 5
figure 5

Impact of threshold on residual noise and speech distortion

Fig. 6
figure 6

Impact of threshold on speech intelligibility

7 Spectrogram analysis

In order to yield comprehensive information about residual noise and speech preservation capability of the proposed system, spectrogram analysis was performed. The Fig. 7 shows sample spectrograms for both stages of the proposed system. The speech utterance was degraded by babble noise at 0 dB SNR with PESQ = 1.51. By observing spectrograms of both stages in Fig. 7c, d, the second stage was better able to reduce the background noise and the speech contents were well preserved as compared to the first stage and the noisy speech respectively. The proposed noise reduction system performed exceptionally well by eliminating residual noise and also preserved the speech contents efficiently.

Fig. 7
figure 7

Spectrogram analysis: a clean speech with PESQ = 4.5, b noisy speech with PESQ = 1.51, c speech processed by first stage with PESQ = 1.87, and d speech

8 Summary and conclusion

A two stage noise reduction system for reducing background noise using single-microphone recordings in very low signal-to-noise ratio (SNR) was proposed that is based on Wiener filtering and ideal binary masking. In first stage, the Wiener filtering with improved a priori SNR is applied to noisy speech for background noise reduction while in a post-processing second stage, the ideal binary mask is estimated in every time–frequency channel by using pre-processed first stage speech. The energy in every time–frequency channels was compared to a pre-selected threshold T to reduce the residual the background noise. All the time–frequency channels satisfying the constrained (threshold) were retained whereas all other time–frequency channels were attenuated. The PESQ was used to predict the mean opinion scores (MOS) and composite measure was used to predict the amount of residual noise (BAK) and speech distortion (SIG). All the measuring parameters indicated significant improvements with the proposed noise reduction system. The spectrogram analysis indicated low speech distortion and less residual noise was observed with proposed system. Moreover, significant improvement in speech intelligibility was also reported with the proposed noise reduction system.