1 Introduction

Speech is an effective way of communicating ideas from one person to another. When speech signal propagates through a highly non-stationary noisy medium then it may be distorted to a severely degraded level. Daily life noise patterns like pop music, exhibition hall, multi-talker babble, and restaurant are some of the examples of highly non-stationary noises that create maximum distortion in speech patterns. The distorted speech may become meaningless and sharp deterioration is evident in the performance of speech communication system Zhang (2010). Hence for removal/suppression of this highly-non-stationary noise from speech patterns an effective speech enhancement system is required. From the listener’s point of view, the other purpose of speech enhancement is to improve the speech intelligibility and clarity in speech patterns for better understanding of speech signals.

There are various speech enhancement methods available in the literature (Singh et al. 2014; Boll 1979; McAulay and Malpass 1980; Ephraim 1992; Dendrinos et al. 1991; Ephraim and Trees 1995; Jensen and Hansen 1995; Yi and Loizou 2004; Bahoura and Rouat 2006; Johnson et al. 2007; Hongyan et al. 2008; Jie and Heping 2012; Farah and Celia 2012; Singh et al. 2014; Gabor 1946; Singh et al. 2013; Goupillaud et al. 1984). In general, the speech enhancement methods may be classified into four groups, i.e., spectral-subtractive algorithms Singh et al. (2014), statistical-model based algorithms McAulay and Malpass (1980); Ephraim (1992), subspace algorithms Ephraim (1992); Dendrinos et al. (1991); Ephraim and Trees (1995), and wavelet transform (WT) (Yi and Loizou 2004; Bahoura and Rouat 2006; Johnson et al. 2007; Hongyan et al. 2008; Jie and Heping 2012; Farah and Celia 2012). Wavelet transform analysis gives information in frequency and time both; it is a time-frequency analysis which is highly suitable for analysis of mixed highly non-stationary noise with speech signal in comparison to traditional Fourier transform analysis (Singh et al. 2014; Gabor 1946). Wavelet transform provides a multi-resolution analysis of highly non-stationary speech signals by using long windows for low frequencies and short windows for high frequencies Goupillaud et al. (1984). Hence, wavelet is much effective in all noisy environments for enhancement of speech patterns. Basically, soft and hard thresholds are the most commonly used function in wavelets for speech enhancement. The residual noise remains in enhanced speech due to discontinuity of hard threshold function while the soft threshold is continuous but it gives an unavoidable error at time of speech reconstruction.

In order to improve quality and intelligibility of speech patterns, in this paper, a binary mask threshold function based db10 wavelet transform method is proposed for enhancement of highly non-stationary noises mixed speech patterns of low (negative) SNR. The binary mask threshold value is consider as \(-\)5 dB. The input pattern of noisy speech is decomposed in five levels. After thresholding, reconstruction of speech patterns is performed and results show the effective improvement in performance parameters of enhanced speech patterns.

The remaining paper is organized as follows: Sect. 2 presents five levels of wavelet decomposition and binary mask thresholding function, followed by the adopted procedure for proposed method in Sect. 3. Simulation conditions, parameters used and performance analysis of different methods are given in Sect. 4 for enhancement in highly non-stationary noise of low (negative) SNR mixed single-channel Hindi speech patterns. Finally, conclusions are drawn in Sect. 5.

2 Background

2.1 Observation model

Generally noisy speech signal in time domain, recorded by single microphone is given as:

$$\begin{aligned} y(n)=x(n)+v(n) \end{aligned}$$
(1)

where y(n), x(n) and \(v(n)\) denotes the noisy speech, clean speech and highly non-stationary additive background noise, respectively. To obtain clean speech patterns from the noisy speech patterns a specific gain i.e. wiener gain is applied to each spectral component which is represented as:

$$\begin{aligned} x(k, t)=g(k, t)\,{*}\,y(k, t) \end{aligned}$$
(2)

where \(g(k, t)\) denotes the gain function, and \(x(k, t)\) and \(y(k, t)\) denotes the estimate of clean Hindi speech and noisy speech spectrum respectively. \(t\) is time frame and k is frequency bin.

2.2 Wavelet transform

Wavelet is a mathematical function which is used to divide a given function into different scale components. The wavelet transform can be divided into two main categories. First is continuous wavelet transform (CWT) and second is discrete wavelet transform (DWT). The substantial redundant information is generated from CWT, since it is given by continuously scaling and translating the mother wavelet. But the mother wavelet can also be scaled and translated using specific subset of scale and translation values or representation grid. Hence DWT is more efficient. It has a high frequency resolution in low bands and low frequency resolution in high bands. It is very helpful for speech signal processing. The DWT \(W(m,n)\)of signal f\(_{(t)}\) with respect to a wavelet \(\phi _{(t)} \)is given as:

$$\begin{aligned} W(m,n)=2^{-(m/2)}\sum _{t=0}^{T-1} {f_{(t)} \phi _{((t-n.2^{m})/2^{m})} } \end{aligned}$$
(3)

where \(T\) is the length of the signal\(f_{(t)} \). m and n are the integer variables in the functions of scaling and translation parameters. t is the discrete time index.

After many experimentation with different types of mother wavelets such as Daubechies (order 1–40), Symlets (order 2–8), Coiflets (order 1–5), and BiorSplines (order 1.1–6.8) db10 is found suitable for my application due to the similarity of order and shape of noisy speech patterns. Since the performance of the DWT type wavelets depends on shape and order of different mother wavelets. In each step of wavelet transform, a particular scaling function is applied to input data. If input data has N number of samples, the scaling function will calculate N/2 smoothed values. In the ordered wavelet transform the smoothed values will be stored in the lower half of the N element input vector.

In wavelet analysis, noisy speech pattern is split into two types of coefficients, namely approximated and detailed coefficients. In this paper five levels decomposition of wavelet coefficients is used since needful information of speech pattern is remains in low frequency band and highest level decomposition gives maximum resolution in lower frequency band of input speech pattern. The high-pass filter gives detail coefficients and low-pass filter gives approximation coefficients. In five levels of decomposition we get five detailed coefficients D1, D2, D3, D4, D5 and fifth level approximate coefficients for analysis of highly non-stationary noisy speech pattern. The levels of decomposition are given in Fig. 1.

Fig. 1
figure 1

Block diagram of five level wavelet decomposition

3 Proposed method

3.1 Binary mask threshold function

In Eq. 2, a Wiener gain function is used for computation of clean speech spectrum since this gain function is very effective in terms of speech quality and intelligibility Scalart and Filho (1996). This wiener gain function is depends on priori SNR and this gain calculation is based on the following equation Rangachari and Loizou (2006)

$$\begin{aligned} g(k, t)=\sqrt{\frac{{priori\,SNR}}{{1+priori\,SNR}}} \end{aligned}$$
(4)

Now, the overall estimate of the noise spectral magnitude by using noisy wiener gain function is given as:

$$\begin{aligned} {\hat{A}}(k, t)=g (k, t). y(k, t) \end{aligned}$$
(5)

where \(g (k, t)\) is a noisy Wiener gain function. On the basis of this estimated noise spectrum \({\hat{A}}(k, t)\) a binary mask is constructed. If estimated noise magnitude spectrum is greater than true noise magnitude spectrum then it is a condition of noise underestimation distortion. This over/under estimation is compared for each time-frequency bin \((k, t)\) (Fig.2). In comparison of time-frequency bin procedure, the time-frequency bins satisfying the constraint were recovered, while time-frequency bins violating the constraints were zeroed out. The threshold value for binary mask threshold function is considered as \(-\)5 dB for maximum improvement. Using this concept, the modified speech magnitude spectrum is recovered as Hu and Loizou (2007):

$$\begin{aligned} x_{enhanced} =\left\{ {\begin{array}{ll} x(k, t) , &{}\quad \mathrm{if}\,\mathop {A}\limits ^{\frown } (k, t)> A(k, t) \\ 0 , &{}\quad \mathrm{otherwise} \\ \end{array}} \right. \end{aligned}$$
(6)

where \(\hat{{A}}(k, t),A(k, t)\)is estimated and a true noise magnitude spectrum for time-frequency bin \((k, t).\) An inverse wavelet transform is applied to compute enhanced speech spectrum and finally overlap-and-add technique is used to synthesize the noise-suppressed speech signal.

Fig. 2
figure 2

Flow chart of the proposed method for enhancement in single-channel speech patterns

The input signal is obtained by amalgamation of single-channel Hindi speech pattern and additive back ground noises such as pop music, exhibition, restaurant, and babble etc. The db10 wavelet transforms is used for decomposition as db10 is suitable and gives better results for this application. In next step these three level detailed coefficients are recovered for same number of samples as in input speech. Now these detailed coefficients D1, D2, D3, D4, D5 are given to binary mask threshold function for noise suppression. The denoised coefficients are obtained by applying binary mask design on the detailed coefficients (D1 to D5) and approximated coefficients A5 of input noisy speech pattern. The Inverse Wavelet Transform is applied on the sum of detailed and approximated coefficients, to get enhanced speech pattern. The capability of proposed method is measured in terms of performance measure parameters to show the improvement in quality and intelligibility of Hindi speech pattern.

4 Simulations and results discussion

4.1 Simulation conditions

In this experiment, the clean speech patterns of Hindi language have been taken from IIIT-H (International Institute of Information Technology Hyderabad) Indic speech database Prahallad et al. (2012), which is spoken by a female speaker. This database is consists of 1000 speech patterns. The clean speech patterns of Hindi language have been added with four different types of highly non-stationary noise source patterns at different levels of signal-to-noise ratio (SNR) ranging from \(-\)5 to \(-\)25 dB (in 5 dB steps). These highly non-stationary noise sources (babble, exhibition, restaurant, and pop music) are taken from AURORA database Pearce and Hirsch (2000). The sampling rate of noise and speech pattern is 16 kHz.

4.2 Performance parameters

The performance of the methods is compared on the basis of subjective and objective measurement. The output SNR, perceptive evaluation of speech quality (PESQ), Peak-SNR, and Cepstrum distance measure are taken for evaluation of enhanced speech signal.

SNR is a ratio of RMS amplitude value of signal\(A_{signal} \) and noise\(A_{noise} \). It is an objective parameter measure of speech quality. It is given in dB as:

$$\begin{aligned} {\textit{SNR}}_{dB} =10\log _{10} \left[ {\left( {\frac{A_{signal} }{A_{noise} }} \right) ^{2}} \right] \end{aligned}$$
(7)

Peak-SNR is a ratio between maximum possible power of a clean speech signal and the power of the corrupting noise signal. It is calculated as:

$$\begin{aligned} {\textit{PSNR}}=10\log _{10} \left( {\frac{MAX_{signal}^2 }{\sqrt{MSE}}} \right) \end{aligned}$$
(8)

where MSE is mean square error difference between clean and enhanced Hindi speech signal.

PESQ is an algorithm that analyzes the enhanced Hindi speech signal sample-by-sample after a temporal alignment of enhanced and clean Hindi speech signal. It is standardized as ITU-T recommendation P.862 (02/01) for objective voiced speech quality testing. Mapping function of PESQ is given as PESQ (2003):

$$\begin{aligned} {\textit{PESQ}}=0.999+\left( {\frac{4.999-0.999}{1+e^{-1.4945{*}x+4.6607}}} \right) \end{aligned}$$
(9)

where x is enhanced speech Hindi speech signal. PESQ algorithms basically gives mean opinion score (MOS) in the range from 1 (bad) to 4.5 (excellent).

Cepstrum distance measure is distortion measure between input and output speech signal that is classified into frequency domain. Cepstrum distance measure corresponds to the best parameter for subjective measures among the several spectral envelope calculating methods based on the LPC methods Kitawaki and Nagabuchi (1988). It is calculated as:

$$\begin{aligned} {\textit{CD}}=10/\log _{10} \sqrt{2\sum _{i=1}^P {\{y(k, t)-x(k, t)} }\}^{2} \end{aligned}$$
(10)

where CD is a measure for Cepstral distance and \(y(k, t)\) and \(x(k, t)\) are the input and output speech signal respectively. \(P\) is maximum number of coefficients.

4.3 Results and discussion

The performance of proposed method is compared with most commonly used methods such as log-mmse Ephraim and Malah (1985), test-psc Stark (2008), wiener Scalart and Filho (1996), IdBM Wojcicki and Loizou (2012), and spectral-subtraction Boll (1979). Four performance measure parameters viz PESQ, Output SNR, PSNR and Cepstrum distance are taken for comparative analysis. There are four types of highly non-stationary noise sources such as babble, pop music, restaurant, and exhibition, which are taken at different levels of input SNR varying from \(-\)5 to \(-\)25 dB. The performance parameters are given in Tables 1, 2, 3 and 4, which report the objective measures obtained for noisy and enhanced single-channel Hindi speech pattern.

Table 1 Output SNR measures values obtained for noisy and enhanced speech pattern
Table 2 Cepstrum Distance measures values obtained for noisy and enhanced speech pattern
Table 3 PESQ measures values obtained for noisy and enhanced speech pattern
Table 4 PSNR measures values obtained for noisy and enhanced speech pattern

The output SNRs values are given in Table 1. The proposed method give highest output SNR values at all levels of input SNR (i.e. varied from \(-\)25 to \(-\)5 dB) except at \(-\)10 dB. The proposed method gives second highest value of output SNR 5.4287 at \(-\)10 dB whereas IdBM method gives maximum output SNR of 6.5900 dB for Hindi language. For the pop music and exhibition noise sources case, proposed method shows maximum improvement in output SNR except at \(-\)5 and \(-\)15 dB, respectively. The maximum output SNR values are shown for the restaurant noise case by proposed method in comparison to other given methods. The overall performance of proposed method is very good in terms of output SNR.

Table 2 demonstrates the values of Cepstrum Distance measure parameter values at various levels of input SNRs in presence of given four highly non-stationary noise sources (i.e. babble, pop music, restaurant and exhibition). Cepstrum distance values must be minimum for the maximum improvement in noisy speech spectrum. The proposed method gives highest improvement for all noise sources except restaurant noise. For the restaurant noise case proposed method shows second highest improvement for the noisy speech signal. Hence restaurant noise case is the exceptional case for Hindi language in terms of overall performance of the proposed method.

The PESQ values are given in Table 3. PESQ value must be high for maximum improvement which is given by proposed method for all noise levels and sources. Perceptive evaluation of speech is much similar to speech intelligibility evaluation hence it can be said that PESQ value must be high for maximum speech intelligibility. Proposed method gives maximum speech intelligibility improvement.

The performance of the proposed method is also measured in terms of PSNR parameter and output values are given in Table 4. The maximum value of PSNR is shown by proposed method in the babble and restaurant noise cases whereas it is not consistent for the pop music and exhibition noise environment at \(-\)5, \(-\)15 dB, respectively but in rest of the noise levels it gives highest improvement. These two noise levels are be considered as exceptional case since overall performance of the proposed method is very high.

PESQ and PSNR parameters values are given in Tables 3 and 4, respectively. Spectrogram of the single-channel Hindi speech pattern “apke hindi pasand karne par khusi hui” by a female speaker is shown in Fig. 3. The single-channel Hindi speech patterns are processed at various levels of input SNRs by all mentioned method for measuring the performance in quality and intelligibility improvement.

Fig. 3
figure 3

Spectrogram of Hindi utterance, apke hindi pasand karne par khusi hui by a female speaker is given as: a clean speech b noisy speech (pop music noise at \(-\)25 dB SNR), c log-mmse, d test-psc, e Wiener, f idbm, g spectral subtraction, h proposed method

The proposed method gives maximum quality and intelligibility than those obtained by log-mmse, test-psc, Wiener, IdBM spectral-subtraction methods. One measure difference between spectrograms of proposed method and aforementioned methods is that proposed method does not give residual noise and impulses in output spectrograms. It is clear that the proposed method’s output plot gives clear pattern but Wiener, spectral subtraction and log-mmse give output pattern with some impulses that creates distortion in speech spectrum. The remaining two methods i.e. test- psc and IdBM do not improve in noisy speech to that level where someone can listen clearly. The proposed method reduces highly non-stationary noise efficiently and improves in quality as well as intelligibility while other methods introduce some distortion like impulses in processed speech. however, the proposed method is not consistent in some cases of noisy Hindi speech signal but overall performance is very good in comparison to other enhancement methods hence these points are consider as exceptional case for Hindi language analysis.

5 Conclusion

In this paper, a binary mask thresholding function based db10 mother wavelet transform is proposed and compared with other commonly used methods and for measuring the effectiveness in enhancement of single-channel Hindi speech patterns of low (negative) SNR range from \(-\)5 to \(-\)25 dB (in 5 dB gap). The binary mask thresholding function is considered as decision making function for reconstruction of enhanced speech patterns from noisy Hindi speech patterns.

Simulation results given by various methods show that the proposed method gives consistently better results in terms of quality and intelligibility measure parameters i.e. output SNR, PESQ, PSNR and lower value in Cepstrum distance measure at different levels (i.e. \(-\)5, \(-\)10, \(-\)15, 20 and \(-\)25 dB) of input SNR. Proposed method is not consistent in some cases of noise levels for Hindi language database. Although as discussed earlier, few results are inconsistent, but in terms of overall perspective the proposed method perform better than all other methods. The spectrograms and listening quality also shows the proposed method gives highest improvement in quality and intelligibility of the reconstructed speech signal.