1 Introduction

To communicate among humans, we need a fundamental mode that transfers ideas from person to person that is speech. If speech signal is transferred in a noisy medium then speech signal may be distorted. This kind of noise may be daily life noise patterns like vehicle, fan, machine gun, tank, factory, fighter plane etc. that create distortion in speech signal. The distorted speech may become meaningless. Hence, for enhancement of these noisy speech signals we need effective speech enhancement methods. There are various speech enhancement methods available in the literature (Lim and Oppenheim 1979; Loizou 2007; Weiss et al. 1974; Boll 1979; Wiener 1949; Hansen and Clements 1991; Ephraim and Malah 1984, 1985; Hazrati and Loizou 2012; Paliwal et al. 2011, 2012; Wojcicki and Loizou 2012). Some of these techniques are spectral subtraction, minimum mean square error (MMSE) based techniques, modulation channel based speech enhancement techniques, wiener filtering methods and wavelet transform based etc. The spectral subtractive algorithms were initially proposed by Weiss et al. (1974) in the correlation domain and later by Boll (1979) in the Fourier transform domain. After filtering, this spectral subtractive method generates isolated peaks (i.e. musical noise). The optimal filter that minimizes the estimation error is called the Wiener filter. Wiener filtering algorithms exploit the fact that noise is additive and one can obtain an estimate of the clean signal spectrum simply by subtracting the noise spectrum from the noisy speech spectrum (Wiener 1949). The main drawback of the iterative Wiener filtering approach was that as additional iterations were performed, the speech formants shifted in location and decreased in formant bandwidth (Hansen and Clements 1991). The wiener filter is the optimal complex spectrum estimator, not the optimal magnitude spectrum estimator. Ephraim and Malah proposed a MMSE estimator which is optimal magnitude spectrum estimator (Ephraim and Malah 1984). Unlike the Wiener filtering, the MMSE estimator does not require a linear model between the observed data and the estimator. But it assumes probability distributions of speech and noisy DFT coefficients. The DFT coefficients are statistically independent and hence uncorrelated. One drawback of this estimator is that it is mathematically tractable and it is not the most subjectively meaningful one. To overcome this problem a log-MMSE is derived by Ephraim and Malah (1985). Furthermore, some more efficient techniques are available in literature based on ideal binary mask (IdBM) (Hazrati and Loizou 2012). But modulation channel selection based method is more efficient for both quality and intelligibility improvement (Paliwal et al. 2011, 2012; Wojcicki and Loizou 2012).

Many researchers have been worked on speech enhancement using wavelet transform based methods. There are many algorithms given on various thresholding concepts for speech enhancement (Donoho 1995; Aggarwalet et al. 2011; Sanam and Shahnaz 2012; Tabibian et al. 2009; Bahoura and Rouat 2001; Zhao et al. 2011; Sheikhzadeh and Abutalebi 2001; Yi and Loizou 2004; Wang and Zhang 2005; Shao and Chang 2007). But quality and intelligibility of speech depends on the criterion adopted for masking threshold. A more efficient concept for threshold selection is adaptive thresholding (Johnson et al. 2007; Sumithra 2009; Sanam and Shahnaz 2012; Yu et al. 2007; Zhou 2010; Ghanbari and Reza 2006). A novel data adaptive thresholding approach to single channel speech enhancement is given by Hamid et al. (2013). In this paper complex signal were used in place of mixed speech signal. This complex signal was a combination of fractional Gaussian noise and noisy speech. A wavelet packet based binary mask method is given for mixed noise suppression (Singh et al. 2014, 2015). In this paper more than one noise and clean signal are mixed to generate noisy speech data for performance evaluation.

Over the past four decades, various single channel speech enhancement methods have been proposed for reduction/removal of one noise at one time but not analyzed for mixture of noises at same time. In this paper, a comparative study and implementation of speech enhancement techniques are presented for single channel Hindi speech patterns with mixed highly non-stationary noises. The mixed noises, we have taken as exhibition + pop music, restaurant + train, pop music + train + babble and pop music + babble + car. These four noise groups are used for quality and intelligibility evaluation of Hindi speech patterns. The well known and popular techniques like spectral-subtraction, wiener filtering, MMSE, p-MMSE, log-MMSE, ideal channel selection, modulation channel based method and wavelet transform based methods are implemented and their subjective and objective performances are analyzed to find out the optimal technique for Hindi speech enhancement in particular environmental condition (where more than one noise is present).

The paper is organized as follows; Sect. 2 presents the background of single channel speech enhancement techniques for noise reduction and binary mask function. Simulation conditions are given in Sect. 3. Section 4 shows the results and discussions. Finally, the conclusion is summarized in Sect. 5.

2 Noisy speech enhancement

A mixed highly non-stationary noisy single-channel Hindi speech signal can be modeled as the sum of clean speech and more than one additive background noises.

$$ y(n) = x(n) + n_{1} (n) + \cdots n_{n} (n) $$
(1)

where, y(n), x(n) and n n (n) denote the noisy speech, clean speech and various additive background highly non-stationary noises, respectively.

2.1 Simulated algorithm

The motivation for the simulation of various speech enhancement algorithms is to enhance the noisy single-channel noisy speech patterns from mixed highly non-stationary signals. Eight commonly used speech enhancement algorithms are evaluated for enhancement in mixed noisy single-channel Hindi speech pattern. Wiener, Spectral Subtraction, log-MMSE and wavelet transform based method (Daubachies10, Daubachies40, Symlet18, Coiflet5, BiorSpline6.8) are implemented for comparative analysis. Wiener filtering method is an iterative method that is based on minimization of mean square error of the noisy speech (Wiener 1949). Spectral subtraction is a widely used frequency domain method for reduction of additive uncorrelated noises from a speech pattern (Boll 1979). Log-spectrum based MMSE is described by Ephraim and Malah (1985) after simple MMSE. This algorithm assumes a Gaussian model for the complex spectral amplitudes of both speech and noise. It gives the optimum estimate of the log-spectrum of the clean speech signal.

Wavelet is a mathematical function that is used to divide a given function into different scale components. It breaks the signal into a shifted and dilated version of a short term waveform called the mother wavelet. It has high frequency-resolution in low bands and low frequency-resolution in high bands. Hence, it is very helpful in various fields of signal processing and widely used for signal analysis. The wavelet transform W(sτ) for a signal x(t) is defined as:

$$ W(s,\,\tau ) = \frac{1}{\sqrt s }\int {x(t)\psi \left( {\frac{(t - \tau )}{s}} \right)dt} $$
(2)

where s > 0 and τ ∊ R, x(t) is the input noisy speech signal. ψ(t) is mother wavelet function and satisfies the orthogonal condition. It is localized in time and frequency domain. In the mother wavelet S is scaling parameter and determining the width of the mother wavelet. τ is a translation parameter and gives the center of mother wavelet. The selection of an appropriate mother wavelet plays an important role in analysis and depends on the application. Various basis functions have been proposed, including Harr, Morlet, Maxican, Daubechies, bi-orthogonal etc. The Daubachies10, Daubachies40, Symlet18, Coiflet5, BiorSpline6.8 mother wavelets are used for decomposition of detailed and approximation coefficients in the proposed work. The five levels in wavelet decomposition are given in Fig. 1. The five level detailed coefficients are recovered for same number of samples as in input speech. Now these detailed coefficients D1–D5 are given to binary mask threshold function for removing noise coefficients. Block diagram of the proposed procedure is given in Fig. 2. The given binary mask decision is applied for all five levels detailed and approximated coefficients. After applying binary mask decision on coefficients we get denoised coefficients. Now these denoised detailed coefficients are added with approximated coefficients and Inverse Wavelet Transform (IWT) is obtained to get denoised speech signal.

Fig. 1
figure 1

Block diagram of wavelet decomposition up to five levels

Fig. 2
figure 2

Flow chart of the proposed method for enhancement of single-channel Hindi speech patterns

To obtain clean speech patterns from the noisy speech patterns, the estimated noise spectrum is subtracted from the noisy speech pattern, which is represented as:

$$ x(f,t) = y(f,t) - n(f,t) $$
(3)

where, n(ft) denotes the noise, and x(ft) and y(ft) denotes the enhanced Hindi speech and noisy speech spectrum respectively. Where t, f indicates the frame index and channel or frequency bin index, respectively.

In Eq. 3, clean speech spectrum is computed by subtraction of estimated noise spectrum. The calculated noisy speech spectrum is accurate and very effective in terms of speech quality and intelligibility (Scalart and Filho 1996; Hu and Loizou 2007). The Eq. 4, priori SNR is calculated by using speech and noise signals (Feng 2015).

On the basis of this estimated noisy spectrum, a binary mask is constructed. A clean speech channel is selected on the basis of ideal binary mask. The ideal term is indicated that a priori information of the target signal is used. To calculate the binary mask SNR criterion is used as (Kim and Loizou 2010):

$$ SNR(f,t) = 10\log_{10} \frac{{\left| {s(f,t)} \right|^{2} }}{{\left| {n(f,t)} \right|^{2} }} $$
(4)

where, s(ft), n(ft) represent clean speech and noise signals respectively. The noise signal is calculated frame by frame. The binary mask (BM) is calculated by using SNR criterion. This is given as (Hazrati and Loizou 2012):

$$ BM(f,t) = \left\{ {\begin{array}{*{20}l} 1 \hfill & {if\;SNR(f,t) > Threshold} \hfill \\ 0 \hfill & {Otherwise} \hfill \\ \end{array} } \right. $$
(5)

The value of threshold is set to −6 dB, which is located around the center of performance.

3 Simulation conditions

Wiener, Spectral Subtraction, log-MMSE and wavelet transform (Daubachies10, Daubachies40, Symlet18, Coiflet5, BiorSpline6.8) based method are compared with proposed method for performance evaluation. The clean speech pattern of Hindi language [taken from IIIT-H Indic speech database (Prahallad et al. 2012)] has been added with four different types of noise patterns [taken from NOIZEUS AURORA database (Hirsch and Pearce 2000)] for noisy speech generation. These four types of noises (pop music, babble, car, train, and restaurant) are added each other and with clean Hindi speech patterns at different levels of signal to noise ratio (SNR) ranging from −25 to −5 dB. These mixed noise patterns are exhibition + pop music, restaurant + train, pop music + train + babble and pop music + babble + car. These four mixed noise groups are used for quality and intelligibility evaluation of Hindi speech patterns in terms of performance measure parameters SNR, PESQ, SII and Cepstrum distance. All algorithms were implemented in MATLAB 7.1.

4 Results and discussion

With the aim of improving quality and intelligibility of mixed highly non-stationary noisy Hindi speech pattern, the four performance parameters are used and output values of those parameters are given in Tables 1, 2 and 3.

Table 1 Performance parameter output SNR values for various mixed highly non-stationary noises
Table 2 Performance parameter output PESQ values for various mixed highly non-stationary noises
Table 3 Performance parameter output Cepstrum distance values for various mixed highly non-stationary noises

The output SNR values from various methods are given in Table 1. The maximum output SNR values given by coiflet5, BiorSpline6.8 and symlet18 wavelet transform at various levels of input SNR for all types of noises.

The higher values of PESQ parameter is given by BiorSpline6.8 wavelet transform at all level of input SNR values. These output PESQ values in Table 2 shows the maximum improvement of intelligibility and quality in enhanced Hindi speech pattern.

The lower Cepstrum distance value shows higher output PESQ values and maximum improvement in quality of speech. The Table 3 shows all output Cepstrum distance measure values and minimum values are given by BiorSpline6.8 wavelet transform.

MOS parameter is used for speech intelligibility measure. The results for MOS values are given in Table 4. The improvement in MOS is increased as input SNR level is increased. The improvement in intelligibility can also be compared on the basis of various spectrograms given in Fig. 3. The noisy and enhanced spectrograms of Hindi speech are given by different methods and clear spectrogram is given by BiorSpline6.8 wavelet transform. The maximum listening quality of the enhanced output spectrum is given by the proposed method.

Table 4 Performance parameter SII index values for various mixed highly non-stationary noises
Fig. 3
figure 3

Spectrograms of enhancement of single-channel speech (variation of frequency w.r.t. time): a clean, b mixed noisy speech (speech + pop music + babble + train), c Db10, d Db40, e Symlet18, f Coiflet5, g Bior 6.8, h Wiener, i Spectral Sub. j Log-MMSE

5 Conclusion

This paper presents a binary mask threshold function based BiorSpline6.8 wavelet transform method to enhance the speech quality and intelligibility of mixed highly non-stationary low SNR noises Hindi speech pattern. A comparative study is also done in this paper, which shows the performance of the conventional methods and wavelet based algorithms for enhancement in mixed noises single channel Hindi speech patterns. Wavelet domain methods show the higher improvement in quality and intelligibility measuring parameters in comparison to other spectral methods. The BiorSpline6.8 wavelets transform domain method give maximum improvement in speech quality and intelligibility parameters like PESQ and output SNR. BiorSpline6.8 wavelets transform method shows the maximum improvement in terms of performance measure parameters. In addition to that, the spectrograms also support same results and therefore the proposed method BiorSpline6.8 is more suitable for reduction of mixed highly non-stationary noises of negative SNR from noisy speech pattern in comparison to other speech enhancement methods.