1 Introduction

Generally, speech enhancement implies the processing of noisy speech signals, so as to improve the signal perception through better decoding by systems or human beings [1,2,3]. A number of speech enhancement procedures are being formulated to recover the performance of a system, when the input given is a noise-ruined speech signal. Still, it is a tedious process to retain the denoised signal by reducing the noise. Hence, some limitations may be attained in the performance, compromising noise reduction and speech distortion [4,5,6]. Moreover, there are two categories of distorting speech signal based on medium to high SNR and low SNR. Under the first category, the objective is reducing the noise level to produce the natural signal. In contrast, in the second category, the objective is dropping the noise level, while preserving the intelligibility. Generally, the major factor that causes degradation in the speech’s intelligibility and quality is the background noise. Further, the noise can be stationary or non-stationary and it is assumed as additive and uncorrelated with the speech signal [7]. More commonly, the entire speech enhancement approaches are intended at suppressing the background noise and they rely on one way or the other on the assessment of background noise. If the background noise gets modified at a rate that is much slower than the speech, that is, if the noise is more stationary than the speech, it is simple to assess the noise during the pauses in the speech.

More particularly, the speech enhancement approaches are broadly categorized as the temporal processing method and the spectral processing method. In case of the temporal processing method, the degraded speech is processed in time domain. On the contrary, the processing is achieved in the frequency domain for the spectral processing methods [8]. Spectral subtraction is one of the oldest procedures, which was proposed for reducing the background noise, and it is popular for its easy implementation and minimal complexity. The process of this technique is reducing or subtracting the average magnitude of the noise spectrum from the noisy speech spectrum. However, the estimation of the average magnitude of noise spectrum is carried out from the frames of speech absence. Mostly, in case of the stationary noise condition, initial frames are chosen for estimation. But, for the non-stationary noise condition, the noise estimation is formulated, whenever the characteristics of noise are changed. Therefore, the spectral subtraction algorithm becomes inefficient for corrupted speech with non-stationary noise [8,9,10,11].

Effective reduction of noise in the noisy speech signal allows the efficiency of speech-related applications to be improved [12, 13]. Various algorithms have been introduced currently to enhance the perceptibility and the quality of the speech signal. Those compensation methods are broadly classified into two, which include the multi-channel algorithms and the single-channel algorithms [14]. In most applications, the users are bound to the single-channel algorithm, since only one input channel is available. The statistical model-based techniques [15, 16] and spectral subtractions [17,18,19] are few modern single-channel algorithms, which usually use short-time Fourier transform (STFT) for processing the speech signal. The performance efficiency of the speech signal is improved, in case of the presence of little preservative noise. But, it is reversed, in case of the additive noise. Recently, wavelet transform (WT) is widely focused, when compared to STFT, because it uses large-sized windows at low frequency and small-sized windows at high frequency. It is different from STFT, since STFT uses the function of fixed window size. The variable size windows of WT result in low resolution and high resolution for high-frequency band and low-frequency band, respectively [20]. Thus, for all the frequency bands of speech signal, the time–frequency domain resolutions are highly improved. Mostly, high quantity of noisy speech is available in the real-time scenario [21,22,23,24]. Hence, the sub-band division methods effectively enhance the performance by making a better estimation of noise. Furthermore, WT works beneficially to build the approximation-based model from the estimated speech signal, even under adverse conditions [25].

Contribution In [26], a single-channel supervised speech enhancement algorithm on the basis of regularized NMF is implemented. In addition, a priori magnitude spectral distributions are modeled by the Gaussian mixtures. The work focuses on speech enhancement in the STFT domain. As the frequency domain is known for its noise removal ability, the adoption of the short-time Fourier domain further enhances the speech. A decomposition model, called the D-EMCD, is introduced here to remove the undesired signal. Further, the Wiener filtering process is adopted to accomplish speech enhancement and this paper is the improved version of [26]. This paper claims the following contributions in the speech enhancement method:

  • An adaptive tuning factor is proposed to enhance the operation of Wiener filtering

  • D-EMCD, which is a variant of EMCD, is proposed to decompose the signal, through which the tuning factor is defined.

  • A sophisticated procedure is proposed to define the enhancement process, and so, the speech is enhanced under different noise conditions.

The proposed technique first estimates the noise spectrum and identifies the clean speech spectrum for achieving the unity tuning factor using Wiener filtering. The resultant signal is decomposed using D-EMCD. The bark frequency of the decomposed signal is determined, and then, it is used in the network. The network, in turn, predicts the tuning ratio. By using this tuning ratio, the second-stage Wiener filtering is carried out on the actual noisy signal. Subsequently, the resultant signal is decomposed for extracting the enhanced speech.

The rest of the paper is organized as follows: Sect. 2 reviews the literature work, and Sect. 3 describes the proposed speech enhancement algorithm. Moreover, Sect. 4 discusses the results and Sect. 5 concludes the paper.

2 Literature review

In 2017, Pejman et al. [27] have proposed an amplitude and phase estimator (ijMAP) and iterative joint maximum a posteriori (MAP) that assume a non-uniform phase distribution. The experimental outcomes proved the efficiency of the proposed method in improving both the phase and the amplitude of noise. The results were also justified using the instrumental measures like speech intelligibility, perceived quality and phase assessment error. Additionally, the approach enabled joint improvement in the perceived excellence. The speech intelligibility and the phase-blind joint MAP estimator exhibited comparable performance with the complex MMSE estimator.

In 2017, Sonay and Mohammad [28] have presented a novel unsupervised speech improvement method, projecting both the speech spectrogram and its temporal gradient as sparse. The sparse assumption was true because of the quasi-harmonic nature of the speech signals. In the approach, speech improvement was made by decreasing the suitable objective function, which was composed of a data fidelity term and a sparsity-imposing regularization term. Further, alternating direction scheme of multipliers (ADSM) was modified to determine the proposed methodology and a well-organized iterative procedure was established for carrying out the speech enhancement. Later, wide experiments showed that the proposed method outperformed the other competing schemes, in relation to varied performance assessment metrics.

In 2017, Hanwook et al. [26] have introduced a speech enhancement algorithm, named as the single-channel supervised speech enhancement algorithm. It was formulated on the basis of regularized nonnegative matrix factorization (RNMF). The regularization in the NMF cost functions considered the log-likelihood functions of the spectrum of both the clean and the noisy speech signals, on the basis of the Gaussian mixture models. With the use of projected regularization as a priori information in the enhancement stage, the algebraic possessions of both the clean speech and the noise signals were exploited. The masking model of the human auditory system was also combined to improve the speech quality. Investigational upshots of source-to-distortion ratio (SDR), perceptual evaluation of speech quality (PESQ) and segmental signal-to-noise ratio (SNR) showed that their proposed speech enhancement algorithms offered improved performance in speech enhancement than the other benchmark algorithms.

In 2016, Ruwei et al. [29] have adopted a new filtering process, called improved least mean square adaptive filtering (ILMSAF). It was a speech enhancement algorithm with deep neural network (DNN) as well as noise classification. An adaptive coefficient of the filter’s parameters was presented into the existing least mean square adaptive filtering algorithm (LMSAF). Initially, the authors have assessed the adaptive coefficient of the filter parameters using the deep belief network (DBN). Later, the enhanced speech was obtained by ILMSAF. Additionally, they presented a new classification method that was based on DNN to make the existing method as appropriate for several types of noise environments. In accordance with the consequence of noise classification, the ILMSAF model was nominated in the improvement process. The test results gave efficient results for the proposed model, under ITU-TG.160. Their method attained significant developments, in correspondence with varied subjective and objective quality measures of speech.

In 2016, Yanping et al. [30] have proposed a new procedure for the reduction of storage space and running time by utilizing low-rank estimate in a copying kernel Hilbert space, with tiny presentation loss in the enhanced speech. They also examined the root-mean-square error that was bound among the improved vectors, which were got by the approximation kernel matrix and the full kernel matrix. Further, it was observed that the method improved the speed of computation of the algorithm with the estimated presentation, while comparing with the full kernel matrix.

In 2016, Yang et al. [31] have developed the extension of gamma tone filter bank for speech enhancement by eliminating both the belongings of reverberation and noise through reinstating the appropriate amplitude and phase. Impartial and personal trials were carried out under numerous noisy reverberant circumstances to assess the delay efficiency of the proposed system. The signal-to-error ratio (SER), correlation, PESQ and SNR loss were also utilized in the objective assessments. The normalized mean preference score and the correctness in modified rhyme test (MRT) were utilized in the subjective evaluations. The results of all the estimations exposed that the proposed arrangement could effectively recover the quality and the intelligibility of speech signals under noisy reverberant situations.

In 2016, Sun et al. [32] have introduced a deep autoencoder (DAE) to represent the residual part, which was obtained by subtracting the valued fresh speech spectrum from the noisy speech spectrum. The enhanced speech signal was, therefore, found by transforming the valued clear speech spectrum back into the time domain. The overhead proposed method was known as separable deep autoencoder (SDAE). The under-determined nature of the above optimization problem was given, and the clear speech reconstruction was confined in the convex hull spanned by a pre-trained speech dictionary. New learning algorithms were investigated to value the nonnegativity of the parameters in the SDAE. Investigational results on TIMIT with 20 noise types, at various noise levels, demonstrated the dominance of the proposed technique over the conventional baselines.

In 2016, Chazan et al. [33] have presented a single-microphone speech enhancement algorithm. A hybrid approach was proposed by merging the generative mixture of Gaussians (MoG) model and the discriminative deep neural network (DNN). The proposed algorithm was executed in two phases, the training and the testing phases. First, the noise-free speech log power spectral density (PSD) was modeled as a MoG, representing the phoneme-based diversity in the speech signal. A DNN was then trained with the phoneme-labeled database of the clean speech signals for phoneme classification with mel-frequency cepstral coefficients (MFCC) as the input features. In the test phase, a noisy utterance of an untrained speech was processed. Lastly, they analyzed the contribution of all the components of the proposed process, indicating their combined importance.

In 2016, Wang et al. [34] have introduced the DWPT and NMF. Briefly, the DWPT remained chiefly practical in splitting a time-domain speech signal into a series of sub-band signals, without the introduction of any distortion. Then, they used NMF to emphasize the speech component for each sub-band. At last, the improved sub-band signals were combined through the inverse DWPT to rebuild a noise-abridged signal in the time domain. Further, they evaluated the proposed DWPT-NMF-based speech enhancement technique on the Mandarin hearing in noise test (MHINT) task. Investigational grades showed that this new way acted very well in encouraging the speech excellence and lucidity and outperformed the conservative STFT-NMF (Table 1).

Table 1 Features and challenges of speech enhancement processes

3 Proposed speech enhancement algorithm

The architecture of the proposed speech enhancement algorithm is demonstrated in Fig. 1.

Fig. 1
figure 1

Proposed architecture for speech enhancement

Step 1: Let \(S\left( n \right)\) be the clear signal. When the noise \(N\) is added to the clear signal, it becomes a noisy signal \(\bar{S}\left( n \right)\), which is given as the input to the NMF process. This results in two spectrums, namely noise spectrum \(N^{\text{s}}\) and signal spectrum \(\bar{N}^{\text{s}}\).

Step 2: The resultant spectrums, \(N^{s}\) and \(\bar{N}^{\text{s}}\), are then filtered under Wiener filtering process, and this results in the filtered signal \(\bar{S}_{\text{f}} \left( n \right)\).

Step 3: \(\bar{S}_{\text{f}} \left( n \right)\) is then decomposed under the D-EMCD process, and this results in the bark frequency \(b^{\prime}\left( {f^{\prime}} \right)\), which is utilized to train the NN classifier.

Step 4: The resultant ‘tuned \(\eta\)’ from the NN classifier and the spectrums, i.e., [\(N^{\text{s}}\) and \(\bar{N}^{\text{s}}\)], are given as the inputs to the adaptive Wiener filtering process to filter the input signal \(\bar{S}\left( n \right)\), resulting in the filtered signal \(\overline{\overline{{S_{\text{f}} \left( n \right)}}}\).

Step 5: Finally, the resultant \(\overline{\overline{{S_{\text{f}} \left( n \right)}}}\) is again decomposed using the D-EMCD process, and the decomposed signal is produced as the denoised signal \(\overline{\overline{{S_{\text{D}} \left( n \right)}}}\).

The description of the adopted processes is as follows:

NMF The NMF is a dimensionality-lessening tool that decomposes the input signal into spectrums with nonnegative element constraints. The resultant spectrums are the noise spectrum and the signal spectrum.

Wiener filter The Wiener filter is a filter, which grants the assessment of the target random process with linear time-invariant (LTI) filtering of the additional noise. This filter reduces the mean square error among the assessed random process and the desired process.

D-EMCD The EMCD—empirical mean curve decomposition—decomposes a signal by smoothening its peaks. First, the maximal and the minimal points from the signals are extracted. Then, they are interpolated and the average is taken. The average signal is subtracted from the original signal to find the residue. The residue and the average signals are checked for their similarity with the original signal. Till it is smoothened, the process is repeated. In D-EMCD, the iteration is diminished and so, the first average signal is used for the further steps because the speech signal requires no loss on maxima and minima.

The D-EMCD is a signal decomposition process that has the similar process of EMCD. The only difference is that the D-EMCD does the decomposition without any iteration, whereas the existing EMCD is an iterative process.

NN This is a machine learning approach inspired by the brain’s performance. The NN organization is associated with the learning algorithm, which is used for training purposes.

Adaptive Wiener filter This is a filter that processes with the concept of Wiener filter, but the fact is that the filter process also incurs tuned \(\eta\), which is the output of NN.

Training library The training library of NN is constructed by giving the known inputs (bark frequency) and its target \(\eta\). With the knowledge of this, the unknown values are formulated.

Offline and online process The training process is considered as the offline process, and the testing process is termed as the online process, in which the testing is carried out on the trained system. Offline process means identifying appropriate tuning factor for different noise variances and training the neural network [35]. Online process implies the actual enhancement process, where the trained network is used for determining the tuning factor.

Learning algorithm In this work, the NN approach is trained using the Levenberg–Marquardt algorithm [36].

3.1 Noise estimation using STFT-minimum statistics

The minima are tracked from the noisy signal by the noise power spectral density estimator, which is based on minimum statistics.

$$W\left( {\alpha ,b} \right) = \lambda \left( {\alpha ,b} \right)W\left( {\alpha - 1,b} \right) + \left( {1 - \lambda \left( {\alpha ,b} \right)} \right)|W\left( {\alpha ,b} \right)|^{2}$$
(1)

where \(W\left( {\alpha ,b} \right)\) denotes the STFT coefficient of the frame \(\alpha\), \(b\) represents the frequency bin and \(\lambda \left( {\alpha ,b} \right)\) denotes the frequency- and time-dependent smoothing parameters. A bias compensation factor is applied to observe the mean power. Moreover, \(F_{{\rm min} }\) represents the bias compensation factor, which defines the function of the length of minimum search interval and \(\text{var} \left\{ {W\left( {\alpha ,b} \right)} \right\}\) denotes the variance estimator of the smoothened power spectral density. The variance of \(W\left( {\alpha ,b} \right)\) is estimated, while fixing the search interval length for the algorithm. The variance estimator for frequency bin \(b\) at \(\alpha\) frame is defined as:

$$\mathop {\text{var} }\limits^{ \wedge } \left\{ {W\left( {\alpha ,b} \right)} \right\} = \overline{{W^{2} }} \left( {\alpha ,b} \right) - \bar{W}^{2} \left( {\alpha ,b} \right)$$
(2)

where \(\bar{W}\left( {\alpha ,b} \right)\) and \(\overline{{W^{2} }} \left( {\alpha ,b} \right)\) denote the mean smoothened periodograms and a first-order recursive average of smoothened periodograms, respectively.

In this paper, we describe about the short-time Fourier transform (STFT)-based noise estimation. Figure 2 illustrates the noise power spectrum of the actual signal, the noise estimated signal by FFT and the noise estimated signal by STFT. Basically, STFT is used to determine the phase content and the sine wave frequency of a signal that changes over time. Practically, we can say that the longer time signals are divided into equal shorter length segments and the Fourier transform is applied separately on each segment. Moreover, the STFT can also be interpreted as a filtering operation. More particularly, there are two properties which satisfy the estimation strategy (i.e., shift invariance property that is based on magnitude and the properties of linear time–frequency distribution).

Fig. 2
figure 2

Noise power spectrum a estimated by FFT and b estimated by STFT—minimum statistics

In Fig. 2, the power spectrum of the noisy speech, which is obtained from FFT and STFT-minimum statistics, is presented. There is a significant difference between them that the magnitude of the frequency component is presented well. Figure 3 actually describes that the D-EMCD decomposed signal correlates with the actual signal, but neglects the undesired spikes and surges. However, the EMCD signal loses huge information from the actual signal. For the reference, the maximum and the minimum peaks are also presented.

Fig. 3
figure 3

EMCD versus D-EMCD: information-preserving characteristics of D-EMCD

3.2 Adaptive Wiener tuning ratio

The role of tuning ratio is highly substantiated in [26]. This paper proposes neural network (NN) to estimate the tuning ratio, based on the bark frequency \(b^{\prime}\left( {f^{\prime}} \right)\) of the NMF-based filtered D-EMCD signal, i.e., \(\bar{S}_{\text{D}} \left( n \right)\). The logical expression of the mapping function from \(f^{\prime}\) frequency to the bark frequency is given as:

$$b^{\prime}\left( {f^{\prime}} \right) = 13\arctan \left( {0.76f^{\prime}} \right) + 3.5\arctan \left[ {\left( {0.33f^{\prime}} \right)^{2} } \right]$$
(3)

where \(f^{\prime}\) is the frequency of the \(\bar{S}_{\text{D}} \left( n \right)\). The basis function \(a^{\prime}_{j}\) is formulated, as defined in Eq. (4):

$$a^{\prime}_{j} = \left( {W_{j}^{I} b^{\prime}\left( {f^{\prime}} \right)} \right) + W_{j}^{0} ;\quad j = 1, \ldots N_{{{\text{h}}'}}$$
(4)

where \(W^{\text{I}}\) represents the weight between the input and the \(j{\text{th}}\) hidden neuron,\(N_{\text{h'}}\) denotes the number of hidden neurons in the NN network and \(W_{j}^{0}\) is the weight of the \(j{\text{th}}\) bias neuron. Consequently, the activation function \(\hat{a}^{\prime}_{j}\) is formulated for limiting the amplitude, as represented in Eq. (5).

$$\hat{a}^{\prime}_{j} = \phi \left( {a_{j} } \right) = \frac{1}{{1 + \exp \left( { - a^{\prime}_{j} } \right)}}$$
(5)

The network output \(\eta\) is defined as given in Eq. (6), where \(W^{\text{H}}\) denotes the weight between the \(j{\text{th}}\) hidden neuron and the output neuron and \(W^{{{\text{H}}_{0} }}\) represents the weight of the bias.

$$\eta = \sum\limits_{j = 1}^{{N_{\text{h'}} }} {W_{j}^{\text{H}} \hat{a^{\prime}}_{j} } + W^{{{\text{H}}_{0} }}$$
(6)

3.3 Spectrum estimation using NMF

For speech signal enhancement, the noisy signal \(\bar{S}\left( n \right)\) is voiced in time–frequency \(\left( {\alpha ,b} \right)\) domain through STFT, as given in Eq. (7).

$$\bar{S}\left( {b,\alpha } \right) = S\left( {b,\alpha } \right) + N\left( {b,\alpha } \right)$$
(7)

where \(S\left( {b,\alpha } \right)\),\(\bar{S}\left( {b,\alpha } \right)\),\(N\left( {b,\alpha } \right)\) present the STFT of the clear speech, noisy speech and noise, respectively, for the \(b{\text{th}}\) frequency bin of \(\alpha\) frame. The approximation of the noisy speech’s magnitude spectrum is defined as \(|\bar{S}\left( {b,\alpha } \right)| = |S\left( {b,\alpha } \right) + |N\left( {b,\alpha } \right)|\). This is the widely used assumption in the processing of NMF-based speech and audio signals.

The magnitude spectrum matrices of varied signals are denoted as:

$$V^{\prime} = \left[ {v^{\prime}_{b\alpha } } \right] \in R_{ + }^{B \times T}$$
(8)

where \(v^{\prime}_{b\alpha }\) represents the magnitude spectral value for the \(b{\text{th}}\) bin of \(\alpha\) frame, whereas \(B\) and \(T\) denote the number of frequency bins and time frames, respectively.

Generally, NMF-based speech enhancement processes are comprised of two stages, namely the training stage and the enhancement stage. In the training stage, Eq. (9) is separately applied to the training data \(V^{\prime}_{\text{S}} \in R_{ + }^{{B \times T_{\text{S}} }}\) and \(V^{\prime}_{\text{N}} \in R_{ + }^{{B \times T_{\text{N}} }}\), and this results in the basis matrices of both clear speech and noise, \(W^{\prime}_{\text{S}} = \left[ {w_{{Bm^{\prime}}}^{{'{\text{S}}}} } \right] \in R_{ + }^{{B \times M^{\prime}_{\text{S}} }}\) and \(W^{\prime}_{\text{N}} = \left[ {w_{{Bm^{\prime}}}^{{'{\text{N}}}} } \right] \in R_{ + }^{{B \times M^{\prime}_{\text{N}} }}\), respectively. Here, \(M^{\prime}\) represents the number of basis vectors:

$$\begin{array}{l} {W^{\prime } \leftarrow W^{\prime } \otimes \frac{{\left( {V^{\prime } /W^{\prime } H^{\prime } } \right)H^{\prime } }}{{\Psi H^{\prime } }}} \hfill \\ {H^{\prime } \leftarrow H^{\prime } \otimes \frac{{W^{\prime } \left( {V^{\prime } /W^{\prime } H^{\prime } } \right)}}{{W^{{\prime ^{{T^{\prime } }} }} \Psi }}} \hfill \\ \end{array}$$
(9)

where \(\varPsi\) is a \(B \times T\) matrix with entries equal to one and \(T^{\prime}\) represents the matrix transpose.

In the enhancement stage, the basis matrices are fixed as \(W^{\prime}_{{{\hat{\text{S}}}}} = \left[ {W^{\prime}_{\text{S}} W^{\prime}_{\text{N}} } \right] \in R_{ + }^{{B \times \left( {M^{\prime}_{\text{S}} + M^{\prime}_{\text{N}} } \right)}}\) and the estimation of activation matrix \(H^{\prime}_{{{\bar{\text{S}}}}} = \left[ {H_{\text{S}}^{{'T^{\prime}}} H_{\text{N}}^{{'T^{\prime}}} } \right]^{{T^{\prime}}} \in R_{ + }^{{\left( {M^{\prime}_{\text{S}} + M^{\prime}_{\text{N}} } \right) \times T_{{{\bar{\text{S}}}}} }}\) of noisy speech is done by applying the NMF activation update on \(V^{\prime}_{{\bar{S}}} \in R_{ + }^{{B \times T_{{\bar{S}}} }}\). After getting the activation matrix of the speech signal, the estimation of clear speech spectrum is done with the aid of the Wiener filter (WF), as given in Eq. (10):

$$S^{\prime} = \frac{{P^{\prime}_{\text{S}} }}{{P^{\prime}_{\text{S}} + P^{\prime}_{\text{N}} }} \otimes \bar{S}$$
(10)

where \(P^{\prime}_{\text{S}} = \left[ {P^{\prime}_{\text{S}} \left( {b,\alpha } \right)} \right]\) and \(P^{\prime}_{\text{N}} = \left[ {P^{\prime}_{\text{N}} \left( {b,\alpha } \right)} \right] \in R_{ + }^{{B \times T_{{{\bar{\text{S}}}}} }}\) represent the estimated power spectral density (PSD) matrices of the clear speech and the noise, respectively. The latter is obtained through the temporal smoothing of periodograms, as defined in Eqs. (11) and (12), respectively:

$$P^{\prime}_{\text{S}} \left( {b,\alpha } \right) = \tau_{\text{S}} P^{\prime}_{\text{S}} \left( {b,\alpha - 1} \right) + \left( {1 - \tau_{\text{S}} } \right)\left( {\left[ {W^{\prime}_{\text{S}} H^{\prime}_{\text{S}} } \right]_{b\alpha } } \right)^{2}$$
(11)
$$P^{\prime}_{\text{N}} \left( {b,\alpha } \right) = \tau_{\text{N}} P^{\prime}_{\text{N}} \left( {b,\alpha - 1} \right) + \left( {1 - \tau_{\text{N}} } \right)\left( {\left[ {W^{\prime}_{\text{N}} H^{\prime}_{\text{N}} } \right]_{b\alpha } } \right)^{2}$$
(12)

where \(\tau_{\text{S}}\) and \(\tau_{\text{N}}\) denote the temporal smoothing factors for speech and noise, respectively.

3.4 D-EMCD-based Wiener filtering

The Wiener filtering process is based on the proposed D-EMCD decomposition process. It is an iterative decomposition process, and the initial step is the extraction of both the minima and the maxima. Figure 3 illustrates the information-preserving characteristics of D-EMCD. Let \(\bar{S}^{{\rm max} } \left( n \right):\left\{ {\left( {P_{i} ,S\left( {P_{i} } \right)} \right),i = 1, \ldots N_{{\rm max} } } \right\}\) be the maxima signal of \(\bar{S}\left( n \right)\) with N-element signal, where \(P_{i}\) denotes the time index and \(N_{{\rm max} }\) represents the number of maxima. Let the minima signal of actual signal \(\bar{S}\left( n \right)\) be \(\bar{S}^{{\rm min} } \left( n \right):\left\{ {\left( {Q_{i} ,S\left( {Q_{i} } \right)} \right),i = 1, \ldots N_{{\rm min} } } \right\}\), where \(Q_{i}\) is the time index and \(N_{{\rm min} }\) denotes the number of minima.

Furthermore, B-spline interpolation is used to interpolate both the maxima and the minima signal and they are defined below:

$$\bar{S}^{I - {\rm max} } \left( n \right) = B\left\{ {\left( {P_{i} ,y\left( {P_{i} } \right)} \right),\bar{S}^{{\rm max} } \left( n \right)} \right\};\quad n = 1, \ldots N$$
(13)
$$\bar{S}^{I - {\rm min} } \left( n \right) = B\left\{ {\left( {Q_{i} ,y\left( {Q_{i} } \right)} \right),\bar{S}^{{\rm min} } \left( n \right)} \right\}\quad n = 1, \ldots N$$
(14)

\(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\delta }_{k} \left( n \right)\) is defined as follows:

$$\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\delta }_{k} \left( n \right) = {\rm min} |\bar{S}_{k} \left( n \right) - \bar{S}_{k}^{I - {\rm max} } \left( n \right)|$$
(15)

Moreover, the minimum and the maximum of \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\delta } \left( n \right)\) are also defined as given in Eqs. (16) and (17). Similarly,\(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{S}_{k} \left( n \right)\) and \(\bar{S}_{{_{k} }}^{{\rm max} - {\rm min} }\) are also represented as shown in Eqs. (18) and (19), respectively. The resultant signal from the filtering process is the denoised signal.

$$\bar{\delta }_{k}^{{\rm max} } \left( n \right) = |\bar{S}_{k} \left( n \right) - \bar{S}_{k}^{I - {\rm max} } \left( n \right)|$$
(16)
$$\bar{\delta }_{k}^{{\rm min} } \left( n \right) = |\bar{S}_{k} \left( n \right) - \bar{S}_{k}^{I - {\rm min} } \left( n \right)|$$
(17)
$$\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{S}_{k} \left( n \right) = \left\{ {\begin{array}{*{20}l} {\bar{S}_{k} \left( n \right);} \hfill & {{\text{if}}\,\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\delta }_{k} \left( n \right) > \delta_{T} } \hfill \\ {\bar{S}_{{_{k} }}^{{\rm max} - {\rm min} } \left( n \right);} \hfill & {\text{otherwise}} \hfill \\ \end{array} } \right.$$
(18)
$$\bar{S}_{k}^{{\rm max} - {\rm min} } \left( n \right) = \left\{ {\begin{array}{*{20}l} {\bar{S}_{k}^{I - {\rm max} } \left( n \right);} \hfill & {{\text{if}}\,\bar{\delta }_{k}^{{\rm max} } \left( n \right) < \bar{\delta }_{k}^{{\rm min} } \left( n \right)} \hfill \\ {\bar{S}_{k}^{I - {\rm min} } \left( n \right);} \hfill & {\text{otherwise}} \hfill \\ \end{array} } \right\}$$
(19)

4 Results and discussion

4.1 Dataset and experiments

The speech signal enhancement experimentation is conducted using MATLAB 2015a. The database including the speech signals is downloaded from the URL http://ecs.utdallas.edu/loizou/speech/noizeus/. The LRA [30] and ILMSAF [29] are the public databases, whereas the Vuvuzela [37], OMLSA [38], TSNR [39], HRNR [40] and RNMF [26] are the private databases. The experimentation is carried out on about 30 speech signals. The number of hidden units is 10. Six noise types, namely airport noise, exhibition noise, restaurant noise, station noise, street noise and babble noise, are added to the speech signals. In addition, the investigation is carried out with different SNR dB levels, which include 0 dB, 5 dB, 10 dB and 15 dB.

The speech data are subjected to NMF decomposition, which estimates the signal spectrum as well as the noise spectrum of similar length. The decomposition is performed at different noise levels, so that diverse decomposition effect can be obtained via NMF. The Wiener filtering is applied on the decomposed signal of dimension 513 × 86, followed by D-EMCD. The resultant signal is subjected to feature extraction using bark frequency, and hence, the training library is constructed in the dimension of 1x30. The training data are obtained for different speech qualities, and the respective tuning ratio of the Wiener filter is set as the target for the respective noise intensities. The training is performed using the Levenberg–Marquardt training algorithm. Given a corrupted test speech, the noise intensity is estimated and it is followed by the estimation of tuning ratio. Based on the estimated tuning ratio, the Wiener filtering is applied to enhance the corrupted speech signal.

4.2 Qualitative analysis

The quality of the selected speech signals is studied in this section. Moreover, the analysis such as temporal analysis, spectral analysis and time–frequency analysis for denoising performance is also observed for six noise types, namely airport noise, exhibition noise, restaurant noise, station noise, street noise and babble noise, which are added to the speech signals. Figure 4a–f illustrates the temporal analysis of the denoising performance, in which the noisy and the denoised signals of various noise types and the performance of the proposed methodology are proved for their efficiency in denoising the signal are shown in Figs. 7, 8, 9, 10, 11, 12. Figure 5 illustrates the spectral analysis of the denoising performance for various noise types, which include airport noise, exhibition noise, restaurant noise, station noise, street noise and babble noise. Here, the noisy and the denoised signals are shown and it is found that the performance rate of the proposed method is high by ultimate reduction of the noise from the noisy signal. Further, Fig. 6 illustrates the time–frequency analysis of the denoising performance, in which the noisy and the denoised signals of various noise types are shown. The superior noise-removing ability of the proposed method is precisely understood from this figure.

Fig. 4
figure 4

Temporal analysis of denoising performance: noisy and denoised signal of various noise types: a airport noise, b exhibition noise, c restaurant noise, d station noise, e street noise and f babble noise

Fig. 5
figure 5

Spectral analysis of denoising performance: noisy and denoised signal of various noise types: a airport noise, b exhibition noise, c restaurant noise, d station noise, e street noise and f babble noise

Fig. 6
figure 6

Time–frequency analysis of denoising performance: noisy and denoised signal of various noise types: a airport noise, b exhibition noise, c restaurant noise, d station noise, e street noise and f babble noise

4.3 Quantitative analysis

The proposed speech enhancement algorithm is compared to the state-of-the-art methods like low-rank approximation (LRA) [30], ILMSAF [29], Vuvuzela [37], optimal modified minimum mean square error log-spectral amplitude (OMLSA) [38], two-step noise reduction (TSNR) [39], harmonic regeneration noise reduction (HRNR) [40] and regularized nonnegative matrix factorization RNMF [26]. The quality of the input speech signals is studied with different measures like PESQ, SNR, root-mean-square error (RMSE), correlation, STOI, extended STOI (ESTOI), SDR and cumulative squared Euclidean distance (CSED). Further, the investigation is proceeded with different SNR dB levels such as 0 dB, 5 dB, 10 dB and 15 dB. Table 2 shows the performance investigation of the proposed method against the existing methods for the airport noise at various dB levels. Similarly, Tables 3, 4, 5, 6 and 7 show the performance investigation of the proposed method, for exhibition noise, restaurant noise, station noise, street noise and babble noise, at different dB levels, respectively. From Table 2, while comparing the conventional methods, the proposed method leads position for airport noise at SNR = 5 dB with 9.26 SDR, 2.36 PSEQ, 8.23 SNR, 0.017 RMSE, 0.92 correlation, 0.64 ESTOI, 0.81 STOI and 1606.895 CSED. Table 3 shows the proposed method, for the case of exhibition noise, with high SDR, PSEQ, SNR, correlation, ESTOI and STOI of 5.26, 1.88, 5.16, 0.83, 0.557 and 0.73, at 0 dB results. Subsequently, the RMSE and the CSED values of the proposed method are found to reduce gradually as 0.024 and 2406.215, respectively.

Table 2 Airport noise at different intensity levels
Table 3 Exhibition noise at different intensity levels
Table 4 Restaurant noise at different intensity levels
Table 5 Station noise at different intensity levels
Table 6 Street noise at different intensity levels
Table 7 Babble noise at different intensity levels

In the same way, for the other noise types such as restaurant noise, station noise, street noise and babble noise at varied dB levels, the proposed method outperforms, with respect to the performance rate. Further, it is observed that the measures like PESQ, SNR, correlation, STOI, ESTOI and SDR of the proposed method are abundantly increased, whereas the existing methods showed poor performance with low values. Similarly, the measures like RMSE and CSED of the proposed method are decreased. But, the existing methods show increased values of RMSE and CSED, leading to the performance excellence of the proposed method. Apart from this, Table 8 demonstrates the computational time required for denoising the speech signal by the proposed methodology and the other existing methods. During comparison, it is observed that the proposed method requires 2.7934 s to denoise a speech signal. Even though the computational time is higher, the proposed method dominates all the existing methods in terms of speech enhancement.

Table 8 Computational time for denoising a speech signal

4.4 Impact of D-EMCD thresholding

In this paper, the threshold value of the D-EMCD is fixed as 0.5e−4. The analysis is performed by varying the threshold \(\delta_{\text{t}}\) values as 0.5, 0.01, 0.05, 0.005 and 0.0005, for all noise types with varied dB levels like 0, 5, 10, 15. Figure 7 illustrates the performance of varied measures like (a) SDR, (b) PESQ, (c) SNR, (d) RMSE, (e) correlation, (f) ESTOI, (g) STOI, (h) CSED of airport noise with different threshold values. As the value of threshold decreases, the performance of SDR, PESQ, SNR, correlation, ESTOI and STOI increases, in a sense that the mentioned measures exhibited a drastic improvement with the threshold \(\delta_{\text{t}}\) value of 0.0005. Similarly, the measures like RMSE and CSED gradually decrease at the same \(\delta_{\text{t}}\) value. The same analysis is observed for all the noise types like exhibition noise, restaurant noise, station noise, street noise and babble noise. Figures 8, 9, 10, 11 and 12 demonstrate the analysis of power spectrum estimation for the denoised signal of all the six noise types. These figures clearly characterize the frequency content of the denoised signals with varied threshold \(\delta_{\text{t}}\) values such as 0.5, 0.01, 0.05, 0.005 and 0.0005. Threshold decides the quality of D-EMCD, and it can be set based on trial and error (Fig. 13).

Fig. 7
figure 7

Performance analysis for mitigating the airport noise (with varying threshold): a SDR, b PESQ, c SNR, d RMSE, e correlation, f ESTOI, g STOI and h CSED

Fig. 8
figure 8

Performance analysis for mitigating the exhibition noise (with varying threshold): a SDR, b PESQ, c SNR, d RMSE, e correlation, f ESTOI, g STOI and h CSED

Fig. 9
figure 9

Performance analysis for mitigating the restaurant noise (with varying threshold): a SDR, b PESQ, c SNR, d RMSE, e correlation, f ESTOI, g STOI and h CSED

Fig. 10
figure 10

Performance analysis for mitigating the station noise (with varying threshold): a SDR, b PESQ, c SNR, d RMSE, e correlation, f ESTOI, g STOI and h CSED

Fig. 11
figure 11

Performance analysis for mitigating the street noise (with varying threshold): a SDR, b PESQ, c SNR, d RMSE, e correlation, f ESTOI, g STOI and h CSED

Fig. 12
figure 12

Performance analysis for mitigating the babble noise (with varying threshold): a SDR, b PESQ, c SNR, d RMSE, e correlation, f ESTOI, g STOI and h CSED

Fig. 13
figure 13

Power spectrum of denoised speech with varying threshold, in case of a airport noise, b exhibition noise, c restaurant noise, d station noise, e street noise and f babble noise

The impact of the threshold on the D-EMCD performance is high, but the relationship between the threshold and the performance remains unknown. Hence, the analysis is performed by varying the threshold. The results have revealed that minimum threshold leads to improved performance.

5 Conclusion

In this paper, a speech enhancement algorithm using short-time Fourier domain has been presented to overcome the regular drawbacks of the conventional speech enhancement algorithms. Further, a decomposition model, named diminished empirical mean curve decomposition (D-EMCD), has also been introduced to remove the undesired signals. Further, the Wiener filtering process has been adopted to accomplish an effective speech enhancement. The proposed methodology has been developed in MATLAB, and the performance of the proposed method has been analyzed with various measures. Moreover, the proposed method has been compared with the existing methods for proving its superiority.