1 Introduction

In everyday listening environments, background noise distorts the speech signals. These kinds of change, degrade the both quality and intelligibility of the speech. This is challenging in applications like automatic speech recognition and speech intelligibility. This is also a challenging task in case of monaural speech enhancement especially at very low signal to noise ratio (SNR) In the speech processing community, many studies have been done on monaural speech enhancement. Some of the Single channel speech enhancement techniques are statistical based approaches [92]. Non-negative matrix factorization [40, 49, 81, 82, 105], Deep Neural Networks [27, 36, 38, 39].

The advancement in speech and vision processing systems has enabled tremendous research and development in the areas of human-computer interactions [70], biometric applications [89, 90], security and surveillance [79], and most recently in computational behavioral analysis [13, 16, 43, 76, 104], Audio-Visual Speech Enhancement [59, 60], Speech Separation [1, 62] Automatic Speech Recognition [5], Human Listening and Live Captioning [15], Accents Identification [57], automatic speaker age estimation [2]. Emotions can alter the acoustic properties of speech, such as pitch, intensity, and duration. For example, speech produced in a fearful state might be higher in pitch and intensity compared to speech produced in a neutral state. Emotions can also impact how speech is perceived. Listeners are often able to accurately infer the emotional state of a speaker based on the acoustic cues present in their speech. In [73] a feature vector by minimum number of elements is proposed for recognizing emotional states of speech. The low complexity spectral enhancement methods are very suitable for hearing aids users [52]. The spectral subtraction technique, initially introduced by Boll [4], uses the assumption of uncorrelated speech and noise to remove noise in speech. This approach was further enhanced by Berouti et al [3]. to minimize the artifacts caused by noise reduction. These methods can be generalized to enhance quality by appropriately adjusting the parameters [42]. In line with this concept, Sim et al. [77] proposed a methodfor optimal parameter selection based on minimum mean squared error. Additionally, Hu and Yu [29] suggested an adaptive noise estimation method to improve quality.

Deep learning has revolutionized speech processing by autonomously extracting meaningful features from raw speech signals, eliminating the need for manual feature engineering. This advancement has led to significant improvements in speech processing performance, particularly in challenging scenarios with noise, various accents, and dialects [58]. It is commonly acknowledged that transcribing noisy speech using automatic speech recognition (ASR) systems trained on clean data results in notably reduced recognition accuracy. This challenge is further exacerbated when working with child speakers. Children’s speech features, such as pitch and formant frequency, vary significantly with age, presenting a significant obstacle to accurate recognition. In [75], the authors explored methods to enhance the noise robustness of ASR systems, focusing on children’s speech. They also proposed the incorporation of a foreground speech segmentation and enhancement module to improve noise robustness. A method for enhancing Dysarthric speech, designed specifically for individuals with cerebral palsy aged 40-60, was introduced in [50]. This method employed Kepstrum analysis and was assessed using dysarthric speech samples. The evaluation encompassed monosyllabic and bisyllabic samples, which exhibit distinct Consonant-Vowel and Consonant-Vowel-Consonant-Vowel patterns. The outcomes indicated notable changes in formants and energy levels in the processed speech signal.

The SEGAN [69] is an end-to-end SE model where only strided convolutions are used in the generator and discriminator. In this model also only ordinary convolution operations are used. Even though the performance of the model is good but it suffers from computational complexity. Wave-U-Net [56] is a time domain SE model with basic U-NET [41] architecture with 1D ordinary convolution layers in the encoder and decoder with a 1D convolution as a bottleneck. The CNN alone cannot well model the long-range dependencies of speech signal. In all the existing models only, ordinary convolutional layers are used. The local receptive field of the convolution limits the model’s ability to capture long-range dependencies across input sequences. In CRN [86] model to further enhance the performance of U-NET [41] the LSTMs are used in between encoder and decoder ofU-NET [41] to learn long term dependencies of speech signals. Even though the performance of model is better the LSTMs are easily prone to the problem of overfitting and it also requires a large time to train. LSTM requires 4 linear layers (MLP layer) per cell to run at each time step. Linear layers require large amounts of memory bandwidth to be computed. Speech enhancement performance is influenced by CNN’s limited receptive field, which restricts its ability to extract long-range dependency of speech sequences.

The existing baseline models, such as the SEGAN [69], Wave-U-Net [56], U-NET [41], Masking [26], CRN [86], Self-attention [6], Autoencoder [66],Parallel RNN [51] are built using convolution layers only. It is difficult for CNN alone to correctly model the long-range dependencies of speech signals. The local receptive field of the convolution limits the model’s ability to capture long-range dependencies across input sequences.

To deal with the long range dependency of speech, some models [51, 86] incorporated LSTMs in the bottleneck. Even though the performance of models [51, 61, 85, 108] is better, the LSTMs are easily prone to the problem of overfitting, and it also requires a long time to train. LSTM requires 4 linear layers (MLP layer) per cell to run at each time step. Linear layers require large amounts of memory bandwidth to be computed. The Self-attention model computes attention scores by comparing each element in the input sequence with every other element, resulting in a dense attention matrix. This computation becomes computationally expensive as the sequence length increases.

In recent years, speech enhancement has been thought of as supervised learning [94], based on the idea of time-frequency (T-F) Masking in computational auditory scene analysis (CASA). For supervised speech enhancement, it is important to choose the right training target [98]. On the one hand, training with a well defined target can improve speech quality and intelligibility. On the other hand, the training target should be something that can be supervised learning. In the T-F domain, a lot of training targets have been made, and most of them can be put into two groups. One set includes targets like the ideal ratio mask (IRM) [98], which characterize the time-frequency connections between noisy speech and clean speech. There are other goals that are based on mapping, such as the target magnitude spectrum (TMS) [25, 55] and the log-power spectrum (LPS) [103], which display the spectral characteristics of clean speech.

The magnitude spectrum of noisy speech is used to determine the majority of these training goals, which is obtained using a short-time Fourier transform (STFT). So, most speech enhancement algorithms only change the magnitude spectrogram and then use the noisy phase spectrogram to resynthesize the improved time-domain waveform. There are two reasons why we are unable to improve the phase spectrogram. First, it was found that the phase spectrogram does not have a clear structure, this makes it hard to figure out the phase spectrogram of clean speech [101]. Second, people thought that phase enhancement was not required to improve speech [95]. But newer research by Paliwal et al. [67] shows that a correct phase estimate can improve subjective and objective speech quality a lot, especially when the analytical window for phase spectrum calculation is set up correctly. Following that, several phase enhancement algorithms were developed for the separation of speech. Mowlaee et al. [72] used the mean squared error (MSE) to figure out the phase spectra of two sources in a mixture. Krawczyk and Gerkmann [46] did phase enhancement on voiced frames but not on unvoiced frames. Kulmer et al. [47] calculated the phase of clean speech by disassembling the instantaneous noisy phase spectrum and temporal smoothing. T-F Masking can also take phase information into consideration. Wang and Wang [97] trained a deep neural network (DNN) to use the noisy phase and an inverse Fourier transform layer to directly rebuild the time-domain enahnced signal. The results suggest that combining training speech resynthesis with mask estimation improves perceived quality while keeping objective intelligibility. The phase-sensitive mask (PSM) is another way to do things [14].

The results conclude that the signal-to-distortion ratio (SDR) is greater when PSM estimation is utilized instead of just enhancing the magnitude spectrum. Williamson et al. [101] found that the phase spectrogram does not have spectrotemporal structure, but that both the real and imaginary parts of the clean speech spectrogram have clear structure and can be learned this way. So, they made the complex ideal ratio mask (cIRM), which can take noisy speech and reconstruct it sound like clean speech again. In their experiments, they employ a DNN to estimate both the imaginary and real spectra simultaneously. CIRM estimation is different from [46, 47, 72] in that it can improve both the phase spectrum and magnitude of noisy speech. The results show that complex ratio Masking (cRM) improves perceived quality more than IRM estimation, while improving objective intelligibility only slightly or not at all. Fu et al. [18] then used a convolutional neural network (CNN) to estimate the clean real and imaginary spectra from the noisy ones. The time-domain waveform is then made from the estimated real and imaginary spectra. They also trained a deep neural network (DNN) to turn noisy LPS characteristics into clean ones. Their results show that complex spectral mapping with a DNN does better than LPS spectral mapping in terms of STOI and PESQ.

In the last ten years, the use of CNNs and recurrent neural networks has greatly helped supervised speech enhancement (RNNs). RNNs with long-term short-term memory (LSTM) are used to improve speech in [99, 100]. Chen et al. [7] came up with an RNN with four hidden LSTM layers to deal with the problem of speaker generalization of noise-free models. They found that the RNN works well with untrained speakers and does better on STOI than a feedforward DNN. Furthermore, CNNs have also been employed to estimate masks and map spectral data [17, 23, 71]. In [71], Park et al. did spectral mapping with a convolutional encoder-decoder network (CED). The CED can remove noise just as well as a DNN or an RNN, but it has a lot fewer trainable parameters. Grais et al. [23] also came up with a similar architecture for encoders and decoders. We just made a gated residual network based on dilated convolutions that can use long-term contexts and has wide receptive fields [88]. Convolutional recurrent networks (CRNs) take the ability of CNNs to pull out features and the ability of RNNs to represent time and put them together. The CRN was made by Naithani et al. [61] by putting convolutional layers, recurrent layers, and fully connected layers on top of each other in that order. In [108], a CRN architecture like this one was made. To make CRN, CED and LSTMs and put them together, which is like a causal system [86]. Takahashi et al. [85] made a CRN with many low-scale convolutional and recurrent layers.

However, since spectrogram of speech and the complex targets are inherently complex valued, using complex networks could potentially lead to richer representations and more efficient modeling [20, 32]. This occurs because complex models adhere to the rules of complex multiplication, allowing them to simultaneously acquire the real as well as imaginary components based on previous knowledge. In prior research, the authors developed complex models utilizing convolutional recurrent architecture, yielding encouraging results [109, 110]. Lately, Transformer models [37, 68], particularly the Conformer architecture, have significantly enhanced sequence modeling capabilities [24]. Unlike recurrent learning, the Conformer model utilizes self-attention [93] to capture overall dependencies within a sequence, while also considering local dependencies through convolutional layers. Therefore, it is highly desirable to extend the convolutional recurrent model into a Conformer-based model that utilizes full-complex networks for enhancing speech.

In [54, 65, 80, 84, 102], the authors proposed a DenseNet to reduce the number of dilated convolution layers to cover the large receptive area. With DenseNet, we aggregate the early and later layer features directly within a single convolution layer via dense skip connectivity. It is inefficient to have many parameters, especially for high-resolution data, especially when transforming local features into global ones. In [21], the authors proposed a network design that combines the advantages of DenseNet with the advantages of dilated convolution. The typical dilated convolutions were indeed employed, and dilation factors were computed based on layer depth, resulting in significant aliasing.

We just came up with a new CRN to do complex spectral mapping for monaural speech enhancement [87] in a preliminary study. This CRN was made using the architecture in [86]. In this study, we improve the CRN architecture [87] and look at how complex spectral mapping can be used to improve monoaural speech. First, each convolutional or deconvolutional layer is replaced with a gated linear unit (GLU) block [10] followed by dense layer and efficient channel attention [12]. Second, we add a linear layer on top of the last deconvolutional layer to guess the real and imaginary spectra.

The main objective of the proposed work is that to improve the quality and intelligibility of the degraded speech. The main advantage of GCRN is that it performance better than normal CNN approaches. The output of GCRN is given to Dense layer. The main advantage of dense layer is that it avoids vanishing gradient because the input of the given layer is not completely depend on the previous layer but also other previous several layers. Also, thinner (less number of channels) dense network outperform than wider dense network and hence the efficiency of the parameter network improves. The output of dense is given to ECA to improve information flow across layers by learning a dynamic representation without reducing the parametric space dimension. The motivation of the proposed work is that we want to improve the performance of the network and to improve the computational cost by keeping the same dimensionality. Efficient Channel Attention (ECA) module extracts the useful channels information by using a cross-channel interaction method without affecting the channel dimensions. In module testing, choosing an adaptable kernel size K for the ECA improved network performance significantly.

  1. 1.

    A gated convolutional recurrent network with efficient channel attention (GCRN-ECA) for complex spectral mapping is proposed, which amounts to a causal system for monaural speech enhancement. Each layer in encoder and decoder consists of dense block.

  2. 2.

    Convolutional Neural Networks (CNN) based techniques suffers from limited receptive field. To overcome this effect, Gated Convolutional Recurrent Neural Networks is proposed.

  3. 3.

    The advantage of dilated convolutions in the receptive field increases with increasing dilation rates, which are used to capture long-range speech contexts. And the dense connectivity provides a feature map with more precise target information by passing through multiple layers.

  4. 4.

    To represent the correlation between neighboring noisy speech frames, a two Layer GRU is added in the bottleneck of Wave-U-NET [56], which has the advantage of increased training speed because of its simpler architecture. GRU captures the long-range dependencies across input sequences.

  5. 5.

    The model is incorporated with a novel ECA network, which can improve information flow across layers by learning a dynamic representation without reducing the parametric space dimension. ECA chooses an adaptable kernel size (k) in model testing, which can improve accuracy and efficiency by allowing cross-channel interactions while preserving dimensions. ECA module can implement cross-channel interaction without dimensionality reduction.

The remainder of this paper is organized as follows. Section 2 discuss about monaural speech in the STFT domain. Section 3 explains description of the system. Section 4 discuss about experimental set up, Section 5 discusses the experiment outcomes. Section 6 concludes the paper.

2 Monaural speech enhancement in STFT domain

Monaural speech enhancement separates the speech s[t] from the noise n[t] in the background. A noisy mixture y can be modeled as

$$\begin{aligned} y[t]=n[t]+s[t] \end{aligned}$$
(1)

where time sample index is t. Applying STFT on both sides will lead us to

$$\begin{aligned} Y_{m, f}=N_{m, f}+S_{m, f} \end{aligned}$$
(2)

where N, Y, and S are the STFTs of n, y, and s, and f and m are the indices for the frequency bin and time frame respectively. In polar coordinates, Eq.(2) is written

$$\begin{aligned} \left| Y_{m, f}\right| e^{i \theta }=\left| N_{m, f}\right| e^{i \theta S(m, f)}+\left| S_{m, f}\right| e^{i \theta S(m, f)} \end{aligned}$$
(3)

where \(\theta \) and \(\Vert |\)show the phase response and the magnitude response, respectively. In Clean Speech’s target magnitude spectrum (TMS), the letter i stands for the “imaginary unit”. The target magnitude spectrum (TMS) of clean speech is often used as a training target in most spectral mapping-based approaches [25, 55].

In the reconstruction process, the estimated magnitude \(\left| Y_{m, f}\right| \) is combined with the noisy phase \(\left| \hat{S}_{m, f}\right| \). The STFT of speech signal may also be represented in Cartesian coordinates, which offers a distinct perspective. So, Eq.(2) can be re-written as

$$\begin{aligned} Y_{m, f}^{(r)}+i Y_{m, f}^{(i)}=\left( S_{m, f}^{(r)}+N_{m, f}^{(r)}\right) +i\left( S_{m, f}^{(i)}+N_{m, f}^{(i)}\right) \end{aligned}$$
(4)

where superscripts (i) and (r) stand for the imaginary and real components, respectively. The cIRM [101] is as defined as

$$\begin{aligned} M=\frac{Y^{(r)} S^{(r)}+Y^{(i)} S^{(i)}}{\left( Y^{(r)}\right) ^2+\left( Y^{(i)}\right) ^2}+i \frac{Y^{(r)} S^{(i)}-Y^{(i)} S^{(r)}}{\left( Y^{(r)}\right) ^2+\left( Y^{(i)}\right) ^2} \end{aligned}$$
(5)

The noisy spectrogram can be turned into the improved spectrogram by adding an estimate of the cIRM \(\hat{M}\) is

$$\begin{aligned} S=\hat{M} \times Y \end{aligned}$$
(6)

where the “X” multiplication complex operator. Signal Approximation [33] performs the Masking by reducing the difference between clean speech and estimated speech. The definition of the loss for cRM-based signal approximation (cRM-SA) is:

$$\begin{aligned} S A=|c R M \times Y-S|^2 \end{aligned}$$
(7)

where (\(\Vert |\)) is the complex modulus. Spectral mapping is learned from the real and imaginary spectra of noisy speech (Y(r) and Y(i)) to those of clean speech (S(r) and S(i)). The time-domain signal is obtained by combining the estimated real and imaginary spectra. Williamson et al. [101] claimed that it is not a good idea to use a DNN to try to predict the real and imaginary components of the STFT. We show that complex spectral mapping is always better than magnitude spectral mapping, complex ratio Masking, and complex ratio Masking-based signal approximation in terms of STOI and PESQ.

3 Description of the system

3.1 Convolutional recurrent network

A convolutional recurring network [86] was built, which is simply an encoder-decoder architecture with LSTMs between the encoder and the decoder. There are five convolutional layers in the encoder, while the decoder has also five deconvolutional layers. Two LSTM layers describe temporal dependencies between the encoder and decoder. The encoder-decoder structure is constructed symmetrically, with the number of kernels increasing in the encoder and decreasing in the decoder. A stride of 2 is used in both convolutional and deconvolutional layers with frequency dimension to aggregate the context along the frequency direction. In those other terms, the frequency dimensionality of the feature maps is halved in the encoder and doubled in the decoder, ensuring that the output has same form as the input. Also we used Skip Connections to connect each encoder layer’s output to the input of the matching decoder layer.

3.2 Gated linear units

The way information moves through the network is controlled by “gating” mechanisms, which could make it possible to model more complex interactions. They were first developed for RNNs [28]. In a recent study on convolutional modelling of images, Van den Oord et al [64]. used an LSTM-style gating mechanism. This led to masked convolutions:

$$\begin{aligned} \begin{aligned} \textbf{y}&=\tanh \left( \textbf{x} * \textbf{W}_1+\textbf{b}_1\right) \odot \sigma \left( \textbf{x} * \textbf{W}_2+\textbf{b}_2\right) \\&=\tanh \left( \textbf{v}_1\right) \odot \sigma \left( \textbf{v}_2\right) \end{aligned} \end{aligned}$$
(8)

Let \( v_1 = x *W_1 + b_1 \) and \( v_2 = x *W_2 + b_2 \), where \( W \)’s and \( b \)’s denote kernels and biases, respectively, and \( \sigma \) denotes the sigmoid function. The symbols \( *\) and \( \circ \) represent convolution operation and element-wise multiplication, respectively.The gradient of the gating is

$$\begin{aligned} \begin{aligned} \nabla \left[ \tanh \left( \textbf{v}_1\right) \odot \sigma \left( \textbf{v}_2\right) \right] =&\tanh ^{\prime }\left( \textbf{v}_1\right) \nabla \textbf{v}_1 \odot \sigma \left( \textbf{v}_2\right) \\&+\sigma ^{\prime }\left( \textbf{v}_2\right) \nabla \textbf{v}_2 \odot \tanh \left( \textbf{v}_1\right) \end{aligned} \end{aligned}$$
(9)

where \(\tanh ^{\prime }\left( \textbf{v}_1\right) , \sigma ^{\prime }\left( \textbf{v}_2\right) \in (0,1)\) are both in the interval (0, 1), and the prime symbol denotes differentiation. As the network depth increases, the gradient vanishes gradually because of the downscaling factors \(\tanh ^{\prime }\left( \textbf{v}_1\right) \) and \(\sigma ^{\prime }\left( \textbf{v}_2\right) \). To address this issue, Dauphin et al. [10] introduced Gated Linear Units (GLUs):

$$\begin{aligned} \begin{aligned} \textbf{y}&=\left( \textbf{x} * \textbf{W}_1+\textbf{b}_1\right) \odot \sigma \left( \textbf{x} * \textbf{W}_2+\textbf{b}_2\right) \\&=\textbf{v}_1 \odot \sigma \left( \textbf{v}_2\right) . \end{aligned} \end{aligned}$$
(10)

The gradient of the GLUs

$$\begin{aligned} \begin{aligned} \nabla \left[ \textbf{v}_1 \odot \sigma \left( \textbf{v}_2\right) \right] =\nabla \textbf{v}_1 \odot \sigma \left( \textbf{v}_2\right) +\sigma ^{\prime }\left( \textbf{v}_2\right) \nabla \textbf{v}_2 \odot \textbf{v}_1 \end{aligned} \end{aligned}$$
(11)

includes a path \(\textbf{v}_1 \odot \sigma \left( \textbf{v}_2\right) \) Without downscaling, which you can be treated as a multiplicative skip link that makes it easier for gradients to flow through layers. Figure 1(a) & (b) shows that a deconvolutional GLU block, called “DeconvGLU,” is similar to a convolutional GLU block, except that it has deconvolutional layers instead of convolutional layers.

Fig. 1
figure 1

Diagram of convolutional GLU and deconvolutional GLU

3.3 Dense block

The idea behind densely connected network is that feature reuse in which an output at a given layer is reused multiple times in the subsequent layers. i.e., the input to the given layer is not only the output of previous layer but also the outputs of previous several layers. This type of network has two advantages. First, it avoids vanishing gradient because the input of the given layer is not completely depend on the previous layer but also other previous several layers. Second, Thinner (less number of channels) dense network outperform than wider dense network and hence the efficiency of the parameter network improves. Finally, the dense connection can be defined as

$$\begin{aligned} \textrm{y}^l=\textrm{g}\left( \textrm{y}^{l-1}, \textrm{y}^{l-2}, \ldots . \textrm{y}^{l-D}\right) \end{aligned}$$
(12)

\(y_l\) denotes the output at the layer l, g is the function in the single layer, D represents depth of dense connections. The proposed network uses dense block after each layer in encoder and decoder. The dense block is shown in Fig. 2.

Fig. 2
figure 2

Dense block

The dense block consists of five convolution layers followed by layer normalization and ReLu. The input of given layer is formed by the output of previous layer along with the outputs of several previous layers. The input channels increases linearly with the successive layers as C,2C,3C,4C,5C. The output of each convolution has C channels

3.4 Efficient channel attention module

The speech signal characteristics of convolutional neural networks are usually obtained by integrating spatial and channel dimensions on a local receptive field. Based on the significance of the channel, to improve the selected channel features and suppress channels that are less useful in the present work. In [96], we see the reviewed model of Squeeze-and-Excitation (SE) in [30]. In the Squeeze-and-Excitation module, global average pooling (GAP) for each channel is applied first separately based on the input features. In order to capture cross-channel interactions, it requires two fully connected non-linear layers. The dimensionality of channel is reduced by using this method, and reducing the dimensions has negative impact on network prediction. Hence, a one-dimensional convolution is employed on the fully connected layer of the SE module to increase the efficiency of channel attention. In this paper, the Efficient channel attention (ECA) module is proposed as a novel cross-channel interaction network without reducing the dimensionality. A significant improvement has been made to the overall network calculation speed as well as its prediction results as compared with the previous SE module. This module is structured as shown in Fig. 3. In the ECA model, the input channel is first sent through the global average pooling (GAP) layer and then employs a 1D convolution for local channel interaction. The 1D convolution kernel size is the same as the convolution kernel, and the kernel size parameter is used to calculate the coverage of cross-channel interaction. The 1D convolution layer by default sets the padding value equal to half of the kernel and it takes the integer part. Local cross-channel interaction is completed and then it is sent to the sigmoid function. The sigmoid function output is element-wise multiplied with the input channel and its product as an ECA module output. To determine the mapping relationship(\(\psi \)) between the number of channels and kernel size(k), the number of channels should be 2n. The eq. (2) gives the mapping relationship (\(\psi \)), where b and \(\gamma \) are set to 2 and 1:

$$\begin{aligned} k=\psi (C)=\left| \frac{\log _2 C}{\gamma }+\frac{b}{\gamma }\right| _{\text{ odd } } \end{aligned}$$
(13)

where \(|\gamma |_{\text{ odd } }\) represents the closet odd number of \(\gamma \). It is possible to calculate the extent cross-channel interaction by selecting k. ECA improves accuracy and efficiency by allowing cross-channel interactions while preserving dimensions. Due to this, it attempts to add ECA modules in this paper.

Fig. 3
figure 3

Efficient channel attention mechanism

3.5 Gated Recurrent Unit (GRU)

Combining the forget and input gates in LSTM into a single one, GRU is introduced with two gates rt and zt , named reset and update gates, respectively. GRU as a variation of LSTM is faster and computationally more efficient than LSTM, while in some cases, it yields even better performance on less training data. The extension of the GRU model in the given figure is displayed through multiple unified hidden layers. The module structure of GRU is repetitive, which is more straightforward than long and short-term memory because each recurrent neural network feature of the module is the same. It has only two doors, the updated door and the reset door, namely zt and rt in Fig. 4. The update gate is used to supervise the extent to which the knowledge of the previously hidden state is extended to the current state. The greater the value of the update gate, the more knowledge of the previous state is introduced. Therefore, if the reset gate is used to adjust the degree of knowledge transfer of the past state, the smaller the value of the reset gate, the more it will be transferred. Therefore, the capture of short-term dependence is usually in the cyclic activation of the reset gate, while the long-term dependence is in the activation of the update gate. A gate controller zt, controls the both the input gate and forget gate. When zt=1, the forget gate is closed and the input gate is open. When zt=0, the forget gate is open, and the input gate is closed. At each step, the previous (t-1) memory is saved, and the input of the time step is cleared. GRU uses tow gates instead of three gates like LSTM. So GRU reduces network complexity as well as improves the performance. each layer models the temporal dynamics of speech. The reset gate rt is used to determine how much of the previous memory information needs to be retained. The smaller rt is the lesser information from the previous state is written. The gate zt is used to control the extent of which the state information from the previous moment is brough into the current state. The larger zt is the more state information from previous moment is brought in.

Fig. 4
figure 4

Block diagram of GRU

3.6 Network architecture

This research tends to add to the CRN architecture outlined in [87] to perform complex spectral mapping. The resulting CRN includes GLUs, and gated convolutional recurrent network (GCRN), dense block and ECA.

Fig. 5
figure 5

Proposed GCRN-ECA structure

Figure 5 shows our proposed design for the GCRN structure. As shown in [18], the real and imaginary spectrograms of noisy speech are treated as two different input channels. Figure 5 shows that the encoder and GRU modules are used to estimate both real and imaginary components, while real and imaginary spectrograms are approximated by two different decoder configurations. This design is based on multi-task learning [48, 107], which means that related prediction tasks are learned at the same time by sharing information across tasks. Estimating the real and imaginary parts is a part of spectral mapping that is connected to two other tasks [101]. We will assume that all signals are analysed at 16 kHz. Using a 20-ms Hamming window, a set of time frames are made in which each pair of frames overlaps by 50%. We use spectra with 161 dimensions, which is the same as a 320-point STFT (16 kHz X 20 ms). Remember that the number of feature maps in each decoder layer doubles when skip connections are used. We use a kernel size of 1X3, which will not affect the performance. After each convolutional or deconvolutional GLU block is followed by an exponential linear unit (ELU) [8] activation function and a batch normalization [34] operation.

4 Experimental setup

The Voice Bank + DEMAND [91] dataset serves as the basis for both training and testing. It consists of 11,572 pairs of clean and noisy speech for training, along with 824 noisy clips designated for testing. We use the Common Voice corpus [9] to test our system, which is a publicly available voice dataset, powered by the voices of volunteer contributors around the world. People who want to build voice applications can use the dataset to train machine learning models. The data set contains 1653880(1.6 million) utterances from 84659 speakers. From the Common Voice, we select the English corpus and randomly choose 2000 utterances for the training set and 400 utterances for the validation set, respectively. The test set is also taken from Common Voice, which consists of 400 utterances. We built training and validation sets using different types of noise from Noizeus [63], consisting of white, pink, restaurant, and babble noises. Five SNR levels are used to test the noise mixture i.e.,-6dB, -3dB, 0dB, 3dB and 6dB. We used cross-validation for validation set.

Hyperparameters: There are two convolutional layers with 256 filters. We have the stride as 1 and 16 and the filter size is 11 and 32 for respective layer. The hidden size of the lstm is 1024 and there are two layers. The batch size is 32. Initial standard deviation is 0.02. A pretraining section is done of 10 epochs The learning rate starts at 1e-3 decays by 0.5 until it reaches a minimum of 1e-6 and the optimizer is the Adam Optimizer. Early stopping is implemented, otherwise it runs for a maximum of 200 epochs.

4.1 Experimental results and analysis

The short-time objective intelligibility (STOI) [83], the perceptual evaluation of speech quality (PESQ) [31] and the signal to noise ratio (SNR) is used as the objective metrics. The experiment results are compared with the existing techniques Wiener [74], SEGAN [69], Wave-U-Net [56], U-NET [41], Masking [26], CRN [86], Self-attention [6], Autoencoder [66], Parallel RNN [51].

Table 1 Comparative performance in terms of PESQ in the case of babble noise and street noise

Table 1 shows PESQ values for the existing techniques like wiener [74] , SEGAN [69], UNET, CRN, Self-attention, Auto encoder and Parallel RNN. In case of babble noise, the PESQ values at -3 dB and 0 dB are 1.83 and 2.08 respectively for SEGAN [69]. The PESQ values at -6dB and 3 dB are 1.85 and 2,41 respectively in case of U-NET [41]. For CRN method, at -3dB and 6dB the PESQ values are 2.24 and 3.02 respectively. For Autoencoder [66] method, the PESQ values at 0 dB and 3dB are 2.90 and 3.17 respectively. The proposed method yields PESQ values of 2.42, 2.94, 3.47 for the input test SNRs of -6dB, 0dB and 6dB respectively. The proposed method shows better results than other techniques. In case of street noise, Wave-U-NET [56] gives PESQ values of 1.70 and 2.11 respectivley at the input SNRs of -6 dB and 0 dB. The PESQ values at input SNR of -3dB and 3dB are 2.46 and 2.86 respectively for Self attention [6] method. For the proposed method the PESQ values of 3.11 and 3.27 for the input test SNRs of 3dB and 6dB.

Table 2 Comparative performance in terms of STOI in the case of babble noise and street noise

Comparative performance of STOI is shown in Table 2. In case of street noise, SEGAN [69] gives STOI values of 64.4 and 75.6 respectively at the input SNRs of -6 dB and 0 dB. The STOI values at input SNR of -3dB and 3dB are 78.2 and 89.1 respectively for Self attention [6] method. For the proposed method the STOI values of 91.9 and 95.4 for the input test SNRs of 3dB and 6dB. In case of babble noise, the STOI values at -3 dB and 0 dB are 69.4 and 75.8 respectively for wiener [74] . The STOI values at -6dB and 3 dB are 67.2 and 72.2 respectively in case of U-NET [41]. For CRN [86] method, at -3dB and 6dB the STOI values are 77.9 and 92.1 respectively. For Autoencoder method, the STOI values at 0 dB and 3dB are 88.3 and 92.7 respectively. The proposed method yields STOI values of 78.5, 90.3, 96.6 for the input test SNRs of -6dB, 0dB and 6dB respectively. The proposed method shows better results than other techniques.

Table 3 Comparative performance in terms of SNR in the case of babble noise and street noise

Table 3 shows SNR values for the existing techniques like wiener [74], SEGAN [69], UNET, CRN, Self attention, Auto encoder and Parallel RNN. In case of babble noise, the SNR values at -3 dB and 0 dB are -2.18 and 1.19 respectively for wiener [74] . The SNR values at -6dB and 3 dB are -2.71 and 6.19 respectively in case of U-NET [41]. For CRN [86] method, at -3dB and 6dB the SNR values are 4.36 and 12.3 respectively. For Autoencoder method, the SNR values at 0 dB and 3dB are 10.2 and 13.3 respectively. The proposed method yields SNR values of 6.07, 11.98, 17.26 for the input test SNRs of -6dB, 0dB and 6dB respectively. The proposed method shows better results than other techniques. In case of street noise, SEGAN [69] gives SNR values of -4.12 and 1.17 respectively at the input SNRs of -6 dB and 0 dB. The SNR values at input SNR of -3dB and 3dB are 7.19 and 12.31 respectively for self-attention method. For the proposed method the SNR values of 13.54 and 16.56 for the input test SNRs of 3dB and 6dB.

5 Discussion on results

The SEGAN [69] is an end-to-end SE model where only strided convolutions are used in the generator and discriminator. In this model also only ordinary convolution operations are used. Even though the performance of the model is good but it suffers from computational complexity. The SEGAN [69] has the PESQ of 2.06& 1.99 on average in babble and street noise environments which is better than wiener [74] model. The SEGAN [69] has the STOI of 76.82 & 75.84 on average in babble and street noise environments which is better than wiener [74] model. The SEGAN [69] has the SNR of 1.72 & 0.926 on average in babble and street noise environments which is better than wiener [74] model. The reason for enhanced result is that the deep learning models can automatically learn relevant features directly from the input data. Moreover, the wiener [74] model is sensitive to noise type. The limitation of SEGAN is its computational complexity. The Wave-U-NET [56] is a time domain SE model with basic U-NET [41] architecture with 1D ordinary convolution layers in the encoder and decoder with a 1D convolution as a bottleneck. The Wave-U-NET [56] has the PESQ of 2.15 & 2.07 on average in babble and street noise environments which is better than wiener [74] model. The Wave-U-NET [56] has the STOI of 77.92 & 76.58 on average in babble and street noise environments which is better than wiener [74] model. The Wave-U-NET [56] has the SNR of 2.95 & 1.88 on average in babble and street noise environments which is better than wiener [74] model and SEGAN [69]. The performance is poor at low SNRs.

The U-NET [41] is a basic U-NET [41] model with encoder-decoder architecture. Even though the encoder extracts better features from noisy speech it is also necessary to deal with long term dependency of speech signal. The U-NET [41] has the PESQ of 2.23 & 2.24 on average in babble and street noise environments which is better than wiener [74] model. The U-NET [41] has the STOI of 78.94 & 78.02 on average in babble and street noise environments which is better than wiener [74] model. The U-NET [41] has the SNR of 3.40& 2.91 on average in babble and street noise environments. The CNN models such as SEGAN [69] , Wave-U-NET [56] and U-NET [41] alone cannot well model the long-range dependencies of speech signal. In all the existing models only, ordinary convolutional layers are used. The local receptive field of the convolution limits the model’s ability to capture long-range dependencies across input sequences. In CRN [86] [55] model to further enhance the performance of U-NET [41] the LSTMs are used in between encoder and decoder ofU-NET [41] to learn long term dependencies of speech signals. The CRN [86] has the PESQ of 2.56 &2.45 on average in babble and street noise environments which is better than wiener [74] model. The CRN [86] has the STOI of 82.66& 81.72 on average in babble and street noise environments which is better than wiener [74] model. The CRN [86] has the SNR of 7.01& 6.01 on average in babble and street noise environments which is better than SEGAN [69], Wave-U-NET [56] and U-NET [41]. Even though the performance of model is better the LSTMs are easily prone to the problem of overfitting and it also requires a large time to train. LSTM requires 4 linear layers (MLP layer) per cell to run at each time step. Linear layers require large amounts of memory bandwidth to be computed. Speech enhancement performance is influenced by CNN’s limited receptive field, which restricts its ability to extract long-range dependency of speech sequences. Autoencoder [66] model does not use any attention mechanism for better feature extraction. Later attention mechanisms are added to selectively focus on relevant features of the speech signal that are important for enhancement. The Self attention [6], has the PESQ of 2.82 & 2.66 on average in babble and street noise environments which is better than wiener [74] model. The Self attention [6], has the STOI of 86.26 & 83.74 on average in babble and street noise environments which is better than wiener [74] model. The Self attention [6], has the SNR of 9.91 &7.81 on average in babble and street noise environments which is better than U-NET [41] models and CRN models. The limitation of self-attention is it results in a dense attention matrix. This is computationally expensive as the sentence length increases.

Table 4 Subjective Assessment using VCTK dataset

The existing baseline models, such as the SEGAN [69], Wave-U-NET [56], U-NET [41], Masking [26], CRN , Self-attention [6], Autoencoder [66], Parallel RNN [51] are built using convolution layers only. It is difficult for CNN alone to correctly model the long-range dependencies of speech signals. The local receptive field of the convolution limits the model’s ability to capture long-range dependencies across input sequences. To deal with the long range dependency of speech, some models [51, 86] incorporated LSTMs in the bottleneck. Even though the performance of models [51, 61, 85, 108] is better, the LSTMs are easily prone to the problem of overfitting, and it also requires a long time to train. LSTM requires 4 linear layers (MLP layer) per cell to run at each time step. Linear layers require large amounts of memory bandwidth to be computed. The Self-attention model computes attention scores by comparing each element in the input sequence with every other element, resulting in a dense attention matrix. This computation becomes computationally expensive as the sequence length increases.

To overcome the above drawbacks in the proposed model, dilated dense blocks and GRUs are introduced. First, the advantage of dilated convolutions in the receptive field increases with increasing dilation rates, which are used to capture long-range speech contexts. And the dense connectivity provides a feature map with more precise target information by passing through multiple layers. Second, to represent the correlation between neighboring noisy speech frames, a two Layer GRU is added in the bottleneck of U-NET [41], which has the advantage of increased training speed because of its simpler architecture. GRU captures the long-range dependencies across input sequences. The vanishing gradient problem is solved by GRUs using update gates and reset gates. The flow of information into and out of memory is controlled by the update and reset gates, respectively. The advantage of GRU is that it is easier to modify and doesn’t require memory units, which means it can train faster than LSTM and also give performance results as fast as LSTM. Moreover, the ECA module can implement cross-channel interaction without dimensionality reduction. An appropriate cross-channel interaction can preserve performance while significantly decreasing model complexity. Hence, the performance of the model is enhanced compared to existing models, such as SEGAN [69] , Wave-U-NET [56] , U-NET [41], Masking [26], CRN [86] , Self attention [6], Autoencoder [66], Parallel RNN [51].

6 Subjective assessment using VCKT dataset

Subjective assessment using VCKT dataset are shown in Table 4. Subjective listening test methodology is designed by ITU in recommendation ITU-T P.835 [35]. This methodology was designed to evaluate the speech quality along three dimensions: signal distortion (CSIG), background distortion (CBAK) and overall quality (COVRL). This evaluation removes the uncertainty of listeners in listening tests by increased readability in terms of rating given to the enhanced speech on a five-point scale. The mean opinion score (MOS) for CSIG, CBAK and COVRL scales are described [31].

7 Conclusion

In this work we proposed a gated convolutional recurrent network with efficient channel attention (GCRN-ECA) for complex spectral mapping, which is a causal system for monaural speech enhancement. Each layer in encoder and decoder consists of dense block. The advantage of dilated convolutions present in dense block is the receptive field increases with increasing dilation rates, which are used to capture long-range speech contexts. And the dense connectivity provides a feature map with more precise target information by passing through multiple layers. The GRU captures the long-range dependencies across input sequences. The ECA module can implement cross-channel interaction without dimensionality reduction. An appropriate cross-channel interaction can preserve performance while significantly decreasing model complexity. Our results reveal that the proposed GCRN-ECA outperforms existing convolutional neural networks (CNN) and CRNs in terms of quality and intelligibility. The proposed method yields higher objective and subjective scores than existing techniques. The findings showed that the proposed model outperforms other competitive baseline methods in both PESQ and STOI metrics across the extensive VCTK and Common voice datasets.