1 Introduction

In everyday circumstances, noise inevitably distorts speech waveforms that are recorded by equipment, which has a significant impact on real-world applications like telecommunication and hearing aid devices. Speech enhancement (SE) approaches based on deep learning are being developed to recover clean waveforms from degraded ones using neural networks, thereby improving speech perceived quality and mitigating the impact of noise.

One can broadly categorize the current state of SE approaches into two categories: time-domain methods and temporal-frequency (TF) domain systems. Neural networks were used by the time-domain SE approaches [1,2,3,4,5] to figure out how to map from noisy waveforms into cleaner versions and these approaches directly use audio signals to train the neural network [6]. Unluckily, because high-resolution waveforms were directly generated, this kind of approach continued to exhibit inefficiencies and quality constraints. TF-domain SE approaches performed superior performance. A deep neural network TF-domain SE methods aim to predict clean frame-level TF-domain representations and then reconstruct the enhanced waveforms [7,8,9,10,11]. In [12], a TF-domain based model called TFADCSU-Net was presented and it improves information flow inside the model and prevents a notable increase in computing complexity as the number of network layers rises.

Generally, phase is not included in commonly utilized representations since the challenge is tremendous to enhance phase spectra directly, given its wrapping and nonstructural properties. However, recent research has shown how phase information is crucial to the speech perception quality of SE approaches, particularly when signal-to-noise (SNR) is low [13]. In previous studies, the researchers merely improved the magnitude spectra and used the noisy phase and enhanced magnitude spectra to reconstruct the waveforms utilizing an inverse short-time Fourier transform (ISTFT) [7,8,9,10]. In [14] TFA-S-TCN model proposed which primarily concentrate on improving the magnitude spectrum and making use of the noise mixture's phase for reconstructing the signal. In absence of phase spectrum enhancement inevitably has led to degradation of enhanced speech quality. To solve above issues, several approaches concentrated on enhancing short-time complex spectra, that quietly restored jointly clean magnitude and phase spectra [15, 16]. A recent study also suggested refining the complex spectrum after enhancing its magnitude [17, 18]. This can help to mitigate the unbounded estimation issue [19, 20] that was present in the methods that solely improved the complex spectra.

Still, there was an imperfect phase estimation due to the compensating effect [21] between the phase and magnitude. In order to achieve this goal, a large number of DNN-based phase-in algorithms are proposed going forward and may be broadly classified into two groups: complex-domain-based [21,22,23]and time-domain-based [24, 25]. For complex domains, the implicit relative relationship between the real and imaginary (RI) elements contains phase information. For instance, in [26], authors suggested using fully-connected (FC) layers stacked one on top of the other for estimating complex ratio mask (CRM), it is then given individually to RI portions of the spectrum in order to recover phase and magnitude concurrently. Nevertheless, the goal dynamic range is typically compressed using the nonlinear function, which somewhat impairs network training.

In supervised speech enhancing, deep neural networks (DNNs) showed remarkable efficacy [27] and use of DNNs for SE has shown tremendous improvement over the classical methods [28]. Although effective in noise-independent SE, deep neural network (DNN) methods are not very good at generalizing speaker features [29]. Even though vanilla DNN is a strong model, its efficacy in mismatched scenarios, such as speaker-independent or noise-independent circumstances, may be limited since the interdependence between nearby temporal frames is not explicitly taken into account [29]. A convolutional recurrent network (CRN) utilizes to directly map RI components in Tan et al.’s more modern [15] complicated spectrum mapping approach, which produces empirically better results than CRM. Fu and others. Convolutional neural networks, or CNNs employed for speech augmentation recently. T-F illustration of speech mixed noise is used as input for CNN in speech improvement, which is driven by CNN-based image processing techniques, just to estimate target speech [30]. The performance of CNN used by the authors in [31] for estimating clean complex spectrograms straight from noisy spectrograms was better than that of the DNN-based magnitude processing technique.

Convolutional encoder-decoder (CED) is a principle that has been received from computer vision research and forms the foundation of several successful CNN model designs [32, 33] and recently CED mechanism has been utilized in Speech Enhancement techniques to enhance feature information [34, 35]. An encoder and decoder were employed to preserve the original information of audio signals in the [36] DCTDCCGRU based deep learning model. However, this model concentrates on frequency information and might not adequately capture time-domain properties. The temporal information in speech signals is essential for preserving the speech's naturalness and comprehensibility [37].

When compared to cutting-edge deep learning techniques for complicated spectrum mapping, the fully convolutional neural network (CNN) presented by the authors in [38] to process complex spectrogram in noise reduction has shown a significant improvement. Zhao et al. have also employed a CED network in a post-processing step to improve encoded and subsequently decoded speech, demonstrating impressive generalization capabilities even to unknown codecs [39]. In order to estimate the target voice, an auto-encoder convolutional neural network (AECNN) is presented [40], with mean absolute error (MAE) serving as a cost function. Two streams are employed within the PHASEN network, which uses phase information in improving performance of amplitude-based SE [41]. The authors of [42] suggested a CFN-based encoder–decoder features with numerous skip connections enabling monaural speech improvement, in contrast to traditional CNN architectures that simply use pooling layers to compress the feature dimension. Using strides deconvolutions or upsampling layers, the feature dimension in this model decompresses in the decoder section and compresses in the encoder section [43]. High-resolution structural information is preserved with skip connections integration going through layers of equal size within the encoder towards the decoder. This is particularly crucial for regression tasks like noise reduction when learning to map using a noisy speech spectrum towards a target clean speech spectrum of identical size is necessary. To increase the receptive fields and capture the interdependence between various frames, dilated CNN has been employed. In contrast to RNN-based techniques, gated residual networks (GRN) [44] have demonstrated superior performance when employed in conjunction with dilated CNN for speech enhancement. However there are still drawbacks to the previously listed strategies. For example, in a standard convolutional encoder-decoder network, using a high kernel size might boost the model's receptive fields, but at the expense of increased computational cost [40]. There was still potential for improvement in the speech quality because these techniques could not explicitly and precisely anticipate the clean phase spectra. Hence, for TFdomain SE techniques, it is imperative to do explicit prediction and optimization on the phase spectra.

InceptionNet and MHCED use multiple kernels of varying sizes to increase model capacity; however, using high kernel sizes [45] is likely to reduce parameter efficiency and restrict the model's applicability in resource-constrained applications. Each group takes half of the input sequence in the two group convolutions that the AlexNet employs in parallel at each layer [46]. But just a portion of the input sequence is used for each convolution group, which can restrict each kernel to only extracting a portion of the information throughout the entire input sequence to downgrade the effectiveness of the model possibly. Group convolution channels within ShuffleNet are suggested to be rearranged using the channel shuffle [47], so that the channels relate to one another. Furthermore, ShuffleNet generates a single feature by sequentially applying ordinary convolution and depth-wise convolution. By keeping separate feature maps of conventional and depth-wise convolutions, this one characteristic can be improved much further. The AECNN architecture only uses the skip connections throughout the encoder as well as the decoder to feed data stored in encoder layers to the appropriate decoder layers [40]. Though it may help improve enhancement performance, the encoder/decoder’s internal information flow reuse has not been investigated.

1.1 Contribution

We suggest a novel framework with a Complex spectral mapping called Deep Complex convolutional neural network (DCCNN), based on an encoder and decoder with parallel magnitude and phase or real-imaginary spectra denoising, to get around the performance limitations of previous SE techniques in difficult circumstances. In the proposed deep learning model encoder helps in encoding input noisy magnitude and phase spectrums to compressed TFdomain features for the upcoming decoding process while the corresponding decoder masks magnitude as well decodes phase and gives an output of enhanced mag-phase spectrum, respectively in the last iSTFT used on the enhanced mag-phase spectrum to reconstruct clean signal waveforms. The phase decoder incorporates the parallel phase estimation architecture to predict the clean phase spectra directly. In accordance with findings from experiments, our proposed DCCNN achieves explicit predictions and optimizations of the phase and magnitude spectrums, which reduces the compensating impact between them and surpasses state-of-the-art SE methods. The proposed deep learning model is unique to have achieved the direct enhancement of phase spectra.

1.2 Problem description

Time-domain noisy audio signal \({\text{x}}\left( {\text{t}} \right)\) in real-time environments is a combination of additive noise n(t) and clean speech signal s(t), where t represents a discrete-time element. This noisy audio signal is calculated mathematically in Eq. (1).

$$ {\text{x}}\left( {\text{t}} \right) = {\text{ s}}\left( {\text{t}} \right) + {\text{ n}}\left( {\text{t}} \right) $$
(1)

This signal undergoes a transformation into the frequency domain using STFT that is utilized over consecutive frames. STFT of the noisy mixed signal is measured as:

$$ {\text{X}}\left( {{\text{k}},{\text{ l}}} \right) = { }\mathop \sum \limits_{{\left\{ {{\text{n}} = - \infty } \right\}}}^{{\left\{ \infty \right\}}} {\text{x}}\left( {\text{n}} \right){\text{w}}\left( {{\text{n }} - {\text{ lH}}} \right){\text{e}}^{{{ } - {\text{j}}2\lambda {\text{kn}}/{\text{N}}}} $$
(2)

In Eq. (2) window function is represented by \(\text{w}\left(\text{n}\right)\), hop size is H, N is FFT size, while k and \(\text{l}\) represent frequency bin and frame index, correspondingly. This gives a complex spectrogram:

$$\text{X}\left(\text{k},\text{ l}\right)= {\text{X}}_{\text{r}}\left(\left(\text{k},\text{ l}\right)\right)+\text{ j }{\text{X}}_{\text{i}}\left(\left(\text{k},\text{ l}\right)\right)$$
(3)

In Eq. (3) \({\text{X}}_{\text{r}}\left(\left(\text{k},\text{ l}\right)\right)\) is real whereas \({\text{X}}_{\text{i}}\left(\left(\text{k},\text{ l}\right)\right)\) is imaginary component of the complex spectrogram, respectively.

1.3 Deep learning-based denoising process

A proposed neural network model is designed for estimating a clean complex spectrogram, \(\widehat{\text{S}}\left(\text{k},\text{ l}\right)\), from this noisy spectrogram by effectively learning to reconstruct both the magnitude and phase components. The deep learning model is trained by utilizing a loss function of Mean Squared Error (MSE), specially formulated to handle the complex nature of the spectrogram. MSE loss is calculated as:

$$\begin{aligned}\text{MSE}&= \frac{1}{\text{KL}}{\sum }_{\left\{\text{k}=1\right\}}^{\text{K}}{\sum }_{\left\{\text{l}=1\right\}}^{\text{L}}\\ & \quad \times \left.\left( {\left({\text{S}}_{\text{r}}\left(\text{k},\text{ l}\right)- {\widehat{\text{S}}}_{\text{r}}\left(\text{k},\text{ l}\right)\right)}^{2}+ {\left({\text{S}}_{\text{i}}\left(\text{k},\text{ l}\right)- {\widehat{\text{S}}}_{\text{i}}\left(\text{k},\text{ l}\right)\right)}^{2}\right.\right)\end{aligned}$$
(4)

where by L symbolizes the entire frames present while K entire quantity of frequency bins. This loss function ensures that the model will make accurate predictions for both real \({\text{S}}_{\text{r}}\) and imaginary \({\text{S}}_{\text{i}}\) portions, enabling a more accurate clean speech signal reconstruction. The proposed technique has the ability for maintaining well the natural dynamics and the timbre of the speech, resulting in an output that is less noisy and retains much of the original characteristics of the speech by addressing both components. After training, the estimated clean spectrogram \(\widehat{\text{S}}\left(\text{k},\text{ l}\right)\) is processed by the inverse operation of the STFT, iSTFT, for getting a time-domain audio signal and iSTFT is illustrated by:

$$ \widehat{{\text{s}}}\left( {\text{t}} \right) = { }\mathop \sum \limits_{{\left\{ {{\text{l}} = - \infty } \right\}}}^{\infty } \widehat{{\text{S}}}\left( {{\text{k}}, {\text{l}}} \right){\text{w}}\left( {{\text{t }} - {\text{ lH}}} \right){\text{e}}^{{{\text{j}}2\lambda {\text{kt}}/{\text{N}}}} $$
(5)

In this way producing the denoised speech signal \(\widehat{\text{s}}\left(\text{t}\right)\). This comprehensive method not limited to only enhancing denoised speech quality and intelligibility but also demonstrates significant enhancements in Signal-to-Noise Ratio (SNR), as confirmed through empirical research. Integration of simultaneous magnitude, as well as phase reconstructions in complex spectrogram processing, exemplifies a robust approach to managing real-world noisy speech signals.

2 Proposed methodology

The proposed model, DCCNN, is trained by supervised learning using features from the Fourier spectrum and its purpose is to give an estimation of clean audio signals from the noisy audio signals. The model inputs 13-time frame sequences at a time, all of which consist of real-imaginary spectrograms derived from audio signals. This dual-component approach has the phase information weighted for better quality of the reconstructed audio signal. The proposed model learns a mapping from a real-imaginary noisy signal feature \(\text{X}(\text{k},\text{l})\) to an estimated clean real-imaginary signal feature \(\widehat{\text{S}}\left(\text{k},\text{ l}\right)\), as given in below Fig. 1.

Fig. 1
figure 1

DCCNN taking Noisy (mixed) spectrograms as input for supervised training and directly maps it out to a clean one during speech enhancement

In proposed methodology processing of data is prepared starting with clean audio signal (target) and noisy audio signal (source) \(\text{x}(\text{t})\) as shown in Fig. 2 and below steps is followed in the methodology of deep learning Speech Enhancement model:

  1. 1.

    Given N raw waveform signals for clean and noisy speech, overlapped framing is applied.

  2. 2.

    Apply the windowing analysis \(\text{w}(\text{n})\) to the framed signals.

  3. 3.

    Convert the framed and windowed signals into the required representation, with help of Short-Time Fourier Transform (STFT) and real-imaginary spectrogram \(\text{X}(\text{k},\text{l})\) is obtained.

  4. 4.

    Create an annotated data set with noisy and clean speech pairs features (noisy_speech_real-img_spectrogramsi, clean_speech_real-img_spectrogramsi) with \(1 \le \text{ i }\le \text{ N}\).

Fig. 2
figure 2

Demonstrating the procedures for DCCNN SE model training for fitting a noisy prediction function

The training process for DCCNN follows below steps:

  1. 5.

    Train the DCCNN model in a way such that it minimizes an objective function while estimating clean features, from noisy features. \(\widehat{\text{S}}\left(\text{k},\text{ l}\right)\)\(\text{X}(\text{k},\text{l})\)

For denoising:

  1. 6.

    As shown in the figure a new noisy feature \(\text{X}(\text{k},\text{l})\) is applied to the proposed trained DCCNN giving an estimated clean feature \(\widehat{\text{S}}\left(\text{k},\text{ l}\right)\).

  2. 7.

    iSTFT is used to estimate clean feature \(\widehat{\text{S}}\), and time domain frames of the denoised signal obtained.

  3. 8.

    Then, synthesis windowing w(n) is applied to the time-domain denoised frames of \(\widehat{\text{s}}\left(\text{t}\right)\) to reduce the spectral leakage.

  4. 9.

    Finally, the overlap-add method is used to obtain the final time-domain denoised signal s(t) from the estimated clean frames. This step ensures a smooth reconstructed audio signal.

The steps in the suggested methodology are summarized in Fig. 2 where the input to the proposed DCCNN model is exploited in the form of time–frequency (spectrogram) of a noisy speech.

3 System overview

3.1 Network architecture

Recommended Deep Complex Convolutional Neural Network (DCCNN) is developed as a convolutional encoder-decoder architecture for processing of real-imaginary as well as magnitude-phase spectrograms of audio signals in SE. The components including different layers of DCCNN deep learning model are illustrated in Fig. 3. The complex structure provides a full representation of sound, which makes it better to differentiate between the two components speech and noise components of it. The proposed DCCNN takes the real-imag or mag-phase spectrum of a noisy mixture as input consists of 13 frames, and produces an estimate of the targeted speech's magnitude-phase spectrum or real-imaginary spectrum. Estimated target speech is reconstructed with the help of estimated real-imaginary or magnitude-phase of target speech. At the heart of the encoder, convolutional layers are stacked, with each having a LeakyReLU activation and batch normalization, starting from 16 filters and finally rising to 128. The encoder serves for obtaining major characteristics from the given input data, spatial dimension are preserved of the input sequence by using the strides and depth is expanded as the network goes deeper.

Fig. 3
figure 3

Illustration of Proposed DCCNN architecture including all layers with dimensions. The labels of components are given at bottom of Fig. 3

In proposed DCCNN structure decoder layers are reflected as part of the encoder where transposed convolution layers are applied that reconstruct the audio signal to form an enhanced output from the encoded features. Skip connections are further applied from layer to layer across the network to ensure that the flow and preservation of important features are maintained for high-quality reconstructed speech. More precisely, a skip connection is utilized for linking CCU's output to the matching symmetric CDU. These skip connections allow the correspondence between different convolutional units and deconvolutional units in the encoder and decoder, respectively, which becomes very important in holding the integrity of spatial and feature information.

3.2 Cluster convolutional units

The proposed DCCNN employs CCU which is encoder side of the proposed monaural speech enhancement model. In its design for a deep complex convolutional unit, it encodes input spectrograms efficiently for processing. The mathematical model applied for every convolutional layer associated with the encoder's 2D convolution is provided as follows:

$$ {\text{C}}\left( {{\text{k}},{\text{ l}},{\text{ n}}} \right) = { }\mathop \sum \limits_{{{\text{i}} = 1}}^{{\text{K}}} \mathop \sum \limits_{{{\text{j}} = 1}}^{{\text{L}}} \mathop \sum \limits_{{{\text{m}} = 1}}^{{\text{M}}} {\text{F}}\left( {{\text{i}},{\text{ j}},{\text{ m}},{\text{ n}}} \right) \cdot {\text{X}}\left( {{\text{k}} + {\text{i}} - 1,{\text{ l}} + {\text{j}} - 1,{\text{ m}}} \right) $$
(6)

where \(\text{C}\left(\text{k},\text{ l},\text{ n}\right)\) is the output feature map at position (\(\text{k},\text{ l})\) for the n-th output channel. This equation summarizes the idea of convolution as a way of simply transforming the input feature matrix X across its spatial dimensions while modifying the channel depth from M to N thus encoding rich and complex patterns from the noisy speech inputs. Following the convolution, the output feature map is then passed through a LeakyReLU activation function [48] for introducing non-linearity. Each unit integrates a LeakyReLU activation function given by:

$$ {\text{y }} = {\text{max}}\left( {0.01{\text{x}},{\text{ x}}} \right) $$
(7)

This ensures that there is still small gradient flow even when the neuron is at rest, thus improving network's capability to learn nuanced characteristics during training. Another is activation function batch normalization, which is used for stabilization to quicken the learning by normalizing the output over each convolutional layer. For the encoder of the proposed DCCNN, this becomes a robust skeleton with which, in a complicated and efficient manner, it is possible to code audio signals; this is fundamental both for further decoding and for a subsequent improved speech output whenever background noise is present. Dropout layers are applied inside the proposed DCCNN architecture after certain convolutional layers in the encoder to randomly deactivate a fraction of units during training, reducing overfitting and encouraging robust feature learning. Unlike standard CNNs in proposed DCCNN’s encoder either real and imaginary spectrums or magnitude and phase spectrums process to estimate clean data in the architecture.

3.3 Cluster deconvolutional units

The decoder segment of suggested DCCNN incorporating single channel speech enhancement is an assembly of cluster deconvolutional units carefully structured for reconstructing denoised audio signals from encoded characteristics. Presented operation is done using transposed convolutional layers (Conv2DTranspose), mathematically doing the operation:

$$ {\text{C}}^{\prime } \left( {{\text{k}},{\text{ l}},{\text{ n}}} \right) = { }\mathop \sum \nolimits_{{{\text{i}} = 1}}^{{\text{K}}} \mathop \sum \nolimits_{{{\text{j}} = 1}}^{{\text{L}}} \mathop \sum \nolimits_{{{\text{m}} = 1}}^{{\text{M}}} {\text{F}}^{\prime } \left( {{\text{i}},{\text{ j}},{\text{ m}},{\text{ n}}} \right)\,\,{\text{ X}}^{\prime } \left( {{\text{k}} - {\text{i}} + 1,{\text{ l}} - {\text{j}} + 1,{\text{ m}}} \right) $$
(8)

where \({\text{X}}^{\prime }\) is input to layer, \({\text{F}}^{\prime }\) denotes the filter matrix used for deconvolution, and \({\text{C}}^{\prime }\) is the output feature map. This process spatially undoes the down-sampling that happened during the encoding—for the motive to upscale again, to the original dimensions of the spatial feature maps. Followed by each transposed convolution, it applies Leaky ReLU [48] using the DCNN:

$$ {\text{y}}^{\prime } = {\text{ max}}\left( {0.01{\text{x}}^{\prime } ,{\text{ x}}^{\prime } } \right) $$
(9)

This function reintroduces non-linearity into the up-sampled outputs and ensures the maintenance of small gradients when units are inactive, aiding in the deep network’s learning process. Batch normalization is then applied to normalize the outputs of each deconvolutional unit:

$$ \hat{{\text{x}}}^{\prime } = { }\frac{{{\text{x}}^{\prime } - { }\upmu_{{\text{B}}}^{{^{\prime } }} }}{{\sqrt {\left\{ {\sigma_{{{\text{B}}^{\prime } }}^{2} + \epsilon } \right\}} }} $$
(10)

Here, \(\upmu_{{{\text{B}{\prime}}}}\) and \(\sigma _{{{\text{B}}^{\prime } }}^{2}\) represent the batch-wise mean and variance, respectively, ensuring that the learning process remains stable by maintaining consistent activation distributions. These cluster deconvolutional units (CDU) play a crucial role in restoring the detailed and nuanced audio features, ensuring the output audio signals are clear and closely resemble the original pre-noise conditions. Using batch normalization and Leaky ReLU in the decoder would stabilize the training, and the network will learn more advanced features when reconstructing the enhanced audio signal. Unlike standard CNNs the Decoder or CDU in the proposed DCCNN process either real-imaginary or mag-phase spectrums of noisy mixture audio signals to learn about speech enhancement.

3.4 Skip connections within encoder and decoder

A convolutional encoder-decoder processes the series of inputs through a number of layers. Certain details might get wasted because variation within dimensions of signal characteristic representations [40, 49]. In an effort to enhance reusing features, skip connections connecting encoder and decoder are implemented for overcoming this problem and skip connections in the architecture of proposed Deep Complex CNN played a vital role for Complex Spectrograms as a conduit for the denoising of the features of both real and imaginary spectrograms. On the contrary, this model diverts from conventional design ones and does not apply dense layers; rather, it makes use of convolutional layers, which are densely and unambiguous connected through skip connections. The connection hooks the direct pass in important paths from the encoder to the decoder without breaking them, resulting in this way for the passing of important information for the speech enhancement. In spectrogram denoising, the decoder has access to the information at the spectrum level from fine to high level, which has been extracted by the encoder using skip connections. Such a holistic perspective shall help the decoder to build denoised complex spectrograms of enhanced quality by using the power of both the richness of real and imaginary parts in a more effective manner. Such a strategically designed deployment of skip connections should optimally handle the flow of information; therefore, the proficiency of the model can be improved to disentangle complex relationships present in the spectrogram data for optimum denoising.

4 Selection and processing of data

4.1 Databases and preprocessing

During training and assessment of DCCNN clean speech data use is collected from TIMIT [50] and CSTR VCTK Corpus [51] and noise speech data from DEMAND [52] databases and all audios down sampled to 8 kHz. CSTR VCTK Corpus consists of 110 English speakers with 400 sentences. To start our framework for desired noise reduction against baseline methods initially cafeter, traffic, metro, bus noise files from DEMAND database and café noise type from the QUT [53] noise database whereas babble and restaurant noises from AURORA-2 [54] database mix with training, development and testing utterances. For each clean and noisy speeches we combine the two databases into a single, sizable set that has an overall amount of 45,150 audios of equal number of males and females speakers. We create unique test, development, and training setups utilizing 70%, 15%, and 15% of entire data, respectively. A total of 4 × 7 = 28 training conditions are produced for each batch of data by combining all files with a selected portion from each of the 7 noise samples and apply SNRs of 0, 5, 10, as well as 15 dB. In total of 45,150 speeches per clean and noisy set (30,102 × 6 ÷ 60 ÷ 60 = 50.17) 50.17 h noisy audio mixtures use for training proposed DCCNN framework, and approximately 7,524 audios of 12.54 h noisy speech mixtures for each development and test set to evaluate our model. For development and test data SNRs of − 5 and 15 dB unseen noise mix with clean data to analyze model Speech Enhancement performance.

4.2 Training and network parameters

The CNN-based Deep Complex network, which is an encoder-decoder model, is trained using standard backpropagation [55], employing a Mean Squared Error (MSE) loss function as defined in Sect. 1.4. During model training the Adam optimizer [56] with a 0.001 learning rate initially, 1 batch size, and other parameters like window length of 512, window shift of 256 for STFT providing frequency content and temporal smoothness, respectively, number of DFT set to 512 controlling frequency resolution, and a context window width of 13 frames are set to provide temporal context. The proposed model performance is tested with different values for optimizer, batch size, window length, window shift, DFT set and context window width. It is noted that high learning rate can cause divergence, smaller batch size avoid randomness and memory efficient for proposed model. The width of the context window for complex spectrograms in current network significantly affects the ability to capture the temporal dependencies in both clean speech and noise. A wider context window width allows the model to better differentiate and separate noise from the clean signal and improving the denoising process. During training proposed DDCCNN, learning rate adjusts using learning rate scheduler that decreases it by a factor of 0.9 after the 5th epoch to avoid underfitting and overfitting of model. If there is no decrease in loss after two epochs, training network resumes following epoch featuring most favorable development set loss. If the learning rate falls below 0.00001, the training is terminated. To get favorable speech enhancement performance proposed Deep Complex Convolutional Neural Network (DCCNN) model is tested with different number of layers and kernel size which have big impact on model’s noise reduction capacity. It is noted large number of layers causes high computational resources and less number of layer has issue of underfitting of the proposed network during training. So finally the proposed network is designed with 4 encoder blocks and 4 decoder blocks (L = 4), each using a kernel size of 2 × 2. Number of convolutional filters starts at 16 in the first layer and doubles with each subsequent layer, up to a maximum of 128 filters. This structure of layers in encoder and decoder as well as other parameters mentioned above maintain stability between computational resources and model complexity and this settings give favorable speech enhancement results in terms of SNRI, PESQ and STOI.

4.3 Instrumental evaluation metrics

We decide using solely instrumental measurements on noisy speech \(\text{x}(\text{t})\), clean speech reference \(\text{s}(\text{t})\), and the enhanced speech \(\widehat{\text{s}}\left(\text{t}\right)\). As a measure of the system's ability to suppress noise, signal-to-noise ratio improvement (SNRI) offered during network testing assesses in accordance with ITU-T G.160 [57]. SNRI choice for performance evaluation indicator is due to the fact as it reads how much noise has been reduced by proposed Deep Complex Convolutional Neural Network (DCCNN) network. In addition, we apply perceptual evaluation of speech quality (PESQ) [58] for obtaining mean opinion score for listening quality objective (MOS-LQO). In proposed network PESQ as a performance evaluation indicator tells approximate values to human listening objective metrics MOS for enhancement in speech quality in real time. Short-time objective intelligibility (STOI) metric [59] is utilized for evaluating the boosted speech's intelligibility. STOI metric, which has values in the range [0, 1], is especially intended for assessing noise suppression techniques therewith high values closely correlated with high intelligibility. By evaluation of proposed DCCNN with STOI scores indicates that after noise reduction the enhanced speech can be hearable or not for humans which is critical for every communication system.

5 Results and analysis

Several representative SE methods MMSE-LSA and SG-jMAP [60], LSTM-IRM [29], LSTMMSA, LSTMcMSA, CEDcSA-du, LSTMcMSA + DNNcSA, LSTMcMSA + CEDcSA-du and LSTM-cMSA + CEDcSAtr [61] and LSTM-cMSA + CLED-cSA-du [62] were selected to compare with proposed DCCNN. Which are discussed below based on Seen noise types and Unseen Noise types.

5.1 Seen noise types results

The results achieved from processing data of development class employing noise types that were observed throughout training are displayed in Table 1. The proposed deep learning-based approach DCCNN performs significantly better than the existing deep learning methods and the traditional MMSE-LSA and SG-jMAP [60] in context of PESQ, STOI, as well as SNRI, not only on average measurements but under any single SNR situation too. Most remarkably, the raw noisy speech owns an average achieved across STOI values of 0.75, and conventional procedures cannot increase intelligibility in context of objective metric STOI. Suggested deep learning-based approach, on the other hand, improves on that value by as much as 0.16 points (0.91) on average across SNR conditions.

Table 1 Development class seen noise comparison of proposed with baseline methods

Further, for the extremely difficult − 5 dB condition, PESQ throughout raw noisy speech is only marginally improved by MMSE-LSA and SG-jMAP, while PESQ is greatly enhanced by the proposed deep learning-based method DCCNN, despite not seeing a comparable low SNR during training. By contrasting the LSTM-IRM along with LSTM-MSA single-stage baselines, we find that DCCNN consistently outperforms them in context of PESQ, alongside average enhancements about 0.65 MOS points respectively. This supports the idea that optimization is more beneficial in the speech spectral domains than masking domains. In comparison to LSTM-MSA (17.89 dB) at an average across entire SNR circumstances, proposed DCCNN exhibits superior noise cancellation capabilities than multiple stages filtering networks.

Nonetheless, DCCNN outperforms all standard approaches in terms of PESQ values in low-SNR situations of − 5 and 0 dB. This could occur since by employing the real along with imaginary elements of clean speech spectrum \(\widehat{\text{s}}\left(\text{t}\right)\) as target, DCCNN implicitly estimates the clean magnitude and phase of noisy mixture complex spectrum. Particularly during low-SNR situations, DCCNN nevertheless offers far more noise reduction in terms of SNRI than any other single-stage and multiple-stage technique. We decide to use the suggested DCCNN because of this finding as well as the fact that it offers most favored average Speech Enhancement (SE) ability on the development class. It obtains an average PESQ improvement of 0.41 points (3.04), average SNRI of 0.12 points (25.46), and average STOI of 0.03 points (0.91) across the development set over all SNR conditions in comparison to best baseline method LSTMcMSA + CEDcSA-tr. It is understandable that this discovery to mean that, even with similar extra noise suppression, the DCCNN network is more effective at reconstructing lost or damaged speech segments, leading to improved total speech quality in terms of PESQ. The DCCNN approach considerably enhances STOI throughout all SNR levels with regards to intelligibility. In case of lower SNR level of − 5 dB, where upgrading intelligibility is particularly essential it delivers gain of up to 0.13 points (0.82) in STOI scores against best baseline approach.

When comparing the outcomes from the test class and the development class, the examination for each set produces similar judgments regarding system rank as well as efficiency patterns across each of the factors as indicated in Table 2. Each model, even those using traditional methods which are not using development class during parameter adjustment, perform somewhat worse overall on the test class. This indicates, test class is marginally higher challenging in processing for various speech enhancement techniques, and deep learning techniques perform better when applied during processing test class samples. In comparison to the development class, the average results of the suggested technique DCCNN slightly little poorer in context of PESQ and STOI; nevertheless, in terms of SNRI, they are even marginally better than those of the high complexity referencing, \(\text{LSTMcMSA}+\text{CLEDcSA}-\text{du}\).

Table 2 Test class seen noise comparison of proposed with baseline methods

5.2 Unseen noise types results

Table 3, displays results after analyzing the unseen noise test class in regard to baseline techniques, in which results average across the several forms of noise (traffic, cafeter, metro, bus, café, babble and restaurant noise). Observations of similar patterns and model ranks to the assessment for seen noise varieties are made once more, indicating that the techniques based on deep learning are generally well-suited to these very non-stationary unseen noise patterns. Particularly, when averaged across all SNR levels, the recommended DCCNN network outperform the current speech enhancement techniques and gives PESQ 0.08 MOS points (2.70), STOI 0.01 (0.90) and SNRI 0.93 points (23.40 dB) for PESQ, STOI, and SNRI, respectively. The proposed Deep Complex CNN (DCCCN) network is able to improve by 0.21 MOS points (2.26) for lower SNR conditions, such as -5, while the baselines method does not improve speech quality in the context of PESQ. Using the encoder and decoder of our proposed DCCNN system increases each of the quality indicators for all four analyzed noise types.

Table 3 Test class unseen noise comparison of proposed with baseline methods

5.3 Model evaluation on environmental noises

In Table 4, the results were obtained for evaluating proposed deep learning model in contents of PESQ, STOI and SNRI for low SNR levels considering unseen environmental noises. With current encoder and decoder parameters complexity the proposed Deep Complex Convolutional Neural Network model highly suppress the traffic noise and gives high improvements in SNRI, PESQ and STOI up to 25.76 dB, 3.74 MOS, 0.94 compared to other noise types. For bus noise it is slightly worst performance due to the high interference of noise in speech signal in bus.

Table 4 Results and Analysis for speech signals with different noise types

5.4 Analyzing DCCNN model

Improved speech spectrograms achieved from the proposed deep complex network were investigated using a case study of test class utterance in traffic noise at varied dB SNR circumstances in order to investigate more the root causes of the quality gains that have been noticed using the suggested deep complex network. The spectrograms for clean speech signal \(\text{s}(\text{t})\) along with its corresponding noisy speech signal \(\text{x}(\text{t})\), and denoised speech signal \(\widehat{\text{s}}\left(\text{t}\right)\) are present in Fig. 4.

Fig. 4
figure 4

Spectrogram analysis for unseen Traffic noise at different dB SNR levels

When examining the outputs, the spectrogram enrichment demonstrates how much greater noise reduction is made possible when using Deep Complex Convolutional Neural Networks (DCCNN) with current model complexity. Owing to intricate spectrogram processing, which improved the parallel noisy signal's phase and magnitude and assisted in reconstructing improved audio signals. Additionally, by means of its skip connection, the suggested encoder and decoder architecture directly utilize high-resolution speech data intrinsic to the noise characteristics, which can also help with a more thorough reconstruction. This demonstrates that our recently suggested approach may accomplish similar speech restoring and noise reduction characteristics with far lower model parameters and computing resource consumption.

6 Conclusion

In this paper, we introduced a Deep Complex Convolutional Neural Network (DCCNN) which is a Speech Enhancement (SE) architecture using an encoder-decoder structure, with network arrangement specifically selected for the tasks of reducing noise and realistic sound speech restoration. DCCNN supervised model employing a complex network for complex-valued spectra mapping. The proposed model takes complex-valued input from the spectrograms of the noisy speech signals, consisting of real and imaginary components incorporating complex spectral mapping, which simultaneously perform enhancement of magnitude and phase dynamics of speech signals. The encoder encodes noisy magnitude and phase spectra, and the corresponding magnitude mask decoder and phase decoder decode out the enhanced magnitude and phase spectrums, respectively. The direct improvement of phase spectra to enhance PESQ and STOI of speech signals was the primary innovation of the DCCNN. In contrast against the baselines, we find an incredible enhancement of over 3 dB in SNR, 0.2 in STOI, and 0.5 in PESQ. In addition, our method outperforms baseline SE techniques in low-SNR conditions in terms of STOI. Moreover, it consistently surpasses all reference approaches and improves intelligibility in low-SNR environments. In future this proposed deep learning model can be evaluated by different datasets and can integrate alongside other deep learning models to gain better MOS points in terms of PESQ and SNRI.

Plan for future enhancements of the proposed DCCNN is to add more robust loss functions and use of depth-wise separable as well as vanilla convolutions with adaptive weights to train model on complex spectrograms with fewer parameters, introducing spatial information (up-down) and contextual dynamic changes.