1 Introduction

In recent years, the field of audio signal processing has witnessed significant advancements, particularly in the area of source separation (SS). Single-channel source separation (SCSS), also known as monaural source separation, refers to the process of separating individual sound sources from a mixed audio signal, typically captured by a single microphone or channel. It has become a highly desirable and challenging task in various applications, such as music production, speech enhancement (SE), audio transcription, and immersive audio systems [1,2,3,4]. Numerous potential benefits exist for the segregation of mixed speech. In contemporary speech processing, the role of SS is becoming increasingly crucial, demanding a growing number of devices to effectively perform this task.

While humans can effortlessly separate speech, constructing an automated system that emulates the human auditory system proves to be exceptionally challenging. Consequently, the pursuit of developing effective automatic SS systems has consistently been a significant focus in speech processing research. Conventionally, SS methods relied on the assumption of having multiple microphones or channels to exploit spatial information. However, in many real-world scenarios, such as live concert recordings, teleconferencing, or historical audio restoration, the availability of multiple channels is limited or nonexistent. This limitation prompted the development of SCSS [5,6,7] techniques that aim to recover individual sound sources from a monoaural mixture.

Due to the increasing fascination with SS, several conventional SCSS models have been suggested, taking into account various factors such as phase, magnitude, frequency, energy, and the spectrogram of the speech signal. A notable success in separating individual speakers has been achieved through the use of factorial hidden Markov models (HMMs) [8]. Moreover, researchers are increasingly utilizing nonnegative matrix factorization (NMF), a collection of methods in multivariate analysis that involves decomposing a matrix into two other nonnegative matrices based on their components and weights to separate source signals in SCSS [9].

However, these conventional approaches often face limitations when dealing with complex acoustic environments, overlapping sources, and nonstationary signals. To overcome these challenges, researchers have turned to deep learning (DL) algorithms [10, 11] and architectures to develop data-driven approaches for SS and achieving unprecedented performance improvements. SCSS focuses on learning a mapping function that estimates individual source signals from mixed audio inputs using a training dataset consisting of paired mixtures and their corresponding source signals [12].

In the context of audio SS, the Short-Time Fourier Transform (STFT) [13] is widely used to analyze and manipulate the audio signals. STFT represents a signal in the time-frequency domain, decomposing it into a series of spectral components. Each component is characterized by its magnitude and phase information, which provide valuable cues for separating the sources. In traditional as well as many DL approaches, the magnitude spectrogram has received significant attention and has been the main focal point for SS. However, phase information has also been recognized as an important factor in performance.

In this study, we propose an approach for SCSS using U-NET that considers both the real and imaginary parts of the complex spectrum generated by the STFT. As a result, the phase component should also be noteworthy in terms of its magnitude. Our method aims to leverage the benefits of DL and exploit the additional information contained in the complex spectrum to enhance separation performance. We have designed a modified U-NET architecture that can effectively handle the complex input features and learn to extract individual sources from the mixed audio signal.

The rest of the article is organized as follows: Sect. 2 provides a comprehensive overview of relative research in this domain, focusing on the evolution of deep learning techniques. Section 3 presents the U-NET architecture in detail, elucidating its key components and highlighting the reasons behind its suitability for audio SS. Section 4 presents the proposed methodologies, describing the architectural choices, proposed algorithm, training, and evaluation procedures. Section 5 showcases the outcomes of the experiments conducted and the subsequent analysis by encompassing both the dataset employed in this study as well as the evaluation metrics utilized to gauge the performance. Finally, Sect. 6 concludes the article by summarizing the key findings and outlining future research directions.

2 Relative Research

For supervised SS, there are two different categories of learning models: (1) methods that are traditional, like processes based on models and voice improvement techniques; and (2) innovative methods based on DNN. As a consequence of the speech production process, the input characteristics and desired outcomes of the SS process display an apparent spatiotemporal structure. Deep models are ideal for modeling due to these characteristics.

In speech separation, numerous deep models are actively deployed. Sun et al. [14] devised a two-stage method employing two DNN-based algorithms to tackle the difficulty of current speech separation systems’ performance. The authors of [15] created new training aims in addition to current magnitude training objectives, utilizing neural network approaches to adjust for target phase in order to attain higher separation performance.

In order to understand the temporal characteristics of geographic data, Zhou et al. [16] developed a separation system based on RNN with LSTM. The statistical properties of noise are not constrained in supervised speech separation, and it is not essential to know the spatial orientation of the sound sources. It offers certain benefits and a bright future for study when used in monaural, nonstationary, or in cases of poor SNR [17, 18].

The Deep Recurrent Neural Network (DRNN) is a deep learning model frequently used in speech separation. It excels in using Markov models to identify the hidden states of RNN units like LSTM [19] and GRU (Gated Recurrent Unit, GRU) [20] in SS. Some past information will still be preserved from the previous concealed state; however, the magnitude spectra of mixed speech have a prolonged duration, causing loss in sequence analysis, impacting both the separation of mixed speech and the accuracy of speech prediction.

CNN has been commonly used in DL since Lecun et al. [21] first presented it in 1998. CNN clearly has advantages in 2-D signal processing, and applications like picture recognition have shown off its impressive modeling abilities. CNN is currently being used for SS and has outperformed speech separation systems based on DNN in terms of removal efficiency under identical circumstances.

[22] introduces a method for SCSS using deep, fully convolutional denoising autoencoders (CDAEs). Trained to extract specific sources from mixtures, CDAEs performs well deep feedforward neural networks in SS. They learn unique spectral–temporal patterns for successfully isolating the sources in mixed signals. Additionally, it explores the use of spectral masks to scale the mixed signal based on each source’s contribution, ensuring an accurate estimation of the mixed signal’s sources.

To address the problem of time-frequency masking, Luo et al. [23] developed Conv-TasNet, a network for SS in the time domain that utilizes fully convolutional techniques. Its impressive modeling abilities have been shown in applications like picture recognition. To mitigate the disparity in accuracy measures such as hit rate, error rate, and classification accuracy, Wang et al. [24] modified the loss function of CNN.

[25] suggests a system that addresses challenges like over-smoothing and incomplete separation in SCSS by integrating time-frequency non-negative matrix factorization (TFNMF) and deep neural networks with sigmoid-based normalization (SNDNN). TFNMF is utilized for feature extraction, and the resulting classified features are transformed into softmax.

The paper [26] introduces VAT-SNet, a time-domain music separation model that directly utilizes music waveform data as input. VAT-SNet enhances the network structure of Conv-TasNet by preserving deep acoustic features through sample-level convolution in both the encoder and decoder. Additionally, it incorporates vocal and accompaniment embeddings from an auxiliary network to enhance the purity of the separation, aligning with the principles of independent component analysis (ICA) and providing a mathematical model for the separation process.

UFLSTM [27] is a deep learning model for speech enhancement (SE). UFLSTM utilizes adaptive power law transformation to redistribute energy, maintaining constant total energy in speech signals for improved intelligibility and quality, incorporating residual connections to prevent gradient decay, and adjusting the forget gate using an attention process.

Fig. 1
figure 1

U-NET Architecture

Although conventional and separation models based on DNN have shown impressive results, they all have a few flaws. Using CNN each element may absorb local features without learning global characteristics in order to benefit from the spatial connectivity of the input data, and in the process of feature extraction, localized features are initially identified and then combined to create more comprehensive features at higher levels. Using weight sharing can improve the speed of the model by reducing the number of parameters that need to be computed for each neuron.

Various feature maps that can recognize the same type of feature in various locations and partly assure the invariance of displacement and distortion may be produced by combining a number of convolution filters. As a result, this study provides a CNN-based approach to alleviate the issue of mixed-language speakers’ loss of extended sequence information. Our model may boost the speech separation impact by concentrating on the timing sequence stage, which offers the highest contribution, and by partially solving the difficulty of the temporal model’s short memory.

3 U-NET Architecture

In order to extract the features of the desired source from the mixed coefficients, we employed the U-NET architecture. Figure 1 presents a pictorial representation of the network structure, comprising two main components: a contracting path on the left side and an expansive path to the right. The contracting path adheres to the typical architecture of a convolutional network. The structure involves the iterative utilization of two sets of \(3\times 3\) convolutions. Subsequently, a Leaky rectified linear unit (LeakyReLU) and a \(2\times 2\) max pooling operation with a stride of 2 are applied for downsampling. During each downsampling stage, the quantity of feature channels is increased twofold.

Each iteration in the extensive trajectory involves enlarging the feature map through upsampling, followed by a \(2\times 2\) convolution that reduces the number of feature channels by half. This is followed by combining the enlarged feature map with the corresponding cropped feature map from the contracting trajectory. Cropping is essential to address the removal of border pixel elements during convolutions at each step. At the last layer, a \(1\times 1\) convolution is employed to transform each 16-component feature vector into the specified number of classes. Altogether, the network comprises 24 convolutional layers. To ensure the output segmentation map can be seamlessly tiled, it is crucial to choose the input tile size in such a way that all \(2\times 2\) max-pooling operations are performed on a layer with both x- and y-dimensions being even.

Huber loss is a robust alternative to mean squared error (MSE) loss, which is commonly affected by outliers and sensitivity issues. By balancing between quadratic loss for small errors and linear loss for larger errors, Huber loss effectively addresses these challenges and improves model performance. Huber loss combines squared loss for minor errors and absolute loss for significant errors. By incorporating a parameter called delta (\(\delta \)), the loss function determines the threshold at which the transition occurs from quadratic to linear. When errors are smaller than \(\delta \), the loss function resembles MSE, while for errors exceeding \(\delta \), it behaves similarly to MAE. Mathematically, this loss function is represented as per Eq. (1), where y denotes the actual or desired value, \(y'\) signifies the predicted value, and \(\delta \) represents the threshold parameter.

$$\begin{aligned} { L(y, y')= {\left\{ \begin{array}{ll} \frac{1}{2}(y-y')^2, &{} if |y-y'|\le \delta \\ \delta |y-y'|-\frac{1}{2}\delta ^2, &{} \mathrm{{otherwise}} \end{array}\right. } } \end{aligned}$$
(1)

The networks’ parameters were randomly initialized, amounting to a total of 1,941,093. They underwent training using backpropagation and the Adam optimizer with a learning rate of 0.001, employing the default settings for all other parameters.

Fig. 2
figure 2

Block diagram of the proposed SS approach

4 Proposed Method

This section outlines the proposed SCSS technique and provides details about the substances it utilizes. In the context of audio or time-series data, the signal is represented by the STFT as a complex matrix, where each element corresponds to a specific frequency and time bin index. The real part signifies the magnitude or intensity of the frequency component, while the imaginary part encodes phase information. Unlike most SS systems that focus solely on the magnitude of the STFT, neglecting the phase component, this article combines STFT with U-NET, a deep CNN, taking both the real and imaginary components into consideration. The utilization of both components during U-NET training enables the model to effectively capture complex-valued frequency information in the input data.

It is important to note that no approach is universally superior, and trade-offs exist. The associated trade-offs were that the utilization of U-NET for SS introduced computational complexities, and its performance was contingent upon the quantity and quality of available data. Notably, there were associated risks of overfitting, especially when confronted with limited data, potentially limiting the model’s interpretability. Furthermore, the implementation of U-NET demanded substantial computational resources and prolonged training times. Achieving robust generalization across diverse acoustic environments posed a significant challenge. Therefore, a pivotal aspect in this methodology involved striking a balance between U-NET’s model complexity and the specific requirements of the application.

However, the trade-offs of the proposed method stated earlier here include a breakdown of potential reasons for better performance. Unlike others, incorporating both the real and imaginary components together in the model yielded a comprehensive representation of the audio signal, capturing both the amplitude and phase details. This refined representation enhanced accuracy, especially in scenarios involving overlapping speech. Besides, preserving phase information was crucial for maintaining temporal attributes, leading to more natural and intelligible speech output. The end-to-end learning approach streamlined training, allowing the model to autonomously learn relevant features and promoting better generalization across speakers and acoustic environments. Furthermore, supervised learning with labeled data enhanced adaptability to diverse acoustic environments, increasing robustness in real-world scenarios. U-NET efficiency and hardware acceleration allowed real-time audio processing, crucial for low-latency applications like live streaming and interactive platforms. The proposed SS method has two stages, the training stage and the testing stage, which are depicted in Fig. 2.

Algorithm 1
figure a

Algorithm for the training and testing stages of the proposed method

4.1 Training Stage

During the training phase, we think about a signal m(t) called the mixed, consisting of two different sources p(t) and q(t), respectively. m(t) is utilized here as an input signal, and p(t) is the corresponding label. The STFT processes both mixed and labeled signals to calculate the complex spectrograms M\(_{(\tau , f)}\) and P\(_{(\tau , f)}\). These are denoted in Eqs. (2) and (3), with \(\tau \) and f indicating the time and frequency bin indices, respectively.

$$\begin{aligned}{} & {} { {\textbf {M}}_{(\tau , f)}= {\textbf {M}}_{R(\tau , f)}+{\textbf {M}}_{I(\tau , f)}i} \end{aligned}$$
(2)
$$\begin{aligned}{} & {} { {\textbf {P}}_{(\tau , f)}= {\textbf {P}}_{R(\tau , f)}+{\textbf {P}}_{I(\tau , f)}i} \end{aligned}$$
(3)

The concatenated forms of the real and imaginary components for both M\(_{RI}^\mathrm{{Train}}\) and P\(_{RI}^\mathrm{{Train}}\) matrices are then forwarded into the U-NET model. The network model next decomposes the M\(_{RI}^\mathrm{{Train}}\) matrix into its bias and weight matrices as per Eq. (4), where the terms W\(_{{\textbf {M}}_{RI}}\) and b\(_{{\textbf {M}}_{RI}}\) represent the weight and bias matrices corresponding to the mixed source, and g represents the nonlinear activation function.

$$\begin{aligned} { {\textbf {M}}_{RI}^\mathrm{{Train}}\approx g({\textbf {W}}_{{\textbf {M}}_{RI}}+{\textbf {b}}_{{\textbf {M}}_{RI}})} \end{aligned}$$
(4)

Initially, the bias and weight metrics are assigned to zero and random values, respectively. The weighted matrix W\(_{{\textbf {M}}_{RI}}\) and the bias metrics b\(_{{\textbf {M}}_{RI}}\) were updated continuously by minimizing the cost between M\(_{RI}^\mathrm{{Train}}\) and P\(_{RI}^\mathrm{{Train}}\) using Eq. (5) with the help of Eqs. (6) and (7), where \(\alpha \) is called learning rate. During training, the model was saved, and after completing the training, the best bias and weights were fixed.

$$\begin{aligned}{} & {} { {\textbf {M}}_{RI}(\mathrm{{Error}})= {\textbf {M}}_{RI}(\mathrm{{Label}}\;\mathrm{{Output}})-{\textbf {M}}_{RI}(\mathrm{{Predicted}}\;\mathrm{{Output}})}\nonumber \\ \end{aligned}$$
(5)
$$\begin{aligned}{} & {} { {\textbf {W}}_{{\textbf {M}}_{RI}}(\mathrm{{New}})={\textbf {W}}_{{\textbf {M}}_{RI}}(\mathrm{{Old}})-\alpha \frac{\partial {\textbf {M}}_{RI}(\mathrm{{Error}})}{\partial {\textbf {W}}_{{\textbf {M}}_{RI}}(\mathrm{{Old}})}} \end{aligned}$$
(6)
$$\begin{aligned}{} & {} {{\textbf {b}}_{{\textbf {M}}_{RI}}(\mathrm{{New}})={\textbf {b}}_{{\textbf {M}}_{RI}}(\mathrm{{Old}})-\alpha \frac{\partial {\textbf {M}}_{RI}(\mathrm{{Error}})}{\partial {\textbf {b}}_{{\textbf {M}}_{RI}}(\mathrm{{Old}})}} \end{aligned}$$
(7)

4.2 Testing Stage

During the testing phase, the signal m(t) in Eq. (8), which is a combination or mixture of the signals p(t) and q(t), undergoes STFT to generate the complex spectrogram.

$$\begin{aligned} { {\textbf {M}}_{(\tau , f)}= {\textbf {M}}_{R(\tau , f)}+{\textbf {M}}_{I(\tau , f)}i} \end{aligned}$$
(8)

From the complex spectrogram of the mixed signal, the real and imaginary components were separated and concatenated to construct M\(_{RI}^\mathrm{{Test}}\), which is passed through the U-NET saved model. The model then generated the enhanced concatenated matrices P\(_{RI}^{E}\) for the first source. To compute the enhanced concatenation matrix Q\(_{RI}^{E}\) for the second source, we subtract P\(_{RI}^{E}\) from M\(_{RI}^\mathrm{{Test}}\) as per Eq. (9).

$$\begin{aligned} { {\textbf {Q}}_{RI}^{E} ={\textbf {M}}_{RI}^\mathrm{{Test}}- {\textbf {P}}_{RI}^{E}} \end{aligned}$$
(9)

From the initial estimation of the first enhanced concatenated matrix P\(_{RI}^{E}\), the real and imaginary components were separated once again to reconstruct a complex matrix P\(^\mathrm{{recmplx}}\) with the help of following Eq. (10).

$$\begin{aligned} {{\textbf {P}}^\mathrm{{recmplx}}={\textbf {P}}_{R}^{E}+{\textbf {P}}_{I}^{E}i} \end{aligned}$$
(10)

Similarly, the real and imaginary components were separated from the female enhanced concatenated matrix to reconstruct another complex matrix Q\(^\mathrm{{recmplx}}\) for the female source as per Eq. (11).

$$\begin{aligned} {{\textbf {Q}}^\mathrm{{recmplx}}={\textbf {Q}}_{R}^{E}+{\textbf {Q}}_{I}^{E}i} \end{aligned}$$
(11)

From the reconstructed complex matrix P\(^\mathrm{{recmplx}}\), the magnitude and phase components P\(_\mathrm{{Emag}}\) and P\(_\mathrm{{Ephase}}\) were generated for the first source, respectively, with the aid of following Eq. (12).

$$\begin{aligned} { \begin{aligned} {\textbf {P}}_\mathrm{{Emag}}&= \mathrm{{magnitude}}({\textbf {P}}^\mathrm{{recmplx}}) \\ {\textbf {P}}_\mathrm{{Ephase}}&= \mathrm{{phase}}({\textbf {P}}^\mathrm{{recmplx}}) \end{aligned} } \end{aligned}$$
(12)

The magnitude and phase components Q\(_\mathrm{{Emag}}\) and Q\(_\mathrm{{Ephase}}\) for the other source were extracted from the reconstruct complex matrix Q\(^\mathrm{{recmplx}}\) as per Eq. (13).

$$\begin{aligned} { \begin{aligned} {\textbf {Q}}_\mathrm{{Emag}}&= \mathrm{{magnitude}}({\textbf {Q}}^\mathrm{{recmplx}}) \\ {\textbf {Q}}_\mathrm{{Ephase}}&= \mathrm{{phase}}({\textbf {Q}}^\mathrm{{recmplx}}) \end{aligned}} \end{aligned}$$
(13)

As input for the first source, the newly generated enhanced magnitude and enhanced phase as per Eq. (12) were fed into the inverse STFT. The inverse STFT then transforms it into a time-domain signal, and we get the first estimated source as per Eq. (14). Similarly, the inverse STFT in Eq. (15), after getting the enhanced magnitude and enhanced phase as per Eq. (13), generated the second source as well.

$$\begin{aligned}{} & {} { {\textbf {p}}'(t)= {{\varvec{ISTFT}}}({\textbf {P}}_\mathrm{{Emag}} \times {\textbf {P}}_\mathrm{{Ephase}})} \end{aligned}$$
(14)
$$\begin{aligned}{} & {} { {\textbf {q}}'(t)= {{\varvec{ISTFT}}}({\textbf {Q}}_\mathrm{{Emag}} \times {\textbf {Q}}_\mathrm{{Ephase}})} \end{aligned}$$
(15)

5 Results and Discussion

This section offers experimental findings and discussions. Initially, a brief overview of the experiment’s design and evaluation methods will be given, followed by a discussion of the metrics used to measure the results. Third, we examine how the join features compare to the single-domain techniques with regard to the SDR, SIR, fwsegSNR, STOI, and HASQI scores. Fourth, we compared the general effectiveness of our suggested approach to the CDAE, Conv-TasNet, CASSM, NMF-DNN, VAT-SNet, and ULSTM techniques in terms of PESQ, STOI, fwseqSNR, and SDR, SIR, and SAR. To the end, the time domain waveform and spectrogram of the clear, mixed, and segregated male and female sounds were provided.

5.1 Experimental Setup

To assess the efficiency of the suggested approach, we compare the proposed model with CDAE [22], Conv-TasNet [23], CASSM [24], NMF-DNN [25], VAT-SNet [26], and ULSTM [27]. In this system, we collect the signal speech from GRID audio visual corpuses [28], which were used for training as well as testing data. There are 1000 utterances spoken by thirty-four speakers (eighteen male and sixteen female). We concatenate sentences all together for each speaker. For the opposite gender speech separation, to form an experimental group, six male and six female speakers’ utterances are exploited here. Each training signal lasts for about 25 min, and each test signal lasts for around 60 s. These signals are sampled at 8000 Hz. Like the speech-noise scenario, we consider female as noise and male as the speech signal. We mixed the female source with the male at \(-\) 10, \(-\) 5, 0, 5, and 10 dB.

Fig. 3
figure 3

Comparison of a SDR, SIR, fwsegSNR, b HASQI and STOI of single over join feature

Fig. 4
figure 4

Comparison of fwsegSNR for a male, b female source, respectively

Fig. 5
figure 5

Comparison of SDR for a male, b female source, respectively

5.2 Evaluation Metrics

The performances of the separated utterances are evaluated through the SDR [29], SIR [29], SAR [29], fwsegSNR [30], STOI [31], PESQ [32], HASPI [33], and HASQI [34] scores. The SDR value, which is a measure of overall speech quality, is calculated as the ratio of the strength of the input signal to the power of the difference between the input and reconstructed signals. Performance restoration is governed by higher SDR scores. Along with SDR, SIR also detects errors brought on by source separation process failures to eliminate the interfering signal. Better separation quality is indicated by a higher SIR value. Comparing the separated speech to comparable clean speech allows for the evaluation of PESQ, which results in scores between \(-\) 0.5 and 4.5, with a greater number indicating better quality. A higher STOI value allows for more intelligibility. Short-time temporal wrappers, with a score ranging from 0 to 1, are correlated with clean and separated speech. The intelligibility of the collected signal was evaluated by fwsegSNR, and the greater the value, the better the performance. The HASQI and HASPI are instruments designed to measure how well hearing-impaired people and hearing-unimpaired people perceive sound. Higher scores, which range from 0 to 1, are related to greater sound quality and understandability.

5.3 The Impact of Single Over Join Features

The source signals are characterized by being brief, unchanging, and infrequent. The transformation of the signal into the time-frequency domain using STFT resulted in the generation of its complex spectra, which were used for speech separation techniques. There are certain methods that have been described that solely consider the magnitude part of a complex spectra, ignoring the real and imagined components. In this contrast, the real and imaginary portions are individually evaluated, even the magnitude section is evaluated separately, and the real and imaginary portions are evaluated jointly. SDR, SIR, fwsegSNR, HASQI, and STOI measurement techniques are compared in Fig. 3. As we can see from the figures that the method which uses the real and imaginary portions together outperforms than others. As a result, in the suggested technique, we examine the real and imaginary portions simultaneously, which improves a SCSS’s quality and intangibility.

Fig. 6
figure 6

Comparison of SIR for a male, b female source, respectively

Fig. 7
figure 7

Comparison of SAR for a male, b female source, respectively

5.4 Overall Performance of the Proposed Algorithm

In Fig. 4, the fwsegSNR performance of the proposed model is compared with that of current models. Based on the following graphs, it appears that the suggested model outperforms the other current techniques in all circumstances. Our strategy boosts fwsegSNR by 9.65% for \(-\) 10 SNR than the presented approaches, 11.56% for \(-\) 5 SNR, 13.69% for 0 SNR, for 5 SNR 15.31% and 17.09% for 10 SNR to separate male sources. Similarly, our proposed approach gained 18.56%, 15.26%, 12.85%, 10.16%, 7.51% for \(-\) 10 SNR, \(-\) 5 SNR, 0 SNR, 5 SNR, and 10 SNR, respectively, for female source separation.

We demonstrated that in Fig. 5, the proposed model’s SDR achieves much superior outcomes compared to the alternatives, notably CDAE, Conv-TasNet, CASSM, NMF-DNN, VAT-SNet, and ULSTM for both male and female gender. The suggested model’s SDR values are greater than the previous models in all circumstances of separation.

The suggested models increase SDR for 7.26 dB for \(-\) 10 SNR, 8.53 dB for \(-\) 5 SNR, 10.19 dB for 0 SNR, 11.78 dB for 5 SNR, and 13.10 dB for 10 SNR to separate the male sources. Accordingly, 13.02 dB, 11.43 dB, 9.84 dB, 7.63 dB, and 4.84 dB for -10 SNR, \(-\) 5 SNR, 0 SNR, 5 SNR, and 10 SNR, respectively, separate the female sources. Similarly in Fig. 6 SIR values for predicted signals get higher than the current models, as seen by this figure.

From Fig. 7, we examined that our proposed approach performed in a better manner in terms of source to artifacts ratio (SAR) for both of the male and female sources than the other methods stated in this article.

Tables 1, 2, 3, and 4 compare the suggested technique’s performance in terms of PESQ and STOI to those of other current approaches. Our suggested technique improves PESQ scores 2.25 for \(-\) 10 dB, 2.40 for \(-\) 5 dB, 2.63 for 0 dB, 2.81 for 5 dB, and 2.98 for 10 dB for separating the male source, over the methods existing for comparisons. Likewise, a separate female source achieved 3.23, 2.98, 2.70, 2.35, and 1.97 for \(-\) 10 dB, \(-\) 5 dB, 0 dB, 5 dB, and 10 dB, respectively. Further, the table demonstrates that expected signals have a higher STOI performance than do models that are already in use.

Table 1 Comparison of PESQ scores for the male source with six different approaches
Table 2 Comparison of PESQ scores for the female source with six different approaches
Table 3 Comparison of STOI scores for the male source with six different approaches
Table 4 Comparison of STOI scores for the female source with six different approaches

Tables 5, 6, 7, and 8 show the HASPI and HASQI findings of several approaches, including CDAE, Conv-TasNet, CASSM, NMF-DNN, VAT-SNet, and ULSTM for male and female speech separation. Tables 5 and 6 show that U-NET produces higher HASPI values in all scenarios of separation. It can also be noted that in Tables 7 and 8 the HASQI findings of our approach outperform the other three techniques in all circumstances of separation.

Table 5 Comparison of HASPI values for the male source with six different approaches
Table 6 Comparison of HASPI values for the female source with six different approaches
Table 7 Comparison of HASQI values for the male source with six different approaches
Table 8 Comparison of HASQI values for the female source with six different approaches
Fig. 8
figure 8

a Waveform, b Spectrogram of Clean, c Waveform, d Spectrogram of Mixed and e Waveform, f Spectrogram of Separated male source, respectively

Fig. 9
figure 9

a Waveform b Spectrogram of Clean, c Waveform d Spectrogram of Mixed and e Waveform f Spectrogram of Separated female source, respectively

5.5 Time-Domain and Spectrogram Representation

Time-domain and spectrogram representation offers distinct approaches to visualize and analyze signals, especially within the realm of signal processing. The time-domain representation depicts the signal’s temporal evolution, offering insights into amplitude and serving as a valuable tool for comprehending temporal patterns and identifying specific events. On the other hand, a spectrogram serves as a graphical representation of a signal’s frequency spectrum over time. It introduces an extra layer of information regarding frequency content over time, facilitating the examination of evolving spectral characteristics.

Figure 8 depicts the time-domain and spectrogram representations of the clean, mixed, and separated signals for the male source. In this case, we chose a male that performed best, the corresponding mixed, and the estimated male signal. From the mentioned figures we see that our suggested approach segregated the male source from the mixed one in a pretty good way. In the similar fashion, we see from Fig. 9, the female source also separated from the mixed signal.

6 Conclusion

From the perspective of neural architecture, we developed U-NET, a convolutional neural network architecture that built on a few improvements in the original CNN design. The model architecture was created with two principles in mind. The initial concept was encoder connections, which use strides 2’s max pooling layers to minimize data sizes. We must further repeat the convolutional layers, including a greater quantity of filters in the encoder block. The second idea is to employ a decoder block and its associated connections. As we move closer to the decoder, we observe that the quantity of filters in the convolutional layers begins to lessen, followed by a continual up-sampling in the subsequent layers at upmost. We can also see the use of skip connections to link the preceding outputs to the decoder blocks’ layers. Using this network architecture to separate the intended sources, we get better performance in every SNR scenario. In comparison with the outcomes of the other approaches mentioned in this article, the quality and understandability of the separated speech signals are enhanced. The experimental results show that the proposed speech separation model outperforms the current models in terms of overall performance in assessments of the improvement in the separated speech signals using various evaluation methodologies. We intend to research other training and testing procedures in the future utilizing different deep neural networks.