U-NET: A Supervised Approach for Monaural Source Separation

Basir, Samiul; Hossain, Md. Nahid; Hosen, Md. Shakhawat; Ali, Md. Sadek; Riaz, Zainab; Islam, Md. Shohidul

doi:10.1007/s13369-024-08785-1

U-NET: A Supervised Approach for Monaural Source Separation

Research Article-Computer Engineering and Computer Science
Published: 26 February 2024

Volume 49, pages 12679–12691, (2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Arabian Journal for Science and Engineering Aims and scope Submit manuscript

U-NET: A Supervised Approach for Monaural Source Separation

Download PDF

Samiul Basir¹,
Md. Nahid Hossain¹,
Md. Shakhawat Hosen¹,
Md. Sadek Ali^2,3,
Zainab Riaz³ &
…
Md. Shohidul Islam^1,3

169 Accesses
1 Citation
Explore all metrics

Abstract

Separating speech is a challenging area of research, especially when trying to separate the desired source from its combination. Deep learning has arisen as a promising solution, surpassing traditional methods. While prior research has mainly focused on the magnitude, log-magnitude, or a combination of the magnitude and phase portions, a new approach using the Short-time Fourier Transform (STFT), and a deep Convolutional Neural Network named U-NET has been proposed. This method, unlike others, considers both the real and imaginary components for decomposition. During the training stage, the mixed time-domain signal undergoes a transformation into a frequency-domain signal by using STFT, producing a mixed complex spectrogram. The spectrogram’s real and imaginary parts are then divided and combined into a single matrix. The newly formed matrix is fed through U-NET to extract the source components. The same process is repeated at testing. The resulting concatenated matrix for the mixed test signal is passed through the saved model to generate two enhanced concatenated matrices for each source. These matrices are then transformed back into time-domain signals using inverse STFT by extracting the magnitude and phase. The proposed approach has been evaluated using the GRID audio visual corpuses, with results showing improved quality and intelligibility compared to the existing methods, as demonstrated by objective measurement metrics.

Audio Source Separation from a Monaural Mixture Using Convolutional Neural Network in the Time Domain

Monoaural Audio Source Separation Using Deep Convolutional Neural Networks

Audio Source Separation with Discriminative Scattering Networks

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In recent years, the field of audio signal processing has witnessed significant advancements, particularly in the area of source separation (SS). Single-channel source separation (SCSS), also known as monaural source separation, refers to the process of separating individual sound sources from a mixed audio signal, typically captured by a single microphone or channel. It has become a highly desirable and challenging task in various applications, such as music production, speech enhancement (SE), audio transcription, and immersive audio systems [1,2,3,4]. Numerous potential benefits exist for the segregation of mixed speech. In contemporary speech processing, the role of SS is becoming increasingly crucial, demanding a growing number of devices to effectively perform this task.

While humans can effortlessly separate speech, constructing an automated system that emulates the human auditory system proves to be exceptionally challenging. Consequently, the pursuit of developing effective automatic SS systems has consistently been a significant focus in speech processing research. Conventionally, SS methods relied on the assumption of having multiple microphones or channels to exploit spatial information. However, in many real-world scenarios, such as live concert recordings, teleconferencing, or historical audio restoration, the availability of multiple channels is limited or nonexistent. This limitation prompted the development of SCSS [5,6,7] techniques that aim to recover individual sound sources from a monoaural mixture.

Due to the increasing fascination with SS, several conventional SCSS models have been suggested, taking into account various factors such as phase, magnitude, frequency, energy, and the spectrogram of the speech signal. A notable success in separating individual speakers has been achieved through the use of factorial hidden Markov models (HMMs) [8]. Moreover, researchers are increasingly utilizing nonnegative matrix factorization (NMF), a collection of methods in multivariate analysis that involves decomposing a matrix into two other nonnegative matrices based on their components and weights to separate source signals in SCSS [9].

However, these conventional approaches often face limitations when dealing with complex acoustic environments, overlapping sources, and nonstationary signals. To overcome these challenges, researchers have turned to deep learning (DL) algorithms [10, 11] and architectures to develop data-driven approaches for SS and achieving unprecedented performance improvements. SCSS focuses on learning a mapping function that estimates individual source signals from mixed audio inputs using a training dataset consisting of paired mixtures and their corresponding source signals [12].

In the context of audio SS, the Short-Time Fourier Transform (STFT) [13] is widely used to analyze and manipulate the audio signals. STFT represents a signal in the time-frequency domain, decomposing it into a series of spectral components. Each component is characterized by its magnitude and phase information, which provide valuable cues for separating the sources. In traditional as well as many DL approaches, the magnitude spectrogram has received significant attention and has been the main focal point for SS. However, phase information has also been recognized as an important factor in performance.

In this study, we propose an approach for SCSS using U-NET that considers both the real and imaginary parts of the complex spectrum generated by the STFT. As a result, the phase component should also be noteworthy in terms of its magnitude. Our method aims to leverage the benefits of DL and exploit the additional information contained in the complex spectrum to enhance separation performance. We have designed a modified U-NET architecture that can effectively handle the complex input features and learn to extract individual sources from the mixed audio signal.

The rest of the article is organized as follows: Sect. 2 provides a comprehensive overview of relative research in this domain, focusing on the evolution of deep learning techniques. Section 3 presents the U-NET architecture in detail, elucidating its key components and highlighting the reasons behind its suitability for audio SS. Section 4 presents the proposed methodologies, describing the architectural choices, proposed algorithm, training, and evaluation procedures. Section 5 showcases the outcomes of the experiments conducted and the subsequent analysis by encompassing both the dataset employed in this study as well as the evaluation metrics utilized to gauge the performance. Finally, Sect. 6 concludes the article by summarizing the key findings and outlining future research directions.

2 Relative Research

For supervised SS, there are two different categories of learning models: (1) methods that are traditional, like processes based on models and voice improvement techniques; and (2) innovative methods based on DNN. As a consequence of the speech production process, the input characteristics and desired outcomes of the SS process display an apparent spatiotemporal structure. Deep models are ideal for modeling due to these characteristics.

In speech separation, numerous deep models are actively deployed. Sun et al. [14] devised a two-stage method employing two DNN-based algorithms to tackle the difficulty of current speech separation systems’ performance. The authors of [15] created new training aims in addition to current magnitude training objectives, utilizing neural network approaches to adjust for target phase in order to attain higher separation performance.

In order to understand the temporal characteristics of geographic data, Zhou et al. [16] developed a separation system based on RNN with LSTM. The statistical properties of noise are not constrained in supervised speech separation, and it is not essential to know the spatial orientation of the sound sources. It offers certain benefits and a bright future for study when used in monaural, nonstationary, or in cases of poor SNR [17, 18].

The Deep Recurrent Neural Network (DRNN) is a deep learning model frequently used in speech separation. It excels in using Markov models to identify the hidden states of RNN units like LSTM [19] and GRU (Gated Recurrent Unit, GRU) [20] in SS. Some past information will still be preserved from the previous concealed state; however, the magnitude spectra of mixed speech have a prolonged duration, causing loss in sequence analysis, impacting both the separation of mixed speech and the accuracy of speech prediction.

CNN has been commonly used in DL since Lecun et al. [21] first presented it in 1998. CNN clearly has advantages in 2-D signal processing, and applications like picture recognition have shown off its impressive modeling abilities. CNN is currently being used for SS and has outperformed speech separation systems based on DNN in terms of removal efficiency under identical circumstances.

[22] introduces a method for SCSS using deep, fully convolutional denoising autoencoders (CDAEs). Trained to extract specific sources from mixtures, CDAEs performs well deep feedforward neural networks in SS. They learn unique spectral–temporal patterns for successfully isolating the sources in mixed signals. Additionally, it explores the use of spectral masks to scale the mixed signal based on each source’s contribution, ensuring an accurate estimation of the mixed signal’s sources.

To address the problem of time-frequency masking, Luo et al. [23] developed Conv-TasNet, a network for SS in the time domain that utilizes fully convolutional techniques. Its impressive modeling abilities have been shown in applications like picture recognition. To mitigate the disparity in accuracy measures such as hit rate, error rate, and classification accuracy, Wang et al. [24] modified the loss function of CNN.

[25] suggests a system that addresses challenges like over-smoothing and incomplete separation in SCSS by integrating time-frequency non-negative matrix factorization (TFNMF) and deep neural networks with sigmoid-based normalization (SNDNN). TFNMF is utilized for feature extraction, and the resulting classified features are transformed into softmax.

The paper [26] introduces VAT-SNet, a time-domain music separation model that directly utilizes music waveform data as input. VAT-SNet enhances the network structure of Conv-TasNet by preserving deep acoustic features through sample-level convolution in both the encoder and decoder. Additionally, it incorporates vocal and accompaniment embeddings from an auxiliary network to enhance the purity of the separation, aligning with the principles of independent component analysis (ICA) and providing a mathematical model for the separation process.

UFLSTM [27] is a deep learning model for speech enhancement (SE). UFLSTM utilizes adaptive power law transformation to redistribute energy, maintaining constant total energy in speech signals for improved intelligibility and quality, incorporating residual connections to prevent gradient decay, and adjusting the forget gate using an attention process.

Although conventional and separation models based on DNN have shown impressive results, they all have a few flaws. Using CNN each element may absorb local features without learning global characteristics in order to benefit from the spatial connectivity of the input data, and in the process of feature extraction, localized features are initially identified and then combined to create more comprehensive features at higher levels. Using weight sharing can improve the speed of the model by reducing the number of parameters that need to be computed for each neuron.

Various feature maps that can recognize the same type of feature in various locations and partly assure the invariance of displacement and distortion may be produced by combining a number of convolution filters. As a result, this study provides a CNN-based approach to alleviate the issue of mixed-language speakers’ loss of extended sequence information. Our model may boost the speech separation impact by concentrating on the timing sequence stage, which offers the highest contribution, and by partially solving the difficulty of the temporal model’s short memory.

3 U-NET Architecture

In order to extract the features of the desired source from the mixed coefficients, we employed the U-NET architecture. Figure 1 presents a pictorial representation of the network structure, comprising two main components: a contracting path on the left side and an expansive path to the right. The contracting path adheres to the typical architecture of a convolutional network. The structure involves the iterative utilization of two sets of $3\times 3$ convolutions. Subsequently, a Leaky rectified linear unit (LeakyReLU) and a $2\times 2$ max pooling operation with a stride of 2 are applied for downsampling. During each downsampling stage, the quantity of feature channels is increased twofold.

Each iteration in the extensive trajectory involves enlarging the feature map through upsampling, followed by a $2\times 2$ convolution that reduces the number of feature channels by half. This is followed by combining the enlarged feature map with the corresponding cropped feature map from the contracting trajectory. Cropping is essential to address the removal of border pixel elements during convolutions at each step. At the last layer, a $1\times 1$ convolution is employed to transform each 16-component feature vector into the specified number of classes. Altogether, the network comprises 24 convolutional layers. To ensure the output segmentation map can be seamlessly tiled, it is crucial to choose the input tile size in such a way that all $2\times 2$ max-pooling operations are performed on a layer with both x- and y-dimensions being even.

Huber loss is a robust alternative to mean squared error (MSE) loss, which is commonly affected by outliers and sensitivity issues. By balancing between quadratic loss for small errors and linear loss for larger errors, Huber loss effectively addresses these challenges and improves model performance. Huber loss combines squared loss for minor errors and absolute loss for significant errors. By incorporating a parameter called delta ($\delta $), the loss function determines the threshold at which the transition occurs from quadratic to linear. When errors are smaller than $\delta $, the loss function resembles MSE, while for errors exceeding $\delta $, it behaves similarly to MAE. Mathematically, this loss function is represented as per Eq. (1), where y denotes the actual or desired value, $y'$ signifies the predicted value, and $\delta $ represents the threshold parameter.

$$\begin{aligned} { L(y, y')= {\left\{ \begin{array}{ll} \frac{1}{2}(y-y')^2, &{} if |y-y'|\le \delta \\ \delta |y-y'|-\frac{1}{2}\delta ^2, &{} \mathrm{{otherwise}} \end{array}\right. } } \end{aligned}$$

(1)

The networks’ parameters were randomly initialized, amounting to a total of 1,941,093. They underwent training using backpropagation and the Adam optimizer with a learning rate of 0.001, employing the default settings for all other parameters.

4 Proposed Method

This section outlines the proposed SCSS technique and provides details about the substances it utilizes. In the context of audio or time-series data, the signal is represented by the STFT as a complex matrix, where each element corresponds to a specific frequency and time bin index. The real part signifies the magnitude or intensity of the frequency component, while the imaginary part encodes phase information. Unlike most SS systems that focus solely on the magnitude of the STFT, neglecting the phase component, this article combines STFT with U-NET, a deep CNN, taking both the real and imaginary components into consideration. The utilization of both components during U-NET training enables the model to effectively capture complex-valued frequency information in the input data.

It is important to note that no approach is universally superior, and trade-offs exist. The associated trade-offs were that the utilization of U-NET for SS introduced computational complexities, and its performance was contingent upon the quantity and quality of available data. Notably, there were associated risks of overfitting, especially when confronted with limited data, potentially limiting the model’s interpretability. Furthermore, the implementation of U-NET demanded substantial computational resources and prolonged training times. Achieving robust generalization across diverse acoustic environments posed a significant challenge. Therefore, a pivotal aspect in this methodology involved striking a balance between U-NET’s model complexity and the specific requirements of the application.

However, the trade-offs of the proposed method stated earlier here include a breakdown of potential reasons for better performance. Unlike others, incorporating both the real and imaginary components together in the model yielded a comprehensive representation of the audio signal, capturing both the amplitude and phase details. This refined representation enhanced accuracy, especially in scenarios involving overlapping speech. Besides, preserving phase information was crucial for maintaining temporal attributes, leading to more natural and intelligible speech output. The end-to-end learning approach streamlined training, allowing the model to autonomously learn relevant features and promoting better generalization across speakers and acoustic environments. Furthermore, supervised learning with labeled data enhanced adaptability to diverse acoustic environments, increasing robustness in real-world scenarios. U-NET efficiency and hardware acceleration allowed real-time audio processing, crucial for low-latency applications like live streaming and interactive platforms. The proposed SS method has two stages, the training stage and the testing stage, which are depicted in Fig. 2.

4.1 Training Stage

During the training phase, we think about a signal m(t) called the mixed, consisting of two different sources p(t) and q(t), respectively. m(t) is utilized here as an input signal, and p(t) is the corresponding label. The STFT processes both mixed and labeled signals to calculate the complex spectrograms M$_{(\tau , f)}$ and P$_{(\tau , f)}$. These are denoted in Eqs. (2) and (3), with $\tau $ and f indicating the time and frequency bin indices, respectively.

$$\begin{aligned}{} & {} { {\textbf {M}}_{(\tau , f)}= {\textbf {M}}_{R(\tau , f)}+{\textbf {M}}_{I(\tau , f)}i} \end{aligned}$$

(2)

$$\begin{aligned}{} & {} { {\textbf {P}}_{(\tau , f)}= {\textbf {P}}_{R(\tau , f)}+{\textbf {P}}_{I(\tau , f)}i} \end{aligned}$$

(3)

The concatenated forms of the real and imaginary components for both M$_{RI}^\mathrm{{Train}}$ and P$_{RI}^\mathrm{{Train}}$ matrices are then forwarded into the U-NET model. The network model next decomposes the M$_{RI}^\mathrm{{Train}}$ matrix into its bias and weight matrices as per Eq. (4), where the terms W$_{{\textbf {M}}_{RI}}$ and b$_{{\textbf {M}}_{RI}}$ represent the weight and bias matrices corresponding to the mixed source, and g represents the nonlinear activation function.

$$\begin{aligned} { {\textbf {M}}_{RI}^\mathrm{{Train}}\approx g({\textbf {W}}_{{\textbf {M}}_{RI}}+{\textbf {b}}_{{\textbf {M}}_{RI}})} \end{aligned}$$

(4)

Initially, the bias and weight metrics are assigned to zero and random values, respectively. The weighted matrix W$_{{\textbf {M}}_{RI}}$ and the bias metrics b$_{{\textbf {M}}_{RI}}$ were updated continuously by minimizing the cost between M$_{RI}^\mathrm{{Train}}$ and P$_{RI}^\mathrm{{Train}}$ using Eq. (5) with the help of Eqs. (6) and (7), where $\alpha $ is called learning rate. During training, the model was saved, and after completing the training, the best bias and weights were fixed.

$$\begin{aligned}{} & {} { {\textbf {M}}_{RI}(\mathrm{{Error}})= {\textbf {M}}_{RI}(\mathrm{{Label}}\;\mathrm{{Output}})-{\textbf {M}}_{RI}(\mathrm{{Predicted}}\;\mathrm{{Output}})}\nonumber \\ \end{aligned}$$

(5)

$$\begin{aligned}{} & {} { {\textbf {W}}_{{\textbf {M}}_{RI}}(\mathrm{{New}})={\textbf {W}}_{{\textbf {M}}_{RI}}(\mathrm{{Old}})-\alpha \frac{\partial {\textbf {M}}_{RI}(\mathrm{{Error}})}{\partial {\textbf {W}}_{{\textbf {M}}_{RI}}(\mathrm{{Old}})}} \end{aligned}$$

(6)

$$\begin{aligned}{} & {} {{\textbf {b}}_{{\textbf {M}}_{RI}}(\mathrm{{New}})={\textbf {b}}_{{\textbf {M}}_{RI}}(\mathrm{{Old}})-\alpha \frac{\partial {\textbf {M}}_{RI}(\mathrm{{Error}})}{\partial {\textbf {b}}_{{\textbf {M}}_{RI}}(\mathrm{{Old}})}} \end{aligned}$$

(7)

4.2 Testing Stage

During the testing phase, the signal m(t) in Eq. (8), which is a combination or mixture of the signals p(t) and q(t), undergoes STFT to generate the complex spectrogram.

$$\begin{aligned} { {\textbf {M}}_{(\tau , f)}= {\textbf {M}}_{R(\tau , f)}+{\textbf {M}}_{I(\tau , f)}i} \end{aligned}$$

(8)

From the complex spectrogram of the mixed signal, the real and imaginary components were separated and concatenated to construct M$_{RI}^\mathrm{{Test}}$, which is passed through the U-NET saved model. The model then generated the enhanced concatenated matrices P$_{RI}^{E}$ for the first source. To compute the enhanced concatenation matrix Q$_{RI}^{E}$ for the second source, we subtract P$_{RI}^{E}$ from M$_{RI}^\mathrm{{Test}}$ as per Eq. (9).

$$\begin{aligned} { {\textbf {Q}}_{RI}^{E} ={\textbf {M}}_{RI}^\mathrm{{Test}}- {\textbf {P}}_{RI}^{E}} \end{aligned}$$

(9)

From the initial estimation of the first enhanced concatenated matrix P$_{RI}^{E}$, the real and imaginary components were separated once again to reconstruct a complex matrix P$^\mathrm{{recmplx}}$ with the help of following Eq. (10).

$$\begin{aligned} {{\textbf {P}}^\mathrm{{recmplx}}={\textbf {P}}_{R}^{E}+{\textbf {P}}_{I}^{E}i} \end{aligned}$$

(10)

Similarly, the real and imaginary components were separated from the female enhanced concatenated matrix to reconstruct another complex matrix Q$^\mathrm{{recmplx}}$ for the female source as per Eq. (11).

$$\begin{aligned} {{\textbf {Q}}^\mathrm{{recmplx}}={\textbf {Q}}_{R}^{E}+{\textbf {Q}}_{I}^{E}i} \end{aligned}$$

(11)

From the reconstructed complex matrix P$^\mathrm{{recmplx}}$, the magnitude and phase components P$_\mathrm{{Emag}}$ and P$_\mathrm{{Ephase}}$ were generated for the first source, respectively, with the aid of following Eq. (12).

$$\begin{aligned} { \begin{aligned} {\textbf {P}}_\mathrm{{Emag}}&= \mathrm{{magnitude}}({\textbf {P}}^\mathrm{{recmplx}}) \\ {\textbf {P}}_\mathrm{{Ephase}}&= \mathrm{{phase}}({\textbf {P}}^\mathrm{{recmplx}}) \end{aligned} } \end{aligned}$$

(12)

The magnitude and phase components Q$_\mathrm{{Emag}}$ and Q$_\mathrm{{Ephase}}$ for the other source were extracted from the reconstruct complex matrix Q$^\mathrm{{recmplx}}$ as per Eq. (13).

$$\begin{aligned} { \begin{aligned} {\textbf {Q}}_\mathrm{{Emag}}&= \mathrm{{magnitude}}({\textbf {Q}}^\mathrm{{recmplx}}) \\ {\textbf {Q}}_\mathrm{{Ephase}}&= \mathrm{{phase}}({\textbf {Q}}^\mathrm{{recmplx}}) \end{aligned}} \end{aligned}$$

(13)

As input for the first source, the newly generated enhanced magnitude and enhanced phase as per Eq. (12) were fed into the inverse STFT. The inverse STFT then transforms it into a time-domain signal, and we get the first estimated source as per Eq. (14). Similarly, the inverse STFT in Eq. (15), after getting the enhanced magnitude and enhanced phase as per Eq. (13), generated the second source as well.

$$\begin{aligned}{} & {} { {\textbf {p}}'(t)= {{\varvec{ISTFT}}}({\textbf {P}}_\mathrm{{Emag}} \times {\textbf {P}}_\mathrm{{Ephase}})} \end{aligned}$$

(14)

$$\begin{aligned}{} & {} { {\textbf {q}}'(t)= {{\varvec{ISTFT}}}({\textbf {Q}}_\mathrm{{Emag}} \times {\textbf {Q}}_\mathrm{{Ephase}})} \end{aligned}$$

(15)

5 Results and Discussion

This section offers experimental findings and discussions. Initially, a brief overview of the experiment’s design and evaluation methods will be given, followed by a discussion of the metrics used to measure the results. Third, we examine how the join features compare to the single-domain techniques with regard to the SDR, SIR, fwsegSNR, STOI, and HASQI scores. Fourth, we compared the general effectiveness of our suggested approach to the CDAE, Conv-TasNet, CASSM, NMF-DNN, VAT-SNet, and ULSTM techniques in terms of PESQ, STOI, fwseqSNR, and SDR, SIR, and SAR. To the end, the time domain waveform and spectrogram of the clear, mixed, and segregated male and female sounds were provided.

5.1 Experimental Setup

To assess the efficiency of the suggested approach, we compare the proposed model with CDAE [22], Conv-TasNet [23], CASSM [24], NMF-DNN [25], VAT-SNet [26], and ULSTM [27]. In this system, we collect the signal speech from GRID audio visual corpuses [28], which were used for training as well as testing data. There are 1000 utterances spoken by thirty-four speakers (eighteen male and sixteen female). We concatenate sentences all together for each speaker. For the opposite gender speech separation, to form an experimental group, six male and six female speakers’ utterances are exploited here. Each training signal lasts for about 25 min, and each test signal lasts for around 60 s. These signals are sampled at 8000 Hz. Like the speech-noise scenario, we consider female as noise and male as the speech signal. We mixed the female source with the male at $-$ 10, $-$ 5, 0, 5, and 10 dB.

5.2 Evaluation Metrics

The performances of the separated utterances are evaluated through the SDR [29], SIR [29], SAR [29], fwsegSNR [30], STOI [31], PESQ [32], HASPI [33], and HASQI [34] scores. The SDR value, which is a measure of overall speech quality, is calculated as the ratio of the strength of the input signal to the power of the difference between the input and reconstructed signals. Performance restoration is governed by higher SDR scores. Along with SDR, SIR also detects errors brought on by source separation process failures to eliminate the interfering signal. Better separation quality is indicated by a higher SIR value. Comparing the separated speech to comparable clean speech allows for the evaluation of PESQ, which results in scores between $-$ 0.5 and 4.5, with a greater number indicating better quality. A higher STOI value allows for more intelligibility. Short-time temporal wrappers, with a score ranging from 0 to 1, are correlated with clean and separated speech. The intelligibility of the collected signal was evaluated by fwsegSNR, and the greater the value, the better the performance. The HASQI and HASPI are instruments designed to measure how well hearing-impaired people and hearing-unimpaired people perceive sound. Higher scores, which range from 0 to 1, are related to greater sound quality and understandability.

5.3 The Impact of Single Over Join Features

The source signals are characterized by being brief, unchanging, and infrequent. The transformation of the signal into the time-frequency domain using STFT resulted in the generation of its complex spectra, which were used for speech separation techniques. There are certain methods that have been described that solely consider the magnitude part of a complex spectra, ignoring the real and imagined components. In this contrast, the real and imaginary portions are individually evaluated, even the magnitude section is evaluated separately, and the real and imaginary portions are evaluated jointly. SDR, SIR, fwsegSNR, HASQI, and STOI measurement techniques are compared in Fig. 3. As we can see from the figures that the method which uses the real and imaginary portions together outperforms than others. As a result, in the suggested technique, we examine the real and imaginary portions simultaneously, which improves a SCSS’s quality and intangibility.

5.4 Overall Performance of the Proposed Algorithm

In Fig. 4, the fwsegSNR performance of the proposed model is compared with that of current models. Based on the following graphs, it appears that the suggested model outperforms the other current techniques in all circumstances. Our strategy boosts fwsegSNR by 9.65% for $-$ 10 SNR than the presented approaches, 11.56% for $-$ 5 SNR, 13.69% for 0 SNR, for 5 SNR 15.31% and 17.09% for 10 SNR to separate male sources. Similarly, our proposed approach gained 18.56%, 15.26%, 12.85%, 10.16%, 7.51% for $-$ 10 SNR, $-$ 5 SNR, 0 SNR, 5 SNR, and 10 SNR, respectively, for female source separation.

We demonstrated that in Fig. 5, the proposed model’s SDR achieves much superior outcomes compared to the alternatives, notably CDAE, Conv-TasNet, CASSM, NMF-DNN, VAT-SNet, and ULSTM for both male and female gender. The suggested model’s SDR values are greater than the previous models in all circumstances of separation.

The suggested models increase SDR for 7.26 dB for $-$ 10 SNR, 8.53 dB for $-$ 5 SNR, 10.19 dB for 0 SNR, 11.78 dB for 5 SNR, and 13.10 dB for 10 SNR to separate the male sources. Accordingly, 13.02 dB, 11.43 dB, 9.84 dB, 7.63 dB, and 4.84 dB for -10 SNR, $-$ 5 SNR, 0 SNR, 5 SNR, and 10 SNR, respectively, separate the female sources. Similarly in Fig. 6 SIR values for predicted signals get higher than the current models, as seen by this figure.

From Fig. 7, we examined that our proposed approach performed in a better manner in terms of source to artifacts ratio (SAR) for both of the male and female sources than the other methods stated in this article.

Tables 1, 2, 3, and 4 compare the suggested technique’s performance in terms of PESQ and STOI to those of other current approaches. Our suggested technique improves PESQ scores 2.25 for $-$ 10 dB, 2.40 for $-$ 5 dB, 2.63 for 0 dB, 2.81 for 5 dB, and 2.98 for 10 dB for separating the male source, over the methods existing for comparisons. Likewise, a separate female source achieved 3.23, 2.98, 2.70, 2.35, and 1.97 for $-$ 10 dB, $-$ 5 dB, 0 dB, 5 dB, and 10 dB, respectively. Further, the table demonstrates that expected signals have a higher STOI performance than do models that are already in use.

Table 1 Comparison of PESQ scores for the male source with six different approaches

Full size table

Table 2 Comparison of PESQ scores for the female source with six different approaches

Full size table

Table 3 Comparison of STOI scores for the male source with six different approaches

Full size table

Table 4 Comparison of STOI scores for the female source with six different approaches

Full size table

Tables 5, 6, 7, and 8 show the HASPI and HASQI findings of several approaches, including CDAE, Conv-TasNet, CASSM, NMF-DNN, VAT-SNet, and ULSTM for male and female speech separation. Tables 5 and 6 show that U-NET produces higher HASPI values in all scenarios of separation. It can also be noted that in Tables 7 and 8 the HASQI findings of our approach outperform the other three techniques in all circumstances of separation.

Table 5 Comparison of HASPI values for the male source with six different approaches

Full size table

Table 6 Comparison of HASPI values for the female source with six different approaches

Full size table

Table 7 Comparison of HASQI values for the male source with six different approaches

Full size table

Table 8 Comparison of HASQI values for the female source with six different approaches

Full size table

5.5 Time-Domain and Spectrogram Representation

Time-domain and spectrogram representation offers distinct approaches to visualize and analyze signals, especially within the realm of signal processing. The time-domain representation depicts the signal’s temporal evolution, offering insights into amplitude and serving as a valuable tool for comprehending temporal patterns and identifying specific events. On the other hand, a spectrogram serves as a graphical representation of a signal’s frequency spectrum over time. It introduces an extra layer of information regarding frequency content over time, facilitating the examination of evolving spectral characteristics.

Figure 8 depicts the time-domain and spectrogram representations of the clean, mixed, and separated signals for the male source. In this case, we chose a male that performed best, the corresponding mixed, and the estimated male signal. From the mentioned figures we see that our suggested approach segregated the male source from the mixed one in a pretty good way. In the similar fashion, we see from Fig. 9, the female source also separated from the mixed signal.

6 Conclusion

From the perspective of neural architecture, we developed U-NET, a convolutional neural network architecture that built on a few improvements in the original CNN design. The model architecture was created with two principles in mind. The initial concept was encoder connections, which use strides 2’s max pooling layers to minimize data sizes. We must further repeat the convolutional layers, including a greater quantity of filters in the encoder block. The second idea is to employ a decoder block and its associated connections. As we move closer to the decoder, we observe that the quantity of filters in the convolutional layers begins to lessen, followed by a continual up-sampling in the subsequent layers at upmost. We can also see the use of skip connections to link the preceding outputs to the decoder blocks’ layers. Using this network architecture to separate the intended sources, we get better performance in every SNR scenario. In comparison with the outcomes of the other approaches mentioned in this article, the quality and understandability of the separated speech signals are enhanced. The experimental results show that the proposed speech separation model outperforms the current models in terms of overall performance in assessments of the improvement in the separated speech signals using various evaluation methodologies. We intend to research other training and testing procedures in the future utilizing different deep neural networks.

Data Availability

Data will be made available on reasonable request.

Abbreviations

p, P (small & capital):: Variable
p (small bold):: Vector
P (capital bold):: Matrix
P (capital bold italic):: Function
$\tau $ :: Time frame index
f :: Frequency bi index
R :: It determines the real part of the complex matrix
I :: It determines the imaginary part of the complex matrix
AI:: Artificial Intelligence
CASSM:: CASSM-based SS method
CDAE:: CDAE-based SS method
CNN:: Convolutional neural network
Conv-TasNet:: Conv-TasNet-based SS method
DNN:: Deep neural network
DSP:: Digital signal processing
fwsegSNR:: Average frequency weighted segmental SNR
GPU:: Graphics processing unit
HASPI:: Hearing aid’s speech perception index
HASQI:: Hearing aid’s speech quality index
ISTFT:: Inverse short-time Fourier transform
LSTM:: Long short-term memory
MP:: Multilayer perceptron
MSE:: Mean squared error
NMF:: Non-negative matrix factorization
NMF-DNN:: NMF-DNN-based SS method
PESQ:: Perceptual evaluation of speech quality
RELU:: Rectified linear units
RNN:: Recurrent neural network
SCSS:: Single-channel source separation
SDR:: Source to distortion ratio
SE:: Speech enhancement
SNR:: Signal-to-noise ratio
SS:: Speech separation, source separation
STFT:: Short-time Fourier transform
STOI:: Short-time objective intelligibility
VAT-SNet:: VAT-SNet-based SS method
ULSTM:: ULSTM-based SS method

References

Huang, P.-S.; Kim, M.; Hasegawa-Johnson, M.; Smaragdis, P.: Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Trans. Audio Speech Lang. Process. 23(12), 2136–2147 (2015)
Article Google Scholar
Rivet, B.; Wang, W.; Naqvi, S.M.; Chambers, J.A.: Audiovisual speech source separation: an overview of key methodologies. IEEE Signal Process. Mag. 31(3), 125–134 (2014)
Article Google Scholar
Khan, M.S.; Naqvi, S.M.; Wang, W.; Chambers, J.; et al.: Video-aided model-based source separation in real reverberant rooms. IEEE Trans. Audio Speech Lang. Process. 21(9), 1900–1912 (2013)
Article Google Scholar
Wu, B.; Li, K.; Yang, M.; Lee, C.-H.: A reverberation-time-aware approach to speech dereverberation based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 25(1), 102–111 (2016)
Article Google Scholar
Demir, C.; Saraclar, M.; Cemgil, A.T.: Single-channel speech-music separation for robust ASR with mixture models. IEEE Trans. Audio Speech Lang. Process. 21(4), 725–736 (2012)
Article Google Scholar
Jiang, D.; He, Z.; Lin, Y.; Chen, Y.; Xu, L.: An improved unsupervised single-channel speech separation algorithm for processing speech sensor signals. Wirel. Commun. Mob. Comput. 2021, 1–13 (2021)
Article Google Scholar
Mowlaee, P.; Saeidi, R.; Christensen, M.G.; Tan, Z.-H.; Kinnunen, T.; Franti, P.; Jensen, S.H.: A joint approach for single-channel speaker identification and speech separation. IEEE Trans. Audio Speech Lang. Process. 20(9), 2586–2601 (2012)
Article Google Scholar
Muhsina, N.; Beegum, D.; Manjusree, S.; Lubaib, P.; Al Saheer, S.; Shenoy, A.J.: Signal enhancement of source separation techniques. In: 2023 Third International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT), pp. 1–8 (2023). IEEE
Hossain, M.I.; Al Mahmud, T.H.; Islam, M.S.; Hossen, M.B.; Khan, R.; Ye, Z.: Dual transform based joint learning single channel speech separation using generative joint dictionary learning. Multimed. Tools Appl. 81(20), 29321–29346 (2022)
Article Google Scholar
Weng, C.; Yu, D.; Seltzer, M.L.; Droppo, J.: Deep neural networks for single-channel multi-talker speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 23(10), 1670–1679 (2015)
Article Google Scholar
Wichern, G.; Antognini, J.; Flynn, M.; Zhu, L.R.; McQuinn, E.; Crow, D.; Manilow, E.; Roux, J.L.: Wham!: Extending speech separation to noisy environments. arXiv preprint arXiv:1907.01160 (2019)
Mayer, F.; Williamson, D.S.; Mowlaee, P.; Wang, D.: Impact of phase estimation on single-channel speech separation based on time-frequency masking. J. Acoust. Soc. Am. 141(6), 4668–4679 (2017)
Article Google Scholar
Wang, D.; Chen, J.: Supervised speech separation based on deep learning: an overview. IEEE/ACM Trans. Audio Speech Lang. Process. 26(10), 1702–1726 (2018)
Article MathSciNet Google Scholar
Sun, Y.; Wang, W.; Chambers, J.; Naqvi, S.M.: Two-stage monaural source separation in reverberant room environments using deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 27(1), 125–139 (2018)
Article Google Scholar
Wang, C.; Zhu, J.: Neural network based phase compensation methods on monaural speech separation. In: 2019 IEEE International Conference on Multimedia and Expo (ICME), pp. 1384–1389 (2019). IEEE
Zhou, L.; Lu, S.; Zhong, Q.; Chen, Y.; Tang, Y.; Zhou, Y.: Binaural speech separation algorithm based on long and short time memory networks. Comput. Mater. Continua 63(3), 1373–1386 (2020)
Article Google Scholar
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Weninger, F.; Hershey, J.R.; Le Roux, J.; Schuller, B.: Discriminatively trained recurrent neural networks for single-channel speech separation. In: 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp. 577–581 (2014). IEEE
Hochreiter, S.; Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Wang, Y.; Wang, D.: A deep neural network for time-domain signal reconstruction. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4390–4394 (2015). IEEE
Grais, E.M.; Plumbley, M.D.: Single channel audio source separation using convolutional denoising autoencoders. In: 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp. 1265–1269 (2017). IEEE
Luo, Y.; Mesgarani, N.: Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)
Article Google Scholar
Yuan, C.-M.; Sun, X.-M.; Zhao, H.: Speech separation using convolutional neural network and attention mechanism. Discret. Dyn. Nat. Soc. 2020, 1–10 (2020)
Article MathSciNet Google Scholar
Koteswararao, Y.V.; Rama Rao, C.: Single channel source separation using time-frequency non-negative matrix factorization and sigmoid base normalization deep neural networks. Multidimens. Syst. Signal Process. 33(3), 1023–1043 (2022)
Article Google Scholar
Qiao, X.; Luo, M.; Shao, F.; Sui, Y.; Yin, X.; Sun, R.: Vat-snet: A convolutional music-separation network based on vocal and accompaniment time-domain features. Electronics 11(24), 4078 (2022)
Saleem, N.; Khattak, M.I.; AlQahtani, S.A.; Jan, A.; Hussain, I.; Khan, M.N.; Dahshan, M.: U-shaped low-complexity type-2 fuzzy LSTM neural network for speech enhancement. IEEE Access 11, 20814–20826 (2023)
Article Google Scholar
Cooke, M.; Barker, J.; Cunningham, S.; Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006)
Article Google Scholar
Varshney, Y.V.; Abbasi, Z.A.; Abidi, M.R.; Farooq, O.: Frequency selection based separation of speech signals with reduced computational time using sparse NMF. Arch. Acoust. 42(2), 287–295 (2017)
Article Google Scholar
Vincent, E.; Gribonval, R.; Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006)
Article Google Scholar
Taal, C.H.; Hendriks, R.C.; Heusdens, R.; Jensen, J.: An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 19(7), 2125–2136 (2011)
Article Google Scholar
Rix, A.W.; Beerends, J.G.; Hollier, M.P.; Hekstra, A.P.: Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), vol. 2, pp. 749–752 (2001). IEEE
Kates, J.M.; Arehart, K.H.: The hearing-aid speech perception index (HASPI) version 2. Speech Commun. 131, 35–46 (2021)
Article Google Scholar
Kates, J.M.; Arehart, K.H.: The hearing-aid speech quality index (HASQI) version 2. J. Audio Eng. Soc. 62(3), 99–117 (2014)
Article Google Scholar

Download references

Acknowledgements

This work is supported by the "Image and Speech Signal Processing Lab," Department of Computer Science and Engineering, Islamic University, Kushtia-7003, Bangladesh.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Islamic University, Kushtia, 7003, Bangladesh
Samiul Basir, Md. Nahid Hossain, Md. Shakhawat Hosen & Md. Shohidul Islam
Department of Information and Communication Technology, Islamic University, Kushtia, 7003, Bangladesh
Md. Sadek Ali
Hong Kong Centre for Cerebro-Cardiovascular Health Engineering (COCHE), Hong Kong, China
Md. Sadek Ali, Zainab Riaz & Md. Shohidul Islam

Authors

Samiul Basir
View author publications
You can also search for this author in PubMed Google Scholar
Md. Nahid Hossain
View author publications
You can also search for this author in PubMed Google Scholar
Md. Shakhawat Hosen
View author publications
You can also search for this author in PubMed Google Scholar
Md. Sadek Ali
View author publications
You can also search for this author in PubMed Google Scholar
Zainab Riaz
View author publications
You can also search for this author in PubMed Google Scholar
Md. Shohidul Islam
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

SB was involved in conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, writing—original draft, writing—review and editing. MSH helped in writing—original draft and writing—review and editing. ZR, MSA contributed to writing—review and editing. MSI helped in writing—original draft, supervision, project administration.

Corresponding author

Correspondence to Md. Shohidul Islam.

Ethics declarations

Conflict of interest

The authors declare no conflicts of interest regarding the publication of this paper.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Basir, S., Hossain, M.N., Hosen, M.S. et al. U-NET: A Supervised Approach for Monaural Source Separation. Arab J Sci Eng 49, 12679–12691 (2024). https://doi.org/10.1007/s13369-024-08785-1

Download citation

Received: 26 August 2023
Accepted: 23 January 2024
Published: 26 February 2024
Issue Date: September 2024
DOI: https://doi.org/10.1007/s13369-024-08785-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

U-NET: A Supervised Approach for Monaural Source Separation

Abstract

Similar content being viewed by others

Audio Source Separation from a Monaural Mixture Using Convolutional Neural Network in the Time Domain

Monoaural Audio Source Separation Using Deep Convolutional Neural Networks

Audio Source Separation with Discriminative Scattering Networks

1 Introduction

2 Relative Research

3 U-NET Architecture