1 Introduction

Speech separation is a process where several signals have been combined together, and the goal is to retrieve the original signals from the mixed signal. Speech separation has pulled in a striking measure of consideration because of its expected use in several real-world applications, for example, hearing aids, automatic speech recognition, communication, medical, multimedia, assisted living systems, control humanoid robots, cocktail-party issue, and so forth [9, 29, 32, 47]. In these applications, well-separated signals are obligatory for the system to work appropriately. According to the number of channels, the speech separation problem is categorized into multichannel, binaural channel, and single channel types. A single-channel speech separation (SCSS) process [4, 17, 26], which still remains a significant research challenge because only one recording is available, and the spatial information that can be extracted is restricted [34].

With the current growing interest in speech separation, many SCSS models have been proposed in considering numerous parameters, for example, phase, magnitude, frequency, energy, and the spectrogram of the speech signal. The factorial hidden Markov models (HMMs) have been suggested by S.T. Roweis that are incredibly successful in demonstrating a single speaker [31]. Jang and Lee [15] use a maximum likelihood approach to separate the mixed source signal that is perceived in a single channel (SC). Researchers increasingly use nonnegative matrix factorization (NMF) to separate SC source signals. It’s a collection of methods in a multivariate analysis where a matrix is decomposed into two other nonnegative matrices according to its components and weights. NMF was first presented by Paatero and Tapper [22] and emerged for the use of source separation by Lee and Seung [28]. Sparse nonnegative matrix factorization (SNMF) is applied to factorize the speech signals [41, 45]. They used SNMF to learn the sparse representation of the data that solves the problem of separating multiple speech sources from a single microphone recording. Sparsity is prescribed only for signal detection in the coefficient matrix [44].

Recently, wavelet-based separation methods [11, 12, 27, 42] have been emerged to the researchers. In [42], discrete wavelet transform (DWT) and SNMF based speech separation methods is implemented. The authors used the wavelet decomposition to speed up the separation process by rejecting the high-frequency components of the source signals. Here DWT is used that splits a signal into its low-frequency parts known as the approximations coefficients and high-frequency parts known as details coefficients. Though it reduces the separation time, but the intelligibility of individual speakers is severely affected due to the total rejection of high-frequency components. In [11], the stationary wavelet transform (SWT) and NMF based speech enhancement (SE) method is presented. The SWT discards the downsampling approach used in both the DWT and discrete wavelet packet transform (DWPT) at every level to acquire the shift-invariance property. This method leads to redundant problems and cannot use the sparseness among different speech signals. The DTCWT is utilized for speech enhancement by considering the NMF but ignoring the sparsity [12]. Therefore, the assessed speech has become inaudible because a few errors occur during deteriorations of the signal utilizing NMF, i.e., some noises or artifacts have been incited during disintegrations through NMF.

The dictionary learning (DL) algorithm [2, 7, 23, 25, 35, 36, 48, 49] is another useful technique for model-based SCSS. They assume that speech signals that have sparse representations from different speakers have some individual components. Usually speaking, a joint dictionary is a redundant dictionary. One source replies to categorize subdictionaries with additional sources that can’t be avoided, though sparse constraints are applied to train the dictionaries. Some approaches for learning the discriminative dictionary have been improved in the past, such as the Metaface learning method [49]. A series of approaches are presented in [2, 7, 23], where the joint discriminative dictionary is made by varying objective functions or adding penalty items to learn the dictionary. However, the solution of the optimization problem comes to be difficult for the complexity of the objective function, and then the time complexity becomes larger. The adaptive discriminative dictionary learning (ADDL) procedure [2] undertakes that different speakers’ speech signals have distinctive constituents. The dictionary column is known as an atom that is rational to the signal if the absolute value of the inner product of the atom and the signal is large. Subsequently, using a discriminative dictionary to code the several organized speech signals sparsely, the coding coefficients of the different essential sources are disseminated separately over all dictionary elements. Sequential discriminative dictionary learning (SDDL) is presented in [48], where both distinctive and similar parts of varying speaker signals are considered. The authors [25] present a sparsity model comprising a couple of joint sparse representations (JSR). The mapping relationship between mixture and speech is used in one JSR, while the mapping relation between speech and noise is used in another. The authors of [35] construct a joint dictionary with a common sub-dictionary method (CJD) where a CJD is built using similar atoms between identity sub-dictionaries. The identity sub-dictionaries are trained using source speech signals corresponding to each speaker. In [36], the authors offered a new optimization function to formulate a joint dictionary (OJDL) with multiple identity sub-dictionaries and a common sub-dictionary. The authors of [5] proposed a two-level correlative joint sparse portrayal technique to improve the performance of single-channel speech separation. To suppress noise source confusion, a two level joint inadequate portrayal was built utilizing the relationship between speech, mixture signals, and the discriminative property of joint word dictionary. The authors of [16, 39], proposed a speech improvement technique with substitute advancement of sparse coefficient and dictionary. The Fisher criterion compelled the target capacity of dictionary learning, and afterward, the discriminative dictionary and the sparse comparing coefficient are acquired. Thusly, the irritated obstruction among joint dictionaries can be diminished.

Deep learning has got particular consideration in the SS community in which non-linear mapping among the mixture and speech is considered. Deep learning-based techniques can be divided into two divisions based on the association between the noisy input and desired outputs. They are the deep neural networks (DNN) based mask algorithms [20, 50] and DNN based on regression algorithms [38]. These techniques have been successfully implemented and have shown outstanding performance in improving the desired signal from the mixture signal. Besides, it is not suitable for dealing with limited features, constrained inside the thought of sources, and somewhat greater computational complexity. Therefore, we combined the DTCWT and STFT to take advantage of both transforms and better resolve to process the noisy mixture [8, 13]. Finally, we’re applying SNMF after the DTCWT and STFT to get the estimated clean speech. These algorithms work very well; however, only the magnitude spectrum is enhanced and overlooks the phase’s enhancement.

Most of the techniques deliberated above consider only the magnitude part, while the phase portion is not enriched. Though the consideration of the magnitude part does have a significant contribution to the estimated speeches, but the improvement of the phase portion cannot be overlooked. The effects of complex estimation have been used in [37, 46], and they found a potential improvement of speech separation.

The contributions of this paper are briefly listed below:

  1. 1.

    For an accurate and in-depth exploration of the signal, we apply dual transforms DTCWT and STFT. The DTCWT decomposes the signal into a set of low and high-frequency component subband signals that makes the signal more stationary. After applying DTCWT, then connected STFT to each subband signal that builds a complex spectrogram for each subband signal. Thus, we better understand the signal for further analyzing and processing.

  2. 2.

    Unlike many other algorithms that investigate either magnitude or phase component, we are handling the magnitude of the signal and the real and imaginary parts. So, these techniques take complete advantage of all the information available in the waveforms of the signals. To accumulate the best version of speech separation, we process the magnitude part, the real part, as well as the imaginary part. As concerns we know, we are the first who jointly investigate the magnitude, the real and imaginary parts of the signal.

  3. 3.

    In this method, we use the DTCWT and STFT consecutively, then apply the GJDL algorithm to both the magnitude part, the absolute value of the real and imaginary part of the signal, and preserve the sign. We apply the LARC algorithm for sparse coding that finds the necessary coefficients. Our suggested approach assesses the initial signals in two separate ways. One is an estimated signal that considers only the magnitude part, and the other is an approximate signal that includes both real and imaginary parts. Finally, the Gini index is used to calculate the complementary impacts of both initially estimated signals. Thus, we get advantages from using the GI with the initially estimated signals.

The rest of the paper is section-wise distributed as follows: A mathematical description of the issue is given in Section 2. Section 3 briefly describes DTCWT, STFT, GDL, and GI. Section 4 is this paper’s method, which entireties up the functions of each module. Section 5 presents the experimental setting and results using the GRID [30], and TIMIT [53] datasets for speech separation. Finally, section 6 finalizes the paper supplemented with references. The nomenclature is provided in Table 1.

Table 1 Nomenclature

2 Problem formulation

Single-channel speech separation problem can be defined as follows: Assume z(t) be the mixed signal which includes of two speakers speech signal x(t) and y(t). The goal of single channel speech separation is to obtain the estimated signals x(t) and y(t) from the mixed signal z(t). The expression for the mixed signal defines in Eq. (1).

$$ \mathbf{z}\left(\mathrm{t}\right)=\mathbf{x}\left(\mathrm{t}\right)+\mathbf{y}\left(\mathrm{t}\right) $$
(1)

Now, the DTCWT is applied to Eq. (1) and gets the subbands are presented in Eq. (2) as follows

$$ {\mathbf{z}}_{\mathrm{b},\mathrm{tl}}={\mathbf{x}}_{\mathrm{b},\mathrm{tl}}+{\mathbf{y}}_{\mathrm{b},\mathrm{tl}} $$
(2)

where zb, tl, xb, tl and yb, tl denote the mixed, first source, and second source subband signals, respectively, and b is one more than the level of DTCWT decomposition, and tl indicates the tree level (describe in subsection 3.1). Now, we apply STFT to each subband signal and time-frequency representation of Eq. (3) can be expressed as

$$ {\mathbf{Z}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right)={\mathbf{X}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right)+{\mathbf{Y}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right) $$
(3)

where Zb, tl(τ, f), Xb, tl(τ, f) and Yb, tl(τ, f) are the STFT coefficients of zb, tl, xb, tl and yb, tl, respectively. f and τ represent the frequency bin index and time frame index, respectively.

Eq. (3) can be decomposed the real and imaginary parts as follows, where ZRb, tl(τ, f) and ZIb, tl(τ, f) represent the real and imaginary parts of Zb, tl(τ, f) and the same for others.

$$ {\mathbf{ZR}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right)+\mathrm{i}{\mathbf{ZI}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right)={\mathbf{XR}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right)+{\mathrm{i}\mathbf{XI}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right)+{\mathbf{YR}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right)+\mathrm{i}{\mathbf{YI}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right) $$
(4)

The magnitude ∣Zb, tl(τ, f)∣, real |ZRb, tl(τ, f)| and imaginary |ZIb, tl(τ, f)| parts are learned jointly using the GJDL algorithm and get the estimated signals \( {\overset{\sim }{\mathbf{X1}}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right) \), \( {\overset{\sim }{\mathbf{Y1}}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right) \) from the magnitude part and \( {\overset{\sim }{\mathbf{X2}}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right) \), \( {\overset{\sim }{\mathbf{Y2}}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right) \) from the real and imaginary parts by using LARC.

Let we get \( {\overset{\sim }{\mathbf{X}}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right) \) and \( {\overset{\sim }{\mathbf{Y}}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right) \) that are the estimated complex speech signals from \( {\overset{\sim }{\mathbf{X1}}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right) \), \( {\overset{\sim }{\mathbf{Y1}}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right) \)and \( {\overset{\sim }{\mathbf{X2}}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right) \), \( {\overset{\sim }{\mathbf{Y2}}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right) \) respectively by applying the Gini index. At last, the expected first and second source speech signals are calculated via the following equations.

$$ \overset{\sim }{\mathbf{x}}\left(\mathrm{t}\right)=\boldsymbol{IDTCWT}\left(\boldsymbol{ISTFT}\ \left({\overset{\sim }{\mathbf{X}}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right)\right)\right) $$
(5)
$$ \overset{\sim }{\mathbf{y}}\left(\mathrm{t}\right)=\boldsymbol{IDTCWT}\left(\boldsymbol{ISTFT}\ \left({\overset{\sim }{\mathbf{Y}}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right)\right)\right) $$
(6)

where \( \overset{\sim }{\mathbf{x}}\left(\mathrm{t}\right) \) and \( \overset{\sim }{\mathbf{y}}\left(\mathrm{t}\right) \)are the estimated first and second source signals, ISTFT and IDTCWT specify the inverse short term Fourier transform and inverse dual-tree complex wavelet transform, respectively.

3 Preliminaries

This section presents all relevant terms such as DTCWT, STFT, GDL, and GI that are linked to our proposed technique.

3.1 DTCWT

Kingsbury presents in [21] that the more computationally productive strategy refers to the DTCWT having different valuable properties, such as approximate shift-invariance, perfect reconstruction, and limited redundancy. The DTCWT [14] splits the signal into two trees; the first tree delivers the real part of the transform, while the second tree offers the imaginary part. Both trees have a low pass filter that provides the approximate coefficient and a high pass filter that delivers a detail coefficient. The complex-valued scaling functions and wavelets calculated from the two trees are roughly analytic.

In the first level DTCWT decomposition, both trees have one approximation coefficient and one detail coefficient. For the upper tree, the approximation coefficient is (\( {\mathbf{x}}_{1,1}^1 \)) and the detail coefficient is (\( {\mathbf{x}}_{2,1}^1 \)), here, the DTCWT decomposition level (dl) is presented by superscript, and the first and second subscripts indicate the subband index and tree-level (tl), individually. At that point, all subband signals are downsampled. For the second level decomposition, the filters are used to pass through the approximation coefficients only, and the subband signals are produced and so on. Two levels of the DTCWT decomposition are shown in Fig. 1a and b represent IDTCWT.

Fig. 1
figure 1

The two-level filter bank implementation of a the DTCWT and b the IDTCWT

3.2 STFT

STFT is a dominant time-frequency analysis tool for audio signal processing [1]. It illustrates an especially useful class of time-frequency distributions which indicate complex amplitude versus time and frequency for any signal. In practice, the data to be transformed is divided into shorter segments of equal length. Each section is Fourier transformed separately, and the complex outcome is complementary to a matrix. It can be expressed as follows.

$$ \boldsymbol{STFT}\ \left\{\mathbf{x}\left(\mathrm{t}\right)\right\}=\mathbf{X}\left(\uptau, \mathrm{f}\right)={\int}_{-\infty}^{\infty}\mathbf{x}\left(\mathrm{t}\right)\boldsymbol{w}\left(\mathrm{t}-\uptau \right){\mathrm{e}}^{-\mathrm{ift}}\mathrm{dt} $$
(7)

Here, w(τ) denotes window function, restrained around zero, and x(t) is the signal to be transformed. X(τ, f) is basically the Fourier Transform of x(t)w(t − τ) that signifies the phase and magnitude of the signal over time and frequency. The time index τ is shifted windows, and f identifies the frequency.

3.3 GDL

The GDL approach is addressed in [33]. During the training process, the mix speech matrix Z ∈ F × T is distributed approximately into two matrices; one is dictionary matrix D ∈ F × R and another is coefficient matrix C ∈ R × T, and the number of dictionary atoms is symbolized by R. The sparse representation error of the speech signals x and y over the speech signals dictionary Dx, and Dy respectively can be reduced using Eq. (8) and Eq. (9), as follows:

$$ \underset{{\mathbf{D}}_{\mathrm{x}},{\mathbf{C}}_{\mathrm{x}}}{\min }{\left\Vert \mathbf{X}-{\mathbf{D}}_{\mathrm{x}}{\mathbf{C}}_{\mathrm{x}}\right\Vert}_{\mathrm{F}}^2\kern2em \mathrm{s}.\mathrm{t}.{\left\Vert {\mathbf{c}}_{\mathrm{x},\mathrm{k}}\right\Vert}_1\le {\mathrm{q}}_{\mathrm{x}},\forall \mathrm{k}, $$
(8)
$$ \underset{{\mathbf{D}}_{\mathrm{y}},{\mathbf{C}}_{\mathrm{y}}}{\min }{\left\Vert \mathbf{Y}-{\mathbf{D}}_{\mathrm{y}}{\mathbf{C}}_{\mathrm{y}}\right\Vert}_{\mathrm{F}}^2\kern2em \mathrm{s}.\mathrm{t}.{\left\Vert {\mathbf{c}}_{\mathrm{y},\mathrm{k}}\right\Vert}_1\le {\mathrm{q}}_{\mathrm{y}},\forall \mathrm{k}, $$
(9)

where ‖∙‖F indicates the Frobenius norm, and ‖.‖1 indicates the l1 norm. The kth column of the sparse coding matrix Cx and Cy, are presented by cx, k and cy, k respectively. qx and qy are the sparsity constraint for speech signals x and y, respectively. In [33], the LARC scheme is developed for sparse coding to solve the cost function presented in Eq. (8) and Eq. (9). K-SVD scheme is estimated for dictionary update, and Dx and Dy are obtained. The mixed signal is sparsely exemplified over the composite dictionary as follows

$$ {\mathbf{Z}}^{\mathrm{test}}=\mathbf{D}\times \mathbf{E}=\left[{\mathbf{D}}_{\mathrm{x}},{\mathbf{D}}_{\mathrm{y}}\right]\times \left[\begin{array}{c}{\mathbf{E}}_{\mathrm{x}}\\ {}{\mathbf{E}}_{\mathrm{y}}\end{array}\right] $$
(10)

where Ex and Ey indicate the sparse coding matrix during the testing stage equivalent to Dx and Dy. Finally, the estimated speeches can be obtained in the following way.

$$ \hat{\mathbf{X}}={\mathbf{D}}_{\mathrm{x}}\times {\mathbf{E}}_{\mathrm{x}} $$
(11)
$$ \hat{\mathbf{Y}}={\mathbf{D}}_{\mathrm{y}}\times {\mathbf{E}}_{\mathrm{y}} $$
(12)

3.4 GI

The GI, introduced in 1921, is utilized to process the imbalance or sparseness of wealth or speech distribution. It is the principal measure that fulfills all the attractive standards for sparsity [10]. The GI is twice the area between the Lorenz curve and the 45-degree line.

Given data z(t) = [z(1), z(2), …, z(T)], the Lorenz curve initially defined in [24], is the function with support (0, 1), which is piecewise linear with T + 1 points defined,

$$ \mathbf{L}\left(\mathrm{g}/\mathrm{T}\right)=\sum \limits_{\mathrm{j}=1}^{\mathrm{g}}\left[{\mathbf{z}}_{\left(\mathrm{j}\right)}/{\sum}_{\mathrm{k}=1}^{\mathrm{T}}{\mathbf{z}}_{\mathrm{k}}\right],\kern1em \mathrm{g}=0,\dots, \mathrm{T} $$
(13)

where z(1) ≤ z(2) ≤ … ≤ z(T) denotes T values orderly from lowest to biggest. α indicates GI as the following equation.

$$ \alpha =1-\frac{1}{\mathrm{T}}\sum \limits_{\mathrm{m}=1}^{\mathrm{T}}\left(\mathbf{L}\left(\frac{\mathrm{m}-1}{\mathrm{T}}\right)+\mathbf{L}\left(\frac{\mathrm{m}}{\mathrm{T}}\right)\right). $$
(14)

4 Proposed SS algorithm

In this section, we describe the newly proposed SS algorithm and subtleties connected to the substances of this algorithm. Most speech separation systems work on the STFT of the speech signal considering only the magnitude spectrum. Generally, the STFT transforms the time domain input signal by taking the small segments or frames of that signal and deliberates each subdivision to be stationary. But the subdivision may not be more stationary because we can’t surely know what frequency exists at what time occurrence. In our proposed algorithm, firstly, we use DTCWT that divides the input signal into subbands where high and low-frequency components are separated. For a particular DTCWT level decomposition, we have represented all subbands by xb, tl, where b = dl + 1, and tl = 2 because it has 2-trees. For one level DTCWT decomposition, the total number of subbands is 4 (2 × 2), and two levels DTCWT decomposition the total number of subbands is 6 ( 3 × 2) and so on. In our proposed technique, we use the first-level decomposition, in which the time-domain signal is decomposed into four subband signals. For illustration, the DTCWT breaks down the source signal x(t) into subband signals, denoted by xb, tl. For the first level decomposition, the subbands are x1, 1x1, 2x2, 1, and x2, 2, as clarified in the DTCWT part of Section 3. Then STFT is applied to each more stationary subband signal that comes from DTCWT decomposition. STFT gives superior transforms for more stationary signals. After applying DTCWT and STFT successively, the generative joint dictionary learning (GJDL) algorithm is used to jointly learn the magnitude, the absolute value of the real and imaginary parts of the signal. The LARC algorithm catches the required coefficients using such dictionaries. The initial signals are evaluated in two ways in our proposed method. The first is an estimated signal that only considers the magnitude component, whereas the second is an estimated signal that considers both the real and imaginary components. Figure 2 illustrates the comprehensive framework of the SS algorithm mentioned in this paper. We use these dual transformations in both the testing and training stages, and the training and testing phases are detailed separately in the following subsections.

Fig. 2
figure 2

Shows a block diagram of the proposed speech separation system including the training and testing stage

4.1 Training stage

In the training stage, we consider two individual speech source generating signals x(t) and y(t). DTCWT is utilized to get the subband source signals xb, tl and yb, tl from the signals x(t) and y(t). The STFT is applied to each subband source signal and found the complex spectrums Xb, tl(τ, f) and Yb, tl(τ, f), where τ and f specify the time and frequency bin indexes, respectively. At this point, we obtain three parts the magnitude part XMb, tl(τ, f), the real part XRb, tl(τ, f) and the imaginary parts XIb, tl(τ, f) from Xb, tl(τ, f), and apply similar operation for Yb, tl(τ, f). We take the absolute of the real and imaginary parts and concatenate it with a magnitude part as follows.

$$ {\mathbf{XM}\mathbf{RI}}_{\mathrm{b},\mathrm{tl}}^{\mathrm{Train}}=\left[\begin{array}{c}\left|{\mathbf{XM}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right)\right|\\ {}\mid {\mathbf{XR}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right)\mid \\ {}\mid {\mathbf{XI}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right)\mid \end{array}\right] $$
(15)
$$ \kern0.5em {\mathbf{YM}\mathbf{RI}}_{\mathrm{b},\mathrm{tl}}^{\mathrm{Train}}=\left[\begin{array}{c}\left|{\mathbf{YM}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right)\right|\\ {}\mid {\mathbf{YR}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right)\mid \\ {}\mid {\mathbf{YI}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right)\mid \end{array}\right] $$
(16)

The GJDL is applied to train the dictionaries DXMRIb, tl and DYMRIb, tl for all of the three components by using Eqs. (8) and (9), based on Eqs. (15) and (16). Using the LARC algorithm for sparse coding and the approximate K-SVD algorithm for dictionary update [33], the cost function Eqs. (8) and (9) have been solved and DXMRIb, tl and DYMRIb, tl are acquired as follows.

$$ {\mathbf{DXMRI}}_{\mathrm{b},\mathrm{tl}}=\boldsymbol{GJDL}\left({\mathbf{XMRI}}_{\mathrm{b},\mathrm{tl}}^{\mathrm{Train}}\right) $$
(17)
$$ {\mathbf{DYMRI}}_{\mathrm{b},\mathrm{tl}}=\boldsymbol{GJDL}\left({\mathbf{YMRI}}_{\mathrm{b},\mathrm{tl}}^{\mathrm{Train}}\right) $$
(18)

At this point, we concatenate all the dictionaries and obtain the concatenated dictionaries like DXYMRIb, tl, in eq. (19) and it is forwarded to the testing phase.

$$ {\mathbf{DXYMRI}}_{\mathrm{b},\mathrm{tl}}=\left[\ {\mathbf{DXMRI}}_{\mathrm{b},\mathrm{tl}}\kern0.50em {\mathbf{DYMRI}}_{\mathrm{b},\mathrm{tl}}\right] $$
(19)

4.2 Testing stage

The mixed speech signal z(t) is decomposed by applying DTCWT and produced a set of subband signals zb, tl. The STFT is applied to every subband of mixed signal and obtained complex spectrum Zb, tl(τ, f). At this point, we take the magnitude part |ZMb, tl(τ, f)|, phase ZPb, tl, the real ZRb, tl(τ, f) and the imaginary ZIb, tl(τ, f) parts from complex spectrums Zb, tl(τ, f) and preserve sign values of real and imaginary parts. We take the absolute of the real and imaginary parts and concatenate it with a magnitude part as follows.

$$ {\mathbf{ZM}\mathbf{RI}}_{\mathrm{b},\mathrm{tl}}^{\mathrm{Test}}=\left[\begin{array}{c}\left|{\mathbf{ZM}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right)\right|\\ {}\mid {\mathbf{ZR}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right)\mid \\ {}\mid {\mathbf{ZI}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right)\mid \end{array}\right] $$
(20)

We have the testing mixture signal with three parts \( {\mathbf{ZMRI}}_{\mathrm{b},\mathrm{tl}}^{\mathrm{Test}} \) and the concatenated dictionaries DXYMRIb, tl, we can obtain the sparse coding CXYMRIb, tl by using eq. (10) as follows.

$$ {\mathbf{CXYMRI}}_{\mathrm{b},\mathrm{tl}}=\boldsymbol{LARC}\left({\mathbf{ZMRI}}_{\mathrm{b},\mathrm{tl}}^{\mathrm{Test}},{\mathbf{DXYMRI}}_{\mathrm{b},\mathrm{tl}}\right) $$
(21)

The initially estimated magnitude, real and imaginary components \( {\overline{\mathbf{XM}}}_{\mathrm{b},\mathrm{tl}} \),\( {\overline{\mathbf{XR}}}_{\mathrm{b},\mathrm{tl}} \), and \( {\overline{\mathbf{XI}}}_{\mathrm{b},\mathrm{tl}} \), respectively for one source signal and \( {\overline{\mathbf{YM}}}_{\mathrm{b},\mathrm{tl}} \), \( {\overline{\mathbf{YR}}}_{\mathrm{b},\mathrm{tl}} \), and \( {\overline{\mathbf{YI}}}_{\mathrm{b},\mathrm{tl}} \) for another source signal are acquired using the corresponding speech signals dictionaries and sparse coding, which are getting from DXYMRIb, tl and CXYMRIb, tl as follows.

$$ {\overline{\mathbf{XM}}}_{\mathrm{b},\mathrm{tl}}={\mathbf{DXM}}_{\mathrm{b},\mathrm{tl}}{\mathbf{CX}}_{\mathrm{b},\mathrm{tl}}, $$
(22)
$$ {\overline{\mathbf{XR}}}_{\mathrm{b},\mathrm{tl}}={\mathbf{DXR}}_{\mathrm{b},\mathrm{tl}}{\mathbf{CX}}_{\mathrm{b},\mathrm{tl}}, $$
(23)
$$ {\overline{\mathbf{XI}}}_{\mathrm{b},\mathrm{tl}}={\mathbf{DXI}}_{\mathrm{b},\mathrm{tl}}{\mathbf{CX}}_{\mathrm{b},\mathrm{tl}}, $$
(24)
$$ {\overline{\mathbf{YM}}}_{\mathrm{b},\mathrm{tl}}={\mathbf{DYM}}_{\mathrm{b},\mathrm{tl}}{\mathbf{CY}}_{\mathrm{b},\mathrm{tl}}, $$
(25)
$$ {\overline{\mathbf{YR}}}_{\mathrm{b},\mathrm{tl}}={\mathbf{DYR}}_{\mathrm{b},\mathrm{tl}}{\mathbf{CY}}_{\mathrm{b},\mathrm{tl}}, $$
(26)
$$ {\overline{\mathbf{YI}}}_{\mathrm{b},\mathrm{tl}}={\mathbf{DYI}}_{\mathrm{b},\mathrm{tl}}{\mathbf{CY}}_{\mathrm{b},\mathrm{tl}}, $$
(27)

The addition of the initial estimate \( {\overline{\mathbf{XM}}}_{\mathrm{b},\mathrm{tl}} \) and \( {\overline{\mathbf{YM}}}_{\mathrm{b},\mathrm{tl}} \) may not be equivalent to the mixed signal magnitude spectrum ZMb, tl. To make it error-free, we compute the subband ratio mask (SBRM) using the following Eq. (28) and Eq. (29) as follows:

$$ {\overline{\mathbf{X1}}}_{\mathrm{b},\mathrm{tl}}=\frac{{\left({\overline{\mathbf{XM}}}_{\mathrm{b},\mathrm{tl}}\right)}^2}{{\left({\overline{\mathbf{XM}}}_{\mathrm{b},\mathrm{tl}}\right)}^2+{\left({\overline{\mathbf{YM}}}_{\mathrm{b},\mathrm{tl}}\right)}^2}\times {\mathbf{ZM}}_{\mathrm{b},\mathrm{tl}}, $$
(28)
$$ {\overline{\mathbf{Y1}}}_{\mathrm{b},\mathrm{tl}}=\frac{{\left({\overline{\mathbf{YM}}}_{\mathrm{b},\mathrm{tl}}\right)}^2}{{\left({\overline{\mathbf{XM}}}_{\mathrm{b},\mathrm{tl}}\right)}^2+{\left({\overline{\mathbf{YM}}}_{\mathrm{b},\mathrm{tl}}\right)}^2}\times {\mathbf{ZM}}_{\mathrm{b},\mathrm{tl}}, $$
(29)

Now, we apply the phase spectrum ZPb, tl with the estimated source signals magnitude spectrum \( {\overline{\mathbf{X1}}}_{\mathrm{b},\mathrm{tl}} \) and \( {\overline{\mathbf{Y1}}}_{\mathrm{b},\mathrm{tl}} \) to acquire the reformed complex spectrum \( {\overset{\sim }{\mathbf{X1}}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right) \) and \( {\overset{\sim }{\mathbf{Y1}}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right) \) using the Eq. (30) and Eq. (31) as follows:

$$ {\overset{\sim }{\mathbf{X1}}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right)={\overline{\mathbf{X1}}}_{\mathrm{b},\mathrm{tl}}\ {\mathrm{e}}^{\mathrm{i}{\mathbf{ZP}}_{\mathrm{b},\mathrm{tl}}}, $$
(30)
$$ {\overset{\sim }{\mathbf{Y1}}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right)={\overline{\mathbf{Y1}}}_{\mathrm{b},\mathrm{tl}}\ {\mathrm{e}}^{\mathrm{i}{\mathbf{ZP}}_{\mathrm{b},\mathrm{tl}}}, $$
(31)

Now, we use the sign preserved previously and multiply the sign with real and imaginary estimates of the signal. Then, the real and imaginary parts are joined to form the complex spectrum of the speech signals as follows.

$$ {\overline{\mathbf{X2}}}_{\mathrm{b},\mathrm{tl}}={\overline{\mathbf{XR}}}_{\mathrm{b},\mathrm{tl}}+\mathrm{i}{\overline{\mathbf{XI}}}_{\mathrm{b},\mathrm{tl}}, $$
(32)
$$ {\overline{\mathbf{Y2}}}_{\mathrm{b},\mathrm{tl}}={\overline{\mathbf{YR}}}_{\mathrm{b},\mathrm{tl}}+\mathrm{i}{\overline{\mathbf{YI}}}_{\mathrm{b},\mathrm{tl}}, $$
(33)

To make the estimated signal \( {\overline{\mathbf{X2}}}_{\mathrm{b},\mathrm{tl}} \) and \( {\overline{\mathbf{Y2}}}_{\mathrm{b},\mathrm{tl}} \) are more accurate, we calculate the complex subband ratio mask (CSBRM) using the following Eq. (34) and Eq. (35).

$$ {\overset{\sim }{\mathbf{X2}}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right)=\frac{{\left({\overline{\mathbf{X2}}}_{\mathrm{b},\mathrm{tl}}\right)}^2}{{\left({\overline{\mathbf{X2}}}_{\mathrm{b},\mathrm{tl}}\right)}^2+{\left({\overline{\mathbf{Y2}}}_{\mathrm{b},\mathrm{tl}}\right)}^2}\times {\mathbf{Z}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right), $$
(34)
$$ {\overset{\sim }{\mathbf{Y2}}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right)=\frac{{\left({\overline{\mathbf{Y2}}}_{\mathrm{b},\mathrm{tl}}\right)}^2}{{\left({\overline{\mathbf{X2}}}_{\mathrm{b},\mathrm{tl}}\right)}^2+{\left({\overline{\mathbf{Y2}}}_{\mathrm{b},\mathrm{tl}}\right)}^2}\times {\mathbf{Z}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right), $$
(35)

The accuracy of \( {\overset{\sim }{\mathbf{X1}}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right) \) and \( {\overset{\sim }{\mathbf{X2}}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right) \) are not similar due to the different estimation processes. The first is based on the signal’s magnitude, while the second is based on the signal’s real and imaginary components. \( {\overset{\sim }{\mathbf{Y1}}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right) \) and \( {\overset{\sim }{\mathbf{Y2}}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right) \) are also estimated in the same way. As these estimated signals have complementary effectiveness we use a weighting parameter αb, tl which is found by using Eq. (14), and estimated signals can be calculated as follows:

$$ {\overset{\sim }{\mathbf{X}}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right)=\left(1-{\upalpha}_{\mathrm{b},\mathrm{tl}}\right){\overset{\sim }{\mathbf{X}\mathbf{1}}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right)+{\upalpha}_{\mathrm{b},\mathrm{tl}}{\overset{\sim }{\mathbf{X}\mathbf{2}}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right), $$
(36)
$$ {\overset{\sim }{\mathbf{Y}}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right)=\left(1-{\upalpha}_{\mathrm{b},\mathrm{tl}}\right){\overset{\sim }{\mathbf{Y}\mathbf{1}}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right)+{\upalpha}_{\mathrm{b},\mathrm{tl}}{\overset{\sim }{\mathbf{Y}\mathbf{2}}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right), $$
(37)

The ISTFT is used to transform the complex speech signals spectrum \( {\overset{\sim }{\mathbf{X}}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right) \) and \( {\overset{\sim }{\mathbf{Y}}}_{\mathrm{b},\mathrm{tl}}\left(\uptau, \mathrm{f}\right) \)to the subband signals \( {\overset{\sim }{\mathbf{x}}}_{\mathrm{b},\mathrm{tl}} \) and \( {\overset{\sim }{\mathbf{y}}}_{\mathrm{b},\mathrm{tl}} \). Finally, the estimated source speech signals \( \overset{\sim }{\mathbf{x}}\left(\mathrm{t}\right) \) and \( \overset{\sim }{\mathbf{y}}\left(\mathrm{t}\right) \) is achieved by transforming the IDTCWT to the subband signals \( {\overset{\sim }{\mathbf{x}}}_{\mathrm{b},\mathrm{tl}} \) and \( {\overset{\sim }{\mathbf{y}}}_{\mathrm{b},\mathrm{tl}} \). The suggested algorithm for the training and testing stages are presented below in Table 2.

Table 2 Algorithm for the training and testing stages of the proposed technique

5 Evaluation and results

In this section, the proposed algorithm is analyzed through simulation experiments. First, provide an overview of the data and performance evaluation indicators that will be used to measure the efficiency of separated speech. We show the impact of joint learning regarding SDR, SIR, STOI, PESQ, HASPI, and HASQI scores at male-female separation. Then we explore the impact of GJDL over SNMF concerning STOI and PESQ scores at the same and opposite gender cases. Finally, we compare our algorithm with the current mainstream single-channel SS algorithm and use the experimental results to confirm the lead of the proposed strategy. The comparison algorithms are STFT-SNMF [45], DWT-STFT-SNMF [42], ADDL [2], SWT-SNMF [11], DTCWT-SNMF [12], CJD [35], OJDL [36], and DTCWT-STFT-SNMF [8].

5.1 Data sets and performance evaluation indicators

In this simulation, we collect the speech signals (including different male speech and female speech) from the GRID audio-visual corpuses [3], which are used as the training and testing data. There are 34 speakers (18 male, 16 female), and each speaker speaks 1000 utterances. In case of selecting each speakers’ utterances, we randomly take 500 utterances for training purposes and 200 utterances for testing. In this simulation, we use two types of speech signal data grouping; one is used for same-gender (male-male or female-female) speech separation and another for opposite gender (male-female) speech separation. For same-gender speech separation, eight same-gender speakers’ utterances are exploited to form one experimental group, and different eight same-gender speakers’ utterances are used to build another experimental group. For opposite-gender speech separation, we choose sixteen male speakers for one experimental group and sixteen females for another experimental group. The length of the training signal is about 60 s, and that for the test is about 10 s. The sampling rate of a speech signal is 8000 Hz, and the signal is transformed into the time-frequency domain by using 512-point STFT.

In this paper, the following six indicators are used to measure the performance of SS: HASQI [18], HASPI [19], PESQ [19], STOI [40], SDR [43], and SIR [43] metrics.

The SDR [43] value approximates the overall speech quality, and it is the proportion of the intensity of the input signal to the intensity of the difference among input and reformed signals. The higher SDR scores regulate the recovered performance.

$$ \mathrm{SDR}=10{\log}_{10}\frac{{\left\Vert {\mathbf{x}}_{\mathrm{target}}\right\Vert}_{\mathrm{l}}^2}{{\left\Vert {\mathbf{e}}_{\mathrm{interf}}+{\mathbf{e}}_{\mathrm{noise}}+{\mathbf{e}}_{\mathrm{artif}}\right\Vert}_{\mathrm{l}}^2} $$
(38)

where xtarget, einterf, enoise, and, eartif are the targeted source, the interference error, the perturbation noise and the artifacts error, respectively.

In addition to SDR, SIR [43] reports errors produced by failures to remove the interfering signal during the source separation technique. A higher value of SIR relates to higher separation quality.

$$ \mathrm{SIR}=10{\log}_{10}\frac{{\left\Vert {\mathbf{x}}_{\mathrm{target}}\right\Vert}_{\mathrm{l}}^2}{{\left\Vert {\mathbf{e}}_{\mathrm{interf}}\right\Vert}_{\mathrm{l}}^2} $$
(39)

The PESQ is nominated for the objective quality assessment and is commonly used to measure speech signals’ quality superiority. It deals with scores ranging from −0.50 to 4.50, where the higher scores lead to more outstanding voice quality. PESQ measures the combination of only two parameters – one symmetric disturbance (dSYM) and one asymmetric disturbance (dASYM), provides a good balance between prediction accuracy and the ability to simplify, described in [19].

$$ \mathrm{PESQ}=4.5-0.1\ {\mathrm{d}}_{\mathrm{SYM}}-0.0309{\mathrm{d}}_{\mathrm{ASYM}} $$
(40)

STOI [40] is a state-of-the-art speech intelligibility indicator and deals with the correlation coefficient among the clean speech temporal envelopes and the separated speech in short-time regions. It conveys the scores somewhere in the range of 0 to 1, where the higher STOI value signifies better intelligibility. We measure STOI, in light of a correlation coefficient between the transient envelopes of the perfect and estimated speech, in a short time frame overlapping fragments. It is a function of the clean and corrupted speech, represented by x and \( \overset{\sim }{\mathrm{x}} \), respectively.

$$ \mathrm{STOI}=\mathrm{Avg}\left(\frac{{\left(\mathbf{x}-{\boldsymbol{\upmu}}_{\mathrm{x}}\right)}^{\mathrm{T}}\ \left(\overset{\sim }{\mathbf{x}}-{\boldsymbol{\upmu}}_{\overset{\sim }{\mathrm{x}}}\right)}{{\left\Vert \mathbf{x}-{\boldsymbol{\upmu}}_{\mathrm{x}}\right\Vert}_{\mathrm{l}}{\left\Vert \overset{\sim }{\mathbf{x}}-{\boldsymbol{\upmu}}_{\overset{\sim }{\mathrm{x}}}\right\Vert}_{\mathrm{l}}}\right) $$
(41)

The HASPI [19] depends on a model of the hear-able fringe that incorporates changes because of hearing loss. The file looks at the envelope and fleeting fine construction yields of the hear-able model for a reference signal to the yields of the model for the signal below test. It ranges from 0 to 1, and advanced scores relay to better sound intelligibility.

The HASPI intelligibility index is specified by:

$$ {\displaystyle \begin{array}{c}\mathrm{p}=-9.047+14.817\mathrm{c}+0.0{\mathrm{a}}_{\mathrm{Low}}+0.0{\mathrm{a}}_{\mathrm{Mid}}+4.616{\mathrm{a}}_{\mathrm{High}}\\ {}\mathrm{HASPI}=\frac{1}{1+{\mathrm{e}}^{-\mathrm{p}}}\end{array}} $$
(42)

where c is cepstral correlation and aLow, aMid, and aHigh are the low-level auditory coherence value, mid-level value, and high-level value, respectively.

The HASQI [18] is a model-based objective measure of quality created regarding portable hearing assistants for ordinary hearing and hearing-impaired listeners. HASQI is the product of two autonomous indices. The first QNonlin, detentions the properties of noise and nonlinear distortion, and the second, QLinear,detentions the properties of linear filtering and spectral changes by focusing on contrasts in the long-term average spectra. It ranges from 0 to 1, and advanced scores relay to better sound quality.

$$ {\displaystyle \begin{array}{c}{\mathrm{Q}}_{\mathrm{Linear}}=1-0.579{\upsigma}_1-0.421{\upsigma}_2\\ {}{\mathrm{Q}}_{\mathrm{Nonlin}}={\mathrm{c}}^3\\ {}\mathrm{HASQI}={\mathrm{Q}}_{\mathrm{Nonlin}}\times {\mathrm{Q}}_{\mathrm{Linear}}\end{array}} $$
(43)

where c, σ1, σ2 are cepstrum correlation, standard deviation of the spectral difference, and the standard deviation of the slope difference, respectively.

5.2 Impact of joint learning

The speech signal is short-term stationary and sparse in nature. Most of the speech separation methods used STFT to transform the signal into the time-frequency domain, which makes the complex spectrum of that signal. Some approaches have been proposed considering only the magnitude part while overlooking the real and imaginary part of a complex spectrum. Here we compare among the methods deliberate only the magnitude part, only the real and imaginary part and the magnitude, real, and imaginary part jointly. Figure 3 shows that if we use joint learning (magnitude, real and imaginary part of the complex matrix), it beats the SDR, SIR, HASQI, and STOI scores in male-female separation to the individual (magnitude or real and imaginary). That’s why in the proposed approach, we consider the magnitude, the real and imaginary parts together, which upgrades the separation performance.

Fig. 3
figure 3

Effect of joint learning at opposite gender cases (Ma: magnitude, RI: real and imaginary and MRI: magnitude, real and imaginary)

5.3 Effect of GJDL over SNMF

We are exploring the impact of GJDL over SNMF concerning STOI and PESQ scores at the same and opposite gender cases. Figure 4 reveals the GJDL’s impact on SNMF. For all considering cases, PESQ and STOI scores are produced and averaged. The DTCWT-STFTMRI-SNMF and DTCWT-STFTMRI-GJDL methods use the SNMF and GJDL, respectively. It appears to be shown that both the speech’s PESQ and STOI values are improved at the same and opposite gender cases, explaining that to some degree, the DTCWT-STFTMRI-GJDL takes care of the speech signals distortion issues subsequently the SS processing.

Fig. 4
figure 4

Effect of GJDL over SNMF method concerning PESQ and STOI at the same and opposite gender cases

5.4 Overall performance comparison of algorithms

In Fig. 5, we show that the SDR and SIR of the proposed model gain considerably better results than other existing models, namely STFT-SNMF, DWT-STFT-SNMF, ADDL, SWT-SNMF, DTCWT-SNMF, CJD, and OJDL. For all cases of separation, the SDR values of the proposed model are higher than the existing models. Our proposed method increases SDR scores by 37.47%, 39.17% for M1 and M2, 21.29% and 18.20% for F1 and F2, and 27.73% and 27.20% for M and F, respectively than the existing method OJDL. From the figure, we can also realize that the SIR values of estimated signals are better than existing models. DTCWT and STFT are used consecutively for the dual transformation of the signal that delivers a more flexible basic framework for the improvement of feature modules that’s why it gives better performance.

Fig. 5
figure 5

Comparison of the separation performance of the nine methods in terms of a SDR and b SIR for the same and opposite gender cases

Fig. 6 presents the comparative performance analysis in terms of STOI and PESQ using the proposed method and other existing methods. STOI is improved from 0.746 to 0.819 for M1, 0.768 to 0.825 for M2, 0.785 to 0.800 for F1, 0.787 to 0.799 for F2, 0.793 to 0.896 for M and 0.778 to 0.863 for F using the proposed models over OJDL method. From the figure, we can also realize that the PESQ score of expected signals is better than the existing models. The STOI and PESQ scores in three different cases are shown that the suggested technique beats the other eight methods, i.e., the suggested approach deals with the maximum quality of speech separation comparative to the different seven schemes. In the case of considering only the magnitude part, the phase information is not enriched. But, if the complex domain training targets were exploited, then the phase information can be considered. As a result, we have used both magnitude, real, and imaginary components in our technique, which increased separation performance.

Fig. 6
figure 6

Comparison of the separation performance of the nine methods in terms of a STOI and b PESQ for the same and opposite gender cases

Tables 3 and 4 delineate the HASQI and HASPI results of different techniques, including STFT-SNMF, DWT-STFT-SNMF, ADDL, SWT-SNMF, DTCWT-SNMF, CJD, OJDL, DTCWT-STFT-SNMF, and DTCWT-STFTMRI-GJDL for the same gender and opposite gender speech separation. From Table 3, we can perceive that DTCWT-STFTMRI-GJDL earnings advanced HASPI values for all separation cases. It can also be seen that the HASQI results of DTCWT-STFTMRI-GJDL achieve progressive value to the other nine methods for all separation cases.

Table 3 Performance comparison among different techniques concerning HASPI values for the same and opposite gender cases
Table 4 Performance comparison of various approaches in terms of HASQI values for the same and opposite gender cases

In order to have more performance evaluation about the proposed method, the spectrograms of the speech separation algorithms are examined. The separation results of the different approaches are displayed in Fig. 7, where the original female and male speech spectrograms are presented in Fig. 7a and b, respectively. The projected female and male speech spectrograms are presented in Fig. 7c, d, e, f, g and h for DWT-STFT-SNMF, SWT-SNMF, and proposed model, respectively. From the figure, we see that the excellence of separated speech is poor in the DWT-STFT-SNMF method due to the entire elimination of the high-frequency component and estimated male and female speech by SWT-SNMF method supplement more undesirable vocal. The proposed method improves male and female speech roughly similar to original female and male speech. Also, it is realized that other methods have extra vocal distortion than our mentioned algorithm.

Fig. 7
figure 7

Spectrogram of original male speech, original female speech, and recovered female speech, recovered male speech for DWT-STFT-SNMF, SWT-SNMF, and proposed model, where x-axis corresponds to the time in seconds, and the y-axis corresponds to the frequency in kHz

Finally, to further confirm the centralities of improvements, for a mixed speech separation investigation, we have used the TIMIT database [6]. We have investigated 24 speakers (12 male and 12 female speakers) were picked from the TIMIT database. Each speaker utters the ten sentences that outcomes total of 240 sentences. Out of 10 sentences of different speakers, the first eight sentences are selected for training, and the remaining are used for testing. To investigate the performance of our proposed scheme, we consider SDR, SIR, STOI, and PESQ scores. From Fig. 8 and Table 5, one can without much stretch be seen that the proposed approach is achieved better performance than the existing eight techniques (STFT-SNMF, DWT-STFT-SNMF, ADDL, SWT-SNMF, DTCWT-SNMF, CJD, OJDL, and DTCWT-STFT-SNMF) depending on the SDR, SIR, STOI and PESQ scores at opposite gender separation.

Fig. 8
figure 8

Comparative performance evaluation of the existing and proposed model of SDR and SIR for the opposite gender case considering the TIMIT database

Table 5 Performance assessment of PESQ and STOI values of nine methods for the opposite gender case considering the TIMIT database

6 Conclusion

We have developed a new framework of speech separation based on dual-domain transform in which GJDL is used for joint learning and GI for joint accuracy. The main emphasis is to learn the dictionary considering magnitude, real and imaginary parts jointly, in contrast to the traditional approach of learning considering only the magnitude part or only the complex domain. DTCWT and STFT are used serially for the dual transformation of the signal that offers a more flexible basic framework for upgrading feature segments. Then GJDL is used to jointly learn the magnitude, the absolute value of the real and imaginary parts of the signal. The LARC algorithm captures the required coefficients using such dictionaries. We initially estimate the signals in two ways, one by considering only the magnitude part and another in view of real and imaginary components. At last, the GI finds better accuracy analyzing the corresponding subband of the two different sets of estimated signals and achieves the final estimated speech signals.

The DTCWT separates the high and low-frequency components of the time domain signal, whereas the STFT accurately investigates the time-frequency components. We also deal with the signal’s magnitude as well as its real and imaginary portions. As a result, this algorithm entirely uses all of the information contained in the waveforms of the signals. The relevant experimental results reveal that our approach outperforms traditional methods when measured using several evaluation metrics such as SDR, SIR, HASQI, HASPI, PESQ, and STOI. We use limited features to train GJDL but to get better performance need to consider more features. If we consider more features, the time complexity will increase for both training and testing stages. We plan to investigate alternative training and testing algorithms using deep neural networks in the future and expand it on multisource/multichannel processing, which is a very relevant and interesting path.