1 Introduction

Human auditory system has the fascinating ability to segregate acoustic signals from a complex mixture of input speech signals amidst reverberant and noisy environment. It can also locate and eventually estimate the distance of acoustic signals, even in the absence of visual information [6, 7, 34, 40, 50]. Significant numbers of computational techniques referred as computational auditory scene analysis (CASA) have been devised to analyse the complex acoustic mixture mainly because of inspiration received from human auditory system [59]. The speech segregation process involves separation of an interested single speech source from multiple sound mixtures [59, 60]. The implementation of an automatic speaker recognition system in a real-world application is considered as a difficult task due to the presence of additive noises as well as room reverberations. Several spectral subtraction filtering methods have been investigated for improving the robustness in automatic speaker recognition [59, 60]. Currently, researchers have focused on designing reliable speech processing system by combining speech separation and robust speaker recognition module [60]. The combined architecture contains speech segregation module followed by robust speaker recognizer. Further, the joint architecture is found to be improving the speech intelligibility and efficiently degrading the combined effects of reverberation and noise [59, 60].

In the literature, most of the speech segregation studies have concentrated on monaural speech signals and related features [8, 51]. Various algorithms, such as spectral subtraction, inverse and Wiener filtering techniques are suggested for the problems in the segregation of monaural speech signals [24, 48]. Woodruff et al. [54] describe a robust joint localization as well as segregation of voiced speech sources by means of combined Pitch and azimuth cues in reverberant environments. Also, Woodruff et al. [53] propose an azimuth-dependent classifier-based localization method in which segregation process is carried out by using monaural feature to improve the estimation of azimuth cues from binaural input. The comparative analyses are largely carried out by using various classifiers, including support vector machine (SVM) and Gaussian mixture model (GMM) [12]. Recently, many researchers use neural network-based classifiers due to its improved robustness and performance. A multi-layer perceptron (MLP)-based monaural feature evaluation framework is demonstrated in a speech segregation application [20]. Several studies have considered pitch and azimuth cues as features for the segregation of speech sources. Wrigley et al. [55] propose recurrent timing neural network-based speech segregation of two acoustic sources in which pitch and location features are considered. Alinaghi et al. [3] suggest binaural speech segregation algorithm on the basis of weighted combination of binaural cues, such as IPD, ILD, IC and the mixing vector models. Weiss et al. [52] have proposed a binaural source separation technique that combines spatial models with a priori trained source models and derived the expectation maximization (EM) algorithm for the determination of maximum likelihood parameters. Abdipour et al. [1] have suggested a novel system for the segregation of multiple moving sources from stereo signals, and it is based on the statistical model where maximum likelihood estimation is realized by using expectation-maximization (EM) technique.

Many researchers have successfully used direct-to-reverberant ratio (DRR) for the estimation of distance [30, 31]. The direct-to-reverberant ratio depends upon various factors, including room volume, directivity, source to receiver distance and also reverberation time [13, 26, 30, 31, 56]. Lu et al. suggest a binaural equalization–cancellation technique that estimate direct energy ratio by locating acoustic source in a delay-line structure [30, 31]. A direct and reverberant sound spatial correlation matrix model is suggested for the estimation of absolute distance between acoustic source and microphone array [30, 31]. Hioka et al. [13] suggest a spatial correlation matrix model for the segregation of direct and reverberant components. The estimated DRR from the above method is restricted to smaller distances but the method has shown better improvement in the speech segregation process.

Automatic speaker recognition (ASR) is considered very important, especially in applications, such as speech and speaker indexing, document content structuring, call routing, data entry and dictation and speaker attributed speech to text transcription [5, 29]. The performance of an ASR is adversely affected by two forms of reverberation, namely self-masking and overlap-masking [42]. Self-masking occurs due to early reflections and diffractions, whereas high impact overlap-masking is due from late reverberation [42]. The binary time–frequency mask is considered as a core of computational auditory scene analysis which is used to segregate the desired target from multiple acoustic mixtures [29, 47].

ASR is well supported by various techniques, such as Gaussian mixture models, pattern matching, support vector machine (SVM), hidden Markov models (HMM) and neural networks [5, 29]. Sadjadi et al. [43] suggest the mean Hilbert cepstral coefficients (MHEC) method as a replacement to traditional Mel-frequency cepstral coefficients (MFCC) within I-vector-based speaker acoustic model under noisy reverberant environments. Recently, Gabor filter banks have been efficiently used to construct monaural features for various applications, such as facial emotion recognition, robust speaker recognition and automatic speech recognition [25, 27, 45, 46]. A joint optimization of spectro-temporal features (Gabor filters) along with neural net acoustic model is demonstrated and proposed as an improved ASR [25]. Kanagasundaram et al. [22] have shown an improvement in the I-vector-based speaker verification by involving channel compensation method. Further, research findings are also available on the basis of simultaneous localization and recognition of target speaker to suppress the combined effects of noise and reverberation [24, 34]. May et al. [34] have proposed a noise robust binaural scene analyser for the localization and also recognition of speakers in the presence of competing sound sources, instantaneously. More specifically, the effects of reverberation and noises in speaker verification are addressed in recent research works [2, 36,37,38]. Al-Ali et al. [2] have introduced a forensic speaker verification system that investigates the combined features of MFCC and DWT–MFCC of input speech signal under different noisy reverberant conditions. Naik et al. have proposed a novel method on the basis of evaluation of super-and sub-Gaussian signals which are computed by using different objective measures of speech qualities to improve the quality of separated audio sources [37].

2 Related Works

Recently, deep learning-based binaural speech segregation shows better results than monaural speech segregation for the applications where reverberations are significantly considered [19, 58]. Zhang et al. [58] have proposed a novel deep learning-based binaural speech segregation by employing a fixed beam-former before extracting spectral features and have successfully validated in various reverberant environments. Jiang et al. [19] have proposed a binaural classification by using deep neural network (DNN) for stereo signals to handle complex auditory scenes, effectively. The DNN-based binaural classification is found to be providing good performance for speech segregation in a multi-source environment. The performance of automatic speech recognition is further improved when a combined architecture of deep neural network and recurrent neural network are used [32, 55]. Maas et al. [32] have introduced a noise reduction technique by applying a deep recurrent auto encoder neural network to ensure robustness in automatic speech recognition system. Recently, Yu et al. [57] have proposed localization-based stereo speech segregation process in which, the generated soft time–frequency mask by using deep neural network is compared and proved as a better model than GMM/EM for the segregation process. Huang et al. have proposed a solution for monaural speech separation problem by jointly optimizing soft masking layer with deep recurrent neural network [15, 16]. Zhao et al. [60] have introduced a combined approach that consists of deep neural network-based speech segregation followed by robust speaker identification module which is tested under various noisy and reverberant conditions. It is observed that the combined perceptual architecture helps to improve the speaker identification performance. Also, issues related to reverberation time and signal-to-noise ratio are efficiently addressed. Mowlaee et al. [35], have proposed a joint system by combining speech separation modules and speaker identification to enhance intelligibility of automatic speaker recognition. Trowitzsch et al. [49] have suggested a systematic approach to improve the robustness of the classifier through multi-conditional training and also by super-imposing general environmental sounds.

The present study proposes two major contributions. At first, binaural classification-based speech segregation is carried out in which a total number of 83-dimensional features, such as 32-D interaural time difference, 32-D interaural phase difference, 16-D interaural level difference, 2-D interaural coherence and 1-D direct to reverberant ratio are considered. The concatenated above discussed resultant features are incorporated into deep recurrent neural network (DRNN)-based joint discriminative training classifier for the segregation of speech signals. The present work considers various performance evaluation metrics, such as source to interference ratio, source to distortion ratio and source to artifacts ratio for the validation of proposed model. Eventually, the obtained results are compared with the existing architectures, including deep neural network and observed better performance. Secondly, a spectro-temporal pattern extractor referred as Gabor–Hilbert envelope coefficients (GHEC) is proposed. The performance of GHEC is compared with existing monaural features using acoustic speaker models, such as GMM–UBM and I-vector. The results found that the joint architecture consists of binaural speech segregation followed by robust speaker recognizer that helps to improve the speech intelligibility, even in the presence of both noise and reverberation.

3 Model Architecture

Figure 1 shows the block diagram representation of proposed joint automatic speech signal segregation and recognition system.

Fig. 1
figure 1

Block diagram representation of binaural speech segregation along with automatic speaker recognition

The following section describes the various functionalities of each component presented in both binaural signal segregation and recognition modules.

3.1 Binaural Source Segregation Module

The predominant component of binaural speech segregation module is deep recurrent neural network classifier. The binaural cues, such as interaural time and level difference (ITD/ILD), interaural phase difference (IPD), and interaural coherence (IC), are first extracted from binaural auditory frond-end. As an important contribution in this study, the DRR is also estimated from binaural signals through equalization–cancellation techniques [31] and combined with binaural cues. It is understood that these binaural cues are dependent on the various factors, including reverberation time, quality and energy of acoustic source, noises, obstacles and distance especially in an enclosed space. The resultant features are then incorporated into DRNN-based joint discriminative training model in order to generate soft mask.

3.1.1 Binaural Cues Extraction

The basilar membrane in cochlea is found to be responsible to segregate the acoustic signals on the basis of its frequencies in the human auditory system [34]. Gammatone filters are modelled as an inspiration of frequency selectivity and other functional properties of human cochlea. The speech signals arrived at two ears are decomposed into auditory channels (\(N=32\)) by using fourth-order Gammatone filter bank followed by inner hair cell processing. Further, these phase-compensated filter banks are used to adjust the binaural features at common time intervals. The centre frequencies of the filter banks are equally spaced on the Equivalent Rectangular Bandwidth (ERB) scale between 80 Hz and 5 kHz [53, 54].

The output of Gammatone filter bank is processed by computing half-wave rectification and square root compression for the transduction process in the inner hair cells. The auditory binaural features are processed by using a rectangular window of 20 ms at a sampling frequency of 44.1 KHz with an overlap of 50% between the successive frames at frame shift of 10 ms [53, 54]. The estimation of interaural time and level differences are achieved by using normalized cross-correlation analysis in time domain and by calculating energy per frame, respectively. In general, interaural time difference (ITD) or interaural phase difference (IPD) deals with discrepancy in arrival times and phases at each ear at low level frequencies. They are sensitive to source distance, whereas ILD is considered as more robust at higher frequencies (higher than 1600 Hz) [50]. The estimated peak position for ITD across the time interval I between two ears is defined as,

$$\begin{aligned} {\tau }_{{ i}} \left( {t} \right) =\hbox {arg max}C_{{i}} {({t},\gamma )} \end{aligned}$$
(1)

where t is frame number, \({\gamma }\) is time lag and normalized cross-correlation function [34] of channel \(C_{i}\) is given by,

$$\begin{aligned} {C}_{{i}} \left( {{t},{\gamma }} \right) =\frac{\mathop \sum \nolimits _{{n=0}}^{{I-1}} \left( {{s}_{{i}} \left( {{t}\frac{{I}}{ 2 }{-n}} \right) -\mathop {\overline{{s}_{{i}} }}\limits } \right) {\big (k}_{{i}} \left( {{t}}\frac{{I}}{ 2 }{-n-}{\gamma } \right) -\mathop {\overline{{k}_{{i}}} }\limits \big )}{\sqrt{\mathop \sum \nolimits _{{n=0}}^{{I -1}} {(s}_{{i}} \left( {{t}\frac{{I}}{ 2 }{-n}} \right) -\mathop {\overline{{s}_{{i}}} }\limits )^{{2}}}\sqrt{\mathop \sum \nolimits _{{n=0}}^{{I-1}} {\big ({k}_{{i}} \left( {{t}\frac{{I}}{ 2 }}{-n-}{\gamma } \right) -\mathop {\overline{{k}_{{i}}} }\limits \big )^{2}}}} \end{aligned}$$
(2)

where \({\overline{{k}_{{i}}} } \) and \(\mathop {\overline{{s}_{{i}}} }\limits \) are the mean values of left and right ear signals, respectively. The comparison of energy arrived at two ears are used to derive interaural level differences, especially in a reverberant environment. The ILD estimation [34] across time interval, I between two ears are given by,

$$\begin{aligned} \hbox {ild}_{{i}} \left( {t} \right) =20\,{\log }_{{10}} \left( {\frac{\mathop \sum \nolimits _{{n=0}}^{{I-1}} {k}_{{i}} {\big (t}\frac{{I}}{ 2 }{-n\big )}^{{2}}}{\mathop \sum \nolimits _{{n=0}}^{{I-1}} {s}_{{i}} {\big (t}\frac{{I}}{ 2 }{-n\big )}^{{2}}}} \right) \end{aligned}$$
(3)

Interaural coherence (IC) [3, 50] is considered as a more salient feature for analysing similarity and strength of correlation between two ear canals.

$$\begin{aligned} {\hbox {IC}(t)}=\frac{\emptyset _{\mathrm{l,r}} \left( {{\upomega ,{t}}} \right) }{\sqrt{\emptyset _{\mathrm{l,l}} {(\omega ,{t}).}\emptyset _{\mathrm{r,r}} {(\omega ,{t})}}} \end{aligned}$$
(4)

where \(\emptyset _{\mathrm{r,r}} {(\omega ,{t})}\), \(\emptyset _{\mathrm{l,l}} {(\omega ,{t})}\) represent the auto-power spectral densities (APSD) of the left and right ears, respectively.

\(\emptyset _{\mathrm{l,r}} \left( {{\omega ,{t}}} \right) \)represents cross power spectral density (CPSD) of the two time-aligned input channels.

The IPD model [24, 50] provides high temporal resolution of robust binaural information. The interaural transfer function (ITF) is computed by using left–right pair of complex Gammatone filter outputs, such as \({g}_{\mathrm{l}} \left( {t} \right) \) and \({g}_{\mathrm{r}} \left( {t} \right) .\) Computed ITF contains complex terms along with amplitude and phase information and it is given by,

$$\begin{aligned} {\hbox {ITF}(t)}={g}_{\mathrm{l}} \left( {t} \right) {.} \overline{{{g}_{\mathrm{r}} \left( {t} \right) }} ={A}_{\mathrm{l}} \left( {t} \right) .{A}_{\mathrm{r}} \left( {t} \right) .\hbox {e}^{{j(}\emptyset _{\mathrm{l}} \left( {t} \right) -\emptyset _{\mathrm{r}} \left( {t} \right) {)}} \end{aligned}$$
(5)

where \({A}_{{l}} \left( {t} \right) \hbox {and}\, {A}_{\mathrm{r}} \left( {t} \right) \) represent the amplitude information whereas \(\emptyset _{\mathrm{l}} \left( {t} \right) {,}\emptyset _{\mathrm{r}} \left( {t} \right) \) represent phase information of ITF for left and right channels, respectively. The temporally smoothened IPD is obtained by using low-pass filtered ITF [24, 50] and it is given as,

$$\begin{aligned} {\hbox {IPD}(t)=\arg ([\hbox {ITF}}\left( {t}\right) ]_{\mathrm{lp}} ) \end{aligned}$$
(6)

3.1.2 Direct-to-Reverberant Ratio (DRR)

The distance estimation of a sound source is closely associated with the energy ratio between direct and reverberant sound signals. It is understood that direct-to-reverberant ratio is dependent upon two factors, namely acoustic properties of the room and source to receiver configuration (i.e. distance and orientation). The DRR is considered as one of the most widely analysed parameters by many researchers for the estimation of distance between source and destination [30, 31]. Further, it is observed that DRR decreases with increasing distance between source and target and it is also affected by certain properties of the room, such as the reverberation time and room volume. It is described in terms of dB and defined as,

$$\begin{aligned} \hbox {DRR}=10\,{\log }_{10}\frac{\sum _{{k}=0}^{{S}_{\mathrm{d}}}|{h}[{K}]|^{2}}{\sum _{{k}=S_{\mathrm{d}}+1}^{\infty }|{h}[{K}]|^{2}} \end{aligned}$$
(7)

where \({S}_{\mathrm{d}} \) represents the sample length of the direct sound arrival, and h[k] represents the room impulse response.

Fig. 2
figure 2

Deep recurrent neural network structure with different layers

3.1.3 DRNN-Based Joint Discriminative Training Classifier

The level of complexity in recognizing a target speaker in a reverberant environment depends on two important factors, namely number of target speakers and nature of noise sources [16, 60]. The computational goal of adopting deep learning model in this study is mainly to separate target speech source from input acoustic mixture. The concatenated features, such as binaural cues and direct-to-reverberant ratio are given as input features. The deep recurrent structure is characterized by temporal connections of recurrent neural network. The deep learning model is successfully processed to reform the magnitude spectra of output targets and predictions. The optimized deep learning structure with different layers is shown in Fig. 2.

The network parameters are updated by involving back-propagation through time (BPTT) method. The limited memory Broyden–Fletcher Goldfarb Shannon (L-BFGS) algorithm [15, 16] is processed to train the models at the time of training phase. The estimation of DRNN model parameters is carried out by using the error back-propagation algorithm with stochastic gradient learning in order to achieve state-of-the-art performance in neural network structure. The time–frequency masking function is integrated as one of the layers in neural network structure that reduced computational complexities. It is assumed an M intermediate layer with recurrent connection is presented at the kth layer. The hidden activation at this layer [16, 55] is calculated on the basis of current input at time, j by using Eq. (8).

$$\begin{aligned} {h}^{{k}}\left( {{x}_{\mathrm{j}} } \right)= & {} {f}_{{h}} {\big (}\left( {{x}_{{j}} } \right) {.h}_{{j-1}}^{{k}} {\big )}\big ) \nonumber \\= & {} \mathop {\gamma }\nolimits _{{k}} {\big (U}^{{k}}{h}_{{j-1}}^{{k}} +{W}^{{k}}{\gamma }_{{k-1}} {\big (W}^{{k-1}}\big ( {{\ldots \gamma }_{{1}} \big ( {{W}^{{1}}{x}_{{j}} } \big )} \big ){\big )\big )} \end{aligned}$$
(8)

where \({f}_{\mathrm{h}} \) represents a state transition function and \({x}_{{j}} \) is given as input to the neural network at time j. \({U}^{{k}}\) and \({W}^{{k}}\) are the two weight matrices for the kth layer and recurrent connection at that layer, respectively. \({\gamma }_{{k}}\)(.) is the element-wise nonlinear function in kth layer. The output function is given by,

$$\begin{aligned} {y}={f}_{\mathrm{o}}\big ({h}_{{i}}^{{k}}\big ) \end{aligned}$$
(9)

where \({f}_{\mathrm{o}} \) represents an output function.

The soft time–frequency masking phase is carried out immediately after the training phase in order to improve intelligence in speech signal. The recent reviews reported that soft masking can be used to reduce the artifacts and also to improve smoothness in the predicted results. It is applied on the predicted magnitude spectrogram for the reconstruction which is followed by inverse short-term Fourier transform.

The soft time–frequency masking, \({S}_{{j}} {(f)}\) is given by

$$\begin{aligned} {S}_{{j}} \left( {f} \right) =\frac{\left| {{y}_{{1j}} {(f)}} \right| }{\left| {{y}_{{1j}} {(f)}} \right| +\left| {{y}_{{2j}} {(f)}} \right| }\ \end{aligned}$$
(10)

where \({y}_{{1j}} \) and \({y}_{{2j}} \) represent the obtained output predictions. The soft masking is applied to the magnitude spectra, \({T}_{{j}} \left( {f} \right) \) of the original mixture signal in order to obtain segregated spectra, \({s}_{{1}}^{{{\prime }}} \) and \({s}_{{2}}^{{{\prime }}} \) which are given by,

$$\begin{aligned} {s}_{{1}}^{{\prime }} \left( {f} \right)= & {} {S}_{{j}} \left( {f} \right) {T}_{{j}} \left( {f} \right) \nonumber \\ {s}_{{2}}^{{\prime }} \left( {f} \right)= & {} {(1-S}_{{j}} \left( {f} \right) {)T}_{\mathrm{j}} \left( {f} \right) \end{aligned}$$
(11)

The time-domain signal can be obtained by applying the inverse short-time Fourier transform. Further, the signal-to-interference ratio (SIR) can be improved by applying discriminative training criterions, such as mean squared error (MSE) and Kullback–Leibler (KL) divergence [15, 16]. DRNN-based speech segregation is an appropriate technique to facilitate dynamic temporal behaviour. The spectrograms of input mixture and segregated speech signals are shown in Fig. 3a–c, respectively.

Fig. 3
figure 3

Spectrogram of segregated speech sources using DRNN-based joint discriminate training. a Mixture of two sources; b, c segregated source signals

3.2 Speaker Identification Module

The speaker identification process is initiated by using resultant input signals from the speech segregation module. The identification module includes three stages, namely feature extraction, speaker modelling and pattern classification-based decision making [23]. The present study proposes the Gabor filter banks for the extraction of monaural features. Further, it uses Gaussian mixture model-universal background model (GMM–UBM) and I-vector methods for the recognition of a speaker [44]. The training phase confines the distribution of extracted features by involving one or more types of statistical models. The unidentified utterances are then classified in the recognition phase on the basis of its similarities with the corresponding speaker model [23].

3.2.1 Energy-Based Voice Activity Detector (VAD)

All speech samples are pre-processed by involving down sampling to 8 kHz, pre-emphasis and removing silence speech regions [44]. The energy of all input speech frames are calculated for a given speech utterance and then the empirical threshold is chosen from frame energies. The accurate discrimination of speech and non-speech regions is achieved by an energy-based detector that rejects frames if their energy decreases below a threshold value.

3.2.2 Gabor Hilbert Envelope Coefficients (GHEC)

The present study proposes a Gabor Hilbert Envelope Coefficients (GHEC) method in which Gabor filters are convolved with Hilbert Envelope and it is illustrated in Fig. 4. The characteristics of spectral, temporal and spectro-temporal components are extracted by using set of 41-Gabor filters. The feature extraction process uses local patches of log-Mel scaled spectrogram of 26-channels. Log-Mel spectrogram considers the basic qualities of human auditory system which includes resolution across entire frequencies and logarithmic intensity perception [45]. The extracted feature components are dependent upon the output of Gabor filters and its convolution with Hilbert envelope. The Hilbert envelopes perform the exact envelope of the auditory nerve response at particular centre frequencies. The output of Hilbert transform, \({H}_{\mathrm{t}} \left( {{s},\hbox {i}} \right) \) contains both real and transformed part which are used to obtain envelope, \({H}_{\mathrm{e}} \left( {{s},\hbox {i}} \right) \).

$$\begin{aligned} {H}_{\mathrm{t}} \left( {{s,i}} \right) ={G}_{\mathrm{c}} \left( {{s,i}} \right) +{iG}_{\mathrm{c}}^{\prime } {(s,i)} \end{aligned}$$
(12)

where \({G}_{\mathrm{c}} \left( {{s,i}} \right) {,G}_{\mathrm{c}}^{{{\prime }}} {(s,i)}\) are the real and Hilbert transformed signal, respectively, and i, is the imaginary unit. The Hilbert envelope [44], \({H}_{\mathrm{e}} \left( {{s},{i}} \right) \) is obtained by using Eq. (13).

$$\begin{aligned} {H}_{\mathrm{e}} \left( {{s,i}} \right) ={G}_{\mathrm{c}} \left( {{s,i}} \right) ^{{2}}+{G}_{\mathrm{c}}^{\mathrm{{\prime }}} {(s,i)}^{{2}} \end{aligned}$$
(13)
Fig. 4
figure 4

The structure of proposed envelope coefficients extraction

The Hilbert envelope is smoothened by using low-pass filter with cut-off frequency of 20 Hz in order to remove redundant undesired higher frequencies. The smoothed envelope, \({H}_{\mathrm{es}} \left( {{s,i}} \right) \)is grouped into 25 ms duration with a skip rate of 10 ms. Further, discontinuities at the edges of each frame are minimized by using a Hamming window. The sample means are estimated as,

$$\begin{aligned} {M}\left( {{t,i}} \right) =\frac{ 1 }{{N}}\mathop \sum \limits _{{s=0}}^{{N-1}} {w(s)H}_{\mathrm{es}} \left( {{s,i}} \right) \end{aligned}$$
(14)

where w(s) is a Hamming window. The natural logarithm is applied on the estimated resultant parameter, \({M}\left( {{t,i}} \right) \) which is used here as a channel normalization factor in order to bring human perception of loudness as well as to compress the dynamic range [43]. In the final step, discrete cosine transform (DCT) is used to perform two functions, namely conversion of spectral features into cepstrum and also to de-correlate various over-lapped feature dimensions [44]. The first and second cepstral derivatives are calculated and appended to the features in order to capture various 57-dimensional dynamic patterns.

3.2.3 GMM–UBM Model-Based Speaker Verification and Identification

The speaker recognition is used to identify an individual person by analysing the spectral contents of his/her speech signal [23, 41]. Generally, performance of this process is degraded when reverberation time increases. The source speech signal reaches the target after experiencing series of reflections and diffractions in a reverberant room environment. The training phase is involved with coefficients and the trained data is processed by using Gaussian mixture model with universal background model (GMM–UBM) [23, 41] with 57-dimensional Gabor Hilbert envelope coefficient (GHEC) features. In the literature, the Mel-frequency Cepstral Coefficients (MFCCs) as well as Gabor Filter Bank Features (GBFB) [45, 46] have been extensively applied in many speech signal processing applications, including speech, emotion and language recognition as well as for the speaker recognition. The performance of proposed Gabor Hilbert envelope coefficient (GHEC) is compared with various known existing methods, including Mel-frequency cepstral coefficient (MFCC) and Gammatone frequency cepstral coefficients (GFCC). The cepstral mean and variance normalization (CMVN) are applied on monaural features for adopting feature normalization in order to reduce the effect of channel influence and to increase the robustness of automatic speaker recognition systems. The Gaussian mixture model (GMM) [23, 41] is considered as a stochastic model which comprised of weighted sum of M multivariate component Gaussian densities. For a D-dimensional feature vector, x, the Gaussian mixture model is referred by its probability density function and it is given by,

$$\begin{aligned} {p(x}{|\lambda )}=\mathop \sum \limits _{{i=1}}^{{M}} {w}_{{i}} {p}_{\mathrm{i}} {(x)} \end{aligned}$$
(15)

where \({w}_{{i}} \) and \({p}_{{i}} {(x)}\) denotes the mixture weights and component densities, respectively

$$\begin{aligned} {p}_{{i}} {(x)}=\frac{ 1 }{\left( {{2\pi }} \right) ^{\frac{{D}}{ 2 }} {|\sum _{{i}} }| ^{\frac{ 1 }{ 2 }}}{\exp }\left\{ -\frac{ 1 }{ 2 }(x-\mu _{{i}} )^\prime \mathop \sum \nolimits _{{i}}^{{-1}} (x-\mu _{i})\right\} \end{aligned}$$

The uni-modal Gaussian densities depend on the mean \({D}\times 1\) vector, \({\mu }_{{i}} \) and a \({D} \times {D}\) covariance matrix, \(\sum _{{i}} \). The parameters of the density model are defined as \({\lambda } =\{{w}_{{i}} ,{\mu }_{{i}} ,\sum _{{i}} \}\) and the mixture weights are required to satisfy the constraint \(\mathop \sum \nolimits _{{i=1}}^{{ M}} {w}_{{i}} =1\). In this study, 64 mixture component of universal background model (UBM) is trained with expectation-maximization (EM) algorithm. The mean vectors of universal background model are adapted by utilizing the relevance factor of 19. The speaker model is frequently trained by using maximum a Posteriori (MAP) method in order to promote the consistency. The mean vectors of universal background model (UBM) are concatenated into super vector for each speaker in the enrolment set and eventually a target speaker model is constructed [23, 41].

In the recognition phase, log-likelihood ratio (LLR) score [23, 41] for the given test feature vectors, X is estimated from two models, namely target (\({\gamma }_{\mathrm{tar}} )\) speaker model and universal background model (\({\gamma }_{\mathrm{impost}} )\) and the ratio is derived as,

$$\begin{aligned} \hbox {LLR}_{\mathrm{score}} ={\log }\left( {{X}{|\gamma }_{\mathrm{tar}} } \right) -{\log }\left( {{X}{|\gamma }_{\mathrm{impost}} } \right) \end{aligned}$$
(16)

where \({\gamma }_{\mathrm{tar}} \) is the utterances related to target speaker and \({\gamma }_{\mathrm{impost}} \) is the utterances that are not related to target speaker. The speaker verification is carried out to confirm whether a speech source signal can be accepted or rejected and it is mainly based on the decision threshold \(\theta \),

$$\begin{aligned} \hbox {LLR}_{\mathrm{score}} = \left\{ {{\begin{array}{l} {\ge \theta \, \hbox {accepted}} \\ {<\theta \, \hbox {rejected}} \\ \end{array} }} \right. \end{aligned}$$
(17)

The normalization value is resulted from universal background model (UBM) which is done by shifting the log-likelihood scores obtained from various feature vectors. The score normalization is applied to reduce the score variability across different speakers and sessions. It improves the accuracy and also provides a common (speaker-independent) decision threshold value. In addition, the Z-normalization [23, 41] is also carried out for the enhancement and it is given by,

$$\begin{aligned} \hbox {LLR}_{\mathrm{norm}} =\frac{{\hbox {LLR}}_{\mathrm{raw}} -{\mu }}{{\sigma }} \end{aligned}$$
(18)

where \({\mu }\) and \({\sigma }\) denotes mean and variance of imposter score of a speaker, respectively.

3.2.4 I-Vector-Based Speaker Recognition System

The experiment uses 57-dimensional Gabor Hilbert envelope features with the appended delta coefficients that are extracted as acoustic features from speech material for i-vector-based speaker verification methods. The low-dimensional representation of Gaussian mixture model (GMM) super vectors is referred as I-vector which was introduced very recently as a major refinement in existing speaker recognition system [9, 10]. I-vector extraction along with Gaussian probabilistic linear discriminant analysis (GPLDA) has been experimentally proved as an enhanced and computationally efficient technique in comparison with conventional joint factor analysis (JFA) and support vector machine (SVM) [21]. In general, channel and session variability is referred as a mismatch between trained and test utterances which is induced by various factors, including noise sources, variations in voice of the speaker and environmental conditions. The same can be compensated by involving various methods, such as within class covariance normalization (WCCN), linear discriminative analysis (LDA) and source-normalized weighted linear discriminant analysis (SN-WLDA) [22]. The joint factor analysis (JFA) [9, 10] is mainly based on decomposition of speaker-dependent Gaussian mixture super vector, k that consists of separate speaker- and channel-dependent components, S and C, respectively, and are given as,

$$\begin{aligned} k=S+C \end{aligned}$$
(19)

where \(S =m+Vy+Dz\); \(C=Ux\); where, m is a session- and speaker-independent super vector extracted by using universal background model (UBM); xy and z are the speaker- and session-dependent factors in their respective subspace. V and D specify the speaker subspace, whereas U represents session subspace.

I-vector model has shown significantly better performance, especially for short utterances (< 10 s). The total variability space simultaneously represents speaker and channel variability [21]. The speaker- and channel-dependent Gaussian mixture super vector in an I-vector-based speaker recognition, k is computed as,

$$\begin{aligned} k=m+Tw \end{aligned}$$
(20)

where m is the session- and speaker-independent universal background model (UBM) super vector, T is a low-rank rectangular matrix representing the primary directions of variability across all development data and w denotes the independent normal distributed random vector with parameter N(0, 1).

3.2.4.1 Within-Class Covariance Normalization Along with Linear Discriminant Analysis Within-class covariance normalization (WCCN) is used to compensate dimensions of the high within-class variance. The within-class covariance normalization (WCCN) additionally removes the dimensions of between-class variance while reducing the dimensions of within-class variability which is considered as a major demerit. This can be overcome by combining within-class covariance normalization (WCCN) along with linear discriminant analysis (LDA). The combined compensation of WCCN \(+\) LDA [21, 22] minimizes within-class variance as well as maximizes between-class variance and it is derived by following Eigen-value decomposition which is denoted as,

$$\begin{aligned} {V}_{\mathrm{b}} {v}=\lambda {V}_{\mathrm{w}} {v} \end{aligned}$$
(21)

Linear discriminative analysis (LDA) [21] is computed by the usage of between-class variance (\({V}_{\mathrm{b}} )\) and within class variance (\({V}_{\mathrm{w}} {)}\), respectively, and it is given as,

$$\begin{aligned} {V}_{\mathrm{b}}= & {} \mathop \sum \limits _{{s=1}}^{{ S}} {n}_{\mathrm{s}} {(} {\overline{\mu } } _{\mathrm{s}} - \overline{{w}} )({\overline{\mu } } _{\mathrm{s}} -\overline{{w}} )^{{T}}, \nonumber \\ {V}_{\mathrm{w}}= & {} \mathop \sum \limits _{{s=1}}^{{ S}} \mathop \sum \limits _{{i=1}}^{{n}_{\mathrm{s}} } {(w}_{{i}}^{\mathrm{s}} -{\overline{\mu }} _{\mathrm{s}} {)(w}_{{i}}^{\mathrm{s}}-{\overline{\mu }} _{\mathrm{s}} {)}^{{T}}, \end{aligned}$$
(22)

where S is the total number of speakers, \({w}_{{i}}^{\mathrm{s}} \) denotes the i-vector representation of i session of speaker s and \({n}_{\mathrm{s}}\) is the number of utterances of speakers. The mean I-vectors, \({\mu } _{\mathrm{s}} \) for each speaker and w is the global mean across which all speaker are specified as,

$$\begin{aligned} \overline{{\mu }} _{\mathrm{s}}= & {} \frac{ 1 }{{n}_{\mathrm{s}} }\mathop \sum \limits _{{i=1}}^{{n}_{\mathrm{s}} } {w}_{\mathrm{i}}^{\mathrm{s}} \nonumber \\ \overline{{w}}= & {} \frac{ 1 }{{N}}\mathop \sum \limits _{{s=1}}^{\mathrm{S}} \mathop \sum \limits _{{i=1}}^{{n}_{\mathrm{s}} } {w}_{\mathrm{i}}^{\mathrm{s}} \end{aligned}$$
(23)

where N is the total number of sessions. As described, linear discriminative analysis (LDA) is responsible for producing reduced set of axes A through Eigen-value decomposition whereas WCCN transformation matrix (B) is derived by using Cholesky decomposition, \({BB}^{{T}}={W}^{{-1}}\) where W is computed by using,

$$\begin{aligned} {W}=\frac{ 1 }{{S}}\mathop \sum \limits _{{s=1}}^{{S}} \mathop \sum \limits _{{i=1}}^{{n}_{\mathrm{s}} } {(A}^{{T}}{(w}_{{i}}^{{s}} - \overline{{w}} _{{s}} {))(A}^{{T}}{(w}_{\mathrm{i}}^{{s}} - \overline{{w}} _{\mathrm{s}} {))}^{{T}} \end{aligned}$$
(24)

The resultant WCCN [LDA] is obtained by computing,

$$\begin{aligned} {\hbox {WCCN}\,\hbox {[LDA]}}={B}^{{T}}{A}^{{T}}{W} \end{aligned}$$
(25)

3.2.4.2 Gaussian Probabilistic Linear Discriminant Analysis (GPLDA) Classifier In the literature, significant numbers of work have been presented the probabilistic linear discriminant analysis (PLDA)-based I-vector speaker recognition system by creating session and speaker variability within I-vector space, effectively. Recently, length-normalized Gaussian probabilistic linear discriminant analysis (GPLDA) approach is introduced that converts I-vector feature behaviour from heavy-tailed to Gaussian [21, 22]. The Gaussian probabilistic linear discriminant analysis-based I-vector speaker recognition technique involves extraction of I-vector, session variability compensation, likelihood ratio scoring. The tested results of proposed technique with the baseline methods for TIMIT dataset [11] are given in Table 1. I-vectors are extracted for Gabor Hilbert envelope features by using front-end factor analysis. Gaussian probabilistic linear discriminant analysis (GPLDA) classifier is applied on channel compensated I-vector features [21, 22]. The speaker- and channel-dependent length-normalized I-vector w can be defined as,

$$\begin{aligned} w= \overline{{w}} +{U}_{{1}} {x}_{{1}} {+U}_{{2}} {x}_{{2}} +{\gamma }_{\mathrm{r}} \end{aligned}$$
(26)

where \({\gamma }_{\mathrm{r}} \) is speaker residuals with mean zero, \({U}_{{1}} \) and \({U}_{{2}} \) are the Eigen voice matrix and Eigen channel matrix, respectively.

The Gaussian probabilistic linear discriminant analysis (GPLDA) scoring [21] is computed using batch likelihood ratio which provides a ratio between two I-vectors of target and test speakers. It is calculated as,

$$\begin{aligned} {\ln }\frac{{P(w}_{\mathrm{targ,}} {w}_{\mathrm{test}} {|H}_{{1}} {)}}{{P(w}_{\mathrm{targ}} {|H}_{{0}} {)P(w}_{\mathrm{test}} {|H}_{{0}} {)}} \end{aligned}$$
(27)

where \({H}_{{1}} \) the speakers are same, \({H}_{{0}}\) the speakers are different.

The equal error rate (EER) and detection cost function (DCF) are used as performance evaluation metrics [44]. The EER is obtained where false acceptance rate (FAR) and false rejection rate (FRR) are found to be equal. The Detection Cost Function is investigated by using weighted sum of the two error probabilities and it is defined as

$$\begin{aligned} {\hbox {DCF}}={C}_{\mathrm{miss}} {E}_{\mathrm{miss}} {P}_{\mathrm{target}} +{C}_{\mathrm{FA}} {E}_{\mathrm{FA}} {(1-P}_{\mathrm{target}} {)} \end{aligned}$$
(28)

where \({C}_{\mathrm{miss}} =10\) and \({C}_{\mathrm{FA}} =1\) represents cost factors, \({P}_{\mathrm{target}} =0.01\) gives the probability of target and \({E}_{\mathrm{miss}}\), \({E}_{\mathrm{FA}} \) denotes probability of miss and false alarm, respectively.

From Table 1, it is observed minimum change in the computed detection cost function for joint factor analysis and I-vector-based techniques. The equal error rate (EER) by following Within-Class Covariance Normalization (WCCN) along with LDA is found to be relatively better than other methods. The equal error rate (EER) of Gaussian probabilistic linear discriminant analysis produces better results than joint factor analysis. It should be noted that the purpose of involving compensation techniques is to promote efficiency in speaker discrimination and attenuate channel effects/variability. It is observed that the equal error rate increases as length of test utterance decreases.

Table 1 Comparative analysis of joint factor analysis and i-vector-based techniques for different truncation of testing and training utterance under reverberant condition (\(\hbox {RT} = 0.38\) s) are given below

4 Results and Discussions

The speech signals are convolved with impulse responses (BRIR) and are obtained from Aachen Impulse Response (AIR) database for different rooms [18] and from [24]. The study also uses impulse responses obtained from the University of Surrey [17, 19] in four different reverberation rooms (A, B, C, and D) for azimuths between − 90\(^{\circ }\) and 90\(^{\circ }\) spaced by 5\(^{\circ }\) at a distance of 1.5 m. The TIMIT database [11] is composed of high-quality, read speech collected from a total of 630 speakers (comprises of 192 female and 438 male). Each speaker supplies 10 short utterances, phonetically rich English language sentences and average quantity of speech available per speaker is 30 s. In this study, 9 short utterances are used for training and the remaining one utterance is used as a test sample. Similarly, 530 speakers are used for background model training, and the rest of 100 speakers are used as test samples. The NOIZEUS dataset [14] contains various noises which are utilized for the experiments. The binaural signals and deep learning algorithms are computed by using a workstation (ThinkStation P300) with an Intel Xeon (E3-1271) 3.6 GHz processor, 32 GB of RAM and dedicated NVIDIA (Quadro K620) graphics card. The software includes, MATLAB (R2015a) installed in Windows 7 operating system.

4.1 Module 1: Feature Extraction and Classification-Based Speech Segregation

In this module, the features of binaural and direct-to-reverberant ratio cues are chosen to generate soft time-frequency mask and also to handle issues during binaural source separation process. The concatenated mixture of these features is given as an input to the deep recurrent classifier, DRNN. Each layer in the typical deep neural network is further enhanced with temporal feedback loops in order to make existing network structure as deep recurrent neural network. In this study, the deep recurrent structure is implemented as three hidden layers of 1000 hidden units integrated with joint discriminative training criterion.

It is understood that the number of neural network parameters, such as weights and bias increase when the number of input feature dimensions increase. These network parameters are updated through back-propagation through time (BPTT) and the epoch is adjusted to 500. Each layer in the network is added with temporal context information; thus each network in the recurrent structure is updated with new information and travels up ensuring a hierarchical architecture. Each layer in the hierarchy is characterized by recurrent neural network. The performance analysis metric, mean square error (MSE) is computed for each feature vector of the network structure in order to produce better signal-to-noise ratio and also to optimize network parameters. The limited memory Broyden–Fletcher Goldfarb Shannon (L-BFGS) algorithm [15, 16] is considered during optimization stage to train the networks from random initialization. Further, long short-term memory (LSTM) optimizer is explored in the recurrent structure that creates the possibility to store and callback temporal information over time to handle vanishing gradient problem.

It should be noted that better combination of concatenated features are selected by estimating output signal-to-noise ratio and HIT-FA (success-false alarm rate) [11]. The correctly identified speech-dominant time–frequency (T–F) units represent HIT rate and the wrongly classified noise-dominant T–F units represent false rate (FA). The binaural cues obtained from binaural auditory front-end are combined along with direct-to-reverberant ratio. The various combinations of resultant concatenated auditory features are validated by estimating output signal-to-noise ratio as well as HIT-FA rate. It is presumed that better combination of binaural features can further improve the performance of deep recurrent neural network-based speech segregation process.

All the segregation-related experiments are carried out by analysing various impulse responses associated with four different reverberation rooms, i.e. A, B, C and D [22, 34]. The babble noise is considered as a noise source as it is known for its higher efficiency, especially in speech masking-related applications. The babble noise is spread across the speech spectrum in azimuth between − 90\(^{\circ }\) and 90\(^{\circ }\) spaced by 5\(^{\circ }\) at a distance of 1.5 m and used to train deep recurrent neural network. An untrained interference angle of 15\(^{\circ }\) is considered for the testing in all the experiments. The results from classification-based speech segregation analysis are shown in Tables from 2 to 4 which performed under reverberation time (\(T_{60} )\) at 0.32 s.

The estimation of signal-to-noise ratio is considered as one of the popular evaluation metrics that expresses the performance of source segregation system [60] and it is given by,

$$\begin{aligned} \hbox {SNR}=10\,{\log }_{{10}} \left[ {\frac{\mathop \sum \nolimits _{\mathrm{m}} {x}^{{2}}{(m)}}{\mathop \sum \nolimits _{\mathrm{m}} \left[ {\hat{{x}} {(m)-x(m)}} \right] ^{{2}}}} \right] \end{aligned}$$
(29)

where \({\hat{{x}}}{(m)}\) is the estimated target signal, x(m) is the target signal. The estimation of HIT-FA is not only considered as a best evaluation criteria and also it is widely used to correlate with human speech intelligibility.

The performance of concatenated binaural cues and direct-to-reverberant ratio in the segregation process are validated by estimating output signal-to-noise ratio and the results are shown in Table 2. It is observed that the classifier-based segregation process produces better results as when the dimensionalities of interaural level difference increases. The effect of reverberation of rooms and noise play the major role than the output signal-to-noise ratio in the performance of segregation process. It is found that the computational complexities further increased when the dimensionality of interaural level difference increases than 16 dimensions. The study observes lower HIT-FA rate when 16-dimensional interaural level differences are used. A nonlinear behaviour is observed between interaural time difference and better output signal-to-noise ratio. The better output HIT-FA rate and output signal-to-noise ratio are observed for the combination of 32-dimensional interaural time and phase differences and 16-dimensional level differences along with direct-to-reverberant ratio. The computational time increases when dimensions of interaural level difference increases above 16-D. It is observed that minimum change in output SNR when the dimension of interaural coherence is increased and it is chosen as 2-D interaural coherence (IC) as one of the concatenated features.

Table 2 Performance of combined binaural features along with direct-to-reverberant ratio-based source segregation under babble noise at − 5 dB

The next step involves concatenation of total number of 83-dimensional features that include four binaural and direct-to-reverberant ratio cues. These combined feature cues are then incorporated into deep recurrent neural network-based joint discriminative training model.

The training of deep recurrent neural network is done by using randomly chosen 100 speakers from TIMIT database which has concatenated 9 sentences for each speaker. Eventually, testing is carried out with an advent of unused 1 sentence. The efficiency of deep recurrent neural network is evaluated by estimating three performance metrics, namely source to interference ratio (SIR), source to artifacts ratio (SAR) and source to distortion ratio (SDR).

Source to distortion ratio (SDR): It is referred as the ratio between target source and the difference between estimated and target source signals. The higher SDR denotes the better performance [15, 16, 57].

$$\begin{aligned} \hbox {Source to distortion ratio}\,\left( {{\hbox {SDR}}} \right) =10{\log }_{{10}} \frac{\Vert {S}_{\mathrm{tar}}\Vert ^{{2}}}{\Vert {e}_{\mathrm{intf}} {+e}_{\mathrm{noise}} {+e}_{\mathrm{artif}} \Vert ^{{2}}} \end{aligned}$$
(30)

where \({S}_{\mathrm{tar}} \) denotes target source signal, \({e}_{\mathrm{intf}} \) denotes interferences from other sources, \({e}_{\mathrm{noise}} \) denotes deformation caused by the noise, \({e}_{\mathrm{artif}} \) denotes artifacts established by the separation algorithm.

Source to interference ratio (SIR): It carries the information about errors caused by failures during the interfering signal elimination process [16, 38, 57] and it is specified as,

$$\begin{aligned} \hbox {Source to interference ratio}\,\left( {{\hbox {SIR}}} \right) ={10\log }_{{10}} \frac{\Vert {S}_{\mathrm{tar}}\Vert ^{{2}}}{\Vert {e}_{\mathrm{intf}} \Vert ^{{2}}} \end{aligned}$$
(31)

Source to artifacts ratio (SAR): It informs about errors due to extraneous artifacts that established at the time of source segregation process [16, 38, 57] and it is defined as,

$$\begin{aligned} \hbox {Source to artifacts ratio}\left( {{\hbox {SAR}}} \right) ={10\log }_{{10}} \frac{\Vert {S}_{\mathrm{tar}} {+e}_{\mathrm{intf}} {+e}_{\mathrm{noise}}\Vert ^{{2}}}{\Vert {e}_{\mathrm{artif}} \Vert ^{{2}}} \end{aligned}$$
(32)

In this study, both masking and discriminative training model are convolved with deep recurrent neural network in order to improve robustness and the results are shown in Table 3. The analysis is carried out by using 83-dimensional (\(32\hbox {D ITD}+ 32\hbox {D IPD}+ 16\hbox {D ILD}+ 2\hbox {D IC}+\hbox {1D DRR}\)) cues at a reverberation time of 0.32 s in the presence of babble noise at 20 dB. It is observed that the addition of soft masking within a layered architecture of network produces better performance and it is evaluated in terms of SDR, SIR and SAR. The deep recurrent network is also validated with and without discriminative training model. From the results, it is observed that the model with discriminative training outperforms other structures. The performance metrics have observed higher values for all other structures than deep neural network (DNN) model. The temporal hierarchy of deep recurrent network plays a major role in producing better performance. Further, the improved performance is mainly due to back-propagation of gradients with respect to training objectives that provide optimized structure for the model. The present study attempts to further improve existing deep recurrent network model by integrating both masking as well as discriminative model criterion within layered structure. The results obtained observe better performance metrics than any other existing models. Obviously, the addition of soft time–frequency (T–F) mask as an internal layer along with exploration of mean square error (MSE) is believed as the main reason for the better results.

Table 3 Performance of DRNN with and without masking and discriminative training model

The source segregation process is carried out by incorporating the features, total number of 83-dimensional features (\(32\hbox {D ITD}+32\hbox {D IPD}+16\hbox {D ILD}+2\hbox {D IC}+1\hbox {D DRR}\)) into deep recurrent neural network-based joint discriminative model. The performance is validated by estimating input signal-to-noise ratio in the presence of babble noise and the results are shown in Table 4. It is experimentally observed that the higher-dimensional interaural level difference above 16 dimensions not merely increases the complexity but also the stability of the system by consuming more time. Also, output signal-to-noise ratio decreases when dimensionality of interaural time difference/interaural phase difference decreases from 32 dimensions. For the obvious reason, the system produces better output signal-to-noise ratio when input SNR increases.

Table 4 Performance of classification-based source separation for the different input SNR values

The source segregation process is performed in four different reverberant rooms in the presence of babble noise at 0 dB and the results are shown in Table 5. The system produces better signal-to-noise ratio and HIT-FA rate for the rooms which has lower reverberant time period. However, the performance of classifier-based source segregation system is not significantly affected by increasing values of reverberation time periods. In other words, the system is observed with significantly good performance metrics for all the rooms with various reverberant time periods even at 0 dB.

Table 5 Performance analysis of binaural-DRR combined source segregation process under babble noise at 0 dB

4.1.1 Comparison Analysis with Baseline Method

In this study, the developed speech separation framework is compared with results of [4, 33, 54]. Woodruff and Wang [54] have proposed binaural detection, localization and segregation of speech source that mainly depend on pitch and azimuth cues. Also, the system uses hidden Markov model for the estimation of number of active sources across time. Mandel et al. [33] have suggested model-based EM source separation and localization (MESSL) system. Here, a mask is generated and successfully utilized for the separation of a desired sound source from stereo signals. Every source in a mixture has been illustrated by a probabilistic model of interaural parameters. Probabilistic model of interaural parameters is evaluated at each spectrogram points, independently. Alinaghi et al. [4] have investigated the strength and weakness of mixing vector estimation along with other techniques, such as interaural level and phase differences (ILD and IPD) for the separation of stereo speech signals.

The comparative analysis is carried out by considering various rooms at different reverberant time periods in the presence of babble noise at − 5 dB. It is understood that the incorporation of combined binaural cues along with direct to reverberant features into deep recurrent-based joint discriminative model show a better performance and validated in terms of output signal-to-noise ratio than Woodruff–Wang and MESSL models which is shown in Table 6. The results are found to be higher in all four rooms, invariably. It is observed that the proposed model produces 2.57 times better results than Woodruff–Wang model in Room A which has reverberation time period of 0.32 s. The proposed system shows improved signal-to-noise ratio (SNR) at an average of 0.58 dB over Woodruff and Wang [54] and Alinaghi et al. [4] models in Room D.

Table 6 Comparative analysis with baseline methods

4.2 Module 2: Speaker Identification and Recognition Module

In this study, total of 630 speakers (192 female, 438 male) from TIMIT database are chosen in which 9 short utterances are used for training and the remaining one utterance is used as a test sample. Nearly, 530 speakers are used for background model training and the rest of 100 speakers are used as test samples. The performance of proposed Gabor Hilbert envelope coefficient (GHEC) is compared with other standard feature extraction techniques, namely Gammatone Frequency Cepstral Coefficients (GFCC) [60], Mel-Frequency Cepstral Coefficients (RASTA-MFCC), Mean Hilbert Envelope Coefficients (MHEC) [43] and Gabor Filter Bank (GBFB) [45, 46] at various SNR values with different noise sources and reverberant room environments. The Gammatone Frequency Cepstral Coefficients (GFCC) feature extraction [59, 60] is done by using a total number of 64 channel Gammatone filter banks with central frequencies ranges from 50 Hz to 8 KHz. The outputs from rectified filter are decimated into 100 Hz that yields time frames of 10 milli-seconds. The magnitudes of the decimated outputs are then compressed by a cubic root operation in order to minimize the loudness. The resultant matrix represents time–frequency (T–F) decomposition of the input and referred as GF (Gammatone Frequency) components which are correlated with each other. The discrete cosine transform (DCT) is applied on GF components to reduce dimensionality and also to de-correlate the components.

The Mel coefficients are computed by segregating input signal into 20 ms frames with 10 ms frame shift. Each frame is applied with hamming window and also short-time Fourier transform is utilized to derive power spectrum. Then, the derived power spectrum is converted into Mel scale. Finally, 39-dimensional coefficients are obtained by applying \({\log }\) compression and also discrete cosine transforms (DCT). The discrete cosine transform is used to perform two functions, such as conversion of spectral features into cepstrum and also de-correlation of various over-lapped feature dimensions. Appending RASTA filtering after DCT yields 39-dimensional RASTA-MFCC coefficients. The mean Hilbert envelope coefficients (MHEC) feature extraction [43] is performed by using 24 channels Gammatone filter banks with centre frequencies spaced on Equivalent Rectangular Bandwidth (ERB) scale between 300 and 3400 Hz which are utilized in order to decompose the speech signal into 24 bands.

Table 7 Accuracy results of speaker recognition system in various noisy conditions

The Hilbert envelope [43] is computed in addition to mean computation and smoothing. Then, first and second derivatives are computed and appended to the features in order to construct final 36-dimensional mean Hilbert envelope coefficient (MHEC) feature patterns. In this study, accuracy measures of different features are tabulated by using I-vector technique, which is used as an acoustic speaker model. The performance overview of various feature extraction techniques under different noisy environments in the presence of various signal-noise ratio (SNR) values are shown in Table 7.

The accuracy of speaker identification (SID) has been evaluated for the various feature extraction techniques, including the proposed Gabor–Hilbert envelope coefficients (GHEC) which is computed by convolving the Gabor filtered components with Hilbert envelope. The Gabor Filter is well documented for its spectro-temporal patterns that find similarity with certain brain-cortex neurons. The performance of Gabor–Hilbert Envelope Coefficients (GHEC) is compared with other well-known techniques, including RASTA-MFCC, GFCC, MHEC and GBFB and the results are shown in Table 7. It is observed that the performance of speaker identification by involving GHEC (57-dimensional features) outperforms RASTA-MFCC (36-dimensional features), significantly for all the noise acoustic signals under various SNR values. Also, it produces comparatively better results than GBFB along with Principal Component Analysis (PCA) [28] that has totally 39-dimensional features for the various noise signals especially with the SNR value of − 5 dB. The performance results are found to be almost similar for Gabor Hilbert envelope coefficients (GHEC), mean Hilbert envelope coefficients (MHEC) and Gammatone frequency cepstral coefficients (GFCC) for the various noise signals especially with low SNR values.

The computation of Gabor–Hilbert envelope coefficients (GHEC) is based on spectro-temporal cues extraction and log-Mel spectrograms which is believed as a major reason for its robustness as well as better performance. Further, 2-D Gabor filters are high receptive to amplitude and frequency modulation. It should be noted that mean Hilbert envelope coefficients (MHEC) and Gammatone frequency cepstral coefficients (GFCC) are computed by using 36- and 31-dimensional features, respectively. The equal error rate (EER) is also estimated in order to validate the performance of various feature extraction techniques under different reverberant conditions and the results are shown in Fig. 5. The equal error rate is computed by involving I-vector-based techniques. It should be noted that both Gabor Hilbert envelope coefficients (GHEC), mean Hilbert envelope coefficients (MHEC) and Gammatone frequency cepstral coefficients (GFCC) show better performance in higher reverberant conditions than other methods. It is observed that the performance of cepstral coefficients-based methods decreases as when reverberation increases. However, it is also observed that the proposed GHEC monaural feature shows less performance than MHEC in few cases. The reason is possibly due to its sensitivity towards intrinsic factors, such as speaking rate, speaking effort, style and pitch information.

Fig. 5
figure 5

Equal error rate (EER) for different features paradigm in the presence of babble noise at 20 dB under various reverberant conditions

In this study, text-independent Gaussian mixture model-universal background model (GMM–UBM) as well as I-vector-based speaker recognition are validated in the presence of factory noise. The feature extraction includes both mean Hilbert envelope coefficients and Gabor Hilbert envelope coefficients techniques. It should be noted that the GMM–UBM and I-vector-based speaker recognition system utilize a total of 64 Gaussian mixture components for training speech samples. From the results, it is observed that the Gabor Hilbert envelope coefficient (GHEC) shows better performance than mean Hilbert envelope coefficients in joint approach for factory noise at reverberation time of 0.23 s. The above discussed comparison is tested in different SNR values which are shown in Fig. 6.

The joint approach for simultaneous speech segregation and automatic speaker recognition described here may be incorporated in real-time self-autonomous robots. Further, the proposed GHEC monaural feature will be utilized for monaural sound source separation to segregate discriminate information in a multi-talker environment.

Fig. 6
figure 6

Performance measure of GHEC and MHEC during joint approach under different acoustic speaker models with single noise source (factory)

The experiment considers randomly chosen 100 speakers from TIMIT database. The classification results are evaluated when a mixture of two speakers are given as input to the proposed system. The performance of joint binaural speech segregation and automatic speaker recognition is evaluated at different reverberant conditions which are shown in Fig. 7. The proposed system uses LSTM-DRNN-based binaural speech segregation techniques that are observed to be a main reason to produce better evaluation metrics, such as SDR, SIR and SAR by achieving an average of 5 dB in room with reverberation time of 0.89 s than other segregation methods, including DNN. Specifically, LSTM-DRNN uses memory blocks to control the information flow at multiple time scales and it is considered as one of the reasons to yield improved speaker recognition performance even at low signal-to-noise ratio. The proposed binaural speech segregation technique uses optimized deep recurrent structure that outperforms DNN-based speech segregation by employing LSTM–DRNN in the architecture when large numbers of speakers are involved in the training. Spatial cues such as IC, IPD, ILD, and ITD are incorporated into optimized deep recurrent structure for binaural classification to enhance robustness in higher reverberation time. The accuracy, sensitivity and specificity [39] for the proposed system are calculated in terms of confusion matrix and the results (in %) are shown in Table 8.

Fig. 7
figure 7

Performance measure of joint approach for binaural speech segregation and automatic speaker identification in the different reverberant environments

Table 8 Classification results (in %) for the proposed system in the presence of different reverberation conditions

5 Conclusions

The developed framework demonstrates a combination of two modules, namely speech segregation and speaker recognition. The concatenated acoustic cues, such as interaural time and level difference, interaural phase difference, interaural coherence, direct-to-reverberant ratio are successfully incorporated into deep recurrent structure-based joint discriminative model for the separation of input speech mixture. The experiments are carried out by considering different dimensions of spatial cues with deep learning structure. The long short-term memory (LSTM) optimizer is explored to avoid vanishing gradient problem by introducing memory blocks in the recurrent architecture. The proposed model has addressed binaural speech segregation since most real-time applications find speech and interfering sources are located at different positions. The system also proposes a monaural feature, referred as Gabor Hilbert envelope coefficients for speaker recognition system which is found to be robust towards extrinsic variations and implemented by applying temporal envelope extraction. The Hilbert envelope is performed to produce slow varying amplitude modulations in narrow frequency bands which contain spectro-temporal acoustic information. The performance of proposed feature extraction in joint approach is found to be better than other known existing techniques. The spatial cues used for binaural speech segregation are found to be robust even in higher reverberation periods than monaural speech segregation. The proposed deep learning-based binaural speech segregation produced signal-to-distortion ratio (SDR) at an average of 5.85 dB in higher reverberation time (0.89 s) over other accepted conventional models. The proposed monaural feature accuracy is computed by using different speaker models under various noisy reverberant conditions. The system can help to recognize speakers in a multi-talker environment even in the absence of visual information. The developed framework finds several acoustic-related applications, such as intelligent hearing aid devices, hands-free communication, voice interactive systems and audio surveillance. The future work considers optimization of GHEC towards intrinsic variations and also incorporation of combined features of proposed spectral features along with relevant spatial cues in a binaural classification framework. The real-time implementation of the developed algorithm in an automated system has been considered, seriously.