1 Introduction

In some criminal cases, the voice recorded during a telephone call is the only clue available to investigators. There is therefore a very pressing and fully justified demand from the judicial police and magistrates, to use these recordings to guide the investigation, and to establish the guilt of a suspect or prove his/her innocence. Hence, speaker recognition techniques provide a valuable contribution to the Forensic Speaker Recognition System.

To this end, forensic speaker recognition system is considered as one of the disciplines of Speaker Recognition (SR) for both identification and verification. Although several forensic applications for SR have been developed in recent years, they have not been successful due to the high complexity and variability of the speech signal, and the mismatch between modeling and testing conditions especially in real life (Deshpande & Holambe, 2011). The latest can be caused by various sources, such as reverberation, compressed audio, degraded channels, and environmental noise that degrade the performance of the forensic system (Scheffer et al., 2013). Thus, the challenging task for forensic experts is to find effective algorithms for speech enhancement in highly degraded environments, such as additive noise (Zhang & Abdulla, 2007).

Several speech enhancement algorithms, which are based on the magnitude spectrum of the speech signal, have been developed to overcome this challenge, namely: Spectral Subtraction Method (SS) (Gustafsson et al., 2004), Spectral Subtraction with Over subtraction Model (SSOM) (Dixit & Mulge, 2014), Non-Linear Spectral Subtraction (NSS) (Verschuur et al., 2006), Adaptive Noise Cancellation (ANC) (Kwatra et al., 2017) and the Minimum Mean Square Error (MMSE) estimators (Lu & Loizou, 2011).

This study proposes a modification of the MMSE estimators, by replacing the magnitude spectrum estimated using a Fourier Transform (FT), by the MODGD spectrum (Asbai & Amrouche, 2017; Parthasarathi et al., 2011). In other words, the independent Gaussian random variables are derived from the MODGD spectrum, instead of their direct estimation from the Discrete Fourier Transform (Parthasarathi et al., 2011), to improve the MMSE algorithm by exploiting the information contained in the phase spectrums.

The proposed modification is motivated by two considerations.

  • In general, the speech signal is a mixed phase signal, because a speaker’s vocal tract is a minimum phase system (Akande & Murphy, 2005), and for minimum phase systems, information can be extracted from the phase or magnitude spectrum. Thus, in terms of analysis, the group delay of a mixed phase signal is the sum of the group delay of its minimum phase components (Hegde et al., 2004). In this study, the MODGD spectrum is thus processed by computing the mean of the posteriori density given in Lu and Loizou (2011), to exploit the properties of MODGDs (high resolution formants) on the MMSE method;

  • Furthermore, Parthasarathi et al. (2011) indicated that, the group delay spectrum retains most of the formants information even at low SNRs of environmental noise. The MODGD spectrum is less affected by noise than the magnitude spectrum.

The contribution of this work is threefold; first, the MMSE estimators based on the MODGD are better adapted to the noisy speech segments of the tests in many applications of speaker recognition systems. Then, the exploitation of the information contained in the phase as well as in the amplitude spectrum can be noted for the proposed MMSE-MODGD. Finally, extensive testing and experimental validation of the proposed MMSE were carried out.

2 Forensic Automatic Speaker Recognition (FASR)

In the FASR systems, the use of scientific tools is necessary to meet the needs of a court for a crime or civil litigation (Roux et al., 2012). The main fields used in forensic science are: biology, chemistry, and medicine (Forest et al., 1983). Despite the predominance of the latter, other disciplines used such as: physics, computer science, geology, and psychology (Forest et al., 1983). For example, traditional biometric parameters, such as DNA and fingerprints, are often used in many forensic cases. The nature of the evidence, whether found at the crime scene or collected during investigations dictates the scientific methods or disciplines needed to study it. In the context of the FASR, experts are interested in methods of identifying a recorded voice. This is based on the fact that each person can be identified from a sample of his/her voice. In addition, a suspect can leave recordings of his/her voice on the phone, voicemail, an answering machine or a hidden recorder, which can then be used as evidence. Three databases are generally required to establish a FASR system: Potential population database (P), suspected speaker Reference database (R) and suspected speaker Control database (C). They allow calculating and evaluating the evidence from the questioned recording (trace) (Drygajlo et al., 2003; Kenai et al., 2019).

There is also another methodology adopted in FASR systems, which requires a statistical model capable of computing a likelihood value, when feature vectors are compared against such a model. This method uses only two databases: the suspected speaker Reference database and the relevant Population (Drygajlo, 2012). These two databases can be used to create two statistical models: (1) statistical model of the suspected speaker and (2) statistical model of the relevant population. The Universal Background Model (UBM) (Kenai et al., 2019), trained with the relevant population database, can also be used as model of the statistical model of the relevant population (Drygajlo, 2012). The multivariate evidence represented by the ensemble of feature vectors extracted from the questioned recording is compared to model of the statistical model of the suspected speaker and statistical model of the relevant population to calculate the likelihood ratio. The first comparison gives the similarity likelihood score (numerator of LR) and the second one gives the typicality likelihood score (denominator of LR) (Drygajlo et al., 2003; Drygajlo, 2012). Figure 1 shows the principle of this methodological approach.

Fig. 1
figure 1

The principle of the FASR methodological approach

However, in real forensic scenarios, the speech signal left by the suspects (trace) is often corrupted by the environmental noise, which degrades the performances of the FASR system (Alexander et al., 2004). To this end, this paper discusses the MMSE-MODGD estimator used in speech enhancement (Gerkmann & Hendriks, 2012) to improve the FASR system under snoisy environments (Figs. 2, 3, 4).

Fig. 2
figure 2

Spectrograms of clean speech, noisy speech corrupted with white noise at 0 dB input SNR and speech enhancement methods

Fig. 3
figure 3

Spectrograms of clean speech, noisy speech corrupted with factory noise at 0 dB input SNR and speech enhancement methods

Fig. 4
figure 4

Spectrograms of clean speech, noisy speech corrupted with babble noise at 0 dB input SNR and speech enhancement methods

3 Minimum mean square error estimator of the noisy short-time power spectrum

The spectral subtraction method (Berouti et al., 1979) based on MMSE (Gerkmann & Hendriks, 2012) and minimum noise statistics (MS) (Martin, 2001) was used to enhance the speech signal damaged by the additive noise. The amplitude of the noisy signal was multiplied with a certain gain factor. Spectral subtraction introduced by Boll (1979), is the oldest method to remove the noise. It operates in the frequency domain, and its principle is to subtract a noise estimate from the observed signal. Noise is assumed to be additive, stationary or slightly varying, which allows to estimate it during silence periods. The noisy signal \(y\left( n \right)\) can be written as (Lu & Loizou, 2011):

$$y\left( n \right) = x\left( n \right) + d\left( n \right)$$
(1)

where \(x\left( n \right)\) and \(d\left( n \right)\) represent the clean speech and noise signals, respectively.

Taking the short-time Fourier transform of \(y\left( n \right)\), we obtain:

$$Y\left( {w_{k} } \right) = X\left( {w_{k} } \right) + D\left( {w_{k} } \right)$$
(2)

Equation (2) can be expressed in a polar form as follows:

$$Y_{k} e^{{j\theta_{y} \left( k \right)}} = X_{k} e^{{j\theta_{x} \left( k \right)}} + D_{k} e^{{j\theta_{d} \left( k \right)}}$$
(3)

where, \(\left\{ {Y_{k} , X_{k} , D_{k} } \right\}\) denotes the magnitudes and \(\left\{ {\theta_{y} \left( k \right), \theta_{x} \left( k \right), \theta_{d} \left( k \right)} \right\}\) denotes the phases at frequency bin \(k\) of the noisy speech, clean speech and noise, respectively.

The MMSE estimator of the short-time power spectrum (MMSE) is given by (Wolfe & Godsill, 2003) as follows:

$$\begin{aligned} \hat{X}_{k}^{2} &= E\left\{ {X_{k}^{2} /Y\left( {w_{k} } \right)} \right\} \hfill \\ &= \int_{0}^{\infty } {X_{k}^{2} f_{{X_{k} }} (X_{k} /Y(w_{k} ))dX_{k} } \hfill \\ &= \frac{{\xi_{k} }}{{1 + \xi_{k} }}\left( {\frac{1}{{\gamma_{k} }} + \frac{{\xi_{k} }}{{1 + \xi_{k} }}} \right)Y_{k}^{2} \hfill \\ \end{aligned}$$
(4)

and,

$$\xi_{k} \equiv \frac{{\sigma_{x}^{2} \left( k \right)}}{{\sigma_{d}^{2} \left( k \right)}}, \gamma_{k} \equiv \frac{{Y_{k}^{2} }}{{\sigma_{d}^{2} \left( k \right)}}$$
(5)
$$\sigma_{x}^{2} \left( k \right) \equiv E\left\{ {X_{k}^{2} } \right\}, \sigma_{d}^{2} \left( k \right) \equiv E\left\{ {D_{k}^{2} } \right\}$$
(6)

where, \(\xi_{k}\) and \(\gamma_{k}\) denote the a priori and a posteriori SNRs, respectively.

The derivations of the above MMSE estimator were based on the following Rician posterior density \(f_{{X_{k} }} \left( {X_{k} /Y\left( {w_{k} } \right)} \right)\):

$$f_{{X_{k} }} \left( {X_{k} /Y\left( {w_{k} } \right)} \right) = \frac{{X_{k} }}{{\sigma_{k}^{2} }}{\text{exp}}\left( { - \frac{{X_{k}^{2} + s_{k}^{2} }}{{2\sigma_{k}^{2} }}} \right)I_{0} \left( {\frac{{X_{k} s_{k} }}{{\sigma_{k}^{2} }}} \right)$$
(7)

where,

$$\frac{1}{{\lambda^{\prime}\left( k \right)}} \equiv \frac{1}{{\sigma_{x}^{2} \left( k \right)}} + \frac{1}{{\sigma_{d}^{2} \left( k \right)}}$$
(8)
$$\upsilon_{k} \equiv \frac{{\xi_{k} }}{{1 + \xi_{k} }}\gamma_{k}$$
(9)
$$\sigma_{k}^{2} \equiv \frac{{\lambda^{\prime}\left( k \right)}}{2}, s_{k}^{2} \equiv \upsilon_{k} \lambda^{\prime}\left( k \right)$$
(10)

\(I_{0} \left( . \right)\) is the first kind modified Bessel function of zeroth order.

However, the analysis of the suppression curves revealed that the MMSE spectral power suppression rule of Eq. (4) provides less suppression in regions of low a priori SNR (Wolfe & Godsill, 2003). Lu and Loizou 2011) proposed the improved MMSE estimator of the short-time power-spectrum, to remedy the problem of less suppression in regions of low a priori SNR.

The power spectrum of the noise-corrupt signal is assumed to be the sum of the power spectra of the clean speech and noise, written as follows:

$$P_{y} \left( w \right) = P_{x} \left( w \right) + P_{d} \left( w \right)$$
(11)

In addition, an assumption is used in the derivation of these estimators based on Eq. (11) by approximating the power spectrum using the magnitude squared spectrum, which is the sample estimate of the ensemble average. Therefore, Eq. (11) can be written as follows:

$$Y_{k}^{2} \approx X_{k}^{2} + D_{k}^{2}$$
(12)

Moreover, assuming that the real and imaginary parts of the Discrete Fourier Transform (DFT) coefficients are modeled as independent Gaussian random variables with equal variance (Ephraim & Malah, 1984), the probability density of \(X_{k}^{2}\) is exponential and can be written as follows:

$$f_{{X_{k}^{2} }} \left( {X_{k}^{2} } \right) = \frac{1}{{{\upsigma }_{{\text{x}}}^{2} \left( {\text{k}} \right)}}{\text{e}}^{{ - \frac{{{\text{X}}_{{\text{k}}}^{2} }}{{{\upsigma }_{{\text{x}}}^{2} \left( {\text{k}} \right)}}}}$$
(13)

Similarly, the density of \(D_{k}^{2}\) is given by Eq. (14):

$$f_{{D_{k}^{2} }} \left( {D_{k}^{2} } \right) = \frac{1}{{{\upsigma }_{{\text{d}}}^{2} \left( {\text{k}} \right)}}{\text{e}}^{{ - \frac{{{\text{D}}_{{\text{k}}}^{2} }}{{{\upsigma }_{{\text{d}}}^{2} \left( {\text{k}} \right)}}}}$$
(14)

where, \(\sigma_{x}^{2} \left( k \right)\) and \(\sigma_{d}^{2} \left( k \right)\) are given by Eq. (6).

The posterior probability density of the clean speech magnitude-squared spectrum is obtained using the Bayes’ rule as follows:

$$\begin{aligned} f_{{X_{k}^{2} }} \left( {X_{k}^{2} /Y_{k}^{2} } \right) &= \frac{{f_{{Y_{k}^{2} }} \left( {Y_{k}^{2} /X_{k}^{2} } \right)f_{{X_{k}^{2} }} \left( {X_{k}^{2} } \right)}}{{f_{{Y_{k}^{2} }} \left( {Y_{k}^{2} } \right)}} \hfill \\ &= \left\{ {\begin{array}{*{20}c} {\psi_{k} e^{{ - \frac{{X_{k}^{2} }}{\lambda \left( k \right)}}} , } & {if\ \sigma_{x}^{2} \left( k \right) \ne \sigma_{d}^{2} \left( k \right)} \\ {\frac{1}{{Y_{k}^{2} }},} & {if\ \sigma_{x}^{2} \left( k \right) = \sigma_{d}^{2} \left( k \right)} \\ \end{array} } \right\} \hfill \\ \end{aligned}$$
(15)

\(\lambda \left( k \right)\) is defined as:

$$\frac{1}{\lambda \left( k \right)} \equiv \frac{1}{{\sigma_{x}^{2} \left( k \right)}} - \frac{1}{{\sigma_{d}^{2} \left( k \right)}}, {\text{if}}\ \sigma_{x}^{2} \left( k \right) \ne \sigma_{d}^{2} \left( k \right)$$
(16)

and

$$\psi_{k} \equiv \frac{1}{{\lambda \left( k \right)\left\{ {1 - exp\left[ { - \frac{{Y_{k}^{2} }}{\lambda \left( k \right)}} \right]} \right\}}}.$$
(17)

Using Eqs. (12)–(15), the MMSE estimator is obtained by computing the mean of the posteriori density given in Eq. (15) as follows:

$$\begin{aligned} \hat{X}_{k}^{2} &= E\left\{ {X_{k}^{2} /Y_{k}^{2} } \right\} \hfill \\ &= \int_{0}^{{Y_{k}^{2} }} {X_{k}^{2} f_{{X_{k}^{2} }} \left( {X_{k}^{2} /Y_{k}^{2} } \right)dX_{k}^{2} } \hfill \\& = \left\{ {\begin{array}{ll} {\left( {\frac{1}{{\upsilon_{k} }} - \frac{1}{{e^{{\upsilon_{k} }} - 1}}} \right)Y_{k}^{2} ,} & {if\ \sigma_{x}^{2} \left( k \right) \ne \sigma_{d}^{2} \left( k \right)} \\ { \frac{1}{2}Y_{k}^{2} , } & {if\ \sigma_{x}^{2} \left( k \right) = \sigma_{d}^{2} \left( k \right)} \\ \end{array} } \right. \hfill \\ \end{aligned}$$
(18)

where, \(\upsilon_{k}\) is defined as:

$$\upsilon_{k} \equiv \frac{{1 - \xi_{k} }}{{\xi_{k} }}\gamma_{k}$$
(19)

4 The proposed modified group delay functions for the MMSE estimator of the noisy short-time power spectrum

A speech signal can be represented completely in the spectral domain only if the amplitude and phase information is specified. However, the information extracted from the phase spectrum is more complex than the information extracted from the amplitude spectrum, as the phase spectrum is generally discontinuous (orwrapped) between \(\left[ { - \pi ,\pi } \right]\) (Murthy & Yegnanarayana, 2011). A multi-valued function is used to make it into a continuous function; this is called the unwrapped phase (unwrapping) (Parthasarathi et al., 2011). The processing of its derivative (i.e., the phase derivative), the “group delay function” (Parthasarathi et al., 2011), is mainly used to extract the information contained in the phase spectrum.

Let \(x\left( n \right)\) a speech signal, its Fourier transform is given by Eq. (3).

The group delay function \(\tau (\omega )\) of a signal \(x\left( n \right)\) is defined as the negative derivative of the phase spectrum \(\theta (\omega )\) as follow:

$$\tau_{X} \left( \omega \right) = - \frac{d\theta \left( \omega \right)}{{d\omega }}$$
(20)

The group delay function can also be estimated from the speech signal using Eq. (21) (Asbai & Amrouche, 2017):

$$\tau_{X} \left( \omega \right) = \frac{{X_{R} \left( \omega \right)\hat{X}_{R} \left( \omega \right) + X_{1} \left( \omega \right)\hat{X}_{1} \left( \omega \right)}}{{\left| {X\left( \omega \right)} \right|^{2} }}$$
(21)

where, \(R\) and \(I\) denote the real part and imaginary part respectively, \(x(n) \leftrightarrow X(\omega )\) and \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x} (n) \leftrightarrow \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{X} (\omega )\) are Fourier Transform pairs, and \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x} (n) = nx(n)\).

The group delay function requires that the speech signal must be a minimum phase or that the poles of the transfer function be within the unit circle (Asbai & Amrouche, 2017).

By smoothing the amplitude \(X(\omega )\) (Asbai & Amrouche, 2017) spectrum in Eq. (21), we define a MODified Group Delay function (MODGD) which given as follows:

$$\tau_{X} \left( \omega \right) = \left( {\frac{{\tau_{s} \left( \omega \right)}}{{\left| {\tau_{s} \left( \omega \right)} \right|}}} \right)\left( {\left| {\tau_{s} \left( \omega \right)} \right|^{\alpha } } \right)$$
(22)

where,

$$\tau_{X} \left( \omega \right) = \frac{{X_{R} \left( \omega \right)\hat{X}_{R} \left( \omega \right) + X_{1} \left( \omega \right)\hat{X}_{1} \left( \omega \right)}}{{\left| {S\left( \omega \right)} \right|^{2Y} }}$$
(23)

and \(\left| {S(\omega )} \right|\) is a smoothed version of \(\left| {X(\omega )} \right|\); the parameters \(\alpha\) and \(\gamma\) are introduced to control the dynamic range. The length of the cepstral smoothing window is controlled by the parameter lifterω.

Therefore, based on Eqs. (22) and (23), Eq. (4) can be written as follows:

$$\begin{aligned} \hat{X}_{k}^{2} &= E\left\{ {\tau_{Xk}^{2} /\tau_{Y} \left( {w_{k} } \right)} \right\} \hfill \\ &= \int_{0}^{\infty } {\tau_{Xk}^{2} f_{{\tau_{Xk} }} \left( {\tau_{Xk} /\tau_{Yk} \left( {w_{k} } \right)} \right)d\tau_{Xk} } \hfill \\ &= \frac{{\xi_{k} }}{{1 + \xi_{k} }}\left( {\frac{1}{{\gamma_{k} }} + \frac{{\xi_{k} }}{{1 + \xi_{k} }}} \right)\tau_{Yk}^{2} \hfill \\ \end{aligned}$$
(24)

where,

$$\xi_{k} \equiv \frac{{\sigma_{x}^{2} \left( k \right)}}{{\sigma_{d}^{2} \left( k \right)}}, \gamma_{k} \equiv \frac{{\tau_{Yk}^{2} }}{{\sigma_{d}^{2} \left( k \right)}}$$
(25)
$$\sigma_{x}^{2} \left( k \right) \equiv E\left\{ {\tau_{Xk}^{2} } \right\}, \sigma_{d}^{2} \left( k \right) \equiv E\left\{ {\tau_{Dk}^{2} } \right\}$$
(26)

Finally, the Rician posterior density \(f_{{X_{k} }} \left( {X_{k} /Y\left( {w_{k} } \right)} \right)\) becomes:

$$f_{{\tau_{Xk} }} \left( {\tau_{Xk} /\tau_{YK} \left( {w_{k} } \right)} \right) = \frac{{\tau_{Xk} }}{{\sigma_{k}^{2} }}\exp \left( { - \frac{{\tau_{Xk}^{2} + s_{k}^{2} }}{{2\sigma_{k}^{2} }}} \right)I_{0} \left( {\frac{{\tau_{Xk} s_{k} }}{{\sigma_{k}^{2} }}} \right)$$
(27)

Moreover, Eqs. (12), (13) and (14) can be written as follows:

$$\tau_{Yk}^{2} \approx \tau_{Xk}^{2} + \tau_{Dk}^{2}$$
(28)
$$f_{{\tau_{Xk}^{2} }} \left( {\tau_{Xk}^{2} } \right) = \frac{1}{{{\upsigma }_{{\text{x}}}^{2} \left( {\text{k}} \right)}}{\text{e}}^{{ - \frac{{{\uptau }_{{{\text{Xk}}}}^{2} }}{{{\upsigma }_{{\text{x}}}^{2} \left( {\text{k}} \right)}}}}$$
(29)
$$f_{{\tau_{Dk}^{2} }} \left( {\tau_{Dk}^{2} } \right) = \frac{1}{{{\upsigma }_{{\text{d}}}^{2} \left( {\text{k}} \right)}}{\text{e}}^{{ - \frac{{\tau_{Dk}^{2} }}{{{\upsigma }_{{\text{d}}}^{2} \left( {\text{k}} \right)}}}}$$
(30)

where, \(\sigma_{x}^{2} \left( k \right)\) and \(\sigma_{d}^{2} \left( k \right)\) are given by Eq. (26).

The posterior probability density of the clean speech magnitude-squared spectrum become as follows:

$$\begin{aligned} f_{{\tau_{Xk}^{2} }} \left( {\tau_{Xk}^{2} /\tau_{Yk}^{2} } \right) &= \frac{{f_{{\tau_{Yk}^{2} }} \left( {\tau_{Yk}^{2} /\tau_{Xk}^{2} } \right)f_{{\tau_{Xk}^{2} }} \left( {\tau_{Xk}^{2} } \right)}}{{f_{{\tau_{Yk}^{2} }} \left( {\tau_{Yk}^{2} } \right)}} \hfill \\ &= \left\{ {\begin{array}{*{20}c} {\psi_{k} e^{{ - \frac{{\tau_{Xk}^{2} }}{\lambda \left( k \right)}}} , } & {if\ \sigma_{x}^{2} \left( k \right) \ne \sigma_{d}^{2} \left( k \right)} \\ {\frac{1}{{\tau_{Yk}^{2} }}, } & {if\ \sigma_{x}^{2} \left( k \right) = \sigma_{d}^{2} \left( k \right)} \\ \end{array} } \right. \hfill \\ \end{aligned}$$
(31)

where, \(\lambda \left( k \right)\) is given by Eqs. (16) and (26).and

$$\psi_{k} \equiv \frac{1}{{\lambda \left( k \right)\left\{ {1 - exp\left[ { - \frac{{\tau_{Yk}^{2} }}{\lambda \left( k \right)}} \right]} \right\}}}$$
(32)

Finally, the modified MMSE estimator is given by:

$$\begin{aligned} \hat{X}_{k}^{2}& = E\left\{ {\tau_{Xk}^{2} /\tau_{Yk}^{2} } \right\} \hfill \\ &= \int_{0}^{{\tau_{Yk}^{2} }} {\tau_{Xk}^{2} f_{{\tau_{Xk}^{2} }} \left( {\tau_{Xk}^{2} /\tau_{Yk}^{2} } \right)d\tau_{Xk}^{2} } \hfill \\ &= \left\{ {\begin{array}{ll} {\left( {\frac{1}{{\upsilon_{k} }} - \frac{1}{{e^{{\upsilon_{k} }} - 1}}} \right)\tau_{Yk}^{2} ,} & {if\ \sigma_{x}^{2} \left( k \right) \ne \sigma_{d}^{2} \left( k \right)} \\ { \frac{1}{2}\tau_{Yk}^{2} , } & {if\ \sigma_{x}^{2} \left( k \right) = \sigma_{d}^{2} \left( k \right)} \\ \end{array} } \right. \hfill \\ \end{aligned}$$
(33)

\(\upsilon_{k}\) is given by Eq. (19), where, \(\gamma_{k}\) is given by Eq. (25).

5 Experimental protocol for speech enhancement

Extensive objective quality tests were carried out to evaluate the performance of the proposed MMSE-MODGD estimation method using ten (10) sentences extracted from the NOIZEUS database (Hu & Loizou). In this database, the noise signals are generated by adding the noise from the AURORA and NOISEX-92 databases to the clean signals, to an overall SNR of 0 dB, 5 dB and 10 dB. The frame size chosen is 20 ms with a 50% overlap. A sampling frequency of 8 kHz and a Hamming window were used. The methods used for comparison with the proposed MMSE-MODGD are the maximum-likelihood (ML) estimator, the MMSE estimator, the log MMSE estimator, the maximum a posteriori (MAP) estimator, incorporating speech presence probability in MMSE (MMSE-ISP) estimator, incorporating speech presence probability in log MMSE (log MMSE-ISP) estimator and Wiener estimator (Loizou, 2007). The objective assessment was carried out as proposed in Hu and Loizou (2008). The tests carried out to evaluate the proposed method include measures related to the perception of the speech signal on a five-point (1–5) scale of signal distortion (SIG), background noise on a five-point (1–5) scale (BAK) and overall quality (OVRL) based on the Mean Opinion Score (MOS) ranging from 1 to 5. The other measures used are segmental SNR (SegSNR), weighted-slope spectral (WSS), perceptual evaluation of speech quality (PESQ) and log-likelihood ratio (LLR) (Hu & Loizou, 2008).

5.1 Results and discussion

Based on a comparative study using spectrograms, it can be noticed that the proposed MMSE-MODGD method gives good results compared to ML, MMSE, log MMSE, MAP, MMSE-ISP, log MMSE-ISP and Wiener. This good performance achieved by the proposed approach is confirmed by the objective evaluation.

Tables 1, 2 and 3 show the results of the evaluations using the objective measures: SIG, BAK, OVRL, PESQ, SegSNR, WSS and LLR using 10 sentences extracted from the NOIZEUS database. The proposed MMSE-MODGD method is compared with ML, MMSE, log MMSE, MAP, MMSE-ISP, log MMSE-ISP and Wiener, in the context of degradation by a white, factory and babble noises, respectively. The LLR and WSS scores indicate speech loss and should therefore be minimal. The results presented in the tables clearly show that the SIG, BAK and OVRL scores, which reflect the level of perception of the speech signal and the overall quality, are generally higher for the MMSE-MODGD method than for the other methods. The results also show that these assessments confirm that speech improvement based on the MMSE-MODGD method produces a higher segmental SNR, higher PESQ and lower WSS than other methods.

Table 1 Objective evaluations of the MMSE-MODGD technique compared with ML, MMSE, Log-MMSE, MAP, MMSE-ISP, Log-MMSE-ISP and Wiener and corrupted with white noise
Table 2 Objective evaluations of the MMSE-MODGD technique compared with ML, MMSE, Log-MMSE, MAP, MMSE-ISP, Log-MMSE-ISP and Wiener and corrupted with factory noise.
Table 3 Objective evaluations of the MMSE-MODGD technique compared with ML, MMSE, Log-MMSE, MAP, MMSE-ISP, Log-MMSE-ISP and Wiener and corrupted with babble noise.

6 Experimental protocol for FASR setup

Generally, there are two constraints in the FASR scenarios. The first is the non-collaboration of the suspects and the second one is the limited number of suspects known by the target person (person who suffers from the actions of others). Due to these constraints, the number of suspects used to develop such systems (FASR) is really limited.

In this work, all the experiments were performed on the NIST 2000 corpus, which consists of the spontaneous telephone speech sampled at 8 kHz. For feature extraction, a 23 MFCC vector is found from pre-emphasized speech every 10 ms using a 20 ms Hamming window.

Twenty speakers were chosen as suspects from this corpus; the suspected speaker Reference database (R) was recorded with 1 recording of 2 min duration which was chosen for each suspect, and 75% of the duration of this recording was intended for modeling and 25% for tests (traces).

The test segment is divided into 4 sections, to have 4 traces for each suspect. The Potential database (P) used was a subset of 420 speakers from the same corpus cited below. The GMM-UBM consisted of 256 mixture components trained via Expectation Maximization (EM) algorithm using 10 iterations (Reynolds & Rose, 1995).

Twenty suspects models were created through the GMM-UBM using maximum a posteriori (MAP) adaptation with factor relevance r = 16, 256 mixtures and an adaptation data amount of 14 h is used (Reynolds et al., 2000).

According to the Fig. 1, which explains the FASR methodological approach adopted in our work, we need 3 databases:

  1. 1.

    Potential-database (UBM database): contains 420 speakers (420*2 min = 14 h);

  2. 2.

    Trace-database (T): contains 20 speakers, each speaker has 4 traces of (0.25*2 min)/4 = 7.5 s. So, the total of true trials (H0) is 20*4 = 80 and the total of false trials (H1) is 4*20*20 – 80 (true trials) = 1520;

  3. 3.

    Reference-database: contains 20 speakers, each speaker has 0.75*2 min = 1.5 min.

Performance metrics provided a single numerical value that described the performance in terms of accuracy, discriminating power and calibration of the LR method (Probabilities of Misleading Evidence, PMEH0 and PMEH1), Equal Proportion Probability (EPP) (Drygajlo et al., 2016; Haraksim & Drygajlo, 2016; Kenai et al., 2019).The values used for the MODGD functions are the length of cepstral lifter window lifterω = 8 and \(\alpha\) = 0.4, \(\gamma\) = 0.9.

7 Classical Forensic Automatic Speaker Recognition results

This Section evaluates the results obtained in clean and noisy environments.

7.1 FASR performance under clean conditions

An evaluation of FASR based on GMM-UBM performance in terms of EPP, PMEH0 and PMEH1 was performed in a clean environment.

According to Table 4, the results are very satisfying, in terms of EPP, PMEH0 and PMEH1. Therefore, EPP = 1, 25%, the LR exceeds 1 in 96% of cases when H0 is true and in only 0.4% of cases when H1 is true.

Table 4 Evaluation results obtained in clean environment

7.2 FASR performance in noisy conditions

Different noises were arbitrarily chosen in this study (babble, factory and white) that were added to the corpus of the questioned recording (traces) to produce noisy feature vectors. Table 5 presents the performances of FASR, at SNR = 0 dB and SNR = 5 dB.

Table 5 Evaluation results obtained under noisy environments

Table 5 summarizes the performances of the FASR under noisy environment, in terms of EPP, PMEH0 and PMEH1. It can be noticed that the performance metrics decrease with decreasing SNR, and increase with increasing SNR, and the performance of noisy speech corrupted with babble noise is less degraded compared to the other noises. This can be explained by the fact that, the babble noise is an overlap of several sounds that comes from two or more speakers (Djeghiour et al., 2018). Its features are like those of the voice. It covers only the low frequency spectrum. Therefore, only the information in low frequency regions is affected by this noise. Whereas, the factory and white noises are characterized by a high intensity. They cover the low and high frequency spectrum and they affect all the existing information in the speech signal. The performance is worse when using these two types of noises (factory and white) than those obtained under babble.

8 Enhanced Forensic Automatic Speaker Recognition results

In this Section, the performance of this system was calculated using the MMSE magnitude enhancement processing and our approach proposed MMSE-MODGD enhancement processing.

8.1 FASR performance using MMSE-magnitude enhancement processing

Table 6 indicates the results obtained when using the MMSE-magnitude enhancement processing (only the information contained in the magnitude), at SNR = 0 dB and SNR = 5 dB.

Table 6 Evaluation results obtained with MMSE enhancement processing

The results presented in Table 6, when applying the MMSE speech enhancement algorithm on noisy tests (traces) speech, indicate an improvement of the performances represented by the decreasing of the EPP with the evolution of LR.

This improvement is explained by the fact that, the MMSE based magnitude spectrum estimator discards all the broadband noise by eliminating most of the wide peaks that constitute the undesirable variances of the spectrum ordinates (Loizou, 2007).

Moreover, the MMSE based magnitude spectrum estimator provides the posterior Probability Density Function (PDF) of the clean signal given the noisy signal. This PDF is an optimal estimator for a large class of difference distortion measures between clean and noisy signal. This distortion measure assigns zero distortion for estimates in the immediate neighborhood of the clean signal, and uniform distortion for the ones outside this neighborhood (Loizou, 2007; Lu & Loizou, 2011). Therefore, the separation between noise and speech components is better.

8.2 FASR performance using the proposed improved MMSE-MODGD enhancement processing

Table 7 summarizes the results obtained when using the proposed algorithm (improved MMSE-MODGD enhancement processing), taking into account the information contained in the magnitude and phase, at SNR = 0 dB and SNR = 5 dB.

Table 7 Evaluation results obtained with the proposed improved MMSE-MODGD enhancement processing

Based on the results in Table 7, it can be observed that when comparing these results with those obtained in Sect. 8.1, a significant improvement of FASR performance metrics in terms of EPP and Probabilities of Misleading Evidence (PMEH0 and PMEH1) is observed, for the three kinds of noises (babble, factory and white). Therefore, in terms of EPP, the improvements represent 1.84% reduction for babble noise and 1.25% reduction for other noises. These results are encouraging given that 1% improvement is significant for high security systems such as FASR systems, as the innocence or indictment of individuals is at stake.

This improvement given by the addition of the MMSE-MODGD estimator to the FASR system is explained by the fact that, the subtraction of the noise from the noisy speech signal, when using MMSE-magnitude spectrum cannot eliminate the deep valleys surrounding the narrow peaks, which remain in the noise spectrum. Therefore, the excursion of noise peaks remains large. However, MMSE-MODGD discards these deep valleys by well preserving the peaks and valleys (depth reduction) of the clean magnitude spectrum in the presence of additive noise (properties of the group delay function of a minimum-phase signal).

Moreover, in (Parthasarathi et al., 2011), the authors indicated that the MODGD spectrum is inversely proportional to the noise power at frequencies corresponding to high noise regions, and directly proportional to the signal power. This indicates that, the MODGD spectrum tends to follow the magnitude spectrum of the signal, rather than that of the noise.

Thus, on the basis of experiments, it was found that noise distorts the shape of the MODGD spectrum less than the FFT spectrum, changes its slopes and reduces the dynamic range of the MODGD spectrum less than the FFT. Most of the time, the frequency locations of the peaks of the higher formants are preserved to some extent in the MODGD spectrum compared to the FFT spectrum in the presence of noise. Therefore, our proposal for MMSE-MODGD retains more information contained in the noisy speech signal than conventional MMSE (Gerkmann & Hendriks, 2012), to avoid any degradation in speech intelligibility and FASR performance.

9 Conclusion

In this work, speech enhancement estimators of noisy speech signal were studied under the assumption that the spectrum of the noisy speech signal can be represented in complex plane as sum of clean signal spectrum and noise spectrum. In addition to the traditional estimator, which is based on the MMSE principles, the improved estimator was proposed by incorporating modified group delay spectrums. Furthermore, compared to the FASR performance using the classical MMSE spectral power estimators, the FASR using the proposed MMSE-MODGD resulted in significantly better speech enhancement quality.

The results of the experiments show that MODGD spectrum has the potential to reduce noise components in the noisy speech signal, since the MODGD spectra tends to follow the magnitude spectrum of speech and opposes the noise spectrum. Therefore, it can be concluded that the important information retained in the enhanced speech using the MODGD spectrum can complement that given by FFT spectrum and give more reliability and robustness to the FASR system under noisy environments.

In future work, we intend to apply a state of the art technique during the parametrization or training phase, which should be an interesting approach to refine the speaker models to obtain a better performance for the proposed forensic system. Subsequently, the latter will be applied to another database specific to the forensic field to compare the two systems.