An improved MMSE estimator based modified group delay spectrum for Forensic Automatic Speaker Recognition

Djeghiour, Salim; Guerti, Mhania

doi:10.1007/s10772-021-09829-9

An improved MMSE estimator based modified group delay spectrum for Forensic Automatic Speaker Recognition

Published: 15 March 2021

Volume 24, pages 687–699, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal of Speech Technology Aims and scope Submit manuscript

An improved MMSE estimator based modified group delay spectrum for Forensic Automatic Speaker Recognition

Download PDF

Salim Djeghiour¹ &
Mhania Guerti¹

141 Accesses
Explore all metrics

Abstract

This paper presents an improved speech enhancement algorithm called Minimum Mean Square Error (MMSE), based on the MODified Group Delay spectrum (MODGD), for Forensic Automatic Speaker Recognition (FASR) under noisy environments. This algorithm uses the MODGD instead of the amplitude spectrum, to compute the power spectrum of the noise-corrupt signal. In the proposed estimator, the MODGD retains most of the formants information. Therefore, it enhances the noisy speech signal with high quality even at extremely low Signal-to-Noise Ratio (SNR) levels. The evaluation of the improved algorithm in simulated FASR scenarios was performed by adding different noise levels, extracted from the NOISEX-92 database to the clean NIST2000-traces. The results obtained show that the proposed MMSE–MODGD estimator provides greater suppression of noise components in regions of low SNR than the MMSE estimator. In addition, there is a drastic reduction in Equal Proportion Probability (EPP) (the improvements are 1.84% for babble noise and 1.25% for factory and white noises), combining FASR techniques with the proposed MMSE–MODGD estimator than with the conventional estimator.

Enhanced forensic speaker verification performance using the ICA-EBM algorithm under noisy and reverberant environments

Article 07 May 2020

Real-time adaptive training for forensic speaker verification in reverberation conditions

Article 22 December 2023

Acoustic feature extraction method for robust speaker identification

Article 05 May 2015

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In some criminal cases, the voice recorded during a telephone call is the only clue available to investigators. There is therefore a very pressing and fully justified demand from the judicial police and magistrates, to use these recordings to guide the investigation, and to establish the guilt of a suspect or prove his/her innocence. Hence, speaker recognition techniques provide a valuable contribution to the Forensic Speaker Recognition System.

To this end, forensic speaker recognition system is considered as one of the disciplines of Speaker Recognition (SR) for both identification and verification. Although several forensic applications for SR have been developed in recent years, they have not been successful due to the high complexity and variability of the speech signal, and the mismatch between modeling and testing conditions especially in real life (Deshpande & Holambe, 2011). The latest can be caused by various sources, such as reverberation, compressed audio, degraded channels, and environmental noise that degrade the performance of the forensic system (Scheffer et al., 2013). Thus, the challenging task for forensic experts is to find effective algorithms for speech enhancement in highly degraded environments, such as additive noise (Zhang & Abdulla, 2007).

Several speech enhancement algorithms, which are based on the magnitude spectrum of the speech signal, have been developed to overcome this challenge, namely: Spectral Subtraction Method (SS) (Gustafsson et al., 2004), Spectral Subtraction with Over subtraction Model (SSOM) (Dixit & Mulge, 2014), Non-Linear Spectral Subtraction (NSS) (Verschuur et al., 2006), Adaptive Noise Cancellation (ANC) (Kwatra et al., 2017) and the Minimum Mean Square Error (MMSE) estimators (Lu & Loizou, 2011).

This study proposes a modification of the MMSE estimators, by replacing the magnitude spectrum estimated using a Fourier Transform (FT), by the MODGD spectrum (Asbai & Amrouche, 2017; Parthasarathi et al., 2011). In other words, the independent Gaussian random variables are derived from the MODGD spectrum, instead of their direct estimation from the Discrete Fourier Transform (Parthasarathi et al., 2011), to improve the MMSE algorithm by exploiting the information contained in the phase spectrums.

The proposed modification is motivated by two considerations.

In general, the speech signal is a mixed phase signal, because a speaker’s vocal tract is a minimum phase system (Akande & Murphy, 2005), and for minimum phase systems, information can be extracted from the phase or magnitude spectrum. Thus, in terms of analysis, the group delay of a mixed phase signal is the sum of the group delay of its minimum phase components (Hegde et al., 2004). In this study, the MODGD spectrum is thus processed by computing the mean of the posteriori density given in Lu and Loizou (2011), to exploit the properties of MODGDs (high resolution formants) on the MMSE method;
Furthermore, Parthasarathi et al. (2011) indicated that, the group delay spectrum retains most of the formants information even at low SNR_s of environmental noise. The MODGD spectrum is less affected by noise than the magnitude spectrum.

The contribution of this work is threefold; first, the MMSE estimators based on the MODGD are better adapted to the noisy speech segments of the tests in many applications of speaker recognition systems. Then, the exploitation of the information contained in the phase as well as in the amplitude spectrum can be noted for the proposed MMSE-MODGD. Finally, extensive testing and experimental validation of the proposed MMSE were carried out.

2 Forensic Automatic Speaker Recognition (FASR)

In the FASR systems, the use of scientific tools is necessary to meet the needs of a court for a crime or civil litigation (Roux et al., 2012). The main fields used in forensic science are: biology, chemistry, and medicine (Forest et al., 1983). Despite the predominance of the latter, other disciplines used such as: physics, computer science, geology, and psychology (Forest et al., 1983). For example, traditional biometric parameters, such as DNA and fingerprints, are often used in many forensic cases. The nature of the evidence, whether found at the crime scene or collected during investigations dictates the scientific methods or disciplines needed to study it. In the context of the FASR, experts are interested in methods of identifying a recorded voice. This is based on the fact that each person can be identified from a sample of his/her voice. In addition, a suspect can leave recordings of his/her voice on the phone, voicemail, an answering machine or a hidden recorder, which can then be used as evidence. Three databases are generally required to establish a FASR system: Potential population database (P), suspected speaker Reference database (R) and suspected speaker Control database (C). They allow calculating and evaluating the evidence from the questioned recording (trace) (Drygajlo et al., 2003; Kenai et al., 2019).

There is also another methodology adopted in FASR systems, which requires a statistical model capable of computing a likelihood value, when feature vectors are compared against such a model. This method uses only two databases: the suspected speaker Reference database and the relevant Population (Drygajlo, 2012). These two databases can be used to create two statistical models: (1) statistical model of the suspected speaker and (2) statistical model of the relevant population. The Universal Background Model (UBM) (Kenai et al., 2019), trained with the relevant population database, can also be used as model of the statistical model of the relevant population (Drygajlo, 2012). The multivariate evidence represented by the ensemble of feature vectors extracted from the questioned recording is compared to model of the statistical model of the suspected speaker and statistical model of the relevant population to calculate the likelihood ratio. The first comparison gives the similarity likelihood score (numerator of LR) and the second one gives the typicality likelihood score (denominator of LR) (Drygajlo et al., 2003; Drygajlo, 2012). Figure 1 shows the principle of this methodological approach.

However, in real forensic scenarios, the speech signal left by the suspects (trace) is often corrupted by the environmental noise, which degrades the performances of the FASR system (Alexander et al., 2004). To this end, this paper discusses the MMSE-MODGD estimator used in speech enhancement (Gerkmann & Hendriks, 2012) to improve the FASR system under snoisy environments (Figs. 2, 3, 4).

3 Minimum mean square error estimator of the noisy short-time power spectrum

The spectral subtraction method (Berouti et al., 1979) based on MMSE (Gerkmann & Hendriks, 2012) and minimum noise statistics (MS) (Martin, 2001) was used to enhance the speech signal damaged by the additive noise. The amplitude of the noisy signal was multiplied with a certain gain factor. Spectral subtraction introduced by Boll (1979), is the oldest method to remove the noise. It operates in the frequency domain, and its principle is to subtract a noise estimate from the observed signal. Noise is assumed to be additive, stationary or slightly varying, which allows to estimate it during silence periods. The noisy signal $y\left( n \right)$ can be written as (Lu & Loizou, 2011):

$$y\left( n \right) = x\left( n \right) + d\left( n \right)$$

(1)

where $x\left( n \right)$ and $d\left( n \right)$ represent the clean speech and noise signals, respectively.

Taking the short-time Fourier transform of $y\left( n \right)$, we obtain:

$$Y\left( {w_{k} } \right) = X\left( {w_{k} } \right) + D\left( {w_{k} } \right)$$

(2)

Equation (2) can be expressed in a polar form as follows:

$$Y_{k} e^{{j\theta_{y} \left( k \right)}} = X_{k} e^{{j\theta_{x} \left( k \right)}} + D_{k} e^{{j\theta_{d} \left( k \right)}}$$

(3)

where, $\left\{ {Y_{k} , X_{k} , D_{k} } \right\}$ denotes the magnitudes and $\left\{ {\theta_{y} \left( k \right), \theta_{x} \left( k \right), \theta_{d} \left( k \right)} \right\}$ denotes the phases at frequency bin $k$ of the noisy speech, clean speech and noise, respectively.

The MMSE estimator of the short-time power spectrum (MMSE) is given by (Wolfe & Godsill, 2003) as follows:

$$\begin{aligned} \hat{X}_{k}^{2} &= E\left\{ {X_{k}^{2} /Y\left( {w_{k} } \right)} \right\} \hfill \\ &= \int_{0}^{\infty } {X_{k}^{2} f_{{X_{k} }} (X_{k} /Y(w_{k} ))dX_{k} } \hfill \\ &= \frac{{\xi_{k} }}{{1 + \xi_{k} }}\left( {\frac{1}{{\gamma_{k} }} + \frac{{\xi_{k} }}{{1 + \xi_{k} }}} \right)Y_{k}^{2} \hfill \\ \end{aligned}$$

(4)

and,

$$\xi_{k} \equiv \frac{{\sigma_{x}^{2} \left( k \right)}}{{\sigma_{d}^{2} \left( k \right)}}, \gamma_{k} \equiv \frac{{Y_{k}^{2} }}{{\sigma_{d}^{2} \left( k \right)}}$$

(5)

$$\sigma_{x}^{2} \left( k \right) \equiv E\left\{ {X_{k}^{2} } \right\}, \sigma_{d}^{2} \left( k \right) \equiv E\left\{ {D_{k}^{2} } \right\}$$

(6)

where, $\xi_{k}$ and $\gamma_{k}$ denote the a priori and a posteriori SNRs, respectively.

The derivations of the above MMSE estimator were based on the following Rician posterior density $f_{{X_{k} }} \left( {X_{k} /Y\left( {w_{k} } \right)} \right)$:

$$f_{{X_{k} }} \left( {X_{k} /Y\left( {w_{k} } \right)} \right) = \frac{{X_{k} }}{{\sigma_{k}^{2} }}{\text{exp}}\left( { - \frac{{X_{k}^{2} + s_{k}^{2} }}{{2\sigma_{k}^{2} }}} \right)I_{0} \left( {\frac{{X_{k} s_{k} }}{{\sigma_{k}^{2} }}} \right)$$

(7)

where,

$$\frac{1}{{\lambda^{\prime}\left( k \right)}} \equiv \frac{1}{{\sigma_{x}^{2} \left( k \right)}} + \frac{1}{{\sigma_{d}^{2} \left( k \right)}}$$

(8)

$$\upsilon_{k} \equiv \frac{{\xi_{k} }}{{1 + \xi_{k} }}\gamma_{k}$$

(9)

$$\sigma_{k}^{2} \equiv \frac{{\lambda^{\prime}\left( k \right)}}{2}, s_{k}^{2} \equiv \upsilon_{k} \lambda^{\prime}\left( k \right)$$

(10)

$I_{0} \left( . \right)$ is the first kind modified Bessel function of zeroth order.

However, the analysis of the suppression curves revealed that the MMSE spectral power suppression rule of Eq. (4) provides less suppression in regions of low a priori SNR (Wolfe & Godsill, 2003). Lu and Loizou 2011) proposed the improved MMSE estimator of the short-time power-spectrum, to remedy the problem of less suppression in regions of low a priori SNR.

The power spectrum of the noise-corrupt signal is assumed to be the sum of the power spectra of the clean speech and noise, written as follows:

$$P_{y} \left( w \right) = P_{x} \left( w \right) + P_{d} \left( w \right)$$

(11)

In addition, an assumption is used in the derivation of these estimators based on Eq. (11) by approximating the power spectrum using the magnitude squared spectrum, which is the sample estimate of the ensemble average. Therefore, Eq. (11) can be written as follows:

$$Y_{k}^{2} \approx X_{k}^{2} + D_{k}^{2}$$

(12)

Moreover, assuming that the real and imaginary parts of the Discrete Fourier Transform (DFT) coefficients are modeled as independent Gaussian random variables with equal variance (Ephraim & Malah, 1984), the probability density of $X_{k}^{2}$ is exponential and can be written as follows:

$$f_{{X_{k}^{2} }} \left( {X_{k}^{2} } \right) = \frac{1}{{{\upsigma }_{{\text{x}}}^{2} \left( {\text{k}} \right)}}{\text{e}}^{{ - \frac{{{\text{X}}_{{\text{k}}}^{2} }}{{{\upsigma }_{{\text{x}}}^{2} \left( {\text{k}} \right)}}}}$$

(13)

Similarly, the density of $D_{k}^{2}$ is given by Eq. (14):

$$f_{{D_{k}^{2} }} \left( {D_{k}^{2} } \right) = \frac{1}{{{\upsigma }_{{\text{d}}}^{2} \left( {\text{k}} \right)}}{\text{e}}^{{ - \frac{{{\text{D}}_{{\text{k}}}^{2} }}{{{\upsigma }_{{\text{d}}}^{2} \left( {\text{k}} \right)}}}}$$

(14)

where, $\sigma_{x}^{2} \left( k \right)$ and $\sigma_{d}^{2} \left( k \right)$ are given by Eq. (6).

The posterior probability density of the clean speech magnitude-squared spectrum is obtained using the Bayes’ rule as follows:

$$\begin{aligned} f_{{X_{k}^{2} }} \left( {X_{k}^{2} /Y_{k}^{2} } \right) &= \frac{{f_{{Y_{k}^{2} }} \left( {Y_{k}^{2} /X_{k}^{2} } \right)f_{{X_{k}^{2} }} \left( {X_{k}^{2} } \right)}}{{f_{{Y_{k}^{2} }} \left( {Y_{k}^{2} } \right)}} \hfill \\ &= \left\{ {\begin{array}{*{20}c} {\psi_{k} e^{{ - \frac{{X_{k}^{2} }}{\lambda \left( k \right)}}} , } & {if\ \sigma_{x}^{2} \left( k \right) \ne \sigma_{d}^{2} \left( k \right)} \\ {\frac{1}{{Y_{k}^{2} }},} & {if\ \sigma_{x}^{2} \left( k \right) = \sigma_{d}^{2} \left( k \right)} \\ \end{array} } \right\} \hfill \\ \end{aligned}$$

(15)

$\lambda \left( k \right)$ is defined as:

$$\frac{1}{\lambda \left( k \right)} \equiv \frac{1}{{\sigma_{x}^{2} \left( k \right)}} - \frac{1}{{\sigma_{d}^{2} \left( k \right)}}, {\text{if}}\ \sigma_{x}^{2} \left( k \right) \ne \sigma_{d}^{2} \left( k \right)$$

(16)

and

$$\psi_{k} \equiv \frac{1}{{\lambda \left( k \right)\left\{ {1 - exp\left[ { - \frac{{Y_{k}^{2} }}{\lambda \left( k \right)}} \right]} \right\}}}.$$

(17)

Using Eqs. (12)–(15), the MMSE estimator is obtained by computing the mean of the posteriori density given in Eq. (15) as follows:

$$\begin{aligned} \hat{X}_{k}^{2} &= E\left\{ {X_{k}^{2} /Y_{k}^{2} } \right\} \hfill \\ &= \int_{0}^{{Y_{k}^{2} }} {X_{k}^{2} f_{{X_{k}^{2} }} \left( {X_{k}^{2} /Y_{k}^{2} } \right)dX_{k}^{2} } \hfill \\& = \left\{ {\begin{array}{ll} {\left( {\frac{1}{{\upsilon_{k} }} - \frac{1}{{e^{{\upsilon_{k} }} - 1}}} \right)Y_{k}^{2} ,} & {if\ \sigma_{x}^{2} \left( k \right) \ne \sigma_{d}^{2} \left( k \right)} \\ { \frac{1}{2}Y_{k}^{2} , } & {if\ \sigma_{x}^{2} \left( k \right) = \sigma_{d}^{2} \left( k \right)} \\ \end{array} } \right. \hfill \\ \end{aligned}$$

(18)

where, $\upsilon_{k}$ is defined as:

$$\upsilon_{k} \equiv \frac{{1 - \xi_{k} }}{{\xi_{k} }}\gamma_{k}$$

(19)

4 The proposed modified group delay functions for the MMSE estimator of the noisy short-time power spectrum

A speech signal can be represented completely in the spectral domain only if the amplitude and phase information is specified. However, the information extracted from the phase spectrum is more complex than the information extracted from the amplitude spectrum, as the phase spectrum is generally discontinuous (orwrapped) between $\left[ { - \pi ,\pi } \right]$ (Murthy & Yegnanarayana, 2011). A multi-valued function is used to make it into a continuous function; this is called the unwrapped phase (unwrapping) (Parthasarathi et al., 2011). The processing of its derivative (i.e., the phase derivative), the “group delay function” (Parthasarathi et al., 2011), is mainly used to extract the information contained in the phase spectrum.

Let $x\left( n \right)$ a speech signal, its Fourier transform is given by Eq. (3).

The group delay function $\tau (\omega )$ of a signal $x\left( n \right)$ is defined as the negative derivative of the phase spectrum $\theta (\omega )$ as follow:

$$\tau_{X} \left( \omega \right) = - \frac{d\theta \left( \omega \right)}{{d\omega }}$$

(20)

The group delay function can also be estimated from the speech signal using Eq. (21) (Asbai & Amrouche, 2017):

$$\tau_{X} \left( \omega \right) = \frac{{X_{R} \left( \omega \right)\hat{X}_{R} \left( \omega \right) + X_{1} \left( \omega \right)\hat{X}_{1} \left( \omega \right)}}{{\left| {X\left( \omega \right)} \right|^{2} }}$$

(21)

where, $R$ and $I$ denote the real part and imaginary part respectively, $x(n) \leftrightarrow X(\omega )$ and $\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x} (n) \leftrightarrow \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{X} (\omega )$ are Fourier Transform pairs, and $\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x} (n) = nx(n)$.

The group delay function requires that the speech signal must be a minimum phase or that the poles of the transfer function be within the unit circle (Asbai & Amrouche, 2017).

By smoothing the amplitude $X(\omega )$ (Asbai & Amrouche, 2017) spectrum in Eq. (21), we define a MODified Group Delay function (MODGD) which given as follows:

$$\tau_{X} \left( \omega \right) = \left( {\frac{{\tau_{s} \left( \omega \right)}}{{\left| {\tau_{s} \left( \omega \right)} \right|}}} \right)\left( {\left| {\tau_{s} \left( \omega \right)} \right|^{\alpha } } \right)$$

(22)

where,

$$\tau_{X} \left( \omega \right) = \frac{{X_{R} \left( \omega \right)\hat{X}_{R} \left( \omega \right) + X_{1} \left( \omega \right)\hat{X}_{1} \left( \omega \right)}}{{\left| {S\left( \omega \right)} \right|^{2Y} }}$$

(23)

and $\left| {S(\omega )} \right|$ is a smoothed version of $\left| {X(\omega )} \right|$; the parameters $\alpha$ and $\gamma$ are introduced to control the dynamic range. The length of the cepstral smoothing window is controlled by the parameter lifter_ω.

Therefore, based on Eqs. (22) and (23), Eq. (4) can be written as follows:

$$\begin{aligned} \hat{X}_{k}^{2} &= E\left\{ {\tau_{Xk}^{2} /\tau_{Y} \left( {w_{k} } \right)} \right\} \hfill \\ &= \int_{0}^{\infty } {\tau_{Xk}^{2} f_{{\tau_{Xk} }} \left( {\tau_{Xk} /\tau_{Yk} \left( {w_{k} } \right)} \right)d\tau_{Xk} } \hfill \\ &= \frac{{\xi_{k} }}{{1 + \xi_{k} }}\left( {\frac{1}{{\gamma_{k} }} + \frac{{\xi_{k} }}{{1 + \xi_{k} }}} \right)\tau_{Yk}^{2} \hfill \\ \end{aligned}$$

(24)

where,

$$\xi_{k} \equiv \frac{{\sigma_{x}^{2} \left( k \right)}}{{\sigma_{d}^{2} \left( k \right)}}, \gamma_{k} \equiv \frac{{\tau_{Yk}^{2} }}{{\sigma_{d}^{2} \left( k \right)}}$$

(25)

$$\sigma_{x}^{2} \left( k \right) \equiv E\left\{ {\tau_{Xk}^{2} } \right\}, \sigma_{d}^{2} \left( k \right) \equiv E\left\{ {\tau_{Dk}^{2} } \right\}$$

(26)

Finally, the Rician posterior density $f_{{X_{k} }} \left( {X_{k} /Y\left( {w_{k} } \right)} \right)$ becomes:

$$f_{{\tau_{Xk} }} \left( {\tau_{Xk} /\tau_{YK} \left( {w_{k} } \right)} \right) = \frac{{\tau_{Xk} }}{{\sigma_{k}^{2} }}\exp \left( { - \frac{{\tau_{Xk}^{2} + s_{k}^{2} }}{{2\sigma_{k}^{2} }}} \right)I_{0} \left( {\frac{{\tau_{Xk} s_{k} }}{{\sigma_{k}^{2} }}} \right)$$

(27)

Moreover, Eqs. (12), (13) and (14) can be written as follows:

$$\tau_{Yk}^{2} \approx \tau_{Xk}^{2} + \tau_{Dk}^{2}$$

(28)

$$f_{{\tau_{Xk}^{2} }} \left( {\tau_{Xk}^{2} } \right) = \frac{1}{{{\upsigma }_{{\text{x}}}^{2} \left( {\text{k}} \right)}}{\text{e}}^{{ - \frac{{{\uptau }_{{{\text{Xk}}}}^{2} }}{{{\upsigma }_{{\text{x}}}^{2} \left( {\text{k}} \right)}}}}$$

(29)

$$f_{{\tau_{Dk}^{2} }} \left( {\tau_{Dk}^{2} } \right) = \frac{1}{{{\upsigma }_{{\text{d}}}^{2} \left( {\text{k}} \right)}}{\text{e}}^{{ - \frac{{\tau_{Dk}^{2} }}{{{\upsigma }_{{\text{d}}}^{2} \left( {\text{k}} \right)}}}}$$

(30)

where, $\sigma_{x}^{2} \left( k \right)$ and $\sigma_{d}^{2} \left( k \right)$ are given by Eq. (26).

The posterior probability density of the clean speech magnitude-squared spectrum become as follows:

$$\begin{aligned} f_{{\tau_{Xk}^{2} }} \left( {\tau_{Xk}^{2} /\tau_{Yk}^{2} } \right) &= \frac{{f_{{\tau_{Yk}^{2} }} \left( {\tau_{Yk}^{2} /\tau_{Xk}^{2} } \right)f_{{\tau_{Xk}^{2} }} \left( {\tau_{Xk}^{2} } \right)}}{{f_{{\tau_{Yk}^{2} }} \left( {\tau_{Yk}^{2} } \right)}} \hfill \\ &= \left\{ {\begin{array}{*{20}c} {\psi_{k} e^{{ - \frac{{\tau_{Xk}^{2} }}{\lambda \left( k \right)}}} , } & {if\ \sigma_{x}^{2} \left( k \right) \ne \sigma_{d}^{2} \left( k \right)} \\ {\frac{1}{{\tau_{Yk}^{2} }}, } & {if\ \sigma_{x}^{2} \left( k \right) = \sigma_{d}^{2} \left( k \right)} \\ \end{array} } \right. \hfill \\ \end{aligned}$$

(31)

where, $\lambda \left( k \right)$ is given by Eqs. (16) and (26).and

$$\psi_{k} \equiv \frac{1}{{\lambda \left( k \right)\left\{ {1 - exp\left[ { - \frac{{\tau_{Yk}^{2} }}{\lambda \left( k \right)}} \right]} \right\}}}$$

(32)

Finally, the modified MMSE estimator is given by:

$$\begin{aligned} \hat{X}_{k}^{2}& = E\left\{ {\tau_{Xk}^{2} /\tau_{Yk}^{2} } \right\} \hfill \\ &= \int_{0}^{{\tau_{Yk}^{2} }} {\tau_{Xk}^{2} f_{{\tau_{Xk}^{2} }} \left( {\tau_{Xk}^{2} /\tau_{Yk}^{2} } \right)d\tau_{Xk}^{2} } \hfill \\ &= \left\{ {\begin{array}{ll} {\left( {\frac{1}{{\upsilon_{k} }} - \frac{1}{{e^{{\upsilon_{k} }} - 1}}} \right)\tau_{Yk}^{2} ,} & {if\ \sigma_{x}^{2} \left( k \right) \ne \sigma_{d}^{2} \left( k \right)} \\ { \frac{1}{2}\tau_{Yk}^{2} , } & {if\ \sigma_{x}^{2} \left( k \right) = \sigma_{d}^{2} \left( k \right)} \\ \end{array} } \right. \hfill \\ \end{aligned}$$

(33)

$\upsilon_{k}$ is given by Eq. (19), where, $\gamma_{k}$ is given by Eq. (25).

5 Experimental protocol for speech enhancement

Extensive objective quality tests were carried out to evaluate the performance of the proposed MMSE-MODGD estimation method using ten (10) sentences extracted from the NOIZEUS database (Hu & Loizou). In this database, the noise signals are generated by adding the noise from the AURORA and NOISEX-92 databases to the clean signals, to an overall SNR of 0 dB, 5 dB and 10 dB. The frame size chosen is 20 ms with a 50% overlap. A sampling frequency of 8 kHz and a Hamming window were used. The methods used for comparison with the proposed MMSE-MODGD are the maximum-likelihood (ML) estimator, the MMSE estimator, the log MMSE estimator, the maximum a posteriori (MAP) estimator, incorporating speech presence probability in MMSE (MMSE-ISP) estimator, incorporating speech presence probability in log MMSE (log MMSE-ISP) estimator and Wiener estimator (Loizou, 2007). The objective assessment was carried out as proposed in Hu and Loizou (2008). The tests carried out to evaluate the proposed method include measures related to the perception of the speech signal on a five-point (1–5) scale of signal distortion (SIG), background noise on a five-point (1–5) scale (BAK) and overall quality (OVRL) based on the Mean Opinion Score (MOS) ranging from 1 to 5. The other measures used are segmental SNR (SegSNR), weighted-slope spectral (WSS), perceptual evaluation of speech quality (PESQ) and log-likelihood ratio (LLR) (Hu & Loizou, 2008).

5.1 Results and discussion

Based on a comparative study using spectrograms, it can be noticed that the proposed MMSE-MODGD method gives good results compared to ML, MMSE, log MMSE, MAP, MMSE-ISP, log MMSE-ISP and Wiener. This good performance achieved by the proposed approach is confirmed by the objective evaluation.

Tables 1, 2 and 3 show the results of the evaluations using the objective measures: SIG, BAK, OVRL, PESQ, SegSNR, WSS and LLR using 10 sentences extracted from the NOIZEUS database. The proposed MMSE-MODGD method is compared with ML, MMSE, log MMSE, MAP, MMSE-ISP, log MMSE-ISP and Wiener, in the context of degradation by a white, factory and babble noises, respectively. The LLR and WSS scores indicate speech loss and should therefore be minimal. The results presented in the tables clearly show that the SIG, BAK and OVRL scores, which reflect the level of perception of the speech signal and the overall quality, are generally higher for the MMSE-MODGD method than for the other methods. The results also show that these assessments confirm that speech improvement based on the MMSE-MODGD method produces a higher segmental SNR, higher PESQ and lower WSS than other methods.

Table 1 Objective evaluations of the MMSE-MODGD technique compared with ML, MMSE, Log-MMSE, MAP, MMSE-ISP, Log-MMSE-ISP and Wiener and corrupted with white noise

Full size table

Table 2 Objective evaluations of the MMSE-MODGD technique compared with ML, MMSE, Log-MMSE, MAP, MMSE-ISP, Log-MMSE-ISP and Wiener and corrupted with factory noise.

Full size table

Table 3 Objective evaluations of the MMSE-MODGD technique compared with ML, MMSE, Log-MMSE, MAP, MMSE-ISP, Log-MMSE-ISP and Wiener and corrupted with babble noise.

Full size table

6 Experimental protocol for FASR setup

Generally, there are two constraints in the FASR scenarios. The first is the non-collaboration of the suspects and the second one is the limited number of suspects known by the target person (person who suffers from the actions of others). Due to these constraints, the number of suspects used to develop such systems (FASR) is really limited.

In this work, all the experiments were performed on the NIST 2000 corpus, which consists of the spontaneous telephone speech sampled at 8 kHz. For feature extraction, a 23 MFCC vector is found from pre-emphasized speech every 10 ms using a 20 ms Hamming window.

Twenty speakers were chosen as suspects from this corpus; the suspected speaker Reference database (R) was recorded with 1 recording of 2 min duration which was chosen for each suspect, and 75% of the duration of this recording was intended for modeling and 25% for tests (traces).

The test segment is divided into 4 sections, to have 4 traces for each suspect. The Potential database (P) used was a subset of 420 speakers from the same corpus cited below. The GMM-UBM consisted of 256 mixture components trained via Expectation Maximization (EM) algorithm using 10 iterations (Reynolds & Rose, 1995).

Twenty suspects models were created through the GMM-UBM using maximum a posteriori (MAP) adaptation with factor relevance r = 16, 256 mixtures and an adaptation data amount of 14 h is used (Reynolds et al., 2000).

According to the Fig. 1, which explains the FASR methodological approach adopted in our work, we need 3 databases:

1.
Potential-database (UBM database): contains 420 speakers (420*2 min = 14 h);
2.
Trace-database (T): contains 20 speakers, each speaker has 4 traces of (0.25*2 min)/4 = 7.5 s. So, the total of true trials (H₀) is 20*4 = 80 and the total of false trials (H₁) is 4*20*20 – 80 (true trials) = 1520;
3.
Reference-database: contains 20 speakers, each speaker has 0.75*2 min = 1.5 min.

Performance metrics provided a single numerical value that described the performance in terms of accuracy, discriminating power and calibration of the LR method (Probabilities of Misleading Evidence, PMEH₀ and PMEH₁), Equal Proportion Probability (EPP) (Drygajlo et al., 2016; Haraksim & Drygajlo, 2016; Kenai et al., 2019).The values used for the MODGD functions are the length of cepstral lifter window lifter_ω = 8 and $\alpha$ = 0.4, $\gamma$ = 0.9.

7 Classical Forensic Automatic Speaker Recognition results

This Section evaluates the results obtained in clean and noisy environments.

7.1 FASR performance under clean conditions

An evaluation of FASR based on GMM-UBM performance in terms of EPP, PMEH₀ and PMEH₁ was performed in a clean environment.

According to Table 4, the results are very satisfying, in terms of EPP, PMEH₀ and PMEH₁. Therefore, EPP = 1, 25%, the LR exceeds 1 in 96% of cases when H₀ is true and in only 0.4% of cases when H₁ is true.

Table 4 Evaluation results obtained in clean environment

Full size table

7.2 FASR performance in noisy conditions

Different noises were arbitrarily chosen in this study (babble, factory and white) that were added to the corpus of the questioned recording (traces) to produce noisy feature vectors. Table 5 presents the performances of FASR, at SNR = 0 dB and SNR = 5 dB.

Table 5 Evaluation results obtained under noisy environments

Full size table

Table 5 summarizes the performances of the FASR under noisy environment, in terms of EPP, PMEH₀ and PMEH₁. It can be noticed that the performance metrics decrease with decreasing SNR, and increase with increasing SNR, and the performance of noisy speech corrupted with babble noise is less degraded compared to the other noises. This can be explained by the fact that, the babble noise is an overlap of several sounds that comes from two or more speakers (Djeghiour et al., 2018). Its features are like those of the voice. It covers only the low frequency spectrum. Therefore, only the information in low frequency regions is affected by this noise. Whereas, the factory and white noises are characterized by a high intensity. They cover the low and high frequency spectrum and they affect all the existing information in the speech signal. The performance is worse when using these two types of noises (factory and white) than those obtained under babble.

8 Enhanced Forensic Automatic Speaker Recognition results

In this Section, the performance of this system was calculated using the MMSE magnitude enhancement processing and our approach proposed MMSE-MODGD enhancement processing.

8.1 FASR performance using MMSE-magnitude enhancement processing

Table 6 indicates the results obtained when using the MMSE-magnitude enhancement processing (only the information contained in the magnitude), at SNR = 0 dB and SNR = 5 dB.

Table 6 Evaluation results obtained with MMSE enhancement processing

Full size table

The results presented in Table 6, when applying the MMSE speech enhancement algorithm on noisy tests (traces) speech, indicate an improvement of the performances represented by the decreasing of the EPP with the evolution of LR.

This improvement is explained by the fact that, the MMSE based magnitude spectrum estimator discards all the broadband noise by eliminating most of the wide peaks that constitute the undesirable variances of the spectrum ordinates (Loizou, 2007).

Moreover, the MMSE based magnitude spectrum estimator provides the posterior Probability Density Function (PDF) of the clean signal given the noisy signal. This PDF is an optimal estimator for a large class of difference distortion measures between clean and noisy signal. This distortion measure assigns zero distortion for estimates in the immediate neighborhood of the clean signal, and uniform distortion for the ones outside this neighborhood (Loizou, 2007; Lu & Loizou, 2011). Therefore, the separation between noise and speech components is better.

8.2 FASR performance using the proposed improved MMSE-MODGD enhancement processing

Table 7 summarizes the results obtained when using the proposed algorithm (improved MMSE-MODGD enhancement processing), taking into account the information contained in the magnitude and phase, at SNR = 0 dB and SNR = 5 dB.

Table 7 Evaluation results obtained with the proposed improved MMSE-MODGD enhancement processing

Full size table

Based on the results in Table 7, it can be observed that when comparing these results with those obtained in Sect. 8.1, a significant improvement of FASR performance metrics in terms of EPP and Probabilities of Misleading Evidence (PMEH₀ and PMEH₁) is observed, for the three kinds of noises (babble, factory and white). Therefore, in terms of EPP, the improvements represent 1.84% reduction for babble noise and 1.25% reduction for other noises. These results are encouraging given that 1% improvement is significant for high security systems such as FASR systems, as the innocence or indictment of individuals is at stake.

This improvement given by the addition of the MMSE-MODGD estimator to the FASR system is explained by the fact that, the subtraction of the noise from the noisy speech signal, when using MMSE-magnitude spectrum cannot eliminate the deep valleys surrounding the narrow peaks, which remain in the noise spectrum. Therefore, the excursion of noise peaks remains large. However, MMSE-MODGD discards these deep valleys by well preserving the peaks and valleys (depth reduction) of the clean magnitude spectrum in the presence of additive noise (properties of the group delay function of a minimum-phase signal).

Moreover, in (Parthasarathi et al., 2011), the authors indicated that the MODGD spectrum is inversely proportional to the noise power at frequencies corresponding to high noise regions, and directly proportional to the signal power. This indicates that, the MODGD spectrum tends to follow the magnitude spectrum of the signal, rather than that of the noise.

Thus, on the basis of experiments, it was found that noise distorts the shape of the MODGD spectrum less than the FFT spectrum, changes its slopes and reduces the dynamic range of the MODGD spectrum less than the FFT. Most of the time, the frequency locations of the peaks of the higher formants are preserved to some extent in the MODGD spectrum compared to the FFT spectrum in the presence of noise. Therefore, our proposal for MMSE-MODGD retains more information contained in the noisy speech signal than conventional MMSE (Gerkmann & Hendriks, 2012), to avoid any degradation in speech intelligibility and FASR performance.

9 Conclusion

In this work, speech enhancement estimators of noisy speech signal were studied under the assumption that the spectrum of the noisy speech signal can be represented in complex plane as sum of clean signal spectrum and noise spectrum. In addition to the traditional estimator, which is based on the MMSE principles, the improved estimator was proposed by incorporating modified group delay spectrums. Furthermore, compared to the FASR performance using the classical MMSE spectral power estimators, the FASR using the proposed MMSE-MODGD resulted in significantly better speech enhancement quality.

The results of the experiments show that MODGD spectrum has the potential to reduce noise components in the noisy speech signal, since the MODGD spectra tends to follow the magnitude spectrum of speech and opposes the noise spectrum. Therefore, it can be concluded that the important information retained in the enhanced speech using the MODGD spectrum can complement that given by FFT spectrum and give more reliability and robustness to the FASR system under noisy environments.

In future work, we intend to apply a state of the art technique during the parametrization or training phase, which should be an interesting approach to refine the speaker models to obtain a better performance for the proposed forensic system. Subsequently, the latter will be applied to another database specific to the forensic field to compare the two systems.

References

Akande, O. O., & Murphy, P. J. (2005). Estimation of the vocal tract transfer function with application to glottal wave analysis. Speech Communication, 46(1), 15–36.
Article Google Scholar
Alexander, A., Botti, F., Dessimoz, D., & Drygajlo, A. N. D. R. Z. E. J. (2004). The effect of mismatched recording conditions on human and automatic speaker recognition in forensic applications. Forensic Science International, 146, S95–S99.
Article Google Scholar
Asbai, N., & Amrouche, A. (2017). Boosting scores fusion approach using Front-End Diversity and adaboost Algorithm, for speaker verification. Computers & Electrical Engineering, 62, 648–662.
Article Google Scholar
Berouti, M., Schwartz, R., & Makhoul, J. (1979). Enhancement of speech corrupted by acoustic noise. In Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP'79. (Vol. 4, pp. 208–211). IEEE.
Boll, S. (1979). Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing, 27(2), 113–120.
Article Google Scholar
De Forest, P. R., Gaensslen, R. D., & Lee, H. C. (1983). Forensic science: an introduction to criminalistics. New York: McGraw-Hill Humanities/Social Sciences/Languages.
Google Scholar
Deshpande, M. S., & Holambe, R. S. (2011). Robust speaker identification in babble noise. In Proceedings of the International Conference & Workshop on Emerging Trends in Technology. (pp. 635–640). ACM
Dixit, S., & Mulge, D. M. Y. (2014). Review on speech enhancement techniques. International Journal of Computer Science and Mobile Computing, 3(8), 285–290.
Google Scholar
Djeghiour, S., Asbai, N., Kenai, O., & Guerti, M. (2018). Forensic Automatic Speaker Recognition under Noisy Environments. IC3E’2018.University of Bouira (pp. 1–5).
Drygajlo, A. (2012). Automatic speaker recognition for forensic case assessment and interpretation. In Forensic Speaker Recognition (pp. 21–39). New York: Springer.
Drygajlo, A., Jessen, M., Gfroerer, S., Wagner, I., Vermeulen, J., & Niemi, T. (2016). Methodological guidelines for best practice in forensic semiautomatic and automatic speaker recognition. VerlagfürPolizeiwissenschaft.
Drygajlo, A., Meuwly, D., & Alexander, A. (2003). Statistical methods and Bayesian interpretation of evidence in forensic automatic speaker recognition. In Eighth European Conference on Speech Communication and Technology.
Ephraim, Y., & Malah, D. (1984). Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(6), 1109–1121.
Article Google Scholar
Gerkmann, T., & Hendriks, R. C. (2012). Unbiased MMSE-based noise power estimation with low complexity and low tracking delay. IEEE Transactions on Audio, Speech, and Language Processing, 20(4), 1383–1393.
Article Google Scholar
Gustafsson, H., Lindgren, U., Claesson, I., & Nordholm, S. (2004). U.S. Patent No. 6,717,991. Washington, DC: U.S. Patent and Trademark Office.
Haraksim, R., & Drygajlo, A. (2016). Measuring performance in forensic automatic speaker recognition: VQ, GMM-UBM, i-vectors. Biosig 2016.
Hegde, R. M., Murthy, H. A., & Rao, G. R. (2004). Application of the modified group delay function to speaker identification and discrimination. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP'04). (Vol. 1, pp. I-517). IEEE.
Hu, Y., & Loizou, P. C. Noizeus: A noisy speech corpus for evaluation of speech enhancement algorithms. Retrieved from http://www.utdallas.edu/~loizou/speech/noizeus/.
Hu, Y., & Loizou, P. C. (2008). Evaluation of objective quality measures for speech enhancement, IEEE Trans. Audio, Speech Language Process., 16(1), 229–238.
Article Google Scholar
Kenai, O., Djeghiour, S., Asbai, N., & Guerti, M. (2019). Forensic gender speaker recognition under clean and noisy environments. Procedia Computer Science, 151, 897–902.
Article Google Scholar
Kwatra, N., Milani, A. A., & Alderson, J. (2017). U.S. Patent No. 9,824,677. Washington, DC: U.S. Patent and Trademark Office.
Loizou, P. C. (2007). Speech enhancement theory and practice (1st ed.). Boca raton: CRC Press.
Book Google Scholar
Lu, Y., & Loizou, P. C. (2011). Estimators of the magnitude-squared spectrum and methods for incorporating SNR uncertainty. IEEE Transactions on Audio, Speech, and Language Processing, 19(5), 1123–1137.
Article Google Scholar
Martin, R. (2001). Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Transactions on Speech and Audio Processing, 9(5), 504–512.
Article Google Scholar
Murthy, H. A., & Yegnanarayana, B. (2011). Group delay functions and its applications in speech technology. Sadhana, 36(5), 745–782.
Article Google Scholar
Parthasarathi, S. H. K., Padmanabhan, R., & Murthy, H. A. (2011). Robustness of group delay representations for noisy speech signals. International Journal of Speech Technology, 14(4), 361.
Article Google Scholar
Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1–3), 19–41.
Article Google Scholar
Reynolds, D. A., & Rose, R. C. (1995). Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing, 3(1), 72–83.
Article Google Scholar
Roux, C., Crispino, F., & Ribaux, O. (2012). From forensics to forensic science. Current Issues Criminal Justice, 24, 7.
Article Google Scholar
Scheffer, N., Ferrer, L., Lawson, A., Lei, Y., & McLaren, M. (2013). Recent developments in voice biometrics: Robustness and high accuracy. In 2013 IEEE International Conference on Technologies for Homeland Security (HST), (pp. 447–452). IEEE.
Verschuur, C., Lutman, M., & Wahat, N. H. A. (2006). Evaluation of a non linear spectral subtraction noise suppression scheme in cochlear implant users. Cochlear Implants International, 7(4), 193–196.
Article Google Scholar
Wolfe, P. J., & Godsill, S. J. (2003). Efficient alternatives to the Ephraim and Malah suppression rule for audio signal enhancement. EURASIP Journal on Applied Signal Processing, 2003, 1043–1051.
MATH Google Scholar
Zhang, Y., & Abdulla, W. H. (2007). Robust speaker identification in noisy environment using cross diagonal GTF-ICA feature. In 2007 6th International Conference on Information, Communications & Signal Processing, (pp. 1–4). IEEE.

Download references

Author information

Authors and Affiliations

Signal and Communications Laboratory, Ecole Nationale Polytechnique, Algiers, Algeria
Salim Djeghiour & Mhania Guerti

Authors

Salim Djeghiour
View author publications
You can also search for this author in PubMed Google Scholar
Mhania Guerti
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Salim Djeghiour.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Djeghiour, S., Guerti, M. An improved MMSE estimator based modified group delay spectrum for Forensic Automatic Speaker Recognition. Int J Speech Technol 24, 687–699 (2021). https://doi.org/10.1007/s10772-021-09829-9

Download citation

Received: 31 March 2020
Accepted: 01 March 2021
Published: 15 March 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s10772-021-09829-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

An improved MMSE estimator based modified group delay spectrum for Forensic Automatic Speaker Recognition

Abstract

Similar content being viewed by others

Enhanced forensic speaker verification performance using the ICA-EBM algorithm under noisy and reverberant environments

Real-time adaptive training for forensic speaker verification in reverberation conditions

Acoustic feature extraction method for robust speaker identification

1 Introduction

2 Forensic Automatic Speaker Recognition (FASR)

3 Minimum mean square error estimator of the noisy short-time power spectrum

4 The proposed modified group delay functions for the MMSE estimator of the noisy short-time power spectrum

5 Experimental protocol for speech enhancement

5.1 Results and discussion

6 Experimental protocol for FASR setup

7 Classical Forensic Automatic Speaker Recognition results

7.1 FASR performance under clean conditions

7.2 FASR performance in noisy conditions

8 Enhanced Forensic Automatic Speaker Recognition results

8.1 FASR performance using MMSE-magnitude enhancement processing

8.2 FASR performance using the proposed improved MMSE-MODGD enhancement processing

9 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An improved MMSE estimator based modified group delay spectrum for Forensic Automatic Speaker Recognition

Abstract

Similar content being viewed by others

Enhanced forensic speaker verification performance using the ICA-EBM algorithm under noisy and reverberant environments

Real-time adaptive training for forensic speaker verification in reverberation conditions

Acoustic feature extraction method for robust speaker identification

Explore related subjects

1 Introduction

2 Forensic Automatic Speaker Recognition (FASR)

3 Minimum mean square error estimator of the noisy short-time power spectrum

4 The proposed modified group delay functions for the MMSE estimator of the noisy short-time power spectrum

5 Experimental protocol for speech enhancement

5.1 Results and discussion

6 Experimental protocol for FASR setup

7 Classical Forensic Automatic Speaker Recognition results

7.1 FASR performance under clean conditions

7.2 FASR performance in noisy conditions

8 Enhanced Forensic Automatic Speaker Recognition results

8.1 FASR performance using MMSE-magnitude enhancement processing

8.2 FASR performance using the proposed improved MMSE-MODGD enhancement processing

9 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation