1 Introduction

The development of mobile devices such as: smartphones or tablets and the use of the voice biometrics (e.g. for access control, device personalization, transaction banking) have paved the way for a large number of new multimedia applications [7, 39]. Automatic Speaker Recognition (ASR) refers to recognizing a person based on his/her voice as a biometric feature. It consist of two tasks: Speaker Identification (SI) and Speaker Verification (SV). In speaker identification task, an unknown speaker is compared against a set of known speakers, and the best matching speaker gives us the identification. Speaker verification is the process of accepting or rejecting the identity claim of the speaker [16]. Speaker Recognition systems are broadly classified into two categories: Text Independent (TI-SR) and Text-Dependent (TD-SR). In TI-SR, the speaker can pronounce any sentence to be recognized, i.e. the system does not impose any constraint for training and test sessions. However, TD-SR systems use the same phrases/sentences for training and test sessions [2, 10]. Nowadays, many applications using speaker recognition have been deployed to improve the authentication procedure such as banking over wireless digital communication network, security control for confidential information, telephone shopping, database access services and voice mail [10]. Feature extraction is the crucial component in speaker recognition system where the speech signal is represented in a compact manner, in which the extracted features are capable of separating the speakers from each other in their space. The effects of additive noise and/or channel distortion have always been one of the most important problems in speaker recognition research. Various techniques have been proposed to improve the performance of speaker recognition systems in presence of noise. Speech enhancement methods include, for example, Spectral Subtraction (SS) or Nonlinear Spectral Subtraction (NSS), Wiener filtering and Kalman filtering. Moreover, other processing techniques are proposed to increase the robustness of ASR systems. Some of these techniques use feature normalization such as Cepstral Mean and Variance Normalization (CMVN), Relative Spectral (RASTA) processing of speech, or feature warping [24, 27]. In [22] they used multi-condition model and the missing features to compensate signal. The work in [31] proposed a soft spectral subtraction method that handles missing features in speaker verification. A recent work on robust speaker recognition based on the i-vector technique made a significant progress in reducing the channel effect and the additive noise [4]. Different channel compensation techniques were used, such as the Within-Class Covariance Normalization (WCCN), the Linear Discriminant Analysis (LDA) and the Nuisance Attribute Projection (NAP) [32]. In [26], the authors proposed a new variant of robust Mel Frequency Cepstral Coefficients (MFCC) that are extracted from the estimated spectral magnitude Bispectral-MFCCs. Score domain techniques such as H-norm, Z-norm, and T-norm have been studied in [2]. Recently, [40] proposed a new feature based on articulatory movement to characterize the relative motion trajectory of articulators in short-duration utterances. In [22], a multi-system fusion approach that uses multiple streams of noise-robust features for i-vector fusion is developed. The authors in [20] have analyzed the effects of multi-condition training on i-vector PLDA. In [6], the authors have proposed to use the Gammatone product-spectrum cepstral coefficients under noisy condition and speech codecs.

The great majority of past studies have addressed the effect of additive noise environment for speech and speaker recognition. However, only few studies have been reported the impact of Additive White Gaussian Noise (AWGN) and Rayleigh fading channels for speaker recognition performance. For instance, the work in [17] shows the effects of speech codecs, with AWGN and Rayleigh fading noises, on the performance of speaker recognition systems. In [36], autoregressive MFCCs and Speech Activity Detection (SAD) algorithms have been applied for speaker recognition system over AWGN channel. In [13] a combination of modified LPC with Wavelet Transform (WT) in AWGN and real noise environments has been proposed. In [5], the authors proposed an approach that consists of using acoustic features, that are extracted directly from encoded bitstream, called ISF (Immittance Spectral Frequency), through a noisy channel (AWGN and Rayleigh).

MFCC coefficients are the most commonly used features in speaker and speech recognition systems. However, the MFCC features, which are computed by using a mel-scaled filter-bank are known to be very sensitive to additive noise. The auditory model based on mel scale in standard MFCC may not be optimal for speaker recognition [34] and the logarithmic nonlinearity used in MFCC to compress the dynamic range of filter bank energies does not possess noise immunity. In [42], the authors proposed a new front-end speech feature based on cochlear filter referred to as Gammatone Frequency Coefficient Cepstral (GFCCs). The work in [41] showed that the GFCCs features give superior performance of speaker recognition compared with other features such as MFCCs in noisy environments. Despite their relative robustness, it is important to mention that the GFCCs features are usually obtained by using the Fast Fourier Transform (FFT) [19]. Since the FFT requires the use of stationary signals within a given short-term frame; this may not analyze the non-stationary segments in transient state, which is not suitable in speaker recognition. Another popular feature extraction technique consists of the linear prediction (LP) filtering which is a well-known all-pole method for modeling the vocal tract by using a small number of parameters. The main drawback of conventional LP method is that the resulting spectral envelope may contain very sharp peaks for speakers with high pitch frequency.

Several modifications of LP method with an improved robustness against noise have been developed. One can cite the Weighted Linear Prediction (WLP), Stabilized Weighted Linear Prediction (SWLP), and regularization of linear prediction of spectrum analysis methods. Temporally weighted linear predictive [37] were studied in speaker verification under additive-noise condition. Extended Weighted Linear Prediction (XLP) [29] were evaluated for both channel distortion and additive noise. The study in [28] introduced a new algorithm based on linear predictive analysis utilizing an autoregressive (AR) Gaussian mixture model. In [30], the authors used an algorithm providing MFCCs features for speaker verification under vocal effort mismatch.

The use of a new approach of linear predictive modeling in this work is motivated by the ability of the linear predictive methods to capture relevant information from two major parts of the voice production mechanism that are the glottal excitation and the vocal tract. The LP signal analysis of this work uses a Gaussian mixture autoregressive model to compress the spectrum parameters. Besides this, it is showed in [11] that, even at low SNRs of environmental noise, the Gammatone filter bank and cubic root rectification provide more robustness to the features than the Mel-filter bank and log nonlinear.

In this paper, we propose a new feature extraction approach providing a Mixture of Linear Prediction Gammatone Cepstral Coefficients (MLPGCCs). The Mixture Linear Prediction MLP method is based on an autoregressive (AR) mixture model processed by Gammatone filter banks. This combination (i.e., MLP and Gammatone) is expected to take advantage of both MLP properties and Gammatone filtering to improve the robustness of speaker verification system under channel transmission noise [18]. The performance of speaker verification system is evaluated using i-vector and Gaussian Probabilistic Linear Discriminant Analysis GPLDA modeling.

The remainder of this paper is organized as follows. A brief introduction to the channel transmission noise is presented in Section.2. In Section.3, we describe the proposed MLPGCCs feature extraction algorithm. Block diagram of a MLPGCCs-based text-independent speaker verification system is presented in Section.4. Section.5 reports the performance evaluation carried out by comparing the proposed method with conventional extracted features. Finally, conclusions are summarized in Section 6.

2 Channel transmission noise

A communication system, as illustrated in Fig. 1, can be divided into two parts. The first part is digital and consists of the source encoder/decoder, the channel encoder/decoder and the digital modulator/demodulator. The second part is analog and is made of the transmitter, the receiver and the channel models. The modulation process involves the change of some parameters of a carrier wave, thus obtaining a set of signals suitable for a transmission channel. There are two main types of signal degradation introduced by the transmission channels: the first is attenuation and random variation of signal amplitude, and the second is distortion of the signal spectrum. Signal attenuation results from the degradation of the signal power level over distance while random variation of signal amplitude results from channel noise and multipath Rayleigh fading effects.

Fig. 1
figure 1

General diagram of basic communication system [25]

In order to implement the communication system, we used the standard speech codec, Adaptive Multi-Rate Wide Band (AMR-WB), codec, introduced by the European Telecommunication Standards Institute (ETSI). AMR provides better speech quality and more robustness for background noise. Binary Phase Shift Keying (BPSK) modulation and demodulation are simulated. We want to transmit symbols from an alphabet {mi; i = 1, ......, M} and a signal xi(t), suitable for transmission and assigned to each symbol mi. After transmission, we obtain a distorted version of the original xi(t) defined by yi(t). On the other hand, the distortion due to quantization and channel errors may make the received symbol \( \overset{\wedge }{m} \)i different from the transmitted one mi. We define the Additive White Gaussian Noise (AWGN) channel, which modifies the transmitted signal as

$$ y(t)={x}_i(t)+n(t) $$
(1)

where n(t) is a white Gaussian-distributed noise of zero mean and variance σn2 = N0/2. In the AWGN channel, the noise is added to the transmitted signal by specifying the signal to noise ratio (SNR) value. To simulate the Fading channel, we apply a random signal envelope a and a random phase θ to the transmitted signal.

$$ y(t)=a{e}^{- j\theta}{x}_i(t)+n(t) $$
(2)

when there is no dominant received component, the envelope is Rayleigh-distributed and it is, defined by:

$$ p(a)=\frac{a}{\sigma^2}\exp \left(\frac{-{a}^2}{2{\sigma}^2}\right) $$
(3)

where 2σ2 = E[a2] is the mean power of the fading and the phase is uniformly distributed. In our investigation, we use the Rayleigh fading channel that has been shown realistic to simulate fading channels. In mobile environments, the speech is compressed by a conventional speech codec, then it is transmitted to the server where the recognition is performed using the features extracted from the decoded signal. The Rayleigh fading channel is simulated based on the modified sum-of-sinusoids method. The quadrature components of Rayleigh fading process are given by:

$$ u(t)=\sqrt{\frac{2}{E}}\sum \limits_{i=1}^E\cos \left({\varpi}_dt\cos {\alpha}_i+{\phi}_i\right)+j\sqrt{\frac{2}{E}}\sum \limits_{i=1}^E\cos \left({\varpi}_dt\sin {\alpha}_i+{\phi}_i\right) $$
(4)

where αi = ((2πi − π − θi)/4π), i = 1, 2.......E, ϖd is the maximum angular Doppler frequency, ϕi and θs are statistically independent and uniformly distributed on [−π, π], [1].

3 Mixture linear prediction Gammatone features

Feature extraction is a crucial component in the Automatic Speaker Verification (ASV) system. Generally speaking, the speech features extraction methods aim at extracting relevant information about the speaker. In this work, we have implemented different feature extraction techniques that have in common the modeling of peripheral auditory system, namely MFCCs, GFCCs and the new feature MLPGCCs. The block diagram of feature extraction is depicted in Fig. 2.

Fig. 2
figure 2

Block diagram of (1) MFCCs, (2) GFCCs, and (3) MLPGCCs feature extraction

3.1 Mixture linear prediction

The linear prediction (LP) analysis is used to estimate the parameters of an autoregressive (AR) model by minimizing the prediction error. In speech processing LP model, each sample is predicted as a linear weighted sum of the past p samples, where p is the order of prediction. The predicted signal \( \hat{s}(n) \) is defined as:

$$ \hat{s}(n)=-\sum \limits_{k=1}^p{a}_ks\left(n-k\right) $$
(5)

In the mixture autoregressive model, the signal sn, n ≥ 0can be modeled as a mixture of Jautoregressive processes with conditional density function defined by [30].

$$ f\left({s}_n/{s}_{n-1},...,{s}_0,\lambda \right)=\sum \limits_{i=1}^J{\pi}_{n,i}\frac{1}{\sigma_i}\varphi \left(\frac{u_{n,i}}{\sigma_i}\right) $$
(6)

where λ is the model parameter set and φ(.) is the standard normal density function. The distribution of a hidden state variable is given by:

$$ {\pi}_{n,i}=P\left({q}_n=i/{s}_{n-1},......,{s}_0,\lambda \right),1\le i\le J $$
(7)

where qn ∈ {1, ........, J} determines the J of AR processes.

$$ {s}_n={a}_{0,i}+\sum \limits_{i=i}^J{a}_{k,i}{s}_{n-k}+{u}_{n,i}1\le i\le J $$
(8)

where a0, i are the intercept (constant) terms. The mixture linear prediction is inspired by the principle of Gaussian Mixture Model (GMM), defined by the set of parameters:

$$ {\lambda}_{GMM}=\left({P}_1,......,{P}_J,{\mu}_1,..........,{\mu}_j,{\sigma}_1^2,..........{\sigma}_J^2\right) $$
(9)

where Pi, μi and \( {\sigma}_i^2 \), 1 ≤ i ≤ J, are the component weights, Gaussian mean values and Gaussian variances, respectively. The mixture linear prediction (MLP) model is defined in as follows [28]:

$$ {\lambda}_{MLP}=\left({P}_1,......,{P}_J,{a}_{0,1},{a}_{1,1}....,{a}_{p,1},{a}_{0,2},....,{a}_{p,J},{\sigma}_1^2,...{\sigma}_J^2\right) $$
(10)

The parameters of this model are estimated by the Expectation - Maximization (EM) algorithm according to the following steps:

  • In the E (expectation) step, estimate the excitations un, i as a prediction residual

$$ {e}_{n,i}={s}_n-{a}_{0,i}-{\sum}_{k=1}^p{a}_{k,i}{s}_{n-k} $$
(11)

The hidden state posterior probabilities defined by

$$ {\displaystyle \begin{array}{l}{\gamma}_{n,i}=P\left({q}_n=i/{s}_n,......,{s}_{n-p},{\lambda}_{GMLP}\right)=\max \left(0.01,\frac{P_i\left(1/\sqrt{2\pi {\sigma}_i^2}\right)\exp \Big(-{e}_{n,i}^2/\left(2{\sigma}_i^2\right)}{\sum \limits_j^J{p}_j\left(1/\sqrt{2\pi {\sigma}_j^2}\right)\exp \Big(-{e}_{n,j}^2/\left(2{\sigma}_j^2\right)}\right)\\ {}\end{array}} $$
(12)
  • In the M (maximization) step, the component weights are re-estimated as \( {P}_i=\frac{\sum_n{\gamma}_{n,i}}{\sum_n1} \) and the noise variances as \( {\sigma}_i^2=\frac{\sum_n{\gamma}_{n,i}{e}_{n,i}^2}{\sum_n{\gamma}_{n,i}} \). To determine the AR parameters ak, i define xn, 0 = 1 (for the intercept) and xn, k = sn − k, k ≥ 1 and then solve the following normal equations:

$$ \sum \limits_{k=0}^p{a}_{k,i}{\sum}_n{\gamma}_{n,i}{x}_{n,k}{x}_{n,j}={\sum}_n{\gamma}_{n,i}{s}_n{x}_{n,j,}0\le j\le p $$
(13)

3.2 Gammatone auditory filter bank

Gammatone filters are a popular way of modeling the auditory processing at the cochlea. The Gammatone function was first introduced in [12], characterizes physiological impulse-response data gathered from primary auditory fibers. The Gammatone filters were used for characterizing data obtained by reverse correlation from measurements of auditory nerve responses of the cat’s cochlea. The impulse response of a Gammatone filter centered at frequency fc is defined as:

$$ g(t)=K{t}^{\left(n-1\right)}{e}^{-2\pi Bt}\cos \left(2\pi {f}_ct+\phi \right) $$
(14)

where K is the amplitude factor; n is the filter order; fc is the central frequency in Hertz (Hz); ϕ is the phase shift; and B represents the duration of the impulse response. The Equivalent Rectangular Bandwidth (ERB) is a psychoacoustic measure of the auditory filter bandwidth at each point along the cochlea. The filterbank center frequencies are uniformly spaced on an equivalent rectangular bandwidth (ERB) scale between 200 and 3400 Hz (assuming a telephone bandwidth at a sampling rate of Fs = 8 kHz). The formula for calculating ERB (in Hz) at any frequency f (in Hz) is expressed by:

$$ ERB=\frac{f}{Q_{ear}}+{B}_{\mathrm{min}} $$
(15)

where Qear = 9.26449 and Bmin = 24.7 are known as Glasberg and Moore parameters [8, 9]. The frequency response of the 64-channel Gammatone filter bank is illustrated in Fig. 3.

Fig.3
figure 3

A Gammatone filter bank with 64filters

Herein, we used a bank of 64 filters whose center frequencies range from 50 Hz to 8000 Hz, relatively to the sampling frequency of the speech signal. The magnitudes of the down-sampled outputs are then loudness-compressed by a cubic root operation [42] such that:

$$ {G}_m\left[i\right]=\mid \mid g{\left|{}_{decimate}\Big[i,m\Big]\right|}^{\raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{$3$}\right.}i=\mathrm{0......}N-1,m=\mathrm{0......}M-1, $$
(16)

Here, N = 64 refers to the number of frequency (filter) channels. m is the frame index; M is the number of time frames obtained after decimation. The resulting responses Gm[i]form a matrix representing the time-frequency (T-F) decomposition of the input signal. This T-F representation is a variant of cochleagram.

4 Speaker verification using mixture linear prediction Gammatone features

The investigated systems use the post feature extraction processing (MFCCs, GFCCs, and MLPGCCs) with i-vector extraction and channel compensation. The GPLDA technique is used to build the speaker model. The block diagram of the proposed speaker verification system is shown in Fig. 4.

Fig. 4
figure 4

Block diagram of i-Vector-GPLDA speaker verification system using MLPGCCs features employed in this study

4.1 Total variability i-vector modeling

Speaker verification based on i-vector approach involves different stages: i-vector feature extraction, GPLDA modeling and scoring using the batch likelihood ratio.

4.1.1 I-vector extraction

The i-vector approach [4] is inspired from the Joint Factor Analysis (JFA). In JFA, speaker and channel effects are independently modeled using Eigen-voice (speaker subspace) and Eigen-channel (channel subspace) models:

$$ M=m+ Vy+ Ux $$
(17)

where M is the speaker super-vector, m represents the speaker- and channel-independent super-vector, which can be taken to be the universal background model (UBM) super-vector. Both V and U are low rank transformation matrices. The variables x and y are assumed to be independent and have a standard normal distributions. In the i-vector extraction, the speaker and channel super-vector M is represented as:

$$ M=m+ Tw $$
(18)

where m is a speaker and channel independent super-vector, T is a low rank matrix representing the primary directions of variation across a large collection of development data, and w is a normally distribution with parameters N(0, 1).

4.1.2 GPLDA modeling and scoring

The PLDA technique was originally proposed by [31] for face recognition, and later adapted to i-vectors for speaker verification by [15, 21]. This technique called Gaussian Probabilistic LDA (GPLDA), which divides the i-vector space into speaker and session variability subspaces, which has shown significant performance for intersession compensation for i-vector speaker verification [14]. In the GLPDA modeling approach, a speaker and channel dependent i-vector, ws, r can be defined as

$$ {w}_{s,r}=\eta +H{z}_s+{\varepsilon}_{s,r} $$
(19)

where η is the i-vectors’ mean, Η is the eigenvoice matrix, zs is the speaker factor and εs, r is the residual for each session.

The scoring in GPLDA is conducted using the batch likelihood ratio between a target and test i-vector [33]. Given two i-vectors, w1 and w2, the batch likelihood ratio can be calculated as follows:

$$ Score\left({w}_1,{w}_2\right)=\log \frac{P\left({w}_1,{w}_2/{\varphi}_1\right)}{P\left({w}_1,{w}_2/{\varphi}_0\right)} $$
(20)

where φ1 denotes the hypothesis that the i-vectors represent the same speakers and φ0 denotes the hypothesis that they do not.

5 Evaluation experiments

The proposed features have been analyzed and evaluated by carrying out various experiments on the ASV. We use the NIST-2008 Speaker Recognition Evaluation (SRE) corpora containing a single channel microphone recorded conversational segments of 8 min or longer duration of the target speaker and an interviewer [23]. The speaker models were obtained from clean training speech data. The clean waveforms are transcoded by passing them through a coding and decoding AMRWB codec [35]. The mobile channel was simulated using two noise channels: AWGN and Rayleigh fading with different variances to make SNR within (−5, 0, 5, 10, 15 dB). In all experiments, the feature vectors contain 20 cepstral coefficients and log-energy/C0, appended with the first and second order time derivatives, thus providing 63 dimensional feature vectors, followed by cepstral mean and variance normalization (CMVN). The self-adaptive VAD (VQ-VAD) is employed to remove silence and low energy speech segments. We utilized three different acoustic features: (a) Mel frequency Cepstral coefficients (MFCCs), (b) Gammatone Frequency Cepstral Coefficients (GFCCs) as our baseline and (c) Mixture linear prediction Gammatone Cepstral Coefficients (MLPGCCs). The feature vector was extracted every 10 ms, using a Hamming window of 25 ms and the magnitude spectrum by FFT and MLP with (p: 8, 14, 20). After feature extraction, each speaker model is adapted from a 512-component in which the UBMs are trained using the entire database. For the total variability matrix training, the UBM training dataset is used. The EM training is performed throughout five iterations. We use 400 total factors (i.e., the i-vector size is 400) then LDA is applied to reduce the dimension of the i-vector to 200, and length normalization is then applied. In the process of variability compensation and scoring, a GPLDA model with adding noise is used.. In practice, the MSR Identity Toolbox [38] was used to implement the i-vector-GPLDA processing. We evaluate the speaker verification accuracy using the equal error rate (EER).

5.1 ASV performance in additive white Gaussian noise (AWGN) channel

In this subsection, we investigate the effect of channel AWGN with different feature extractions (MFCC, GFCCs and MLPGCCs) on overall system performance. The first experiment aims to find the optimal number of MLP iterations to estimate the model of prediction. Figure 6 shows EER as a function number of MLP iterations. We can see an improvement of the accuracy when the iteration number is set to a specified values (5, 7).We found that the optimal number of MLP iterations is 5 compared to 7 as found in [30].

The goal of the next experiments is to evaluate the verification performance using MFCCs, GFCCs, MLPGCCs features and the combination of MLPGCCs-GFCCs features. Here, we consider the context of mismatched conditions where the test data is distorted with AWGN noise having a 5 dB of SNR level. The results obtained by using a development set and i vector GPLDA are displayed in Tables 1, 2 and Fig. 5. It can be observed from Table 1 that the proposed features perform better than GFCCs and MFCCs at almost all SNR levels and clean condition. We also note that the MLPGCCs with p = 20 slightly outperform the method presented in [29]. This can be explained by the fact that for the LP orders ranged in 8–20, the LP residual contains mostly the information about the excitation source.

Table. 1 ASV performance in terms of EER (%) under AWGN channel using different SNR for the features(MFCCs, GFCCs) and proposed MLPGCCs with differents number of prediction (p = 8,14,20)
Table. 2 ASV performance in terms of EER (%) under AWGN channel using different SNRs for the features (GFCCs, MLPGCCs) and combined features
Fig. 5
figure 5

The effect of varying the number of MLP iterations on the performance of speaker verification

From Fig. 5, it is clear that EER decreases with respect to signal to noise ratio for all the features. From the results it can be shown that MLPGCCs with (5-iterations) gives better correct recognition rate compared to other features at all SNR levels.

Results in Table 1 and Fig. 6 indicate that the proposed feature extraction method achieves a reduction in average equal error rate (EER) ranging from 9.41% to 6.65% and 3.72% to 1.50% compared with MFCCs and GFCCs features, when the test speech signals are corrupted withAdditive White Gaussian Noise (AWGN) channel, at SNRs ranging from (-5 dB to 15 dB) respectively.

Fig. 6
figure 6

Performance comparison of alternative noise-robust features considered in this study against MFCCs on the ASV task under AWGN noise channel at different SNR

Furthermore, we combined two methods for estimating the short-term spectrum, namely FFT and mixture LP to improve the (EER). This combination is performed by the logistic regression technique where weights are trained using the BOSARIS Toolkit [3]. The effect of this combination of GFCCs and MLPGCCs is investigated under AWGN noise channel. The results are summarized in Table 2. In comparison with the results obtained by MLPGCCs based system as shown in Table 1 and Fig. 6, we observe significant EER reduction in all SNR levels between 0.43% to 0.59%.

5.2 ASV performance in Rayeligh fading channel

In the case of Rayleigh fading channel, we have carried out the same processing as in the case of AWGN channel distortion. The results obtained on the development set for different features using i-vector GPLDA are shown in Fig. 7. It can be seen that there is a drop in the accuracy of the verification system as the SNR increases. It can also be noticed that there is an accuracy improvement for MLPGCCs features compared to MFCCs and GFCCs features. Moreover, it can be shown that MLPGCCs with (7-iterations) gives better recognition rate compared with the (5-iteration) MLPGGCs.

Fig. 7
figure 7

Performance comparison of alternative noise-robust features considered in this study against MFCCs on the ASV task under Rayleigh fading channel using different SNR

As result, when the test speech signals are corrupted with Rayleigh fading channel indicate that the proposed feature extraction method achieves a reduction in the average equal error rate (EER) ranging from 23.63% to 7.8% and from 10.88% to 6.8% over conventional MFCCs and GFCCs features, at SNRs ranging from (-5 dB to 15 dB), respectively. In addition, Table 3 summarizes the results of GFCCs and MLPGCCs combination under Rayleigh fading channel. These results have shown a significant EER(%) reduction at all SNR levels between 1.92% to 3.88%.

Table. 3 ASV performance in terms of EER (%) under Rayleigh fading channel using different SNRs and features (GFCCs, MLPGCCs) and combined features

6 Conclusion

In this paper, a new feature extraction method based on the mixture linear prediction Gammatone is proposed. The MLPGCCs features are evaluated on aspeaker verification system using i-vector GPLDA modeling in mobile communications with considering the impact of transmission channel distorsion. The key point of our idea is to take advantage of the characteristics of the linear prediction approach by using the iterative parameter re-estimation of a mixture autoregressive (AR) model, instead of using standard spectrum estimation performed by FFT. The new features are evaluated on NIST 2008 dataset by considering the effects of noisy transmission channel (AWGN and Rayleigh fading). Experimental results show that the proposed MLPGCCs outperform the conventional MFCCs and GFCCs features in speaker verification task. The best perfromance is obtained in the context of AWGN channel (vs. Rayleigh fading channel). The combination of the proposed and conventional features achieves better performance when compared with each system alone and data corrupted by transmission channel noise. The results have shown that the proposed MLPGCCs considerably imporved the robustness in all types of channel distorsions. We have also demonstated that the algorithm that uses Gammatone filter bank and mixture linear prediction is suitable in the context of transmssion channel noise compared to the FFT and the Mel-filterbank. Future research includes the study of MLPGCCs system performance under other types of noise and degradations such as: convolutive noise and reverberation.