1 Introduction

Due to the growing need for secured access or criminalistic investigations, improving Automatic Speaker Recognition (ASR) systems became an attractive challenge. ASR covers verification and identification. Automatic Speaker Verification (ASV) is the use of a machine to verify a person’s claimed identity from his voice. In Automatic Speaker Identification (ASI), there is no a priori identity state, and the system decides who the person is Campbell (1997).

Current state of the art systems in text-independent speaker recognition use cepstral coefficients as baseline features, and speaker modeling techniques, such as Universal Background Gaussian Mixture Models (GMM-UBM) Reynolds et al. (2000) and Gaussian Supervector (GMM-SVM) Campbell et al. (2006).

This work was originally devoted to a robust ASR task using the Support Vector Machine (SVM) Wan and Renals (2003), Karam and Campbell (2008) and the hybrid GMM-SVM based recognizers. Along this study, it clearly appears that the dimensionality reduction is an attractive way to process the huge quantity of data without loss of recognition performance. Many different approaches have been studied to improve the system’s accuracy with a minimum size of input data. In Jokic et al. (2012), the authors discuss possibilities for dimensionality reduction of the standard MFCC feature vectors by applying Principal Component Analysis (PCA). The results showed that PCA is an interesting method to reduce dimensionality without decreasing the system performance. The GMM-UBM is adopted in Li and Dong (2013). The MAP (Maximum A Posterior Probability) means have been improved by using the MLLR (Maximum Likelihood Linear Regression) and EigenVoice.

In Hanilci and Ertas (2011), in a first time, the authors made a partition of the UBM data into clusters using the Vector Quantification (VQ) algorithm, afterward the transformation matrix is obtained by applying the PCA on the set of feature vectors in each cluster. Finally, multiple speaker models are constructed using this set of transformed feature vectors through MAP adaptation. Best results were achieved using K = 2 local regions with model order M = 256. The obtained EER is less than 12.2 %. In Minkyung et al. (2010), the authors propose a global eigenvector matrix based PCA for speaker recognition (SR) task, to deal with the large amount of training data when the eigenvector matrix of each speaker is calculated. The authors use training data issued from all speakers to calculate the covariance matrix and use this matrix to find the global eigenvalue and eigenvector matrix to perform PCA technique. The proposed method shows better performance while requiring less storage space.

A Fishervoice based feature fusion method incorporating with PCA, LDA is proposed in Zhang and Zheng (2013). The high dimensional input data is simply projected into a lower-dimensional subspace. Results show that this technique can effectively reduce the Equal Error Rate (EER) for utterances as short as about 2 s. In Jiang et al. (2013), the authors transform the original features extracted from speech files by PCA and KPCA (Kernel-PCA) to select effective emotional features for the Automatic Speech Emotion Recognition (ASER). Results shown that feature dimension reduction seriously improve the accuracy of the ASER system. In Lee (2004), Lee introduced local fuzzy PCA based GMM which creates the regions using fuzzy clustering algorithm followed by PCA for each region. The author concluded that this technique gives comparable performance accuracy for speaker identification task with reduced dimension of data. The best performances are reached with the proposed method. With reduced dimension, the performances are same or better to the conventional GMM, with k = 2 clusters and mixture number equal to 64.

As mentioned above, the main idea of this work is to find a new scheme for speaker recognition modeling based on dimensionality reduction, with improved performance. The ability of the PCA is investigated in order to reduce the size of the adapted mean vectors issued from the GMM-UBM model. Moreover, the paper investigates the influence of the dialect Yun and Hansen (2009, 2011), Chitturi and Hansen (2007) effect on the ASR systems.

The rest of the paper is outlined as follow. Sections 2 reviews the SVM and the GMM-SVM classifiers used for the ASR task. Then, dimensionality reduction applied in the front end part of the ASR system is described in sect. 3. Section 4, detailed the proposed new scheme based on the GMM-PCA-SVM modeling. Section 5 presents the data sets used and the experimental results in both clean and noisy environments. Finally, Sect. 6 concludes this paper.

2 Speaker recognition using SVM and GMM-SVM

2.1 SVM modeling

Support Vector Machines (SVM), is a powerful discriminative classifier that is related to minimizing generalization errors. SVM aims to fit an Optimal Separating Hyperplane (OSH) between classes by focusing on the training samples that lie at the edge of the class distributions, the support vectors, and separates classes using “Maximum-Margin” hyperplane boundary (see Fig. 1).

Fig. 1
figure 1

Principle of support vector machine (SVM) classification

When data are not linearly separable in the finite dimensional space, a kernel function \(k(\cdot ,\cdot )\) is used, this leads to an easier separation between two classes with a Hyperplane. A linear hyperplane in the high dimensional kernel feature space, Hilbert space \((H)\), corresponds to a nonlinear decision boundary in the original input space. More details can be found in both Vapniks’ book Vapnik (1998) and Burges’ tutorial Burges (1998).

The SVM is constructed from the sums of a kernel function \(k(\cdot ,\cdot )\) as follow:

$$\begin{aligned} f(x)=sign\left[ {\sum _{t=i}^N {\alpha _i} t_i k(x,x_i)+b} \right] \hbox { with } \sum _{t=i}^N {\alpha _i} t_i =0 \end{aligned}$$
(1)

Where \(t_i\) are the ideal outputs, \(x_{i}\) represent the support vectors, which are the training data. \(\alpha _{i}\) are Lagrange multipliers and \(b\) represents the bias.

The Radial Basis Function (RBF) and the polynomial kernels are commonly used, and take respectively the following forms:

$$\begin{aligned} k(x,x_i)&= e^{-\gamma \left\| {x-x_i} \right\| 2}\end{aligned}$$
(2)
$$\begin{aligned} k(x_i ,x_j)&= (x_i .x_j +1)^{d} \end{aligned}$$
(3)

where \(\gamma \) is the width of the Radial Basis Function and \(d\) is the order of the polynomial function.

2.2 GMM-SVM speaker recognition

Gaussian Mixture Model (GMM) is a type of density model that is used to represent the speaker and follows the probabilistic rules. The GMMs models are easy to implement, and are commonly used for Language Identification, Gender Identification and Automatic Speaker Recognition tasks. The GMM model obtains the likelihood of a D-dimensional Cepstral vector \(\vec {x}\) using a mixture model \(\lambda \) of M multivariate Gaussians Reynolds et al. (2000) given by:

$$\begin{aligned} p(x/\lambda )=\sum _{i=1}^M {\pi _i b_i (x)} \end{aligned}$$
(4)

where \(\pi _i\) represents the mixture weights and \(b_i (x),i=1,...,M\) are the component densities given by:

$$\begin{aligned} b_i (x)\!=\!\frac{1}{2\Pi ^{D/2}\mathop {\left| {\Sigma _i} \right| }\nolimits ^{1/2}}\exp \left[ {\!-\!\frac{1}{2}(x\!-\!\mu _i )^{\prime }(\Sigma _i)^{\!-\!1}(x-\mu _i)} \right] \end{aligned}$$
(5)

with mean vector \(\mu _i\) and covariance matrix \(\Sigma _i\). The mixture weights satisfy the constraint that \(\sum _{i=1}^M {\pi _i =1}\). These parameters are estimated using the Expectation–Maximization (EM) algorithm Reynolds et al. (2000). For speaker recognition, each speaker is modeled by a GMM and is referred to by its model \(\lambda \).

The UBM is generally a large GMM learned from multiple speech files to represent the speaker’s independent distribution of features, its parameters (mean, variance and weight) are found using the EM algorithm. The hypothesized speaker specific model is derived by adapting the parameters of the UBM using the speaker’s training speech and a form of Bayesian adaptation MAP Reynolds et al. (2000). The specifications of the adaptation are given below.

Given a UBM model and training vectors from the hypothesized speaker, \(X=\left\{ x_{1}, x_2 ,..., x_{T} \right\} \), we first determine the probabilistic alignment of the training vectors into the UBM mixture components. That is, for mixture \(i\) in the UBM, we compute

$$\begin{aligned}&\Pr (i/x_t) = \frac{\lambda _i p_i (x_t)}{\sum _{j=1}^M {\lambda _j p_j (x_t)}}\end{aligned}$$
(6)
$$\begin{aligned}&n_i (X) = \sum _{t=1}^T {\Pr (i/x_t})\end{aligned}$$
(7)
$$\begin{aligned}&E_i (X) = \frac{1}{n_i}\sum _{t=1}^T {\Pr (i/x_t})x_t \end{aligned}$$
(8)

This is the same as the expectation step in the EM algorithm. Finally, these new sufficient statistics from the training data are used to update the old UBM sufficient statistics for mixture i to create the adapted parameters for mixture \(i\) with the equations:

$$\begin{aligned} \mathop {\mu _i}\limits ^{-}&= \alpha _i E_i (X)+(1+\alpha _i)\mu _i ,i=1,...,M\end{aligned}$$
(9)
$$\begin{aligned} \alpha _i&= \frac{n_i (X)}{n_i (X)+r} \end{aligned}$$
(10)

where \(r\) is a fixed relevance factor.

Another approach became more popular, which consists of using the hybrid system GMM-SVM. The main goal is to see the complementary information provided by the traditional GMM to the SVM based system. In this approach, instead of using the MFCC features directly, the hybrid classifier uses the adapted Gaussian means of the mixture components obtained from the UBM and the MAP adaptation as input to the SVM system for the discrimination and the decision task. An illustrative bloc diagram of the GMM-SVM classifier is given on Fig. 2.

Fig. 2
figure 2

Bloc diagram of the GMM-SVM based speaker recognition system

3 Dimentionality reduction in the front-end part

In order to investigate the influence of dimensionality reduction on the ASR system, the PCA is applied to the input feature vectors (MFCCs) issued from the speech signal for the SVM based speaker recognition system.

The SVM system is based on the principle of structural risk minimization. It is considered to be more suitable for classification and therefore is used in our work. The difficulty of the SVM classifier is setting its respective optimal parameters \((C,\gamma )\) to achieve the lower misclassification accuracy. These parameters are calculated during the training phase, and the final step consists of the testing phase, which allows the evaluation of the robustness of the classifier. To calculate the classification function class (x) in the SVM model, the RBF kernel was used. All presented results in this study, were obtained using that function.

In this paper, the SVM was trained directly on the acoustic space, which characterizes the client data and the impostor data. In this way, 15 unknown speakers were used to represent the impostors for the recognition task.

For PCA-SVM model, the PCA was applied to the feature vector in the front-end part, and was applied to each speaker independently. This leads to a better representation of the speaker’s intra variability and allows reducing the effective size of the input data (MFCCs). The SVM block diagram is given on Fig. 3.

Fig. 3
figure 3

Bloc diagram of the SVM based speaker recognition system

The PCA Jolliffe (2010) technique is an unsupervised feature extraction method Izquierdo-Verdiguier et al. (2014), it rotates the synchronization system in such a way that the directions of the axes are oriented with progressively decreasing variance of the data Kuncheva and Faithfull (2014). This technique allows a transformation from a number of correlated variables into a smaller number of uncorrelated ones Malarvizhi and Sivasarathadevi (2013), the Principal Components (PCs) while preserving the maximum variance during the projection process. The following paragraph details the theoretical fundaments of the PCA routine.

The initialization step of the system consists of the creation of the eigenspace. Let the training set be the input feature vectors (MFCCs), \(X=\left\{ x_{1}, x_2 ,..., x_{M} \right\} \). The average mean of the set is defined by:

$$\begin{aligned} \overline{X} =\frac{1}{M}\sum _{i=1}^M {x_i} \end{aligned}$$
(11)

Each feature vector differs from the average mean \(\overline{X}\) by the vector:

$$\begin{aligned} \vartheta _p =x_\mathrm{p} -\overline{X} \hbox { with } p=1...M \end{aligned}$$
(12)

The rearranged \(\vartheta _p\) vectors construct the \(\delta \) (N*M) matrix that will be subject to the PCA technique Kresimir delac etal. (2005). The covariance matrix of \(X\) using the \(\delta \) matrix is calculated as:

$$\begin{aligned} C=\delta \delta ^{T} \end{aligned}$$
(13)

Let \(\left\{ {\lambda _1 ,\lambda _2 ,...,\lambda _n} \right\} \) be the eigenvalues of the covariance matrix C, ordered from largest to smallest and \(\phi =\left\{ {\omega _1 ,\omega _2 ,....,\omega _n} \right\} \) be the corresponding eigenvectors. \(\phi \) represents the transformation matrix which projects the original data \(X\) onto orthogonal feature space.

The dimensionality reduction is then made by keeping some number of the principal components that capture most of the variance in the data set, and discarding the rest. So, the transformation matrix \(\phi \) will consist of the first \(D\) eigenvectors which is associated with largest \(D\) eigenvalues, where \(D\) is the new dimension. Figure 4 illustrates the block diagram of the PCA-SVM based speaker recognition system.

Fig. 4
figure 4

Bloc diagram of the PCA-SVM based speaker recognition system

4 The proposed GMM-PCA-SVM based speaker recognition modeling

4.1 System Overview

The bloc diagram of the proposed GMM-PCA-SVM based speaker recognition system is depicted in Fig. 5. First, a Voice Activity Detector (VAD) technique is used. For a given speech utterance, the energy of all speech frames is computed. An empirical threshold is then determined from the maximum energy of these speech frames. This classifies speech segments as either speech or silence segments. Finally, silent segments (no-speech) are removed.

Fig. 5
figure 5

The proposed GMM-PCA-SVM based speaker recognition system

The ASR systems use the short term spectrum features Harrag et al. (2011) to represent speaker specific features. Indeed, the short term spectrum features convey the glottal source, the vocal tract shape and length of a speaker, and thus lead to a better representation of a given speaker.

In this study, an extraction of 12 MFCCs, plus their delta and double delta Cepstral coefficients, making 36 dimensional feature vectors to represent the feature space. These features are extracted using a Hamming window with 20 ms of length and a shift of 10 ms. The window is used to taper the original signal on the sides and therefore reduces the side effects Hanilci and Ertas (2011). Finally, a Cepstral Mean Subtraction (CMS) Kinnunen and Li (2010) is applied to these features by the subtraction of the cepstral mean of the feature vectors in order to fit the data around their average.

4.2 Modeling phase

In the proposed GMM-PCA-SVM scheme, the main idea consists of the introduction of the dimensionality reduction using PCA technique in the core of the recognizer. The proposed process is given on Fig. 5.

The mean vectors issued from the UBM model using MAP adaptation are projected using PCA technique into an orthogonal feature space. The new reduced mean vectors are then used as input to the SVM model for scoring.

4.3 Double dimensionality reduction

To better investigate on the contribution of PCA technique, dimensionality reduction is also applied in the front-end part of the proposed GMM-PCA-SVM system (see Fig. 6).

Fig. 6
figure 6

Bloc diagram of the double dimensionality reduction, the PCA-GMM-PCA-SVM based speaker recognition system

5 Experiment results

5.1 Corpora

The corpus used in this work is issued from the TIMIT database Garofolo et al. (1993), which was one of the first corpora available that had a large number of speakers, and has been used for many speaker recognition studies. This database includes phonetic and word transcriptions as well as a 16-bit, 16 kHz speech file for each utterance and is recorded in “.ADC” format.

The database consists of a set of 8 sentences with 3s of length spoken by 491 speakers in English language and divided in 8 dialects (Dr1 to Dr8) of the United States. We have selected 5 phonetically rich sentences (SX recordings) for the training and 3 other utterances (SI sentences) different from the previous ones for the testing. In this way, the text independency of speaker recognition was preserved.

5.2 Speaker recognition using SVM and GMM-SVM

To evaluate the influence of dialect and size of database on the ASR, a comparative study of the SVM and GMM-SVM systems is performed. In this study, Gaussian mixture models were used with M = 32. The parameter \(\alpha _i\) is calculated as in Eq. (10). For GMM-MAP training, only mean values of the Gaussian components were adapted, with a relevance factor of 16, the weight vector and the covariance matrix were not modified.

The so-called impostor model is used as an a priori for the estimation of speaker models. For this purpose, a gender balanced UBM consisting of 2048 mixture components was trained using the EM algorithm. The UBM aims to model the general acoustic space of 120 unknown speakers (impostors), 60 male and 60 female, where each speaker utters five different sequences. In a last step, an SVM classifier using the target GMM supervectors and the SVM background which represents GMM supervectors of 25 impostors labeled as (–1) for scoring is trained. Table 1 presents the results in term of EER of different dialects and different lengths of subdatabases contained in the TIMIT dataset.

Table 1 Performance of the SVM and GMM-SVM based speaker recognition systems, in term of EER (%)

Table 1 shows the EER (%) of the speaker recognition system accuracy with both SVM and GMM-SVM classifiers. As expected, in major cases, the GMM-SVM outperforms the SVM system’s performance. For example, the EER obtained with the SVM model for the Dr8 subset is equal to 26.89 % where it is less than 22.1 % for the hybrid GMM-SVM system.

Even when the three subsets of the TIMIT corpora have almost the same number of speakers, Dr4, Dr5 and Dr7 with different dialect, both GMM-SVM and SVM performance accuracies are quite the same for all these subsets. For example, for the SVM classifier, the EER in Dr4 (Dialect: South Midland, Number of speaker: 65) is 8.8 %, in Dr5 (Dialect is: Southern, Number of speaker is: 65) it is 8.71 % and in Dr7 (Dialect is: Western, Number of speaker is: 66) 8.18 %. But on the other hand, a difference of performance accuracy is noticed with the SVM classifier for the Dr1 and Dr6 subsets. The EER in Dr1 (Dialect: New England, Number of speaker: 47) is 14.83 % while it is 16.4 % for the Dr6 (Dialect is: Southern, Number of speaker is: 47) subset. Therefore, one cannot confirm that the dialect did have an influence on the ASR task for both systems. However, the number of speakers has a big influence on both classifiers for the speaker recognition rate. That is, the greater the number of speakers, the smaller the EER becomes. This is clearly seen with Dr8 (Dialect: Army Brat, Number of speaker: 25), and Dr2 (Dialect: Northern, Number of speaker: 90) for which the EERs are 26.89 and 6.93 % for the GMM-SVM and the SVM classifiers respectively.

5.3 Speaker recognition using the PCA dimensionality reduction

The main goal of the experiments described in this section is to evaluate the recognition performance of the proposed system using the PCA dimensionality reduction in the core of the classifier. Results when applying PCA in the front-end part of the ASR system are also presented in the Table 2.

Table 2 Performance of the GMM-PCA-SVM, PCA-GMM-PCA-SVM and PCA-SVM systems, in term of EER (%)

Comparing to Table 1, we can observe that using PCA dimensionality reduction leads to a notable increase in the system’s accuracy for both SVM and the hybrid GMM-SVM classifiers. It is clearly seen that, the proposed GMM-PCA-SVM system outperforms the other ones for all different subsets of the TIMIT database.

5.4 Speaker recognition in noisy environment

The SVM and the GMM-PCA-SVM classifiers have been performed in both clean and noisy environments. A set of 176 speakers issued from the eight subsets of the TIMIT database is used.

For real world setting, two different noisy environments, Train station and Subway noises issued from the NOISEUS database have been used within Signal-to-Noise Ratio, SNR = 0, 5 and 10 dB. The experimental protocol is the same as that one detailed previously in this paper. In clean environment, the obtained results are express by the Detection Error Tradeoff (DET) curve (See Fig. 7).

Fig. 7
figure 7

Speaker recognition in clean environment

In the clean case, a low degradation is noticed when applying the PCA technique in the front-end part of the SVM classifier. For example, the EER increased from 4,2 % (SVM alone) to 5 % (PCA-SVM). For the proposed GMM-PCA-SVM model, the PCA gives an important contribution for the recognition accuracy. In fact, the EER decreases from 3,92 % for the conventional GMM-SVM based classifier to 2,94 % obtained with the proposed GMM-PCA-SVM one.

Figures 8, 9 present the performance accuracy of the proposed system in different noisy environments. It is clearly seen that the proposed GMM-PCA-SVM speaker recognition system is more robust compared to the conventional SVM or GMM-SVM based speaker recognition systems.

Fig. 8
figure 8

Comparative performance of the speaker recognition systems using speech corrupted with Subway noise

Fig. 9
figure 9

Comparative performance of the speaker recognition systems using speech corrupted with Train station noise

Concerning the noisy environment, the contribution of the PCA is clearly noticed for both SVM and GMM-SVM systems. Best performances accuracies are reached with the proposed GMM-PCA-SVM system. Applying PCA in the front-end part of the SVM system brings also interesting results. For instance, for Subway noise and at SNR = 0 dB, the EER is 12 % for SVM based system alone, while it is less than 10, 2 % for the PCA-SVM based system.

In the speech signal, it is expected that subsets of variables are highly correlated with each other. These variables are quite redundant and consequently share the same powerful rule in defining the outcome of interest. Consequently, the system is trained on unnecessary samples which lead to a loss of time and performance. Furthermore, when speech data is corrupted with different noises, the information within this particular data (the redundant samples/less significant samples) is totally lost and hence causes a serious degradation in system accuracy.

The basic solution is to combine, using the PCA technique, these variables into a smaller number that will account for most of the variance in the observed data. One of the principal assumptions of PCA technique is assuming that components with big variance correspond to interesting dynamics and lower ones correspond to noise. Though, the purpose of this paper consists of the use of PCA in the modeling phase of the classifier, which transforms the reduced adapted mean vectors into an orthogonal feature space and allows throwing out the low weight transformed features. This considerably enhances performances by removing correlations between variables.

6 Conclusion

In this paper, a new GMM-PCA-SVM scheme has been proposed for ASR. The concept, based on the dimensionality reduction, consists of applying the PCA technique to the adapted mean vectors in the modeling phase of the GMM-SVM based speaker recognition system. Comparative study proven that, this new scheme brings interesting results in both clean and noisy environment.

In addition, dimensionality reduction was also applied to both frond-end stage and speaker modeling core, but in this last case, the overall reduction method was not more effective due to the huge loss of data caused by the repeated reduction.

Moreover, the results show that the dialect did not have a visible effect on the system’s performances. However, the size of the database (number of speakers) affected strongly the performance accuracy of both classifiers.

For future work, additional features, such as prosodic and voice quality features can be merged with the proposed method to ameliorate the speaker recognition performance accuracy.