Keywords

1 Introduction

Biometric systems are essentially pattern recognition systems which operate by acquiring biometric data from an individual. Instead of the use of passwords and PIN codes which can be forgotten or stolen or using signatures which can be easily forged, body characteristics such as voice, face, fingerprints and gait have been considered as discriminative features which cannot be easily stolen or forged [1]. Human relationships are essentially based on communication between individuals. The speech in both its written and spoken form supports all aspects of human interactions. In fact, individuals can communicate with one another employing only the human vocal apparatus. Hence, the acoustic signal of human speech carries not only what is being said but also embodies individual characteristics of the speaker such as speaking styles, the speaker specific characteristics and emotions, the speaker accent, the state of health of the speaker, transmission channel properties,…etc. Every person possesses a unique voice and even when the same person says the same words, the resulting sounds can’t be identical. Among the important directions in speech analysis research we find the field of speaker recognition. This domain has received much attention from the scientific community since many years up to the present day. Indeed, the most used in society and least importunate biometric measure is that of human speech.

In this article, we refer to speaker recognition systems which utilize human speech to recognize an individual [2]. In the past decade, numerous speaker recognition algorithms have been developed in literature [3]. However, the performances of these speaker recognition systems have usually been drastically degraded when limited data are presented.

To decrease the problem of speaker recognition based on short utterances, this article introduces a new robust speaker recognition system, which is based on new cepstral features combining between the well known state of the art Mel Frequency Cepstral Coefficients (MFCC) [3, 22] together with new robust features called Power Normalized Cepstral coefficients (PNCC) that proves to be lately efficient and successful for speech and speaker recognition applications [31,32,33,34]. We evaluate the effectiveness of these combined features on speakers taken from TIMIT [16] and VoxCeleb2 [15] databases.

The rest of this paper is organized as follows. In Sect. 2, Support Vector Machines technique is explained, Sect. 3 describe related works in speaker recognition field and explain the utility of the proposed approach, experimental protocol is presented in Sect. 4, Experimental results are demonstrated in Sect. 5 and conclusions are drawn in Sect. 6.

2 Support Vector Machines

2.1 Linear Support Vector Machines

An SVM is a classifier based on hyperplane separators. Considering the problem of separating a set of m training vectors S = {{(xi, yi)}, where xi \( \in \) Rn is a vector of features, yi \( \in \) {1, −1} is a class label and i = {1,…,m}, into two different classes, with a separating hyperplane having the following equation:

$$ {\text{wx}}\, + \,{\text{b }}\, = \, \, 0 $$
(1)

This hyperplane must maximize the margin, that’s why it should satisfy the following conditions:

$$ y_{i} (\omega .x_{i} )\, + \,b_{{}} \ge \, +1\forall i \in \left\{ {1, \ldots ,\left. m \right\}} \right. $$
(2)

The best separating hyperplane must maximize the margin M given by the equation:

$$ M\, = \,\frac{2}{\left\| \omega \right\|} $$
(3)

In fact, the optimal hyperplane is the one that minimizes:

$$ \phi (\omega )\, = \,\frac{1}{2}\omega .\omega $$
(4)

2.2 Non-linear Support Vector Machines

When the set of training vectors of two classes are non-linearly separable, Cortes and Vapnik [8] use new variables \( \xi_{i}^{{}} \) to measure the miss-classification errors, with \( \xi_{i}^{{}} \) >= 0.

For the solution of the optimisation problem, a minimization of the classification error is needed [9]. The optimal hyperplane must satisfy the following inequalities:

$$ (\omega .x_{i} )\, + \,b_{{}} \ge +1 - \xi_{i}^{{}} ,{\text{ si}} \, y_{i } \, = \, \, \, +1 $$
(5)
$$ (\omega .x_{i} )\, + \,b_{{}} \le - 1 + \xi_{i}^{{}} ,{\text{ si}}\, y_{i} \, = \, - 1 $$
(6)

In this case, the optimal hyperplane is determined by the vector \( \omega \) which tries to minimize the following function:

$$ \phi (\omega ,\xi )\, = \,\frac{1}{2}\omega .\omega \, + \,C\sum\limits_{i = 1}^{m} {\xi_{i}^{{}} } $$
(7)

Where \( \xi_{{}}^{{}} \, = \,(\xi_{i}^{{}} , \ldots ,\xi_{m}^{{}} ) \) and C are constants.

2.3 Kernel Support Vector Machines

When a linear boundary is inappropriate, the principle of the SVM consists in throwing the learning vectors in a high dimensional space to be able to find an optimal hyperplan.

SVM replaces the input data \( (x_{i} ,x_{j} ) \) with a kernel function \( K(x_{i} ,x_{j} ) \) to constructs an optimal hyperplane in the new space. The kernel function maps the input data via an associated function \( \Phi \) into a high dimensional feature space in which the mapped data can be separated linearly.

Although the existence of different kernel functions, the following functions are the most known:

  • Linear: \( K(x_{i} ,x_{j} )\, = \,x_{i}^{T} x_{j} \)

  • Polynomial: \( K(x_{i} ,x_{j} )\, = \,(\gamma \)\( x_{i}^{T} x_{j} \, + \,r)^{d} ,\gamma > 0. \)

  • Radial Basis Function (RBF): \( K(x_{i} ,x_{j} )\, = \,\exp ( - \gamma \left\| {x_{i} - x_{j} } \right\|^{2} ),\gamma > 0. \)

Where \( \gamma \), r and d are kernel parameters.

3 Related Works

For classification problems, we find that most paradigms referred to one of two families: generative models such as Gaussian Mixture Models (GMM) or discriminative classifiers like SVM. The generative models need only to train data samples from the class or target speaker and make a statistical model which describes the target speaker distribution. However, discriminative classifiers require training data for both the target and imposter speakers and generating an optimal separation between the different speakers.

Most of state-of-the-art speaker recognition systems depend on the generative training of GMM. In fact, the problem has traditionally been interpreted by directly modelling the spectral content of the speech with GMM [10]. However, the generative training of the Gaussian mixture models doesn’t directly optimize the classification performance. That’s why it was interesting to develop alternative discriminative approaches which address directly the classification problem [11, 12]. Some other latest works recur to the use of the neural networks technique [4]. In fact, deep neural networks (DNN) have been used for speaker verification systems [4,5,6,7].

Popular in the recent advances in speaker recognition field, the increasing adoption of SVMs, which have demonstrated to be a novel effective method for speaker recognition applications [13], [26,27,28,29,30]. In fact, owing to the kernel which represent the main design component in an SVM, this classifier is able to find an appropriate metric in the SVM feature space relevant to the classification problem [14]. Generally, these systems conduct to comparable or superior performances than generative methods with much less training data.

Even so, most techniques have been applied to related problems such as speaker verification, and there is a lack of effective recognition method for the short utterance text independent speaker identification task.

For speaker recognition applications, the process of feature extraction presents another fundamental phase for speaker recognitions. Indeed, this step is essential to capture the speaker specific characteristics [23]. State of the art applications use appropriate features where the most successful are the Linear Prediction Coefficients (LPCs) [17], Perceptual Linear Prediction (PLP) coefficients [20], and the latest successful and well known are spectral features which have become popular are the MFCCs Coefficients. They allow obtaining high level of performance due to the use of perceptually based Mel spaced filter bank processing of the Fourier Transform and the particular robustness to the environment and flexibility that can be achieved using cepstral analysis [3, 22].

Recently the use of the PNCC coefficients proves a great efficiency in the domain of speech recognition and also for speaker recognition applications [31,32,33,34].

In this work, we try to enhance the performance of the proposed system by using both combined MFCC and PNCC features. Thus, we profit from the robustness of both features for the task of speaker recognition. The resultant combined feature vectors are evaluated for a speaker identification system when only short utterances are available and the proposed system performance is compared against results obtained with baseline systems.

4 Experiments

4.1 Test Database

We performed our experiments using the TIMIT Dataset. The TIMIT corpus is comprised of recordings of 630 speakers (438 male, 192 female [16]) using eight major dialects of American English. Table 1 illustrates the different dialect region of TIMIT database and their respective code. For each speaker, there are ten different utterances over a clean channel. The dataset contains about 5.25 h of audio file in wav format. The sampling frequency of the utterances is 16 kHz with 16-bit resolution. The recordings are single-channel, and the mean duration of each utterance is 3.28 s.

Table 1. The different dialect regions of TIMIT database.

The second set of experiments is performed using speakers from the VoxCeleb2 database [15]. This corpus contains over a million utterances from a large pool of speakers. TIMIT corpus contains clearly read speech, while VoxCeleb2 has more background noise and overlapping speech.

4.2 Acoustic Features

In our experiments, we used cepstral features extracted from the speech signal using a 25 ms Hamming window with an overlap of 10 ms. 12 MFCC Coefficients together with log energy are calculated every 10 ms. Delta and double delta coefficients were then calculated to obtain a 39-dimensional final vector. This feature vector is the most efficient in the literature [3]. We use also 39-dimensional PNCC feature vectors.

4.3 SVM Systems

The classification is realized with SVM which proved their efficiencies with regard to the other systems of classification in our domain [3, 18].

We used two SVM kernel functions in our experiments. The first one is the linear kernel. The second system uses the radial basis function kernel.

To compare our results with other approaches, we have performed two different kernel systems with low-dimensional vectors and limited training data. In fact, unlike Dehak [19] who used NIST SRE 2006 corpus where the train and test utterances contain 2.5 min of speech on average, we used utterances with a mean duration of about 3 s from TIMIT and VoxCeleb2 databases. Besides, we used MFCC features which prove their efficiency in speaker recognition [3] instead of Linear Frequency Cepstral Coefficients (LFCC) which are widely criticized because of the not linear character of the speech [21]. Moreover, as in [3], we use 39 MFCC features extracted from the speech signal instead of 60-dimensional feature vectors which are almost used in [24, 25].

Referring to the protocol suggested in [3], we use 64 speakers. For TIMIT database, we divide the utterance spoken by each speaker into 8 utterances per speaker for training and 2 utterances for testing. After that we further reduced the training duration and we use only 3 utterances per speaker for training and 2 utterances for testing. For VoxCeleb2 database, the first set of experiments is dealt with about 24 s for training and 6 s for testing. The second set of experiments is dealt with about 10 s for training and 6 s for testing.

5 Results and Discussion

We examine the performance of speaker recognition systems described previously by carrying out experimental evaluations as follows. We use two baseline systems, the first one is based on the use of MFCC features, the second baseline system is based on the use of PNCC features, and the proposed system is based on both combined MFCC and PNCC features.

The different systems for speaker recognition were implemented and evaluated with a series of experiences. For each kind of kernel, we varied its various parameters to find the values which give the optimal learning. After achieving the phase of learning, we make a set of experiences in the phase of test.

We start by presenting the first set of experiments in Table 2. For TIMIT database, we give the speaker identification rates (IR) found with linear and RBF kernels with 8 utterances per speaker for training and 2 utterances for testing. For VoxCeleb2 database, we give the results obtained with 24 s for training and 6 s for testing

Table 2. Speaker identification rates with SVM-based systems using RBF and linear kernels.

From the experimental results, we notice that the use of the SVM systems with RBF kernel achieves the best identification rates.

If we compare our results to the results obtained with the baseline systems, we can remark that the proposed system outperforms the results obtained with standard MFCC coefficients and PNCC features. In fact the use of combined features allow to obtain 100% of correct identification rates against only 97.66% and 99.22% respectively with PNCC and MFCC features with TIMIT database. The results are also ameliorated for VoxCeleb2 database which attain 93.75% of correct identification rates against only 88.28% and 89.06% respectively with MFCC and PNCC features.

For further comparison, a second set of experiments was developed with shorter training duration. In fact, we use only 3 utterances for training and 2 utterances for testing For TIMIT database and about 10 s for training and 6 s for testing with VoxCeleb2 database. The results are illustrated in Table 3.

Table 3. Speaker identification rates with SVM-based systems using RBF and Linear kernels with reduced training duration.

The results obtained highlight the influence of the use of short utterances in our system with limited data in the training phase. Compared to the results obtained with baseline approaches, it is clear to remark that the proposed features outperform the standard ones and allow obtaining 98.43% of correct identification rates with the RBF kernel against only 96.88% and 96.09% respectively with PNCC and MFCC coefficients. The same remark is also validated with VoxCeleb2 database which attain 90.63% of correct identification rates against only 73.44% and 78.13% respectively with MFCC and PNCC coefficients.

6 Conclusions and Perspectives

In this paper, we present a new enhanced system based on the SVM approach for speaker recognition task. This system has focused on the formulation of new features looking for recognizing speakers with much reduced information. In fact we don’t need to use additional training dataset as in traditional algorithms. Besides, we don’t require incorporating further complex algorithms. We plan the proposed features with other approaches under different conditions.