Keywords

1 Introduction

Automatic speaker recognition (ASR) system was first introduced by Pruzansky et al. [7]. There are two primary tasks within the speaker recognition (SR), namely speaker identification (SI) and speaker verification (SV). This paper concerns with SI and we have used the terms SI and SR synonymously. A general block diagram for SI and SV is shown in Fig. 1a. The SR can also be classified as text-dependent and text-independent recognition and further divided into open-set and closed-set identification.

Fig. 1
figure 1

Block diagram a for SR and b for MFCC feature extraction

An immense number of features are invented, but at present day the features that are popularly used for robust SR are Linear Predictive Cepstral Coefficient (LPCC) and Perceptual Linear Predictive Cepstral Coefficient (PLPCC) [9], Gammatone Frequency Cepstral Coefficient (GFCC) [11], Mel-Frequency Cepstral Coefficient (MFCC), combination of MFCC and phase information [6], Modified Group Delay Feature (MODGDF) [5], Mel Filter Bank Energy-Based Slope Feature [4], i-Vector [3], Bottleneck Feature of DNN (BF-DNN). In some cases to increase robustness, combined features are developed by fusion of some of these robust features. Some of combined features are LPCC+MFCC, MFCC+GFCC, PLPCC+MFCC+GFCC. The state-of-the-art methods for modeling/classification are Vector Quantization (VQ) [10], Hidden Markov Model (HMM), Gaussian Mixture Model (GMM) [8], GMM-Universal Background Model (GMM-UBM) [1], Support Vector Machine (SVM), Deep Neural Network (DNN) and hybrid models like VQ/GMM, SVM/GMM, HMM/GMM. Among these, the hybrid model HMM/GMM is very useful for SR in noisy environment because HMM isolates the speech feature vectors from the noisy feature vectors and then estimates the multivariate probability density function using GMM in the feature space.

2 Feature Extraction

The first step of SR is feature extraction, also known as front-end processing. It transforms(maps) the raw speech data into the featurespace. The features like MFCC and GFCC are computed using frequency domain analysis and Spectrogram. In our experiment, the MFCC feature is used for SR. The block diagram for extracting MFCC feature is shown in Fig. 1b. The computation of MFCC is discussed briefly as follows:

Pre-emphasis: The speech signal is passed through a HPF to increase the amplitude of high frequency. If s(n) is the speech signal, then it is implemented as \(\tilde{s}(n) = s(n)-\alpha s(n)\), where \(0.9<\alpha <1.\) Generally, the typical value of \(\alpha \) is 0.97.

Framing: To compute MFCC, short time processing of speech signal is required. The whole speech signal is broken into overlapping frames. Typically, 25–60 ms frame is chosen with the overlap of 15–45 ms.

Window: For Short Time Fourier Transform (STFT) for x(n), where x(n) be a short time frame, we must choose a window function h(n). A typical window function h(n) is given by

$$\begin{aligned} h(n) = \beta - \left( 1-\beta \right) cos\left( \frac{2 \pi n}{N-1}\right) \end{aligned}$$
(1)

where N is the window length. Here, \(\beta = 0.54\) for Hamming window and \(\beta = 0.5\) for Hanning window.

DFT and FFT: The Discrete Fourier Transform (DFT) for the windowed signal is computed as \(X(\omega ,n) = \sum _{m=-\infty }^{\infty }x(m)h(n-m)e^{-j\omega n}, ~where~ 0 \le n \le N\). For discrete STFT, continuous \(X(\omega ,n)\) is sampled with N (length of windowed signal) equal points in frequency (\(\omega \)) as \(X(k,n)= X(k) = X(\omega ,n)|_{\omega = \frac{2 \pi }{N}k}, ~where~ 0 \le k \le N\). The graphical display of |X(kn)| as color intensity is known as Spectrogram. Fortunately, two previous equations can be simplified with the help of Fast Fourier Transform (FFT) as \(X(k) = \mathcal {FFT}\{x(n)h(n)\}, where~ 0 \le k \le N\). To facilitate FFT, we must make N as power of 2. To do so, it is required to pad zeros with the frame to make frame length a nearest power of 2 if N is not a power of 2, otherwise zero padding is not required.

Magnitude Spectrum: The squared magnitude spectrum is computed as \(S(k) = |X(k)|^2\), where \(0 \le k \le N\)

Mel-Scale Filter Bank: In Mel scale, \(n_B\) number of overlapping triangular filters are set between \(M(f_{min})\) and \(M(f_{max)}\) to form a filter bank. The relation between Mel scale (mel) and Linear scale (Hz) is given by

$$\begin{aligned} M(f) = 1127log_e\left( 1+\frac{f}{700}\right) \end{aligned}$$
(2)

where f in Hz and M(f) in mel. A filter in filter bank is characterized by start, center, and end frequencies, i.e., \(M(f_s)\), \(M(f_c)\), and end \(M(f_e)\), respectively. Using inverse operation of (2), we can compute \(f_s\), \(f_c\), and \(f_e\) using the following equation:

$$\begin{aligned} f = 700(e^{\frac{M(f)}{1127}}-1) \end{aligned}$$
(3)

where f in Hz and M(f) in mel. Next, we map the frequencies \(f_s\), \(f_c\), and \(f_e\) to the corresponding nearest FFT index numbers given by \(f_{bin}^s\), \(f_{bin}^c\), and \(f_{bin}^e\), respectively, which are called FFT bins by using the following equation:

$$\begin{aligned} f_{bin} = \lfloor {\frac{(N+1).f}{F_s}}\rfloor , \qquad f = f_s,~f_c,~f_e \end{aligned}$$
(4)

Here, \(F_s\) is the sampling frequency of the speech signal. The filter weight is maximum at center bin \(f_{bin}^c\) which is 1, and zero weight is assumed at start and end bins, \(f_{bin}^s\) and \(f_{bin}^e\). The weights are calculated as follows:

$$\begin{aligned} \begin{array}{@{} r @{} c @{} l @{} } &{}H_m(k) &{}{}=\displaystyle {\left\{ \begin{array}{ll} 0 &{}\text {if } k<f_{bin}^s\\ \frac{k - f_{bin}^s}{f_{bin}^c - f_{bin}^s} &{}\text {if } f_{bin}^s\le k \le f_{bin}^c\\ \frac{f_{bin}^e - k}{f_{bin}^s - f_{bin}^c} &{}\text {if } f_{bin}^c \le k \le f_{bin}^e\\ 0 &{}\text {if } k>f_{bin}^e \end{array}\right. } \end{array} \end{aligned}$$
(5)

Filter Energy: The filter bank is set over the squared magnitude spectrum S(k). For each filter in the filter bank, the filter weight is multiplied with the corresponding S(k) and summed up all the products to get the filter energy, denoted by \(\{\tilde{S}(k)\}_{k=1}^{k=n_B}\). Taking logarithm, we get log energies, \(\{log(\tilde{S}(k)\}_{k=1}^{k=n_B}\).

DCT: To perform Discrete Cosine Transform (DCT), the following operation is carried out.

$$\begin{aligned} C_n = \sum _{k=1}^{D}(\log \tilde{S}(k))cos(n(k-\frac{1}{2})\frac{\pi }{D}), \qquad n = 1,2,......,D \end{aligned}$$
(6)

Here, \(D=n_B\) is the dimension of the vector \(C_n\) which is called MFCC vector.

3 Speaker Model

The models that are used frequently in SR are Linear Discriminative Analysis (LDA), Probabilistic LDA (PLDA), Gaussian Mixture Model (GMM), GMM-Universal Background Model (GMM-UBM), Hidden Markov Model (HMM), Artificial Neural Network (ANN), Deep Neural Network (DNN), Vector Quantization, Dynamic Time Warping (DTW), Support Vector Machine (SVM). GMM is the most popular model used in SR. These models are used to build speaker templates. Score domain compensation aims to remove handset-dependent biases from the likelihood ratio scores. The most prevalent methods include H-norm, Z-norm, and T-norm.

3.1 Vector Quantization (VQ)

It is used as a preliminary method for clustering data, so that the process of Vector Quantization(VQ) can be applied more suitably. The grouping is done by minimizing Euclidean distance between vectors. If we get V number of vectors after the feature extraction phase, then after VQ we will get K vectors where K < V. This set of K vectors is called codebook which represents the set of centroids of the individual clusters. In the modeling section, the GMM model is built upon these K vectors.

3.2 Gaussian Mixture Model (GMM)

Let for \(j\mathrm{{th}}\) speaker there are K number of quantized feature vectors of dimension D, viz. \({\mathcal {X}}\) = \( \{ \varvec{x_t} \in \mathbb {R}^D : 1 \le t \le K \}\). The GMM for \(j\mathrm{{th}}\) speaker, \(\lambda _j\), is the weighted sum of M component D-variate Gaussian densities where mixture weights \(w_i ~\{i=1 ~to~M\}\) must satisfy \(\sum _{i=1}^{M} w_i = 1\). Hence, the GMM model \(\lambda _j\) is given by \(p(\varvec{x_t} | \lambda _j) = \sum _{i=1}^{M}w_i \mathcal {N}(\varvec{x}_t; \varvec{\mu _i},{{\varvec{\varSigma }}_i})\) where \(\mathcal {N}\) (\(\varvec{x}_t\); \(\varvec{\mu _i}\),\({{\varvec{\varSigma }}_i}\)) \(\{i=1 ~to~M\}\) are D-variate Gaussian density functions given by

(7)

with mean vector \(\varvec{\mu }_i\) \(\in \) \(\mathbb {R}^D\) and covariance matrix \({\varvec{\varSigma }}_i\) \(\in \) \(\mathbb {R}^{D \times D}\). (\(\varvec{x}_t\) - \(\varvec{\mu }_i)'\) represents the transpose of vector (\(\varvec{x}_t\) - \(\varvec{\mu }_i\)). The GMM model for \(j\mathrm{{th}}\) speaker \({\lambda _j}\) is parameterized by weight \({w_i}\), mean vector \(\varvec{\mu }_i\), and covariance matrix \({ {\varvec{\varSigma }}_i}\). Hence, \( {\lambda }_j = \{w_i,\varvec{\mu }_i,{{\varvec{\varSigma }}}_i \}\).

Fig. 2
figure 2

A block diagram of GMM

These three parameters are computed with the help of EM algorithm. In the beginning of the EM iteration, the three parameters are required to initialize per Gaussian component. Initialization could be absolutely random, but in order to converge faster one can use k-means clustering algorithm also. A block diagram for GMM is shown in Fig. 2.

3.2.1 Maximum Likelihood (ML) Parameter Estimation (MLE):

The aim of the EM algorithm is to re-estimate the parameters after initialization, which give the maximum likelihood (ML) value given by

$$\begin{aligned} {p(\mathcal {X}|\lambda _j)=\prod _{t=1}^{K} p({\varvec{x}_t}|\lambda _j)} \end{aligned}$$
(8)

The EM algorithm begins with an initial model \(\lambda _0\) and re-estimate a new model \(\lambda \) in such a way that it always provides a new \(\lambda \) for which \(p(\mathcal {X}|\lambda ) \ge p(\mathcal {X}|\lambda _0)\).

To estimate the model parameters, mean vector \({\varvec{\mu }}_i\) is initialized using k-means clustering algorithm and this mean vector is used to initialize covariance matrix \({\varvec{\varSigma }}_i\). \({w_i}\) is assumed to 1 / M as its initial value. In each EM iteration, three parameters are re-estimated according to the following three equations to get the new model \(\lambda _{new}\).

$$\begin{aligned} w_i = \frac{1}{K}{\sum _{t=1}^{K}}\mathcal P(i|\varvec{x}_t, \lambda _j) \end{aligned}$$
(9)
$$\begin{aligned} \varvec{\mu }_i = \frac{ {\sum _{t=1}^{K}}\mathcal P(i|\varvec{x}_t, \lambda _j)\varvec{x}_t}{{\sum _{t=1}^{K}}\mathcal P(i|\varvec{x}_t, \lambda _j)} \end{aligned}$$
(10)
$$\begin{aligned} {\varvec{\varSigma }}_i = \frac{{\sum _{t=1}^{K}}\mathcal P(i|\varvec{x}_t, \lambda _j) (\varvec{x}_i - \varvec{\mu }_i)(\varvec{x}_i - \varvec{\mu }_i)'}{{\sum _{t=1}^{K}}\mathcal P(i|\varvec{x}_t, \lambda _j)} \end{aligned}$$
(11)

The iteration continues until a suitable convergence criteria holds. For the covariance matrix \({\varvec{\varSigma }}_i\), only diagonal elements are taken and all off-diagonal elements are set to zero. The probability \(\mathcal P(i|\varvec{x}_t, \lambda _j)\) is given by

$$\begin{aligned} \mathcal {P}(i|\varvec{x}_t, \lambda _i) = \frac{w_i \mathcal {N}(\varvec{x}_t; \varvec{\mu _i},{{\varvec{\varSigma }}_i})}{\sum _{j=1}^{M}w_j \mathcal {N}(\varvec{x}_t; \varvec{\mu _j},{{\varvec{\varSigma }}_j})} \end{aligned}$$
(12)

4 Speaker Identification with MLL Score

Let there are S speakers \(\mathcal {S}\) = {1, 2, 3, ........., S} and they are represented by the GMM’s \(\lambda _1\), \(\lambda _2\), \(\lambda _3\),......, \(\lambda _S\). Now, the task is to find the speaker model with the maximum posteriori probability for the set of feature vectors \(\mathcal {X}\) of test speaker. Using minimum error Bayes’ decision rule, the identified speaker is given by

$$\begin{aligned} \hat{S} = \underset{k \in \mathcal {S}}{\arg \max } (\sum _{t=1}^{K}log(p(\varvec{x}_t|\lambda _k))) \end{aligned}$$
(13)

Here, \(\hat{S}\) is the identified speaker and \(k\mathrm{{th}}\) speaker’s log-likelihood (LL) score is given by \(\sum _{t=1}^{K}log(p(\varvec{x}_t|\lambda _k))\) [8]. The identified speaker \(\hat{S}\) has the maximum log-likelihood (MLL) score.

Table 1 Recognition accuracy for databases IITG-MV SR, Hyke-2011 and ELSDSR for 5 EM iteration and 512 VQ clusters

5 Experimental Results and Discussion

We conducted the SR experiment extensively over the three databases, namely IITG Multi-Variability Speaker Recognition Database (IITG-MV SR), ELSDSR, and Hyke-2011. The IITG-MV SR database contains recorded speech from five recording devices, namely digital recorder (D01), Headset (H01), Tablet PC (T01), Nokia 5130c mobile (M01), and Sony Ericsson W350i mobile (M02), in noisy environment. However, ELSDSR and Hyke-2011 contain clean speech; i.e., noise level is very low and the speeches are recorded with a microphone. The sampling frequency for D01, H01, T01 is 16 kHz, for M01, M02 is 8 kHz, and for ELSDSR, Hyke-2011 is 8 kHz. We chose frame size about 25 ms and overlap about 17 ms, i.e., frameshift is \((25-17)=8\) ms for 16 kHz speech signal and 50 ms frame size and about 34 ms overlap; i.e., frameshift is \((50-34)=16\) ms for 8 kHz speech signal. The pre-emphasis factor \(\alpha \) is set to 0.97. To compute FFT 512-point, FFT algorithm is used. For mel-scale frequency conversion, maximum and minimum linear frequencies are \(f_{min} = 0, 300\) Hz and \(f_{max} = 5000\) Hz. The frequency, \(f_{min}\), has significant effect on the accuracy of ASR. Number of triangular filters in filter bank is \(n_B = 26\) which produces 26 MFC coefficients, and among them first 13 MFCC are chosen to create MFCC feature vector of dimension \(D=13\). The accuracy rate for the mentioned databases is shown in Table 1. In VQ, we consider 512 clusters, to reduce large number of vectors, upon which GMM is built using 5 EM iteration.

It is clearly shown that the accuracy rate is low for noisy speech as compared to the clean speech. This is because the noise level distorts the frequency spectrum of the signal considerably and vectors in the feature space are shifted and distorted from the original vectors. All the databases show the highest accuracy with vector dimension equal to 13, no. of Gaussian components equal to 32, and accuracy degrades beyond this limits. Another observation is that the bandwidth of the filters in the filter bank in linear scale (Hz) also influences the accuracy rate. The SR under mismatch and reverberant conditions are more challenging tasks, because in these cases, performance of SR system degrades drastically. Other important issues for SR are language dependency and device mismatch. It has been seen that the accuracy rate degrades if there is a mismatch of language between training and testing data. Specially, for the device mismatch between training and testing data, the accuracy rate degrades drastically. Though GMM shows satisfactory accuracy rate, HMM is more robust than GMM and provides better result in environmental mismatch . Hybrid HMM/GMM-based SR in noisy environment performs better than only GMM-based SR. In noisy environment, the accuracy of GMM-based SR degrades more rapidly than the HMM/GMM-based SR.

6 Conclusion

SR has a very close relation with the speech recognition. Emotion extraction from speech data using corpus-based feature and sentiment orientation technique could be thought of as an extension of SR experiment [2]. In this paper, we cite SR experiment and analyze feature extraction and modeling/classification steps. It is very important to mention that number of GMM components and Mel filter bank’s minimum frequency \(f_{min}\) have significant influence on the recognition accuracy. Since there are sufficient differences in accuracies between clean speech data and noisy speech data, we can infer that noise level shifts the data from its true orientation. Various normalization techniques in feature domain and modeling/classification domain could be applied to combat with the unwanted shift of data in feature space. Indeed, before transforming data into feature space various filtering techniques to reduce the effect of noise are also available.