1 Introduction

Enhancing automatic speaker recognition (ASR) systems has become an attractive challenge due to the growing needs for secure access or criminalistics inquiries. The main objective of ASR is to determine and recognize speaker personality, regardless of what speaker is clarifying (Dharanipragada et al. 2007; Huang et al. 2016; Shuling 2009; Gandhiraj and Sathidevi 2007). The ASR includes both verifying and identifying phases. In automatic speaker verification (ASV), speaker’s speech is matched according to his/her pattern within the database and categorized either customer or imposter (Furui 1981). ASV systems can be usually utilized in many security fields such as telephone transactions. With automatic speaker identification (ASI), speech talking of anonymous speaker is tested and matched with patterns of all recognized speakers to determine the top matched speaker (Xu and Yang 2016; Li and Gao 2016; Hossain et al. 2007). ASI can be divided into either closed or open sets. Closed set ASI include that speaker under test was previously recognized to be one from finite set of speakers. Open set ASI involves the preference of defining declaring that test speaker may not belong to any one from recognized speakers.

ASR includes two phases stages, named, feature extracting and classification phases. The feature extracting  phase may be thought as data reducing procedure with the potential of capturing main speaker features with reduced data as possible. There exist several schemes for speech features extraction using various coefficients types like linear prediction coefficients (LPCs) (Mellahi and Hamdi 2015), linear prediction cepstral coefficients (LPCCs) (Polur and Miller 2005), Mel-frequency cepstral coefficients (MFCCs) (Selva Nidhyananthan et al. 2016), and multiple kernel weighted MFCCs (Subba Ramaiah and Rajeswara Rao 2016).

Classification phase may be thought as a procedure that includes two stages named as; speaker modeling/ matching stages. In speaker modeling stage, the speaker is registered in the system with extracted features resulted from training data. If  data sample of anonymous speaker is received, feature matching schemes can be utilized for mapping features of input speech data sample to a pattern that may correspond to a recognized speaker. The combination of speaker modeling/matching schemes may be known as classifier. Classification schemes employed in ASI systems may cover Gaussian mixture models (GMMs) (Ding and Yen 2015; Qian et al. 2008), vector quantization (VQ), hidden Markov models (HMMs) (Polur and Miller 2005), ANNs (Galushkin 2007; Hayati and shirvany 2007) and SVM (Naeeni et al. 2010; Boujelbene et al. 2010; Zergat and Amrouche 2014).

In this paper, an efficient robust noise MKMFCC–SVM method for ASI is presented. The proposed method utilizes the MKMFCC as feature parameterization with multiple kernels such as the exponential and tangential to weight the MFCC’s and the SVM for classification. The cepstral features combining the Mel filter bank tangential/exponential functions were utilized in cepstral coefficient parameterization. Multiple kernel weighted functions may help in considering low/ high energy frames of recognized audio signal, such that no frames dropped out. The paper remainder may be arranged as follow. Section 2 explores feature extraction using the MKMFCC. Then, the SVM is described in Sect. 3. Section 4 detailed the proposed MKMFCC–SVM ASI. Section 5 presents the utilized data sets and test results. And finally, Sect. 6 concludes the paper.

2 MKMFCC feature extraction

The MKMFCC employs two distinct kernel functions for the MFCC coefficients weighting (Ramaiah and Rao 2016). The kernel weighting offers a natural way for mixing and integrating various data types. Also, flexible mixture of suited kernel design and modern kernel schemes proved the superiority of such class of methods whose statistical and computational characteristics are well known by several machine learning methods. The MKMFCC is illustrated in Fig. 1 and detailed steps are given as follows.

Fig. 1
figure 1

Block diagram of MKMFCC (Ramaiah and Rao 2016)

2.1 Pre-emphasis

The pre-emphasis stage is employed for flattening speech spectrum, as it increases the high frequency band amplitude and reduce the low frequency band. It can be estimated by,

$$B(m)=A(m) - C \star A(m - 1)$$
(1)

where C, A, B, m are constant value, input signal, output signal, and speech signal samples.

2.2 Framing

The speech signal sample is split into short L blocks of M samples. The speech block length is ranged as 20–40 ms. The neighbouring blocks are unattached by R factor; where \(R<M\).

2.3 Hamming windowing

During hamming window stage, all close frequencies in speech streams are integrated together. The hamming windowing can be represented as \(W(m):1 \le m \le M - 1.\) The speech signal after employing windowing can be computed as,

$$B(m)=A(m) \star W(m)$$
(2)

where \(W(m)\) is the hamming window and it is computed as,

$$W(m)=0.56 - 0.46\left( {\frac{{2\pi m}}{{M - 1}}} \right);\,0 \le m \le M - 1.$$
(3)

2.4 Fast fourier transform (FFT)

During FFT stage, the speech signal are FFT transformed. The block power spectrum can be computed as,

$${P_l}(k)=\frac{1}{M}{\left| {{A_l}(k)} \right|^2}$$
(4)

The Discrete Fourier Transform (DFT) of correspondent block can be estimated as,

$${A_l}(k)=\sum\limits_{{m=1}}^{M} {B(m) \cdot {e^{ - j2\pi km}};{\kern 1pt} 1 \le k \le K}$$
(5)

where k is the DFT length and \(B(m)\) covering M sample long analysis window.

2.5 Mel filter bank processing

Signal frequencies will be filtered using triangular filter for estimating filter spectral components weighted sum and Mel scale triangular filter output border. Figure 2 illustrates the Mel scale filter bank.

Fig. 2
figure 2

Mel scale filter bank

The high and low F H /F L frequency spectral components of periodogram estimations must be considered. The filter locations have equivalent space in Mel frequency.

$$MEL(f)=1125 \times \ln \left( {1+\frac{f}{{700}}} \right)$$
(6)

The Mel Filter bank can be estimated with FFT as

$$G(l)=(mFFT+1) \times h(l)/Sample{\kern 1pt} \,rate$$
(7)

The filter bank can be computed as;

$$M_{f} (k) = \left\{ {\begin{array}{*{20}l} 0 \hfill & {k < G(f - 1)} \hfill \\ {\frac{{k - G(f - 1)}}{{G(f) - G(f - 1)}}} \hfill & {G(f - 1) \le k \le G(f)} \hfill \\ {\frac{{G(f + 1) - k}}{{G(f + 1) - G(f)}}} \hfill & {G(f) \le k \le G(f + 1)} \hfill \\ 0 \hfill & {k> G(f + 1)} \hfill \\ \end{array} } \right.$$
(8)

where f = 1 to F is Mel Filters number.

2.6 Filter bank energy

The filter bank is bonded by power spectrum and summed up to some coefficients. The filter bank energy can be computed as;

$$E(l)=\sum\limits_{{m=0}}^{{\frac{m}{2}}} {\log \left| {A(m)} \right|B(m)\left( {k\frac{{2\pi }}{M}} \right) \times W{T_m}}$$
(9)

where \(W{T_m}\) is the multiple kernel weighted function that can be computed as;

$$W{T_m}=W{T_{m1}}+W{T_{m2}}$$
(10)

2.7 Discrete cosine transform (DCT)

The DCT is performed for transforming the log Mel spectrum estimates to spatial domain.

$$E(l)=\overline {E} (k)$$
(11)

where

$$\overline {E} (k)=\left\{ {\begin{array}{*{20}{l}} {E(l),}&{k={k_l}} \\ {0,}&{otherwise} \end{array}} \right.$$
(12)

The cepstral coefficient can be computed as;

$$W{C_s}(m)=\frac{1}{{M^{\prime}}}\sum\limits_{{k=0}}^{{M^{\prime} - 1}} {\bar {E}(k)\,{e^{jk(2\pi /M^{\prime})m}}}$$
(13)

\(W{C_s}(m)\) represents multiple kernel weighted Mel frequency cepstral coefficients.

2.8 Delta energy and spectrum

The energy patterns or features are summed within the acoustic features vector. The addition enhances audio recognition accuracy and dominates noise robustness as well as the echo.

2.9 Cepstral normalization

In the normalization procedure, the average of each of coefficients will be subtracted and divided with variance.

3 Classification using support vector machine

The classification stage in ASI systems is a feature matching procedure among the new speaker features and the database saved features. The SVM depends on the statistical learning theory (Boujelbene et al. 2010). It is based on finding the best interval among between feature levels to be precisely isolated as much as possible. Such features must be divided linearly using the hyper-plane which may be consider like linear classifier. The SVM transform input features into feature space with large dimension (Zergat and Amrouche 2014; Campbell et al. 2007).

3.1 Geometric margin

It is required to estimate the space from the two patterns to separator. The space is the margin among the two patterns which is the minimum space among the pattern and hyperplane, defined with dashed line in Fig. 3.

Fig. 3
figure 3

Separating different patterns with a hyperplane

For formulating such distance r, let x′–x defines the dotted line which is perpendicular to decision border and parallel to the hyperplane with the normal vector w. The unit vector of normal vector direction to the hyperplane may be estimated as:

$$\bar {\text{v}}=\frac{{\text{w}}}{{\left| {\text{w}} \right|}}$$
(14)

So that the distance r may be estimated as:

$$\bar{\text{r}}={\text{r}}\star\bar {\text{v}}$$
(15)

Since,

$${\text{r}}={\text{X}}^{\prime} - {\text{X}}$$
(16)

So,

$${\text{X}}^{\prime}={\text{X}} - {\text{r}}\frac{{\text{w}}}{{\left| {\text{w}} \right|}}$$
(17)

The margin among the hyper-plane and the closest two patterns of the two data classes may be estimated as:

$${\text{z(}}{{\text{X}}_{\text{i}}}{\text{)}}={{\text{y}}_{\text{i}}}{\text{(}}{{\text{w}}^{\text{T}}}{{\text{X}}_{\text{i}}}+{\text{b)}}$$
(18)

where w is the decision hyperplane normal vector, \({{\text{X}}_{\text{i}}}\) is the data point, and \({{\text{y}}_{\text{i}}}\) is the data point class (+ 1 or − 1).

The margin distance may be estimated as:

$$\rho=\frac{2}{{\left\| w \right\|}}$$
(19)

3.2 Separation technique of SVM

The main aim of SVM is to determine the optimal separately hyperplane. So, the optimal separately hyperplane may be considered as optimizing problem:

$$\begin{array}{*{20}{l}} {{\text{Maximize}}:\rho} \\ {{\text{Subject to}}:{\text{z(}}{{\text{X}}_{\text{i}}}{\text{)}} \ge {\text{1}}} \end{array}$$
(20)

Using Lagrange multiplier scheme, Eq. (20) can be minimized and the objective function can be restated as:

$${\text{L(w,b,}}\alpha {\text{)}}=\frac{{\text{1}}}{{\text{2}}}{\text{ww}} - \sum\limits_{{{\text{i}}={\text{1}}}}^{{\text{n}}} {{\alpha_{\text{i}}}} {\text{(}}{{\text{y}}_{\text{i}}}{\text{(w}}{{\text{x}}_{\text{i}}}+{\text{b)}} - {\text{1)}}$$
(21)

where constant αi is Lagrange multiplier. By differentiating αi with respect to w and b:

$$\frac{{\partial {\text{L(w,b,}}\alpha {\text{)}}}}{{\partial {\text{w}}}}={\text{w}} - \sum\limits_{{{\text{i}}={\text{1}}}}^{{\text{n}}} {{\alpha _{\text{i}}}} {{\text{y}}_{\text{i}}}{{\text{x}}_{\text{i}}}=0$$
(22)
$$\frac{{\partial {\text{L(w,b,}}\alpha {\text{)}}}}{{\partial {\text{b}}}}=\sum\limits_{{{\text{i}}={\text{1}}}}^{{\text{n}}} {{\alpha _{\text{i}}}} {{\text{y}}_{\text{i}}}=0$$
(23)

Substituting from Eqs. (22) and (23) into Eq. (21):

$${\text{L(w,b,}}\alpha {\text{)}}=\sum\limits_{{{\text{i}}={\text{1}}}}^{{\text{n}}} {{\alpha _{\text{i}}}} - \frac{{\text{1}}}{{\text{2}}}\sum\limits_{{{\text{i,j}}={\text{1}}}}^{{\text{n}}} {{{\text{y}}_{\text{i}}}{{\text{y}}_{\text{j}}}{\alpha _{\text{i}}}{\alpha _{\text{j}}}} {{\text{x}}_{\text{i}}}{{\text{x}}_{\text{j}}}$$
(24)

The minimization of Eq. (24) can be considered as a convex quadratic programming problem with condition:

$$\sum\limits_{{{\text{i}} = {\text{1}}}}^{{\text{n}}} {{\text{y}}_{{\text{i}}} \alpha _{{\text{i}}} } = 0{\kern 1pt} \,{\text{and}}\,\alpha _{{\text{i}}} \ge 0{\kern 1pt}$$
(25)

The minimization of the Eq. (24) will be:

$${\text{L(w,b,}}\alpha {\text{)}}=\sum\limits_{{{\text{i}}={\text{1}}}}^{{\text{n}}} {{\alpha _{\text{i}}}} - \frac{{\text{1}}}{{\text{2}}}\sum\limits_{{{\text{i,j}}={\text{1}}}}^{{\text{n}}} {{{\text{y}}_{\text{i}}}{{\text{y}}_{\text{j}}}{\alpha _{\text{i}}}{\alpha _{\text{j}}}} {\text{k(}}{{\text{X}}_{\text{i}}}{\text{,}}{{\text{X}}_{\text{j}}})$$
(26)

The hyperplane may be estimated as:

$${\text{z(X)}}={\text{sign}}\left[ {\sum\limits_{{\text{i}}}^{{\text{n}}} {{{\text{y}}_{\text{i}}}{\alpha _{\text{i}}}} {\text{k(}}{{\text{X}}_{\text{i}}}{\text{,}}{{\text{X}}_{\text{j}}}{\text{)}}+{\text{b}}} \right]$$
(27)

4 The proposed MKMFCC–SVM ASI system

The full description of the proposed MKMFCC–SVM ASI system using MKMFCC feature extraction and SVM classification algorithm is addressed. Initially, the audio signals for multiple speakers are taken as input for the ASR system. Feature extraction is performed in which the feature vector sequences representing feature patterns about speech signal is extracted. The MFCC features are extracted, and multiple kernel weighted function is performed for generating the MKMFCC coefficients using Mel filter bank energy. After feature extracting phase, speech classification phase is employed with the SVM.

4.1 Feature extraction phase

The feature extraction phase include speaker related properties for effective recognition. The KMFCCs are considered within the proposed ASI since it enhances and preserves information formant from spectral envelope. The MFCC spectral feature differs from other acoustic features in time frequency analysis and requency smoothing schemes.

4.2 SVM implementation for feature matching phase

The research paper utilizes sequential minimal optimization (SMO) (You et al. 2010). The SMO selection rather than other optimization schemes is due to reliability of SMO scheme on large datasets and the LIBSVM library utilized for SVM implementation using SMO can be linked to the Matlab platform. Much time is required for Kernel matrix calculation utilized in SVMs under normal situation, this time grows quickly when training samples number are exist, resulting in a larger Kernel matrix. To bypass such difficulty, SMO divides the problem into a series of smaller quadratic programming problems. The SMO procedure may be summarized as:

  1. Step 1

    Choose an arbitrary Lagrange multiplier α.

  2. Step 2

    Choose other Lagrange multiplier.

  3. Step 3

    Upgrade the other second Lagrange multiplier using Eq. (28):

    $$\alpha _{2}^{{new}}={\alpha _2}+\frac{{{y_2}({E_1} - {E_2})}}{k}$$
    (28)
  4. Step 4

    Set the Lagrange multiplier, i.e. \(\alpha _{{\text{2}}}^{{{\text{new, assigned}}}} \leftarrow \alpha _{{\text{2}}}^{{{\text{new}}}}.\)

  5. Step 5

    If the Lagrange multiplier is not varied, go back to Step 1.

  6. Step 6

    Upgrade the earliest Lagrange multiplier.

  7. Step 7

    If all Lagrange multiplier satisfy step 5 conditions, end. Else, go to step 1.

5 Experimental tests

With existence of telephone and noise-analogous degradations, speaker recognition process may be not an easy process. The noise-analogous degradation tries to disguise the speech signal so the extracted features will not accurate and infeasible for recognition. The telephone-analogous degradation may be considered as a low-pass filter on the speech signal that may remove a lot of speaker features. In this section, different four speaker recognition tests are performed with different degradation types. The considered degradations will be  AWGN, colored noise, telephone-analogous degradations with AWGN and telephone-analogous degradations with colored noise. The telephone-analogous degradations have been simulated using low-pass filter of low bandwidth applied on speech signals.

During ASI training stage, a database that includes 80 speakers is utilized. Every speaker iterates a given Arabic clause 15 times. As a result, 1200 speech models will be utilized for providing MKMFCCs using the proposed MKMFCC–SVM ASI, MFCCs and polynomial coefficients for MFCC–SVM ASI to constitute database features vector. During enrolling stage, every speaker is requested to repeat the clause and the audio signal is subjected to degradation. Comparable features like utilized during enrollment will be also evolved from such the degraded speech signals, and utilized in the classification stage. Five methods for feature extraction are employed in the paper.

In first scheme, features extraction of the MKMFCCs, and MFCCs is performed directly using only the speech signals. In second scheme, features extraction is performed using DCT of speech signals. In third scheme, features extraction is performed from the concatenation of both the original speech signal and DCT of speech signal in one features vector. In fourth scheme, features extraction is performed using DWT of speech signals. In fifth scheme, features extraction is performed from the concatenation of both the original speech signal and DWT of speech signal in one features vector. Comparisons are performed to inspect the performance of MKMFCC–SVM ASI with respect to MFCC–SVM ASI in terms of identification rate using the above mentioned five feature extraction schemes in four degradation situations, and test results are shown in Tables 1, 2, 3 and 4. Firstly, the results shown in Tables 1, 2, 3 and 4 ensured and proved the superiority of the proposed MKMFCC–SVM ASI compared with MFCC–SVM ASI using the five feature extraction schemes in all the four degradation cases. Also, it is clear from the results in Tables 1, 2, 3 and 4 for both the proposed MKMFCC–SVM ASI and MFCC–SVM ASI that the extracted features from the audio plus DWT audio signals and audio plus DCT audio signals have the highest recognition rate in all the four degradation cases. In AWGN case, the extracted features using speech plus DWT speech signals have the best recognition rates with different SNRs. For colored noise case shown in Table 2, the extracted features using speech plus DCT speech signals achieve the best recognition rates with different SNRs. In telephone-analogous degradations with AWGN and colored noise cases shown in Tables 3 and 4, respectively, the performance suffers since the low-pass filter eliminates a lot of speech features. The extracted features using speech plus DWT speech signals achieve the best recognition rates for the telephone like degradations and AWGN at all SNRs. But the extracted features from the audio plus DCT audio signals achieve the best recognition rates for the telephone like degradations and colored noise at all SNRs.

Table 1 Identification rate of MKMFCC–SVM ASI and MFCC–SVM ASI with different transforms in the presence of AWGN with different SNR
Table 2 Identification rate of MKMFCC–SVM ASI and MFCC–SVM ASI with different transforms in the presence of colored noise with different SNR
Table 3 Identification rate of MKMFCC–SVM ASI and MFCC–SVM ASI with different transforms in the presence of telephone like degradation and AWGN with different SNR
Table 4 Identification rate of MKMFCC–SVM ASI and MFCC–SVM ASI with different transforms in the presence of telephone like degradation and colored noise with different SNR

6 Conclusion

The paper introduced an efficient robust noise ASI method using MKMFCC and SVM. A comparative study is held between the proposed MKMFCC–SVM ASI and MFCC–SVM ASI in terms of identification rate measure using five methods for extracting features in presence of five degrading cases. Experimental tests prove the effectiveness of the proposed MKMFCC–SVM ASI for extracting features from telephone and noisy-like degraded audio signals.