Abstract
This paper proposes robust noise automatic speaker identification (ASI) scheme named MKMFCC–SVM. It based on the Multiple Kernel Weighted Mel Frequency Cepstral Coefficient (MKMFCC) and support vector machine (SVM). Firstly, the MKMFCC is employed for extracting features from degraded audio and it uses multiple kernels such as the exponential and tangential and for MFCC’s weighting. Secondly, the extracted features are then categorized with the SVM classification technique. A comparative study is performed between the proposed MKMFCC–SVM and the MFCC–SVM ASI schemes using the MKMFCC and MFCCs with five schemes for extracting features from telephone-analogous and noisy-like degraded audio signals. Experimental tests prove that the proposed MKMFCC–SVM ASI scheme yields higher identification rate in noise presence or degradation.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Enhancing automatic speaker recognition (ASR) systems has become an attractive challenge due to the growing needs for secure access or criminalistics inquiries. The main objective of ASR is to determine and recognize speaker personality, regardless of what speaker is clarifying (Dharanipragada et al. 2007; Huang et al. 2016; Shuling 2009; Gandhiraj and Sathidevi 2007). The ASR includes both verifying and identifying phases. In automatic speaker verification (ASV), speaker’s speech is matched according to his/her pattern within the database and categorized either customer or imposter (Furui 1981). ASV systems can be usually utilized in many security fields such as telephone transactions. With automatic speaker identification (ASI), speech talking of anonymous speaker is tested and matched with patterns of all recognized speakers to determine the top matched speaker (Xu and Yang 2016; Li and Gao 2016; Hossain et al. 2007). ASI can be divided into either closed or open sets. Closed set ASI include that speaker under test was previously recognized to be one from finite set of speakers. Open set ASI involves the preference of defining declaring that test speaker may not belong to any one from recognized speakers.
ASR includes two phases stages, named, feature extracting and classification phases. The feature extracting phase may be thought as data reducing procedure with the potential of capturing main speaker features with reduced data as possible. There exist several schemes for speech features extraction using various coefficients types like linear prediction coefficients (LPCs) (Mellahi and Hamdi 2015), linear prediction cepstral coefficients (LPCCs) (Polur and Miller 2005), Mel-frequency cepstral coefficients (MFCCs) (Selva Nidhyananthan et al. 2016), and multiple kernel weighted MFCCs (Subba Ramaiah and Rajeswara Rao 2016).
Classification phase may be thought as a procedure that includes two stages named as; speaker modeling/ matching stages. In speaker modeling stage, the speaker is registered in the system with extracted features resulted from training data. If data sample of anonymous speaker is received, feature matching schemes can be utilized for mapping features of input speech data sample to a pattern that may correspond to a recognized speaker. The combination of speaker modeling/matching schemes may be known as classifier. Classification schemes employed in ASI systems may cover Gaussian mixture models (GMMs) (Ding and Yen 2015; Qian et al. 2008), vector quantization (VQ), hidden Markov models (HMMs) (Polur and Miller 2005), ANNs (Galushkin 2007; Hayati and shirvany 2007) and SVM (Naeeni et al. 2010; Boujelbene et al. 2010; Zergat and Amrouche 2014).
In this paper, an efficient robust noise MKMFCC–SVM method for ASI is presented. The proposed method utilizes the MKMFCC as feature parameterization with multiple kernels such as the exponential and tangential to weight the MFCC’s and the SVM for classification. The cepstral features combining the Mel filter bank tangential/exponential functions were utilized in cepstral coefficient parameterization. Multiple kernel weighted functions may help in considering low/ high energy frames of recognized audio signal, such that no frames dropped out. The paper remainder may be arranged as follow. Section 2 explores feature extraction using the MKMFCC. Then, the SVM is described in Sect. 3. Section 4 detailed the proposed MKMFCC–SVM ASI. Section 5 presents the utilized data sets and test results. And finally, Sect. 6 concludes the paper.
2 MKMFCC feature extraction
The MKMFCC employs two distinct kernel functions for the MFCC coefficients weighting (Ramaiah and Rao 2016). The kernel weighting offers a natural way for mixing and integrating various data types. Also, flexible mixture of suited kernel design and modern kernel schemes proved the superiority of such class of methods whose statistical and computational characteristics are well known by several machine learning methods. The MKMFCC is illustrated in Fig. 1 and detailed steps are given as follows.
2.1 Pre-emphasis
The pre-emphasis stage is employed for flattening speech spectrum, as it increases the high frequency band amplitude and reduce the low frequency band. It can be estimated by,
where C, A, B, m are constant value, input signal, output signal, and speech signal samples.
2.2 Framing
The speech signal sample is split into short L blocks of M samples. The speech block length is ranged as 20–40 ms. The neighbouring blocks are unattached by R factor; where \(R<M\).
2.3 Hamming windowing
During hamming window stage, all close frequencies in speech streams are integrated together. The hamming windowing can be represented as \(W(m):1 \le m \le M - 1.\) The speech signal after employing windowing can be computed as,
where \(W(m)\) is the hamming window and it is computed as,
2.4 Fast fourier transform (FFT)
During FFT stage, the speech signal are FFT transformed. The block power spectrum can be computed as,
The Discrete Fourier Transform (DFT) of correspondent block can be estimated as,
where k is the DFT length and \(B(m)\) covering M sample long analysis window.
2.5 Mel filter bank processing
Signal frequencies will be filtered using triangular filter for estimating filter spectral components weighted sum and Mel scale triangular filter output border. Figure 2 illustrates the Mel scale filter bank.
The high and low F H /F L frequency spectral components of periodogram estimations must be considered. The filter locations have equivalent space in Mel frequency.
The Mel Filter bank can be estimated with FFT as
The filter bank can be computed as;
where f = 1 to F is Mel Filters number.
2.6 Filter bank energy
The filter bank is bonded by power spectrum and summed up to some coefficients. The filter bank energy can be computed as;
where \(W{T_m}\) is the multiple kernel weighted function that can be computed as;
2.7 Discrete cosine transform (DCT)
The DCT is performed for transforming the log Mel spectrum estimates to spatial domain.
where
The cepstral coefficient can be computed as;
\(W{C_s}(m)\) represents multiple kernel weighted Mel frequency cepstral coefficients.
2.8 Delta energy and spectrum
The energy patterns or features are summed within the acoustic features vector. The addition enhances audio recognition accuracy and dominates noise robustness as well as the echo.
2.9 Cepstral normalization
In the normalization procedure, the average of each of coefficients will be subtracted and divided with variance.
3 Classification using support vector machine
The classification stage in ASI systems is a feature matching procedure among the new speaker features and the database saved features. The SVM depends on the statistical learning theory (Boujelbene et al. 2010). It is based on finding the best interval among between feature levels to be precisely isolated as much as possible. Such features must be divided linearly using the hyper-plane which may be consider like linear classifier. The SVM transform input features into feature space with large dimension (Zergat and Amrouche 2014; Campbell et al. 2007).
3.1 Geometric margin
It is required to estimate the space from the two patterns to separator. The space is the margin among the two patterns which is the minimum space among the pattern and hyperplane, defined with dashed line in Fig. 3.
For formulating such distance r, let x′–x defines the dotted line which is perpendicular to decision border and parallel to the hyperplane with the normal vector w. The unit vector of normal vector direction to the hyperplane may be estimated as:
So that the distance r may be estimated as:
Since,
So,
The margin among the hyper-plane and the closest two patterns of the two data classes may be estimated as:
where w is the decision hyperplane normal vector, \({{\text{X}}_{\text{i}}}\) is the data point, and \({{\text{y}}_{\text{i}}}\) is the data point class (+ 1 or − 1).
The margin distance may be estimated as:
3.2 Separation technique of SVM
The main aim of SVM is to determine the optimal separately hyperplane. So, the optimal separately hyperplane may be considered as optimizing problem:
Using Lagrange multiplier scheme, Eq. (20) can be minimized and the objective function can be restated as:
where constant αi is Lagrange multiplier. By differentiating αi with respect to w and b:
Substituting from Eqs. (22) and (23) into Eq. (21):
The minimization of Eq. (24) can be considered as a convex quadratic programming problem with condition:
The minimization of the Eq. (24) will be:
The hyperplane may be estimated as:
4 The proposed MKMFCC–SVM ASI system
The full description of the proposed MKMFCC–SVM ASI system using MKMFCC feature extraction and SVM classification algorithm is addressed. Initially, the audio signals for multiple speakers are taken as input for the ASR system. Feature extraction is performed in which the feature vector sequences representing feature patterns about speech signal is extracted. The MFCC features are extracted, and multiple kernel weighted function is performed for generating the MKMFCC coefficients using Mel filter bank energy. After feature extracting phase, speech classification phase is employed with the SVM.
4.1 Feature extraction phase
The feature extraction phase include speaker related properties for effective recognition. The KMFCCs are considered within the proposed ASI since it enhances and preserves information formant from spectral envelope. The MFCC spectral feature differs from other acoustic features in time frequency analysis and requency smoothing schemes.
4.2 SVM implementation for feature matching phase
The research paper utilizes sequential minimal optimization (SMO) (You et al. 2010). The SMO selection rather than other optimization schemes is due to reliability of SMO scheme on large datasets and the LIBSVM library utilized for SVM implementation using SMO can be linked to the Matlab platform. Much time is required for Kernel matrix calculation utilized in SVMs under normal situation, this time grows quickly when training samples number are exist, resulting in a larger Kernel matrix. To bypass such difficulty, SMO divides the problem into a series of smaller quadratic programming problems. The SMO procedure may be summarized as:
-
Step 1
Choose an arbitrary Lagrange multiplier α.
-
Step 2
Choose other Lagrange multiplier.
-
Step 3
Upgrade the other second Lagrange multiplier using Eq. (28):
$$\alpha _{2}^{{new}}={\alpha _2}+\frac{{{y_2}({E_1} - {E_2})}}{k}$$(28) -
Step 4
Set the Lagrange multiplier, i.e. \(\alpha _{{\text{2}}}^{{{\text{new, assigned}}}} \leftarrow \alpha _{{\text{2}}}^{{{\text{new}}}}.\)
-
Step 5
If the Lagrange multiplier is not varied, go back to Step 1.
-
Step 6
Upgrade the earliest Lagrange multiplier.
-
Step 7
If all Lagrange multiplier satisfy step 5 conditions, end. Else, go to step 1.
5 Experimental tests
With existence of telephone and noise-analogous degradations, speaker recognition process may be not an easy process. The noise-analogous degradation tries to disguise the speech signal so the extracted features will not accurate and infeasible for recognition. The telephone-analogous degradation may be considered as a low-pass filter on the speech signal that may remove a lot of speaker features. In this section, different four speaker recognition tests are performed with different degradation types. The considered degradations will be AWGN, colored noise, telephone-analogous degradations with AWGN and telephone-analogous degradations with colored noise. The telephone-analogous degradations have been simulated using low-pass filter of low bandwidth applied on speech signals.
During ASI training stage, a database that includes 80 speakers is utilized. Every speaker iterates a given Arabic clause 15 times. As a result, 1200 speech models will be utilized for providing MKMFCCs using the proposed MKMFCC–SVM ASI, MFCCs and polynomial coefficients for MFCC–SVM ASI to constitute database features vector. During enrolling stage, every speaker is requested to repeat the clause and the audio signal is subjected to degradation. Comparable features like utilized during enrollment will be also evolved from such the degraded speech signals, and utilized in the classification stage. Five methods for feature extraction are employed in the paper.
In first scheme, features extraction of the MKMFCCs, and MFCCs is performed directly using only the speech signals. In second scheme, features extraction is performed using DCT of speech signals. In third scheme, features extraction is performed from the concatenation of both the original speech signal and DCT of speech signal in one features vector. In fourth scheme, features extraction is performed using DWT of speech signals. In fifth scheme, features extraction is performed from the concatenation of both the original speech signal and DWT of speech signal in one features vector. Comparisons are performed to inspect the performance of MKMFCC–SVM ASI with respect to MFCC–SVM ASI in terms of identification rate using the above mentioned five feature extraction schemes in four degradation situations, and test results are shown in Tables 1, 2, 3 and 4. Firstly, the results shown in Tables 1, 2, 3 and 4 ensured and proved the superiority of the proposed MKMFCC–SVM ASI compared with MFCC–SVM ASI using the five feature extraction schemes in all the four degradation cases. Also, it is clear from the results in Tables 1, 2, 3 and 4 for both the proposed MKMFCC–SVM ASI and MFCC–SVM ASI that the extracted features from the audio plus DWT audio signals and audio plus DCT audio signals have the highest recognition rate in all the four degradation cases. In AWGN case, the extracted features using speech plus DWT speech signals have the best recognition rates with different SNRs. For colored noise case shown in Table 2, the extracted features using speech plus DCT speech signals achieve the best recognition rates with different SNRs. In telephone-analogous degradations with AWGN and colored noise cases shown in Tables 3 and 4, respectively, the performance suffers since the low-pass filter eliminates a lot of speech features. The extracted features using speech plus DWT speech signals achieve the best recognition rates for the telephone like degradations and AWGN at all SNRs. But the extracted features from the audio plus DCT audio signals achieve the best recognition rates for the telephone like degradations and colored noise at all SNRs.
6 Conclusion
The paper introduced an efficient robust noise ASI method using MKMFCC and SVM. A comparative study is held between the proposed MKMFCC–SVM ASI and MFCC–SVM ASI in terms of identification rate measure using five methods for extracting features in presence of five degrading cases. Experimental tests prove the effectiveness of the proposed MKMFCC–SVM ASI for extracting features from telephone and noisy-like degraded audio signals.
References
Boujelbene, S. Z., Mezghani, D. B. A., & Ellouze, N. (2010). Improving SVM by modifying kernel functions for speaker identification task. International Journal of Digital Content Technology and its Applications, 4(6, 100–105.
Campbell, W. M., Campbell, J. P., Gleason, T. P., Reynolds, D. A., & Shen, W. (2007). Speaker verification using support vector machines and high-level features. IEEE Transactions on Audio, Speech and Language Processing, 15(7), 2085–2094.
Dharanipragada, S., Yapanel, U. H., & Rao, B. D. (2007) Robust feature extraction for continuous speech recognition using the MVDR spectrum estimation method. IEEE Transactions on Audio, Speech, and Language Processing, 15(1), 224–234.
Ding, I.-J., & Yen, C.-T. (2015) Enhancing GMM speaker identification by incorporating SVM speaker verification for intelligent web-based speech applications. Multimedia Tools and Applications, 74, 5131–5140.
Furui, S. (1981) Cepstral Analysis Technique for Automatic Speaker Verification. IEEE Transactions on Acoustics, Speech, and Signal Processing, 20(2), 254–272.
Galushkin, A. I. (2007). Neural networks theory. Berlin: Springer.
Gandhiraj, R., Sathidevi, P. S. (2007). Auditory-based wavelet packet filter bank for speech recognition using neural network. In Proceedings of the 15th International Conference on Advanced Computing and Communications, pp. 666–671.
Hayati, M., shirvany, Y. (2007). Artificial neural network approach for short term load forecasting for Illam region. Proceeding of World Academy of Science, Engineering and Technology, 22. ISSN 1307–6884.
Hossain, M., Ahmed, B., Asrafi, M. (2007). A real time speaker identification using artificial neural network. In 10th International Conference on Computer and Information Technology, pp. 1–5.
Huang, C., Song, B., & Zhao, L. (2016). Emotional speech feature normalization and recognition based on speaker-sensitive feature clustering. International Journal of Speech Technology, 19, 805–816.
Li, Z., & Gao, Y. (2016). Acoustic feature extraction method for robust speaker identification. Multimedia Tools and Applications, 75, 7391–7406.
Mellahi, T., & Hamdi, R. (2015). LPC-based formant enhancement method in Kalman filtering for speech enhancement. International Journal of Electronics and Communications, 69(2), 545–554.
Naeeni, B. H., Amindavar, H., & Bakhshi, H. (2010). Blind per tone equalization of multilevel signals using support vector machines for OFDM in wireless communication. International Journal of Electronics and Communications, 64(2), 186–190.
Polur, P. D., & Miller, G. E. (2005). Experiments with fast Fourier transform, linear predictive and cepstral coefficients in dysarthric speech recognition algorithms using hidden Markov model. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 13(4), 558–561.
Qian, F., Hu, G., & Yao, X. (2008). Semi-supervised internet network traffic classification using a Gaussian mixture model. International Journal of Electronics and Communications, 62(7), 557–564.
Ramaiah, V. S., & Rao, R. R. (2016). Speaker diarization system using MKMFCC parameterization and WLI-fuzzy clustering. International Journal of Speech Technology, 19, 945–963.
Selva Nidhyananthan, S., Shantha Selva Kumari, R., & Senthur Selvi, T. (2016). Noise robust speaker identification using RASTA-MFCC Feature with quadrilateral filter bank structure. Wireless Personal Communications, 91, 1321–1333.
Shuling, L., & Wang C. (2009). Nonspecific speech recognition method based on composite LVQ1 and LVQ2 network. In Chinese Control and Decision Conference (CCDC), pp. 2304–2388.
Xu, L., & Yang, Z. (2016). Speaker identification based on state space model. International Journal of Speech Technology, 19, 404–414.
You, C. H., Lee, K. A., & Li, H. (2010). GMM-SVM kernel with a Bhattacharyya-based distance for speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 18(6), 1300–1312.
Zergat, K. Y., & Amrouche, A. (2014). New scheme based on GMM-PCA-SVM modeling for automatic speaker recognition. International Journal of Speech Technology, 17, 373–381.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Faragallah, O.S. Robust noise MKMFCC–SVM automatic speaker identification. Int J Speech Technol 21, 185–192 (2018). https://doi.org/10.1007/s10772-018-9494-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-018-9494-9