Abstract
Speaker identification is an upcoming boon in this modern technology. It is basically a “one on many mapping” technique in which the speaker can be identified by matching the unknown speaker’s speech with templates of all speakers, called as speaker identification or in other words, an utterance from unknown speaker is analysed and compared with speech model of known speakers. The speaker identification process is segregated into two phases, namely the training phase and the test phase. In training phase, the input of 15 speakers including both male and female speaker’s speech utterances are taken to obtain the individual speaker models by extracting the features such as gammatone cepstral coefficient (GTCC) and mel-frequency cepstral coefficients (MFCC). In test phase, a random utterance spoken by each speaker is subjected for comparison with the speaker models obtained by k-means clustering so as to find out the particular speaker accurately. Finally, a comparison is made between the MFCC and GTCC feature vectors in terms of accuracy in predicting the exact speaker. This speaker identification technology (SIT) is used in various applications such as in voice biometrics for authentication purpose, in surveillance for eavesdropping telephone conversations and in forensics department for backtracking suspect’s voice during crimes. It is also used in Google’s speech recognition system so as to unlock the gadgets with the speaker’s voice that is used as a password for privacy protection. The stated problem objective’s accuracy is expected to be in the range 80–90%. In addition to the two feature extractions, convolutional neural network is deployed for the same set of speaker’s speech utterances, thereby enhancing the accuracy beyond 95%.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
As we all know, speech is the best fastest way of interacting among humans. This reality created a spark among researchers to consider speech signal as an effective tool for interacting with computers. This creates the need for the machines to have a complete knowledge about the humans and a pathway to utilize speech for different accustoms. One such pathway is the speaker identification technology (SIT) where human’s speech is used to unlock various technologies [1]. Speaker identification is a “one on many mapping” technique in which “the speaker can be identified by matching the unknown speaker’s speech with templates of all speakers”, called as speaker identification or in other words, an utterance from unknown speaker is analysed and compared with speech model of known speakers. Basically, speaker identification is a pattern recognition problem. Widely, text-independent systems are used for speaker identification process [2]. Text independent is nothing but the speech given at training and testing are different as the ultimate goal is to identify the person corresponding to the particular voice or speech and not the context of the speech. Text-independent system uses techniques such as acoustics and speech analysis. Acoustics is a branch of physics that deals with the oscillation of matter in solid, liquid and gases including terms such as vibration and sound. Speech analysis is the study of speech sounds, or in other words, it is the analysis of voice tones corresponding to the vocal folds. In speaker identification, predominantly speech analysis is used. The voice prints in this thesis is processed and stored using technologies such as vector quantization [3] and neural networks [4]. Here, I have used the k-means algorithm as classifier for VQ techniques and convolutional neural network for training the model in neural networks technique. Speaker identification technology (SIT) uses the power of voice in biometrics for authentication to recognize a speaker automatically and with high accuracy based on their voice [5]. This technology is used in surveillance for eavesdropping telephone conversations. Speaker identification is used in forensics department for backtracking suspect’s voice during crimes. It is also used in online banking services, monitoring elderly people’s health and helpful for people with dementia to identify who’s speaking with them [6].
2 Speech an Overview
As we all know, speech is the best fastest way of interacting among humans. Speech is the output of the vocal tract system excited by the vibration of vocal cords due to quasi-periodic pulses of air for voiced speech or noise excitation for unvoiced speech. In other words, speech is the output of the vocal tract system excited by the vibration of vocal cords due to acoustic air pressure from the lungs. This acoustic air pressure is a function of time. The speech signal can be normally segmented into voiced segment and unvoiced segment. The source of excitation for the voiced segment is a quasi-periodic pulse of air, whereas for the unvoiced segment, it is a white noise signal. So, depending on how you shape your vocal tract different sounds can be produced such as fricatives, semivowels, consonants, diphthongs, vowels.
2.1 Formalizing the Speech
The vocal tract frequency response convolved with the glottal pulse results in the speech signal. This vocal tract frequency response acts as an impulse response. Let the speech signal be x(t), glottal pulses represented as e(t), excitation signal produced by the vocal tract system and h(t) be the vocal tract frequency response, i.e. impulse response.
Normally, the log spectrum resembles the speech, spectral envelope resembles the vocal tract frequency response, and the spectral details resemble the glottal pulses. In the spectral envelope, the formants carry the identity of sound.
In terms of digital signal processing, vocal tract acts as a filter, by filtering the glottal pulses which consists of high-frequency noisy signals from the carrier information about the pitch. The speech signal s(n) is given by
and
g(n) = excitation signal
v(n) = impulse response of vocal tract system.
Because of the digital processing of speech signals, before the extraction of features from the audio signals, let us get to know how the speech signal should be used for the analysis. First and foremost, the analogue speech signal has to be digitized and pre-emphasized before it is being used for analysis. Later, spectral analysis techniques are carried out to extract the features from the audio speech signal. It involves two steps. They are
-
1.
Analogue speech signal from sound pressure wave is converted to digitized form.
-
2.
Significant frequency components are emphasized, i.e. digital filtering.
The role of digitization is to produce high signal-to-noise ratio (SNR) on the sampled data of audio speech. After digitization comes the amplification process which is done by a pre-emphasis filter that boosts the spectrum of the audio signal up to 20 dB/decade since the natural voice signal has a negative slope of 20 dB/decade. In pre-emphasis, voice signals at high frequencies are amplified due to the effect of damping at higher frequencies. Here,
is our pre-emphasis filter, with filtering coefficient a = 0.95. In general, there are two major concerns in any speech communication system.
-
1.
Preservation of the information contained in the speech signal.
-
2.
Representation of the speech signal in an easier manner so that modifications can be done to the signal, without degrading the original information content present in the speech signal.
Hence, these two major concerns have to be dealt with before further processing of speech signals in any communication system.
2.2 Framing and Windowing
As the speech signal is not stationary for an infinite length, framing is done so that for a short period, it remains stationary since the glottal system cannot vary at once. So, framing is the splitting of the speech signal into smaller chunks. Before framing, the speech signal is filtered first using a pre-emphasis filter. Framing is done as machines cannot do computations with infinite data points, as the signal will be cut off at either ends leading to information loss. In framing, the speech signal is divided into frames of 8–16 ms length and shifted with overlapping of up to 50% so that the next frame contains information about the previous frames. The frames of the speech signal are then multiplied with the window. Usually, windows enhance the performance of an FFT to extract spectral data. The window which we preferred for multiplication with frames of the speech signal is the Hamming window. The Hamming window is used to reduce the ripple caused in the signal so that we can get a clear idea of the original signal’s frequency spectrum. Hamming window w(n) is given by
2.3 Understanding the Cepstrum
We know that the speech signal can be represented as follows:
By using the log magnitude spectrum, we can separate the vocal tract information and the glottal pulse information which could not be done by normal spectrum. After which by taking IDFT of log magnitude spectrum, cepstrum is obtained. The physical separation of the information that is relative to the spectral envelope (vocal tract frequency response) is in the lower end of the quefrency domain, and the information relative to spectral details (glottal pulse excitation) is in the higher end. Thus, the excitation e(t) can be removed after passing it through a low pass lifter.
Thus, after analysis of speech signals, it can be used to train the speaker models by extracting the feature vectors of each specified emotion. Later, the emotion model is constructed and test feature vectors are given to the classifier after which the emotional state of the test feature is being tracked.
3 Feature Extraction
3.1 Mel-frequency Cepstral Coefficients (MFCC) [2]
Mel-frequency cepstrum is a short-time power spectrum representation of a speech signal obtained after computation of discrete cosine transformation. The words cepstrum, quefrency, liftering, and harmonic are just the wordplay of the word spectrum, frequency, filtering and harmonic, respectively. The former terms correspond to the frequency domain, and the later terms correspond to time-domain representation. MFCC is based on the characteristics of the hearing perception of humans. It is a fact that human ears use a nonlinear frequency unit to simulate the auditory system of humans. Usually, human ears perceive frequency logarithmically. This, in turn, necessitates an ideal audio feature that can be depicted in the time–frequency domain and has a perceptually relevant amplitude and frequency representation. One such audio feature is mel-frequency cepstral coefficients. Figure 1 describes the procedure for MFCC extraction.
The speech waveform after undergoing pre-emphasis, frame blocking and windowing, discrete Fourier transform of the signal is computed after which the log amplitude spectrum is obtained. Then, mel-scaling is performed. Mel-scale is a logarithmic scale. It is a perceptually relevant or perceptually informed scale for pitch. It is a fact that equidistant on the scale has the same “perceptual” distance. In the mel filter bank, the difference between the mel points is the same resulting in null weight, whereas the difference between frequency points is not the same on a scale of frequencies. After mel-scaling, the log amplitude spectrum undergoes discrete cosine transform, and the cepstrum of the speech signal is obtained. The advantages of MFCC are as follows:
-
1.
MFCC describes the “large” structures of the spectrum, that is, it focuses on the phonemes.
-
2.
It ignores the fine spectral structures like pitch.
-
3.
Works well in speech and music processing.
3.2 Gammatone Cepstral Coefficients (GTCC) [7,8,9,10]
During the hearing, the gammatone cepstral coefficients capture the movement of the basilar membrane in the cochlea. GFCCs model the physical changes more closely and accurately that occur within the ear during the hearing and are therefore more representative of a human auditory system than mel-frequency cepstral coefficients. Unlike mel filter bank, here, gammatone filter bank is used as it models the human auditory system and thereby uses ERB scale. This gammatone filter bank is often used in cochlea simulation’s front end, thereby transforming complex sounds into multichannel activity as observed in the auditory nerve. The GFB is designed in such a way that the centre frequency of the filter is distributed in proportion to their bandwidth and is linearly spaced on the ERB scale. Generally, equivalent rectangular bandwidth (ERB) scale gives an approximation to the filter bandwidths in human hearing. That is, it models the filter either as rectangular bandpass or band-stop filters. The gammatone filter is given by
where a is amplitude factor, t is time in seconds, n is filter order, fc is the centre frequency, b is bandwidth and φ is phase factor. Figure 2 indicates the modules used for GTCC extraction.
3.3 How Many MFCC and GTCC Coefficients?
Traditionally, the first 12–14 coefficients are computed for the MFCC analysis and 42 coefficients for the GTCC analysis, since they retain the information about formants and the spectral envelope (vocal tract frequency response) which is the requirement of the analysis, whereas the higher-order coefficients retain information about the fast-changing spectral details (glottal pulse excitation).
3.4 Why Discrete Cosine Transform Instead of IDFT?
-
As it is a simplified version of Fourier transform.
-
By using discrete cosine transform, we will be able to get only the real-valued coefficients, thereby neglecting the imaginary values, whereas in IDFT, both real and imaginary valued coefficients will be computed.
-
Discrete cosine transform decorrelates energy in different mel bands.
-
It reduces the dimension to represent the spectrum.
4 K-means Clustering [3]
K-means clustering belongs to the group of exclusive clustering where data points belong exclusively to one group. K-means clustering is a vector quantization method, specifically from signal processing, that aims to partition k clusters to n observations in which each observation belongs to the cluster that has the nearest mean (cluster centres or cluster centroid), serving as the preliminary model of the cluster. We use k-means clustering rather than hierarchical clustering because k-means works on actual observations creating a single level of clusters rather than the variation between every pair of observations leading to a multilevel hierarchy of clusters. The variable K indicates the number of clusters. The algorithm then runs iteratively to assign each data points to one of the k groups based on the features provided. K-means works by evaluating the Euclidean distance between the data points and the centre of the clusters, thereby assigning the data point to the nearest cluster. Based on the extracted mel-frequency cepstral coefficients and gammatone cepstral coefficients from the trained set of speakers, k-means clusters are formed. According to a k-means algorithm, it iterates over again and again, unless and until the data points within each cluster stop changing. At the end of each iteration, it keeps track of these clusters and their total variance and repeats the same steps from scratch but with a different starting point. The k-means algorithm can now compare the result and select the best variance out. For finding the value of K, you have to use the Hit and Trial method starting from K = 1. This K = 1 is the worst-case scenario as the variations among the dataset are large. Each time when you increase the number of clusters, the variation decreases. When the number of clusters is equal to the number of data points, then in that case, variation will be zero. In this project, the optimal k-value is found to be four as the variation is reduced and a further increase in the number of clusters does not affect the variation much. That particular value of K is termed to be elbow point. We have preferred k-means as a classifier rather than other mentioned classifiers since it is very simple to carry out and also applies to large sets of data like the Berlin database. It has the specialty of generalization to clusters of any shape and size. Each cluster is assumed to be an emotion feature, and cluster centroid is formed using a k-means algorithm, and finally, emotion data is classified.
5 Convolutional Neural Network [4, 11]
5.1 Why not Fully Connected Neural Network?
We know that in computers, the images are read in the form of pixels in 3D plane, i.e. RBG plane. Say if an image has pixel values 28*28*3, then the number of weights in the first hidden value of fully connected network will be 2352, whereas in real life, images have larger pixel values 200*200*3, then in this case, the number of weights in the first hidden value of fully connected network will be 1,20,000. So, we need to deal with huge number of parameters and thereby we require a greater number of neurons which can eventually lead to overfitting. That is why we do not use fully connected neural networks for image classification.
5.2 Why Convolutional Neural Network?
In CNN, a particular neuron is connected only to a small region of that layer unlike connecting to the complete layer as it was in fully connected networks. Thereby, CNN requires a smaller number of weights and a smaller number of neurons.
5.3 What is Convolutional Neural Network?
-
Convolutional neural network is “a sort of feed-forward artificial neural network within which the property pattern between its neurons is inspired by the organization of animal visual area”.
-
Visual cortex—“small regions of cells that are sensitive to specific regions of the visual field”. In other words, “some individual neuronal cells in the brain responds only in the edges of a certain orientation”. For example, some neurons respond on exposing to vertical edges and some responds on exposing to horizontal or diagonal edges.
5.4 How CNN Works?
CNN has four layers. They are:
-
Convolution layer
-
ReLU layer
-
Pooling layer
-
Fully connected layer
In this project, the spectrogram of 15 input speaker’s speech utterances is obtained. Normally, the signal strength or loudness of a signal can be represented visually over time at a range of frequencies called spectrogram. The spectrogram images of 15 input speakers are fed to the layers (Fig. 3).
Convolution Layer: In the convolution layer, the spectrogram images are piece by piece compared with the predefined filter patches to get the convolved output. Steps involved are:
-
Feature and image patches are lined up
-
Multiply each pixel of image with the corresponding feature pixel
-
Add them
-
Divide the feature by total number of pixels
ReLU Layer
-
Every non-positive value from the filtered images is replaced with zeros.
-
Done to prevent values from summing up to zero.
-
ReLU stands for “rectified linear unit”. This transform function is used as a criterion.
$$F\left( x \right) = \left\{ {\begin{array}{*{20}c} {n,\,{\text{for}}\,n > 0} \\ {0,\,{\text{for}}\,n < 0} \\ \end{array} } \right.$$
Pooling Layer: In this layer, the size of the image stack is reduced. Steps involved are:
-
Choose window size (two or three)
-
Choose stride = 2
-
Window is moved over filtered images
-
Maximum value is taken from each window
Stacking up the Layers: The above layers are stacked up one more time to minimize the size of the image stack.
Fully Connected Layers
-
Actual classification takes place in this final layer.
-
Here, our filtered and shrunk images are taken and lined up into a single sheet. That is, the values are listed in the form of the vector.
-
Here, the number of fully connected layers taken is 15.
5.5 Test Phase
Certain frames of spectrograms (nearly 80%) are given as input to the convolution layer which contains the predefined filters. Apart from the frames of spectrogram given to train the model, remaining frames of each speaker’s spectrogram are given as a test input for classification. And finally, classification is done. During the classification, two rates have to be considered for accurate analysis. They are:
-
False acceptance rate.
-
False rejection rate.
When we accept a user whom we must actually have rejected, it is called false-acceptance rate. This issue is also called a false positive. False rejection rate occurs when we reject a user whom we should actually have accepted. As the false acceptances (FAR) decrease, the number of false rejections (FRR) increases and vice versa (Fig. 4).
The point at which the lines intersect is known as the equal error rate (EER). At this location, the percentage of false acceptances is equal to false rejections. Generally, FAR and FRR are configured in system by adjusting proper criteria, such that they are more or less strict.
6 Results and Discussions
6.1 Feature Extraction Results
A random utterance spoken by the speakers is tested against the trained cluster models to predict the particular speaker. Such that the minimum distance obtained from testing phase is used to determine the recognition rate, and further, we can develop a confusion matrix or average recognition rate table. The results obtained from both the features are tabulated in Tables 1 and 2.
Based on the observation from the confusion matrix, it is evident that the overall speaker identification system’s accuracy computed by means of gammatone cepstral coefficients (GTCC) is higher than that computed by means of mel-frequency cepstral coefficients (MFCC). It is observed that among the female and male speakers considered for evaluation of better classification system, the overall female speakers’ accuracy is higher than that of overall male speakers’ accuracy in both the feature extraction techniques (MFCC and GTCC). Though the overall female speakers’ accuracy is higher, in the confusion matrix, it can be observed that some individual female speaker’s accuracy is less than that of individual male speaker’s accuracy. This can be dealt quantitatively in two ways.
6.1.1 Quantitative Analysis
-
Frequency Response Analysis
When the frequency response is plotted between two female speakers and two male speakers, it is observed that over a range of frequencies, two female speakers’ frequencies tend to overlap each in most of the frequencies showing that there exists similarity between the female speakers’ frequencies, whereas between two male speakers, there exists difference in frequency content (Figs. 5 and 6).
-
Correlation Coefficient Analysis
When correlation analysis is carried out between two female speakers’ speech utterances, between two male speakers’ speech utterances and between a male speaker speech utterance and a female speaker speech utterance, it is observed that the correlation between two female speakers is higher than correlation between two male speakers. And when correlation is taken between a male speaker and a female speaker, it resulted in negative correlation coefficients which indicates that the overall performance of the system is good (Table 3).
6.2 Convolution Neural Network Results
The testing accuracy or average recognition rate obtained after classification is shown in Fig. 7.
So, in order to get accurate classification model, false acceptance rate (FAR) and false rejection rate (FRR) must be taken into consideration while modelling.
6.2.1 Effect of FAR and FRR
-
If the value of FAR is set to the lowest possible value, then FRR tends to increase sharply. In other words, when the FAR is too low, the overall system will be more secure, and it will be less user-friendly such that it may falsely reject the correct users.
-
If the value of FRR is set to the lowest possible value, then FAR tends to increase sharply. In other words, when the FRR is too low, the overall system will be more user-friendly and less secure, thereby enabling false users by mistake.
Therefore, it is our choice whether to give priority either to security or user convenience. If both do not want to be compromised, then we can set both the values to be the same, equal error rate.
7 Conclusions
In this paper, a speaker identification system is assessed by employing features such as mel-frequency cepstral coefficients (MFCC) and gammatone cepstral coefficients (GTCC), and VQ-based minimum distance classifier is used to classify the speakers from the speech utterances spoken by speakers. Additionally, convolutional neural network (CNN) was deployed to model a classification system for the speaker identification to enhance the accuracy furthermore. From the observations, it is evident that the GTCC feature tracks the particular speaker precisely as compared to the MFCC feature. Since the GTCC feature captures the movement of basilar membrane within the cochlea to mimic the human auditory system during the hearing, a speaker is accurately predicted compared to MFCC. The accuracy of the system using GFCC is improvised than MFCC. GTCC provides better overall accuracy of 90.12% as compared to MFCC whose overall accuracy is 77.79%. And, the accuracy obtained from the CNN model is 98.71%. A combination of features can be considered for improving the performance. This speaker identification system would find applications in the voice biometrics for authentication purpose, in surveillance for eavesdropping telephone conversations and in forensics department for backtracking suspect’s voice during crimes. It is also used in Google’s speech recognition system so as to unlock the gadgets with the speaker’s voice that is used as a password for privacy protection.
References
Jahangir R, Ali I (2017) Speaker identification through artificial intelligence techniques: a comprehensive review and research challenges. Expert Syst Appl
Leu F, Lin G (2017) An MFCC based speaker identification system. In: 2017 IEEE 31st international conference on advanced information networking and applications, pp 1055–1062
Escamilla E, Perez H, Martinez J, Suzuki MM (2012) Speaker recognition using MFCC and VQ techniques. IEEE
Albawi S, Al Zawi S, Mohammed TA (2017) Understanding of a convolutional neural network. In: 2017 International Conference on Engineering and Technology (ICET)
Furui S (2009) 40 years of progress in automatic speaker recognition. In: Nixon MS (ed) Advances in Biometrics, vol 5558. Springer, Berlin
Bansal PK, Sharma V (2013) A review on speaker recognition and challenges. Int J Eng Technol 2(5)
Fathima R, Raseena PE (2013) Gammatone cepstral coefficients for speakers identification. Int J Adv Res Electr Electron Instrum Eng 2
Valero X, Alias F (2012) Gammatone cepstral coefficients: biologically inspired features for non-speech audio classification. IEEE Trans Multimedia 14(6):1684–1689
Ayoub B, Jamal K, Arsalane Z (2016) Gammatone frequency cepstral coefficients for speaker identification over VoIP networks. In: 2016 International conference on Information Technology for Organizations Development (IT4OD), Fez, Morocco, pp 1–5
wang H, Zhang C (2019) The applications of gammatone frequency cepstral coefficients for forensic voice comparison under noisy conditions. Aust Forensic Sci
Chauhan R, Ghanshala KK, Joshi RC (2018) Convolutional Neural Network (CNN) for image detection and recognition. In: 2018 First International Conference on Secure Cyber Computing and Communication (ICSCCC)
Revathi AS and Jeyalakshmi C (2017) “Comparative analysis on the use of features and models for validating language identification system.” 2017 International Conference on Inventive Computing and Informatics (ICICI), 693-698.
Dharini D and Revathy A (2014) "Singer identification using clustering algorithm", Communications and Signal Processing (ICC SP). 2014 International Conference on, pp. 1927-1931
Revathi AS, Jeyalakshmi C and Muruganantham T (2018) “Perceptual Features Based Rapid and Robust Language Identification System for Various Indian Classical Languages.”
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Revathi, A., Gayathri, G., Jeyalakshmi, C. (2022). Speaker Identification Using Multiple Features and Models. In: Sengodan, T., Murugappan, M., Misra, S. (eds) Advances in Electrical and Computer Technologies. ICAECT 2021. Lecture Notes in Electrical Engineering, vol 881. Springer, Singapore. https://doi.org/10.1007/978-981-19-1111-8_41
Download citation
DOI: https://doi.org/10.1007/978-981-19-1111-8_41
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-1110-1
Online ISBN: 978-981-19-1111-8
eBook Packages: Computer ScienceComputer Science (R0)