Keywords

1 Introduction

As we all know, speech is the best fastest way of interacting among humans. This reality created a spark among researchers to consider speech signal as an effective tool for interacting with computers. This creates the need for the machines to have a complete knowledge about the humans and a pathway to utilize speech for different accustoms. One such pathway is the speaker identification technology (SIT) where human’s speech is used to unlock various technologies [1]. Speaker identification is a “one on many mapping” technique in which “the speaker can be identified by matching the unknown speaker’s speech with templates of all speakers”, called as speaker identification or in other words, an utterance from unknown speaker is analysed and compared with speech model of known speakers. Basically, speaker identification is a pattern recognition problem. Widely, text-independent systems are used for speaker identification process [2]. Text independent is nothing but the speech given at training and testing are different as the ultimate goal is to identify the person corresponding to the particular voice or speech and not the context of the speech. Text-independent system uses techniques such as acoustics and speech analysis. Acoustics is a branch of physics that deals with the oscillation of matter in solid, liquid and gases including terms such as vibration and sound. Speech analysis is the study of speech sounds, or in other words, it is the analysis of voice tones corresponding to the vocal folds. In speaker identification, predominantly speech analysis is used. The voice prints in this thesis is processed and stored using technologies such as vector quantization [3] and neural networks [4]. Here, I have used the k-means algorithm as classifier for VQ techniques and convolutional neural network for training the model in neural networks technique. Speaker identification technology (SIT) uses the power of voice in biometrics for authentication to recognize a speaker automatically and with high accuracy based on their voice [5]. This technology is used in surveillance for eavesdropping telephone conversations. Speaker identification is used in forensics department for backtracking suspect’s voice during crimes. It is also used in online banking services, monitoring elderly people’s health and helpful for people with dementia to identify who’s speaking with them [6].

2 Speech an Overview

As we all know, speech is the best fastest way of interacting among humans. Speech is the output of the vocal tract system excited by the vibration of vocal cords due to quasi-periodic pulses of air for voiced speech or noise excitation for unvoiced speech. In other words, speech is the output of the vocal tract system excited by the vibration of vocal cords due to acoustic air pressure from the lungs. This acoustic air pressure is a function of time. The speech signal can be normally segmented into voiced segment and unvoiced segment. The source of excitation for the voiced segment is a quasi-periodic pulse of air, whereas for the unvoiced segment, it is a white noise signal. So, depending on how you shape your vocal tract different sounds can be produced such as fricatives, semivowels, consonants, diphthongs, vowels.

2.1 Formalizing the Speech

The vocal tract frequency response convolved with the glottal pulse results in the speech signal. This vocal tract frequency response acts as an impulse response. Let the speech signal be x(t), glottal pulses represented as e(t), excitation signal produced by the vocal tract system and h(t) be the vocal tract frequency response, i.e. impulse response.

$${\text{Speech}},\,s\left( t \right) = e\left( t \right)*h\left( t \right)$$
(1)

Normally, the log spectrum resembles the speech, spectral envelope resembles the vocal tract frequency response, and the spectral details resemble the glottal pulses. In the spectral envelope, the formants carry the identity of sound.

In terms of digital signal processing, vocal tract acts as a filter, by filtering the glottal pulses which consists of high-frequency noisy signals from the carrier information about the pitch. The speech signal s(n) is given by

$$s\left( n \right) = g\left( n \right)*h\left( n \right)$$
(2)

and

g(n) = excitation signal

v(n) = impulse response of vocal tract system.

Because of the digital processing of speech signals, before the extraction of features from the audio signals, let us get to know how the speech signal should be used for the analysis. First and foremost, the analogue speech signal has to be digitized and pre-emphasized before it is being used for analysis. Later, spectral analysis techniques are carried out to extract the features from the audio speech signal. It involves two steps. They are

  1. 1.

    Analogue speech signal from sound pressure wave is converted to digitized form.

  2. 2.

    Significant frequency components are emphasized, i.e. digital filtering.

The role of digitization is to produce high signal-to-noise ratio (SNR) on the sampled data of audio speech. After digitization comes the amplification process which is done by a pre-emphasis filter that boosts the spectrum of the audio signal up to 20 dB/decade since the natural voice signal has a negative slope of 20 dB/decade. In pre-emphasis, voice signals at high frequencies are amplified due to the effect of damping at higher frequencies. Here,

$$y\left( n \right) = x\left( n \right) - a*x\left( {n - 1} \right)$$
(3)

is our pre-emphasis filter, with filtering coefficient a = 0.95. In general, there are two major concerns in any speech communication system.

  1. 1.

    Preservation of the information contained in the speech signal.

  2. 2.

    Representation of the speech signal in an easier manner so that modifications can be done to the signal, without degrading the original information content present in the speech signal.

Hence, these two major concerns have to be dealt with before further processing of speech signals in any communication system.

2.2 Framing and Windowing

As the speech signal is not stationary for an infinite length, framing is done so that for a short period, it remains stationary since the glottal system cannot vary at once. So, framing is the splitting of the speech signal into smaller chunks. Before framing, the speech signal is filtered first using a pre-emphasis filter. Framing is done as machines cannot do computations with infinite data points, as the signal will be cut off at either ends leading to information loss. In framing, the speech signal is divided into frames of 8–16 ms length and shifted with overlapping of up to 50% so that the next frame contains information about the previous frames. The frames of the speech signal are then multiplied with the window. Usually, windows enhance the performance of an FFT to extract spectral data. The window which we preferred for multiplication with frames of the speech signal is the Hamming window. The Hamming window is used to reduce the ripple caused in the signal so that we can get a clear idea of the original signal’s frequency spectrum. Hamming window w(n) is given by

$$w\left( n \right) = 0.54 - 0.46\cos (2\pi n/(N - 1)),\quad 0 \le n \le N - 1$$
(4)

2.3 Understanding the Cepstrum

We know that the speech signal can be represented as follows:

$${\text{Speech}},\,s\left( t \right) = e\left( t \right)*h\left( t \right)$$
(5)
$${\text{Taking Fourier Transform}},\,s\left( w \right) = E\left( w \right)*H\left( w \right)$$
(6)
$${\text{Taking log on both sides}},\,\log \left( {s\left( w \right)} \right) = \log \left( {E\left( w \right)} \right) + \log \left( {H\left( w \right)} \right)$$
(7)

By using the log magnitude spectrum, we can separate the vocal tract information and the glottal pulse information which could not be done by normal spectrum. After which by taking IDFT of log magnitude spectrum, cepstrum is obtained. The physical separation of the information that is relative to the spectral envelope (vocal tract frequency response) is in the lower end of the quefrency domain, and the information relative to spectral details (glottal pulse excitation) is in the higher end. Thus, the excitation e(t) can be removed after passing it through a low pass lifter.

Thus, after analysis of speech signals, it can be used to train the speaker models by extracting the feature vectors of each specified emotion. Later, the emotion model is constructed and test feature vectors are given to the classifier after which the emotional state of the test feature is being tracked.

3 Feature Extraction

3.1 Mel-frequency Cepstral Coefficients (MFCC) [2]

Mel-frequency cepstrum is a short-time power spectrum representation of a speech signal obtained after computation of discrete cosine transformation. The words cepstrum, quefrency, liftering, and harmonic are just the wordplay of the word spectrum, frequency, filtering and harmonic, respectively. The former terms correspond to the frequency domain, and the later terms correspond to time-domain representation. MFCC is based on the characteristics of the hearing perception of humans. It is a fact that human ears use a nonlinear frequency unit to simulate the auditory system of humans. Usually, human ears perceive frequency logarithmically. This, in turn, necessitates an ideal audio feature that can be depicted in the time–frequency domain and has a perceptually relevant amplitude and frequency representation. One such audio feature is mel-frequency cepstral coefficients. Figure 1 describes the procedure for MFCC extraction.

Fig. 1
figure 1

Block diagram of MFCC extraction

The speech waveform after undergoing pre-emphasis, frame blocking and windowing, discrete Fourier transform of the signal is computed after which the log amplitude spectrum is obtained. Then, mel-scaling is performed. Mel-scale is a logarithmic scale. It is a perceptually relevant or perceptually informed scale for pitch. It is a fact that equidistant on the scale has the same “perceptual” distance. In the mel filter bank, the difference between the mel points is the same resulting in null weight, whereas the difference between frequency points is not the same on a scale of frequencies. After mel-scaling, the log amplitude spectrum undergoes discrete cosine transform, and the cepstrum of the speech signal is obtained. The advantages of MFCC are as follows:

  1. 1.

    MFCC describes the “large” structures of the spectrum, that is, it focuses on the phonemes.

  2. 2.

    It ignores the fine spectral structures like pitch.

  3. 3.

    Works well in speech and music processing.

3.2 Gammatone Cepstral Coefficients (GTCC) [7,8,9,10]

During the hearing, the gammatone cepstral coefficients capture the movement of the basilar membrane in the cochlea. GFCCs model the physical changes more closely and accurately that occur within the ear during the hearing and are therefore more representative of a human auditory system than mel-frequency cepstral coefficients. Unlike mel filter bank, here, gammatone filter bank is used as it models the human auditory system and thereby uses ERB scale. This gammatone filter bank is often used in cochlea simulation’s front end, thereby transforming complex sounds into multichannel activity as observed in the auditory nerve. The GFB is designed in such a way that the centre frequency of the filter is distributed in proportion to their bandwidth and is linearly spaced on the ERB scale. Generally, equivalent rectangular bandwidth (ERB) scale gives an approximation to the filter bandwidths in human hearing. That is, it models the filter either as rectangular bandpass or band-stop filters. The gammatone filter is given by

$$g\left( t \right) = at^{n - 1} {\text{e}}^{ - 2\pi bt} {\text{cos}}\left( {2\pi f_{\text{c}} t + \phi } \right)$$
(8)

where a is amplitude factor, t is time in seconds, n is filter order, fc is the centre frequency, b is bandwidth and φ is phase factor. Figure 2 indicates the modules used for GTCC extraction.

Fig. 2
figure 2

Block diagram of GTCC extraction

3.3 How Many MFCC and GTCC Coefficients?

Traditionally, the first 12–14 coefficients are computed for the MFCC analysis and 42 coefficients for the GTCC analysis, since they retain the information about formants and the spectral envelope (vocal tract frequency response) which is the requirement of the analysis, whereas the higher-order coefficients retain information about the fast-changing spectral details (glottal pulse excitation).

3.4 Why Discrete Cosine Transform Instead of IDFT?

  • As it is a simplified version of Fourier transform.

  • By using discrete cosine transform, we will be able to get only the real-valued coefficients, thereby neglecting the imaginary values, whereas in IDFT, both real and imaginary valued coefficients will be computed.

  • Discrete cosine transform decorrelates energy in different mel bands.

  • It reduces the dimension to represent the spectrum.

4 K-means Clustering [3]

K-means clustering belongs to the group of exclusive clustering where data points belong exclusively to one group. K-means clustering is a vector quantization method, specifically from signal processing, that aims to partition k clusters to n observations in which each observation belongs to the cluster that has the nearest mean (cluster centres or cluster centroid), serving as the preliminary model of the cluster. We use k-means clustering rather than hierarchical clustering because k-means works on actual observations creating a single level of clusters rather than the variation between every pair of observations leading to a multilevel hierarchy of clusters. The variable K indicates the number of clusters. The algorithm then runs iteratively to assign each data points to one of the k groups based on the features provided. K-means works by evaluating the Euclidean distance between the data points and the centre of the clusters, thereby assigning the data point to the nearest cluster. Based on the extracted mel-frequency cepstral coefficients and gammatone cepstral coefficients from the trained set of speakers, k-means clusters are formed. According to a k-means algorithm, it iterates over again and again, unless and until the data points within each cluster stop changing. At the end of each iteration, it keeps track of these clusters and their total variance and repeats the same steps from scratch but with a different starting point. The k-means algorithm can now compare the result and select the best variance out. For finding the value of K, you have to use the Hit and Trial method starting from K = 1. This K = 1 is the worst-case scenario as the variations among the dataset are large. Each time when you increase the number of clusters, the variation decreases. When the number of clusters is equal to the number of data points, then in that case, variation will be zero. In this project, the optimal k-value is found to be four as the variation is reduced and a further increase in the number of clusters does not affect the variation much. That particular value of K is termed to be elbow point. We have preferred k-means as a classifier rather than other mentioned classifiers since it is very simple to carry out and also applies to large sets of data like the Berlin database. It has the specialty of generalization to clusters of any shape and size. Each cluster is assumed to be an emotion feature, and cluster centroid is formed using a k-means algorithm, and finally, emotion data is classified.

5 Convolutional Neural Network [4, 11]

5.1 Why not Fully Connected Neural Network?

We know that in computers, the images are read in the form of pixels in 3D plane, i.e. RBG plane. Say if an image has pixel values 28*28*3, then the number of weights in the first hidden value of fully connected network will be 2352, whereas in real life, images have larger pixel values 200*200*3, then in this case, the number of weights in the first hidden value of fully connected network will be 1,20,000. So, we need to deal with huge number of parameters and thereby we require a greater number of neurons which can eventually lead to overfitting. That is why we do not use fully connected neural networks for image classification.

5.2 Why Convolutional Neural Network?

In CNN, a particular neuron is connected only to a small region of that layer unlike connecting to the complete layer as it was in fully connected networks. Thereby, CNN requires a smaller number of weights and a smaller number of neurons.

5.3 What is Convolutional Neural Network?

  • Convolutional neural network is “a sort of feed-forward artificial neural network within which the property pattern between its neurons is inspired by the organization of animal visual area”.

  • Visual cortex—“small regions of cells that are sensitive to specific regions of the visual field”. In other words, “some individual neuronal cells in the brain responds only in the edges of a certain orientation”. For example, some neurons respond on exposing to vertical edges and some responds on exposing to horizontal or diagonal edges.

5.4 How CNN Works?

CNN has four layers. They are:

  • Convolution layer

  • ReLU layer

  • Pooling layer

  • Fully connected layer

In this project, the spectrogram of 15 input speaker’s speech utterances is obtained. Normally, the signal strength or loudness of a signal can be represented visually over time at a range of frequencies called spectrogram. The spectrogram images of 15 input speakers are fed to the layers (Fig. 3).

Fig. 3
figure 3

Overview of convolutional neural network

Convolution Layer: In the convolution layer, the spectrogram images are piece by piece compared with the predefined filter patches to get the convolved output. Steps involved are:

  • Feature and image patches are lined up

  • Multiply each pixel of image with the corresponding feature pixel

  • Add them

  • Divide the feature by total number of pixels

ReLU Layer

  • Every non-positive value from the filtered images is replaced with zeros.

  • Done to prevent values from summing up to zero.

  • ReLU stands for “rectified linear unit”. This transform function is used as a criterion.

    $$F\left( x \right) = \left\{ {\begin{array}{*{20}c} {n,\,{\text{for}}\,n > 0} \\ {0,\,{\text{for}}\,n < 0} \\ \end{array} } \right.$$

Pooling Layer: In this layer, the size of the image stack is reduced. Steps involved are:

  • Choose window size (two or three)

  • Choose stride = 2

  • Window is moved over filtered images

  • Maximum value is taken from each window

Stacking up the Layers: The above layers are stacked up one more time to minimize the size of the image stack.

Fully Connected Layers

  • Actual classification takes place in this final layer.

  • Here, our filtered and shrunk images are taken and lined up into a single sheet. That is, the values are listed in the form of the vector.

  • Here, the number of fully connected layers taken is 15.

5.5 Test Phase

Certain frames of spectrograms (nearly 80%) are given as input to the convolution layer which contains the predefined filters. Apart from the frames of spectrogram given to train the model, remaining frames of each speaker’s spectrogram are given as a test input for classification. And finally, classification is done. During the classification, two rates have to be considered for accurate analysis. They are:

  • False acceptance rate.

  • False rejection rate.

When we accept a user whom we must actually have rejected, it is called false-acceptance rate. This issue is also called a false positive. False rejection rate occurs when we reject a user whom we should actually have accepted. As the false acceptances (FAR) decrease, the number of false rejections (FRR) increases and vice versa (Fig. 4).

Fig. 4
figure 4

Equal error rate

The point at which the lines intersect is known as the equal error rate (EER). At this location, the percentage of false acceptances is equal to false rejections. Generally, FAR and FRR are configured in system by adjusting proper criteria, such that they are more or less strict.

6 Results and Discussions

6.1 Feature Extraction Results

A random utterance spoken by the speakers is tested against the trained cluster models to predict the particular speaker. Such that the minimum distance obtained from testing phase is used to determine the recognition rate, and further, we can develop a confusion matrix or average recognition rate table. The results obtained from both the features are tabulated in Tables 1 and 2.

Table 1 MFCC confusion matrix using clustering
Table 2 GTCC confusion matrix using clustering

Based on the observation from the confusion matrix, it is evident that the overall speaker identification system’s accuracy computed by means of gammatone cepstral coefficients (GTCC) is higher than that computed by means of mel-frequency cepstral coefficients (MFCC). It is observed that among the female and male speakers considered for evaluation of better classification system, the overall female speakers’ accuracy is higher than that of overall male speakers’ accuracy in both the feature extraction techniques (MFCC and GTCC). Though the overall female speakers’ accuracy is higher, in the confusion matrix, it can be observed that some individual female speaker’s accuracy is less than that of individual male speaker’s accuracy. This can be dealt quantitatively in two ways.

6.1.1 Quantitative Analysis

  • Frequency Response Analysis

When the frequency response is plotted between two female speakers and two male speakers, it is observed that over a range of frequencies, two female speakers’ frequencies tend to overlap each in most of the frequencies showing that there exists similarity between the female speakers’ frequencies, whereas between two male speakers, there exists difference in frequency content (Figs. 5 and 6).

Fig. 5
figure 5

Frequency response of two female speakers

Fig. 6
figure 6

Frequency response of two male speakers

  • Correlation Coefficient Analysis

When correlation analysis is carried out between two female speakers’ speech utterances, between two male speakers’ speech utterances and between a male speaker speech utterance and a female speaker speech utterance, it is observed that the correlation between two female speakers is higher than correlation between two male speakers. And when correlation is taken between a male speaker and a female speaker, it resulted in negative correlation coefficients which indicates that the overall performance of the system is good (Table 3).

Table 3 Correlation coefficient between speakers

6.2 Convolution Neural Network Results

The testing accuracy or average recognition rate obtained after classification is shown in Fig. 7.

Fig. 7
figure 7

CNN testing accuracy

So, in order to get accurate classification model, false acceptance rate (FAR) and false rejection rate (FRR) must be taken into consideration while modelling.

6.2.1 Effect of FAR and FRR

  • If the value of FAR is set to the lowest possible value, then FRR tends to increase sharply. In other words, when the FAR is too low, the overall system will be more secure, and it will be less user-friendly such that it may falsely reject the correct users.

  • If the value of FRR is set to the lowest possible value, then FAR tends to increase sharply. In other words, when the FRR is too low, the overall system will be more user-friendly and less secure, thereby enabling false users by mistake.

Therefore, it is our choice whether to give priority either to security or user convenience. If both do not want to be compromised, then we can set both the values to be the same, equal error rate.

7 Conclusions

In this paper, a speaker identification system is assessed by employing features such as mel-frequency cepstral coefficients (MFCC) and gammatone cepstral coefficients (GTCC), and VQ-based minimum distance classifier is used to classify the speakers from the speech utterances spoken by speakers. Additionally, convolutional neural network (CNN) was deployed to model a classification system for the speaker identification to enhance the accuracy furthermore. From the observations, it is evident that the GTCC feature tracks the particular speaker precisely as compared to the MFCC feature. Since the GTCC feature captures the movement of basilar membrane within the cochlea to mimic the human auditory system during the hearing, a speaker is accurately predicted compared to MFCC. The accuracy of the system using GFCC is improvised than MFCC. GTCC provides better overall accuracy of 90.12% as compared to MFCC whose overall accuracy is 77.79%. And, the accuracy obtained from the CNN model is 98.71%. A combination of features can be considered for improving the performance. This speaker identification system would find applications in the voice biometrics for authentication purpose, in surveillance for eavesdropping telephone conversations and in forensics department for backtracking suspect’s voice during crimes. It is also used in Google’s speech recognition system so as to unlock the gadgets with the speaker’s voice that is used as a password for privacy protection.