Keywords

1 Introduction

Speaker recognition is nothing but to recognize the person from known set of voices. Speaker recognition is classified into speaker identification and speaker verification. Speaker identification is nothing but to identify a person from the known set of voices. It is a task of identifying who is talking from known set of voice samples. While, speaker verification is to verify claimed identity of a speaker, i.e., Yes or No decision. Speaker identification is further classified into text-dependent identification and text-independent identification. Text-dependent speaker identification requires same utterance in training and testing phase. Whereas, in text-independent speaker identification training and testing utterances are different. Speaker identification system consists of two distinct phases, a training phase and testing phase. In training phase, the features computed from voice of speaker are modeled and stored in the database. In testing phase, the features extracted from utterance of unknown speaker are compared with the speaker models stored in database to identify the unknown person.

Feature extraction step in speaker identification transforms the raw speech signal into a set of feature vectors. The raw speech signal is represented in compact, less redundant feature vectors [1]. Features emphasizing on speaker specific properties are used to train speaker model. As feature extraction is the first step in speaker identification, the quality of the speaker modeling and classification depends on it [2].

In the computation of MFCCs, the spectrum is estimated from windowed speech frames. The spectrum is then multiplied by triangular Mel filter bank to perform auditory spectra analysis. Next step is the logarithm of windowed signal followed by discrete cosine transform. An important step in the computation of MFCC is the Mel filter bank [1, 3]. The MFCC technique computes speech parameters based on how human hears and perceives sound [2]. However, MFCC does not consider the contribution of piriform fossa, which results in high frequency components [4].

The auditory filter created by the cochlea inside human ear has frequency bandwidth termed as critical band. The existence of auditory filter is experimented by Harvey Fletcher [5]. The auditory filters are responsible for frequency selectivity inside the cochlea which helps the listener for discrimination between different sounds. These critical band filters are designed using frequency scales, i.e., the Mel scale and the Bark scale [5]. The MFCCs are widely used in speaker recognition system [2, 68]. In previous work, many researchers have demonstrated the dominant performance of MFCCs and contributed to enhance the robustness of MFCC features as well as speaker recognition system. Such efforts are [2, 715]. The importance of speaker specific information present in the wideband speech in demonstrated in [16].

This paper presents the importance of frequency band selection. The speaker specific information extends beyond telephonic pass band [16]. The performance of MFCC scheme in different frequency bands is demonstrated in this paper. The organization of this paper is as follows. Section 2 discusses frequency warping scale Mel scale and MFCC computation process. Experimental set-up is discussed in the Sect. 3. Section 4 discusses the results followed by conclusion in Sect. 5.

2 Mel Scale and MFCC

Nerves in human ear perception system responds differently to various frequencies in a listened sound. For example, sound of 1 kHz triggers nerves while sound of other frequencies will keep quite. This scale is roughly nonlinear in nature. It is like a band-pass filter that looks like triangular in shape. This was observed for how human ear perceives Melody sound. Mel scale is based on pitch perception [5]. Mel scale uses triangular-shaped filters and is roughly linear below 1 kHz and logarithmically nonlinear above 1 kHz. The relationship between Mel scale frequencies and linear frequencies is given as per the following equation,

$$ {\text{F}}_{\text{mel}} = 2 5 9 5^{\text{ * }} \log_{10} \left( {1 + \frac{{F_{\text{Linear}} }}{700}} \right) $$
(1)

Figure 1 shows Mel scale filter bank. MFCC procedure starts with pre-emphasis which boosts the higher frequencies. The high-pass filter given by transfer function, H(z) = 1 – az−1 where, 0.9 ≤ a ≤ 1 is generally used for pre-emphasis. The pre-emphasized signal is divided into frames of duration 10–30 ms with 25–50 % overlap to avoid loss of information. Over this short duration, speech signal is assumed to remain stationary. Then, each frame is multiplied with Hamming window in order to smooth the speech signal. After windowing step, fast Fourier transform is used to estimate the frequency content present in speech signal. Next, the windowed spectrum is integrated with Mel filter bank which is based on Mel scale as given in Eq. (1). The vocal tract response is separated from excitation signal using logarithm of windowed spectrum integrated with Mel filter bank followed by discrete cosine transform.

Fig. 1
figure 1

Mel filter bank

3 Experimental Set-up

In this paper, the performance of Mel-frequency cepstral coefficients frequency band selection for text-independent speaker identification system is evaluated on TIMIT [17] database. TIMIT database consists of a total number of recordings of 630 speakers among which 438 are male speakers and 192 are female speakers. There are ten different sentences of each speaker of sampling frequency 16 kHz which makes a total of 6300 sentences recorded from 8 dialect region of the United States. For training of speaker model, eight sentences, five SX and three SI (approximately 24 s) were used. For testing purpose, two remaining SA sentences (sentences of 3 s each) were used. All the experiments have been performed using HP Pavilion g6 laptop with CPU speed of 2.50 GHz, 4 GB RAM, and MATLAB 8.1 signal processing tool.

The speech signal has been pre-emphasized with the first-order high pass filter given by equation H(z) = 1–0.95z −1. The signal is divided into 256 samples per frame with 50 % overlap followed by the Hamming window. The spectrum of the windowed signal is calculated by fast Fourier transform (FFT). The spectrum is multiplied by Mel-filter bank followed by logarithm and discrete cosine transform (DCT) to obtain MFCCs. Speaker model is generated for each speaker from the MFCCs using vector quantization (LBG algorithm). This speaker model is stored in the database. In testing phase, MFCC features of an unknown speaker are extracted. Next, Euclidean distance between MFCC features and speaker model stored in the database is calculated. The speaker is recognized on the basis of minimum Euclidean distance computed between MFCC features in testing phase and speaker model stored in database. The experiments are carried out for different number of MFCC filters, i.e., 20 and 29 in the frequency band 0–4 kHz. Next, frequency is varied up to 7.75 kHz with 20 MFCC filters. The number of MFCC filters is varied as 13, 20, and 29 in the significant frequency band to observe the average recognition rate. In each experiment, first 12 cepstral coefficients excluding the 0th coefficient are selected and the number of clusters of vector quantization is 32.

4 Results and Discussion

Frequency band 0–4 kHz is analyzed for MFCC filters equal to 20 and 29. This frequency band is analyzed in two separate intervals. First, frequency band 0–2 kHz is analyzed and then frequency band 0–4 kHz is analyzed. The recognition rate in percentage is calculated by,

$$ {\text{Recognition}}\;{\text{Rate}} = \frac{{{\text{Number}}\,{\text{of}}\,{\text{correct}}\,{\text{matches}}}}{{{\text{Total}}\,{\text{number}}\,{\text{of}}\,{\text{test}}\,{\text{speaker}}}} \times 100\% $$

The following Table 1 and Fig. 2 shows the average recognition rate observed in these bands.

Table 1 Recognition rate for frequency range 0–4 kHz
Fig. 2
figure 2

Effect on recognition rate by varying MFCC filters up to 8 kHz

It is observed that the frequency band 0–4 kHz has provided a good resolution as compared to frequency band 0–2 kHz. This is because average recognition rate of 95.95 % is observed for 20 MFCC filters in frequency band 0–4 kHz. This indicates that speaker specific information is present up to 4 kHz. Also, varying the number of filters in these bands has less effect on recognition rate as compared to variation in frequency band. In addition to number of filters, it is also important to select a frequency band which is having good resolution for speaker identification.

In next subsequent experiments 20 MFCC filters are chosen and frequency band is varied up to 7.75 kHz. Table 2 and Fig. 3 shows the effect on recognition rate by varying frequency band.

Table 2 Effect on average recognition rate by varying frequency band up to 7.75 kHz for 20 MFCC filters
Fig. 3
figure 3

Effect on average recognition rate by varying frequency band for 20 MFCC filters

From Table 2, it is observed that frequency band 0–4.85 kHz is the significant frequency band. This is because the maximum average recognition rate achieved is 97.37 % in this frequency band for 20 MFCC filters. Thereafter, average recognition rate is decreasing as shown in Table 2. It is observed that speaker specific information extends beyond 4 kHz, and therefore, it is important to select frequency band. Next, in the significant frequency band, i.e., 0–4.85 kHz, MFCC filters are varied and effect on average recognition rate is observed. Following table shows the effect of varying MFCC filters in frequency band 0–4.85 kHz.

From Table 3 and Fig. 4, it is observed that there is no much more improvement in average recognition rate by varying number of filters in the significant frequency band.

Table 3 Effect of varying MFCC filters in frequency band 0–4.85 kHz
Fig. 4
figure 4

Effect of varying MFCC filters in frequency band 0–4.85 kHz

5 Conclusion

In this paper, the significance of selection of Mel-frequency Cepstral Coefficients (MFCC) frequency band for speaker identification is proposed. First, frequency band 0–2 kHz is selected and MFCC filters are varied in this frequency band. Next, frequency band is varied 0–4 kHz and MFCC filters are varied in this band. It is found that speaker specific information is present in the frequency band 0–4 kHz is much more as compared to 0–2 kHz. Further, frequency band is varied up to 7.75 kHz. It is observed that the average recognition rate achieved is 97.37 % in the frequency band 0–4.85 kHz for 20 MFCC filters. This indicates that speaker specific information is present up to 4.85 kHz. Thereafter, recognition rate is decreasing. In the significant frequency band 0–4.85 kHz, MFCC filters are varied as 13, 20, and 29 and it is observed that there is no much more improvement in the average recognition rate.