1 Introduction

The prime motto of speech analysis is efficiently characterizing the information present in the speech signal either to identify the speech or to identify the speaker. Since both speech recognition and speaker identification involve pattern recognition, the speech analysis techniques are almost similar for both. In speech signal analysis, the kind of information to be retained in the form of feature vector depends on the application for which the speech signal is analyzed. For example, speaker-dependent attributes are obviously fully relevant for speaker identification, but those attributes are often superfluous for speech recognition. In speaker identification N comparisons are required between the test pattern and the stored N enrolled patterns. The speaker is identified among N speakers in the database based on minimum absolute probability of error.

The speaker can be identified relevant to the text spoken or irrelevant to the text spoken. The former task is called as text-dependent speaker identification and the later is called as text-independent speaker identification. Text-dependent is the simplest of these two, wherein a small set of specific words is used in enrollment phase and the words from same set must be used by the speaker in the test phase in order for correct identification. Text-independent speaker identification system imposes no boundary or limitation on the words or phrases that can be used for identifying the speaker. Since the speaker is provided with the freedom of using any utterance during testing irrespective of the utterance used during enrollment, this mode of speaker identification is comparatively complex and challenging.

Efficient representation of speaker oriented information present in the speech signal is of at most important to have better text independent speaker identification system. Thus the speaker-dependent aspects of speech have to be represented economically with reduced dimensionality. The selected features must have large inter-speaker variability and small intra-speaker variability and must have robustness against mimicry. Different researchers have investigated the usage of features such as MFCC, BFCC, Perceptual Linear Prediction (PLP) coefficients [1], Linear Predictive Residual Cepstral Coefficients (LPRCC). Few researchers have used delta (Δ) and delta–delta (Δ–Δ) coefficients of the features to improve the identification accuracy. Most of the researchers agreed upon the supremacy of MFCC features over other features in speaker identification and speech recognition. GMM modeling with multivariate Gaussian distribution technique best configures the human vocal tract, by which the identity of the individual speaker is best reflected. Performance of GMM decays under noisy conditions [2]. Compared to GMM, the universal background model based GMM reduces the dimensionality and computational complexity. The Universal background model encompasses the overall characteristic of the population in a single pool [3, 4]; at the same time it adapts the pool to the individual speaker.

Challenging factors in accurate speaker identification system are noise due to hostile environment and channel distortions [5]. When speech recordings are done using microphones with different sensitivities, the channel effects become prominent. Variations in speech spectral component due to noise can be handled effectively by RASTA processing [6]. This improves the performance of the identification system in the presence of convolution and additive noise [7]. Cepstral mean normalization minimizes the degradation in perceived quality of speech by channel equalization [8].

In the proposed work, the combined RASTA–MFCC feature for improving the performance of the identification system under noisy environment is analyzed for different filter bank structures. It is evident from the statistical result that the performance of RASTA–MFCC feature surpasses the conventional MFCC feature in unknown channel and noisy environment.

The organization of the paper is as follows: Overall speaker identification system and preprocessing are discussed in Sect. 2. The new RASTA–MFCC feature extraction process is discussed in Sect. 3 with detailed explanation. Section 4 describes the design aspects of the newly designed quadrilateral filter bank structure. In Sect. 5, modeling of features via GMM–UBM techniques is dealt and the obtained results are analyzed. Finally, Sect. 6 concludes the proposed work.

2 Speaker Identification Systems

Speaker identification can be carried out in two stages as shown in Fig. 1. Pattern representation of the speech samples followed by modeling of patterns/vectors is done in enrollment/training stage. In the testing stage, the log likelihood ratio of the test speaker model is one-to-one compared with all the stored models to find the minimum probability of error.

Fig. 1
figure 1

Speaker identification system

For dimensionality reduction, the voiced and unvoiced regions are separated. Since the entire speaker specific information is present in the voiced region [9], it is retained for further processing. Removal of unvoiced and silence regions from speech samples are done using energy based thresholding technique which reduces the computational requirements. To make use of intermittent nature of the speech signal, the voiced regions of the speech is segmented into frames and each frame is windowed to provide smooth tapering.

Even though speech signal is quasi-periodic, when it is processed in segmented frames of 10–30 ms duration, the characteristics of speech resembles the characteristics of stationary and periodic signals, mainly at the occurrences of vowels. Biological production of speech is merely a filtering operation in which voiced sound is produced by periodic source exciting a vocal tract filter. Over the duration of a frame the speech is interpreted as the stationary signal because of the tendency of the signal to gradually change its characteristics between sounds. If framing is done with non-overlapping between frames then there may be loss of information due to the transition between adjacent frames. Usually overlap size of more than 50 % of frame size yields better result [10]. Researches on speech and speaker recognition unanimously agree that Hamming window best suits for speech processing applications. The window length is kept same as the individual frame length. The choice of window shape for producing desired smoothing [11] depends on its effect in speech analysis. Windowing is done on each frame in order to taper the signal to zero at the beginning and the end of the frame.

3 Noise Robust RASTA–MFCC

RASTA filtering is applied on the windowed speech signal to minimize the noise effects in the speech signal, especially convolution and additive noise effects [12]. Filtering is followed by the extraction of MFCC from the RASTA filtered signal in order to yield RASTA–MFCC features. The steps followed in obtaining RASTA–MFCC feature is depicted in Fig. 2.

Fig. 2
figure 2

RASTA–MFCC feature extraction process

RASTA processing improves the performance of a recognizer in noisy conditions. RASTA processing compensates the effect of abrupt spectral change in speech signal by means of filtering. Fast spectral changes in Consecutive frames are alleviated by low pass filtering [13] through smoothing process. In general, bigger the auditory structures, more the sensitivity to lower speech/sound frequencies. In the mammal family, humans have relatively less sensitivity to lower frequency sounds. RASTA processing involves computation of power spectrum of critical band, filtering the time trajectory of compressed spectral component, static nonlinear transformation followed by multiplication with equal loudness curves. Finally computes all-pole model of the spectrum.

Lower cutoff frequency of the filter determines the fastest spectral change whereas the higher cutoff frequency determines the preserved spectral change. Computation of squared magnitude of FFT follows RASTA filtering. Pre-emphasis emphasizes the energy of the high frequency contents of the squared magnitude spectrum. The pre-emphasis that equalizes the speech spectral tilt is given in Eq. (1) with the pre-emphasis factor α value 0.97.

$$\hat{s}\left( n \right) = s\left( n \right) - \alpha s\left( {n - 1} \right)$$
(1)

s(n) is the nth instant of the speech signal, s(n − 1) is the n − 1th instant of the speech signal, \(\hat{s}\left( n \right)\) is the nth instant of the pre-emphasized signal.

Human auditory perception is a nonlinear process. Mel scale mapping from linear frequency resembles human auditory pattern. As shown in Fig. 3 Mel scale mapping is approximately linear for frequencies up to 1 kHz and logarithmic afterwards. The relation between Hertz and Mel scale [14] is given as follows.

$$mel\left( f \right) = 2595\log_{10} \left( {1 + \left( {\frac{{f_{\text{Hz}} }}{700}} \right)} \right)$$
(2)
Fig. 3
figure 3

Relation between Mel and linear frequency

Conventionally, the critical band triangular shaped filters are residing on the Nyquist range. The transforms of the filters are made symmetrical about the Nyquist frequency. As shown in Fig. 4, the Mel axis filter bank is constructed with 40 non uniform filters. In order to have smooth transition between adjacent critical bands and to preserve the correlation among them Gaussian filter bank is also developed with 40 non uniform filters as shown in Fig. 5.

Fig. 4
figure 4

Triangular filter bank structure

Fig. 5
figure 5

Gaussian filter bank structure

After Mel scale warping, Mel spectral coefficients are obtained, for which discrete cosine transform is taken in order to yield the Rasta Mel frequency Cepstral coefficients. MFCC extraction is similar to cepstrum calculation except the Mel scale frequency axis. By applying DCT reduced data set representation is obtained. Equation (3) gives RASTA–MFCC coefficients.

$$c\left( l \right) = \frac{\sqrt 2 }{M}\sum\limits_{m = 1}^{M} {X^{\prime } \left( m \right)\cos \left( {\frac{l\pi }{M}\left( {m - \frac{1}{2}} \right)} \right)}$$
(3)

where X′(m) are the Mel spectral coefficients, M is the number of filters.

To help in minimizing the effect of channel in speech recording noise spectral subtraction [15] and Cepstral Mean Normalization (CMN) methods can be used. Since the former method has the problem of estimating the noise [16], the later method is preferred to mitigate the effect of variable communication environment. In CMN, the average value of the RASTA–MFCC coefficients over the whole length of the speech is subtracted from each frame as follows.

$$c_{i} = c_{i} - \frac{1}{N}\sum\limits_{k = 1}^{N} {c_{ik} }$$
(4)

where, c ik is the ith feature element in the kth frame. The resultant feature after Cepstral mean normalization, a post processing step, yields the noise robust RASTA–MFCC Feature.

4 Quadrilateral Filter Bank Structure

In the triangular shaped or Gaussian shaped Mel filter bank design, the speaker dependent information around the lower frequency range of each filter bin of the filter bank is not given much importance to encompass as much energy as possible. Hence, a new filter bank with quadrilateral shaped filter bins is designed in which the lower frequency of the current filter bin is the first intermediate frequency of the previous filter bin. The First and the last filter bin’s center frequency are determined from Moore and Grasberg’s ERB (Equivalent Rectangular Bandwidth) expression as given by [17].

$$af_{{c_{i} }}^{2} + bf_{{c_{i} }} + c = \frac{1}{2}\left( {f_{{\mathop {high}\nolimits_{i} }} - f_{{\mathop {low}\nolimits_{i} }} } \right)$$
(5)
$$ERB_{i} = 24.7\left( {0.00437f_{{c_{i} }} + 1} \right)$$
(6)

For the remaining filter bins the center frequency is found using Eq. (7),

$$f_{{c_{i} }} = f_{{\mathop {low}\nolimits_{i} }} + \left( {i - 1} \right)\left[ {\frac{{f_{{\mathop {high}\nolimits_{i} }} - f_{{\mathop {low}\nolimits_{i} }} }}{N + 2}} \right]$$
(7)

The lower and upper frequency of each filter bin is found using the following Eqs. (8) and (9),

$$\left( {700 + f_{{c_{i} }} } \right)^{2} = \left( {700 + f_{{\mathop {low}\nolimits_{i} }} + 2ERB_{i} } \right)\left( {700 + f_{{\mathop {low}\nolimits_{i} }} } \right)$$
(8)
$$f_{{\mathop {high}\nolimits_{i} }} = f_{{\mathop {low}\nolimits_{i} }} + 2ERB_{i}$$
(9)

The two intermediate frequencies f int1 and f int2 are found using the Eqs. (10) and (11),

$$f_{{\text{int} 1}} = 0.25 * \left( {f_{{\mathop {high}\nolimits_{i} }} + f_{{\mathop {low}\nolimits_{i} }} } \right)$$
(10)
$$f_{{\text{int} 2}} = 0.75 * \left( {f_{{\mathop {high}\nolimits_{i} }} + f_{{\mathop {low}\nolimits_{i} }} } \right)$$
(11)

where, a = 6.23 × 10−6; b = 93.39 × 10−3; c = 28.52, \(f_{{c_{i} }}\) is the ith center frequency of the filter bin, \(f_{{high_{i} }}\) is the upper frequency range of the ith filter bin, \(f_{{low_{i} }}\) is the lower frequency range of the ith filter bin, \(f_{{\text{int} 1}}\) is the first intermediate frequency of the ith filter bin, \(f_{{\text{int} 2}}\) is the second intermediate frequency of the ith filter bin, i = 1, 2… N, N is the total number of filter bins in the filter bank.

The amplitude of the four vertices in each quadrilateral bin is [0, 0.7, 1, and 0]. The value 0.7 is found to be optimum to height of the second vertices after a series of test for the values in between the range [0.5, 1.0]. The designed Quadrilateral filter bank structure is placed in Mel frequency scaling in order to closely approximate the human cochlear membrane and the resultant filter bank structure is shown in Fig. 6.

Fig. 6
figure 6

Quadrilateral filter bank structure

5 Modeling and Result Analysis

The objective of GMM–UBM modeling is estimating the test model parameters that match with the distribution of the training feature vector. UBM model is trained by computing λp which is constituted by mean vector, variance vector and weight vector. Background model first takes the common characteristics of the population then adjust it to the individual. The Log Likelihood Ratio (LLR) score is the tool to identify the speaker under test. The test speaker is compared with the enrolled speakers in terms of their likelihood and the one match with maximum LLR score is declared as the identified speaker. In GMM–UBM, the background model is taken into consideration. Speaker model is represented using background model and adapted model. The density function is calculated for GMM–UBM with 256 mixtures.

$$P_{r} \left( {{i \mathord{\left/ {\vphantom {i {x_{t} ,\lambda }}} \right. \kern-0pt} {x_{t} ,\lambda }}} \right) = \tfrac{{w_{i} p_{i} \left( {x_{t} } \right)}}{{\sum\nolimits_{j = 1}^{M} {w_{j} p_{j} \left( {x_{t} } \right)} }}$$
(12)

In Expectation Maximization algorithm, the values of the model parameters change for every iteration. New coefficients are calculated using the Eqs. (13), (14) and (15) at every iteration.

$$\mu_{i}^{new} = \alpha_{i}^{m} E(x)_{i} + \left( {1 - \alpha_{i}^{m} } \right)\mu_{i}$$
(13)
$$\sigma_{i}^{new} = \alpha_{i}^{v} E\left( {x^{2} } \right)_{i} + \left( {1 - \alpha_{i}^{v} } \right)\left( {\sigma_{i}^{2} + \mu_{i} } \right) - \mu_{i}^{new}$$
(14)
$$w_{i}^{new} = \left[ {{{\alpha_{i}^{w} n_{i} } \mathord{\left/ {\vphantom {{\alpha_{i}^{w} n_{i} } {T + \left( {1 - \alpha_{i}^{new} } \right)w_{i} }}} \right. \kern-0pt} {T + \left( {1 - \alpha_{i}^{new} } \right)w_{i} }}} \right]\gamma$$
(15)

where α = n(i)/(n(i) + r).

GMM distribution represents the best distribution of feature vectors for hypothesis H0. The UBM is used for modeling the alternative hypothesis H1 in the likelihood ratio test. For a given set of N background speaker models, the alternative hypothesis H1 is represented by Eq. (16),

$$p\left( {{X \mathord{\left/ {\vphantom {X {\lambda_{p}^{{\prime }} }}} \right. \kern-0pt} {\lambda_{p}^{{\prime }} }}} \right) = \hbox{max} \left( {p\left( {{X \mathord{\left/ {\vphantom {X {\lambda_{1} }}} \right. \kern-0pt} {\lambda_{1} }}} \right)p\left( {{X \mathord{\left/ {\vphantom {X {\lambda_{2} }}} \right. \kern-0pt} {\lambda_{2} }}} \right) \ldots p\left( {{X \mathord{\left/ {\vphantom {X {\lambda_{N} }}} \right. \kern-0pt} {\lambda_{N} }}} \right)} \right)$$
(16)

With UBM treated as prior model, a speaker specific model is derived by using maximum likelihood estimation. For a given T independent and identically distributed observations, X = {x1, x2, x3, x4 … xT}, the joint likelihood ratio is determined using Eq. (17).

$$E\left( {LLR\left( x \right)} \right) = \frac{1}{T}\sum\limits_{t = 1}^{T} {\left( {\log p\left( {x_{t} |\lambda_{target} } \right) - \log p\left( {x_{t} |\lambda_{ubm} } \right)} \right)}$$
(17)

MEPCO speech biometric database is used in the proposed speaker identification system with 50 speakers and among them 10 speakers additionally perform the task of imposters. The recording process for MEPCO speech biometric database was done using Gold Wave version 5.58 software and Condenser microphone at 16 kHz sampling rate mono mode recording with PCM coding. In order to accommodate this time varying nature of speech, recording are done in different days. In order to test the robustness of the proposed speaker identification system against real world noise, the recordings are done in classroom environment having disturbances like other student’s speech, humming noise from Air conditioner, ceiling fan noise and electricity generator noise. 6 speech samples are recorded from each speaker and each recording spans 3 s duration.

Since only voiced speech has useful speaker-specific information, the unvoiced and silence regions of speech waves are removed using energy based thresholding technique. Almost half the processing requirement is reduced after silence removal. Since the speech signal is assured to be short time stationary, voiced speech is divided into overlapping frames of length 256 samples with amount of overlapping 50 %. Compared to other windows, Hamming window produces much less spectral leakage. Hence framing is followed by hamming window process. Pre-emphasis is done with a Pre-emphasis factor of 0.97. In order to have speech features vigorous against noise, RASTA filtering is performed on the windowed speech frames. To obtain RASTA–MFCC feature, 40 filters filter bank is implemented for both triangular and Gaussian filter banks. The feature models are obtained by having 256 mixtures in GMM–UBM models. Out of the 6 sessions of speech recording for every speaker, first three sessions are used for training the speaker model, rest three sessions are used for testing the identity of the speaker. For every speaker the ratio of number of correctly identified session to the total number of sessions is calculated. This ratio in percentage is treated as the identification accuracy of that particular speaker. Similarly, the identification accuracy of all the 50 speakers is calculated. The average of all these 50 identification accuracies is the identification accuracy of the proposed speaker identification system.

$$Identification\;Accuracy\;(\% ) = \frac{Number\;of\;sessions\;correctly\;identified}{Total\;number\;of\;sessions} \times 100$$
(18)

The proposed speaker identification system provides an efficient identification of 94.5 % for triangular filter bank design and 96 % for Gaussian filter bank design. The performance of RASTA–MFCC feature is compared with traditional MFCC feature in text independent speaker identification system under noisy environment. It is found that the RASTA–MFCC feature is more robust and provides an identification accuracy of 97.67 % in the case of Quadrilateral filter bank with the speech database size of 50 speakers while the MFCC method provides an accuracy of 88 %. GMM–UBM modeling is used for its effective resistance towards imposter attack. A bar chart for the performance comparison of different GMM modeling methods with different MFCC features is shown in Figs. 7, 8. A comparison between the performances of various filter bank structures with RASTA–MFCC features modeled using GMM–UBM is shown is Fig. 9.

Fig. 7
figure 7

Comparison between different features and modeling for triangular filter bank structure

Fig. 8
figure 8

Comparison between different features and modeling for gaussian filter bank structure

Fig. 9
figure 9

Comparison between different filter bank structures

All the 10 imposter speakers have been correctly identified. Table 1 show that the proposed Quadrilateral filter bank with RASTA–MFCC feature outperforms other speaker identification techniques. The reason behind this is that the proposed filter bank encompasses more low frequency, high energy speaker specific information than high frequency information for speaker modeling.

Table 1 Comparison of various speaker identification works

6 Conclusion

In this paper, text independent speaker identification under noisy environment is implemented using RASTA–MFCC as feature vector and GMM–UBM as the modeling. A new Quadrilateral filter bank structure is designed and its performance is found to be better than conventional filter banks. Experimental results show that the RASTA–MFCC features with Quadrilateral filter banks are more robust to noisy environment than triangular and Gaussian filter banks. The UBM adaptation is faster than GMM training. The quality of UBM is better than GMM when small training segments on the order of 2–5 s. Only the detection time of UBM is longer than GMM. Speaker identification system may have applications in banking over telephone, attendance systems, computer security, database access systems, and forensics.