1 Introduction

With the rapid growth of the usages of portable devices, internet of things (IOT) has seen considerable development. Users may use these devices everywhere, and the problem associated with user authentication of these devices is an important issue. Since many devices can only be accessed after authentication, the repeated input of account information and passwords is inconvenient for accessing these devices. Therefore, a natural user interface (NUI) is required to solve this problem. Multi-touch, gesture-based, and speech-based NUIs are popular nowadays. They depend on devices such as a touch screen, Kinect, and a microphone to receive and process a signal. Among these NUIs, speech can also be used for identification because it is a unique biometric. In this work, the biometric characteristics of speech are used to recognize the user identity.

The two types of the speaker authentication system are - text-dependent [2, 3, 20] and text-independent [12, 15, 17, 28]. A text-dependent speaker authentication system is similar to traditional authentication systems of speakers. The speaker passes the system upon speaking the pre-specified password. A text-independent speaker authentication system models the acoustic characteristics of a speaker and utilizes the acoustic models to identify speakers. Since a text-independent speaker authentication system is not limited by text, its use is more convenient. This work proposes a text-independent speaker recognition system to authenticate a user.

In a speaker recognition system, audio feature extraction and classifier modeling are the two main components. Audio feature extraction comes behind the pre-processing stage. Previous researches have reported various types of audio features include amplitude and power in the temporal domain, chroma and harmonicity in the frequency domain, and cepstral coefficients in the cepstral domain. The most frequently used audio features are cepstral coefficients [14]. Cepstral coefficients can be extracted by two different approaches [21]. One is the parametric approach, which is developed to match closely the resonant structure of the human vocal tract that produces speech sounds. This approach is mainly based on linear predictive analysis. The linear predictive coefficients (LPCs) obtained can be converted to LPC cepsral coefficients (LPCCs). The other approach is non-parametric and models the human auditory perception system. Mel frequency cepstral coefficient (MFCC) [27] is categorized as the second approach.

Model classifier also plays an important role in a speaker recognition system. The Gaussian mixture model (GMM) [10, 18, 31], the support vector machine (SVM) [1, 19, 25, 28], the neural network (NN) [15, 30], and hybrid of these models [4, 5, 11] are commonly used for this purpose. A GMM-based recognition system uses a GMM of the probability density function of a speech signal with Gaussian components. The mean, variance and weights of the GMM speaker model can be used to recognize a speaker by maximizing the log likelihood. The SVM is a binary classifier, which makes decisions by finding the optimal hyper-plane that separates positive from negative data. An SVM can also map the features of a speaker to high-dimensional space and perform classification. In many works, GMM is combined with the SVM method to recognize speakers. In the audio signal processing method, a NN usually replaces GMM, and the NN can train a model for multiple speakers. In this paper, an SVM is adopted as the classifier.

To make a speaker recognition system more useful, the following two issues are essential to be considered. The first issue is that natural speech is easy to be copied and a speaker recognition system can be fooled by using voice conversion software [9]. However, people seldom whisper, so recording whispered speeches is much more difficult than recording normal speeches. A whisper based speaker recognition system is therefore safer than a normal speech-based speaker recognition system [23]. The second issue is whether a speaker recognition system is robust against background noise. In practical situations, speech utterances rarely occur in isolation. The captured speech signal is usually accompanied by environmental noises which makes impact on speaker recognition performance. In this paper, we address the second issue and present an audio feature based method to enhance the speaker recognition performance under noisy environments. The used audio feature is derived based on the theory of spectral-temporal receptive fields (STRFs).

The STRF multi-resolution analysis in the spectral and temporal domain is a computational auditory analysis model inspired by psychoacoustical and neurophysiological studies in early and central stages of the mammalian auditory system [8]. This auditory analysis model consists of two primary stages. The first stage is an early auditory system that simulates processing at the auditory periphery and produces an auditory spectrum. The second stage is the primary auditory cortex (A1) model in the central auditory system, which is based on the assumption that a response neuron area tuned to a specific range of frequencies and intensities. In recent years, this theory has been applied to speech recognition research by the work of Woojay et al. [29]. Woojay et al. [29] used STRF theory to propose a new feature selection method that paralleled the computation of the MFCC. The authors’ previous work [26] discusses the conception of STRF and reveals that the proposed STRF-based feature can be used in speech recognition. In this paper, we further exploit the possibility to use STRF-based feature in speaker recognition under noisy environments. Our work is not the first attempt to use STRF in speaker recognition. Chi et al. presented a making based speaker identification algorithm [7]. This algorithm distinguishes speech from non-speech based on spectro-temporal modulation energies and can enhance the speaker recognition in noisy environments.

The remaining parts of the paper are organized as follows. Section 2 provides an overview of the proposed speaker recognition system. Section 3 clearly describes the STRF theory and its audio feature extraction method. Section 4 shows the experimental results under various conditions, and Section 5 draws conclusions.

2 System overview

Figure 1 shows an overview of the proposed speaker recognition system. The proposed system contains three modules, which are implemented in the following order: the pre-processing module, the feature extraction module, and the recognition module. In the first module, we perform framing and voice activity detection as the pre-processing to input speech. In the second module, the MFCC and the STRF-based features are extracted. Conventional speaker recognition usually merely adopts MFCCs as acoustical features. The MFCC takes into account the nonlinear frequency resolution which can simulate the hearing characteristics of human ears. However, these considerations are only crude approximations of the auditory periphery. Therefore, the STRF-based feature is fused with conventional MFCCs to become an effective acoustical feature set for speaker recognition. The STRF theory comes from a physiological model of the mammalian auditory system and generates the cortical response. We convert the cortical response into the useful STRF-based feature. In the third module, SVM models are trained or tested for the speaker recognition task.

Fig. 1
figure 1

Block diagram of proposed speaker recognition system

3 Speaker feature extraction

The STRF has two stages [8, 26]. The first stage is an early auditory system for simulating human hearing system, which generates auditory spectrograms. The second stage is a model of the primary auditory cortex (A1) in the central auditory system.

3.1 Early auditory system

Firstly, an affine wavelet transformation of the signal s(t) is performed by passing audio signal s(t) through a cochlear filter bank. The cochlear output y C is obtained by

$$ {y}_C\left(t,f\right)=s(t){*}_th\left(t,f\right) $$
(1)

where h(t, f) denotes the impulse response of each filter, and *  t represents the convolution operation in the time domain.

Next, the cochlear output y C is fed into the hair cell stage. This stage consists of a high-pass filter to change the pressure into a moving speed; a nonlinear compression function g(u) to protect; a low-pass filter w(t) to reduce phase-locking on the auditory-nerve.

$$ {y}_A\left(t,f\right)=g\left({\partial}_t{y}_C\left(t,f\right)\right){\ast}_tw(t) $$
(2)

The lateral inhibitory network (LIN) in the cochlear nucleus performs frequency selectivity. As in Eq. (3), taking the first-order derivative with respect to the tonotopic axis first and then passing it though a half-wave rectifier can approximate the function of LIN.

$$ {y}_{LIN}\left(t,f\right)= \max \left({\partial}_f{y}_A\left(t,f\right),0\right) $$
(3)

Finally, the output of the first stage is obtained by convoluting y LIN(t, f) with a short window.

$$ y\left(t,f\right)={y}_{LIN}\left(t,f\right){\ast}_t\mu \left(t,\tau \right) $$
(4)

where y(t, f) is the obtained auditory spectrogram, \( \mu \left(t,\tau \right)={e}^{\frac{-t}{\tau }}u(t) \), and τ is the time constant in msec. Figure 2 gives examples to compare conventional Fourier spectrograms and auditory spectrograms.

Fig. 2
figure 2

Examples comparing Fourier spectrograms and auditory spectrograms: a Fourier spectrogram of a clean speech; b Fourier spectrogram of a noisy speech; c auditory spectrogram of a clean speech; d auditory spectrogram of a noisy speech

3.2 Model of primary auditory cortex (A1)

The so called STRF is the product of a spatial impulse response h S and the temporal impulse responses h T :

$$ STRF={h}_S\cdot {h}_T $$
(5)

Equations (6) and (7) give the spatial impulse response h s and the temporal impulse response h T , respectively.

$$ {h}_S\left(f,\omega, \theta \right)={h}_{scale}\left(f,\omega \right) \cos \theta +{\widehat{h}}_{scale}\left(f,\omega \right) \sin \theta $$
(6)
$$ {h}_T\left(t,\varOmega, \varphi \right)={h}_{rate}\left(t,\varOmega \right) \cos \varphi +{\widehat{h}}_{rate}\left(t,\varOmega \right) \sin \varphi $$
(7)

where ω and Ω denote the spatial density and velocity parameters of the spatial impulse response and temporal impulse response filters, respectively. Besides, ĥ denotes the following Hilbert transform, and θ and φ represent the characteristic phases.

For an input auditory spectrogram y(t, f), the spectral-temporal response STRF(t, f, Ω, ω, φ, θ), i.e. cortical representation, is calculated by

$$ STRF\left(t,f,\varOmega, \omega, \varphi, \theta \right)=y\left(t,f\right){\ast}_{tf}\left[{h}_S\left(f,\omega, \theta \right)\cdot {h}_T\left(t,\varOmega, \varphi \right)\right] $$
(8)

where ∗  tf represents the convolution operation in both the time and frequency domains.

Figure 3 gives a scale-rate representation of the 36th frame of the speech signal in Fig. 2. The horizontal axis represents the rate parameter in both downward and upward directions, while the vertical axis presents the scale response. Figure 4 shows the MFCC of the speech signal in Fig. 2. By comparing Fig. 4 with 3, the noise robustness of STRF is better than MFCC.

Fig. 3
figure 3

Scale-rate representation: a scale-rate representation of a clean speech; b scale-rate representation of a noisy speech

Fig. 4
figure 4

MFCC representation: a MFCC representation of a clean speech; b MFCC representation of a noisy speech

3.3 STRF based feature extraction

The STRF representation reveals joint spectral-temporal modulations of the auditory spectrogram. In STRF, the low and high scales capture the formants and harmonics, respectively. This scale information is utilized to derive the STRF based features. The scale values from 2−3 to 23 with the intervals of 0.5 cycle/octave are taken to span the frequency range of the auditory spectrum. In this paper, three different STRF based features are derived from the STRF representation [26]. The first STRF based feature S(t, ω) is obtained by summing the all the rates and frequency in the STRF magnitude representation.

$$ S\left(t,\omega \right)={\displaystyle \sum_f}{\displaystyle \sum_{\varOmega}\left| STRF\left(t,\ f,\ \varOmega,\ \omega,\ 0,\ 0\right)\right|},\kern2em \omega =1,2,\dots {N}_{\omega }. $$
(9)

where N ω is the scale number. Noted that the characteristic phases φ and θ in Eq. (9) are both set to zero [29].

Next, a non-linear function is applied to S(t, ω) to generate the second STRF based feature S L (t, ω). Here, a logarithmic function is utilized for this purpose.

$$ {S}_L\left(t,\omega \right)= \log \left(S\left(t,\omega \right)\right),\kern2em \omega =1,2,\dots {N}_{\omega }. $$
(10)

A discrete cosine transform (DCT) [14] is applied to S L (t, ω) to create the third STRF based feature S DL (t, k).

$$ {S}_{DL}\left(t,k\right)={\displaystyle \sum_{\omega =1}^{N_{\omega }}{S}_L\left(t,\omega \right) \cos \left(\frac{2\pi \omega k}{N_{\omega }}\right)},\kern2em k=1,2,\dots {N}_k $$
(11)

where N k is the feature dimension of S DL (t, k), and N k is smaller or equal to N ω .

4 Speaker classifier

Conventional speaker recognition uses GMM to model the acoustics characteristics of speakers. The parameters of GMM are estimated through maximizing the likelihood of the speaker data. Alternatively, the SVM has already proven its performance in speaker recognition [1, 17, 28]. An SVM is capable of discriminating 2-class of speaker data by training a nonlinear decision boundary.

Considering speech features from two different speakers, an SVM is able to find out the maximum margin hyperplane [22]. Assume {(x i , s i ), i = 1, 2, …, N} are the training speaker data. A pair (x i ,s i ) denotes training speech feature x i was from speaker s i , where s i ∈{+1,-1}. After learning the maximum margin hyperplane by an algorithm such as sequential minimal optimization [16], a testing speech feature x is classified as s ϵ {+1, −1} based on the following speaker decision function:

$$ s\left(\mathbf{x}\right)=\mathrm{sign}\left(\left(\mathbf{w}\cdot \mathbf{x}\right)+b\right) $$
(12)

where w ⋅ x + b = 0 is the maximum margin hyperplane.A kernel version of the speaker decision function can be derived as

$$ s\left(\mathbf{x}\right)=\mathrm{sign}\left({\displaystyle \sum_{i=1}^N{\mu}_ik\left(\mathbf{x},{\mathbf{x}}_i\right)}+b\right) $$
(13)

where k is a kernel function, and μ i is a Lagrange multiplier.

Multiple speaker recognition can be solved by a multi-class SVM, which is built based on a required number of 2-class SVMs. In this paper, the multi-class SVM is achieved by a one-against-one method [13, 24]. A voting strategy is required to determine the final speaker recognition result. For recognizing m speakers, the required number of 2-class SVM is m(m − 1)/2.

5 Experimental results

5.1 Experimental setting

Speech samples are taken from a dataset that contains 36 speakers. For each speaker, 37 speech clips were recorded. The duration of each clip for training and testing ranges roughly from 1 s to 2 min. In the experiments, all speech clips were divided into frames of 16 ms with an overlapping between consecutive frames of 8 ms. The Hamming-windowing was applied to each frame. Finally, the MFCC feature and proposed feature were extracted. The two-fold cross-validation was used to estimate the performance of the proposed speaker recognition system.

The proposed STRF feature parameter was set using a scale of 2-3,-2,…,3 . The settings of SVM and the parameters are the default settings of the LIBSVM tool [6]. The multi-class SVM with radial basis function kernel was used to train the models, and the gamma value of the radial basis function kernel was set to 2.

5.2 Comparison of the recognition systems

The first experiment was conducted to compare the speaker recognition performance among 13 dimensional MFCCs (MFCC13), S, S L and S DL . The experimental result is listed in Table 1. MFCC13 yields a recognition rate of 93.07 %, which is the highest among all the features. The proposed STRF based feature S DL yields a recognition rate value that is 17.72 % and 15.87 % higher than that of S L and S, respectively. In the next experiment, we tried to combine MFCCs (MFCC13) with STRF based features. The speaker recognition rate generated by each combined feature set is described in Table 2. The experimental result reveals that all of the combined feature sets increase the recognition rate in comparison with individual MFCCs. The best result of 96.92 % is obtained when S DL and MFCCs are adopted as the combined feature set.

Table 1 Performance of each feature
Table 2 Performance of combined features

Since the access control system may be used in everywhere, it is important to consider the problem of speech in noisy environments. Table 3 shows the speaker recognition rate of the noisy speech. In this experiment, noisy speech signals were generated by adding white noise to the clean speech using five different signal-to-noise ratio (SNR) levels: 20db, 15db, 10db, 5db and 0db. In Table 3, the average recognition rate of S DL is 14.56 and 16.85 % higher than that of S and S L. In other words, feature S DL yields the most favorable results among the three STRF based features. Considering the proposed S DL feature, the difference of speaker recognition rates is 1.69 % between clean speech and 0db noisy speech. However, the speaker recognition rate of MFCC feature decreases 25.74 % from clean speech to 0db noisy speech, and S DL performed a better result than MFCC when using 0db speech. The results show that the proposed SDL feature is highly robust to noise condition. In addition, the best recognition rate is produced by the fusion of the S DL and the MFCC13. The speaker recognition rates can achieve over 90 % when the power of speech signal is higher than the power of noise. Conventional MFCC can be interpreted as the information derived from energies of band-pass filter. Combining captured formant and harmonics information from STRF based feature with conventional MFCCs further improves the speaker recognition performance.

Table 3 Performance for noisy speech

In the speaker recognition field, GMM with MFCCs is a widespread used standard method. A GMM based speaker recognition system with 13 dimensional MFCCs is taken as the baseline and its speaker recognition rates are listed in Table 4. Table 4 reveals that the proposed system outperforms GMM baseline under clear and various noisy conditions.

Table 4 Performance comparison with baseline

To compare the computational complexity among different features, the feature extraction time of one second of audio signal is obtained. The 13 dimensional MFCCs (MFCC13) and the proposed STRF based feature require 0.18 and 0.75 s, respectively. Although the required computational time of the proposed STRF scale-based feature is about four times of that of MFCCs, the increased computational time is acceptable. In addition, the SVM classification requires roughly 0.05 s in 1 s of audio signal. Therefore, the proposed speaker recognition system meets the real-time requirement.

Our experiments were conduct to recognize 36 speakers. In case the speaker recognition system is used to deal with thousands of speakers, proximity indexes may be required to use. Several percentages increase in recognition rate may vanish when using proximity indexes because of the increase in dimensionality from the combination with STRF based features and MFCCs. However, the main contribution of the paper is its robust speaker recognition performance under noisy conditions. Take audio with SNR being 0 dB or 5db as example, the recognition rate of the proposed combined feature set is about 18 % higher than that of conventional MFCCs. Therefore, the improvement of about 18 % recognition rate is significant even when a drop of several percentages in recognition rate does occur after using proximity indexes.

6 Conclusions

This work proposes a robust speaker recognition system using a combination of the STRF based features and MFCCs. The STRF model concerns the spectral and temporal variations of an analyzed auditory spectrogram. The scale features reveal many characteristics in the spatial domain, such as formants and harmonics. Therefore, the following new acoustic features that are based on scale features are proposed; the energy of each scale, the logarithmic scale energy, and the DCT coefficients of the logarithmic scale energy. The proposed feature set that consists of STRF based features and MFCCs significantly improve the recognition rates in robust speaker recognition tasks over those achieved using conventional MFCCs.