1 Introduction

The advancements in technology have made us to use biometrics to authenticate the persons with no need to have them physically. There are various human traits being used as biometric features which facilitate the provision for authorized persons only or restricting the unauthorized persons from making an intended service. Face, fingerprint, iris, retina and DNA are the most frequently used biometrics to authenticate the persons. Combination of these biometrics is also deployed to enhance the security in person authentication/verification. Specialized equipments and computers are necessary to extract these biometrics. Voice can also be used as a biometric in-person authentication. Cost effective microphones are only required in addition to the computers to record the speech. When voice is thought of as a biometric for speaker authentication, speaker recognition has evolved with two categories such as speaker identification and verification. Speaker identification is one - on - many mapping between the test speaker and all enrolled speaker models. Speaker verification is one-to-one mapping to check genuineness of the claim of the speaker for taking accept/reject decisions. Text-dependent speaker identification/verification involves the use of same speech utterance for training and testing and password specific person authentication falls into this category. Text-independent speaker identification/verification emphasizes the use of different speech utterance uttered by speakers for training and testing and person authentication can be done by using speech utterances uttered by the speakers and it is not necessary to remember the speech utterance spoken during the training session and speakers are not bound for any kind of fixed set of text utterances. Multilevel framework [3] has been done using speech as a biometric for person authentication. Voice is used as a biometric [24] for person authentication against replay attacks. Offline signature verification and speaker verification [22] system are combined as both of these biometric are accepted widely for person authentication. Speaker verification system [26] is implemented using whispered speech for voice-based biometric systems. Fingerprint and speech [4] are used as multimodal biometrics for mobile authentication applications. Face, fingerprint and speech [1] are used as multimodal biometrics for person authentication Face and speech are used as multimodal biometrics [5, 19, 25] for speaker verification. Speech-based biometric system [20] against voice conversion attacks is developed. Two dimensional PalmHash Code (2DPHC) [10,11,12,13,14] was used for secure palmprint verification. Techniques for action recognition from sensor- generated data are discussed [15] and novel approach for complex activity recognition is proposed. The feasibility of career path prediction from social network data has been scientifically and systematically studied [16]. An efficient algorithm to identify temporal patterns among actions and utilize the identified patterns to represent activities for automated recognition has been presented [17]. The water quality of a station [18] using a multitask multi-view learning method is introduced to fuse multiple datasets from different domains. Tracking generic human motion is proposed using a fusion formulation which integrates low and high dimensional tracking approaches into one framework [2].

In this work, playback audios are obtained by recording the speech samples present and saved at the same sampling rate of original speeches. The testing algorithm is developed to distinguish between recorded speech and original speech and authenticate the speakers using password specific speaker models. Feature extraction, recorded speech models, original speech models, password specific speaker models and testing procedure are appropriately developed to authenticate the speaker in a robust manner and the system has more immunity against playback spoofing attacks. The proposed algorithm is extended to check the authenticity of 44 speakers against replay–laptop and replay-laptop-HQ – speaker attacks by considering the speech utterances from AVSpoof database and the testing is done on genuine models.

2 Materials and methods

2.1 About the database

Speech database considered in this work contains speeches of 8 female speakers and 8 male speakers and speeches of all the speakers are taken for analysis to authenticate sixteen speakers. For each isolated word, there are 26 utterances uttered by the 16 speakers out of which 16 utterances of each speaker are used for training and the remaining 10 utterances are used for testing. For the classification between recorded speech and original speech, 256 utterances in all words are used for training and 160 utterances are used for testing. For speaker authentication based on original speeches, 16 password specific utterances are used for training and 10 utterances in each word are used for testing. Another database used in our work is AVSpoof database [6].

Data availability

All relevant data are within the paper and its supporting information files.

2.2 Features based on cepstrum

The feature extraction algorithm stems from two ideas 1.Modeling of Vocal tract 2. Homomorphic filtering of speech. In the vocal tract model, speech has been produced by passing an excitation through a filter whose response models the effects of vocal tract on the excitation. Homomorphic filtering deals with the convolution of excitation and vocal tract response through the process of addition of logarithm of their transforms and vocal tract response have been separated out from excitation by linear filtering. Formant locations and bandwidth are used to indicate the variation between the speeches uttered by speakers. The features used for any classification must be reliable to perform any pattern recognition task. These features must have high inter-class variation and low intra-class variation among the classes. These features should also be representing the case study to improve the recognition/classification accuracy to enable the use of speech as a biometric for authenticating the speakers. MFPLPC speech analysis method [7,8,9, 23] gives the details about the perceptual features extraction with filters spaced in mel scale. The proposed method includes training phase and classification phase for distinguishing recorded speech and original speech initially and subsequently, classification is done to authenticate the speakers by applying features to all password specific speaker models.

2.3 Training phase

The overall flow diagram shown in Fig. 1 indicates the modules used for extracting the proposed features and creation of templates and testing procedure for authenticating sixteen persons with the speech utterances chosen from TIMIT database. Recorded speeches are created by recording the speeches already stored in a computer by using “Audacity” software and the recorded speeches are stored in a separate folder with same sampling rate as specified for original speeches. Sixteen recorded speeches of 16 speakers are concatenated before extracting the features for recorded speech model creation. Similarly, original speech models are created by concatenating original speech utterances first, and the proposed features are extracted after performing the conventional preprocessing techniques such as pre-emphasis, frame blocking & windowing. Password specific speaker models are created in a similar manner.

Fig. 1
figure 1

The proposed person authentication system

2.4 Training model based on the clustering technique

Apart from reliable and promising feature extraction, one of the other important aspects is the generation of reference patterns or templates for pattern matching approach with respect to speech/language/speaker/emotion/noise recognition or classification. It is necessary to create templates as a representative model for individual speakers to achieve good accuracy for practical tasks so that speech can be used as a biometric for speaker authentication. For the given set of ‘L” training vectors for recorded/original speech, VQ based clustering technique [21] is used as a pattern clustering approach to generate a set of “M” code books as representative templates for recorded and original speech. The basic idea of clustering approach is to reduce the signal’s information rate through the use of codebook with relatively a small number of codewords as centroids as compared to the number of training vectors. In general, it is used to convert the training vectors into clusters and cluster centroids are representing the recorded/original speech and password specific speaker training data. The procedure [21, 23] used for converting the set of training vectors into a set of clusters is based on distance computation.

Classification procedure [21, 23] for arbitrary spectral analysis test vectors that chooses the codebook vector appropriate to the input vector and uses the code book vector as the resulting spectral representation. This is often referred to as nearest neighbourhood labeling or optimal encoding procedure. The classification procedure is essentially a quantizer that accepts the spectral analysis vectors as input and provides the output that is the codebook index of the codebook vector or the cluster centroids that best matches the input. It is done by computing Euclidean distance between each of the test vectors and M cluster centroids. The spectral distance measure for comparing features vi and vj is as in (1)

$$ d\left({v}_i,{v}_j\right)={d}_{ij}=0\kern0.5em when\;{v}_i={v}_j\; and>0\kern0.5em otherwise $$
(1)

If codebook vectors of an M-vector codebook are taken as ym, 1 ≤ m ≤ M and new spectral vector to be classified is denoted as v, then the index m of the best codebook entry is as in (2)

$$ {m}^{\ast }=\arg \left(\min \left(d\left(v,{y}_m\right)\right)\right) for\;1\le m\le M $$
(2)

The formation of the clusters is done in such a way that the training data distribution is characterized and captured to produce a representative model or template for each speech/speaker. It is analyzed that the most frequently occurring vectors do have less Euclidean distance as compared to the least frequently occurring ones.

2.5 Speaker authentication based on modeling technique

In general, any recognition system involves extraction of features from the training and test data, creating VQ codebook models for all the classes and feature vectors of each test utterance is tested against a certain number of template models to detect the identity of the class of that utterance from among the classes enrolled. Training data for recorded/original speech classification system is formed by concatenating the 160 speech utterances corresponding to the recorded/original speeches. Training data for speaker authentication is formed by concatenating the password specific speech utterances pertaining to the speaker. Other sets of 16 password specific speech utterances of a speaker are used for testing and 16 speakers comprising 8 male and 8 female speakers are considered for developing speaker authentication system.

Initially, features are extracted from the concatenated recorded or original speeches and the extracted features are applied to the training algorithm and reference models or templates are created for recorded/original speech. For feature extraction, recorded/original speech signal is passed through a pre-emphasis block, followed by frame blocking which converts the speech signal into overlapping frames of 16 msec duration for each block with 8 msec overlapping. Each frame is windowed by using a Hamming window [21] which is considered as the appropriate window for speech communication applications. Then the MFPLPC [7,8,9, 23] features are extracted. These feature vectors are applied to the VQ block to generate the codebook for recorded/original speech by adopting a K-means clustering procedure. In this algorithm, L training vectors are mapped with M clusters. Each feature vector in a block is normalized appropriately before giving as input to the module for generating reference templates for recorded/original speech. For testing, test data considered can be a recorded/original speech utterance. The evaluation of the recorded/original speech classification system is done by enabling the application of the perceptual features extracted from the test utterance to all templates corresponding to the recorded/original speech. Testing procedure initiates the process by first finding the minimum distance between each test vector and centroid of clusters. Average of minimum distances are determined for each speech model. The test utterance best associates with a recorded or original speech model which has a minimum of averages. If the test speech is classified as recorded speech, it is not allowed to participate in the process subsequently. For speaker authentication, feature vectors of the original test speeches are applied to the password specific speaker models and based on the minimum distance criterion, a speaker is authenticated or identified.

3 Results and discussion

The evaluation of the performance of the speaker authentication system using speech as a biometric based on the extraction of perceptual features on the recorded/original speech signal and clustering technique as a modeling technique is done by computing the squared Euclidean distance between feature vectors of the test utterance and template for recorded/original speech. If the test utterance is classified as recorded speech, it is prevented from undergoing a further process. If it is classified as original speech utterance, feature vectors of the original speech utterance are applied to the password specific speaker models and a speaker is authenticated based on the minimum distance classifier. The accuracy of the speaker authentication is the number of times the given speech utterances are correctly identified for a particular password speech utterance by the total number of test utterances corresponding to each speaker. The formula used for computing recognition accuracy is as in (3).

$$ \% RA=\frac{Number\kern0.17em of\kern0.17em times\kern0.17em the\kern0.17em speaker\kern0.17em is\kern0.17em authenticated\kern0.17em correctly}{Total\;\ln\;numbers\kern0.17em of\kern0.17em test\kern0.17em utterances} $$
(3)

Figures 2 and 3 give the details about the description of the password “RUG BY” specific speech uttered by a female speaker and its recorded version in time and frequency domains.

Fig. 2
figure 2

Recorded speech “RUGBY” and its spectrogram

Fig. 3
figure 3

Original speech “RUGBY” and its spectrogram

The distance plot sown in Fig. 4 depicts the efficiency of the testing procedure used for discriminating recorded and original speech. Ten speeches are considered for testing. For each recorded speech, an average of minimum distances between test vectors and clusters of original training speeches is found to be minimum. If the test speech is classified as original speech, it will be permitted to undergo a further process of authenticating the speaker.

Fig. 4
figure 4

Distance plot – discriminating original speech and recorded speech

Figure 5 depicts the effectiveness of using a testing algorithm to isolate the recorded speech from the original speech. If the speech is classified as recorded speech by comparing the test vectors with recorded and original speech models, it will be ignored and stopped from further processing. For each recorded speech, the average of minimum distances is found to be minimum and this testing algorithm is more efficient in isolating the recorded test speech from the original speech i.e. providing immunity against playback attacks.

Fig. 5
figure 5

Distance plot – isolation of the recorded speech from the original speech

Figure 6 indicates the effectiveness in using the testing algorithm to discriminate speaker 2 from the enrolled ten speakers for each test speech by applying test vectors to all password specific speaker models. Passwords used are”enter, erase, go, help, no, rugby, repeat, stop, start, yes, zero, one, two, three, four and five” for eight female and eight male speakers considered in our work. Figure 7 depicts the discrimination of speaker 10 from the enrolled ten speaker models for each test speech.

Fig. 6
figure 6

Distance plot – authentication of speaker 2

Fig. 7
figure 7

Distance plot - authentication of speaker 10

Table 1 indicates the confusion matrix depicting the details about the speaker authentication by using voice passwords and it is providing 100% as accuracy for this small set of 16 speakers and the testing is done with ten test speeches for each speaker.

Table 1 Confusion matrix – speaker authentication – MFPLPC Feature – TIMIT database

From the Table 1, it is evident that the algorithm developed is more robust in isolating recorded and original speeches and speaker authentication by applying the features of ten test speeches initially on models for discriminating original speeches from recorded speeches and subsequently, the features are applied to the password specific speaker models for authenticating the speaker from the set of sixteen enrolled speakers and the algorithm is said to be an immuniological approach against playback attacks. Performance of the system is also evaluated in terms of PSNR between original and recorded speech utterances of each speaker and is given in Table 2. High PSNR values indicate the closeness between original and recorded speech utterances. The proposed algorithm is more robust in discriminating recorded and original speech utterances by applying the features to the original and recorded speech models. If the speech utterance belongs to recorded speech model, it is forbidden from further processing. If the speech belongs to original speech model, it will further undergo the process to authenticate speakers.

Table 2 Performance evaluation – PSNR between original and recorded speech

This work is also extended for authenticating speakers against replay attacks using the speech utterances chosen from AVSpoof database [6]. Training phase involves the extraction of MFPLPC and MFPLPC concatenated with probability features from the “READ” genuine speech utterances and models/templates are created for 44 speakers comprising 31 male and 13 female speakers. Testing phase elucidates the extraction of features from the “PASS” genuine utterances and the features are applied to the models and based on the computation of the distance between the test features and models, a speaker is authenticated. Authentication or the recognition accuracy (%RA) is calculated as, the number of times the speaker is authenticated correctly out of the total number of utterances considered for each speaker. Performance of the system is evaluated against the replay attacks such as Replay-laptop and Replay-laptop-HQ-speaker. Evaluation is done by extracting the features from the “PASS” attack utterances and applying these features to the genuine models. By using the minimum distance classifier, speaker classification is done. Figure 8 depicts the performance of the system against replay attacks for MFPLPC and MFPLPC concatenated with probability. Probability is computed for each frame as a ratio by finding the number of samples whose spectral energy is greater than the average spectral energy of the frame. The overall accuracy of the system is 88% for MFPLPC and 91% for MFPLPC concatenated with probability for testing the genuine test utterances against genuine models. The rejection rate is found to be 78 and 94.1% for the feature MFPLPC with the testing done on replay-laptop-HQ-speaker and replay-laptop atatck utterances against genuine models. Rejection rate obtained is 75.5 and 91.4% for the feature MFPLPC concatenated with probability with the testing done on replay-laptop-HQ-speaker and replay-laptop attack utterances against genuine models.

Fig. 8
figure 8

Performance of the system against replay attacks – MFPLPC Feature – AVSpoof database

Figure 9 indicates the efficiency of the feature MFPLPC concatenated with probability in authenticating speakers against replay attacks.

Fig. 9
figure 9

Performance of the system against replay attacks – MFPLPC concatenated with probability AVSpoof database

The plots shown in Figs. 8 and 9 indicate the robustness of feature selection and modeling technique used to authenticate speakers against replay attacks with the speech utterances chosen from AVSpoof database.

4 Conclusions

This work discusses the application of perceptual features, perceptual feature fused with probability and clustering technique in building a robust speaker authentication system by using speech as a biometric against play back attacks with speech utterances chosen from TIMIT database. In this work, recorded speeches are considered as playback attacks. Speaker authentication system is developed by using the perceptual feature and VQ based clustering technique for recorded and original speech model creation initially. The testing procedure is used to discriminate the recorded and original speeches. Password specific training speeches for a speaker are concatenated for extracting features and subsequent model created for each speaker by using modeling technique. This work mainly emphasizes the importance and significance of modeling technique to isolate recorded speeches considered as playback attacks. Feature vectors of original speeches are applied to the password specific speaker models and based on the testing procedure, a speaker is authenticated. Performance of the system is evaluated by the number of times the speaker is authenticated correctly based on minimum distance classifier with respect to the password specific test speeches given for each speaker and this algorithm provides the maximum accuracy for this small set of speakers. This algorithm is highly robust to isolate the recorded speeches considered as playback attacks from original speeches and authenticate the speaker using voice as a biometric. This work is also extended to check the authenticity of 44 speakers against replay-laptop and replay-laptop-HQ-speaker attacks with the speech utterances chosen from AVSpoof database and performance of the system is evaluated in terms of rejection rate. The rejection rate is found to be very low for testing the genuine speech utterances on genuine models and is high for testing attack speech utterances on genuine models and it is revealed that the feature selection and modeling technique used are more robust against replay attacks also. The rejection rate is better for MFPLPC feature in comparison with MFPLPC concatenated with probability for replay-laptop-HQ and replay-laptop attacks.