1 Introduction

The literature in spoof detection on automatic speaker verification system has proof of vulnerability of the system to different spoof attacks [7] and the development of countermeasures [31] for the same. The researchers have explored the possibility of improving the robustness of the countermeasure through feature enhancement approaches [4, 16, 35] and varying the choice of the classifier [37][15].

With the advent of the biometric systems [8], the area of identification and verification of the biometric traits [23] has gained momentum. The concern is about the reliability of the system, as most of these applications are used for authentication purposes. One of the biometric traits used is the voice print [21]. An automatic speaker verification system, on the authentication front, should prevent any fraudulent entry. All kinds of biometric systems are vulnerable to spoof attacks including automatic speaker verification (ASV) system [7]. The detection of spoof attacks on the speaker verification system [12] is and will be an open research with the emergence of high quality techniques imitating the naturalness of human voice [17, 25] posing high threat to the system. The research community has contributed towards this end with feature re-engineering [22] and classification of these features with various classifiers [9, 18, 37].

Linear frequency cepstral coefficient (LFCC) features-based countermeasure has performed significantly in the detection of unknown attacks as analysed in the literature. Many researchers have chosen LFCC as the optimal candidate to fuse with other features to improvise the performance of a countermeasure system in spoof detection. This has been the contributing feature under both LA and PA scenario catering to the generalised countermeasure countering different kinds of spoof attacks on ASV system. This has motivated the authors of the paper to leverage the potential of LFCC through fine tuning the weighting factors in the linear sub-band aiming to boost up the discriminating characteristic of cepstrum and suppress noise as well. The feature is tested on traditional generative classifier GMM and the discriminative classifier BiLSTM, known for its learning capability of long-term dependencies between sequence data which proved to be a complementary classifier for GaussFCC.

In their previous work [16], the authors have focused on capturing most of the significant energy variations into few cepstral coefficients by obtaining the energy variation pattern forming the basis for FBCC (filter-based cepstral coefficient) feature extraction. This showed a significant improvement of countermeasure under both LA and PA condition. In the current paper, the authors have chosen BiLSTM as a classifier to classify the genuine and spoofed utterances using GaussFCC features. Here the cepstral coefficients capture the variations at feature level and the BiLSTM classifier learns bidirectional long-term dependencies between sequence cepstral coefficients for classification. Through this, a significant performance improvement of the countermeasure could be achieved under physical access condition and comparatively good performance under logical access condition. Both the works indicate that a complementing feature-classifier combination would do better spoof detection by well-capturing time-varying information within the sequence adding to the discrimination of spoof from genuine utterance.

Organisation of the rest of the paper is as follows: Sect. 2 discusses about LFCC in unknown spoof detection on ASV systems. Section 3 discusses about the corpus and classifiers used. Section 4 explains the proposed feature for spoof detection. Section 5 discusses the performance analysis of the countermeasure using the proposed feature with GMM and BiLSTM classifiers. This is followed by Sect. 6 for conclusion and Sect. 7 for acknowledgement.

2 Performance of LFCC in Unknown Spoof Detection on ASV Systems

Enhancement of the discriminating nature of feature involved in speech detection is indispensable with the increasing naturalness of the synthetic speech and replayed speech. A feature required for classification problem needs to meticulously capture the inter-class discrimination or intra-class affinity among the data under classification. Challenge related to the selection of such features varies based on the application at hand. The research community is confronted with the task of spoof detection on the speaker verification system. This task demands a robust feature which could capture the traces of a spoof attack on the utterance presented to the system. Research works have led us to countermeasures with the robust features in detecting such attacks. The countermeasure either differs in feature used in the front-end processing or classifier used in back-end classification with good performance.

The feature re-engineering has witnessed the modified versions of the filterbank to emphasis the frequency component of interest [27, 36]. The traditionally used features those that fall in this category are linear frequency cepstral coefficients (LFCCs), mel frequency cepstral coefficients (MFCCs) and inverted mel frequency cepstral coefficients (IMFCCs) [33]. Most of these features differ in the sub-band analysis depending on the application under consideration. For example, LFCC, MFCC and IMFCC are based on linear scaled, mel-scaled and inverted mel-scaled sub-band analysis, respectively.

Previous research [24] has dealt with a detailed comparative analysis of features which proved to enhance the performance of countermeasure in detecting spoof attacks on ASV system under LA condition. The performance analysis of countermeasure with these features is experimented with ASV spoof 2015 dataset [32]. ASV spoof 2015 dataset deals with spoofed utterances generated from ten different algorithms (S1–S10), and the evaluation set contains both known (S1–S5) and unknown (S6–S10) attacks. The features analysed under short-term power spectrum are filterbank-based cepstral features, namely RFCC, LFCC, MFCC and IMFCC, all-pole modelling-based cepstral features, namely linear prediction cepstral coefficients (LPCC), and perceptual linear prediction cepstral coefficients (PLPCC), spectral flux-based feature, namely sub-band spectral flux coefficient (SSFC), sub-band spectral centroid-based features, namely sub-band centroid frequency coefficients (SCFC), and spectral centroid magnitude coefficients (SCMC) and under short-term phase features are modified group delay function (MGDF), all-pole group delay function (APGDF), cosine-phase function (CosPhase) and relative phase shift (RPS) and under spectral features with long-term processing are modulation spectrum (ModSpec), shifted delta coefficients (SDCs), frequency domain linear prediction (FDLP) and mean Hilbert envelope coefficients (MHECs). These 17 features are experimented with GMM and SVM classifiers. The \(\Delta \Delta \) value of LFCC proved to be significant for unknown attacks with an average equal error rate (EER\(\%\)) of 1.67 using GMM classifier outperforming the other features.

In [33], the authors have presented a comparison of LFCC, MFCC, IMFCC and CQCC features under PA condition. The experiments were conducted on two datasets, namely ASVspoof 2017 and BTAS 2016. The EER\(\%\) of LFCC for ASVspoof 2017 dataset was 3.31 and 2.04 for unknown attacks.

In [28], the performance of LFCC for unknown attacks is listed with EER\(\%\) of 1.670 on ASVspoof 2015 dataset. It is shown to outperform cepstral coefficients and change in the instantaneous frequency (CFCC-IF) feature, system with i-vectors based on MFCCs, mel frequency principal coefficients and cosine-phase principal coefficients feature and magnitude- and phase-based feature.

In [11], with score-level fusion of LFCC and TECC (Teager energy cepstral coefficients) on ASVspoof 2017 dataset, the countermeasure outperformed fusion of TECC with MFCC and CQCC as well.

Hence, the LFCC feature set has proved to be consistently good for detection of a spoof attack on ASV systems as a stand-alone and as a good candidate feature for fusion as well.

3 Database and Classifiers

3.1 Speech Corpus with Spoof and Bonafide Utterances

Speech Corpus used for the proposed work is ASVspoof 2019 corpus [29] detailed in Table 1. The dataset is categorised into logical access (LA) and physical access (PA) scenarios. LA access condition consists of speech synthesised and voice converted utterances. PA access condition consists of recorded and replayed utterances. Each of LA and PA consists of bonafide and spoofed utterances for training, development and evaluation. The number of speakers are 8 male and 12 female. The duration of utterances in the dataset is in the range of 1–11 s.

Table 1 ASVspoof 2019 speech corpus

The training and development sets contain known attacks, and evaluation set contains 2 known and 11 unknown spoofing attacks. There are six known attacks, of which two are voice conversion (VC) systems and four from text-to-speech synthesis (TTS) system. TTS systems use either waveform concatenation or neural network-based speech synthesis using a conventional source-filter vocoder or a WaveNet-based vocoder. Among 11 unknown systems, there are two VC, six TTS and three hybrid TTS-VC systems. These are implemented with various waveform generation methods including classical vocoding, Griffin-Lim, generative adversarial networks, neural waveform models, waveform concatenation, waveform filtering, spectral filtering and their combination. The references related to the known and unknown attacks and their implementation details are mentioned in [29].

3.2 Generative and Discriminative Classifiers

The spoof detection using GaussFCC is explored on GMM and BiLSTM. Though both the systems have shown significant improvement over the LFCC-based baseline system under both LA and PA conditions, the one with GMM classifier has performed well under LA condition and the latter has performed well under PA condition comparatively. In [1], the authors present a detailed discussion on generative and discriminative models.

3.2.1 GMM Classifier

Gaussian mixture model is a generative approach where the joint distribution is considered in the model. Traditional Gaussian mixture model (GMM) [6] is used here. Two such models are generated, one for GaussFCC features from spoofed utterances and a second one for GaussFCC features from bonafide utterances.

The GaussFCC features from the test utterance are extracted and are presented to the spoofed and bonafide GMM. The log likelihood scores \(S_\mathrm{b}\) and \(S_\mathrm{sf}\) are computed for the bonafide and spoofed model, respectively. The log likelihood difference is computed as \(\lambda =S_\mathrm{b}-S_\mathrm{sf}\). Here, \(\lambda \) is the final score of each of the test utterance. The positive value of \(\lambda \) would classify the utterance as bonafide and spoofed otherwise. The GMM classifier used is shown in Fig. 1.

Fig. 1
figure 1

Spoof detection in ASV systems using GMM classifier

3.2.2 BiLSTM Classifier

The problem of limited long memory capability of RNN is addressed by LSTM units with the concept of an additional hidden state to h(t), the cell state C(t). Gates remove or add information to C(t) based on the input value x(t) and the hidden value h(t-1). Gates are implemented using sigmoidal layer. The feature that makes LSTM more appealing in the field of speech processing is its “long-term dependencies” [10]. Hence, the bidirectional LSTM has the property of “long-term dependencies”.

BiLSTM is a bidirectional LSTM, in which the signal propagates in both backward and forward direction in time. The simple architecture of the BiLSTM classifier used here is shown in Fig. 2.

Fig. 2
figure 2

Spoof detection in ASV systems using BiLSTM

The number of frames generated for each utterance would differ based on the duration of the speech. But for each frame, the number of cepstral coefficients retrieved would remain the same as 120 including dynamic coefficients, namely delta and delta–delta coefficients. The padding could be reduced by sorting the training and testing data by sequence length, and choosing a mini-batch size so that sequences in a mini-batch have almost similar length. This is an added advantage when the application deals with speech utterances [14].

4 Proposed Feature: GaussFCC

In our current research work, we propose to modify the weighting function of the linearly scaled LFCC feature to enhance its capability in spoof detection on ASV systems. Owing to the usage of Gaussian weighting energy sub-band for the retrieval of cepstral coefficients, the proposed feature is referred as GaussFCC. The Gaussian filter is formulated using the Gaussian membership function represented as Gaussian(x:c,s), where c, s represent the mean and standard deviation, respectively. The filterbank is shown in Fig. 3.

Fig. 3
figure 3

GaussFCC filterbank

The number of filters used for the experiments are 40. The idea is to obtain 40 cepstral coefficients. The finer spectral details are captured by the higher-order cepstral coefficients [19], and hence, all the 40 cepstral coefficients are retained without discarding any. The experiments are conducted with energy sub-bands of GaussFCC closely resembling that of LFCC. This is to experiment with GaussFCC performance when sub-bands closely resemble that of LFCC. There are two sets of experiment investigated in this paper based on the methods used to compute the standard deviation (\(\sigma \)). In the first set of experiment, the \(\sigma \) is approximated to a value tuned using \(\alpha \)-factor. The performance of countermeasure is found to be good under LA condition when \(\sigma \) is computed with \(\alpha \) set to 3 and to 2 under PA condition in the given equation,

$$\begin{aligned} \sigma = \frac{(f_{i+2} - f_{i})}{2*{\alpha }} \end{aligned}$$
(1)

where fi and fi+2 are the lower and higher values of bandwidth (BW) and BW is \(\frac{(f_{i+2} - f_{i})}{2}\). The second set of experiments is tested with a constant value \(\sigma \)=128 at the neighbourhood of full width half maximum region yielding good results. There is a folded Gaussian filter set at the minimum and maximum frequency in this case. The results of the aforementioned two sets of experiments are discussed under result analysis section. A filter showing the placement and bandwidth of the Gaussian filter used for the experiments is shown in Fig. 4 along with the triangular filter at the same position for comparison, when the number of filters is 40 over a frequency ranging from 0 to 8000Hz.

Fig. 4
figure 4

GaussFCC filter with different \(\sigma \) values along with triangular filter (refer Eq. 1)

The GaussFCC feature extraction stages are shown in Fig. 5. The pre-emphasis of the speech utterance is not performed throughout the experiments.

Fig. 5
figure 5

GaussFCC computation flow

The additional information captured by GaussFCC as compared to the one used in LFCC is shown in Fig. 6. The filtered energy captured for a spoofed utterance that is detected by GaussFCC feature and missed by LFCC feature is depicted in Fig. 6. In Fig. 7, first, second, third, fourth, twentieth and fortieth cepstral coefficients obtained from a randomly selected sequence of frames are shown. A positive value of cepstral coefficient indicates that spectral energy is more concentrated in the low frequency region and a negative value indicates that most of the spectral energy is concentrated on the high frequency region [5]. The highlighted values in the figure show the significant variation that has taken place with the change in weighting function while processing the same utterance. As mentioned earlier, the idea of suggesting the change in weighting function is to use the linearly scaled sub-band analysis to its full potential following its success as witnessed in the literature. The statistical significance of the GaussFCC approach is studied through the entropy computation of the probability distribution obtained from the histograms of utterances after the application of triangular and Gaussian filters. The entropy equation used is \(E = - \sum _{i=1}^{n}[P_{i}log_{b}(P_{i})]\) and is the one introduced by Shannon [26]. The entropy of the probability distribution obtained from the triangular(\(E_{tri}\)) and Gaussian(\(E_{gauss}\)) weighted filtered energy under LA condition is 0.3571 (\(E_{tri}\)) and 0.2930 (\(E_{gauss}\)) and under PA condition is 0.4026 (\(E_{tri}\)) and 0.3353 (\(E_{gauss}\)). The entropy values indicate that Gaussian weighting tends to elicit information more than the triangular weighting function [3]. The robustness of the feature is further evident through the empirical results obtained when experiment is conducted with ASVspoof 2019 corpus which is discussed under the result analysis section. As far as linear filters are concerned, the energy analysis bands are linearly scaled. The outcome justifies the intuitive idea that the linear filter captures sufficient information required to detect traces of spoof attack.

Fig. 6
figure 6

Visualisation of the energy captured by GaussFCC (left) and LFCC (right) features (the energy captured is of spoofed utterance detected by GaussFCC and missed by LFCC)

Fig. 7
figure 7

Cepstral coefficients obtained in LFCC (A) and GaussFCC (B) features (significant variations are highlighted using circles)

4.1 Pre-emphasis or No Pre-emphasis in Spoof Detection

The paper analyses the performance of the countermeasure over feature extracted from non-pre-emphasised utterance. Figure 8 shows the spectrum of an utterance before and after pre-emphasis. In Fig. 8, utterance U1 is identified as spoof even after pre-emphasis, but utterance U2 is not identified as a spoof after pre-emphasis. The spectrum shows the energy suppressed due to pre-emphasis in Fig. 8c and d. There is difference observed in a pre-emphasised signal from the original one, when the sound is played. The pre-emphasis filter used for analysis is \(H(z) = 1-0.97z\). U1 and U2 are the utterances captured under PA scenario. The noise could be a trace of the attack from the record and replay devices or the ambience.

Fig. 8
figure 8

Spectrum analysis of speech utterance from ASVspoof 2019 dataset before and after pre-emphasis. a, b are non-pre-emphasised speech utterance; c, d are pre-emphasised speech utterance

5 Performance Analysis

5.1 Experimental Setup

The details of the corpus are shown in Sect. 3. The metric used for performance analysis is a minimum tandem detection cost function (min t-DCF) [13]. The utterance is not pre-emphasised. Framing is performed on the signal with a frame length of 20 ms with an overlap of 10 ms. Windowing function used is Hanning window. 40 Gaussian filters are used for sub-band analysis of energy. 40 cepstral coefficients are retrieved. Considering the static and dynamic values, the feature set consists of 120 coefficients. For the GMM classifier, the number of mixture components considered for LA is 512 and for PA is 256. Experiments were performed on development and evaluation data under both LA and PA scenarios for 100 iterations each. BiLSTM classifier used for classification consists of hidden layer of 64 nodes. The input sequence size is 120. Finally a fully connected layer of two nodes, one for each bonafide and spoof class. The minibatch size is set to be 100. The experiment was repeated for 100 epochs. The gradient threshold is set to 1. In conventional backpropagation algorithms, error flowing backward tends to explode or vanish depending on the weights and this in turn hampers the learning of long time lags by the network. Gradient threshold is set to a constant to overcome this problem which is 1 here.

5.2 Analysis I (GaussFCC1): GaussFCC with \(\sigma \) Controlled Using \(\alpha \)-factor

The performance of the countermeasure is studied with the filter bandwidth set using Eq. 1. The performance of the countermeasure was found to be significant using \(\alpha =3\) for attacks under the logical access condition. The performance improved for attacks under physical access condition when the \(\alpha \) value was set to 2 in Eq. 1.

5.3 Analysis II (GaussFCC2): GaussFCC with \(\sigma \) Set to a Constant

The constant value for \(\sigma \) was chosen experimentally to be 128 while analysing the bandwidth at the neighbourhood of the full width half maximum region. Hence, the experiments and results analysis substantiate well the robustness of the countermeasure to counter the attacks under LA and PA scenario with our proposed feature GaussFCC. The performance of the countermeasure with two versions of the proposed feature GaussFCC1 and GaussFCC2 as discussed above is shown in Table 2 for attacks under LA scenario and under PA scenario, respectively. The experimental result shows the performance of the countermeasure using GMM and BiLSTM classifier as well.

Table 2 Performance of two versions of GaussFCC using GMM classifier and BiLSTM classifier with no pre-emphasis
Table 3 Performance of countermeasure using GaussFCC on individual attacks under LA condition
Table 4 Information on attack types under PA condition
Table 5 Performance of countermeasure in min t-DCF on varying both, the attacker-to-talker distance and the quality of device

5.4 Analysis III: Performance Analysis of Countermeasure on Individual Attacks Under LA and PA Conditions Using Generative and Discriminative Classifiers for GaussFCC1 and GaussFCC2 Features

In Table 3, the spoof detection rate of the countermeasure for the individual attacks under logical access condition is shown. The highlighted values are that of the best results within the classifier level as compared to its baseline system, albeit GMM classifier shows good spoof detection over BiLSTM classifier when the values are compared between the classifiers for each of the features. Hence, the generative classifier performs better than the discriminative classifier for spoof detection under LA condition.

In Table 4, the information on the attacker-to-talker distance [34] and three categories of quality of the replay device used to record and replay the utterances as in ASVspoof 2019 corpus [2, 20, 30] is shown.

Based on the above information, the performance of the countermeasure is captured for the individual attack type under PA condition and is depicted in Table 5. The results show that discriminative (BiLSTM) classifier is good at detecting spoof under PA condition compared to the generative classifier. The comparison of values of GaussFCC1 and GaussFCC2 using BiLSTM classifier depicts that both the features are complementary and could be the optimal candidates for fusion at score level.

Table 5 shows that when the quality of the device is perfect and the attacker-to-talker distance is varied, GaussFCC2 captures the details of the attack better compared to other features. It also shows that when the device quality is low and the attacker-to-talker distance is varied, GaussFCC1 captures the details of the attack better compared to other features. Figure 4 shows that the analysis energy band for GaussFCC1 is comparatively narrower than the one for GaussFCC2.

5.5 Analysis IV: Performance Analysis of Countermeasure on Female and Male Speaker Utterances

Table 6 is used to show the available male and female utterances in the development and evaluation set of ASVspoof2019 dataset. This information is used for further analysis.

Table 6 Number of male and female utterances in development and evaluation dataset of ASVspoof2019 corpus

The spoof detection for male and female utterances categorised based on features with both generative and discriminative classifiers is shown in Table 7. In paper [38], for speaker recognition task LFCC is suggested to be the good option especially for female trials. The results as shown in Table 7 in the case of spoof detection task, LFCC and its modified versions GaussFCC1 and GaussFCC2 remain unbiased. The least improvement in female trials could be attributed to the number of female utterances being more than male utterances from Table 6.

Table 7 Performance of countermeasures on female and male utterances under both LA and PA conditions. (columnwise) Across classifiers and (rowwise) across female and male utterances for both development and evaluation data
Table 8 Score-level fusion of GaussFCC1 and GaussFCC2 from each of GMM and BiLSTM classifier

5.6 Analysis V: Score-Level Fusion of GaussFCC1 and GaussFCC2

Tables 3, 5 and 7 show that GaussFCC1 and GaussFCC2 features are complementary from GMM classifier under LA condition and from BiLSTM under PA condition. This was tested with further analysis using score-level fusion of GaussFCC1 and GaussFCC2 from both the classifiers and hence proves the above observation. The result of the fusion is shown in Table 8. The highlighted values show that the performance is comparatively good. The observations in the paper empirically prove that under LA condition the spoof detection is good using GMM classifier with the proposed feature (in isolation and in combination as well) and under PA condition the spoof detection is good using BiLSTM classifier. In all the experimental outcome, both the proposed features, namely GaussFCC1 and GaussFCC2, have outperformed the baseline feature traditional LFCC.

In a nutshell, the comparative analysis of the countermeasure with GaussFCC features using both GMM and BiLSTM classifiers is depicted in Fig. 9.

The comparison shows the countermeasure performance well in countering attacks under LA scenario when used with GMM classifier and under PA scenario with BiLSTM classifier. To get the best out from these generative and discriminative classifiers as suggested by the authors of [1], the research could be taken forward with classifier-level fusion.

Based on the observations of the empirical outcome, the countermeasure has shown robustness in spoof detection on a speaker verification system using a complementary combination of features and classifiers as discussed. The result shows consistency of performance by the countermeasure at all stages of analysis, namely between features, classifiers and gender-based utterances. Hence leading to the choice of optimal features for score-level fusion and appropriate selection of classifiers for spoof detection in an automatic speaker verification system by experimenting on ASVspoof2019 datasets.

Fig. 9
figure 9

Performance comparison of baseline and proposed features using generative (GMM) and discriminative (BiLSTM) classifiers

6 Conclusions

The literature bears the witness of good spoof detection in ASV system with the use of linear frequency cepstral coefficients, in isolation and in fusion with other features as discussed in the paper. The authors intend to explore the possibility to bring enhancement to the linear frequency-based feature so as to use the feature to its full potential in spoof detection. This could be achieved through the modification to the weighting function from triangular to Gaussian. The change in cepstral coefficients obtained showed significant variations. Subsequently the computation of entropy over the probability distribution of Gaussian filtered energy spectrum proved that the changes in cepstral coefficients are due to the information gain achieved using Gaussian filters. Two variations of modified features are GaussFCC1 and GaussFCC2 as discussed. Both these features were used with two different classifiers to detect spoof attack under LA and PA conditions. The two classifiers used are generative (GMM) and discriminative (BiLSTM) classifiers. The paper presents an elaborate analysis of the empirical results with the possible combinations of features and classifiers. A study of these combinations was carried out for female and male speaker’s utterances as well. Throughout the analysis, the performance of the countermeasure showed consistent improvement and hence empirically proved that GaussFCC1 and GaussFCC2 are optimal candidates for score-level fusion. The behaviour of generative classifier was found to be good under LA condition and discriminative classifier under PA condition. The utterances considered for the experiments were not pre-emphasised in the time domain as per the reasons discussed in the paper. The pre-processing of speech signals might cause loss/modification of information at each level of processing which could be averted, and from this perspective, the paper investigates the performance of countermeasure to spoof attack with non-pre-emphasised utterance. Further investigation could be carried out by increasing the cepstral coefficients and study the influence of discrete cosine transform on the logarithm of Gaussian filtered energy spectrum and subsequent impact on cepstrum.