1 Introduction

Authentication is one of the pillars of information assurance, which attaches a valid identification with the information. Classically and currently, word passwords have been enough for protecting the applications from unauthorized access. However, it is a tiresome and considerable time taking technique, due to keying the data. Various human physiological characteristics like retina, fingerprint, voice, etc. can be used to identify a person uniquely. Voice is the easiest and comfortable means of association with objects as compared to other traits. And also, it has more than one characteristic like the shape of the vocal tract, pitch, time-delay, etc. to differentiate individuals. Voice based authentication systems, i.e., automatic speaker verification (ASV) systems have become popular and convenient alternatives to existing security systems due to the technology advancements occurred in recent years. Unlike others, these systems offer no discomfort and health risks to the user as there is no direct contact with the machine. Studies have revealed that 90% of people are excited about using speech signal based biometrics instead of the classical ones (Beranek, 2013).

An ASV system processes the voice inserted through a microphone and either accepts or rejects the claimed identity. The task of speaker verification is to check whether the applied speech from a claimer is genuine or not. Frontend and backend are the two equally important parts of such systems for acquiring the desired functionality. As shown in Fig. 1, the input voice signal is processed at the front end of the ASV system and then, validity check and verification of the speaker (by comparing genuineness of his/her voice with the already existing legal user’s speech in the database) are accomplished in the backend part of the system to accept or reject the claimed identity.

Fig. 1
figure 1

Components of ASV system

The frontend of the system extracts information of the uniqueness of the speaker and signal being genuine, which is intact in the input speech signal in the form of its characteristics. Characteristics of speech signal like phase, time-delay, frequency, sampling rate, pitch, magnitude, etc. varies signal to signal. These characteristics can be compared to differentiate signals produced by different sources. Features defining these characteristics can be categorized into three categories, i.e. short-term power spectrum features, short-term phase features, and features involving long term processing steps (Sahidullah et al., 2015). A wide range of feature extraction techniques that can capture the clues of speech manipulation are being used for designing the frontend of ASV systems. Feature extraction techniques that extract the features of the cepstrum domain have been dominating for speech and speaker related tasks since they can model the human auditory and human vocal tract properly (Dinkel et al., 2018; Paul et al., 2015). The classification model of the backend identifies the processable artifacts from the applied speech features. As the speaker verification lies under the class of classification problems, machine learning approaches are suitable to conclude the seen data. Gaussian Mixture Model (GMM), Hidden Markov Model (HMM), etc. models dominated this area of verification over the years. These models are suitable to process speech related data and were being used consistently for different ASV related activities. However, these are not efficient for non-linear or near to non-linear spread of data. In the last few years, with the development of more advanced algorithms, the research community is shifted to deep learning models that can process a huge dataset with complex relationships. A good speech dataset is essential for providing the robustness to the system i.e. to make it efficient in highly varying acoustic conditions. Such speech datasets are recorded under different acoustic conditions and might be containing a large speaker and speech related variations. There are various dataset releasing communities that are cooperating with the release of acoustically and phonetically enrich datasets. Audios, speaker identity and meta data for audio and speaker are the key components of any speech corpora. Training the ASV system with a good corpus makes it gain enough experience to take the decision of acceptance or rejection for further real life data. This paper discusses and analyses various state of the art feature extraction techniques, classification models and datasets proposed by the researchers (Kumar & Aggarwal, 2020a, b).

Although these systems are being used effectively in different forms, for example, unlocking policies of smartphones, voice surveillance systems in finance, voice chat application, etc. but they are not completely reliable due to their vulnerability to various kinds of spoofing attacks. ASV systems can be attacked directly or indirectly at different stages, from the entry of the speech signal into the system to the verification of the claimer. Direct access attacks are inserted via the microphone or channel, whereas whole ASV arrangement is vulnerable to indirect access attacks at different stages. Speech Synthesis (SS)/Text-to-Speech (TTS), Voice Conversion (VC), replay, mimicry, and twins are the potential attacks to these systems. This paper shows a precise classification of these spoofing attacks based on the requirement of the access level of the system to accomplish them. INTERSPEECH (ASVspoof consortium, 2019; Lavrentyeva et al., 2017; Sahidullah et al., 2016; Zhizheng et al., 2017) has initiated sessions to enlighten these risks, as well as challenges in the design and implementation of ASV systems. The first session, ASVspoof 2013, held in Lyon, France, was aimed to spread awareness about the vulnerability of ASV systems against the spoofing attacks. Challenge of ASVspoof 2015 was to propose countermeasures fit to differentiate original speech and the speech tricked by techniques (TTS and VC). Replay attack was taken care of by the ASVspoof 2017. Designing the countermeasure solutions to deal with the replay attack was the focus of this session. ASVspoof 2019 challenge was aimed to extend previous challenges. And it identified the need for new ASV centric evaluation measures to assess the countermeasures.

2 Related surveys

Several surveys have been carried out on ASV systems and their spoof detection techniques. The main focus of already existing survey works is spoofing attacks to the system. Work Wu et al., (2015a, 2015b, 2015c) provides a good classification of spoofing attacks along with the potential attack points of the ASV system. Even replay attack got attention separately for ASVspoof 2017 challenge in a survey done by Patil and Kamble (2018). A lot of research has been done in designing the spoof free countermeasures. Survey of Sahidullah et al. (2019) discusses types of different attacks, their spoofing procedure and countermeasures designed especially in their direction. However, all these discussions still have room for improvement as presenting various feature extraction techniques, different backend designing classification models, datasets used and various spoof attacks have may be explored in single discussion. For instance, work of Sahidullah et al. (2019) covers different feature extraction techniques applied for frontend design of countermeasures focusing on SS, VC, replay and mimicry attack types individually or in combination. Speech corpora and their protocols along with the evaluation metrics for countermeasures contribute equally in development of ASV system. However, these are being part of the recent studies partially. Kamble et al. (2020) covers some of the dedicated speech corpora and almost all the evaluation measures in this area. Motivated by all these discussions, the proposed survey in this paper tries to explores various important contributions involved in development chain of ASV systems. Table 1 provides the comparative view of the proposed survey with other recent surveys. Below listed points describe the main contributions of the proposed survey work:

  1. i.

    Knowledge of speech signal processing is essential for frontend design of any kind of speech based system. This paper presents the detailed computation mechanism of traditional and modern speech feature extraction techniques applied for ASV frontend design.

  2. ii.

    Machine learning techniques are suitable to adapt the classification clues from the huge dataset. This work provides the study of architectures of classical machine learning and deep learning networks adopted by ASV systems.

  3. iii.

    This paper describes approximately all datasets applied in speaker verification systems with the best of our knowledge. Dataset enrich with the speaker, spoofing, etc. variations plays a remarkable roll in the development of these systems.

  4. iv.

    Evaluation measures are also covered in this paper on which accuracy of the countermeasure is marked. Advancements in spoof generating techniques contributes in empowerment the defence mechanism of any security systems. Some attack types, added recently into this field, and new insights for generating all applied spoofing attacks are one of the major parts of discussion in this paper.

  5. v.

    This survey figures out the revolutionary time periods in ASV systems. It analysis traditional and modern countermeasures, a combination of different frontend and backed techniques trained with different datasets, to check the status of accuracy of these systems.

  6. vi.

    This survey finds out some attack types that are not being targeted by todays countermeasures. It discusses all the methodologies, techniques and concepts while keeping from new readers to researchers and developer of ASV system in mind. It highlights new noticeable facts and system requirements to researchers and developers for future work.

Table 1 Comparison of several surveys on ASV systems

3 Automatic speaker verification (ASV) system

Automatic speaker verification systems are successfully contributing to the development of human behavioural and physiological characteristics based biometric surveillance and authentication systems (Koolwaaij & Boves, 1999; Singh et al., 2018). One can misunderstand the task of speaker verification with speaker identification. In the case of speaker identification, an unknown speaker is matched with the already existing pool of the known speakers of the database, and the closest matching speaker is declared the desired identity. Whereas for speaker verification, a claimed known speaker is accepted or rejected based on the genuineness of his/her voice. The verification model checks whether the applied speech is original (directly coming from the speaker) or generated by tricks (spoofed). The decision of acceptance or rejection is taken based on a threshold valueThese systems can be classified into text-dependent ASV (TD-ASV) and text-independent ASV (TI-ASV) systems. TD-ASV system requires a registered text-password to be said correctly without bothering about the voice of the speaker whereas the TI-ASV system finds the identity of the user only by verifying his/her voice with freely accepting the speech content (Marinov, 2003; Reynolds & Rose, 1995). Former systems are suitable only for authentication purposes, whereas later ones are applicable in authentication as well as in biometric surveillance. Frontend and backend parts, shown in Fig. 2a–b, of the ASV system play equally significant roles in delivering the desired functionality of the system.

Fig. 2
figure 2

Frontend and backend of ASV system

Fig. 3
figure 3

a 30 Static CQCC features, b 30 First order CQCC features, c 30 Second order CQCC Features, y-axis are the number of features

3.1 Approaches to design frontend of ASV system

Frontend extracts the features from the applied speech signal after converting the analog signal into digital by passing it to a sampler, quantizer, and then to an encoder. Continuous time speech signal is converted into discrete time by the sampler, which is then passed to a quantizer to get the finite values of amplitude. Then each quantized value is associated with a digital word by the encoder (Ochiai et al., 2014; Picone, 1993). Linear Predictive Coding (LPC), Mel Frequency Cepstrum Coefficients (MFCC), Linear Predictive Cepstrum Coefficients (PLCC), etc. feature extraction techniques are applied to the digital signal to achieve the required features (Chen et al., 2018). Extracted features are analyzed on the backend to verify the speaker’s genuineness. Most of the classical feature extraction techniques involve the tasks of filtering, Linear Predictive Coding, and cepstrum calculation in any combination. First we highlight some important characteristics of these techniques and in the next subsection we discuss different feature extraction techniques that are also summarised in Table 2.

  • Filtering Filter-Bank based feature extraction can imitate the way of human auditory. Artificial Cochlea-based (Patel & Patil, 2015; Patil & Kamble, 2018) and Fourier-Transform-based filters are being used for frequency analysis of speech signals as standards. The applied filter outputs a short-time energy signal. A short-time energy signal obtained by Fourier-Transform based filter undergoes logarithm, etc. functions to deliver an acoustic feature vector (Dua et al., 2018a, b; Ochiai et al., 2014).

  • Linear Predictive Coding (LPC) The state of sample \(s\left( t \right)\) of a speech signal at some point of time t can be predicted on the basis of its previous state at t-1 because the current state of a time varying signal depends on the former state (Ochiai et al., 2014). This concept is formularized (Eq. 1) for feature extraction under Linear Predictive Coding (LPC), also called autoregressive modeling. If the order of modeling is o and noise in prediction is \(n\left( t \right)\) then sample \(s\left( t \right)\) can be modeled as:

    $$ s\left( t \right) = \mathop \sum \limits_{i = 1}^{o} \beta_{i} s_{t - 1} + n\left( t \right) $$
    (1)

where \(\beta_{i}\) denote the coefficients of prediction found to minimize the Mean-Square-Error (MSE) for the speech segment’s window starting and ending at p1 and p2, respectively. MSE is defined as

$$ P_{n } = \mathop \sum \limits_{{t = p_{1} }}^{{p_{2} }} \left\{ {n\left( t \right)} \right\}^{2} $$
(2)
  • Cepstrum Cepstrum is another approach for modeling the linguistic class (vocal tract information) of a speech signal. Speech signal, a time varying signal, can be obtained by the convolution of vocal tract filter and excitation signal. To obtain its cepstrum, a signal undergoes some operators, including Fourier transformation, logarithmic operator, and then to inverse Fourier transformation. By applying the Fourier transformation, speech signal becomes the product of vocal tract filter and excitation spectrums. This power spectrum is a function of frequency. Log operator converts these power spectrums into log spectrums and returns their summation. Now Inverse Fourier transform is applied on these summed up log power spectrums to get the cepstrum of the initial speech signal.

  • Static and Dynamic Features Static features capture general information from the speech, whereas dynamic features can capture contextual variations also from the speech. They can model speaker specific information more precisely. Dynamic features, delta, and delta-delta (Δ, ΔΔ), are the first order derivative and second order derivative of static features, respectively. Dynamic features are proved to be better for speaker verification systems. Figure 2a–c show the spectrographic view of these features extracted for an utterance of the ASVspoof 2019 dataset. Firstly, Statics, first order, and second order 30 CQCC features are extracted for 400 frames in MatLab than these features are plotted by functions of python’s (Cheuk et al., 2019; Glover et al., 2011) library librosa.

Table 2 Summary of all classical feature extraction techniques

3.1.1 Mel-frequency cepstral coefficients (MFCCs)

Cepstral analysis based Mel-frequency Cepstral Coefficients (MFCCs) are the most common feature coefficients used in spoof detection. In this approach, pitch comparisons based mel-scale unit is used for frequency representation. Mel scale is perceptually motivated by the human auditory system. For extraction of coefficients by MFCC technique Fast Fourier Transform (FFT) (Prithvi & Kumar, 2016) or (Todisco et al., 2018) Discrete Fourier Transform (DFT) is applied on the input speech signal, which results in an audio spectrum. Then, applied triangular, gaussian, etc. filter bank changes the scale of spectrum to mel-cale (Chakroborty & Saha, 2009). The logarithm is calculated for the spectrum before applying the Discrete Cosine Transform (DCT), which results in MFCCs (Cai et al., 2019). Generally static, first order and second order derivatives of first 12 to 14 coefficients are able to perform well for ASV systems (Balamurali et al., 2019; Dua et al., 2018a, b). Figure 3 shows a general approach of MFCC extraction. The mathematical process of MFCC extraction is as:

$$ D_{DFT} \left( i \right) = DFT\left( f \right) $$
(3)
$$ MFB\left( j \right) = \mathop \sum \limits_{i = 1}^{L} \left| {D_{DFT} \left( i \right)} \right|^{2} W\left( i \right) $$
(4)
$$ MFCC\left( r \right) = \mathop \sum \limits_{j = 1}^{J} log\left[ {MFB\left( j \right)} \right]cos\left\{ {\frac{{r\left( {j - 0.5} \right)\pi }}{J}} \right\} $$
(5)
Fig. 4
figure 4

MFCC feature extraction

where f is the audio frame, \(MFB\left( j \right) \) is the mel scaled frequency spectrum calculated by ith mel filter bank \(W\left( i \right) \) having j number of filter banks, L represents the DFT indices in total and r MFCC features are carried out by \(MFCC\left( r \right)\).

3.1.2 Inverse mel-frequency cepstral coefficients (IMFCCs)

MFCCs are concerned about the low frequency regions of the spectrum, whereas high frequency regions are taken care of by the Inverse Mel-frequency Cepstral Coefficients (IMFCCs) (Mohammadi & Mohammadi, 2017). These features are retrieved by applying a similar process to MFCC except the use of inverted mel-scale filters (Cai et al., 2019; Saranya & Murthy, 2018) that introduce the inversion of frequency domain from low to high to offer complementary information (Pritam et al., 2018). Figure 4 shows the steps for IMFCC feature extraction technique.

Fig. 5
figure 5

IMFCC feature extraction

3.1.3 Linear frequency cepstral coefficients (LFCC)

Linear scale based Linear Frequency Cepstral Coefficients (LFCC) features can model all frequency regions equally (Mohammadi & Mohammadi, 2017) and proved to be better than MFCCs for ASV systems (Pritam et al., 2018). These features are extracted by applying a triangular filter bank that provides a constant resolution of the spectrum. LFCCs are helpful in speech recognition as well as in speaker identification as they can model the length of the vocal tract in higher frequency regions (Sahidullah et al., 2015; Pritam et al., 2018). Figure 5 depicts the complete process of LFCC extraction.

Fig. 6
figure 6

LFCC feature extraction

3.1.4 Constant Q cepsral coefficients (CQCC)

In the case of CQCCs, frequency bins are geometrically spaced, whereas in Fourier-based transforms, frequency bins are regularly spaced, which leads to the variable Q factor. To generate CQCCs perceptually motivated Constant Q Transform (CQT) is applied to the speech signal, which ensures the constant Q factor. At lower frequencies, higher determination of frequency, and at higher frequencies higher determination of time, are the speech distinguishing properties offered by the CQT (Balamurali et al., 2019; Saranya & Murthy, 2018). These features are especially being applied for LA and PA spoof detection in ASV systems, and are achieving better performance than instantaneous frequency cosine coefficient (IFCC), MFCC, LFCC, ICMC, etc. features (Jelil et al., 2017; Todisco et al., 2018). The extraction of these features from speech starts with the application of CQT that converts the time domain into the frequency domain. Then the power of the spectrum is followed by a logarithm operation. Before applying the DCT on the log power spectrum, uniform re-sampling is done to convert geometrically spaced CQT bins to linearly spaced bins that make spectrum compatible to DCT operation (Brown, 1991; Brown & Puckette, 1992; Todisco et al., 2017). This whole process can be summarized by the Eqs. 6 and 7.

$$ C_{CQT} \left( l \right) = CQT\left( {s\left( t \right)} \right) $$
(6)
$$ CQCC\left( r \right) = \mathop \sum \limits_{l = 1}^{L} log\left| {C_{CQT} \left( l \right)} \right|^{2} cos\left\{ {\frac{{r\left( {j - 0.5} \right)\pi }}{L}} \right\} $$
(7)

where \(C_{CQT} \left( l \right)\) is the CQT of speech sample \(s\left( t \right)\), l is used for indexing into total L number of linearly spaced bins to extract and \(CQCC\left( r \right)\) denotes the total r number of extracted coefficients. Figure 6 depicts this process clearly.

Fig. 7
figure 7

CQCC feature extraction

Yang et al. (2018) have proposed Extended Constant Q Cepstral Coefficients (eCQCC) that are the concatenation of DCT of log scale and log linear power spectrums, which is preceded by the same procedure as CQCCs. eCQCCs are performing better than the baseline CQCCs of the ASVspoof2015 and ASVspoof2017 datasets.

3.1.5 Linear predictive cepstrum coefficients (LPCC)

Linear Predictive Cepstrum Coefficients (LPCC) extract the characteristics of a speaker’s voice by using one of the oldest techniques called Linear Predictive Coding (LPC) on a speech frame. Application of LPC gives the LPC coefficients, which are converted into LPCC using the below given autoregressive (recursive) function (Balamurali et al., 2019; Prithvi et al., 2016). Speech sample \(s\left( t \right)\) is the input to LPC filter having l linear predictive coefficients [β0, β1,…, βl-1] with error signal \(n\left( t \right)\). These linear predictive coefficients are converted into r LPCC [C0, C1,…, Cr-1] as:

$$ C_{{j}} = \left\{ {\begin{array}{lll} {\ln \left( {P_{n} } \right)} &\quad {if\, j = 0} \\ { - \beta _{j} + \frac{1}{j}\mathop \sum \limits_{{k = 1}}^{{j - 1}} \left\{ { - \left( {j - k} \right)\beta _{k} C_{{j - k}} } \right\}} &\quad {if\, 1 \le j \le l} \\ {\frac{1}{j}\mathop \sum \limits_{{k = 1}}^{l} \left\{ {\frac{{ - \left( {j - k} \right)}}{j}\beta _{k} C_{{j - k}} } \right\}} &\quad {if\, l < j < r} \\ \end{array} } \right. $$
(8)

where j is for indexing into l linear predictive coefficients (or linear predictive filter banks), \(P_{n}\) is the power of error signal. Figure 7 shows this process clearly.

Fig. 8
figure 8

LPCC feature extraction

3.1.6 Perceptual linear prediction (PLP)

Perceptual Linear Prediction (PLP) is one of the popular short term spectral features (Wu et al., 2015a, b, c). These are perceptually motivated linear predictive coding based features, which models the characteristics of the human auditory system (Dua et al., 2018a, b; Zouhir & Ouni, 2014). PLP features are computed by an estimated spectrum that is expected by windowed periodogram through DFT. Therefore, these features have a high variance. Alam et al. (2013) have proposed a multi-windowing technique for spectrum estimation to get a reduced variance spectrum that performs better than baseline PLP features for speaker verification task with i-vector classifier.

3.1.7 Power normalized cepstrum coefficients (PNCC)

Power Normalization Cepstrum Coefficients (PNCC) features can perform better than other features, like MFCC, for noisy (with 0 dB to 15 dB Signal to Noise Ratio (SNR)) speeches (Al-Kaltakchi et al., 2016). These features are the human auditory system based, and they try to simulate its process too. The process of PNCC extraction starts with the application of gammatone filter bank for frequency analysis of spectrum, which is preceded by a Short-term Fourier Transform (STFT) operation. For noise reduction and addition of robust reverberation, this frequency spectrum is passed to a sequence of operations that are nonlinear and time varying in nature. After this medium time processing, the mean power normalization stage is invoked that lowers the effect of changing amplitude values. This stage processes the input by a power-low nonlinearity with the exponent value of 1/15 (Kim & Stern, 2016). This operation simulates the biological auditory system with its best efforts. Finally, the application of DCT outputs the PNCC features, as shown in Fig. 8. These features are performing better in the combination of other features like MFCC, LFCC, IMFCC, etc. for speaker identification (Al-Kaltakchi et al., 2016; Mohammadi & Mohammadi, 2017) but the computational complexity of these features is higher than others (MFCC, PLP, etc.) (Kim & Stern, 2016).

Fig. 9
figure 9

PNCC feature extraction

3.1.8 All-pole group delay function (APGDF)

The human ear cannot recognize the phase factor of sound. Also, phase based features are less used as they are computationally complex to extract, whereas magnitude-based features like MFCC are getting more attention as these can be perceived by the human auditory system (Rajan et al., 2013). However, the phase is a crucial characteristic of a speech signal, and it can be used to distinguish utterances generated by different sources. It can be achieved from a speech signal by group-delay methodology with the help of all pole model. Phase based speaker discrimination is introduced in ASV recently (Pal et al., 2018; Sahidullah et al., 2015). A group-delay function given by Eq. 9 plays a major role in the extraction of these features (Pal et al., 2018). The application of this function offers bogus high amplitude projections that are reduced by all pole models of the audio signal. APGDF function \(A\left( z \right)\) can be calculated as:

$$ A\left( z \right) = \frac{{F_{r} \left( z \right)XF_{r} \left( z \right) + F_{i} \left( z \right)XF_{i} \left( z \right)}}{{\left| {F\left( z \right)} \right|^{2} }} $$
(9)
$$ F\left( z \right) = FFT\left( {s\left( t \right)} \right) $$
(10)
$$ XF\left( z \right) = FFT\left( {X\left( t \right)} \right) $$
(11)
$$ X\left( t \right) = ts\left( t \right) $$
(12)

where \(F\left( z \right)\) is the FFT of speech sample \(s\left( t \right)\), \(XF\left( z \right)\) is the FFT of \(X\left( t \right)\) and, i and r in subscripts of \(F\left( z \right)\) and \(XF\left( z \right)\) represents their imaginary and real parts. Modified Group Delay Function (MODGDF) is also a feature extraction technique that models the phase of speech signal (Hegde et al., 2004; Wu et al., 2016).

3.1.9 Sub-band centroid frequency coefficients (SCFC)

Sub-band Centroid Frequency Coefficients (SCFC) are formant based features, which have been investigated as an alternative of cepstral features for speech recognition tasks (Dua et al., 2017; Paliwal, 1998). Further, these features offer complementary information of sub-bands that is not captured by cepstral features. Balamurali et al. (2019) have taken SCFC features in their set of features, including CQCC, IMFCC, LPCC, etc. for replay attack detection. In the extraction process of these features, firstly, k sub-bands are marked with their initial and ending frequency edges on the spectrum of a speech signal. Then weighted average frequency of each sub-band taken that is called sub-band centroid frequency. Sub-band centroid frequency of nth sub-band is given by:

$$ C_{n} = \frac{{\mathop \sum \nolimits_{{j = l_{n} }}^{{u_{n} }} j\left| {Z\left[ j \right]X_{n} \left[ j \right]} \right|}}{{\mathop \sum \nolimits_{{j = l_{n} }}^{{u_{n} }} \left| {Z\left[ j \right]X_{n} \left[ j \right]} \right|}} $$
(13)

where \(Z\left[ j \right]\) is the spectrum of speech frame corresponding to signal \(s\left( t \right)\). n number of sub-bands are marked on \(Z\left[ j \right]\) starting at \(l_{n}\) lower frequency edge and ending at \(u_{n}\) upper frequency edge and frequency sampled response of the filter is given by \(X_{n} \left[ j \right].\)

3.1.10 Deep learning based feature extraction techniques

As analysed, use of deep learning for feature extraction tasks emerged from 2011 (Chen et al., 2015). Deep learning is providing a new era to feature extraction task. Usually, CNNs and RNNs are used in the context of computer vision completing tasks like object detection and image recognition. But CNN's give indications that if an audio signal is represented appropriately, it can be made suitable to audios too (Shuvaev et al., 2017). In this case, hidden layers of deep learning networks are extracted as feature vectors of speech data. d-vectors, j-vectors, x-vectors, etc. come under these types of techniques. In the case of d-vectors, a Deep Neural Network (DNN) is used for feature extraction where the output layer of the network is ignored, and the values of the activation functions at the last hidden layers are taken as feature vectors. j-vectors are the extension of d-vectors (Kinnunen & Li, 2010). Another advanced technique x-vectors uses the Time Delay Neural Network (TDNN) embedding architecture and extracted features outperforms the classical i-vectors (Chen & Salman, 2011).

3.2 Approaches to design backend of ASV system

Backend, as described by Fig. 9b, of an ASV system is comprised of a classification model that takes features of the speech signal and claimed identities as input. During training classification model finds out discriminating patterns associated with bonafide and spoofed utterances in these applied features and learns out the characteristics of different classes well. Then the trained model takes a decision of acceptance or rejection for the claimed person of an utterance of testing data by matching it with the speech characteristics of the system’s user. Various classical machine learning and deep learning approaches used to design the backend are discussed below.

3.2.1 Classical machine learning approaches

Some classical machine learning approaches are generative in nature and some are classifiers. These approaches are suitable for spoof detection in applied dataset from the initial research in ASV systems. We are discussing some of the mostly used classical machine learning approaches in backend design of these systems below.

3.2.1.1 GMM based models

Gaussian Mixture Model (GMM) has been a de-facto standard for developing the ASV system since the emergence of the idea of ASV. The use of the GMM in audio spoofing is inspired by the observation that general speaker-dependent shapes are found in Gaussian components. And it can model the difference between genuine and synthetic speeches effectively (Chettri et al., 2020; Pal et al., 2018). A special property of GMM is that it can plot even curves for randomly spread densities. Classical, GMM represents the distribution of the speaker’s features by poison distribution and elliptic shape. For implementing the model, firstly, its parameters are randomly initialized; then, the Expectation Maximization (EM) algorithm is used to optimize the parameters by taking the maximum likelihood estimate (Suthokumar et al., 2017). GMM with Universal Background Model (UBM) is mainly being used in speech related tasks. In this case, an additional GMM, i.e., a UBM, is trained by a huge development dataset. Then Maximum a posteriori (MAP) estimation is used to obtain the speaker/task specific models (Al-Kaltakchi et al., 2016).

3.2.1.2 SVM based models

The support vector machine has discriminative properties. It generates a hyperplane on 2-dimensional space, which classifies the data in two classes where each class lies on different sides of the plane. SVM separates the classes with a maximum margin between them. It does regularization to avoid the misclassification of example. SVM is being used successfully for speaker verification tasks as well as for spoof detection (De Leon et al., 2012; Godoy et al., 2015). One class SVM is also being used to identify the abnormal data by running only for genuine speeches (Hanilçi et al., 2015). Radial Bias Function (RBF) kernel based SVM can detect unknown spoofing attacks mixed in the speech data well (Godoy et al., 2015).

3.2.1.3 HMM based models

Hidden Markov Model (HMM) is a well-known technique for speech recognition, speaker identification, and speaker verification tasks (De Leon et al., 2012; Dua, et al., 2019a, b; Varchol et al., 2008). These models are widely used for designing TD-ASV systems. Rich mathematical framework and robust architecture of the model are the two strengths of HMM (Gong & Yang, 2020; Rose & Juang, 1996). HMM has two parts first one is the sequence of states or Markov chain, and the second one is the collection of output distribution. Former part of the model characterizes the information from the speech signal and the later part converts the speech sequences of Markov chain into the observations to hide it from the observer (Dua et al., 2012a, 2012b).

3.2.1.4 K-means algorithm

As Speaker Verification is a classification problem, so unsupervised clustering approach is suitable for it. K-means clustering algorithm can find out distinct clusters in the input vectors of speech features. This clustering algorithm firstly assigns random centroids for chosen k clusters and sets the vectors to different clusters by minimizing the distortion between it and the centroid. This iterative algorithm redistributes the vectors to achieve the minimum value of distortion within the cluster. This algorithm is applied to train the GMM model and can make excellent classification performance for speech data also.

3.2.2 Deep learning approaches

It has been perceived that deep learning is suitable for the audio spoofing community. Deep learning can process large datasets with complex distribution structure. These approaches can be used successfully with features vectors extracted by various feature extraction techniques as well as with raw speech signal. Various deep learning based architectures, individually or ensemble, are being used as backend of the ASV system. Application of raw waveform directly to the model is also famous as hidden layers of this network can build relevant features from the raw data. Some deep learning based models are discussed below.

3.2.2.1 DNN based models

Deep neural networks are discriminative in nature. They are trained to capture the discrimination among the classes rather than enhancing the classification ability of their hidden layers. These networks are capable of representing the features as after training deep features can be drawn out from the veritable hidden layer. Training is vital for the preprocessing of data. The adjacent context of each frame is fed along with it, which enhances its performance. Common feature vectors have the same dimension; therefore, this approach is practical. Figure 10 shows a general architecture of the DNN network. A similar architecture is used in the work of Dinkel et al. (2018) with five fully connected blocks made of linear and batch normalization layers, each having 1024 parameters. These blocks are using the Relu activation function for nonlinearity.

Fig. 10
figure 10

Architecture of CNN with Elu activation function for non-linearity

3.2.2.2 CNN based models

Convolutional neural networks are suitable to find local and discernible patterns in the dataset, which helps to differentiate bonafide to spoofed speech. These networks require a very less preprocessing of data as they use kernels on convolutional layers. Pooling layers are the essential building block of the CNN. It reduces the spacial size of the dataset to make the computations easy and measure of parameters less (Mittal & Dua, 2021b). Hidden layers make the use of different activation functions to take the decision of firing a neuron. Relu is the most used activation function, and Softmax is used in output layers. When a raw speech signal is passed to a CNN, its hidden layers learn features of the signal well by adjustment of weights and flatten, fully connected, etc. layers participate in a classification of the claimer, as shown by Fig. 11.

Fig. 11
figure 11

General architecture of CNN

3.2.2.3 RNN based networks

A recurrent neural network is capable of capturing the temporal history of the speech signal. RNN belongs to the sequence-based scoring group means it does not make any adjustment into the network during training and evaluation phases. RNN can remember a sequence of complex past events by a mechanism called cell mechanism. It calculates an output vector at time t for every input vector \(x_{t}\). It gives a single label either spoofed or bonafide to each sequence. As it provides a label for each sequence or each time step, so there is a need for reducing these output vectors into a single vector \(v\). Dinkel et al. (2018) have proposed three approaches to doing so. These are: (1) If T is the last time step then output vector \(o_{t}\) at that time is passed to \(v\) (\(v = o_{t}\)). (2) Mean is taken over time by summing all output vectors \(o_{t}\) up to time T then dividing by T and pass it to \(v\). (3) Making \(v\) equal to attention where attention is weighted average calculated by taking the mean of summation of weight multiplied with output vector \(o_{t}\) over time T.

Figure 12 shows a common RNN model, where \(x_{t}\) is the input signal and \(o_{t}\) denote the output of different RNN units at different time, Softmax is the output layer. Being a non-sequential model, it can achieve more information from time varying speech signal (Dua et al., 2019a, b; Sahidullah et al., 2019).

Fig. 12
figure 12

Architecture of RNN for ASV systems

3.2.2.4 LSTM based networks

Long short-term memory network is a special kind of RNN which can hold the information for an extended period. Basically, it is designed to reduce the dependency problem for long periods. RNN is a chain of simple neural network modules. Unlike the RNN repeating module of LSTM has a different structure comprised of four neurons, each connected in a very special way. For implementation of LSTM, RNN blocks of Fig. 12 have LSTM units in them. LSTMs overcomes the gradient disappearance drawbacks of CNN and less prone to overfitting (Mittal & Dua, 2021a). A two LSTM layers network in addition to DenseNet can be applied well for replay attack detection when trained with MFCC features (Huang & Pun, 2019).

3.2.2.5 Wave-U-Net based networks

Audios have a high temporal correlation in the long-range, so they need high-quality separation. Wave-U-Net computes and combines the feature map at contrasting time domains by repeatedly resampling them (Chettri et al., 2019).

3.2.2.6 Ensemble of different models

Ensemble is a technique of combining different machine learning models dedicated to solve a same problem. Individual models do not perform well due to high bias or high variance, but they might have learnt different kind of facts from data (Kadyan et al., 2021a, b; Kumar & Aggarwal, 2020a, b, c). So to combine all the usefulness of single models into one, ensemble is done by stacking, boosting or bagging methods. Ensemble of ResNets, CNNs, etc., achieves better performance for speech data too (Fawaz et al., 2019).

4 Contributing datasets and evaluation measures

This section of the paper describes and analyses different datasets used in various state of the art ASV system and, discusses evaluation metrics in latter part.

4.1 ASV datasets

Dataset is also a crucial question for the development of ASV systems. For training, development, and testing of designed methodology, a suitable database is always required, especially for machine learning techniques. A database including all kinds of circumstances and rich with the protocols contributes highly to the reliability of countermeasure (Chen & Salman, 2011). Below discussed datasets are assisting remarkably in the development of threat free ASV systems. Table 3 gives the complete summary of these datasets, where M denotes Male, F indicates Female, T denotes Training, D stands for Development, E stands for Evaluation, I has been used for Imposter and, B and S are used for Bonafide and Spoofed, respectively.

Table 3 Summary of ASV datasets used in development of ASV systems

4.1.1 YOHO

The English Language based YOHO speaker verification dataset was developed to support the research for TD-ASV in 1989 under the contract of the US government. Total of 138 speakers (106 males and 32 females) took part in the collection of high quality speeches. Enrolment data was recorded in four sessions (each with 24 speeches) for each speaker, whereas verification data was recorded in ten sessions, each with four speeches. They all were said to pronounce two digit numbers from 21 to 97 in the combination of three, e.g., “Thirty-Six, Forty-Five, Eighty-Nine.” All the utterances are created at 8000 Hz Sampling rate (Campbell, 1995).

4.1.2 Wall Street Journal (WSJ)

DARPA initiated the creation of the Wall Street Journal (WSJ) dataset, especially to support speech recognition during 1991. WSJ0 and WSJ1 are the two accessible databases built under his program. The speech data of WSJ0 was recorded by two microphones. It has three sets having speeches recorded individually by them along with a mixture of them. Each set has all documents, transcripts, tests, etc. WSJ1 has 78,000 utterances in the training set, making 73 h of speech data and 8200 utterances in the testing set, making 8 h speech data (Paul & Baker, 1992).

4.1.3 TIMIT

TIMIT dataset was recorded in 1993 at Texas Instruments, Inc. All sounds were recorded at 16 kHz sampling rate by 630 speakers. Each speaker was said to speak 10 American English sentences enrich of phonetics. This dataset is divided into the training and testing subsets. Both of the subsets have useful variations in phonemes, speaker characteristics, etc. (Garofalo, 1990).

4.1.4 NIST

NIST Speaker Recognition Evaluation (SRE) dataset has speech files recorded in the American English language for 2,225 h via microphone and telephone system. SRE is an ongoing series in the development of text independent speaker recognition systems. This series was initiated in 1996 with the objects of improving the performance of speech based systems, to provide a common platform to the researchers in this field and to support the community in their idea of advancement in voice based technologies. NIST 2019 SRE is the new step of this ongoing series (Sturim et al., 2016).

4.1.5 Spoofing and antispoofing (SAS) corpus

Spoofing and Antispoofing (SAS) corpus contains a diverse range of attacks generated by nine speech synthesis (SS) and VC algorithms. This database includes two protocols, one for evaluating the ASV system and another for creating the spoofed utterances. Synthetic speeches are also the part of the corpus along with the natural speeches. Non-realistic silence has been removed from the utterances, which leads the dataset to be more realistic SS and VC spoofed corpus (Wu et al., 2015a, 2015b, 2015c).

4.1.6 RedDots

RedDots dataset has more numbers of recording sessions with less number of English utterances in each. The goal while designing the dataset was to provide 52 sessions on per week basis (one year) to each speaker. These sessions were two minutes longer with 24 sentences (10 commons, 10 unique, 2 free choices, 2 free texts) in each. This dataset has large variations of inter-speaker and intra-speaker type (Lee et al., 2015). After this, the Replayed RedDots dataset is created by re-recording the utterances of the original corpus under different environmental conditions. Both of these databases support the development of replay attack free ASV systems as original RedDots provides genuine utterances, and Replayed RedDots provides their related spoofed data (Kinnunen et al., 2017).

4.1.7 AVspoof

AVspoof dataset is designed to support ASV systems as well as anti-spoofing techniques. This dataset contains SS, VC, and replay attacks balanced ratio. Replay attacks are generated by various recording devices to include the variations in spoofs, SS attacks are generated by HMM techniques mostly, and VC attacks are generated by Festvox. 31 males and 13 females participated in the recording of these sessions under various environmental conditions with different recording devices. Speakers were said to read out sentences, phrases and speak out about any topic freely for 3 to 10 min (ASVspoof).

4.1.8 Vox Celeb

It is a collection of audio visual data extracted from videos chosen from youTube. Data set has a good diversity in the nationality of speakers as there are speakers with Indian, American, Finnish, etc. accents. 61% of speakers are male, and 39% of speakers are female. Utterances have at least 3 s in length (VoxCeleb, 2019). VoxCeleb1 and VoxCeleb2 are the two versions of this dataset, each having audio files, face videos, meta data about speakers, etc. in its training and testing set. Table 3 shows a number of speakers and utterances in these versions. Their Finnish language based sets are contributing to mimicry attack detection for ASV systems (Vestman et al., 2020).

4.1.9 voicePA

voicePA dataset is created with the help of AVspoof datase. Its genuine data is the subset of genuine data of AVspoof dataset spoken by 44 speakers, each contributed in four recording sessions organized in different environments (ASVspoof, 2019). These sessions were recorded by high quality microphones of a laptop, Samsung S3, and iPhone 3GS. Spoofed data consists of 24 types of presentation attacks recorded by five different devices in three different environments. These spoofed utterances are based on genuine data. SS and VC spoofed audios taken from the original dataset were also replayed (Korshunov et al., 2018a, b). Table 3 shows the number of male and female speakers in different sets of dataset.

4.1.10 BioCPqD-PA

This dataset was recorded by 222 participants under different environmental conditions in the Portuguese language. The dataset contains 27,253 genuine, and 391,687 presentation attacked data. Presentation attacked audios were recorded under 24 configurations made by 8 loudspeakers and 3 microphones in an isolated room, whereas genuine data was recorded by one laptop. Dataset is partitioned into training, development and evaluation sets recorded by different pairs of the microphone and loudspeakers. Every set has voices of all participating speakers (Korshunov et al., 2018a, b).

4.1.11 ASVspoof 2015

ASVspoof 2015 dataset is derived from the Spoofing and Anti-spoofing (SAS) dataset. It has TTS and VC spoofed utterances along with the genuine utterances. 45 males and 61 females contributed in the creation of genuine speeches and spoofed speeches are generated by ten different S1 to S10 algorithms. Here S1 to S5 are known attacks and S6 to S10 are unknown attacks that are introduced in the evaluation set. The whole dataset is partitioned into training (3750 genuine, 12,625 spoofed), development (3497 genuine, 49,875 spoofed) and evaluation (9404 genuine, 184,000 spoofed) sets (Wu et al., 2015a, 2015b, 2015c).

4.1.12 ASVspoof 2017

ASVspoof 2017 dataset is designed from RedDots corpus to focus replay attack in ASV. It contains the voices of 42 speakers recorded under 61 different combinations of recording devices, replay devices and environmental conditions. It is collected during 179 sessions. Version 2.0 of this dataset was released in 2018 after removing the errors of labeling, empty files, zero-sequence artifacts, etc. (Delgado et al., 2018).

4.1.13 ASVspoof 2019

ASVspoof 2019 dataset is extracted from the VTCK corpus. It is divided into Logical Access (LA) and Physical Access (PA) subsets. LA has TTS and VC spoofed speeches, and PA has replay spoofed speeches. Both of these subsets are further partitioned into training, development and evaluation subsets. The training subset is generated by 8 males and 12 females, development subset by 4 males and 6 females, and evaluation subset by 21 males and 27 females. These subsets are recorded under the same recording conditions; only the sets of speakers are disjoint. Training and development subsets contain known attacks generated by the same algorithms, and evaluation subset has unknown attacks too made by different synthesizing algorithms. This dataset includes two evaluation measure protocols, namely Equal Error Rate (EER) and Tendum Detection Cost Function (t-DCF) (Wang et al., 2019).

4.2 Evaluation measures

Evaluation of ASV systems is done on the basis of a predefined threshold. All the evaluation metrics consider the false acceptance and false rejections of the system. They indicate to reduce the false acceptance rates or to balance the trade-off between them. This section highlights some of the threshold based evaluation metrics of ASV.

4.2.1 Equal error rate (EER)

ASV system either accepts or rejects the claimed identity. There are four possibilities for a classification to be correct or incorrect. These are True Acceptance (TA), True Rejection (TR), False Acceptance (FA) and False Rejection (FR). TA and TR are the desirable possibilities, but FR and FA are harmful situations for the system. These possibilities are set on the basis of a predefined threshold \(\tau\) (Todisco et al., 2017). In the case of FA, a spoofed utterance having score more than or equal to \(\tau\) gets accepted, and in case of FR, a genuine utterance having score less than \(\tau\) gets rejected. To measure the performance of ASV Equal Error Rate (EER) is used, which is the value where False Acceptance Rate (FAR) (Eq. 14) and False Rejection Rate (FRR) (Eq. 15) become equal.

$$ FAR = \frac{{Count\left( {FA} \right)}}{{Count\left( {spoofed\, utterances} \right)}} $$
(14)
$$ FRR = \frac{{Count\left( {FR} \right)}}{{Count\left( {genuine\, utterances} \right)}} $$
(15)

4.2.2 Detection error tradeoff (DET) curve

Initial Researches used Detection Error Tradeoff (DET) curve for evaluation of ASV systems. It is suitable for binary classification problems. It plots curves for EER taking FAR on the x-axis and FRR on the y-axis. Reynolds et at. (2000) are showing the comparative representation of their models with DET curves.

4.2.3 Half total equal error rate (HTER)

FAR and FRR are inversely proportional to each other, so they can be illustrated as a function of predefined threshold \(\tau\) for a particular dataset DS. Equation 16 shows the calculation method of HTER.

$$ HTER_{{\left( {DS, \tau } \right)}} = \frac{{FRR_{{\left( {DS, \tau } \right)}} + FAR_{{\left( {DS, \tau } \right)}} }}{2} $$
(16)

4.2.4 Tendum detection cost function (t-DCF)

Tendum Detection cost function (t-DCF) is an ASV centric evaluation measure (Kinnunen et al., 2018) ASVspoof, 2019 challenge has provided ASV and countermeasure protocols explicitly (Todisco et al., 2019). t-DCF function for a system can be calculated by Eq. 17.

$$ tDCF_{SYST} \left( \tau \right) = Val_{FA}^{ASV} .\alpha_{tar} .L_{FA}^{ASV} \left( \tau \right) + Val_{FR}^{ASV} .\alpha_{non - tar} .L_{FR}^{ASV} \left( \tau \right) $$
(17)

where \(Val_{FA}^{ASV}\) and \(Val_{FR}^{ASV}\) are the cost values of false acceptance of an non-target utterance and false rejection of an target utterance respectively. \(L_{FA}^{ASV} \left( \tau \right)\) and \(L_{FR}^{ASV} \left( \tau \right)\) are the values of FAR and FRR for ASV system, respectively, at threshold \(\tau .\) \(\alpha_{tar}\) is the probability of utterance being target and \(\alpha_{non - tar}\) the probability of utterance being non-target. These all computations are performed with the assumption that an ideal countermeasure has 0% FR and FA that makes EER 0% (Eq. 18).

$$ L_{FA}^{CM} = L_{FR}^{CM} = 0 $$
(18)

\(L_{FA}^{CM}\) is the value of FAR of countermeasure and \(L_{FR}^{CM}\) is the value of FRR of countermeasure at threshold \(\tau\).

5 Spoofing attacks to the ASV systems

Spoofing attacks are categorized into direct and indirect access attacks on the basis of access required to part of the system while conducting them. Direct attacks are introduced via microphone and transmission channel, whereas indirect attacks are injected during the speech processing, distribution of information internally, classification, and even just before the declaration of result after the verification of claimer (Wu et al., 2015a, 2015b, 2015c). Various known direct spoofing attacks are categorized into Logical Access (LA) and Physical Access (PA) attacks (Chettri et al., 2020). LA attacks, generated algorithmically, are comprised of voice Conversion (VC) and Text to speech (TTS) spoofing attacks. Their best algorithms produce much similar speech to the bonafide ones, and these synthetic speeches are injected directly into the system with no involvement of microphones. On the other hand, PA attacks are accomplished by transmitting impersonated speech physically or by playing the recorded speech back in front of the microphone. These spoofing attacks are replay, mimicry and twines attacks (Chettri et al., 2019). Potential of risk from direct spoofing attacks increases due to enhancement in speech synthesizing toolkits (Suthokumar et al., 2017) and (Vestman et al., 2020) availability of user data publicly via social media, audio/video sharing platforms, service provider websites, etc. Figure 15 shows the complete classification of all types of attacks to an ASV system.

5.1 Direct access attacks

Direct spoofing attacks are the most common and easily attainable threats to the ASV systems. These attacks can be performed without complete knowledge of the system (Wu et al., 2015a, 2015b, 2015c). LA attacks are performed at transmission level, and PA attacks are performed at the speech input level (via microphone). Standard datasets like AVspoof, ASVspoof 2015, ASVspoof 2019, voicePA, etc. are enriched with these attacks.

5.1.1 Logical access (LA) attacks

Progress in the development of voice synthesis algorithms has promoted the Logical Access (LA) attacks. Some online platforms and open source software tools (Festival and Festvox) are available to generate these threats directly. Injection of these attacks takes place directly via the transmission channel. Accessing the channel becomes easy in case of (Reynolds & Rose, 1995) telephonic communications applied in banking, e-commerce, etc. To perform the attack imposter, pretending as a legitimate user puts a synthesized speech utterance into the channel and gains access to the system. The most general LA attacks are discussed below.

5.1.1.1 Voice conversion (VC) spoofing attack

Although Voice Conversion (VC) has been a useful technique for the development of personified speech driven systems (Lim & Kwan, 2011; Patil & Kamble, 2018) from nearly the last two decades, it is broadly being used as a threat to ASV systems (Mohammadi & Kain, 2017; Pellom & Hansen, 1999). VC attacks are conducted by applying an artificial speech directly to the system, which is generated by converting the imposter’s voice into the target speaker’s voice. Converted speech can be achieved by GMM based, bilinear based, codebook based, neural network based, etc. approaches; some are defined in (Helander & Gabbouj, 2012). During training, the process of voice conversion involves a transformation function that is applied on phone or frame aligned utterances of imposter and target speaker. This function converts the characteristics of the voice of imposter’s utterance into that of the target speaker’s voice’s (Patil & Kamble, 2018). This threat affects the performance of ASV systems considerably.

5.1.1.2 Text-to-speech (TTS) or speech synthesis (SS) spoofing attack

Speech Synthesis (SS) spoofing attack is generated by the Text to Speech (TTS) method. Synthetic speech is produced in two steps by this method. Firstly, input text is converted into the rhetoric that will be having elements like phonemes. In the second step generation of speech waveform from the rhetoric takes place. Speech waveform can be generated by different approaches (Patil & Kamble, 2018; Wu et al., 2015a, 2015b, 2015c). Indumathi & Chandra (Indumathi & Chandra, 2012) present three generations (G1, G2, G3) of speech synthesis on the basis of acoustic characteristics, usage of speech corpus, and statistical model in this order. Formant synthesis and articulatory speech synthesis belong to first generation (G1), where formant synthesis is the very first technique in the field of SS, which tries to model the transfer function of the human vocal tract. Formant synthesis produces low quality robot like sounding speech, but articulatory speech synthesis is able to produce far better natural like speech as it simulates the biological sound production system (Karpe & Vernekar, 2018). Corpus based second generation (G2) includes the concatenative speech and sinusoidal synthesis techniques. G2 got initiated in 1980 with the usage of small datasets in SS, then in 1990, large sized corpora were collected (Patil & Kamble, 2018). The year 2000 was the start of statistical model based third generation (G3) that contributed to SS by Hidden Markov Model (HMM) based and unit selection based techniques. HMM, synthesis produces very natural speech by using maximum likelihood criteria to model characteristics like fundamental frequency, etc. of speech (Indumathi & Chandra, 2012). Unit selection is another approach of G3, which makes use of a large variety of speech from the corpus to deliver a better quality voice (Karpe & Vernekar, 2018). From 2010 a succeeding generation G4 based on deep learning is introducing improvement in acoustic characteristics prediction and overcoming the limitations of traditional techniques (Ze et al., 2013). Since the start of G4 various Deep Neural Network (DNN), Recurrent Neural Network (RNN), etc. based techniques have been proposed that provides a drastic improvement in this field. Figure 13 shows the complete view of these generations. These emerging SS technologies are contributing to the implementation of TTS systems like text reader, digital personal assistant, etc.

Fig. 13
figure 13

Generations of speech synthesis

5.1.2 Physical access (PA) attacks

These attacks are carried out by presenting the spoofed utterances in front of the microphone of the ASV system. This is the easiest form of attack where imposter does not need to put extra efforts to generate the spoofed speech algorithmically to gain access to the system. The taxonomy of these attacks is presented below.

5.1.2.1 Replay attack

Imposter needs a recording device that is easily available in today’s era and good environmental conditions to conduct a replay attack on an ASV system (Wu et al., 2015a, 2015b, 2015c). To attack the system, he/she intentionally records the voice of a target registered user with the help of a recording device and plays it back in front of the input port of the system along with the insertion of the identity of the target user. A speech recorded in a noiseless environment is so much viable to attack the system successfully. But spoofed speech has distinctive clues like acoustic features, additive and convolutional distortions introduced by intermediate devices, etc. Mostly initial 400 ms are enough to classify an utterance spoofed or genuine (Chettri et al., 2018). ASVspoof 2017 challenge started taking care of replay attack by providing standard ASVspoof 2017 version 1.0 and version 2.0 datasets as previously there was very little data and research available for this attack type (Oo et al., 2019). With this dataset Lavrentyeva et al. (2017) show that deep learning based countermeasure identifies attacks better than the classical GMM approaches.

5.1.2.2 Mimicry attack

The vulnerability of ASV systems for mimicry attacks was identified about fifty years ago (Vestman et al., 2020). After that, various experiments and analysis were negotiated to declare the fact officially. Lau et al. (2004) performed the experiment on a Gaussian Mixture Model based ASV countermeasure, trained by 138 speakers of the YOHO dataset (Lau et al., 2005), attacking it by two mimickers as imposters and they could verify that ASV systems can be attacked by mimicking the any valid user’s voice. Other remarkable studies are presented by Hautamäki et al. (2013, 2014), where the authors are training a Cosine score based i-vector countermeasure and a Gaussian Mixture Model Universal Background Model (GMM-UBM) with Finnish language dataset. Because of mimicry attack, their experiment shows a significant increase in the False Acceptance Rate (FAR) for i-vector countermeasure. To conduct this attack, due to no involvement of technology, imposter needs zero efforts to cross the system’s security barriers just by modifying his/her voice characteristics.

A Skillful attacker manipulates his/her prosodic features, lexical behaviour, etc. and produces the voice similar to a target user. In a very elegant approach (Fig. 14) to find out the attack, the professional finds out his/her matching speaker with the help of a general ASV system, practices on an experimental machine, and then attacks to a target ASV System. The same approach can be used to build the mimicry attack free ASV system with the help of a suitable dataset like VoxCeleb1, VoxCeleb2, etc. having voices of well-known people (Vestman et al., 2020).

Fig. 14
figure 14

Mimicry attack scenario

5.1.2.3 Twins attack

Identical twins are known to have similar voice features because they have similar or approximately similar shape and structure of the vocal tract. This makes twins attack a matching physiological characteristics kind of mimicry attack. But the spectrographic pattern of their speech samples shows the speaker specific variations of speech (Lindberg & Blomberg, 1999). Different sets of features like a set of MFCC and Target Energy Operator (TEO), a set of MFCC and Variable Length Target Energy Operator (VTEO), etc. are proved to be effective for different Indian languages in distinguishing the characteristics of identical twins or triplets (Patil & Parhi, 2009; Patil et al., 2017).

5.2 Indirect access attacks

The accomplishment of indirect access attacks needs the access to various parts of the ASV system. These parts include a feature extraction unit, the database of registered users, the verification model, and decision making unit, etc. (Wu et al., 2015a, 2015b, 2015c). Gaining access to these system components is a bit complicated and tricky. Therefore, indirect attacks are not so frequent. But even a single attack success hampers the privacy of users and confidentiality of data. So there is a need for identification of these attacks and prevention of system from them (Fig. 15).

Fig. 15
figure 15

Types of attacks to the ASV system

6 ASV spoofing and countermeasures discussion

The work of designing spoof free countermeasures is consistently running since the arise of the speaker verification problem. Initially, classical machine learning was applied with different feature extraction techniques. But ASV countermeasures are enhanced in parallel with modification in feature extraction and classification technologies since always. A new era has begun in ASV by the introduction of efficient deep learning models in the machine learning field. We are providing a detailed discussion about the old and new era countermeasures below. Table 4 summarises the different aspects of countermeasures of these eras and Table 5 gives the all notations used in the paper.

Table 4 Summary of different aspects of old and new era countermeasures
Table 5 Notations used in the paper

6.1 ASV spoofing and countermeasures: old era

The most significant work for ASV systems can be traced from the latter decades of the twentieth century. Early ages of development used HMM, GMM, etc. These techniques contribute remarkably in designing the countermeasures, and they are compatible with almost all kinds of speech features. Villalba et al. (2015) are training a HMM with static and dynamic MFCC features, extracted from ATR speech data, for speaker verification task of a TD-ASV system that achieves 0% EER for human speakers and a HMM system is used for speech synthesis. The synthetic speech signals generated by the synthesis system are tested over the reference HMM model which, shows a remarkable value of false acceptance rate, i.e., 70% when they are using only a single synthetic statement from each user. Concatenation of isolated words, Re-synthesis, and diphone synthesis can provide the worst case scenario for the security of an ASV system. Among these techniques, word concatenation provides the most unbreakable utterances for HMM trained with LPCC features (Sztahó et al., 2019). Reynolds et al. (2000) state for TI-ASV systems that GMM has been the most successful model to find maximum likelihood. They are using GMM-UBM along with mel-scale cepstral coefficients to verify whether the speech is coming from the carbon button or electric based microphone handset. For carbon button microphone handset, they trained 2048 GMMs and for electric microphone handset 1024 gender dependent UBMs. Finally, they made a 2048 gender dependent UBM that is performing better than the others. HTK toolkit is a used for speech processing along with the HMM (Dua et al., 2012a, 2012b; Wong et al., 2001). MFCCs extracted by this method, along with the GMM, are used to check the vulnerability to mimicry attack (Lau et al., 2004). GMM with mixture of 32, 64 and 128 are performing best for three different sets of speeches made on the basis of duration (Shanmugapriya & Venkataramani, 2011). Speech synthesis (SS) and mimicry attacks were potential attacks to the system up to decades, but with the enhancement of technology, voice conversion (VC) also got added into the list of vulnerabilities to the ASV. This attack was included in ASVspoof 2015 dataset along with the SS. GMM of 128 mixtures trained with Cochlear Filter Cepstral Coefficients Instantaneous Frequency (CFCCIF) is used to take care of this attack (Kersta & Colangelo, 1970; Patel & Patil, 2015). SVM is used along with the neural networks and GMM (Godoy et al., 2015; Hanilçi et al., 2015). One-class SVM is used along with the DNN as a supportive model (Hanilçi et al., 2015). Similarly, k-means algorithm is playing the role of a supportive approach in ASV. Unvoiced speech frames have more information of spoof than the voiced speech. In the work of Wu et al., (2015a, 2015b, 2015c) an experiment has been carried out by separating low and high energy speech signal frames to detect the SS and VC attacks from ASVspoof 2015 and SAS and BTAS 2016 datasets. Static and dynamic features extracted by different feature extraction techniques like MFCC, Instantaneous Frequency Cosine Coefficients (IFCC), CQCC, etc. have been compared by applying on GMM based systems. CQCC features show the best performance when single system are compared, but an ensemble of all features performed the best for spoof detection task when trained with replay attacked data of ASVspoof 2017 dataset (Jelil et al., 2017). After ASVspoof 2017 challenge, research for replay attacks drastically increased, and a lot of reliable countermeasures are available now for this single attack type. This old era was entirely dominated by GMMs up to decades, but rise of deep learning attracted the research community towards neural network based models. And a new era started in the development of ASV systems. Since classical machine learning techniques or, more specifically, GMMs are easy to understand and suitable for speech involving systems, still a lot of research is going on involving these techniques.

6.2 ASV spoofing and countermeasures: new era

Deep learning based countermeasures are easy to implement and need less preprocessing of data. The noticeable time period when ASV research adopted deep learning can be marked from less than a decade ago. As the research is going on in deep learning from DNN to CNN than RNN, LSTM, autoencoders, etc. parallel adoption of technology is running by the ASV community. Speaker verification is a two class classification problem DNN can work well for this classification. Hanilci et al. (2015) are using DNN along with the SVM for ASVspoof 2015 dataset having SS and VC attacks. A fusion of these models is achieving less than 0.05% EER for nine spoofing attack types out of ten. This DNN model is using Softmax in the output layer.

Deep learning models are suitable for raw input audio signals also if the signal is presented in a proper way to the model, even they can walk through a music audio file (Lee et al., 2017; Morfi & Stowell, 2018). With the ASVspoof 2017 challenge replay attack got new insights. It promoted a huge study for this attack type. A CNN network with five convolutional, two fully-connected, five Softmax and four network-in-network layers is achieving good accuracy (Chettri et al., 2018). VoicePA and BioCPqD-PA datasets are used with a shallow CNN with 20 neurons in the convolutional layer and a deep CNN with three convolutional layers for Presentation Attack Detection (PAD) (Korshunov et al., 2018a, b). These networks are trained by MFCC features. The ensemble of ResNets, the ensemble of Fully Convolutional Neural Networks (FCN) and the ensemble of neural networks are proved to be most suitable for time series classification tasks (Fawaz et al., 2019). Chattri et al. (2019) are training CNN, Convolutional Recurrent Neural Network (CRNN), One Dimensional CNN (1D-CNN), and Wave-U-Net models along with the classical classification models with ASVspoof 2019 dataset. Time and frequency representation of speech signal is applied to the input layers of these networks. Models are trained with early stopping criteria with a binary cross-entropy loss function. Adam is the most used optimizer in ASV related as well as other purpose networks. Individual performance is measured by EER and t-DCF for each model, but an ensemble unit is performing the best for both the measures (Chettri et al., 2019). ASVspoof 2019 dataset has two partitions LA and PA, both focusing on SS, VC and replay attacks, respectively.

All the studies show deep learning is also suitable for time varying speech signals presented either directly or with the help of any feature extraction technique. The ensemble of different models achieves better performance than the individual models. Mimicry and Twins attacks are not targeted much with modern machine learning techniques. This deep learning based new era is giving more prospects of enhancement in the situation of ASV systems.

7 Conclusion, discussion and future expectations

This paper focused to each part of ASV systems by analysing different research works in this field. This survey finds out that MFCCs are the most commonly used and reliable features, but CQCC features proved to be most efficient (achieving better performance). This survey notices that dynamic features can model speaker and speech specific information better than static features. Some of the prior databases or corpora were focusing only on speaker recognition tasks. Datasets provided by ASVspoof community in 2015, 2017, and 2019 are more enrich corpora with the direct access attacks than the classical datasets. Mimicry attack related research is focusing on different languages, but other attacks are conducted mostly in the English language. Although deep learning has put its step in ASV systems, but researchers are still attracted to GMM. In classical learning, GMMs are suitable for TI-ASV systems, whereas HMMs are suitable for TD-ASV systems as they model the temporal information of known text and not efficient for TI-ASV as compared to GMMs. Deep learning is being used to train with already extracted features from different feature extraction techniques as well as with raw speech waveforms. Use of deep learning has started a new era in these systems. Spoofing techniques are also getting improved in parallel along with the enhancement in ASV countermeasures. Deep learning has given new insights in SS or TTS spoofing attack also with the start of a new generation (G4) in text to speech conversion. But the position of the ASV system is fine enough that industry is using these systems practically. This research provides sufficient information with the best efforts for development and, starting or continuing research in ASV systems to the beginners also. We are listing out some suggestions and facts to be considered for future work in ASV systems:

  1. i.

    Till now, different countermeasures are suitable for different particular attack set and they are working well with different particular speech corpora. We seek the attention of the research community to design a single countermeasure working well with all aspects.

  2. ii.

    Mimicry and Twins attacks are not being targeted much with modern machine learning techniques. Although, early research with GMM focused on mimicry thoroughly. So we advocate to focus on these attack types with deep learning in near future.

  3. iii.

    Although, indirect access attacks require different kind of efforts to get encountered but they should also be in notice while installing the ASV system. Because even single threat can break the robustness of whole security mechanism.

  4. iv.

    A perfect set of features that can model all kinds of variations of speech and a perfect combination of classification models to design a countermeasure is needed to be chosen.

  5. v.

    Single standard dataset, including every kind of possible attacks, utterances spoken by males and females of each age group in different languages under different environmental conditions seems to be required. More specifically, recently designed datasets are not having twins and mimicry attacks.

This paper has reviewed various technologies and advancements proposed in this area, and it brings a huge knowledge base of this area at one place. For future work, one countermeasure that can cope with all kinds of spoofing attacks should be the next target of research in this area, and development of hybrid systems should be practiced to involve the benefits of different technologies.