Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

7.1 Introduction

As one of our primary methods of communication, the speech modality has natural appeal as a biometric in one of two different scenarios: text-independent and text-dependent. While text-dependent automatic speaker verification (ASV) systems use fixed or randomly prompted utterances with known text content, text-independent recognisers operate on arbitrary utterances, possibly spoken in different languages. Text-independent methods are best suited to surveillance scenarios where speech signals are likely to originate from noncooperative speakers. In authentication scenarios, where cooperation can be readily assumed, text-dependent ASV is generally more appropriate since better performance can then be achieved with shorter utterances. On the other hand, text-independent recognisers are also used for authentication in call-centre applications such as caller verification in telephone banking.Footnote 1 On account of its utility in surveillance applications, evaluation sponsorship and dataset availability, text-independent ASV dominates the field.

The potential for ASV to be spoofed is now well recognised [1]. Since speaker recognition is commonly used in telephony or other unattended, distributed scenarios without human supervision, speech is arguably more prone to malicious interference or manipulation than other biometric signals. However, while spoofing is relevant to authentication scenarios and therefore text-dependent ASV, almost all prior work has been performed on text-independent datasets more suited to surveillance. While this observation most likely reflects the absence of viable text-dependent datasets in the recent past, progress in the development of spoofing countermeasures for ASV is lagging behind that in other biometric modalities.Footnote 2

Nonetheless, there is growing interest to assess the vulnerabilities of ASV to spoofing and new initiatives to develop countermeasures [1]. This article reviews the past work which is predominantly text-independent. While the use of different datasets, protocols and metrics hinders such a task, we aim to describe and analyse four different spoofing attacks considered thus far: impersonation, replay, speech synthesis and voice conversion. Countermeasures for all four spoofing attacks are also reviewed and we discuss the directions which must be taken in future work to address weaknesses in the current research methodology and to properly protect ASV systems from the spoofing threat.

7.2 Automatic Speaker Verification

This section describes state-of-the-art approaches to text-independent automatic speaker verification (ASV) and their potential vulnerabilities to spoofing.

7.2.1 Feature Extraction

Since speech signals are nonstationary, features are commonly extracted from short-term segments (frames) of 20–30 ms in duration. Typically, mel-frequency cepstral coefficient (MFCC), linear predictive cepstral coefficient (LPCC), or perceptual linear prediction (PLP) features are used as a descriptor of the short-term power spectrum. These are usually appended with their time-derivative coefficients (deltas and double-deltas) and they undergo various normalisations such as global mean removal or short-term Gaussianization or feature warping [2]. In addition to spectral features, prosodic and high-level features have been studied extensively [35], achieving comparable results to state-of-the-art spectral recognisers [6]. For more details regarding popular feature representations used in ASV, readers are referred to [7].

The literature shows that ASV systems based on both spectral and prosodic features are vulnerable to spoofing. As described in Sect. 7.3, state-of-the-art voice conversion and statistical parametric speech synthesisers may also use mel-cepstral and linear prediction representations; spectral recognisers can be particularly vulnerable to synthesis and conversion attacks which use ‘matched’ parameterisations. Recognisers which utilise prosodic parameterisations are in turn vulnerable to human impersonation.

7.2.2 Modelling and Classification

Approaches to ASV generally focus on modelling the long-term distribution of spectral vectors. To this end, the Gaussian mixture model (GMM) [8, 9] has become the de facto modelling technique. Early ASV systems used maximum likelihood (ML) [8] and maximum a posteriori (MAP) [9] training. In the latter case, a speaker-dependent GMM is obtained from the adaptation of a previously trained universal background model (UBM). Adapted GMM mean supervectors obtained in this way were combined with support vector machine (SVM) classifiers in [10]. This idea lead to the development of many successful speaker model normalisation techniques including nuisance attribute projection (NAP) [11, 12] and within-class covariance normalisation (WCCN) [13]. These techniques aim to compensate for intersession variation, namely differences in supervectors corresponding to the same speaker caused by channel or session mismatch.

Parallel to the development of SVM-based discriminative models, generative factor analysis models were pioneered in [1416]. In particular, joint factor analysis (JFA) [14] can improve ASV performance by incorporating distinct speaker and channel subspace models. These subspace models require the estimation of various hyper-parameters using labelled utterances. Subsequently, JFA evolved into a much-simplified model that is now the state of the art. The so-called total variability model or ‘i-vector’ representation [17] uses latent variable vectors of low-dimension (typically 200–600) to represent an arbitrary utterance. Unlike JFA, the training of an i-vector extractor is essentially an unsupervised process which leads to only one subspace model. Accordingly it can be viewed as a approach to dimensionality reduction, while compensation for session, environment and other nuisance factors are applied in the computationally light back-end classification. To this end, probabilistic linear discriminant analysis (PLDA) [18] with length-normalised i-vectors [19] has proven particularly effective.

Being based on the transformation of short-term cepstra, conversion and synthesis techniques also induce a form of ’channel shift’. Since they aim to attenuate channel effects, approaches to intersession compensation may present vulnerabilities to spoofing through the potential to confuse spoofed speech with channel-shifted speech of a target speaker. However, even if there is some evidence to the contrary, i.e., that recognisers employing intersession compensation might be intrinsically more robust to voice conversion attacks [20], all have their roots in the standard GMM and independent spectral observations. Neither utilises time sequence information, a key characteristic of speech which might otherwise afford some protection from spoofing.

7.2.3 System Fusion

In addition to the development of increasingly robust models and classifiers, there is a significant emphasis within the ASV community on the study of classifier fusion. This is based on the assumption that independently trained recognisers capture different aspects of the speech signal not covered by any individual classifier. Fusion also provides a convenient vehicle for large-scale research collaborations promoting independent classifier development and benchmarking [21]. Different classifiers can involve different features, classifiers, or hyper-parameter training sets [22]. A simple, yet robust approach to fusion involves the weighted summation of the base classifier scores, where the weights are optimised according to a logistic regression cost function. For recent trends in fusion, readers are referred to [23].

While we are unaware of any spoofing or anti-spoofing studies on fused ASV systems, some insight into their likely utility can be gained from related work in fused, multi-modal biometric systems; whether the scores originate from different biometric modalities or sub-classifiers applied to the same biometric trait makes little difference. A common claim is that multi-biometric systems should be inherently resistant to spoofing since an impostor is less likely to succeed in spoofing all the different subsystems. We note, however, that [24] suggests it might suffice to spoof only one modality under a score fusion setting in the case where the spoofing of a single, significantly weighted sub-system is particularly effective.

7.3 Spoofing and Countermeasures

Spoofing attacks are performed on a biometric system at the sensor or acquisition level to bias score distributions toward those of genuine clients, thus provoking increases in the false acceptance rate (FAR). This section reviews past work to evaluate vulnerabilities and to develop spoofing countermeasures. We consider impersonation, replay, speech synthesis and voice conversion.

7.3.1 Impersonation

Impersonation refers to spoofing attacks whereby a speaker attempts to imitate the speech of another speaker and is one of the most obvious forms of spoofing and earliest studied.

7.3.1.1 Spoofing

The work in [25] showed that impersonators can readily adapt their voice to overcome ASV, but only when their natural voice is already similar to that of the target (the closest targets were selected from YOHO corpus using an ASV system). Further work in [26] showed that impersonation increased FAR rates from close to 0 % to between 10 and 60 %. Linguistic expertise was not found to be useful, except in cases when the voice of the target speaker was very different to that of the impersonator. However, contradictory findings reported in [27] suggest that even while professional imitators are better impersonators than average people, they are unable to spoof an ASV system.

In addition to spoofing studies, impersonation has been a subject in acoustic-phonetic studies [2830]. These have shown that imitators tend to be effective in mimicking long-term prosodic patterns and the speaking rate, though it is less clear that they are as effective in mimicking formant and other spectral characteristics. For instance, the imitator involved in the studies reported in [28] was not successful in translating his formant frequencies towards the target, whereas the opposite is reported in [31].

Characteristic to all studies involving impersonation is the use of relatively few speakers, different languages and ASV systems. The target speakers involved in such studies are also often public figures or celebrities and it is difficult to collect technically comparable material from both the impersonator and the target. These aspects of the past work makes it difficult to conclude whether or not impersonation poses a genuine threat. Since impersonation is thought to involve mostly the mimicking of prosodic and stylistic cues, it is perhaps considered more effective in fooling human listeners than today’s state-of-the-art ASV systems [32].

7.3.1.2 Countermeasures

While the threat of impersonation is not fully understood due to limited studies involving small datasets, it is perhaps not surprising that there is no prior work to investigate countermeasures against impersonation. If the threat is proven to be genuine, then the design of appropriate countermeasures might be challenging. Unlike the spoofing attacks discussed below, all of which can be assumed to leave traces of the physical properties of the recording and playback devices, or signal processing artefacts from synthesis or conversion systems, impersonators are live human beings who produce entirely natural speech.

7.3.2 Replay

Replay attacks involve the presentation of previously-recorded speech from a genuine client in the form of continuous speech recordings, or samples resulting from the concatenation of shorter segments. Replay is a relatively low-technology attack within the grasp of any potential attacker even without specialised knowledge in speech processing. The availability of inexpensive, high-quality recording devices and digital audio editing software might suggest that replay is both effective and difficult to detect.

7.3.2.1 Spoofing

In contrast to research involving speech synthesis and voice conversion, spoofing attacks where large datasets are generally used for assessment, e.g. NIST datasets, all the past work to assess vulnerabilities to replay attacks relates to small, often purpose-collected datasets, typically involving no more than 15 speakers. While results generated with such small datasets have low statistical significance, differences between baseline performance and that under spoofing highlight the vulnerability.

The vulnerability of ASV systems to replay attacks was first investigated in a text-dependent scenario [33] where the concatenation of recorded digits was tested against a hidden Markov model (HMM) based ASV system. Results showed an increase in the FAR (EER threshold) from 1 to 89 % for male speakers and from 5 to 100 % for female speakers.

The work in [34] investigated text-independent ASV vulnerabilities through the replaying of far-field recorded speech in a mobile telephony scenario where signals were transmitted by analogue and digital telephone channels. Using a baseline ASV system based on JFA, their work showed an increase in the EER of 1 % to almost 70 % when impostor accesses were replaced by replayed spoof attacks. A physical access scenario was considered in [35]. While the baseline performance of their GMM-UBM ASV system was not reported, experiments showed that replay attacks produced an FAR of 93 %.

7.3.2.2 Countermeasures

A countermeasure for replay attack detection in the case of text-dependent ASV was reported in [36]. The approach is based upon the comparison of new access samples with stored instances of past accesses. New accesses which are deemed too similar to previous access attempts are identified as replay attacks. A large number of different experiments, all relating to a telephony scenario, showed that the countermeasures succeeded in lowering the EER in most of the experiments performed.

While some form of text-dependent or challenge-response countermeasure is usually used to prevent replay attacks, text-independent solutions have also been investigated. The same authors in [34] showed that it is possible to detect replay attacks by measuring the channel differences caused by far-field recording [37]. While they show spoof detection error rates of less than 10 % it is feasible that today’s state-of-the-art approaches to channel compensation will render some ASV systems still vulnerable.

Two different replay attack countermeasures are compared in [35]. Both are based on the detection of differences in channel characteristics expected between licit and spoofed access attempts. Replay attacks incur channel noise from both the recording device and the loudspeaker used for replay and thus the detection of channel effects beyond those introduced by the recording device of the ASV system thus serves as an indicator of replay. The performance of a baseline GMM-UBM system with an EER 40 % under spoofing attack falls to 29 % with the first countermeasure and a more respectable EER of 10 % with the second countermeasure.

7.3.3 Speech Synthesis

Speech synthesis, commonly referred to as text-to-speech (TTS), is a technique for generating intelligible, natural sounding artificial speech for any arbitrary text. Speech synthesis is used widely in various applications including in-car navigation systems, e-book readers, voice-over functions for the visually impaired and communication aids for the speech impaired. More recent applications include spoken dialogue systems, communicative robots, singing speech synthesisers and speech-to-speech translation systems.

Typical speech synthesis systems have two main components: text analysis and speech waveform generation, which are sometimes referred to as the front-end and back-end, respectively. In the text analysis component, input text is converted into a linguistic specification consisting of elements such as phonemes. In the speech waveform generation component, speech waveforms are generated from the produced linguistic specification.

There are four major approaches to speech waveform generation. In the early 1970s, the speech waveform generation component used very low-dimensional acoustic parameters for each phoneme, such as formants, corresponding to vocal tract resonances with hand-crafted acoustic rules [38]. In the 1980s, the speech waveform generation component used a small database of phoneme units called ’diphones’ (the second half of one phone plus the first half of the following phone) and concatenated them according to the given phoneme sequence by applying signal processing, such as linear predictive (LP) analysis, to the units [39]. In the 1990s, larger speech databases were collected and used to select more appropriate speech units that match both phonemes and other linguistic contexts such as lexical stress and pitch accent in order to generate high-quality natural sounding synthetic speech with appropriate prosody. This approach is generally referred to as ‘unit selection’, and is used in many speech synthesis systems, including commercial products [4044]. In the late 1990s another data-driven approach emerged, ‘Statistical parametric speech synthesis’, and has grown in popularity in recent years [4548]. In this approach, several acoustic parameters are modelled using a time-series stochastic generative model, typically a hidden Markov model (HMM). HMMs represent not only the phoneme sequences but also various contexts of the linguistic specification in a similar way to the unit selection approach. Acoustic parameters generated from HMMs and selected according to the linguistic specification are used to drive a vocoder (a simplified speech production model with which speech is represented by vocal tract and excitation parameters) in order to generate a speech waveform.

The first three approaches are unlikely to be effective in ASV spoofing since they do not provide for the synthesis of speaker-specific formant characteristics. Furthermore, diphone or unit selection approaches generally require a speaker-specific database that covers all the diphones or relatively large amounts of speaker-specific data with carefully prepared transcripts. In contrast, state-of-the-art HMM-based speech synthesisers [49, 50] can learn individualised speech models from relatively little speaker-specific data by adapting background models derived from other speakers based on the standard model adaptation techniques drawn from speech recognition, i.e. maximum likelihood linear regression (MLLR) [51, 52].

7.3.3.1 Spoofing

There is a considerable volume of research in the literature which has demonstrated the vulnerability of ASV to synthetic voices generated with a variety of approaches to speech synthesis. Experiments using formant, diphone and unit selection-based synthetic speech in addition to the simple cut-and-paste of speech waveforms have been reported [33, 34, 53].

ASV vulnerabilities to HMM-based synthetic speech were first demonstrated over a decade ago [54] using an HMM-based, text-prompted ASV system [55] and an HMM-based synthesiser where acoustic models were adapted to specific human speakers [56, 57]. The ASV system scored feature vectors against speaker and background models composed of concatenated phoneme models. When tested with human speech the ASV system achieved an FAR of 0 % and an FRR of 7 %. When subjected to spoofing attacks with synthetic speech, the FAR increased to over 70 %, however this work involved only 20 speakers.

Large-scale experiments using the Wall Street Journal corpus containing 284 speakers and two different ASV systems (GMM-UBM and SVM using Gaussian supervectors) was reported in [58]. Using a state-of-the-art HMM-based speech synthesiser, the FAR was shown to rise to 86 and 81 % for the GMM-UBM and SVM systems, respectively. Spoofing experiments using HMM-based synthetic speech against a forensics speaker verification tool BATVOX was also reported in [59] with similar findings. Today’s state-of-the-art speech synthesisers thus present a genuine threat to ASV.

7.3.3.2 Countermeasures

Only a small number of attempts to discriminate synthetic speech from natural speech have been investigated and there is currently no general solution which is independent from specific speech synthesis methods. Previous work has demonstrated the successful detection of synthetic speech based on prior knowledge of the acoustic differences of specific speech synthesisers, such as the dynamic ranges of spectral parameters at the utterance level [60] and variance of higher order parts of mel-cepstral coefficients [61].

There are some attempts which focus on acoustic differences between vocoders and natural speech. Since the human auditory system is known to be relatively insensitive to phase [62], vocoders are typically based on a minimum-phase vocal tract model. This simplification leads to differences in the phase spectra between human and synthetic speech, differences which can be utilised for discrimination [58, 63].

Based on the difficulty in reliable prosody modelling in both unit selection and statistical parametric speech synthesis, other approaches to synthetic speech detection use F0 statistics [64, 65]. F0 patterns generated for the statistical parametric speech synthesis approach tend to be over-smoothed and the unit selection approach frequently exhibits ‘F0 jumps’ at concatenation points of speech units.

7.3.4 Voice Conversion

Voice conversion is a sub-domain of voice transformation [66] which aims to convert one speaker’s voice towards that of another. The field has attracted increasing interest in the context of ASV vulnerabilities for over a decade [67]. Unlike TTS, which requires text input, voice conversion operates directly on speech samples. In particular, the goal is to transform according to a conversion function \(\fancyscript{F}\) the feature vectors (\(\mathbf {x}\)) corresponding to speech from a source speaker (spoofer) to that they are closer to those of target a speaker (\(\mathbf {y}\)):

$$\begin{aligned} \mathbf {y} = \fancyscript{F}(\mathbf {x}, \mathbf {\theta }). \end{aligned}$$
(7.1)

Most voice conversion approaches adopt a training phase which requires frame-aligned pairs \(\{(\mathbf {x}_t,\mathbf {y}_t)\}\) in order to learn the transformation parameters \(\mathbf {\theta }\). Frame alignment is usually achieved using dynamic time warping (DTW) on parallel source-target training utterances with identical text content. The trained conversion function is then applied to new source utterances of arbitrary text content at run-time.

A large number of specific conversion approaches have been reported. One of the earliest and simplest techniques employs vector quantisation (VQ) with codebooks [68] or segmental codebooks [69] of paired source-target frame vectors to represent the conversion function. However, VQ introduces frame-to-frame discontinuity problems. Among the more recent conversion methods, joint density Gaussian mixture model (JD-GMM) [7072] has become a standard baseline method. It achieves smooth feature transformations using a local linear transformation. Despite its popularity, known problems of JD-GMM include over-smoothing [7375] and over-fitting [76, 77] which has led to the development of alternative linear conversion methods such as partial least square (PLS) regression [76], tensor representation [78], a trajectory hidden Markov model [79], a mixture of factor analysers [80], local linear transformation [73] and a noisy channel model [81]. Non-linear approaches, including artificial neural networks [82, 83], support vector regression [84], kernel partial least square [85] and conditional restricted Boltzmann machines [86], have also been studied. As alternatives to data-driven conversion, frequency warping techniques [8789] have also attracted attention.

The approaches to voice conversion considered above are usually applied to the transformation of spectral envelope features, though the conversion of prosodic features such as fundamental frequency [9093] and duration [91, 94] has also been studied. In contrast to parametric methods, unit selection approaches can be applied directly to feature vectors coming from the target speaker to synthesise converted speech [95]. Since they use target speaker data directly, unit selection approaches arguably pose a greater risk to ASV than statistical approaches [96].

In general, only the most straightforward of the spectral conversion methods have been utilised in ASV vulnerability studies. Even when trained using a non-parallel technique and non-ideal telephony data, the baseline JD-GMM approach, which produces over-smooth speech with audible artefacts, is shown to increase significantly the FAR of modern ASV systems [20, 96]; unlike the human ear, current recognisers are essentially ‘deaf’ to obvious conversion artefacts caused by imperfect signal analysis-synthesis models and poorly trained conversion functions.

7.3.4.1 Spoofing

When applied to spoofing, voice conversion aims to synthesise a new speech signal such that features extracted for ASV are close in some sense to the target speaker. Some of the first work relevant to text-independent ASV spoofing includes that in [32, 97]. The work in [32] showed that a baseline EER increased from 16 to 26 % as a result of voice conversion which also converted prosodic aspects not modelled in typical ASV systems. The work in [97] investigated the probabilistic mapping of a speaker’s vocal tract information towards that of another, target speaker using a pair of tied speaker models, one of ASV features and another of filtering coefficients. This work targeted the conversion of spectral-slope parameters. The work showed that a baseline EER of 10 % increased to over 60 % when all impostor test samples were replaced with converted voice. In addition, signals subjected to voice conversion did not exhibit any perceivable artefacts indicative of manipulation.

The work in [20] investigated ASV vulnerabilities using a popular approach to voice conversion [70] based on JD-GMMs, which requires a parallel training corpus for both source and target speakers. Even if converted speech would be easily detectable by human listeners, experiments involving five different ASV systems showed universal susceptibility to spoofing. The FAR of the most robust, JFA system increased from 3 % to over 17 %.

Other work relevant to voice conversion includes attacks referred to as artificial signals. It was noted in [98] that certain short intervals of converted speech yield extremely high scores or likelihoods. Such intervals are not representative of intelligible speech but they are nonetheless effective in overcoming typical text-independent ASV systems which lack any form of speech quality assessment. The work in [98] showed that artificial signals optimised with a genetic algorithm provoke increases in the EER from 10 % to almost 80 % for a GMM-UBM system and from 5 % to almost 65 % for a factor analysis (FA) system.

Fig. 7.1
figure 1

An example of a spoofed speech detector combined with speaker verification [99]. Based on prior knowledge that many analysis–synthesis modules used in voice conversion and TTS systems discard natural speech phase, phase characteristics parametrised via the modified group delay function (MGDF) can be used for discriminating natural and synthetic speech

7.3.4.2 Countermeasures

Some of the first work to detect converted voice draws on related work in synthetic speech detection [100]. While the proposed cosine phase and modified group delay function (MGDF) countermeasures proposed in [63, 99] are effective in detecting spoofed speech (see Fig. 7.1), they are unlikely to detect converted voice with real-speech phase [97].

Two approaches to artificial signal detection are reported in [101]. Experimental work shows that supervector-based SVM classifiers are naturally robust to such attacks whereas all spoofing attacks can be detected using an utterance-levelvariability feature which detects the absence of natural, dynamic variability characteristic of genuine speech. An alternative approach based on voice quality analysis is less dependent on explicit knowledge of the attack but less effective in detecting attacks.

A related approach to detect converted voice is proposed in [102]. Probabilistic mappings between source and target speaker models are shown to yield converted speech with less short-term variability than genuine speech. The thresholded, average pair-wise distance between consecutive feature vectors is used to detect converted voice with an EER of under 3 %.

Due to fact that current analysis–synthesis techniques operate at the short-term frame level, the use of temporal magnitude/phase modulation features, a form of long-term feature, are proposed in [103] to detect both speech synthesis and voice conversion spoofing attacks. Another form of long-term feature is reported in [104]. The approach is based on the local binary pattern (LBP) analysis of sequences of acoustic vectors and is successful in detecting converted voice. Interestingly, the approach is less reliant on prior knowledge and can also detect different spoofing attacks, examples of which were not used for training or optimisation.

7.3.5 Summary

As shown above, ASV spoofing and countermeasures have been studied with a multitude of different datasets, evaluation protocols and metrics, with highly diverse experimental designs, different ASV recognisers and with different approaches to spoofing; the lack of any commonality makes the comparison of results, vulnerabilities and countermeasure performance an extremely challenging task. Drawing carefully upon the literature and the authors’ own experience with various spoofing approaches, we have nevertheless made such an attempt. Table 7.1 aims to summarise the threat of spoofing for the four approaches considered above. Accessibility (practicality) reflects whether the threat is available to the masses or limited to the technically knowledgeable. Effectiveness (risk), in turn, reflects the success of each approach in provoking higher false acceptance rates.

Table 7.1 A summary of the four approaches to ASV spoofing, their expected accessibility and risk

Although some studies have shown that impersonation can fool ASV recognisers, in practice, the effectiveness seems to depend both on the skill of the impersonator, the similarity of the attacker’s voice to that of the target speaker and on the recogniser itself. Replay attacks are highly effective in the case of text-independent ASV and fixed-phrase text-independent systems. Even if the effectiveness is reduced in the case of randomised, phrase-prompted text-dependent systems, replay attacks are the most accessible approach to spoofing, requiring only a recording and playback device such as a tape recorder or a smart phone.

Speech synthesis and voice conversion attacks pose the greatest risk. While voice conversion systems are not yet commercially available, both free and commercial text-to-speech (TTS) systems with pre-trained voice profiles are widely available, even if commercial off-the-shelf (COTS) systems do not include the functionality for adaptation to specific target voices. While accessibility is therefore medium in the short term, speaker adaptation remains a highly active research topic. It is thus only a matter of time until flexible, speaker-adapted synthesis and conversionsystems become readily available. Then, both effectiveness and accessibility should be considered high.

7.4 Discussion

In this section, we discuss current approaches to evaluation and some weaknesses in the current evaluation methodology. While much of the following is not necessarily specific to the speech modality, with research in spoofing and countermeasures in ASV lagging behind that related to other biometric modalities, the discussion below is particularly pertinent.

7.4.1 Protocols and Metrics

While countermeasures can be integrated into existing ASV systems, they are most often implemented as independent modules which allow for the explicit detection of spoofing attacks. The most common approach in this case is to concatenate the two classifiers in series.

The assessment of countermeasure performance on its own is relatively straightforward; results are readily analysed with standard detection error trade-off (DET) profiles [106] and related metrics. It is often of interest, however, that the assessment reflects their impact on ASV performance. Assessment is then non-trivial and calls for the joint optimisation of combined classifiers. Results furthermore reflect the performance of specific ASV systems. As described in Sect. 7.3, there are currently no standard evaluation protocols, metrics or ASV systems which might otherwise be used to conduct evaluations. There is a thus a need to define such standards in the future.

Fig. 7.2
figure 2

An example of four DET profiles needed to analyse vulnerabilities to spoofing and countermeasure performance, both on licit and spoofed access attempts. Results correspond to spoofing attacks using synthetic speech and a standard GMM-UBM classifier assessed on the male subset of the NIST’06 SRE dataset

Candidate standards are being drafted within the scope of the EU FP7 TABULA RASA project.Footnote 3 Here, independent countermeasures preceding biometric verification are optimised at three different operating points where thresholds are set to obtain FARs (the probability of labelling a genuine access as a spoofing attack) of 1, 5 or 10 %. Samples labelled as genuine accesses are then passed to the verification system.Footnote 4 Performance is assessed using four different DET profiles,Footnote 5 examples of which are illustrated in Fig. 7.2. The four profiles illustrate performance of the baseline system with zero-effort impostors, the baseline system with active countermeasures, the baseline system where all impostor accesses are replaced with spoofing attacks and, finally, the baseline system with spoofing attacks and active countermeasures.

Consideration of all four profiles is needed to gauge the impact of countermeasure performance on licit transactions (any deterioration in false rejection—difference between first and second profiles) and improved robustness to spoofing (improvements in false acceptance—difference between third and fourth profiles). While the interpretation of such profiles is trivial, different plots are obtained for each countermeasure operating point. Further work is required to design intuitive, universal metrics which represent the performance of spoofing countermeasures when combined with ASV.

7.4.2 Datasets

While some works have shown the potential for detecting spoofing without prior knowledge or training data indicative of a specific attack [63, 104, 107], all previous works are based on some implicit prior knowledge, i.e. the nature of the spoofing attack and/or the targeted ASV system is known. While training and evaluation data with known spoofing attacks might be useful to develop and optimise appropriate countermeasures, the precise nature of spoofing attacks can never be known in practice. Estimates of countermeasure performance so obtained should thus be considered at best optimistic. Furthermore, the majority of the past work was also conducted under matched conditions, i.e. data used to learn target models and that used to effect spoofing were collected in the same or similar acoustic environment and over the same or similar channel. The performance of spoofing countermeasures when subjected to realistic session variability is then unknown.

While much of the past work already uses standard datasets, e.g. NIST SRE data, spoofed samples are obtained by treating them with non-standard algorithms. Standard datasets containing both licit transactions and spoofed speech from a multitude of different spoofing algorithms and with realistic session variability are therefore needed to reduce the use of prior knowledge, to improve the comparability of different countermeasures and their performance against varied spoofing attacks. Collaboration with colleagues in other speech and language processing communities, e.g. voice conversion and speech synthesis, will help to assess vulnerabilities to state-of-the art spoofing attacks and also to assess countermeasures when details of the spoofing attacks are unknown. The detection of spoofing will then be considerably more challenging but more reflective of practical use cases.

7.5 Conclusions

This contribution reviews previous work to assess the threat from spoofing to automatic speaker verification (ASV). While there are currently no standard datasets, evaluation protocols or metrics, the study of impersonation, replay, speech synthesis and voice conversion spoofing attacks reported in this article indicate genuine vulnerabilities. We nonetheless argue that significant additional research is required before the issue of spoofing in ASV is properly understood and conclusions can be drawn.

In particular, while the situation is slowly changing, the majority of past work involves text-independent ASV, most relevant to surveillance. The spoofing threat is pertinent in authentication scenarios where text-dependent ASV might be preferred. Greater effort is therefore needed to investigate spoofing in text-dependent scenarios with particularly careful consideration being given to design appropriate datasets and protocols.

Second, almost all ASV spoofing countermeasures proposed thus far are dependent on training examples indicative of a specific attack. Given that the nature of spoofing attacks can never be known in practice, and with the variety in spoofing attacks being particularly high in ASV, future work should investigate new countermeasures which generalise well to unforeseen attacks. Formal evaluations with standard datasets, evaluation protocols, metrics and even standard ASV systems are also needed to address weaknesses in the current evaluation methodology.

Finally, some of the vulnerabilities discussed in this paper involve relatively high-cost and high-technology attacks. While the trend of open source software may cause this to change, such attacks are beyond the competence of the unskilled and in such case the level of vulnerability is arguably overestimated. While we have touched on this issue in this article, a more comprehensive risk-based assessment is needed to ensure such evaluations are not overly-alarmist. Indeed, the work discussed above shows that countermeasures, some of them relatively trivial, have the potential to detect spoofing attacks with manageable impacts on system usability.