Speaker Recognition Anti-spoofing

Evans, Nicholas; Kinnunen, Tomi; Yamagishi, Junichi; Wu, Zhizheng; Alegre, Federico; Leon, Phillip  De

doi:10.1007/978-1-4471-6524-8_7

Nicholas Evans⁶,
Tomi Kinnunen⁷,
Junichi Yamagishi^8,9,
Zhizheng Wu¹⁰,
Federico Alegre⁶ &
…
Phillip De Leon¹¹

Part of the book series: Advances in Computer Vision and Pattern Recognition ((ACVPR))

2270 Accesses
20 Citations

Abstract

Progress in the development of spoofing countermeasures for automatic speaker recognition is less advanced than equivalent work related to other biometric modalities. This chapter outlines the potential for even state-of-the-art automatic speaker recognition systems to be spoofed. While the use of a multitude of different datasets, protocols and metrics complicates the meaningful comparison of different vulnerabilities, we review previous work related to impersonation, replay, speech synthesis and voice conversion spoofing attacks. The article also presents an analysis of the early work to develop spoofing countermeasures. The literature shows that there is significant potential for automatic speaker verification systems to be spoofed, that significant further work is required to develop generalised countermeasures, that there is a need for standard datasets, evaluation protocols and metrics and that greater emphasis should be placed on text-dependent scenarios.

Access provided by Autonomous University of Puebla. Download chapter PDF

On the study of replay and voice conversion attacks to text-dependent speaker verification

Article 03 December 2015

Introduction to Voice Presentation Attack Detection and Recent Advances

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

7.1 Introduction

As one of our primary methods of communication, the speech modality has natural appeal as a biometric in one of two different scenarios: text-independent and text-dependent. While text-dependent automatic speaker verification (ASV) systems use fixed or randomly prompted utterances with known text content, text-independent recognisers operate on arbitrary utterances, possibly spoken in different languages. Text-independent methods are best suited to surveillance scenarios where speech signals are likely to originate from noncooperative speakers. In authentication scenarios, where cooperation can be readily assumed, text-dependent ASV is generally more appropriate since better performance can then be achieved with shorter utterances. On the other hand, text-independent recognisers are also used for authentication in call-centre applications such as caller verification in telephone banking.^{Footnote 1} On account of its utility in surveillance applications, evaluation sponsorship and dataset availability, text-independent ASV dominates the field.

The potential for ASV to be spoofed is now well recognised [1]. Since speaker recognition is commonly used in telephony or other unattended, distributed scenarios without human supervision, speech is arguably more prone to malicious interference or manipulation than other biometric signals. However, while spoofing is relevant to authentication scenarios and therefore text-dependent ASV, almost all prior work has been performed on text-independent datasets more suited to surveillance. While this observation most likely reflects the absence of viable text-dependent datasets in the recent past, progress in the development of spoofing countermeasures for ASV is lagging behind that in other biometric modalities.^{Footnote 2}

Nonetheless, there is growing interest to assess the vulnerabilities of ASV to spoofing and new initiatives to develop countermeasures [1]. This article reviews the past work which is predominantly text-independent. While the use of different datasets, protocols and metrics hinders such a task, we aim to describe and analyse four different spoofing attacks considered thus far: impersonation, replay, speech synthesis and voice conversion. Countermeasures for all four spoofing attacks are also reviewed and we discuss the directions which must be taken in future work to address weaknesses in the current research methodology and to properly protect ASV systems from the spoofing threat.

7.2 Automatic Speaker Verification

This section describes state-of-the-art approaches to text-independent automatic speaker verification (ASV) and their potential vulnerabilities to spoofing.

7.2.1 Feature Extraction

Since speech signals are nonstationary, features are commonly extracted from short-term segments (frames) of 20–30 ms in duration. Typically, mel-frequency cepstral coefficient (MFCC), linear predictive cepstral coefficient (LPCC), or perceptual linear prediction (PLP) features are used as a descriptor of the short-term power spectrum. These are usually appended with their time-derivative coefficients (deltas and double-deltas) and they undergo various normalisations such as global mean removal or short-term Gaussianization or feature warping [2]. In addition to spectral features, prosodic and high-level features have been studied extensively [3–5], achieving comparable results to state-of-the-art spectral recognisers [6]. For more details regarding popular feature representations used in ASV, readers are referred to [7].

The literature shows that ASV systems based on both spectral and prosodic features are vulnerable to spoofing. As described in Sect. 7.3, state-of-the-art voice conversion and statistical parametric speech synthesisers may also use mel-cepstral and linear prediction representations; spectral recognisers can be particularly vulnerable to synthesis and conversion attacks which use ‘matched’ parameterisations. Recognisers which utilise prosodic parameterisations are in turn vulnerable to human impersonation.

7.2.2 Modelling and Classification

Approaches to ASV generally focus on modelling the long-term distribution of spectral vectors. To this end, the Gaussian mixture model (GMM) [8, 9] has become the de facto modelling technique. Early ASV systems used maximum likelihood (ML) [8] and maximum a posteriori (MAP) [9] training. In the latter case, a speaker-dependent GMM is obtained from the adaptation of a previously trained universal background model (UBM). Adapted GMM mean supervectors obtained in this way were combined with support vector machine (SVM) classifiers in [10]. This idea lead to the development of many successful speaker model normalisation techniques including nuisance attribute projection (NAP) [11, 12] and within-class covariance normalisation (WCCN) [13]. These techniques aim to compensate for intersession variation, namely differences in supervectors corresponding to the same speaker caused by channel or session mismatch.

Parallel to the development of SVM-based discriminative models, generative factor analysis models were pioneered in [14–16]. In particular, joint factor analysis (JFA) [14] can improve ASV performance by incorporating distinct speaker and channel subspace models. These subspace models require the estimation of various hyper-parameters using labelled utterances. Subsequently, JFA evolved into a much-simplified model that is now the state of the art. The so-called total variability model or ‘i-vector’ representation [17] uses latent variable vectors of low-dimension (typically 200–600) to represent an arbitrary utterance. Unlike JFA, the training of an i-vector extractor is essentially an unsupervised process which leads to only one subspace model. Accordingly it can be viewed as a approach to dimensionality reduction, while compensation for session, environment and other nuisance factors are applied in the computationally light back-end classification. To this end, probabilistic linear discriminant analysis (PLDA) [18] with length-normalised i-vectors [19] has proven particularly effective.

Being based on the transformation of short-term cepstra, conversion and synthesis techniques also induce a form of ’channel shift’. Since they aim to attenuate channel effects, approaches to intersession compensation may present vulnerabilities to spoofing through the potential to confuse spoofed speech with channel-shifted speech of a target speaker. However, even if there is some evidence to the contrary, i.e., that recognisers employing intersession compensation might be intrinsically more robust to voice conversion attacks [20], all have their roots in the standard GMM and independent spectral observations. Neither utilises time sequence information, a key characteristic of speech which might otherwise afford some protection from spoofing.

7.2.3 System Fusion

In addition to the development of increasingly robust models and classifiers, there is a significant emphasis within the ASV community on the study of classifier fusion. This is based on the assumption that independently trained recognisers capture different aspects of the speech signal not covered by any individual classifier. Fusion also provides a convenient vehicle for large-scale research collaborations promoting independent classifier development and benchmarking [21]. Different classifiers can involve different features, classifiers, or hyper-parameter training sets [22]. A simple, yet robust approach to fusion involves the weighted summation of the base classifier scores, where the weights are optimised according to a logistic regression cost function. For recent trends in fusion, readers are referred to [23].

While we are unaware of any spoofing or anti-spoofing studies on fused ASV systems, some insight into their likely utility can be gained from related work in fused, multi-modal biometric systems; whether the scores originate from different biometric modalities or sub-classifiers applied to the same biometric trait makes little difference. A common claim is that multi-biometric systems should be inherently resistant to spoofing since an impostor is less likely to succeed in spoofing all the different subsystems. We note, however, that [24] suggests it might suffice to spoof only one modality under a score fusion setting in the case where the spoofing of a single, significantly weighted sub-system is particularly effective.

7.3 Spoofing and Countermeasures

Spoofing attacks are performed on a biometric system at the sensor or acquisition level to bias score distributions toward those of genuine clients, thus provoking increases in the false acceptance rate (FAR). This section reviews past work to evaluate vulnerabilities and to develop spoofing countermeasures. We consider impersonation, replay, speech synthesis and voice conversion.

7.3.1 Impersonation

Impersonation refers to spoofing attacks whereby a speaker attempts to imitate the speech of another speaker and is one of the most obvious forms of spoofing and earliest studied.

7.3.1.1 Spoofing

The work in [25] showed that impersonators can readily adapt their voice to overcome ASV, but only when their natural voice is already similar to that of the target (the closest targets were selected from YOHO corpus using an ASV system). Further work in [26] showed that impersonation increased FAR rates from close to 0 % to between 10 and 60 %. Linguistic expertise was not found to be useful, except in cases when the voice of the target speaker was very different to that of the impersonator. However, contradictory findings reported in [27] suggest that even while professional imitators are better impersonators than average people, they are unable to spoof an ASV system.

In addition to spoofing studies, impersonation has been a subject in acoustic-phonetic studies [28–30]. These have shown that imitators tend to be effective in mimicking long-term prosodic patterns and the speaking rate, though it is less clear that they are as effective in mimicking formant and other spectral characteristics. For instance, the imitator involved in the studies reported in [28] was not successful in translating his formant frequencies towards the target, whereas the opposite is reported in [31].

Characteristic to all studies involving impersonation is the use of relatively few speakers, different languages and ASV systems. The target speakers involved in such studies are also often public figures or celebrities and it is difficult to collect technically comparable material from both the impersonator and the target. These aspects of the past work makes it difficult to conclude whether or not impersonation poses a genuine threat. Since impersonation is thought to involve mostly the mimicking of prosodic and stylistic cues, it is perhaps considered more effective in fooling human listeners than today’s state-of-the-art ASV systems [32].

7.3.1.2 Countermeasures

While the threat of impersonation is not fully understood due to limited studies involving small datasets, it is perhaps not surprising that there is no prior work to investigate countermeasures against impersonation. If the threat is proven to be genuine, then the design of appropriate countermeasures might be challenging. Unlike the spoofing attacks discussed below, all of which can be assumed to leave traces of the physical properties of the recording and playback devices, or signal processing artefacts from synthesis or conversion systems, impersonators are live human beings who produce entirely natural speech.

7.3.2 Replay

Replay attacks involve the presentation of previously-recorded speech from a genuine client in the form of continuous speech recordings, or samples resulting from the concatenation of shorter segments. Replay is a relatively low-technology attack within the grasp of any potential attacker even without specialised knowledge in speech processing. The availability of inexpensive, high-quality recording devices and digital audio editing software might suggest that replay is both effective and difficult to detect.

7.3.2.1 Spoofing

In contrast to research involving speech synthesis and voice conversion, spoofing attacks where large datasets are generally used for assessment, e.g. NIST datasets, all the past work to assess vulnerabilities to replay attacks relates to small, often purpose-collected datasets, typically involving no more than 15 speakers. While results generated with such small datasets have low statistical significance, differences between baseline performance and that under spoofing highlight the vulnerability.

The vulnerability of ASV systems to replay attacks was first investigated in a text-dependent scenario [33] where the concatenation of recorded digits was tested against a hidden Markov model (HMM) based ASV system. Results showed an increase in the FAR (EER threshold) from 1 to 89 % for male speakers and from 5 to 100 % for female speakers.

The work in [34] investigated text-independent ASV vulnerabilities through the replaying of far-field recorded speech in a mobile telephony scenario where signals were transmitted by analogue and digital telephone channels. Using a baseline ASV system based on JFA, their work showed an increase in the EER of 1 % to almost 70 % when impostor accesses were replaced by replayed spoof attacks. A physical access scenario was considered in [35]. While the baseline performance of their GMM-UBM ASV system was not reported, experiments showed that replay attacks produced an FAR of 93 %.

7.3.2.2 Countermeasures

A countermeasure for replay attack detection in the case of text-dependent ASV was reported in [36]. The approach is based upon the comparison of new access samples with stored instances of past accesses. New accesses which are deemed too similar to previous access attempts are identified as replay attacks. A large number of different experiments, all relating to a telephony scenario, showed that the countermeasures succeeded in lowering the EER in most of the experiments performed.

While some form of text-dependent or challenge-response countermeasure is usually used to prevent replay attacks, text-independent solutions have also been investigated. The same authors in [34] showed that it is possible to detect replay attacks by measuring the channel differences caused by far-field recording [37]. While they show spoof detection error rates of less than 10 % it is feasible that today’s state-of-the-art approaches to channel compensation will render some ASV systems still vulnerable.

Two different replay attack countermeasures are compared in [35]. Both are based on the detection of differences in channel characteristics expected between licit and spoofed access attempts. Replay attacks incur channel noise from both the recording device and the loudspeaker used for replay and thus the detection of channel effects beyond those introduced by the recording device of the ASV system thus serves as an indicator of replay. The performance of a baseline GMM-UBM system with an EER 40 % under spoofing attack falls to 29 % with the first countermeasure and a more respectable EER of 10 % with the second countermeasure.

7.3.3 Speech Synthesis

Speech synthesis, commonly referred to as text-to-speech (TTS), is a technique for generating intelligible, natural sounding artificial speech for any arbitrary text. Speech synthesis is used widely in various applications including in-car navigation systems, e-book readers, voice-over functions for the visually impaired and communication aids for the speech impaired. More recent applications include spoken dialogue systems, communicative robots, singing speech synthesisers and speech-to-speech translation systems.

Typical speech synthesis systems have two main components: text analysis and speech waveform generation, which are sometimes referred to as the front-end and back-end, respectively. In the text analysis component, input text is converted into a linguistic specification consisting of elements such as phonemes. In the speech waveform generation component, speech waveforms are generated from the produced linguistic specification.

There are four major approaches to speech waveform generation. In the early 1970s, the speech waveform generation component used very low-dimensional acoustic parameters for each phoneme, such as formants, corresponding to vocal tract resonances with hand-crafted acoustic rules [38]. In the 1980s, the speech waveform generation component used a small database of phoneme units called ’diphones’ (the second half of one phone plus the first half of the following phone) and concatenated them according to the given phoneme sequence by applying signal processing, such as linear predictive (LP) analysis, to the units [39]. In the 1990s, larger speech databases were collected and used to select more appropriate speech units that match both phonemes and other linguistic contexts such as lexical stress and pitch accent in order to generate high-quality natural sounding synthetic speech with appropriate prosody. This approach is generally referred to as ‘unit selection’, and is used in many speech synthesis systems, including commercial products [40–44]. In the late 1990s another data-driven approach emerged, ‘Statistical parametric speech synthesis’, and has grown in popularity in recent years [45–48]. In this approach, several acoustic parameters are modelled using a time-series stochastic generative model, typically a hidden Markov model (HMM). HMMs represent not only the phoneme sequences but also various contexts of the linguistic specification in a similar way to the unit selection approach. Acoustic parameters generated from HMMs and selected according to the linguistic specification are used to drive a vocoder (a simplified speech production model with which speech is represented by vocal tract and excitation parameters) in order to generate a speech waveform.

The first three approaches are unlikely to be effective in ASV spoofing since they do not provide for the synthesis of speaker-specific formant characteristics. Furthermore, diphone or unit selection approaches generally require a speaker-specific database that covers all the diphones or relatively large amounts of speaker-specific data with carefully prepared transcripts. In contrast, state-of-the-art HMM-based speech synthesisers [49, 50] can learn individualised speech models from relatively little speaker-specific data by adapting background models derived from other speakers based on the standard model adaptation techniques drawn from speech recognition, i.e. maximum likelihood linear regression (MLLR) [51, 52].

7.3.3.1 Spoofing

There is a considerable volume of research in the literature which has demonstrated the vulnerability of ASV to synthetic voices generated with a variety of approaches to speech synthesis. Experiments using formant, diphone and unit selection-based synthetic speech in addition to the simple cut-and-paste of speech waveforms have been reported [33, 34, 53].

ASV vulnerabilities to HMM-based synthetic speech were first demonstrated over a decade ago [54] using an HMM-based, text-prompted ASV system [55] and an HMM-based synthesiser where acoustic models were adapted to specific human speakers [56, 57]. The ASV system scored feature vectors against speaker and background models composed of concatenated phoneme models. When tested with human speech the ASV system achieved an FAR of 0 % and an FRR of 7 %. When subjected to spoofing attacks with synthetic speech, the FAR increased to over 70 %, however this work involved only 20 speakers.

Large-scale experiments using the Wall Street Journal corpus containing 284 speakers and two different ASV systems (GMM-UBM and SVM using Gaussian supervectors) was reported in [58]. Using a state-of-the-art HMM-based speech synthesiser, the FAR was shown to rise to 86 and 81 % for the GMM-UBM and SVM systems, respectively. Spoofing experiments using HMM-based synthetic speech against a forensics speaker verification tool BATVOX was also reported in [59] with similar findings. Today’s state-of-the-art speech synthesisers thus present a genuine threat to ASV.

7.3.3.2 Countermeasures

Only a small number of attempts to discriminate synthetic speech from natural speech have been investigated and there is currently no general solution which is independent from specific speech synthesis methods. Previous work has demonstrated the successful detection of synthetic speech based on prior knowledge of the acoustic differences of specific speech synthesisers, such as the dynamic ranges of spectral parameters at the utterance level [60] and variance of higher order parts of mel-cepstral coefficients [61].

There are some attempts which focus on acoustic differences between vocoders and natural speech. Since the human auditory system is known to be relatively insensitive to phase [62], vocoders are typically based on a minimum-phase vocal tract model. This simplification leads to differences in the phase spectra between human and synthetic speech, differences which can be utilised for discrimination [58, 63].

Based on the difficulty in reliable prosody modelling in both unit selection and statistical parametric speech synthesis, other approaches to synthetic speech detection use F0 statistics [64, 65]. F0 patterns generated for the statistical parametric speech synthesis approach tend to be over-smoothed and the unit selection approach frequently exhibits ‘F0 jumps’ at concatenation points of speech units.

7.3.4 Voice Conversion

Voice conversion is a sub-domain of voice transformation [66] which aims to convert one speaker’s voice towards that of another. The field has attracted increasing interest in the context of ASV vulnerabilities for over a decade [67]. Unlike TTS, which requires text input, voice conversion operates directly on speech samples. In particular, the goal is to transform according to a conversion function $\fancyscript{F}$ the feature vectors ($\mathbf {x}$) corresponding to speech from a source speaker (spoofer) to that they are closer to those of target a speaker ($\mathbf {y}$):

$$\begin{aligned} \mathbf {y} = \fancyscript{F}(\mathbf {x}, \mathbf {\theta }). \end{aligned}$$

(7.1)

Most voice conversion approaches adopt a training phase which requires frame-aligned pairs $\{(\mathbf {x}_t,\mathbf {y}_t)\}$ in order to learn the transformation parameters $\mathbf {\theta }$. Frame alignment is usually achieved using dynamic time warping (DTW) on parallel source-target training utterances with identical text content. The trained conversion function is then applied to new source utterances of arbitrary text content at run-time.

A large number of specific conversion approaches have been reported. One of the earliest and simplest techniques employs vector quantisation (VQ) with codebooks [68] or segmental codebooks [69] of paired source-target frame vectors to represent the conversion function. However, VQ introduces frame-to-frame discontinuity problems. Among the more recent conversion methods, joint density Gaussian mixture model (JD-GMM) [70–72] has become a standard baseline method. It achieves smooth feature transformations using a local linear transformation. Despite its popularity, known problems of JD-GMM include over-smoothing [73–75] and over-fitting [76, 77] which has led to the development of alternative linear conversion methods such as partial least square (PLS) regression [76], tensor representation [78], a trajectory hidden Markov model [79], a mixture of factor analysers [80], local linear transformation [73] and a noisy channel model [81]. Non-linear approaches, including artificial neural networks [82, 83], support vector regression [84], kernel partial least square [85] and conditional restricted Boltzmann machines [86], have also been studied. As alternatives to data-driven conversion, frequency warping techniques [87–89] have also attracted attention.

The approaches to voice conversion considered above are usually applied to the transformation of spectral envelope features, though the conversion of prosodic features such as fundamental frequency [90–93] and duration [91, 94] has also been studied. In contrast to parametric methods, unit selection approaches can be applied directly to feature vectors coming from the target speaker to synthesise converted speech [95]. Since they use target speaker data directly, unit selection approaches arguably pose a greater risk to ASV than statistical approaches [96].

In general, only the most straightforward of the spectral conversion methods have been utilised in ASV vulnerability studies. Even when trained using a non-parallel technique and non-ideal telephony data, the baseline JD-GMM approach, which produces over-smooth speech with audible artefacts, is shown to increase significantly the FAR of modern ASV systems [20, 96]; unlike the human ear, current recognisers are essentially ‘deaf’ to obvious conversion artefacts caused by imperfect signal analysis-synthesis models and poorly trained conversion functions.

7.3.4.1 Spoofing

When applied to spoofing, voice conversion aims to synthesise a new speech signal such that features extracted for ASV are close in some sense to the target speaker. Some of the first work relevant to text-independent ASV spoofing includes that in [32, 97]. The work in [32] showed that a baseline EER increased from 16 to 26 % as a result of voice conversion which also converted prosodic aspects not modelled in typical ASV systems. The work in [97] investigated the probabilistic mapping of a speaker’s vocal tract information towards that of another, target speaker using a pair of tied speaker models, one of ASV features and another of filtering coefficients. This work targeted the conversion of spectral-slope parameters. The work showed that a baseline EER of 10 % increased to over 60 % when all impostor test samples were replaced with converted voice. In addition, signals subjected to voice conversion did not exhibit any perceivable artefacts indicative of manipulation.

The work in [20] investigated ASV vulnerabilities using a popular approach to voice conversion [70] based on JD-GMMs, which requires a parallel training corpus for both source and target speakers. Even if converted speech would be easily detectable by human listeners, experiments involving five different ASV systems showed universal susceptibility to spoofing. The FAR of the most robust, JFA system increased from 3 % to over 17 %.

Other work relevant to voice conversion includes attacks referred to as artificial signals. It was noted in [98] that certain short intervals of converted speech yield extremely high scores or likelihoods. Such intervals are not representative of intelligible speech but they are nonetheless effective in overcoming typical text-independent ASV systems which lack any form of speech quality assessment. The work in [98] showed that artificial signals optimised with a genetic algorithm provoke increases in the EER from 10 % to almost 80 % for a GMM-UBM system and from 5 % to almost 65 % for a factor analysis (FA) system.

7.3.4.2 Countermeasures

Some of the first work to detect converted voice draws on related work in synthetic speech detection [100]. While the proposed cosine phase and modified group delay function (MGDF) countermeasures proposed in [63, 99] are effective in detecting spoofed speech (see Fig. 7.1), they are unlikely to detect converted voice with real-speech phase [97].

Two approaches to artificial signal detection are reported in [101]. Experimental work shows that supervector-based SVM classifiers are naturally robust to such attacks whereas all spoofing attacks can be detected using an utterance-levelvariability feature which detects the absence of natural, dynamic variability characteristic of genuine speech. An alternative approach based on voice quality analysis is less dependent on explicit knowledge of the attack but less effective in detecting attacks.

A related approach to detect converted voice is proposed in [102]. Probabilistic mappings between source and target speaker models are shown to yield converted speech with less short-term variability than genuine speech. The thresholded, average pair-wise distance between consecutive feature vectors is used to detect converted voice with an EER of under 3 %.

Due to fact that current analysis–synthesis techniques operate at the short-term frame level, the use of temporal magnitude/phase modulation features, a form of long-term feature, are proposed in [103] to detect both speech synthesis and voice conversion spoofing attacks. Another form of long-term feature is reported in [104]. The approach is based on the local binary pattern (LBP) analysis of sequences of acoustic vectors and is successful in detecting converted voice. Interestingly, the approach is less reliant on prior knowledge and can also detect different spoofing attacks, examples of which were not used for training or optimisation.

7.3.5 Summary

As shown above, ASV spoofing and countermeasures have been studied with a multitude of different datasets, evaluation protocols and metrics, with highly diverse experimental designs, different ASV recognisers and with different approaches to spoofing; the lack of any commonality makes the comparison of results, vulnerabilities and countermeasure performance an extremely challenging task. Drawing carefully upon the literature and the authors’ own experience with various spoofing approaches, we have nevertheless made such an attempt. Table 7.1 aims to summarise the threat of spoofing for the four approaches considered above. Accessibility (practicality) reflects whether the threat is available to the masses or limited to the technically knowledgeable. Effectiveness (risk), in turn, reflects the success of each approach in provoking higher false acceptance rates.

Table 7.1 A summary of the four approaches to ASV spoofing, their expected accessibility and risk

Full size table

Although some studies have shown that impersonation can fool ASV recognisers, in practice, the effectiveness seems to depend both on the skill of the impersonator, the similarity of the attacker’s voice to that of the target speaker and on the recogniser itself. Replay attacks are highly effective in the case of text-independent ASV and fixed-phrase text-independent systems. Even if the effectiveness is reduced in the case of randomised, phrase-prompted text-dependent systems, replay attacks are the most accessible approach to spoofing, requiring only a recording and playback device such as a tape recorder or a smart phone.

Speech synthesis and voice conversion attacks pose the greatest risk. While voice conversion systems are not yet commercially available, both free and commercial text-to-speech (TTS) systems with pre-trained voice profiles are widely available, even if commercial off-the-shelf (COTS) systems do not include the functionality for adaptation to specific target voices. While accessibility is therefore medium in the short term, speaker adaptation remains a highly active research topic. It is thus only a matter of time until flexible, speaker-adapted synthesis and conversionsystems become readily available. Then, both effectiveness and accessibility should be considered high.

7.4 Discussion

In this section, we discuss current approaches to evaluation and some weaknesses in the current evaluation methodology. While much of the following is not necessarily specific to the speech modality, with research in spoofing and countermeasures in ASV lagging behind that related to other biometric modalities, the discussion below is particularly pertinent.

7.4.1 Protocols and Metrics

While countermeasures can be integrated into existing ASV systems, they are most often implemented as independent modules which allow for the explicit detection of spoofing attacks. The most common approach in this case is to concatenate the two classifiers in series.

The assessment of countermeasure performance on its own is relatively straightforward; results are readily analysed with standard detection error trade-off (DET) profiles [106] and related metrics. It is often of interest, however, that the assessment reflects their impact on ASV performance. Assessment is then non-trivial and calls for the joint optimisation of combined classifiers. Results furthermore reflect the performance of specific ASV systems. As described in Sect. 7.3, there are currently no standard evaluation protocols, metrics or ASV systems which might otherwise be used to conduct evaluations. There is a thus a need to define such standards in the future.

Candidate standards are being drafted within the scope of the EU FP7 TABULA RASA project.^{Footnote 3} Here, independent countermeasures preceding biometric verification are optimised at three different operating points where thresholds are set to obtain FARs (the probability of labelling a genuine access as a spoofing attack) of 1, 5 or 10 %. Samples labelled as genuine accesses are then passed to the verification system.^{Footnote 4} Performance is assessed using four different DET profiles,^{Footnote 5} examples of which are illustrated in Fig. 7.2. The four profiles illustrate performance of the baseline system with zero-effort impostors, the baseline system with active countermeasures, the baseline system where all impostor accesses are replaced with spoofing attacks and, finally, the baseline system with spoofing attacks and active countermeasures.

Consideration of all four profiles is needed to gauge the impact of countermeasure performance on licit transactions (any deterioration in false rejection—difference between first and second profiles) and improved robustness to spoofing (improvements in false acceptance—difference between third and fourth profiles). While the interpretation of such profiles is trivial, different plots are obtained for each countermeasure operating point. Further work is required to design intuitive, universal metrics which represent the performance of spoofing countermeasures when combined with ASV.

7.4.2 Datasets

While some works have shown the potential for detecting spoofing without prior knowledge or training data indicative of a specific attack [63, 104, 107], all previous works are based on some implicit prior knowledge, i.e. the nature of the spoofing attack and/or the targeted ASV system is known. While training and evaluation data with known spoofing attacks might be useful to develop and optimise appropriate countermeasures, the precise nature of spoofing attacks can never be known in practice. Estimates of countermeasure performance so obtained should thus be considered at best optimistic. Furthermore, the majority of the past work was also conducted under matched conditions, i.e. data used to learn target models and that used to effect spoofing were collected in the same or similar acoustic environment and over the same or similar channel. The performance of spoofing countermeasures when subjected to realistic session variability is then unknown.

While much of the past work already uses standard datasets, e.g. NIST SRE data, spoofed samples are obtained by treating them with non-standard algorithms. Standard datasets containing both licit transactions and spoofed speech from a multitude of different spoofing algorithms and with realistic session variability are therefore needed to reduce the use of prior knowledge, to improve the comparability of different countermeasures and their performance against varied spoofing attacks. Collaboration with colleagues in other speech and language processing communities, e.g. voice conversion and speech synthesis, will help to assess vulnerabilities to state-of-the art spoofing attacks and also to assess countermeasures when details of the spoofing attacks are unknown. The detection of spoofing will then be considerably more challenging but more reflective of practical use cases.

7.5 Conclusions

This contribution reviews previous work to assess the threat from spoofing to automatic speaker verification (ASV). While there are currently no standard datasets, evaluation protocols or metrics, the study of impersonation, replay, speech synthesis and voice conversion spoofing attacks reported in this article indicate genuine vulnerabilities. We nonetheless argue that significant additional research is required before the issue of spoofing in ASV is properly understood and conclusions can be drawn.

In particular, while the situation is slowly changing, the majority of past work involves text-independent ASV, most relevant to surveillance. The spoofing threat is pertinent in authentication scenarios where text-dependent ASV might be preferred. Greater effort is therefore needed to investigate spoofing in text-dependent scenarios with particularly careful consideration being given to design appropriate datasets and protocols.

Second, almost all ASV spoofing countermeasures proposed thus far are dependent on training examples indicative of a specific attack. Given that the nature of spoofing attacks can never be known in practice, and with the variety in spoofing attacks being particularly high in ASV, future work should investigate new countermeasures which generalise well to unforeseen attacks. Formal evaluations with standard datasets, evaluation protocols, metrics and even standard ASV systems are also needed to address weaknesses in the current evaluation methodology.

Finally, some of the vulnerabilities discussed in this paper involve relatively high-cost and high-technology attacks. While the trend of open source software may cause this to change, such attacks are beyond the competence of the unskilled and in such case the level of vulnerability is arguably overestimated. While we have touched on this issue in this article, a more comprehensive risk-based assessment is needed to ensure such evaluations are not overly-alarmist. Indeed, the work discussed above shows that countermeasures, some of them relatively trivial, have the potential to detect spoofing attacks with manageable impacts on system usability.

Notes

1.
http://www.nuance.com/landing-pages/products/voicebiometrics/freespeech.asp.
2.
http://www.tabularasa-euproject.org/.
3.
http://www.tabularasa-euproject.org/.
4.
In practice samples labelled as spoofing attacks cannot be fully discarded since so doing would unduly influence false reject and false acceptance rates calculated as a percentage of all accesses.
5.
Produced with the TABULA RASA Score-toolkit: http://publications.idiap.ch/downloads/reports/2012/Anjos_Idiap-Com-02-2012.pdf.

References

Evans N, Kinnunen T, Yamagishi J (2013) Spoofing and countermeasures for automatic speaker verification. In: Proceedings of interspeech, annual conference of the international speech communication association, Lyon, France
Google Scholar
Pelecanos J, Sridharan S (2001) Feature warping for robust speaker verification. In: Proceedings of Odyssey 2001: the speaker and language recognition workshop, Crete, Greece, pp 213–218
Google Scholar
Shriberg E, Ferrer L, Kajarekar S, Venkataraman A, Stolcke A (2005) Modeling prosodic feature sequences for speaker recognition. Speech Commun 46(3–4):455–472
Article Google Scholar
Dehak N, Kenny P, Dumouchel P (2007) Modeling prosodic features with joint factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 15(7):2095–2103
Article Google Scholar
Siddiq S, Kinnunen T, Vainio M, Werner S (2012) Intonational speaker verification: a study on parameters and performance under noisy conditions. In: Proceedings of IEEE international conference on acoustics, speech and signal process (ICASSP), Kyoto, Japan, pp 4777–4780
Google Scholar
Kockmann M, Ferrer L, Burget L, Cěrnocký J (2011) i-vector fusion of prosodic and cepstral features for speaker verification. In: Proceedings of interspeech, annual conference of the international speech communication association, Florence, Italy, pp 265–268
Google Scholar
Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: from features to supervectors. Speech Commun 52(1):12–40
Article Google Scholar
Reynolds D, Rose R (1995) Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans Speech Audio Process 3:72–83
Article Google Scholar
Reynolds DA, Quatieri TF, Dunn RB (2000) Speaker verification using adapted Gaussian mixture models. Digital Signal Process 10(1):19–41
Article Google Scholar
Campbell WM, Sturim DE, Reynolds DA (2006) Support vector machines using GMM supervectors for speaker verification. IEEE Signal Process Lett 13(5):308–311
Article Google Scholar
Solomonoff A, Campbell W, Boardman I (2005) Advances in channel compensation for SVM speaker recognition. In: Proceedings of IEEE international conference on acoustics, speech and signal process (ICASSP), pp 629–632, Philadelphia, USA
Google Scholar
Burget L, Matějka P, Schwarz P, Glembek O, Černocký J (2007) Analysis of feature extraction and channel compensation in a GMM speaker recognition system. IEEE Trans Audio Speech Lang Process 15(7):1979–1986
Article Google Scholar
Hatch AO, Kajarekar S, Stolcke A (2006) Within-class covariance normalization for svm-based speaker recognition. In: Proceedings of IEEE international conference on spoken language process (ICSLP), pp 1471–1474
Google Scholar
Kenny, P (2006) Joint factor analysis of speaker and session variability: theory and algorithms. technical report CRIM-06/08-14
Google Scholar
Kenny P, Boulianne G, Ouellet P, Dumouchel P (2007) Speaker and session variability in GMM-based speaker verification. IEEE Trans Audio Speech Lang Process 15(4):1448–1460
Article Google Scholar
Kenny P, Ouellet P, Dehak N, Gupta V, Dumouchel P (2008) A study of inter-speaker variability in speaker verification. IEEE Trans Audio Speech Lang Process 16(5):980–988
Article Google Scholar
Dehak N, Kenny P, Dehak R, Dumouchel P, Ouellet P (2011) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798
Article Google Scholar
Li P, Fu Y, Mohammed U, Elder JH, Prince SJ (2012) Probabilistic models for inference about identity. IEEE Trans Pattern Anal Mach Intell 34(1):144–157
Article Google Scholar
Garcia-Romero D, Espy-Wilson CY (2011) Analysis of i-vector length normalization in speaker recognition systems. In: Proceedings of interspeech, annual conference of the international speech communication association, Florence, Italy, pp 249–252
Google Scholar
Kinnunen T, Wu ZZ, Lee KA, Sedlak F, Chng ES, Li H (2012) Vulnerability of speaker verification systems against voice conversion spoofing attacks: the case of telephone speech. In: Proceedings of IEEE international conference on acoustics speech and signal process (ICASSP), pp 4401–4404
Google Scholar
Saeidi R et al (2013) I4U submission to NIST SRE 2012: a large-scale collaborative effort for noise-robust speaker verification. In: Proceedings of interspeech, annual conference of the international speech communication association, Lyon, France
Google Scholar
Brümmer N, Burget L, Černocký J, Glembek O, Grézl F, Karafiát M, Leeuwen D, Matějka P, Schwartz P, Strasheim A (2007) Fusion of heterogeneous speaker recognition systems in the STBU submission for the NIST speaker recognition evaluation 2006. IEEE Trans Audio Speech Lang Process 15(7):2072–2084
Article Google Scholar
Hautamäki V, Kinnunen T, Sedlák F, Lee KA, Ma B, Li H (2013) Sparse classifier fusion for speaker verification. IEEE Trans Audio Speech Lang Process 21(8):1622–1631
Article Google Scholar
Akhtar Z, Fumera G, Marcialis GL, Roli F (2012) Evaluation of serial and parallel multibiometric systems under spoong attacks. In: Proceedings of 5th Int. Conference on biometrics (ICB 2012), pp 283–288, New Delhi, India
Google Scholar
Lau YW, Wagner M, Tran D (2004) Vulnerability of speaker verification to voice mimicking. In: Proceedings of 2004 international symposium on Intelligent multimedia, video and speech processing, 2004. IEEE, pp 145–148
Google Scholar
Lau Y, Tran D, Wagner M (2005) Testing voice mimicry with the yoho speaker verification corpus. Knowledge-based intelligent information and engineering systems. Springer, Berlin, p 907
Google Scholar
Mariéthoz J, Bengio S (2005) Can a professional imitator fool a GMM-based speaker verification system? IDIAP Research Report 05–61
Google Scholar
Eriksson A, Wretling P (1997) How flexible is the human voice?—a case study of mimicry. In: Proceedings of Eurospeech, ESCA European conference on speech communication and technology, pp 1043–1046. http://www.ling.gu.se/anders/papers/a1008.pdf
Zetterholm E, Blomberg M, Elenius D (2004) A comparison between human perception and a speaker verification system score of a voice imitation. In: Proceedings of tenth australian international conference on speech science and technology, Macquarie University, Sydney, Australia, pp 393–397
Google Scholar
Farrús M, Wagner M, Anguita J, Hernando J (2008) How vulnerable are prosodic features to professional imitators? In: The speaker and language recognition workshop (Odyssey 2008), Stellenbosch, South Africa
Google Scholar
Kitamura T (2008) Acoustic analysis of imitated voice produced by a professional impersonator. In: Proceedings of interspeech, annual conference of the international speech communication association, Brisbane, Australia, pp 813–816
Google Scholar
Perrot P, Aversano G, Blouet R, Charbit M, Chollet G (2005) Voice forgery using ALISP: indexation in a client memory. In: Proceedings of IEEE international conference on acoustics, speech and signal process (ICASSP), vol 1, pp 17–20
Google Scholar
Lindberg J, Blomberg M et al (1999) Vulnerability in speaker verification-a study of technical impostor techniques. Proc Eur Conf speech Commun Technol 3:1211–1214
Google Scholar
Villalba J, Lleida E (2010) Speaker verification performance degradation against spoofing and tampering attacks. In: FALA 10 workshop, pp 131–134
Google Scholar
Wang ZF, Wei G, He QH (2011) Channel pattern noise based playback attack detection algorithm for speaker recognition. Int Conf Mach Learn Cybern (ICMLC) 4:1708–1713
Google Scholar
Shang W, Stevenson M (2010) Score normalization in playback attack detection. In: Proceedings of IEEE international conference on acoustics, speech and signal process (ICASSP), pp 1678–1681
Google Scholar
Villalba J, Lleida E (2011) Preventing replay attacks on speaker verification systems. In: Proceedings of the IEEE international carnahan conference on security technology, (ICCST) 2011, pp 1–8
Google Scholar
Klatt DH (1980) Software for a cascade/parallel formant synthesizer. J Acoust Soc Am 67:971–995
Article Google Scholar
Moulines E, Charpentier F (1990) Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Commun 9:453–467
Article Google Scholar
Hunt A, Black AW (1996) Unit selection in a concatenative speech synthesis system using a large speech database. In: Proceedings of IEEE international conference on acoustics, speech and signal process (ICASSP), pp 373–376
Google Scholar
Breen A, Jackson P (1998) A phonologically motivated method of selecting nonuniform units. In: Proceedings of IEEE international conference on spoken language process (ICSLP), pp 2735–2738
Google Scholar
Donovan RE, Eide EM (1998) The IBM trainable speech synthesis system. In: Proceedings of IEEE international conference on spoken language process (ICSLP), pp 1703–1706
Google Scholar
Beutnagel B, Conkie A, Schroeter J, Stylianou Y, Syrdal A (1999) The AT&T next-gen TTS system. In: Proceedings of joint ASA, EAA and DAEA meeting, pp 15–19
Google Scholar
Coorman G, Fackrell J, Rutten P, Coile B (2000) Segment selection in the L & H realspeak laboratory TTS system. In: Proceedings of international conference on speech and language processing, pp 395–398
Google Scholar
Yoshimura T, Tokuda K, Masuko T, Kobayashi T, Kitamura T (1999) Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In: Proceedings of Eurospeech, ESCA European conference on speech communication and technology, pp 2347–2350
Google Scholar
Ling ZH, Wu YJ, Wang YP, Qin L, Wang RH (2006) USTC system for blizzard challenge 2006 an improved HMM-based speech synthesis method. In: Proceedings of the blizzard challenge workshop
Google Scholar
Black AW (2006) CLUSTERGEN: a statistical parametric synthesizer using trajectory modeling. In: Proceedings of interspeech, annual conference of the international speech communication association, pp 1762–1765
Google Scholar
Zen H, Toda T, Nakamura M, Tokuda K (2007) Details of the Nitech HMM-based speech synthesis system for the Blizzard Challenge 2005. IEICE Trans Inf Syst E90–D(1):325–333
Article Google Scholar
Zen H, Tokuda K, Black AW (2009) Statistical parametric speech synthesis. Speech Communication 51(11):1039–1064. doi:10.1016/j.specom.2009.04.004
Article Google Scholar
Yamagishi J, Kobayashi T, Nakano Y, Ogata K, Isogai J (2009) Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm. IEEE Trans Speech Audio Lang Process 17(1):66–83
Article Google Scholar
Leggetter CJ, Woodland PC (1995) Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput Speech Lang 9:171–185
Article Google Scholar
Woodland PC (2001) Speaker adaptation for continuous density HMMs: A review. In: Proceedings of ISCA workshop on adaptation methods for speech recognition, p 119
Google Scholar
Foomany F, Hirschfield A, Ingleby M (2009) Toward a dynamic framework for security evaluation of voice verification systems. In: IEEE toronto international conference on science and technology for humanity (TIC-STH), pp 22–27. doi:10.1109/TIC-STH.2009.5444499
Masuko T, Hitotsumatsu T, Tokuda K, Kobayashi T (1999) On the security of HMM-based speaker verification systems against imposture using synthetic speech. In: Proceedings of Eurospeech, ESCA European conference on speech communication and technology
Google Scholar
Matsui T, Furui S (1995) Likelihood normalization for speaker verification using a phoneme- and speaker-independent model. Speech Commun 17(1–2):109–116
Article Google Scholar
Masuko T, Tokuda K, Kobayashi T, Imai S (1996) Speech synthesis using HMMs with dynamic features. In: Proceedings of IEEE international conference on acoustics, speech and signal process (ICASSP)
Google Scholar
Masuko T, Tokuda K, Kobayashi T, Imai S (1997) Voice characteristics conversion for HMM-based speech synthesis system. In: Proceedings of IEEE international conference on acoustics, speech and signal process (ICASSP)
Google Scholar
De Leon PL, Pucher M, Yamagishi J, Hernaez I, Saratxaga I (2012) Evaluation of speaker verification security and detection of HMM-based synthetic speech. IEEE Trans Audio Speech Lang Process 20(8):2280–2290. doi:10.1109/TASL.2012.2201472
Article Google Scholar
Galou, G (2011) Synthetic voice forgery in the forensic context: a short tutorial. In: Forensic speech and audio analysis working group (ENFSI-FSAAWG), pp 1–3
Google Scholar
Satoh T, Masuko T, Kobayashi T, Tokuda K (2001) A robust speaker verification system against imposture using an HMM-based speech synthesis system. In: Proceedings of Eurospeech, ESCA European conference on speech technology
Google Scholar
Chen LW, Guo W, Dai LR (2010) Speaker verification against synthetic speech. In: Proceedings of 7th international symposium on chinese spoken language processing (ISCSLP), pp 309–312 (29 Nov–3 Dec 2010). doi:10.1109/ISCSLP.2010.5684887
Quatieri TF (2002) Discrete-time speech signal processing principles and practice. Prentice-hall, Inc
Google Scholar
Wu Z, Chng ES, Li H (2012) Detecting converted speech and natural speech for anti-spoofing attack in speaker recognition. In: Proceedings of interspeech, annual conference of the international speech communication association
Google Scholar
Ogihara A, Unno H, Shiozakai A (2005) Discrimination method of synthetic speech using pitch frequency against synthetic speech falsification. IEICE Trans Fundam Electron Commun Comput Sci 88(1):280–286
Article Google Scholar
De Leon PL, Stewart B, Yamagishi J (2012) Synthetic speech discrimination using pitch pattern statistics derived from image analysis. In: Proceedings of interspeech, annual conference of the international speech communication association, Portland, Oregon, USA
Google Scholar
Stylianou Y (2009) Voice transformation: a survey. In: Proceedings of IEEE international conference on acoustics speech and signal process (ICASSP), pp 3585–3588
Google Scholar
Pellom BL, Hansen JH (1999) An experimental study of speaker verification sensitivity to computer voice-altered imposters. In: Proceedings of IEEE international conference on acoustics, speech and signal process (ICASSP), vol 2, pp 837–840
Google Scholar
Abe M, Nakamura S, Shikano K, Kuwabara H (1988) Voice conversion through vector quantization. In: Proceedings of IEEE international conference on acoustics, speech and signal process (ICASSP), pp 655–658
Google Scholar
Arslan LM (1999) Speaker transformation algorithm using segmental codebooks (STASC). Speech Commun 28(3):211–226
Article Google Scholar
Kain A, Macon MW (1998) Spectral voice conversion for text-to-speech synthesis. In: Proceedings of IEEE international conference on acoustics, speech and signal process (ICASSP), vol 1, pp 285–288
Google Scholar
Stylianou Y, Cappé O, Moulines E (1998) Continuous probabilistic transform for voice conversion. IEEE Trans Speech Audio Process 6(2):131–142
Article Google Scholar
Toda T, Black AW, Tokuda K (2007) Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans Audio Speech Lang Process 15(8):2222–2235
Article Google Scholar
Popa V, Silen H, Nurminen J, Gabbouj M (2012) Local linear transformation for voice conversion. In: Proceedings of IEEE international conference on acoustics, speech and signal process (ICASSP), pp 4517–4520
Google Scholar
Chen Y, Chu M, Chang E, Liu J, Liu R (2003) Voice conversion with smoothed GMM and MAP adaptation. In: Proceedings of Eurospeech, ESCA European conference on speech communication and technology, pp 2413–2416
Google Scholar
Hwang HT, Tsao Y, Wang HM, Wang YR, Chen SH (2012) A study of mutual information for GMM-based spectral conversion. In: Proceedings of Interspeech, annual conference of the international speech communication association
Google Scholar
Helander E, Virtanen T, Nurminen J, Gabbouj M (2010) Voice conversion using partial least squares regression. IEEE Trans Audio Speech Lang Process 18(5):912–921
Article Google Scholar
Pilkington NC, Zen H, Gales MJ (2011) Gaussian process experts for voice conversion. In: Twelfth annual conference of the international speech communication association
Google Scholar
Saito D, Yamamoto K, Minematsu N, Hirose K (2011) One-to-many voice conversion based on tensor representation of speaker space. In: Proceedings of Interspeech, annual conference of the international speech communication association, pp 653–656
Google Scholar
Zen H, Nankaku Y, Tokuda K (2011) Continuous stochastic feature mapping based on trajectory HMMs. IEEE Trans Audio Speech Lang Process 19(2):417–430
Article Google Scholar
Wu Z, Kinnunen T, Chng ES, Li H (2012) Mixture of factor analyzers using priors from non-parallel speech for voice conversion. IEEE Signal Process Lett 19(12):914–917
Article Google Scholar
Saito D, Watanabe S, Nakamura A, Minematsu N (2012) Statistical voice conversion based on noisy channel model. IEEE Trans Audio Speech Lang Process 20(6):1784–1794
Article Google Scholar
Narendranath M, Murthy HA, Rajendran S, Yegnanarayana B (1995) Transformation of formants for voice conversion using artificial neural networks. Speech commun 16(2):207–216
Article Google Scholar
Desai S, Raghavendra EV, Yegnanarayana B, Black AW, Prahallad K (2009) Voice conversion using artificial neural networks. In: Proceedings of IEEE international conference on acoustics, speech and signal process (ICASSP), pp 3893–3896
Google Scholar
Song P, Bao Y, Zhao L, Zou C (2011) Voice conversion using support vector regression. Electron Lett 47(18):1045–1046
Article Google Scholar
Helander E, Silén H, Virtanen T, Gabbouj M (2012) Voice conversion using dynamic kernel partial least squares regression. IEEE Trans Audio Speech Lang Process 20(3):806–817
Article Google Scholar
Wu Z, Chng ES, Li H (2013) Conditional restricted boltzmann machine for voice conversion. In: The first IEEE china summit and international conference on signal and information processing (ChinaSIP)
Google Scholar
Sundermann D, Ney H (2003) VTLN-based voice conversion. In: Proceedings of the 3rd IEEE international symposium on signal processing and information technology, 2003. ISSPIT 2003, pp 556–559
Google Scholar
Erro D, Moreno A, Bonafonte A (2010) Voice conversion based on weighted frequency warping. IEEE Trans Audio Speech Lang Process 18(5):922–931
Article Google Scholar
Erro D, Navas E, Hernaez I (2013) Parametric voice conversion based on bilinear frequency warping plus amplitude scaling. IEEE Trans Audio Speech Lang Process 21(3):556–566
Article Google Scholar
Gillet B, King S (2003) Transforming F0 contours. In: Proceedings of Eurospeech, ESCA European conference on speech communication and technology, pp 101–104
Google Scholar
Wu CH, Hsia CC, Liu TH, Wang JF (2006) Voice conversion using duration-embedded bi-HMMs for expressive speech synthesis. IEEE Trans Audio Speech Lang Process 14(4):1109–1116
Article Google Scholar
Helander EE, Nurminen J (2007) A novel method for prosody prediction in voice conversion. In: Proceedings of IEEE international conference on acoustics, speech and signal process (ICASSP), pp IV-509
Google Scholar
Wu ZZ, Kinnunen T, Chng ES, Li H (2010) Text-independent F0 transformation with non-parallel data for voice conversion. In: Eleventh annual conference of the international speech communication association
Google Scholar
Lolive D, Barbot N, Boeffard O (2008) Pitch and duration transformation with non-parallel data. Speech prosody 2008:111–114
Google Scholar
Sundermann D, Hoge H, Bonafonte A, Ney H, Black A, Narayanan S (2006) Text-independent voice conversion based on unit selection. In: Proceedings of IEEE international conference on acoustics, speech and signal process (ICASSP), vol 1, pp I-I
Google Scholar
Wu Z, Larcher A, Lee KA, Chng ES, Kinnunen T, Li H (2013) Vulnerability evaluation of speaker verication under voice conversion spoong: the effect of text constraints. In: Proceedings of interspeech, annual conference of the international speech communication association, Lyon, France
Google Scholar
Matrouf D, Bonastre JF, Fredouille C (2006) Effect of speech transformation on impostor acceptance. In: Proceedings of IEEE international conference on acoustics, speech and signal process (ICASSP), vol 1, pp I-I
Google Scholar
Alegre F, Vipperla R, Evans N, Fauve B (2012) On the vulnerability of automatic speaker recognition to spoofing attacks with artificial signals. In: Proceedings of EURASIP Euro signal processing conference (EUSIPCO)
Google Scholar
Wu Z, Kinnunen T, Chng ES, Li H, Ambikairajah E (2012) A study on spoofing attack in state-of-the-art speaker verification: the telephone speech case. In: Signal and information processing association annual summit and conference (APSIPA ASC), 2012 Asia-Pacific, pp 1–5
Google Scholar
De Leon PL, Hernaez I, Saratxaga I, Pucher M, Yamagishi J (2011) Detection of synthetic speech for the problem of imposture. In: Proceedings of IEEE international conference on acoustic, speech and signal process (ICASSP), pp 4844–4847, Dallas, USA
Google Scholar
Alegre F, Vipperla R, Evans N, et al (2012) Spoofing countermeasures for the protection of automatic speaker recognition systems against attacks with artificial signals. In: Proceedings of interspeech, annual conference of the international speech communication association
Google Scholar
Alegre F, Amehraye A, Evans N (2013) Spoofing countermeasures to protect automatic speaker verification from voice conversion. In: Proceedings of IEEE international conference on acoustic, speech and signal process (ICASSP)
Google Scholar
Wu Z, Xiao X, Chng ES, Li H (2013) Synthetic speech detection using temporal modulation feature. In: Proceedings of IEEE international conference on acoustic, speech and signal process (ICASSP)
Google Scholar
Alegre F, Vipperla R, Amehraye A, Evans N (2013) A new speaker verification spoofing countermeasure based on local binary patterns. In: Proceedings of interspeech, annual conference of the international speech communication association, Lyon, France
Google Scholar
Hautamki RG, Kinnunen T, Hautamki V, Leino T, Laukkanen AM (2013) I-vectors meet imitators: on vulnerability of speaker verification systems against voice mimicry. In: Proceedings of interspeech, annual conference of the international speech communication association
Google Scholar
Martin A, Doddington G, Kamm T, Ordowski M, Przybocki M (1997) The DET curve in assessment of detection task performance. In: Proceedings of Eurospeech, ESCA European conference on speech communication and technology, pp 1895–1898
Google Scholar
Alegre F, Amehraye A, Evans N (2013) A one-class classification approach to generalised speaker verification spoofing countermeasures using local binary patterns. In: Proceedings of international conference on biometrics: theory, applications and systems (BTAS), Washington DC, USA
Google Scholar

Download references

Acknowledgments

This work was partially supported by the TABULA RASA project funded under the 7th Framework Programme of the European Union (EU) (grant agreement number 257289), by the Academy of Finland (project no. 253120) and by EPSRC grants EP/I031022/1 (NST) and EP/J002526/1 (CAF).

Author information

Authors and Affiliations

Department of Multimedia Communications, Campus SophiaTech, EURECOM, 450 Route des Chappes, 06410, Biot, France
Nicholas Evans & Federico Alegre
Speech and Image Processing Unit, School of Computing, University of Eastern Finland (UEF), P.O. Box 111, FI-80101, Joensuu, Finland
Tomi Kinnunen
National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, 101-8430, Japan
Junichi Yamagishi
University of Edinburgh, 10 Crichton Street, Edinburgh, EH8 9AB, UK
Junichi Yamagishi
Emerging Research Lab, School of Computer Engineering, Nanyang Technological University (NTU), N4-B1a-02 C2I, Nanyang Avenue, Singapore, 639798, Singapore
Zhizheng Wu
Department 3-O, Klipsch School of Electrical and Computer Engineering, New Mexico State University, PO Box 30001 , Las Cruces, NM, 88003-8001, USA
Phillip De Leon

Authors

Nicholas Evans
View author publications
You can also search for this author in PubMed Google Scholar
Tomi Kinnunen
View author publications
You can also search for this author in PubMed Google Scholar
Junichi Yamagishi
View author publications
You can also search for this author in PubMed Google Scholar
Zhizheng Wu
View author publications
You can also search for this author in PubMed Google Scholar
Federico Alegre
View author publications
You can also search for this author in PubMed Google Scholar
Phillip De Leon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nicholas Evans .

Editor information

Editors and Affiliations

Idiap Research Institute, Martigny, Switzerland
Sébastien Marcel
University of Southampton, United Kingdom
Mark S. Nixon
Institute of Automation Center for Biometrics and Security Resea, Chinese Academy of Sciences, Beijing, China
Stan Z. Li

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Evans, N., Kinnunen, T., Yamagishi, J., Wu, Z., Alegre, F., Leon, P. . (2014). Speaker Recognition Anti-spoofing. In: Marcel, S., Nixon, M., Li, S. (eds) Handbook of Biometric Anti-Spoofing. Advances in Computer Vision and Pattern Recognition. Springer, London. https://doi.org/10.1007/978-1-4471-6524-8_7

Download citation

DOI: https://doi.org/10.1007/978-1-4471-6524-8_7
Published: 18 July 2014
Publisher Name: Springer, London
Print ISBN: 978-1-4471-6523-1
Online ISBN: 978-1-4471-6524-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Speaker Recognition Anti-spoofing

Abstract

Similar content being viewed by others

On the study of replay and voice conversion attacks to text-dependent speaker verification

Introduction to Voice Presentation Attack Detection and Recent Advances

Introduction to Voice Presentation Attack Detection and Recent Advances

Keywords

7.1 Introduction

7.2 Automatic Speaker Verification

7.2.1 Feature Extraction

7.2.2 Modelling and Classification

7.2.3 System Fusion

7.3 Spoofing and Countermeasures

7.3.1 Impersonation

7.3.1.1 Spoofing

7.3.1.2 Countermeasures

7.3.2 Replay

7.3.2.1 Spoofing

7.3.2.2 Countermeasures

7.3.3 Speech Synthesis

7.3.3.1 Spoofing

7.3.3.2 Countermeasures

7.3.4 Voice Conversion

7.3.4.1 Spoofing

7.3.4.2 Countermeasures

7.3.5 Summary

7.4 Discussion

7.4.1 Protocols and Metrics

7.4.2 Datasets

7.5 Conclusions

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation