1 Originality and contribution

This paper addressed the subject of liveness detection in frontal faces videos. The liveness check is performed by measuring the degree of synchrony between the lips and the voice extracted from a video sequence. Four different original methods are derived for that purpose: three methods based on co-inertia analysis and a fourth based on coupled hidden Markov models. The main contributions of this work are a full theoretical description of these methods and an experimental comparison of the main asynchrony detection algorithms in a publicly available database, allowing for future performance comparisons.

2 Introduction

Oral communication between people is a means of communication which is intrinsically multimodal. Not only does it include acoustic information but it also conveys complementary visual information. Acoustic information is classically used for state-of-the-art automatic speech processing applications such as automatic speech transcription or speaker authentication, while visual information is of great help in adverse environments where acoustic information is degraded (background noise, channel distortion, etc.). It provides complementary clues that can help in the analysis of the acoustic signal [1]. In extreme cases, visual information can even be used on its own. For instance, it is well known that deaf people can learn how to lip read. The joint analysis of acoustic and visual speech improves the robustness of automatic speech recognition systems [2, 3].

In the framework of identity verification based on talking-faces, most systems in the literature fuse scores from speaker verification and face recognition tests. Nevertheless, a number of systems have attempted to make use of visual speech information to improve overall authentication performance [46].

One major weakness of these systems is that they do not take into account realistic impostor attack scenarios. Most existing systems, for example, could easily be fooled by simple attacks such as recording the voice of the target in advance and replaying it in front of the microphone, or simply placing a picture of the target’s face in front of the camera. Another problem emerges in audio–visual speaker recognition when several faces appear in the video and the true speaker must be selected before identification or verification can take place. Systems such as the one described in [5] jointly model acoustic and visual speech in order to improve speaker verification performance with respect to independent modeling. The audio–visual biometric system described in [6] performs better when the visual stream is incorporated for both identification and verification. The robustness of these systems against non-synchronized video attacks or complex scenes with several face candidates, however, has not been tested.

One solution that has been proposed in the recent literature is to test liveness by studying the degree of synchrony between the acoustic signal and lip motion [7, 8]. Synchrony detection is not a new problem in audio–visual analysis. It is a major issue in fields such as speaker location [9] and speaker association [1012]. Studies in the area used measures such as canonical correlation (CANCOR) [10] and mutual information (MI) [9, 11, 12] to distinguish the true speaker from a set of candidates. Synchrony detection in video-based biometrics would solve the problem of complex scenes where several faces are present in the image. Furthermore, it would allow the detection of attacks that cause audio–visual inconsistency. A number of studies in the biometrics field have already dealt with asynchrony detection. For instance, the method introduced in [7] fuses the speech and lip parameters in a single audiovisual feature vector stream, and then models it within a Gaussian mixture model (GMM) for each client. The results obtained with this method are impressive (1% equal error rate) for easy replay attacks constructed with a voice recording and a still photograph, although it has not been tested using a voice recording and an image sequence taken from another video. The method described in [8] uses co-inertia analysis (CoIA) correlation evolution to create liveness scores based on different delays between audio and image sequences.

The main aim of this paper is to describe a series of new asynchrony detection techniques and compare them to existing ones. The techniques presented increase the robustness of audio–visual biometric systems against spoof attacks. In addition to their application in the biometrics field, these techniques can also be applied to any generic audio–visual consistency assessment or monologue detection task. Two new approaches for measuring synchrony between audio and visual speech and detecting possible asynchrony are proposed. The first approach is based on co-inertia analysis (CoIA), and three new, different algorithms for detecting liveness are derived. The second one is a Bayesian approach based on coupled hidden Markov models (CHMMs). CANCOR, MI and the method proposed by Eveno et al. and based on CoIA are also tested in the same experimental framework for comparison purposes.

The rest of the paper is organized as follows. Section 2 introduces the acoustic and visual features that will be used in the experiments. The first approach (based on CoIA) is described in Sect. 3 and the second (based on CHMMs) in Sect. 4. A third method based on the fusion of the two previous approaches is investigated in Sect. 5. The methods used for comparison are introduced in Sect. 6. Finally, the performance of each of the methods is evaluated using real data from the BANCA audiovisual database. Evaluation protocols and results are discussed in Sect. 7.

3 Audiovisual speech features

3.1 Acoustic speech features

Mel-frequency cepstral coefficients (MFCC) are classical acoustic speech features in automatic speech processing. They are state-of-the-art features in many applications, including automatic speech recognition and speaker verification systems.

Every 10 ms, a 20 ms window is extracted from the acoustic signal and 12 MFCCs and the signal energy are computed to produce 13-dimensional acoustic speech features. First- and second-order time-derivatives are then appended, and finally a 39-dimensional feature vector is extracted every 10 ms.

3.2 Visual speech features

Visual speech features can be classified into two categories, depending on whether they are based on the shape or the appearance of the mouth [13]. The first category includes features that are directly related to the shape of the lip, such as the openness of the mouth, the location of particular lip landmarks, etc. The second category, in contrast, considers the mouth area as a whole and includes features that have been extracted directly from the pixels corresponding to a region of interest (ROI) around the mouth area.

Shape-based features

Robust tracking of lip landmarks is a mandatory preliminary step towards extracting shape-based features. A Lucas–Kanade-based tracker [14] is used to track the location of a collection of facial landmarks (including lip landmarks) throughout the video sequence, as shown in the example in Fig. 1. Shape features corresponding to three separate dimensions (height, width and area of the mouth) are then straightforwardly extracted from the location of these lip landmarks.

Fig. 1
figure 1

Shape-based features extraction

Appearance-based features

The mouth detection algorithm described in [15] was used to locate the lip area, as shown in Fig. 2. A discrete cosine transform (DCT) was then applied to the grey level size-normalized ROI, and the first 30 DCT coefficients (in a zig-zag manner, corresponding to the low spatial frequency) were kept as the visual speech features.

Fig. 2
figure 2

Appearance-based features extraction

Sample rate

The visual speech sample rate is dependent on the frame rate of the audiovisual sequence. Whereas current video cameras work at a frame rate of 25 or 29.97 frames/s (depending on the codec), the acoustic speech features presented in Sect. 2.1 are extracted at a sample rate of 100 Hz.

The algorithms presented here make use of acoustic and visual features that have equal sample rates. Therefore, the chosen solution was to linearly interpolate the visual features to obtain a sample rate of 100 Hz for both acoustic and visual features.

Visual dynamic features

As with acoustic features, first- and second-order derivatives are also appended to static visual features. In the end, nine-dimensional shape-based features and 90-dimensional appearance-based features are available every 10 ms.

4 Coinertial approach: CoIA

4.1 Theoretical aspects

CoIA was first introduced by Dolédec and Chessel [16] in the field of biology to uncover the hidden relationships between species and their environment. Because, however, we did not find any demonstration of the co-inertia analysis in the literature, we have included the following demonstration:

Given two multivariate random variables \(X = \left( X_1,\ldots,X_n \right)^t \in {{\mathbb{R}}}^n\) and \(Y = \left(Y_1,\ldots,Y_m\right)^t \in {{\mathbb{R}}}^m\) of covariance matrix \(C_{XY} = \hbox{cov}{\left( X,Y \right)} = \hbox{E}{\{XY^t\}} \in {{\mathbb{M}}}_{n \times m},\) where E{·} denotes the expectation operator, CoIA allows to find \({\bf a} \in {{\mathbb{U}}}^n\) and \({\bf b} \in {{\mathbb{U}}}^m,\) with \({{\mathbb{U}}}^l = \left\{ z \in {{\mathbb{R}}}^l | \| z \| = 1 \right\},\) so that the projections of X and Y on these two vectors have maximum covariance:

$$ \begin{aligned} \left( {{\mathbf{a}}}, {{\mathbf{b}}} \right) &= \mathop {\hbox{argmax} }\limits_{{\left( a, b \right) \in {{\mathbb{U}}}^n \times {{\mathbb{U}}}^m}} \hbox{cov} {\left( a^t X , b^t Y \right)} \\ &=\mathop{\hbox{argmax}}\limits_{{\left( a, b \right) \in {{\mathbb{U}}}^n \times {{\mathbb{U}}}^m}} \hbox{E}{\{(a^tX)(Y^tb)\}} \\ &=\mathop{\hbox{argmax}}\limits_{{\left( a, b \right) \in {{\mathbb{U}}}^n \times {{\mathbb{U}}}^m}} a^tC_{XY}b. \\ \end{aligned} $$
(1)

Proposition 1 (CoIA)

a is the eigenvector corresponding to the highest eigenvalue λ of matrix \(C_{XY} C_{XY}^t\) and b is proportional to C t XY a.

Proof of Proposition 1

Let us denote

$$ \rho = a^t C_{XY} b $$
(2)

In the process of maximizing ρ, one can assume that ρ > 0 (change a into −a if ρ < 0): it is therefore equivalent to maximize ρ and ρ 2.

$$ \rho^2 = \left( a^t C_{XY} b \right)^t \left( a^t C_{XY} b \right) $$
(3)
$$ \rho^2 = \left[ \left( C_{XY}^t a \right)^t b \right]^t \left[ \left( C_{XY}^t a \right)^t b \right]. $$
(4)

According to the Cauchy–Schwarz inequality, \(\rho^2 \leq \| C_{XY}^t a \| \cdot \| b \|\) with equality if and only if b can be written as μC t XY a, with \(\mu \in {{\mathbb{R}}}.\) Therefore, Eq. 2 becomes:

$$\rho = a^t C_{XY} \left( \mu C_{XY}^t a \right)$$
(5)
$$\rho = \mu a^t \left( C_{XY} C_{XY}^t \right) a. $$
(6)

Since \(\| a \|=1, \rho\) is proportional to the Rayleigh quotient \(R \left( C_{XY} C_{XY}^t, a \right) = (a^tC_{XY}C_{XY}^ta)/(a^ta),\) which is maximized when a is the eigenvector of C XY C t XY associated with the biggest eigenvalue λ1.\({\hfill \square} \)

Sorting the eigenvalues of C XY C t XY in decreasing order \(\{\lambda_1, \ldots,\lambda_d\},\) CoIA recursively finds the orthogonal vectors \(\left\{ {\bf a}_{\bf 1}, \ldots, {\bf a}_{{\bf d}} \right\}\) and \(\left\{{\bf b}_{{\bf 1}}, \ldots, {\bf b}_{{\bf d}} \right\}\) which maximize the covariance between the projections \({\bf a}_{{\bf k}}^tX\) and \({\bf b}_{{\bf k}}^tY\) (d being the rank of C XY ). In other words, CoIA rotates X and Y into a new coordinate system that maximizes their covariance.

In the following, A and B will denote n × d and m × d matrices containing the directions of the new coordinate systems:

$$ {{\mathbf{A}}} = \left[ \begin{array}{*{20}l} {{\mathbf{a}}}_{{\mathbf{1}}} | \ldots | {{\mathbf{a}}}_{{\mathbf{d}}} \end{array}\right] \quad \hbox{and} \quad {{\mathbf{B}}} = \left[ \begin{array}{*{20}l} {{\mathbf{b}}}_{{\mathbf{1}}} | \ldots | {{\mathbf{b}}}_{{\mathbf{d}}} \end{array} \right]. $$

4.2 Application of CoIA

4.2.1 Extracting correlated acoustic and visual speech features

Given synchronized acoustic and visual features \(X \in {{\mathbb{R}}}^n\) and \(Y \in {{\mathbb{R}}}^m,\) CoIA can be used to compute matrices A and B, which, in turn, can be used to extract correlated acoustic and visual features \({{\mathcal{X}}} = {\bf A}^tX\) and \({{\mathcal{Y}}} = {\bf B}^tY\) of dimension d as follows:

$$\begin{aligned} \forall k \in \{ 1, \ldots, d \},\quad {{\mathcal{X}}}_{k} &= {{\mathbf{a}}}_{{\mathbf{k}}}^t X = \sum_{i=1}^n {{\mathbf{a}}}_{{{\mathbf{k}}}_i} X_{i}\\ {{\mathcal{Y}}}_{k} &= {{\mathbf{b}}}_{{\mathbf{k}}}^t Y = \sum_{i=1}^m {{\mathbf{b}}}_{{{\mathbf{k}}}_i} Y_i. \\ \end{aligned} $$
(7)

The effect of CoIA on real data is shown in Fig. 3, which contains features extracted from the audiovisual sequence“1002_f_g1_s02_1002_en.avi” from the BANCA database [17].

Fig. 3
figure 3

Original acoustic and visual features (top left: X 1, X 2 and X 3 bottom left Y 1, Y 2 and Y 3) and first correlated acoustic and visual features (top right: \({\mathcal{X}}_1\) bottom right \({\mathcal{Y}}_1).\) The correlation between X and Y is much more evident if we look at \({\mathcal{X}}_1\) and \({\mathcal{Y}}_1\)

Remark

CoIA can be used to reduce the dimension of acoustic and visual features without losing those that contain the most information regarding correlation. This is particularly important when working with CHMMs such as those described in Sect. 4. The curse of dimensionality is a major issue for these models because the small size of the BANCA database does not permit accurate training with high-dimensional features. The only requirement in our case was that all the acoustic and visual features X and Y had to be transformed using the same matrices A Ω and B Ω.

Synchronized acoustic and visual features X Ω and Y Ω can be extracted from a training set Ω (BANCA world model part wm, in our case). CoIA transformation matrices \({\bf A}^{\Omega} = \left[ \begin{array}{*{20}l} {\bf a}_{{\bf 1}}^{\Omega} | \ldots | {\bf a}_{{\bf d}}^{\Omega} \end{array} \right]\) and \({\bf B}^{\Omega} = \left[ \begin{array}{*{20}l} {\bf b}_{{\bf 1}}^{\Omega} | \ldots | {\bf b}_{{\bf d}}^{\Omega} \end{array}\right]\) are then obtained by applying CoIA to Ω, and the transformed acoustic and visual features \({{\mathcal{X}}}^{\Omega}\) and \({{\mathcal{Y}}}^{\Omega}\) are computed using Eq. 8

$$ \begin{aligned} \forall k \in \{ 1, \ldots, d \}, \quad {{\mathcal{X}}}^{\Omega}_{k}& = {{\mathbf{a}}}_{{{\mathbf{k}}}^{\Omega}}^t X = \sum_{i=1}^n {{\mathbf{a}}}^{\Omega}_{{{\mathbf{k}}}_i} X_{i}\\ {{\mathcal{Y}}}^{\Omega}_{k} &= {{\mathbf{b}}}_{{{\mathbf{k}}}^{\Omega}}^t Y = \sum_{i=1}^m {{\mathbf{b}}}^{\Omega}_{{{\mathbf{k}}}_i} Y_i. \\ \end{aligned} $$
(8)

The dimensions of the transformed acoustic and visual features \({\mathcal {X}}^{\Omega}\) and \({{\mathcal{Y}}}^{\Omega}\) can then be conveniently reduced by keeping only the D most informative ones with respect to correlation.

4.2.2 Measuring audiovisual speech synchrony

In this section, we introduce a method involving the use of correlated acoustic and visual features to measure how well voice X and lips Y correspond to each other. We distinguish between three different methods (world-, self- or piecewise self-training), though they all share a common framework:

  1. 1.

    The transformation matrices A Ω and B Ω are derived by means of CoIA from a training set Ω composed of acoustic and visual features X Ω and Y Ω.

  2. 2.

    Acoustic and visual features X Γ and Y Γ from a test utterance Γ are then transformed into \({{\mathcal{X}}}^{\Omega}\) and \({{\mathcal{Y}}}^{\Omega}\) using the previously computed matrixes A Ω and B Ω:

    $$ \begin{aligned} {\mathcal{X}}^{\Omega} &= {\mathbf{A}}^{{\Omega}^t} X^{\Gamma}\\ {\mathcal{Y}}^{\Omega} & = {\mathbf{B}}^{{\Omega}^t} Y^{\Gamma}. \\ \end{aligned} $$
    (9)
  3. 3.

    Direct correlation is computed between each dimension of \({{\mathcal{X}}}^{\Omega}\) and \({{\mathcal{Y}}}^{\Omega}\) and used as a measure \(s\left(X^{\Gamma}, Y^{\Gamma} \right)\) of synchronization between X Γ and Y Γ, whereby the higher the correlation, the greater the synchronization:

    $$ \begin{aligned} s\left(X^{\Gamma}, Y^{\Gamma} \right) &=\frac{1}{D} \sum_{k=1}^{D} \frac{{{\mathcal{X}}^{\Omega}_{k}}^t {\mathcal{Y}}^{\Omega}_{k}}{\sqrt{{{\mathcal{X}}^{\Omega}_{k}}^t {\mathcal{X}}^{\Omega}_{k}} \sqrt{{{\mathcal{Y}}^{\Omega}_{k}}^t {\mathcal{Y}}^{\Omega}_{k}}} \\ & =\frac{1}{D} \sum_{{\mathbf{k}}=1}^{D} \frac{\left({\mathbf{a}}_{\mathbf{k}}^{{\Omega}^t} X^{\Gamma} \right)^t \left({\mathbf{b}}_{\mathbf{k}}^{{\Omega}^t} Y^{\Gamma} \right)}{\sqrt{\left({\mathbf{a}}_{\mathbf{k}}^{{\Omega}^t} X^{\Gamma} \right)^t \left({\mathbf{a}}_{\mathbf{k}}^{{\Omega}^t} X^{\Gamma} \right)} \sqrt{\left( {\mathbf{b}}_{\mathbf{k}}^{{\Omega}^t} Y^{\Gamma} \right)^t \left({\mathbf{b}}_{\mathbf{k}}^{{\Omega}^t} Y^{\Gamma} \right)}}. \end{aligned} $$
    (10)

    The three methods mostly differ in how the training set Ω and the test set Γ are built.

World training method

As proposed in the previous paragraph, one can use a large set of synchronized audiovisual sequences (the world model part wm of BANCA, in our case) to get X Ω and Y Ω. CoIA can then be used to compute matrices A Ω and B Ω, modeling the average best correspondence between voice and lips. Using a given test utterance Γ, all the features in Γ are transformed using Eq. 9 to obtain \({{\mathcal{X}}}^{\Omega} \textrm{\text\quad {and}\quad } {{\mathcal{Y}}}^{\Omega}.\) A synchronization score s (X Γ, Y Γ ) for test utterance Γ is then obtained using Eq. 10.

Self training method

This method differs from the above in that a different training set is used to obtain matrices A Ω and B Ω. Using a given test utterance Γ, CoIA is directly performed on data X Γ and Y Γ. In other words, the training and the test sets are the same: Γ = Ω.

Piecewise self training method

Bearing in mind that the purpose of this measure of synchronization is to discriminate between synchronized and non-synchronized audiovisual sequences, this third method is slightly different to the previous method. The intuition is the following (where a sub-sequence Λ is a sequence extracted from the original utterance sequence Γ by keeping only some of the samples, that is \(\Lambda \subset \Gamma):\)

  • if sequence Γ is synchronized, then every sub-sequence should follow the same synchronization model: a model \(\left({\bf A}^{\Omega},{\bf B}^{\Omega}\right)\) which is optimal with respect to a sub-sequence \(\Omega \subset \Gamma\) would also be optimal with respect to any other sub-sequence \(\Theta \subset \Gamma;\)

  • if the sequence is not synchronized, then a model (A Ω,B Ω) which is optimal with respect to a sub-sequence Ω would not make sense for another sub-sequence \(\Theta \subset \Gamma\) with \(\Omega \cap \Theta = \varnothing.\)

Let us introduce some notations:

  • N is the number of samples in the sequence \(\Gamma: X^{\Gamma} = \{ x^1, \ldots, x^N \}\) and \(Y^{\Gamma} = \{ y^1, \ldots, y^N \}.\)

  • \({\mathfrak{P}}_{\Gamma}\) is the collection of all subsets of Γ of cardinal \(\lfloor N/2 \rfloor.\)

CoIA is applied to each training subsequence \(\Omega \in {\mathfrak{P}}_{\Gamma}\) to produce transformation matrices A Ω and B Ω. The remaining features in the sequence (Θ = Γ−Ω) are then transformed using the transformation matrices: \({{\mathcal{X}}}^{\Omega} = {\bf A}^{{\Omega}^t} X^{\Theta}\) and \({{\mathcal{Y}}}^{\Omega} = {\bf B}^{{\Omega}^t} Y^{\Theta}.\) The synchronization measure s(X Θ, Y Θ) is computed as in Eq. 10, for every subsequence \(\Omega \in {\mathfrak{P}}_{\Gamma}.\) The final synchronization measure for sequence Γ is obtained via Eq. 11:

$$ s(X^{\Gamma}, Y^{\Gamma}) =\frac{1} {\hbox{card} {{\mathfrak{P}}_{\Gamma}}} \sum_{\Omega \in {\mathfrak{P}}_{\Gamma}} s\left(X^{\Theta}, Y^{\Theta}\right). $$
(11)

In practice, because it is not computationally feasible to use every \(\Omega \in {\mathfrak{P}},\) only a few are drawn randomly (50, in our case) to compute the final synchronization measure.

5 Dynamic approach: CHMMs

5.1 Theoretical aspects

A CHMM can be seen as a collection of HMM in which the state at time t for every HMM in the collection is conditioned by the states of all the HMM in the collection at time t−1. This is illustrated in Fig. 4. The fact that the next state of every HMM depends on the states of all the HMMs is useful for capturing interactions between the acoustic and visual streams.

Fig. 4
figure 4

The CHMM next state of each HMM depends on the state of all the HMM in the CHMM

A CHMM can be completely described by the parameters \({\varvec{\lambda}}=\{\lambda^i\} = \left\{\{\pi^i_j\},\{a^{i,s_i}_{{\bf r}}\},\{b^i_{s_i}(\cdot)\}\right\},\) for every stream i ∈ {1,…,N h }, where N h is the number of streams; s i ∈ {1,…,NS i }, where NS i is the number of states in stream i; \(\pi^i_{s_i}\) is the initial probability of the state s i for stream i; \(a^{i,s_i}_{{\bf r}}\) is the state transition probability for stream i and state s i of the composite state \({\bf r}=\{r_1,\ldots,r_{N_h}\};\) and b i s i is the output probability density function for stream i and state s i . The transition probabilities for stream i are defined as:

$$ a^{i,s_i}_{{\mathbf{r}}} = \hbox{P}(q^i_t=s_i | q^{1}_{t-1}=r_{1},\ldots, q^{N_h}_{t-1}=r_{N_h}).$$
(12)

The output probability density function for every state s i and stream i is modelled as gaussian mixture model (GMM) with \(M^i_{s_i}\) mixtures. Let o i t be the observation of the stream i at time t (in this case, \({\bf o^1} = {{\mathcal{X}}}^{\Omega}\) and \({\bf o^2} = {{\mathcal{Y}}}^{\Omega}).\) The output probability density function can be written as:

$$ b^i_{s_i}(o^i_t) = \hbox{p}(o^i_t|q^i_t=s_i) = \sum^{M^i_{s_i}}_{m=1}{w^i_{s_i,m} {{\mathcal{N}}}(o^i_t;\mu^i_{s_i,m}, \sigma^i_{s_i,m})}.$$
(13)

The initial states for the training sequences are obtained using the five internal states of an energy-based voice activity detector (VAD) applied to the most correlated acoustic and visual features \({{\mathcal{X}}}_1\) and \({{\mathcal{Y}}}_1,\) as defined in Eq. 7. Figure 6 shows the architecture of the VAD state machine, and Fig. 5 shows the VAD internal state sequence for a given signal. The VAD was chosen because it was believed that the system would be able to distinguish between synchronized and non-synchronized streams paying attention only to major signal changes (when a word starts or ends, when the signal is at a high energy interval, etc.) The state transition probabilities \(a^{i,s_i}_{{\bf r}}\) are initially estimated from the state transitions obtained from the VAD sequence for all the training sequences:

$$ a^{i,s_i}_{{\mathbf{r}}}=\frac{n^{i,s_i}_{{\mathbf{r}}}} {n^{i}_{{\mathbf{r}}}} $$
(14)

where \(n^{i,s_i}_{\bf r}\) is the number of transitions to state s i of stream i from the composite state \({\bf r}=\{r_1,\ldots,r_{N_h}\},\) and n i r is the total number of times that the CHMM visits the composite state \({\bf r}=\{r_1,\ldots,r_{N_h}\}\) before the last sample for every training sequence. The initial state probabilities \(\pi^i_{s_i}\) can be estimated as \(\pi^i_{s_i} = n^i_{s_i}/ns,\) where \(n^i_{s_i}\) are the number of training sequences in which the first state of stream i is state s i , and ns is the total number of training sequences (Fig. 6).

Fig. 5
figure 5

Energy signal, VAD internal state sequence used to estimate the initial states for the training sequences and normal VAD output (voice/nonvoice)

Fig. 6
figure 6

VAD state machine. The only input variable is represented by E. Configurable timers T1 and T2 and thresholds th1, th2 and th3 can be tuned to modify the VAD behaviour

It should be noted that the stream states are calculated independently for both streams in the training process. We can expect the output distribution for each stream and state to be the same (as if a HMM was trained on each stream separately) because no relation with the other stream’s state is used for the initial state estimation. Dynamic relationships between the streams are then learnt from the combined state sequence of both streams.

The Baum–Welch algorithm adapted to the CHMM framework is iterated 20 times to train the CHMM. The Viterbi algorithm is used to calculate the sequence of states for every stream and the frame log-likelihoods. This framework has been derived in previous studies such as [2].

5.2 Bayesian framework to detect audiovisual asynchrony

In order to detect asynchrony between the acoustic and visual streams X and Y, a hypothesis test can be performed with the following hypothesis:

  • \({{\mathcal{H}}}_{\bf 0}:\) Because streams are produced synchronously the state sequences are dependent on each other. This hypothesis is represented by CHMM \({\varvec{\lambda}}.\)

  • \({{\mathcal{H}}}_{\bf 1}:\) Because streams are produced by independent sources, sequences are independent of each other. This hypothesis is represented by the two-stream HMM \({\varvec{\lambda}}^{\prime},\) as described in [3].

The test we performed in our study is a slight modification of the classical Bayesian test:

$${{\mathcal{H}}}_{\bf 0} \, \hbox{is accepted} \, \Longleftrightarrow \frac{\hbox{p}({{\mathcal{X}}},{{\mathcal{Y}}}, Q| {\varvec{\lambda}})} {\hbox{p}({{\mathcal{X}}}, {{\mathcal{Y}}},Q^{\prime}| {\varvec{\lambda}}^{\prime})} > \theta,$$
(15)

where Q and Q′ are the most likely state sequences. These likelihoods are provided by the Viterbi algorithm. This test approximates the classical Bayesian test when one state sequence is much more likely than the others. If the two-stream HMM \({\varvec{\lambda}}^{\prime}=\{\{{\pi^{\prime}}^i_{s_i}\}, \{{a^{\prime}}^{i,s_i}_{r_i}\}, \{{b^{\prime}}^i_{s_i}(\cdot)\}\}\) was an independently trained model, then the slightest mismatch in the learned output distributions would thwart the effectiveness of the hypothesis test. In addition, dynamic relationships between the streams are encoded in the combined state sequences Q and Q′. The two-stream HMM \({\varvec{\lambda}}^{\prime}\) used in this hypothesis test, therefore, is an uncoupled version of the CHMM \({\varvec{\lambda}},\) where the parameters for both the output distributions and the initial state probabilities are shared, and the state transition probabilities are computed from the CHMM \({\varvec{\lambda}}\) parameters:

$$ \left. \begin{array}{*{20}l} {\pi^{\prime}}^i_{s_i} = \pi^i_{s_i} \\ b^{\prime i}_{s_i}(\cdot) = b^i_{s_i}(\cdot) \\ \end{array} \right\} \quad \forall \; i \in \{1,\ldots,N_h\} , s_i \in \{1,\ldots,NS_i\}. $$
(16)

This enhances the asynchrony discrimination because random effects derived from the output probability density functions training are removed and only differences in the decoded state sequences are taken into account: if \({{\mathcal{H}}}_1\) holds then it is likely that rare joint state transitions in Q makes the ratio in Eq. 15 fall below θ. The state transition matrix of \({\varvec{\lambda}}^{\prime}\) is defined in such a way that the next state s i for every HMM i depends only on its previous state r i . It is known that:

$$ \begin{aligned} a_{r_{i}}^{\prime i,s_i} &= \hbox{P}(q^i_t=s_i|q^i_{t-1}=r_i)\\ &= \sum_{{{\mathbf{q}}}_{{\mathbf{t-1}}}|q^i_{t-1}=r_i} \hbox{P}(q^{i}_{t}=s_i| {{{\mathbf{q}}}_{{\mathbf{t-1}}}}={{\mathbf{r}}}) \prod_{j=1,j\neq i}^{N_h}\hbox{P}(q^j_{t-1}=r_j)\\ & = \sum_{{r_{1}=1}}^{NS_{1}}\ldots {\sum_{r_{i-{1}}=1}^{NS_{i-1}}}{\sum_{r_{i+1}= 1}^{NS_{i+1}}} \ldots {\sum_{r_{N_h}=1}^{NS_{N_{h}}}}{a_{{\bf r}}^{i,{s_{i}}}} \prod_{j=1,j\neq i}^{N_h} \hbox{P}(q^j_{t-1}=r_j).\\ \end{aligned} $$
(17)

The probability P(q i t  = r i ) can be calculated. It depends on time, however, and it is not desirable to work with time-dependent state transition probabilities. Therefore, since the quantity \(\lim_{t\rightarrow\inf} \hbox{P}(q^i_t=r_i)\) converges quickly for ergodic HMMs, it is computed following this iterative procedure:

  1. 1.

    Initialization: for t = 1,

    $$\hbox{P}(q^i_1=s_i) = \pi^i_{s_i}. $$
    (18)
  2. 2.

    Induction:

    $$ {\rm{P(}}q_t^i {\rm{ = }}s_i {\rm{) = }}\sum\limits_r {a_r^{i,s_i } } \prod\limits_{j = 1}^{N_h } {\rm{P}} (q_{t - 1}^j {\rm{ = }}r_j )$$
    (19)
  3. 3.

    Stop condition:

    $$ \left| \frac{ {\hbox{P}}(q^i_t=s_i) - {\hbox{P}}(q^i_{t-1}=s_i)} {{\hbox{P}}(q^i_t=s_i)} \right| < 10^{-6}. $$
    (20)

An example of the uncoupled transition matrices obtained by this uncoupling procedure is illustrated in Fig. 7. It should be noted that the original CHMM from which the uncoupled transition matrices are obtained has 250 different \(\{a^{i,s_i}_{\bf r}\}\) parameters.

Fig. 7
figure 7

Uncoupled HMMs obtained with the uncoupling procedure described in Sect. 4.2

6 Bayesian fusion using GMM as a probability density function estimator

CoIA and CHMM are different approaches to asynchrony detection. While CoIA uses linear correlation as a measure of synchrony between acoustic and visual features, CHMM uses dynamic statistics to determine whether acoustic and visual features are synchronous. Because they use complementary information, fusing them could lead to improved performance. Statistical fusion techniques such as GMM fusion [18] can be used for this purpose. In our framework, the joint probability density function f of the CoIA and CHMM scores s 1 and s 2 for both the synchronized \({\mathfrak{S}}\) and non-synchronized \({\mathfrak{N}}\) acoustic and visual features is modeled using two GMMs:

$$ f_{{\mathfrak{S}}}({{\mathbf{s}}}(X,Y)) = \hbox{P}({{\mathbf{s}}}(X,Y) |X \, \hbox{and}\, Y \, \hbox{are synchronized})$$
(21)
$$ f_{{\mathfrak{N}}}({{\mathbf{s}}}(X,Y)) = \hbox{P}({{\mathbf{s}}}(X,Y) |X \, \hbox{and} \, Y \hbox{are not synchronized}),$$
(22)

where \({\bf s}(X,Y) = (s_1(X,Y), s_2(X,Y))^t\) and \(f_{\mathfrak{S}}\) and \(f_{\mathfrak{N}}\) can both be expressed as follows:

$$ f_{{\mathfrak{\rho}}}({{\mathbf{s}}}) = \sum_{i=1}^{N}{w_i^{{\mathfrak{\rho}}} \frac{1}{\sqrt{(2\pi)^d \left\|\Gamma_i^{{\mathfrak{\rho}}}\right\|}} \exp{\left(-\frac{1}{2}({{\mathbf{ s}}}-\mu_i^{{\mathfrak{\rho}}})^T {\Gamma_i^{{\mathfrak{\rho}}}}^{-1} ({{\mathbf{s}}}-\mu_i^{{\mathfrak{\rho}}})\right)}} $$
(23)

\(f_{\mathfrak{S}}\) and \(f_{\mathfrak{N}}\) are initialized using the LBG algorithm and trained using the EM algorithm. To discriminate between synchronized and not synchronized acoustic and visual streams, the following hypothesis test is performed:

$$ X \, \hbox{and}\, Y \, \hbox{are synchronized} \, \Longleftrightarrow \frac{f_{\mathfrak{S}}({\bf s} (X,Y))}{f_{\mathfrak{N}}({\bf s}(X,Y))} > \theta.$$
(24)

We used the above method in our fusion experiments described below. Additional results using the sum rule are reported as a baseline for fusion [19].

7 Other methods for asynchrony detection

Several asynchrony detection techniques have already been studied in the literature, as indicated in the introduction. We performed the same experiments with CHMM, CoIA approaches (including Eveno and Besacier’s approach), CANCOR and MI to compare performance. Although descriptions of these approaches can be found in the literature [812], some implementation issues must be discussed in order to facilitate understanding of the results presented later in this paper.

Eveno’s measure

In a similar liveness test framework [8], Eveno and Besacier apply CANCOR analysis and CoIA to the tested sequence, in order to obtain the first projection vectors a 1 and b 1. The design of their synchrony measure M(X,Y) (summarized by Eq. 27) results from the observation of the value of the correlation ρ between a t1 X and b t1 Y, as a function of the shift δ between audio and visual features: its maximum value ρref is often obtained for a small negative shift:

$$ \rho_{\rm ref} = \mathop{\max}\limits_{-80\,\hbox{ms} \leq \delta \leq 0} \left[ \hbox{corr} \left({ {{\mathbf{a}}}_1}^t X^{\delta}, { {{\mathbf{b}}}_1}^t Y \right) \right] $$
(25)
$$\rho_{\rm avg} = \hbox{mean} \left[ \hbox{corr} \left( { {{\mathbf{a}}}_1}^t X^{\delta}, { {{\mathbf{b}}}_1}^t Y \right) \right]$$
(26)

where X δ is the δ-shifted X.

$$ M(X,Y) = \frac{1}{{2\Updelta + 1}}\left( {\frac{{\rho _{{\rm{ref}}} }}{{\rho _{{\rm{avg}}} }} - 1} \right)\sum\limits_{\delta = - \Updelta }^\Updelta {f({\rm{corr}}(a_1^t X^\delta, b_1^t Y))} $$
(27)

where f (ρ) = 1 if ρ ≤ ρ ref and 0 otherwise, and Δ corresponds to a time-shift of 400 ms (10 visual frames). M(X, Y) can be seen as a measure of the peakiness of the maximum found in the interval [−80, 0 ms]. Our implementation of Eveno’s algorithm involves the use of slightly different acoustic features to those described in [8] (MFCC instead of LPC). The major difference between our self-training method and Eveno’s approach is that we considered more than just the first dimension, \({{\mathcal{X}}}_1\) and \({{\mathcal{Y}}}_1.\) Moreover, the world-training method is also quite different in that it makes use of a prior training step where universal transformation matrices A Ω and B Ω are learned.

CANCOR

CANCOR analysis is applied to synchrony detection [10] in the same manner as CoIA is. All the synchrony detection techniques described for CoIA can be directly tested using the CANCOR approach. The same training sets are used for the estimation of CoIA and CANCOR transformation matrices.

MI

Mutual information between visual and acoustic parameters can be defined in several ways depending on the probability density estimator used to model joint and separate feature vectors. In our case we use GMMs as parameter estimators for visual, acoustic and joint visual and acoustic features in the CoIA-transformed space; the MI measure is therefore defined as:

$$ MI(X,Y) = \sum_{t=1}^{N} f_{AV}({{\mathcal{X}}}_t,{{\mathcal{Y}}}_t) \log \left(\frac{ f_{AV}({{\mathcal{X}}}_t,{{\mathcal{Y}}}_t) } {f_A({{\mathcal{X}}}_t)f_V({{\mathcal{Y}}}_t)}\right) , $$
(28)

where f AV , f A and f V are the GMM probability density functions, as defined in Eq. 23, for the joint audio–visual features, the audio features and the visual features, respectively.

8 Experiments

8.1 Experimental framework

BANCA database

We conducted our experiments using the English part of the BANCA database [17], which was originally designed for biometric system evaluation purposes only. Two disjoint groups of 26 people (13 male and 13 female) recorded 24 sequences of approximately 15 s, in which they each pronounced a sequence of 10 digits and either their name and address (client access) or the name and address of another person (impostor access). The recordings were performed under three different conditions (controlled, degraded and adverse) as shown in Fig. 8. Sixty additional sequences from 30 different people were also recorded (under controlled, degraded and adverse conditions) and used to create the world model.

Fig. 8
figure 8

Three different recording conditions. Left controlled (DV camera), middle degraded (webcam), right adverse (background noise)

Evaluation protocols

Because we are focusing on asynchrony detection in this paper, the experimental protocols described in [17]—designed for identity verification—are not valid here. As a matter of fact, for each group, all the 312 (26 × 12) original client access sequences were synchronized naturally. Therefore, for each group, 3,432 (26 × 12 × 11) asynchronous recordings were built artificially using audio and video from two different recordings, in which the name and address pronounced were the same, both acoustically and visually. Two asynchrony detection protocols were derived from these two sets of synchronized and non-synchronized audiovisual sequences:

Controlled :

Only recordings from the controlled conditions are used. This protocol can be used to compare the suitability of both shape-based and appearance-based visual speech features for asynchrony detection. Only the controlled part of the world model recordings of BANCA can be used to train models. As a result, for each group, 104 synchronized and 312 non-synchronized sequences were tested using this protocol.

Pooled :

The 3 conditions (controlled, adverse and degraded) were used. This protocol can be used to estimate the robustness of CoIA and CHMM asynchrony detection methods. All the world model recordings of BANCA can be used to train models. As a result, for each group, 312 synchronized and 3,432 non-synchronized sequences were tested using this protocol.

Although it is very unlikely that an impostor would own both an audio and a video recording of the client pronouncing two different utterances, these protocols deal with an extremely challenging, if not the most challenging, synchrony detection task and therefore constitute a useful framework in which to compare the performance of the different synchrony measures we propose.

Performance measure and comparison

Given a decision threshold θ, an asynchrony detection system can commit two types of error: it can falsely accept a non-synchronized sequence and classify it as a synchronized sequence (false acceptance) or it can falsely reject a synchronized sequence and classify it as a non-synchronized sequence (false rejection). A low θ value would tend to increase the number of false acceptances (FA) and reciprocally a high θ value would tend to increase the number of false rejections (FR). Consequently we defined the false acceptance rate (FAR) and false rejection rate (FRR) as a function of θ (one objective being to find the best compromise between those two error rates):

$$\hbox{FAR}(\theta) =\frac{\hbox{FA}(\theta)} {\hbox{NI}} \, \hbox{and} \, \, \,\hbox{FRR}(\theta) =\frac{\hbox{FR}(\theta)}{\hbox{NC}} $$
(29)

where NI and NC are the numbers of non-synchronized and synchronized sequences respectively. Detection error tradeoff (DET) curves are usually plotted to compare such detection algorithms [20]. The (FAR(θ), FRR(θ)) point is plotted for every possible θ value and the resulting curve can be used to easily compare two systems: the closer the target curve is to the origin, the better.

Depending on the application, we might want to place more or less importance on false rejection or acceptance errors. The weighted error rate (WER), presented in [21], is therefore introduced:

$$\hbox{WER} \left( r \right) =\frac{1}{1+r} \left( r \cdot \hbox{FAR} + \hbox{FRR} \right).$$
(30)

Two possible applications where mentioned in the introduction. Although the synchrony detection can be performed using the same algorithms in both applications, different compromises between FAR and FRR should be assumed, and hence we should choose different values for the weight r:

r = 10:

This configuration corresponds to a biometric authentication system with strict security requirements, where the most important constraint is to detect spoof attacks. It is therefore ten times more costly to falsely accept a non-synchronized sequence than to reject a synchronized sequence (in that case, a genuine client would have to repeat his/her access attempt).

r = 1:

This configuration might be used in an application where no strong binary decision (synchronized vs. non-synchronized) is needed. It could be used, for example, to select the true speaker form a large group of people on a screen.

Are results generalizable and conclusive? Because the BANCA database is divided into two disjoint groups, namely G1 and G2, the WERs for one group (the test set) are calculated using the thresholds that minimizes the WERs for the other group (the training set). This prevents the results from being biased by the choice of threshold. Confidence intervals at 95% are then computed using the method proposed in [22] with the following formula, where α = 1.960 and \(\overline{{WER}(r)}\) is the estimation obtained thanks to the test set:

$$ \hbox{WER}(r) \in \overline{{\rm WER}(r)} \pm \alpha \cdot \sqrt{\frac{1}{1+r^2}} \cdot \sqrt{\frac{r^2}{NI} \cdot \overline{\hbox{FAR}} (1-\overline{\hbox{FAR}})+ \frac{1}{NC} \cdot \overline{\hbox{FRR}} (1-\overline{\hbox{FRR}})}.$$
(31)

As a matter of fact, given the small number of tests that are performed, it is important to make sure that the resulting difference between the error rates of two methods is stastically significant and capable of generating conclusive results.

8.2 Experimental results

Table 1 shows the asynchrony detection performance of the different methods compared in this paper in terms of the WER (1.0) and WER (10). All the experiments were performed using both shape-based (shp) and appearance-based (app) visual features. Algorithms based on CANCOR, CHMM, MI and CoIA were used for the asynchrony detection. The Method column indicates the audiovisual synchrony measurement method used in the correlation-based CANCOR and CoIA cases. This column indicates the design parameters regarding stream dimension and number of gaussians in the case of MI and CHMM algorithms.

Table 1 WER (1), WER (10) and their 95% confidence intervals (in subscripts) for the different algorithms for the controlled (C) and pooled (P) protocol using appearance (app) and shape-based (shp) visual parameters (VP)

The DET curves for the different algorithms are shown in Figs. 9 and 10 (controlled and pooled protocols, respectively). It must be noted that parameters such as the stream dimension or the number of gaussians per state (in the case of the CHMM) may slightly alter the performance of these methods for the same dataset. These parameters have been empirically chosen to achieve a good compromise in terms of performance. In the case of the CHMM and MI approaches, the dimension of the streams used and the number of gaussians are shown in Table 1. CoIA and CANCOR methods use correlated acoustic and visual streams of dimension 3. In other words, D = 3 in Eq. 10 (Fig. 11).

Fig. 9
figure 9

Controlled protocol DET curves for the best methods shown in Table 1 using appearance-based (left) and shape-based visual features (right)

Fig. 10
figure 10

Pooled protocol DET curves for the best methods shown in Table 1 using appearance-based and shape-based visual features

Fig. 11
figure 11

Controlled protocol DET curves for the best CoIA and CHMM methods, sum rule and GMM fusion algorithms

Shape-based versus appearance-based visual features

Performance was much better for appearance-based visual features than shape-based visual features in all of the methods (with no exceptions). This suggests that shape-based visual features do not hold appropriate linear dependencies with acoustic speech features or time evolution information. As a matter of fact, only the outer lip contours were modeled by the lip tracker: the area, height and width of the mouth and their time derivatives do not provide enough information for synchrony analysis. Appearance-based features, in contrast, contain (in a hidden way) not only the shape of the mouth but also additionnal information such as whether the mouth is really open, the tongue or the teeth are visible, etc.

CANCOR versus CoIA

While WT performed far better than (P)ST for the CANCOR-based synchrony measure, CoIA-based measures did not behave in the same way and (P)ST methods yielded better results in all cases. This observation coincides with the findings of [8]. CANCOR needs much more training data to accurately estimate the transformation matrices A and B: world-training (where a lot of training data is available) therefore results in better modeling than self-training does (where only the sequence itself can be used for training). CoIA is much less dependent on the amount of training data available and it is even better at modeling and uncovering the intrinsical synchrony of a given audiovisual sequence.

CHMM robustness against degraded conditions

A quick comparison of the performance of CoIA and CHMM using the controlled and pooled protocols shows that CHMM performs better than CoIA in degraded test conditions. The WER (1.0) of ST appearance-based CoIA increased from WER (1.0) = 8.25% for the controlled protocol to WER (1.0) = 11.9% for the pooled protocol (statistically significant degraded performance). Comparatively, we observed a small, yet not statistically significant, degradation in the performance of appearance-based CHMM. This observation was also made in the security-oriented performance measure defined by WER (10). This CHMM robustness against low quality features is highlighted when observing the CHMM performance when using the less informative shape-based features, where it gets the best performance.

Piecewise self-training

One of the contributions of this paper is the introduction of the piecewise self-training approach. It seems to be particularly effective for applications where more security is needed [defined by the error rate WER (10)], and where conditions are controlled. Indeed, in such circumstances, piecewise self-training always results in a small (yet not statistically significant) improvement over the self-training approach.

MI

This approach seems to be the least successful of all those tested in this paper for asynchrony detection. However, this does not mean that MI should not be used for monologue detection or speaker association. What it means is that the technique may not be appropriate when a global threshold is required, as in the case of a biometric application or a synchrony quality assessment task.

Sum rule and GMM fusion

Performance improved when CoIA and CHMM were fused. The two systems encoded different types of synchrony data, and hence, when fused, resulted in improved performance, even though the two systems were using the same audiovisual features.

9 Conclusion and future work

The results reported in Sect. 7 demonstrate the effectiveness of both CoIA and CHMM as asynchrony detection methods. They have been tested in a difficult framework for asynchrony detection, where the video sequences and voice are taken from the same user uttering the same speech.

Asynchrony detection can be a useful anti-spoofing technique for real-life impostor attacks in biometric identity verification systems, among other applications such as speaker location and monologue detection.

The methods we presented can easily be adapted to identity verification systems based on audiovisual speech features. Client-dependent models can be derived, which would provide complementary information to speaker or face verifiers working in a multimodal framework.

Synchrony evaluation could also be used in other fields that are not directly related to biometrics, speaker location or monologue detection. It could be used, for example, to replace tasks that are currently done manually, such as the alignment of video and soundtrack in movie post-production, or the evaluation of the quality of dubbing into a foreign language.

New directions of research in asynchrony detection emerge from this paper. We have shown how fusing CoIA and CHMM scores can lead to improved performance. Appearance- and shape-based systems can also be fused at feature level. The two systems offer different methods for integrating multiple information sources: while CoIA can be applied to concatenated appearance- and shape-based visual features, CHMM can work with an acoustic stream, an appearance-based visual feature stream and a shape-based visual feature stream. Structural improvements to CHMM are also a possibility in future studies given that CHMM families have already been used successfully for audiovisual speech recognition purposes [23, 24]. A large number of training audiovisual utterances containing different phonetic units is required if acceptable speech recognition accuracy is to be achieved. The uncoupling procedure described in Sect. 4.2 can be applied to such a CHMM audiovisual speech recognizer to obtain an asynchrony detector. The results would more than likely be much more accurate than those described here. Our system suffered from structural limitations due to insufficient training material and this resulted in poor audiovisual speech unit modeling, which is mostly based on the evolution of the most correlated components of both streams. Another promising research direction emerges from recently derived tensor based classification frameworks [25, 26]. Tensors encoding audio-visual speech features from several consecutive sampling periods can keep most of the dynamic relationship between lips movement and speech dynamics, while the use of tensors algebra can overcome the scarcity of training data. Equivalent CoIA and CANCOR tensor techniques should be derived in a future work and tested in the audio-visual asynchrony detection problem presented in this paper.