1 Introduction

In the recent years, speech-based interaction with computers made significant progress. Digital voice assistants are now ubiquitous due to their integration into many commodity devices such as smartphone, tvs, cars, etc.; even companies use more and more machine learning techniques to drive service bots that interact with their customers. These virtual agents aim for a user-friendly man-machine interface while keeping maintenance costs low. However, a significant challenge is to appeal to humans by delivering information through a medium that is most comfortable to them. While speech-based interaction is already very successful, such as shown in virtual assistants like Siri, Alexa, Google, etc., the visual counterpart is largely missing. This comes to no surprise given that a user would also like to associate the visuals of a face with the generated audio, similar to the ideas behind video conferencing. In fact, the level of engagement for audio-visual interactions is higher than for purely audio ones  [9, 24].

Fig. 1.
figure 1

Neural Voice Puppetry enables applications like facial animation for digital assistants or audio-driven facial reenactment.

The aim of this work is to provide the missing visual channel by introducing Neural Voice Puppetry, a photo-realistic facial animation method that can be used in the scenario of a visual digital assistant (Fig. 1). To this end, we build on the recent advances in text-to-speech synthesis literature [14, 21], which is able to provide a synthetic audio stream from a text that can be generated by a digital agent. As visual basis, we leverage a short target video of a real person. The key component of our method is to estimate lip motions that fit the input audio and to render the appearance of the target person in a convincing way. This mapping from audio to visual output is trained using the ground truth information that we can gather from a target video (aligned real audio and image data). We designed Neural Voice Puppetry to be an easy to use audio-to-video translation tool which does not require vast amount of video footage of a single target video or any manual user input. In our experiments, the target videos are comparably short (2–3 min), thus, allowing us to work on a large amount of video footage that can be downloaded from the Internet. To enable this easy applicability to new videos, we generalize specific parts of our pipeline. Specifically, we compute a latent expression space that is generalized among multiple persons (in our experiments 116). This also ensures the capability of being able to handle different audio inputs. Besides the generation of a visual appearance of a digital agent, our method can also be used as audio-based facial reenactment. Facial reenactment is the process of re-animating a target video in a photo-realistic manner with the expressions of a source actor  [28, 34]. It enables a variety of applications, ranging from consumer-level teleconferencing through photo-realistic virtual avatars  [20, 26, 27] to movie production applications such as video dubbing  [11, 17]. Recently, several authors started to exploit the audio signal for facial reenactment [5, 23, 31]. This has the potential of avoiding failures of visual-based approaches, when the visual signal is not reliable, e.g., due to occluded face, noise, distorted views and so on. Many of these approaches, however, lack video-realism [5, 31], since they work in a normalized space of facial imagery (cropped, frontal faces), to be agnostic to head movements. An exception is the work of Suwajanakorn et al. [23], where they have shown photo-realistic videos of President Obama that can be synthesized just from the audio signal. This approach, however, requires very large quantities of data for training (17 h of President Obama weekly speeches) and, thus, limits its application and generalization to other identities. In contrast, our method only needs 2–3 min of a target video to learn the person-specific talking style and appearance. Our underlying latent 3D model space inherently learns 3D consistency and temporal stability that allows us to generate natural, full frame imagery. Especially, it enables the disentanglement of rigid head motions from facial expressions.

To enable photo-realistic renderings of digital assistants as well as audio-driven facial reenactment, we have the following contributions:

  • A temporal network architecture called Audio2ExpressionNet is proposed to map an audio stream to a 3D blendshape basis that can represent person-specific talking styles. Exploiting features from a pre-trained speech-to-text network, we generalize the Audio2ExpressionNet on a dataset of news-speaker.

  • Based on a short target video sequence (2–3 min), we extract a representation of person-specific talking styles, since our goal is to preserve the talking style of the target video during reenactment.

  • A novel light-weight neural rendering network using neural textures is presented that allows us to generate photo-realistic video content reproducing the person-specific appearance. It surpasses the quality and speed of state-of-the-art neural rendering methods  [10, 29].

2 Related Work

Neural Voice Puppetry is a facial reenactment approach based only on audio input. In the literature, there are many video-based facial reenactment systems that enable dubbing and other general facial expression manipulation. Our focus in this related work section lies on audio-based methods. These methods can be organized in facial animation and facial reenactment. Facial animation concentrates on the prediction of expressions that can be applied to a predefined avatar. In contrast, audio-driven facial reenactment aims to generate photo-realistic videos of an existing person including all idiosyncrasies.

Video-Driven Facial Reenactment: The state-of-the-art report of Zollhöfer et al.  [34] discusses several works for video-driven facial reenactment. Most methods, rely on a reconstruction of a source and target face using a parametric face model. The target face is reenacted by replacing its expression parameters with that of the source face. Thies et al. [28] uses a static skin texture and a data-driven approach to synthesize the mouth interior. In Deep Video Portraits  [18], a generative adversarial network is used to produce photo-realistic skin texture that can handle skin deformations conditioned on synthetic renderings. Recently, Thies et al. [29] proposed the usage of neural textures in conjunction with a deferred neural renderer. Results show that neural textures can be used to generate high quality facial reenactments. For instance, it produces high fidelity mouth interiors with less artifacts. Kim et al. [17] analyzed the notion of style for facial expressions and showed its importance for dubbing. In contrast to Kim et al. [17], we directly estimate the expressions in the target talking-style domain, thus, we don’t need to apply any transfer or adaption method.

Audio-Driven Facial Animation: These methods do not focus on photo-realistic results, but on the prediction of facial motions  [6, 16, 22, 25, 30]. Karras et al. [16] drives a 3D facial animation using an LSTM that maps input waveforms to the 3D vertex coordinates of a face mesh, also considering the emotional state of the person. In contrast to our method, it needs high quality 3D reconstructions for supervised training and does not render photo-realistic output. Taylor et al. [25] use a neural network to map phonemes into the parameters of a reference face model. It is trained on data collected for only one person speaking for 8 hours. They show animations of different synthetic avatars using deformation retargeting. VOCA  [6] is an end-to-end deep neural network for speech-to-animation translation trained on multiple subjects. Similar to our approach, a low-dimensional audio embedding based on features of the DeepSpeech network [13] is used. From this embedding, VOCA regresses 3D vertices on a FLAME face model [19] conditioned on a subject label. It requires high quality 4D scans recorded in a studio setup. Our approach works on ’in the wild’ videos, with a focus on temporally coherent predictions and photo-realistic renderings.

Audio-Driven Facial Reenactment: Audio-driven facial reenactment has the goal to generate photo-realistic videos that are in sync with the input audio stream. There is a number of techniques for audio-driven facial reenactment [2, 3, 5, 8, 31, 32] but only a few generate photo-realistic, natural, full frame images  [23]. Suwajanakorn et al.  [23] uses an audio stream from President Barack Obama to synthesize a high quality video of him speaking. A Recurrent Neural Network is trained on many hours of his speech to learn the mouth shape from the audio. The mouth is then composited with proper 3D matching to reanimate an original video in photo-realistic manner. Because of the huge amount of used training data (17h), it is not applicable to other target actors. In contrast, our approach only needs a 2–3 min long video of a target sequence.

Chung et al. [5] present a technique that animates the mouth of a still, normalized image to follow an audio speech. First, the image and audio is projected into a latent space through a deep encoder. A decoder then utilizes the joint embedding of the face and audio to synthesize the talking head. The technique is trained on tens of hours of data in an unsupervised manner. Another 2D image-based method has been presented by Vougioukas et al. [31]. They use a temporal GAN to produce a video of a talking face given a still image and an audio signal as input. The generator feeds the still image and the audio to an encoder-decoder architecture with a RNN to better capture temporal relations. It uses discriminators that work on per-frame and on a sequence level to improve temporal coherence. As conditioning, it also takes the audio signal as input to enforce the synthesized mouth to be in sync with the audio. In [32] a dedicated mouth-audio syn discriminator is used to improve the results. In contrast to our method, the 2D image-based methods are restricted to a normalized image space of cropped and frontalized images. They are not applicable to generate full frame images with 3D consistent motions.

Text-Based Video Editing: Fried et al. [10] presented a technique for text-based editing of videos. Their approach allows overwriting existing video segments with new texts in a seamless manner. A face model [12] is registered to the examined video and a viseme search finds video segments with similar mouth movements to the edited text. The corresponding face parameters of the matching video segment are blended with the original sequence parameters based on a heuristic, followed by a deep renderer to synthesize photo-realistic results. The method is person-specific and requires a one hour long training sequence of the target actor and, thus, is not applicable to short videos from the Internet. The viseme search is slow (\(\sim \)5 min for three words) and does not allow for interactive results.

Fig. 2.
figure 2

Pipeline of Neural Voice Puppetry. Given an audio sequence we use the DeepSpeech RNN to predict a window of character logits that are fed into a small network. This generalized network predicts coefficients that drive a person-specific expression blendshape basis. We render the target face model with the new expressions using a novel light-weight neural rendering network.

3 Overview

Neural Voice Puppetry consists of two main components (see Fig. 2): a generalized and a specialized part. A generalized network predicts a latent expression vector, thus, spanning an audio-expression space. This audio-expression space is shared among all persons and allows for reenactment, i.e., transferring the predicted motions from one person to another. To ensure generalizability w.r.t. the input audio, we use features extracted by a pretrained speech-to-text network  [13] as input to estimate the audio-expressions. The audio-expressions are interpreted as blendshape coefficients of a 3D face model rig. This face model rig is person-specific and is optimized in the second part of our pipeline. This specialized stage captures the idiosyncrasies of a target person including the facial motion and appearance. It is trained on a short video sequence of \(2-3\) minutes (in comparison to hours that are required by state-of-the-art methods). The 3D facial motions are represented as delta-blendshapes which we constrain to be in the subspace of a generic face template  [1, 28]. A neural texture in conjunction with a novel neural rendering network is used to store and to rerender the appearance of the face of an individual person.

Fig. 3.
figure 3

Samples of the training corpus used to optimize the Audio2ExpressionNet.

4 Data

In contrast to previous model-based methods, Neural Voice Puppetry is based on ‘in-the-wild’ videos that can be download from the internet. The videos have to be synced with the audio stream, such that we can extract ground truth pairs of audio features and image content. In our experiments the videos have a resolution of \(512 \times 512\) with 25fps.

Training Corpus for the Audio2ExpressionNet: Figure 3 shows an overview of our video training corpus that is used for the training of the small network that predicts the ‘audio expressions’ from the input audio features (see Sect. 5.1). The dataset consists of 116 videos with an average length of 1.7 min (in total 302750 frames). We selected the training corpus, such that the persons are in a neutral mood (commentators of the German public TV).

Target Sequences: For a target sequence, we extract the person-specific talking style in the sequence. I.e., we compute a mapping from the generalized audio-expression space to the actual facial movements of the target actor (see Sect. 5.3). The sequences are 2–3 min long and, thus, easy to obtain from the Internet.

4.1 Preprocessing:

In an automatic preprocessing step, we extract face tracking information as well as audio features needed for training.

3D Face Tracking: Our method is using a statistical face model and delta-blendshapes  [1, 28] to represent a 3D latent space for modelling facial animation. The 3D face model space reduces the face space to only a few hundred parameters (100 for shape, 100 for albedo and 76 for expressions) and stays fixed in this work. Using the dense face tracking method of Thies et al.  [28], we estimate the model parameters for every frame of a sequence. During tracking, we extract the per-frame expression parameters that are used to train the audio to expression network. To train our neural renderer, we also store the rasterized texture coordinates of the reconstructed face mesh.

Audio-Feature Extraction: The video contains a synced audio stream. We use the recurrent feature extractor of the pre-trained speech-to-text model DeepSpeech  [13] (v0.1.0). Similar to Voca  [6], we extract a window of character logits per video frame. Each window consists of 16 time intervals à 20ms, resulting in an audio feature of \(16\times 29\). The DeepSpeech model is generalized among thousands of different voices, trained on Mozilla’s CommonVoice dataset.

5 Method

To enable photo-realistic facial reenactment based on audio signals, we employ a 3D face model as intermediate representation of facial motion. A key component of our pipeline is the audio-based expression estimation. Since every person has his own talking style and, thus, different expressions, we establish person-specific expression spaces that can be computed for every target sequence. To ensure generalization among multiple persons, we created a latent audio-expression space that is shared by all persons. From this audio-expression space, one can map to the person specific expression space, enabling reenactment. Given the estimated expression and the extracted audio features, we apply a novel light-weight neural rendering technique that generates the final output image.

5.1 Audio2ExpressionNet

Our method is designed to generate temporally smooth predictions of facial motions. To this end, we employ a deep neural network with two stages. First, we predict per-frame facial expression predictions. These expressions are potentially noisy, thus, we use an expression aware temporal filtering network. Given the noisy per-frame predictions as input the neural network predicts filter weights to compute smooth audio-expressions for a single frame. The per-frame as well as the filtering network can be trained jointly and outputs audio-expression coefficients. This audio-expression space is shared among multiple persons and is interpreted as blendshape coefficients. Per person, we compute a blendshape basis which is in the subspace of our generic face model  [28]. The networks are trained with a loss that works on a vertex level of this face model.

Per-Frame Audio-Expression Estimation Network: Since our goal is a generalized audio-based expression estimation, we rely on generalized audio features. We use the RNN-part of the speech to text approach DeepSpeech  [13] to extract these features. These features represent the logits of the DeepSpeech alphabet for 20 ms audio signal. For each video frame, we extract a time window of 16 features around the frame that consist of 29 logits (length of the DeepSpeech alphabet is 29). This, \(16 \times 29\) tensor is input to our per-frame estimation network (see Fig. 4). To map from this feature space to the per-frame audio-expression space, we apply 4 convolutional layer and 3 fully connected layer. Specifically, we apply 1D convolutions with kernel dimensions (3) and stride (2), filtering in the time dimension. The convolutional layers have a bias and are followed by a leaky ReLU (slope 0.02). The feature dimensions are reduced successively from \((16\times 29)\), \((8\times 32)\), \((4\times 32)\), \((2\times 64)\) to \((1\times 64)\). This reduced feature is input to the fully connected layers that have a bias and are also followed by a leaky ReLU (0.02), except the last layer. The fully connected layers map the 64 features from the convolutional network to 128, then to 64 and, finally, to the audio-expression space of dimension 32, where a TanH activation is applied.

Temporally Stable Audio-Expression Estimation: To generate temporally stable audio-expression predictions, we jointly learn a filtering network that gets T per-frame estimates as input (see Fig. 4(b)). Specifically, we estimate the audio-expressions for frame t using a linear combination of the per-frame predictions of the timesteps \(t-T/2\) to \(t+T/2\). The weights for the linear combination are computed using a neural network that gets the audio-expressions as input (which results in an expression-aware filtering). The filter weight prediction network consists of five 1D convolutions followed by a linear layer with softmax activation (see supplemental material for detailed description). This content aware temporal filtering is also inspired by the self-attention mechanism  [33].

Fig. 4.
figure 4

Audio2ExpressionNet: (a) Per-frame audio-expression estimation network that gets DeepSpeech features as input, (b) to get smooth audio-expressions, we employ a content-aware filtering along the time dimension.

Person-Specific Expressions: To retrieve the 3D model from this audio-expression space, we learn a person-specific audio-expression blendshape basis which we constrain by the generic blendshape basis of our statistical face model. I.e., the audio-expression blendshapes of a person are a linear combination of the generic blendshapes. This linear relation, results in a linear mapping from the audio-expression space which is output of the generalized network to the generic blendshape basis. This linear mapping is person specific, resulting in N matrices with dimension \(76 \times 32\) during training (N being the number of training sequences and 76 being the number of generic blendshapes).

Loss: The network and the mapping matrices are learned end-to-end using the visually tracked training corpus and a vertex-based loss function, with a higher weight (10\(\times \)) on the mouth region of the face model. Specifically, we compute a vertex-to-vertex distance from the audio-based predicted and the visually tracked face model in terms of a root mean squared (RMS) distance:

$$\begin{aligned} L_{expr} = RMS(v_t - v_t^{*}) + \lambda \cdot L_{temp} \end{aligned}$$

with \(v_t\), the vertices based on the filtered expression estimation of frame t and \(v_t^{*}\) being the visual tracked face vertices. In addition to the absolute loss between predictions and the visual tracked face geometry, we use a temporal loss that considers the vertex displacements of consecutive frames:

$$\begin{aligned} \begin{aligned} \small L_{temp}&= RMS((v_t-v_{t-1}) - (v_t^{*}-v_{t-1}^{*})) + RMS( (v_{t+1}-v_{t}) - (v_{t+1}^{*}-v_{t}^{*})) \\&+ RMS( (v_{t+1}-v_{t-1}) - (v_{t+1}^{*}-v_{t-1}^{*})) \end{aligned} \end{aligned}$$

These forward, backward and central differences are weighted with \(\lambda \) (in our experiments \(\lambda =20\)). The losses are measured in millimeters.

5.2 Neural Face Rendering

Based on the recent advances in neural rendering, we employ a novel light-weight neural rendering technique that is based on neural textures to store the appearance of a face. Our rendering pipeline synthesizes the lower face in the target video based on the audio-driven expression estimations. Specifically, we use two networks (see supplemental material for an overview figure). One network that focuses on the face interior, and another network that embeds this rendering into the original image. The estimated 3D face model is rendered using the rigid pose observed from the original target image using a neural texture  [29]. The neural texture has a resolution of \(256\times 256\times 16\). The network for the face interior translates these rendered feature descriptors to RGB colors. The network is using a similar structure as a classical U-Net with 5 layers. But instead of using strided convolutions that result in a downsampling in each layer, we are using dilated convolutions with increasing dilation factor and a stride of one. Instead of transposed convolutions we are using standard convolutions. All convolutions have kernel size \(3\times 3\). Note, dilated instead of strided convolutions do not increase the number of learnable parameters, but it increases the memory load during training and testing. Dilated convolutions reduce visual artifacts and result in smoother results (also temporally, see video). The second network that blends the face interior with the ‘background image’ has the same structure. To remove potential movements of the chin in the background image, we erode the background image around the rendered face. The second network inpaints these missing regions.

Loss: We use a per-frame loss function that is based on an \(\ell _1\) loss to measure absolute errors and a VGG style loss  [15].

$$\begin{aligned} L_{rendering} = \ell _1(I, I^{*}) + \ell _1(\hat{I}, I^{*}) + VGG(I, I^{*}) \end{aligned}$$

with I being the final synthetic image, \(I^*\) the ground truth image and \(\hat{I}\) the intermediate result of the first network that focuses on the face interior (loss is masked to this region).

5.3 Training

Our training procedure has two stages – the generalization and the specialization phase. In the first phase, we train the Audio2ExpressionNet among all sequences from our dataset (see Sect. 4) in a supervised fashion. Given the visual face tracking information, we know the 3D face model of a specific person for every frame. In the training process, we reproduce these 3D reconstructions based on the audio input by optimizing the network parameters and the person-specific mappings from the audio-expression space to the 3D space. In the second phase, the rendering network for a specific target sequence is trained. Given the ground truth images and the visual tracking information, we train the neural renderer end-to-end including the neural texture.

New Target Video: Since the audio-based expression estimation network is generalized among multiple persons, we can apply it to unseen actors. The person specific mapping between the predicted audio-expression space coefficients and the expression space of the new person can be obtained by solving a linear system of equations. Specifically, we extract the audio-expression for all training images and compute the linear mapping to the expressions that are visually estimated. In addition to this step, the person-specific rendering network for the new target video is trained from scratch (see supplement for further information).

5.4 Inference

At test time, we only require a source audio sequence. Based on the target actor selection, we use the corresponding person-specific mapping. The mapping from the audio features to the person specific expression space takes less than 2 ms on an Nvidia 1080Ti. Generation of the 3D model and the rasterization using these predictions takes another 2 ms. The deferred neural rendering takes \(\sim 5\) ms which results in a real-time capable pipeline.

Text-to-Video: Our pipeline is trained on real video sequences, where the audio is in sync with the visual content. Thus, we learned a mapping directly from audio to video that ensures synchronicity. Instead of going directly from text to video, where such a natural training corpus is not available, we synthesize voice from the text and feed this into our pipeline. For our experiments we used samples from the DNN-based text-to-speech demo of IBM WatsonFootnote 1. Which gives us state-of-the-art synthetic audio streams that are comparable to the synthetic voices of virtual assistants.

6 Results

Neural Voice Puppetryhas several important use cases, i.e., audio-driven video avatars, video dubbing and text-driven video synthesis of a talking head, see supplemental video. In the following sections, we discuss these results including comparisons to state-of-the-art approaches.

Fig. 5.
figure 5

Self-reenactment: Evaluation of our rendering network and the audio-prediction network. Error plot shows the euclidean photo-metric error.

6.1 Ablation Studies

Self-reenactment: We use self-reenactment to evaluate our pipeline (Fig. 5), since it gives us access to a ground truth video sequence where we can also retrieve visual face tracking. As a distance measurement, we use an \(\ell _2\) distance in color space (colors in [0,1]) and the corresponding PSNR values. Using this measure, we evaluate the rendering network and the entire reenactment pipeline. Specifically, we compare the results using visual tracked mouth movements to the results using audio-based predictions (see video). The mean color difference of the re-rendering on the test sequence of 645 frame is 0.003 for the visual and 0.005 for the audio-based expressions, which corresponds to a PSNR of 41.48 and 36.65 respectively. In addition to the photo-metric measurements, we computed the 2D mouth landmark distances relative to the eye distance using Dlib, resulting in 0.022 for visual tracking and 0.055 for the audio-based predictions.

In the supplemental video, we also show a side-by-side comparison of our rendering network using dilated convolutions and our network with strided convolutions (and a kernel size of 4 to reduce block artifacts in the upsampling). Both networks are trained with the same number of epochs (50). As can be seen, dilated convolutions lead to visually more pleasing results (smoother in spatial and temporal domain). As a comparison to the results using dilated convolutions reported above, strided convolutions result in a lower PSNR of 40.12 with visual tracking and 36.32 with audio-based predictions.

Temporal Smoothness: We also evaluated the benefits of using a temporal-based expression prediction network. Besides temporally smooth predictions shown in the supplemental video, it also improves the prediction accuracy of the mouth shape. The relative 2D mouth landmark error improves from 0.058 (per frame prediction) to 0.055 (temporal prediction).

Generalization/Transferability: Our results are covering different target persons which demonstrates the wide applicability of our method, including the reproduction of different person-specific talking styles and appearances. As can be seen in the supplemental video, the expression estimation network that is trained on multiple target sequences (302750 frames) results in more coherent predictions than the network solely trained on a sequence of Obama (3145 frames). The usage of more target videos increases the training corpus size and the variety of input voices and, thus, leads to more robustness. In the video, we also show a comparison of the transfer from different source languages to different target videos that are originally also in different languages. In Table 1, we show the corresponding quantitative measurements of the achieved lip sync using SyncNet  [4]. SyncNet is trained on the BBC news program (English), nevertheless, the authors state that it works also good for other languages. As a reference for the measurements of the different videos, we list the values for the original target videos. Higher confidence values are better, while a value below 1 refers to uncorrelated audio video streams. The original video of Macron has the lowest measured confidence which propagates to the reenactment results.

Table 1. Analysis of generated videos with different source/target languages. Based on SyncNet  [4], we measure the audio-visual sync (offset/confidence). As a reference, we list the sync measurements for the original target video (right).

6.2 Comparisons to State-of-the-art Methods

In the following, as well as in the supplemental document, we compare to model-based and pure image-based approaches for audio-driven facial reenactment.

Preliminary User Study: In a preliminary user study, we evaluated the visual quality and audio-visual sync of the state-of-the-art methods. The user study is based on videos taken from the supplemental materials of the respective publications (assuming the authors showing the best case scenario). Note that the videos of the different methods show (potentially) different persons (see supplemental material). In total, 56 attendees with a computer science background judged upon synchronicity and visual quality (‘very bad’, ‘bad’, ‘neither bad nor good’, ‘good’, ‘very good’) of 24 videos in randomized order (in total). In Fig. 6, we show the percentage of attendees that rated the specific approach with good or very good. As can be seen, the 2D image-based approaches achieve a high audio-visual sync (especially, Vougioukas  [32]), but they lack visual quality and are not able to synthesize natural videos (outside of the normalized space). Our approach gives the best visual quality and also a high audio-visual sync, similar to state-of-the-art video-based reenactment approaches like Thies et al.  [29].

Fig. 6.
figure 6

User study: percentage of attendees (in total 56) that rated the visual and audio-visual quality good or very good.

Fig. 7.
figure 7

Visual quality comparison to the image-based methods (a) ‘You said that?’  [5] and (b) ‘Realistic Speech-Driven Facial Animation with GANs’  [32] (driven by the same input audio stream, respectively), including the synchronicity measurements using SyncNet  [4] (offset/confidence).

Image-Based Methods: Our method aims for high quality output that is embedded in a real video, including the person-specific talking style, exploiting an explicit 3D model representation of the face to ensure 3D consistent movements. This is fundamentally different from image-based approaches that are operating in a normalized space of facial imagery (cropped, frontal faces) and do not capture person-specific talking styles, but, therefore, can be applied to single input images. In Fig. 7 as well as in the video, we show generated images of state-of-the-art image-based methods  [5, 32]. It illustrates the inherent visual quality differences that has also been quantified in our user study (see Fig. 6). The figure also includes the quantitative synchronicity measurements using SyncNet. Especially, Vougioukas et al.  [32] achieves a high confidence score, while our method is in the range of the target video it has been trained on (compare to Fig. 6).

Fig. 8.
figure 8

Comparison to state-of-the-art audio-driven model-based video avatars using the same input audio stream. Our approach is applicable to multiple targets, especially, where only \(2-3\)min of training data are available.

Model-based Audio-Driven Methods: In Fig. 8 we show a representative image from a comparison to Taylor et al.  [25], Karras et al.  [16] and Suwajanakorn et al.  [23]. Only the method of Suwajanakorn et al. is able to produce photo-realistic output. The method is fitted to the scenario where a large video dataset of the target person is available and, thus, limited in its applicability. They demonstrate it on sequences of Obama, using 14 hours of training data and 3 hours for validation. In contrast, our method works on short \(2-3\) min target video clips. Measuring the audio visual sync with SyncNet  [4], the generated Obama videos in Fig. 8 (top row) result in \(-2\)/5.9 (offset/confidence) for the person-specific approach of Suwajanakorn et al., and 0/5.2 for our generalized expression prediction network.

7 Limitations

As can be seen in the supplemental video, our approach works robustly on different audio sources and target videos. But it still has limitations. Especially, in the scenario of multiple voices in the audio stream our method fails. Recent work is solving this ‘cocktail party’ issue using visual cues  [7]. As all other reenactment approaches, the target videos have to be occlusion free to allow good visual tracking. In addition, the audio-visual sync of the original videos has to be good since it transfers to the quality of the reenactment.

We assume that the target actor has a constant talking style during a target sequence. In follow-up work, we plan to estimate the talking style from the audio signal to adaptively control the expressiveness of the facial motions.

8 Conclusion

We presented a novel audio-driven facial reenactment approach that is generalized among different audio sources. This allows us not only to synthesize videos of a talking head from an audio sequence from another person, but also to generate a photo-realistic video based on a synthesized voice. I.e., text-driven video synthesis can be achieved that is in sync with artificial voice. We hope that our work is a stepping stone in the direction to photo-realistic audio-visual assistants.