1 Introduction

Creating personalized audio-driven talking portraits has many applications in teleconferencing, video production, VR/AR games, and the movie industry. Given its great potential, research on talking face generation (Taylor et al., 2017; Thies et al., 2020; Zhou et al., 2019b, 2021; Zhang et al., 2021c; Ji et al., 2021) has enjoyed massive popularity in recent years, with emphasis on creating lip-synced (Prajwal et al., 2020; Thies et al., 2020) portraits with diverse head motions, talking styles, and emotions (Yi et al., 2020; Wu et al., 2021). However, the ability to change the lighting conditions of audio-driven portraits is still under-explored, which is critical to real-world applications as we expect the portrait in the foreground to be seamlessly harmonized with backgrounds under different illuminations.

Fig. 1
figure 1

Relighting Talking Portrait with Assigned Background. (Left) Our method takes a monocular video as input and estimates the corresponding normal and albedo which can be driven by audio. (Right) Talking portrait renderings with different illuminations, where lighting and shading are placed at the bottom. The rightmost three are relighted by HDR background images. Only a single video is required as the training data, without any extra annotations

To generate a relightable talking portrait from a single video, we argue that the underlying model should be capable of (1) estimating fine-grained 3D head geometry from monocular videos, (2) reflectance decomposition without any extra annotations, and (3) generalizing to driven audios. However, most learning-based methods either operate only on the 2D plane (Zhou et al., 2019b, 2021; Prajwal et al., 2020), or leverage structural intermediate representations (Chen et al., 2019; Cudeiro et al., 2019a; Ji et al., 2021; Thies et al., 2020; Wu et al., 2021; Zhou et al., 2020), and neural radiance fields (Guo et al., 2021; Yao et al., 2022; Shen et al., 2022). No fine-grained 3D geometry can be acquired for reflectance decomposition in these studies. On the other hand, adapting existing relighting techniques (Sun et al., 2019; Wang et al., 2020; Pandey et al., 2021; Zhang et al., 2021a) is too expensive for audio-driven video portraits given their dependence on multi-view or dynamically lighted data.

To bridge this gap, we propose ReliTalk, a novel framework for relightable audio-driven talking portrait generation that only requires a single monocular video as input, as shown in Fig. 1. Our key insight is the self-supervised implicit decomposition of geometry and reflectance, both of which can be further driven by input audios. In specific, the proposed approach first extracts expression- and pose-related representations based on 3D facial priors (Li et al., 2017), and refines them into delicate normal maps through implicit functions. The initial normals then take a critical role in reflectance decomposition, which disentangles the human head as a set of intrinsic normal, albedo, diffuse and specular maps, by dynamically estimating the lighting condition of the given video. To get rid of leveraging knowledge from expensive capturing data (i.e. Light Stage (Debevec et al., 2000)), we carefully design several learning objectives to decompose the human portrait into corresponding maps from monocular videos, which will be introduced in the following sections.

To learn the audio-to-face mapping that better generalizes to unseen audio, we introduce mesh-aware guidance to assist the lip-syncing especially when the training video is too short to cover enough audio variance. Specifically, we use a model pre-trained on the VOCA dataset (Cudeiro et al., 2019b) to obtain lip-related meshes as the additional guidance. Phoneme-related features and lip-related meshes are separately encoded and then concatenated to achieve more accurate audio-driven animations. Phoneme-related features enable the network to learn richer mouth shapes and mesh-aware features provide coarse information on the opening and closing of lips, even if the input audio is far away from the audio used in training.

Natural talking portrait videos usually provide a limited perspective of the target persons when they face the camera without turning around. Plus, the lack of multi-view information inherently negatively impacts an accurate estimation of 3D geometry. To address the ill-posed inverse problem of geometry and reflectance decomposition caused by single-view, limited motion variance, and unknown illuminations, we design identity-consistent supervision (ICS) with simulated multiple lighting conditions to refine normal maps. The key insight is that we relight the human portrait on-the-fly during the training stage, by sampling different lights and using identity-consistent loss to update normal maps.

We evaluate our approach on both real and synthetic datasets. Overall, ReliTalk drives and relights dynamic human portraits in high fidelity, outperforming other methods on both perceptual quality and reconstruction correctness. Our contributions are summarized as follows:

  • We propose a novel framework ReliTalk that learns relightable audio-driven talking portrait generation and only requires a single monocular portrait video.

  • We propose the additional audio-to-mesh guidance to improve the mapping accuracy especially when the single training video only has a limited audio variance.

  • We design identity-consistent supervision with simulated multiple lighting conditions, addressing the ill-pose problem caused by limited views available from the single video.

2 Related Work

Inverse Rendering Recovering and disentangling the appearance of observed images into geometry and reflectance is a long-standing problem in the field of computer vision and graphics. Prior works (Barron & Malik, 2014; Liu et al., 2019) address this challenge by physical-based priors on synthetic image data. However, they fail to extract the underlying 3D representation. Later approaches (Chan et al., 2022; Or-El et al., 2022; Xu et al., 2022; Sun et al., 2022; Zhao et al., 2022; Pan et al., 2020; Chan et al., 2021) successfully extract the 3D representations by the 3D generator and refine the output using image-based CNN networks. Recently, methods based on implicit representation (Zhang et al., 2021b; Srinivasan et al., 2021) propose learning 3D reflectance and geometry from multi-view images. In this work, we aim to tackle a harder problem, i.e., inverse rendering from a monocular video of a talking human face. Note that, limited-view information can be accessed as the person is always oriented toward the front.

Portrait Relighting One-Light-at-A-Time (OLAT) capturing system allows for obtaining detailed portrait geometry and reflectance. Many methods based on it have achieved impressive success (Sun et al., 2019; Wang et al., 2020; Pandey et al., 2021; Zhang et al., 2021a). However, it is only applicable in a constrained environment due to its complexity and expense. Other methods (Zhou et al., 2019a; Hou et al., 2021, 2022; Caselles et al., 2023) simulate some multi-lighting data and train the network to predict relighted results. Due to their limited simulation methods, the final results are far away from OLAT-based methods. Yeh et al. (2022) synthesizes a high-quality multi-lighting dataset but it is still not available to the public. Another simplified strategy that requires the user to capture a selfie video or a sequence of images to gain multi-view information is proposed (Nestmeyer et al., 2020; Wang et al., 2022). And Relighting4D (Chen & Liu, 2022) can even relight dynamic humans with free viewpoints only from videos. However, their rendering quality is totally tied to the accuracy of geometry, requiring enough viewpoints from videos. Our method is able to relight portraits with finer details from the monocular portrait video even without much multi-view information available.

Fig. 2
figure 2

Overview of Our Proposed Framework. Denote the input video with unknown illuminations as \({\textbf {V}} = \{I_{1}, I_{2},..., I_{t}\}\) with audio sequence \({\textbf {a}} = \{a_{1}, a_{2},..., a_{t}\}\), where t is the number of frames. Generally, our aim is to extract the geometry and reflectance information from video \({\textbf {V}}\) in an unsupervised manner then drive the geometry deformation according to the audio

Audio-driven Talking Face Face animation has wild applications, drawing great research interest in computer vision and graphics. Recent methods for audio-driven animation (Cudeiro et al., 2019a; Fan et al., 2022; Karras et al., 2017; Richard et al., 2021; Suwajanakorn et al., 2017; Kim et al., 2018) are usually data-driven and can be divided into two categories. One is generalized animation (Cudeiro et al., 2019a; Fan et al., 2022; Richard et al., 2021), which utilizes some large datasets which contain the pair data of audio/speech to lip/face. Wave2Lip (Prajwal et al., 2020) trains a mapping from audio to lips on LRS2 (Chung et al., 2017). Instead of learning a highly heterogeneous and nonlinear mapping from audio to video directly, Everybody’s Talkin (Song et al., 2022) additionally involves the statistical linear 3D face model and builds an easier map from audio to parameters of 3DMM (Blanz & Vetter, 1999). Our proposed method takes a similar strategy that drives the whole portrait through controlling the parameters of FLAME model (Li et al., 2017). The other one is personalized animation (Suwajanakorn et al., 2017; Karras et al., 2017; Tang et al., 2022), which usually does not rely on a large dataset for training and only builds one model for each person. Recently, with the emergence of Neural Radiance Fields (NeRF) (Mildenhall et al., 2020; Barron et al., 2021b, a), many NeRF-based audio-driven methods are proposed (Guo et al., 2021; Liu et al., 2022; Yao et al., 2022). However, those methods can not drive the portraits well when meeting novel audio. DFRF (Shen et al., 2022) improves this issue with a pre-trained base model but the final results are still not satisfactory.

3 Our Approach

Given a monocular video of a talking portrait, our framework can re-render the human portrait with novel illuminations driven by the input audio. Denote the input video with unknown illuminations as \({\textbf {V}} = \{I_{1}, I_{2},..., I_{t}\}\) with audio sequence \({\textbf {a}} = \{a_{1}, a_{2},..., a_{t}\}\), where t is the number of frames. The key aim of our framework is to extract the geometry and reflectance information from video \({\textbf {V}}\) in an unsupervised manner, and the geometry deformation is driven by the audio accordingly. Specifically, we neurally model the expression- and pose-related geometry of human heads based on the FLAME model (Li et al., 2017). Then, an audio-to-geometry mapping is learned to drive the portrait and also provide a good initial normal estimation (Sect. 3.1). Meanwhile, the reflectance components, i.e., normal N, albedo A, shading \(S_{shad}\), and specular \(S_{spec}\) maps, are decomposed via carefully designed priors (Sect. 3.2). During training, the lighting condition L of the given video is estimated on-the-fly, and the training objective is reconstructing the whole video. In addition, multiple lighting conditions are randomly simulated for identity-consistent supervision which further refines geometry estimation. With the well-disentangled geometry and reflectance, we use audio from the user to drive the portrait by controlling expression and pose coefficients, then render it with any desired illuminations, which seamlessly harmonizes with the background. The whole pipeline is shown in Fig. 2.

In this paper, \(I, N, A\in {\mathbb {R}}^{3 \times H \times W}\), \(S_{shad}, S_{spec}\in {\mathbb {R}}^{1 \times H \times W}\) where H and W are height and width respectively.

3.1 Audio-Driven Synthesis

Expression- and Pose-related Geometry Estimating the surface normal of talking portraits from monocular videos is a non-trivial task, given the ill-posed nature of single-view reconstruction. To address this issue, we leverage a parametric model, FLAME (Li et al., 2017), as the human head prior to modeling the expression- and pose-related human portrait:

$$\begin{aligned} \text {FLAME}({\beta }, {\theta }, {\psi }): {\mathbb {R}}^{|{\beta }| \times |{\theta }| \times |{\psi }|} \rightarrow {\mathbb {R}}^{n\times 3}, \end{aligned}$$
(1)

which takes coefficients of shape \({\beta } \in {\mathbb {R}}^{|{\beta }|}\), pose \({\theta } \in {\mathbb {R}}^{|{\theta }|}\), and expression \({\psi } \in {\mathbb {R}}^{|{\psi }|}\) as input. We use the off-the-shelf tool (Feng et al., 2021) to estimate those parameters. Intuitively, this parametric human portrait model offers a good initialization of the 3D geometry, which facilitates further refinement.

However, this initial parametric portrait model is not well-aligned with the details of the given human portrait. To refine the initial model, we use nearest surface intersection search (Zheng et al., 2022) to optimize the initial mesh and calculate the normal N as the normalized gradient on the surface. This normal N will be further optimized during the reflectance decomposition process (Sect. 3.2).

Mesh-Aware Audio-to-Expression Translation From the perspective of the mapping function, learning a direct mapping from audio to talking video is hard due to its high-dimensional property. In contrast, mapping audio signals to expressions and poses of the head is much easier. To enable robust talking portrait generation, we leverage a mesh-aware audio-to-expression translation strategy. Benefiting from FLAME (Li et al., 2017) based design, our extracted head geometry is expression- and pose-related. We first use DeepSpeech (Amodei et al., 2016) to extract phoneme-related audio features \(f_\text {pho} \in {\mathbb {R}}^{16 \times 29}\):

$$\begin{aligned} f_\text {pho} = \text {DeepSpeech}(a). \end{aligned}$$
(2)

Then extracted audio features \(f_\text {pho}\) are fed to a model pre-trained on the VOCA dataset (Cudeiro et al., 2019b) to predict lip-related mesh vertices \(V_\text {lip} \in {\mathbb {R}}^{N_V \times 3}\) (\(N_V\) is the selected vertex number) as the additional guidance:

$$\begin{aligned} V_\text {lip} = F_\text {mesh}(V_\text {template}, f_\text {pho}), \end{aligned}$$
(3)

where \(V_\text {template}\) is the zero-pose template for audio features.

Lip-related vertices and phoneme-related features are separately encoded and concatenated to predict expressions and poses of the FLAME model by a learnable network:

$$\begin{aligned} {\hat{\psi }}, {\hat{\theta }} = F_\text {exp}(E_\text {lip}(V_\text {lip}), E_\text {pho}(f_\text {pho})), \end{aligned}$$
(4)

where \(E_\text {lip}\) and \(E_\text {pho}\) are two feature encoders and \(F_\text {exp}\) is a network that concatenates two kinds of features and predicts expressions and poses. Meanwhile, to address the unstable prediction caused by a single audio frame, we input neighboring frames and use attention layers in \(F_\text {exp}\) to integrate multi-frame audio information. This learning process is supervised by \({\mathcal {L}}_\text {exp} = \Vert {\hat{\psi }} - {\psi } \Vert _2^2 + \Vert {\hat{\theta }} - {\theta } \Vert _2^2\).

Neural Video Rendering Network Gaining the new driven coefficients \({\psi }\), \({\theta }\), we can send them to Eq. 1 and recalculate the geometry to get a new normal \({\hat{N}}\) that fits input audio. Here we find that it is non-trivial to faithfully relate the audio signals and all face deformations (e.glet@tokeneonedothead movement). The translation network \(F_\text {exp}\) will perform poorly if required to fit all poses. Therefore, we only predict lip-related poses and directly use the sequence of lip-unrelated poses from existing videos. In this way, we only need to regenerate lip-related areas (including cheek and chin) and blend lip-unrelated areas from the existing videos.

We first use a ResNet based local network \(F_\text {local}\) to translate the newly generate normal to lip-related areas. Using eroded lip-unrelated areas from existing videos as background and the output of the first network, another blending network \(F_\text {blend}\) is used to output the blended image \({\hat{I}}\):

$$\begin{aligned} {\hat{I}} = F_\text {blend}(F_\text {local}({\hat{N}}) \odot {M}, I \odot (1 - {M}^{d})), \end{aligned}$$
(5)

where M is the lip-related area and \({M}^{d}\) is the dilated area for the network \(F_\text {blend}\) to inpaint. This process is learned by:

$$\begin{aligned}{} & {} {\mathcal {L}}^\text {local}_\text {rgb} = \Vert F_\text {local}({\hat{N}}) \odot {M} - I \odot {M} \Vert _2^2, \end{aligned}$$
(6)
$$\begin{aligned}{} & {} {\mathcal {L}}^\text {blend}_\text {rgb} = \Vert {\hat{I}} - I \Vert _2^2, \end{aligned}$$
(7)
$$\begin{aligned}{} & {} {\mathcal {L}}^\text {blend}_\text {per} = \Vert \text {VGG}({\hat{I}}) - \text {VGG}(I) \Vert _2^2, \end{aligned}$$
(8)

where \({\mathcal {L}}^\text {blend}_\text {per}\) is the perceptual loss and VGG represents a pretrained face VGG network (Parkhi et al., 2015) and returns extracted embedding features. It is only added to the blending network to generate vivid results while the local network is only supposed to generate the rough lip area.

3.2 Reflectance Decomposition

To enable rendering the talking portrait under novel illuminations, the reflectance and environmental lighting should be appropriately disentangled and estimated.

Lighting Following previous work (Ramamoorthi & Hanrahan, 2001; Barron & Malik, 2014; Basri & Jacobs, 2003; Wang et al., 2008; Shu et al., 2017; Zhou et al., 2019a; Hou et al., 2021), the environmental lighting \(L \in {\mathbb {R}}^{9}\) is represented as a 9-dimensional spherical harmonics coefficient vector. However, the lighting conditions of online talking videos are unknown, which makes it hard for inverse rendering. Inspired by Relighting4D (Chen & Liu, 2022), we first initialize the lighting L from the front of the human face and then treat it as a trainable parameter to optimize. During the inference, given HDR lighting will be converted to a 9-dimensional spherical harmonics coefficient vector.

Normal Map Although \({\hat{N}}\) provides a rough estimation of portrait geometry, its deviation from the real geometry will be amplified in the relighting. To further refine the geometry while still keeping the structure of \({\hat{N}}\), we use a network \(F_\text {normal}\) to predict normal residual:

$$\begin{aligned} \delta N = F_\text {normal}({\hat{I}}, {\hat{N}}). \end{aligned}$$
(9)

We add an \(L_1\) regularization on \(\delta N\), i.e., \({\mathcal {L}}_{\delta N}=\Vert \delta N\Vert _1\). The final predicted normal N is \(N = {\hat{N}} + \delta N\).

Shading Map Given the normal N and lighting L, we can calculate the shading map \(S_\text {shad}\) using a network \(F_\text {shad}\) conditioned on the normal and lighting:

$$\begin{aligned} S_\text {shad} = F_\text {shad}(N, L), \end{aligned}$$
(10)

Albedo Map To represent the illumination-invariant base color of the human face, we use a network \(F_\text {albedo}\) to predict the albedo map A from the appearance:

$$\begin{aligned} A = F_\text {albedo}({\hat{I}}). \end{aligned}$$
(11)

Although there is no ground truth for albedo in our setting, it is supposed to have two physical priors: smoothness and parsimony (Barron & Malik, 2014). Smoothness requires that variation in the albedo map tends to be small and sparse. To achieve that, we use total variation regularization on the skin area:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{\text{ smooth }}(A)&=\sum _{h=1}^{H} \sum _{w=1}^{W}\left\| \varvec{\beta }_{h+1, w}-\varvec{\beta }_{h, w}\right\| _2^2 \\&\quad +\sum _{h=1}^{H} \sum _{w=1}^{W}\left\| \varvec{\beta }_{h, w+1}-\varvec{\beta }_{h, w}\right\| _2^2, \end{aligned} \end{aligned}$$
(12)

where \(\varvec{\beta }_*\) are the values of albedo A within the skin area.

In addition to piece-wise smoothness, the second property we expect from albedo map is parsimony, which means that the palette with which an albedo image was painted should be small. This property holds only when it is a soft constraint to make the palette sparse enough. As for the parsimony prior, we penalize the network by minimizing the entropy of the albedo map (Chen & Liu, 2022):

$$\begin{aligned} {\mathcal {L}}_{\text {parsimony}} ={\mathbb {E}}[-\log (p(A))], \end{aligned}$$
(13)

where \(p(\cdot )\) is the probability density function (PDF). To address the difficulty of estimating the PDF of the continuous variable albedo map A during training, we use Monte Carlo sampling to obtain a soft approximation of a Gaussian histogram at predefined bins for estimating the PDF of A.

Specular Map Prior works (He et al., 2016; Shu et al., 2017) for portrait relighting, especially which require no One-Light-at-A-Time (OLAT) data, only employ simple diffuse lighting to model the human face. However, given the fact that specular phenomenons widely appear on human faces, it is key to the photorealistic rendering to model the specular effects. Therefore, we leverage Blinn–Phong model (1977) to incorporate specular component as:

$$\begin{aligned} R_\text {spec}\left( N, \omega _i, \omega _o\right) = \frac{s+2}{2 \pi }\left( h\left( \omega _i, \omega _o\right) \cdot N \right) ^s, \end{aligned}$$
(14)

where \(h(\omega _i, \omega _o) = \text {normalize}(\omega _i + \omega _o)\), and s is the Phong exponent that controls the apparent smoothness of the surface. Then the specular map \(S_\text {spec}\) can be calculated by the accumulation of \(R^\text {spec}\left( N, \omega _i, \omega _o\right) \) under illumination from different directions:

$$\begin{aligned} S_\text {spec} = F_\text {spec}(N, L) = \sum _{\omega _i}(L(\omega _i) \odot R_\text {spec}\left( N, \omega _i, \omega _o\right) ),\nonumber \\ \end{aligned}$$
(15)

in which \(\omega _o\) is always towards the front in this paper. In experiments, we also find that the specular produced by Blinn-Phong model can never perfectly align with the real face in the video. Inspired by SunStage (Wang et al., 2022), we use another network \(F_\text {cspec}\) to predict a coefficient map \(C_\text {spec} \in {\mathbb {R}}^{1 \times H \times W}\) for flexibly adjusting the final specular:

$$\begin{aligned} C_\text {spec} = F_\text {cspec}({\hat{I}}, N). \end{aligned}$$
(16)

For the coefficient map, we apply a TV loss mentioned in Eq. 12 to avoid checkerboard artifacts. Finally, we synthesize the video frame \({\tilde{I}}\) via image-based rendering:

$$\begin{aligned} F_\text {render}: {\tilde{I}} = A\odot (S_\text {shad}+C_\text {spec} \odot S_\text {spec}), \end{aligned}$$
(17)

where \(\odot \) denotes the element-wise product. And the training objective is the reconstruction loss against input frames:

$$\begin{aligned} {\mathcal {L}}^\text {render}_\text {rgb} = \Vert {\tilde{I}} - I \Vert _2^2. \end{aligned}$$
(18)

Identity-Consistent Supervision with Relighting Coarse renderer \(F_\text {coarse}\) synthesizes RGB pixel values according to the normal N. Without multi-view supervision, the face area in highlights will be regarded as raised part even if it is smooth originally, leading to artifacts during the relighting. To address this issue, we propose identity-consistent supervision with simulated multiple lighting conditions, which is performed on-the-fly during training. We assume that a well-trained face recognition network can extract similar embedding when the lighting condition varies. Therefore, after the decomposition is nearly converged, we randomly sample a new lighting condition and reinforce the identity consistency between the two rendered images with different lighting conditions:

$$\begin{aligned} {\mathcal {L}}_\text {consistent} = \Vert E_\text {id}(I^\text {relight}) - E_\text {id}(I) \Vert _2^2, \end{aligned}$$
(19)

where \(I^\text {relight}\) is the rendering under the randomly sampled lighting, and \(E_\text {id}\) is the embedding extracted by a pre-trained face recognition network (Schroff et al., 2015).

3.3 Optimization and Inference

During the training phase, training the entire framework directly may cause the networks to learn the locally optimized results of each decomposed map, as there are no ground truths available for each component of reflectance decomposition. Therefore, we first train networks for audio-driven synthesis to learn a rough expression- and pose-related geometry.

Yet, there is no off-the-shelf ground truth for normal map N to supervise geometry refinement. To address it, a coarse renderer \(F_\text {coarse}\) is used to predict the RGB result \({\hat{I}}\) conditioning on normal N, which is supervised by image reconstruction loss. This self-supervised learning process encourages the normal map to obtain more details of surface shape, without requiring extra annotations. After a rough normal map N is gained, it is combined with an RGB portrait image as inputs of reflectance decomposition for further optimization and also stabilize the decomposition.

The overall loss is:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}&= \lambda ^\text {local}_\text {rgb} {\mathcal {L}}^\text {local}_\text {rgb} + \lambda ^\text {blend}_\text {rgb} {\mathcal {L}}^\text {blend}_\text {rgb} + \lambda ^\text {blend}_\text {per} {\mathcal {L}}^\text {blend}_\text {per} \\&\quad + \lambda ^\text {render}_\text {rgb} {\mathcal {L}}^\text {render}_\text {rgb} + \lambda _{\delta N} {\mathcal {L}}_{\delta N} + \lambda _\text {consistent} {\mathcal {L}}_\text {consistent} \\&\quad + \lambda _\text {exp} {\mathcal {L}}_\text {exp} + \lambda _{\text {parsimony}} {\mathcal {L}}_{\text {parsimony}} + \lambda ^\text {total}_\text {smooth} {\mathcal {L}}^\text {total}_\text {smooth}, \end{aligned} \end{aligned}$$
(20)

where \(\lambda \)’s are the weights and are set to 1, 1, 100, 1, 1, 3, 1, 0.001, and 1 respectively. Here \({\mathcal {L}}^\text {total}_\text {smooth}\) is similar to Eq. 12 but is added to all decomposed maps:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}^\text {total}_\text {smooth}&= {\mathcal {L}}_{\text{ smooth }}(A) + {\mathcal {L}}_{\text{ smooth }}(N) \\&\quad + {\mathcal {L}}_{\text{ smooth }}(R_\text {spec}) + {\mathcal {L}}_{\text{ smooth }}(C_\text {spec}). \end{aligned} \end{aligned}$$
(21)

In the inference phase, new audio will drive the portrait by controlling expression and pose coefficients. Meanwhile, desired illuminations will replace the learned light L of the original video to relight the whole video, thus seamlessly harmonizing with the background.

Fig. 3
figure 3

Details of Our Proposed Framework. Our framework decomposes the video I into a set of normal N, albedo A, shading \(S_\text {shad}\), and specular \(S_\text {spec}\) maps. Specifically, we neurally model the expression- and pose-related geometry of human heads based on the FLAME model (Li et al., 2017). Then, the reflectance components are decomposed via multiple carefully designed priors (Sect. 3.2). With the well-disentangled geometry and reflectance, we use audio from the user to drive the human portrait by controlling expression and pose coefficients, then render it with any desired illuminations, which seamlessly harmonizes with the background

4 Experiments

4.1 Implementation Details

Network Architecture In audio-driven synthesis, lip-related feature encoder \(E_\text {exp}\) and phoneme-related feature encoder \(E_\text {pho}\) both use 1D convolutional neural networks. Decoder \(F_\text {exp}\) is also a 1D convolutional neural network but with the self-attention mechanism (Zhang et al., 2019) to predict pose and expression coefficients through 8 adjacent frames. For Local network \(F_\text {local}\), we use the ResNet (He et al., 2016) with 6 residual blocks. While for blending network \(F_\text {blend}\), we use U-Net of depth 5 with dilated convolutions (Thies et al., 2020). To gain coherent results, we adjust the mask size to leave some missing area between the generated lip area and the given background area, which will be inpainted by \(F_\text {blend}\). As shown in Fig. 3, we choose the area with \(80\times 80\) resolution around the mouth as the Driven Area and remove the area with \(120\times 120\) resolution around the mouth as the Existing Area. Here we also add the lip area generated by Wave2Lip (Prajwal et al., 2020) as the additional input of \(F_\text {blend}\) to increase the performance when the input audio is far away from the audio used in training (i.e. audio from a new person).

For reflectance decomposition, we choose U-Net (Isola et al., 2017) of depth 8 as the architecture of specular weight predicter \(F_\text {cspec}\). But to gain smoother albedo maps and normal residuals, we choose ResNet (He et al., 2016) with 6 residual blocks as the architecture of albedo predictor \(F_\text {albedo}\) and normal residual predictor \(F_\text {normal}\).

Running Time We conduct our experiments on a single GPU of NVIDIA V100. Each person will take around 1 day for training and the inference time is around 0.6 s per frame.

4.2 Dataset

Real Video Data AD-NeRF (Guo et al., 2021) and HDTF (Zhang et al., 2021c) collect several high-resolution talking videos in different scenes to better evaluate the generation performance. Following this practice, we choose celebrity videos whose protagonists are news anchors, entrepreneurs, or presidents from YouTube as our real video set. We collect 8 public videos with an average length of 3 min from 7 identities. We split each video with around \(80\%\) frames for training and \(20\%\) frames for evaluation. These videos are all available online and we will provide corresponding source links for reproduction purposes.

Synthetic Video Data Talking videos with ground-truth illuminations are not available from online collections. To evaluate our relighting algorithm quantitatively, we synthesize some talking videos with the same motion sequence but different lighting conditions within the modern graphic pipeline, as shown in Fig. 4. Specifically, we render 6 sequences (2 min, 25 fps), for each person in 10 different lighting conditions with Cycles renderer (Hess, 2013) in Blender (Community, 2018), a photorealistic ray-tracing renderer. All mesh models, textures, and displacement maps are released by FaceScape (Yang et al., 2020; Zhu et al., 2021). We combine displacement maps and textures in a physically-based skin material featuring sub-surface scattering (Christensen, 2015) for photo-realistic rendering. We drive our head models with expression coefficients and head rotation angles estimated from our own recorded talking videos.

Fig. 4
figure 4

Visualization of Synthetic Data. We render 6 sequences (2 min, 25 fps), for each person in 10 different lighting conditions with Cycles renderer (Hess, 2013) in Blender (Community, 2018)

Fig. 5
figure 5

Qualitative Comparison of Real Video Driving. Our method successfully drives the motion of lips. Compared to AD-IMAvatar (Zheng et al., 2022), AD-NeRF (Guo et al., 2021), and DFRF (Shen et al., 2022), our generated lip motion are closer to the ground truth (zoom in for a better view)

4.3 Evaluation Metrics

For evaluation metrics, we report peak signal-to-noise ratio (PSNR), structural similarity (SSIM), and perceptual similarity (LPIPS) (Zhang et al., 2018) to measure the quality of generation results. For datasets, we collect 8 talking portrait videos from YouTube with an average length of around 3 min as our real video set (most are used in AD-NeRF (Guo et al., 2021) or HDTF (Zhang et al., 2021c)) and additionally render synthetic videos of 6 persons with an average length of around 2 min for quantitative comparison. More details are introduced in our supplementary materials. To measure audio-driven accuracy, we further use SyncNet (confidence) (Chung & Zisserman, 2017) to measure the audio-driven synchronization.

4.4 Qualitative Comparison

Currently, there is no unified framework for relightable audio-driven talking portrait generation. Therefore, we first compare our method with audio animation methods and relighting methods separately, then trivially combine two existing frameworks as the baseline of relightable audio-driven talking portrait generation.

Comparison on Audio-Driven Talk In this work, we focus on personalized animation, which only uses one video for training. We choose two representative personalized audio animation methods, AD-NeRF (Guo et al., 2021), and DFRF (Shen et al., 2022). We also modify the FLAME-based method IMAvatar (Zheng et al., 2022) to gain a simple audio-driven version, AD-IMAvatar as an additional baseline. In Fig. 5, AD-IMAvatar only generates coarse talking portraits with blurry teeth areas. And AD-NeRF is prone to generate artifacts at the junction of the neck and head. Compared to the results of AD-NeRF and DFRF, the motion of our generated lips is closer to the ground truth. Notably, our framework succeeds to generate clear teeth areas.

Comparison on Relighting In this paper, we compare our relighting performance with five advanced methods. DPR (Zhou et al., 2019a), SMFR (Hou et al., 2021) and GCFR (Hou et al., 2022) are trained on publicly available data and release their models. Since nearly none of One-Light-at-A-Time (OLAT) based methods release their code or models. We requested the authors to inference their models on our provided inputs (SIPR-W (Wang et al., 2020) and TR (Pandey et al., 2021)), and take results for comparisons (Fig. 6).

As presented in Fig. 7, although both DPR and SMFR are able to reflect given lighting conditions on generated images when a sample directional light is given, their generated portraits lack the special texture of a real human face. This is mainly because they do not account for model specular, which is a significant and noticeable feature of the human face. Meanwhile, the recent method GCFR is easy to produce unnatural shadows. In contrast, ReliTalk renders realistic human portraits with reserved facial details.

Fig. 6
figure 6

Qualitative Comparisons of Real Video Relighting. We compare our methods against DPR (Zhou et al., 2019a), SMFR (Hou et al., 2021), SIPR-W (Wang et al., 2020) and TR (Pandey et al., 2021). ReliTalk renders human portraits with high-fidelity even with the complex lighting from HDR environment maps

Additionally, when complex lighting optimized from HDR images is used (as shown in Fig. 6), DPR and SMFR tend to produce unsatisfactory results, many of which are not relevant to the given lighting conditions. And SMFR even fails to reconstruct some faces. SIPR-W will generate some relighted results whose color is similar to the background but can not reflect the varied lighting on the face. Although TR succeeds to generate some vivid relighted results, it loses some facial details and also mildly hurt the original identity. However, our framework performs well on both types of lighting and successfully renders the specular texture of the human face. This enables our generated avatar to blend in seamlessly with various backgrounds, as long as HDR data of the background is available, by matching the shading and lighting of the avatar to that of the background.

4.5 Quantitative Comparison

Evaluation on Audio-Driven Talk As shown in Table 1, we compare our method with AD-IMAvatar, AD-NeRF, and DFRF. Among those baselines, DFRF achieves comparable performance in PSNR, SSIM, and LPIPS, while its confidence in SyncNet is slightly lower than AD-NeRF. However, our method significantly outperforms all baselines in terms of all metrics.

Fig. 7
figure 7

Qualitative Comparisons Under Directional Lights. We compare our methods against three baseline methods DPR (Zhou et al., 2019a), SMFR (Hou et al., 2021) and GCFR (Hou et al., 2022) for portrait relighting under directional light

Table 1 Quantitative results of audio-driven real videos

Although our method uses some portrait area from existing frames, both AD-NeRF and DFRF use pose parameters from the existing sequence. In this way, DFRF only generates the remaining face area with the neck and collar given. Therefore for a fair comparison, we recalculate PSNR, SSIM, and LPIP purely in the driven area (\(120\times 120\) resolution). As shown in the brackets of Table 1, our method still significantly outperforms all baselines in terms of all metrics.

Evaluation on Synthetic Relighting Dataset As shown in Table 2, we achieve the highest PSNR and SSIM on the synthetic dataset. And they are significantly higher than the other two, which indicates our generated images is closer to the ground truth. Meanwhile, the lowest LPIPS also illustrates that our results have the highest perceptual quality. It is notable that SMFR almost fails in our synthetic video dataset, which is perhaps caused by the distribution gap between our synthesized video data and real data. As a result, our ReliTalk outperforms DPR and SMFR both qualitatively and quantitatively.

According to the analysis of practicality, DPR and SMFR need a pre-collect face image dataset to train the network. At the inference stage, a new lighting or a new portrait that is out of training distribution will significantly hurt the performance. While our method may not be able to handle all persons within a single model, it is still practical because the training data, a short talking portrait video, is readily available and easy to obtain.

4.6 Ablation of Core Modules

Effects of Mesh-Aware Guidance Mesh-aware guidance is used to assist lip-syncing. We use a model pre-trained (Cudeiro et al., 2019b) to gain lip-related meshes as additional guidance. Phoneme-related features and lip-related meshes are separately encoded and then concatenated to generate pose and expression coefficients. As shown in Table 3, mesh-aware guidance improves prediction accuracy significantly. In addition, we find that the improvement is significant (PSNR increases from 29.5445 to 34.8186) when the training video is too short to cover enough phonemes (2450 training frames). The improvement is mild for the long video (6500 training frames). This implies that mesh-aware features offer the network approximate information about the movements of the lips, such as opening and closing, even when the input audio is significantly different from the audio used during training.

Table 2 Quantitative results of synthetic video relighting
Table 3 Ablation results of Mesh-Aware Guidance
Fig. 8
figure 8

Finer Normal under Identity-Consistent Supervision with Relighting. Without ICS, the normal map estimation may contain severe artifacts that cause unrealistic rendering given novel illuminations (zoom in for clearer results)

Fig. 9
figure 9

Ablation of Reflectance Decomposition. (a) Without initial normal, (b) without normal residual, (c) without parsimony prior, (d) without smoothness constraints, (e) without specular weight, (f) without specular map, (g) our final method. Our final method gains the most vivid relighting results

Effects of Identity-Consistent Supervision Identity-consistent supervision with relighting is employed to lessen the influence of lacking multi-view information. We visualize the effects in Fig. 8. Instead of a well-structured normal, the network prefers to generate an irregular one whose surface varies with the change of color on the face because it is an easier mapping for the coarse render. However, those irregular areas will be very significant when a different lighting is given (left of Fig. 8). After adding identity-consistent supervision, this weird face is hard to be recognized as the same person, urging the network to produce a well-structured normal which can gain reasonable relighting results under lighting with various directions (right of Fig. 8). To quantitatively evaluate the improvements of identity-consistent supervision, we calculate metrics in the image space of the normal map. Here we propose \(\text {PSNR}_\text {grad}\), which calculates the 2D gradient in the image space, to jointly measure the normal quality. As shown in Table 4, although normal refined by ICS does not gain a higher PSNR, its \(\text {PSNR}_\text {grad}\) is significantly higher, which means that it owns a better shape surface. Higher SSIM also proves the effectiveness of our method.

Table 4 Ablation Results of Identity-Consistent Supervision.
Fig. 10
figure 10

Ablation of Reflectance Decomposition. a Without initial normal, b without normal residual, c without parsimony prior, d without smoothness constraints, e without specular weight, f without specular map, g our final method. Our final method gains the most vivid relighting results

Fig. 11
figure 11

Ablation of Reflectance Decomposition. a Without initial normal, b without normal residual, c without parsimony prior, d without smoothness constraints, e without specular weight, f without specular map, g our final method. Our final method gains the most vivid relighting results

4.7 Ablation of Reflectance Decomposition

To get rid of leveraging knowledge from expensive capturing data (i.e. Light Stage (Debevec et al., 2000)), we decompose the human portrait into corresponding maps from monocular videos through some careful designs.

Initial Normal Different from previous audio-driven generation methods only generate the final portrait image, our audio-to-geometry also provides a good initial normal estimation. As shown in row (a) of Figs. 9, 10, 11, the reflectance decomposition can not converge properly without initial normal estimation because of lacking the constraints for the normal map.

Normal Residual Initial normal is not accurate because we do not have either the ground truth of the normal map or multi-view information of the portrait. Those irregular areas will be very significant when new lighting is given (row (b) of Fig. 9).

Parsimony Prior Parsimony means that the palette with which an albedo image was painted should be small. Without parsimony prior, albedo will contain more details while normal details are reduced, which is obviously reflected in the shading map of row (c) (Fig. 9). Therefore, the final relighting result is not such vivid.

Smoothness Constraints With smoothness constraints, some decomposition maps may overfit training data. As shown in row (d) of Fig. 9, the predicted specular weight is chaotic, reducing the vividness of the relighting result.

Specular Weight We use a network \(F_\text {cspec}\) to predict a specular weight \(C_\text {spec}\) for flexibly adjusting the final specular. As shown in row (e) of Fig. 9, the relighting result is blurred without this design.

Specular Map Prior works (He et al., 2016; Shu et al., 2017) for portrait relighting only employ simple diffuse lighting to model the human face. However, ignoring specular phenomenons that widely appear on human faces, the final rendering result is less photo-realistic (row (f) of Fig. 9).

4.8 Ablation of Training Frames

We also conduct a convergence ablation based on the number of training frames. As shown in Fig. 12, 750 training frames (clip of 30 s) are enough for basic relighting performance. But more training frames will bring better results. In this example, the unnatural division between hair and forehead gradually disappears when more training frames are used.

Fig. 12
figure 12

Ablation of Training Frames. a 750 training frames, b 1500 training frames, c 3000 training frames, d 6000 training frames. 750 training frames (clip of 30 s) are enough for basic relighting performance. But more training frames will bring better results. In this example, the unnatural division between hair and forehead gradually disappears when more training frames are used

5 Conclusion

We propose ReliTalk a novel framework for relightable audio-driven talking portrait generation which only requires an easily accessible single monocular portrait video as input, while previous light-stage-based methods are not publicly available for data or code. Our method decomposes the human portrait for the well-disentangled geometry and reflectance, which is also expression- and pose-related. During the inference, we use audio from the user to drive the human portrait by controlling expression and pose coefficients, then render it with any desired illuminations, seamlessly harmonizing with the background.

However, there are still some limitations of our designed relighting model. (1) We only consider one-bounce direct environment light, and thus our method cannot handle furry appearances, such as beards and long hair. (2) Our method assumes the appearance of human faces does not change throughout the entire video. Therefore, actions like wearing glasses or putting on hats may change the appearance would cause inaccurate estimation of reflectance.

In the future, we want to design a more realistic physical model that can take into account various complex lighting conditions.

Societal Impacts Our code is released for better promotion. Therefore, users only need to input a talking video of the target person and then are able to freely generate a talking portrait with desired audio and background. Although it increases the risk of forged videos, our approach also provides a new type of forged samples for researchers to improve defense methods.