Keywords

1 Introduction

Recently, facial animation has received attention from the industrial graphics, gaming and filming communities. Face is capable to impart a wide range of information not only about the subject’s emotional state but also about the tension of the moment in general. An engaged and crucial task is 3D avatar animation, which has lately become feasible [34]. With modern technology, a 3D avatar can be generated by a single uncalibrated camera [9] or even by a self portrait image [33]. At the same time, capturing facial expression is an important task in order to perceive behaviour and emotions of people. To tackle the problem of facial expression generation, it is essential to understand and model facial muscle activations that are related to various emotions. Several studies have attempted to decompose facial expressions on two dimensional spaces such as images and videos [15, 31, 44, 50]. However, modeling facial expressions on high resolution 3D meshes remains unexplored.

In contrast, few studies have attempted 3D speech-driven facial animation exclusively based on vocal audio and identity information [13, 21]. Nevertheless, emotional reactions of a subject are not always expressed vocally and speech-driven facial animation approaches neglect the importance of facial expressions. For instance, sadness and happiness are two very common emotions that can be voiced, mainly, through facial deformations. To this end, facial expressions are a major component of entertainment industry, and can convey emotional state of both scene and identity.

People signify their emotions using facial expressions in similar manners. For instance, people express their happiness by mouth and cheek deformations, that vary according to the subject’s emotional state and characteristics. Thus, one can describe expressions as “unimodal” distributions [15], with gradual changes from the neutral model till the apex state. Similarly to speech signals, emotion expressions are highly correlated to facial motion, but lie in two different domains. Modeling the relation between those two domains is essential for the task of realistic facial animation. However, in order to disentangle identity information and facial expression it is essential to have a sufficient amount of data. Although most of the publicly available 3D datasets contain a large variety of facial expression, they are captured only from a few subjects. Due to this difficulty, prior work has only focused on generating expressions in 2D.

Our aim is to generate realistic 3D facial animation given a target expression and a static neutral face. Synthesis of facial expression generation on new subjects can be achieved by expression transfer of generalized deformations [39, 50]. In order to produce realistic expressions, we map and model facial animation directly on the mesh space, avoiding to focus on specific face landmarks. Specifically, the proposed method comprises two parts: (a) a recurrent LSTM encoder to project the expected expression motion to an expression latent space, and (b) a mesh decoder to decode each latent time-sample to a mesh deformation, which is added to the neutral expression identity mesh. The mesh decoder utilizes intrinsic lightweight mesh convolutions, introduced in [6], along with unpooling operations that act directly on the mesh space [35]. We train our model in an end-to-end fashion on a large scale 4D face dataset. The devised methodology tackles a novel and unexplored problem, i.e. the generation of 4D expressions given a single neutral expression mesh. Both the desired length and the target expression are fully defined and controlled by the user. Our work considerably deviates from methods in the literature as it can be used to generate 4D full-face customised expressions on real-time. Finally, our study is the first 3D facial animation framework that utilizes an intrinsic encoder-decoder architecture that operates directly on mesh space using mesh convolutions instead of fully connected layers, as opposed to [13, 21].

2 Related Work

Facial Animation Generation. Following the progress of 3DMMs, several approaches have attempted to decouple expression and identity subspaces and built linear [2, 5, 45] and nonlinear [6, 25, 35] expression morphable models. However, all of the aforementioned studies are focused on static 3D meshes and they cannot model 3D facial motion. Recently, a few studies attempted to model the relation between speech and facial deformation for the task of 3D facial motion synthesis. Karras et al. [21] modeled speech formant relationships with 5K vertex positions, generating facial motion from LPC audio features. While this was the first approach to tackle facial motion directly on 3D meshes, their model is subject specific and cannot be generalized across different subjects. Towards the same direction, in [13], facial animation was generated using a static neutral template of the identity and a speech signal, used along with DeepSpeech [20] to generate more robust speech features. A different approach was utilized in [32], where 3D facial motion is generated by regressing on a set of action units, given MFCC audio features processed by RNN units. However, their model is trained on parameters extracted from 2D videos instead of 3D scans. Tzirakis et al. [41] combined predicted blendshape coefficients with a mean face to synthesize 3D facial motion from speech, replacing also fully connected layers, utilized in previous studies, with an LSTM. Blendshape coefficients are also predicted from audio, using attentive LSTMs in [40]. In contrast with the aforementioned studies, the proposed method aims to model facial animations directly on 3D meshes. Furthermore, although blendshape coefficients might be easily modeled, they rely on predefined face rigs, a factor that limits their generalization to new unseen subjects.

Geometric Deep Learning. Recently, the enormous amount of applications related to data residing in non-Euclidean domains motivated the need for the generalization of several popular deep learning operations, such as convolution, to graphs and manifolds. The main efforts include the reformulation of regular convolution operators in order to be applied on structures that lack consistent ordering or directions, as well as the invention of pooling techniques for graph downsampling. All relevant endeavours lie within the new research area of Geometric Deep Learning (GDL) [7]. The first attempts defined convolution in the spectral domain, by applying filters inspired from graph signal processing techniques [37]. These methods mainly boil down to either an eigendecomposition of a Graph Shift Operator (GSO) [8], such as the graph Laplacian, or to approximations thereof, by using polynomials [14, 23] or rational complex functions [24] of the GSO in order to obtain strict spatial localization and reduced computational complexity. Subsequent attempts generalize conventional CNNs by introducing patch operators that extract spatial relations of adjacent nodes within a local patch. To this end, several approaches generalized local patches to graph data, using geodesic polar charts [27] anisotropic diffusion operators [4] on manifolds or graphs [3]. MoNet [28], generalized previous spatial approaches by learning the patches themselves with Gaussian kernels. In the same direction, SplineCNN [17] replaced Gaussian kernels with B-spline functions with significant speed advantage. Recent studies focused on soft-attention methods to weight adjacent nodes [42, 43]. However, in contrast to regular convolutions, the way permutation invariance is enforced in most of the aforementioned operators, inevitably renders them unaware of vertex correspondences. To tackle that, Bouritsas et al. [6] defined local node orderings, instantiated with the spiral operator of [26], by exploiting the fixed underlying topology of certain deformable shapes, and built correspondence-aware anisotropic operators.

Facial Expression Datasets. Another major reason that 4D generative models have not been widely exploited is due to the limited amount of 3D datasets. During the past decade, several 3D face databases have been published. However, most of them are static [19, 29, 36, 38, 49], consisted of few subjects [10, 12, 35, 46], and have limited [1, 16, 36] or spontaneous expressions [47, 48], making them inappropriate for tasks such as facial expression synthesis. On the other hand, the recently proposed 4DFAB dataset [11] consists of six 3D dynamic facial expressions (from 180 subjects), which is ideal for subject independent facial expression generation. In contrast with all previously mentioned datasets, 4DFAB, due to the high range of subjects, can be a promising resource towards disentangling facial expression from the identity information.

3 Learnable Mesh Operators: Background

3.1 Spiral Convolution Networks

We define a 3D facial surface discretized as triangular mesh \(\mathcal {M} = (\mathcal {V}, \mathcal {E},\mathcal {F})\) with \(\mathcal {V}\) the set of N vertices, \(\mathcal {E}\) and \(\mathcal {F}\) the sets of edges and faces, respectively. Let also, \(X \in \mathbb {R}^{N \times d}\) denote the feature matrix of the mesh. In contrast to regular domains, when attempting to apply convolution operators on graph-based structures, there does not exist a consistent way to order the input coordinates. However, in a fixed topology setting, such an ordering is beneficial so as to be able to keep track of the existing correspondences. In [6], the authors identified this problem and intuitively order the vertices by using spiral trajectories [26]. In particular, given a vertex \(v \in \mathcal {V}\), we can define a k-ring and k-disk as:

$$\begin{aligned} \begin{array}{c} ring^{(0)}(v) = {v},\\ ring^{(k+1)}(v) = \mathcal {N}(ring^{(k)}(v))- disk^{(k)}(v), \\ disk^{(k)}(v) = \bigcup \limits _{i=0,...,k} ring^{(i)}(v) \end{array} \end{aligned}$$
(1)

where \(\mathcal {N}(S )\) is the set of all vertices adjacent to at least one vertex \(\in S \).

Once the \(ring^{(k)}\) is defined, the spiral trajectory centered around vertex v can be defined as:

$$\begin{aligned} S(v ,k) = \{ring^{(0)}(v), ring^{(1)}(v),..., ring^{(k)}(v)\} \end{aligned}$$
(2)

To be consistent across all vertices, one can pad or truncate S(v, k) to a fixed length L. To fully define the spiral ordering, we have to declare the starting direction and the orientation of a spiral sequence. In the current study, we adopt the settings followed in [6], by selecting the initial vertex of \(S(v ,k)\) to be in the direction of the shortest geodesic distance between a static reference vertex. Given that all 3D faces share the same topology, spiral ordering \(S(v ,k)\) will be the same across all meshes and so, their calculation is done only once. With all the above mentioned, Spiral Convolution can be defined as:

$$\begin{aligned} \mathbf {f}^*_{v} = \sum _{j=0}^{|S(v,k)|-1} \mathbf {f}(S_j(v,k)) \mathbf {W}_j \end{aligned}$$
(3)

where |S(vk)| amounts to the total length of the spiral trajectory, \(\mathbf {f}(S_j(v,k))\) are the d-dimensional input features of the jth vertex of the spiral trajectory, \(\mathbf {f}^*\) the respective output, and \(\mathbf {W}_j\) are the filter weights.

3.2 Mesh Unpooling Operations

In order to let our graph convolution decoder to generate faces sampled from a latent space, it is essential to use unpooling operations in analogy with transposed convolutions in regular settings. Each graph convolution is followed by an upsampling layer which acts directly on the mesh space, by increasing the number of vertices. We use sampling operations introduced in [35], based on sparse matrix multiplications with upsampling matrices \(Q_u \in \{0,1\}^{n\times m}\), where \(m>n\). Since upsampling operation changes the topology of the mesh, and in order to retain the face structure, upsampling matrices \(Q_u\) are defined on the basis of down-sampling matrices. The barycentric coordinates of the vertices that were discarded during downsampling procedure are stored and used as the new vertex coordinates of the upsampling matrices.

4 Model

The overall architecture of our model is structured by two major components (see Fig. 1). The first one contains a temporal encoder, using an LSTM layer that encodes the expected facial motion of the target expression. It takes as input a temporal signal \(e \in R^{6 \times T}\) with length T, equal to the target facial expression, also equipped with information about the time-stamps that show when the generated facial expression should reach onset, apex and offset modes. Each time-frame of signal e can be characterised as a one-hot encoding of one of the six expressions, with amplitude that indicates the scale of the expression. The second component of our network consists of a frame decoder, with four layers of mesh convolutions, where each one is followed by an upsampling layer. Each upsampling layer increases the number of vertices by five times, and every mesh convolution is followed by a ReLU activation [30]. Finally, the output of the decoder is added to the identity neutral face. Given a time sample from the latent space the frame decoder network models the expected deformations on the neutral face. Each output time frame can be expressed as:

$$\begin{aligned} \begin{array}{c} \hat{x}_t = D(z_t) + x_{id}, \\ z_t = E(e_t) \end{array} \end{aligned}$$
(4)

where \(D(\cdot )\) denotes the mesh decoder network, \(E(\cdot )\) the LSTM encoder, \(e_t\) the facial motion information for time-frame t and \(x_{id}\) the neutral face of the identity. The network details can be found in Table 1. We trained our model for 100 epochs with learning rate of 0.001 and a weight decay of 0.99 on every epoch. We used Adam optimizer [22] with a 5e−5 weight decay.

Fig. 1.
figure 1

Network architecture of the proposed method.

Loss Function. The mesh decoder network outputs motion deformation for each time-frame with respect to the expected facial animation. To train our model we minimize both the reconstruction error \(L_r\) and the temporal coherence \(L_c\), as proposed in [21]. Specifically, we define our loss function between the generated time frame \(\hat{x}_t\) and its ground truth \(x_t\) value as:

$$\begin{aligned} \begin{array}{c} L_r(\hat{x}_t,x_t) = \left\Vert \hat{x}_t-x_t\right\Vert _{1} \\ L_c(\hat{x}_t,x_t) = \left\Vert (\hat{x}_t-\hat{x}_{t-1})- (x_t-x_{t-1})\right\Vert _{1} \\ L(\hat{x}_t,x_t) = L_r(\hat{x}_t,x_t) + L_c(\hat{x}_t,x_t) \end{array} \end{aligned}$$
(5)

Although reconstruction loss \(L_r\) term can be sufficient to encourage model to match ground truth vertices at each time step, it does not produce high-quality realistic animation. On the contrary, temporal coherence loss \(L_c\) term ensures temporal stability of the generated frames by matching the distances between consecutive frames on ground truth and generated expressions.

Table 1. Mesh decoder architecture

5 Experiments

5.1 Dynamic 3D Face Database

To train our expression generative model we use the recently published 4DFAB [11]. 4DFAB contains dynamic 3D meshes of 180 people (60 females, 120 males) with ages between 5 to 75 years. The devised meshes display a variety of complex and exaggerated facial expressions, namely happy, sad, surprise, angry, disgust and fear. The 4DFAB database displays high variance in terms of ethnicity origins, including subjects from more than 30 different ethnic groups. We split the dataset into 153 subjects for training and 27 for testing. The data were captured with 60fps, thus each expression is sampled every approximately 5 frames in order to allow our model to generate extreme facial deformations. Given the high quality of the data (each mesh is composed by 28K vertices) as well as the relatively big number of subjects, 4DFAB presents a rich and rather challenging choice for training generative models.

Expression Motion Labels. In this study, we rely on the assumption that each expression can be characterised by four phases of its evolution (see Fig. 2). First, the subject starts from a neutral pose and at a certain point their face starts to deform, in order to express their emotional state. We call this the onset phase. After the subject’s expression reaches its apex state, it will start again its deformation from the peak emotional state until it becomes neutral again. We call this the offset phase. Thus, each time frame is assigned a label that reflects its emotional state phase. We consider the emotional state as a value ranging from 0 to 1 assigned to each frame, with 0 representing the neutral phase and 1 the apex phase. Onset and offset phases are represented via a linear interpolation between the apex and neutral phases (see Fig. 2). However, expressions may also range in terms of extremeness, i.e. the level of intensity in subject’s expression. To let our model learn diverse extremeness levels for each expression, it is essential to scale each expression motion label from [0, 1] to \([0,s_i]\), where \(s_i \in (0,1]\) represents the scaled value of the apex state according to the intensity of the expression. Intuitively, the extremeness of each expression is proportional to the absolute mean deformation of the expression, we can thus calculate scaling factor \(s_i\) as:

$$\begin{aligned} s_i = \frac{clip(\frac{m_i - \mu _e}{\sigma _e})+1}{2} \end{aligned}$$
(6)

where \(m_i\) is a scalar value representing the absolute value of the mean deformation of the sequence from neutral frame and \(\mu _e\), \(\sigma _e\) the mean and standart deviation of the deformation of the respective expression. Clip() function is used to clip values to [−1, 1].

Fig. 2.
figure 2

Sample subjects from the 4DFAB database posing an expression along with expression motion labels.

5.2 Dynamic Facial Expressions

The proposed model for the generation of facial expressions is assessed both qualitatively and quantitatively. The model is trained by feeding the neutral frame of each subject and the manifested motion (i.e. the time-frames where the expression reaches onset, apex and offset modes) of the target expression. We evaluated the performance of the proposed model by its ability to generate expressions of 27 unobserved test subjects. To this end, we calculated the reconstruction loss as the per-vertex Euclidean distance between each generated sample and its corresponding ground truth.

Baseline. As a comparison baseline we implemented an expression blendshape decoder that transforms the latent representation \(z_t\) of each time-frame, i.e. the LSTM outputs, to an output mesh. In particular, expression blendshapes were modeled by first subtracting the neutral face of each subject to its corresponding expression, for all the corresponding video frames. With this operation we are able to capture and model just the motion deformation of each expression. Then, we applied Principal Component Analysis (PCA) to the motion deformations to reduce each expression to a latent vector. For a fair comparison, we use the same latent size for both the baseline and the proposed method.

The results presented in Table 2 show that the proposed model outperforms the baseline with regards to all the expressions, as well as on the entire dataset (0.39 mm vs 0.44 mm).

Table 2. Generalization per-vertex loss over all expressions, along with the total loss.

Moreover, as can be seen in Fig. 3, the proposed method can produce more realistic animation, especially in the mouth region, compared to PCA blendshapes. Error visualizations in Fig. 3, show that the blendshape model produces mild transitions between each frame and cannot generate extreme deformations. Note also that the proposed method errors are mostly centered around the mouth and the eyebrows, due to the fact that our model is subject independent and each identity expresses its emotions in different ways and varying extents. In other words, the proposed method models each expression with respect to the motion labels without taking into account identity information, thus the generated expressions can have some variations compared to the ground truth subject-dependent expression.

Fig. 3.
figure 3

Color heatmap visualization of error metric of both baseline (top rows) and proposed (bottom rows) model against the ground truth test data for four different expressions.

Table 3. Constructing classification performance between ground truth (test) data, generated (by the proposed method) data, and the blendshape baseline.

For a subjective qualitative assessment, Fig. 4 shows several expressions that were generated by the proposed method. Due to the fact that several expressions, such as angry and sad expression, mainly relate to eyebrow deformations can not be easily visualised by still images we encourage the reader to check our supplementary material for more qualitative results.

Fig. 4.
figure 4

Frames of generated expressions along with their expected motion labels: Fear, Angry, Surprise, Sad, Happy, Disgust (from top to bottom).

5.3 Classification of Generated 4D Expressions

To further assess the quality of the generated expressions we implemented a classifier trained to identify expressions. The architecture of the classifier is based on a projection on a PCA space followed by three fully connected layers. The sequence classification is performed in two steps. In particular, first we computed 64-PCA coefficients to represent all expression and deformation variations of the training set and then we used them as a frame encoder to map the unseen test data to a latent space. Following that, the latent representations of each frame are concatenated and processed by fully-connected layers in order to predict the expression of the given sequence. The network was trained on the same training set that was originally used for the generation of expressions, with Adam optimizer and 5e−3 weight decay for 13 epochs. Table 3 shows the achieved classification performance for the ground truth test data and the generated data from the proposed model and from the baseline model, respectively.

As can be seen in Table 3, the generated data from the proposed model achieve similar classification performance with ground truth data across almost every expression. In particular, the generated surprise, disgust and fear expressions from the proposed model can be even easier classified compared to the ground truth test data. Note also that both ground truth data and the data generated by the proposed model achieve 0.68 F1-score.

5.4 Loss per Frame

Since the overall loss is calculated for all frames of the generated expression, we cannot assess the ability of the proposed model to generate with low error-rate the onset, apex and offset of each expression. To evaluate the performance of the model on each expression phase we calculated the average \(L_1\) distance between the generated and the corresponding ground truth frame for each of the frames of the evolved expression. Figure 5 shows that apex phase, which is usually taking place between time-frames 30–80, has an increased \(L_1\) error for both models. However, the proposed method exhibits a stable loss around 0.45 mm across the apex phase, compared to the blendshape baseline that struggles to model the extreme deformations that take place and characterize the apex phase of the expression.

Fig. 5.
figure 5

Average per-frame \(L_1\) error between the proposed method and the PCA-based blendshape baseline.

5.5 Interpolation on the Latent Space

To qualitatively evaluate the representation power of the proposed LSTM encoder we applied linear interpolation to the expression latent space. Specifically, we choose two different apex expression labels from our test set, we encode them using our LSTM encoder to two latent variables \(z_0\) and \(z_1\), each one of size 64. We then produce all intermediate encodings by linearly interpolating the line between them, i.e. \(z_{\alpha } = \alpha z_1 + (1-\alpha )z_0\), \(\alpha \in (0,1)\). The latent samples \(z_{\alpha }\) are then fed to our mesh decoder network. We visualize the interpolations between different expressions in Fig. 6.

Fig. 6.
figure 6

Interpolation on the latent space between different expressions.

5.6 Expression Generation In-the-Wild

Since expression generation is an essential task in graphics and film industries, we propose a real-world application of our 3D facial expression generator. In particular, we have collected several image pairs with neutral and various expressions of the same identity and we have attempted to realistically synthesize the 4D animation of the target expression solely relying on our proposed approach. In order to acquire a neutral 3D template for our animation purposes, we applied a fitting methodology in the neutral image as proposed in [18]. By utilizing the fitted neutral mesh, we applied our proposed model in order to generate several target expressions as shown in Fig. 7. Our method is able to synthesize/generate a series of realistic 3D facial expressions which demonstrate the ability of our framework to animate a template mesh given a desired expression. We can qualitatively evaluate the similarity of the generated expression with the target expression of the same identity by comparing the generated mesh with the fitted 3D face.

Fig. 7.
figure 7

Generation of expressions in-the-wild form 2D images.

6 Limitations and Future Work

Although the proposed framework is able to model and generate a wide range of realistic expressions, it cannot model thoroughly extremeness variations. As mentioned in Sect. 5.1, we attempted to adapt the extremeness variations of each subject into the training procedure by using an intuitive scaling trick. However, it’s not certain that the mean absolute deformation of each mesh always represents the extremeness of the conducted expression. In addition, in order to synthesize realistic facial animation, it is essential to model along with shape deformations also facial wrinkles of each expression. Thus, we will attempt to generalize facial expression animation to both shape and texture, by extrapolating our model to texture prediction.

7 Conclusion

In this paper, we propose the first generative model to synthesize 3D dynamic facial expressions from a still neutral 3D mesh. Our model captures both local and global expression deformations using graph based upsampling and convolution operators. Given a neutral expression mesh of a subject and a time signal that conditions the expected expression motion, the proposed model generates dynamic facial expressions for the same subject that respects the time conditions and the anticipated expression. The proposed method models the animation of each expression and deforms the neutral face of each subject according to a desired motion. Both expression and motion can be fully defined by the user. Results show that the proposed method outperforms expression blendshapes and creates motion-consistent deformations, validated both qualitatively and quantitatively. In addition, we assessed whether the generated expressions can be correctly classified and identified by a classifier trained on the same dataset. Classification results endorse our qualitative results showing that the generated data can be similarly classified, compared to the ones created by blendshapes model. In summary, the proposed model is the first that attempts and manages to synthesize realistic and high-quality facial expressions from a single neutral face input.