US-GAN: on the importance of ultimate skip connection for facial expression synthesis

Akram, Arbish; Khan, Nazar

doi:10.1007/s11042-023-15268-2

US-GAN: on the importance of ultimate skip connection for facial expression synthesis

Published: 06 June 2023

Volume 83, pages 7231–7247, (2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

US-GAN: on the importance of ultimate skip connection for facial expression synthesis

Download PDF

335 Accesses
3 Citations
2 Altmetric
Explore all metrics

Abstract

We demonstrate the benefit of using an ultimate skip (US) connection for facial expression synthesis using generative adversarial networks (GAN). A direct connection transfers identity, facial, and color details from input to output while suppressing artifacts. The intermediate layers can therefore focus on expression generation only. This leads to a light-weight US-GAN model comprised of encoding layers, a single residual block, decoding layers, and an ultimate skip connection from input to output. US-GAN has 3 × fewer parameters than state-of-the-art models and is trained on 2 orders of magnitude smaller dataset. It yields 7% increase in face verification score (FVS) and 27% decrease in average content distance (ACD). Based on a randomized user-study, US-GAN outperforms the state of the art by 25% in face realism, 43% in expression quality, and 58% in identity preservation.

GANimation: Anatomically-Aware Facial Animation from a Single Image

Neural style transfer generative adversarial network (NST-GAN) for facial expression recognition

Article 23 August 2023

Region Based Adversarial Synthesis of Facial Action Units

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Facial expression synthesis is an image-to-image translation task, which aims to change the expression of a given image to a desired one. Photorealistic facial expression synthesis can be useful for data augmentation to train face and expression recognition models. It is also useful for producing animations for human-computer interaction and entertainment. Image synthesis has received considerable attention with the arrival of generative adversarial networks (GANs) [13] and its conditional variant [24]. A GAN consists of two networks, a generator and a discriminator. It employs an adversarial learning scheme to train both networks. The generator tries to fool the discriminator by producing realistic-looking fake images while the discriminator learns to distinguish real images from fake ones. GANs have been utilized to solve many image related problems including image synthesis [26], image-to-image translation [16, 36], image super-resolution [3, 5], and facial attribute editing [12, 33, 35]. Recent GAN-based methods have shown impressive results on facial expression synthesis tasks [6, 27, 31] as well. Despite producing photorealistic and plausible results, these models have two limitations. First, they require large datasets to synthesize photorealistic expressions. When trained on smaller datasets, [1, 6, 18], and [4] have shown that existing models introduce color degradation and noticeable artifacts in the synthesized expressions. Second, the majority of existing methods [6, 7, 27, 33] share a common architecture [17] in their generator which incurs a significantly large computational cost and prohibits deployment on resource-constrained devices.

In this paper, we aim to build a simpler, smaller, and effective facial expression synthesis model called Ultimate Skip Generative Adversarial Network (US-GAN). The generator of the proposed US-GAN consists of four modules: encoding layers, single residual block, decoding layers and an ultimate skip connection as illustrated in Fig. 1. The encoding layers enable the network to encode key facial details. The residual block helps to refine these details. The decoding layers then decode high-level facial details from this latent encoding. Finally, we propose to directly link the input image to an output image with a skip connection, called the ultimate skip connection. We hypothesize that the inclusion of an ultimate skip connection in the generator of a GAN will lead to improved transfer of identity, facial details, and color details from input to output. Since the generator will be relieved from the task of producing such details, it will utilize it’s parameters to learn expression mappings only and not suffer from color degradation and block artifacts.

The contributions of this work can be summarized as follows:

1.
Incorporation of an ultimate skip connection improves preservation of identity, facial details, and color details while inducing convincing expressions.
2.
The ultimate skip connection leads to a model with three times fewer parameters that can be trained using two orders of magnitude smaller dataset than the state-of-the-art GANimation model.
3.
US-GAN qualitatively outperforms the state of the art in realism, mapped expression and identity preservation by 25%,43%, and 58% respectively.
4.
Quantitatively, US-GAN improves face verification score (FVS) by 7% and reduces average content distance (ACD) by 27% compared to the state of the art.
5.
US-GAN generalizes well on out-of-dataset facial images.

The rest of the paper is structured as follows. Section 2 reviews state-of-the-art image-to-image translation and facial expression synthesis models. The proposed US-GAN architecture is explained in Section 3 along with loss functions, datasets, and evaluation metrics. Qualitative as well as quantitative experimental results are presented in Section 4. Section 5 discusses the effectiveness of our proposed method and we conclude in Section 6.

2 Related work

2.1 Image-to-image translation

Generative Adversarial Network (GAN) [13] and its conditional variant (cGAN) [24] have been successfully utilized for image synthesis and image-to-image translation problems. The Pix2pix model [16] employed a conditional GAN with ℓ₁ image reconstruction loss to enforce generated samples to be close to target images. In order to learn the mapping between two domains using paired datasets, they utilized U-Net [29] and PatchGAN [16] in the generator and discriminator, respectively. Zhu et al. [36] introduced a cycle consistency loss to perform cross-domain mapping using an unpaired dataset. Their network, named CycleGAN, contains two generators and two discriminators to learn the cyclical, cross-domain mappings. They adopted Johnson et al’s architecture [17] in their generator. While Pix2pix and CycleGAN can be used to learn the mappings between facial expressions, these networks fail to produce realistic expressions as demonstrated in [18] where the smaller size of facial expression synthesis datasets is suggested as a possible reason for their failure.

2.2 Facial expression synthesis

GANs have been widely used for facial expression synthesis. Ding et al. [8] proposed ExprGAN to synthesize facial expressions with controllable intensity. However, it fails to preserve the identity details of the input image. The GC-GAN model [28] induces a desired facial expression on an input image. To learn mappings among multiple expressions, StarGAN [6] learns mappings among multiple expressions using a single, shared generator. When trained on a small dataset, the authors reported color degradation artifacts. The GANimation model [27] trained on the large EmotioNet [10] dataset extracts action units from a target face and transfers them to an input face. The quality of expressions is highly dependent on the extracted action units. Liu et al [21] proposed an encoder-decoder architecture, called STGAN, with symmetric skip connections and selective transfer units for facial attribute manipulation. The Cascade-EF GAN [34] generates sharper and realistic images by employing local and global attention. It focuses in a progressive manner. For out-of-dataset images, it was fine-tuned on the large AffectNet [25] dataset. In summary, current state-of-the-art, GAN-based, facial expression synthesis models require larger datasets for inducing satisfactory expressions on in- and out-of-dataset images. Quality significantly degrades when trained on smaller datasets. In contrast, our proposed US-GAN method produces realistic expressions on both in- and out-of-dataset images by employing only hundreds of images for training.

3 Materials and methods

In this work, we aim to learn a mapping to transform an input image $\boldsymbol {x} \in \mathcal {R}^{D\times D \times 3}$ with known original expression c_o, to an output image $\boldsymbol {y} \in \mathcal {R}^{D\times D \times 3}$ with a target expression c_t. Both expression vectors c_o and c_t are 1-hot encodings from C expression classes. The proposed US-GAN consists of two modules: a generator and a discriminator.

3.1 Generator

Figure 2 presents an overview of the complete US-GAN pipeline consisting of the following four modules.

3.1.1 Encoding layers

There are three encoding layers in our proposed network. The first layer takes an input volume x of size D × D × (3 + C) and generates a feature volume of size D × D × 64. The second encoding layer takes this feature volume, performs strided convolution for downsampling and produces an output volume of size $\frac {D}{2} \times \frac {D}{2} \times 128$. The third encoding layer takes this downsampled feature volume, performs strided convolution again and provides an encoded volume of size $\frac {D}{4} \times \frac {D}{4} \times 256$.

3.1.2 Residual block

Residual connections [14] assist deep networks in learning better representations by directly transferring a layer’s input to a later layer’s output so that only the residual transformation needs to be modelled. This makes the learning task easier. For expression synthesis, residual connections help to preserve facial details. We employ only one residual block in the body of our network. The residual block consists of two 3 × 3 convolutional layers of 256 channels, two instance normalization layers [32] and a ReLU layer [11]. Each convolutional layer is followed by an instance normalization layer. ReLU activation function is applied to the output of the first instance normalization layer. The input of a block is directly added after the second instance normalization layer.

3.1.3 Decoding layers

The first decoding layer takes the $\frac {D}{4} \times \frac {D}{4} \times 256$ volume produced by the residual block and performs fractionally-strided convolutions to produce an upsampled volume of size $\frac {D}{2} \times \frac {D}{2} \times 128$. The second decoding layer takes these upsampled features and applies fractionally-strided convolution again to produce an upsampled volume of size D × D × 64. The last layer transforms these feature maps into an output image $\mathbb {G}(\boldsymbol {x},\mathbf {c}_{t})$ of size D × D × 3.

3.1.4 Ultimate skip connection

For large training sets, it is possible to produce rich enough encodings and powerful enough decoders [6]. However, for smaller training sets, if the input is transferred directly to the output via an ultimate skip connection, then the encoding and decoding tasks become easier. The output of the generator is obtained by adding the input to the output of the last decoding layer as

$$ \boldsymbol{y} = \mathbb{G}(\boldsymbol{x}, \boldsymbol{c}_{t}) + \boldsymbol{x}. $$

(1)

This allows the encoding and decoding layers to focus purely on the expressions as can be seen in Fig. 3. Since input details relating to identity, facial features, and overall color are already transferred via the ultimate skip connection, the parameters of the generator only learn to produce the residual expression $\mathbb {G}(\boldsymbol {x},\boldsymbol {c}_{t})$. Similar ideas have been explored for image restoration [23] and facial attribute editing [30].

3.2 Discriminator

The discriminator $\mathbb {D}$ transforms its input image into a feature volume of size $\frac {D}{2^{6}}\times \frac {D}{2^{6}} \times 2048$ through a sequence of six layers of 4 × 4 strided convolution filters. Starting from 64 channels in the first layer, each convolution layer doubles the the number of channels. Each convolution is followed by LeakyReLU activation. The feature volume after the sixth layer is converted into a score $\mathbb {D}_{I}$ that is interpreted as the chance of the discriminator’s input being a real image. In parallel, the volume is also converted into a C × 1 vector of probabilities representing expression of the discriminator’s input image.

3.3 Loss formulations

The proposed model is trained by minimizing a combination of three loss functions.

3.3.1 Adversarial loss

In order to avoid training instability and generate higher quality images, we used Wasserstein adversarial loss [2], defined as

$$ \begin{array}{@{}rcl@{}} \mathcal{L}_{A}=& E \left[ \mathbb{D}_{I}(\boldsymbol{x})\right] - E\left[\mathbb{D}_{I}(\boldsymbol{y}))\right] - \lambda_{gp} E \left[ (\Vert \nabla_{\boldsymbol{\bar{x}}} \mathbb{D}_{I}(\boldsymbol{\bar{x}}) \Vert_{2} - 1)^{2} \right], \end{array} $$

(2)

where $\mathbb {D}_{I}(\cdot )$ is proportional to the the probability of it’s input image being real, λ_gp is the gradient penalty coefficient, and $\boldsymbol {\bar {x}}$ is a uniform random linear combination of x and y.

3.3.2 Image reconstruction loss

Let $\hat {\boldsymbol {x}}$ be the reconstruction of the original image x generated from the fake expression y as

$$ \begin{array}{@{}rcl@{}} \hat{\boldsymbol{x}} &= \mathbb{G}(\boldsymbol{y},\boldsymbol{c}_{o})+\boldsymbol{y} \end{array} $$

(3)

In order to softly enforce that the face in the input x and the generated image y correspond to the same person, we utilize the cycle reconstruction loss [36] defined as

$$ \begin{array}{@{}rcl@{}} \mathcal{L}_{R} &= E \left[ \Vert \boldsymbol{x} - \hat{\boldsymbol{x}} \Vert_{1} \right] \end{array} $$

(4)

3.3.3 Expression classification loss

We define a multiclass cross-entropy loss between target expression c_t and classified expression $\hat {\mathbf {c}}_{t}$ of the generated image y as

$$ \begin{array}{@{}rcl@{}} \mathcal{L}_{C}^{F} &= E \left[- \log \mathbb{D}_{c} (\boldsymbol{c}_{t}|\boldsymbol{y},\hat{\boldsymbol{c}}_{t}) \right]. \end{array} $$

(5)

and between original expression c_o and classified expression $\hat {\mathbf {c}}_{o}$ of the input image x as

$$ \begin{array}{@{}rcl@{}} \mathcal{L}_{C}^{R} &= E \left[ - \log \mathbb{D}_{c} (\boldsymbol{c}_{o}|\boldsymbol{x}, \hat{\boldsymbol{c}}_{o})\right] \end{array} $$

(6)

The idea is to penalize any deviation between target and classified expression of y and between original and classified expression of x.

3.3.4 Overall GAN objectives

The overall objective functions for discriminator $\mathbb {D}$ and generator $\mathbb {G}$ can be written as

$$ \begin{array}{@{}rcl@{}} \mathcal{L}_{\mathbb{D}} &=& - \mathcal{L}_{A} + \lambda_{C} \mathcal{L}_{C}^{R} \end{array} $$

(7)

$$ \begin{array}{@{}rcl@{}} \mathcal{L}_{\mathbb{G}} &=& \mathcal{L}_{A} + \lambda_{C} \mathcal{L}_{C}^{F} + \lambda_{R} \mathcal{L}_{R} \end{array} $$

(8)

where λ_C and λ_R denote weights of classification and reconstruction losses, respectively. Minimizing ${\mathscr{L}}_{\mathbb {D}}$ encourages the discriminator to improve it’s ability to i) differentiate between real and fake images, and ii) classify the expression of it’s input image. Minimizing ${\mathscr{L}}_{\mathbb {G}}$ encourages the generator to produce fake images that i) are hard to distinguish from real images, ii) have the desired expression, and iii) preserve the identity of the input image.

3.4 Implementation details

We train the proposed US-GAN model from scratch for 350 epochs using the Adam optimizer [19] with β₁ = 0.5,β₂ = 0.999, learning rate 0.0001 and batch size of 8. Following [6], we set λ_C = 1, λ_R = 10 and λ_gp = 10 for all experiments.

3.5 Dataset

We trained our model using three publically available datasets KDEF [22], RaFD [20] and CFEE [9]. The KDEF dataset consists of 490 images of seven universal expressions collected from 35 male and 35 female participants. RaFD contains 8,040 facial expression images collected from 67 participants from five different angles. We used only 469 frontal images in our experiments. The CFEE dataset contains 5,060 compound facial expressions images of 230 participants. We used 1,610 images from this dataset. In total, we use 2,569 images from the three datasets for seven facial expressions. We used 90% images for training and the rest for testing. All facial images are center-cropped and resized to 128 × 128. To evaluate the effectiveness of our proposed model on out-of-dataset images, celebrities, paintings and avatar images are downloaded from the Internet. These images are significantly different from the distribution of the training datasets. Some example images from in- and out-of-dataset are shown in Fig. 4.

3.6 Evaluation metrics

We compare models in terms of i) number of learnable parameters, and ii) size of training sets. We compare the quantitative performance of models in terms of the following two metrics.

1.
Average Content Distance (ACD) is the squared Euclidean distance between the features ϕ(x) of the input image and features ϕ(y) of the generated image. The features ϕ(⋅) are extracted using a face classifier ^{Footnote 1}.
$$ \begin{array}{@{}rcl@{}} ACD(\boldsymbol{x},\boldsymbol{y}) = \Vert\phi(\boldsymbol{x})-\phi(\boldsymbol{y}){\Vert_{2}^{2}} \end{array} $$
(9)
2.
Face Verification Score (FVS) computes the similarity between input and synthesized images using Face++^{Footnote 2} and returns a value between 0 and 100 to indicate the likeness between two faces.

For qualitative comparison, we performed a user study to compute user preference percentages for StarGAN, STGAN, GANimation as well as our proposed US-GAN. Eighty users participated in this user study. Eighteen input face images from in- and out-of-dataset were randomly selected for evaluation. Synthesized expressions for these images were generated using StarGAN, STGAN, GANimation and the proposed US-GAN. One input image in neutral expression was placed on the left. It’s four manipulated versions were placed next to it in random order (as shown in Fig. 5). The human evaluators were not made aware of the source algorithm for any manipulated image and were asked the select one image for each of the following three questions.

1.
Which image looks most realistic (without considering the expression)?
2.
Which image has the most convincing expression?
3.
Which image best preserves the identity of the input face?

3.7 Ablation study

We conducted ablation studies to investigate the effectiveness of the ultimate skip connection and the number of residual blocks in the generator of the proposed US-GAN.

3.7.1 Ultimate skip connection

To qualitatively demonstrate the usefulness of the ultimate skip connection, we designed the following two experiments:

1.
Train US-GAN with and without^{Footnote 3} ultimate skip connection to observe the difference.
2.
Train STGAN [21] with and without ultimate skip connection to observe the difference.

3.7.2 Number of residual blocks

In order to validate the effectiveness of residual blocks in the bottleneck of the US-GAN generator, US-GAN is trained with one and six residual blocks, respectively. We denote US-GAN model with R residual blocks as US-GAN-R, where R ∈{1,6}. The comparison with six residual blocks is motivated by the state of the art such as StarGAN [6] and GANimation [27] that uses six residual blocks.

4 Results

In this section, we conduct extensive experiments to evaluate the performance of our proposed method. We first discuss baseline details in Section 4.1. Qualitative as well as quantitative evaluation on these results are presented in Sections 4.2 and 4.3.

4.1 Baselines

We compare our proposed US-GAN with three state-of-the-art, multi-domain, facial expression synthesis models, StarGAN [6] , STGAN [21] and GANimation [27]. For StarGAN and STGAN, we use the code and hyperparameter settings provided by the authors and trained on the same combined dataset (KDEF, RaFD and CFEE) used to train US-GAN. For GANimation, we used a model pre-trained on the large EmotioNet dataset [10] for 30 epochs.

4.2 Qualitative evaluation

Generalization of US-GAN on out-of-dataset imagery can be observed from Figs. 6 and 7. The ultimate skip connection helps to transfer input image details so that the network parameters only learn to focus on generating expressions. This leads to realistic expressions and preserved identities, facial details, and color details. Comparison with three state-of-the-art facial manipulation models including StarGAN [6], STGAN [21] and GANimation [27] is presented in Fig. 8. While existing models may induce realistic expressions on both in- and out-of-dataset images, GANimation introduces strong artifacts around the eyes, nose and mouth, StarGAN fails to recover the true input image colors, and STGAN introduces pseudo-periodic artifacts on the synthesized images. In comparison, the proposed US-GAN successfully introduces the desired expression without adding irrelevant changes.

4.3 Quantitative evaluation

Table 1 shows that our proposed method has three times fewer parameters and is trained on two orders of magnitude smaller dataset than StarGAN and GANimation. Compared to STGAN, the proposed method has an order of magnitude fewer parameters. The values of ACD and FVS indicate that US-GAN is most effective at preservation of identity and other features of the input.

Table 1 Compared to the state of the art, the proposed US-GAN has more than three times fewer parameters and is trained on two orders of magnitude smaller dataset. It yields the best identity preservation (lowest ACD and highest FVS) between inputs and outputs

Full size table

The summarized results of the user study are provided in Fig. 9. Compared to the second best performing model, US-GAN yielded 25% improvement in realism, 43% improvement in plausibility of mapped expressions, and 58% improvement in identity preservation.

4.4 Ablation study

4.4.1 Ultimate skip connection

We demonstrate in Fig. 10 that the ultimate skip connection directly leads to preservation of input image details. This is true for the proposed US-GAN method as well as for STGAN [21]. The second row contains US-GAN results. When the ultimate skip connection is removed, the third row shows that the corresponding model fails to preserve input image colors and introduces some artifacts. For STGAN as well, the fourth row shows that the addition of an ultimate skip connection leads to clear reduction in artifacts and improves the transfer of facial details from input to output.

4.4.2 Number of residual blocks

Existing models such as StarGAN [6] and GANimation [27] use six residual blocks in the generator. Figure 11 demonstrates that, powered by the ultimate skip connection, even one residual block can produce plausible, identity- and color-preserving transformations. In other words, the ultimate skip connection reduces the need for many parameters, which can help improve generalization.

5 Discussion and future directions

We now address a few questions raised by our results and place our results in context of existing work.

Why an ultimate skip connection?

Skip connections have already been used between encoding and decoding layers [21] since they allow learning of easier residuals of intermediate tasks within deep networks. In hindsight, it seems natural to apply residual learning on the original end-to-end expression synthesis task.

The ultimate skip connection has been shown to be fundamentally important for expression synthesis since it performs the heavy lifting of transferring non-expression related details from the input to the output at no cost. This leaves the learnable parameters of the generator to focus on pure expression synthesis only.

This can be viewed from another perspective of residual learning as well. Residual learning such as the popular ResNet model [14] works by learning easier, residual sub-problems within deep network layers instead of complete transformations. By incorporating a direct ultimate skip connection from input to output, we have applied residual learning to the original, end-to-end expression synthesis problem and made it easier to solve.

Since the problem has been made easier, we have solved it using fewer parameters (only one residual block in US-GAN instead of six in competing models). As a consequence, US-GAN has been shown to have better generalization on out-of-dataset imagery.

Are residual blocks even necessary?

Another question raised, but not answered, by our current work is whether residual blocks are necessary. This is because, even one residual block powered by an ultimate skip connection produced results better than StarGAN and GANimation models that both contain six residual blocks. Perhaps increasing direct skip connections between encoding/decoding layers in the manner of UNet [29] or DenseNet [15] can be more effective than residual blocks.

What is the limitation of the proposed ultimate skip connection?

Although the ultimate skip connection helps to recover the input image details, it slightly decreases the expressiveness of expressions as can be observed from Fig. 10. The ultimate skip connection helps to synthesize realistic expressions at the cost of weakened expression manipulation ability. This can perhaps be alleviated by gating the skip connection through attention mechanisms [3].

6 Conclusion

We have proposed US-GAN, a smaller and more effective model for facial expression synthesis. Our primary contribution is demonstrating the benefit of an ultimate skip connection which transfers identity, facial, and color details directly from input to output. This eases the task of the generator that can then focus on inducing expressions only. It also helps to reduce the number of learnable parameters, such as multiple residual blocks, which improves generalization. Compared to state-of-the-art models, US-GAN has more than three times fewer parameters, is trained on two times smaller dataset. Based on ACD and FVS metrics, US-GAN generates realistic expressions while best preserving identity and details of the input face. US-GAN also outperforms the state of the art in terms of image and expression realism and identity preservation based on responses from human evaluations. Our results indicate that the ultimate skip connection is fundamentally important for the facial expression synthesis task. The proposed method can potentially be extended by exploring the use of intermediate skip connections as an alternative to residual blocks, and incorporating spatial attention to improve expressiveness of results.

Data Availability

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

Notes

https://github.com/ageitgey/face_recognition
https://www.faceplusplus.com/
Technically, the model should no longer be called US-GAN in this case.

References

Akram A, Khan N (2021) Pixel-based facial expression synthesis. In: International conference on pattern recognition. IEEE, pp 9733–9739
Arjovsky M, Chintala S, Bottou L (2017) Wasserstein generative adversarial networks. In: International conference on machine learning, pp 214–223
Chen C, Gong D, Wang H, Li Z, Wong K-Y K (2020) Learning spatial attention for face super-resolution. IEEE Trans Image Process 30:1219–1231
Article Google Scholar
Chen Y-C, Xu X, Jia J (2020) Domain adaptive image-to-image translation. In: IEEE Conference on computer vision and pattern recognition, pp 5274–5283
Chen Y, Tai Y, Liu X, Shen C, Yang J (2018) FSRNet: end-to-end learning face super-resolution with facial priors. In: IEEE Conference on computer vision and pattern recognition, pp 2492–2501
Choi Y, Choi M, Kim M, Ha J-W, Kim S, Choo J (2018) StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. In: IEEE Conference on computer vision and pattern recognition, pp 8789–8797
d’Apolito S, Paudel D P, Huang Z, Romero A, Van Gool L (2021) GANmut: learning interpretable conditional space for gamut of emotions. In: IEEE Conference on computer vision and pattern recognition, pp 568–577
Ding H, Sricharan K, Chellappa R (2018) ExprGAN: facial expression editing with controllable expression intensity. In: Proceedings of the AAAI conference on artificial intelligence, vol 32
Du S, Tao Y, Martinez A M (2014) Compound facial expressions of emotion. Proc Natl Acad Sci 111(15):E1454–E1462
Article Google Scholar
Fabian Benitez-Quiroz C, Srinivasan R, Martinez A M (2016) EmotioNet: an accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild. In: IEEE Conference on computer vision and pattern recognition, pp 5562–5570
Fukushima K (1975) Cognitron: a self-organizing multilayered neural network. Biol Cybern 20(3):121–136
Article MathSciNet Google Scholar
Gao Y, Wei F, Bao J, Gu S, Chen D, Wen F, Lian Z (2021) High-fidelity and arbitrary face editing. In: IEEE Conference on computer vision and pattern recognition, pp 16115–16124
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672–2680
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE Conference on computer vision and pattern recognition, pp 770–778
Huang G, Liu Z, van der Maaten L, Weinberger K Q (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Isola P, Zhu J-Y, Zhou T, Efros A A (2017) Image-to-image translation with conditional adversarial networks. In: IEEE Conference on computer vision and pattern recognition, pp 1125–1134
Johnson J, Alahi A, Fei-Fei L (2016) Perceptual losses for real-time style transfer and super-resolution. In: European Conference on computer vision. Springer, pp 694–711
Khan N, Akram A, Mahmood A, Ashraf S, Murtaza K (2020) Masked linear regression for learning local receptive fields for facial expression synthesis. Int J Comput Vis 128(5):1433–1454
Article Google Scholar
Kingma D P, Ba J (2014) Adam: a method for stochastic optimization. arXiv:1412.6980
Langner O, Dotsch R, Bijlstra G, Wigboldus Daniel HJ, Hawk S T, Van Knippenberg AD (2010) Presentation and validation of the Radboud Faces Database. Cogn Emotion 24(8):1377–1388
Article Google Scholar
Liu M, Ding Y, Xia M, Liu X, Ding E, Zuo W, Wen S (2019) Stgan: a unified selective transfer network for arbitrary image attribute editing. In: IEEE Conference on computer vision and pattern recognition, pp 3673–3682
Lundqvist D, Flykt A, Öhman A (1998) The karolinska directed emotional faces - KDEF, CD ROM. Department of Clinical Neuroscience, Psychology section, Karolinska Institutet, Stockholm, Sweden
Google Scholar
Mao X, Shen C, Yang Y-B (2016) Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. Adv Neur Inform Process Syst 29:2802–2810
Google Scholar
Mirza M, Osindero S (2014) Conditional generative adversarial nets. ArXiv:1411.1784
Mollahosseini A, Hasani B, Mahoor M H (2017) AffectNet: a database for facial expression, valence, and arousal computing in the wild. IEEE Trans Affect Comput 10(1):18–31
Article Google Scholar
Perarnau G, van de Weijer J, Raducanu B, Álvarez J M (2016) Invertible conditional GANs for image editing. ArXiv:1611.06355
Pumarola A, Agudo A, Martinez A M, Sanfeliu A, Moreno-Noguer F (2020) GANimation: one-shot anatomically consistent facial animation. Int J Comput Vis 128(3):698–713
Article Google Scholar
Qiao F, Yao N, Jiao Z, Li Z, Chen H, Wang H (2018) Geometry-contrastive generative adversarial network for facial expression synthesis. ArXiv:1802.01822
Ronneberger O, Fischer P, Brox T (2015) U-Net: convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer, pp 234–241
Shen W, Liu R (2017) Learning residual images for face attribute manipulation. In: IEEE Conference on computer vision and pattern recognition, pp 4030–4038
Tang J, Shao Z, Ma L (2021) EGGAN: learning latent space for fine-grained expression manipulation. IEEE MultiMedia
Ulyanov D, Vedaldi A, Lempitsky V (2016) Instance normalization: the missing ingredient for fast stylization. ArXiv:1607.08022
Wu P-W, Lin Y-J, Chang C-H, Chang E Y, Liao S-W (2019) RelGAN: multi-domain image-to-image translation via relative attributes. In: IEEE International conference on computer vision, pp 5914–5922
Wu R, Zhang G, Lu S, Chen T (2020) Cascade EF-GAN: progressive facial expression editing with local focuses. In: IEEE Conference on computer vision and pattern recognition, pp 5021–5030
Yi Z, Zhang H, Tan P, Gong M (2017) DualGAN: unsupervised dual learning for image-to-image translation. In: IEEE International conference on computer vision, pp 2849–2857
Zhu J-Y, Park T, Isola P, Efros A A (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In: IEEE International conference on computer vision, pp 2223–2232

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of the Punjab, Lahore, Pakistan
Arbish Akram & Nazar Khan

Authors

Arbish Akram
View author publications
You can also search for this author in PubMed Google Scholar
Nazar Khan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arbish Akram.

Ethics declarations

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Akram, A., Khan, N. US-GAN: on the importance of ultimate skip connection for facial expression synthesis. Multimed Tools Appl 83, 7231–7247 (2024). https://doi.org/10.1007/s11042-023-15268-2

Download citation

Received: 14 April 2022
Revised: 01 September 2022
Accepted: 06 April 2023
Published: 06 June 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s11042-023-15268-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

US-GAN: on the importance of ultimate skip connection for facial expression synthesis

Abstract

Similar content being viewed by others

GANimation: Anatomically-Aware Facial Animation from a Single Image

Neural style transfer generative adversarial network (NST-GAN) for facial expression recognition

Region Based Adversarial Synthesis of Facial Action Units

Explore related subjects

1 Introduction

2 Related work

2.1 Image-to-image translation

2.2 Facial expression synthesis

3 Materials and methods

3.1 Generator

3.1.1 Encoding layers

3.1.2 Residual block

3.1.3 Decoding layers

3.1.4 Ultimate skip connection

3.2 Discriminator

3.3 Loss formulations

3.3.1 Adversarial loss

3.3.2 Image reconstruction loss

3.3.3 Expression classification loss

3.3.4 Overall GAN objectives

3.4 Implementation details

3.5 Dataset

3.6 Evaluation metrics

3.7 Ablation study

3.7.1 Ultimate skip connection

3.7.2 Number of residual blocks

4 Results

4.1 Baselines

4.2 Qualitative evaluation

4.3 Quantitative evaluation

4.4 Ablation study

4.4.1 Ultimate skip connection

4.4.2 Number of residual blocks

5 Discussion and future directions

Why an ultimate skip connection?

Are residual blocks even necessary?

What is the limitation of the proposed ultimate skip connection?

6 Conclusion

Data Availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation