1 Introduction

Facial expression synthesis is an image-to-image translation task, which aims to change the expression of a given image to a desired one. Photorealistic facial expression synthesis can be useful for data augmentation to train face and expression recognition models. It is also useful for producing animations for human-computer interaction and entertainment. Image synthesis has received considerable attention with the arrival of generative adversarial networks (GANs) [13] and its conditional variant [24]. A GAN consists of two networks, a generator and a discriminator. It employs an adversarial learning scheme to train both networks. The generator tries to fool the discriminator by producing realistic-looking fake images while the discriminator learns to distinguish real images from fake ones. GANs have been utilized to solve many image related problems including image synthesis [26], image-to-image translation [16, 36], image super-resolution [3, 5], and facial attribute editing [12, 33, 35]. Recent GAN-based methods have shown impressive results on facial expression synthesis tasks [6, 27, 31] as well. Despite producing photorealistic and plausible results, these models have two limitations. First, they require large datasets to synthesize photorealistic expressions. When trained on smaller datasets, [1, 6, 18], and [4] have shown that existing models introduce color degradation and noticeable artifacts in the synthesized expressions. Second, the majority of existing methods [6, 7, 27, 33] share a common architecture [17] in their generator which incurs a significantly large computational cost and prohibits deployment on resource-constrained devices.

In this paper, we aim to build a simpler, smaller, and effective facial expression synthesis model called Ultimate Skip Generative Adversarial Network (US-GAN). The generator of the proposed US-GAN consists of four modules: encoding layers, single residual block, decoding layers and an ultimate skip connection as illustrated in Fig. 1. The encoding layers enable the network to encode key facial details. The residual block helps to refine these details. The decoding layers then decode high-level facial details from this latent encoding. Finally, we propose to directly link the input image to an output image with a skip connection, called the ultimate skip connection. We hypothesize that the inclusion of an ultimate skip connection in the generator of a GAN will lead to improved transfer of identity, facial details, and color details from input to output. Since the generator will be relieved from the task of producing such details, it will utilize it’s parameters to learn expression mappings only and not suffer from color degradation and block artifacts.

Fig. 1
figure 1

The generator of the proposed framework. US-GAN takes a face image with any expression and the target expression label as input and produces the output image with the target expression. It consists of four modules: i) encoding layers, ii) single residual block, iii) decoding layers, and iv) an ultimate skip connection

The contributions of this work can be summarized as follows:

  1. 1.

    Incorporation of an ultimate skip connection improves preservation of identity, facial details, and color details while inducing convincing expressions.

  2. 2.

    The ultimate skip connection leads to a model with three times fewer parameters that can be trained using two orders of magnitude smaller dataset than the state-of-the-art GANimation model.

  3. 3.

    US-GAN qualitatively outperforms the state of the art in realism, mapped expression and identity preservation by 25%,43%, and 58% respectively.

  4. 4.

    Quantitatively, US-GAN improves face verification score (FVS) by 7% and reduces average content distance (ACD) by 27% compared to the state of the art.

  5. 5.

    US-GAN generalizes well on out-of-dataset facial images.

The rest of the paper is structured as follows. Section 2 reviews state-of-the-art image-to-image translation and facial expression synthesis models. The proposed US-GAN architecture is explained in Section 3 along with loss functions, datasets, and evaluation metrics. Qualitative as well as quantitative experimental results are presented in Section 4. Section 5 discusses the effectiveness of our proposed method and we conclude in Section 6.

2 Related work

2.1 Image-to-image translation

Generative Adversarial Network (GAN) [13] and its conditional variant (cGAN) [24] have been successfully utilized for image synthesis and image-to-image translation problems. The Pix2pix model [16] employed a conditional GAN with 1 image reconstruction loss to enforce generated samples to be close to target images. In order to learn the mapping between two domains using paired datasets, they utilized U-Net [29] and PatchGAN [16] in the generator and discriminator, respectively. Zhu et al. [36] introduced a cycle consistency loss to perform cross-domain mapping using an unpaired dataset. Their network, named CycleGAN, contains two generators and two discriminators to learn the cyclical, cross-domain mappings. They adopted Johnson et al’s architecture [17] in their generator. While Pix2pix and CycleGAN can be used to learn the mappings between facial expressions, these networks fail to produce realistic expressions as demonstrated in [18] where the smaller size of facial expression synthesis datasets is suggested as a possible reason for their failure.

2.2 Facial expression synthesis

GANs have been widely used for facial expression synthesis. Ding et al. [8] proposed ExprGAN to synthesize facial expressions with controllable intensity. However, it fails to preserve the identity details of the input image. The GC-GAN model [28] induces a desired facial expression on an input image. To learn mappings among multiple expressions, StarGAN [6] learns mappings among multiple expressions using a single, shared generator. When trained on a small dataset, the authors reported color degradation artifacts. The GANimation model [27] trained on the large EmotioNet [10] dataset extracts action units from a target face and transfers them to an input face. The quality of expressions is highly dependent on the extracted action units. Liu et al [21] proposed an encoder-decoder architecture, called STGAN, with symmetric skip connections and selective transfer units for facial attribute manipulation. The Cascade-EF GAN [34] generates sharper and realistic images by employing local and global attention. It focuses in a progressive manner. For out-of-dataset images, it was fine-tuned on the large AffectNet [25] dataset. In summary, current state-of-the-art, GAN-based, facial expression synthesis models require larger datasets for inducing satisfactory expressions on in- and out-of-dataset images. Quality significantly degrades when trained on smaller datasets. In contrast, our proposed US-GAN method produces realistic expressions on both in- and out-of-dataset images by employing only hundreds of images for training.

3 Materials and methods

In this work, we aim to learn a mapping to transform an input image \(\boldsymbol {x} \in \mathcal {R}^{D\times D \times 3}\) with known original expression co, to an output image \(\boldsymbol {y} \in \mathcal {R}^{D\times D \times 3}\) with a target expression ct. Both expression vectors co and ct are 1-hot encodings from C expression classes. The proposed US-GAN consists of two modules: a generator and a discriminator.

3.1 Generator

Figure 2 presents an overview of the complete US-GAN pipeline consisting of the following four modules.

Fig. 2
figure 2

US-GAN consists of two modules: a generator (\(\mathbb {G}\)) and a discriminator (\(\mathbb {D}\)). \(\mathbb {G}\) takes an input image (x) and a target expression vector (ct) and generates an output image (y) with the target expression. The same network \(\mathbb {G}\) is then used to reconstruct the input image \(\boldsymbol {\hat {x}}\) from y and original expression vector (co). Discriminator head \(\mathbb {D}_{I}\) learns to distinguish real images x from fake images y while head \(\mathbb {D}_{c}\) classifies discriminator input into an expression class

3.1.1 Encoding layers

There are three encoding layers in our proposed network. The first layer takes an input volume x of size D × D × (3 + C) and generates a feature volume of size D × D × 64. The second encoding layer takes this feature volume, performs strided convolution for downsampling and produces an output volume of size \(\frac {D}{2} \times \frac {D}{2} \times 128\). The third encoding layer takes this downsampled feature volume, performs strided convolution again and provides an encoded volume of size \(\frac {D}{4} \times \frac {D}{4} \times 256\).

3.1.2 Residual block

Residual connections [14] assist deep networks in learning better representations by directly transferring a layer’s input to a later layer’s output so that only the residual transformation needs to be modelled. This makes the learning task easier. For expression synthesis, residual connections help to preserve facial details. We employ only one residual block in the body of our network. The residual block consists of two 3 × 3 convolutional layers of 256 channels, two instance normalization layers [32] and a ReLU layer [11]. Each convolutional layer is followed by an instance normalization layer. ReLU activation function is applied to the output of the first instance normalization layer. The input of a block is directly added after the second instance normalization layer.

3.1.3 Decoding layers

The first decoding layer takes the \(\frac {D}{4} \times \frac {D}{4} \times 256\) volume produced by the residual block and performs fractionally-strided convolutions to produce an upsampled volume of size \(\frac {D}{2} \times \frac {D}{2} \times 128\). The second decoding layer takes these upsampled features and applies fractionally-strided convolution again to produce an upsampled volume of size D × D × 64. The last layer transforms these feature maps into an output image \(\mathbb {G}(\boldsymbol {x},\mathbf {c}_{t})\) of size D × D × 3.

3.1.4 Ultimate skip connection

For large training sets, it is possible to produce rich enough encodings and powerful enough decoders [6]. However, for smaller training sets, if the input is transferred directly to the output via an ultimate skip connection, then the encoding and decoding tasks become easier. The output of the generator is obtained by adding the input to the output of the last decoding layer as

$$ \boldsymbol{y} = \mathbb{G}(\boldsymbol{x}, \boldsymbol{c}_{t}) + \boldsymbol{x}. $$
(1)

This allows the encoding and decoding layers to focus purely on the expressions as can be seen in Fig. 3. Since input details relating to identity, facial features, and overall color are already transferred via the ultimate skip connection, the parameters of the generator only learn to produce the residual expression \(\mathbb {G}(\boldsymbol {x},\boldsymbol {c}_{t})\). Similar ideas have been explored for image restoration [23] and facial attribute editing [30].

Fig. 3
figure 3

Illustration of residual and final output image synthesized by our proposed US-GAN. The network is encouraged to generate only expression-related details since the ultimate skip connection carries over identity, facial, and color details directly from the input image

3.2 Discriminator

The discriminator \(\mathbb {D}\) transforms its input image into a feature volume of size \(\frac {D}{2^{6}}\times \frac {D}{2^{6}} \times 2048\) through a sequence of six layers of 4 × 4 strided convolution filters. Starting from 64 channels in the first layer, each convolution layer doubles the the number of channels. Each convolution is followed by LeakyReLU activation. The feature volume after the sixth layer is converted into a score \(\mathbb {D}_{I}\) that is interpreted as the chance of the discriminator’s input being a real image. In parallel, the volume is also converted into a C × 1 vector of probabilities representing expression of the discriminator’s input image.

3.3 Loss formulations

The proposed model is trained by minimizing a combination of three loss functions.

3.3.1 Adversarial loss

In order to avoid training instability and generate higher quality images, we used Wasserstein adversarial loss [2], defined as

$$ \begin{array}{@{}rcl@{}} \mathcal{L}_{A}=& E \left[ \mathbb{D}_{I}(\boldsymbol{x})\right] - E\left[\mathbb{D}_{I}(\boldsymbol{y}))\right] - \lambda_{gp} E \left[ (\Vert \nabla_{\boldsymbol{\bar{x}}} \mathbb{D}_{I}(\boldsymbol{\bar{x}}) \Vert_{2} - 1)^{2} \right], \end{array} $$
(2)

where \(\mathbb {D}_{I}(\cdot )\) is proportional to the the probability of it’s input image being real, λgp is the gradient penalty coefficient, and \(\boldsymbol {\bar {x}}\) is a uniform random linear combination of x and y.

3.3.2 Image reconstruction loss

Let \(\hat {\boldsymbol {x}}\) be the reconstruction of the original image x generated from the fake expression y as

$$ \begin{array}{@{}rcl@{}} \hat{\boldsymbol{x}} &= \mathbb{G}(\boldsymbol{y},\boldsymbol{c}_{o})+\boldsymbol{y} \end{array} $$
(3)

In order to softly enforce that the face in the input x and the generated image y correspond to the same person, we utilize the cycle reconstruction loss [36] defined as

$$ \begin{array}{@{}rcl@{}} \mathcal{L}_{R} &= E \left[ \Vert \boldsymbol{x} - \hat{\boldsymbol{x}} \Vert_{1} \right] \end{array} $$
(4)

3.3.3 Expression classification loss

We define a multiclass cross-entropy loss between target expression ct and classified expression \(\hat {\mathbf {c}}_{t}\) of the generated image y as

$$ \begin{array}{@{}rcl@{}} \mathcal{L}_{C}^{F} &= E \left[- \log \mathbb{D}_{c} (\boldsymbol{c}_{t}|\boldsymbol{y},\hat{\boldsymbol{c}}_{t}) \right]. \end{array} $$
(5)

and between original expression co and classified expression \(\hat {\mathbf {c}}_{o}\) of the input image x as

$$ \begin{array}{@{}rcl@{}} \mathcal{L}_{C}^{R} &= E \left[ - \log \mathbb{D}_{c} (\boldsymbol{c}_{o}|\boldsymbol{x}, \hat{\boldsymbol{c}}_{o})\right] \end{array} $$
(6)

The idea is to penalize any deviation between target and classified expression of y and between original and classified expression of x.

3.3.4 Overall GAN objectives

The overall objective functions for discriminator \(\mathbb {D}\) and generator \(\mathbb {G}\) can be written as

$$ \begin{array}{@{}rcl@{}} \mathcal{L}_{\mathbb{D}} &=& - \mathcal{L}_{A} + \lambda_{C} \mathcal{L}_{C}^{R} \end{array} $$
(7)
$$ \begin{array}{@{}rcl@{}} \mathcal{L}_{\mathbb{G}} &=& \mathcal{L}_{A} + \lambda_{C} \mathcal{L}_{C}^{F} + \lambda_{R} \mathcal{L}_{R} \end{array} $$
(8)

where λC and λR denote weights of classification and reconstruction losses, respectively. Minimizing \({\mathscr{L}}_{\mathbb {D}}\) encourages the discriminator to improve it’s ability to i) differentiate between real and fake images, and ii) classify the expression of it’s input image. Minimizing \({\mathscr{L}}_{\mathbb {G}}\) encourages the generator to produce fake images that i) are hard to distinguish from real images, ii) have the desired expression, and iii) preserve the identity of the input image.

3.4 Implementation details

We train the proposed US-GAN model from scratch for 350 epochs using the Adam optimizer [19] with β1 = 0.5,β2 = 0.999, learning rate 0.0001 and batch size of 8. Following [6], we set λC = 1, λR = 10 and λgp = 10 for all experiments.

3.5 Dataset

We trained our model using three publically available datasets KDEF [22], RaFD [20] and CFEE [9]. The KDEF dataset consists of 490 images of seven universal expressions collected from 35 male and 35 female participants. RaFD contains 8,040 facial expression images collected from 67 participants from five different angles. We used only 469 frontal images in our experiments. The CFEE dataset contains 5,060 compound facial expressions images of 230 participants. We used 1,610 images from this dataset. In total, we use 2,569 images from the three datasets for seven facial expressions. We used 90% images for training and the rest for testing. All facial images are center-cropped and resized to 128 × 128. To evaluate the effectiveness of our proposed model on out-of-dataset images, celebrities, paintings and avatar images are downloaded from the Internet. These images are significantly different from the distribution of the training datasets. Some example images from in- and out-of-dataset are shown in Fig. 4.

Fig. 4
figure 4

Example images from in- and out-of-dataset

3.6 Evaluation metrics

We compare models in terms of i) number of learnable parameters, and ii) size of training sets. We compare the quantitative performance of models in terms of the following two metrics.

  1. 1.

    Average Content Distance (ACD) is the squared Euclidean distance between the features ϕ(x) of the input image and features ϕ(y) of the generated image. The features ϕ(⋅) are extracted using a face classifier Footnote 1.

    $$ \begin{array}{@{}rcl@{}} ACD(\boldsymbol{x},\boldsymbol{y}) = \Vert\phi(\boldsymbol{x})-\phi(\boldsymbol{y}){\Vert_{2}^{2}} \end{array} $$
    (9)
  2. 2.

    Face Verification Score (FVS) computes the similarity between input and synthesized images using Face++Footnote 2 and returns a value between 0 and 100 to indicate the likeness between two faces.

For qualitative comparison, we performed a user study to compute user preference percentages for StarGAN, STGAN, GANimation as well as our proposed US-GAN. Eighty users participated in this user study. Eighteen input face images from in- and out-of-dataset were randomly selected for evaluation. Synthesized expressions for these images were generated using StarGAN, STGAN, GANimation and the proposed US-GAN. One input image in neutral expression was placed on the left. It’s four manipulated versions were placed next to it in random order (as shown in Fig. 5). The human evaluators were not made aware of the source algorithm for any manipulated image and were asked the select one image for each of the following three questions.

  1. 1.

    Which image looks most realistic (without considering the expression)?

  2. 2.

    Which image has the most convincing expression?

  3. 3.

    Which image best preserves the identity of the input face?

Fig. 5
figure 5

A sample from the user study. One input image in neutral expression along with its four manipulated versions were placed in random order in one row and human evaluators were asked to vote for the best-synthesized image in terms of i) realism, ii) mapped expression, and iii) identity preservation

3.7 Ablation study

We conducted ablation studies to investigate the effectiveness of the ultimate skip connection and the number of residual blocks in the generator of the proposed US-GAN.

3.7.1 Ultimate skip connection

To qualitatively demonstrate the usefulness of the ultimate skip connection, we designed the following two experiments:

  1. 1.

    Train US-GAN with and withoutFootnote 3 ultimate skip connection to observe the difference.

  2. 2.

    Train STGAN [21] with and without ultimate skip connection to observe the difference.

3.7.2 Number of residual blocks

In order to validate the effectiveness of residual blocks in the bottleneck of the US-GAN generator, US-GAN is trained with one and six residual blocks, respectively. We denote US-GAN model with R residual blocks as US-GAN-R, where R ∈{1,6}. The comparison with six residual blocks is motivated by the state of the art such as StarGAN [6] and GANimation [27] that uses six residual blocks.

4 Results

In this section, we conduct extensive experiments to evaluate the performance of our proposed method. We first discuss baseline details in Section 4.1. Qualitative as well as quantitative evaluation on these results are presented in Sections 4.2 and 4.3.

4.1 Baselines

We compare our proposed US-GAN with three state-of-the-art, multi-domain, facial expression synthesis models, StarGAN [6] , STGAN [21] and GANimation [27]. For StarGAN and STGAN, we use the code and hyperparameter settings provided by the authors and trained on the same combined dataset (KDEF, RaFD and CFEE) used to train US-GAN. For GANimation, we used a model pre-trained on the large EmotioNet dataset [10] for 30 epochs.

4.2 Qualitative evaluation

Generalization of US-GAN on out-of-dataset imagery can be observed from Figs. 6 and 7. The ultimate skip connection helps to transfer input image details so that the network parameters only learn to focus on generating expressions. This leads to realistic expressions and preserved identities, facial details, and color details. Comparison with three state-of-the-art facial manipulation models including StarGAN [6], STGAN [21] and GANimation [27] is presented in Fig. 8. While existing models may induce realistic expressions on both in- and out-of-dataset images, GANimation introduces strong artifacts around the eyes, nose and mouth, StarGAN fails to recover the true input image colors, and STGAN introduces pseudo-periodic artifacts on the synthesized images. In comparison, the proposed US-GAN successfully introduces the desired expression without adding irrelevant changes.

Fig. 6
figure 6

Facial expression synthesis results on in-dataset (top two rows) and out-of-dataset (last two rows) testing images. The proposed method trained on RaFD, KDEF and CFEE datasets generates plausible expressions while preserving identity and retaining facial details. Despite being trained on a smaller dataset, these results demonstrate both in-dataset and out-of-dataset generalization strength of the proposed method

Fig. 7
figure 7

Expressions synthesised by US-GAN on Rows 1 to 3: in-dataset, and Rows 4 to 7: out-of-dataset images. Due to ultimate skip connection, the proposed method preserves input image details and colors and induces convincing expressions

Fig. 8
figure 8

Comparison of facial expression synthesis results obtained by proposed US-GAN and other state-of-the-art models. Left: An in-dataset image. Right: An out-of-dataset image. GANimation [27] introduces aging and other noticeable artifacts on all in- and out-of-dataset images. STGAN [21] introduces pseudo-periodic artifacts. StarGAN [6] introduces a pinkish bias and for out-of-dataset images, generates artifacts. The proposed US-GAN successfully synthesizes expressions while preserving identity, facial details, and color details

4.3 Quantitative evaluation

Table 1 shows that our proposed method has three times fewer parameters and is trained on two orders of magnitude smaller dataset than StarGAN and GANimation. Compared to STGAN, the proposed method has an order of magnitude fewer parameters. The values of ACD and FVS indicate that US-GAN is most effective at preservation of identity and other features of the input.

Table 1 Compared to the state of the art, the proposed US-GAN has more than three times fewer parameters and is trained on two orders of magnitude smaller dataset. It yields the best identity preservation (lowest ACD and highest FVS) between inputs and outputs

The summarized results of the user study are provided in Fig. 9. Compared to the second best performing model, US-GAN yielded 25% improvement in realism, 43% improvement in plausibility of mapped expressions, and 58% improvement in identity preservation.

Fig. 9
figure 9

User study results for facial expression synthesis. The proposed US-GAN outperforms StarGAN [6], STGAN [21] and GANimation [27] based on user preferences

4.4 Ablation study

4.4.1 Ultimate skip connection

We demonstrate in Fig. 10 that the ultimate skip connection directly leads to preservation of input image details. This is true for the proposed US-GAN method as well as for STGAN [21]. The second row contains US-GAN results. When the ultimate skip connection is removed, the third row shows that the corresponding model fails to preserve input image colors and introduces some artifacts. For STGAN as well, the fourth row shows that the addition of an ultimate skip connection leads to clear reduction in artifacts and improves the transfer of facial details from input to output.

Fig. 10
figure 10

Impact of ultimate skip connection on generated images. First row: Input images. Row 2: US-GAN results. Row 3: US-GAN trained without ultimate skip connection. Row 4: STGAN [21] trained after adding ultimate skip connection. Row 5: STGAN results. The introduction of the ultimate skip connection helps to preserve the input image as well as overall color details while introducing convincing expressions in the proposed US-GAN. In the last row, we can see that STGAN fails to preserve the color details and introduces noise-like artifacts in the synthesized images without ultimate skip connection

4.4.2 Number of residual blocks

Existing models such as StarGAN [6] and GANimation [27] use six residual blocks in the generator. Figure 11 demonstrates that, powered by the ultimate skip connection, even one residual block can produce plausible, identity- and color-preserving transformations. In other words, the ultimate skip connection reduces the need for many parameters, which can help improve generalization.

Fig. 11
figure 11

First row: StarGAN with six residual blocks failed to preserve the true colors of an input image. Second row: US-GAN with six residual blocks produced sharper expressions with better preserved facial and color details. Third row: US-GAN with only one residual block did not suffer from significant drop in quality of results since most of the heavy lifting related to facial and color details is already carried out by the ultimate skip connection

5 Discussion and future directions

We now address a few questions raised by our results and place our results in context of existing work.

Why an ultimate skip connection?

Skip connections have already been used between encoding and decoding layers [21] since they allow learning of easier residuals of intermediate tasks within deep networks. In hindsight, it seems natural to apply residual learning on the original end-to-end expression synthesis task.

The ultimate skip connection has been shown to be fundamentally important for expression synthesis since it performs the heavy lifting of transferring non-expression related details from the input to the output at no cost. This leaves the learnable parameters of the generator to focus on pure expression synthesis only.

This can be viewed from another perspective of residual learning as well. Residual learning such as the popular ResNet model [14] works by learning easier, residual sub-problems within deep network layers instead of complete transformations. By incorporating a direct ultimate skip connection from input to output, we have applied residual learning to the original, end-to-end expression synthesis problem and made it easier to solve.

Since the problem has been made easier, we have solved it using fewer parameters (only one residual block in US-GAN instead of six in competing models). As a consequence, US-GAN has been shown to have better generalization on out-of-dataset imagery.

Are residual blocks even necessary?

Another question raised, but not answered, by our current work is whether residual blocks are necessary. This is because, even one residual block powered by an ultimate skip connection produced results better than StarGAN and GANimation models that both contain six residual blocks. Perhaps increasing direct skip connections between encoding/decoding layers in the manner of UNet [29] or DenseNet [15] can be more effective than residual blocks.

What is the limitation of the proposed ultimate skip connection?

Although the ultimate skip connection helps to recover the input image details, it slightly decreases the expressiveness of expressions as can be observed from Fig. 10. The ultimate skip connection helps to synthesize realistic expressions at the cost of weakened expression manipulation ability. This can perhaps be alleviated by gating the skip connection through attention mechanisms [3].

6 Conclusion

We have proposed US-GAN, a smaller and more effective model for facial expression synthesis. Our primary contribution is demonstrating the benefit of an ultimate skip connection which transfers identity, facial, and color details directly from input to output. This eases the task of the generator that can then focus on inducing expressions only. It also helps to reduce the number of learnable parameters, such as multiple residual blocks, which improves generalization. Compared to state-of-the-art models, US-GAN has more than three times fewer parameters, is trained on two times smaller dataset. Based on ACD and FVS metrics, US-GAN generates realistic expressions while best preserving identity and details of the input face. US-GAN also outperforms the state of the art in terms of image and expression realism and identity preservation based on responses from human evaluations. Our results indicate that the ultimate skip connection is fundamentally important for the facial expression synthesis task. The proposed method can potentially be extended by exploring the use of intermediate skip connections as an alternative to residual blocks, and incorporating spatial attention to improve expressiveness of results.