1 Introduction

Cross-modal generation aims to generate data from one modality conditioned on another correlated modality, which has attracted a lot of research efforts. Early researches on cross-model generation usually generate low-dimensional data from high-dimensional data, such as voice-to-text [1, 2] and image-to-text [3, 4]. Recently, thanks to the rapid growth of generative adversarial networks (GANs) [5] and with increase in multi-modal datasets [6], it is possible to generate complex data from low-dimensional data, such as text-to-image [7, 8] and audio-to-image [9, 10]. Note that the audio and visual information are the most important perceptual modalities in our daily life. We believe that the research on cross-modal audiovisual generation can endow machines with humanized capabilities of imagination and interpretation. Here, we leverage the voices to directly generate speakers’ facial images by GANs.

Many previous works have been done in solving the audio–image generation problem. Duarte et al. proposed conditional GANs (cGANs) [11] to directly generate face from voice [12]. Later, Wen et al. used an auxiliary classifier GANs (AC-GANs) [15] to directly generate face [14]. Oh et al. leveraged the encoder–decoder network to learn the cross-modal visual-audio mutual relationship, then generated the face based on the corresponding static face and voice [13]. However, generated faces from studies above are usually with certain unsatisfactory artifacts and missing parts. The reasons are twofold. First, face generation in previous works usually considers identity information of target faces but leave alone the corresponding facial expressions. It can be observed that one’s expression can usually change with different emotions when she/he talks to others. Emotion would be a key to construct high-quality facial image. Second, we find that GANs with a single discriminator are not able for learning the complex mapping relationships between audio and visual modalities. That is, a constraint that only allows the generated images to be on the one manifold of the truth data distribution. To improve the quality of faces generated, it is a promising way to have multiple discriminators, rather than only one, to help the generator learn from more fine-grained face features extracted from audio.

In this paper, we propose a novel model, facial expression GANs (abbr. as FE-GAN) to generate faces by given voice information. In a nutshell, FE-GAN considers emotion and identity variations from face and voice simultaneously. The existence of semantic consistency in human’s voice and face [16, 17] inspires us to adopt both identify labels and emotion labels for model training. Specifically, more discriminators could take the emotion and identity constraints into account so that the generator can also retain more emotion and identity characteristics. The simplified pipeline of the proposed method is shown in Fig. 1. The core of FE-GAN is composed of one generator network (G-net) and two discriminator-classifier pairs, say (C1-net, D1-net) and (C2-net, D2-net). In the generating process, the voice encoder extracts the Fbank features \(F_\mathrm{v}\) from voice clip V and obtain voice embedding \(E_\mathrm{v}\) by V-net. Next, taking \(E_\mathrm{v}\) as an input, generator G-net generates the face image \(I_\mathrm{G}\). Finally, the two discriminators D1-net and D2-net are used to distinguish whether or not a face image is real or fake, meanwhile, the auxiliary classifier C1-net and C2-net predict its identity and emotion. This design of FE-GAN can not only learn one-to-one mapping between faces and voices but also capture various emotions of the target person that are correlated with the input speech. Our contributions can be summarized as follows:

(1) We propose an effective GAN model (FE-GAN) for cross-modal voice-to-face generation. It explores the emotional and identity relationship in cross-modal voice-to-image task and generates sharper facial images with expression. (2) We adopt two discriminators and two classifiers in GANs. They help the model generate more realistic images, and transfer the label information to generator. Besides, we explore the multiply discriminators and classifiers optimization problem, a triple loss is presented to optimize the FE-GAN. (3) We conduct qualitative and quantitative experiments on RAVDESS [18] and eNTERFACE [19] dataset, the results show that FE-GAN outperforms previous GANs methods [12, 14] and achieves the best performance in the series metrics with remarkable improvements. The rest of the paper is organized as follows. Section 2 lists the previous relevant research works. Section 3 gives the technical detail of the proposed approach FE-GAN. Section 4 reports the experimental results. Section 5 concludes this paper.

Fig. 1
figure 1

The simplified pipeline of the proposed method. Our method has divided into two parts: voice encoder (gray dashed box) and FE-GAN (blue dashed box). (1) Voice encoder consists of VAD (voice activity detection) and V-net, which takes Fbank features \(F_\mathrm{v}\) as input, and outputs embedding features \(E_\mathrm{v}\). (2) FE-GAN consists of five parts: G-net, C1-net, C2-net, D1-net and D2-net. The FE-GAN is used to transfer embedding \(E_\mathrm{v}\) into face image \(I_\mathrm{G}\), then to predict its sources (true or fake) and categories (emotion and identity)

2 Related work

2.1 Generative adversarial networks

GANs [5] is an excellent game theory architecture. It is easily assembled with others backbone networks and mechanisms. A vanilla GANs [5] consists of two neural networks: a generator and discriminator. Given a random sample with noise, the generator attempts to generate image for fooling the discriminator. Then, the discriminator is responsible for distinguishing generated image and real image. In order to address the training instability and get high-quality generated results, many variants of GAN have been developed. For example, conditional GANs (cGANs) [11] introduces a conditional constraint to get more attribute information, the condition could be class labels, object attributes or feature embeddings. However, it will bring additional noise to network and increases extra burdens to training process. Compared to cGANs, auxiliary classifier GANs (AC-GANs) [15] leverage an additional auxiliary classifier to assist in supervising the learning process, which is share weights with discriminator and can help GANS to generate sharper images. Besides, dual discriminator GANs (D2GANs) [20] and generative multi-adversarial networks (GMANs) [21], which use multiple discriminators to improve generation performance, extend GANs architecture.

Recently, many cross-modal methods use GANs and their variants to generate face from voice [12,13,14, 22,23,24,25]. Inspired by the above success of GANs in cross-modal generation task, we establish our FE-GAN model based on AC-GANs [15] and D2GANs [20]. Different from the two GANs, we employ two discriminator and corresponding classifiers to guide the generator for producing photo-realistic facial images.

2.2 Audio representations selection and extraction

In human interaction, voice contains various emotions and identity information, which conveyed by linguistic information (e.g., word, sentence and language meaning) [26] and prosodic information (e.g., voice pitch, tempo, loudness and intonation) [27]. The linguistic contents are dynamic variation and highly dependent on word dictionaries and language model [26, 28]. However, it is unreliable and difficult to infer speaker’s emotion and identity state by linguistic features [29]. Compared to linguistic information, the prosodic information are global-level and they cannot describe the dynamic variation in voice [30]. Thus, we decide to learn audio representations from speech prosody, and transfer the emotional and identity knowledge into face images.

The quality of audio representations influences the results of generation methods. Most audio-related methods involve the analysis of a speech representations using either hand-crafted prosody features (e.g., Mel Frequency Cepstral Coefficients (MFCCs), Perceptual Linear Prediction (PLP), Spectrograms, Fbank and Fourier transforms), or through a neural network which indirectly learns high-level representations. Compared to these hand-crafted methods [31, 32], the convolutional neural networks (CNNs) are enabled to learn robust high-dimensional features, which achieve high accuracy in emotion and identity classification [30, 33, 34]. Therefore, we also use CNNs as audio feature extractors (V-net) to extract emotion and identity information from prosody features. Our experiments prove that CNNs are able to learn temporal filters across features and distill an entire utterance down into a static representation by fully connected layer to model.

2.3 Audio-to-visual generation

Many methods have been proposed to reconstruct visual information from different types of audio signals. Existing studies in audio-to-visual generation mainly synthesize a specific talking face from an audio clip and a still image. For example, Chung et al. [23] use facial landmarks and voice clip to synthesize a talking face video by an encoder–decoder CNNs model. Chen et al. [22] design a cascade GANs combined RNN to learn joint features from voice clip and facial landmarks to generate talking face video on the features. Furthermore, Vougioukas et al. [24] and Yi et al. [35] consider facial expressions in generation. They adopt GANs to synthesize a talking face video from voice and image.

On the other hand, some methods try to generate lip shapes from voice to synthesize a specific identity face with lip shape. Suwajanakorn et al. [36] and Jalalifar et al. [37] use the long-short term memory (LSTM) network to generate talking mouth features from voice to synthesize a talking video of Obama conditioned on these landmarks. To improve the quality of synthesis lip, Sadoughi and Busso [38] propose cGANs to learn emotion features from the speech, and generate lip animation with different expressions. However, these methods need to parametrize the reconstructed face model a priori, this often requires post-processing using computer graphics techniques to produce realistic albeit subject-dependent results.

There are very few works try to leverage audio to directly generate facial image, which is different from the above-mentioned methods using both audio and visual modalities as inputs. Existing methods on voice-to-face generation [12,13,14] use CNNs to extract embedding features from input voice, then the feature is feed into the generator or decoder to generate corresponding images. Moreover, some works generate images condition on music directly [10, 39]. To overcome shortcomings of the conventional cross-modal GANs model and generate more realistic face, we introduce the emotion to our facial expression GAN (FE-GAN) and perform voice-to-face generation.

Fig. 2
figure 2

The detailed architecture of V-net and FE-GAN. The symbol \(+\) represents concatenation operation; the / represents OR operation chooses the corresponding labels, and the label symbol \(l_\mathrm{fe}\), \(l_\mathrm{fi}\), \(l_\mathrm{vi}\), \(l_\mathrm{ve}\) (yellow blocks) represent face emotion, face identity, voice emotion and voice identity, respectively; \(I_T\) and \(I_\mathrm{G}\) (gray blocks) represent real face from dataset and fake face from G-net, respectively; blue line and green dotted line denote forward and backward propagation paths, respectively. The dimensions of input and output are denoted on top of the blocks. Besides, the loss equations \(L_\mathrm{G}\), \(L_\mathrm{D1}\), \(L_\mathrm{D2}\), \(L_\mathrm{C1}\), \(L_\mathrm{C2}\) (green blocks) and other symbols are described in the rest of Sect. 3

3 Proposed methods

3.1 Overview of V-net and FE-GAN

This section gives the detailed architecture of V-net and FE-GAN, as shown in Fig. 2. V-net is a standard CNNs with normalization, which learns a voice embedding from speech prosody feature. FE-GAN is composed of G-net (which generates a face image from a voice embedding), D1-net with C1-net, and D2-net with C2-net (two pairs of a discriminator with its classifier to identify whether a face image is true from identity and emotion perspectives, respectively).

After extracting speakers’ voice and face from videos, we can obtain the training dataset tuples of \(F_\mathrm{v}\), \(I_T\), \(l_\mathrm{ve}\), \(l_\mathrm{vi}\), \(l_\mathrm{fe}\), \(l_\mathrm{fi}\), where \(F_\mathrm{v}\) are the Fbank features extracted from speakers’ voice, \(I_T\) is the face image, and \(l_{xy}\) is the label of y based on attribute x where x can be v (voice) or f (face) and y can be i (identity) or e (emotion). Given the identity label \(l_\mathrm{vi}\) and the Fbank feature \(F_\mathrm{v}\), we firstly pre-train V-net to classify a person through her voice. After pre-training of V-net, the voice embeddings \(E_\mathrm{v}\) of each voice can be extracted.

Subsequently, given a voice embedding \(E_\mathrm{v}\) with Gaussian noise \(N_\mathrm{g}\), G-net is trained to generate the target face \(I_\mathrm{G}\). Concurrently, we use the true face \(I_\mathrm{T}\) with labels (\(l_\mathrm{fi}\), \(l_\mathrm{fe}\)) and fake faces \(I_\mathrm{G}\) with labels (\(l_\mathrm{ve}\), \(l_\mathrm{vi}\)) to train the discriminators D1-net and D2-net, with the auxiliary network C1-net and C2-net. In this way, D1-net and D2-net are trained to distinguish input face image \(I_T\) or \(I_\mathrm{G}\) is true or fake, respectively; C1-net and C2-net are trained to classify the emotion and identify labels of input face, respectively. Besides, the proposed triple loss combining loss equations from the generator, the discriminators and the classifiers are designed to optimize FE-GAN.

3.2 Pre-processing and V-net

We firstly use voice activity detection (VAD) module [40] for original voice to remove the silent frames (e.g., in RAVDESS datasets, the average duration time of the original voice is 3.6 s. After removing silent parts, it is shortened into 2.4 s.). Then, the voice clips are resampled at 32 kHz and a single audio channel is preserved. Next, we repeat the audio clips 3 4 times and eliminate the redundancy so that they all become 10 s long. Furthermore, Fbank features (\(F_\mathrm{v}\)), MFCC and Spectrogram are calculated by fast Fourier transform with window length of 33 ms (milliseconds), and a hop length is 16 ms. In addition, we use the face detector based on Resnet-18 in Dlib [41] to detect the face regions from video, and resize them to 128\(\sim \)128 pixels. To augment the training data, we use random cropping in audio features and left-right flipping in image, the cropping length is 300–800 ms.

Our V-net aim to classify features \(F_\mathrm{v}\) into different identity categories and extract voice embedding features \(E_\mathrm{v}\). V-net takes \(64 \times T\) (frequency \(\times \) time) dimensional \(F_\mathrm{v}\) as input, and outputs \(1 \times 128\) dimensional features \(E_\mathrm{v}\). The top row of Fig. 2 shows the network architecture of V-net where there are 5 one-dimensional convolutional layers 1D-Conv1, 1D-Conv2, \(\ldots \), and 1D-Conv5 with kernel size 3, stride 2 and padding 1 and a batch normalization (BN) operation is followed with Leaky-ReLU as the activation function. After the 5th convolutional layers, the temporal channels of \(F_\mathrm{v}\) have been decimated to 256. Next, we apply the average 1D pooling layer along the temporal dimension. This allows us to efficiently aggregate information over time and makes the model applicable to input speech of varying duration. By the 1D pooling layer, V-net compresses features \(E_\mathrm{v}\) to \(1 \times 128\) dimensions. Besides, cross-entropy loss with softmax function is used to train V-net.

3.3 G-net

G-net will learn the emotion and identity mapping between voice embeddings and generated images so that it can generate more realistic face images to deceive discriminators. The architecture of G-net is shown in the middle row in Fig. 2. First of all, the voice embedding \(E_\mathrm{v}\) is concatenated with \(1 \times 128\) dimensional noise \(N_\mathrm{g}\) and this concatenated embedding is mapped to \(1 \times 1 \times 128\) by two fully connected layers (FC1, FC2) with BN operation and ReLU function. Then, we use 6 two-dimensional transposed convolution layers (Tr-Conv1 6) to upsample to \(3 \times 128 \times 128\) dimensional \(I_\mathrm{G}\). Each layer has kernel size 4, stride 2 and padding 1 followed by BN and ReLU. Apart from the first layer (kernel size 4, stride 1 and padding 0) and the last layer (kernel size 1, stride 1 and padding 0). The number of channels in transposed layers is 1024-512-256-128-64-32. To improve the generative capacity of G-net, we add a dropout strategy and Tanh activation function inspired by Wasserstein GANs [42].

3.4 D-net and C-net

The original AC-GANs conducts backpropagation mainly determined by one discriminator and one classifier. One discriminator only judges the images from the one perspective, but not the different semantic perspectives. Likewise, one classifier is not able to solve multi-label consistency problem. In the paper, we argue that corresponding voice and face can match with the two types semantic label. Therefore, apart from distinguishing real or fake identity attributes of the speaker from D1-net, we also distinguish real or fake emotion attributes by D2-net. To further control the label consistency in generating, we use two corresponding classifiers C1-net and C2-net to make sure the generated faces belong to the same label with input audios.

D1-net and D2-net are designed to discriminate whether the input image is real face \(I_T\) or fake \(I_\mathrm{G}\). In this way, the fake label and true label are, respectively, couple with \(I_\mathrm{G}\) and \(I_T\), then input them into D1-net and D2-net to get two scores. The two discriminators architecture is shown in bottom row in Fig. 2. They both have 6 two-dimensional convolution layers. Each layer is only followed by a Leaky-ReLU function. The number of channels in convolution layers is inverse of G-net that is 32-64-128-256-512-1024, and the other parameters like kernel size, stride are also inverse. Finally, we apply a FC7 with 1 channel and sigmoid activation function to obtain a score as the output. Besides, our discriminators base on DCGANs [43] architecture.

C1-net is emotion classifier that helps achieve expression reconstruction of the speakers. And the C2-net is identity classifier that ensures the speakers’ facial identity. In other words, the emotional category of \(I_\mathrm{G}\) and corresponding voice emotion label Lve should keep consistent, and face emotion label \(L\mathrm{fe}\) is consistent with the category of \(I_T\). In addition, C1-net and C2-net share weights with the convolution layers in D1-net and D2-net, respectively. The architectures of the classifiers are similar to D1-net and D2-net, as shown in bottom row in Fig. 2, they also consist of the 6 two-dimensional convolution layers followed by Leaky-ReLU functions, a FC7 and softmax function. The FC7 of the two classifiers have i and m channels, respectively (i denotes the number of speakers, and m denotes voice emotion categories).

3.5 Triple loss

Our triple loss is composed of three parts: The G-net loss \(L_\mathrm{G}\), two discriminator losses \(L_\mathrm{D1}\) and \(L_\mathrm{D2}\), and two classifier losses \(L_\mathrm{C1}\) and \(L_\mathrm{C2}\). The generator and discriminator losses are both designed to reduce the differences between true face \(I_T\) and generated face \(I_\mathrm{G}\). The classifier losses target to guarantee the semantic consistency, which can control the generated faces in the specific class domains. Here, we use these losses to optimize the different parts of FE-GAN, the backpropagating paths of these losses are shown in Fig. 2.

First, we adopt the cross-entropy loss with softmax activation as losses of two classifiers. Here, the loss equations of \(L_\mathrm{C1}\) and \(L_\mathrm{C2}\) are defined as:

$$\begin{aligned} L_\mathrm{C1}= & {} \, - {\sum \limits _{j = 1}^{n}{p\left( l_\mathrm{fi}^{j} \right) \mathrm{log}\left( p\left( {l}_\mathrm{fi}^{j}\left( \mathrm{C1},I_{T} \right) \right) \right) }} \nonumber \\&-{\sum \limits _{j = 1}^{n}{p\left( l_\mathrm{vi}^{j} \right) \mathrm{log}\left( p\left( {l}_\mathrm{vi}^{j}\left( {\mathrm{C1},I_\mathrm{G}} \right) \right) \right) }} \end{aligned}$$
(1)

where p(l) denotes the probability of the label l, \(l_\mathrm{fi}^{j}\) and \(l_\mathrm{vi}^{j}\) denotes the j-th face and voice identity labels, respectively; \(l_\mathrm{fi}^{j} (\mathrm{C1},I_T)\) and \(l_\mathrm{vi}^{j} (\mathrm{C1},I_\mathrm{G})\) denotes that the predicted label by C1-net given the true and generated faces are the j-th face identify label, respectively; n denotes the numbers of identity categories.

$$\begin{aligned} L_\mathrm{C2}= & {} \, - {\sum \limits _{j = 1}^{m}{p\left( l_\mathrm{fe}^{j} \right) \mathrm{log}\left( p\left( {l}_\mathrm{fe}^{j}\left( \mathrm{C2},I_{T} \right) \right) \right) }} \nonumber \\&- {\sum \limits _{j = 1}^{m}{p\left( l_\mathrm{ve}^{j} \right) \mathrm{log}\left( p\left( {l}_\mathrm{ve}^{j}\left( {\mathrm{C2},I_\mathrm{G}} \right) \right) \right) }} \end{aligned}$$
(2)

where p(l) denotes the probability of the label l, \(l_\mathrm{fe}^{j}\) and \(l_\mathrm{ve}^{j}\) denotes the j-th face and voice emotion labels, respectively; \(l_\mathrm{fe}^{j} (\mathrm{C2},I_T)\) and \(l_\mathrm{fe}^{j} (\mathrm{C2},I_\mathrm{G})\) denotes that the predicted label by C2-net given the true and generated faces are the j-th emotion label, respectively; m denotes the numbers of emotion categories.

Then, the generator loss \(L_\mathrm{G}\) of G-net is defined as:

$$\begin{aligned} L_\mathrm{G}= & {} {\frac{1}{2}E}_{{({e_\mathrm{v}, N_\mathrm{g}})}\sim \mathrm{data}} [- \mathrm{log}D_{1}\left( {G\left( {e_\mathrm{v},N_\mathrm{g}} \right) } \right) \nonumber \\&\quad - \mathrm{log}D_{2}\left( {G\left( {e_\mathrm{v},N_\mathrm{g}} \right) } \right) ]\end{aligned}$$
(3)

where \(D_1\), \(D_2\) and G represent the discriminators D1-net, D2-net and generator G-net, respectively; embedding feature \(E_\mathrm{v}\) is from V-net; \(G(E_\mathrm{v},N_\mathrm{g})\) takes \(E_\mathrm{v}\) and a random noise \(N_\mathrm{g}\) as input, and generates a fake image \(I_\mathrm{G}\), that is, \(G\left( {e_\mathrm{v},N_\mathrm{g}} \right) = I_\mathrm{G}\); \(D_{1} (G(.))\)is the score assigned by discriminator D1-net, \(D_{2} (G(.))\) is similar to \(D_1\) that is the score assigned by D2-net, e.g., \(D_{1} (I_T )\) is the score from D1-net given a real image \(I_T\).

Meanwhile, the two discriminators loss \(L_\mathrm{D1}\) and \(L_\mathrm{D2}\) are formulated as:

$$\begin{aligned} \begin{aligned} L_{\mathrm{D}_{i = 1,2}}=&\, { E_{{(I_{T})}\sim \mathrm{data}}\left[{\mathrm{log}\left( {D_{i}\left( I_{T} \right) } \right) } \right]}\\&+ {E_{(e_\mathrm{v},N_\mathrm{g})\sim \mathrm{data}}\left[{{\log }\left( 1 - D_{i}\left( {G\left( {E_\mathrm{v},N_\mathrm{g}} \right) } \right) \right) } \right]}. \end{aligned} \end{aligned}$$
(4)

Finally, we implement cross-entropy loss with sigmoid function as loss functions \(L_\mathrm{G}\), \(L_\mathrm{D1}\)and \(L_\mathrm{D2}\), and our triple loss \(L_\mathrm{triple}\) is a combination of the above four losses:

(5)

where \( \lambda _1 \) and \( \lambda _2 \) are the hyper-parameters to control the relative weight of \( L_\mathrm{D1} \) and \( L_\mathrm{D2} \), respectively. In triple loss, the generator learns to minimize the score that can obtained by the generated \( I_\mathrm{G}\), then the two discriminators learn to give higher score to the real images \(I_T\) and give lower score to the generated images \(I_\mathrm{G}\) to maximize \( L_\mathrm{D1} \) and \( L_\mathrm{D2} \). Besides, the two classifiers need to minimize \( L_\mathrm{C1} \) and \( L_\mathrm{C2} \) between the predicted label from \(I_\mathrm{G}\) or \(I_T\) and the target emotion and identity label.

Note that the proposed triple loss is different from the loss in Triangle GANs (\(\varDelta \)-GANs) [44] and Triple GANs [45], which adopts their triple loss between the input image and the reconstruction image in the image space. In this paper, we employ the two different modalities triple loss to optimize our FE-GAN. In addition, FE-GAN is trained in a semi-supervised manner. The generator and the discriminators are trained iteratively. That is, the generator is fixed, and two discriminators and two classifiers are updated once. Then, we fix the discriminator, and update the parameters of the generator.

4 Experiments

4.1 Datasets and settings

To validate the performance of FE-GAN in voice-to-face generation task, our experiments are run on two multi-modal datasets: RAVDESS [18] and eNTERFACE [18]. They are collected in lab-controlled environments where the speakers are asked to read the given sentences with certain voice emotions and facial expressions. RAVDESS consists of 1440 clips, which are expressed by 24 actors with 8 emotion categories. eNTERFACE contains 1166 clips, which are expressed by 43 speakers with 6 emotion categories. Table 1 summarizes the details of the datasets used in our work.

Table 1 Summary of datasets’ sample numbers, duration time and emotion categories

Our model is implemented in PyTorch and trained on Nvidia GeForce RTX 2080ti. V-net and FE-GAN are trained separately. First, using RAVDESS [18] or eNTERFACE [19] datasets, V-net is pre-trained where SGD optimizer is chosen, the batch size is 64 and the initial learning rate is 0.03 which decreases by half for every 100 epochs. Next, FE-GAN is trained with Adam optimizer, the batch size is set to 64 and the learning rate is 0.0002. In addition, the hyper-parameters \(\lambda _{1}\) and \(\lambda _{2}\) in triple loss is 0.7 and 0.3, respectively.

4.2 Evaluation metrics

To evaluate realism and variation of the generated images, we choose Inception score (IS) [46], Fréchet Inception Distance (FID) [47] and classification accuracy as quantitative metrics.

$$\begin{aligned} \mathrm{IS}\left( g \right) = \mathrm{exp}\left( E_{x\sim g}D_{KL}\left( p\left( y \big | x \right) \big | \big | p\left( y \right) \right) \right. \end{aligned}$$
(6)

where \(x \sim g\) represents generated images from generator; p(y) and p(y|x) are marginal label distribution and conditional label distribution, respectively.

FID measures the quality of an overall generative images. FID computes the Wasserstein-2 distance between the generated images and the real images in the feature space from by a pre-trained Inception-v3 network [48]. The FID is defined as follows:

$$\begin{aligned} FID\left( {x,g} \right)= & {} \, \left\| {\mu _{x} - \mu _{g}} \right\| _{2}^{2} \nonumber \\&+ Tr\left( \sum _{x} + \sum _{g} - 2\left( \sum _{x}\sum _{g} \right) ^{\frac{1}{2}} \right) \end{aligned}$$
(7)

where \((\mu _{x}, \mu _{g})\) and \( (\sum _{x}, \sum _{g}) \) are the means and covariances of the images from the true dataset distribution and generator’s learned distribution, respectively. The authors of FID [47] shows that FID is consistent with human judgment and more robust to noise than IS.

In our experiments, a lower IS value indicates that the model can produce the images that are less variety and not associated with voice features; a higher IS indicates that the model falls into mode collapse and the images have blurry parts. Thus, the reasonable IS of models is similar to the datasets. FID is a more confident and comprehensive metric. A lower FID value means the generated images are closer to the distribution of a dataset. In addition, to evaluate the model’s performance regarding the identity and emotion preservation, we compute the emotion and identity classification accuracy by VGG-Face network [49]. The way to obtain accuracy is that the VGG-Face are pre-trained on RAVDESS or eNTERFACE dataset, and then we use the pre-trained VGG-Face on our generated results. Due to the previous works [12, 14] lack of the emotion in generating, we are not able to compute emotion accuracy of their works.

4.3 Ablation experiments

Two ablation experiments are conducted on RAVDESS dataset to (1) find which kind of audio feature is the most suitable feature for our task, and (2) analyze the contribution of each component of our FE-GAN.

First of all, we perform ablation experiment on different audio features: MFCC, Fbank and Spectrogram. Specifically, we report IS and FID by using the same model and training method with audio features varied. Table 2 shows the quantitative results. Fbank can lead to the highest FID score (58.79) and IS score (1.71) and MFCC is in the second place with a litter higher value FID (64.35) and a lower IS (1.65). Compared to Fbank and MFCC, Spectrogram performs the worst where FID score (96.16) and IS score (1.91) are the highest IS (1.91) among all the audio features. The reason may be threefold: (1) Spectrogram is too primitive so that it may include many irrelevant emotion and identity information in audio; (2) MFCC outperforms Spectrogram, but it only retains 13-dimensional features that related to speech content, and discards some information about emotion and identity; (3) Fbank is best because it preserves more prosodic and acoustic information from the inputting voice. Figure 3 shows the qualitative results. We can observe that Fbank can obtain better generated images compared with MFCC and Spectrogram. In general, images generated by Fbank is sharper and with more distinct expressions in mouth and eye. Therefore, Fbank feature is selected in the rest of experiments.

Second, we conduct an experiment to evaluate the impact of four components in FE-GAN, say (a) C1-net, (b) C2-net, (c) D1-net and (d) D2-net. These four components share the same baseline (FE-GAN) but a particular part is abandoned. That is to say, FE-GAN is running without (a) C1-net, (b) C2-net, (c) D1-net and (d) D2-net. Table 2 shows their IS and FID scores.

  1. (a)

    Influence of C1-net: If adding C1-net into the model, it can make a dramatic improvement in FID by 42.3% and in IS by 12.8%. As C1-net and D1-net share weights, the well-trained C1-net can provide the basic identity information to D1-net so that the generated images of different speakers are forced to keep the consistency of identity label between voices and images.

  2. (b)

    Influence of C2-net: If adding C2-net, it can lead to further improvement in FID by 47.0% and in IS by 10.4%. The emotion feature is a special identity feature that relies on facial attributes, and the single C1-net has weak representation ability to extract emotion features. Thus, we use C2-net to learn the emotion representation from voice. The union of C1-net and C2-net can progressively reduce the collapse mode in training and improve the classification accuracy of generated images.

  3. (c)

    Influence of D1-net: If using D1-net, it can improve FID by 53.8% and IS by 3.4%. The discriminator loss \(L_\mathrm{D1}\) provides G-net strong guidance toward the ground-truth. Besides, C1-net shares weights with D1-net, it can optimize G-net from point of view of the identity label distribution. Therefore, G-net knows the way to learn identity semantic relevance between image and voice.

  4. (d)

    Influence of D2-net: The single discriminator only discerns images by one attribution and it cannot exactly control the content of generated images. Therefore, we add two discriminators to improve distinguish ability and generation performance. Extra D2-net can supply the missing emotion information to our model. Table 3 shows that D2-net can improve 34.7% in FID and 2.3% in IS, which means that the two discriminator performs better than the single discriminator and further improve image quality. The shared weights also could help to learn better D2-net.

Finally, Fig. 4 visualizes the influence of above components. The generated images by full model have more fine-grained details and are more similar to the ground truth.

Table 2 Ablation experiments: FID and IS results of different audio features, duration time, noises and components on RAVDESS dataset
Fig. 3
figure 3

Ablation experiments 1: generated images by different voice features that performed on the RAVDESS dataset. GT represents ground truth. The red circles depict the mouth regions under analysis for different expressions

4.4 Robustness tests

Robustness of FE-GAN is evaluated in this section. Two robustness experiments are conducted to verify how FID and IS scores vary when different input conditions are given: (1) the audio with different levels of noise; (2) the audio with different time durations.

Table 3 Robustness tests: FID and IS results of different duration times and noises on RAVDESS dataset

First of all, we study the effect between various levels of noise and quality of images. On RAVDESS dataset, we add different intensities of babble noises to voice with four Signal Noise Ratio (SNRs): 1 dB, 5 dB, 10 dB and 25 dB. The qualitative results for the experiments with added noise can be seen in Figs. 5 and 6. While the noise intensity is increased, we observe that the generated images are gradually to be blurry and unrecognizable. The reason may be that the useful features can be destroyed by noises as no identity and emotion information in noises. Moreover, the quantitative results of this experiment are reported in Table 3. We also observe that the FID and IS scores of various levels of noise gradually decrease, which are also consistent with the qualitative results.

Fig. 4
figure 4

Ablation experiments 2: generated images by four components that performed on the RAVDESS dataset. The red circle depicts the obscured and incorrect regions under analysis for different expressions

Fig. 5
figure 5

Robustness tests with female speakers: generated images by voices with different noisy conditions and duration times that performed on the RAVDESS dataset. The left side represents the noise levels, and the right side represents the emotion types. Each row represents the generated faces using one of the four noisy conditions with different duration times

Fig. 6
figure 6

Robustness tests with male speakers: generated images by voices with different noisy conditions and duration times that performed on the RAVDESS dataset

The effect of different audio durations on FE-GAN is then evaluated. We conduct experiment on 1 s, 3 s, 5 s and 10 s voice segments. We observe that the audio duration has obvious effect on the quality of the reconstructions, as shown in Figs. 5 and 6. The qualitative results show that a longer duration of the input voice can improve the performance. For example, when using 10 s voice segments (the 4th column in Figs. 5 and 6), the generated faces are seen to be more clear, recognizable and less background noises. Furthermore, the corresponding quantitative results are also shown in Table 3. We find that feeding longer audio segments as input leads to considerable improvement in the FID and IS scores, that is, reconstructed faces capture the personal attributes and emotions better, regardless of which of the levels of noise are added.

Besides, Figs. 5 and 6 also show qualitative comparison of the effect of gender. We find that the model is able to successfully capture the latent attributes like gender, reconstructing the facial image with different voices.

4.5 Comparison to state-of-the-art

To verify the effectiveness of our FE-GAN model, we compare with two state-of-the-art methods on RAVDESS and eNTERFACE datasets, say AC-GAN [14] and CGANs [12]. Table 4 shows the comparison results of FID and IS, and Table 5 shows results of identity classification accuracy.

Table 4 FID and IS results of different methods on RAVDESS and eNTERFACE dataset
Table 5 Identity classification accuracy of different methods on RAVDESS and eNTERFACE dataset

We first conduct comparison on the RAVDESS. Table 4 shows that FE-GAN performs better than AC-GAN, which can improve FID by 23.7% and IS by 2.8%. Compared with CGANs method, FE-GAN also improves FID by 48.1% and IS by 2.3%. As shown in Table 5, we make an improvement in training accuracy of identity by 9.7%, 15.1% compared with AC-GANs and CGANs, respectively. In the testing dataset, FE-GAN can achieve an increase of 9.8% and 18.0% compared with AC-GANs and CGANs. Besides, FE-GAN always achieves the high emotion accuracy of 95.08% in the training dataset. These quantitative results reveal that the sufficient utilization of both identity and emotion information from voice can significantly boost the performance of classification task. Furthermore, the qualitative results of RAVDESS are as shown in Fig. 7, which also based on our FE-GAN and two competitors. It can be seen that FE-GAN can not only generate the faces with more exactly identity information, but also maintain the more expression information. Our results have less noise in background and are more realistic than without emotion samples.

Fig. 7
figure 7

Generated images of different methods on the RAVDESS dataset. The red circles depict the eye regions under analysis for different expressions

Fig. 8
figure 8

Generated images of different methods on the eNTERFACE dataset

To further verify the robustness of our method for voice-to-face generation, we evaluate our method on another dataset eNTERFACE and also give the comparison results in Tables 4 and 5. In Table 4, we observe that our method achieves the highest IS (1.89) and the lowest FID (84.58), demonstrating the effectiveness and robustness of our method. As shown in Table 5, FE-GAN also achieves the highest identity accuracy in training (99.23%) and testing (76.83%). FE-GAN outperforms comparison methods in eNTERFACE. However, there are still some defects in the generated images. Figure 8 shows that the faces are blurs and artifacts, even corrupted facial expressions around eyes, nose and mouth regions. Besides, our method also gets the low emotion accuracy of 73.15% on training images. This is because of the imbalanced data distribution and large variance between same class of training data. A low-quality and bad-controlled dataset may cause unstable generation results. Although the images in Fig. 8 are not sharp, we can still see that the identity of the generated images is semantically consistent with the input audio information, which means our method has captured the semantic attribution from speech features to some extent.

4.6 Limitation of FE-GAN

During our experiments, we found there are some generated images which have observable failures as shown in Figs. 34 and 7. The major problems include moderate artifacts (e.g., the texture and color of face seem unnatural), loss of facial contours and details (e.g., tooth, hair and eyebrow region are obscure or missing), and minor semantic inconsistency (e.g., compared with GT images). There are the two main reasons for these problems: (1) The intra-personal and inter-personal variances of emotion are large in datasets, which make FE-GAN hard to learn these face and voice emotion features effectively. (2) The input embedding features are only from the single modality (voice) instead of multiple modalities (voice and face). That is, a part of facial attributes is irrelevant to speakers’ voices so that the generator cannot build these mapping between voice and face. Therefore, it is unable to generate high-quality tooth, hair, eyebrow and head pose by only using single modality features.

5 Conclusion

Facial expression plays an important role in high-quality face generation. Human perception is very sensitive to subtle facial expression. Therefore, without taking emotion about this face and voice into account, it is hard to generate shaper and proper face images. In this paper, we propose a novel FE-GAN to consider the emotion in voice-to-face generation problem. Specifically, audio emotion and identity are used to directly generate face images with expressions. We proposed FE-GAN which includes one generator and two discriminators with their auxiliary classifiers. The core idea is to use auxiliary classifiers to help discriminators to better identify whether a face image is generated or true based on the identity and emotion represented in this image. Therefore, the generator can be trained to generate more realistic face images. Finally, the proposed triple loss facilitates the generalization and optimization ability of the model. Experimental results show that our proposed method outperforms the state-of-the-art approaches in both quantitative and qualitative perspectives.

FE-GAN has its own limitation. Firstly, the output based on single generator has model collapse and over-fitting problems. For example, some facial identity features and emotional features cover up each other, resulting in a lot of ambiguous and pixel jittering in images, and some emotion samples are insufficient, which can affect the generation of face images. On the other hand, the model is hard to achieve the best balance between the two discriminators in training. In addition, the intensity of the expressions should be considered to further improve the quality of generated images.