Keywords

1 Introduction

Constructing a multi-faceted image from a single image is a well-investigated problem and has several real-life applications. Essential applications of creating a multi-posed image from a single image are its use for identification purposes, detecting malicious, criminals in public, capturing the identity of people in general etc. Constructing multi-posed image is a challenging task comprising of imagining the objects might looking like, constructed from another pose [3]. It requires the construction of unknown possibilities and hence requires a very rich embedding space so that the constructed view of the object should have the same identity and should be relevant in context.

Several research efforts have been made to address this problem using different models like synthesis based models, and data-based models [16, 19]. These GAN based models consist of linear framework and encoder-decoder followed by Discriminator to address this issue. Here, the main purpose of the encoder(E) is to map the input images to the latent space(Z), which are fed into the decoder(G) after some manipulation for generating multi-faceted images [1, 2].

But, it is found empirically that the linear framework isn’t powerful enough to learn appropriate embedding space. The linear framework generates an output for creating a multi-faceted image isn’t clear enough and doesn’t preserve identity across various posed images. Learning incomplete embedding space leads to incomplete generalization on test images or unseen images. The primary reason of incapability of linear frameworks in learning complete presentation is that during training the encoder part of G only sees a fraction of Z and while testing, very likely model come across samples corresponding to unseen embedding space. This results in poor generalization.

In order to tackle this problem, Tian et al. [14] proposed a dual-pathway architecture, termed as Complete-Representation (CR-GAN). Unlike linear framework, the authors of CR-GAN have used dual pathway architecture. Besides the typical re-construction path, they introduced another generation path for constructing multi-faceted images from embeddings, randomly sampled from Z. In the proposed architecture, they used the same G, which aids the learning of E and discriminator (D). In their proposed model, E is forced to be an inverse of G, which theoretically should yield complete representations that should span the entire Z space.

However, the experiments conducted in this work demonstrate that one encoder is not convincing to span the entire Z space. Therefore, in order to address this challenge, we propose DUO-GAN with dual encoder to learn complete representation for a multi-facet generation. The primary purpose is to distribute the task of spanning the entire Z space, across two encoders instead of one as proposed in the previous work. We empirically demonstrate that dual encoder architecture produces many realistic results in comparison to prior work in this field.

2 Related Work

Several researchers contributed to constructing a multi-faceted image from a single image. The significant work in this field is presented as follows.

Goodfellow et al. [5] first introduced GAN to learn models with generative ability via an adversarial process. In the proposed model, a two-player min-max game is played between generator (G) and discriminator (D). Competing with each other in the game, both G and D tend to improve themselves. GAN has been used in various fields like image synthesis, super-resolution image generation etc. Every model proposed with the help of GAN manipulates constraints on Z and attempt to cover more and more embedding space for a better synthesis of images.

Hassner et al. [8] proposed a 3D face model in order to generate a frontal face for any subject. Sagonas et al. [13] used a statistical model for creating joint frontal face reconstruction, which is quite useful. The reported results were not very useful, as frontal face generation from a side view is a very challenging task. Because of occlusion and variation in spatial feature from side view face pictures.

Yan et al. [16] solved the problem of multi-pose generation to a certain level by using projection information by their Perspective Transformer Nets. Whereas, Yang et al. [17] proposed a model which incrementally rotated faces in fixed yaw angles. For generating multi-poses, Hinto et al. [9] tried generating images with view variance by using auto-encoder. Tian et al. [14] proposed dual pathway architecture CR-GAN for constructing multiple poses. However, all the above-mentioned system fail to construct realistic images in an unseen wild condition. In comparison, DUO-GAN spans embedding space in a much more exhaustive manner using it’s multi-path architecture and produces higher-quality images than previously proposed models.

Preserving identity synchronously across images with numerous positions is a very active research area. Previously DR-GAN [15] attempted to solve this problem, by providing pose code along with image data, while training. Li et al. [12] attempted this challenge by using Canonical Correlation Analysis for comparing the difference between the sub-spaces of various poses. Tian et al. [14] tried solving this problem with dual pathway architecture. We propose dual encoder dual-pathway architecture, which results in a much better generation of multi-faceted images.

3 The Proposed Method

Most of the previous research on this field involves a linear network, i.e. an encoder-decoder generator network, followed by Discriminator network. As empirically found, such linear network is incapable of spanning entire embedding space, which leads to incomplete learning as a single encoder can only span limited space, irrespective of the variance and quantity of data. So while testing, when an unseen image is passed through the G, it is very likely that the unseen input will be mapped to un-covered embedding space, which consequently leads to the poor generation of images.

Yu et al. [14] proposed CR-GAN, which uses dual-pathway architecture to cover embedding space more extensively than a linear framework. It’s primary uses a second-generation path, with the aim to map the entire Z space to corresponding targets. However, we empirically found that single encoder used in dual pathway architecture is not powerful enough to span the entire embedding dimension. This fact motivates us to use dual encoder architecture for spanning embedding space more extensively. Figure 1 illustrates the comparison between our proposed model, CR-GAN and other linear networks. The proposed model consists of two paths, namely Generator path, and Reconstruction path, described in following subsections.

Fig. 1.
figure 1

Comparison of models: BiGAN, DR-GAN, TP-GAN, CR-GAN, and the proposed model

3.1 Generator Path

This path is similar to the Generator path proposed in CR-GAN [14]. Here both the encoder are not involved, and G is trained to generate with random noise. Here we give a view-label v and random noise z. Aim is to produce very realistic image G(v, z) with view-label v. And like in GANs aim of D is to distinguish the output of G’s from real. G tries to minimize Eqs. 1 and 2.

$$\begin{aligned} \underset{\mathbf {z}\sim \mathbf {P}_{\mathbf {z}}}{\mathbb {E}}{[\textit{D}_{s}(\textit{G}(\textit{v},\mathbf {z}))]}-\underset{\mathbf {x}\sim \mathbf {P}_{\mathbf {x}}}{\mathbb {E}}{[\textit{D}_{s}(\mathbf {x})]}+ \mathbf {C}_{1}\underset{\mathbf {\hat{x}}\sim \mathbf {P}_{\mathbf {\hat{x}}}}{\mathbb {E}}[(\mid \mid \nabla _{\hat{x}}\mathbf {D}(\hat{x})\mid \mid )_2 - 1)^2] - \mathbf {C}_2\underset{\mathbf {x}\sim \mathbf {P}_{\mathbf {x}}}{\mathbb {E}}[\mathbf {P}(\mathbf {D}_{\mathbf {v}}(x) = \mathbf {v})] \end{aligned}$$
(1)

Here, \(\mathbf {P}_{\mathbf {x}}\) represents the distribution of data, and \(\mathbf {P}_{\mathbf {z}}\) represents the uniform noise distribution. Further, \(\mathbf {P}_{\mathbf {\hat{x}}}\) represents the interpolation between the data constructed form different images. In the proposed model, we randomly pass either \(\textit{v}_i\) or \(\textit{v}_k\), as we want to learn G to generate high quality images either from \(\mathbf {\hat{x}}\) which is interpolation of \(\textit{x}_i\) and \(\textit{x}_k\) as further discussed in Sect. 3.2. We also experimentally found that feeding in \(\mathbf {\hat{x}}\) in first phase of training did not give good results, possibly because of noise, formed due to interpolation.

$$\begin{aligned} \underset{\mathbf {z}\sim \mathbf {P}_{\mathbf {z}}}{\mathbb {E}}{[\textit{D}_{s}(\textit{G}(\textit{v},\mathbf {z}))]} + \mathbf {C}_3 \underset{\mathbf {z}\sim \mathbf {P}_{\mathbf {z}}}{\mathbb {E}}{[\textit{P}(\textit{D}_{\textit{v}}(\textit{G}(\textit{v},\mathbf {z}))}= \textit{v})] \end{aligned}$$
(2)

The proposed algorithm for training our model in phase 1 and phase 2, with batch-size b and time-steps t is described as below.

figure a

3.2 Reconstruction Path

We train both the \(\mathbf {E1}\) and \(\mathbf {E2}\) and \(\mathbf {D}\) but not the \(\mathbf {G}\). In reconstruction path we make \(\mathbf {G}\) generate image from the features extracted from \(\mathbf {E1}\) and \(\mathbf {E2}\) re-generate images, which makes them both inverse of \(\mathbf {G}\). Passing different poses in both \(\mathbf {E1}\) and \(\mathbf {E2}\) makes sure they cover different embedding space, which in turns leads to complete learning of latent embedding space. Further, the output generated from the \(\mathbf {E1}\) and \(\mathbf {E2}\) is combined using the interpolation between the data points from each of encoders, which are in spirit the same as \(\hat{x}\) in generation part.

For making sure the re-constructed images by \(\mathbf {G}\) from the features extracted from \(\mathbf {E1}\) and \(\mathbf {E2}\) share the same identity we use the cross reconstruction task, in order to make \(\mathbf {E1}\) and \(\mathbf {E1}\) preserve identity. To be more precise, we pass in image of same identity in both \(\mathbf {E1}\) and \(\mathbf {E2}\) having different poses. As primary goal is to re-construct an image \(\mathbf {x}_j\) with interpolation of images \(\mathbf {x}_i\) and \(\mathbf {x}_k\). So in order to do this, \(\mathbf {E1}\) takes \(\mathbf {x}_i\) and \(\mathbf {E2}\) takes \(\mathbf {x}_k\), both of these encoders output an identity preserved \(\mathbf {\bar{z}_i}\) and \(\mathbf {\bar{z}_k}\) with respective view estimation \(\mathbf {\bar{v}_i}\) and \(\mathbf {\bar{v}_k}\).

\(\mathbf {G}\) takes \(\mathbf {\bar{z}}\) and view \(\mathbf {v}_i\) as input, and is trained to reconstruct the image of the same person with view \(\mathbf {v}_i\) with the help of interpolated \(\mathbf {\bar{z}}\). Here \(\mathbf {\bar{z}}\) should help \(\mathbb {G}\) to preserve identity and carry out essential latent features of the person. \(\mathbf {D}\) here is trained to differentiate between the fake image \(\mathbf {\hat{x}_j}\) from the real one \(\mathbf {\hat{x}_i}\) or \(\mathbf {\hat{x}_k}\). Thus, \(\mathbf {D}\) minimizes the Eq. 3.

$$\begin{aligned} \begin{aligned} \underset{\mathbf {x_i}, \mathbf {x_j},\mathbf {x_k}\sim \mathbb {P}_{\mathbf {x}}}{\mathbb {E}} [2 \times \textit{D}_{s}(\hat{x}_j) - \textit{D}_{s}(x_i) - \textit{D}_{s}(x_k)] + \mathbf {C}_1 \underset{\mathbf {\hat{x}}\sim \mathbb {P}_{\mathbf {\hat{x}}}}{\mathbb {E}}[(\mid \mid \nabla _{\hat{x}}\mathbf {D}(\hat{x})\mid \mid )_2 - 1)^2] \\ - \mathbf {C}_2\underset{\mathbf {x}_i\sim \mathbf {P}_{\mathbf {x}}}{\mathbb {E}}[\mathbf {P}(\mathbf {D}_{\mathbf {v}}(x_i) = \mathbf {v_i})] \end{aligned} \end{aligned}$$
(3)

Here, \(\mathbf {\tilde{x}}\) = \(\mathbf {G}(\mathbf {v}_j, \mathbf {E}_z(\mathbf {x}_i))\). \(\mathbf {E}\) helps \(\mathbf {G}\) to generate realistic image, with \(\mathbf {v}_j\). Basically, \(\mathbf {E1}\) and \(\mathbf {E2}\) maximizes Eq. 4.

$$\begin{aligned} \underset{\mathbf {x_i}, \mathbf {x_j},\mathbf {x_k} \sim \mathbb {P}_{\mathbf {x}}}{\mathbb {E}} [\textit{D}_{s}(\hat{x}_j) + \mathbf {C}_3 \mathbf {P}(\mathbf {D}_v(\tilde{x}_j = \mathbf {v}_j) - \mathbf {C}_4 \mathbf {L}j_1(\tilde{x}_j, \mathbf {x}_j) - \mathbf {C}_5 \mathbf {L}_v (\mathbf {E}_v (\mathbf {x}_i), \mathbf {v}_i)) \end{aligned}$$
(4)

Here, \(\mathbf {L}_1\) is the loss to ensure \(\tilde{x}_j\) is reconstructed property from \(\mathbf {x}_j\). \(\mathbf {L}_v\) is the loss estimated from cross-entropy of the ground and estimated views, for \(\mathbf {E1}\) and \(\mathbf {E2}\).

This dual-dual-pathway network efficiently spans complete embedding space. In the first path of the algorithm, \(\mathbf {G}\) learns how to better produce image, from the random noise, which in time, when produced through the \(\mathbf {E1}\) leads to better output.

In comparison to previously proposed linear-networks, the proposed double-dual pathway network helps better solve the problem of multi-facet construction in following ways:

  • It leads to better covering of latent embedding space, which in turns leads to better generation of multi-faceted pictures.

  • Once trained on good quality images, model seems to work pretty well even for low quality images, probably because of expansive embedding space covered.

4 Experiments and Results

This section describes the experimental setup, benchmark dataset, experimental results and compares the results with existing state of the art in the field. Also, considering the fact that we can not separate just the encoder part of the model, we can not just compare the feature encoding capability of respective models, so we decided it would be better if we can just compare the output of the model, and the ability to reconstruct images. So we’ve compared the output of images by two models, and calculated root mean square value (RMSE) value for the constructed images.

4.1 Experimental Settings

  • Benchmark Dataset: In this experimental work, we used primary dataset as, Multi-PIE [6] and 300wLP [18]. These datasets are labelled datasets collected in an artificially constructed environment. The dataset consists of 250 subjects, with 9 poses within \(\pm \) \(60^{\circ }\), two expressions and twenty illuminations. For the training purpose, we choose the first 200 for training and the remaining 50 for test. 300wLP contains view labels that are used to extract images with yaw angles from –\(60^{\circ }\) to +\(60^{\circ }\), dividing them into 9 intervals. So, they can synchronize with Multi-PIE dataset after feeding into the model.

  • Implementation Details: The network-implementation is modified from CR-GAN, where each of \(\mathbf {E}\) shares the dual-pathway architecture with the \(\mathbf {G}\). The main structure for our model is adopted from the res-net (residual networks) as proposed in WGAN-GP [7], where E shares a similar network structure with D. During training \(\mathbf {v}\) is set to 9 dimensional one-hot vectors where \(\mathbf {z} \in [-1, 1]^119\) in the latent embedding space. The batch-size we chose for our model is 20. We used Adam optimizer [11] with the learning rate of 0.0001 and momentum of [0.01, 0.89]. Choosing rest of the parameters of CR-GAN as default, we have \(\mathbf {C}_1\) = 10, \(\mathbf {C}_2- \mathbf {C}_4\) = 1, and \(\mathbf {C}_5\) = 0.01. Finally, we train the model for 50 epochs.

Table 1. Average RMSE(in mm) using dual encoder architecture, validated against CR-GAN.

4.2 Results and Discussion

The primary aim of the proposed model - DUO-GAN is to learn complete representation by using its dual-encoder architecture and dual-pathway architecture to span entire embedding space. We conduct experiments to evaluate these contributions with respect to CR-GAN. The comparative results are shown in Table 1. We can see how the model performs in the wild settings in Fig. 2.

Fig. 2.
figure 2

Sample output on test images

In order to demonstrate the applicability of the proposed model, we compare it with four GANs, namely, BiGAN, DR-GAN, TP-GAN, and CR-GAN as depicted in Fig. 1. CR-GAN [14] used a dual-architecture for spanning embedding space, and learning better representation. Authors used a second reconstruction-pathway in order to make the encoder inverse of the generator. However, in practice, the Encoder doesn’t seem to be powerful enough to span the entire embedding space. Comparatively, DUO-GAN uses two encoders in order to span the entire embedding space, which learns the representation comparatively more efficiently. The output produced by the proposed model is presented in Fig. 3.

Fig. 3.
figure 3

Sample output on similar, but unseen images

DR-GAN [15] also tackled the problem of generating multi-view images from a single image, through a linear network. Like in a linear network, input of decoder is the output of encoder, the model is not very robust to images outside the dataset. Comparatively, we use a second generation path, which leads to better learning and generalization.

TP-GAN [10] also used a dual-architecture for solving this problem. However, unlike our model, it uses two separate structure, i.e. these two structures don’t share parameter, unlike our architecture. Further, these two independent architectures in TP-GAN aims to learn different set of features. Where as our architecture aims to learn collectively.

Bi-GAN [4] aims to learn collectively a \(\mathbf {G}\) and an \(\mathbf {E}\). Theoretically, \(\mathbf {E}\) should be an inverse of \(\mathbf {G}\). Because of their linear network, Bi-GAN leads to poor learning and doesn’t lead to good generation especially for unseen data.

5 Conclusion

In this paper, we investigated different models and compared them for constructing multi-facet images from a single image. We propose a dual architecture model called DUO-GAN, which uses double duo-pathway framework for better learning the representation. The proposed model leverages the architecture to span latent embedding space in a better way and produces higher quality images in comparison to existing models.