1 Introduction

The task of synthesizing novel views of objects from a single reference frame/view is an important problem which has a variety of practical applications in computer vision, graphics, and robotics. In computer vision, view synthesis can be used to generate 3D point cloud representations of objects from a single input image [40]; in the context of hand pose estimation algorithms, generating additional synthetic views can also help reduce occlusion and improve the accuracy of the estimated poses [12, 14]. In computer graphics, view synthesis has been used to apply changes in lighting and viewpoint to single-view images of faces [24]. In robotics, synthetic views can be used to help predict unobserved part locations and improve the performance of object grasping with manipulators [42]. However, synthesizing novel views from a single input image is a formidable task with serious complications arising from the complexity of the target object and the presence of heavily self-occluded parts.

Generative models have been shown to provide effective frameworks for representing complex, structured datasets and generating realistic samples from underlying data distributions [7, 13]. This concept has also been extended to form conditional models capable of sampling from conditional distributions in order to allow certain properties of the generated data to be controlled or selected [26, 43]. Generative models without encoders [5, 46] are generally used to sample from broad classes of the data distributions; however, these models are not designed to incorporate input data and therefore cannot preserve characteristic features of specified input data. Models have also been proposed which incorporate encoding components to overcome this by learning to map input data onto an associated latent space representation within a generative framework [21, 25]. The resulting inference models allow for the defining features of inputs to be preserved while specified target properties are adjusted through conditioning [45].

Conventional conditional models have largely relied on rather simple methods, such as concatenation, for implementing this conditioning process; however, cGANs  [27] have shown that utilizing the conditioning information in a less trivial, more methodical manner has the potential to significantly improve the performance of conditional generative models. In this work, we provide a general framework for effectively performing inference with conditional generative models by strategically controlling the interaction between conditioning information and latent representations within a generative inference model. In this framework, a conditional transformation unit (CTU), \(\varPhi \), is introduced to provide a means for navigating the underlying manifold structure of the latent space. The CTU is realized in the form of a collection of convolutional layers which are designed to approximate the latent space operators defined by mapping encoded inputs to the encoded representations of specified targets (see Fig. 1). This is enforced by introducing a consistency loss term to guide the CTU mappings during training. In addition, a conditional discriminator unit (CDU), \(\varPsi \), also realized as a collection of convolutional layers, is included in the network’s discriminator. This CDU is designed to improve the network’s ability to identify and eliminate transformation-specific artifacts in the network’s predictions.

The network has also been equipped with RGB balance parameters consisting of three values \(\{\theta _R, \theta _G, \theta _B \}\) designed to give the network the ability to quickly adjust the global color balance of the images it produces to better align with that of the true data distribution. In this way, the network is easily able to remove unnatural hues and focus on estimating local pixel values by adjusting the three RGB parameters rather than correcting each pixel individually. In addition, we introduce a novel estimation strategy for efficiently learning shape and color properties simultaneously; a task-divided decoder (TD) is designed to produce a coarse pixel value map along with a refinement map in order to split the network’s overall task into distinct, dedicated network components.

Summary of contributions:

  1. 1.

    We introduce the conditional transformation unit, with a family of modular filter weights, to learn high-level mappings within a low-dimensional latent space. In addition, we present a consistency loss term which is used to guide the transformations learned during training.

  2. 2.

    We propose a novel framework for 3D object view synthesis which separates the generative process into distinct network components dedicated to learning (i) coarse pixel value estimates, (ii) pixel refinement map, and (iii) the global RGB color balance of the dataset.

  3. 3.

    We introduce the conditional discriminator unit designed to improve adversarial training by identifying and eliminating transformation-specific artifacts present in the generated images.

Each contribution proposed above has been shown to provide a significant improvement to the network’s overall performance through a series of ablation studies. The resulting latent transformation neural network (LTNN) is placed through a series of comparative studies on a diverse range of experiments where it is seen to be comparable with the existing state-of-the-art models for (i) simultaneous multi-view synthesis of real hand depth images in real-time, (ii) the synthesis of rotated views of rigid objects given a single image, and (iii) object view synthesis and attribute modification of real and synthetic faces.

Moreover, the CTU conditioning framework allows for additional conditioning information, or target views, to be added to the training procedure ad infinitum without any increase to the network’s inference speed.

Fig. 1
figure 1

The conditional transformation unit \(\varPhi \) constructs a collection of mappings \(\{\varPhi _k\}\) in the latent space which produce object view changes to the decoded outputs. Conditioning information is used to select the appropriate convolutional weights \(\omega _k\) for the specified transformation; the encoding \(l_x\) of the original input image x is transformed to \(\, \widehat{l}_{y_k} = \varPhi _k(l_x) = {\text {conv}}(l_x,\omega _k)\) and provides an approximation to the encoding \(\, l_{y_k}\) of the attribute-modified target image \(y_k\)

2 Related work

Conditional generative models have been widely used in computer vision areas such as geometric prediction [30, 34, 40, 48] and non-rigid object modification such as human face deformation [1, 11, 33, 47]. Dosovitskiy et al. [8] has proposed a supervised, conditional generative model trained to generate images of chairs, tables, and cars with specified attributes which are controlled by transformation and view parameters passed to the network. MV3D [40] is pioneering deep learning work for object view synthesis which uses an encoder–decoder network to directly generate pixels of a target view with depth information in the loss function along with view point information passed as a conditional term. The appearance flow network (AFN) [48] proposed a method for view synthesis of objects by predicting appearance flow fields, which are used to move pixels from an input to a target view. However, this method requires detailed camera pose information and is not capable of predicting pixels which are missing in the source views. The M2N from [38] proposed view prediction using a recurrent network and a self-learned confidence map iteratively synthesizes views with recurrent pixel generator with appearance flow. TVSN [30] uses a visibility map, which indicates visible parts in a target image to identify occlusion in different views. However, this method requires mesh models for each object in order to extract visibility maps for training the network. The DFN by Jia et al. [20] proposed using a dynamic filter which is conditioned on a sequence of previous frames; this is fundamentally different from our method since the filter is applied to the original inputs rather than the latent embeddings. Moreover, it relies on temporal information and is not applicable for predictions given a single image. The IterGAN model introduced by Galama and Mensink [10] is also designed to synthesize novel views from a single image, with a specific emphasis on the synthesis of rotated views of objects in small, iterative steps. The conditional variational autoencoder (CVAE) incorporates conditioning information into the standard variational autoencoder (VAE) framework [23] and is capable of synthesizing specified attribute changes in an identity preserving manner [37, 45]. Other works have introduced a clamping strategy to enforce a specific organizational structure in the latent space [24, 33]; these networks require extremely detailed labels for supervision, such as the graphics code parameters used to create each example, and are therefore very difficult to implement for more general tasks (e.g., training with real images). These models are all reliant on additional knowledge for training, such as depth information, camera poses, or mesh models, and are not applicable in embedded systems and real-time applications due to the high computational demand and the number of neural networks’ parameters since these methods did not consider the efficiency of the model.

CVAE-GAN [2] further adds adversarial training to the CVAE framework in order to improve the quality of generated predictions. The work from Zhang et al. [47] has introduced the conditional adversarial autoencoder (CAAE) designed to model age progression/regression in human faces. This is achieved by concatenating conditioning information (i.e., age) with the input’s latent representation before proceeding to the decoding process. The framework also includes an adaptive discriminator with conditional information passed using a resize/concatenate procedure. To the best of our knowledge, all existing conditional generative models are designed for inference use fixed hidden layers and concatenate conditioning information directly with latent representations. In contrast to these existing methods, the proposed model incorporates conditioning information by defining dedicated, transformation-specific convolutional layers at the latent level. This conditioning framework allows the network to synthesize multiple transformed views from a single input, while retaining a fully convolutional structure which avoids the dense connections used in existing inference-based conditional models. Most significantly, the proposed LTNN framework is shown to be comparable with the state-of-the-art models in a diverse range of object view synthesis tasks, while requiring substantially less FLOPs and memory consumption for inference than other methods.

3 Latent transformation neural network

In this section, we introduce the methods used to define the proposed LTNN model. We first give a brief overview of the LTNN network structure. We then detail how conditional transformation unit mappings are defined and trained to operate on the latent space, followed by a description of the conditional discriminator unit implementation and the network loss function used to guide the training process. Lastly, we describe the task division framework used for the decoding process.

The basic workflow of the proposed model is as follows:

  1. 1.

    Encode the input image x to a latent representation \(l_x ={\text {Encode}}(x)\).

  2. 2.

    Use conditioning information k to select conditional, convolutional filter weights \(\omega _k\).

  3. 3.

    Map the latent representation \(l_x \) to \( \widehat{l}_{y_k} = \varPhi _k(l_x) = {\text {conv}}(l_x, \omega _k)\), an approximation of the encoded latent representation \(l_{y_k}\) of the specified target image \(y_k\).

  4. 4.

    Decode \(\widehat{l}_{y_k}\) to obtain a coarse pixel value map and a refinement map.

  5. 5.

    Scale the channels of the pixel value map by the RGB balance parameters and take the Hadamard product with the refinement map to obtain the final prediction \(\widehat{y}_k\).

  6. 6.

    Pass real images \(y_k\) as well as generated images \(\widehat{y}_k\) to the discriminator, and use the conditioning information to select the discriminator’s conditional filter weights \(\overline{\omega }_k\).

  7. 7.

    Compute loss and update weights using ADAM optimization and backpropagation.

figure f
Fig. 2
figure 2

Selected methods for incorporating conditioning information; the proposed LTNN method is illustrated on the left, and six conventional alternatives are shown to the right

3.1 Conditional transformation unit

Generative models have frequently been designed to explicitly disentangle the latent space in order to enable high-level attribute modification through linear, latent space interpolation. This linear latent structure is imposed by design decisions, however, and may not be the most natural way for a network to internalize features of the data distribution. Several approaches have been proposed which include nonlinear layers for processing conditioning information at the latent space level. In these conventional conditional generative frameworks, conditioning information is introduced by combining features extracted from the input with features extracted from the conditioning information (often using dense connection layers); these features are typically combined using standard vector concatenation, although some have opted to use channel concatenation.

In particular, conventional approaches for incorporating conditional information generally fall into three classes: (1) apply a fully connected layer before and after concatenating a vector storing conditional information [24, 40, 47, 48], (2) flatten the network features and concatenate with a vector storing conditional information [30], (3) tile a conditional vector to create a two-dimensional array with the same shape as the network features and concatenate channel-wise [2, 38]. Since the first class is more prevalent than the others in practice, we have subdivided this class into four cases: FC-Concat-FC [47], FC-Concat-2FC [24], 2FC-Concat-FC [48], and 2FC-Concat-2FC [40]. Six of these conventional conditional network designs are illustrated in Fig. 2 along with the proposed LTNN network design for incorporating conditioning information.

Rather than directly concatenating conditioning information with network features, we propose using a conditional transformation unit (CTU), consisting of a collection of distinct convolutional mappings in the network’s latent space. More specifically, the CTU maintains independent convolution kernel weights for each target view in consideration. Conditioning information is used to select which collection of kernel weights, i.e., which CTU mapping, should be used in the CTU convolutional layer to perform a specified transformation. In addition to the convolutional kernel weights, each CTU mapping incorporates a Swish activation [32] with independent parameters for each specified target view. The kernel weights and Swish parameters of each CTU mapping are selectively updated by controlling the gradient flow based on the conditioning information provided.

The CTU mappings are trained to transform the encoded, latent space representation of the network’s input in a manner which produces high-level view or attribute changes upon decoding. In this way, different angles of view, light directions, and deformations, for example, can be generated from a single input image. In one embodiment, the training process for the conditional transformation units can be designed to form a semigroup \(\{\varPhi _t\}_{t\ge 0}\) of operators:

$$\begin{aligned} \hbox {i.e.,}\quad {\left\{ \begin{array}{ll} \quad \varPhi _0 = id &{} \\ \quad \varPhi _{t+s} = \varPhi _t \circ \varPhi _s &{} \forall \, \, t,s \ge 0 \end{array}\right. } \end{aligned}$$
(1)

defined on the latent space and trained to follow the geometric flow corresponding to a specified attribute. In the context of rotating three-dimensional objects, for example, the transformation units are trained on the input images paired with several target outputs corresponding to different angles of rotation; the network then uses conditioning information, which specifies the angle by which the object should be rotated, to select the appropriate transformation unit. In this context, the semigroup criteria correspond to the fact that rotating an object \(10^{\circ }\) twice should align with the result of rotating the object by \(20^{\circ }\) once.

Since the encoder and decoder are not influenced by the specified angle of rotation, the network’s encoding/decoding structure learns to model objects at different angles simultaneously; the single, low-dimensional latent representation of the input contains all information required to produce rotated views of the original object. Other embodiments can depart with this semigroup formulation, however, training conditional transformation units to instead produce a more diverse collection of non-sequential viewpoints, for example, as is the case for multi-view hand synthesis.

To enforce this behavior on the latent space CTU mappings in practice, a consistency term is introduced into the loss function, as specified in Eq. 2. This loss term is minimized precisely when the CTU mappings behave as depicted in Fig. 1; in particular, the output of the CTU mapping associated with a particular transformation is designed to match the encoding of the associated ground truth target view. More precisely, given an input image x, the consistency loss associated with the kth transformation is defined in terms of the ground truth, transformed target view \(y_{k}\) by:

$$\begin{aligned} \mathscr {L}_{consist} = \big \Vert \varPhi _k({\text {Encode}}[x]) - {\text {Encode}}[y_k] \big \Vert _1. \end{aligned}$$
(2)
Table 1 Ablation/comparison results of six different conventional alternatives for fusing condition information into the latent space and ablation study of conditional transformation unit (CTU), conditional discriminator unit (CDU), and task-divided decoder (TD)

3.2 Discriminator and loss function

The discriminator used in the adversarial training process is also passed conditioning information which specifies the transformation which the model has attempted to make. The conditional discriminator unit (CDU), which is implemented as a convolutional layer with modular weights similar to the CTU, is trained to specifically identify unrealistic artifacts which are being produced by the corresponding conditional transformation unit mappings. This is accomplished by maintaining independent convolutional kernel weights for each specified target view and using the conditioning information passed to the discriminator to select the kernel weights for the CDU layer. The incorporation of this context-aware discriminator structure has significantly boosted the performance of the network (see Table 1). The discriminator, \(\mathscr {D}\), is trained using the adversarial loss term \(\mathscr {L}^\mathscr {D}_{adv}\) defined in Eq. 3. The proposed model uses the adversarial loss in Eq. 4 to effectively capture multimodal distributions [36], which helps to sharpen the generated views.

$$\begin{aligned} \mathscr {L}^\mathscr {D}_{adv}&= - \log \mathscr {D}(y_k, \overline{\omega }_k)- \log \big (1 - \mathscr {D}(\widehat{y}_k, \overline{\omega }_k) \big ) \end{aligned}$$
(3)
$$\begin{aligned} \mathscr {L}_{adv}&= - \log \mathscr {D}(\widehat{y}_k, \overline{\omega }_k). \end{aligned}$$
(4)

Reducing the total variation is widely used in view synthesis methods [30, 47]. In particular, the \(L_{smooth}\) term is used to reduce noise in the generated images by reducing the variation of pixels, which is inspired by total variation image denoising. Experimental evidence shows that the inclusion of the \(L_{smooth}\) loss term leads to an improvement in the overall quality of the synthesized images (see Table 1). We have experimented with various shift sizes and found that the shift size \(\tau =1\) yields the best performance.

Additional loss terms corresponding to accurate structural reconstruction and smoothness [19] in the generated views are defined in Eqs. 5 and 6:

$$\begin{aligned} \mathscr {L}_{recon}&= \Vert \widehat{y}_k - y_k \Vert _2^2 \end{aligned}$$
(5)
$$\begin{aligned} \mathscr {L}_{smooth}&= \sum _{i\in \{0,\pm 1\}} \, \sum _{j\in \{0,\pm 1\}} \, \big \Vert \, \widehat{y}_k - \tau _{i,j}\widehat{y}_k \, \big \Vert _1, \end{aligned}$$
(6)

where \(y_k\) is the modified target image corresponding to an input x,   \(\overline{\omega }_k\) are the weights of the CDU mapping corresponding to the kth transformation, \(\varPhi _k\) is the CTU mapping for the kth transformation, \(\widehat{y}_k= {\text {Decode}}\big ( \varPhi _k\big ({\text {Encode}}[x]\big ) \big )\) is the network prediction, and \(\tau _{i,j}\) is the two-dimensional, discrete shift operator. The final loss function for the encoder and decoder components is given by:

$$\begin{aligned} \mathscr {L} = \lambda \cdot \mathscr {L}_{adv} + \rho \cdot \mathscr {L}_{recon} + \gamma \cdot \mathscr {L}_{smooth} + \kappa \cdot \mathscr {L}_{consist} \end{aligned}$$
(7)

with hyperparameters typically selected so that \(\lambda , \rho \gg \gamma , \kappa \). The consistency loss is designed to guide the CTU mappings toward approximations of the latent space mappings which connect the latent representations of input images and target images as depicted in Fig. 1. In particular, the consistency term enforces the condition that the transformed encoding, \(\widehat{l}_{y_k} = \varPhi _k(\text{ Encode }[x])\), approximates the encoding of the kth target image, \(l_{y_k} = \text{ Encode }[y_k]\), during the training process.

Fig. 3
figure 3

Proposed task-divided design for the LTNN decoder. The coarse pixel value estimation map is split into RGB channels, rescaled by the RGB balance parameters, and multiplied element-wise by the refinement map values to produce the final network prediction

Fig. 4
figure 4

The proposed network structure for the encoder/decoder (left) and discriminator (right) for \(64 \times 64\) input images. Features have been color-coded according to the type of layer which has produced them. The CTU and CDU components both store and train separate collections of \(3\times 3\) filter weights for each conditional transformation; in particular, the number of distinct \(3\times 3\) filters associated with the CTU and CDU corresponds to the number of distinct conditional transformations the network is designed to produce. For \(256 \times 256\) input images, we have added two Block v1/MaxPool layers in the front of encoder and two Conv/Interpolation layers at the end of the decoder

3.3 Task-divided decoder

The decoding process has been divided into three tasks: estimating the refinement map, pixel values, and RGB color balance of the dataset. We have found this decoupled framework for estimation helps the network converge to better minima to produce sharp, realistic outputs without additional loss terms. The decoding process begins with a series of convolutional layers followed by bilinear interpolation to upsample the low-resolution latent information. The last component of the decoder’s upsampling process consists of two distinct convolutional layers used for task divide; one layer is allocated for predicting the refinement map, while the other is trained to predict pixel values. The refinement map layer incorporates a sigmoidal activation function which outputs scaling factors intended to refine the coarse pixel value estimations; the pixel value estimation layer does not use an activation so that the output values are not restricted to the range of a specific activation function. RGB balance parameters, consisting of three trainable variables, are used as weights for balancing the color channels of the pixel value map. The Hadamard product, \(\odot \), of the refinement map and the RGB-rescaled value map serves as the network’s final output:

$$\begin{aligned} \widehat{y}= & {} \, \left[ \widehat{y}_{R}, \, \widehat{y}_{G}, \, \widehat{y}_{B}\right] \quad \text{ where }\quad \nonumber \\ \widehat{y}_{C}= & {} \theta _C \, \cdot \widehat{y}^{value}_C \odot \widehat{y}^{refine}_C \quad \text{ for }\quad C \in \{R, G, B\} \end{aligned}$$
(8)

In this way, the network has the capacity to mask values which lie outside of the target object (i.e., by setting refinement map values to zero) which allows the value map to focus on the object itself during the training process. Experimental results show that the refinement maps learn to produce masks which closely resemble the target objects’ shapes and have sharp drop-offs along the boundaries. No additional information has been provided to the network for training the refinement map; the masking behavior illustrated in Figs. 3 and 6 is learned implicitly by the network during training and is made possible by the design of the network’s architecture. As shown in Fig. 3, the refinement map produces a shape mask and mask out errors in each pixels by masking values which lie outside of the target object (i.e., by setting refinement map values to zero).

Fig. 5
figure 5

Layer definitions for Block v1 and Block v2 collaborative filters. Once the total number of output channels, \(N_{\text{ out }}\), is specified, the remaining \(N_{\text{ out }} - N_{\text{ in }}\) output channels are allocated to the non-identity filters (where \(N_{\text{ in }}\) denotes the number of input channels). For the Block v1 layer at the start of the proposed LTNN model, for example, the input is a image with \(N_{\text{ in }} = 3\) channels and the specified number of output channels is \(N_{\text{ out }} = 32\). One of the 32 channels is accounted for by the identity component, and the remaining 29 channels are the three non-identity filters. When the remaining channel count is not divisible by 3, we allocate the remainder of the output channels to the single \(3\times 3\) convolutional layer. Swish activation functions are used for each filter; however, the filters with multiple convolutional layers do not use activation functions for the intermediate \(3\times 3\) convolutional layers

Fig. 6
figure 6

Qualitative comparison of \(360^{\circ }\) view prediction of rigid objects. A single image, shown in the first column of the “Ground” row, is used as the input for the network. Results are shown for the proposed network with and without task division (“w/o TD”) as well as a comparison with M2N. The pixel value map and refinement maps corresponding to the task division framework are also provided as well as an inverted view of the refinement map for better visibility

4 Architecture details

The overview of the pipeline is shown in Fig. 4. Input images are passed through a Block v1 collaborative filter layer (see Fig. 5) along with a max pooling layer to produce the features at the far left end of the figure. At the bottleneck between the encoder and decoder, a conditional transformation unit (CTU) is applied to map the \(2\times 2\) latent features directly to the transformed \(2\times 2\) latent features on the right. This CTU is implemented as a convolutional layer with \(3\times 3\) filter weights selected based on the conditioning information provided to the network. The features near the end of the decoder component are processed by two independent convolution transpose layers for non-rigid object and bilinear interpolation for the rigid object: one corresponding to the value estimation map and the other corresponding to the refinement map. The channels of the value estimation map are rescaled by the RGB balance parameters, and the Hadamard product is taken with the refinement map to produce the final network output. For rigid object experiment, we added tangent hyperbolic activation function after the Hadamard product to bound the output values range in \([-1,1]\). The CDU is also designed to have the same \(3\times 3\) kernel size as the CTU and is applied between the third and fourth layers of the discriminator. For the stereo face dataset [9] experiment, we have added an additional Block v1 layer in the encoder and additional convolutional layer followed by bilinear interpolation in decoder to utilize the full \(128\times 128\times 3\) resolution images and two Block v1 layers and two convolutional layers followed by bilinear interpolation for the \(256\times 256\times 3\) resolution image of rigid object views.

The encoder incorporates two main block layers, as defined in Fig. 5, which are designed to provide efficient feature extraction; these blocks follow a similar design to that proposed by Szegedy et al. [39], but include dense connections between blocks, as introduced by Huang et al. [16]. We normalize the output of each network layer using the batch normalization method as described in [18]. For the decoder, we have opted for a minimalist design, inspired by the work of [31]. Standard convolutional layers with \(3\times 3\) filters and same padding are used through the penultimate decoding layer and transpose convolutional layers with \(1\times 1\) filters for non-rigid objects and \(5\times 5\) for other experiments. We have used same padding to produce the value estimation and refinement maps. All parameters have been initialized using the variance scaling initialization method described in [15].

Our method has been implemented and developed using the TensorFlow framework. The models have been trained using stochastic gradient descent (SGD) and the ADAM optimizer [22] with initial parameters: \(\text{ learning }\_\text{ rate }\) = 0.005, \(\beta _1\) = 0.9, and \(\beta _2\) = 0.999 (as defined in the TensorFlow API r1.6 documentation for \({\text {tf.train.AdamOptimizer}}\)), along with loss function hyperparameters: \(\lambda \) = 0.8, \(\rho \) = 0.2, \(\gamma \) = 0.000025, and \(\kappa \) = 0.00005 (as introduced in Eq. 7). The discriminator is updated once every two encoder/decoder updates, and one-sided label smoothing [36] has been used to improve the stability of the discriminator training procedure.

5 Experiments and results

We conduct experiments on a diverse collection of datasets including both rigid and non-rigid objects. To show the generalizability of our method, we have conducted a series of experiments: (i) hand pose estimation using a synthetic training set and real NYU hand depth image data [41] for testing, (ii) synthesis of rotated views of rigid objects using the 3D object dataset [4], (iii) synthesis of rotated views using a real face dataset [9], and (iv) the modification of a diverse range of attributes on a synthetic face dataset [17]. For each experiment, we have trained the models using 80% of the datasets. Since ground truth target depth images were not available for the real hand dataset, an indirect metric has been used to quantitatively evaluate the model as described in Sect. 5.2. Ground truth data were available for all other experiments, and models were evaluated directly using the \(L_1\) mean pixel-wise error and the structural similarity index measure (SSIM) [44] used in [30, 38]. To evaluate the proposed framework with existing works, two comparison groups have been formed: conditional inference methods, CVAE-GAN [2] and CAAE [47], with comparable hourglass structures for comparison on experiments with non-rigid objects, and view synthesis methods, MV3D [40], M2N [38], AFN [48], and TVSN [30], for comparison on experiments with rigid objects. Additional ablation experiments have been performed to compare the proposed CTU conditioning method with other conventional concatenation methods (see Fig. 2); results are shown in Fig. 9 and Table 1.

Table 2 FLOPs and parameter counts corresponding to inference for a single image with resolution \(256\times 256\times 3\)
Table 3 Quantitative comparison for \(360^{\circ }\) view synthesis of rigid objects

5.1 Experiment on rigid objects

Rigid object experiment We have experimented with novel 3D view synthesis tasks given a single view of an object with an arbitrary pose. The goal of this experiment is to synthesize an image of the object after a specified transformation or change in viewpoint has been applied to the original view. To evaluate our method in the context of rigid objects, we have performed a collection of tests on the chair and car datasets. Given a single input view of an object, we leverage the LTNN model to produce \(360^{\circ }\) views of the object. We have tested our model’s ability to perform \(360^{\circ }\) view estimation on 3D objects and compared the results with the other state-of-the-art methods. The models are trained on the same dataset used in M2N [38]. The car and chair categories from the ShapeNet [3] 3D model repository have been rotated horizontally 18 times by \({20}^{\circ }\) along with elevation changes of \({0}^{\circ }\), \({10}^{\circ }\), and \({20}^{\circ }\). The M2N and TVSN results are slightly better for the car category; however, these works have incorporated skip connections between the encoder layers and decoder layers, proposed in U-net [35], which substantially increases the computational demand for these networks (see Table 2). As can be seen in Tables 2 and 3, the proposed model is comparable with existing models specifically designed for the task of multi-view prediction while requiring the least FLOPs for inference compared with all other methods. The low computational cost of the LTNN model highlights the efficiency of the CTU/CDU framework for incorporating conditional information into the network for view synthesis. Moreover, as seen in the qualitative results provided in Fig. 6, using a task-divided decoder helps to eliminate artifacts in the generated views; in particular, the spokes on the back of the chair and the spoiler on the back of the car are seen to be synthesized much more clearly when using a task-divided decoder.

Fig. 7
figure 7

Quantitative evaluation for multi-view hand synthesis using the real NYU dataset

Fig. 8
figure 8

Comparison of CVAE-GAN (top) with the proposed LTNN model (bottom) using the noisy NYU hand dataset [41]. The input depth-map hand pose image is shown to the far left, followed by the network predictions for nine synthesized view points. The views synthesized using LTNN are seen to be sharper and also yield higher accuracy for pose estimation (see Fig.11)

Fig. 9
figure 9

LTNN ablation experiment results and comparison with alternative conditioning frameworks using synthetic hand dataset. Our models: conditional transformation unit (CTU), conditional discriminator unit (CDU), task-divide decoder (TD), and LTNN consisting of all previous components. Alternative concatenation methods: channel-wise concatenation (CH Concat), fully connected concatenation (FC Concat), and reshape fully connected feature vector concatenation (RE Concat)

5.2 Experiment on non-rigid objects

Hand pose experiment To assess the performance of the proposed network on non-rigid objects, we consider the problem of hand pose estimation. As the number of available view points of a given hand is increased, the task of estimating the associated hand pose becomes significantly easier [14]. Motivated by this fact, we synthesize multiple views of a hand given a single view and evaluate the accuracy of the estimated hand pose using the synthesized views. The underlying assumption of the assessment is that the accuracy of the hand pose estimation will be improved precisely when the synthesized views provide faithful representations of the true hand pose. Since ground truth predictions for the real NYU hand dataset were not available, the LTNN model has been trained using a synthetic dataset generated using 3D mesh hand models. The NYU dataset does, however, provide ground truth coordinates for the input hand pose; using this, we were able to indirectly evaluate the performance of the model by assessing the accuracy of a hand pose estimation method using the network’s multi-view predictions as input. More specifically, the LTNN model was trained to generate nine different views which were then fed into the pose estimation network from Choi et al. [6] (also trained using the synthetic dataset). For an evaluation metric, the maximum error in the predicted joint locations has been computed for each frame (i.e., each hand pose in the dataset). The cumulative number of frames with maximum error below a threshold distance \(\epsilon _D\) has then been computed, as is commonly used in hand pose estimation tasks [6, 29]. A comparison of the pose estimation results using synthetic views generated by the proposed model, the CVAE-GAN model, and the CAAE model are presented in Fig. 7, along with the results obtained by performing pose estimation using the single-view input frame alone. In particular, for a threshold distance \(\epsilon _D = 40\,\text{ mm }\), the proposed model yields the highest accuracy with \(61.98\%\) of the frames having all predicted joint locations within a distance of \(40\,\text{ mm }\) from the ground truth values. The second highest accuracy is achieved with the CVAE-GAN model with \(45.70\%\) of frames predicted within the \(40\,\text{ mm }\) threshold.

A comparison of the quantitative hand pose estimation results is provided in Fig. 7 where the proposed LTNN framework is seen to provide a substantial improvement over existing methods; qualitative results are also available in Fig. 8. Ablation study results for assessing the impact of individual components of the LTNN model are also provided in Fig. 9; in particular, we note that the inclusion of the CTU, CDU, and task-divided decoder each provides significant improvements to the performance of the network. With regard to real-time applications, the proposed model runs at 114 fps without batching and at 1975 fps when applied to a mini-batch of size 128 (using a single TITAN Xp GPU and an Intel i7-6850K CPU).

Fig. 10
figure 10

Qualitative evaluation for view synthesis of real faces using the image dataset [9]

Fig. 11
figure 11

Quantitative evaluation with SSIM of model performances for experiment on the real face dataset [9]. Higher values are better

Fig. 12
figure 12

Quantitative evaluation with \(L_1\) of model performances for experiment on the real face dataset [9]. Lower values are better

Fig. 13
figure 13

Simultaneous learning of multiple attribute modifications. Azimuth and age (left), light and age (center), and light and azimuth (right) combined modifications are shown. The network has been trained using four CTU mappings per attribute (e.g., four azimuth mappings and four age mappings); results shown have been generated by composing CTU mappings in the latent space and decoding

Real face experiment We have also conducted an experiment using a real face dataset to show the applicability of LTNN for real images. The stereo face database [9], consisting of images of 100 individuals from 10 different viewpoints, was used for experiments with real faces. These faces were first segmented using the method of  [28], and then we manually cleaned up the failure cases. The cleaned faces have been cropped and centered to form the final dataset. The LTNN model was trained to synthesize images of input faces corresponding to three consecutive horizontal rotations. Qualitative results for the real face experiment are provided in Fig. 10; in particular, we note that the quality of the views generated by the proposed LTNN model is consistent for each of the four views, while the quality of the views generated using other methods decreases substantially as the change in angle is increased. This illustrates the advantage of using CTU mappings to navigate the latent space and avoid the accumulation of errors inherent to iterative methods. Moreover, as shown in Figs. 11 and 12, the LTNN model provides substantial improvements to alternative methods with respect to the SSIM and \(L_1\) metrics and converges much faster as well.

Fig. 14
figure 14

Near-continuous attribute modification is attainable using piecewise-linear interpolation in the latent space. Provided a grayscale image (corresponding to the faces on the far left), modified images corresponding to changes in light direction (first), age (second), azimuth (third), and elevation (fourth) are produced with 17 degrees of variation. These attribute-modified images have been produced using nine CTU mappings, corresponding to varying degrees of modification, and linearly interpolating between the discrete transformation encodings in the latent space

5.3 Diverse attribute exploration

To evaluate the proposed framework’s performance on a more diverse range of attribute modification tasks, a synthetic face dataset and other conditional generative models, CVAE-GAN and CAAE, with comparable hourglass structures to the LTNN model have been selected for comparison. The generated images from the LTNN model are available in Fig. 13. These models have been trained to synthesize discrete changes in elevation, azimuth, light direction, and age from a single image. As shown in Tables 4 and 5, the LTNN model outperforms the CVAE-GAN and CAAE models by a significant margin in both SSIM and \(L_1\) metrics; additional quantitative results are provided in Table 1, along with a collection of ablation results for the LTNN model.

Multiple attributes can also be modified simultaneously using LTNN by composing CTU mappings. For example, one can train four CTU mappings \(\{\varPhi _k^{light}\}_{k=0}^3\) corresponding to incremental changes in lighting and four CTU mappings \(\{\varPhi _k^{azim}\}_{k=0}^3\) corresponding to incremental changes in azimuth. In this setting, the network predictions for lighting and azimuth changes correspond to the values of \({\text {Decode}}[\varPhi _k^{light}(l_x)]\) and \({\text {Decode}}[\varPhi _k^{azim}(l_x)]\), respectively (where \(l_x\) denotes the encoding of the original input image). To predict the effect of simultaneously changing both lighting and azimuth, we can compose the associated CTU mappings in the latent space; that is, we may take our network prediction for the lighting change associated with \(\varPhi _i^{light}\) combined with the azimuth change associated with \(\varPhi _j^{azim}\) to be:

$$\begin{aligned} \widehat{y}= & {} {\text {Decode}}[ \widehat{l}_y ] \quad \text{ where } \nonumber \\ \widehat{l}_y= & {} \varPhi _i^{light}\circ \varPhi _j^{azim}(l_x) = \varPhi _i^{light}\big [\varPhi _j^{azim}(l_x) \big ]. \end{aligned}$$
(9)
Table 4 Quantitative results for light direction and age modification on the synthetic face dataset
Table 5 Quantitative results for azimuth and elevation modification on the synthetic face dataset

5.4 Near-continuous attribute modification

Near-continuous attribute modification is also possible within the proposed framework; this can be performed by a simple, piecewise-linear interpolation procedure in the latent space. For example, we can train nine CTU mappings \(\{\varPhi _k\}_{k=0}^8\) corresponding to incremental \({7}^{\circ }\) changes in elevation \(\{\theta _k\}_{k=0}^8\). The network predictions for an elevation change of \(\theta _0={0}^{\circ }\) and \(\theta _1={7}^{\circ }\) are then given by the values \({\text {Decode}}[\varPhi _0(l_x)]\) and \({\text {Decode}}[\varPhi _1(l_x)]\), respectively (where \(l_x\) denotes the encoding of the input image). To predict an elevation change of \({3.5}^{\circ }\), we can perform linear interpolation in the latent space between the representations \(\varPhi _0(l_x)\) and \(\varPhi _1(l_x)\); that is, we may take our network prediction for the intermediate change of \({3.5}^{\circ }\) to be:

$$\begin{aligned} \widehat{y} = {\text {Decode}}[\widehat{l}_y ] \quad \text{ where } \quad \widehat{l}_y = 0.5\cdot \varPhi _0(l_x) + 0.5\cdot \varPhi _1(l_x). \end{aligned}$$
(10)

More generally, we can interpolate between the latent CTU map representations to predict a change \(\theta \) via:

$$\begin{aligned} \widehat{y} ={\text {Decode}}[\widehat{l}_y ] \quad \text{ where } \quad \widehat{l}_y = \lambda \cdot \varPhi _k(l_x) + (1-\lambda )\cdot \varPhi _{k+1}(l_x), \end{aligned}$$
(11)

where \(k\in \{0,\dots ,7\}\) and \(\lambda \in [0,1]\) are chosen so that \(\theta = \lambda \cdot \theta _k+ (1-\lambda )\cdot \theta _{k+1}\). In this way, the proposed framework naturally allows for continuous attribute changes to be approximated while only requiring training for a finite collection of discrete changes. Qualitative results for near-continuous attribute modification on the synthetic face dataset are provided in Fig. 14; in particular, we note that views generated by the network effectively model gradual changes in the attributes without any noticeable degradation in quality. This highlights the fact that the model has learned a smooth latent space structure which can be navigated effectively by the CTU mappings while maintaining the identities of the original input faces.

6 Conclusion

In this work, we have introduced an effective, general framework for incorporating conditioning information into inference-based generative models. We have proposed a modular approach to incorporating conditioning information using CTUs and a consistency loss term, defined an efficient task-divided decoder setup for deconstructing the data generation process into manageable subtasks, and shown that a context-aware discriminator can be used to improve the performance of the adversarial training process. The performance of this framework has been assessed on a diverse range of tasks and shown to perform comparably with the state-of-the-art methods while reducing computational operations and memory consumption.