Abstract
We propose a fully convolutional conditional generative neural network, the latent transformation neural network, capable of rigid and non-rigid object view synthesis using a lightweight architecture suited for real-time applications and embedded systems. In contrast to existing object view synthesis methods which incorporate conditioning information via concatenation, we introduce a dedicated network component, the conditional transformation unit. This unit is designed to learn the latent space transformations corresponding to specified target views. In addition, a consistency loss term is defined to guide the network toward learning the desired latent space mappings, a task-divided decoder is constructed to refine the quality of generated views of objects, and an adaptive discriminator is introduced to improve the adversarial training process. The generalizability of the proposed methodology is demonstrated on a collection of three diverse tasks: multi-view synthesis on real hand depth images, view synthesis of real and synthetic faces, and the rotation of rigid objects. The proposed model is shown to be comparable with the state-of-the-art methods in structural similarity index measure and \(L_{1}\) metrics while simultaneously achieving a 24% reduction in the compute time for inference of novel images.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
The task of synthesizing novel views of objects from a single reference frame/view is an important problem which has a variety of practical applications in computer vision, graphics, and robotics. In computer vision, view synthesis can be used to generate 3D point cloud representations of objects from a single input image [40]; in the context of hand pose estimation algorithms, generating additional synthetic views can also help reduce occlusion and improve the accuracy of the estimated poses [12, 14]. In computer graphics, view synthesis has been used to apply changes in lighting and viewpoint to single-view images of faces [24]. In robotics, synthetic views can be used to help predict unobserved part locations and improve the performance of object grasping with manipulators [42]. However, synthesizing novel views from a single input image is a formidable task with serious complications arising from the complexity of the target object and the presence of heavily self-occluded parts.
Generative models have been shown to provide effective frameworks for representing complex, structured datasets and generating realistic samples from underlying data distributions [7, 13]. This concept has also been extended to form conditional models capable of sampling from conditional distributions in order to allow certain properties of the generated data to be controlled or selected [26, 43]. Generative models without encoders [5, 46] are generally used to sample from broad classes of the data distributions; however, these models are not designed to incorporate input data and therefore cannot preserve characteristic features of specified input data. Models have also been proposed which incorporate encoding components to overcome this by learning to map input data onto an associated latent space representation within a generative framework [21, 25]. The resulting inference models allow for the defining features of inputs to be preserved while specified target properties are adjusted through conditioning [45].
Conventional conditional models have largely relied on rather simple methods, such as concatenation, for implementing this conditioning process; however, cGANs [27] have shown that utilizing the conditioning information in a less trivial, more methodical manner has the potential to significantly improve the performance of conditional generative models. In this work, we provide a general framework for effectively performing inference with conditional generative models by strategically controlling the interaction between conditioning information and latent representations within a generative inference model. In this framework, a conditional transformation unit (CTU), \(\varPhi \), is introduced to provide a means for navigating the underlying manifold structure of the latent space. The CTU is realized in the form of a collection of convolutional layers which are designed to approximate the latent space operators defined by mapping encoded inputs to the encoded representations of specified targets (see Fig. 1). This is enforced by introducing a consistency loss term to guide the CTU mappings during training. In addition, a conditional discriminator unit (CDU), \(\varPsi \), also realized as a collection of convolutional layers, is included in the network’s discriminator. This CDU is designed to improve the network’s ability to identify and eliminate transformation-specific artifacts in the network’s predictions.
The network has also been equipped with RGB balance parameters consisting of three values \(\{\theta _R, \theta _G, \theta _B \}\) designed to give the network the ability to quickly adjust the global color balance of the images it produces to better align with that of the true data distribution. In this way, the network is easily able to remove unnatural hues and focus on estimating local pixel values by adjusting the three RGB parameters rather than correcting each pixel individually. In addition, we introduce a novel estimation strategy for efficiently learning shape and color properties simultaneously; a task-divided decoder (TD) is designed to produce a coarse pixel value map along with a refinement map in order to split the network’s overall task into distinct, dedicated network components.
Summary of contributions:
-
1.
We introduce the conditional transformation unit, with a family of modular filter weights, to learn high-level mappings within a low-dimensional latent space. In addition, we present a consistency loss term which is used to guide the transformations learned during training.
-
2.
We propose a novel framework for 3D object view synthesis which separates the generative process into distinct network components dedicated to learning (i) coarse pixel value estimates, (ii) pixel refinement map, and (iii) the global RGB color balance of the dataset.
-
3.
We introduce the conditional discriminator unit designed to improve adversarial training by identifying and eliminating transformation-specific artifacts present in the generated images.
Each contribution proposed above has been shown to provide a significant improvement to the network’s overall performance through a series of ablation studies. The resulting latent transformation neural network (LTNN) is placed through a series of comparative studies on a diverse range of experiments where it is seen to be comparable with the existing state-of-the-art models for (i) simultaneous multi-view synthesis of real hand depth images in real-time, (ii) the synthesis of rotated views of rigid objects given a single image, and (iii) object view synthesis and attribute modification of real and synthetic faces.
Moreover, the CTU conditioning framework allows for additional conditioning information, or target views, to be added to the training procedure ad infinitum without any increase to the network’s inference speed.
2 Related work
Conditional generative models have been widely used in computer vision areas such as geometric prediction [30, 34, 40, 48] and non-rigid object modification such as human face deformation [1, 11, 33, 47]. Dosovitskiy et al. [8] has proposed a supervised, conditional generative model trained to generate images of chairs, tables, and cars with specified attributes which are controlled by transformation and view parameters passed to the network. MV3D [40] is pioneering deep learning work for object view synthesis which uses an encoder–decoder network to directly generate pixels of a target view with depth information in the loss function along with view point information passed as a conditional term. The appearance flow network (AFN) [48] proposed a method for view synthesis of objects by predicting appearance flow fields, which are used to move pixels from an input to a target view. However, this method requires detailed camera pose information and is not capable of predicting pixels which are missing in the source views. The M2N from [38] proposed view prediction using a recurrent network and a self-learned confidence map iteratively synthesizes views with recurrent pixel generator with appearance flow. TVSN [30] uses a visibility map, which indicates visible parts in a target image to identify occlusion in different views. However, this method requires mesh models for each object in order to extract visibility maps for training the network. The DFN by Jia et al. [20] proposed using a dynamic filter which is conditioned on a sequence of previous frames; this is fundamentally different from our method since the filter is applied to the original inputs rather than the latent embeddings. Moreover, it relies on temporal information and is not applicable for predictions given a single image. The IterGAN model introduced by Galama and Mensink [10] is also designed to synthesize novel views from a single image, with a specific emphasis on the synthesis of rotated views of objects in small, iterative steps. The conditional variational autoencoder (CVAE) incorporates conditioning information into the standard variational autoencoder (VAE) framework [23] and is capable of synthesizing specified attribute changes in an identity preserving manner [37, 45]. Other works have introduced a clamping strategy to enforce a specific organizational structure in the latent space [24, 33]; these networks require extremely detailed labels for supervision, such as the graphics code parameters used to create each example, and are therefore very difficult to implement for more general tasks (e.g., training with real images). These models are all reliant on additional knowledge for training, such as depth information, camera poses, or mesh models, and are not applicable in embedded systems and real-time applications due to the high computational demand and the number of neural networks’ parameters since these methods did not consider the efficiency of the model.
CVAE-GAN [2] further adds adversarial training to the CVAE framework in order to improve the quality of generated predictions. The work from Zhang et al. [47] has introduced the conditional adversarial autoencoder (CAAE) designed to model age progression/regression in human faces. This is achieved by concatenating conditioning information (i.e., age) with the input’s latent representation before proceeding to the decoding process. The framework also includes an adaptive discriminator with conditional information passed using a resize/concatenate procedure. To the best of our knowledge, all existing conditional generative models are designed for inference use fixed hidden layers and concatenate conditioning information directly with latent representations. In contrast to these existing methods, the proposed model incorporates conditioning information by defining dedicated, transformation-specific convolutional layers at the latent level. This conditioning framework allows the network to synthesize multiple transformed views from a single input, while retaining a fully convolutional structure which avoids the dense connections used in existing inference-based conditional models. Most significantly, the proposed LTNN framework is shown to be comparable with the state-of-the-art models in a diverse range of object view synthesis tasks, while requiring substantially less FLOPs and memory consumption for inference than other methods.
3 Latent transformation neural network
In this section, we introduce the methods used to define the proposed LTNN model. We first give a brief overview of the LTNN network structure. We then detail how conditional transformation unit mappings are defined and trained to operate on the latent space, followed by a description of the conditional discriminator unit implementation and the network loss function used to guide the training process. Lastly, we describe the task division framework used for the decoding process.
The basic workflow of the proposed model is as follows:
-
1.
Encode the input image x to a latent representation \(l_x ={\text {Encode}}(x)\).
-
2.
Use conditioning information k to select conditional, convolutional filter weights \(\omega _k\).
-
3.
Map the latent representation \(l_x \) to \( \widehat{l}_{y_k} = \varPhi _k(l_x) = {\text {conv}}(l_x, \omega _k)\), an approximation of the encoded latent representation \(l_{y_k}\) of the specified target image \(y_k\).
-
4.
Decode \(\widehat{l}_{y_k}\) to obtain a coarse pixel value map and a refinement map.
-
5.
Scale the channels of the pixel value map by the RGB balance parameters and take the Hadamard product with the refinement map to obtain the final prediction \(\widehat{y}_k\).
-
6.
Pass real images \(y_k\) as well as generated images \(\widehat{y}_k\) to the discriminator, and use the conditioning information to select the discriminator’s conditional filter weights \(\overline{\omega }_k\).
-
7.
Compute loss and update weights using ADAM optimization and backpropagation.
3.1 Conditional transformation unit
Generative models have frequently been designed to explicitly disentangle the latent space in order to enable high-level attribute modification through linear, latent space interpolation. This linear latent structure is imposed by design decisions, however, and may not be the most natural way for a network to internalize features of the data distribution. Several approaches have been proposed which include nonlinear layers for processing conditioning information at the latent space level. In these conventional conditional generative frameworks, conditioning information is introduced by combining features extracted from the input with features extracted from the conditioning information (often using dense connection layers); these features are typically combined using standard vector concatenation, although some have opted to use channel concatenation.
In particular, conventional approaches for incorporating conditional information generally fall into three classes: (1) apply a fully connected layer before and after concatenating a vector storing conditional information [24, 40, 47, 48], (2) flatten the network features and concatenate with a vector storing conditional information [30], (3) tile a conditional vector to create a two-dimensional array with the same shape as the network features and concatenate channel-wise [2, 38]. Since the first class is more prevalent than the others in practice, we have subdivided this class into four cases: FC-Concat-FC [47], FC-Concat-2FC [24], 2FC-Concat-FC [48], and 2FC-Concat-2FC [40]. Six of these conventional conditional network designs are illustrated in Fig. 2 along with the proposed LTNN network design for incorporating conditioning information.
Rather than directly concatenating conditioning information with network features, we propose using a conditional transformation unit (CTU), consisting of a collection of distinct convolutional mappings in the network’s latent space. More specifically, the CTU maintains independent convolution kernel weights for each target view in consideration. Conditioning information is used to select which collection of kernel weights, i.e., which CTU mapping, should be used in the CTU convolutional layer to perform a specified transformation. In addition to the convolutional kernel weights, each CTU mapping incorporates a Swish activation [32] with independent parameters for each specified target view. The kernel weights and Swish parameters of each CTU mapping are selectively updated by controlling the gradient flow based on the conditioning information provided.
The CTU mappings are trained to transform the encoded, latent space representation of the network’s input in a manner which produces high-level view or attribute changes upon decoding. In this way, different angles of view, light directions, and deformations, for example, can be generated from a single input image. In one embodiment, the training process for the conditional transformation units can be designed to form a semigroup \(\{\varPhi _t\}_{t\ge 0}\) of operators:
defined on the latent space and trained to follow the geometric flow corresponding to a specified attribute. In the context of rotating three-dimensional objects, for example, the transformation units are trained on the input images paired with several target outputs corresponding to different angles of rotation; the network then uses conditioning information, which specifies the angle by which the object should be rotated, to select the appropriate transformation unit. In this context, the semigroup criteria correspond to the fact that rotating an object \(10^{\circ }\) twice should align with the result of rotating the object by \(20^{\circ }\) once.
Since the encoder and decoder are not influenced by the specified angle of rotation, the network’s encoding/decoding structure learns to model objects at different angles simultaneously; the single, low-dimensional latent representation of the input contains all information required to produce rotated views of the original object. Other embodiments can depart with this semigroup formulation, however, training conditional transformation units to instead produce a more diverse collection of non-sequential viewpoints, for example, as is the case for multi-view hand synthesis.
To enforce this behavior on the latent space CTU mappings in practice, a consistency term is introduced into the loss function, as specified in Eq. 2. This loss term is minimized precisely when the CTU mappings behave as depicted in Fig. 1; in particular, the output of the CTU mapping associated with a particular transformation is designed to match the encoding of the associated ground truth target view. More precisely, given an input image x, the consistency loss associated with the kth transformation is defined in terms of the ground truth, transformed target view \(y_{k}\) by:
3.2 Discriminator and loss function
The discriminator used in the adversarial training process is also passed conditioning information which specifies the transformation which the model has attempted to make. The conditional discriminator unit (CDU), which is implemented as a convolutional layer with modular weights similar to the CTU, is trained to specifically identify unrealistic artifacts which are being produced by the corresponding conditional transformation unit mappings. This is accomplished by maintaining independent convolutional kernel weights for each specified target view and using the conditioning information passed to the discriminator to select the kernel weights for the CDU layer. The incorporation of this context-aware discriminator structure has significantly boosted the performance of the network (see Table 1). The discriminator, \(\mathscr {D}\), is trained using the adversarial loss term \(\mathscr {L}^\mathscr {D}_{adv}\) defined in Eq. 3. The proposed model uses the adversarial loss in Eq. 4 to effectively capture multimodal distributions [36], which helps to sharpen the generated views.
Reducing the total variation is widely used in view synthesis methods [30, 47]. In particular, the \(L_{smooth}\) term is used to reduce noise in the generated images by reducing the variation of pixels, which is inspired by total variation image denoising. Experimental evidence shows that the inclusion of the \(L_{smooth}\) loss term leads to an improvement in the overall quality of the synthesized images (see Table 1). We have experimented with various shift sizes and found that the shift size \(\tau =1\) yields the best performance.
Additional loss terms corresponding to accurate structural reconstruction and smoothness [19] in the generated views are defined in Eqs. 5 and 6:
where \(y_k\) is the modified target image corresponding to an input x, \(\overline{\omega }_k\) are the weights of the CDU mapping corresponding to the kth transformation, \(\varPhi _k\) is the CTU mapping for the kth transformation, \(\widehat{y}_k= {\text {Decode}}\big ( \varPhi _k\big ({\text {Encode}}[x]\big ) \big )\) is the network prediction, and \(\tau _{i,j}\) is the two-dimensional, discrete shift operator. The final loss function for the encoder and decoder components is given by:
with hyperparameters typically selected so that \(\lambda , \rho \gg \gamma , \kappa \). The consistency loss is designed to guide the CTU mappings toward approximations of the latent space mappings which connect the latent representations of input images and target images as depicted in Fig. 1. In particular, the consistency term enforces the condition that the transformed encoding, \(\widehat{l}_{y_k} = \varPhi _k(\text{ Encode }[x])\), approximates the encoding of the kth target image, \(l_{y_k} = \text{ Encode }[y_k]\), during the training process.
3.3 Task-divided decoder
The decoding process has been divided into three tasks: estimating the refinement map, pixel values, and RGB color balance of the dataset. We have found this decoupled framework for estimation helps the network converge to better minima to produce sharp, realistic outputs without additional loss terms. The decoding process begins with a series of convolutional layers followed by bilinear interpolation to upsample the low-resolution latent information. The last component of the decoder’s upsampling process consists of two distinct convolutional layers used for task divide; one layer is allocated for predicting the refinement map, while the other is trained to predict pixel values. The refinement map layer incorporates a sigmoidal activation function which outputs scaling factors intended to refine the coarse pixel value estimations; the pixel value estimation layer does not use an activation so that the output values are not restricted to the range of a specific activation function. RGB balance parameters, consisting of three trainable variables, are used as weights for balancing the color channels of the pixel value map. The Hadamard product, \(\odot \), of the refinement map and the RGB-rescaled value map serves as the network’s final output:
In this way, the network has the capacity to mask values which lie outside of the target object (i.e., by setting refinement map values to zero) which allows the value map to focus on the object itself during the training process. Experimental results show that the refinement maps learn to produce masks which closely resemble the target objects’ shapes and have sharp drop-offs along the boundaries. No additional information has been provided to the network for training the refinement map; the masking behavior illustrated in Figs. 3 and 6 is learned implicitly by the network during training and is made possible by the design of the network’s architecture. As shown in Fig. 3, the refinement map produces a shape mask and mask out errors in each pixels by masking values which lie outside of the target object (i.e., by setting refinement map values to zero).
4 Architecture details
The overview of the pipeline is shown in Fig. 4. Input images are passed through a Block v1 collaborative filter layer (see Fig. 5) along with a max pooling layer to produce the features at the far left end of the figure. At the bottleneck between the encoder and decoder, a conditional transformation unit (CTU) is applied to map the \(2\times 2\) latent features directly to the transformed \(2\times 2\) latent features on the right. This CTU is implemented as a convolutional layer with \(3\times 3\) filter weights selected based on the conditioning information provided to the network. The features near the end of the decoder component are processed by two independent convolution transpose layers for non-rigid object and bilinear interpolation for the rigid object: one corresponding to the value estimation map and the other corresponding to the refinement map. The channels of the value estimation map are rescaled by the RGB balance parameters, and the Hadamard product is taken with the refinement map to produce the final network output. For rigid object experiment, we added tangent hyperbolic activation function after the Hadamard product to bound the output values range in \([-1,1]\). The CDU is also designed to have the same \(3\times 3\) kernel size as the CTU and is applied between the third and fourth layers of the discriminator. For the stereo face dataset [9] experiment, we have added an additional Block v1 layer in the encoder and additional convolutional layer followed by bilinear interpolation in decoder to utilize the full \(128\times 128\times 3\) resolution images and two Block v1 layers and two convolutional layers followed by bilinear interpolation for the \(256\times 256\times 3\) resolution image of rigid object views.
The encoder incorporates two main block layers, as defined in Fig. 5, which are designed to provide efficient feature extraction; these blocks follow a similar design to that proposed by Szegedy et al. [39], but include dense connections between blocks, as introduced by Huang et al. [16]. We normalize the output of each network layer using the batch normalization method as described in [18]. For the decoder, we have opted for a minimalist design, inspired by the work of [31]. Standard convolutional layers with \(3\times 3\) filters and same padding are used through the penultimate decoding layer and transpose convolutional layers with \(1\times 1\) filters for non-rigid objects and \(5\times 5\) for other experiments. We have used same padding to produce the value estimation and refinement maps. All parameters have been initialized using the variance scaling initialization method described in [15].
Our method has been implemented and developed using the TensorFlow framework. The models have been trained using stochastic gradient descent (SGD) and the ADAM optimizer [22] with initial parameters: \(\text{ learning }\_\text{ rate }\) = 0.005, \(\beta _1\) = 0.9, and \(\beta _2\) = 0.999 (as defined in the TensorFlow API r1.6 documentation for \({\text {tf.train.AdamOptimizer}}\)), along with loss function hyperparameters: \(\lambda \) = 0.8, \(\rho \) = 0.2, \(\gamma \) = 0.000025, and \(\kappa \) = 0.00005 (as introduced in Eq. 7). The discriminator is updated once every two encoder/decoder updates, and one-sided label smoothing [36] has been used to improve the stability of the discriminator training procedure.
5 Experiments and results
We conduct experiments on a diverse collection of datasets including both rigid and non-rigid objects. To show the generalizability of our method, we have conducted a series of experiments: (i) hand pose estimation using a synthetic training set and real NYU hand depth image data [41] for testing, (ii) synthesis of rotated views of rigid objects using the 3D object dataset [4], (iii) synthesis of rotated views using a real face dataset [9], and (iv) the modification of a diverse range of attributes on a synthetic face dataset [17]. For each experiment, we have trained the models using 80% of the datasets. Since ground truth target depth images were not available for the real hand dataset, an indirect metric has been used to quantitatively evaluate the model as described in Sect. 5.2. Ground truth data were available for all other experiments, and models were evaluated directly using the \(L_1\) mean pixel-wise error and the structural similarity index measure (SSIM) [44] used in [30, 38]. To evaluate the proposed framework with existing works, two comparison groups have been formed: conditional inference methods, CVAE-GAN [2] and CAAE [47], with comparable hourglass structures for comparison on experiments with non-rigid objects, and view synthesis methods, MV3D [40], M2N [38], AFN [48], and TVSN [30], for comparison on experiments with rigid objects. Additional ablation experiments have been performed to compare the proposed CTU conditioning method with other conventional concatenation methods (see Fig. 2); results are shown in Fig. 9 and Table 1.
5.1 Experiment on rigid objects
Rigid object experiment We have experimented with novel 3D view synthesis tasks given a single view of an object with an arbitrary pose. The goal of this experiment is to synthesize an image of the object after a specified transformation or change in viewpoint has been applied to the original view. To evaluate our method in the context of rigid objects, we have performed a collection of tests on the chair and car datasets. Given a single input view of an object, we leverage the LTNN model to produce \(360^{\circ }\) views of the object. We have tested our model’s ability to perform \(360^{\circ }\) view estimation on 3D objects and compared the results with the other state-of-the-art methods. The models are trained on the same dataset used in M2N [38]. The car and chair categories from the ShapeNet [3] 3D model repository have been rotated horizontally 18 times by \({20}^{\circ }\) along with elevation changes of \({0}^{\circ }\), \({10}^{\circ }\), and \({20}^{\circ }\). The M2N and TVSN results are slightly better for the car category; however, these works have incorporated skip connections between the encoder layers and decoder layers, proposed in U-net [35], which substantially increases the computational demand for these networks (see Table 2). As can be seen in Tables 2 and 3, the proposed model is comparable with existing models specifically designed for the task of multi-view prediction while requiring the least FLOPs for inference compared with all other methods. The low computational cost of the LTNN model highlights the efficiency of the CTU/CDU framework for incorporating conditional information into the network for view synthesis. Moreover, as seen in the qualitative results provided in Fig. 6, using a task-divided decoder helps to eliminate artifacts in the generated views; in particular, the spokes on the back of the chair and the spoiler on the back of the car are seen to be synthesized much more clearly when using a task-divided decoder.
5.2 Experiment on non-rigid objects
Hand pose experiment To assess the performance of the proposed network on non-rigid objects, we consider the problem of hand pose estimation. As the number of available view points of a given hand is increased, the task of estimating the associated hand pose becomes significantly easier [14]. Motivated by this fact, we synthesize multiple views of a hand given a single view and evaluate the accuracy of the estimated hand pose using the synthesized views. The underlying assumption of the assessment is that the accuracy of the hand pose estimation will be improved precisely when the synthesized views provide faithful representations of the true hand pose. Since ground truth predictions for the real NYU hand dataset were not available, the LTNN model has been trained using a synthetic dataset generated using 3D mesh hand models. The NYU dataset does, however, provide ground truth coordinates for the input hand pose; using this, we were able to indirectly evaluate the performance of the model by assessing the accuracy of a hand pose estimation method using the network’s multi-view predictions as input. More specifically, the LTNN model was trained to generate nine different views which were then fed into the pose estimation network from Choi et al. [6] (also trained using the synthetic dataset). For an evaluation metric, the maximum error in the predicted joint locations has been computed for each frame (i.e., each hand pose in the dataset). The cumulative number of frames with maximum error below a threshold distance \(\epsilon _D\) has then been computed, as is commonly used in hand pose estimation tasks [6, 29]. A comparison of the pose estimation results using synthetic views generated by the proposed model, the CVAE-GAN model, and the CAAE model are presented in Fig. 7, along with the results obtained by performing pose estimation using the single-view input frame alone. In particular, for a threshold distance \(\epsilon _D = 40\,\text{ mm }\), the proposed model yields the highest accuracy with \(61.98\%\) of the frames having all predicted joint locations within a distance of \(40\,\text{ mm }\) from the ground truth values. The second highest accuracy is achieved with the CVAE-GAN model with \(45.70\%\) of frames predicted within the \(40\,\text{ mm }\) threshold.
A comparison of the quantitative hand pose estimation results is provided in Fig. 7 where the proposed LTNN framework is seen to provide a substantial improvement over existing methods; qualitative results are also available in Fig. 8. Ablation study results for assessing the impact of individual components of the LTNN model are also provided in Fig. 9; in particular, we note that the inclusion of the CTU, CDU, and task-divided decoder each provides significant improvements to the performance of the network. With regard to real-time applications, the proposed model runs at 114 fps without batching and at 1975 fps when applied to a mini-batch of size 128 (using a single TITAN Xp GPU and an Intel i7-6850K CPU).
Real face experiment We have also conducted an experiment using a real face dataset to show the applicability of LTNN for real images. The stereo face database [9], consisting of images of 100 individuals from 10 different viewpoints, was used for experiments with real faces. These faces were first segmented using the method of [28], and then we manually cleaned up the failure cases. The cleaned faces have been cropped and centered to form the final dataset. The LTNN model was trained to synthesize images of input faces corresponding to three consecutive horizontal rotations. Qualitative results for the real face experiment are provided in Fig. 10; in particular, we note that the quality of the views generated by the proposed LTNN model is consistent for each of the four views, while the quality of the views generated using other methods decreases substantially as the change in angle is increased. This illustrates the advantage of using CTU mappings to navigate the latent space and avoid the accumulation of errors inherent to iterative methods. Moreover, as shown in Figs. 11 and 12, the LTNN model provides substantial improvements to alternative methods with respect to the SSIM and \(L_1\) metrics and converges much faster as well.
5.3 Diverse attribute exploration
To evaluate the proposed framework’s performance on a more diverse range of attribute modification tasks, a synthetic face dataset and other conditional generative models, CVAE-GAN and CAAE, with comparable hourglass structures to the LTNN model have been selected for comparison. The generated images from the LTNN model are available in Fig. 13. These models have been trained to synthesize discrete changes in elevation, azimuth, light direction, and age from a single image. As shown in Tables 4 and 5, the LTNN model outperforms the CVAE-GAN and CAAE models by a significant margin in both SSIM and \(L_1\) metrics; additional quantitative results are provided in Table 1, along with a collection of ablation results for the LTNN model.
Multiple attributes can also be modified simultaneously using LTNN by composing CTU mappings. For example, one can train four CTU mappings \(\{\varPhi _k^{light}\}_{k=0}^3\) corresponding to incremental changes in lighting and four CTU mappings \(\{\varPhi _k^{azim}\}_{k=0}^3\) corresponding to incremental changes in azimuth. In this setting, the network predictions for lighting and azimuth changes correspond to the values of \({\text {Decode}}[\varPhi _k^{light}(l_x)]\) and \({\text {Decode}}[\varPhi _k^{azim}(l_x)]\), respectively (where \(l_x\) denotes the encoding of the original input image). To predict the effect of simultaneously changing both lighting and azimuth, we can compose the associated CTU mappings in the latent space; that is, we may take our network prediction for the lighting change associated with \(\varPhi _i^{light}\) combined with the azimuth change associated with \(\varPhi _j^{azim}\) to be:
5.4 Near-continuous attribute modification
Near-continuous attribute modification is also possible within the proposed framework; this can be performed by a simple, piecewise-linear interpolation procedure in the latent space. For example, we can train nine CTU mappings \(\{\varPhi _k\}_{k=0}^8\) corresponding to incremental \({7}^{\circ }\) changes in elevation \(\{\theta _k\}_{k=0}^8\). The network predictions for an elevation change of \(\theta _0={0}^{\circ }\) and \(\theta _1={7}^{\circ }\) are then given by the values \({\text {Decode}}[\varPhi _0(l_x)]\) and \({\text {Decode}}[\varPhi _1(l_x)]\), respectively (where \(l_x\) denotes the encoding of the input image). To predict an elevation change of \({3.5}^{\circ }\), we can perform linear interpolation in the latent space between the representations \(\varPhi _0(l_x)\) and \(\varPhi _1(l_x)\); that is, we may take our network prediction for the intermediate change of \({3.5}^{\circ }\) to be:
More generally, we can interpolate between the latent CTU map representations to predict a change \(\theta \) via:
where \(k\in \{0,\dots ,7\}\) and \(\lambda \in [0,1]\) are chosen so that \(\theta = \lambda \cdot \theta _k+ (1-\lambda )\cdot \theta _{k+1}\). In this way, the proposed framework naturally allows for continuous attribute changes to be approximated while only requiring training for a finite collection of discrete changes. Qualitative results for near-continuous attribute modification on the synthetic face dataset are provided in Fig. 14; in particular, we note that views generated by the network effectively model gradual changes in the attributes without any noticeable degradation in quality. This highlights the fact that the model has learned a smooth latent space structure which can be navigated effectively by the CTU mappings while maintaining the identities of the original input faces.
6 Conclusion
In this work, we have introduced an effective, general framework for incorporating conditioning information into inference-based generative models. We have proposed a modular approach to incorporating conditioning information using CTUs and a consistency loss term, defined an efficient task-divided decoder setup for deconstructing the data generation process into manageable subtasks, and shown that a context-aware discriminator can be used to improve the performance of the adversarial training process. The performance of this framework has been assessed on a diverse range of tasks and shown to perform comparably with the state-of-the-art methods while reducing computational operations and memory consumption.
References
Antipov, G., Baccouche, M., Dugelay, J.L.: Face aging with conditional generative adversarial networks (2017). arXiv preprint arXiv:1702.01983
Bao, J., Chen, D., Wen, F., Li, H., Hua, G.: CVAE-GAN: fine-grained image generation through asymmetric training (2017). arXiv preprint arXiv:1703.10155
Chang, A., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: An information-rich 3D model repository. 1(7), 8 (2015). arXiv preprint arXiv:1512.03012
Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., Xiao, J., Yi, L., Yu, F.: ShapeNet: an information-rich 3D model repository. Technical Report, Stanford University—Princeton University—Toyota Technological Institute at Chicago (2015). arXiv:1512.03012 [cs.GR]
Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.: InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2172–2180 (2016)
Choi, C., Kim, S., Ramani, K.: Learning hand articulations by hallucinating heat distribution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3104–3113 (2017)
Dinerstein, J., Egbert, P.K., Cline, D.: Enhancing computer graphics through machine learning: a survey. Vis. Comput. 23(1), 25–43 (2007)
Dosovitskiy, A., Tobias Springenberg, J., Brox, T.: Learning to generate chairs with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1538–1546 (2015)
Fransens, R., Strecha, C., Van Gool, L.: Parametric stereo for multi-pose face recognition and 3D-face modeling. In: International Workshop on Analysis and Modeling of Faces and Gestures, pp. 109–124. Springer (2005)
Galama, Y., Mensink, T.: Iterative GANs for rotating visual objects (2018)
Gauthier, J.: Conditional generative adversarial nets for convolutional face generation. In: Class Project for Stanford CS231N: Convolutional Neural Networks for Visual Recognition. Winter Semester 2014(5), 2 (2014)
Ge, L., Liang, H., Yuan, J., Thalmann, D.: Robust 3D hand pose estimation in single depth images: from single-view CNN to multi-view CNNs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3593–3601 (2016)
Goodfellow, I.J.: NIPS 2016 Tutorial: Generative Adversarial Networks (2017). CoRR arXiv:1701.00160
Guan, H., Chang, J.S., Chen, L., Feris, R.S., Turk, M.: Multi-view appearance-based 3D hand pose estimation. In: 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW’06), pp. 154–154. IEEE (2006)
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)
Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.: Densely connected convolutional networks (2016). arXiv preprint arXiv:1608.06993
IEEE: A 3D Face Model for Pose and Illumination Invariant Face Recognition (2009)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015)
Jason, J.Y., Harley, A.W., Derpanis, K.G.: Back to basics: unsupervised learning of optical flow via brightness constancy and motion smoothness. In: Computer Vision—ECCV 2016 Workshops, pp. 3–10. Springer (2016)
Jia, X., De Brabandere, B., Tuytelaars, T., Gool, L.V.: Dynamic filter networks. In: Advances in Neural Information Processing Systems, pp. 667–675 (2016)
Kim, S., Kim, D., Choi, S.: Citycraft: 3D virtual city creation from a single image. Vis. Comput. (2019). https://doi.org/10.1007/s00371-019-01701-x
Kingma, D., Ba, J.: Adam: a method for stochastic optimization (2014). arXiv preprint arXiv:1412.6980
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes (2013). arXiv preprint arXiv:1312.6114
Kulkarni, T.D., Whitney, W.F., Kohli, P., Tenenbaum, J.: Deep convolutional inverse graphics network. In: Advances in Neural Information Processing Systems, pp. 2539–2547 (2015)
Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., Frey, B.: Adversarial autoencoders (2015). arXiv preprint arXiv:1511.05644
Mirza, M., Osindero, S.: Conditional generative adversarial nets (2014). arXiv preprint arXiv:1411.1784
Miyato, T., Koyama, M.: cGANs with projection discriminator (2018). arXiv preprint arXiv:1802.05637
Nirkin, Y., Masi, I., Tuan, A.T., Hassner, T., Medioni, G.: On face segmentation, face swapping, and face perception. In: 2018 13th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2018), pp. 98–105. IEEE (2018)
Oberweger, M., Lepetit, V.: Deepprior++: improving fast and accurate 3D hand pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 585–594 (2017)
Park, E., Yang, J., Yumer, E., Ceylan, D., Berg, A.C.: Transformation-grounded image generation network for novel 3D view synthesis. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 702–711. IEEE (2017)
Paszke, A., Chaurasia, A., Kim, S., Culurciello, E.: ENet: a deep neural network architecture for real-time semantic segmentation (2016). arXiv preprint arXiv:1606.02147
Ramachandran, P., Zoph, B., Le, Q.V.: Swish: a self-gated activation function (2017). arXiv preprint arXiv:1710.05941
Reed, S., Sohn, K., Zhang, Y., Lee, H.: Learning to disentangle factors of variation with manifold interaction. In: International Conference on Machine Learning, pp. 1431–1439 (2014)
Rezende, D.J., Eslami, S.A., Mohamed, S., Battaglia, P., Jaderberg, M., Heess, N.: Unsupervised learning of 3D structure from images. In: Advances in Neural Information Processing Systems, pp. 4996–5004 (2016)
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 234–241. Springer (2015)
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. In: Advances in Neural Information Processing Systems, pp. 2234–2242 (2016)
Sohn, K., Lee, H., Yan, X.: Learning structured output representation using deep conditional generative models. In: Advances in Neural Information Processing Systems, pp. 3483–3491 (2015)
Sun, S.H., Huh, M., Liao, Y.H., Zhang, N., Lim, J.J.: Multi-view to novel view: Synthesizing novel views with self-learned confidence. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 155–171 (2018)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Tatarchenko, M., Dosovitskiy, A., Brox, T.: Multi-view 3D models from single images with a convolutional network. In: European Conference on Computer Vision, pp. 322–337. Springer (2016)
Tompson, J., Stein, M., Lecun, Y., Perlin, K.: Real-time continuous pose recovery of human hands using convolutional networks. ACM Trans. Gr. 33(5), 169 (2014)
Varley, J., DeChant, C., Richardson, A., Ruales, J., Allen, P.: Shape completion enabled robotic grasping. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2442–2447. IEEE (2017)
Wang, Q., Artières, T., Chen, M., Denoyer, L.: Adversarial learning for modeling human motion. Vis. Comput. (2018). https://doi.org/10.1007/s00371-018-1594-7
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Yan, X., Yang, J., Sohn, K., Lee, H.: Attribute2image: conditional image generation from visual attributes. In: European Conference on Computer Vision, pp. 776–791. Springer (2016)
Zhang, S., Han, Z., Lai, Y.K., Zwicker, M., Zhang, H.: Stylistic scene enhancement GAN: mixed stylistic enhancement generation for 3D indoor scenes. Vis. Comput. 35(6–8), 1157–1169 (2019)
Zhang, Z., Song, Y., Qi, H.: Age progression/regression by conditional adversarial autoencoder. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5810–5818 (2017)
Zhou, T., Tulsiani, S., Sun, W., Malik, J., Efros, A.A.: View synthesis by appearance flow. In: European Conference on Computer Vision, pp. 286–301. Springer (2016)
Acknowledgements
Karthik Ramani acknowledges the US National Science Foundation Awards NRI-1637961 and IIP-1632154. Guang Lin acknowledges the US National Science Foundation Awards DMS-1555072, DMS-1736364 and DMS-1821233. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding agency. We gratefully appreciate the support of NVIDIA Corporation with the donation of GPUs used for this research.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Kim, S., Winovich, N., Chi, HG. et al. Latent transformations neural network for object view synthesis. Vis Comput 36, 1663–1677 (2020). https://doi.org/10.1007/s00371-019-01755-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-019-01755-x