1 Introduction

The three-dimensional (3D) content creation involves coordinated work with artists, modelers, designers and animators. One of the key challenges of the industry is to create a seamless pipeline. The virtual reality, robotics and computer graphics industry understands this process to be time-consuming and cumbersome. Many collection of 3D models are created and published on online repositories such as ShapeNet [1] and ModelNet [2]. These online repositories contain large-scale useful information that details about textures, styles, structures and poses of object classes. This information can be helpful to designers in the modeling process. Thus, leveraging this information with tools can enforce data-driven constraints, providing completions of partially designed objects, or even through the synthesis of whole shape from an image or merely a random noise vector.

One of the key challenges faced by the current academic research is to develop algorithms that can understand, analyze and auto-generate 3D content. Generative models address this issue [3, 4]. Currently, one of the hottest topics in deep learning and computer vision is generative adversarial networks (GAN) [5,6,7] or variational auto-encoder (VAE) [8, 9] for shape generation. These generative models serve as a test-bed for high-quality representation learning, feature extraction and unsupervised recognition using probabilistic spaces and manifolds.

While deep generative model acts as a generic mathematical framework which is very flexible and provides immense expressive power, the performance of VAE or GAN-based methods still leaves much to be desired when faced with challenging conditions. A downside to the 3D-VAE [4] is that it uses direct mean squared error instead of an adversarial network, so the network tends to produce unreliable reconstruction, corresponding to more blurry images in image generation. Moreover, choosing a too simplistic prior like the standard normal distribution is known to result in over-regularized models with only few active latent dimensions, as a result, with very poor hidden representations. 3D-GAN [3] is based upon the original GAN architecture and training approach, which is well known to suffer from instability. The coexistence of instability is that GAN can easily result in a problem called missing mode, that is, the generation of the network G in GAN will be easily confined to some modes, but not rich.

An important factor in the aforementioned problem is the lack of control on the discriminator during GAN’s training. Inspired by the observation that the optimization objectives for supervised learning are more stable, we suggest adding a supervised signal as a regularizer on top of the target of the discriminator. In the recent literature [10,11,12], we have noted that prior distribution of latent variable in fact plays a crucial role in generative models. The selection of an appropriate prior for GAN is, however, not trivial. Specifically, choosing an appropriate prior depends on the following criteria:

  • The prior should have an expressive distribution and is flexible enough to capture a-priori knowledge, e.g., by Gaussian mixture models (GMMs).

  • The prior should retain information from real samples as much as possible to create a both richer and informative prior.

These insights helped us to introduce the variational mixture of posteriors prior (VampPrior) as the distribution of the latent variable. The multimodal nature benefits our prior to achieve superiority over many other simple priors in terms of training complexity and expressiveness. Specifically, in contrast to modeling prior directly as Gaussian mixture models (GMMs), the VampPrior consists of a mixture distribution with components given by variational posteriors conditioned on a set of learnable pseudo-inputs (Eq. 2). Importantly, the prior and posterior are coupled in VampPrior which implicitly incorporates the real data information into the generator network. This will simultaneously facilitate the variety and fidelity of generator-generated false samples. Therefore, the GANs can benefit dramatically from this novel prior. Moreover, incorporating pseudo-inputs into VampPrior prevents the GANs from bearing the risk of potential overfitting, which makes the model less expensive to train. Thus, such priors provide a good compromise between computational convenience and flexibility. Therefore, we suggest a novel generative model, called VampPrior-3DGAN, integrating this concept with the above solution to compensate for the missing modes.

Contributions Our principal contributions are:

  • We introduce the prior, named as variational mixture of posterior prior [11], as the distribution function of the latent variable in GAN. It can enrich the prior and encode more information of the real samples.

  • Our encoder serving as a regularizer to penalize missing modes, thus, can improve GAN’s training stability and sample qualities.

  • We propose the VampPrior-3DGAN which allows learning prior from data and thus modeling multimodality for 3D generation tasks. Our method can learn multimodal distribution and generate high-fidelity, diverse 3D shapes.

  • We showcase that our models have favorable properties, like enjoying high compatibility in the network architecture from dynamic shape generation to image-to-shape reconstruction.

The rest of the paper is organized as follows. Section 2 describes some related studies. Section 3 details the proposed method. Section 4 presents and discusses the results and findings. Section 5 concludes the paper.

2 Related work

Shape generation There are two main schools of work on shape generation: (a) the native 3D school. This school is characterized by training directly on 3D datasets such as ShapeNet [1], and is based on 3D data from training to inference. Some interesting works are: 3D-GAN [3], GET3D [13], TextCraft [14] (which implements text conditioning), AutoSDF [15], MeshDiffusion [16], etc. Such methods tend to be fast and no problem at all in generating the categories present in the dataset. However, generating models that require ‘imagination’ remains challenging. (b) 2D upscaling school. Some approaches draw on the imaginative powers of 2D generative AI to drive the generation of 3D content. Work in this genre has recently made a lot of progress riding on breakthroughs in 2D deep generative models such as Imagen [17] and stable diffusion (SD [18]). OpenAI Point\(\cdot \)E [19] takes text as input, generates an image using the 2D diffusion model GLIDE [20], and then generates a point cloud based on the input image using the 3D point cloud diffusion model. DreamFusion [21] generates multiple perspectives by a 2D generative model (e.g., Imagen [17]) and then reconstructs it with NeRF [22]. The authors came up with a GAN-like approach, where NeRF and Imagen iterate back and forth. The advantage is that there is more diversity. Magic3D [23] cleverly divides the reconstruction process into two steps: the first step uses only NeRF for rough shape generation, and the second step uses a differentiable rasterizer to refine.

Point cloud complementation The pioneering work PointNet [24, 25] has led to a boom in 3D vision that has generated many subsequent studies. PointNet directly inspires researchers to focus on learning global feature embeddings from point clouds for point cloud generation (PSG [26]) and point cloud complementation [27, 28]. However, predicting local details and thin shape structures remains a challenge. To address these challenges, research efforts [29,30,31,32] utilize multi-scale local point features to reconstruct complete point clouds with fine-grained geometric details. With the help of attention mechanism, some works have provided impressive complementation results [31, 32]. As a challenging conditional generation problem, point cloud complementation is still an open problem. In the last two years, the diffusion model (DDPM [33], stable diffusion [18]) has made many breakthroughs and has become a big hit in the field of 2D AIGC. In the field of 3D content generation, however, the diffusion model is just in the exploration stage. Luo & Hu [34] is the first to use DDPM for unconditional point cloud generation. Lyu [35] and Zhou et al. [36] further use conditional DDPM for point cloud completion. The major difference is that Zhou et.al. do not refine or upsample the coarse point cloud generated by DDPM like Zhaoyang Lyu does.

Single-view reconstruction Significant progress has been made in the field of single-view 3D reconstruction. The choice of representation is clearly critical to the quality of shape reconstruction. Volume-based methods [3, 37, 38] contain most of the work for single-view 3D reconstruction. Due to their memory consumption limitations, however, these methods lack the scalability needed to reconstruct high-resolution and detailed shapes. Point cloud-based methods [26, 39, 40] have a smaller memory footprint. But because point clouds lack topological connectivity information, they require post-processing to obtain shapes from these point clouds. Mesh-based methods [41,42,43] utilize connectivity information, but are greatly dependent on the underlying model they are deforming. In general, there is no direct way to change the topology of the underlying mesh during the reconstruction process to achieve better edge and mesh flow, which facilitates better shape quality. Methods based on implicit surfaces (level set [44], SDF [45], occupancy [46] and implicit fields [47]) have recently received increasing attention. This is due to their desirable properties in single-view 3D reconstruction. However, due to the inefficiency of sampling methods, implicit surface-based methods usually lead to over-smoothed reconstructions. It is worth noting that new implicit representations are constantly being proposed. One of the most promising representations is the neural radiance fields (aka NeRF [22, 48, 49]). In the near future, one will continue to witness various novel breakthroughs in NeRF research efforts, such as single-view 3D reconstruction [50] [51].

Regularization of latent space Another branch of related works, which perhaps more closely relates to our work, involves the regularization of GAN and the learning of a meaningfully structured latent space. [52] proposes two novel regularizer for the GAN training target: geometric metrics regularizer and mode regularizer. GM-GAN [53] incorporates a sparse prior-knowledge into the model, by sampling latent vectors using a multimodal probability distribution which better matches the sparse characteristics of the data space.

3 Methodology

In this section, we first provide two intuitions and then the corresponding solutions for our specific variant of 3D-GAN, dubbed VampPrior-3DGAN, which mainly serves to address the mode collapse issue to improve samples’ diversity.

Geometric intuition Canonically, the GAN Fig. 1 training procedure can be viewed as a non-cooperative two-player min–max game, in which the discriminator D attempts to distinguish real and generated examples, whereas the generator G tries to fool the discriminator by pushing the generated samples toward the direction of higher discrimination values.

Fig. 1
figure 1

Generative adversarial network

We argue that training the discriminator D can be interpreted as training an evaluation metric on the sample space. Then, the generator G has to take advantage of the local gradient \(\nabla \log D(G)\) provided by the discriminator to improve itself, namely to move toward the data manifold. The value function describing the GAN’s min–max game can be formally formulated by

$$\begin{aligned} \begin{aligned} \min _{G}\max _{D}~V(D, G) ~=~ {\mathbb {E}}_{{\textbf {x}} \sim {p_{\textsc {data}}}}[\log {D({\textbf {x}})}] + {\mathbb {E}}_{{\textbf {z}}\sim {p_{{\textbf {z}}}}} [\log (1 - D \circ G({\textbf {z}}))] \end{aligned} \end{aligned}$$
(1)

where \(p_{\textsc {data}}\) is the data generating distribution, \({\textbf {z}} \in {\mathbb {R}}^{d_{\textbf {z}}}\) is a latent variable drawn from distribution \(p({\textbf {z}})\) such as \({\mathcal {N}}(0; I)\) or \({\mathcal {U}}[-1; 1]\).

Upon comparison with the objective for the GAN generator, the optimization targets for supervised learning are more stable, from an optimization point of view. The difference is clear: the optimization target for the GAN generator is a learned discriminator. While in supervised models, the optimization targets are distance functions with nice geometric properties. The latter usually provides much easier training gradients than the former.

These insights empower us to incorporate a supervised training signal as a regularizer on top of the discriminator target. Assume the generator \(G({\textbf {z}}): Z \rightarrow X\) generates samples by sampling first from a fixed prior distribution in space Z followed by a deterministic trainable transformation G into the sample space X. Together with G, we also jointly train an encoder \(E({\textbf {x}}): X \rightarrow Z\). Assume d is some similarity metric in the data space, we add \({\mathbb {E}}_{{\textbf {x}} \sim p_{\textsc {data}}}[d({\textbf {x}}, G\circ E({\textbf {x}}))]\) as a metric regularizer.

The geometric motivation for this metric regularizer is straightforward. We are trying to match the generated manifold to the real data manifold by geometric distances, in addition to the gradient provided by the discriminator D. The idea of adding an encoder is equivalent to first training a point to point mapping \(G(E({\textbf {x}}))\) between the two manifolds and then trying to minimize the expected distance between the points on these two manifolds.

In addition to the metric regularizer, we propose a prior regularizer intuition to further penalize missing modes.

Latent prior intuition In traditional GANs, the optimization target for the generator is the empirical sum \(\sum _i{\nabla \log {D(G({\textbf {z}}_i))}}\). The missing mode problem is caused by the conjunction of two facts: (1) the areas near missing modes are rarely visited by the generator, by definition, thus providing very few examples to improve the generator around those areas [52], and (2) both missing modes and non-missing modes tend to correspond to a high value of D, because the generator is not perfect so that the discriminator can take strong decisions locally and acquire a high value of D even near non-missing modes. For most \({\textbf {z}}\), the gradient of the discriminator \(\nabla \log D(G({\textbf {z}}))\) implicitly pushes the density of the generator distribution toward the major mode. Only when \(G({\textbf {z}})\) is very close to the minor mode can the generator get gradients to push itself toward this minor mode. However, it is possible that such \({\textbf {z}}\) is of low or zero probability in the prior distribution \(p_{\textbf {z}}({\textbf {z}})\). We argue that this problem can be solved by two ways.

First solution: Increase the depth of generator network. We argue that the rationale why we would like to extend the depth of the generator is because the vanilla 3D-GAN [3], in essence, attempts to learn a mapping from a simplistic prior distribution \(p_{{\textbf {z}}}({\textbf {z}}) \sim {\mathcal {N}}(0, I)\) or \({\mathcal {U}}[-1; 1]\) to the complicated three-dimension data distribution. Such mapping requires a deep generator which can decode this single simplistic guassian (or uniform) to disentangle the underlying diverse modes or factors of variation within the real three-dimension data and encourage its samples’ diversity. This, in turn, transfers into the requirement of large amounts of input data. However, when real three-dimension data is limited, yet originates from a diverse modality, increasing the network depth becomes infeasible. Furthermore, this also ends up in overfitting.

Second solution: Rather than raising the generator’s depth, we instead recommend to enrich the prior distribution \(p_{{\textbf {z}}}({\textbf {z}})\) to strengthen the generator. Even if our central idea—utilizing a mixture model for latent variable— has been suggested in various papers mostly in the context of variational inference, for instance, GMVAE [10], yet, in the context of GANs, we have more considerations, i.e., at the top of a richer distribution, we would like our prior to take advantage of the information from the real samples. Concretely, if the prior is learned enough that to assign separate regions in the latent space to each datapoint, this effect should help the generator to decode a hidden representation to its corresponding voxel representation much easier. Combining this insight with the above approach meant to penalize the missing modes, we propose a hybrid architecture for the GAN objective, where we, in particular, propose the prior distribution of the latent vector of GAN as a variational mixture of posterior prior (VampPrior), which was first introduced by [11] to extend the VAE. The VampPrior consists of a mixture distribution with components given by variational posteriors conditioned on learnable pseudo-inputs:

$$\begin{aligned} p_{\lambda }({\textbf {z}}) = \frac{1}{K} \sum _{k=1}^{K}{E({\textbf {z}} \mid {\textbf {u}}_k)}, \end{aligned}$$
(2)

where K is the number of components, and \({\textbf {u}}_k\) is a parameterized vector referred as a pseudo-input which can be learned through backpropagation and can be thought of as hyperparameters of the prior, alongside parameters of the posterior \(\phi \), \(\lambda = \{{\textbf {u}}_1, {\textbf {u}}_2, \ldots , {\textbf {u}}_k, \phi \}\).

Importantly, the VampPrior is multimodal. It makes the prior of GAN more expressive, thereby preventing the over-regularization of the prior. By incorporating pseudo-inputs, it prevents from potential overfitting once we pick \(K \ll N\), which makes the model cost-effective to train. More specifically, the prior and posterior are coupled in VampPrior which implicitly incorporates the training data information into the generator network. Moreover, the learnable pseudo-inputs will fine-tune and tweak the prior to best suit the data distribution automatically.

Next, we describe this intuition more precisely.

Fig. 2
figure 2

Flowchart of VampPrior-3DGAN for shape generation, completion and reconstruction

3.1 VampPrior-3DGAN

In this section, we train a voxel-based VAE jointly with the GAN model. The model architecture is presented as Fig. 2.

Learned pseudo-input The previous observation suggests to prefer an expressive prior, so that the generator can easily decode a hidden representation on a voxel grid. In other words, the encoder should be trained in order to have large variance. To achieve this effect, the VampPrior should be attracted by dissimilar pseudo-inputs and assigns separate regions within the latent space. Within this framework, we select the real samples as the weight of the deep neural network. A random noise vector \({\textbf {n}}\) with the same dimension of \({\textbf {x}}\) is then added to \({\textbf {x}}\) element-wisely. Subsequently, we leverage the backpropagation procedure to tune these weights so as to learn these pseudo-inputs. The input of the deep neural network is an identity matrix with an order K, referred as idle-input. The schematic representation is in Fig. 3.

Fig. 3
figure 3

Multilayer perception (MLP) for pseudo-inputs learning

Loss Given a batch of training samples, we first pass these samples through encoder E and then reparameterize the output of the encoder to provide the input of the generator G. After jointly training the encoder and generator, we then dynamically sample from Eq. 2 on the latent space of encoder E. Sampling then from this dynamic and far more powerful mixture prior distribution in space Z, the generator G at this time can generate more diverse samples, which are then judged by the discriminator D. We then update the generator G and the discriminator D, alternately. The encoder and generator update their parameters by minimizing the following loss:

$$\begin{aligned}&{\mathcal {L}}_\textrm{gan}^{g} = {\mathbb {E}}_{{\textbf {x}}\sim {p_{\textsc {data}}}}{[\alpha _1{d({\textbf {x}}, G \circ E({\textbf {x}}))} + \alpha _2{\log {D(G \circ E({\textbf {x}}))}}]} \nonumber \\&\qquad \qquad - {\mathbb {E}}_{{\textbf {z}}\sim {p_{\lambda }({\textbf {z}})}}{[\log {D(G({\textbf {z}}))}]},\end{aligned}$$
(3)
$$\begin{aligned}&{\mathcal {L}}_\textrm{vae}^{e} ~=~ {\mathbb {E}}_{{\textbf {x}} \sim p_{\textsc {data}}}[\alpha _1{d({\textbf {x}}, G \circ E({\textbf {x}}))} ] ~ \nonumber \\&\qquad \qquad + \alpha _3 KL(p_{\lambda }({\textbf {z}}) \Vert E({\textbf {z}} \mid {\textbf {x}})), \end{aligned}$$
(4)

where \({\mathbb {E}}_{{\textbf {x}}\sim {p_{\textsc {data}}}}{[{\log {D(G \circ E({\textbf {x}}))}}]}\) is the mode regularizer to encourage \(G \circ E({\textbf {x}})\) to move toward a nearby mode of the data generating distribution and \(KL(\cdot )\) is the KL divergence. \(\alpha _1\), \(\alpha _2\) and \(\alpha _3\) are the trade-off parameters controlling the fidelity and diversity of the fake samples. In this way, we can achieve fair probability mass distribution across different modes. The discriminator updates its parameters by minimizing the following loss:

$$\begin{aligned} {\mathcal {L}}_\textrm{gan}^{d}&= -{\mathbb {E}}_{{\textbf {x}} \sim p_{\textsc {data}}}[\log (D({\textbf {x}})) + \log (1 - D(G \circ E({\textbf {x}})))] \nonumber \\&\quad - {\mathbb {E}}_{p_{\lambda }({\textbf {z}})}[1 - \log (D(G({\textbf {z}})))], \end{aligned}$$
(5)

where \(D(\cdot )\) is the probability of the input being a real volumetric shape, and \(1 - D(\cdot )\) that of a synthetic one. The second component denotes the probability of the input being a synthetic shape generated by an encoder and generator network. Meanwhile, the third component denotes the one of the input being a synthetic shape generated directly from the VampPrior distribution. We implement D as a convolutional neural networks, whose last layer outputs the probability of the sample being a synthetic shape. For training this network, each mini-batch comprises of randomly sampled synthetic shape \(\tilde{{\textbf {x}}}\) from the VampPrior distribution \(p_{\lambda }({\textbf {z}})\) and real shape \({\textbf {x}}\). The target labels for the cross-entropy loss layer are 0 for every \({\textbf {x}}_j\), and 1 for every \(\tilde{{\textbf {x}}}_{i}\). Then, the parameter of discriminator D for a mini-batch is updated by taking a stochastic gradient descent (SGD) step on the mini-batch loss gradient.

3.2 Shape reconstruction

We tackle the problem of reconstructing objects from RGB images in a novel training methodology. The reconstructive model enjoys the same architecture as used for shape generation (see Fig. 2 and Table 1). In contrast to shape generation, the input herein is an image. The encoder converts an image into a 200-dimensional vector of means and variances. We then dynamically fit a variational mixture of posterior model from the latent space of the encoder network to produce our noise vector. The latent vector is then passed through the generator network to generate a reconstructed object, which is then given to the discriminator to pass judgment on its validity.

The same generator and discriminator networks as used for shape generation (VampPrior-3DGAN) are implemented into this system and the encoder network is a simple five layer convolution neural network. During training, the discriminator and encoder networks are trained at every batch while the generator only learns every two batches. This last point is key to the integration of the systems, since if the encoder is not trained alongside the discriminator at every iteration, the system will not converge. This makes sense since both networks should learn similar features about the objects being created at approximately the same rate.

3.3 Partial shape completion

We discuss the issue of voxel form completion from sparse point clouds in this paper as well. This issue arises when only a single view of an individual object is given, or large parts of the object are occluded as in robotic applications. In order to form informed decisions (e.g., for path planning and navigation), it is of utmost importance to efficiently establish a representation of the environment which is as complete as possible. We accurately reconstruct an object’s complete 3D shape and volume when presented with only a part of the object from a single perspective. We tackle this problem to show the generative power of our system, to highlight that our model is applicable to realistic robotic problems, and to demonstrate that our system is easily applicable to tasks involving reproducing 3D shapes from multiple input types.

Table 1 Details of model architectures used in the experiments. The models were trained using Adam [54] optimizers. BN: Batch normalization, LR: Leaky ReLU, s2: stride 2

4 Experiments

In this section, we verify that our novel architecture performs on par with or even better than the state-of-the-art generative framework. To assess the quality of our proposed neural generative model for 3D shapes, we conduct several extensive experiments. In Sect. 4.2, we investigate our model’s ability to generate diverse samples. Following this, in Sect. 4.3, we test our model’s ability to reconstruct real-world image, comparing our results to 3D-R2N2 [37] and NRSfM [55]. Finally, we demonstrate the shape completion from the output of a single perspective scan from a depth sensor.

4.1 Datasets

  • ModelNet  There are two variants of the ModelNet dataset, ModelNet10 and ModelNet40, introduced in [2], with 10 and 40 target classes, respectively. ModelNet10 has 3D shapes which are pre-aligned with the same pose across all categories. In contrast, ModelNet40 (which includes the shapes found in ModelNet10) features a variety of poses. In order to assess the ability of our model to handle 3D forms of great variety and complexity, we augment each class of ModelNet10 with a maximum number of 12 rotations while avoiding the risk of overfitting. For the shape completion task, we construct a synthetic dataset based on the ModelNet dataset, taking 15 random perspectives for each object in the ModelNet10 dataset. A test set of entirely unseen objects was held back for evaluation, examples of which, can be observed in the first rows of Figs. 12 and 13.

  • PASCAL 3D  The PASCAL 3D dataset is composed of the image from the PASCAL VOC 2012 dataset [56], augmented with 3D annotations using PASCAL3D+ [57]. PASCAL3D+ images exhibit much more variability compared to the current 3D datasets, and on average there are over 3,000 object instances per category. We voxelize the 3D CAD models with resolution \(32\times 32\times 32\) and the same training and testing splits as NRSfM [55], which is also used to conduct real-world image reconstruction. Note that only pre-processing techniques applied were image cropping and padding with 0-intensity pixels to create final samples of resolution \(100 \times 100\).

  • Synthetic Dataset  A new synthetic dataset of images and 3D models that we created solely serves to train and validate our models from a single RGB image for three-dimensional object reconstruction. The dataset was directly obtained from the online ShapeNet repository [1]. It consisted of six object classes, rendered in front of backgrounds images from the SUN dataset [58] and mantled with random textures from the Describable Textures Dataset [59]. Each RGB image is accompanied with its ground truth 3D model from the ShapeNet repository, with \(64\times 64\times 64\) voxel resolution. To test our models, we also collected the IKEA dataset from the Google 3D Warehouse which consists of a set of 800 images rendering a large collection of objects, and their corresponding object models. These objects fall into six categories, namely, beds, bookcases, chairs, desks, sofas and tables, and are evaluated in accordance with resolution \(64\times 64\times 64\). The dataset presents a strong evaluation tool for heavily occluded images in realistic scenes, using only the constraint that the object is centered within the image.

Fig. 4
figure 4

Shape generation results by our VampPrior-3DGAN model on ModelNet40. The picture is best viewed in color on screen

Fig. 5
figure 5

Visualization comparison of diversity between 3D-GAN (left) and VampPrior-3DGAN (right). The picture is best viewed in color

Table 2 Comparing sample diversity by the augmented inception-score values for baseline 3D-GAN and VampPrior-3DGAN across the 5 categories of ModelNet40 dataset

4.2 Evaluating shape generation and learning

To examine our model’s ability to generate high-resolution 3D shapes with realistic details, we design a task that involves shape generation and shape interpolation. We add Gaussian noise to the learned latent codes on test data taken from ModelNet and then use our model to generate “unseen” samples that are diverse to the input voxel.

It can be noted that the suggested VampPrior-3DGAN demonstrates the ability to transition between two objects smoothly. Our findings on shape generation are illustrated in Fig. 4. We further compare to previous state-of-the-art results in shape generation, which are depicted in Fig. 5). Figure 6 shows the results of our shape interpolation experiment, from both within-class and across-class perspectives.

For our system of VampPrior-3DGAN framework, the choice of the number of pseudo-inputs (Eq. 2), denoted by \(N_{{pseudoInput}}\), is made empirically—more complicated data distributions require more pseudo-inputs. Larger value of \(N_{{pseudoInput}}\) helps model with relatively increased diversity. Nevertheless, increasing \(N_{{pseudoInput}}\) also increases memory requirements. Our experiments indicate that increasing \(N_{{pseudoInput}}\) beyond a point has little to no effect on the model capacity since the VampPrior tends to ‘crowd’ and become redundant. We use a \(N_{{pseudoInput}}\) between 50 and 100 for our experiments.

In order to quantitatively characterize the diversity of generated voxel samples in our experiments, we design an augmented version of the inception score, a measure which has been found to correlate well with human evaluation, for different experiments instead of human annotators. We describe this score next.

Fig. 6
figure 6

Continuous morphing of output shapes achieved by linear interpolation of shape vectors. The picture is best viewed in color

Fig. 7
figure 7

A graph depicting the discrimination (left), generation (mid) and reconstruction loss (right) at each iteration, while training the VampPrior-3DGAN system on the ModelNet10 bed dataset

Fig. 8
figure 8

Comparison of reconstruction loss on ModelNet10 chair dataset with different prior

Augmented inception score The inception score was considered as a good assessment for sample quality:

$$\begin{aligned} \exp ({\mathbb {E}}_x [KL(p(y \mid x) \Vert p(y))]), \end{aligned}$$
(6)

where x denotes one sample, \(p(y \mid x)\) is the softmax output of a trained classifier of the labels, and p(y) is the overall label distribution of generated samples. The intuition behind this score is that a strong classifier usually has a high confidence for good samples. However, it is desirable to have diversity within voxel samples of a particular category. To characterize this diversity, we use a cross-entropy style score \(-p(y \mid x_i) \log (p(y \mid x_j))\) where \(x_j\)s are samples of the same class as \(x_i\) as per the outputs of the trained inception model. We incorporate this cross-entropy style term into the original inception-score formulation and define the augmented inception score as a KL divergence:

$$\begin{aligned} \exp ({\mathbb {E}}_{x_i} [{\mathbb {E}}_{x_j} [KL(p(y\mid x_i) \Vert p(y \mid x_j))]]). \end{aligned}$$
(7)

Essentially, this augmented inception score can be viewed as a proxy for measuring intra-class sample diversity along with the sample quality. In our experiments, we report the augmented inception score on a per-class basis and a combined score averaged over all classes.

Fig. 9
figure 9

Reconstruction samples for PASCAL3D from the separately trained VampPrior-3DGAN

Table 3 Per-category voxel predictive performance on PASCAL 3D, as measured by IoU

For evaluation, we first pretrained a 3D inception network, that is generalized from image inception network, on the training set of ModelNet40. The 3D inception network is a four-layer CNN classifier. After pretraining, the last layer of the inception network is then fine-tuned by transfer learning and then applied to compute the augmented inception scores for the generated samples. We selected five categories—airplane, car, chair, sofa and vase. Note that, during training, we augment the dataset using the rotated version of the voxels. We compare the generated results of 3D-GAN and VampPrior-3DGAN. Figure 5 shows the samples generated by 3D-GAN and VampPrior-3DGAN, respectively. During this case, our samples are visibly better, and arise from a more stable training procedure. The samples generated by our framework also exhibit larger diversity, visibly and consistent with augmented inception scores additionally (Table 2).

For the ModelNet10 dataset, we will assume that the data generating distribution may be approximated with 10 dominant modes, here we define the term “mode” as a connected component of the data manifold. We first examined whether our system was able to generate objects from a distribution consisting of just one object, but set in 12 different orientations from the ModelNet10 dataset. The value of \(N_{pseudoInput}\) is set to 50. This task was clearly successful, since it produced sufficiently varied objects of high quality. This can be observed in Fig. 4. It was possible to track the quality of the objects using the reconstruction and generation loss, and this can be viewed in Fig. 7. Figure 8 showcases a comparison of the reconstruction loss curve with 3D-GAN. One can see in the figure that our method VampPrior-3DGAN converges faster and the reconstruction error is at least 0.04 smaller than that of 3D-GAN. Moreover, in order to see more clearly the effectiveness of the prior we introduced, we further replace vampprior with mixture of Gaussian (dubbed GM-3DGAN), and from the figure one can see the difference in reconstruction quality due to the two priors. Our vampprior is on average 0.02 lower than the GM prior.

Table 4 Average precision scores on the IKEA dataset
Table 5 Comparison of computational efficiency, model size and IoU for single-view 3D reconstruction on the ShapeNet testing set. FIT: Forward inference time
Fig. 10
figure 10

IoU metric on the ShapeNet subset

Fig. 11
figure 11

Sample reconstruction results from single image using the VampPrior-3DGAN model, from a distribution consisting of the chair class from the ShapeNet Core dataset. In the 1st row is the RGB input, in the 2nd is the reconstruction of our method. The picture is best viewed in color on screen

4.3 Evaluating shape reconstruction from single image

Another application of the proposed VampPrior-3DGAN is single-image shape reconstruction. This is a challenging problem, forcing our model to deal with real-world images under a variety of lighting conditions and resolutions. Furthermore, there are many instances of model occlusion as well as different color gradings.

Metrics We use two metrics to evaluate the performance of 3D reconstruction. The first metric is voxel Intersection-over-Union (IoU) between a predicted voxel grid and its ground truth. It is formally defined as follows:

$$\begin{aligned} \text {IoU} = \frac{\sum _{ijk}\left[ I(y'_{ijk}> p) * I(y_{ijk})\right] }{\sum _{ijk}\left[ I \left( I(y'_{ijk}>p) + I(y_{ijk}) \right) \right] }, \end{aligned}$$
(8)

where \(I(\cdot )\) is an indicator function, (ijk) is the index of voxel in three dimensions, \(y'_{ijk}\) is the predicted value at the (ijk) voxel, \(y_{ijk}\) is the ground truth value at (ijk), and p is the threshold for voxelization. The higher the IoU value, the better the reconstruction of a 3D model. We also report the average precision loss as a secondary metric. Higher values indicate higher confidence reconstructions.

Fig. 12
figure 12

First row: Sample synthetic single perspective kinetic scans created by the authors, produced from the chair class of the ModelNet10 dataset. Second row: The corresponding shape completion result of our VampPrior-3DGAN framework. Third row: The corresponding shape completion result of 3D-GAN. Final row: The corresponding ground truth volumetric grid. The picture is best viewed in color on screen

Fig. 13
figure 13

First row: Sample synthetic single perspective kinetic scans created by the authors, produced from the bed class of the ModelNet10 dataset. Second row: The corresponding shape completion result of our VampPrior-3DGAN framework. Third row: The corresponding ground truth volumetric grid. The picture is best viewed in color on screen

To test our model on this application, we use the PASCAL3D dataset and utilize the same exact training and testing splits from [55]. We compare our results with those reported for recent approaches, including the NRSfM [55] and 3D-R2N2 [37] models. Note that these also used the exact same experimental configurations as we did.

Fig. 14
figure 14

Failure cases in shape completion on bed category of ModelNet10 dataset. The picture is best viewed in color on screen

For this task, we train our model in two different ways: (1) jointly on all categories, and (2) separately on each category. In Fig. 9, we observe better reconstructions from the separately trained VampPrior-3DGAN when compared to previous work. Unlike the NRSfM, our model does not require any segmentation, pose information or keypoints. In addition, our model is trained from scratch while the 3D-R2N2 is pretrained using the ShapeNet dataset. However, the jointly trained VampPrior-3DGAN did not outperform the 3D-R2N2, which is also jointly trained. The performance gap is due to the fact that the 3D-R2N2 is specifically designed for image reconstruction and employs a residual network to help the model learn richer semantic features.

Quantitatively, we compare our model to the NRSfM and two versions of 3D-R2N2, one with an LSTM structure and another with a deep residual network. The IoU results are shown in Table 3. Observe that our jointly trained model performs comparably to the 3D-R2N2 LSTM variant while the separately trained version surpasses the 3D-R2N2 ResNet structure in 8 out of 10 categories, half of them by a wide margin.

In this experiment, we trained networks with our method on the task of single-image 3D reconstruction. This was a task also performed by the 3D-GAN system [3], and, therefore, provides a fair and quantitative basis for a comparison of the two methods. The result evaluated on the IKEA dataset is illustrated in Table 4. Our system consistently outperforms the original 3D-GAN system and several other previous approaches, with a mean average precision of 62.1% across all classes. Figure 10 showcases the IoU metric results with a comparison of the start-of-the-art methods on the ShapeNet subset. Figure 11 illustrates the example reconstruction results using our method.

Table 5 showcases the numbers of parameters, model size, forward inference time, training time and IoU of different methods. Although our model underperforms Pix2Vox by about 1.2 in the IoU metric, there is an 30% reduction in parameters in our method compared to 3D-R2N2. In order to make a fair comparison, the running times are obtained on the same PC with an NVIDIA GTX 1080 Ti GPU. Our method is about nine times faster in forward inference than 3D-R2N2 in single-view reconstruction.

4.4 Shape completion

The task of recovering 3D shape completion of artifacts from the output of single perspective scan from a depth sensor is added to a voxel-encoded variant of our VampPrior-3DGAN system. Two models were produced: the first trained on chair and bed objects and the second on all the objects in the ModelNet10 dataset. The experiment are clearly quite successful and the examples of recovered objects from the test set can be viewed in the second row of Figs. 12 and 13. In addition, Fig. 14 also indicates several failure cases on bed category of ModelNet10. This is mostly due to the fact that the sample form is far away from the data distribution, and the shape itself is very complex.

5 Conclusion and future work

In this paper, we introduced a novel GAN-based deep generative model, VampPrior-3DGAN, with a powerful prior as the mixture of variational posterior. Our model is successful in shape generation from complex multimodal distribution involving multiple distinct classes. We demonstrate that the models produced by our system can learn the distributions involving multiple object classes in multiple orientations. In addition, we explain how our method can be smoothly extended to single-image reconstruction, without alternative network architecture. We demonstrate this model’s generative power by recovering three-dimensional objects from a single image with different light, background and texture. We achieve state-of-the-art performance on the synthetic dataset. Finally, we again show the system’s generative power by successfully applying it to object completion from single perspective scan. While we have focused on dense regular-grid-based shape generation and binary occupancy maps in this paper, it is straightforward to extend the framework to scene generation and reconstruction from scene image/depth map.