Keywords

1 Introduction

While humans live and navigate in the 3D world, they reason about it semantically. Given only a class of an object, a human could easily imagine its 3D shape. Object’s class, depth, and shape are closely related to each other, and a deep model should reason explicitly about them to truly understand a 3D scene.

Fig. 1.
figure 1

Image-to-semantic voxel model translation using our SSZ model. Input color image (left), 2D-to-3D contour alignment (center), semantic voxel model output (right).

There have been exciting recent progress in single image 3D object reconstruction  [1,2,3,4]. While modern models can reconstruct the human body  [5] or arbitrary object  [3] from a single view, they are usually focused on the prediction of a single instance of a single object class. Recently proposed multilayer depth maps  [6] make a step towards the 3D reconstruction of the whole scene. Still, they do not provide semantic labeling of the 3D scene. On the other hand, 3D scene semantic segmentation models  [7] require a 3D model as input.

In this paper, we propose a Single Shot Z-space segmentation and 3D reconstruction model (SSZ) for single image-to-semantic voxel model translation. Different from modern baselines, our SSZ model performs joint 3D voxel model reconstruction and 3D scene semantic segmentation from a single image. Moreover, a modern architecture based on volumetric residual blocks allows our SSZ model to provide near-real-time performance at inference.

We hypothesize that semantic labeling of 3D object classes could aid a deep model learning explicit reasoning about 3D scene structure. To this end, we propose a multiclass semantic voxel model that represents the whole 3D scene visible by the camera. In our semantic voxel model, each voxel holds the ID of its class. Moreover, we leverage trapezium-shaped voxels to keep each voxel aligned with a corresponding pixel (see Fig. 1). Such 3D representation allows us to design direct 2D-to-3D skip connections, that leverage contour correspondences between an image and a 3D model. We use assumptions of Ronneberger et al.  [8] and Sandler et al.  [9] as a starting point to incorporate a U-net-like generator with inverted residuals blocks and skip connections into our framework.

Generative modeling  [10] of 3D shapes has demonstrated promising progress recently  [11]. Inspired by adversarial learning of 3D shapes, we incorporate a 3D pose discriminator into our framework. Specifically, we simultaneously train two models: an SSZ generator and an adversarial Pose6DoF discriminator (see Fig. 2). The aim of our Pose6DoF discriminator is twofold. Firstly, it estimates the poses of all object instances in the SSZ generator’s output. Secondly, it qualifies each object instance as either being ‘real’ or ‘fake.’ The aim of our SSZ generator is fooling the discriminator Pose6DoF by producing a realistic and geometrically accurate semantic voxel model.

We collected a large SemanticVoxels dataset to train and evaluate our model and baselines. Our SemanticVoxels dataset includes 116k color images and pixel-level aligned semantic voxel models of nine object classes: person, car, truck, van, bus, building, tree, bicycle, ground.

Experiments on our SemanticVoxels dataset and various public benchmarks demonstrate that our SSZ model achieves the state-of-the-art in single-image 3D scene reconstruction. We show quantitative and qualitative results demonstrating our SSZ model ability to reconstruct a detailed voxel model of the whole scene from a single image. Moreover, our SSZ model produces both high-resolution 3D model and multiclass 3D semantic segmentation from a single image.

The developed model will be able to estimate shape, pose, and a class of all objects in the scene in such applications such as autonomous driving, robotics, and single photo 3D scene reconstruction.

We present four key technical contributions: (1) An SSZ generator architecture for single-shot 3D scene reconstruction and segmentation from a single image with 2D-to-3D skip connections and volumetric inverted residual blocks, (2) a generative-adversarial framework for training a volumetric generator against 6DoF pose reasoning discriminator, (3) a large SemanticVoxels dataset with 116k samples. Each sample includes color image, view-centered semantic voxel model, depth map, pose annotations of nine objects classes: person, car, truck, van, bus, building, tree, bicycle, ground, (4) an evaluation of our SSZ model and state-of-the-art baselines on ShapeNet, and our SemanticVoxels dataset.

Fig. 2.
figure 2

SSZ framework.

2 Related Work

Single-Photo 3D Reconstruction.Deep networks for generation of 3D models from a single photo fall into two groups: object-centered models  [12] and view-centered models  [2, 3, 6, 13]. Object-centered models  [12] reconstruct object 3D model in the same coordinate system for any camera pose with respect to the object. While the object-centered setting is generally easier in terms of data collection and model structure, most of the object-centered models fail to generalize to new object classes. The main reason for this is the absence of explicit reasoning about connections between object shape in the image and the reconstructed 3D shape.

View-centered models  [1, 3, 13,14,15] overcome this problem using paired datasets. Such datasets include a separate 3D model in the camera coordinate system for each image. The collection of view-centered 3D shape datasets is challenging as the camera pose must be recovered for each image. Still, explicit coding of the camera pose in the dataset al.lows a model to learn complicated 2D-to-3D reconstruction techniques. Hence, view-centered models are generally more robust to new object classes and backgrounds  [13].

Multi-view models  [13, 14, 16,17,18] leverage multiple images of a single object to improve 3D reconstruction accuracy. Related to our semantic frustum voxel models are projective convolutional networks (PCN)  [14] that use view-centered frame projection for 3D model reconstruction and segmentation from multiple images. Unlike PCN, our SSZ model uses a view-centered frame during the training time. Closely related to our Pose6DoF discriminator is geometric adversarial loss (GAL)  [19] focused on the consistency of reconstructed 3D shapes. Unlike the GAL, our pose adversarial loss function is designed for multiple objects and focused on the scene structure.

3D Model Representations. While images are commonly represented as multichannel 2D tensors to train deep models, volumetric 3D shapes are more challenging to incorporate in deep learning pipeline. Therefore 3D reconstruction deep models could be divided into groups by the 3D model representation they use. Voxel Models divide object space into equal volume elements that encode probability p of space being either empty or occupied by an object. While voxel models are the most straightforward data representation for volumetric convolutional neural networks  [12, 20,21,22,23,24,25,26,27,28,29,30,31], they consume large amounts of GPU memory. Hence, the resolution of most modern methods is limited to 128\(\times \)128\(\times \)128 voxels. Matryoshka networks  [32] overcome this problem leveraging a memory-efficient shape encoding, which recursively decomposes a 3D shape into nested shape layers. Leveraging the semantic annotations for improving 3D reconstruction accuracy demonstrated promising results recently  [33]. Depth Maps estimation methods  [6, 34,35,36,37,38] are closely connected to 3D model reconstruction. Still, only the visible surface of the object is being reconstructed in such methods. Closely related to our SSZ model is the property of depth maps to preserve contour correspondence between the input image and the reconstructed depth map. This correspondence allows using of skip connections between generator layers  [8, 39] to increase model resolution and robustness to new object classes. Deformable Meshes allow to use polygonal models for network training  [40,41,42,43,44,45,46,47,48]. While this representation consumes less GPU memory than voxel models, it is best suited for symmetric, smooth objects such as hair  [42] or human face  [35, 49,50,51,52,53]. The semantic description of the scene at the object level  [54] is related to multiclass semantic voxel models in our SSZ model. Similar to our semantic voxel model is 3D-RCNN  [55] for instance-level 3D object reconstruction. Unlike 3D-RCNN, our SSZ is a single-shot detector. Frustum Voxel Models  [56,57,58] are similar to voxel models but utilize view-oriented projection similar to depth maps. Being designed specifically for single-photo 3D reconstruction, frustum voxel models (fruxel models) can significantly improve model performance for generator with skip connections. In this paper, we extend the fruxel model 3D representations for multiclass 3D scene reconstruction. We train our generator to produce tensors of \(n \times w \times h \times d\) elements, where n is the number of classes, whd number of elements for the width, height, and depth of a fruxel model.

3 Method

Our goal is training an SSZ generator \(G: (\textit{\textbf{A}}) \rightarrow \textit{\textbf{B}}\) translating an input image \(\textit{\textbf{A}}\) into a multiclass frustum voxel model of the scene \(\textit{\textbf{F}}\). Specifically, for an input image \(\textit{\textbf{A}} \in \mathbb {R}^{w \times h \times 3}\) our model predicts a probability tensor \(\textit{\textbf{B}} \in [0,1]^{n \times w \times h \times d}\), where n is the number of classes. Each element in \(\textit{\textbf{B}}\) represents a probability p(xyz) of point with coordinates (xyz) belonging to object class i. We found the resulting fruxel model \(\textit{\textbf{F}} \in \{0, 1, \ldots , n-1\}^{w \times h \times d}\) as an of the probability map \(\textit{\textbf{B}}\).

(1)

Inspired by generative models for 3D reconstruction, we train two models simultaneously: a generator network G and an adversarial discriminator D (see Fig. 2). The aim of our Pose6DoF discriminator \(D: (\textit{\textbf{A}},\textit{\textbf{F}}) \rightarrow \textit{\textbf{C}}\) is predicting a certificate \(\textit{\textbf{C}} \in \{t,q,r\}^{u,v,w}\), where uvw is dimensions of the discriminator output, \(t \in R^3\) is object translation in the view-centered coordinate frame, \(q \in R^4\) is the object rotation quaternion, \(r \in [0, 1]\) is the probability of object being ‘real’ or ‘fake’. Certificate \(\textit{\textbf{C}}\) describes the poses of object instances in the scene and qualifies them as either ‘real’ or ‘fake.’ The aim of our generator G is generating a realistic and geometrically accurate semantic voxel model \(\textit{\textbf{F}}\). To this end, the objective of our generator G is maximizing the probability of discriminator D making a mistake in certificate \(\textit{\textbf{C}}\) qualifying a synthesized semantic voxel \(\hat{\textit{\textbf{F}}}\) as a real sample \(\textit{\textbf{F}}\) from the training dataset. On the other hand, the generator is forced to minimize the error between ground truth object poses (tq) and the predicted poses \((\hat{t}, \hat{q})\).

Two loss functions govern the training process of our framework: a negative log-likelihood loss \(\mathcal {L}_{NLL}(\textit{\textbf{B}}, \hat{\textit{\textbf{B}}})\) and a pose adversarial loss \(\mathcal {L}_{adv}(\textit{\textbf{C}}, \hat{\textit{\textbf{C}}})\). Inspired by the efficiency of negative log-likelihood loss for the task of 2D semantic segmentation  [59], we leverage a similar loss function for our 3D semantic labeling. The aim of our \(\mathcal {L}_{NLL}(\textit{\textbf{B}}, \hat{\textit{\textbf{B}}})\) loss is maximizing the probability p(xyz) of voxel being labeled with the correct object class

$$\begin{aligned} \mathcal {L}_{NLL}(\textit{\textbf{B}}, \hat{\textit{\textbf{B}}}) = \frac{1}{q\cdot w\cdot h \cdot d} \sum _{x=0}^{w}\sum _{y=0}^{h}\sum _{z=0}^{d} \sum _{i = 0}^{n}-k_{i}\cdot \log \Big (\hat{\textit{\textbf{B}}}(f,x,y,z)\Big ), \end{aligned}$$
(2)

where \(k_{i}\) is a scalar weight of an object class i, \(q = \sum _{i = 0}^{n}k_{i}\) is the sum of weights for all classes, \(f = \textit{\textbf{F}}(x,y,z)\) is the index of the correct object class for point (xyz), \(\sum _{f=1}^{n} \hat{\textit{\textbf{B}}}(f, x, y, z) = 1\). The negative log-likelihood loss introduces a penalty only for voxels, where the predicted class does not equal to the target class. Hence, under such an objective, the voxels representing the empty space of the scene could be filled with any class without any penalty. To avoid such a scenario, we use an additional ‘air’ class that forces the loss function to include empty voxels in the training process.

We firstly present our semantic frustum voxel, in Sect. 3.1, and then discuss our SSZ generator in Sect. 3.2. After that, in Sect. 3.3, we introduce our Pose6DoF discriminator that provides the adversarial loss. Finally, we present our SemanticVoxels dataset in Sect. 3.4.

3.1 Semantic Frustum Voxel Model

Unlike the rectangular voxel model, the fruxel model leverages trapezium-shaped voxels. The trapezium of each fruxel lies on the ray that connects a pixel on the sensor matrix and a point on an object (see Fig. 3). Let \(I = \{0, 1, \dots , n - 1\}\) be the set of n classes that the deep model has to predict in the image. Then the semantic voxel model \(F \in \{0, 1, \ldots , n-1\}^{w \times h \times d}\) is a 3D tensor in which each element contains the index \(i \in I\) of the class of an object located in the given fruxel.

To this end, the fruxel model can be regarded as a multilayer 3D semantic segmentation. Each slice is a boolean intersection of an object and a thin box orthogonal to the camera optical axis located at a given distance. A fruxel model can be described by the following set of parameters \(\{z_n,z_f,d,\alpha \}\), where \(z_n\) is the distance from the camera to the nearest frustum clipping plane, \(z_f\) is the distance to the far clipping plane, d is the number of slices, and \(\alpha \) is the camera’s horizontal field of view (see Fig. 3).

Fig. 3.
figure 3

Frustum voxel model: Slices generation by the boolean intersection of a cutting plane with 3D objects (left). A 3D model composed of trapezium-shaped elements (middle). Top view illustrating fruxel model parameters (right).

3.2 SSZ Generator

A defining feature of image-to-voxel translation problems is that they transform high-resolution 2D features to their 3D counterparts. While such translation can be achieved using hidden embedded representations  [12], explicit feature translation using skip connections improves model generalization ability. We use assumptions made by Ronneberger et al.  [8] and Sandler et al.  [9] as a starting point for our SSZ generator. Namely, we connect the corresponding layers of an encoder and a decoder using skip connections that we term ‘copy-inflate.’

While feature maps in the encoder are 3D tensors \(\textit{\textbf{M}}_{e} \in R^{w \times h \times c}\), their corresponding feature maps in the decoder are 4D tensors \(\textit{\textbf{M}}_{d}\in R^{w\times h \times d \times c}\), where c is the number of channels in a feature map. To match the dimensions, our ‘copy-inflate’ skip connections expand the new dimension by copying d times 2D slices of each channel in an encoder feature map \(\textit{\textbf{M}}_{e}\). While the ‘copy-inflate’ connection does not add new information to the expanded feature maps \(\textit{\textbf{M}}_d\), the pixel level contour correspondence between \(\textit{\textbf{M}}_{e}\) and \(\textit{\textbf{M}}_{d}\) allows the model to reason explicitly about relationships between 2D contours and the corresponding 3D shape.

We build the encoder and decoder of our model using inverted residual blocks  [60, 61]. This stimulates effective gradient propagation through our model. Moreover, modified inverted residual blocks allow near real-time inference time of the trained model. Each block of the encoder includes inverted residual blocks similar to  [61] and an additional pointwise and depthwise convolutions that downscale the feature map.

We use volumetric inverted residual blocks to construct our decoder. Each volumetric inverted residual block includes a volumetric depth separable deconvolution layer followed by a Leaky ReLU activation and a pointwise volumetric convolution. We believe that depth separable convolution in our volumetric inverted residual blocks facilitates learning diverse filters for 2D and 3D features maps. The resulting generator architecture is presented in Fig. 4.

Fig. 4.
figure 4

SSZ generator.

3.3 Pose6DoF Discriminator

Our Pose6DoF discriminator aims to provide an adversarial loss function focused on the pose accuracy of the objects predicted by our SSZ generator. Different from modern volumetric discriminators  [11], that qualify the input voxel model as being either ‘real’ or ‘fake,’ our Pose6DoF discriminator estimates 6DoF poses of objects in the scene and their perceptual realism. Hence, the architecture of our Pose6DoF discriminator fuses a pose estimation model and a discriminator.

We hypothesize that on additional pose term in an adversarial loss will facilitate the accuracy of our SSZ generator in terms of depth estimation. During training, our Pose6DoF discriminator receives either real fruxel model \(\textit{\textbf{F}}\) from the dataset or a generator output \(\hat{\textit{\textbf{F}}}\). The objective of our Pose6DoF discriminator is twofold. Firstly, it must detect all instances of objects of all classes and predict their 6DoF poses. Secondly, for each instance it must predict if the instance is ‘real’ or ‘fake.’

We use a PatchGAN discriminator  [39] as a starting point for our Pose6DoF discriminator. Specifically, our architecture is similar to the encoder part of our SSZ generator with 2D convolutions replaced by volumetric convolutions. Our Pose6DoF is a conditional discriminator \(D: (\textit{\textbf{A}},\textit{\textbf{F}}) \rightarrow \textit{\textbf{C}}\) that receives an image \(\textit{\textbf{A}}\) and fruxel model \(\textit{\textbf{F}}\) concatenated to a single tensor. Given the input \((\textit{\textbf{A}}, \textit{\textbf{F}})\) the model predicts a certificate \(\textit{\textbf{C}} \in \{t,q,r\}^{u,v,w}\). The discriminator output’s structure is inspired by single-shot object detection models  [62].

The aim of our adversarial loss \(\mathcal {L}_{adv}(G, D)\) is twofold. Firstly, it introduces a penalty for incorrect object poses. Secondly, it penalizes unrealistic 3D object instances predicted by G

$$\begin{aligned} \begin{aligned} \mathcal {L}_{adv}(G, D) = \mathbb {E}_{\textit{\textbf{F}}}[\log D(\textit{\textbf{F}})]&+ \mathbb {E}_{\textit{\textbf{A}}}[\log (1-D(G(\textit{\textbf{A}}))] \\&+\,\sum _{j=0}^{m} ||R(\hat{q}_j)\hat{t}_j -R(q_j)t_j||^2, \end{aligned} \end{aligned}$$
(3)

where R(q) – is the mapping from quaternion q to rotation matrix. Please see Supplementary material for details on our Pose6DoF discriminator.

3.4 SemanticVoxels Dataset

Our SemanticVoxels dataset was inspired by the VoxelCity dataset  [56]. It includes 116k samples of 3D and 2D data. Each data sample represents a single camera pose. It includes a color image, a semantic frustum voxel model, a depth map, a camera pose, and an object pose annotations for all classes. We used 8k images of 10 street scenes from  [56] to increase the diversity of the dataset. SemanticVoxels dataset make the following contributions to the VoxelCity dataset: (1) 8k new real images of 20 street scenes, (2) 100k synthetic images of 200 scenes, (3) 116k new semantic voxel annotations for 9 object classes

We made our dataset consistent with the NuScenes dataset format  [63]. Our dataset is divided into two splits: real and synthetic. The real split was generated using a Structure-from-Motion (SfM) technique similar to  [64, 65]. It contains 16k images. We present additional details on our SemanticVoxels dataset in the Supplementary material. Example scenes from the dataset are shown in Fig. 5.

Fig. 5.
figure 5

Examples of color images with 6D pose annotations and ground truth semantic voxel models from our SemanticVoxels dataset.

Fig. 6.
figure 6

Synthetic data generation using GAN. Training pix2pixHD to generate realistic color images from edges (left). Generating paired data samples by rendering a non-realistic 3D model \(A_S\), calculating its edges E, and generating a realistic GAN image \(\hat{A}_{S \rightarrow R}\) (right).

Synthetic Data Generation Using GANs. Generation of 3D datasets is challenging if it is required to obtain paired images and view-centered 3D models  [66]. To overcome this problem, we developed a method based on generative modeling. Inspired by recent advances in generating realistic images from object contours  [67,68,69,70], we hypothesize that object edges are very similar for real images and non-realistic images generated using the 3D model. Therefore, a ground truth color image for a voxel model could be generated from edges of a 3D model rendered in a non-realistic setup. Our pipeline is presented in Fig. 6. Firstly, we generate a training dataset from random images of objects of given classes from the COCO dataset  [71]. For each real image A, we generate contours E using a Canny operator  [72]. We train the pix2pixHD  [67] model on the task of edges-to-image translation.

We generate the dataset samples by creating virtual scenes S containing 3D models of various classes of objects. For each scene, we render a non-realistic image of the scene \(A_S\) and a corresponding frustum voxel model F. We extract the edges E from the image \(A_S\) and generate a realistic color image \(\hat{A}_{S \rightarrow R}\) using the pix2pixHD  [67] model.

4 Experiments

We evaluate our SSZ model and baselines on our SemanticVoxels dataset, the ShapeNet dataset  [73], and the ScanNet dataset  [74]. We train all models on the train split of ShapeNet and our SemanticVoxels datasets for the tasks of outdoor single photo 3D reconstruction. For the task of 3D Semantic Scene Completion, we use train and test splits of ScanNet dataset  [74]. While our SSZ model simultaneously predicts voxel models for N classes of objects, all baselines predict only single class of object for a single photo. Therefore, we perform per-class accuracy compassion with baselines models. We use 3D Intersection over Union (IoU) metric. Our experiments are threefold. Firstly, we perform a qualitative evaluation to demonstrate rich 3D scene model details and multiclass reconstruction provided by our SSZ. Then, we evaluate our model and baselines quantitatively to prove the accuracy of 3D shape and pose of reconstructed 3D models. Finally, we demonstrate the necessity of all components in our SSZ model by performing an ablation study.

4.1 Baselines

We compare our SSZ model to four baselines DISN  [4], Pix2Vox  [3], 3D-R2N2  [2] and one 3D semantic scene completion baseline TS3DSC  [75]. Deep Implicit Surface Network (DISN)  [4] for high-quality single-view 3D reconstruction predicts a high-quality detail-rich 3D mesh from a single 2D image. The DISN model allows capturing the holes in a 3D shape using signed distance fields. Pix2Vox  [3] exploits an encoder-decoder architecture to generate a coarse 3D volumes and refine them using a fusion block. 3D-R2N2  [2] utilizes a view-based generator that allows tackling single or multiview reconstruction problem. Two Stream 3D Semantic Scene Completion (TS3DSC)  [75] leverages two stream model that uses the input depth and color modalities to perform semantic segmentation of indoor scenes. We train DISN, [3], 3D-R2N2 and our SSZ model on train splits of ShapeNet and our SemanticVoxels datasets. We train TS3DSC and our SSZ model on train split of ScanNet. We test all models on the test split of ShapeNet, ScanNet and our SemanticVoxels datasets.

4.2 Training Details

Our SSZ framework was trained on the SemanticVoxels dataset using the PyTorch library  [76]. For training on the ShapeNet dataset, we convert ground truth 3D models to fruxel models with parameters \(\{z_n=3,z_f=10,d=128,\alpha =60^{\circ }\}\). For training on the SemanticVoxels dataset, we use fruxel models with parameters \(\{z_n=2,z_f=12,d=128,\alpha =40^{\circ }\}\). The training was performed using the NVIDIA 2080 RTX GPU and took 82 h for the ShapeNet dataset and 173 h for our SemanticVoxels dataset. For network optimization, we use minibatch SGD with an Adam solver. We set the learning rate to 0.0002 with momentum parameters \(\beta _1 = 0.5\), \(\beta _2 = 0.999\) similar to  [39].

4.3 Qualitative Evaluation

We evaluate our model and baselines qualitatively by reconstructing 3D scenes with multiple objects from single images. None of the compared baselines can to perform semantic segmentation of the resulting 3D model. Hence, to perform a fair evaluation, we extract a single class from our resulting fruxel model and compare it to the output of baselines. Qualitative results for ShapeNet  [73] dataset are presented in Fig. 7. Pix2Vox and 3D-R2N2 models are the best competing baselines demonstrating the correct structure of the 3D shape. While the DISN model attempts to reconstruct the interior structure of the 3D model, its shape differs from the ground-truth model. The voxel model generated by our SSZ framework demonstrates more details and pose correspondence to the input image. The results for our SemanticVoxels dataset are presented in Fig. 8. Unlike the ShapeNet dataset our, SemanticVoxels dataset includes images with multiple objects. During the training stage, we use single-class ground truth 3D models for baselines. We select the 3D model of the object that occupies the largest area in the image. Only the Pix2Vox model can reconstruct the rough shape of the object. We believe that our ‘copy-inflate’ skip connections allow our model to reconstruct 3D scenes with multiple images. For more qualitative results on our SemanticVoxels dataset, see Supplementary material. Qualitative results for ScanNet  [74] are given in Fig. 9. While the baseline TS3DSC  [75] model receives both depth and color information as an input, our SSZ model still leverages only single color input image. Still our framework outperforms the TS3DSC both in fine details and number of reconstructed object classes.

Fig. 7.
figure 7

Examples of 3D reconstruction using DISN  [4], Pix2Vox  [3], 3D-R2N2  [2], and our SSZ model on ShapeNet  [77] dataset. Note that all baselines fail to reconstruct multi instance input images.

Fig. 8.
figure 8

Example of 3D reconstruction using DISN  [4], Pix2Vox  [3], 3D-R2N2  [2] and our SSZ on our SemanticVoxels dataset.

4.4 Quantitative Results

We compare quantitative results in terms of 3D IoU. We present per-class 3D IoU for the ShapeNet dataset in Table 1. Pix2Vox and 3D-R2N2 are the next best performing models after our SSZ model. Pix2Vox model performs the best on plane models and boat models outperforming our model for these classes. Our SSZ model demonstrates the best mean IoU compared to baselines. Quantitative results on our SemanticVoxels dataset demonstrate that our SSZ model successfully reconstructs complex scenes with multiple non-rigid objects of different classes (see Table 2). 3D-R2N2 is the next best performing model for challenging non-rigid classes such as a human. Pix2Vox model demonstrates the next best results in mean IoU. Our SSZ model demonstrates best results in reconstructing non-rigid objects with complex structures such as humans.

4.5 Ablation Studies

We evaluate the necessity of all components of our model by performing 3D scene reconstructions using an ablated version of our model. We firstly remove our Pose6DoF discriminator to check the geometric accuracy of the reconstructed scene (see Fig. 10). The qualitative comparison demonstrates that the ablated version of our model introduces distortions of the scene geometry. Therefore, our pose loss forces the generator to learn to reconstruct invisible parts of an object and their dimensions along the camera’s optical axis.

Fig. 9.
figure 9

Example of 3D reconstruction using TS3DSC  [75] and our SSZ on ScanNet  [74] dataset.

Table 1. Per-category IoU for different object classes for ShapeNet images.
Table 2. Per-category IoU for different object classes on our SemanticVoxels dataset.

Secondly, we compare the performance of the SSZ generator without 2D and 3D inverted residual blocks. The ablated version of our model fails to reconstruct textureless objects such as ground and fine shape details. Furthermore, the ablated version could not reconstruct rare object classes such as bicycle (see Table 2). Therefore, all components of our SSZ framework contribute to the accuracy of the trained generator that allows it to achieve the state-of-the-art performance for the task of single-photo 3D reconstruction of multiclass non-rigid objects.

Fig. 10.
figure 10

Evaluation of ablated versions of our SSZ model.

5 Conclusions

We demonstrated that volumetric residual blocks could learn reconstruction and segmentation of 3D scenes from a single image. Furthermore, our frustum voxel model 3D scene representation allows using 2D-to-3D skip connections, facilitating the generalization ability of our SSZ model and robust reconstruction of previously unseen objects. Our main observation is that multiclass 3D scene reconstruction and semantic segmentation requires a similar number of model parameters compared to single class image-to-voxel model translation task. Moreover, rich semantic data in the training dataset al.lows our model to reason explicitly about geometric relationships between object classes.

Compared to state-of-the-art image-to-voxel model translation models, our SSZ framework surpasses leading results in both 3D IoU and pose accuracy for multiclass 3D scene reconstruction. Moreover, our SSZ model is end-to-end trainable. While modern GPUs pose hardware challenges for increasing voxel model resolution, graph convolution networks demonstrate promising results in voxel model super-resolution. The development of a mixed image-to-voxel model with graph convolution super-resolution is an exciting project that requires further work.