1 Introduction

Thanks to the availability of consumer-level 3D data acquisition sensors such as Microsoft Kinect or Asus Xtion Live, it is now quite easy to build digital copies of the environments or the objects. Regardless of its rapid development, some problems will always remain in the raw output of such sensing technologies. The notorious example is the missing data artifact which emerges due to self-occlusion, occlusion by other objects, or insufficient coverage of the viewing frustum of the device. While the occlusion-related problems have been addressed extensively in the literature under the shape completion or scene completion problems, the coverage-related version is, to our knowledge, not studied yet. In this paper, we address this new problem that we cast as scene extrapolation. Specifically, given the first half of the scene, we extrapolate it to recover the second one. This is a valid scenario as the 3D sensor may not be able to capture the other half due to, for instance, physical limitations. This is also a more challenging problem than the completion task which is able to use the available surroundings of the occluded holes to be filled, an information that is unavailable to our extrapolation framework. Our problem is also fundamentally different from the family of shape or scene synthesis problems which aims to reconstruct the target based on textual or geometric descriptions such as captions or 3D point clouds (see Fig. 1).

Fig. 1
figure 1

An illustration of the relevant problems. a Scene completion aims at estimating the missing hole in the middle of scene. b Scene synthesis constructs 3D scenes from point clouds. c Scene extrapolation estimates the unseen part of the environment from the seen scene (the problem addressed in this paper) [best viewed in color]

In this paper, we study the problem of 3D scene extrapolation with convolutional neural networks (CNNs). The dataset we use [38] is voxelized 3D indoor scenes with object labeling at voxel level. Our proposed models take the first half of each 3D scene as input and complete (estimate) the other side of it. We use voxel-wise Softmax cross-entropy loss function in the network to train our models.

The main contributions of this paper can be described as follows. First of all, we introduce a novel problem different than other completion tasks, namely the extrapolation of a half-scene into a full one. Secondly, we propose a deep CNN model using both 3D and 2D inputs to estimate the unseen half of the scene given the visible part as its input.

2 Related work

Missing data can be completed in various ways depending on the size of the misses. We broadly categorize these methods as rule-based and learning-based approaches.

A set of rule-based approaches utilize an ‘as smooth as possible’ filling strategy for the completion of sufficiently small missing parts, or holes [26, 42, 51]. When the holes enlarge to a certain extent, context-based copy-paste techniques yield more accurate results than smooth filling techniques do. In this line of research, convenient patches from other areas of the same object [11, 29, 37] or from a database of similar objects [9, 21, 24] are copied to the missing parts. Although these methods generally arise in the context of 3D shape completion, appropriate modifications enable image [2, 13, 22] and video [16, 40] completion as well. A common limitation to this set of works is the assumption on the existence of the missing part in the context. Besides, these algorithms may generate patches inconsistent with the original geometry as they may not be able to take full advantage of the global information. Although semi-automatic methods that operate under the guidance of a user [12, 46] may improve the completion results, a more principled solution is based on supervised machine learning techniques, as we discuss in the sequel.

A different rule-based pipeline is based on the family of iterative closest point algorithms [3, 4, 30, 45]. These algorithms are applicable when there are multiple sets of partial data available at arbitrary orientations, a case that typically emerges as a result of 3D scanning from different views [28]. By alternating between closest point correspondence and rigid transformation optimization, these methods unify all partial scans into one full scan, hence completing the geometry. Requiring access to other complementary partial data is a fundamental drawback on these works.

Scene completion, posed either on 2D image domain or 3D Euclidean domain, is more challenging than the shape completion case discussed thus far. The difficulty mainly stems from the fact that it is an inherently ill-posed problem as there are infinitely many geometric pieces that look accurate, e.g., a chair as well as a garbage bin is equally likely to be occluded by a table. A rule-based approach such as [52] should, therefore, be replaced with a machine learning-based scheme as the latter imitates the thinking mechanism of a human for whom the completion task is trivial, e.g., we immediately decide the highly occluded component based on our years of natural training. To this end, [38] recently proposed a 3D convolutional network that takes a single depth image as input and outputs volumetric occupancy and semantic labels for all voxels in the camera view frustum. Although this is the closest work to our paper, we deal with the voxels that are not in the view frustum, hence the even more challenging task of scene extrapolation instead of completion, or interpolation. Similar to [38], unobserved geometry is predicted by [6] using structured random forest learning that is able to handle the large dimensionality of the output space.

Fig. 2
figure 2

The scene extrapolation problem addressed in the paper, and how we use deep learning to solve it. Our deep CNN model takes in a half of the 3D voxelized scene as input, and after applying some convolution operations, generates the full scene as output. The numbers inside the layers indicate the kernel sizes. Highlighted layers show dilation convolutions. The architecture consists of several residual connections, as shown by the arrows [best viewed in color]

Recent deep generative models have shown promising success on completion of images at the expense of significant training times, e.g., in the order of months [17, 19, 25, 47]. There also are promising 3D shape completion results based on deep neural nets [5, 44], which again suffer from the tedious training stage.

In the shape synthesis problem, given a description of a shape or a scene to be synthesized, it is aimed to reconstruct the query in the form of an image [49, 50], video [39], or surface embedded in 3D [7, 15, 18, 36]. There are many sources for the descriptive information ranging from a verbal description of the scene to 3D point cloud of a shape. Synthesis can be performed by either generating the objects [41] or retrieving them from a database [8, 20].

In deep learning, there are many generative models that can scale well at large spaces such as natural images. The prominent models include generative adversarial networks (GANs) [10], variational autoencoders [32], and pixel-level convolutional or recurrent networks [34]. In addition to modeling the extrapolation problem as a voxel-wise classification problem, we chose GANs among the generative models since (1) they are easier to construct and train, (2) they are computationally cheaper during training and inference, and (3) they yield better and sharper results.

As a summary, completion and synthesis of shape, scene, images or videos have been extensively studied in the literature. However, scene extrapolation is a new problem that we propose to solve in this paper using deep networks.

Table 1 The architecture details of our CNN model
Fig. 3
figure 3

The hybrid model used in our paper. We feed the half of 3D voxelized scene as input to the CNN-SE network—see Sect. 3.1. In parallel, the top-view projection of the first half (i.e., \(\mathbf{f }^{\mathrm{top}}\)) is fed into PixelCNN [34] to generate the other side top view (\(\mathbf{s }^{\mathrm{top}}\)). The 2D to 3D network, which consists of multiple residual blocks, maps its 2D input to 3D. Then this output is aggregated with CNN-SE network output and goes into a smoother network consists of 4 ResNeXt [43] blocks [best viewed in color]

3 Scene extrapolation with deep models

We introduce two main models; the first is CNN-SE and the other one is Hybrid-SE, which use CNN-SE as a sub-network and extend it to use 2D top-view projection as another input. We also test the U-Net architecture [35] in scene extrapolation task.

3.1 A convolutional neural network for scene extrapolation (CNN-SE)

CNNs [23] are specialized neural networks with local connectivity and shared weights. With such constraints and inclusion of other operations such as pooling and non-linearity, CNNs essentially learn a hierarchical set of filters that extract the crucial information from the input for the problem at hand.

In our approach, we use deep fully CNN for the 3D semantic scene extrapolation task (Fig. 2). Fully CNN models consist of only convolution and do not use pooling or fully connected layers. Given the first half of 3D voxelized scene (\(\mathbf{f }^{3D} \in {\mathbb {V}}^{w\times h \times d}\), where \({\mathbb {V}}=\{1,\ldots , 14\}\) is the set of object categories each voxel can take; whd are, respectively, the width, the height and the depth of the scene), our task is to generate the other half of scene (\(\mathbf{s }^{3D} \in {\mathbb {V}}^{w\times h \times d}\)), which is semantically consistent with first half of scene (\(\mathbf{f }^{3D}\)) and conceptually realistic (e.g., objects have correct shapes, boundaries and sizes). Since our approaches work on voxelized data, we treat each 3D voxelized scene as ‘2D images’ with multiple channels, where channels correspond to planes of the scene. We find that each voxel in a 3D scene is strongly dependent to adjacency voxels, or in the other words, each plane of voxels in a way similar to near planes. Therefore, we convert 3D voxelized scenes to 2D planes, like RGB images, instead we have 42 channels (42 layers of voxels—i.e., w is 42) for the input scene. Our models complete the missing channels from an input 3D scene.

We have experimented with many architectures and structures, including pooling layers, strided and deconvolution in bottleneck architectures, and dilated convolutions [48]. However, we found that a network with only convolution operations with stride 1 yielded better results. Therefore, we design our architecture as convolutional layers with stride 1 and same padding size, see Fig. 2.

Table 1 lists the details of the CNN-SE model. The fully CNN model takes input scene and first applies a \(7 \times 7\) convolutions, then the architecture continues with 6 residual blocks [14]. Each residual block contains 3 layers of ReLU non-linearity (defined as \(\text {relu}(x)=\max (0,x)\); see [33]) followed by convolution after each. It finishes with a \(1\times 1\) convolutional layers, and Softmax classifier with voxel-wise cross-entropy loss, \(L_{CE}\), between the generated and the ground truth voxels as follows:

$$\begin{aligned} L_{\mathrm{CE}} = \frac{1}{m} \sum _{i} H(\mathbf{t }^{\mathrm{3D}}_{i}, \mathbf{s }^{\mathrm{3D}}_{i}), \end{aligned}$$
(1)

where m is the batch size; \(\mathbf{s }^{\mathrm{3D}}_i\) is the extrapolated scene for the ith input, and \(t^{\mathrm{3D}}_i\) is the corresponding ground truth. Cross-entropy (H) between two vectors (or serialized matrices) is defined as follows:

$$\begin{aligned} H(\mathbf{p }, \mathbf{q }) = - \sum _{i} p_{i} \log (q_{i}). \end{aligned}$$
(2)

We also use a smoothness penalty term, \(L_S\), to make the network produce smoother results by enforcing consistency between each generated plane \(\mathbf{s }_{ij}^{p}\) (i.e., the generated jth voxel plane for the ith input instance) and its next plane in the ground truth, \(\mathbf{t }_{i(j+1)}^{p}\):

$$\begin{aligned} L_S = \frac{1}{m} \sum _{i,j} H\left( \mathbf{t }_{i(j+1)}^{p}, \mathbf{s }_{ij}^{p}\right) , \end{aligned}$$
(3)

where m is again the batch size.

Moreover, against overfitting, we add \(L_2\) regularization on the weights:

$$\begin{aligned} L_2 = \frac{\lambda }{2} \sum _{i}^{} w_{i}^{2}, \end{aligned}$$
(4)

where \(\lambda \) is a constant, controlling the strength of regularization (\(\lambda \) is tuned to 0.001), and \(w_i\) is a weight.

In addition the focal loss \(L_f\) [27] is added to have faster convergence. Therefore, the total loss that we try to minimize with our CNN model is as follows:

$$\begin{aligned} L_{\mathrm{total}} = L_{\mathrm{CE}} + L_S + L_2 + L_f. \end{aligned}$$
(5)

3.2 Scene extrapolation with hybrid model (hybrid-SE)

Figure 3 shows the overall view of our hybrid model. This model, in addition to taking the first half in 3D (\(\mathbf{f }^{\mathrm{3D}}\)) as input, processes the 2D top-view projection in parallel. \(\mathbf{f }^{\mathrm{3D}}\) is fed to the CNN-SE model as explained in Sect. 3.1, in parallel, \(\mathbf{f }^{\mathrm{top}}\), the 2D top-view projection of f goes into an autoregressive generative model, namely PixelCNN [34], to generate the top view of the second part (\(\mathbf{s }^{\mathrm{top}}\)). This top view should provide whereabouts and identities of the objects in the space to be extrapolated. Then a 2D to 3D network takes this generated 2D top view as input and predicts the third dimension to make it 3D. The output from CNN-SE (\(\mathbf{s }^{\mathrm{3D}}\)) and 2D to 3D network (\(\mathbf{s }^{\mathrm{top}}\)) are aggregated together and fed into a smoothing network consisting of 4 ResNeXt blocks [43]. In this model, we use the loss in Eq. 5 for CNN-SE network; for PixelCNN network, Eqs. 14; for the focal loss for 2D to 3D network, Eqs. 12 and 4; and for the smoother network, Eqs. 1 and 2.

Our hybrid model predicts the top view of the unseen part. However, this 2D extrapolation is itself as challenging as 3D extrapolation. In order to demonstrate the full potential of having a good top-view estimation of the unseen part, we also solved the 3D extrapolation by feeding the known 2D top view of the unseen part.

Fig. 4
figure 4

Sample scenes from the synthetic SUNCG dataset [38], a challenging dataset containing diverse kinds of chairs, beds, sofas, shelves, tables, cabinets and coffee tables in different shapes and sizes [best viewed in color]

4 Experiments and results

In this section, we train and evaluate our models on SUNCG [38] dataset. Our codes and models are available on https://github.com/aliabbasi/d3dsse.

Table 2 The training details for our models
Fig. 5
figure 5

Results on the SUNCG dataset [38] from our models. ‘w gt’ indicates known 2D top view [best viewed in color]

4.1 Datasets

We used SUNCG [38] synthetic 3D scene dataset, for training and inference. This dataset contains 45,622 different scenes, about 400k rooms and 2644 unique objects. We constructed scenes by their available information with provided as JSON file for each scene. We parsed each scene JSON file and build a separate scene for each room. In the room extraction process we ignored scene types such as outdoor scenes, swimming pool. We used the binvox [31] voxelizer, which use space carving approach for both surface and interior voxels. To make the voxelizing process faster, we first voxelized each object individually then put them in their place in each room. During the object voxelization, we considered each 6cm as each voxel size, then voxelized objects with respect to their dimensions. The longest dimension in the object was divided by 6 to give the resolution of voxelizing for that object.

In our experiments, we removed categories that did not have sufficient number of instances, (or smaller than most occur objects in scene like beds, sofas, cabinets, windows), which reduced the number of object categories from 84 to 13 (plus one category, for emptiness). The removed objects include TVs, plants, toys, kitchen appliance, floor lamps, people, cats. We set the fix resolution i our models to \(84\times 42\times 84\). We observed that this resolution was sufficient to represent the objects and the content of the scenes. We also removed scenes which has less than 12000 voxels in size, and leave about 211k scene, we used 200K as train the rest as test data. The data processing and building roughly took a week. Figure 4 shows some sample scenes from this challenging dataset.

Table 3 The accuracy, completeness, precision and recall measures of our models

4.2 Training details

We used Tensorflow [1] to implement our models. Table 2 summarizes the training details for each model individually. For all training processes, we used a single NVIDIA GeForce Titan X GPU.

4.3 Results

Figure 5 shows the results of our experiments. We use two metrics in order to evaluate the results. First, the accuracy metric, defined as the number of correctly generated voxels in the output divided by the total number of estimated voxels. In the first metric, we also consider empty voxels if they are estimated at the right place. The second metric is completeness, which is the number of correctly generated non-empty voxels in the output divided by the total number of non-empty voxels. Table 3 shows the quantitative results of our models, as well as per-category F1 scores.

4.4 Discussion

The architecture We have experimented with many architectures and structures for the generator, include pooling layers, bottleneck architectures with strided and deconvolution layers, yet, at the end, we concluded that a simple architecture of stride-1 convolution layers can generate better results. This suggests that it is better to keep the widths of the layers unreduced and allow the network to process information along the width of the scene at each convolution. This also allows the network to get a better estimation about the sizes of the objects since each convolution has access to the width of the scene. Moreover, worse performance on increased filter sizes suggests that highly local processing is better and bigger filter sizes lead to averaging of information across voxels.

Future work One can formulate the scene extrapolation task as a sequence modeling problem, such that the input sequence (the visible scene) is encoded by a sequence modeling framework, and the rest of the scene is decoded by another. Moreover, a context network can be trained in parallel together with the generation network such that the context network captures a high-level scene information and layout of the scene and modulates the generation process.

5 Conclusion

In this paper, we have proposed using convolutional networks for the scene extrapolation problem, which has not been addressed in the literature before. We showed that the proposed models are able to extrapolate the scene toward the unseen part and extend successfully the ground, the walls and the partially visible objects.

However, we realized that the networks were unable to hallucinate objects in the extrapolated part of the scene if they did not have any part visible in the input scene. This is likely to be due to the highly challenging nature of the extrapolation problem, due to the huge inter-variability and intra-variability between the objects. To able to extrapolate at a level where new objects can be generated would likely require (1) a much larger dataset than we used and available in the literature, and (2) extraction and usage of higher-level semantic information from the input scene, as a modulator for the deep extrapolating networks.

Scene extrapolation is a challenging and highly-relevant problem for the computer graphics and deep learning societies, and our paper offers a new research problem and direction with which new and better learning and estimation methods can be developed.