Keywords

1 Introduction

Inferring 3D information from 2D sensory data such as images and videos has long been a central research topic in computer vision. Conventional approach to building 3D models typically relies on detecting, matching, and triangulating local image features (e.g., patches, superpixels, edges, and SIFT features). Although significant progress has been made over the past decades, these methods still suffer from some fundamental problems. In particular, local feature detection is sensitive to a large number of factors such as scene appearance (e.g., textureless areas and repetitive patterns), lighting conditions, and occlusions. Further, the noisy, point cloud-based 3D model often fails to meet the increasing demand for high-level 3D understanding in real-world applications.

Fig. 1.
figure 1

The Structured3D dataset. From a large collection of house designs (a) created by professional designers, we automatically extract a variety of ground truth 3D structure annotations (b) and generate photo-realistic 2D images (c).

When perceiving 3D scenes, humans are remarkably effective in using salient global structures such as lines, contours, planes, smooth surfaces, symmetries, and repetitive patterns. Thus, if a reconstruction algorithm can take advantage of such global information, it is natural to expect the algorithm to obtain more accurate results. Traditionally, however, it has been computationally challenging to reliably detect such global structures from noisy local image features. Recently, deep learning-based methods have shown promising results in detecting various forms of structure directly from the images, including lines  [12, 40], planes  [16, 19, 35, 36], cuboids  [10], floorplans  [17, 18], room layouts  [14, 26, 41], abstracted 3D shapes  [28, 32], and smooth surfaces  [11].

With the fast development of deep learning methods comes the need for large amounts of accurately annotated data. In order to train the proposed neural networks, most prior work collects their own sets of images and manually label the structure of interest in them. Such a strategy has several shortcomings. First, due to the tedious process of manually labeling and verifying all the structure instances (e.g., line segments) in each image, existing datasets typically have limited sizes and scene diversity. And the annotations may also contain errors. Second, since each study primarily focuses on one type of structure, none of these datasets has multiple types of structure labeled. As a result, existing methods are unable to exploit relations between different types of structure (e.g., lines and planes) as humans do for effective, efficient, and robust 3D reconstruction.

In this paper, we present a large synthetic dataset with rich annotations of 3D structure and photo-realistic 2D renderings of indoor man-made environments (Fig. ). At the core of our dataset design is a unified representation of 3D structure which enables us to efficiently capture multiple types of 3D structure in the scene. Specifically, the proposed representation considers any structure as relationship among geometric primitives. For example, a “wireframe” structure encodes the incidence and intersection relationship between line segments, whereas a “cuboid” structure encodes the rotational and reflective symmetry relationship among its planar faces. With our “primitive + relationship” representation, one can easily derive the ground truth annotations for a wide variety of semi-global and global structures (e.g., lines, wireframes, planes, regular shapes, floorplans, and room layouts), and also exploit their relations in future data-driven approaches (e.g., the wireframe formed by intersecting planar surfaces in the scene).

Table 1. An overview of datasets with structure annotations. \(^\dag \): The actual numbers are not explicitly given and hard to estimate, because these datasets contain images from Internet (LSUN Room Layout, PanoContext), or multiple sources (LayoutNet). \(^*\): Dataset is unavailable online at the time of publication.

To create a large-scale dataset with the aim of facilitating research on data-driven methods for structured 3D scene understanding, we leverage the availability of professional interior designs and millions of production-level 3D object models – all coming with fine geometric details and high-resolution textures (Fig. (a)). We first use computer programs to automatically extract information about 3D structure from the original house design files. As shown in Fig. (b), our dataset contains rich annotations of 3D room structure including a variety of geometric primitives and relationships. To further generate photo-realistic 2D images (Fig. (c)), we utilize industry-leading rendering engines to model the lighting conditions. Currently, our dataset consists of more than 196k images of 21,835 rooms in 3,500 scenes (i.e., houses).

To showcase the usefulness and uniqueness of the proposed Structured3D dataset, we train deep networks for room layout estimation on a subset of the dataset. We show that the models trained on both synthetic and real data outperform the models trained on real data only. Further, following the spirit of  [8, 27], we show how multi-modal annotations in our dataset can benefit domain adaptation tasks.

In summary, the main contributions of this paper are:

  • We create the Structured3D dataset, which contains rich ground truth 3D structure annotations of 21,835 rooms in 3,500 scenes, and more than 196k photo-realistic 2D renderings of the rooms.

  • We introduce a unified “primitive + relationship” representation. This representation enables us to efficiently capture a wide variety of semi-global or global 3D structures and their mutual relationships.

  • We verify the usefulness of our dataset by using it to train deep networks for room layout estimation and demonstrating improved performance on public benchmarks.

Fig. 2.
figure 2

Example annotations of structure in existing datasets. The reference number indicates the paper from which the illustration is originally from.

2 Related Work

Datasets. Table  summarizes existing datasets for structured 3D scene modeling. Additionally, [28, 32] provide datasets with structured representations of single objects. We show example annotations in these datasets in Fig. . Note that ground truth annotations in most datasets are manually labeled. This is one main reason why all these datasets have limited size, i.e., contain no more than a few thousand images. One exception is [16], which employs a multi-model fitting algorithm to automatically extract planes from 3D scans in the ScanNet dataset  [9]. But such algorithms are sensitive to data noises and outliers, thus introduce errors in the annotations (Fig. (a)). Similar to our work, SceneCity 3D  [40] also contains synthetic images with ground truth automatically extracted from CAD models. But the number of scenes is limited to 230. Further, none of these datasets has more than one type of structure labeled, although different types of structure often have strong relations among them. For example, from the wireframe in Fig. (b) humans can easily identify other types of structure such as planes and cuboids. Our new dataset sets to bridge the gap between what is needed to train machine learning models to achieve human-level holistic 3D scene understanding and what is being offered by existing datasets.

Note that our dataset is very different from other popular large-scale 3D datasets, such as NYU v2  [23], SUN RGB-D  [24], 2D-3D-S  [3, 4], ScanNet  [9], and Matterport3D  [6], in which the ground truth 3D information is stored in the format of point clouds or meshes. These datasets lack ground truth annotations of semi-global or global structures. While it is theoretically possible to extract 3D structure by applying structure detection algorithms to the point clouds or meshes (e.g., extracting planes from ScanNet as did in  [16]), the detection results are often noisy and even contain errors. In addition, for some types of structure like wireframes and room layouts, how to reliably detect them from raw sensor data remains an active research topic in computer vision.

In recent years, synthetic datasets have played an important role in the successful training of deep neural networks. Notable examples for indoor scene understanding include SUNCG  [25], SceneNet RGB-D  [20], and InteriorNet  [15]. These datasets exceed real datasets in terms of scene diversity and frame numbers. But just like their real counterparts, these datasets lack ground truth structure annotations. Another issue with some synthetic datasets is the degree of realism in both the 3D models and the 2D renderings. [38] shows that physically-based rendering could boost the performance of various indoor scene understanding tasks. To ensure the quality of our dataset, we make use of 3D room models created by professional designers and the state-of-the-art industrial rendering engines. Table  summarizes the differences of 3D scene datasets.

Room Layout Estimation. Room layout estimation aims to reconstruct the enclosing structure of the indoor scene, consisting of walls, floor, and ceiling. Existing public datasets (e.g., PanoContext  [37] and LayoutNet  [41]) assume a simple box-shaped layout. PanoContext  [37] collects about 500 panoramas from the SUN360 dataset  [33], LayoutNet  [41] extends the layout annotations to include panoramas from 2D-3D-S  [3]. Recently, MatterportLayout  [42] collects 2,295 RGB-D panoramas from Matterport3D  [6] and extends annotations to Manhattan layout. We note that all room layout in these real datasets is manually labeled by the human. Since the room structure may be occluded by furniture and other objects, the “ground truth” inferred by humans may not be consistent with the actual layout. In our dataset, all ground truth 3D annotations are automatically extracted from the original house design files.

Table 2. Comparison of 3D scene datasets. \(^\dag \): Meshes are obtained by 3D reconstruction algorithm. Notations for applications: O (object detection), U (scene understanding), S (image synthesis), M (structured 3D modeling).

3 A Unified Representation of 3D Structure

The main goal of our dataset is to provide rich annotations of ground truth 3D structure. A naive way to do so is generating and storing different types of 3D annotations in the same format as existing works, like wireframes as in  [12], planes as in  [16], floorplans as in  [17], and so on. But this leads to a lot of redundancy. For example, planes in man-made environments are often bounded by a number of line segments, which are part of the wireframe. Even worse, by representing wireframes and planes separately, the relationships between them are lost. In this paper, we present a unified representation in order to minimize redundancy while preserving mutual relationships. We show how the most common types of structure studied in the literature (e.g., planes, cuboids, wireframes, room layouts, and floorplans) can be derived from our representation.

Fig. 3.
figure 3

The ground truth 3D structure annotations in our dataset are represented by primitives and relationships. (a): Junctions and lines. (b): Planes. We highlight the planes in a single room. (c): Plane-line and line-junction relationships. We highlight a junction, the three lines intersecting at the junction, and the planes intersecting at each of the lines. (d): Cuboids. We highlight one cuboid instance. (e): Manhattan world. We use different colors to denote planes aligned with different directions. (f): Semantic objects. We highlight a “room”, a “balcony”, and the “door” connecting them.

Our representation of the structure is largely inspired by the early work of Witkin and Tenenbaum  [31], which characterizes structure as “a shape, pattern, or configuration that replicates or continues with little or no change over an interval of space and time”. Accordingly, to describe any structure, we need to specify: (i) what pattern is continuing or replicating (e.g., a patch, an edge, or a texture descriptor), and (ii) the domain of its replication or continuation. In this paper, we call the former primitives and the latter relationships.

3.1 The “Primitive + Relationship” Representation

We now show how to describe a man-made environment using a unified representation. For ease of exposition, we assume all objects in the scene can be modeled by piece-wise planar surfaces. But our representation can be easily extended to more general surfaces. An illustration of our representation is shown in Fig. .

Primitives. Generally, a man-made scene has the following geometric primitives:

  • Planes \(\mathbf {P}\): We model the scene as a collection of planes \(\mathbf {P}= \{p_1, p_2, \ldots \}\). Each plane is described by its parameters \(p = \{\mathbf {n}, d\}\), where \(\mathbf {n}\) and d denote the surface normal and the distance to the origin, respectively.

  • Lines \(\mathbf {L}\): When two planes intersect in the 3D space, a line is created. We use \(\mathbf {L}= \{l_1, l_2, \ldots \}\) to represent the set of all 3D lines in the scene.

  • Junction Points \(\mathbf {X}\): When two lines meet in the 3D space, a junction point is formed. We use \(\mathbf {X}= \{x_1, x_2, \ldots \}\) to represent the set of all junction points.

Relationships. Next, we define some common types of relationships between the geometric primitives:

  • Plane-Line Relationships (\(R_1\)): We use a matrix \(W_1\) to record all incidence and intersection relationships between planes in \(\mathbf {P}\) and lines in \(\mathbf {L}\). Specifically, the ij-th entry of \(W_1\) is 1 if \(l_i\) is on \(p_j\), and 0 otherwise. Note that two planes are intersected at some line if and only if the corresponding entry in \(W_1^TW_1\) is nonzero.

  • Line-Point Relationships (\(R_2\)): Similarly, we use a matrix \(W_2\) to record all incidence and intersection relationships between lines in \(\mathbf {L}\) and points in \(\mathbf {X}\). Specifically, the mn-th entry of \(W_2\) is 1 if \(x_m\) is on \(l_n\), and 0 otherwise. Note that two lines are intersected at some junction if and only if the corresponding entry in \(W_2^TW_2\) is nonzero.

  • Cuboids (\(R_3\)): A cuboid is a special arrangement of plane primitives with rotational and reflection symmetry along x-, y- and z-axes. The corresponding symmetry group is the dihedral group \(D_{2h}\).

  • Manhattan World (\(R_4\)): This is a special type of 3D structure commonly used for indoor and outdoor scene modeling. It can be viewed as a grouping relationship, in which all the plane primitives can be grouped into three classes, \(\mathbf {P}_1\), \(\mathbf {P}_2\), and \(\mathbf {P}_3\), \(\mathbf {P}= \bigcup _{i=1}^3 \mathbf {P}_i\). Further, each class is represented by a single normal vector \(\mathbf {n}_i\), such that \(\mathbf {n}_i^T \mathbf {n}_j = 0, i\ne j\).

  • Semantic Objects (\(R_5\)): Semantic information is critical for many 3D computer vision tasks. It can be regarded as another type of grouping relationship, in which each semantic object instance corresponds to one or more primitives defined above. For example, each “wall”, “ceiling”, or “floor” instance is associated with one plane primitive; each “chair” instance is associated with a set of multiple plane primitives. Further, such a grouping is hierarchical. For example, we can further group one floor, one ceiling, and multiple walls to form a “living room” instance. And a “door” or a “window” is an opening which connects two rooms (or one room and the outer space).

Note that the relationships are not mutually exclusive, in the sense that a primitive can belong to multiple relationship instances of the same type or different types. For example, a plane primitive can be shared by two cuboids, and at the same time belong to one of the three classes in the Manhattan world model.

Discussion. The primitives and relationships we discussed above are just a few most common examples. They are by no means exhaustive. For example, our representation can be easily extended to include other primitives such as parametric surfaces. And besides cuboids, there are many other types of regular or symmetric shapes in man-made environments, where type corresponds to a different symmetry group.

Our representation of 3D structures is also related to the graph representations in semantic scene understanding  [2, 13, 30]. As these graphs focus on semantics, geometry is represented in simplified manners by (i) 6D object poses and (ii) coarse, discrete spatial relations such as “supported by”, “front”, “back”, and “adjacent”. In contrast, our representation focuses on modeling the scene geometry using fine-grained primitives (i.e., junctions, lines, and planes) and relationships (in terms of topology and regularities). Thus, it is highly complementary to the scene graphs in prior work. Intuitively, it can be used for geometric analysis and synthesis tasks, in a similar way as scene graphs are used for semantic scene understanding.

3.2 Relation to Existing Models

Given our representation which contains primitives \(\mathcal {P} = \{\mathbf {P}, \mathbf {L}, \mathbf {X}\}\) and relationships \(\mathcal {R} = \{R_1, R_2, \ldots \}\), we show how several types of 3D structure commonly studied in the literature can be derived from it. We again refer readers to Fig.  for illustrations of these structures.

Planes: A large volume of studies in the literature model the scene as a collection of 3D planes, where each plane is represented by its parameters and boundary. To generate such a model, we simply use the plane primitives \(\mathbf {P}\). For each \(p\in \mathbf {P}\), we further obtain its boundary by using matrix \(W_1\) in \(R_1\) to find all the lines in \(\mathbf {L}\) that form an incidence relationship with p.

Wireframes: A wireframe consists of lines \(\mathbf {L}\) and junction points \(\mathbf {P}\), and their incidence and intersection relationships (\(R_2\)).

Cuboids: This model is same as \(R_3\).

Manhattan Layouts: A Manhattan room layout model includes a “room” as defined in \(R_5\) which also satisfies the Manhattan world assumption (\(R_4\)).

Floorplans: A floorplan is a 2D vector representation that consists of a set of line segments and semantic labels (e.g., room types). To obtain such a vector representation, we can identify all lines in \(\mathbf {L}\) and junction points in \(\mathbf {X}\) which lie on a “floor” (as defined in \(R_5\)). To further obtain the semantic room labels, we can project all “rooms”, “doors”, and “windows” (as defined in \(R_5\)) to this floor.

Abstracted 3D shapes: In addition to room structures, our representation can also be applied to individual 3D object models to create abstractions in the form of wireframes or cuboids, as described above.

4 The Structured3D Dataset

Fig. 4.
figure 4

Comparison of 3D house designs. (a): The 3D models in our database are created by professional designers using high-quality furniture models from world-leading manufacturers. Most designs are being used in real-world production. (b): The 3D models in SUNCG dataset  [25] are created using Planner 5D  [1], an online tool for amateur interior design.

Our unified representation enables us to encode a rich set of geometric primitives and relationships for structured 3D modeling. With this representation, our ultimate goal is to build a dataset that can be used to train machines to achieve the human-level understanding of the 3D environment.

As a first step towards this goal, in this section, we describe our ongoing effort to create a large-scale dataset of indoor scenes which include (i) ground truth 3D structure annotations of the scene and (ii) realistic 2D renderings of the scene. Note that in this work we focus on extracting ground truth annotations on the room structure only. We plan to extend our dataset to include 3D structure annotations of individual furniture models in the future.

In the following, we describe our general procedure to create the dataset. We refer readers to the supplementary materials for additional details, including dataset statistics and example annotations.

4.1 Extraction of Structured 3D Models

To extract a “primitive + relationship” scene representation, we utilize a large database of house designs hand-crafted by professional designers. An example design is shown in Fig. (a). All information of the design is stored in an industry-standard format in the database so that specifications about the geometry (e.g., the precise size of each wall), textures and materials, and functions (e.g., which room the wall belongs to) of all objects can be easily retrieved.

From the database, we have selected 3,500 house designs with 21,835 rooms. We created a computer program to automatically extract all the geometric primitives associated with the room structure, which consists of the ceiling, floor, walls, and openings (doors and windows). Given the precise measurements and associated information of these entities, it is straightforward to generate all planes, lines, and junctions, as well as their relationships (\(R_1\) and \(R_2\)).

Since the measurements are highly accurate and noise-free, other types of relationship such a Manhattan world (\(R_3\)) and cuboids (\(R_4\)) can also be easily obtained by clustering the primitives, followed by a geometric verification process. Finally, to include semantic information (\(R_5\)) into our representation, we map the relevant labels provided by the professional designers to the geometric primitives in our representation. Figure 3 shows examples of the extracted geometric primitives and relationships.

Fig. 5.
figure 5

Examples of our rendered panoramic images.

4.2 Photo-Realistic 2D Rendering

To ensure the quality of our 2D renderings, our rendering engine is developed in collaboration with a company specialized in interior design rendering. Our engine uses a well-known ray-tracing method  [21], a Monte Carlo approach to approximating realistic Global Illumination (GI), for RGB rendering. The other ground truth images are obtained by a customized path-tracer renderer on top of Intel Embree  [29], an open-source collection of ray-tracing kernels for x86 CPUs.

Each room is manually created by professional designers with over one million CAD models of furniture from world-leading manufacturers. These high-resolution furniture models are measured in real-world dimensions and being used in real production. A default lighting setup is also provided. Figure 4 compares the 3D models in our database with those in SUNCG  [25], which are created using Planner 5D  [1], an online tool for amateur interior design.

At the time of rendering, a panoramic or pin-hole camera is placed at random locations not occupied by objects in the room. We use \(512 \times 1024\) resolution for panoramas and \(720 \times 1280\) for perspective images. Figure 5 shows example panoramas rendered by our engine. For each room, we generate different configurations (full, simple, and empty) by removing some or all the furniture. We also modify the lighting setup to generate images with different temperatures. For each image, our dataset al.so includes the depth map and semantic mask. Figure 6 illustrates the degree of photo-realism of our dataset, where we compare the rendered images with photos of real decoration guided by the design.

Fig. 6.
figure 6

Photo-realistic rendering vs. real-world decoration. The first and third columns are rendered images.

4.3 Use Cases

Due to the unique characteristics of our dataset, we envision it contributing to computer vision research in terms of both methodology and applications.

Methodology. As our dataset contains multiple types of 3D structure annotations as well as ground truth labels (e.g., semantic maps, depth maps, and 3D object bounding boxes), it enables researchers to design novel multi-modal or multi-task approaches for a variety of vision tasks. As an example, we show in Sect. 5 that, by leveraging multi-modal annotations in our dataset, we can boost the performance of existing room layout estimation methods in the domain adaptation framework.

Applications. Our dataset also facilitates research on a number of problems and applications. For example, as shown in Table , all publicly available datasets for room layout estimation are limited to simple cuboid rooms. Our dataset is the first to provide the general (non-cuboid) room layout annotations. As another example, existing datasets for floorplan reconstruction  [7, 18] contain about 100–150 scenes, whereas our dataset includes 3,500 scenes.

Another major line of research that would benefit from our dataset is image synthesis. With a photo-realistic rendering engine, we are able to generate images given any scene configurations and viewpoints. These images may be used as ground truth for tasks including image inpainting (e.g., completing an image when certain furniture is removed) and novel view synthesis.

Finally, we would like to emphasize the potential of our dataset in terms of extension capabilities. As we mentioned before, the unified representation enables us to include many other types of structure in the dataset. As for 2D rendering, depending on the application, we can easily simulate different effects such as lighting conditions, fisheye and novel camera designs, motion blur, and imaging noise. Furthermore, the dataset may be extended to include videos for applications such as visual SLAM  [5].

5 Experiments

5.1 Experiment Setup

To demonstrate the benefits of our dataset, we use it to train deep neural networks for room layout estimation, an important task in structured 3D modeling.

Real Dataset. We use the same dataset as LayoutNet  [41]. The dataset consists of images from PanoContext  [37] and 2D-3D-S  [3], including 818 training images, 79 validation images, and 166 test images. Note that both datasets only provide cuboid layout annotations.

Table 3. Room layout statistics. \(^\dag \): MatterportLayout is the only other dataset with non-cuboid layout annotations, but is unavailable at the time of publication.

Our Structured3D Dataset. In this experiment, we use a subset of panoramas with the original lighting and full configuration. Each panorama corresponds to a different room in our dataset. We show statistics of different room layouts in our dataset in Table . Since the current real dataset only contains cuboid layout annotations (i.e., 4 corners), we choose 12k panoramic images with the cuboid layout in our dataset. We split the images into 10k for training, 1k for validation, and 1k for testing.

Evaluation Metrics. Following  [26, 41], we adopt three standard metrics: (i) 3D IoU: intersection over union between predicted 3D layout and the ground truth, (ii) Corner Error (CE): normalized \(\ell _2\) distance between predicted corner and ground truth, and (iii) Pixel Error (PE): pixel-wise error between predicted plane classes and ground truth.

Baselines. We choose two recent CNN-based approaches, LayoutNet  [41, 42]Footnote 1 and HorizonNet  [26]Footnote 2, based on their performance and source code availability. LayoutNet uses a CNN to predict a corner probability map and a boundary map from the panorama and vanishing lines, then optimizes the layout parameters based on network predictions. HorizonNet represents room layout as three 1D vectors, i.e., boundary positions of floor-wall, and ceiling-wall, and the existence of wall-wall boundary. It trains CNNs to directly predict the three 1D vectors. In this paper, we follow the default training setting of the respective methods. For specific training procedures, please refer to the supplementary materials.

5.2 Experiment Results

Table 4. Quantitative evaluation under different training schemes. The best and the second best results are boldfaced and underlined, respectively.

Augmenting Real Datasets. In this experiment, we train LayoutNet and HorizonNet in four different manners: (i) training only on our synthetic dataset (“s”), (ii) training only on the real dataset (“r”), (iii) training on the synthetic and real dataset with Balanced Gradient Contribution (BGC)  [22] (“s + r”), and (iv) pre-training on our synthetic dataset, then fine-tuning on the real dataset (“s \(\rightarrow \) r”). We adopt the training set of LayoutNet as the real dataset in this experiment. The results are shown in Table . As one can see, augmenting real datasets with our synthetic data boosts the performance of both networks. We refer readers to supplementary materials for more qualitative results.

Table 5. Quantitative evaluation using varying synthetic data size in pre-training. The best and the second best results are boldfaced and underlined, respectively.

Performance vs. Synthetic Data Size. We further study the relationship between the number of synthetic images used in pre-training and the accuracy on the real dataset. We sample 1k, 5k and 10k synthetic images for pre-training, then fine-tune the model on the real dataset. The results are shown in Table . As expected, using more synthetic data generally improves the performance.

Table 6. Domain adaptation results. NA: non-adaptive baseline. +DA: align layout estimation output. +Depth: align both layout estimation and depth outputs. Real: train in the target domain.

Domain Adaptation. Domain adaptation techniques (e.g., [27]) have been shown to be effective in bridging the performance gap when directly applying models learned on synthetic data to real environments. In this experiment, we do not assume access to ground truth layout labels in the real dataset. We adopt LayoutNet as the task network and use PanoContext and 2D-3D-S separately. We apply a discriminator network to align the output features of the LayoutNet for two domains. Inspired by  [8], we further leverage multi-modal annotations in our dataset by adding another decoder branch to the LayoutNet for depth prediction. We concatenate the boundary, corner, and depth predictions as the input of the discriminator network. The results are shown in the Table . By incorporating additional information, i.e., depth map, we further boost the performance on both datasets. This illustrates the advantage of including multiple types of ground truth in our dataset.

Fig. 7.
figure 7

Limitation of real datasets. Left: PanoContext dataset. Right: 2D-3D-S dataset. Blue lines are ground truth layout and green lines are predictions (Color figure online).

Limitation of Real Datasets. Due to human errors, the annotation in real datasets is not always consistent with the actual room layout. In the left image of Fig. , the room is a non-cuboid layout, but the ground truth layout is labeled as cuboid shape. In the right image, the front wall is not labeled as ground truth. These examples illustrate the limitation of using real datasets as benchmarks. We avoid such errors in our dataset by automatically generating ground truth from the original design files.

6 Conclusion

In this paper, we present Structured3D, a large synthetic dataset with rich ground truth 3D structure annotations of 21,835 rooms and more than 196k photo-realistic 2D renderings. Among many potential use cases of our dataset, we further demonstrate its benefit in augmenting real data and facilitating domain adaptation for the room layout estimation task.

We view this work as an important and exciting step towards building intelligent machines which can achieve human-level holistic 3D scene understanding. In the future, we will continue to add more 3D structure annotations of the scenes and objects to the dataset, and explore novel ways to use the dataset to advance techniques for structured 3D modeling and understanding.