Keywords

1 Introduction

3D semantic segmentation is a fundamental perception task receiving incredible attention from both industry and academia due to its wide applications in robotics, augmented reality, and human-computer interaction, to name a few. Data hungry deep learning approaches have attained remarkable success for 3D semantic segmentation [22, 30, 46, 55, 61, 63]. Nevertheless, harvesting a large amount of annotated data is expensive and time-consuming [3, 7].

An appealing avenue to overcome such data scarcity is to leverage simulation data where both data and labels can be obtained for free. Simulated datasets can be arbitrarily large, easily adapted to different label spaces and customized for various usages [8, 18, 26, 34, 53, 72]. However, due to notorious domain gaps in point patterns and context (see Fig. 1), models trained on simulated scenes suffer drastic performance degradation when generalized to real-world scenarios. This motivates us to study sim-to-real unsupervised domain adaptation (UDA), leveraging labeled source data (simulation) and unlabeled target data (real) for effectively adapting knowledge across domains.

Recent efforts on 3D domain adaptation for outdoor scene parsing have obtained considerable progress [23, 29, 60, 71]. However, they often adopt LiDAR-specific range image format, not applicable for indoor scenarios with scenes constructed by RGB-D sequences. Besides, such outdoor attempts could be sub-optimal in addressing the indoor domain gaps raised from different scene construction processes. Further, indoor scenes have more sophisticated interior context than outdoor, which makes the context gap a more essential issue in indoor settings. Here, we explore sim-to-real UDA in the 3D indoor scenario which is challenging and largely under explored.

Fig. 1.
figure 1

The domain gaps between simulated scenes from 3D-FRONT [8] and real-world scenes from ScanNet [7]. (a): The point pattern gap. The simulated scene is perfect without occlusions or noise, while the real-world scene inevitably contains scan occlusion and noise patterns such as rough surfaces. (b): The context gap. While the simulated scene applies simple layout with regularly placed objects, the real scene is complex with cluttered interiors.

Challenges.  Our empirical studies on sim-to-real adaptation demonstrate two unique challenges in this setting: the point pattern gap owing to different sensing mechanisms, and the context gap due to dissimilar semantic layouts. As shown in Fig. 1(a), simulated scenes tend to contain complete objects as well as smooth surfaces, while real scenes include inevitable scan occlusions and noise patterns during reconstructing point clouds from RGB-D videos captured by depth cameras [3, 7]. Also, even professionally designed layouts in simulated scenes are much simpler and more regular than real layouts as illustrated in Fig. 1(b).

To tackle the above domain gaps, we develop a holistic two stage pipeline DODA with a pretrain and a self-training stage, which is widely proved to be effective in UDA settings [49, 64, 73]. As the root of the challenges lies in “data”, we thus design two data-oriented modules which are shown to dramatically reduce domain gaps without incurring any computational costs during inference. Specifically, we develop Virtual Scan Simulation (VSS) to mimic occlusion and noise patterns that occur during the construction of real scenes. Such pattern imitation yields a more transferable model to real-world data. Afterwards, to adapt the model to target domain, we design Tail-aware Cuboid Mixing (TACM) for boosting self-training. While source supervision is utilized to stabilize gradients with clean labels in self-training, it unfortunately introduces context bias. Thus, we propose TACM to create an intermediate domain by splitting, permuting, mixing and re-sampling source and target cuboids, which explicitly mitigates the context gap through breaking and rectifying source bias with target pseudo-labeled data, and simultaneously eases long-tail issue by oversampling tail cuboids.

To the best of our knowledge, we are the first to explore unsupervised domain adaptation on 3D indoor semantic segmentation. To verify the effectiveness of our DODA, we construct the first 3D indoor sim-to-real UDA benchmark on a simulated dataset 3D-FRONT [8] and two widely used real-world scene understanding datasets ScanNet [7] and S3DIS [3] along with 8 popular UDA methods with task-specific modifications as our baselines. Experimental results show that DODA obtains 22% and 19% performance gains in terms of mIoU compared to source only model on 3D-FRONT \(\rightarrow \) ScanNet and 3D-FRONT \(\rightarrow \) S3DIS respectively. Even compared to existing UDA methods, over 13% improvement is still achieved. It is also noteworthy that the proposed VSS can lift previous UDA methods by a large margin (8% \(\sim \) 14%) as a plug-and-play data augmentation, and TACM further facilitates real-world cross-site adaptation tasks with 4% \(\sim \) 5% improvements.

2 Related Work

3D Indoor Semantic Segmentation focuses on obtaining point-wise category predictions from point clouds, which is a fundamental while challenging task due to the irregularity and sparsity of 3D point clouds. Some previous works [41, 53] feed 3D grids constructed from point clouds into 3D convolutional neural networks. Some approaches [6, 16] further employ sparse convolution [17] to leverage the sparsity of 3D voxel representation to accelerate computation. Another line of works [25, 45, 46, 61, 70] directly extract feature embeddings from raw point clouds with hierarchical feature aggregation schemes. Recent methods [55, 63] assign position-related kernel functions on local point areas to perform dynamic convolutions. Additionally, graph-based works  [31, 52, 59] adopt graph convolutions to mimic point cloud structure for point representation learning. Although the above methods achieve prominent performance on various indoor scene datasets, they require large-scale human-annotated datasets which we aim to address using simulation data. Our experimental investigation is built upon the sparse-convolution-based U-Net [6, 16] due to its high performance.

Unsupervised Domain Adaptation aims at adapting models obtained from annotated source data towards unlabeled target samples. The annotation efficiency of UDA and existing data-hungry deep neural networks make it receive great attention from the computer vision community. Some previous works [38, 39] attempt to learn domain-invariant representations by minimizing maximum mean discrepancy [5]. Another line of research leverages adversarial training [14] to align distributions in feature [10, 20, 50], pixel [13, 20, 21] or output space [56] across domains. Adversarial attacks [15] have also been utilized in [35, 66] to train domain-invariant classifiers. Recently, Self-training has been investigated in addressing this problem [49] which formulate UDA as a supervised learning problem guided by pseudo-labeled target data and achieves state-of-the-art performance in semantic segmentation [73] and object detection [28, 48].

Lately, with the rising of 3D vision tasks, UDA has also attracted a lot of attention in such 3D tasks as 3D object classification [1, 47], 3D outdoor semantic segmentation [23, 29, 44, 60, 67] and 3D outdoor object detection [40, 58, 64, 65, 69]. Especially, Wu et al. [60] propose intensity rendering, geodesic alignment and domain calibration modules to align sim-to-real gaps of outdoor 3D semantic segmentation datasets. Jaritz et al. [23] explore multi-modality UDA by leveraging images and point clouds simultaneously. Nevertheless, no previous work studies UDA on 3D indoor scenes. The unique point pattern gap and the context gap also render 3D outdoor UDA approaches not readily applicable to indoor scenarios. Hence, in this work, we make the first attempt on UDA for 3D indoor semantic segmentation. Particularly, we focus on the most practical and challenging scenario – simulation to real adaptation.

Data Augmentation for UDA has also been investigated to remedy data-level gaps across domains. Data augmentation techniques have been widely employed to construct an intermediate domain  [13, 29, 48] to benefit optimization and facilitate gradual domain adaptation. However, they mainly focus on image-like input formats, which is not suitable for sparse and irregular raw 3D point clouds. Different from existing works, we build a holistic pipeline with two data-oriented modules on two stages to manipulate raw point clouds for mimicking target point cloud patterns and creating a cuboid-based intermediate domain.

3 Method

3.1 Overview

In this work, we aim at adapting a 3D semantic scene parsing model trained on a source domain \(\mathcal {D}_s = \{(P^s_i, Y^s_i)\}_{i=1}^{N_s}\) of \(N_s\) samples to an unlabeled target domain \(\mathcal {D}_t = \{P^t_i\}_{i=1}^{N_t}\) of \(N_t\) samples. P and Y represent the point cloud and the point-wise semantic labels respectively.

In this section, we present DODA, a data-oriented domain adaptation framework to simultaneously close pattern and context gaps by imitating target patterns as well as breaking source bias with the generated intermediate domain. Specifically, as shown in Fig. 2, DODA begins with pretraining the 3D scene parsing model F on labeled source data with our proposed virtual scan simulation module for better generalization. VSS puts virtual cameras on the feasible regions in source scenes to simulate occlusion patterns, and jitters source points to imitate sensing and reconstruction noise in the real scenes. The pseudo labels are then generated with the pretrained model. In the self-training stage, we develop tail-aware cuboid mixing to build an intermediate domain between source and target, which is constructed by splitting and mixing cuboids from both domains. Besides, cuboids including high percentage tail classes are over-sampled to overcome the class imbalance issue during learning with pseudo labeled data. Elaborations of our tailored VSS and TACM are presented in the following parts.

Fig. 2.
figure 2

Our DODA framework consists of two data-oriented modules: Virtual Scan Simulation (VSS) and Tail-aware Cuboid Mixing (TACM). VSS mimics real-world data patterns and TACM constructs an intermediate domain through mixing source and target cuboids. P denotes the point cloud; Y denotes the semantic labels and \(\hat{Y}\) denotes the pseudo labels. The superscripts s, t and m stand for source, target and intermediate domain, respectively. The denotes source training flow; the denotes target training flow and the denotes target pseudo label generation procedure. Best viewed in color. (Color figure online)

3.2 Virtual Scan Simulation

DODA starts from training a 3D scene parsing network on labeled source data, to provide pseudo labels on the target domain in the next self-training stage. Hence, a model with a good generalization ability is highly desirable. As analyzed in Sect. 1, different scene construction procedures cause point pattern gaps across domains, significantly hindering the transferability of source-trained models. Specifically, we find that the missing of occlusion patterns and sensing or reconstruction noise in simulation scenes raises huge negative transfer during the adaptation, which cannot be readily addressed by previous UDA methods (see Sect. 5). This is potentially caused by the fact that models trained on clean source data are incapable of extracting useful features to handle real-world challenging scenarios with ubiquitous occlusions and noise. To this end, we propose a plug-and-play data augmentation technique, namely virtual scan simulation, to imitate camera scanning procedure for augmenting the simulation data.

VSS includes two parts: the occlusion simulation that puts virtual cameras in feasible regions of simulated scenes to imitate occlusions in the scanning process, and the noise simulation that randomly jitters point patterns to mimic sensing or reconstruction errors, through which the pattern gaps are largely bridged.

Fig. 3.
figure 3

Virtual scan simulation. (a): We simulate occlusion patterns by simulating camera poses and determining visible ranges. (b): We simulate noise by randomly jittering points to generate realistic irregular point patterns such as rough surfaces.

Occlusion Simulation.   Scenes in real-world datasets are reconstructed from RGB-D frame sequences suffering from inevitable occlusions, while simulated scenes contain complete objects without any hidden points. We attempt to mimic occlusion patterns on the simulation data by simulating the real-world data acquisition procedures. Specifically, we divide it into the following three steps:

a) Simulate camera poses. To put virtual cameras in a given simulation scene, we need to determine camera poses including camera positions and camera orientations. First, feasible camera positions where a handheld camera can be placed are determined by checking free space in the simulated environment. We voxelize and project \(P^s\) to bird eye’s view and remove voxels containing instance or room boundary. The centers of remaining free-space voxels are considered as feasible x-y coordinates for virtual cameras, as shown in Fig. 3(a) (i). For the z axis, we randomly sample the camera height in the top half of the room.

Second, for each camera position v, we randomly generate a camera orientation using the direction from the camera position v to a corresponding randomly sampled point of interest h on the wall, as shown in Fig. 3(a) (i). This ensures that simulated camera orientations are uniformly distributed among all potential directions without being influenced by scene-specific layout bias.

b) Determine visible range. Given a virtual camera pose and a simulated 3D scene, we are now able to determine the spatial range that the camera can cover, i.e., \(R_v \), which is determined by the camera field of view (FOV) (see Fig. 3(a) (ii)). To ease the modeling difficulties, we decompose FOV into the horizontal viewing angle \(\alpha _h\), the vertical viewing angle \(\alpha _v\) and the viewing mode \(\eta \) that determine horizontal range, vertical range and the shape of viewing frustum, respectively. For the viewing mode \(\eta \), we approximate three versions from simple to sophisticated, namely fixed, parallel and perspective, with details presented in the supplementary materials. As illustrated in Fig. 3(a) (ii), we show an example of the visible range \(R_v\) with random \(\alpha _h\) and \(\alpha _v\) and \(\eta \) in the fixed mode.

c) Determine visible points. After obtaining the visible range \(R_v\), we then determine the visibility of each point within \(R_v\). Specifically, we convert the point cloud to the camera coordinate and extend [27] with spherical projection to filter out occluded points and obtain visible points. By taking the union of visible points from all virtual cameras, we finally obtain the point set \(P^s_v\) with occluded points removed. Till now, we can generate occlusion patterns in simulation scenes by mimicking real-world scanning process and adjust the intensity of occlusion by changing the number of camera positions \(n_v\) and FOV configurations to ensure that enough semantic context is covered for model learning.

Noise Simulation. Besides occlusion patterns, sensing and reconstruction errors are unavoidable when generating 3D point clouds from sensor-captured RGB-D videos, which unfortunately results in non-uniform distributed points and rough surfaces in real-world datasets (See Fig. 1(a)). To address this issue, we equip our VSS with another noise simulation module, which injects perturbations to each point as follows:

$$\begin{aligned} \tilde{P}^s = \{p + \varDelta p \ | \ p \in P^s_v \}, \end{aligned}$$
(1)

where \(\varDelta _p\) denotes the point perturbation following a uniform distribution ranging from \(-\delta _p\) to \(\delta _p\), and \(\tilde{P}^s\) is the perturbed simulation point cloud. Though simple, we argue that this module efficiently imitates the noise in terms of non-uniform and irregular points patterns as illustrated in Fig. 3.

Model Pretraining on Source Data. By adopting VSS as a data augmentation for simulated data, we train a model with cross-entropy loss as Eq. (2) following settings in [24, 37].

$$\begin{aligned} \min \mathcal {L}_{pre} = \sum \limits _{i=1}^{N_s} \text {CE}(S^s_i, Y^s_{i}), \end{aligned}$$
(2)

where \(\text {CE}(\cdot ,\cdot )\) is the cross-entropy loss and S is the predicted semantic scores after performing softmax on logits.

Fig. 4.
figure 4

An illustration of tail-aware cuboid mixing, which contains cuboid mixing and tail cuboid over-sampling. Notice that for clarity, we take \((n_x,n_y,n_z)=(2,2,1)\) as an example.

3.3 Tail-aware Cuboid Mixing

After obtaining a more transferable scene parsing model with VSS augmentation, we further adopt self-training [32, 54, 62, 69, 73], to adapt the model by directly utilizing target pseudo-labeled data for supervision. Since target pseudo label is rather noisy, containing incorrect pseudo labeled data and leading to erroneous supervisions [65], we also introduce source supervision to harvest its clean annotations and improve the percentage of correct labels. However, directly utilizing source data unfortunately brings source bias and large discrepancies in joint optimization. Even though point pattern gaps have already been alleviated with the proposed VSS, the model still suffers from the context gap due to different scene layouts. Fortunately, the availability of target domain data gives us the chance to rectify such context gaps. To this end, we design Tail-aware Cuboid Mixing (TACM) to construct an intermediate domain \(\mathcal {D}_m\) that combines source and target cuboid-level patterns (see Fig. 4), which augments and rectifies source layouts with target domain context. Besides, it also decreases the difficulty of simultaneously optimizing source and target domains with huge distribution discrepancies by providing a bridge for adaptation. TACM further moderates the pseudo label class imbalance issue by cuboid-level tail class oversampling. Details on pseudo labeling, cuboid mixing and tail cuboid oversampling are as follows.

Pseudo Label Generation. To employ self-training after pretraining, we first need to generate pseudo labels \(\hat{Y}^t\) for target scenes \(P^t\). Similar to previous paradigms [24, 62, 65, 73], we obtain pseudo labels via the following equation:

$$\begin{aligned} \hat{Y}^t_{i,j} = \left\{ \begin{array}{lll} 1 &{}, &{}{\textbf {if}}~\max (S^t_i) > T,j=\mathop {\arg \max } S^t_i, \\ 0 &{} , &{}\text {otherwise},\\ \end{array} \right. \end{aligned}$$
(3)

where \(\hat{Y}^t_{i}=[\hat{Y}^t_{i,1},\cdots ,\hat{Y}^t_{i, c}]\), c is the number of classes and T is the confidence threshold to filter out uncertain predictions.

Cuboid Mixing. Here, given labeled source data and pseudo-labeled target data, we carry out the cuboid mixing to construct a new intermediate domain \(\mathcal {D}_m\) as shown in Fig. 2 and Fig. 4. For each target scene, we randomly sample a source scene to perform cuboid mixing. We first partition two scenes into several cuboids with varying sizes as the smallest units to mix cuboid as Eq. (4):

$$\begin{aligned} P = \{\gamma _{ijk}\}, i \in \{1, ..., n_x\}, j \in \{1, ..., n_y\}, k \in \{1, ..., n_z\}, \nonumber \\ \gamma _{ijk}=\{p \mid p~\text {in}~[x_{i-1}, y_{j-1}, z_{k-1}, x_{i}, y_{j}, z_{k}]\}, \end{aligned}$$
(4)

where \(\gamma _{ijk}\) denotes a single cuboid; \(n_x\), \(n_y\) and \(n_z\) stand for the number of partitions in x, y and z axis, respectively; and each cuboid \(\gamma _{ijk}\) is constrained in a six-tuple bounding box \([x_{i-1}, y_{j-1}, z_{k-1}, x_{i}, y_{j}, z_{k}]\) defined by the partition positions \(x_i, y_j, z_k\) for corresponding dimensions, respectively. These partition positions are first initialized as equal-divisions and then injected with randomness to enhance diversities as below:

$$\begin{aligned} x_{i}= {\left\{ \begin{array}{ll} \frac{i}{n_x}\max p_x+(1-\frac{i}{n_x})\min p_x, &{}{\textbf {if}}~i\in \{0, n_x\},\\ \frac{i}{n_x}\max p_x+(1-\frac{i}{n_x})\min p_x + \varDelta \phi , &{} \text {otherwise}, \end{array}\right. } \end{aligned}$$
(5)

where \(\varDelta \phi \) is the random perturbation following uniform distribution ranging from \(-\delta _\phi \) to \(\delta _\phi \). The same formulation is also adopted for \(y_j\) and \(z_k\). After partitioning, the source and target cuboids are first spatially permuted with a probability \(\rho _{s}\) and then randomly mixed with another probability \(\rho _{m}\), as depicted in Fig. 4 and Fig. 2.

Though ConDA [29] shares some similarities with our cuboid mixing by mixing source and target, it aims to preserve cross-domain context consistency while ours attempts to mitigate context gaps. Besides, ConDA operates on 2D range images, inapplicable to reconstructed indoor scenes obtained by fusing depth images. Our cuboid mixing leverages the freedom of the raw 3D representation, i.e.,  point cloud, and thus is generalizable to arbitrary 3D scenarios.

Tail Cuboid Over-Sampling. Besides embedding target context to source data, our cuboid mixing technique also allows adjusting the category distributions by designing cuboid sampling strategies. Here, as an add-on advantage, we leverage this nice property to alleviate the biased pseudo label problem [2, 19, 36, 73] in self-training: tail categories only occupy a small percentage of pseudo labeled data. Specifically, we sample cuboids with tail categories more frequently, namely tail cuboid over-sampling, detailed as follows.

We calculate per-class pseudo label ratio \(r \in [0, 1]^{c}\) and define \(n_r\) least common categories as tail categories. We then define tail cuboid whose pseudo label ratio is higher than the average value r on at least one of \(n_r\) tail categories. We construct a tail cuboid queue Q with size \(N_q\) to store tail cuboids. Formally, \(\gamma ^t_{q[w]}\) denotes the \(w^{th}\) tail cuboid in Q, as shown in Fig. 4. Notice that through training, Q is dynamically updated with First In, First Out (FIFO) rule since cuboids are randomly split in each iteration as Eq. (4). In each training iteration, we ensure that at least u tail cuboids are in each mixed scene by sampling cuboids from Q and replacing existing cuboids if needed. With such a simple over-sampling strategy, we make the cuboid mixing process tail-aware, and relieve the class imbalance issue in the self-training. Experimental results in Sect. 6 further demonstrate the effectiveness of our tail cuboid over-sampling strategy.

Self-training with Target and Source data. In the self-training stage, for data augmentation, VSS is first adopted to augment the source domain data to reduce the pattern gap and then TACM mixes source and target scenes to construct a tail-aware intermediate domain \(\mathcal {D}_m=\{P^m\}\) with labels \(\hat{Y}^m\) mixed by source ground-truth and target pseudo labels. To alleviate the noisy supervisions from incorrect target pseudo labels, we minimize dense cross-entropy loss on source data \(\tilde{P}^s\) and intermediate domain data \(P^m\) as below:

$$\begin{aligned} \min \mathcal {L}_{st} = \sum \limits _{i=1}^{N_t} \text {CE}(S^m_i, \hat{Y}^m_{i}) + \lambda \sum \limits _{i=1}^{N_s} \text {CE}(S^s_i, Y^s_{i}), \end{aligned}$$
(6)

where \(\lambda \) denotes the trade-off factor between losses.

4 Benchmark Setup

4.1 Datasets

3D-FRONT [8] is a large-scale dataset of synthetic 3D indoor scenes, which contains 18,968 rooms with 13,151 CAD 3D furniture objects from 3D-FUTURE [9]. The layouts of rooms are created by professional designers and distinctively span 31 scene categories and 34 object semantic super-classes. We randomly select 4995 rooms as training samples and 500 rooms as validation samples after filtering out noisy rooms. Notice that we obtain source point clouds by uniformly sampling points from original mesh with CloudCompare [12] at 1250 surface density (number of points per square units). Comparison between 3D-FRONT and other simulation datasets are detailed in the nsupplemental materials.

ScanNet [7] is a popular real-world indoor 3D scene understanding dataset, consisting 1,613 real 3D scans with dense semantic annotations (i.e.,, 1,201 scans for training, 3,12 scans for validation and 100 scans for testing). It provides semantic annotations for 20 categories.

S3DIS [3] is also a well-known real-world indoor 3D point cloud dataset for semantic segmentation. It contains 271 scenes across six areas along with 13 categories with point-wise annotations. Similar to previous works [33, 46], we use the fifth area as the validation split and other areas as the training split.

Label Mapping. Due to different category taxonomy of datasets, we condense 11 categories for 3D-FRONT \(\rightarrow \) ScanNet and 3D-FRONT \(\rightarrow \) S3DIS settings, individually. Besides, we condense 8 categories for cross-site settings between S3DIS and ScanNet. Please refer to the Suppl. for the detailed taxonomy.

4.2 UDA Baselines

As shown in Table 1 and 2, we reproduce 7 popular 2D UDA methods and 1 3D outdoor method as UDA baselines, encompassing MCD [51], AdaptSegNet [56], CBST [73], MinEnt [57], AdvEnt [57], Noisy Student [62], APO-DA [66] and SqueezeSegV2 [60]. These UDA baselines cover most existing streams such as adversarial alignment, discrepancy minimization, self-training and entropy guided adaptation. To perform these image-based methods on our setting, we carry out some task-specific modifications, which are detailed in supplemental materials.

5 Experiments

To validate our method, we benchmark DODA and other popular UDA methods with extensive experiments on 3D-FRONT [8], ScanNet [7] and S3DIS [3]. Moreover, we explore a more challenging setting, from simulated 3D-FRONT [8] to RGBD realistic dataset NYU-V2 [42], presented in the supplementary materials. To verify the generalizability of VSS and TACM, we further integrate VSS to previous UDA methods and adopt TACM in the real-world cross-site UDA setting. Note that since textures for some background classes are not provided in 3D-FRONT dataset, we only focus on adaptation using 3D point positions. The implementation details including network and training details are provided in the Suppl.

Comparison to Other UDA Methods. As shown in Table 1 and Table 2, compared to source only, DODA largely lifts the adaptation performance in terms of mIoU by around 21% and 19% on 3D-FRONT \(\rightarrow \) ScanNet and 3D-FRONT \(\rightarrow \) S3DIS, respectively. DODA also shows its superiority over other popular UDA methods, obtaining \(14\%\sim 22\%\) performance gain on 3D-FRONT \(\rightarrow \) ScanNet and \(13\%\sim 19\%\) gain on 3D-FRONT \(\rightarrow \) S3DIS. Even only equipping source only with VSS module, our DODA (only VSS) still outperforms UDA baselines by around \(4\% \sim 10\%\), indicating that the pattern gap caused by different sensing mechanisms significantly harms adaptation results while previous methods have not readily addressed it. Comparing DODA with DODA (w/o TACM), we observe that TACM mainly contributes to the performance of instances such as bed and bookshelf on ScanNet, since cuboid mixing forces model to focus more on local semantic clues and object shapes itself inside cuboids. It is noteworthy that though DODA yields general improvement around almost all categories adaptation in both pretrain stage and self-training stage, challenging classes such as bed on ScanNet and sofa on S3DIS attain more conspicuous performance lift, demonstrating the predominance of DODA in tackling troublesome categories. However, the effectiveness of all UDA methods for column and beam on S3DIS are not obvious due to their large disparities in data patterns across domains and low appearing frequencies in source domain. To illustrate the reproducibility of our DODA, all results are repeated three times and reported as average performance along with standard variance.

Table 1. Adaptation results of 3D-FRONT \(\rightarrow \) ScanNet in terms of mIoU. We indicate the best adaptation result in bold. \(\dagger \) denotes pretrain generalization results with VSS
Table 2. Adaptation results of 3D-FRONT \(\rightarrow \) S3DIS in terms of mIoU. We indicate the best adaptation result in bold. \(\dagger \) denotes pretrain generalization results with VSS

VSS Plug-and-Play Results to Other UDA Methods. Since VSS works as a data augmentation in our DODA, we argue that it can serve as a plug-and-play module to mimic occlusion and noise patterns on simulation data, and is orthogonal to existing UDA strategies. As demonstrated in Table 3, equipped with VSS, current popular UDA approaches consistently surpass their original performance by around \(8\%\sim 13\%\). It also verifies that previous 2D-based methods fail to close the point pattern gap in 3D indoor scene adaptations, while our VSS can be incorporated into various pipelines to boost performance.

TACM Results in Cross-Site Adaptation. Serving as a general module to alleviate domain shifts across domains, we show that TACM can consistently mitigate domain discrepancies on even real-to-real adaptation settings. For cross-site adaptation, scenes collected from different sites or room types also suffer a considerable data distribution gap. As shown in Table 4, the domain gaps in real-to-real adaptation tasks are also large when comparing the source only and oracle results. When adopting TACM in the self-training pipelines, they obtain 5.64% and 3.66% relative performance boost separately in ScanNet \(\rightarrow \) S3DIS and S3DIS \(\rightarrow \) ScanNet. These results verify that TACM is general in relieving data gaps, especially the context gap on various 3D scene UDA tasks. We provide the cross-site benchmark with more UDA methods in the Suppl.

Table 3. UDA results equipped with VSS on 3D-FRONT \(\rightarrow \) ScanNet
Table 4. Cross-site adaptation results with TACM

6 Ablation Study

In this section, we conduct extensive ablation experiments to investigate the individual components of our DODA. All experiments are conducted on 3D-FRONT \(\rightarrow \) ScanNet for simplicity. Default settings are marked in bold.

Component Analysis. Here, we investigate the effectiveness of each component and module in our DODA. As shown in Table 5, occlusion simulation brings the largest performance gain (around \(9.7\%\)), indicating that model trained on complete scenes is hard to adapt to scenes with occluded patterns. Noise simulation further supplements VSS to imitate sensing and reconstruction noise, obtaining about \(1.3\%\) boosts. Two sub-modules jointly mimic realistic scenes, largely alleviating the point distribution gap and leading to a more generalizable source only model. In the self-training stage, VSS also surpasses the baseline by around 13% due to its efficacy in reducing the point pattern gap and facilitating generating high-quality pseudo labels. Cuboid mixing combines cuboid patterns from source and target domains for moderating context-level bias, further boosting the performance by around \(2.4\%\). Moreover, cuboid-level tail-class over-sampling yields \(0.9\%\) improvement with greater gains on tail classes. For instance, desk on ScanNet achieves 6% gain (see Suppl.).

Table 5. Component Analysis for DODA on 3D-FRONT \(\rightarrow \) ScanNet

VSS: Visible Range. Here, we study the effect of visible range of VSS, which is jointly determined by the horizontal angle \(\alpha _h\), vertical angle \(\alpha _v\), viewing mode \(\eta \) and the number of cameras \(n_v\). As shown in Table 6, fewer cameras \(n_v=2\) and smaller viewing angle \(\alpha _v=45^\circ \) draw around 2% performance degradation with a smaller visible range. And decreasing \(\alpha _h\) to 90\(^\circ \) can also achieve similar performance with \(\alpha _h = 180^\circ \) with more cameras \(n_v=8\), demonstrating that enough semantic coverage is a vital factor. Besides, as for the three viewing modes \(\eta \), the simplest fixed mode achieves the highest performance in comparison to parallel and perspective modes. Even though parallel and perspective are more similar to reality practice, they cannot cover sufficient range with limited cameras, since real-world scenes are reconstructed through hundreds or thousands of view frames. This again demonstrates that large spatial coverage is essential. To trade off between the effectiveness and efficiency of on-the-fly VSS, we use fixed mode with 4 camera positions by default here.

Table 6. Ablation study of visible range design on 3D-FRONT \(\rightarrow \) ScanNet
Table 7. Ablation study of cuboid partitions on 3D-FRONT \(\rightarrow \) ScanNet

TACM: Cuboid Partition. We study various cuboid partition manners in Table 7. Notice that random rotation along z axis is performed before cuboid partition, so the partition on x or y axes can be treated as identical. While horizontal partitioning yields consistent performance beyond 50% mIoU, vertical partitioning does not show robust improvements, suggesting the mixing of vertical spatial context is not necessary. Simultaneous partitioning on x and y axes also improves performance (i.e., (2,2,1) and (2,3,1)), while too small cuboid size (i.e., (3,3,1)) results in insufficient context cues in each cuboid with a slight decrease in mIoU.

TACM: Data-Mixing Method. We compare TACM with other popular data-mixing methods in Table 8. Experimental results show the superiority of TACM since it outperforms Mix3D [43], CutMix [68] and Copy-Paste [11] by around \(2.2\%\) to \(2.9\%\). TACM effectively alleviates the context gap while preserving local context clues. Mix3D, however, results in large overlapping areas, which is unnatural and causes semantic confusions. CutMix and Copy-Paste only disrupt local areas without enough perturbations of the broader context (see Suppl.).

TACM: Tail Cuboid Over-Sampling with Class-Balanced Loss. Tail cuboid over-sampling brings significant gains on tail classes as discussed in Sect. 6. As demonstrated in Table 9, the class-balanced lovasz loss [4] also boosts performance by considering each category more equally. We highlight that our TACM can also incorporate with other class-balancing methods during optimization since it eases long tail issue on the data-level.

Table 8. Ablation study of data-mixing methods on 3D-FRONT \(\rightarrow \) ScanNet
Table 9. Investigation of pseudo label class imbalance issue on 3D-FRONT \(\rightarrow \) ScanNet

7 Limitations and Open Problems

Although our model largely closes the domain gaps across simulation and real-world datasets, we still suffer from the inherent limitations of the simulation data. For some categories such as beam and column, the simulator fails to generate realistic shape patterns, resulting in huge negative transfer. Besides, room layouts need to be developed by experts, which may limit the diversity and complexity of the created scenes. Therefore, in order to make simulation data benefit real-world applications, there are still several open problems: how to handle the failure modes of the simulator, how to unify the adaptation and simulation stage in one pipeline, and how to automate the simulation process, to name a few.

8 Conclusions

We have presented DODA, a data-oriented domain adaptation method with virtual scan simulation and tail-aware cuboid mixing for 3D indoor sim-to-real unsupervised domain adaptation. Virtual scan simulation generates a more transferable model by mitigating the real-and-simulation point pattern gap. Tail-aware cuboid mixing rectifies context biases through creating a tail-aware intermediate domain and facilitating self-training to effectively leverage pseudo labeled target data, further reducing domain gaps. Our extensive experiments not only show the prominent performance of our DODA in two sim-to-real UDA tasks, but also illustrate the potential ability of TACM to solve general 3D UDA scene parsing tasks. More importantly, we have built the first benchmark for 3D indoor scene unsupervised domain adaptation, including sim to real adaptation and cross-site real-world adaptation. The benchmark suit will be publicly available. We hope our work could inspire further investigations on this problem.