Abstract
We present a new approach to instill 4D dynamic object priors into learned 3D representations by unsupervised pre-training. We observe that dynamic movement of an object through an environment provides important cues about its objectness, and thus propose to imbue learned 3D representations with such dynamic understanding, that can then be effectively transferred to improved performance in downstream 3D semantic scene understanding tasks. We propose a new data augmentation scheme leveraging synthetic 3D shapes moving in static 3D environments, and employ contrastive learning under 3D-4D constraints that encode 4D invariances into the learned 3D representations. Experiments demonstrate that our unsupervised representation learning results in improvement in downstream 3D semantic segmentation, object detection, and instance segmentation tasks, and moreover, notably improves performance in data-scarce scenarios. Our results show that our 4D pre-training method improves downstream tasks such as object detection mAP@0.5 by 5.5%/6.5% over training from scratch on ScanNet/SUN RGB-D while involving no additional run-time overhead at test time.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
- 3D scene understanding
- Point cloud recognition
- 3D semantic segmentation
- 3D instance segmentation
- 3D object detection
1 Introduction
3D semantic scene understanding has seen remarkable progress in recent years, in large part driven by advances in deep learning as well as the introduction of large-scale, annotated datasets [1, 7, 10]. In particular, notable progress has been made to address core 3D scene understanding tasks such as 3D semantic segmentation, object detection, and instance segmentation, which are fundamental to many real-world computer vision applications such as robotics, mixed reality, or autonomous driving. Such approaches have developed various methods to learn on different 3D scene representations, such as sparse or dense volumetric grids [6, 7, 11], point clouds [29, 30], meshes [19], or multi-view approaches [8, 38]. Recently, driven by the success of unsupervised representation learning for transfer learning in 2D, 3D scene understanding methods have been augmented with unsupervised 3D pre-training to further improve performance to downstream 3D scene understanding tasks [17, 20, 43, 46] (Fig. 1).
While such 3D representation learning has focused on feature representations learned from static 3D scenes, we observe that important notions of objectness are given by 4D dynamic observations – for instance, object segmentations can often be naturally intuited by observing objects moving around an environment without any annotations required, which can be more difficult in a static 3D observation. We thus propose to leverage this powerful 4D signal in unsupervised pre-training to imbue 4D object priors into learned 3D representations, that can then be effectively transferred to various downstream 3D scene understanding tasks for improved recognition performance.
In this work, we introduce 4DContrast to learn about objectness from both static 3D and dynamic 4D information in learned 3D representations. We leverage a combination of static 3D scanned scenes and a database of synthetic 3D shapes, and augment the scenes with moving synthetic shapes to generate 4D sequence data with inherent motion correspondence. We then employ a contrastive learning scheme under both 3D and 4D constraints, correlating local 3D point features with each other as well as with 4D sequence features, thus imbuing learned objectness from dynamic information into the 3D representation learning.
To demonstrate our approach, we pre-train on ScanNet [7] along with ModelNet [41] shapes for unsupervised 3D representation learning. Experiments on 3D semantic segmentation, object detection, and instance segmentation show that 4DContrast learns effective features that can be transferred to achieve improved performance in various downstream 3D scene understanding tasks. 4DContrast can also generalize from pre-training on ScanNet and ModelNet to improved performance on SUN RGB-D [36]. Additionally, we show that our learned representations remain robust in limited training data scenarios, consistently improving performance under a various amounts of training data available.
Our main contributions are summarized as follows:
-
We propose the first method to leverage 4D sequence information and constraints for 3D representation learning, showing transferability of the learned features to the downstream 3D scene understanding tasks of 3D semantic segmentation, object detection, and instance segmentation.
-
Our new unsupervised pre-training based on constructing 4D sequences from synthetic 3D shapes in real-world, static 3D scenes improves performance across a variety of downstream tasks and different datasets.
2 Related Work
3D Semantic Scene Understanding. Driven by rapid developments in deep learning and the introduction of several large-scale, annotated 3D datasets [1, 7, 10], notable progress has been made in 3D semantic scene understanding, in particular the tasks of 3D semantic segmentation [6,7,8, 11, 18, 19, 24, 30, 32], 3D object detection [25,26,27, 42, 47], and 3D instance segmentation [9, 13, 16, 21, 45]. Many methods have been proposed, largely focusing on learning on various 3D representations, such as sparse or dense volumetric grids [6, 7, 11], point clouds [21, 27, 29, 30], meshes [19, 35], or multi-view hybrid representations [8, 22]. In particular, approaches leveraging backbones built with sparse convolutions [6, 11] have shown strong effectiveness across a variety of 3D scene understanding tasks and datasets. We propose a new unsupervised pre-training approach to learn 4D priors in learned 3D representations, leveraging sparse convolutional backbones for both 3D and 4D feature extraction.
3D Representation Learning. Inspired by the success of representation learning in 2D, particularly that leveraging instance discrimination with contrastive learning [2, 3, 15], recent works have explored unsupervised learning with 3D pretext tasks that can be leveraged for fine-tuning on downstream 3D scene understanding tasks [5, 14, 17, 20, 23, 31, 33, 34, 39, 40, 43, 46]. For instance, [14, 34] learn feature representations from point-based instance discrimination for object classification and segmentation, and [17, 43, 46] extend to more complex 3D scenes by generating correspondences from various different views of scene point clouds.
In particular, given the more scarce data availability of real-world 3D environments, Hou et al. [17] additionally demonstrate the efficacy of contrastive 3D pretraining for various 3D semantic scene understanding tasks under a variety of limited training data scenarios. In contrast to these methods that employ 3D-only pretext tasks for representation learning, we propose to learn from 4D sequence data to embed 4D priors into learned 3D representations for more effective transfer to downstream 3D tasks.
Recently, Huang et al. [20] propose to learn from the inherent sequence data of RGB-D video to incorporate the notion of a temporal sequence. Constraints are established across pairs of frames in the sequence; however, the sequence data itself represents static scenes without any movement within the scene, limiting the temporal signal that can be learned. In contrast, we consider 4D sequence data containing object movement through the scene, which can provide additional semantic signal about objectness through an object’s motion. Additionally, Rao et al. [31] propose to learn from 3D scenes that are synthetically generated by randomly placing synthetic CAD models on a rectangular layout. They employ object-level contrastive learning on object-level features, resulting in improved 3D object detection performance. We also leverage synthetic CAD models for data augmentation, but we compose them with real-world 3D scan data to generate 4D sequences of objects in motion, and exploit learned 4D features to enhance learned 3D representations, with performance improvement on various downstream 3D scene understanding tasks.
3 4D Invariant Representation Learning
4DContrast presents a new approach to 3D representation learning: our key idea is to employ 4D constraints during pre-training, in order to imbue learned features with 4D invariance from learned objectness from seeing an object in motion. We consider a dataset of 3D scans \(\mathcal {S}=\{S_i\}\) as well as a dataset of synthetic 3D objects \(\mathcal {O} = \{O_j\}\), and construct dynamic sequences with inherent correspondence information by moving a synthetic object \(O_j\) in a static 3D scan \(S_i\). This enables us to establish 4D correspondences along with 3D-4D correspondence as constraints under a contrastive learning framework for unsupervised pre-training. An overview of our approach is shown in Fig. 2.
3.1 Revisiting SimSiam
We first revisit SimSiam [4], which introduced a simple yet powerful approach for contrastive 2D representation learning. Inspired by the effectiveness of SimSiam, we build an unsupervised contrastive learning scheme for embedding 4D priors into 3D representations.
SimSiam considers two augmented variants of an image I, \(I_1\), and \(I_2\), which are input to weight-shared encoder network \(\varPhi _{\text {2D}}\) (a 2D convolutional backbone followed by a projection MLP). Then a prediction MLP head \(P_{\text {2D}}\) transforms the output of one view as \(p_1^{\text {2D}}=P_{\text {2D}}(\varPhi _{\text {2D}}(I_1))\) to match to the another output \(z_2^{\text {2D}}=\varPhi _{\text {2D}}(I_2)\), with minimizing the negative cosine similarity [12]:
SimSiam also uses a stop-gradient (SG) operation that treats \(z_2^{\text {2D}}\) as a constant during back-propagation, to prevent collapse during the training, thus modifying Eq. 1 as: \(\mathcal {D}(p_1^{\text {2D}}, SG(z_2^{\text {2D}}))\). A symmetrized loss is defined for the two augmented inputs:
SimSiam has shown to be very effective at learning invariances under various image augmentations, without requiring negative samples or very large batches. We thus build from this contrastive framework for our 3D-4D constraints, as it allows for our high-dimensional pre-training design.
3.2 4D-Invariant Contrastive Learning
To imbue effective 4D priors into learned 3D features, we consider a static 3D scan S and a synthetic 3D object O as a train sample, and compose them together to form dynamic object movement in the scene \(\{F_0,...,F_{t-1}\}\) for t time steps (as described in Sect. 3.3). We then establish spatial correspondences between frames (3D-3D), spatio-temporal correspondences (3D-4D), and dynamic correspondences (4D-4D) as constraints. 3D features are extracted with a 3D encoder \(\varPhi _{\text {3D}}\) and 4D features with a 4D encoder \(\varPhi _{\text {4D}}\), with respective prediction MLPs \(P_{\text {3D}}\) and \(P_{\text {4D}}\).
Inter-Frame Spatial Correspondence. For each pair of frames \((F_i, F_j)\) in a train sequence F, we consider their spatial correspondence across sequence frames in order to implicitly pose invariance over the dynamic sequence. That is, points that correspond to the same location in the original 3D scene S or original object O should also correspond in feature space. For the set of corresponding point locations \(\mathcal {A}_{i,j}\) from frames \((F_i, F_j)\), we consider each pair of point locations \((\textbf{a}_i, \textbf{b}_j)\in \mathcal {A}\), we obtain their 3D backbone features at the respective locations: \(p^{\text {3D}}_{i,a}=P_{\text {3D}}(\varPhi _{\text {3D}}(F_i))(\textbf{a}_i)\) and \(z^{\text {3D}}_{j,b}=\varPhi _{\text {3D}}(F_j)(\textbf{b}_j)\). We then compute a symmetrized negative cosine similarity loss between features of corresponding point locations:
In Fig. 3, we use arrows to indicate constraints between frame \(F_{t-2}\) and frame \(F_{t-1}\).
We compute Eq. 3 over each pair of frames in the whole sequence F:
By establishing constraints across 3D frames in a 4D sequence, we encode pose invariance of moving objects across varying background into the learned 3D features.
Spatio-Temporal Correspondence. In addition to implicitly encoding pose invariance of moving objects, we establish explicit 3D-4D correspondences to learn 4D priors, encouraging 4D-invariance in the learned features. For a train sequence \(F=\{F_0,...,F_{t-1}\}\), we use the 4D encoder \(\varPhi _\text {4D}\) and the 4D predictor \(P_\text {4D}\) to extract 4D features from the whole sequence. Then \(z^{\text {4D}}_{i,a}\) indicates the 4D features output by the 4D encoder \(\varPhi _\text {4D}\) at point location \(\textbf{a}_i\) in frame i, and \(p^{\text {4D}}_{i,a}\) denotes the 4D features output by the 4D predictor \(P_\text {4D}\). Then for a frame \(F_i\), we consider each 3D point \(\textbf{a}_i\in \mathcal {A}_i\) in this set of frame points \(F_i\), and establish a constraint between its corresponding 3D feature extracted by 3D network (\(\varPhi _\text {3D}\) and \(P_\text {3D}\)) and its corresponding 4D feature extracted by 4D network (\(\varPhi _\text {4D}\) and \(P_\text {4D}\)):
As shown in Fig. 3, we use arrows to indicate constraints of frame \(F_{t-1}\). For the entire input sequence F, we calculate Eq. 5 for every frame, and the 3D-4D contrastive loss \(\mathcal {L}^{\text {3D4D}}\) is defined as:
Additionally, in order to learn spatio-temporally consistent 4D representations, we employ 4D-4D correspondence constraints inherent to the 4D features within the same point cloud sequence. This is formulated analogously to Eq. 3, replacing the 3D features with the 4D features from different time steps that correspond spatially:
In Fig. 3, we use arrows to indicate 4D constraints between frame \(F_{t-2}\) and frame \(F_{t-1}\). We evaluate Eq. 7 over every pair of frames in the entire input sequence F, with the 4D contrastive loss \(\mathcal {L}^\text {4D}\) defined as:
Joint Learning. Our overall training loss \(\mathcal {L}\) consists of three parts including 3D contrastive loss \(\mathcal {L}^{\text {3D}}\), 3D-4D contrastive loss \(\mathcal {L}^{\text {3D4D}}\), and 4D contrastive loss \(\mathcal {L}^{\text {4D}}\):
where constant weights \(w_{\text {3D}}\), \(w_{\text {3D4D}}\) and \(w_{\text {4D}}\) are used to balance the losses.
3.3 Generating 4D Correspondence via Scene-Object Augmentation
To learn from 4D sequence data to embed 4D priors into learned 3D representations, we leverage existing large-scale real-world 3D scan datasets in combination with synthetic 3D shape datasets. This enables generation of 4D correspondences without requiring any labels – by augmenting static 3D scenes with generated trajectories of a moving synthetic object within the scene, which provides inherent 4D correspondence knowledge across the object motion. Thus for pre-training, we consider pairs of reconstructed scans and an arbitrarily sampled synthetic 3D shape (S, O), and generate a 4D sequence \(F=\{F_0,...,F_{t-1}\}\) by moving the object through the scene.
Trajectory Generation. We first generate a trajectory for O in S. We voxelize S at 10 cm voxel resolution, and accumulate occupied surface voxels in the height dimension to acquire a 2D map of the scene geometry. Valid object locations are then identified as those in the 2D map with a voxel accumulation \(\le \)1, with the max height of the accumulated voxels near to the ground floor (within 20 cm of the average floor height). For the object O, we consider all possible 2D locations, and if O does not exceed the valid region (based on its bounding sphere), then the location is taken as a candidate object position. A random position sampled from these candidate positions is taken as the starting point of the trajectory, we can randomly sample a step distance in [30, 90] cm and step direction such that the angular change in trajectory is <150\(^\circ \), and then select the nearest valid candidate position as the second trajectory point. We repeat this process for t time steps in the sequence to obtain 4D scene-object augmentations for pre-training.
4D Sequence Generation. A sequence of point clouds are then generated based on the computed object trajectory for the scan, up to sequence length t, by compositing the object into the scene under its translation and rotation steps per frame. This provides inherent correspondence information between 3D scene locations and 4D object movement through the scene.
Scene Augmentation. We augment the 4D sequences by randomly sampling different points across the geometry in each individual frame. We also randomly remove cubic chunks of points in the background 3D scene for additional data variation, with the number of chunks removed randomly sampled from [5, 15] and the size of the chunks randomly sampled in [0.15, 0.45] as a proportion of the scene extent. We discard any sequences that do not have enough correspondences in its frames; that is, \(\ge \)30% of the points in the original scan and \(\ge \)30% of the points of the synthetic object should be consistently represented in each frame, and each frame must maintain at least 50% of its points through the augmentation process. Additionally, we further augment the static 3D frame interpretations of the sequence (but not the sequence) by applying random rotation, translation, and scaling to each individually considered 3D frame.
3.4 Network Architecture for Pre-training
During pre-training, we leverage correspondences induced by our 4D data generation, between encoded 3D frames as well as across the encoded 4D sequence. To this end, we employ 3D and 4D feature extractors as meta-architectures for 4D-invariant learning.
To extract per-point features from a 3D scene, we use a 3D encoder \(\varPhi _\text {3D}\) and a 3D predictor \(P_\text {3D}\). \(\varPhi _\text {3D}\) is a U-Net architecture based on sparse 3D convolutions with residual block followed by a \(1\times 1\times 1\) sparse convolutional projection layer, and \(P_\text {3D}\) is two \(1\times 1\times 1\) sparse convolutional layers.
To extract spatio-temporal features from a 4D sequence, we use a 4D encoder \(\varPhi _\text {4D}\) and a 4D predictor \(P_\text {4D}\). These are structured analogously to the 3D feature extraction, using sparse 4D convolutions instead. For more detailed architecture specifications, we refer to the supplemental material.
4 Experimental Setup
We demonstrate the effectiveness of our 4D-informed pre-training of learned 3D representations for a variety of downstream 3D scene understanding tasks.
Pre-training Setup. We use reconstructed 3D scans from ScanNet [7] and synthetic 3D shapes from ModelNet [41] to compose our 4D sequence data for pre-training. We use the official ScanNet train split with 1201 train scans, augmented with shapes from ModelNet from eight furniture categories: chair, desk, dresser, nightstand, sofa, table, bathtub, and toilet. For each 3D scan, we generate 20 trajectories of an object moving through the scan, following Sect. 3.3 with \(t=4\). For sequence generation we use 2 cm resolution for the scene and 1000 randomly sampled points from the synthetic object to compose together.
The 3D and 4D sparse U-Nets are implemented with MinkowskiEngine [6] using 2 cm voxel size for 3D and 5cm voxel size for 4D. For pre-training we consider only geometry information from the scene-object sequence augmentations. We use an SGD optimizer with initial learning rate 0.25 and a batch-size of 12. The learning rate is decreased by a factor of 0.99 every 1000 steps. We train for 50K steps until convergence.
Fine-Tuning on Downstream Tasks. We use the same pre-trained backbone network in the three 3D scene understanding tasks of semantic segmentation, instance segmentation, and object detection. For semantic segmentation, we directly use the U-Net architecture for dense label prediction, and for object detection and instance segmentation, we use VoteNet [27] and PointGroup [21] respectively, both with our pre-trained 3D U-Net backbone. All experiments, including comparisons with state of the art, are trained with geometric information only, unless otherwise noted. Fine-tuning experiments on semantic segmentation are trained with a batch size of 48 for 10K steps, using an initial learning rate of 0.8 with polynomial decay with power 0.9. For instance segmentation, we use the same training setup as PointGroup, and use an initial learning rate of 0.1. For object detection, the network is trained for 500 epochs, and the learning rate is 0.001 and decayed by a factor of 0.5 at epochs 250, 350, and 450. We use a batch size of 6 on ScanNet and 16 on SUN RGB-D.
5 Results
We demonstrate that our learned features under 3D-4D constraints can effectively transfer well to a variety of downstream 3D scene understanding tasks. We consider both in-domain transfer to 3D scene understanding tasks on ScanNet [7] (Sect. 5.1), as well as out-of-domain transfer to SUN RGB-D [36] (Sect. 5.2); a summary is shown in Table 1. We also show data-efficient scene understanding (Sect. 5.3) and additional analysis (Sect. 5.4). Note that for all downstream experiments, we do not use the 4D backbone and thus use the same 3D U-Net architecture as PointContrast [43] and CSC [17].
All experiments, including our method and all baseline comparisons, are trained on geometric data only without any color information.
5.1 ScanNet
We first demonstrate our 4DContrast pre-training in fine-tuning for 3D object detection, semantic segmentation, and instance segmentation on ScanNet [7], showing the effectiveness of learning 3D features under 4D constraints. Tables 2, 3, and 4 evaluate performance on 3D object detection, semantic segmentation, and instance segmentation, respectively.
Table 2 shows 3D object detection results, for which our pretraining approach improves over baseline training from scratch (+5.5% mAP@0.5) as well as over the strong 3D-based pre-training methods of RandomRooms [31], PointContrast [43] and CSC [17].
In Tables 3 and 4, we evaluate semantic segmentation in comparison with state-of-the-art 3D pre-training approaches [17, 43], as well as a baseline training paradigm from scratch. These pre-training approaches improve notably over training from scratch, and our 4DContrast approach leveraging learned representations under 4D constraints, leads to additional performance improvement over train from scratch (+2.3% mIoU for semantic segmentation and +4.2% mAP@0.5 for instance segmentation). We show qualitative results for semantic segmentation in Fig. 4.
5.2 SUN RGB-D
We additionally show that our 4DContrast learning scheme can produce transferable representations across datasets. We leverage our pre-trained weights from ScanNet + ModelNet, and explore downstream 3D object detection on the SUN RGB-D [36] dataset. SUN RGB-D is a dataset of RGB-D images, containing 10,335 frames captured with a variety of commodity RGB-D sensors. It contains 3D object bounding box annotations for 10 class categories. We follow the official train/test split of 5,285 train frames and 5,050 test frames.
Table 5 shows 3D object detection performance on SUN RGB-D, with qualitative results visualized in Fig. 5. We use the same pre-training as with ScanNet, with downstream fine-tuning on SUN RGB-D data. 4DContrast improves over training from scratch (+6.5% mAP@0.5), with our learned representations surpassing the 3D-based pre-training [17, 31, 43, 46].
5.3 Data-Efficient 3D Scene Understanding
We evaluate our approach in the scenario of limited training data, as shown in Fig. 6. 4DContrast improves over baseline training from scratch as well as over state-of-the-art data-efficient scene understanding CSC [17] in semantic segmentation and object detection under various different percentages of ScanNet training data. With only 20% of the training data, we can recover \(87\%\) of the fine-tuned semantic segmentation performance training with 100% of the train data from scratch. In object detection, our pre-training enables improved performance for all percentage settings, notably in the very limited regime with +3.0/4.5% mAP@0.5 over CSC/training from scratch at 10% data, and +2.5/5.9% mAP@0.5 with 20% data.
5.4 Ablation Studies
Effect of 3D and 4D Data Augmentation. We consider a baseline variant of our approach that considers only the static 3D scene data without any scene-object augmentations with 3D-3D constraints during pre-training in Table 6 (Ours (3D data, 3D-3D only)), which provides some improvement over training from scratch but is notably improved with our 4D pre-training formulation. We additionally consider using our 4D scene-object augmentation with only 3D-3D constraints between sequence frames during pre-training (Ours (4D data, 3D-3D only)) in Table 6, which helps to additionally improve performance with implicitly learned priors from 4D data. Both are further improved by our approach to explicitly learn 4D priors in 3D features.
Effect of 4D-Invariant Contrastive Priors. In Table 6, we see that learning 4D-invariant contrastive priors through our 3D-4D and 4D-4D constraints during pretraining improves upon data augmentation variants only. Additionally, Table 7 evaluates the 3D variant of our approach with our full 4D-based pre-training across a variety of downstream tasks, showing consistent improvements from learned 4D-based priors.
Effect of SimSiam Contrastive Learning. We also consider the effect of our SimSiam contrastive framework as PointContrast [43] leverages a PointInfoNCE contrastive loss. We note that the 3D variant of our approach (Ours (3D data, 3D-3D only)) reflects a PointContrast [43] setting using our scene augmentation and SimSiam architecture, which our 4D-based feature learning outperforms.
5.5 Discussion
While 4DContrast pre-training demonstrates the effectiveness of leveraging 4D priors for learned 3D representations, various limitations remain. In particular, 4D feature learning with sparse convolutions involves considerable memory during pre-training, so we use half-resolution for characterizing 4D features relative to 3D features and limited sequence durations. Additionally, we consider a subset of 4D motion when augmenting scenes with moving synthetic objects, and believe exploration of articulated motion or more complex dynamic object interactions would lead to additional insight and robustness of learned feature representations.
Memory and Speed. Our 4D-imbued pre-training results in consistent improvements across a variety of tasks and datasets, even with only using the learned 3D backbone for downstream training and inference. Thus, our method maintains the same memory and speed costs for inference as purely 3D-based pre-training approaches. For pre-training, our joint 3D-4D training uses additional parameters (33M for the 4D network in addition to the 38M for the 3D network), but due to jointly learning 4D priors with SimSiam, we do not require as large of a batch size to train as PointContrast [43] (12 vs their 48), nor as many iterations (up to 30K vs 60K), resulting in slightly less total memory use and pre-training time overall.
6 Conclusion
We have presented 4DContrast, a new approach for 3D representation learning that incorporates 4D priors into learned features during pre-training. We propose a data augmentation scheme to construct 4D sequences of moving synthetic objects in static 3D scenes, without requiring any semantic labels. This enables learning from 4D sequences, and we and establish contrastive constraints between learned 3D features and 4D features from the inherent correspondences given in the 4D sequence generation. Our experiments demonstrate that our 4D-imbued pre-training results in performance improvement across a variety of 3D downstream tasks and datasets. Additionally, our learned features effectively transfer to limited training data scenarios, significantly outperforming state of the art in the low training data regime. We hope that this will lead to additional insights in 3D representation learning and new possibilities in 3D scene understanding.
References
Chang, A., et al.: Matterport3D: learning from RGB-D data in indoor environments. In: International Conference on 3D Vision, pp. 667–676 (2017)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020)
Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 (2020)
Chen, X., He, K.: Exploring simple Siamese representation learning. In: Conference on Computer Vision and Pattern Recognition, pp. 15750–15758 (2021)
Chen, Y., et al.: Shape self-correction for unsupervised point cloud understanding. In: International Conference on Computer Vision, pp. 8382–8391 (2021)
Choy, C., Gwak, J., Savarese, S.: 4D spatio-temporal convnets: Minkowski convolutional neural networks. In: Conference on Computer Vision and Pattern Recognition, pp. 3075–3084 (2019)
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet: richly-annotated 3d reconstructions of indoor scenes. In: Conference on Computer Vision and Pattern Recognition, pp. 5828–5839 (2017)
Dai, A., Nießner, M.: 3DMV: joint 3D-multi-view prediction for 3D semantic scene segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 458–474. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_28
Engelmann, F., Bokeloh, M., Fathi, A., Leibe, B., Nießner, M.: 3D-MPA: multi-proposal aggregation for 3D semantic instance segmentation. In: Conference on Computer Vision and Pattern Recognition, pp. 9031–9040 (2020)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: Conference on Computer Vision and Pattern Recognition, pp. 3354–3361 (2012)
Graham, B., Engelcke, M., Van Der Maaten, L.: 3D semantic segmentation with submanifold sparse convolutional networks. In: Conference on Computer Vision and Pattern Recognition, pp. 9224–9232 (2018)
Grill, J.B., et al.: Bootstrap your own latent: a new approach to self-supervised learning. arXiv preprint arXiv:2006.07733 (2020)
Han, L., Zheng, T., Xu, L., Fang, L.: OccuSeg: occupancy-aware 3D instance segmentation. In: Conference on Computer Vision and Pattern Recognition, pp. 2940–2949 (2020)
Hassani, K., Haley, M.: Unsupervised multi-task feature learning on point clouds. In: International Conference on Computer Vision, pp. 8160–8171 (2019)
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
Hou, J., Dai, A., Nießner, M.: 3D-SIS: 3D semantic instance segmentation of RGB-D scans. In: Conference on Computer Vision and Pattern Recognition, pp. 4421–4430 (2019)
Hou, J., Graham, B., Nießner, M., Xie, S.: Exploring data-efficient 3D scene understanding with contrastive scene contexts. In: Conference on Computer Vision and Pattern Recognition, pp. 15587–15597 (2021)
Hu, W., Zhao, H., Jiang, L., Jia, J., Wong, T.T.: Bidirectional projection network for cross dimension scene understanding. In: Conference on Computer Vision and Pattern Recognition, pp. 14373–14382 (2021)
Huang, J., Zhang, H., Yi, L., Funkhouser, T., Nießner, M., Guibas, L.J.: TextureNet: consistent local parametrizations for learning from high-resolution signals on meshes. In: Conference on Computer Vision and Pattern Recognition, pp. 4440–4449 (2019)
Huang, S., Xie, Y., Zhu, S.C., Zhu, Y.: Spatio-temporal self-supervised representation learning for 3D point clouds. In: International Conference on Computer Vision, pp. 6535–6545 (2021)
Jiang, L., Zhao, H., Shi, S., Liu, S., Fu, C.W., Jia, J.: PointGroup: dual-set point grouping for 3D instance segmentation. In: Conference on Computer Vision and Pattern Recognition, pp. 4867–4876 (2020)
Kundu, A., et al.: Virtual multi-view fusion for 3D semantic segmentation. In: European Conference on Computer Vision, pp. 518–535 (2020)
Liang, H., et al.: Exploring geometry-aware contrast and clustering harmonization for self-supervised 3D object detection. In: International Conference on Computer Vision, pp. 3293–3302 (2021)
Nekrasov, A., Schult, J., Litany, O., Leibe, B., Engelmann, F.: Mix3D: out-of-context data augmentation for 3D scenes. In: 2021 International Conference on 3D Vision (3DV), pp. 116–125. IEEE (2021)
Nie, Y., Hou, J., Han, X., Nießner, M.: RFD-net: point scene understanding by semantic instance reconstruction. In: Conference on Computer Vision and Pattern Recognition, pp. 4608–4618 (2021)
Qi, C.R., Chen, X., Litany, O., Guibas, L.J.: ImvoteNet: boosting 3D object detection in point clouds with image votes. In: Conference on Computer Vision and Pattern Recognition, pp. 4404–4413 (2020)
Qi, C.R., Litany, O., He, K., Guibas, L.J.: Deep Hough voting for 3D object detection in point clouds. In: International Conference on Computer Vision, pp. 9277–9286 (2019)
Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum pointnets for 3D object detection from RGB-D data. In: Conference on Computer Vision and Pattern Recognition, pp. 918–927 (2018)
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: Conference on Computer Vision and Pattern Recognition, pp. 652–660 (2017)
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: deep hierarchical feature learning on point sets in a metric space. In: Neural Information Processing Systems (2017)
Rao, Y., Liu, B., Wei, Y., Lu, J., Hsieh, C.J., Zhou, J.: RandomRooms: unsupervised pre-training from synthetic shapes and randomized layouts for 3D object detection. In: International Conference on Computer Vision, pp. 3283–3292 (2021)
Rozenberszki, D., Litany, O., Dai, A.: Language-grounded indoor 3D semantic segmentation in the wild. arXiv preprint arXiv:2204.07761 (2022)
Sanghi, A.: Info3D: representation learning on 3D objects using mutual information maximization and contrastive learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12374, pp. 626–642. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58526-6_37
Sauder, J., Sievers, B.: Self-supervised deep learning on point clouds by reconstructing space. In: Neural Information Processing Systems (2019)
Schult, J., Engelmann, F., Kontogianni, T., Leibe, B.: DualconvMesh-net: joint geodesic and Euclidean convolutions on 3D meshes. In: Conference on Computer Vision and Pattern Recognition, pp. 8612–8622 (2020)
Song, S., Lichtenberg, S.P., Xiao, J.: Sun RGB-D: a RGB-D scene understanding benchmark suite. In: Conference on Computer Vision and Pattern Recognition, pp. 567–576 (2015)
Song, S., Xiao, J.: Deep sliding shapes for amodal 3D object detection in RGB-D images. In: Conference on Computer Vision and Pattern Recognition, pp. 808–816 (2016)
Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.: Multi-view convolutional neural networks for 3D shape recognition. In: International Conference on Computer Vision, pp. 945–953 (2015)
Wang, H., Liu, Q., Yue, X., Lasenby, J., Kusner, M.J.: Unsupervised point cloud pre-training via occlusion completion. In: International Conference on Computer Vision, pp. 9782–9792 (2021)
Wang, P.S., Yang, Y.Q., Zou, Q.F., Wu, Z., Liu, Y., Tong, X.: Unsupervised 3D learning for shape analysis via multiresolution instance discrimination, vol. 35, pp. 2773–2781 (2021)
Wu, Z., et al.: 3D shapenets: a deep representation for volumetric shapes. In: Conference on Computer Vision and Pattern Recognition, pp. 1912–1920 (2015)
Xie, Q., et al.: MlcvNet: multi-level context votenet for 3D object detection. In: Conference on Computer Vision and Pattern Recognition, pp. 10447–10456 (2020)
Xie, S., Gu, J., Guo, D., Qi, C.R., Guibas, L., Litany, O.: PointContrast: unsupervised pre-training for 3D point cloud understanding. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 574–591. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_34
Yi, L., Zhao, W., Wang, H., Sung, M., Guibas, L.J.: GSPN: generative shape proposal network for 3D instance segmentation in point cloud. In: Conference on Computer Vision and Pattern Recognition, pp. 3947–3956 (2019)
Zhang, B., Wonka, P.: Point cloud instance segmentation using probabilistic embeddings. In: Conference on Computer Vision and Pattern Recognition, pp. 8883–8892 (2021)
Zhang, Z., Girdhar, R., Joulin, A., Misra, I.: Self-supervised pretraining of 3D features on any point-cloud. In: International Conference on Computer Vision, pp. 10252–10263 (2021)
Zhang, Z., Sun, B., Yang, H., Huang, Q.: H3DNet: 3D object detection using hybrid geometric primitives. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 311–329. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_19
Acknowledgements
This project is funded by the Bavarian State Ministry of Science and the Arts and coordinated by the Bavarian Research Institute for Digital Transformation (bidt), the TUM Institute of Advanced Studies (TUM-IAS), the ERC Starting Grant Scan2CAD (804724), and the German Research Foundation (DFG) Grant Making Machine Learning on Static and Dynamic 3D Data Practical.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Chen, Y., Nießner, M., Dai, A. (2022). 4DContrast: Contrastive Learning with Dynamic Correspondences for 3D Scene Understanding. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13692. Springer, Cham. https://doi.org/10.1007/978-3-031-19824-3_32
Download citation
DOI: https://doi.org/10.1007/978-3-031-19824-3_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19823-6
Online ISBN: 978-3-031-19824-3
eBook Packages: Computer ScienceComputer Science (R0)