Keywords

1 Introduction

3D semantic scene understanding has seen remarkable progress in recent years, in large part driven by advances in deep learning as well as the introduction of large-scale, annotated datasets [1, 7, 10]. In particular, notable progress has been made to address core 3D scene understanding tasks such as 3D semantic segmentation, object detection, and instance segmentation, which are fundamental to many real-world computer vision applications such as robotics, mixed reality, or autonomous driving. Such approaches have developed various methods to learn on different 3D scene representations, such as sparse or dense volumetric grids [6, 7, 11], point clouds [29, 30], meshes [19], or multi-view approaches [8, 38]. Recently, driven by the success of unsupervised representation learning for transfer learning in 2D, 3D scene understanding methods have been augmented with unsupervised 3D pre-training to further improve performance to downstream 3D scene understanding tasks [17, 20, 43, 46] (Fig. 1).

Fig. 1.
figure 1

We propose 4DContrast to imbue learned 3D representations with 4D priors. We introduce a data augmentation scheme to composite synthetic 3D objects with real-world 3D scans to create 4D sequence data with inherent correspondence information. We then leverage a combination of 3D-3D, 3D-4D, and 4D-4D constraints within a contrastive learning framework to learn 4D invariance in the 3D representations. The learned features can be transferred to improve performance in various downstream 3D scene understanding tasks.

While such 3D representation learning has focused on feature representations learned from static 3D scenes, we observe that important notions of objectness are given by 4D dynamic observations – for instance, object segmentations can often be naturally intuited by observing objects moving around an environment without any annotations required, which can be more difficult in a static 3D observation. We thus propose to leverage this powerful 4D signal in unsupervised pre-training to imbue 4D object priors into learned 3D representations, that can then be effectively transferred to various downstream 3D scene understanding tasks for improved recognition performance.

In this work, we introduce 4DContrast to learn about objectness from both static 3D and dynamic 4D information in learned 3D representations. We leverage a combination of static 3D scanned scenes and a database of synthetic 3D shapes, and augment the scenes with moving synthetic shapes to generate 4D sequence data with inherent motion correspondence. We then employ a contrastive learning scheme under both 3D and 4D constraints, correlating local 3D point features with each other as well as with 4D sequence features, thus imbuing learned objectness from dynamic information into the 3D representation learning.

To demonstrate our approach, we pre-train on ScanNet [7] along with ModelNet [41] shapes for unsupervised 3D representation learning. Experiments on 3D semantic segmentation, object detection, and instance segmentation show that 4DContrast learns effective features that can be transferred to achieve improved performance in various downstream 3D scene understanding tasks. 4DContrast can also generalize from pre-training on ScanNet and ModelNet to improved performance on SUN RGB-D [36]. Additionally, we show that our learned representations remain robust in limited training data scenarios, consistently improving performance under a various amounts of training data available.

Our main contributions are summarized as follows:

  • We propose the first method to leverage 4D sequence information and constraints for 3D representation learning, showing transferability of the learned features to the downstream 3D scene understanding tasks of 3D semantic segmentation, object detection, and instance segmentation.

  • Our new unsupervised pre-training based on constructing 4D sequences from synthetic 3D shapes in real-world, static 3D scenes improves performance across a variety of downstream tasks and different datasets.

2 Related Work

3D Semantic Scene Understanding. Driven by rapid developments in deep learning and the introduction of several large-scale, annotated 3D datasets [1, 7, 10], notable progress has been made in 3D semantic scene understanding, in particular the tasks of 3D semantic segmentation [6,7,8, 11, 18, 19, 24, 30, 32], 3D object detection [25,26,27, 42, 47], and 3D instance segmentation [9, 13, 16, 21, 45]. Many methods have been proposed, largely focusing on learning on various 3D representations, such as sparse or dense volumetric grids [6, 7, 11], point clouds [21, 27, 29, 30], meshes [19, 35], or multi-view hybrid representations [8, 22]. In particular, approaches leveraging backbones built with sparse convolutions [6, 11] have shown strong effectiveness across a variety of 3D scene understanding tasks and datasets. We propose a new unsupervised pre-training approach to learn 4D priors in learned 3D representations, leveraging sparse convolutional backbones for both 3D and 4D feature extraction.

3D Representation Learning. Inspired by the success of representation learning in 2D, particularly that leveraging instance discrimination with contrastive learning [2, 3, 15], recent works have explored unsupervised learning with 3D pretext tasks that can be leveraged for fine-tuning on downstream 3D scene understanding tasks [5, 14, 17, 20, 23, 31, 33, 34, 39, 40, 43, 46]. For instance, [14, 34] learn feature representations from point-based instance discrimination for object classification and segmentation, and [17, 43, 46] extend to more complex 3D scenes by generating correspondences from various different views of scene point clouds.

In particular, given the more scarce data availability of real-world 3D environments, Hou et al. [17] additionally demonstrate the efficacy of contrastive 3D pretraining for various 3D semantic scene understanding tasks under a variety of limited training data scenarios. In contrast to these methods that employ 3D-only pretext tasks for representation learning, we propose to learn from 4D sequence data to embed 4D priors into learned 3D representations for more effective transfer to downstream 3D tasks.

Recently, Huang et al. [20] propose to learn from the inherent sequence data of RGB-D video to incorporate the notion of a temporal sequence. Constraints are established across pairs of frames in the sequence; however, the sequence data itself represents static scenes without any movement within the scene, limiting the temporal signal that can be learned. In contrast, we consider 4D sequence data containing object movement through the scene, which can provide additional semantic signal about objectness through an object’s motion. Additionally, Rao et al. [31] propose to learn from 3D scenes that are synthetically generated by randomly placing synthetic CAD models on a rectangular layout. They employ object-level contrastive learning on object-level features, resulting in improved 3D object detection performance. We also leverage synthetic CAD models for data augmentation, but we compose them with real-world 3D scan data to generate 4D sequences of objects in motion, and exploit learned 4D features to enhance learned 3D representations, with performance improvement on various downstream 3D scene understanding tasks.

Fig. 2.
figure 2

Method overview. 4DContrast learns effective 3D feature representations imbued with 4D signal from moving object sequences. During pre-training, we augment static 3D scene data with a moving object from a synthetic shape dataset. We can then establish dynamic correspondences between the spatio-temporal features learned from the 4D sequence with 3D features of individual static frames. We employ contrastive learning under not only 3D geometric correspondence between individual frames, but also with their corresponding 4D counterpart, as well as 4D-4D constraints to anchor the 4D feature learning. This enables 4D-invariant representation learning, which we can apply to various downstream 3D scene understanding tasks.

3 4D Invariant Representation Learning

4DContrast presents a new approach to 3D representation learning: our key idea is to employ 4D constraints during pre-training, in order to imbue learned features with 4D invariance from learned objectness from seeing an object in motion. We consider a dataset of 3D scans \(\mathcal {S}=\{S_i\}\) as well as a dataset of synthetic 3D objects \(\mathcal {O} = \{O_j\}\), and construct dynamic sequences with inherent correspondence information by moving a synthetic object \(O_j\) in a static 3D scan \(S_i\). This enables us to establish 4D correspondences along with 3D-4D correspondence as constraints under a contrastive learning framework for unsupervised pre-training. An overview of our approach is shown in Fig. 2.

3.1 Revisiting SimSiam

We first revisit SimSiam [4], which introduced a simple yet powerful approach for contrastive 2D representation learning. Inspired by the effectiveness of SimSiam, we build an unsupervised contrastive learning scheme for embedding 4D priors into 3D representations.

SimSiam considers two augmented variants of an image I, \(I_1\), and \(I_2\), which are input to weight-shared encoder network \(\varPhi _{\text {2D}}\) (a 2D convolutional backbone followed by a projection MLP). Then a prediction MLP head \(P_{\text {2D}}\) transforms the output of one view as \(p_1^{\text {2D}}=P_{\text {2D}}(\varPhi _{\text {2D}}(I_1))\) to match to the another output \(z_2^{\text {2D}}=\varPhi _{\text {2D}}(I_2)\), with minimizing the negative cosine similarity [12]:

$$\begin{aligned} \mathcal {D}(p_1^{\text {2D}}, z_2^{\text {2D}}) = -\frac{p_1^{\text {2D}}}{{||p_1^{\text {2D}}||}_2}\cdot \frac{z_2^{\text {2D}}}{{||z_2^{\text {2D}}||}_2}. \end{aligned}$$
(1)

SimSiam also uses a stop-gradient (SG) operation that treats \(z_2^{\text {2D}}\) as a constant during back-propagation, to prevent collapse during the training, thus modifying Eq. 1 as: \(\mathcal {D}(p_1^{\text {2D}}, SG(z_2^{\text {2D}}))\). A symmetrized loss is defined for the two augmented inputs:

$$\begin{aligned} \mathcal {L^\text {2D}} = \frac{1}{2}\mathcal {D}(p_1^{\text {2D}}, SG(z_2^{\text {2D}})) + \frac{1}{2}\mathcal {D}(p_2^{\text {2D}}, SG(z_1^{\text {2D}})). \end{aligned}$$
(2)

SimSiam has shown to be very effective at learning invariances under various image augmentations, without requiring negative samples or very large batches. We thus build from this contrastive framework for our 3D-4D constraints, as it allows for our high-dimensional pre-training design.

Fig. 3.
figure 3

4DContrast pre-training. We visualize 3D-3D, 3D-4D, and 4D-4D losses across frame and spatio-temporal correspondence. Note that losses are established across all pairs of frames for \(\mathcal {L^\text {3D}}\) \(\mathcal {L^\text {4D}}\) and across all frames for \(\mathcal {L^\text {3D4D}}\); for visualization we only show those associated with frames \(F_{t-2}\) and \(F_{t-1}\), and only for \(F_{t-1}\) for \(\mathcal {L^\text {3D4D}}\). Each loss only propagates back according to the gradient arrows due to stop-gradient operations for stable training. (Color figure online)

3.2 4D-Invariant Contrastive Learning

To imbue effective 4D priors into learned 3D features, we consider a static 3D scan S and a synthetic 3D object O as a train sample, and compose them together to form dynamic object movement in the scene \(\{F_0,...,F_{t-1}\}\) for t time steps (as described in Sect. 3.3). We then establish spatial correspondences between frames (3D-3D), spatio-temporal correspondences (3D-4D), and dynamic correspondences (4D-4D) as constraints. 3D features are extracted with a 3D encoder \(\varPhi _{\text {3D}}\) and 4D features with a 4D encoder \(\varPhi _{\text {4D}}\), with respective prediction MLPs \(P_{\text {3D}}\) and \(P_{\text {4D}}\).

Inter-Frame Spatial Correspondence. For each pair of frames \((F_i, F_j)\) in a train sequence F, we consider their spatial correspondence across sequence frames in order to implicitly pose invariance over the dynamic sequence. That is, points that correspond to the same location in the original 3D scene S or original object O should also correspond in feature space. For the set of corresponding point locations \(\mathcal {A}_{i,j}\) from frames \((F_i, F_j)\), we consider each pair of point locations \((\textbf{a}_i, \textbf{b}_j)\in \mathcal {A}\), we obtain their 3D backbone features at the respective locations: \(p^{\text {3D}}_{i,a}=P_{\text {3D}}(\varPhi _{\text {3D}}(F_i))(\textbf{a}_i)\) and \(z^{\text {3D}}_{j,b}=\varPhi _{\text {3D}}(F_j)(\textbf{b}_j)\). We then compute a symmetrized negative cosine similarity loss between features of corresponding point locations:

$$\begin{aligned} \begin{aligned} \mathcal {L}^{\text {3D}}_{\mathcal {A}_{i,j}} = \sum _{(a,b)\in \mathcal {A}_{i,j}} \left( \frac{1}{2}\mathcal {D}(p^{\text {3D}}_{i,a}, SG(z^{\text {3D}}_{j,b})) \right. + \left. \frac{1}{2}\mathcal {D}(p^{\text {3D}}_{j,b}, SG(z^{\text {3D}}_{i,a})) \right) . \end{aligned} \end{aligned}$$
(3)

In Fig. 3, we use arrows to indicate constraints between frame \(F_{t-2}\) and frame \(F_{t-1}\).

We compute Eq. 3 over each pair of frames in the whole sequence F:

$$\begin{aligned} \mathcal {L}^{\text {3D}} = \underset{i<j}{\sum ^{t-1}_{i=0}\sum ^{t-1}_{j=0}}\mathcal {L}^{\text {3D}}_{\mathcal {A}_{i,j}}. \end{aligned}$$
(4)

By establishing constraints across 3D frames in a 4D sequence, we encode pose invariance of moving objects across varying background into the learned 3D features.

Spatio-Temporal Correspondence. In addition to implicitly encoding pose invariance of moving objects, we establish explicit 3D-4D correspondences to learn 4D priors, encouraging 4D-invariance in the learned features. For a train sequence \(F=\{F_0,...,F_{t-1}\}\), we use the 4D encoder \(\varPhi _\text {4D}\) and the 4D predictor \(P_\text {4D}\) to extract 4D features from the whole sequence. Then \(z^{\text {4D}}_{i,a}\) indicates the 4D features output by the 4D encoder \(\varPhi _\text {4D}\) at point location \(\textbf{a}_i\) in frame i, and \(p^{\text {4D}}_{i,a}\) denotes the 4D features output by the 4D predictor \(P_\text {4D}\). Then for a frame \(F_i\), we consider each 3D point \(\textbf{a}_i\in \mathcal {A}_i\) in this set of frame points \(F_i\), and establish a constraint between its corresponding 3D feature extracted by 3D network (\(\varPhi _\text {3D}\) and \(P_\text {3D}\)) and its corresponding 4D feature extracted by 4D network (\(\varPhi _\text {4D}\) and \(P_\text {4D}\)):

$$\begin{aligned} \begin{aligned} \mathcal {L}^{\text {3D4D}}_{\mathcal {A}_i} = \sum _{a\in \mathcal {A}_i} \left( \frac{1}{2}\mathcal {D}(SG({p^{\text {3D}}_{i,a}}), {z^{\text {4D}}_{i,a}}) + \right. \left. \frac{1}{2}\mathcal {D}(SG({p^{\text {4D}}_{i,a}}), {z^{\text {3D}}_{i,a}}) \right) . \end{aligned} \end{aligned}$$
(5)

As shown in Fig. 3, we use arrows to indicate constraints of frame \(F_{t-1}\). For the entire input sequence F, we calculate Eq. 5 for every frame, and the 3D-4D contrastive loss \(\mathcal {L}^{\text {3D4D}}\) is defined as:

$$\begin{aligned} \mathcal {L}^{\text {3D4D}} = {\sum ^{t-1}_{i=0}}\mathcal {L}^{\text {3D4D}}_{\mathcal {A}_i}. \end{aligned}$$
(6)

Additionally, in order to learn spatio-temporally consistent 4D representations, we employ 4D-4D correspondence constraints inherent to the 4D features within the same point cloud sequence. This is formulated analogously to Eq. 3, replacing the 3D features with the 4D features from different time steps that correspond spatially:

$$\begin{aligned} \begin{aligned} \mathcal {L}^{\text {4D}}_{\mathcal {A}_{i,j}} = \sum _{(a)\in \mathcal {A}_{i,j}} \left( \frac{1}{2}\mathcal {D}(p^{\text {4D}}_{i,a}, SG(z^{\text {4D}}_{j,a})) \right. + \left. \frac{1}{2}\mathcal {D}(p^{\text {4D}}_{j,a}, SG(z^{\text {4D}}_{i,a})) \right) . \end{aligned} \end{aligned}$$
(7)

In Fig. 3, we use arrows to indicate 4D constraints between frame \(F_{t-2}\) and frame \(F_{t-1}\). We evaluate Eq. 7 over every pair of frames in the entire input sequence F, with the 4D contrastive loss \(\mathcal {L}^\text {4D}\) defined as:

$$\begin{aligned} \mathcal {L}^{\text {4D}} = \underset{i<j}{\sum ^{t-1}_{i=0}\sum ^{t-1}_{j=1}}\mathcal {L}^{\text {4D}}_{\mathcal {A}_{i,j}}. \end{aligned}$$
(8)

Joint Learning. Our overall training loss \(\mathcal {L}\) consists of three parts including 3D contrastive loss \(\mathcal {L}^{\text {3D}}\), 3D-4D contrastive loss \(\mathcal {L}^{\text {3D4D}}\), and 4D contrastive loss \(\mathcal {L}^{\text {4D}}\):

$$\begin{aligned} \mathcal {L} = w_{\text {3D}}\mathcal {L}^{\text {3D}} + w_{\text {3D4D}}\mathcal {L}^{\text {3D4D}} + w_{\text {4D}}\mathcal {L}^{\text {4D}}, \end{aligned}$$
(9)

where constant weights \(w_{\text {3D}}\), \(w_{\text {3D4D}}\) and \(w_{\text {4D}}\) are used to balance the losses.

3.3 Generating 4D Correspondence via Scene-Object Augmentation

To learn from 4D sequence data to embed 4D priors into learned 3D representations, we leverage existing large-scale real-world 3D scan datasets in combination with synthetic 3D shape datasets. This enables generation of 4D correspondences without requiring any labels – by augmenting static 3D scenes with generated trajectories of a moving synthetic object within the scene, which provides inherent 4D correspondence knowledge across the object motion. Thus for pre-training, we consider pairs of reconstructed scans and an arbitrarily sampled synthetic 3D shape (SO), and generate a 4D sequence \(F=\{F_0,...,F_{t-1}\}\) by moving the object through the scene.

Trajectory Generation. We first generate a trajectory for O in S. We voxelize S at 10 cm voxel resolution, and accumulate occupied surface voxels in the height dimension to acquire a 2D map of the scene geometry. Valid object locations are then identified as those in the 2D map with a voxel accumulation \(\le \)1, with the max height of the accumulated voxels near to the ground floor (within 20 cm of the average floor height). For the object O, we consider all possible 2D locations, and if O does not exceed the valid region (based on its bounding sphere), then the location is taken as a candidate object position. A random position sampled from these candidate positions is taken as the starting point of the trajectory, we can randomly sample a step distance in [30, 90] cm and step direction such that the angular change in trajectory is <150\(^\circ \), and then select the nearest valid candidate position as the second trajectory point. We repeat this process for t time steps in the sequence to obtain 4D scene-object augmentations for pre-training.

4D Sequence Generation. A sequence of point clouds are then generated based on the computed object trajectory for the scan, up to sequence length t, by compositing the object into the scene under its translation and rotation steps per frame. This provides inherent correspondence information between 3D scene locations and 4D object movement through the scene.

Table 1. Summary of fine-tuning of 4DContrast for various downstream 3D scene understanding tasks and datasets. Our pre-training approach learns effective, transferable features, resulting in notable improvement over the baseline learning paradigm of training from scratch.

Scene Augmentation. We augment the 4D sequences by randomly sampling different points across the geometry in each individual frame. We also randomly remove cubic chunks of points in the background 3D scene for additional data variation, with the number of chunks removed randomly sampled from [5, 15] and the size of the chunks randomly sampled in [0.15, 0.45] as a proportion of the scene extent. We discard any sequences that do not have enough correspondences in its frames; that is, \(\ge \)30% of the points in the original scan and \(\ge \)30% of the points of the synthetic object should be consistently represented in each frame, and each frame must maintain at least 50% of its points through the augmentation process. Additionally, we further augment the static 3D frame interpretations of the sequence (but not the sequence) by applying random rotation, translation, and scaling to each individually considered 3D frame.

3.4 Network Architecture for Pre-training

During pre-training, we leverage correspondences induced by our 4D data generation, between encoded 3D frames as well as across the encoded 4D sequence. To this end, we employ 3D and 4D feature extractors as meta-architectures for 4D-invariant learning.

To extract per-point features from a 3D scene, we use a 3D encoder \(\varPhi _\text {3D}\) and a 3D predictor \(P_\text {3D}\). \(\varPhi _\text {3D}\) is a U-Net architecture based on sparse 3D convolutions with residual block followed by a \(1\times 1\times 1\) sparse convolutional projection layer, and \(P_\text {3D}\) is two \(1\times 1\times 1\) sparse convolutional layers.

To extract spatio-temporal features from a 4D sequence, we use a 4D encoder \(\varPhi _\text {4D}\) and a 4D predictor \(P_\text {4D}\). These are structured analogously to the 3D feature extraction, using sparse 4D convolutions instead. For more detailed architecture specifications, we refer to the supplemental material.

4 Experimental Setup

We demonstrate the effectiveness of our 4D-informed pre-training of learned 3D representations for a variety of downstream 3D scene understanding tasks.

Pre-training Setup. We use reconstructed 3D scans from ScanNet [7] and synthetic 3D shapes from ModelNet [41] to compose our 4D sequence data for pre-training. We use the official ScanNet train split with 1201 train scans, augmented with shapes from ModelNet from eight furniture categories: chair, desk, dresser, nightstand, sofa, table, bathtub, and toilet. For each 3D scan, we generate 20 trajectories of an object moving through the scan, following Sect. 3.3 with \(t=4\). For sequence generation we use 2 cm resolution for the scene and 1000 randomly sampled points from the synthetic object to compose together.

The 3D and 4D sparse U-Nets are implemented with MinkowskiEngine [6] using 2 cm voxel size for 3D and 5cm voxel size for 4D. For pre-training we consider only geometry information from the scene-object sequence augmentations. We use an SGD optimizer with initial learning rate 0.25 and a batch-size of 12. The learning rate is decreased by a factor of 0.99 every 1000 steps. We train for 50K steps until convergence.

Fine-Tuning on Downstream Tasks. We use the same pre-trained backbone network in the three 3D scene understanding tasks of semantic segmentation, instance segmentation, and object detection. For semantic segmentation, we directly use the U-Net architecture for dense label prediction, and for object detection and instance segmentation, we use VoteNet [27] and PointGroup [21] respectively, both with our pre-trained 3D U-Net backbone. All experiments, including comparisons with state of the art, are trained with geometric information only, unless otherwise noted. Fine-tuning experiments on semantic segmentation are trained with a batch size of 48 for 10K steps, using an initial learning rate of 0.8 with polynomial decay with power 0.9. For instance segmentation, we use the same training setup as PointGroup, and use an initial learning rate of 0.1. For object detection, the network is trained for 500 epochs, and the learning rate is 0.001 and decayed by a factor of 0.5 at epochs 250, 350, and 450. We use a batch size of 6 on ScanNet and 16 on SUN RGB-D.

Table 2. 3D object detection on ScanNet. Our 4DContrast pre-training leads to improved performance in comparison with state of the art object detection and 3D pretraining schemes.

5 Results

We demonstrate that our learned features under 3D-4D constraints can effectively transfer well to a variety of downstream 3D scene understanding tasks. We consider both in-domain transfer to 3D scene understanding tasks on ScanNet [7] (Sect. 5.1), as well as out-of-domain transfer to SUN RGB-D [36] (Sect. 5.2); a summary is shown in Table 1. We also show data-efficient scene understanding (Sect. 5.3) and additional analysis (Sect. 5.4). Note that for all downstream experiments, we do not use the 4D backbone and thus use the same 3D U-Net architecture as PointContrast [43] and CSC [17].

All experiments, including our method and all baseline comparisons, are trained on geometric data only without any color information.

5.1 ScanNet

We first demonstrate our 4DContrast pre-training in fine-tuning for 3D object detection, semantic segmentation, and instance segmentation on ScanNet [7], showing the effectiveness of learning 3D features under 4D constraints. Tables 2, 3, and 4 evaluate performance on 3D object detection, semantic segmentation, and instance segmentation, respectively.

Table 2 shows 3D object detection results, for which our pretraining approach improves over baseline training from scratch (+5.5% mAP@0.5) as well as over the strong 3D-based pre-training methods of RandomRooms [31], PointContrast [43] and CSC [17].

In Tables 3 and 4, we evaluate semantic segmentation in comparison with state-of-the-art 3D pre-training approaches [17, 43], as well as a baseline training paradigm from scratch. These pre-training approaches improve notably over training from scratch, and our 4DContrast approach leveraging learned representations under 4D constraints, leads to additional performance improvement over train from scratch (+2.3% mIoU for semantic segmentation and +4.2% mAP@0.5 for instance segmentation). We show qualitative results for semantic segmentation in Fig. 4.

5.2 SUN RGB-D

We additionally show that our 4DContrast learning scheme can produce transferable representations across datasets. We leverage our pre-trained weights from ScanNet + ModelNet, and explore downstream 3D object detection on the SUN RGB-D [36] dataset. SUN RGB-D is a dataset of RGB-D images, containing 10,335 frames captured with a variety of commodity RGB-D sensors. It contains 3D object bounding box annotations for 10 class categories. We follow the official train/test split of 5,285 train frames and 5,050 test frames.

Table 5 shows 3D object detection performance on SUN RGB-D, with qualitative results visualized in Fig. 5. We use the same pre-training as with ScanNet, with downstream fine-tuning on SUN RGB-D data. 4DContrast improves over training from scratch (+6.5% mAP@0.5), with our learned representations surpassing the 3D-based pre-training [17, 31, 43, 46].

Table 3. Semantic segmentation on ScanNet. Our 4D-informed pre-training learns effective features that lead to improved performance boost over training from scratch as well as state-of-the-art 3D-based pre-training of CSC [17] and PointContrast [43].
Table 4. Instance segmentation on ScanNet. Our 4D-imbued pre-training leads to significantly improved results over training from scratch, as well as favorable performance over other 3D-only pretraining schemes.
Fig. 4.
figure 4

Qualitative results on ScanNet semantic segmentation. Our 4DContrast pre-training to encode 4D priors enables more consistent segmentation results, in comparison to training from scratch as well as 3D-based PointContrast [43] pre-training.

5.3 Data-Efficient 3D Scene Understanding

We evaluate our approach in the scenario of limited training data, as shown in Fig. 6. 4DContrast improves over baseline training from scratch as well as over state-of-the-art data-efficient scene understanding CSC [17] in semantic segmentation and object detection under various different percentages of ScanNet training data. With only 20% of the training data, we can recover \(87\%\) of the fine-tuned semantic segmentation performance training with 100% of the train data from scratch. In object detection, our pre-training enables improved performance for all percentage settings, notably in the very limited regime with +3.0/4.5% mAP@0.5 over CSC/training from scratch at 10% data, and +2.5/5.9% mAP@0.5 with 20% data.

Table 5. 3D object detection on SUN RGB-D. Our 4D-based pre-training learns effective 3D representations, improving performance over training from scratch and state-of-the-art 3D pre-training methods. \(^*\)indicates that PointNet++ is used as a backbone instead of a 3D U-Net.

5.4 Ablation Studies

Effect of 3D and 4D Data Augmentation. We consider a baseline variant of our approach that considers only the static 3D scene data without any scene-object augmentations with 3D-3D constraints during pre-training in Table 6 (Ours (3D data, 3D-3D only)), which provides some improvement over training from scratch but is notably improved with our 4D pre-training formulation. We additionally consider using our 4D scene-object augmentation with only 3D-3D constraints between sequence frames during pre-training (Ours (4D data, 3D-3D only)) in Table 6, which helps to additionally improve performance with implicitly learned priors from 4D data. Both are further improved by our approach to explicitly learn 4D priors in 3D features.

Effect of 4D-Invariant Contrastive Priors. In Table 6, we see that learning 4D-invariant contrastive priors through our 3D-4D and 4D-4D constraints during pretraining improves upon data augmentation variants only. Additionally, Table 7 evaluates the 3D variant of our approach with our full 4D-based pre-training across a variety of downstream tasks, showing consistent improvements from learned 4D-based priors.

Effect of SimSiam Contrastive Learning. We also consider the effect of our SimSiam contrastive framework as PointContrast [43] leverages a PointInfoNCE contrastive loss. We note that the 3D variant of our approach (Ours (3D data, 3D-3D only)) reflects a PointContrast [43] setting using our scene augmentation and SimSiam architecture, which our 4D-based feature learning outperforms.

Fig. 5.
figure 5

Qualitative results on SUN RGB-D [36] object detection. Our 4DContrast pre-training to encode 4D priors enables more accurate detection results, in comparison to training from scratch as well as 3D-based PointContrast [43] pre-training. Different colors denote different objects.

Fig. 6.
figure 6

Data-efficient learning on ScanNet semantic segmentation and object detection. Under limited data scenarios, our 4D-imbued pre-training effectively improves performance over training from scratch as well as the state-of-the-art CSC [17].

5.5 Discussion

While 4DContrast pre-training demonstrates the effectiveness of leveraging 4D priors for learned 3D representations, various limitations remain. In particular, 4D feature learning with sparse convolutions involves considerable memory during pre-training, so we use half-resolution for characterizing 4D features relative to 3D features and limited sequence durations. Additionally, we consider a subset of 4D motion when augmenting scenes with moving synthetic objects, and believe exploration of articulated motion or more complex dynamic object interactions would lead to additional insight and robustness of learned feature representations.

Memory and Speed. Our 4D-imbued pre-training results in consistent improvements across a variety of tasks and datasets, even with only using the learned 3D backbone for downstream training and inference. Thus, our method maintains the same memory and speed costs for inference as purely 3D-based pre-training approaches. For pre-training, our joint 3D-4D training uses additional parameters (33M for the 4D network in addition to the 38M for the 3D network), but due to jointly learning 4D priors with SimSiam, we do not require as large of a batch size to train as PointContrast [43] (12 vs their 48), nor as many iterations (up to 30K vs 60K), resulting in slightly less total memory use and pre-training time overall.

Table 6. Additionally ablation variants: compared to a baseline of using 3D-3D constraints on static 3D scene data only, leveraging augmented 4D sequence data improves feature learning even under 3D only constraints. Our final 4DContrast pre-training leveraging constraints with learned 4D features achieves the best performance.
Table 7. Extended ablation of the 3d-only variant of our approach on ScanNet.

6 Conclusion

We have presented 4DContrast, a new approach for 3D representation learning that incorporates 4D priors into learned features during pre-training. We propose a data augmentation scheme to construct 4D sequences of moving synthetic objects in static 3D scenes, without requiring any semantic labels. This enables learning from 4D sequences, and we and establish contrastive constraints between learned 3D features and 4D features from the inherent correspondences given in the 4D sequence generation. Our experiments demonstrate that our 4D-imbued pre-training results in performance improvement across a variety of 3D downstream tasks and datasets. Additionally, our learned features effectively transfer to limited training data scenarios, significantly outperforming state of the art in the low training data regime. We hope that this will lead to additional insights in 3D representation learning and new possibilities in 3D scene understanding.