Keywords

1 Introduction

Recently, the deep learning (DL)-based methods have achieved significant performance gains for the point cloud segmentation task, which is a fundamental and critical step to understand realistic scenes [13] and analyze 3D geometric data [59]. However, it is extremely costly and labor-consuming to collect abundant dense annotations of 3D point clouds for model training. Thus, it is highly desirable to develop effective algorithms that can well segment point cloud data with only weak annotations of point clouds.

Fig. 1.
figure 1

Illustration of the proposed Dual Adaptive Transformation (DAT) model. We encourage DAT to produce consistent predictions under local and regional adaptive transformations. Note that, there are only few labeled points inside the whole scene to train our model. During testing, only segmentation module (blue) is used to generate the segmentation prediction. (Color figure online)

For semantic image segmentation tasks, there are different types of weak annotations including image-level labels [31, 41, 55], scribbles [22, 48], or partially labeled samples [23, 28, 49]. For the point cloud segmentation task, following the recent work [24], we consider partially labeled samples as weak annotations for the model training, i.e., only a few sparse points inside the whole scene are labeled and all other points are unlabeled. The latest model 1T1C [24] attempts to train a segmentation model with limited labeled points and then propagate the labels to the unlabeled points as the pseudo labels for iteratively refining the model. However, such a training strategy is time-consuming and is often affected by unreliable pseudo labels, resulting in sub-optimal segmentation performance. Here, we hypothesize that the weakly supervised segmentation performance can be further improved by adding more constraints on the unlabeled 3D points.

To exploit the unlabeled data, the consistency-based learning methods have shown promising progress in natural image classification and segmentation. For example, [2, 8, 34] encouraged the model to produce invariant results under various strong data augmentations. However, it is non-trivial to apply these image-based strong data augmentation techniques to point cloud processing, and point cloud-specific augmentations are still under early exploration [4, 19]. This motivates us to investigate an effective transformation method to leverage large amounts of unlabeled 3D points by applying sufficient smoothness constraints for weakly supervised point cloud segmentation.

Specifically, in this paper, we propose a Dual Adaptive Transformation (DAT) model, where we encourage consistent predictions between original and local/regional adaptively transformed point clouds data. As shown in Fig. 1, we first design a Local Adaptive Perturbation (LAP) module that computes the adaptive perturbations for both point coordinates and their associate features. Meanwhile, considering the feature distributions are quite different between different classes, we further embed the class-aware information into the LAP module to generate class-aware adaptive feature perturbations. Then, to capture more structural information in point clouds, we further introduce a Regional Adaptive Deformation (RAD) module to apply adaptive deformations on the pre-defined super-points, which enforces the consistency constraints at the region level.

We evaluate our DAT model with two popular backbones on the large-scale S3DIS dataset [1] and ScanNet-v2 dataset [6]. Via effectively leveraging the unlabeled point clouds, our DAT model is able to segment point cloud data with very few annotations, setting new state-of-the-art (SOTA) performance for the weakly supervised point cloud segmentation task. For example, on S3DIS dataset [1], the DAT model outperforms the previous SOTA model 1T1C [24] by 6.5% under the “One Thing One Click” annotation setting. Note that our proposed strategy can be easily combined with other frameworks. For instance, based on our design, the segmentation performance of 1T1C [24] model can be further improved by 2.9%/3.0% on the ScanNet-v2 test/validation set [6], respectively.

Overall, our main contributions are three-fold:

  • We propose a novel Dual Adaptive Transformation (DAT) model for weakly supervised point cloud segmentation, with the key insight that applying the consistency constraint under local and regional adaptive transformations can effectively leverage a large amount of unlabeled 3D points and facilitate a better model training.

  • We introduce the Local Adaptive Perturbation (LAP) module, where we inject the adaptive perturbations to point coordinates and the associate feature inputs separately. Meanwhile, we embed the information of the class-aware point feature distribution into the generation of the local adaptive feature perturbations, which leads to better performance.

  • We introduce the Regional Adaptive Deformation (RAD) module, where we generate structural adaptive deformations at the region-level, i.e. adaptive deformations such as shifting, scaling, and rotation for the superpoint regions. Such regional deformations introduce another level of the consistency constraint, which is a complement to LAP.

2 Related Work

2.1 Deep Learning on Point Clouds

DL-based methods have achieved great progress to process point cloud data. For example, PointNet model [32] used permutation-invariant operators such as pooling layers to aggregate the features from all points. Then, PointNet++ model [33] further designed a hierarchical spatial structure to extract local geometric features. Furthermore, the graph-based methods [17, 18, 42] built a graph for all points and applied the message passing mechanism on the graph. For instance, DGCNN [42] used a kNN graph to perform graph convolutions. To capture contextual relationships, SPG [17] constructed a graph on the sub-regions, i.e. the superpoints. DeepGCNs [18] explored the depth information in graph convolutional networks. Afterwards, [26, 35, 38, 54] further improved the performance by directly applying continuous convolutions on the points without any quantization. SpiderCNN [54] used polynomial functions to generate the kernel weights and the spherical convolution [35] was used to address the 3D rotation equivariance problem in Spherical CNN. KPConv [38] constructed the kernel weights based on the input coordinates and achieved good performance. Similarly, InterpCNN [26] interpolated point-wise kernel weights by utilizing the coordinate information. Different from point convolution networks, the voxel-based methods [5] firstly quantized all the points and map the points to the regular voxels and then applied 3D convolutions on the regular voxels to obtain point features.

In this paper, we adopt the point-based KPConv model [38] as our backbone, where the model is trained via encouraging the dual adaptive transformation consistency for weakly supervised point cloud segmentation. Furthermore, in Sect. 4.2, we also extend our method to the voxel-based framework MinkowiNet [5] so as to demonstrate the generalization ability of our training strategy.

2.2 Weakly Supervised Point Cloud Segmentation

There are some DL-based methods being proposed recently for the weakly supervised point cloud segmentation task [7, 9, 10, 12, 20, 27, 30, 36, 40, 51, 56, 60]. For example, Wang et al. [39] proposed to generate point cloud segmentation labels by back-projecting 2D image annotations to 3D spaces. However, annotating large-scale image semantic segmentation datasets is extremely labor-consuming. To reduce the labeling costs, Wei et al. [44] used the Class Activation Map (CAM) [50, 58] to generate pseudo segmentation masks with sub-cloud level annotations. However, its performance is limited due to the lack of localization information in labels. To address the issue, Xu et al. [53] further labeled 10% points in the whole point cloud, which is able to achieve a good performance comparable to the fully-supervised references. Then, the 1T1C method [24] under the “One Thing One Click” setting was introduced to tackle this task, which uses fewer labeled points, i.e. only labeling one point per thing in each scene.

Here, we follow the 1T1C method [24] to conduct experiments. Different from the iterative refinement mechanism used in 1T1C which brings in significant computational cost, we propose an end-to-end training strategy to train a model in the identical weakly supervised manner while without the need for any iterative refinement.

2.3 Consistency-Based Semi-supervised Learning

Our work is closely related to the consistency based semi-supervised learning (SSL) [45, 46], where the basic idea is to leverage the unlabeled data based on the smoothness assumptions, i.e. deep models under various small perturbations or augmentations should output consistent results. For example, Bortsova et al. [3] enforced the model to produce invariant predictions for unlabeled images under different transformations. For semi-supervised image classification task, the VAT model [29] designed an adversarial perturbation and then encouraged the consistency between the original data and its adversarial one. Temporal ensembling [16] and mean teacher [37] generated similar distributions for the perturbed inputs. Meanwhile, the mutual learning strategy has been studied for semi-supervised learning [47, 57]. For instance, the dual-student model [14] enforced two sub-networks learn from each other via constraining the consistent predictions. FixMatch [34] further explicitly generated the pseudo labels from the data with weak augmentations and used them to guide the prediction from the strongly augmented samples.

Motivated by the consistency-based semi-supervised learning methods which encourage the model to produce consistent results under various perturbations, we propose the DAT model for the weakly supervised point cloud segmentation task with two major novel designs, i.e. the LAP and RAD modules.

3 Methods

Figure 2 gives an overview of the proposed Dual Adaptive Transformation (DAT) model, which consists of three main modules: the class-aware Local Adaptive Perturbation (LAP) module, the Regional Adaptive Deformation (RAD) module, and the original SEGmentation (SEG) module. LAP contains a novel Class-aware Perturbation Generator (CPG) to produce semantic perturbations at the point level. RAD generates structural augmented examples by applying various deformations at the region level. SEG contains a conventional point cloud segmentation backbone.

3.1 Segmentation Module

We first define a set of notations for the weakly supervised point cloud segmentation task. Specifically, consider a set of points \(X = [C, F] \in \mathbb {R}^{N \times 3+D_f}\) with point coordinates \(C \in \mathbb {R}^{N \times 3}\) and the corresponding features \(F \in \mathbb {R}^{N \times D_f}\) as the model input, and denote \(Y \in \mathbb {R}^{N \times 1}\) as the groundtruth label, which is a very sparse one with only M known entries, \(M<< N\). The output of SEG module is the predicted segmentation mask \(\hat{Y}\). The segmentation module aims to train the backbone model with few labeled points in Y. Here, we adopt a popular segmentation framework KPConv [38] as our backbone. With the kernel parameters denoted as \(\theta \), the model prediction is given by \(p(\hat{y}_i|c_i, f_i; \theta )\), \(i \in \{1, ..., N\}\), where \(c_i\) and \(f_i\) are respectively the point coordinates and features of point \(x_i\). We train the segmentation module by applying a cross-entropy loss \(\mathcal {L}_{seg}\) on the few labels in Y and the corresponding predictions in \(\hat{Y}\).

3.2 Local Adaptive Perturbation Module

We design a Local Adaptive Perturbation (LAP) module to generate perturbed examples \(X^{lap}\) by applying the adaptive perturbations on the point coordinates and the corresponding features. In particular, the input to LAP is the point cloud X and the output is the perturbed examples \(X^{lap}\) with the injection of the adaptive perturbations \(R^{ada}\). Inspired by VAT [29], which is proposed for semi-supervised image classification, to achieve local distributional smoothness (LDS) as a smoothness constrain to regularize unlabeled data, we encourage our model to generate consistent outputs between each input point \(x \in X\) and its perturbed version \(x+r^{ada}\), where \(r^{ada} \in R^{ada}\) is the corresponding adaptive perturbation:

$$\begin{aligned} \mathcal {LDS}(x; \theta ) = D\left[ p(\hat{y}|x; \theta ),p(\hat{y}|x+r^{ada};\theta )\right] . \end{aligned}$$
(1)
Fig. 2.
figure 2

Overall pipeline of our proposed Dual Adaptive Transformation (DAT) model, which consists of three main modules: the segmentation (SEG) module (blue), the Local Adaptive Perturbation (LAP) module (yellow), and the Regional Adaptive Deformation (RAD) Module (green). SEG module adopts KPConv backbone to train the model with few labeled points. LAP module is to generate class-aware perturbed examples on each point. RAD module generates structural deformed data by applying the adaptive affine transformations on each region. Note that, during testing, we only employ SEG module to process point cloud data. (Color figure online)

Here D is a non-negative loss function to measure the divergence between x and \(x+r^{ada}\). Then, we compute \(r^{ada}\) by estimating a gradient g of \(\mathcal {LDS}\) with a random input vector d as

$$\begin{aligned} \begin{aligned}&g = \nabla _{R} D\left[ p(\hat{y}|x, \theta ), p(\hat{y}|x+r, \theta )\right] \Big |_{r=\xi d} \\&r^{ada} = \epsilon \times g / \Vert g\Vert _2, \end{aligned} \end{aligned}$$
(2)

where \(\xi \) and \(\epsilon \) are two hyper-parameters to control the magnitude of the perturbation, and g can be efficiently computed by applying the back-propagation on the network.

Considering the input point coordinates and features are two different types of inputs, we generate their perturbations separately. In other words, for an input point x consisting of its coordinates c and features f, we generate the adaptive perturbation data \(c+r^{ada}_c\) and \(f+r^{ada}_f\) with the initial random unit vectors \(d_{c}\) and \(d_{f}\), respectively.

Class-Aware Perturbation Generator. Note that many existing perturbation based semi-supervised image classification methods [29] usually generate the initial perturbations d through sampling them from an iid Gaussian distribution. However, in the point cloud segmentation task, directly applying this to generate \(d_{f}\) might not be optimal. This is because, for different classes, their input point feature distributions are quite different across different dimensions. A class-agnostic iid Gaussian sampling might generate unrealistic perturbations.

Fig. 3.
figure 3

Visual results for the superpoint estimation and the generated dual adaptive transformed examples during the training stage.

Therefore, we propose a Class-aware Perturbation Generator (CPG) to obtain \(d_{f}\) for each point. Specifically, in each training iteration, we generate the pseudo labels \(\hat{y}\) for all the points with the current model parameter \(\hat{\theta }\), where \(\hat{y} \in \{1, ..., K_c\}\) with \(K_c\) being the number of classes. Based on that, we establish a zero-mean multivariate normal distribution \(\mathbb {N}(0, \sum _k)\). Here \(\sum _{k}\) is the class-conditional covariance matrix estimated from all the input point features (e.g. rgbh for KPConv) that belong to the pseudo-class k. Afterward, we update the covariance matrix in an online manner [43] with the statistics of the features from each mini-batch. In this way, at each iteration, \(d_{f}\) is then generated by sampling from the up-to-date class-aware multivariate Gaussian distribution. For \(d_{c}\), we adopt the conventional way i.e. sampling the initial input vectors from an iid Gaussian distribution, since the point clouds are unordered and the individual coordinates alone are not closely related to the class of the points. This is also observed in the PointNet model [32].

With the generated \(d_c\) and \(d_{f}\), our LDS loss for point clouds now becomes:

$$\begin{aligned} \begin{aligned} \mathcal {LDS}(x; \theta )&= D\left[ p(\hat{y}|c, f; \theta ), p(\hat{y}|c+\xi _{c} d_{c}, f+\xi _{f} d_{f}; \theta )\right] \\ g_{c}&= \nabla _{\xi _{c} d_{c}} \mathcal {LDS}(x, \theta ) \\ g_{f}&= \nabla _{\xi _{f} d_{f}} \mathcal {LDS}(x, \theta ), \end{aligned} \end{aligned}$$
(3)

where we use the Kullback-Leibler divergence (KL-div) for D. Finally, we obtain the \(r_c^{ada}\) and \(r_f^{ada}\) by

$$\begin{aligned} \begin{aligned}&r_{c}^\textrm{ada} = \epsilon _{c} g_{c}/\Vert g_{c}\Vert _2 \\&r_{f}^\textrm{ada} = \epsilon _{f} g_{f}/\Vert g_{f}\Vert _2. \end{aligned} \end{aligned}$$
(4)

In this way, the perturbed examples \(X^{lap}\) is obtained by point-wise adding the perturbations \(r_c^\textrm{ada}\) and \(r_f^\textrm{ada}\) on the coordinates c and the features f, respectively. One example is visualized in the third column of Fig. 3.

3.3 Regional Adaptive Deformation Module

In addition to the local adaptive perturbations, considering point clouds often contain various structural local deformations such as region shift, rotation, and scaling, we further design a regional adaptive deformation (RAD) module to generate structural local deformations. RAD module takes point cloud X as input and outputs region-level augmented examples \(X^{rad}\) by deforming each region with adaptive affine transformations \(A^{ada}\). As shown in Fig. 2, we firstly over-segment point cloud X into a set of superpoints \(S_i, i \in \{1, ..., K_s\}\) via [6, 17]. For each superpoint \(S_i\), we generate the adaptive deformed example \(S_i^{ada}\). Combing all \(S_i^{ada}, i \in \{1, ..., K_s\}\), we obtain \(X^{rad}\).

For each superpoint \(S_i\), we firstly generate the initial affine transformation matrices \(A_{i,j}\), whose parameters are randomly sampled from an iid Gaussian distribution. Then, we deform each superpoint as

$$\begin{aligned} S_i^{int} = S_i \cdot \prod _{j=1}^{K_a} \xi _{A} A_{i,j}, \end{aligned}$$
(5)

where \(A_{i,j}, j \in \{1, ..., K_a\}\), corresponds to the j-th type of deformations. Combining all \(S_i^{int}, i \in \{1, ..., K_s\}\), we obtain the initial deformed point cloud \(X^{int}\). The \(\mathcal {LDS}\) loss becomes

$$\begin{aligned} \begin{aligned} \mathcal {LDS}(X; \theta )&= D\left[ p(\hat{y}|x; \theta ), p(\hat{y}|x^{int}; \theta )\right] \\ g_{A_{i,j}}&= \nabla _{\xi _{A} A_{i,j}} \mathcal {LDS}(x; \theta ) . \end{aligned} \end{aligned}$$
(6)

Then, we obtain the \(A^{ada}_{i,j}\) by

$$\begin{aligned} A^{ada}_{i,j} = \epsilon _{A} g_{A_{i,j}}/\Vert g_{A_{i,j}}\Vert _2. \end{aligned}$$
(7)

Finally, the regional deformed examples \(X^{rad}\) is obtained by combining all the deformed superpoints \(S_i^{ada}\), which is computed as

$$\begin{aligned} S_{i}^{ada} = S_i * \prod _{j=1}^{K_a} A_{i, j}^{ada}. \end{aligned}$$
(8)

Specifically, we use the following three types of affine transformations: translation, scale and rotation.

figure a

One RAD example is given in the fourth column of Fig. 3. Algorithm 1 summarizes the process of generating the adversarial examples under both LAP and RAD.

3.4 Training Losses

The overall training loss can be written as

$$\begin{aligned} \mathcal {L}_{total} =\mathcal {L}_{seg} + \alpha \mathcal {L}_{lc} + \beta \mathcal {L}_{rc} \end{aligned}$$
(9)

where \(\mathcal {L}_{seg}\), \(\mathcal {L}_{lc}\) and \(\mathcal {L}_{rc}\) are Segmentation Loss, Local Consistency Loss and Regional Consistency Loss, respectively, and \(\alpha \) and \(\beta \) are trade-off weights, both set as 2 to balance the losses. Segmentation Loss \(\mathcal {L}_{seg}\) is to guide the segmentation prediction with the limited annotations in Y. Specifically, we follow the KPConv [38] by using the cross entropy loss for \(\mathcal {L}_{seg}\) to train the segmentation prediction \(\hat{Y}\). Local Consistency Loss \(\mathcal {L}_{lc}\) encourages the consistency and penalizes the prediction difference between the original point cloud X and the local perturbed examples \(X^{lap}\). Regional Consistency Loss \(\mathcal {L}_{rc}\) ensures the consistency between X and its regional deformed examples \(X^{rad}\). \(\mathcal {L}_{pc}\) and \(\mathcal {L}_{rc}\) are defined as

$$\begin{aligned} \begin{aligned}&\mathcal {L}_{pc} = D\left[ p(\hat{y}|x; \theta ), p(\hat{y}|x^{lap}; \theta )\right] \\&\mathcal {L}_{rc} = D\left[ p(\hat{y}|x; \theta ), p(\hat{y}|x^{rad}; \theta )\right] \end{aligned} \end{aligned}$$
(10)

where D is the KL-div loss.

4 Experiments and Results

4.1 Implementation Details

Datasets. Following the 1T1C [24] model, we conduct experiments on two large-scale point cloud datasets - the S3DIS [1] and ScanNet-v2 [6]. The S3DIS dataset consists of 3D scans of 271 rooms with 13 categories belonging to 6 areas. For fair comparisons, we train the segmentation model on Area 1, 2, 3, 4, 6 and test on Area 5 as [24]. The ScanNet-v2 dataset contains 1201, 312, and 100 3D scans for training, validation, and testing, respectively.

Weak Annotation Scheme. For fair comparisons, on the S3DIS dataset, we label the data under the “One Thing One Click” (OTOC) setting as in 1T1C [24]. We randomly select a point in each object with the identical probability as the labeled points. Therefore, only 0.02% of points have annotations inside the whole point cloud. On the ScanNet-v2 dataset, we evaluate our DAT model on the “3D Semantic label with Limited Annotations” benchmark [6]. In this benchmark, only 20 points are labeled in each room scene.

Table 1. Comparison of our DAT with several existing methods on the S3DIS Area-5 set. Note that, we report the performance as final results based on the KPConv [38] backbone.
Table 2. Comparison of our DAT with its variant methods with the KPConv framework. Note that, all experiments are conducted under the OTOC setting on the S3DIS dataset

Experiment Setting. If there is no special declaration, we implement our proposed DAT training method based on the KPConv rigid model. We use SGD to train the model with learning rate of 0.01 and batch size of 2. Following 1T1C [24], we use the geometrical partition results [17] and mesh segment results [6] as the superpoints for S3DIS and ScanNet-v2 datasets, respectively. We set the hyper-parameters \(\xi _{c} = 10\), \(\xi _{f} = 0.1\), \(\xi _{A}=0.1\), \(\epsilon _{c} = 1\), \(\epsilon _{f} = 0.05\), \(\epsilon _{A} = 0.05\). During the model training, to reduce the GPU memory consumption, we employ the segmentation loss \(\mathcal {L}_{seg}\) at all iterations and randomly apply local consistency loss \(\mathcal {L}_{lc}\) or regional consistency loss \(\mathcal {L}_{rc}\) with an equal probability of 0.5 to train our model. All of our experiments are conducted on a single NVIDIA RTX 3090 GPU with PyTorch 1.7.0 and CUDA 11.0.

4.2 Evaluations on S3DIS Dataset

Comparing with State-of-the-Art Methods. Table 1 shows the results of our DAT and several SOTA methods on the S3DIS Area 5 dataset. Via effectively exploiting the unlabeled data, the DAT model with few labeled points training achieves comparable results with the upper bound (i.e. the fully-supervised KPConv model with 100% labeled data training). Furthermore, under the “OTOC” setting, the DAT model significantly outperforms the second-best 1T1C method by 6.4% mIOU gains on the S3DIS dataset. In addition, we further perform the “One Thing Three clicks” (OTTC) setting, where we annotate three points for each target. Our model outperforms the corresponding second-best method 1T1C [24] by 3.2%.

Table 3. Ablation studies of our DAT about the Class-aware Perturbation Generator (CPG) used in our LAP module under the OTOC setting on the S3DIS dataset.

Ablation Studies Comparisons with Baselines. We perform the ablation studies on the S3DIS dataset, to show the effectiveness of our proposed DAT. The first baseline is that we only use the segmentation loss \(\mathcal {L}_{seg}\) on a few labeled points to train the segmentation model, which is denoted as “Our Baseline” in Table 2. Our proposed DAT outperforms “Our baseline” by 6.4%. Another baseline is that we apply random noises to all the points to generate perturbed examples. Then we use KL-div loss to encourage the prediction consistency between the original point cloud and the perturbed examples. Specifically, similar to our designed LAP, we are able to apply random noises to point coordinates, point features, or both, which is denoted as “Ours w/ Noise”. As Table 2 shows, the DAT significantly outperforms two baseline methods, which suggests that our adaptive perturbations achieve better regularization to the unlabeled data compared to the random noises.

Effects of LAP and RAD. To demonstrate the effects of two novel modules, as shown in Table 2, with separately applying consistency loss on the transformed examples generated by LAP (Ours w/ LAP) or RAD (Ours w/ RAD), we are able to significantly improve mIoU results compared with the “Our Baseline”. This suggests that enforcing the consistency between the prediction of transformed examples and the original point clouds can predict better segmentation masks. “Our DAT” denotes that we apply the consistency loss on both LAP and RAD. Table 2 shows combining both modules can further improve mIoU by 2.6% and 1.7% compared with only using LAP or RAD, respectively.

Effects of CPG. We further verify the effectiveness of our designed CPG used in the LAP module. “Feat. w/o CPG” denotes that we generate the initial perturbation \(d_f\) from the iid Gaussian distribution, instead of the class-aware multivariate Gaussian distribution. Table 3 shows that our class-aware perturbation generator is able to boost segmentation performance under all settings, which suggests that the class-aware information is critical in the point cloud segmentation task.

Besides, Fig. 4 gives three examples of the computed covariance matrices in the CPG, where we randomly select them from all 13 covariance matrices. We can observe that different classes have different covariance matrices.

Fig. 4.
figure 4

Three covariance matrices estimated via our designed CPG module under the OTOC setting on the S3DIS datasets.

Table 4. Ablation studies of our DAT on different affine transformations used in RAD module under the OTOC setting on the S3DIS dataset.

Different Affine Transformations in RAD. Table 4 shows the mIoU results for our DAT with different affine transformations. “Ours w/ RAD” indicates that we only apply the consistency loss on the deformed examples generated by RAD, and “Our DAT” indicates that we make use of all the transformed examples generated by LAP and RAD to train the model. As Table 4 shows, “Our DAT” achieves the best performance by using all three affine transformation methods (i.e. translation, scale and rotation).

Table 5. To show the generalization ability, we further show the results with MinkowskiNet32 [5] backbone on the S3DIS Area-5 set. “Our DAT*” denotes we only use our LAP module to train the backbone.
Table 6. Comparison of our DAT model with several existing methods on the ScanNet-v2 test set. “Our DAT\(\dagger \)” denotes that our DAT is built upon the 1T1C [24] model.

Generalization Ability. To verify the generalization ability, we further use our training strategy to train a voxel-based segmentation framework (i.e. MinkowskiNet [5]). Unlike the point-based methods, the voxel-based methods firstly project the point cloud into regular voxels and then apply 3D sparse convolution on it. Since the projecting operation is non-differentiable and cannot back-propagate the gradients to point coordinates, we only employ the LAP module to add adaptive perturbations on the input features with the CPG module (labeled as “Our DAT*” in Table 5). Table 5 shows, under the OTOC/OTTC setting, our model improves the mIoU results by 5.9%/3.2% compared to their respective “Our Baseline”, which demonstrates that such novel training strategy is general and effective, and can be easily applied to various point cloud frameworks.

4.3 Evaluations on ScanNet-v2 Dataset

Tables 6 and 7 respectively give the results on the test and validation set of ScanNet-v2 dataset in the “3D Semantic label with Limited Annotations” benchmark. We use the officially given 20 points annotations as the sparse labels to train the model. Compared with “Our Baseline”, our DAT (denoted as “Our DAT”) with the KPConv backbone can achieve impressive performance gains of 3.2% and 3.9% mIoU on ScanNet-v2 test and validation sets, respectively.

Meanwhile, such a training strategy can be easily combined with existing models for point cloud segmentation. For example, on the ScanNet-v2 dataset, we build our DAT upon the 1T1C model, which is used to generate the pseudo labels for all training data. Then we use the pseudo labels to train our DAT. Based on the 1T1C model (denoted as “Our DAT \(\dagger \)” in Tables 6 and 7), our DAT can further improve the mIoU results by 2.9% and 3.0% on the ScanNet-v2 test and validation set compared with 1T1C, respectively. This suggests that our training strategy can further improve the performance of other SOTA models.

Table 7. Comparison of our DAT model with several existing methods on the ScanNet-v2 validation set. “Our DAT\(\dagger \)” denotes that our DAT is built upon the 1T1C [24] model.
Fig. 5.
figure 5

Two results of our DAT on the S3DIS (first two rows, under the “OTOC” setting) and ScanNet-v2 datasets (last two rows, under the “20 points” setting).

4.4 Qualitative Results

Figure 5 shows the segmentation results obtained by our proposed DAT model on the S3DIS and ScanNet-v2 dataset. It reveals that the DAT model can successfully preserve most of the object structures and segment the 3D point clouds accurately, only with the weak annotation training.

5 Conclusion

In this paper, we have presented a Dual Adaptive Transformations (DAT) model for the weakly supervised point cloud segmentation task, with two novel designs, i.e. the LAP and RAD module. First, the LAP module generates point-wise adaptive coordinate perturbations and class-aware adaptive feature perturbations based on the online estimated class distribution. Second, we propose the RAD module to generate regional adaptive deformations by applying a set of adaptive affine transformations on the superpoint regions. Extensive experimental results under multiple weakly supervised settings have demonstrated that our proposed DAT model achieves new SOTA segmentation performance on the S3DIS and ScanNet-v2 datasets.