Keywords

1 Introduction

The task of instance segmentation has recently gained popularity. As an extension to semantic segmentation, this task needs to separate pixels/points that have identical categories into individual groups. In the 2D image domain, many approaches [4, 5, 10, 12, 18] have been proposed and achieve promising results. With the growth of the availability of 3D sensors, more and more researches have focused on 3D scene understanding, which is a fundamental necessity for robotic vision, autonomous driving, and virtual reality. Although instance segmentation in the 3D domain has started to draw attention and has been discussed in [21, 29, 30, 33, 34], it still lags behind its 2D image counterpart and far from being solved.

Fig. 1.
figure 1

Comparison of the instance segmentation results with and without the proposed Instance-Aware Module (IAM). The proposed IAM successfully encodes instance-aware information and geometric knowledge, which are critical for separating adjacent instances. Note that different instances can be presented in different colours. (Color figures online)

Similar to the tasks of dense prediction in 2D images [2, 16, 35], context is also important in 3D domain. For 3D point clouds, PointNet++ [24] is the first work that captures local structure information and has been successfully utilized in the task of semantic segmentation. It maintains an encoder-decoder architecture, which includes several set-abstraction layers and feature-propagation layers for down-sampling and up-sampling, respectively. Algorithms such as radius search and k nearest neighbours (K-NN) search are utilized for aggregating local context knowledge. Building on this powerful network, many methods [21, 29, 30] have been proposed to tackle the task of instance segmentation on point clouds. To encode meaningful context information, ASIS [30] is proposed to associate two tasks together so they can cooperate with each other. JSIS3D [21] applied multi-value Conditional Random Field (CRF) that formulates a joint optimization for semantic segmentation and instance segmentation in a unified framework. However, these methods fail to explicitly encode the instance contextual knowledge and geometric information, which are extremely critical for separating adjacent instances and handling complex situations. For example, two neighbouring chairs can be easily confused and grouped as one united instance if boundaries and geometric information are not encoded in the embedding space (e.g., the second row in Fig. 1). In this paper, we address the problem by proposing an Instance-Aware Module (IAM) to learn the instance level context by locating representative regions for each input point. Moreover, geometric knowledge is explicitly encoded in the embedding space, which is an informative indicator to identify the points belonging to the same instance. The whole framework can be trained in an end-to-end manner to tackle instance segmentation and semantic segmentation simultaneously with little computation resource overhead.

Specifically, as shown in Fig. 2, our method maintains an encoder-decoder architecture. Different from previous methods that only maintain an instance grouping branch and a semantic segmentation branch, we come up with a novel light-weight instance-aware module, which localizes representative points within the same instance for each input point. The information from these representative points is then aggregated into the decoding process of the instance branch, generating instance-aware contexts for learning discriminative point-level embeddings. Moreover, the normalized geometric centroids of these representative points (predicted by every input point feature), are directly added to the embedding space, which provides critical geometric knowledge for identifying and reducing the ambiguity of adjacent instances.

The training of the instance-aware module is regularized jointly by the bounding box and instance segmentation supervision, such that the meaningful semantic regions can be tightly bonded by the spatial extension of the instance and guided towards representative regions of the instance.

Compared with the conventional representation of an instance by using vertexes to represent a bounding box, learning semantically meaningful regions helps to remove unrelated background and noise information. As it is applied in the bottleneck layer, very few additional computations are introduced. Compared with ASIS [30], which needs to search neighbours of every input point exhaustively, our approach shows superiority in both efficiency and effectiveness.

To validate the effectiveness of our proposed method, extensive experiments have been conducted on three popular benchmarks. The flexibility of our method allows it to be applied in not only indoor scenes but objects with fine-grained part labels. State-of-the-art performances are achieved on these datasets. To summarize, our main contributions are listed as follows.

  • We propose a novel Instance-Aware Module, which successfully encodes instance-dependent context information for point cloud instance segmentation.

  • Our method explicitly encodes instance-related geometric information, which is informative and helpful to produce discriminative embedding features.

  • The proposed framework can be trained in an end-to-end manner and shows superiority over previous methods on both efficiency and effectiveness. With the proposed method, state-of-the-art results are achieved on different tasks.

2 Related Works

Instance segmentation on point clouds has just started to be discussed recently. In this section, we briefly review some existing approaches that are related to this field.

2.1 Deep Learning on Point Clouds

Deep learning-based methods for 3D feature extraction can be roughly categorized into three classes: voxel-based, multi-view-based, and point-based. Voxel-based methods [9, 19, 25, 32] utilize 3D convolution neural networks for feature extraction on voxelized spatial grids, which can be easily influenced by the density of the points. Meanwhile, it is highly constrained by the huge memory occupation and lower running speed because a large proportion of computation is wasted on vacant voxels. Many approaches have been proposed to address the problem [9, 25]. Octree [25] tries to modify the convolution operation by generating average hidden states in empty space. SparseConv [9] is proposed to process spatially sparse data more efficiently by encoding with a Hash Table to avoid unnecessary memory usage in vacant space. The second category is multi-view-based methods [13, 23, 26], which first project 3D shapes or point clouds into 2D images and utilize conventional 2D CNN for feature extraction. Hou et al. proposed 3D-SIS [13] by leveraging both RGB 2D input and 3D geometrical information. 2D features are then back-projected into 3D grids. Unlike the above methods, directly extracting features on point clouds is more efficient and straightforward. PointNet [22] is the pioneering work that directly learns a spatial encoding of each point. A symmetrical function is utilized to process disordered point sets. To effectively encode local context information to obtain representative features, many approaches [14, 15, 24, 27, 28] have been proposed. Qi et al. proposed PointNet++ [24] which applied PointNet recursively on a nested partitioning of the input point clouds. Thomas et al. came up with KPConv [27] by designing a continuous weight space through interpolating with several kernel points. In our experiments, we utilize PointNet++ as the backbone to verify the effectiveness of our method.

2.2 Instance Segmentation on Point Cloud

Although the task of instance segmentation on 2D images has made huge progress since Mask-RCNN [10] was proposed, its 3D point cloud counterpart lags far behind. SGPN [29] is the first deep-learning-based method developed in this field. It tried to generate point cloud groups by predicting three objectives: the similarity matrix, the confidence map, and the semantic prediction map. Due to the pair-wise term, the method occupies a large amount of GPU memories and suffers from slow running speed and small batch size for training. On the other hand, generating instance groups from three matrices requires many hyper-parameters, making it less stable for different scenarios. Wang et al. proposed ASIS [30] to address the problem by removing the pair-wise prediction and introducing a discriminative loss for instance embedding. The loss pulls the embeddings of the same instance towards the cluster center and pushes the cluster centers away from each other. However, the method fails to utilize the geometrical information and is unaware of the spatial distribution of the instances. GSPN [34], proposed by Yi et al., generates shape proposals using a generative model for instance segmentation. Due to its emphasis on geometric understanding for object proposal, it achieved promising performance on both indoor dataset and part instances dataset. Due to the large requirement of GPU memory and a two-step training procedure, it is ineffective with limited computation resources. MPNet [11] proposed a memory-based module to deal with the imbalance of the point cloud data. In this work, we propose an Instance-Aware Module (IAM) to encode instance context knowledge and geometric information. The state-of-the-art performance on three large open benchmarks shows superiority over previous methods in both effectiveness and efficiency.

3 Method

Fig. 2.
figure 2

The whole framework of our proposed one-stage method, which is a simple and clear encoder-decoder architecture. The input point clouds first go through a shared encoder network, and two parallel decoders are followed: one for semantic segmentation, one for instance grouping. A novel instance aware module (IAM) is proposed to generate representative points for instance segmentation. We use the coordinates of representative points to select argument features for instance segmentation module and the geometric information of the coordinates to extend the instance embedding. The whole framework is end-to-end trainable.

In this section, we describe our proposed Instance-Aware Module (IAM), which can encode both instance-aware context and instance-related geometric information. Details of the approach are presented below.

3.1 Network Framework

As shown in Fig. 2, we apply an encoder-decoder architecture. The encoder is shared by two tasks and takes point sets \(P \in \mathbb {R}^{N\times D}\) as input, where N denotes the total number of the points and D refers to the input feature dimension. The input features can consist of colour and position information, e.g., X, Y, Z, R, G, and B. The decoder contains two parallel branches: one for semantic segmentation, one for instance embedding. The semantic segmentation branch generates per-point classification results \(S \in \mathbb {R}^{N \times D_c}\), where \(D_c\) is the category number. Focal loss [17] \(L_{fl}\) is applied to address the category imbalance during the training process. Besides, the instance branch outputs per-point embedding features \(E \in \mathbb {R}^{N \times D_e}\) for learning a distance metric, where \(D_e\) is the embedding dimension. The embeddings belonging to the same instance should end up close together, and the embeddings belonging to the different instances should end up far apart. During the inference, a clustering algorithm is applied to obtain the final grouping results. A novel IAM for producing instance aware knowledge is achieved by detecting the spatial extension of an instance. Through IAM, representative points locating on the corresponding instance provide instance-aware knowledge, which contains two parts: (1) instance-related contextual information via detection a set of regions that are tightly covering the spatial extension of an instance. (2) instance geometric knowledge that is critical for separating adjacent objects.

3.2 Instance-Aware Module

We propose an instance-aware module (IAM) mainly for selecting representative points that capture spatial instance context. For point \(p_i\) with position \(x_i, y_i\) and \(z_i\), point-level offsets are predicted by the contextual detection branch to represent the spatial extension of the instance, denoted as \(\{\varDelta x_{i}^k, \varDelta y_{i}^k\), \(\varDelta z_{i}^k\}_{k=1}^K\). Representative regions of the instance predicted by \(p_i\) is \(\mathcal {R}_{i}\), which can be simply represented as:

$$\begin{aligned} \mathcal {R}_{i} = \{(x_i + \varDelta x_{i}^k, y_i + \varDelta y_{i}^k, z_i + \varDelta z_{i}^k)\}_{k=1}^{K}, \end{aligned}$$
(1)

where K is the number of representative points and i represent the i-th point. The axis-aligned bounding box predicted by every point can be formulated as \(\mathcal {B}_i\) through a min-max function F: \(\mathcal {B}_i = F(\mathcal {R}_{i})\)

Learning these representative regions is jointly driven by both the spatial bounding boxes and the instance grouping labels, such that \( \mathcal {R}_i\) can tightly compass the instance. To achieve this, three losses are provided: \(L_{bnd}\), \(L_{cen}\) and \(L_{ins}\) (the last two will be discussed in the next section). \(L_{bnd}\) is to maximize the overlaps of the bounding boxes between the prediction and the ground truth. 3D IoU loss is utilized in our paper:

$$\begin{aligned} L_{bnd} = \frac{1}{N}\sum _{i=1}^{N} 1 - IoU(GT_i, \mathcal {B}_i), \end{aligned}$$
(2)

where N is the total number of points, \(\mathcal {B}_i\) is the predicted bounding box of the i-th point, and \(GT_{i}\) is the 3D axis-aligned bounding box ground truth of the i-th point. To have a better understanding of the detection branch, we visualize \(\mathcal {R}_{i}\) in Fig. 3. Green points are selected \(p_i\), and red points are the predicted \(\mathcal {R}_{i}\). We choose the number of representative points as 18, which empirically works well in our experiments. Employing more points will have limited improvements. Therefore, in terms of efficiency, we choose \(K=18\). Instance related regions are located and successfully cover the spatial extension. In the next section, we provide details of how to incorporate these instance contextual information.

Fig. 3.
figure 3

Visualization of detected representative points. The green point is randomly selected, and red points are the corresponding meaningful regions output by the IAM. Due to the encoded instance context information, our method can separate adjacent objects. (Figure best viewed in color) (Color figures online)

3.3 Instance Branch

Conventionally, the inputs of the instance decoder are down-sampled bottleneck points \(P_b \subseteq P\), and the corresponding features are denoted as \(F_b\). These features are gradually propagated to the full set of points through several up-sampling layers. To encode the instance context during the propagation process, we utilize the meaningful semantic regions of \(\mathcal {R}_b\) for the bottleneck points.

Encode Instance-Aware Context. Representations of \(F_b\) are augmented by aggregating information from \(\mathcal {R}_{b}\) that covers the instance spatial extent. As these detected points are not necessarily located on the input points, the features of \(\mathcal {R}_{b}\) are interpolated by using K-NN. The interpolated features are then added to the original \(F_b\), generating features containing both local representation and instance context. Compared with ASIS [30], which has to search neighbours for every input point, our method, on the other hand, is more efficient. As K-NN is applied in the bottleneck layer, the searching space in \(P_b\) is much smaller than that in P, introducing very limited computation overhead. The combined features are gradually upsampled during the decoding process, propagating the instance-aware context through all points.

Encode Geometric Information. Geometric information is critical for identifying two close objects. To learn a discriminative embedding feature, we directly concatenate the normalized centroids of coordinates to the embedding space. Considering the centroid \(C(\mathcal {B}_i)\) predicted by point \(p_i\), where \(C(\cdot )\) is the function for computing geometric centroids of a given bounding box, the final per-point embedding feature can be represented as \(\hat{E}_i = \text {Concat}(E_i, C(\mathcal {B}_i))\), where \(E_i\) is the embedding feature produced from the instance branch. Besides, to force the geometric information to be consistent for the points that have identical instance label, we pull the predicted geometric centroids from the same instance towards the cluster center by:

$$\begin{aligned} L_{cen} = \frac{1}{M} \sum _{m=1}^{M}\frac{1}{N_m}\sum _{i=1}^{N_m}[ \Vert C(\mathcal {B}_i) - \mu _m \Vert -\sigma _v]_+^2, \end{aligned}$$
(3)

where M is the total number of instances, and \(N_m\) is the point number for m-th instance. \(\mu _m\) refers to the average predicted geometric centroids of m-th instance. \([x]_+\) is defined as\([x]_+ = \text {max}(0, x)\) and \(\sigma _v\) is the loose margin. The \(L_{cen}\) is designed for forcing the additional geometric information to have less variation and to be informative for separating adjacent objects.

The informative per-point embedding \(\{ \hat{E}\}_{n=1}^N\) is applied for learning a distance metric that could pull intra-instance embedding toward the cluster center and push instances centers away from each other. The loss function is formulated as:

$$\begin{aligned} \begin{aligned} L_{ins}&= \underbrace{ \frac{1}{M(M-1)}\sum _{a=1}^{M}\sum _{b=1 \atop b\ne a}^{M} [2\sigma _d - \Vert \mu _a - \mu _b \Vert ]_+^2}_{inter-instance} \\&+ \underbrace{ \frac{1}{M}\sum _{i=1}^{M}\frac{1}{N_m} \sum _{m=1}^{N_m} [\Vert \mu _m - \hat{E}_m \Vert - \sigma _v]_+^2 }_{intra-instance}, \end{aligned} \end{aligned}$$
(4)

where M is the total instance number, \(N_m\) is the point number of the m-th instance. \(\sigma _d\) and \(\sigma _v\) are relaxation margins. During the training process, the first term pushes instance clusters away from each other and the second term pulls the embedding towards the cluster center. During the inference process, a fast mean-shift algorithm is applied for clustering different instances in the embedding spaces.

To summarize, our method is end-to-end trainable and supervised by four losses. The loss weights for the four losses are all set to 1 in all our experiments.

$$\begin{aligned} L = L_{fl} + L_{bnd} + L_{cen} + L_{ins}, \end{aligned}$$
(5)

4 Experiments

In this section, we evaluate the effectiveness of our proposed method. Both qualitative and quantitative experiments are conducted and reported.

4.1 Datasets

We introduce three popular datasets that have instance annotations: Stanford 3D Indoor Semantic Dataset (S3DIS) [1], ScanNetV2 [3], and PartNet [20]. S3DIS is collected in 6 large-scale indoor areas, covering 272 rooms. The whole dataset contains more than 215 million points and is consisted of 13 common semantic categories. ScanNetV2 [3] is an RGB-D video dataset. It contains more than 1500 scans, which is split into 1201, 300, and 100 scans for training, validation, and testing, respectively. The dataset contains 40 classes in total, and 13 categories are evaluated. Different from the above two datasets, PartNet [20] is a consistent large-scale dataset with fine-grained object annotations. It consists of more than 570k part instances covering 24 object categories. Each object contains 10000 points. Similar to GSPN [34], we select five categories that have the largest number of training examples.

4.2 Evaluation Metrics

On the S3DIS dataset, we conduct 6-fold cross-validation. Similar to SGPN [29] and ASIS [30], the performance on Area-5 is also reported. On ScanNetV2 [3], we report our results on the validation set, which contains more instances and has more stable results. On the PartNet [20] dataset, five selected categories are Chair, Storage, Table, Lamp, and Vase. Both coarse and fine-grained results are included. Different levels of different categories are trained separately and independently. The evaluation metrics for semantic segmentation are the overall pixel-wise accuracy (mAcc), category-wise mean accuracy (oAcc) and average intersection-over-union (mIoU). The instance segmentation is evaluated by the average instance-wise coverage (mCov), mean weighted instance-wise coverage (mWCov), mean instance precision (mPrec) and recall (mRec) with IoU threshold of 0.5. The weights for mWCov is calculated by \(w_i = \frac{| N_{i} |}{\sum _{k}| N_{k}|}\), where i is the i-th instance and \(N_{k}\) is the point number of k-th ground truth instance.

4.3 Implementation Details

For the S3DIS [1] and ScanNetV2 [3], each scan contains millions of points, making it hard to process all data at one time. In our experiments, we split each scene into \(1m \times 1m\) overlapped blocks with 0.5 m stride. Then, 4,096 points are randomly sampled across each block. Similar to SGPN [29], every point is represented by a 9-D feature (XYZRGB, and normalized positions in blocks \(N_X, N_Y, N_Z\)). PartNet [20], on the other hand, is proposed for shape analysis which contains 10,000 points for each instance. We randomly select 8,000 for training and 10,000 for testing.

Although our method is not restricted to any specific network, all experiments are conducted with vanilla PointNet++ [24] as the backbone (without multi-scale grouping) and leave the other choices for future study. One single GTX1080Ti GPU card is used for training with the batch size set to 16. The initial learning rate is set to 0.01 (0.001 for S3DIS) and divided by 2 in every 300k iterations. We use Adam optimizer with momentum set to 0.9, and the whole network is trained for 100 epochs. The hyper-parameters for discriminative loss are identical with original setting in [30]: \(\sigma _v = 0.5\), \(\sigma _d = 1.5\). Besides, for testing the whole scene on S3DIS and ScanNetV2, a method named BlockMerging [29] is used for grouping blocks according to the segmentation information of the overlapped areas.

Table 1. Ablation study on ScanNetV2 dataset. Both \(AP_{50}\) and \(AP_{25}\) are reported on the validation set. FL refers to focal loss. InsContext refers to instance-aware context. \(\mathbf{L_{cen}}\) refers to centroid constrain loss in Eq. 3. GE refers to geometric embedding.

4.4 Ablation Studies

We first build a strong baseline that contains two decoder branches: one is the semantic segmentation, and the other is the instance embedding branch. Two losses are used for supervising the two branches: the cross-entropy loss for the segmentation task and the discriminative loss for instance grouping. The discriminative loss forces points belonging to the same instance to lie close together in the embedding space and keep a large margin for points belonging to different instances. The loss weights are set to 1.0. We conduct our experiments on the ScanNetV2 validation set.

Focal Loss. Focal loss [17] is first proposed in the object detection task to address the problem of data imbalance between positive and negative samples. Due to the imbalance of categories introduced in the point cloud, we apply focal loss in the segmentation branch with default parameters identical to [17]. The results are shown in Table 1, and the focal loss can improve the results by 2.0 for \(AP_{50}\), from 22.0 to 24.0.

Instance Aware Module. We study the influence of the proposed instance-aware module, which first finds out representative points of the instance, and then features from these sampled points are aggregated. Encoding the spatial extension knowledge helps to separate and distinguish close instances. As shown in Table 1, the instance aware decoder boosts the performance by a large margin, improving \(AP_{50}\) from 24.0 to 27.6 and \(AP_{25}\) from 45.5 to 48.2. Besides, simply enlarging the dimension of the embedding space can not bring further improvement in performance (presented in ASIS [30]). The proposed geometric embedding provides informative knowledge, which brings about 2.6% improvement in \(AP_{50}\), demonstrating the effectiveness of our proposed method. Qualitative results are shown in Fig. 4. Our method shows robustness to the intensive scenes, which require more discriminative features to separate different instances.

Fig. 4.
figure 4

Comparison of the results with and without the Instance-Aware Module. Due to the successfully encoded instance context and geometric information, our method generates discriminative results, especially for nearby objects.

Table 2. Instance segmentation results on the S3DIS dataset. Both Area-5 and 6-fold performance are reported. mCov: mean instance-wise IoU coverage. mWCov: mean size-weighted IoU coverage. mPrec: mean precision with IoU threshold 0.5. mRec: mean recall with IoU threshold of 0.5. All our results are achieved on a vanilla PointNet++ [24] backbone without multi-scale grouping for fair comparison.
Table 3. Comparison per-class performance of our proposed method with state-of-the-arts on the S3DIS semantic segmentation task, tested on all areas (6-fold). Our result utilize the vanilla PointNet++ [24] without multi-scale group. Even with a simple baseline, the proposed method surpassed the complex graph-based methods. mA: mean pixel-wise accuracy. mI: mean category-wise IoU.

Centroid Constrain Loss. The centroid constraint loss \(L_{cen}\) is designed for maintaining consistency for points belonging to the same instance. The loss function serves as a regularizer to constrain the embedding features from the same instance to have a small variance. Moreover, it also helps stabilize the centroids when concatenated to the embedding space. As can be inferred from Table 1, the utilization of \(L_{cen}\) improves the \(AP_{50}\) from 27.6 to 28.9. By further combing the geometric embeddings with the per-point features, we achieve an improvement on the \(AP_{50}\) from 28.9 to 31.5.

Table 4. Instance segmentation results on ScanNetV2 benchmark (validation set). The metric of mAP@0.25 is reported. All methods except [8] are based on PointNet or PointNet++. (Categories of Table, Toilet, and Window are not presented in the table.)

4.5 Comparison with State-of-the-Art Methods

In this section, we make a comprehensive comparison with other state-of-the-art methods on three popular benchmarks. Our method can not only be applied to indoor scenes but also achieved promising results on the hierarchical 3D part dataset. The results on S3DIS [1], ScanNetV2 [3], and PartNet [20] show the superiority of our method on both efficiency and effectiveness.

Training and Testing Efficiency. As the first method to solve instance segmentation on the point cloud, SGPN [29] needs to predict a pair-wise similarity matrix, which requires a lot of memory. Each sample requires about 2.7G for training. GSPN [34] needs two training stages, and each sample has to take about 6G memory for training due to the generative network. ASIS [30] addresses the problem by removing the memory consuming parts and learning a discriminative embedding. However, due to the massive usage of K-NN for every point, training ASIS requires a memory of more than 700M for every sample and the inference time for the network requires 60 ms for each block. As we only utilize K-NN in the bottleneck layer, training IAM needs only about 400M for each sample and reduces the running time to 42 ms for each block, showing the superiority in both the effectiveness and efficiency of our method.

Quantitative Results on S3DIS. Instance segmentation performance on Area-5 and k-fold cross validation results are reported in Table 2. We compare our method with other start-of-the-art results. Equipped with instance-aware knowledge, 2.4%, and 7.7% improvement are achieved with metric mPrec and mRec for instance segmentation. Although employing a simple backbone, our method surpasses previous methods, which need more complex operations and more memories for training. Moreover, we also report the performance on the semantic segmentation task in Table 3. The results are evaluated with 6-fold cross-validation. Our method is built upon vanilla PointNet++ [24] and achieves better results compared with methods that applied multi-view [7] or even graph CNN [14, 31]. Qualitative instance grouping results are shown in Fig. 5. We compare the performance of our method with ASIS [30], showing the effectiveness of the encoded instance-aware knowledge.

Fig. 5.
figure 5

Visualization of the instance segmentation results on the S3DIS indoor scenes. From left to right are: input point cloud, the ground truth of instance segmentation, the results of our proposed method, and the results of ASIS [30]. As shown in the figure, our methods have discriminative embedding features for distinguishing adjacent objects. We should note that: different instances are presented with different colors, and the same instances in different methods are not necessarily sharing the same color. (Color figure online)

Table 5. Instance segmentation results on PartNet. We report part-category mAP (%) under IoU threshold 0.5. There are three different levels for evaluation: coarse-grained level, middle-grained level, and fine-grained level. We select five categories with the most data amount for training and evaluation.

Quantitative Results on ScanNetV2. The quantitative performance on ScanNetV2 is presented in Table 4. It is evaluated on the validation set. Both mAP@0.25 and mAP@0.5 are reported. The results of [30] and [34] are reproduced via the open source code. For fair comparison, methods based on PointNet [22] or PointNet++ [24] are reported. Compared with state-of-the-art ASIS [30], our method achieves promising results and boosts mAP@0.25 and mAP@0.5 with a significant improvement, by 8.4% and 6.5%, respectively. Figure 6(a) shows qualitative results of instance segmentation on ScanNetV2.

Quantitative Results on PartNet. The performance on PartNet [20] is shown in Table 5. Different from indoor scenes, PartNet provides fine-grained and hierarchical object parts annotations. Level-1 contains the coarsest annotations and level-3 contains the finest annotations. Similar to GSPN [34], we report the performance of the five categories that have the largest number of training samples: Chair, Storage, Table, Lamp, and Vase. mAP@0.5 is reported. Each category of different levels is trained separately. Our method achieved state-of-the-art results on most categories and levels, substantially improving the performance. Figure 6(b) shows qualitative results of instance segmentation on PartNet. Different categories and fine-grained levels are provided.

Fig. 6.
figure 6

Visualization of the instance segmentation results on (a) ScanNetV2 and (b) Partnet. Our method successfully discriminates adjacent objects that are difficult to separate. Noting: different instances are presented with different colors, and the same instance in different methods are not necessarily sharing the same color. (Color figure online)

5 Conclusion

In this paper, we present a novel method for solving point cloud instance segmentation and semantic segmentation simultaneously. An instance-aware module (IAM) is proposed to encode both instance-aware context and geometric information. Extensive experimental results show that our method has achieved state-of-the-art performance on several benchmarks and shown superiority in both effectiveness and efficiency.