1 Introduction

Point cloud is a collection of points with spatial coordinates and possibly additional features such as color or intensity. Visualizing point cloud in a scene provides intuitive and accurate information of 3D space. Modern technologies such as 3D imaging, photogrammetry, and SLAM can produce colored point cloud. The applications of point cloud cover a large variety of fields, from augmented reality, autonomous navigation, to Scan-to-BIM in construction.

One basic task in point cloud processing is segmentation that partitions the point cloud into groups, each of which exhibits certain homogeneous characteristics. In particular, semantic segmentation groups the points with similar semantics, and instance segmentation further divides the points into object instances. For either segmentation task, the key to the problem is feature learning. On the one hand, extracting the discriminative features from point cloud is vital to the following clustering and segmentation processes. While searching-based algorithm [1] showed good performance on clustering data with distinctive features, 3D point cloud in its original space is not so easily separable. For low dimension and structured data such as 2D image, the convolutional network is well known for its ability to extract features for various tasks [2]. This trend is extending to the 3D field as many recent semantic segmentation works [3,4,5,6,7] have shown promising results using the concept of convolution on an unordered point set. On the other hand, not all features are useful for segmentation, and feature selection is a well-studied topic [8]. In convolution networks, feature selection is accomplished by the attentional layer, which is used in multiple recent literature [9,10,11] to tackle the isotropic nature of convolutional kernels. However, the effectiveness of attentional convolution on instance segmentation remains unexplored. In addition, the weight matrix in previous works only depends on spatial or propagated feature difference. Although this may seem sufficient for semantic task, the relationship between color and geometry is more important when delineating an instance boundary.

Intuitively, people delineate the instance boundary by paying more attention to geometry or color in different circumstances, e.g. , we distinguish two instances of walls based on verticality, and a white board from the wall mainly based on color of the frame. To model the attention mechanism that is adaptive to the instance boundary, we propose BEACon network. We first define a generalized version of point set convolution and demonstrate how the design choice of BEACon fits into those definitions. For each layer, the boundary is represented as differences in multiple feature spaces including geometry and color spaces, and the attentional weight is generated by feeding boundary information to a set of multilayer perceptron (MLP). After the instance embedding is obtained, we use the Cut-Pursuit algorithm [12] for clustering and further design a vicinity merging algorithm specifically for the large indoor spaces dataset. Experiments on S3DIS [13] dataset show a significant improvement on instance tasks among the most recent works. We also test BEACon on PartNet [14] dataset to demonstrate its effectiveness on part instance segmentation. To summarize, our main contributions are as follows:

  • We propose a network that incorporates boundary embedded attention mechanism for instance segmentation.

  • We explicitly model the influence of both geometry and color changes on attentional weight. Experiment results prove its benefit, especially for instance segmentation.

2 Related work

This section briefly reviews the prior arts on semantic and instance segmentation of point clouds. As more related to our work, we will focus on deep learning approaches that directly process unordered point set. Other methods include volumetric approach [15,16,17,18] which requires voxelization of the input data, and multi-view approach [19].

2.1 Semantic segmentation

PointNet [20] pioneers direct MLP on unordered point set. However, it does not take into consideration the spatial context around the vicinity. To capture a larger spatial context, the related approaches can be divided into 3 categories: point-based, graph-based and CNN-based.

2.1.1 Point-based approach

Several methods use neighborhood context, recurrent neural network (RNN) or kernel to aggregate local information. Ye et al. [21] develop pointwise pyramid pooling to capture the spatial context at different scales, and the across-block relationship is explored with RNN. ShapeContextNet [22] uses kernels to extract the local features and train the shape context using the self-attention network. EdgeConv is proposed in DGCNN [23] for feature propagation. The graph in each layer is constructed dynamically in feature space, allowing point clouds being grouped even over long distances. However, the above-mentioned methods aggregate information on all the input points through each layer. Much information is overlapping, and the network becomes unnecessarily huge.

2.1.2 Graph-based approach

This approach incorporates the graph convolutional neural network into proposed network structures. The superpoint graph [24] first partitions the whole scene into small patches based on geometric features and then applies graph convolutional neural network to predict the semantic label of each patch. In its following work [25], the manually selected geometric features are replaced by a lightweight network called local point embedder. PGCNet [26] takes a similar approach by partitioning the scene into planar surfaces, followed by patch graph construction to produce semantic output. Wang et al. [27] transfer the local neighborhood point set into the spectral domain, and the structural information is encoded in the graph topology.

2.1.3 CNN-based approach

Different from the 2D image, 3D data do not have a regular grid-like partition scheme, and different design choices can be made for kernel weight and kernel shape. On the one hand, kernel function can be regarded as a weight matrix, and the weight is defined based on features in the neighborhood. Liu et al. [9] propose relation shape convolution, in which the kernel function is mapped with an MLP based on surrounding spatial features. PointWeb [11] adds an adaptive feature adjustment (AFA) module on top of PointNet++ [4] to adjust the learning based on feature difference. Wang et al. introduce GACNet [10], and the kernel function is obtained with MLP on the difference between spatial and propagated features. Li et al. [28] use distance norm as kernel function and propose adversarial network as a guide to refine the segmentation result.

On the other hand, the kernel can be modeled with locations, and its location can be fixed in place or trainable. Lei et al. [29] adopt a fixed spherical bin kernel to extract the local features. Komarichev et al. [30] transfer the 3D kernel into 2D by projecting the points onto an annular ring which is normal to local geometry. While the point locations of all the above kernels are pre-defined, KPConv [31] generalizes the idea of point convolution and models the kernel point with trainable locations. It also emphasizes the radius neighborhood rather than the kNN.

2.2 Instance segmentation

SGPN [32] is the first proposed framework to solve this problem with a similarity matrix and a confidence map. ASIS [33] explores the mutual aid between the semantic-instance tasks and proposes semantic-aware instance segmentation and instance-fused semantic segmentation. JSIS3D [34] also emphasizes the joint relationship that is modeled by multi-value conditional random fields. 3D-SIS [35] and 3D-BoNet [36] take a different approach by predicting bounding box of each instance and subsequently predict point mask to obtain the segmentation result. MTML [37] applies a multi-task learning strategy, predicting both instance embedding and point offset in 3D space.

Our BEACon network is inspired by the recent success of the attentional convolution for semantic segmentation. However, we realize the relationship between geometry and color plays a more important role when delineating an instance boundary. Thus, we aim to improve the instance segmentation by designing the attentional weight with the embedded boundary information.

3 Our method

This section presents the methods to construct the BEACon network. Section 3.1 generalizes the idea of point set convolution, which provides a guideline of designing B-Conv layer in Sect. 3.2. Section 3.3 details the network structure and the loss function. Section 3.4 explains the necessity behind the vicinity merging algorithm, which is designed for large indoor spaces datasets such as S3DIS.

3.1 Generalization of point set convolution

Given a point cloud with point sets \(\mathcal {P} \in \mathbb {R}^{N \times 3}\) and corresponding feature sets \(\mathcal {F} \in \mathbb {R}^{N \times C}\), the general point convolution of \(\mathcal {F}\) by a kernel g at a point \(x \in \mathbb {R}^3\) is defined in KPConv [31] as:

$$\begin{aligned} (\mathcal {F}*g)(x) = \sum _{x_i \in \mathcal {N}_x, f_i \in \mathcal {N}_f} g(x_i-x)f_i \end{aligned}$$
(1)

where \(\mathcal {N}_x \in \mathcal {P}\) (\(\mathcal {N}_f \in \mathcal {F}\)) is the neighboring point set (feature set) defined around the query point x. However, the kernel function can be generalized to take the difference between features as well. In addition, the input feature for a particular layer can be processed by feature mapping function before the convolution operation, denoted as \(\mathcal {M}(\cdot )\).

$$\begin{aligned} (\mathcal {F}*g)(x) = \sum _{x_i \in \mathcal {N}_x, f_i \in \mathcal {N}_f} g(x_i-x, f_i-f)\cdot \mathcal {M}(f_i) \end{aligned}$$
(2)

Note that the aggregation function for convolution is a summation. This aggregation function can be more general and replaced by other functions such as max pooling. In image processing, the image will have fewer feature elements because of stride operation. In point set convolution, a similar approach is accomplished by sampling query points \(\mathcal {P}_q\) from the point set \(\mathcal {P}\). Common sampling methods include inverse density sampling [38], furthest point sampling [3, 4], and grid down-sampling [31, 39].

$$\begin{aligned}&\mathcal {P}_q = \mathcal {S}(\mathcal {P}) \end{aligned}$$
(3)
$$\begin{aligned}&(\mathcal {F}*g)(x) = \mathcal {A}\big ( g(x_i-x, f_i-f)\cdot \mathcal {M}(f_i) \big ) \end{aligned}$$
(4)

The generalized point set convolution can be represented as in Eqs. 3 and 4, where \(\mathcal {S}(\cdot )\) is the sampling function, \(\mathcal {A}(\cdot )\) denotes the aggregation function, \(x_i \in \mathcal {N}_x, f_i \in \mathcal {N}_f,\) and \(x \in \mathcal {P}_q\).

The novelty of our proposed BEACon network is based on the above generalization. We interpret the kernel g as attentional layer, which fuses both local geometry \((x_i-x)\) and color \((f_i-f)\) information, and the convolution operation is achieved by furthest point sampling as our sampling function \(\mathcal {S}(\cdot )\).

Fig. 1
figure 1

Visualization of the query point (red dot at the origin) and \(\Delta XYZ\), \(\Delta RGB\), \(\Delta F_{\mathrm{geo}}\) in 3D, color and geo-feature spaces (scattering is omitted). It can be observed that a picture on the wall can only be separated in color space. To some extent, BEACon learns the “shapes” in all of those spaces and generates the attentional weight by exploring their inter-relationship

3.2 B-Conv layer

Embedding boundary information into an attentional matrix is the core operation in BEACon layers, termed B-Conv layer. This information can guide the network to learn more discriminative local features in the neighborhood, as we experimentally verify this claim in Sect. 4.4. In classic image processing, an edge is commonly computed on the basis of gradient of the nearby pixels. Multiple criteria can be used to produce the binary label [40]. However, for point cloud the gradient of either geometry or color alone cannot guarantee the boundary of the desired instance. Rather, the relative relationship between geometry and color difference describes the instance boundary and can provide more clues to the attentional matrix.

With these considerations, we formally define the instance boundary as differences in four spaces: 3D space, color space, geo-feature space and propagated feature space. The difference in 3D space transforms the neighborhood area into a local coordinate system around the query point, while the differences in color space and geo-feature space provide other similarity measures between neighboring points, as shown in Fig. 1. Since the propagated feature has a better describability of a larger spatial context, a fourth propagated feature space is added if the layer is not an input layer. Intuitively, more attention should be given to nearer points in 3D space, but the network can also adjust its attention based on feature distribution in all the other three spaces.

The design of the B-Conv layer can be explained with the generalized point set convolution, as illustrated in Fig. 2. More specifically, \(\mathcal {S}(\cdot )\) is furthest point sampling, which is applied to extract the query points with shape \(N_q \times 3\) from pool points with shape \(N \times 3\). We adopt kNN to search a fixed number of neighbors near query points with a pre-defined dilation rate D. The layer departs from here to generate attentional weight and propagated feature. To calculate the differences, the query point feature is subtracted from neighboring features in four spaces. The kernel function \(g(\cdot )\) embeds the instance boundary in two steps. The first step uses 4 MLPs and extracts high level features unique to 4 difference spaces, respectively. After concatenation, the second step uses another MLP to explore the inter-relationship between those features and generate the attentional weight with dimension \(N_p \times K \times C^\prime \). To generate the propagated feature, we simply feed the gathered neighborhood feature to an MLP, as \(\mathcal {M(\cdot )} = MLP(\cdot )\).

Fig. 2
figure 2

The architecture of B-Conv layer, illustrated in terms of generalized point set convolution

The attentional weight is multiplied element-wise with the propagated feature. Although aggregation function is defined as summation in convolution operation, we experimentally show that \(\mathcal {A(\cdot )}\) as max pooling function can learn more discriminative features in the neighborhood.

Relationship to Prior Works As attentional convolution network, BEACon shares similar traits with recent works and is inspired mainly from GACNet [10] and PointWeb [11]. However, there are key differences: (1) BEACon only processes downsampled query points, while graph pooling happens at the end of layer operation in GACNet. (2) During kNN search, BEACon applies the dilation rate to increase the layer receptive field. (3) GACNet learns a weighted average to sum up neighboring features, PointWeb learns a bias to adjust the neighboring features, while BEACon learns a weight matrix to scale the neighboring features. (4) Both GACNet and PointWeb use feature difference as input to attention matrix, while BEACon decouples the difference into separate feature spaces and explicitly models the influence of geometry and color on attentional weight.

Fig. 3
figure 3

The architecture of BEACon network for semantic segmentation (top) and instance segmentation (bottom). The number on the encoder layer indicates the size of the output matrix, e.g. , 128 points with 256 features after the third layer

3.3 BEACon network

The BEACon network consists of two parallel networks for semantic and instance segmentation (Fig. 3). Semantic branch and instance branch share the same encoder but have different decoders. At the end of the network, the semantic branch generates the semantic probability and the instance branch generates the embedding of the input point clouds. The initial input feature for a point is composed of XYZ, RGB and geo-features including linearity, planarity, scattering, and verticality. To preserve the finer-scale features, we add skip-links between corresponding layers of the encoder and the decoder.

3.3.1 Interpolation layer

Decoder starts with interpolation layers to restore the scale of the original point cloud. We still use \(k\)NN to search for the neighboring points, but in this case, the number of query points is larger than the number of pool points. The interpolated point feature is a linear combination of nearest points, and the weight is calculated as the inverse of point distance.

3.3.2 Inverse B-Conv layer

The inverse B-Conv layer is an interpolation layer followed by the B-Conv layer. The skip-linked feature is concatenated with the interpolated feature as input to the B-Conv layer, and a new neighborhood searching is conducted before the standard operation of the B-Conv layer. To keep the model small and adjust the feature in the neighborhood at the finest scale, we only apply inverse B-Conv layer at the last convolution layer.

3.3.3 Loss functions

The output layer is defined with a simple classifier in mind, with several fully connected layers and dropout layers. During training, the losses are defined separately for the semantic and instance branch, and their sum is used to update the whole neural network.

The semantic branch is supervised by the classical cross-entropy loss. The instance branch, however, does not have a fixed number of labels during the runtime and is adopted with a class-agnostic instance embedding learning, similar to the one in [41]. The loss function can be formulated as follows:

$$\begin{aligned} L_{\mathrm{ins}} = L_{\mathrm{var}} + L_{\mathrm{dist}} + \eta \cdot L_{\mathrm{reg}} \end{aligned}$$
(5)

where \(L_{\mathrm{var}}\) aims to pull the instance embedding toward its instance center, \(L_{\mathrm{dist}}\) encourages separation between instance clusters, and \(L_{\mathrm{reg}}\) is the regularization term. Each term can be further defined as follows:

$$\begin{aligned} L_{\mathrm{var}}= & {} \frac{1}{I} \sum _{i=1}^{I} \frac{1}{N_i} \sum _{j=1}^{N_i} [\Vert \mu _i-e_i\Vert _1 - \delta _v]_+^2 \end{aligned}$$
(6)
$$\begin{aligned} L_{\mathrm{dist}}= & {} \frac{1}{I(I-1)} \underset{i_A \ne i_B}{\sum _{i_A=1,}^{I} \sum _{i_B=1}^{I}} [2\delta _d-\Vert \mu _{i_A}-\mu _{i_B}\Vert _1]_+^2 \end{aligned}$$
(7)
$$\begin{aligned} L_{\mathrm{reg}}= & {} \frac{1}{I} \sum _{i=1}^{I} \Vert \mu _i\Vert _1 \end{aligned}$$
(8)

where I is the number of ground-truth instances, \(N_i\) is the number of points in instance i, \(\mu _i\) is the mean embedding of instance i, \(\Vert \cdot \Vert _1\) is the \(l_1\) distance, \(e_j\) is the instance embedding of an input point, \(\delta _v\) and \(\delta _d\) are margins that define the attractive force and repulsive force, and \([x]_+ = max(0,x)\).

During the test time, we use the Cut-Pursuit algorithm [12] to cluster instance embedding for the entire room. The category of the instance is determined by the mode of the semantic label for that instance.

3.4 Vicinity merging

For a dataset consists of large indoor spaces, it is common to separate the space into smaller volumes. However, this introduces problems—an instance may be divided into multiple parts, and because of the separated geometry, the embedding becomes different even for the same instance. For example, the handle of the chair may be separated from the whole chair structure and can be classified as clutter. To navigate through this problem, we concatenate the predicted semantic label at the end of instance embedding before feeding into the Cut-Pursuit algorithm, making the embedding more consistent if they belong to the same category.

Fig. 4
figure 4

The effect of vicinity merging. Generally, the connected instances are merged together if they belong to the same category

In addition, we propose a vicinity merging process specifically for large indoor spaces dataset. The effect of vicinity merging is shown in Fig. 4. The vicinity merging algorithm is based on a simple rule—if two instances are from the same semantic category and directly connect, they should be merged into one instance. For other special categories, we also use common knowledge to add more rules to the merging criteria. Planarity, for example, is an additional condition for merging wall instances. Note the proposed vicinity merging process is universal to a similar dataset and can be extended to other datasets in which the object instance is connected within itself and separable from others, like common daily objects. In other special cases, hand-crafted rules are needed to achieve better merging performance.

4 Experiments

4.1 Datasets

S3DIS [13] is the primary training source for our BEACon network. It contains 6 areas and 270 rooms, most of which are office room settings. Totally, 13 classes are introduced in this dataset. Each point has both semantic and instance annotations. To make the network less sensitive to scan noise, we first grid downsample the room point cloud with 0.02m. Geo-features are calculated based on the 20-nn search in the entire room. The room is then divided into 1.2m \(\times \) 1.2m blocks with two strategies. For semantic branch, each block has an overlap of 0.8m, so each point is predicted three times and the predicted probabilities are averaged. For instance, branch the blocks are sampled in a non-overlapping fashion, so the entire room is predicted exactly once for instance embedding. Each block is further divided into batches with maximum points of 4096. After prediction, the per-point labels are back-projected to the full point set for evaluation purpose.

Table 1 Semantic segmentation results on S3DIS, \(_6\) is for sixfold cross-validation
Table 2 Instance segmentation results on S3DIS, \(_6\) is for sixfold cross-validation

PartNet [14] consists of 573,585 part instances over 26,671 3D models covering 24 object categories. Semantic and instance annotations can be prepared for each category. The number of part instance per-object ranges from 2 to 220 with an average of 18, and each object consists of 10000 points. Similar to S3DIS, we calculate the geo-feature based on all the points in one object and randomly sample the points into 4 batches with 2500 points.

4.2 Implementation Details of BEACon Network

For the S3DIS dataset, each point is represented with a 10-dim feature vector, including 3D coordinates (XYZ), color (RGB) and geo-features. We define \(\delta _v = 0.5, \delta _d = 1.5\) and \(\eta = 0.001\) for loss function. To augment the dataset, a perturbation of \(\pi /32\) on the z-axis and 0.001 scale variance in all directions are used. Adam optimizer is selected for the training with a base learning rate of 0.001 and a decay rate of 0.8 for every 5000 steps. The minimum learning rate is capped at \(10^{-6}\). The embedding dimension for instance segmentation is set to 5, and regularization strength for Cut-Pursuit is set to 3 with a 5-nn graph. During training, we randomly sample 2048 points from each batch and trained for 60 epochs with batch size 4. During testing, we use all the available points in the block as input. For the PartNet dataset, we keep all the settings similar to S3DIS’s.

4.2.1 Evaluation metrics

For evaluation of semantic prediction, the accuracy and IoU across all the categories are obtained, and mean accuracy (mAcc) and mean IoU are calculated by averaging the per-class accuracy and IoU. In addition, the overall accuracy (oAcc) is also calculated for all the predicted points.

For instance segmentation, the coverage and weighted coverage are evaluated, along with the precision and recall. Coverage is the average instance-wise IoU of prediction matched with ground-truth. The weighted coverage can be calculated by multiplying the ratio of current ground truth instance points and all the ground truth instances points. The precision and recall are defined with the threshold 0.5, and mean precision (mPrec) and mean recall (mRec) are obtained by averaging the per-category results.

4.2.2 Vicinity merging details

As mentioned in Sect. 3.4, only dataset such as S3DIS requires vicinity merging. For each category, we iterate through all the instances and merge the ones that are directly connected. We repeat this procedure until no instance can be merged anymore. For ceiling and floor, we merge all the instances unselectively. For walls, we first filter out the small instances and then use RANSAC to fit planes to the instances. The instances will only be merged if they belong to the same plane and directly connected. For chairs, instances that are directly connected or have intersections when projected to a horizontal plane will be merged.

4.3 S3DIS results

The evaluation metrics for S3DIS follow the fifthfold validation and sixfold cross-validation. We report our semantic results in Table 1. Although not specifically designed for semantic segmentation, our BEACon network has the competitive performance among the most recent works. We observe that by incorporating both geometry and color information into the attentional layer, the network can have a clearer boundary between object instance, which in turn better delineate the boundary between semantic classes as well. Instance results are shown in Table 2. BEACon network outperforms state-of-the-art methods by a large margin in all four metrics. Compared with ASIS, we achieved more than 15% improvement on mCov and mWConv, with 4.09% improvement on mean precision and 14.56% on mean recall.

Fig. 5
figure 5

Qualitative results on S3DIS dataset (Area 5)

We provide some qualitative results of our prediction as shown in Fig. 5. For semantic results, different colors are corresponding to different categories. For instance results, we randomly assign each instance with a color. The color does not have a meaning but as an indication of different instances. Most of the instances can be correctly recalled. However, one drawback of the vicinity merging algorithm is that it makes some classes indistinguishable between two objects which are directly connected.

Table 3 Computational time analysis run on RTX2080Ti to process point cloud with 4096 points

Computational Time The BEACon network has 2.5 M parameters, which is 56% more parameters than ASIS (1.6 M). However, the Cut-Pursuit algorithm is much faster and processes the whole room point cloud at once. For input with 4096 points in office-39 (last row in Fig. 5), although the BEACon network inference time is 103 ms (55 ms for ASIS), the overall processing time is 236 ms, which is faster than ASIS (253 ms) as shown in Table 3. We also research the time complexity for making use of both geometry and color information. BEACon network inference time is about 3 ms more than the partial attention variant. However, the instance segmentation performance is significantly higher as discussed in Sect. 4.4.1. The network takes about 3–4 h to converge on a single RTX2080Ti.

4.4 Ablation study

We evaluate our design choices by removing or replacing certain components. The results are collectively reported in Table 4. Experiment \({\textcircled {{6}}}\) is equivalent to our BEACon approach.

Table 4 Ablation studies on S3DIS dataset (Area 5)

4.4.1 Effect of attention mechanism

To show the effectiveness of our proposed attention mechanism, we specifically designed a baseline network without attention kernel, denoted as \({\textcircled {{1}}}\) in Table 4. The attention module is removed, and the input of each layer is concatenated with \(\Delta XYZ\) to provide localized information. In experiment \({\textcircled {{2}}}\) to \({\textcircled {{4}}}\), we conduct studies on partial attention by only inputting single feature difference to generate attentional weight. Experiment \({\textcircled {{2}}}\) resembles the attention mechanism commonly used in semantic segmentation, with spatial difference concatenated with propagated feature difference as input of the attentional weight. By embedding color and geometry difference as boundary information, BEACon (experiment \({\textcircled {{6}}}\)) has an improvement of +1.04 mIoU over experiment \({\textcircled {{2}}}\) and has a large performance gain on mPrec (+4.07) and mRec (+1.61). The results indicate that the geometry-color-based attention performs on par with its spatial-only counterpart for semantic segmentation, but can largely benefit the instance segmentation. We also test the weighted sum attention mechanism as in GACNet [10] in experiment \({\textcircled {{5}}}\). The weight matrix is normalized across the neighborhood, and the adjusted features are summed up. The results show that max-pooling performs best as the aggregation function.

We further visualize the effect of attention by comparing neighboring point attention around carefully selected query points in experiment \({\textcircled {{1}}}\) and \({\textcircled {{6}}}\), as shown in Fig. 6. The attention is calculated as the histogram of neighborhood index after the max-pooling operation. In other words, a neighboring point with maximum attention would have most of its features remained after the aggregation function. We extract the result from layer 3, where each query point has 32 neighbors with a dilation rate \(D = 2\). Compared to the network without attention, BEACon has a smaller attention spread when the query point is near the edge of the picture and has a larger spread at the center of the picture. While BEACon puts maximum attention on the points that have similar color and geometry as query point, \({\textcircled {{1}}}\) tends to divert the attention randomly. When the query point is on the chair, BEACon puts most of the attention on the structure of the chair, while \({\textcircled {{1}}}\) spreads its attention to the wall, causing the wrong feature being aggregated down the line.

Fig. 6
figure 6

Comparison between experiment \({\textcircled {{1}}}\) (top) and experiment \({\textcircled {{6}}}\) on neighboring point attention. The point receiving most attention is colored yellow

4.4.2 Effect of clustering and merging algorithm

We compare our selected clustering algorithm Cut-Pursuit (CP) to another commonly used algorithm MeanShift (MS). One simple strategy is to directly replace Cut-Pursuit with MeanShift in our design. However, the computation complexity of MeanShift increase quadratically as the number of input points goes up. It is unpractical to processing the entire room at once using MeanShift. We therefore use MeanShift only inside each batch. It takes 55 minutes to test and evaluate the entire Area 5, which is 5 minutes longer than BEACon. We also tested the MeanShift with BlockMerging (BM) strategy that is used in [33]. BlockMerging requires the blocks to have overlap. Unlike Cut-Pursuit with vicinity merging (VM), we have to predict instance embedding 3 times in this case. The entire evaluation of Area 5 takes 110 minutes.

The advantage of Cut-Pursuit lies in its effect and speed. Compared to MeanShift which solves the clustering problem with a pure density-based approach, Cut-Pursuit models it as a global optimization of a graph-cut and has a smoother segmentation result. In addition, it can process the entire room at once in a short time. Thus, the Cut-Pursuit is chosen to cluster instance embedding in this research. For merging algorithm, one drawback of BlockMerging is that it requires an overlapped area between processing block and processed blocks. Vicinity merging does not have such a limitation.

4.5 PartNet results

We show the effectiveness of the BEACon network for part instance segmentation using the four largest categories in the PartNet dataset, following the evaluation protocol in GSPN [43] where the third level is used. Note the RGB value provided by PartNet is misleading, so we omit the color as input and embed the boundary information without color information. In fact, the point cloud of the shape is randomly sampled on the CAD model surface. On many models, outer surface and inner surface belong to the same category but have different color. We report our experiment results in Table 5. In PartNet [14], multiple network architectures are tested for the semantic task, and we report the highest score in the paper regardless of the methods used. For a network that simultaneously processes both tasks, BEACon has a close semantic score to the network that specifically designed for semantic segmentation. Our instance segmentation outperforms the best method in PartNet, with a maximum 25.02% improvement on the chair category. We also provide the qualitative results as shown in Fig. 7. With geometry embedded boundary information, BEACon can distinguish the instance based on small geometric differences, like the horizontal support under the table.

Table 5 Segmentation results on PartNet dataset
Fig. 7
figure 7

Qualitative results on PartNet dataset

5 Conclusion

We have presented BEACon, a boundary embedded attentional convolution network for point cloud instance segmentation. We draw inspiration from human perception and model the attentional weight that adapts itself to the relationship between geometry and color difference. Experimental results show that our network can learn a more discriminative feature around the neighborhood and achieve better performance than the state-of-the-art on several benchmarks.