1 Introduction

In recent years, with the aid of deep learning, autonomous driving achieves significant breakthroughs in multiple tasks, like object detection, motion forecasting, and semantic segmentation. As an emerging field among them, point cloud semantic segmentation (PCSS) is usually used to understand the driving-scene and draw more and more attention. Especially in the past 5 years, numerous novel PCCS methods [18, 35, 47] based on deep learning frameworks have been proposed. And several public datasets of PCSS have also been released, such as Semantic3D [15], ScanNet [9], SemanticKITTI [3].

To achieve superior performance of the model, deep learning generally relies on a large amount of annotated data to strengthen the large-scale model. However, the performance of the model is still not saturated with respect to the size of annotated data [54]. Moreover, it costs lots of human labor and time to annotate a large amount of data, and sometimes only relevant professionals can annotate data [4]. More importantly, 3D point cloud data are generally sparse and unorganized, and a point cloud often includes more than 100,000 points [3], which results in difficulties of point cloud annotation. Active learning (AL) is an effective method to solve this problem. The purpose of AL is to select the most informative and representative samples from the unlabeled data to annotate, which greatly reduces the cost of annotation.

Existing AL methods are mostly at the sample level and focus less on dense prediction tasks. Most of the works [36] are proposed for image processing and natural language processing tasks. However, since point cloud is an unorganized and irregular structure, these methods for image cannot be directly applied to it. In addition, compared with images, point cloud typically contains rich geometric information [33] and intensity information. Besides, it is often collected in sequence, which contains temporal information [3]. This information, which is mostly not involved in recent works [26, 43, 52], has the potential to improve the AL model performance.

In this paper, we focus on these characteristics of the point cloud and propose a novel sample selection and annotation pipeline. Specifically, our proposed method takes representativeness, uncertainty, and diversity into consideration and conducts multi-granularity sample selection: inter-frame and intra-frame. For inter-frame selection, we consider the sample representativeness within the sequence, so as to single out a subset which could represent for the entire sequence distribution. In other words, the coverage area of adjacent frames usually overlaps with different sizes, so it is uneconomical to label all point clouds, which will produce a lot of redundancy. Inspired by the point cloud registration algorithm [5], we develop a novel matching score function which is used to evaluate the similarity of two frames within the sequence. According to whether the matching score is smaller or bigger than a similarity threshold, we determine which one of the two frames is a member of the representative subset. As shown in Fig. 3, a representative subset selected from a sequence can cover the whole coverage of the point cloud sequence with fewer samples, reduce the occurrence of overlapping areas, and thus lower the annotation costs.

As for intra-frame selection, not all annotated points within the frame contribute to the model’s improvement [52], that is, redundancy also exists in the intra-frame annotation. Besides, due to the particularity of the dense prediction task, it is laborious to annotate every point in PCSS task. To make the point cloud annotation more efficient and encourage maximizing the segmentation performance, we argue that the unit of point cloud annotation can be changed from the frame to a small portion of segmented regions [52]. Therefore, we make a tradeoff between annotating labor and efficiency to alleviate the expensive point-by-point labeling [43]. Specifically, we propose a novel method to reduce the redundancy of the intra-frame granularity under the guidance of uncertainty estimation and point cloud intensity. In detail, we first segment a point cloud into regions as the fundamental labeled units using two unsupervised algorithms [33, 46]. Next, uncertainty estimation is carried out on such segmented regions. Furthermore, to avoid selecting some typically uncertain segmented regions which exist in several point clouds, we introduce exclusive intensity information of point cloud [19] to complement segmented region information estimation. Finally, the segmented regions with uncertainty and diversity are selected to annotate.

AL aims at minimizing the training size, while exactly matching the natural demand of semisupervised learning [27]. Semisupervised learning utilizes both labeled and unlabeled data to train models and is well suited to solve the lack of data in real-world tasks. Pseudolabeling is one of the application methods in semisupervised learning. Its goal is to leverage the model trained by partially labeled data to predict unlabeled data for generating pseudolabels [51]. Then, data with high confidence in model prediction will be assigned pseudolabels. Therefore, the integration of semisupervised learning and AL has attracted research interest in recent years [45, 50]. However, this integration method used to PCCS is almost not involved in recent literature. In this paper, to further reduce the human annotating labor, we propose to automatically select and pseudolabel a portion of the confident unlabeled data. The proposed method aims at searching for the most certain and informative unlabeled data with the guidance of a high-confidence threshold. Specifically, we first leverage the trained model to predict unlabeled data for getting the prediction confidence. And further, the data with high prediction confidence are selected and added to the labeled data pool. Then, the labeled data and pseudolabeled data are exploited to fine-tune the model.

Experimental results show that our method significantly outperforms existing deep active learning approaches on the SemanticKITTI dataset and achieves state-of-the-art performance on the S3DIS dataset. Our proposed method could achieve the performance of 90% fully supervised learning, while less than 15% and 3% annotations are required on S3DIS and SemanticKITTI datasets, respectively. The ablation studies also verify the effectiveness of each component proposed in our method.

In summary, the major contributions of this paper are as follows:

  • We propose a new multi-granularity sample selection and annotation AL pipeline for point cloud semantic segmentation.

  • We introduce semisupervised learning to automatically select and annotate the high prediction confidence data for effectively reducing annotation costs.

  • Experiments on challenging SemanticKITTI dataset show that our approach outperforms existing deep active learning methods in classification accuracy and could highly reduce human annotation labor and computational costs.

2 Related works

2.1 3D semantic segmentation

Recently, 3D PCSS has achieved great progress with the aid of deep learning. The purpose of 3D PCSS is to divide a point cloud into several objects according to the predicted semantic meanings of points. According to the representation of the point cloud data, 3D semantic segmentation methods can be classified into three categories: point-based [18, 35], projection-based [34], voxel-based [7, 29]. Point-based methods directly process unstructured point clouds, which suffer from efficiency bottlenecks. In order to employ the two-dimensional (2D) convolutional neural networks (CNN) architectures, projection-based methods focus on converting the 3D point cloud to 2D pseudo-images, yet resulting in information loss. Voxel-based methods convert a point cloud into 3D voxels processed by 3D volumetric convolutions. Although retains the 3D geometric information, it requires very high resolution in order not to lose much information. Overall, these methods heavily rely on fully annotated datasets, which require densely annotated point clouds that are laborious and time-consuming. To this end, we focus on how to train a model with less annotated data to achieve similar performance compared to fully supervised training.

2.2 Deep active learning

As a machine learning method, AL has been of research interest for a couple of decades for increasing label efficiency and reducing annotated costs. AL selects the most informative and representative samples from the unlabeled dataset into the labeled pool through the query strategy and then iteratively trains the model until the annotated budget is exhausted or the pre-defined termination conditions are reached. Therefore, the query strategy is becoming extremely important. The main query strategies include the uncertainty-based approach [4, 20, 23, 30], distribution-based approach [2, 14, 31] and expected model change approach [21, 41]. Various methods were proposed to measure the uncertainty of the unlabeled samples through the posterior probability of a predicted class [23], the difference between the first prediction and the second one [20], or the entropy of class posterior probabilities [30]. Some earlier studies [8, 42] also estimated the sample uncertainty referring to a committee of classifiers. The distribution-based approach queries samples by considering the selection of core subsets and chooses the samples which represent the whole dataset, like clustering algorithm [31], Gaussian process [14] and context-aware methods [2]. The expected model change approach primarily chooses the unlabeled samples that can make the largest change on the current model through estimating expected gradient length [41], expected future errors [38], or expected output changes [21].

Deep learning (DL) has achieved unparalleled breakthroughs in various fields, while DL is often very greedy for large amounts of labeled data [16]. Therefore, many researchers have high expectations for the results of combining DL and AL, referred to as deep active learning (DAL) [36], for AL’s capacity to effectively reduce labeling costs. Gal et al. [12] proposed a significant AL framework for high-dimensional data based on Bayesian deep learning, to estimate uncertainty through Monte Carlo(MC) Dropout integration. However, Sener and Savarese [40] pointed out that this method is unsuitable for large datasets because of batch sampling. And then, they proposed a Core-set approach from the perspective of distribution to construct a core set which is representative of the entire original dataset. They considered minimizing the core-set loss is equivalent to the k-Center problem which can be tackled by an efficient approximate solution. William et al. [4] proposed an ensemble-based AL for deriving well-behaved uncertainty estimates for unlabeled data. Meanwhile, they compared it against the Bayesian deep learning approach [12] and the density-based approach [40], and the results show ensemble-based AL can effectively counteract the class-imbalanced problem during acquisition and lead to more calibrated predictive uncertainties. Yoo and Kweon [54] introduced a novel active learning method with a loss prediction module which is learned to predict the target loss of the unlabeled dataset. By considering the difference between a pair of loss predictions, the loss prediction module could discard the scale of the real loss changes. Inspired by semisupervised learning, some researchers [13, 17, 45, 50, 55] have assigned pseudo-labels to high-confidence samples in order to further improve the accuracy and keep the stability of the DAL model because of the majority and consistency. In addition, some researchers combined generative adversarial networks (GAN) [48], reinforcement learning [28], and transfer learning [10] with AL to achieve various purposes, respectively.

2.3 AL for semantic segmentation

Semantic segmentation has important applications in various fields, like autonomous driving [24], image processing [1], and high-resolution remote sensing [32]. Combining AL with semantic segmentation is also conducive to alleviating the annotation cost. Although many AL approaches for semantic segmentation have been proposed, most of them focus on 2D image segmentation [6, 22, 44, 53]. Recently, a few researchers are applying AL to 3D point cloud segmentation. Lin et al. [26] first combined AL with DL for semantic segmentation of large-scale airborne laser scanning (ALS) point clouds. They proposed a segment-based query function, considering interactions among points within segments, to assess the informativeness of samples. Based on the previous training framework, they introduced incremental learning to save the training time and added mutual information metric to estimate model-dependent uncertainty [25]. Shi et al. [43] proposed a super-point-based [11] AL strategy which could better exploit the limited annotation cost. And they further designed shape-level diversity and local spatial consistency constraint. Observing that only a small portion of annotated regions are sufficient for 3D scene understanding, Wu et al. [52] proposed a region-based and diversity-aware AL. In this paper, from the perspective of uncertainty, representativeness, and diversity, we propose a multi-granularity sample selection and annotation pipeline which combines the unique 3D geometric information of the point cloud and the sequential relationship between frames.

3 Methodology

In this section, we describe our multi-granularity and semisupervised AL pipeline in detail. We first introduce the architecture of our pipeline. Then, the proposed inter-frame selection approach is presented. And then, we introduce the segmented region-based inner frame selection strategy in detail. Furthermore, we illustrate how to compute the confidence of segmented regions to further apply pseudolabels for semisupervised learning task. Next, the details of the network adopted in our work are explained. Finally, we introduce how we leverage the query strategy to select the segmented regions with uncertainty and diversity for annotation and pick out segmented regions with high confidence probability for pseudolabeling.

3.1 Architecture of the proposed pipeline

The purpose of PCSS is to train a model by leveraging the dataset, and then, the model assigns a predicted label to each point, which is a dense prediction task. Therefore, the labor and time cost of sample annotation required in the training of PCSS model are very high. In order to improve the efficiency of manual annotation, we first achieve a representative subset \(D_{\mathrm{NDT}}\) from the original point cloud dataset \(D_{\mathrm{orig}}\) through the normal distributions transfer (NDT) algorithm. Next over-segments 3D point cloud scans from \(D_{\mathrm{NDT}}\) into supervoxels using the voxel cloud connectivity segmentation (VCCS) [33] algorithm. Subsequently, the locally convex connected patches (LCCP) [46] algorithm is used  to obtain the segmented regions from the generated supervoxels. Each segmented region contains several points, so it is convenient and time-saving to annotate such regions. So, we have a segmented 3D point cloud dataset D now, which can be divided into two subsets. One is a little labeled subset \(D_{\mathrm{L}}\) containing randomly selected point cloud scans, and the other is a large unlabeled subset \(D_{\mathrm{U}}\).

Our multi-granularity and semisupervised active learning can be divided into 5 steps:

  1. 1.

    Achieving a representative subset \(D_{\mathrm{NDT}}\) from the original point cloud dataset \(D_{\mathrm{orig}}\) through the NDT algorithm.

  2. 2.

    Generating a segmented 3D point cloud dataset D through VCCS [33] and LCCP [46] algorithms.

  3. 3.

    Training a network on the current labeled subset \(D_{\mathrm{L}}\) for assigning a label to each point.

  4. 4.

    Calculating the information score of segmented regions with two items: softmax entropy and intensity of point cloud as shown in Fig. 1a. And computing the softmax confidence of segmented regions as shown in Fig. 1c

  5. 5.

    Selecting \(\textit{Top-K}\) segmented regions for annotators to annotate exclusive labels, and moving them from the unlabeled subset \(D_{\mathrm{U}}\) into the current labeled subset \(D_{\mathrm{L}}\) as shown in Fig. 1b. Meanwhile picking out \(\textit{Top-M}\) segmented regions with pseudolabels from \(D_{\mathrm{U}}\) and also feeding into \(D_{\mathrm{L}}\) as shown in Fig. 1d.

Fig. 1
figure 1

Multi-granularity and semi-semisupervised active learning pipeline. In our proposed architecture, the network is first trained in supervision with labeled subset \(D_{\mathrm{L}}\). The network then produces softmax entropy and intensity of all segmented regions in unlabeled subset \(D_{\mathrm{U}}\). a Combining segmented region entropy with point cloud intensity to form the selection indicators. b The \(\textit{Top-K}\) segmented regions are selected for the annotator to label and moved to the labeled subset \(D_{\mathrm{L}}\) for the next round. c Calculating the classification score for each segmented region. d Assigning pseudolabels to \(\textit{Top-M}\) segmented regions and moving them to the labeled subset \(D_{\mathrm{L}}\)

3.2 Registration-based inter-frame selection

Generally speaking, a point cloud dataset contains multiple sequences, each of which contains multiple frames. Continuous frames in the same sequence have overlapping areas and include a large number of repeated categories, so we employ a point cloud matching approach to screen out a subset which could represent the sequence from the perspective of building-map.

Considering robustness and efficiency, we choose the NDT algorithm [5] as the point cloud registration method. This is because NDT does not need to establish explicit correspondences between points or features, and all derivatives could be calculated analytically. The NDT transforms the discrete set of 2D points reconstructed from a single point cloud scan into a piecewise continuous and differentiable probability density, which consists of a set of normal distributions and can be used to match another scan through Newton’s algorithm [5]. During the registration of the two point cloud scans through the NDT algorithm, if the registration process converges or reaches the maximum number of iterations, a registration score \({\text{score}}_{\mathrm{match}}\) will be obtained, which is used to construct the matching score function for screening representative point clouds.

$$\begin{aligned} {\text{score}}_{\mathrm{match}} = 1 - \sum _{i} \exp \left( \frac{-\left( x_{i}^{\prime }-q_{i} \right) ^{t} {\sum _{i}^{-1} \left( x_{i}^{\prime }-q_{i} \right) } }{2} \right) , \end{aligned}$$
(1)

where \(x_{i}^{\prime }\), \(\sum _{i}^{-1}\) and \(q_{i}\) denotes the following notation:

  • \(x_{i}^{\prime }\) denotes the point \(x_{i}\) mapped into the coordinate frame of the target scan according to the parameters P of rotation and displacement. \(x_{i}\) is the reconstructed 2D point  of laser scan sample i of the input scan in the coordinate frame of the input scan.

  • \(\sum _{i}\) and \(q_{i}\) represent the covariance matrix and the mean of the corresponding normal distribution to point \(x_{i}^{\prime }\).

In our work, when the registration score \({\text{score}}_{\mathrm{match}}\) of two point cloud scans is less than a threshold \(\delta _{\mathrm{match}}\), we consider that the overlapping area of two point cloud scans is large, and then discard the input frame and retain the target frame. On the contrary, when it is greater than \(\delta _{\mathrm{match}}\), we take the current input frame as the target frame for the next matching. The outline of the proposed inter-frame selection approach, given a point cloud sequence \({\textbf{S}} = \{ s_{1}, s_{2}, \ldots , s_{n}\}\) of n scans and a initial representative subset \({\textbf{S}}^{\prime }=\{s_{1}\}\), is as follows:

  1. 1.

    Take the scan \(s_{1}\) as the target frame and scan \(s_{2}\) as the input frame, and then calculate their matching score \({\text{score}}_{\mathrm{match}}^{1-2}\) through the NDT algorithm. If score \({\text{score}}_{\mathrm{match}}^{1-2}\) is less than the threshold \(\delta _{\mathrm{match}}\), there is no need to update subset \({\textbf{S}}^{\prime }\).

  2. 2.

    Next take the scan \(s_{3}\) as the input frame, and perform the registration between scan \(s_{3}\) and scan \(s_{1}\). If their matching score \({\text{score}}_{\mathrm{match}}^{1-3}\) is larger than the threshold \(\delta _{\mathrm{match}}\), scan \(s_{3}\) will be added to the subset \({\textbf{S}}^{\prime }\) and taken as the target frame at the same time.

  3. 3.

    Repeat the above steps until the point cloud registration of each frame in the sequence is completed.

And then, we can achieve a representative subset \({\textbf{S}}^{\prime } = \{ s_{1}^{\prime }, s_{2}^{\prime }, \ldots , s_{m}^{\prime } \}\) that represents the whole sequence. The process of inter-frame selection is illustrated in detail in Algorithm 1.

figure a

It is obvious that the number of point clouds selected from the same sequence will be different with different thresholds. Taking the sequence 07 (with 1101 point cloud scans) in SemanticKITTI dataset as an example, the number of point clouds selected by setting different thresholds is shown in Fig. 2. For example, when the threshold is \(\delta _{\mathrm{match}} = 0.2\), a representative subset \({\textbf{S}}^{\prime }\) (with 330 point cloud scans) is selected from the sequence 07. Then, the selected point cloud scans are used to build the map, as shown in Fig. 3. The results show that the subset selected by the NDT matching algorithm can represent all the elements in the scene completely.

Fig. 2
figure 2

The number of point clouds selected from the sequence \(07\) (with \(1101\) point cloud scans) in SemanticKITTI dataset by setting different thresholds

Fig. 3
figure 3

Leveraging \(330\) representative point cloud scans selected from the sequence \(07\) in SemanticKITTI dataset with threshold \(\delta _{\mathrm{match}} = 0.2\) to build the map

3.3 Segmented region-based inner frame selection

The labeling cost varies greatly depending on target tasks. In the annotation process, it is relatively cheap to select closed polygons to form a semantic annotation for a 2D image, but 3D point-wise data require expensive point-by-point labeling [43, 54]. However, not all annotated points within the frame contribute to the model’s improvement [52]. Besides, when annotating the same number of points, if the selected points are scattered in the whole frame, although the model performance may be very good, the difficulty and time consumption of annotation will be greatly increased, and it is hard to exploit the limited budget.

To alleviate the time and labor of manual point-by-point labeling, we first leverage VCCS [33] and LCCP [46] algorithms to segment a point cloud scan into segmented regions which can be taken as the fundamental label querying units. Then, in each active selection step, we calculate segmented regions information with softmax entropy and point cloud intensity.

3.3.1 Segmented regions generation

Geometrically constrained supervoxels All points in a point cloud scan are required to be annotated in the supervised task or conventional AL, which is labor-intensive. If we can divide a point cloud scan into connective segmented regions as the basic unit of annotation, it will greatly improve the efficiency of annotation. So, we first employ VCCS [33] algorithm to deal with the original point cloud scan for generating geometrically constrained supervoxels. The VCCS algorithm is composed of 4 parts: (1) construct the adjacency graph for the voxel-cloud to ensure these supervoxels connection in space; (2) select a number of seed points to initialize the supervoxels; (3) calculate the normalized distance \(d_{\mathrm{norm}}\) with three distances: spatial distance \(d_{\mathrm{s}}\), color distance \(d_{\mathrm{c}}\) and distance \(d_{\mathrm{f}}\) in fast point feature histograms (FPFH) space [39]; and (4) use a flow-constrained local iterative clustering for generating geometrically constrained supervoxels as shown in Fig. 4.

Fig. 4
figure 4

Visualization of over-segmenting an original 3D point cloud scan into supervoxels using the VCCS [33] algorithm. Points of the same color belong to the same supervoxel

Point cloud partitioning These geometrically constrained supervoxels gained in the last step are not isolated; they can be further merged into larger segmented regions. So next we leverage LCCP [46] algorithm to segment the supervoxel adjacency graph by classifying whether the connection relation between two supervoxels is convex or concave through two criteria: extended convexity criterion (CC) and sanity criterion (SC). Finally, these small supervoxels can merge into larger segmented regions as shown in Fig. 5 through a region-growing process according to the discriminant results.

Fig. 5
figure 5

Visualization of obtaining the segmented regions after using LCCP [46] algorithm to work on previous supervoxels. Points of the same color belong to the same segmented region

3.3.2 Calculating segmented regions information

In each AL selection step, the trained network predicts the probability \(p(y_{i}=j \vert x_{i})\) of each point \(x_{i}\) belonging to the \(j_{\mathrm{th}}\) category. Then, we calculate the information of a segmented region from two aspects: (1) softmax entropy based on the probability; (2) point cloud intensity, which is introduced in detail as follows.

Segmented region entropy As a widely concerned aspect in AL, uncertainty sampling aims to select the most uncertain samples to annotate from unlabeled subset \(D_{\mathrm{U}}\). In this paper, we use softmax entropy to measure the uncertainty of a segmented region. We first obtain the softmax probability \(p(y_{i}=j \vert x_{i})\) of each point \(x_{i}\) belonging to the \(j_{\mathrm{th}}\) category in the unlabeled subset \(D_{\mathrm{U}}\). Then, we calculate the region entropy \(E_{n}\) for the \(n_{\mathrm{th}}\) segmented region \(R_{n}\) through averaging the entropy of points within unlabeled region \(R_{n}\) as shown in Eq. 2,

$$\begin{aligned} E_{n}\,=\, \frac{1}{R_{n}}\sum _{i=1}^{R_{n}} -P \left( y_{i}=j\vert x_{i}; \Theta \right) \log _{} {P\left( y_{i}=j\vert x_{i}; \Theta \right) }, \end{aligned}$$
(2)

where \(R_{n}\) contains N points, \(\Theta\) denotes the network parameters. If the trained network is quite confident about a predicted category, it will assign a probability to that category greater than other categories. In this case, the entropy \(E_{n}\) is much lower than other categories. On the contrary, a higher entropy value is obtained when the trained network is not confident about a category in the prediction.

Point cloud intensity After obtaining the entropy \(E_{n}\) of each segmented region, the most obvious way is to select the top-ranked regions for annotation. However, these segmented regions with higher entropy \(E_{n}\) may result in redundant annotation effort if appearing in the same querying step. To increase diverse information for the network, we can leverage the intensity of each point in a point cloud scan. The reason for this is that intensity is different from material to material. The intensities of reflection on the same material are similar, while pulsed on different materials are different [19]. Based on this theory, we pick the intensity as a diversity-aware selection criterion to select diverse segmented regions for the network. We compute the region intensity score \(I_{n}\) for the \(n_{\mathrm{th}}\) segmented region \(R_{n}\) by averaging intensity of points within unlabeled region \(R_{n}\) as shown in Eq. 3,

$$\begin{aligned} I_{n}=\frac{1}{R_{n}}\sum _{i=1}^{R_{n}}\rho _{i}, \end{aligned}$$
(3)

where \(\rho _{i}\) is intensity of a point.

After calculating the softmax entropy \(E_{n}\) and intensity \(I_{n}\) of each segmented region, we can combine them linearly to form the information score \(\sigma _{n}\) of the \(n_{\mathrm{th}}\) segmented region \(R_{n}\) as shown in Eq. 3.

$$\begin{aligned} \sigma _{n}=\alpha E_{n}+\beta I_{n}. \end{aligned}$$
(4)

Finally, we can obtain a sorted information list \(\sigma\),

$$\begin{aligned} \sigma =\left( \sigma _{1},\sigma _{2},\ldots ,\sigma _{n} \right) . \end{aligned}$$
(5)

3.4 Segmented region confidence estimation

In our work, at each AL iterative process, the most informative unlabeled segmented regions are selected for annotating, and the network is retrained with added labeled dataset. In this way, the redundant annotation of noninformative regions is avoided, greatly reducing human annotation labor. Actually, the subset \(D_{\mathrm{U}}\) also contains an adequate amount of ignored unlabeled data with high confidence. After the network is trained with the initial labeled subset \(D_{\mathrm{L}}\), we can use its predictive capability to generate relatively accurate pseudolabels for unlabeled segmented regions in subset \(D_{\mathrm{U}}\).

We select the segmented regions with high confidence from subset \(D_{\mathrm{U}}\), when the predicted probability difference \(S_{\mathrm{mar}}\) between the two most likely class labels is smaller than a threshold \(\delta _{H}\). The pseudolabel \(y_{c}^{\mathrm{pseudo}}\) is defined as:

$$\begin{aligned} y_{c}^{\mathrm{pseudo}} = {\left\{ \begin{array}{ll} \mathop {\text{argmax}}\limits _{j}\ p\left( y_{i} = j \vert x_{i};\Theta \right) , &\quad \text{if}\; S_{\mathrm{mar}} > \delta _{H} \\ {\text{None}}, &\quad \text{otherwise}, \end{array}\right. } \end{aligned}$$
(6)

where the threshold \(\delta _{H}\) is set to a large value to achieve high confident pseudolabels. The \(S_{\mathrm{mar}}\) is formulated as follows:

$$\begin{aligned} S_{\mathrm{mar}} = S_{\mathrm{conf}}^{c_{1}} - S_{\mathrm{conf}}^{c_{2}}, \end{aligned}$$
(7)

where \(S_{\mathrm{conf}}^{c_{1}}\), \(S_{\mathrm{conf}}^{c_{2}}\) represent the classification scores of the highest and second highest predicted class labels for a segmented region, respectively. As shown in Eq.  8, given a segmented region R with N points, we calculate the confidence of the predicted class label for all points and achieve the classification scores \(S_{\mathrm{conf}}^{c_{1}}\) and \(S_{\mathrm{conf}}^{c_{2}}\) for a segmented region by averaging the predicted probabilities of all points in the segmented region.

$$\begin{aligned} \begin{aligned} S_{\mathrm{conf}}^{c_{1}}&= \frac{1}{N} \sum _{n=1}^{N}P\left( y_{n}^{c1} \vert R;\Theta \right) , \\ S_{\mathrm{conf}}^{c_{2}}&= \frac{1}{N} \sum _{n=1}^{N}P\left( y_{n}^{c2} \vert R;\Theta \right) . \end{aligned} \end{aligned}$$
(8)

Through the probability difference \(S_{\mathrm{mar}}\), we can avoid selecting noisy segmented regions to assign pseudolabels.

For the segmented regions which meet the pseudolabeling condition, we can arrange each segmented region in descending order according to its probability difference \(S_{\mathrm{mar}}\) to obtain a descending list \(\varphi _{S}\),

$$\begin{aligned} \varphi _{S}=\left( S_{\mathrm{mar}}^{1},S_{\mathrm{mar}}^{2},\dots ,S_{\mathrm{mar}}^{n} \right) . \end{aligned}$$
(9)

3.5 PCSS network

The PCSS network is a crucial component in our pipeline for 3D deep learning. Currently, many point-based [35] and voxel-based [37] networks are proposed to process 3D data. However, most of these methods suffer from high memory consumption and computational costs. To better demonstrate the effectiveness of the proposed AL pipeline, we pick MinkowskiNet [7] based on sparse convolution and SPVCNN [29] based on point-voxel CNN as the PCSS networks in this paper.

MinkowskiNet is proposed for spatio-temporal perception which can directly process 3D point cloud scans using high-dimensional convolutions. To achieve this, it adopts sparse tensors and convolutions for three reasons:

  1. 1.

    The sparse tensor can better express and generalize high-dimensional spaces.

  2. 2.

    The sparse convolution is similar to the standard convolution which can leverage all architectural innovations such as residual connections and batch normalization.

  3. 3.

    The sparse convolution is efficient and fast according to only computing outputs for predefined coordinates and saving them into a compact sparse tensor.

To implement efficient and generalized sparse convolution, it proposes an open-source library which includes sparse tensor quantization, generalized sparse convolution, max pooling, and so on. Furthermore, MinkowskiNet leverages a hybrid kernel (cross-shaped kernel and cubic kernel) to resolve the problem of computational cost and the number of parameters in a network caused by increasing dimensions.

SPVCNN is composed of a fine-grained point-based branch that keeps the 3D data in high resolution without large memory footprint, and a coarse-grained voxel-based branch which aggregates the neighboring features without random memory accesses [29]. And for large outdoor scenes [3], it further proposes sparse point-voxel convolution (SPVConv) that enhances PVConv with the sparse convolution to enable higher resolutions in the voxel-based branch.

3.6 Annotating labels for segmented regions

On the one hand, according to the final decreasing order \(\sigma\), we can select \(\textit{Top-K}\) segmented regions for annotators to assign labels. For the experiment, we actually regard the ground truth of the segmented region as the labeled data instead of labeling by human annotators. Then, these labeled segmented regions \(D_{\mathrm{label}}\) are moved from unlabeled subset \(D_{\mathrm{U}}\) to labeled subset \(D_{\mathrm{L}}\). Note that only a small portion of a point cloud scan in each active selection is added to the subset \(D_{\mathrm{L}}\) as shown in Fig. 1b, because we take the segmented region as the basic labeling unit.

On the other hand, after getting the final descending list \(\varphi _{S}\), we select \(\textit{Top-M}\) segmented regions to assign pseudolabels. Then, these pseudolabeled regions \(D_{\mathrm{pseudo}}\) are fed into the labeled subset \(D_{\mathrm{L}}\) from unlabeled subset \(D_{\mathrm{U}}\). Accomplishing the segmented region information estimation, label annotation, region confidence estimation and pseudolabeling, we repeat the AL loop to fine-tune the PCSS network on the updated subset \(D_{\mathrm{L}}\) until the annotated budget is exhausted or the iterations are reached. Note that after each fine-tuning step, we put the high-confidence samples \(D_{\mathrm{pseudo}}\) back to \(D_{\mathrm{U}}\) and erase their pseudolabels.

4 Experiments

In this section, we first introduce our experimental settings, including two datasets, the initial portion of all labeled point cloud scans, maximum iteration, and annotation budget. Then, we compare our approach with other existing methods to demonstrate the effectiveness of our method. Next, to verify the contribution of each individual strategy, we conduct ablation experiments. Finally, based on the experimental results, we present the limitations of our method and the directions for future work.

4.1 Experimental settings

4.1.1 Datasets

We evaluate the performance of our approach and compare it with the other AL methods on two large-scale challenging datasets, S3DIS and SemanticKITTI, respectively. S3DIS is a commonly used indoor semantic segmentation dataset which can be divided into 6 large areas, with a total of 271 rooms. We take Area5 as the validation set and perform active learning training on the remaining datasets. As for SemanticKITTI [3], it is a representative outdoor dataset which is released in 2019 for autonomous driving. SemanticKITTI consists of 22 sequences with total of 43,552 point cloud scans, splitting sequences 00 to 10 as a training set where sequence 08 is used as the validation set and the rest sequences as the test set. And the total number of training points is \({{\text{total}}_{\mathrm{number}} = 2{,}349{,}559{,}532}\).

4.1.2 Segmented region generation

We employ the VCCS [33] algorithm to over-segment a 3D point cloud scan into supervoxels with given voxel resolution \(R_{\mathrm{voxel}}\) and seed resolution \(R_{\mathrm{seed}}\). Considering the density difference between indoor and outdoor point cloud, we set \(R_{\mathrm{voxel}}\), \(R_{\mathrm{seed}}\) to a small value (\(R_{\mathrm{voxel}} = 0.05\), \(R_{\mathrm{seed}} = 0.5\)) for S3DIS dataset, and a large value (\(R_{\mathrm{voxel}} = 0.15\), \(R_{\mathrm{seed}} = 3.5\)) for SemanticKITTI dataset. The \(R_{\mathrm{voxel}}\) represents the voxel resolution which will be used for the segmentation, \(R_{\mathrm{seed}}\) denotes the distance between supervoxels. After that, flow-constrained local iterative clustering is used to generate geometrically constrained supervoxels based on spatial connection. Next, we utilize the LCCP algorithm to cluster these supervoxels into larger segmented regions through CC criterion with \(\beta _{\mathrm{Tresh}} = 10^{\circ }\), and SC criterion with \(\alpha _{\mathrm{smooth}} = 0.1\). The \(\beta _{\mathrm{Tresh}}\) denotes the concavity tolerance angle, and \(\alpha _{\mathrm{smooth}}\) is utilized to calculate the smoothness constraint.

4.1.3 Annotation budget

In each active label acquisition step, because the number of points in different segmented regions varies, we set the annotation budget as a fixed portion of total training points instead of a fixed number of segmented regions for the fair comparison with other methods. The number of pseudolabel acquisitions is also set as a fixed portion of the total points.

4.1.4 Active learning settings

At the beginning of each experiment, we first randomly select a small portion \(x_{\mathrm{init}}\%\) of fully labeled point clouds as the initially labeled subset \(D_{\mathrm{L}}\) and treat the rest as the unlabeled subset \(D_{\mathrm{U}}\). Then, we perform K rounds as following steps: (1) Training the PCSS network on subset \(D_{\mathrm{L}}\); (2) Selecting a portion \(x_{\mathrm{label}}\%\) of total training points from subset \(D_{\mathrm{U}}\) for annotation according to different AL querying methods; (3) If pseudolabels are adopted, select a portion \(x_{\mathrm{pseudo}}\%\) of total training points for assigning pseudolabels at \(\delta _{H}=0.9\); (4) Moving the newly annotated points into subset \(D_{\mathrm{L}}\) and fine-tune the network. In order to ensure the reliability of the experimental results, each experiment is conducted three times and results are averaged.

Specifically, we set \(x_{\mathrm{init}}=3\%\), \(K=7\) and \(x_{\mathrm{label}}=2\%\) for S3DIS dataset, and \(x_{\mathrm{init}}=1\%\), \(K=5\) and \(x_{\mathrm{label}}=1\%\) for SemanticKITTI dataset [52].

4.1.5 Network training

For both S3DIS and SemanticKITTI datasets, the networks are trained with Adam optimizer (initial learning rate = 0.001) and cross-entropy loss [52]. And the voxel resolution of both datasets is set to 5 cm.

On the S3DIS dataset, we train the networks on 3 TITAN RTX GPUs with a batch size of nine. In the training, we first train both networks for 200 epochs on 3% of the fully labeled point cloud scans and then fine-tune the two networks for 150 epochs after adding 2% active annotated data into subset \(D_{\mathrm{L}}\) each time. Since the point clouds in the S3DIS dataset do not include intensity information, we set \(\alpha = 1, \beta = 0\) in Eq. 4 for the dataset.

On the SemanticKITTI dataset, we train both networks on 4 GTX 1080Ti GPUs and set the batch size to 8. In the training, we initially train both networks for 100 epochs on 1% of the fully labeled point cloud scans and then fine-tune the two networks for 30 epochs after adding 1% active annotated data into subset \(D_{\mathrm{L}}\) each time. Referring to [52], the weight of softmax entropy in Eq. 4 is set as \(\alpha = 1\). Based on the experimental results, we set \({\beta =0.05.}\)

4.2 Comparison with other methods

We compare our approach with 7 other AL methods, including random point cloud scans selection (RAND), uncertainty-based methods, such as softmax confidence (CONF [50]), softmax margin (MARG [50]), softmax entropy (ENT [50]) and segmented-entropy(SEG-ENT [13]), and diversity-based methods, such as core-set approach (CoSET [40]) and ReDAL [52], which is a recent region-based and diversity-aware AL approach.

4.2.1 Inter-frame selection

The inter-frame selection algorithm proposed in this research cannot be employed to reduce the inter-frame redundancy of the S3DIS dataset since the point clouds in the S3DIS dataset are not collected in chronological sequence. As a result, we only conduct inter-frame selection comparison experiments on the SemanticKITTI dataset. To fairly verify the effectiveness of the inter-frame selection method based on the NDT registration algorithm, we adopt the random selection method as the active query method. The experimental results are shown in Table 1. RAND and \({\text{RAND}}_{\mathrm{NDT}}\) indicate that point cloud scans are randomly selected from the original unlabeled dataset \(D_{\mathrm{orig}}\) and the unlabeled dataset \(D_{\mathrm{NDT}}\) for annotation, respectively. Note that the dataset \(D_{\mathrm{orig}}\) contains 19,130 raining point cloud scans, after the NDT matching with the threshold \(\delta _{\mathrm{match}} = 0.1\), the dataset \(D_{\mathrm{NDT}}\) contains 9335 point cloud scans. For the SPVCNN network, our inter-frame selection method can achieve \(90\%\) performance of the result of fully supervised methods (\({\text{mIoU}}_{\mathrm{supvis}}^{\mathrm{SPVCNN}}=63.52\%\)) with merely 5% of annotated data. With the MinkowskiNet network, our method is also better than RAND. Although the training data for active queries is reduced, our method makes the model be trained on more diverse and informative labeled data.

Table 1 Results of mIoU performance (%) on SemanticKITTI with SPVCNN and MinkowskiNet in frame annotation
Fig. 6
figure 6

Visualization of SemanticKITTI on sequence 08 validation subset with SPVCNN network. With our AL approach, the model can correctly identify persons on the sidewalk with merely 5% annotated points

4.2.2 Intra-frame selection

The visualization of SemanticKITTI on sequence 08 validation subset with SPVCNN network is shown in Fig. 6. And the experimental comparison results on the SemanticKITTI dataset are shown in Figs. 7 and 8 where the x-axis represents the percentage of annotated points, and the y-axis means the mIoU obtained by the network. Under both networks, our proposed multi-granularity and semisupervised AL pipeline consistently outperforms the previous methods over the PCSS task. We find that our method outperforms any other AL methods on two experiments with initial \({x_{\mathrm{init}}=1\%}\) labeled data. It verifies the effectiveness of the inter-frame selection method based on the NDT registration algorithm again.

Fig. 7
figure 7

Experimental results of different AL methods on SemanticKITTI with SPVCNN. We compare our multi-granularity and semisupervised AL method with other approaches. It is obvious that our method highly outperforms previous AL approaches

Fig. 8
figure 8

Experimental results of different AL methods on SemanticKITTI with MinkowskiNet. We compare our multi-granularity and semisupervised AL method with other approaches. It is obvious that our method highly outperforms previous AL approaches

As for the SPVCNN, in Table 2, we observe that our AL method can achieve 90% performance of the result of fully supervised methods with merely 3% of annotated data, and it reaches 97.95% fully supervised performance with 5% of annotated points. Particularly, it, respectively, outperforms the recent state-of-the-art (SOTA) method ReDAL [52] by 6.6%, 7.4%, 8.4%, and 6.9% when using 2%, 3%, 4%, and 5% labeled points. With the network of MinkowskiNet, in Table 3, our AL method can achieve 90% performance of the result of fully supervised methods \(({\text{mIoU}}_{\mathrm{supvis}}^{\mathrm{MinkuNet}}=61.4\%)\) with merely 2% of annotated data, and it can even reach 99.48% fully supervised performance with only 4% of annotated points.

Table 2 Results of mIoU performance (%) on SemanticKITTI with SPVCNN
Table 3 Results of mIoU performance (%) on SemanticKITTI with MinkowskiNet

On the S3DIS dataset, as shown in Figs. 9 and 10, our method highly outperforms any other AL methods except for ReDAL. As shown in Tables 4 and 5, the performance of mIoU we obtained is very close to those obtained by ReDAL. The main reason for this is that the point clouds in the S3DIS dataset do not include diverse intensity information. Therefore, we cannot leverage the intensity information of the point cloud to reduce its intra-frame redundancy which results in both networks being trained on the redundant annotated dataset. Furthermore, this result also demonstrates that our method achieves SOTA performance by leveraging segmented region entropy and pseudolabels.

Fig. 9
figure 9

Experimental results of different AL methods on S3DIS with SPVCNN. We compare our multi-granularity and semisupervised AL method with other approaches. Except for ReDAL approach, our method highly outperforms previous AL approaches

Fig. 10
figure 10

Experimental results of different AL methods on S3DIS with MinkowskiNet. We compare our multi-granularity and semisupervised AL method with other approaches. Except for ReDAL approach, our method highly outperforms previous AL approaches

Table 4 Results of mIoU performance (%) on S3DIS with SPVCNN
Table 5 Results of mIoU performance (%) on S3DIS with MinkowskiNet

4.3 Ablation studies

We verify the effectiveness of segmented region, point cloud intensity, pseudolabels and NDT in our proposed pipeline on SemanticKITTI dataset with 5% of annotated points for fair comparison.

The results are shown in Table 6 and Fig. 11 where ENT and \({\text{ENT}}_{\mathrm{reg}}\) represents querying the annotated points by calculating the softmax entropy of a point cloud scan and the segmented region entropy, respectively. Inten, Pseu and NDT, respectively, denote selecting the segmented regions using point cloud intensity, training the network with pseudolabels, and selecting segmented regions from a representative dataset screened out by the NDT algorithm.

Table 6 Ablation study with 5% of annotated data on SemanticKITTI with SPVCNN network
Fig. 11
figure 11

Ablation study. Segmented region entropy, point cloud intensity, pseudolabels and NDT all yield improvements to mIoU

In Table 6, we can observe that changing the annotating units from a point cloud scan to segmented regions contributes most to the improvement with about 6.15% mIoU. Furthermore, with the aid of Inten, Pseu and NDT, the mIoU performance of segmented region entropy yields an improvement of 1.90%, 2.84% and 2.49%, respectively. From the comparison of combination (\({\text{ENT}}_{\mathrm{reg}}\) + Inten) and combination (\({\text{ENT}}_{\mathrm{reg}}\) + Inten + Pseu), we find that pseudolabels play a key role in the performance of the trained network.

From Fig. 11, we observe that the performance of “\({\text{ENT}}_{\mathrm{reg}}\)” is similar to “\({\text{ENT}}_{\mathrm{reg}}\) + Inten.” The reason is that without the diverse intensity information, the selected segmented regions still contain redundant regions. The result also validates the feasibility of selecting point cloud intensity information as the diversity indicator. Although the final performance of group (\({\text{ENT}}_{\mathrm{reg}}\) + Inten + Pseu) and group (\({\text{ENT}}_{\mathrm{reg}}\) + Inten + Pseu + NDT) is very close, the training data for the latter are reduced from 19,130 scans to 9335 scans after inter-frame selection. This result shows that the inter-frame selection method effectively reduces inter-frame redundancy, and it enables the model to be trained on a more representative dataset. Despite the fact that the quantity of point clouds available for model training is reduced by 51.20%, the model performance is not compromised by the reduction in the training dataset. Besides less training data mean less training time consumption and storage consumption. The result also validates the importance of our inner selection strategy.

The group (\({\text{ENT}}_{\mathrm{reg}}\) + Pseu) outperforms the group (\({\text{ENT}}_{\mathrm{reg}}\) + NDT) by only 0.34%, and the performance of group (\({\text{ENT}}_{\mathrm{reg}}\) + Inten+NDT) is weaker than that of the group (\({\text{ENT}}_{\mathrm{reg}}\) + Inten + Psu). It can be seen that the Pseu approach actually feeds the model with supplementary pseudolabeled training data, which can improve the model performance. The NDT method, on the other hand, enables the model to be trained on less redundant and more informative data. Although it can improve the model performance, the NDT method is a coarse-grained selection method which filters out redundant information by the unit of frame. This way may result in the removal of data that is necessary for enhancing the model performance. In summary, there are two ways to improve model performance, either by feeding the model with a large amount of trainable data, including pseudolabeled data, or by providing data that is diverse and representative.

4.4 Discussion

4.4.1 Per-class IoU results

A comparison of the performance of our method with fully supervised one is shown in Table 7. For the SPVCNN network, our method is on par with full supervision (Full) on most categories, and even better than that on the category of building. Although the performance on the three categories of other-vehicle, parking, and terrain is weaker than full supervised one, our method achieves 91%, 86%, and 93% of full supervised result, respectively. The main reason for that is that the inter-frame filtering method is a coarse-grained method, which may result in filtering out some useful information. Another possible reason is the imbalanced class distributions in the SemanticKITTI dataset. As for the MinkowskiNet network, our method outperforms the fully supervised result for some small objects, such as motorcycle, person and bicyclist.

Table 7 Per-class results of IoU performance(%) with 5% of annotated data on SemanticKITTI with two networks

4.4.2 Performance change

To investigate the relationship between segmentation performance and the proportion of annotated data, we expand the annotated data proportion to 10% and conduct experiments on the SemanticKITTI dataset, as shown in Fig. 12. The results show that our method achieves 99.15% performance of full supervision result with 10% of the annotated data. It can be seen that the model performance slowly improves from 62.11 to 62.98% as the annotated data increase from 5 to 10%. The main reason for the slow performance improvement is that as the proportion of annotated data increases, the proportion of the new annotated data that is valid for the model decreases. Another possible reason is that the diversity filtering criteria proposed in this paper, when designing the active query function, only utilizes one piece of information, the point cloud intensity, which makes it difficult to feed the model with more diverse data.

Fig. 12
figure 12

Experimental results of our method on SemanticKITTI with more annotated data. Segmented region entropy, point cloud intensity, pseudolabels and NDT all yield improvements to mIoU

4.4.3 Computational costs

We report the computational time (in minutes) of four methods presented in ablation study in Table 8 where \({\text{ENT}}_{\mathrm{int}}\), \({\text{ENT}}_{\mathrm{pse}}\) and \({\text{ENT}}_{\mathrm{ndt}}\), respectively, denote “\({\text{ENT}}_{\mathrm{reg}}\) + Inten,” “\({\text{ENT}}_{\mathrm{reg}}\) + Pseu” and “\({\text{ENT}}_{\mathrm{reg}}\) + NDT.” And \(T_{\mathrm{train}}\) and \(T_{\mathrm{cal}}\) denote the average time per epoch in an AL loop and the calculating time for active querying, respectively.

Table 8 Computational time (min) with 5% of annotated data on SemanticKITTI with SPVCNN network

Because the amount of annotated data is the same for the initial training, the training time \(T_{\mathrm{train}}\) is approximately the same for each method. It can be seen that as the proportion of annotated data increases, the calculation time \(T_{\mathrm{cal}}\) for active querying tends to decrease. It is because the amount of data in the unlabeled dataset \(D_{\mathrm{U}}\) gradually decreases, resulting in less computation on querying. Using point cloud intensity data for calculating segmented regions information has no impact on calculation time \(T_{\mathrm{cal}}\). The addition of pseudolabeling, on the other hand, increases active querying time by 19.0%, with a mean value of 23.07 min. It can be seen that compared to the \({\text{ENT}}_{\mathrm{reg}}\) method, the mean calculating time \(T_{\mathrm{cal}}\) and training time \(T_{\mathrm{train}}\) of the \({\text{ENT}}_{\mathrm{ndt}}\) method are reduced by 54.6% and 50.5%, respectively. The reason for computational costs reduction is that after NDT-based inter-frame selection, and the number of point clouds in the training set decreased from 19,130 to 9335, resulting in a considerable reduction in computation in the unlabeled dataset. This result also validates the importance of our registration-based inter-frame selection.

4.4.4 Hyper-parameters analysis

We conduct a parametric study of three important parameters proposed in our method, which are the registration threshold (\(\delta _{\mathrm{match}}\)), the weight of point cloud intensity (\(\beta\)) in Eq. 4, and the proportion of pseudolabeled data (\(x_{\mathrm{pseudo}}\)). During the experiment, we keep the other settings unchanged and then evaluate how mIoU performance varies with the set parameters, the results are shown in Table 9. It can be seen that the mIou performance decays when the point cloud intensity threshold is gradually increasing. It is because samples with uncertainty are more important for improving model performance than samples with diversity [52]. As the \(x_{\mathrm{pseudo}}\) increases, the model performance tends to flatten out. The reason may be that the amount of mislabeled data in the \(D_{\mathrm{pseudo}}\) dataset also increases as the \(x_{\mathrm{pseudo}}\) increases, which has a negative effect on the model performance. Although the NDT-based inter-frame selection can effectively reduce computational costs, it can degrade model performance when the \(\delta _{\mathrm{match}}\) is set too large. This is because NDT-based inter-frame selection is a coarse-grained selection method that may result in the removal of data that is necessary for enhancing the model performance.

Table 9 The mIoU performance (%) of hyper-parameters with 5% of annotated data on SemanticKITTI with SPVCNN network

4.5 Limitations and future work

Although our proposed method proved to be effective in reducing human annotation labor and computational costs, there are still two pivotal limitations. The first is that the diversity filtering criteria proposed in this paper, when designing the active query function, only utilize the point cloud intensity. In fact, there is additional information that can be used, such as the color properties contained in the S3DIS dataset’s point clouds. It is because regions with substantial color variances are more likely to suggest semantic diversity.

The other limitation is that the imbalance of categories in the dataset during the acquisition is not considered. Deep learning is usually trained and evaluated with the assumption that the dataset is balanced or nearly so. In reality, datasets in real-world scenarios are frequently unevenly distributed between categories, such as S3DIS and SemanticKITTI. The model trained on a skewed dataset is likely to be overwhelmed by samples coming from the majority categories. To summarize, we argue that active learning should not only select informative and diverse samples to decrease annotating costs, but should also be able to alleviate the imbalance in the labeled subset for improving the model’s accuracy and robustness. In addition, scribble-annotation is a popular and effective method that retains as much information as possible to allow relatively high performance when compared to fully supervised training [49]. In future work, active learning can be integrated with scribble-annotations, i.e., only scribbling the uncertain and diverse data, to further minimize annotation labor.

5 Conclusion

In this paper, we propose a multi-granularity and semisupervised active learning pipeline for point cloud semantic segmentation. We first propose the novel inter-frame selection module based on the NDT registration algorithm to select a representative subset. Then, two key components, the segmented region entropy and point cloud intensity, are designed to select the most informative and diverse regions to annotate rather than a traditional point cloud scan. Next, through the efficient pseudolabeling method, our method further achieves high-cost efficiency. Finally, we conduct extensive experiments and ablation studies with two networks on SemanticKITTI dataset, where our method substantially achieves SOTA cost efficiency and greatly outperforms all existing works.