Abstract
Recent successes in point cloud semantic segmentation heavily rely on a large amount of annotated data. Furthermore, three-dimensional point cloud data are generally sparse and unorganized, and a frame of point cloud usually includes more than 100,000 points, which increases the difficulty of point cloud annotation. To reduce the annotation efforts, we propose a multi-granularity semisupervised active learning pipeline which aims to select representative, uncertain and diverse data to annotate. To better exploit annotating budget, we first leverage the conventional point cloud registration algorithm to develop a matching score function which is used to select a representative subset. And then we change the annotating units from a point cloud scan to segmented regions through two semisupervised methods. Subsequently, in each active selection step, segmented region information is calculated with two terms: softmax entropy and point cloud intensity, and the latter serves to encourage region diversity. Finally, to further reduce annotation effort, semisupervised learning is introduced to our pipeline to automatically select a portion of unlabeled segmented regions with high confidence and assign pseudolabels to them. Extensive experiments show that our approach greatly outperforms previous active learning methods, and we obtain the mean class intersection-over-union performance of 95% fully supervised learning with merely 3% of labeled data on SemanticKITTI dataset.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
In recent years, with the aid of deep learning, autonomous driving achieves significant breakthroughs in multiple tasks, like object detection, motion forecasting, and semantic segmentation. As an emerging field among them, point cloud semantic segmentation (PCSS) is usually used to understand the driving-scene and draw more and more attention. Especially in the past 5 years, numerous novel PCCS methods [18, 35, 47] based on deep learning frameworks have been proposed. And several public datasets of PCSS have also been released, such as Semantic3D [15], ScanNet [9], SemanticKITTI [3].
To achieve superior performance of the model, deep learning generally relies on a large amount of annotated data to strengthen the large-scale model. However, the performance of the model is still not saturated with respect to the size of annotated data [54]. Moreover, it costs lots of human labor and time to annotate a large amount of data, and sometimes only relevant professionals can annotate data [4]. More importantly, 3D point cloud data are generally sparse and unorganized, and a point cloud often includes more than 100,000 points [3], which results in difficulties of point cloud annotation. Active learning (AL) is an effective method to solve this problem. The purpose of AL is to select the most informative and representative samples from the unlabeled data to annotate, which greatly reduces the cost of annotation.
Existing AL methods are mostly at the sample level and focus less on dense prediction tasks. Most of the works [36] are proposed for image processing and natural language processing tasks. However, since point cloud is an unorganized and irregular structure, these methods for image cannot be directly applied to it. In addition, compared with images, point cloud typically contains rich geometric information [33] and intensity information. Besides, it is often collected in sequence, which contains temporal information [3]. This information, which is mostly not involved in recent works [26, 43, 52], has the potential to improve the AL model performance.
In this paper, we focus on these characteristics of the point cloud and propose a novel sample selection and annotation pipeline. Specifically, our proposed method takes representativeness, uncertainty, and diversity into consideration and conducts multi-granularity sample selection: inter-frame and intra-frame. For inter-frame selection, we consider the sample representativeness within the sequence, so as to single out a subset which could represent for the entire sequence distribution. In other words, the coverage area of adjacent frames usually overlaps with different sizes, so it is uneconomical to label all point clouds, which will produce a lot of redundancy. Inspired by the point cloud registration algorithm [5], we develop a novel matching score function which is used to evaluate the similarity of two frames within the sequence. According to whether the matching score is smaller or bigger than a similarity threshold, we determine which one of the two frames is a member of the representative subset. As shown in Fig. 3, a representative subset selected from a sequence can cover the whole coverage of the point cloud sequence with fewer samples, reduce the occurrence of overlapping areas, and thus lower the annotation costs.
As for intra-frame selection, not all annotated points within the frame contribute to the model’s improvement [52], that is, redundancy also exists in the intra-frame annotation. Besides, due to the particularity of the dense prediction task, it is laborious to annotate every point in PCSS task. To make the point cloud annotation more efficient and encourage maximizing the segmentation performance, we argue that the unit of point cloud annotation can be changed from the frame to a small portion of segmented regions [52]. Therefore, we make a tradeoff between annotating labor and efficiency to alleviate the expensive point-by-point labeling [43]. Specifically, we propose a novel method to reduce the redundancy of the intra-frame granularity under the guidance of uncertainty estimation and point cloud intensity. In detail, we first segment a point cloud into regions as the fundamental labeled units using two unsupervised algorithms [33, 46]. Next, uncertainty estimation is carried out on such segmented regions. Furthermore, to avoid selecting some typically uncertain segmented regions which exist in several point clouds, we introduce exclusive intensity information of point cloud [19] to complement segmented region information estimation. Finally, the segmented regions with uncertainty and diversity are selected to annotate.
AL aims at minimizing the training size, while exactly matching the natural demand of semisupervised learning [27]. Semisupervised learning utilizes both labeled and unlabeled data to train models and is well suited to solve the lack of data in real-world tasks. Pseudolabeling is one of the application methods in semisupervised learning. Its goal is to leverage the model trained by partially labeled data to predict unlabeled data for generating pseudolabels [51]. Then, data with high confidence in model prediction will be assigned pseudolabels. Therefore, the integration of semisupervised learning and AL has attracted research interest in recent years [45, 50]. However, this integration method used to PCCS is almost not involved in recent literature. In this paper, to further reduce the human annotating labor, we propose to automatically select and pseudolabel a portion of the confident unlabeled data. The proposed method aims at searching for the most certain and informative unlabeled data with the guidance of a high-confidence threshold. Specifically, we first leverage the trained model to predict unlabeled data for getting the prediction confidence. And further, the data with high prediction confidence are selected and added to the labeled data pool. Then, the labeled data and pseudolabeled data are exploited to fine-tune the model.
Experimental results show that our method significantly outperforms existing deep active learning approaches on the SemanticKITTI dataset and achieves state-of-the-art performance on the S3DIS dataset. Our proposed method could achieve the performance of 90% fully supervised learning, while less than 15% and 3% annotations are required on S3DIS and SemanticKITTI datasets, respectively. The ablation studies also verify the effectiveness of each component proposed in our method.
In summary, the major contributions of this paper are as follows:
-
We propose a new multi-granularity sample selection and annotation AL pipeline for point cloud semantic segmentation.
-
We introduce semisupervised learning to automatically select and annotate the high prediction confidence data for effectively reducing annotation costs.
-
Experiments on challenging SemanticKITTI dataset show that our approach outperforms existing deep active learning methods in classification accuracy and could highly reduce human annotation labor and computational costs.
2 Related works
2.1 3D semantic segmentation
Recently, 3D PCSS has achieved great progress with the aid of deep learning. The purpose of 3D PCSS is to divide a point cloud into several objects according to the predicted semantic meanings of points. According to the representation of the point cloud data, 3D semantic segmentation methods can be classified into three categories: point-based [18, 35], projection-based [34], voxel-based [7, 29]. Point-based methods directly process unstructured point clouds, which suffer from efficiency bottlenecks. In order to employ the two-dimensional (2D) convolutional neural networks (CNN) architectures, projection-based methods focus on converting the 3D point cloud to 2D pseudo-images, yet resulting in information loss. Voxel-based methods convert a point cloud into 3D voxels processed by 3D volumetric convolutions. Although retains the 3D geometric information, it requires very high resolution in order not to lose much information. Overall, these methods heavily rely on fully annotated datasets, which require densely annotated point clouds that are laborious and time-consuming. To this end, we focus on how to train a model with less annotated data to achieve similar performance compared to fully supervised training.
2.2 Deep active learning
As a machine learning method, AL has been of research interest for a couple of decades for increasing label efficiency and reducing annotated costs. AL selects the most informative and representative samples from the unlabeled dataset into the labeled pool through the query strategy and then iteratively trains the model until the annotated budget is exhausted or the pre-defined termination conditions are reached. Therefore, the query strategy is becoming extremely important. The main query strategies include the uncertainty-based approach [4, 20, 23, 30], distribution-based approach [2, 14, 31] and expected model change approach [21, 41]. Various methods were proposed to measure the uncertainty of the unlabeled samples through the posterior probability of a predicted class [23], the difference between the first prediction and the second one [20], or the entropy of class posterior probabilities [30]. Some earlier studies [8, 42] also estimated the sample uncertainty referring to a committee of classifiers. The distribution-based approach queries samples by considering the selection of core subsets and chooses the samples which represent the whole dataset, like clustering algorithm [31], Gaussian process [14] and context-aware methods [2]. The expected model change approach primarily chooses the unlabeled samples that can make the largest change on the current model through estimating expected gradient length [41], expected future errors [38], or expected output changes [21].
Deep learning (DL) has achieved unparalleled breakthroughs in various fields, while DL is often very greedy for large amounts of labeled data [16]. Therefore, many researchers have high expectations for the results of combining DL and AL, referred to as deep active learning (DAL) [36], for AL’s capacity to effectively reduce labeling costs. Gal et al. [12] proposed a significant AL framework for high-dimensional data based on Bayesian deep learning, to estimate uncertainty through Monte Carlo(MC) Dropout integration. However, Sener and Savarese [40] pointed out that this method is unsuitable for large datasets because of batch sampling. And then, they proposed a Core-set approach from the perspective of distribution to construct a core set which is representative of the entire original dataset. They considered minimizing the core-set loss is equivalent to the k-Center problem which can be tackled by an efficient approximate solution. William et al. [4] proposed an ensemble-based AL for deriving well-behaved uncertainty estimates for unlabeled data. Meanwhile, they compared it against the Bayesian deep learning approach [12] and the density-based approach [40], and the results show ensemble-based AL can effectively counteract the class-imbalanced problem during acquisition and lead to more calibrated predictive uncertainties. Yoo and Kweon [54] introduced a novel active learning method with a loss prediction module which is learned to predict the target loss of the unlabeled dataset. By considering the difference between a pair of loss predictions, the loss prediction module could discard the scale of the real loss changes. Inspired by semisupervised learning, some researchers [13, 17, 45, 50, 55] have assigned pseudo-labels to high-confidence samples in order to further improve the accuracy and keep the stability of the DAL model because of the majority and consistency. In addition, some researchers combined generative adversarial networks (GAN) [48], reinforcement learning [28], and transfer learning [10] with AL to achieve various purposes, respectively.
2.3 AL for semantic segmentation
Semantic segmentation has important applications in various fields, like autonomous driving [24], image processing [1], and high-resolution remote sensing [32]. Combining AL with semantic segmentation is also conducive to alleviating the annotation cost. Although many AL approaches for semantic segmentation have been proposed, most of them focus on 2D image segmentation [6, 22, 44, 53]. Recently, a few researchers are applying AL to 3D point cloud segmentation. Lin et al. [26] first combined AL with DL for semantic segmentation of large-scale airborne laser scanning (ALS) point clouds. They proposed a segment-based query function, considering interactions among points within segments, to assess the informativeness of samples. Based on the previous training framework, they introduced incremental learning to save the training time and added mutual information metric to estimate model-dependent uncertainty [25]. Shi et al. [43] proposed a super-point-based [11] AL strategy which could better exploit the limited annotation cost. And they further designed shape-level diversity and local spatial consistency constraint. Observing that only a small portion of annotated regions are sufficient for 3D scene understanding, Wu et al. [52] proposed a region-based and diversity-aware AL. In this paper, from the perspective of uncertainty, representativeness, and diversity, we propose a multi-granularity sample selection and annotation pipeline which combines the unique 3D geometric information of the point cloud and the sequential relationship between frames.
3 Methodology
In this section, we describe our multi-granularity and semisupervised AL pipeline in detail. We first introduce the architecture of our pipeline. Then, the proposed inter-frame selection approach is presented. And then, we introduce the segmented region-based inner frame selection strategy in detail. Furthermore, we illustrate how to compute the confidence of segmented regions to further apply pseudolabels for semisupervised learning task. Next, the details of the network adopted in our work are explained. Finally, we introduce how we leverage the query strategy to select the segmented regions with uncertainty and diversity for annotation and pick out segmented regions with high confidence probability for pseudolabeling.
3.1 Architecture of the proposed pipeline
The purpose of PCSS is to train a model by leveraging the dataset, and then, the model assigns a predicted label to each point, which is a dense prediction task. Therefore, the labor and time cost of sample annotation required in the training of PCSS model are very high. In order to improve the efficiency of manual annotation, we first achieve a representative subset \(D_{\mathrm{NDT}}\) from the original point cloud dataset \(D_{\mathrm{orig}}\) through the normal distributions transfer (NDT) algorithm. Next over-segments 3D point cloud scans from \(D_{\mathrm{NDT}}\) into supervoxels using the voxel cloud connectivity segmentation (VCCS) [33] algorithm. Subsequently, the locally convex connected patches (LCCP) [46] algorithm is used to obtain the segmented regions from the generated supervoxels. Each segmented region contains several points, so it is convenient and time-saving to annotate such regions. So, we have a segmented 3D point cloud dataset D now, which can be divided into two subsets. One is a little labeled subset \(D_{\mathrm{L}}\) containing randomly selected point cloud scans, and the other is a large unlabeled subset \(D_{\mathrm{U}}\).
Our multi-granularity and semisupervised active learning can be divided into 5 steps:
-
1.
Achieving a representative subset \(D_{\mathrm{NDT}}\) from the original point cloud dataset \(D_{\mathrm{orig}}\) through the NDT algorithm.
-
2.
Generating a segmented 3D point cloud dataset D through VCCS [33] and LCCP [46] algorithms.
-
3.
Training a network on the current labeled subset \(D_{\mathrm{L}}\) for assigning a label to each point.
-
4.
Calculating the information score of segmented regions with two items: softmax entropy and intensity of point cloud as shown in Fig. 1a. And computing the softmax confidence of segmented regions as shown in Fig. 1c
-
5.
Selecting \(\textit{Top-K}\) segmented regions for annotators to annotate exclusive labels, and moving them from the unlabeled subset \(D_{\mathrm{U}}\) into the current labeled subset \(D_{\mathrm{L}}\) as shown in Fig. 1b. Meanwhile picking out \(\textit{Top-M}\) segmented regions with pseudolabels from \(D_{\mathrm{U}}\) and also feeding into \(D_{\mathrm{L}}\) as shown in Fig. 1d.
3.2 Registration-based inter-frame selection
Generally speaking, a point cloud dataset contains multiple sequences, each of which contains multiple frames. Continuous frames in the same sequence have overlapping areas and include a large number of repeated categories, so we employ a point cloud matching approach to screen out a subset which could represent the sequence from the perspective of building-map.
Considering robustness and efficiency, we choose the NDT algorithm [5] as the point cloud registration method. This is because NDT does not need to establish explicit correspondences between points or features, and all derivatives could be calculated analytically. The NDT transforms the discrete set of 2D points reconstructed from a single point cloud scan into a piecewise continuous and differentiable probability density, which consists of a set of normal distributions and can be used to match another scan through Newton’s algorithm [5]. During the registration of the two point cloud scans through the NDT algorithm, if the registration process converges or reaches the maximum number of iterations, a registration score \({\text{score}}_{\mathrm{match}}\) will be obtained, which is used to construct the matching score function for screening representative point clouds.
where \(x_{i}^{\prime }\), \(\sum _{i}^{-1}\) and \(q_{i}\) denotes the following notation:
-
\(x_{i}^{\prime }\) denotes the point \(x_{i}\) mapped into the coordinate frame of the target scan according to the parameters P of rotation and displacement. \(x_{i}\) is the reconstructed 2D point of laser scan sample i of the input scan in the coordinate frame of the input scan.
-
\(\sum _{i}\) and \(q_{i}\) represent the covariance matrix and the mean of the corresponding normal distribution to point \(x_{i}^{\prime }\).
In our work, when the registration score \({\text{score}}_{\mathrm{match}}\) of two point cloud scans is less than a threshold \(\delta _{\mathrm{match}}\), we consider that the overlapping area of two point cloud scans is large, and then discard the input frame and retain the target frame. On the contrary, when it is greater than \(\delta _{\mathrm{match}}\), we take the current input frame as the target frame for the next matching. The outline of the proposed inter-frame selection approach, given a point cloud sequence \({\textbf{S}} = \{ s_{1}, s_{2}, \ldots , s_{n}\}\) of n scans and a initial representative subset \({\textbf{S}}^{\prime }=\{s_{1}\}\), is as follows:
-
1.
Take the scan \(s_{1}\) as the target frame and scan \(s_{2}\) as the input frame, and then calculate their matching score \({\text{score}}_{\mathrm{match}}^{1-2}\) through the NDT algorithm. If score \({\text{score}}_{\mathrm{match}}^{1-2}\) is less than the threshold \(\delta _{\mathrm{match}}\), there is no need to update subset \({\textbf{S}}^{\prime }\).
-
2.
Next take the scan \(s_{3}\) as the input frame, and perform the registration between scan \(s_{3}\) and scan \(s_{1}\). If their matching score \({\text{score}}_{\mathrm{match}}^{1-3}\) is larger than the threshold \(\delta _{\mathrm{match}}\), scan \(s_{3}\) will be added to the subset \({\textbf{S}}^{\prime }\) and taken as the target frame at the same time.
-
3.
Repeat the above steps until the point cloud registration of each frame in the sequence is completed.
And then, we can achieve a representative subset \({\textbf{S}}^{\prime } = \{ s_{1}^{\prime }, s_{2}^{\prime }, \ldots , s_{m}^{\prime } \}\) that represents the whole sequence. The process of inter-frame selection is illustrated in detail in Algorithm 1.
It is obvious that the number of point clouds selected from the same sequence will be different with different thresholds. Taking the sequence 07 (with 1101 point cloud scans) in SemanticKITTI dataset as an example, the number of point clouds selected by setting different thresholds is shown in Fig. 2. For example, when the threshold is \(\delta _{\mathrm{match}} = 0.2\), a representative subset \({\textbf{S}}^{\prime }\) (with 330 point cloud scans) is selected from the sequence 07. Then, the selected point cloud scans are used to build the map, as shown in Fig. 3. The results show that the subset selected by the NDT matching algorithm can represent all the elements in the scene completely.
3.3 Segmented region-based inner frame selection
The labeling cost varies greatly depending on target tasks. In the annotation process, it is relatively cheap to select closed polygons to form a semantic annotation for a 2D image, but 3D point-wise data require expensive point-by-point labeling [43, 54]. However, not all annotated points within the frame contribute to the model’s improvement [52]. Besides, when annotating the same number of points, if the selected points are scattered in the whole frame, although the model performance may be very good, the difficulty and time consumption of annotation will be greatly increased, and it is hard to exploit the limited budget.
To alleviate the time and labor of manual point-by-point labeling, we first leverage VCCS [33] and LCCP [46] algorithms to segment a point cloud scan into segmented regions which can be taken as the fundamental label querying units. Then, in each active selection step, we calculate segmented regions information with softmax entropy and point cloud intensity.
3.3.1 Segmented regions generation
Geometrically constrained supervoxels All points in a point cloud scan are required to be annotated in the supervised task or conventional AL, which is labor-intensive. If we can divide a point cloud scan into connective segmented regions as the basic unit of annotation, it will greatly improve the efficiency of annotation. So, we first employ VCCS [33] algorithm to deal with the original point cloud scan for generating geometrically constrained supervoxels. The VCCS algorithm is composed of 4 parts: (1) construct the adjacency graph for the voxel-cloud to ensure these supervoxels connection in space; (2) select a number of seed points to initialize the supervoxels; (3) calculate the normalized distance \(d_{\mathrm{norm}}\) with three distances: spatial distance \(d_{\mathrm{s}}\), color distance \(d_{\mathrm{c}}\) and distance \(d_{\mathrm{f}}\) in fast point feature histograms (FPFH) space [39]; and (4) use a flow-constrained local iterative clustering for generating geometrically constrained supervoxels as shown in Fig. 4.
Point cloud partitioning These geometrically constrained supervoxels gained in the last step are not isolated; they can be further merged into larger segmented regions. So next we leverage LCCP [46] algorithm to segment the supervoxel adjacency graph by classifying whether the connection relation between two supervoxels is convex or concave through two criteria: extended convexity criterion (CC) and sanity criterion (SC). Finally, these small supervoxels can merge into larger segmented regions as shown in Fig. 5 through a region-growing process according to the discriminant results.
3.3.2 Calculating segmented regions information
In each AL selection step, the trained network predicts the probability \(p(y_{i}=j \vert x_{i})\) of each point \(x_{i}\) belonging to the \(j_{\mathrm{th}}\) category. Then, we calculate the information of a segmented region from two aspects: (1) softmax entropy based on the probability; (2) point cloud intensity, which is introduced in detail as follows.
Segmented region entropy As a widely concerned aspect in AL, uncertainty sampling aims to select the most uncertain samples to annotate from unlabeled subset \(D_{\mathrm{U}}\). In this paper, we use softmax entropy to measure the uncertainty of a segmented region. We first obtain the softmax probability \(p(y_{i}=j \vert x_{i})\) of each point \(x_{i}\) belonging to the \(j_{\mathrm{th}}\) category in the unlabeled subset \(D_{\mathrm{U}}\). Then, we calculate the region entropy \(E_{n}\) for the \(n_{\mathrm{th}}\) segmented region \(R_{n}\) through averaging the entropy of points within unlabeled region \(R_{n}\) as shown in Eq. 2,
where \(R_{n}\) contains N points, \(\Theta\) denotes the network parameters. If the trained network is quite confident about a predicted category, it will assign a probability to that category greater than other categories. In this case, the entropy \(E_{n}\) is much lower than other categories. On the contrary, a higher entropy value is obtained when the trained network is not confident about a category in the prediction.
Point cloud intensity After obtaining the entropy \(E_{n}\) of each segmented region, the most obvious way is to select the top-ranked regions for annotation. However, these segmented regions with higher entropy \(E_{n}\) may result in redundant annotation effort if appearing in the same querying step. To increase diverse information for the network, we can leverage the intensity of each point in a point cloud scan. The reason for this is that intensity is different from material to material. The intensities of reflection on the same material are similar, while pulsed on different materials are different [19]. Based on this theory, we pick the intensity as a diversity-aware selection criterion to select diverse segmented regions for the network. We compute the region intensity score \(I_{n}\) for the \(n_{\mathrm{th}}\) segmented region \(R_{n}\) by averaging intensity of points within unlabeled region \(R_{n}\) as shown in Eq. 3,
where \(\rho _{i}\) is intensity of a point.
After calculating the softmax entropy \(E_{n}\) and intensity \(I_{n}\) of each segmented region, we can combine them linearly to form the information score \(\sigma _{n}\) of the \(n_{\mathrm{th}}\) segmented region \(R_{n}\) as shown in Eq. 3.
Finally, we can obtain a sorted information list \(\sigma\),
3.4 Segmented region confidence estimation
In our work, at each AL iterative process, the most informative unlabeled segmented regions are selected for annotating, and the network is retrained with added labeled dataset. In this way, the redundant annotation of noninformative regions is avoided, greatly reducing human annotation labor. Actually, the subset \(D_{\mathrm{U}}\) also contains an adequate amount of ignored unlabeled data with high confidence. After the network is trained with the initial labeled subset \(D_{\mathrm{L}}\), we can use its predictive capability to generate relatively accurate pseudolabels for unlabeled segmented regions in subset \(D_{\mathrm{U}}\).
We select the segmented regions with high confidence from subset \(D_{\mathrm{U}}\), when the predicted probability difference \(S_{\mathrm{mar}}\) between the two most likely class labels is smaller than a threshold \(\delta _{H}\). The pseudolabel \(y_{c}^{\mathrm{pseudo}}\) is defined as:
where the threshold \(\delta _{H}\) is set to a large value to achieve high confident pseudolabels. The \(S_{\mathrm{mar}}\) is formulated as follows:
where \(S_{\mathrm{conf}}^{c_{1}}\), \(S_{\mathrm{conf}}^{c_{2}}\) represent the classification scores of the highest and second highest predicted class labels for a segmented region, respectively. As shown in Eq. 8, given a segmented region R with N points, we calculate the confidence of the predicted class label for all points and achieve the classification scores \(S_{\mathrm{conf}}^{c_{1}}\) and \(S_{\mathrm{conf}}^{c_{2}}\) for a segmented region by averaging the predicted probabilities of all points in the segmented region.
Through the probability difference \(S_{\mathrm{mar}}\), we can avoid selecting noisy segmented regions to assign pseudolabels.
For the segmented regions which meet the pseudolabeling condition, we can arrange each segmented region in descending order according to its probability difference \(S_{\mathrm{mar}}\) to obtain a descending list \(\varphi _{S}\),
3.5 PCSS network
The PCSS network is a crucial component in our pipeline for 3D deep learning. Currently, many point-based [35] and voxel-based [37] networks are proposed to process 3D data. However, most of these methods suffer from high memory consumption and computational costs. To better demonstrate the effectiveness of the proposed AL pipeline, we pick MinkowskiNet [7] based on sparse convolution and SPVCNN [29] based on point-voxel CNN as the PCSS networks in this paper.
MinkowskiNet is proposed for spatio-temporal perception which can directly process 3D point cloud scans using high-dimensional convolutions. To achieve this, it adopts sparse tensors and convolutions for three reasons:
-
1.
The sparse tensor can better express and generalize high-dimensional spaces.
-
2.
The sparse convolution is similar to the standard convolution which can leverage all architectural innovations such as residual connections and batch normalization.
-
3.
The sparse convolution is efficient and fast according to only computing outputs for predefined coordinates and saving them into a compact sparse tensor.
To implement efficient and generalized sparse convolution, it proposes an open-source library which includes sparse tensor quantization, generalized sparse convolution, max pooling, and so on. Furthermore, MinkowskiNet leverages a hybrid kernel (cross-shaped kernel and cubic kernel) to resolve the problem of computational cost and the number of parameters in a network caused by increasing dimensions.
SPVCNN is composed of a fine-grained point-based branch that keeps the 3D data in high resolution without large memory footprint, and a coarse-grained voxel-based branch which aggregates the neighboring features without random memory accesses [29]. And for large outdoor scenes [3], it further proposes sparse point-voxel convolution (SPVConv) that enhances PVConv with the sparse convolution to enable higher resolutions in the voxel-based branch.
3.6 Annotating labels for segmented regions
On the one hand, according to the final decreasing order \(\sigma\), we can select \(\textit{Top-K}\) segmented regions for annotators to assign labels. For the experiment, we actually regard the ground truth of the segmented region as the labeled data instead of labeling by human annotators. Then, these labeled segmented regions \(D_{\mathrm{label}}\) are moved from unlabeled subset \(D_{\mathrm{U}}\) to labeled subset \(D_{\mathrm{L}}\). Note that only a small portion of a point cloud scan in each active selection is added to the subset \(D_{\mathrm{L}}\) as shown in Fig. 1b, because we take the segmented region as the basic labeling unit.
On the other hand, after getting the final descending list \(\varphi _{S}\), we select \(\textit{Top-M}\) segmented regions to assign pseudolabels. Then, these pseudolabeled regions \(D_{\mathrm{pseudo}}\) are fed into the labeled subset \(D_{\mathrm{L}}\) from unlabeled subset \(D_{\mathrm{U}}\). Accomplishing the segmented region information estimation, label annotation, region confidence estimation and pseudolabeling, we repeat the AL loop to fine-tune the PCSS network on the updated subset \(D_{\mathrm{L}}\) until the annotated budget is exhausted or the iterations are reached. Note that after each fine-tuning step, we put the high-confidence samples \(D_{\mathrm{pseudo}}\) back to \(D_{\mathrm{U}}\) and erase their pseudolabels.
4 Experiments
In this section, we first introduce our experimental settings, including two datasets, the initial portion of all labeled point cloud scans, maximum iteration, and annotation budget. Then, we compare our approach with other existing methods to demonstrate the effectiveness of our method. Next, to verify the contribution of each individual strategy, we conduct ablation experiments. Finally, based on the experimental results, we present the limitations of our method and the directions for future work.
4.1 Experimental settings
4.1.1 Datasets
We evaluate the performance of our approach and compare it with the other AL methods on two large-scale challenging datasets, S3DIS and SemanticKITTI, respectively. S3DIS is a commonly used indoor semantic segmentation dataset which can be divided into 6 large areas, with a total of 271 rooms. We take Area5 as the validation set and perform active learning training on the remaining datasets. As for SemanticKITTI [3], it is a representative outdoor dataset which is released in 2019 for autonomous driving. SemanticKITTI consists of 22 sequences with total of 43,552 point cloud scans, splitting sequences 00 to 10 as a training set where sequence 08 is used as the validation set and the rest sequences as the test set. And the total number of training points is \({{\text{total}}_{\mathrm{number}} = 2{,}349{,}559{,}532}\).
4.1.2 Segmented region generation
We employ the VCCS [33] algorithm to over-segment a 3D point cloud scan into supervoxels with given voxel resolution \(R_{\mathrm{voxel}}\) and seed resolution \(R_{\mathrm{seed}}\). Considering the density difference between indoor and outdoor point cloud, we set \(R_{\mathrm{voxel}}\), \(R_{\mathrm{seed}}\) to a small value (\(R_{\mathrm{voxel}} = 0.05\), \(R_{\mathrm{seed}} = 0.5\)) for S3DIS dataset, and a large value (\(R_{\mathrm{voxel}} = 0.15\), \(R_{\mathrm{seed}} = 3.5\)) for SemanticKITTI dataset. The \(R_{\mathrm{voxel}}\) represents the voxel resolution which will be used for the segmentation, \(R_{\mathrm{seed}}\) denotes the distance between supervoxels. After that, flow-constrained local iterative clustering is used to generate geometrically constrained supervoxels based on spatial connection. Next, we utilize the LCCP algorithm to cluster these supervoxels into larger segmented regions through CC criterion with \(\beta _{\mathrm{Tresh}} = 10^{\circ }\), and SC criterion with \(\alpha _{\mathrm{smooth}} = 0.1\). The \(\beta _{\mathrm{Tresh}}\) denotes the concavity tolerance angle, and \(\alpha _{\mathrm{smooth}}\) is utilized to calculate the smoothness constraint.
4.1.3 Annotation budget
In each active label acquisition step, because the number of points in different segmented regions varies, we set the annotation budget as a fixed portion of total training points instead of a fixed number of segmented regions for the fair comparison with other methods. The number of pseudolabel acquisitions is also set as a fixed portion of the total points.
4.1.4 Active learning settings
At the beginning of each experiment, we first randomly select a small portion \(x_{\mathrm{init}}\%\) of fully labeled point clouds as the initially labeled subset \(D_{\mathrm{L}}\) and treat the rest as the unlabeled subset \(D_{\mathrm{U}}\). Then, we perform K rounds as following steps: (1) Training the PCSS network on subset \(D_{\mathrm{L}}\); (2) Selecting a portion \(x_{\mathrm{label}}\%\) of total training points from subset \(D_{\mathrm{U}}\) for annotation according to different AL querying methods; (3) If pseudolabels are adopted, select a portion \(x_{\mathrm{pseudo}}\%\) of total training points for assigning pseudolabels at \(\delta _{H}=0.9\); (4) Moving the newly annotated points into subset \(D_{\mathrm{L}}\) and fine-tune the network. In order to ensure the reliability of the experimental results, each experiment is conducted three times and results are averaged.
Specifically, we set \(x_{\mathrm{init}}=3\%\), \(K=7\) and \(x_{\mathrm{label}}=2\%\) for S3DIS dataset, and \(x_{\mathrm{init}}=1\%\), \(K=5\) and \(x_{\mathrm{label}}=1\%\) for SemanticKITTI dataset [52].
4.1.5 Network training
For both S3DIS and SemanticKITTI datasets, the networks are trained with Adam optimizer (initial learning rate = 0.001) and cross-entropy loss [52]. And the voxel resolution of both datasets is set to 5 cm.
On the S3DIS dataset, we train the networks on 3 TITAN RTX GPUs with a batch size of nine. In the training, we first train both networks for 200 epochs on 3% of the fully labeled point cloud scans and then fine-tune the two networks for 150 epochs after adding 2% active annotated data into subset \(D_{\mathrm{L}}\) each time. Since the point clouds in the S3DIS dataset do not include intensity information, we set \(\alpha = 1, \beta = 0\) in Eq. 4 for the dataset.
On the SemanticKITTI dataset, we train both networks on 4 GTX 1080Ti GPUs and set the batch size to 8. In the training, we initially train both networks for 100 epochs on 1% of the fully labeled point cloud scans and then fine-tune the two networks for 30 epochs after adding 1% active annotated data into subset \(D_{\mathrm{L}}\) each time. Referring to [52], the weight of softmax entropy in Eq. 4 is set as \(\alpha = 1\). Based on the experimental results, we set \({\beta =0.05.}\)
4.2 Comparison with other methods
We compare our approach with 7 other AL methods, including random point cloud scans selection (RAND), uncertainty-based methods, such as softmax confidence (CONF [50]), softmax margin (MARG [50]), softmax entropy (ENT [50]) and segmented-entropy(SEG-ENT [13]), and diversity-based methods, such as core-set approach (CoSET [40]) and ReDAL [52], which is a recent region-based and diversity-aware AL approach.
4.2.1 Inter-frame selection
The inter-frame selection algorithm proposed in this research cannot be employed to reduce the inter-frame redundancy of the S3DIS dataset since the point clouds in the S3DIS dataset are not collected in chronological sequence. As a result, we only conduct inter-frame selection comparison experiments on the SemanticKITTI dataset. To fairly verify the effectiveness of the inter-frame selection method based on the NDT registration algorithm, we adopt the random selection method as the active query method. The experimental results are shown in Table 1. RAND and \({\text{RAND}}_{\mathrm{NDT}}\) indicate that point cloud scans are randomly selected from the original unlabeled dataset \(D_{\mathrm{orig}}\) and the unlabeled dataset \(D_{\mathrm{NDT}}\) for annotation, respectively. Note that the dataset \(D_{\mathrm{orig}}\) contains 19,130 raining point cloud scans, after the NDT matching with the threshold \(\delta _{\mathrm{match}} = 0.1\), the dataset \(D_{\mathrm{NDT}}\) contains 9335 point cloud scans. For the SPVCNN network, our inter-frame selection method can achieve \(90\%\) performance of the result of fully supervised methods (\({\text{mIoU}}_{\mathrm{supvis}}^{\mathrm{SPVCNN}}=63.52\%\)) with merely 5% of annotated data. With the MinkowskiNet network, our method is also better than RAND. Although the training data for active queries is reduced, our method makes the model be trained on more diverse and informative labeled data.
4.2.2 Intra-frame selection
The visualization of SemanticKITTI on sequence 08 validation subset with SPVCNN network is shown in Fig. 6. And the experimental comparison results on the SemanticKITTI dataset are shown in Figs. 7 and 8 where the x-axis represents the percentage of annotated points, and the y-axis means the mIoU obtained by the network. Under both networks, our proposed multi-granularity and semisupervised AL pipeline consistently outperforms the previous methods over the PCSS task. We find that our method outperforms any other AL methods on two experiments with initial \({x_{\mathrm{init}}=1\%}\) labeled data. It verifies the effectiveness of the inter-frame selection method based on the NDT registration algorithm again.
As for the SPVCNN, in Table 2, we observe that our AL method can achieve 90% performance of the result of fully supervised methods with merely 3% of annotated data, and it reaches 97.95% fully supervised performance with 5% of annotated points. Particularly, it, respectively, outperforms the recent state-of-the-art (SOTA) method ReDAL [52] by 6.6%, 7.4%, 8.4%, and 6.9% when using 2%, 3%, 4%, and 5% labeled points. With the network of MinkowskiNet, in Table 3, our AL method can achieve 90% performance of the result of fully supervised methods \(({\text{mIoU}}_{\mathrm{supvis}}^{\mathrm{MinkuNet}}=61.4\%)\) with merely 2% of annotated data, and it can even reach 99.48% fully supervised performance with only 4% of annotated points.
On the S3DIS dataset, as shown in Figs. 9 and 10, our method highly outperforms any other AL methods except for ReDAL. As shown in Tables 4 and 5, the performance of mIoU we obtained is very close to those obtained by ReDAL. The main reason for this is that the point clouds in the S3DIS dataset do not include diverse intensity information. Therefore, we cannot leverage the intensity information of the point cloud to reduce its intra-frame redundancy which results in both networks being trained on the redundant annotated dataset. Furthermore, this result also demonstrates that our method achieves SOTA performance by leveraging segmented region entropy and pseudolabels.
4.3 Ablation studies
We verify the effectiveness of segmented region, point cloud intensity, pseudolabels and NDT in our proposed pipeline on SemanticKITTI dataset with 5% of annotated points for fair comparison.
The results are shown in Table 6 and Fig. 11 where ENT and \({\text{ENT}}_{\mathrm{reg}}\) represents querying the annotated points by calculating the softmax entropy of a point cloud scan and the segmented region entropy, respectively. Inten, Pseu and NDT, respectively, denote selecting the segmented regions using point cloud intensity, training the network with pseudolabels, and selecting segmented regions from a representative dataset screened out by the NDT algorithm.
In Table 6, we can observe that changing the annotating units from a point cloud scan to segmented regions contributes most to the improvement with about 6.15% mIoU. Furthermore, with the aid of Inten, Pseu and NDT, the mIoU performance of segmented region entropy yields an improvement of 1.90%, 2.84% and 2.49%, respectively. From the comparison of combination (\({\text{ENT}}_{\mathrm{reg}}\) + Inten) and combination (\({\text{ENT}}_{\mathrm{reg}}\) + Inten + Pseu), we find that pseudolabels play a key role in the performance of the trained network.
From Fig. 11, we observe that the performance of “\({\text{ENT}}_{\mathrm{reg}}\)” is similar to “\({\text{ENT}}_{\mathrm{reg}}\) + Inten.” The reason is that without the diverse intensity information, the selected segmented regions still contain redundant regions. The result also validates the feasibility of selecting point cloud intensity information as the diversity indicator. Although the final performance of group (\({\text{ENT}}_{\mathrm{reg}}\) + Inten + Pseu) and group (\({\text{ENT}}_{\mathrm{reg}}\) + Inten + Pseu + NDT) is very close, the training data for the latter are reduced from 19,130 scans to 9335 scans after inter-frame selection. This result shows that the inter-frame selection method effectively reduces inter-frame redundancy, and it enables the model to be trained on a more representative dataset. Despite the fact that the quantity of point clouds available for model training is reduced by 51.20%, the model performance is not compromised by the reduction in the training dataset. Besides less training data mean less training time consumption and storage consumption. The result also validates the importance of our inner selection strategy.
The group (\({\text{ENT}}_{\mathrm{reg}}\) + Pseu) outperforms the group (\({\text{ENT}}_{\mathrm{reg}}\) + NDT) by only 0.34%, and the performance of group (\({\text{ENT}}_{\mathrm{reg}}\) + Inten+NDT) is weaker than that of the group (\({\text{ENT}}_{\mathrm{reg}}\) + Inten + Psu). It can be seen that the Pseu approach actually feeds the model with supplementary pseudolabeled training data, which can improve the model performance. The NDT method, on the other hand, enables the model to be trained on less redundant and more informative data. Although it can improve the model performance, the NDT method is a coarse-grained selection method which filters out redundant information by the unit of frame. This way may result in the removal of data that is necessary for enhancing the model performance. In summary, there are two ways to improve model performance, either by feeding the model with a large amount of trainable data, including pseudolabeled data, or by providing data that is diverse and representative.
4.4 Discussion
4.4.1 Per-class IoU results
A comparison of the performance of our method with fully supervised one is shown in Table 7. For the SPVCNN network, our method is on par with full supervision (Full) on most categories, and even better than that on the category of building. Although the performance on the three categories of other-vehicle, parking, and terrain is weaker than full supervised one, our method achieves 91%, 86%, and 93% of full supervised result, respectively. The main reason for that is that the inter-frame filtering method is a coarse-grained method, which may result in filtering out some useful information. Another possible reason is the imbalanced class distributions in the SemanticKITTI dataset. As for the MinkowskiNet network, our method outperforms the fully supervised result for some small objects, such as motorcycle, person and bicyclist.
4.4.2 Performance change
To investigate the relationship between segmentation performance and the proportion of annotated data, we expand the annotated data proportion to 10% and conduct experiments on the SemanticKITTI dataset, as shown in Fig. 12. The results show that our method achieves 99.15% performance of full supervision result with 10% of the annotated data. It can be seen that the model performance slowly improves from 62.11 to 62.98% as the annotated data increase from 5 to 10%. The main reason for the slow performance improvement is that as the proportion of annotated data increases, the proportion of the new annotated data that is valid for the model decreases. Another possible reason is that the diversity filtering criteria proposed in this paper, when designing the active query function, only utilizes one piece of information, the point cloud intensity, which makes it difficult to feed the model with more diverse data.
4.4.3 Computational costs
We report the computational time (in minutes) of four methods presented in ablation study in Table 8 where \({\text{ENT}}_{\mathrm{int}}\), \({\text{ENT}}_{\mathrm{pse}}\) and \({\text{ENT}}_{\mathrm{ndt}}\), respectively, denote “\({\text{ENT}}_{\mathrm{reg}}\) + Inten,” “\({\text{ENT}}_{\mathrm{reg}}\) + Pseu” and “\({\text{ENT}}_{\mathrm{reg}}\) + NDT.” And \(T_{\mathrm{train}}\) and \(T_{\mathrm{cal}}\) denote the average time per epoch in an AL loop and the calculating time for active querying, respectively.
Because the amount of annotated data is the same for the initial training, the training time \(T_{\mathrm{train}}\) is approximately the same for each method. It can be seen that as the proportion of annotated data increases, the calculation time \(T_{\mathrm{cal}}\) for active querying tends to decrease. It is because the amount of data in the unlabeled dataset \(D_{\mathrm{U}}\) gradually decreases, resulting in less computation on querying. Using point cloud intensity data for calculating segmented regions information has no impact on calculation time \(T_{\mathrm{cal}}\). The addition of pseudolabeling, on the other hand, increases active querying time by 19.0%, with a mean value of 23.07 min. It can be seen that compared to the \({\text{ENT}}_{\mathrm{reg}}\) method, the mean calculating time \(T_{\mathrm{cal}}\) and training time \(T_{\mathrm{train}}\) of the \({\text{ENT}}_{\mathrm{ndt}}\) method are reduced by 54.6% and 50.5%, respectively. The reason for computational costs reduction is that after NDT-based inter-frame selection, and the number of point clouds in the training set decreased from 19,130 to 9335, resulting in a considerable reduction in computation in the unlabeled dataset. This result also validates the importance of our registration-based inter-frame selection.
4.4.4 Hyper-parameters analysis
We conduct a parametric study of three important parameters proposed in our method, which are the registration threshold (\(\delta _{\mathrm{match}}\)), the weight of point cloud intensity (\(\beta\)) in Eq. 4, and the proportion of pseudolabeled data (\(x_{\mathrm{pseudo}}\)). During the experiment, we keep the other settings unchanged and then evaluate how mIoU performance varies with the set parameters, the results are shown in Table 9. It can be seen that the mIou performance decays when the point cloud intensity threshold is gradually increasing. It is because samples with uncertainty are more important for improving model performance than samples with diversity [52]. As the \(x_{\mathrm{pseudo}}\) increases, the model performance tends to flatten out. The reason may be that the amount of mislabeled data in the \(D_{\mathrm{pseudo}}\) dataset also increases as the \(x_{\mathrm{pseudo}}\) increases, which has a negative effect on the model performance. Although the NDT-based inter-frame selection can effectively reduce computational costs, it can degrade model performance when the \(\delta _{\mathrm{match}}\) is set too large. This is because NDT-based inter-frame selection is a coarse-grained selection method that may result in the removal of data that is necessary for enhancing the model performance.
4.5 Limitations and future work
Although our proposed method proved to be effective in reducing human annotation labor and computational costs, there are still two pivotal limitations. The first is that the diversity filtering criteria proposed in this paper, when designing the active query function, only utilize the point cloud intensity. In fact, there is additional information that can be used, such as the color properties contained in the S3DIS dataset’s point clouds. It is because regions with substantial color variances are more likely to suggest semantic diversity.
The other limitation is that the imbalance of categories in the dataset during the acquisition is not considered. Deep learning is usually trained and evaluated with the assumption that the dataset is balanced or nearly so. In reality, datasets in real-world scenarios are frequently unevenly distributed between categories, such as S3DIS and SemanticKITTI. The model trained on a skewed dataset is likely to be overwhelmed by samples coming from the majority categories. To summarize, we argue that active learning should not only select informative and diverse samples to decrease annotating costs, but should also be able to alleviate the imbalance in the labeled subset for improving the model’s accuracy and robustness. In addition, scribble-annotation is a popular and effective method that retains as much information as possible to allow relatively high performance when compared to fully supervised training [49]. In future work, active learning can be integrated with scribble-annotations, i.e., only scribbling the uncertain and diverse data, to further minimize annotation labor.
5 Conclusion
In this paper, we propose a multi-granularity and semisupervised active learning pipeline for point cloud semantic segmentation. We first propose the novel inter-frame selection module based on the NDT registration algorithm to select a representative subset. Then, two key components, the segmented region entropy and point cloud intensity, are designed to select the most informative and diverse regions to annotate rather than a traditional point cloud scan. Next, through the efficient pseudolabeling method, our method further achieves high-cost efficiency. Finally, we conduct extensive experiments and ablation studies with two networks on SemanticKITTI dataset, where our method substantially achieves SOTA cost efficiency and greatly outperforms all existing works.
Data availability
The datasets generated during and/or analyzed during the current study are available from the corresponding author (Z. Pan) and S. Ye on reasonable request.
References
Abdel-Salam R, Mostafa R, Abdel-Gawad AH (2022) RIECNN: real-time image enhanced CNN for traffic sign recognition. Neural Comput Appl 34:6085–6096. https://doi.org/10.1007/s00521-021-06762-5
Aodha OM, Campbell ND, Kautz J et al (2014) Hierarchical subquery evaluation for active learning on a graph. In: 2014 IEEE conference on computer vision and pattern recognition, pp 564–571. https://doi.org/10.1109/CVPR.2014.79
Behley J, Garbade M, Milioto A et al (2019) SemanticKITTI: a dataset for semantic scene understanding of lidar sequences. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 9296–9306. https://doi.org/10.1109/ICCV.2019.00939
Beluch WH, Genewein T, Nurnberger A et al (2018) The power of ensembles for active learning in image classification. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 9368–9377. https://doi.org/10.1109/CVPR.2018.00976
Biber P, Straßer W (2003) The normal distributions transform: a new approach to laser scan matching. In: 2003 IEEE/RSJ international conference on intelligent robots and systems, Las Vegas, Nevada, USA, October 27–November 1, 2003. IEEE, pp 2743–2748. https://doi.org/10.1109/IROS.2003.1249285
Casanova A, Pinheiro PO, Rostamzadeh N et al (2020) Reinforced active learning for image segmentation. In: 8th International conference on learning representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020. OpenReview.net
Choy C, Gwak J, Savarese S (2019) 4D spatio-temporal convnets: Minkowski convolutional neural networks. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3070–3079. https://doi.org/10.1109/CVPR.2019.00319
Dagan I, Engelson SP (1995) Committee-based sampling for training probabilistic classifiers. In: Machine learning, proceedings of the twelfth international conference on machine learning, Tahoe City, California, USA, July 9–12, 1995. Morgan Kaufmann, pp 150–157. https://doi.org/10.1016/b978-1-55860-377-6.50027-x
Dai A, Chang AX, Savva M et al (2017) Scannet: Richly-annotated 3d reconstructions of indoor scenes. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 2432–2443. https://doi.org/10.1109/CVPR.2017.261
Deng C, Xue Y, Liu X et al (2019) Active transfer learning network: a unified deep joint spectral-spatial feature learning model for hyperspectral image classification. IEEE Trans Geosci Remote Sens 57(3):1741–1754. https://doi.org/10.1109/TGRS.2018.2868851
Deng S, Dong Q, Liu B, Hu Z (2022) Superpoint-guided semi-supervised semantic segmentation of 3D point clouds. In: 2022 International conference on robotics and automation (ICRA), pp 9214–9220. https://doi.org/10.1109/ICRA46639.2022.9811904
Gal Y, Islam R, Ghahramani Z (2017) Deep Bayesian active learning with image data. In: Precup D, Teh YW (eds) Proceedings of the 34th international conference on machine learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017, proceedings of machine learning research, vol 70. PMLR, pp 1183–1192
Gu B, Zhai Z, Deng C et al (2021) Efficient active learning by querying discriminative and representative samples and fully exploiting unlabeled data. IEEE Trans Neural Netw Learn Syst 32(9):4111–4122. https://doi.org/10.1109/TNNLS.2020.3016928
Guo Y (2010) Active instance sampling via matrix partition. In: Advances in neural information processing systems, vol 23. Curran Associates Inc., pp 802–810
Hackel T, Savinov N, Ladicky L et al (2017) Semantic3d.net: a new large-scale point cloud classification benchmark. In: ISPRS Annals of the photogrammetry, remote sensing and spatial information sciences, pp 91–98
Hinton GE, Srivastava N, Krizhevsky A et al (2012) Improving neural networks by preventing co-adaptation of feature detectors. CoRR arXiv:1207.0580
Hossain HMS, Roy N (2019) Active deep learning for activity recognition with context aware annotator selection. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery and data mining, KDD 2019, Anchorage, AK, USA, August 4–8, 2019. ACM, pp 1862–1870. https://doi.org/10.1145/3292500.3330688
Hu Q, Yang B, Xie L et al (2020) Randla-net: efficient semantic segmentation of large-scale point clouds. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 11105–11114. https://doi.org/10.1109/CVPR42600.2020.01112
Hui L, Di L, Xianfeng H et al (2008) Laser intensity used in classification of lidar point cloud data. In: IGARSS 2008—2008 IEEE international geoscience and remote sensing symposium, pp II-1140–II-1143. https://doi.org/10.1109/IGARSS.2008.4779201
Joshi AJ, Porikli F, Papanikolopoulos N (2009) Multi-class active learning for image classification. In: 2009 IEEE conference on computer vision and pattern recognition, pp 2372–2379. https://doi.org/10.1109/CVPR.2009.5206627
Käding C, Rodner E, Freytag A et al (2016) Active and continuous exploration with deep neural networks and expected model output changes. CoRR arXiv:1612.06129
Konyushkova K, Sznitman R, Fua P (2015) Introducing geometry in active learning for image segmentation. In: 2015 IEEE international conference on computer vision (ICCV), pp 2974–2982. https://doi.org/10.1109/ICCV.2015.340
Lewis DD, Catlett J (1994) Heterogeneous uncertainty sampling for supervised learning. In: Machine learning, proceedings of the eleventh international conference, Rutgers University, New Brunswick, NJ, USA, July 10–13, 1994. Morgan Kaufmann, pp 148–156. https://doi.org/10.1016/b978-1-55860-335-6.50026-x
Li J, Jiang F, Yang J et al (2021) Lane-deeplab: lane semantic segmentation in automatic driving scenarios for high-definition maps. Neurocomputing 465:15–25. https://doi.org/10.1016/j.neucom.2021.08.105
Lin Y, Vosselman G, Cao Y et al (2020) Active and incremental learning for semantic ALS point cloud segmentation. ISPRS J Photogramm Remote Sens 169:73–92. https://doi.org/10.1016/j.isprsjprs.2020.09.003
Lin Y, Vosselman G, Cao Y et al (2020) Efficient training of semantic point cloud segmentation via active learning. In: ISPRS annals of the photogrammetry, remote sensing and spatial information sciences, pp 243–250. https://doi.org/10.5194/isprs-annals-V-2-2020-243-2020
Liu C, Li J, He L (2019) Superpixel-based semisupervised active learning for hyperspectral image classification. IEEE J Sel Top Appl Earth Observ Remote Sens 12(1):357–370. https://doi.org/10.1109/JSTARS.2018.2880562
Liu Z, Wang J, Gong S et al (2019) Deep reinforcement active learning for human-in-the-loop person re-identification. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 6121–6130. https://doi.org/10.1109/ICCV.2019.00622
Liu Z, Tang H, Zhao S et al (2021) Pvnas: 3d neural architecture search with point-voxel convolution. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2021.3109025
Luo W, Schwing AG, Urtasun R (2013) Latent structured active learning. In: Advances in neural information processing systems 26: 27th annual conference on neural information processing systems 2013. Proceedings of a meeting held December 5–8, 2013, Lake Tahoe, Nevada, United States, pp 728–736
Nguyen HT, Smeulders AWM (2004) Active learning using pre-clustering. In: Machine learning, proceedings of the twenty-first international conference ICML 2004, Banff, Alberta, Canada, July 4–8, 2004, ACM international conference proceeding series, vol 69. ACM. https://doi.org/10.1145/1015330.1015349
Pan Y, Pi D, Chen J et al (2021) FDPPGAN: remote sensing image fusion based on deep perceptual patchGAN. Neural Comput Appl 33:9589–9605. https://doi.org/10.1007/s00521-021-05724-1
Papon J, Abramov A, Schoeler M et al (2013) Voxel cloud connectivity segmentation - supervoxels for point clouds. In: 2013 IEEE conference on computer vision and pattern recognition, pp 2027–2034. https://doi.org/10.1109/CVPR.2013.264
Peng K, Fei J, Yang K et al (2022) MASS: multi-attentional semantic segmentation of LiDAR data for dense top-view understanding. IEEE Trans Intell Transp Syst 23(9):15824–15840. https://doi.org/10.1109/TITS.2022.3145588
Qi CR, Yi L, Su H et al (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in neural information processing systems 30: annual conference on neural information processing systems 2017, December 4–9, 2017, Long Beach, CA, USA, pp 5099–5108
Ren P, Xiao Y, Chang X et al (2022) A survey of deep active learning. ACM Comput Surv 54(9):180:1-180:40. https://doi.org/10.1145/3472291
Riegler G, Ulusoy AO, Geiger A (2017) Octnet: learning deep 3d representations at high resolutions. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 6620–6629. https://doi.org/10.1109/CVPR.2017.701
Roy N, Mccallum A (2001) Toward optimal active learning through Monte Carlo estimation of error reduction. In: Proceedings of the international conference on machine learning, pp 441–448
Rusu RB, Blodow N, Beetz M (2009) Fast point feature histograms (FPFH) for 3D registration. In: 2009 IEEE international conference on robotics and automation, pp 3212–3217. https://doi.org/10.1109/ROBOT.2009.5152473
Sener O, Savarese S (2018) Active learning for convolutional neural networks: a core-set approach. In: 6th International conference on learning representations, ICLR 2018, Vancouver, BC, Canada, April 30–May 3, 2018, conference track proceedings. OpenReview.net
Settles B, Craven M, Ray S (2007) Multiple-instance active learning. In: Advances in neural information processing systems 20, proceedings of the twenty-first annual conference on neural information processing systems, Vancouver, British Columbia, Canada, December 3–6, 2007. Curran Associates Inc., pp 1289–1296
Seung HS, Opper M, Sompolinsky H (1992) Query by committee. In: Proceedings of the fifth annual ACM conference on computational learning theory, COLT 1992, Pittsburgh, PA, USA, July 27–29, 1992. ACM, pp 287–294. https://doi.org/10.1145/130385.130417
Shi X, Xu X, Chen K et al (2021) Label-efficient point cloud semantic segmentation: an active learning approach. CoRR arXiv:2101.06931
Siddiqui Y, Valentin J, Niessner M (2020) Viewal: active learning with viewpoint entropy for semantic segmentation. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 9430–9440. https://doi.org/10.1109/CVPR42600.2020.00945
Siméoni O, Budnik M, Avrithis Y et al (2021) Rethinking deep active learning: using unlabeled data at model training. In: 2020 25th International conference on pattern recognition (ICPR), pp 1220–1227. https://doi.org/10.1109/ICPR48806.2021.9412716
Stein SC, Schoeler M, Papon J et al (2014) Object partitioning using local convexity. In: 2014 IEEE conference on computer vision and pattern recognition, pp 304–311. https://doi.org/10.1109/CVPR.2014.46
Tatarchenko M, Park J, Koltun V et al (2018) Tangent convolutions for dense prediction in 3D. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 3887–3896. https://doi.org/10.1109/CVPR.2018.00409
Tran T, Do T, Reid ID et al (2019) Bayesian generative active deep learning. In: Proceedings of the 36th international conference on machine learning, ICML 2019, 9–15 June 2019, Long Beach, California, USA, proceedings of machine learning research, vol 97. PMLR, pp 6295–6304
Unal O, Dai D, Gool L van, Zurich E (2022) Scribble-supervised LiDAR semantic segmentation. In: 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 2697–2707
Wang K, Zhang D, Li Y et al (2017) Cost-effective active learning for deep image classification. IEEE Trans Circuits Syst Video Technol 27(12):2591–2600. https://doi.org/10.1109/TCSVT.2016.2589879
Wang J-X, Chen S-B, Ding CHQ, Tang J, Luo B (2022) RanPaste: paste consistency and pseudo label for semisupervised remote sensing image semantic segmentation. IEEE Trans Geosci Remote Sens 60:1–16. https://doi.org/10.1109/TGRS.2021.3102026
Wu TH, Liu YC, Huang YK et al (2021) Redal: region-based and diversity-aware active learning for point cloud semantic segmentation. In: 2021 IEEE/CVF international conference on computer vision (ICCV), pp 15490–15499. https://doi.org/10.1109/ICCV48922.2021.01522
Xie B, Yuan L, Li S, Liu CH, Cheng X (2022) Towards fewer annotations: active learning via region impurity and prediction uncertainty for domain adaptive semantic segmentation. In: 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 8058–8068. https://doi.org/10.1109/CVPR52688.2022.00790
Yoo D, Kweon IS (2019) Learning loss for active learning. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 93–102. https://doi.org/10.1109/CVPR.2019.00018
Yuan T, Wan F, Fu M et al (2021) Multiple instance active learning for object detection. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5326–5335. https://doi.org/10.1109/CVPR46437.2021.00529
Acknowledgements
The work was supported in part by the National Key Research and Development Program of China under Grant No. 2021YFB2501300 and in part by the National Important Science & Technology Specific Projects under Grant No. 2017ZX01038201.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ye, S., Yin, Z., Fu, Y. et al. A multi-granularity semisupervised active learning for point cloud semantic segmentation. Neural Comput & Applic 35, 15629–15645 (2023). https://doi.org/10.1007/s00521-023-08455-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-023-08455-7