Abstract
3D point cloud semantic segmentation is fundamental for autonomous driving. Most approaches in the literature neglect an important aspect, i.e., how to deal with domain shift when handling dynamic scenes. This can significantly hinder the navigation capabilities of self-driving vehicles. This paper advances the state of the art in this research field. Our first contribution consists in analysing a new unexplored scenario in point cloud segmentation, namely Source-Free Online Unsupervised Domain Adaptation (SF-OUDA). We experimentally show that state-of-the-art methods have a rather limited ability to adapt pre-trained deep network models to unseen domains in an online manner. Our second contribution is an approach that relies on adaptive self-training and geometric-feature propagation to adapt a pre-trained source model online without requiring either source data or target labels. Our third contribution is to study SF-OUDA in a challenging setup where source data is synthetic and target data is point clouds captured in the real world. We use the recent SynLiDAR dataset as a synthetic source and introduce two new synthetic (source) datasets, which can stimulate future synthetic-to-real autonomous driving research. Our experiments show the effectiveness of our segmentation approach on thousands of real-world point clouds (Code and synthetic datasets are available at https://github.com/saltoricristiano/gipso-sfouda).
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
- Online domain adaptation
- Source-free unsupervised domain adaptation
- Point cloud segmentation
- Geometric propagation
1 Introduction
Autonomous driving requires accurate and efficient 3D visual scene perception algorithms. Low-level visual tasks such as detection and segmentation are crucial to enable higher-level tasks such as path planning [11, 35] and obstacle avoidance [46]. Deep learning-based methods have proven to be the most suitable option to meet these requirements so far, but at the cost of requiring large-scale annotated dataset for training [29]. Relying only on annotated data is not always a viable solution. This problem can be mitigated by considering synthetic data, as it can be generated at low cost with potentially unlimited annotations and under different environmental conditions [12, 23]. However, when a model trained on synthetic data is deployed in the real world, typically it will underperform due to domain shift, e.g., caused by varying lighting conditions, clutter, occlusions and materials with different reflective properties [56]. We argue that a 3D semantic segmentation algorithm running on an autonomous vehicle should be capable of adapting online – handling scenarios that are visited for the first time while driving – and it should do so by only using the newly captured data. A variety of research works have addressed the adaptation problem in the context of 3D semantic segmentation. However, most approaches operate offline and assume to have access to training (source) data [28, 61, 63, 69, 72, 73]. In this paper, we argue that these two assumptions are too restrictive in an autonomous driving scenario (Fig. 1). On the one hand, offline adaptation would be equivalent to performing model adaptation on the data a vehicle has captured when the navigation has terminated, which is clearly a sub-optimal solution for autonomous driving [30]. On the other hand, having to rely on source data may not be a viable option, as it requires the method to store and query potentially large amount of data, thus hindering scalability [33, 36].
To overcome these limitations, in this paper we explore the new problem of Source-Free Online Unsupervised Domain Adaptation (SF-OUDA) for semantic segmentation, i.e., that of adapting a deep semantic segmentation model while a vehicle navigates in an unseen environment without relying on human supervision. Specifically, in this work we first implement, adapt and thoroughly analyze existing adaptation methods for the 3D semantic segmentation problem in a SF-OUDA setup. We experimentally observe that none of these methods provides consistent and satisfactory performance when employed in a SF-OUDA setting. However, there are elements of interest that, when carefully combined and extended, can be generally applicable. This leads us to move toward and design GIPSO (Geometrically Informed Propagation for Source-free Online adaptation), the first SF-OUDA method for 3D point cloud segmentation that builds upon recent advances in the literature, and exploits geometry information and temporal consistency to support the domain adaptation process. We also introduce two new synthetic datasets to benchmark SF-OUDA in two different real-world datasets, i.e. SemanticKITTI [3, 13, 14] and nuScenes [4]. We validate our approach on these new synthetic-to-real benchmarks. Our motivation for creating these datasets is to make evaluation more comprehensive and to assess the generalization ability of different techniques to different experimental setups. In summary, our contributions are:
-
A thorough experimental analysis of existing domain adaptation methods for 3D semantic segmentation in a SF-OUDA setting;
-
A novel method for SF-OUDA that exploits low-level geometric properties and temporal information to continuously adapt a 3D segmentation model;
-
The introduction of two new LiDAR synthetic datasets that are compatible with the SemanticKITTI and nuScenes datasets.
2 Related Work
Point Cloud Semantic Segmentation. Point cloud segmentation methods can be classified into quantization-free and quantization-based architectures. The former processes the input point clouds in their original 3D format. Examples include PointNet [43] that is based on a series of multi layer perceptrons. PointNet++ [44] builds upon PointNet by using multi-scale sampling and neighbourhood aggregation to encode both global and local features. RandLA-Net [21] extends PoinNet++ [44] by embedding local spatial encoding, random sampling and attentive pooling. These methods are computationally inefficient when large-scale point clouds are used. The latter provides a computationally efficient alternative as input point clouds can be mapped into efficient representations, namely range maps [39, 60, 61], polar maps [67], 3D voxel grids [8, 16, 17, 70] or 3D cylindrical voxels [71]. Quantization-based approaches can be based on sparse convolutions [16, 17] or Minkowski convolutions [8]. We use the Minkowski Engine [8] as it provides a suitable trade off between accuracy and efficiency.
Unsupervised Domain Adaptation. Offline UDA can be performed either using source data [20, 37, 48, 72] or without using source data (source-free UDA) [33, 36, 49, 62]. Online UDA can be used to adapt a model to an unlabelled continuous target data stream through source domain supervision [58]. It can be employed for classification [40], image semantic segmentation [58], depth estimation [55, 68], robot manipulation [38], human mesh reconstruction [19] and occupancy mapping [54]. The assumption of unsupervised target input data can be relaxed and applied for online adaptation in classification [31], video-object segmentation [57] and motion planning [53]. Recently, test-time adaptation methods have been applied to online UDA in classification by using supervision from source data [50, 52, 59]. We tackle source-free online UDA for point cloud segmentation for the first time.
Domain Adaptation for Point Cloud Segmentation. Domain shift in point cloud segmentation occurs due to differences in (i) sampling noise, (ii) structure of the environment and (iii) class distributions [26, 61, 63, 69]. The domain adaptation problem can be formulated as a 3D surface completion task [63] or addressed with ray casting system capable of transferring the target sensor sampling pattern to the source data [28]. Other approaches tackle the domain adaptation problem in the synthetic-to-real setting (i.e., point cloud in the source domain are synthetic, while target ones are collected with LiDAR sensors) [60, 61, 69]. Attention models can be used to aggregate contextual information with large receptive fields at early layers of the model [60, 61]. Geodesic correlation alignment and progressive domain calibration can be also used to further improve domain adaptation effectiveness [61]. Authors in [69] argue that the method in [61] cannot be trained end-to-end as it employs a multi-stage pipeline. Therefore, they propose an end-to-end approach to simulate the dropout noise of real sensors on synthetic data through a generative adversarial network. Unlike these methods, we focus on SF-OUDA and propose a novel adaptation method which invokes geometry for propagating reliable pseudo-labels on target data.
3 Datasets for Synthetic-to-Real Adaptation
Autonomous driving simulators enable users to create ad-hoc synthetic datasets that can resemble real-world scenarios. Examples of popular simulators are GTA-V [64] and CARLA [12]. In principle, synthetic datasets should be compatible with their real-world counterpart [3, 4, 14], i.e., they should share the same semantic classes and the same sensor specifications, such as the resolution (32 vs. 64 channels) and the horizontal field of view (e.g., 90\(^\circ \) vs. 360\(^\circ \)). However, this is not the case for most of the synthetic datasets in literature. The SynthCity [18] dataset contains large-scale point clouds that are generated from collections of several LiDAR scans, making it unsuitable for online domain adaptation as no odometry data is provided. PreSIL [23] and GTA-LiDAR’s [61] point clouds are captured from a moving vehicle using a simulated Velodyne HDL64E [34], as that of SemanticKITTI, however they are rendered with a different field of view, i.e., \(90^\circ \) as opposed to \(360^\circ \) of SemantiKITTI. SynLIDAR’s [2] point clouds are obtained using a simulated Velodyne HDL64E with \(360^\circ \) field of view, as in SemantiKITTI. However, the odometry data is not provided, i.e., point clouds are all configured in their local reference frame. Therefore, domain adaptation algorithms that are based on ray-casting like [28] cannot be used.
To enable full compatibility with SemanticKITTI [3] and nuScenes [4], we present a new synthetic dataset, namely Synth4D, which we created using the CARLA simulator [12]. Table 1 compares Synth4D to the other synthetic datasets. Synth4D is composed of two sets of point cloud sequences, one compatible with SemanticKITTI and one compatible with nuScenes. Each set is composed of 20K labelled point clouds. Synth4D is captured using a vehicle navigating in four scenarios, i.e., town, highway, rural area and city. Because UDA requires consistent labels between source and target, we mapped the labels of Synth4D with those of SemanticKITTI/nuScenes using the original instructions given to annotators [3, 4], thus producing eight macro classes: vehicle, pedestrian, road, sidewalk, terrain, manmade, vegetation and unlabelled. Figure 2 shows examples of annotated point clouds from Synth4D. See Supp. Mat. for more details.
4 SF-OUDA
We formulate the problem of SF-OUDA for 3D point cloud segmentation as follows. Given a deep network model \(F_\mathcal {S}\) that is pre-trained with supervision on the source domain \(\mathcal {S}\), we aim to adapt \(F_\mathcal {S}\) on the target domain \(\mathcal {T}\) given an unlabelled point cloud stream as input. \(F_\mathcal {S}\) is pre-trained using the source data \(\varGamma _\mathcal {S} = \{ (X^i_\mathcal {S}, Y^i_\mathcal {S}) \}_{i=1}^{M_\mathcal {S}}\), where \(X^i_\mathcal {S}\) is a synthetic point cloud, \(Y^i_\mathcal {S}\) is the segmentation mask of \(X^i_\mathcal {S}\) and \(M_\mathcal {S}\) is the number of available synthetic point clouds. Let \(X^t_\mathcal {T}\) be a point cloud of our stream at time t and \(F^t_\mathcal {T}\) be the target model adapted using \(X^t_\mathcal {T}\) and \(X^{t-w}_\mathcal {T}\), with \(w > 0\). \(Y_\mathcal {T}\) is the set of unknown target labels and C is the number of classes contained in \(Y_\mathcal {T}\). The source classes and the target classes are coincident.
4.1 Our Approach
The input to GIPSO is the point cloud \(X^t_\mathcal {T}\) and an already processed point cloud \(X^{t-w}_\mathcal {T}\). These point clouds are used to adapt \(F_\mathcal {S}\) to \(\mathcal {T}\) through self-supervision (Fig. 3). The input is processed by two modules. The first module aims to create labels for self-supervision by segmenting \(X^t_\mathcal {T}\) with the source model \(F_\mathcal {S}\). Because these labels are produced by an unsupervised deep network, we refer to them as pseudo-labels. We select a subset of segmented points that have reliable pseudo-labels through an adaptive selection criteria, and propagate them to less reliable points. The propagation uses geometric similarity in the feature space to increase the number of pseudo-labels available for self-supervision. To this end, we use an auxiliary deep network (\(F_{aux}\)) that is specialized in extracting geometrically-informed representations from 3D points. The second module aims to encourage temporal regularization of semantic information between \(X^t_\mathcal {T}\) and \(X^{t-w}_\mathcal {T}\). Unlike recent works [22], where a global point cloud descriptor of the scene is learnt, we exploit a self-supervised framework based on stop gradient [6] to ensure smoothness over time. Self-supervision through pseudo-label geometric propagation and temporal regularization are concurrently optimized to achieve the desired domain adaptation objective (Sect. 4.2).
Adaptive Pseudo-label Selection. An accurate selection of pseudo-labels is key to reliably adapt a model. In dynamic real-world scenarios, where new structures appear/disappear in/from the LiDAR field of view, traditional pseudo-labeling techniques [7, 51] can suffer from unexpected variations of class distributions, producing overconfident incorrect pseudo-labels and making more populated classes prevail on others [72, 73]. We overcome this problem by designing a class-balanced adaptive-thresholding strategy to choose reliable pseudo-labels. First, we compute an uncertainty index to filter out likely unreliable pseudo-labels. Second, we apply a different threshold for each class based on the uncertainty index distribution. This uncertainty index is directly related to the robustness of the output class distribution for each point. Robust pseudo-labels can be extracted from those points that consistently provide similar output distributions under different dropout perturbations [27]. We found that this approach works better than alternative confidence based approaches [72, 73].
Given the point cloud \(X^t_\mathcal {T}\), we perform J iterations of inference with \(F_\mathcal {S}\) by using dropout and obtain
where \(p_\mathcal {T}^t\) is the averaged output distribution of \(F_\mathcal {S}\) given \(X_\mathcal {T}^t\) and \(d_j\), i.e. the dropout at j-th iteration. We compute the uncertainty index \(\nu _\mathcal {T}^t\) as the variance over the C classes of \(p_\mathcal {T}^t\) as
where \(\mu _\mathcal {T}^t = E[p_\mathcal {T}^t]\) is the expected value of \(p_\mathcal {T}^t\). Then, we select the least uncertain points by using a different uncertainty threshold for each class. Let \(\lambda _c^t\) be the uncertainty threshold of class c at time t. Since \(\nu _\mathcal {T}^t\) defines the uncertainty for each point, we group \(\nu _\mathcal {T}^t\) values per class and compute \(\lambda _c^t\) as the a-th percentile of \(\nu _\mathcal {T}^t\) for class c. Therefore, at time t and for class c, we select only those pseudo-labels having the corresponding uncertainty index lower than \(\lambda _c^t\) and use the corresponding pseudo-labels as seed pseudo-labels.
Geometric Pseudo-label Propagation. Typically, seed pseudo-labels are few and uninformative for the adaptation of the target model – the deep network is already confident about them. Therefore, we aim to propagate these pseudo-labels to potentially informative points. This is challenging because the model may drift during adaptation. We propose to use the features produced by an auxiliary geometrically-informed encoder \(F_{aux}\) to propagate seed pseudo-labels to geometrically-similar points. Geometric features can be extracted using deep networks that compute 3D local descriptors [1, 15, 41]. 3D local descriptors are compact representations of local geometries with great generalization abilities across domains. Our intuition is that, while the propagation in the metric space may propagate only in the spatial neighborhood of seed pseudo-labels, the use of geometric features would allow us to propagate to geometrically similar points, which can be distant from their seeds in the metric space (Fig. 4).
Given a seed pseudo-labeled point \(\tilde{\textbf{x}}^t \in X^t_\mathcal {T}\), we compute a set of geometric similarities as
where \(||\cdot ||_2\) is the \(l_2\)-norm and \(\mathcal {G}_{\tilde{\textbf{x}}}^t\) is the set that contains the similarity values between \(\tilde{\textbf{x}}^t\) and all the other points of \(X^t_\mathcal {T}\) (except \(\tilde{\textbf{x}}^t\)). Then, we select the points that correspond to top K values in \(\mathcal {G}_{\tilde{\textbf{x}}}^t\) and assign the pseudo-label of \(\tilde{\textbf{x}}^t\) to them. Let \(Y^t_\mathcal {T}\) be the final set of pseudo-labels that we use for fine-tuning our model.
Self-supervised Temporal Consistency Loss. While the vehicle moves, the LiDAR sensor samples the environment from different viewpoints generating point clouds with different point distributions due to clutter and/or occlusions. As points of consecutive point clouds can be simply matched over time by using the vehicle’s odometry [4, 14], we can reasonably consider local variations of point distributions as local augmentations with the same semantic information. As a result, we can exploit recent self-supervised techniques to enforce temporal smoothness of our semantic features.
We begin by computing the set of corresponding points between \(X^{t-w}_\mathcal {T}\) and \(X^t_\mathcal {T}\) by using the vehicle’s odometry. Let \(T_{t-w \rightarrow t} \in \mathbb {R}^{4\times 4}\) be the rigid transformation (from odometry) that maps \(X^{t-w}_\mathcal {T}\) in the reference frame of \(X^t_\mathcal {T}\). We define the set of corresponding point \(\varOmega ^{t, t-w}\) as
where \(\texttt{NN}(n,m)\) is the nearest-neighbour search given the set m and the query n, \(\circ \) is the operator that applies \(T_{t-w \rightarrow t}\) to a 3D point and \(\tau \) is a distance threshold.
We adapt the self-supervised learning framework proposed in SimSiam [6] to semantically smooth point clouds over time. We add an encoder network \(h(\cdot )\) and a predictor head \(f(\cdot )\) to the target model \(F_\mathcal {T}\) and minimize the negative cosine similarity between consecutive semantic representations of corresponding points. Let \(z^t \triangleq h(x^t)\) be the encoder features over the target backbone for \(x^t\) and let \(q^t \triangleq f(h(x^t))\) be the respective predictor features. We minimize the negative cosine similarity as
Time consistency is symmetric in the backward direction, hence we use the corresponding point of \(x^t\) from \(\varOmega ^{t, t-w}\) and define our self-supervised temporal consistency loss as
where stop-grad is applied on \(z^{t}\) and \(z^{t-w}\).
4.2 Online Model Update
Classes are typically highly unbalanced in each point cloud, e.g., a pedestrian class may be \(1\%\) the number of points of the vegetation class. To this end, we use the soft Dice loss [25] as we found it works well when classes are unbalanced. Let \(\mathcal {L}_{dice}\) be our soft Dice loss that uses the pseudo-labels selected though Eq. 3 as supervision. We define the overall adaptation objective as \(\mathcal {L}_{tot} = \mathcal {L}_{dice} + \mathcal {L}_{reg}\), where \(\mathcal {L}_{reg}\) is our regularization loss defined in Eq. 6.
5 Experiments
5.1 Experimental Setup
Source and Target Datasets. We pre-train our source models on Synth4D and SynLiDAR [2], and validate our approach on the official validation sets of SemanticKITTI [3] and nuScenes [4] (target domains). In SemanticKITTI, we use the sequence 08 that is composed of 4071 point clouds 10 Hz. In nuScenes, we use 150 sequences, each composed of 40 point clouds 2 Hz.
Implementation Details. We use MinkowskiNet as deep network for point cloud segmentation [8]. We use ADAM: initial learning rate of 0.01 with exponential decay, batch-size 16 and weight decay \(10^{-5}\). As auxiliary network \(F_{aux}\), we use the PointNet-based architecture proposed in [41] trained on Synth4D that outputs a geometric features (descriptor) for a given 3D point. For online adaptation, we fix the learning rate to \(10^{-3}\) and do not use schedulers as they would require prior knowledge about the stream length. Because we adapt our model on each new incoming point cloud, we use batch-size equal to 1. We set \(J=5\), \(a=1\), \(\tau =0.3\) cm and use 0.5 dropout probability. We set \(K=10\), \(w=5\) on SemanticKITTI, and \(K=5\), \(w=1\) on nuScenes. Parameters are the same in all the experiments.
Evaluation Protocol. We follow the traditional evaluation procedure for online learning methods [5, 65], i.e., we evaluate the model performance on a new incoming frame using the model adapted up to the previous frame. We compute the Intersection over Union (IoU) [45] and report the average IoU (mIoU) improvement over the source (averaged over all the target sequences). We also evaluate the online version of our source model by fine-tuning it with ground-truth labels for all the points in the scene (target). We also evaluate the target upper bound (target) of our method obtained from the online finetuning of our source models over labelled target point clouds.
5.2 Benchmarking Existing Methods for SF-OUDA
Because our approach is the first that specifically tackles SF-OUDA in the context of 3D point cloud segmentation, we perform an in-depth analysis of the literature to identify previous adaptation methods that can be re-purposed for SF-OUDA. Additionally, we experimentally evaluate their effectiveness on the considered datasets. We identify three categories of methods, as detailed below.
Batch normalization-based methods perform domain adaptation by considering different statistics for source and target samples within Batch Normalization (BN) layers. Here, we consider ADABN [32] and ONDA [38]. ADABN [32] is a source-free adaptation method which operates by updating the BN statistics assuming that all target data are available (offline adaptation). ONDA [38] is the online version of ADABN [32], where the target BN statistics are updated online based on the target data within a mini-batch. This can be regarded as a SF-OUDA method. However, these approaches are general-purpose methods and have not been previously evaluated for 3D point cloud segmentation.
Prototype-based adaptation methods use class centroids, i.e. prototypes, to generate target pseudo-labels that can be transferred to other samples via clustering. We implement SHOT [33] and ProDA [66]. SHOT [33] exploits Information Maximization (IM) to promote cluster compactness during offline adaptation. We implement SHOT by adapting the pre-trained model with the proposed IM loss online on each incoming target point cloud. ProDA [66] adopts a centroid-based weighting strategy to denoise target pseudo-labels, while also considering supervision from source data. We adapt ProDA to SF-OUDA by applying the same weighting strategy but removing source data supervision. We update target centroids at each incremental learning step. We refer to our SF-OUDA version of SHOT and PRODA as SHOT\(^*\) and ProDA\(^*\), respectively.
Self-training-based methods exploit source model predictions to adapt on the target domain by re-training the model. We implement CBST [72] and TPLD [51]. CBST [72] relies on a prediction confidence to select the most reliable pseudo labels. A confidence threshold is computed offline for each target class to avoid class unbalance. Our implementation of CBST, which we denote as CBST\(^*\), uses the same class balance selection strategy but updates the thresholds online on each incoming frame. Moreover, no source data are considered as we are in a SF-OUDA setting. TPLD [51], originally designed for 2D semantic segmentation, uses the pseudo-label selection mechanism in [72] but introduces a pixel pseudo label densification process. We implement TPLD by removing source supervision and replace the densification procedure with a 3D spatial nearest-neighbor propagation. Our version of TPLD is denoted as TPLD\(^*\).
Besides re-purposing existing approaches for SF-OUDA, we also evaluate an additional baseline, i.e. the rendering-based method RayCast [28]. This approach is based on the idea that target-like data can be obtained with photorealistic rendering applied to the source point clouds. Thus, adaptation is performed by simply training on target-like data. While RayCast can be regarded as an offline adaptation approach, we select it as it only requires the parameters of the real sensor to obtain target-like data from source point clouds.
5.3 Results
Evaluating GIPSO. Tables 2, 3 and 4 report the results of our quantitative evaluation in the cases of Synth4D \(\rightarrow \) SemanticKITTI, Synlidar \(\rightarrow \) SemanticKITTI and Synth4D \(\rightarrow \) nuScenes, respectively. The numbers in the tables indicate the improvement over the source model. GIPSO achieves an average IoU improvement of +4.31 on Synth4D \(\rightarrow \) SemanticKITTI, +3.70 on Synlidar \(\rightarrow \) SemanticKITTI and +0.85 on Synth4D \(\rightarrow \) nuScenes. GIPSO outperforms both offline and online methods by a large margin on Synth4D \(\rightarrow \) SemanticKITTI and Synlidar \(\rightarrow \) SemanticKITTI, while it achieves a lower improvement over Synth4D \(\rightarrow \) nuScenes. On SemanticKITTI, GIPSO can effectively improve road, sidewalk, terrain, manmade and vegetation. vehicle is the best performing class, which can achieve a mIoU above +13. pedestrian is the worst performing class on all the datasets. pedestrian is a challenging class because it is significantly unbalanced compared to the others, also in the source domain. Although we attempted to mitigate the problem of unbalanced classes using adaptive thresholding and soft Dice loss, there are still situations that are difficult to address (see Sect. 6 for details). On nuScenes, the improvement is minor because at its lower resolutions makes patterns less distinguishable and more difficult to segment.
Evaluating State-of-the-Art Methods. We also analyze the performance of the existing methods discussed in Sect. 5.2. Batch-normalisation based methods perform poorly on all the datasets, with only ADABN [32] showing a minor improvement on nuScenes. We argue that non-i.i.d. batch samples arising in the online setting are playing an important role in this degradation, as they can have detrimental effects on models with BN layers [24]. SHOT\(^*\) and ProDA\(^*\) perform poorly in almost all the experiments, except on Synth4D \(\rightarrow \) nuScenes where ProDA\(^*\) achieves +0.29. This minor improvement may be due to the short sequences of nuScenes (40 frames) making centroids less likely to drift. This does not occur in SemanticKITTI where the long sequence causes a rapid drift (see detailed in Sect. 5.4). CBST\(^*\) and TPLD\(^*\) improve on SemanticKITTI and perform poorly on nuScenes. This can be ascribed to the noisy pseudo-labels that are selected using their confidence-based filtering approach. Lastly, RayCast [28] achieves +1.37 on Synth4D \(\rightarrow \) SemanticKITTI, but underperform on Synth4D \(\rightarrow \) nuScenes with a degradation of −3.46. RayCast was originally proposed for real-to-real adaptation, therefore we believe that its performance may be affected by the large difference in point cloud resolution between Synth4D and nuScenes. RayCast underperforms GIPSO in the online setup, thus showing how offline solutions can fail in dynamic domains. Note that RayCast cannot be evaluated using Synlidar, because Synlidar does not provide odometry information.
5.4 In-Depth Analyses
Ablation Study. Table 5 shows the results of our ablation study on Synth4D \(\rightarrow \) SemanticKITTI. When we use only the adaptive pseudo-label selection (A) we can achieve +1.07 compared to the source. When we combine A with the temporal regularization (T) we can further improve by +3.65. Then we can achieve our best performance through the geometric propagation (P) of the pseudo labels.
Oracle Study. We analyze the importance of using a reliable pseudo-label selection metric. Table 6 shows the pseudo-label accuracy as a function of the points that are selected as the K-th best candidates based on the distance from their centroids (as proposed in [66]), confidence (as proposed in [72]) and uncertainty (ours). Centroid-based selection shows a low accuracy even at \(K=1\), which tends to worsen as K increases. Confidence-based selection is more reliable than the centroid-based selection. We found uncertainty-based selection to be more reliable at smaller values of K, which we deem to be more important than having more pseudo-labels but less reliable.
Per-Class Temporal Behavior. Figure 5a shows the mIoU over time for each class on Synth4D \(\rightarrow \) SemanticKITTI. We can observe that six out of seven classes have a steady improvement: vehicle is the best performing class, followed by vegetation and manmade. Drops in mIoU are typically due to sudden geometric variations of the point cloud, e.g., a road junction after a straight road, or a jammed road after a empty road. pedestrian confirms to be the most challenging class.
Temporal Compactness of Features. We assess how well points are organized in the feature space over time. We use the DB Index (DBI) that is typically used in clustering to measures the feature intra- and inter-class distances [10]. The lower the DBI, the better the quality of the features. We use SHOT\(^*\) and ProDA\(^*\) as comparisons with our method, and the source and target models as references. Figure 5b shows the DBI variations over time. SHOT\(^*\) behavior is typical of a drift, as features of different classes become interwoven. ProDA\(^*\) does not drift, but it produces features that are worse than the source model. Our approach is between source and target models, with a tendency to get closer to target.
Different 3D Local Descriptors. We assess the effectiveness of different 3D local descriptors. We test FPFH [47] (handcrafted) and FCGF [9] (deep learning) descriptors. GIPSO achieves +3.56 mIoU with FPFH, +4.12 mIoU with FCGF and +4.31 mIoU with DIP. This is inline with the experiments shown in [42], where DIP shows a superior generalization capability across domains than FCGF.
Performance with Global Features. We assess the GIPSO performance on Synth4D\(\rightarrow \)SemanticKITTI when the global temporal consistency loss proposed in STRL [22] is used instead of our per-point loss (Eq. 5). This variation achieves \(+1.74\) mIoU, showing that per-point temporal consistency is key.
Qualitative Results. Figure 6 shows the comparison between GIPSO and the source model on Synth4D\(\rightarrow \)SemanticKITTI. The first row shows frame 178 of SemanticKITTI with an improvement of \(+27.14\) mIoU (large). The classes vehicle, sidewalk and terrain are incorrectly segmented by the source model, we can see a significant improvement in segmentation on these classes after adaptation. The second and third rows show frame 1193 and frame 2625 with an improvement of \(+10.00\) mIoU (medium) and \(+4.99\) mIoU (small). Improvements are visible after adaptation in the classes vehicle, sidewalk and road. The last row shows a segmentation drift for road that is caused by incorrect pseudo-labels.
6 Discussions
Conclusions. We studied for the first time the problem of SF-OUDA for 3D point cloud segmentation in a synthetic-to-real setting. We experimentally showed that existing approaches do not suffice in coping with domain shift in this scenario. We presented GIPSO that relies on adaptive self-training and geometric-features propagation to address SF-OUDA. We also introduced a novel synthetic dataset, namely Synth4D composed of two splits and matching the sensor setup of SemanticKITTI and nuScenes, respectively. Experiments on three different benchmarks showed that GIPSO outperforms state-of-the-art approaches.
Limitations. GIPSO limitations are related to geometric propagation and long-tailed classes. If objects of different classes share similar geometric structures, the geometric propagation may be deleterious. This can be mitigated by using another sensor modality (e.g. RGB) or by accounting for multi-scale signals to exploit context information. If severe class unbalance occurs, semantic segmentation accuracy may be affected, e.g. pedestrian class in Tables 2, 3 and 4. This can be mitigated by re-weighting the loss through a class-balanced term (computed on the source).
References
Ao, S., Hu, Q., Yang, B., Markham, A., Guo, Y.: SpinNet: learning a general surface descriptor for 3D point cloud registration. In: CVPR (2021)
Aoran, X., Jiaxing, H., Dayan, G., Fangneng, Z., Shijian, L.: SynLiDAR: learning from synthetic LiDAR sequential point cloud for semantic segmentation. arXiv (2021)
Behley, J., et al.: SemanticKITTI: a dataset for semantic scene understanding of LiDAR sequences. In: ICCV (2019)
Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: CVPR (2020)
Cesa-Bianchi, N., Conconi, A., Gentile, C.: On the generalization ability of on-line learning algorithms. T-IT 50(9), 2050–2057 (2004)
Chen, X., He, K.: Exploring simple siamese representation learning. In: CVPR (2021)
Chen, Y., Li, W., Sakaridis, C., Dai, D., van Gool, L.: Domain adaptive faster R-CNN for object detection in the wild. In: CVPR (2018)
Choy, C., Gwak, J., Savarese, S.: 4D spatio-temporal ConvNets: Minkowski convolutional neural networks. In: CVPR (2019)
Choy, C., Park, J., Koltun, V.: Fully convolutional geometric features. In: ICCV (2019)
Davies, D., Bouldin, D.: A cluster separation measure. T-PAMI PAMI-1(2) , 224–227 (1979)
Dolgov, D., Thrun, S., Montemerlo, M., Diebel, J.: Path planning for autonomous vehicles in unknown semi-structured environments. IJRR 29(5), 485–501 (2010)
Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: CARLA: an open urban driving simulator. In: ACRL (2017)
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. IJRR 32(11), 1231–1237 (2013)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: CVPR (2012)
Gojcic, Z., Zhou, C., Wegner, J., Andreas, W.: The perfect match: 3D point cloud matching with smoothed densities. In: CVPR (2019)
Graham, B., Engelcke, M., van der Maaten, L.: 3D semantic segmentation with submanifold sparse convolutional networks. In: CVPR (2018)
Graham, B., van der Maaten, L.: Submanifold sparse convolutional networks. arXiv (2017)
Griffiths, D., Boehm, J.: SynthCity: a large scale synthetic point cloud. arXiv (2019)
Guan, S., Xu, J., Wang, Y., Ni, B., Yang, X.: Bilevel online adaptation for out-of-domain human mesh reconstruction. In: CVPR (2021)
Hoffman, J., et al.: CyCADA: cycle-consistent adversarial domain adaptation. In: ICML (2018)
Hu, Q., et al.: RandLA-Net: efficient semantic segmentation of large-scale point clouds. In: CVPR (2020)
Huang, S., Xie, Y., Zhu, S., Zhu, Y.: Spatio-temporal self-supervised representation learning for 3D point clouds. In: ICCV (2021)
Hurl, B., Czarnecki, K., Waslander, S.: Precise synthetic image and LiDAR (PreSIL) dataset for autonomous vehicle perception. In: IVS (2019)
Ioffe, S.: Batch renormalization: towards reducing minibatch dependence in batch-normalized models. arXiv (2017)
Jadon, S.: A survey of loss functions for semantic segmentation. In: CIBCB (2020)
Jaritz, M., Vu, T.H., de Charette, R., Wirbel, E., Pérez, P.: xMUDA: cross-modal unsupervised domain adaptation for 3D semantic segmentation. In: CVPR (2020)
Kendall, A., Badrinarayanan, V., Cipolla, R.: Bayesian SegNet: model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. In: BMVC (2017)
Langer, F., Milioto, A., Haag, A., Behley, J., Stachniss, C.: Domain transfer for semantic segmentation of LiDAR data using deep neural networks. In: IROS (2021)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Levinson, J., et al.: Towards fully autonomous driving: systems and algorithms. In: IV (2011)
Li, D., Hospedales, T.: Online meta-learning for multi-source and semi-supervised domain adaptation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12361, pp. 382–403. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58517-4_23
Li, Y., Wang, N., Shi, J., Liu, J., Hou, X.: Revisiting batch normalization for practical domain adaptation. arXiv (2016)
Liang, J., Hu, D., Feng, J.: Do we really need to access the source data? Source hypothesis transfer for unsupervised domain adaptation. In: ICML (2020)
Velodyne Lidar: VelodyneLidar (2021). www.velodynelidar.com
Liu, C., Lee, S., Varnhagen, S., Tseng, H.: Path planning for autonomous vehicles using model predictive control. In: IV (2017)
Liu, Y., Zhang, W., Wang, J.: Source-free domain adaptation for semantic segmentation. In: CVPR (2021)
Long, M., Cao, Z., Wang, J., Jordan, M.: Conditional adversarial domain adaptation. In: NeurIPS (2018)
Mancini, M., Karaoguz, H., Ricci, E., Jensfelt, P., Caputo, B.: Kitting in the wild through online domain adaptation. In: IROS (2018)
Milioto, A., Vizzo, I., Behley, J., Stachniss, C.: RangeNet++: fast and accurate LiDAR semantic segmentation. In: IROS (2019)
Moon, J., Das, D., Lee, C.: Multi-step online unsupervised domain adaptation. In: ICASSP (2020)
Poiesi, F., Boscaini, D.: Distinctive 3D local deep descriptors. In: ICPR (2021)
Poiesi, F., Boscaini, D.: Learning general and distinctive 3D local deep descriptors for point cloud registration. T-PAMI (2022)
Qi, C., Su, H., Mo, K., Guibas, L.: PointNet: deep learning on point sets for 3D classification and segmentation. In: CVPR (2017)
Qi, C., Yi, L., Su, H., Guibas, L.: PointNet++: deep hierarchical feature learning on point sets in a metric space. arXiv (2017)
Rahman, M.A., Wang, Y.: Optimizing intersection-over-union in deep neural networks for image segmentation. In: Bebis, G., et al. (eds.) ISVC 2016. LNCS, vol. 10072, pp. 234–244. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-50835-1_22
Rosolia, U., Bruyne, S.D., Alleyne, A.: Autonomous vehicle control: a nonconvex approach for obstacle avoidance. T-CST 25(2), 469–484 (2016)
Rusu, R., Blodow, N., Beetz, M.: Fast point feature histograms (FPFH) for 3D registration. In: ICRA (2009)
Saito, K., Watanabe, K., Ushiku, Y., Harada, T.: Maximum classifier discrepancy for unsupervised domain adaptation. In: CVPR (2018)
Saltori, C., Lathuiliére, S., Sebe, N., Ricci, E., Galasso, F.: SF-UDA\(^3\rm {D}\): source-free unsupervised domain adaptation for LiDAR-based 3D object detection. arXiv (2020)
Schneider, S., Rusak, E., Eck, L., Bringmann, O., Brendel, W., Bethge, M.: Improving robustness against common corruptions by covariate shift adaptation. In: NeurIPS (2020)
Shin, I., Woo, S., Pan, F., Kweon, I.S.: Two-phase pseudo label densification for self-training based domain adaptation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12358, pp. 532–548. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58601-0_32
Sun, Y., Wang, X., Liu, Z., Miller, J., Efros, A., Hardt, M.: Test-time training with self-supervision for generalization under distribution shifts. In: ICML (2020)
Tanneberg, D., Peters, J., Rueckert, E.: Efficient online adaptation with stochastic recurrent neural networks. In: Humanoids (2017)
Tompkins, A., Senanayake, R., Ramos, F.: Online domain adaptation for occupancy mapping. arXiv (2020)
Tonioni, A., Tosi, F., Poggi, M., Mattoccia, S., Stefano, L.D.: Real-time self-adaptive deep stereo. In: CVPR (2019)
Torralba, A., Efros, A.: Unbiased look at dataset bias. In: CVPR (2011)
Voigtlaender, P., Leibe, B.: Online adaptation of convolutional neural networks for video object segmentation. arXiv (2017)
Volpi, R., Jorge, P.D., Larlus, D., Csurka, G.: On the road to online adaptation for semantic image segmentation. In: CVPR (2022)
Wang, D., Shelhamer, E., Liu, S., Olshausen, B., Darrell, T.: Tent: fully test-time adaptation by entropy minimization. In: ICLR (2021)
Wu, B., Wan, A., Yue, X., Keutzer, K.: SqueezeSeg: convolutional neural nets with recurrent CRF for real-time road-object segmentation from 3D LiDAR point cloud. In: ICRA (2018)
Wu, B., Zhou, X., Zhao, S., Yue, X., Keutzer, K.: SqueezeSegV2: improved model structure and unsupervised domain adaptation for road-object segmentation from a LiDAR point cloud. In: ICRA (2019)
Yang, S., van de Weijer, J., Herranz, L., Jui, S., et al.: Exploiting the intrinsic neighborhood structure for source-free domain adaptation. In: NeurIPS (2021)
Yi, L., Gong, B., Funkhouser, T.: Complete & label: a domain adaptation approach to semantic segmentation of LiDAR point clouds. arXiv (2021)
Yue, X., Wu, B., Seshia, S., Keutzer, K., Sangiovanni-Vincentelli, A.: A LiDAR point cloud generator: from a virtual world to autonomous driving. In: ICMR (2018)
Zhan, X., Xie, J., Liu, Z., Ong, Y., Loy, C.: Online deep clustering for unsupervised representation learning. In: CVPR (2020)
Zhang, P., Zhang, B., Zhang, T., Chen, D., Wang, Y., Wen, F.: Prototypical pseudo label denoising and target structure learning for domain adaptive semantic segmentation. In: CVPR (2021)
Zhang, Y., et al.: PolarNet: an improved grid representation for online LiDAR point clouds semantic segmentation. In: CVPR (2020)
Zhang, Z., Lathuilière, S., Pilzer, A., Sebe, N., Ricci, E., Yang, J.: Online adaptation through meta-learning for stereo depth estimation. arXiv (2019)
Zhao, S., et al.: ePointDA: an end-to-end simulation-to-real domain adaptation framework for LiDAR point cloud segmentation. arXiv (2020)
Zhou, Y., Tuzel, O.: VoxelNet: end-to-end learning for point cloud based 3D object detection. In: CVPR (2018)
Zhu, X., et al.: Cylindrical and asymmetrical 3D convolution networks for LiDAR segmentation. In: CVPR (2021)
Zou, Y., Yu, Z., Vijaya Kumar, B.V.K., Wang, J.: Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 297–313. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_18
Zou, Y., Yu, Z., Liu, X., Kumar, B., Wang, J.: Confidence regularized self-training. In: ICCV (2019)
Acknowledgements
This work was partially supported by OSRAM GmbH, by the Italian Ministry of Education, Universities and Research (MIUR) “Dipartimenti di Eccellenza 2018–2022”, by the EU JPI/CH SHIELD project, by the PRIN project PREVUE (Prot. 2017N2RK7K), the EU ISFP PROTECTOR (101034216) project and the EU H2020 MARVEL (957337) project and, it was carried out in the Vision and Learning joint laboratory of FBK and UNITN.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Saltori, C. et al. (2022). GIPSO: Geometrically Informed Propagation for Online Adaptation in 3D LiDAR Segmentation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13693. Springer, Cham. https://doi.org/10.1007/978-3-031-19827-4_33
Download citation
DOI: https://doi.org/10.1007/978-3-031-19827-4_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19826-7
Online ISBN: 978-3-031-19827-4
eBook Packages: Computer ScienceComputer Science (R0)