1 Introduction

Fig. 1
figure 1

We study Lidar Panoptic Segmentation (LPS) in an Open World (LiPSOW). In each of the 2\(\times \)3 subfigure, left panel visualizes segmented points colored different w.r.t semantic classes, where red encodes unknown; right panel visualizes segmented thing instances of known and unknown. For autonomous navigation, one should evaluate LPS methods in the presence of novel thing object instances and stuff regions, which are usually termed as unknown. We call this setting LiPSOW, where methods should particularly segment points into unknown object instances that are outside of the K-way semantic classes in the predefined vocabulary. For example, given the predefined vocabulary by SemanticKITTI (Behley et al., 2019), the unknown objects can be trailers, containers, signaling structures, highway bridge foundations, and buses, as visualized in this figure (Color figure online)

Lidar Panoptic Segmentation (LPS) (Behley et al., 2021; Fong et al., 2021) unifies lidar point classification and segmentation, both important for autonomous agents to interact with the open environment. In LPS, each point must be classified as one of pre-defined K semantic classes. LPS also defines the notion of stuff and thing: thing classes are countable (e.g., pedestrian and car classes) and must be assigned unique instance identities. Amorphous regions, such as vegetation and road, are defined as stuff.

1.1 Motivation

Solving LPS attracts increasing attention owing to its practical value for robotic applications, but the current setup fails to consider the realistic open-world testing environments, where robots must know when they observe regions that do not fit in the predefined vocabulary of K known classes (e.g., fallen-tree-trunk or overturned-truck) and recognize these regions as unknown obstacles.

1.2 Lidar Panoptic Segmentation the Open World

We study LPS in the open-world (LiPSOW, Fig. 1), motivated by real-world challenges: AV companies have already operated autonomous fleets in different geo-locations, and these vehicles constantly observe new or previously unknown semantic classes over time. To study such situations, we introduce the LiPSOW evaluation protocol. For example, LiPSOW allows one to train models on the SemanticKITTI (Behley et al., 2019) by using its K common semantic classes for K-way classification of predicted segments, and importantly, gathering all the remaining rare classes as a catch-all other class (Hendrycks et al., 2019; Kong & Ramanan, 2021) to better detect unknown objects. Further, it performs evaluation on the KITTI360 (Liao et al., 2021) dataset, recorded in the same city with the same sensors but labeling more classes (Lin et al., 2022). This effectively expands the class vocabulary to include instances of unknown classes. The main challenge in LiPSOW is to recognize and segment K known classes as defined in the SemanticKITTI vocabulary, and recognize unknown classes that appear in the testing set, i.e., KITTI360.

1.3 Technical Insights

Prior efforts in LPS (Aygün et al., 2021; Gasperini et al., 2021; Hong et al., 2021; Li et al., 2022) learn to group known classes but fail to generalize to unknown classes. Based on this observation, Wong et al. (2020) suggests to learn to segment known classes and lean on bottom-up point clustering methods (Teichman et al., 2011; Nunes et al., 2022; Wong et al., 2020; Moosmann et al., 2009) to segment unknown instances. Our findings, derived from our publicly available LiPSOW benchmark, suggest unified treatment of known and unknown classes: (i) we learn which points do not correspond to K-known classes via outlier exposure (Hendrycks et al., 2019; Kong & Ramanan, 2021), (ii) segment unknown and thing classes using class-agnostic bottom-up methods at multiple hierarchies, and (iii) learn which segments in the segmentation tree likely are objects by using labeled data, akin to class-agnostic training of Region Proposal Network (Ren et al., 2015). Surprisingly, this approach not only is effective for segmenting novel instances with high fidelity but also outperforms learned point-grouping-based methods for known thing classes, suggesting a unified treatment of known and unknown classes.

1.4 Contributions

We make three major contributions: (i) We introduce LiPSOW, a new problem setting that extends LPS to the open-world, and establish an evaluation protocol to study LiPSOW. (ii) We repurpose existing LPS methods to address LiPSOW and comprehensively analyze their performance. (iii), drawing insights from in-depth analysis of existing LPS methods, we propose an approach that combines lidar semantic segmentation and a non-learned clustering algorithm with a learned scoring function. Our method effectively segments objects in a class-agnostic fashion, from both known and unknown classes during testing. To foster future research, we make our code publicly available.

2 Related Work

2.1 Lidar Semantic and Panoptic Segmentation

Recent Lidar Semantic Segmentation (LSS) and Lidar Panoptic Segmentation (LPS) methods are data-driven and fueled by the developments in learning representations from point sets (Qi et al., 2016, 2017; Thomas et al., 2019) and publicly available densely labeled datasets (Behley et al., 2019, 2021; Fong et al., 2021). LSS methods classify points into K classes, for which dense supervision is available during training. Prior works focus on developing strong encoder-decoder-based architectures for sparse 3D data (Qi et al., 2016, 2017; Thomas et al., 2019; Yan et al., 2018; Choy et al., 2019; Tang et al., 2020; Zhu et al., 2021; Loiseau et al., 2022; Ye et al., 2021), the fusion of information obtained from different 3D representations (Alonso et al., 2020; Xu et al., 2021; Li et al., 2022) or neural architecture search (Tang et al., 2020). On the other hand, LPS methods (Behley et al., 2021; Sirohi et al., 2021; Gasperini et al., 2021) must additionally segment instances of thing classes. Early methods combine Lidar semantic segmentation networks with 3D object detectors, with a heuristic fusion of both sources of information (Behley et al., 2021). Efficient-LPS (Sirohi et al., 2021) follows two-stage image-based object detection architectures using a range-image-based convolutional backbone. Several methods focus on end-to-end learning using point-based (Thomas et al., 2019; Hong et al., 2021, 2024; Li et al., 2023) or sparse voxel (Zhu et al., 2021) backbones. Recent efforts focus on additional modalities such as camera-lidar fusion (Marcuzzi et al., 2023; Zhang et al., 2023) or different views of Lidar data, such as range-view map (Li et al., 2023) to improve model performance. In addition to learning to classify points, these methods learn to group points in space (Gasperini et al., 2021; Hong et al., 2021; Zhou et al., 2021; Razani et al., 2021; Li et al., 2022), space and time (Aygün et al., 2021; Kreuzberg et al., 2022), or resort to bottom-up geometric clustering to segment instances (Zhao et al., 2022, 2021). Unlike our work, these methods do not consider the open-world environment, in which a pre-fixed class vocabulary is insufficient to capture all semantic classes that are encountered during the autonomous operation.

2.2 Bottom-Up Lidar Instance Segmentation

Bottom-up grouping based on Euclidean distance has been used to isolate object instances in Lidar scans in a class-agnostic manner since the dawn of Lidar-based perception (Thorpe et al., 1991). Existing methods employ techniques such as flood-filling (Douillard et al., 2011; Teichman et al., 2011) and connected components (Klasing et al., 2008), estimated in the rasterized bird’s-eye view, bottom-up grouping (Moosmann et al., 2009; Behley et al., 2013; McInnes et al., 2017) using density-based clustering methods (Ester et al., 1996) or graph-based clustering methods (Wang et al., 2012). Nunes et al. (2022) propose to segment object instances with DBSCAN and refine segments using GraphCuts (Boykov & Funka-Lea, 2006). Since one-fits-all clustering parameters are difficult to obtain, Hu et al. (2020) propose constructing a hierarchical tree of several plausible Lidar segmentations, obtained using a density-based clustering method (Ester et al., 1996). These regions are then scored using a learned objectness regressor, and optimal instance segmentation (w.r.t. the regressed objectness function) can then be obtained via a cut in this tree. In this paper, we demonstrate that a data-driven Lidar Panoptic Segmentation network, in conjunction with hierarchical tree construction, forms a strong baseline for Lidar Panoptic Segmentation in an Open World.

2.3 Domain Adaption for Lidar Segmentation

Domain adaptation aims to improve the generalization ability of segmentation models, trained on the source domain, by adapting models to the (unlabeled) target domain. Prior works focus on adapting 3D representations (Langer et al., 2020; Yi et al., 2021), feature representations (Rist et al., 2019; Jiang & Saripalli, 2021; Wu et al., 2019; Shaban et al., 2023; Kong et al., 2023). While our paper focuses on identifying unseen novel objects (unknown ’s) under similar sensor distributions and geographical regions, these methods focus on adapting to target distributions under significant sensor configuration shifts or environments.

2.4 Open-Set Recognition (OSR)

OSR requires training on data from K known classes and recognizing examples from unknown classes encountered during testing (Scheirer et al., 2012). Many OSR approaches train a K-way classification network and then exploit the trained model for OSR (Yoshihashi et al., 2019; Oza & Patel, 2019). Recent work shows that a more realistic setup is to allow access to some outlier examples during training (Hendrycks et al., 2019; Kong & Ramanan, 2021). Such outliers are used as instances of other class (i.e., held-out samples that do not correspond to pre-defined K-classes) during training, significantly boosting OSR performance. In the context of (lidar) semantic segmentation, Kong and Ramanan (2021); Cen et al. (2022) approximate the distribution of novel objects by synthesizing instances of novel classes. Different from the aforementioned, we tackle OSR through the lens of lidar panoptic segmentation.

2.5 Open-Vocabulary Object Detection

Recent efforts utilize such bottom-up segmentation, combined with Kalman-filter-based object trackers (c.f., (Teichman & Thrun, 2012; Dewan et al., 2015; Osep et al., 2018, 2020)) to pseudo-label instances of moving objects in Lidar Najibi et al. (2022), Zhang et al. (2023) or stereo video streams (Osep et al., 2019, 2018) and use these instances to train object detectors for moving object instances. Moreover, Najibi et al. (2023) demonstrate that detected moving objects can also be classified in a zero-shot manner by distilling CLIP (Radford et al., 2021) features to Lidar. Different from the aforementioned, we tackle LPS, which entails dense segmentation and recognition of thing and stuff classes for moving, as well as stationary objects. Segmented instances that our method classifies as unknown class could be further classified in a similar fashion as proposed in Najibi et al. (2023) to obtain a fine-grained semantic interpretation of segmented regions.

2.6 Open-Set (Lidar) Segmentation

Early works by Teichman et al. (2011), Moosmann et al. (2009), Moosmann and Stiller (2013), Held et al. (2016) can be understood as early attempts towards open-set Lidar instance segmentation. In Teichman et al. (2011) after bottom-up segmentation of individual point clouds, objects are tracked across time, and classified as car, cyclist, pedestrian or other. The most similar works to ours are Wong et al. (2020), Hwang et al. (2021), which study OSR in lidar point clouds and images, respectively. Their setup assumes complete annotations for stuff classes, i.e., assume stuff classes exhaustively labeled. This is an unrealistic assumption since new stuff classes (e.g., bridges and tunnels) may also be encountered at test time and must be recognized as novel classes. Differently, our setup assumes novel classes (i.e., unknown ’s classes) can appear in both stuff and thing classes. This is a realistic setup, further justified by the ontology change from SemanticKITTI to KITTI360 (Lin et al., 2022) where several new stuff classes are encountered (e.g., Fig. 3: gate, wall, tunnel, bridge, garage, stop, rail track). This subtle yet crucial distinction separates LiPSOW from previous settings. Prior works (Wong et al., 2020; Hwang et al., 2021; Cen et al., 2022) build their methods and evaluation on the assumption that the unknown consists of only thing, i.e., assuming complete annotations for stuff. Our experimental validation (Sect. 5.3) confirms that the open-set semantic segmentation method (Cen et al., 2022), which assumes complete annotation for stuff classes, does not perform well in our proposed setup. While Wong et al. (2020) is the first work investigating LPS in open-set conditions, it conducts experiments on a proprietary dataset, and does not release the code or data. We repurpose publicly available datasets to foster future research on LiPSOW. Finally, we suggest a different approach for LiPSOW that unifies instance segmentation of known and unknown classes in a class-agnostic manner, by contrast to Wong et al. (2020) that learns to segment known classes and only segment instances of novel classes via DBSCAN.

3 Open World Lidar Panoptic Segmentation

In this section, we review the problem of Lidar Panoptic Segmentation (LPS) and discuss the limitations of its setup from the perspective of open-world deployment in Sect. 3.1. To address the limitations, we introduce LPS in an Open World setting (LiPSOW) in Sect. 3.2.

3.1 Lidar Panoptic Segmentation

Definition. LPS takes a Lidar point cloud \(\mathcal {P} = \{ \textbf{p}_i \in \mathbb {R}^3\}_{i=1}^N\) as input, and aims to classify points w.r.t. a predefined vocabulary \(\mathcal {K} = \{1, \ldots , K\}\) of K categorical labels and segment object instances. Categorical labels are divided into (1) thing classes, covering countable objects such as cars and persons, and (2) stuff classes that cover uncountable amorphous regions such as road and vegetation. For thing points, LPS methods must segment object instances (e.g., every car). Mathematically, LPS methods learn a function \(f(\cdot ; {\varvec{\theta }})\) parameterized by \({\varvec{\theta }}\), mapping an input point \(\textbf{p}_i\) to a semantic label k and object instance ID\(_i\), i.e., \(f(\textbf{p}_i; {\varvec{\theta }}) \rightarrow (k, \text {ID}_i)\), where \(k\in \mathcal {K}\), ID\(_i\) is a unique ID for the object instance \(\textbf{p}_i\) belongs to, and particularly, ID\(_i=0\) means that \(\textbf{p}_i\) belongs to one of the stuff classes (i.e., class-k is stuff). LPS measures the per-point classification accuracy (i.e., Lidar semantic segmentation) and per-instance segmentation accuracy.

Remarks. LPS does not properly formulate the real-world case that there exist points belonging to an unknown catch-all superclass, which contain various unknown classes encountered only during testing. For example, a vocabulary in AVs probably does not have labels such as sliding- unattended- stroller and fallen- tree- trunk, but AVs must segment them into individual instances for safe maneuver such as “stop”, “yield” or “change-lane”. We are motivated to address this in detail below.

3.2 LPS in an Open World

Definition. Extending LPS, Lidar Panoptic Segmentation in the Open World (LiPSOW) further requires classifying points into an unknown class if the points do not belong to any of the predefined K semantic classes, and segment them as unknown instances. That said, unknown covers all classes that do not correspond to any of the K predefined classes and might contain unannotated instances that, without prior knowledge, cannot be treated as a stuff or thing class. Formally, we define the unknown class as the \((K+1)^{th}\) class, so LiPSOW methods learn a function \(f(\cdot ; {\varvec{\theta }})\) parameterized by \({\varvec{\theta }}\), mapping input points \(\textbf{p}_i\) to a semantic label k and instance ID\(_i\), i.e., \(f(\textbf{p}_i; {\varvec{\theta }}) \rightarrow (k, \text {ID}_i)\), where \(k\in \{1, \dots , K, K+1\}\). As before, ID\(_i\) is a unique ID for point-i and ID\(_i=0\) implies that class-k is stuff.

3.2.1 Significance

By definition, LiPSOW algorithms should be able to distinguish unknown objects from the predefined K classes and to segment corresponding object instances. This ability is useful for many applications. First, recognizing unknown objects is crucial for safety-critical robotic operations, e.g., AVs should recognize a never-before-seen sliding-unattended-stroller to avoid collision and casualty. Second, unknown instances could be used in conjunction with active learning (Ren et al., 2021; Zhan et al., 2022) to help select more valuable examples that are recognized as unknown to reduce data collection and annotation costs.

Remark I. Can unknowns be seen during training? Related to LiPSOW is open-set recognition, the task of recognizing unknown examples during test-time. A conventional setup is that all the training data is labeled w.r.t. the K predefined classes, and test-time examples may originate from any of the K known classes, or the (K+1)\(^{th}\) unknown (Scheirer et al., 2012; Bendale & Boult, 2016; Yoshihashi et al., 2019; Oza & Patel, 2019) class. While these works suggest that unknown examples should not be part of the training set, recent work has comprehensively demonstrated that a more reasonable setup is to explicitly exploit outlier data or diverse other examples during training. In particular, Hendrycks et al. (2019); Kong and Ramanan (2021) show that such models effectively generalize to real unseen examples. To improve the real-world AV application, we consider the latter setup, i.e., the training set contains instances of known classes, labeled as other. Instances of these classes are available during the training, but, importantly, do not overlap with the K known classes. These classes are presented as possible instances of the unknown, (K+1)\(^{th}\) class. Note that the other class is often named void or unlabeled in many contemporary benchmarks (Cordts et al., 2016; Neuhold et al., 2017).

Remark II. How to define unknown semantic classes? Strictly defining all objects that could appear in the open-world would be difficult. Instead, inspired by the image segmentation literature (Martin et al., 2001; Fomenko et al., 2022), which suggests that humans have a consensus on which image regions constitute object instances, we define unknown instances as those that were (i) not annotated w.r.t. a categorical label in the train-set but (ii) labeled as objects by human annotators in the test-set.

Remark III Can only novel thing classes be regarded as unknown ’s? Prior efforts tackling open-set in images and point clouds (Wong et al., 2020; Hwang et al., 2021) assume stuff classes exhaustively labeled. This is an unrealistic assumption since new stuff classes (e.g., Fig. 3: tunnel, bridge, rail track, etc.) may also be encountered at test time and must be recognized as novel classes. This subtle yet crucial distinction separates LiPSOW from prior work.

4 Methodology

Fig. 2
figure 2

Open-World Lidar Panoptic Segmentation (OWL): We first perform \(K+1\) semantic segmentation network on a point cloud (a, b) and classify points as stuff, things, and unknown (point color encodes semantic classes, red points represent unknown ’s). Then we construct a hierarchical tree of “all possible” segments for thing and unknown points (c) and train a segment-scoring function to cut the tree (d), finally producing instance and semantic segmentation results (e) (Color figure online)

Existing methods for LPS (Aygün et al., 2021; Gasperini et al., 2021; Zhou et al., 2021; Razani et al., 2021) learn to classify points, and learn to group points that represent thing classes. These methods work under the LPS setting, where semantic and instance-level supervisions are given for all classes. For LiPSOW, we need to rethink existing methodology: LiPSOW methods must (just as in LPS) recognize predefined semantic classes and segment instances of thing classes. Additionally, they need to cope with the inherent difficulty of recognizing and segmenting the unknown class that mixes stuff and things. Unfortunately, for this catch-all class, exhaustive semantic and instance-level supervision is not available.

To design a LiPSOW method, named Open-World Lidar Panoptic Segmentor (OWL), we draw inspiration from two-stage object detectors (Ren et al., 2015). It was shown in literature that such networks can be repurposed for image-based open-set (Dhamija et al., 2018) and open-world object detection (Hwang et al., 2021; Weng et al., 2021; Liu et al., 2022; Joseph et al., 2021; Fomenko et al., 2022). Two-stage networks were also adopted for 3D object detection (Shi et al., 2019; Chen et al., 2015). However, a 3D analogy of the region proposal network (RPN), a key component that allows us to recognize instances of novel classes, is not trivial due to the large 3D search space (Shi et al., 2019; Chen et al., 2015). For this reason, prior works constrain the set of anchor boxes to the mean size of each semantic class (e.g., car and pedestrian sized boxes). This approach is not scalable to the large variety of object instances that may appear in the other class. Instead, we rely on the observations of early work on Lidar perception (Teichman et al., 2011; Teichman & Thrun, 2012; Held et al., 2016; Behley et al., 2013), which shows that simple bottom-up grouping of points yields a compact set of class-agnostic object candidates. In the following sections, we outline a simple and effective method for LiPSOW, based on data-driven approaches (Thomas et al., 2019; Aygün et al., 2021) and perceptual grouping (Klasing et al., 2008; Douillard et al., 2011; Hu et al., 2020; Behley et al., 2013) based methods, as well as recent findings in open-set recognition (Kong & Ramanan, 2021).

4.1 High-Level Overview

We propose a two-stage network for LiPSOW, which is trained sequentially. We adopt an encoder-decoder point-based backbone (Thomas et al., 2019) to learn to classify points in \(K+1\) fashion. We explicitly train our network to distinguish points from K known classes from the other class, that is considered to be a representative of the unknown class during the model training (Fig. 2b). This network estimates label predictions that belong to stuff, things, and the mixed other class. In the second stage, we run a non-learned clustering algorithm on points recognized as thing or other (Fig. 2c) and apply a learned scoring function to derive the final instance segmentation. To this end, we produce a hierarchical segmentation tree (c.f., Hu et al. (2020)), and train the second-stage network that learns to estimate how likely a point segment is an object, and run a min-cut algorithm Hu et al. (2020) to obtain a unique, globally optimal point-to-instance assignment (Fig. 2d). Importantly, this method treats the thing and other classes in a unified manner, producing instance segmentation for both. We present individual components of our Open-World Lidar Panoptic Segmentor (OWL) baseline below.

4.2 Semantic Segmentation Network

We train an encoder-decoder architecture that operates directly on point cloud \(P \in \mathbb {R}^{N\times 3}\). In particular, we train a well-consolidated KPConv (Thomas et al., 2019) network with deformable convolutions; however, we note that a variety of backbones suitable for learning representations from unstructured point sets could be used (Qi et al., 2016; Yan et al., 2018). We use the KPConv-based LPS network (Aygün et al., 2021) due to its (i) open-source implementation and (ii) point-based backbone, that directly learns fine-grained per-point features, as opposed to 3D sparse-convolutional networks (Choy et al., 2019) that estimate per-voxel features. We attach a semantic classifier on top of the decoder feature representation \(F \in \mathbb {R}^{N \times D}\) that outputs a semantic map \(S \in \mathbb {R}^{N \times {(K + 1)}}\). Finally, we train this network head using the cross-entropy loss. The difference from conventional (lidar) semantic segmentation training is that we explicitly introduce an additional catch-all class by holding out rare (other) classes during the model training (Sect 5.1.2). This class is analogous to the catch-all background class (Ren et al., 2015), a common practice in training two-stage object detectors. The final (K+1)-way softmax provides a smooth distribution that indicates the likelihood of a point being one of K classes or the other class. Classes classified as other during test time are considered to be unknown ’s.

Such a catch-all other class is not common practice in training semantic segmentation networks, as it is usually assumed that points (or pixels) are densely and exhaustively labeled. However, in LiPSOW, this assumption is no longer valid. Without it, we would incentivize the network to label each point as one of the K classes.

4.3 Segmenting any Object

Using a proximity-based point grouping method, we can construct combinatorially many possible point segments from a point cloud of size N. We learn a function \(f(p) \rightarrow [0,1],\;p \subset P \in \mathbb {R}^{N\times 3}\) that scores objectness of a subset of points in a data-driven manner to indicate how likely a point segment encapsulates an object. This is analogous to image-based object proposal generation methods (Alexe et al., 2012; Zitnick & Dollár, 2014) that adopt sliding window search and learn an objectness score to rank windows. The advantage of such methods over recent work on data-driven pixel/point grouping (Kong & Fowlkes, 2018; Aygün et al., 2021; Razani et al., 2021; Zhou et al., 2021) is that the set of possible objects should naturally cover all objects, irrespective of class labels.

To understand if our segmentation tree covers most of the relevant objects, we measure recall using labeled instances. To this end, we follow (Hu et al., 2020) and construct a hierarchical segmentation tree T by applying Euclidean clustering recursively with decreasing distance threshold using DBSCAN (Ester et al., 1996) and parameters recommended by Hu et al. (2020). Our experiments show that with this approach we can recall \(97.2\%\) of instances labeled in the SemanticKITTI dataset (Behley et al., 2019, 2021) validation set (see Sect. 5.2). This shows that not only can this approach segment a large variety of objects, but also it does not need to learn how to group instances of known classes. These instances are already included in the segmentation tree.

4.4 Learning an Objectness Function

There are several ways to learn such a function \(f(p) \rightarrow [0,1]\) that estimates how likely a subset of points represents an object. One approach is to estimate a per-point objectness score. Following (Aygün et al., 2021), this can be learned by regressing a truncated distance \(O \in \mathbb {R}^{N \times 1}\) to the nearest point center of a labeled instance (Aygün et al., 2021) atop of decoder features \(F \in \mathbb {R}^{N \times D}\). The objectness value can then be averaged over the segment \(p \subset P\). Alternatively, we can train a holistic classifier as a second-stage network by pooling point segment features, followed by fully-connected layers, similar to the PointNet (Qi et al., 2016) classification network. In this case, we pre-built hierarchical segmentation trees \(T_i\) for each point cloud i in the training set and minimize the training loss based on the signal we obtain from matched segments between the segmentation trees and set of labeled instances, \(GT_i\). One possibility is to use binary cross-entropy loss (similar to how the RPN is trained); alternatively, we can directly regress the objectness value to be proportional to the point-set intersection-over-union. We detail the network architecture and training recipes in the appendix (“B Implementation Details Section”) and discuss design choices on how to train such a network in Sect. 5.2.

4.5 Unique Point-to-Instance Assignment

Segmentation tree T provides a hierarchy of pairs of point segments and their corresponding scores. However, for LiPSOW, we need to assign points to instances uniquely. Intuitively, this property will be satisfied with any “cut” in this tree; we could simply output leaf nodes in a tree after the cut is performed. The question then boils down to where to cut such that the overall segmentation score is as good as possible according to some criterion. It was shown in Hu et al. (2020) that we can compute optimal worst-case segmentation efficiently by simply traversing the tree, as long as we have strictly smaller segments at each tree-level. Optimal worse-case segmentation is the segmentation that yields an overall (global) segmentation score when the overall segmentation score is defined as the worst objectness among its local segments (this can be efficiently evaluated by looking at the tree leaf nodes). This approach ensures a unique point-to-segment assignment i.e. no overlap. This algorithm is not the contribution of this paper. However, for completeness, we provide the algorithm with a detailed explanation in the appendix (“C.1 Segmentation Tree Generation” Section).

4.6 Inference

At inference time, we first make a forward pass through our network, construct the segmentation tree on points classified as thing or other (i.e. unknown) class, and run our objectness classifier for each segment in the tree. After such construction, we run the tree-cut algorithm to obtain unique point-to-instance assignments. Finally, as semantic labels within segments may be inconsistent, we assign a majority vote to each segmented instance.

Fig. 3
figure 3

We base LiPSOW setup on SemanticKITTI (Behley et al., 2019, 2021) and KITTI360 (Liao et al., 2021) datasets. We train and validate models using SemanticKITTI, and re-purpose KITTI360 dataset, which contains classes, not labeled in SemanticKITTI (i.e., unknown) as a test set. A detailed list of our taxonomy is provided in the appendix (Table  5) (Color figure online)

5 Experiments

In this section, we first outline our experimental setup for LiPSOW (Sect. 5.1). Then, we discuss and ablate our Open-World Lidar Panoptic Segmentor (OWL) (Sect. 5.2) on a standard lidar panoptic segmentation (LPS) benchmark, SemanticKITTI (Behley et al., 2019, 2021). Finally, we demonstrate the generality of our approach with cross-dataset evaluation (Sect. 5.3).

5.1 Evaluation Protocol of LiPSOW

We set up an evaluation protocol to simulate the conditions that occur when a robot within a certain geographic region (i.e., city), e.g., a robot taxi fleet. Here, data is recorded by the same sensor type (cross-sensor domain generalization is out-of-scope of this paper), which is reasonable in practice when deploying robot taxis of the same type. Importantly, however, even though we focus on a certain geographic region, source and target data must be recorded at disjunct locations (i.e., sequences that appear in the test set should not be recorded in precisely same city districts). In such a setting, domain shifts are often gradual e.g., we record more data over time, and therefore observe a larger variety of known regions and more other objects that appear in the long-tail of the object class distribution.

5.1.1 Open-World Lidar Panoptic KITTI

To study LiPSOW in such a setting, we base our experimental setup on SemanticKITTI (Behley et al., 2019) and KITTI360 (Liao et al., 2021) datasets. They were recorded in distinct regions of Karlsruhe, Germany (Table 3a). We use SemanticKITTI for model training and validation, and KITTI360 sequences only for testing.

5.1.2 Source (Train/Val) Domain

In Fig. 3b (blue “known” set) we visualize the SemanticKITTI classes (Behley et al., 2019). The dashed inner circle (pink) denotes the rarer classes, which we merge into a single other class. Those are examples of regions, different from K known classes, i.e., \((K+1)^{th}\) catch-all class. This allows evaluating how well the model learns to separate known classes from other by measuring IoU for the other class and mIoU for all classes within the source domain. We call this taxonomy Vocabulary 1. However, by this construction of Vocabulary 1, classes such as bicycle and motorcycle, which belong to other, are important for autonomous driving and commonly observed in urban environments. Therefore, we also construct Vocabulary 2, which holds out only the rarest categories as other. We provide further details on vocabulary construction and taxonomy in the appendix (“A LiDAR Panoptic Segmentation in Open-World” Section and Table 5).

5.1.3 Target (Test) Domain

We evaluate models on KITTI360, which encompasses all SemanticKITTI classes, and importantly, additional 10 thing (with instance labels) and 7 stuff classes, which are used as novel classes in experiments. We discuss vocabulary changes that ensure SemanticKITTI and KITTI360 vocabularies are consistent in the appendix (“A LiDAR Panoptic Segmentation in Open-World” Section and Table 5).

Table 1 LPS results on SemanticKITTI validation set.

5.1.4 Metrics

We repurpose evaluation metrics proposed in the context of semantic segmentation (mean intersection-over-union, mIoU (Everingham et al., 2010)) and panoptic segmentation (panoptic quality, PQ (Kirillov et al., 2019)). To quantify the point classification performance (mIoU), we simply treat other class as “just one more class”. To quantify panoptic segmentation, we split the evaluation for known classes and other classes. We evaluate known using \(PQ = SQ \times RQ\), as defined by Kirillov et al. (2019). The segmentation (SQ) term averages instance-level IoU for each true positive (TP), while the recognition quality \(\left( RQ = \frac{\text {TP}_c}{|\text {TP}_c| + \frac{1}{2} |\text {FP}_c| + \frac{1}{2} |\text {FN}_c|}\right) \) is evaluated as F1 score (harmonic mean of precision and recall). For other classes, the task definition does not specify which semantic classes are thing classes, nor specifies the vocabulary of target instance classes. As we cannot annotate every possible semantic class, it is important to not penalize false positives, FPs, as these cannot be clearly defined. Therefore, we follow Wong et al. (2020); Liu et al. (2022), and replace the RQ term with recall \(\left( \frac{\text {TP}_c}{|\text {TP}_c| + |\text {FN}_c|}\right) \), which we call UQ (unknown quality). Note that our UQ is computed slightly differently from the UQ introduced by Wong et al. (2020) because we do not penalize segment predictions that overlap unlabeled stuff in the other class.

5.2 Lidar Panoptic Segmentation

A pre-requisite for good performance on LIPSOW is good performance on known classes. Therefore, first, we present a comparison of our method with state-of-the-art LPS methods on SemanticKITTI Behley et al. (2019) in Table 1. The top performing method on the val-set for known classes is GP-S3Net Razani et al. (2021). Due to its strong Transformer-based backbone, GP-S3Net obtains \(73.0\%\) mIoU and the highest PQ of \(63.3\%\). The remaining methods are in the ballpark of 60–65% mIoU and 55–59% PQ. We note that methods such as CPGNet (Li et al., 2022) and LCPS (Zhang et al., 2023) are also highly performant, however, these utilize multi-modal inputs (range-views and images, respectively). We base our OWL on well-consolidated and easily-extendable KPConv (Thomas et al., 2019), similar to 4D-PLS.Footnote 1

5.2.1 Semantic Oracle

With the semantic oracle experiment, we aim to answer the question: how far we can get in LPS with our baseline? To answer this, we replace our learned classification network with ground-truth semantic maps (GT semantic map), available in the validation set, but retain our instance branch. This yields a near-perfect PQ of \(98.3\%\). As evident, we can recall \(97.2\%\) of thing objects with \(99.4\%\) precision. This experiment raises questions of whether point-to-instance grouping needs to be learned; nearly all labeled instances are already included in our segmentation tree. Moreover, this experiment suggests that future efforts should focus on driving further the point-classification performance and that semantic labeling may be sufficient – near-perfect instance labels can be obtained via hierarchical clustering.

5.2.2 Objectness Oracle

Given a KPConv-based semantic network, what performance can we obtain with perfect objectness scoring function? To answer this, we use GT instance labels to score all segments as IoU between a segment and its best-matching GT segment (obj. oracle). With this approach, we obtain \(59.2\%\) PQ. By propagating semantic majority vote within each segment, we can further improve precision and recall and, consequently, PQ (\(+0.5\%\)). This is an upper bound that we can obtain with the KPConv semantic backbone.

5.2.3 Ablations

We start with a variant of our network with semantic and objectness heads (c.f., (Aygün et al., 2021)). In this case, we compute per-segment objectness by averaging per-point objectness scores (see Sect. 4). The class-specific variant builds the segmentation tree separately for each semantic class, while class-agnostic variant builds it upon all thing and other points, effectively dropping fine-grained semantic information. With both, we obtain a PQ of \(58.7\%\), \(+2.2\%\) improvement over 4D-PLS, which uses identical segmentation and objectness networks for inference. In the class-agnostic variant, we observe improvement of \(+0.2\%\) in terms of precision (\(79.9\%\)). We conclude this approach is more accurate because it is less sensitive to errors in semantic classification (e.g., a truck, part classified as truck and part as car, cannot be holistically segmented with class-specific variant). That said, we can safely treat thing and other classes in a unified manner. Next, instead of averaging per-point objectness, we train a holistic second-stage objectness classifier using cross-entropy loss (see Sect. 4) and observe PQ improvement (\(58.8\%\), \(+0.1\%\)). This way we gain an additional \(+0.1\%\) in PQ. We hypothesize that by training our network using regression loss, we obtain smoother objectness scores compared to sharp-peaked (and overconfident) binary classification scores, which is beneficial for the tree-cut algorithm.

Table 2 Results of Lidar Semantic Segmentation. Methods are trained on SemanticKITTI under the specified vocabulary, and evaluated on the SemanticKITTI validation set. Experiments indicate that OSeg (Cen et al., 2022)) struggles to generalize to held-out other classes
Table 3 Open-World Lidar Panoptic KITTI

5.3 Open-World Lidar Panoptic Segmentation

We now study within-dataset performance on SemanticKITTI and cross-dataset performance on SemanticKITTI \(\rightarrow \) KITTI360 for two different source-domain vocabularies in Tab 3. Vocabulary 1 merges rarer classes into a catch-all other class (as discussed in Sect. 5.1) while Vocabulary 2 closely follows the original SemanticKITTI class definitions. We provide further details on the construction of these vocabularies in Appendix “A LiDAR Panoptic Segmentation in Open-World” Section. We report results of known thing and stuff classes using Panoptic Quality and mean-IoU, and for unknown class (i.e. points classified as other during inference on KITTI360), we report Unknown Quality (UQ), Recall, and IoU. We analyze open-world generalization based on cross-dataset performance (i.e. unknown) and not on within-dataset performance.

5.3.1 Baselines

We compare our OWL to vanilla 4D-PLS (Aygün et al., 2021), trained in a single-scan setting. This network uses the same KPConv backbone (Thomas et al., 2019). We also train PolarSeg-Panoptic (Zhou et al., 2021), one of the top-performers on standard LPS (see Table 3.1) for which source code is available. 4D-PLS\(^{\dagger }\) modifies the inference procedure: for points classified as other, we lower the center-objectness threshold. We provide details in the appendix (“B Implementation Details” Section). This is based on the intuition that the objectness head of 4D-PLS should be able to generalize to novel classes, but with lower confidence. Finally, OWL \(^\ddagger \) is a modified variant of OWL that uses a learned point-grouping mechanism (c.f., Aygün et al. (2021)) for thing classes, and hierarchical segmentor for the other class (i.e., does not treat instance segmentation of thing and other classes in a unified manner). This baseline is inspired by Wong et al. (2020). However, we use 4D-PLS as a backbone network and bottom-up grouping as described in Sect. 4.

5.3.2 Lidar Semantic Segmentation

We compare our approach to OSeg (Cen et al., 2022), a state-of-the-art method for open-set Lidar Semantic Segmentation (LSS) in Table 2. OSeg only tackles the semantic point classification of Lidar scans and does not address the instance segmentation aspect of Lidar Panoptic Segmentation. OSeg consists of two stages: (i) Open-set semantic segmentation, where each point is classified into one of K known or a catch-all class using redundant classifiers, and (ii) Incremental learning, where the catch-all categories are incorporated into the model. For a fair comparison with our setting, we report the performance of the first stage of OSeg, without performing incremental learning. We validate performance using our proposed vocabulary splits on SemanticKITTI using publicly available implementation (details in the appendix). As OSeg performs only point classification (i.e., semantic segmentation), we can compare to OSeg only in terms of mean intersection-over-union with our base network for point classification (4DPLS). Results show that OSeg (Cen et al., 2022) significantly underperforms compared to our base 4DPLS network on both known and unknown. OSeg performance on the unknown class performances may be caused by the object synthesis, which implicitly assumes the other category consists of only thing and not stuff classes, violating the spirit of LiPSOW. As can be seen, our 4DPLS-based network additionally outperforms OSeg on known classes.

5.3.3 Lidar Panoptic Segmentation

Vocabulary 1 In Table 3, OWL is top-performer for known classes (\(+1.6\%\) w.r.t. 4D-PLS and \(+0.8\%\) w.r.t. PolarSeg-Panoptic). Similarly for other, OWL recalls \(48.4\%\) of objects, compared to \(10.8\%\) recalled by 4D-PLS and \(14.7\%\) by PolarSeg-Panoptic. Note that the only other objects with instance labels in SemanticKITTI are bicycle and motorcycle. For the unknown class in the cross-domain section ( KITTI360), we observe 4D-PLS recalls only \(2.0 \%\) of instances (\(1.3\%\) UQ). By changing the inference, 4D-PLS\(^\dagger \) recalls \(6.0\%\) of objects. OWL recalls \(45.1\%\) of objects, leading to \(36.3\%\) UQ. This suggests there is a significant potential to improve further without modifying our instance segmentor: the bottleneck seems to be the point-level classifier (see low IoU of \(11.4\%\)). This result also highlights that SemanticKITTI by itself is not sufficient for studying open-world LPS methods due to a limited number of classes with instance labels. In KITTI360 we have a significantly larger number of instances in the unknown class, exposing the poor generalization of methods trained to segment only the K-known classes.

In SemanticKITTI\(\rightarrow \) KITTI360, we observe a performance drop in cross-domain evaluation, including known classes. 4D-PLS performance drops from \(67.8\%\) (SemanticKITTI) to \(56.1\%\) PQ (KITTI360) and \(79.8\%\rightarrow 65.3\%\) mIoU. PolarSeg-Panoptic works well when evaluated in a within-domain setting but fails to generalize to KITTI360 (\(0.7\%\) PQ). This suggests that existing models are very sensitive to data distribution shifts. Future efforts should aim not only to improve point classification performance within-domain but also in cross-domain settings.

Table 4 Per-class results (Vocabulary 2) w.r.t. PQ for known classes and UQ for the unknown. We observe performance drop for rarer classes (e.g., motorcycle) and among stuff (e.g., building, vegetation) that may look differently in different areas of the city

Vocabulary 2 This setting follows more closely the official SemanticKITTI vocabulary and exposes a smaller number of semantic classes as instances of the other class during training. We observe that Vocabulary 1 generalizes much better across datasets, suggesting that grouping rare classes in a catch-all other class leads to better generalization.

SemanticKITTI\(\rightarrow \) KITTI360 gap Why does performance drop from SemanticKITTI to KITTI360, even though both were recorded in the same city, with the same sensor? To answer this question, we analyze per-class performance in Table 4 (Vocabulary 2) and confusion tables (Fig. 5). For common thing classes, there is a minimal performance drop (\(90.6\rightarrow 88.3\) PQ for car). Indeed cars should look identical in different regions of the city. As expected, we observe a performance drop for rarer thing classes (\(65.4\rightarrow 45.8\) for motorcycle, \(40.3 \rightarrow 23.2\) for bicycle), as only a handful of instances of these classes are observed in the (dominantly static) SemanticKITTI. We observe larger performance drops for stuff classes (e.g., building, vegetation, fence, terrain). The reason for this drop is two-fold. Firstly, KITTI360 covers a larger area of the city that does not overlap with SemanticKITTI. Therefore, in KITTI360, we observe a larger diversity of these classes, which confuses the semantic classifier. Second, these classes are often confused with stuff classes, labeled only in KITTI360. For example, building is commonly confused with KITTI360 classes garage and wall, and fence is confused with wall and gate. Class thrash bin is often confused with a stuff class sidewalk, likely due to context: thrash bins are usually seen on sidewalks. Outlier synthesis (Cen et al., 2022; Kong & Ramanan, 2021) could be used to minimize this confusion in the future.

5.4 Confusion Analysis

To further analyze the per-class semantic segmentation performance, we plot extended confusion matrices, similar to those reported in open-set object detection (Dhamija et al., 2020). The horizontal axis represents the ground-truth classes, and the vertical axis represents OWL predictions. We extend the other class into its fine-grained split on the horizontal axis. Since Vocabulary 1 consists of more held-out classes, we visualize the matrices for this setting.

5.4.1 SemanticKITTI

In Fig. 4, we show the confusion of OWL on SemanticKITTI. Among known classes, we observe that the terrain class is most confused with the vegetation class. In the other class, we observe significant confusion with the known classes. For example, other-vehicle is often misclassified as car or truck. Furthermore, other-ground and parking are often misclassified as sidewalk and road. Confusion most commonly arises between classes with super-classes.

Fig. 4
figure 4

The extended confusion matrix for OWL trained on SemanticKITTI and evaluated in-domain (on SemanticKITTI), using Vocabulary 1. On the left side, we see the confusion among known classes. On the right, we can see which known classes are confused with classes that form the other class. For known classes, we observe a confusion between (related) terrain and vegetation. We also observe that several other points are misclassified as known. Class other-vehicle is often misclassified as car or truck, while other-ground and parking are commonly misclassified as sidewalk and road classes. This explains the low IoU, observed in Table 2 (main paper) on other in SemanticKITTI (Color figure online)

Fig. 5
figure 5

The extended confusion matrix for OWL trained on SemanticKITTI and evaluated in cross-domain setting (on KITTI-360), using Vocabulary 1. On the left side, we see the confusion among known classes. On the right, we can see which known classes are confused with classes that form the other class. Contrary to the in-domain confusion, we observe more confusion within known classes. For instance, car and truck classes are often confused. The class sidewalk is often misclassified as terrain, while almost all known classes are confused with vegetation. As can be seen, there is confusion between known and unknown classes. Ground and parking are often predicted as road and sidewalk. Class wall (a novel other-stuff class) is confused with fence, building, and vegetation, presumably due to their geometric similarity. Class trailer is frequently confused with class car. As demonstrated, cross-domain semantic segmentation is a challenging problem (Color figure online)

Fig. 6
figure 6

Qualitative results on KITTI360 dataset. OWL successfully segments several other objects (left, shown in red; right: segmented instances). Issues and challenges: we observe OWL occasionally under-segments other instances (see, e.g., top row) (Color figure online)

5.4.2 KITTI360

In Fig. 5, we visualize the extended confusion matrix for OWL on KITTI360. In the known classes, we observe confusion between the car and truck classes. Furthermore, fence and terrain are also frequently misclassified as vegetation. Next, we analyze confusion between known and other classes. We observe that ground and parking are often misclassified as road and sidewalk. Walls are confused with the fence, building, and vegetation, while trailer is commonly confused with car. This demonstrates the challenge of open-world generalization of semantic segmentation, indicating the need for future research on this front.

5.5 Qualitative Results

In Fig. 6, we show several visual examples for OWL in the KITTI360 dataset (Vocabulary 1), focusing on instances of unknown classes. As can be seen, OWL performs well in challenging cases with several unknown objects in the same scene. For example, in sixth-row-middle and eighth-row-left, our method segments common objects, such as trunk, sign board, and pole. Moreover, OWL segments several rarer objects. For example, in first-row-left, OWL segments a wheelbarrow. In first-row-right, OWL segments a bus stop. In fifth-row-left, fifth-row-middle and sixth-row-left, OWL segments a swing, stoller, and a motorcycle.

However, we also observe failure cases. In the top row of Fig 6, OWL sometimes under-segment nearby objects. For example, on first-row-mid and,first-row-right neighboring poles and signs are clustered as a single instance.

To further showcase the performance of our method, we provide a video with OWL inference on SemanticKITTI and KITTI360 lidar sweeps.

6 Discussion and Conclusion

We investigate Lidar Panoptic Segmentation in an Open World setting (LiPSOW), for which we set up baselines and an evaluation protocol. We demonstrate that our OWL performs significantly better than prior work for in-domain and cross-domain evaluations. In addition to better generalization across domains, OWL segments a large number of instances in the other class. Finally, we observed that grouping rare classes into a catch-all other class leads to significantly better cross-domain generalization. We hope our insights spark future investigation and help build perception models that can generalize to novel environments.

We envision LiPSOW as a first stage towards an end-to-end continual learning paradigm where unknown objects from lidar scans are discovered online and clustered offline. Based on discovered object clusters (with human-in-the-loop annotations to provide further categorical refinement), the network can be updated using incremental/continual learning. Such signals can be incorporated into trajectory prediction or motion planning algorithms to enable safe maneuvers.