1 Introduction

Semantic segmentation of 3D point clouds has attracted a lot of attention in areas such as autonomous driving, augmented reality (AR) and virtual reality (VR). For autonomous systems such as self-driving cars, it becomes especially important to understand the scene accurately. Because errors in the perception process can lead to serious accidents and scheduling problems. In this paper, we aim to improve the performance of semantic segmentation in self-driving scenarios, which allows us to perceive the entire 3D scene in a high-quality and point-by-point manner. 3D data is collected by optical or radar sensors, usually in the form of point clouds. And there are also other forms of views to represent as shown in Fig. 1.

Fig. 1
figure 1

Three different views of a point cloud. (a) Point-based view (top left): points are irregular and chaotic, which makes finding neighbors of points inefficient. (b) Voxel-based view (top right): the voxelization operation discretizes continuous point cloud data into discrete voxels, introducing quantization loss and a sharp increase in computation with increasing resolution. (c) Range-based view (bottom): The range image distorts the physical dimensions due to spherical projection, while the depth information of the object is lost

More specifically, the semantic segmentation of point clouds can be broadly divided into three directions. Resear-chers would divide the original point cloud into voxel units, defined in detail in Cartesian coordinates, depicted in Fig. 1(b), and use 3D convolution [1] to construct a dedicated network. A voxel-based view preserves physical size and has friendly memory localization. However, it is relatively sparse and requires very high resolution to eliminate the problem of lost quantisation information, which can increase the computational and memory footprint by a cubic order of magnitude. Moreover, the different voxel sizes can greatly affect the performance of the network.

Recently, point-based views, depicted in Fig. 1(a), have attracted more and more attention so that much of the work uses raw point cloud data directly in the network. PointNet [2] served as a pioneering work for extracting point features using a multilayer perceptron (MLP) for each point, but it lacked local context modeling capability. Based on PointNet, later studies [3, 4] focused extensively on each point to extract local features by aggregating its neighborhood features. However, points in a point cloud are unstructured and searching the neighbourhood of a point is inefficient due to the random nature of memory access.

Some parallel works [5,6,7,8] followed a spherical projection scheme, i.e., range-based view, as shown in Fig. 1(c), which transformed the 3-dimensional information of the points into 2-dimensional information, and then utilized a well-established convolutional neural network for feature extraction, which had the advantage of densely representing the point cloud and utilizing the convolution kernel to gather a larger sensory field on the image, greatly eased the point sparsity problem. However, due to the spherical projection, the depth information of the points is partially lost and the size and distance of the objects are changed, so the objects in a dense and cluttered scene may overlap with each other severely.

Fig. 2
figure 2

Performance distribution of different methods on the SemanticKITTI dataset

In conjunction with the results presented in Fig. 2, for the segmentation of large-scale point clouds, we find that: 1) the voxel-based approach is to outperform the point-based and range-based approaches, and the best performance achieved by the point-based and range-based ones is about the same; 2) the fusion-based approach is generally much higher than the point-based, range-based and voxel-based ones.

Out of instinct, it is natural to think of combining different views together to utilize complementary information, retaining strengths and reducing weaknesses. Recent research has attempted an approach called PVCNN [9], which employed a fusion strategy of point clouds and voxels. In this approach, voxels were used to provide coarse-grained local feature information, while the point clouds retained fine-grained geometric features by applying a simple MLP at each point. Whereas this approach provided a promising perspective, the simple additive fusion of point clouds and voxels does not result in significant performance gains, as shown in Fig. 2. Therefore, more sophisticated and effective fusion strategies are needed to further improve the performance. RPVNet [10] was the first network to fuse three views, which leveraged a hashing mechanism to record the location information of the same points in different views and performed feature fusion across views. This offers a promising prospect.

In this thesis, we propose a deeply adaptive range-point-voxel network that is intended to come to synergise the presentation of three views. See Fig. 4 for more details, where we design an RPVLayer with the original point cloud as input, and transform the point cloud into ranges, points and voxels, respectively, which are all mapped back to the features of the points after the features are extracted. In order to minimize the loss of point quality, the final feature alignment and fusion is performed at the point level after feature re-inference for anomalous points. Unlike previous approaches [11,12,13,14,15] that fused upstream and downstream of the network, we consider that less information is lost by convergence in the middle of the network. Our network can not only reduce the problems caused by range branch and voxel branch in collaboration with the FRM module, depicted in Fig. 6, but also efficiently capture fine-grained information and structural features at the point level from different views at different levels through the continuous iteration of the FGA module. The details of each module of the FGA will be described in Section 3.1.

Our main contributions are as follows:

  • We design a novel range-point-voxel interactive structure which overcomes the drawbacks of the point-based, voxel-based and range-based single view approaches by allowing different views to compensate and enhance each other in a softer way.

  • Considering the excellent performance of Convolutional Neural Networks in the field of 2D images, we propose a novel extractor for modeling contextual modules called Dynamic Pyramid Feature Extractor. We construct a dynamic skeleton network in the range branch, specifically designed for processing spherical distance images LiDAR data, which extracts features from the range branch at different scales suitable for semantic segmentation.

  • We propose a Feature Refinement Module, which divides the points into abnormal points (points with low confidence) and normal points (points with high confidence). For the points with low confidence, we use their own features, the features of the point branch and the features of the neighboring points to reason jointly, and then finally perform the calculation of the self-attention. The problem of losing the quality of points when restoring features back to the point level in range view and voxel view is solved.

  • We propose a novel fusion mechanism, the channel-based multiple fine-grained self-attention mechanism, which contains only one attention layer who captures and fuses different fine-grained features of the same object from different views.

  • We conduct detailed experiments, the experimental results show that our method achieves 69.8% mIou and 77.1% mIou on the SemanticKITTI [16] dataset and nuScenes [17] dataset, respectively, which outperforms most state-of-the-art algorithms.

2 Related work

In this section, we present the exploration of attention mechanisms in the field of point cloud segmentation, along with the four conventional research directions.

2.1 Self-attention in point cloud segmentation

Self-Attention (SA) is a very powerful neural network operation for capturing complex dependencies among elements in sequential or ensemble data. In the field of point cloud learning, SA has been widely used as a core module. Some early works [18, 19] used SA as an auxiliary module to advance the development of point cloud learning.

Recent studies have shown that point cloud learning methods based on Transformer models have great potential. PT (Point Transformer) [20] and PCT (Point Cloud Transformer) [21] were two purely Transformer models that have achieved significant performance in point cloud task improvement. In particular, the performance of PT [20] far surpassed the previous methods [22, 23], which suggested that the Transformer model has a great advantage in modelling non-locality of point clouds and multi-scale feature fusion in point cloud learning. Subsequent works, such as PCTMA-Net [24] and Hierarchical Transformer [25], further improved the performance of the point cloud Transformer by introducing the standard multi-head attention mechanism and shift window mechanism.

However, all of the above methods were based on point-level self-attention mechanisms, and the complexity of the point cloud increased quadratically with the number of input points, which led to the limitation of the local receptive field. Therefore, many subsequent research works [26,27,28,29] have adopted voxels as attention and proposed voxel-level self-attention mechanisms. These methods usually used a de-voxelization operator to obtain a point-by-point feature representation.

In addition, PatchFormer [30] proposed a method to compute the attention graph between points and patches by directly aggregating the patch features for each point. These methods have made significant progress in considering multi-scale features, but they still faced the challenge of size-aware feature learning, i.e., how to effectively capture and utilise the different scale information in the input data. In point cloud tasks, size-aware feature learning is important for dealing with point cloud objects of different sizes and complexities, thus this remains an area of intensive research. To the best of our knowledge, we are the first network to fuse three views at the point level with the help of self-attention. We propose a range-point-voxel multifine-grained self-attention in RPVLayer, which differs from the voxel attention described above in that our range-point-voxel multifine-grained self-attention directly captures the relationship among range-point-voxels. Furthermore, fine-grained and coarse-grained labelling is preserved in the attention layer, enabling the free unravelling of multi-scale features in integrated point cloud learning.

2.2 Point-based segmentation

PointNet [2] was the first attempt to directly process point cloud data through a network based on MLP. Although subsequent studies [3, 4] demonstrated the effectiveness on indoor point cloud data, most of these methods could not be easily scaled up to process large-scale outdoor data due to computational and memory constraints. RandLA-Net [31] employed random sampling and local feature aggregation to reduce the information loss introduced by random sampling, which provided a feasible method to accelerate point cloud processing, but it still could not overcome the accuracy loss caused by sampling. KPConv [4] achieved the best performance of current point-based methods by introducing spatial kernel-based point convolution. However, faced with the same problem of dealing with large scene data, it is still unable to train directly using all of the data. To solve this problem, a method of balanced sampling by radius classification has been proposed to reduce the amount of data. However, this sampling method may destroy some of the information inherent in the point cloud. Although point-based methods usually have fewer parameters [31], they inevitably involve inefficient local neighbor search operations, which limits their efficiency and performance when dealing with large-scale scenes.

2.3 Voxel-based segmentation

In early voxel-based approaches [32, 33] , the point cloud data was converted into voxel grids of points and a common 3D convolution operation was subsequently applied for semantic segmentation. However, recent research efforts [34, 35] have been devoted to accelerating the 3D convolution operation while improving the performance and reducing the computational cost. Meanwhile, there were some variant approaches [36, 37] involving the segmentation of 3D space. Among them, Cylinder3D [37] introduced asymmetric residual blocks to reduce the computational burden and ensure that features associated with rectangular objects are captured. AF2S3Net [38], on the other hand, achieved an advanced level of performance based on previous work [33] by introducing two novel attentional modules, the Attentional Feature Fusion Module (AF2M) and the Adaptive Feature Selection Module (AFSM). These modules efficiently learnt local and global contextual information and emphasized fine detail information. In addition, the method used a hybrid loss function with geometric-aware anisotropy [39] to recover fine detail information. Conventional voxel methods may face a serious problem of information loss when the resolution is reduced. However, our method employs a multi-view approach to compensate for this shortcoming, which improves the recovery of detailed information.

2.4 Range-based segmentation

Range-based point cloud segmentation methods [6, 7, 23, 40, 41] utilized 2D Convolutional Neural Networks (CNNs) by mapping 3D point clouds into dense 2D spherical meshes. For example, RangeNet++ [6] employed the DarkNet backbone network from YOLOv3 [42] as a feature extractor with efficient K Nearest Neighbour (KNN) post-processing. SalsaNext [7] used Salnet [43] as a baseline and also introduced an uncertainty-aware mechanism for point feature learning. In addition, KPRNet [40] stood out in this class of methods by employing a powerful ResNeXt-101 backbone network and an Astrous Spatial Pyramid Pooling layer to achieve state-of-the-art segmentation performance, as well as the innovative use of KPConv [4] in place of the less efficient KNN as a segmentation head method. Although range-based methods can leverage well-established 2D image segmentation techniques, mapping 3D point clouds to 2D spherical meshes introduces the problem of distortion of physical dimensions. This problem is avoided in our proposed RPV-CASNet, which employs a multi-view interactive learning approach to process 3D point cloud data more accurately, alleviating the size distortion of objects by a large extent.

2.5 Multi-view fusion

Given that there are some limitations with single view, recent research approaches [44,45,46,47,48,49] have attempted to fuse information from two or more different views to improve point cloud segmentation performance. As an example, the approach in [11] combined point-level information from bird’s eye view and range images, which were then passed to a subsequent network for early fusion. AMVNet [13], on the other hand, devised a late fusion strategy by calculating the uncertainty in the outputs of different views and further refining the results with the help of an additional network. FusionNet [14] proposed a point-voxel interaction MLP for aggregating features between neighbourhood voxels and corresponding points, which reduced the time cost of neighbourhood search and maintained satisfactory accuracy when processing large-scale point clouds. In particular, PVCNN [9] proposed an effective point-voxel fusion method in which the voxels provide coarse-grained local feature information while the points retain fine-grained geometric features through a simple MLP. [50, 51] fused RGB pictures and range images, but it needed to fuse images and point clouds in a hard correlation manner with the help of a calibration matrix, and thus was susceptible to calibration errors [52]. A simple additive fusion strategy for RPVNet [10] in multi-view cross-fertilisation ignored the fact that the noise introduced by the points in the mapping for the Range view is also incorporated into the other modalities, which can severely handicap the performance of the network. The aforementioned methods face the same problem i.e. the fusion strategy is also relatively simple, e.g. using additive operations, which limits the performance of the network. However, our method is able to utilize the information provided by multiple views more efficiently and has the ability to select and fuse the most helpful parts for point cloud segmentation. This means that our approach has greater potential and flexibility in improving the performance of point cloud segmentation.

3 FGA transformer block

In this section,we first introduce our overall network architecture, and then introduce the components of this network, exemplified by the RPVLayer in the FGA module, which consists of three parts: the Dynamic Feature Pyramid Extractor(DFPE), the Feature Refinement Module(FRM), and the Multi-Fine-Grained Feature Self-Attention Module(MFGFSAM). Finally, an improved and efficient data enhancement method we proposed will also be presented.

Fig. 3
figure 3

(a)Overview (b) FGA Transformer Block

Fig. 4
figure 4

Rang-Point-Voxel Cross Adaptive Layer

3.1 Framework overview

Our proposed algorithm, named RPV-CASNet, is built on top of the FGA Transformer Block, whose core idea is to introduce effective multiscale features for each attentional layer and allow each point to adapt to its attentional domain. Therefore, our network completely relies on the FGA block to learn and interact with channel features from different viewpoints to achieve fusion of multi-fine-grained features from objects of different scales. The details of the FGA block are shown in Fig. 3(b). In order to efficiently capture features from different views and scales in each attention layer, the FGA must overcome three challenges: loss of point quality, integration of features from distant regions, and accurate discrimination of multiscale features from objects at different scales. To address the first challenge, drawing on Video Anomaly Detection(VAD) ideas, we design a FRM that handles anomalies and re-inferences point features in a more subtle way. To address the second challenge, we design a new range-point-voxel interaction structure on which we use a novel channel-based Multi-Fine-Grained Attention Mechanism for planning the learning of effective multi-scale features for the point cloud. And to overcome the third challenge, we introduce a range-point-voxel triple decoupling strategy based on the standard multi-head self-attention mechanism. We also propose a simple but effective DFPE, and use it to dynamically build our DFPE module for extracting features at different scales of the range branch.

3.2 Rang-point-voxel cross adaptive layer

Our RPVLayer is shown in Fig. 4, it contains three main parts, which are Feature Extraction , FRM and MFGFSAM. Specifically, the input point cloud first passes through a point embeding layer with linear transformation, followed by feature extraction using a simple MLP, which will then be transformed into three branches. For the range branch we use a DFPE as a backbone network. For the point branch we deploy the lightweight PointNet++, and in particular, for the voxel branch we use the lightweight Spvans network. Finally we obtain the channel features for the three sequences of points, which will be fused in our FRM and MFGFSAM.

Fig. 5
figure 5

Dynamic feature pyramid extractor

3.2.1 Dynamic feature pyramid extractor

Here we introduce an extractor for modeling the context module, called “Dynamic feature pyramid extractor”, or DFPE for short, which aims to handle the mapping of the point cloud into “image” features in the range branch. DFPE uses regular 2D convolutional operations and provides contextual information at different scales through efficient computational dynamics, which can be used to extend DFPE to a wide range of models non-destructively. The following figure illustrates a generic block diagram of DFPE (see Fig. 5).

The DFPE module consists of three parallel dynamic residual blocks (shown as dashed boxes). Inspired by [15], in order to better introduce contextual information and optimise the feature extraction process, we use different sizes of convolutional kernels to redesign our DFPE module. In the actual deployment, the dynamic residual block mainly consists of Conv 3\(\times \)3 (S=1,P=1), Conv 5\(\times \)5 (S=1,P=2), Conv 7\(\times \)7 (S=1,P=3) and Conv 1\(\times \)1 2D convolution kernels. Each convolutional layer is followed by a Batch Normalisation layer and a ReLU activation layer, except for the 1 × 1 convolutional layer in the dynamic feature pyramid extractor (no ReLU activation layer here). Also, in order to preserve local features, we introduce a skip connection with a 2D convolution after the first layer convolution within each residual block. To further reduce the computational overhead, we use Dwconv with the same convolutional kernel size. Finally, the three parallel convolutional blocks are subjected to a Conv 1×1 operation, thus generating the final feature tensor for further processing. Instead of following the complex deployment of networks such as RangNet++, we view that: 1. A portion of the points will be covered or occluded after mapping, and these points are not taken into account by the network rely. 2. Our extractor is very efficient although it contains only three sub-blocks. With the coordinated iteration of the downsampling operation and the FGA block, the number of points in the network decreases, and then after mapping, the occluded points are exposed to some extent and captured by our dynamic extractor. 3. If a dedicated mapping-based network (encoder and decoder) is deployed, it will introduce additional memory as well as computational overhead. We use the same architecture in both datasets. Details about the deployment will be presented in Section 4.2.

Fig. 6
figure 6

Feature Refinement Module

Fig. 7
figure 7

Multi-Fine-Grained Feature Self-Attention Module

3.2.2 Feature refinement module

The point cloud introduces noise or loses some information after mapping and voxelization operations, which is an important challenge in multi-view fusion. Motivated by VAD, we propose a module called FRM, which is placed before the MFGFSAM. The exact deployment is shown in Fig. 6.

Specifically, the inputs R \(\left. \left\{ R_{x_{m}}\right\} ^{N}_{m=1} \right. \) / V \(\left. \left\{ R_{x_{n}} \right\} ^{N}_{n=1} \right. \) and P \(\left. \left\{ R_{x_{k}}\right\} _{k=1}^{N} \right. \) to our FRM module are the sequential sets of three point features in the feature extraction stage, which are rfeature/vfeature collections from range branch / voxel branch and pfeature collections from point branch as inputs. We select the rfeature/vfeature set and the pfeature set as inputs, and after multiplication of vectors and sigmoid activation function, we aim to screen out the point set n \(\left. \left( n<<N\right) \right. \) as anomalous points. For the anomalous point sequence n in range branch/voxel branch, we embed the features of range branch/voxel branch, features of point branch and coordinate information of anomalous points together with the help of point branch lossless point features and threshold \(\alpha \) with joint different weights S and \(\left. \left( 1-S\right) \right. \) . With the help of KDTree we can easily obtain the feature information of the neighbouring points around the anomalous point, we select 16 neighbouring points, and finally splice the three together according to the number of channels, and then multiply them with the original features as the final re-representation of the anomalous point features after obtaining the corresponding weight matrices through a linear layer and a Softmax layer. On the contrary, we simply “copy” the normal points. Similarly, we perform the same operation for the anomalies in the point branch. One of our flashpoints is that when the sequence of anomalies is separated, both the range branch / voxel branch and point branch perform a re-inference representation of the anomalies, with the benefit of preventing the network from over-relying on the point branch. Finally, we get the re-inference representation of the feature set at the point level of R, P and V, which are \(R\) \(\left. \left\{ \left\{ R_{x_{m}}\right\} ^{a}_{m=1},\left\{ R_{x_{m}}\right\} ^{N}_{m=a}\right\} \right. \), \(V\) \(\left. \left\{ \left\{ R_{x_{n}}\right\} ^{b}_{n=1},\left\{ R_{x_{n}}\right\} ^{N}_{n=b}\right\} \right. \), \(P\) \(\left. \left\{ \left\{ R_{x_{k}}\right\} ^{c}_{k=1}, \left\{ R_{x_{k}}\right\} ^{N}_{k=c}\right\} \right. \),where \(\left\{ R_{x_{m}}\right\} ^{N}_{m=a} \) ,\(\left\{ R_{x_{n}}\right\} ^{N}_{n=b}\) and \(\left\{ R_{x_{k}}\right\} ^{N}_{k=c}\) are the “anomaly” feature after re-inferred representation in the corresponding set.

Table 1 Detailed deployment of the network
Table 2 Evaluation on the SemanticKITTI dataset. Our results use GANMix (see Section 3.3), where \(\ddagger \) indicates that GANMix is used for data enhancement

Regarding the division of anomalies normal points are set as follows:

$$\begin{aligned} \begin{aligned} \psi = {\left\{ \begin{array}{ll} 0\quad , \quad if \quad \alpha <\delta \\ 1\quad , \quad if \quad \alpha >\delta \\ \end{array}\right. } \end{aligned} \end{aligned}$$
(1)

We make the distinction between abnormal and normal points by using agent task \(\psi \). Specifically, if the value of \(~\alpha \) is less than \(\delta \), the agent task \(\psi \) treats the point features that are less than \(\delta \) as abnormal features and then further extracts the features of abnormal and abnormal points. Conversely, they are considered normal points.

Table 3 Evaluation on the nuScenes dataset. Our results do not use any GANMix-like trick

3.2.3 Multi-fine-grained feature self-attention module

MFGFSAM does not choose to be at the voxel level but chooses to compute the channel attention from point sequences, voxel sequences and range sequences directly at the point level, where the attention is computed as the channel feature of the query sequence from the point sequence, the branch that has pure features, and the key and value are derived from the channel features obtained from itself and the other two branches, respectively. Thus our MFGFSAM is more flexible. As shown in the Fig.7.

The inputs to this module are the range features \(R_{f} \in \) \( R^{D_{nr}\times \left( D_{h} \times D_{d}\right) }, D_{nr}\) is the number of points in the range branch in the nth FGA transformation block, and \(\left. \left( D_{h} \times D_{d} \right) \right. \) is the dimension of the points obtained by going through rang to point in it. Similarly we have \(\left. P_{f} \in R^{D_{np}\times \left( D_{h} \times D_{d}\right) } \right. , \left. D_{np}\right. \) is the number of points in the point branch in the nth FGA transformation block, and \(\left. \left( D_{h} \times D_{d} \right) \right. \) is the feature dimension of the point obtained therein.\(\left. V_{f} \in R^{D_{nv}\times \left( D_{h} \times D_{d}\right) } \right. , \left. D_{nv}\right. \) is the number of points in the voxel branch in the nth FGA transformation block, and \(\left. \left( D_{h} \times D_{d} \right) \right. \) is the channel feature dimensions of the points obtained from the points that pass through the voxel to point in it. In particular, in order to facilitate the subsequent channel self-attention computation, \(\left. D_{nr} = D_{np} = D_{nv} \right. \) in the nth conversion block. The RPVLayer in the nth FGA conversion block can be computed in accordance with the following (2)-(9):

$$\begin{aligned} R_{f},P_{f},V_{f}= & {} LN(R_{f}),LN(P_{f}),LN(V_{f}) \end{aligned}$$
(2)
$$\begin{aligned} Q_{R},K_{R},V_{R}= & {} \psi (R_{f} \mid r_{i}) W^{Q}_{1}, \psi (R_{f} \mid r_{i}) W^{K}_{1}, \nonumber \\{} & {} \psi (R_{f} \mid r_{i}) W^{V}_{1} \end{aligned}$$
(3)
$$\begin{aligned} Q_{V},K_{V},V_{V}= & {} \psi (V_{f} \mid r_{i}) W^{Q}_{2}, \psi (V_{f} \mid r_{i}) W^{K}_{2}, \nonumber \\{} & {} \psi (V_{f} \mid r_{i}) W^{V}_{2} \end{aligned}$$
(4)
$$\begin{aligned} K_{R} , V_{R}= & {} Q_{R} + K_{R},Q_{R} + V_{R} \nonumber \\ K_{V} , V_{V}= & {} Q_{V} + K_{V},Q_{V} + V_{V} \end{aligned}$$
(5)
$$\begin{aligned} Q_{p}^{1} ,Q_{p}^{2} , Q_{p}^{3}= & {} P_{f} W_{1}^{Q}, P_{f} W_{2}^{Q},P_{f} W_{3}^{Q} \end{aligned}$$
(6)
$$\begin{aligned} F_{coarse1}^{PR}, F_{coarse2}^{PV}= & {} MSA_\frac{H}{3}( Q_{p}^{1}, K_{R} , V_{R}) , \nonumber \\{} & {} MSA_\frac{H}{3}( Q_{p}^{2}, K_{V} , V_{V}) \end{aligned}$$
(7)

Where LN is layer normalization, W is a linear matrix, this MSA is channel-based multi-head self-attention, and \(\psi (V_{f} \mid r_{i}) \) is the downsampling rate of the feature in the ith head. We reshape the input features \(\left. \left( B, N,C \right) to \left( B,H,W,C\right) \right. \), and then use a convolution kernel with step size \( r_{i} \) to implement the downsampling layer. \(\left. F_{coarse1}^{PR} ,F_{coarse2}^{PV} \!\in \! R^{D_{nr}\times ( D_{\frac{H}{3}}\times D_{d})} \right. \), all linear matrices W are used to project input vectors at the channel level onto output vectors with only one-third of the input dimension. In the specific deployment, the MSA is performed in parallel. In addition, we further reduce the computational consumption by reducing the query, key, and value dimensions to one-third of the original ones through the linear layer. Our RPVLayer can directly compute the attention maps between range, point and voxel, which makes our model well modeled between different fine-grains of range-point-voxel by channel features.

3.2.4 Range-point-voxel triple decoupling strategy

Shown in Fig. 7, MFGFSAM contains point-attention branch, point-range attention branch and point-voxel attention branch. For point-range attention branch and point-voxel attention branch, they model coarse fine-grained features with larger sense fields, and for the point-attention branch it can better capture fine-grained features at a micro-fine scale. Inspired by Shunted Transformer [53], similarly we use a channel-based multi-head self-attention mechanism in RPVLayer to combine multi-scale rough features from point-range attention branch, point-voxel attention branch and fine-grained features from point attention branch. In particular, our RPVLayer can be computed as:

$$\begin{aligned} K_{p}, V_{p}= & {} P_{f} W_{3}^{K}, P_{f} W_{3}^{V} \end{aligned}$$
(8)
$$\begin{aligned} F_{fine3}^{P}= & {} MSA_\frac{H}{3}( Q_{p}^{3}, K_{p} , V_{p}) \end{aligned}$$
(9)
$$\begin{aligned} F^{'}= & {} Concat(F_{coarse1}^{PR} ,F_{coarse2}^{PV}, F_{fine3}^{P}) \end{aligned}$$
(10)

Where \(\left. F^{'}\in R^{D_{nr}\times ( D_{h}\times D_{d})}, D_{nr}\right. \) is the output of the final RPVLayer layer, our RPVLayer is able to learn multiscale features from among different objects in an unmerged manner on a single attention layer. With the help of PRVLayer, FGA can combine a range of sensory field sizes from different objects at different levels.

Table 4 The comparisons between efficiency (run-time) and accuracy (mloU)on the SemanticKITTI val set
Table 5 Effect of different view combinations on network performance
Table 6 Impact of different modules on network performance

3.3 GANMix

Although some data enhancement methods [54, 55] have achieved good results in point cloud segmentation indoors, relatively few studies [56, 57] have been conducted for outdoor scenes. To address the problem of category imbalance in LIDAR semantic segmentation, we propose a data enhancement method called GANMix.

Empirically, by increasing the number of samples in a few categories, the network can predict less frequently occurring objects more accurately. Inspired by this, we extract object instances from scarce categories, such as bicycles or vehicles, from each frame in the training set and put them into a small sample pool. Unlike simply copying and pasting the same object over and over again, we generate a certain amount of realistic samples with the help of an Generative Adversarial Network(GAN) to provide greater robustness and realism. During training, we select samples uniformly at random from a small sample pool to keep the categories balanced. Then, we perform random operations such as scaling and rotation on these samples. In order to better match with the real environment, we randomly place these objects on the points of the ground category. Finally, we obtain some new sparse objects from other scenes and “paste” them into the current training scene to simulate objects in various environments.

This approach aims to address the problem of category imbalance in point cloud semantic segmentation and to improve the performance of the model by generating realistic samples.

4 Experiments and results

In the experimental part, firstly, we introduce the two datasets and the evaluation metrics in Section 4.1. Next, Section 4.2 describes the network training strategy and the detailed deployment, after that, the experimental results are given in Section 4.3. Finally, the ablation experiments and the visualisation demonstrations are described in Sections 4.5 and 4.7 respectively.

4.1 Datasets and evaluation metrics

We evaluate our RPV-CASNet on the SemanticKITTI dataset [16] and the nuScenes dataset [17].

Datasets SemanticKITTI [16], derived from the KITTI Vision Benchmark [15], is a large-scale dataset for the task of point cloud segmentation in driving scenarios. It consists of 43551 LiDAR scans from 22 sequences collected from German cities. Equipped with a Velodyne-HDLE 64 LiDAR, each scan has about 120 k points. These 22 sequences were divided into 3 groups, i.e., the training set (00 to 10, except 08, 19,130 scans), the validation set (08, 4071 scans), and the test set (11 to 21, 20,351 scans.) SemanticKITTI provides up to 28 categories, but the official evaluation ignores categories with only a few points and merges categories with different movement states and categories, so a set of 19 valid categories was used.

nuScenes [17] is a newly released dataset for semantic segmentation in LiDAR containing 1000 scenes collected from different areas in Boston and Singapore. Each scene is 20 seconds long and is sampled at 20 Hz using a Velodyne HDL-32 E sensor, resulting in a total of 40,000 frames for nuScenes. It used 8130 samples for training, 6019 for validation and 6008 for testing. After merging similar classes and removing rare classes, a total of 16 classes were retained for LiDAR semantic segmentation.

Evaluation metric Following the official recommendations of [16, 17], we use the mean intersection of all classes (mIoU) as the evaluation metric. mIoU can be formulated as:

$$\begin{aligned} mIoU = \frac{1}{C} \sum _{i=1}^{C}\frac{TP_{C}}{TP_{c}+FP_{C}+FN_{C}} \end{aligned}$$
(11)

Where \(\left. TP_{c}, FP_{C}, FN_{C} \right. \) denote the true positive, false positive and false negative predictions for category c. C is the number of categories.

4.2 Training strategy and network deployment

In the training phase, we deploy a cross-entropy loss function. For both the SemanticKITTI dataset and the nuScenes dataset, we use the ADAM optimizer, and our learning rate has an initial trial value of 0.006 and is gradually reduced through multiple steps of decay. The number of epochs of training is 60 and 80 respectively. The training time of our network on Nvidia GeForce RTX 3090 take about a week. During training, we utilize widely used data enhancement strategies, including global scaling with a random scaling factor sampled from [0.95, 1.05], and global rotation around the Z-axis at a random angle. We also incorporate our proposed GANMix method in the last ten epochs of the training phase to further fine-tune the network. The detailed network deployment is shown in Table 1.

4.3 Results on SemanticKITT and nuScenes

In this experiment, we compare the results of our method with most existing state-of-the-art LiDAR segmentation methods on the SemanticKITTI test set in Table 2 and on the nuScenes validation set in Table 3. Our method outperforms most models, in terms of mIoU metrics.

Table 7 The impact of different fusion methods on networks
Table 8 Threshold setting for hyperparameter \(\delta \)

Our method, RPV-CASNet, has an overall segmentation performance almost equal to that of RPVNet [10] on both data, and the segmentation accuracy of some classes even exceeds that of the former. Due to the strong robustness of our multi-view interaction learning, RPV-CASNet is more accurate than most state-of-the-art networks in terms of small target segmentation accuracy. On the SemanticKITTI dataset, RPV-CASNet improves segmentation accuracy by 6.0, 6.7, 6.9, 4.3, 5.5 and 0.3 percentage points compared to CPGNet [58] in the small target categories of bicycles, motorcycle, other vehicles, person, motorcyclist, and traffic signs, respectively. Compared to SPVNAS [34] in bicycle, motorcycle, other vehicles, person, bicyclist, pole, traffic signs also improved by 18.3, 17.4, 4.2 , 9, 1.3, 0.1 and 4.8 percentage points respectively. The segmentation accuracy of the four small target classes such as bicycle, other vehicles, pedestrians, and traffic signs even exceeds that of RPVNet [10] by 0.5, 0.6, 0.5 and 0.7 percentage points, respectively. On the nuScenes dataset, RPV-CASNet improves the segmentation accuracy over AMVNet [13] by 3.9, 11.8 percentage points, over Cylinder3D [37] by 1.8, 5.4 percentage points, and over RPVNet [10] by 0.8 and 0.2 percentage points for the two small-target categories of bicycles and pedestrians. These results show that our model achieves significant improvements on the small target segmentation task.

4.4 Model appraisal

We provide a comparison of efficiency and accuracy as shown in Table 4. When parameters and latency are compared with other methods, although we reduce the complexity of the model when self-attention is computed, we still have a higher number of parameters, but our model is lower than SPVCNN , RandLANet and RPVNet in latency.

4.5 Ablation studies

Performance of different view combinations We explore the effect of different view combinations on the network, including RP, RV, PV and RPV. We use a quarter of the training set and conduct validation on a validation set of two datasets. The results in Table 5 show that: 1. P-branch contributes the least to the performance of the whole network while V-branch contributes the most to the network. 2. P-branch and R-branch contribute almost equally to the network. 3. The best performance is achieved based on the three views of RPV.

Effects of different modules As shown in Table 6, we provide an exhaustive analysis of the overall performance of the network by each module of the network, where our FRM module has 3.5% and 4.8% improvement in the mIoU values on the two datasets respectively. While MFGFSAM possesses 3.7% and 3.9% enhancement on both the datasets respectively.

Variants of fusion style In the paper, we explore the impact of different fusion methods on the network performance in Table 7. We can see that our multi-fine-grained feature fusion module, on the SemanticKITTI dataset, gains 3.7 points more than using the addition and on the nuScenes dataset gains 3.9 points more than the Concatnation’s method. We also compare our proposed channel-based Multi-Fine-Grained Self-Attention Mechanism with the normal multi-head self-attention mechanism, also with 1.8 and 2.2 points of improvement, respectively.

Threshold setting for hyperparameter \( \delta \) . We conduct a lot of experiments to determine the threshold of hyperparameter \(\delta \) on different datasets, the experiments reveal that about \( 10\% - 25\% \) of the features are classified as anomalies. \( PR_{\delta }\) denotes the hyperparameter setting situation between point view and range view, and \( PV_{\delta } \) denotes the hyperparameter setting situation between point view and voxel view. See Table 8 for details.

Table 9 Comparison of different methods of finding anomalies
Fig. 8
figure 8

Qualitative comparison with RPVNet [10], SPVCNN [34] , Cylinder3D [37] on SemanticKITTI dataset. Segmentation errors are highlighted in red. Best viewed in colors

Fig. 9
figure 9

Quantitative Comparison with RPVNet [10] in Selected Small Target Class Segmentation Accuracy on the SemanticKITTI dataset. Segmentation errors are highlighted in red. Best viewed in colors

Comparison of effects with different similarities We compare different methods for separating anomalies in FRM, where our method point multiplication and sigmoid on the SemanticKITTI dataset outperforms calculating the Euclidean distance between the two by 1.6 percentage points, and calculating the cosine similarity has the lowest performance of 64.3 percentage points, with a difference of up to 3.7 percentage points between the two. In the same case, on the nuScenes dataset, our proposed method outperforms calculating the cosine similarity and Euclidean distance between the two by 4.0 and 1.8 percentage points, respectively, as shown in Table 9. It can be insightful that it is very important to seek the same semantic features of the same object in different modalities, this is because the features of different modalities are difference in feature space, and the simple calculation of cosine similarity or Euclidean distance sometimes does not satisfy the needs of some complex tasks. This provides a brilliant prospect for our future work.

4.6 Scalability of GANMix

Due to restricted code availability, we use our GANMix on two algorithms, Cylinder3D and RandLANet, for the small target classes bicycle and motorcycle. As shown in Table 2, there is a dramatic improvement of 14.2, 18.6 and 12.2,16.0 points in the segmentation accuracy for the bicycle and motorcycle categories, as well as an improvement of 1.7 percentage points and 1.5 percentage points in the overall segmentation accuracy. This shows that our designed GANMix has strong horizontal scalability.

4.7 Visualization of results

We show the segmentation results of our proposed algorithm RPV-CASNet and the state-of-the-art algorithms on the SemanticKITTI dataset as shown in Fig. 8 in a visual way. We also provide the segmentation results of and RPVNet[10] on some of the small target classes as shown Fig. 9, where the network segmentation errors are highlighted in red. Our network performance is even better.

5 Conclusion

This paper presents a new algorithm called RPV-CASNet for the task of fine-grained segmentation in point cloud multi-view fusion. The algorithm utilises a channel self-attention mechanism to fuse three different views, range, point and voxel, and integrates them more finely through an interactive structure, the RPVLayer, to take full advantage of their differences. The RPVLayer consists of two key designs: the Feature Refinement Module (FRM) and the Multi-Fine-Grained Feature Self-Attention Module (MFGFSAM).The FRM allows for a more fine-grained segmentation of both range view and voxel view to re-correct anomalies introduced due to mapping and voxelisation operations, while MFGFSAM has the ability to efficiently integrate remote area tokens and maintain multi-scale features within a single attention layer. In addition, this paper also introduces a Dynamic Feature Pyramid Extractor (DFPE) for network deployment that extracts rich multi-scale features for range views. By conducting experiments on two publicly available large datasets, we demonstrate the efficiency and comparability of RPV-CASNet.

6 Discussion

From Table 2 to Table 3, we clearly demonstrate the advantages of our algorithm over most fusion-based state-of-the-art methods and achieve satisfactory performance. It is particularly noteworthy that the mIoU performance of our proposed RPV-CASNet on the SemanticKITTI dataset and the nuScenes dataset is very close to that of RPVNet [10], and it even exceeds the accuracy of RPVNet on five small target categories. This indicates that our network performs well in the task of fine-grained segmentation of radar point cloud data. However, we also realise that there are still some challenges that need to be further explored in our next work. Through in-depth reflection, we confirm that the self-attention mechanism has unrivalled advantages in capturing complex dependencies between elements, and is particularly suited to point cloud data. Yet, when performing self-attention computations in MFGFSAM, we found that about one-third of the features in the range branch and voxel branch were under-utilised, which may have limited the performance of our network. Therefore, we will continue to explore new efficient fusion mechanisms to address this issue. It is coming soon.