1 Introduction

Three-dimensional detection is one of the most important issues in the environment perception of autonomous driving, and many outstanding researches have emerged. These tasks are usually divided into the types of sensors used: Lidar-based and multi-sensor fusion (usually RGB + Lidar).

Some representative works based on multi-sensor fusion such as: Chen et al. [1] use a compact multi-view representation to encode sparse 3D point clouds, and propose a multi-layer fusion strategy. Ku et al. [2] further extend [1], by inputting RGB image and BEV (Bird's Eye View) Map, using FPN [3] network to obtain the full resolution feature map of the two, and then extracting two feature map correspondences through crop&resize Feature crops are integrated to achieve 3D inspection. Liang et al. [4] Proposed Continuous Fusion Layer to fuse multi-scale image features into radar features. Qi et al. [5] use the combination of dimensionality reduction technology and mature 2D object detectors to complete the detection process. Liang et al. [6] can achieve a fully integrated feature representation by using the fusion between point-wise and ROI-wise features. Xu et al. [7] process the image data and original point cloud data separately with CNN and PointNet architectures, and then combine the output obtained by the new fusion network. Wang et al. [8] designed a dense pixel-level fusion method to integrate the features of RGB data and point cloud in a more appropriate way. Vora et al. [9]project each lidar point into the output of the image semantic segmentation network, and connects the channel direction activation with the intensity measurement of each lidar point to strengthen the point cloud through the image semantics.

For detection methods based on multi-sensor fusion, most of the work is to project point clouds onto RGB images to obtain dense expression, although in principle they can combine the advantages of multiple sensors to further improve the detection result, but the specific fusion method still needs further research and development, including how to improve the fusion effect and reduce the processing time.

Some representative works of Lidar-based detection methods such as: Zhou et al. [10] cooperate with BEV and perspective drawing, effectively using the information of the two to complement each other, and proposes the concept of dynamic voxelization, so that each point can be different Perspective learning integrates contextual information. Yang et al. [11] proposed a point-based spherical anchor point cloud target detection scheme generation model, which is universal to achieve high recall. And put forward the PointsPool layer, which integrates the advantages of point-based and voxel-based methods to achieve efficient and effective prediction. Lang et al. [12] use [13] to learn the characteristics of the point cloud in the vertical columnar organization, and convert the complex three-dimensional point cloud space into a two-dimensional plane space. Shi et al. [14] detect 3D objects from the original point cloud, and directly generates a 3D scheme from the point cloud, which has a higher recall rate than the previous scheme generation method. Yang et al. [15] use a BEV to represent the scene to estimate three-dimensional objects based on pixel-level neural networks. Ali et al. [16]use the method of deconvolution and connect the features layer by layer, so that the network can retain location information when acquiring deeper feature information. Shi et al. [17] further expand [14] and makes better use of box label information.Kuang et al. [18] improved [19], processed voxel of multiple scales, and improved detection accuracy by fusing with feature maps of different layers. Chen et al. [20] effectively integrate the coordinate and index convolution features of each point with the attention mechanism, which not only retains accurate positioning information, but also retains context information. Yang et al. [21]designed a new SA module and discarded the FP module [22], making the detection time up to 25FPS.

For Lidar-based detection, part of the work uses the Point-based method to directly process each point. The advantage of this method is that the information obtained is more sufficient, and the experience field is flexible and variable, but the disadvantage is that for the large scene of unmanned driving, the number of points is often very large, and the point-based method will encounter the challenge of huge amount of calculation. Another part of the work uses the voxel-based method, which uses 3D CNN to reduce computing time by converting the original data space into a series of voxels, but the disadvantage is that the quality of the voxel feature directly affects the detection results, and the receptive field cannot be flexibly changed.

In response to the above problems, this paper proposes a Voxel-based method that only uses Lidar information as input. Grouping raw points by establishing a Gaussian model, by sampling and grouping to generate high-quality feature map and further improve the detection accuracy.

2 Our method

This paper proposes a two-stage detector Gaussian-Voxel Detection Network (GVnet). In stage one, Gaussian-Voxel Feature Encoding is used for the raw point cloud, then voxelization is carried out, next, 3D CNN is used to generate high-quality feature maps, the feature map is passed as input to the RPN network [23] to generate a series of 3D proposals.

In stage two, voxel-ROI pooling was used to improve the RPN performance. The corresponding receptive field is obtained through the mapping relationship between raw point and feature map in voxel. Then, the receptive field is regulated by sampling any point of the Gaussian model corresponding to the raw point. This makes the features corresponding to the proposal stronger, and improves the effect of classification and regression tasks.

The overall framework of GVnet is shown in Fig. 1.

Fig. 1
figure 1

GVnet framework

2.1 Gaussian-voxel feature encoding

Traditional VFE [17] first needs grouping and sampling raw point, and these two steps directly affect the quality of subsequent refinement. The usual operation in Sampling processing is to unify the number of points contained in each voxel. Suppose we set the number of points contained in each voxel to k. If the number of points actually contained in the Voxel is j, when j > k, the points in voxel are randomly dropout. If j < k, add a certain amount of points to this voxel. The supplement methods include: (1) Add a series of points with a value of 0. (2) Copy some points in voxel. Obviously, there is a gap between the characteristics of the point and the raw point obtained after sampling using the above method, which is not conducive to the subsequent 3D CNN for feature extraction.

In response to the above problems, this article proposes a new VFE method. First, perform Gaussian clustering on the raw data to obtain a series of Gaussian models that the class obeys, and then voxelize the overall data. When sampling each non-empty voxel, use the Gaussian model corresponding to the voxel to perform this operation. This can keep the generated point and raw point features similar to the greatest extent, and improve the effect of the encoder. For the disorder of point cloud data, the Gaussian mixture model(GMM) [24] can be used for a good description.

First define \(x_{j}\) to represent the No.\(j\) data, where \(j = 1,2,3,...,N\). \(K\) is the number of Gaussian models included in the GMM. \(\alpha_{k}\) is the probability that a certain data belongs to the No.\(k\) Gaussian model, where \(\alpha_{k} \ge 0,\sum\nolimits_{k = 1}^{K} {\alpha_{k} = 1}\). \(\varphi (x\left| {\theta_{k} } \right.)\) is the distribution function of the No.\(k\) Gaussian model. \(\gamma_{jk}\) is the probability that the No.\(j\) data belongs to the No.\(k\) Gaussian model. Hence, the distribution of the Gaussian mixture model is:

$$ P(x\left| \theta \right.) = \sum\limits_{k = 1}^{K} {\alpha_{k} \varphi (x\left| {\theta_{k} } \right.)} $$
(1)

When calculating \(\theta\). Use Maximum Log-Likelihood to calculate. Specifically:

$$ \log L(\theta ) = \sum\limits_{j = 1}^{N} {\log P(x_{j} \left| \theta \right.) = \sum\limits_{j = 1}^{N} {\log \left( {\sum\limits_{k = 1}^{K} {\alpha_{k} \varphi (x\left| {\theta_{k} } \right.)} } \right)} } $$
(2)

By initializing the model parameters, iteratively calculates all the parameters in the model. Each iteration includes two steps. The first step is to calculate the probability that each data j comes from the model k based on the current parameters:

$$ \gamma_{jk} = \frac{{\alpha_{k} \varphi (x_{j} \left| {\theta_{k} } \right.)}}{{\sum\nolimits_{k = 1}^{K} {\alpha_{k} \varphi (x_{j} \left| {\theta_{k} } \right.)} }} $$
(3)

Then calculate the model parameters for the new iteration:

$$ \begin{aligned} \mu_{k} & = \frac{{\sum\nolimits_{j}^{N} {(\gamma_{jk} x_{j} )} }}{{\sum\nolimits_{j}^{N} {\gamma_{jk} } }} \\ \Sigma_{k} & = \frac{{\sum {\gamma_{jk} (x_{j} - \mu_{k} )(x_{j} - \mu_{k} )^{T} } }}{{\sum\nolimits_{j}^{N} {\gamma_{jk} } }} \\ \alpha_{k} & = \frac{{\sum\nolimits_{j = 1}^{N} {\gamma_{jk} } }}{N} \\ \end{aligned} $$
(4)

Repeat the above steps until the parameters converge to get the Gaussian mixture model corresponding to the original point cloud, as shown in Fig. 2.

Fig. 2
figure 2

GMM result comparison

It can be seen from Fig. 2 that the data processed by GMM obey their respective distributions, and the blue point is the center of the class. By sampling within the corresponding Gaussian distribution, the quality of the points that are subsequently sent to the MLP network to extract features is improved, and a high-quality feature map is further obtained for use in stage two.

2.2 Voxel-ROI pooling

In order to obtain higher accuracy, the feature map generated by 3D CNN is first sent to the RPN network to generate a series of 3D proposals, and then the space size is the same through ROI Pooling [23], and the corresponding ROI vector is output to complete the classification and return. The overall process is shown in Fig. 3:

Fig. 3
figure 3

Stage two work process

G Point refers to a point with high confidence that obeys the corresponding Gaussian distribution.3D sparse convolution is chosen as the backbone of 3D CNN model. The voxelization of point cloud will generate about 5 K ~ 8 K voxels and about 0.005 sparse degree. Direct use of 3D convolution will consume huge computing time and memory, which can be avoided by sparse convolution. [25] limits the sparsity of output through the sparsity of input data, thus greatly reducing the computational amount of subsequent convolution operations.

A major disadvantage of using voxel-based detection is that it cannot flexibly adjust the receptive field. In response to this problem, this paper designs a pooling method with better feature acquisition performance, named voxel-ROI Pooling. The specific method is to map the feature corresponding to the 3D proposal back to Voxel before pooling processing, and obtain the Gaussian distribution information of the corresponding point. The mapping relationship is similar to the mapping relationship between feature and ROI in 2D CNN [26], as shown in Fig. 4.

Fig. 4
figure 4

ROI to raw data mapping relationship

As shown in Fig. 4, \((x,y)\) represents the coordinate point on the feature map, and \((x^{^{\prime}} ,y^{^{\prime}} )\) represents the coordinate point in the original space corresponding to the point. The conversion relationship between the two is:

$$ (x^{^{\prime}} ,y^{^{\prime}} ) = (Sx,Sy) $$
(5)
$$ S = \prod\limits_{0}^{i} {s_{i} } $$
(6)

s represents the stride size of the middle layer. After obtaining the feature representation in raw data, this article further uses the sampling m points with the highest confidence in the raw point corresponding to each proposal as feature points to control the size and characteristics of the receptive field. The specific principle is shown in Fig. 5:

Fig. 5
figure 5

Correspondence between receptive field and raw point

When controlling the receptive field, according to the formula:

$$ Q = (W - K + 2P)/S + 1 $$
(7)

whereas Q is the feature size, W is the input size, K is the convolution kernel size, P is padding, and S is stride. The calculation formula of the input size can be further obtained:

$$ W = (Q - 1) \cdot S - 2P + K $$
(8)

W here is the receptive field corresponding to the feature, and the point in the receptive field directly affects the output feature. Therefore, when doing pooling, the Voxel-ROI Pooling method can effectively adjust the receptive field and further enhance the characteristics of Voxel.

Finally, these features are sent to a layer of MLP network to output a fixed-size one-dimensional vector to further obtain the feature expression for detection.

2.3 Loss function

The loss function of the network is divided into two parts. The first part is used for classification. Because the classification of positive and negative samples of point cloud data is more uneven, so Focal Loss [27] is selected as the solution, as shown in the formula:

$$ L_{cls} = - \alpha_{t} (1 - p_{t} )^{\gamma } \log (p_{t} ) $$
(9)

For the negative sample \((1 - p_{t} )^{\gamma }\) with a larger probability to approach 0, its loss value can be reduced, thereby effectively suppressing it. For positive samples with low probability, \((1 - p_{t} )^{\gamma }\) has little effect on its loss, so we can increase the contribution of difficult samples to the gradient by reducing the loss of simple samples. Through the experiment of the paper, it was found that when \(\gamma = 2\), \(\alpha = 0.25\), GVnet has the best performance.

For the second part, \({\text{smooth}}_{L1}\) [28]is used for bounding box regression:

$$ L_{{{\text{reg}}}} (t^{\mu } ,v) = \sum\limits_{{i \in \left\{ {x,y,z,l,w,h} \right\}}} {{\text{smooth}}_{L1} (t_{i}^{\mu } - v_{i} )} $$
(10)

In which:

$$ {\text{Smooth}}_{L1} (x) = \left\{ {\begin{array}{*{20}c} {0.5x^{2} ,} & {{\text{if}}\,\,\left| x \right| < 1} \\ {\left| x \right| - 0.5,} & {{\text{otherwise}}} \\ \end{array} } \right. $$
(11)

Further, the total loss function can be obtained:

$$ L_{{{\text{total}}}} = L_{{{\text{cls}}}} + L_{{{\text{reg}}}} $$
(12)

The total loss function is the sum of the above two parts without any weighting. The training details of the loss function will be explained in the experiment part.

3 Experiment

This part introduces some details of GVnet training and the specific experimental performance on KITTI [29], nuScenes [30], Waymo [31] dataset, and compares the effects with some other commonly used 3D detection models. At the end, an ablation experiment was carried out, which showed that the G-VFE and Voxel-ROI pooling proposed in this paper can effectively improve the accuracy of the model.

3.1 Training step

The KITTI dataset contains real image data collected from scenes such as urban areas, rural areas, and highways. Each image can contain up to 15 cars and 30 pedestrians, with various degrees of occlusion and truncation. In addition, KITTI dataset is also one of the important data sets in the field of autonomous driving at present, so this experiment chooses to use KITTI as the data set for network performance evaluation and Average Precision (AP) as the evaluation index [32]. And divide 40% of the training set into a validation set to monitor the performance of the model in real time to prevent over-fitting.

The nuScenes dataset consists of 1000 scenes, each of which is 20 s long and contains a variety of scenarios. In each Scenes, there are 40 key frames, that is, 2 key frames per second, and the other frames are sweeps. Key frames are manually annotated, and there are several annotations in each frame in the form of bounding box. Not only size, range, but also category, visibility, and so on.

The Waymo dataset contains 1000 driving segments, each consisting of 20 s of continuous driving footage. Images of vehicles, pedestrians, bicycles, signage, etc., were carefully tagged, with a total of 1200 3D tags and 1.2 million 2D tags captured.

The network framework GVnet proposed in this paper is trained on a single NVIDIA RTX3090, and the training time is about 15 h for KITTI, 20 h for nuScenes, 30 h for Waymo.

In order to prevent over-fitting, a certain data augmentation method is used. First, the raw data are randomly flipped along the x-axis, and then rotated by \(\left[ { - \frac{\pi }{4},\frac{\pi }{4}} \right]\) along the z-axis, and finally the raw data are scaled by a certain proportion, and the zoom range is \(\left[ {0.95,1.05} \right]\).

For each class sample in voxel, 15 points are used as features, and the threshold of NMS is set to 0.8, voxel size is \(\left[ {\begin{array}{*{20}c} {0.1} & {0.1} & {0.2} \\ \end{array} } \right]\), for KITTI, the number of cluster sets is 10, 20 for nuScenes and Waymo. Adam is used as the optimizer, batch size is 4, total epoch is 80, and the initial learning rate is set to 0.01. In the 30th–50th epoch, the learning rate is gradually changed to 0.0001. The convergence of the loss function is shown in Fig. 6:

Fig. 6
figure 6

Comparison of Loss convergence with different models

As can be seen from Fig. 6, because GVnet uses G-VFE and uses a more reasonable sampling operation, the initial Loss value is lower than other methods, which helps the model to converge quickly and can further improve performance.

Performance on dataset.

Table 1 shows the performance of GVnet and other network models under the KITTI test set. Through comparison, it can be seen that in terms of 3D detection, the performance of GVnet has a certain accuracy improvement compared with other methods. In the case of AP70, the three modes of easy, moderate and hard have increased by 0.3, 0.46, and 0.82, respectively. In terms of BEV detection, GVnet has also improved to varying degrees.

Table 1 Comparison of different network models on KITTI test set

Table 2 shows the performance of GVnet on the val set. It can be seen that GVnet is a two-stage network framework. Although a part of the inference time is added, the gain is an increase in accuracy, which is consistent with the design concept of the two-stage detector.

Table 2 Comparison on KITTI val

Table 3 shows the performance of GVnet and other network models under the nuScenes dataset. It can be seen that our method has a good performance in most class, and mAP improved by 2.07 compared to Point Pillars.For datasets of such large scenarios, GVnet can still achieve good performance by increasing the number of clusters sets.

Table 3 Test on nuScenes dataset

Table 4 shows the performance of GVnet and other detection methods under the Waymo dataset. For 3D detection task, GVnet improved mAP by 3.78 compared to other methods, for BEV detection, GVnet improved mAP by 0.45.

Table 4 Test on Waymo dataset

In terms of point cloud visualization, each frame of KITTI dataset contains a large number of points, so the use of Meshlab and other software can not display the scene details well. Therefore, in order to better demonstrate the detection effect of GVnet, this experiment visualized the detection results on KITTI Viewer Web [23], and some of the results are shown in Fig. 7.

Fig. 7
figure 7

Visual performance of GVnet on KITTI dataset

It can be seen from Fig. 7 that GVnet has a good performance in terms of recall rate and accuracy, but the price is that more complicated sampling operations lead to false detection of some plausible targets.

3.2 Ablation experiment

In order to better explain the rationality of the G-VFE and V-ROI Pooling operations, this paper carried out an ablation experiment using KITTI dataset. G-VFE and original VFE are used as the encoder method of the network backbone, and then the original ROI Pooling and V-ROI Pooling are used for feature extraction, respectively. The combined experimental results are shown in Table 5.

Table 5 Comparison of experimental effects of different components of the network model

It can be seen from Table 5 that the network using G-VFE + V-ROI Pooling performs best under the Mod.mAP indicator. The network that only uses G-VFE + ROI Pooling generates a higher-quality feature map, so its performance is 1.47 higher than the original VFE + ROI Pooling mAP. In summary, it can be seen that the G-VFE and V-ROI Pooling modules proposed in this article can effectively improve the overall accuracy of the network model by adjusting the performance of the corresponding parts, and fully demonstrate its rationality. However, performing GMM clustering before feature extraction will cause additional computational burden. This shortcoming has been reflected in the inference time, and a high-performance GPU is required to run this method.

4 Conclusion

This paper proposes a two-stage 3d detection network. The original Voxel Feature Encoder is optimized in stage one. The specific method is to solve the Gaussian distribution that the raw point conforms to, and then use the Gaussian Mixture Model for voxel feature encoding, and perform 3D CNN to obtain high-quality feature maps. In the ROI Pooling of stage two, the feature is sampled based on the raw point in voxel to control the change of the receptive field and enhance the feature. This approach is designed to focus on key points and ignore unimportant point clouds. And then send the output feature to the fully connection layer to complete the classification and regression of the object. Finally, the proposed method was compared with other methods on the dataset, and the ablation experiment was carried out to demonstrate the rationality of GVnet proposed in this article. However, the number of categories of GMM clustering needs to be manually set in advance, which leads to the failure of end-to-end training of the entire network model, which reduces the generalization ability of the model. For different data sets, it is necessary to set the number of different categories in advance. Therefore, using GMM as a module of the network and adding it to the training process requires further research and development.