1 Introduction

The realization of autonomous driving technology depends on an accurate environment perception system. With continuous improvement in autonomous driving technology, the 3D object detection method has been greatly developed [1,2,3,4,5,6,7,8]. Since 3D objects are randomly distributed in the perception space of lidar, most of the 3D object information contained in the point cloud is derived from the spatial position relationship between points rather than the three-dimensional coordinates of a single point [8]. Certain works [8,9,10,11,12] directly process the original point cloud data. Due to the disorder and rotation of point cloud data [8], it is difficult to obtain detailed information about an object. Some works convert point cloud data into 2D, bird-view or forward-view images and apply a 2D, convolutional neural network (CNN), which is presently relatively mature for predicting 3D objects. However, 3D-2D conversion will cause substantial loss of original data and low accuracy of a model. Methods [4,5,6, 13,14,15,16,17] transform point cloud data into 3D voxels and retain the spatial position relationship between points, but this approach will increase the calculation amount and loss of original data.

In addition, certain networks [6, 15, 17] adopt the two-stage method to further optimize the regions of interest through flexible, perceptual domain changes and more detailed feature extraction of point clouds. The accuracy of the two-stage object detection model is obviously higher than that of the one-stage model.

By comparing existing 3D object detection models, we discover that the commonly employed point cloud voxelization methods will cause a loss of original information, which is essential for the prediction of the 3D bounding box. In addition, most existing models focus on improving the detection accuracy of a single type of object, such as a car. However, in a lidar sensing system, the types and sizes of targets are diverse, and the characteristics and detection methods of different objects differ too. Therefore, it makes sense to improve the model's detection robustness to different types of objects.

In contrast, we propose a 3D, object detection neural network that effectively combines point-based and voxel-based methods with fused voxel information. The network structure is shown in Fig. 1. In Stage-I, in addition to the point coordinates, we obtain the point cloud density of voxels as supplementary information during the voxelization. Simultaneously, through the voxel preprocess module (VPM), the voxel density and coordinates are fully combined at multiple levels to obtain more elaborate initial features of the voxels. We apply the sparse convolution backbone to extract the voxel features, which generate the multiscale voxel features. We then design a new regional proposal network (Cross-RPN) that integrates multiscale and multidepth context features with a multilayer intersecting structure. In addition, we improve the target generation strategy, which is commonly employed in the field of 3D object detection. Futhermore, we attempted to merge our network with some advanced two-stage methods, such as RoI-Aware Pooling (PartA2), RoI-grid-Pooling (PV-RCNN) and Voxel-RoI-Pooling (VoxelRCNN).

Fig. 1
figure 1

Overall architecture of our network consists of two stages: FuseNet: our network to extract initial voxel features via the voxel preprocess module and to generate 3D proposal bounding boxes via Cross-RPN; Box Refine Stage: some advanced refinement methods to optimize and improve the 3D proposal boxes

2 Related Works

2.1 One-Stage 3D Object Detection

2.1.1 Point-Based Method

PointNet [8] proposed by researchers at Stanford University directly processes the point cloud and adopts max pooling as a symmetric function to solve the arrangement invariance of the origin point cloud. Considering the lack of local information in PointNet, researchers from Stanford University further proposed PointNet +  + [9], which applies the Farthest Point Sampling (FPS) module to obtain the relative relationship between points, and the receptive field is more flexible.

2.1.2 Voxel-Based Method

Zhou et al. [4] proposed the VoxelNet network, which divides the point cloud space according to the three-dimensional regular grid, uses the PointNet network to extract the features of the point cloud within each nonempty volume element, and effectively utilizes the 3D information in the point cloud. Yan et al. [5] employed a sparse convolutional neural network to replace the voxel convolution neural network in the VoxelNet network, which improved the operating efficiency of the neural network and reduced memory consumption. SA-SSD [14] proposes an auxiliary training model that uses semantic segmentation and center point prediction tasks to assist the target detection task, and in the test stage, the auxiliary network is separated from the network. SE-SSD [19] uses the teacher student structure and IoU-based matching strategy to achieve the best effect.

2.1.3 Image-Based Method

Complex-YOLO [20] transformed the point cloud to a bird-view image and applied the YOLO network for 3D object detection, which is employed in 2D object detection work. Yang et al. proposed PIXOR [21], which applied Resnet as the backbone network and encoded the bird-view image as occupancy. HDNET [22] added high-precision map information to a network to reduce the impact of ground slopes for object detection.

2.2 Two-Stage 3D Object Detection

Based on the one-stage object detection network, RefineDet [23] proposed a two-stage RPN network to perform secondary optimization on the predicted 3D bounding boxes. The PartA2 network proposed by Shi [6] et al. adopts voxel semantic segmentation results to optimize the identification results through the RoI-Aware Pooling module, thus remarkably improving the network accuracy. PV-RCNN [15] combined voxel-based and point-based methods with the VSA module, which effectively improved the object detection performance. Voxel RCNN [4] uses voxel-RoI pooling to extract regional features from 3D voxel features for refinement and achieves a balance between accuracy and efficiency. However, the two-stage methods increase the number of model parameters, resulting in the degradation of real-time performance.

3 FuseNet: Detection with Fused Information

In this section, we propose a one-stage 3D object detection network for lidar point clouds (FuseNet) and combine FuseNet with certain advanced refinement methods.

3.1 FuseNet

FuseNet consists of a voxel preprocessing module, backbone network and cross-RPN module, which aim to generate the bounding box and classification of RoIs from point clouds.

3.1.1 Voxel Preprocessing Module (VPM)

The disadvantage of a voxel-based, 3D, object detection network is that voxelization of the point cloud will lead to a loss of original point cloud information, which makes it difficult for the network to extract object details. The existing object recognition methods [4,5,6, 15, 24] are committed to obtaining more object details from the voxel features in the neural network. For example, Parta2 [6] trained the internal position of voxels in the object, and PV-RCNN [15] and Voxel RCNN [4] mapped the voxels to key points to maintain the 3D position relationship of voxels. However, these methods only extracted the average coordinates of points as the initial information during voxelization, so the problem of information loss caused by voxelization was not improved.

For short-range object detection, the point cloud mapping of the object is clear and dense, and the change in point cloud density is obvious. As shown in Fig. 2, when the car is perpendicular or parallel to the lidar field of view, the point cloud distribution is average, and there is a sudden change in density at the left and right boundaries. When there is an angle between the car and the lidar, the point cloud density on the angular surface of the car will decay due to the angle and distance. For long-distance object detection, the point cloud mapping of the object is sparse, which is usually composed of several to dozens of points, so that the number and density of points can reflect the size of the object and then determine the object classification.

Fig. 2
figure 2

Point cloud mapping for cars at different angles. a indicates that the car is perpendicular to the lidar field of view, and b indicates that it is not perpendicular

Considering the performance of the above density information in visual information discrimination, we attempted to fuse the voxel density information with the coordinate information. Due to the disorder of the original point cloud data, the extraction of local information such as density is complex. By studying the process of sparse voxel data generation, we discovered that the point-based method can be utilized to expand the initial voxel feature after the points are mapped to the voxels at the corresponding location and before the sparse voxel data are generated, which means that each voxel is regarded as a point, and the data of all points within the voxel serve as the source of the voxel internal feature. An efficient fusion method is also critical for networks. After many comparative experiments for the structure of the network, we fused features at multiple levels, including elementwise concatenation and elementwise addition. We also introduced some attention mechanisms to improve the efficiency of the module. The network structure is shown in Fig. 3.

Fig. 3
figure 3

Structure of the voxel preprocessing module. In the picture, \(N\) represents the number of nonempty voxels, \(M\) represents the max point number of a single voxel, and \(\left( {W, H, L} \right)\) represent the voxel dimensions

Specifically, we voxelize the point cloud within the specified valid space \(\left( {x:\left[ {0,70.4m} \right],y:\left[ { - 40m,40m} \right],z:\left[ { - 3m,1m} \right]} \right)\), which means that the space is divided into 3D voxel blocks of equal size (\(\left( {0.05m,0.05m,0.1m} \right)\) is selected), and the points are allocated to the corresponding voxels according to their coordinates. We then design a combined extraction module: the voxel density and coordinate information are fused with voxelwise concatenation, where the density information is obtained from the maximum normalized point number of voxels, and the coordinate information is obtained from the point coordinates in a single voxel. We then adopt miniPointnet, which is a mini version of Pointnet [8], to extract detailed voxel information from single point information and the spatial relationship of points of a single voxel. Next, two 3D sparse convolution branches with \({\text{kernel size}} = \left( {3,3} \right)\),\({\text{stride}} = \left( {1,1} \right)\) and \({\text{depth}} = 16\) are used to aggregate the density information and features from miniPointnet of nonempty voxels around a single nonempty voxel. The partial density information and partial coordinate information are fused with voxelwise addition to obtain the final initial features of the voxel.

MiniPointnet consists of two MLP layers and one pooling layer. The input data dimension of MiniPointnet is \(\left( {N, M, 5} \right)\), in which \(N\) represents the number of nonzero voxels, \(M\) represents the max point number of a single voxel, and 5 represents the feature depth (x, y, z, intensity, density). In the MLP layer, we apply convolutions with \({\text{kernel size}} = 1\), \({\text{stride }} = 1\), and \({\text{depth}} = 8\) for all convolutions. In the pooling layer, we set the max pooling size to M. The dimension of the output data is \( \left( {N, 8} \right)\).

3.1.2 Backbone Network

As shown in Fig. 1, we apply the commonly employed backbone network to generate multiscale voxel features composed of four sparse convolution blocks with filter numbers of 16, 32, 32, 64, and 64. Each sparse convolution block consists of two submanifold convolutions with \({\text{kernel size}} = 3\) and one sparse convolution with \({\text{stride }} = 2\). Each convolution is followed by batch normalization and ReLU nonlinearity. As a result, the output tensors are stacked along the Z axis to generate bird’s eye view (BEV) feature maps. The parameters of the backbone network are shown in Table 1.

Table 1 Parameters of Convolutions in the Backbone Network

3.1.3 Target Generation Strategy

In the experiment, we discovered that when the depth of the network increases, the model accuracy will not increase or even decrease. Through research, we determined that in bounding box regression networks [4,5,6, 15, 19], the direct output result of the network is the scaling coefficient of the central position and size of the preset anchors. However, for the central position of the bounding box, the scaling coefficient of the coordinate position lacks practical theoretical significance, which reduces the correlation between the input data and the predicted results.

To solve this problem, an improved anchor generation strategy is proposed in this paper: the deviation between the anchor \(\left( {x_{a} ,y_{a} ,z_{a} ,l_{a} ,w_{a} ,h_{a} ,\theta_{a} } \right)\) and the object bounding box label \(\left( {x_{{{\text{gt}}}} ,y_{{{\text{gt}}}} ,z_{{{\text{gt}}}} ,l_{{{\text{gt}}}} ,w_{{{\text{gt}}}} ,h_{{{\text{gt}}}} ,\theta_{{{\text{gt}}}} } \right)\) was calculated as the regression target of the network prediction results. Our deviation calculation formula is shown in Eq. 1.

where \(\left( {x_{res} ,y_{res} ,z_{res} ,l_{res} ,w_{res} ,h_{res} ,\theta_{res} } \right)\) represents the deviation between the anchor and the true value of the 3D bounding box, namely, the network regression target.\({ }\left( {dx,{\text{dy}},{\text{dz}}} \right)\) represents the voxel resolution when anchors are generated.

$$ \begin{array}{*{20}c} {\left\{ {\begin{array}{*{20}c} {x_{res} = \frac{{x_{{{\text{gt}}}} - x_{a} }}{dx}} \\ {y_{res} = \frac{{y_{{{\text{gt}}}} - y_{a} }}{dy}} \\ {z_{res} = \frac{{z_{{{\text{gt}}}} - z_{a} }}{dz}} \\ {l_{res} = log\left( {\frac{{l_{{{\text{gt}}}} }}{{l_{a} }}} \right)/dx} \\ {w_{res} = log\left( {\frac{{w_{{{\text{gt}}}} }}{{w_{a} }}} \right)/dy} \\ {h_{res} = log\left( {\frac{{h_{{{\text{gt}}}} }}{{h_{a} }}} \right)/dz} \\ {\theta_{res} = sin\left( {\theta_{{{\text{gt}}}} - \theta_{a} } \right)} \\ \end{array} } \right.} \\ \end{array} $$
(1)

For the object center coordinate deviation \(\left( {x_{res} ,y_{res} ,z_{res} } \right)\), since the deviation between the center position label of the bounding box and the corresponding anchor center position is usually within a single voxel, which is diminutive, we choose to directly calculate the position deviation as the regression target.; We remove the voxel resolution information in the deviation. The actual distance deviation of the center position of the 3D bounding box is converted to the voxel number deviation so that the actual size information does not need to be retained in the process of network prediction, which reduces the network task.

For bounding box size deviations \(\left( {l_{res} ,w_{res} ,h_{res} } \right)\), due to different resolutions of voxel features on the \(\left( {x,{\text{y}},{\text{z}}} \right)\) axes, the sensitivity of the network prediction results for different directions of the voxel space is different. Therefore, we also remove the voxel resolution information in the deviations, which normalized the sensitivity of the network prediction results and optimized the network performance.

3.1.4 Cross-RPN

In 2D and 3D object detection networks, anchor-based region proposal network (RPN) is widely employed to determine the areas where the objects may exist; for example, FAST R-CNN [3], STD [12], Faster R-CNN [18], SECOND [5] PartA2 [6], PV-RCNN [15] and VoxelRCNN [4]. For objects with different sizes, the importance of feature maps with different feature resolutions varies in object boundary prediction. The full fusion of features with different resolutions is the key to multiple object detection. EfficientNet [25] proposed a multidimensional compound scaling method for object detection tasks, which uses the multiscale, feature fusion structure, BIFPN to perform multilevel fusion of features with different depths, widths and feature resolutions. It is shown that the full fusion of context information is effective for the multitype object detection task.

Inspired by EfficientNet, we propose an anchor-based, cross-structure region proposal network (Cross-RPN): With multiscale, BEV features from the backbone network as input (we choose \( D/4\,D/8\,D/16,\) in which \(D\) is the initial voxel dimension), 2D convolution, maxpooling, and upsampling interpolation methods are utilized to carry out multilevel fusion of bird-view features with different depths and dimensions.

For features with the same dimension and different depths, the channel reduction module [6] is used to unify feature depths. For features with the same depth and different dimensions, the dimension-reduction module, which consists of maxpooling with \({\text{kernel size}} = 3\) and the upsampling interpolation function with \({\text{scale factor}} = 2\), was employed to unify the dimensions. Next, 3D bounding boxes, which consist of the center position, box size and angle, are generated with two convolution branch. The network structure is shown in Fig. 4. We choose \({\text{depth}} = 64 \) and \({\text{stride}} = 1\) for all 2D convolutions in our Cross-RPN.

Fig. 4
figure 4

Structure of Cross-RPN. The formula \(T_{D}^{i}\) represents the features with \(D\) dimensions at the \(i\) th layer. The formula \(T_{{{\tilde{\text{D}}}}}^{i}\) represents the features with a different dimension than \(D\) at the \(i\) th layer

In Table 3, we investigate the effects of the depth of Cross-RPN. Table 3 shows that when the number of Cross-RPN layers exceeds 4, the network performance undergoes few changes. Considering the computation and complexity of the model, we choose the Cross-RPN module with 4 layers in our FuseNet.

The loss of Cross-RPN consists of classification loss and regression loss. For classification loss, we use Focal Loss similarly to Eq. (2) to obtain \(L_{RPN - cls}\). For regression losses, we directly use the Smooth L1 loss function and binary classification of orientation direction \(L_{dir}\)[6]:

$$ L_{RPN - cls} = \left\{ {\begin{array}{ll} { - \alpha \left( {1 - p} \right)^{\gamma } \log p,} &\quad if \; y = 1 \\ { - \left( {1 - \alpha } \right)p^{\gamma } \log \left( {1 - p} \right)} &\quad {\text{otherwise}} \\ \end{array} } \right. $$
(2)
$$ \begin{array}{*{20}c} {L_{RPN - reg} = \mathop \sum \limits_{{n \in \left\{ {x,y,z,l,w,h,\theta } \right\}}} L_{Smooth - L1} \left( {n_{res}^{p} ,n_{res} } \right) + \sigma L_{dir} } \\ \end{array} $$
(3)

where \(n_{res}^{p}\) represents the predicted regression parameter and \(n_{res}\) represents the label of the regression parameter. \({ }\sigma\) is weight loss, and \(\sigma = 0.2\). The loss of Stage-I \(L_{{\text{stage - I}}}\) can be formulated as:

$$ L_{{\text{stage - I}}} = { }L_{RPN - cls} + {\delta }\frac{1}{{M_{{{\text{pos}}}} }}L_{RPN - reg} $$
(4)

where \(N_{pos}\) represents the number of foreground voxels in the voxel, \(M_{pos}\) represents the number of positive samples in the RoIs, and δ is the weight (δ = 2 is selected).

3.2 Box Refine Stage

The detection results of our FuseNet can be easily optimized by some existing refinement methods. We further apply the ROI-aware pooling module proposed by PartA2[6], the RoI-grid pooling module proposed by PV-RCNN [15] and the voxel RoI pooling module proposed by Voxel RCNN [4] to obtain the refinement for bounding box parameters and classification. During the experiment, we maintain the same structure and parameters of Stage-II as PartA2, PV-RCNN and Voxel RCNN. The augmented network architecture is shown in Fig. 1, and the validation results with different refinement methods are shown in Table 4.

3.3 Overall Loss

The total loss of the network is the weighted sum of \(L_{{\text{stage - I}}}\) and \(L_{{\text{stage - II}}}\). Since the Stage-II module mentioned in this paper is totally referenced from previous works [4, 6, 15], we apply the same loss function mentioned in their references to calculate the loss of the Stage-II network \(L_{{\text{stage - II}}}\).

$$ \begin{array}{*{20}c} {L_{{{\text{total}}}} = L_{{\text{stage - I}}} + \gamma L_{{\text{stage - II}}} } \\ \end{array} $$
(5)

where \(\gamma\) is the weight of loss of Stage-II. For different Stage-II modules, \(\gamma\) is different. For example, \(\gamma\) = 2 is selected for the voxel RoI pooling and voxel RoI pooling modules, and \(\gamma\) = 1.5 is selected for the RoI-grid pooling module.

4 Experiment and Analysis

In this section, we introduce detailed information about model training and analyze the network output validation results.

4.1 Dataset and Data Augmentation

All experiments in this paper were conducted on the 3D detection benchmark of KITTI [7], including 7481 training samples and 7518 test samples. The training samples are further divided into a training split (3712 samples) and an evaluation split (3769 samples). Our model was trained on the train split and evaluated on the evaluation split and test set. The data were classified into three categories according to the occlusion of the object in the target tag: Hard, Moderate, and Easy. During training, we adopt the data augmentation strategy proposed in SECOND [5]: in addition to rotation, translation and scaling, we introduced object information from other samples of the training split to increase the number of positive samples in the input data and to improve the training performance.

4.2 Model Details

We refer to the network parameter setting in SECOND [5]. The effective range of the point cloud is limited to \(\left( {x:\left[ {0,70.4\left] {m, y:} \right[ - 40,40\left] {m, z:} \right[ - 3,1} \right]m} \right)\), and the voxel size is \(\left( {0.05m, 0.05m, 0.1m} \right)\). The maximum number of points per voxel is 10. For the car, bicycle and pedestrian class, the predefined size of the anchors are (3.9 m, 1.6 m, and 1.56 m), (1.76 m, 0.6 m, and 1.73 m), (0.8 m, 0.6 m, and 1.73 m), and the orientation of the anchors are predefined as \(0^\circ\) and \(90^\circ\). The threshold values of positive samples are 0.6, 0.5 and 0.5, and the threshold values of negative samples are 0.45, 0.35 and 0.35. We denoted RoIs whose Intersection-over-Union(IOU) is greater than the positive threshold as positive samples and denoted RoIs whose IOU is less than the negative threshold as negative samples. Anchor centers are aligned to voxel centers.

4.3 Ablation Experiments

In this section, we introduce several ablation experiments on the VPM module, Cross-RPN module and target generation strategy to prove their effectiveness for multiple object detection tasks. Our model is validated on the BEV detection and 3D detection benchmarks of KITTI. Average precision (AP) with 11 recall positions is adopted as the evaluation standard. The evaluation results are shown in Table 2.

Table 2 Performance comparison of different methods on the KITTI 3D val dataset at moderate difficulty with AP calculated with 11 recall positions

4.3.1 Effect of VPM

The second line of Table 2 shows that the mAP of the network with the VPM module is improved by 0.25% compared with that without the VPM module, especially for small volume objects. For the car class, the change in AP was not obvious, but for the pedestrian class and bicycle class, the AP increased by 0.79% and 0.15%, respectively, indicating that the point cloud density information is significant for the small volume object detection task.

4.3.2 Effect of Cross-RPN

As shown on the third line of Table 2, applying the Cross-RPN module as the regional proposal network, the object detection APs of the car class, pedestrian class and bicycle class are improved by (− 0.28%, 3.12%, and 0.26%, respectively). The mAP of cars, pedestrians, and cyclists is improved by 1.04%. This finding proves that the full integration of context information with different depths and solutions is significant for 3D point cloud object detection.

4.3.3 Effect of Target Generation Strategy

The fourth line of Table 2 shows that our target generation strategy also achieved a better regression effect on KITTI val split, and the AP of pedestrians and bicycles was improved by 0.61% and 1.63%, respectively. For car classification, the AP remains at the same level. The mAP of cars, pedestrians, and cyclists is improved by 1.20%.

As shown in Table 3, the Cross-RPN module with four layers achieves the best effect on car and cyclist classification. The Cross-RPN module with six layers achieves the best effect on pedestrian classification. We choose Cross-RPN with four layers in our FuseNet and other combined networks mentioned in this paper.

Table 3 Performance comparison of the network with Cross-RPN of different depths on the KITTI val dataset at moderate difficulty

4.3.4 Two-Stage Network Based on FuseNet

For most voxel-based approaches, our VPM module can be easily inserted before the sparse voxel data generation section, and the Cross-RPN module can directly replace the RPN network in the original neural network, which can improve network performance. As shown in Table 2, with the VPM module, Cross-RPN and our target generation strategy object detection, our network AP of the car class, pedestrian class and bicycle class were improved by 0.22%, 3.66%, and 4.66%, respectively, compared with SECOND [5]. For mAP, our FuseNet framework outperforms SECOND [5] by 2.85%.

In addition, our proposed network that consists of VPM, Backbone, CrossRPN can be directly employed as Stage-I of a complex two-stage network. Considering that most two-stage networks apply SECOND[5] as Stage I, we further combine our modules with a certain refined method of two-stage networks, including PartA2, PV-RCNN and VoxelRCNN. Table 4 shows that our network can be flexibly applied to two-stage detectors with increased performance. Compared with two-stage detectors based on SECOND, two-stage detectors based on our FuseNet achieve better effects on car, pedestrian and cyclist classification, and the mAP of PartA2, PV-RCNN and Voxel RCNN are improved by 2.17%, 0.96%, and 0.99%, respectively.

Table 4 Performance comparison of different two-stage methods with our FuseNet and SECOND

4.3.5 Comparison with State-of-the-Art Methods

We compare the performance of our model with the state-of-the-art models for multiple object detection work on the KITTI test split. The comparison results are shown in Table 5. For the 3D object detection benchmark, our model maintains the same level on the AP of the car as VoxelRCNN and outperforms state-of-the-art networks for multiple objects on the AP of cyclist bicycles. For the BEV object detection benchmark, our network achieves better results than VoxelRCNN and is very close to SE-SSD, which ranks first on the KITTI benchmark. Our network outperforms state-of-the-art networks with an AP gain (2.51% and 1.64%) for the cyclist class on Moderate and Hard levels.

Table 5 Performance comparison of different methods on the KITTI 3D test set

We additionally calculate the mean average precision (mAP) of three classifications and compare our network with other networks. As shown in Table 6, the combination of our network and the Stage II of Voxel RCNN outperforms other networks with a 0.99% mAP and significantly improves the APs of pedestrians and cyclists. This result fully proves that our method attains a more balanced performance and is more suitable for complex scenes with a variety of objects.

Table 6 Performance comparison of state-of-the-art methods on KITTI val split at moderate difficulty

5 Conclusion

In this paper, we proposed FuseNet, a 3D object detection neural network for lidar point clouds. The network effectively combines the advantages of point-based and voxel-based methods, including two proposed modules: VPM and Cross-RPN. First, the fusion of coordinate and density information can effectively reduce the loss of data details caused by voxelization and improve the accuracy of voxel-based methods. More detailed voxel initial features improve the upper limit of the voxel-based 3D target detection network. Second, the full integration of contextual information brought by Cross-RPN also improves the RoI extraction efficiency of RPN networks and compensates for the shortcomings of RPN networks in various types of object detection tasks. Last, our network can be easily expanded as a part of a complex network. Whether used independently or combined with other networks, our network can achieve better results than the existing network. Combined with the voxel RoI pooling module, our model achieved better AP on the KITTI dataset compared with other state-of-the-art models for multitype object detection and is more suitable for complex scenes. Our work proves that information fusion is the key to improving the robustness of neural networks and is one of the directions of future neural network research.